Hi. I’m having a problem with my environment setup for Qwen3-30B-A3B in Unsloth, and I’m trying to understand what is misconfigured.
I have two different environments. In both of them, I use the same model, the same code, and load_in_4bit=True, but the behavior is very different:
Environment 1 (qwen_base2 or qwen_base)
- Model: Qwen3-30B-A3B
- Setting:
load_in_4bit=True - VRAM usage after loading: 27766 MiB
- Test generate: input/output = 63 / 167 tokens (same prompt)
- Latency: 357 seconds
Environment 2 (qwen3_test)
- Model: Qwen3-30B-A3B
- Setting:
load_in_4bit=True - VRAM usage after loading: 71152 MiB
- Test generate: input/output = 63 / 167 tokens (same prompt)
- Latency: 39 seconds
Based on the Unsloth documentation, I expected Qwen3-30B-A3B with load_in_4bit=True to use around 18 GB of VRAM:
https://unsloth.ai/docs/models/tutorials/qwen3-coder-how-to-run-locally#run-qwen3-coder-30b-a3b-instruct
Or does this only apply to Qwen3-Coder-30B-A3B-Instruct? As far as I understand, there shouldn’t be any difference.
I attached:
- notebooks with saved outputs (including loading models,
nvidia-smi, library versions,xformers.info, and other details) .shinstall scripts for both environments
Could someone help me understand what is wrong and how to fix it?