Qwen3-30B-A3B: inconsistent VRAM usage and very different generation latency across two environments | Unsloth AI | Page 1

agile pewter Mar 27, 2026, 7:26 PM

#

Hi. I’m having a problem with my environment setup for Qwen3-30B-A3B in Unsloth, and I’m trying to understand what is misconfigured.

I have two different environments. In both of them, I use the same model, the same code, and load_in_4bit=True, but the behavior is very different:

Environment 1 (qwen_base2 or qwen_base)

Model: Qwen3-30B-A3B
Setting: load_in_4bit=True
VRAM usage after loading: 27766 MiB
Test generate: input/output = 63 / 167 tokens (same prompt)
Latency: 357 seconds

Environment 2 (qwen3_test)

Model: Qwen3-30B-A3B
Setting: load_in_4bit=True
VRAM usage after loading: 71152 MiB
Test generate: input/output = 63 / 167 tokens (same prompt)
Latency: 39 seconds

Based on the Unsloth documentation, I expected Qwen3-30B-A3B with load_in_4bit=True to use around 18 GB of VRAM:
https://unsloth.ai/docs/models/tutorials/qwen3-coder-how-to-run-locally#run-qwen3-coder-30b-a3b-instruct

Or does this only apply to Qwen3-Coder-30B-A3B-Instruct? As far as I understand, there shouldn’t be any difference.

I attached:

notebooks with saved outputs (including loading models, nvidia-smi, library versions, xformers.info, and other details)
.sh install scripts for both environments

Could someone help me understand what is wrong and how to fix it?

sullen dagger Mar 27, 2026, 7:28 PM

#

no .sh shared

#

what is the transformers version in both environments?

agile pewter Mar 27, 2026, 7:29 PM

#

I’m trying to upload them, but the bot keeps deleting the files.

sullen dagger Mar 27, 2026, 7:29 PM

#

sullen dagger what is the transformers version in both environments?

^

agile pewter Mar 27, 2026, 7:30 PM

#

Unsloth 2026.3.15: Fast Qwen3_MoE patching. Transformers: 5.3.0.
\ /| NVIDIA A100 80GB PCIe. Num GPUs = 1. Max memory: 79.325 GB. Platform: Linux.
O^O/ _/ \ Torch: 2.10.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.6.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.35. FA2 = False]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

#

====))== Unsloth 2026.3.15: Fast Qwen3_MoE patching. Transformers: 4.57.6. vLLM: 0.18.0.
\ /| NVIDIA A100 80GB PCIe. Num GPUs = 1. Max memory: 79.325 GB. Platform: Linux.
O^O/ _/ \ Torch: 2.10.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.6.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.35. FA2 = True]
"-____-" Free license: http://github.com/unslothai/unsloth

#

first is env 2
second is env 1

sullen dagger Mar 27, 2026, 7:32 PM

#

the first one is the right one

#

Qwen3-30B-A3B is MOE right?

agile pewter Mar 27, 2026, 7:32 PM

#

yes

#

#!/bin/bash
eval "$(conda shell.bash hook)"

env_name="qwen_base"

conda create --name ${env_name} python=3.13 cuda-toolkit=12.8 -c nvidia -y

conda activate ${env_name}

pip install --upgrade cmake ninja pip setuptools wheel && pip install uv

uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

uv pip install unsloth vllm unsloth_zoo

uv pip install pandas openpyxl tqdm ipywidgets ipykernel

python -m ipykernel install --user --name ${env_name} --display-name "Python (${env_name})"

#

=================================================================================

#

#!/bin/bash
eval "$(conda shell.bash hook)"

env_name="qwen3_test"

conda create --name ${env_name} python=3.11 cuda-toolkit=12.8 -c nvidia -y

conda activate ${env_name}
pip install --upgrade pip && pip install uv

uv pip install unsloth
uv pip install transformers==5.3.0
uv pip install --no-deps trl==0.22.2

uv pip install pandas openpyxl tqdm ipywidgets ipykernel

python -m ipykernel install --user --name ${env_name} --display-name "Python (${env_name})"

sullen dagger Mar 27, 2026, 7:34 PM

#

no wait

#

mmm

#

maybe not 5.3.0

#

and the behavior of a lot of models broke under 5.3.0 (we're still working on compatbility ) for that version

#

what is the trl version in your env 1?

#

why are you pinning it in env 2 but not env 1?

agile pewter Mar 27, 2026, 7:37 PM

#

qwen_base libraries:
unsloth: 2026.3.15
torch: 2.10.0+cu128
CUDA version: 12.8
torch.cuda.is_available: True
cudnn: 91002
torchvision_version 0.25.0+cu128
torchaudio_version 2.10.0+cu128
transformers_version 4.57.6
trl_version 0.24.0
triton_version 3.6.0
make_tensor_descriptor - True
bitsandbytes_version 0.49.2
xformers_version 0.0.35
causal_conv1d import failed: No module named 'causal_conv1d'
flash_attn import failed: No module named 'fla_core'
flash_linear_attention import failed: No module named 'flash_linear_attention'

#

=============================================================================

#

qwen3_test libraries:

unsloth: 2026.3.15
torch: 2.10.0+cu128
CUDA version: 12.8
torch.cuda.is_available: True
cudnn: 91002
torchvision_version 0.25.0+cu128
torchaudio import failed: No module named 'torchaudio'
transformers_version 5.3.0
trl_version 0.22.2
triton_version 3.6.0
make_tensor_descriptor - True
bitsandbytes_version 0.49.2
xformers_version 0.0.35
causal_conv1d import failed: No module named 'causal_conv1d'
flash_attn import failed: No module named 'fla_core'
flash_linear_attention import failed: No module named 'flash_linear_attention'

sullen dagger Mar 27, 2026, 7:43 PM

#

i suspect you'll need to change some dependency versions in env1

agile pewter Mar 27, 2026, 7:48 PM

#

Yes, I agree, but at this point I honestly don’t know how. I’ve been experimenting, trying to replicate the environments used in the Unsloth Google Colab notebooks, and I’ve already spent 4–5 days on this. Do you have any guesses about which library might be causing the problem?

sullen dagger Mar 27, 2026, 7:49 PM

#

let me get back to you on this tomorrow

#

i need to consult with a colleague

#

about this first

agile pewter Mar 27, 2026, 7:50 PM

#

I can’t attach the files in the chat because the bot keeps deleting them, so I uploaded them to Google Drive for convenience
https://drive.google.com/drive/folders/1jjQ1O1OEZVKF2yDG7_3LEC2fNdz_eizl?usp=sharing

agile pewter Mar 29, 2026, 6:36 AM

#

@sullen dagger Did you get a chance to talk to your colleagues?

agile pewter Mar 29, 2026, 7:11 AM

#

sullen dagger why are you pinning it in env 2 but not env 1?

i tried copy env from https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_MoE.ipynb#scrollTo=DkIvEkIIkEyB

Google Colab

sullen dagger Mar 29, 2026, 11:08 AM

#

agile pewter <@949069408813346856> Did you get a chance to talk to your colleagues?

Not before monday.

#

also please don't cross post on two different channels and reddit... flooding doesn't get you a faster answer

sullen dagger Mar 29, 2026, 12:48 PM

#

@summer lagoon your help here would be appreciated when you get the time

summer lagoon Mar 29, 2026, 12:58 PM

#

agile pewter Hi. I’m having a problem with my environment setup for Qwen3-30B-A3B in Unsloth,...

so the thing is transformers v5 changed the weight layout mechanism for MoE
so anytime you use transformers v5 with MoE which stores weights as nn.Parameter, bnb/peft doesn't quantize the weights
so there won't be much difference between 4bit and 16bit load for MoE on transformers v5 (~90% of weights of the model are in MoE experts)

But in transfrmers v4, it was nn.ModuleList(nn.LInear for _ in range(experts)) so quantization with bnb/peft was straight forward
This is more of a transformers restriction 🙁

Otoh, as written here , we did a lot of MoE related optimisations specifically for LoRA to make things faster and memory efficient. Unfortunately they are deeply tied to transformers v5 🙂

agile pewter Mar 30, 2026, 6:09 AM

#

The explanation about transformers makes it sound like I can’t achieve both fast inference and low VRAM usage at the same time. Is that actually the case, or is it still possible with the right combination of library versions?

#

Then how is it possible that Qwen3-30B-A3B loads into 27,766 MiB on the qwen_base environment?

#

I already tried experimenting with UNSLOTH_MOE_BACKEND set to grouped_mm, unsloth_triton, and native_torch.

But I don’t remember which environment I tested them in, and I didn’t really notice much difference. Doesn’t Unsloth choose the optimal backend automatically by default?

summer lagoon Mar 30, 2026, 6:26 AM

#

agile pewter I already tried experimenting with UNSLOTH_MOE_BACKEND set to grouped_mm, unslot...

these all are only applicable for transformers v5
And you can't get the said speed up without that unfortunately.

agile pewter Mar 30, 2026, 6:39 AM

#

Will I be unable to save VRAM even if I use unsloth/Qwen3-30B-A3B-bnb-4bit?

As I understand it, should I avoid using 4-bit entirely and stick to 16-bit, since:

“Training MoE models in 4-bit QLoRA isn’t recommended right now because BitsandBytes doesn’t support it. This isn’t specific to Unsloth. For now, use bf16 for LoRA or full fine-tuning.”

So does that mean I should train only in full precision, but then can I quantize the trained model to 4-bit afterward (maybe using an environment with Transformers v4)?

This is important for me because my inference servers use A10 GPUs with 24 GB VRAM, and I will only be able to run quantized models there.

summer lagoon Mar 30, 2026, 6:41 AM

#

oh yeah once trained using transformers v5, getting all the benefits of faster moe I shared above, you can quantise it post training
the quantized model can be inferenced on v4 to your liking 🙂

summer lagoon Mar 30, 2026, 7:17 AM

#

you can try this out and see if it helps you
https://github.com/unslothai/unsloth-zoo/pull/527
but mind you this is a community PR and might not be fully complete/perfect
so take that with a grain of salt

Also when doing quantisation, I'd recommend setting HF_DEACTIVATE_ASYNC_LOAD env variable

agile pewter Apr 1, 2026, 6:07 AM

#

Oh thanks, it basically worked. I experimented with different versions of other libraries and managed to get it running. Here are the new environments for Qwen3-30B-A3B:

env18 — latency (sec): 47.47 and 32,248 MiB
env15 — latency (sec): 75.34 and 19,676 MiB

I’ll upload the env_check files to a Google Drive link if you’re interested.

#

qwen_base18.sh

#!/bin/bash
eval "$(conda shell.bash hook)"

env_name="qwen_base18"

# с попыткой установить https://github.com/unslothai/unsloth-zoo/pull/527 для работы MoE, порекомендовал чувак из discord
# более чистая без всяких лишних библиотек
conda create --name ${env_name} python=3.11 cuda-toolkit=12.8 -c nvidia -y

conda activate ${env_name}

pip install --upgrade cmake ninja pip setuptools wheel && pip install uv

uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

uv pip install unsloth vllm "unsloth_zoo @ git+https://github.com/sensai99/unsloth-zoo.git@493405b"
uv pip install transformers==5.3.0
uv pip install --no-deps trl==0.22.2

# Доп библиотеки
uv pip install pandas openpyxl tqdm ipywidgets ipykernel


# превращаем conda env в jupyter kernel сразу
python -m ipykernel install --user --name ${env_name} --display-name "Python (${env_name})"```

#

qwen_base15.sh

#!/bin/bash
eval "$(conda shell.bash hook)"

env_name="qwen_base15"

# с попыткой установить https://github.com/unslothai/unsloth-zoo/pull/527 для работы MoE, порекомендовал чувак из discord
conda create --name ${env_name} python=3.11 cuda-toolkit=12.8 -c nvidia -y

conda activate ${env_name}

pip install --upgrade cmake ninja pip setuptools wheel && pip install uv

uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

uv pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer

uv pip install --no-deps "unsloth_zoo @ git+https://github.com/sensai99/unsloth-zoo.git@493405b" bitsandbytes accelerate "xformers==0.0.34" peft trl triton unsloth
uv pip install transformers==5.3.0
uv pip install --no-deps trl==0.22.2

# Доп библиотеки
uv pip install pandas openpyxl tqdm ipywidgets ipykernel


# превращаем conda env в jupyter kernel сразу
python -m ipykernel install --user --name ${env_name} --display-name "Python (${env_name})"

#

herelink to google drive:
https://drive.google.com/drive/folders/1jjQ1O1OEZVKF2yDG7_3LEC2fNdz_eizl?usp=drive_link

#Qwen3-30B-A3B: inconsistent VRAM usage and very different generation latency across two environments