Qwen3 model outputs seem broken | Unsloth AI | Page 1

ocean parcel May 20, 2025, 2:00 PM

#

hello?

#

what happened

golden marlin May 20, 2025, 2:02 PM

#

Oops. Looks like unsloth/qwen3-4B-Base model is producing garbage outputs when using the "standard" Qwen3 chat template.

ocean parcel May 20, 2025, 2:03 PM

#

are you trying to lora finetune?

#

lora/qlora

golden marlin May 20, 2025, 2:03 PM

#

for some reason I am unable to upload the code 😦

ocean parcel May 20, 2025, 2:04 PM

#

try to copy and paste the code wrapped with

three reverse tickmarks

golden marlin May 20, 2025, 2:04 PM

#

import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower
model_size = "4B"
BASE_MODEL = f"unsloth/Qwen3-{model_size}-Base"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = BASE_MODEL,
    max_seq_length = max_seq_length,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    device_map = "balanced",
    fast_inference=True,
    # token = "hf_...",      # use one if using gated models
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 7007,
)

tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
tokenizer.chat_template = tok.chat_template
messages = [
    {"role": "user", "content": "How many r's in strawberries?"},
]
rendered = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,   # tells template to add assistant pre‑amble
    enable_thinking=True,         # default; set False to skip reasoning block
    system_prompt="You are a helpful assistant."
)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.01,
    top_p = 0.01,
    max_tokens = 4096,
)
output = model.fast_generate(
    rendered,
    sampling_params = sampling_params,
)[0].outputs[0].text

print(output)```

ocean parcel May 20, 2025, 2:04 PM

#

ocean parcel are you trying to lora finetune?

@golden marlin this?

golden marlin May 20, 2025, 2:05 PM

#

Yes.. but this happens before finetuning etc.

ocean parcel May 20, 2025, 2:05 PM

#

The model isn't instruction finetuned....

#

so it doesn't know how to follow instructions

#

try the it version

#

and for inference try to stick to how it's done in the notebooks.....

#

in terms of format and parameters

#

those have an impact

#

BUT definitely use the instruction finetnued (it) version

#

also don't mix code from different libraries.....

#

you're already loading the tokenizer with FastLanguageModel

#

..

golden marlin May 20, 2025, 2:09 PM

#

is it this one - unsloth/Qwen3-4B-unsloth-bnb-4bit

ocean parcel May 20, 2025, 2:10 PM

#

should have "it" in the name

#

if you mean the instruction finetuned variants

golden marlin May 20, 2025, 2:12 PM

#

Not seeing any here - https://huggingface.co/collections/unsloth/qwen3-680edabfb790c8c34a242f95

Qwen3 - a unsloth Collection

#

Also in this [https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(14B)-Reasoning-Conversational.ipynb] notebook they use unsloth/Qwen3-4B-unsloth-bnb-4bit as the base model

Google Colab

ocean parcel May 20, 2025, 2:29 PM

#

then it's probably a instruction finetuned variant

#

and you can use it

golden marlin May 20, 2025, 2:32 PM

#

thanks Roland. This seems to work 🤞

ocean parcel May 20, 2025, 2:33 PM

#

do you mind closing this ticket then? (I can't close it myself :D)

#

you should be able to mark it as solved or something

golden marlin May 20, 2025, 2:43 PM

#

one issue I run into when I load those models is ```ValueError Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/unsloth_zoo/vllm_utils.py in load_vllm(model_name, config, gpu_memory_utilization, max_seq_length, dtype, training, float8_kv_cache, random_state, enable_lora, max_lora_rank, max_loras, use_async, use_engine, disable_log_stats, enforce_eager, enable_prefix_caching, compilation_config, conservativeness, max_logprobs, use_bitsandbytes, return_args)
1195 else:
-> 1196 llm = LLM(**engine_args)
1197 pass

29 frames
ValueError: Duplicate layer name: model.layers.0.self_attn.attn

During handling of the above exception, another exception occurred:

RuntimeError Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/unsloth_zoo/vllm_utils.py in load_vllm(model_name, config, gpu_memory_utilization, max_seq_length, dtype, training, float8_kv_cache, random_state, enable_lora, max_lora_rank, max_loras, use_async, use_engine, disable_log_stats, enforce_eager, enable_prefix_caching, compilation_config, conservativeness, max_logprobs, use_bitsandbytes, return_args)
1206 error = str(error)
1207 if trials >= 2:
-> 1208 raise RuntimeError(error)
1209
1210 if "gpu_memory_utilization" in error or "memory" in error:

RuntimeError: Duplicate layer name: model.layers.0.self_attn.attn

ocean parcel May 20, 2025, 2:48 PM

#

in vllm i guess

#Qwen3 model outputs seem broken