#Qwen3 model outputs seem broken
32 messages · Page 1 of 1 (latest)
Oops. Looks like unsloth/qwen3-4B-Base model is producing garbage outputs when using the "standard" Qwen3 chat template.
for some reason I am unable to upload the code 😦
try to copy and paste the code wrapped with
three reverse tickmarks
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower
model_size = "4B"
BASE_MODEL = f"unsloth/Qwen3-{model_size}-Base"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = BASE_MODEL,
max_seq_length = max_seq_length, # Context length - can be longer, but uses more memory
load_in_4bit = True, # 4bit uses much less memory
load_in_8bit = False, # A bit more accurate, uses 2x memory
full_finetuning = False, # We have full finetuning now!
device_map = "balanced",
fast_inference=True,
# token = "hf_...", # use one if using gated models
)
model = FastLanguageModel.get_peft_model(
model,
r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
], # Remove QKVO if out of memory
lora_alpha = lora_rank,
use_gradient_checkpointing = "unsloth", # Enable long context finetuning
random_state = 7007,
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
tokenizer.chat_template = tok.chat_template
messages = [
{"role": "user", "content": "How many r's in strawberries?"},
]
rendered = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True, # tells template to add assistant pre‑amble
enable_thinking=True, # default; set False to skip reasoning block
system_prompt="You are a helpful assistant."
)
from vllm import SamplingParams
sampling_params = SamplingParams(
temperature = 0.01,
top_p = 0.01,
max_tokens = 4096,
)
output = model.fast_generate(
rendered,
sampling_params = sampling_params,
)[0].outputs[0].text
print(output)```
@golden marlin this?
Yes.. but this happens before finetuning etc.
The model isn't instruction finetuned....
so it doesn't know how to follow instructions
try the it version
and for inference try to stick to how it's done in the notebooks.....
in terms of format and parameters
those have an impact
BUT definitely use the instruction finetnued (it) version
also don't mix code from different libraries.....
you're already loading the tokenizer with FastLanguageModel
..
is it this one - unsloth/Qwen3-4B-unsloth-bnb-4bit
Not seeing any here - https://huggingface.co/collections/unsloth/qwen3-680edabfb790c8c34a242f95
Also in this [https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(14B)-Reasoning-Conversational.ipynb] notebook they use unsloth/Qwen3-4B-unsloth-bnb-4bit as the base model
thanks Roland. This seems to work 🤞
do you mind closing this ticket then? (I can't close it myself :D)
you should be able to mark it as solved or something
one issue I run into when I load those models is ```ValueError Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/unsloth_zoo/vllm_utils.py in load_vllm(model_name, config, gpu_memory_utilization, max_seq_length, dtype, training, float8_kv_cache, random_state, enable_lora, max_lora_rank, max_loras, use_async, use_engine, disable_log_stats, enforce_eager, enable_prefix_caching, compilation_config, conservativeness, max_logprobs, use_bitsandbytes, return_args)
1195 else:
-> 1196 llm = LLM(**engine_args)
1197 pass
29 frames
ValueError: Duplicate layer name: model.layers.0.self_attn.attn
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/unsloth_zoo/vllm_utils.py in load_vllm(model_name, config, gpu_memory_utilization, max_seq_length, dtype, training, float8_kv_cache, random_state, enable_lora, max_lora_rank, max_loras, use_async, use_engine, disable_log_stats, enforce_eager, enable_prefix_caching, compilation_config, conservativeness, max_logprobs, use_bitsandbytes, return_args)
1206 error = str(error)
1207 if trials >= 2:
-> 1208 raise RuntimeError(error)
1209
1210 if "gpu_memory_utilization" in error or "memory" in error:
RuntimeError: Duplicate layer name: model.layers.0.self_attn.attn
in vllm i guess