#Fast Inference vllm GRPO tuning on Qwen3VL

2 messages · Page 1 of 1 (latest)

echo harbor
#

Hey all, I am trying to tune Qwen3VL 4B with unsloth on a custom dataset, and I am not able to load the model with fast inference.

os.environ["UNSLOTH_VLLM_STANDBY"] = "1"
from unsloth import FastVisionModel
model, tokenizer = FastVisionModel.from_pretrained(
model_name="unsloth/Qwen3-VL-4B-Instruct",
max_seq_length=MAX_SEQ_LENGTH,
load_in_4bit=False,
fast_inference=True,
max_lora_ran =LORA_RANK,
gpu_memory_utilization=0.95,
unsloth_vllm_standby=True,
enable_lora=True
)

model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers=False,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
r=LORA_RANK,
lora_alpha=16,
lora_dropout=0,
bias="none",
random_state=3407,
use_rslora=False,
loftq_config=None,
use_gradient_checkpointing="unsloth",
)

I get the following error during the init:

RuntimeError: new_init() missing 1 required positional argument: 'embedding_padding_modules'

I am quite sure it is a mismatch in the vllm version. I have vllm==0.12.0. What is the raccomended version to run qwen3vl in fast inference mode? Thanks!

heavy grove
#

@crimson pilot could you check this out thanks