Hi Unsloth Team đź‘‹,
I’m currently testing the Qwen3-VL 8B Instruct model using Unsloth and would really appreciate some guidance or example code for multimodal inference (image + text).
Could you please share or confirm the correct setup for:
Loading the Qwen3-VL 8B Instruct model with Unsloth.
Running inference on both image and text inputs (e.g., image captioning or visual QA).
Enabling FlashAttention 2 or Unsloth’s optimized attention mechanism.
Measuring performance metrics — right now I’m getting around 10 tokens per second on inference. I’d like to know if that’s expected, or if there are optimizations (like FP16/BF16, batch settings, or CUDA tuning) to improve generation speed.
Compatibility information — specifically the required versions of torch, unsloth, transformers, and CUDA for full vision + language support.
Here’s a minimal example of what I tried so far:
from unsloth import FastLanguageModel
from PIL import Image
Attempting to load Qwen3-VL 8B Instruct
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-VL-8B-Instruct",
load_in_4bit=True,
device_map="auto"
)
Example: Visual prompt
image = Image.open("sample_image.jpg")
prompt = "Describe the image in detail."
Inference attempt
response = model.chat(tokenizer, image=image, text=prompt)
print(response)
However, I encountered issues:
FlashAttention didn’t seem to activate.
The vision encoder didn’t respond to image input (text-only output).
Token generation speed is ~10 tokens/sec on an A100 (mixed precision).
Could you please share:
A working inference code snippet for multimodal (text + image) prompts.
Recommended environment setup and best practices for optimizing speed.
Whether FlashAttention2 is officially supported for the Qwen3-VL 8B Instruct model in Unsloth.
Thank you so much! 🙏
Your help will be invaluable for developers integrating Qwen3-VL models with Unsloth