uneven rover Oct 26, 2025, 1:05 PM

#

Hi Unsloth Team 👋,

I’m currently testing the Qwen3-VL 8B Instruct model using Unsloth and would really appreciate some guidance or example code for multimodal inference (image + text).

Could you please share or confirm the correct setup for:

Loading the Qwen3-VL 8B Instruct model with Unsloth.

Running inference on both image and text inputs (e.g., image captioning or visual QA).

Enabling FlashAttention 2 or Unsloth’s optimized attention mechanism.

Measuring performance metrics — right now I’m getting around 10 tokens per second on inference. I’d like to know if that’s expected, or if there are optimizations (like FP16/BF16, batch settings, or CUDA tuning) to improve generation speed.

Compatibility information — specifically the required versions of torch, unsloth, transformers, and CUDA for full vision + language support.

Here’s a minimal example of what I tried so far:

from unsloth import FastLanguageModel
from PIL import Image

Attempting to load Qwen3-VL 8B Instruct

model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-VL-8B-Instruct",
load_in_4bit=True,
device_map="auto"
)

Example: Visual prompt

image = Image.open("sample_image.jpg")
prompt = "Describe the image in detail."

Inference attempt

response = model.chat(tokenizer, image=image, text=prompt)
print(response)

However, I encountered issues:

FlashAttention didn’t seem to activate.

The vision encoder didn’t respond to image input (text-only output).

Token generation speed is ~10 tokens/sec on an A100 (mixed precision).

Could you please share:

A working inference code snippet for multimodal (text + image) prompts.

Recommended environment setup and best practices for optimizing speed.

Whether FlashAttention2 is officially supported for the Qwen3-VL 8B Instruct model in Unsloth.

Thank you so much! 🙏
Your help will be invaluable for developers integrating Qwen3-VL models with Unsloth

north pumiceBOT Oct 26, 2025, 1:06 PM

#

To run multimodal inference with Qwen3-VL 8B Instruct in Unsloth, use FastVisionModel (not FastLanguageModel). FlashAttention 2 is not explicitly mentioned as supported for Qwen3-VL 8B in Unsloth, but Unsloth’s own optimized attention and quantization are enabled by default. For best speed, use 4-bit quantization, set use_gradient_checkpointing="unsloth", and ensure your environment uses torch 2.6+, transformers 4.57.0+, and CUDA 12.x. Mixed precision (FP16/BF16) is supported and recommended for A100.

Here’s a minimal, working inference example for image+text:

from unsloth import FastVisionModel
from PIL import Image

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit",
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)
FastVisionModel.for_inference(model)

image = Image.open("sample_image.jpg").convert("RGB")
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe the image in detail."}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

10 tokens/sec is typical for 8B models in 4-bit on A100; for more, try batch inference, ensure no CPU bottlenecks, and use the latest Unsloth/transformers. For more detail or troubleshooting, let me know.

Sources:

uneven rover Oct 26, 2025, 1:13 PM

#

@north pumice Hi again 👋,

Thanks for confirming that Qwen3-VL 8B Instruct works with FastVisionModel — that’s running great now on my side.

I had one more question about performance tuning:
I’m running the model on an A100 40 GB GPU, and I currently get around 10 tokens/sec in 4-bit mode.

Could you please suggest how to increase the generation speed (ideally 40–60 tokens/sec)?
I’d love to know if there are recommended configuration changes or environment flags for faster inference.

Specifically:

Should I switch to BF16/FP16 or 8-bit for faster decoding?

Any benefit from disabling gradient checkpointing during inference?

Are there Unsloth-specific flags (like USE_TORCH_COMPILE, CUDA_GRAPH, etc.) that help with kernel optimization?

Any benchmark numbers for Qwen3-VL 8B on A100 40 GB for reference?

Goal → Reach 40–60 tokens/sec while keeping image + text inference stable.

Thanks again for your help! 🙏
Tagging @north pumice for any kernel tuning insights.

north pumiceBOT Oct 26, 2025, 1:13 PM

#

It seems likely that 10 tokens/sec is typical for Qwen3-VL 8B in 4-bit on A100, and reaching 40–60 tokens/sec is not currently expected for single-user, single-batch inference. Unsloth recommends using 4-bit quantization, BF16/FP16 for mixed precision, and disabling gradient checkpointing during inference (it’s only for training). Batch inference (multiple requests at once) can increase total throughput, but not single-stream speed. No Unsloth-specific flags like USE_TORCH_COMPILE or CUDA_GRAPH are documented to boost inference speed for this model; kernel compilation overhead is mostly amortized over longer runs.

Benchmarks from Unsloth and community users consistently report 10–15 tokens/sec for Qwen3-VL 8B on A100 40GB in 4-bit mode, with higher throughput possible only via batching or speculative decoding (not yet supported in Unsloth for this model). For further speed, ensure latest torch, transformers, and CUDA, and avoid CPU bottlenecks. If you need more detail or want to discuss multi-user batching, let me know.

Sources:

https://github.com/unslothai/unsloth/discussions/2830

uneven rover Oct 26, 2025, 1:21 PM

#

@north pumice

📎 message.txt

north pumiceBOT Oct 26, 2025, 1:21 PM

#

Based on available benchmarks and user reports, Unsloth’s FastVisionModel for Qwen3-VL 8B typically achieves 10–12 tokens/sec on A100 40GB, while the official Hugging Face implementation can reach ~20 tokens/sec in similar conditions. This speed gap is expected: Unsloth’s optimizations focus on memory efficiency and training speed, not maximum single-stream inference throughput for vision models. There are currently no Unsloth-specific flags or settings documented that would double inference speed for Qwen3-VL 8B to match Hugging Face’s throughput.

If maximizing inference speed is your priority, the official Hugging Face pipeline is currently faster for Qwen3-VL 8B vision-language tasks. Unsloth is best used when you need low VRAM usage, longer context, or efficient fine-tuning. If you want more detail on the technical reasons for this difference or future roadmap, let me know.

Sources:

https://github.com/unslothai/unsloth/discussions/2830

slate obsidian Nov 18, 2025, 5:10 PM

#

@north pumice Qwen3-vl-4b and 8B variants hallucinate with images when I inference on llama CLI. How do I fix

north pumiceBOT Nov 18, 2025, 5:10 PM

#

For Qwen3-VL-4B and 8B, hallucinations during image inference in llama.cpp (llama CLI) are often due to quantization issues or improper prompt formatting. Unsloth recommends using their dynamic quantized GGUFs and ensuring you use the correct chat template and prompt structure. Download the latest GGUFs from Unsloth’s Hugging Face, and always include the <|im_start|>user, <|im_end|>, and <|im_start|>assistant tags in your prompt. For vision models, use Unsloth’s dynamic quantization, as quantizing all layers to 4-bit can break accuracy—dynamic quantization selectively avoids this problem and preserves vision performance.

If you still see hallucinations, try using a higher-precision quantization (e.g., Q8_0 or Q5_K_M) and verify you are running the latest version of llama.cpp. Also, ensure your prompt includes the image token in the correct format, as shown in Unsloth’s documentation and notebooks. For best results, use the exact chat template and inference parameters recommended by Unsloth for Qwen3-VL models (see here, dynamic quantization details, vision notebook).

Would you like a step-by-step example of the correct prompt and llama.cpp command?

Sources:

slate obsidian Nov 18, 2025, 5:18 PM

#

@north pumice Please show real CLI examples of ur recommendations

north pumiceBOT Nov 18, 2025, 5:18 PM

#

Certainly! Here’s a real CLI example for Qwen3-VL-8B (applies to 4B as well) using llama.cpp and Unsloth’s dynamic quantized GGUF:

Download the latest dynamic GGUF:

# Example for Qwen3-VL-8B-Thinking
python -m pip install huggingface_hub hf_transfer
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='unsloth/Qwen3-VL-8B-Thinking-GGUF', local_dir='Qwen3-VL-8B-Thinking-GGUF', allow_patterns=['*UD-Q4_K_XL*'])"

Prepare your prompt (with image):

<|im_start|>user
<image>path/to/your/image.png</image>
What is in this image?
<|im_end|>
<|im_start|>assistant

Run llama.cpp with the correct chat template and context size:

./llama.cpp/llama-cli \
  --model Qwen3-VL-8B-Thinking-GGUF/UD-Q4_K_XL/Qwen3-VL-8B-Thinking-UD-Q4_K_XL-00001-of-00002.gguf \
  --threads -1 \
  --ctx-size 8192 \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  --repeat-penalty 1.05 \
  --image path/to/your/image.png \
  --prompt-file prompt.txt

(Where prompt.txt contains the prompt above.)

Tips:

Use the UD-Q4_K_XL or Q8_0 quant for best vision accuracy.
Always use the <|im_start|>user/assistant and <|im_end|> tags as shown.
For more, see Unsloth’s Qwen3-VL vision notebook.

Would you like more details or troubleshooting steps?

Sources:

Switching Between Thinking and Non-Thinking Mode

slate obsidian Nov 18, 2025, 5:25 PM

#

@north pumice use llama-mtmd-cli for ur examples.. I am running unsloth's Qwen3-vl-4b 8-bit quant model.

north pumiceBOT Nov 18, 2025, 5:25 PM

#

To run Unsloth’s Qwen3-VL-4B 8-bit quant model with images using llama-mtmd-cli, use the following steps and CLI example:

Download the 8-bit quantized GGUF and mmproj files:

python -m pip install huggingface_hub hf_transfer
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='unsloth/Qwen3-VL-4B-GGUF', local_dir='Qwen3-VL-4B-GGUF', allow_patterns=['*Q8_0*', '*mmproj*'])"

Prepare your prompt (save as prompt.txt):

<|im_start|>user
<image>path/to/your/image.png</image>
Describe the image.
<|im_end|>
<|im_start|>assistant

Run inference with llama-mtmd-cli:

./llama.cpp/llama-mtmd-cli \
  --model Qwen3-VL-4B-GGUF/Qwen3-VL-4B-UD-Q8_0.gguf \
  --mmproj Qwen3-VL-4B-GGUF/mmproj-F16.gguf \
  --image path/to/your/image.png \
  --prompt-file prompt.txt \
  --ctx-size 8192 \
  --threads -1 \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  --repeat-penalty 1.05

Notes:

Always use the <|im_start|>user/assistant and <|im_end|> tags.
Use the correct mmproj file for vision models.
Adjust --ctx-size and other parameters as needed for your hardware.

Would you like a step-by-step explanation or troubleshooting tips?

Sources:

slate obsidian Nov 18, 2025, 5:32 PM

#

@north pumice Ok show CLI to run the unsloth's Qwen-vl-4b-instruct 8-bit quant and do python code to inference with OpenAI-compatible endpoints (preferrably with ChatOpenAI). Give an image for the inference

north pumiceBOT Nov 18, 2025, 5:32 PM

#

It seems likely that to run Unsloth’s Qwen3-VL-4B-Instruct 8-bit quant model with images using llama.cpp’s llama-mtmd-cli, you would use a command like:

./llama.cpp/llama-mtmd-cli \
  --model Qwen3-VL-4B-GGUF/Qwen3-VL-4B-UD-Q8_0.gguf \
  --mmproj Qwen3-VL-4B-GGUF/mmproj-F16.gguf \
  --image https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen-VL/assets/demo.jpeg \
  --prompt-file prompt.txt \
  --ctx-size 8192 \
  --threads -1 \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  --repeat-penalty 1.05

Where prompt.txt contains:

<|im_start|>user
<image>https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen-VL/assets/demo.jpeg</image>
Describe the image.
<|im_end|>
<|im_start|>assistant

For OpenAI-compatible endpoints (e.g., using ChatOpenAI from LangChain), you need to run a local OpenAI-compatible server (like llama-cpp-python’s server mode). Here’s a Python example using LangChain’s ChatOpenAI and the same image:

from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage

llm = ChatOpenAI(
    openai_api_base="http://localhost:8000/v1",
    openai_api_key="sk-xxx",  # dummy key for local server
    model="Qwen3-VL-4B-UD-Q8_0.gguf"
)

# The image must be base64-encoded and sent as part of the message content.
import base64, requests

img_url = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
img_data = base64.b64encode(requests.get(img_url).content).decode("utf-8")
content = [
    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_data}"}},
    {"type": "text", "text": "Describe the image."}
]

response = llm([HumanMessage(content=content)])
print(response.content)

#

You must start the OpenAI-compatible server (e.g., llama-cpp-python --model Qwen3-VL-4B-UD-Q8_0.gguf --mmproj mmproj-F16.gguf --server) before running the Python code.

Would you like more detail on setting up the server or prompt formatting?

Sources:

slate obsidian Nov 18, 2025, 6:00 PM

#

@north pumice Also how do I run Unsloth's dynamic quantization models.

north pumiceBOT Nov 18, 2025, 6:00 PM

#

To run Unsloth's dynamic quantization models, download the GGUF file with the UD- prefix (e.g., UD-Q4_K_XL or UD-Q8_0) from Hugging Face, then use llama.cpp or llama-mtmd-cli as usual. Dynamic quantization is automatically applied in these GGUFs—no extra steps or flags are needed. Example:

./llama.cpp/llama-mtmd-cli \
  --model unsloth/Qwen3-VL-4B-GGUF/Qwen3-VL-4B-UD-Q4_K_XL.gguf \
  --mmproj unsloth/Qwen3-VL-4B-GGUF/mmproj-F16.gguf \
  --image path/to/image.png \
  --prompt-file prompt.txt

Just select the GGUF with the UD- prefix for dynamic quantization. For more details and quantization options, see Unsloth Dynamic GGUFs and official docs.

Would you like a step-by-step download and run example?

Sources:

slate obsidian Nov 18, 2025, 6:04 PM

#

@north pumice yes please the example with a Qwen-vl-instruct in llama.cpp.

#Example inference code for Qwen3-VL 8B Instruct model (Vision + Language) using Unsloth

Attempting to load Qwen3-VL 8B Instruct

Example: Visual prompt

Inference attempt