To deploy your fine-tuned unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit on an RTX 4090 for fast inference, use Unsloth's FastVisionModel with 4-bit loading. This enables low VRAM usage and 2x faster inference. Example:
from unsloth import FastVisionModel
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit",
load_in_4bit=True,
)
FastVisionModel.for_inference(model)
This setup is optimized for your GPU and requires no extra dependencies. For even faster inference, consider using vLLM with Unsloth by setting fast_inference=True if you need high throughput batch serving. See Unsloth Inference Guide and Vision Fine-tuning Notebooks for more.
Would you like step-by-step deployment or API serving details?
Sources:
💡 **Hint:** Mention @RunLLM in the thread for followups.