For your dataset, ensure you use the vision fine-tuning workflow with Unsloth, format your data as vision Q&A pairs (image + prompt + answer), and use FastVisionModel.get_peft_model with all finetune flags enabled (vision, language, attention, MLP). Set r and lora_alpha to 16 for balanced accuracy and overfitting risk. Use UnslothVisionDataCollator and set remove_unused_columns=False, dataset_text_field="", and max_seq_length to fit your longest prompt/answer. Image sizes should be consistent (300–1000px recommended) for efficient training. If your current model is not improving, check your data quality, ensure labels are correct, and consider using train_on_responses_only to focus loss on the answer part only.
Switching to Qwen3-VL-4B may help, as it offers improved vision and OCR capabilities, but expect higher VRAM requirements. Qwen3-VL is state-of-the-art for vision-language tasks and may yield better results, especially if your current model struggles with text extraction or reasoning on images. For best results, follow the official Unsloth vision fine-tuning guide and Qwen3-VL fine-tuning instructions.
Would you like a step-by-step code example or more detail on troubleshooting poor training outcomes?
Sources: