#[solved] Qwen3 VL Index out of bounds

10 messages · Page 1 of 1 (latest)

pseudo tundra
#
  File "/home/user/Documents/amy/.venv/lib/python3.11/site-packages/unsloth_zoo/vision_utils.py", line 826, in __call__
    batch = self.processor(**proc_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Documents/amy/.venv/lib/python3.11/site-packages/transformers/models/qwen3_vl/processing_qwen3_vl.py", line 191, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
                       ~~~~~~~~~~~~~~^^^^^^^
IndexError: index 2 is out of bounds for dimension 0 with size 2

getting this when trying to train Qwen3 VL

my data is like this (i printed dataset[0]):

{'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': 'Tell me what this is', 'image': None}, {'type': 'image', 'text': None, 'image': <PIL.Image.Image image mode=RGB size=512x512 at 0x73B9AE433710>}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': "<think>\nOk, the user is asking me to say what the image they sent is. Looking at the image, I see the MULTIMODAL_TRAIN_TEST symbol (a lilac hexagon with a checkerboard circle in the center) on a black (#000000) background. So, I'll answer that.\n</think>\nHii! That's the MULTIMODAL_TRAIN_TEST symbol on a pitch black background.", 'image': None}]}]}

some examples also dont have any images

kindred forgeBOT
#

The error IndexError: index 2 is out of bounds for dimension 0 with size 2 occurs because your dataset contains samples where the number of image tokens in the text does not match the number of actual images provided. Qwen3-VL (and similar Unsloth vision models) require that for every <image> token in the prompt, there is a corresponding image in the data; if a sample has no images, it should not have image tokens, and vice versa. Mixed batches of text-only and image+text samples are not currently supported and will cause this error during batching and processing. You must ensure that each sample in a batch has a consistent number of images and image tokens, or filter/preprocess your dataset to avoid such mismatches before training. See the discussion and error analysis in Unsloth Issue #3463 and Unsloth Issue #2297.

Would you like more detail on how to preprocess your dataset or handle mixed modality batches?

Sources:

pseudo tundra
#

im loading dataset like this

from datasets import load_dataset
dataset = load_dataset(
    "json",
    data_files="/home/user/Documents/multimodal data/dataset.jsonl",
    split="train",
)

from PIL import Image
import os

def load_image_from_path(sample):
    if 'messages' not in sample or not sample['messages']:
        return sample

    first_turn = sample['messages'][0]
    user_message = None

    if isinstance(first_turn, list):
        if first_turn:
            user_message = first_turn[0]
    elif isinstance(first_turn, dict):
        user_message = first_turn

    if not isinstance(user_message, dict):
        return sample

    content = user_message.get('content')

    if not isinstance(content, list):
        return sample

    for content_part in content:
        if isinstance(content_part, dict) and content_part.get('type') == 'image':
            image_path = content_part.get('image')
            
            if image_path and isinstance(image_path, str):
                try:
                    content_part['image'] = Image.open(image_path).convert("RGB")
                except FileNotFoundError:
                    print(f"Warning: Image file not found at {image_path}. Setting to None.")
                    content_part['image'] = None
                break
            
    return sample

converted_dataset = dataset.with_transform(load_image_from_path)
pseudo tundra
# kindred forge

i need to be able to have a variable number of images per training example

kindred forgeBOT
#

Unsloth's Qwen3-VL currently requires that, for each training batch, the number of image tokens in the prompt matches the number of images provided—mixed batches with variable numbers of images (including some with none) are not supported out-of-the-box. All samples in a batch must have the same number of images, and the dataset should be preprocessed to ensure this alignment; otherwise, you will encounter index or token mismatch errors during training. See Unsloth Issue #2297 and Unsloth Issue #3463.

Would you like more detail on how to preprocess your dataset or implement a custom collator to handle variable image counts?

Sources:

pseudo tundra
kindred forgeBOT
#

To support variable numbers of images per example (including some with none), you need a custom data collator that ensures the number of image tokens in the text matches the number of images in each batch. The UnslothVisionDataCollator does not natively support mixed batches, but you can subclass it and override the __call__ method to filter or align image tokens and images, as shown in Unsloth Issue #2297. This approach is required for Gemma3 and Qwen3-VL vision models.

Would you like a step-by-step example of such a custom collator?

Sources:

pseudo tundra
#

same issue when i only have examples with images, though first user message doesnt have one in some but theres still always an image

pseudo tundra
#

[solved] Qwen3 VL Index out of bounds