#Qwen3VL (or any VLM) OOM from large dataset

1 messages · Page 1 of 1 (latest)

rustic badge
#

Hey guys, I've posted about this before but never heard back and I'm hoping someone has faced this issue and dealt with it better than I am right now.

I have a training dataset of 20k forms that I'm using for OCR. Each form is already downsized to 1200x800. During training, it's not actually my VRAM that's overflowing, but my system RAM. My system RAM gets filled up more and more as training runs.

Is there a way to have images that aren't being used in the current batch to unload themselves, or is it the actual accumulation of the results of the training before back-prop that's causing system RAM to explode?

ocean talon
#

what are your Trainer config settings

rustic badge
#

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
#data collator only expose the ans to the loss function so that image & instructions don't effect the loss
data_collator = UnslothVisionDataCollator(model, tokenizer), # Must use!
train_dataset = training_dataset,
eval_dataset = validation_dataset,
args = SFTConfig(
# resume_from_checkpoint = True,
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4,

    warmup_ratio = 0.1,
    # max_steps = 30,
    num_train_epochs = 2, # Set this instead of max_steps for full training runs
    learning_rate = lr,
    fp16 = not is_bf16_supported(),
    bf16 = is_bf16_supported(),
    fp16_full_eval = True,
    logging_steps = 0.1, #0.1
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 3407,
    output_dir = output_dir,
    report_to = "none",     # For Weights and Biases
    
    ######
    per_device_eval_batch_size = 1,
    eval_accumulation_steps = 4,
    metric_for_best_model= "eval_loss", 
    load_best_model_at_end=True,
    save_strategy = "steps", 
    eval_strategy = "steps", 
    eval_steps = 0.1,
    save_steps = 0.1,
    save_total_limit = 2, 
    ######

    # You MUST put the below items for vision finetuning:
    remove_unused_columns = False,
    dataset_text_field = "",
    dataset_kwargs = {"skip_prepare_dataset": True},
    dataset_num_proc = 4,
    max_seq_length = 512,
    ),
)
#

Basically the same as in the tutorial notebook, except with a validation dataset and set to run for 2 epochs instead of 30 steps

ocean talon
#

can you try to eval less frequently and see if that lowers the usage of RAM

rustic badge
#

So like set eval_steps to 0.2 or 0.5?

placid dew
#

@topaz gate would your streaming datasets implementation help here? sounds like a problem of storing the full dataset in RAM (although system RAM filling up during training is a little odd)

topaz gate
#

Streaming doesn’t work with qwen3 vl. They don’t batch pixel values the same way as other models and trl switches the data loader type for streaming datasets. It expects index 0 to be batch size

#

I would first turn off saving and eval as a sanity check. If memory behavior is the same we know it’s not that. There are dataloader kwargs you can pass. Like turning off prefetch and turning down the num processes. But I don’t see why that the dataloader would eat up so much ram. You could also tell the collator to resize the images even smaller.

#

The entire dataset doesn’t usually sit in ram btw. Most of it is on disk unless you create one in memory

placid dew
#

Oops, right. I'm tired lol

topaz gate
#

Maybe if on windows there’s something strange going on with multiprocessing

rustic badge
#

It's Ubuntu, and I'll try just straight up turning off eval for the next run

#

The one thing that comes to mind is that I'm not truly using the dataloader properly, or at least, in the notebook it isn't. The original dataset is in the form of a Dataset

#

But you don't actually use the dataset for training

#

There's this function in the notebook: def convert_to_conversation(sample):
conversation = [
{ "role": "user",
"content" : [
{"type" : "text", "text" : instruction},
{"type" : "image", "image" : sample["image"]} ]
},
{ "role" : "assistant",
"content" : [
{"type" : "text", "text" : sample["text"]} ]
},
]
return { "messages" : conversation }
pass

#

And then you create converted_dataset = [convert_to_conversation(sample) for sample in dataset]

#

Converted dataset is a list of all of the individual samples in dataset that get converted into the conversation format above

#

I think that ends up loading all of the images into memory after trainer calls it

#

And doesn't actively use dataset/dataloaders on-disk behaviour

topaz gate
#

yea the notebook is an example. if you need to scale it up you can, or you can try dataset.map. hf datasets have more constraints when it comes to processing so they're a bit trickier to work with for vlm's

#

the notebooks use list to make it clear

rustic badge
#

I actually see that in the grpo qwen3-vl uses dataset.map