Training stops after first step | Unsloth AI | Page 1

deft prawn Mar 10, 2025, 12:06 PM

#

I see

I am using a teslaV100 to train a Mistral 12b Nemo.
It stops after the first steps..
I can't understand: my Dataset should be OK. Made some tests with some rows print(dataset[0]) # Print the first sample and was fine.
If I switch from epoch to steps only, I saw that it goes to the steps after 1 without issues: weird...

Here my params. Any idea ?

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 0, # Make just the steps not all the rows.
num_train_epochs = 1, # For longer training runs!
learning_rate = 2e-5, # consider lowering it to 1e-5 or even 5e-6
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none", # Use this for WandB etc
),
)

deft prawn Mar 10, 2025, 1:07 PM

#

actually the behavior was experienced in Vast.ai and not in Colab.
Same data

small lintel Mar 10, 2025, 5:48 PM

#

I think you don't need to specify the max_steps if you want to use epochs

deft prawn Mar 10, 2025, 7:25 PM

#

small lintel I think you don't need to specify the `max_steps` if you want to use epochs

Hi.
I made a comment in max_steps and only used epochs.
The training seems to go well then !

What I don't understand... :
I used it like this in the past in vast.ai and was working.
Also, in colab, leaving "max_steps = 0" and setting "num_train_epochs" made no issues....
Really weird..

rugged socket Mar 10, 2025, 8:03 PM

#

you have to remove max_steps
if you set it to 0 it will not really do something but directly stop training.

If you want to use all your data you should use the epochs argument instead.

deft prawn Mar 10, 2025, 8:13 PM

#

rugged socket you have to remove max_steps if you set it to 0 it will not really do something ...

Yep, made it this way. It's just that in Colab works which drives me nuts

#Training stops after first step