#Training stops after first step

1 messages · Page 1 of 1 (latest)

deft prawn
#

I see

I am using a teslaV100 to train a Mistral 12b Nemo.
It stops after the first steps..
I can't understand: my Dataset should be OK. Made some tests with some rows print(dataset[0]) # Print the first sample and was fine.
If I switch from epoch to steps only, I saw that it goes to the steps after 1 without issues: weird...

Here my params. Any idea ?

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 0, # Make just the steps not all the rows.
num_train_epochs = 1, # For longer training runs!
learning_rate = 2e-5, # consider lowering it to 1e-5 or even 5e-6
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none", # Use this for WandB etc
),
)

deft prawn
#

actually the behavior was experienced in Vast.ai and not in Colab.
Same data

small lintel
#

I think you don't need to specify the max_steps if you want to use epochs

deft prawn
rugged socket
#

you have to remove max_steps
if you set it to 0 it will not really do something but directly stop training.

If you want to use all your data you should use the epochs argument instead.

deft prawn