I see
I am using a teslaV100 to train a Mistral 12b Nemo.
It stops after the first steps..
I can't understand: my Dataset should be OK. Made some tests with some rows print(dataset[0]) # Print the first sample and was fine.
If I switch from epoch to steps only, I saw that it goes to the steps after 1 without issues: weird...
Here my params. Any idea ?
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 0, # Make just the steps not all the rows.
num_train_epochs = 1, # For longer training runs!
learning_rate = 2e-5, # consider lowering it to 1e-5 or even 5e-6
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none", # Use this for WandB etc
),
)