#How to calculate F1 Score and Accuracy when training with Unsloth?

7 messages · Page 1 of 1 (latest)

solid onyx
#

Hi! I'm wondering if there's a functionality to calculate the accuracy and f1 score after training, like passing a compute_metrics function as an argument, and what is the standard compute strategy (like is F1 score and accuracy the standard?)? If anyone has a notebook that shows the entire pipeline, that would be helpful! Thanks!

next edgeBOT
#

Yes, you can calculate accuracy and F1 score during Unsloth training by passing a compute_metrics function to the Trainer or SFTTrainer. The standard approach is to define compute_metrics using scikit-learn metrics (e.g., accuracy_score, f1_score) and pass it as an argument; accuracy is commonly used, but F1 is not the default and must be added explicitly. Example notebooks for classification with compute_metrics are available in the official Unsloth resources.

For example, the compute_metrics function can look like this:

from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds, average="weighted"),
    }

See full pipeline examples in the bert_classification notebook and Kaggle-bert_classification notebook. Would you like a step-by-step breakdown or more code details?

Sources:

solid onyx
#

@next edge so following your response:

Yes, you can calculate accuracy and F1 score during Unsloth training by passing a compute_metrics function to the Trainer or SFTTrainer. The standard approach is to define compute_metrics using scikit-learn metrics (e.g., accuracy_score, f1_score) and pass it as an argument; accuracy is commonly used, but F1 is not the default and must be added explicitly. Example notebooks for classification with compute_metrics are available in the official Unsloth resources.

For example, the compute_metrics function can look like this:

from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = logits.argmax(axis=-1)
return {
"accuracy": accuracy_score(labels, preds),
"f1": f1_score(labels, preds, average="weighted"),
}

See full pipeline examples in the bert_classification notebook and Kaggle-bert_classification notebook. Would you like a step-by-step breakdown or more code details?

Sources:

bert_classification.ipynb

Kaggle-bert_classification.ipynb

I would pass it into the trainer variable? as a parameter in the SFTTrainer method is it? compute_metrics = compute_metrics? and when I do that and start training, it would finish training and then would evaluate on the eval_dataset, and then the RAM just shoots up and uses more than 25GB and crashes my Kernel. How would I do this?

next edgeBOT
#

Yes, you pass compute_metrics=compute_metrics as a parameter to SFTTrainer. However, high RAM usage during evaluation is a known issue, especially with large eval datasets or when returning logits for metrics—this can cause crashes if not managed.

To reduce RAM usage, set per_device_eval_batch_size=1 and eval_accumulation_steps=2 (or higher) in your training arguments. Also, use a preprocess_logits_for_metrics function to avoid storing large tensors:

def preprocess_logits_for_metrics(logits, labels):
    pred_ids = logits.argmax(axis=-1)
    return pred_ids, labels

Then, pass both compute_metrics and preprocess_logits_for_metrics to SFTTrainer. See Unsloth evaluation loop FAQ and Unsloth wiki evaluation loop for more details. Would you like a full code example?

Sources:

solid onyx
#

@next edge from your response:
Yes, you pass compute_metrics=compute_metrics as a parameter to SFTTrainer. However, high RAM usage during evaluation is a known issue, especially with large eval datasets or when returning logits for metrics—this can cause crashes if not managed.

To reduce RAM usage, set per_device_eval_batch_size=1 and eval_accumulation_steps=2 (or higher) in your training arguments. Also, use a preprocess_logits_for_metrics function to avoid storing large tensors:

def preprocess_logits_for_metrics(logits, labels):
pred_ids = logits.argmax(axis=-1)
return pred_ids, labels

Then, pass both compute_metrics and preprocess_logits_for_metrics to SFTTrainer. See Unsloth evaluation loop FAQ and Unsloth wiki evaluation loop for more details. Would you like a full code example?

I've passed the preprocess_logits_for_metrics as a parameter in the trainer object, and now I'm getting this error once training has finished and is calculating the metrics:

Unsloth: Not an error, but Qwen2ForCausalLM does not accept num_items_in_batch.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


AttributeError Traceback (most recent call last)

/2282580354.py in compute_metrics(eval_pred)
3 def compute_metrics(eval_pred):
4 logits, labels = eval_pred
----> 5 preds = logits.argmax(axis=-1)
6 return {
7 "accuracy": accuracy_score(labels, preds),

AttributeError: 'tuple' object has no attribute 'argmax'

Unsloth - Open source Fine-tuning & RL for LLMs

Unsloth's Gradient Accumulation fix solves critical errors in LLM Training.

next edgeBOT
#

This error occurs because when you use preprocess_logits_for_metrics, the output passed to compute_metrics is no longer (logits, labels), but (pred_ids, labels), where pred_ids is already the argmax. So, in your compute_metrics, you should remove the argmax and use preds directly:

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds, average="weighted"),
    }

See Unsloth bert_classification notebook for similar usage. Would you like more details or troubleshooting tips?

Sources:

solid onyx
#

@next edge Following your response, I got this error (caused by my compute_metrics function):

[error]

ValueError Traceback (most recent call last)
/tmp/ipykernel_1757/773422404.py in <cell line: 0>()
----> 1 trainer_stats = trainer.train()

/tmp/ipykernel_1757/1822903631.py in compute_metrics(eval_pred)
2 preds, labels = eval_pred
3 return {
----> 4 "accuracy": accuracy_score(labels, preds),
5 "f1": f1_score(labels, preds, average="weighted"),
6 }

ValueError: Found input variables with inconsistent numbers of samples: [93, 2]
[/error]

[response]
This error occurs because when you use preprocess_logits_for_metrics, the output passed to compute_metrics is no longer (logits, labels), but (pred_ids, labels), where pred_ids is already the argmax. So, in your compute_metrics, you should remove the argmax and use preds directly:

def compute_metrics(eval_pred):
preds, labels = eval_pred
return {
"accuracy": accuracy_score(labels, preds),
"f1": f1_score(labels, preds, average="weighted"),
}

See Unsloth bert_classification notebook for similar usage. Would you like more details or troubleshooting tips?

Sources:

Unsloth bert_classification notebook

[/response]