Hi! I'm wondering if there's a functionality to calculate the accuracy and f1 score after training, like passing a compute_metrics function as an argument, and what is the standard compute strategy (like is F1 score and accuracy the standard?)? If anyone has a notebook that shows the entire pipeline, that would be helpful! Thanks!
#How to calculate F1 Score and Accuracy when training with Unsloth?
7 messages · Page 1 of 1 (latest)
Yes, you can calculate accuracy and F1 score during Unsloth training by passing a compute_metrics function to the Trainer or SFTTrainer. The standard approach is to define compute_metrics using scikit-learn metrics (e.g., accuracy_score, f1_score) and pass it as an argument; accuracy is commonly used, but F1 is not the default and must be added explicitly. Example notebooks for classification with compute_metrics are available in the official Unsloth resources.
For example, the compute_metrics function can look like this:
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = logits.argmax(axis=-1)
return {
"accuracy": accuracy_score(labels, preds),
"f1": f1_score(labels, preds, average="weighted"),
}
See full pipeline examples in the bert_classification notebook and Kaggle-bert_classification notebook. Would you like a step-by-step breakdown or more code details?
Sources:
💡 **Hint:** Mention @RunLLM in the thread for followups.
@next edge so following your response:
Yes, you can calculate accuracy and F1 score during Unsloth training by passing a compute_metrics function to the Trainer or SFTTrainer. The standard approach is to define compute_metrics using scikit-learn metrics (e.g., accuracy_score, f1_score) and pass it as an argument; accuracy is commonly used, but F1 is not the default and must be added explicitly. Example notebooks for classification with compute_metrics are available in the official Unsloth resources.
For example, the compute_metrics function can look like this:
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = logits.argmax(axis=-1)
return {
"accuracy": accuracy_score(labels, preds),
"f1": f1_score(labels, preds, average="weighted"),
}
See full pipeline examples in the bert_classification notebook and Kaggle-bert_classification notebook. Would you like a step-by-step breakdown or more code details?
Sources:
bert_classification.ipynb
Kaggle-bert_classification.ipynb
I would pass it into the trainer variable? as a parameter in the SFTTrainer method is it? compute_metrics = compute_metrics? and when I do that and start training, it would finish training and then would evaluate on the eval_dataset, and then the RAM just shoots up and uses more than 25GB and crashes my Kernel. How would I do this?
Yes, you pass compute_metrics=compute_metrics as a parameter to SFTTrainer. However, high RAM usage during evaluation is a known issue, especially with large eval datasets or when returning logits for metrics—this can cause crashes if not managed.
To reduce RAM usage, set per_device_eval_batch_size=1 and eval_accumulation_steps=2 (or higher) in your training arguments. Also, use a preprocess_logits_for_metrics function to avoid storing large tensors:
def preprocess_logits_for_metrics(logits, labels):
pred_ids = logits.argmax(axis=-1)
return pred_ids, labels
Then, pass both compute_metrics and preprocess_logits_for_metrics to SFTTrainer. See Unsloth evaluation loop FAQ and Unsloth wiki evaluation loop for more details. Would you like a full code example?
Sources:
💡 **Hint:** Mention @RunLLM in the thread for followups.
@next edge from your response:
Yes, you pass compute_metrics=compute_metrics as a parameter to SFTTrainer. However, high RAM usage during evaluation is a known issue, especially with large eval datasets or when returning logits for metrics—this can cause crashes if not managed.
To reduce RAM usage, set per_device_eval_batch_size=1 and eval_accumulation_steps=2 (or higher) in your training arguments. Also, use a preprocess_logits_for_metrics function to avoid storing large tensors:
def preprocess_logits_for_metrics(logits, labels):
pred_ids = logits.argmax(axis=-1)
return pred_ids, labels
Then, pass both compute_metrics and preprocess_logits_for_metrics to SFTTrainer. See Unsloth evaluation loop FAQ and Unsloth wiki evaluation loop for more details. Would you like a full code example?
I've passed the preprocess_logits_for_metrics as a parameter in the trainer object, and now I'm getting this error once training has finished and is calculating the metrics:
Unsloth: Not an error, but Qwen2ForCausalLM does not accept num_items_in_batch.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient
AttributeError Traceback (most recent call last)
/2282580354.py in compute_metrics(eval_pred)
3 def compute_metrics(eval_pred):
4 logits, labels = eval_pred
----> 5 preds = logits.argmax(axis=-1)
6 return {
7 "accuracy": accuracy_score(labels, preds),
AttributeError: 'tuple' object has no attribute 'argmax'
Unsloth - Open source Fine-tuning & RL for LLMs
Unsloth's Gradient Accumulation fix solves critical errors in LLM Training.
This error occurs because when you use preprocess_logits_for_metrics, the output passed to compute_metrics is no longer (logits, labels), but (pred_ids, labels), where pred_ids is already the argmax. So, in your compute_metrics, you should remove the argmax and use preds directly:
def compute_metrics(eval_pred):
preds, labels = eval_pred
return {
"accuracy": accuracy_score(labels, preds),
"f1": f1_score(labels, preds, average="weighted"),
}
See Unsloth bert_classification notebook for similar usage. Would you like more details or troubleshooting tips?
Sources:
💡 **Hint:** Mention @RunLLM in the thread for followups.
@next edge Following your response, I got this error (caused by my compute_metrics function):
[error]
ValueError Traceback (most recent call last)
/tmp/ipykernel_1757/773422404.py in <cell line: 0>()
----> 1 trainer_stats = trainer.train()
/tmp/ipykernel_1757/1822903631.py in compute_metrics(eval_pred)
2 preds, labels = eval_pred
3 return {
----> 4 "accuracy": accuracy_score(labels, preds),
5 "f1": f1_score(labels, preds, average="weighted"),
6 }
ValueError: Found input variables with inconsistent numbers of samples: [93, 2]
[/error]
[response]
This error occurs because when you use preprocess_logits_for_metrics, the output passed to compute_metrics is no longer (logits, labels), but (pred_ids, labels), where pred_ids is already the argmax. So, in your compute_metrics, you should remove the argmax and use preds directly:
def compute_metrics(eval_pred):
preds, labels = eval_pred
return {
"accuracy": accuracy_score(labels, preds),
"f1": f1_score(labels, preds, average="weighted"),
}
See Unsloth bert_classification notebook for similar usage. Would you like more details or troubleshooting tips?
Sources:
Unsloth bert_classification notebook
[/response]