data preparation for instruct sft | Unsloth AI | Page 1

spring pivotBOT Nov 10, 2025, 2:08 PM

#

To fine-tune Gemma-3 instruct models with a Hugging Face dataset in 'instruction', 'input', 'output' format, first convert your data to a single text field using a prompt template, then apply the Gemma-3 chat template. The recommended approach is:

Use a formatting function to merge your fields into a prompt, e.g.:

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text": texts }

Map this function to your dataset, then use Unsloth's get_chat_template with chat_template="gemma-3" for the tokenizer before training with SFTTrainer. See Kaggle-Gemma3_(4B).py and data formatting guide.

Would you like a full code example or more detail on applying the chat template?

Sources:

Unsloth Datasets Guide

glad fractal Nov 10, 2025, 2:18 PM

#

@spring pivot that seems to be outdated approach. I see from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset) being used in the example notebooks. Here is how dataset[100] looks like {'conversations': [{'content': 'What is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?',
'role': 'user'},
{'content': 'To calculate the modulus of two given numbers, you can use the modulus operator in the following way: ... This means that the modulus of 10 and 4 is 2.',
'role': 'assistant'}],
'source': 'infini-instruct-top-500k',
'score': 4.774171352386475}
Then i see it being transformed like so:
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

At the end we have dataset[100]['text']:
'<start_of_turn>user\nWhat is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?<end_of_turn>\n<start_of_turn>model\n To calculate the modulus of two given numbers, you can use the modulus operator in the following way: ... This means that the modulus of 10 and 4 is 2.<end_of_turn>\n'

spring pivotBOT Nov 10, 2025, 2:18 PM

#

You are correct—the current Unsloth workflow uses standardize_data_formats to convert datasets into a unified 'conversations' format, followed by applying the chat template for Gemma-3. This is the recommended and up-to-date approach for preparing data for instruct SFT with Unsloth, as shown in the official example notebooks and documentation.

To use your 'instruction', 'input', 'output' dataset, first convert each row into a 'conversations' list with user/assistant roles, then run standardize_data_formats, and finally apply the chat template as you described. This ensures compatibility with Gemma-3 and other supported models. See Kaggle-Gemma3_(4B).py and Unsloth Datasets Guide.

Would you like a code snippet for converting your dataset to the 'conversations' format?

Sources:

glad fractal Nov 10, 2025, 2:27 PM

#

@spring pivot i need code snippet

spring pivotBOT Nov 10, 2025, 2:27 PM

#

Here’s a code snippet to convert your 'instruction', 'input', 'output' dataset to the correct 'conversations' format and apply the Gemma-3 chat template for Unsloth SFT:

from datasets import load_dataset
from unsloth.chat_templates import standardize_data_formats, get_chat_template

# Load your dataset
dataset = load_dataset("your_dataset_name", split="train")

# Convert to 'conversations' format
def to_conversations(batch):
    return {
        "conversations": [
            [
                {"role": "user", "content": f"{instr}\n{inp}"},
                {"role": "model", "content": out}
            ]
            for instr, inp, out in zip(batch["instruction"], batch["input"], batch["output"])
        ]
    }

dataset = dataset.map(to_conversations, batched=True, remove_columns=["instruction", "input", "output"])

# Standardize format
dataset = standardize_data_formats(dataset)

# Apply Gemma-3 chat template
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("unsloth/gemma-3-4b-it")
tokenizer = get_chat_template(tokenizer, chat_template="gemma-3")

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False).removeprefix('<bos>') for convo in convos]
    return { "text": texts }

dataset = dataset.map(formatting_prompts_func, batched=True)

This matches the official Unsloth Gemma-3 workflow. Let me know if you want a full training example.

Sources:

glad fractal Nov 10, 2025, 2:45 PM

#

@spring pivot std_ds = standardize_data_formats(conversations)
TypeError Traceback (most recent call last)
Cell In[15], line 2
1 # Standardize format
----> 2 std_ds = standardize_data_formats(conversations)

File ~/.conda/envs/unsloth_env/lib/python3.11/site-packages/unsloth_zoo/dataset_utils.py:443, in standardize_data_formats(dataset, tokenizer, aliases_for_system, aliases_for_user, aliases_for_assistant, batch_size, num_proc)
441 leftover_aliases = (all_aliases | roles) - all_aliases
442 if len(leftover_aliases) != 0:
--> 443 raise TypeError(
444 f"Unsloth: {list(leftover_aliases)} are not in aliases. Please update aliases."
445 )
446 pass
448 # Mapping for aliases

TypeError: Unsloth: ['model'] are not in aliases. Please update aliases.

spring pivotBOT Nov 10, 2025, 2:45 PM

#

This error occurs because your conversation uses the role 'model', which is not in the default alias list for standardize_data_formats. By default, Unsloth expects assistant roles to be 'assistant', 'gpt', or 'output'. You need to add 'model' to aliases_for_assistant:

std_ds = standardize_data_formats(
    dataset,
    aliases_for_assistant=["gpt", "assistant", "output", "model"]
)

This will resolve the TypeError by telling Unsloth to treat 'model' as the assistant role. See dataset_utils.py.

Would you like more detail on customizing role aliases?

Sources:

https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_zoo/dataset_utils.py

glad fractal Nov 10, 2025, 2:48 PM

#

where is the mention of gemma-3? @spring pivot

glad fractal Nov 10, 2025, 2:48 PM

#

spring pivot

where is the mention of gemma-3? <@&1408578411567255678>

spring pivotBOT Nov 10, 2025, 2:48 PM

#

The mention of "gemma-3" is required when you apply the chat template using get_chat_template(tokenizer, chat_template="gemma-3"). This ensures your prompts are formatted for Gemma-3's expected conversational structure, which is necessary for correct fine-tuning and inference. See Kaggle-Gemma3_(4B).py and Unsloth Docs: Gemma 3 Fine-tuning.

Would you like more detail on how the chat template affects Gemma-3 training?

Sources:

https://docs.unsloth.ai/models/gemma-3-how-to-run-and-fine-tune

glad fractal Nov 10, 2025, 2:50 PM

#

@spring pivot but we haven't added it here:
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False).removeprefix('<bos>') for convo in convos]
return { "text": texts }

spring pivotBOT Nov 10, 2025, 2:50 PM

#

You are correct—the chat template "gemma-3" is not specified inside the formatting_prompts_func itself. Instead, it is set when you call get_chat_template(tokenizer, chat_template="gemma-3") before this function. This modifies the tokenizer so that all subsequent calls to tokenizer.apply_chat_template(...) use the "gemma-3" template by default, as shown in Kaggle-Gemma3_(4B).py.

Would you like a code block showing the full sequence of steps?

Sources:

glad fractal Nov 10, 2025, 2:52 PM

#

@spring pivot i got an error :
formatted_ds = std_ds.map(formatting_prompts_func, batched=True)
File ~/.conda/envs/unsloth_env/lib/python3.11/site-packages/transformers/utils/chat_template_utils.py:446, in _compile_jinja_template.<locals>.raise_exception(message)
445 def raise_exception(message):
--> 446 raise jinja2.exceptions.TemplateError(message)

TemplateError: Invalid content type

spring pivotBOT Nov 10, 2025, 2:52 PM

#

This error means that one or more entries in your 'conversations' list have a content type that is not a string (e.g., it might be None, a dict, or a list). All 'content' fields must be strings for the chat template to work. This is a common data issue when preparing datasets for Unsloth or Hugging Face chat templates, as shown in Kaggle-Gemma3_(27B)_A100-Conversational.py.

To fix, filter or coerce all 'content' fields to strings before applying the chat template:

def clean_conversations(example):
    for msg in example["conversations"]:
        if not isinstance(msg["content"], str):
            msg["content"] = str(msg["content"])
    return example

std_ds = std_ds.map(clean_conversations)

Would you like a more detailed data cleaning example?

Sources:

glad fractal Nov 10, 2025, 3:03 PM

#

@spring pivot I just want to ignore/delete those rows instead

spring pivotBOT Nov 10, 2025, 3:03 PM

#

It seems likely you can filter out problematic rows by checking that all 'content' fields in each 'conversations' list are strings before applying the chat template. This is a common approach in Unsloth example notebooks.

Here's a code snippet to filter out such rows:

def is_valid_conversation(example):
    return all(isinstance(msg["content"], str) for msg in example["conversations"])

filtered_ds = std_ds.filter(is_valid_conversation)

Would you like more detail or a batched version for large datasets?

Sources:

https://github.com/unslothai/unsloth/issues/2297

glad fractal Nov 10, 2025, 5:25 PM

#

@spring pivot now divide into train and eval sets

spring pivotBOT Nov 10, 2025, 5:25 PM

#

You can split your dataset into train and eval sets using Hugging Face's train_test_split method, which is the recommended approach in Unsloth documentation and notebooks:

split_ds = filtered_ds.train_test_split(test_size=0.01, shuffle=True, seed=3407)
train_ds = split_ds["train"]
eval_ds = split_ds["test"]

This ensures a randomized and reproducible split. See Unsloth Datasets Guide.

Would you like more detail on customizing the split ratio?

Sources:

glad fractal Nov 10, 2025, 5:32 PM

#

@spring pivot RuntimeError: Unsloth: You must specify a formatting_func
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = train_ds,
eval_dataset = eval_ds, # Can set up evaluation!
args = SFTConfig(
dataset_text_field = "text",
per_device_train_batch_size = 32,
gradient_accumulation_steps = 8, # Use GA to mimic batch size!
warmup_steps = 500,
num_train_epochs = 3, # Set this for 1 full training run.
max_steps = -1,
learning_rate = 2e-5, # Reduce to 2e-5 for long training runs
logging_steps = 10,
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "wandb", # Use TrackIO/WandB etc
),

spring pivotBOT Nov 10, 2025, 5:32 PM

#

This error occurs because your dataset does not have a "text" column, or you have not provided a formatting_func to convert your data to text for tokenization. Unsloth requires either a "text" field or a formatting_func to process the dataset, as explained in Unsloth Issue #2658.

To fix, ensure you have run the formatting step to add a "text" column:

train_ds = train_ds.map(formatting_prompts_func, batched=True)
eval_ds = eval_ds.map(formatting_prompts_func, batched=True)

Or, pass your formatting_prompts_func as the formatting_func argument to SFTTrainer.

Would you like a full example of the correct workflow?

Sources:

https://github.com/unslothai/unsloth/issues/2658

glad fractal Nov 10, 2025, 5:52 PM

#

@spring pivot what do you think of this setup? Also, i want to save intermediate checkpoints every 500 steps.

from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = train_ds,
eval_dataset = eval_ds, # Can set up evaluation!
args = SFTConfig(
dataset_text_field = "text",
per_device_train_batch_size = 32,
gradient_accumulation_steps = 8, # Use GA to mimic batch size!
warmup_steps = 500,
num_train_epochs = 3, # Set this for 1 full training run.
max_steps = -1,
learning_rate = 2e-5, # Reduce to 2e-5 for long training runs
logging_steps = 10,
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "wandb", # Use TrackIO/WandB etc
),
)

spring pivotBOT Nov 10, 2025, 5:52 PM

#

Your setup is correct for Unsloth SFT training. To save intermediate checkpoints every 500 steps, add the following to your SFTConfig: save_strategy="steps", save_steps=500, output_dir="outputs" (or your preferred directory). This is the recommended approach in Unsloth docs and example notebooks (see here).

Example:

args = SFTConfig(
    ...,
    save_strategy="steps",
    save_steps=500,
    output_dir="outputs",
)

Would you like more detail on resuming from checkpoints or managing saved models?

Sources:

https://docs.unsloth.ai/basics/finetuning-from-last-checkpoint

glad fractal Nov 11, 2025, 2:21 AM

#

@spring pivot decreased batch size to 16, and grad accum to 4. The train loss is decreasing, but it's still around 2. Is that a signal to stop?

spring pivotBOT Nov 11, 2025, 2:21 AM

#

It seems likely that a training loss around 2 is not necessarily a signal to stop—what matters is whether the loss is still decreasing and how your eval/validation loss behaves. Unsloth recommends aiming for training loss values closer to 1, 0.8, or 0.5, but the optimal stopping point depends on your dataset and task. If the loss plateaus and eval loss stops improving, consider stopping or using early stopping (Unsloth Datasets Guide, LoRA Hyperparameters Guide).

Would you like more detail on monitoring eval loss or using early stopping?

Sources:

https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/lora-hyperparameters-guide#training-on-completions-only-masking-out-inputs

glad fractal Nov 11, 2025, 2:34 AM

#

@spring pivot it's been 4500 steps. Loss is decreasing very slowly.

#data preparation for instruct sft