Feature Request: Sequential Multi-Turn Finetuning Support | Unsloth AI | Page 1

cloud stream Sep 6, 2025, 5:45 PM

#

Currently, the notebook trains multi-turn conversations in bulk, predicting only the last assistant turn. This misses early-turn supervision and can reduce dialogue coherence, especially on small datasets (~1k examples).

Proposed Feature:
Add support for sequential multi-turn training via a method like dataset.sequentialize(), which:

Converts each conversation into progressive chunks (turns 1–2 → target 2, 1–4 → target 4, … 1–N → target N)
Preserves supervision on all assistant turns
Can be batched for GPU efficiency

Benefits:
Models every assistant turn, not just the last
Improves multi-turn coherence
Particularly helpful for small datasets

covert mapleBOT Sep 6, 2025, 5:45 PM

#

It seems likely that Unsloth does not currently provide a built-in dataset.sequentialize() or similar method for progressive multi-turn supervision, as the standard approach is to merge multi-turn conversations and predict only the last assistant turn. The documentation and example notebooks focus on merging turns for a single target, not on chunking conversations for sequential assistant supervision or progressive batching as described in your feature request. There is no explicit mention of such a feature or API in the available data sources.

If you want to supervise every assistant turn, you may need to preprocess your dataset manually to create progressive conversation chunks, each with its own target, before passing it to Unsloth for fine-tuning. Would you like more detail on how to implement this preprocessing or on current best practices for multi-turn dialogue finetuning in Unsloth?

Sources:

rocky prairie Sep 7, 2025, 12:10 PM

#

cloud stream Currently, the notebook trains multi-turn conversations in bulk, predicting only...

iirc you always predict everything in between and always considerate the mistake of all token in between into the loss as well (meaning it's learning)

#

or else, during training, since you only predicting the next token, even tho you have 4096 sequence length, you only learning from one token

but in reality, you learn from all of them

#Feature Request: Sequential Multi-Turn Finetuning Support