Hey there,
We are using the reference model implementation for Mistral7B, and the corresponding SentencePiece tokenizer from Google.
sp_model = SentencePieceProcessor(model_file=...)
We would like to use the ChatML format for fine-tuning, and we don't want to split these tokens:
<|im_start|> and <|im_end|>
What's the best way of adding/modifying the SentencePiece tokenizer so that it doesn't split these tokens when encoding?
When your team trained the Instruct model, did you modify the tokenizer so that it preserved the <s> and </s> tokens?
Thank you!