How to modify the SentencePiece tokenizer to not split the chatml tokens? | Mistral AI | Page 1

thorn nimbus Nov 28, 2023, 4:39 PM

#

Hey there,

We are using the reference model implementation for Mistral7B, and the corresponding SentencePiece tokenizer from Google.

sp_model = SentencePieceProcessor(model_file=...)

We would like to use the ChatML format for fine-tuning, and we don't want to split these tokens:

<|im_start|> and <|im_end|>

What's the best way of adding/modifying the SentencePiece tokenizer so that it doesn't split these tokens when encoding?

When your team trained the Instruct model, did you modify the tokenizer so that it preserved the <s> and </s> tokens?

Thank you!

vagrant salmon Nov 28, 2023, 7:02 PM

#

I can't answer the question, but Eric Hartford (https://twitter.com/erhartford / https://huggingface.co/ehartford) has done a lot of Mistral finetunes that use ChatML format so probably a good person to ask

thorn nimbus Nov 28, 2023, 9:39 PM

#

Thank @vagrant salmon i believe he (like everyone else) users the huggingface tokenizer, which makes these tasks simple. Here we want to do it explicitly with SentencePiece

south anvil Nov 29, 2023, 11:07 AM

#

thorn nimbus Thank <@893621104516665376> i believe he (like everyone else) users the huggingf...

i have a vague souvenir of george hotz doing that in his stream on porting mistral to tinygrad https://www.youtube.com/watch?v=2QO3vzwHXhg&t=1s&ab_channel=georgehotzarchive

#

around 2:00-2:20 into it

thorn nimbus Nov 29, 2023, 8:00 PM

#

@south anvil haha - thanks for sharing! George Hotz is amazing! I just watched the clip, he makes it work, but there must be a better solution. I don't want to parse each and every single and manually add the token_ids; I wonder if the Mistral guys know how to do this?

south anvil Nov 29, 2023, 8:03 PM

#

thorn nimbus <@776898092125257809> haha - thanks for sharing! George Hotz is amazing! I just ...

yeah, i watched it at 2x... he did seem to have other plans for his stream than to get this annoying detail solved with elegance 🤣

#How to modify the SentencePiece tokenizer to not split the chatml tokens?