#How to modify the SentencePiece tokenizer to not split the chatml tokens?

7 messages · Page 1 of 1 (latest)

thorn nimbus
#

Hey there,

We are using the reference model implementation for Mistral7B, and the corresponding SentencePiece tokenizer from Google.

sp_model = SentencePieceProcessor(model_file=...)

We would like to use the ChatML format for fine-tuning, and we don't want to split these tokens:

<|im_start|> and <|im_end|>

What's the best way of adding/modifying the SentencePiece tokenizer so that it doesn't split these tokens when encoding?

When your team trained the Instruct model, did you modify the tokenizer so that it preserved the <s> and </s> tokens?

Thank you!

vagrant salmon
thorn nimbus
#

Thank @vagrant salmon i believe he (like everyone else) users the huggingface tokenizer, which makes these tasks simple. Here we want to do it explicitly with SentencePiece

south anvil
#

around 2:00-2:20 into it

thorn nimbus
#

@south anvil haha - thanks for sharing! George Hotz is amazing! I just watched the clip, he makes it work, but there must be a better solution. I don't want to parse each and every single and manually add the token_ids; I wonder if the Mistral guys know how to do this?

south anvil