Hello, I'm trying to calculate the token count of strings of text using the mistral-common tokenizer code. After digging through the class methods and docs I was unable to find a specific example or method for doing this. I'm able to get a token count by passing encode_chat_completion a ChatCompletionRequest like this
def get_token_count(input_string):
tokenizer = MistralTokenizer.from_model("mistral-embed")
message = {"role": "user", "content": input_string}
tokenized = tokenizer.encode_chat_completion(ChatCompletionRequest(messages=[message]))
tokens = tokenized.tokens
return len(tokens)
tokens_len = get_token_count("What's the weather like today in Paris")
print(tokens_len)
# output: 17
However when I use the embedding model on the same line of text it returns a different token count
client = MistralClient(api_key=os.getenv("MISTRAL_API_KEY"))
embeddings_batch_response = client.embeddings(
model="mistral-embed",
input=["What's the weather like today in Paris"]
)
print(embeddings_batch_response.usage.total_tokens)
# 11
I'm not sure if I can jus subtract 6 tokens from all strings or if there is a supported method I'm missing.