#Calculating token count for embedding model

3 messages · Page 1 of 1 (latest)

knotty coral
#

Hello, I'm trying to calculate the token count of strings of text using the mistral-common tokenizer code. After digging through the class methods and docs I was unable to find a specific example or method for doing this. I'm able to get a token count by passing encode_chat_completion a ChatCompletionRequest like this

def get_token_count(input_string):
    tokenizer = MistralTokenizer.from_model("mistral-embed")
    message = {"role": "user", "content": input_string}  
    tokenized = tokenizer.encode_chat_completion(ChatCompletionRequest(messages=[message]))
    tokens = tokenized.tokens
    return len(tokens)


tokens_len = get_token_count("What's the weather like today in Paris")
print(tokens_len)
# output: 17

However when I use the embedding model on the same line of text it returns a different token count

client = MistralClient(api_key=os.getenv("MISTRAL_API_KEY"))

embeddings_batch_response = client.embeddings(
    model="mistral-embed",
    input=["What's the weather like today in Paris"]
)

print(embeddings_batch_response.usage.total_tokens)
# 11

I'm not sure if I can jus subtract 6 tokens from all strings or if there is a supported method I'm missing.

dusk swallow
#

basically its most likely doing this:
<s>[INST] What's the weather like today in Paris[/INST] or similar