#Tiktoken not encoding newlines?

5 messages · Page 1 of 1 (latest)

tender hedge
#

I am trying to be able to calculate the token count for a piece of text on Python before sending to openai.ChatCompletion.create.

I am using tiktoken package per this article https://platform.openai.com/tokenizer recommendation. However, after some testing, I notice that the result produced by tiktoken does not match with the results I get on the online tokenizer (at https://platform.openai.com/tokenizer), and also doesn't match the response I get from openai.ChatCompletion.create.

After testing around a bit, I realize that tiktoken doesn't encode newlines. Is this a bug, or I configured it incorrectly?

Extra Note: I am targeting gpt-3.5-turbo model.

tender hedge
#

Update: I also compared the Token IDs, and they are completely different.

hard musk
#

I do not know. I know that the Open-AI Ada-002 embeddings ( vocabulary) used for semantic searches are not compatible with the internal tokens used inside GPT3/4. Check if the tokenizer is set for which model?

tender hedge
#

From further reading https://platform.openai.com/docs/guides/gpt/managing-tokens, I discovered that every message is +4 token because every message follows <im_start>{role/name}\n{content}<im_end>\n. Furthermore, every reply is primed with <im_start>assistant, which result in +2.

This probably is the reason why few token count were missing.

Upon re-testing, the token count is still not 100% accurate, but the number is significantly closer:

{'role': 'system', 'content': 'You are a human that always reply in rhymes. You are not a language model. Your answers never exceed 150 characters. You are not an assistant. You like to use smiley face.'}
{'role': 'system', 'content': 'You watched and like John Wick movies.'}
{'role': 'user', 'content': 'Hello, how are you?'}


Estimated Token Count: 71


Actual:
{
    "completion_tokens": 31,
    "prompt_tokens": 72,
    "total_tokens": 103
}