#What to do when the token limit is reached?

6 messages · Page 1 of 1 (latest)

grim tendon
#

So..., if I understand it correctly there is a limit to the total token count - which is about 2 and a half pages worth of text (including the full chat history).

The question is what to do next if the chat keeps going on longer? Are there any best practice** tips and tricks** for truncating or summarizing the chat and then going from there?

blissful matrix
#

I'd like to know too. Was experimenting when suddenly the last response was cut short, and I realised I was at the maximum number of tokens. I'd thought that the API would just truncate the beginning of my prompt to allow for however many tokens were allowed in the response, but I guess that was naïve of me. What's the easiest way to see how many tokens are in the current list of messages that constitute the current chat history, so that we can truncate/summarise enough to make room for a response?

grim tendon
#

There is a tokenizer library so you could calculate the number of tokens in the request.
Each response contains the token count ( request, response and total) so if you look for that ad you go back and forth with longer and longer chat history you can handle it when you're approaching the limit.
Just not sure how 🤪
Easiest would be to do FIFO and start dropping the oldest entries

blissful matrix
#

Yeah. You can also grab the total tokens from each response (in the usage.total_tokens property), so when it starts approaching 4096, you can take steps to truncate the initial messages. I'm wondering if dropping the initial messages (except for the very first 'system' message) would have detrimental effects, unless you try to get gpt to summarise them first and feed the summary as a substitute for the messages you got rid of.

blissful matrix
#

My slapdash approach is going to be:
By always looking at the token usage included in every response, I'll keep track of exactly how many tokens each message is worth
When the usage.total_token value of a response is > 3900 (maybe lower if messages are more than 196 tokens on average), decide to truncate messages
To truncate messages, look at the tokens used by the oldest few messages (other than the initial 'system' message, which I want to keep), and remove enough messages to get the total back below a certain number (maybe 3750).

That should help keep incoming responses from being cut off. That being said, it's so cheap now, you could just look at the reason why the response ended, and if it's due to running out of tokens, do your message truncation there and then, and retry afterwards.

#

Specifically, in the response, it's choices.finish_reason == "length" that we'll need to check for. If you still want to keep your costs down though, you'll want to remove enough messages to give yourself a bit of room, otherwise you'll be forced to re-send every message, effectively doubling your token usage.