My slapdash approach is going to be:
By always looking at the token usage included in every response, I'll keep track of exactly how many tokens each message is worth
When the usage.total_token value of a response is > 3900 (maybe lower if messages are more than 196 tokens on average), decide to truncate messages
To truncate messages, look at the tokens used by the oldest few messages (other than the initial 'system' message, which I want to keep), and remove enough messages to get the total back below a certain number (maybe 3750).
That should help keep incoming responses from being cut off. That being said, it's so cheap now, you could just look at the reason why the response ended, and if it's due to running out of tokens, do your message truncation there and then, and retry afterwards.