#how big can the assistant be and does it count against system or user token limit?

10 messages · Page 1 of 1 (latest)

regal sundial
#

one of our biggest battles is fighting token caps with prompt length and balancing it between the request and the response.

I believe in order for this to work the way that it seems like it works that the assistant must be consuming tokens separately then the user and if nothing else I think maybe it sends each of those others prompts first it sends the context prompt and then it sends the assistant prompt and then it sends the user prompt and each of those I'm guessing can be up to 4,096 tokens each where the first two prompts don't have any response so it can all be in the prompt and the user prompt is the only one where we have to balance our tokens.

I want to confirm that and really in general understand what the limitations are on token usage for the assistant?

spare flame
#

Hi, the second paragraph would be easier to read if it was divided into more sentences. This is what is done in this message. A possible answer to the question is given in the second message after the message from the user below.

As an experiment, I asked ChatGPT to turn that paragraph into a list of sentences while asking it to show the original text for each element in the list.
In a separate message, I asked ChatGPT to rewrite separately the second point as its answer was not good and I copy-pasted that simplified sentence into the second element of the answer from the prompt below.

Prompt and answer

(I provided the original text where I use ... below)

break the text below into a list of sentences and provide each element of the list in the format [simpler version of the sentence] <-> [original sentence]:

"I believe in order ... our tokens."

  • The assistant and the user must consume tokens separately for it to work. <-> I believe in order for this to work the way that it seems like it works that the assistant must be consuming tokens separately then the user.

  • I believe the assistant sends three types of prompts in a certain order: first, the context prompt; second, the assistant prompt; and third, the user prompt. Each prompt can have up to 4,096 tokens. <-> I think maybe it sends each of those others prompts first it sends the context prompt and then it sends the assistant prompt and then it sends the user prompt and each of those I'm guessing can be up to 4,096 tokens each.

  • The first two prompts don't require a response and can be included in the prompt itself. <-> where the first two prompts don't have any response so it can all be in the prompt.

  • The user prompt is the only one where the token count needs to be balanced. <-> and the user prompt is the only one where we have to balance our tokens.

compact folio
spare flame
#

A possible answer to your question:

Summary of text below (maybe also see the next message concerning costs): Given the quote below, It seems the entire conversation counts in the token usage

I think the documentation for the APIs on the openAI website answers the question. Specifically, the URL that ends with platform.openai.com/docs/guides/chat/introduction which is under:

Documentation>Guides>Chat completion>Introduction>Manage tokens (you can also just visit the documentation page and use the keyboard shortcut CTRL+F to search for a part of the quote below).

I highlighted particularly relevant parts.

...if your API call used 10 tokens in the message input and you received 20 tokens in the message output, you would be billed for 30 tokens.

To see how many tokens are used by an API call, check the usage field in the API response (e.g., response[‘usage’][‘total_tokens’]).

To see how many tokens are in a text string without making an API call, use OpenAI’s tiktoken Python library. Example code can be found in the OpenAI Cookbook’s guide on how to count tokens with tiktoken.

Each message passed to the API consumes the number of tokens in the content, role, and other fields, plus a few extra for behind-the-scenes formatting. This may change slightly in the future.

If a conversation has too many tokens to fit within a model’s maximum limit (e.g., more than 4096 tokens for gpt-3.5-turbo), you will have to truncate, omit, or otherwise shrink your text until it fits. Beware that if a message is removed from the messages input, the model will lose all knowledge of it.

Note too that very long conversations are more likely to receive incomplete replies. For example,** a gpt-3.5-turbo conversation that is 4090 tokens long will have its reply cut off after just 6 tokens.**

#

Hence, from the message above it seems the entire conversation counts in the token usage which limits how much can be given to the API and implies that the costs can be significant in long conversations.

Given the costs, I think it's best to send only the user prompt and maybe the system content unless you really need to take into account all of the previous messages. A user interface design that lets you choose the messages to add to the context might make that simpler.

If the costs are irrelevant to you then you can maybe implement a moving context window (I assume that is what the ChatGPT chat page does in the background).

regal sundial
#

I mean I can think of some ways to really pack a lot of historical conversation into that assistant but if I only get 4096 for all of those combined the system the assistant and the prompt then there's no difference than me just doing this myself with DaVinci 003 other than the fact they only charge like 1/10 of the cost to use this model, which definitely is influence enough, but I was hopeful for expanded capabilities that could prolong the need for fine tuning and paying six times more than DaVinci 3 costs which is like 60 times more than this costs.

regal sundial
#

Is it not possible that system is one message, assistant is one message and the user is one message with a total possible token count of 12k in a message?

#

Okay so also here evidence that my presumption may be correct. "If adding instructions in the system message doesn't work, you can also try putting them into a user message. (In the near future, we will train our models to be much more steerable via the system message. But to date, we have trained only on a few system messages, so the models pay much more attention to user example"
"https://github.com/openai/openai-python/blob/main/chatml.md"

this implies it is a different language model handling each of these, so I would think it would be a different token count each.

gilded furnace
regal sundial
#

I still don't see this is a clear outline. What this says is that the system message costs 1 token, no matter what, which make sense since we aren't actually querying some crazy powerful language model we're querying a super simple watered down system management language model it's barely been trained yet.

It shows a plus two for the assistant, does that mean the assistant only costs two tokens no matter how much we put there?

Does mention you can do this the old school way using basically soft switches in the prompt but then it can be hacked. So maybe it is all the same and the assistant is only giving us the ability to protect our prompt without having to put a bunch of extra prompt in to protect our prompt.

I'm about to put this to the test today anyway to find out for sure.

At minimum I still feel the documentation is ambiguous.

If in case it is 4096 total, and the cost of the assistant prompt is the same amount of tokens as the user prompt, it sounds like the entirety of this new model only does two things - one provide assistant as way of protecting your pre-prompt and two, reduce the cost of your queries. But it doesn't really do anything special or new regarding capabilities of chat that you couldn't already do with DaVinci already yourself by appending a pre prompt and postprompt.