#Extract the idea and question - 000001

1 messages · Page 1 of 1 (latest)

glass spear Jan 30, 2024, 4:56 PM

For the following text, summarize the idea, extract meaningful questions and give possible answers.

I’ve been playing around with Assistants API to build a reservation system for restaurants, in which each client has it’s own thread in order to make the experience even more customized, as it remembers previous interactions.

Retrieval is enabled with a single file of ~10k tokens where the restaurant menu is included. I’m having some serious headaches to get this to production as the input tokens are way higher than expected. I can assume the input tokens from the thread context window + function calling + assistant prompt, but the fact that for every message in a thread, the assistant uses retrieval, although 90% of time it wouldn’t be necessary, it’s making it impossible for me to progress.

Is there a way to limit when the assistant uses retrieval? Tried specifying it in the prompt but it’s not working. Would embeddings api work better for this use case?

Sorry if this makes no sense, very beginner dev here :sweat_smile:

inland thicketBOT Jan 30, 2024, 4:56 PM

Hi @glass spear. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

Idea Summary:

The user is developing a restaurant reservation system using the Assistants API, where each client has their own thread for a personalized experience. The system uses Retrieval with a file containing the restaurant menu (~10k tokens). However, the user is facing issues with high input token counts, as the assistant uses Retrieval for every message in a thread, even when it's not necessary most of the time. The user is seeking a way to limit when the assistant uses Retrieval and is considering if the embeddings API would be a better fit for their use case.

Extracted Questions and Possible Answers:

Question: Is there a way to limit when the assistant uses retrieval?
Answer: Currently, there is no direct way to specify a maximum number of tokens the Assistant or tool should use. However, OpenAI is aware of this issue and plans to introduce support for this feature soon [source (https://community.openai.com/t/534869)].
Question: Would the embeddings API work better for this use case?
Answer: The embeddings API could potentially be used to build a retrieval-augmented generation (RAG) system, where an AI-powered search is done against a knowledge database to answer queries. However, this would require building an embeddings-based semantic search vector knowledge chunked database, which might be more complex than using the Assistants API with Retrieval [source (https://community.openai.com/t/574160)].
Question: Why are the input tokens higher than expected?
Answer: The Assistants API currently optimizes for quality by adding all relevant content to the context of model calls, which can result in high token usage. The assistant and its internal functions for retrieval and other tools also consume tokens [source (https://community.openai.com/t/595953)].

Question: Can the use of retrieval be specified in the prompt?
Answer: The model decides when to retrieve content based on the user Messages. The Assistants API automatically chooses between two retrieval techniques: it either passes the file content in the prompt for short documents, or performs a vector search for longer documents [source (https://platform.openai.com/docs/assistants/tools)]. However, there is no direct way to control this behavior through the prompt.

#Extract the idea and question - 000001

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: