For the following text, summarize the idea, extract meaningful questions and give possible answers.
I’ve been playing around with Assistants API to build a reservation system for restaurants, in which each client has it’s own thread in order to make the experience even more customized, as it remembers previous interactions.
Retrieval is enabled with a single file of ~10k tokens where the restaurant menu is included. I’m having some serious headaches to get this to production as the input tokens are way higher than expected. I can assume the input tokens from the thread context window + function calling + assistant prompt, but the fact that for every message in a thread, the assistant uses retrieval, although 90% of time it wouldn’t be necessary, it’s making it impossible for me to progress.
Is there a way to limit when the assistant uses retrieval? Tried specifying it in the prompt but it’s not working. Would embeddings api work better for this use case?
Sorry if this makes no sense, very beginner dev here :sweat_smile: