I want to use the Assistants API provided by OpenAI.
I have a prompt which is quite large. Using the chat completion api (using python), i have to keep sending the prompt in each call which costs us a lot.
I am looking for ways where I dont have to keep sending the prompt again and again and found the assistants api mentioned above. The documentation is however lacking a little and i wanted to know how can i do it. Can someone help me?
#Avoid sending the Prompt everytime - Assistant API
1 messages · Page 1 of 1 (latest)
Hi @pulsar tangle. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:
The Assistants API by OpenAI is indeed designed to simplify the creation of assistant-like experiences and manage conversation state. However, it's important to note that the API does not inherently save tokens or reduce costs by managing the conversation state. In fact, the messages in a thread continue to accumulate as long as the chat session continues, which can increase the expense (source (https://community.openai.com/t/702512)).
To use the Assistants API, you first create an assistant, which encapsulates a base model, instructions, tools, and context documents. Then, you create a thread, which represents the state of a conversation. Finally, you create runs, which power the execution of an assistant on a thread, including textual responses and multi-step tool use (source (https://cookbook.openai.com/examples/assistants_api_overview_python)).
Here's an example of how to create an assistant using the Python SDK:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
assistant = client.beta.assistants.create(
name="Math Tutor",
instructions="You are a personal math tutor. Answer questions briefly, in a sentence or less.",
model="gpt-4-1106-preview",
)
show_json(assistant)
However, if you're looking to reduce the number of tokens used, you might want to consider using the batching feature in the OpenAI API. This allows you to send multiple prompts in the same request, reducing the number of requests you need to make. The prompt parameter can hold up to 20 unique prompts (source (https://platform.openai.com/docs/guides/production-best-practices)).
Remember, each completion request is discrete, and the OpenAI AI model is stateless and memory-less. Every API call you make is independent. There is no referring to any previous generation or any previous state of the model that could “save tokens” (source (https://community.openai.com/t/801650)).
If you're still unsure about how to proceed, I recommend checking out the OpenAI Cookbook (https://cookbook.openai.com/examples/assistants_api_overview_python) and the OpenAI Community Forum (https://community.openai.com/) for more examples and discussions.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@pulsar tangle kapa.ai is still learning and improving, please let us know how it did by reacting below
Isn't there a way to avoid sending the prompt everytime?
@scenic night Isn't there a way to avoid sending the prompt everytime?
Let me check the knowledge base for you. One moment! :books:
Yes, there is a way to avoid sending the entire conversation history with every request when using the OpenAI API. This can be achieved by using the messages parameter instead of the prompt parameter. The messages parameter allows you to send a list of message objects, each with a role that can be 'system', 'user', or 'assistant', and content which is the text of the message from the role.
Here's an example of how you can use the messages parameter:
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
In this example, you don't need to resend the entire conversation in the 'content' of the user's message. Instead, you just send the new user message and the model has access to the prior messages.
However, it's important to note that while this method can make your code cleaner and easier to manage, it doesn't necessarily reduce the cost of the API calls. The total number of tokens in an API call still affects the cost, and the tokens in all messages (including role and other fields) count towards this total.
For more information, you can refer to the OpenAI API documentation (https://platform.openai.com/docs/guides/chat).
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@pulsar tangle kapa.ai is still learning and improving, please let us know how it did by reacting below
@scenic night What I need is a way to avoid sending the prompt everytime we call the API because it's too expensive
Let me check the knowledge base for you. One moment! :books:
Unfortunately, the OpenAI API is stateless and does not have the capability to remember or cache the conversation history between API calls. This means that every time you make a request, you need to provide all the necessary context that the model needs to generate a response. This includes any system message, user message, and assistant message that you want the model to consider when generating its response (source (https://community.openai.com/t/801650)).
However, there are a few strategies you can use to manage the cost:
-
Batching: If you are sending multiple requests to the same endpoint, you can batch the prompts to be sent in the same request. This will reduce the number of requests you need to make. The prompt parameter can hold up to 20 unique prompts (source (https://platform.openai.com/docs/guides/production-best-practices)).
-
Reducing the number of tokens: You can try to reduce the number of tokens required by using shorter prompts, fine-tuning models, or caching common user queries so that they don't need to be processed repeatedly (source (https://platform.openai.com/docs/guides/production-best-practices)).
-
Using embeddings: If your prompt contains a lot of static information (like laws or car models), you could consider using an embeddings-based Retrieval Augmented Generation (RAG) approach. This involves converting the static information into embedding vectors and storing them in a vector database. When you have a new query, you convert the query into an embedding and identify the most related embeddings (source (https://community.openai.com/t/703956)).