Hello, can someone explain how my 500-word request ends up using 16,000 input tokens? From what I’ve researched, it seems that all previous answers in the conversation are sent back with each API call to maintain context. I’d like to know how I can reduce input tokens for smaller requests while still making the AI remember the conversation.
#Api input
1 messages · Page 1 of 1 (latest)
Hi @loud oar. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:
It’s likely not just your 500-word prompt being counted. In multi‑turn conversations, apps commonly resend prior turns to preserve context; those tokens are billed again each call, and can snowball further when tools (e.g., web search) or reasoning models add large intermediate items. Cached tokens lower cost but still show up in usage counts. Reasoning tokens also occupy the context window, and tool outputs (e.g., scraped pages) can massively inflate inputs (even 10×–50×) (Managing the context window; How token counts are calculated; Reasoning works; Forum answer on web search inflation).
Ways to keep input tokens low while preserving memory:
- Use the Responses API with previous_response_id to thread state without manually resending long histories; note all previous input tokens in the chain are still billed, but this avoids reconstructing context and can reduce extra overhead and re‑reasoning (OpenAI APIs for conversation state).
- Periodically summarize older turns and keep only the last few verbatim (“context trim”). Replace long histories with a compact summary message (Context summarization with Realtime; Data‑intensive Realtime apps).
- Prune/tool-filter context: include only relevant snippets (e.g., RAG pruning, clean HTML), and keep a stable prompt prefix to maximize cache reuse (Use fewer input tokens).
- Limit reasoning when not needed (use non‑reasoning models for simple turns) to avoid extra reasoning tokens; if you do use reasoning, cap with max_output_tokens and prefer Responses API to reuse reasoning items efficiently (Reasoning best practices; How reasoning works; Reduce latency and tokens via caching).
- If using tools like web search, be aware retrieved page contents may be ingested into context; restrict the number/size of tool results (Forum post on web search inflation).