Problem: James web surface via OpenResponses has 17.9s TTFT on GPT-5.4. Same model on main hive via Discord responds in 2-3s. The
run.startedfires at 0.5s but firstrun.deltadoesn't arrive until 17.9s — so there's ~17s of pipeline overhead between the request hitting OpenResponses and the model actually streaming tokens back.
What we need to understand:
- What does OpenResponses do between receiving the POST and calling the model API? Is it loading bootstrap files, running memory_search (Gemini embedding), initialising the session, running hooks? Can we see a timing breakdown or enable debug logging to isolate which step is eating the time?
- Is there a "fast path" or lite mode? For web surface requests where TTFT matters (especially voice), can we skip or defer memory_search, reduce bootstrap loading, or use a pre-warmed session?
- Are other OpenClaw operators running web surfaces hitting this? What TTFT do they see? What's the expected baseline for OpenResponses → streaming first token?
- Does
chat.sendvia WebSocket skip the overhead that OpenResponses HTTP adds? Our Krill spec uses WS for text/voice turns and OpenResponses only for file turns — would routing ALL turns through WS fix this?- Is there a config flag to disable memory_search per-request or per-agent? Or a way to make it async (stream starts, memory loads in background)?
Context: Gateway WS text turns on the main hive are fast (2-3s). Only OpenResponses HTTP path on the Meta You instance is slow. Both use the same OpenAI Pro token. This suggests the overhead is in the OpenResponses pipeline, not the model or network.
any questions to help you help us?