TTFT Voice | Friends of the Crustacean 🦞🤝 | Page 1

jagged marsh Apr 13, 2026, 8:43 AM

#

Problem: James web surface via OpenResponses has 17.9s TTFT on GPT-5.4. Same model on main hive via Discord responds in 2-3s. The run.started fires at 0.5s but first run.delta doesn't arrive until 17.9s — so there's ~17s of pipeline overhead between the request hitting OpenResponses and the model actually streaming tokens back.

What we need to understand:

What does OpenResponses do between receiving the POST and calling the model API? Is it loading bootstrap files, running memory_search (Gemini embedding), initialising the session, running hooks? Can we see a timing breakdown or enable debug logging to isolate which step is eating the time?

Is there a "fast path" or lite mode? For web surface requests where TTFT matters (especially voice), can we skip or defer memory_search, reduce bootstrap loading, or use a pre-warmed session?

Are other OpenClaw operators running web surfaces hitting this? What TTFT do they see? What's the expected baseline for OpenResponses → streaming first token?

Does chat.send via WebSocket skip the overhead that OpenResponses HTTP adds? Our Krill spec uses WS for text/voice turns and OpenResponses only for file turns — would routing ALL turns through WS fix this?

Is there a config flag to disable memory_search per-request or per-agent? Or a way to make it async (stream starts, memory loads in background)?

Context: Gateway WS text turns on the main hive are fast (2-3s). Only OpenResponses HTTP path on the Meta You instance is slow. Both use the same OpenAI Pro token. This suggests the overhead is in the OpenResponses pipeline, not the model or network.

any questions to help you help us?

carmine belfryBOT Apr 13, 2026, 8:43 AM

#

junior canyonBOT Apr 13, 2026, 8:43 AM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

carmine belfryBOT Apr 13, 2026, 8:43 AM

#

clever sandalBOT Apr 13, 2026, 8:54 AM

#

Short version: I don’t think the 17s is the tiny HTTP shim itself. For text-only turns, most of that gap is very likely in the shared embedded-agent pre-stream path.

What I found:

What OpenResponses does before first token
- HTTP auth, body parse, schema validation
- session routing, stateless by default unless you send user, previous_response_id, or x-openclaw-session-key
- optional input_file / input_image fetch + decode + parse, PDFs can rasterize pages
- prompt normalization via buildAgentPrompt(...)
- then agentCommandFromIngress(...), which drops into the shared embedded runner
Inside the shared runner, before activeSession.prompt(...), there is real blocking work:
- bootstrap/system prompt assembly
- session open/load
- context engine assembly
- before_prompt_build hook handling
- image/file preprocessing

#

compaction / token-budget checks

What is probably not the default culprit
- memory_search looks like a tool, not an automatic preflight step on every OpenResponses call
- active memory is separate, and docs gate it to eligible interactive persistent chat sessions, so it is less likely unless that agent/session is set up that way
- llm_input hook is fired async, so that one is not my first suspect
Fast path / lite mode
I did not find a per-request OpenResponses flag to skip memory/bootstrap.

What you can do is make a lean web agent:
- agents.defaults.memorySearch.enabled: false
- optionally deny memory_search / memory_get in tool policy
- agents.defaults.contextInjection: "continuation-skip"
- reduce bootstrapMaxChars / bootstrapTotalMaxChars
- skipBootstrap: true only stops auto-creating bootstrap files, it does not magically skip injecting existing ones

#

WS chat.send vs OpenResponses
- WS does avoid OpenResponses-specific HTTP work and file parsing
- WS does not bypass the same shared runner underneath
So, routing all plain text/voice through WS is a good bet, but it only fully fixes things if the 17s is mostly:
- stateless OpenResponses session cold start
- file/PDF/input preprocessing
- request-shape differences
Timing / diagnostics
I did not find a built-in SSE event that cleanly means “provider call started”.
response.created is sent early, before first assistant delta, so it is not a reliable model-start marker.

Best next diagnostics:
- openclaw --version
- openclaw logs --follow
- one run with OPENCLAW_LOG_LEVEL=debug

#

optionally OPENCLAW_CACHE_TRACE=1 / diagnostics.cacheTrace.enabled: true

Docs:

Most useful questions back to you:

What does openclaw --version show on the Meta You instance?
Are the slow turns text-only, or do they include input_file / input_image / PDF URLs?
Are you sending user, previous_response_id, or x-openclaw-session-key, or is every /v1/responses call fresh/stateless?
Is active-memory enabled for that agent?
Do you have any before_prompt_build, before_agent_start, or agent:bootstrap hooks?

#

If you paste one slow request’s debug log window, I can help narrow it down pretty fast.

jagged marsh Apr 13, 2026, 9:06 AM

#

Krill's 5 questions — our answers:

openclaw --version on Meta You: 2026.4.5 (isolated binary at /home/metayou/.local/bin/openclaw, independently upgradeable)

Text-only or files? All slow turns are text-only. No input_file, no input_image, no PDFs. Just plain text prompts via the web surface.

Session handling: We send x-openclaw-session-key (the James session key). Not stateless — it's routed to James's persistent session. We also send user in the request body. No previous_response_id.

Active memory: Need Tony to verify. James has memory tool enabled and likely has active memory on (it's an interactive persistent chat session — exactly the type Krill says would qualify). This could be a significant contributor.

Hooks: Yes — James has an agent:bootstrap hook (the hive-mq-inbox hook that reads HIVE_INBOX.md on session start). No custom before_prompt_build or before_agent_start hooks that we've added, but default OpenClaw hooks may be active. Need Tony to check config.

Bonus context: Bootstrap files are lean — James total is ~3.3KB across SOUL/AGENTS/TOOLS/MEMORY. bootstrapMaxChars: 4000, bootstrapTotalMaxChars: 20000. The session is persistent (no idle timeout reset). memory_search_count: 0 on the probed turns.

Can get you a debug log window if you tell us the exact command — Tony can run OPENCLAW_LOG_LEVEL=debug on the Meta You instance for one request.

clever sandalBOT Apr 13, 2026, 9:08 AM

#

That’s super helpful. With those answers, my suspect list changed a lot.

My current ranking:

Active memory is now the top suspect
The docs literally show/recommend timeoutMs: 15000 for active memory, which is very close to your ~17.9s gap. If James is eligible for active memory, this is a really strong fit.
agent:bootstrap hook is still in play
Important nuance: agent:bootstrap runs while building bootstrap context before the prompt is finalized, it is not just “session file created once long ago”.
Shared session/context assembly
memory_search is now low on the list, especially since you already saw memory_search_count: 0

So yes, I’m now pretty suspicious of active memory first, then the bootstrap hook.

Fastest isolation test

If Tony can do only one thing, do this:

temporarily disable active memory for James
send the same text-only OpenResponses request again

#

compare TTFT

If it drops from ~18s to normal, you’ve basically found it.

If active memory is the cause, the likely fixes are:

disable it for the web agent
lower timeoutMs
use a lighter queryMode like message
keep it on for Discord, off for TTFT-sensitive web/voice

Exact log command

One important nuance: setting OPENCLAW_LOG_LEVEL=debug on openclaw logs only affects the CLI, not the already-running daemon.

So for a service-managed gateway, the reliable path is:

Temporarily set in config:

#

{
  logging: {
    level: "debug"
  },
  plugins: {
    entries: {
      "active-memory": {
        config: {
          logging: true
        }
      }
    }
  }
}

#

Restart the gateway:

/home/metayou/.local/bin/openclaw gateway restart

Tail logs in another terminal:

/home/metayou/.local/bin/openclaw logs --follow --plain --local-time

Fire one slow text-only OpenResponses request
Paste the log window from request receipt through first streamed token

What I’d like Tony to verify

plugins.entries.active-memory.enabled
plugins.entries.active-memory.config.agents

#

plugins.entries.active-memory.config.timeoutMs
plugins.entries.active-memory.config.queryMode
whether the hive-mq-inbox agent:bootstrap hook runs on every OpenResponses turn

Best next A/B order

active memory off
active memory back on, bootstrap hook off
both on

That should isolate this pretty quickly.

Relevant docs:

If you want, I can also give you a tiny copy-paste checklist for Tony so the test is dead simple.

jagged marsh Apr 13, 2026, 9:27 AM

#

Active memory confirmed as root cause. Plugin was not explicitly configured — running on defaults (enabled for all agents, default timeoutMs). After "active-memory": { "enabled": false }, gap dropped from 15.7s / 43.5s to 0.004s / 0.005s on warm probes. Cold probe 0.016s. Embedded runner unchanged at ~1.7s. James web TTFT is now sub-second.

clever sandalBOT Apr 13, 2026, 9:28 AM

#

Nice, that’s a great catch. That lines up perfectly with the 15s-ish stall.

So the root cause was active memory, not OpenResponses HTTP itself and not memory_search tool calls.

Useful takeaway:

active memory can block TTFT hard on eligible persistent chat sessions
if it isn’t explicitly scoped, it can surprise you by applying more broadly than expected
for web/voice surfaces, I’d keep it off or use a separate lean agent

If you still want active memory somewhere, I’d make it explicit instead of default-ish:

only target specific agents
lower timeoutMs
use a lighter queryMode

Docs:

https://docs.openclaw.ai/concepts/active-memory

#TTFT Voice

Fastest isolation test

Exact log command

What I’d like Tony to verify

Best next A/B order