#TTFT Voice

17 messages · Page 1 of 1 (latest)

jagged marsh
#

Problem: James web surface via OpenResponses has 17.9s TTFT on GPT-5.4. Same model on main hive via Discord responds in 2-3s. The run.started fires at 0.5s but first run.delta doesn't arrive until 17.9s — so there's ~17s of pipeline overhead between the request hitting OpenResponses and the model actually streaming tokens back.

What we need to understand:

  1. What does OpenResponses do between receiving the POST and calling the model API? Is it loading bootstrap files, running memory_search (Gemini embedding), initialising the session, running hooks? Can we see a timing breakdown or enable debug logging to isolate which step is eating the time?
  2. Is there a "fast path" or lite mode? For web surface requests where TTFT matters (especially voice), can we skip or defer memory_search, reduce bootstrap loading, or use a pre-warmed session?
  3. Are other OpenClaw operators running web surfaces hitting this? What TTFT do they see? What's the expected baseline for OpenResponses → streaming first token?
  4. Does chat.send via WebSocket skip the overhead that OpenResponses HTTP adds? Our Krill spec uses WS for text/voice turns and OpenResponses only for file turns — would routing ALL turns through WS fix this?
  5. Is there a config flag to disable memory_search per-request or per-agent? Or a way to make it async (stream starts, memory loads in background)?

Context: Gateway WS text turns on the main hive are fast (2-3s). Only OpenResponses HTTP path on the Meta You instance is slow. Both use the same OpenAI Pro token. This suggests the overhead is in the OpenResponses pipeline, not the model or network.

any questions to help you help us?

carmine belfryBOT
junior canyonBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

carmine belfryBOT
clever sandalBOT
#

Short version: I don’t think the 17s is the tiny HTTP shim itself. For text-only turns, most of that gap is very likely in the shared embedded-agent pre-stream path.

What I found:

  1. What OpenResponses does before first token

    • HTTP auth, body parse, schema validation
    • session routing, stateless by default unless you send user, previous_response_id, or x-openclaw-session-key
    • optional input_file / input_image fetch + decode + parse, PDFs can rasterize pages
    • prompt normalization via buildAgentPrompt(...)
    • then agentCommandFromIngress(...), which drops into the shared embedded runner

    Inside the shared runner, before activeSession.prompt(...), there is real blocking work:

    • bootstrap/system prompt assembly
    • session open/load
    • context engine assembly
    • before_prompt_build hook handling
    • image/file preprocessing
#
  • compaction / token-budget checks
  1. What is probably not the default culprit

    • memory_search looks like a tool, not an automatic preflight step on every OpenResponses call
    • active memory is separate, and docs gate it to eligible interactive persistent chat sessions, so it is less likely unless that agent/session is set up that way
    • llm_input hook is fired async, so that one is not my first suspect
  2. Fast path / lite mode
    I did not find a per-request OpenResponses flag to skip memory/bootstrap.

    What you can do is make a lean web agent:

    • agents.defaults.memorySearch.enabled: false
    • optionally deny memory_search / memory_get in tool policy
    • agents.defaults.contextInjection: "continuation-skip"
    • reduce bootstrapMaxChars / bootstrapTotalMaxChars
    • skipBootstrap: true only stops auto-creating bootstrap files, it does not magically skip injecting existing ones
#
  1. WS chat.send vs OpenResponses

    • WS does avoid OpenResponses-specific HTTP work and file parsing
    • WS does not bypass the same shared runner underneath

    So, routing all plain text/voice through WS is a good bet, but it only fully fixes things if the 17s is mostly:

    • stateless OpenResponses session cold start
    • file/PDF/input preprocessing
    • request-shape differences
  2. Timing / diagnostics
    I did not find a built-in SSE event that cleanly means “provider call started”.
    response.created is sent early, before first assistant delta, so it is not a reliable model-start marker.

    Best next diagnostics:

    • openclaw --version
    • openclaw logs --follow
    • one run with OPENCLAW_LOG_LEVEL=debug
#
  • optionally OPENCLAW_CACHE_TRACE=1 / diagnostics.cacheTrace.enabled: true

Docs:

Most useful questions back to you:

  1. What does openclaw --version show on the Meta You instance?
  2. Are the slow turns text-only, or do they include input_file / input_image / PDF URLs?
  3. Are you sending user, previous_response_id, or x-openclaw-session-key, or is every /v1/responses call fresh/stateless?
  4. Is active-memory enabled for that agent?
  5. Do you have any before_prompt_build, before_agent_start, or agent:bootstrap hooks?
#

If you paste one slow request’s debug log window, I can help narrow it down pretty fast.

jagged marsh
#

Krill's 5 questions — our answers:

  1. openclaw --version on Meta You: 2026.4.5 (isolated binary at /home/metayou/.local/bin/openclaw, independently upgradeable)
  1. Text-only or files? All slow turns are text-only. No input_file, no input_image, no PDFs. Just plain text prompts via the web surface.
  1. Session handling: We send x-openclaw-session-key (the James session key). Not stateless — it's routed to James's persistent session. We also send user in the request body. No previous_response_id.
  1. Active memory: Need Tony to verify. James has memory tool enabled and likely has active memory on (it's an interactive persistent chat session — exactly the type Krill says would qualify). This could be a significant contributor.
  1. Hooks: Yes — James has an agent:bootstrap hook (the hive-mq-inbox hook that reads HIVE_INBOX.md on session start). No custom before_prompt_build or before_agent_start hooks that we've added, but default OpenClaw hooks may be active. Need Tony to check config.

Bonus context: Bootstrap files are lean — James total is ~3.3KB across SOUL/AGENTS/TOOLS/MEMORY. bootstrapMaxChars: 4000, bootstrapTotalMaxChars: 20000. The session is persistent (no idle timeout reset). memory_search_count: 0 on the probed turns.

Can get you a debug log window if you tell us the exact command — Tony can run OPENCLAW_LOG_LEVEL=debug on the Meta You instance for one request.

clever sandalBOT
#

That’s super helpful. With those answers, my suspect list changed a lot.

My current ranking:

  1. Active memory is now the top suspect
    The docs literally show/recommend timeoutMs: 15000 for active memory, which is very close to your ~17.9s gap. If James is eligible for active memory, this is a really strong fit.
  2. agent:bootstrap hook is still in play
    Important nuance: agent:bootstrap runs while building bootstrap context before the prompt is finalized, it is not just “session file created once long ago”.
  3. Shared session/context assembly
  4. memory_search is now low on the list, especially since you already saw memory_search_count: 0

So yes, I’m now pretty suspicious of active memory first, then the bootstrap hook.

Fastest isolation test

If Tony can do only one thing, do this:

  • temporarily disable active memory for James
  • send the same text-only OpenResponses request again
#
  • compare TTFT

If it drops from ~18s to normal, you’ve basically found it.

If active memory is the cause, the likely fixes are:

  • disable it for the web agent
  • lower timeoutMs
  • use a lighter queryMode like message
  • keep it on for Discord, off for TTFT-sensitive web/voice

Exact log command

One important nuance: setting OPENCLAW_LOG_LEVEL=debug on openclaw logs only affects the CLI, not the already-running daemon.

So for a service-managed gateway, the reliable path is:

  1. Temporarily set in config:
#
{
  logging: {
    level: "debug"
  },
  plugins: {
    entries: {
      "active-memory": {
        config: {
          logging: true
        }
      }
    }
  }
}
#
  1. Restart the gateway:
/home/metayou/.local/bin/openclaw gateway restart
  1. Tail logs in another terminal:
/home/metayou/.local/bin/openclaw logs --follow --plain --local-time
  1. Fire one slow text-only OpenResponses request

  2. Paste the log window from request receipt through first streamed token

What I’d like Tony to verify

  • plugins.entries.active-memory.enabled
  • plugins.entries.active-memory.config.agents
#
  • plugins.entries.active-memory.config.timeoutMs
  • plugins.entries.active-memory.config.queryMode
  • whether the hive-mq-inbox agent:bootstrap hook runs on every OpenResponses turn

Best next A/B order

  1. active memory off
  2. active memory back on, bootstrap hook off
  3. both on

That should isolate this pretty quickly.

Relevant docs:

If you want, I can also give you a tiny copy-paste checklist for Tony so the test is dead simple.

jagged marsh
#

Active memory confirmed as root cause. Plugin was not explicitly configured — running on defaults (enabled for all agents, default timeoutMs). After "active-memory": { "enabled": false }, gap dropped from 15.7s / 43.5s to 0.004s / 0.005s on warm probes. Cold probe 0.016s. Embedded runner unchanged at ~1.7s. James web TTFT is now sub-second.

clever sandalBOT
#

Nice, that’s a great catch. That lines up perfectly with the 15s-ish stall.

So the root cause was active memory, not OpenResponses HTTP itself and not memory_search tool calls.

Useful takeaway:

  • active memory can block TTFT hard on eligible persistent chat sessions
  • if it isn’t explicitly scoped, it can surprise you by applying more broadly than expected
  • for web/voice surfaces, I’d keep it off or use a separate lean agent

If you still want active memory somewhere, I’d make it explicit instead of default-ish:

  • only target specific agents
  • lower timeoutMs
  • use a lighter queryMode

Docs: