#OpenClaw 2026.5.7: Gateway event-loop stalls with native Codex + Telegram/TTS

1 messages · Page 1 of 1 (latest)

orchid elk
#

OpenClaw 2026.5.7 on Linux VPS, user gateway. Main + sub-agent use native Codex (openai/gpt-5.x, agentRuntime.id:"codex", codex plugin). Telegram with
multiple bot accounts, Google/Gemini batch TTS enabled. Previously same host/account handled 5–6 parallel agents; now 2 agents can trigger stalls.

 Health stays ok=true, but eventLoop.degraded=true and sessions.list jumps from ~70–170ms to ~17–21s. Logs during bad windows show Telegram getMe fetch     
 timeouts, while direct curl to Bot API from host is ~60–90ms. Gemini TTS generateContent sometimes times out after 30s and seems to delay Telegram         
 text delivery. Codex also logs: app-server connection closed during startup, client is closed, and Codex agent harness failed; not falling back to         
 embedded PI backend.                                                                                                                                       
                                                                                                                                                            
 Question: known 2026.5.7 regression/config drift? Can blocked fetch/TTS/Codex paths stall gateway WS calls? Should doctor --fix be first step?
quick flaxBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

terse krakenBOT
#

Yes — this lines up with a real current report, not just “your VPS is weird”: https://github.com/openclaw/openclaw/issues/79757

A few grounded points:

  • health ok=true only means the HTTP liveness path can answer. It does not mean the Gateway control plane is healthy. The detailed readiness/health event-loop block is the important part here: eventLoop.degraded=true is emitted when event-loop delay is high / sustained utilization+delay is bad, so a 17–21s sessions.list is absolutely consistent with gateway process starvation.
  • Async fetch() should not normally block Node’s event loop by itself. But synchronous prep work, plugin/runtime init, retry storms, high callback pressure, child-process startup churn, or tight delivery pipelines can starve the same Gateway process that serves WS RPC. So yes: Telegram/TTS/Codex paths can indirectly make sessions.list slow because they share the Gateway process/control plane.
  • Gemini batch TTS timing out after ~30s can plausibly delay Telegram delivery if auto-TTS is in the final-send path. I’d temporarily disable auto TTS or switch it to tagged/off while debugging; if text delivery becomes fast again, that’s a very strong isolating signal.
  • Telegram getMe timing out while curl is 60–90ms usually points away from raw network latency and toward Gateway-side event-loop delay, proxy/dispatcher path differences, or request contention. Docs note Telegram uses its own undici dispatcher and honors process proxy/OPENCLAW_PROXY_URL, so check proxy env/config too if you use one.

I would make doctor --fix the first repair step, yes — especially on 2026.5.x with Codex/plugin/config migrations. It backs up config, repairs legacy openai-codex/* routing into canonical OpenAI model refs with agentRuntime.id:"codex", relinks/repairs managed plugins, and cleans stale plugin/config drift. Use:

openclaw config validate
openclaw doctor --fix --non-interactive
openclaw gateway restart
#

…but don’t treat doctor as proof it’s fixed. If it still reproduces after doctor, I’d isolate in this order:

  1. Disable auto/batch TTS temporarily (tts.auto: "off" or channel TTS off) and retest 2 parallel agents.
  2. Disable extra Telegram bot accounts except one and retest.
  3. Confirm Telegram account timeout/proxy/network config; for Node 22+ Telegram defaults IPv4-first, but explicit proxy/env drift can still matter.
  4. Capture during a bad window: openclaw gateway health --json, openclaw gateway ready --json if available, and the log span covering eventLoop.degraded, getMe, TTS timeout, and Codex app-server close.
  5. Add that to issue #79757 or open a focused issue if your trace differs materially.

Docs refs: https://docs.openclaw.ai/cli/doctor, https://docs.openclaw.ai/cli/gateway, https://docs.openclaw.ai/channels/telegram

orchid elk
#

Follow-up after openclaw doctor --fix --non-interactive + gateway restart.

 Doctor repaired legacy openai-codex/ defaults to openai/ and set agentRuntime.id:"codex". Config validates. After restart: Telegram accounts               
 connected, eventLoop.degraded=false, sessions.list ~98–138ms.                                                                                              
                                                                                                                                                            
 Then I ran 2 parallel direct session smokes: main + sub-agent, both native Codex (openai/gpt-5.x, agentRuntime.id:"codex"). Issue reproduced:              
                                                                                                                                                            
 - fetch timeout ... timer delayed 7772ms, likely event-loop starvation                                                                                     
 - liveness warning ... active=2 queued=2                                                                                                                   
 - repeated codex app-server connection closed during startup                                                                                               
 - Codex agent harness failed; not falling back to embedded PI backend                                             
 - sessions.list back to ~16.5–22s 
 Sub-agent replied eventually; main stayed active on fallback gpt-5.3-codex.
 Does this match #79757? Recommended mitigation?
terse krakenBOT
#

Yes — I’d treat this as matching #79757 / same regression family, but your repro is now more specific: two parallel native Codex app-server harness starts are enough to starve the Gateway event loop. Telegram/TTS are probably victims/amplifiers here, not the root cause in this repro.

The key tells:

  • clean restart: eventLoop.degraded=false, sessions.list ~100ms
  • only after 2 parallel native Codex runs: timer delayed ~7.7s, liveness active=2 queued=2, sessions.list 16–22s
  • codex app-server connection closed during startup repeats during the same window
  • not falling back to embedded PI backend is expected when the selected harness is Codex. Docs/source say explicit plugin runtime like agentRuntime.id: "codex" fails closed; OpenClaw logs that it will not fall back to PI.

Recommended mitigation for now:

  1. Serialize agent runs temporarily

Set global agent concurrency to 1 while this is hot:

#
openclaw config set agents.defaults.maxConcurrent 1
openclaw gateway restart

That’s the safest mitigation because your liveness line already shows active=2 queued=2 at the bad point.

  1. Don’t run both main + sub-agent as forced native Codex on the same Gateway

If you need parallelism, make one of them direct PI/API OpenAI instead of Codex app-server:

agents: {
  list: [
    {
      id: "worker",
#
      model: "openai/gpt-5.x",
      models: {
        "openai/gpt-5.x": { agentRuntime: { id: "pi" } }
      }
    }
  ]
}

Use the exact model key you already have configured. The important bit is agentRuntime.id: "pi" for the worker, or provider/model-scoped equivalent.

  1. Keep TTS off/tagged while testing

Since you’ve proven direct sessions can trigger it without Telegram delivery, TTS isn’t the first suspect anymore. But auto/batch TTS can still add delayed outbound delivery during a starved window, so keep tts.auto off/tagged until Codex concurrency is stable.

  1. Attach this repro to #79757