#please help me! my agents reply so slow....

1 messages · Page 1 of 1 (latest)

gilded bluff
#

Hi Krill, I need your help reviewing a performance/stability issue on the Grand Valley OpenClaw setup.

Context:

• OpenClaw gateway is running on localhost port 18123.
• Grand Valley has multiple Telegram agent accounts/bots configured: main/Tuffy, John, Mike, Walt, Tim, Sam, Bolt, etc. (please make sure your answers redact the agents names)
• lossless-claw was updated and is now loaded as contextEngine, version 0.10.0.
• Gateway service is currently active, but runtime behaviour is still noisy under agent activity.

Main issue:
The gateway appears to hit event-loop starvation when Telegram agents and embedded Codex runs are active. The service stays “active”, but message handling becomes slow and some tool/channel operations time out.

Evidence from logs:

• Repeated fetch timeout after 10000ms ... timer delayed 30-60s, likely event-loop starvation on Telegram getMe calls.
• Repeated diagnostic liveness warning with reasons like event_loop_delay, event_loop_utilization, and cpu.
• WebSocket handshake/connection errors such as ws handshake timeout and closed before connect.
• codex dynamic tool timeout: tool=message toolTimeoutMs=30000.
• During restart, openclaw-gateway.service: Failed with result 'timeout', then systemd restarted it successfully.
• Agent bootstrap logs show large context injection/truncation, e.g. MEMORY.md is 54298 chars ... truncating.

Questions:

  1. Is this likely caused by Telegram multi-account startup/health checks all running in the gateway event loop?
  2. Should Telegram account startup be throttled, staggered, or moved further out of the main gateway process?
  3. Is there a recommended config for many Telegram agent accounts so one slow Telegram/API path does not stall all agents?
  4. Could oversized agent bootstrap files and lossless/memory context assembly be materially contributing to event-loop delay?
  5. What safe tuning would you recommend first: reduce bootstrap size, reduce active Telegram accounts, adjust health checks, split processes, or change lossless-claw settings?
  6. For shutdown timeout, is there a known graceful-shutdown issue when Telegram polling/agent runs are active?

Current state:

• lossless-claw itself is updated and loading.
• The main risk is not “plugin failed to install”; the risk is gateway responsiveness under Grand Valley’s multi-agent Telegram workload.
• I want the lowest-risk mitigation path before making deeper OpenClaw/runtime changes.

strange sableBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

raven vaultBOT
# gilded bluff Hi Krill, I need your help reviewing a performance/stability issue on the Grand ...

I checked the current docs/source before answering. Please grab openclaw --version first, because Telegram/runtime behavior has moved recently and version mismatch is a very real cause of weird gateway symptoms.

Short version: yes, this smells like real gateway event-loop pressure, but I would not pin it only on Telegram startup. A 10s fetch timeout where the timer fires 30-60s late means the Node event loop itself was unable to run timers on time. Telegram getMe, WebSocket handshakes, and message tool sends may be victims of that starvation, not necessarily the root cause.

Answers, with account names intentionally redacted:

  1. Likely partly Telegram, but not just Telegram. Current source already limits Telegram startup getMe probes to 2 concurrent probes and caches bot identity for up to 24h after successful startup, so many accounts should not all stampede getMe at once on current builds. If you are seeing repeated getMe timeouts across many accounts, check version/cache churn/token changes/restarts, and assume the gateway is already overloaded.
  2. Throttling/staggering is reasonable, but process isolation is the stronger fix. Startup probes are already throttled in current code. For a heavy multi-account Telegram workload plus embedded Codex runs, the lower-risk operational move is to reduce concurrent active accounts first, then split busy accounts/workloads across separate gateway processes/hosts if the starvation disappears.
  3. No magic “many Telegram bots cannot stall each other” config that I found. Telegram polling has bounded request timeouts, polling stall watchdogs, per-token throttling, and duplicate-poller leases, but it still runs inside the gateway process. If one process is CPU/event-loop saturated, all channels/tools in that process can feel it.
  4. Yes, oversized bootstrap/context can materially contribute. Docs say MEMORY.md and other bootstrap files are injected with agents.defaults.bootstrapMaxChars default
raven vaultBOT
# gilded bluff Hi Krill, I need your help reviewing a performance/stability issue on the Grand ...

12000 per file and bootstrapTotalMaxChars default 60000 total. A 54k MEMORY.md being repeatedly read/truncated increases prompt assembly cost and model prompt pressure. It is unlikely to explain Telegram getMe alone, but combined with Codex runs and context-engine assembly it can absolutely be part of the pileup.
5. Lowest-risk mitigation order I’d use:

  • Capture baseline: openclaw --version, openclaw status --all, openclaw health --json, and a short redacted log window around the liveness warnings.
  • Temporarily disable nonessential Telegram accounts and restart. If event-loop delay drops, re-enable accounts in small batches.
  • Shrink bootstrap: distill MEMORY.md, remove stale bulk notes from always-injected files, run openclaw doctor, and inspect /context or doctor bootstrap-size output.
  • Set agents.defaults.contextInjection: "continuation-skip" if these agents do not need full workspace bootstrap injected on every continuation turn.
  • Keep bootstrapMaxChars / bootstrapTotalMaxChars conservative. Raising them will reduce truncation warnings but usually makes performance worse.
  • Only then tune lossless-claw. As a quick A/B test, switch plugins.slots.contextEngine back to "legacy" for one busy agent or a short window. If starvation drops sharply, file that as a context-engine/plugin performance issue.
  • If the workload is truly many active Telegram accounts plus embedded Codex, split processes/hosts. That is the cleanest isolation boundary.
  1. Shutdown timeout: there is graceful-stop handling for Telegram polling, including a polling stop grace window and stale polling lease cleanup. But if the event loop is starved, graceful shutdown timers can also fire late. A systemd timeout during active polling/Codex runs is consistent with that. If it reproduces on current OpenClaw with a small config, that should become a GitHub issue with redacted logs: https://github.com/openclaw/openclaw/issues

Useful

raven vaultBOT
# gilded bluff Hi Krill, I need your help reviewing a performance/stability issue on the Grand ...

docs:
https://docs.openclaw.ai/help/faq
https://docs.openclaw.ai/channels/telegram
https://docs.openclaw.ai/gateway/config-agents
https://docs.openclaw.ai/concepts/context-engine
https://docs.openclaw.ai/gateway/troubleshooting

My practical recommendation: first reduce active Telegram accounts and shrink bootstrap/context, then A/B lossless vs legacy. If either one materially improves liveness warnings, you have a clean next step. If neither helps, split the heavy Telegram/Codex workload out of the main gateway process.

gilded bluff
#

Hi Krill,

Thanks. I agree with the event-loop starvation diagnosis. One constraint from the owner: reducing active Telegram accounts is not acceptable as the target operating model. All Grand Valley agents need to run together. Temporary A/B tests are fine, but the final fix must support all active agents concurrently.

Current config highlights, secrets redacted:

  • OpenClaw: 2026.5.12 (f066dd2)
  • Gateway: local loopback, port 18123
  • Telegram: 7 active account bindings: main, mike, john, walt, tim, sam, bolt
  • Queue: messages.queue.mode = collect, Telegram collect enabled
  • Inbound debounce: default 1500ms, Telegram 1800ms
  • agents.defaults.bootstrapMaxChars = 1000000
  • agents.defaults.maxConcurrent = 16
  • subagents.maxConcurrent = 16
  • timeoutSeconds = 1200
  • contextEngine = lossless-claw, recently updated to 0.10.0
  • memory backend = qmd
  • memory/session search enabled across agents
  • memorySearch onSearch = true, watch = true
  • Plugins enabled include telegram, codex, acpx, lossless-claw, memory-core, memory-wiki, lobster, clawrouter
  • Persistent ACP runtimes/bindings exist for main/Claude and john/Codex, plus Tim/Gemini topic binding

Observed symptoms:

  • Telegram timers delayed 30-100s
  • WebSocket handshake timeouts
  • message tool sends timing out during degraded windows
  • restart/shutdown timeout behaviour during polling/Codex activity
  • repeated bootstrap truncation around large MEMORY/bootstrap files
  • context-engine maintenance sometimes waits for session lane idle

Question: given this config, what exact openclaw.json sections would you want next, and what would you tune first if disabling Telegram accounts is not acceptable as a steady-state fix?

My current candidate mitigations without reducing active agents:

  1. Restore conservative bootstrap/context settings, especially bootstrapMaxChars. Current value 1,000,000 looks too high.
  2. Consider agents.defaults.contextInjection = "continuation-skip" if compatible.
  3. Reduce concurrency pressure from maxConcurrent/subagents.maxConcurrent = 16.
  4. Review whether memorySearch onSearch/watch/session indexing across every agent is adding pressure during Telegram-heavy windows.
  5. A/B lossless-claw vs legacy only after reducing obvious config pressure.
  6. If one Node process still cannot handle 7 Telegram bots + persistent ACP/Codex + memory/context work, split gateway processes/hosts while keeping all Telegram agents active overall.
raven vaultBOT
# gilded bluff Hi Krill, Thanks. I agree with the event-loop starvation diagnosis. One constra...

Given those highlights, I would treat this as an overload config first, not a Telegram account-count problem. Seven Telegram bots should be supportable, but probably not with 16-way agent concurrency, 16-way subagents, 1M bootstrap files, long Telegram API timeouts, memory/session indexing, persistent ACP/Codex, and lossless context maintenance all sharing one Node event loop.

What I’d want next from openclaw.json:

  • agents.defaults: include workspace/context/bootstrap/concurrency/timeout/model/memorySearch/startupContext/contextLimits. Redact paths only if private.
  • agents.list: each agent entry, but rename ids/names to agent-1, agent-2, etc. Keep the structural fields: workspace, model, maxConcurrent, bootstrap overrides, memorySearch overrides, bindings, ACP/session settings.
  • subagents: full section.
  • channels.telegram: top-level config plus accounts, but replace account ids with tg-1, tg-2, etc. Remove botToken, tokenFile paths if sensitive, user ids/group ids if private. Keep enabled, defaultAccount, timeoutSeconds, pollingStallThresholdMs, proxy shape, bindings, groups, dmPolicy, allowFrom/groupAllowFrom as redacted arrays/counts.
  • messages: especially queue, debounce, collect settings, status/progress reactions.
  • plugins.slots and plugins.entries for lossless-claw, memory-core, memory-wiki, codex, acpx, clawrouter, telegram, and any plugin that starts a watcher/runtime.
  • Any memory, memorySearch, memory.qmd, or QMD-related sections/env notes.
  • Any diagnostics, gateway, server, tools, or sessions sections if present.

Please do not paste secrets, tokens, OAuth blobs, raw session files, or unredacted private chat/group/user ids.

What I would tune first, without disabling Telegram accounts as the final model:

raven vaultBOT
# gilded bluff Hi Krill, Thanks. I agree with the event-loop starvation diagnosis. One constra...
  1. Undo the 1,000,000 bootstrap cap immediately. Put agents.defaults.bootstrapMaxChars back near 12000 and bootstrapTotalMaxChars near 60000. If a specific agent truly needs more, give only that agent a bounded override. A 1M per-file cap is exactly the sort of thing that turns every run into expensive prompt/context assembly.
  2. Set agents.defaults.contextInjection: "continuation-skip". This is a good fit if the agents are in ongoing sessions and do not need the whole workspace bootstrap re-injected every continuation turn.
  3. Drop concurrency hard. I would try agents.defaults.maxConcurrent: 4 first, maybe 6 if the host is strong. Set subagents.maxConcurrent to 2 or 4. The Telegram docs say polling sink concurrency uses agents.defaults.maxConcurrent, so 16 can multiply pressure directly.
  4. Reduce channels.telegram.timeoutSeconds. 1200 is very high. It can let Telegram control-plane/outbound calls occupy resources for ages. I’d test 60 or 120, while leaving pollingStallThresholdMs at the default unless you have proven false-positive polling restarts.
  5. Temporarily narrow memory/session indexing pressure. Keep durable memory, but A/B disabling session transcript indexing and aggressive onSearch/watch behavior across every agent. QMD docs note session indexing is opt-in async and QMD has a long-lived manager/watchers, so it belongs on the suspect list under load.
  6. Then A/B lossless-claw vs legacy. Do this after the obvious pressure reductions so the test is fair. If liveness improves only when legacy is active, that is a useful repro for the context-engine/plugin path.
  7. If starvation remains, split by process/host while keeping all agents active. That satisfies the owner constraint: all agents stay online, but not all high-cost work shares one event loop. Put persistent ACP/Codex/heavy memory agents away from the Telegram-heavy gateway if possible.

For validation, compare before/after with:

raven vaultBOT
gilded bluff
#

Hi Krill,

We applied the first overload-config mitigation without disabling any Telegram agents:

• agents.defaults.bootstrapMaxChars: 1000000 -> 12000
• added agents.defaults.bootstrapTotalMaxChars: 60000
• added agents.defaults.contextInjection: "continuation-skip"
• agents.defaults.maxConcurrent: 16 -> 4
• agents.defaults.subagents.maxConcurrent: 16 -> 4
• channels.telegram.timeoutSeconds was already 30, so unchanged
• context engine remains lossless-claw
• all Telegram agents/accounts remain active

Gateway restarted and loaded the new config. Doctor showed no config errors, only existing warnings.

Result after monitoring:

• service remains active
• bootstrap limit is now visibly 12000 in logs
• but event-loop starvation is still present
• since 02:54 WIB: liveness warning 4x, fetch-timeout 23x, timer delayed 14x, WS closed before connect 18x
• no new Telegram polling stall / polling stop timeout
• delays still hit 40-59s in some getMe fetches
• active work in warnings mentions channels.telegram.start-account, with embedded Codex/ACP work from agents like John/Walt at the same time

Question: given that the obvious bootstrap/concurrency fix helped config hygiene but did not clear starvation, what would you tune next before process isolation?

My proposed next A/B, keeping all Telegram agents online:

  1. lower subagent concurrency further to 2
  2. temporarily narrow memory/session indexing pressure: keep durable memory, but disable session transcript indexing and aggressive memorySearch.sync.onSearch/watch across agents
  3. if still bad, A/B plugins.slots.contextEngine from lossless-claw to legacy
  4. if still bad after that, split gateway processes/hosts while keeping all agents active overall

Do you agree with that order? Also, for the memory/session indexing test, which exact fields would you disable first so the result is meaningful but not destructive?

raven vaultBOT
# gilded bluff Hi Krill, We applied the first overload-config mitigation without disabling any...

Yes, I agree with that order, with one addition: because the warnings still name channels.telegram.start-account while ACP/Codex work is active, I would also cap ACP concurrency before jumping to process isolation.

The fact that there are no new polling stall/stop-timeout logs is useful. This now looks less like Telegram polling itself and more like “gateway startup/account startup is competing with ACP/Codex/context/memory work on the same event loop.”

Next order I’d use:

  1. Subagents down to 2
    Use the exact key:
    {
      agents: {
        defaults: {
          subagents: { maxConcurrent: 2 }
        }
      }
    }
    
raven vaultBOT
# gilded bluff Hi Krill, We applied the first overload-config mitigation without disabling any...

agents.defaults.subagents.maxConcurrent is the documented subagent lane cap.

  1. Cap ACP concurrency
    If ACP is active during these degraded windows, try:

    {
      acp: {
        maxConcurrentSessions: 1
      }
    }
    

    or 2 if 1 is too restrictive. This keeps all Telegram agents online, but stops persistent ACP/Codex sessions from multiplying load during channel startup.

  2. Meaningful, non-destructive memory/QMD A/B
    First disable transcript indexing and automatic sync triggers, while keeping durable memory files intact:

raven vaultBOT
raven vaultBOT
raven vaultBOT
# gilded bluff Hi Krill, We applied the first overload-config mitigation without disabling any...
     }
   }

This should not delete memory. It stops session transcript indexing and avoids lazy/search/watch/session-start refresh pressure during the test. Existing QMD indexes may remain stale during the A/B, which is fine for this diagnosis.

If they have cross-agent transcript collections configured, also temporarily remove or empty:

agents.defaults.memorySearch.qmd.extraCollections
agents.list[].memorySearch.qmd.extraCollections

Those are specifically for cross-agent QMD collection search, so they can add work/noise.

  1. Then A/B lossless-claw vs legacy
    Keep the memory/concurrency reductions in place and switch only:
raven vaultBOT
# gilded bluff Hi Krill, We applied the first overload-config mitigation without disabling any...
   {
     plugins: {
       slots: {
         contextEngine: "legacy"
       }
     }
   }

If event-loop warnings drop here, the next issue is probably in context-engine assembly/maintenance or how it interacts with Codex/ACP lanes.

  1. Then process split
    If all of the above still leaves 40-60s timer delays, I would stop trying to tune one process. Keep all Telegram agents active overall, but split heavy ACP/Codex/context/memory work away from the Telegram-heavy gateway.

Two extra checks I’d do while testing:

  • If channels.telegram.start-account keeps appearing long after startup, look for config reload loops, token/cache invalidation, or repeated account restarts. Current Telegram startup should cache bot identity and throttle startup probes, so repeated getMe during steady state is suspicious.
raven vaultBOT
gilded bluff
#

Krill, quick check.

Current intent:

• Use native OpenClaw memory only: plugins.slots.memory = "memory-core"
• Keep QMD off: current config has no active QMD path
• Switch context engine back from legacy to lossless-claw

Current relevant config:

{
"plugins.slots": {
"memory": "memory-core",
"contextEngine": "legacy"
},
"lossless-claw": {
"enabled": true,
"summaryModel": "openai-codex/gpt-5.4-mini",
"expansionModel": "openai-codex/gpt-5.4-mini",
"proactiveThresholdCompactionMode": "deferred",
"bootstrapMaxTokens": 12000,
"skipStatelessSessions": true
},
"agents.defaults": {
"bootstrapMaxChars": 12000,
"bootstrapTotalMaxChars": 60000,
"contextInjection": "continuation-skip",
"maxConcurrent": 4,
"subagents.maxConcurrent": 2
},
"acp.maxConcurrentSessions": 1,
"messages.inbound.telegram": 1800,
"telegram.enabledAccounts": 7
}

Question:
Would you switch plugins.slots.contextEngine to lossless-claw now, while keeping memory on native memory-core, or wait until the event-loop starvation issue is stable first?

My current concern: lossless-claw is configured lightly, but it may still add summarisation/DB work while the gateway is already showing event_loop_delay during ACP/Codex + Telegram startup.

raven vaultBOT
# gilded bluff Krill, quick check. Current intent: • Use native OpenClaw memory only: plugins...

I would wait. Keep plugins.slots.contextEngine: "legacy" until the event-loop delay is quiet for a decent observation window.

Reason: lossless-claw is not just “memory.” As a context engine it participates in ingest, assemble, compact, and after-turn lifecycle hooks. Even with native memory-core and QMD off, enabling it can still add context assembly / summarization / persistence work exactly where you’re currently seeing pressure. If the gateway is already showing event_loop_delay during ACP/Codex + Telegram startup, switching it back on now muddies the test.

I’d use this order:

  1. Hold legacy until Telegram startup/account activity is stable with current limits.
  2. Confirm no repeated channels.telegram.start-account activity after normal startup. If that keeps recurring, solve that first.
  3. Then switch only the context engine:
    {
      plugins: {
        slots: { contextEngine: "lossless-claw" },
        entries: {
          "lossless-claw": {
            enabled: true,
    
raven vaultBOT
# gilded bluff Krill, quick check. Current intent: • Use native OpenClaw memory only: plugins...
           config: {
             summaryModel: "openai-codex/gpt-5.4-mini",
             expansionModel: "openai-codex/gpt-5.4-mini",
             proactiveThresholdCompactionMode: "deferred",
             bootstrapMaxTokens: 12000,
             skipStatelessSessions: true
           }
         }
       }
     }
   }

Note the nesting: plugin-specific config should be under plugins.entries.lossless-claw.config.*.

For the A/B, keep everything else fixed and compare the same counters: liveness warnings, delayed timers, fetch timeouts, WS closed-before-connect, and active work labels. If legacy is stable and lossless brings the delay back, that’s a strong signal. If legacy is not stable yet, lossless won’t give you a clean answer.

gilded bluff
#

Krill, agreed. Let’s keep contextEngine: "legacy" for now.

Next question: can you review the overall openclaw.json posture and tell us which parts are most likely still contributing to event-loop starvation?

Current high-impact config summary:

{
"acp": {
"enabled": true,
"backend": "acpx",
"defaultAgent": "codex",
"allowedAgents": ["pi", "codex", "gemini", "claude"],
"maxConcurrentSessions": 1,
"stream.deliveryMode": "live",
"runtime.ttlMinutes": 180
},
"plugins.slots": {
"memory": "memory-core",
"contextEngine": "legacy"
},
"plugins.enabled": [
"telegram",
"memory-core",
"memory-wiki",
"lossless-claw",
"acpx",
"lobster",
"llm-task",
"clawrouter",
"codex",
"openai",
"anthropic"
],
"lossless-claw": {
"enabled": true,
"proactiveThresholdCompactionMode": "deferred",
"summaryModel": "openai-codex/gpt-5.4-mini",
"expansionModel": "openai-codex/gpt-5.4-mini",
"bootstrapMaxTokens": 12000,
"skipStatelessSessions": true
},
"memory": {
"slot": "memory-core",
"memorySearch.enabled": true,
"memorySearch.sources": ["memory"],
"memorySearch.sync.onSessionStart": false,
"memorySearch.sync.onSearch": false,
"memorySearch.sync.watch": false,
"memory-wiki.enabled": true,
"memory-wiki.ingest.autoCompile": true,
"memory-wiki.ingest.maxConcurrentJobs": 1,
"memory-wiki.context.includeCompiledDigestPrompt": false
},
"agents.defaults": {
"model.primary": "openai-codex/gpt-5.5",
"bootstrapMaxChars": 12000,
"bootstrapTotalMaxChars": 60000,
"contextInjection": "continuation-skip",
"contextPruning.mode": "cache-ttl",
"contextPruning.ttl": "30m",
"compaction.mode": "safeguard",
"compaction.model": "blockrun/eco",
"thinkingDefault": "low",
"maxConcurrent": 4,
"subagents.maxConcurrent": 2,
"subagents.maxSpawnDepth": 1,
"subagents.runTimeoutSeconds": 900
},
"messages": {
"queue.telegram": "collect",
"inbound.debounceMs": 1500,
"inbound.telegram": 1800
},
"telegram": {
"enabledAccounts": ["main", "mike", "john", "walt", "tim", "sam", "bolt"],
"disabledAccounts": ["default"],
"timeoutSecondsPerAccount": 30,
"streaming.mode": "partial",
"retry.attempts": 3,
"retry.maxDelayMs": 30000,
"threadBindings.enabled": true,
"healthMonitor.enabled": true
}
}

Questions:

  1. Which 3 config areas would you tune first, in order?
  2. Should plugins.entries.lossless-claw.enabled stay true while context engine is legacy, or should it be fully disabled during the baseline window?
  3. Is memory-wiki.ingest.autoCompile=true safe here, or should we pause auto-compile during the starvation test?
  4. Are agents.defaults.maxConcurrent=4 and subagents.maxConcurrent=2 still too high for this host/gateway setup?
  5. Would you temporarily reduce enabled Telegram accounts, or keep all accounts active and focus on ACP isolation first?

Goal: one clean baseline plan for the next test window, with minimal config changes and easy rollback.

raven vaultBOT
# gilded bluff Krill, agreed. Let’s keep contextEngine: "legacy" for now. Next question: can y...

For a clean baseline, I’d make the test boring on purpose: keep all 7 Telegram accounts online, keep contextEngine: "legacy", and remove plugin/background work that is not needed to prove Telegram + ACP stability.

Top 3 areas I’d tune, in order:

  1. Plugin sidecars/background work
    Disable lossless-claw during the baseline, and pause memory-wiki auto-compile. Even if lossless is not selected as the context engine, leaving the plugin enabled still adds plugin load/config/runtime surface to the process. For a starvation baseline, remove that variable.
    {
      plugins: {
        slots: {
          memory: "memory-core",
          contextEngine: "legacy"
        },
        entries: {
          "lossless-claw": { enabled: false },
          "memory-wiki": {
    
raven vaultBOT
raven vaultBOT
# gilded bluff Krill, agreed. Let’s keep contextEngine: "legacy" for now. Next question: can y...
  1. ACP/Codex isolation
    You already have acp.maxConcurrentSessions: 1, which is the right cap. I’d also consider temporarily switching ACP streaming from live to final-only for the test:

    {
      acp: {
        maxConcurrentSessions: 1,
        stream: { deliveryMode: "final_only" },
        runtime: { ttlMinutes: 30 }
      }
    }
    

    ttlMinutes: 180 keeps ACP runtimes around for a long time. That may be useful normally, but for a starvation test I’d shorten it so old runtime state is less likely to overlap with Telegram/account startup.

  2. Agent/subagent concurrency
    agents.defaults.maxConcurrent: 4 and subagents.maxConcurrent: 2 are not crazy, but if the host is still getting 40-60s timer delay, I’d test one notch lower:

raven vaultBOT
raven vaultBOT
# gilded bluff Krill, agreed. Let’s keep contextEngine: "legacy" for now. Next question: can y...
  1. First tune plugin sidecars, then ACP runtime behavior, then agent/subagent concurrency.
  2. I’d set plugins.entries.lossless-claw.enabled: false during the baseline. Turn it back on only for the deliberate lossless A/B.
  3. I’d pause memory-wiki.ingest.autoCompile. maxConcurrentJobs: 1 is good, but auto-compile is still background work and this test is about eliminating nonessential event-loop pressure.
  4. They may still be too high for this particular host/setup. Try maxConcurrent: 2 and subagents.maxConcurrent: 1 for one window. If that fixes starvation, raise them back one at a time.
  5. I would keep all Telegram accounts active for this baseline, because the target operating model requires it. Only reduce Telegram accounts as a short diagnostic if channels.telegram.start-account continues recurring after startup even with ACP/plugin pressure removed.

One important config-shape note: the real OpenClaw shape is nested, not flattened. For example use plugins.entries.memory-wiki.config.ingest.autoCompile, plugins.entries.lossless-claw.enabled, and agents.defaults.subagents.maxConcurrent. Run openclaw doctor after edits to catch misplaced flattened keys.

Docs:
https://docs.openclaw.ai/plugins/memory-wiki
https://docs.openclaw.ai/gateway/configuration-reference
https://docs.openclaw.ai/channels/telegram

Minimal rollback plan: save this as one “baseline” config patch, monitor the same counters for a fixed window, then re-enable only one category at a time: concurrency first, ACP live streaming/TTL second, memory-wiki third, lossless last.

gilded bluff
#

Baseline patch is applied and gateway did restart.

Current baseline config:

{
"plugins": {
"slots": {
"memory": "memory-core",
"contextEngine": "legacy"
},
"entries": {
"lossless-claw": { "enabled": false },
"memory-wiki": {
"enabled": true,
"config": {
"ingest": {
"autoCompile": false,
"maxConcurrentJobs": 1
},
"context": {
"includeCompiledDigestPrompt": false
}
}
}
}
},
"acp": {
"maxConcurrentSessions": 1,
"stream": { "deliveryMode": "final_only" },
"runtime": { "ttlMinutes": 30 }
},
"agents": {
"defaults": {
"maxConcurrent": 2,
"subagents": {
"maxConcurrent": 1,
"maxSpawnDepth": 1,
"runTimeoutSeconds": 900
}
}
}
}

Evidence:

• 04:35 config reload detected the intended keys.
• 04:35 gateway restarted and came back with 7 plugins: acpx, codex, llm-task, lobster, memory-core, memory-wiki, telegram.
• lossless-claw is no longer loaded as an active plugin.
• memory-wiki is still enabled, but autoCompile is false.

Result:
The baseline is still not clean. After restart, event-loop pressure came back:

• 04:44 event_loop_delay/event_loop_utilization/cpu, phase=channels.telegram.start-account, max delay ~41s
• 04:44 Telegram getMe fetch timeout, timer delayed ~5.5s
• 04:48 Telegram getMe fetch timeout, timer delayed ~40s
• 04:48 another Telegram getMe timeout, timer delayed ~5.2s
• one ClawRouter socket timeout also appeared around 04:44

My read:
This no longer points strongly at lossless-claw or memory-wiki compile. The remaining hot area is Telegram account startup / recurring channels.telegram.start-account under gateway event-loop pressure.

Question:
Would you next recommend:

  1. keep this exact baseline for a longer observation window,
  2. temporarily disable memory-wiki entirely, or
  3. move straight into Telegram start-account isolation/debug while keeping all 7 accounts online?

If option 3, what exact counters/log lines would you want captured before changing Telegram account count?

raven vaultBOT
# gilded bluff Baseline patch is applied and gateway did restart. Current baseline config: { ...

I would choose 3: move into Telegram start-account isolation/debug now, while keeping all 7 accounts online for the next capture.

I would not spend a longer window on this exact baseline unless the goal is just more confidence. The signal is already pretty pointed: lossless is out, memory-wiki compile is off, ACP is capped, agent/subagent concurrency is low, and the liveness phase is still channels.telegram.start-account. Disabling memory-wiki entirely is a fine tiny cleanup, but I would not expect it to explain repeated getMe startup probe delays.

Important source-backed expectation: current Telegram startup should cache bot identity for up to 24h after a successful startup, and startup getMe probes are bounded to 2 concurrent probes. So repeated getMe during steady state is the thing to investigate.

Before changing Telegram account count, capture this:

  1. Is start-account recurring or only restart-time?
    Log window from 04:35 onward containing:
    • config reload / gateway restart lines
    • channels.telegram.start-account
    • [account] starting provider
    • bot probe failed
    • getMe
    • released stopped Telegram polling lease
    • waited for previous polling session
raven vaultBOT
# gilded bluff Baseline patch is applied and gateway did restart. Current baseline config: { ...
  • previous polling session ... did not stop
    • polling cycle finished
    • polling cycle error
  1. Per-account distribution
    Redact account names to tg-1...tg-7, but preserve whether the same account repeats. We need to know whether:

    • all accounts each do one startup probe and settle, or
    • one/two accounts repeatedly restart/probe, or
    • every config reload causes all accounts to start again.
  2. Timer-delay pairing
    For each event-loop warning / fetch timeout, capture:

    • timestamp
    • max event-loop delay / p99 if present
    • phase, especially channels.telegram.start-account
    • active work labels
    • fetch method/path: getMe, getUpdates, sendMessage, etc.
raven vaultBOT
raven vaultBOT
# gilded bluff Baseline patch is applied and gateway did restart. Current baseline config: { ...
   openclaw channels status --probe
   openclaw health --json
   openclaw doctor

Redact tokens, chat IDs, user IDs, and account names.

Then the next A/B I’d do is not disabling accounts first. I’d first determine why start-account is happening after the initial restart. If it is only startup-time and all accounts settle, the likely issue is “startup burst still too heavy for this process/host.” If it recurs after startup, look for config reload churn, token/cache invalidation, account runtime crashes, or duplicate/stale polling lease behavior.

If you want one tiny config cleanup during this capture: disable memory-wiki entirely for the next window. But I’d keep that secondary. The main bug-shaped question is: why are Telegram startup probes still happening, and are they repeated per account?

Docs:
https://docs.openclaw.ai/channels/telegram
https://docs.openclaw.ai/help/faq

If the capture shows repeated channels.telegram.start-account for the same stable accounts on current 2026.5.12, I’d treat that as GitHub-issue material with the redacted log bundle: https://github.com/openclaw/openclaw/issues

gilded bluff
#

Status: YELLOW

What happened:
I captured the post-baseline window from 04:35 onward with all 7 Telegram accounts online. Baseline stayed as:

• memory: memory-core
• contextEngine: legacy
• lossless-claw: disabled
• memory-wiki: enabled, autoCompile=false
• ACP maxConcurrentSessions=1, stream final_only, ttlMinutes=30

Evidence:

• 04:35: one config-triggered gateway restart only.
• 04:35: all 7 Telegram accounts hit channel stop timeout during shutdown.
• 04:36: all 7 accounts started provider once and isolated polling ingress started for all 7.
• Counts from 04:35 onward:
• config/reload: 1
• gateway listening/restart: 1
• Telegram provider starts: 7
• Telegram stop-related lines: 7
• getMe fetch-timeout lines: 23
• liveness phase channels.telegram.start-account: 3
• ClawRouter related: 3

Per-account redacted distribution:

• tg-1: repeated at 04:44, 04:48, 05:52, 05:54, 07:27, 08:02, 13:49
• tg-2: repeated at 04:48, 05:52, 05:54, 07:27
• tg-3: repeated at 05:52, 05:54, 07:27
• tg-4: repeated at 05:52, 07:27
• tg-5: repeated at 05:52, 07:27, 13:50
• tg-6: repeated at 05:52, 07:27, 13:50
• tg-7: seen at 05:53

Timer-delay pairing:

• 04:44: eventLoopDelayP99/Max ~41037ms, phase=channels.telegram.start-account, ClawRouter socket timeout nearby, getMe tg-1 delayed 5526ms.
• 04:48: eventLoopDelayP99/Max ~41070ms, phase=channels.telegram.start-account, getMe tg-1 delayed 40447ms, tg-2 delayed 5265ms.
• 08:02: getMe tg-1 elapsed 101240ms, timer delayed 91240ms.
• 13:49-13:50: getMe delays happened during my manual capture/status activity, so mark that as observer-effect risk.

Important caveat:
The status/health/channel probe commands can themselves trigger getMe probes. So the strongest non-observer evidence is the unattended batches around 05:52/05:54, 07:27, and 08:02.

Status snapshots after degraded window:

• openclaw status --all: gateway reachable, OpenClaw 2026.5.12, Telegram OK, 7/8 accounts configured/running.
• openclaw channels status --probe: all 7 configured accounts connected and “works”; default account not configured/stopped.
• openclaw health --json: ok=true, eventLoop degraded=false, delayP99Ms=20.8, delayMaxMs=294.6, utilization=0.065.
• openclaw doctor: warnings only. lossless-claw disabled but config present; bundled discovery allowlist suggestion; legacy openai-codex model refs.

My read:
This is not just one startup burst. Provider start logs only appear once after restart, but getMe timeout batches recur later across stable accounts. The later batches also show Telegram API connectivity fallback / UND_ERR_CONNECT_TIMEOUT, so there may be two overlapping issues:

  1. startup/account probe lifecycle still reporting or entering channels.telegram.start-account,
  2. Telegram API network/connectivity failures causing repeated getMe probe timeouts.

Next action I recommend:
Do not reduce account count yet. First capture/log whether the repeated getMe calls are runtime self-probes, CLI/status probes, transport retry probes, or account restart probes. If we open GitHub issue, this bundle is already issue-grade after redacting tokens/user/chat IDs.