#Goal: Make reconnect/recovery after restart reliable

1 messages · Page 1 of 1 (latest)

hallow lichen
#

Setup: OpenClaw local gateway on 127.0.0.1:18789
Goal: Make reconnect/recovery after restart reliable
Observed behavior: After restart, openclaw cron list sometimes returns gateway closed (1000 normal closure): no close reason
Related error: I previously saw Max reconnect attempts (0) reached after code 1005
Possible masking issue: Discord is set to allowlist, but allowFrom/groupAllowFrom are empty, so group messages may be silently dropped
What I need: Please check whether the Discord provider reconnect bug or gateway shutdown path is still causing this, and where the fix should go

gray scrollBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

sullen roostBOT
solemn boneBOT
#

I checked current main (be7a415eb0), and my read is:

  1. openclaw cron list is not going through the Discord provider path

    • It goes CLI → src/cli/gateway-rpc.runtime.tssrc/gateway/call.ts → gateway RPC.
    • So the gateway closed (1000 normal closure): no close reason failure is not primarily a Discord reconnect bug.
  2. The old Discord 1005 / Max reconnect attempts bug looks mostly mitigated on current main

    • extensions/discord/src/monitor/gateway-supervisor.ts now suppresses late reconnect-exhausted errors after dispose/teardown.
    • There are tests for exactly that in provider.test.ts and gateway-supervisor.test.ts.
    • So I would not put the main fix in extensions/discord/... first.
  3. The flaky restart failure looks like a gateway shutdown/connect race

    • src/gateway/server-close.ts explicitly closes tracked clients with 1012, "service restart".
    • But only authenticated/tracked WS clients are in params.clients.
    • New/in-flight/pre-auth connections are not added until after successful connect, in src/gateway/server/ws-connection/message-handler.ts.
    • During restart, those in-flight sockets can get swept up by wss.close() and surface as generic 1000 with no reason.
    • Then src/gateway/call.ts treats that as a hard failure, and the CLI path does no retry.
#

So where should the fix go?

Primary fix: gateway CLI/connect layer

  • Best first target: src/gateway/call.ts or src/cli/gateway-rpc.runtime.ts
  • Add a short retry window/backoff for transient connect-close errors during restart, especially:
    • 1000 during connect
    • 1012 service restart
    • connect timeout / challenge timeout

There is already precedent for this in:

  • src/gateway/gateway-cli-backend.live-helpers.ts
  • isRetryableGatewayConnectError(...) already treats gateway closed during connect (1000) as retryable

Secondary fix: gateway shutdown semantics

  • Improve src/gateway/server-close.ts / WS shutdown so pre-auth / connecting sockets also get a restart-specific signal, ideally 1012 service restart, instead of falling through to 1000 no close reason.

About the allowlist masking issue

#

Yes, there is a real masking problem, but slightly different from how you phrased it:

  • For Discord, empty allowFrom / groupAllowFrom are not the main guild-message gate
  • Guild/group handling is controlled by:
    • channels.discord.groupPolicy
    • channels.discord.guilds.<guildId>.channels
  • Discord explicitly sets groupAllowFromFallbackToAllowFrom: false

So if groupPolicy="allowlist" and there is no guild channel allowlist configured, guild messages can be silently dropped in preflight. The drop path is in:

  • extensions/discord/src/monitor/message-handler.preflight.ts

That can absolutely make recovery look broken, but it is separate from the cron list restart race.

My recommendation

I’d treat this as:

#
  • Root cause for cron list after restart: gateway shutdown/connect race
  • Fix location: src/gateway/call.ts first, then optionally src/gateway/server-close.ts
  • Not first fix location: Discord provider lifecycle
  • Separate config issue to verify: Discord guild channel allowlist, not just allowFrom/groupAllowFrom

One more thing before I go deeper: can you paste openclaw --version? If you’re not near current main, the answer may shift a bit. Also worth keeping handy: https://docs.openclaw.ai/gateway/troubleshooting

hallow lichen
#

OpenClaw 2026.4.15 (041266a)

solemn boneBOT
#

Thanks, that helps. On 2026.4.15 (041266a), I’m pretty confident the answer is:

What’s still causing it

The gateway shutdown/connect path is the main culprit, not the Discord reconnect bug.

Why:

  • Your release already includes the Discord-side mitigation for the late 1005 / Max reconnect attempts teardown noise.

    • extensions/discord/src/monitor/gateway-supervisor.test.ts
    • extensions/discord/src/monitor/provider.test.ts
    • both already cover suppressed late reconnect-exhausted after dispose on 041266a
  • openclaw cron list does not go through the Discord provider anyway.

    • src/cli/gateway-rpc.runtime.tssrc/gateway/call.ts
    • so a failure there points at CLI ↔ gateway restart timing, not Discord chat transport

What I think is happening

#

On 041266a:

  • src/gateway/server-close.ts closes tracked clients with:
    • 1012, "service restart"
  • but sockets only get added to the tracked client set after successful auth/connect:
    • src/gateway/server/ws-connection/message-handler.ts
  • so a client connecting during restart can miss the explicit 1012 path and get caught by the generic WS server shutdown instead
  • then src/gateway/call.ts surfaces that as:
    • gateway closed (1000 normal closure): no close reason
  • and the CLI path does no retry

That lines up almost perfectly with your cron list symptom.

Where the fix should go

Best first fix location: CLI gateway call path, not Discord.

I’d put it here, in this order of preference:

#
  1. src/gateway/call.ts

    • ideally in the CLI-specific path (callGatewayCli / CLI wrapper behavior), not the generic backend path
    • add short retry/backoff for transient restart-time connect failures:
      • 1000 during connect
      • 1012 service restart
      • connect timeout / challenge timeout
  2. Secondary improvement: src/gateway/server-close.ts

    • make pre-auth / in-flight WS connections get a restart-specific close reason too, instead of falling through to generic 1000

Where I would not start

I would not start in:

  • extensions/discord/src/monitor/provider.ts
  • extensions/discord/src/monitor/provider.lifecycle.ts
  • extensions/discord/src/monitor/gateway-supervisor.ts
#

Those are worth revisiting only if you still have a separate Discord runtime failure after the gateway is back up.

About the masking issue

You’re right that there may be a masking config issue, but the important Discord gate is slightly different:

For Discord guild messages, the real gate is mainly:

  • channels.discord.groupPolicy
  • channels.discord.guilds.<guildId>.channels

Not empty allowFrom/groupAllowFrom.

So if groupPolicy="allowlist" and the guild/channel allowlist is empty, guild messages can be dropped in preflight and make recovery look broken even when the gateway is actually fine.

That path is in:

  • extensions/discord/src/monitor/message-handler.preflight.ts
#

So my final read is:

  • cron list failure after restart → gateway restart/connect race
  • Discord message drops → possibly separate allowlist config masking
  • Primary fix target → gateway CLI call/retry path

Also, src/cli/daemon-cli/probe.ts is probably not the hot path for this specific cron list failure.

If you want, I can sketch the exact retry patch shape I’d use for callGatewayCli on 041266a.

hallow lichen
#

yes please

solemn boneBOT
#

Yep. I’d do a small CLI-scoped fix first.

1. Add retry only for CLI gateway calls

I’d put it in src/gateway/call.ts, inside callGatewayCli(...), so it covers all CLI commands but does not change backend/runtime callers.

Rough shape:

function sleep(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

function isRetryableCliGatewayRestartError(err: unknown): boolean {
  const message = normalizeLowercaseStringOrEmpty(err instanceof Error ? err.message : String(err));
  return (
    message.includes("gateway closed (1000 normal closure)") ||
#
    message.includes("gateway closed (1012)") ||
    message.includes("service restart")
  );
}

export async function callGatewayCli<T = Record<string, unknown>>(
  opts: CallGatewayCliOptions,
): Promise<T> {
  const scopes = Array.isArray(opts.scopes) ? opts.scopes : CLI_DEFAULT_OPERATOR_SCOPES;

  const backoffMs = [150, 350, 750, 1500];
  let lastError: unknown;

  for (let attempt = 0; attempt <= backoffMs.length; attempt += 1) {
    try {
#
      return await callGatewayWithScopes(opts, scopes);
    } catch (err) {
      lastError = err;
      if (!isRetryableCliGatewayRestartError(err) || attempt === backoffMs.length) {
        throw err;
      }
      await sleep(backoffMs[attempt]!);
    }
  }

  throw lastError instanceof Error ? lastError : new Error(String(lastError));
}

2. While you’re there, wire onConnectError

executeGatewayRequestWithScopes(...) currently sets onHelloOk and onClose, but not onConnectError.

#

I’d add:

onConnectError: (err) => {
  if (settled || ignoreClose) {
    return;
  }
  ignoreClose = true;
  client.stop();
  stop(err instanceof Error ? err : new Error(String(err)));
},

in the createGatewayClient({...}) options inside src/gateway/call.ts.

That makes connect failures fail fast instead of waiting for the outer timeout.

#

3. Optional follow-up server fix

After that, I’d do a cleanup pass on src/gateway/server-close.ts:

  • track pre-auth / handshaking sockets too
  • close them with 1012, "service restart" during shutdown

That would reduce the chance of generic 1000 no close reason in the first place.

Why this order

  • Fastest user-visible win: CLI retries
  • Safest blast radius: CLI-only path
  • Better semantics later: server shutdown consistency

What I would not do

I would not start by changing Discord reconnect logic. On 041266a, that looks like the wrong layer for this symptom.

If you want, I can also sketch the server-side pre-auth socket fix so the close code becomes consistently 1012 instead of 1000.

hallow lichen
#

yes please

solemn boneBOT
#

Yep. For the server-side piece, I’d keep it very small.

Smallest useful server patch

In src/gateway/server-close.ts, after closing tracked authenticated clients, also sweep all remaining WS sockets before wss.close():

for (const c of params.clients) {
  try {
    c.socket.close(1012, "service restart");
  } catch {
    /* ignore */
  }
}
params.clients.clear();

// New bit: close any still-open WS sockets, including pre-auth / handshaking ones.
#
for (const socket of params.wss.clients ?? new Set()) {
  try {
    socket.close(1012, "service restart");
  } catch {
    /* ignore */
  }
}

const wsClients = params.wss.clients ?? new Set();
const closePromise = new Promise<void>((resolve) => params.wss.close(() => resolve()));

Why this helps

Right now:

  • authenticated sockets in params.clients get 1012
#
  • but pre-auth / mid-handshake sockets can miss that and only see generic shutdown behavior
  • that is where the ugly 1000 normal closure / no close reason can leak out

Sweeping wss.clients gives those sockets the same restart signal.

Important nuance

I still would not rely on this alone.

Even with the server patch, a CLI call can still land in the tiny window where:

  • old gateway is going away
  • new gateway is not ready yet

So I’d still do the CLI retry patch first, then this server patch as the consistency cleanup.

If I were stacking fixes

  1. CLI retry in callGatewayCli
#
  1. Wire onConnectError in call.ts
  2. Sweep wss.clients with 1012 in server-close.ts

That combo should make restart recovery feel a lot more solid.

If you want the shortest maintainer-facing summary, I’d phrase it like this:

cron list is failing on a restart-time CLI↔gateway race. Discord reconnect handling is probably not the primary bug on 2026.4.15. Fix should start in CLI gateway retry logic, with a follow-up server-close change so pre-auth WS sockets also receive 1012 service restart instead of leaking 1000/no close reason.

hallow lichen
#

Issue: openclaw cron list and similar CLI calls could fail right after gateway restart.
Root cause: restart-time race between CLI gateway calls and gateway shutdown/startup, not primarily Discord reconnect.
Fix applied:
src/gateway/call.ts
added short CLI retry/backoff for restart-like errors
wired onConnectError so connect failures fail fast
src/gateway/server-close.ts
close remaining websocket clients with 1012 service restart
Verification:
focused tests passed for call.ts
focused tests passed for server-close.ts
retry test was fixed and re-run successfully
Discord allowlist check:
config uses explicit allowlists
no empty-allowlist masking bug found
Status: fixed locally in the source checkout, not pushed/shared

solemn boneBOT
#

Nice, that tracks. I’m glad you pinned it down.

So the clean takeaway is:

  • Cause: CLI ↔ gateway restart race
  • Not mainly: Discord reconnect lifecycle
  • Fixes: src/gateway/call.ts retry + fast onConnectError, and src/gateway/server-close.ts sweeping remaining WS clients with 1012
  • Config: no Discord allowlist masking issue in this case

Since it’s fixed locally but not upstream yet, the next step is to file a GitHub issue with exactly this summary and the affected version:

https://github.com/openclaw/openclaw/issues

I’d include:

  • version: 2026.4.15 (041266a)
  • repro: openclaw cron list right after gateway restart
  • observed error: gateway closed (1000 normal closure): no close reason