Setup: OpenClaw local gateway on 127.0.0.1:18789
Goal: Make reconnect/recovery after restart reliable
Observed behavior: After restart, openclaw cron list sometimes returns gateway closed (1000 normal closure): no close reason
Related error: I previously saw Max reconnect attempts (0) reached after code 1005
Possible masking issue: Discord is set to allowlist, but allowFrom/groupAllowFrom are empty, so group messages may be silently dropped
What I need: Please check whether the Discord provider reconnect bug or gateway shutdown path is still causing this, and where the fix should go
#Goal: Make reconnect/recovery after restart reliable
1 messages · Page 1 of 1 (latest)
I checked current main (be7a415eb0), and my read is:
-
openclaw cron listis not going through the Discord provider path- It goes CLI →
src/cli/gateway-rpc.runtime.ts→src/gateway/call.ts→ gateway RPC. - So the
gateway closed (1000 normal closure): no close reasonfailure is not primarily a Discord reconnect bug.
- It goes CLI →
-
The old Discord
1005 / Max reconnect attemptsbug looks mostly mitigated on current mainextensions/discord/src/monitor/gateway-supervisor.tsnow suppresses latereconnect-exhaustederrors after dispose/teardown.- There are tests for exactly that in
provider.test.tsandgateway-supervisor.test.ts. - So I would not put the main fix in
extensions/discord/...first.
-
The flaky restart failure looks like a gateway shutdown/connect race
src/gateway/server-close.tsexplicitly closes tracked clients with1012, "service restart".- But only authenticated/tracked WS clients are in
params.clients. - New/in-flight/pre-auth connections are not added until after successful connect, in
src/gateway/server/ws-connection/message-handler.ts. - During restart, those in-flight sockets can get swept up by
wss.close()and surface as generic1000with no reason. - Then
src/gateway/call.tstreats that as a hard failure, and the CLI path does no retry.
So where should the fix go?
Primary fix: gateway CLI/connect layer
- Best first target:
src/gateway/call.tsorsrc/cli/gateway-rpc.runtime.ts - Add a short retry window/backoff for transient connect-close errors during restart, especially:
1000 during connect1012 service restart- connect timeout / challenge timeout
There is already precedent for this in:
src/gateway/gateway-cli-backend.live-helpers.tsisRetryableGatewayConnectError(...)already treatsgateway closed during connect (1000)as retryable
Secondary fix: gateway shutdown semantics
- Improve
src/gateway/server-close.ts/ WS shutdown so pre-auth / connecting sockets also get a restart-specific signal, ideally1012 service restart, instead of falling through to1000 no close reason.
About the allowlist masking issue
Yes, there is a real masking problem, but slightly different from how you phrased it:
- For Discord, empty
allowFrom/groupAllowFromare not the main guild-message gate - Guild/group handling is controlled by:
channels.discord.groupPolicychannels.discord.guilds.<guildId>.channels
- Discord explicitly sets
groupAllowFromFallbackToAllowFrom: false
So if groupPolicy="allowlist" and there is no guild channel allowlist configured, guild messages can be silently dropped in preflight. The drop path is in:
extensions/discord/src/monitor/message-handler.preflight.ts
That can absolutely make recovery look broken, but it is separate from the cron list restart race.
My recommendation
I’d treat this as:
- Root cause for
cron listafter restart: gateway shutdown/connect race - Fix location:
src/gateway/call.tsfirst, then optionallysrc/gateway/server-close.ts - Not first fix location: Discord provider lifecycle
- Separate config issue to verify: Discord guild channel allowlist, not just
allowFrom/groupAllowFrom
One more thing before I go deeper: can you paste openclaw --version? If you’re not near current main, the answer may shift a bit. Also worth keeping handy: https://docs.openclaw.ai/gateway/troubleshooting
OpenClaw 2026.4.15 (041266a)
Thanks, that helps. On 2026.4.15 (041266a), I’m pretty confident the answer is:
What’s still causing it
The gateway shutdown/connect path is the main culprit, not the Discord reconnect bug.
Why:
-
Your release already includes the Discord-side mitigation for the late
1005 / Max reconnect attemptsteardown noise.extensions/discord/src/monitor/gateway-supervisor.test.tsextensions/discord/src/monitor/provider.test.ts- both already cover suppressed late
reconnect-exhaustedafter dispose on041266a
-
openclaw cron listdoes not go through the Discord provider anyway.src/cli/gateway-rpc.runtime.ts→src/gateway/call.ts- so a failure there points at CLI ↔ gateway restart timing, not Discord chat transport
What I think is happening
On 041266a:
src/gateway/server-close.tscloses tracked clients with:1012, "service restart"
- but sockets only get added to the tracked client set after successful auth/connect:
src/gateway/server/ws-connection/message-handler.ts
- so a client connecting during restart can miss the explicit
1012path and get caught by the generic WS server shutdown instead - then
src/gateway/call.tssurfaces that as:gateway closed (1000 normal closure): no close reason
- and the CLI path does no retry
That lines up almost perfectly with your cron list symptom.
Where the fix should go
Best first fix location: CLI gateway call path, not Discord.
I’d put it here, in this order of preference:
-
src/gateway/call.ts- ideally in the CLI-specific path (
callGatewayCli/ CLI wrapper behavior), not the generic backend path - add short retry/backoff for transient restart-time connect failures:
1000 during connect1012 service restart- connect timeout / challenge timeout
- ideally in the CLI-specific path (
-
Secondary improvement:
src/gateway/server-close.ts- make pre-auth / in-flight WS connections get a restart-specific close reason too, instead of falling through to generic
1000
- make pre-auth / in-flight WS connections get a restart-specific close reason too, instead of falling through to generic
Where I would not start
I would not start in:
extensions/discord/src/monitor/provider.tsextensions/discord/src/monitor/provider.lifecycle.tsextensions/discord/src/monitor/gateway-supervisor.ts
Those are worth revisiting only if you still have a separate Discord runtime failure after the gateway is back up.
About the masking issue
You’re right that there may be a masking config issue, but the important Discord gate is slightly different:
For Discord guild messages, the real gate is mainly:
channels.discord.groupPolicychannels.discord.guilds.<guildId>.channels
Not empty allowFrom/groupAllowFrom.
So if groupPolicy="allowlist" and the guild/channel allowlist is empty, guild messages can be dropped in preflight and make recovery look broken even when the gateway is actually fine.
That path is in:
extensions/discord/src/monitor/message-handler.preflight.ts
So my final read is:
cron listfailure after restart → gateway restart/connect race- Discord message drops → possibly separate allowlist config masking
- Primary fix target → gateway CLI call/retry path
Also, src/cli/daemon-cli/probe.ts is probably not the hot path for this specific cron list failure.
If you want, I can sketch the exact retry patch shape I’d use for callGatewayCli on 041266a.
yes please
Yep. I’d do a small CLI-scoped fix first.
1. Add retry only for CLI gateway calls
I’d put it in src/gateway/call.ts, inside callGatewayCli(...), so it covers all CLI commands but does not change backend/runtime callers.
Rough shape:
function sleep(ms: number) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
function isRetryableCliGatewayRestartError(err: unknown): boolean {
const message = normalizeLowercaseStringOrEmpty(err instanceof Error ? err.message : String(err));
return (
message.includes("gateway closed (1000 normal closure)") ||
message.includes("gateway closed (1012)") ||
message.includes("service restart")
);
}
export async function callGatewayCli<T = Record<string, unknown>>(
opts: CallGatewayCliOptions,
): Promise<T> {
const scopes = Array.isArray(opts.scopes) ? opts.scopes : CLI_DEFAULT_OPERATOR_SCOPES;
const backoffMs = [150, 350, 750, 1500];
let lastError: unknown;
for (let attempt = 0; attempt <= backoffMs.length; attempt += 1) {
try {
return await callGatewayWithScopes(opts, scopes);
} catch (err) {
lastError = err;
if (!isRetryableCliGatewayRestartError(err) || attempt === backoffMs.length) {
throw err;
}
await sleep(backoffMs[attempt]!);
}
}
throw lastError instanceof Error ? lastError : new Error(String(lastError));
}
2. While you’re there, wire onConnectError
executeGatewayRequestWithScopes(...) currently sets onHelloOk and onClose, but not onConnectError.
I’d add:
onConnectError: (err) => {
if (settled || ignoreClose) {
return;
}
ignoreClose = true;
client.stop();
stop(err instanceof Error ? err : new Error(String(err)));
},
in the createGatewayClient({...}) options inside src/gateway/call.ts.
That makes connect failures fail fast instead of waiting for the outer timeout.
3. Optional follow-up server fix
After that, I’d do a cleanup pass on src/gateway/server-close.ts:
- track pre-auth / handshaking sockets too
- close them with
1012, "service restart"during shutdown
That would reduce the chance of generic 1000 no close reason in the first place.
Why this order
- Fastest user-visible win: CLI retries
- Safest blast radius: CLI-only path
- Better semantics later: server shutdown consistency
What I would not do
I would not start by changing Discord reconnect logic. On 041266a, that looks like the wrong layer for this symptom.
If you want, I can also sketch the server-side pre-auth socket fix so the close code becomes consistently 1012 instead of 1000.
yes please
Yep. For the server-side piece, I’d keep it very small.
Smallest useful server patch
In src/gateway/server-close.ts, after closing tracked authenticated clients, also sweep all remaining WS sockets before wss.close():
for (const c of params.clients) {
try {
c.socket.close(1012, "service restart");
} catch {
/* ignore */
}
}
params.clients.clear();
// New bit: close any still-open WS sockets, including pre-auth / handshaking ones.
for (const socket of params.wss.clients ?? new Set()) {
try {
socket.close(1012, "service restart");
} catch {
/* ignore */
}
}
const wsClients = params.wss.clients ?? new Set();
const closePromise = new Promise<void>((resolve) => params.wss.close(() => resolve()));
Why this helps
Right now:
- authenticated sockets in
params.clientsget1012
- but pre-auth / mid-handshake sockets can miss that and only see generic shutdown behavior
- that is where the ugly
1000 normal closure / no close reasoncan leak out
Sweeping wss.clients gives those sockets the same restart signal.
Important nuance
I still would not rely on this alone.
Even with the server patch, a CLI call can still land in the tiny window where:
- old gateway is going away
- new gateway is not ready yet
So I’d still do the CLI retry patch first, then this server patch as the consistency cleanup.
If I were stacking fixes
- CLI retry in
callGatewayCli
- Wire
onConnectErrorincall.ts - Sweep
wss.clientswith1012inserver-close.ts
That combo should make restart recovery feel a lot more solid.
If you want the shortest maintainer-facing summary, I’d phrase it like this:
cron listis failing on a restart-time CLI↔gateway race. Discord reconnect handling is probably not the primary bug on 2026.4.15. Fix should start in CLI gateway retry logic, with a follow-up server-close change so pre-auth WS sockets also receive1012 service restartinstead of leaking1000/no close reason.
Issue: openclaw cron list and similar CLI calls could fail right after gateway restart.
Root cause: restart-time race between CLI gateway calls and gateway shutdown/startup, not primarily Discord reconnect.
Fix applied:
src/gateway/call.ts
added short CLI retry/backoff for restart-like errors
wired onConnectError so connect failures fail fast
src/gateway/server-close.ts
close remaining websocket clients with 1012 service restart
Verification:
focused tests passed for call.ts
focused tests passed for server-close.ts
retry test was fixed and re-run successfully
Discord allowlist check:
config uses explicit allowlists
no empty-allowlist masking bug found
Status: fixed locally in the source checkout, not pushed/shared
Nice, that tracks. I’m glad you pinned it down.
So the clean takeaway is:
- Cause: CLI ↔ gateway restart race
- Not mainly: Discord reconnect lifecycle
- Fixes:
src/gateway/call.tsretry + fastonConnectError, andsrc/gateway/server-close.tssweeping remaining WS clients with1012 - Config: no Discord allowlist masking issue in this case
Since it’s fixed locally but not upstream yet, the next step is to file a GitHub issue with exactly this summary and the affected version:
https://github.com/openclaw/openclaw/issues
I’d include:
- version:
2026.4.15 (041266a) - repro:
openclaw cron listright after gateway restart - observed error:
gateway closed (1000 normal closure): no close reason