#Gateway disconnecting intermittently

1 messages Ā· Page 1 of 1 (latest)

spice fractal
#

Hi @muted siren

Issue: My gateway or something else is disconnecting intermittently, which disrupts responses and heightens operator anxiety.

Symptoms:

  1. openclaw logs --follow periodically crashes with this message: "Gateway not reachable. Is it running and accessible?" Yet when I tail the logs again they come right back.
  2. slash commands execute inconsistently. Within the same minute, I can run /new and get an application failure error, then run /new again and it works.

Logs: show handshake-related issues that look suspect

I will paste the logs and what I've tried into my next response

royal forgeBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> āœ… Mark Solution

spice fractal
#
00:33:06+00:00 info gateway/channels/discord {"subsystem":"gateway/channels/discord"} discord gateway: Gateway websocket closed: 1000
00:33:06+00:00 info gateway/channels/discord {"subsystem":"gateway/channels/discord"} discord gateway: Gateway reconnect scheduled in 1088ms (close, resume=true)
00:33:06+00:00 warn gateway/ws {"subsystem":"gateway/ws"} {"handshake":"pending","durationMs":0,"host":"127.0.0.1:__REDACTED__","remoteAddr":"127.0.0.1","remotePort":5__REDACTED__6,"localAddr":"127.0.0.1","localPort":__REDACTED__,"endpoint":"127.0.0.1:5__REDACTED__6->127.0.0.1:__REDACTED__"} closed before connect conn=aa590a70-cee2-43b5-85c6-8c4e1d8db7f2 peer=127.0.0.1:5__REDACTED__6->127.0.0.1:__REDACTED__ remote=127.0.0.1 fwd=n/a origin=n/a host=127.0.0.1:__REDACTED__ ua=n/a code=1006 reason=n/a
00:33:11+00:00 warn gateway/ws {"subsystem":"gateway/ws"} {"handshake":"pending","durationMs":4815,"lastFrameType":"req","lastFrameMethod":"connect","lastFrameId":"dc816aee-19cd-42f7-9c8b-f60c8b4c4a4a","host":"127.0.0.1:__REDACTED__","remoteAddr":"127.0.0.1","remotePort":5__REDACTED__4,"localAddr":"127.0.0.1","localPort":__REDACTED__,"endpoint":"127.0.0.1:5__REDACTED__4->127.0.0.1:__REDACTED__"} closed before connect conn=18fa0326-2e57-453f-87f8-90cc706aaff9 peer=127.0.0.1:5__REDACTED__4->127.0.0.1:4202 remote=127.0.0.1 fwd=n/a origin=n/a host=127.0.0.1:__REDACTED__ ua=n/a code=1005 reason=n/a
00:33:15+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res āœ“ sessions.list 158ms conn=fb159dd1…30a6 id=b8030352…9d2e

Fixes tried:

  1. added gateway.handshakeTimeoutMs to my config, per:
    https://docs.openclaw.ai/gateway/configuration#tune-gateway-websocket-handshake-timeout

Result: the setting is stripped from my config and then reloaded.

muted sirenBOT
#

Yep — paste the logs when ready. First thing I need at the top is:

openclaw --version

Then please include:

openclaw status --all
openclaw gateway status
openclaw doctor
openclaw channels status --probe

If openclaw logs --follow itself drops because RPC blips, use the file-log fallback too:

#
tail -f "$(ls -t /tmp/openclaw/openclaw-*.log | head -1)"

The handshake errors may be the symptom, but the inconsistent slash commands + logs --follow losing the gateway smells like the gateway is restarting, briefly unreachable, or a split-brain install/service mismatch after an update.

Helpful extras:

  • OS / install method
  • how the gateway is started: service, manual terminal, Docker, VPS, etc.
  • whether this started after an update
  • ~30–60 seconds of logs before and after one disconnect

Docs for reference: https://docs.openclaw.ai/help/faq and https://docs.openclaw.ai/gateway/troubleshooting

#

This is useful. Two things jump out:

  1. handshakeTimeoutMs being stripped means your running OpenClaw schema likely does not support it yet. I checked current source: that config key landed very recently/unreleased, so if you’re on a packaged release that predates it, config reload/doctor may remove it. Check:
openclaw --version
openclaw config schema | grep -n "handshakeTimeoutMs"
  1. Those specific closed before connect lines don’t look like a timeout. One closes at 0ms, another at 4815ms; the default timeout is much higher. So increasing the timeout probably won’t fix the core issue. It looks more like clients/Discord reconnecting while the gateway/event loop is briefly unhealthy, restarting, or saturated.

Next best checks:

which openclaw
openclaw --version
#
openclaw config get meta.lastTouchedVersion
openclaw gateway status --deep
openclaw status --all
openclaw gateway stability

And since openclaw logs --follow itself is losing the RPC connection, use raw file tail during the next incident:

tail -f "$(ls -t /tmp/openclaw/openclaw-*.log | head -1)"

Look for gateway restart/liveness/memory/event-loop messages before the Discord websocket close. The discord gateway: Gateway websocket closed: 1000 is Discord’s gateway websocket, not necessarily the OpenClaw gateway process itself — but if it happens often alongside slash command failures, I’d suspect the OpenClaw process is stalling/restarting or the Discord channel runtime is being restarted by health monitoring.

If you paste openclaw --version + the 30–60 seconds before the disconnect from the raw file log, we can narrow it down.

spice fractal
#

OpenClaw 2026.4.26 (be8c246)

Here are some more logs, which I have to paste across two messages:

00:33:06+00:00 warn gateway/ws {"subsystem":"gateway/ws"} {"handshake":"pending","durationMs":0,"host":"127.0.0.1:__REDACTED__","remoteAddr":"127.0.0.1","remotePort":5__REDACTED__6,"localAddr":"127.0.0.1","localPort":__REDACTED__,"endpoint":"127.0.0.1:5__REDACTED__6->127.0.0.1:__REDACTED__"} closed before connect conn=aa590a70-cee2-43b5-85c6-8c4e1d8db7f2 peer=127.0.0.1:5__REDACTED__6->127.0.0.1:__REDACTED__ remote=127.0.0.1 fwd=n/a origin=n/a host=127.0.0.1:__REDACTED__ ua=n/a code=1006 reason=n/a
00:33:11+00:00 warn gateway/ws {"subsystem":"gateway/ws"} {"handshake":"pending","durationMs":4815,"lastFrameType":"req","lastFrameMethod":"connect","lastFrameId":"dc816aee-19cd-42f7-9c8b-f60c8b4c4a4a","host":"127.0.0.1:__REDACTED__","remoteAddr":"127.0.0.1","remotePort":5__REDACTED__4,"localAddr":"127.0.0.1","localPort":__REDACTED__,"endpoint":"127.0.0.1:5__REDACTED__4->127.0.0.1:__REDACTED__"} closed before connect conn=18fa0326-2e57-453f-87f8-90cc706aaff9 peer=127.0.0.1:5__REDACTED__4->127.0.0.1:__REDACTED__ remote=127.0.0.1 fwd=n/a origin=n/a host=127.0.0.1:__REDACTED__ ua=n/a code=1005 reason=n/a
muted sirenBOT
#

Yep — 2026.4.26 explains the config part: gateway.handshakeTimeoutMs isn’t in that release’s schema, so it getting stripped is expected. The docs are slightly ahead of your installed version there.

From this chunk alone, I wouldn’t chase the handshake timeout: durationMs: 0 and 4815 are too short/odd for the normal pre-auth timeout path. More likely: gateway event-loop stall, restart, channel runtime restart, or Discord interaction timeout.

Please paste chunk 2, but first-line recommendation is: update to latest and restart the gateway, because 2026.4.27 has Discord/slash-command and gateway reliability fixes that are very relevant here.

After updating, run:

openclaw gateway restart
openclaw gateway status --deep
openclaw status --all
openclaw gateway stability

If it still happens, the most useful evidence is the raw file log from 30–60 seconds before the disconnect, not just the gateway/ws closed before connect lines.

spice fractal
#

@muted siren there are too many logs to past

muted sirenBOT
#

No worries — don’t paste the wall. Do this instead:

openclaw gateway diagnostics export

It should produce a sanitized zip meant for sharing. Attach that here if you’re comfortable.

If you want a smaller text paste, run:

LOG="$(ls -t /tmp/openclaw/openclaw-*.log | head -1)"
grep -Ei 'gateway/ws|discord gateway|reconnect|restart|health|liveness|event-loop|memory|fatal|uncaught|unhandled|port|EADDRINUSE' "$LOG" | tail -200

Also still strongly recommend updating off 2026.4.26 first, then:

#
openclaw gateway restart
openclaw gateway status --deep
openclaw status --all
openclaw gateway stability

The diagnostics export + status --all is probably the cleanest next bundle.

spice fractal
#

@muted siren I've got the diagnostic. What piece do you want to see?

muted sirenBOT
#

Nice — the diagnostic zip is designed to be shareable, so easiest is to attach the whole zip here.

If you’d rather paste only pieces, start with:

  1. summary.md / main markdown summary
  2. gateway status / health snapshot
  3. stability recorder / latest bundle summary
  4. The sanitized log tail around the disconnect

I’m mainly looking for:

  • gateway process restarts
  • event-loop/liveness warnings
  • memory pressure
  • channel health monitor restarts
  • Discord runtime errors before the websocket closes

If the zip feels too much for Discord, paste summary.md first and we’ll branch from there.

spice fractal
#

@muted siren even when sanitized there are still details in all of those outputs that I prefer to keep private.

muted sirenBOT
#

Totally fair — keep it private.

Best path then: use the diagnostic locally as a checklist, and only paste the category of what you find.

First, update off 2026.4.26 if you can:

openclaw update --yes
openclaw gateway restart

Then, for the diagnostic zip, run locally:

mkdir -p /tmp/openclaw-diag
unzip -q /path/to/diagnostic.zip -d /tmp/openclaw-diag
#

grep -RniE 'fatal|uncaught|unhandled|EADDRINUSE|port already|restart|liveness|event-loop|memory|heap|rss|health monitor|discord|heartbeat ACK timeout|interaction|slash' /tmp/openclaw-diag | head -200

You don’t need to paste the output. Just tell me which buckets appear:

  • gateway restart / process exit
  • event-loop or liveness warning
  • memory / heap / RSS pressure
  • Discord health-monitor restart
  • Discord interaction/slash timeout
  • port conflict / split-brain install
  • fatal/uncaught/unhandled error

Also useful and low-detail:

#
openclaw gateway status --deep
openclaw gateway stability

If those contain private details, summarize only: ā€œhealthy/unhealthyā€, ā€œPID/start time changed or notā€, and any warning names.

spice fractal
#

I see a pattern: When I prompt an agent in Discord, there is a handshake error every time

muted sirenBOT
#

That pattern makes sense: each Discord prompt starts one or more local Gateway WS clients for routing/session/agent work. So a handshake warning per prompt can be a symptom of the prompt path, not necessarily Discord itself.

But if it correlates with failed slash commands, I’d still treat it as real. On 2026.4.26, I’d do this in order:

  1. Update first — 2026.4.27+ has several Discord / channel / gateway reliability fixes.

    openclaw update --yes
    openclaw gateway restart
    
  2. If you can’t update yet, test whether it’s just the 4.26 handshake budget by using the env var, not config:

    OPENCLAW_HANDSHAKE_TIMEOUT_MS=30000 openclaw gateway restart
    

    The config key is stripped on your version, but the env override exists.

  3. Locally search diagnostics for the actual trigger around one prompt:

#
   grep -RniE 'InteractionEventListener|interaction|slash|heartbeat ACK timeout|closed before connect|handshake timeout|liveness|event-loop|startup-sidecars-pending|restart' /tmp/openclaw-diag | head -100

Key distinction:

  • If the agent replies normally and there’s exactly one closed before connect, it may be noisy/canceled local client cleanup.
  • If slash commands fail or replies drop at the same time, it’s likely gateway/client readiness or Discord interaction timeout — update is the strongest first fix here.
spice fractal
#

hmm i thought I was already on 4.27. Let me try that

muted sirenBOT
#

Good catch — your pasted version was definitely:

OpenClaw 2026.4.26 (be8c246)

After updating, verify both the CLI and the running gateway agree:

openclaw --version
openclaw gateway status --deep
openclaw config get meta.lastTouchedVersion

If those disagree, it’s probably a split-brain install/service path issue. Then run:

#
which openclaw
openclaw doctor

and look for service/binary mismatch repair suggestions.

spice fractal
#

Oh, I see. 4.27 came out since I updated to 4.26 my goodness

muted sirenBOT
#

Yep, the release train is moving fast šŸ˜…

Update + restart is the right next experiment:

openclaw update --yes
openclaw gateway restart
openclaw --version
openclaw gateway status --deep

Then try the same Discord prompt again. If the handshake warnings persist but replies/slash commands become reliable, we can treat them as noisy cleanup. If failures persist too, we’ll focus on gateway stability / event-loop warnings next.

spice fractal
#

@muted siren i updated and rebooted. My local agents can now respond in DM in Discord but not my channels. My cloud agent is working okay. All three are running in the codex harness. Can non open-ai models use hooks, tools, and slash commands in the codex harness?

spice fractal
#

@muted siren ?

#

ah dang, we broke Krill!

spice fractal
#

@muted siren hopefully you can respond to this later. I am still seeing the handshake error after updating to 2026.4.27

03:35:38+00:00 warn agent/embedded {"subsystem":"agent/embedded"} workspace bootstrap file AGENTS.md is 10664 chars (limit 8000); truncating in injected context (sessionKey=agent:elton:discord:channel:1499251723724652624)
03:36:38+00:00 warn gateway/ws {"subsystem":"gateway/ws"} handshake timeout conn=46ffd350-d7f2-49e1-b6e9-5f02872920b0 peer=127.0.0.1:REDACTED->127.0.0.1:REDACTED remote=127.0.0.1
03:36:38+00:00 warn gateway/ws {"subsystem":"gateway/ws"} {"cause":"handshake-timeout","handshake":"failed","durationMs":11263,"host":"127.0.0.1:REDACTED","remoteAddr":"127.0.0.1","remotePort":REDACTED,"localAddr":"127.0.0.1","localPort":REDACTED,"endpoint":"127.0.0.1:REDACTED->127.0.0.1:REDACTED","handshakeMs":11262} closed before connect conn=aff2c7d9-90ba-41d5-802f-eb0f393cff16 peer=127.0.0.1:REDACTED->127.0.0.1:REDACTED remote=127.0.0.1 fwd=n/a origin=n/a host=127.0.0.1:REDACTED ua=n/a code=1005 reason=n/a
03:36:38+00:00 error gateway connect failed: Error: gateway closed (1000):
03:36:38+00:00 error gateway connect failed: Error: gateway closed (1000):

spice fractal
#

@muted siren I'm still seeing some handshake warnings and other shenanigans on 2026.4.27, but the disruptions are not breaking.

So yes, the fix was to upgrade to 2026.4.27,

Note that there is a bug in this version that may prevent agents from responding in group chats whilst allowing responses to work perfectly in DM. I resolved by adding this to the config:

  "messages": {
    "queue": {
      "mode": "collect"
    },
    "groupChat": {
      "visibleReplies": "automatic"
    }
  }

This issue was reported across multiple users in GitHub and it is resolved and committed to main, so should not be an issue in future releases. https://github.com/openclaw/openclaw/issues/74876

I do not know if this fix resolves similar issues in previous releases.