gateway multiple crash | Friends of the Crustacean 🦞🤝 | Page 1

fiery roost Apr 16, 2026, 12:47 PM

#

Gateway crashes multiple times daily on macOS. Logs show: lsof failed during initial stale-pid scan for port 18789: ETIMEDOUT — macOS Tahoe, MacBook Pro M-Chip. Please add a timeout to the lsof call in the gateway restart logic.

cold foxBOT Apr 16, 2026, 12:49 PM

#

I’m checking the actual restart path now so I don’t make up a fix.

#

Looks like this may already be fixed on main, which is good news. I’m checking whether it landed after your install/version.

#

I checked the source, and this specific timeout is already in main.

In src/infra/restart-stale-pids.ts:

initial stale-pid scan uses spawnSync(..., { timeout: 2000 })
the port-free polling loop uses a shorter per-call timeout of 400ms

So if you’re still seeing lsof failed during initial stale-pid scan ... ETIMEDOUT, the next thing I need is your exact version:

openclaw --version
openclaw status --all
openclaw gateway status

If you’re on an older build, updating is the first move. If you’re already current and Tahoe still reproduces this, that smells like a separate macOS-specific bug, and it should go to GitHub with those outputs plus a log snippet:

https://docs.openclaw.ai/help/faq

#

https://docs.openclaw.ai/gateway/troubleshooting
https://docs.openclaw.ai/install/updating
https://github.com/openclaw/openclaw/issues

Paste openclaw --version and I’ll help narrow down whether you’re missing the fix or hitting a newer Tahoe edge case.

fiery roost Apr 16, 2026, 12:50 PM

#

I'm on OpenClaw 2026.4.14 (323493f), macOS Tahoe arm64. Still seeing the ETIMEDOUT errors today at 06:10, 06:15, and 14:40. So the fix in main may not fully cover the Tahoe edge case. Here's a log snippet:
[restart] lsof failed during initial stale-pid scan for port 18789: ETIMEDOUT

cold foxBOT Apr 16, 2026, 12:51 PM

#

I’m checking one thing: whether that ETIMEDOUT line is actually fatal, or just a warning that leaves the real crash one line later.

#

Thanks, that helps, and yeah, I think you’re right that Tahoe still has an edge case here.

Important detail though: that ETIMEDOUT line is a warning, not the fatal crash by itself. In current code, the initial lsof scan times out, logs that warning, skips stale-pid cleanup, and restart continues. So the thing that actually matters is usually the next few lines after it.

Can you paste ~20-30 log lines after the warning from one of those timestamps? I’m specifically looking for:

EADDRINUSE
another gateway instance is already listening
launchctl ... failed
any uncaught/fatal stack

Also please run:

openclaw status --all
openclaw gateway status --deep

#

My current read is:

the timeout is already there
Tahoe is still causing lsof to hang often enough to make stale-pid cleanup inconclusive
the real restart failure is probably one step later

If that next chunk shows EADDRINUSE or a launchd failure on 2026.4.14, I’d file it as a GitHub bug with the version + logs:

https://docs.openclaw.ai/help/faq
https://docs.openclaw.ai/gateway/troubleshooting
https://github.com/openclaw/openclaw/issues

Paste the post-timeout lines and I’ll help pin it down.

fiery roost Apr 16, 2026, 12:51 PM

#

Version: 2026.4.14 (323493f), macOS 26.3.1 arm64, Node 24.13.1
After the lsof ETIMEDOUT warning, no EADDRINUSE follows — the gateway actually restarts fine. But the pattern repeats: new node PID starts, gateway binds successfully, then crashes again minutes later for an unknown reason. Today it happened at 06:10, 06:15, 12:17, 14:26, 14:32, 14:40 — each time a new PID.
Key observation: multiple new node PIDs starting throughout the day suggests the gateway keeps dying and launchd keeps restarting it. The lsof ETIMEDOUT is a symptom of the rapid restart loop, not the root cause.
What's killing the gateway between restarts?

cold foxBOT Apr 16, 2026, 12:53 PM

#

Yep, that changes the picture. I’m checking the actual fatal paths so I can tell you what to grep for, instead of hand-waving.

#

Yep, I agree, the lsof timeout is just smoke here. The real signal is: node exits, launchd KeepAlive starts a fresh PID.

In OpenClaw, the usual ways the gateway dies are:

intentional restart/stop: signal SIGUSR1 received or signal SIGTERM received
crash path: Uncaught exception: ...
fatal rejection: FATAL unhandled rejection: ... or Unhandled promise rejection: ...
config fatal: CONFIGURATION ERROR - requires fix: ...
shutdown watchdog: shutdown timed out; exiting without full cleanup

So I’d grep for exactly those before/around each PID change:

openclaw status --all
openclaw gateway status --deep
grep -nE 'Uncaught exception|Unhandled promise rejection|FATAL unhandled rejection|CONFIGURATION ERROR|signal SIGTERM received|signal SIGUSR1 received|shutdown timed out' /tmp/openclaw/openclaw-gateway.log /tmp/openclaw/openclaw-*.log

#

How to read it:

SIGTERM / SIGUSR1 found → not a mystery crash, something is telling it to restart/stop
Unhandled/uncaught found → that’s the bug
nothing found → the process is probably being killed from outside Node, so next stop is macOS crash reports / external service interference

If the grep is empty, also check whether gateway status --deep reports other gateway-like services detected, because a stale extra launchd service can absolutely create weird restart behavior.

Relevant docs:

Paste that grep output and I can help separate “internal crash” from “external kill.”

fiery roost Apr 16, 2026, 12:54 PM

#

grep output shows: SIGTERM received — dozens of times, no uncaught exceptions or unhandled rejections at all.
Every single crash is signal SIGTERM received — not a Node crash. Something external is killing the gateway repeatedly.
Examples from yesterday:
12:11, 12:12, 12:14, 12:16 (x2), 12:26, 12:48, 12:49, 13:47, 14:59, 15:00
This is not a Node bug. Something on macOS Tahoe is sending SIGTERM to the gateway process repeatedly. Could be launchd, could be a watchdog, could be macOS memory pressure. How do I find out what's sending SIGTERM?

cold foxBOT Apr 16, 2026, 12:55 PM

#

Good, that’s a real lead now. I’m checking which OpenClaw paths intentionally send SIGTERM, so we can separate “launchd/app did it” from “macOS or another process did it.”

#

There’s one more suspicious thing: the macOS app itself owns the LaunchAgent in Local mode, so if the app is “reconciling” the service, it can absolutely be the thing sending SIGTERM. Let me check whether there was already a restart-loop fix in this area.

#

I’d bet against memory pressure here. Repeated, clean SIGTERM is much more consistent with a service/app-managed restart than a random OS kill.

Most likely senders on macOS:

launchd doing kickstart -k / stop / bootout on the LaunchAgent
OpenClaw.app in Local mode, because it manages the LaunchAgent
a second stale gateway service or old label fighting the current one
some local automation calling openclaw gateway restart/stop

Also, there was literally a macOS fix named:

fix(macOS): remove redundant kickstart -k causing gateway restart loop

So if you have the macOS app running, it is suspect #1.

What I’d do next:

#

openclaw gateway status --deep
launchctl print gui/$UID/ai.openclaw.gateway
log show --last 24h --style compact --predicate '(process == "launchd" OR process == "OpenClaw" OR process == "node") AND (eventMessage CONTAINS[c] "ai.openclaw.gateway" OR eventMessage CONTAINS[c] "kickstart" OR eventMessage CONTAINS[c] "bootout" OR eventMessage CONTAINS[c] "SIGTERM")'

High-signal isolation test:

if you are running OpenClaw.app, temporarily run it in attach-only / no-launchd mode, or quit it
then run the gateway manually for a while instead of under launchd

openclaw gateway stop
openclaw gateway

If the SIGTERM storm disappears in manual mode, you’ve basically proven the sender is launchd/app lifecycle, not Node itself.

#

Two specific questions:

Are you also running OpenClaw.app?
If yes, is the app version matched to the CLI version?

Docs:

Paste gateway status --deep plus whether the macOS app is running, and I can help narrow it further.

fiery roost Apr 16, 2026, 12:56 PM

#

openclaw gateway status --deep: Single LaunchAgent running, pid 74123, state active, no second service detected.
OpenClaw.app: NOT running. No app process found. So the macOS app is not the SIGTERM sender.
launchctl print: Clean single LaunchAgent, no stale labels.
So: no app, no duplicate service. The SIGTERM is coming from somewhere else. Could launchd itself be doing this on Tahoe due to memory pressure or a resource limit? Or could it be my own watchdog script I set up? I have a custom LaunchAgent com.julian.gateway-watchdog that runs every 60 seconds and calls openclaw gateway restart if the port is not listening — could that be the culprit?

cold foxBOT Apr 16, 2026, 12:57 PM

#

Yes, honestly, that watchdog is now the top suspect.

Why:

openclaw gateway restart under launchd is supposed to terminate the current gateway and start a new one
your logs show exactly that pattern: clean SIGTERM, new PID, service comes back
a watchdog that runs every 60s and restarts on a flaky “port not listening” check can absolutely create a self-inflicted restart loop

And OpenClaw’s launchd setup already uses KeepAlive, so a second watchdog is usually redundant.

My recommendation:

Pause/disable com.julian.gateway-watchdog temporarily
Watch whether the SIGTERM received entries stop
If they stop, culprit confirmed

I would not blame Tahoe or launchd itself until that custom watchdog is out of the loop.

#

If you want to keep a watchdog later, make it much less trigger-happy:

check gateway health, not just raw port presence
require multiple consecutive failures
add backoff
do not restart on a single missed probe

Docs for the built-in launchd behavior:

If you want, paste the watchdog script/plist next, and I’ll sanity-check it.

fiery roost Apr 16, 2026, 3:19 PM

#

The 16:22/16:23 SIGTERM was during a macOS reboot (software update). That's expected. After reboot gateway came back fine via KeepAlive. Still monitoring for unexpected SIGTERMs during normal operation.

cold foxBOT Apr 16, 2026, 3:19 PM

#

Nice, that one sounds benign then.

A reboot/software update causing SIGTERM, followed by launchd/KeepAlive bringing it back, is exactly what I’d expect on macOS.

So the useful question now is just:

do unexpected SIGTERM entries still happen during normal uptime?
and if they do, is the custom watchdog still enabled at that moment?

If you catch another non-reboot SIGTERM, paste the timestamp plus whether the watchdog was active, and we can keep narrowing it down. Glad we ruled out at least one scary-looking case.

fiery roost Apr 16, 2026, 3:21 PM

#

After macOS reboot (software update), gateway shows new PID each time it restarts during boot sequence. Is this normal behavior for launchd KeepAlive, or is there still an issue?

#gateway multiple crash