#gateway multiple crash
1 messages · Page 1 of 1 (latest)
I’m checking the actual restart path now so I don’t make up a fix.
Looks like this may already be fixed on main, which is good news. I’m checking whether it landed after your install/version.
I checked the source, and this specific timeout is already in main.
In src/infra/restart-stale-pids.ts:
- initial stale-pid scan uses
spawnSync(..., { timeout: 2000 }) - the port-free polling loop uses a shorter per-call timeout of
400ms
So if you’re still seeing lsof failed during initial stale-pid scan ... ETIMEDOUT, the next thing I need is your exact version:
openclaw --version
openclaw status --all
openclaw gateway status
If you’re on an older build, updating is the first move. If you’re already current and Tahoe still reproduces this, that smells like a separate macOS-specific bug, and it should go to GitHub with those outputs plus a log snippet:
https://docs.openclaw.ai/gateway/troubleshooting
https://docs.openclaw.ai/install/updating
https://github.com/openclaw/openclaw/issues
Paste openclaw --version and I’ll help narrow down whether you’re missing the fix or hitting a newer Tahoe edge case.
I'm on OpenClaw 2026.4.14 (323493f), macOS Tahoe arm64. Still seeing the ETIMEDOUT errors today at 06:10, 06:15, and 14:40. So the fix in main may not fully cover the Tahoe edge case. Here's a log snippet:
[restart] lsof failed during initial stale-pid scan for port 18789: ETIMEDOUT
I’m checking one thing: whether that ETIMEDOUT line is actually fatal, or just a warning that leaves the real crash one line later.
Thanks, that helps, and yeah, I think you’re right that Tahoe still has an edge case here.
Important detail though: that ETIMEDOUT line is a warning, not the fatal crash by itself. In current code, the initial lsof scan times out, logs that warning, skips stale-pid cleanup, and restart continues. So the thing that actually matters is usually the next few lines after it.
Can you paste ~20-30 log lines after the warning from one of those timestamps? I’m specifically looking for:
EADDRINUSEanother gateway instance is already listeninglaunchctl ... failed- any uncaught/fatal stack
Also please run:
openclaw status --all
openclaw gateway status --deep
My current read is:
- the timeout is already there
- Tahoe is still causing
lsofto hang often enough to make stale-pid cleanup inconclusive - the real restart failure is probably one step later
If that next chunk shows EADDRINUSE or a launchd failure on 2026.4.14, I’d file it as a GitHub bug with the version + logs:
https://docs.openclaw.ai/help/faq
https://docs.openclaw.ai/gateway/troubleshooting
https://github.com/openclaw/openclaw/issues
Paste the post-timeout lines and I’ll help pin it down.
Version: 2026.4.14 (323493f), macOS 26.3.1 arm64, Node 24.13.1
After the lsof ETIMEDOUT warning, no EADDRINUSE follows — the gateway actually restarts fine. But the pattern repeats: new node PID starts, gateway binds successfully, then crashes again minutes later for an unknown reason. Today it happened at 06:10, 06:15, 12:17, 14:26, 14:32, 14:40 — each time a new PID.
Key observation: multiple new node PIDs starting throughout the day suggests the gateway keeps dying and launchd keeps restarting it. The lsof ETIMEDOUT is a symptom of the rapid restart loop, not the root cause.
What's killing the gateway between restarts?
Yep, that changes the picture. I’m checking the actual fatal paths so I can tell you what to grep for, instead of hand-waving.
Yep, I agree, the lsof timeout is just smoke here. The real signal is: node exits, launchd KeepAlive starts a fresh PID.
In OpenClaw, the usual ways the gateway dies are:
- intentional restart/stop:
signal SIGUSR1 receivedorsignal SIGTERM received - crash path:
Uncaught exception: ... - fatal rejection:
FATAL unhandled rejection: ...orUnhandled promise rejection: ... - config fatal:
CONFIGURATION ERROR - requires fix: ... - shutdown watchdog:
shutdown timed out; exiting without full cleanup
So I’d grep for exactly those before/around each PID change:
openclaw status --all
openclaw gateway status --deep
grep -nE 'Uncaught exception|Unhandled promise rejection|FATAL unhandled rejection|CONFIGURATION ERROR|signal SIGTERM received|signal SIGUSR1 received|shutdown timed out' /tmp/openclaw/openclaw-gateway.log /tmp/openclaw/openclaw-*.log
How to read it:
- SIGTERM / SIGUSR1 found → not a mystery crash, something is telling it to restart/stop
- Unhandled/uncaught found → that’s the bug
- nothing found → the process is probably being killed from outside Node, so next stop is macOS crash reports / external service interference
If the grep is empty, also check whether gateway status --deep reports other gateway-like services detected, because a stale extra launchd service can absolutely create weird restart behavior.
Relevant docs:
- https://docs.openclaw.ai/gateway/troubleshooting
- https://docs.openclaw.ai/gateway/logging
- https://docs.openclaw.ai/platforms/mac/bundled-gateway
Paste that grep output and I can help separate “internal crash” from “external kill.”
grep output shows: SIGTERM received — dozens of times, no uncaught exceptions or unhandled rejections at all.
Every single crash is signal SIGTERM received — not a Node crash. Something external is killing the gateway repeatedly.
Examples from yesterday:
12:11, 12:12, 12:14, 12:16 (x2), 12:26, 12:48, 12:49, 13:47, 14:59, 15:00
This is not a Node bug. Something on macOS Tahoe is sending SIGTERM to the gateway process repeatedly. Could be launchd, could be a watchdog, could be macOS memory pressure. How do I find out what's sending SIGTERM?
Good, that’s a real lead now. I’m checking which OpenClaw paths intentionally send SIGTERM, so we can separate “launchd/app did it” from “macOS or another process did it.”
There’s one more suspicious thing: the macOS app itself owns the LaunchAgent in Local mode, so if the app is “reconciling” the service, it can absolutely be the thing sending SIGTERM. Let me check whether there was already a restart-loop fix in this area.
I’d bet against memory pressure here. Repeated, clean SIGTERM is much more consistent with a service/app-managed restart than a random OS kill.
Most likely senders on macOS:
- launchd doing
kickstart -k/ stop / bootout on the LaunchAgent - OpenClaw.app in Local mode, because it manages the LaunchAgent
- a second stale gateway service or old label fighting the current one
- some local automation calling
openclaw gateway restart/stop
Also, there was literally a macOS fix named:
fix(macOS): remove redundant kickstart -k causing gateway restart loop
So if you have the macOS app running, it is suspect #1.
What I’d do next:
openclaw gateway status --deep
launchctl print gui/$UID/ai.openclaw.gateway
log show --last 24h --style compact --predicate '(process == "launchd" OR process == "OpenClaw" OR process == "node") AND (eventMessage CONTAINS[c] "ai.openclaw.gateway" OR eventMessage CONTAINS[c] "kickstart" OR eventMessage CONTAINS[c] "bootout" OR eventMessage CONTAINS[c] "SIGTERM")'
High-signal isolation test:
- if you are running OpenClaw.app, temporarily run it in attach-only / no-launchd mode, or quit it
- then run the gateway manually for a while instead of under launchd
openclaw gateway stop
openclaw gateway
If the SIGTERM storm disappears in manual mode, you’ve basically proven the sender is launchd/app lifecycle, not Node itself.
Two specific questions:
- Are you also running OpenClaw.app?
- If yes, is the app version matched to the CLI version?
Docs:
- https://docs.openclaw.ai/platforms/macos
- https://docs.openclaw.ai/platforms/mac/child-process
- https://docs.openclaw.ai/gateway/troubleshooting
Paste gateway status --deep plus whether the macOS app is running, and I can help narrow it further.
openclaw gateway status --deep: Single LaunchAgent running, pid 74123, state active, no second service detected.
OpenClaw.app: NOT running. No app process found. So the macOS app is not the SIGTERM sender.
launchctl print: Clean single LaunchAgent, no stale labels.
So: no app, no duplicate service. The SIGTERM is coming from somewhere else. Could launchd itself be doing this on Tahoe due to memory pressure or a resource limit? Or could it be my own watchdog script I set up? I have a custom LaunchAgent com.julian.gateway-watchdog that runs every 60 seconds and calls openclaw gateway restart if the port is not listening — could that be the culprit?
Yes, honestly, that watchdog is now the top suspect.
Why:
openclaw gateway restartunder launchd is supposed to terminate the current gateway and start a new one- your logs show exactly that pattern: clean
SIGTERM, new PID, service comes back - a watchdog that runs every 60s and restarts on a flaky “port not listening” check can absolutely create a self-inflicted restart loop
And OpenClaw’s launchd setup already uses KeepAlive, so a second watchdog is usually redundant.
My recommendation:
- Pause/disable
com.julian.gateway-watchdogtemporarily - Watch whether the
SIGTERM receivedentries stop - If they stop, culprit confirmed
I would not blame Tahoe or launchd itself until that custom watchdog is out of the loop.
If you want to keep a watchdog later, make it much less trigger-happy:
- check gateway health, not just raw port presence
- require multiple consecutive failures
- add backoff
- do not restart on a single missed probe
Docs for the built-in launchd behavior:
- https://docs.openclaw.ai/platforms/mac/child-process
- https://docs.openclaw.ai/platforms/macos
- https://docs.openclaw.ai/gateway/troubleshooting
If you want, paste the watchdog script/plist next, and I’ll sanity-check it.
The 16:22/16:23 SIGTERM was during a macOS reboot (software update). That's expected. After reboot gateway came back fine via KeepAlive. Still monitoring for unexpected SIGTERMs during normal operation.
Nice, that one sounds benign then.
A reboot/software update causing SIGTERM, followed by launchd/KeepAlive bringing it back, is exactly what I’d expect on macOS.
So the useful question now is just:
- do unexpected
SIGTERMentries still happen during normal uptime? - and if they do, is the custom watchdog still enabled at that moment?
If you catch another non-reboot SIGTERM, paste the timestamp plus whether the watchdog was active, and we can keep narrowing it down. Glad we ruled out at least one scary-looking case.
After macOS reboot (software update), gateway shows new PID each time it restarts during boot sequence. Is this normal behavior for launchd KeepAlive, or is there still an issue?