#gateway multiple crash

1 messages · Page 1 of 1 (latest)

fiery roost
#

Gateway crashes multiple times daily on macOS. Logs show: lsof failed during initial stale-pid scan for port 18789: ETIMEDOUT — macOS Tahoe, MacBook Pro M-Chip. Please add a timeout to the lsof call in the gateway restart logic.

cold foxBOT
#

I’m checking the actual restart path now so I don’t make up a fix.

#

Looks like this may already be fixed on main, which is good news. I’m checking whether it landed after your install/version.

#

I checked the source, and this specific timeout is already in main.

In src/infra/restart-stale-pids.ts:

  • initial stale-pid scan uses spawnSync(..., { timeout: 2000 })
  • the port-free polling loop uses a shorter per-call timeout of 400ms

So if you’re still seeing lsof failed during initial stale-pid scan ... ETIMEDOUT, the next thing I need is your exact version:

openclaw --version
openclaw status --all
openclaw gateway status

If you’re on an older build, updating is the first move. If you’re already current and Tahoe still reproduces this, that smells like a separate macOS-specific bug, and it should go to GitHub with those outputs plus a log snippet:

https://docs.openclaw.ai/help/faq

fiery roost
#

I'm on OpenClaw 2026.4.14 (323493f), macOS Tahoe arm64. Still seeing the ETIMEDOUT errors today at 06:10, 06:15, and 14:40. So the fix in main may not fully cover the Tahoe edge case. Here's a log snippet:
[restart] lsof failed during initial stale-pid scan for port 18789: ETIMEDOUT

cold foxBOT
#

I’m checking one thing: whether that ETIMEDOUT line is actually fatal, or just a warning that leaves the real crash one line later.

#

Thanks, that helps, and yeah, I think you’re right that Tahoe still has an edge case here.

Important detail though: that ETIMEDOUT line is a warning, not the fatal crash by itself. In current code, the initial lsof scan times out, logs that warning, skips stale-pid cleanup, and restart continues. So the thing that actually matters is usually the next few lines after it.

Can you paste ~20-30 log lines after the warning from one of those timestamps? I’m specifically looking for:

  • EADDRINUSE
  • another gateway instance is already listening
  • launchctl ... failed
  • any uncaught/fatal stack

Also please run:

openclaw status --all
openclaw gateway status --deep
#

My current read is:

  1. the timeout is already there
  2. Tahoe is still causing lsof to hang often enough to make stale-pid cleanup inconclusive
  3. the real restart failure is probably one step later

If that next chunk shows EADDRINUSE or a launchd failure on 2026.4.14, I’d file it as a GitHub bug with the version + logs:

https://docs.openclaw.ai/help/faq
https://docs.openclaw.ai/gateway/troubleshooting
https://github.com/openclaw/openclaw/issues

Paste the post-timeout lines and I’ll help pin it down.

fiery roost
#

Version: 2026.4.14 (323493f), macOS 26.3.1 arm64, Node 24.13.1
After the lsof ETIMEDOUT warning, no EADDRINUSE follows — the gateway actually restarts fine. But the pattern repeats: new node PID starts, gateway binds successfully, then crashes again minutes later for an unknown reason. Today it happened at 06:10, 06:15, 12:17, 14:26, 14:32, 14:40 — each time a new PID.
Key observation: multiple new node PIDs starting throughout the day suggests the gateway keeps dying and launchd keeps restarting it. The lsof ETIMEDOUT is a symptom of the rapid restart loop, not the root cause.
What's killing the gateway between restarts?

cold foxBOT
#

Yep, that changes the picture. I’m checking the actual fatal paths so I can tell you what to grep for, instead of hand-waving.

#

Yep, I agree, the lsof timeout is just smoke here. The real signal is: node exits, launchd KeepAlive starts a fresh PID.

In OpenClaw, the usual ways the gateway dies are:

  • intentional restart/stop: signal SIGUSR1 received or signal SIGTERM received
  • crash path: Uncaught exception: ...
  • fatal rejection: FATAL unhandled rejection: ... or Unhandled promise rejection: ...
  • config fatal: CONFIGURATION ERROR - requires fix: ...
  • shutdown watchdog: shutdown timed out; exiting without full cleanup

So I’d grep for exactly those before/around each PID change:

openclaw status --all
openclaw gateway status --deep
grep -nE 'Uncaught exception|Unhandled promise rejection|FATAL unhandled rejection|CONFIGURATION ERROR|signal SIGTERM received|signal SIGUSR1 received|shutdown timed out' /tmp/openclaw/openclaw-gateway.log /tmp/openclaw/openclaw-*.log
#

How to read it:

  • SIGTERM / SIGUSR1 found → not a mystery crash, something is telling it to restart/stop
  • Unhandled/uncaught found → that’s the bug
  • nothing found → the process is probably being killed from outside Node, so next stop is macOS crash reports / external service interference

If the grep is empty, also check whether gateway status --deep reports other gateway-like services detected, because a stale extra launchd service can absolutely create weird restart behavior.

Relevant docs:

Paste that grep output and I can help separate “internal crash” from “external kill.”

fiery roost
#

grep output shows: SIGTERM received — dozens of times, no uncaught exceptions or unhandled rejections at all.
Every single crash is signal SIGTERM received — not a Node crash. Something external is killing the gateway repeatedly.
Examples from yesterday:
12:11, 12:12, 12:14, 12:16 (x2), 12:26, 12:48, 12:49, 13:47, 14:59, 15:00
This is not a Node bug. Something on macOS Tahoe is sending SIGTERM to the gateway process repeatedly. Could be launchd, could be a watchdog, could be macOS memory pressure. How do I find out what's sending SIGTERM?

cold foxBOT
#

Good, that’s a real lead now. I’m checking which OpenClaw paths intentionally send SIGTERM, so we can separate “launchd/app did it” from “macOS or another process did it.”

#

There’s one more suspicious thing: the macOS app itself owns the LaunchAgent in Local mode, so if the app is “reconciling” the service, it can absolutely be the thing sending SIGTERM. Let me check whether there was already a restart-loop fix in this area.

#

I’d bet against memory pressure here. Repeated, clean SIGTERM is much more consistent with a service/app-managed restart than a random OS kill.

Most likely senders on macOS:

  1. launchd doing kickstart -k / stop / bootout on the LaunchAgent
  2. OpenClaw.app in Local mode, because it manages the LaunchAgent
  3. a second stale gateway service or old label fighting the current one
  4. some local automation calling openclaw gateway restart/stop

Also, there was literally a macOS fix named:

  • fix(macOS): remove redundant kickstart -k causing gateway restart loop

So if you have the macOS app running, it is suspect #1.

What I’d do next:

#
openclaw gateway status --deep
launchctl print gui/$UID/ai.openclaw.gateway
log show --last 24h --style compact --predicate '(process == "launchd" OR process == "OpenClaw" OR process == "node") AND (eventMessage CONTAINS[c] "ai.openclaw.gateway" OR eventMessage CONTAINS[c] "kickstart" OR eventMessage CONTAINS[c] "bootout" OR eventMessage CONTAINS[c] "SIGTERM")'

High-signal isolation test:

  • if you are running OpenClaw.app, temporarily run it in attach-only / no-launchd mode, or quit it
  • then run the gateway manually for a while instead of under launchd
openclaw gateway stop
openclaw gateway

If the SIGTERM storm disappears in manual mode, you’ve basically proven the sender is launchd/app lifecycle, not Node itself.

fiery roost
#

openclaw gateway status --deep: Single LaunchAgent running, pid 74123, state active, no second service detected.
OpenClaw.app: NOT running. No app process found. So the macOS app is not the SIGTERM sender.
launchctl print: Clean single LaunchAgent, no stale labels.
So: no app, no duplicate service. The SIGTERM is coming from somewhere else. Could launchd itself be doing this on Tahoe due to memory pressure or a resource limit? Or could it be my own watchdog script I set up? I have a custom LaunchAgent com.julian.gateway-watchdog that runs every 60 seconds and calls openclaw gateway restart if the port is not listening — could that be the culprit?

cold foxBOT
#

Yes, honestly, that watchdog is now the top suspect.

Why:

  • openclaw gateway restart under launchd is supposed to terminate the current gateway and start a new one
  • your logs show exactly that pattern: clean SIGTERM, new PID, service comes back
  • a watchdog that runs every 60s and restarts on a flaky “port not listening” check can absolutely create a self-inflicted restart loop

And OpenClaw’s launchd setup already uses KeepAlive, so a second watchdog is usually redundant.

My recommendation:

  1. Pause/disable com.julian.gateway-watchdog temporarily
  2. Watch whether the SIGTERM received entries stop
  3. If they stop, culprit confirmed

I would not blame Tahoe or launchd itself until that custom watchdog is out of the loop.

fiery roost
#

The 16:22/16:23 SIGTERM was during a macOS reboot (software update). That's expected. After reboot gateway came back fine via KeepAlive. Still monitoring for unexpected SIGTERMs during normal operation.

cold foxBOT
#

Nice, that one sounds benign then.

A reboot/software update causing SIGTERM, followed by launchd/KeepAlive bringing it back, is exactly what I’d expect on macOS.

So the useful question now is just:

  • do unexpected SIGTERM entries still happen during normal uptime?
  • and if they do, is the custom watchdog still enabled at that moment?

If you catch another non-reboot SIGTERM, paste the timestamp plus whether the watchdog was active, and we can keep narrowing it down. Glad we ruled out at least one scary-looking case.

fiery roost
#

After macOS reboot (software update), gateway shows new PID each time it restarts during boot sequence. Is this normal behavior for launchd KeepAlive, or is there still an issue?