#[Bug] Gateway crash loop (Exit 0) after updating to 2026.4.26

1 messages Ā· Page 1 of 1 (latest)

serene arrow
#

Hi! We need some help with a restart loop after updating OpenClaw on our Linux VPS (Docker).

We updated to version 2026.4.26 using our usual procedure:

  1. Pulled ghcr.io/openclaw/openclaw:latest
  2. Stopped and removed the old container
  3. Ran the new one:
    docker run -d --name openclaw-gateway --restart always -p 18789:18789 -e DISCORD_TOKEN="<hidden>" -v /home/clawd/.openclaw:/home/node/.openclaw ghcr.io/openclaw/openclaw:latest
  4. Installed ffmpeg inside it: apt-get update && apt-get install -y ffmpeg

The Issue:
The container boots up, installs runtime deps, and reaches [gateway] starting channels and sidecars.... Then it freezes for about 60-70 seconds and silently restarts the whole process (loading configuration…), looping infinitely.

Our main question:
Does version 2026.4.26 require a new installation method, a different setup, or new ENV variables that we missed?

What we already investigated:

  • Not an OOM kill: Checked dmesg, the host is not killing it due to lack of RAM.
  • Clean Exit: docker inspect shows Exit Code: 0. Not a fatal Node crash.
  • Watchdog timeout? Since it exits right after ~60s, we suspect the [health-monitor] (startup-grace: 60s) is pulling the plug because a channel or plugin is hanging during startup.

Are we missing a breaking change in the update procedure? Thanks in advance!
@brave compass

unreal krakenBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> āœ… Mark Solution

brave compassBOT
#

Luma, I don’t think you missed a brand-new install method/env var. This looks much more like a current Docker restart-path/runtime-deps regression than a normal channel watchdog.

Two things jump out:

  1. Exit code 0 after ~60s strongly matches a config-triggered ā€œfull process restartā€ path in Docker, not OOM/crash. There’s an open issue for ghcr.io/openclaw/openclaw:latest where SIGUSR1/full restart exits PID 1 with code 0 inside Docker. Check your logs for:
[reload] config change requires gateway restart
[gateway] signal SIGUSR1 received
[gateway] restart mode: full process restart

That would explain the ā€œsilentā€ clean restart. (github.com)

  1. Your raw docker run doesn’t quite match the current Docker docs/Compose path. The docs expect Docker Compose/setup flow, --bind lan for host port publishing, and /healthz//readyz checks. If you keep using raw docker run, override the command to bind LAN and use the current Discord env name (DISCORD_BOT_TOKEN, not DISCORD_TOKEN). (docs.openclaw.ai)

Useful next diagnostic, without changing state:

GitHub

Your own personal AI assistant. Any OS. Any Platform. The lobster way. šŸ¦ž - Issues Ā· openclaw/openclaw

#
docker logs --tail 500 openclaw-gateway 2>&1 | grep -Ei 'reload|SIGUSR1|restart mode|health-monitor|plugin-runtime|ENOTEMPTY|openclaw-unknown|startup trace|error|fatal'

If you see SIGUSR1 / restart mode: full process restart, this is probably that Docker restart bug, not a missing env var. If you see plugin-runtime-deps, openclaw-unknown-*, or ENOTEMPTY, it’s likely the bundled plugin runtime-deps staging path instead.

Also: installing ffmpeg manually inside a running container is brittle after image replacement. Better to bake it into the image or use the Docker setup env for extra apt packages later; not likely the cause of this loop though.

If you can paste the redacted grep output above, we can narrow it to restart-path vs plugin-runtime-deps pretty quickly. 🦐

serene arrow
# brave compass ```bash docker logs --tail 500 openclaw-gateway 2>&1 | grep -Ei 'reload|SIGUSR1|...

Here is the output. As you can see, it's completely empty regarding SIGUSR1, reload, or any of the other error keywords. It only shows the health monitor starting:

clawd@srv1374478:~$ sudo docker logs --tail 500 openclaw-gateway 2>&1 | grep -Ei 'reload|SIGUSR1|restart mode|health-monitor|plugin-runtime|ENOTEMPTY|openclaw-unknown|startup trace|error|fatal'
[sudo] password for clawd:
2026-04-28T08:11:31.423+00:00 [health-monitor] started (interval: 300s, startup-grace: 60s, channel-connect-grace: 120s)
clawd@srv1374478:~$

brave compassBOT
#

Yep — that output rules out the SIGUSR1/config-reload path.

Also important: the [health-monitor] started ... startup-grace: 60s line is just initialization. Its first real interval is 300s, so it should not be killing the process at ~60s.

Next I’d isolate startup like this:

sudo docker stop openclaw-gateway

sudo docker run --rm --name openclaw-gateway-test \
  -p 18789:18789 \
  -e DISCORD_BOT_TOKEN="<hidden>" \
  -e OPENCLAW_SKIP_CHANNELS=1 \
  -e OPENCLAW_GATEWAY_STARTUP_TRACE=1 \
  -e OPENCLAW_LOG_LEVEL=debug \
  -v /home/clawd/.openclaw:/home/node/.openclaw \
#
  ghcr.io/openclaw/openclaw:latest \
  node dist/index.js gateway --bind lan --port 18789 --allow-unconfigured

Two notes:

  • Use DISCORD_BOT_TOKEN, not DISCORD_TOKEN, unless your config explicitly references DISCORD_TOKEN.
  • The explicit node dist/index.js gateway --bind lan --port 18789 matches the current Docker docs/Compose path better than relying on the image default command.

If that stays up, remove only OPENCLAW_SKIP_CHANNELS=1 and try again. If it then exits, the culprit is channel/sidecar startup.

If it still exits even with channels skipped, paste the startup trace around the last completed phase — that’s the useful clue now. 🦐

serene arrow
# brave compass ```bash ghcr.io/openclaw/openclaw:latest \ node dist/index.js gateway --bind...

We ran the diagnostic command with OPENCLAW_SKIP_CHANNELS=1 and OPENCLAW_GATEWAY_STARTUP_TRACE=1.

You were totally right: it's not the channels! The container still hangs completely, but now we caught exactly where it stops. It freezes right after control-ui.seed and never proceeds further, until it silently exits ~60s later.

Here is the tail of the trace right before it hangs indefinitely:

2026-04-28T08:57:11.608+00:00 [gateway] startup trace: config.snapshot 671.0ms total=671.0ms eventLoopMax=0.0ms
2026-04-28T08:57:11.627+00:00 [gateway] startup trace: config.auth 16.4ms total=689.0ms eventLoopMax=0.0ms
2026-04-28T08:57:11.629+00:00 [gateway] startup trace: control-ui.seed 0.2ms total=691.7ms eventLoopMax=0.0ms
Since it hangs here, it looks like a core initialization block. Could it be trying to extract or overwrite the Canvas/UI assets into the mounted volume (/home/node/.openclaw) and getting stuck on a file lock or a permission issue from the previous version? Our host volume is strictly chowned to our user, which worked fine on the older version.

What is the safest way to clear the UI seed/cache or unblock this step?

brave compassBOT
#

Nice catch — and no, I would not clear UI/cache first.

control-ui.seed already completed in 0.2ms. The next phase in the code is plugins.bootstrap, so the hang is almost certainly plugin startup/runtime-deps bootstrap, not Canvas/UI asset extraction.

Safest test:

sudo docker stop openclaw-gateway

sudo docker run --rm --name openclaw-gateway-test \
  -p 18789:18789 \
  -e DISCORD_BOT_TOKEN="<hidden>" \
  -e OPENCLAW_SKIP_CHANNELS=1 \
  -e OPENCLAW_GATEWAY_STARTUP_TRACE=1 \
  -e OPENCLAW_LOG_LEVEL=debug \
  -e OPENCLAW_PLUGIN_STAGE_DIR=/tmp/openclaw-plugin-runtime-deps-test \
#
  -v /home/clawd/.openclaw:/home/node/.openclaw \
  ghcr.io/openclaw/openclaw:latest \
  node dist/index.js gateway --bind lan --port 18789 --allow-unconfigured

If that boots, the stuck part is likely the existing bundled plugin runtime-deps cache under the mounted .openclaw.

Then do the safe fix by moving, not deleting:

sudo docker stop openclaw-gateway
sudo mv /home/clawd/.openclaw/plugin-runtime-deps \
  /home/clawd/.openclaw/plugin-runtime-deps.bak.$(date +%Y%m%d-%H%M%S)

For a more durable Docker setup, keep plugin runtime deps off the main config bind mount:

#
sudo mkdir -p /home/clawd/openclaw-plugin-runtime-deps
sudo chown -R 1000:1000 /home/clawd/openclaw-plugin-runtime-deps

then run with:

-e OPENCLAW_PLUGIN_STAGE_DIR=/var/lib/openclaw/plugin-runtime-deps \
-v /home/clawd/openclaw-plugin-runtime-deps:/var/lib/openclaw/plugin-runtime-deps

Docs: https://docs.openclaw.ai/install/docker and https://docs.openclaw.ai/cli/config

serene arrow
# brave compass ```bash sudo mkdir -p /home/clawd/openclaw-plugin-runtime-deps sudo chown -R 100...

Just a quick update on the sequence of events: during your isolated test with the temporary plugin dir, the bot actually came back online perfectly!

However, when we applied the definitive fix and started the container normally, she went offline again and fell right back into the 60-second Exit 0 loop.

Looking closely at the logs right before it dies, we caught the exact culprit for this new loop:

[model-pricing] OpenRouter pricing fetch failed (timeout 60s): TimeoutError: The operation was aborted due to timeout
[model-pricing] LiteLLM pricing fetch failed (timeout 60s): TimeoutError: The operation was aborted due to timeout

It seems the gateway is trying to fetch model pricing on startup, but the network request is hanging/blocked on our VPS. Because this fetch hangs for exactly 60 seconds, it perfectly trips the watchdog's startup-grace: 60s, causing the silent reboot before the channels can start.
Is there an ENV variable we can use to bypass this fetch (e.g., something like OPENCLAW_SKIP_PRICING=1), or a way to extend the watchdog grace period so it can fail gracefully and continue the boot process?
brave compassBOT
#

Great, that’s a real culprit. There is a supported bypass, but it’s config, not env:

models: {
  pricing: {
    enabled: false
  }
}

Since the gateway won’t stay up, set it with a one-shot container against the mounted config:

sudo docker run --rm \
  -v /home/clawd/.openclaw:/home/node/.openclaw \
  ghcr.io/openclaw/openclaw:latest \
#
  node dist/index.js config set models.pricing.enabled false --strict-json

Then start the real container again, keeping the plugin stage dir fix:

-e OPENCLAW_PLUGIN_STAGE_DIR=/var/lib/openclaw/plugin-runtime-deps \
-v /home/clawd/openclaw-plugin-runtime-deps:/var/lib/openclaw/plugin-runtime-deps

And yeah: I’d keep DISCORD_BOT_TOKEN + explicit node dist/index.js gateway --bind lan --port 18789 --allow-unconfigured too.

I don’t see an OPENCLAW_SKIP_PRICING env var in the current source/docs; the documented knob is models.pricing.enabled: false. It skips the OpenRouter/LiteLLM pricing catalog fetches during Gateway startup.

Docs: https://docs.openclaw.ai/gateway/configuration-reference and https://docs.openclaw.ai/reference/token-use

serene arrow
# brave compass ```bash node dist/index.js config set models.pricing.enabled false --strict-js...

Hey! We tried the exact fix you suggested (node dist/index.js config set models.pricing.enabled false --strict-json), but unfortunately, it throws an error in the CLI:

Error: Config validation failed: models: Unrecognized key: "pricing"

We even tried manually injecting the models.pricing block directly into openclaw.json to bypass the CLI. But when the container starts, the internal validator catches it immediately and crashes the container into a restart loop with this error:

Config invalid
File: ~/.openclaw/openclaw.json
Problem:
  - models: Unrecognized key: "pricing"

Run: openclaw doctor --fix

It seems that the pricing key under models might have been removed, renamed, or moved in this new image version schema.
Could you double-check the current config schema for the latest tag and let us know the correct key to disable the OpenRouter/LiteLLM fetch, or if there is another way to bypass the 60s startup-grace timeout?
Thank you!
brave compassBOT
#

Ah, you’re right — I checked the current source/docs, but your image is v2026.4.26. In 2026.4.26, models.pricing.enabled is not in the schema yet. That toggle exists after that tag, so my previous command was ahead of your image. Sorry — good catch.

First remove the invalid key:

sudo docker run --rm \
  -v /home/clawd/.openclaw:/home/node/.openclaw \
  ghcr.io/openclaw/openclaw:latest \
  node dist/index.js doctor --fix

For 2026.4.26, I don’t see a supported global ā€œskip pricingā€ env/config toggle. Practical workarounds:

  1. Try IPv4-first if this is an IPv6 blackhole/network stall:
#
-e NODE_OPTIONS="--dns-result-order=ipv4first"
  1. Make the pricing fetch unnecessary by adding local cost metadata for the configured model refs that lack it. That’s more annoying, but avoids the OpenRouter/LiteLLM catalog lookup in 2026.4.26.

  2. Upgrade once a newer image is available with the models.pricing.enabled: false schema support.

There isn’t a documented config key to extend the 60s startup-grace itself in 2026.4.26; only gateway.channelHealthCheckMinutes: 0 disables health-monitor restarts, but it won’t stop the pricing fetch.

serene arrow
# brave compass ```bash -e NODE_OPTIONS="--dns-result-order=ipv4first" ``` 2. **Make the pricin...

Thanks for confirming the schema version! We successfully cleaned up the openclaw.json (removed the invalid pricing key) to stop the crash loop.

We then tried the NODE_OPTIONS="--dns-result-order=ipv4first" workaround, but unfortunately, the container still hits the exact same 60s timeout and restarts. It seems the network on our VPS strictly blocks or stalls the OpenRouter/LiteLLM fetch regardless of IPv4/IPv6.

Since there is no documented config or ENV in v2026.4.26 to skip this fetch, and we can't bypass the 60s startup-grace watchdog, the container simply cannot boot in this environment.

At this point, we have two paths forward to get the bot back online:

  1. Wait/Upgrade to a newer image version that officially supports the models.pricing.enabled: false schema config.
  2. Downgrade our image back to the previous stable version (before v2026.4.26) where this pricing fetch timeout wasn't tripping the watchdog.

What is the safest way to downgrade without corrupting our current openclaw.json config? Can we simply pull the older tag (e.g., v2026.4.x) and run it with the exact same mount paths?

brave compassBOT
#

Yep — don’t keep fighting this one. Downgrade is reasonable.

Safest approach: same mount paths are fine, but pin the image tag and backup first. Also avoid 2026.4.25 here; it also has the 60s pricing timeout. I’d try 2026.4.24, then 2026.4.23 if needed.

stamp=$(date +%Y%m%d-%H%M%S)

sudo docker rm -f openclaw-gateway

sudo cp -a /home/clawd/.openclaw /home/clawd/.openclaw.bak.$stamp
sudo cp -a /home/clawd/openclaw-plugin-runtime-deps /home/clawd/openclaw-plugin-runtime-deps.bak.$stamp 2>/dev/null || true

Validate config against the older image first:

#
sudo docker run --rm \
  -v /home/clawd/.openclaw:/home/node/.openclaw \
  ghcr.io/openclaw/openclaw:2026.4.24 \
  node dist/index.js config validate

If that passes, run pinned:

sudo docker run -d --name openclaw-gateway --restart always \
  -p 18789:18789 \
  -e DISCORD_BOT_TOKEN="<hidden>" \
  -e OPENCLAW_NO_AUTO_UPDATE=1 \
  -e OPENCLAW_PLUGIN_STAGE_DIR=/var/lib/openclaw/plugin-runtime-deps \
  -v /home/clawd/.openclaw:/home/node/.openclaw \
#
  -v /home/clawd/openclaw-plugin-runtime-deps:/var/lib/openclaw/plugin-runtime-deps \
  ghcr.io/openclaw/openclaw:2026.4.24 \
  node dist/index.js gateway --bind lan --port 18789 --allow-unconfigured

If you see only Config was last written by a newer OpenClaw, that’s usually just a warning. If validation fails on a truly new config key, restore the backup or remove that specific key.

Docs: https://docs.openclaw.ai/install/docker
And please keep using pinned tags for now, not latest.

serene arrow
# brave compass ```bash -v /home/clawd/openclaw-plugin-runtime-deps:/var/lib/openclaw/plugin-r...

Hey Krill! We thought about it, and we really want to stay on the latest version (v2026.4.26) rather than downgrading, especially since your diagnostic test proved it can work in our environment.

To recap: during your isolated test (using OPENCLAW_SKIP_CHANNELS=1 and the isolated OPENCLAW_PLUGIN_STAGE_DIR), the container successfully bypassed the watchdog loop AND the bot actually came online on Discord.

However, when we try to take those exact same test parameters and run them as our permanent detached container (using -d --restart always), the bot stays offline and hangs at [gateway] starting channels and sidecars....

Since we saw with our own eyes that your test command kept the container alive and the bot online, how can we perfectly replicate that "test state" as our permanent production setup on latest?

Did the test work because of OPENCLAW_GATEWAY_STARTUP_TRACE=1 or OPENCLAW_LOG_LEVEL=debug altering the execution timing? What is the exact docker run command we should use to force latest to run permanently exactly like it did during your diagnostic test?

brave compassBOT
#

Yep — to replicate the successful test, do not include OPENCLAW_SKIP_CHANNELS=1. That flag prevents Discord from starting, so a bot ā€œonlineā€ state cannot be caused by that same container with channels skipped.

Use the same isolated runtime-deps path, but make it persistent:

sudo docker rm -f openclaw-gateway

sudo mkdir -p /home/clawd/openclaw-plugin-runtime-deps
sudo chown -R 1000:1000 /home/clawd/openclaw-plugin-runtime-deps

sudo docker run -d --name openclaw-gateway --restart always \
  -p 18789:18789 \
  -e DISCORD_BOT_TOKEN="<hidden>" \
  -e OPENCLAW_NO_AUTO_UPDATE=1 \
  -e OPENCLAW_GATEWAY_STARTUP_TRACE=1 \
  -e OPENCLAW_LOG_LEVEL=debug \
#
  -e NODE_OPTIONS="--dns-result-order=ipv4first" \
  -e OPENCLAW_PLUGIN_STAGE_DIR=/tmp/openclaw-plugin-runtime-deps-test \
  -v /home/clawd/.openclaw:/home/node/.openclaw \
  -v /home/clawd/openclaw-plugin-runtime-deps:/tmp/openclaw-plugin-runtime-deps-test \
  ghcr.io/openclaw/openclaw:latest \
  node dist/index.js gateway --bind lan --port 18789 --allow-unconfigured

STARTUP_TRACE / LOG_LEVEL=debug shouldn’t be required, but keep them for now since they match the known-good diagnostic shape and give evidence if it wedges again.

If this still hangs, the earlier ā€œbot onlineā€ was likely from another gateway/container using the same Discord token. Quick sanity check:

sudo docker ps -a | grep -i openclaw
serene arrow
# brave compass ```bash -e NODE_OPTIONS="--dns-result-order=ipv4first" \ -e OPENCLAW_PLUGIN_...

Hey Krill, just closing the loop on this! We wiped the slate clean, applied the isolated plugin path, and launched 2026.4.24.

You were totally right:

  1. We passed both DISCORD_TOKEN and DISCORD_BOT_TOKEN in the ENV just to be absolutely sure the legacy config caught it.
  2. The pricing fetch STILL times out on .24 just like it did on .26, but the crucial difference is that on .24, the container simply warns and survives the timeout, continuing the boot process.
  3. The Discord provider started successfully and the bot is online!

We will stay pinned on .24 until a future release explicitly supports the schema toggle for the pricing catalog or fixes the fatal crash handler.

Thanks for sticking with us through the diagnostic!