#gateway oddities

1 messages · Page 1 of 1 (latest)

halcyon moon
#

please help, if im following logs what will happen is it will show gateway clsed. and then it might come back. and.. yeah. just like that. on discord messages might show abnormally long loading times and typing indicator coming and going. and following the logs - shows gateway not reachable? but then the turn still.. continues? yeah. status --deep wont work cause it says gateway not reachable.

spice smeltBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

shy orioleBOT
#

This sounds like two “gateways” may be getting mixed together:

  • Gateway websocket closed: 1000 in Discord logs = Discord’s gateway websocket reconnecting.
  • openclaw status --deep says gateway not reachable = OpenClaw’s local Gateway daemon is unreachable.

First thing: please paste:

openclaw --version
openclaw gateway status
openclaw status --all
openclaw doctor --deep

If status --deep fails, that’s expected while the OpenClaw Gateway is unreachable — use status --all instead.

Also grab the relevant log tail:

#
grep -Ei 'discord|Gateway websocket|READY|reconnect|identify|fatal|error|listener|EADDRINUSE|not reachable' /tmp/openclaw/openclaw-$(date +%F).log | tail -200

My suspicion: either the Gateway process is restarting/crashing, or there’s a Discord websocket reconnect loop from duplicate bot-token usage / duplicate gateway instance. Check for duplicates too:

pgrep -af 'openclaw.*gateway|openclaw-gateway|node .*openclaw'
docker ps

Docs for the gateway runbook are here: https://docs.openclaw.ai/gateway/troubleshooting

#

That status is useful: this is not a port conflict. The port is “in use” because your OpenClaw Gateway process owns it.

The bad part is:

Runtime: running
Listening: 127.0.0.1:18789
Connectivity probe: failed
timeout

That usually means the process is alive and bound to the port, but the Gateway isn’t answering the WebSocket probe — likely startup hang, event-loop stall, or the process is overloaded.

Please run this next:

#
openclaw gateway restart
sleep 20
openclaw gateway status

If it still says probe timeout, paste:

journalctl --user -u openclaw-gateway.service -n 200 --no-pager
grep -Ei 'fatal|error|exception|timeout|liveness|event.loop|event-loop|memory|heap|rss|discord|Gateway websocket|READY|reconnect' /tmp/openclaw/openclaw-2026-04-28.log | tail -250
ps -p 3535 -o pid,etimes,%cpu,%mem,rss,vsz,stat,cmd

Also: don’t chase the Port 18789 is already in use line yet — that’s expected while pid 3535 is listening.

One extra note: there’s a main-branch fix today around exposing event-loop health in readiness, so if 2026.4.26 keeps doing this, this may be worth a GitHub issue with diagnostics. But first let’s see whether it recovers after restart or immediately wedges again.

#

This confirms it: the Gateway process is alive but wedged/unresponsive.

The big clue is this line:

Git main @ 4340cb74 · behind 856 · fetch failed
Command: /home/ubuntu/openclaw/dist/index.js gateway

So you’re running from a source checkout, not the npm-installed latest. npm latest 2026.4.26 doesn’t really help if systemd is launching ~/openclaw/dist/index.js, and that checkout is 856 commits behind.

I’d do this next:

cd ~/openclaw
git fetch origin main
#
git pull --ff-only
pnpm install
pnpm build
openclaw gateway install --force
openclaw gateway restart
sleep 20
openclaw gateway status

If git fetch still fails, fix that first — otherwise you’re stuck on an old source tree/dist.

If it still wedges after updating/rebuilding, paste:

journalctl --user -u openclaw-gateway.service -n 200 --no-pager
#

Yep — same diagnosis. systemd is saying “I successfully started a process,” but the OpenClaw CLI is saying “that process is not answering as a Gateway.”

So the next useful thing is startup logs for pid 7423 / that restart, not more status output:

journalctl --user -u openclaw-gateway.service -b -n 300 --no-pager
tail -n 200 ~/.openclaw/logs/gateway.err.log ~/.openclaw/logs/gateway.log 2>/dev/null
grep -Ei 'fatal|error|exception|timeout|liveness|event.loop|event-loop|memory|heap|rss|listening|ready|discord|Gateway websocket|reconnect' /tmp/openclaw/openclaw-2026-04-28.log | tail -300
ps -p 7423 -o pid,etimes,%cpu,%mem,rss,vsz,stat,cmd

Also: did you update/rebuild the source checkout yet? Your earlier report said systemd is launching:

/home/ubuntu/openclaw/dist/index.js
#

and that checkout was 856 commits behind. If it’s still behind, I’d fix that before digging deeper, because the installed npm version won’t affect this service path.

#

That log is actually pretty telling.

These lines:

client initialized ... awaiting gateway readiness
...
[heartbeat] started

mean startup is getting partway through, but it’s not reaching a healthy Gateway-ready state. The unresolved Discord channels are probably just deleted/missing channels, not fatal. The Tailscale Serve failure is worth fixing later, but it should not make local ws://127.0.0.1:18789 time out.

The big thing: this matches a known-ish “Discord gateway startup stuck at awaiting gateway readiness” area, and your checkout is 856 commits behind. I’d update/rebuild before chasing logs further:

cd ~/openclaw
#
git status -sb
git fetch origin main
git merge --ff-only origin/main
pnpm install
pnpm build
pnpm openclaw gateway install --force
pnpm openclaw gateway restart
sleep 20
pnpm openclaw gateway status

If git fetch fails, paste the exact error. If fetch fails but origin/main is already ahead locally, this may still work:

cd ~/openclaw
#
git merge --ff-only origin/main

If it still wedges after that, temporarily isolate Tailscale:

openclaw config set gateway.tailscale.mode off
openclaw gateway restart
sleep 20
openclaw gateway status

That’s reversible, and helps tell us whether the tailscale serve failed path is involved.

#

That update failure is separate and likely the root cause of being stuck on the old checkout.

unable to write file ... fatal: Could not reset index file usually means one of:

  1. disk/inodes full
  2. repo files owned by another user, often from a previous sudo git/pnpm
  3. filesystem / repo dir not writable

Please don’t rerun the update yet — check this first:

cd ~/openclaw

df -h .
df -ih .
#
id
git status -sb

ls -ld . .git vendor vendor/a2ui vendor/a2ui/specification vendor/a2ui/specification/0.9 2>/dev/null

find . -maxdepth 4 ! -user "$(id -un)" -printf '%u:%g %p\n' 2>/dev/null | head -50

If df shows 100%, free space first.
If the find output shows lots of root:root or another user inside ~/openclaw, that’s probably it.

Also, if you ran any of the build/update commands with sudo, don’t — the source checkout should be writable by your normal ubuntu user.

#

Disk/inodes look fine, so next suspect is permissions/ownership in ~/openclaw or .git.

Run these:

cd ~/openclaw

id
git status -sb

ls -ld . .git .git/index vendor vendor/a2ui vendor/a2ui/specification vendor/a2ui/specification/0.9 2>/dev/null

find . -maxdepth 4 ! -user "$(id -un)" -printf '%u:%g %p\n' 2>/dev/null | head -80

If you see root:root in that output, fix ownership from the parent dir:

#
sudo chown -R "$(id -un):$(id -gn)" ~/openclaw

Then retry:

cd ~/openclaw
git reset --hard HEAD
git fetch origin main
git merge --ff-only origin/main

If the find output is clean and it still can’t write, paste ls -ld output — we’ll check file modes/immutable flags next.

#

Nice — ownership looks basically OK, and behind 3 means the fetch mostly worked. The earlier behind 856 is gone.

Two possibilities now:

  1. the repo is fine now and only needs the last few commits
  2. the failing files are one level deeper than what we checked (.../0.9/json/*)

Try the simple path first:

cd ~/openclaw
git status --porcelain
git pull --ff-only

If git status --porcelain prints local changes, paste it before pulling.

#

If git pull works:

pnpm install
pnpm build
pnpm openclaw gateway install --force
pnpm openclaw gateway restart
sleep 20
pnpm openclaw gateway status

If it fails with unable to write file again, check the deeper path:

cd ~/openclaw
ls -ld vendor/a2ui/specification/0.9/json
#
ls -l vendor/a2ui/specification/0.9/json/common_types.json vitest.config.ts zizmor.yml 2>/dev/null
lsattr -d vendor/a2ui/specification/0.9/json vitest.config.ts zizmor.yml 2>/dev/null

Paste that if it still errors.

#

Good, repo update is fixed ✅

Now rebuild/reinstall the service so systemd stops launching stale dist:

cd ~/openclaw
pnpm install
pnpm build
pnpm openclaw gateway install --force
pnpm openclaw gateway restart
sleep 20
pnpm openclaw gateway status

If pnpm build fails, paste the first real error. If status still says runtime running but connectivity timeout after this, paste the fresh journalctl tail again.

#

It should work, but in your case openclaw update hit the source-checkout updater’s preflight worktree and failed while writing files.

Since you’re running from:

/home/ubuntu/openclaw/dist/index.js

you’re in the git/source checkout flow, not the npm-global package flow. openclaw update does extra safety steps there: fetch, preflight temp worktree/build, then update/build/restart. Your manual git pull --ff-only just succeeded, so the repo itself is probably OK now.

For this incident, I’d skip openclaw update and finish manually:

cd ~/openclaw
pnpm install
pnpm build
#
pnpm openclaw gateway install --force
pnpm openclaw gateway restart
sleep 20
pnpm openclaw gateway status

After things are healthy, you can test the updater again with:

pnpm openclaw update status
pnpm openclaw update --dry-run

Docs: https://docs.openclaw.ai/cli/update

#

Yes — openclaw update --channel dev is supposed to use the git checkout and update to latest main.

The flow is roughly:

set channel dev → use ~/openclaw git checkout → fetch origin/main → preflight/build-check → rebase/update → install deps → build → doctor/plugin sync → restart gateway

In your case, it started that flow but failed at the preflight worktree step:

Preparing preflight worktree
error: unable to write file ...
fatal: Could not reset index file to revision 'HEAD'

So the intended answer is “yes”, but the updater itself got blocked before it could finish. Since your manual git pull --ff-only worked afterward, finishing with manual pnpm install && pnpm build && pnpm openclaw gateway restart is the fastest recovery path.

#

After the gateway is healthy, pnpm openclaw update --channel dev --dry-run would be the safe way to see whether the updater is still broken.

#

Yeah — at this point doctor won’t fix it, because the service does start; it just never becomes reachable. We need to isolate whether startup is hanging on Discord/Tailscale vs core Gateway.

This will briefly take the bot offline:

cd ~/openclaw
pnpm openclaw gateway stop

Then run the Gateway in the foreground with channels skipped + startup tracing:

cd ~/openclaw
OPENCLAW_SKIP_CHANNELS=1 OPENCLAW_GATEWAY_STARTUP_TRACE=1 pnpm openclaw gateway --tailscale off --port 18789 --verbose

Leave that running for ~30s. In a second terminal, run:

#
cd ~/openclaw
pnpm openclaw gateway status
curl -i --max-time 5 http://127.0.0.1:18789/healthz
curl -i --max-time 5 http://127.0.0.1:18789/readyz

If this works, the core Gateway is fine and the hang is likely channel/Tailscale startup.
If this still fails, paste the foreground startup trace — that should show exactly where it wedges.

#

Good — this is progress. With channels skipped, connectivity probe is OK and /healthz is 200.

So the core Gateway is fine. The normal startup hang is coming from something loaded during full service startup — most likely Discord channel startup or Tailscale Serve, not the Gateway server itself.

Also this line matters:

Service config looks out of date or non-standard.

Let’s clean that up first, then isolate channels.

Run:

cd ~/openclaw
#
pnpm openclaw gateway install --force
systemctl --user daemon-reload
pnpm openclaw gateway restart
sleep 20
pnpm openclaw gateway status

If it wedges again, test with Tailscale disabled persistently:

pnpm openclaw config set gateway.tailscale.mode off
pnpm openclaw gateway restart
sleep 20
pnpm openclaw gateway status
#

If it still wedges after Tailscale off, then we isolate Discord accounts. Since your logs show two Discord accounts (Pikachu and Quiet), temporarily disable one configured Discord account at a time in config and restart. Don’t paste tokens, but you can inspect account ids with:

pnpm openclaw config get channels.discord --json

Given the earlier awaiting gateway readiness lines, I’m leaning Discord startup/lifecycle race rather than core Gateway.

#

Yep — that confirms the foreground “skip channels” test is healthy:

{"ready":true,"failing":[]}

The event_loop_delay is degraded but not fatal there; 1.5s max delay during startup/build-ish conditions can happen. The important part is: Gateway ready works when channels are skipped.

So next isolate in this order:

  1. clean/reinstall service metadata
  2. disable Tailscale
  3. isolate Discord accounts

Commands:

#
cd ~/openclaw
pnpm openclaw gateway install --force
systemctl --user daemon-reload
pnpm openclaw config set gateway.tailscale.mode off
pnpm openclaw gateway restart
sleep 20
pnpm openclaw gateway status

If still unreachable, check the Discord config account ids:

pnpm openclaw config get channels.discord --json

Don’t paste tokens. We’re looking for the account keys/names so we can temporarily disable one bot at a time. Your logs show isolated-bot / Pikachu and main-bot / Quiet; one of those may be hanging startup.

halcyon moon
#

for the first time gatewya status --all and --deep succeeds without asking if the gateway s running... good sign..

shy orioleBOT
#

That’s a very good sign 🎉

Given what changed, the likely culprit was either:

  • stale/non-standard systemd service metadata/PATH, or
  • Tailscale Serve startup getting in the way,
  • less likely Discord now that full status/deep works.

Do one more sanity pass:

pnpm openclaw gateway status
pnpm openclaw status --deep
pnpm openclaw channels status --probe

If those stay green and Discord replies normally, you’re probably fixed.

#

Since we set Tailscale off while isolating, only turn it back on if you actually need remote dashboard access:

pnpm openclaw config set gateway.tailscale.mode serve
pnpm openclaw gateway restart
sleep 20
pnpm openclaw gateway status

If enabling Tailscale makes it wedge again, we’ve found the trigger.

halcyon moon
#

no way, those commands worked for the first time... i dont need tailscale i can live without it tbh. i just ssh into my server via tailscale and not into the gateway

shy orioleBOT
#

Hell yeah — then leave gateway.tailscale.mode off. That’s a totally valid setup.

If you SSH in over Tailscale, you can still reach the dashboard safely with a tunnel when needed:

ssh -N -L 18789:127.0.0.1:18789 ubuntu@<your-tailscale-hostname>

Then open locally:

http://127.0.0.1:18789/

So the stable state is:

#
pnpm openclaw config set gateway.tailscale.mode off
pnpm openclaw gateway restart
pnpm openclaw status --deep

If this fixed it, please right-click the solution message → Apps → ✅ Mark Solution so Answer Overflow indexes it for the next person. 🦐

halcyon moon
#

tbh i dont want to save this solution cause in this chat i gave a lot of personal logs

shy orioleBOT
#

Totally fair — don’t mark it then. Privacy > indexing.

If you want a safe version later, you can make a new sanitized thread with only:

Symptom: Gateway service running, but local probe timed out / status --deep unreachable.
Cause/workaround: Tailscale Serve exposure seemed to wedge startup.
Fix: set gateway.tailscale.mode off, reinstall/restart gateway service.

No hostnames, channel IDs, logs, usernames, or paths needed.

For now: just leave it unmarked. Glad it’s working again 🦐

paper cobaltBOT