#Gateway restart

1 messages · Page 1 of 1 (latest)

amber pagodaBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

wise boltBOT
#

A Gateway restart will kill the in-flight run. The session/history is still on disk, but the model turn that was mid-execution gets cut off — so you see the “gateway restarting” sentinel message, and then nothing else happens until a new inbound message arrives to “wake” that session again.

There are two practical fixes:

1) Use config.patch / config.apply with sessionKey (auto wake-up after restart)

If your infra agent is doing something like openclaw config set ... and then openclaw gateway restart, switch to the Config RPC methods and pass the current sessionKey.

Docs (this is the key line): sessionKey is “for the post-restart wake-up ping”:
https://docs.openclaw.ai/gateway/configuration#config-rpc-programmatic-updates

So your agent should do config.patch/config.apply and include:

  • sessionKey: the chat/session to ping after restart
  • optionally note: what to do next after restart
  • optionally restartDelayMs: give it a few seconds to send a final “I’m restarting now…” message

That wake-up ping after boot is what makes it “continue without you manually prompting”.

#

How to find the sessionKey (CLI):

2) Avoid restarting unless you actually need to

Most config changes hot-apply; only certain categories require restart (notably gateway.*, plugins, some infra). This table is in the config docs:
https://docs.openclaw.ai/gateway/configuration#what-hot-applies-vs-what-needs-a-restart

So if the agent is “restarting out of habit”, you can often just stop doing that and the run won’t get interrupted.


If you paste how your infra agent currently applies config (CLI config set? editing the json file? config.apply already?) and your openclaw --version, I can tell you the smallest change to get the auto-resume behavior.

#

Copy/paste this as a single message to your infra agent (tweak the placeholders if you want):


From now on, when you change OpenClaw config and a restart is required, you must auto-resume after the restart.

Do not do: openclaw gateway restart and then stop.

Instead, always apply config via config.patch/config.apply and include the current sessionKey so the Gateway sends a post-restart wake-up ping back into this chat, and you continue automatically when it arrives.

What to do (every time)

  1. Determine the sessionKey for this chat (the one you’re currently replying in):
  • Run: openclaw gateway call sessions.list --params '{}'
  • Find the session entry whose origin matches this Discord thread/channel (label/topic), and copy its key.
  • If you can’t confidently identify it, ask me to run openclaw sessions --active 120 --json and paste the relevant entry.
  1. Get the current config hash:
#
  • openclaw gateway call config.get --params '{}'
  • Save hash as baseHash.
  1. Apply the change using config.patch (preferred) and include:
  • sessionKey: (the key from step 1)
  • note: a short “resume plan”
  • restartDelayMs: e.g. 8000 (so you can finish your status message before restart)

Example (fill in PATCH + HASH + SESSIONKEY):

openclaw gateway call config.patch --params '{
  "raw": "{ /* PATCH_JSON5_HERE */ }",
  "baseHash": "<HASH_FROM_config.get>",
  "sessionKey": "<SESSION_KEY_FOR_THIS_CHAT>",
  "note": "After restart: continue from where you left off (verify status/health, then report result).",
  "restartDelayMs": 8000
#
}'
  1. After the restart, wait for the wake-up ping that comes back into this chat, then continue automatically (do not wait for me to say “continue”).

Reference: https://docs.openclaw.ai/gateway/configuration#config-rpc-programmatic-updates (sessionKey = post-restart wake-up ping)


If you tell me whether you’re doing this in a Discord thread vs a plain channel/DM, I can also help you pick the exact sessionKey pattern (Discord keys are usually agent:<agentId>:discord:channel:<id>).

wise boltBOT
#

Yeah — that analysis is very plausible, and it lines up with what you’re seeing.

If the restart-sentinel “wake-up ping” is implemented as a best-effort outbound send very shortly after boot (750ms), it’s totally believable that:

  • Gateway boots
  • sentinel is consumed
  • wake-up fires immediately
  • Discord adapter isn’t “ready” yet (gateway hasn’t reconnected / resumed the WS)
  • send fails → swallowed (bestEffort: true)
  • no second attempt ⇒ your session never gets nudged, so you have to manually “continue”

What this means

Option (1) (“pass sessionKey and rely on the post-restart wake”) is currently racy on Discord in 2026.3.2 if that 750ms timer is real. That’s a bug/footgun: Discord reconnect often takes seconds, not sub-second.

What I’d do next (actionable)

A) File a GitHub issue (this needs maintainer attention)

#

Include:

  • OpenClaw version (openclaw --version) — you say 2026.3.2
  • channel: Discord
  • proof that sentinel is written + consumed
  • the code snippet / location (or at least the setTimeout(..., 750) line)
  • a log line showing Discord client “ready” occurs after the wake

Issue tracker: https://github.com/openclaw/openclaw/issues

B) Workarounds you can use today

Pick one:

  1. Avoid restarts unless required
    Most config is hot-applied; only certain categories need restart (notably gateway.*, plugins/infra). Docs:
    https://docs.openclaw.ai/gateway/configuration#what-hot-applies-vs-what-needs-a-restart

  2. Heartbeat-based “resume” (doesn’t depend on Discord reconnect timing)

#

Have your infra agent write a small “resume marker” file before it triggers a restart (e.g. RESUME.md with “continue from step X”), and configure that agent with a fast heartbeat (e.g. every 1m). On the next heartbeat after boot, it sees the marker and continues automatically, then deletes/clears it.

(Heartbeat runs inside the gateway scheduler; it doesn’t require an inbound Discord message to kick it.)

Docs: https://docs.openclaw.ai/gateway/heartbeat

  1. If you’re running from a git checkout: bump that 750ms delay or add retry
    This is the “real fix” locally: make the sentinel wake delay something like 5–10s, or retry until the Discord transport is connected. (But upstream should still fix it.)

If you want, paste the exact sessionKey you were using and the snippet around scheduleRestartSentinelWake(...) (just that function), and I’ll help you write a crisp GitHub issue + suggest the smallest safe retry logic.

#

Right — even if the wake-up ping lands, a Gateway restart will interrupt the run mid-flight, so the model often never gets to write down “I’m on step 4/9, next I’ll do X”. After reboot it only has whatever was already in the transcript, which may not include the latest “current stage”.

So you need checkpointing, not just wake-up.

The fix: make the infra agent persist a run state before restarting

Have the agent treat “restart-required config changes” like a little transaction:

  1. Before triggering any restart, it must write a durable checkpoint file in its workspace, e.g.
  • infra/runstate.json or RESUME.md

Include at minimum:

  • operation id (timestamp)
  • goal (“apply config patch X”)
  • numbered plan
  • currentStep / nextStep
  • “post-restart checks” it must run (status/health/log tail)
  • any command outputs it already collected that it shouldn’t recompute
#
  1. Trigger restart (via config.patch/apply).

  2. After restart, on the first opportunity (wake-up ping or heartbeat), the agent:

  • reads infra/runstate.json
  • continues from nextStep
  • deletes/marks the runstate as complete when done

This solves the “forgets what stage it was at” problem because the stage is no longer implicit in the model’s short-term chain-of-thought—it’s explicit on disk.

Minimal runstate shape (example)

{
  "opId": "2026-03-07T16:53:00Z-config-patch",
  "goal": "Patch openclaw.json safely and verify gateway health",
  "nextStep": 4,
  "plan": [
#
    "1) Capture baseHash",
    "2) Apply config.patch with sessionKey + note",
    "3) Restart happens",
    "4) After restart: openclaw status + health + logs tail",
    "5) Confirm desired behavior, then mark complete"
  ],
  "pending": true
}

How to ensure it runs after restart even if Discord wake-up is flaky

Use a fast heartbeat for that infra agent (e.g. every 1m) and put instructions in its HEARTBEAT.md like:

“If infra/runstate.json exists and pending=true, resume the operation from nextStep and report progress; otherwise reply HEARTBEAT_OK.”

Heartbeat doesn’t depend on a human message arriving, so it’s a solid “resume engine” after restarts.

#

Got it — separate agent actually makes this easier, because you can give that agent its own resume state + heartbeat runner without messing with your main agent.

Here’s the clean pattern for a separate infra agent:

1) Give the infra agent a persistent checkpoint file in its own workspace

Pick a fixed path inside the infra workspace, e.g.:

  • infra/runstate.json (recommended)

Your infra agent must update this file before any restart with:

  • opId
  • goal
  • nextStep
  • plan[]
  • pending: true

…and set pending:false (or delete it) when done.

#

2) Use a per-agent heartbeat to auto-resume after restarts (no Discord wake-up needed)

Heartbeat runs even if nobody messages, so it’s a reliable “resume engine”.

In config, add a heartbeat block on the infra agent. Key points:

  • set a short cadence (e.g. 1m)
  • set a custom prompt that only resumes if infra/runstate.json is pending
  • set delivery to your infra thread/channel so you actually see progress
  • (optional) set session to the specific Discord sessionKey so it runs “in the right place”

Docs: https://docs.openclaw.ai/gateway/heartbeat#per-agent-heartbeats

Example config (fill placeholders)

{
  agents: {
    list: [
#
      {
        id: "infra",
        workspace: "~/.openclaw/workspace-infra",

        heartbeat: {
          every: "1m",

          // Optional but recommended: run the heartbeat *in the same session*
          // as your infra chat, not in agent:infra:main.
          session: "<SESSION_KEY_FOR_YOUR_INFRA_DISCORD_THREAD>",

          // Deliver only when there's non-OK content (OKs are suppressed by default)
          target: "discord",
          to: "channel:<DISCORD_THREAD_OR_CHANNEL_ID>",
#
          lightContext: true,

          prompt:
            "Read infra/runstate.json. If pending=true, resume exactly from nextStep and keep going until the operation is complete (updating infra/runstate.json as you advance). If pending=false or file missing, reply HEARTBEAT_OK."
        }
      }
    ]
  }
}

How to find the infra agent’s sessionKey

Run (on the gateway host):

openclaw sessions --agent infra --active 240 --json
#

and copy the key for the Discord thread/channel you care about. (Session key formats are documented here: https://docs.openclaw.ai/concepts/session#mapping-transports--session-keys)

3) Add a tiny HEARTBEAT.md in the infra workspace (optional)

Not strictly required since we set prompt, but it helps if you want a checklist.


If you tell me the infra agent’s agentId (is it literally infra?) and where you chat with it (Discord channel vs thread, and the ID), I’ll paste back a fully-filled config + a single “instruction message” you can send to the infra agent to implement infra/runstate.json checkpointing correctly.

#

A 1‑minute heartbeat will burn tokens if you leave it on 24/7 — so you’re right to be wary.

“Native” solution (intended)

The native mechanism is exactly what you tried: config.patch/apply with sessionKey → post‑restart wake-up ping. That’s the documented path.
But if it’s firing ~750ms after boot, it’s currently unreliable on Discord (reconnect usually takes seconds). That part looks like a bug/oversight in 2026.3.2.

Token-light options that are still “native OpenClaw”

Option A) Heartbeat only when needed (don’t run it constantly)

Keep heartbeat disabled (or at a sane interval like 30m), and have the infra agent temporarily turn it on only during an operation that requires a restart:

  • Before restart: set infra heartbeat to every: "1m" and a resume-only prompt
  • After operation completes: set it back to every: "0m" (disable)

Also set lightContext: true so heartbeat turns are tiny (only HEARTBEAT.md).

Heartbeat docs show all these knobs (every, lightContext, per-agent heartbeat):

#

https://docs.openclaw.ai/gateway/heartbeat

Option B) One-shot “resume poke” via Session Tools (cheaper than a heartbeat loop)

OpenClaw has Session Tools (sessions_send) that can inject a message into a session key internally (doesn’t depend on Discord being connected at that exact millisecond):
https://docs.openclaw.ai/concepts/session-tool#sessions_send

So the pattern becomes:

  1. Before restart: write your infra/runstate.json checkpoint (stage/nextStep).
  2. After restart: trigger exactly one sessions_send into the infra session key with a message like “Resume opId XYZ from runstate”.

That costs you one extra agent turn per restart (not continuous heartbeats).

The missing piece is “what runs sessions_send after restart?” Today you can do that with a one-shot cron scheduled a few seconds in the future (cron survives restarts), whose only job is to call sessions_send once. That’s still all OpenClaw-native, and typically far cheaper than a 1m heartbeat.

Cron overview: https://docs.openclaw.ai/automation/cron-jobs

#

If you tell me:

  • your infra agentId
  • the Discord thread/channel id you want it to resume in
  • whether your infra agent is allowed to use session tools

…I’ll write you the exact “recipe” (checkpoint file + one-shot cron that calls sessions_send to resume). This ends up being the lowest-token reliable workaround until the Discord wake-up timing bug is fixed.

#

Yep — cron is a good fit here because it’s one-shot (tokens only when you need it) and it survives restarts.

Pattern: “checkpoint → schedule one-shot resume cron → restart”

  1. Before triggering the restart, your infra agent writes a checkpoint file in its workspace, e.g. infra/runstate.json (pending: true, nextStep, etc.).
  2. Infra agent creates a one-shot cron job for now + 15s (enough time for Discord to reconnect after the restart).
  3. Infra agent applies the config change / restarts.
  4. After boot, cron fires and tells the infra agent to read runstate + continue.

What to run (CLI form)

You (or the infra agent via system.run) can do something like:

# Pick a time ~15s in the future (example using node; any way to compute ISO is fine)
AT="$(node -e 'console.log(new Date(Date.now()+15000).toISOString())')"
#

openclaw cron add \
  --name "infra-resume" \
  --at "$AT" \
  --session isolated \
  --agent infra \
  --message "Resume the interrupted infra operation: read infra/runstate.json; if pending=true continue from nextStep, update the file as you progress, and set pending=false when finished. Post a short completion summary to the original Discord thread/channel." \
  --light-context \
  --no-deliver

Notes:

  • --at jobs auto-delete after success by default (so no cleanup needed). (Docs: https://docs.openclaw.ai/cli/cron)
  • --agent infra pins it to your infra agent so it has the right workspace + files (runstate lives there).
  • --light-context keeps token use down.
  • --no-deliver keeps the cron run itself quiet; your infra agent should post the summary itself once it’s done.
#

If you’d rather have cron deliver the summary directly (instead of relying on the agent to message), swap --no-deliver for something like:

--announce --channel discord --to "channel:<THREAD_OR_CHANNEL_ID>"

(Discord targets use channel:<id>.)


One thing you still must do: write the checkpoint

Cron solves the “wake me after restart” part. The “agent forgets what step it was on” is solved by the infra agent always writing/updating infra/runstate.json right before it restarts.


If you tell me your actual infra agentId (is it literally infra?) and where you want the summary to land (this thread id 1479860126461464736 or a different channel), I’ll paste the exact final command with the right --agent/--to filled in.

#

Paste this as a single message to your infra agent (edit only the 2 placeholders at the top if needed):

You are my **Infra** agent. You are allowed to update OpenClaw config, but you must do it in a way that survives Gateway restarts and never “forgets” what step you were on.

## Constants (edit if needed)
- Your agentId: `infra`  (if your agentId is different, replace it everywhere below)
- Discord resume target: `channel:1479860126461464736`  (this thread/channel id)
- Run-state checkpoint file (in your workspace): `infra/runstate.json`
- Resume delay after restart: 20 seconds

## Core rule
Any time you are about to trigger a Gateway restart (directly or indirectly via config changes), you MUST:
1) checkpoint state to `infra/runstate.json`
2) schedule a one-shot cron “resume” job for +20s
3) only then apply the config change / restart
#
4) after restart, the cron job completes the remaining steps using the checkpoint file and posts a summary to Discord.

This avoids relying on the fragile “wake-up ping” timing on Discord and prevents losing the current stage.

---

# A) Run-state format (MUST follow)
Before restart, write/update `infra/runstate.json` with at least:

```json
{
  "opId": "ISO_TIMESTAMP + short label",
  "pending": true,
  "goal": "What we are trying to achieve",
  "plan": ["1) ...", "2) ...", "3) ..."],
  "nextStep": 1,
#

"context": {
"targetDiscord": "channel:1479860126461464736",
"notes": "Any critical details / commands / hashes / baseHash / expected outcomes"
},
"progressLog": ["optional append-only notes"]
}


Rules:
- Increment `nextStep` as you complete steps.
- Add any important outputs (hashes, diffs, observed errors) into `context.notes` or `progressLog`.
- When finished successfully, set `"pending": false` and (optionally) keep the file for audit.

---

# B) How to schedule the one-shot resume cron job (MUST do before restarting)
#
Right before the restart, schedule a cron job for +20s, pinned to agent `infra`, isolated, lightweight, and silent delivery:

1) Compute the timestamp:
- `AT = now + 20 seconds (ISO)`

2) Add the cron job:
- Use OpenClaw CLI: `openclaw cron add ...`

Template command (you can run this via system.run):
```bash
AT="$(node -e 'console.log(new Date(Date.now()+20000).toISOString())')"

openclaw cron add \
  --name "infra-resume-$(date -u +%Y%m%dT%H%M%SZ)" \
  --at "$AT" \
  --session isolated \
#

--agent infra
--light-context
--no-deliver
--message "RESUME JOB: Read infra/runstate.json. If pending!=true, exit with HEARTBEAT_OK-style silence. If pending=true: continue the operation deterministically from nextStep until completion, updating infra/runstate.json as you go. When done, set pending=false and send a short summary to Discord target channel:1479860126461464736. If you hit an error, keep pending=true, append the error to progressLog, and send an error summary to Discord."


Notes:
- One-shot `--at` jobs auto-delete after success (default), so no cleanup required.
- `--no-deliver` ensures the cron run itself doesn’t spam; YOU must send the final summary to Discord.

---

# C) Safe config-apply procedure (when restart is required)
When changing config:
1) Prefer `config.patch` over `config.apply` (patch is safer; apply replaces everything).
2) Always capture baseHash first (`config.get`) and store it into runstate `context.notes`.
#
3) BEFORE applying the patch, do:
   - update `infra/runstate.json` with the plan and set nextStep to the step AFTER the restart
   - schedule the resume cron job (section B)
4) Apply the patch that triggers the restart.
5) Stop: you WILL be interrupted. The cron will pick up after reboot.

---

# D) Resume behavior (what you do when you’re “back”)
Whether you’re running as:
- the cron resume job, OR
- you receive any message after a restart

…you must:
1) open `infra/runstate.json`