#openclaw keeps crashing when running doctor. is this oom heap issue?

1 messages · Page 1 of 1 (latest)

granite sparrow
#

i previously changed heap from 2 to 4gb but now when updating to 2026.4.26 and using codex harness this is happening

ivory wolfBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

knotty kindleBOT
#

I don’t see the actual crash/log in the thread, so I wouldn’t call it heap yet.

Quick triage:

  • If it says FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory, yes, that’s heap/OOM.
  • If it’s gateway closed / stuck processing / Codex turn never completes / doctor hangs, that may be one of the post-2026.4.26 Codex/ACP cleanup issues rather than pure heap — there are newer fixes on main around Codex harness lanes getting stuck. (raw.githubusercontent.com)

Please paste the exact output from:

openclaw --version
node --version
NODE_OPTIONS=--max-old-space-size=4096 openclaw doctor

And if the gateway is running as a service, also paste recent logs:

#
openclaw logs --follow

One gotcha: if your 4GB heap change was only in your shell, the service won’t necessarily inherit it. OpenClaw’s docs note env precedence starts with the actual process environment, and services don’t inherit your interactive shell env by default. (docs.openclaw.ai)

Redact tokens/API keys before pasting.

granite sparrow
#

what should we do first? clean up codex acp issues?

knotty kindleBOT
#

I’d not start by wiping Codex/ACP state yet. First separate “gateway/doctor crash” from “stale ACP tasks”.

Do this order:

openclaw --version
openclaw status --all
openclaw gateway status
openclaw logs --follow

If logs mention:

device-pair: notify poll failed
Unexpected end of JSON input
#
~/.openclaw/devices/pending.json

then that’s likely corrupted pairing state, not heap/Codex. Fix with:

mv ~/.openclaw/devices/pending.json ~/.openclaw/devices/pending.json.bad
openclaw gateway restart
openclaw doctor

If gateway is stable but Codex/ACP is stuck, then clean the supported way:

openclaw tasks list --runtime acp
#
openclaw tasks audit
openclaw tasks maintenance
openclaw tasks maintenance --apply

Only cancel specific stuck running tasks after inspecting them:

openclaw tasks show <lookup>
openclaw tasks cancel <lookup>

So: logs/status first → corrupted state check → then ACP task maintenance. Don’t manually delete sessions.json unless we’ve confirmed that’s the broken thing.

granite sparrow
#

because my status message shows acp too idk why

🩞 OpenClaw 2026.4.26 (be8c246)
🧠 Model: openai/gpt-5.5 · 🔑 api-key (openai:default)
🧼 Tokens: 983 in / 126 out
đŸ—„ïž Cache: 99% hit · 101k cached, 0 new
📚 Context: 102k/272k (38%) · đŸ§č Compactions: 0
đŸ§” Session: agent:main:telegram:group:-1003705610703:topic:1124 ‱ updated just now
📌 Tasks: recently finished · acp · Context engine turn maintenance · No transcript changes were needed.
⚙ Execution: direct · Runtime: OpenAI Codex · Think: medium · Text: low
đŸ‘„ Activation: always · đŸȘą Queue: collect (depth 0)

anyway, what should i do now?

knotty kindleBOT
#

That 📌 Tasks: recently finished · acp · Context engine turn maintenance · No transcript changes were needed line is not by itself a problem. It just means an ACP-backed maintenance task ran and finished cleanly. I wouldn’t clean ACP state first.

Do this now:

openclaw logs --follow

Leave that running, then in another terminal run:

openclaw doctor

When it crashes, paste the 20–50 log lines around the crash.

If the crash line says heap/OOM, retry once with:

#
NODE_OPTIONS="--max-old-space-size=4096" openclaw doctor

If that works, then your heap setting isn’t being inherited by the process/service.

If it does not say OOM, don’t touch Codex/ACP yet — we should diagnose the actual crash. The ACP line in status looks harmless.

Docs for the flow: https://docs.openclaw.ai/help/faq and https://docs.openclaw.ai/gateway/doctor

granite sparrow
#

hmm ok sure i will try

knotty kindleBOT
#

Sounds good. Tiny shortcut while you test: if logs show JavaScript heap out of memory, it’s heap. If logs show JSON parse / pending.json / device-pair, it’s likely corrupted pairing state. Anything else, paste the crash block and we’ll chase the real cause. 🩐

granite sparrow
#

i just keep getting doctor and these running in a loop

2026-Apr-29 22:35:40
└ Doctor complete.
2026-Apr-29 22:35:40
2026-Apr-29 22:35:41
[entrypoint] setting up nginx basic auth (user: T3v5wQY43KFJqBUH)
2026-Apr-29 22:35:41
2026-04-29T22:35:46.295+00:00 [gateway] loading configuration

2026-Apr-29 22:35:46
2026-04-29T22:35:46.377+00:00 [gateway] resolving authentication

2026-Apr-29 22:35:46
2026-04-29T22:35:46.380+00:00 [gateway] starting...
2026-Apr-29 22:35:47
2026-04-29T22:35:47.764+00:00 [secrets] [SECRETS_RELOADER_DEGRADED] SecretRefResolutionError: Environment variable "OPENAI_API_KEY" is missing or empty.
2026-Apr-29 22:35:47
2026-04-29T22:35:47.771+00:00 [gateway] wrote stability bundle: /data/.openclaw/logs/stability/openclaw-stability-2026-04-29T22-35-47-767Z-1-gateway.startup_failed.json
2026-Apr-29 22:35:47
2026-04-29T22:35:47.772+00:00 Gateway failed to start: Error: Startup failed: required secrets are unavailable. SecretRefResolutionError: Environment variable "OPENAI_API_KEY" is missing or empty.
2026-Apr-29 22:35:48
[entrypoint] state dir: /data/.openclaw
2026-Apr-29 22:35:48
[entrypoint] workspace dir: /data/workspace
2026-Apr-29 22:35:48
[entrypoint] setting up persistent tool storage in /data...
2026-Apr-29 22:35:48
[entrypoint] running configure...
2026-Apr-29 22:35:48
[configure] state dir: /data/.openclaw
2026-Apr-29 22:35:48
[configure] workspace dir: /data/workspace
2026-Apr-29 22:35:48
[configure] config file: /data/.openclaw/openclaw.json
2026-Apr-29 22:35:48
[configure] merged persisted config from /data/.openclaw/openclaw.json
2026-Apr-29 22:35:48
[configure] ZAI provider enabled (ZAI_API_KEY set)
2026-Apr-29 22:35:48
[configure] primary model (override): openai/gpt-5.5
aw doctor --fix...
2026-Apr-29 22:35
🩞 OPENCLAW 🩞
2026-Apr-29 22:35:50

2026-Apr-29 22:35:50
┌ OpenClaw doctor

knotty kindleBOT
#

Yep — this is not heap and not ACP.

The crash reason is right here:

Startup failed: required secrets are unavailable.
Environment variable "OPENAI_API_KEY" is missing or empty.

Your container is restarting, entrypoint runs configure + doctor --fix, then Gateway starts, fails because the configured primary model is:

primary model (override): openai/gpt-5.5


but the container does not have OPENAI_API_KEY.

#

Do one of these:

Option A — keep OpenAI/Codex
Add OPENAI_API_KEY to the container env / docker compose / hosting panel, then recreate/restart the container.

Example compose shape:

environment:
  OPENAI_API_KEY: ${OPENAI_API_KEY}

Then:

docker compose up -d --force-recreate
granite sparrow
#

but i just did subscription auth and switched to codex back end why need api key?

knotty kindleBOT
#

Because “Codex backend” and “OpenAI provider auth” are different knobs.

Your log says the configured model is:

primary model (override): openai/gpt-5.5

openai/gpt-5.5 is being treated as the direct OpenAI provider route, so OpenClaw tries to resolve OPENAI_API_KEY. ChatGPT/Codex subscription auth is separate from normal API key billing/auth; OpenAI documents ChatGPT/API billing separately, while Codex plan access is via “Sign in with ChatGPT.” (help.openai.com)

If you want subscription OAuth, use:

openclaw config get agents.defaults.model.primary
openclaw config get agents.defaults.agentRuntime
openclaw config get models.providers.openai.apiKey
#
openclaw models status

Then likely fix is either:

openclaw models set openai-codex/gpt-5.5

or, if you intentionally want native Codex harness, keep openai/gpt-5.5 + agentRuntime.id: "codex" but remove the stale direct OpenAI API-key SecretRef if you don’t use it:

openclaw config unset models.providers.openai.apiKey

Also check your Docker/hosting env for a model override forcing openai/gpt-5.5 every boot. The “primary model (override)” line makes me suspect an env/compose setting is reapplying it.

granite sparrow
#

oh so in addition to auth i have to remove api key?

knotty kindleBOT
#

Yes — if it’s a stale/missing API-key reference.

You don’t have to remove a working API key. But your config currently points at:

models.providers.openai.apiKey -> env OPENAI_API_KEY


and the container does not have OPENAI_API_KEY, so startup fails before Codex auth can help.

If you’re subscription/Codex-only, do:

openclaw config unset models.providers.openai.apiKey
openclaw models set openai-codex/gpt-5.5
openclaw gateway restart
#

If you want native Codex harness, then model can be openai/gpt-5.5, but you still need to remove the missing API-key ref unless you actually provide OPENAI_API_KEY.

Also check Docker env/compose for anything forcing the model back to openai/gpt-5.5 on boot.

granite sparrow
#

whats wrong? i thought openai/gpt-5.5 works for codex also?

knotty kindleBOT
#

You’re not crazy — openai/gpt-5.5 can be right for native Codex harness.

The problem is: your config also has a direct OpenAI API-key secret ref somewhere, and Gateway fails startup because that env var is missing:

models.providers.openai.apiKey -> OPENAI_API_KEY

Codex subscription sign-in doesn’t satisfy a broken OPENAI_API_KEY SecretRef. OpenAI’s Codex subscription flow is “sign in with ChatGPT”; API availability/billing is separate. (help.openai.com)

So choose one:

Native Codex harness:

openclaw config unset models.providers.openai.apiKey
#
openclaw config set agents.defaults.model.primary openai/gpt-5.5
openclaw config set agents.defaults.agentRuntime.id codex
openclaw gateway restart

Codex OAuth through normal OpenClaw/PI runner:

openclaw config unset models.providers.openai.apiKey
openclaw models set openai-codex/gpt-5.5
openclaw gateway restart

Your immediate blocker is not the model name — it’s the stale missing OPENAI_API_KEY reference killing startup.

Docs: https://docs.openclaw.ai/providers/openai and https://docs.openclaw.ai/plugins/codex-harness

granite sparrow
knotty kindleBOT
#

Yep — exactly. That doc line is right.

For native Codex app-server harness, this is valid:

model: openai/gpt-5.5
agentRuntime.id: codex

So the model name is not what’s wrong.

What’s wrong is this separate stale config:

models.providers.openai.apiKey -> OPENAI_API_KEY
#

That makes Gateway try to resolve OPENAI_API_KEY during startup, and your container doesn’t have it, so it exits before the Codex harness can run.

Your target state should be:

openclaw config unset models.providers.openai.apiKey
openclaw config set agents.defaults.model.primary openai/gpt-5.5
openclaw config set agents.defaults.agentRuntime.id codex
openclaw gateway restart

So: keep openai/gpt-5.5, remove the broken API-key SecretRef. That’s the key bit.

granite sparrow
#

how to do?

knotty kindleBOT
#

Do it from the folder with your docker-compose.yml.

Since the container is restart-looping, stop it first, then run config edits through the gateway image:

docker compose stop openclaw-gateway

Then run these:

docker compose run --rm --no-deps --entrypoint node openclaw-gateway \
  dist/index.js config unset models.providers.openai.apiKey

docker compose run --rm --no-deps --entrypoint node openclaw-gateway \
  dist/index.js config set agents.defaults.model.primary openai/gpt-5.5
#

docker compose run --rm --no-deps --entrypoint node openclaw-gateway \
  dist/index.js config set agents.defaults.agentRuntime.id codex

docker compose run --rm --no-deps --entrypoint node openclaw-gateway \
  dist/index.js config validate

Then restart:

docker compose up -d --force-recreate openclaw-gateway
docker compose logs -f openclaw-gateway

Also check your .env / hosting panel: if you have a blank line like this, remove it unless you’re actually using an API key:

granite sparrow
#

setting a random str to the env var did the trick

knotty kindleBOT
#

Nice, that confirms it was missing SecretRef, not heap/OOM.

I’d still replace the random string with the clean fix when you can:

openclaw config unset models.providers.openai.apiKey
openclaw gateway restart

The random env var works because startup only needed OPENAI_API_KEY to be non-empty. But later, anything that actually touches direct OpenAI API-key routes — embeddings, TTS, image gen, accidental openai/* PI route — may fail with auth/401 using that fake key.

After restart, sanity check:

openclaw models status
openclaw doctor
#

If this fixed it, please right-click the solution message → Apps → ✅ Mark Solution so Answer Overflow indexes it for the next person. 🩐

granite sparrow
#

u sure? can you check github to see if others have had similar problems?