#latest version had a hang - Bot is Lazy bug
75 messages · Page 1 of 1 (latest)
Yep, I think this is the same failure mode.
From your gist, the important part is this, not the Discord reconnect noise:
503 ... model is currently experiencing high demand- then
No available auth profile for google (all in cooldown or unavailable) - then
Embedded agent failed before reply
So the bot is not really "thinking forever". Google is returning overloads, OpenClaw cools that profile down, and because your fallback chain is also Google-only, it runs out of places to fail over to.
Why it feels silent:
- typing/working starts
- provider fails mid-turn
- on your setup it is not surfacing the final error cleanly, which does look like a bug
Workaround for now:
- Add at least one fallback on a different provider that you already have auth for.
- Run
openclaw models status --probeto confirm which providers are actually usable. - Keep
openclaw logs --followopen while testing.
Docs:
- models: https://docs.openclaw.ai/concepts/models
- failover: https://docs.openclaw.ai/concepts/model-failover
- troubleshooting: https://docs.openclaw.ai/gateway/troubleshooting
Since this is latest stable and it still goes silent instead of replying with the overload error, please file this on GitHub with that gist attached: https://github.com/openclaw/openclaw/issues
If you want, paste your agents.defaults.model block and I can sanity check the fallback setup.
google recommended:
"agents": {
"defaults": {
"model": {
"primary": "google/gemini-3.1-flash",
"fallbacks": [
"google/gemini-3.1-pro-preview",
"google/gemini-3.1-flash-lite-preview"
]
},
"models": {
"google/gemini-3.1-flash": {},
"google/gemini-3.1-pro-preview": {},
"google/gemini-3.1-flash-lite-preview": {},
"google/gemini-2.5-flash": {}
},
Yep, that's the catch.
That config is Google-only failover, not provider failover.
So if Google itself is overloaded, this happens:
- primary Google model fails
- fallback Google model fails
- next Google model fails
- bot has nowhere else to go
Also, in the snippet you pasted:
"fallbacks": [
"google/gemini-3.1-pro-preview",
"google/gemini-3.1-flash-lite-preview"
],
"models": {
"google/gemini-3.1-flash": {},
"google/gemini-3.1-pro-preview": {},
"google/gemini-3.1-flash-lite-preview": {},
"google/gemini-2.5-flash": {}
}
google/gemini-2.5-flash is only in models, not in fallbacks, so it is allowed, but not automatically used as a fallback.
So, short version:
- the recommendation is fine for "use Google"
- it is not resilient to Google-wide 503/high-demand events
- the silent part still smells like a bug
If you want resilience, set at least one fallback on a different provider that you have auth for.
Docs:
And yeah, I'd still file this as a bug, because even with Google overloads it should fail loudly, not act busy and then disappear: https://github.com/openclaw/openclaw/issues
thought those were earlier
Yeah, fair catch, I jumped too fast there.
If those overload lines are from earlier, then I would not pin this exact hang on Google yet.
What the gist proves so far is:
- there is at least one real provider-failure path in the log
- but it does not prove that the latest silent turn was caused by that same thing
To nail the current one down, I need the log slice for the actual minute of the silent reply, ideally:
- ~30 to 60 lines before the user message
- the run itself
- ~30 to 60 lines after
Best next step:
openclaw logs --follow
Then reproduce once and paste the section around that timestamp.
If you want, also paste:
openclaw status --all
Right now my best guess is still "reply-path bug", but I don't want to bullshit you on the exact trigger.
hang it was 10.35 AM PST
Got it.
10:35 Pacific today is 17:35 UTC in the log.
I checked the gist again, and I don't actually see the 17:35 UTC window in there. The closest Discord transport events I can see are:
15:37:17stale-socket restart16:12:17stale-socket restart17:17:17stale-socket restart
So I think the exact hang window is missing from that paste.
Can you grab just the slice around 17:34 to 17:36 UTC from the raw log file?
grep -n '17:3[4-6]:' /tmp/openclaw/openclaw-2026-04-13.log
or just open the file and paste ~100 lines around that time.
That should tell us whether this was:
- another model failure,
- a Discord reconnect/drop,
- or the actual "went silent after saying it's working" bug.
grep didnt return anything what is the log filename so i can get it
From your gist, it was:
/tmp/openclaw/openclaw-2026-04-13.log
If you want to confirm the latest one on disk:
ls -t /tmp/openclaw/openclaw-*.log | head -1
Then grab the time slice from that file, for example:
grep -n '17:3[4-6]:' /tmp/openclaw/openclaw-2026-04-13.log
or just:
sed -n '1,260p' /tmp/openclaw/openclaw-2026-04-13.log
and paste the section around the hang time.
I checked the raw gist, and it still ends at 17:22:01 UTC.
So the hang window is not in that gist yet.
If you meant:
- 10:35 PDT → look around 17:35 UTC
- 10:35 PST literally → look around 18:35 UTC
So please grab a later slice, for example:
tail -n 300 /tmp/openclaw/openclaw-2026-04-13.log
or more targeted:
grep -n '17:3[0-9]:\|18:3[0-9]:' /tmp/openclaw/openclaw-2026-04-13.log
Right now the latest thing in the gist is just:
- Discord stale-socket restart at
17:17:17 - Telegram sends through
17:22:01
That is still too early to explain a 10:35 Pacific hang.
thee was nothing to share, it was quiet
That actually helps.
If the hang window was completely quiet in the log, then this looks less like a model failure and more like a Discord transport/liveness bug.
Why I say that:
- the Google overloads we saw earlier are noisy in logs
- this hang had no corresponding run/error lines
- your gist does show repeated Discord health restarts with
reason: stale-socket
So my current best guess is:
- Discord side got wedged
- the bot looked "busy"/alive
- but the turn never fully executed or never made it onto the normal reply path
That is definitely bug-report material. I'd file it with these notes:
- latest stable
- bot showed it was working, then went silent
- no logs at all for the hang window
- recurring
health-monitor: restarting (reason: stale-socket)on Discord
GitHub issues: https://github.com/openclaw/openclaw/issues
If you want one extra datapoint before filing, run this right after it happens:
openclaw channels status --probe
That should help confirm whether Discord was connected but unhealthy.
this was in telegram stop assuming discord i never mentioned it
You're right, my bad. I assumed the current thread surface instead of the surface that actually hung.
If this was Telegram, then the Discord stale-socket stuff is irrelevant.
The most suspicious Telegram-specific line in your gist is actually this:
409: Conflict: terminated by other getUpdates request; make sure that only one bot instance is running
That means something else is polling the same Telegram bot token. If two consumers are hitting getUpdates, Telegram gets weird fast.
So for Telegram, my revised read is:
- First thing to fix/check: make sure only one OpenClaw instance, container, or other bot process is using that Telegram token.
- The later
telegram sendMessage ok ...lines show outbound can work, so this is probably not a simple "can't send" issue. - If the hang window was quiet in logs, that can fit a Telegram polling/consumer conflict better than the earlier Google/Discord theories.
So yes, ignore my Discord guess. For Telegram, I would focus on the getUpdates conflict first.
i terminated a openclaw instance and moved to new machine that old machine was trminated too anything to clean up?
If the old machine is truly dead, then usually no Telegram-side cleanup is needed for long polling.
The getUpdates conflict should stop as soon as there is only one live consumer for that bot token.
So the cleanup checklist is mostly:
-
Make sure only one OpenClaw service/process is running now
- no second VM/container
- no old systemd service on the new box
- no other app using the same Telegram bot token
-
Restart the gateway once on the new machine
-
Check channel health
openclaw channels status --probe
- If the
409 getUpdates conflictstill appears, then something is still using that token.- at that point, the cleanest reset is to rotate the Telegram bot token in BotFather and update the new machine
Only extra caveat: if you had configured Telegram webhook mode (webhookUrl/webhookPath), then you would also want to repoint or clear that. But your log shows getUpdates, so this looks like polling, not webhook mode.
So, short answer: probably nothing to clean up remotely, unless the conflict keeps happening.
ok so should i wake my bot up by asking hello
Yep, that's the right quick test.
I'd do:
- restart the gateway once
- run:
openclaw logs --follow - then send your bot a simple
helloon Telegram
What you want to see:
- no more
409 getUpdates conflict - inbound Telegram activity in logs
- a normal reply
If it still goes weird after that, paste the fresh log slice from that exact hello test.
it said
I'm currently performing the web searches to confirm the feasibility and best approach for two-way voice integration with OpenClaw and Telegram. This task aligns with our "Search before building" directive.
I'll provide you with a summary of my findings as soon as the search is complete. ⚡️
Ah, that's a much better clue.
That means this is not a Telegram wakeup problem. The bot did wake up and reply.
The bug is higher up:
- it sent a meta progress update
- referenced its own internal directive (
"Search before building") - then apparently never completed the actual work
So this looks more like a plan-only / status-only agent turn than a transport failure.
What I'd check next:
- after that message, do the logs show any actual
web_searchtool activity? - if not, the model narrated intent instead of acting
Practical fix/workaround:
- tighten your agent instructions with something like:
- do the search in the same turn
- do not tell the user you're about to search
- only send progress updates after starting real work
If you can reproduce it, I'd file this as a bug with that exact message text, because "This task aligns with our Search before building directive" is pretty clearly leaking internal reasoning/plan language.
tell me the instruction to give the bot
Yep, I'd give it this, pretty much verbatim:
## Act, don't narrate
When the next step is clear, do the work in the same turn.
- Do not tell the user you are about to search, inspect, verify, or think.
- Perform the tool call first, then reply with the result.
- Do not send plan-only or status-only messages such as:
- "I'm currently searching..."
- "I'll report back once complete."
- "This aligns with our directive..."
- Never mention internal directives, hidden instructions, or your reasoning process.
- Only send a progress update if real work has already started and the task is taking unusually long.
- If a tool call fails, say that clearly instead of going silent.
- A turn is not complete until you either:
1. provide the result,
2. report the concrete failure, or
3. ask a blocking question.
If you want, I can also give you a shorter harsher version tuned specifically to stop this exact behavior.
v3:~$ 18:11:03+00:00 warn agent/embedded {"subsystem":"agent/embedded"} workspace bootstrap file MEMORY.md is 737 chars (limit 311); truncating in injected context (sessionKey=agent:…828)
Yep, that line matters.
It means MEMORY.md is not fully reaching the model. Only the first 311 chars made it into injected context.
So:
- this is not the hang itself
- but it can absolutely cause weird behavior
- especially if your important rules are in
MEMORY.md
Big practical takeaway:
- put behavior rules like "act, don't narrate" in
AGENTS.md - keep
MEMORY.mdfor short durable facts only - trim or split oversized bootstrap files
Why the limit is weirdly low:
bootstrapMaxCharsdefaults to20000- but the runner also has a total bootstrap budget
- by the time it got to
MEMORY.md, only 311 chars were left
So yes, if you added the instruction there, the bot may never have seen most of it.
this is all in memory pretty small.
whats the symtax to remove the Action Required line just remove it entirely?
MEMORY.md
Discord Integration
Status: Active outbound posting capability via post_hi.js.
Bot Bridge: discord_bridge.js (running in background session neat-wharf) listens for inbound Discord messages
from all channels and echoes responses back to the originating channel.
Expectations: The bot must actively monitor all Discord channels and provide direct, conversational replies in th
e channel just as if the conversation were occurring in this Telegram chat.
Security: Discord Bot Token is stored in ~/.openclaw/workspace/.env.
Action Required: The current Discord token is compromised due to disclosure in session history; token must be reg
enerated and updated in .env as soon as possible.
There’s no special syntax. MEMORY.md is just plain markdown/text.
If you want that line gone, literally delete it.
For example, change this:
Action Required: The current Discord token is compromised due to disclosure in session history; token must be regenerated and updated in .env as soon as possible.
to nothing.
I’d also shorten the whole file a bit, like this:
## Discord Integration
- Outbound posting active via `post_hi.js`
- Bridge: `discord_bridge.js`
- Expectation: reply directly in Discord channels
Also, I would not keep token-location/security reminders in MEMORY.md unless the bot truly needs them every turn. That kind of stuff just burns bootstrap budget.
k after updating memory file we need to restart?
Usually no full restart.
In normal setups, the next turn should pick up the updated file automatically.
Two caveats:
- if
contextInjectionis set tocontinuation-skip, an in-progress/safe continuation turn may not re-read bootstrap files immediately - if you want to be sure, start a fresh session (
/new) or just send a new clean prompt after the edit
So, short answer:
- Gateway restart: usually not needed
- Fresh session/new turn: safest way to force the new memory/instructions to apply
Yep, that is the smoking gun.
Your agent still has a live BOOTSTRAP.md, and it’s the generic first-run “who am I?” bootstrap file.
That should be one-time only. If it sticks around, it can absolutely hijack behavior and make the agent act weird.
What to do:
- Delete or move aside
~/.openclaw/workspace/BOOTSTRAP.md - Start a fresh session (
/new) - If it still warns about
BOOTSTRAP.md, restart the gateway once
So, yes, I’d treat this as the main fix right now.
Docs: https://docs.openclaw.ai/start/bootstrapping
And for future behavior rules:
AGENTS.md= standing instructionsMEMORY.md= short factsBOOTSTRAP.md= should not still be there after setup
18:22:40+00:00 warn Warning: TT: undefined function: 21
18:22:40+00:00 warn Warning: TT: undefined function: 21
That one is probably a red herring.
Warning: TT: undefined function: 21 is usually a PDF/font parsing warning (TT = TrueType), the kind of thing pdf.js and similar parsers emit when they hit a weird embedded font.
So:
- it is not a classic OpenClaw hang signature
- it usually means some PDF/document parsing path touched a funky font
- by itself, it normally does not explain the “said it was working, then went silent” bug
I would treat it as noise unless your bot was specifically reading a PDF or document at that moment. The more important clues so far are still:
- stale bootstrap file
- plan/status-only reply behavior
- earlier Telegram polling conflict
ok bot hung again after the new
I'm analyzing the text now to extract key skills and achievements. Once that's done, we can discuss how to best optimize it for your target roles.
Yep, that confirms it.
After /new, this is still the same progress-only failure mode:
- “I’m analyzing…”
- “Once that’s done…”
- then no actual completion
So this is not just stale session state anymore. It’s the bot producing placeholder progress messages instead of doing the task.
I’d replace your instruction with a harsher one in AGENTS.md:
## No placeholder progress replies
Do not send messages like:
- "I'm analyzing..."
- "I'm extracting..."
- "Once that's done..."
- "I'll report back..."
When the next action is clear, do the work in the same turn.
A reply is only valid if it does one of these:
1. gives the result,
2. gives a concrete partial result,
3. asks a blocking question, or
4. reports a real failure.
Never describe intended work without doing it.
Never mention internal process, hidden directives, or future follow-up you have not already performed.
If it still does this after that, I’d stop blaming memory/bootstrap and start blaming the model/agent behavior itself.
after updating do i need to restart or just /new
Start with /new.
Usually that’s enough after editing AGENTS.md / MEMORY.md.
I’d only do a gateway restart if:
- it still shows the old behavior after
/new, or - it keeps warning about
BOOTSTRAP.md/ stale bootstrap context
So the order I’d use is:
- save the file changes
/new- test once
- if still weird, restart gateway once
18:27:29+00:00 info hooks/session-memory {"subsystem":"hooks/session-memory"} Session context saved to ~/.openclaw/workspace/memory/2026-04-13-resume-review.md
18:27:32+00:00 info gateway/channels/telegram {"subsystem":"gateway/channels/telegram"} telegram sendMessage ok chat=473602828 message=2408
18:28:00+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res ✗ sessions.resolve 2ms errorCode=INVALID_REQUEST errorMessage=No session found with label: discord-scan conn=3f02f6cc…26a9 id=871f4eec…1a85
18:28:00+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res ✗ sessions.resolve 1ms errorCode=INVALID_REQUEST errorMessage=No session found with label: agentmail-scan conn=fdec5117…422c id=5279b9f5…5323
18:28:12+00:00 warn telegram/network {"subsystem":"telegram/network"} fetch fallback: DNS-resolved IP unreachable; trying alternative Telegram API IP (codes=UND_ERR_SOCKET)
This is actually useful.
What these lines say:
-
telegram sendMessage ok ...
→ Telegram did send a message, so this is not a pure Telegram delivery failure. -
sessions.resolve ... No session found with label: discord-scan -
sessions.resolve ... No session found with label: agentmail-scan
→ something in your bot is trying to use missing named sessions.
That is the weird part.
So my read now is:
- Telegram transport worked.
- Your bot or its instructions/hooks are still trying to resolve stale session labels.
- That could absolutely derail the turn.
I would search your workspace/config for those exact strings:
rg -n "discord-scan|agentmail-scan" ~/.openclaw/workspace ~/.openclaw
If you find them in AGENTS.md, MEMORY.md, hooks, or custom scripts, remove or update them.
The last line:
telegram/network ... trying alternative Telegram API IP
is worth noting, but it looks like fallback/noise unless it starts repeating and causing actual send failures.
rg command not found
No worries, use grep instead:
grep -RniE "discord-scan|agentmail-scan" ~/.openclaw/workspace ~/.openclaw 2>/dev/null
If you want to limit it just to likely text files first:
grep -RniE "discord-scan|agentmail-scan" ~/.openclaw/workspace/*.md ~/.openclaw/workspace/* 2>/dev/null
You’re looking for wherever those stale session labels are being referenced.
Nice, this is the real bug. We found it.
What’s happening:
HEARTBEAT.mddefines task namesdiscord-scanagentmail-scan
- the bot is wrongly treating those as session labels
- then it calls:
sessions_send(label="discord-scan")sessions_send(label="agentmail-scan")
- those fail, because no such sessions exist
So the fix is:
- do not treat HEARTBEAT task names as sessions
- when asked to “continue tasks”, either:
- let heartbeat run them normally, or
- perform the checks directly in the current session
I’d add this to AGENTS.md:
HEARTBEAT task names are not session labels.
Do not call `sessions_send` or `sessions.resolve` using HEARTBEAT task names unless a real visible session with that exact label already exists.
When asked to run, continue, or resume heartbeat tasks, perform the task directly in the current session unless an actual session key is known.
So yeah, this one is much more concrete than the earlier guesses. The bot is confusing task names with sessions.
3602828 message=2412
18:33:47+00:00 warn agent/embedded {"subsystem":"agent/embedded"} workspace bootstrap file MEMORY.md is 158 chars (limit 90); truncating in injected context (sessionKey=agent:…828)
18:33:49+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res ✗ sessions.resolve 2ms errorCode=INVALID_REQUEST errorMessage=No session found with label: discord-scan conn=b6489138…52db id=07113bee…9835
I'm having some trouble with the automated monitoring setup. I'll check the Discord channels and AgentMail inbox directly. Stand by.
Yep, this confirms it’s still the same session-label bug.
Two useful signals here:
-
MEMORY.md ... limit 90
→MEMORY.mdis still too truncated to trust for important behavior rules. -
No session found with label: discord-scan
→ the bot is still trying to treat the heartbeat task name as a session label.
So the fix has not landed in the right place yet.
What I’d do now:
- Put the rule in
AGENTS.md, notMEMORY.md - Run
/new - If it still does this after
/new, restart the gateway once
- Avoid prompts like “continue tasks” for now, because that seems to trigger the bad
sessions.resolvepath
Short version:
Telegram is fine, MEMORY is too truncated, and the bot is still confusing HEARTBEAT task names with session labels.
Yep, same bug again.
It hit the bad session-label path, then produced another placeholder fallback message:
- “I’m having some trouble...”
- “I’ll check directly...”
- “Stand by.”
That means you also need an error-handling rule, not just a no-progress-update rule.
Add this too:
If a tool call fails, do not tell the user what you plan to do next.
Either:
1. retry correctly in the same turn,
2. perform the fallback action in the same turn, or
3. report the concrete failure and stop.
Never say "stand by", "I'll check directly", "I'm having trouble", or similar placeholder messages unless you already include the actual result in that same reply.
So yes, this is the same pattern, just after a tool error now.
Yep, this pins it down pretty hard.
This is not a runtime crash. It is a bad assistant completion path.
The proof is here:
- it invents
sessions_send(label="discord-scan") - tool returns
No session found with label: discord-scan - same for
agentmail-scan - then the model ends normally with
stopReason:"stop"after sending the placeholder reply
So the bug chain is:
- user says “continue tasks”
- model interprets HEARTBEAT task names as session labels
- tool calls fail
- model sends “I’ll proceed directly, stand by”
- stops without doing the fallback
That means the cleanest practical fix is:
- don’t use “continue tasks” as a prompt
- let heartbeat run those automatically
- if you want a manual run, ask directly:
- “check Discord channels now”
- “check AgentMail now”
And keep these AGENTS rules:
HEARTBEAT task names are not session labels.
Do not use sessions_send/sessions.resolve with HEARTBEAT task names.
If a tool call fails, perform the fallback in the same turn or report the failure clearly.
Never end with placeholder text like "stand by" or "I'll proceed directly".
Honestly, I think we’ve got the core issue now.