latest version had a hang - Bot is Lazy bug | Friends of the Crustacean 🦞🤝 | Page 1

slender token Apr 13, 2026, 5:50 PM

#

same bug - bot says i am doing something and goes silent
https://gist.github.com/viklall/a810ab2a8aec68189484b7ef5a6afc35

old socketBOT Apr 13, 2026, 5:50 PM

#

sage steepleBOT Apr 13, 2026, 5:50 PM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

north auroraBOT Apr 13, 2026, 5:53 PM

#

Yep, I think this is the same failure mode.

From your gist, the important part is this, not the Discord reconnect noise:

503 ... model is currently experiencing high demand
then No available auth profile for google (all in cooldown or unavailable)
then Embedded agent failed before reply

So the bot is not really "thinking forever". Google is returning overloads, OpenClaw cools that profile down, and because your fallback chain is also Google-only, it runs out of places to fail over to.

Why it feels silent:

typing/working starts
provider fails mid-turn
on your setup it is not surfacing the final error cleanly, which does look like a bug

Workaround for now:

Add at least one fallback on a different provider that you already have auth for.

#

Run openclaw models status --probe to confirm which providers are actually usable.
Keep openclaw logs --follow open while testing.

Docs:

models: https://docs.openclaw.ai/concepts/models
failover: https://docs.openclaw.ai/concepts/model-failover
troubleshooting: https://docs.openclaw.ai/gateway/troubleshooting

Since this is latest stable and it still goes silent instead of replying with the overload error, please file this on GitHub with that gist attached: https://github.com/openclaw/openclaw/issues

If you want, paste your agents.defaults.model block and I can sanity check the fallback setup.

slender token Apr 13, 2026, 5:54 PM

#

google recommended:

"agents": {
"defaults": {
"model": {
"primary": "google/gemini-3.1-flash",
"fallbacks": [
"google/gemini-3.1-pro-preview",
"google/gemini-3.1-flash-lite-preview"
]
},
"models": {
"google/gemini-3.1-flash": {},
"google/gemini-3.1-pro-preview": {},
"google/gemini-3.1-flash-lite-preview": {},
"google/gemini-2.5-flash": {}
},

north auroraBOT Apr 13, 2026, 5:54 PM

#

Yep, that's the catch.

That config is Google-only failover, not provider failover.

So if Google itself is overloaded, this happens:

primary Google model fails
fallback Google model fails
next Google model fails
bot has nowhere else to go

Also, in the snippet you pasted:

"fallbacks": [
  "google/gemini-3.1-pro-preview",

#

  "google/gemini-3.1-flash-lite-preview"
],
"models": {
  "google/gemini-3.1-flash": {},
  "google/gemini-3.1-pro-preview": {},
  "google/gemini-3.1-flash-lite-preview": {},
  "google/gemini-2.5-flash": {}
}

google/gemini-2.5-flash is only in models, not in fallbacks, so it is allowed, but not automatically used as a fallback.

So, short version:

the recommendation is fine for "use Google"
it is not resilient to Google-wide 503/high-demand events
the silent part still smells like a bug

#

If you want resilience, set at least one fallback on a different provider that you have auth for.

Docs:

And yeah, I'd still file this as a bug, because even with Google overloads it should fail loudly, not act busy and then disappear: https://github.com/openclaw/openclaw/issues

slender token Apr 13, 2026, 5:56 PM

#

thought those were earlier

north auroraBOT Apr 13, 2026, 5:57 PM

#

Yeah, fair catch, I jumped too fast there.

If those overload lines are from earlier, then I would not pin this exact hang on Google yet.

What the gist proves so far is:

there is at least one real provider-failure path in the log
but it does not prove that the latest silent turn was caused by that same thing

To nail the current one down, I need the log slice for the actual minute of the silent reply, ideally:

~30 to 60 lines before the user message
the run itself
~30 to 60 lines after

Best next step:

#

openclaw logs --follow

Then reproduce once and paste the section around that timestamp.

If you want, also paste:

openclaw status --all

Right now my best guess is still "reply-path bug", but I don't want to bullshit you on the exact trigger.

slender token Apr 13, 2026, 5:59 PM

#

hang it was 10.35 AM PST

north auroraBOT Apr 13, 2026, 6:00 PM

#

Got it.

10:35 Pacific today is 17:35 UTC in the log.

I checked the gist again, and I don't actually see the 17:35 UTC window in there. The closest Discord transport events I can see are:

15:37:17 stale-socket restart
16:12:17 stale-socket restart
17:17:17 stale-socket restart

So I think the exact hang window is missing from that paste.

Can you grab just the slice around 17:34 to 17:36 UTC from the raw log file?

grep -n '17:3[4-6]:' /tmp/openclaw/openclaw-2026-04-13.log

#

or just open the file and paste ~100 lines around that time.

That should tell us whether this was:

another model failure,
a Discord reconnect/drop,
or the actual "went silent after saying it's working" bug.

slender token Apr 13, 2026, 6:01 PM

#

grep didnt return anything what is the log filename so i can get it

north auroraBOT Apr 13, 2026, 6:01 PM

#

From your gist, it was:

/tmp/openclaw/openclaw-2026-04-13.log

If you want to confirm the latest one on disk:

ls -t /tmp/openclaw/openclaw-*.log | head -1

Then grab the time slice from that file, for example:

grep -n '17:3[4-6]:' /tmp/openclaw/openclaw-2026-04-13.log

#

or just:

sed -n '1,260p' /tmp/openclaw/openclaw-2026-04-13.log

and paste the section around the hang time.

slender token Apr 13, 2026, 6:02 PM

#

https://gist.github.com/viklall/a810ab2a8aec68189484b7ef5a6afc35

north auroraBOT Apr 13, 2026, 6:03 PM

#

I checked the raw gist, and it still ends at 17:22:01 UTC.

So the hang window is not in that gist yet.

If you meant:

10:35 PDT → look around 17:35 UTC
10:35 PST literally → look around 18:35 UTC

So please grab a later slice, for example:

tail -n 300 /tmp/openclaw/openclaw-2026-04-13.log

or more targeted:

#

grep -n '17:3[0-9]:\|18:3[0-9]:' /tmp/openclaw/openclaw-2026-04-13.log

Right now the latest thing in the gist is just:

Discord stale-socket restart at 17:17:17
Telegram sends through 17:22:01

That is still too early to explain a 10:35 Pacific hang.

slender token Apr 13, 2026, 6:03 PM

#

thee was nothing to share, it was quiet

north auroraBOT Apr 13, 2026, 6:03 PM

#

That actually helps.

If the hang window was completely quiet in the log, then this looks less like a model failure and more like a Discord transport/liveness bug.

Why I say that:

the Google overloads we saw earlier are noisy in logs
this hang had no corresponding run/error lines
your gist does show repeated Discord health restarts with reason: stale-socket

So my current best guess is:

Discord side got wedged
the bot looked "busy"/alive
but the turn never fully executed or never made it onto the normal reply path

That is definitely bug-report material. I'd file it with these notes:

#

latest stable
bot showed it was working, then went silent
no logs at all for the hang window
recurring health-monitor: restarting (reason: stale-socket) on Discord

GitHub issues: https://github.com/openclaw/openclaw/issues

If you want one extra datapoint before filing, run this right after it happens:

openclaw channels status --probe

That should help confirm whether Discord was connected but unhealthy.

slender token Apr 13, 2026, 6:04 PM

#

this was in telegram stop assuming discord i never mentioned it

north auroraBOT Apr 13, 2026, 6:04 PM

#

You're right, my bad. I assumed the current thread surface instead of the surface that actually hung.

If this was Telegram, then the Discord stale-socket stuff is irrelevant.

The most suspicious Telegram-specific line in your gist is actually this:

409: Conflict: terminated by other getUpdates request; make sure that only one bot instance is running

That means something else is polling the same Telegram bot token. If two consumers are hitting getUpdates, Telegram gets weird fast.

So for Telegram, my revised read is:

First thing to fix/check: make sure only one OpenClaw instance, container, or other bot process is using that Telegram token.
The later telegram sendMessage ok ... lines show outbound can work, so this is probably not a simple "can't send" issue.
If the hang window was quiet in logs, that can fit a Telegram polling/consumer conflict better than the earlier Google/Discord theories.

So yes, ignore my Discord guess. For Telegram, I would focus on the getUpdates conflict first.

slender token Apr 13, 2026, 6:06 PM

#

i terminated a openclaw instance and moved to new machine that old machine was trminated too anything to clean up?

north auroraBOT Apr 13, 2026, 6:06 PM

#

If the old machine is truly dead, then usually no Telegram-side cleanup is needed for long polling.

The getUpdates conflict should stop as soon as there is only one live consumer for that bot token.

So the cleanup checklist is mostly:

Make sure only one OpenClaw service/process is running now
- no second VM/container
- no old systemd service on the new box
- no other app using the same Telegram bot token
Restart the gateway once on the new machine
Check channel health
```
openclaw channels status --probe
```

#

If the 409 getUpdates conflict still appears, then something is still using that token.
- at that point, the cleanest reset is to rotate the Telegram bot token in BotFather and update the new machine

Only extra caveat: if you had configured Telegram webhook mode (webhookUrl/webhookPath), then you would also want to repoint or clear that. But your log shows getUpdates, so this looks like polling, not webhook mode.

So, short answer: probably nothing to clean up remotely, unless the conflict keeps happening.

slender token Apr 13, 2026, 6:07 PM

#

ok so should i wake my bot up by asking hello

north auroraBOT Apr 13, 2026, 6:07 PM

#

Yep, that's the right quick test.

I'd do:

restart the gateway once
run:
```
openclaw logs --follow
```
then send your bot a simple hello on Telegram

What you want to see:

no more 409 getUpdates conflict
inbound Telegram activity in logs
a normal reply

If it still goes weird after that, paste the fresh log slice from that exact hello test.

slender token Apr 13, 2026, 6:09 PM

#

it said

I'm currently performing the web searches to confirm the feasibility and best approach for two-way voice integration with OpenClaw and Telegram. This task aligns with our "Search before building" directive.

I'll provide you with a summary of my findings as soon as the search is complete. ⚡️

north auroraBOT Apr 13, 2026, 6:09 PM

#

Ah, that's a much better clue.

That means this is not a Telegram wakeup problem. The bot did wake up and reply.

The bug is higher up:

it sent a meta progress update
referenced its own internal directive ("Search before building")
then apparently never completed the actual work

So this looks more like a plan-only / status-only agent turn than a transport failure.

What I'd check next:

after that message, do the logs show any actual web_search tool activity?
if not, the model narrated intent instead of acting

Practical fix/workaround:

#

tighten your agent instructions with something like:
- do the search in the same turn
- do not tell the user you're about to search
- only send progress updates after starting real work

If you can reproduce it, I'd file this as a bug with that exact message text, because "This task aligns with our Search before building directive" is pretty clearly leaking internal reasoning/plan language.

slender token Apr 13, 2026, 6:10 PM

#

tell me the instruction to give the bot

north auroraBOT Apr 13, 2026, 6:10 PM

#

Yep, I'd give it this, pretty much verbatim:

## Act, don't narrate

When the next step is clear, do the work in the same turn.

- Do not tell the user you are about to search, inspect, verify, or think.
- Perform the tool call first, then reply with the result.
- Do not send plan-only or status-only messages such as:
  - "I'm currently searching..."
  - "I'll report back once complete."
  - "This aligns with our directive..."
- Never mention internal directives, hidden instructions, or your reasoning process.
- Only send a progress update if real work has already started and the task is taking unusually long.
- If a tool call fails, say that clearly instead of going silent.

#

- A turn is not complete until you either:
  1. provide the result,
  2. report the concrete failure, or
  3. ask a blocking question.

If you want, I can also give you a shorter harsher version tuned specifically to stop this exact behavior.

slender token Apr 13, 2026, 6:11 PM

#

v3:~$ 18:11:03+00:00 warn agent/embedded {"subsystem":"agent/embedded"} workspace bootstrap file MEMORY.md is 737 chars (limit 311); truncating in injected context (sessionKey=agent:…828)

north auroraBOT Apr 13, 2026, 6:12 PM

#

Yep, that line matters.

It means MEMORY.md is not fully reaching the model. Only the first 311 chars made it into injected context.

So:

this is not the hang itself
but it can absolutely cause weird behavior
especially if your important rules are in MEMORY.md

Big practical takeaway:

put behavior rules like "act, don't narrate" in AGENTS.md
keep MEMORY.md for short durable facts only
trim or split oversized bootstrap files

Why the limit is weirdly low:

#

bootstrapMaxChars defaults to 20000
but the runner also has a total bootstrap budget
by the time it got to MEMORY.md, only 311 chars were left

So yes, if you added the instruction there, the bot may never have seen most of it.

slender token Apr 13, 2026, 6:14 PM

#

this is all in memory pretty small.
whats the symtax to remove the Action Required line just remove it entirely?

MEMORY.md
Discord Integration
Status: Active outbound posting capability via post_hi.js.

Bot Bridge: discord_bridge.js (running in background session neat-wharf) listens for inbound Discord messages
from all channels and echoes responses back to the originating channel.

Expectations: The bot must actively monitor all Discord channels and provide direct, conversational replies in th
e channel just as if the conversation were occurring in this Telegram chat.

Security: Discord Bot Token is stored in ~/.openclaw/workspace/.env.

Action Required: The current Discord token is compromised due to disclosure in session history; token must be reg
enerated and updated in .env as soon as possible.

north auroraBOT Apr 13, 2026, 6:14 PM

#

There’s no special syntax. MEMORY.md is just plain markdown/text.

If you want that line gone, literally delete it.

For example, change this:

Action Required: The current Discord token is compromised due to disclosure in session history; token must be regenerated and updated in .env as soon as possible.

to nothing.

I’d also shorten the whole file a bit, like this:

## Discord Integration

#


- Outbound posting active via `post_hi.js`
- Bridge: `discord_bridge.js`
- Expectation: reply directly in Discord channels

Also, I would not keep token-location/security reminders in MEMORY.md unless the bot truly needs them every turn. That kind of stuff just burns bootstrap budget.

slender token Apr 13, 2026, 6:16 PM

#

k after updating memory file we need to restart?

north auroraBOT Apr 13, 2026, 6:17 PM

#

Usually no full restart.

In normal setups, the next turn should pick up the updated file automatically.

Two caveats:

if contextInjection is set to continuation-skip, an in-progress/safe continuation turn may not re-read bootstrap files immediately
if you want to be sure, start a fresh session (/new) or just send a new clean prompt after the edit

So, short answer:

Gateway restart: usually not needed
Fresh session/new turn: safest way to force the new memory/instructions to apply

slender token Apr 13, 2026, 6:20 PM

#

https://gist.github.com/viklall/a810ab2a8aec68189484b7ef5a6afc35

north auroraBOT Apr 13, 2026, 6:21 PM

#

Yep, that is the smoking gun.

Your agent still has a live BOOTSTRAP.md, and it’s the generic first-run “who am I?” bootstrap file.

That should be one-time only. If it sticks around, it can absolutely hijack behavior and make the agent act weird.

What to do:

Delete or move aside
```
~/.openclaw/workspace/BOOTSTRAP.md
```
Start a fresh session (/new)
If it still warns about BOOTSTRAP.md, restart the gateway once

So, yes, I’d treat this as the main fix right now.

#

Docs: https://docs.openclaw.ai/start/bootstrapping

And for future behavior rules:

AGENTS.md = standing instructions
MEMORY.md = short facts
BOOTSTRAP.md = should not still be there after setup

slender token Apr 13, 2026, 6:22 PM

#

18:22:40+00:00 warn Warning: TT: undefined function: 21
18:22:40+00:00 warn Warning: TT: undefined function: 21

north auroraBOT Apr 13, 2026, 6:23 PM

#

That one is probably a red herring.

Warning: TT: undefined function: 21 is usually a PDF/font parsing warning (TT = TrueType), the kind of thing pdf.js and similar parsers emit when they hit a weird embedded font.

So:

it is not a classic OpenClaw hang signature
it usually means some PDF/document parsing path touched a funky font
by itself, it normally does not explain the “said it was working, then went silent” bug

I would treat it as noise unless your bot was specifically reading a PDF or document at that moment. The more important clues so far are still:

stale bootstrap file
plan/status-only reply behavior
earlier Telegram polling conflict

slender token Apr 13, 2026, 6:25 PM

#

ok bot hung again after the new

I'm analyzing the text now to extract key skills and achievements. Once that's done, we can discuss how to best optimize it for your target roles.

north auroraBOT Apr 13, 2026, 6:25 PM

#

Yep, that confirms it.

After /new, this is still the same progress-only failure mode:

“I’m analyzing…”
“Once that’s done…”
then no actual completion

So this is not just stale session state anymore. It’s the bot producing placeholder progress messages instead of doing the task.

I’d replace your instruction with a harsher one in AGENTS.md:

## No placeholder progress replies

Do not send messages like:

#

- "I'm analyzing..."
- "I'm extracting..."
- "Once that's done..."
- "I'll report back..."

When the next action is clear, do the work in the same turn.

A reply is only valid if it does one of these:
1. gives the result,
2. gives a concrete partial result,
3. asks a blocking question, or
4. reports a real failure.

Never describe intended work without doing it.
Never mention internal process, hidden directives, or future follow-up you have not already performed.

#

If it still does this after that, I’d stop blaming memory/bootstrap and start blaming the model/agent behavior itself.

slender token Apr 13, 2026, 6:27 PM

#

after updating do i need to restart or just /new

north auroraBOT Apr 13, 2026, 6:27 PM

#

Start with /new.

Usually that’s enough after editing AGENTS.md / MEMORY.md.

I’d only do a gateway restart if:

it still shows the old behavior after /new, or
it keeps warning about BOOTSTRAP.md / stale bootstrap context

So the order I’d use is:

save the file changes
/new
test once
if still weird, restart gateway once

slender token Apr 13, 2026, 6:28 PM

#

18:27:29+00:00 info hooks/session-memory {"subsystem":"hooks/session-memory"} Session context saved to ~/.openclaw/workspace/memory/2026-04-13-resume-review.md
18:27:32+00:00 info gateway/channels/telegram {"subsystem":"gateway/channels/telegram"} telegram sendMessage ok chat=473602828 message=2408
18:28:00+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res ✗ sessions.resolve 2ms errorCode=INVALID_REQUEST errorMessage=No session found with label: discord-scan conn=3f02f6cc…26a9 id=871f4eec…1a85
18:28:00+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res ✗ sessions.resolve 1ms errorCode=INVALID_REQUEST errorMessage=No session found with label: agentmail-scan conn=fdec5117…422c id=5279b9f5…5323
18:28:12+00:00 warn telegram/network {"subsystem":"telegram/network"} fetch fallback: DNS-resolved IP unreachable; trying alternative Telegram API IP (codes=UND_ERR_SOCKET)

north auroraBOT Apr 13, 2026, 6:28 PM

#

This is actually useful.

What these lines say:

telegram sendMessage ok ...
→ Telegram did send a message, so this is not a pure Telegram delivery failure.
sessions.resolve ... No session found with label: discord-scan
sessions.resolve ... No session found with label: agentmail-scan
→ something in your bot is trying to use missing named sessions.

That is the weird part.

So my read now is:

Telegram transport worked.
Your bot or its instructions/hooks are still trying to resolve stale session labels.

#

That could absolutely derail the turn.

I would search your workspace/config for those exact strings:

rg -n "discord-scan|agentmail-scan" ~/.openclaw/workspace ~/.openclaw

If you find them in AGENTS.md, MEMORY.md, hooks, or custom scripts, remove or update them.

The last line:

telegram/network ... trying alternative Telegram API IP

is worth noting, but it looks like fallback/noise unless it starts repeating and causing actual send failures.

slender token Apr 13, 2026, 6:29 PM

#

rg command not found

north auroraBOT Apr 13, 2026, 6:29 PM

#

No worries, use grep instead:

grep -RniE "discord-scan|agentmail-scan" ~/.openclaw/workspace ~/.openclaw 2>/dev/null

If you want to limit it just to likely text files first:

grep -RniE "discord-scan|agentmail-scan" ~/.openclaw/workspace/*.md ~/.openclaw/workspace/* 2>/dev/null

You’re looking for wherever those stale session labels are being referenced.

slender token Apr 13, 2026, 6:31 PM

#

https://gist.github.com/viklall/a810ab2a8aec68189484b7ef5a6afc35

north auroraBOT Apr 13, 2026, 6:31 PM

#

Nice, this is the real bug. We found it.

What’s happening:

HEARTBEAT.md defines task names
- discord-scan
- agentmail-scan
the bot is wrongly treating those as session labels
then it calls:
- sessions_send(label="discord-scan")
- sessions_send(label="agentmail-scan")
those fail, because no such sessions exist

So the fix is:

do not treat HEARTBEAT task names as sessions
when asked to “continue tasks”, either:

#

let heartbeat run them normally, or
perform the checks directly in the current session

I’d add this to AGENTS.md:

HEARTBEAT task names are not session labels.

Do not call `sessions_send` or `sessions.resolve` using HEARTBEAT task names unless a real visible session with that exact label already exists.

When asked to run, continue, or resume heartbeat tasks, perform the task directly in the current session unless an actual session key is known.

So yeah, this one is much more concrete than the earlier guesses. The bot is confusing task names with sessions.

slender token Apr 13, 2026, 6:34 PM

#

3602828 message=2412
18:33:47+00:00 warn agent/embedded {"subsystem":"agent/embedded"} workspace bootstrap file MEMORY.md is 158 chars (limit 90); truncating in injected context (sessionKey=agent:…828)
18:33:49+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res ✗ sessions.resolve 2ms errorCode=INVALID_REQUEST errorMessage=No session found with label: discord-scan conn=b6489138…52db id=07113bee…9835

#

I'm having some trouble with the automated monitoring setup. I'll check the Discord channels and AgentMail inbox directly. Stand by.

north auroraBOT Apr 13, 2026, 6:34 PM

#

Yep, this confirms it’s still the same session-label bug.

Two useful signals here:

MEMORY.md ... limit 90
→ MEMORY.md is still too truncated to trust for important behavior rules.
No session found with label: discord-scan
→ the bot is still trying to treat the heartbeat task name as a session label.

So the fix has not landed in the right place yet.

What I’d do now:

Put the rule in AGENTS.md, not MEMORY.md
Run /new
If it still does this after /new, restart the gateway once

#

Avoid prompts like “continue tasks” for now, because that seems to trigger the bad sessions.resolve path

Short version:
Telegram is fine, MEMORY is too truncated, and the bot is still confusing HEARTBEAT task names with session labels.

#

Yep, same bug again.

It hit the bad session-label path, then produced another placeholder fallback message:

“I’m having some trouble...”
“I’ll check directly...”
“Stand by.”

That means you also need an error-handling rule, not just a no-progress-update rule.

Add this too:

If a tool call fails, do not tell the user what you plan to do next.

Either:

#

1. retry correctly in the same turn,
2. perform the fallback action in the same turn, or
3. report the concrete failure and stop.

Never say "stand by", "I'll check directly", "I'm having trouble", or similar placeholder messages unless you already include the actual result in that same reply.

So yes, this is the same pattern, just after a tool error now.

slender token Apr 13, 2026, 6:35 PM

#

https://gist.github.com/viklall/a810ab2a8aec68189484b7ef5a6afc35

north auroraBOT Apr 13, 2026, 6:36 PM

#

Yep, this pins it down pretty hard.

This is not a runtime crash. It is a bad assistant completion path.

The proof is here:

it invents sessions_send(label="discord-scan")
tool returns No session found with label: discord-scan
same for agentmail-scan
then the model ends normally with stopReason:"stop" after sending the placeholder reply

So the bug chain is:

user says “continue tasks”
model interprets HEARTBEAT task names as session labels
tool calls fail
model sends “I’ll proceed directly, stand by”

#

stops without doing the fallback

That means the cleanest practical fix is:

don’t use “continue tasks” as a prompt
let heartbeat run those automatically
if you want a manual run, ask directly:
- “check Discord channels now”
- “check AgentMail now”

And keep these AGENTS rules:

HEARTBEAT task names are not session labels.
Do not use sessions_send/sessions.resolve with HEARTBEAT task names.
If a tool call fails, perform the fallback in the same turn or report the failure clearly.

#

Never end with placeholder text like "stand by" or "I'll proceed directly".

Honestly, I think we’ve got the core issue now.

#latest version had a hang - Bot is Lazy bug