#latest version had a hang - Bot is Lazy bug

75 messages · Page 1 of 1 (latest)

slender token
old socketBOT
sage steepleBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

north auroraBOT
#

Yep, I think this is the same failure mode.

From your gist, the important part is this, not the Discord reconnect noise:

  • 503 ... model is currently experiencing high demand
  • then No available auth profile for google (all in cooldown or unavailable)
  • then Embedded agent failed before reply

So the bot is not really "thinking forever". Google is returning overloads, OpenClaw cools that profile down, and because your fallback chain is also Google-only, it runs out of places to fail over to.

Why it feels silent:

  • typing/working starts
  • provider fails mid-turn
  • on your setup it is not surfacing the final error cleanly, which does look like a bug

Workaround for now:

  1. Add at least one fallback on a different provider that you already have auth for.
#
  1. Run openclaw models status --probe to confirm which providers are actually usable.
  2. Keep openclaw logs --follow open while testing.

Docs:

Since this is latest stable and it still goes silent instead of replying with the overload error, please file this on GitHub with that gist attached: https://github.com/openclaw/openclaw/issues

If you want, paste your agents.defaults.model block and I can sanity check the fallback setup.

slender token
#

google recommended:

"agents": {
"defaults": {
"model": {
"primary": "google/gemini-3.1-flash",
"fallbacks": [
"google/gemini-3.1-pro-preview",
"google/gemini-3.1-flash-lite-preview"
]
},
"models": {
"google/gemini-3.1-flash": {},
"google/gemini-3.1-pro-preview": {},
"google/gemini-3.1-flash-lite-preview": {},
"google/gemini-2.5-flash": {}
},

north auroraBOT
#

Yep, that's the catch.

That config is Google-only failover, not provider failover.

So if Google itself is overloaded, this happens:

  • primary Google model fails
  • fallback Google model fails
  • next Google model fails
  • bot has nowhere else to go

Also, in the snippet you pasted:

"fallbacks": [
  "google/gemini-3.1-pro-preview",
#
  "google/gemini-3.1-flash-lite-preview"
],
"models": {
  "google/gemini-3.1-flash": {},
  "google/gemini-3.1-pro-preview": {},
  "google/gemini-3.1-flash-lite-preview": {},
  "google/gemini-2.5-flash": {}
}

google/gemini-2.5-flash is only in models, not in fallbacks, so it is allowed, but not automatically used as a fallback.

So, short version:

  • the recommendation is fine for "use Google"
  • it is not resilient to Google-wide 503/high-demand events
  • the silent part still smells like a bug
slender token
#

thought those were earlier

north auroraBOT
#

Yeah, fair catch, I jumped too fast there.

If those overload lines are from earlier, then I would not pin this exact hang on Google yet.

What the gist proves so far is:

  • there is at least one real provider-failure path in the log
  • but it does not prove that the latest silent turn was caused by that same thing

To nail the current one down, I need the log slice for the actual minute of the silent reply, ideally:

  • ~30 to 60 lines before the user message
  • the run itself
  • ~30 to 60 lines after

Best next step:

#
openclaw logs --follow

Then reproduce once and paste the section around that timestamp.

If you want, also paste:

openclaw status --all

Right now my best guess is still "reply-path bug", but I don't want to bullshit you on the exact trigger.

slender token
#

hang it was 10.35 AM PST

north auroraBOT
#

Got it.

10:35 Pacific today is 17:35 UTC in the log.

I checked the gist again, and I don't actually see the 17:35 UTC window in there. The closest Discord transport events I can see are:

  • 15:37:17 stale-socket restart
  • 16:12:17 stale-socket restart
  • 17:17:17 stale-socket restart

So I think the exact hang window is missing from that paste.

Can you grab just the slice around 17:34 to 17:36 UTC from the raw log file?

grep -n '17:3[4-6]:' /tmp/openclaw/openclaw-2026-04-13.log
#

or just open the file and paste ~100 lines around that time.

That should tell us whether this was:

  1. another model failure,
  2. a Discord reconnect/drop,
  3. or the actual "went silent after saying it's working" bug.
slender token
#

grep didnt return anything what is the log filename so i can get it

north auroraBOT
#

From your gist, it was:

/tmp/openclaw/openclaw-2026-04-13.log

If you want to confirm the latest one on disk:

ls -t /tmp/openclaw/openclaw-*.log | head -1

Then grab the time slice from that file, for example:

grep -n '17:3[4-6]:' /tmp/openclaw/openclaw-2026-04-13.log
#

or just:

sed -n '1,260p' /tmp/openclaw/openclaw-2026-04-13.log

and paste the section around the hang time.

north auroraBOT
#

I checked the raw gist, and it still ends at 17:22:01 UTC.

So the hang window is not in that gist yet.

If you meant:

  • 10:35 PDT → look around 17:35 UTC
  • 10:35 PST literally → look around 18:35 UTC

So please grab a later slice, for example:

tail -n 300 /tmp/openclaw/openclaw-2026-04-13.log

or more targeted:

#
grep -n '17:3[0-9]:\|18:3[0-9]:' /tmp/openclaw/openclaw-2026-04-13.log

Right now the latest thing in the gist is just:

  • Discord stale-socket restart at 17:17:17
  • Telegram sends through 17:22:01

That is still too early to explain a 10:35 Pacific hang.

slender token
#

thee was nothing to share, it was quiet

north auroraBOT
#

That actually helps.

If the hang window was completely quiet in the log, then this looks less like a model failure and more like a Discord transport/liveness bug.

Why I say that:

  • the Google overloads we saw earlier are noisy in logs
  • this hang had no corresponding run/error lines
  • your gist does show repeated Discord health restarts with reason: stale-socket

So my current best guess is:

  • Discord side got wedged
  • the bot looked "busy"/alive
  • but the turn never fully executed or never made it onto the normal reply path

That is definitely bug-report material. I'd file it with these notes:

#
  • latest stable
  • bot showed it was working, then went silent
  • no logs at all for the hang window
  • recurring health-monitor: restarting (reason: stale-socket) on Discord

GitHub issues: https://github.com/openclaw/openclaw/issues

If you want one extra datapoint before filing, run this right after it happens:

openclaw channels status --probe

That should help confirm whether Discord was connected but unhealthy.

slender token
#

this was in telegram stop assuming discord i never mentioned it

north auroraBOT
#

You're right, my bad. I assumed the current thread surface instead of the surface that actually hung.

If this was Telegram, then the Discord stale-socket stuff is irrelevant.

The most suspicious Telegram-specific line in your gist is actually this:

  • 409: Conflict: terminated by other getUpdates request; make sure that only one bot instance is running

That means something else is polling the same Telegram bot token. If two consumers are hitting getUpdates, Telegram gets weird fast.

So for Telegram, my revised read is:

  1. First thing to fix/check: make sure only one OpenClaw instance, container, or other bot process is using that Telegram token.
  2. The later telegram sendMessage ok ... lines show outbound can work, so this is probably not a simple "can't send" issue.
  3. If the hang window was quiet in logs, that can fit a Telegram polling/consumer conflict better than the earlier Google/Discord theories.

So yes, ignore my Discord guess. For Telegram, I would focus on the getUpdates conflict first.

slender token
#

i terminated a openclaw instance and moved to new machine that old machine was trminated too anything to clean up?

north auroraBOT
#

If the old machine is truly dead, then usually no Telegram-side cleanup is needed for long polling.

The getUpdates conflict should stop as soon as there is only one live consumer for that bot token.

So the cleanup checklist is mostly:

  1. Make sure only one OpenClaw service/process is running now

    • no second VM/container
    • no old systemd service on the new box
    • no other app using the same Telegram bot token
  2. Restart the gateway once on the new machine

  3. Check channel health

    openclaw channels status --probe
    
#
  1. If the 409 getUpdates conflict still appears, then something is still using that token.
    • at that point, the cleanest reset is to rotate the Telegram bot token in BotFather and update the new machine

Only extra caveat: if you had configured Telegram webhook mode (webhookUrl/webhookPath), then you would also want to repoint or clear that. But your log shows getUpdates, so this looks like polling, not webhook mode.

So, short answer: probably nothing to clean up remotely, unless the conflict keeps happening.

slender token
#

ok so should i wake my bot up by asking hello

north auroraBOT
#

Yep, that's the right quick test.

I'd do:

  1. restart the gateway once
  2. run:
    openclaw logs --follow
    
  3. then send your bot a simple hello on Telegram

What you want to see:

  • no more 409 getUpdates conflict
  • inbound Telegram activity in logs
  • a normal reply

If it still goes weird after that, paste the fresh log slice from that exact hello test.

slender token
#

it said

I'm currently performing the web searches to confirm the feasibility and best approach for two-way voice integration with OpenClaw and Telegram. This task aligns with our "Search before building" directive.

I'll provide you with a summary of my findings as soon as the search is complete. ⚡️

north auroraBOT
#

Ah, that's a much better clue.

That means this is not a Telegram wakeup problem. The bot did wake up and reply.

The bug is higher up:

  • it sent a meta progress update
  • referenced its own internal directive ("Search before building")
  • then apparently never completed the actual work

So this looks more like a plan-only / status-only agent turn than a transport failure.

What I'd check next:

  1. after that message, do the logs show any actual web_search tool activity?
  2. if not, the model narrated intent instead of acting

Practical fix/workaround:

#
  • tighten your agent instructions with something like:
    • do the search in the same turn
    • do not tell the user you're about to search
    • only send progress updates after starting real work

If you can reproduce it, I'd file this as a bug with that exact message text, because "This task aligns with our Search before building directive" is pretty clearly leaking internal reasoning/plan language.

slender token
#

tell me the instruction to give the bot

north auroraBOT
#

Yep, I'd give it this, pretty much verbatim:

## Act, don't narrate

When the next step is clear, do the work in the same turn.

- Do not tell the user you are about to search, inspect, verify, or think.
- Perform the tool call first, then reply with the result.
- Do not send plan-only or status-only messages such as:
  - "I'm currently searching..."
  - "I'll report back once complete."
  - "This aligns with our directive..."
- Never mention internal directives, hidden instructions, or your reasoning process.
- Only send a progress update if real work has already started and the task is taking unusually long.
- If a tool call fails, say that clearly instead of going silent.
#
- A turn is not complete until you either:
  1. provide the result,
  2. report the concrete failure, or
  3. ask a blocking question.

If you want, I can also give you a shorter harsher version tuned specifically to stop this exact behavior.

slender token
#

v3:~$ 18:11:03+00:00 warn agent/embedded {"subsystem":"agent/embedded"} workspace bootstrap file MEMORY.md is 737 chars (limit 311); truncating in injected context (sessionKey=agent:…828)

north auroraBOT
#

Yep, that line matters.

It means MEMORY.md is not fully reaching the model. Only the first 311 chars made it into injected context.

So:

  • this is not the hang itself
  • but it can absolutely cause weird behavior
  • especially if your important rules are in MEMORY.md

Big practical takeaway:

  • put behavior rules like "act, don't narrate" in AGENTS.md
  • keep MEMORY.md for short durable facts only
  • trim or split oversized bootstrap files

Why the limit is weirdly low:

#
  • bootstrapMaxChars defaults to 20000
  • but the runner also has a total bootstrap budget
  • by the time it got to MEMORY.md, only 311 chars were left

So yes, if you added the instruction there, the bot may never have seen most of it.

slender token
#

this is all in memory pretty small.
whats the symtax to remove the Action Required line just remove it entirely?

MEMORY.md
Discord Integration
Status: Active outbound posting capability via post_hi.js.

Bot Bridge: discord_bridge.js (running in background session neat-wharf) listens for inbound Discord messages
from all channels and echoes responses back to the originating channel.

Expectations: The bot must actively monitor all Discord channels and provide direct, conversational replies in th
e channel just as if the conversation were occurring in this Telegram chat.

Security: Discord Bot Token is stored in ~/.openclaw/workspace/.env.

Action Required: The current Discord token is compromised due to disclosure in session history; token must be reg
enerated and updated in .env as soon as possible.

north auroraBOT
#

There’s no special syntax. MEMORY.md is just plain markdown/text.

If you want that line gone, literally delete it.

For example, change this:

Action Required: The current Discord token is compromised due to disclosure in session history; token must be regenerated and updated in .env as soon as possible.

to nothing.

I’d also shorten the whole file a bit, like this:

## Discord Integration
#

- Outbound posting active via `post_hi.js`
- Bridge: `discord_bridge.js`
- Expectation: reply directly in Discord channels

Also, I would not keep token-location/security reminders in MEMORY.md unless the bot truly needs them every turn. That kind of stuff just burns bootstrap budget.

slender token
#

k after updating memory file we need to restart?

north auroraBOT
#

Usually no full restart.

In normal setups, the next turn should pick up the updated file automatically.

Two caveats:

  • if contextInjection is set to continuation-skip, an in-progress/safe continuation turn may not re-read bootstrap files immediately
  • if you want to be sure, start a fresh session (/new) or just send a new clean prompt after the edit

So, short answer:

  • Gateway restart: usually not needed
  • Fresh session/new turn: safest way to force the new memory/instructions to apply
north auroraBOT
#

Yep, that is the smoking gun.

Your agent still has a live BOOTSTRAP.md, and it’s the generic first-run “who am I?” bootstrap file.

That should be one-time only. If it sticks around, it can absolutely hijack behavior and make the agent act weird.

What to do:

  1. Delete or move aside
    ~/.openclaw/workspace/BOOTSTRAP.md
    
  2. Start a fresh session (/new)
  3. If it still warns about BOOTSTRAP.md, restart the gateway once

So, yes, I’d treat this as the main fix right now.

slender token
#

18:22:40+00:00 warn Warning: TT: undefined function: 21
18:22:40+00:00 warn Warning: TT: undefined function: 21

north auroraBOT
#

That one is probably a red herring.

Warning: TT: undefined function: 21 is usually a PDF/font parsing warning (TT = TrueType), the kind of thing pdf.js and similar parsers emit when they hit a weird embedded font.

So:

  • it is not a classic OpenClaw hang signature
  • it usually means some PDF/document parsing path touched a funky font
  • by itself, it normally does not explain the “said it was working, then went silent” bug

I would treat it as noise unless your bot was specifically reading a PDF or document at that moment. The more important clues so far are still:

  • stale bootstrap file
  • plan/status-only reply behavior
  • earlier Telegram polling conflict
slender token
#

ok bot hung again after the new

I'm analyzing the text now to extract key skills and achievements. Once that's done, we can discuss how to best optimize it for your target roles.

north auroraBOT
#

Yep, that confirms it.

After /new, this is still the same progress-only failure mode:

  • “I’m analyzing…”
  • “Once that’s done…”
  • then no actual completion

So this is not just stale session state anymore. It’s the bot producing placeholder progress messages instead of doing the task.

I’d replace your instruction with a harsher one in AGENTS.md:

## No placeholder progress replies

Do not send messages like:
#
- "I'm analyzing..."
- "I'm extracting..."
- "Once that's done..."
- "I'll report back..."

When the next action is clear, do the work in the same turn.

A reply is only valid if it does one of these:
1. gives the result,
2. gives a concrete partial result,
3. asks a blocking question, or
4. reports a real failure.

Never describe intended work without doing it.
Never mention internal process, hidden directives, or future follow-up you have not already performed.
#

If it still does this after that, I’d stop blaming memory/bootstrap and start blaming the model/agent behavior itself.

slender token
#

after updating do i need to restart or just /new

north auroraBOT
#

Start with /new.

Usually that’s enough after editing AGENTS.md / MEMORY.md.

I’d only do a gateway restart if:

  • it still shows the old behavior after /new, or
  • it keeps warning about BOOTSTRAP.md / stale bootstrap context

So the order I’d use is:

  1. save the file changes
  2. /new
  3. test once
  4. if still weird, restart gateway once
slender token
#

18:27:29+00:00 info hooks/session-memory {"subsystem":"hooks/session-memory"} Session context saved to ~/.openclaw/workspace/memory/2026-04-13-resume-review.md
18:27:32+00:00 info gateway/channels/telegram {"subsystem":"gateway/channels/telegram"} telegram sendMessage ok chat=473602828 message=2408
18:28:00+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res ✗ sessions.resolve 2ms errorCode=INVALID_REQUEST errorMessage=No session found with label: discord-scan conn=3f02f6cc…26a9 id=871f4eec…1a85
18:28:00+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res ✗ sessions.resolve 1ms errorCode=INVALID_REQUEST errorMessage=No session found with label: agentmail-scan conn=fdec5117…422c id=5279b9f5…5323
18:28:12+00:00 warn telegram/network {"subsystem":"telegram/network"} fetch fallback: DNS-resolved IP unreachable; trying alternative Telegram API IP (codes=UND_ERR_SOCKET)

north auroraBOT
#

This is actually useful.

What these lines say:

  • telegram sendMessage ok ...
    → Telegram did send a message, so this is not a pure Telegram delivery failure.

  • sessions.resolve ... No session found with label: discord-scan

  • sessions.resolve ... No session found with label: agentmail-scan
    → something in your bot is trying to use missing named sessions.

That is the weird part.

So my read now is:

  1. Telegram transport worked.
  2. Your bot or its instructions/hooks are still trying to resolve stale session labels.
#
  1. That could absolutely derail the turn.

I would search your workspace/config for those exact strings:

rg -n "discord-scan|agentmail-scan" ~/.openclaw/workspace ~/.openclaw

If you find them in AGENTS.md, MEMORY.md, hooks, or custom scripts, remove or update them.

The last line:

  • telegram/network ... trying alternative Telegram API IP

is worth noting, but it looks like fallback/noise unless it starts repeating and causing actual send failures.

slender token
#

rg command not found

north auroraBOT
#

No worries, use grep instead:

grep -RniE "discord-scan|agentmail-scan" ~/.openclaw/workspace ~/.openclaw 2>/dev/null

If you want to limit it just to likely text files first:

grep -RniE "discord-scan|agentmail-scan" ~/.openclaw/workspace/*.md ~/.openclaw/workspace/* 2>/dev/null

You’re looking for wherever those stale session labels are being referenced.

north auroraBOT
#

Nice, this is the real bug. We found it.

What’s happening:

  • HEARTBEAT.md defines task names
    • discord-scan
    • agentmail-scan
  • the bot is wrongly treating those as session labels
  • then it calls:
    • sessions_send(label="discord-scan")
    • sessions_send(label="agentmail-scan")
  • those fail, because no such sessions exist

So the fix is:

  • do not treat HEARTBEAT task names as sessions
  • when asked to “continue tasks”, either:
#
  1. let heartbeat run them normally, or
  2. perform the checks directly in the current session

I’d add this to AGENTS.md:

HEARTBEAT task names are not session labels.

Do not call `sessions_send` or `sessions.resolve` using HEARTBEAT task names unless a real visible session with that exact label already exists.

When asked to run, continue, or resume heartbeat tasks, perform the task directly in the current session unless an actual session key is known.

So yeah, this one is much more concrete than the earlier guesses. The bot is confusing task names with sessions.

slender token
#

3602828 message=2412
18:33:47+00:00 warn agent/embedded {"subsystem":"agent/embedded"} workspace bootstrap file MEMORY.md is 158 chars (limit 90); truncating in injected context (sessionKey=agent:…828)
18:33:49+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res ✗ sessions.resolve 2ms errorCode=INVALID_REQUEST errorMessage=No session found with label: discord-scan conn=b6489138…52db id=07113bee…9835

#

I'm having some trouble with the automated monitoring setup. I'll check the Discord channels and AgentMail inbox directly. Stand by.

north auroraBOT
#

Yep, this confirms it’s still the same session-label bug.

Two useful signals here:

  • MEMORY.md ... limit 90
    → MEMORY.md is still too truncated to trust for important behavior rules.

  • No session found with label: discord-scan
    → the bot is still trying to treat the heartbeat task name as a session label.

So the fix has not landed in the right place yet.

What I’d do now:

  1. Put the rule in AGENTS.md, not MEMORY.md
  2. Run /new
  3. If it still does this after /new, restart the gateway once
#
  1. Avoid prompts like “continue tasks” for now, because that seems to trigger the bad sessions.resolve path

Short version:
Telegram is fine, MEMORY is too truncated, and the bot is still confusing HEARTBEAT task names with session labels.

#

Yep, same bug again.

It hit the bad session-label path, then produced another placeholder fallback message:

  • “I’m having some trouble...”
  • “I’ll check directly...”
  • “Stand by.”

That means you also need an error-handling rule, not just a no-progress-update rule.

Add this too:

If a tool call fails, do not tell the user what you plan to do next.

Either:
#
1. retry correctly in the same turn,
2. perform the fallback action in the same turn, or
3. report the concrete failure and stop.

Never say "stand by", "I'll check directly", "I'm having trouble", or similar placeholder messages unless you already include the actual result in that same reply.

So yes, this is the same pattern, just after a tool error now.

north auroraBOT
#

Yep, this pins it down pretty hard.

This is not a runtime crash. It is a bad assistant completion path.

The proof is here:

  • it invents sessions_send(label="discord-scan")
  • tool returns No session found with label: discord-scan
  • same for agentmail-scan
  • then the model ends normally with stopReason:"stop" after sending the placeholder reply

So the bug chain is:

  1. user says “continue tasks”
  2. model interprets HEARTBEAT task names as session labels
  3. tool calls fail
  4. model sends “I’ll proceed directly, stand by”
#
  1. stops without doing the fallback

That means the cleanest practical fix is:

  • don’t use “continue tasks” as a prompt
  • let heartbeat run those automatically
  • if you want a manual run, ask directly:
    • “check Discord channels now”
    • “check AgentMail now”

And keep these AGENTS rules:

HEARTBEAT task names are not session labels.
Do not use sessions_send/sessions.resolve with HEARTBEAT task names.
If a tool call fails, perform the fallback in the same turn or report the failure clearly.
#
Never end with placeholder text like "stand by" or "I'll proceed directly".

Honestly, I think we’ve got the core issue now.