#Getting ? for context amount back from models. Seems to lead directly to context overflow

1 messages · Page 1 of 1 (latest)

cold aurora
#

Getting a ? for /status

🦞 OpenClaw 2026.5.6 (c97b9f7)
⏱️ Uptime: gateway 7m 42s · system 2d 22h
🧠 Model: studio-llamacpp/unsloth/MiniMax-M2.7-GGUF:Q4_K_M · 🔑 api-key (models.json)
📚 Context: ?/75k · 🧹 Compactions: 0

There seems to be somehting poisonous about this. One agent had it then can't get rid of it (something in hisotry?). Then while debugging it another agent has it. The original model was gemma4 that started this. I am self hosting via llama.cpp with

And it just migrated to another new model

OpenClaw 2026.5.6 (c97b9f7)
⏱️ Uptime: gateway 7m 49s · system 2d 22h
🧠 Model: lmstudio/lmstudio-community/Qwen3-8B-MLX-4bit · 🔑 api-key (models.json)
📚 Context: ?/75k · 🧹 Compactions: 2

I am runnign latest llama.cpp. Claude wants ot blame reasoning flags, but that doesnt seem to be it

undone spearBOT
#

That ? means OpenClaw does not currently have a fresh prompt-token usage number for that session. It still knows the configured cap (75k), but the left side is unknown, so /status prints ?/75k.

I don’t think Claude’s “reasoning flags” theory fits this. From the current code path, this is usually one of:

  • the local OpenAI-compatible server didn’t return usable usage.prompt_tokens / final stream usage
  • the previous run had no usable usage, so the session store marks totalTokensFresh: false
  • after switching models in the same session/agent, that unknown state can remain until a run returns fresh usage again

OpenClaw does send stream_options: { include_usage: true } for OpenAI-compatible streaming, including local backends, so I’d look at what llama.cpp / LM Studio is actually returning in the final stream chunk. If there’s no usage block, /status can’t compute the left side.

The “poisonous history” part is probably not text content poisoning the model; it’s more likely stale/unknown token accounting plus compaction trying to operate without a reliable prompt-size snapshot. Starting a genuinely new session should clear the session-state side, but if the backend still doesn’t return usage, the ? will come back on the next run.

Two useful checks:

openclaw status --all
#
openclaw logs --tail 200

And if you can paste the llama.cpp launch command you meant after “self hosting via llama.cpp with …”, that’s the missing piece. I’d especially want to know the actual --ctx-size and whether this is direct llama.cpp, LM Studio, or llama.cpp behind an OpenAI-compatible proxy.

Docs: https://docs.openclaw.ai/help/faq and compaction notes: https://docs.openclaw.ai/concepts/compaction

feral robinBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

cold aurora
#

Oh i am so ready for you

Models.ini

[default]
ngl = 99
flash-attn = auto
slot-save-path = /Volumes/External_KV/llama/cache

[unsloth/MiniMax-M2.7-GGUF:Q4_K_M]
ctx-size = 75000
parallel = 1
predict = -1
batch-size = 2048
ubatch-size = 512
flash-attn = on
jinja = true

[unsloth/Qwen3-VL-235B-A22B-Thinking-GGUF:Q4_K_XL]
ctx-size = 102400
parallel = 1
predict = -1
batch-size = 2048
ubatch-size = 512
flash-attn = on
jinja = true

[lmstudio-community/gemma-4-31B-it-GGUF:Q4_K_M]
hf-repo = lmstudio-community/gemma-4-31B-it-GGUF
hf-file = gemma-4-31B-it-Q4_K_M.gguf
ctx-size = 120000
parallel = 2
cache-type-k = q8_0
cache-type-v = q8_0
repeat-penalty = 1.3
repeat-last-n = -1
dry-multiplier = 2.0
predict = -1
batch-size = 1024
ubatch-size = 256
flash-attn = on
jinja = true

LLAMA_BIN=/opt/homebrew/bin/llama-server
MODELS_DIR=/Users/name/.cache/huggingface/hub
MODELS_PRESET=/Users/name/models.ini
HOST=0.0.0.0
PORT=8080
LOG=/tmp/llama.log
SCREEN_NAME=llama

echo "Starting llama-server on port $PORT with models-preset $MODELS_PRESET..."

if screen -list | grep -q "$SCREEN_NAME"; then
screen -S $SCREEN_NAME -X stuff "$LLAMA_BIN --models-dir $MODELS_DIR --models-preset $MODELS_PRESET --host $HOST --port $PORT --log-file $LOG\n"
else
screen -dmS $SCREEN_NAME $LLAMA_BIN --models-dir $MODELS_DIR --models-preset $MODELS_PRESET --host $HOST --port $PORT --log-file $LOG
fi

undone spearBOT
#

Oh this is a good config dump. Two separate things jump out.

First: the ? in /status is still almost certainly missing/freshness-failed usage accounting. OpenClaw’s OpenAI-compatible transport expects streaming usage from the backend. It sends stream_options: { include_usage: true } and then looks for usage.prompt_tokens / usage.completion_tokens on the stream chunk. If llama.cpp/LM Studio doesn’t return that final usage, OpenClaw can know the right-side window (75k) but not the current left-side prompt size, so ?/75k.

Quick proof:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"unsloth/MiniMax-M2.7-GGUF:Q4_K_M",
    "messages":[{"role":"user","content":"say hi"}],
    "stream":true,
    "stream_options":{"include_usage":true}
  }'
#

At the end you want to see a chunk with something like:

"usage":{"prompt_tokens":...,"completion_tokens":...,"total_tokens":...}

If not, that explains the ?.

Second, and more “poisonous”: this line for Gemma is dangerous:

ctx-size = 120000
parallel = 2

In llama-server, parallel slots divide the context budget. So that is likely ~60k effective context per slot, while OpenClaw/model config may believe it has 120k. That mismatch can directly produce context overflow, failed retries, compactions, and a session that feels cursed. For debugging, I’d make Gemma parallel = 1, or configure OpenClaw’s contextWindow/contextTokens for that model to the actual per-slot budget.

#

I’d also temporarily remove this while testing:

slot-save-path = /Volumes/External_KV/llama/cache

Not because it explains /status by itself, but because persistent KV/slot state is exactly the kind of “why is this following me between runs?” variable I’d eliminate until the usage/context numbers are sane.

My recommended clean test:

  1. stop llama-server fully, don’t just screen -X stuff a new command into an existing session
  2. start one model only, parallel=1, no slot-save-path
  3. run the curl above and confirm final usage
  4. start a brand-new OpenClaw session and check /status

If usage appears and /status becomes small-number/75k, we’ve got it. If usage appears in curl but OpenClaw still shows ?, then that’s worth a GitHub issue with the curl output + openclaw status --all: https://github.com/openclaw/openclaw/issues

cold aurora
#

curl response:

data: {"choices":[],"created":1778345467,"id":"chatcmpl-FwGXvBuXebLg9EOxYViox5L6eQB9TgOK","model":"unsloth/MiniMax-M2.7-GGUF:Q4_K_M","system_fingerprint":"b9080-9f5f0e689","object":"chat.completion.chunk","usage":{"completion_tokens":39,"prompt_tokens":40,"total_tokens":79,"prompt_tokens_details":{"cached_tokens":7}},"timings":{"cache_n":7,"prompt_n":33,"prompt_ms":1474.269,"prompt_per_token_ms":44.67481818181818,"prompt_per_second":22.38397470203877,"predicted_n":39,"predicted_ms":834.026,"predicted_per_token_ms":21.38528205128205,"predicted_per_second":46.76113214695945}}

undone spearBOT
#

Nice — that curl result is actually really useful. It means llama.cpp is returning the final usage chunk correctly:

"usage":{"completion_tokens":39,"prompt_tokens":40,"total_tokens":79,...}

So I’d retract the “backend doesn’t return usage” theory for this model. OpenClaw’s OpenAI-compatible stream parser should be able to consume a usage-only chunk with choices: [] — the code checks chunk.usage before it cares about choices.

That narrows it down to: usage is getting lost after transport, before /status writes/reads the session store.

The suspicious path I see in current source is this: OpenClaw intentionally does not use normal usage for context accounting when the provider is treated as a CLI provider. In that case it can mark totalTokensFresh: false, and /status prints ?/75k.

So I’d check whether your custom provider id accidentally matches a CLI backend/config entry:

grep -RIn 'cliBackends\|studio-llamacpp\|lmstudio' ~/.openclaw/openclaw.json ~/.openclaw/models.json ~/.openclaw/agents 2>/dev/null
#

If studio-llamacpp or lmstudio appears under agents.defaults.cliBackends, that would explain the poison perfectly: OpenClaw gets usage, but refuses to treat it as prompt context usage because it thinks this is a CLI harness provider.

Also: your screen restart wrapper has a footgun. If the screen session already exists, this:

screen -S $SCREEN_NAME -X stuff "$LLAMA_BIN ...
"

may just type the command into the existing llama-server process/session, not restart it cleanly. For debugging, fully kill the screen/session and start fresh so you know which config is active.

Next thing I’d want is either:

openclaw status --all
#

or just the relevant session store entry showing totalTokens, totalTokensFresh, model, and provider for the cursed session. If totalTokensFresh: false, we’ve found the exact reason /status is showing ?.

cold aurora
#

1 - That grep returns an a massive amount of entries from teh agents session files

2-

> openclaw config get agents.defaults.cliBackends
Config warnings:
- plugins: plugin: ignored plugins.load.paths entry that points at OpenClaw's current bundled plugin directory; remove this redundant path or run openclaw doctor --fix
- plugins.entries.qqbot: plugin not installed: qqbot — install the official external plugin with: openclaw plugins install @openclaw/qqbot

:lobster: OpenClaw 2026.5.6 (c97b9f7) — The lobster in your shell. :lobster:

Config path not found: agents.defaults.cliBackends

3 - status all isn't showing naythign about sessions or tokens

If i go pull a recent session from last night:

part of a big json:

model":"lmstudio-community/gemma-4-31B-it-GGUF:Q4_K_M","usage":{"input":0,"output":0,"cacheRead":0,"cacheWrite":0,"totalTokens":0,"cost":{"input":0,"output":0,"cacheRead":0,"cacheWrite":0,"total":0}},"stopReason":"stop","timestamp":1778289200715,

undone spearBOT
#

Aha, that session snippet is the smoking gun:

"usage":{"input":0,"output":0,"cacheRead":0,"cacheWrite":0,"totalTokens":0}

So llama.cpp is returning usage correctly in curl, but OpenClaw is recording the assistant turn with the default zero-usage object. That explains the whole chain:

  1. transcript has zero usage
  2. session store can’t derive fresh prompt tokens
  3. /status prints ?/75k
  4. compaction/context guard loses the reliable “how full am I?” signal
  5. overflow starts feeling contagious across model switches in the same session/history

Your cliBackends check being missing is good — that was a plausible path, but not this one.

Also yeah, my grep was too broad because it hit session files. Use config-only paths:

#
grep -RIn 'studio-llamacpp\|lmstudio\|cliBackends' \
  ~/.openclaw/openclaw.json ~/.openclaw/models.json ~/.openclaw/agents/*/AGENTS.md 2>/dev/null

But with your Config path not found, I’d stop chasing CLI backend unless that grep finds something weird.

The useful next diagnostic is to confirm whether OpenClaw’s request/response stream contains the final usage chunk when OpenClaw calls llama.cpp, not curl. If your /tmp/llama.log is verbose enough, grep around one bad request:

grep -n 'usage\|chatcmpl\|prompt_tokens\|completion_tokens' /tmp/llama.log | tail -80

If llama.log shows the usage chunk for OpenClaw requests too, this is very likely an OpenClaw stream-consumption bug/compat gap: the final llama.cpp usage-only chunk has choices: [], and OpenClaw is somehow ending the transport with the initial empty usage snapshot instead of the parsed final usage.

For a workaround while debugging:

#
  • start a fresh session after changing models; don’t reuse the cursed one
  • set realistic/lower contextWindow/contextTokens in OpenClaw for these local models so compaction triggers earlier
  • make Gemma parallel = 1 or cap OpenClaw to the effective per-slot context
  • file this as a GitHub issue with: curl response, the zero-usage transcript snippet, model provider config, and openclaw --version: https://github.com/openclaw/openclaw/issues

I’m pretty convinced this is not reasoning flags now. It’s usage accounting going to zero between llama.cpp and OpenClaw’s saved assistant turn.

cold aurora
#

That llama grep fails here is the latest example:

slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.998 (> 0.100 thold), f_keep = 0.999
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 112 | processing task, is_child = 0
slot update_slots: id 0 | task 112 | new prompt, n_ctx_slot = 120064, n_keep = 0, task.n_tokens = 17332
slot update_slots: id 0 | task 112 | n_tokens = 17290, memory_seq_rm [17290, end)
slot update_slots: id 0 | task 112 | prompt processing progress, n_tokens = 17328, batch.n_tokens = 38, progress = 0.999769
slot update_slots: id 0 | task 112 | n_tokens = 17328, memory_seq_rm [17328, end)
slot init_sampler: id 0 | task 112 | init sampler, took 1.45 ms, tokens: text = 17332, total = 17332
slot update_slots: id 0 | task 112 | prompt processing done, n_tokens = 17332, batch.n_tokens = 4
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
reasoning-budget: activated, budget=2147483647 tokens
srv stop: cancel task, id_task = 112
slot release: id 0 | task 112 | stop processing: n_tokens = 17553, truncated = 0
srv update_slots: all slots are idle
kens = 1024, progress = 0.357646
[50107] slot update_slots: id 0 | task 0 | n_tokens = 6144, memory_seq_rm [6144, end)
[50107] slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 7168, batch.n_tokens = 1024, progress = 0.417254
[50107] slot update_slots: id 0 | task 0 | n_tokens = 7168, memory_seq_rm [7168, end)
[50107] slot create_check: id 0 | task 0 | created context checkpoint 2 of 32 (pos_min = 15104, pos_max = 16383, n_tokens = 16384, size = 531.266 MiB)
[50107] slot update_slots: id 0 | task 0 | n_tokens = 16919, memory_seq_rm [16919, end)
[50107] slot update_slots: id 0 | task 0 | prompt processing

undone spearBOT
#

Yep, that grep “fails” because llama.cpp’s normal log is not logging the SSE response body, only server-side slot/accounting events. So absence of usage in /tmp/llama.log doesn’t prove it wasn’t sent.

But this log still tells us something important:

new prompt, n_ctx_slot = 120064, task.n_tokens = 17332
...
stop processing: n_tokens = 17553, truncated = 0

So llama.cpp itself had a perfectly valid prompt size: ~17.3k input tokens, ~221 generated tokens. OpenClaw saving usage: 0/0/0 for that same kind of request is the bad handoff.

One thing in your log is suspicious though:

srv log_server_r: done request: POST /v1/chat/completions ... 200
#
reasoning-budget: activated, budget=2147483647 tokens
srv stop: cancel task, id_task = 112

That looks like the client/request may be closing/cancelling right around finalization. If the final usage-only SSE chunk is emitted at the very end, and OpenClaw stops reading once it sees finish_reason: stop / closes early, you’d get exactly this symptom: text saved, stopReason saved, but final usage chunk never makes it into the saved assistant message.

The cleanest repro for an issue would be:

  1. curl shows final usage chunk âś…
  2. llama log shows real prompt tokens âś…
  3. OpenClaw transcript saves:
"usage":{"input":0,"output":0,"cacheRead":0,"cacheWrite":0,"totalTokens":0}
#

That’s enough to file. I’d title it something like:

OpenAI-compatible streaming with llama.cpp saves zero usage, causing /status ?/context and bad compaction behavior

Include the curl final chunk and that n_ctx_slot/task.n_tokens llama log. This smells like OpenClaw should either keep consuming through the terminal usage chunk, or fall back to llama.cpp timings/token fields when usage was missed.

cold aurora
#

That reasonig budget seems high. is it possible that needs to be lowered? Maybe the repsonse is overflowing?

undone spearBOT
#

Yes — that reasoning budget is effectively “infinite”, and I’d absolutely lower it for this repro.

2147483647 is basically int-max. With predict = -1 too, you’ve created an “allowed to think/generate forever until something else stops it” setup. That can absolutely cause response-side context pressure, especially with Gemma/Qwen thinking templates where reasoning tokens may be hidden/stripped but still consume KV/context.

But: I’d separate the two bugs:

  • Reasoning budget too high can cause long hidden thinking, slow runs, empty/weird output, or eventually context overflow.
  • It should not by itself make OpenClaw save usage: 0/0/0 if llama.cpp returned usage correctly.

For debugging, I’d change these first:

reasoning-budget = 1024
predict = 2048

or for non-thinking/debug mode: