#Qwen3.6-35B, how much context should I run on llama-server's end?
1 messages · Page 1 of 1 (latest)
262k is almost definitely the reason it feels slow.
For a local 35B in OpenClaw, I’d start with 32k on llama-server, then try 64k only if you really need it. I would not run 262k as the default.
Quick rule of thumb:
- 16k to 32k: fastest
- 32k to 64k: good compromise
- 128k+: only for very specific long-context cases
- 262k: usually too slow for daily agent use
A good setup is:
- lower
llama-server --ctx-sizeto 32768 or 65536 - make OpenClaw match that limit, or use
contextTokensto cap runtime lower
Docs:
Also, what does openclaw --version say?
If you paste your provider config block, I can suggest the exact contextWindow / contextTokens values.
What happens when I run out of tokens and reach 64k for example? Do I lose the whole chat?
Nope, you usually do not lose the whole chat.
What normally happens at a 64k limit:
- OpenClaw sees the session is getting near the limit
- it compacts older parts of the conversation into a summary
- recent messages stay intact
- the full history still stays on disk, only the model sees the compacted version
So it’s more like older context gets summarized, not deleted.
If the model/server hard-rejects the request first with a context overflow, OpenClaw can usually auto-compact and retry.
Useful commands:
/compactto force a summary now/newif you want a completely fresh session
Docs:
- https://docs.openclaw.ai/concepts/compaction
- https://docs.openclaw.ai/concepts/session-pruning
- https://docs.openclaw.ai/help/faq
So yeah, 64k is a sane cap, and hitting it should mean “summarize older stuff”, not “lose the chat.”
How should I optimize my Qwen3.6-35B to work with OpenClaw and my 40GB VRAM system specifically?
For OpenClaw on 40 GB VRAM, I’d optimize for agent throughput, not max benchmark context.
My honest recommendation on your box:
- Do not run 262k for daily use
- start at 32k ctx
- try 64k only if you really need it
- if Q6 is tight, drop to Q5 or Q4 and keep more KV-cache headroom, that often feels faster overall in real OpenClaw use
What I’d tune, in order:
-
Lower llama-server ctx
- best default: 32k
- upper “practical” target: 64k
-
Match OpenClaw to that
- set your model entry’s effective cap with
contextTokens
- set your model entry’s effective cap with
- if the server itself is capped lower, keep OpenClaw aligned so it doesn’t try to push beyond that
-
Keep compaction + pruning on
- OpenClaw will summarize old context instead of dragging the full thread forever
- that matters a lot with local models
Docs: https://docs.openclaw.ai/concepts/compaction
https://docs.openclaw.ai/concepts/session-pruning
-
Favor “fits comfortably” over Q6 pride
- for agent work, a slightly smaller quant + sane context usually beats Q6 + giant ctx
- if Q6 only feels okay at 16k to 32k, that’s still a valid choice
-
Keep output lengths modest
- don’t let replies sprawl if you care about speed
- long outputs + big context = double pain
-
Use local for normal work, keep a hosted fallback if possible
- local is great for cheap/private runs
- hosted fallback helps when a task needs stronger tool reliability
Docs: https://docs.openclaw.ai/gateway/local-models
- hosted fallback helps when a task needs stronger tool reliability
- At the llama.cpp layer
- full GPU offload if possible
- Flash Attention on
- benchmark 32k Q6 vs 32k or 48k Q5
- pick the one with better end-to-end tokens/sec, not the prettier quant label
If it were my setup, I’d try these two profiles first:
- Fast/default: Qwen 35B Q5, 32k ctx
- Quality mode: Qwen 35B Q6, 16k to 32k ctx
Also, what does openclaw --version say, and what llama-server flags are you using now? If you paste those, I can suggest a more exact setup.
--host 127.0.0.1
--port 8080
--ctx-size 65536
--fit on
-np 1
#-ngl 99
-fa on
-ctk q4_0
-ctv q4_0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--presence_penalty 1.5
--repeat_penalty 1.0
--chat-template-kwargs '{"preserve_thinking": true}'
--jinja
Looking at those flags, a few things jump out.
Biggest wins first:
-
--ctx-size 65536is still pretty high for a 35B Q6 on 40 GB if you want OpenClaw to feel snappy.
I’d try32768first. -
If
-nglis actually commented out, fix that first.
If layers are not fully offloaded, that alone will tank performance. -
-fa onis good, keep that. -
-ctk q4_0+-ctv q4_0is also good, keep that.
KV cache compression helps a lot at these context sizes. -
preserve_thinking: truecan bloat the transcript a lot.
For OpenClaw, I would probably turn that off unless you really want visible reasoning traces. Long thinking output gets dragged back into context and slows everything down.
For agent reliability, not speed:
--temp 0.6is a bit lively for tool use--presence_penalty 1.5is pretty aggressive
For OpenClaw I’d lean more boring/stable:
- temp around 0.2 to 0.4
- presence penalty 0
- repeat penalty around 1.0 to 1.05
What I’d do on your machine:
Fast/default profile
- Qwen 35B
- 32k server ctx
- full GPU offload
- Flash Attention on
- KV cache quantized
- thinking preservation off
Quality profile
- same model
- 48k to 64k max, only when you really need it
- expect noticeable slowdown
Important OpenClaw-side tip:
If your server is 32768, don’t let OpenClaw try to use the full 32k. Give it headroom.
So for OpenClaw, I’d set the model entry to something like:
contextWindow: 32768contextTokens: 24000to28000
That way OpenClaw leaves room for system prompt, tools, and output instead of slamming into the wall.
Docs:
- https://docs.openclaw.ai/gateway/local-models
- https://docs.openclaw.ai/gateway/configuration-reference
- https://docs.openclaw.ai/concepts/compaction
- https://docs.openclaw.ai/concepts/session-pruning
So, short version: drop to 32k, make sure GPU offload is real, and disable preserve_thinking first. That’s where I’d start on your exact setup.
If you want, I can turn this into an exact OpenClaw provider config snippet next.
Should I disable "-np 1" since we are doing agentic tasks?
No, I’d keep -np 1 for now.
For agentic OpenClaw work, more parallel slots usually do not make a single task faster. They mainly help if you have multiple requests hitting llama-server at the same time.
On your setup, increasing slots would usually mean:
- more KV cache pressure
- more VRAM/RAM use
- more contention
- often slower per-request latency
So the rule of thumb is:
- single user / single active agent → keep
-np 1 - multiple concurrent chats / subagents / background jobs → maybe test
-np 2 - with a 35B + long context, I would be conservative
Especially with --ctx-size 65536, extra slots get expensive fast.
So yeah, for your current goal, -np 1 is the right choice. The bigger wins are still:
- lower ctx
- real GPU offload
- disable
preserve_thinking
If this fixed it, please right-click the solution message → Apps → ✅ Mark Solution. This helps Answer Overflow index the solution for future users.
okay I ran out of tokens very quickly
Yep, that tracks. OpenClaw burns context much faster than a normal chat UI.
A lot is competing for tokens:
- system prompt
- tool schemas
- chat history
- tool results
- any visible reasoning /
preserve_thinking - workspace/bootstrap context
So if you dropped to 32k, the usable space is a lot less than 32k.
What I’d do next:
- turn off
preserve_thinking- this is a big one
-
bump llama-server back to 64k
- but do not let OpenClaw use all 64k
-
set OpenClaw
contextTokenslower, around 40k to 48k- that leaves headroom for output + tool overhead
-
use
/compactearlier- especially before long agentic runs
-
enable pruning if you haven’t
- local/non-Anthropic setups usually need this more
Docs: https://docs.openclaw.ai/concepts/session-pruning
- local/non-Anthropic setups usually need this more
So the sweet spot is often:
- server ctx: 65536
- OpenClaw effective cap: ~40960 or 49152
That usually feels better than:
- 32k and constantly compacting, or
- 262k and everything crawling 😅
If you want to inspect it live, check:
/status/context list
Docs:
- https://docs.openclaw.ai/reference/token-use
- https://docs.openclaw.ai/concepts/context
- https://docs.openclaw.ai/concepts/compaction
If you want, I can suggest the exact contextTokens + pruning config next.
"contextWindow": 32768,
"contextTokens": 24000,
"maxTokens": 8192,
That shape is basically right, but I’d change maxTokens.
Main issue:
"maxTokens": 8192 is pretty high for a local 35B setup with a small effective context budget. It leaves less headroom and can make overflows happen sooner.
I’d use:
If your server is 32k
"contextWindow": 32768,
"contextTokens": 24000,
"maxTokens": 2048
or at most:
"contextWindow": 32768,
"contextTokens": 24000,
"maxTokens": 4096
If your server is still 64k
then I’d rather do:
"contextWindow": 65536,
"contextTokens": 40960,
"maxTokens": 2048
or maybe 49152 for contextTokens if it still feels okay.
So, short version:
contextWindow= actual server/model limitcontextTokens= the cap OpenClaw should aim formaxTokens= keep this modest, usually 2k to 4k
For your setup, 24k / 8k is a bit unbalanced.
24k / 2k or 40k / 2k is much healthier.
how do I restart openclaw so the change takes place?
If you changed OpenClaw config, the usual command is:
openclaw gateway restart
Then verify with:
openclaw gateway status
One important gotcha:
- if you changed
openclaw.json→ restart OpenClaw - if you changed
llama-serverflags like--ctx-size→ restart llama-server too