#Qwen3.6-35B, how much context should I run on llama-server's end?

1 messages · Page 1 of 1 (latest)

rancid snow
#

OpenClaw gets really slow, I am running 40GB VRAM (5080 & 3090) and Qwen3.6-35B Q6, and the 262k context really slows down the speed.

near glacierBOT
grand sedgeBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

paper havenBOT
#

262k is almost definitely the reason it feels slow.

For a local 35B in OpenClaw, I’d start with 32k on llama-server, then try 64k only if you really need it. I would not run 262k as the default.

Quick rule of thumb:

  • 16k to 32k: fastest
  • 32k to 64k: good compromise
  • 128k+: only for very specific long-context cases
  • 262k: usually too slow for daily agent use

A good setup is:

  1. lower llama-server --ctx-size to 32768 or 65536
  2. make OpenClaw match that limit, or use contextTokens to cap runtime lower

Docs:

rancid snow
#

What happens when I run out of tokens and reach 64k for example? Do I lose the whole chat?

paper havenBOT
#

Nope, you usually do not lose the whole chat.

What normally happens at a 64k limit:

  1. OpenClaw sees the session is getting near the limit
  2. it compacts older parts of the conversation into a summary
  3. recent messages stay intact
  4. the full history still stays on disk, only the model sees the compacted version

So it’s more like older context gets summarized, not deleted.

If the model/server hard-rejects the request first with a context overflow, OpenClaw can usually auto-compact and retry.

Useful commands:

  • /compact to force a summary now
  • /new if you want a completely fresh session
rancid snow
#

How should I optimize my Qwen3.6-35B to work with OpenClaw and my 40GB VRAM system specifically?

paper havenBOT
#

For OpenClaw on 40 GB VRAM, I’d optimize for agent throughput, not max benchmark context.

My honest recommendation on your box:

  • Do not run 262k for daily use
  • start at 32k ctx
  • try 64k only if you really need it
  • if Q6 is tight, drop to Q5 or Q4 and keep more KV-cache headroom, that often feels faster overall in real OpenClaw use

What I’d tune, in order:

  1. Lower llama-server ctx

    • best default: 32k
    • upper “practical” target: 64k
  2. Match OpenClaw to that

    • set your model entry’s effective cap with contextTokens
#
  • if the server itself is capped lower, keep OpenClaw aligned so it doesn’t try to push beyond that
  1. Keep compaction + pruning on

  2. Favor “fits comfortably” over Q6 pride

    • for agent work, a slightly smaller quant + sane context usually beats Q6 + giant ctx
    • if Q6 only feels okay at 16k to 32k, that’s still a valid choice
  3. Keep output lengths modest

    • don’t let replies sprawl if you care about speed
    • long outputs + big context = double pain
  4. Use local for normal work, keep a hosted fallback if possible

#
  1. At the llama.cpp layer
    • full GPU offload if possible
    • Flash Attention on
    • benchmark 32k Q6 vs 32k or 48k Q5
    • pick the one with better end-to-end tokens/sec, not the prettier quant label

If it were my setup, I’d try these two profiles first:

  • Fast/default: Qwen 35B Q5, 32k ctx
  • Quality mode: Qwen 35B Q6, 16k to 32k ctx

Also, what does openclaw --version say, and what llama-server flags are you using now? If you paste those, I can suggest a more exact setup.

rancid snow
#

--host 127.0.0.1
--port 8080
--ctx-size 65536
--fit on
-np 1
#-ngl 99
-fa on
-ctk q4_0
-ctv q4_0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--presence_penalty 1.5
--repeat_penalty 1.0
--chat-template-kwargs '{"preserve_thinking": true}'
--jinja

paper havenBOT
#

Looking at those flags, a few things jump out.

Biggest wins first:

  • --ctx-size 65536 is still pretty high for a 35B Q6 on 40 GB if you want OpenClaw to feel snappy.
    I’d try 32768 first.

  • If -ngl is actually commented out, fix that first.
    If layers are not fully offloaded, that alone will tank performance.

  • -fa on is good, keep that.

  • -ctk q4_0 + -ctv q4_0 is also good, keep that.
    KV cache compression helps a lot at these context sizes.

  • preserve_thinking: true can bloat the transcript a lot.
    For OpenClaw, I would probably turn that off unless you really want visible reasoning traces. Long thinking output gets dragged back into context and slows everything down.

#

For agent reliability, not speed:

  • --temp 0.6 is a bit lively for tool use
  • --presence_penalty 1.5 is pretty aggressive

For OpenClaw I’d lean more boring/stable:

  • temp around 0.2 to 0.4
  • presence penalty 0
  • repeat penalty around 1.0 to 1.05

What I’d do on your machine:

Fast/default profile

  • Qwen 35B
  • 32k server ctx
  • full GPU offload
  • Flash Attention on
#
  • KV cache quantized
  • thinking preservation off

Quality profile

  • same model
  • 48k to 64k max, only when you really need it
  • expect noticeable slowdown

Important OpenClaw-side tip:
If your server is 32768, don’t let OpenClaw try to use the full 32k. Give it headroom.

So for OpenClaw, I’d set the model entry to something like:

  • contextWindow: 32768
  • contextTokens: 24000 to 28000

That way OpenClaw leaves room for system prompt, tools, and output instead of slamming into the wall.

rancid snow
#

Should I disable "-np 1" since we are doing agentic tasks?

paper havenBOT
#

No, I’d keep -np 1 for now.

For agentic OpenClaw work, more parallel slots usually do not make a single task faster. They mainly help if you have multiple requests hitting llama-server at the same time.

On your setup, increasing slots would usually mean:

  • more KV cache pressure
  • more VRAM/RAM use
  • more contention
  • often slower per-request latency

So the rule of thumb is:

  • single user / single active agent → keep -np 1
  • multiple concurrent chats / subagents / background jobs → maybe test -np 2
  • with a 35B + long context, I would be conservative
#

Especially with --ctx-size 65536, extra slots get expensive fast.

So yeah, for your current goal, -np 1 is the right choice. The bigger wins are still:

  • lower ctx
  • real GPU offload
  • disable preserve_thinking

If this fixed it, please right-click the solution message → Apps → ✅ Mark Solution. This helps Answer Overflow index the solution for future users.

rancid snow
#

okay I ran out of tokens very quickly

paper havenBOT
#

Yep, that tracks. OpenClaw burns context much faster than a normal chat UI.

A lot is competing for tokens:

  • system prompt
  • tool schemas
  • chat history
  • tool results
  • any visible reasoning / preserve_thinking
  • workspace/bootstrap context

So if you dropped to 32k, the usable space is a lot less than 32k.

What I’d do next:

  1. turn off preserve_thinking
    • this is a big one
#
  1. bump llama-server back to 64k

    • but do not let OpenClaw use all 64k
  2. set OpenClaw contextTokens lower, around 40k to 48k

    • that leaves headroom for output + tool overhead
  3. use /compact earlier

    • especially before long agentic runs
  4. enable pruning if you haven’t

So the sweet spot is often:

  • server ctx: 65536
  • OpenClaw effective cap: ~40960 or 49152
rancid snow
#

"contextWindow": 32768,
"contextTokens": 24000,
"maxTokens": 8192,

paper havenBOT
#

That shape is basically right, but I’d change maxTokens.

Main issue:
"maxTokens": 8192 is pretty high for a local 35B setup with a small effective context budget. It leaves less headroom and can make overflows happen sooner.

I’d use:

If your server is 32k

"contextWindow": 32768,
"contextTokens": 24000,
"maxTokens": 2048

or at most:

#
"contextWindow": 32768,
"contextTokens": 24000,
"maxTokens": 4096

If your server is still 64k

then I’d rather do:

"contextWindow": 65536,
"contextTokens": 40960,
"maxTokens": 2048

or maybe 49152 for contextTokens if it still feels okay.

#

So, short version:

  • contextWindow = actual server/model limit
  • contextTokens = the cap OpenClaw should aim for
  • maxTokens = keep this modest, usually 2k to 4k

For your setup, 24k / 8k is a bit unbalanced.
24k / 2k or 40k / 2k is much healthier.

rancid snow
#

how do I restart openclaw so the change takes place?

paper havenBOT
#

If you changed OpenClaw config, the usual command is:

openclaw gateway restart

Then verify with:

openclaw gateway status

One important gotcha:

  • if you changed openclaw.json → restart OpenClaw
  • if you changed llama-server flags like --ctx-size → restart llama-server too