I've started using Qwen 3.5 using llama.cpp. I turned off the thinking capabilities. I'm using:
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_XL --host 0.0.0.0 --port 30001 --reasoning off --jinja
To connect HA to it, I'm using the Local OpenAI LLM custom integration.
I must say I'm quite happy with the result. I was wondering however if there is a way to run llama-server without the --reasoning off and have the thinking turned off from the what is sent to the server. I've read somewhere that one could send "thinking" : off. I do not know where to do that. Has anyone done that with this or another model?
#Qwen 3.5 and thinking mode
1 messages · Page 1 of 1 (latest)
I think you can just add /no_think or /nothink to your prompt. But I am not sure on the specifics for your exact setup.
At the very end of the prompt?
I don't think it matters where it is, but you can experiment a bit.
That does not seem to work for Qwen 3.5. What I'm finding online is that I should add some parameters:
"chat_template_kwargs": {
"enable_thinking": false
}
But I don't know if/how Local OpenAI LLM supports it.
Funny you mention this, I put together some benchmarking against llama.cpp (my old GTX 1080 8GB), https://github.com/Drizzt321/ha-voiceagent-llm-benchmark, and with /no_think it significantly improved performance https://github.com/Drizzt321/ha-voiceagent-llm-benchmark/blob/main/reports/benchmark-run-analysis-2026-03.md
@knotty solstice You have not tried Qwen 3.5 yet, right?
Ah, I haven't, just 3, there's a 3.5 now? Might need to redo the tests. I also noticed a new Gemma 4 with some smaller models, might need to try those as well
There is a 3.5 But it looks like it doesn't support /no_think.
I disabled thinking from the llama.cpp command line but was wondering if I could do it from the data sent to the server on a request per request basis.
Look at my prompts in the report, you can send /no_think as part of the HA prompt
There were a few other minor prompt changes which did help a good bid overall
Have a look at https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1559 It states that /think /no_think is not supported by Qwen3.5. I did try it and indeed it does nothing.
At least for qwen 3
Oh, sorry, I kept thinking in the v3 that I tried
Which specific model/hugging-face quant version are you using?
Hm, looks like this is a way to disable thinking https://huggingface.co/Qwen/Qwen3.5-9B#instruct-or-non-thinking-mode
Hm. Looks like the HACs add-on I'm using doesn't allow for specifying a chat_template_kwargs param
unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_XL
35B, oh wow
Which add-on are you using?
you must have some good sized GPU hardware
I have a Nvidia DGX Spark.
They have a nvfp4 version that should use much less memory but I haven't tried it yet.
I'm using the same add-on.
The results are quite good (and fun).
Oh wait!
Chat Template Arguments allow you to provide custom arguments to your model
Arguments are supplied as key/value pairs and provided to the chat_template_kwargs request parameter
Values support Jinja2 templates, in order to provide non-string and more complex data structures
Arguments differ per model, and not all models make use of user-provided arguments
See your models documentation for what arguments are available to be used
Looks like they key/value pairs within chat_template_kwargs!!
So you should be able to add "enable_thinking": False as part of the add-on
It'd be amazing to have something half as capable myself, but in this case, it's intended to solely be a voice control, with minimal real thinking needed. I'm going to end up (somehow) hooking up a Claude Code background agent (use that cheaper subscription!) as a more fully fledged assistant type, just need to figure out how. Probably using some kind of MQTT message consumer setup.
I'm going to try this. I've set up a new Llama.cpp without the command line argument turning off thinking.
What do you think I should put in the template?
Just "enable_thinking": false ?
It gives me "Error talking to API"
On the server side I have:
{"error":{"code":400,"message":"invalid type for "enable_thinking" (expected boolean, got string)","type":"invalid_request_error"}}
Ah. Now I got it working.
Had to put the double quotes around the enable_thinking
prompt eval time = 125.32 ms / 71 tokens ( 1.77 ms per token, 566.56 tokens per second)
eval time = 9482.62 ms / 540 tokens ( 17.56 ms per token, 56.95 tokens per second)
total time = 9607.94 ms / 611 tokens
This is weird: Even in the agent where I do not turn off the thinking, it does not do any reasoning....
That seems to come from Local OpenAI LLM. When I put the same prompt directly to llama.cpp, it does think for a very long time.
Oh interesting, this might be relevant, a release on the Add-on https://github.com/skye-harris/hass_local_openai_llm/pull/51
This is what is working for me: Qwen3.5-35B-A3B-Q4_K_M in my compose.yml:
- —jinja
- —reasoning-format
- deepseek