#Qwen 3.5 and thinking mode

1 messages · Page 1 of 1 (latest)

thick hazel
#

I've started using Qwen 3.5 using llama.cpp. I turned off the thinking capabilities. I'm using:
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_XL --host 0.0.0.0 --port 30001 --reasoning off --jinja
To connect HA to it, I'm using the Local OpenAI LLM custom integration.
I must say I'm quite happy with the result. I was wondering however if there is a way to run llama-server without the --reasoning off and have the thinking turned off from the what is sent to the server. I've read somewhere that one could send "thinking" : off. I do not know where to do that. Has anyone done that with this or another model?

trail cave
thick hazel
#

At the very end of the prompt?

trail cave
thick hazel
#

That does not seem to work for Qwen 3.5. What I'm finding online is that I should add some parameters:

"chat_template_kwargs": {
        "enable_thinking": false
      }

But I don't know if/how Local OpenAI LLM supports it.

knotty solstice
#

Funny you mention this, I put together some benchmarking against llama.cpp (my old GTX 1080 8GB), https://github.com/Drizzt321/ha-voiceagent-llm-benchmark, and with /no_think it significantly improved performance https://github.com/Drizzt321/ha-voiceagent-llm-benchmark/blob/main/reports/benchmark-run-analysis-2026-03.md

GitHub

A Home-Assistant Voice Agent LLM benchmark setup, designed to performance and quality test LLMs running on a remote llama.cpp server to score and determine quality of results. - Drizzt321/ha-voice...

thick hazel
#

@knotty solstice You have not tried Qwen 3.5 yet, right?

knotty solstice
#

Ah, I haven't, just 3, there's a 3.5 now? Might need to redo the tests. I also noticed a new Gemma 4 with some smaller models, might need to try those as well

thick hazel
#

There is a 3.5 But it looks like it doesn't support /no_think.

#

I disabled thinking from the llama.cpp command line but was wondering if I could do it from the data sent to the server on a request per request basis.

knotty solstice
#

Look at my prompts in the report, you can send /no_think as part of the HA prompt

#

There were a few other minor prompt changes which did help a good bid overall

thick hazel
knotty solstice
#

At least for qwen 3

#

Oh, sorry, I kept thinking in the v3 that I tried

#

Which specific model/hugging-face quant version are you using?

#

Hm. Looks like the HACs add-on I'm using doesn't allow for specifying a chat_template_kwargs param

thick hazel
#

unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_XL

knotty solstice
#

35B, oh wow

thick hazel
#

Which add-on are you using?

knotty solstice
#

you must have some good sized GPU hardware

thick hazel
#

I have a Nvidia DGX Spark.

knotty solstice
#

dude, nice

#

all I've got is my old GTX 1080 8GB, lol

#

Local OpenAI LLM

thick hazel
#

They have a nvfp4 version that should use much less memory but I haven't tried it yet.

#

I'm using the same add-on.

knotty solstice
thick hazel
#

The results are quite good (and fun).

knotty solstice
#

Oh wait!

Chat Template Arguments allow you to provide custom arguments to your model

    Arguments are supplied as key/value pairs and provided to the chat_template_kwargs request parameter
    Values support Jinja2 templates, in order to provide non-string and more complex data structures
    Arguments differ per model, and not all models make use of user-provided arguments
    See your models documentation for what arguments are available to be used
#

Looks like they key/value pairs within chat_template_kwargs!!

#

So you should be able to add "enable_thinking": False as part of the add-on

#

It'd be amazing to have something half as capable myself, but in this case, it's intended to solely be a voice control, with minimal real thinking needed. I'm going to end up (somehow) hooking up a Claude Code background agent (use that cheaper subscription!) as a more fully fledged assistant type, just need to figure out how. Probably using some kind of MQTT message consumer setup.

thick hazel
#

I'm going to try this. I've set up a new Llama.cpp without the command line argument turning off thinking.
What do you think I should put in the template?

#

Just "enable_thinking": false ?

#

It gives me "Error talking to API"

#

On the server side I have:
{"error":{"code":400,"message":"invalid type for "enable_thinking" (expected boolean, got string)","type":"invalid_request_error"}}

#

Ah. Now I got it working.

#

Had to put the double quotes around the enable_thinking

#
prompt eval time =     125.32 ms /    71 tokens (    1.77 ms per token,   566.56 tokens per second)
       eval time =    9482.62 ms /   540 tokens (   17.56 ms per token,    56.95 tokens per second)
      total time =    9607.94 ms /   611 tokens
#

This is weird: Even in the agent where I do not turn off the thinking, it does not do any reasoning....

#

That seems to come from Local OpenAI LLM. When I put the same prompt directly to llama.cpp, it does think for a very long time.

knotty solstice
lusty tide