#Qwen
1 messages · Page 1 of 1 (latest)
Commenting here as I am interested in the best providers for this model and self hosting tuning or local benchmarks for different hardware
Qwen 3.5 plus is multi modal
Anyone tried the small Qwen3.5 models yet. It came out yesterday for us card poor people
Same question i can't test this week the new local modèle 27b et a35 a3b q4km 🥺
I want the best ai with 5070ti for light coding, chat, translate, heartbeat.. i can use ollama cloud free and i have gemini pro paid already, i have take thé qwen coding plan too this month
Oh I have a 5070 TI 😛
Last model I used was the Qwen3 14B Q6
I am hoping that 3.5 9B Q8 would beat it.
My focus is for HEARTBEAT exclusively, I can main claude sonnet and and cheat on the heartbeat with Qwen on the down low. At least that is the plan
Just gave 122b a whirl, q4 on a 6000ada with main memory spillover. Much slower but maybe also much smarter
Have been trying out 122b and 35b on my M4 Max 128GB, but so far its been pretty terrible at tool calling and keeps getting stuck in loops
did you apply the tool calling fix template?
Yeah I don’t know what you mean, my 122b is super smart and solving tons of stuff that 35b was just getting stuck on
4070 (12 gb) here, qwen3:8b and 14b work for me for my tasks.. 14b is a little tight but seems to work (but since 8b seems to work not a reason to use 14b). I started playing with qwen3.5:9b the other day, and that worked well too, so I've been using it in place of qwen3:8b. I'm going to be replacing the 4070 with a 3090 (24 gb), hoping 24 gb will be the sweet spot. I could then try qwen3.5:27b and some other larger models. I'm using both with ollama. I should add that I'm currently using the minimax starter plan (about $9 a month) as my main driver, but for scheduled tasks, qwen3.5 is doing the work. it'll take a little longer than minimax but I'm not waiting on it, so doesn't matter. I did find when I went from minimax to qwen variants, I had to tighten up my instructions to be a little clearer. where minimax would just figure out what I meant when not clear, qwen would get confused or give up or misinterpret. being more clear got things working.
ClawEval just released tests for all those small Qwen 3.5 modes for 59 OpenClaw Agent roles. They added also 8GB, 12GB 16GB VRAM models on top of those 24GB and bigger https://github.com/explaindio/ClawEval
are you on the latest version? llamacpp is broken with tool calling, https://github.com/openclaw/openclaw/issues/32916
try 3.1 for now
I'm serving the models with lmstudio, the tool calls themselves seem to work, just hasn't been good at recovering from errors and figuring out how to use them correctly
using qwen3.5:27b cant seem to remember its own soul
What Quantization?
I'm also having pretty meh results with qwen 3.5 9b. Often doesn't reply unless I ask if successful. I did have to apply the template fix for the multi step tool bug so at least it responds sometimes
And at least it is mostly successful for simple queries, just don't know why it chooses to be silent about results
what do you mean? 32k ctxt
Same problème with 3.5 plus coding plan, hé didn't respond most time
But work is done
I just tried switching to openai-responses to see if that does better tool calling. I think it was getting stuck in loops.
for olllama provider (on another host) I'm following the openclaw.ai docs https://docs.openclaw.ai/providers/ollama#streaming-configuration, specifically not using the v1 endpoint, and using the ollama api (not openai-completions), and large contexts. my openclaw config info:
"models": {
"mode": "merge",
"providers": {
"ollama": {
"baseUrl": "http://192.168.86.60:11434",
"apiKey": {
"source": "env",
"provider": "default",
"id": "OLLAMA_API_KEY"
},
"api": "ollama",
"models": [
{
"id": "qwen3.5:9b",
"name": "Qwen3.5 9B (Ollama, local GPU)",
"reasoning": true,
"input": [
"text"
],
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
},
"contextWindow": 262144,
"maxTokens": 2621440
},
context length is determined from the context length returned from the api/show endpoint. e.g.
curl http://localhost:11434/api/show -d '{"name":"qwen3.5:9b"}' | python -m json.tool
. . .
"model_info": {
"general.architecture": "qwen35",
"general.file_type": 15,
"general.parameter_count": 9653104368,
"general.quantization_version": 2,
"qwen35.attention.head_count": 16,
"qwen35.attention.head_count_kv": null,
"qwen35.attention.key_length": 256,
"qwen35.attention.layer_norm_rms_epsilon": 1e-06,
"qwen35.attention.value_length": 256,
"qwen35.block_count": 32,
"qwen35.context_length": 262144,
. . .
Was that fixed by Unsloth or did I miss a fix that dropped?
No I found the fix in a posted GitHub issue. I'm not home right now but if you can't find it I'll try to find for you. I'm still having trouble where the prompts complete in lm studio with no errors but I get no response in telegram
It's intermittent, not sure how to debug or if it's just a problem for the poors
Or you might be able to find posts made by me here, I posted the fix in a krill support thread
Will take a look! I noticed similar issues with the 35b variant and the 122b one. Funnily enough 9b does pretty well
I'm using 9b with 48k context and it ranges from impressive to absolute ass. I'm prob shoving too much context into it
The unresponsiveness is driving me nuts tho. I ask if it was successful and sometimes it responds with all the stuff it did (accurately mostly)
Did you manage to get 122b or 397b working with reliable tool calls and responses? I'm currently using minimax m2.5 and looking for an upgrade.
qwenny is not all that bright 🙂 two things that help. in heart beat I put in "After tools run check and process output". that was something I found in a reference how they made qwen perform (quite a lot better) in a benchmark by just telling it to check its own output.
you need to check the output of the browser ui with thinking enabled. Sometimes you have to be very precise what it needs to do, e.g. I change my one command to Execute EXACTLY: bash /home/me/scripts/run_pipeline.sh. (CRITICAL: DO NOT use python3 to run a .sh file!) because it got confused.
if you say "trigger cron job XXX" it also takes 5 steps running a "trigger" command then openclaw help. instead say run cron XXX. then it works straight up.
so debug what is going on or be very explicit with tools.
ClawEval just released a guide — 14 AI agents running 100% locally on a single RTX 3090 using 9B and 2B with agent KV cache offloading . all of agains have own cache for speed https://github.com/explaindio/ClawEval/blob/master/docs/OpenClaw_Backend_Local_on_3090.pdf
you did that in heartbeat or somewhere else? Like I can see it being useful in heartbeat, "did a tool run and then you didn't follow up? pick up that thread"
but that'd have to hit a lot of sessions to be useful
I had it working for 397b on llamacpp, i'm doing some bringup on vllm-mlx rn for the text only side
initial results v promising
Yes in heartbeat.md. seems to work at least for "what Cron jobs do I have" test
after the update. seems like things are not as stable as before. have not seen output like this before
"The script is still running. Let me wait a bit and poll it again.
<tool_call>
poll
mild-river
30000
</tool_call>"
switched to llama.cpp with qwen and things are working better
better perf and more reliable tools. although it did overwrite my entire skills folder once.... luckily it was a git repo
it did restore it after I pointed it out
"All done! Your skills directory is fully restored. That was my fault - when I did cp -r for the new skill, it overwrote the entire directory instead of just adding the new folder. "
did qwen just remove their promo on their coding plans? crap i was just going to go for it to try it out.
yeah right now its a really good summary model, with some basic tooling if you are specific about what it should do
Switched to llamacpp from ollama, what a difference, now getting 126 tps with qwen3.5-35b-a3b (3000 tps prompt load), vs 33 tps (1000 prompt) with qwen 3.5 27b. Ollama speeds were half those numbers. 35b is usable as primary at that speed 65K context). This is on 3090 (24 gb). With 27b I was able to get up to around 150k context.
so still struggling to get qwen to be reproducible. eg. "apped your conclusion to conclusions.json" is a interpreted very differently every time. sometimes it does append, sometimes to conclusions_new.json, and other files. so that sort of thing is is not great
which qwen
my qwen 3.5 9b on my mac mini m4 pro 24gb of ram keeps stalling. it just stops mid way through a task and I have to keep reminding it to complete the task. is there any fix for this? also I get 25tok/s
I would love someones help with this. I am new to this and love learning and passing along info to others to help as well
With the 24gb you might be able to run the 35b-a3b, or the 27b, though it depends on how much your OS is leaving so maybe not. Is it actually stalling, or gettnig into a thinking loop? What are you running the model on, ollama, lm studio, llamacpp, etc.? I've found a few things to be helpful. I have openclaw running on an ubuntu machine, but by GPU is in a windows machine, with 24gb gpu. I was using ollama originally but have switched to llamacpp. I had claude code write me a proxy that I can run in front of either ollama or llamacpp so I can watch the traffic. so if openclaw is pointing to 11434 for ollama, I run ollama on port 11435 and have the proxy listen on 11434 and it logs and forwards to 11435. I've put a copy of it here : https://github.com/khaney64/llm-stuff. So while debugging a skill or script or task in openclaw, I'll watch the proxy output. I run it with something like this:
**node .\proxy.js --filter-thinking --dump-messages --message-size 500 --default-ctx 40960 --thinking --log-file --backend llamacpp
**
I'll watch that, and I'll watch the GPU and CPU percentage in windows task monitor performance tab. when it's thinking, I'll see GPU ram go up for ollama (or it's already loaded for llamacpp), and I'll see GPU percentage go up. If it seems stuck, if GPU is still going, either it's still thinking, and I need to wait, or it's in an internal loop. if GPU goes t 0%, it's done. in that case I'll try to change the prompt to be more descriptive, e.g. create a list of tasks, don't stop until they are done, update the status. Or I'll paste the prompt and output and what I'm seeing and ask claude code to improve the prompt.
a few notes on the proxy - the original purpose was to add the think:false or think:true flag to the ollama prompt, because openclaw isn't (or at least hasn't) passing on any thinking settings to ollama. I wanted to see if it made any difference. I don't think it matters much honestly, think: false turns off the think dialog, so a little faster, but it also turns off think, which might impact the task. my worry was that the think response was going back into the context (it doesnt), so I'm leaving think on.
The other thing the proxy helped me with was seing if my context was too small, or my prompt too big, or I was hitting context size and it had to be compacted - if that happens, things are being removed and your model may lose context and get confused/lost. it also helped me realize how absolutely huge the prompt size is, with just the stuff ollama adds. make sure your AGENT.md, TOOLs.md, etc. are compact. go into dashboard and disable ANY skills you aren't using, otherwise instructions are sent for each one, using up context size. I also find it useful to watch the converation/back and forth with the model and openclaw during the processing.. I've found cases where it was not finding my script, or my tool, and it would spend several iterations trying t find it. I had to update my instructions to give it specific locations, and that reduced the processing more.
The think response does go into the context if it’s returned to openclaw - it just disappears on the next user turn (unless this has changed in latest OC)
It’s kind of an annoying behavior because of caching
My understanding was that it is returned to openclaw (in a <think> tag(?)), but ignored and not added to the context, so the only hit if it is on is the cost of transmitting the think data. I'll have to take a closer look. I was looking at this in the context of ollama though and have since switched to llamacpp, so not sure if that changes things.
Actually, rethinking that, I'm confusing context and the prompt data. I guess technically it's using up space temporarily in the context. It would be nice if you could ask it to think quietly to itself!
Well it can’t really do that, since it needs the thinking tokens in context in order to make decisions based on them.
Like I said I don’t know if it’s changed since v3.2 but I have a local change that tries to work around this issue. (I haven’t managed to actually solve it; the best I’ve done is clearing the contents of the think tags in later turns. But the think tags stay, so, it doesn’t really solve the issue.) The problem is that you’ll build up a bunch of agent turns “on top of” that thinking block in the prompt and then your next user turn removes that block, busting the cache and needing to prefill all of the agent turns again.
Remember that there is no difference between the “prompt” and “context”; they are one and the same.
How good is it?
How good is ... ? Ollama? The qwen35 model? As mentioned in other post I've switched to llamacpp from ollama for better performance. The qwen35-35b-a3b is the one I've been using, it's fast and succeeds with the tasks I give it. I've iteratively tweaked the prompts as needed, but generally things just work for what I'm doing (gathering some data via APIs, checking emails, summarizing, emailing or discording status). I should probably retry 9b with these prompts and see how it does.
Sometimes i use ollama to force certain models into cpu/ram and keep llama.cpp and lmstudio for the ones i want in gpu or split a certain way
yeah, that's the bit I'm trying to figure out now.. with llamacpp and qwen3.5:35b-a3b on 3090 (24gb) GPU is pretty much maxed.. so I can't just swap models on the fly from openclaw depending on task, they all use 35b. but I do also have ollama running on this host, I use grepai mcp server from claude code and VS code, and it uses ollama and nomic-embed-text model to index files, and that'll end up going to CPU. that's actually an idea, maybe I just launch ollams with the env. file that says "no gpu" and just let it rely on CPU... I've got 64 gb system ram in this machine - let ollama do smaller models for simpler tasks via CPU... it'll take longer, but I'm not waiting for them, they're cron jobs. Ollama can also use the "free" tieir cloud models as well from openclaw. I could configure an agent to use a cloud model as primary, fall back to local model when I run out of free.
Qwen3.6-Plus is free on OpenRouter right now, good rating on ClawEval https://github.com/explaindio/ClawEval
Enjoyed qwen3.6 in Hermes agent. It helped me sort out my issues with openclaw upgrade from 2.26 to 3.31 breaking my clawsuite gateway connection. But now it works
It turns out that the behavior I was talking about is actually fixed seemingly in 3.28
Are you referring to this, or something else?
"Ollama/thinking off: route thinkingLevel=off through the live Ollama extension request path so thinking-capable Ollama models now receive top-level think: false instead of silently generating hidden reasoning tokens. (#53200) Thanks @BruceMacD"
OPEN SOURCE COMMUNITY ON FIRE 🤯
︀︀
︀︀This is the third iteration in the viral Qwen Distilled series (previously known as the massive “Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled” models). The name has officially changed to simply Qwopus 27B
︀︀
︀︀On paper this one is another big leap over v2. Most impressive: it’s the first in the series to outperform the base Qwen 27B on HumanEval while still delivering significant efficiency gains in thinking speed.
︀︀
︀︀Overall benefits include stronger reasoning stability, better correctness, improved cross-task generalization (especially programming), more sophisticated chain-of-thought, and much more usable long-context handling without the base model’s excessive thinking time.
︀︀
︀︀Check it out on hugging face 👇🏻
No, I was talking about the thing I was previously talking about, where it leaves reasoning blocks in the context for the entire agent turn
Which it no longer does so the cache is a lot less busted
I was just curious how you knew it was in 3.28, was looking for something in the release notes.
I used it and observed the behavior
It happened sometime between those two releases
hello! i am brand new to local AI systems and have been battling with my build. this is what I have and what is happening: using a Raspbery Pi 5 with 16GB RAM and a 256 GB SDD (I am on a budget for the first build). I used Ollama to install the 3.28 version of Openclaw and intially selected Qwen 3.5:9b as the model. the model had issues (response time exceeded 7 minutes) and working with Claude code I downgraded to Qwen3.5:4b. This had similar issues. I downgraded to llama3.2:3b and still had higher than 5 minute response times. I did end up upgrading to the Openclaw 4.2 release in hopes that it would help. No luck. The question I have to you, the experts, is, "what is the recommeded configuration for local models (not just Ollama) on low-power hardware to minimize system prompt overload? (yes, claude helped me write that...no shame here). If you have a setup recommedation for something under $USD1200 that does not use the Apple eco system (Linux is preferred), I would be most grateful for the informtion. I work in the M&A world (Gen Xer) as an independant contractor and need multiple agents (and subs) to do some of my repetitive work. Many TIA! 🙂
Any idea what kind of system setup this model would need to run on?
im no raspbery pi expert but thats probably just way too weak to run qwen models. there no gpu vram right ? youll need to use cloud compute for it
no gpu. i can't use cloud due to sensitive document handling
you gonna need some beefier gpu / computer to run that, or apple ecosystem even if you dont like it 😄 . macmini with 24-32gb ram you can probably get for that money?
i am looking into other possibilities. my husband usually builds our computers and he is all Linux. the last apple i used was an apple 2E... any idea what kind of specs i would need? that Qwopus 27B model looks excellent.
what kind of processing will you need ? you might be able to write scripts that can handle it offline without any LLM at all, if you are extracting text etc
the type of agents I need are scraping, research, database keeper, social posting, and probably some that I can't think of now. not sure if that answes your questions
lead generation is a big part of waht i need as well. email management too
rtx 3090 24gb type graphic card is probably what youll want to get, that can run Qwen 3 14B, Gemma 4 26B MoE, etc. at usable speeds
take this with a grain of salt as I had to consult my openclaw 😁 i did check and prices for these cards are in the 800-1000 range on ebay
if you went mac mini route, you are just looking for any M series model that has 24-32gb or more memory and see what you can get for this money. their unified memory is shared with their GPU, which means it can run models decently using system memory. this applies to macs specifically
i asked claude to explain your suggestion of scripts to me. He responded that this may give me 80% of what i need and was an excellent suggestion. strange that validation comes from a terminal now... the question is about openclaw skills and Cron functionality for offline task automation. where would you suggest i glean this type of information? I think my husband and I are realizing that we need to build another system for this purpose. we can't believe the price of chips among other needed items. yikes!
thanks for your expertise and time!
happy to help, report back once you make something, will be interesting to see what you did!
Do I have to use QWen locally?
Nope, you are just limited to models that will fit in your vram (GPU or Mac memory, etc)
I think you will ultimately need GPU
I'm afraid, AI models that are more capable than judging whether an email might be SPAM need BRAIN and POWER. If you start from scratch and not with an existing desktop PC "needing some upgrade", then $1200 is a very tight budget. If you are really dedicated, you might check out the last chance to get a reduced price for the https://tiiny.ai/ . More like $2000 if you add tax, but definitely better than any RasPi.
I bought a used 16GB VRAM RX 7800 XT for $400 in 2024 (for gaming, not AI) . The card is noted as ~40TOPS on the AMD website.
It turned out, I can run Qwen 3.5:9b with ~41Tps, gemma4:e4b with 70Tps or gpt-oss:20b with 90Tps. That's quite usable. The lowest I have seen recently for this 16Gb card was $480. USED(!)
Still the cheaper option compared to the NVIDIA brothers. Ollama runs perfectly fine on AMD-ROCm.
But speed is not the primary limiting factor. You can run a sufficiently tiny model like qwen3.5:0.8b on a RasPi too without timeouts on every prompt. Quality is. All of those mentioned models feel quite stupid compared to even a cheap cloud model in the 40b range and up, like gpt-oss:120b (needs at least 80GB VRAM) or even Claude Sonnet. 😢
If you are that tight on budget, you'd better wait 3-5 years. Prices will settle, models will grow more efficient and hardware too. But expectations will grow likewise. You might not want to run Qwen 3.5:9b on a $150 device in 2031...
You can however expect models to get generally more efficient on the same hardware
Meanwhile I stumbled across an AI hat (PI AI HAT+ 2 with an HAILO-10H chip) that promises useful "edge device AI" (https://www.youtube.com/watch?v=8dwVnmcZ9v0)
While the grinning, over emotional, artificial presenter is creepy 😖 , the product seems usable for a defined set of work: Image/Video analysis and voice commands. So it depends on your use case. General agentic tasks are not possible with 8GB model space. But if you hook it up to a camera it certainly can do interesting things.
They used Qwen2.5 1.5B model on the chip for the elevator demo.
for those using qwen35 w/ llama.cpp, I've had good luck finally with some releases / fixes from yesterday. I've been on b8466 (from March), was just about to post an issue re: "Failed to parse input at pos 250: <tool_call>" that was introduced in b8467 when chatter about a possible fix showed up, and was merged yesterday. I'm on b8733 now and that problem is gone.
also, silly complaint mode on, but I hate the name qwen or qwen35- I find it the most awkward model name to type, and I always spell it wrong unless I carefully type slowly. usually it comes out qw3n ... compaint mode off!
a note in the local / selfhosted AI channel as it isn't just about qwen, but tldr; qwen35 35b a3b continues to be the winner for me on my 24gb card. #1478210877541974157 message
Been trying openqwen for the past couple days. It's very poor with basic instructions and often goes explicitly against things I have previously instructed it on. It's like a little baby that has to have its hand held every step of the way qwen3.5-plus
ps this isn't just to talk smack.. if there's something I'm missing, please enlighten me
ClawEvel added Qwen3.6-35B-A3B test for 59 agent types and now they have new token efficiency ranking https://github.com/AIgenteur/ClawEval
I have yet to see Qwen3.5-14b that is on the short list to be tested.
Do you mean qwen3 14b? For my early 12 gig gpu testing 14b did well, but I found qwen35 9b to be faster, smaller and could get larger context. 14b was a little too tight for 12 gig but would be better on 16?
No, I reference that:
Model Testing Roadmap
This is a living benchmark. We're continuously adding new models as they release. Here's what's on the radar — and we take requests.
16GB VRAM tier (coming soon)
Qwen3-8B, **Qwen3.5-14B**
Phi-4 14B, Phi-4-mini 3.8B
Mistral Small 3.1 24B (tight fit)
GPT-OSS-20B MXFP4
Gemma 3 12B
Ministral 3B (ultra-fast routing/triage)