#Qwen

1 messages · Page 1 of 1 (latest)

abstract harbor
mortal roost
#

Commenting here as I am interested in the best providers for this model and self hosting tuning or local benchmarks for different hardware

opaque carbon
#

Qwen 3.5 plus is multi modal

sturdy gyro
#

Anyone tried the small Qwen3.5 models yet. It came out yesterday for us card poor people

opaque carbon
#

Same question i can't test this week the new local modèle 27b et a35 a3b q4km 🥺

#

I want the best ai with 5070ti for light coding, chat, translate, heartbeat.. i can use ollama cloud free and i have gemini pro paid already, i have take thé qwen coding plan too this month

sturdy gyro
#

Oh I have a 5070 TI 😛

#

Last model I used was the Qwen3 14B Q6

#

I am hoping that 3.5 9B Q8 would beat it.

#

My focus is for HEARTBEAT exclusively, I can main claude sonnet and and cheat on the heartbeat with Qwen on the down low. At least that is the plan

brittle garden
#

Just gave 122b a whirl, q4 on a 6000ada with main memory spillover. Much slower but maybe also much smarter

drifting mountain
#

Have been trying out 122b and 35b on my M4 Max 128GB, but so far its been pretty terrible at tool calling and keeps getting stuck in loops

elfin quest
#

did you apply the tool calling fix template?

brittle garden
#

Yeah I don’t know what you mean, my 122b is super smart and solving tons of stuff that 35b was just getting stuck on

near ridge
#

4070 (12 gb) here, qwen3:8b and 14b work for me for my tasks.. 14b is a little tight but seems to work (but since 8b seems to work not a reason to use 14b). I started playing with qwen3.5:9b the other day, and that worked well too, so I've been using it in place of qwen3:8b. I'm going to be replacing the 4070 with a 3090 (24 gb), hoping 24 gb will be the sweet spot. I could then try qwen3.5:27b and some other larger models. I'm using both with ollama. I should add that I'm currently using the minimax starter plan (about $9 a month) as my main driver, but for scheduled tasks, qwen3.5 is doing the work. it'll take a little longer than minimax but I'm not waiting on it, so doesn't matter. I did find when I went from minimax to qwen variants, I had to tighten up my instructions to be a little clearer. where minimax would just figure out what I meant when not clear, qwen would get confused or give up or misinterpret. being more clear got things working.

tight minnow
brittle garden
#

try 3.1 for now

drifting mountain
#

I'm serving the models with lmstudio, the tool calls themselves seem to work, just hasn't been good at recovering from errors and figuring out how to use them correctly

west vale
#

using qwen3.5:27b cant seem to remember its own soul

sturdy gyro
modest hound
#

I'm also having pretty meh results with qwen 3.5 9b. Often doesn't reply unless I ask if successful. I did have to apply the template fix for the multi step tool bug so at least it responds sometimes

#

And at least it is mostly successful for simple queries, just don't know why it chooses to be silent about results

west vale
opaque carbon
#

But work is done

modest hound
#

I just tried switching to openai-responses to see if that does better tool calling. I think it was getting stuck in loops.

near ridge
#

for olllama provider (on another host) I'm following the openclaw.ai docs https://docs.openclaw.ai/providers/ollama#streaming-configuration, specifically not using the v1 endpoint, and using the ollama api (not openai-completions), and large contexts. my openclaw config info:
"models": {
"mode": "merge",
"providers": {
"ollama": {
"baseUrl": "http://192.168.86.60:11434",
"apiKey": {
"source": "env",
"provider": "default",
"id": "OLLAMA_API_KEY"
},
"api": "ollama",
"models": [
{
"id": "qwen3.5:9b",
"name": "Qwen3.5 9B (Ollama, local GPU)",
"reasoning": true,
"input": [
"text"
],
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
},
"contextWindow": 262144,
"maxTokens": 2621440
},

context length is determined from the context length returned from the api/show endpoint. e.g.
curl http://localhost:11434/api/show -d '{"name":"qwen3.5:9b"}' | python -m json.tool
. . .
"model_info": {
"general.architecture": "qwen35",
"general.file_type": 15,
"general.parameter_count": 9653104368,
"general.quantization_version": 2,
"qwen35.attention.head_count": 16,
"qwen35.attention.head_count_kv": null,
"qwen35.attention.key_length": 256,
"qwen35.attention.layer_norm_rms_epsilon": 1e-06,
"qwen35.attention.value_length": 256,
"qwen35.block_count": 32,
"qwen35.context_length": 262144,
. . .

opaque depot
modest hound
#

It's intermittent, not sure how to debug or if it's just a problem for the poors

#

Or you might be able to find posts made by me here, I posted the fix in a krill support thread

opaque depot
modest hound
#

I'm using 9b with 48k context and it ranges from impressive to absolute ass. I'm prob shoving too much context into it

#

The unresponsiveness is driving me nuts tho. I ask if it was successful and sometimes it responds with all the stuff it did (accurately mostly)

cunning wharf
#

Did you manage to get 122b or 397b working with reliable tool calls and responses? I'm currently using minimax m2.5 and looking for an upgrade.

acoustic holly
#

qwenny is not all that bright 🙂 two things that help. in heart beat I put in "After tools run check and process output". that was something I found in a reference how they made qwen perform (quite a lot better) in a benchmark by just telling it to check its own output.

you need to check the output of the browser ui with thinking enabled. Sometimes you have to be very precise what it needs to do, e.g. I change my one command to Execute EXACTLY: bash /home/me/scripts/run_pipeline.sh. (CRITICAL: DO NOT use python3 to run a .sh file!) because it got confused.

if you say "trigger cron job XXX" it also takes 5 steps running a "trigger" command then openclaw help. instead say run cron XXX. then it works straight up.

so debug what is going on or be very explicit with tools.

tight minnow
brittle garden
#

but that'd have to hit a lot of sessions to be useful

brittle garden
#

initial results v promising

acoustic holly
acoustic holly
#

after the update. seems like things are not as stable as before. have not seen output like this before

"The script is still running. Let me wait a bit and poll it again.

<tool_call>

poll

mild-river

30000

</tool_call>"

acoustic holly
#

switched to llama.cpp with qwen and things are working better

#

better perf and more reliable tools. although it did overwrite my entire skills folder once.... luckily it was a git repo

#

it did restore it after I pointed it out

"All done! Your skills directory is fully restored. That was my fault - when I did cp -r for the new skill, it overwrote the entire directory instead of just adding the new folder. "

neat phoenix
#

did qwen just remove their promo on their coding plans? crap i was just going to go for it to try it out.

acoustic holly
#

yeah right now its a really good summary model, with some basic tooling if you are specific about what it should do

near ridge
#

Switched to llamacpp from ollama, what a difference, now getting 126 tps with qwen3.5-35b-a3b (3000 tps prompt load), vs 33 tps (1000 prompt) with qwen 3.5 27b. Ollama speeds were half those numbers. 35b is usable as primary at that speed 65K context). This is on 3090 (24 gb). With 27b I was able to get up to around 150k context.

acoustic holly
#

so still struggling to get qwen to be reproducible. eg. "apped your conclusion to conclusions.json" is a interpreted very differently every time. sometimes it does append, sometimes to conclusions_new.json, and other files. so that sort of thing is is not great

brittle garden
#

which qwen

dapper breach
#

my qwen 3.5 9b on my mac mini m4 pro 24gb of ram keeps stalling. it just stops mid way through a task and I have to keep reminding it to complete the task. is there any fix for this? also I get 25tok/s

#

I would love someones help with this. I am new to this and love learning and passing along info to others to help as well

near ridge
# dapper breach my qwen 3.5 9b on my mac mini m4 pro 24gb of ram keeps stalling. it just stops m...

With the 24gb you might be able to run the 35b-a3b, or the 27b, though it depends on how much your OS is leaving so maybe not. Is it actually stalling, or gettnig into a thinking loop? What are you running the model on, ollama, lm studio, llamacpp, etc.? I've found a few things to be helpful. I have openclaw running on an ubuntu machine, but by GPU is in a windows machine, with 24gb gpu. I was using ollama originally but have switched to llamacpp. I had claude code write me a proxy that I can run in front of either ollama or llamacpp so I can watch the traffic. so if openclaw is pointing to 11434 for ollama, I run ollama on port 11435 and have the proxy listen on 11434 and it logs and forwards to 11435. I've put a copy of it here : https://github.com/khaney64/llm-stuff. So while debugging a skill or script or task in openclaw, I'll watch the proxy output. I run it with something like this:
**node .\proxy.js --filter-thinking --dump-messages --message-size 500 --default-ctx 40960 --thinking --log-file --backend llamacpp
**
I'll watch that, and I'll watch the GPU and CPU percentage in windows task monitor performance tab. when it's thinking, I'll see GPU ram go up for ollama (or it's already loaded for llamacpp), and I'll see GPU percentage go up. If it seems stuck, if GPU is still going, either it's still thinking, and I need to wait, or it's in an internal loop. if GPU goes t 0%, it's done. in that case I'll try to change the prompt to be more descriptive, e.g. create a list of tasks, don't stop until they are done, update the status. Or I'll paste the prompt and output and what I'm seeing and ask claude code to improve the prompt.

near ridge
# dapper breach I would love someones help with this. I am new to this and love learning and pas...

a few notes on the proxy - the original purpose was to add the think:false or think:true flag to the ollama prompt, because openclaw isn't (or at least hasn't) passing on any thinking settings to ollama. I wanted to see if it made any difference. I don't think it matters much honestly, think: false turns off the think dialog, so a little faster, but it also turns off think, which might impact the task. my worry was that the think response was going back into the context (it doesnt), so I'm leaving think on.
The other thing the proxy helped me with was seing if my context was too small, or my prompt too big, or I was hitting context size and it had to be compacted - if that happens, things are being removed and your model may lose context and get confused/lost. it also helped me realize how absolutely huge the prompt size is, with just the stuff ollama adds. make sure your AGENT.md, TOOLs.md, etc. are compact. go into dashboard and disable ANY skills you aren't using, otherwise instructions are sent for each one, using up context size. I also find it useful to watch the converation/back and forth with the model and openclaw during the processing.. I've found cases where it was not finding my script, or my tool, and it would spend several iterations trying t find it. I had to update my instructions to give it specific locations, and that reduced the processing more.

brittle garden
#

The think response does go into the context if it’s returned to openclaw - it just disappears on the next user turn (unless this has changed in latest OC)

#

It’s kind of an annoying behavior because of caching

near ridge
#

Actually, rethinking that, I'm confusing context and the prompt data. I guess technically it's using up space temporarily in the context. It would be nice if you could ask it to think quietly to itself!

brittle garden
# near ridge Actually, rethinking that, I'm confusing context and the prompt data. I guess t...

Well it can’t really do that, since it needs the thinking tokens in context in order to make decisions based on them.

Like I said I don’t know if it’s changed since v3.2 but I have a local change that tries to work around this issue. (I haven’t managed to actually solve it; the best I’ve done is clearing the contents of the think tags in later turns. But the think tags stay, so, it doesn’t really solve the issue.) The problem is that you’ll build up a bunch of agent turns “on top of” that thinking block in the prompt and then your next user turn removes that block, busting the cache and needing to prefill all of the agent turns again.

Remember that there is no difference between the “prompt” and “context”; they are one and the same.

near ridge
# glacial needle How good is it?

How good is ... ? Ollama? The qwen35 model? As mentioned in other post I've switched to llamacpp from ollama for better performance. The qwen35-35b-a3b is the one I've been using, it's fast and succeeds with the tasks I give it. I've iteratively tweaked the prompts as needed, but generally things just work for what I'm doing (gathering some data via APIs, checking emails, summarizing, emailing or discording status). I should probably retry 9b with these prompts and see how it does.

mortal roost
#

Sometimes i use ollama to force certain models into cpu/ram and keep llama.cpp and lmstudio for the ones i want in gpu or split a certain way

near ridge
# mortal roost Sometimes i use ollama to force certain models into cpu/ram and keep llama.cpp a...

yeah, that's the bit I'm trying to figure out now.. with llamacpp and qwen3.5:35b-a3b on 3090 (24gb) GPU is pretty much maxed.. so I can't just swap models on the fly from openclaw depending on task, they all use 35b. but I do also have ollama running on this host, I use grepai mcp server from claude code and VS code, and it uses ollama and nomic-embed-text model to index files, and that'll end up going to CPU. that's actually an idea, maybe I just launch ollams with the env. file that says "no gpu" and just let it rely on CPU... I've got 64 gb system ram in this machine - let ollama do smaller models for simpler tasks via CPU... it'll take longer, but I'm not waiting for them, they're cron jobs. Ollama can also use the "free" tieir cloud models as well from openclaw. I could configure an agent to use a cloud model as primary, fall back to local model when I run out of free.

tight minnow
mortal roost
#

Enjoyed qwen3.6 in Hermes agent. It helped me sort out my issues with openclaw upgrade from 2.26 to 3.31 breaking my clawsuite gateway connection. But now it works

brittle garden
near ridge
icy zenithBOT
#

OPEN SOURCE COMMUNITY ON FIRE 🤯
︀︀
︀︀This is the third iteration in the viral Qwen Distilled series (previously known as the massive “Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled” models). The name has officially changed to simply Qwopus 27B
︀︀
︀︀On paper this one is another big leap over v2. Most impressive: it’s the first in the series to outperform the base Qwen 27B on HumanEval while still delivering significant efficiency gains in thinking speed.
︀︀
︀︀Overall benefits include stronger reasoning stability, better correctness, improved cross-task generalization (especially programming), more sophisticated chain-of-thought, and much more usable long-context handling without the base model’s excessive thinking time.
︀︀
︀︀Check it out on hugging face 👇🏻

**💬 4 🔁 3 ❤️ 4 👁️ 272 **

brittle garden
#

Which it no longer does so the cache is a lot less busted

near ridge
brittle garden
#

I used it and observed the behavior

#

It happened sometime between those two releases

snow saffron
#

hello! i am brand new to local AI systems and have been battling with my build. this is what I have and what is happening: using a Raspbery Pi 5 with 16GB RAM and a 256 GB SDD (I am on a budget for the first build). I used Ollama to install the 3.28 version of Openclaw and intially selected Qwen 3.5:9b as the model. the model had issues (response time exceeded 7 minutes) and working with Claude code I downgraded to Qwen3.5:4b. This had similar issues. I downgraded to llama3.2:3b and still had higher than 5 minute response times. I did end up upgrading to the Openclaw 4.2 release in hopes that it would help. No luck. The question I have to you, the experts, is, "what is the recommeded configuration for local models (not just Ollama) on low-power hardware to minimize system prompt overload? (yes, claude helped me write that...no shame here). If you have a setup recommedation for something under $USD1200 that does not use the Apple eco system (Linux is preferred), I would be most grateful for the informtion. I work in the M&A world (Gen Xer) as an independant contractor and need multiple agents (and subs) to do some of my repetitive work. Many TIA! 🙂

snow saffron
grand nymph
#

im no raspbery pi expert but thats probably just way too weak to run qwen models. there no gpu vram right ? youll need to use cloud compute for it

snow saffron
grand nymph
snow saffron
grand nymph
#

what kind of processing will you need ? you might be able to write scripts that can handle it offline without any LLM at all, if you are extracting text etc

snow saffron
#

the type of agents I need are scraping, research, database keeper, social posting, and probably some that I can't think of now. not sure if that answes your questions

#

lead generation is a big part of waht i need as well. email management too

grand nymph
#

rtx 3090 24gb type graphic card is probably what youll want to get, that can run Qwen 3 14B, Gemma 4 26B MoE, etc. at usable speeds

take this with a grain of salt as I had to consult my openclaw 😁 i did check and prices for these cards are in the 800-1000 range on ebay

if you went mac mini route, you are just looking for any M series model that has 24-32gb or more memory and see what you can get for this money. their unified memory is shared with their GPU, which means it can run models decently using system memory. this applies to macs specifically

snow saffron
#

i asked claude to explain your suggestion of scripts to me. He responded that this may give me 80% of what i need and was an excellent suggestion. strange that validation comes from a terminal now... the question is about openclaw skills and Cron functionality for offline task automation. where would you suggest i glean this type of information? I think my husband and I are realizing that we need to build another system for this purpose. we can't believe the price of chips among other needed items. yikes!

snow saffron
grand nymph
grim oak
#

Do I have to use QWen locally?

near ridge
glacial needle
muted anchor
# snow saffron hello! i am brand new to local AI systems and have been battling with my build....

I'm afraid, AI models that are more capable than judging whether an email might be SPAM need BRAIN and POWER. If you start from scratch and not with an existing desktop PC "needing some upgrade", then $1200 is a very tight budget. If you are really dedicated, you might check out the last chance to get a reduced price for the https://tiiny.ai/ . More like $2000 if you add tax, but definitely better than any RasPi.
I bought a used 16GB VRAM RX 7800 XT for $400 in 2024 (for gaming, not AI) . The card is noted as ~40TOPS on the AMD website.

It turned out, I can run Qwen 3.5:9b with ~41Tps, gemma4:e4b with 70Tps or gpt-oss:20b with 90Tps. That's quite usable. The lowest I have seen recently for this 16Gb card was $480. USED(!)
Still the cheaper option compared to the NVIDIA brothers. Ollama runs perfectly fine on AMD-ROCm.

But speed is not the primary limiting factor. You can run a sufficiently tiny model like qwen3.5:0.8b on a RasPi too without timeouts on every prompt. Quality is. All of those mentioned models feel quite stupid compared to even a cheap cloud model in the 40b range and up, like gpt-oss:120b (needs at least 80GB VRAM) or even Claude Sonnet. 😢

If you are that tight on budget, you'd better wait 3-5 years. Prices will settle, models will grow more efficient and hardware too. But expectations will grow likewise. You might not want to run Qwen 3.5:9b on a $150 device in 2031...

brittle garden
#

You can however expect models to get generally more efficient on the same hardware

muted anchor
# snow saffron hello! i am brand new to local AI systems and have been battling with my build....

Meanwhile I stumbled across an AI hat (PI AI HAT+ 2 with an HAILO-10H chip) that promises useful "edge device AI" (https://www.youtube.com/watch?v=8dwVnmcZ9v0)
While the grinning, over emotional, artificial presenter is creepy 😖 , the product seems usable for a defined set of work: Image/Video analysis and voice commands. So it depends on your use case. General agentic tasks are not possible with 8GB model space. But if you hook it up to a camera it certainly can do interesting things.

They used Qwen2.5 1.5B model on the chip for the elevator demo.

near ridge
#

for those using qwen35 w/ llama.cpp, I've had good luck finally with some releases / fixes from yesterday. I've been on b8466 (from March), was just about to post an issue re: "Failed to parse input at pos 250: <tool_call>" that was introduced in b8467 when chatter about a possible fix showed up, and was merged yesterday. I'm on b8733 now and that problem is gone.

#

also, silly complaint mode on, but I hate the name qwen or qwen35- I find it the most awkward model name to type, and I always spell it wrong unless I carefully type slowly. usually it comes out qw3n ... compaint mode off!

near ridge
#

a note in the local / selfhosted AI channel as it isn't just about qwen, but tldr; qwen35 35b a3b continues to be the winner for me on my 24gb card. #1478210877541974157 message

quartz gazelle
#

Been trying openqwen for the past couple days. It's very poor with basic instructions and often goes explicitly against things I have previously instructed it on. It's like a little baby that has to have its hand held every step of the way qwen3.5-plus

ps this isn't just to talk smack.. if there's something I'm missing, please enlighten me

tight minnow
muted anchor
#

I have yet to see Qwen3.5-14b that is on the short list to be tested.

near ridge
muted anchor
brittle garden
#

Ok but qwen3.6-35b is really good

#

Like I'm used to my fast model being vapid and just basically a note taker for my slow model, but this one is actually having legitimate insights about the design I'm working on

#

Alibaba, release the Qwen3.6-397b