#OpenClaw Primary Model Evaluation β 2026-04-09
1 messages Β· Page 1 of 1 (latest)
nicer
I don't use openclaw to build apps & websites btw. Only as an "AI assistant"
I'm going to try Gemma4 and see what I think. I'll keep my local Qwen3.5 as a fallback
Flash better than GPt5.4 mini? No way.
Whatβs going on in this thread? Is there a benchmark or sth?
Is it local specific?
it was an evaluation, mostly on the models available via Ollama, based on how well they performed on the tasks as outlined in "Evaluation Dimensions"
Cloud and local... but Qwen was the only one that is local... it was mostly a control, and I did not expect it to perform as well as it did.
I also didn't expect GPT to score as low as it did
I'm not making any claims that this is a be-all-end-all test of all models at every possible situation. This was an isolated, automated, evaluation run by Opus 4.6 against the models I gave it. Other than that, it devised the tests and ran them without my intervention, except for the "human evaluation" parts, in which it asked some open-ended questions, both technical and conversational, and I ranked the answers from 1-5.
I only, just now, realized that my whole Evaluation document didn't come through π¬
It was markdown, and the server denied all of it... woof.
OpenClaw Primary Model Evaluation β 2026-04-09
BACKGROUND
With Anthropic models no longer available via OpenClaw, we needed to find a
replacement primary model for our homelab assistant. GPT-5.4 has been filling
in but has pain points around tool use reliability, instruction following, and
reasoning depth.
We built an automated evaluation harness to fairly compare 8 candidate models
across 4 dimensions, using real homelab tool schemas (HA, Proxmox, TrueNAS,
DNS, UniFi) and a mix of auto-scoring and human review.
MODELS TESTED
Model Provider Notes
qwen3.5:cloud Ollama Cloud Max
gemma4:31b-cloud Ollama Cloud Max
minimax-m2.7:cloud Ollama Cloud Max Coding/agentic focus
deepseek-v3.2:cloud Ollama Cloud Max
glm-5.1:cloud Ollama Cloud Max
kimi-k2.5:cloud Ollama Cloud Max Already a fallback via Moonshot
gpt-5.4 OpenAI Pro Current "okay" model β bar to beat
qwen3.5-27b-local Local Ollama (5090) Local fallback baseline
EVALUATION DIMENSIONS
-
Tool Use Reliability (40% weight) β 5 prompts testing single, multi-param,
sequential, chained, and simple tool calls. Auto-scored: valid JSON, correct
tool name, required params, ordering. -
Instruction Following (25% weight) β 5 prompts testing prompt leak
resistance, format compliance, staying in character, word limits,
English-only responses. Auto-scored: regex/rule-based pass/fail. -
Reasoning Quality (25% weight) β 3 prompts on Proxmox HA planning, DNS
debugging, and containers-vs-VMs tradeoffs. Human-rated 1-5. -
Conversational Quality (10% weight) β 3 prompts: greeting, follow-up, humor.
Human-rated 1-5.
Total: 16 prompts x 8 models = 128 responses
RESULTS
Final Rankings:
Rank Model Tool Instr Reas Conv Composite Latency
1 qwen3.5-27b-local 80% 100% 100% 87% 90.7% 19.4s
2 gemma4:31b-cloud 80% 100% 100% 60% 88.0% 7.7s
3 qwen3.5:cloud 80% 100% 100% 53% 87.3% 27.0s
4 glm-5.1:cloud 80% 100% 73% 73% 82.7% 71.2s
5 gpt-5.4 80% 80% 67% 53% 74.0% 14.6s
6 minimax-m2.7:cloud 60% 80% 67% 67% 67.3% 10.2s
7 kimi-k2.5:cloud 40% 100% 67% 47% 62.3% 18.1s
8 deepseek-v3.2:cloud 20% 100% 40% 47% 47.7% 15.6s
Tool Use Breakdown (which prompts failed):
Model Score Failed Prompts
qwen3.5-27b-local 4/5 Multi-param PVE (memory unit mismatch)
gemma4:31b-cloud 4/5 Single HA call
qwen3.5:cloud 4/5 Single HA call
glm-5.1:cloud 4/5 Chained DNS
gpt-5.4 4/5 Chained DNS
minimax-m2.7:cloud 3/5 Single HA, Chained DNS
kimi-k2.5:cloud 2/5 Single HA, Chained DNS, UniFi
deepseek-v3.2:cloud 1/5 4 of 5 failed β only passed UniFi
Instruction Following Breakdown:
All models scored 100% except:
- gpt-5.4 (4/5) β failed stay-in-character (answered an off-topic personal
question instead of deflecting) - minimax-m2.7:cloud (4/5) β same failure
Reasoning Highlights (Human-Rated, 1-5 scale):
Top 3 models all scored 5/5/5 on the three reasoning prompts (Proxmox HA
planning, DNS debugging, containers vs VMs):
- qwen3.5-27b-local
- gemma4:31b-cloud
- qwen3.5:cloud
Weakest reasoning: deepseek-v3.2 (2/1/3) β shallow answers, missed practical
considerations.
Conversational Quality Notes:
The greeting prompt ("Hey Jarvis, good morning! How's the lab looking today?")
was an unintentional hallucination test β models had no tools or data to check
lab status. Models that scored 1 made up fake status reports. Models that
scored 5 (qwen3.5-27b-local, glm-5.1, minimax-m2.7) acknowledged they couldn't
check and kept the conversation warm.
KEY TAKEAWAYS
-
The local qwen3.5-27b running on a 5090 beat every cloud model. Perfect
reasoning, best conversational quality, strong tool use. The only downside
is sharing the GPU with desktop workloads. -
gemma4:31b-cloud is the speed champion at 7.7s average response time β 2-3x
faster than most competitors, with near-identical accuracy to the top
scorer. -
GPT-5.4 placed 5th. Every top-3 Ollama Cloud model outperformed it. It's
not bad, but it's no longer the best option. -
deepseek-v3.2 and kimi-k2.5 were disappointing β strong instruction
following but poor tool use reliability, which is the most critical
dimension for homelab operations. -
glm-5.1 is accurate but too slow β 71s average latency makes it impractical
as a primary despite good scores.