cosmic quartz Apr 9, 2026, 10:47 PM

#

I like gemini flash.

I had to switch to GPT now, but gemini flash was just better

#

nicer

#

I don't use openclaw to build apps & websites btw. Only as an "AI assistant"

olive iron Apr 9, 2026, 11:03 PM

#

I'm going to try Gemma4 and see what I think. I'll keep my local Qwen3.5 as a fallback

alpine hill Apr 9, 2026, 11:10 PM

#

Flash better than GPt5.4 mini? No way.

#

What’s going on in this thread? Is there a benchmark or sth?

#

Is it local specific?

olive iron Apr 9, 2026, 11:16 PM

#

it was an evaluation, mostly on the models available via Ollama, based on how well they performed on the tasks as outlined in "Evaluation Dimensions"

#

Cloud and local... but Qwen was the only one that is local... it was mostly a control, and I did not expect it to perform as well as it did.

#

I also didn't expect GPT to score as low as it did

#

I'm not making any claims that this is a be-all-end-all test of all models at every possible situation. This was an isolated, automated, evaluation run by Opus 4.6 against the models I gave it. Other than that, it devised the tests and ran them without my intervention, except for the "human evaluation" parts, in which it asked some open-ended questions, both technical and conversational, and I ranked the answers from 1-5.

olive iron Apr 10, 2026, 4:35 AM

#

I only, just now, realized that my whole Evaluation document didn't come through 😬

#

It was markdown, and the server denied all of it... woof.

#

OpenClaw Primary Model Evaluation — 2026-04-09

BACKGROUND

With Anthropic models no longer available via OpenClaw, we needed to find a
replacement primary model for our homelab assistant. GPT-5.4 has been filling
in but has pain points around tool use reliability, instruction following, and
reasoning depth.

We built an automated evaluation harness to fairly compare 8 candidate models
across 4 dimensions, using real homelab tool schemas (HA, Proxmox, TrueNAS,
DNS, UniFi) and a mix of auto-scoring and human review.

MODELS TESTED

Model Provider Notes

qwen3.5:cloud Ollama Cloud Max
gemma4:31b-cloud Ollama Cloud Max
minimax-m2.7:cloud Ollama Cloud Max Coding/agentic focus
deepseek-v3.2:cloud Ollama Cloud Max
glm-5.1:cloud Ollama Cloud Max
kimi-k2.5:cloud Ollama Cloud Max Already a fallback via Moonshot
gpt-5.4 OpenAI Pro Current "okay" model — bar to beat
qwen3.5-27b-local Local Ollama (5090) Local fallback baseline

EVALUATION DIMENSIONS

Tool Use Reliability (40% weight) — 5 prompts testing single, multi-param,
sequential, chained, and simple tool calls. Auto-scored: valid JSON, correct
tool name, required params, ordering.
Instruction Following (25% weight) — 5 prompts testing prompt leak
resistance, format compliance, staying in character, word limits,
English-only responses. Auto-scored: regex/rule-based pass/fail.
Reasoning Quality (25% weight) — 3 prompts on Proxmox HA planning, DNS
debugging, and containers-vs-VMs tradeoffs. Human-rated 1-5.
Conversational Quality (10% weight) — 3 prompts: greeting, follow-up, humor.
Human-rated 1-5.

Total: 16 prompts x 8 models = 128 responses

#

RESULTS

Final Rankings:

Rank Model Tool Instr Reas Conv Composite Latency

1 qwen3.5-27b-local 80% 100% 100% 87% 90.7% 19.4s
2 gemma4:31b-cloud 80% 100% 100% 60% 88.0% 7.7s
3 qwen3.5:cloud 80% 100% 100% 53% 87.3% 27.0s
4 glm-5.1:cloud 80% 100% 73% 73% 82.7% 71.2s
5 gpt-5.4 80% 80% 67% 53% 74.0% 14.6s
6 minimax-m2.7:cloud 60% 80% 67% 67% 67.3% 10.2s
7 kimi-k2.5:cloud 40% 100% 67% 47% 62.3% 18.1s
8 deepseek-v3.2:cloud 20% 100% 40% 47% 47.7% 15.6s

Tool Use Breakdown (which prompts failed):

Model Score Failed Prompts

qwen3.5-27b-local 4/5 Multi-param PVE (memory unit mismatch)
gemma4:31b-cloud 4/5 Single HA call
qwen3.5:cloud 4/5 Single HA call
glm-5.1:cloud 4/5 Chained DNS
gpt-5.4 4/5 Chained DNS
minimax-m2.7:cloud 3/5 Single HA, Chained DNS
kimi-k2.5:cloud 2/5 Single HA, Chained DNS, UniFi
deepseek-v3.2:cloud 1/5 4 of 5 failed — only passed UniFi

Instruction Following Breakdown:

All models scored 100% except:

gpt-5.4 (4/5) — failed stay-in-character (answered an off-topic personal
question instead of deflecting)
minimax-m2.7:cloud (4/5) — same failure

Reasoning Highlights (Human-Rated, 1-5 scale):

Top 3 models all scored 5/5/5 on the three reasoning prompts (Proxmox HA
planning, DNS debugging, containers vs VMs):

qwen3.5-27b-local
gemma4:31b-cloud
qwen3.5:cloud

Weakest reasoning: deepseek-v3.2 (2/1/3) — shallow answers, missed practical
considerations.

Conversational Quality Notes:

The greeting prompt ("Hey Jarvis, good morning! How's the lab looking today?")
was an unintentional hallucination test — models had no tools or data to check
lab status. Models that scored 1 made up fake status reports. Models that
scored 5 (qwen3.5-27b-local, glm-5.1, minimax-m2.7) acknowledged they couldn't
check and kept the conversation warm.

KEY TAKEAWAYS

The local qwen3.5-27b running on a 5090 beat every cloud model. Perfect
reasoning, best conversational quality, strong tool use. The only downside
is sharing the GPU with desktop workloads.
gemma4:31b-cloud is the speed champion at 7.7s average response time — 2-3x
faster than most competitors, with near-identical accuracy to the top
scorer.
GPT-5.4 placed 5th. Every top-3 Ollama Cloud model outperformed it. It's
not bad, but it's no longer the best option.
deepseek-v3.2 and kimi-k2.5 were disappointing — strong instruction
following but poor tool use reliability, which is the most critical
dimension for homelab operations.
glm-5.1 is accurate but too slow — 71s average latency makes it impractical
as a primary despite good scores.

#OpenClaw Primary Model Evaluation — 2026-04-09