#OpenClaw Primary Model Evaluation β€” 2026-04-09

1 messages Β· Page 1 of 1 (latest)

cosmic quartz
#

I like gemini flash.

I had to switch to GPT now, but gemini flash was just better

#

nicer

#

I don't use openclaw to build apps & websites btw. Only as an "AI assistant"

olive iron
#

I'm going to try Gemma4 and see what I think. I'll keep my local Qwen3.5 as a fallback

alpine hill
#

Flash better than GPt5.4 mini? No way.

#

What’s going on in this thread? Is there a benchmark or sth?

#

Is it local specific?

olive iron
#

it was an evaluation, mostly on the models available via Ollama, based on how well they performed on the tasks as outlined in "Evaluation Dimensions"

#

Cloud and local... but Qwen was the only one that is local... it was mostly a control, and I did not expect it to perform as well as it did.

#

I also didn't expect GPT to score as low as it did

#

I'm not making any claims that this is a be-all-end-all test of all models at every possible situation. This was an isolated, automated, evaluation run by Opus 4.6 against the models I gave it. Other than that, it devised the tests and ran them without my intervention, except for the "human evaluation" parts, in which it asked some open-ended questions, both technical and conversational, and I ranked the answers from 1-5.

olive iron
#

I only, just now, realized that my whole Evaluation document didn't come through 😬

#

It was markdown, and the server denied all of it... woof.

#

OpenClaw Primary Model Evaluation β€” 2026-04-09

BACKGROUND

With Anthropic models no longer available via OpenClaw, we needed to find a
replacement primary model for our homelab assistant. GPT-5.4 has been filling
in but has pain points around tool use reliability, instruction following, and
reasoning depth.

We built an automated evaluation harness to fairly compare 8 candidate models
across 4 dimensions, using real homelab tool schemas (HA, Proxmox, TrueNAS,
DNS, UniFi) and a mix of auto-scoring and human review.

MODELS TESTED

Model Provider Notes


qwen3.5:cloud Ollama Cloud Max
gemma4:31b-cloud Ollama Cloud Max
minimax-m2.7:cloud Ollama Cloud Max Coding/agentic focus
deepseek-v3.2:cloud Ollama Cloud Max
glm-5.1:cloud Ollama Cloud Max
kimi-k2.5:cloud Ollama Cloud Max Already a fallback via Moonshot
gpt-5.4 OpenAI Pro Current "okay" model β€” bar to beat
qwen3.5-27b-local Local Ollama (5090) Local fallback baseline

EVALUATION DIMENSIONS

  • Tool Use Reliability (40% weight) β€” 5 prompts testing single, multi-param,
    sequential, chained, and simple tool calls. Auto-scored: valid JSON, correct
    tool name, required params, ordering.

  • Instruction Following (25% weight) β€” 5 prompts testing prompt leak
    resistance, format compliance, staying in character, word limits,
    English-only responses. Auto-scored: regex/rule-based pass/fail.

  • Reasoning Quality (25% weight) β€” 3 prompts on Proxmox HA planning, DNS
    debugging, and containers-vs-VMs tradeoffs. Human-rated 1-5.

  • Conversational Quality (10% weight) β€” 3 prompts: greeting, follow-up, humor.
    Human-rated 1-5.

Total: 16 prompts x 8 models = 128 responses

#

RESULTS

Final Rankings:

Rank Model Tool Instr Reas Conv Composite Latency


1 qwen3.5-27b-local 80% 100% 100% 87% 90.7% 19.4s
2 gemma4:31b-cloud 80% 100% 100% 60% 88.0% 7.7s
3 qwen3.5:cloud 80% 100% 100% 53% 87.3% 27.0s
4 glm-5.1:cloud 80% 100% 73% 73% 82.7% 71.2s
5 gpt-5.4 80% 80% 67% 53% 74.0% 14.6s
6 minimax-m2.7:cloud 60% 80% 67% 67% 67.3% 10.2s
7 kimi-k2.5:cloud 40% 100% 67% 47% 62.3% 18.1s
8 deepseek-v3.2:cloud 20% 100% 40% 47% 47.7% 15.6s

Tool Use Breakdown (which prompts failed):

Model Score Failed Prompts


qwen3.5-27b-local 4/5 Multi-param PVE (memory unit mismatch)
gemma4:31b-cloud 4/5 Single HA call
qwen3.5:cloud 4/5 Single HA call
glm-5.1:cloud 4/5 Chained DNS
gpt-5.4 4/5 Chained DNS
minimax-m2.7:cloud 3/5 Single HA, Chained DNS
kimi-k2.5:cloud 2/5 Single HA, Chained DNS, UniFi
deepseek-v3.2:cloud 1/5 4 of 5 failed β€” only passed UniFi

Instruction Following Breakdown:

All models scored 100% except:

  • gpt-5.4 (4/5) β€” failed stay-in-character (answered an off-topic personal
    question instead of deflecting)
  • minimax-m2.7:cloud (4/5) β€” same failure

Reasoning Highlights (Human-Rated, 1-5 scale):

Top 3 models all scored 5/5/5 on the three reasoning prompts (Proxmox HA
planning, DNS debugging, containers vs VMs):

  • qwen3.5-27b-local
  • gemma4:31b-cloud
  • qwen3.5:cloud

Weakest reasoning: deepseek-v3.2 (2/1/3) β€” shallow answers, missed practical
considerations.

Conversational Quality Notes:

The greeting prompt ("Hey Jarvis, good morning! How's the lab looking today?")
was an unintentional hallucination test β€” models had no tools or data to check
lab status. Models that scored 1 made up fake status reports. Models that
scored 5 (qwen3.5-27b-local, glm-5.1, minimax-m2.7) acknowledged they couldn't
check and kept the conversation warm.

KEY TAKEAWAYS

  1. The local qwen3.5-27b running on a 5090 beat every cloud model. Perfect
    reasoning, best conversational quality, strong tool use. The only downside
    is sharing the GPU with desktop workloads.

  2. gemma4:31b-cloud is the speed champion at 7.7s average response time β€” 2-3x
    faster than most competitors, with near-identical accuracy to the top
    scorer.

  3. GPT-5.4 placed 5th. Every top-3 Ollama Cloud model outperformed it. It's
    not bad, but it's no longer the best option.

  4. deepseek-v3.2 and kimi-k2.5 were disappointing β€” strong instruction
    following but poor tool use reliability, which is the most critical
    dimension for homelab operations.

  5. glm-5.1 is accurate but too slow β€” 71s average latency makes it impractical
    as a primary despite good scores.