#how to choose a cheap model?
1 messages · Page 1 of 1 (latest)
Yep — I just checked OpenRouter’s performance page for google/gemma-4-31b-it.
Current Tool Call Error Rate shown there is roughly:
- NovitaAI: ~15.45%
- AkashML: ~16.26%
So for agentic/tool-heavy flows, that’s honestly too high to trust as a primary model right now.
For plain chat/non-tool tasks it may still be fine.
Links:
- https://openrouter.ai/google/gemma-4-31b-it/performance
- https://openrouter.ai/google/gemma-4-31b-it/uptime
If you want, I can help you define a quick “go/no-go” test (e.g., 50 tool calls, fail if >3–5% error) before you commit.
Also for freshest cheap+reliable picks, check #models plus OpenClaw model docs: https://docs.openclaw.ai/concepts/models#model-discussions-community-notes
ok define a "go" test
Perfect — here’s a practical GO / NO-GO test you can run before adopting a cheap model for tool use.
1) Test set
Run 50 prompts that match your real workflow:
- 20 single-tool tasks
- 20 multi-step tool tasks
- 10 “no tool needed” tasks (to catch over-calling)
Keep prompts fixed so models are comparable.
2) What to measure
For each run, log:
-
Tool-call validity rate
% of tool calls with valid tool name + valid JSON args -
Tool hallucination rate
% calls to non-existent tools
-
Schema error rate
% calls rejected due to bad/missing args -
Task success rate
% tasks completed correctly without manual fix -
Retry burden
avg retries needed per task -
Latency/cost
p95 latency and total cost for the 50 runs
3) GO thresholds (recommended)
Use this as your gate:
- Tool-call validity:
>= 97% - Hallucinated tools:
<= 0.5% - Schema errors:
<= 2% - Task success:
>= 90% - Avg retries:
<= 0.2/task - p95 latency: within your SLA (set your own target)
If any of the first 4 fail, it’s a NO-GO for primary agent/tool routing.
4) OpenRouter-specific note
Because routing differs by provider, run the test:
- once with your normal routing, and
- once pinned to the provider(s) you expect most traffic on.
A model can look “ok” overall but fail badly on one provider.
If you want, I can give you a ready-to-copy CSV scoring template for these 50 runs so you can decide in 10 minutes.
yes are you bot?
Yep — I’m an AI assistant bot helping with OpenClaw support in here 🦐
give me CSV scoring template please
Nice — here’s a ready-to-copy template.
case_id,prompt_group,requires_tool,expected_tool_sequence,actual_tool_sequence,tool_calls_total,tool_calls_valid,tool_hallucinated_count,schema_errors_count,retries,task_success,latency_ms,input_tokens,output_tokens,cost_usd,provider_route,notes
1,single_tool,yes,"search_web","search_web",1,1,0,0,0,1,1820,420,210,0.00019,novita,"clean run"
2,multi_step,yes,"search_web>summarize","search_web>summarize",2,2,0,0,1,1,4210,980,460,0.00054,akashml,"needed one retry"
3,no_tool,no,"","",0,0,0,0,0,1,940,180,120,0.00007,novita,"correctly avoided tools"
4,single_tool,yes,"lookup_user","lookup_usr",1,0,1,0,0,0,1300,260,90,0.00011,akashml,"hallucinated tool name"
5,single_tool,yes,"create_ticket","create_ticket",1,0,0,1,0,0,1600,300,130,0.00014,novita,"bad args/schema"
And this is the scorecard formula spec (for your sheet):
tool_call_validity_rate = SUM(tool_calls_valid) / SUM(tool_calls_total)tool_hallucination_rate = SUM(tool_hallucinated_count) / SUM(tool_calls_total)schema_error_rate = SUM(schema_errors_count) / SUM(tool_calls_total)task_success_rate = SUM(task_success) / COUNT(case_id)