how to choose a cheap model? | Friends of the Crustacean 🦞🤝 | Page 1

steady panther Apr 5, 2026, 9:16 AM

#

first check the tool call error rate on openrouter? like google/gemma-4-31b-it too high to try

hearty jasperBOT Apr 5, 2026, 9:16 AM

#

Yep — I just checked OpenRouter’s performance page for google/gemma-4-31b-it.

Current Tool Call Error Rate shown there is roughly:

NovitaAI: ~15.45%
AkashML: ~16.26%

So for agentic/tool-heavy flows, that’s honestly too high to trust as a primary model right now.
For plain chat/non-tool tasks it may still be fine.

Links:

If you want, I can help you define a quick “go/no-go” test (e.g., 50 tool calls, fail if >3–5% error) before you commit.
Also for freshest cheap+reliable picks, check #models plus OpenClaw model docs: https://docs.openclaw.ai/concepts/models#model-discussions-community-notes

steady panther Apr 5, 2026, 9:17 AM

#

ok define a "go" test

hearty jasperBOT Apr 5, 2026, 9:18 AM

#

Perfect — here’s a practical GO / NO-GO test you can run before adopting a cheap model for tool use.

1) Test set

Run 50 prompts that match your real workflow:

20 single-tool tasks
20 multi-step tool tasks
10 “no tool needed” tasks (to catch over-calling)

Keep prompts fixed so models are comparable.

2) What to measure

For each run, log:

Tool-call validity rate
% of tool calls with valid tool name + valid JSON args
Tool hallucination rate

#

% calls to non-existent tools

Schema error rate
% calls rejected due to bad/missing args
Task success rate
% tasks completed correctly without manual fix
Retry burden
avg retries needed per task
Latency/cost
p95 latency and total cost for the 50 runs

3) GO thresholds (recommended)

#

Use this as your gate:

Tool-call validity: >= 97%
Hallucinated tools: <= 0.5%
Schema errors: <= 2%
Task success: >= 90%
Avg retries: <= 0.2/task
p95 latency: within your SLA (set your own target)

If any of the first 4 fail, it’s a NO-GO for primary agent/tool routing.

4) OpenRouter-specific note

Because routing differs by provider, run the test:

once with your normal routing, and
once pinned to the provider(s) you expect most traffic on.

#

A model can look “ok” overall but fail badly on one provider.

If you want, I can give you a ready-to-copy CSV scoring template for these 50 runs so you can decide in 10 minutes.

steady panther Apr 5, 2026, 9:19 AM

#

yes are you bot?

hearty jasperBOT Apr 5, 2026, 9:19 AM

#

Yep — I’m an AI assistant bot helping with OpenClaw support in here 🦐

steady panther Apr 5, 2026, 9:20 AM

#

give me CSV scoring template please

hearty jasperBOT Apr 5, 2026, 9:20 AM

#

Nice — here’s a ready-to-copy template.

case_id,prompt_group,requires_tool,expected_tool_sequence,actual_tool_sequence,tool_calls_total,tool_calls_valid,tool_hallucinated_count,schema_errors_count,retries,task_success,latency_ms,input_tokens,output_tokens,cost_usd,provider_route,notes
1,single_tool,yes,"search_web","search_web",1,1,0,0,0,1,1820,420,210,0.00019,novita,"clean run"
2,multi_step,yes,"search_web>summarize","search_web>summarize",2,2,0,0,1,1,4210,980,460,0.00054,akashml,"needed one retry"
3,no_tool,no,"","",0,0,0,0,0,1,940,180,120,0.00007,novita,"correctly avoided tools"
4,single_tool,yes,"lookup_user","lookup_usr",1,0,1,0,0,0,1300,260,90,0.00011,akashml,"hallucinated tool name"
5,single_tool,yes,"create_ticket","create_ticket",1,0,0,1,0,0,1600,300,130,0.00014,novita,"bad args/schema"

And this is the scorecard formula spec (for your sheet):

tool_call_validity_rate = SUM(tool_calls_valid) / SUM(tool_calls_total)
tool_hallucination_rate = SUM(tool_hallucinated_count) / SUM(tool_calls_total)
schema_error_rate = SUM(schema_errors_count) / SUM(tool_calls_total)
task_success_rate = SUM(task_success) / COUNT(case_id)

#how to choose a cheap model?

1) Test set

2) What to measure

3) GO thresholds (recommended)

4) OpenRouter-specific note