#how to choose a cheap model?

1 messages · Page 1 of 1 (latest)

steady panther
#

first check the tool call error rate on openrouter? like google/gemma-4-31b-it too high to try

hearty jasperBOT
#

Yep — I just checked OpenRouter’s performance page for google/gemma-4-31b-it.

Current Tool Call Error Rate shown there is roughly:

  • NovitaAI: ~15.45%
  • AkashML: ~16.26%

So for agentic/tool-heavy flows, that’s honestly too high to trust as a primary model right now.
For plain chat/non-tool tasks it may still be fine.

Links:

If you want, I can help you define a quick “go/no-go” test (e.g., 50 tool calls, fail if >3–5% error) before you commit.
Also for freshest cheap+reliable picks, check #models plus OpenClaw model docs: https://docs.openclaw.ai/concepts/models#model-discussions-community-notes

steady panther
#

ok define a "go" test

hearty jasperBOT
#

Perfect — here’s a practical GO / NO-GO test you can run before adopting a cheap model for tool use.

1) Test set

Run 50 prompts that match your real workflow:

  • 20 single-tool tasks
  • 20 multi-step tool tasks
  • 10 “no tool needed” tasks (to catch over-calling)

Keep prompts fixed so models are comparable.

2) What to measure

For each run, log:

  1. Tool-call validity rate
    % of tool calls with valid tool name + valid JSON args

  2. Tool hallucination rate

#

% calls to non-existent tools

  1. Schema error rate
    % calls rejected due to bad/missing args

  2. Task success rate
    % tasks completed correctly without manual fix

  3. Retry burden
    avg retries needed per task

  4. Latency/cost
    p95 latency and total cost for the 50 runs


3) GO thresholds (recommended)

#

Use this as your gate:

  • Tool-call validity: >= 97%
  • Hallucinated tools: <= 0.5%
  • Schema errors: <= 2%
  • Task success: >= 90%
  • Avg retries: <= 0.2/task
  • p95 latency: within your SLA (set your own target)

If any of the first 4 fail, it’s a NO-GO for primary agent/tool routing.


4) OpenRouter-specific note

Because routing differs by provider, run the test:

  • once with your normal routing, and
  • once pinned to the provider(s) you expect most traffic on.
#

A model can look “ok” overall but fail badly on one provider.


If you want, I can give you a ready-to-copy CSV scoring template for these 50 runs so you can decide in 10 minutes.

steady panther
#

yes are you bot?

hearty jasperBOT
#

Yep — I’m an AI assistant bot helping with OpenClaw support in here 🦐

steady panther
#

give me CSV scoring template please

hearty jasperBOT
#

Nice — here’s a ready-to-copy template.

case_id,prompt_group,requires_tool,expected_tool_sequence,actual_tool_sequence,tool_calls_total,tool_calls_valid,tool_hallucinated_count,schema_errors_count,retries,task_success,latency_ms,input_tokens,output_tokens,cost_usd,provider_route,notes
1,single_tool,yes,"search_web","search_web",1,1,0,0,0,1,1820,420,210,0.00019,novita,"clean run"
2,multi_step,yes,"search_web>summarize","search_web>summarize",2,2,0,0,1,1,4210,980,460,0.00054,akashml,"needed one retry"
3,no_tool,no,"","",0,0,0,0,0,1,940,180,120,0.00007,novita,"correctly avoided tools"
4,single_tool,yes,"lookup_user","lookup_usr",1,0,1,0,0,0,1300,260,90,0.00011,akashml,"hallucinated tool name"
5,single_tool,yes,"create_ticket","create_ticket",1,0,0,1,0,0,1600,300,130,0.00014,novita,"bad args/schema"

And this is the scorecard formula spec (for your sheet):

  • tool_call_validity_rate = SUM(tool_calls_valid) / SUM(tool_calls_total)
  • tool_hallucination_rate = SUM(tool_hallucinated_count) / SUM(tool_calls_total)
  • schema_error_rate = SUM(schema_errors_count) / SUM(tool_calls_total)
  • task_success_rate = SUM(task_success) / COUNT(case_id)