How do you handle tool-call consistency when routing across multiple providers | Nous Research | Page 1

lament viper Apr 28, 2026, 3:16 AM

#

karmic shard Apr 28, 2026, 3:20 AM

#

◆ On your actual dump

The "palm tree" issue is not a routing bug. Your config:

Model / Provider: default / custom (endpoint: http://192.168.1.88:5000/v1/)


You're on a local OpenAI-compatible endpoint (llama.cpp / vllm / LM Studio / similar). Symptoms line up:

1. "Assistant response prefill is incompatible with enable_thinking" — that's your backend rejecting Hermes' prefill retry. It's a server-side restriction on the endpoint, not Hermes.
2. HTTP 502/500 from the endpoint — not a Hermes routing issue, the backend itself is flaking under load (msgs=105-113, ~27k tokens).
3. The agent claiming "palm tree on/off" without calling a tool — your local model is hallucinating tool results instead of emitting structured tool_calls. Most small/mid local models do this when the tool schema gets long or context fills up. Hermes has no text-based tool-call parser in the main loop by design (rejected PR #7628) — if the model writes "I turned it off" as content instead of emitting a tool_calls array, Hermes sees that as a final answer.

Fixes in priority order:
- Set model.context_length in config.yaml to your model's real limit (probe defaulted to 256k, likely wrong)
- Check whether your endpoint supports native tool calling. If it's llama.cpp, you need --chat-template with tool support and a model trained for it (Hermes 3, Qwen2.5-Instruct, Llama-3.1-Instruct all work). If it's returning JSON as content, fix the endpoint's tool-call extraction — not Hermes.
- Turn off enable_thinking in the endpoint or stop sending prefill retries. Those two features conflict on your backend.
- The 407s "Continue" latency + 502s suggest the local server is overloaded. 27k token replays through a local model are expensive.

lament viper Apr 28, 2026, 3:20 AM

#

I created my own litellm router, or should i use HA router instead? But this router is full of freebies yet. No paid

karmic shard Apr 28, 2026, 3:22 AM

#

What's "HA router" — HuggingFace router, or something else? Not a term I recognize in this space.

Real answer though: the router isn't your problem. Tool-call consistency is determined by the underlying model the router dispatches to, not the router itself.

◆ LiteLLM as your router

LiteLLM is fine as a Hermes backend — it speaks OpenAI-compatible, which is all Hermes needs. Two gotchas specific to your setup:

1. Free-tier models often don't do tool calling well. Free endpoints on LiteLLM (HuggingFace free inference, free Groq tier, free OpenRouter models marked :free, Gemini free tier, etc.) frequently route to base/instruct models without proper tool-call training, or the provider strips the tools param. When that happens, the model writes "I turned the palm tree off" as content, which is exactly your symptom.

2. LiteLLM's tool_choice and tools passthrough varies by upstream. Some upstreams return tool calls in content, some in a tool_calls field, some not at all. LiteLLM tries to normalize but it's best-effort. Run a quick test: send a request with a tool defined and check whether the response has choices[0].message.tool_calls populated or if the tool invocation is hallucinated into content.

◆ What to actually use

If you want free + tool-calling that works reliably with Hermes, your best bets:
- OpenRouter free tier with models that explicitly support tools: google/gemini-2.5-flash-lite-preview (free), meta-llama/llama-3.3-70b-instruct:free — these do native tool calls
- Groq free tier — llama-3.3-70b-versatile does tool calling cleanly, and it's fast
- Gemini free tier direct (no router needed) — Hermes has native Gemini support, just hermes setup → Gemini

Your current 192.168.1.88:5000 — is that LiteLLM proxying to free upstreams, or is it a local llama.cpp/vllm that LiteLLM sits in front of? The fix differs.

If you tell me which models LiteLLM is routing to, I can tell you which ones emit real tool_calls vs which hallucinate in content.

lament viper Apr 28, 2026, 3:22 AM

#

Perhaps my router keeps shuffling new models each tool call. Perhaps I'll try thinking off on my router. But Ahmad highly suggests Thinking ON lol

karmic shard Apr 28, 2026, 3:23 AM

#

who is Ahmad?

#

I would just set various models directly through hermes and it should handle it, since I haven't seen anyone complain about this issue specifically

#

If you want Hermes-side routing behavior (fallback to OpenRouter when custom endpoint 502s), that's fallback_model + fallback provider config in config.yaml — but you have no API keys set for any remote provider, so fallback can't help until you add one.

lament viper Apr 28, 2026, 3:27 AM

#

karmic shard who is Ahmad?