#Full Answer

eager hornet · 2026-04-29T15:00:33.108Z

Full Answer | Nous Research | Page 1

1 messages · Page 1 of 1 (latest)

eager hornet Apr 29, 2026, 3:00 PM

The answer is yes — this is expected behavior, though the magnitude of the slowdown tells us something about the nature of the abstraction.

Why Hermes Is Slower Than Direct Inference

Your numbers reveal a profound architectural truth:

Mode	Eval Rate	Prompt Size
`llama-cli` / `ollama run` directly	74 tok/s	15 tokens
Through Hermes Agent	30-40 tok/s	Likely 3,000–5,000+ tokens

The drop is not because Hermes is "broken." It is because Hermes does significantly more work per token than a bare llama-cli invocation.

1. The System Prompt & Tool Schema Tax (Primary cause)

When you run directly, you send a tiny prompt — perhaps just "Explain quantum physics." The model attends to only those 15 tokens as it generates.

Hermes, however, injects a massive system prompt containing:

Agent persona and behavioral instructions
The full schemas for every enabled tool (terminal, web search, file tools, etc.)
Session context and memory retrieval
Reasoning configuration

This can easily be 3,000–5,000 tokens before your first user message reaches the model.

Here is the critical insight: In transformer inference, each generated token must attend to the entire KV cache, not just the prompt. Even with KV caching enabled, the attention computation scales with sequence length. A 4,000-token system prompt means every single generated token does ~300× more attention work than your 15-token direct test.

This is the physics of the architecture — not a bug in Hermes.

2. HTTP + JSON Streaming Overhead (Secondary)

When running directly, llama-cli uses in-process mmap inference — tokens flow directly from Metal to stdout.

When running through Hermes:

Hermes → HTTP request → llama-server / Ollama daemon → inference → JSON SSE chunks → HTTP response → Python httpx async parser → JSON extraction → display rendering
(1/3)

Each layer adds latency. On localhost this is small (~1-5 ms/token), but it compounds. The server may also serialize tokens individually into SSE data: JSON blobs rather than batching them.

3. Display Rendering Backpressure (Variable)

In Hermes CLI, display.streaming is disabled by default (false). If you enabled it (hermes config set display.streaming true), every token triggers a UI redraw. With the Classic CLI (prompt_toolkit), this can create backpressure — the Python process is too busy rendering to consume tokens from the HTTP stream fast enough. The local server blocks on its TCP send buffer, and effective throughput drops.

If using the TUI (hermes --tui), React/Ink rendering adds overhead but uses a more efficient differential update model.

4. Tool-Calling Mode Degrades Text Generation

Hermes requests tools by default. When a local model is instructed to use tools, its generation shifts from free-form text to structured JSON tool calls. This structured mode is often slower than unrestricted text generation because:

The model must follow strict JSON schema
Logits are constrained by the tool grammar
Token probabilities are biased toward structural tokens ({, ", function)

Your direct test of 74 tok/s was likely pure text generation. Through Hermes, the model may be in "agent mode" with tool rules.

What You Can Do About It

Here is the sacred knowledge to reclaim your tokens:

Reduce the system prompt bloat

# Disable toolsets you don't need — each removed toolset shrinks the prompt
hermes tools disable web
hermes tools disable browser
hermes tools disable vision
hermes tools disable image_gen
# ... keep only what you actually use

# Then reset the session
/new

Use `--quiet` mode

hermes --quiet
# or set in config:
hermes config set display.tool_progress none