#Full Answer
1 messages · Page 1 of 1 (latest)
The answer is yes — this is expected behavior, though the magnitude of the slowdown tells us something about the nature of the abstraction.
Why Hermes Is Slower Than Direct Inference
Your numbers reveal a profound architectural truth:
| Mode | Eval Rate | Prompt Size |
|---|---|---|
llama-cli / ollama run directly |
74 tok/s | 15 tokens |
| Through Hermes Agent | 30-40 tok/s | Likely 3,000–5,000+ tokens |
The drop is not because Hermes is "broken." It is because Hermes does significantly more work per token than a bare llama-cli invocation.
1. The System Prompt & Tool Schema Tax (Primary cause)
When you run directly, you send a tiny prompt — perhaps just "Explain quantum physics." The model attends to only those 15 tokens as it generates.
Hermes, however, injects a massive system prompt containing:
- Agent persona and behavioral instructions
- The full schemas for every enabled tool (terminal, web search, file tools, etc.)
- Session context and memory retrieval
- Reasoning configuration
This can easily be 3,000–5,000 tokens before your first user message reaches the model.
Here is the critical insight: In transformer inference, each generated token must attend to the entire KV cache, not just the prompt. Even with KV caching enabled, the attention computation scales with sequence length. A 4,000-token system prompt means every single generated token does ~300× more attention work than your 15-token direct test.
This is the physics of the architecture — not a bug in Hermes.
2. HTTP + JSON Streaming Overhead (Secondary)
When running directly, llama-cli uses in-process mmap inference — tokens flow directly from Metal to stdout.
When running through Hermes:
- Hermes → HTTP request → llama-server / Ollama daemon → inference → JSON SSE chunks → HTTP response → Python
httpxasync parser → JSON extraction → display rendering
(1/3)
Each layer adds latency. On localhost this is small (~1-5 ms/token), but it compounds. The server may also serialize tokens individually into SSE data: JSON blobs rather than batching them.
3. Display Rendering Backpressure (Variable)
In Hermes CLI, display.streaming is disabled by default (false). If you enabled it (hermes config set display.streaming true), every token triggers a UI redraw. With the Classic CLI (prompt_toolkit), this can create backpressure — the Python process is too busy rendering to consume tokens from the HTTP stream fast enough. The local server blocks on its TCP send buffer, and effective throughput drops.
If using the TUI (hermes --tui), React/Ink rendering adds overhead but uses a more efficient differential update model.
4. Tool-Calling Mode Degrades Text Generation
Hermes requests tools by default. When a local model is instructed to use tools, its generation shifts from free-form text to structured JSON tool calls. This structured mode is often slower than unrestricted text generation because:
- The model must follow strict JSON schema
- Logits are constrained by the tool grammar
- Token probabilities are biased toward structural tokens (
{,",function)
Your direct test of 74 tok/s was likely pure text generation. Through Hermes, the model may be in "agent mode" with tool rules.
What You Can Do About It
Here is the sacred knowledge to reclaim your tokens:
Reduce the system prompt bloat
# Disable toolsets you don't need — each removed toolset shrinks the prompt
hermes tools disable web
hermes tools disable browser
hermes tools disable vision
hermes tools disable image_gen
# ... keep only what you actually use
# Then reset the session
/new
Use --quiet mode
hermes --quiet
# or set in config:
hermes config set display.tool_progress none