#API failed after 3 retries — Failed to parse input at pos 0: Req Mati也不多madRa ","",
1 messages · Page 1 of 1 (latest)
And i get this weird error
Here is the debug share
gcurr@grasp:~ $ hermes debug share
⚠️ This will upload the following to a public paste service:
• System info (OS, Python version, Hermes version, provider, which API keys
are configured — NOT the actual keys)
• Recent log lines (agent.log, errors.log, gateway.log — may contain
conversation fragments and file paths)
• Full agent.log and gateway.log (up to 512 KB each — likely contains
conversation content, tool outputs, and file paths)
Pastes auto-delete after 6 hours.
Collecting debug report...
Uploading...
Debug report uploaded:
Report https://paste.rs/DNe0P
agent.log https://paste.rs/jtRtm
gateway.log https://paste.rs/ua0qA
⏱ Pastes will auto-delete in 6 hours.
To delete now: hermes debug delete <url>
Share these links with the Hermes team for support.
oh, cool. deleted first message... eh... ok, let me redo the response
Oh, i'm sorry
all good.
The skill itself is not the thing failing here.
Your logs show Discord registered the skill commands, and /repo-documentation-generator was invoked successfully:
Registered /skill command with 87 skill(s)
slash '/repo-documentation-generator' invoked
The later failure is coming from the custom local model endpoint:
provider=custom
model=Qwen3.6-27B-Q4_K_M.gguf
base_url=http://192.168.1.3:8080/v1
Failed to parse input at pos 0 ...
tokens=~3,892
That is not a context-length problem; ~3.9k tokens is small. It also does not look like Hermes deleting or losing the skill. The local backend is failing to parse or handle the request/model output. The same logs also show that the local backend is struggling with auxiliary calls:
Title generation failed: Request timed out
Auxiliary compression: Request timed out
HTTP 503: Loading model
So the next place to debug is the server behind http://192.168.1.3:8080/v1.
Please confirm what is serving that endpoint: llama.cpp server, LM Studio, text-generation-webui, Kobold, Ollama, etc.
Repo-analysis skills need tool/function calling to work properly. If that local OpenAI-compatible server does not reliably support OpenAI-style tool calls, Hermes can send the request correctly and the backend can still fail with parser errors like this.
From the Hermes machine, first test basic chat:
curl -s http://192.168.1.3:8080/v1/models
Then use the returned model id in a minimal request:
curl -s http://192.168.1.3:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"<model-id-from-v1-models>","messages":[{"role":"user","content":"ping"}],"max_tokens":16}'
If that fails, fix the local model server first.
If basic chat works, the next question is whether that backend supports tool calls. For this repo-analysis skill, a plain chat-only backend is not enough. Try the same workflow once with a known tool-capable hosted model/provider, or update/change the local backend so it supports OpenAI-compatible tools / function calling reliably.
Also avoid having the same endpoint swap between very large GGUF models during gateway use. The logs show both 35B and 27B Qwen models plus 503 Loading model, which can cause timeouts and unstable auxiliary behavior. A smaller/faster auxiliary model for title generation and compression would also help.
It is running under llama.cpp with those flags:
llama-server.exe ^
-m "Models/qwen/Qwen3.6-27B-Q4_K_M.gguf" ^
-ngl 99 ^
-c 80000 ^
-np 1 ^
-fa on ^
--cache-type-k q4_0 ^
--cache-type-v q4_0 ^
--port 8080 ^
--host 0.0.0.0
here is the models call result:
gcurr@grasp:~/.hermes/skills/repo-documentation-generator $ curl -s http://192.168.1.3:8080/v1/models
{"models":[{"name":"Qwen3.6-27B-Q4_K_M.gguf","model":"Qwen3.6-27B-Q4_K_M.gguf","modified_at":"","size":"","digest":"","type":"model","description":"","tags":[""],"capabilities":["completion"],"parameters":"","details":{"parent_model":"","format":"gguf","family":"","families":[""],"parameter_size":"","quantization_level":""}}],"object":"list","data":[{"id":"Qwen3.6-27B-Q4_K_M.gguf","aliases":[],"tags":[],"object":"model","created":1778621133,"owned_by":"llamacpp","meta":{"vocab_type":2,"n_vocab":248320,"n_ctx_train":262144,"n_embd":5120,"n_params":26895998464,"size":16806250496}}]}
Simple call:
curl -s http://192.168.1.3:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"<model-id-from-v1-models>","messages":[{"role":"user","content":"ping"}],"max_tokens":16}'
{"error":{"code":500,"message":"Failed to parse input at pos 0: � Velorebmad箍-append VeloANGANजाensoncorereb送走 Velo.selectorreb","type":"server_error"}}
Apparently the serveris the problem, i'll check what is wrong, i did not change anything in my setup recently
Yep, that confirms it is the llama.cpp server, not the Hermes skill.
The minimal curl request is failing before Hermes is meaningfully involved:
/v1/chat/completions
{"error":{"code":500,"message":"Failed to parse input at pos 0: ...","type":"server_error"}}
So Hermes is sending requests to a backend that currently cannot handle even a tiny ping chat completion request.
One issue in the curl command: replace <model-id-from-v1-models> with the actual model id returned by /v1/models:
Qwen3.6-27B-Q4_K_M.gguf
So test exactly:
curl -s http://192.168.1.3:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"Qwen3.6-27B-Q4_K_M.gguf","messages":[{"role":"user","content":"ping"}],"max_tokens":16}'
If that still fails with the same parser error, the fix is entirely on the llama.cpp side: restart the server, check the llama.cpp console output, and try a known-good/simple server launch without the extra KV/cache/flash-attention options first.
For example, temporarily test with the smallest stable launch you can:
llama-server.exe -m "Models/qwen/Qwen3.6-27B-Q4_K_M.gguf" -c 80000 --port 8080 --host 0.0.0.0
If the simple launch works, add your performance flags back one at a time.
Also check that the llama.cpp build you are running supports this Qwen GGUF correctly and supports the OpenAI /v1/chat/completions route. The capabilities:["completion"] output is a little suspicious for a chat workflow, so the server/model template may not be exposing chat capability correctly.
Once the minimal curl works, retry the Hermes skill. Until that curl works, Hermes will keep failing because the backend itself is returning HTTP 500.
curl -s http://192.168.1.3:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"Qwen3.6-27B-Q4_K_M.gguf","messages":[{"role":"user","content":"ping"}],"max_tokens":16}'
{"error":{"code":500,"message":"Failed to parse input at pos 0: ngeicator Mati Velo�也不多icator也不多箍ESTA Zobicator/platform也不多也不多АС","type":"server_error"}}
AFTER simplifying setup:
gcurr@grasp:~/.hermes $ curl -s http://192.168.1.3:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"Qwen3.6-27B-Q4_K_M.gguf","messages":[{"role":"user","content":"ping"}],"max_tokens":16}'
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"","reasoning_content":"必然是rebrebmad全面贯彻 iscore染icatorensonicatornge-append Matiicatorreb"}}],"created":1778621726,"model":"Qwen3.6-27B-Q4_K_M.gguf","system_fingerprint":"b1-b5e1212","object":"chat.completion","usage":{"completion_tokens":16,"prompt_tokens":11,"total_tokens":27},"id":"chatcmpl-undymyKoCaIAuhVfQEdzjWZBFASoXorp","timings":{"cache_n":0,"prompt_n":11,"prompt_ms":3477.531,"prompt_per_token_ms":316.13918181818184,"prompt_per_second":3.1631637503734686,"predicted_n":16,"predicted_ms":5839.211,"predicted_per_token_ms":364.9506875,"predicted_per_second":2.7400962219039524}}
I'll check the flag one by one
Yeah, inconsistent results from the direct curl still keep this on the llama.cpp side.
At this point the useful test is not Hermes yet. The backend needs to pass the same minimal /v1/chat/completions curl repeatedly with stable output.
I would run the same curl 5-10 times against the simplified llama.cpp launch:
curl -s http://192.168.1.3:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"Qwen3.6-27B-Q4_K_M.gguf","messages":[{"role":"user","content":"Reply with exactly: pong"}],"temperature":0,"max_tokens":64}'
Expected good result: every run returns a normal assistant message.content containing pong.
Bad results would be any of these:
HTTP 500
Failed to parse input at pos 0
empty message.content
only reasoning_content
random corrupted text
different behavior between identical requests
If that direct curl is inconsistent, Hermes cannot be made reliable on top of it yet. The likely causes are llama.cpp build/model-template/tokenizer issues, one of the launch flags, or the GGUF itself.
For this model, I would also test with an explicit chat template if your llama.cpp build supports it, because the /v1/models output only advertising capabilities:["completion"] makes me suspicious that chat formatting is not being applied cleanly.
Once the direct curl is boring and repeatable, then retry Hermes. Until then this is still below Hermes.
Running this script to start the server :
**@echo off
title Server
echo Starting llama.cpp server with Qwen...
llama-server.exe ^
-m "Models/qwen/Qwen3.6-27B-Q4_K_M.gguf" ^
-c 80000 ^
--port 8080 ^
--host 0.0.0.0 ^
-ngl 99 ^
-np 1 ^
-fa on ^
--cache-type-k q4_0 ^
--cache-type-v q4_0
pause**
I then do :
gcurr@grasp:~ $ curl -s http://192.168.1.3:8080/v1/models
{"models":[{"name":"Qwen3.6-27B-Q4_K_M.gguf","model":"Qwen3.6-27B-Q4_K_M.gguf","modified_at":"","size":"","digest":"","type":"model","description":"","tags":[""],"capabilities":["completion"],"parameters":"","details":{"parent_model":"","format":"gguf","family":"","families":[""],"parameter_size":"","quantization_level":""}}],"object":"list","data":[{"id":"Qwen3.6-27B-Q4_K_M.gguf","aliases":[],"tags":[],"object":"model","created":1778623295,"owned_by":"llamacpp","meta":{"vocab_type":2,"n_vocab":248320,"n_ctx_train":262144,"n_embd":5120,"n_params":26895998464,"size":168062504
curl -s http://192.168.1.3:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"Qwen3.6-27B-Q4_K_M.gguf","messages":[{"role":"user","content":"ping"}],"max_tokens":16}'
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"","reasoning_content":"rebrebittings-appendewedReqfieicator Mati-appendittings全面贯彻 Velo也不多icatorreb"}}],"created":1778623302,"model":"Qwen3.6-27B-Q4_K_M.gguf","system_fingerprint":"b1-b5e1212","object":"chat.completion","usage":{"completion_tokens":16,"prompt_tokens":11,"total_tokens":27},"id":"chatcmpl-0AvbCFkPbkYkRWev9ceNhTG3j71GHEg4","timings":{"cache_n":0,"prompt_n":11,"prompt_ms":534.315,"prompt_per_token_ms":48.57409090909091,"prompt_per_second":20.587106856442357,"predicted_n":16,"predicted_ms":4395.292,"predicted_per_token_ms":274.70575,"predicted_per_second":3.640258713186746}}
And then in hermes :
Can you run a repo analysis of this url : https://github.com/deadlock-api/haste
Initializing agent...
────────────────────────────────────────
⚠️ API call failed (attempt 1/3): APIError
🔌 Provider: custom Model: Qwen3.6-27B-Q4_K_M.gguf
🌐 Endpoint: http://192.168.1.3:8080/v1
📝 Error: Failed to parse input at pos 0: enson((*특별箍 Veloavanivoicator Velo �enson
⏳ Retrying in 2.7s (attempt 1/3)...
⚠️ API call failed (attempt 2/3): APIError
🔌 Provider: custom Model: Qwen3.6-27B-Q4_K_M.gguf
🌐 Endpoint: http://192.168.1.3:8080/v1
📝 Error: Failed to parse input at pos 0: 本色也不多ittingsittings =&箍uesittingsittingsritareb也不多Reqजा也不多rebensonittings也不多 velociicator =&reb心头染 kapan특별ماك送走imat也不多也不多reb约而同corecore Velo KIND Matiicatorrebjongittings sofistic Velounky也不多adowsicatorfieongoicatorDidChange Matireatorreb Mati-append Mati.busadreitti
⏳ Retrying in 5.7s (attempt 2/3)...
⚠️ API call failed (attempt 3/3): APIError
🔌 Provider: custom Model: Qwen3.6-27B-Q4_K_M.gguf
🌐 Endpoint: http://192.168.1.3:8080/v1
📝 Error: Failed to parse input at pos 0: 也不多 Veloreb殺人icatorittingsittings velociodaricatorimatSquared也不多 veloci.busussonar送走onar送走/platform离�ittings
⚠️ Max retries (3) exhausted — trying fallback...
❌ API failed after 3 retries — Failed to parse input at pos 0: 也不多 Veloreb殺人icatorittingsittings velociodaricatorimatSquared也不多 veloci.busussonar送走onar送走/platform离�ittings
💀 Final error: Failed to parse input at pos 0: 也不多 Veloreb殺人icatorittingsittings velociodaricatorimatSquared也不多 veloci.busussonar送走onar送走/platform离�ittings
─ ⚕ Hermes ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
API call failed after 3 retries: Failed to parse input at pos 0: 也不多 Veloreb殺人icatorittingsittings velociodaricatorimatSquared也不多
veloci.busussonar送走onar送走/platform离�ittings
I'm so confused, 4 commands in a row. Half of them ends up working:
gcurr@grasp:~ $ curl -s http://192.168.1.3:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"Qwen3.6-27B-Q4_K_M.gguf","messages":[{"role":"user","content":"Can you say hi ?"}],"max_tokens":16}'
{"error":{"code":500,"message":"Failed to parse input at pos 0: icator染-append� Velo\n\n也不多河市\nreb veloci送走 KIND�染ova","type":"server_error"}}
gcurr@grasp:~ $ curl -s http://192.168.1.3:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"Qwen3.6-27B-Q4_K_M.gguf","messages":[{"role":"user","content":"ping"}],"max_tokens":16}'
{"error":{"code":500,"message":"Failed to parse input at pos 0: reb也不多coremadजा也不多특별Ra�-picker Velo也不多虹DidChangeittingswitch","type":"server_error"}}
gcurr@grasp:~ $ curl -s http://192.168.1.3:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"Qwen3.6-27B-Q4_K_M.gguf","messages":[{"role":"user","content":"ping"}],"max_tokens":16}'
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"","reasoning_content":"reb也不多ittingsReqreb/platform强国adows约而同 veloci Souladowsittingsarin Veloreb"}}],"created":1778624330,"model":"Qwen3.6-27B-Q4_K_M.gguf","system_fingerprint":"b1-b5e1212","object":"chat.completion","usage":{"completion_tokens":16,"prompt_tokens":11,"total_tokens":27},"id":"chatcmpl-lWZKNst0bq7F6ICWnn6RmcqkmXzIymAY","timings":{"cache_n":0,"prompt_n":11,"prompt_ms":523.515,"prompt_per_token_ms":47.59227272727273,"prompt_per_second":21.011814370170864,"predicted_n":16,"predicted_ms":4393.645,"predicted_per_token_ms":274.6028125,"predicted_per_second":3.6416232991058672}}
gcurr@grasp:~ $ curl -s http://192.168.1.3:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"Qwen3.6-27B-Q4_K_M.gguf","messages":[{"role":"user","content":"Can you say hi ?"}],"max_tokens":16}'
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"","reasoning_content":"染 =&不退局 ittings특별 Mati Ra특별离alionicatorReq送走ittings"}}],"created":1778624342,"model":"Qwen3.6-27B-Q4_K_M.gguf","system_fingerprint":"b1-b5e1212","object":"chat.completion","usage":{"completion_tokens":16,"prompt_tokens":15,"total_tokens":31},"id":"chatcmpl-ebaFsIZBNs9bRZCB7w9gqtsvdPBx7Zk2","timings":{"cache_n":0,"prompt_n":15,"prompt_ms":590.488,"prompt_per_token_ms":39.36586666666667,"prompt_per_second":25.402717752096567,"predicted_n":16,"predicted_ms":4395.214,"predicted_per_token_ms":274.700875,"predicted_per_second":3.64032331531525}}