#今天hermes频繁崩溃,hermes自己做了一套诊断工具,这是诊断日志。
1 messages · Page 1 of 1 (latest)
I just forgot to do the desensitization treatment, so I’ll upload it again now.
脱敏好了
Desensitization is done
谢谢,脱敏后的日志够用了。
从这份日志里现在能确认几件事。第一,session_search_tool.py 的报错是真实的,日志里的 TypeError: '<=' not supported between instances of 'int' and 'type' 和代码路径对得上,这个我们会修。第二,qwen3.5-plus 这条自定义 provider 路径在同一时段出现了明显的请求异常,包括 httpx.ReadTimeout、openai.APITimeoutError,以及两次 API call failed after 3 retries: [Errno 2] No such file or directory。第三,Feishu gateway 当时是连着的,所以这份日志看不出是 gateway 自己挂了。
但这份日志本身还不能支持“uv 管理的 Python 已经损坏”这个结论。session_search 这个 TypeError 会被 Hermes 记成 warning,不会直接解释最后那个 Segmentation fault (core dumped)。checkpoint_manager 的 git 报错也是真问题,但更像工作区/检查点状态问题,不像 native segfault 的直接根因。
要继续定位真正的 segfault,需要补一份原始 native 崩溃信息。请在同一台 WSL 里复现一次崩溃后,马上把下面这些结果发来:
dmesg | tail -100
如果系统里有 coredumpctl,再发:
coredumpctl info --no-pager
如果当前目录或 home 下生成了 core 文件,也请把文件名和下面这条命令的输出一起发来:
file core*
另外请确认一下,直接出现 Segmentation fault 的是 venv/bin/hermes 这个交互 CLI,还是 gateway run 也会一起崩。现有日志更像是前者。
如果你想先绕开当前已经确认的非崩溃问题,可以先暂时不要触发 session_search,并观察 qwen3.5-plus 这条自定义 provider 路径是否还会同时报 timeout / [Errno 2]。我们这边会先按已确认的问题修 session_search_tool.py,但真正导致进程直接 Segmentation fault 的根因,还得看上面那份 native crash 输出。
Thank you, the diary after desensitization is enough.
Several things can be confirmed from this log. First, the error reported by session_search_tool.py is real. The TypeError: '<=' not supported between instances of 'int' and 'type' in the log matches the code path. We will fix this. Second, the custom provider path qwen3.5-plus had obvious request exceptions during the same period, including httpx.ReadTimeout, openai.APITimeoutError, and two API call failed after 3 retries: [Errno 2] No such file or directory. Third, the Feishu gateway was connected at the time, so this log does not show that the gateway itself failed.
But this log alone does not support the conclusion that "uv-managed Python is broken". The TypeError of session_search will be recorded as warning by Hermes, and the last Segmentation fault (core dumped) will not be directly explained. The git error of checkpoint_manager is also a real problem, but it is more like a workspace/checkpoint status problem, not like the direct root cause of native segfault.
To continue locating the real segfault, you need to add a copy of the original native crash information. Please reproduce the crash on the same WSL and immediately send the following results:
dmesg | tail -100
If there is coredumpctl in the system, then issue:
coredumpctl info --no-pager
If a core file is generated in the current directory or home, please also send the file name and the output of the following command:
file core*
Also, please confirm whether it is the interactive CLI venv/bin/hermes that causes the Segmentation fault directly, or whether it is the gateway run that also crashes together. The existing log is more like the former.
If you want to circumvent the currently confirmed non-crash problem, you can temporarily not trigger session_search and observe whether the custom provider path qwen3.5-plus will also report timeout / [Errno 2] at the same time. We will first fix
session_search_tool.py according to the confirmed problems, but the real root cause of the direct Segmentation fault of the process depends on the native crash output above.
此前并没有安装coredumpctl, 这次崩溃后才安装,后续崩溃可以提供coredumpctl info --no-pager。Segmentation fault 的是 venv/bin/hermes 这个交互 CLI。
以下是这次崩溃的最后终端内容:
────────────────────────────────────────
● 我们现在在这里继续前面的 ACP Router v2 任务,你需要再次核实一下当前的进度,此前的进度是在飞书会话中同步的
────────────────────────────────────────
┊ 💬 让我搜索飞书会话中的进度同步记录和之前的工作:
💾 Memory updated
┊ 🔍 recall "ACP Router v2 WebSocket development progress" 38.7s
┊ 💬 让我检查 Router v2 的当前状态和文件:
┊ 💻 $ ls -la /home/lodo/acp-router-v2/ && echo "" && ps aux | grep -E "(acp-router|router-v2)" | grep -v grep 0.5s
┊ 💻 $ cat /tmp/router-v2.log 2>/dev/null | tail -30 && echo "" && curl -s http://localhost:8768/health 2>/dev/null || echo "Router v2 未运行" 0.2s
┊ 💻 $ ls -la /home/lodo/acp-router-v2/tests/ && cat /home/lodo/acp-router-v2/tests/test_e2e.py 2>/dev/null | head -50 0.3s
┊ 📚 skill acp-router-v2-development 0.0s
┊ 💬 好的,我现在清楚了进度。让我核实当前状态并继续:
┊ 💻 $ echo "=== ACP Router v2 进度核实 ===" && echo "" && echo "1. Router v2 状态:" && curl -s http://localhost:8768/health | python3 -m json.tool && echo "" && echo "2. 生产 Router (v1) 状态:" && curl -s http://localhost:8766/status | python3 -m json.tool 2>/dev/null || echo " v1 未运行" && echo "" && echo "3. 已注册 agents (v1):" && curl -s http://localhost:8766/agents | python3 -m json.tool 2>/dev/null | head -20 || echo " 无法获取" 0.3s
┊ 📋 plan reading tasks 0.0s
┊ 💬 让我创建 TODO 列表并继续任务:
┊ 📋 plan 5 task(s) 0.0s
📋 planning 5 task(s)
⚕ qwen3.5-plus │ 121K/1M │ [█░░░░░░░░░] 12% │ 12h
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
⚕ ❯ Segmentation fault (core dumped)upt, Ctrl+C to cancel
lodo@DESKTOP-VP65LK2:~/.hermes-agent$ venv/bin/hermes───
coredumpctl was not installed before, but was installed after this crash. In subsequent crashes, coredumpctl info --no-pager can be provided. Segmentation fault is the interactive CLI of venv/bin/hermes.
Here is the final terminal content of this crash:
─────────────────────────────────────
● We are now continuing the previous ACP Router v2 task here. You need to check the current progress again. The previous progress was synchronized in the Feishu session.
─────────────────────────────────────
┊ 💬 Let me search for progress synchronization records and previous work in Feishu sessions:
💾 Memory updated
┊ 🔍 recall "ACP Router v2 WebSocket development progress" 38.7s
┊ 💬 Let me check the current status and files of Router v2:
┊ 💻 $ ls -la /home/lodo/acp-router-v2/ && echo "" && ps aux | grep -E "(acp-router|router-v2)" | grep -v grep 0.5s
┊ 💻 $ cat /tmp/router-v2.log 2>/dev/null | tail -30 && echo "" && curl -s http://localhost:8768/health 2>/dev/null || echo "Router v2 is not running" 0.2s
┊ 💻 $ ls -la /home/lodo/acp-router-v2/tests/ && cat /home/lodo/acp-router-v2/tests/test_e2e.py 2>/dev/null | head -50 0.3s
┊ 📚 skill acp-router-v2-development 0.0s
┊ 💬 Okay, I know the progress now. Let me verify the current status and continue:
┊ 💻 $ echo "=== ACP Router v2 progress verification ===" && echo "" && echo "1. Router v2 status:" && curl -s http://localhost:8768/health | python3 -m json.tool && echo "" && echo "2. Production Router (v1) status:" && curl -s http://localhost:8766/status | python3 -m json.tool 2>/dev/null || echo " v1 is not running" && echo "" && echo "3. Registered agents (v1):" && curl -s http://localhost:8766/agents | python3 -m json.tool 2>/dev/null | head -20 || echo "Unable to obtain" 0.3s
┊ 📋 plan reading tasks 0.0s
┊ 💬 Let me create a TODO list and continue the task:
┊ 📋 plan 5 task(s) 0.0s
📋 planning 5 task(s)
⚕ qwen3.5-plus │ 121K/1M │ [█░░░░░░░░░] 12% │
12h
─────────────────────────────────────────────────────── ───────────────────────────────────────────────────────
⚕ ❯ Segmentation fault (core dumped)upt, Ctrl+C to cancel
lodo@DESKTOP-VP65LK2:~/.hermes-agent$ venv/bin/hermes───
我用openclaw修复了hermes,以下是openclaw的修复报告:
2026-04-15 - ACP Router v2 开发与 acp_a2a_bridge 崩溃修复
问题描述
在 ACP Router v2 开发测试过程中,acp_a2a_bridge.py 频繁崩溃:
Segmentation fault (core dumped)
PermissionError: [Errno 1] Operation not permitted
ValueError: Pipe transport is only for pipes, sockets and character devices
根本原因
1. acp_a2a_bridge.py 的运行限制
- 使用
acp.run_agent()通过 stdio 与 ACP 客户端通信 - 需要真正的终端或管道 (stdin/stdout)
- 不能在 subprocess、code_execution 或后台模式下运行
- 当通过 hermes 工具或脚本启动时,文件描述符不是真正的管道,导致 asyncio 无法注册
2. A2A Hub 未实现
acp_a2a_bridge.py依赖 A2A Hub (端口 8765)- 该服务目前未实现/未运行
- 8765 端口实际是 Feishu webhook 的默认端口,不是独立的 A2A Hub
3. Router v1 与 v2 共存问题
- Router v1 (acp_router.py) 运行在端口 8766
- Router v2 运行在端口 8768
- 两者功能重叠,建议清理 v1 专注于 v2
解决方案
立即措施
-
关闭 Router v1
pkill -f "acp_router.py" -
acp_a2a_bridge.py 正确启动方式(如需要)
# 使用 tmux 运行,确保有真正的终端 tmux new-session -d -s acp_bridge \ "cd ~/.hermes-agent && source venv/bin/activate && python acp_a2a_bridge.py" # 查看日志 tmux attach -t acp_bridge # 分离:Ctrl+B 然后 D -
或暂时禁用桥接服务
- 先专注 Router v2 核心功能
- A2A Hub 需要单独实现后再测试桥接
长期方案
- 实现 A2A Hub 服务(端口 8765)
- 修改
acp_a2a_bridge.py支持 HTTP 模式(可选) - 清理 Router v1 相关代码和配置
当前服务状态
| 服务 | 端口 | 状态 |
|---|---|---|
| Hermes Dashboard | 9119 | ✅ |
| Hermes Gateway | - | ✅ |
| ACP Router v2 | 8768 | ✅ |
| ACP Router v1 | 8766 | ❌ 已关闭 |
| A2A Hub | 8765 | ❌ 未实现 |
| acp_a2a_bridge | - | ⚠️ 需 tmux 运行 |
相关文件
/home/lodo/.hermes-agent/acp_a2a_bridge.py- 桥接脚本/home/lodo/.hermes-agent/acp_router.py- Router v1/home/lodo/acp-router-v2/- Router v2 项目/home/lodo/.hermes-agent/ACP_A2A_BRIDGE.md- 桥接文档
经验教训
- 使用
acp.run_agent()的脚本必须在真正的终端环境中运行 - 不要通过 hermes code_execution 或 subprocess 启动 stdio 服务
- 后台服务应使用 tmux 或 systemd
- Router v1/v2 不应同时运行,避免端口和配置冲突
I repaired hermes using openclaw, the following is the repair report of openclaw:
2026-04-15 - ACP Router v2 development and acp_a2a_bridge crash fix
Problem description
During the development and testing of ACP Router v2, acp_a2a_bridge.py crashed frequently:
Segmentation fault (core dumped)
PermissionError: [Errno 1] Operation not permitted
ValueError: Pipe transport is only for pipes, sockets and character devices
Root cause
1. Operation limitations of acp_a2a_bridge.py
- Use
acp.run_agent()to communicate with ACP clients via stdio - Requires a real terminal or pipe (stdin/stdout)
- Cannot run in subprocess, code_execution or background mode
- When launched through the hermes tool or script, the file descriptor is not a real pipe, causing asyncio to fail to register
2. A2A Hub is not implemented
acp_a2a_bridge.pydepends on A2A Hub (port 8765)- The service is currently not implemented/not running
- Port 8765 is actually the default port of Feishu webhook, not an independent A2A Hub
3. Coexistence problem between Router v1 and v2
- Router v1 (acp_router.py) runs on port 8766
- Router v2 runs on port 8768
- The functions of the two overlap, it is recommended to clean up v1 and focus on v2
Solution
Immediate measures
-
Close Router v1
pkill -f "acp_router.py" -
acp_a2a_bridge.py Correct startup method (if necessary)
# Run using tmux to ensure there is a real terminal tmux new-session -d -s acp_bridge \ "cd ~/.hermes-agent && source venv/bin/activate && python acp_a2a_bridge.py" # View log tmux attach -t acp_bridge # Detach: Ctrl+B then D -
Or temporarily disable the bridge service
- Focus on the core functions of Router v2 first
- A2A Hub needs to be implemented separately before testing the bridge
Long-term plan
- Implement A2A Hub service (port 8765)
- Modify
acp_a2a_bridge.pyto support HTTP mode
(optional)
3. Clean up Router v1 related code and configuration
Current service status
| Service | Port | Status |
|---|---|---|
| Hermes Dashboard | 9119 | ✅ |
| Hermes Gateway | - | ✅ |
| ACP Router v2 | 8768 | ✅ |
| ACP Router v1 | 8766 | ❌ Closed |
| A2A Hub | 8765 | ❌ Not implemented |
| acp_a2a_bridge | - | ⚠️ Requires tmux to run |
Related documents
/home/lodo/.hermes-agent/acp_a2a_bridge.py- bridge script/home/lodo/.hermes-agent/acp_router.py- Router v1/home/lodo/acp-router-v2/- Router v2 project/home/lodo/.hermes-agent/ACP_A2A_BRIDGE.md- Bridge documentation
Lessons Learned
- Scripts using
acp.run_agent()must be run in a real terminal environment - Do not start the stdio service through hermes code_execution or subprocess
- Background services should use tmux or systemd
- Router v1/v2 should not be run at the same time to avoid port and configuration conflicts
补充一下hermes的自我诊断报告
Supplement the self-diagnosis report of Hermes
谢谢,这份日志有帮助。
现在能更明确确认的是:崩溃发生时 Hermes Gateway 还在正常运行,Feishu 连接也是在线的,所以这次挂掉的还是交互式 venv/bin/hermes CLI,不是 gateway 整体一起倒掉。
这份日志里还能继续确认两类已有问题还在:
session_search_tool.py 的 TypeError: '<=' not supported between instances of 'int' and 'type'
以及 qwen3.5-plus 这条自定义 provider 路径上的超时和 [Errno 2] No such file or directory
但这份诊断报告本身还是不能直接解释原生 Segmentation fault。真正对 native crash 最有用的还是你前面补的 dmesg,因为那已经说明是 WSL 下的 Python 进程在吃 SIGSEGV。
现在有一个新的关键点需要核对:你前面的 dmesg 里显示的是 python3.12,但这份日志里的 Python 栈又明显指向 uv 管理的 cpython-3.11.15 路径。请你在同一个会崩的环境里直接发这两条命令的输出:
venv/bin/python -V
readlink -f venv/bin/python
如果后面再次复现崩溃,现在既然已经装好了 coredumpctl,也请第一时间补上:
coredumpctl info --no-pager
这样我们才能判断到底是同一个解释器在崩,还是环境里混进了两个不同的 Python 路径。```
Thanks, this log is helpful.
What can be more clearly confirmed now is that the Hermes Gateway was still running normally when the crash occurred, and the Feishu connection was also online, so it was the interactive venv/bin/hermes CLI that hung up this time, not the entire gateway.
This log can also confirm that two types of existing problems are still there:
TypeError: '<=' not supported between instances of 'int' and 'type' in session_search_tool.py
And the timeout on the custom provider path of qwen3.5-plus and [Errno 2] No such file or directory
But this diagnostic report itself still cannot directly explain the native Segmentation fault. The most useful thing for native crash is the dmesg you added earlier, because it already shows that the Python process under WSL is eating SIGSEGV.
Now there is a new key point to check: your previous dmesg showed python3.12, but the Python stack in this log clearly points to the cpython-3.11.15 path managed by uv. Please directly send the output of these two commands in the same crashing environment:
venv/bin/python -V
readlink -f venv/bin/python
If the crash occurs again later, now that coredumpctl has been installed, please make it up as soon as possible:
coredumpctl info --no-pager
In this way, we can determine whether the same interpreter is crashing, or whether two different Python paths are mixed into the environment. ```
谢谢,重新看过你这次贴的 debug 链接后,这几条信息已经能拼出更清楚的方向了。
现在看起来不太像两个互相独立的问题,而更像是同一个 scholar 环境本身已经混乱了,然后同时冒出了几种症状。
最关键的新信号是这个:
这份 dump 顶部显示当前运行时是 Python 3.12.3,但同一个 scholar profile 里的包路径又出现了 ~/.hermes/profiles/scholar/.venv/lib/python3.11/site-packages/...
这说明现在这个 scholar 的虚拟环境很可能已经混进了两套不同 Python 版本的内容,不只是“profile 独立所以版本可以不同”这么简单,而是这个 profile 自己的 .venv 状态就已经不干净了。这个现象本身就足以解释为什么你会同时看到配置/导入异常、随机 hang,以及最后的 CLI segfault。
另外,从这批日志里还能继续确认几件事:
session_search 那个 TypeError 是真实问题,不过这个我们这边已经有后续修复。
自定义 qwen3.5-plus provider 路径当时也确实不稳定。
之前还出现过 Feishu gateway 冲突,不像只有一个单点故障。
所以现在最合理的顺序是:
先更新到最新版本再试一次。
你现在这份 dump 还是 v0.9.0 (2026.4.13),而你碰到的这几个方向后面几天刚好都有修复,包括 session_search、interrupt、profile/HERMES_HOME 路径,以及部分 terminal hang 路径。
但我不建议只停在“更新看看”。因为从这份日志本身看,scholar 这个 profile 的 .venv 已经像是混了 Python 3.11 和 3.12。
所以更新之后,请直接在 scholar 这个会崩的环境里发这几条:
venv/bin/python -V
readlink -f venv/bin/python
python3 -V
readlink -f "$(command -v python3)"
如果这里还显示出解释器和 .venv 不一致,那下一步就不是继续猜了,而是要重建 scholar 这个 profile 的虚拟环境。
如果更新后再次复现崩溃,再马上补:
dmesg | tail -100
coredumpctl info --no-pager
现在这批信息已经足够说明:问题重点不是单独某一个 ticket,而是 scholar 这个 profile 的运行环境本身需要先校正。
Thank you. After re-reading the debug link you posted this time, these pieces of information can spell out a clearer direction.
It now looks less like two separate issues and more like the same scholar environment itself has become messed up and several symptoms are popping up at the same time.
The most critical new signal is this:
The top of this dump shows that the current running time is Python 3.12.3, but the package path in the same scholar profile appears again as ~/.hermes/profiles/scholar/.venv/lib/python3.11/site-packages/...
This shows that the current scholar virtual environment may have mixed content from two different Python versions. It is not just as simple as "the profile is independent so the versions can be different", but the profile's own .venv status is no longer clean. This phenomenon alone is enough to explain why you see configuration/import exceptions, random hangs, and finally CLI segfaults all at the same time.
In addition, several things can be confirmed from this batch of logs:
The TypeError of session_search is a real problem, but we have already fixed it later.
The customized qwen3.5-plus provider path was indeed unstable at the time.
There have been Feishu gateway conflicts before, unlike there is only one single point of failure.
So the most reasonable order now is:
Update to the latest version first and try again.
Your current dump is still v0.9.0 (2026.4.13), and the directions you encountered happened to be fixed in the next few days, including session_search, interrupt, profile/HERMES_HOME paths, and some terminal hang paths.
But I don’t recommend just stopping at “update and see.” Because judging from the log itself, the .venv of the scholar profile seems to be a mixture of Python 3.11 and 3.12.
So after the update, please post these messages directly in the scholar environment that will crash:
venv/bin/python -V
readlink -f venv/bin/python
python3 -V
`readlink -f "$(command -v
python3)"`
If it still shows that the interpreter is inconsistent with .venv, then the next step is not to continue guessing, but to rebuild the virtual environment of the scholar profile.
If the crash recurs after updating, fix it immediately:
dmesg | tail -100
coredumpctl info --no-pager
Now this batch of information is enough to explain: the focus of the problem is not a single ticket, but the running environment of the scholar profile itself that needs to be corrected first.