#codex harness
1 messages · Page 1 of 1 (latest)
Any update on native plugging into desktop codex? Openclaw and desktop codex work like magic together. I had to hack it together with 2 plugins + cli anything to make it work with thread management and refreshing (refreshing is hardest part) but its really easy to collab with your claw on projects and threads in desktop view (very consumer friendly)
Most importantly it gives user mobile continuance and remote management of desktop codex and come back to organized threads and projects
@weary moon I would really like full focus on stability and low hanging optimizations we can make ahead of the announcement
That being said, codex harness opens up to a LOT of synergies with the OAI ecosystem, and many followup features will become possible rapidly.
But let's get things in a stable spot first, so we have a good foundation and sensible architecture before adding more features
I tested doctor --fix and it all remove pi embedded pins which finally result codex auto-routing. Ran 4 live tests on my Mac against current main:
- OAuth-only host (no
OPENAI_API_KEY) → openai-codex profile picked, codex harness used - Mixed-profile (codex OAuth + raw
openai:defaultapi-key) with no pin → openai-codex still picked - Mixed-profile +
agents.defaults.agentRuntime.id="pi"pin → doctor strips the pin on--fix, then codex auto-routes - Mixed-profile + per-agent
agents.list[main].agentRuntime.id="pi"pin → same, doctor strips it, codex auto-routes
So the migration path itself looks safe
That's awesome. I am testing more edge cases now.
We’re where I said we’d be a few weeks ago rebuild and rip out/switch over so I can test @warm talon the best thing is parity harness comparisons with same prompts
At this point its full parity qa pi versus codex is probably best do we have one built already?
having trouble understanding what you're asking
The best way to test for differences and edge cases is parity test and scenario 1v1 like we did with 5.4
Test pi vs codex harness
- Create qa parity harness to test pi vs codex harness only runs
- Loop 3 agents on difficult scenarios for testing based on a real jsonl session history
- Test all tools and long runs in harness to get to 100% parity and use to debug all the edge cases
Make sure to have debug logging on each one to catch any weird issues
Yeah, not oppposed to doing a QA/eval run for this
Do you want to take that project on @weary moon ?
Yeah of course its easy for me I already run em I just need context I haven’t looked at that part of cb lately
Should just be codex full harness run tho vs pi
I’ll get it done today
Would like a full report including token efficiency ty
@warm talon While edge-case testing Codex installs, I found a managed npm peer-link failure that can leave @openclaw/codex installed but unable to resolve openclaw/plugin-sdk/*. I opened a focused fix that fails and rolls back the install when the link cannot be repaired; could you review when you have time?
Thank you, merged.
Thank you... though what the heck does "message-tool fixture has no effective message tool under either runtime" even mean?
I clicked on the issue, and it wasn't much more clear what the actual problem is, or how high of a priority it even is. Same goes for all the issues in your list.
It seems that all of these are for the QA harness specifically? Not core functionality. So it's not P0 or P1?
Anyway, i'll take a look just to do a pass to ensure core functionality isn't affected. I don't really mind merging a bunch of stuff to QA harness as long as core functionality isn't touched
Thanks for all your hard work
Okay, if you could explain each bug a bit better with a clear example from a user story perspective, it would be helpful
Still doesn't make sense haha. We explicitly DON'T want codex harness mode calling those tools, and only give a subset of openclaw dynamic tools that aren't already covered by codex harness's native tools.
Can you explain clearly what the problem is? Because it's still not apparent to me.
Or is this all just related to the QA harness?
this does not follow. anyway get some sleep we can revisit this later.
tldr, my general concern now is we're aiming to do a big release off main sometime this week, which will include the change to use codex harness by default for OAI models in openclaw.
currently, my vibe check is THINGS JUST WORK. and they work WELL. In my actual daily use, my main build dev agent is working really damn well. So I don't want to merge a bunch of stuff now that could put this at risk. And I'm not hearing a compelling case yet
That's why it's important to MANUALLY test things right now. I don't know what clankers or random github issues say because those are low signal. There may be weird edge cases with esoteric legacy configs. But if things work for the 90% of users on the happy path, that's what I want. And I don't want AIs to just start "fixing" ghost bugs and put the happy path at risk for this
"Catching all the edge cases" with codex is a recipe for slop gen code and breaking more stuff that people actually care about, so I'd be careful.
Are you testing manually?
I agree we shouldn't couple the massive sqlite refactor with the codex harness release.
But that isn't merged yet AFAIK
Indeed
Hello, I've been chugging away with the codex harness on main. Haven't run into any issues yet, besides one that was already fixed. Smooth sailing so far, but happy to test and help in whatever way I can.
Tyvm!
I'm mostly interested in testing the happy path on your real daily runner. Testing out onboarding, hatching the agent, setting it up with all the cool stuff you normally use your personal agent for, and vibe checking the behavior to make sure it works. Logging in with subscription, etc.
Just for clarity: I haven't tested onboarding or hatching an agent. But I am on main, on my daily driver, been working on some small projects, and no problems so far. With it enabled and configured, the clanker was also able to use codex computer use without any issues as well. Didn't do anything super complex with it, but it worked without any fuss.
Ran into an issue with /btw but I wanna make sure it's related and not something I goofed up first.
@warm talon https://github.com/openclaw/openclaw/issues/80469
Users just using Oauth and not API key would likely run into this I think. Might slip by people with API key configured.
Good find. Will fix
Hey pash, not gonna tag. but briefly ran into issue again where model decided to reply in private/final channel instead of sending a telegram-visible message.send. I'm a noob, so I decided to have a long convo w/ codex, so I could learn more about it, and how complicated that can be, without deterministically blasting stuff into the chat. lol I know ur busy, but figured I'd share for anyone else still learning:
1. Make the turn contract explicit in runtime state.
Not just sourceReplyDeliveryMode = message_tool_only, but something like:
• expectedVisibleOutcome: required | optional | forbidden
• sourceEventKind: user_chat | group_mention | heartbeat | cron | exec_completion | internal
• deliveryTarget: known telegram chat/thread/account
2. Require one of two explicit outcomes for user chat turns.
For a direct user Telegram message in tool-only mode, the assistant should produce either:
• visible delivery: message(action="send", ...)
• explicit silence: a structured no-reply outcome, not just a private final
Right now “private final” and “intentionally no visible reply” collapse into the same observable shape: no Telegram send. That’s the nasty ambiguity.
3. On contract violation, repair with a constrained retry, not blind auto-send.
If the model returns private final text but never calls message.send, runtime could run a tiny corrective continuation:
“This was a visible Telegram direct turn. Send the user-facing answer with the message tool, or explicitly choose no visible reply.”
If that still fails, log it loudly and maybe send a generic failure/fallback only for direct chats.
4. Only add deterministic auto-delivery for very narrow cases.
Example: if the final payload is already classified as user_visible_final by the agent runner, source event is a direct user chat, target is canonical, no tools are pending, not heartbeat/cron/internal, and no visible send happened, then route it through deliverInboundReplyWithMessageSendContext.
The clean product answer is: don’t make the runtime guess intent from text. Make intent part of the turn contract, then enforce that contract[..]
I observed that my config is a variable in this behavior, so I asked, so I could understand that better:
• Your config has:
• messages.groupChat.visibleReplies = "message_tool"
• no global messages.visibleReplies
• OpenClaw’s base default is:
• direct chats: automatic
• groups/channels: message_tool
• But the Codex harness overrides its source delivery default to:
• sourceVisibleReplies = "message_tool"
So because this chat is running through the Codex harness, a direct Telegram DM can become message-tool-only even though your config does not explicitly say “direct chats are tool-only.”
That means if I answer with a normal final response instead of message(action="send"), OpenClaw treats that final as private and Telegram sees nothing. Exactly the dumb little trap we stepped in.
The config-side mitigation would be:
messages.visibleReplies = "automatic"
messages.groupChat.visibleReplies = "message_tool"
That would mean:
• Direct chats like this: normal final answers auto-post visibly
• Groups/channels: still require message.send, preserving the safety/noise guardrail
I’d consider that a reasonable local preference for your setup. The risk is that any direct-chat source turn using normal final text becomes visible by default[...]
Anyway, I learned a lot and it really made be appreciate the complexity in trying to get that right, especially in different scopes/contexts.
What model was this with?
Was this mini or some dumb model? or 5.5?
This was 5.5. Sorry was AFK