#codex harness

1 messages · Page 1 of 1 (latest)

weary moon
#

Any update on native plugging into desktop codex? Openclaw and desktop codex work like magic together. I had to hack it together with 2 plugins + cli anything to make it work with thread management and refreshing (refreshing is hardest part) but its really easy to collab with your claw on projects and threads in desktop view (very consumer friendly)

#

Most importantly it gives user mobile continuance and remote management of desktop codex and come back to organized threads and projects

warm talon
#

@weary moon I would really like full focus on stability and low hanging optimizations we can make ahead of the announcement

That being said, codex harness opens up to a LOT of synergies with the OAI ecosystem, and many followup features will become possible rapidly.

But let's get things in a stable spot first, so we have a good foundation and sensible architecture before adding more features

little wedge
#

I tested doctor --fix and it all remove pi embedded pins which finally result codex auto-routing. Ran 4 live tests on my Mac against current main:

  1. OAuth-only host (no OPENAI_API_KEY) → openai-codex profile picked, codex harness used
  2. Mixed-profile (codex OAuth + raw openai:default api-key) with no pin → openai-codex still picked
  3. Mixed-profile + agents.defaults.agentRuntime.id="pi" pin → doctor strips the pin on --fix, then codex auto-routes
  4. Mixed-profile + per-agent agents.list[main].agentRuntime.id="pi" pin → same, doctor strips it, codex auto-routes

So the migration path itself looks safe

#

That's awesome. I am testing more edge cases now.

weary moon
#

We’re where I said we’d be a few weeks ago rebuild and rip out/switch over so I can test @warm talon the best thing is parity harness comparisons with same prompts

#

At this point its full parity qa pi versus codex is probably best do we have one built already?

warm talon
weary moon
#

The best way to test for differences and edge cases is parity test and scenario 1v1 like we did with 5.4

#

Test pi vs codex harness

#
  1. Create qa parity harness to test pi vs codex harness only runs
  2. Loop 3 agents on difficult scenarios for testing based on a real jsonl session history
  3. Test all tools and long runs in harness to get to 100% parity and use to debug all the edge cases
#

Make sure to have debug logging on each one to catch any weird issues

warm talon
#

Yeah, not oppposed to doing a QA/eval run for this

#

Do you want to take that project on @weary moon ?

weary moon
#

Yeah of course its easy for me I already run em I just need context I haven’t looked at that part of cb lately

#

Should just be codex full harness run tho vs pi

#

I’ll get it done today

warm talon
#

Would like a full report including token efficiency ty

little wedge
#

@warm talon While edge-case testing Codex installs, I found a managed npm peer-link failure that can leave @openclaw/codex installed but unable to resolve openclaw/plugin-sdk/*. I opened a focused fix that fails and rolls back the install when the link cannot be repaired; could you review when you have time?

https://github.com/openclaw/openclaw/pull/79494

warm talon
#

Thank you... though what the heck does "message-tool fixture has no effective message tool under either runtime" even mean?

I clicked on the issue, and it wasn't much more clear what the actual problem is, or how high of a priority it even is. Same goes for all the issues in your list.

warm talon
#

It seems that all of these are for the QA harness specifically? Not core functionality. So it's not P0 or P1?

#

Anyway, i'll take a look just to do a pass to ensure core functionality isn't affected. I don't really mind merging a bunch of stuff to QA harness as long as core functionality isn't touched

#

Thanks for all your hard work

#

Okay, if you could explain each bug a bit better with a clear example from a user story perspective, it would be helpful

warm talon
#

Still doesn't make sense haha. We explicitly DON'T want codex harness mode calling those tools, and only give a subset of openclaw dynamic tools that aren't already covered by codex harness's native tools.

Can you explain clearly what the problem is? Because it's still not apparent to me.

#

Or is this all just related to the QA harness?

warm talon
#

this does not follow. anyway get some sleep we can revisit this later.

tldr, my general concern now is we're aiming to do a big release off main sometime this week, which will include the change to use codex harness by default for OAI models in openclaw.

currently, my vibe check is THINGS JUST WORK. and they work WELL. In my actual daily use, my main build dev agent is working really damn well. So I don't want to merge a bunch of stuff now that could put this at risk. And I'm not hearing a compelling case yet

#

That's why it's important to MANUALLY test things right now. I don't know what clankers or random github issues say because those are low signal. There may be weird edge cases with esoteric legacy configs. But if things work for the 90% of users on the happy path, that's what I want. And I don't want AIs to just start "fixing" ghost bugs and put the happy path at risk for this

#

"Catching all the edge cases" with codex is a recipe for slop gen code and breaking more stuff that people actually care about, so I'd be careful.

#

Are you testing manually?

#

I agree we shouldn't couple the massive sqlite refactor with the codex harness release.

#

But that isn't merged yet AFAIK

#

Indeed

gaunt geode
#

Hello, I've been chugging away with the codex harness on main. Haven't run into any issues yet, besides one that was already fixed. Smooth sailing so far, but happy to test and help in whatever way I can.

warm talon
gaunt geode
#

Ran into an issue with /btw but I wanna make sure it's related and not something I goofed up first.

gaunt geode
gaunt geode
#

Hey pash, not gonna tag. but briefly ran into issue again where model decided to reply in private/final channel instead of sending a telegram-visible message.send. I'm a noob, so I decided to have a long convo w/ codex, so I could learn more about it, and how complicated that can be, without deterministically blasting stuff into the chat. lol I know ur busy, but figured I'd share for anyone else still learning:


1. Make the turn contract explicit in runtime state.
Not just sourceReplyDeliveryMode = message_tool_only, but something like:

  • expectedVisibleOutcome: required | optional | forbidden
  • sourceEventKind: user_chat | group_mention | heartbeat | cron | exec_completion | internal
  • deliveryTarget: known telegram chat/thread/account
2. Require one of two explicit outcomes for user chat turns.
For a direct user Telegram message in tool-only mode, the assistant should produce either:

  • visible delivery: message(action="send", ...)
  • explicit silence: a structured no-reply outcome, not just a private final
Right now “private final” and “intentionally no visible reply” collapse into the same observable shape: no Telegram send. That’s the nasty ambiguity.

3. On contract violation, repair with a constrained retry, not blind auto-send.
If the model returns private final text but never calls message.send, runtime could run a tiny corrective continuation:
“This was a visible Telegram direct turn. Send the user-facing answer with the message tool, or explicitly choose no visible reply.”

If that still fails, log it loudly and maybe send a generic failure/fallback only for direct chats.

4. Only add deterministic auto-delivery for very narrow cases.
Example: if the final payload is already classified as user_visible_final by the agent runner, source event is a direct user chat, target is canonical, no tools are pending, not heartbeat/cron/internal, and no visible send happened, then route it through deliverInboundReplyWithMessageSendContext.

The clean product answer is: don’t make the runtime guess intent from text. Make intent part of the turn contract, then enforce that contract[..]

I observed that my config is a variable in this behavior, so I asked, so I could understand that better:

• Your config has:

  • messages.groupChat.visibleReplies = "message_tool"
  • no global messages.visibleReplies
• OpenClaw’s base default is:

  • direct chats: automatic
  • groups/channels: message_tool
• But the Codex harness overrides its source delivery default to:

  • sourceVisibleReplies = "message_tool"

So because this chat is running through the Codex harness, a direct Telegram DM can become message-tool-only even though your config does not explicitly say “direct chats are tool-only.”

That means if I answer with a normal final response instead of message(action="send"), OpenClaw treats that final as private and Telegram sees nothing. Exactly the dumb little trap we stepped in.

The config-side mitigation would be:

messages.visibleReplies = "automatic"
messages.groupChat.visibleReplies = "message_tool"

That would mean:

• Direct chats like this: normal final answers auto-post visibly
• Groups/channels: still require message.send, preserving the safety/noise guardrail

I’d consider that a reasonable local preference for your setup. The risk is that any direct-chat source turn using normal final text becomes visible by default[...]

Anyway, I learned a lot and it really made be appreciate the complexity in trying to get that right, especially in different scopes/contexts.

warm talon
#

Was this mini or some dumb model? or 5.5?

gaunt geode