Hi 👋 , I built the hea-world system, a SaaS platform to build and deploy governed AI agents, it uses today mainly GPT 4.1min and 4.1nano, and it works very well 😃. I wanted to anticipate on the future and so try to qualify the latest generation of models for my application before using them in production. I build a model qualification tool. I noticed that, so far, I don't get the same level of reliability yet with the model gen 5. Any ideas why? Codex made the optimization of the model calls, so I assume it is well done, so I am still scratching my head. I am based in Europe, and most other servers/cloud providers are in Frankfurt.