Hi everyone,
In domains like philosophical reasoning, ethical dilemmas, metaphysics, ontological paradoxes, self-reference, educational synthesis, conceptual creativity, or cognitive tension, models such as GPT-4.5, Claude Opus 4, Gemini 2.5 Pro, Grok 3, or DeepSeek V3 often evaluate AERIS’s responses as superior to their own — including on prompts they initially rated as too difficult to score highly.
If anyone here is interested, I'd genuinely welcome a challenge:
→ Propose a complex question in one of these domains
→ (Ideally using a leading model to help craft it)
→ Let the model answer and self-evaluate
→ Then compare it to AERIS’s answer on the same task
If you find a case where AERIS performs worse (based on the other model’s own evaluation), it would be very helpful for improving the system.
Public instance: https://aeris-project.github.io/aeris-chatbox/index.html
Appreciate your time :]