#evals with no system prompt

1 messages · Page 1 of 1 (latest)

rancid jackal
#

i'll caveat this a bit: Claude only runs 3 attempts per eval, vs. Gemini (10) and OpenAI (5)

still very interesting though - it didn't just stumble its way though, it nailed almost every attempt and used our tools exactly how we'd want

rancid jackal