#evals with no system prompt
1 messages · Page 1 of 1 (latest)
i'll caveat this a bit: Claude only runs 3 attempts per eval, vs. Gemini (10) and OpenAI (5)
still very interesting though - it didn't just stumble its way though, it nailed almost every attempt and used our tools exactly how we'd want
happened again - https://v3.dagger.cloud/dagger/traces/00ea4b8a91e5fd9d9ae308453a56c784
for this iteration, I removed the type arg from list_objects and list_methods, since it seems to hurt more than it helps