A benchmark for multimodal game agents with 34 browser games, 170 tasks, 18 model-interface pairs, and outcome-based state-verifiable evaluation.
1 messages · Page 1 of 1 (latest)
A benchmark for multimodal game agents with 34 browser games, 170 tasks, 18 model-interface pairs, and outcome-based state-verifiable evaluation.