https://arxiv.org/abs/2308.03688
I've been really interested in LLM evaluation lately and enjoyed this paper. It presents a new benchmark, AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
The part that stands out to me is the "evolving" concept. I'm interested in how Kaggle might host evolving benchmarks as evergreen competitions, constantly evaluating on new unseen data. Maybe contributing data to a benchmark is something the community would be interested in -- progression points for contributing valuable new data to a benchmark! 😉
The performance gap between open source and proprietary models is also interesting. There really are "two tiers" of performance (closed source being better).
I'm also loving the proliferating use of radar charts to visually illustrate evaluation on multi-dimensional benchmarks.
Large Language Models (LLMs) are becoming increasingly smart and autonomous,
targeting real-world pragmatic missions beyond traditional NLP tasks. As a
result, there has been an urgent need to evaluate LLMs as agents on challenging
tasks in interactive environments. We present AgentBench, a multi-dimensional
evolving benchmark that currently con...