analog creek Jun 4, 2023, 1:39 AM

#

========================================
[102, 197, 252, 400, 565]

Select the [first], [third], and [fifth] number.
Calculate the following: [first] [add] [fifth] [divide] [third].
Report the [second] decimal place of the answer from step 2.

Step by step answer: [102, 252, 565] | 104.24206 | 4

[2, 4, 8, 3, 5]

Sort in descending order.
Select [first], [third] and [fifth] number.
Write a sentence where each word's length corresponds to the selected sequence of numbers.

Step by step answer: [8, 5, 4, 3, 2] | [8, 4, 2] | "Daughter eats pie."

This approach presents several distinct advantages:

(1) With an extended input list of numbers and randomized bracket selections, each question type generates ample variations. For example, question 1 with a list of 5 numbers and 4 types of mathematical symbols (+ - / *) produces 5 * 5 * 5 * 4 * 4 variations, mitigating the risk of data contamination.

(2) The step-by-step answer format allows for a more precise evaluation of a model's ability to follow instructions. Assessment criteria could be based on the number of instructions correctly followed or a similar metric.

(3) This method enables a comprehensive assessment of numerical reasoning abilities, providing insight into an LLM's computational accuracy as well as its creative problem-solving capabilities.

As I am relatively new to AI research, I am eager to gather feedback on this concept. Please feel free to provide any critique or suggestions. Image is an example of chatgpt and oasst with the problem.

pastel wagon Jun 4, 2023, 1:54 AM

#

analog creek ======================================== [102, 197, 252, 400, 565] 1. Select th...

I quite like the idea, but i can see 2 problems of this:

If it become a widespread benchmark data contamination can still be a issue because the model could learn "the task" more than the specific answer.
It's away from what users usually ask models to do and measure only a small number of desired features of a model

vital pawn Jun 4, 2023, 1:54 AM

#

FYI “Hugging Face’s” Open LLM Leaderboard is actually an evaluation library we developed and maintain: https://github.com/EleutherAI/lm-evaluation-harness

GitHub

GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot...

A framework for few-shot evaluation of autoregressive language models. - GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models.

pastel wagon Jun 4, 2023, 1:55 AM

#

vital pawn FYI “Hugging Face’s” Open LLM Leaderboard is actually an evaluation library we d...

TIL

analog creek Jun 4, 2023, 1:57 AM

#

pastel wagon I quite like the idea, but i can see 2 problems of this: 1. If it become a wides...

These are valid concerns,

It's true that if this task format becomes widespread, models may start learning the task patterns. However, the element of randomness in task generation creates a vast space of possible task, which I believe is an advancement compared to past work.

analog creek Jun 4, 2023, 1:59 AM

#

vital pawn FYI “Hugging Face’s” Open LLM Leaderboard is actually an evaluation library we d...

Thank you for the correction my apologies.

pastel wagon Jun 4, 2023, 2:00 AM

#

More on point 2.:
This mainly mesure the ability of the model to do reasoning.
Other features that we may want to mesure from LLMs (not comprehensive at all):

knowledge
bias
recalling ability (long medium and short term)
rephrasing/sumarisation ability

vital pawn Jun 4, 2023, 2:01 AM

#

analog creek These are valid concerns, 1. It's true that if this task format becomes wides...

the element of randomness in task generation creates a vast space of possible task, which I believe is an advancement compared to past work.
That’s not really true. There are several previous eval metrics that are randomized in this sense, it’s just not very popular because people generally think that consistency is more important.

analog creek Jun 4, 2023, 2:01 AM

#

analog creek These are valid concerns, 1. It's true that if this task format becomes wides...

Alignment with User Queries: It's correct that these tasks might not mirror the exact types of queries users typically pose to language models. However, the abilities they test — following a series of instructions, performing calculations, understanding numerical reasoning, and generating creative linguistic output — are all skills that contribute to a model's overall utility. Moreover, personally I think it is infeasible for a single benchmark to cover all the features.

pastel wagon Jun 4, 2023, 2:03 AM

#

Moreover, personally I think it is infeasible for a single benchmark to cover all the features.
I 100% think the same

analog creek Jun 4, 2023, 2:05 AM

#

vital pawn > the element of randomness in task generation creates a vast space of possible ...

how about if we create a number of tasks, random generate a large number of variants, which are consistent in structure but varied in specifics, and then evaluate a model's performance based on its average score. This way, we maintain the advantages of consistency while leveraging the diversity offered by randomization to conduct a more robust evaluation of the model's abilities.

pastel wagon Jun 4, 2023, 2:11 AM

#

analog creek how about if we create a number of tasks, random generate a large number of vari...

I think the problem with random evaluation is because it's remove reproducibility which is important for scientific papers, and we could see the benchmark getting bias for a better score if author cherry pick best of multiple runs

#

It's a tradeoff for the scientific community, if it can be proven to be good benchmark with low contamination possibilities I think it has it have chances to be used for research

vital pawn Jun 4, 2023, 2:15 AM

#

pastel wagon I think the problem with random evaluation is because it's remove reproducibilit...

I mean, most benchmark scores aren’t reproducible

#

At least for the most powerful LLMs

pastel wagon Jun 4, 2023, 2:16 AM

#

vital pawn I mean, most benchmark scores aren’t reproducible

Meaning sampling ?
or training randomness ?
both ?

vital pawn Jun 4, 2023, 2:23 AM

#

pastel wagon Meaning sampling ? or training randomness ? both ?

Meaning people don’t document how they evaluate models and often don’t release models

analog creek Jun 4, 2023, 2:25 AM

#

pastel wagon It's a tradeoff for the scientific community, if it can be proven to be good ben...

Do you have some ideas how this can be done?

pastel wagon Jun 4, 2023, 2:29 AM

#

analog creek Do you have some ideas how this can be done?

Not for sure, but one way i can think of is finetuning a model on model on those problem (intentional contaminations) and comparing it to baseline and if performance of baseline and fine tuned version are close your benchmark is contamination proof (also you need to prove the same way that traditional benchmarks are not contamination proof).

#dynamic benchmark for assessment of instructon-following & reasoning abilities

Step by step answer: [102, 252, 565] | 104.24206 | 4

Step by step answer: [8, 5, 4, 3, 2] | [8, 4, 2] | "Daughter eats pie."