#dynamic benchmark for assessment of instructon-following & reasoning abilities

19 messages · Page 1 of 1 (latest)

analog creek
#

========================================
[102, 197, 252, 400, 565]

  1. Select the [first], [third], and [fifth] number.
  2. Calculate the following: [first] [add] [fifth] [divide] [third].
  3. Report the [second] decimal place of the answer from step 2.

Step by step answer: [102, 252, 565] | 104.24206 | 4

[2, 4, 8, 3, 5]

  1. Sort in descending order.
  2. Select [first], [third] and [fifth] number.
  3. Write a sentence where each word's length corresponds to the selected sequence of numbers.

Step by step answer: [8, 5, 4, 3, 2] | [8, 4, 2] | "Daughter eats pie."

This approach presents several distinct advantages:

(1) With an extended input list of numbers and randomized bracket selections, each question type generates ample variations. For example, question 1 with a list of 5 numbers and 4 types of mathematical symbols (+ - / *) produces 5 * 5 * 5 * 4 * 4 variations, mitigating the risk of data contamination.

(2) The step-by-step answer format allows for a more precise evaluation of a model's ability to follow instructions. Assessment criteria could be based on the number of instructions correctly followed or a similar metric.

(3) This method enables a comprehensive assessment of numerical reasoning abilities, providing insight into an LLM's computational accuracy as well as its creative problem-solving capabilities.

As I am relatively new to AI research, I am eager to gather feedback on this concept. Please feel free to provide any critique or suggestions. Image is an example of chatgpt and oasst with the problem.

pastel wagon
vital pawn
analog creek
analog creek
pastel wagon
#

More on point 2.:
This mainly mesure the ability of the model to do reasoning.
Other features that we may want to mesure from LLMs (not comprehensive at all):

  1. knowledge
  2. bias
  3. recalling ability (long medium and short term)
  4. rephrasing/sumarisation ability
vital pawn
analog creek
# analog creek These are valid concerns, 1. It's true that if this task format becomes wides...
  1. Alignment with User Queries: It's correct that these tasks might not mirror the exact types of queries users typically pose to language models. However, the abilities they test — following a series of instructions, performing calculations, understanding numerical reasoning, and generating creative linguistic output — are all skills that contribute to a model's overall utility. Moreover, personally I think it is infeasible for a single benchmark to cover all the features.
pastel wagon
#

Moreover, personally I think it is infeasible for a single benchmark to cover all the features.
I 100% think the same

analog creek
pastel wagon
#

It's a tradeoff for the scientific community, if it can be proven to be good benchmark with low contamination possibilities I think it has it have chances to be used for research

vital pawn
#

At least for the most powerful LLMs

pastel wagon
vital pawn
analog creek
pastel wagon
# analog creek Do you have some ideas how this can be done?

Not for sure, but one way i can think of is finetuning a model on model on those problem (intentional contaminations) and comparing it to baseline and if performance of baseline and fine tuned version are close your benchmark is contamination proof (also you need to prove the same way that traditional benchmarks are not contamination proof).