#ai-mathematical-olympiad-prize

1 messages · Page 1 of 1 (latest)

weary kelp
#

This should help boost performance through prompting. I’m excited to see how people can push the performance of LLMs in math.

sleek plank
#

A few weeks ago, I downloaded and parsed the AIME / AMC12 problems from AoPS. I have already shared them on Lean forum, so let me share them here too: http://olsak.net/mirek/aimo-datasets/

valid shoal
gritty oxide
valid shoal
final breach
#

silly question here..but this competition is effectively only for LLMs right

normal eagle
#

new to kaggle and programming, so didn't realize this was for local models only. spent half a day (literally 12+ hours playing gpt4 and claude 3 models)

valid shoal
plush crater
#

Today I learned: Kaggle CPU-only notebooks run on an Ubuntu-20.04 image, while GPU T4 x 2 notebooks run on an Ubuntu-22.04 image.

valid shoal
#

I will greatly appreciate if someone could produce and benchmark performance of various LLMs

round obsidian
#

Hello, sorry if this is off-topic I created a question in manifold markets so you can bet play money on the winning score https://manifold.markets/EmanuelR/how-many-problems-will-be-solved-by?r=RW1hbnVlbFI the market determines the probability distribution of the winner's score

Manifold

This is the leaderboard https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/leaderboard

The competition consists of creating an AI that runs on a maximum of 9 hours on 2 Tesla T4 or 1 P100 GPU's, and solves a hidden list of 50 math problems with an answer between 0 and 999, anything that runs on the Kaggle VM that doesn't use the...

normal eagle
#

weird but whenever I add "think step by step" to my prompts the performance usually goes down 😅.

normal eagle
#

Also, somehow all explanations for correct answers are around 400 words. Often when it goes over than that, the answer go wildly wrong. GPT-4 (via ChatGPT) recovered some of those but none of the local model runs have done that.

abstract fjord
#

Can we install and run llama index in the notbook ?

serene fiber
#

I guess you can use anything that is available for the public on Kaggle

true moat
#

Stay tuned for the dataset for the further updates as I have processed only a half of parsed data with answers in \boxed. Another part need more careful or even manual preprocessing

hidden isle
#

how to start it , I am new here

storm pond
hidden isle
#

Okay I'll try

storm pond
#

gl

sinful raft
#

why is the current cap 17/50? does it have to do with transformers or prompting?

lusty moon
#

It's meant to be tough

finite stump
#

Can we use a mixture between nlp and ATP or we should use just NLP

sinful raft
lusty moon
#

Then deepseek math 7b instruct made a jump

#

Then self consistency made another jump

#

Can get 15-16 rn myself

#

Someone made 18

sinful raft
#

ohhhhh gatcha, i have the deepseek math bookmarked for reading. ill get on that. also do you know what made the 17-18 jump possible?

#

wow yeah they did it today

#

i guess my other question would be gemma 2b and it's applications in models vs transformers

torpid rapids
#

I have a basic question about code competitions. What models are we allowed to use exactly? Is it basically the list that shows up if I click on "Add input" in the right hand side bar on the edit code page? Also, how do I check if the model was available before Feb 24th or not?

upper knot
vital cave
lusty moon
#

Also you can use any pretrained model, just no internet access right?

wintry juniper
#

the rules state it had to have been available before feb 23, 2024

lusty moon
#

So no fine tuning?

wintry juniper
#

that's a grey area, not sure

clever heart
#

how much time does it take to see submitted results?

lusty moon
upper knot
#

mmos further finetuned on top of deepseek rl

clever heart
vital cave
hybrid arrow
#

I have a background in Maths, mostly developing content, and have participated in these olympiads; it's easier for me to solve problems than to have LLM do it.
Anyway, I have created some basic models in the past, but this is entirely new to me.
Anyone looking to team up, I can contribute vastly to types of problems, datasets for the same and tweaking things.

hidden isle
#

Anyone aware of wizardmath model ??

upper knot
vital cave
upper knot
#

interesting

upper knot
# vital cave It is mentioned only in the title as I can understand from the attaced comment. ...

i think the comment author is mistaken. it's unclear now if the weights are the original model or mmos finetune, but it clearly isn't directly from hf.

if you look on hf, the original model is split into 2 safetensors files while the author's upload is split into 3 files. either the author did the safetensors conversion themselves or they are using a different model and forgot to change the name

vital cave
vital cave
upper knot
#

okay awesome thanks

tranquil flame
torpid rapids
#

btw all the top notebooks seem to be running self-consistency with only 8 samples whereas the original deepseek-math paper reports 64 samples. I wonder how many samples can one squeeze into the 9 hr time limit

lusty moon
boreal dew
#

Hi,
I have a question around the problem id 246d26

Each of the three-digits numbers  111 to $999...

where the suggested correct answer should be 250.
But after working on my model it gave me a better solution 267 as it is a maximum I guess it should be 267 the correct answer. here the analysis of my LLM model

The solution using modular arithmetic appears optimal based on the problem's constraints:

Selection Criteria: Yellow numbers were selected based on their last digits under modulo 10 arithmetic. The set of valid last digits chosen was {1, 3, 7}. This choice ensures that the sums of any two yellow numbers do not have a last digit that falls back into this set, fulfilling the requirement that their sums must be blue.

Verification of Sums: The calculations confirmed that no sums of two yellow numbers end in the digits 1, 3, or 7. Therefore, all such sums must be blue numbers as required by the problem's constraints.

Outcome:

Number of Yellow Numbers: 267, which is maximized under the chosen method.
Sums Compliance: There were zero invalid sums, indicating full compliance with the problem's condition.
This approach effectively uses a simple modular arithmetic concept to partition the set of numbers from 111 to 999 into yellow and blue categories such that the sum of any two yellow numbers always results in a blue number. It maximizes the number of yellow numbers while ensuring compliance with the given sum condition.```
The analysis from my model seems correct to me `267 = (100 - 11) * 3`. But always good to have an additional brain on it 😄 . 
Would it be possible to double-check my result with a provider of the problem?
Thanks 🙂
normal eagle
boreal dew
normal eagle
#

Tbh, this is the only problem in the public 10 for which I don't have a plausible standardized solution.

But one correct line of logic is

Max (Y) < 999/2.
=> 500 is min Blue such that it is higher than any yellow.
=> Min (Y) = 250.

So 250 to 499 are all Yellows.

boreal dew
#

It is an interesting problem for sure, love it. I'll let the competition host answer this one. I think it needs a peer review for sure.

normal eagle
#

What I meant by standardized solition here is that I can't seem to recall a problem solving strategy that forces you think count backwards from 499 instead of starting from 111.

For other problems example: in problem 1 (sum of squares of distances) -> the key insight is to use the sum and product of roots of a quadratic because difference of roots is given. or say in problem 3 (sparkle) the key insight is sum of digits < 3 which then automatically brings up combinatorics as a valid strategy.

haughty mountain
#

Would it violate contest rules for me to have a math friend help? I have a close friend who is a longtime MO competitor, and I passed along this question to him

#

By “help” I mean try to give his answer to the question

boreal dew
#

not at all @haughty mountain . That is the beauty of Mathematics. Share the love 🙂

haughty mountain
#

I am an organizer from a different contest so not the authority that my name color might suggest

#

I have successfully passed along the question but not received an answer yet...

dapper cliff
#

@normal eagle I think you are close, then maybe adding the fact that sum of any two odd numbers is even you can remove even yellows which is roughly 250. Now need to polish the edges. 😉

normal eagle
valid shoal
normal eagle
#

Actually, I found the gap in my solution (which was making it seem random). It's not a single possible distribution.

N = [111,999] = dY+dB+mY+mB (d, m => definitely and maybe)
So you calculate dB and dY sets in parts by elimination from N. Then apply box principle/conditional sorting to mY and mB.
How to get dB = if 2*i in N is not in N, it's in dB.
dB = [500,999], eliminate this from N
How to get dY = if i + N is in dB, i is in
which means [389,499] is dY.

now for [111, 388] if any number is added mY or mB, one corresponding number has to be added to another set. => 278/2 = 139 maybe Yellows.

Total yellows are 111+139 = 250.

valid shoal
normal eagle
valid shoal
normal eagle
lethal arch
#

You don't have to prove this with any theorem, in these kind of math competitions, you just need an educated guess, if you find one such an f(n) function that satisfies the conditions, you can guess the solution without proving it

normal eagle
shy tundra
#

is VLLM allowed in this competition?

#

or exllama, llama.cpp, etc

dapper cliff
#

yes, but for example vllm it is trick to make it work on multiple gpus and it does not support P100

#

oh, neither it support TPU

dapper cliff
#

When I am loading a model with vllm I can load it on one GPU, and it takes around 13.8GB, but then everything crashes when I try to load it on two GPUs.

neon trench
#

Is the kaggle TPU option (vm v3-8) allowed for the submissions? I only see people submitting with the P100 and T4 options, as opposed to the kaggle TPU option, but I don't recall seeing anything about this in the rules?

lusty moon
#

Any offline inference form is fine

vital cave
#

I just dropped 4 places in 1 sec :/ from 30 to 34 :/ I hope there is no private sharing ...

#

now 35th 😦

shy tundra
#

well, the more attempts you get the scores go up because people get luckier

bitter kernel
#

LLAMA3 came out

floral barn
#

if you thought grok was big to run

sweet bluff
#

Seems like the 20/50 prize has been won with a very underwhelming solution to say the least.

vital cave
#

What will happen if the current 1st shares their notebook? what happens if there is a current scoring notebook that was submitted say 8 hours ago and scored 20 ? it will directly be 1st.

#

Kaggle scoring mechanism depends on submission time not execution time, so I still think this is not over (to be fair)

sweet bluff
#

Be the first to publish a public notebook scoring at least 20/50 on the leaderboard before April 22, 2024 11:59PM UTC.
By this I personally understand that the notebook being published first matters only but idk.

vital cave
sweet bluff
#

Oh alright. A bit weird wording then.

vital cave
#

If my understanding is correct, there is time for surprises

lethal arch
#

But does this also mean that the current no. 1 can wait to make her notebook public until Sunday?

vital cave
sweet bluff
#

Hopefully it will not be a linear probed/hardcoded problem numbers solution.

lethal arch
sweet bluff
#

Oh yeah, I meant the one from the first person on the leaderboard

vital cave
lethal arch
#

When a week ago I reached 18 points at least I thought I added something a little bit useful to the base code, not just probing the near-miss problem_ids

vital cave
#

probing was expected , there is 2 months with 60 subs and probing might reach further, and here comes private LB , but winning 10K with probing is .... well ... sad 😦

#

currently, Natalia Ogneva submitted 7 hours ago, after 2 hours from now if no one is able to become first (20 or 20+ but larger submit time) , then she will be the first to get to 20 ! and for this to be fair she has the right to share her notebook till Sunday

#

if she did not share the notebook then the current one will be first.

bitter kernel
#

gosh what a smart people

dapper cliff
plush crater
broken vale
#

So when will the submissions be open again?

broken vale
#

Are llama 3 weights allowed. They were released after the competition start, but the model is listed in the models page on kaggle

dapper cliff
#

Which models page?

broken vale
#

|the models tab thats on every competition

rare sandal
#

LLAMA3 cannot be used. Model weights were released on 2024-04-18. The rules state that

Teams may only use AI models and tools that are open source and were released prior to 23 February 2024.
broken vale
#

The leaderboard has been cleared 😄

broken vale
#

submissions seems to have opened again!

floral sentinel
#

is there a notebook explaining the new submissions system yet?

#

the pinned one is not clear and seems it have issues

rose nest
#

for some reason the submission.csv is making permissionError
here is a sample code and its output error

import pandas as pd
import random

sn = []
for i in range(len(test_data)):
    r_num = random.randint(0,999)
    sn.append((test_data[i][0],r_num))

snd = pd.DataFrame(sn, columns=['id', 'answer'])
snd.to_csv("submission.csv", index=False)

this is throwing the error

PermissionError: [Errno 1] Operation not permitted: 'submission.csv'
broken vale
hoary rampart
#

how you guys been running the llms? do you pay for the vertex ai subscription or use private hardware?

floral sentinel
#

are you running out ot memory when loading in a notebook?

rose nest
#

Exception: You can only iterate over `iter_test()` once.

is there a way to avoid this to test the code for multiple times without requiring to restart the kernel each time ?

broken vale
hoary rampart
#

I have used llms in the past but i am not sure if for this competition my submission should run on the kaggle notebook or can use something else

floral sentinel
#

there are couple of notebooks in the code section of the competition showing how people used inference of llms

#

this is good notebook, but keep in mind that there is a new API for submission

floral sentinel
hoary rampart
#

wow thanks for that i will defo chek it out

#

@floral sentinel how is your submission going?

floral sentinel
hoary rampart
#

i just submitted one of the examples you sent me to see how it goes. its been an hour and it is still executing :/

floral sentinel
#

it depends on the variable "n partitions" and because of new API, it answers each question in sequential, basically solving the first question with n partitions, then move on to the next and so and on and on, which makes it slow

floral sentinel
# hoary rampart haha same brother. why is that?

Cause this is my first time getting into LLMs field, and honestly I don't find myself flexible using them, plus I only recently learned about hugging face, so I am kinda new to all of this stuff.

broken vale
#

anyone know how do decode bitsandbytes 4-bit weights back into 16 bit weights?

fresh gust
broken vale
fresh gust
broken vale
# fresh gust

You might have to add the ai-mathematical-olympiad-prize competition dataset first

#

is that a kaggle kernel btw?

#

looks like a colab notebook

fresh gust
#

yep, sorry I created the notebook without adding AIMO competition dataset

#

it's working now

broken vale
#

nice 👍

fresh gust
#

kernel

fresh gust
#

is that it for submission?

broken vale
fresh gust
#

@broken vale this is the first time I'm trying to submit in a competition on kaggle. I ran my submission script twice yesterday and got error running them, so look like my submissions were wasted.

I think the reason is that I was using huggingface api to load the model within the notebook, but while submitting it is required to turn off the internet connection in the notebook, maybe that's why my notebook threw error, as it wasn't loading the model via api call.

Just to be clear on this, do I have to download the model first and then add it to notebook so it can be loaded without internet?

broken vale
floral sentinel
# fresh gust is that it for submission?

add .values[0] next to test['problem'] in sample_submission['answer']=predict(test['problem'])

cause for some reason it throws error when using only predict(test['problem'])

#

should be predict(test['problem'].values[0])

fresh gust
#

What does .values[0] do in this case?

floral sentinel
#

test['problem'] returns <class 'pandas.core.series.Series'>

#

so test['problem'].values[0] returns real answer

floral sentinel
#

or any other model u wanna use

fresh gust
fresh gust
broken vale
floral sentinel
#

"So using safetensors weights for DeepSeekMath is definitely in the spirit of the rules and is within the rules as these are not new weights."

fresh gust
floral sentinel
#

Competition host said it's okay

fresh gust
# floral sentinel "So using safetensors weights for DeepSeekMath is definitely in the spirit of th...

oh yeah about that, the host is okay as long as everything remains opensource (code, dataset, model weights). So if someone is to finetune the deepseek-math model, he may upload the fineutining data to be opensource, but he will still not have the pretraining data of deepseek-math. In this case, will he still be eligible for the prize? (where pre-training data is not opensource, only finetuning data is)

fresh gust
floral sentinel
#

ok

#

good luck bro

fresh gust
#

you too bro

tough olive
#

Is there any limit on model size in this competition? Or is it just a limit on GPU run time? Are there any obvious differences between the three GPUs available (T4x2, P100, TPU)?

wintry juniper
#

No limit on model size , limit is on the weights published date , can't use TPU for submission , P100 is slightly faster but would need a quantized version due to less vram leading to loss in accuracy.

tough olive
#

Thanks a lot!

rare sandal
#

story of this competition

#

GCD of 2 EVEN numbers is 1 by DeepSeekMath 🤣

floral sentinel
fresh gust
#

How are you guys loading models without internet?

broken vale
#

you got two choices, either as a dataset or as a model

fresh gust
#

I've uploadeed model from huggingface to kaggle as dataset but I'm not able to use AutoTokenizer or AutoModelForCasualLM when I load them from Input

#

They work just fine when I load them from huggingface

#

is this the right way of uploading it?

#

@broken vale

broken vale
#

that looks correct, now click the "Copy file path" icon on the model, and use that as the model_path

fresh gust
#

that's exactly what I'm doing but I'm getting error

broken vale
#

what does the error say?

#

copy the file path of the model folder, not the files inside the folder

fresh gust
#

wait... it just worked in a new notebook

#

but I'm confused

#

the news notebook imports transformers 4.93.3, whereas the old one imports transformers 4.20.1, by default

#

why so?

broken vale
#

maybe the old one is pinned to the original environment

fresh gust
#

and what about these libraries? are they imported with internet off?

broken vale
#

they should be part of the kaggle environment

#

bitsandbytes is not, so if you are doing nf4 quant you need to add the wheels for bnb and pip install the whl

fresh gust
#

I see, thanks for the help

broken vale
#

np

fresh gust
#

@broken vale one more thing. when I run this code, a csv file is generated in /kaggle/working (output directory). is this the output file that I have to select while submitting my notebook version?

broken vale
#

no, you gotta use the aimo api, no files should be saved

#

the env.predict step is the submission of that problem

#

you are using the api i see, everything after the env.predict is not needed

#

also, test['problem'] returns a series, you need to access the problem with test['problem'][0] to get the problem string

fresh gust
#

but when submitting. it asks for output file. what is that?

#

you mean like this?
predict(test['problem'][0], one_shot_context)

broken vale
#

yes

#

you should try to run it in test before you submit

#

might have other errors

fresh gust
broken vale
#

there should be no output file.

#

use the submit button on in the right tab modal in the kaggle notebook

fresh gust
#

okay I see, the submit button on in the notebook shows different sidebar then the one on competition page

#

got it

broken vale
#

👍

fresh gust
#

I see. I'm uploading now, will see if this error occurs

#

and also, I was not able to do 5 shot learning in conetxt due to kaggle's limited GPU VRAM. Is this possible in submission too? How do I know that the submission will not run out of CUDA memory?

broken vale
#

you dont, if its happening in your notebook it will probably happen in the submission too. it usually does

#

try using a 4bit quant

#

or shard the layers on multiple gpus

fresh gust
#

so submission is using the same kaggle GPU?

broken vale
#

if you use 2 T4 gpus with device_map="auto" on 1 7B model that should be enough even in float16

#

yeah

fresh gust
#

I see

#

got it

fresh gust
#

well that's interesting. I wonder how do I figure out what went wrong there

broken vale
#

haha, nice. got that for some of my notebooks too

fresh gust
#

and I have nothing on top of my head how to debug it 🤷‍♂️

broken vale
#

you just didnt get any answers right

#

run it on the training set in the notebook

#

youll probably see that you wont get any right

proper ferry
#

When we run a notebook, the result only shows the script without the notebook and output. Why?

#

I guess the error will make the notebook invisible right?

broken vale
floral sentinel
#

Most forked notebooks had unorganized code structure, and the import libraries were scattered here and there, I cleaned up the code and only left necessary stuff

proper ferry
broken vale
floral sentinel
#

Deepseek generated Chinese characters...

candid holly
#

how to upload models from huggingface to kaggle?

candid holly
#

The notebook runs successfully but submission failed. What happened and how to fix it?😪

floral sentinel
stray mulch
#

Is a model that is completely open source (but was released after February 2023) allowed?

west prism
#

Why do some of my notebooks get cancelled before they finish in Olympiad?

floral sentinel
stray mulch
floral sentinel
stray mulch
floral sentinel
#

that's just a tab that shows models that people used in their notebooks, it doesn't necessarily mean it can be used according to the rules...

stray mulch
#

lets hope the hosts answer

floral sentinel
#

found something similar to your question

stray mulch
#

aw man 😦

#

what about things like xLSTMs? which are just modifications of LSTMs. Are they allowed?

floral sentinel
#

idk about that tbh

#

i think they only put limits on LLM models

#

xLSTM is research based i guess

stray mulch
stray mulch
floral sentinel
stray mulch
#

Good idea

boreal dew
#

Does the Python modules used for the submission also need to be released before the 23rd of February, 2024?

minor crystal
#

Is it possible to find somewhere more trainig math challenges that are same type as official?

floral sentinel
#

the above 2 needs a little bit data cleaning before you proceed to use them

candid holly
floral sentinel
broken vale
#

everybody seems to be using deepseek-math, are there any other models that are okayish?

floral sentinel
#

I saw someone using Mathegenie, but people in comments said that it's a model released after Fefbruary 23rd

proper ferry
#

the new api of iter_test can only be initialized, iterated only once? And then it died? The only chance we reuse it is to restart the notebook? What...

floral sentinel
#

cause it doesn't allow you to go next question without using that function to evaluate your answer

proper ferry
#

Thank you, yes, I used predict. And my test result is: in one notebook, you can only initialize the env for one time, and only can iterate it for one time. The next iteration will get a null response.

broken vale
proper ferry
#

Yes, the only way to reuse the iterate environment is to restart the notebook LOL

fresh gust
#

is it possible to submit notebook using GPU T4 x2 if my weekly quota for T4 x2 is finished?

floral sentinel
fresh gust
#

@floral sentinel I used your notebook for submission as well, but the score I got was 18 not 20. What do you think is causing discrepancy in results?

floral sentinel
# fresh gust <@305317021661528066> I used your notebook for submission as well, but the score...

Yeah what's what I noticed as well, I submitted multiple times, and each time it gave different results, you can even the check the versions history of the notebook.

I think it's because of model generation, we use "self consistency" method where the model takes in the problem, and solves it in times of n_repetitions parameter, and takes the most repeated answers.

Example

n_repetitions = 7:

problem = {math problem}

answers = []

for _ in range(n_repetitions):
    output_text = model(problem)
    
    model_answer = process_output(output_text)

    answers.append(model_answer)

and let's say the answers list is [52, 52, 21, 222, -1, 52, 0 ]

from this answers list we see that 52 is the most consistent one, so we take that and submit it

#

so for my case when I submitted I got [52, 52, 21, 222, -1, 52, 0 ]

#

but when you submit for example ...you get [9, 9, 52, 52, 9, 0, 22], in which case 9 would be the most consistent one in your case

#

another factor would be top_p and temperature parameters

#

honestly deep-seek model is weird

fresh gust
#

I see I noticed that too. Maybe it's because of the random selection of next token, that everytime it runs, a different output is produced

#

also in your notebook the maximum time for notebook is set to 31500, which is equal to 8.75 hours. Is this the maximum time the notebook runs in submission? Have you inferred this, or is this mentioned in the competition or maybe it's something everyone knows? I'm new to competitions so still figuring this out.

#

Also, how would I now if my submission exhausted all the time allowed?

floral sentinel
#

the reason for 8.75, is because maximum hours of notebook runtime that is allowed is 9 hours... that's why we used 31500 seconds, and in the code itself we put a condition that if it exceeds the limit, then just return 0, which means it will submit 0 as an answer

floral sentinel
#

in "normal running" mode people set TIME_LIMIT = 1, basically just run the code quickly as possible so it can move into "submission" mode, where TIME_LIMIT becomes 31500

#

only "submission" or "private" mode matters for scoring

fresh gust
#

And how do I check that if my notebook ran completely in less than 9 hours and didn't return answers as 0?

floral sentinel
#

idk how else to check

fresh gust
#

where does it show time?

floral sentinel
#

after you click submit, 2 things run

#

one your notebook, other your submission

#

click on the submission one, and it shows for how long it was running

fresh gust
#

in my submission TIME_LIMIT = 31500 if PRIVATE else 31500. This means that it will first completely run on train data (inside the 9 hour time limmit) and then on test data?

floral sentinel
#

but you don't need to do 31500 for normal one tbh

#

that one doesn't score

fresh gust
#

I was just using 31500 for normal to test prediction on training data. Didn't knew it would effect in submission as well

floral sentinel
#

oh

#

if you use 31500 for normal it will use your quota

#

gpu quota

fresh gust
#

interesting

#

that's where my GPU quota has been going then

floral sentinel
#

besides the score doesn't depend on normal one

floral sentinel
fresh gust
#

this is my first time participating in competition so this was meant to be lol

#

but I still don't understand how to check time for the submission notebook

#

for example

#

this is my submission running

floral sentinel
#

ok it's been running for 2h now

#

wait... why the normal one has been running for 2h?

#

you used 31500 there?

fresh gust
#

that's why

floral sentinel
#

ok i see

fresh gust
#

or the entire submission?

floral sentinel
#

private one is above

#

the one writing Scoring...

fresh gust
#

it doesn't show time

floral sentinel
#

it shows next to it

#

on the right

#

been running 2h as well

#

both start at the same time

#

but because you used 31500, it will run aprox for 8.80 hours

#

for both

#

gotta wait 6.8 more hours

fresh gust
floral sentinel
fresh gust
#

but it only has 10 problems to solve

floral sentinel
#

oh yeah

#

you are right

fresh gust
#

shouldn't it take 5 times less time?

floral sentinel
#

yeah

#

forgot that normal one had 10

fresh gust
#

but okay. when normal one ends, a new version is saved, for which I can see the time in version history

floral sentinel
fresh gust
#

LIKE THIS

floral sentinel
#

sadly the private one doesn't show the run time

floral sentinel
fresh gust
#

yes so 7353 was normal submission time last time i submitted

fresh gust
floral sentinel
fresh gust
floral sentinel
#

someone probed the LB to get to 20 when they announced early prize

#

that's why they created this new submission system

fresh gust
#

I see

#

maybe i'll just have to check manually then

#

by the way, while one of my submission is stills scoring, can I submit another one or do I have to wait for it to finish?

floral sentinel
#

but you have normal one running which uses GPU

#

so i guess u gotta wait for normal one to finish

#

but idk, you can try

#

try submitting another one

fresh gust
#

I'll do that after normal one has finished running. I'll already low on quota so I'll run accordingly

floral sentinel
#

it will take 370 seconds for normal one

fresh gust
#

yea that's right

#

I will

#

thanks bro

floral sentinel
rocky kayak
#

Do we know where the models get the solution wrong at the moment? Is it 1) at the beginning where it tries to understand the problem and lets say come up with necessary equations; 2) Follow on algebra or equation manipulation; 3) Calculating values; 4) or something else that I am not thinking of here?

candid holly
marble dagger
#

Dose any one know why the submission failed when the new version still running?

valid shoal
floral sentinel
#

new api evaluates answer directly as I know

#

no need to create submission.csv

floral sentinel
#

maybe due to some connection error it might've not run

#

otherwise I don't know the reason

marble dagger
#

import os

if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
PRIVATE = True
else:
PRIVATE = False

if not PRIVATE:
import pandas as pd

class train_env():
    def __init__(self, randomize=False):
        self.randomlize = randomize
        
        self.df = pd.read_csv('/kaggle/input/ai-mathematical-olympiad-prize/train.csv')
        self.df['ground_truth'] = self.df['answer']
        self.df['answer'] = -1
        
        if self.randomlize:
            self.df = self.df.reset_index().sample(frac=1).reset_index(drop=True)
        
        self.predict_called = True
        self.counter = 0
        self.len = len(self.df)
    
    
    def iter_test(self):
         while self.counter<self.len:
            if self.predict_called:
                self.predict_called = False
                yield (self.df.loc[[self.counter]][['id','problem']]),(self.df.loc[[self.counter]][['id','answer']])
            else:
                print("You must call `predict()` successfully before you can continue with `iter_test()`")
                yield None 
            
    def predict(self, answer):
        self.df.loc[self.counter, ('answer')] = answer['answer'].values[0]
        self.predict_called = True
        self.counter+=1

env = train_env(randomize=True)
iter_test = env.iter_test()

else:
# Set up the evaluation API
import aimo

env = aimo.make_env()
iter_test = env.iter_test()

for test, sample_submission in iter_test:
problem: str = test['problem'].to_string(index=False)
sample_submission['answer'] = get_answer(problem)
env.predict(sample_submission)
print(test)
print(sample_submission, '\n')

#

About submission part of my code, get_answer() return a number, there is nothing wrong. right?

floral sentinel
#

also it should be test['problem'].values[0] to return a number

marble dagger
#

I think these two ways make same function?

floral sentinel
#

I did test['problem'].values[0] and it works just fine

#

this is the first time i see someone do str = test['problem].to_string(index=False)

marble dagger
#

I tried test['problem'].values[0], nothing change

marble dagger
floral sentinel
#

get_answer() are you sure this function returns a number?

marble dagger
#

thers are something I didn't noticed

floral sentinel
#

that's just printing number

#

can you check the type of the output

#

type(get_answer(problem))

#

it might be returning string value instead of int

marble dagger
#

When I use test['problem'].values[0], the version run failed

floral sentinel
#

check the return type

#

I wanna make sure whether it returns string or no

#

without submitting

#

on your notebook, run the cell

marble dagger
#

Ok I'll try it , but according to the code , it should be num

rare sandal
#

The CV/LB correlation is totally not existent in this comp 😐

#

I get a question correct on the Colab L4 (answered correctly over multiple runs too)...then I try testing on Kaggle T4, it becomes a wrong answer. I set a seed too 🤣

rare sandal
#

don't know what to trust at this point...made a change that is supposed to be a clear improvement

Gives 2 points boost on TWO different AIME datasets (both of them) - over multiple runs with random ordering, no difference. But the change drops the LB from 22 to 13 holyfuck

floral sentinel
#

cause people in discussion mentioned that deepseek is unstable

#

some tried to fine tune it, but it made it worse

rare sandal
#

I haven’t tried finetuning

floral sentinel
#

Olga tsymboi decided to use deepseek

#

then the whole competition members followed her using the same model

#

I am really confused on why nobody tried a new method or a model for a past month and a half

#

maximum score was 27 for a whole freaking month

#

a whole month

#

0 change

rare sandal
#

I did try other models released before Feb 23, but I couldn’t score more than 16 points

floral sentinel
#

everybody was brute forcing prompting the deepseek

floral sentinel
#

or model

rare sandal
#

If my team gets a good result in the private LB I will share it after the competition ends

floral sentinel
#

the name

rare sandal
#

Rank 70 right now on public

floral sentinel
#

frozen timtam

#

ooh you are from singapore

#

nice

rare sandal
floral sentinel
teal lake
floral sentinel
teal lake
#

what's it about

floral sentinel
#

using LLMs to solve math questions

teal lake
#

ohhk

floral sentinel
#

basically public dataset has 50 questions

#

and maximum we got is 27 score

teal lake
#

27/50 correct ?

floral sentinel
#

or to be more precise a team got 27 score

floral sentinel
#

for a whole month

#

that was the maximum score

#

no change at all

#

or more like 1 month and a half

teal lake
#

LLM and math is something which doesn't get along well

#

so what are the rules ?

#

can use any LLM ,

floral sentinel
floral sentinel
#

only

#

the thing is, they made this competition so they can motivate the research on improving the reasoning of LLMs

#

but I've barely seen any new research

teal lake
#

i read Llemma paper where they claimed to have a model dedicated to math

floral sentinel
#

people just been using self consistency prompting on deep seek, and that's it

floral sentinel
teal lake
#

it was released in 2023

teal lake
#

the whole reason behind LLMs being bad in math is tokenization strategy

floral sentinel
floral sentinel
#

I mean if we think about what LLMs are, they are just predicting what's the next word based on the current input sentence

#

they are good at generating text

#

it's missing the "reasoning" part

teal lake
#

current LLMs can't do math because they can't see the number as whole

#

which is short coming of tokenization

floral sentinel
teal lake
#

you can try tokenizing a random number

teal lake
floral sentinel
#

nah I don't think

teal lake
#

you can't add 12008 if you see it as 12 00 8

floral sentinel
#

even if they saw numbers

#

there is still the "logical" or "reasoning" part to perform action based on the input

#

which doesn't exist

#

yet

floral sentinel
teal lake
#

you can do all the reasoning but when it comes to do add / sub / mul / div it wasted

floral sentinel
#

but tokenization isn't the only thing holding back the LLMs when it comes to logical problem solving

teal lake
#

yes

#

i'm talking about math part

#

it can reason a question very well

#

but can't do it by itself

floral sentinel
#

how

#

and wdym by "reason a question very well", you mean they understand it?

teal lake
#

self.high = 0 + 5 * 16 // 7 - 1 = 11

floral sentinel
#

PEMDAS rule

teal lake
#

and i only know bodmas

#

its i believe standard

#

but here ans should be 10

floral sentinel
#

0 + 5 * 16 / 7 -1
according to PEMDAS rule:
step one we look if there are any parentheses:
Checking....
No
step two
Look if there are any exponents:
Checking...
No
Step three
Look if there are any Multiplication:
Checkking....
Found "5 * 16":
Solving:
ans = 80 -> save answer in memory;
current equation = "0 + 80/7 - 1"
Equation changed... go back to step one...
This thing will go on loops until equation is reduced

#

and the final answer is found

#

this is the logic behind solving a simple equation

#

Human mind understands rules, and apply them step by step. We don't think "what's the next token comes " after seeing the equation

#

@teal lake

#

LLM would see that equation and output the token that comes after with the highest probability. That's why I said there is no reasoning

#

it doesn't think

teal lake
#

i will have hard time but let me find the paper

floral sentinel
#

alright

#

but you know, thinking logically and step by step about the equation you gave...
I think human brain works like a tree graph

#

somehow I understood that the equation was math, so my brain went to a "math" node, and there are branches of mathematics that I know based on my experience, for example:
"Calculus, Algebra, Arithmatic...etc", and I picked the topic related to the equation, and went to that node, and from that node more branches came...

#

and picked one that is closest to the problem aka "high probability", and extracted the steps from there and applied it step by step until the problem is solved.

teal lake
floral sentinel
#

so I think instead of training LLMs to solve math, I think we should focus on a neural Graph tree model something like that, that actually maps the experience as a tree and branches

teal lake
#

i don't have much idea about graphs

floral sentinel
floral sentinel
#

that doesn't necessarily mean they are able to think

teal lake
#

llama also did so did mistral

floral sentinel
#

they just brute forced the knowledge into LLMs

floral sentinel
teal lake
teal lake
floral sentinel
#

yeah it still doesn't prove anything

teal lake
floral sentinel
teal lake
#

yes

floral sentinel
#

if you ask me

#

the closest thing to reasoning that was done was Deepmind's alpha geometry

#

they used neurosymbolic approach

#

I would still believe is reasonning was used in here, unlike using Pure LLMs

teal lake
#

chatgpt solved this question when prompted correctly
think step by step and reason before giving answer and consider all the information we gathered

#

can't prove more than this now

floral sentinel
#

for how long chatgpt was trained?

#

besides how many datasets it's seen?

#

how do I explain this

teal lake
#

is that the question here

floral sentinel
#

what I'm trynna say is

teal lake
#

you mean chatGPT might have encountered this passage before

floral sentinel
#

chatgpt is overfit

floral sentinel
#

and trained on it multiple times

#

it's memorized the problems

#

doesn't necessarily mean it understands them

teal lake
#

why didn't it answer in first few prompts then ?

floral sentinel
#

so it's no surprise that they are able to solve typical stuff

floral sentinel
teal lake
#

its not typical generation of text

floral sentinel
#

there is no reasoning in LLMs

teal lake
#

LLMs are not brain to begin with, just some random numbers and calculations

#

if you think that way then no LLM can do reasoning

floral sentinel
#

wait I'm cooking something

#

lemme show u

#

here are the questions

#

I copied first 3 questions to Chatgpt to solve

#

it got all 3 wrong

#

The answers should be as seen in the table:
52, 250, 702

floral sentinel
floral sentinel
teal lake
floral sentinel
#

or what do you call it

#

"hallucinate"

teal lake
floral sentinel
#

but it really is uselss when steps mean nothing

floral sentinel
teal lake
floral sentinel
teal lake
#

this not how you should argue 🙃

floral sentinel
#

i am not arguing, I'm just trynna prove my point

#

sure,, with little prompting I can make it get better answers

teal lake
#

its the same

floral sentinel
#

but what I'm saying is LLMs don't have reasoning

#

they can't reason

teal lake
floral sentinel
#

what

teal lake
#

if it can answer then its fine

floral sentinel
#

dude

#

it's not about giving answer

teal lake
#

how do you define reasoning

floral sentinel
#

it's about

#

giving right answer

teal lake
#

really ?

floral sentinel
#

what's the point of giving a wrong answer

floral sentinel
#

I can literally just code a function to output random number

#

there it gives answer

#

does it mean it's able to reason now?

teal lake
floral sentinel
teal lake
#

reasoning is steps to reach there

floral sentinel
#

cause we need right answers

teal lake
floral sentinel
#

or rather "random reasoning"

#

LLMs do random reasoning

teal lake
floral sentinel
#

they don't even do reaosning at all honestly

teal lake
#

don't see answer

floral sentinel
#

no it doesn't do reason

#

it just generates text

#

based on what it believes is the correct token

teal lake
#

i give up

#

you win

floral sentinel
#

it's not about me winning or losing

#

we got a problem now

#

there are 50 math questions

#

people were only able to solve 27 of them

#

for a month now, nothing changed

#

how the hell we gonna increase the score

#

15 days left for competition...

teal lake
#
  • prompting is all you can do now
  • times up for experimenting things with architecture
marble dagger
#

27/50 is good enough for a 7B model

floral sentinel
marble dagger
#

model=Anren()

marble dagger
#

@floral sentinel I had setted TIME_LIMIT in the loop why the submit still timeout? so confused

floral sentinel
#

one moment

floral sentinel
marble dagger
#

10h

#

no

#

How can I check the time , the time limit is 9, and start 10h ago. over 9

floral sentinel
#

yeah it's same

#

idk

#

couldn't figure out the problem

#

you writing code from scratch?

marble dagger
#

Fine , I'll minish the n_reps

#

No I copy from some one

floral sentinel
#

why u getting errors then

marble dagger
#

But the code is quilt easy

floral sentinel
#

the api and predictions are mixed with the LLM solving

#

hard to debug

#

I'd rather create a function that takes problem as input and outputs answer.

floral sentinel
marble dagger
#

Idk what you really mean

#

is there a open code which is good as you say

floral sentinel
#

wait lemme send it

#

it's buried down in the code section

marble dagger
#

Elegant! so you mean I should put these code in the predict().

#

I do make a function. the problem is input and the answer as output

floral sentinel
rare sandal
#

I changed the seed and my 22 points submission dropped to 16 points 🤣

rancid compass
#

hi folks, in submission tab it showed throwing exception, however clicking into the notebook it said successfully ran in 132s. Is the submission successful?

serene fiber
rare sandal
proper ferry
#

what is the temperature parameter and the top k means?

serene fiber
rare sandal
floral sentinel
marble dagger
#

@floral sentinel have you tested that the time that your code spend?
I mean your repetitions for a question is 19, and the num of questions is 50. is enough to generate them in 31500s?

silk sandal
#

People reported that changing the seed leads to a different score. But I have reasons to believe that even with the same seed (submitting the same notebook twice) one can get a different score. It happened to me. Did this happen to you?

rare sandal
#

+- 1 for me

#

but I would be scared also...my best seed for validation is not the same as my best seed for public LB

#

also when comparing different AIME sets...all different optimal seed 🥴

#

In other words, I tested 3 seeds on public LB, AIME old, AIME new datasets

The first seed is best for public LB
The second seed is best for AIME old
The third seed is best for AIME new
and the gap is not small

won't give more details until the comp finishes 😅

marble dagger
floral sentinel
#

wdym by decode strategy?

marble dagger
#

these part will skip top several loop except first

#

and about decode the right answer from text or code_output . I think the code output is more convincing, answer from text should be dropped . But you take them both

floral sentinel
#

about code output and text output

#

sometimes text output gives right answer

#

that's why people consider taking both answers

floral sentinel
#

to prevent that, he added skipping thingy that u see

marble dagger
#

for this function , if the answer was covered by \boxed{}, the answer will be counted twice. It make no sense

floral sentinel
marble dagger
floral sentinel
#

bro, the model itself is not reliable, let alone the answer xD

marble dagger
marble dagger
#

to calculate

rare sandal
#

Great work holyfuck

serene fiber
floral sentinel
serene fiber
serene fiber
#

Obviously, a lot of ways to fix that better not to discuss it right now 😅😅

floral sentinel
#

i think it's perfect time to discuss that

#

someone got 28 score

#

wow it took god damn 1 month and a half to get +1

#

bruuhu

serene fiber
serene fiber
#

24th on LB right now

floral sentinel
rare sandal
rare sandal
silk sandal
#

By the way, did anyone try using two llms, like deepseek-math 7b-rl+deepseek-coder2b?

#

I could not get it running

serene fiber
silk sandal
#

Hi, thanks for your reply. If I get it right, no more idea sharing here until the competition ends?

serene fiber
dusky narwhal
#

e.g. when models are evaluated on a single sample (maj@1 aka pass@1) usually greedy decoding is used (temperature 0)

#

unless you're actually in the running for a prize (or gold medal, I suppose), I don't see why you wouldn't share anything

serene fiber
rare sandal
#

Maybe even 20

#

with some luck obviously

serene fiber
rare sandal
#

Mine is very dependent on the seed, sadly

#

Stable across same seed, but if I change the seed it drops a lot

serene fiber
#

Kinda hard to comment, could get lucky as well.

Btw one of our submissions is with that hope of luck.

#

My intuition says it should be fine as the code is already executed for all 100 questions, and as you got lucky in those 50, incase of same distribution the remaining 50 should also give similar results.

If it was Kaggle running the code again, then it would have been a big red flag

vital cave
#

The final subs you select will run over the new 50 only, so two of your submissions will only re-run once.
Shake is coming. I hope I can keep a good rank in the end 😅

serene fiber
vital cave
#

"Because of the limited number of problems available, we are taking special precautions to secure the test set against probing. Among other things, during the submission period the test set will comprise only the 50 public set problems. Once the competition ends, when we rerun submissions, the test set will comprise only the 50 private set problems. You should attempt to make sure your submission will complete successfully on the 50 new private set problems. This may mean ensuring your submission is robust to unexpected inputs, or managing runtime and memory usage."
This from data section

serene fiber
#

Oh ok posted from the competition's page?

vital cave
#

My biggest biggest fear is that for some reason my subs fail (hit errors that I didn't catch during validation or public subs)

vital cave
serene fiber
#

Oh ok my bad, I didn't read that, then yea seems like private lb will be fun

vital cave
#

yes a scary movie is coming within 7 days 😄 , I can handle any score drop - I am preparing myself for this , but I can't handle failed subs ! it will be a total waste!

serene fiber
vital cave
#

doesn't matter , my fear comes from this line in the section I shared : "This may mean ensuring your submission is robust to unexpected inputs" what unexpected inputs 😄

serene fiber
#

Shouldn't be an issue, but timeout can be, it completely depends on what sort of code you wrote

vital cave
#

I did face rare timeouts that i can't yet explain.
My advice (I hope I stick with it - and stop trying new ideas) is that I should focus on the subs I want to select and run them again and again to make sure they are stable

dusky narwhal
#

BTW I just noticed you're 2nd in the ARC contest! Wow

#

that you even have time to work on both!

#

how much work was it to get to that score?

#

and yeah, I've very paranoid about errors on the private LB; I'm going to try to make my 2nd submission quite different and rather conservative

vital cave
#

Ya I suffered timeouts 😅 My current (in mind) 2 subs work fine with different validation questions but I am still afraid of unexpected inputs or errors that might kill the code in private, If the private was calculated in parallel with public and we see 50% of the score then it would have been much better (no sudden errors !)
I have no expectations currently regarding by final rank ! I just hope to stay in gold ! I need the solo gold 🙂

Regarding ARC I like such competitions, the math one, ARC, CTF, and Santa problems are my favorites.
Regarding my current score ARC 26.5 it took me about a week. I have some ideas to try. BUT no way to reach the guys at 38 ! They are professional ARC developers (if that can be a job title!)

dusky narwhal
#

well, you have half a year to come up with and implement ideas! Good ideas to tackle such hard problems are very rare

#

I've heard of ARC before but I'm not familiar with it, and I only just saw ARCathon mentioned. I suppose that's what you're referring to

#

I haven't looked at it much because I'm too busy with AIMO! I'm attempting to write an entirely new solution to AIMO in the next week! Yesterday I put together the first pieces of code and managed to solve the first problems. Haven't submitted yet, but I think it'd score about 4/50 :P

#

also yesterday I finally got a score of 23; it only took me a month to match the score of the top public notebook, arghh! What a massive waste of time

vital cave
#

Try to test the stability of the 23 ! if you ran the code again will you get 23 ? I think stability is very important currently.

#

I have 2 subs with 26 (from diffrent "somehow" approaces) and 4 subs with 25 and a good 24s and 23s .... I can't say my 26 and 25 are stable , but avg score for those notebooks is 24 , but I have some codes that when re run makes a roller coaster ! from 19 to 24 to 25 even ... I somehow know why but I can't say much now

dusky narwhal
#

I haven't yet but will. However I don't have many submissions left and will need to use most for my new solution

vital cave
#

Good luck! never give up! ... who knows ... you might see higher scores in private
My stable 24 might be a stabe 4 in private or an error 😄 ....

dusky narwhal
#

in theory, the LLM could write a script that runs killall python

#

more realistically, it could fork bomb; I did see it try to use multiprocessing after I informed it its solution was too slow

vital cave
#

hhhh well the private will be fun to see ... I just hope hard work pay off and not a lucky hit (which I am afraid will put some people in good ranks with maybe public notebooks)

#

Imagine I enter the competition , sorted notebooks by score, ran the notebook got luck with 23 or 24 ... then same luck happens in private and with more luck to put me on top ... wow

dusky narwhal
#

well, the score gap between the top team vs what you could get by submitting the public notebook 50 times is just 5 points

vital cave
#

I am pretty sure that you may even get 24 with the public notebook

dusky narwhal
#

yes, if it can get 23 probably it could get 24 very rarely

#

I got 22 with it the very first time I set the timelimit to 8.5 hours, with minimal changes

#

and then took weeks to get another 22

#

it turns out that almost all my submissions during the contest have had some major bug or other

vital cave
#

My 4th rerun (just finished) of my first 25 got me 23 , so its (25, 24, 24, 23) , I have -+2 variance I guess , I hope to keep it this way

rare sandal
#

Lmao I have been stuck at 22 since the first week, but that was a lucky 22. Cleaning up the code dropped the score by so much and I had to slowly climb back up to a stable 22 holyfuck only to realise it once again drops to 18 with a change in the seed 🤣

#

I’m not sure if the old AIME cv is correlating because my latest stable 21/22 public scores 27/50 there as well 😐 [with a DIFFERENT seed], but as usual the AIME new questions it just sucks, out of 61 I got just 9 points 😓

vital cave
# serene fiber For GM?

I need gold in general for competition Master level, but a solo will help for later of course (GM) so I am trying to hit two goals at once. I was so close to get it in "LLM - Detect AI Generated Text" competition, if I only selected the correct sub 😦 (I finished 2nd public board, 21 "silver" private board , where I was able to secure 6th place 😅 if I was wiser)

serene fiber
#

Anyways check your DM please

spare cypress
dusky narwhal
#

I didn't realise there are actually two notebooks with a score of 23

#

I believe they're virtually identical though

#

I notice a very amusing pattern on the scoreboard (when I looked a few hours ago), the number of teams with a score of at least N goes:
28: 1
27: 4
26: 8
25: 16
24: 31
23: 76
22: 166

dusky narwhal
#

oh BTW, I see everyone using AIME for validation but it doesn't match the composition of the LB at all, which has a much wider range of difficulties. I meant to mention this on the forums

#

so if you think your submission score is stable, to a large degree it's actually because there aren't that many problems of the edge of what it can solve

celest bough
#

can't join anymore flameseyeroll

rare sandal
rare sandal
#

I don’t even know what I can trust now

#

Haven’t seen anything like this in previous competitions

floral sentinel
#

no seriously wtf is this

rare sandal
#

My predictions for the private LB gold zone:

27 27 26 24 22 21 20 19 19 18 17 17

Cutoff for gold: 17
Cutoff for silver: 13
Cutoff for bronze: 12

I may be wrong, this is my gut feeling based on my validation tests and the data description

celest bough
#

Why can't we join this comp anymore

rare sandal
celest bough
floral sentinel
celest bough
floral sentinel
celest bough
#

Still only troll answer?

floral sentinel
#

and who is the one trolling here with silly questions?

celest bough
#

Someone wants to participate but is kept from participating due to unexplained rules. Question is good. Answer is not

floral sentinel
#

wdym by "unexplained rules", the guy told you the reason

#

there is a deadline for joining competitions

#

what part of it you didn't understand?

#

the competition started around 3 months ago

celest bough
#

What part of my question are you struggling with?

vital cave
#

There is two deadlines , a deadline to enter the competition which is over now, and a deadline to submit which (4 days left)

celest bough
vital cave
#

Ok, does this explains why you can’t join ?

serene fiber
#

Wondering why don't folks just ignore it, the discussion is already self explanatory

floral sentinel
#

He is trolling, ignore him

serene fiber
celest bough
#

I don't get why you don't try and be a good person instead of a troll

valid shoal
#

Exactly when is the last submission time?

rare sandal
#

I'm getting only 24 CV with the setting that gave 22 on LB, but if I change the seed I get 27 CV and 18 LB. also lowering temperature doesn't worsen my CV, only +- 1

#

I guess we will know in a week...but terrible CV/LB correlation I'm not confident in our submissions 😐

#

In all the comps I have participated this is the one I'm least confident of even getting a medal out of it

serene fiber
summer bolt
#

no one mentioning rag in discussion, does that mean it didn't work or worked too well?

silk sandal
#

Is it possible that the test set is already online somewhere?

silk sandal
#

By the way, do you think it is possible some submissions throw an exception on the private test set? I thought a submission would be marked successful if it was able to run on both public and private test sets

floral sentinel
#

I got 10 last submission about 12 hours ago...

summer bolt
#

answer of similar problem from aops, same contest, similar contests, apparently not exact answer of the exact problem, but the reasoning steps or key ideas can be re-used for the main problem

floral sentinel
#

so stacking questions and answers will lead to high token count

summer bolt
#

yea so few things can be tried: summarize/shorten the solution, or use longer context model

#

3 shorten examples + prompts might take around 900-1.5k tokens, can be shorter depends on how compressed the summary is, anyway deepseek doesn't utilize those very well (or my implementation was bad)

vital cave
floral sentinel
floral sentinel
#

it threw error exception?

#

but the drop from 25 to 20 tho...

vital cave
#

it took it 5 runs to go wild ! No ... the notebook finish usually around 7 hrs ... say its now around 8hrs due to latency in showing the score !

#

Its punishing me for repeating it 😄

floral sentinel
#

the private leaderboard will be lit 🔥 🔥

#

imagine it throws error exception on all the submissions 😄

#

the private leaderborad

#

no winners lmfao

#

3 months will go to waste

vital cave
#

hhhhhh oh god ... the selection of the final 2 subs is the hardest thing to do now

floral sentinel
#

the highest scores

vital cave
#

No you can select which two you want (before deadline) if you don't then Kaggle will select (usually highest two)

floral sentinel
vital cave
#

becuase I am looking for the stable ones ... which I don't know now

floral sentinel
#

ooh...

vital cave
#

I have three (somehow) approaches and all score up to 26 ... and another apporoach that is stable around 24 (so far)

#

6 subs left ... one should use them wisely !
But in all cases the private leaderboard is going to be crazy .... I dropped my pridection about winning a gold medal now

#

By the way results on colab (regardless the questions) are perfectly stable ! (I mean the repetition of any exp gives same score ) rare times I faced +-1 variance
I use L4 on colab

bright lotus
#

this competition only gamble

#

🥹

rare sandal
#

my 22 notebook turns into 14 with only slight change…😱 I was so surprised by the score. Usually these type of small change still can get 20-21

rare sandal
#

It’s always 24-25 points on old AIME,and 10-11 points on new AIME

#

Doing things that make sense improve validation and reduce LB...

#

Haven't seen a competition like this 🤣

serene fiber
valid shoal
#

wtf my code tried to execute this

silk sandal
rare sandal
#

For me, I randomized the order and restarted the notebook

summer bolt
#

curious what people are using for their experiment, this is my first kaggle competition so I started out maximizing kaggle gpu hours, then moved to runpod because it looks cheaper than colab

celest bough
floral sentinel
#
config = transformers.AutoConfig.from_pretrained(model_path)
config.gradient_checkpointing = True

What does gradient_checkpointing really do?

bright lotus
floral sentinel
proper ferry
#

The ddl is 28 0:00 or 29 0:00?

floral sentinel
floral sentinel
#

did my two last submissions, let's see how it goes

#

goodluck to everyone

vital cave
#

Me too, I will finalize my selections once my last sub finishes, Good luck everyone.

floral sentinel
#

bruh the second phase will start in 2 months...

#

so soon

rare sandal
# floral sentinel

Yeah I probably need to build my own solution next time rather then using open source models

#

I didn't have time this month to explore on how to do that..
Bet some top teams did try research paper worthy things...let's see tomorrow

floral sentinel
#

3 months ain't enough for researching and discovering new things

rare sandal
#

I doubt one can reach 29 without fine tuning, I may be wrong

rare sandal
floral sentinel
#

for 1 month and a half we've been stuck in 27 score

#

which really shows that nobody created anything new

#

everyone was using deepseek and tryna prompt it correctly to get it to work

rare sandal
#

I don’t think the 28 and 29 are achieved without fine tuning, and my guess is they will be stable on private LB too

floral sentinel
rare sandal
#

well let’s see tomorrow, I don’t want to jump into conclusions now

floral sentinel
rare sandal
floral sentinel
#

some said it's in a deep local minima

#

which made the results worse by fine tuning

floral sentinel
vital cave
#

Also this means that you might overfit more !

rare sandal
#

haha when I fix 1 problem I break 2 problems

vital cave
#

How many outputs you generate for each problem ?

rare sandal
#

seed set at 42

vital cave
#

I generate between 100 and 120 based on time

#

I selected a notebook that generate max around 100 since its was the first 25 on LB in general and the add value of earlier sub pushed me to select it

serene fiber
#

I am looking for a team for season 2 of this competition. Given that it is officially ended and we can do 100 submissions per day, I believe this is the perfect time for exploration. Kindly let me know if anyone's interested.

PS: I would prefer someone who has actively participated in this season

rare sandal
#

How do you generate 100 tries ?

#

vLLM Async Engine ?

vital cave
#

vLLM and 2 T4

rare sandal
#

Even with vLLM, my speed isn’t so fast to run 100 tries on a problem

#

You need AsyncEngine I think

#

Transformers = 13 token/s, vLLM: 28 token/s

vital cave
#

If I reached money prizes I will share more regarding this.

But within 9hrs I think you can reach more maybe up to 140

#

But the risk of timeout is always there

#

Early discussions and notebooks on this helped me a lot

rare sandal
#

My submission definitely has a risk of timeout also

#

But my score isn’t good so I risked it

#

I didn’t end up selecting the 22 that drops with every small change that I made. My opinion is it overfitted public LB from past experience

vital cave
#

My sec one do, I kinda gambled with the sec one, based on my analysis it should be the most stable one but with a risk

serene fiber
vital cave
#

Ya in one of the approaches I set around 450 sec per question

rare sandal
#

Also vLLM didn’t improve my score and just lowered the floor that I could get, ended up not selecting it

serene fiber
vital cave
#

Low temps stablize around 20 to 22 …

serene fiber
#

Oh thats good my stable was between 18-21

rare sandal
serene fiber
#

btw till when will final lb get declared?

serene fiber
vital cave
#

I have hopes for my second sub which I scored yesterday ( 26) but Because of the first comes first ranked I had to select the first 25 ( with variance of 2 most of the time) also

rare sandal
#

Submitted on May 25 and June 1

vital cave
serene fiber
#

Oh ok ok

rare sandal
#

But I have no confidence of the private result 😓

vital cave
#

My selected subs are one from 2 month ago and yestreday

serene fiber
#

Anyone of you in for season 2 of this competition?

vital cave
serene fiber
rare sandal
#

In the last week of the comp I tried some “out of the box” ideas, but they didn’t have a significantly better CV, I ended up not choosing them. Even though public 18-21

In any other comp 99% of the time I will choose this as 1 sub

vital cave
#

Alos a lot and I mean a lot of experiencrd grand masters avoided this phase ( mayve due the 23-limit)

rare sandal
#

tiebreaker is a menace

#

😞

serene fiber
#

you mean public lb score?

vital cave
#

23-feb

serene fiber
#

23 Feb didn't get you

rare sandal
#

Yeah that one holyfuck

vital cave
#

The rule that forbid you of using anything released after 23-feb

serene fiber
#

Oh ok

vital cave
#

Second phase will have mostly a rule for 1-sep 🙂

rare sandal
#

I hope not, and the rule is 1-Mar-2025 instead

#

Since the comp ends 1-May-2025

vital cave
#

I guess its maybe going to be end of year 2024 rule to give everyone a chance to test and finetune

rare sandal
#

You only need 1 day to fork an open source model, try it on the comp dataset and submit

#

I learnt a lot on the behaviour of LLMs on math problems from this comp though 😄

serene fiber
#

btw anyone of you tried finetuning?

rare sandal
vital cave
#

I did finetune reached up to 15 I guess but dropped it

rare sandal
#

Me and my teammate tried 6 different RAG strategies

serene fiber
#

data for RAG?