#ai-mathematical-olympiad-prize
1 messages · Page 1 of 1 (latest)
This should help boost performance through prompting. I’m excited to see how people can push the performance of LLMs in math.
A few weeks ago, I downloaded and parsed the AIME / AMC12 problems from AoPS. I have already shared them on Lean forum, so let me share them here too: http://olsak.net/mirek/aimo-datasets/
I have published https://www.kaggle.com/code/huikang/code-interpreter-baseline
lol, by dumb luck my copy of your kernel (without changes) scored 6
The current best score is a public notebook
https://www.kaggle.com/code/olyatsimboy/aimo-zero-shot-sc-mmos-deepseekmath
silly question here..but this competition is effectively only for LLMs right
new to kaggle and programming, so didn't realize this was for local models only. spent half a day (literally 12+ hours playing gpt4 and claude 3 models)
You can do anything that is offline, which includes executing code
Today I learned: Kaggle CPU-only notebooks run on an Ubuntu-20.04 image, while GPU T4 x 2 notebooks run on an Ubuntu-22.04 image.
I will greatly appreciate if someone could produce and benchmark performance of various LLMs
Hello, sorry if this is off-topic I created a question in manifold markets so you can bet play money on the winning score https://manifold.markets/EmanuelR/how-many-problems-will-be-solved-by?r=RW1hbnVlbFI the market determines the probability distribution of the winner's score
This is the leaderboard https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/leaderboard
The competition consists of creating an AI that runs on a maximum of 9 hours on 2 Tesla T4 or 1 P100 GPU's, and solves a hidden list of 50 math problems with an answer between 0 and 999, anything that runs on the Kaggle VM that doesn't use the...
weird but whenever I add "think step by step" to my prompts the performance usually goes down 😅.
Also, somehow all explanations for correct answers are around 400 words. Often when it goes over than that, the answer go wildly wrong. GPT-4 (via ChatGPT) recovered some of those but none of the local model runs have done that.
Can we install and run llama index in the notbook ?
I guess you can use anything that is available for the public on Kaggle
❗️New External Dataset Alert - I have parsed 8.8k solutions for 3.7k problems from ArtOfProblemSolving website and it’s now publicly available: https://www.kaggle.com/datasets/alexryzhkov/amio-parsed-art-of-problem-solving-website
Stay tuned for the dataset for the further updates as I have processed only a half of parsed data with answers in \boxed. Another part need more careful or even manual preprocessing
how to start it , I am new here
copy a baseline and run all
Okay I'll try
gl
why is the current cap 17/50? does it have to do with transformers or prompting?
We're talking about one of the highest levels of math olympiad
It's meant to be tough
Can we use a mixture between nlp and ATP or we should use just NLP
let me rephrase that. There was a notebook that scored 13/50 a while back and i was wondering how the score has jumped by 4 since then. Whether it be something specific or maybe it was just that more fine tuned models came out since then
Oh they first used like mistral or similar
Then deepseek math 7b instruct made a jump
Then self consistency made another jump
Can get 15-16 rn myself
Someone made 18
ohhhhh gatcha, i have the deepseek math bookmarked for reading. ill get on that. also do you know what made the 17-18 jump possible?
wow yeah they did it today
i guess my other question would be gemma 2b and it's applications in models vs transformers
I have a basic question about code competitions. What models are we allowed to use exactly? Is it basically the list that shows up if I click on "Add input" in the right hand side bar on the edit code page? Also, how do I check if the model was available before Feb 24th or not?
afaik weights for mmos deepseek math were released feb 28. isn't that disqualified? https://pan.quark.cn/s/b939a0510658#/list/share/009acba8182d4a26a7bc44bd2a46933f-MMOS*101DeepSeekMath 7B
夸克网盘是夸克推出的一款云服务产品,功能包括云存储、高清看剧、文件在线解压、PDF一键转换等。通过夸克网盘可随时随地管理和使用照片、文档、手机资料,目前支持Android、iOS、PC、iPad。
on huggingface they were released feb-5
https://huggingface.co/deepseek-ai/deepseek-math-7b-rl/tree/main
The safetensors were published 25 days ago , but I don't think this causes disqualification because the original weights were there on Feb-5
Doesn't look like it, considering that they are being used all over
Also you can use any pretrained model, just no internet access right?
the rules state it had to have been available before feb 23, 2024
Oh
So no fine tuning?
that's a grey area, not sure
how much time does it take to see submitted results?
30 mins to an hour
that's the base model not the mmos finetune which was released feb 28
mmos further finetuned on top of deepseek rl
Oh Sorry, yes this is disqualified.
If you are referring to the public notebooks (13+ score) they use deepseek-ai/deepseek-math-7b-rl
However, there is another problem here :/
check: https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/493554
Solve national-level math challenges using artificial intelligence models
I have a background in Maths, mostly developing content, and have participated in these olympiads; it's easier for me to solve problems than to have LLM do it.
Anyway, I have created some basic models in the past, but this is entirely new to me.
Anyone looking to team up, I can contribute vastly to types of problems, datasets for the same and tweaking things.
Anyone aware of wizardmath model ??
if you look at the top 5 scoring public notebooks they all use the mmos finetune, not deepseek-ai/deepseek-math-7b-rl
It is mentioned only in the title as I can understand from the attaced comment.
mmos version is not on huggingface as far as I can tell.
In all cases I asked and waiting a reply.
interesting
i think the comment author is mistaken. it's unclear now if the weights are the original model or mmos finetune, but it clearly isn't directly from hf.
if you look on hf, the original model is split into 2 safetensors files while the author's upload is split into 3 files. either the author did the safetensors conversion themselves or they are using a different model and forgot to change the name
I've noticed and I asked for clarification
The Author confirmed the use of deepseek-math-7b-rl.
I also uploaded my own version (directly from hugging face) and got exact log for both.
okay awesome thanks
You can use this if you want: https://www.kaggle.com/models/yashbhalgat/deepseek-math-7b-rl
This is the original deepseek-math-7b-rl from HuggingFace (not the MMOS model)
From HuggingFace: https://huggingface.co/deepseek-ai/deepseek-math-7b-rl
btw all the top notebooks seem to be running self-consistency with only 8 samples whereas the original deepseek-math paper reports 64 samples. I wonder how many samples can one squeeze into the 9 hr time limit
Vllm would definitely give you more Leeway, it's probably 10x transformers, and most notebooks take like 2000sec, so you would be fine
Hi,
I have a question around the problem id 246d26
Each of the three-digits numbers 111 to $999...
where the suggested correct answer should be 250.
But after working on my model it gave me a better solution 267 as it is a maximum I guess it should be 267 the correct answer. here the analysis of my LLM model
The solution using modular arithmetic appears optimal based on the problem's constraints:
Selection Criteria: Yellow numbers were selected based on their last digits under modulo 10 arithmetic. The set of valid last digits chosen was {1, 3, 7}. This choice ensures that the sums of any two yellow numbers do not have a last digit that falls back into this set, fulfilling the requirement that their sums must be blue.
Verification of Sums: The calculations confirmed that no sums of two yellow numbers end in the digits 1, 3, or 7. Therefore, all such sums must be blue numbers as required by the problem's constraints.
Outcome:
Number of Yellow Numbers: 267, which is maximized under the chosen method.
Sums Compliance: There were zero invalid sums, indicating full compliance with the problem's condition.
This approach effectively uses a simple modular arithmetic concept to partition the set of numbers from 111 to 999 into yellow and blue categories such that the sum of any two yellow numbers always results in a blue number. It maximizes the number of yellow numbers while ensuring compliance with the given sum condition.```
The analysis from my model seems correct to me `267 = (100 - 11) * 3`. But always good to have an additional brain on it 😄 .
Would it be possible to double-check my result with a provider of the problem?
Thanks 🙂
Hi, this ignores that even if the last digit doesn't fall in {1,3,7}, the sum can be outside [111,999]. eg: 537
I didn't want to go too deep into analysing this, but I was wondering if using another base could also get a better result. I just feel the answer looked wrong anyway. 🙂 . Was quite happy my model figure this out actually 😄
Tbh, this is the only problem in the public 10 for which I don't have a plausible standardized solution.
But one correct line of logic is
Max (Y) < 999/2.
=> 500 is min Blue such that it is higher than any yellow.
=> Min (Y) = 250.
So 250 to 499 are all Yellows.
It is an interesting problem for sure, love it. I'll let the competition host answer this one. I think it needs a peer review for sure.
What I meant by standardized solition here is that I can't seem to recall a problem solving strategy that forces you think count backwards from 499 instead of starting from 111.
For other problems example: in problem 1 (sum of squares of distances) -> the key insight is to use the sum and product of roots of a quadratic because difference of roots is given. or say in problem 3 (sparkle) the key insight is sum of digits < 3 which then automatically brings up combinatorics as a valid strategy.
Would it violate contest rules for me to have a math friend help? I have a close friend who is a longtime MO competitor, and I passed along this question to him
By “help” I mean try to give his answer to the question
not at all @haughty mountain . That is the beauty of Mathematics. Share the love 🙂
I am an organizer from a different contest so not the authority that my name color might suggest
I have successfully passed along the question but not received an answer yet...
@normal eagle I think you are close, then maybe adding the fact that sum of any two odd numbers is even you can remove even yellows which is roughly 250. Now need to polish the edges. 😉
Nope, can't go by the even yellows logic. Because the distribution with Max Yellows is this: [111,249] blue, [250,499] yellow, [500,999] blue. The thing that's bothering me is how does one arrive at the insight "max distribution of yellow is a continuous set." Otherwise we (and the model) keeps trying to solve it using box principle.
Well could you produce a set of 267 numbers?
Actually, I found the gap in my solution (which was making it seem random). It's not a single possible distribution.
N = [111,999] = dY+dB+mY+mB (d, m => definitely and maybe)
So you calculate dB and dY sets in parts by elimination from N. Then apply box principle/conditional sorting to mY and mB.
How to get dB = if 2*i in N is not in N, it's in dB.
dB = [500,999], eliminate this from N
How to get dY = if i + N is in dB, i is in
which means [389,499] is dY.
now for [111, 388] if any number is added mY or mB, one corresponding number has to be added to another set. => 278/2 = 139 maybe Yellows.
Total yellows are 111+139 = 250.
Solve national-level math challenges using artificial intelligence models
even in the last question, the simple insight that f(n) is linear = an + b makes it solvable using SymPy.
What is the name of this theorem?
can't remember a specific name but the degree of composite functions is the product of degrees of individual functions.
You don't have to prove this with any theorem, in these kind of math competitions, you just need an educated guess, if you find one such an f(n) function that satisfies the conditions, you can guess the solution without proving it
ps: in AIMO it might work, but in olympiads you actually have to prove that's the general form for all f(n)s for that case.
yes, but for example vllm it is trick to make it work on multiple gpus and it does not support P100
oh, neither it support TPU
When I am loading a model with vllm I can load it on one GPU, and it takes around 13.8GB, but then everything crashes when I try to load it on two GPUs.
Is the kaggle TPU option (vm v3-8) allowed for the submissions? I only see people submitting with the P100 and T4 options, as opposed to the kaggle TPU option, but I don't recall seeing anything about this in the rules?
Yes
Any offline inference form is fine
I just dropped 4 places in 1 sec :/ from 30 to 34 :/ I hope there is no private sharing ...
now 35th 😦
well, the more attempts you get the scores go up because people get luckier
LLAMA3 came out
and it has a 400B+ variant too
if you thought grok was big to run
Seems like the 20/50 prize has been won with a very underwhelming solution to say the least.
What will happen if the current 1st shares their notebook? what happens if there is a current scoring notebook that was submitted say 8 hours ago and scored 20 ? it will directly be 1st.
Kaggle scoring mechanism depends on submission time not execution time, so I still think this is not over (to be fair)
Be the first to publish a public notebook scoring at least 20/50 on the leaderboard before April 22, 2024 11:59PM UTC.
By this I personally understand that the notebook being published first matters only but idk.
Oh alright. A bit weird wording then.
If my understanding is correct, there is time for surprises
But does this also mean that the current no. 1 can wait to make her notebook public until Sunday?
Yes ! unless another user scored 20+ (and before her submission) and decided to share
Hopefully it will not be a linear probed/hardcoded problem numbers solution.
It is, I mean the current publicly shared one is
Oh yeah, I meant the one from the first person on the leaderboard
I think the current 1st is not aware yet of that 😄
When a week ago I reached 18 points at least I thought I added something a little bit useful to the base code, not just probing the near-miss problem_ids
probing was expected , there is 2 months with 60 subs and probing might reach further, and here comes private LB , but winning 10K with probing is .... well ... sad 😦
currently, Natalia Ogneva submitted 7 hours ago, after 2 hours from now if no one is able to become first (20 or 20+ but larger submit time) , then she will be the first to get to 20 ! and for this to be fair she has the right to share her notebook till Sunday
if she did not share the notebook then the current one will be first.
gosh what a smart people
You will not know it before the competitoon ends.
Now all new notebooks I create are Ubuntu 20.04. Is there a way to force a new notebook to be created with a Ubuntu 22.04 image?
So when will the submissions be open again?
Are llama 3 weights allowed. They were released after the competition start, but the model is listed in the models page on kaggle
Which models page?
|the models tab thats on every competition
LLAMA3 cannot be used. Model weights were released on 2024-04-18. The rules state that
Teams may only use AI models and tools that are open source and were released prior to 23 February 2024.
The leaderboard has been cleared 😄
submissions seems to have opened again!
is there a notebook explaining the new submissions system yet?
the pinned one is not clear and seems it have issues
for some reason the submission.csv is making permissionError
here is a sample code and its output error
import pandas as pd
import random
sn = []
for i in range(len(test_data)):
r_num = random.randint(0,999)
sn.append((test_data[i][0],r_num))
snd = pd.DataFrame(sn, columns=['id', 'answer'])
snd.to_csv("submission.csv", index=False)
this is throwing the error
PermissionError: [Errno 1] Operation not permitted: 'submission.csv'
There is a new api, thats not how you submitt your solution any longer
how you guys been running the llms? do you pay for the vertex ai subscription or use private hardware?
wdym by runnning llms?
are you running out ot memory when loading in a notebook?
I see, thanks
Exception: You can only iterate over `iter_test()` once.
is there a way to avoid this to test the code for multiple times without requiring to restart the kernel each time ?
Doesnt look like it. What you can do is load test or train set the old way, but when you do submissions you use the new api
I see
yes exactly. I have tried now with the keras version of Gemma2b but it is going extremely slow maybe it is not even working.
I have used llms in the past but i am not sure if for this competition my submission should run on the kaggle notebook or can use something else
I don't know about keras, but using hugging face, you can apply quantization to load the llms efficiently and actually run it
there are couple of notebooks in the code section of the competition showing how people used inference of llms
this is good notebook, but keep in mind that there is a new API for submission
this is another notebook with same implementation except for changing device numbers in device map, and the guy also implemented predict function for submitting on new API
wow thanks for that i will defo chek it out
@floral sentinel how is your submission going?
it's bad honestly
haha same brother. why is that?
i just submitted one of the examples you sent me to see how it goes. its been an hour and it is still executing :/
that's normal, mine took 8 hours
it depends on the variable "n partitions" and because of new API, it answers each question in sequential, basically solving the first question with n partitions, then move on to the next and so and on and on, which makes it slow
Cause this is my first time getting into LLMs field, and honestly I don't find myself flexible using them, plus I only recently learned about hugging face, so I am kinda new to all of this stuff.
:/
sadgeeee
so sad
anyone know how do decode bitsandbytes 4-bit weights back into 16 bit weights?
how to download package to import the new API?
It's already installed in the kaggle kernel, just import it
You might have to add the ai-mathematical-olympiad-prize competition dataset first
is that a kaggle kernel btw?
looks like a colab notebook
yep, sorry I created the notebook without adding AIMO competition dataset
it's working now
nice 👍
is that it for submission?
yeah, that should work
@broken vale this is the first time I'm trying to submit in a competition on kaggle. I ran my submission script twice yesterday and got error running them, so look like my submissions were wasted.
I think the reason is that I was using huggingface api to load the model within the notebook, but while submitting it is required to turn off the internet connection in the notebook, maybe that's why my notebook threw error, as it wasn't loading the model via api call.
Just to be clear on this, do I have to download the model first and then add it to notebook so it can be loaded without internet?
You gotta have internet turned off. If you have a finetuned model, you need to upload the weights as a model or dataset to kaggle and add it to your notebook.
got it thanks!
add .values[0] next to test['problem'] in sample_submission['answer']=predict(test['problem'])
cause for some reason it throws error when using only predict(test['problem'])
should be predict(test['problem'].values[0])
What does .values[0] do in this case?
Solve national-level math challenges using artificial intelligence models
test['problem'] returns <class 'pandas.core.series.Series'>
so test['problem'].values[0] returns real answer
also add this as an input to your notebook, so you can load the model offline
https://www.kaggle.com/datasets/olyatsimboy/deepseek-math
or any other model u wanna use
oh great, thanks for the heads up
about this, this is the safe tensor version of deepseek-math, which was released after 23rd feb. is it okay to use? I'm still a little confused on this
Solve national-level math challenges using artificial intelligence models
i believe that is just another format, its the same weights as the original
"So using safetensors weights for DeepSeekMath is definitely in the spirit of the rules and is within the rules as these are not new weights."
yeah I read this thread and have posted a quesiton as well. still confused
Competition host said it's okay
oh yeah about that, the host is okay as long as everything remains opensource (code, dataset, model weights). So if someone is to finetune the deepseek-math model, he may upload the fineutining data to be opensource, but he will still not have the pretraining data of deepseek-math. In this case, will he still be eligible for the prize? (where pre-training data is not opensource, only finetuning data is)
idk about that lol
Yeah i am confused about that. I've posted the question in discussion forum just waitinng for the reply. I'll let you know too about the response
you too bro
Is there any limit on model size in this competition? Or is it just a limit on GPU run time? Are there any obvious differences between the three GPUs available (T4x2, P100, TPU)?
No limit on model size , limit is on the weights published date , can't use TPU for submission , P100 is slightly faster but would need a quantized version due to less vram leading to loss in accuracy.
Thanks a lot!
it might be because of "6227020800", that's a huge number
How are you guys loading models without internet?
by uploading it to kaggle
you got two choices, either as a dataset or as a model
I've uploadeed model from huggingface to kaggle as dataset but I'm not able to use AutoTokenizer or AutoModelForCasualLM when I load them from Input
They work just fine when I load them from huggingface
is this the right way of uploading it?
@broken vale
that looks correct, now click the "Copy file path" icon on the model, and use that as the model_path
that's exactly what I'm doing but I'm getting error
what does the error say?
copy the file path of the model folder, not the files inside the folder
wait... it just worked in a new notebook
but I'm confused
the news notebook imports transformers 4.93.3, whereas the old one imports transformers 4.20.1, by default
why so?
maybe the old one is pinned to the original environment
they should be part of the kaggle environment
bitsandbytes is not, so if you are doing nf4 quant you need to add the wheels for bnb and pip install the whl
I see, thanks for the help
np
@broken vale one more thing. when I run this code, a csv file is generated in /kaggle/working (output directory). is this the output file that I have to select while submitting my notebook version?
no, you gotta use the aimo api, no files should be saved
the env.predict step is the submission of that problem
you are using the api i see, everything after the env.predict is not needed
also, test['problem'] returns a series, you need to access the problem with test['problem'][0] to get the problem string
but when submitting. it asks for output file. what is that?
you mean like this?
predict(test['problem'][0], one_shot_context)
okay and what about the output file?
there should be no output file.
use the submit button on in the right tab modal in the kaggle notebook
okay I see, the submit button on in the notebook shows different sidebar then the one on competition page
got it
👍
also, ive had some trouble with the api lately. see https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/506041 and the answer from Chan
Solve national-level math challenges using artificial intelligence models
I see. I'm uploading now, will see if this error occurs
and also, I was not able to do 5 shot learning in conetxt due to kaggle's limited GPU VRAM. Is this possible in submission too? How do I know that the submission will not run out of CUDA memory?
you dont, if its happening in your notebook it will probably happen in the submission too. it usually does
try using a 4bit quant
or shard the layers on multiple gpus
so submission is using the same kaggle GPU?
if you use 2 T4 gpus with device_map="auto" on 1 7B model that should be enough even in float16
yeah
well that's interesting. I wonder how do I figure out what went wrong there
haha, nice. got that for some of my notebooks too
and I have nothing on top of my head how to debug it 🤷♂️
you just didnt get any answers right
run it on the training set in the notebook
youll probably see that you wont get any right
When we run a notebook, the result only shows the script without the notebook and output. Why?
I guess the error will make the notebook invisible right?
becasue otherwise you could cheat...
Guys, I have written a notebook where I organized the functions into one class from Abdurafae's notebook, here if anyone wants to check
https://www.kaggle.com/code/anrenk/aimo-llm-class
Most forked notebooks had unorganized code structure, and the import libraries were scattered here and there, I cleaned up the code and only left necessary stuff
How to understand the "could cheat". When I run it without error, everything is ok as before.
Thought you meant when submitting your code.
how to upload models from huggingface to kaggle?
The notebook runs successfully but submission failed. What happened and how to fix it?😪
what error did submission give
Is a model that is completely open source (but was released after February 2023) allowed?
Why do some of my notebooks get cancelled before they finish in Olympiad?
as far as I know it's not allowed.
throwing error?
But they seem to have allowed Llama 3 which was released this year
where did you see that? can you provide source/
that's just a tab that shows models that people used in their notebooks, it doesn't necessarily mean it can be used according to the rules...
Oh 💀
Well I have just posted on discussion
lets hope the hosts answer
Solve national-level math challenges using artificial intelligence models
found something similar to your question
aw man 😦
what about things like xLSTMs? which are just modifications of LSTMs. Are they allowed?
xLSTM were released only recently, you already thinking about using them xD?
idk about that tbh
i think they only put limits on LLM models
xLSTM is research based i guess
well I thought if most of the older models have been tried, why not try new ones
yeah, xLSTM is a new type of layer/whatever; I'd have to build the entire LLM/architecture if I plan on using it
fair enough
Solve national-level math challenges using artificial intelligence models
maybe specify that you want to use xLSTM in the post
Good idea
Does the Python modules used for the submission also need to be released before the 23rd of February, 2024?
Is it possible to find somewhere more trainig math challenges that are same type as official?
but this one has answers that are not modulo 1000 (positive integer), it has algebraic expression answers
the above 2 needs a little bit data cleaning before you proceed to use them
submission scoring error.
everybody seems to be using deepseek-math, are there any other models that are okayish?
Nothing as performant
honestly that was the only model before February 23rd that is doing well, I tried others like Mistral7x8b and Gemma, they did terrible in my experience
I saw someone using Mathegenie, but people in comments said that it's a model released after Fefbruary 23rd
the new api of iter_test can only be initialized, iterated only once? And then it died? The only chance we reuse it is to restart the notebook? What...
did you use predict function after iterating on a question?
cause it doesn't allow you to go next question without using that function to evaluate your answer
Thank you, yes, I used predict. And my test result is: in one notebook, you can only initialize the env for one time, and only can iterate it for one time. The next iteration will get a null response.
can you show me the code?
yes, its true, you can only iterate over it once. you can make your own iterator for the train set and use that
Yes, the only way to reuse the iterate environment is to restart the notebook LOL
is it possible to submit notebook using GPU T4 x2 if my weekly quota for T4 x2 is finished?
yeah it's possible, cuz submission doesn't use ur quota
@floral sentinel I used your notebook for submission as well, but the score I got was 18 not 20. What do you think is causing discrepancy in results?
Yeah what's what I noticed as well, I submitted multiple times, and each time it gave different results, you can even the check the versions history of the notebook.
I think it's because of model generation, we use "self consistency" method where the model takes in the problem, and solves it in times of n_repetitions parameter, and takes the most repeated answers.
Example
n_repetitions = 7:
problem = {math problem}
answers = []
for _ in range(n_repetitions):
output_text = model(problem)
model_answer = process_output(output_text)
answers.append(model_answer)
and let's say the answers list is [52, 52, 21, 222, -1, 52, 0 ]
from this answers list we see that 52 is the most consistent one, so we take that and submit it
so for my case when I submitted I got [52, 52, 21, 222, -1, 52, 0 ]
but when you submit for example ...you get [9, 9, 52, 52, 9, 0, 22], in which case 9 would be the most consistent one in your case
another factor would be top_p and temperature parameters
honestly deep-seek model is weird
I see I noticed that too. Maybe it's because of the random selection of next token, that everytime it runs, a different output is produced
also in your notebook the maximum time for notebook is set to 31500, which is equal to 8.75 hours. Is this the maximum time the notebook runs in submission? Have you inferred this, or is this mentioned in the competition or maybe it's something everyone knows? I'm new to competitions so still figuring this out.
Also, how would I now if my submission exhausted all the time allowed?
the reason for 8.75, is because maximum hours of notebook runtime that is allowed is 9 hours... that's why we used 31500 seconds, and in the code itself we put a condition that if it exceeds the limit, then just return 0, which means it will submit 0 as an answer
the notebook runs in both submission and normal running mode
in "normal running" mode people set TIME_LIMIT = 1, basically just run the code quickly as possible so it can move into "submission" mode, where TIME_LIMIT becomes 31500
only "submission" or "private" mode matters for scoring
Does this mean that in submission, the notebook first runs as not PRIVATE and then runs as PRIVATE?
And how do I check that if my notebook ran completely in less than 9 hours and didn't return answers as 0?
yeah
eeeh i guess the time the submission took
idk how else to check
where does it show time?
after you click submit, 2 things run
one your notebook, other your submission
click on the submission one, and it shows for how long it was running
in my submission TIME_LIMIT = 31500 if PRIVATE else 31500. This means that it will first completely run on train data (inside the 9 hour time limmit) and then on test data?
it will run both 31500 for private and normal one
but you don't need to do 31500 for normal one tbh
that one doesn't score
I was just using 31500 for normal to test prediction on training data. Didn't knew it would effect in submission as well
besides the score doesn't depend on normal one
bruuuh
yep i understood that
this is my first time participating in competition so this was meant to be lol
but I still don't understand how to check time for the submission notebook
for example
this is my submission running
ok it's been running for 2h now
wait... why the normal one has been running for 2h?
you used 31500 there?
yes... TIME_LIMIT = 31500 if PRIVATE else 31500
that's why
ok i see
but this is the normal one. what about PRIVATE?
or the entire submission?
it doesn't show time
it shows next to it
on the right
been running 2h as well
both start at the same time
but because you used 31500, it will run aprox for 8.80 hours
for both
gotta wait 6.8 more hours
the normal one will end before time reaches 31500
nope. it will take same time as the private one
but it only has 10 problems to solve
shouldn't it take 5 times less time?
but okay. when normal one ends, a new version is saved, for which I can see the time in version history
yeah but that time only for normal notebook
sadly the private one doesn't show the run time
yup
yes so 7353 was normal submission time last time i submitted
yep exactly what I was asking
yeah
or maybe there is, but Idk how to check it
exactly what's been bothering me lol. I don't even know if I've got 16 answers correct and 34 wrong, ot if I got 16 correct,, a few wrong, and a few just ran out of time so returned 0
yeah that's problem with new api
someone probed the LB to get to 20 when they announced early prize
that's why they created this new submission system
I see
maybe i'll just have to check manually then
by the way, while one of my submission is stills scoring, can I submit another one or do I have to wait for it to finish?
normally you could
but you have normal one running which uses GPU
so i guess u gotta wait for normal one to finish
but idk, you can try
try submitting another one
I'll do that after normal one has finished running. I'll already low on quota so I'll run accordingly
but this time do TIME_LIMIT = 31500 if PRIVATE else 1
it will take 370 seconds for normal one
welcome bro
Do we know where the models get the solution wrong at the moment? Is it 1) at the beginning where it tries to understand the problem and lets say come up with necessary equations; 2) Follow on algebra or equation manipulation; 3) Calculating values; 4) or something else that I am not thinking of here?
OK,thanks👍
re: screenshot - do we still have submission.csv?
no
new api evaluates answer directly as I know
no need to create submission.csv
that's weird
can you check whether it used your remaining submissions?
maybe due to some connection error it might've not run
otherwise I don't know the reason
import os
if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
PRIVATE = True
else:
PRIVATE = False
if not PRIVATE:
import pandas as pd
class train_env():
def __init__(self, randomize=False):
self.randomlize = randomize
self.df = pd.read_csv('/kaggle/input/ai-mathematical-olympiad-prize/train.csv')
self.df['ground_truth'] = self.df['answer']
self.df['answer'] = -1
if self.randomlize:
self.df = self.df.reset_index().sample(frac=1).reset_index(drop=True)
self.predict_called = True
self.counter = 0
self.len = len(self.df)
def iter_test(self):
while self.counter<self.len:
if self.predict_called:
self.predict_called = False
yield (self.df.loc[[self.counter]][['id','problem']]),(self.df.loc[[self.counter]][['id','answer']])
else:
print("You must call `predict()` successfully before you can continue with `iter_test()`")
yield None
def predict(self, answer):
self.df.loc[self.counter, ('answer')] = answer['answer'].values[0]
self.predict_called = True
self.counter+=1
env = train_env(randomize=True)
iter_test = env.iter_test()
else:
# Set up the evaluation API
import aimo
env = aimo.make_env()
iter_test = env.iter_test()
for test, sample_submission in iter_test:
problem: str = test['problem'].to_string(index=False)
sample_submission['answer'] = get_answer(problem)
env.predict(sample_submission)
print(test)
print(sample_submission, '\n')
About submission part of my code, get_answer() return a number, there is nothing wrong. right?
problem: str = test['problem'].to_string(index=False)
sample_submission['answer'] = get_answer(problem)
env.predict(sample_submission)
can you tell me why you converting test["problem"] to string?
also it should be test['problem'].values[0] to return a number
I think these two ways make same function?
Is that right? I'll try it
I did test['problem'].values[0] and it works just fine
this is the first time i see someone do str = test['problem].to_string(index=False)
I tried test['problem'].values[0], nothing change
thanks for your time, maybe I should copy someone's notebook which submitted
wait hold on
get_answer() are you sure this function returns a number?
yeah, I print it, get_answer() return a number.
thers are something I didn't noticed
that's just printing number
can you check the type of the output
type(get_answer(problem))
it might be returning string value instead of int
When I use test['problem'].values[0], the version run failed
ok do this
check the return type
I wanna make sure whether it returns string or no
without submitting
on your notebook, run the cell
The CV/LB correlation is totally not existent in this comp 😐
I get a question correct on the Colab L4 (answered correctly over multiple runs too)...then I try testing on Kaggle T4, it becomes a wrong answer. I set a seed too 🤣
don't know what to trust at this point...made a change that is supposed to be a clear improvement
Gives 2 points boost on TWO different AIME datasets (both of them) - over multiple runs with random ordering, no difference. But the change drops the LB from 22 to 13 
you using deepseek?
cause people in discussion mentioned that deepseek is unstable
some tried to fine tune it, but it made it worse
Olga tsymboi decided to use deepseek
then the whole competition members followed her using the same model
I am really confused on why nobody tried a new method or a model for a past month and a half
maximum score was 27 for a whole freaking month
a whole month
0 change
I did try other models released before Feb 23, but I couldn’t score more than 16 points
everybody was brute forcing prompting the deepseek
what was the method?
or model
If my team gets a good result in the private LB I will share it after the competition ends
fair enough
what's your team?
the name
Rank 70 right now on public
Private LB will be a wild shakeup no doubt
I like chaos
what is it about ?
ai math olympiad
what's it about
using LLMs to solve math questions
ohhk
27/50 correct ?
or to be more precise a team got 27 score
yeah
for a whole month
that was the maximum score
no change at all
or more like 1 month and a half
LLM and math is something which doesn't get along well
so what are the rules ?
can use any LLM ,
yeah it's kinda sad...
we are allowed to use models released before February 23rd
only
the thing is, they made this competition so they can motivate the research on improving the reasoning of LLMs
but I've barely seen any new research
i read Llemma paper where they claimed to have a model dedicated to math
people just been using self consistency prompting on deep seek, and that's it
is it released before february 23rd?
it was released in 2023
this is something hard to individuals or beginners ,
the whole reason behind LLMs being bad in math is tokenization strategy
eehhh I doubt that's the case
I mean if we think about what LLMs are, they are just predicting what's the next word based on the current input sentence
they are good at generating text
it's missing the "reasoning" part
current LLMs can't do math because they can't see the number as whole
which is short coming of tokenization
oh yeah that one too
you can try tokenizing a random number
its the main reason
nah I don't think
you can't add 12008 if you see it as 12 00 8
even if they saw numbers
there is still the "logical" or "reasoning" part to perform action based on the input
which doesn't exist
yet
yeah I get what you are saying
you can do all the reasoning but when it comes to do add / sub / mul / div it wasted
but tokenization isn't the only thing holding back the LLMs when it comes to logical problem solving
yes
i'm talking about math part
it can reason a question very well
but can't do it by itself
self.high = 0 + 5 * 16 // 7 - 1 = 11
construct a logic for it
PEMDAS rule
explain
and i only know bodmas
its i believe standard
but here ans should be 10
0 + 5 * 16 / 7 -1
according to PEMDAS rule:
step one we look if there are any parentheses:
Checking....
No
step two
Look if there are any exponents:
Checking...
No
Step three
Look if there are any Multiplication:
Checkking....
Found "5 * 16":
Solving:
ans = 80 -> save answer in memory;
current equation = "0 + 80/7 - 1"
Equation changed... go back to step one...
This thing will go on loops until equation is reduced
and the final answer is found
this is the logic behind solving a simple equation
Human mind understands rules, and apply them step by step. We don't think "what's the next token comes " after seeing the equation
@teal lake
LLM would see that equation and output the token that comes after with the highest probability. That's why I said there is no reasoning
it doesn't think
i will have hard time but let me find the paper
alright
but you know, thinking logically and step by step about the equation you gave...
I think human brain works like a tree graph
somehow I understood that the equation was math, so my brain went to a "math" node, and there are branches of mathematics that I know based on my experience, for example:
"Calculus, Algebra, Arithmatic...etc", and I picked the topic related to the equation, and went to that node, and from that node more branches came...
and picked one that is closest to the problem aka "high probability", and extracted the steps from there and applied it step by step until the problem is solved.
if you don't have time , just jump to page 36
https://arxiv.org/pdf/2406.02061
so I think instead of training LLMs to solve math, I think we should focus on a neural Graph tree model something like that, that actually maps the experience as a tree and branches
i believe people have already tried , but you can try if you think it can make it good in math
i don't have much idea about graphs
can you show me if there are already anytihng done?
that doesn't necessarily mean they are able to think
llama also did so did mistral
they just brute forced the knowledge into LLMs
but it says here LLama and claude gave wrong answers...
check below as well
then how do you think of it
yeah it still doesn't prove anything
i don't think people will show failure
as I said, they are just next token generators
yes
fair enough, but i thought maybe there were some works done and published
if you ask me
the closest thing to reasoning that was done was Deepmind's alpha geometry
they used neurosymbolic approach
I would still believe is reasonning was used in here, unlike using Pure LLMs
chatgpt solved this question when prompted correctly
think step by step and reason before giving answer and consider all the information we gathered
can't prove more than this now
dude
for how long chatgpt was trained?
besides how many datasets it's seen?
how do I explain this
is that the question here
what I'm trynna say is
you mean chatGPT might have encountered this passage before
chatgpt is overfit
yeah
and trained on it multiple times
it's memorized the problems
doesn't necessarily mean it understands them
why didn't it answer in first few prompts then ?
so it's no surprise that they are able to solve typical stuff
which prompts?
i only wanted to show that they can do reasoning ,
its not typical generation of text
they are faking the "reasoning"
there is no reasoning in LLMs
LLMs are not brain to begin with, just some random numbers and calculations
if you think that way then no LLM can do reasoning
wait I'm cooking something
lemme show u
here are the questions
I copied first 3 questions to Chatgpt to solve
it got all 3 wrong
The answers should be as seen in the table:
52, 250, 702
just like any ANN models out there
bruh.. they are literally decoders whose only purpose is to generate token correctly
i'm not good in math , does the reasoning sounds good to you ?
you can fake the steps
or what do you call it
"hallucinate"
then why are you hoping for reasoning
but it really is uselss when steps mean nothing
yeah....
but how are the steps
idk lol, i didn't read them
this not how you should argue 🙃
bruh
i am not arguing, I'm just trynna prove my point
sure,, with little prompting I can make it get better answers
its the same
then isn't it like saying earth is flat
if it can answer then its fine
bruh
dude
it's not about giving answer
how do you define reasoning
really ?
what's the point of giving a wrong answer
i don't need LLM to give me answer
I can literally just code a function to output random number
there it gives answer
does it mean it's able to reason now?
first of all, answer doesn't mean anything
using logic to solve problem
reasoning is steps to reach there
well honestly in solving math competition it does matter
cause we need right answers
thats what i'm saying from the beginnning
but there is difference between correct reasoning and wrong reasoning
or rather "random reasoning"
LLMs do random reasoning
i'm asking for reasoning only , is it correct or not
they don't even do reaosning at all honestly
don't see answer
no it doesn't do reason
it just generates text
based on what it believes is the correct token
bruuh
it's not about me winning or losing
we got a problem now
there are 50 math questions
people were only able to solve 27 of them
for a month now, nothing changed
how the hell we gonna increase the score
15 days left for competition...
- prompting is all you can do now
- times up for experimenting things with architecture
27/50 is good enough for a 7B model
i want 50/50
model=Anren()
@floral sentinel I had setted TIME_LIMIT in the loop why the submit still timeout? so confused
one moment
how long did your submission take?
10h
no
How can I check the time , the time limit is 9, and start 10h ago. over 9
is this based off Lewis' notebook?
yeah it's same
idk
couldn't figure out the problem
you writing code from scratch?
But the code is quilt easy
honestly I don't like this method of solving math problem
the api and predictions are mixed with the LLM solving
hard to debug
I'd rather create a function that takes problem as input and outputs answer.
way better than doing this
yeah I created an organized notebook
wait lemme send it
it's buried down in the code section
Elegant! so you mean I should put these code in the predict().
I do make a function. the problem is input and the answer as output
yeah kinda
i mean it's already in predict
I changed the seed and my 22 points submission dropped to 16 points 🤣
hi folks, in submission tab it showed throwing exception, however clicking into the notebook it said successfully ran in 132s. Is the submission successful?
If your temperature was high then it's expected
I haven't tried to modify this temperature, top_p setting, all my tests are with 0.9 and 1.0
what is the temperature parameter and the top k means?
temp = 0.9 will give highly stochastic responses, no wonder with the change in results
Maybe because it was even pinned on the discussion forum 
Nah, people always choose the easy path
@floral sentinel have you tested that the time that your code spend?
I mean your repetitions for a question is 19, and the num of questions is 50. is enough to generate them in 31500s?
People reported that changing the seed leads to a different score. But I have reasons to believe that even with the same seed (submitting the same notebook twice) one can get a different score. It happened to me. Did this happen to you?
+- 1 for me
but I would be scared also...my best seed for validation is not the same as my best seed for public LB
also when comparing different AIME sets...all different optimal seed 🥴
In other words, I tested 3 seeds on public LB, AIME old, AIME new datasets
The first seed is best for public LB
The second seed is best for AIME old
The third seed is best for AIME new
and the gap is not small
won't give more details until the comp finishes 😅
no, i didn't test
I think there are some issue in early stop strategy and decode strategy in your notebook
explain more in detail
wdym by decode strategy?
these part will skip top several loop except first
and about decode the right answer from text or code_output . I think the code output is more convincing, answer from text should be dropped . But you take them both
Honestly I don't understand 50% of code myself, as I mentioned, I just organized people's forked notebooks and added them together in a nice way
about code output and text output
sometimes text output gives right answer
that's why people consider taking both answers
the reason why abdulrafea did that "skipping" is because notebooks threw exception errors when submitting
to prevent that, he added skipping thingy that u see
for this function , if the answer was covered by \boxed{}, the answer will be counted twice. It make no sense
alright then test out your ideas and see if they work
answer from text is unreliable
bro, the model itself is not reliable, let alone the answer xD
there is another early stop in the code
I mean is not reliable to count for llm
to calculate
Happens, had such experience with gpt and Gemini as well
what the hell...
The problem happens with basic algebraic calculations, and it's universal across multiple LLMs
we gonna fix that one day
Obviously, a lot of ways to fix that better not to discuss it right now 😅😅
why
i think it's perfect time to discuss that
someone got 28 score
wow it took god damn 1 month and a half to get +1
bruuhu
Lol if I do that it's like talking about my strategy, would not like to drop my lb rankings
oh
what's your score
24th on LB right now
cool
It's not just math, it happens in text generation with lots of small LLM. I have seen Llama 2 7b do that at work, it's frustrating I know 
I haven't seen GPT do anything similar
discuss in 9 days after competition close 😅
Then what is a good value for temperature? I experimented even with 0.1 but the score got worse:)
By the way, did anyone try using two llms, like deepseek-math 7b-rl+deepseek-coder2b?
I could not get it running
The more the temp, higher will be the stochasticity and the maybe even the quality for some tasks.
Again as such commenting on exact things without knowing what you exactly did would be hard, and also it is not the correct time to do so
Hi, thanks for your reply. If I get it right, no more idea sharing here until the competition ends?
I am not sure about others, but yea atleast from my end.
Certainly would be up for open discussions, but stuffs directly related to competition would be hard
the more samples you take from the model, the higher the temperature should be, as otherwise all the samples will be too similar
e.g. when models are evaluated on a single sample (maj@1 aka pass@1) usually greedy decoding is used (temperature 0)
unless you're actually in the running for a prize (or gold medal, I suppose), I don't see why you wouldn't share anything
Obv, I am into Kaggle with some aim (ranking matters for me), for learning purposes I am already pursuing a PhD in the same field
Because I believe there will be a huge shakeup and can see stable 21-22 scores move all the way up to gold
Maybe even 20
with some luck obviously
For some lucky submissions yea eg: In one case we were lucky to get 22, after that for that submission it dropped to 17-18 and constantly remained around that. But for those who are not just dependent on the prompt, temp or any hyperparams things should be fine I guess
Mine is very dependent on the seed, sadly
Stable across same seed, but if I change the seed it drops a lot
Kinda hard to comment, could get lucky as well.
Btw one of our submissions is with that hope of luck.
My intuition says it should be fine as the code is already executed for all 100 questions, and as you got lucky in those 50, incase of same distribution the remaining 50 should also give similar results.
If it was Kaggle running the code again, then it would have been a big red flag
The final subs you select will run over the new 50 only, so two of your submissions will only re-run once.
Shake is coming. I hope I can keep a good rank in the end 😅
I am new to Kaggle but afaik they won't rerun your code, atleast they didn't in the competition that I had earlier participated in, and same was implied from one of the discussions as well.
But yea if they do, shake is definitely coming, particularly from the score of 25
"Because of the limited number of problems available, we are taking special precautions to secure the test set against probing. Among other things, during the submission period the test set will comprise only the 50 public set problems. Once the competition ends, when we rerun submissions, the test set will comprise only the 50 private set problems. You should attempt to make sure your submission will complete successfully on the 50 new private set problems. This may mean ensuring your submission is robust to unexpected inputs, or managing runtime and memory usage."
This from data section
Oh ok posted from the competition's page?
My biggest biggest fear is that for some reason my subs fail (hit errors that I didn't catch during validation or public subs)
from data section
Oh ok my bad, I didn't read that, then yea seems like private lb will be fun
yes a scary movie is coming within 7 days 😄 , I can handle any score drop - I am preparing myself for this , but I can't handle failed subs ! it will be a total waste!
It won't drop much, maybe ±3-4 questions, but yea the lb can completely get flipped
doesn't matter , my fear comes from this line in the section I shared : "This may mean ensuring your submission is robust to unexpected inputs" what unexpected inputs 😄
Shouldn't be an issue, but timeout can be, it completely depends on what sort of code you wrote
I did face rare timeouts that i can't yet explain.
My advice (I hope I stick with it - and stop trying new ideas) is that I should focus on the subs I want to select and run them again and again to make sure they are stable
even you suffered timeouts? And didn't later find any bugs in the code to explain it? I'm glad I solved my freezes and timeout (as posted on the forum), even if it did cost me days; that could have been a disaster during the private LB rerun. Instead I have the extra stress due to less time!
BTW I just noticed you're 2nd in the ARC contest! Wow
that you even have time to work on both!
how much work was it to get to that score?
and yeah, I've very paranoid about errors on the private LB; I'm going to try to make my 2nd submission quite different and rather conservative
Ya I suffered timeouts 😅 My current (in mind) 2 subs work fine with different validation questions but I am still afraid of unexpected inputs or errors that might kill the code in private, If the private was calculated in parallel with public and we see 50% of the score then it would have been much better (no sudden errors !)
I have no expectations currently regarding by final rank ! I just hope to stay in gold ! I need the solo gold 🙂
Regarding ARC I like such competitions, the math one, ARC, CTF, and Santa problems are my favorites.
Regarding my current score ARC 26.5 it took me about a week. I have some ideas to try. BUT no way to reach the guys at 38 ! They are professional ARC developers (if that can be a job title!)
well, you have half a year to come up with and implement ideas! Good ideas to tackle such hard problems are very rare
I've heard of ARC before but I'm not familiar with it, and I only just saw ARCathon mentioned. I suppose that's what you're referring to
I haven't looked at it much because I'm too busy with AIMO! I'm attempting to write an entirely new solution to AIMO in the next week! Yesterday I put together the first pieces of code and managed to solve the first problems. Haven't submitted yet, but I think it'd score about 4/50 :P
also yesterday I finally got a score of 23; it only took me a month to match the score of the top public notebook, arghh! What a massive waste of time
Try to test the stability of the 23 ! if you ran the code again will you get 23 ? I think stability is very important currently.
I have 2 subs with 26 (from diffrent "somehow" approaces) and 4 subs with 25 and a good 24s and 23s .... I can't say my 26 and 25 are stable , but avg score for those notebooks is 24 , but I have some codes that when re run makes a roller coaster ! from 19 to 24 to 25 even ... I somehow know why but I can't say much now
I haven't yet but will. However I don't have many submissions left and will need to use most for my new solution
Good luck! never give up! ... who knows ... you might see higher scores in private
My stable 24 might be a stabe 4 in private or an error 😄 ....
in theory, the LLM could write a script that runs killall python
more realistically, it could fork bomb; I did see it try to use multiprocessing after I informed it its solution was too slow
hhhh well the private will be fun to see ... I just hope hard work pay off and not a lucky hit (which I am afraid will put some people in good ranks with maybe public notebooks)
Imagine I enter the competition , sorted notebooks by score, ran the notebook got luck with 23 or 24 ... then same luck happens in private and with more luck to put me on top ... wow
well, the score gap between the top team vs what you could get by submitting the public notebook 50 times is just 5 points
I am pretty sure that you may even get 24 with the public notebook
yes, if it can get 23 probably it could get 24 very rarely
I got 22 with it the very first time I set the timelimit to 8.5 hours, with minimal changes
and then took weeks to get another 22
it turns out that almost all my submissions during the contest have had some major bug or other
My 4th rerun (just finished) of my first 25 got me 23 , so its (25, 24, 24, 23) , I have -+2 variance I guess , I hope to keep it this way
Lmao I have been stuck at 22 since the first week, but that was a lucky 22. Cleaning up the code dropped the score by so much and I had to slowly climb back up to a stable 22
only to realise it once again drops to 18 with a change in the seed 🤣
I’m not sure if the old AIME cv is correlating because my latest stable 21/22 public scores 27/50 there as well 😐 [with a DIFFERENT seed], but as usual the AIME new questions it just sucks, out of 61 I got just 9 points 😓
ooh Ali hii
For GM?
I need gold in general for competition Master level, but a solo will help for later of course (GM) so I am trying to hit two goals at once. I was so close to get it in "LLM - Detect AI Generated Text" competition, if I only selected the correct sub 😦 (I finished 2nd public board, 21 "silver" private board , where I was able to secure 6th place 😅 if I was wiser)
This honestly sucks.... Even I wanted to try for a gold, could have been fastest Master on the platform 😅😅
But it seems unlikely considering my score right now, and unstable performances
Anyways check your DM please
which is the top piblic notebook??
you can sort by public score https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/code?competitionId=73231&sortBy=scoreDescending&excludeNonAccessedDatasources=true
Solve national-level math challenges using artificial intelligence models
I didn't realise there are actually two notebooks with a score of 23
I believe they're virtually identical though
I notice a very amusing pattern on the scoreboard (when I looked a few hours ago), the number of teams with a score of at least N goes:
28: 1
27: 4
26: 8
25: 16
24: 31
23: 76
22: 166
join the club. There must be SO many people in the same boat, because that code was awful. It took me weeks to get back up there, because I kept breaking new things, or I had bugs that would destroy the performance half the time making it even more unstable
oh BTW, I see everyone using AIME for validation but it doesn't match the composition of the LB at all, which has a much wider range of difficulties. I meant to mention this on the forums
so if you think your submission score is stable, to a large degree it's actually because there aren't that many problems of the edge of what it can solve
can't join anymore 
I tried to fix the dumb behavior…but most of the time it ends up fixing X problems at the expense of making Y problems wrong
where X < Y
AIME 2022-24 (new ones) gets 10 or 11/61 on CV for me regardless…it gets the same for every experiment I have tried over the past 2 weeks, regardless how the LB score is
, my LB can be between 13 and 22 and the new AIME val score cannot differentiate it
I don’t even know what I can trust now
Haven’t seen anything like this in previous competitions
no seriously wtf is this
My predictions for the private LB gold zone:
27 27 26 24 22 21 20 19 19 18 17 17
Cutoff for gold: 17
Cutoff for silver: 13
Cutoff for bronze: 12
I may be wrong, this is my gut feeling based on my validation tests and the data description
Why can't we join this comp anymore
Entry deadline is June 20...Read the competition description under Timeline
Correct, the entry deadline is 20 June 2024. I don't know what this contributes to my question?
bruh, there are 4 days left for competition, and u coming now to join?
huh?
Yesterday there were 5 days left
wooow big brain
Still only troll answer?
and who is the one trolling here with silly questions?
Someone wants to participate but is kept from participating due to unexplained rules. Question is good. Answer is not
wdym by "unexplained rules", the guy told you the reason
there is a deadline for joining competitions
what part of it you didn't understand?
the competition started around 3 months ago
So why is there a deadline when the submissions are open?
What part of my question are you struggling with?
There is two deadlines , a deadline to enter the competition which is over now, and a deadline to submit which (4 days left)
correct. this information can be found on the kaggle page of this competition.
Ok, does this explains why you can’t join ?
Wondering why don't folks just ignore it, the discussion is already self explanatory
He is trolling, ignore him
I doubt on it, while our best submission has some stochasticity, our third best which is around 21 does have some sort of stability coz of lower temperature, and yea I am not expecting a gold from it
Again, not giving answer, only trying to be toxic for no reason.
I don't get why you don't try and be a good person instead of a troll
Exactly when is the last submission time?
That's a good sign...we have a 22 but it has temperature 0.9, I changed the seed it drops to 16-18, I lowered the temperature, same thing. I don't even know if it's good 🤣
I'm getting only 24 CV with the setting that gave 22 on LB, but if I change the seed I get 27 CV and 18 LB. also lowering temperature doesn't worsen my CV, only +- 1
I guess we will know in a week...but terrible CV/LB correlation I'm not confident in our submissions 😐
In all the comps I have participated this is the one I'm least confident of even getting a medal out of it
Yea same here, our 24 and 22 one has a seed of 0.9, but we are unable to reproduce the results with lower seeds
no one mentioning rag in discussion, does that mean it didn't work or worked too well?
I do not see how RAG would work here. From what can we retrieve the answer?
Is it possible that the test set is already online somewhere?
Frustrating right😁. I got similar nonsense
By the way, do you think it is possible some submissions throw an exception on the private test set? I thought a submission would be marked successful if it was able to run on both public and private test sets
Yeah... 3 days left and the model giving me nonsense answers
I got 10 last submission about 12 hours ago...
answer of similar problem from aops, same contest, similar contests, apparently not exact answer of the exact problem, but the reasoning steps or key ideas can be re-used for the main problem
deepseek has an input limit of 4096 tokens
so stacking questions and answers will lead to high token count
yea so few things can be tried: summarize/shorten the solution, or use longer context model
3 shorten examples + prompts might take around 900-1.5k tokens, can be shorter depends on how compressed the summary is, anyway deepseek doesn't utilize those very well (or my implementation was bad)
Nthn is stable 🙂 Remmber this ...well its 25, 24, 24, 23, 24 .... 20 .. ❌
no you are right, deepseek doesn't utilize context well... it's better off giving it a short request to just solve the problem step by step
bruh...
it threw error exception?
but the drop from 25 to 20 tho...
it took it 5 runs to go wild ! No ... the notebook finish usually around 7 hrs ... say its now around 8hrs due to latency in showing the score !
Its punishing me for repeating it 😄
the private leaderboard will be lit 🔥 🔥
imagine it throws error exception on all the submissions 😄
the private leaderborad
no winners lmfao
3 months will go to waste
hhhhhh oh god ... the selection of the final 2 subs is the hardest thing to do now
why, isn't it selected automatically?
the highest scores
No you can select which two you want (before deadline) if you don't then Kaggle will select (usually highest two)
yeah i know we can select it manually, why not just let it select automatically?
becuase I am looking for the stable ones ... which I don't know now
ooh...
I have three (somehow) approaches and all score up to 26 ... and another apporoach that is stable around 24 (so far)
6 subs left ... one should use them wisely !
But in all cases the private leaderboard is going to be crazy .... I dropped my pridection about winning a gold medal now
By the way results on colab (regardless the questions) are perfectly stable ! (I mean the repetition of any exp gives same score ) rare times I faced +-1 variance
I use L4 on colab
my 22 notebook turns into 14 with only slight change…😱 I was so surprised by the score. Usually these type of small change still can get 20-21
Same for me 😄
It’s always 24-25 points on old AIME,and 10-11 points on new AIME
Doing things that make sense improve validation and reduce LB...
Haven't seen a competition like this 🤣
#learning-agency-lab-automated-essay-scoring-2 is way worse than this
wtf my code tried to execute this
Great observation. By repetition do you mean restarting the notebook and running or just solving the same problem multiple times within the same session?
For me, I randomized the order and restarted the notebook
Restarting the notebook
curious what people are using for their experiment, this is my first kaggle competition so I started out maximizing kaggle gpu hours, then moved to runpod because it looks cheaper than colab
you can also try vast.ai. It's like runpod, but often has good prices
config = transformers.AutoConfig.from_pretrained(model_path)
config.gradient_checkpointing = True
What does gradient_checkpointing really do?
helpful memory racks when only training.
thank you
The ddl is 28 0:00 or 29 0:00?
convert the 11:59 PM UTC to the timeline of your area
Me too, I will finalize my selections once my last sub finishes, Good luck everyone.
bruh the second phase will start in 2 months...
so soon
Solve national-level math challenges using artificial intelligence models
Yeah I probably need to build my own solution next time rather then using open source models
I didn't have time this month to explore on how to do that..
Bet some top teams did try research paper worthy things...let's see tomorrow
i don't think anybody had time to do new stuff
3 months ain't enough for researching and discovering new things
I doubt one can reach 29 without fine tuning, I may be wrong
I guess you’re underestimating the ability of the top-notch in this field. It is definitely possible if your team is really good
I'm not underestimating anything, the facts are out there
for 1 month and a half we've been stuck in 27 score
which really shows that nobody created anything new
everyone was using deepseek and tryna prompt it correctly to get it to work
I don’t think the 28 and 29 are achieved without fine tuning, and my guess is they will be stable on private LB too
well fine tuning isn't something new?
well let’s see tomorrow, I don’t want to jump into conclusions now
yeah that's the best thing to do
If you had followed the public discussions, many people have said they tried and failed. So doing it right to improve the score is an achievement itself
yeah i followed that
some said it's in a deep local minima
which made the results worse by fine tuning
hmm, yeah in that terms you are right
I think it may be possible.
Earlier I decided to test the ideal solution for my main strategy, by ideal I mean that the model at least generated a solution to a problem, I was on average counting around 35/50 (validation of course), however, my best was up to 28/50 ... so if you were able to get better judging/scoring "name it as you like" approach then yes you might reach more ...
Also this means that you might overfit more !
haha when I fix 1 problem I break 2 problems
How many outputs you generate for each problem ?
I set time limit = (remaining time) / (number of remaining problems)
I did not have any logic that breaks out of the for loop if the time exceeds 31500 (in my runs it has never timed out)
seed set at 42
I generate between 100 and 120 based on time
I selected a notebook that generate max around 100 since its was the first 25 on LB in general and the add value of earlier sub pushed me to select it
I am looking for a team for season 2 of this competition. Given that it is officially ended and we can do 100 submissions per day, I believe this is the perfect time for exploration. Kindly let me know if anyone's interested.
PS: I would prefer someone who has actively participated in this season
vLLM and 2 T4
Even with vLLM, my speed isn’t so fast to run 100 tries on a problem
You need AsyncEngine I think
Transformers = 13 token/s, vLLM: 28 token/s
If I reached money prizes I will share more regarding this.
But within 9hrs I think you can reach more maybe up to 140
But the risk of timeout is always there
Early discussions and notebooks on this helped me a lot
My submission definitely has a risk of timeout also
But my score isn’t good so I risked it
I didn’t end up selecting the 22 that drops with every small change that I made. My opinion is it overfitted public LB from past experience
My sec one do, I kinda gambled with the sec one, based on my analysis it should be the most stable one but with a risk
dont you allocate specificic time for every question?
Ya in one of the approaches I set around 450 sec per question
Also vLLM didn’t improve my score and just lowered the floor that I could get, ended up not selecting it
such solutions are kind of stable with low temp
Low temps stablize around 20 to 22 …
Oh thats good my stable was between 18-21
Same for me
btw till when will final lb get declared?
whats your top score?
I have hopes for my second sub which I scored yesterday ( 26) but Because of the first comes first ranked I had to select the first 25 ( with variance of 2 most of the time) also
I selected a 20 and a 21
Both have been resubmitted twice with the same score all 3 times...stable ones
Submitted on May 25 and June 1
I guess it will take a week or more
Oh ok ok
But I have no confidence of the private result 😓
My selected subs are one from 2 month ago and yestreday
Anyone of you in for season 2 of this competition?
It depends on this phase final output
Would definitely like to know more about 26 one
In the last week of the comp I tried some “out of the box” ideas, but they didn’t have a significantly better CV, I ended up not choosing them. Even though public 18-21
In any other comp 99% of the time I will choose this as 1 sub
Alos a lot and I mean a lot of experiencrd grand masters avoided this phase ( mayve due the 23-limit)
23?
you mean public lb score?
23-feb
23 Feb didn't get you
Yeah that one 
The rule that forbid you of using anything released after 23-feb
Oh ok
Second phase will have mostly a rule for 1-sep 🙂
I guess its maybe going to be end of year 2024 rule to give everyone a chance to test and finetune
You only need 1 day to fork an open source model, try it on the comp dataset and submit
I learnt a lot on the behaviour of LLMs on math problems from this comp though 😄
btw anyone of you tried finetuning?
nope
I did finetune reached up to 15 I guess but dropped it
Me and my teammate tried 6 different RAG strategies
data for RAG?