#ai-mathematical-olympiad-prize
1 messages · Page 2 of 1
I didn’t play much with prompt and ended with the one shared in the 20 points notebook before sub api
lol for us prompt helped reach 24
I didn’t try to fine tune prompt LOL
Actually I did, but it solves 1 problem and breaks 2 problems haha in validation
So I didn’t end up choosing
The problem is its so slow to try finetune prompt lol
I had local GPU so was not a big deal
In my validation set for new AIME I added like 10 problems which I created by myself 
I would say none of these are more than 3/10 difficulty for the human
I resubmitted my notebook from yesterday if my calc is correct it should rescore 26 again
LOl, do let us know the score 🙂
Will take 8 hrs but sure I will
My prediction for my private score is 13 points, it depends a lot on how much the public notebook will score…from my validation, I don’t see that it is significantly better
(the 21 that is trending in notebooks is the public NB I’m referring to)
🤣
I have no predictions … I want a solo gold and a money prize will be nice hhhh but I expect 2 timeouts just to not be in shock if this happens
Oh aiming for GM
Ya one day 🙂
Great
? What lol
I assume it's for time being it will change.
Lol if this happens most of us will get a free medal
Not really, just identified some keywords for difficulty and gave those questions some extra time.
Though I thought of using Gemma to explain DeepSeek the question but it didn't work (VRAM exceeded)
lol did we all get trolled?
i.e. we should have overfitted on public lb score?
I wish that happens, would get a free silver and can be the fastest expert 😅😅
Its just kaggle system
I guess it will take a week to have the final update
Lol my bad it's on the comment
Hhhh unfortunately didn’t 🙂
Deleting
No worries
It's silver for the competition discussion and not the lb 🤣🤣🤣
Hhhhh on other issue I don’t think the 100 sub will stay much after competition
As far as I understand the test data will be the same in phase2
I just reran my final subs again
Yea that's the reason I plan to start it now irrespective of the results, just exploiting it as much as possible.
Ya because I dont think this window of 100 subs will stay much
Let's see, ideally it shouldn't but hoping too
Solve national-level math challenges using artificial intelligence models
lol this is insane
shows how volatile the model is
This though…I’m not sure how it would hold on private, there is a chance it shakes down if the prompt is fitted to public…
Fitting prompt for public lb is kind of impossible. The prompt would give similar performance across problems given that you use the same model
Is your prompt that gave 24 also the one that gave the best local validation score?
For our team, it's not correlated
I have a prompt which scores 26 on CV but only 9 on LB. But the baseline scores 22 on CV and 21 on LB, for example
If it does you're probably safe
I tried a few times to tune the prompt to fit 2 or 3 validation problems where the model is nearly correct, it always performs worse than the public prompt
You turn 2 or 3 from wrong to correct, but instead turn 5 from correct to wrong
Can you try resubmitting those solutions, I would expect it to be unstable
Hmm I set a seed 🤔
You can just remove that and run a few number of times, coz if this is true than a lot of research is just waste of time 🤣🤣🤣
No seed I tried before it's not stable and max - min can be up to 9 points
By research I mean actual research, the one which proper academia and industry folks do
I didn't submit too many times for the same code though but 24-25 is definitely possible with a lucky run
That's what I would expect, what was your temp for those solutions and did you effectively manage the time?
temp is 0.9 and time limit is 630 per problem
This is true, but the interpretation is completely different i.e prompt overfitting on lb
One of the reasons for unstable and again why it can be lucky to get 24-25
assumed the seed was set
my bad, I saw the discussion post and assumed it was a stable 25
Yea still I mean the underlying stochasticity with higher temperatures is high. Also I assume the top p would be something around 1 for the submission
If any of the higher temp solution gets a good score, I would just assume it's got to be lucky irrespective of the prompt or any other modifications
I think scores may be released tomorrow
Last time Optiver comp, it took 2-3 days to run 2 submissions each for 3.4K teams
That one was compute intensive and submissions take very long to score as well
Hope so
Makes sense, btw if you don't mind can you make your finetune code public? Would like to experiment over that
?? I didn't finetune any model
as for our submission code, I will make public if our team makes the gold zone in private
I mean the RAG solutions
" The private leaderboard is calculated over the same rows as the public leaderboard in this competition. " Does this mean that the validation set is the same as public leaderboard?
Also I don't know the ceiling of my solutions
Those 2 our team selected, were all submitted thrice (only minor changes between them), and they got the same score all 3 times
For high temp, I won't expect it to give same score unless the seed is in fixed in such a way that it always outputs the same token
But how can the seed be set to achieve this?
The following does not seem to work:
def seed_everything():
seed = 200
os.environ['PYTHONHASHSEED'] = str(seed)
os.environ['TOKENIZERS_PARALLELISM'] = 'true'
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
print('-----Seed Set!-----')
Scored again 26 , will run it again for thrid time , I think it will score again 26 ...
The idea I implemented two days ago is proving stability so far but there is a catch ( I will discuss it more when I know what score it gives on PB)
Also I am a bit afraid of timeout since it finishes around 8 hours (I have safety time monitor code - but I am not sure if its bug free)
Is that 26 on the private score ? I think it was said that after the competition, the hidden test set will be the 50 private LB problems
No public of course (doing after deadline tests)
I don't think they ran the private set yet
I doubt
I hope to score something near 26 with this code in private ( they said that private is similar in distribution as public) but we have to wait and see
I am waiting for the re-run of my 2-month-old 25 to give a score but it's taking more than usual!
26, I guess you got to get a monetary prize for sure
Congratulations
I think it will take time to run all notebooks on private data and then do validation because if you are using external data or finetuning they need to check the process the data the "date limit - 23-feb" and so on. If you are using other models those need to be checked for dates also ... it will take time
I will not jump to that just yet ! as I said there is a catch ... (beside that the code is from 2 days sub so its kinda new and will not stand in a tie)
thats why I also selected the first 25 (which also was the first on LB) notebook
Hi @vital cave: If you do not mind answering, did you finetune an LLM to get the 26? I am just curious and not looking for details:)
No finetuning didn't work ( or I am not experienced enough to make it work just yet 🙂 ) , My solution is based more on repeats and stabiity
Thanks for the reply
I do think 26 private score will be a prize winner
Also I think those successful fine tunes will do better than the zero shot on the private LB, significantly ~+4-5 points gap
Say 2 people get 23 on public, one fine tuned and one zero shot, my guess is the fine tuned gets a 23 on private while the zero shot gets a 15-17
Monetary prize will surely be 25+
haven’t seen an unstable method robust on private score before in past comps…with some rare exceptions
If many people can't make it work with a certain method, the small proportion that makes it work wins out most of the time
Not really, I recently executed my unstable code again and it is able to maintain a score of ≥20.
We didn't experiment it much but the highest score it had was 24 on public lb
That’s public score, private score may not be 20+
Yea I understand the difference, what I am talking here is about the variance.
The variance of public lb is not much, so we shouldn't expect more than ±2 on private provided the question type remains the same
My 25 points 2 month old notebook is on its way to timeout ! I tested it (re-run it) for 5 times .. this is the first timeout ... will resubmit again
You did consider time management right?
I did but there a rare situation where vLLM stuck in somekind of inner loop, I didn’t look deeper into this, the notebook didnt stop so far but it should ( thid maybe due to after deadline submission, the 9 hrs don’t stand out)
Oh ok
It just faild ( timeout) , I didn’t face time out before with this notebook which is weird
Any way I submitted again
Even I am getting weird observations lol.
I executed both the codes multiple times, and most of the times it gave stable score of 20, but sometimes it give 14-15
Did you try after deadline ?
Yea
Out of 5 tries, the unstable one with score 24 gives 20 thrice and 15 twice, whereas the stable one with score 21 gives 20 four times and 14 once
So the score of 15 and 14 happend only after the deadline?
yep
Hmmm
And also I didn't expect 20 in continuously 3 or 4 submissions. This behaviour shows how luck would matter
Luck will play a part even in gold I guess
But why now the 15 and 14
Its a far far possibility but are we still testing public dataset ! Hmm because I never faced timeout with that submission and now I did
Yea something to wonder, I even wonder about the 20 ones how come it's so stable
Anyway I resubmitted both my selections
I have hopes for my second selection ( I think it will score 26 again on public) I hope to give good result on private
I resubmitted both our choices too
Is 15 and 14 the executions on the private dataset 🤷♂️
I'm fully expecting my submissions to score like this on private
If my code cannot find an answer for every question within 630 seconds, it will get timeout...those buffers add up
let's see
Idk as stated it gave 20 thrice and four times before giving that 14 and 15
My bad it's 20 thrice for both of them and 19 once, let's see what it gives next
Ok so I guess it’s still the public dataset that is there
No way with different problems the scores are 1-1 mapping
I doubt, my 24 one is highly unstable and it seems to give 20 most of the times now.
In this competition the public will always equal to private score on kaggle leaderboard since its designed that way
The whole dataset will be changed to new one (no split)
So it will be new public/private
Yea thats fine, the doubt remains the same, how come such unexpected scores. I expected my 24 solution to always give different answers between 14-23, and 21 one to be between 18 and 21, but things are more stable than expected
BTW earlier in discussion, someone talked about finetuning. Yea the current lb topmost solution is based on finetuning
I am not sure how but suddenly our rank has increased by 2, anyone else experienced it?
Sorry for them. They were so close to winning the Jackpot:)
Submissions have now been disabled
My 2nd sub scored 26 again ( 3 times in a raw, so its stable as I thought) I hope it scores good in PB
My 1st sub is still scoring so I will wait and see
Solve national-level math challenges using artificial intelligence models
Now the first sub gives submission scoring error, maybe because of the rerun of private data currently
Anyone has active subs scoring and failed ?
Most likely the ids changed and the scoring function couldn’t match ids with ids in the sub
So I hope this means my selection didn’t timeout this time
Yes mine shows scoring error as well
Ok so now we wait 🙄🙏 good luck to you all
I noticed that I dropped 2 ranks (rank 6 now) with score 26 , the guys that where after me are now before me hmmm
is it possible that the re-run and I scored 26 again which droped me to last 26 ... (in case of tie I will loose rank .... since its a 2 days old sub) but also the notebook needs 8 hours to finish and its early !
any one noticed changes in ranks or selected scores ?
with same score ?
Yea I can't see change in score on lb
I've seed others with lower scores
I've noticed a user with score 22 now and was 25
and there is fewer 25s now
Oh, my score remains the same as if now i.e 24
then its either they didn't run your notebook yet or it scored 24 in private or I don't know ... I am super impatient ! 🙂
Hmm, not sure but I wouldn't expect 24 on private lb tbh 😅😅
but they said they will shortly start (was 5 hours ago ...) ... notebooks didn't finish yet ....or they started earlier
Probably didn't finish yet, but not sure why sudden change in ranks
hmm then your higher rank is due others decrease in score .... its going to be two looooong days
or started earlier or we were testing (after deadline) on private data
since my score didn't change I am still 26 but I ranked last in 26 this means that its a new 26 (newer than the others) so its a re-run
Or current board reflect only our selections
Yea rerun is for sure, I guess they must have started it earlier, not sure.
@rare sandal did you select 21 as your best sub ?
What do you mean?
I selected a new 26 and old 25 , thus my new 26 ranked last 26 becuase its newer than the others
Than this wouldn't have happened
unless they didn't select the 25 🙂
that why I am asking @rare sandal becuase I think they scored 22 as best and now they are 21
Oh yea yea my bad
I think they reflect only the 2 selected subs (value and time of submission) in this way they will need only to replace those notebooks values with new values of the private dataset and the leaderboard will auto sort
But this will be bad, they should have directly ordered it based on private lb
Ordering based on public lb, checking private scores, reordering ties based on time, things just get more complex
This to fix the ties , currently orderd on the two selected subs and ties is solved by time
Thata why I dropped 2 ranks
Now they need to rerun our notebooks and you score will shift you
This means at wors I will rank 4th in 26
This will happen only if other 4 get 26 as well
Or the other possibility, which I guess you are aware off
Basically ties should be fixed after having evaluated on private lb, that's what I wanted to say.
On public lb it makes no sense
Btw this seems bad btw, private lb is calculated with all data
The time management we have is for 50 questions, and not 100
No private with new 50 questions only
Oh than its fine lol
Yes
Are your latest submissions running indefinitely?
Mine hasn't finished even after 12 hours
Yes, it gave me a fail notification but it still running or it is actually "pending", this is because Kaggle stopped all running submissions, it is a UI bug I guess
The priviate results is out and will not change largely right?
No they are not
Currently your selected scores is shown ( best of 2) but thats on the public set
Thanks Ali
ah, I saw a public code today…it’s so easy to remove docstrings lol, just 2 lines 😁
Meanwhile mine is, over-engineered, if-else statements, go through token by token in the generation to search for triple quote, and then I believe it still doesn’t cover all cases with no improvement in the CV score…ended up not selecting it due to the later submission time (by 2 weeks) 💀 😓
don’t know what I was thinking back then 😑
why is it taking so much time to display private scores?
I thought they needed only 9h to get the results for all submissions
9 hours * 1393 teams * 2
It's a lot
Well, Kaggle got multiple GPUs right, all submissions can run in parallel or am I missing something?
I think they should have
there was an update yesterday in the pinned topic
UPDATE 07/01/2024: Submissions are still scoring on the private set. We're now targeting sometime tomorrow (Tuesday) for finalization. We've increased our pool of machines for this job as well, so progress should accelerate.
Thanks for the update
hope today ends and we have the private LB guys! i am curious to see if there is shakeup
Let's hope there isn't 🥲🥲
Is it possible that some notebooks are shutting down the compute resources for the competition with commands like sudo reboot or killall -u kaggle --nopassword. It is taking too long😁
Who had a dream about the Private Leaderboard? I had one, saw my ranking, and it was surprising!
I woke up today to find out that the competition was gone from my home page ! I was in shock! This could mean one thing, my subs failed thus I have no subs ! then after a couple of minutes I realized that the competition is "completed" (according to Kaggle system) and thus will be removed from the home page! .... that's a nightmare more than a dream 
When I saw that happened I thought the LB is finalized 😅
lol I'm just hoping that my "code repairing and postprocessing" works out on private. There are some cases where it fails on the CV but I'm banking on it being a net positive 😅
CV says net positive but public LB is same
i think it is nasty not to have an update on what's happening, but i understand. maybe we get news soon
My 2 month notebook only scored 😦 My stable and should finish in time notebook seems not to score! This is the 2nd time in a competition I face 1 sub scoring only! The other one till now I don't know why it didn't score. I hope at least I get feedback on this 😦
Congrats to you all 🙂
Today I got a silver and a bronze competition medals ... gold yet to be reached .... again the curse of rank 21 ....
I need a break
And I am currently ranked 99 world wide in competitions , a milestone achived 🙂
12...
Our selected subs has one 18 and one 12 LOL
and they both have the same CV on old AIME ...
24/50
The model is so great that repairing those "timeout 7" and code issues reduced the CV score
this still doesn't make sense to me
what do you mean by "repairing those "timeout 7" and code issues"? Telling the LLM to fix them? I worked on that too, telling it the code timed out, or giving it a truncated backtrace on exception
but I have very little evidence that it actually helped at all, due to instability and other bugs which most of my submissions suffered
I'm really amazed by the big gap between 1st and 2nd. Does it show that all the stuff we've been doing such as I just described is a waste of time, and only training the LLM further actually creates an improvement that generalises?
How did you check the performances of both of them?
Solve national-level math challenges using artificial intelligence models
I ended up getting 15 lol, not sure if it was a bad run or an issue
I see my public lb score in private as well
Telling the LLM to use a more efficient approach
The public notebook, it doesn't do anything to hint the LLM that the code has issues so it just generates the exact same thing again and gets REPEATED ERRORS
and ya it doesn't help, which I'm surprised
There's only 2 submissions that will show your exact private score
No, it show's public lb score for me even on that submissions
Oh ok my bad got it, it shows in recently executed one's, my bad
Is your 15 the stable or unstable notebook ?
Unstable
What did the stable one get ?
It's damn bad, just 8
I guess lower temp, did show its effect there. Lack of creativity in the generated solution
20 to 8 is insane
I wouldn't have imagined getting single digits from a 20 point public submission given same topics and difficulty
It was 21, 20 was stable on public lb
I am just waiting for Numina to drop their solution, they are the only ones who were able to have a stable score...
29 both in public and private
Just wondering how do y’all debug vLLM when your CV is getting perfectly normal results but the LB is like 10-15. Yes, perfectly normal even when committing on Kaggle T4
there’s no clue on what is wrong
I see vLLM solutions that made to 20+ now
can’t just trial and error when there’s only 1 or 2 submissions a day…
I am really disappointed that my stable vLLM notebook with 26 on public LB didn't score. I was afraid of the one that did score :/ .... I've set each question to 460 secs, I don't see a reason for timeout unless something inside vLLM got stuck 😦 ... evaluating the private board on one rerun was not in my favor. I can't say it's unfair now because we know from the beginning it will be tested once, but I can't feel good about it. I am pretty sure after many many tests that this notebook would have scored nice if it ran successfully. .... maybe next time will do better ...
have you tried reaching out to competition hosts and explaining them your situation?
they might give it a shot to run your notebook
I don't think this will work, besides if they gave me then they need to give everybody a new chance and so on. This competition (Phase1) is now complete.
also, the rules mention that it's our job to make sure our subs work both public and private so ..
The one thing I really wish to know is the error in my 2nd sub (is it timeout, internal error ...) because I am planning to start from it in phase2.
My advice to myself and all is to come prepared to the next phase! I don't think 2nd phase will run without famous grandmasters joining the race.
telling the LLM that the code was too slow definitely changes its behaviour, though. I see it trying to change its code as a result, saying "We need to do this in a more efficient way", etc. However I haven't inspected how often it actually succeeds, but I think nearly never
and ultimately that's the reason that all these attempted tricks, such as getting it to verify, just didn't help significantly: if its solution/working has gone off track it's better to just throw it away and start over clean
so on the private LB you just get an inspecific submission failed message? I don't see any reason why they would intentionally make it vaguer than on public LB
people were reporting unusual timeouts in the last days of the contest presumably due to high load so I guess the same happened when rerunning all submissions at once
but I'm not filled with confidence in vLLM
Did you try fixing the docstrings issue ?
I.e. the docstring starts with 3 quotes and ends with 1. Once you fix, you will get an answer
I tested it and the CV dropped by 2 points
I disallowed it from generating docstrings entirely. However, repeating the question in the docstring might ave helped it focus, I don't know how much removing it helped beyond fixing the quotes problem
helped or hindered
I just let it generate and if it throws the dreaded "timeout 7 returned non-zero exit status" then I re-execute it with the docstring removed
It has turned problems from wrong to correct, but also the other way round
The only thing I found that gives net positive almost all the time is postprocessing the result (in fact its >= +0 100% of the time)
Wondering why it would turn correct to wrong, did you verify your docstring code?
Because the problem itself cannot be solved with code
All the correct responses are from CoT answers
It only helps if the code answer turns out to be correct
We didn't even get an error msg ! I can see only one selected sub scored and nthn else, I have no clue what happened with the second submission. It's as if the second sub doesn't exist. So getting at least the " inspecific submission failed message" is helpful but we are not even getting it.
Oh ok, I thought it was causing some errors, but if that's the case then how come it was fine when docstrings were there? I assume you have a separate function to remove docstrings and not just ask deepseek to remove those
Yeah
Then I don't get why it was initially correct and then wrong, coz deepseek is gonna generate the same code
The wrong code answer counts towards the votes
Say your initial logic gives [(52, 3), (36, 2)] and 52 is correct
But you fix the code and it turned two tries from failed to 36, it becomes [(36, 4), (52, 3)] and 36 becomes the maximum vote
Cool, makes sense
Why did the leaderboard scores drop after the competition was over? My best score was 18 and nowit's 15.
This is still the biggest question mark for me. Why some user’s implementation of vLLM cannot score above 15, while others can get 20+. I see the code for vLLM generation part, it isn’t much different
Our best vLLM sub iirc is 25/50 CV and 13/50 LB…and that was after trial and error fixing two timeouts
I do think the CV cannot differentiate these “normal scoring” vLLM subs and “poor scoring” vLLM subs like it cannot tell why my notebook gets timeout
With transformers if I get a 25/50 CV it usually gets LB 20 or 21
My hypothesis is when it got re-run after submission, it ran into lots of OOMs before generating the answer, idk if it’s right
Why did the leaderboard scores drop after the competition was over? My best score was 18 and now it's 15. And I saw the same for all other top contenders.
Coz the data it was evaluated has been changed
you got higher than me lol
lol what was your final score?
what is public score and private score here in this version? Are these scores of both datasets that were used before and after competitions completion?
Before saying 15 is bad, my RAG and hint retrieval submission got 12 points 😂
did you post your implementation of RAG?
I was looking for one before the competition was over
That's very surprising to me, I know the algorithm is imperfect but I didn't expect to turn out so badly
nope...didn't get a gold so didn't share code or solution
can you share the notebook with me?
If I share I will share publicly
There are 1 or 2 implementations in public notebook
But low scores
I don't like private sharing, if you do that in competition it is considered cheating
18 without RAG and 12 with RAG, everything else is the same, for the 2 that our team selected
that's interesting
I don't understand. My best score is 17 but the one on leader board is 15
12
and I did it with RAG
without rag i would've gotten 10
for some reason my RAG submissions always scored with a score+2
The more solutions coming out the more I feel bad about this competition. It's a game of luck for me. I am pretty sure that a freeze in vLLM somewhere lost me the opportunity to score my second sub! Which never froze on colab.
Memory issues with 2xT4 , many runs at the same time , Servers super busy with re-runs ... all of this can play a part.
I wasted many hours stabilizing my work with 27/28 validation and 26 LB to be beaten by some hardware issue 😦
If the approach for 2nd phase will be the same ( one rerun and let luck be your friend ) then I am out.
Also, I think the scoring should be weights and points, harder problems should have more points and easier problems should have fewer points and the scoring should be calculated even if the notebook froze. (scoring 0 points on the rest of the questions) .
yeah it kinda sucks that when you've been working on something for so long and it just goes down the drain cause of luck...
but hey, there is always a next time
phase 1 is finished, it's in the past. Better focus on second phase
I know there is always next time , I just don't want next time to be as same as phase1 (extended phase1) 🙂
nah it won't be the same
Agreed, it is crazy to spend 8 months of effort only for your submission to fail on the rerun 😭
wasn't it 3 months?
Phase 2 seems to be 8 months from the host's comments
oh i thought you talking about phase 1
damn 8 months is nice
we can freely experiment
althought the hype will be so low
Yeah, I hope they don't put a limit on open source based on the start date lol
Better to put it 2 months before the end date if they want to
e.g. Sep 1 2024 to May 1 2025, can cut off open source on Feb 23 2025
lmao
Solve national-level math challenges using artificial intelligence models
@rare sandal you were right, they fine tuned deepseek and created their own model
numina just dropped a solution
they also have a nice resources 🙂
I am testing their model now, I replaced Deepseek original one with theirs in my code (I haven't used their approach yet) I want to see how validation changes (originally 28/50 with my approach)
That's very wierd. I used self-consistency copied from your notebook and I got higher score. But there's one important addition I made. I implemented the idea of question rephrasing just like in the paper MetaMath. I guess that helped a lot, but I'm still not sure if it really did. Need to run some tests to better understand.
Check this post of mine to see what I did to score better: https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/519363
Solve national-level math challenges using artificial intelligence models
yeah the 8 x H100 gpu sent me waaay of the cliff
they also implemented Tree of Thoughts... which i struggled to wrap my head around it for 3 weeks
ooooh... that's interesting approach
didn't know about metamath...
"3 may 2024"
bruh it's a new paper...
the papers new, but I just took their idea of question rephrasing and used it in decoding
it's an idea they used to augment data
congrats on the score
I wish I had more time for this competition, might have implemented RAG or vLLMs
I put my last hope into Learning RAG from scratch and just focus on it the last 7 days
the phase 2 will be wild
Yep I think so too
so better prepare slowly from now
I want to do that too
I was so looking forward to implement RAG but didn;t had time
it scored 30/50 on my validation set, so improvment by 2 , but they confirmed using the dataset in thier training so there is overfitting (if not the 28/50 is already overfitting)
I will try to run thier approach full not only the model
I was testing Qwen2 on the train data, I just saw this. What kind of brilliance was that ?
Btw, this isn’t the full code, there are still nested loops after that
4/10 with transformers and 6/10 with vLLM (on colab) 😀