#ai-mathematical-olympiad-prize | Kaggle | Page 2

rare sandal Jun 28, 2024, 12:38 AM

#

Publicly shared data and some textbook corpus from GitHub

#

If I end up doing well I will share

vital cave Jun 28, 2024, 12:38 AM

#

I didn’t play much with prompt and ended with the one shared in the 20 points notebook before sub api

serene fiber Jun 28, 2024, 12:39 AM

#

lol for us prompt helped reach 24

rare sandal Jun 28, 2024, 12:39 AM

#

I didn’t try to fine tune prompt LOL

#

Actually I did, but it solves 1 problem and breaks 2 problems haha in validation

#

So I didn’t end up choosing

vital cave Jun 28, 2024, 12:40 AM

#

The problem is its so slow to try finetune prompt lol

serene fiber Jun 28, 2024, 12:41 AM

#

I had local GPU so was not a big deal

rare sandal Jun 28, 2024, 12:41 AM

#

In my validation set for new AIME I added like 10 problems which I created by myself holyfuck

#

I would say none of these are more than 3/10 difficulty for the human

vital cave Jun 28, 2024, 12:43 AM

#

I resubmitted my notebook from yesterday if my calc is correct it should rescore 26 again

serene fiber Jun 28, 2024, 12:44 AM

#

vital cave I resubmitted my notebook from yesterday if my calc is correct it should rescore...

LOl, do let us know the score 🙂

vital cave Jun 28, 2024, 12:44 AM

#

Will take 8 hrs but sure I will

rare sandal Jun 28, 2024, 12:45 AM

#

My prediction for my private score is 13 points, it depends a lot on how much the public notebook will score…from my validation, I don’t see that it is significantly better

(the 21 that is trending in notebooks is the public NB I’m referring to)

#

🤣

vital cave Jun 28, 2024, 12:46 AM

#

I have no predictions … I want a solo gold and a money prize will be nice hhhh but I expect 2 timeouts just to not be in shock if this happens

serene fiber Jun 28, 2024, 12:47 AM

#

vital cave I have no predictions … I want a solo gold and a money prize will be nice hhhh b...

Oh aiming for GM

vital cave Jun 28, 2024, 12:47 AM

#

Ya one day 🙂

serene fiber Jun 28, 2024, 12:49 AM

#

Great

rare sandal Jun 28, 2024, 12:50 AM

#

? What lol

serene fiber Jun 28, 2024, 12:51 AM

#

rare sandal ? What lol

I assume it's for time being it will change.
Lol if this happens most of us will get a free medal

rare sandal Jun 28, 2024, 1:03 AM

#

Did yall do any postprocessing ?

#

I did lol

serene fiber Jun 28, 2024, 1:04 AM

#

rare sandal Did yall do any postprocessing ?

Not really, just identified some keywords for difficulty and gave those questions some extra time.

Though I thought of using Gemma to explain DeepSeek the question but it didn't work (VRAM exceeded)

valid shoal Jun 28, 2024, 1:04 AM

#

rare sandal ? What lol

lol did we all get trolled?

#

i.e. we should have overfitted on public lb score?

serene fiber Jun 28, 2024, 1:05 AM

#

I wish that happens, would get a free silver and can be the fastest expert 😅😅

vital cave Jun 28, 2024, 1:11 AM

#

Its just kaggle system
I guess it will take a week to have the final update

serene fiber Jun 28, 2024, 1:12 AM

#

Lol my bad it's on the comment

vital cave Jun 28, 2024, 1:12 AM

#

Hhhh unfortunately didn’t 🙂

serene fiber Jun 28, 2024, 1:12 AM

#

Deleting

vital cave Jun 28, 2024, 1:12 AM

#

No worries

serene fiber Jun 28, 2024, 1:13 AM

#

It's silver for the competition discussion and not the lb 🤣🤣🤣

vital cave Jun 28, 2024, 1:14 AM

#

Hhhhh on other issue I don’t think the 100 sub will stay much after competition
As far as I understand the test data will be the same in phase2

#

I just reran my final subs again

serene fiber Jun 28, 2024, 1:15 AM

#

Yea that's the reason I plan to start it now irrespective of the results, just exploiting it as much as possible.

vital cave Jun 28, 2024, 1:16 AM

#

Ya because I dont think this window of 100 subs will stay much

serene fiber Jun 28, 2024, 1:16 AM

#

Let's see, ideally it shouldn't but hoping too

rare sandal Jun 28, 2024, 1:53 AM

#

https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/515379#2893667

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

#

lol this is insane

#

shows how volatile the model is

#

This though…I’m not sure how it would hold on private, there is a chance it shakes down if the prompt is fitted to public…

serene fiber Jun 28, 2024, 2:01 AM

#

rare sandal This though…I’m not sure how it would hold on private, there is a chance it shak...

Fitting prompt for public lb is kind of impossible. The prompt would give similar performance across problems given that you use the same model

rare sandal Jun 28, 2024, 2:04 AM

#

Is your prompt that gave 24 also the one that gave the best local validation score?

#

For our team, it's not correlated

#

I have a prompt which scores 26 on CV but only 9 on LB. But the baseline scores 22 on CV and 21 on LB, for example

#

If it does you're probably safe

#

I tried a few times to tune the prompt to fit 2 or 3 validation problems where the model is nearly correct, it always performs worse than the public prompt

#

You turn 2 or 3 from wrong to correct, but instead turn 5 from correct to wrong

serene fiber Jun 28, 2024, 2:08 AM

#

Can you try resubmitting those solutions, I would expect it to be unstable

rare sandal Jun 28, 2024, 2:09 AM

#

Hmm I set a seed 🤔

serene fiber Jun 28, 2024, 2:10 AM

#

You can just remove that and run a few number of times, coz if this is true than a lot of research is just waste of time 🤣🤣🤣

rare sandal Jun 28, 2024, 2:11 AM

#

No seed I tried before it's not stable and max - min can be up to 9 points

serene fiber Jun 28, 2024, 2:11 AM

#

By research I mean actual research, the one which proper academia and industry folks do

rare sandal Jun 28, 2024, 2:12 AM

#

I didn't submit too many times for the same code though but 24-25 is definitely possible with a lucky run

serene fiber Jun 28, 2024, 2:13 AM

#

rare sandal No seed I tried before it's not stable and max - min can be up to 9 points

That's what I would expect, what was your temp for those solutions and did you effectively manage the time?

rare sandal Jun 28, 2024, 2:13 AM

#

serene fiber That's what I would expect, what was your temp for those solutions and did you e...

temp is 0.9 and time limit is 630 per problem

serene fiber Jun 28, 2024, 2:13 AM

#

rare sandal I didn't submit too many times for the same code though but 24-25 is definitely ...

This is true, but the interpretation is completely different i.e prompt overfitting on lb

serene fiber Jun 28, 2024, 2:14 AM

#

rare sandal temp is 0.9 and time limit is 630 per problem

One of the reasons for unstable and again why it can be lucky to get 24-25

rare sandal Jun 28, 2024, 2:17 AM

#

assumed the seed was set

#

my bad, I saw the discussion post and assumed it was a stable 25

serene fiber Jun 28, 2024, 2:20 AM

#

Yea still I mean the underlying stochasticity with higher temperatures is high. Also I assume the top p would be something around 1 for the submission

#

If any of the higher temp solution gets a good score, I would just assume it's got to be lucky irrespective of the prompt or any other modifications

rare sandal Jun 28, 2024, 3:23 AM

#

ah he said his notebook is not stable

#

I got it wrong then

rare sandal Jun 28, 2024, 3:46 AM

#

I think scores may be released tomorrow

#

Last time Optiver comp, it took 2-3 days to run 2 submissions each for 3.4K teams

#

That one was compute intensive and submissions take very long to score as well

serene fiber Jun 28, 2024, 3:52 AM

#

rare sandal I think scores may be released tomorrow

Hope so

serene fiber Jun 28, 2024, 3:52 AM

#

rare sandal ah he said his notebook is not stable

Makes sense, btw if you don't mind can you make your finetune code public? Would like to experiment over that

rare sandal Jun 28, 2024, 3:56 AM

#

serene fiber Makes sense, btw if you don't mind can you make your finetune code public? Would...

?? I didn't finetune any model

#

as for our submission code, I will make public if our team makes the gold zone in private

serene fiber Jun 28, 2024, 4:05 AM

#

I mean the RAG solutions

proper ferry Jun 28, 2024, 6:15 AM

#

" The private leaderboard is calculated over the same rows as the public leaderboard in this competition. " Does this mean that the validation set is the same as public leaderboard?

rare sandal Jun 28, 2024, 6:29 AM

#

serene fiber Can you try resubmitting those solutions, I would expect it to be unstable

Also I don't know the ceiling of my solutions

Those 2 our team selected, were all submitted thrice (only minor changes between them), and they got the same score all 3 times

serene fiber Jun 28, 2024, 7:26 AM

#

For high temp, I won't expect it to give same score unless the seed is in fixed in such a way that it always outputs the same token

silk sandal Jun 28, 2024, 7:37 AM

#

But how can the seed be set to achieve this?

#

The following does not seem to work:

def seed_everything():
seed = 200
os.environ['PYTHONHASHSEED'] = str(seed)
os.environ['TOKENIZERS_PARALLELISM'] = 'true'
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
print('-----Seed Set!-----')

vital cave Jun 28, 2024, 8:17 AM

#

vital cave I resubmitted my notebook from yesterday if my calc is correct it should rescore...

Scored again 26 , will run it again for thrid time , I think it will score again 26 ...
The idea I implemented two days ago is proving stability so far but there is a catch ( I will discuss it more when I know what score it gives on PB)

#

Also I am a bit afraid of timeout since it finishes around 8 hours (I have safety time monitor code - but I am not sure if its bug free)

rare sandal Jun 28, 2024, 8:20 AM

#

Is that 26 on the private score ? I think it was said that after the competition, the hidden test set will be the 50 private LB problems

vital cave Jun 28, 2024, 8:20 AM

#

No public of course (doing after deadline tests)

#

I don't think they ran the private set yet

rare sandal Jun 28, 2024, 8:20 AM

#

I don't know if Kaggle has changed the test set

#

It can be private now

vital cave Jun 28, 2024, 8:21 AM

#

I doubt

#

I hope to score something near 26 with this code in private ( they said that private is similar in distribution as public) but we have to wait and see

#

I am waiting for the re-run of my 2-month-old 25 to give a score but it's taking more than usual!

serene fiber Jun 28, 2024, 8:29 AM

#

26, I guess you got to get a monetary prize for sure

Congratulations

vital cave Jun 28, 2024, 8:29 AM

#

I think it will take time to run all notebooks on private data and then do validation because if you are using external data or finetuning they need to check the process the data the "date limit - 23-feb" and so on. If you are using other models those need to be checked for dates also ... it will take time

vital cave Jun 28, 2024, 8:30 AM

#

serene fiber 26, I guess you got to get a monetary prize for sure Congratulations

I will not jump to that just yet ! as I said there is a catch ... (beside that the code is from 2 days sub so its kinda new and will not stand in a tie)

#

thats why I also selected the first 25 (which also was the first on LB) notebook

silk sandal Jun 28, 2024, 8:37 AM

#

Hi @vital cave: If you do not mind answering, did you finetune an LLM to get the 26? I am just curious and not looking for details:)

vital cave Jun 28, 2024, 8:40 AM

#

No finetuning didn't work ( or I am not experienced enough to make it work just yet 🙂 ) , My solution is based more on repeats and stabiity

silk sandal Jun 28, 2024, 8:41 AM

#

vital cave No finetuning didn't work ( or I am not experienced enough to make it work just ...

Thanks for the reply

rare sandal Jun 28, 2024, 9:09 AM

#

I do think 26 private score will be a prize winner

#

Also I think those successful fine tunes will do better than the zero shot on the private LB, significantly ~+4-5 points gap

#

Say 2 people get 23 on public, one fine tuned and one zero shot, my guess is the fine tuned gets a 23 on private while the zero shot gets a 15-17

serene fiber Jun 28, 2024, 9:13 AM

#

Monetary prize will surely be 25+

rare sandal Jun 28, 2024, 9:14 AM

#

haven’t seen an unstable method robust on private score before in past comps…with some rare exceptions

If many people can't make it work with a certain method, the small proportion that makes it work wins out most of the time

serene fiber Jun 28, 2024, 9:14 AM

#

rare sandal Say 2 people get 23 on public, one fine tuned and one zero shot, my guess is the...

Not really, I recently executed my unstable code again and it is able to maintain a score of ≥20.

We didn't experiment it much but the highest score it had was 24 on public lb

rare sandal Jun 28, 2024, 9:15 AM

#

That’s public score, private score may not be 20+

serene fiber Jun 28, 2024, 9:16 AM

#

Yea I understand the difference, what I am talking here is about the variance.
The variance of public lb is not much, so we shouldn't expect more than ±2 on private provided the question type remains the same

vital cave Jun 28, 2024, 9:24 AM

#

My 25 points 2 month old notebook is on its way to timeout ! I tested it (re-run it) for 5 times .. this is the first timeout ... will resubmit again

serene fiber Jun 28, 2024, 9:58 AM

#

vital cave My 25 points 2 month old notebook is on its way to timeout ! I tested it (re-ru...

You did consider time management right?

vital cave Jun 28, 2024, 10:21 AM

#

I did but there a rare situation where vLLM stuck in somekind of inner loop, I didn’t look deeper into this, the notebook didnt stop so far but it should ( thid maybe due to after deadline submission, the 9 hrs don’t stand out)

serene fiber Jun 28, 2024, 10:29 AM

#

Oh ok

vital cave Jun 28, 2024, 10:32 AM

#

It just faild ( timeout) , I didn’t face time out before with this notebook which is weird
Any way I submitted again

serene fiber Jun 28, 2024, 10:59 AM

#

vital cave It just faild ( timeout) , I didn’t face time out before with this notebook whic...

Even I am getting weird observations lol.
I executed both the codes multiple times, and most of the times it gave stable score of 20, but sometimes it give 14-15

vital cave Jun 28, 2024, 11:00 AM

#

Did you try after deadline ?

serene fiber Jun 28, 2024, 11:01 AM

#

Yea

#

Out of 5 tries, the unstable one with score 24 gives 20 thrice and 15 twice, whereas the stable one with score 21 gives 20 four times and 14 once

vital cave Jun 28, 2024, 11:05 AM

#

So the score of 15 and 14 happend only after the deadline?

serene fiber Jun 28, 2024, 11:05 AM

#

yep

vital cave Jun 28, 2024, 11:05 AM

#

Hmmm

serene fiber Jun 28, 2024, 11:06 AM

#

And also I didn't expect 20 in continuously 3 or 4 submissions. This behaviour shows how luck would matter

vital cave Jun 28, 2024, 11:08 AM

#

Luck will play a part even in gold I guess
But why now the 15 and 14
Its a far far possibility but are we still testing public dataset ! Hmm because I never faced timeout with that submission and now I did

serene fiber Jun 28, 2024, 11:09 AM

#

Yea something to wonder, I even wonder about the 20 ones how come it's so stable

vital cave Jun 28, 2024, 11:13 AM

#

Anyway I resubmitted both my selections
I have hopes for my second selection ( I think it will score 26 again on public) I hope to give good result on private

rare sandal Jun 28, 2024, 11:19 AM

#

I resubmitted both our choices too

rare sandal Jun 28, 2024, 11:21 AM

#

serene fiber Out of 5 tries, the unstable one with score 24 gives 20 thrice and 15 twice, whe...

Is 15 and 14 the executions on the private dataset 🤷‍♂️

#

I'm fully expecting my submissions to score like this on private

#

If my code cannot find an answer for every question within 630 seconds, it will get timeout...those buffers add up

#

let's see

serene fiber Jun 28, 2024, 1:31 PM

#

rare sandal Is 15 and 14 the executions on the private dataset 🤷‍♂️

Idk as stated it gave 20 thrice and four times before giving that 14 and 15

#

Screenshot_2024-06-28-19-02-14-54_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

#

My bad it's 20 thrice for both of them and 19 once, let's see what it gives next

rare sandal Jun 28, 2024, 1:51 PM

#

Ok so I guess it’s still the public dataset that is there

#

No way with different problems the scores are 1-1 mapping

serene fiber Jun 28, 2024, 2:07 PM

#

rare sandal Ok so I guess it’s still the public dataset that is there

I doubt, my 24 one is highly unstable and it seems to give 20 most of the times now.

vital cave Jun 28, 2024, 2:07 PM

#

In this competition the public will always equal to private score on kaggle leaderboard since its designed that way

#

The whole dataset will be changed to new one (no split)

#

So it will be new public/private

serene fiber Jun 28, 2024, 2:21 PM

#

Yea thats fine, the doubt remains the same, how come such unexpected scores. I expected my 24 solution to always give different answers between 14-23, and 21 one to be between 18 and 21, but things are more stable than expected

#

BTW earlier in discussion, someone talked about finetuning. Yea the current lb topmost solution is based on finetuning

serene fiber Jun 28, 2024, 2:45 PM

#

I am not sure how but suddenly our rank has increased by 2, anyone else experienced it?

vital cave Jun 28, 2024, 2:48 PM

#

The 2nd team was removrd

#

And the deleted account

#

I guess they might be related

silk sandal Jun 28, 2024, 2:57 PM

#

Sorry for them. They were so close to winning the Jackpot:)

silk sandal Jun 28, 2024, 4:10 PM

#

Submissions have now been disabled

vital cave Jun 28, 2024, 4:13 PM

#

My 2nd sub scored 26 again ( 3 times in a raw, so its stable as I thought) I hope it scores good in PB

#

My 1st sub is still scoring so I will wait and see

rare sandal Jun 28, 2024, 4:15 PM

#

serene fiber I am not sure how but suddenly our rank has increased by 2, anyone else experien...

https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/511414

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

vital cave Jun 28, 2024, 4:39 PM

#

Now the first sub gives submission scoring error, maybe because of the rerun of private data currently

#

Anyone has active subs scoring and failed ?

#

Most likely the ids changed and the scoring function couldn’t match ids with ids in the sub
So I hope this means my selection didn’t timeout this time

silk sandal Jun 28, 2024, 4:41 PM

#

Yes mine shows scoring error as well

vital cave Jun 28, 2024, 4:42 PM

#

Ok so now we wait 🙄🙏 good luck to you all

vital cave Jun 28, 2024, 7:21 PM

#

I noticed that I dropped 2 ranks (rank 6 now) with score 26 , the guys that where after me are now before me hmmm

#

is it possible that the re-run and I scored 26 again which droped me to last 26 ... (in case of tie I will loose rank .... since its a 2 days old sub) but also the notebook needs 8 hours to finish and its early !

#

any one noticed changes in ranks or selected scores ?

serene fiber Jun 28, 2024, 7:32 PM

#

Yea right now 24th, was initially 32nd on public lb

#

Scores remain the same*

vital cave Jun 28, 2024, 7:32 PM

#

with same score ?

serene fiber Jun 28, 2024, 7:33 PM

#

Yea I can't see change in score on lb

vital cave Jun 28, 2024, 7:33 PM

#

I've seed others with lower scores

#

I've noticed a user with score 22 now and was 25

#

and there is fewer 25s now

serene fiber Jun 28, 2024, 7:34 PM

#

Oh, my score remains the same as if now i.e 24

vital cave Jun 28, 2024, 7:35 PM

#

then its either they didn't run your notebook yet or it scored 24 in private or I don't know ... I am super impatient ! 🙂

serene fiber Jun 28, 2024, 7:36 PM

#

Hmm, not sure but I wouldn't expect 24 on private lb tbh 😅😅

vital cave Jun 28, 2024, 7:36 PM

#

but they said they will shortly start (was 5 hours ago ...) ... notebooks didn't finish yet ....or they started earlier

serene fiber Jun 28, 2024, 7:36 PM

#

vital cave but they said they will shortly start (was 5 hours ago ...) ... notebooks didn't...

Probably didn't finish yet, but not sure why sudden change in ranks

vital cave Jun 28, 2024, 7:36 PM

#

hmm then your higher rank is due others decrease in score .... its going to be two looooong days

serene fiber Jun 28, 2024, 7:37 PM

#

Still it's weird someone's notebook gets executed in just 5-6 hours

#

So fast....

vital cave Jun 28, 2024, 7:38 PM

#

or started earlier or we were testing (after deadline) on private data

#

since my score didn't change I am still 26 but I ranked last in 26 this means that its a new 26 (newer than the others) so its a re-run

#

Or current board reflect only our selections

serene fiber Jun 28, 2024, 7:39 PM

#

Yea rerun is for sure, I guess they must have started it earlier, not sure.

vital cave Jun 28, 2024, 7:39 PM

#

@rare sandal did you select 21 as your best sub ?

serene fiber Jun 28, 2024, 7:39 PM

#

vital cave Or current board reflect only our selections

What do you mean?

vital cave Jun 28, 2024, 7:40 PM

#

I selected a new 26 and old 25 , thus my new 26 ranked last 26 becuase its newer than the others

serene fiber Jun 28, 2024, 7:40 PM

#

vital cave I've noticed a user with score 22 now and was 25

Than this wouldn't have happened

vital cave Jun 28, 2024, 7:40 PM

#

unless they didn't select the 25 🙂

#

that why I am asking @rare sandal becuase I think they scored 22 as best and now they are 21

serene fiber Jun 28, 2024, 7:41 PM

#

Oh yea yea my bad

vital cave Jun 28, 2024, 7:56 PM

#

I think they reflect only the 2 selected subs (value and time of submission) in this way they will need only to replace those notebooks values with new values of the private dataset and the leaderboard will auto sort

serene fiber Jun 28, 2024, 8:13 PM

#

vital cave I think they reflect only the 2 selected subs (value and time of submission) in ...

But this will be bad, they should have directly ordered it based on private lb

Ordering based on public lb, checking private scores, reordering ties based on time, things just get more complex

vital cave Jun 28, 2024, 8:20 PM

#

This to fix the ties , currently orderd on the two selected subs and ties is solved by time

#

Thata why I dropped 2 ranks

#

Now they need to rerun our notebooks and you score will shift you

#

~~This means at wors I will rank 4th in 26~~

serene fiber Jun 28, 2024, 8:23 PM

#

vital cave ~~This means at wors I will rank 4th in 26~~

This will happen only if other 4 get 26 as well

#

Or the other possibility, which I guess you are aware off

#

Basically ties should be fixed after having evaluated on private lb, that's what I wanted to say.

On public lb it makes no sense

#

Btw this seems bad btw, private lb is calculated with all data

#

The time management we have is for 50 questions, and not 100

vital cave Jun 28, 2024, 8:42 PM

#

No private with new 50 questions only

serene fiber Jun 28, 2024, 8:50 PM

#

Oh than its fine lol

rare sandal Jun 29, 2024, 12:36 AM

#

vital cave <@517498884461821970> did you select 21 as your best sub ?

Yes

#

Are your latest submissions running indefinitely?

#

Mine hasn't finished even after 12 hours

#

vital cave Jun 29, 2024, 12:52 AM

#

Yes, it gave me a fail notification but it still running or it is actually "pending", this is because Kaggle stopped all running submissions, it is a UI bug I guess

proper ferry Jun 29, 2024, 1:08 AM

#

The priviate results is out and will not change largely right?

vital cave Jun 29, 2024, 1:39 AM

#

No they are not
Currently your selected scores is shown ( best of 2) but thats on the public set

proper ferry Jun 29, 2024, 1:40 AM

#

Thanks Ali

rare sandal Jun 30, 2024, 3:20 PM

#

ah, I saw a public code today…it’s so easy to remove docstrings lol, just 2 lines 😁

Meanwhile mine is, over-engineered, if-else statements, go through token by token in the generation to search for triple quote, and then I believe it still doesn’t cover all cases with no improvement in the CV score…ended up not selecting it due to the later submission time (by 2 weeks) 💀 😓

#

don’t know what I was thinking back then 😑

silk sandal Jul 2, 2024, 9:43 AM

#

why is it taking so much time to display private scores?

#

I thought they needed only 9h to get the results for all submissions

rare sandal Jul 2, 2024, 9:54 AM

#

silk sandal I thought they needed only 9h to get the results for all submissions

9 hours * 1393 teams * 2

#

It's a lot

silk sandal Jul 2, 2024, 9:55 AM

#

Well, Kaggle got multiple GPUs right, all submissions can run in parallel or am I missing something?

rare sandal Jul 2, 2024, 9:55 AM

#

silk sandal Well, Kaggle got multiple GPUs right, all submissions can run in parallel or am ...

I think they should have

vital cave Jul 2, 2024, 10:06 AM

#

there was an update yesterday in the pinned topic

UPDATE 07/01/2024: Submissions are still scoring on the private set. We're now targeting sometime tomorrow (Tuesday) for finalization. We've increased our pool of machines for this job as well, so progress should accelerate.

silk sandal Jul 2, 2024, 10:08 AM

#

Thanks for the update

south anchor Jul 2, 2024, 6:17 PM

#

hope today ends and we have the private LB guys! i am curious to see if there is shakeup

serene fiber Jul 2, 2024, 6:27 PM

#

Let's hope there isn't 🥲🥲

silk sandal Jul 3, 2024, 7:16 AM

#

Is it possible that some notebooks are shutting down the compute resources for the competition with commands like sudo reboot or killall -u kaggle --nopassword. It is taking too long😁

#

Who had a dream about the Private Leaderboard? I had one, saw my ranking, and it was surprising!

vital cave Jul 3, 2024, 9:05 AM

#

I woke up today to find out that the competition was gone from my home page ! I was in shock! This could mean one thing, my subs failed thus I have no subs ! then after a couple of minutes I realized that the competition is "completed" (according to Kaggle system) and thus will be removed from the home page! .... that's a nightmare more than a dream harold

rare sandal Jul 3, 2024, 9:23 AM

#

When I saw that happened I thought the LB is finalized 😅

#

lol I'm just hoping that my "code repairing and postprocessing" works out on private. There are some cases where it fails on the CV but I'm banking on it being a net positive 😅

#

CV says net positive but public LB is same

south anchor Jul 3, 2024, 5:28 PM

#

i think it is nasty not to have an update on what's happening, but i understand. maybe we get news soon

vital cave Jul 3, 2024, 6:58 PM

#

My 2 month notebook only scored 😦 My stable and should finish in time notebook seems not to score! This is the 2nd time in a competition I face 1 sub scoring only! The other one till now I don't know why it didn't score. I hope at least I get feedback on this 😦

Congrats to you all 🙂

Today I got a silver and a bronze competition medals ... gold yet to be reached .... again the curse of rank 21 ....

I need a break

vital cave Jul 3, 2024, 7:14 PM

#

And I am currently ranked 99 world wide in competitions , a milestone achived 🙂

floral sentinel Jul 3, 2024, 7:53 PM

#

12...

rare sandal Jul 3, 2024, 11:19 PM

#

ah my prediction for gold zone not really off

#

huge gap between 1st and 2nd lol

rare sandal Jul 3, 2024, 11:19 PM

#

floral sentinel 12...

Our selected subs has one 18 and one 12 LOL

#

and they both have the same CV on old AIME ...

#

24/50

rare sandal Jul 4, 2024, 12:20 AM

#

The model is so great that repairing those "timeout 7" and code issues reduced the CV score holyfuck this still doesn't make sense to me

dusky narwhal Jul 4, 2024, 3:19 AM

#

what do you mean by "repairing those "timeout 7" and code issues"? Telling the LLM to fix them? I worked on that too, telling it the code timed out, or giving it a truncated backtrace on exception

#

but I have very little evidence that it actually helped at all, due to instability and other bugs which most of my submissions suffered

#

I'm really amazed by the big gap between 1st and 2nd. Does it show that all the stuff we've been doing such as I just described is a waste of time, and only training the LLM further actually creates an improvement that generalises?

serene fiber Jul 4, 2024, 3:58 AM

#

rare sandal Our selected subs has one 18 and one 12 LOL

How did you check the performances of both of them?

rare sandal Jul 4, 2024, 3:59 AM

#

https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/submissions

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

serene fiber Jul 4, 2024, 3:59 AM

#

serene fiber Idk as stated it gave 20 thrice and four times before giving that 14 and 15

I ended up getting 15 lol, not sure if it was a bad run or an issue

serene fiber Jul 4, 2024, 4:00 AM

#

rare sandal https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/submissions

I see my public lb score in private as well

rare sandal Jul 4, 2024, 4:01 AM

#

dusky narwhal what do you mean by "repairing those "timeout 7" and code issues"? Telling the L...

Telling the LLM to use a more efficient approach

#

The public notebook, it doesn't do anything to hint the LLM that the code has issues so it just generates the exact same thing again and gets REPEATED ERRORS

rare sandal Jul 4, 2024, 4:40 AM

#

and ya it doesn't help, which I'm surprised

rare sandal Jul 4, 2024, 4:43 AM

#

serene fiber I see my public lb score in private as well

There's only 2 submissions that will show your exact private score

serene fiber Jul 4, 2024, 4:49 AM

#

No, it show's public lb score for me even on that submissions

#

Oh ok my bad got it, it shows in recently executed one's, my bad

rare sandal Jul 4, 2024, 4:52 AM

#

Is your 15 the stable or unstable notebook ?

serene fiber Jul 4, 2024, 4:53 AM

#

Unstable

rare sandal Jul 4, 2024, 4:53 AM

#

What did the stable one get ?

serene fiber Jul 4, 2024, 4:54 AM

#

It's damn bad, just 8

#

I guess lower temp, did show its effect there. Lack of creativity in the generated solution

rare sandal Jul 4, 2024, 5:02 AM

#

20 to 8 is insane

#

I wouldn't have imagined getting single digits from a 20 point public submission given same topics and difficulty

serene fiber Jul 4, 2024, 5:07 AM

#

It was 21, 20 was stable on public lb

floral sentinel Jul 4, 2024, 1:50 PM

#

I am just waiting for Numina to drop their solution, they are the only ones who were able to have a stable score...

#

29 both in public and private

rare sandal Jul 5, 2024, 12:19 PM

#

Just wondering how do y’all debug vLLM when your CV is getting perfectly normal results but the LB is like 10-15. Yes, perfectly normal even when committing on Kaggle T4

#

there’s no clue on what is wrong

#

I see vLLM solutions that made to 20+ now

#

can’t just trial and error when there’s only 1 or 2 submissions a day…

vital cave Jul 5, 2024, 5:51 PM

#

I am really disappointed that my stable vLLM notebook with 26 on public LB didn't score. I was afraid of the one that did score :/ .... I've set each question to 460 secs, I don't see a reason for timeout unless something inside vLLM got stuck 😦 ... evaluating the private board on one rerun was not in my favor. I can't say it's unfair now because we know from the beginning it will be tested once, but I can't feel good about it. I am pretty sure after many many tests that this notebook would have scored nice if it ran successfully. .... maybe next time will do better ...

floral sentinel Jul 5, 2024, 7:21 PM

#

vital cave I am really disappointed that my stable vLLM notebook with 26 on public LB didn'...

have you tried reaching out to competition hosts and explaining them your situation?

#

they might give it a shot to run your notebook

vital cave Jul 5, 2024, 7:54 PM

#

I don't think this will work, besides if they gave me then they need to give everybody a new chance and so on. This competition (Phase1) is now complete.
also, the rules mention that it's our job to make sure our subs work both public and private so ..
The one thing I really wish to know is the error in my 2nd sub (is it timeout, internal error ...) because I am planning to start from it in phase2.

My advice to myself and all is to come prepared to the next phase! I don't think 2nd phase will run without famous grandmasters joining the race.

dusky narwhal Jul 6, 2024, 1:01 AM

#

rare sandal and ya it doesn't help, which I'm surprised

telling the LLM that the code was too slow definitely changes its behaviour, though. I see it trying to change its code as a result, saying "We need to do this in a more efficient way", etc. However I haven't inspected how often it actually succeeds, but I think nearly never

#

and ultimately that's the reason that all these attempted tricks, such as getting it to verify, just didn't help significantly: if its solution/working has gone off track it's better to just throw it away and start over clean

#

so on the private LB you just get an inspecific submission failed message? I don't see any reason why they would intentionally make it vaguer than on public LB

#

people were reporting unusual timeouts in the last days of the contest presumably due to high load so I guess the same happened when rerunning all submissions at once

#

but I'm not filled with confidence in vLLM

rare sandal Jul 6, 2024, 1:37 AM

#

dusky narwhal telling the LLM that the code was too slow definitely changes its behaviour, tho...

Did you try fixing the docstrings issue ?

I.e. the docstring starts with 3 quotes and ends with 1. Once you fix, you will get an answer

rare sandal Jul 6, 2024, 1:38 AM

#

dusky narwhal telling the LLM that the code was too slow definitely changes its behaviour, tho...

I tested it and the CV dropped by 2 points

dusky narwhal Jul 6, 2024, 1:57 AM

#

I disallowed it from generating docstrings entirely. However, repeating the question in the docstring might ave helped it focus, I don't know how much removing it helped beyond fixing the quotes problem

#

helped or hindered

rare sandal Jul 6, 2024, 2:49 AM

#

I just let it generate and if it throws the dreaded "timeout 7 returned non-zero exit status" then I re-execute it with the docstring removed

#

It has turned problems from wrong to correct, but also the other way round

#

The only thing I found that gives net positive almost all the time is postprocessing the result (in fact its >= +0 100% of the time)

serene fiber Jul 6, 2024, 4:19 AM

#

Wondering why it would turn correct to wrong, did you verify your docstring code?

rare sandal Jul 6, 2024, 4:26 AM

#

serene fiber Wondering why it would turn correct to wrong, did you verify your docstring code...

Because the problem itself cannot be solved with code

#

All the correct responses are from CoT answers

#

It only helps if the code answer turns out to be correct

vital cave Jul 6, 2024, 4:32 AM

#

dusky narwhal people were reporting unusual timeouts in the last days of the contest presumabl...

We didn't even get an error msg ! I can see only one selected sub scored and nthn else, I have no clue what happened with the second submission. It's as if the second sub doesn't exist. So getting at least the " inspecific submission failed message" is helpful but we are not even getting it.

serene fiber Jul 6, 2024, 4:50 AM

#

rare sandal Because the problem itself cannot be solved with code

Oh ok, I thought it was causing some errors, but if that's the case then how come it was fine when docstrings were there? I assume you have a separate function to remove docstrings and not just ask deepseek to remove those

rare sandal Jul 6, 2024, 4:50 AM

#

serene fiber Oh ok, I thought it was causing some errors, but if that's the case then how com...

Yeah

serene fiber Jul 6, 2024, 4:51 AM

#

Then I don't get why it was initially correct and then wrong, coz deepseek is gonna generate the same code

rare sandal Jul 6, 2024, 5:18 AM

#

serene fiber Then I don't get why it was initially correct and then wrong, coz deepseek is go...

The wrong code answer counts towards the votes
Say your initial logic gives [(52, 3), (36, 2)] and 52 is correct

But you fix the code and it turned two tries from failed to 36, it becomes [(36, 4), (52, 3)] and 36 becomes the maximum vote

serene fiber Jul 6, 2024, 7:18 AM

#

Cool, makes sense

fresh gust Jul 6, 2024, 8:27 AM

#

Why did the leaderboard scores drop after the competition was over? My best score was 18 and nowit's 15.

rare sandal Jul 6, 2024, 8:31 AM

#

rare sandal Just wondering how do y’all debug vLLM when your CV is getting perfectly normal ...

This is still the biggest question mark for me. Why some user’s implementation of vLLM cannot score above 15, while others can get 20+. I see the code for vLLM generation part, it isn’t much different

#

Our best vLLM sub iirc is 25/50 CV and 13/50 LB…and that was after trial and error fixing two timeouts

#

I do think the CV cannot differentiate these “normal scoring” vLLM subs and “poor scoring” vLLM subs like it cannot tell why my notebook gets timeout

#

With transformers if I get a 25/50 CV it usually gets LB 20 or 21

rare sandal Jul 6, 2024, 8:43 AM

#

rare sandal This is still the biggest question mark for me. Why some user’s implementation o...

My hypothesis is when it got re-run after submission, it ran into lots of OOMs before generating the answer, idk if it’s right

fresh gust Jul 7, 2024, 7:58 AM

#

Why did the leaderboard scores drop after the competition was over? My best score was 18 and now it's 15. And I saw the same for all other top contenders.

serene fiber Jul 7, 2024, 9:46 AM

#

fresh gust Why did the leaderboard scores drop after the competition was over? My best scor...

Coz the data it was evaluated has been changed

floral sentinel Jul 7, 2024, 1:27 PM

#

fresh gust Why did the leaderboard scores drop after the competition was over? My best scor...

you got higher than me lol

fresh gust Jul 8, 2024, 9:56 AM

#

floral sentinel you got higher than me lol

lol what was your final score?

#

what is public score and private score here in this version? Are these scores of both datasets that were used before and after competitions completion?

rare sandal Jul 8, 2024, 12:18 PM

#

Before saying 15 is bad, my RAG and hint retrieval submission got 12 points 😂

fresh gust Jul 8, 2024, 12:27 PM

#

did you post your implementation of RAG?

#

I was looking for one before the competition was over

rare sandal Jul 8, 2024, 12:27 PM

#

That's very surprising to me, I know the algorithm is imperfect but I didn't expect to turn out so badly

rare sandal Jul 8, 2024, 12:27 PM

#

fresh gust did you post your implementation of RAG?

nope...didn't get a gold so didn't share code or solution

fresh gust Jul 8, 2024, 12:28 PM

#

can you share the notebook with me?

rare sandal Jul 8, 2024, 12:28 PM

#

If I share I will share publicly

fresh gust Jul 8, 2024, 12:28 PM

#

let me know if you share it publicly

#

i'd like to see the implementation of RAG

rare sandal Jul 8, 2024, 12:28 PM

#

fresh gust did you post your implementation of RAG?

There are 1 or 2 implementations in public notebook

#

But low scores

fresh gust Jul 8, 2024, 12:28 PM

#

ok I'll check them

#

what is your score on the leaderboard?

rare sandal Jul 8, 2024, 12:29 PM

#

I don't like private sharing, if you do that in competition it is considered cheating

rare sandal Jul 8, 2024, 12:30 PM

#

fresh gust what is your score on the leaderboard?

18 without RAG and 12 with RAG, everything else is the same, for the 2 that our team selected

fresh gust Jul 8, 2024, 12:31 PM

#

that's interesting

#

I don't understand. My best score is 17 but the one on leader board is 15

floral sentinel Jul 8, 2024, 3:25 PM

#

fresh gust lol what was your final score?

12

#

and I did it with RAG

#

without rag i would've gotten 10

#

for some reason my RAG submissions always scored with a score+2

floral sentinel Jul 8, 2024, 3:26 PM

#

fresh gust did you post your implementation of RAG?

https://www.kaggle.com/code/anrenk/deepseek-class-rag-question

DeepSeek Class RAG + Question

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

vital cave Jul 9, 2024, 5:07 AM

#

The more solutions coming out the more I feel bad about this competition. It's a game of luck for me. I am pretty sure that a freeze in vLLM somewhere lost me the opportunity to score my second sub! Which never froze on colab.
Memory issues with 2xT4 , many runs at the same time , Servers super busy with re-runs ... all of this can play a part.

I wasted many hours stabilizing my work with 27/28 validation and 26 LB to be beaten by some hardware issue 😦
If the approach for 2nd phase will be the same ( one rerun and let luck be your friend ) then I am out.

#

Also, I think the scoring should be weights and points, harder problems should have more points and easier problems should have fewer points and the scoring should be calculated even if the notebook froze. (scoring 0 points on the rest of the questions) .

floral sentinel Jul 9, 2024, 5:52 AM

#

vital cave The more solutions coming out the more I feel bad about this competition. It's a...

yeah it kinda sucks that when you've been working on something for so long and it just goes down the drain cause of luck...

#

but hey, there is always a next time

#

phase 1 is finished, it's in the past. Better focus on second phase

vital cave Jul 9, 2024, 6:14 AM

#

I know there is always next time , I just don't want next time to be as same as phase1 (extended phase1) 🙂

floral sentinel Jul 9, 2024, 6:59 AM

#

vital cave I know there is always next time , I just don't want next time to be as same as ...

nah it won't be the same

rare sandal Jul 9, 2024, 7:36 AM

#

Agreed, it is crazy to spend 8 months of effort only for your submission to fail on the rerun 😭

floral sentinel Jul 9, 2024, 1:51 PM

#

rare sandal Agreed, it is crazy to spend 8 months of effort only for your submission to fail...

wasn't it 3 months?

rare sandal Jul 9, 2024, 2:34 PM

#

floral sentinel wasn't it 3 months?

Phase 2 seems to be 8 months from the host's comments

floral sentinel Jul 9, 2024, 2:34 PM

#

rare sandal Phase 2 seems to be 8 months from the host's comments

oh i thought you talking about phase 1

#

damn 8 months is nice

#

we can freely experiment

#

althought the hype will be so low

rare sandal Jul 9, 2024, 2:35 PM

#

Yeah, I hope they don't put a limit on open source based on the start date lol

Better to put it 2 months before the end date if they want to

e.g. Sep 1 2024 to May 1 2025, can cut off open source on Feb 23 2025

floral sentinel Jul 9, 2024, 2:39 PM

#

rare sandal Yeah, I hope they don't put a limit on open source based on the start date lol ...

lmao

floral sentinel Jul 10, 2024, 6:09 PM

#

https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/519303

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

#

@rare sandal you were right, they fine tuned deepseek and created their own model

#

numina just dropped a solution

vital cave Jul 10, 2024, 6:44 PM

#

they also have a nice resources 🙂
I am testing their model now, I replaced Deepseek original one with theirs in my code (I haven't used their approach yet) I want to see how validation changes (originally 28/50 with my approach)

fresh gust Jul 10, 2024, 7:34 PM

#

floral sentinel without rag i would've gotten 10

That's very wierd. I used self-consistency copied from your notebook and I got higher score. But there's one important addition I made. I implemented the idea of question rephrasing just like in the paper MetaMath. I guess that helped a lot, but I'm still not sure if it really did. Need to run some tests to better understand.

Check this post of mine to see what I did to score better: https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/519363

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

floral sentinel Jul 10, 2024, 7:34 PM

#

vital cave they also have a nice resources 🙂 I am testing their model now, I replaced Deep...

yeah the 8 x H100 gpu sent me waaay of the cliff

#

they also implemented Tree of Thoughts... which i struggled to wrap my head around it for 3 weeks

floral sentinel Jul 10, 2024, 7:36 PM

#

fresh gust That's very wierd. I used self-consistency copied from your notebook and I got h...

ooooh... that's interesting approach

#

didn't know about metamath...

floral sentinel Jul 10, 2024, 7:36 PM

#

fresh gust That's very wierd. I used self-consistency copied from your notebook and I got h...

"3 may 2024"

#

bruh it's a new paper...

fresh gust Jul 10, 2024, 7:37 PM

#

the papers new, but I just took their idea of question rephrasing and used it in decoding

#

it's an idea they used to augment data

floral sentinel Jul 10, 2024, 7:37 PM

#

congrats on the score

fresh gust Jul 10, 2024, 7:38 PM

#

I wish I had more time for this competition, might have implemented RAG or vLLMs

floral sentinel Jul 10, 2024, 7:38 PM

#

I put my last hope into Learning RAG from scratch and just focus on it the last 7 days

floral sentinel Jul 10, 2024, 7:38 PM

#

fresh gust I wish I had more time for this competition, might have implemented RAG or vLLMs

the phase 2 will be wild

fresh gust Jul 10, 2024, 7:38 PM

#

Yep I think so too

floral sentinel Jul 10, 2024, 7:38 PM

#

so better prepare slowly from now

fresh gust Jul 10, 2024, 7:40 PM

#

floral sentinel I put my last hope into Learning RAG from scratch and just focus on it the last ...

I want to do that too

#

I was so looking forward to implement RAG but didn;t had time

vital cave Jul 11, 2024, 4:03 AM

#

vital cave they also have a nice resources 🙂 I am testing their model now, I replaced Deep...

it scored 30/50 on my validation set, so improvment by 2 , but they confirmed using the dataset in thier training so there is overfitting (if not the 28/50 is already overfitting)

#

I will try to run thier approach full not only the model

rare sandal Aug 9, 2024, 3:44 PM

#

I was testing Qwen2 on the train data, I just saw this. What kind of brilliance was that ?

#

rare sandal Aug 9, 2024, 3:45 PM

#

rare sandal

Btw, this isn’t the full code, there are still nested loops after that

#

4/10 with transformers and 6/10 with vLLM (on colab) 😀