#ai-mathematical-olympiad-prize

1 messages · Page 2 of 1

rare sandal
#

Publicly shared data and some textbook corpus from GitHub

#

If I end up doing well I will share

vital cave
#

I didn’t play much with prompt and ended with the one shared in the 20 points notebook before sub api

serene fiber
#

lol for us prompt helped reach 24

rare sandal
#

I didn’t try to fine tune prompt LOL

#

Actually I did, but it solves 1 problem and breaks 2 problems haha in validation

#

So I didn’t end up choosing

vital cave
#

The problem is its so slow to try finetune prompt lol

serene fiber
#

I had local GPU so was not a big deal

rare sandal
#

In my validation set for new AIME I added like 10 problems which I created by myself holyfuck

#

I would say none of these are more than 3/10 difficulty for the human

vital cave
#

I resubmitted my notebook from yesterday if my calc is correct it should rescore 26 again

serene fiber
vital cave
#

Will take 8 hrs but sure I will

rare sandal
#

My prediction for my private score is 13 points, it depends a lot on how much the public notebook will score…from my validation, I don’t see that it is significantly better

(the 21 that is trending in notebooks is the public NB I’m referring to)

#

🤣

vital cave
#

I have no predictions … I want a solo gold and a money prize will be nice hhhh but I expect 2 timeouts just to not be in shock if this happens

vital cave
#

Ya one day 🙂

serene fiber
#

Great

rare sandal
#

? What lol

serene fiber
rare sandal
#

Did yall do any postprocessing ?

#

I did lol

serene fiber
# rare sandal Did yall do any postprocessing ?

Not really, just identified some keywords for difficulty and gave those questions some extra time.

Though I thought of using Gemma to explain DeepSeek the question but it didn't work (VRAM exceeded)

valid shoal
#

i.e. we should have overfitted on public lb score?

serene fiber
#

I wish that happens, would get a free silver and can be the fastest expert 😅😅

vital cave
#

Its just kaggle system
I guess it will take a week to have the final update

serene fiber
#

Lol my bad it's on the comment

vital cave
#

Hhhh unfortunately didn’t 🙂

serene fiber
#

Deleting

vital cave
#

No worries

serene fiber
#

It's silver for the competition discussion and not the lb 🤣🤣🤣

vital cave
#

Hhhhh on other issue I don’t think the 100 sub will stay much after competition
As far as I understand the test data will be the same in phase2

#

I just reran my final subs again

serene fiber
#

Yea that's the reason I plan to start it now irrespective of the results, just exploiting it as much as possible.

vital cave
#

Ya because I dont think this window of 100 subs will stay much

serene fiber
#

Let's see, ideally it shouldn't but hoping too

rare sandal
#

lol this is insane

#

shows how volatile the model is

#

This though…I’m not sure how it would hold on private, there is a chance it shakes down if the prompt is fitted to public…

serene fiber
rare sandal
#

Is your prompt that gave 24 also the one that gave the best local validation score?

#

For our team, it's not correlated

#

I have a prompt which scores 26 on CV but only 9 on LB. But the baseline scores 22 on CV and 21 on LB, for example

#

If it does you're probably safe

#

I tried a few times to tune the prompt to fit 2 or 3 validation problems where the model is nearly correct, it always performs worse than the public prompt

#

You turn 2 or 3 from wrong to correct, but instead turn 5 from correct to wrong

serene fiber
#

Can you try resubmitting those solutions, I would expect it to be unstable

rare sandal
#

Hmm I set a seed 🤔

serene fiber
#

You can just remove that and run a few number of times, coz if this is true than a lot of research is just waste of time 🤣🤣🤣

rare sandal
#

No seed I tried before it's not stable and max - min can be up to 9 points

serene fiber
#

By research I mean actual research, the one which proper academia and industry folks do

rare sandal
#

I didn't submit too many times for the same code though but 24-25 is definitely possible with a lucky run

serene fiber
rare sandal
serene fiber
serene fiber
rare sandal
#

assumed the seed was set

#

my bad, I saw the discussion post and assumed it was a stable 25

serene fiber
#

Yea still I mean the underlying stochasticity with higher temperatures is high. Also I assume the top p would be something around 1 for the submission

#

If any of the higher temp solution gets a good score, I would just assume it's got to be lucky irrespective of the prompt or any other modifications

rare sandal
#

ah he said his notebook is not stable

#

I got it wrong then

rare sandal
#

I think scores may be released tomorrow

#

Last time Optiver comp, it took 2-3 days to run 2 submissions each for 3.4K teams

#

That one was compute intensive and submissions take very long to score as well

serene fiber
serene fiber
rare sandal
#

as for our submission code, I will make public if our team makes the gold zone in private

serene fiber
#

I mean the RAG solutions

proper ferry
#

" The private leaderboard is calculated over the same rows as the public leaderboard in this competition. " Does this mean that the validation set is the same as public leaderboard?

rare sandal
serene fiber
#

For high temp, I won't expect it to give same score unless the seed is in fixed in such a way that it always outputs the same token

silk sandal
#

But how can the seed be set to achieve this?

#

The following does not seem to work:

def seed_everything():
seed = 200
os.environ['PYTHONHASHSEED'] = str(seed)
os.environ['TOKENIZERS_PARALLELISM'] = 'true'
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
print('-----Seed Set!-----')

vital cave
#

Also I am a bit afraid of timeout since it finishes around 8 hours (I have safety time monitor code - but I am not sure if its bug free)

rare sandal
#

Is that 26 on the private score ? I think it was said that after the competition, the hidden test set will be the 50 private LB problems

vital cave
#

No public of course (doing after deadline tests)

#

I don't think they ran the private set yet

rare sandal
#

I don't know if Kaggle has changed the test set

#

It can be private now

vital cave
#

I doubt

#

I hope to score something near 26 with this code in private ( they said that private is similar in distribution as public) but we have to wait and see

#

I am waiting for the re-run of my 2-month-old 25 to give a score but it's taking more than usual!

serene fiber
#

26, I guess you got to get a monetary prize for sure

Congratulations

vital cave
#

I think it will take time to run all notebooks on private data and then do validation because if you are using external data or finetuning they need to check the process the data the "date limit - 23-feb" and so on. If you are using other models those need to be checked for dates also ... it will take time

vital cave
#

thats why I also selected the first 25 (which also was the first on LB) notebook

silk sandal
#

Hi @vital cave: If you do not mind answering, did you finetune an LLM to get the 26? I am just curious and not looking for details:)

vital cave
#

No finetuning didn't work ( or I am not experienced enough to make it work just yet 🙂 ) , My solution is based more on repeats and stabiity

rare sandal
#

I do think 26 private score will be a prize winner

#

Also I think those successful fine tunes will do better than the zero shot on the private LB, significantly ~+4-5 points gap

#

Say 2 people get 23 on public, one fine tuned and one zero shot, my guess is the fine tuned gets a 23 on private while the zero shot gets a 15-17

serene fiber
#

Monetary prize will surely be 25+

rare sandal
#

haven’t seen an unstable method robust on private score before in past comps…with some rare exceptions

If many people can't make it work with a certain method, the small proportion that makes it work wins out most of the time

serene fiber
rare sandal
#

That’s public score, private score may not be 20+

serene fiber
#

Yea I understand the difference, what I am talking here is about the variance.
The variance of public lb is not much, so we shouldn't expect more than ±2 on private provided the question type remains the same

vital cave
#

My 25 points 2 month old notebook is on its way to timeout ! I tested it (re-run it) for 5 times .. this is the first timeout ... will resubmit again

serene fiber
vital cave
#

I did but there a rare situation where vLLM stuck in somekind of inner loop, I didn’t look deeper into this, the notebook didnt stop so far but it should ( thid maybe due to after deadline submission, the 9 hrs don’t stand out)

serene fiber
#

Oh ok

vital cave
#

It just faild ( timeout) , I didn’t face time out before with this notebook which is weird
Any way I submitted again

serene fiber
vital cave
#

Did you try after deadline ?

serene fiber
#

Yea

#

Out of 5 tries, the unstable one with score 24 gives 20 thrice and 15 twice, whereas the stable one with score 21 gives 20 four times and 14 once

vital cave
#

So the score of 15 and 14 happend only after the deadline?

serene fiber
#

yep

vital cave
#

Hmmm

serene fiber
#

And also I didn't expect 20 in continuously 3 or 4 submissions. This behaviour shows how luck would matter

vital cave
#

Luck will play a part even in gold I guess
But why now the 15 and 14
Its a far far possibility but are we still testing public dataset ! Hmm because I never faced timeout with that submission and now I did

serene fiber
#

Yea something to wonder, I even wonder about the 20 ones how come it's so stable

vital cave
#

Anyway I resubmitted both my selections
I have hopes for my second selection ( I think it will score 26 again on public) I hope to give good result on private

rare sandal
#

I resubmitted both our choices too

rare sandal
#

I'm fully expecting my submissions to score like this on private

#

If my code cannot find an answer for every question within 630 seconds, it will get timeout...those buffers add up

#

let's see

serene fiber
#

My bad it's 20 thrice for both of them and 19 once, let's see what it gives next

rare sandal
#

Ok so I guess it’s still the public dataset that is there

#

No way with different problems the scores are 1-1 mapping

serene fiber
vital cave
#

In this competition the public will always equal to private score on kaggle leaderboard since its designed that way

#

The whole dataset will be changed to new one (no split)

#

So it will be new public/private

serene fiber
#

Yea thats fine, the doubt remains the same, how come such unexpected scores. I expected my 24 solution to always give different answers between 14-23, and 21 one to be between 18 and 21, but things are more stable than expected

#

BTW earlier in discussion, someone talked about finetuning. Yea the current lb topmost solution is based on finetuning

serene fiber
#

I am not sure how but suddenly our rank has increased by 2, anyone else experienced it?

vital cave
#

The 2nd team was removrd

#

And the deleted account

#

I guess they might be related

silk sandal
#

Sorry for them. They were so close to winning the Jackpot:)

silk sandal
#

Submissions have now been disabled

vital cave
#

My 2nd sub scored 26 again ( 3 times in a raw, so its stable as I thought) I hope it scores good in PB

#

My 1st sub is still scoring so I will wait and see

vital cave
#

Now the first sub gives submission scoring error, maybe because of the rerun of private data currently

#

Anyone has active subs scoring and failed ?

#

Most likely the ids changed and the scoring function couldn’t match ids with ids in the sub
So I hope this means my selection didn’t timeout this time

silk sandal
#

Yes mine shows scoring error as well

vital cave
#

Ok so now we wait 🙄🙏 good luck to you all

vital cave
#

I noticed that I dropped 2 ranks (rank 6 now) with score 26 , the guys that where after me are now before me hmmm

#

is it possible that the re-run and I scored 26 again which droped me to last 26 ... (in case of tie I will loose rank .... since its a 2 days old sub) but also the notebook needs 8 hours to finish and its early !

#

any one noticed changes in ranks or selected scores ?

serene fiber
#

Yea right now 24th, was initially 32nd on public lb

#

Scores remain the same*

vital cave
#

with same score ?

serene fiber
#

Yea I can't see change in score on lb

vital cave
#

I've seed others with lower scores

#

I've noticed a user with score 22 now and was 25

#

and there is fewer 25s now

serene fiber
#

Oh, my score remains the same as if now i.e 24

vital cave
#

then its either they didn't run your notebook yet or it scored 24 in private or I don't know ... I am super impatient ! 🙂

serene fiber
#

Hmm, not sure but I wouldn't expect 24 on private lb tbh 😅😅

vital cave
#

but they said they will shortly start (was 5 hours ago ...) ... notebooks didn't finish yet ....or they started earlier

serene fiber
vital cave
#

hmm then your higher rank is due others decrease in score .... its going to be two looooong days

serene fiber
#

Still it's weird someone's notebook gets executed in just 5-6 hours

#

So fast....

vital cave
#

or started earlier or we were testing (after deadline) on private data

#

since my score didn't change I am still 26 but I ranked last in 26 this means that its a new 26 (newer than the others) so its a re-run

#

Or current board reflect only our selections

serene fiber
#

Yea rerun is for sure, I guess they must have started it earlier, not sure.

vital cave
#

@rare sandal did you select 21 as your best sub ?

serene fiber
vital cave
#

I selected a new 26 and old 25 , thus my new 26 ranked last 26 becuase its newer than the others

serene fiber
vital cave
#

unless they didn't select the 25 🙂

#

that why I am asking @rare sandal becuase I think they scored 22 as best and now they are 21

serene fiber
#

Oh yea yea my bad

vital cave
#

I think they reflect only the 2 selected subs (value and time of submission) in this way they will need only to replace those notebooks values with new values of the private dataset and the leaderboard will auto sort

serene fiber
vital cave
#

This to fix the ties , currently orderd on the two selected subs and ties is solved by time

#

Thata why I dropped 2 ranks

#

Now they need to rerun our notebooks and you score will shift you

#

This means at wors I will rank 4th in 26

serene fiber
#

Or the other possibility, which I guess you are aware off

#

Basically ties should be fixed after having evaluated on private lb, that's what I wanted to say.

On public lb it makes no sense

#

Btw this seems bad btw, private lb is calculated with all data

#

The time management we have is for 50 questions, and not 100

vital cave
#

No private with new 50 questions only

serene fiber
#

Oh than its fine lol

rare sandal
#

Are your latest submissions running indefinitely?

#

Mine hasn't finished even after 12 hours

vital cave
#

Yes, it gave me a fail notification but it still running or it is actually "pending", this is because Kaggle stopped all running submissions, it is a UI bug I guess

proper ferry
#

The priviate results is out and will not change largely right?

vital cave
#

No they are not
Currently your selected scores is shown ( best of 2) but thats on the public set

proper ferry
#

Thanks Ali

rare sandal
#

ah, I saw a public code today…it’s so easy to remove docstrings lol, just 2 lines 😁

Meanwhile mine is, over-engineered, if-else statements, go through token by token in the generation to search for triple quote, and then I believe it still doesn’t cover all cases with no improvement in the CV score…ended up not selecting it due to the later submission time (by 2 weeks) 💀 😓

#

don’t know what I was thinking back then 😑

silk sandal
#

why is it taking so much time to display private scores?

#

I thought they needed only 9h to get the results for all submissions

rare sandal
#

It's a lot

silk sandal
#

Well, Kaggle got multiple GPUs right, all submissions can run in parallel or am I missing something?

vital cave
#

there was an update yesterday in the pinned topic

UPDATE 07/01/2024: Submissions are still scoring on the private set. We're now targeting sometime tomorrow (Tuesday) for finalization. We've increased our pool of machines for this job as well, so progress should accelerate.

silk sandal
#

Thanks for the update

south anchor
#

hope today ends and we have the private LB guys! i am curious to see if there is shakeup

serene fiber
#

Let's hope there isn't 🥲🥲

silk sandal
#

Is it possible that some notebooks are shutting down the compute resources for the competition with commands like sudo reboot or killall -u kaggle --nopassword. It is taking too long😁

#

Who had a dream about the Private Leaderboard? I had one, saw my ranking, and it was surprising!

vital cave
#

I woke up today to find out that the competition was gone from my home page ! I was in shock! This could mean one thing, my subs failed thus I have no subs ! then after a couple of minutes I realized that the competition is "completed" (according to Kaggle system) and thus will be removed from the home page! .... that's a nightmare more than a dream harold

rare sandal
#

When I saw that happened I thought the LB is finalized 😅

#

lol I'm just hoping that my "code repairing and postprocessing" works out on private. There are some cases where it fails on the CV but I'm banking on it being a net positive 😅

#

CV says net positive but public LB is same

south anchor
#

i think it is nasty not to have an update on what's happening, but i understand. maybe we get news soon

vital cave
#

My 2 month notebook only scored 😦 My stable and should finish in time notebook seems not to score! This is the 2nd time in a competition I face 1 sub scoring only! The other one till now I don't know why it didn't score. I hope at least I get feedback on this 😦

Congrats to you all 🙂

Today I got a silver and a bronze competition medals ... gold yet to be reached .... again the curse of rank 21 ....

I need a break

vital cave
#

And I am currently ranked 99 world wide in competitions , a milestone achived 🙂

floral sentinel
#

12...

rare sandal
#

ah my prediction for gold zone not really off

#

huge gap between 1st and 2nd lol

rare sandal
#

and they both have the same CV on old AIME ...

#

24/50

rare sandal
#

The model is so great that repairing those "timeout 7" and code issues reduced the CV score holyfuck holyfuck this still doesn't make sense to me

dusky narwhal
#

what do you mean by "repairing those "timeout 7" and code issues"? Telling the LLM to fix them? I worked on that too, telling it the code timed out, or giving it a truncated backtrace on exception

#

but I have very little evidence that it actually helped at all, due to instability and other bugs which most of my submissions suffered

#

I'm really amazed by the big gap between 1st and 2nd. Does it show that all the stuff we've been doing such as I just described is a waste of time, and only training the LLM further actually creates an improvement that generalises?

serene fiber
rare sandal
serene fiber
serene fiber
rare sandal
#

The public notebook, it doesn't do anything to hint the LLM that the code has issues so it just generates the exact same thing again and gets REPEATED ERRORS

rare sandal
#

and ya it doesn't help, which I'm surprised

rare sandal
serene fiber
#

No, it show's public lb score for me even on that submissions

#

Oh ok my bad got it, it shows in recently executed one's, my bad

rare sandal
#

Is your 15 the stable or unstable notebook ?

serene fiber
#

Unstable

rare sandal
#

What did the stable one get ?

serene fiber
#

It's damn bad, just 8

#

I guess lower temp, did show its effect there. Lack of creativity in the generated solution

rare sandal
#

20 to 8 is insane

#

I wouldn't have imagined getting single digits from a 20 point public submission given same topics and difficulty

serene fiber
#

It was 21, 20 was stable on public lb

floral sentinel
#

I am just waiting for Numina to drop their solution, they are the only ones who were able to have a stable score...

#

29 both in public and private

rare sandal
#

Just wondering how do y’all debug vLLM when your CV is getting perfectly normal results but the LB is like 10-15. Yes, perfectly normal even when committing on Kaggle T4

#

there’s no clue on what is wrong

#

I see vLLM solutions that made to 20+ now

#

can’t just trial and error when there’s only 1 or 2 submissions a day…

vital cave
#

I am really disappointed that my stable vLLM notebook with 26 on public LB didn't score. I was afraid of the one that did score :/ .... I've set each question to 460 secs, I don't see a reason for timeout unless something inside vLLM got stuck 😦 ... evaluating the private board on one rerun was not in my favor. I can't say it's unfair now because we know from the beginning it will be tested once, but I can't feel good about it. I am pretty sure after many many tests that this notebook would have scored nice if it ran successfully. .... maybe next time will do better ...

floral sentinel
#

they might give it a shot to run your notebook

vital cave
#

I don't think this will work, besides if they gave me then they need to give everybody a new chance and so on. This competition (Phase1) is now complete.
also, the rules mention that it's our job to make sure our subs work both public and private so ..
The one thing I really wish to know is the error in my 2nd sub (is it timeout, internal error ...) because I am planning to start from it in phase2.

My advice to myself and all is to come prepared to the next phase! I don't think 2nd phase will run without famous grandmasters joining the race.

dusky narwhal
# rare sandal and ya it doesn't help, which I'm surprised

telling the LLM that the code was too slow definitely changes its behaviour, though. I see it trying to change its code as a result, saying "We need to do this in a more efficient way", etc. However I haven't inspected how often it actually succeeds, but I think nearly never

#

and ultimately that's the reason that all these attempted tricks, such as getting it to verify, just didn't help significantly: if its solution/working has gone off track it's better to just throw it away and start over clean

#

so on the private LB you just get an inspecific submission failed message? I don't see any reason why they would intentionally make it vaguer than on public LB

#

people were reporting unusual timeouts in the last days of the contest presumably due to high load so I guess the same happened when rerunning all submissions at once

#

but I'm not filled with confidence in vLLM

rare sandal
rare sandal
dusky narwhal
#

I disallowed it from generating docstrings entirely. However, repeating the question in the docstring might ave helped it focus, I don't know how much removing it helped beyond fixing the quotes problem

#

helped or hindered

rare sandal
#

I just let it generate and if it throws the dreaded "timeout 7 returned non-zero exit status" then I re-execute it with the docstring removed

#

It has turned problems from wrong to correct, but also the other way round

#

The only thing I found that gives net positive almost all the time is postprocessing the result (in fact its >= +0 100% of the time)

serene fiber
#

Wondering why it would turn correct to wrong, did you verify your docstring code?

rare sandal
#

All the correct responses are from CoT answers

#

It only helps if the code answer turns out to be correct

vital cave
serene fiber
serene fiber
#

Then I don't get why it was initially correct and then wrong, coz deepseek is gonna generate the same code

rare sandal
serene fiber
#

Cool, makes sense

fresh gust
#

Why did the leaderboard scores drop after the competition was over? My best score was 18 and nowit's 15.

rare sandal
#

Our best vLLM sub iirc is 25/50 CV and 13/50 LB…and that was after trial and error fixing two timeouts

#

I do think the CV cannot differentiate these “normal scoring” vLLM subs and “poor scoring” vLLM subs like it cannot tell why my notebook gets timeout

#

With transformers if I get a 25/50 CV it usually gets LB 20 or 21

rare sandal
fresh gust
#

Why did the leaderboard scores drop after the competition was over? My best score was 18 and now it's 15. And I saw the same for all other top contenders.

serene fiber
floral sentinel
fresh gust
#

what is public score and private score here in this version? Are these scores of both datasets that were used before and after competitions completion?

rare sandal
#

Before saying 15 is bad, my RAG and hint retrieval submission got 12 points 😂

fresh gust
#

did you post your implementation of RAG?

#

I was looking for one before the competition was over

rare sandal
#

That's very surprising to me, I know the algorithm is imperfect but I didn't expect to turn out so badly

rare sandal
fresh gust
#

can you share the notebook with me?

rare sandal
#

If I share I will share publicly

fresh gust
#

let me know if you share it publicly

#

i'd like to see the implementation of RAG

rare sandal
#

But low scores

fresh gust
#

ok I'll check them

#

what is your score on the leaderboard?

rare sandal
#

I don't like private sharing, if you do that in competition it is considered cheating

rare sandal
fresh gust
#

that's interesting

#

I don't understand. My best score is 17 but the one on leader board is 15

floral sentinel
#

and I did it with RAG

#

without rag i would've gotten 10

#

for some reason my RAG submissions always scored with a score+2

floral sentinel
vital cave
#

The more solutions coming out the more I feel bad about this competition. It's a game of luck for me. I am pretty sure that a freeze in vLLM somewhere lost me the opportunity to score my second sub! Which never froze on colab.
Memory issues with 2xT4 , many runs at the same time , Servers super busy with re-runs ... all of this can play a part.

I wasted many hours stabilizing my work with 27/28 validation and 26 LB to be beaten by some hardware issue 😦
If the approach for 2nd phase will be the same ( one rerun and let luck be your friend ) then I am out.

#

Also, I think the scoring should be weights and points, harder problems should have more points and easier problems should have fewer points and the scoring should be calculated even if the notebook froze. (scoring 0 points on the rest of the questions) .

floral sentinel
#

but hey, there is always a next time

#

phase 1 is finished, it's in the past. Better focus on second phase

vital cave
#

I know there is always next time , I just don't want next time to be as same as phase1 (extended phase1) 🙂

rare sandal
#

Agreed, it is crazy to spend 8 months of effort only for your submission to fail on the rerun 😭

rare sandal
floral sentinel
#

damn 8 months is nice

#

we can freely experiment

#

althought the hype will be so low

rare sandal
#

Yeah, I hope they don't put a limit on open source based on the start date lol

Better to put it 2 months before the end date if they want to

e.g. Sep 1 2024 to May 1 2025, can cut off open source on Feb 23 2025

floral sentinel
#

@rare sandal you were right, they fine tuned deepseek and created their own model

#

numina just dropped a solution

vital cave
#

they also have a nice resources 🙂
I am testing their model now, I replaced Deepseek original one with theirs in my code (I haven't used their approach yet) I want to see how validation changes (originally 28/50 with my approach)

fresh gust
# floral sentinel without rag i would've gotten 10

That's very wierd. I used self-consistency copied from your notebook and I got higher score. But there's one important addition I made. I implemented the idea of question rephrasing just like in the paper MetaMath. I guess that helped a lot, but I'm still not sure if it really did. Need to run some tests to better understand.

Check this post of mine to see what I did to score better: https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/519363

floral sentinel
#

they also implemented Tree of Thoughts... which i struggled to wrap my head around it for 3 weeks

floral sentinel
#

didn't know about metamath...

floral sentinel
#

bruh it's a new paper...

fresh gust
#

the papers new, but I just took their idea of question rephrasing and used it in decoding

#

it's an idea they used to augment data

floral sentinel
#

congrats on the score

fresh gust
#

I wish I had more time for this competition, might have implemented RAG or vLLMs

floral sentinel
#

I put my last hope into Learning RAG from scratch and just focus on it the last 7 days

floral sentinel
fresh gust
#

Yep I think so too

floral sentinel
#

so better prepare slowly from now

fresh gust
#

I was so looking forward to implement RAG but didn;t had time

vital cave
#

I will try to run thier approach full not only the model

rare sandal
#

I was testing Qwen2 on the train data, I just saw this. What kind of brilliance was that ?

rare sandal
# rare sandal

Btw, this isn’t the full code, there are still nested loops after that

#

4/10 with transformers and 6/10 with vLLM (on colab) 😀