#ai-mathematical-olympiad-prize | Kaggle | Page 1

weary kelp Apr 2, 2024, 7:21 PM

#

Hey everyone! Since the competition is similar to AIME, I gathered some information on math olympiad topics. It’s here: https://docs.google.com/document/u/0/d/1ZYD-6jUJS9ndLm2ewuVO3Q26zMsXfrGlGmBb41SQYVA/mobilebasic

#

This should help boost performance through prompting. I’m excited to see how people can push the performance of LLMs in math.

sleek plank Apr 2, 2024, 10:09 PM

#

A few weeks ago, I downloaded and parsed the AIME / AMC12 problems from AoPS. I have already shared them on Lean forum, so let me share them here too: http://olsak.net/mirek/aimo-datasets/

valid shoal Apr 4, 2024, 4:09 PM

#

I have published https://www.kaggle.com/code/huikang/code-interpreter-baseline

Code Interpreter Baseline

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

gritty oxide Apr 4, 2024, 8:26 PM

#

valid shoal I have published https://www.kaggle.com/code/huikang/code-interpreter-baseline

lol, by dumb luck my copy of your kernel (without changes) scored 6

valid shoal Apr 4, 2024, 10:39 PM

#

The current best score is a public notebook

https://www.kaggle.com/code/olyatsimboy/aimo-zero-shot-sc-mmos-deepseekmath

AIMO Zero-Shot SC MMOS-DeepSeekMath

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

final breach Apr 4, 2024, 11:08 PM

#

silly question here..but this competition is effectively only for LLMs right

normal eagle Apr 5, 2024, 4:35 AM

#

new to kaggle and programming, so didn't realize this was for local models only. spent half a day (literally 12+ hours playing gpt4 and claude 3 models)

valid shoal Apr 5, 2024, 5:19 AM

#

final breach silly question here..but this competition is effectively only for LLMs right

You can do anything that is offline, which includes executing code

plush crater Apr 5, 2024, 8:17 PM

#

Today I learned: Kaggle CPU-only notebooks run on an Ubuntu-20.04 image, while GPU T4 x 2 notebooks run on an Ubuntu-22.04 image.

valid shoal Apr 5, 2024, 11:28 PM

#

I will greatly appreciate if someone could produce and benchmark performance of various LLMs

round obsidian Apr 6, 2024, 3:19 AM

#

Hello, sorry if this is off-topic I created a question in manifold markets so you can bet play money on the winning score https://manifold.markets/EmanuelR/how-many-problems-will-be-solved-by?r=RW1hbnVlbFI the market determines the probability distribution of the winner's score

Manifold

How many problems will be solved by the winning solution in the Kag...

This is the leaderboard https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/leaderboard

The competition consists of creating an AI that runs on a maximum of 9 hours on 2 Tesla T4 or 1 P100 GPU's, and solves a hidden list of 50 math problems with an answer between 0 and 999, anything that runs on the Kaggle VM that doesn't use the...

normal eagle Apr 7, 2024, 11:17 PM

#

weird but whenever I add "think step by step" to my prompts the performance usually goes down 😅.

normal eagle Apr 8, 2024, 1:13 AM

#

Also, somehow all explanations for correct answers are around 400 words. Often when it goes over than that, the answer go wildly wrong. GPT-4 (via ChatGPT) recovered some of those but none of the local model runs have done that.

abstract fjord Apr 9, 2024, 5:13 PM

#

Can we install and run llama index in the notbook ?

serene fiber Apr 10, 2024, 2:02 AM

#

I guess you can use anything that is available for the public on Kaggle

true moat Apr 11, 2024, 2:39 PM

#

❗️New External Dataset Alert - I have parsed 8.8k solutions for 3.7k problems from ArtOfProblemSolving website and it’s now publicly available: https://www.kaggle.com/datasets/alexryzhkov/amio-parsed-art-of-problem-solving-website

AMIO parsed "Art Of Problem Solving" website

Mathematical Olympiads problems with solutions

#

Stay tuned for the dataset for the further updates as I have processed only a half of parsed data with answers in \boxed. Another part need more careful or even manual preprocessing

hidden isle Apr 12, 2024, 2:58 PM

#

how to start it , I am new here

storm pond Apr 12, 2024, 4:02 PM

#

hidden isle how to start it , I am new here

copy a baseline and run all

hidden isle Apr 12, 2024, 4:07 PM

#

Okay I'll try

storm pond Apr 12, 2024, 4:07 PM

#

gl

sinful raft Apr 12, 2024, 7:49 PM

#

why is the current cap 17/50? does it have to do with transformers or prompting?

lusty moon Apr 12, 2024, 8:56 PM

#

sinful raft why is the current cap 17/50? does it have to do with transformers or prompting?

We're talking about one of the highest levels of math olympiad

#

It's meant to be tough

finite stump Apr 12, 2024, 10:12 PM

#

Can we use a mixture between nlp and ATP or we should use just NLP

sinful raft Apr 13, 2024, 12:04 AM

#

lusty moon We're talking about one of the highest levels of math olympiad

let me rephrase that. There was a notebook that scored 13/50 a while back and i was wondering how the score has jumped by 4 since then. Whether it be something specific or maybe it was just that more fine tuned models came out since then

lusty moon Apr 13, 2024, 12:05 AM

#

sinful raft let me rephrase that. There was a notebook that scored 13/50 a while back and i ...

Oh they first used like mistral or similar

#

Then deepseek math 7b instruct made a jump

#

Then self consistency made another jump

#

Can get 15-16 rn myself

#

Someone made 18

sinful raft Apr 13, 2024, 12:06 AM

#

ohhhhh gatcha, i have the deepseek math bookmarked for reading. ill get on that. also do you know what made the 17-18 jump possible?

#

wow yeah they did it today

#

i guess my other question would be gemma 2b and it's applications in models vs transformers

torpid rapids Apr 13, 2024, 4:41 AM

#

I have a basic question about code competitions. What models are we allowed to use exactly? Is it basically the list that shows up if I click on "Add input" in the right hand side bar on the edit code page? Also, how do I check if the model was available before Feb 24th or not?

upper knot Apr 13, 2024, 11:53 AM

#

afaik weights for mmos deepseek math were released feb 28. isn't that disqualified? https://pan.quark.cn/s/b939a0510658#/list/share/009acba8182d4a26a7bc44bd2a46933f-MMOS*101DeepSeekMath 7B

夸克网盘分享

夸克网盘是夸克推出的一款云服务产品，功能包括云存储、高清看剧、文件在线解压、PDF一键转换等。通过夸克网盘可随时随地管理和使用照片、文档、手机资料，目前支持Android、iOS、PC、iPad。

vital cave Apr 13, 2024, 2:17 PM

#

upper knot afaik weights for mmos deepseek math were released feb 28. isn't that disqualifi...

on huggingface they were released feb-5
https://huggingface.co/deepseek-ai/deepseek-math-7b-rl/tree/main

The safetensors were published 25 days ago , but I don't think this causes disqualification because the original weights were there on Feb-5

deepseek-ai/deepseek-math-7b-rl at main

lusty moon Apr 13, 2024, 3:08 PM

#

upper knot afaik weights for mmos deepseek math were released feb 28. isn't that disqualifi...

Doesn't look like it, considering that they are being used all over

#

Also you can use any pretrained model, just no internet access right?

wintry juniper Apr 13, 2024, 3:27 PM

#

the rules state it had to have been available before feb 23, 2024

lusty moon Apr 13, 2024, 7:15 PM

#

wintry juniper the rules state it had to have been available before feb 23, 2024

Oh

#

So no fine tuning?

wintry juniper Apr 13, 2024, 7:16 PM

#

that's a grey area, not sure

clever heart Apr 13, 2024, 10:58 PM

#

how much time does it take to see submitted results?

lusty moon Apr 14, 2024, 12:57 AM

#

clever heart how much time does it take to see submitted results?

30 mins to an hour

upper knot Apr 14, 2024, 1:04 AM

#

vital cave on huggingface they were released feb-5 https://huggingface.co/deepseek-ai/deeps...

that's the base model not the mmos finetune which was released feb 28

#

mmos further finetuned on top of deepseek rl

clever heart Apr 14, 2024, 1:22 AM

#

vital cave Apr 14, 2024, 5:05 AM

#

upper knot that's the base model not the mmos finetune which was released feb 28

Oh Sorry, yes this is disqualified.

#

If you are referring to the public notebooks (13+ score) they use deepseek-ai/deepseek-math-7b-rl
However, there is another problem here :/
check: https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/493554

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

hybrid arrow Apr 14, 2024, 7:03 AM

#

I have a background in Maths, mostly developing content, and have participated in these olympiads; it's easier for me to solve problems than to have LLM do it.
Anyway, I have created some basic models in the past, but this is entirely new to me.
Anyone looking to team up, I can contribute vastly to types of problems, datasets for the same and tweaking things.

hidden isle Apr 14, 2024, 10:23 AM

#

Anyone aware of wizardmath model ??

upper knot Apr 14, 2024, 11:19 AM

#

vital cave If you are referring to the public notebooks (13+ score) they use deepseek-ai/d...

if you look at the top 5 scoring public notebooks they all use the mmos finetune, not deepseek-ai/deepseek-math-7b-rl

vital cave Apr 14, 2024, 12:12 PM

#

upper knot if you look at the top 5 scoring public notebooks they all use the mmos finetune...

It is mentioned only in the title as I can understand from the attaced comment.
mmos version is not on huggingface as far as I can tell.

In all cases I asked and waiting a reply.

upper knot Apr 14, 2024, 1:44 PM

#

interesting

upper knot Apr 14, 2024, 2:02 PM

#

vital cave It is mentioned only in the title as I can understand from the attaced comment. ...

i think the comment author is mistaken. it's unclear now if the weights are the original model or mmos finetune, but it clearly isn't directly from hf.

if you look on hf, the original model is split into 2 safetensors files while the author's upload is split into 3 files. either the author did the safetensors conversion themselves or they are using a different model and forgot to change the name

vital cave Apr 14, 2024, 2:03 PM

#

upper knot i think the comment author is mistaken. it's unclear now if the weights are the ...

I've noticed and I asked for clarification

vital cave Apr 14, 2024, 3:59 PM

#

vital cave I've noticed and I asked for clarification

The Author confirmed the use of deepseek-math-7b-rl.
I also uploaded my own version (directly from hugging face) and got exact log for both.

upper knot Apr 14, 2024, 4:12 PM

#

okay awesome thanks

tranquil flame Apr 14, 2024, 8:42 PM

#

You can use this if you want: https://www.kaggle.com/models/yashbhalgat/deepseek-math-7b-rl
This is the original deepseek-math-7b-rl from HuggingFace (not the MMOS model)

deepseek-math-7b-rl

From HuggingFace: https://huggingface.co/deepseek-ai/deepseek-math-7b-rl

torpid rapids Apr 15, 2024, 9:11 PM

#

btw all the top notebooks seem to be running self-consistency with only 8 samples whereas the original deepseek-math paper reports 64 samples. I wonder how many samples can one squeeze into the 9 hr time limit

lusty moon Apr 16, 2024, 12:13 AM

#

torpid rapids btw all the top notebooks seem to be running self-consistency with only 8 sample...

Vllm would definitely give you more Leeway, it's probably 10x transformers, and most notebooks take like 2000sec, so you would be fine

boreal dew Apr 16, 2024, 9:39 PM

#

Hi,
I have a question around the problem id 246d26

Each of the three-digits numbers  111 to $999...

where the suggested correct answer should be 250.
But after working on my model it gave me a better solution 267 as it is a maximum I guess it should be 267 the correct answer. here the analysis of my LLM model

The solution using modular arithmetic appears optimal based on the problem's constraints:

Selection Criteria: Yellow numbers were selected based on their last digits under modulo 10 arithmetic. The set of valid last digits chosen was {1, 3, 7}. This choice ensures that the sums of any two yellow numbers do not have a last digit that falls back into this set, fulfilling the requirement that their sums must be blue.

Verification of Sums: The calculations confirmed that no sums of two yellow numbers end in the digits 1, 3, or 7. Therefore, all such sums must be blue numbers as required by the problem's constraints.

Outcome:

Number of Yellow Numbers: 267, which is maximized under the chosen method.
Sums Compliance: There were zero invalid sums, indicating full compliance with the problem's condition.
This approach effectively uses a simple modular arithmetic concept to partition the set of numbers from 111 to 999 into yellow and blue categories such that the sum of any two yellow numbers always results in a blue number. It maximizes the number of yellow numbers while ensuring compliance with the given sum condition.```
The analysis from my model seems correct to me `267 = (100 - 11) * 3`. But always good to have an additional brain on it 😄 . 
Would it be possible to double-check my result with a provider of the problem?
Thanks 🙂

normal eagle Apr 16, 2024, 9:47 PM

#

boreal dew Hi, I have a question around the problem id `246d26` ``` Each of the three-digi...

Hi, this ignores that even if the last digit doesn't fall in {1,3,7}, the sum can be outside [111,999]. eg: 537

boreal dew Apr 16, 2024, 9:50 PM

#

normal eagle Hi, this ignores that even if the last digit doesn't fall in {1,3,7}, the sum ca...

I didn't want to go too deep into analysing this, but I was wondering if using another base could also get a better result. I just feel the answer looked wrong anyway. 🙂 . Was quite happy my model figure this out actually 😄

normal eagle Apr 16, 2024, 9:54 PM

#

Tbh, this is the only problem in the public 10 for which I don't have a plausible standardized solution.

But one correct line of logic is

Max (Y) < 999/2.
=> 500 is min Blue such that it is higher than any yellow.
=> Min (Y) = 250.

So 250 to 499 are all Yellows.

boreal dew Apr 16, 2024, 10:03 PM

#

It is an interesting problem for sure, love it. I'll let the competition host answer this one. I think it needs a peer review for sure.

normal eagle Apr 16, 2024, 10:03 PM

#

What I meant by standardized solition here is that I can't seem to recall a problem solving strategy that forces you think count backwards from 499 instead of starting from 111.

For other problems example: in problem 1 (sum of squares of distances) -> the key insight is to use the sum and product of roots of a quadratic because difference of roots is given. or say in problem 3 (sparkle) the key insight is sum of digits < 3 which then automatically brings up combinatorics as a valid strategy.

haughty mountain Apr 16, 2024, 10:04 PM

#

Would it violate contest rules for me to have a math friend help? I have a close friend who is a longtime MO competitor, and I passed along this question to him

#

By “help” I mean try to give his answer to the question

boreal dew Apr 16, 2024, 10:05 PM

#

not at all @haughty mountain . That is the beauty of Mathematics. Share the love 🙂

haughty mountain Apr 16, 2024, 10:06 PM

#

I am an organizer from a different contest so not the authority that my name color might suggest

#

I have successfully passed along the question but not received an answer yet...

dapper cliff Apr 16, 2024, 10:30 PM

#

@normal eagle I think you are close, then maybe adding the fact that sum of any two odd numbers is even you can remove even yellows which is roughly 250. Now need to polish the edges. 😉

normal eagle Apr 16, 2024, 10:35 PM

#

dapper cliff <@453426682565623808> I think you are close, then maybe adding the fact that sum...

Nope, can't go by the even yellows logic. Because the distribution with Max Yellows is this: [111,249] blue, [250,499] yellow, [500,999] blue. The thing that's bothering me is how does one arrive at the insight "max distribution of yellow is a continuous set." Otherwise we (and the model) keeps trying to solve it using box principle.

valid shoal Apr 17, 2024, 12:07 AM

#

boreal dew Hi, I have a question around the problem id `246d26` ``` Each of the three-digi...

Well could you produce a set of 267 numbers?

normal eagle Apr 17, 2024, 12:09 AM

#

Actually, I found the gap in my solution (which was making it seem random). It's not a single possible distribution.

N = [111,999] = dY+dB+mY+mB (d, m => definitely and maybe)
So you calculate dB and dY sets in parts by elimination from N. Then apply box principle/conditional sorting to mY and mB.
How to get dB = if 2*i in N is not in N, it's in dB.
dB = [500,999], eliminate this from N
How to get dY = if i + N is in dB, i is in
which means [389,499] is dY.

now for [111, 388] if any number is added mY or mB, one corresponding number has to be added to another set. => 278/2 = 139 maybe Yellows.

Total yellows are 111+139 = 250.

valid shoal Apr 18, 2024, 6:37 AM

#

I wrote this https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/494713

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

normal eagle Apr 18, 2024, 8:57 AM

#

valid shoal I wrote this https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/...

even in the last question, the simple insight that f(n) is linear = an + b makes it solvable using SymPy.

valid shoal Apr 18, 2024, 10:05 AM

#

normal eagle even in the last question, the simple insight that f(n) is linear = an + b makes...

What is the name of this theorem?

normal eagle Apr 18, 2024, 10:09 AM

#

valid shoal What is the name of this theorem?

can't remember a specific name but the degree of composite functions is the product of degrees of individual functions.

lethal arch Apr 18, 2024, 10:23 AM

#

You don't have to prove this with any theorem, in these kind of math competitions, you just need an educated guess, if you find one such an f(n) function that satisfies the conditions, you can guess the solution without proving it

normal eagle Apr 18, 2024, 11:46 AM

#

lethal arch You don't have to prove this with any theorem, in these kind of math competition...

ps: in AIMO it might work, but in olympiads you actually have to prove that's the general form for all f(n)s for that case.

shy tundra Apr 18, 2024, 2:04 PM

#

is VLLM allowed in this competition?

#

or exllama, llama.cpp, etc

dapper cliff Apr 18, 2024, 4:12 PM

#

yes, but for example vllm it is trick to make it work on multiple gpus and it does not support P100

#

oh, neither it support TPU

dapper cliff Apr 18, 2024, 4:29 PM

#

When I am loading a model with vllm I can load it on one GPU, and it takes around 13.8GB, but then everything crashes when I try to load it on two GPUs.

neon trench Apr 18, 2024, 4:41 PM

#

Is the kaggle TPU option (vm v3-8) allowed for the submissions? I only see people submitting with the P100 and T4 options, as opposed to the kaggle TPU option, but I don't recall seeing anything about this in the rules?

lusty moon Apr 18, 2024, 5:25 PM

#

shy tundra is VLLM allowed in this competition?

Yes

#

Any offline inference form is fine

vital cave Apr 18, 2024, 6:19 PM

#

I just dropped 4 places in 1 sec :/ from 30 to 34 :/ I hope there is no private sharing ...

#

now 35th 😦

shy tundra Apr 18, 2024, 6:43 PM

#

well, the more attempts you get the scores go up because people get luckier

bitter kernel Apr 19, 2024, 2:56 AM

#

LLAMA3 came out

floral barn Apr 19, 2024, 5:01 AM

#

bitter kernel LLAMA3 came out

and it has a 400B+ variant too

#

if you thought grok was big to run

sweet bluff Apr 19, 2024, 10:54 AM

#

Seems like the 20/50 prize has been won with a very underwhelming solution to say the least.

vital cave Apr 19, 2024, 11:27 AM

#

What will happen if the current 1st shares their notebook? what happens if there is a current scoring notebook that was submitted say 8 hours ago and scored 20 ? it will directly be 1st.

#

Kaggle scoring mechanism depends on submission time not execution time, so I still think this is not over (to be fair)

sweet bluff Apr 19, 2024, 11:30 AM

#

Be the first to publish a public notebook scoring at least 20/50 on the leaderboard before April 22, 2024 11:59PM UTC.
By this I personally understand that the notebook being published first matters only but idk.

vital cave Apr 19, 2024, 11:31 AM

#

sweet bluff Apr 19, 2024, 11:31 AM

#

Oh alright. A bit weird wording then.

vital cave Apr 19, 2024, 11:32 AM

#

If my understanding is correct, there is time for surprises

lethal arch Apr 19, 2024, 11:33 AM

#

But does this also mean that the current no. 1 can wait to make her notebook public until Sunday?

vital cave Apr 19, 2024, 11:34 AM

#

lethal arch But does this also mean that the current no. 1 can wait to make her notebook pub...

Yes ! unless another user scored 20+ (and before her submission) and decided to share

sweet bluff Apr 19, 2024, 11:34 AM

#

Hopefully it will not be a linear probed/hardcoded problem numbers solution.

lethal arch Apr 19, 2024, 11:34 AM

#

sweet bluff Hopefully it will not be a linear probed/hardcoded problem numbers solution.

It is, I mean the current publicly shared one is

sweet bluff Apr 19, 2024, 11:35 AM

#

Oh yeah, I meant the one from the first person on the leaderboard

vital cave Apr 19, 2024, 11:36 AM

#

sweet bluff Oh yeah, I meant the one from the first person on the leaderboard

I think the current 1st is not aware yet of that 😄

lethal arch Apr 19, 2024, 11:37 AM

#

When a week ago I reached 18 points at least I thought I added something a little bit useful to the base code, not just probing the near-miss problem_ids

vital cave Apr 19, 2024, 11:39 AM

#

probing was expected , there is 2 months with 60 subs and probing might reach further, and here comes private LB , but winning 10K with probing is .... well ... sad 😦

#

currently, Natalia Ogneva submitted 7 hours ago, after 2 hours from now if no one is able to become first (20 or 20+ but larger submit time) , then she will be the first to get to 20 ! and for this to be fair she has the right to share her notebook till Sunday

#

if she did not share the notebook then the current one will be first.

bitter kernel Apr 19, 2024, 12:18 PM

#

gosh what a smart people

dapper cliff Apr 19, 2024, 12:19 PM

#

sweet bluff Seems like the 20/50 prize has been won with a very underwhelming solution to sa...

You will not know it before the competitoon ends.

plush crater Apr 19, 2024, 9:10 PM

#

plush crater Today I learned: Kaggle CPU-only notebooks run on an Ubuntu-20.04 image, while G...

Now all new notebooks I create are Ubuntu 20.04. Is there a way to force a new notebook to be created with a Ubuntu 22.04 image?

broken vale Apr 20, 2024, 1:53 PM

#

So when will the submissions be open again?

broken vale Apr 23, 2024, 12:56 PM

#

Are llama 3 weights allowed. They were released after the competition start, but the model is listed in the models page on kaggle

dapper cliff Apr 24, 2024, 1:00 PM

#

Which models page?

broken vale Apr 24, 2024, 6:25 PM

#

|the models tab thats on every competition

rare sandal Apr 25, 2024, 5:15 AM

#

LLAMA3 cannot be used. Model weights were released on 2024-04-18. The rules state that

Teams may only use AI models and tools that are open source and were released prior to 23 February 2024.

broken vale Apr 26, 2024, 4:12 PM

#

The leaderboard has been cleared 😄

broken vale Apr 26, 2024, 6:40 PM

#

submissions seems to have opened again!

floral sentinel Apr 28, 2024, 2:41 PM

#

is there a notebook explaining the new submissions system yet?

#

the pinned one is not clear and seems it have issues

rose nest Apr 29, 2024, 7:53 AM

#

for some reason the submission.csv is making permissionError
here is a sample code and its output error

import pandas as pd
import random

sn = []
for i in range(len(test_data)):
    r_num = random.randint(0,999)
    sn.append((test_data[i][0],r_num))

snd = pd.DataFrame(sn, columns=['id', 'answer'])
snd.to_csv("submission.csv", index=False)

this is throwing the error

PermissionError: [Errno 1] Operation not permitted: 'submission.csv'

broken vale May 1, 2024, 10:26 AM

#

rose nest for some reason the submission.csv is making permissionError here is a sample co...

There is a new api, thats not how you submitt your solution any longer

hoary rampart May 1, 2024, 5:49 PM

#

how you guys been running the llms? do you pay for the vertex ai subscription or use private hardware?

floral sentinel May 2, 2024, 11:35 AM

#

hoary rampart how you guys been running the llms? do you pay for the vertex ai subscription or...

wdym by runnning llms?

#

are you running out ot memory when loading in a notebook?

rose nest May 2, 2024, 3:19 PM

#

broken vale There is a new api, thats not how you submitt your solution any longer

I see, thanks

rose nest May 2, 2024, 6:54 PM

#

Exception: You can only iterate over `iter_test()` once.

is there a way to avoid this to test the code for multiple times without requiring to restart the kernel each time ?

broken vale May 3, 2024, 9:20 AM

#

rose nest ```Exception: You can only iterate over `iter_test()` once.``` is there a way t...

Doesnt look like it. What you can do is load test or train set the old way, but when you do submissions you use the new api

rose nest May 3, 2024, 9:22 AM

#

broken vale Doesnt look like it. What you can do is load test or train set the old way, but ...

I see

hoary rampart May 5, 2024, 2:58 PM

#

floral sentinel are you running out ot memory when loading in a notebook?

yes exactly. I have tried now with the keras version of Gemma2b but it is going extremely slow maybe it is not even working.

#

I have used llms in the past but i am not sure if for this competition my submission should run on the kaggle notebook or can use something else

floral sentinel May 5, 2024, 3:14 PM

#

hoary rampart yes exactly. I have tried now with the keras version of Gemma2b but it is going ...

I don't know about keras, but using hugging face, you can apply quantization to load the llms efficiently and actually run it

#

there are couple of notebooks in the code section of the competition showing how people used inference of llms

#

https://www.kaggle.com/code/abdurrafae/improved-code-interpretation

Improved Code Interpretation

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

#

this is good notebook, but keep in mind that there is a new API for submission

#

https://www.kaggle.com/code/dnyaneshwalwadkar/submission-with-the-best-nb-new-api

Submission with the best NB+ new API

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

floral sentinel May 5, 2024, 3:17 PM

#

floral sentinel https://www.kaggle.com/code/dnyaneshwalwadkar/submission-with-the-best-nb-new-ap...

this is another notebook with same implementation except for changing device numbers in device map, and the guy also implemented predict function for submitting on new API

hoary rampart May 5, 2024, 3:52 PM

#

wow thanks for that i will defo chek it out

#

@floral sentinel how is your submission going?

floral sentinel May 5, 2024, 4:45 PM

#

hoary rampart <@305317021661528066> how is your submission going?

it's bad honestly

hoary rampart May 5, 2024, 5:18 PM

#

floral sentinel it's bad honestly

haha same brother. why is that?

#

i just submitted one of the examples you sent me to see how it goes. its been an hour and it is still executing :/

floral sentinel May 5, 2024, 5:19 PM

#

hoary rampart i just submitted one of the examples you sent me to see how it goes. its been an...

that's normal, mine took 8 hours

#

it depends on the variable "n partitions" and because of new API, it answers each question in sequential, basically solving the first question with n partitions, then move on to the next and so and on and on, which makes it slow

floral sentinel May 5, 2024, 5:21 PM

#

hoary rampart haha same brother. why is that?

Cause this is my first time getting into LLMs field, and honestly I don't find myself flexible using them, plus I only recently learned about hugging face, so I am kinda new to all of this stuff.

floral barn May 7, 2024, 7:02 AM

#

rare sandal LLAMA3 cannot be used. Model weights were released on 2024-04-18. The rules stat...

:/

#

sadgeeee

#

so sad

broken vale May 9, 2024, 9:36 PM

#

anyone know how do decode bitsandbytes 4-bit weights back into 16 bit weights?

fresh gust May 13, 2024, 5:01 PM

#

broken vale There is a new api, thats not how you submitt your solution any longer

how to download package to import the new API?

broken vale May 13, 2024, 5:01 PM

#

fresh gust how to download package to import the new API?

It's already installed in the kaggle kernel, just import it

fresh gust May 13, 2024, 5:02 PM

#

broken vale May 13, 2024, 5:03 PM

#

fresh gust

You might have to add the ai-mathematical-olympiad-prize competition dataset first

#

is that a kaggle kernel btw?

#

looks like a colab notebook

fresh gust May 13, 2024, 5:04 PM

#

yep, sorry I created the notebook without adding AIMO competition dataset

#

it's working now

broken vale May 13, 2024, 5:04 PM

#

nice 👍

fresh gust May 13, 2024, 5:04 PM

#

broken vale is that a kaggle kernel btw?

it is kaggles kernet

#

kernel

fresh gust May 15, 2024, 9:43 PM

#

is that it for submission?

broken vale May 16, 2024, 12:40 AM

#

fresh gust is that it for submission?

yeah, that should work

fresh gust May 16, 2024, 2:29 PM

#

@broken vale this is the first time I'm trying to submit in a competition on kaggle. I ran my submission script twice yesterday and got error running them, so look like my submissions were wasted.

I think the reason is that I was using huggingface api to load the model within the notebook, but while submitting it is required to turn off the internet connection in the notebook, maybe that's why my notebook threw error, as it wasn't loading the model via api call.

Just to be clear on this, do I have to download the model first and then add it to notebook so it can be loaded without internet?

broken vale May 16, 2024, 2:34 PM

#

fresh gust <@1008447674812473425> this is the first time I'm trying to submit in a competit...

You gotta have internet turned off. If you have a finetuned model, you need to upload the weights as a model or dataset to kaggle and add it to your notebook.

fresh gust May 16, 2024, 4:09 PM

#

broken vale You gotta have internet turned off. If you have a finetuned model, you need to u...

got it thanks!

floral sentinel May 16, 2024, 4:23 PM

#

fresh gust is that it for submission?

add .values[0] next to test['problem'] in sample_submission['answer']=predict(test['problem'])

cause for some reason it throws error when using only predict(test['problem'])

#

should be predict(test['problem'].values[0])

fresh gust May 16, 2024, 4:24 PM

#

What does .values[0] do in this case?

floral sentinel May 16, 2024, 4:28 PM

#

fresh gust What does `.values[0]` do in this case?

https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/498135

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

#

test['problem'] returns <class 'pandas.core.series.Series'>

#

so test['problem'].values[0] returns real answer

floral sentinel May 16, 2024, 4:30 PM

#

fresh gust <@1008447674812473425> this is the first time I'm trying to submit in a competit...

also add this as an input to your notebook, so you can load the model offline
https://www.kaggle.com/datasets/olyatsimboy/deepseek-math

deepseek-math

#

or any other model u wanna use

fresh gust May 16, 2024, 4:31 PM

#

floral sentinel https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/49...

oh great, thanks for the heads up

fresh gust May 16, 2024, 4:32 PM

#

floral sentinel also add this as an input to your notebook, so you can load the model offline ht...

about this, this is the safe tensor version of deepseek-math, which was released after 23rd feb. is it okay to use? I'm still a little confused on this

floral sentinel May 16, 2024, 4:33 PM

#

fresh gust about this, this is the safe tensor version of deepseek-math, which was released...

https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/493554

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

broken vale May 16, 2024, 4:33 PM

#

fresh gust about this, this is the safe tensor version of deepseek-math, which was released...

i believe that is just another format, its the same weights as the original

floral sentinel May 16, 2024, 4:34 PM

#

"So using safetensors weights for DeepSeekMath is definitely in the spirit of the rules and is within the rules as these are not new weights."

fresh gust May 16, 2024, 4:34 PM

#

floral sentinel https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/49...

yeah I read this thread and have posted a quesiton as well. still confused

floral sentinel May 16, 2024, 4:34 PM

#

Competition host said it's okay

fresh gust May 16, 2024, 4:35 PM

#

floral sentinel "So using safetensors weights for DeepSeekMath is definitely in the spirit of th...

oh yeah about that, the host is okay as long as everything remains opensource (code, dataset, model weights). So if someone is to finetune the deepseek-math model, he may upload the fineutining data to be opensource, but he will still not have the pretraining data of deepseek-math. In this case, will he still be eligible for the prize? (where pre-training data is not opensource, only finetuning data is)

floral sentinel May 16, 2024, 4:36 PM

#

fresh gust oh yeah about that, the host is okay as long as everything remains opensource (c...

idk about that lol

fresh gust May 16, 2024, 4:37 PM

#

floral sentinel idk about that lol

Yeah i am confused about that. I've posted the question in discussion forum just waitinng for the reply. I'll let you know too about the response

floral sentinel May 16, 2024, 4:39 PM

#

ok

#

good luck bro

fresh gust May 16, 2024, 4:41 PM

#

you too bro

tough olive May 19, 2024, 10:48 AM

#

Is there any limit on model size in this competition? Or is it just a limit on GPU run time? Are there any obvious differences between the three GPUs available (T4x2, P100, TPU)?

wintry juniper May 19, 2024, 1:20 PM

#

No limit on model size , limit is on the weights published date , can't use TPU for submission , P100 is slightly faster but would need a quantized version due to less vram leading to loss in accuracy.

tough olive May 20, 2024, 1:53 AM

#

Thanks a lot!

rare sandal May 20, 2024, 9:11 AM

#

story of this competition

#

GCD of 2 EVEN numbers is 1 by DeepSeekMath 🤣

floral sentinel May 20, 2024, 11:32 AM

#

rare sandal story of this competition

it might be because of "6227020800", that's a huge number

fresh gust May 21, 2024, 10:19 PM

#

How are you guys loading models without internet?

broken vale May 21, 2024, 10:29 PM

#

fresh gust How are you guys loading models without internet?

by uploading it to kaggle

#

you got two choices, either as a dataset or as a model

fresh gust May 21, 2024, 10:31 PM

#

I've uploadeed model from huggingface to kaggle as dataset but I'm not able to use AutoTokenizer or AutoModelForCasualLM when I load them from Input

#

They work just fine when I load them from huggingface

#

is this the right way of uploading it?

#

@broken vale

broken vale May 21, 2024, 10:41 PM

#

that looks correct, now click the "Copy file path" icon on the model, and use that as the model_path

fresh gust May 21, 2024, 10:41 PM

#

that's exactly what I'm doing but I'm getting error

broken vale May 21, 2024, 10:41 PM

#

what does the error say?

#

copy the file path of the model folder, not the files inside the folder

fresh gust May 21, 2024, 10:45 PM

#

wait... it just worked in a new notebook

#

but I'm confused

#

#

the news notebook imports transformers 4.93.3, whereas the old one imports transformers 4.20.1, by default

#

why so?

broken vale May 21, 2024, 10:58 PM

#

maybe the old one is pinned to the original environment

fresh gust May 21, 2024, 11:22 PM

#

and what about these libraries? are they imported with internet off?

#

broken vale May 21, 2024, 11:33 PM

#

they should be part of the kaggle environment

#

bitsandbytes is not, so if you are doing nf4 quant you need to add the wheels for bnb and pip install the whl

fresh gust May 21, 2024, 11:49 PM

#

I see, thanks for the help

broken vale May 21, 2024, 11:49 PM

#

np

fresh gust May 22, 2024, 12:04 AM

#

@broken vale one more thing. when I run this code, a csv file is generated in /kaggle/working (output directory). is this the output file that I have to select while submitting my notebook version?

#

broken vale May 22, 2024, 12:05 AM

#

no, you gotta use the aimo api, no files should be saved

#

the env.predict step is the submission of that problem

#

you are using the api i see, everything after the env.predict is not needed

#

also, test['problem'] returns a series, you need to access the problem with test['problem'][0] to get the problem string

fresh gust May 22, 2024, 12:07 AM

#

but when submitting. it asks for output file. what is that?

#

you mean like this?
predict(test['problem'][0], one_shot_context)

broken vale May 22, 2024, 12:08 AM

#

yes

#

you should try to run it in test before you submit

#

might have other errors

fresh gust May 22, 2024, 12:09 AM

#

fresh gust but when submitting. it asks for output file. what is that?

okay and what about the output file?

broken vale May 22, 2024, 12:09 AM

#

there should be no output file.

#

use the submit button on in the right tab modal in the kaggle notebook

fresh gust May 22, 2024, 12:11 AM

#

okay I see, the submit button on in the notebook shows different sidebar then the one on competition page

#

got it

broken vale May 22, 2024, 12:11 AM

#

👍

#

also, ive had some trouble with the api lately. see https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/506041 and the answer from Chan

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

fresh gust May 22, 2024, 12:13 AM

#

I see. I'm uploading now, will see if this error occurs

#

and also, I was not able to do 5 shot learning in conetxt due to kaggle's limited GPU VRAM. Is this possible in submission too? How do I know that the submission will not run out of CUDA memory?

broken vale May 22, 2024, 12:17 AM

#

you dont, if its happening in your notebook it will probably happen in the submission too. it usually does

#

try using a 4bit quant

#

or shard the layers on multiple gpus

fresh gust May 22, 2024, 12:18 AM

#

so submission is using the same kaggle GPU?

broken vale May 22, 2024, 12:18 AM

#

if you use 2 T4 gpus with device_map="auto" on 1 7B model that should be enough even in float16

#

yeah

fresh gust May 22, 2024, 12:20 AM

#

I see

#

got it

fresh gust May 22, 2024, 12:50 AM

#

well that's interesting. I wonder how do I figure out what went wrong there

broken vale May 22, 2024, 1:03 AM

#

haha, nice. got that for some of my notebooks too

fresh gust May 22, 2024, 1:19 AM

#

and I have nothing on top of my head how to debug it 🤷‍♂️

broken vale May 22, 2024, 1:39 AM

#

you just didnt get any answers right

#

run it on the training set in the notebook

#

youll probably see that you wont get any right

proper ferry May 22, 2024, 7:28 AM

#

When we run a notebook, the result only shows the script without the notebook and output. Why?

#

#

I guess the error will make the notebook invisible right?

broken vale May 22, 2024, 1:34 PM

#

proper ferry When we run a notebook, the result only shows the script without the notebook an...

becasue otherwise you could cheat...

floral sentinel May 22, 2024, 7:49 PM

#

Guys, I have written a notebook where I organized the functions into one class from Abdurafae's notebook, here if anyone wants to check
https://www.kaggle.com/code/anrenk/aimo-llm-class

AIMO LLM CLASS

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

#

Most forked notebooks had unorganized code structure, and the import libraries were scattered here and there, I cleaned up the code and only left necessary stuff

proper ferry May 23, 2024, 12:05 AM

#

broken vale becasue otherwise you could cheat...

How to understand the "could cheat". When I run it without error, everything is ok as before.

broken vale May 23, 2024, 12:07 AM

#

proper ferry How to understand the "could cheat". When I run it without error, everything is ...

Thought you meant when submitting your code.

floral sentinel May 24, 2024, 1:12 PM

#

Deepseek generated Chinese characters...

#

bruh

candid holly May 26, 2024, 8:22 AM

#

how to upload models from huggingface to kaggle？

candid holly May 28, 2024, 12:01 PM

#

The notebook runs successfully but submission failed. What happened and how to fix it?😪

floral sentinel May 28, 2024, 1:03 PM

#

candid holly The notebook runs successfully but submission failed. What happened and how to f...

what error did submission give

stray mulch May 30, 2024, 4:16 AM

#

Is a model that is completely open source (but was released after February 2023) allowed?

west prism May 30, 2024, 6:26 AM

#

Why do some of my notebooks get cancelled before they finish in Olympiad?

floral sentinel May 30, 2024, 6:39 AM

#

stray mulch Is a model that is completely open source (but was released after February 2023)...

as far as I know it's not allowed.

floral sentinel May 30, 2024, 6:40 AM

#

west prism Why do some of my notebooks get cancelled before they finish in Olympiad?

throwing error?

stray mulch May 30, 2024, 7:55 AM

#

floral sentinel as far as I know it's not allowed.

But they seem to have allowed Llama 3 which was released this year

floral sentinel May 30, 2024, 8:01 AM

#

stray mulch But they seem to have allowed Llama 3 which was released this year

where did you see that? can you provide source/

stray mulch May 30, 2024, 8:44 AM

#

floral sentinel May 30, 2024, 11:14 AM

#

that's just a tab that shows models that people used in their notebooks, it doesn't necessarily mean it can be used according to the rules...

stray mulch May 30, 2024, 1:31 PM

#

floral sentinel that's just a tab that shows models that people used in their notebooks, it does...

Oh 💀

stray mulch May 30, 2024, 1:59 PM

#

floral sentinel that's just a tab that shows models that people used in their notebooks, it does...

Well I have just posted on discussion

#

lets hope the hosts answer

floral sentinel May 30, 2024, 2:02 PM

#

stray mulch Well I have just posted on discussion

https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/495321

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

#

found something similar to your question

stray mulch May 30, 2024, 2:03 PM

#

aw man 😦

#

what about things like xLSTMs? which are just modifications of LSTMs. Are they allowed?

floral sentinel May 30, 2024, 2:11 PM

#

stray mulch what about things like xLSTMs? which are just modifications of LSTMs. Are they a...

xLSTM were released only recently, you already thinking about using them xD?

#

idk about that tbh

#

i think they only put limits on LLM models

#

xLSTM is research based i guess

stray mulch May 30, 2024, 2:12 PM

#

floral sentinel xLSTM were released only recently, you already thinking about using them xD?

well I thought if most of the older models have been tried, why not try new ones

stray mulch May 30, 2024, 2:12 PM

#

floral sentinel xLSTM is research based i guess

yeah, xLSTM is a new type of layer/whatever; I'd have to build the entire LLM/architecture if I plan on using it

floral sentinel May 30, 2024, 2:18 PM

#

stray mulch well I thought if most of the older models have been tried, why not try new ones

fair enough

#

https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/508684

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

floral sentinel May 30, 2024, 2:19 PM

#

stray mulch yeah, xLSTM is a new type of layer/whatever; I'd have to build the entire LLM/ar...

maybe specify that you want to use xLSTM in the post

stray mulch May 30, 2024, 2:20 PM

#

Good idea

boreal dew May 30, 2024, 8:39 PM

#

Does the Python modules used for the submission also need to be released before the 23rd of February, 2024?

minor crystal May 31, 2024, 12:29 PM

#

Is it possible to find somewhere more trainig math challenges that are same type as official?

floral sentinel May 31, 2024, 2:30 PM

#

minor crystal Is it possible to find somewhere more trainig math challenges that are same type...

https://www.kaggle.com/datasets/hemishveeraboina/aime-problem-set-1983-2024

AIME Problem Set: 1983-2024

The American Invitational Mathematics Examination Dataset from 1983 to 2024

#

https://www.kaggle.com/datasets/alexryzhkov/amio-parsed-art-of-problem-solving-website

AMIO parsed "Art Of Problem Solving" website

Mathematical Olympiads problems with solutions

#

https://www.kaggle.com/datasets/alejopaullier/aimo-external-dataset

AIMO External Dataset

External Dataset to be used in the AI Mathematical Olympiad Kaggle competition

floral sentinel May 31, 2024, 2:31 PM

#

floral sentinel https://www.kaggle.com/datasets/alejopaullier/aimo-external-dataset

but this one has answers that are not modulo 1000 (positive integer), it has algebraic expression answers

#

the above 2 needs a little bit data cleaning before you proceed to use them

candid holly Jun 1, 2024, 10:08 AM

#

floral sentinel what error did submission give

submission scoring error.

floral sentinel Jun 1, 2024, 3:43 PM

#

https://www.kaggle.com/code-competition-debugging

Code Competitions - Errors & Debugging Tips

floral sentinel Jun 1, 2024, 3:43 PM

#

candid holly submission scoring error.

broken vale Jun 1, 2024, 10:13 PM

#

everybody seems to be using deepseek-math, are there any other models that are okayish?

lusty moon Jun 2, 2024, 9:24 PM

#

broken vale everybody seems to be using deepseek-math, are there any other models that are o...

Nothing as performant

floral sentinel Jun 3, 2024, 6:49 AM

#

broken vale everybody seems to be using deepseek-math, are there any other models that are o...

honestly that was the only model before February 23rd that is doing well, I tried others like Mistral7x8b and Gemma, they did terrible in my experience

#

I saw someone using Mathegenie, but people in comments said that it's a model released after Fefbruary 23rd

proper ferry Jun 3, 2024, 7:40 AM

#

the new api of iter_test can only be initialized, iterated only once? And then it died? The only chance we reuse it is to restart the notebook? What...

floral sentinel Jun 3, 2024, 9:53 AM

#

proper ferry the new api of iter_test can only be initialized, iterated only once? And then i...

did you use predict function after iterating on a question?

#

cause it doesn't allow you to go next question without using that function to evaluate your answer

proper ferry Jun 4, 2024, 2:27 AM

#

Thank you, yes, I used predict. And my test result is: in one notebook, you can only initialize the env for one time, and only can iterate it for one time. The next iteration will get a null response.

floral sentinel Jun 4, 2024, 6:58 AM

#

proper ferry Thank you, yes, I used predict. And my test result is: in one notebook, you can ...

can you show me the code?

broken vale Jun 5, 2024, 2:11 AM

#

proper ferry Thank you, yes, I used predict. And my test result is: in one notebook, you can ...

yes, its true, you can only iterate over it once. you can make your own iterator for the train set and use that

proper ferry Jun 5, 2024, 7:53 AM

#

Yes, the only way to reuse the iterate environment is to restart the notebook LOL

fresh gust Jun 5, 2024, 10:34 AM

#

is it possible to submit notebook using GPU T4 x2 if my weekly quota for T4 x2 is finished?

floral sentinel Jun 5, 2024, 10:44 AM

#

fresh gust is it possible to submit notebook using GPU T4 x2 if my weekly quota for T4 x2 i...

yeah it's possible, cuz submission doesn't use ur quota

fresh gust Jun 5, 2024, 11:03 AM

#

@floral sentinel I used your notebook for submission as well, but the score I got was 18 not 20. What do you think is causing discrepancy in results?

floral sentinel Jun 5, 2024, 11:14 AM

#

fresh gust <@305317021661528066> I used your notebook for submission as well, but the score...

Yeah what's what I noticed as well, I submitted multiple times, and each time it gave different results, you can even the check the versions history of the notebook.

I think it's because of model generation, we use "self consistency" method where the model takes in the problem, and solves it in times of n_repetitions parameter, and takes the most repeated answers.

Example

n_repetitions = 7:

problem = {math problem}

answers = []

for _ in range(n_repetitions):
    output_text = model(problem)
    
    model_answer = process_output(output_text)

    answers.append(model_answer)

and let's say the answers list is [52, 52, 21, 222, -1, 52, 0 ]

from this answers list we see that 52 is the most consistent one, so we take that and submit it

#

so for my case when I submitted I got [52, 52, 21, 222, -1, 52, 0 ]

#

but when you submit for example ...you get [9, 9, 52, 52, 9, 0, 22], in which case 9 would be the most consistent one in your case

#

another factor would be top_p and temperature parameters

#

honestly deep-seek model is weird

fresh gust Jun 5, 2024, 11:21 AM

#

I see I noticed that too. Maybe it's because of the random selection of next token, that everytime it runs, a different output is produced

#

also in your notebook the maximum time for notebook is set to 31500, which is equal to 8.75 hours. Is this the maximum time the notebook runs in submission? Have you inferred this, or is this mentioned in the competition or maybe it's something everyone knows? I'm new to competitions so still figuring this out.

#

Also, how would I now if my submission exhausted all the time allowed?

floral sentinel Jun 5, 2024, 12:19 PM

#

the reason for 8.75, is because maximum hours of notebook runtime that is allowed is 9 hours... that's why we used 31500 seconds, and in the code itself we put a condition that if it exceeds the limit, then just return 0, which means it will submit 0 as an answer

floral sentinel Jun 5, 2024, 12:20 PM

#

fresh gust also in your notebook the maximum time for notebook is set to 31500, which is eq...

the notebook runs in both submission and normal running mode

#

in "normal running" mode people set TIME_LIMIT = 1, basically just run the code quickly as possible so it can move into "submission" mode, where TIME_LIMIT becomes 31500

#

only "submission" or "private" mode matters for scoring

fresh gust Jun 5, 2024, 1:10 PM

#

floral sentinel in "normal running" mode people set TIME_LIMIT = 1, basically just run the code ...

Does this mean that in submission, the notebook first runs as not PRIVATE and then runs as PRIVATE?

#

And how do I check that if my notebook ran completely in less than 9 hours and didn't return answers as 0?

floral sentinel Jun 5, 2024, 1:15 PM

#

fresh gust Does this mean that in submission, the notebook first runs as not PRIVATE and th...

yeah

floral sentinel Jun 5, 2024, 1:15 PM

#

fresh gust And how do I check that if my notebook ran completely in less than 9 hours and d...

eeeh i guess the time the submission took

#

idk how else to check

fresh gust Jun 5, 2024, 1:15 PM

#

where does it show time?

floral sentinel Jun 5, 2024, 1:15 PM

#

after you click submit, 2 things run

#

one your notebook, other your submission

#

click on the submission one, and it shows for how long it was running

fresh gust Jun 5, 2024, 1:16 PM

#

in my submission TIME_LIMIT = 31500 if PRIVATE else 31500. This means that it will first completely run on train data (inside the 9 hour time limmit) and then on test data?

floral sentinel Jun 5, 2024, 1:17 PM

#

fresh gust in my submission `TIME_LIMIT = 31500 if PRIVATE else 31500`. This means that it ...

it will run both 31500 for private and normal one

#

but you don't need to do 31500 for normal one tbh

#

that one doesn't score

fresh gust Jun 5, 2024, 1:18 PM

#

I was just using 31500 for normal to test prediction on training data. Didn't knew it would effect in submission as well

floral sentinel Jun 5, 2024, 1:18 PM

#

oh

#

if you use 31500 for normal it will use your quota

#

gpu quota

fresh gust Jun 5, 2024, 1:19 PM

#

interesting

#

that's where my GPU quota has been going then

floral sentinel Jun 5, 2024, 1:19 PM

#

besides the score doesn't depend on normal one

floral sentinel Jun 5, 2024, 1:19 PM

#

fresh gust that's where my GPU quota has been going then

bruuuh

fresh gust Jun 5, 2024, 1:19 PM

#

floral sentinel besides the score doesn't depend on normal one

yep i understood that

#

this is my first time participating in competition so this was meant to be lol

#

but I still don't understand how to check time for the submission notebook

#

for example

#

this is my submission running

floral sentinel Jun 5, 2024, 1:21 PM

#

ok it's been running for 2h now

#

wait... why the normal one has been running for 2h?

#

you used 31500 there?

fresh gust Jun 5, 2024, 1:22 PM

#

floral sentinel it will run both 31500 for private and normal one

yes... TIME_LIMIT = 31500 if PRIVATE else 31500

#

that's why

floral sentinel Jun 5, 2024, 1:22 PM

#

ok i see

fresh gust Jun 5, 2024, 1:22 PM

#

floral sentinel ok it's been running for 2h now

but this is the normal one. what about PRIVATE?

#

or the entire submission?

floral sentinel Jun 5, 2024, 1:23 PM

#

private one is above

#

the one writing Scoring...

fresh gust Jun 5, 2024, 1:23 PM

#

it doesn't show time

floral sentinel Jun 5, 2024, 1:23 PM

#

it shows next to it

#

on the right

#

been running 2h as well

#

both start at the same time

#

but because you used 31500, it will run aprox for 8.80 hours

#

for both

#

gotta wait 6.8 more hours

fresh gust Jun 5, 2024, 1:24 PM

#

floral sentinel but because you used 31500, it will run aprox for 8.80 hours

the normal one will end before time reaches 31500

floral sentinel Jun 5, 2024, 1:24 PM

#

fresh gust the normal one will end before time reaches 31500

nope. it will take same time as the private one

fresh gust Jun 5, 2024, 1:25 PM

#

but it only has 10 problems to solve

floral sentinel Jun 5, 2024, 1:25 PM

#

oh yeah

#

you are right

fresh gust Jun 5, 2024, 1:25 PM

#

shouldn't it take 5 times less time?

floral sentinel Jun 5, 2024, 1:25 PM

#

yeah

#

forgot that normal one had 10

fresh gust Jun 5, 2024, 1:26 PM

#

but okay. when normal one ends, a new version is saved, for which I can see the time in version history

floral sentinel Jun 5, 2024, 1:26 PM

#

fresh gust but okay. when normal one ends, a new version is saved, for which I can see the ...

yeah but that time only for normal notebook

fresh gust Jun 5, 2024, 1:26 PM

#

LIKE THIS

#

floral sentinel Jun 5, 2024, 1:26 PM

#

sadly the private one doesn't show the run time

floral sentinel Jun 5, 2024, 1:26 PM

#

fresh gust

yup

fresh gust Jun 5, 2024, 1:26 PM

#

yes so 7353 was normal submission time last time i submitted

fresh gust Jun 5, 2024, 1:26 PM

#

floral sentinel sadly the private one doesn't show the run time

yep exactly what I was asking

floral sentinel Jun 5, 2024, 1:26 PM

#

fresh gust yes so 7353 was normal submission time last time i submitted

yeah

floral sentinel Jun 5, 2024, 1:27 PM

#

fresh gust yep exactly what I was asking

or maybe there is, but Idk how to check it

fresh gust Jun 5, 2024, 1:28 PM

#

floral sentinel or maybe there is, but Idk how to check it

exactly what's been bothering me lol. I don't even know if I've got 16 answers correct and 34 wrong, ot if I got 16 correct,, a few wrong, and a few just ran out of time so returned 0

floral sentinel Jun 5, 2024, 1:29 PM

#

fresh gust exactly what's been bothering me lol. I don't even know if I've got 16 answers c...

yeah that's problem with new api

#

someone probed the LB to get to 20 when they announced early prize

#

that's why they created this new submission system

fresh gust Jun 5, 2024, 1:31 PM

#

I see

#

maybe i'll just have to check manually then

#

by the way, while one of my submission is stills scoring, can I submit another one or do I have to wait for it to finish?

floral sentinel Jun 5, 2024, 1:33 PM

#

fresh gust by the way, while one of my submission is stills scoring, can I submit another o...

normally you could

#

but you have normal one running which uses GPU

#

so i guess u gotta wait for normal one to finish

#

but idk, you can try

#

try submitting another one

fresh gust Jun 5, 2024, 1:34 PM

#

I'll do that after normal one has finished running. I'll already low on quota so I'll run accordingly

floral sentinel Jun 5, 2024, 1:35 PM

#

fresh gust I'll do that after normal one has finished running. I'll already low on quota so...

but this time do TIME_LIMIT = 31500 if PRIVATE else 1

#

it will take 370 seconds for normal one

fresh gust Jun 5, 2024, 1:35 PM

#

yea that's right

#

I will

#

thanks bro

floral sentinel Jun 5, 2024, 1:36 PM

#

fresh gust thanks bro

welcome bro

rocky kayak Jun 5, 2024, 10:40 PM

#

Do we know where the models get the solution wrong at the moment? Is it 1) at the beginning where it tries to understand the problem and lets say come up with necessary equations; 2) Follow on algebra or equation manipulation; 3) Calculating values; 4) or something else that I am not thinking of here?

candid holly Jun 6, 2024, 12:42 PM

#

floral sentinel

OK,thanks👍

marble dagger Jun 8, 2024, 7:47 AM

#

Dose any one know why the submission failed when the new version still running?

#

valid shoal Jun 8, 2024, 4:11 PM

#

floral sentinel the reason for 8.75, is because maximum hours of notebook runtime that is allowe...

re: screenshot - do we still have submission.csv?

floral sentinel Jun 8, 2024, 5:14 PM

#

valid shoal re: screenshot - do we still have submission.csv?

no

#

new api evaluates answer directly as I know

#

no need to create submission.csv

floral sentinel Jun 8, 2024, 5:19 PM

#

marble dagger Dose any one know why the submission failed when the new version still running?

that's weird

floral sentinel Jun 8, 2024, 5:19 PM

#

marble dagger Dose any one know why the submission failed when the new version still running?

can you check whether it used your remaining submissions?

#

maybe due to some connection error it might've not run

#

otherwise I don't know the reason

marble dagger Jun 9, 2024, 2:42 AM

#

https://www.kaggle.com/code/hanochhu/deepseek-math/notebook

Deepseek Math

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

#

import os

if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
PRIVATE = True
else:
PRIVATE = False

if not PRIVATE:
import pandas as pd

class train_env():
    def __init__(self, randomize=False):
        self.randomlize = randomize
        
        self.df = pd.read_csv('/kaggle/input/ai-mathematical-olympiad-prize/train.csv')
        self.df['ground_truth'] = self.df['answer']
        self.df['answer'] = -1
        
        if self.randomlize:
            self.df = self.df.reset_index().sample(frac=1).reset_index(drop=True)
        
        self.predict_called = True
        self.counter = 0
        self.len = len(self.df)
    
    
    def iter_test(self):
         while self.counter<self.len:
            if self.predict_called:
                self.predict_called = False
                yield (self.df.loc[[self.counter]][['id','problem']]),(self.df.loc[[self.counter]][['id','answer']])
            else:
                print("You must call `predict()` successfully before you can continue with `iter_test()`")
                yield None 
            
    def predict(self, answer):
        self.df.loc[self.counter, ('answer')] = answer['answer'].values[0]
        self.predict_called = True
        self.counter+=1

env = train_env(randomize=True)
iter_test = env.iter_test()

else:
# Set up the evaluation API
import aimo

env = aimo.make_env()
iter_test = env.iter_test()

for test, sample_submission in iter_test:
problem: str = test['problem'].to_string(index=False)
sample_submission['answer'] = get_answer(problem)
env.predict(sample_submission)
print(test)
print(sample_submission, '\n')

#

About submission part of my code, get_answer() return a number, there is nothing wrong. right?

floral sentinel Jun 10, 2024, 6:52 AM

#

marble dagger import os if os.getenv('KAGGLE_IS_COMPETITION_RERUN'): PRIVATE = True else:...

 
problem: str = test['problem'].to_string(index=False)
sample_submission['answer'] = get_answer(problem)
env.predict(sample_submission)

can you tell me why you converting test["problem"] to string?

#

also it should be test['problem'].values[0] to return a number

marble dagger Jun 10, 2024, 9:54 AM

#

I think these two ways make same function?

marble dagger Jun 10, 2024, 9:57 AM

#

floral sentinel ``` problem: str = test['problem'].to_string(index=False) sample_submission['a...

Is that right? I'll try it

floral sentinel Jun 10, 2024, 10:14 AM

#

I did test['problem'].values[0] and it works just fine

#

this is the first time i see someone do str = test['problem].to_string(index=False)

marble dagger Jun 10, 2024, 1:25 PM

#

I tried test['problem'].values[0], nothing change

marble dagger Jun 10, 2024, 1:35 PM

#

floral sentinel I did `test['problem'].values[0]` and it works just fine

thanks for your time, maybe I should copy someone's notebook which submitted

floral sentinel Jun 10, 2024, 1:36 PM

#

marble dagger thanks for your time, maybe I should copy someone's notebook which submitted

wait hold on

#

get_answer() are you sure this function returns a number?

marble dagger Jun 10, 2024, 1:43 PM

#

floral sentinel wait hold on

yeah, I print it, get_answer() return a number.

#

thers are something I didn't noticed

floral sentinel Jun 10, 2024, 1:44 PM

#

that's just printing number

#

can you check the type of the output

#

type(get_answer(problem))

#

it might be returning string value instead of int

marble dagger Jun 10, 2024, 1:45 PM

#

When I use test['problem'].values[0], the version run failed

floral sentinel Jun 10, 2024, 1:45 PM

#

floral sentinel `type(get_answer(problem))`

ok do this

#

check the return type

#

I wanna make sure whether it returns string or no

#

without submitting

#

on your notebook, run the cell

marble dagger Jun 10, 2024, 1:48 PM

#

Ok I'll try it , but according to the code , it should be num

#

rare sandal Jun 12, 2024, 11:23 AM

#

The CV/LB correlation is totally not existent in this comp 😐

#

I get a question correct on the Colab L4 (answered correctly over multiple runs too)...then I try testing on Kaggle T4, it becomes a wrong answer. I set a seed too 🤣

rare sandal Jun 12, 2024, 11:42 AM

#

don't know what to trust at this point...made a change that is supposed to be a clear improvement

Gives 2 points boost on TWO different AIME datasets (both of them) - over multiple runs with random ordering, no difference. But the change drops the LB from 22 to 13 holyfuck

floral sentinel Jun 12, 2024, 2:46 PM

#

rare sandal don't know what to trust at this point...made a change that is supposed to be a ...

you using deepseek?

#

cause people in discussion mentioned that deepseek is unstable

#

some tried to fine tune it, but it made it worse

rare sandal Jun 12, 2024, 2:50 PM

#

floral sentinel you using deepseek?

Yes

#

I haven’t tried finetuning

floral sentinel Jun 12, 2024, 2:50 PM

#

Olga tsymboi decided to use deepseek

#

then the whole competition members followed her using the same model

#

I am really confused on why nobody tried a new method or a model for a past month and a half

#

maximum score was 27 for a whole freaking month

#

a whole month

#

0 change

rare sandal Jun 12, 2024, 2:52 PM

#

I did try other models released before Feb 23, but I couldn’t score more than 16 points

floral sentinel Jun 12, 2024, 2:52 PM

#

everybody was brute forcing prompting the deepseek

floral sentinel Jun 12, 2024, 2:52 PM

#

rare sandal I did try other models released before Feb 23, but I couldn’t score more than 16...

what was the method?

#

or model

rare sandal Jun 12, 2024, 2:53 PM

#

If my team gets a good result in the private LB I will share it after the competition ends

floral sentinel Jun 12, 2024, 2:53 PM

#

rare sandal If my team gets a good result in the private LB I will share it after the compet...

fair enough

floral sentinel Jun 12, 2024, 2:54 PM

#

rare sandal If my team gets a good result in the private LB I will share it after the compet...

what's your team?

#

the name

rare sandal Jun 12, 2024, 2:54 PM

#

Rank 70 right now on public

floral sentinel Jun 12, 2024, 2:55 PM

#

frozen timtam

#

ooh you are from singapore

#

nice

rare sandal Jun 12, 2024, 3:21 PM

#

floral sentinel cause people in discussion mentioned that deepseek is unstable

Private LB will be a wild shakeup no doubt

floral sentinel Jun 12, 2024, 3:28 PM

#

rare sandal Private LB will be a wild shakeup no doubt

I like chaos

teal lake Jun 12, 2024, 3:53 PM

#

floral sentinel maximum score was 27 for a whole freaking month

what is it about ?

floral sentinel Jun 12, 2024, 4:20 PM

#

teal lake what is it about ?

ai math olympiad

teal lake Jun 12, 2024, 4:20 PM

#

what's it about

floral sentinel Jun 12, 2024, 4:21 PM

#

using LLMs to solve math questions

teal lake Jun 12, 2024, 4:21 PM

#

ohhk

floral sentinel Jun 12, 2024, 4:21 PM

#

basically public dataset has 50 questions

#

and maximum we got is 27 score

teal lake Jun 12, 2024, 4:21 PM

#

27/50 correct ?

floral sentinel Jun 12, 2024, 4:21 PM

#

or to be more precise a team got 27 score

floral sentinel Jun 12, 2024, 4:21 PM

#

teal lake 27/50 correct ?

yeah

#

for a whole month

#

that was the maximum score

#

no change at all

#

or more like 1 month and a half

teal lake Jun 12, 2024, 4:22 PM

#

LLM and math is something which doesn't get along well

#

so what are the rules ?

#

can use any LLM ,

floral sentinel Jun 12, 2024, 4:23 PM

#

teal lake LLM and math is something which doesn't get along well

yeah it's kinda sad...

floral sentinel Jun 12, 2024, 4:23 PM

#

teal lake so what are the rules ?

we are allowed to use models released before February 23rd

#

only

#

the thing is, they made this competition so they can motivate the research on improving the reasoning of LLMs

#

but I've barely seen any new research

teal lake Jun 12, 2024, 4:24 PM

#

i read Llemma paper where they claimed to have a model dedicated to math

floral sentinel Jun 12, 2024, 4:24 PM

#

people just been using self consistency prompting on deep seek, and that's it

floral sentinel Jun 12, 2024, 4:25 PM

#

teal lake i read Llemma paper where they claimed to have a model dedicated to math

is it released before february 23rd?

teal lake Jun 12, 2024, 4:25 PM

#

it was released in 2023

teal lake Jun 12, 2024, 4:26 PM

#

floral sentinel the thing is, they made this competition so they can motivate the research on im...

this is something hard to individuals or beginners ,

#

the whole reason behind LLMs being bad in math is tokenization strategy

floral sentinel Jun 12, 2024, 4:27 PM

#

teal lake the whole reason behind LLMs being bad in math is tokenization strategy

eehhh I doubt that's the case

teal lake Jun 12, 2024, 4:27 PM

#

https://huggingface.co/EleutherAI/llemma_7b

EleutherAI/llemma_7b · Hugging Face

floral sentinel Jun 12, 2024, 4:27 PM

#

I mean if we think about what LLMs are, they are just predicting what's the next word based on the current input sentence

#

they are good at generating text

#

it's missing the "reasoning" part

teal lake Jun 12, 2024, 4:28 PM

#

current LLMs can't do math because they can't see the number as whole

#

which is short coming of tokenization

floral sentinel Jun 12, 2024, 4:28 PM

#

teal lake current LLMs can't do math because they can't see the number as whole

oh yeah that one too

teal lake Jun 12, 2024, 4:28 PM

#

you can try tokenizing a random number

teal lake Jun 12, 2024, 4:28 PM

#

floral sentinel oh yeah that one too

its the main reason

floral sentinel Jun 12, 2024, 4:29 PM

#

nah I don't think

teal lake Jun 12, 2024, 4:29 PM

#

you can't add 12008 if you see it as 12 00 8

floral sentinel Jun 12, 2024, 4:29 PM

#

even if they saw numbers

#

there is still the "logical" or "reasoning" part to perform action based on the input

#

which doesn't exist

#

yet

floral sentinel Jun 12, 2024, 4:29 PM

#

teal lake you can't add 12008 if you see it as 12 00 8

yeah I get what you are saying

teal lake Jun 12, 2024, 4:30 PM

#

you can do all the reasoning but when it comes to do add / sub / mul / div it wasted

floral sentinel Jun 12, 2024, 4:30 PM

#

but tokenization isn't the only thing holding back the LLMs when it comes to logical problem solving

teal lake Jun 12, 2024, 4:30 PM

#

yes

#

i'm talking about math part

#

it can reason a question very well

#

but can't do it by itself

floral sentinel Jun 12, 2024, 4:31 PM

#

how

#

and wdym by "reason a question very well", you mean they understand it?

teal lake Jun 12, 2024, 4:32 PM

#

self.high = 0 + 5 * 16 // 7 - 1 = 11

teal lake Jun 12, 2024, 4:32 PM

#

floral sentinel and wdym by "reason a question very well", you mean they understand it?

construct a logic for it

floral sentinel Jun 12, 2024, 4:32 PM

#

PEMDAS rule

teal lake Jun 12, 2024, 4:33 PM

#

floral sentinel PEMDAS rule

explain

#

and i only know bodmas

#

its i believe standard

#

but here ans should be 10

floral sentinel Jun 12, 2024, 4:38 PM

#

0 + 5 * 16 / 7 -1
according to PEMDAS rule:
step one we look if there are any parentheses:
Checking....
No
step two
Look if there are any exponents:
Checking...
No
Step three
Look if there are any Multiplication:
Checkking....
Found "5 * 16":
Solving:
ans = 80 -> save answer in memory;
current equation = "0 + 80/7 - 1"
Equation changed... go back to step one...
This thing will go on loops until equation is reduced

#

and the final answer is found

#

this is the logic behind solving a simple equation

#

Human mind understands rules, and apply them step by step. We don't think "what's the next token comes " after seeing the equation

#

@teal lake

#

LLM would see that equation and output the token that comes after with the highest probability. That's why I said there is no reasoning

#

it doesn't think

teal lake Jun 12, 2024, 4:41 PM

#

i will have hard time but let me find the paper

floral sentinel Jun 12, 2024, 4:41 PM

#

alright

#

but you know, thinking logically and step by step about the equation you gave...
I think human brain works like a tree graph

#

somehow I understood that the equation was math, so my brain went to a "math" node, and there are branches of mathematics that I know based on my experience, for example:
"Calculus, Algebra, Arithmatic...etc", and I picked the topic related to the equation, and went to that node, and from that node more branches came...

#

and picked one that is closest to the problem aka "high probability", and extracted the steps from there and applied it step by step until the problem is solved.

teal lake Jun 12, 2024, 4:46 PM

#

if you don't have time , just jump to page 36
https://arxiv.org/pdf/2406.02061

floral sentinel Jun 12, 2024, 4:46 PM

#

so I think instead of training LLMs to solve math, I think we should focus on a neural Graph tree model something like that, that actually maps the experience as a tree and branches

teal lake Jun 12, 2024, 4:47 PM

#

floral sentinel so I think instead of training LLMs to solve math, I think we should focus on a ...

i believe people have already tried , but you can try if you think it can make it good in math

#

i don't have much idea about graphs

floral sentinel Jun 12, 2024, 4:48 PM

#

teal lake i believe people have already tried , but you can try if you think it can make i...

can you show me if there are already anytihng done?

floral sentinel Jun 12, 2024, 4:48 PM

#

teal lake if you don't have time , just jump to page 36 https://arxiv.org/pdf/2406.02061

#

that doesn't necessarily mean they are able to think

teal lake Jun 12, 2024, 4:49 PM

#

llama also did so did mistral

floral sentinel Jun 12, 2024, 4:49 PM

#

they just brute forced the knowledge into LLMs

floral sentinel Jun 12, 2024, 4:49 PM

#

teal lake llama also did so did mistral

but it says here LLama and claude gave wrong answers...

teal lake Jun 12, 2024, 4:50 PM

#

floral sentinel but it says here LLama and claude gave wrong answers...

check below as well

teal lake Jun 12, 2024, 4:50 PM

#

floral sentinel that doesn't necessarily mean they are able to think

then how do you think of it

floral sentinel Jun 12, 2024, 4:50 PM

#

yeah it still doesn't prove anything

teal lake Jun 12, 2024, 4:50 PM

#

floral sentinel can you show me if there are already anytihng done?

i don't think people will show failure

floral sentinel Jun 12, 2024, 4:51 PM

#

teal lake then how do you think of it

as I said, they are just next token generators

teal lake Jun 12, 2024, 4:51 PM

#

yes

floral sentinel Jun 12, 2024, 4:51 PM

#

teal lake i don't think people will show failure

fair enough, but i thought maybe there were some works done and published

#

if you ask me

#

the closest thing to reasoning that was done was Deepmind's alpha geometry

#

they used neurosymbolic approach

#

https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/

Google DeepMind

AlphaGeometry: An Olympiad-level AI system for geometry

Our AI system surpasses the state-of-the-art approach for geometry problems, advancing AI reasoning in mathematics

#

I would still believe is reasonning was used in here, unlike using Pure LLMs

teal lake Jun 12, 2024, 4:54 PM

#

chatgpt solved this question when prompted correctly
think step by step and reason before giving answer and consider all the information we gathered

#

https://chatgpt.com/share/5bd33efc-dc99-47d6-af22-83459c91ce03

ChatGPT

A conversational AI system that listens, learns, and challenges

#

can't prove more than this now

floral sentinel Jun 12, 2024, 4:55 PM

#

teal lake https://chatgpt.com/share/5bd33efc-dc99-47d6-af22-83459c91ce03

dude

#

for how long chatgpt was trained?

#

besides how many datasets it's seen?

#

how do I explain this

teal lake Jun 12, 2024, 4:55 PM

#

is that the question here

floral sentinel Jun 12, 2024, 4:55 PM

#

what I'm trynna say is

teal lake Jun 12, 2024, 4:55 PM

#

you mean chatGPT might have encountered this passage before

floral sentinel Jun 12, 2024, 4:55 PM

#

chatgpt is overfit

floral sentinel Jun 12, 2024, 4:56 PM

#

teal lake you mean chatGPT might have encountered this passage before

yeah

#

and trained on it multiple times

#

it's memorized the problems

#

doesn't necessarily mean it understands them

teal lake Jun 12, 2024, 4:56 PM

#

why didn't it answer in first few prompts then ?

floral sentinel Jun 12, 2024, 4:56 PM

#

so it's no surprise that they are able to solve typical stuff

floral sentinel Jun 12, 2024, 4:56 PM

#

teal lake why didn't it answer in first few prompts then ?

which prompts?

teal lake Jun 12, 2024, 4:57 PM

#

floral sentinel so it's no surprise that they are able to solve typical stuff

i only wanted to show that they can do reasoning ,

#

its not typical generation of text

floral sentinel Jun 12, 2024, 4:57 PM

#

teal lake i only wanted to show that they can do reasoning ,

they are faking the "reasoning"

#

there is no reasoning in LLMs

teal lake Jun 12, 2024, 4:59 PM

#

LLMs are not brain to begin with, just some random numbers and calculations

#

if you think that way then no LLM can do reasoning

floral sentinel Jun 12, 2024, 5:00 PM

#

wait I'm cooking something

#

lemme show u

#

#

here are the questions

#

I copied first 3 questions to Chatgpt to solve

#

it got all 3 wrong

#

https://chatgpt.com/share/e98ea5e1-02e2-4896-9711-c5efbfc714e1

ChatGPT

A conversational AI system that listens, learns, and challenges

#

The answers should be as seen in the table:
52, 250, 702

floral sentinel Jun 12, 2024, 5:02 PM

#

teal lake LLMs are not brain to begin with, just some random numbers and calculations

just like any ANN models out there

floral sentinel Jun 12, 2024, 5:03 PM

#

teal lake its not typical generation of text

bruh.. they are literally decoders whose only purpose is to generate token correctly

teal lake Jun 12, 2024, 5:03 PM

#

floral sentinel The answers should be as seen in the table: 52, 250, 702

i'm not good in math , does the reasoning sounds good to you ?

floral sentinel Jun 12, 2024, 5:03 PM

#

teal lake i'm not good in math , does the reasoning sounds good to you ?

you can fake the steps

#

or what do you call it

#

"hallucinate"

teal lake Jun 12, 2024, 5:03 PM

#

floral sentinel bruh.. they are literally decoders whose only purpose is to generate token corre...

then why are you hoping for reasoning

floral sentinel Jun 12, 2024, 5:03 PM

#

but it really is uselss when steps mean nothing

floral sentinel Jun 12, 2024, 5:03 PM

#

teal lake then why are you hoping for reasoning

yeah....

teal lake Jun 12, 2024, 5:03 PM

#

floral sentinel but it really is uselss when steps mean nothing

but how are the steps

floral sentinel Jun 12, 2024, 5:04 PM

#

teal lake but how are the steps

idk lol, i didn't read them

teal lake Jun 12, 2024, 5:04 PM

#

this not how you should argue 🙃

floral sentinel Jun 12, 2024, 5:04 PM

#

teal lake this not how you should argue 🙃

bruh

#

i am not arguing, I'm just trynna prove my point

#

sure,, with little prompting I can make it get better answers

teal lake Jun 12, 2024, 5:04 PM

#

its the same

floral sentinel Jun 12, 2024, 5:05 PM

#

but what I'm saying is LLMs don't have reasoning

#

they can't reason

teal lake Jun 12, 2024, 5:05 PM

#

floral sentinel sure,, with little prompting I can make it get better answers

then isn't it like saying earth is flat

floral sentinel Jun 12, 2024, 5:05 PM

#

teal lake then isn't it like saying earth is flat

😐

#

what

teal lake Jun 12, 2024, 5:05 PM

#

if it can answer then its fine

floral sentinel Jun 12, 2024, 5:05 PM

#

teal lake if it can answer then its fine

bruh

#

dude

#

it's not about giving answer

teal lake Jun 12, 2024, 5:05 PM

#

how do you define reasoning

floral sentinel Jun 12, 2024, 5:05 PM

#

it's about

#

giving right answer

teal lake Jun 12, 2024, 5:06 PM

#

really ?

floral sentinel Jun 12, 2024, 5:06 PM

#

what's the point of giving a wrong answer

floral sentinel Jun 12, 2024, 5:06 PM

#

teal lake really ?

i don't need LLM to give me answer

#

I can literally just code a function to output random number

#

there it gives answer

#

does it mean it's able to reason now?

teal lake Jun 12, 2024, 5:07 PM

#

floral sentinel does it mean it's able to reason now?

first of all, answer doesn't mean anything

floral sentinel Jun 12, 2024, 5:07 PM

#

teal lake how do you define reasoning

using logic to solve problem

teal lake Jun 12, 2024, 5:07 PM

#

reasoning is steps to reach there

floral sentinel Jun 12, 2024, 5:07 PM

#

teal lake first of all, answer doesn't mean anything

well honestly in solving math competition it does matter

#

cause we need right answers

teal lake Jun 12, 2024, 5:07 PM

#

floral sentinel using logic to solve problem

thats what i'm saying from the beginnning

floral sentinel Jun 12, 2024, 5:07 PM

#

teal lake thats what i'm saying from the beginnning

but there is difference between correct reasoning and wrong reasoning

#

or rather "random reasoning"

#

LLMs do random reasoning

teal lake Jun 12, 2024, 5:08 PM

#

floral sentinel but there is difference between correct reasoning and wrong reasoning

i'm asking for reasoning only , is it correct or not

floral sentinel Jun 12, 2024, 5:08 PM

#

they don't even do reaosning at all honestly

teal lake Jun 12, 2024, 5:08 PM

#

don't see answer

floral sentinel Jun 12, 2024, 5:08 PM

#

no it doesn't do reason

#

it just generates text

#

based on what it believes is the correct token

teal lake Jun 12, 2024, 5:08 PM

#

i give up

#

you win

floral sentinel Jun 12, 2024, 5:11 PM

#

teal lake you win

bruuh

#

it's not about me winning or losing

#

we got a problem now

#

there are 50 math questions

#

people were only able to solve 27 of them

#

for a month now, nothing changed

#

how the hell we gonna increase the score

#

15 days left for competition...

teal lake Jun 12, 2024, 5:22 PM

#

prompting is all you can do now
times up for experimenting things with architecture

marble dagger Jun 13, 2024, 8:51 AM

#

27/50 is good enough for a 7B model

floral sentinel Jun 13, 2024, 10:00 AM

#

marble dagger 27/50 is good enough for a 7B model

i want 50/50

marble dagger Jun 13, 2024, 10:31 AM

#

model=Anren()

marble dagger Jun 13, 2024, 11:51 AM

#

@floral sentinel I had setted TIME_LIMIT in the loop why the submit still timeout? so confused

floral sentinel Jun 13, 2024, 11:52 AM

#

one moment

floral sentinel Jun 13, 2024, 11:52 AM

#

marble dagger <@305317021661528066> I had setted TIME_LIMIT in the loop why the submit still ...

how long did your submission take?

marble dagger Jun 13, 2024, 11:52 AM

#

10h

#

no

#

How can I check the time , the time limit is 9, and start 10h ago. over 9

floral sentinel Jun 13, 2024, 11:56 AM

#

marble dagger <@305317021661528066> I had setted TIME_LIMIT in the loop why the submit still ...

is this based off Lewis' notebook?

#

#

yeah it's same

#

idk

#

couldn't figure out the problem

#

you writing code from scratch?

marble dagger Jun 13, 2024, 11:58 AM

#

Fine , I'll minish the n_reps

#

No I copy from some one

floral sentinel Jun 13, 2024, 11:59 AM

#

marble dagger No I copy from some one

bruh...

#

why u getting errors then

marble dagger Jun 13, 2024, 11:59 AM

#

But the code is quilt easy

floral sentinel Jun 13, 2024, 12:00 PM

#

floral sentinel

honestly I don't like this method of solving math problem

#

the api and predictions are mixed with the LLM solving

#

hard to debug

#

I'd rather create a function that takes problem as input and outputs answer.

floral sentinel Jun 13, 2024, 12:01 PM

#

marble dagger <@305317021661528066> I had setted TIME_LIMIT in the loop why the submit still ...

way better than doing this

marble dagger Jun 13, 2024, 12:02 PM

#

Idk what you really mean

#

is there a open code which is good as you say

floral sentinel Jun 13, 2024, 12:04 PM

#

marble dagger is there a open code which is good as you say

yeah I created an organized notebook

#

wait lemme send it

#

https://www.kaggle.com/code/anrenk/aimo-llm-class-deepseek

AIMO LLM Class DeepSeek

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

#

it's buried down in the code section

marble dagger Jun 13, 2024, 12:23 PM

#

Elegant! so you mean I should put these code in the predict().

#

I do make a function. the problem is input and the answer as output

floral sentinel Jun 13, 2024, 12:45 PM

#

marble dagger Elegant! so you mean I should put these code in the predict().

yeah kinda

floral sentinel Jun 13, 2024, 12:45 PM

#

marble dagger Elegant! so you mean I should put these code in the predict().

i mean it's already in predict

rare sandal Jun 14, 2024, 3:12 AM

#

I changed the seed and my 22 points submission dropped to 16 points 🤣

rancid compass Jun 14, 2024, 4:05 AM

#

hi folks, in submission tab it showed throwing exception, however clicking into the notebook it said successfully ran in 132s. Is the submission successful?

serene fiber Jun 14, 2024, 4:29 AM

#

rare sandal I changed the seed and my 22 points submission dropped to 16 points 🤣

If your temperature was high then it's expected

rare sandal Jun 14, 2024, 5:37 AM

#

serene fiber If your temperature was high then it's expected

I haven't tried to modify this temperature, top_p setting, all my tests are with 0.9 and 1.0

proper ferry Jun 14, 2024, 7:05 AM

#

what is the temperature parameter and the top k means?

serene fiber Jun 14, 2024, 10:18 AM

#

rare sandal I haven't tried to modify this temperature, top_p setting, all my tests are with...

temp = 0.9 will give highly stochastic responses, no wonder with the change in results

rare sandal Jun 14, 2024, 11:07 PM

#

floral sentinel then the whole competition members followed her using the same model

Maybe because it was even pinned on the discussion forum holyfuck

floral sentinel Jun 15, 2024, 5:33 PM

#

rare sandal Maybe because it was even pinned on the discussion forum <:holyfuck:753367777896...

Nah, people always choose the easy path

marble dagger Jun 16, 2024, 2:39 AM

#

@floral sentinel have you tested that the time that your code spend?
I mean your repetitions for a question is 19, and the num of questions is 50. is enough to generate them in 31500s?

silk sandal Jun 16, 2024, 9:48 AM

#

People reported that changing the seed leads to a different score. But I have reasons to believe that even with the same seed (submitting the same notebook twice) one can get a different score. It happened to me. Did this happen to you?

rare sandal Jun 16, 2024, 10:17 AM

#

+- 1 for me

#

but I would be scared also...my best seed for validation is not the same as my best seed for public LB

#

also when comparing different AIME sets...all different optimal seed 🥴

#

In other words, I tested 3 seeds on public LB, AIME old, AIME new datasets

The first seed is best for public LB
The second seed is best for AIME old
The third seed is best for AIME new
and the gap is not small

won't give more details until the comp finishes 😅

floral sentinel Jun 17, 2024, 3:03 PM

#

marble dagger <@305317021661528066> have you tested that the time that your code spend? I ...

no, i didn't test

marble dagger Jun 17, 2024, 3:24 PM

#

floral sentinel no, i didn't test

I think there are some issue in early stop strategy and decode strategy in your notebook

floral sentinel Jun 17, 2024, 3:25 PM

#

marble dagger I think there are some issue in early stop strategy and decode strategy in your ...

explain more in detail

#

wdym by decode strategy?

marble dagger Jun 17, 2024, 3:29 PM

#

these part will skip top several loop except first

#

and about decode the right answer from text or code_output . I think the code output is more convincing, answer from text should be dropped . But you take them both

floral sentinel Jun 17, 2024, 3:37 PM

#

marble dagger and about decode the right answer from text or code_output . I think the code ou...

Honestly I don't understand 50% of code myself, as I mentioned, I just organized people's forked notebooks and added them together in a nice way

#

about code output and text output

#

sometimes text output gives right answer

#

that's why people consider taking both answers

floral sentinel Jun 17, 2024, 3:39 PM

#

marble dagger these part will skip top several loop except first

the reason why abdulrafea did that "skipping" is because notebooks threw exception errors when submitting

#

to prevent that, he added skipping thingy that u see

marble dagger Jun 17, 2024, 3:39 PM

#

for this function , if the answer was covered by \boxed{}, the answer will be counted twice. It make no sense

floral sentinel Jun 17, 2024, 3:41 PM

#

marble dagger for this function , if the answer was covered by \\boxed{}, the answer will be c...

alright then test out your ideas and see if they work

marble dagger Jun 17, 2024, 3:41 PM

#

floral sentinel the reason why abdulrafea did that "skipping" is because notebooks threw excepti...

answer from text is unreliable

floral sentinel Jun 17, 2024, 3:42 PM

#

bro, the model itself is not reliable, let alone the answer xD

marble dagger Jun 17, 2024, 3:43 PM

#

floral sentinel the reason why abdulrafea did that "skipping" is because notebooks threw excepti...

there is another early stop in the code

marble dagger Jun 17, 2024, 3:44 PM

#

floral sentinel bro, the model itself is not reliable, let alone the answer xD

I mean is not reliable to count for llm

#

to calculate

rare sandal Jun 18, 2024, 1:39 PM

#

Great work holyfuck

#

serene fiber Jun 18, 2024, 1:46 PM

#

rare sandal Great work <:holyfuck:753367777896824995>

Happens, had such experience with gpt and Gemini as well

floral sentinel Jun 18, 2024, 3:15 PM

#

rare sandal

what the hell...

serene fiber Jun 18, 2024, 6:37 PM

#

floral sentinel what the hell...

The problem happens with basic algebraic calculations, and it's universal across multiple LLMs

floral sentinel Jun 18, 2024, 6:52 PM

#

serene fiber The problem happens with basic algebraic calculations, and it's universal across...

we gonna fix that one day

serene fiber Jun 18, 2024, 6:54 PM

#

Obviously, a lot of ways to fix that better not to discuss it right now 😅😅

floral sentinel Jun 18, 2024, 6:55 PM

#

serene fiber Obviously, a lot of ways to fix that better not to discuss it right now 😅😅

why

#

i think it's perfect time to discuss that

#

someone got 28 score

#

wow it took god damn 1 month and a half to get +1

#

bruuhu

serene fiber Jun 18, 2024, 7:07 PM

#

floral sentinel i think it's perfect time to discuss that

Lol if I do that it's like talking about my strategy, would not like to drop my lb rankings

floral sentinel Jun 18, 2024, 9:46 PM

#

serene fiber Lol if I do that it's like talking about my strategy, would not like to drop my ...

oh

floral sentinel Jun 18, 2024, 9:46 PM

#

serene fiber Lol if I do that it's like talking about my strategy, would not like to drop my ...

what's your score

serene fiber Jun 18, 2024, 9:56 PM

#

24th on LB right now

floral sentinel Jun 18, 2024, 10:10 PM

#

serene fiber 24th on LB right now

cool

rare sandal Jun 19, 2024, 11:48 AM

#

serene fiber The problem happens with basic algebraic calculations, and it's universal across...

It's not just math, it happens in text generation with lots of small LLM. I have seen Llama 2 7b do that at work, it's frustrating I know holyfuck

I haven't seen GPT do anything similar

rare sandal Jun 19, 2024, 11:59 AM

#

floral sentinel i think it's perfect time to discuss that

discuss in 9 days after competition close 😅

silk sandal Jun 20, 2024, 3:31 PM

#

serene fiber temp = 0.9 will give highly stochastic responses, no wonder with the change in r...

Then what is a good value for temperature? I experimented even with 0.1 but the score got worse:)

#

By the way, did anyone try using two llms, like deepseek-math 7b-rl+deepseek-coder2b?

#

I could not get it running

serene fiber Jun 20, 2024, 3:36 PM

#

silk sandal Then what is a good value for temperature? I experimented even with 0.1 but the ...

The more the temp, higher will be the stochasticity and the maybe even the quality for some tasks.

Again as such commenting on exact things without knowing what you exactly did would be hard, and also it is not the correct time to do so

silk sandal Jun 20, 2024, 3:38 PM

#

Hi, thanks for your reply. If I get it right, no more idea sharing here until the competition ends?

serene fiber Jun 20, 2024, 3:39 PM

#

silk sandal Hi, thanks for your reply. If I get it right, no more idea sharing here until th...

I am not sure about others, but yea atleast from my end.
Certainly would be up for open discussions, but stuffs directly related to competition would be hard

dusky narwhal Jun 20, 2024, 4:02 PM

#

silk sandal Then what is a good value for temperature? I experimented even with 0.1 but the ...

the more samples you take from the model, the higher the temperature should be, as otherwise all the samples will be too similar

#

e.g. when models are evaluated on a single sample (maj@1 aka pass@1) usually greedy decoding is used (temperature 0)

#

unless you're actually in the running for a prize (or gold medal, I suppose), I don't see why you wouldn't share anything

serene fiber Jun 20, 2024, 4:08 PM

#

dusky narwhal unless you're actually in the running for a prize (or gold medal, I suppose), I ...

Obv, I am into Kaggle with some aim (ranking matters for me), for learning purposes I am already pursuing a PhD in the same field

rare sandal Jun 21, 2024, 2:46 AM

#

dusky narwhal unless you're actually in the running for a prize (or gold medal, I suppose), I ...

Because I believe there will be a huge shakeup and can see stable 21-22 scores move all the way up to gold

#

Maybe even 20

#

with some luck obviously

serene fiber Jun 21, 2024, 2:55 AM

#

rare sandal Because I believe there will be a huge shakeup and can see stable 21-22 scores m...

For some lucky submissions yea eg: In one case we were lucky to get 22, after that for that submission it dropped to 17-18 and constantly remained around that. But for those who are not just dependent on the prompt, temp or any hyperparams things should be fine I guess

rare sandal Jun 21, 2024, 2:59 AM

#

Mine is very dependent on the seed, sadly

#

Stable across same seed, but if I change the seed it drops a lot

serene fiber Jun 21, 2024, 3:02 AM

#

Kinda hard to comment, could get lucky as well.

Btw one of our submissions is with that hope of luck.

#

My intuition says it should be fine as the code is already executed for all 100 questions, and as you got lucky in those 50, incase of same distribution the remaining 50 should also give similar results.

If it was Kaggle running the code again, then it would have been a big red flag

vital cave Jun 21, 2024, 3:42 AM

#

The final subs you select will run over the new 50 only, so two of your submissions will only re-run once.
Shake is coming. I hope I can keep a good rank in the end 😅

serene fiber Jun 21, 2024, 3:50 AM

#

vital cave The final subs you select will run over the new 50 only, so two of your submiss...

I am new to Kaggle but afaik they won't rerun your code, atleast they didn't in the competition that I had earlier participated in, and same was implied from one of the discussions as well.

But yea if they do, shake is definitely coming, particularly from the score of 25

vital cave Jun 21, 2024, 3:50 AM

#

"Because of the limited number of problems available, we are taking special precautions to secure the test set against probing. Among other things, during the submission period the test set will comprise only the 50 public set problems. Once the competition ends, when we rerun submissions, the test set will comprise only the 50 private set problems. You should attempt to make sure your submission will complete successfully on the 50 new private set problems. This may mean ensuring your submission is robust to unexpected inputs, or managing runtime and memory usage."
This from data section

serene fiber Jun 21, 2024, 3:51 AM

#

Oh ok posted from the competition's page?

vital cave Jun 21, 2024, 3:51 AM

#

My biggest biggest fear is that for some reason my subs fail (hit errors that I didn't catch during validation or public subs)

vital cave Jun 21, 2024, 3:51 AM

#

serene fiber Oh ok posted from the competition's page?

from data section

serene fiber Jun 21, 2024, 3:52 AM

#

Oh ok my bad, I didn't read that, then yea seems like private lb will be fun

vital cave Jun 21, 2024, 3:54 AM

#

yes a scary movie is coming within 7 days 😄 , I can handle any score drop - I am preparing myself for this , but I can't handle failed subs ! it will be a total waste!

serene fiber Jun 21, 2024, 3:55 AM

#

vital cave yes a scary movie is coming within 7 days 😄 , I can handle any score drop - I a...

It won't drop much, maybe ±3-4 questions, but yea the lb can completely get flipped

vital cave Jun 21, 2024, 3:56 AM

#

doesn't matter , my fear comes from this line in the section I shared : "This may mean ensuring your submission is robust to unexpected inputs" what unexpected inputs 😄

serene fiber Jun 21, 2024, 3:57 AM

#

Shouldn't be an issue, but timeout can be, it completely depends on what sort of code you wrote

vital cave Jun 21, 2024, 3:59 AM

#

I did face rare timeouts that i can't yet explain.
My advice (I hope I stick with it - and stop trying new ideas) is that I should focus on the subs I want to select and run them again and again to make sure they are stable

dusky narwhal Jun 21, 2024, 10:28 AM

#

vital cave I did face rare timeouts that i can't yet explain. My advice (I hope I stick wi...

even you suffered timeouts? And didn't later find any bugs in the code to explain it? I'm glad I solved my freezes and timeout (as posted on the forum), even if it did cost me days; that could have been a disaster during the private LB rerun. Instead I have the extra stress due to less time!

#

BTW I just noticed you're 2nd in the ARC contest! Wow

#

that you even have time to work on both!

#

how much work was it to get to that score?

#

and yeah, I've very paranoid about errors on the private LB; I'm going to try to make my 2nd submission quite different and rather conservative

vital cave Jun 21, 2024, 10:44 AM

#

Ya I suffered timeouts 😅 My current (in mind) 2 subs work fine with different validation questions but I am still afraid of unexpected inputs or errors that might kill the code in private, If the private was calculated in parallel with public and we see 50% of the score then it would have been much better (no sudden errors !)
I have no expectations currently regarding by final rank ! I just hope to stay in gold ! I need the solo gold 🙂

Regarding ARC I like such competitions, the math one, ARC, CTF, and Santa problems are my favorites.
Regarding my current score ARC 26.5 it took me about a week. I have some ideas to try. BUT no way to reach the guys at 38 ! They are professional ARC developers (if that can be a job title!)

dusky narwhal Jun 21, 2024, 10:48 AM

#

well, you have half a year to come up with and implement ideas! Good ideas to tackle such hard problems are very rare

#

I've heard of ARC before but I'm not familiar with it, and I only just saw ARCathon mentioned. I suppose that's what you're referring to

#

I haven't looked at it much because I'm too busy with AIMO! I'm attempting to write an entirely new solution to AIMO in the next week! Yesterday I put together the first pieces of code and managed to solve the first problems. Haven't submitted yet, but I think it'd score about 4/50 :P

#

also yesterday I finally got a score of 23; it only took me a month to match the score of the top public notebook, arghh! What a massive waste of time

vital cave Jun 21, 2024, 10:55 AM

#

Try to test the stability of the 23 ! if you ran the code again will you get 23 ? I think stability is very important currently.

#

I have 2 subs with 26 (from diffrent "somehow" approaces) and 4 subs with 25 and a good 24s and 23s .... I can't say my 26 and 25 are stable , but avg score for those notebooks is 24 , but I have some codes that when re run makes a roller coaster ! from 19 to 24 to 25 even ... I somehow know why but I can't say much now

dusky narwhal Jun 21, 2024, 10:59 AM

#

I haven't yet but will. However I don't have many submissions left and will need to use most for my new solution

vital cave Jun 21, 2024, 11:00 AM

#

Good luck! never give up! ... who knows ... you might see higher scores in private
My stable 24 might be a stabe 4 in private or an error 😄 ....

dusky narwhal Jun 21, 2024, 11:01 AM

#

in theory, the LLM could write a script that runs killall python

#

more realistically, it could fork bomb; I did see it try to use multiprocessing after I informed it its solution was too slow

vital cave Jun 21, 2024, 11:03 AM

#

hhhh well the private will be fun to see ... I just hope hard work pay off and not a lucky hit (which I am afraid will put some people in good ranks with maybe public notebooks)

#

Imagine I enter the competition , sorted notebooks by score, ran the notebook got luck with 23 or 24 ... then same luck happens in private and with more luck to put me on top ... wow

dusky narwhal Jun 21, 2024, 11:05 AM

#

well, the score gap between the top team vs what you could get by submitting the public notebook 50 times is just 5 points

vital cave Jun 21, 2024, 11:06 AM

#

I am pretty sure that you may even get 24 with the public notebook

dusky narwhal Jun 21, 2024, 11:06 AM

#

yes, if it can get 23 probably it could get 24 very rarely

#

I got 22 with it the very first time I set the timelimit to 8.5 hours, with minimal changes

#

and then took weeks to get another 22

#

it turns out that almost all my submissions during the contest have had some major bug or other

vital cave Jun 21, 2024, 11:10 AM

#

My 4th rerun (just finished) of my first 25 got me 23 , so its (25, 24, 24, 23) , I have -+2 variance I guess , I hope to keep it this way

rare sandal Jun 21, 2024, 1:38 PM

#

Lmao I have been stuck at 22 since the first week, but that was a lucky 22. Cleaning up the code dropped the score by so much and I had to slowly climb back up to a stable 22 holyfuck only to realise it once again drops to 18 with a change in the seed 🤣

#

I’m not sure if the old AIME cv is correlating because my latest stable 21/22 public scores 27/50 there as well 😐 [with a DIFFERENT seed], but as usual the AIME new questions it just sucks, out of 61 I got just 9 points 😓

floral sentinel Jun 21, 2024, 2:06 PM

#

vital cave My 4th rerun (just finished) of my first 25 got me 23 , so its (25, 24, 24, 23) ...

ooh Ali hii

serene fiber Jun 21, 2024, 4:27 PM

#

vital cave Ya I suffered timeouts 😅 My current (in mind) 2 subs work fine with different ...

For GM?

vital cave Jun 21, 2024, 4:38 PM

#

serene fiber For GM?

I need gold in general for competition Master level, but a solo will help for later of course (GM) so I am trying to hit two goals at once. I was so close to get it in "LLM - Detect AI Generated Text" competition, if I only selected the correct sub 😦 (I finished 2nd public board, 21 "silver" private board , where I was able to secure 6th place 😅 if I was wiser)

serene fiber Jun 21, 2024, 4:41 PM

#

vital cave I need gold in general for competition Master level, but a solo will help for la...

This honestly sucks.... Even I wanted to try for a gold, could have been fastest Master on the platform 😅😅
But it seems unlikely considering my score right now, and unstable performances

#

Anyways check your DM please

spare cypress Jun 22, 2024, 12:17 AM

#

dusky narwhal also yesterday I *finally* got a score of 23; it only took me a month to match t...

which is the top piblic notebook??

dusky narwhal Jun 22, 2024, 4:42 PM

#

spare cypress which is the top piblic notebook??

you can sort by public score https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/code?competitionId=73231&sortBy=scoreDescending&excludeNonAccessedDatasources=true

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

#

I didn't realise there are actually two notebooks with a score of 23

#

I believe they're virtually identical though

#

I notice a very amusing pattern on the scoreboard (when I looked a few hours ago), the number of teams with a score of at least N goes:
28: 1
27: 4
26: 8
25: 16
24: 31
23: 76
22: 166

dusky narwhal Jun 22, 2024, 4:48 PM

#

rare sandal Lmao I have been stuck at 22 since the first week, but that was a lucky 22. Clea...

join the club. There must be SO many people in the same boat, because that code was awful. It took me weeks to get back up there, because I kept breaking new things, or I had bugs that would destroy the performance half the time making it even more unstable

#

oh BTW, I see everyone using AIME for validation but it doesn't match the composition of the LB at all, which has a much wider range of difficulties. I meant to mention this on the forums

#

so if you think your submission score is stable, to a large degree it's actually because there aren't that many problems of the edge of what it can solve

celest bough Jun 22, 2024, 5:59 PM

#

can't join anymore flameseyeroll

rare sandal Jun 22, 2024, 11:20 PM

#

dusky narwhal join the club. There must be SO many people in the same boat, because that code ...

I tried to fix the dumb behavior…but most of the time it ends up fixing X problems at the expense of making Y problems wrong holyfuck where X < Y

rare sandal Jun 22, 2024, 11:23 PM

#

dusky narwhal so if you think your submission score is stable, to a large degree it's actually...

AIME 2022-24 (new ones) gets 10 or 11/61 on CV for me regardless…it gets the same for every experiment I have tried over the past 2 weeks, regardless how the LB score is holyfuck , my LB can be between 13 and 22 and the new AIME val score cannot differentiate it

#

I don’t even know what I can trust now

#

Haven’t seen anything like this in previous competitions

floral sentinel Jun 23, 2024, 9:40 AM

#

no seriously wtf is this

rare sandal Jun 23, 2024, 12:28 PM

#

My predictions for the private LB gold zone:

27 27 26 24 22 21 20 19 19 18 17 17

Cutoff for gold: 17
Cutoff for silver: 13
Cutoff for bronze: 12

I may be wrong, this is my gut feeling based on my validation tests and the data description

celest bough Jun 23, 2024, 12:31 PM

#

Why can't we join this comp anymore

rare sandal Jun 23, 2024, 12:32 PM

#

celest bough Why can't we join this comp anymore

Entry deadline is June 20...Read the competition description under Timeline

celest bough Jun 23, 2024, 12:50 PM

#

rare sandal Entry deadline is June 20...Read the competition description under Timeline

Correct, the entry deadline is 20 June 2024. I don't know what this contributes to my question?

floral sentinel Jun 23, 2024, 12:50 PM

#

celest bough Why can't we join this comp anymore

bruh, there are 4 days left for competition, and u coming now to join?

floral sentinel Jun 23, 2024, 12:50 PM

#

celest bough Correct, the entry deadline is 20 June 2024. I don't know what this contributes ...

huh?

celest bough Jun 23, 2024, 12:50 PM

#

floral sentinel bruh, there are 4 days left for competition, and u coming now to join?

Yesterday there were 5 days left

floral sentinel Jun 23, 2024, 12:50 PM

#

celest bough Yesterday there were 5 days left

wooow big brain

celest bough Jun 23, 2024, 12:51 PM

#

Still only troll answer?

floral sentinel Jun 23, 2024, 12:51 PM

#

and who is the one trolling here with silly questions?

celest bough Jun 23, 2024, 12:52 PM

#

Someone wants to participate but is kept from participating due to unexplained rules. Question is good. Answer is not

floral sentinel Jun 23, 2024, 12:53 PM

#

wdym by "unexplained rules", the guy told you the reason

#

there is a deadline for joining competitions

#

what part of it you didn't understand?

#

the competition started around 3 months ago

celest bough Jun 23, 2024, 1:23 PM

#

floral sentinel there is a deadline for joining competitions

So why is there a deadline when the submissions are open?

#

What part of my question are you struggling with?

vital cave Jun 23, 2024, 2:27 PM

#

There is two deadlines , a deadline to enter the competition which is over now, and a deadline to submit which (4 days left)

celest bough Jun 23, 2024, 2:45 PM

#

vital cave There is two deadlines , a deadline to enter the competition which is over now, ...

correct. this information can be found on the kaggle page of this competition.

vital cave Jun 23, 2024, 2:49 PM

#

Ok, does this explains why you can’t join ?

serene fiber Jun 23, 2024, 3:58 PM

#

Wondering why don't folks just ignore it, the discussion is already self explanatory

floral sentinel Jun 23, 2024, 4:01 PM

#

He is trolling, ignore him

serene fiber Jun 23, 2024, 4:22 PM

#

rare sandal My predictions for the private LB gold zone: 27 27 26 24 22 21 20 19 19 18 17 1...

I doubt on it, while our best submission has some stochasticity, our third best which is around 21 does have some sort of stability coz of lower temperature, and yea I am not expecting a gold from it

celest bough Jun 23, 2024, 4:29 PM

#

floral sentinel He is trolling, ignore him

Again, not giving answer, only trying to be toxic for no reason.

#

I don't get why you don't try and be a good person instead of a troll

valid shoal Jun 24, 2024, 12:12 AM

#

Exactly when is the last submission time?

rare sandal Jun 24, 2024, 3:53 AM

#

serene fiber I doubt on it, while our best submission has some stochasticity, our third best ...

That's a good sign...we have a 22 but it has temperature 0.9, I changed the seed it drops to 16-18, I lowered the temperature, same thing. I don't even know if it's good 🤣

#

I'm getting only 24 CV with the setting that gave 22 on LB, but if I change the seed I get 27 CV and 18 LB. also lowering temperature doesn't worsen my CV, only +- 1

#

I guess we will know in a week...but terrible CV/LB correlation I'm not confident in our submissions 😐

#

In all the comps I have participated this is the one I'm least confident of even getting a medal out of it

serene fiber Jun 24, 2024, 4:11 AM

#

rare sandal That's a good sign...we have a 22 but it has temperature 0.9, I changed the seed...

Yea same here, our 24 and 22 one has a seed of 0.9, but we are unable to reproduce the results with lower seeds

summer bolt Jun 24, 2024, 10:50 AM

#

no one mentioning rag in discussion, does that mean it didn't work or worked too well?

silk sandal Jun 24, 2024, 12:58 PM

#

summer bolt no one mentioning rag in discussion, does that mean it didn't work or worked too...

I do not see how RAG would work here. From what can we retrieve the answer?

#

Is it possible that the test set is already online somewhere?

silk sandal Jun 24, 2024, 12:59 PM

#

floral sentinel no seriously wtf is this

Frustrating right😁. I got similar nonsense

#

By the way, do you think it is possible some submissions throw an exception on the private test set? I thought a submission would be marked successful if it was able to run on both public and private test sets

floral sentinel Jun 24, 2024, 1:07 PM

#

silk sandal Frustrating right😁. I got similar nonsense

Yeah... 3 days left and the model giving me nonsense answers

#

I got 10 last submission about 12 hours ago...

summer bolt Jun 24, 2024, 2:05 PM

#

answer of similar problem from aops, same contest, similar contests, apparently not exact answer of the exact problem, but the reasoning steps or key ideas can be re-used for the main problem

floral sentinel Jun 24, 2024, 2:29 PM

#

summer bolt answer of similar problem from aops, same contest, similar contests, apparently ...

deepseek has an input limit of 4096 tokens

#

so stacking questions and answers will lead to high token count

summer bolt Jun 24, 2024, 2:34 PM

#

yea so few things can be tried: summarize/shorten the solution, or use longer context model

#

3 shorten examples + prompts might take around 900-1.5k tokens, can be shorter depends on how compressed the summary is, anyway deepseek doesn't utilize those very well (or my implementation was bad)

vital cave Jun 24, 2024, 4:55 PM

#

vital cave My 4th rerun (just finished) of my first 25 got me 23 , so its (25, 24, 24, 23) ...

Nthn is stable 🙂 Remmber this ...well its 25, 24, 24, 23, 24 .... 20 .. ❌

floral sentinel Jun 24, 2024, 4:55 PM

#

summer bolt 3 shorten examples + prompts might take around 900-1.5k tokens, can be shorter d...

no you are right, deepseek doesn't utilize context well... it's better off giving it a short request to just solve the problem step by step

floral sentinel Jun 24, 2024, 4:56 PM

#

vital cave Nthn is stable 🙂 Remmber this ...well its 25, 24, 24, 23, 24 .... 20 .. ❌

bruh...

#

it threw error exception?

#

but the drop from 25 to 20 tho...

vital cave Jun 24, 2024, 4:57 PM

#

it took it 5 runs to go wild ! No ... the notebook finish usually around 7 hrs ... say its now around 8hrs due to latency in showing the score !

#

Its punishing me for repeating it 😄

floral sentinel Jun 24, 2024, 4:58 PM

#

the private leaderboard will be lit 🔥 🔥

#

imagine it throws error exception on all the submissions 😄

#

the private leaderborad

#

no winners lmfao

#

3 months will go to waste

vital cave Jun 24, 2024, 4:59 PM

#

hhhhhh oh god ... the selection of the final 2 subs is the hardest thing to do now

floral sentinel Jun 24, 2024, 5:00 PM

#

vital cave hhhhhh oh god ... the selection of the final 2 subs is the hardest thing to do n...

why, isn't it selected automatically?

#

the highest scores

vital cave Jun 24, 2024, 5:00 PM

#

No you can select which two you want (before deadline) if you don't then Kaggle will select (usually highest two)

floral sentinel Jun 24, 2024, 5:01 PM

#

vital cave No you can select which two you want (before deadline) if you don't then Kaggle ...

yeah i know we can select it manually, why not just let it select automatically?

vital cave Jun 24, 2024, 5:01 PM

#

becuase I am looking for the stable ones ... which I don't know now

floral sentinel Jun 24, 2024, 5:01 PM

#

ooh...

vital cave Jun 24, 2024, 5:02 PM

#

I have three (somehow) approaches and all score up to 26 ... and another apporoach that is stable around 24 (so far)

#

6 subs left ... one should use them wisely !
But in all cases the private leaderboard is going to be crazy .... I dropped my pridection about winning a gold medal now

#

By the way results on colab (regardless the questions) are perfectly stable ! (I mean the repetition of any exp gives same score ) rare times I faced +-1 variance
I use L4 on colab

bright lotus Jun 25, 2024, 1:01 AM

#

this competition only gamble

#

🥹

rare sandal Jun 25, 2024, 2:42 AM

#

my 22 notebook turns into 14 with only slight change…😱 I was so surprised by the score. Usually these type of small change still can get 20-21

rare sandal Jun 25, 2024, 2:43 AM

#

vital cave By the way results on colab (regardless the questions) are perfectly stable ! (I...

Same for me 😄

#

It’s always 24-25 points on old AIME,and 10-11 points on new AIME

#

Doing things that make sense improve validation and reduce LB...

#

Haven't seen a competition like this 🤣

serene fiber Jun 25, 2024, 3:27 AM

#

#learning-agency-lab-automated-essay-scoring-2 is way worse than this

valid shoal Jun 25, 2024, 7:38 AM

#

wtf my code tried to execute this

silk sandal Jun 25, 2024, 10:49 AM

#

vital cave By the way results on colab (regardless the questions) are perfectly stable ! (I...

Great observation. By repetition do you mean restarting the notebook and running or just solving the same problem multiple times within the same session?

rare sandal Jun 25, 2024, 11:27 AM

#

For me, I randomized the order and restarted the notebook

vital cave Jun 25, 2024, 11:43 AM

#

silk sandal Great observation. By repetition do you mean restarting the notebook and running...

Restarting the notebook

summer bolt Jun 25, 2024, 11:59 AM

#

curious what people are using for their experiment, this is my first kaggle competition so I started out maximizing kaggle gpu hours, then moved to runpod because it looks cheaper than colab

celest bough Jun 25, 2024, 1:49 PM

#

summer bolt curious what people are using for their experiment, this is my first kaggle comp...

you can also try vast.ai. It's like runpod, but often has good prices

floral sentinel Jun 25, 2024, 4:28 PM

#

config = transformers.AutoConfig.from_pretrained(model_path)
config.gradient_checkpointing = True

What does gradient_checkpointing really do?

bright lotus Jun 26, 2024, 4:28 AM

#

floral sentinel ```py config = transformers.AutoConfig.from_pretrained(model_path) config.gradie...

helpful memory racks when only training.

floral sentinel Jun 26, 2024, 1:43 PM

#

bright lotus helpful memory racks when only training.

thank you

proper ferry Jun 26, 2024, 2:00 PM

#

The ddl is 28 0:00 or 29 0:00?

floral sentinel Jun 26, 2024, 2:43 PM

#

proper ferry The ddl is 28 0:00 or 29 0:00?

convert the 11:59 PM UTC to the timeline of your area

floral sentinel Jun 27, 2024, 9:39 AM

#

did my two last submissions, let's see how it goes

#

goodluck to everyone

vital cave Jun 27, 2024, 11:05 AM

#

Me too, I will finalize my selections once my last sub finishes, Good luck everyone.

floral sentinel Jun 27, 2024, 2:57 PM

#

#

bruh the second phase will start in 2 months...

#

so soon

#

https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/515001

AI Mathematical Olympiad - Progress Prize 1

Solve national-level math challenges using artificial intelligence models

rare sandal Jun 27, 2024, 3:21 PM

#

floral sentinel

Yeah I probably need to build my own solution next time rather then using open source models

#

I didn't have time this month to explore on how to do that..
Bet some top teams did try research paper worthy things...let's see tomorrow

floral sentinel Jun 27, 2024, 3:40 PM

#

rare sandal I didn't have time this month to explore on how to do that.. Bet some top teams ...

i don't think anybody had time to do new stuff

#

3 months ain't enough for researching and discovering new things

rare sandal Jun 27, 2024, 3:43 PM

#

I doubt one can reach 29 without fine tuning, I may be wrong

rare sandal Jun 27, 2024, 3:46 PM

#

floral sentinel 3 months ain't enough for researching and discovering new things

I guess you’re underestimating the ability of the top-notch in this field. It is definitely possible if your team is really good

floral sentinel Jun 27, 2024, 3:47 PM

#

rare sandal I guess you’re underestimating the ability of the top-notch in this field. It is...

I'm not underestimating anything, the facts are out there

#

for 1 month and a half we've been stuck in 27 score

#

which really shows that nobody created anything new

#

everyone was using deepseek and tryna prompt it correctly to get it to work

rare sandal Jun 27, 2024, 3:48 PM

#

I don’t think the 28 and 29 are achieved without fine tuning, and my guess is they will be stable on private LB too

floral sentinel Jun 27, 2024, 3:48 PM

#

rare sandal I don’t think the 28 and 29 are achieved without fine tuning, and my guess is th...

well fine tuning isn't something new?

rare sandal Jun 27, 2024, 3:48 PM

#

well let’s see tomorrow, I don’t want to jump into conclusions now

floral sentinel Jun 27, 2024, 3:48 PM

#

rare sandal well let’s see tomorrow, I don’t want to jump into conclusions now

yeah that's the best thing to do

rare sandal Jun 27, 2024, 3:49 PM

#

floral sentinel well fine tuning isn't something new?

If you had followed the public discussions, many people have said they tried and failed. So doing it right to improve the score is an achievement itself

floral sentinel Jun 27, 2024, 3:49 PM

#

rare sandal If you had followed the public discussions, many people have said they tried and...

yeah i followed that

#

some said it's in a deep local minima

#

which made the results worse by fine tuning

floral sentinel Jun 27, 2024, 3:50 PM

#

rare sandal If you had followed the public discussions, many people have said they tried and...

hmm, yeah in that terms you are right

vital cave Jun 27, 2024, 8:09 PM

#

rare sandal I don’t think the 28 and 29 are achieved without fine tuning, and my guess is th...

I think it may be possible.
Earlier I decided to test the ideal solution for my main strategy, by ideal I mean that the model at least generated a solution to a problem, I was on average counting around 35/50 (validation of course), however, my best was up to 28/50 ... so if you were able to get better judging/scoring "name it as you like" approach then yes you might reach more ...

#

Also this means that you might overfit more !

rare sandal Jun 27, 2024, 11:49 PM

#

haha when I fix 1 problem I break 2 problems

vital cave Jun 28, 2024, 12:08 AM

#

How many outputs you generate for each problem ?

rare sandal Jun 28, 2024, 12:11 AM

#

vital cave How many outputs you generate for each problem ?

I set time limit = (remaining time) / (number of remaining problems)
I did not have any logic that breaks out of the for loop if the time exceeds 31500 (in my runs it has never timed out)

#

seed set at 42

vital cave Jun 28, 2024, 12:12 AM

#

I generate between 100 and 120 based on time

#

I selected a notebook that generate max around 100 since its was the first 25 on LB in general and the add value of earlier sub pushed me to select it

serene fiber Jun 28, 2024, 12:14 AM

#

I am looking for a team for season 2 of this competition. Given that it is officially ended and we can do 100 submissions per day, I believe this is the perfect time for exploration. Kindly let me know if anyone's interested.

PS: I would prefer someone who has actively participated in this season

rare sandal Jun 28, 2024, 12:14 AM

#

How do you generate 100 tries ?

#

vLLM Async Engine ?

vital cave Jun 28, 2024, 12:15 AM

#

vLLM and 2 T4

rare sandal Jun 28, 2024, 12:16 AM

#

Even with vLLM, my speed isn’t so fast to run 100 tries on a problem

#

You need AsyncEngine I think

#

Transformers = 13 token/s, vLLM: 28 token/s

vital cave Jun 28, 2024, 12:19 AM

#

If I reached money prizes I will share more regarding this.

But within 9hrs I think you can reach more maybe up to 140

#

But the risk of timeout is always there

#

Early discussions and notebooks on this helped me a lot

rare sandal Jun 28, 2024, 12:20 AM

#

My submission definitely has a risk of timeout also

#

But my score isn’t good so I risked it

#

I didn’t end up selecting the 22 that drops with every small change that I made. My opinion is it overfitted public LB from past experience

vital cave Jun 28, 2024, 12:21 AM

#

My sec one do, I kinda gambled with the sec one, based on my analysis it should be the most stable one but with a risk

serene fiber Jun 28, 2024, 12:21 AM

#

vital cave If I reached money prizes I will share more regarding this. But within 9hrs I t...

dont you allocate specificic time for every question?

vital cave Jun 28, 2024, 12:22 AM

#

Ya in one of the approaches I set around 450 sec per question

rare sandal Jun 28, 2024, 12:22 AM

#

Also vLLM didn’t improve my score and just lowered the floor that I could get, ended up not selecting it

serene fiber Jun 28, 2024, 12:22 AM

#

vital cave Ya in one of the approaches I set around 450 sec per question

such solutions are kind of stable with low temp

vital cave Jun 28, 2024, 12:23 AM

#

Low temps stablize around 20 to 22 …

serene fiber Jun 28, 2024, 12:24 AM

#

Oh thats good my stable was between 18-21

rare sandal Jun 28, 2024, 12:25 AM

#

serene fiber Oh thats good my stable was between 18-21

Same for me

serene fiber Jun 28, 2024, 12:25 AM

#

btw till when will final lb get declared?

serene fiber Jun 28, 2024, 12:25 AM

#

rare sandal Same for me

whats your top score?

vital cave Jun 28, 2024, 12:25 AM

#

I have hopes for my second sub which I scored yesterday ( 26) but Because of the first comes first ranked I had to select the first 25 ( with variance of 2 most of the time) also

rare sandal Jun 28, 2024, 12:26 AM

#

serene fiber whats your top score?

I selected a 20 and a 21
Both have been resubmitted twice with the same score all 3 times...stable ones

#

Submitted on May 25 and June 1

vital cave Jun 28, 2024, 12:26 AM

#

serene fiber btw till when will final lb get declared?

I guess it will take a week or more

serene fiber Jun 28, 2024, 12:26 AM

#

Oh ok ok

rare sandal Jun 28, 2024, 12:27 AM

#

But I have no confidence of the private result 😓

vital cave Jun 28, 2024, 12:27 AM

#

My selected subs are one from 2 month ago and yestreday

serene fiber Jun 28, 2024, 12:28 AM

#

Anyone of you in for season 2 of this competition?

vital cave Jun 28, 2024, 12:28 AM

#

serene fiber Anyone of you in for season 2 of this competition?

It depends on this phase final output

serene fiber Jun 28, 2024, 12:28 AM

#

vital cave My selected subs are one from 2 month ago and yestreday

Would definitely like to know more about 26 one

rare sandal Jun 28, 2024, 12:29 AM

#

In the last week of the comp I tried some “out of the box” ideas, but they didn’t have a significantly better CV, I ended up not choosing them. Even though public 18-21

In any other comp 99% of the time I will choose this as 1 sub

vital cave Jun 28, 2024, 12:29 AM

#

Alos a lot and I mean a lot of experiencrd grand masters avoided this phase ( mayve due the 23-limit)

rare sandal Jun 28, 2024, 12:29 AM

#

tiebreaker is a menace

#

😞

serene fiber Jun 28, 2024, 12:29 AM

#

vital cave Alos a lot and I mean a lot of experiencrd grand masters avoided this phase ( ma...

23?

#

you mean public lb score?

vital cave Jun 28, 2024, 12:30 AM

#

23-feb

serene fiber Jun 28, 2024, 12:30 AM

#

23 Feb didn't get you

rare sandal Jun 28, 2024, 12:31 AM

#

Yeah that one holyfuck

vital cave Jun 28, 2024, 12:31 AM

#

The rule that forbid you of using anything released after 23-feb

serene fiber Jun 28, 2024, 12:32 AM

#

Oh ok

vital cave Jun 28, 2024, 12:33 AM

#

Second phase will have mostly a rule for 1-sep 🙂

rare sandal Jun 28, 2024, 12:33 AM

#

I hope not, and the rule is 1-Mar-2025 instead

#

Since the comp ends 1-May-2025

vital cave Jun 28, 2024, 12:35 AM

#

I guess its maybe going to be end of year 2024 rule to give everyone a chance to test and finetune

rare sandal Jun 28, 2024, 12:35 AM

#

You only need 1 day to fork an open source model, try it on the comp dataset and submit

#

I learnt a lot on the behaviour of LLMs on math problems from this comp though 😄

serene fiber Jun 28, 2024, 12:36 AM

#

btw anyone of you tried finetuning?

rare sandal Jun 28, 2024, 12:36 AM

#

serene fiber btw anyone of you tried finetuning?

nope

vital cave Jun 28, 2024, 12:37 AM

#

I did finetune reached up to 15 I guess but dropped it

rare sandal Jun 28, 2024, 12:37 AM

#

Me and my teammate tried 6 different RAG strategies

serene fiber Jun 28, 2024, 12:37 AM

#

data for RAG?