#arc-prize-2025
1 messages · Page 1 of 1 (latest)
I'm down
I'll dm
I forget to ask u lol 💀
Thought u will find it complicated rather than interesting
oh
I forgot my planned algo it'd been so much time 🤧
This is the Powerball of all competitions @harsh dagger 😂 🤔
Everything else is like playing Pick 3, or, Pick 4
Can someone tell me what this actually is and how the model selection process may work for this competition?
It's about Abstracting. You are presented with input girds and must predict their corresonding output grids. In all the examples sets, multiple sets of input-output pairs are presented to you, in which a a consistent rule or set of rules is being applied to transform input grids into output girds . So your challenge is to, given that (heuristic) example set, predict the output that pairs with a novel input. You are given 3-5 examples to abstract a learning rule for a particular (heuristic) set and aly it to an inut whose outut you don't know. Hoefully that helps.
Try going to https://arcprize.org/play and solve the puzzles. Basically, you are building an AI/ML system that can do this
input grids input gir||d||s btw
But the rule for score seems a bit strict it's not 2 attempts for a task actually since ur score for the task will average of the two attempts
It shouldn't ve been max of 2 attempts per task
it is not average of 2 attempts
it is best of 2
if any one of the answers are correct you get the score for it
How ??
Oh right i overlooked it 😅
Thnx to clarify
Sure!
ok
Hey pals, I'm Anser. About to graduate from NUST in Pakistan. I've been working for the past few days on this challenge. I am really feeling the lack of some team. Yk more minds, more possibilities. If you are good at communicating, hit the DM if you wanna work with me. Let's try to beat it and publish a paper.
I'm new in kaggle. Does this competition have any restrictions on model size or gpu memory usage?
The problem I see with this challenge, is that for some harder Puzzles the training data is too low to have only one possible solution for the puzzle (even for humans). E.g. puzzle 1ae2feb7 (first one in public evaluation set v2) by seeing the training data another solution would be that the second (counting from the wall) color on the left starts on the total right of the grid going to the left
So more training data or simple unambiguous puzzles are needed.
I agree that some puzzles are ambiguous, but it's also about simplicity of the rules you find
Ok reasonable
I see that you have to mix a lot of strategies and come with some novel solution to this. If you check the interview with the creators it is clearly not a challenge to just use some gigantic deep learning model, throw data inside and pray to solve it. There are symbolic reason, heuristic and a lot more involved here.
That is why you are allowed to submit multiple solutions also
Competitors don't always use the techniques the creators initially thought about
Last year the winning solution was just an LLM finetuned on ARC grids with test time finetuning, augmentation & funny inference
so i was wondering about the evaluation metrics
the score on leaderboard is the actual score or they would apply metric so the number would transform in some matter ?
like for ex 7.92 becomes like 60 % or what ?
the score on the leaderboard should the percentage, I believe
Do we have architecture of last year solution? I already have access to code which is there on Kaggle.
What do you mean architecture?
you mean a higher level description of how their code works?
Yes That's correct.
Last year, there was a stated limit to the size of the grid. I haven't found one this year. Have I just missed it, or it's really not there?
I think it is the same this year. Grids are at most 30x30 grids, or smaller. Of course some are even, say 3x4, etc.
Is that specified anywhere?
yes, here: https://www.kaggle.com/competitions/arc-prize-2025/data 30x30 is the maximum size for a grid.
I have an approach - I have come up with a new "think" technique for a neural network to defeat ARC-AGI-2
So look, AI has 3 stages. Stage 1 of chunking a question or task and does how many "think" there will be, let's say a regular neural network has a module where it thinks according to its own scheme, but stage 1 splits into chunks and thinks about each chunk separately. Stage 2 will combine some chunks together and will use what she has already thought before, then she will think. Stage 3 of combining everything she thought about all the chunks. Let's say there is a question "come up with creativity for a neural network" stage 1 - breaks down words into chunks that mean something, let's say "come up with" - she thinks what does it mean to come up with? Then "creativity" what does creativity mean? then "for a neural network" what does it mean for a neural network? If you can do something, then you need to apply your knowledge in this area. Stage 2 - merges chunks of words that she thinks will be better in meaning and to advance the goal of answering and using knowledge from past reflections. Stage 3 - uses the knowledge of everything she thought before and tries to find the answer. And for the first stage, I think so that she doesn't think for a long time, you can use an approach where she interprets words into the meanings that she understands best, or use a separate neural network for stage 1 so that she quickly types, say, she has 1 billion parameters and will learn the main neural network to give the correct answer to the question. What do you guys think?
-# sorry she's silly
is this very different from how multi headed attention works?
seems similar conceptually
Try ViT encoder with usual text based transformer decoder
guys is there anyway to download the training and evaluation data sets?
like arc org has 1520 problems
so i was wondering if there is a way to download all of them at once
or does the data in the data section of the competion have al those
Umm. First you join the competition. Then you can download all the datasets with a simple click on "Data" page. Look, just this time, I'll save you the trouble; here is the dataset (a set of JSON files) for ARC 2024 challenge. Make sure to join though before the deadline for joining, if you want to be entered into this competition. Getting datasets for competitions, in general: this applies for most (all?) competitions, as a rule.
no like i know that
but is that all the data from the website asw
or is there an api i have to call or write a script to scape it
It is a multitude of JSON files that contain the data you have to work with in developing your system. Except for the last which gives you a sample sumbmission file (which also needs to be in JSON format). From the site:
i don't know what you mean about scraping. You have to read and use these files in your code. This is the data you have to work with. Nothing needs to be scraped off web from any website
kk thanks
on github has some data
https://github.com/arcprize/ARC-AGI-2
Idk if this is the right place to post, but I’m looking for teammates to work on this challenge.
I'm interested
I would like to join your team if you are still looking
Hi everyone,
I'm currently looking to join a team for the ARC Prize competition. I bring strong skills in Python, deep learning, and reasoning-based AI challenges, and I’m eager to actively contribute to both the modeling and experimentation aspects of the project.
I’ve been following ARC-style problems for a while and am particularly excited about the unique reasoning challenges this competition presents. I'm also comfortable with collaboration tools (Git, Notebooks, etc.) and can commit time regularly throughout the competition.
If you're looking for a motivated teammate who’ll pull their weight, I’d love to connect.
You can check out my Kaggle profile here: https://www.kaggle.com/sathishkumartheta
Feel free to DM or reply here!
Hi, I would like to team up with anyone to work in the arc prize 2025.
I have worked in Finetuning LLM, building agents, Rag systems, Core ML, and RL. Would like to team up with anyone to work in collaboration
Would love to team up if you are still up for it
😉
Hi I'm trying to use Qwen3 with vLLM, can anyone share a starter notebook for this?
I recommend just taking a starter notebook that uses vLLM and swapping in Qwen3
Hi all, has anyone tried out the LPN-based approach (search latent program spaces : https://arxiv.org/pdf/2411.08706) on ARC AGI 2 yet? I am looking to make some improvements to this.
Warning: Task 'bbb1b8b6' has 7 clues. Expected 2 to 5.
What's the deal with the possibility o there being more than 1 solution for some challenges? Anyone understand this. Do we have to present two different solutions. Found in training data that sometimes there was multiple solutions, btw, too.
I think you're misunderstanding the task data format.
There is a demonstration set (train set), and an inference set (test set). Each set contains N tasks where each tasks contains both an input and and output
the goal is to predict ALL the Inference outputs given the Demonstration Inputs, Demonstration Outputs, and Inference Inputs
not even one pixel can be incorrect and the shape cannot also be wrong
you also get two attempts per output predicted
so if one of the attempts if corrects, it's fully correct
JUST A FRIENDLY HEADS UP as related to the Challenge: Some of the Tasks have up to 10 clue/heuristic pairs (Training input and output pairs). Also, some Tasks have multiple Test inputs and Solutions. In these cases, there will always be as many Solutions to figure out, as there will be Test inputs to be processed, using the same heuristic pair sets. And when submitting, for those Tasks, they will need to be submitted, in a list, in order (test[0] -> solution[0]), for that specific Task. That said, GOOD LUCK!
Hey everyone. I understand the data format for this challenge, but I have some questions :
If you are training your own model, could you share how you are dealing with padding & grid_size / class imbalance / loss ?
(I'm not necessarily asking about your best approach of course, even just ideas from your baseline would be ok.)
There are mainly 3 ways for predicting shape.
- Upsampling/Interpolation
- Dedicated Prediction Heads (usually CrossEntropy over the possible 900 shapes)
- Boundary detection using Softmax on logits who's probability exceed certain threshold.
For padding there are 2 obvious methods.
- Shift all colors by 1 and use 0 padding and 0-padding tensors to collate.
- Same as above but use negative numbers for padding safety.
Class imbalance can be solved by Focal Loss
If you want to pad, or more precisely, STANDARDIZE all grids, for every grid create a grid that is 30X30, and all padding (give it value = 10, for instance). Then overwrite that grid's squares with your data grid square values, row by row and column by column, starting from top left corner. So all your grids will be uniform. Relatively simple in code. Hope that makes sense. Precede from there.
Is this year's challenge harder than the 2024 one?
probably
Is it at all possible to work on this while not being very proficient in python? What’s the best tool to translate English to python? Or is AI not there yet..
python is the bare minimum not english
Im having issues with fragmented stop word tokenization in transformers library. Using a qwen2.5 model to write code and i want to stop on "```\n" after a python block, so i can inject the code result before continuing generation.
but i cant get it to work. anyone faced the same issue, and resolved it? Thanks.
I build just via LLM collaborating letting the LLM produce the code then I just kinda coordinate that. You need to understand ML fundementals but you could just let AI teach you about AI essentially haha--edit not a replacement from coding yourself, but if the learning curve is too steep just have fun! Whatever works! 
Yes
How?
Google it
I did already. Is there a specific recommendation
Get a really smart partner
You mean get me a partner that knows Python. Yes, I considered that when I asked. Since I’ve learned absolutely no new information from this back and forth I’ll explore other avenues.
That was partially a joke
Idk
Can anyone tell what's arc prize ? And what's the problem
The ARC Prize is a non-profit dedicated to accelerating the development of Artificial General Intelligence (AGI). We believe that true AGI requires more than just scaling up existing AI models.
please see more detail below.
https://arcprize.org/
Brother tell me more about the which type of problem is they ask to be solved?
Basically pattern recognition but they give too little input data so current transformer models can't solve it
U should give the input data a peek u will get a better idea
Hey everyone!
Long before the launch of Kaggle’s Game Arena, I developed a real-time AI chess battle platform where you can watch different AI models play against each other.
🔗 Try it here: https://adaptive-ai-chess-game.streamlit.app/
🎮 Modes available:
Human vs Adaptive AI
Adaptive AI vs Hyperbolic AI
It’s a fun and insightful way to observe how different AI strategies behave on the board — ideal for anyone into AI, reinforcement learning, or game theory.
📩 If you're interested in collaborating, expanding, or integrating it into more formal benchmarks, feel free to connect:
👤 Karthikeya Guduru
📧 karthikeyaguduru19@gmail.com
🔗 https://www.linkedin.com/in/karthikeya-guduru-70227b262/
deadline for ARCPRIZE is diffrent here than the one on kaggle?
does anyone know why?
I’ve noticed the deadline for ARCPRIZE here doesn’t match the one on Kaggle either.
It might be due to differences in time zones, updates on one platform that haven't been reflected on the other yet, or simply an error in listing.
so which one should we go with?
I’d recommend going with the deadline listed on Kaggle, since that’s the official competition host and the one submissions will be judged by. The ARCPRIZE site might just be showing it in a different time zone or hasn’t updated yet. You can confirm by checking the “Timeline” section on the Kaggle competition page or asking the organizers directly in the Kaggle Q&A forum.
Thanks, are you also participating?
not now .
Does anybody know the transformation (logic) for task 09629e4f ? - Cannot solve it for the life of me...
pick the cell with 4 colored squares, then enlarge it?
that seems to be it, I was able to solve it
for arc agi 3 rules seems to be changed very much
be careful this is a scam message
can we write function p as lambda function, p=lambda x:
or should it be def p(x)
submitting lambda worked in the kaggle, just wanted to clarify here am i not missing some rules
⭐ For those who want to solve this challenge with cellular automata
I've made a puzzle game: https://caped.ferenczi.eu/
Enjoy!
Hi, is there any chance that you can provide an additional Python environment with a higher Python version >= 3.12? The current version is 3.11, which is not compatible with some new models in the HF Transformers libs.
That was one of my pain points as well. I just reproduced the same 3.11 environment and used that, but it is far from convenient.
that would break a lot of old code, but i agree its time to upgrade to 3.12 or later. The python kernels should be possible to set like any other jupyter ntoebook
but its like kaggle is no longer under development, so many old bugs from 10 years ago still looming around
Anyone trying TRM?
Is this true ??
I have successfully replicated trm and am now at war with the kaggle ui to aubmit
were u able to solve first few arc agi puzzles
Like, me personally? or the model?
did you do the whole $500 training run or a smaller version?
i meant the preview puzzles
this comp we cant use arc agi agents there?
no, it's ARC-AGI 2 compeition
the kaggle one ?
yes
oh right thnx n its still going
13 or so days left
Hello - I have a question regarding the Kaggle challenge. Our team has developed a system and evaluated it using the arc-agi_evaluation-challenges.json and evaluated that we conform to the submission format by validating it against the arc-agi_evaluation-solutions.json.
More specifically we test this using the pass@2 metric:
# Calculating the pass@2 metric
total = len(submissions)
correct = 0
for challenge_name in solutions:
test_solutions = [
np.array(test_solution)
for test_solution in solutions[challenge_name]
]
submission_solutions = [{
"attempt_1": np.array(test_submission["attempt_1"]),
"attempt_2": np.array(test_submission["attempt_2"]),
} for test_submission in submissions[challenge_name]]
if len(test_solutions) != len(submission_solutions):
continue
else:
correct_tests = 0
for test_solution, submission_solution in zip(test_solutions, submission_solutions):
if np.array_equal(test_solution, submission_solution["attempt_1"]):
correct_tests += 1
elif np.array_equal(test_solution, submission_solution["attempt_2"]):
correct_tests += 1
if correct_tests == len(test_solutions):
correct += 1
print(f"pass@2: {correct}/{total}")
We know when our model is "correct" (we can verify that it creates correct solutions) and when we submitted to Kaggle we indeed solved some of the puzzles. We investigated the solutions and are confident they are correct. However, the Kaggle leaderboard has rated us at 0.0.
From this we conclude that something must be wrong in our submission format.
Is there any more information on how the submission.json is used?
Specifically, for instance, what is the order of numpy lists (row from 0 to n). Can we assume numpy.tolist() produces a correct output?
Maybe for starters and clarification, is the challenge actually read from:
/kaggle/input/arc-prize-2025/arc-agi_test_challenges.json
Sorry if this question is a bit trivial 😄 but I'm really confused
Uff - is it possible that if you only solve tasks in the 50% that Kaggle doesn't choose for the leaderboard, you could get 0%, even though you actually solve tasks?
/claim 316
no it has to be a grid, not a flat list of ints
numpy.tolist() produces a nested list of lists and we also load it via np.array(input_mat), e.g:
[[2, 0, 2], [0, 0, 0], [0, 0, 0]]
This is a 3x3 matrix from row 0 (TOP) to bottom, correct?
yes that file is swapped during private evaluation
ah in that case yes that's fine
so what exactly is the issue? most likely the solution is just wrong even by 1 pixel wrong is wrong
what is your evaluation perfect task accuracy?
16/120 ~ 13%
The test challenge evaluates 240 tasks - if I understand correctly 120 of those are selected for the public leaderboard. Is it that we could be extremely unlucky and only solve "private" puzzles?
Are the puzzles used for the public leaderboard randomised? If so, I guess submitting again tomorrow should show if that is the case
so 16/120 means it's not a very good chance of score unfortunately..but it's pretty good ngl on eval if you're getting 16/120
well, good or not, we want to figure out why the leaderboard shows 0 😄
i suggest keep improving to 30 or above and if it stil shows no score then it's an issue
16 can be explained by variance
variance in which way?
the variance of the distribution of the eval sets, according to the hosts both sets have the same distribution of difficulty
"roughly"
so if you get good score on eval set, you "should" get similar score on leaderboard
I'm a bit confused. Are the outputs generated from notebooks submitted to the challenge not the ones with the actual test input? Because we know when a solution is correct
That's the whole point of the model we're writing, explainable AI 😄
The score might be low but we're fairly certain that the tasks that are logged as "solved" in submitted notebooks are in fact solved
yeah they all have test input
The score might be low but we're fairly certain that the tasks that are logged as "solved" in submitted notebooks are in fact solved
This is where i'm confused, what do you mean soived? where do you see this on kaggle?
the train and eval sets have "solutions"
but are you talkin specifically about your specific submission run? or just the kaggle arc data itself?
I would guess my specific run
In the output section of your notebook, you receive a log. We have a per-task log. Our model informs us in this log if a task can be solved, as in, if a consistent algorithm can be found that explains everything (all test examples and train examples)
We solve a certain amount of tasks in the submission - and receive a 0 on the leaderboard
So we're a bit puzzled as to what exactly the issue is
It could of course be that this log is not correct, that it was generated on a public dataset instead of the private one, for instance
ahh that's misleading because the notebook is run twice, once on dummy data which is just the training data copied into the test data, and then it's run
then a second time for private run where the test data is swapped with private hidden test data and it's not the copied training data but actual evaluataion-like data
you can see the logs for the first run, but not the second
AHA
the second run is also not accessible via internet to prevent leaks via probing 🙂
That makes a lot more sense - so indeed the logs we were reading are not the actual submission logs, and debugging them in any way doesn't make sense?
In this case I would assume our algorithms haven't found solutions - good to know as well
yups
Thanks for your clarifications @digital forge
np 🙂
anyone tried using TRM?
https://www.kaggle.com/code/seconds0/trm-arc-agi-2-inference-py311-offline/notebook why this one got 0 score
ARC Prize 2025: Submission Scoring Error Help Needed
Problem: Getting "Submission Scoring Error" despite passing all local validation checks.
Verified Requirements (from Kaggle discussion):
✅ Each test input → test output: 240 tasks, 259 entries (223×1 + 15×2 + 2×3)
✅ Two attempts per output: All entries have attempt_1 and attempt_2
✅ Output length = Input length: Matches exactly
✅ Grid format: Rectangular, integers 0-9, matching shapes
My Format:
{"00000000": [{"attempt_1": [[0,0,0],[0,0,0],[0,0,0]], "attempt_2": [[0,0,0],[0,0,0],[0,0,0]]}]}
Matches sample_submission.json structure exactly (except values).
What I've Tried:
- Validated against sample format ✅
- Verified 240 tasks, 259 entries match test inputs ✅
- Compact JSON:
separators=(',', ':')✅ - Saved to
/kaggle/working/submission.json✅ - All grids: rectangular, 0-9 integers, matching shapes ✅
Notebook Setup:
Script loads from /kaggle/input/arc-submissions-2025/submission.json, validates, formats (compact JSON, sorted keys), saves to /kaggle/working/submission.json. Includes auto-fixing for shape mismatches, invalid values, size limits.
Questions:
- Subtle format differences causing rejection?
- File saving issues (encoding/line endings)?
- Wrong path/filename?
- Hidden characters/whitespace problems?
- JSON format requirements (compact vs pretty)?
- Multiple test inputs handling issue?
Context: ARC Prize 2025, Notebook v7, local validation passes. Model achieved 33.98% on EVALUATION set (validation, 120 tasks). Submission uses TEST set (240 tasks) - actual test accuracy unknown until Kaggle scores it.
Any guidance is appreciated, as I have been trying to for almost a week get this submitted.
"Script loads from /kaggle/input/arc-submissions-2025/submission.json, validates, formats (compact JSON, sorted keys), saves to /kaggle/working/submission.json. "
Don't do that, just rewrite attempt_1 and attempt_2 with [[0, 0], [0, 0]] for each tasks/tests (Or Scoring will fail)
Hey, would we be able to do one last submission tomorrow? When making the submission, I was able to get the scoring to work, but it was replacing my predictions. They were being reformatted into similar dimensions to the inputs. I think I got a working submissions file now, but have to wait until tomorrow to resubmit. I was able to RL train my model and reached 35% accuracy and when reloading for evaluating I got around 30%-32%. was hoping to be able to submit one last one. If not oh wells.
Is using [[0, 0], [0, 0]] really a requirement? ... uff
when can i see the result at the private LB?
Hi
https://media.discordapp.net/attachments/1436719817624256534/1436719913518633010/1.JPG?ex=6910a130&is=690f4fb0&hm=6a48397700e40b701b7defba0bc73ccc590e83e58af09eb7035cae318e9fb319&=&format=webp&width=515&height=687
https://media.discordapp.net/attachments/1436719817624256534/1436719914034659408/2.jpg?ex=6910a130&is=690f4fb0&hm=5d3c01e3db0b2fe7135969c69c22cbf49db07bae5ed8cb9a98ac3e18d3c73ce5&=&format=webp&width=515&height=687
https://media.discordapp.net/attachments/1436719817624256534/1436719914512547951/3.jpg?ex=6910a130&is=690f4fb0&hm=59a326eaa4d74733a406431b5c2eb8ee07f6b78d95094102deb1153d2e261407&=&format=webp&width=515&height=687
Hi Kaggle team,
I wanted to report that I was banned from the ARC Prize Discord after sharing part of my research idea.
I was accused of using AI-generated text, which isn’t true. I write in Portuguese and translate my thoughts into English myself.
I believe this may have been a misunderstanding, but I’d like to clarify that my contributions were original and based on my own research. (Books from 2011, published in 2019)
I’m continuing to develop my work independently, and I hope Kaggle remains a space that values openness to new approaches and fair discussion.
And I'm not frustrated. I'm simply questioning something deeper:
how can a group aiming to build general AI be so resistant to the very type of reasoning that could lead to it?
It seems paradoxical that a community dedicated to the advancement of intelligence rejects approaches that challenge its own premises.
Openness to epistemic diversity should be part of the path towards true general artificial intelligence.
Which wallet do you have
If you make a proper submission on kaggle and get a score, that will always be respected