#arc-prize-2025

1 messages · Page 1 of 1 (latest)

frigid thistle
#

Anyone want to collaborate?

fringe skiff
frigid thistle
harsh dagger
#

Thought u will find it complicated rather than interesting

harsh dagger
#

I forgot my planned algo it'd been so much time 🤧

short sable
#

This is the Powerball of all competitions @harsh dagger 😂 🤔

#

Everything else is like playing Pick 3, or, Pick 4

covert sun
#

Can someone tell me what this actually is and how the model selection process may work for this competition?

short sable
# covert sun Can someone tell me what this actually is and how the model selection process ma...

It's about Abstracting. You are presented with input girds and must predict their corresonding output grids. In all the examples sets, multiple sets of input-output pairs are presented to you, in which a a consistent rule or set of rules is being applied to transform input grids into output girds . So your challenge is to, given that (heuristic) example set, predict the output that pairs with a novel input. You are given 3-5 examples to abstract a learning rule for a particular (heuristic) set and aly it to an inut whose outut you don't know. Hoefully that helps.

harsh dagger
harsh dagger
#

It shouldn't ve been max of 2 attempts per task

lunar sky
#

it is best of 2

#

if any one of the answers are correct you get the score for it

harsh dagger
lunar sky
harsh dagger
#

Thnx to clarify

onyx prairie
frigid thistle
tame bane
#

Hey pals, I'm Anser. About to graduate from NUST in Pakistan. I've been working for the past few days on this challenge. I am really feeling the lack of some team. Yk more minds, more possibilities. If you are good at communicating, hit the DM if you wanna work with me. Let's try to beat it and publish a paper.

brave estuary
#

I'm new in kaggle. Does this competition have any restrictions on model size or gpu memory usage?

jovial lily
#

The problem I see with this challenge, is that for some harder Puzzles the training data is too low to have only one possible solution for the puzzle (even for humans). E.g. puzzle 1ae2feb7 (first one in public evaluation set v2) by seeing the training data another solution would be that the second (counting from the wall) color on the left starts on the total right of the grid going to the left

#

So more training data or simple unambiguous puzzles are needed.

dusty imp
jovial lily
#

Ok reasonable

onyx prairie
#

I see that you have to mix a lot of strategies and come with some novel solution to this. If you check the interview with the creators it is clearly not a challenge to just use some gigantic deep learning model, throw data inside and pray to solve it. There are symbolic reason, heuristic and a lot more involved here.

steep rain
dusty imp
#

Last year the winning solution was just an LLM finetuned on ARC grids with test time finetuning, augmentation & funny inference

pale crest
#

so i was wondering about the evaluation metrics
the score on leaderboard is the actual score or they would apply metric so the number would transform in some matter ?
like for ex 7.92 becomes like 60 % or what ?

dusty imp
rain kite
#

Do we have architecture of last year solution? I already have access to code which is there on Kaggle.

dusty imp
#

you mean a higher level description of how their code works?

rain kite
grim parcel
#

Last year, there was a stated limit to the size of the grid. I haven't found one this year. Have I just missed it, or it's really not there?

short sable
grim parcel
#

Is that specified anywhere?

grim parcel
crisp dew
#

I have an approach - I have come up with a new "think" technique for a neural network to defeat ARC-AGI-2
So look, AI has 3 stages. Stage 1 of chunking a question or task and does how many "think" there will be, let's say a regular neural network has a module where it thinks according to its own scheme, but stage 1 splits into chunks and thinks about each chunk separately. Stage 2 will combine some chunks together and will use what she has already thought before, then she will think. Stage 3 of combining everything she thought about all the chunks. Let's say there is a question "come up with creativity for a neural network" stage 1 - breaks down words into chunks that mean something, let's say "come up with" - she thinks what does it mean to come up with? Then "creativity" what does creativity mean? then "for a neural network" what does it mean for a neural network? If you can do something, then you need to apply your knowledge in this area. Stage 2 - merges chunks of words that she thinks will be better in meaning and to advance the goal of answering and using knowledge from past reflections. Stage 3 - uses the knowledge of everything she thought before and tries to find the answer. And for the first stage, I think so that she doesn't think for a long time, you can use an approach where she interprets words into the meanings that she understands best, or use a separate neural network for stage 1 so that she quickly types, say, she has 1 billion parameters and will learn the main neural network to give the correct answer to the question. What do you guys think?

harsh dagger
#

-# sorry she's silly

barren horizon
#

seems similar conceptually

harsh dagger
#

Try ViT encoder with usual text based transformer decoder

unborn moat
#

guys is there anyway to download the training and evaluation data sets?

#

like arc org has 1520 problems

#

so i was wondering if there is a way to download all of them at once
or does the data in the data section of the competion have al those

short sable
# unborn moat guys is there anyway to download the training and evaluation data sets?

Umm. First you join the competition. Then you can download all the datasets with a simple click on "Data" page. Look, just this time, I'll save you the trouble; here is the dataset (a set of JSON files) for ARC 2024 challenge. Make sure to join though before the deadline for joining, if you want to be entered into this competition. Getting datasets for competitions, in general: this applies for most (all?) competitions, as a rule.

unborn moat
#

but is that all the data from the website asw

#

or is there an api i have to call or write a script to scape it

short sable
# unborn moat no like i know that

It is a multitude of JSON files that contain the data you have to work with in developing your system. Except for the last which gives you a sample sumbmission file (which also needs to be in JSON format). From the site:

#

i don't know what you mean about scraping. You have to read and use these files in your code. This is the data you have to work with. Nothing needs to be scraped off web from any website

crisp dew
shy orbit
#

Idk if this is the right place to post, but I’m looking for teammates to work on this challenge.

wicked summit
#

Hi everyone,
I'm currently looking to join a team for the ARC Prize competition. I bring strong skills in Python, deep learning, and reasoning-based AI challenges, and I’m eager to actively contribute to both the modeling and experimentation aspects of the project.

I’ve been following ARC-style problems for a while and am particularly excited about the unique reasoning challenges this competition presents. I'm also comfortable with collaboration tools (Git, Notebooks, etc.) and can commit time regularly throughout the competition.

If you're looking for a motivated teammate who’ll pull their weight, I’d love to connect.
You can check out my Kaggle profile here: https://www.kaggle.com/sathishkumartheta

Feel free to DM or reply here!

pseudo hawk
#

Hi, I would like to team up with anyone to work in the arc prize 2025.

I have worked in Finetuning LLM, building agents, Rag systems, Core ML, and RL. Would like to team up with anyone to work in collaboration

pseudo hawk
fallen marsh
sly meteor
#

Hi I'm trying to use Qwen3 with vLLM, can anyone share a starter notebook for this?

dusty imp
cinder kite
#

Hi all, has anyone tried out the LPN-based approach (search latent program spaces : https://arxiv.org/pdf/2411.08706) on ARC AGI 2 yet? I am looking to make some improvements to this.

short sable
#

Warning: Task 'bbb1b8b6' has 7 clues. Expected 2 to 5.

short sable
#

What's the deal with the possibility o there being more than 1 solution for some challenges? Anyone understand this. Do we have to present two different solutions. Found in training data that sometimes there was multiple solutions, btw, too.

digital forge
#

There is a demonstration set (train set), and an inference set (test set). Each set contains N tasks where each tasks contains both an input and and output

#

the goal is to predict ALL the Inference outputs given the Demonstration Inputs, Demonstration Outputs, and Inference Inputs

#

not even one pixel can be incorrect and the shape cannot also be wrong

#

you also get two attempts per output predicted

#

so if one of the attempts if corrects, it's fully correct

short sable
#

JUST A FRIENDLY HEADS UP as related to the Challenge: Some of the Tasks have up to 10 clue/heuristic pairs (Training input and output pairs). Also, some Tasks have multiple Test inputs and Solutions. In these cases, there will always be as many Solutions to figure out, as there will be Test inputs to be processed, using the same heuristic pair sets. And when submitting, for those Tasks, they will need to be submitted, in a list, in order (test[0] -> solution[0]), for that specific Task. That said, GOOD LUCK!

copper pagoda
#

Hey everyone. I understand the data format for this challenge, but I have some questions :

If you are training your own model, could you share how you are dealing with padding & grid_size / class imbalance / loss ?

(I'm not necessarily asking about your best approach of course, even just ideas from your baseline would be ok.)

digital forge
# copper pagoda Hey everyone. I understand the data format for this challenge, but I have some q...

There are mainly 3 ways for predicting shape.

  1. Upsampling/Interpolation
  2. Dedicated Prediction Heads (usually CrossEntropy over the possible 900 shapes)
  3. Boundary detection using Softmax on logits who's probability exceed certain threshold.

For padding there are 2 obvious methods.

  1. Shift all colors by 1 and use 0 padding and 0-padding tensors to collate.
  2. Same as above but use negative numbers for padding safety.

Class imbalance can be solved by Focal Loss

short sable
sinful egret
#

Is this year's challenge harder than the 2024 one?

harsh dagger
brittle shadow
#

Is it at all possible to work on this while not being very proficient in python? What’s the best tool to translate English to python? Or is AI not there yet..

harsh dagger
fresh zealot
#

Im having issues with fragmented stop word tokenization in transformers library. Using a qwen2.5 model to write code and i want to stop on "```\n" after a python block, so i can inject the code result before continuing generation.

but i cant get it to work. anyone faced the same issue, and resolved it? Thanks.

elder haven
# harsh dagger python is the bare minimum not english

I build just via LLM collaborating letting the LLM produce the code then I just kinda coordinate that. You need to understand ML fundementals but you could just let AI teach you about AI essentially haha--edit not a replacement from coding yourself, but if the learning curve is too steep just have fun! Whatever works! you_got_it_dude

brittle shadow
burnt token
brittle shadow
burnt token
brittle shadow
# burnt token Get a really smart partner

You mean get me a partner that knows Python. Yes, I considered that when I asked. Since I’ve learned absolutely no new information from this back and forth I’ll explore other avenues.

astral vessel
#

Can anyone tell what's arc prize ? And what's the problem

celest sparrow
astral vessel
sacred vapor
#

Basically pattern recognition but they give too little input data so current transformer models can't solve it

#

U should give the input data a peek u will get a better idea

subtle cloud
#

Hey everyone!

Long before the launch of Kaggle’s Game Arena, I developed a real-time AI chess battle platform where you can watch different AI models play against each other.

🔗 Try it here: https://adaptive-ai-chess-game.streamlit.app/

🎮 Modes available:

Human vs Adaptive AI

Adaptive AI vs Hyperbolic AI

It’s a fun and insightful way to observe how different AI strategies behave on the board — ideal for anyone into AI, reinforcement learning, or game theory.

📩 If you're interested in collaborating, expanding, or integrating it into more formal benchmarks, feel free to connect:

👤 Karthikeya Guduru
📧 karthikeyaguduru19@gmail.com
🔗 https://www.linkedin.com/in/karthikeya-guduru-70227b262/

Streamlit

Play chess against an Adaptive AI that learns your style and skill as you play. The AI adjusts it...

tall bramble
celest sparrow
celest sparrow
# tall bramble so which one should we go with?

I’d recommend going with the deadline listed on Kaggle, since that’s the official competition host and the one submissions will be judged by. The ARCPRIZE site might just be showing it in a different time zone or hasn’t updated yet. You can confirm by checking the “Timeline” section on the Kaggle competition page or asking the organizers directly in the Kaggle Q&A forum.

tall bramble
celest sparrow
median maple
#

Does anybody know the transformation (logic) for task 09629e4f ? - Cannot solve it for the life of me...

narrow crow
#

pick the cell with 4 colored squares, then enlarge it?

#

that seems to be it, I was able to solve it

harsh dagger
narrow crow
#

be careful this is a scam message

uneven sapphire
#

can we write function p as lambda function, p=lambda x:

#

or should it be def p(x)

#

submitting lambda worked in the kaggle, just wanted to clarify here am i not missing some rules

median maple
livid furnace
#

Hi, is there any chance that you can provide an additional Python environment with a higher Python version >= 3.12? The current version is 3.11, which is not compatible with some new models in the HF Transformers libs.

median maple
#

That was one of my pain points as well. I just reproduced the same 3.11 environment and used that, but it is far from convenient.

fresh zealot
#

that would break a lot of old code, but i agree its time to upgrade to 3.12 or later. The python kernels should be possible to set like any other jupyter ntoebook

#

but its like kaggle is no longer under development, so many old bugs from 10 years ago still looming around

harsh dagger
#

Anyone trying TRM?

south plover
#

Is this true ??

fringe beacon
harsh dagger
fringe beacon
#

Like, me personally? or the model?

dusty imp
harsh dagger
harsh dagger
#

this comp we cant use arc agi agents there?

digital forge
harsh dagger
digital forge
#

yes

harsh dagger
#

oh right thnx n its still going

digital forge
#

13 or so days left

agile geyser
#

Hello - I have a question regarding the Kaggle challenge. Our team has developed a system and evaluated it using the arc-agi_evaluation-challenges.json and evaluated that we conform to the submission format by validating it against the arc-agi_evaluation-solutions.json.
More specifically we test this using the pass@2 metric:

# Calculating the pass@2 metric
total = len(submissions)
correct = 0

for challenge_name in solutions:
    test_solutions = [
        np.array(test_solution)
        for test_solution in solutions[challenge_name]
    ]

    submission_solutions = [{
        "attempt_1": np.array(test_submission["attempt_1"]),
        "attempt_2": np.array(test_submission["attempt_2"]),
    } for test_submission in submissions[challenge_name]]

    if len(test_solutions) != len(submission_solutions):
        continue
    else:
        correct_tests = 0
        for test_solution, submission_solution in zip(test_solutions, submission_solutions):
            if np.array_equal(test_solution, submission_solution["attempt_1"]):
                correct_tests += 1
            elif np.array_equal(test_solution, submission_solution["attempt_2"]):
                correct_tests += 1

        if correct_tests == len(test_solutions):
            correct += 1

print(f"pass@2: {correct}/{total}")

We know when our model is "correct" (we can verify that it creates correct solutions) and when we submitted to Kaggle we indeed solved some of the puzzles. We investigated the solutions and are confident they are correct. However, the Kaggle leaderboard has rated us at 0.0.
From this we conclude that something must be wrong in our submission format.
Is there any more information on how the submission.json is used?

#

Specifically, for instance, what is the order of numpy lists (row from 0 to n). Can we assume numpy.tolist() produces a correct output?

agile geyser
#

Maybe for starters and clarification, is the challenge actually read from:
/kaggle/input/arc-prize-2025/arc-agi_test_challenges.json
Sorry if this question is a bit trivial 😄 but I'm really confused

#

Uff - is it possible that if you only solve tasks in the 50% that Kaggle doesn't choose for the leaderboard, you could get 0%, even though you actually solve tasks?

golden vortex
#

/claim 316

digital forge
agile geyser
#

numpy.tolist() produces a nested list of lists and we also load it via np.array(input_mat), e.g:

[[2, 0, 2], [0, 0, 0], [0, 0, 0]]
#

This is a 3x3 matrix from row 0 (TOP) to bottom, correct?

digital forge
digital forge
#

so what exactly is the issue? most likely the solution is just wrong even by 1 pixel wrong is wrong

#

what is your evaluation perfect task accuracy?

agile geyser
#

16/120 ~ 13%

#

The test challenge evaluates 240 tasks - if I understand correctly 120 of those are selected for the public leaderboard. Is it that we could be extremely unlucky and only solve "private" puzzles?

#

Are the puzzles used for the public leaderboard randomised? If so, I guess submitting again tomorrow should show if that is the case

digital forge
#

so 16/120 means it's not a very good chance of score unfortunately..but it's pretty good ngl on eval if you're getting 16/120

agile geyser
#

well, good or not, we want to figure out why the leaderboard shows 0 😄

digital forge
#

i suggest keep improving to 30 or above and if it stil shows no score then it's an issue

#

16 can be explained by variance

agile geyser
#

variance in which way?

digital forge
#

the variance of the distribution of the eval sets, according to the hosts both sets have the same distribution of difficulty

#

"roughly"

#

so if you get good score on eval set, you "should" get similar score on leaderboard

agile geyser
#

I'm a bit confused. Are the outputs generated from notebooks submitted to the challenge not the ones with the actual test input? Because we know when a solution is correct

#

That's the whole point of the model we're writing, explainable AI 😄

#

The score might be low but we're fairly certain that the tasks that are logged as "solved" in submitted notebooks are in fact solved

digital forge
#

yeah they all have test input

#

The score might be low but we're fairly certain that the tasks that are logged as "solved" in submitted notebooks are in fact solved
This is where i'm confused, what do you mean soived? where do you see this on kaggle?

#

the train and eval sets have "solutions"

#

but are you talkin specifically about your specific submission run? or just the kaggle arc data itself?

agile geyser
#

I would guess my specific run

#

In the output section of your notebook, you receive a log. We have a per-task log. Our model informs us in this log if a task can be solved, as in, if a consistent algorithm can be found that explains everything (all test examples and train examples)

#

We solve a certain amount of tasks in the submission - and receive a 0 on the leaderboard

#

So we're a bit puzzled as to what exactly the issue is

#

It could of course be that this log is not correct, that it was generated on a public dataset instead of the private one, for instance

digital forge
#

ahh that's misleading because the notebook is run twice, once on dummy data which is just the training data copied into the test data, and then it's run

then a second time for private run where the test data is swapped with private hidden test data and it's not the copied training data but actual evaluataion-like data

#

you can see the logs for the first run, but not the second

agile geyser
#

AHA

digital forge
#

the second run is also not accessible via internet to prevent leaks via probing 🙂

agile geyser
#

That makes a lot more sense - so indeed the logs we were reading are not the actual submission logs, and debugging them in any way doesn't make sense?

#

In this case I would assume our algorithms haven't found solutions - good to know as well

digital forge
#

yups

agile geyser
#

Thanks for your clarifications @digital forge

digital forge
#

np 🙂

harsh dagger
#

anyone tried using TRM?

fringe beacon
#

Because i am fucking up the eval pipeline somehow

#

and trying to diagnose

somber elm
#

ARC Prize 2025: Submission Scoring Error Help Needed

Problem: Getting "Submission Scoring Error" despite passing all local validation checks.

Verified Requirements (from Kaggle discussion):
✅ Each test input → test output: 240 tasks, 259 entries (223×1 + 15×2 + 2×3)
✅ Two attempts per output: All entries have attempt_1 and attempt_2
✅ Output length = Input length: Matches exactly
✅ Grid format: Rectangular, integers 0-9, matching shapes

My Format:

{"00000000": [{"attempt_1": [[0,0,0],[0,0,0],[0,0,0]], "attempt_2": [[0,0,0],[0,0,0],[0,0,0]]}]}

Matches sample_submission.json structure exactly (except values).

What I've Tried:

  • Validated against sample format ✅
  • Verified 240 tasks, 259 entries match test inputs ✅
  • Compact JSON: separators=(',', ':')
  • Saved to /kaggle/working/submission.json
  • All grids: rectangular, 0-9 integers, matching shapes ✅

Notebook Setup:
Script loads from /kaggle/input/arc-submissions-2025/submission.json, validates, formats (compact JSON, sorted keys), saves to /kaggle/working/submission.json. Includes auto-fixing for shape mismatches, invalid values, size limits.

Questions:

  1. Subtle format differences causing rejection?
  2. File saving issues (encoding/line endings)?
  3. Wrong path/filename?
  4. Hidden characters/whitespace problems?
  5. JSON format requirements (compact vs pretty)?
  6. Multiple test inputs handling issue?

Context: ARC Prize 2025, Notebook v7, local validation passes. Model achieved 33.98% on EVALUATION set (validation, 120 tasks). Submission uses TEST set (240 tasks) - actual test accuracy unknown until Kaggle scores it.

Any guidance is appreciated, as I have been trying to for almost a week get this submitted.

lavish river
#

"Script loads from /kaggle/input/arc-submissions-2025/submission.json, validates, formats (compact JSON, sorted keys), saves to /kaggle/working/submission.json. "

Don't do that, just rewrite attempt_1 and attempt_2 with [[0, 0], [0, 0]] for each tasks/tests (Or Scoring will fail)

somber elm
#

Hey, would we be able to do one last submission tomorrow? When making the submission, I was able to get the scoring to work, but it was replacing my predictions. They were being reformatted into similar dimensions to the inputs. I think I got a working submissions file now, but have to wait until tomorrow to resubmit. I was able to RL train my model and reached 35% accuracy and when reloading for evaluating I got around 30%-32%. was hoping to be able to submit one last one. If not oh wells.

agile geyser
#

Is using [[0, 0], [0, 0]] really a requirement? ... uff

glass pier
#

when can i see the result at the private LB?

minor oriole
#

Hi

raven mural
#

Hi Kaggle team,

I wanted to report that I was banned from the ARC Prize Discord after sharing part of my research idea.
I was accused of using AI-generated text, which isn’t true. I write in Portuguese and translate my thoughts into English myself.
I believe this may have been a misunderstanding, but I’d like to clarify that my contributions were original and based on my own research. (Books from 2011, published in 2019)

I’m continuing to develop my work independently, and I hope Kaggle remains a space that values openness to new approaches and fair discussion.

#

And I'm not frustrated. I'm simply questioning something deeper:
how can a group aiming to build general AI be so resistant to the very type of reasoning that could lead to it?

It seems paradoxical that a community dedicated to the advancement of intelligence rejects approaches that challenge its own premises.
Openness to epistemic diversity should be part of the path towards true general artificial intelligence.

dusty imp