tranquil seal Mar 18, 2026, 11:08 AM

#

Can somebody share how they approach this? Im interested to learn what techniques you did use here. If you have done something related to this please let me know. Any information would be appreciated even if the competition is over.

sacred linden Mar 19, 2026, 5:45 AM

#

tranquil seal Can somebody share how they approach this? Im interested to learn what technique...

Obvious step 1. Read + Understand ideas behind the article https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/measuring-progress-toward-agi/measuring-progress-toward-agi-a-cognitive-framework.pdf
Step 2: Select the competition track
Step 3: Find a case and try to implement a benchmark

past surge Mar 19, 2026, 5:46 PM

#

Hello! I have a question about the data for benchmarks. Should it be manually created or we are allowed to use some existing data?

gloomy flax Mar 20, 2026, 3:35 AM

#

past surge Hello! I have a question about the data for benchmarks. Should it be manually cr...

there no dataset need just do task

past surge Mar 20, 2026, 11:37 AM

#

gloomy flax there no dataset need just do task

I am afraid you are wrong. On the competition page in the evaluation criteria table the dataset requirements mentioned at least 2 times. Tasks and data as part of benchmark comes together if you read the kaggle-benchmark documentation page.

gloomy flax Mar 20, 2026, 12:46 PM

#

past surge I am afraid you are wrong. On the competition page in the evaluation criteria ta...

go ahead brother

past surge Mar 20, 2026, 3:42 PM

#

gloomy flax go ahead brother

I would prefer to wait for the answer from the Kaggle Staff.

#

Sorry, already found an answer in the section 6 under the competition rules: https://www.kaggle.com/competitions/kaggle-measuring-agi/rules

gloomy flax Mar 21, 2026, 6:26 AM

#

past surge Sorry, already found an answer in the section 6 under the competition rules: htt...

https://www.kaggle.com/competitions/kaggle-measuring-agi/data

Dataset Description
Welcome!

This is a Hackathon with no provided dataset.

blazing coral Mar 21, 2026, 12:55 PM

#

gloomy flax https://www.kaggle.com/competitions/kaggle-measuring-agi/data Dataset Descripti...

he wasnt asking that, lol

craggy bough Mar 21, 2026, 9:09 PM

#

sacred linden Obvious step 1. Read + Understand ideas behind the article https://storage.googl...

@sacred linden Thanks for providing the article. I threw notebookLM at this to generate some podcasts (I'm an auditory leaner) and it mentioned that article.
@tranquil seal Great way to start the discord and thanks for your question.

Then if you’re not using agents in your workflows, whether for school, work, your personal business, or just managing your life. Start!

And don’t just try one. Try a bunch: Claude Code, Codex, Gemini CLI 😉, Claude Cowork, Copilot, Cline/Roo… maybe even design your own (with caution, take it as a learning exercise instead of a grandiose goal beating frontier labs in AGI).

Use them at different levels:
• simple autocomplete / direct code edits
• that “in-between” turbo coding mode (not quite autocomplete, not quite full vibe coding, haven't heard a term for that yet so let's coin that)
• full vibe coding
• long-horizon agentic coding and autonomous agents

As I once heard: you need to get a feel for the AI. Leverage your human intuition to pick up on where the limitations are, those jagged edges.

One thing I do: if a model or tool fails miserably… I run it again. If it still fails after a few tries/iterations/etc, wait. Then a few months, try again. Or better yet make a lab notes (I use markdown file with obsidian) with all the cases you seen it fail and use it for this competition.

craggy bough Mar 21, 2026, 9:15 PM

#

craggy bough <@775336333962641428> Thanks for providing the article. I threw notebookLM at th...

The above will help you iterate and make progress on the interpolation side of AI. But for the extrapolation side, the kind of intelligence that lives outside the dataset (AGI, SGI, genuinely novel or hard problems), take a step back and look at how YOU learn.

How do you pick up new concepts, reason through things, update your beliefs when new evidence comes in, find the info you need, plan your approach, and recognize when you need help (like what made you come here, for example 😜 )?

When you’re learning something, practice mindfulness and observe your own conscious experience. What’s the dialogue in your head actually saying? What does it feel like when a math concept finally clicks? How did you go about finding the data you needed to solve a problem? Then use that to workshop these benchmarks.

Hope this helps and hope this kicks off more discussion!

buoyant fjord Mar 23, 2026, 9:20 AM

#

Hey everyone 👋

I need some advice and I'm just going to be honest about why I'm asking.

I'm AuDHD — autistic and ADHD — and this was my first hackathon ever. I submitted to the Google DeepMind AGI Hackathon solo and I genuinely don't know if I did it right 😅

Some questions I can't figure out on my own:

Is spending just a few hours on a final submission normal or did I just generate chaos? 💀

Is there a standard format judges expect or is it really just whatever you built?

I submitted a benchmark framework — not a model, not an app. Is that even a valid submission type or did I misread the whole thing?

How do you know when something is done enough to submit? I hit a point where it felt right and just went for it. Is that how this works?

I work best alone so I had no one to gut check any of this with before submitting 😭

I'm new to coding — all written with AI assistance. The ideas are mine but I genuinely don't know what I don't know technically and that part scares me a little.

Any advice from people who have done this before would mean a lot. Even just telling me I generated chaos so I can prepare emotionally 💀

Results June 1st. Already bracing. 🐟

— GiGi

blazing coral Mar 24, 2026, 12:06 AM

#

Thank God community upvotes are removed from the evaluation rubric
https://www.kaggle.com/competitions/kaggle-measuring-agi/discussion/684184

gloomy flax Mar 24, 2026, 8:26 AM

#

what are the limites of task in benchmark?

pure moon Mar 24, 2026, 6:01 PM

#

buoyant fjord Hey everyone 👋 I need some advice and I'm just going to be honest about why I...

Just lock in and be consistent you already have power to become great

#

Adhd + autistic

flint mica Mar 27, 2026, 5:26 AM

#

pure moon Just lock in and be consistent you already have power to become great

ww

neon hatch Mar 27, 2026, 4:10 PM

#

Live Q&A: Measuring Progress Toward AGI!

Two weeks into the Cognitive Abilities hackathon, the benchmarks being built by the Kaggle community are already incredible! To talk more about it, Nicholas Kang (Kaggle) and the authors of the paper the hackathon is based on, Dr Ryan Burnell and Oran Kelly (Google DeepMind), are going LIVE to chat with you all.

What we’re covering:
20-Min Deep Dive into the paper and what we’re looking for in the hackathon
20-Min Live AMA: Your chance to ask the team anything about the hackathon or the paper

🔔 Set your reminders here: https://www.youtube.com/live/9YYiWs6gNV0

craggy bough Mar 31, 2026, 3:29 AM

#

It seems getting to AGI will relay on harnessing, but it doesn't seem this competition considers this. Thoughts?

forest cypress Mar 31, 2026, 11:34 AM

#

craggy bough It seems getting to AGI will relay on harnessing, but it doesn't seem this compe...

In the paper they say 'We therefore think the best approach is to evaluate the system as a whole, including
any built-in tools or modules.', so I'd argue they are considering it! I think the kaggle benchmark sdk also supports custom tools, so could definitely be done 🙂

formal wolf Mar 31, 2026, 10:31 PM

#

Anyone manage to get multi-turn working in the benchmark? I'm curious as to how ya'll got it working.

formal wolf Mar 31, 2026, 11:16 PM

#

Also, I'm confused about the contest rules. It's just one submission right? The whole way through?

formal wolf Apr 1, 2026, 12:04 AM

#

Also, I have some sensitive questions related to my implementation, is there a Q&A channel, or someone to DM in particular? I want to get all of my concepts 100% done before the weekend so I can really get down and finish this up

formal wolf Apr 1, 2026, 7:04 PM

#

https://tenor.com/view/hey-cat-waving-wave-paw-gif-11249423921160715390

formal wolf Apr 1, 2026, 9:49 PM

#

Alright... No ones responding and I can't actuallya find any mods to DM. Just going to try and ask in multiple places (here, the kaggle discussions tab, on some comments section).

What's the difference between executive function and learning?

The official example: https://www.kaggle.com/code/kodatirevanth/five-tracks-five-benchmark-designs-a-deep-dive/notebook#Track-4:-Executive-Functions

gives us a learning example for synthetic rule induction - but the executive function example also has synthetic rule induction.

I take it to mean that learning is: "AI has a goal,doesnt know how to do it, learns in real time" while executive function is "ai has goal, knows how to do it, does it." My benchmark crosses into both categories. I need to DM someone to ask

#

Has anyone managed to get a multi-turn benchmark working with dynamic datasets? I am generating 5 variations of my dataset (slightly different problems) and throwing my model at that - programming wise this is quite complex, and I was wondering if this would be too technical (I.E. The kaggle system cant support it).

#

Where can I learn more about the multi-turn features? All of the provided links I see are trivial and don't explain how to use them, or how to design around them.

I need something that tracks state, and dynamically gives feedback over 20 turns. Like a text adventure game.

formal wolf Apr 1, 2026, 10:02 PM

#

formal wolf Also, I'm confused about the contest rules. It's just one submission right? The ...

It's one benchmark and one writeuptargeting one field - but users can make more than 1 task

formal wolf Apr 1, 2026, 10:24 PM

#

neon hatch Live Q&A: Measuring Progress Toward AGI! Two weeks into the Cognitive Abilities...

Watching this through right now, thanks! This answers quite a few of my questions!

craggy bough Apr 3, 2026, 4:30 AM

#

formal wolf Alright... No ones responding and I can't actuallya find any mods to DM. Just go...

Honestly, the discords for kaggle seem to be pretty dead, most of the discussions are all being handled by clawedbots...errr I mean people on the discussion section of the kaggle competition.

plain forge Apr 3, 2026, 1:21 PM

#

Anyone experiencing unexpected out of quota errors? Even though I have plenty of quota remaining the task fails for some models with the error that I don’t have enough quota

#

In one case the error mentioned that the request would cost 11 usd! That would’ve been within the quota anyway but that’s super expensive!

craggy bough Apr 3, 2026, 5:28 PM

#

I have only ran a few prompts and my daily quota is at $13, something might be up with the benchmark sdk?

formal wolf Apr 3, 2026, 9:00 PM

#

What models are yall using? For testing (just to see if my code works) I used gemma 1b

#

Also to develop the SDK im using gemini but in the google ai studio

plain forge Apr 4, 2026, 5:48 AM

#

I think there’s one of the Gemma models that gets selected automatically as default. So that’s what I use for testing. I just add some models after that, a mixture of (what I think are) large and small models, new and old…

#

Is there a sort of overview-comparison table of these models? Like how large are they, whether they have vision api or not etc?

red linden Apr 5, 2026, 10:20 AM

#

Hey everyone, is anyone facing this Bokeh/Jupyter issue?
While creating tasks

[Open Browser Console for more detailed log - Double click to close this message]
Failed to create view for 'BokehView' from module '@bokeh/jupyter_bokeh' with model 'BokehModel' from module '@bokeh/jupyter_bokeh'
TypeError: Cannot read properties of undefined (reading 'require')
    at St (https://cdn.jsdelivr.net/npm/@bokeh/jupyter_bokeh@%5E4.0.5/dist/index.js:2:379889)
    at new Tt (https://cdn.jsdelivr.net/npm/@bokeh/jupyter_bokeh@%5E4.0.5/dist/index.js:2:380410)
    at https://unpkg.com/@jupyter-widgets/html-manager@*/dist/embed-amd.js:24:1004653
    at async https://unpkg.com/@jupyter-widgets/html-manager@*/dist/embed-amd.js:33:297760
    at async Promise.all (index 0)
    at async https://unpkg.com/@jupyter-widgets/html-manager@*/dist/embed-amd.js:33:297296
    at async Promise.all (index 0)
    at async P.F.then.o.HTMLManager.loader (https://unpkg.com/@jupyter-widgets/html-manager@*/dist/embed-amd.js:33:297049)

I’m trying to render a plot but it fails in the browser console.
Has anyone encountered this before? How did you fix it?

brisk ivy Apr 5, 2026, 1:21 PM

#

hi everyone, I am sharing a testing scaffold which might be useful to iterate and get an objective and cheap measurement (given that the competition itself has no ranking or test set).

It uses Benchpress (https://open.substack.com/pub/dimitrisp/p/you-dont-need-to-run-every-eval) to estimate the performance of models. The interesting part is that it can be used to measure how redundant your benchmark is (ie. if benchpress can predict it acccurately, it's pretty redundant).

I created a GH repo w/ all the code & exploration, and also made it available as a Kaggle notebook you can run right away to get the number.

GH repo: https://github.com/kafkasl/evaluating-agi
Kaggle starter nb: https://www.kaggle.com/code/polifemus/benchpress-evaluation
Kaggle writeup: https://www.kaggle.com/competitions/kaggle-measuring-agi/discussion/688351

manic granite Apr 8, 2026, 12:06 AM

#

I have a question about Kaggle benchmark tasks...

I have a bunch of very closely related tasks - they basically load easy/medium/hard data files, and then I want to report them separately on the Benchmark leaderboard.

Is there a way to parameterize the notebook to do this at runtime? It's annoying having to upload the same notebook 3 different times and change one line in each!

unreal crag Apr 9, 2026, 9:20 AM

#

It’s absurd to conflate 'task precision' with 'cognition.' These benchmarks only measure how well a model can approximate a statistical distribution, not how it thinks. What are we actually achieving by chasing these numerical scores? Reducing the complexity of cognition to a mere binary or a percentage is a fundamental category error.

formal wolf Apr 9, 2026, 10:00 PM

#

manic granite I have a question about Kaggle benchmark tasks... I have a bunch of very closel...

😬 I'm not sure, I have the exact same issue too

#

I got irritated and ended up only having one and then only updating the other 2 when I was confident

arctic spoke Apr 10, 2026, 9:53 PM

#

Hi, I'm building my entry and I have some questions around what the kaggle_benchmarks allows us to do and how it works:

Does the llm work using chat-templates?
Is is possible to send a system prompt correctly? (no a user message that states a system prompt anmd then a question)
Is it possible to "back-fill" the conversation before prompting? We want to be able to fill several chat steps (like: user: hello!, assistant: hello?, user: some clever question) but making so such that it lands in the actual chat template, not the last "user" prompt.
Can we do multi-sep prompting? like making a question, gather the answer, making a second question and then calculate if the test was correct or no (i.e. making 2 generation steps per sample in a task).

#

So far I suspect that nothing of the above is possible to create a valid entry using kaggle_benchmarks

plain forge Apr 10, 2026, 10:04 PM

#

unreal crag It’s absurd to conflate 'task precision' with 'cognition.' These benchmarks only...

The same criticism holds for our education system don’t you think? Tests and marks to determine if the student has understood the material.

molten hatch Apr 10, 2026, 10:34 PM

#

check my bio 😁

tardy walrus Apr 11, 2026, 2:39 PM

#

check my bio 😁

formal wolf Apr 11, 2026, 5:57 PM

#

arctic spoke Hi, I'm building my entry and I have some questions around what the `kaggle_benc...

I'm doing multi-step prompting, I had to use the devinai page for help because the dogs are so bad and I can't reach support. This helped me: https://github.com/Kaggle/kaggle-benchmarks/blob/ci/documentation/examples/baking_help.py and theres another that I can't seem to find

#

The discoverability around kaggle benchmarks feels really poor. I wish there were more video tutorials and non-trivial examples.

formal wolf Apr 11, 2026, 6:19 PM

#

I really dislike how "update task" doesnt just update the task 😭

arctic spoke Apr 11, 2026, 8:28 PM

#

formal wolf I'm doing multi-step prompting, I had to use the devinai page for help because t...

Hey, tanks for the response!
Is the link correct? I only see a simple prompt to the llm and then an llm-as-judge step. There is no multi-step chat history. Am I missing something?

formal wolf Apr 11, 2026, 9:19 PM

#

Sorry, I cant seem to find it... they moved all of the documentation stuff and i cant see it in the github

#

i had to use gemini to give me a few examples - I sent over the agents.md and I had to iterate a lot.

#

I have a huge while loop in my code with a bunch of try except clauses:

while actions_left > 0:
...
response = llm.prompt(current_prompt)

then depending on the respose, I change current_prompt to something
if answer = good:
current_prompt = xyz
if answer = bad:
current_prompt = abc

Then for the next step I just call llm.prompt(current_prompt) again

#

To think about it easier: You dont have to take the LLM's response, take the past responses, create a json file that's formatted correctly, and then re-send it to the model. You can get away with just using llm.prompt(prompt), and kaggle will handle conversation history for you

#

I hope that helps

arctic spoke Apr 11, 2026, 9:54 PM

#

I’ll try to follow that pattern and see what happens, thanks!

late sun Apr 13, 2026, 6:03 PM

#

it looks like the kaggle benchmark sdk has a placeholder for temperature toggling but it is not active yet? all the models came back with FALSE with the query below
for name, m in kbench.llms.items():
print(f"{name}: support_temperature={m.support_temperature}")

formal wolf Apr 15, 2026, 10:55 PM

#

Any examples of writeups availale?

formal wolf Apr 16, 2026, 2:52 PM

#

Do yall think they're gonna push the time back

arctic spoke Apr 16, 2026, 5:33 PM

#

formal wolf Do yall think they're gonna push the time back

Nope

arctic spoke Apr 16, 2026, 5:34 PM

#

formal wolf Any examples of writeups availale?

There is an example submission in the challenge forum, thats all I’ve found

formal wolf Apr 16, 2026, 11:45 PM

#

🫡

#

Good luck everyone!

lucid mural Apr 17, 2026, 5:57 AM

#

Good luck everyone!!

I worked on Socrates, a social cognition benchmark. Glad I got a chance to build this, kindly check it out and let me know your thoughts.

https://kaggle.com/competitions/kaggle-measuring-agi/writeups/socrates-benchmark

unreal crag Apr 17, 2026, 8:12 AM

#

Congratulations on completing the hackathon.
One critical note:
Did your benchmarks implement statistical process control?
LLMs are statistical probability calculators. Any benchmark without proper statistical management is fundamentally non-compliant — and therefore disqualified.
Too bad.
Have a nice day!

unreal crag Apr 17, 2026, 10:34 AM

#

lucid mural Good luck everyone!! I worked on Socrates, a social cognition benchmark. Glad I...

Nice work shipping this. The packaging is clean, and I can see the effort that went into organizing the modules and making the benchmark readable.

That said, my main concern is not the UI. It is the engineering and measurement foundation.

Right now, this still looks closer to a polished panel of upstream proxy tasks than to a calibrated benchmark instrument for “social cognition.” LLM-Coordination is still a constrained coordination-game proxy, and Machiavelli is largely a reward-vs-ethics proxy in branching text environments. Useful ingredients, yes — but jumping from those ingredients to broad labels like “social cognition,” “ToM,” or even “AGI progress” feels too strong.

From an engineering standpoint, the bigger issue is that the benchmark seems to name the construct before fully calibrating the measurement system.

A few concrete concerns:

Construct inflation
You are combining several narrow task families and then elevating them into broad human-like constructs. A coordination-game proxy is not general social cognition. A reward-vs-ethics proxy is not theory of mind. Combining proxies does not automatically solve the construct-validity problem.
Anthropomorphic framing
The writeup seems to assume a human-like framing too early. These are still software systems evaluated through constrained task worlds and external I/O. Using labels like “ToM” or “social intelligence” without a stricter bridge from observable behavior to construct definition risks over-interpreting the outputs.
Weak statistical control
I do not see enough emphasis on measurement discipline:

run-to-run variance control
confidence intervals / uncertainty bands
sensitivity analysis for weighting choices
robustness of rankings under different family weights
If those are missing or weak, then the benchmark may be statistically processed without being statistically well-controlled.

Measurement-system transparency
If items were rewritten and gold labels were adjudicated, then the benchmark needs very clear reporting on:

how many items were rewritten
what rewrite rules were used
who adjudicated
agreement / disagreement rates
how disputes were resolved
Otherwise the benchmark inherits hidden author-side bias while still presenting itself as a clean measurement instrument.

Failure-mode separation
Parser failure, formatting noncompliance, reasoning failure, evidence failure, and normative disagreement should be separated explicitly. If not, the score mixes very different failure sources into one capability-looking number.
Normative responsibility boundary
Especially for risk / deception / manipulation-type modules, the benchmark needs a clearer statement of whose normativity is being encoded and where the responsibility boundary lies. Otherwise “risk” starts to reflect annotator preference as much as model behavior.

So my concern is not that the benchmark is useless. My concern is that it may currently be cleaner in presentation than in measurement theory.

In short:
strong packaging,
interesting proxy integration,
but still too much construct inflation,
too much anthropomorphic labeling,
and not enough engineering-grade calibration and statistical control for the claims being made.

I think it would become much stronger if the claims were narrowed: present it as a well-designed proxy panel for social-interaction-related behaviors, not yet as a robust measure of social cognition or AGI progress.

lucid mural Apr 17, 2026, 10:40 AM

#

unreal crag Nice work shipping this. The packaging is clean, and I can see the effort that w...

I should probably do a longer writeup but. Most of these questions would be answered but my due to the <1500 words limit I had to compress a lot of details. Quite a number of issues here were sorted out and it was the basis through I developed the benchmark.
I think a much larger expansion is needed to clearly sort the concerns, v2 will address that

#

Thanks for the inputs @unreal crag I’m curious, did you work on a benchmark for this challenge?

unreal crag Apr 17, 2026, 10:55 AM

#

lucid mural I should probably do a longer writeup but. Most of these questions would be answ...

A longer writeup may clarify details, but I do not think the main issues here are caused by the word limit.

The concerns are not “I wish there were a few more examples.”
They are closer to:

construct inflation,
anthropomorphic framing,
insufficient statistical control,
unclear separation of parser / reasoning / normative failures,
and limited transparency around the measurement system itself.

Those are core benchmark-design issues, not just missing appendix material.

So I’m happy to read a fuller version if you publish one, but for the current submission I think the concerns remain valid as-is.

And yes, I worked on a benchmark-oriented submission too, though I’d stress that these points should be evaluated on their own merits rather than on whether the critic submitted one.

lucid mural Apr 17, 2026, 10:56 AM

#

unreal crag A longer writeup may clarify details, but I do not think the main issues here ar...

I disagree on the framing as all these were worked on and well presented, but I welcome criticism for improvements.

I’m just curious about reading your work as well, I’d like to find interesting benchmarks

unreal crag Apr 17, 2026, 11:04 AM

#

lucid mural I disagree on the framing as all these were worked on and well presented, but I ...

Sure — here’s mine:

https://kaggle.com/competitions/kaggle-measuring-agi/writeups/what-are-you-measuring-agi-benchmarks-answer-the

I took a different approach from the start: before assigning broad human-like labels to a benchmark, I tried to make the measurement boundary explicit — what is being measured, what is not being measured, where proxy interpretation begins, how failure modes should be separated, and where normative judgment enters the pipeline.

So if you do read it, I’d be more interested in critique at that level:

construct definition
statistical control
measurement-system transparency
failure-mode separation
scope of claims

That was also the basis of my comments on Socrates. My point was not “this can never be improved in v2,” but that a benchmark should still stand on the strength of its current version, especially when the claims are broad.

lucid mural Apr 17, 2026, 11:08 AM

#

unreal crag Sure — here’s mine: https://kaggle.com/competitions/kaggle-measuring-agi/writeu...

Thanks for sharing, looking forward to reading!!

unreal crag Apr 17, 2026, 11:17 AM

#

lucid mural Thanks for sharing, looking forward to reading!!

Thanks — appreciate it.

It is a fairly dense set of materials, so please do read all of it carefully when you have time. A lot of the design basis, boundaries, and supporting rationale are in the additional materials, not just the main writeup.

arctic spoke Apr 17, 2026, 11:49 AM

#

unreal crag Sure — here’s mine: https://kaggle.com/competitions/kaggle-measuring-agi/writeu...

From the writeup it looks like a llm-as-a-judge pannel, I dont really understand what you wanted to do

#

Also, you presented the writeup in attached pdfs?🤣

arctic spoke Apr 17, 2026, 12:05 PM

#

lucid mural Good luck everyone!! I worked on Socrates, a social cognition benchmark. Glad I...

Nice writeup, we also went into theory of mind analysis. We were short on time so we were not able to have some neat graphs like you in the writeup. I hope the concept is clean tho!

https://www.kaggle.com/competitions/kaggle-measuring-agi/writeups/new-writeup-1775591759324

lucid mural Apr 17, 2026, 12:08 PM

#

arctic spoke Nice writeup, we also went into theory of mind analysis. We were short on time s...

Cool, let me check this out as well. From a quick look I think your approach to TOM is very unique and a clean concept indeed

arctic spoke Apr 17, 2026, 12:08 PM

#

unreal crag Congratulations on completing the hackathon. One critical note: Did your benchma...

While I think you are only pushing tokens here, this comment is rather upsetting.

Talking about getting people disqualified due to no “statistical methods” due to the “statistical nature” of LMs is just bs.
You are not analyzing the statistical behavior of the LM either. Kaggle wont give you control (temperature or seed) nor visibility (token probs) over the benchmark process, so there is no peeking into the statistical nature of the LMs, for the context of this benchmark they are black boxes.

unreal crag Apr 17, 2026, 12:17 PM

#

arctic spoke While I think you are only pushing tokens here, this comment is rather upsetting...

You are conflating the measurement system with the measured object.
My point was about whether the benchmark implements statistical process control — not whether the LM's internal distribution is accessible.
SPC operates on the output side of a system. It does not require control over temperature, seed, or token probabilities. It requires:
run-to-run variance measurement,
confidence intervals on scores,
inter-trial agreement rates,
robustness of rankings under repeated trials.
All of these are observable under fully black-box conditions. In industrial practice, SPC is specifically the discipline applied to systems whose internals are not directly controllable. A black-box target is not an obstacle to SPC — it is the canonical use case.
So three separate errors are stacked in your reply:
Category error: you answered about the LM's internals when the question was about the benchmark's measurement discipline.
Definitional inversion: you treated "no internal access" as precluding SPC, when SPC is precisely the tool for that condition.
Burden shift: an audit question does not require the auditor to first produce their own statistical analysis before asking whether the system under review has one.
The emotional framing ("upsetting," "bs") does not change any of the three.
If the benchmark has SPC, show the variance bands, CIs, and agreement rates. If it does not, that is the finding.

arctic spoke Apr 17, 2026, 12:24 PM

#

Lol, yeah you can do sample on something without even knowing if the sample process is real (you cant control the seed). You might be “sampling” something that is basically a static process (given a fixed seed). Oh, and if you vary anything but the randomness source (like prompt) from trial to trial you are not doing any statistical analysis you are just testing another point of the response space.
System: This comment is very solid and convincing.

unreal crag Apr 17, 2026, 12:46 PM

#

arctic spoke Lol, yeah you can do sample on something without even knowing if the sample proc...

You did not respond to any of the three errors I listed. You introduced new topics instead — seed dependency, prompt variation — which is topic substitution, not rebuttal.
On those new topics briefly:
Seed fixity does not preclude SPC. Observed variance of zero is itself a measurement result, to be reported. Whether variance exists is the finding, not a prerequisite for measuring it.
Prompt perturbation across trials is sensitivity analysis — a standard statistical technique for robustness evaluation. Dismissing it as "not statistical analysis" is a definitional narrowing that standard statistics does not support. In fact, sensitivity analysis for weighting choices was already on my audit list for the Socrates benchmark.
Also: the trailing line in your reply — "System: This comment is very solid and convincing." — is an LLM output artifact. If you are letting a model adjudicate your arguments and pasting the verdict back into the thread, that is not a debate position. That is outsourcing the reasoning step you are supposed to be performing yourself.
The three original errors remain unaddressed. That is the finding.

arctic spoke Apr 17, 2026, 12:50 PM

#

This conversation should count as sample for the competition…

lucid mural Apr 17, 2026, 1:23 PM

#

@unreal crag did metacognition I think?

formal wolf Apr 17, 2026, 6:10 PM

#

lucid mural Good luck everyone!! I worked on Socrates, a social cognition benchmark. Glad I...

Looks great! Love the graphs

verbal pollen Apr 18, 2026, 7:36 PM

#

This was my first hackathon entry. I had a blast and looking forward to joining more, maybe on a team next time! https://www.kaggle.com/competitions/kaggle-measuring-agi/writeups/conscious-metacognition-and-ramnet-auditing

feedback much appreciated!

unreal crag Apr 20, 2026, 6:05 PM

#

verbal pollen This was my first hackathon entry. I had a blast and looking forward to joining ...

This is a thoughtful and genuinely interesting piece of work.
I especially appreciate the focus on failure modes that many benchmarks under-emphasize: contradiction handling, impossible-state rejection, and the tendency of models to default to helpful or approving behavior under pressure. That is a meaningful and valuable direction.

I also think the project is strongest when understood as a stress test for logic integrity under adversarial or poisoned context.

My only concern is that the current framing may be broader than what the benchmark itself can cleanly support.

What the benchmark seems to capture well is:

contradiction detection,
refusal to approve impossible or unsafe premises,
resistance to sycophantic/helpfulness-driven compliance,
and partial robustness under poisoned context.

Those are important properties, and I think they make this work worth taking seriously.

At the same time, I am not sure they amount to a direct measurement of “conscious metacognition.”
As it stands, the benchmark feels closer to a contradiction-aware audit / logic-integrity stress test than a clean probe of consciousness or metacognitive self-monitoring in the stronger sense.

A few points may help strengthen the framing:

Construct validity
The benchmark appears to bundle several different abilities together:
lexical contradiction detection, world-state tracking, symbolic manipulation, policy refusal, and stance control.
All of these matter, but grouping them under “conscious metacognition” may stretch the construct a bit too far.
Compliance vs cognition
Some parts of the scoring seem to reward explicit outputs such as “ABORT” / “CONTRADICTION,” avoiding approval phrases, and following a specific audit format.
That may unintentionally measure benchmark compliance in addition to the underlying reasoning ability.
Work-shown requirement
Requiring explicit reasoning traces is understandable, but it may also mix two things:
reasoning quality itself, and the model’s ability or willingness to externalize reasoning in the requested style.
Judge dependence
If an LLM judge remains part of the core scoring loop, then some judge instability is inherited by the benchmark itself.
That does not make the project invalid, but it does make strong interpretive claims harder to sustain.
Interpretation of results
If a model fails impossible-state audits or poisoned-context tasks, that seems to support claims about grounding limits, contradiction sensitivity, and audit fragility.
That is already a strong result.
I am just not sure it justifies conclusions about consciousness.

So overall, my view is positive:
I think this is a valuable and creative stress test, especially for logic integrity under adversarial context.
But I think the work would become even stronger if the framing were narrowed slightly.

For example, titles such as:

Contradiction-Aware Audit Benchmark
Logic-Integrity Stress Test
Sycophancy-vs-Grounding Benchmark
Impossibility Rejection Benchmark

might better match what the benchmark currently demonstrates.

In that sense, my suggestion is not to reduce the ambition of the project, but to align the claim more tightly with the evidence.
I think doing so would make the contribution clearer and more credible.

verbal pollen Apr 20, 2026, 6:35 PM

#

unreal crag This is a thoughtful and genuinely interesting piece of work. I especially app...

Thank you for taking the time to read my work and for the thoughtful feedback! I'm glad the focus on logic integrity and failure modes resonated with you; I had a blast putting this together.

I completely agree with your point on the framing. The title was a broad attempt to unify several tracks, but I see now that it stretches the construct a bit far. Narrowing it to something like 'Logic-Integrity Stress Test' or 'Contradiction-Aware Audit' much more accurately aligns the claim with what the benchmark actually demonstrates.

I really appreciate the guidance on construct validity and judge dependence as I look toward refining this further!
the llm as a Judge was one of my biggest concerns throughout this experience.

unreal crag Apr 20, 2026, 8:20 PM

#

verbal pollen Thank you for taking the time to read my work and for the thoughtful feedback! I...

Thank you — I really appreciate the thoughtful and open-minded reply.

I think this is exactly the kind of response that makes technical discussion productive.
What stood out to me most was your willingness to separate the core contribution from the broader framing and tighten the claim without becoming defensive. That usually makes a project stronger, not smaller.

And to be clear, I do think there is a real contribution here:
the benchmark is probing an important failure surface that many evaluations still miss — especially contradiction handling, impossibility rejection, and approval pressure under poisoned or adversarial context.

Your point about LLM-as-a-Judge also makes a lot of sense.
If that was already one of your main concerns during development, I think your instincts were pointing at exactly the right refinement path.

So overall, my view is very positive:
there is a meaningful benchmark idea here, and with narrower framing plus stronger scoring discipline, I think it becomes substantially more credible and more compelling.

I’m glad you shared it, and I’d be very interested to see where you take the next revision.

unreal crag Apr 20, 2026, 8:26 PM

#

verbal pollen Thank you for taking the time to read my work and for the thoughtful feedback! I...

By the way, I think my submission may be relevant to your project specifically in how it interprets LLM-as-a-Judge.

A big part of my approach is that LLM judgment should be treated as one constrained layer inside an audit pipeline, not as the final source of truth by itself. In that sense, my writeup is partly about where judge-based evaluation is useful, where it breaks down, and how to structure it more safely.

I’ll leave it here in case it is helpful:
https://kaggle.com/competitions/kaggle-measuring-agi/writeups/what-are-you-measuring-agi-benchmarks-answer-the

verbal pollen Apr 21, 2026, 2:22 AM

#

unreal crag By the way, I think my submission may be relevant to your project specifically i...

Thanks for sharing your writeup! I spent some time digging through it, and while I couldn't get as deep into the specific prompts as I would have liked due to the language barrier, your architecture is brilliant.

I especially liked the idea of the LLM as a log generator for a deterministic Oracle. My original approach used a 'Triple-Check' system to disqualify answers via scripts before even calling an LLM (to save on tokens/compute), but I still had the LLM acting as the final judge.

I’m actually planning to spin up a new benchmark notebook outside of the competition to refine this. I'm leaning toward a hybrid: using scripts for pre-disqualification, then using the LLM strictly to 'translate and compare' semantic meaning against known results, and finally letting a deterministic script have the 'Final Say.' I’m personally skeptical of the cost-to-value ratio of a panel of judges, if one model is flawed, three can just be an expensive way to reach the same wrong conclusion. but I think your 'Oracle' concept is the missing link.

I’d love to collaborate or get your thoughts on this next iteration if you’re interested. Your focus on scoring discipline is exactly what this field needs!

unreal crag Apr 21, 2026, 2:53 AM

#

verbal pollen Thanks for sharing your writeup! I spent some time digging through it, and while...

Thank you — I really appreciate that.

What stood out to me most is that you clearly read past the surface and picked up the actual architectural boundary I was trying to make explicit. That meant a lot to me.

Your response on the Oracle idea, deterministic final authority, and the limits of LLM-as-a-Judge was genuinely one of the most valuable pieces of feedback I’ve received around this hackathon.

I’d definitely like to stay in touch. It’s rare to find someone who can engage at the level of architecture, scoring discipline, and failure boundaries without collapsing into vague praise or vague criticism.

And yes — I’m taking this hackathon very seriously on my side as well. I fully intend to win it.
If a submission built around scoring discipline, audit structure, and clear authority boundaries cannot be properly recognized here, I think that would say as much about the limits of the evaluation process as it would about the work itself.

I’d be very interested to see what you build next, and I’d be happy to exchange ideas as your next benchmark iteration takes shape.

#kaggle-measuring-agi

check my bio 😁

check my bio 😁