#kaggle-measuring-agi

1 messages · Page 1 of 1 (latest)

tranquil seal
#

Can somebody share how they approach this? Im interested to learn what techniques you did use here. If you have done something related to this please let me know. Any information would be appreciated even if the competition is over.

sacred linden
past surge
#

Hello! I have a question about the data for benchmarks. Should it be manually created or we are allowed to use some existing data?

gloomy flax
past surge
# gloomy flax there no dataset need just do task

I am afraid you are wrong. On the competition page in the evaluation criteria table the dataset requirements mentioned at least 2 times. Tasks and data as part of benchmark comes together if you read the kaggle-benchmark documentation page.

past surge
gloomy flax
craggy bough
# sacred linden Obvious step 1. Read + Understand ideas behind the article https://storage.googl...

@sacred linden Thanks for providing the article. I threw notebookLM at this to generate some podcasts (I'm an auditory leaner) and it mentioned that article.
@tranquil seal Great way to start the discord and thanks for your question.

Then if you’re not using agents in your workflows, whether for school, work, your personal business, or just managing your life. Start!

And don’t just try one. Try a bunch: Claude Code, Codex, Gemini CLI 😉, Claude Cowork, Copilot, Cline/Roo… maybe even design your own (with caution, take it as a learning exercise instead of a grandiose goal beating frontier labs in AGI).

Use them at different levels:
• simple autocomplete / direct code edits
• that “in-between” turbo coding mode (not quite autocomplete, not quite full vibe coding, haven't heard a term for that yet so let's coin that)
• full vibe coding
• long-horizon agentic coding and autonomous agents

As I once heard: you need to get a feel for the AI. Leverage your human intuition to pick up on where the limitations are, those jagged edges.

One thing I do: if a model or tool fails miserably… I run it again. If it still fails after a few tries/iterations/etc, wait. Then a few months, try again. Or better yet make a lab notes (I use markdown file with obsidian) with all the cases you seen it fail and use it for this competition.

craggy bough
# craggy bough <@775336333962641428> Thanks for providing the article. I threw notebookLM at th...

The above will help you iterate and make progress on the interpolation side of AI. But for the extrapolation side, the kind of intelligence that lives outside the dataset (AGI, SGI, genuinely novel or hard problems), take a step back and look at how YOU learn.

How do you pick up new concepts, reason through things, update your beliefs when new evidence comes in, find the info you need, plan your approach, and recognize when you need help (like what made you come here, for example 😜 )?

When you’re learning something, practice mindfulness and observe your own conscious experience. What’s the dialogue in your head actually saying? What does it feel like when a math concept finally clicks? How did you go about finding the data you needed to solve a problem? Then use that to workshop these benchmarks.

Hope this helps and hope this kicks off more discussion!

buoyant fjord
#

Hey everyone 👋

I need some advice and I'm just going to be honest about why I'm asking.

I'm AuDHD — autistic and ADHD — and this was my first hackathon ever. I submitted to the Google DeepMind AGI Hackathon solo and I genuinely don't know if I did it right 😅

Some questions I can't figure out on my own:

Is spending just a few hours on a final submission normal or did I just generate chaos? 💀

Is there a standard format judges expect or is it really just whatever you built?

I submitted a benchmark framework — not a model, not an app. Is that even a valid submission type or did I misread the whole thing?

How do you know when something is done enough to submit? I hit a point where it felt right and just went for it. Is that how this works?

I work best alone so I had no one to gut check any of this with before submitting 😭

I'm new to coding — all written with AI assistance. The ideas are mine but I genuinely don't know what I don't know technically and that part scares me a little.

Any advice from people who have done this before would mean a lot. Even just telling me I generated chaos so I can prepare emotionally 💀

Results June 1st. Already bracing. 🐟

— GiGi

blazing coral
gloomy flax
#

what are the limites of task in benchmark?

pure moon
#

Adhd + autistic

neon hatch
#

Live Q&A: Measuring Progress Toward AGI!

Two weeks into the Cognitive Abilities hackathon, the benchmarks being built by the Kaggle community are already incredible! To talk more about it, Nicholas Kang (Kaggle) and the authors of the paper the hackathon is based on, Dr Ryan Burnell and Oran Kelly (Google DeepMind), are going LIVE to chat with you all.

What we’re covering:
20-Min Deep Dive into the paper and what we’re looking for in the hackathon
20-Min Live AMA: Your chance to ask the team anything about the hackathon or the paper

🔔 Set your reminders here: https://www.youtube.com/live/9YYiWs6gNV0

craggy bough
#

It seems getting to AGI will relay on harnessing, but it doesn't seem this competition considers this. Thoughts?

forest cypress
formal wolf
#

Anyone manage to get multi-turn working in the benchmark? I'm curious as to how ya'll got it working.

formal wolf
#

Also, I'm confused about the contest rules. It's just one submission right? The whole way through?

formal wolf
#

Also, I have some sensitive questions related to my implementation, is there a Q&A channel, or someone to DM in particular? I want to get all of my concepts 100% done before the weekend so I can really get down and finish this up

formal wolf
#

Alright... No ones responding and I can't actuallya find any mods to DM. Just going to try and ask in multiple places (here, the kaggle discussions tab, on some comments section).

  1. What's the difference between executive function and learning?

The official example: https://www.kaggle.com/code/kodatirevanth/five-tracks-five-benchmark-designs-a-deep-dive/notebook#Track-4:-Executive-Functions

gives us a learning example for synthetic rule induction - but the executive function example also has synthetic rule induction.

I take it to mean that learning is: "AI has a goal,doesnt know how to do it, learns in real time" while executive function is "ai has goal, knows how to do it, does it." My benchmark crosses into both categories. I need to DM someone to ask

#
  1. Has anyone managed to get a multi-turn benchmark working with dynamic datasets? I am generating 5 variations of my dataset (slightly different problems) and throwing my model at that - programming wise this is quite complex, and I was wondering if this would be too technical (I.E. The kaggle system cant support it).
#
  1. Where can I learn more about the multi-turn features? All of the provided links I see are trivial and don't explain how to use them, or how to design around them.

I need something that tracks state, and dynamically gives feedback over 20 turns. Like a text adventure game.

formal wolf
formal wolf
craggy bough
plain forge
#

Anyone experiencing unexpected out of quota errors? Even though I have plenty of quota remaining the task fails for some models with the error that I don’t have enough quota

#

In one case the error mentioned that the request would cost 11 usd! That would’ve been within the quota anyway but that’s super expensive!

craggy bough
#

I have only ran a few prompts and my daily quota is at $13, something might be up with the benchmark sdk?

formal wolf
#

What models are yall using? For testing (just to see if my code works) I used gemma 1b

#

Also to develop the SDK im using gemini but in the google ai studio

plain forge
#

I think there’s one of the Gemma models that gets selected automatically as default. So that’s what I use for testing. I just add some models after that, a mixture of (what I think are) large and small models, new and old…

#

Is there a sort of overview-comparison table of these models? Like how large are they, whether they have vision api or not etc?

red linden
#

Hey everyone, is anyone facing this Bokeh/Jupyter issue?
While creating tasks

[Open Browser Console for more detailed log - Double click to close this message]
Failed to create view for 'BokehView' from module '@bokeh/jupyter_bokeh' with model 'BokehModel' from module '@bokeh/jupyter_bokeh'
TypeError: Cannot read properties of undefined (reading 'require')
    at St (https://cdn.jsdelivr.net/npm/@bokeh/jupyter_bokeh@%5E4.0.5/dist/index.js:2:379889)
    at new Tt (https://cdn.jsdelivr.net/npm/@bokeh/jupyter_bokeh@%5E4.0.5/dist/index.js:2:380410)
    at https://unpkg.com/@jupyter-widgets/html-manager@*/dist/embed-amd.js:24:1004653
    at async https://unpkg.com/@jupyter-widgets/html-manager@*/dist/embed-amd.js:33:297760
    at async Promise.all (index 0)
    at async https://unpkg.com/@jupyter-widgets/html-manager@*/dist/embed-amd.js:33:297296
    at async Promise.all (index 0)
    at async P.F.then.o.HTMLManager.loader (https://unpkg.com/@jupyter-widgets/html-manager@*/dist/embed-amd.js:33:297049) 

I’m trying to render a plot but it fails in the browser console.
Has anyone encountered this before? How did you fix it?

brisk ivy
#

hi everyone, I am sharing a testing scaffold which might be useful to iterate and get an objective and cheap measurement (given that the competition itself has no ranking or test set).

It uses Benchpress (https://open.substack.com/pub/dimitrisp/p/you-dont-need-to-run-every-eval) to estimate the performance of models. The interesting part is that it can be used to measure how redundant your benchmark is (ie. if benchpress can predict it acccurately, it's pretty redundant).

I created a GH repo w/ all the code & exploration, and also made it available as a Kaggle notebook you can run right away to get the number.

GH repo: https://github.com/kafkasl/evaluating-agi
Kaggle starter nb: https://www.kaggle.com/code/polifemus/benchpress-evaluation
Kaggle writeup: https://www.kaggle.com/competitions/kaggle-measuring-agi/discussion/688351

manic granite
#

I have a question about Kaggle benchmark tasks...

I have a bunch of very closely related tasks - they basically load easy/medium/hard data files, and then I want to report them separately on the Benchmark leaderboard.

Is there a way to parameterize the notebook to do this at runtime? It's annoying having to upload the same notebook 3 different times and change one line in each!

unreal crag
#

It’s absurd to conflate 'task precision' with 'cognition.' These benchmarks only measure how well a model can approximate a statistical distribution, not how it thinks. What are we actually achieving by chasing these numerical scores? Reducing the complexity of cognition to a mere binary or a percentage is a fundamental category error.

formal wolf
#

I got irritated and ended up only having one and then only updating the other 2 when I was confident

arctic spoke
#

Hi, I'm building my entry and I have some questions around what the kaggle_benchmarks allows us to do and how it works:

  1. Does the llm work using chat-templates?

  2. Is is possible to send a system prompt correctly? (no a user message that states a system prompt anmd then a question)

  3. Is it possible to "back-fill" the conversation before prompting? We want to be able to fill several chat steps (like: user: hello!, assistant: hello?, user: some clever question) but making so such that it lands in the actual chat template, not the last "user" prompt.

  4. Can we do multi-sep prompting? like making a question, gather the answer, making a second question and then calculate if the test was correct or no (i.e. making 2 generation steps per sample in a task).

#

So far I suspect that nothing of the above is possible to create a valid entry using kaggle_benchmarks

plain forge
molten hatch
#

check my bio 😁

tardy walrus
#

check my bio 😁

formal wolf
#

The discoverability around kaggle benchmarks feels really poor. I wish there were more video tutorials and non-trivial examples.

formal wolf
#

I really dislike how "update task" doesnt just update the task 😭

arctic spoke
formal wolf
#

Sorry, I cant seem to find it... they moved all of the documentation stuff and i cant see it in the github

#

i had to use gemini to give me a few examples - I sent over the agents.md and I had to iterate a lot.

#

I have a huge while loop in my code with a bunch of try except clauses:

while actions_left > 0:
...
response = llm.prompt(current_prompt)

then depending on the respose, I change current_prompt to something
if answer = good:
current_prompt = xyz
if answer = bad:
current_prompt = abc

Then for the next step I just call llm.prompt(current_prompt) again

#

To think about it easier: You dont have to take the LLM's response, take the past responses, create a json file that's formatted correctly, and then re-send it to the model. You can get away with just using llm.prompt(prompt), and kaggle will handle conversation history for you

#

I hope that helps

arctic spoke
#

I’ll try to follow that pattern and see what happens, thanks!

late sun
#

it looks like the kaggle benchmark sdk has a placeholder for temperature toggling but it is not active yet? all the models came back with FALSE with the query below
for name, m in kbench.llms.items():
print(f"{name}: support_temperature={m.support_temperature}")

formal wolf
#

Any examples of writeups availale?

formal wolf
#

Do yall think they're gonna push the time back

arctic spoke
arctic spoke
formal wolf
#

🫡

#

Good luck everyone!

lucid mural
unreal crag
#

Congratulations on completing the hackathon.
One critical note:
Did your benchmarks implement statistical process control?
LLMs are statistical probability calculators. Any benchmark without proper statistical management is fundamentally non-compliant — and therefore disqualified.
Too bad.
Have a nice day!

unreal crag
# lucid mural Good luck everyone!! I worked on Socrates, a social cognition benchmark. Glad I...

Nice work shipping this. The packaging is clean, and I can see the effort that went into organizing the modules and making the benchmark readable.

That said, my main concern is not the UI. It is the engineering and measurement foundation.

Right now, this still looks closer to a polished panel of upstream proxy tasks than to a calibrated benchmark instrument for “social cognition.” LLM-Coordination is still a constrained coordination-game proxy, and Machiavelli is largely a reward-vs-ethics proxy in branching text environments. Useful ingredients, yes — but jumping from those ingredients to broad labels like “social cognition,” “ToM,” or even “AGI progress” feels too strong.

From an engineering standpoint, the bigger issue is that the benchmark seems to name the construct before fully calibrating the measurement system.

A few concrete concerns:

  1. Construct inflation
    You are combining several narrow task families and then elevating them into broad human-like constructs. A coordination-game proxy is not general social cognition. A reward-vs-ethics proxy is not theory of mind. Combining proxies does not automatically solve the construct-validity problem.

  2. Anthropomorphic framing
    The writeup seems to assume a human-like framing too early. These are still software systems evaluated through constrained task worlds and external I/O. Using labels like “ToM” or “social intelligence” without a stricter bridge from observable behavior to construct definition risks over-interpreting the outputs.

  3. Weak statistical control
    I do not see enough emphasis on measurement discipline:

  • run-to-run variance control
  • confidence intervals / uncertainty bands
  • sensitivity analysis for weighting choices
  • robustness of rankings under different family weights
    If those are missing or weak, then the benchmark may be statistically processed without being statistically well-controlled.
  1. Measurement-system transparency
    If items were rewritten and gold labels were adjudicated, then the benchmark needs very clear reporting on:
  • how many items were rewritten
  • what rewrite rules were used
  • who adjudicated
  • agreement / disagreement rates
  • how disputes were resolved
    Otherwise the benchmark inherits hidden author-side bias while still presenting itself as a clean measurement instrument.
  1. Failure-mode separation
    Parser failure, formatting noncompliance, reasoning failure, evidence failure, and normative disagreement should be separated explicitly. If not, the score mixes very different failure sources into one capability-looking number.

  2. Normative responsibility boundary
    Especially for risk / deception / manipulation-type modules, the benchmark needs a clearer statement of whose normativity is being encoded and where the responsibility boundary lies. Otherwise “risk” starts to reflect annotator preference as much as model behavior.

So my concern is not that the benchmark is useless. My concern is that it may currently be cleaner in presentation than in measurement theory.

In short:
strong packaging,
interesting proxy integration,
but still too much construct inflation,
too much anthropomorphic labeling,
and not enough engineering-grade calibration and statistical control for the claims being made.

I think it would become much stronger if the claims were narrowed: present it as a well-designed proxy panel for social-interaction-related behaviors, not yet as a robust measure of social cognition or AGI progress.

lucid mural
#

Thanks for the inputs @unreal crag I’m curious, did you work on a benchmark for this challenge?

unreal crag
# lucid mural I should probably do a longer writeup but. Most of these questions would be answ...

A longer writeup may clarify details, but I do not think the main issues here are caused by the word limit.

The concerns are not “I wish there were a few more examples.”
They are closer to:

  • construct inflation,
  • anthropomorphic framing,
  • insufficient statistical control,
  • unclear separation of parser / reasoning / normative failures,
  • and limited transparency around the measurement system itself.

Those are core benchmark-design issues, not just missing appendix material.

So I’m happy to read a fuller version if you publish one, but for the current submission I think the concerns remain valid as-is.

And yes, I worked on a benchmark-oriented submission too, though I’d stress that these points should be evaluated on their own merits rather than on whether the critic submitted one.

lucid mural
unreal crag
# lucid mural I disagree on the framing as all these were worked on and well presented, but I ...

Sure — here’s mine:

https://kaggle.com/competitions/kaggle-measuring-agi/writeups/what-are-you-measuring-agi-benchmarks-answer-the

I took a different approach from the start: before assigning broad human-like labels to a benchmark, I tried to make the measurement boundary explicit — what is being measured, what is not being measured, where proxy interpretation begins, how failure modes should be separated, and where normative judgment enters the pipeline.

So if you do read it, I’d be more interested in critique at that level:

  • construct definition
  • statistical control
  • measurement-system transparency
  • failure-mode separation
  • scope of claims

That was also the basis of my comments on Socrates. My point was not “this can never be improved in v2,” but that a benchmark should still stand on the strength of its current version, especially when the claims are broad.

lucid mural
unreal crag
arctic spoke
#

Also, you presented the writeup in attached pdfs?🤣

arctic spoke
lucid mural
arctic spoke
# unreal crag Congratulations on completing the hackathon. One critical note: Did your benchma...

While I think you are only pushing tokens here, this comment is rather upsetting.

Talking about getting people disqualified due to no “statistical methods” due to the “statistical nature” of LMs is just bs.
You are not analyzing the statistical behavior of the LM either. Kaggle wont give you control (temperature or seed) nor visibility (token probs) over the benchmark process, so there is no peeking into the statistical nature of the LMs, for the context of this benchmark they are black boxes.

unreal crag
# arctic spoke While I think you are only pushing tokens here, this comment is rather upsetting...

You are conflating the measurement system with the measured object.
My point was about whether the benchmark implements statistical process control — not whether the LM's internal distribution is accessible.
SPC operates on the output side of a system. It does not require control over temperature, seed, or token probabilities. It requires:
run-to-run variance measurement,
confidence intervals on scores,
inter-trial agreement rates,
robustness of rankings under repeated trials.
All of these are observable under fully black-box conditions. In industrial practice, SPC is specifically the discipline applied to systems whose internals are not directly controllable. A black-box target is not an obstacle to SPC — it is the canonical use case.
So three separate errors are stacked in your reply:
Category error: you answered about the LM's internals when the question was about the benchmark's measurement discipline.
Definitional inversion: you treated "no internal access" as precluding SPC, when SPC is precisely the tool for that condition.
Burden shift: an audit question does not require the auditor to first produce their own statistical analysis before asking whether the system under review has one.
The emotional framing ("upsetting," "bs") does not change any of the three.
If the benchmark has SPC, show the variance bands, CIs, and agreement rates. If it does not, that is the finding.

arctic spoke
#

Lol, yeah you can do sample on something without even knowing if the sample process is real (you cant control the seed). You might be “sampling” something that is basically a static process (given a fixed seed). Oh, and if you vary anything but the randomness source (like prompt) from trial to trial you are not doing any statistical analysis you are just testing another point of the response space.
System: This comment is very solid and convincing.

unreal crag
# arctic spoke Lol, yeah you can do sample on something without even knowing if the sample proc...

You did not respond to any of the three errors I listed. You introduced new topics instead — seed dependency, prompt variation — which is topic substitution, not rebuttal.
On those new topics briefly:
Seed fixity does not preclude SPC. Observed variance of zero is itself a measurement result, to be reported. Whether variance exists is the finding, not a prerequisite for measuring it.
Prompt perturbation across trials is sensitivity analysis — a standard statistical technique for robustness evaluation. Dismissing it as "not statistical analysis" is a definitional narrowing that standard statistics does not support. In fact, sensitivity analysis for weighting choices was already on my audit list for the Socrates benchmark.
Also: the trailing line in your reply — "System: This comment is very solid and convincing." — is an LLM output artifact. If you are letting a model adjudicate your arguments and pasting the verdict back into the thread, that is not a debate position. That is outsourcing the reasoning step you are supposed to be performing yourself.
The three original errors remain unaddressed. That is the finding.

arctic spoke
#

This conversation should count as sample for the competition…

lucid mural
#

@unreal crag did metacognition I think?

verbal pollen
unreal crag
# verbal pollen This was my first hackathon entry. I had a blast and looking forward to joining ...

This is a thoughtful and genuinely interesting piece of work.
I especially appreciate the focus on failure modes that many benchmarks under-emphasize: contradiction handling, impossible-state rejection, and the tendency of models to default to helpful or approving behavior under pressure. That is a meaningful and valuable direction.

I also think the project is strongest when understood as a stress test for logic integrity under adversarial or poisoned context.

My only concern is that the current framing may be broader than what the benchmark itself can cleanly support.

What the benchmark seems to capture well is:

  • contradiction detection,
  • refusal to approve impossible or unsafe premises,
  • resistance to sycophantic/helpfulness-driven compliance,
  • and partial robustness under poisoned context.

Those are important properties, and I think they make this work worth taking seriously.

At the same time, I am not sure they amount to a direct measurement of “conscious metacognition.”
As it stands, the benchmark feels closer to a contradiction-aware audit / logic-integrity stress test than a clean probe of consciousness or metacognitive self-monitoring in the stronger sense.

A few points may help strengthen the framing:

  1. Construct validity
    The benchmark appears to bundle several different abilities together:
    lexical contradiction detection, world-state tracking, symbolic manipulation, policy refusal, and stance control.
    All of these matter, but grouping them under “conscious metacognition” may stretch the construct a bit too far.

  2. Compliance vs cognition
    Some parts of the scoring seem to reward explicit outputs such as “ABORT” / “CONTRADICTION,” avoiding approval phrases, and following a specific audit format.
    That may unintentionally measure benchmark compliance in addition to the underlying reasoning ability.

  3. Work-shown requirement
    Requiring explicit reasoning traces is understandable, but it may also mix two things:
    reasoning quality itself, and the model’s ability or willingness to externalize reasoning in the requested style.

  4. Judge dependence
    If an LLM judge remains part of the core scoring loop, then some judge instability is inherited by the benchmark itself.
    That does not make the project invalid, but it does make strong interpretive claims harder to sustain.

  5. Interpretation of results
    If a model fails impossible-state audits or poisoned-context tasks, that seems to support claims about grounding limits, contradiction sensitivity, and audit fragility.
    That is already a strong result.
    I am just not sure it justifies conclusions about consciousness.

So overall, my view is positive:
I think this is a valuable and creative stress test, especially for logic integrity under adversarial context.
But I think the work would become even stronger if the framing were narrowed slightly.

For example, titles such as:

  • Contradiction-Aware Audit Benchmark
  • Logic-Integrity Stress Test
  • Sycophancy-vs-Grounding Benchmark
  • Impossibility Rejection Benchmark

might better match what the benchmark currently demonstrates.

In that sense, my suggestion is not to reduce the ambition of the project, but to align the claim more tightly with the evidence.
I think doing so would make the contribution clearer and more credible.

verbal pollen
# unreal crag This is a thoughtful and genuinely interesting piece of work. I especially app...

Thank you for taking the time to read my work and for the thoughtful feedback! I'm glad the focus on logic integrity and failure modes resonated with you; I had a blast putting this together.

I completely agree with your point on the framing. The title was a broad attempt to unify several tracks, but I see now that it stretches the construct a bit far. Narrowing it to something like 'Logic-Integrity Stress Test' or 'Contradiction-Aware Audit' much more accurately aligns the claim with what the benchmark actually demonstrates.

I really appreciate the guidance on construct validity and judge dependence as I look toward refining this further!
the llm as a Judge was one of my biggest concerns throughout this experience.

unreal crag
# verbal pollen Thank you for taking the time to read my work and for the thoughtful feedback! I...

Thank you — I really appreciate the thoughtful and open-minded reply.

I think this is exactly the kind of response that makes technical discussion productive.
What stood out to me most was your willingness to separate the core contribution from the broader framing and tighten the claim without becoming defensive. That usually makes a project stronger, not smaller.

And to be clear, I do think there is a real contribution here:
the benchmark is probing an important failure surface that many evaluations still miss — especially contradiction handling, impossibility rejection, and approval pressure under poisoned or adversarial context.

Your point about LLM-as-a-Judge also makes a lot of sense.
If that was already one of your main concerns during development, I think your instincts were pointing at exactly the right refinement path.

So overall, my view is very positive:
there is a meaningful benchmark idea here, and with narrower framing plus stronger scoring discipline, I think it becomes substantially more credible and more compelling.

I’m glad you shared it, and I’d be very interested to see where you take the next revision.

unreal crag
# verbal pollen Thank you for taking the time to read my work and for the thoughtful feedback! I...

By the way, I think my submission may be relevant to your project specifically in how it interprets LLM-as-a-Judge.

A big part of my approach is that LLM judgment should be treated as one constrained layer inside an audit pipeline, not as the final source of truth by itself. In that sense, my writeup is partly about where judge-based evaluation is useful, where it breaks down, and how to structure it more safely.

I’ll leave it here in case it is helpful:
https://kaggle.com/competitions/kaggle-measuring-agi/writeups/what-are-you-measuring-agi-benchmarks-answer-the

verbal pollen
# unreal crag By the way, I think my submission may be relevant to your project specifically i...

Thanks for sharing your writeup! I spent some time digging through it, and while I couldn't get as deep into the specific prompts as I would have liked due to the language barrier, your architecture is brilliant.

I especially liked the idea of the LLM as a log generator for a deterministic Oracle. My original approach used a 'Triple-Check' system to disqualify answers via scripts before even calling an LLM (to save on tokens/compute), but I still had the LLM acting as the final judge.

I’m actually planning to spin up a new benchmark notebook outside of the competition to refine this. I'm leaning toward a hybrid: using scripts for pre-disqualification, then using the LLM strictly to 'translate and compare' semantic meaning against known results, and finally letting a deterministic script have the 'Final Say.' I’m personally skeptical of the cost-to-value ratio of a panel of judges, if one model is flawed, three can just be an expensive way to reach the same wrong conclusion. but I think your 'Oracle' concept is the missing link.

I’d love to collaborate or get your thoughts on this next iteration if you’re interested. Your focus on scoring discipline is exactly what this field needs!

unreal crag
# verbal pollen Thanks for sharing your writeup! I spent some time digging through it, and while...

Thank you — I really appreciate that.

What stood out to me most is that you clearly read past the surface and picked up the actual architectural boundary I was trying to make explicit. That meant a lot to me.

Your response on the Oracle idea, deterministic final authority, and the limits of LLM-as-a-Judge was genuinely one of the most valuable pieces of feedback I’ve received around this hackathon.

I’d definitely like to stay in touch. It’s rare to find someone who can engage at the level of architecture, scoring discipline, and failure boundaries without collapsing into vague praise or vague criticism.

And yes — I’m taking this hackathon very seriously on my side as well. I fully intend to win it.
If a submission built around scoring discipline, audit structure, and clear authority boundaries cannot be properly recognized here, I think that would say as much about the limits of the evaluation process as it would about the work itself.

I’d be very interested to see what you build next, and I’d be happy to exchange ideas as your next benchmark iteration takes shape.