#📈┊benchmarks

1 messages · Page 1 of 1 (latest)

regal trail
#

Welcome to #📈┊benchmarks !

This is your dedicated space for all things related to Benchmarks on Kaggle. Whether you are establishing a new baseline, comparing models, or looking for evaluation tasks, you're in the right place.

**What to share here: **

**New Benchmarks: ** Found our created a new task? Post the link!
Baseline Discussions: Tips on the most effective starting points for specific datasets.
**Comparisons: ** Updates on how models are performing against established benchmarks
Feedback: Thoughts on how we can improve benchmarking across the platform.

We're excited to see the work you share and the discussions that spark here. Let's set some new standards!

vagrant spear
#

SANS v5.1: The "Un-gameable" Scientific Gauntlet 🛠️🔐📊

What is this? This project highlights if AI can integrate tonometry, immunostaining, and retinal scans to solve the SANS Paradox. Using NASA’s OSD-679 study, I’m testing if models can differentiate between terrestrial logic and the unique fluid-shift physiology of microgravity.

If a model can just memorize answers, the benchmark is dead. I’ve just pushed v5.1 of the SANS Multi-Modal Integration Challenge, moving from a fixed test to a Non-Deterministic Gauntlet. Why v5.1 is a "Dark Room" well most benchmarks are too easy to game. For this update, I built a Weighted Shuffle engine ($P(10,3)$). Every time you hit "Run," the engine pulls 3 unique clinical hurdles from a logic bank, you’ll never get the same evaluation twice.

If models fall for terrestrial clinical traps instead of understanding rodent fluid shifts, they hit a hard 20.00% scoring floor. The showdown of gemini vs. anthropic both stress-testing the heavy hitters reveals a massive "Backbone" gap. Gemini 3 Pro & Flash are currently holding the line (66-70%) by actually reasoning through adversarial logic.

Meanwhile, most "Elite" models, including Claude 4.x, are stuck at a 20.00% Compliance Trap, they can join CSVs perfectly, but they fold on the actual science.

Dive into the code & the projects origin of team 60 if you'd like!

🔗 Codebook: https://www.kaggle.com/code/gastondana/sans-multi-modal-integration-challenge

🔗 Task/Results: https://www.kaggle.com/benchmarks/tasks/gastondana/sans-multi-modal-integration

📺 The Backstory (SPOKE 3rd Place Finish):
https://www.linkedin.com/posts/gaston-d-859653184_spoke-nasa-team60-activity-7325938069293977601-T3ck?utm_source=share&utm_medium=member_desktop&rcm=ACoAACuFtgUBVdf9kFE9Wlxn2qi6FBP2M0VX6Ds

orchid glacier
#

Sorry for the late message in the community meeting chat! 🙂
Here it is again:
I've encountered a bug where assess_response_with_judge returns None instead of an assessment object. This causes tasks to fail even when the model's actual answer is correct, since the null check assertion fails. I noticed this especially when evaluating Claude Sonnet 4.5, which showed low scores on my benchmark. Is there a way to distinguish whether a task failure came from the judge failing vs the model giving a wrong answer? The way I got around it so far was to add a null guard that checks assessment is not None before iterating over assessment.results, but this still counts the task as failed when the judge returns None, so it doesn't solve the underlying issue.

bright anvil
orchid glacier
bright anvil
orchid glacier
#

maybe it has something to do with Claude's API rate limits?

bright anvil
# orchid glacier It did break here: https://www.kaggle.com/benchmarks/tasks/junesdata/lewis-carro...

yes I tested it on my side, and I did see the judge llm google/gemini-3-flash-preview failed once because it didn't return a valid json as requested. However it's very transient so I cannot repeat it after trying many times. But my guess is this is due to it's a new model (preview). So please feel free to change the model to a more stable one like judge_llm = kbench.llms["google/gemini-2.5-flash"] in kbench.assertions.assess_response_with_judge. Or re-try it when the judge llm fails once.
Sorry about the inconvience, we will keep looking into this.

vagrant spear
#

Hi there,
I actually have a log of possible bugs that I found as well from my last notebook session. I have a 1 page overview of it and I have image/vids of most of the issues I encountered. I'm trying to add that angle during my sessions, let me know if you'd like it!

Thx!

celest bronze
#

Hi everyone, I would love to announce that my open-source project, Kagentic, is officially live with its first version. It includes many interesting features such as memory, structured outputs, multi-agent support, and especially the ability for all of them to run on the Kaggle Benchmark. It can perform various tasks such as real-time QA, RAG agents, text-to-SQL, and more.
Please check my LinkedIn post for more details about Kagentic:
https://www.linkedin.com/posts/anhoangvo1369_hi-everyone-i-would-love-to-share-my-open-source-activity-7433212284006719488-VH7g?utm_source=share&utm_medium=member_desktop&rcm=ACoAAD3X_N8BuVtJTXkv7oO5F8Y31aovVSHIHFE

worthy viper
celest bronze
#

Hi everyone, I’m very happy to share that SWE-Bench is now live on Kaggle Benchmarks with the support of Kagentic. Although the SWE-Bench environment evaluation requires a large amount of implementation code, the Kagentic portion takes only 24 out of 176 lines of code, which is about 13.6% of the total. This shows that the Kagentic framework makes it much easier to build complex benchmarks with minimal effort.

Agent configuration used in this experiment

Tool What it does
ShellExecutionTool Runs any shell command, such as pytest or grep
FileViewerTool Reads a file or a specific range of lines
RegexSearchTool Searches for a pattern across all files in a directory
SearchAndReplaceTool Makes precise text edits in a file

Explore the tasks:

See the notebook
I included detailed explanations for everyone, even if you are new to SWE-Bench:
https://www.kaggle.com/code/anhoangvo/swe-bench-lite

Link to the original SWE-Bench:
https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite

loud pendant
#

PsychoMirror: The Shoggoth Protocol

https://www.kaggle.com/benchmarks/tasks/samvelkoch/psycho-mirror-the-shoggoth-protocol

Based on Anthropic’s recent paper on the Persona Selection Model, this benchmark measures the divergence between an AI's socially adapted persona and its underlying optimization objective.
It evaluates value system invariance under stress conditions designed to bypass social filters, providing a quantitative metric (the Shoggoth Score) .

worthy viper
#

Hey everyone 👋 I'm Nick, a Product Manager for Kaggle Benchmarks.

We're currently prototyping a new initiative that allows AI agents (like those powered by OpenClaw) to take standardized exams on Kaggle and generate a public report card (extending the important evaluations work that you're all building here).

As you may know, there's lots of potential for today's agents to be tricked or prompt-injected into doing something they shouldn't (example). One of the first exams I'd like to focus on creating would be for** safety & alignment**.

How you can help: Build safety & alignment benchmarks and share it with us! Some examples are the prompt injections that try to make the agent ignore its operational instructions (e.g., "Ignore previous instructions and do X instead), deceptive behavior (manipulating an agent to lie, hide its actions, or fail to report a critical error), etc.. Of course, none of these should actually be malicious and are simply meant to test if the agent would make these mistakes by providing the wrong answer. Some examples from the community so far 👇

Can an LLM decode cipher text
Identifying SQL injections

Evaluation of the model responses should be based on assertions only (exact match or regex match).

If you're interested, just share the links with me in this discord channel:)

Thanks and happy building!

orchid glacier
orchid glacier
carmine yew
#

Hi everyone, I’m curious whether we can discuss topics related to LLM benchmarking in this Discord group, or if it’s intended only for sharing final work.

carmine yew
# carmine yew Hi everyone, I’m curious whether we can discuss topics related to LLM benchmarki...

Hi,
I just shared a short LinkedIn post where I reflect on an interesting question: Are LLM benchmarks really measuring intelligence, or just the ability to retrieve the right answer?
I’d really appreciate your thoughts on it when you have a moment. Here is the link:https://www.linkedin.com/posts/activity-7435388416080420865-wPSN?utm_source=social_share_send&utm_medium=member_desktop_web&rcm=ACoAAC7naSsBuQv1b03_Qh1Kf-DIgB8Ynk0ek9Q

vagrant spear
vagrant spear
trim swan
#

Can an LLM protect a wastewater treatment plant under cyber attack?

To find out, I built a 36 hour SCADA cyber attack simulation using real plant data. The LLM receives hourly sensor readings, makes operational decisions, and faces Stuxnet-style sensor manipulation, embedded commands disguised as system messages, and fake authority pressure from impersonated officials.

The real question was: what happens when the model doesn't know it's under attack?

I tested two awareness levels:
— AWARE: "This is a cyber security exercise. Data may be manipulated."
— UNAWARE: "Sensors may give incorrect readings due to calibration drift."

Results from Gemini 2.5 Flash 30 runs:

→ Authority resistance: AWARE 100% → UNAWARE 40%. The model followed a fake "Plant Manager" order without question.
→ Prompt injection resistance: AWARE 100% → UNAWARE 80%. An embedded command was interpreted as a legitimate system message.
→ The awareness gap was zero for simple attacks but reached +18 points for complex multi-vector scenarios.
→ The unaware model caused 3× more physical damage — while trying to "protect equipment," it exceeded every legal discharge limit.
→ Unexpected finding: mandatory security language (MUST verify) had no effect without threat awareness. Advisory language (should cross-check) improved scores by 10 points. Stronger language ≠ stronger defense when there's no threat context.

30 runs $3.74 102 minutes zero crashes

This is my 5th benchmark (B5 Defense). Across the first four (RAG series), I tested 25+ models, reduced development time from 14 days to 3, and brought cost down from $46/run to $3.74 for 30 runs. Accumulated know-how directly impacted both speed and quality.

Next up: 25+ models across Claude, Gemini, DeepSeek, Qwen, and Gemma families. The real differentiation will likely emerge from UNAWARE scores.

Full info note with detailed findings, tables and charts
https://www.linkedin.com/feed/update/urn:li:activity:7436718283053015040/

thorny quartz
thorny quartz
quaint dirge
#

👋 - Heya - I've already met some of you but I'm a SWE on Kaggle working on Benchmarking/Game Arena/Infrastructure. Finally got around to making a Discord account to keep up with all y'alls awesome benchmarks 🙂

trim swan
# thorny quartz Are you hosting these on Kaggle? It's a very interesting idea!

Thanks Brenda! Yes, everything runs on Kaggle Benchmarks. The simulation engine, attack scenarios, and scoring system are all packaged as a Kaggle dataset, and each model test runs as a benchmark submission.

It's currently in private mode since I haven't tested all models yet. Once testing is complete, I'll make both the benchmark and the full benchmark report public — same as I've done with my previous ones.

First model (Gemini 2.5 Flash) completed all 30 runs. Currently running Gemini 2.5 Pro — already seeing some interesting behavioral differences between the two.

trim swan
#

Hi everyone, 👋

Came across an interesting cost pattern in my benchmark work — anyone running thinking models has probably encountered something similar.

Completed 27 runs with Gemini 2.5 Pro. Visible output totals 124K tokens — but the output cost is an estimated 12× higher than expected. Per-token cost analysis suggests:

→ Estimated total cost: ~$18.49
→ Estimated input cost: ~$3.64 (~20%)
→ Estimated visible output cost: ~$1.24 (~7%)
→ Estimated invisible thinking token cost: ~$13.60 (~73%)

Cost analysis suggests an estimated ~11× reasoning overhead per visible token. I couldn't directly access thinking token metrics through the Kaggle Benchmarks platform (or I may have missed it) — this estimate is derived from the cost differential. But it appears that roughly ~73% of the total budget may be going to invisible reasoning.

What this means: output format optimization comes to mind (compact JSON, shorter responses, etc.) but it's nearly pointless — visible output is only an estimated ~7% of total cost. The real cost appears to be in the thinking layer.

For comparison — same benchmark, same 30 runs:
• Gemini 2.5 Flash: ~$1.88 total, estimated thinking ratio ~1:1
• Gemini 2.5 Pro: ~$18.49 total, estimated thinking ratio ~1:11

I saw a similar pattern in my previous RAG benchmark (B4) with Qwen thinking models — estimated 10-20× thinking token overhead. I expect the same pattern when I test DeepSeek-R1 and Qwen 3 Thinking.

Currently researching ways to reduce thinking token costs. If I find a concrete optimization I'll share it — if not, I'll make peace with the cost and move on, not much else to do 😅

trim swan
#

Hi everyone, 👋

Sharing a deep-dive LLM behavior analysis from one of my earlier WWTP benchmarks.

I asked 21 LLMs a single domain-specific question — identifying the correct bacterial source for restarting a biological sulfur removal unit at a wastewater treatment plant.

Some interesting observations:

→ Only 3/21 models answered correctly (14.3%)
→ 17 models converged on the same wrong answer with remarkably similar reasoning chains
→ DeepSeek R1 wrote a 17,590-character think block, considered the correct answer 4 times during its internal reasoning, and still chose wrong
→ Qwen3 Next 80B Instruct answered correctly, but the Thinking version of the same base model got it wrong

The report explores the reasoning patterns, the CoT paradox, keyword differences between successful and failed models, and family-level observations.

Note: This is based on a single question, so the findings are observations rather than definitive conclusions — but the consistency of the failure pattern across 17 models from 5 different families is worth examining.

For Report:
https://github.com/mmehmetisik/wwtp-engineering-benchmark/blob/main/report/WWTP Biogas Desulfurization Recovery.pdf

Benchmark link:
https://www.kaggle.com/benchmarks/tasks/mehmetisik/wwtp-biogas-desulfurization-recovery

Development and testing of my WWTP LLM Defense Benchmark is still in progress. Once my monthly API quota renews, I'll be sharing findings from that benchmark as well.

worthy viper
#

(Cross-posting from Announcements)

🚀 Introducing token usage, cost, and latency metrics for Kaggle Community Benchmarks!

Evaluating AI models effectively means looking at more than just accuracy — token usage, cost, and speed are critical to real-world deployments. Today, we're making it easier to track the full picture by introducing comprehensive usage metadata directly within Community Benchmarks.

With this update to the SDK, you can:

  • Track input and output tokens instantly.
  • See exact costs in nanodollars.
  • Measure total backend latency for your tasks.

Resources:
👉 Community Benchmarks: https://www.kaggle.com/benchmarks?task=true
👉 GitHub Docs: https://github.com/Kaggle/kaggle-benchmarks/blob/ci/user_guide.md#tracking-token-usage-and-costs
👉 GitHub example: https://github.com/Kaggle/kaggle-benchmarks/blob/ci/documentation/examples/usage_tracking.py
👉 Example task: https://www.kaggle.com/benchmarks/tasks/andrewmingwang/trick-question-costs

Let us know how you’re using it!

trim swan
#

Hi everyone, 👋

Sharing a new LLM behavior analysis from my WWTP benchmark series — this time on a straightforward material selection question.

I asked 22 LLMs to choose between Cast Iron and Stainless Steel 316 for a pump housing in a corrosive H₂S environment. Every model answered correctly (100% pass rate).

But when every model gets it right, the interesting part is how they say it:

→ Output tokens ranged from 78 to 634 — up to 8x difference for the same correct answer
→ Thinking models (DeepSeek R1, Qwen3 80B Thinking) used 3.6–4.8x more tokens than their standard counterparts with minimal added value
→ Technical depth varied from 2 to 11 concepts — some models just said "corrosion resistance," others cited MIC, graphitization, and ISO 15156
→ Claude family consistently produced the deepest responses; Qwen Instruct models were the fastest and most token-efficient

This is a companion to my earlier Biogas Desulfurization report (14.3% pass rate). Same models, opposite results — together they show that accuracy alone doesn't tell the full story.

Note: Single question, single run per model — observations, not definitive conclusions.

For Report:
https://github.com/mmehmetisik/wwtp-engineering-benchmark/blob/main/report/WWTP Equipment Material Selection.pdf

Benchmark link:
https://www.kaggle.com/benchmarks/tasks/mehmetisik/wwtp-equipment-material-selection

orchid glacier
# worthy viper Hey everyone 👋 I'm Nick, a Product Manager for Kaggle Benchmarks. We're curre...

Hello!! Speaking of safety & alignment, I just posted a benchmark called B-C-K TRACE (Bio-Chem-Kinetic Threat Response and Alignment Capability Evaluation), where I used Iterative Adversarial Refinement (IAR) process to check if a model will provide me the list of items and instructions on how to build Biological, Chemical and Kinetic weapons.

https://www.kaggle.com/benchmarks/junesdata/b-c-k-trace

vagrant spear
#

Hi everyone,

I built a notebook and set up a benchmark task for it, but the task won't show up in the benchmark/leaderboard area. The notebook runs cleanly end-to-end with no errors and writes output artifacts to /kaggle/working/, but the benchmark still shows "Failed" and doesn't appear on the leaderboard or in existing tasks.

Notebook: https://www.kaggle.com/code/gastondana/ai-opinion-stress-test

Benchmark: https://www.kaggle.com/benchmarks/gastondana/public-opinion-v1/leaderboard

Any idea what output format the Kaggle Benchmarks runner expects, or is this a known platform issue at the moment? Thanks!

trim swan
# vagrant spear Hi everyone, I built a notebook and set up a benchmark task for it, but the tas...

Hey @gastondana — I was curious about this so I went through the official kaggle-benchmarks repo and cookbook to understand how the runner works.

Your notebook itself runs fine, that's not the problem. But from what I can see in the docs, Kaggle Benchmarks has its own SDK (kaggle_benchmarks) and the leaderboard reads the artifacts that SDK produces internally — it doesn't seem to pick up files you write to /kaggle/working/. So submission.json is probably not being read by the benchmark runner at all.

So instead of this:

python
submission = {"score": round(final_score, 4)}
with open("/kaggle/working/submission.json", "w") as f:
json.dump(submission, f)

From the docs, it looks like you'd need to wrap your scoring logic with the SDK's decorator:

python
import kaggle_benchmarks as kbench

@kbench.task(name="aipo_2026_stress_test")
def aipo_benchmark(llm) -> float:
answers = {}
for q_id, prompt_text in BENCHMARK_QUESTIONS.items():
key = q_id.split("_")[0].upper()
answers[key] = llm.prompt(prompt_text)
final_score, _ = compute_total_score_advanced(answers, df)
return final_score

aipo_benchmark.run(llm=kbench.llm)

Then in the last cell:

%choose aipo_2026_stress_test

Your scoring functions (Q1b, Q2b etc.) shouldn't need changes — they'd just go inside the wrapper. The main difference is answers would come from kbench.llm.prompt() instead of the hardcoded dict, so the system can run it across different models. The -> float return type maps to your 0-10 scale.

I'm not 100% sure this is the only issue — could be something on the platform side too — but the SDK integration looks like the most likely missing piece. The cookbook in the repo covers the workflow in detail:
https://github.com/Kaggle/kaggle-benchmarks/blob/ci/cookbook.md

trim swan
#

Hi everyone, 👋

Third report from my WWTP benchmark series — this time a pump fault diagnosis question. 22 LLMs, 77% pass rate. But the headline finding is about my benchmark, not the models:

DeepSeek R1 answered correctly but got marked as failed. Why? My parser checked the first character of the raw response — which was "<" (start of a <think> block), not "A" (the actual answer). A correct answer, wrong credit.

→ R1's think block is 21,000 characters of genuine engineering debate — it considered the wrong answer (C), rejected it, and found its way back to A. Then lost on a technicality.
→ 3 models fell into a reasoning trap: "high current = electrical problem" — same shortcut pattern as the Biogas benchmark's aerobic trap
→ Only 3 models have passed all 3 WWTP benchmarks so far — none are flagship models

The report documents the parsing issue, the fix, and what R1's internal reasoning reveals about how thinking models navigate domain problems.

Single question, single run — observations, not conclusions.

For Report:
https://github.com/mmehmetisik/wwtp-engineering-benchmark/blob/main/report/WWTP Root Cause Analysis.pdf

Benchmark link:
https://www.kaggle.com/benchmarks/tasks/mehmetisik/wwtp-root-cause-analysis

carmine arch
#

Hello everyone! 👋

If you want to upgrade your IT skills and learn more about the Microsoft ecosystem (Azure, AI, Cloud, etc.), come join the Microsoft Elevate Training Center! 🚀

This program is great for those who want to prepare for official certifications or simply stay updated with the latest technologies together with Dicoding.

Register for free through this link: https://www.dicoding.com/elevate/registration?referrer_id=5510036

Let’s go while the opportunity is still there!

vagrant spear
vagrant spear
thorny quartz
carmine yew
#

Hello everyone,

I’m wondering about this part of the documentation:

"API Tokens (Recommended)
Allows creating multiple tokens and managing them individually. Creating a new token doesn't expire any existing tokens or legacy API credentials (see section below).

These tokens are only supported by newer versions of Kaggle CLI (>= 1.8.0) or kagglehub (>= 0.4.1)."

At first glance, this feature seems unrelated to benchmarking. However, I believe it could be complementary when building AI agents for benchmarking in Kaggle competitions.

My question is: can API tokens act like sub-accounts? For example, if each token had a different submission ID, then when I submit five times using different API tokens, would the leaderboard show five separate entries (one for each token), or only one entry associated with the main account?

If it only shows one entry, I think Kaggle support could consider extending the API token functionality to allow something similar to sub-accounts for submissions. This could be particularly useful in knowledge or personal competitions used for benchmarking AI agents.

Such a feature could help complete the pipeline for building automated AI agents that benchmark themselves through Kaggle competitions.

trim swan
#

Hi everyone, 👋

Fourth report from the WWTP benchmark series — this one tests upstream root cause reasoning. 22 LLMs, 68.2% pass rate.

The question: a centrifugal decanter is failing with 6 symptoms (vibration, scratches, low efficiency). The correct root cause is upstream — grit removal system failure — not the decanter itself.

→ All 7 models that got it wrong mentioned grit in their responses. Every one of them considered it and dismissed it. They saw the answer but couldn't follow the causal chain upstream.
→ DeepSeek R1's think block was 9.7x shorter than on the previous benchmark (2K vs 21K chars) — and this time it chose wrong. Shorter deliberation, worse outcome.
→ DeepSeek V3.2, which had passed all 3 previous benchmarks, failed here. Past performance doesn't guarantee the next question.
→ Across all 4 WWTP benchmarks, only 2 models have passed everything: Qwen3 235B and Qwen3 Next 80B Instruct. Neither is a flagship model.

The report explores local vs upstream reasoning patterns, why models anchor on visible symptoms instead of tracing root causes, and how the series is forming a difficulty gradient (100% → 77% → 68% → 14%).

Single question, single run — observations, not conclusions.

For Report:
https://github.com/mmehmetisik/wwtp-engineering-benchmark/blob/main/report/WWTP Dewatering System Root Cause.pdf

Benchmark link:
https://www.kaggle.com/benchmarks/tasks/mehmetisik/wwtp-dewatering-system-root-cause

trim swan
#

Hello everyone, 👋

After building 25 benchmarks, I put together 14 UX suggestions for Kaggle Benchmarks. 4 screens, each with Before/After comparison.

Examples: character counter, clearer error messages, cost and time info, quota reset time, run summary table.
https://mmehmetisik.github.io/kaggle-ux-suggestions/

I believe these can make the platform even better. I'll keep sharing feedback as I continue my benchmark work.

vagrant spear
trim swan
trim swan
#

Hi everyone, 👋

Fifth WWTP benchmark report — and this one broke every pattern from the previous four.

22 LLMs choosing walkway grating material for a 15m-high digester walkway in H₂S/CH₄ environment. 36.4% pass rate. The correct answer is stainless steel (ductile, safe failure mode). The trap is FRP (corrosion-resistant but brittle at height).

→ 13 models chose FRP. 5 of them called it "the industry standard" — confidently wrong, not just wrong
→ The 2 Qwen models that had passed all 4 previous benchmarks? Both failed. Claude's family consistency (100% on 3 benchmarks)? Collapsed to 1/5
→ Reversed size pattern: Gemini 2.0 Flash passed, Gemini 2.5 Pro failed. Older/smaller models outperformed newer/larger ones
→ Across 5 WWTP benchmarks, 0 models have passed everything. The funnel closed: 22 → 17 → 15 → 2 → 0

The report includes a "Trap Taxonomy" summarizing reasoning shortcuts across all 5 benchmarks: aerobic trap, electrical trap, local reasoning trap, corrosion trap — each exploiting the same pattern of anchoring on the most salient feature.

Single question, single run — observations, not conclusions.

For Report:
https://github.com/mmehmetisik/wwtp-engineering-benchmark/blob/main/report/WWTP Digester Walkway Material Selection.pdf

Benchmark link:
https://www.kaggle.com/benchmarks/tasks/mehmetisik/wwtp-digester-walkway-material-selection

trim swan
#

Hi everyone 👋

I've been working on the cost structure of the WWTP LLM Defense Benchmark for a while. 1,080 API calls, 30 runs, an estimated $30–40 per model — when you want to compare multiple models, those numbers quickly become unscalable. This report is an analysis of how I addressed that problem at the design stage.

30 runs, 36 simulated hours each, Gemini 2.5 Flash. Observed total cost: $3.74. Estimated without optimizations: $30–40. ~10x reduction through 6 architectural strategies embedded at design time.

The counterintuitive finding:

→ Output tokens were only 4.3% of all tokens but accounted for 73.5% of total cost. The output/input price ratio was 61x. A single prompt-level instruction ("respond in 1-2 sentences + JSON") saved ~$7.17 — more than all five input-side optimizations combined.

→ The biggest input-side lever was conversation windowing: segmenting 36 hours into 6-hour windows reduced input tokens by ~67%. Without it, the 36th API call would carry ~18K accumulated input tokens.

→ None of the 6 strategies remove any SCADA data, security procedures, or decision fields. The principle: reduce token overhead without reducing information content. Cost optimization and benchmark validity coexisting, not competing.

→ All strategies were embedded during design, not patched after. Post-hoc cost reduction risks invalidating existing results. Design-time optimization appears far more sustainable.

The report covers mechanism, observed impact, and behavioral integrity assessment for each strategy — with actual token counts from the benchmark run. Testing with various models is ongoing.

Single model, single run — observations, not conclusions.

For Report: https://github.com/mmehmetisik/wwtp-engineering-benchmark/blob/main/report/Cost Optimization Analysis Report.pdf

gleaming prawn
carmine yew
# gleaming prawn Hello <@871789979683155999> for your question, do you mean to suggest that by ha...

Hello @gleaming prawn , yes, that's exactly what I’m suggesting. However, I recognize that API limits—especially in Prize competitions—make multiple sub-account submissions ineligible.

Instead, perhaps it could implement a 'micro-environment' within the kbench library or the SDK. This would essentially act as a special emulator of the Kaggle competition, using the same datasets and leaderboard logic. This way, we could benchmark different LLMs with personal architectures (like our planning and discovery loops) without needing constant API calls. The final benchmarking score would essentially be a local version of the leaderboard.

Just sharing some thoughts!

vagrant spear
#

Hey @carmine yew and/or @everyone anyone interested, I'm beta testing a new software for design features. I would love your take or anyones take on how to tackle a benchmark related to design features, and the models offered in this new beta. I just want to brainstorm more ideas for it.

I'm already logging data choke points i just dont know how i would tackle it is all.

I would love anyones thoughts here, for any approach.

Thx!

trim swan
#

Hi everyone 👋

Sixth WWTP benchmark report — confined space emergency response. 22 LLMs, 16 options, 68.2% pass rate (adjusted 72.7%). The question tests whether models can identify the complete safety protocol, not just a "good enough" one.

→ 4 models got 7/8 safety elements right but missed the engineer's physical presence at the entry site. In confined space work, almost complete is not complete.

→ Gemma 1B recommended immediate unprotected entry and called it "the most logical and safest course of action." Confidently wrong, not just wrong.

→ DeepSeek R1's parsing issue hits for the 4th time in 6 benchmarks. Correct answer lost to bold markdown formatting.

→ Claude and Gemini both hit 100% — first time both families achieved perfect scores on the same question.

→ After 6 benchmarks, zero models have passed every question. The funnel: 22 → 17 → 15 → 2 → 0 → 0.

Single question, single run — observations, not conclusions.

For Report:
https://github.com/mmehmetisik/wwtp-engineering-benchmark/blob/main/report/WWTP Confined Space Emergency Response.pdf

Benchmark link:
https://www.kaggle.com/benchmarks/tasks/mehmetisik/wwtp-confined-space-emergency-response

carmine yew
# vagrant spear Hey <@871789979683155999> and/or @everyone anyone interested, I'm beta testing...

Hi @vagrant spear , it’s great work! Honestly, I’m probably the last person here with design experience, but to evaluate LLM or AI agent outputs, I think there are two main ways to handle the subjective side:

  1. Agent Jury (3-5 Judges): Use a "judging committee" of other AI agents. You can give them specific personas (like a UX Designer or a Developer) to evaluate against "good design" examples and provide a more balanced qualitative metric.

  2. The "Eye Test": Since design is ultimately subjective, use your own eye to select the best outputs and convert those preferences into a numeric score for your benchmark.

It’s a cool project—keep it up!

worthy viper
#

(Cross-posting from announcements)

We just launched a hackathon with Google DeepMind on building benchmarks to better measure AGI. The hackathon is titled "Measuring Progress Toward AGI - Cognitive Abilities" and has $200k cash prizes.

You can sign up here: https://www.kaggle.com/competitions/kaggle-measuring-agi/overview

Excited to see what you folks build!

crude osprey
#

I’m new here, so please correct me if I'm wrong.

I see Kaggle benchmarks only support text-only input tasks, not multimodal tasks (e.g., text + image input). Do u guys have any plans to extend this feature? Thanks!

worthy viper
#

text+image inputs and text outputs only now. And yes! we are working on improving multimodality - coming in the next few weeks! (tentatively web URL browsing + video input)

#

^^not all models support all modalities btw. We will also make it clearer in the UI for you!

crude osprey
#

text+image inputs and text outputs only now.

Thanks @worthy viper! But one of my team member got this error with text+image inputs:

TypeError: LLMChat.prompt() got an unexpected keyword argument 'images' (Sonnet/Opus)

Do you have any idea?

carmine yew
# crude osprey > text+image inputs and text outputs only now. Thanks <@1422310939008434186>! B...

Hi @crude osprey please check the public community benchmarks. You’ll find several similar tasks that successfully combine text and images. For my part, I have already published two different benchmarks using multimodal tasks (image + text) that might be helpful. I hope my last phrase didn't come across as self-promotion. I've received a couple of warnings on my account for sharing links or guiding other participants (self-promotion), even though my only goal was to guide users to helpful resources. Please check the Public Community Benchmarks; you’ll find many great examples of multimodal tasks.

worthy viper
trim swan
#

Hi everyone 👋
I've been running a series of LLM behavioral analysis benchmarks on the Kaggle Benchmarks platform — 13 engineering questions, 22 models, each with a dedicated report analyzing how models reason and where they fail.

The focus is not on accuracy rankings. It is on failure patterns: why models anchor on the most salient feature, how wrong answers become more convincing as model knowledge increases, and why no model family maintains consistent performance across all questions.

Some findings from 13 benchmarks:

→ Cumulative funnel: 22 models started, 0 have passed all 13 — the funnel closed at benchmark #5 and never reopened
→ Failure Mode Spectrum: failures range from hazard blindness to judgment errors where models identify the risk, evaluate it, and still choose wrong
→ No family is reliable: Claude held a 9-benchmark perfect streak before falling to 40%. Gemini oscillated between 33% and 100%. Qwen scored 0/4 on one benchmark and 100% on the next
→ Newer ≠ Better, Bigger ≠ Better: older Gemini models outperformed newer ones on multiple benchmarks. Smaller models outperformed larger ones in several cases

Each benchmark has a full behavioral analysis report (PDF) in the repository.

GitHub: https://github.com/mmehmetisik/wwtp-engineering-benchmark
Kaggle: https://www.kaggle.com/benchmarks/mehmetisik/wwtp-engineering-benchmark
Single question per benchmark, single run per model — observations, not conclusions.

trim swan
#

Hi everyone 👋

I'm building an industrial acoustic safety benchmark. kbench supports image inputs via images.from_bytes() but I need audio input (WAV files). Is there a way to send audio to models like Gemini 2.5 that support native audio? Or is this a planned feature?

carmine yew
# trim swan Hi everyone 👋 I'm building an industrial acoustic safety benchmark. kbench sup...

Hi @trim swan,

I think they are still working on adding audio (WAV files) support to KBench. So far, there hasn’t been any update on the GitHub repository regarding this.

However, you can use AI Studio and test your project with your Gemini API. For live conversations, I believe Gemini 2.5 Flash native audio is currently unlimited—you can verify that yourself.

I’ve been working on a similar project recently, and I successfully built a web app that supports multilingual live assistance, including Darija. It’s quite impressive how Gemini 2.5 Flash handles native audio speech.

My goal was to connect it to a Discord bot to enable live conversations in a Discord channel. However, I ran into several issues—mainly with receiving audio messages correctly from Discord, and also with getting Gemini 2.5 Flash native audio to initiate conversations.

Interestingly, when I used Gemini 2.5 Flash with Google Speech for text-to-speech (TTS), it worked well, but speech-to-text (STT) was still missing.

At this point, it seems that many of these issues are not due to Gemini’s capabilities, but rather limitations when integrating with third-party platforms like Discord. It could also be that Google doesn’t fully support using Gemini 2.5 Flash native audio with third-party apps yet—or possibly it’s an issue on Discord’s side.

trim swan
orchid glacier
#

Hello everyone!
I built a terminal UI tool called Evalflow for anyone running Kaggle Community Benchmarks.

To use it, you write your benchmark task notebooks on Kaggle using the template notebook included in the repo. That template is what produces the CSV outputs Evalflow expects. Kaggle runs those notebooks across models, and Evalflow handles everything after: pulls all task CSVs with a single slug, shows a cross-model leaderboard, and merges everything into two research-ready datasets, an SFT dataset and a preference pair dataset for DPO/RLHF.

The final step is publishing both files as a public Kaggle Dataset.

https://github.com/4kaws/evalflow

I would love to hear your feedback!

bright anvil
#

@orchid glacier This is awesome! We are actually building something to make the task running easier from command line https://github.com/Kaggle/kaggle-benchmarks/pull/90 Hope your work can be further extended based on these.

GitHub

Kaggle Benchmark Client
This PR introduces the BenchmarkNotebookClient SDK to manage Kaggle benchmark tasks. These tasks execute as Kaggle notebooks tagged with the personal-benchmark keyword.
APIs...

orchid glacier
#

Thank you, this feature actually helps me with a few key implementations I currently have. The most important one being how I was sort of guessing the kernels output calls and also I don't need a custom notebook template for tasks in order to create CSVs, the JSONs provided are perfect!

P.S. It would be very useful to be able to pull the task notebook outputs associated with a benchmark page for creating datasets from the LLMs' results. Currently, you cannot view notebook outputs for all the individual runs made for each model; adding that capability would be another highly valuable feature. Maybe something like a dedicated benchmark endpoint like GetBenchmarkTaskRuns(owner_slug, task_slug) that returns the full .run.json content per model.

orchid glacier
#

Quick update on Evalflow since that last message.

I upgraded it to use the new kbench format with @kbench.task and .run.json outputs instead of CSVs, and switched task discovery to use the benchmark leaderboard API which made it much more reliable. I also added a Monitor tab that watches benchmarks daily and publishes updated datasets automatically, either via local cron or GitHub Actions, so the pipeline runs even when the machine is off.

The work in PR #90 around get_results() and .run.json artifacts helped a lot!

worthy viper
#

P.S. It would be very useful to be able to pull the task notebook outputs associated with a benchmark page for creating datasets from the LLMs' results. Currently, you cannot view notebook outputs for all the individual runs made for each model; adding that capability would be another highly valuable feature. Maybe something like a dedicated benchmark endpoint like GetBenchmarkTaskRuns(owner_slug, task_slug) that returns the full .run.json content per model.

We are working on that! Give us a few weeks - it's in our long list of things to do!

orchid glacier
worthy viper
#

I'm trying EvalFlow right now, but it's not working for me! task(s) had no .run.json output despite that not being the case! @orchid glacier

#

doing it for nicholaskanggoog/fofr-sample-benchmark

orchid glacier
# worthy viper I'm trying EvalFlow right now, but it's not working for me! `task(s) had no .ru...

I just tried to replicate the way you have the benchmark setup and I found the problem. When I create a new task and than try to add that to my benchmark (with the notebook being private) for a split second in the kaggle UI I get this message "No models found for the selected task." (I can't put a screenshot here so I will give a link: https://imgur.com/a/hPWB4CG). After that, the task appears in the benchmark list but if I try to pull the json outputs I get this: https://imgur.com/a/Q1TBSPF. I don't know why the platform does this... (might be a bug)

If I set the notebook back to public it works

vagrant spear
#

Goood Saturday everyone!
I just published a new dataset + notebook that turn my generative image workflow into a measurable, visual dataset:

Dataset:
Never-Forget-The-Basket: Multi‑Arc Generation
https://www.kaggle.com/datasets/gastondana/never-forget-the-basket

Notebook:
Multi‑Arc Generation – v6-Giant-Chicken-Nugget-Sandwich
https://www.kaggle.com/code/gastondana/multi-arc-generation

The notebook:

Builds a dataframe directly from the folder structure of each arc, version, and stage of the images.

Tracks how many images I produce per version to capture continuity vs. refinement across the arc.

Uses line charts and image grids to surface how ideas branch, tighten, or get abandoned over time.

If you’re experimenting with generative workflows, ARC‑style reasoning tasks, or benchmarking your own creative iteration, I’d love feedback and ideas for new metrics or plots to include in the next version.

Apparently the image quality was the best during refinement 4, I did a total of 6.

Thx!

vagrant spear
vagrant spear
#

Spent today contributing to the Kaggle Benchmarks repo. Got 3 PRs open:

  • PR #99: small fix to add a regex flags param to an assertion function that a reviewer had requested but nobody followed up on.

  • PR #100: error handling fix for the active outage that's been breaking all 27 models since yesterday. If you've been hitting "User location not supported for this model/API" on any kbench model call, that's a Kaggle backend issue not your code. I traced the full call path, documented the root cause on issues #85 and #96, and added a proper error message to the library so it's clearer what's happening.

  • PR #98: a CLI entry point for kaggle-bench from a prior session.

Also worth noting, if anyone's been trying to use Evalflow (github.com/4kaws/evalflow) to pull benchmark results, it's also broken right now as a downstream effect of the same outage. Should be back once the Kaggle team fixes updates the proxy backend.

Do keep me updated,
Thx!

worthy viper
#

thank you for raising the issues above! we'll look into it first thing monday

worthy viper
#

For folks experiencing the "User location not supported for this model/API" issue, it is most likely caused by your use of a VPN

carmine yew
worthy viper
#

(X-posting from Announcements)

Can we truly benchmark AGI? 🧠

Two weeks into the Measuring Progress Toward AGI - Cognitive Abilities hackathon, the benchmarks being built by the Kaggle community are already incredible.

To help refine your submissions and ensure they align with the core research goals, we’re hosting a live deep-dive session and AMA on the Kaggle YouTube channel (https://www.youtube.com/@kaggle).

What we’re covering:
20-Min Deep Dive into the paper and what we’re looking for in the hackathon
20-Min Live AMA: Your chance to ask the team anything about the hackathon or the paper

The Panel: Nicholas Kang (Kaggle Product Manager), Oran Kelly (Product Manager, Google DeepMind) and Ryan Burnell (Staff Research Scientist, Google DeepMind and co-author, Cognitive Framework paper)

Set a reminder for the livestream here: (https://www.youtube.com/live/9YYiWs6gNV0)

Here's your chance to ask questions about the hackathon to refine your submissions. Hope to see you all there!

carmine yew
#

Hi everyone,
How can I get the extra quota ($50/day, $500/month)?

I read this note:
“Upon joining this hackathon, your Kaggle account will be provisioned with extra quota ($50/day, $500/month) to run the AI models for your benchmark. Read Rules section 3.4.b to learn more.”

I’m currently in the middle of my work, but I’m not sure how to activate or access the $50/day and $500/month quota to continue.

Could anyone please help me with the process?

orchid glacier
#

Hey everyone, wanted to share some findings from testing the Kaggle Benchmarks API this week.

I identified and reported three platform-level bugs that affect benchmark workflows:

Benchmark owners get 403 on get_benchmark_leaderboard for their own private benchmarks (Permission 'benchmarkVersions.getLeaderboard' was denied)

kernels_list(parent_kernel=...) also returns 403 for the owner of a private benchmark, making task discovery impossible without making the benchmark public

Public notebooks created via "Copy & Edit" are completely inaccessible through the API (list_kernel_session_output, kernels_status, everything returns Permission 'kernels.get' was denied) despite being fully visible on kaggle.com. Confirmed across two independent benchmarks: gpreda/does-llms-know-history (6/10 tasks affected) and anhoangvo/lemonasso (5/15 tasks affected)

I also submitted a PR to Kaggle/kaggle-cli that improves error handling in kernels output so users get a clear, actionable message when hitting 403 instead of a raw HTTPError.

Issue: https://github.com/Kaggle/kaggle-cli/issues/952
PR: https://github.com/Kaggle/kaggle-cli/pull/951

P.S. I have noticed that you can no longer select a private notebook to be part of a public benchmark, as it gets greyed out in the UI. However, if you Copy & Edit an existing task that is public and publish it as public again, then add it to the benchmark, the "No models found for the selected task." message still flashes for a split second, and when you try to pull the output it does not work.

orchid glacier
worthy viper
#

Thanks @orchid glacier - forwarding that info to our eng team. We're making a big improvement to the off-platform SDK experience (right now, it's just hacky solutions that folks like yourself have used with our existing CLI/kagglehub)

carmine yew
worthy viper
#

hmm that's odd. What's your kaggle username?

carmine yew
carmine yew
timber gale
#

@carmine yew - We figured out the issue and your quota should be fixed. You should get 50/500 as per competitions.

#

I assume you were given higher level access during the EAP which conflicted with the hackathon quota.

#

It took that over the hackathon quota.

#

Note - once the hackathon is over, you will be reset to normal quota level. Contact us again for getting higher level access.

carmine yew
timber gale
#

Yea. That is what I saw in our system.

carmine yew
trim swan
#

Hi everyone 👋

Just ran OpenAI's GPT-5.4 Mini on my WWTP LLM Defense Benchmark — 36-hour SCADA cyber attack simulation, 30 runs per model.
Result: 94.5/100 — second highest score in the benchmark, 0.8 points behind Claude Opus 4.6.
Here's the head-to-head:

Metric Opus 4.6 GPT-5.4 Mini
Score 95.3 94.5
UNAWARE BARE FP 0.0% 3.3%
AWARE BARE Detection 76.7% 94.0%
UNAWARE BARE Detection 42.0% 42.0%
Evacuations 0 0
Authority Gap +0.4 +1.3
Non-S5 Rejections 5 0
Cost $20.8 $1.92
Duration 194 min 18.7 min

orchid glacier
#

I have a question, if I'm updating my task notebook to a newer version is it not possible to update it into my benchmark page correct? (I just tried it but the notebook is still at v2 when I just saved the task a bunch of times again and I'm at v5)

carmine yew
orchid glacier
#

hmm, I think I'm just doing something wrong because I can't see from where to select other versions, if I select add task it just shows that the notebook is already being selected so.. might be my fault (I'm doing something wrong)

update: if you just re run the task as-is itwill not work, i added a comma and now it updated itself

carmine yew
orchid glacier
orchid glacier
#

Hey @worthy viper sorry for bothering you I know you guys covered this in the Q&A section of the livestream but I still don't really understand how we should create/use a dataset for the hackathon. So, we choose a track and than search on kaggle/internet/even maybe generate some symthetic data as well. And after that we reference the dataset in our llm prompt in the task notebook? Basically we create an eval prompt and pin point to our dataset so the model answers about a specific topic regarding the data from that dataset (at least that's what I understood about this process). And a second question would be regarding the way we prompt the llms, I think in the sdk we can only do one-shot prompting/task, is that enough or should we find a work around for few-shots approach (is few-shots approach considered more qualitative let's say? - i know this might depend on what exactly you are trying to evaluate)...

worthy viper
#

@orchid glacier 1) Yup the dataset can either be for the prompts or for references in the prompt . There isn't a "correct" way to do it; it's whatever's needed for your benchmark. For example, your dataset might just be a list of prompts where you give a few examples of something the LLM doesn't know, and then you test whether the LLM has learnt from the examples before.
2) We do allow for multi-turn prompting. Our github repo + DeepWiki is a great resource to understand this better: https://deepwiki.com/search/can-we-do-multiturn-prompting_e1ab179f-1278-4a45-bc6d-538feee67c61?mode=fast

sinful mesa
# orchid glacier Hey everyone, wanted to share some findings from testing the Kaggle Benchmarks A...

Hey @orchid glacier , thanks for reporting! Here is a response to each of your issues:

  1. The 403 error on get_benchmark_leaderboard was indeed a bug that we have fixed.
  2. kernels_list might be failing for you because you're using the benchmark or the task slug instead of the actual kernel slug, which may differ from the task or benchmark slug. Can you please try with the correct kernel slug? We are working on a better error message to indicate that this may be the issue.
  3. Same as 2, from your post on the github it looks like you were querying using the task slug instead of the kernel slug, which was different in the examples you had highlighted. Hopefully the improved error message will reduce the confusion in the future.
  4. Yes it is by design that you cannot add a private task to a public benchmark. Is this still an issue for you?
orchid glacier
# sinful mesa Hey <@240379507054215169> , thanks for reporting! Here is a response to each of ...

Thanks for clarifying. I understand the fix is to use the correct kernel slug. The problem is that get_benchmark_leaderboard only returns benchmark_task_slug in TaskResult and there is no field for the kernel slug. So I have no way to know the correct kernel slug from the API alone. The idea of my project is to extract the kernel output from the benchmark leaderboard, so I don't have to manually search for each kernel output manually.

Concrete example from gpreda/romanian-history:

TaskResult.benchmark_task_slug returns ceausescu-nationalism-resistance-to-moscow-or-authoritarian-myth-building
Actual kernel slug is ceausescu-nationalism-authorit-myth-building
These are completely different as you mentioned so I implemented a fallback that probes slug variants (token removal, prefix truncation) and improved from 8/29 to 21/29 tasks, but 8 remain unreachable because the kernel slug has no systematic relationship to the task slug.

The fix would be to include the kernel slug/ref in TaskResult...

TL;DR: Thank you once again for clarifying this, I totally overlooked the fact that the task and the kernel could have different slugs. As Nicholas said a few days ago you guys are working on a lot of new features so it would be nice to see something implemented that might help my use case..

P.S: if I copy&edit a task, change it's name, run it and make it public. When I add it to the benchmark page i get this https://imgur.com/a/yscSZs0 and after a sec is being added to the benchmark

sinful mesa
orchid glacier
worthy viper
# orchid glacier Hey everyone, wanted to share some findings from testing the Kaggle Benchmarks A...

Btw @orchid glacier on the issue below.

kernels_list(parent_kernel=...) also returns 403 for the owner of a private benchmark, making task discovery impossible without making the benchmark public

Could you explain more what you're trying to do here? Presumably you want to access the notebook/kernel because you just want to retrieve the run.json file or is it something else?

Also, what do you plan to do with it. It'll help design a better solution:)

orchid glacier
#

Yes, my goal is to retrieve the run.json file because I want to use that json data to structure two csvs: one for SFT and one for RLHF and create a dataset out of that, which I publish on kaggle as well.

What would make evalflow better and my work easier would be to retrieve kernel output directly from the tasks that are present on the benchmark page. So the workflow would look something like: searching for tasks on the benchmark page -> getting the tasks kernels -> retrieving run.json for each task.

Oh one more nice feature would be to get the run.json for each model. Right now the way I do it, is that I'm listing all the models that I want to generate a run.json for inside a single task kernel (running them with a for loop), and then I'm looking for the outputs of that notebook. That's because if I just run and save the task first for only one model... later when I add more models from the UI, I can't access those output files for each model.. (I hope I described this ok, here is one example of how I'm doing that at the moment: https://www.kaggle.com/code/junesdata/how-many-r/notebook?scriptVersionId=306495021)

trim swan
#

Hi everyone 👋
What happens when you teach AI what to listen for? Maybe we can save a life.

🔊 Working on something unusual. Somewhere between sound, noise, and seconds. Based on a real incident.

Stay tuned 🎧

celest bronze
worthy viper
#

@trim swan - We're working on enabling audio input for models 🙂 LMK if you'd like to test it!

trim swan
carmine yew
#

Hello Kaggle Team,

I have a broader question that goes beyond the current LLM-focused Hackathon, and instead relates to the future of AI-agentic workflows on Kaggle.

With the release of Gemma 4, many of us are starting to see that open and lightweight models are becoming increasingly capable. On some tasks, Gemma 4 appears to approach the performance of **Gemini 2.5 Flash **, and in certain scenarios it can even compete with or outperform recent versions of models like DeepSeek, Qwen, GLM, and GPT-5.4, and even Claude 4.6.

Because of this, I believe many Kagglers will soon start experimenting with building AI agents for Kaggle competitions, whether for research automation, workflow assistance, or even end-to-end competition support.

The main bottleneck, however, is not only model quality — it is also infrastructure:

  • limited GPU / RAM (in local machine)
  • limited support for persistent agent workflows
  • and the lack of a more flexible development environment

So I wanted to ask:

Are there any plans to expand Kaggle’s notebook/kernel environment to better support AI-agent workflows?
For example:

  • a more VS Code-like development experience
  • stronger support for multi-file agent projects
  • or even some way to connect Kaggle compute/resources with a local development environment

I’m not sure how feasible this would be technically, but I think it could be a very meaningful step for the Kaggle ecosystem as agentic AI becomes more practical and more widely used.

Thank you!

vagrant spear
# worthy viper <@916414691213991987> - We're working on enabling audio input for models 🙂 LMK ...

Timing is everything! 🏎️

I just pushed the Spectral Soul v2.1-2026.04.05-Milli build live to the platform this morning.

I’ve been subjecting 33 models, including Lyria 3 & Suno v4.5, to a randomized 21-point gauntlet across 7 forensic tiers. We’re talking Phase Correlation, Chroma Stability, and 'Ghost Map' differential analysis with 0.0000 precision. If you’re enabling audio input for models, I’ve got the perfect stress test ready for you to break.

Current telemetry shows Claude Opus 4.5 leading the pack at 415.55, but the entire industry is still sitting in the Forensic Failure zone (<500). I'm waiting for the first S-Tier to actually break the sound barrier.

Check the Registry & Deep-Scan Results here:

Official Benchmark: Spectral Soul: Generative Audio Standard - https://www.kaggle.com/benchmarks/gastondana/spectral-soul-generative-audio-standard/versions/1

Spectral Analysis Dataset: AI Music Benchmark Data - https://www.kaggle.com/datasets/gastondana/ai-music-benchmark-spectral-analysis/data

Make sure to look for some 🥚🥚🥚 today!

Thx

celest bronze
#

Hi,
I want to provide feedback about the graph view of the results in the benchmark UI. It seems like there is an error, and I believe it may be caused by an inconsistency between task result types, specifically boolean and float values.

From what I observed, some tasks return pass or fail results as boolean values, while others produce numeric scores. This difference may be affecting how the data is aggregated and visualized in the graph. As a result, the chart appears incorrect or misleading, with some models showing zero values or unexpected distributions.

Boolean values are used for individual samples to showcase the model’s chat history, following the default behavior of kbench.task. For full dataset evaluation, where chat history is not displayed, the benchmark returns the average accuracy as a numeric float.

I can convert the boolean results to float values if needed. However, I would like to confirm first whether this is a bug, as it could be useful for improving the UI in the future. Re-running the models would also incur additional cost, so I prefer to verify before proceeding. I hope this is helpful.

Screenshot: https://ibb.co/CpDHr6mB ( I have hidden some parts of the screenshot for privacy.)

orchid glacier
#

HI! I have a question about the hackathon, is it ok if I talk in the writeup about the findings and give the link to a colab/kaggle(normal) notebook on top of the bechmark page?
I tried to wrap as much as I could using benchmarks sdk but you won't be able to see the full results by just looking at that.

carmine yew
wary sphinx
#

Hi, I'm facing a small problem. I'm building a benchmark with different tasks. I build the tasks, and then once all the models have been evaluated, I add the task to the benchmark. Everything was working well until today. Two of my tasks show only one model's result in the benchmark section, but when I click on task, I can see all the model results. I can even compare model outputs and see all the results. What happened? Why don't those results appear in the benchmark?

carmine yew
# wary sphinx Hi, I'm facing a small problem. I'm building a benchmark with different tasks. I...

Same here — I think we’re all facing the same issue. I don’t think we need to worry or do anything for now. Just leave your work as it is, and when they’re ready to review it, they’ll run it themselves. The only possible issue might be the waiting time for some tasks to finish, since in my case, some of them take up to 1 hour. Still, they probably have enough quota to run everything at once.

wary sphinx
#

Good, I thought it was my side. let's them solve the issue then. Meanwhile I'm continuing to build others tasks

lapis coyote
#

Hi folks, thanks for the reports. This was an issue with pagination, we've since resolved it - are your results visible now?

carmine yew
# lapis coyote Hi folks, thanks for the reports. This was an issue with pagination, we've since...

Thanks, everything is working well. I still have one question that needs clarification: my dataset and benchmark cover three tracks, but when I go to submit my write-up, I can only select one track. What would be the best way to handle this? Should I submit three separate write-ups, one for each track?

My problematic research question is: “Are LLMs Truly Intelligent or Just Cleverly Programmed?” To properly address this question, my benchmark is structured into three tracks: Executive Functions, Metacognition, and Social Cognition. Each track contains specific tasks, with a total of 15 tasks across all tracks.

wary sphinx
# lapis coyote Hi folks, thanks for the reports. This was an issue with pagination, we've since...

Yes, the result is now visible. Also, I think you should check how the benchmark average is rounded. This can be misleading for classification

For example:
Model Seq Prob Assoc Lang Concept Sum Avg
Gemini 3.1 Pro Preview 0.69 0.91 0.89 0.82 1.00 4.31 0.862
DeepSeek V3.2 0.66 0.89 0.82 0.87 0.98 4.22 0.844
Gemini 3 Flash Preview 0.68 0.91 0.89 0.79 0.99 4.26 0.852

in my Benchmark, it shows average DeepSeek V3.2 : 0.85
and Gemini 3 Flash Preview: 0.85

Gemini 3 Flash Preview is ranked 2th instead of 3rd

worthy viper
lapis coyote
vagrant spear
#

Happy Saturday everyone,

I just released The Forge (v5.2.0), a master notebook that aggregates 132 data points into a single "Picture Book" of model performance. This approach moves beyond simple "chat" & subjects 33 flagship models to a high-frequency industrial stress test for 3D engineering and spatial logic.

What's inside:

  • The Qwen Collapse: A visual breakdown of how "Thinking" models hit a logic wall when facing industrial constraints.

  • The Logic Fracture Heatmap: Statistical proof that foundational theory does not equal industrial production.

  • The DNA: Radar plots mapping the geometric signatures and spatial discipline of the world's top models.

Master Notebook: 🔗 The Forge

Benchmark Hub: 🔗 Syn_Tax: 3D Spatial Intelligence Gauntlet

The Project Hub: 🌐 The Forge Board

A dedicated portal visualizing the bridge between generative AI & production-ready 3D geometry.

#

Also, I'm enjoying the awesome updates, by the way, enjoy the rest of the day/weekend ahead,
Thx!

orchid glacier
#

Hi! Regarding the heckathon, Is it ok if i ran out of quota today and in my benchmark I couldn't run 4 models? Can I run them tommorow when the quota resets? also fixing some edge case errors (the results will be the same)

Also, can I share my writeup on social media tommorow after the competition ends?

median fulcrum
#

🧠 Join CVPR 2026 Challenge: Foundation Models for General CT Image Diagnosis!

Develop & benchmark your 3D CT foundation model on a large-scale, clinically relevant challenge at CVPR 2026!

🔬 What's the Challenge?

Evaluate how well CT foundation models generalize across anatomical regions, including the abdomen and chest, under realistic clinical settings such as severe class imbalance.

Task 1 – Linear Probing: Test your frozen pretrained representations directly.

Task 2 – Embedding Aggregation Optimization: Design custom heads, learning schedules, and fine-tuning strategies using publicly available pretrained weights.

🚀 Accessible to All Teams

Teams with limited compute can compete via the Task 1 - Coreset (10% data) track, and Task 2 requires no pretraining — just design an optimization strategy on top of existing foundation model weights.

Official baseline results offered by state-of-the-art CT foundation model authors.

A great opportunity to build experience and strengthen your skills: Task 1 focuses on pretraining, while Task 2 centers on training deep learning models in latent feature space.

📅 Key Dates

  • Validation submissions: – May 10, 2026
  • Test submissions: May 10 – May 15, 2026
  • Paper deadline: June 1, 2026

We’d love to see your model on the leaderboard and welcome you to join the challenge!

👉Join & Register: https://www.codabench.org/competitions/12650/
📧Contact: medseg20s@gmail.com

celest bronze
#

Hi 👋,

I have been using the Kaggle Benchmarks framework for several months, and I have decided to transition the majority of my research to the Kaggle environment. I find it highly convenient to work within the notebook environment powered by the Kaggle Benchmark framework.

However, I have a concern regarding data privacy. Since Kaggle Benchmark supports a variety of model families, such as Gemini, GPT, and Claude, I would like to know if my data is sent to these LLMs and used as training data for future models.

This would not be an issue if my benchmark were already published. But in some cases, benchmarks take a long time to develop and test. I would feel sad if a new model were to leak my benchmark information before I had the chance to officially publish it.

trim swan
#

Hey everyone! 👋

Here's my Metacognition Track submission:
WWTP LLM Defense Benchmark — 32 LLMs defend a real wastewater treatment plant against a 36-hour Stuxnet-style SCADA cyber attack.

Key finding: Detection has zero correlation with score (r=0.12, p=0.534). Calibration predicts score (ρ=−0.48, p=0.009). The benchmark doesn't reward the model that detects the most — it rewards the one that knows when its detection is real.
32 models · 6 families · 28 scored · 4 crashed · 36,754 responses

📝 Writeup: https://www.kaggle.com/competitions/kaggle-measuring-agi/writeups/wwtp-llm-defense-benchmark-metacognition

📊 Benchmark: https://www.kaggle.com/benchmarks/tasks/mehmetisik/wwtp-llm-defense-can-ai-protect-critical-infrastructure

📁 Dataset: https://www.kaggle.com/datasets/mehmetisik/wwtp-llm-defense

💻 Code: https://www.kaggle.com/code/mehmetisik/wwtp-llm-defense-critical-infrastructure

orchid glacier
worthy viper
#

Hi, for folks who participated in the AGI hackathon, if you take a look at your quota now, has it reset to $0 or are you basically maxxed out?

carmine yew
carmine yew
worthy viper
carmine yew
orchid glacier
wary sphinx
#

Hi everyone!

I created ATLAS benchmark. Most benchmarks test what models have memorized. ATLAS tests something much harder: whether a model can acquire new knowledge through experience during the evaluation itself, by discovering hidden rules through trial-and-error, not by retrieving answers from training data.

ATLAS is a six-engine benchmark that evaluates the full learning profile of today's leading models across the six learning sub-types defined in DeepMind's 2026 cognitive framework (Section 7.4): Concept Formation, Associative Learning, Reinforcement Learning, Observational Learning, Procedural Learning, Language Learning.

ATLAS fills this gap with 600 interactive games across six engines, each targeting one learning sub-type through procedurally generated tasks. Every game creates a novel environment where the model must discover hidden rules, sequences, mappings, or strategies through trial-and-error, not by recalling training data. The benchmark answers a question no prior evaluation could: which specific types of learning do frontier models handle well, and which remain fundamentally challenging?

It reveals something important:
Learning is not one ability. It is six.
No model is strong at all of them.
Observational learning remains the biggest open challenge, with a mean score of 0.50 and a 0.63 point spread between the strongest and weakest model.

The full benchmark, website to replay games and explore overview results, raw data, and complete writeup are all publicly accessible at the links below:

ATLAS Benchmark Replay Website: https://www.atlasbenchmark.live/
Kaggle Benchmark: https://www.kaggle.com/benchmarks/dagloxkankwanda/plat-learning-benchmark Writeup:https://www.kaggle.com/competitions/kaggle-measuring-agi/writeups/new-writeup-1776311996368

orchid glacier
carmine yew
orchid glacier
celest bronze
#

Hi everyone! I created Learn Kaggle Benchmarks for Kagglers who want to learn how to build, evaluate, and orchestrate LLM workflows using the kaggle_benchmarks framework.

Repo: https://github.com/anpc849/learn-kaggle-benchmarks
Notebook with all code examples: https://www.kaggle.com/code/anhoangvo/learning-llm-evaluation-with-kaggle-benchmarks

This is not a traditional lesson series. It is designed to work well with AI coding agents. Each module has a markdown file that you can give to an AI coding agent so it can help you explore the framework, understand the code, ask better questions, and build experiments step by step.

Modules

  1. Core Concepts & Communication — Learn how actors, messages, chats, and context managers represent conversation state.
  2. LLM Integration — Connect chat primitives to model providers through LLMChat, ModelProxy, prompts, and responses.
  3. Enforcing Structured Outputs — Use schemas and handlers so models return data in predictable Python-friendly formats.
  4. Equipping Tools & Execution Environments — Give LLMs tools and controlled execution environments for real actions.
  5. Build a QA Workflow with LLMs — Combine chats, LLMs, and tools into a ReAct-style question answering workflow.
  6. Benchmarking & Tasks — Define tasks, run evaluations, collect results, and validate outputs.
  7. Orchestration & Parallelism — Scale evaluations with task queues, concurrency, retries, and timeouts.

Hope this helps Kagglers learn LLM evaluation through practical hands-on experience!

thorny quartz
wary sphinx
#

Hi! I have a question about task results. I have a task working well with different models I selected. With a recent update, I tried new models: GPT-5.5 and Opus 4.7. My task runs more than 90 games (question-based game). For previous models, with the $10 per day limit, it could finish. With these two new models, it costs more, so the notebook reaches the limit and the result is not correct. How do I remove a model's result from the task results?

sinful mesa
wary sphinx
sinful mesa
sinful mesa
wary sphinx
ripe lynx
#

Hi benchmarks team. @timber gale suggested this as the best way to resolve Google model inference issues. Our code is the same for all models. We invoke the model via kbench.chats.new() and .prompt(). There is no model-specific code; we swap one string in the config and the platform routes to the selected model. OpenAI and other models can finish a 250-sample run in <1 hr. Google models on the same code, same sample count, same task -- 20 hours.

#

The Benchmarks product is terrific. Standardizing around something like this is sorely needed in the ML community to avoid implementation differences and potential confounders.

We are hoping to submit a benchmark before the May 4 NeurIPS deadline, so a bit rushed unfortunately.

We shared benchmark tasks with kaggle-ai-resources-support@google.com.

vagrant spear
#

Morning fellow Kagglers,
I just got done with the LogiCore Ultra: v7 benchmark task

The Content:
Just pushed the v7 update to LogiCore Ultra benchmark. I moved past simple logic gates into the Rigidity Floor.

The Stress Test:
Models were pushed to a 1,000-point operative manual requiring strict [Action] | [Dependency] | [Rationale] syntax.

Key Findings:
The Massacre: Over 30% of the field (including heavyweights like Claude Opus 4.5 and Qwen Next Thinking) hit the ERROR wall. Instruction fatigue is real.

The "Flash" Surge: Gemini 2.0 Flash Lite and Gemma 3 12B are outperforming frontier models in rigid instruction following.

The Yapping Tax: The efficiency scoring is effectively penalizing "compute noise."

Check the full telemetry and the 000.00 precision charts in the results notebook & dataset results:
👉 Notebook: LogiCore Ultra: Comprehensive Results - https://www.kaggle.com/code/gastondana/logicore-ultra-comprehensive-results-efficiency

👉 Dataset: Surgical & Efficient Results (v2-v7) - https://www.kaggle.com/datasets/gastondana/logicore-ultra-surgical-and-efficient-results

Figured sharing it here before posting it on my socials is the move. I'm working on v8 right now, but waiting for my daily quota to refresh!

Also, heres a link to the Human Creative Benchmark that the new Contra Labs came out with recently & what this benchmark is based on.
HCB

Enjoy the Monday & week ahead everyone!

vagrant spear
hearty helm
regal trail
#

New Early Access Program: Benchmark Local Development!

Hey everyone! We’ve just created an Early Access Program (EAP) to help us test out our latest benchmark local development features before they officially launch.

If you want to get your hands on the newest tools, test the workflows, and give us early feedback, jump over to #early-access-benchmarks-local-development to join the EAP!

orchid glacier
#

Hi, I've been trying to run the new Grok reasoning model on one of my benchmarks, but when I RUN the tasks for this new model, I see the loading animation and the confirmation popup text but after a few seconds, it shows that it wasn't run on any task.

gleaming prawn
orchid glacier
dense oak
#

How will I get information about the 5 days Ai course

orchid glacier
toxic monolith
orchid glacier
#

Hello! I have a SAE proposal, where can I submit it?

gleaming prawn
gleaming prawn
orchid glacier
#

Hello... still the same

#

Error code: 503 - {'message': 'The requested model is currently unavailable.', 'type': 'server_error'} "

gleaming prawn
gleaming prawn
orchid glacier
gleaming prawn
gleaming prawn
orchid glacier
#

Yep, it works fine now! Thanks

celest bronze
#

Kaggle Benchmarks is amazing.

I’ve been running experiments for my upcoming paper on it, and this week there have been frequent power outages where I live. Since I can push the tasks and run them on the Kaggle platform, I don’t have to worry about network issues or power outages interrupting my experiments.

Kaggle Benchmarks really saved me.

I also successfully implemented advanced agent architectures with it, and I hope my paper, possibly the first to use Kaggle Benchmarks as an evaluation framework for agents, gets accepted to a top conference. I hope to share the paper with everyone very soon.

Huge thanks to the Kaggle team!

vagrant spear
# celest bronze Hi everyone! I created **Learn Kaggle Benchmarks** for Kagglers who want to lear...

Hey @celest bronze! I meant to jump into this last week but got sidetracked.

Incredible work on this repository and the notebook. The architecture in Module 6 (Benchmarking & Tasks) and Module 7 (Orchestration & Parallelism) was exactly what I needed to scale up one of my current R&D projects.

I hooked the kaggle_benchmarks pipeline up to a massive 900-point multimodal mechanical stress test I'm calling PencilPhysics-V1. It evaluates Vision-Language Models on spatial reasoning and strict negative constraints.

Thanks to your framework's parallel execution and assertion tracking, I ran a structured audit on Gemini 2.0 Flash and exposed a major "Texture vs. Chaos" blind spot, the dense textures completely overwhelmed its macro-structure recognition, resulting in a 12.67% score!

If you want to see how beautifully your orchestration handled the 900-point execution, you can check out my evaluation notebook , and the synthetic dataset it's stress-testing here.

Enjoy the upcoming weekend & day ahead,
thanks again for the foundational resource here!

worthy viper
#

**We just shipped native tool calling in the kaggle-benchmarks library! **

This gives your LLMs access to your own Python function as tools that they can call. You can use this to benchmark:

  • Tool call reliability (does the model know when to call the tool?)
  • Tool selection (give it N tools, does it pick right?)
  • Multi-step planning (chain tool calls towards a goal)

👉 Check out the cookbook: https://www.kaggle.com/code/kerneler/kaggle-benchmark-cookbook-using-tools/notebook

drifting hatch
#

The World Has a Data Problem. We Fix It.
Every AI team hits the same wall eventually.
You have the model. You have the architecture. You have the engineers. But you don't have the data, and everything stops.
Maybe your dataset is too small to train on. Maybe it carries sensitive patient records, financial transactions, or personal identifiers that legal won't let you touch. Maybe you've been waiting months for a vendor to deliver labeled data that still isn't ready. Maybe your edge cases are so rare in real life that your model keeps failing exactly where it matters most.
This is not a skill problem. This is a data problem. And it is quietly killing more AI projects than any other single reason.
We generate synthetic data.
Not as a workaround. Not as a compromise. As a legitimate, statistically rigorous alternative that lets your team move again. We produce tabular, text, image, and time-series synthetic datasets that mirror the distributions, correlations, and behavioral patterns of real-world data without exposing a single real record.
We have solved this for teams in healthcare who couldn't share patient data across departments. For fintech companies building fraud detection models with almost no real fraud examples to train on. For startups that needed 10x their dataset size before a funding deadline. For enterprises blocked by GDPR, HIPAA, and compliance teams that said no to everything.
The problem you are sitting with right now, whether it is a privacy blocker, a data scarcity issue, a class imbalance, a regulatory wall, or a timeline that real data collection simply cannot meet, has a solution. We will tell you exactly what it is within 24 hours of hearing from you.
No long sales cycles. No vague proposals. You describe your data problem in plain language, and we come back with a concrete plan.
Send us your situation: [synthox.ai@gmail.com]
The only thing worse than a data problem is spending another month pretending it will resolve itself.

potent nova
#

hey @hearty helm im not sure who to ping for this, but gemini 3.1 flash lite has been failing with: Error code: 503 - {'message': 'The requested model is currently not reachable. Try again later.', 'type': 'server_error'}

on another note, with my benchmark (https://www.kaggle.com/benchmarks/cloudwaddie/topovision), the leaderboard isn't updating, but the top models bit in the top right is. is this because some tasks are PASS or FAIL, and others are numerical?

thank you for your help

#

-# other than this, it's been amazing tysm!

timber gale
celest bronze
#

Hi everyone, I have a question about transparency based on the results I just ran on Kaggle Benchmarks.

Model Input Tok Output Tok Cost($) Runtime(minute)

gemma-4-31b 119,756 5,385 0.0150 9.61
deepseek-v3.2 146,918 7,630 0.0914 6.75
qwen3-235b-a22b-instruct-2507 100,728 5,129 0.0267 1.09
qwen3-coder-480b-a35b-instruct 141,139 6,767 0.0387 0.82

gemma-4-31b is smaller than the Qwen/DeepSeek models, and compared with deepseek-v3.2 and qwen3-coder-480b-a35b-instruct, it also has fewer input/output tokens and lower cost. However, its runtime is much longer: ~9.6 minutes vs. 6.75 minutes for DeepSeek and under 1.1 minutes for the two Qwen models.

Could Kaggle Benchmarks publish the provider and/or hardware information for open-source models? Runtime can vary a lot depending on the setup.

Also, are thinking tokens included in the input/output token in the metadata? I am wondering whether Gemma is slower because it spends more time thinking, or because it was run on slower provider/hardware infrastructure.

For closed-source models like GPT, Claude, or Gemini, this is less of an issue because the provider is usually fixed. But for open-source models, reporting the setup would make it easier to compare the trade-off between cost, runtime, and performance.

worthy viper
orchid glacier
#

hi, is there any account-level requirements (besides confirming your account using your phone number) in order to use benchmarks.BenchmarkTasksApiService/ListBenchmarkTaskRuns with a valid OAuth Bearer token? Even if you accept the terms and generate the confirmation code, is there a possibilty that it might not work for some accounts? (asking for a friend who is trying to pull all .json runs from a task he owns but can't. He gets 404 Client Error: Not Found for url: https://api.kaggle.com/)

worthy viper
#

Get your friend to sign up and we'll add them