#early-access-benchmarks-local-development | Kaggle | Page 1

honest arch May 1, 2026, 10:21 PM

#

Hi benchmarks team. @neon prawn suggested this as the best way to resolve Google model inference issues.

Our code is the same for all models.

We invoke the model via kbench.chats.new() and .prompt(). There is no model-specific code; we swap one string in the config and the platform routes to the selected model. OpenAI and other models can finish a 250-sample run in <1 hr. Google models on the same code, same sample count, same task -- 20 hours.

The Benchmarks product is great. Standardizing around something like this is sorely needed in the ML community to avoid implementation differences and potential confounders.

We are aiming to submit a benchmark before the May 4 NeurIPS deadline, so a bit rushed unfortunately.

We shared the tasks with kaggle-ai-resources-support@google.com.

Benchmark tasks are shared with kaggle-ai-resources-support@google.com.

upper barn May 4, 2026, 2:44 AM

#

..

scarlet spade May 6, 2026, 5:31 AM

#

**Welcome to Early Access Benchmarks Local Development! **

We are bringing local development to the Kaggle Benchmarks Python library so you can finally build tasks using your favorite IDEs, debuggers, and coding agents!

**What you get access to: **

Create and publish tasks written locally
Run model evaluations remotely on Kaggle via CLI
Track runs and download results

How to get access: To officially join the Early Access Program, please fill out our sign-up form

**What is this channel for? ** We want to hear from you! We will be using this space to gather your feedback before our broader launch. Please use this channel to:

Share your work
Discuss your experiences using the new tools
Drop bug reports and feature requests

We are excited to see what you build. Let us know if you have any questions!

scarlet spade May 6, 2026, 5:32 AM

#

scarlet spade **Welcome to Early Access Benchmarks Local Development! ** We are bringing loca...

green ivy May 11, 2026, 6:32 AM

#

Hello everyone! 👋

Form submitted — really excited about this.

I've been building a benchmark series at the intersection of LLMs and water / environmental technologies, focused on health, safety, and operator decision support. Beyond raw task performance, I'm interested in measuring model awareness, capability boundaries, and behavior under pressure (authority, ambiguity, safety-critical scenarios).

The bigger question behind all of this: how far can we actually trust LLMs in industrial settings — from heavy industry to day-to-day operations?

oblique lichen May 15, 2026, 4:15 PM

#

have folks been using the local development feature? lmk what you think of it -- want to hear the good and (especially) the bad!

proper flare May 16, 2026, 8:06 PM

#

Hello!
I recently opened a PR for the Kaggle CLI repo. Which channel would be the right place to ask for feedback/review on it?

red hound May 17, 2026, 5:31 PM

#

Hello! I'm building a new benchmark using the early access features, and I have two questions:

Is it possible to link a Kaggle dataset as input to a task kernel directly from the CLI?
Can the benchmark page itself be created — and tasks linked to it — entirely through the CLI, using local development only?

oblique lichen May 18, 2026, 4:02 PM

#

@red hound

No not yet, but it's on our radar!
No, the benchmark page can't be created yet. It's on our roadmap to eventually have soon!

oblique lichen May 18, 2026, 4:03 PM

#

proper flare Hello! I recently opened a PR for the Kaggle CLI repo. Which channel would be th...

could you share the URL? we get notifications when you open the PR and one of our engineers will review

red hound May 18, 2026, 4:07 PM

#

I have also asked Addison Howard via email about it but I'll ask here as well, could I get an increase of daily quota for about $200 please? I have to run 2250 prompts and I ran out pretty fast of my daily quota... I know it's a lot and I understand if it's not possible 🙂
I'm trying to evaluate the models against a bunch of games and I have to prompt them multiple times per game for different scenarios..

Thanks

proper flare May 18, 2026, 4:13 PM

#

oblique lichen could you share the URL? we get notifications when you open the PR and one of ou...

https://github.com/Kaggle/kaggle-cli/pull/1013

This PR adds a small validation check to prevent the CLI from hanging in non-interactive environments.

oblique lichen May 18, 2026, 5:40 PM

#

red hound I have also asked Addison Howard via email about it but I'll ask here as well, c...

could you share a link of your benchmark for me to review?

#

All, could you provide a thumbs up if you had a chance to use the local development capabilities (i.e., the benchmark commands in our kaggle CLI)? I'd love to get feedback either thru this channel or via a quick google meets call:) to thank you for your time, i'll be happy to provide you more quota for you to do more cool work:)

red hound May 18, 2026, 5:51 PM

#

oblique lichen could you share a link of your benchmark for me to review?

https://github.com/4kaws/childhoodgamesbenchmark i just uploaded it here (i only had it locally)

red hound May 18, 2026, 6:37 PM

#

Here are a few things I've stumbled upon so far while using the early access features:

No .evaluate() call fails silently. A task that's only decorated pushes, creates, and runs without
error, it just produces no results. A validation step on push would catch this instantly.
"RUNNING (creating)" status is ambiguous. After a push, the task is provisioning but looks like it's
running. Firing kaggle b t run too early fails without a clear explanation. A distinct CREATING state
would help.
No server-side task deletion. Throwaway tasks (e.g. a probe) stay on the account permanently and can be
mistaken for real benchmark tasks later.
The push-to-run feedback loop is slow for investigating API behavior. Each experiment (e.g. "does
dataset mounting work?") costs a full push -> wait -> run -> wait -> download cycle. A faster sandbox or
dry-run mode would help early iteration.
Credential loading isn't surfaced. Running a task locally without first loading the .env gives a
cryptic auth error. A hint like "run kaggle b init -y and source the .env" in the error output would save
time.

ps: if there's a call happening I would love to join 🙂

oblique lichen May 18, 2026, 10:25 PM

#

Thank you! Noting everything down
On the last error on credential loading. What error do you see? Is it AttributeError: module 'kaggle_benchmarks' has no attribute 'llm'

red hound May 19, 2026, 6:57 PM

#

oblique lichen Thank you! Noting everything down On the last error on credential loading. What ...

The error is 401 expired token from the model (doesn't matter which one) with no hint that the fix is kaggle b init -y && source .env.
That's the message to improve either by catching AuthenticationError at the task runner level and re-raising with the validate_model_proxy_config hint, or by calling validate_model_proxy_config(raise_on_error=True) eagerly when kbench.llm is configured

So, as a possible fix would be nice to implement something like this:

Missing credentials entirely: validate_model_proxy_config(raise_on_error=True) at startup catches it
early with a good message.
Expired token needs: catch the auth error at the task runner level and re-raise with the kaggle b init -y && source .env hint, since the credentials look valid at startup but fail at call time.

oblique lichen May 20, 2026, 2:47 PM

#

Our eng has fixed it: https://github.com/Kaggle/kaggle-benchmarks/pull/164/changes !

deft shuttle May 22, 2026, 4:18 PM

#

Hey team,
I'm building a local telemetry dashboard using the kaggle-benchmarks SDK and CLI. I've encountered a UI bug in the leaderboard comparison view: selecting 3+ models triggers a full window reload rather than showing the model comparisons.

I was wondering if anyone noticed this as well.

I’m happy to share logs or a screen recording if it helps the team track this down, just let me know!
Thx!

oblique lichen May 23, 2026, 10:08 PM

#

yes please share a screen recording if you can!

deft shuttle May 24, 2026, 12:28 AM

#

oblique lichen yes please share a screen recording if you can!

The chat here wont allow videos/image uploads 😭

deft shuttle May 24, 2026, 12:36 AM

#

oblique lichen yes please share a screen recording if you can!

I think it may have been fixed. I was running some tests earlier/today, and it was working fine.

Just double-checked and now it's working perfectly, odd but it works fine now!

deft shuttle May 24, 2026, 3:29 AM

#

Hi again team,

I’ve been diving deep into the local benchmarking setup today and wanted to share a telemetry dashboard pipeline I built on top of it.

Because tracking token efficiency and execution speed across a high volume of models gets heavy manually, I put together an automated headless architecture to handle the data orchestration.

How it works:

Playwright Async Engine: A local script steps through the comparison DOM sequentially, bypassing the React state-caching constraints to strictly isolate active-tab metrics (Input Tokens, Output Tokens, and Time) directly from the UI.

Next.js Frontend: Ingests the hot-reloaded JSON/CSV data and renders a live, type-safe distribution matrix.

What it surfaces:
It instantly maps the cost-to-performance ratio. For example, looking at the distribution graph, models like deepseek-r1-0528 and gemma-4-26b show massive efficiency relative to compute cost, while gpt-oss-120b holds the absolute SOTA ceiling for spatial reasoning.

Here is a quick, unlisted 60-second walkthrough showing the live UI and the normalized cost distribution chart in action:
🎥 https://youtu.be/nnQPXsLdqoE

Context on "Pencil Physics":
(Evaluates if multimodal models obey strict physical laws and spatial logic, not just aesthetics). Here are the underlying assets being benchmarked:

Official Benchmark: https://www.kaggle.com/benchmarks/gastondana/pencil-physics-mechanical-logic-benchmark/leaderboard

Benchmark Task: https://www.kaggle.com/benchmarks/tasks/gastondana/pencil-physics-mechanical-constraint-test/2

Eval Notebook: https://www.kaggle.com/code/gastondana/pencil-physics-mechanical-constraint-test

Dataset: https://www.kaggle.com/datasets/gastondana/pencilphysics-v1

The code is fully decoupled from the UI, so whenever new evaluations finish, running the scraper automatically pushes fresh snapshots to the dashboard. If anyone else is working on automated local extraction layers for these metrics, I'd love to see what you got going on.
Thx!