#early-access-benchmarks-local-development

1 messages Β· Page 1 of 1 (latest)

honest arch
#

Hi benchmarks team. @neon prawn suggested this as the best way to resolve Google model inference issues.

Our code is the same for all models.

We invoke the model via kbench.chats.new() and .prompt(). There is no model-specific code; we swap one string in the config and the platform routes to the selected model. OpenAI and other models can finish a 250-sample run in <1 hr. Google models on the same code, same sample count, same task -- 20 hours.

The Benchmarks product is great. Standardizing around something like this is sorely needed in the ML community to avoid implementation differences and potential confounders.

We are aiming to submit a benchmark before the May 4 NeurIPS deadline, so a bit rushed unfortunately.

We shared the tasks with kaggle-ai-resources-support@google.com.

Benchmark tasks are shared with kaggle-ai-resources-support@google.com.

upper barn
#

..

scarlet spade
#

**Welcome to Early Access Benchmarks Local Development! **

We are bringing local development to the Kaggle Benchmarks Python library so you can finally build tasks using your favorite IDEs, debuggers, and coding agents!

**What you get access to: **

  • Create and publish tasks written locally
  • Run model evaluations remotely on Kaggle via CLI
  • Track runs and download results

How to get access: To officially join the Early Access Program, please fill out our sign-up form

**What is this channel for? ** We want to hear from you! We will be using this space to gather your feedback before our broader launch. Please use this channel to:

  • Share your work
  • Discuss your experiences using the new tools
  • Drop bug reports and feature requests

We are excited to see what you build. Let us know if you have any questions!

green ivy
#

Hello everyone! πŸ‘‹

Form submitted β€” really excited about this.

I've been building a benchmark series at the intersection of LLMs and water / environmental technologies, focused on health, safety, and operator decision support. Beyond raw task performance, I'm interested in measuring model awareness, capability boundaries, and behavior under pressure (authority, ambiguity, safety-critical scenarios).

The bigger question behind all of this: how far can we actually trust LLMs in industrial settings β€” from heavy industry to day-to-day operations?

oblique lichen
#

have folks been using the local development feature? lmk what you think of it -- want to hear the good and (especially) the bad!

proper flare
#

Hello!
I recently opened a PR for the Kaggle CLI repo. Which channel would be the right place to ask for feedback/review on it?

red hound
#

Hello! I'm building a new benchmark using the early access features, and I have two questions:

  1. Is it possible to link a Kaggle dataset as input to a task kernel directly from the CLI?
  2. Can the benchmark page itself be created β€” and tasks linked to it β€” entirely through the CLI, using local development only?
oblique lichen
#

@red hound

  1. No not yet, but it's on our radar!
  2. No, the benchmark page can't be created yet. It's on our roadmap to eventually have soon!
oblique lichen
red hound
#

I have also asked Addison Howard via email about it but I'll ask here as well, could I get an increase of daily quota for about $200 please? I have to run 2250 prompts and I ran out pretty fast of my daily quota... I know it's a lot and I understand if it's not possible πŸ™‚
I'm trying to evaluate the models against a bunch of games and I have to prompt them multiple times per game for different scenarios..

Thanks

proper flare
oblique lichen
#

All, could you provide a thumbs up if you had a chance to use the local development capabilities (i.e., the benchmark commands in our kaggle CLI)? I'd love to get feedback either thru this channel or via a quick google meets call:) to thank you for your time, i'll be happy to provide you more quota for you to do more cool work:)

red hound
#

Here are a few things I've stumbled upon so far while using the early access features:

  • No .evaluate() call fails silently. A task that's only decorated pushes, creates, and runs without
    error, it just produces no results. A validation step on push would catch this instantly.
  • "RUNNING (creating)" status is ambiguous. After a push, the task is provisioning but looks like it's
    running. Firing kaggle b t run too early fails without a clear explanation. A distinct CREATING state
    would help.
  • No server-side task deletion. Throwaway tasks (e.g. a probe) stay on the account permanently and can be
    mistaken for real benchmark tasks later.
  • The push-to-run feedback loop is slow for investigating API behavior. Each experiment (e.g. "does
    dataset mounting work?") costs a full push -> wait -> run -> wait -> download cycle. A faster sandbox or
    dry-run mode would help early iteration.
  • Credential loading isn't surfaced. Running a task locally without first loading the .env gives a
    cryptic auth error. A hint like "run kaggle b init -y and source the .env" in the error output would save
    time.

ps: if there's a call happening I would love to join πŸ™‚

oblique lichen
#

Thank you! Noting everything down
On the last error on credential loading. What error do you see? Is it AttributeError: module 'kaggle_benchmarks' has no attribute 'llm'

red hound
# oblique lichen Thank you! Noting everything down On the last error on credential loading. What ...

The error is 401 expired token from the model (doesn't matter which one) with no hint that the fix is kaggle b init -y && source .env.
That's the message to improve either by catching AuthenticationError at the task runner level and re-raising with the validate_model_proxy_config hint, or by calling validate_model_proxy_config(raise_on_error=True) eagerly when kbench.llm is configured

So, as a possible fix would be nice to implement something like this:

  • Missing credentials entirely: validate_model_proxy_config(raise_on_error=True) at startup catches it
    early with a good message.
  • Expired token needs: catch the auth error at the task runner level and re-raise with the kaggle b init -y && source .env hint, since the credentials look valid at startup but fail at call time.
oblique lichen
deft shuttle
#

Hey team,
I'm building a local telemetry dashboard using the kaggle-benchmarks SDK and CLI. I've encountered a UI bug in the leaderboard comparison view: selecting 3+ models triggers a full window reload rather than showing the model comparisons.

I was wondering if anyone noticed this as well.

I’m happy to share logs or a screen recording if it helps the team track this down, just let me know!
Thx!

oblique lichen
#

yes please share a screen recording if you can!

deft shuttle
deft shuttle
deft shuttle
#

Hi again team,

I’ve been diving deep into the local benchmarking setup today and wanted to share a telemetry dashboard pipeline I built on top of it.

Because tracking token efficiency and execution speed across a high volume of models gets heavy manually, I put together an automated headless architecture to handle the data orchestration.

How it works:

Playwright Async Engine: A local script steps through the comparison DOM sequentially, bypassing the React state-caching constraints to strictly isolate active-tab metrics (Input Tokens, Output Tokens, and Time) directly from the UI.

Next.js Frontend: Ingests the hot-reloaded JSON/CSV data and renders a live, type-safe distribution matrix.

What it surfaces:
It instantly maps the cost-to-performance ratio. For example, looking at the distribution graph, models like deepseek-r1-0528 and gemma-4-26b show massive efficiency relative to compute cost, while gpt-oss-120b holds the absolute SOTA ceiling for spatial reasoning.

Here is a quick, unlisted 60-second walkthrough showing the live UI and the normalized cost distribution chart in action:
πŸŽ₯ https://youtu.be/nnQPXsLdqoE

Context on "Pencil Physics":
(Evaluates if multimodal models obey strict physical laws and spatial logic, not just aesthetics). Here are the underlying assets being benchmarked:

Official Benchmark: https://www.kaggle.com/benchmarks/gastondana/pencil-physics-mechanical-logic-benchmark/leaderboard

Benchmark Task: https://www.kaggle.com/benchmarks/tasks/gastondana/pencil-physics-mechanical-constraint-test/2

Eval Notebook: https://www.kaggle.com/code/gastondana/pencil-physics-mechanical-constraint-test

Dataset: https://www.kaggle.com/datasets/gastondana/pencilphysics-v1

The code is fully decoupled from the UI, so whenever new evaluations finish, running the scraper automatically pushes fresh snapshots to the dashboard. If anyone else is working on automated local extraction layers for these metrics, I'd love to see what you got going on.
Thx!