Hi again team,
Iβve been diving deep into the local benchmarking setup today and wanted to share a telemetry dashboard pipeline I built on top of it.
Because tracking token efficiency and execution speed across a high volume of models gets heavy manually, I put together an automated headless architecture to handle the data orchestration.
How it works:
Playwright Async Engine: A local script steps through the comparison DOM sequentially, bypassing the React state-caching constraints to strictly isolate active-tab metrics (Input Tokens, Output Tokens, and Time) directly from the UI.
Next.js Frontend: Ingests the hot-reloaded JSON/CSV data and renders a live, type-safe distribution matrix.
What it surfaces:
It instantly maps the cost-to-performance ratio. For example, looking at the distribution graph, models like deepseek-r1-0528 and gemma-4-26b show massive efficiency relative to compute cost, while gpt-oss-120b holds the absolute SOTA ceiling for spatial reasoning.
Here is a quick, unlisted 60-second walkthrough showing the live UI and the normalized cost distribution chart in action:
π₯ https://youtu.be/nnQPXsLdqoE
Context on "Pencil Physics":
(Evaluates if multimodal models obey strict physical laws and spatial logic, not just aesthetics). Here are the underlying assets being benchmarked:
Official Benchmark: https://www.kaggle.com/benchmarks/gastondana/pencil-physics-mechanical-logic-benchmark/leaderboard
Benchmark Task: https://www.kaggle.com/benchmarks/tasks/gastondana/pencil-physics-mechanical-constraint-test/2
Eval Notebook: https://www.kaggle.com/code/gastondana/pencil-physics-mechanical-constraint-test
Dataset: https://www.kaggle.com/datasets/gastondana/pencilphysics-v1
The code is fully decoupled from the UI, so whenever new evaluations finish, running the scraper automatically pushes fresh snapshots to the dashboard. If anyone else is working on automated local extraction layers for these metrics, I'd love to see what you got going on.
Thx!