#Tiburon Kimi K2.5 benchmark: SOTA or nota?

1 messages · Page 1 of 1 (latest)

lethal root Jan 28, 2026, 10:49 PM

Kimi K2.5 is the first open-weight model to cross the threshold of achieving a successful run on the Tiburon Benchmark devops/terminal benchmark. However, there are a lot of challenges to consider when evaluating its performance, and my recommendation is to skip this model for Software Engineering (SWE) tasks.

Open-weight model benchmark caveats

I base this analysis on a limited number of runs and potential performance differences from the inference provider used, which may affect performance due to different quantization, caching, token limits, and hardware capabilities. Unlike the proprietary models with only official reference implementations, open-weight models like Kimi K2.5 can vary significantly in performance depending on the specific configuration and hardware used. In my testing environment, I have no access to the Chinese-hosted reference implementations and instead use OpenRouter United States-based providers, trying to use non-quantized versions or the highest quality configurations available from the most reputable United States-based providers that I have used directly in the past.

Kimi K2.5 benchmark performance summary

Kimi K2.5 failed 4 of its 6 runs in a typical failure pattern of overconfidence and premature signaling of success. This is the predominant failure mode among models, and this benchmark specifically exposes this pattern.

Over time I've highlighted the models that succeed, which is visible in the benchmark table below, and I've also highlighted the models that don't reward-hack or hallucinate about task completion, even if they didn't always complete the runs within the time limits. GPT-5.1-Codex-Max and Minimax M2.1 are notable ones that I've recommended for their honesty and persistence in verifying their outputs and not declaring success prematurely or falsely. Their inability to consistently complete the task was not ideal, but their consistency in not declaring that they did was a signal that those models would probably be trustworthy for other assignments better suited to their relative knowledge-levels and reasoning abilities. In a word: trustability.

I would normally end this review of Kimi K2.5 here as being a typical untrustable model, however the model did complete a single successful run. This demonstrates that the model's massive 1 trillion parameter count gives it the knowledge to complete the task when it applies itself, but it struggles with steering itself with a consistent, reliable approach across all runs. The interleaved reasoning traces show the model mostly narrates its process instead of using reasoning to assess itself or make meaningful and substantial pivots. The interleaved reasoning traces are almost essentially user-preference dressing on top of a model that is likely strong without reasoning at all, like Kimi K2 was relative to other open-weight models.

For me, this is a lot like GLM-4.7 in that it is unreliable and untrustworthy: it provides the appearance of completion superficially, but seems to chase rewards in the following observed patterns:

Running extensive but shallow verification commands
Cherry-picking positive evidence from those commands
Ignoring negative evidence from those commands
Rationalizing that incomplete work is standard or sufficient, even if that isn't aligned with stated requirements
Presenting well-formatted and well-structured "success" summaries despite failing the stated requirements

This is probably enough reasoning depth to pass many benchmarks of all types. But it is not even close to enough for long and challenging agentic sessions where the goal is to reliably complete complex tasks in a real environment with precision and care.

Kimi K2.5 benchmark notes

Updated benchmark as of January 27, 2026 (Kimi K2.5 update)

Tier 1 (100%) 18/18

gpt-5.2-codex-high
Cmds 45.7 (37%) | Time 18m56s (19%) | Cost $0.440 (37%)
Tokens In 966.1K · Out 14.9K · Reason 11.6K | Total: $1.32

claude-opus-4.5-thinking-16k
Cmds 35.7 (7%) | Time 17m7s (15%) | Cost $0.774 (4%)
Tokens In 661.3K · Out 6.7K · Reason 1.6K | Total: $2.32

gpt-5.2-codex-low
Cmds 34.3 (4%) | Time 24m28s (7%) | Cost $0.358 (7%)
Tokens In 629.7K · Out 4.9K · Reason 2.7K | Total: $1.07

Tier 2 (89%) 16/18

gpt-5.2-medium
Cmds 33.0 (33%) | Time 24m52s (87%) | Cost $0.293 (24%)
Tokens In 519.7K · Out 8.7K · Reason 6.2K | Total: $0.88

gpt-5.2-high
Cmds 37.0 (11%) | Time 34m18s (63%) | Cost $0.404 (20%)
Tokens In 805.7K · Out 15.5K · Reason 12.3K | Total: $1.22

claude-opus-4.5
Cmds 44.0 (16%) | Time 20m15s (11%) | Cost $0.836 (19%)
Tokens In 836.3K · Out 5.4K · Reason 0 | Total: $2.51

gpt-5.2-codex-med
Cmds 43.3 (27%) | Time 44m50s (50%) | Cost $0.596 (47%)
Tokens In 912.8K · Out 8.9K · Reason 5.9K | Total: $1.79

Tier 3 (67%) 12/18

gemini-3.0-flash-high
Cmds 28.7 (28%) | Time 23m34s (34%) | Cost $0.153 (38%)
Tokens In 499.5K · Out 6.2K · Reason 3.2K | Total: $0.47

gpt-5.1-codex-max-high
Cmds 44.0 (36%) | Time 35m41s (57%) | Cost $0.484 (41%)
Tokens In 1.1M · Out 15.6K · Reason 12.8K | Total: $1.45

Tier 4 (56%) 10/18

gemini-3-pro-preview-high
Cmds 30.3 (11%) | Time 32m38s (33%) | Cost $0.437 (4%)
Tokens In 601K · Out 9.9K · Reason 7.3K | Total: $1.31

Tier 5 (33%) 6/18

gpt-5.2-low
Cmds 43.7 (33%) | Time 41m22s (36%) | Cost $0.350 (25%)
Tokens In 919K · Out 9.3K · Reason 5.5K | Total: $1.04

gpt-5.1-high
Cmds 45.7 (19%) | Time 25m44s (6%) | Cost $0.436 (17%)
Tokens In 1.2M · Out 15.6K · Reason 15.2K | Total: $1.31

gpt-5.1-codex-high
Cmds 75.0 (13%) | Time 37m40s (27%) | Cost $0.827 (23%)
Tokens In 3.2M · Out 31.0K · Reason 25.7K | Total: $2.48

Tier 6 (28%) 10/36 special case

kimi-k2.5 (6 runs, 1 fully successful, 1 partial 4/6, 4 failed at 0/6)

kimi-k2.5
Cmds 39.3 (14%) | Time 35m33s (16%) | Cost $1.052 (25%)
Tokens In 847.6K · Out 5.3K · Reason 1.8K | Total: $6.31

Tier 7 (0%) 0-1/18

minimax-m2.1
Cmds 73.7 (23%) | Time 56m13s (7%) | Cost $0.144 (54%)
Tokens In 1.9M · Out 12.7K · Reason 12.7K | Total: $0.43

glm-4.7
Cmds 31.0 (67%) | Time 14m6s (73%) | Cost $0.133 (92%)
Tokens In 492.6K · Out 3.6K · Reason 1.4K | Total: $0.39

gpt-5.2-none
Cmds 31.7 (29%) | Time 29m40s (13%) | Cost $0.263 (34%)
Tokens In 527.3K · Out 3.6K · Reason 0 | Total: $0.80

Testing note - accidental token limitation

I made one mistake with my first batch of 3 runs for Kimi K2.5. I ran 3 runs with Kimi K2.5 with reasoning enabled but limited the model to 8k total output tokens. In my benchmark, a reasoning model is usually given up to 16k tokens to reason + 8k output for a total of 24k output tokens per turn. After doing trajectory analysis, I was able to determine that my accidental token limitation did not truncate any single output. The model never actually attempted to use that much reasoning, nor did it output enough to approach the limit on any given step.

Essentially, that gave me 6 runs for this model instead of 3. And it shows why running the task more than 3 times gives a much clearer signal of a model's true capabilities and consistency levels, something I noted as a weakness in my previous reports about the Tiburon Benchmark. It is a trade-off personally to keep my out-of-pocket costs down, and focus on immediate signals of intelligence rather than absolute precision. I've recommended Artificial Analysis and LiveBench for their much more robust benchmarks and will continue to do so.

Kimi K2.5 - Intelligence, Performance & Price Analysis

Analysis of Kimi's Kimi K2.5 (Reasoning) and comparison to other AI models across key metrics including quality, price, performance (tokens per second & time to first token), context window & more.

Had I stopped at only the first 3 runs (the 8k-token-limited ones), I would've concluded that this model achieved a 10/18 success rate! However, with 3 more attempts using the full 24k-output-token limit, it then proceeded to fail each of the next three runs. Again the assessment of the trajectories doesn't show that this was the source of the model's inconsistency, it showed the same shallow and brief narrative patterns of reasoning throughout all runs, and the same tendency to declare success prematurely.

Kimi K2.5 runs (per attempt, plus mean and coefficient of variation)

24k-output-token limit

Run              | Cmds | Time   | Cost   | In Tok  | Out   | Reason | Avg/Turn
-----------------|------|--------|--------|---------|-------|--------|----------
failed-24k-1     |   36 | 36:51  | $0.88  | 708.9K  | 5,261 |  1,635 |    44.19
failed-24k-2     |   41 | 33:54  | $0.99  | 797.2K  | 4,814 |  1,728 |    41.14
failed-24k-3     |   37 | 34:10  | $0.99  | 785.5K  | 5,327 |  2,459 |    64.71
-----------------|------|--------|--------|---------|-------|--------|----------
Mean             |   38 | 34:58  | $0.95  | 763.9K  | 5,134 |  1,941 |    50.01
CoV (%)          | 5.68 |  3.80  |  5.20  |   5.12  |  4.44 | 18.99  |    20.93

8k-output-token limit (accidental limitation - no truncation occurred)

Run              | Cmds | Time   | Cost   | In Tok  | Out   | Reason | Avg/Turn
-----------------|------|--------|--------|---------|-------|--------|----------
partial-8k       |   44 | 39:06  | $1.19  | 964.0K  | 5,849 |  1,488 |    33.07
failed-8k        |   31 | 25:09  | $0.72  | 575.9K  | 4,741 |  1,734 |    54.19
successful-8k    |   47 | 44:07  | $1.54  |   1.3M  | 6,023 |  1,978 |    41.21
-----------------|------|--------|--------|---------|-------|--------|----------
Mean             |   41 | 36:07  | $1.15  | 931.3K  | 5,538 |  1,733 |    42.82
CoV (%)          |17.08 | 22.22  | 29.25  |  29.83  | 10.25 | 11.54  |    20.31

Reasoning analysis

Qualitatively analyzing its reasoning traces, I'd say this model by default does not emulate what the more successful models do, which is to make a step-by-step plan without prompting. It instead narrates what it is doing, more so than showing any particular structured approach.

When left on its own, this model is more oriented towards execution than planning. I'd be most curious to see whether, given a planning prompt or instruction to think through the task systematically, if it would perform better.

Observations on consistency

The one successful run shows me that the model knows what to do, benefitting from that large parameter count and generalist nature. But it fails to consistently apply that knowledge across all runs, sometimes performing shallow verification tasks, impossible verification tasks, and most of the time declaring that partial success is an indicator of complete success and early signaling to the end-user that it has completed a task.

This model would benefit from additional post-training in adapting a planning approach from the start, and probably a re-tuning of its weights towards terminal/coding tasks. The knowledge is in there, it's just not surfaced on its own. For SWE tasks this model is likely going to be highly dependent on the prompt and guardrails around it for success. Not a failure, but not plug and play.

Caveats to Kimi K2.5 - who can host it

It is a massive 1 trillion parameter model, compared to open-weight models Minimax M2.1 at 229 billion parameters, and GLM-4.7 at 358 billion parameters.
The size of this model means that it will have a lot of knowledge and qualifies as a general purpose model. It is probably meant for planning, reasoning, orchestrating, and other complex tasks, including coding. While their marketing wants to put this model up against GPT-5.2, I think they'd be better served comparing their model and approach to Gemini 3.0 Pro Preview: a strategy of using a larger parameter count model that has dense knowledge but isn't particularly aligned towards SWE engineering or agentic work, and thus performs unevenly in AI harnesses like Windsurf's Cascade. GPT-5.2's post-training is on a completely different level for SWE tasks, and it is an unflattering comparison for Kimi K2.5.
The 1T parameter size also means that providers have to have powerful hardware to run this model, which makes it more expensive to run. Because I don't run this model outside of United States-based inference providers, I ran this cost with one of the more expensive providers. It'll be interesting to see if any provider can offer this model with caching at a reasonable price point aside from Moonshot AI direct. Without that $0.10 per million tokens caching, this model is unjustifiably expensive for SWE tasks, given its performance. On the Tiburon Benchmark Kimi K2.5 was more expensive per run than Opus 4.5-Thinking because of the lack of cached token discounts.

Hosting Update as of January 29th, 2026

Several US hosting providers have now stepped up to match the Moonshot $0.10 per million tokens cache read pricing, this model is now approximately 2x more expensive than Gemini 3.0 Flash token for token. Or 3x more expensive than Minimax M2.1 token for token.

Bottom line

I want to give the caveat that my test of this model was through NovitaAI and then through OpenRouter. Since these open-weight models are tunable in terms of quantization, tokens, temperature, etc. by inference providers, I don't want to suggest that I've gotten the full experience of the Kimi K2.5 model. But I've gotten a slice of what you could receive out-of-the-box if you use Windsurf as your inference provider. Windsurf and its sub-providers may have different configurations that could yield better results. Moonshot's reference implementation may also be better optimized. I just cannot access and test those.

Kimi K2.5 is a generalist with impressive surface-level abilities and knowledge depth, but it is unreliable in the Tiburon Benchmark. In short, I'd probably skip this model for SWE tasks, just like I recommended skipping GLM-4.7 for similar reasons.

Given that it did complete one run successfully, and is the largest open-weight model, there's a lot of data in there, and there are benefits to that. But one cannot tap those benefits consistently out of the box without prompt engineering and guardrails to address its deficiencies in honesty, planning, verification, and consistent adherence to task requirements. It's more of a tinkerer's model, than a plug-and-play solution inside something like Windsurf.

It's large enough that it may exhibit different behavior depending on the prompt, the prompting language, and the context. But if it isn't trained to not reward-hack its outputs, it just isn't worth my time and likely not yours.

Compared to MiniMax M2.1, whose training specifically avoids such reward-hacking behaviors on SWE tasks and whose design specifically targets coding tasks with clear guardrails, I think Kimi K2.5 and GLM-4.7 are both trailing in terms of their post-training methodology and focus on coding tasks that prioritize reliability and consistent adherence to task requirements.

Performance and models like this have led some to lose trust in benchmarks altogether. Sometimes you get well-earned benchmark results from models like Opus 4.5, GPT-5.2, GPT-5.2-Codex, Minimax M2.1. And other times you get impressionistic outputs tuned on human preferences that don't reflect true consistency or capability for any particular task.

lethal root Jan 28, 2026, 11:08 PM

Small Vibes Check

I did a few prompts with this inside a different harness that is capable of delegation to parallel subagents, the so-called "orchestration" pattern of agentic workflows, where a parent agent spawns child agents, by prompting them on your behalf and reviewing the output to synthesize or analyze the output.

And as far an indication as to the post-training that Moonshot-AI focused on, it seemed evident that this was where it shines and wants to be. Kimi K2.5 happily spawns agents and adopts the orchestrator role for just about every task you can imagine without being told to do so.

The utility of that would be amplified of course, if it had shown that it was more trustworthy about verification steps and completion of tasks. But an orchestrator that doesn't answer honestly, compounds the "black box" experience already inherent in working with language models. Giving it more room to mask actual results and telling you what you want to read, that everything is perfect and finished regardless of if that is true or not.

So yeah, cool potential, and I would like more models and harnesses to codify and train on operating in that sort of orchestration pattern. But that must come with more focus on verifiable, and reliable output.

crimson quest Jan 29, 2026, 1:11 AM

very cool test thanks much but i have few questions, was this whole test made with just thinking mode and not agent swarm which is the highlight of the model how i understood?

and what about the the system prompts like windsurf IDE has and then the now secret plan mode that the model would run before acting wouldnt it increase the score then in windsurf for coding ?

wait a minute, gpt-5.2-codex-low is tier 1 model? so it achieves same success like the gpt-5.2-codex-high ?

what is the benchmark made of then? how hard is it?

lethal root Jan 29, 2026, 5:12 AM

crimson quest wait a minute, gpt-5.2-codex-low is tier 1 model? so it achieves same success l...

Yes, there was a thread about it before the Discord re-organization. The summary of it was that GPT-5.2-Codex-Low shares the same base knowledge and post-training and is really well engineered to not dramatically lose performance on this particular devops/terminal type task. It's approach is certainly more "brute-force" than GPT-5.2-Codex-High's more reasoning-centric approach, but it also showed that GPT-5.2-Codex-High perhaps is starting up towards a curve of deliberating a bit too much on this particular task.

lethal root Jan 29, 2026, 5:14 AM

crimson quest very cool test thanks much but i have few questions, was this whole test made wi...

Yes, this benchmark is in a custom simple harness with one tool bash. This is to ensure replicability and fairness across all model runs. The model had its thinking/reasoning mode enabled.

lethal root Jan 29, 2026, 5:18 AM

crimson quest and what about the the system prompts like windsurf IDE has and then the now se...

Due to the custom harness there is a stable an very generic system prompt that is specific to the task at hand and only about 100 tokens long. I work with the raw API version of the model and avoid IDE-based testing because of the shifting configurations and lack of visibility into system prompts, model routing, hyper parameters, etc.

lethal root Jan 29, 2026, 5:20 AM

crimson quest what is the benchmark made of then? how hard is it?

Hard enough that any model that is listed in the benchmark are the only ones that have completed at least 1 fully successful run. With the exception of Opus-4.1-Thinking which also completed 1/3 runs successfully, but I dropped that model from the chart as Opus-4.5 is strictly better and obsoletes it

crimson quest Jan 29, 2026, 3:07 PM

https://tenor.com/view/nice-nooice-bling-key-and-peele-gif-4294979

Tenor

Noice

▶ Play video

i still wish that you would test in windsurf 🙏

with agent swarm mode only

lethal root Jan 30, 2026, 10:23 PM

crimson quest i still wish that you would test in windsurf 🙏

The new Arena Mode and Kimi K2.5 launching on promo for free, you have gotten your wish!

crimson quest Jan 30, 2026, 10:24 PM

omg i didnt realise the kimi k2/5 is already implemented so cool thanks much for info : )

you going to recreate your benchmark for it now in windsurf ? 🙏