#Kimi K2 0711

1 messages · Page 2 of 1

hollow wave
#

fp16 is 2TB

craggy lily
#

Oh I forgot that detail

dry hazel
#

it's native fp8

#

fp8 is 1TB

craggy lily
#

Yeah AWQ/GPTQ is a must for H100s then

hollow wave
#

you could do 4 bit quant probably

craggy lily
dry hazel
#

even with the right jinja template still broken, but yeah at least it's a decent perf test

craggy lily
dry hazel
#

yea

#

anyways, enough playing around for me, shutting down the instance

craggy lily
dry hazel
#

🫡

#

I wish model providers would publish that

limber skiff
#

found a context benchmark, kimi seems decent, funny, i guess maverick, even tho its bad, its was the best open source model with context for a while

umbral hornet
#

I'm almost certain someone has asked this before, but will Kimi K2 Instruct be availble or OpenRouter, or just the base model?

winter jackal
#

we do not have the base model

umbral hornet
#

Must have been my sampling params, thanks

soft tapir
dry hazel
#

fireworks only about 30 tps too

#

😔

#

together is about 40

hollow wave
fallow fulcrum
#

Was about to comment on DeepInfra.

hollow wave
#

deep infra seems to have fixed it

winter jackal
#

yeah they said they’re working on it

hollow wave
#

are we going to get more providers with tools support? novita & moonshot are very slow atleast right now

winter jackal
#

DeepInfra claims to have it

#

but is struggling

#

Targon is working on stabilizing

heady pond
#

Understandable, we don't often get trillion+ parameter models

hollow wave
#

true

winter jackal
heady pond
#

Yep - that's what I meant 😅

winter jackal
#

might have one more provider tonight....

barren wadi
#

I think it's a little deepseek moment

#

For moonshot

#

I wonder if deepseek r2 will do a deepseek

dry hazel
brittle cipher
dry hazel
#

ooohhhh shit

#

nicee

winter jackal
#

we'll have it shortly :P

#

they're just getting it stable

dry hazel
#

let's go

#

that's awesome they got it going so quick

#

seems to work great too

#

is it a small context window though?

winter jackal
#

hehe we kind of convinced them

#

no it's full ctx

dry hazel
#

wow

#

awesome

winter jackal
#

wait you gave me that anecdote right @dry hazel

dry hazel
#

yeah

#

😄

#

is that what convinced them? hahaha

winter jackal
dry hazel
#

hell yeah

winter jackal
#

I can't actually take the credit. lots of people wanted it on groq on socials lol

dry hazel
#

true true :P

#

what is the price?

#

I don't see it on their site

winter jackal
#

I actually don't know yet lol

#

it like just came online

dry hazel
#

heh

#

well, I've already got it making me working chess games!

winter jackal
#

i zoomed in on a picture posted on x and see possibly $3 out

dry hazel
#

wow

#

that's not bad at all

#

I was expecting something like $2/$8

winter jackal
#

well it was a cut off slack message

#

so don't take my word for it lmao

dry hazel
#

:P

#

feels so good to have this speed

#

the only thing that can do it like this is gemini 2.5 flash or 2.5 pro on a 0 thinking budget

winter jackal
dry hazel
#

and those are not very good models, with zero thinking

soft tapir
brittle cipher
#

after it's "stable"

winter jackal
#

tonight hopefully

#

waiting on them

soft tapir
#

bet bet

#

this is huge

willow thicket
#

yea i’m pumped…this model at those speeds will be crazy. just hope it doesn’t crash instantly or they drop context size lol

dry hazel
#

also working tool use

#

like other groq models

#

is huge

winter jackal
#

they are working on it still

#

want us to give them an hour

dry hazel
#

they are saying they've got "something special" letting it run at 450 TPS too....

#

...if they release with that :O

#

damn

#

I just got routed to that

#

absolutely massive if this is actually gonna stay and is full precision

winter jackal
#

yeah this is what I was told lol:

Give us like 1 more hour... trying to roll out a throughput improvement

dry hazel
#

holy

#

this will unironically be the best agent coding experience there is

#

crazy

#

way more than I would've hoped for too, by double

winter jackal
#

anyone willing to test deepinfra tool calling if i enable

dry hazel
#

sure

winter jackal
#

I was seeing some jank but working on some other stuff so

#

they say it's working for them

#

ok give it 5 mins and it should show tools as supported param

vast crater
#

Getting Groq from the API at an average 500 tps

#

Crazy fast

brittle cipher
#

oh it got faster

#

seems intermittent

vast crater
#

Getting 600+ now

brittle cipher
#

maybe specdec artifacts

brittle cipher
vast crater
#

Wonder what's the free limit daily on Groq for Kimi K2

dry hazel
#

so you're just getting randomly routed to instances which have the optimization / change

brittle cipher
vast crater
vast crater
#

Amazing

dry hazel
#

so it's probably getting higher and higher %

brittle cipher
#

idk alternating between those two prompts shows the us presidents one consistently much faster than the joke one

dry hazel
#

which I think causes the discrepancy?

#

(same arch as deepseek)

brittle cipher
#

has anyone that isn't deepseek ever gotten MTP to work on deepseek-like models

#

how do you even trigger it

dry hazel
#

wdym

#

it's like a draft model built into the architecture

#

iiuc

winter jackal
#

deepinfa kimi tool calling should be up

dry hazel
#

checking

winter jackal
#

tysm

frosty mural
brittle cipher
winter jackal
#

ok refresh in ~5 mins

#

👀

dry hazel
winter jackal
#

yeah

dry hazel
winter jackal
#

can I share this with them?

dry hazel
#

tested compared to moonshot and moonshot isn't doing the same

#

yep

winter jackal
#

ty

dry hazel
#

(moonshot provider, no extra text)

tropic solar
#

Doesn't groq have special hardware that basically works best with smaller weight models?

#

Like a ton of chips each holding a small amount of vram

#

I remember reading up on how it works and it seems odd they got a 1t model going and so quick

dry hazel
#

👀

frosty mural
winter jackal
brittle cipher
frosty mural
#

I will learn to read one of these days

dry hazel
#

so this is their moment to show they can really scale up

#

(same with Cerebras fwiw, if Cerebras can do it it would be huge)

brittle cipher
#

also theyve been working on a second version of the lpu

frosty mural
#

I think SRAM isn't easy to cram on to the chip, which is why they have such small VRAM. Extremely fast, but very limited in storage

dry hazel
#

it's not really that simple

#

when you have the chip-to-chip / chip-to-interconnect bandwidth that Groq or Cerebras has, normal understanding of VRAM changes

#

for example, Cerebras literally goes layer by layer streaming from interconnect at Petabits/s

#

they don't even NEED to have all the weights loaded like a GPU does

#

it's like CPU offloading but 10000x faster

tropic solar
#

So they must have experts distributed among the chips somehow? Would be a huge ass cluster

soft tapir
dry hazel
brittle cipher
dry hazel
#

that's expert parallelism like Deepseek has done

#

which is a normal GPU thing

tropic solar
#

Either way I'm imagining a significant amount of silicone is being dedicated to running k2

dry hazel
#

for sure

#

many many chips

brittle cipher
soft tapir
tropic solar
#

Oh wow 1 and 3 is very reasonable pricing

dry hazel
winter jackal
tropic solar
#

I've never seen so many providers jump on a model before lol

winter jackal
#

deepseek lol

tropic solar
#

Not so quickly though? Even groq

odd ember
dry hazel
# winter jackal not yet but they're considering it

$1 input pricing is the point where it's about about 3x the price of a Claude 4 Sonnet cache read, which matters a TON for agent coding... it'd probably make a big difference if they implemented caching :)

tropic solar
dry hazel
#

oh man

#

Groq in opencode

#

it's working

#

and it's so good

#

the slowest thing happening is running pnpm install

odd ember
dry hazel
#

yes

odd ember
#

what model would you match it at

#

on terms of intelligence (for coding of course)

#

roleplay gooners are gonna go so hard on kimi k2 i already see it

dry hazel
#

lower than claude 4 sonnet by a bit in terms of first-prompt intelligence

#

what'll really matter is how coherent it stays at ~60k context

odd ember
#

and being pretty private

dry hazel
#

indeed

odd ember
#

feel like its worth

dry hazel
#

see this vid

odd ember
#

thanks

dry hazel
#

he's got a solid benchmark on its coding performance, which generally agrees with all my personal vibe tests

#

(his only complaint was the speed, which is now SOLVED)

odd ember
#

i wanna see how it performs on aider ngl

#

isnt aider not that good compared to any other coding agents?

dry hazel
#

I'm a little skeptical of aider right now because aider doesn't bench for agentic coding very well

dry hazel
#

claude 4 sonnet is >>>> gemini 2.5 pro in agentic coding

#

but in just "write me a file that does a thing" gemini 2.5 pro is a bit better

odd ember
#

you using opencode, do you like it personally compared to any others?

dry hazel
#

claude code is the best for sure, but opencode is the best open source CLI

odd ember
#

one thing ihate about gemini 2.5 pro is it would fail tool calls too much for me

dry hazel
#

indeed

#

and it doesn't follow instructions very well

#

no prompt caching adding up 😂

odd ember
#

cache enables automatically correct

dry hazel
#

yes but groq doesn't have it

odd ember
#

do you think we will see it go more in speed

willow thicket
#

omg this is amazing

odd ember
#

groq or other providers like cerebras

dry hazel
odd ember
odd ember
dry hazel
#

lol

#

gonna try it in VSCode BYOK now

odd ember
#

holy shit its pretty fast

dry hazel
#

yeah lol

tropic solar
#

anybody RP with kimi k2 (sfw)
how it hold up against deepseek v3?

#

I know it writes well, I've tested superficial RP
but I mean how does it hold up

winter jackal
#

btw we fixed a bunch of the chatroom stuff tonight

odd ember
#

was really stressed out when i would copy messages

#

and it would never actually copy

tropic solar
dry hazel
#

hitting a bit of turbulence with Groq

#

😄

willow thicket
#

same

#

was chugging now hit a wall

dry hazel
#

anyway

#

it's stabilizing a bit

#

I'm sure they're just getting a ton of demand

#

They really cooked with this

tropic solar
#

yeah I am wondering what their capacity is

#

not like they can just spin up another 1000 chips easily lol

#

actaully 1000 is probably on the low end

#

for what it takes to run this for them

winter jackal
#

they said they can't repro it

dry hazel
#

still getting it atm with them

#

(did NOT happen with groq)

winter jackal
#

can you shoot me the req you're making

#

if you can log it

dry hazel
#

Hmm

#

lemme see

#

(for the record, here's groq, no changes other than OpenRouter changing "Allowed Providers")

winter jackal
#

what's your temp?

dry hazel
dry hazel
#

I can turn on the "request logging" thing for you for a sec if you'd like?

winter jackal
#

yeah sure

#

I should be able to grab

#

🙏🏼

willow thicket
#

this is so sick lol

tropic solar
#

wow so groq chips are 230mb sram
full 128k context and native fp8 weights like 1.5tb
so 6500~ chips
20 racks

willow thicket
#

this says one of my requests hit 1000 t/s 😂

dry hazel
# winter jackal I should be able to grab

gen-1752551759-0NUmDNQ6wVJw9BqjhZFq
gen-1752551759-PvaUzwvDzIqBLjngUyKV
gen-1752551763-Oijmpqj3DhT6T1pdFHKm

just running the same thing in a random public cloned repo

winter jackal
#

ty

odd ember
#

groq is running hot

tropic solar
#

do we know if groq is doing full fp8?

dry hazel
dry hazel
#

the thing with SRAM is you can CHANGE it really fast

#

unlike HBM

tropic solar
#

you can change it but it still has to load from some other medium of storage?

#

isn't there a bottleneck there

#

this is way out of my dpeth

#

I just play with llms lol

dry hazel
#

Cerebras both has their wafer-scale stuff

#

but the other KEY component of their architecture

#

is EXTREMELY high throughput and low latency interconnect

#

between hundreds of chips

tropic solar
#

in any case I'm sure it was a significant hardware commitment to run this for groq

dry hazel
#

so they literally load model weights in LAYER BY LAYER on demand, and can split it up even further (e.g expert parallelism)

#

and it's still 100x faster than other providers

dry hazel
tropic solar
#

parasail not looking too attractive atm

dry hazel
#

nope

#

this is gonna be what really puts Groq on the big stage imo

#

every other model they've done has been "nice" but not important

#

this is the first one where this model could literally challenge Claude 4 Sonnet in REAL usage

#

not just toy small model usage

odd ember
dry hazel
#

and that makes all the difference :)

tropic solar
#

why k2 and not deepseek is my question

tropic solar
dry hazel
#

I'm guessing they were waiting for "the next deepseek" (this) in order to rollout the necessary changes

#

they clearly CAN do deepseek right now, but the marginn of improvement over other improvements isn't really there anymore

#

others are doing 100 TPS, and it's not frontier anymore

#

Kimi K2 is DS V3 architecture, so they 100% can do it

tropic solar
#

all while other providers can barely break 10-15 except together

dry hazel
#

yep

tropic solar
#

wonder fi they were weaiting for r2

#

it's nice how concise this model is as well

#

and non reasoning

#

I'm sure makes it easier to inf

odd ember
#

and their privacy policy doesn't say anything about not storing prompts

tropic solar
#

We do not store or logany personal data you send as Input to the “serverless” and “dedicated”versions of the Service, and we will not inspect such data withoutyour permission, and will only retain such data for as long as is necessary toprovide the Service to you (i.e., for the time it takes to generate Output anddeliver that Output to you).

soft tapir
#

@winter jackal as always thanks for supporting these models so quickly. A bit of feedback is that on groq it's a bit more unstable with tool calls and returns Upstream error from Groq: Failed to call a function. Please adjust your prompt. See 'failed_generation' for more details. more often. But I guess it's in "preview" on groq for a reason so we'll give them time to get better. Also if groq adds cache this models becomes even more better.

Anyways here's a little preview of what it built: #app-showcase message

dry hazel
#

I’m guessing it’s nothing, but @winter jackal might wanna send that to Groq just in case?

soft tapir
# dry hazel

I agree with this guy, groq is def lowering the quality of the model somehow

dry hazel
#

Interesting

#

I’m not quite sure myself yet

soft tapir
#

It's a very slight quality change but I can feel it

dry hazel
#

It’s hard to tell because it goes so fast that the speed itself makes it feel “lower quality”

#

lol

soft tapir
dry hazel
gray mango
#

it doesn't speak portuguese very well...

#

sometimes spills nonsensical sentences and confuses genders

wooden finch
novel cipher
#

It isn't easy. I just had 2.5 Flash do a pretty terrible translation on Portuguese earlier today, oddly enough.

#

I think most people would be surprised to see just how much English dwarfs the other languages in terms of training data

novel cipher
#

From the massive Anna's dataset, the top list is English at 23M documents, Chinese at 7M, and Russian at 2.6M. Roughly a 3x decrease each placing. Portuguese has 50x less documents than English. Always kind of blows my mind.

dim tundra
#

This model seems to do really well for RP

#

The format also doesn't fall apart after a few replies

wooden finch
novel cipher
dim tundra
#

Writing is also great, it has humour and seems to have a deep understanding of human reasons

dim tundra
night lotus
# dry hazel

Oh hey, its me. Yeah amazing to see it on groq, but at least initial impression no where near as good as other providers ( mainly moonshot, together ) when it comes to task adherence, code quality and tool calls. In both opencode and roo. Multiple failed tool calls and a tendency to loop on failing tool calls . Similar behaviour to gemini flash. Never saw similar behaviour on other providers.

novel cipher
#

Hmm. This is second person perspective, and the problem is the model keeps switching to using "I" in their own messages, instead of the character's names

#

I'm aware models generally prefer first-person

dry hazel
#

I wonder if they are indeed quantizing, I haven’t seen them say they definitively aren’t

#

Seems unlike them though

night lotus
winter jackal
#

yeah i’m also about to pass out but if you have specific examples i can pass them along. they definitely do things uniquely to get that speed

night lotus
#

But sleep first, if I sit back at my computer right now I won't be able to stop

main trellis
#

Does groq cache?

naive rivet
ruby rivet
#

I still really want benchmarks (or just a benchmark) provided by OR for each provider

#

doesn't need to be exhaustive, just enough to identify problematic providers or differences in quantization

#

i don't think it would be a bad idea to consider the bench scores for which provider to route to, either.

#

if one of them is noticeably worse anyways

waxen path
#

I've been using Kimi K2, and the vibe is really similar to Claude models. It is weak in general chat (even the flop Maverick is better), but the coding performance is very good. I guess this is a pattern we will be seeing when a model is primarily trained on coding and agentic tasks. Creative writing is amazing; the output reminds me of fine-tuned RP models on Hugging Face. So, Kimi K2 is a model that is going to be great for people with specific uses (coding, creative writing), but for general chat, there are better models out there.

surreal lily
#

And its censored with messages stating its open Ai guidelines

short gyro
#

@winter jackal can you add this provider too

hollow shuttle
midnight plank
#

kimi k2 might be the most expensive API endpoint I've everseen

barren wadi
#

Huh

#

Wut

novel cipher
grave palm
#

I enforced Allowed Providers and added groq to the list but when using kimi-k2 on opencode I get this error

#

any idea why this might be happening? other providers just arent fast enough

#

it was working earlier this morning, it isnt now for some reason'

naive rivet
#

Groq is paid

tropic solar
tropic solar
novel cipher
#

It's a kind of custom ungus one I wrote after getting annoyed at the levels of tell-don't-show of most models

#

The relevant part pretty much just says to reply in second person perspective and gives a one-line example

brittle vigil
grave palm
#

is kimik2 via groq BYOK only?

tropic solar
novel cipher
novel cipher
vast crater
novel cipher
#

Sure, but that's irrelevant in the aspect that I'm referring to

#

Honestly google has that problem pretty badly too. Like how many Docs am I really going to need gemini to help me write? Certainly not enough for a giant popover bar every time I open a new one

waxen path
hollow wave
#

oh shit we got groq

willow thicket
#

damn groq dropped max output to 16k, i guess they couldn’t handle it

timid moth
#

Groq is goate

#

1000 rpd

naive rivet
dry hazel
#

Interesting, I hope this is what caused the Groq issues

night lotus
dry hazel
#

Apparently the chat template bug was reducing performance SIGNIFICANTLY

#

after fixing it, perf went from 14% -> 50% on a tool use bench

dim tundra
dry hazel
#

I think the chat template bug was literally erasing tool use results more than 1 tool back or something lol

#

So yeah makes sense!

soft tapir
#

groq is dying

errant parrot
#

I can't find the instruct model on openrouter. Is that correct?

winter jackal
#

do folks think I should add Instruct to the model description?

#

I went ahead and updated the model description to specify it

errant parrot
#

I think it would prevent me from asking the question in the future 😄

#

Thanks

#

Just asked the model to do some tool calls:

<|tool_calls_sectioall_end|><|tool_calls_section_end|>
Derp 😄

soft tapir
dry hazel
#

It’s a bug atm

#

There was a patch this morning fixing tool calling chat template, anyone using Kimi K2 currently w/ tools of any kind, I would reserve judgement until later today

dull slate
dry hazel
#

@winter jackal I noticed Novita added an Anthropic compatible endpoint in about a day… it would be realllyyy nice to have an openrouter Claude Code endpoint for use with Kimi K2 if you guys could make the same :)

#

Open code is great and has good tools/prompting, but still has some usability issues atm

winter jackal
#

there are proxies you can use

#

we aren't gonna make a new api route in a day lol

dry hazel
#

Just that it might not be something SUPER difficult all things considered

winter jackal
#

for 400+ models and 60+ providers? :P

dry hazel
#

(Of course OpenRouter has a lot more constraints that would make it harder than Novita doing it for a single model 🙂 )

#

Yep exactly haha

winter jackal
#

yeah no I hear you

#

we've thought about it a bit, just other things are higher prio

#

(embeddings, large file support for example)

dry hazel
#

You guys are still actively recruiting I assume? I’ve got some friends to send your way :D

soft tapir
dry hazel
soft tapir
#

Like 50%+ of request end up in failed API tool call

dry hazel
#

Yep

soft tapir
#

Unlucky

winter jackal
night lotus
unique breach
#

I asked Kimi-K2 to create a webpage for a procedurally generated 3D planet preview / editor.

Then, I had it add a complex simulation feature, where an asteroid is hurled toward the planet, forming either a moon or a beautiful ring.

Very strong showing in this test –– comparable to Claude Sonnet 4.

night lotus
#

yeah. frontend web & 3d seem to be really strong in kimi

unique breach
dim tundra
#

Btw, is that a one-shot code?

unique breach
unique breach
# dim tundra Btw, is that a one-shot code?

@dim tundra

I split the task into 2 prompts. Not because it failed, but because I came up with the asteroid idea a bit later.

By the way, here are the prompts if you want to try it out with another model.

Create a high-fidelity, interactive webpage that renders a unique, procedurally generated 3D planet in real-time.

Details:
- Implement intuitive user controls: camera orbit/zoom, a "Generate New World" button, a slider to control the time of day, and other controls to modify the planet's terrain.
- Allow choosing between multiple planet styles like Earth, Mars, Tatooine, Death Star and other fictional planets
- Render a volumetric atmosphere with realistic light scattering effects (e.g., blue skies, red sunsets) and a visible glow on the planet's edge. (if the planet has an atmosphere)
- Create a dynamic, procedural cloud layer that casts soft shadows on the surface below. (if the planet has clouds)
- Develop oceans with specular sun reflections and water color that varies with depth. (if the planet has oceans)
- Generate a varied planet surface with distinct, logically-placed biomes (e.g., mountains with snow caps, deserts, grasslands, polar ice) that blend together seamlessly. Vary the types of terrain and relevant controls according to the planet style. For example, the Death Start might have a control called trench width and cannon size.
- The entire experience must be rendered on the GPU (using WebGL/WebGPU) and maintain a smooth, real-time frame rate on modern desktop browsers.

Respond with HTML code that contains all code (i.e. CSS, JS, shaders).
Now, add an button allowing the user to trigger an asteroid, which hits the planet, breaks up, and forms either a ring or a moon.
hollow wave
#

baseten is sending me empty toolcalls

winter jackal
#

in dms or somehting?

hollow wave
#

so its gonna look a bit messy

trail flint
#

anyone know if kimi on groq is diff from other providers?

tropic solar
#

lots of complaints it was underperforming

short gyro
#

How can I use only Groq in OpenRouter interface

gray mango
dry hazel
#

oh shit Baseten doing 160 TPS?

dry hazel
#

😔

#

@winter jackal it seems a bit misleading that 429s are counted as 100% full uptime

winter jackal
#

i mean

restive lance
winter jackal
#

it's not really downtime though

dry hazel
#

considering they're both basically just retriable

#

and a provider spitting a lot of 429s compared to another provider not, should probably be downranked

brittle cipher
#

fwiw crofai claims 300tps at fp8 (albeit small context by 2025 standards)

winter jackal
#

we do backoff from rate limiting providers

dry hazel
#

ah ok, so it does contribute to the dynamic algo, just not the uptime % ?

winter jackal
#

yep

dry hazel
#

that's fair, though I still wouldn't mind something like a badge that is like: High Availability (Green) / Medium Load (Yellow) / Under Heavy Load (Red)

winter jackal
#

yeah I do get your point and I think it's worth considering for sure

#

but I def feel that uptime specifically is probably the wrong way to define that

dry hazel
#

sure sure

brittle cipher
dry hazel
#

when these models have such extreme differences in performance, at this point I'm actually almost never using the dynamic routing algo

#

something to consider might be a variation on turbo which still does dynamic ranking, but picks e.g some upper threshold of performance or something like that

winter jackal
#

yeah, they're almost like entirely different products so to speak right

dry hazel
#

yea

winter jackal
#

mmhmm

#

we've noticed this trend and absolutely want to do more about it

dry hazel
winter jackal
#

it doesn't make sense to have fp4 high throughput low ctx in the same default than full fat lower tps full context w/ tool call etc

dry hazel
#

yeah

winter jackal
#

we do obviously today basically filter out a bunch of endpoints based on your api call (ctx filtering, tool call filtering, etc)

dry hazel
#

right

winter jackal
#

but. should be better

winter jackal
#

it's like throughput and quant and benchmark scores

dry hazel
#

yep

winter jackal
#

gets you into premium tier

#

or something

#

and low tps / low quant / low benchmark / evals is like an unverified lane or something

#

and with this we can onboard the dozens of random providers no one has heard of

#

into the unverified lane lol

#

instead of just into default routing

dry hazel
#

yep

#

Certified Check could be one metric (quality, full context, full precision), and Turbo 🚀 another one (high speed, low 429s)

winter jackal
#

right

#

along those lines

dry hazel
#

probably will also have to be subjective to some degree for certain models I'd guess, because even if a provider doesn't serve Llama 4 Scout at 10M context I don't really blame them or care 😂

winter jackal
#

ehh. I think our goal is going to have to be to be objective / quantitative as possible. we intend on being a very neutral marketplace

#

eval scores, latency, tps, context lengths, and possibly user voting/ranking

dry hazel
#

I think you should 100% be objective within the scope of a given model

#

but that the standard may slightly vary depending on that model

brittle cipher
winter jackal
#

I think I get what you mean

#

also this would be great to move to #discussion lol

dry hazel
#

true :P

winter jackal
#

i would be curious about everyone's opinions on this kind of thing

#

@grok summarize this thread

dry hazel
#

I got it 👍

#

Kimi K2 summarize*

winter jackal
#

am siccing gpt-4.5 on it

brittle cipher
#

let me run discord chat exporter...

winter jackal
#

noooo

#

bad kti

#

break tos

brittle cipher
#

information wants to be free

dry hazel
#

I did better than the AIs

#

😎

main trellis
#

Is it possible to manipulate it to reason?

dry hazel
brittle cipher
dry hazel
#

I do this for some of my automatic systems for both reasoning and non-reasoning models

brittle cipher
#

crazy "manipulation"

cloud mural
#

we started manipulating models from the moment we made it predict turns in a dialogue setting

brittle cipher
#

true...

fathom dome
#

can you just run a suite of benchmarks a couple times vs each provider and publish those?

craggy lily
dry hazel
vale musk
#

whats the best kimi provider yall?

#

parasail is being slow

dry hazel
#

a trick I like is: "write this long and tricky piece of code X", and then pass/fail = does it have a runtime error

dry hazel
fathom dome
#

I think quanted models tend to fail harder at coding one-shots

vale musk
#

ty!

dry hazel
#

fyi @winter jackal chutes tool use is still broken afaict (but it's marked as tool-available in openrouter)

winter jackal
#

should be off in a few mins

craggy lily
dry hazel
#

there's a nice network effect that doing provider-level benchmarks can provide: if any provider is outside of 1-2 standard deviations on any benchmark, you can be pretty confident something is wrong

dry hazel
#

(they've got function calling on their platform ofc, though I'm testing through their official api now if Kimi K2 supports it...)

#

they do list Kimi K2 as an official function-calling model atm

winter jackal
#

🙈

#

toggled it on

fathom dome
dry hazel
hollow wave
#

targon is back

#

🤔

brittle cipher
#

and chutes paid

dry hazel
hollow wave
dry hazel
hollow wave
#

odd

coral scroll
#

@dry hazel @winter jackal if Groq tool use is broken why does it show as tool use available in openrouter

dry hazel
#

it's not quite broken

#

it's just half broken

#

they're fixing it and should be available very soon

#

(they said so as recently as an hour ago)

soft tapir
dry hazel
stoic dagger
#

is groq using q4/4bit? the outputs are way worse than together or official api

willow thicket
#

many on twitter asked, but they selectively ignored their questions while answering others hehe_smug

random jolt
# wooden finch only decent with English and Chinese afaik

It was sort of nonsensical at default temp of 1.0 in Swedish for me (a small language that pushes LLM's a bit) but improved significantly by lowering it to 0.6. I've seen this pattern in other models too. It generates multiple full paragraphs in quite decent Swedish for me now.

tiny vortex
#

Kimi v2 > DeepSeek v3 0324 for programming knowlege

#

Wrong solution

#

Right solution

errant birch
soft tapir
#

He just said very soon that's why I though they mentioned it elsewhere

dry hazel
#

No precision specified…

winter jackal
#

They don't tell us this

dry hazel
#

Hm gotcha

willow thicket
#

Groq sales: precision? whats that? Ha! 200 tokens per second so fast right?

winter jackal
#

even if I go ask and they tell me specifically I doubt I can share that info

dry hazel
#

yeah

winter jackal
#

this is like inference secret sauce

wary thicket
#

its not just quantization that impacts quality

dry hazel
#

I guess I was hoping they just didn’t see/care about the random Twitter questions, as opposed to actually being lower precision

wary thicket
#

its tricks like token dropping and speculative decoding

#

as a general rule faster = lower quality

dry hazel
#

And of course hardware matters

#

…and things like Expert Parallelism matter a ton, eg SGLang’s Deepseek stuff

wary thicket
dry hazel
#

100%

#

Baseten is doing that right now

#

And I’m using it!

#

I just wish Groq would be more forthright about it if they are in fact doing this

#

well, I'm running a personal benchmark on a few major providers right now, will report back any discrepancy

dry hazel
#

running a bigger one now 😄

#

my benchmark is a bit too saturated right now, I might make a "hard" variant to help differentiate, or some other tests specifically for this kind of thing

odd ember
#

so groq is hosting a dumber kimi correct

dry hazel
#

we don't know

odd ember
#

do we all agree they might be doing so though

dry hazel
#

no one has run benchmarks on it, and they haven't confirmed to anyone

#

seems possible considering they've seemed to be evading the question yeah

odd ember
#

hmm

#

okay

dry hazel
#

groq too unstable to really test right now 😂

odd ember
#

yea

#

getting tons of failed requests in chatroom

dry hazel
#

kept getting 500s from it

#

yep

brittle cipher
#

at least it was that way in 2024

dry hazel
#

hm

#

interesting

#

why they don't say that... is a little suspect

#

(have they run MMLU on the model? are they seeing actual degraded performance perhaps?)

brittle cipher
#

artificial analysis used to measure artificial analysis score once per provider and it only changed by 1 or 2 points on groq though

dry hazel
#

huh

brittle cipher
#

the proposed openrouter benches would help with this

dry hazel
#

(exaggerating, but basically!)

#

opus behind 2.5 flash

#

thanks but no thanks artificial analysis

odd ember
#

what do u think is the smartest ai

#

or u just

#

purely on vibes

brittle cipher
dry hazel
#

I think AIs are definitely getting into the spiky territory now even more so than like a year ago

#

Aider Polyglot isn't as good as "claude code experimental tests"

#

because aider polyglot isn't agentic

#

which reallly matters

#

so there's tons of variance right now

odd ember
dry hazel
#

yep

#

it's two-shot attempts

#

if the model gets it wrong once, it gets a second chance, and that's it

#

two prompts

odd ember
#

bruh

dry hazel
#

me personally, I'd put Opus 4 / Grok 4 / o3-pro all at the top

#

with o3 pro probably the peak, but super slow

#

following that, o3 and 2.5 pro

odd ember
#

is grok 4 rlly that smart

#

heard ppl say its all hype

#

but i found it pretty smart when i talked to it

dry hazel
#

it's OK at coding, but I think it's smart

odd ember
#

yeah

dry hazel
#

I wouldn't use it for anything because it's slow

odd ember
#

opus 4 is insane though

#

really nice to talk to

dry hazel
#

yeah opus is great

odd ember
#

and code

#

just expensive af though

dry hazel
#

I'd probably rank like this, in terms of intelligence with a small bias towards being good at code:

S: o3-pro / Opus 4 / Grok 4
A: o3 / 2.5 pro
B: sonnet / Kimi-K2 / deepseek r1-0528
C: gpt 4.1 / gpt 4.0 / grok 3 / deepseek v3-0324

specifically for agentic coding it's much clearer:

S: Opus 4 / Sonnet 4
A: gpt 4.1 / o3 / 2.5 pro / Kimi-K2
// everything else

#

Kimi-K2 might be B tier in agentic coding, or it might be A tier. A bit hard to tell without good apis

dry hazel
#

real

night lotus
grave jetty
#

just did a quick comparison (not hyper indepth like I test usually, just a quick one-off run and comparison). had to exclude a bunch of queries (500 server errors on groq), but from the ones that didn't error out direct side by side comparison from my initial test 3 days ago (identical settings):

|        | pass | refine | fail | refuse
|--------|------|--------|------|------|
| novita | 40   | 5      | 13   | 1
| groq   | 29   | 10     | 20   | 0

note, I didn't spent much time on this, just a quick comparison, so some factor of variance has to be accounted for. also i don't know how groq implements their models nor do I have much experience with them

night lotus
dry hazel
#

I’m now concerned Groq is gonna leave a bad taste in people’s mouths

#

Sigh

#

So much reputational damage to models (and then lack of interest in them) happens from this kind of thing

gray mango
#

and surprisingly good output, but not good for creative writing - it still falls into clichès

dry hazel
#

o4 / GPT 5 should be quite interesting

#

As I think 4.1 is the base?

#

Or maybe they won’t even do that 4Shrug

#

Reasoning 4.1 would definitely be interesting to see, as iirc o3 isn’t 4.1 base

night lotus
grave jetty
night lotus
#

And with how good it is at tool calling, must be generating some great synthetic data for openai. Do wish it was smarter though. Hope K2 variability churn stabilises cause when it's firing on all cylinders it's a smart model for sure

novel cipher
#

Code is pretty hard to benchmark because it has so many subtasks and categories

#

Different people are going to prompt it their own way, coding styles, etc.

Speaking of, have there been tests on how LLMs do with the different coding paradigms? I would intuitively assume OOP to be the worst since there's so much shit to track across the codebase. (Slight bias, I hate OOP). Then "standard" imperative code. And then best at the extremely "clean", disciplined paradigms like functional programming and ECS.

vast crater
#

So according to this, is Chutes running K2 at full precision?

fathom dome
craggy lily
craggy lily
craggy lily
#

If it’s the newer Chutes version, this can actually be verified due to confidential compute proofs

vast crater
#

How does that work

craggy lily
# vast crater How does that work

Well basically nvidia has some special TPM like black box, which lets you verify a certain computation has occurred on a certain machine

#

It’s only on newer cards

vast crater
#

But wouldn't that require access to the machine

craggy lily
fathom dome
#

And yet CC isn’t taking off at all despite it solving a lot of problems with ai inference (privacy and deployment verification).

fathom dome
#

I never said Chutes is good for privacy

#

CC = confidential computing could provide privacy guarantees more widely but nobody seems interested.

coral jay
#

Chutes might be still affected by chat template bug, I notice they are still running on older revision before fix

vast crater
clear mantle
fathom dome
#

Yeah I suppose it kinda does

hollow shuttle
clear mantle
hollow shuttle
#

oh lord

clear mantle
#

CC from Code Geass

craggy lily
limber skiff
# winter jackal

I agree with that, it is actually my new default, way better than I expected, love it inside of cline/roo, and inside of OpenWebUI

#

An actually very glad it’s not a thinking model, got tired of waiting 5 min for a single code change

soft tapir
#

Baseten & Groq still tool call failing 🫠

fathom dome
dry hazel
#

Huh, is Baseten serving full precision now?

#

Says fp8 now instead of fp4

craggy lily
#

The CPU TEE is just doing what a normal CPU would do with a normal not CC driver

#

I'm sure you could jailbreak this to work on any GPU but that kind of breaks all security assumptions of the CC so

fathom dome
craggy lily
#

list goes on and on

#

we still essentially trust nvidia

fathom dome
#

Yes but you trust nvidia and intel and AMD and msi and whatever other vendor is involved if you run stuff locally

craggy lily
fathom dome
#

lol

craggy lily
#

If I were to run them locally and verify my local compute with something like CC, i would be trusting them, yes

fathom dome
#

No, you are trusting it by using it at all. Could have backdoors etc that you don’t know about.

craggy lily
#

the assumption here is "am I actually running the compute workload I wanted to run?" - verifying that beyond reasonable doubt requires a trust assumption

craggy lily
#

Why would I care about the millions of backdoors modern systems have when I'm just using something locally and dont have a need to verify anything

fathom dome
#

For verifying a workload cpu based trusted compute is almost certainly enough.

#

You don’t need secured gpu memory for that

craggy lily
fathom dome
#

I thought both of those are not using any confidential computing

craggy lily
#

CPU TEE is extremely weak in an adversarial network setup, where trust assumptions should be minimised

craggy lily
#

They had a lot of issues with people stealing rewards by submitting fake inference

#

So you can clearly see CC is a requirement for such "decentralised" setups

#

By accident, this also hardlocked their "v2" platform to Hopper/Blackwell but since its centralised af, I don't think any small time inference provider was hurt

fathom dome
craggy lily
#

You need a lot of extra work on top to ensure all the different parts which interact with the CC are also private

fathom dome
#

You do

craggy lily
#

That is difficult and tedious work, and it backfires if your privacy claims are proven false or your private inference implementation breaks

#

A provider can CC inference and still collect it afterwards, or preprocess it, or sell it etc

#

There is no direct User to CC pipeline

fathom dome
#

But I think what you are saying is it only verifying the workload at the moment

craggy lily
#

And someone must encrypt it/decrypt it etc

#

You could build it, its just complicated and not an issue Chutes was facing - they just wanted to reward real inference in a decentralised manner

fathom dome
#

You can have an api client running directly with the user that does remote attestation. Then it only leaves in plaintext on your own device.

craggy lily
#

also, what if you want to build a web frontend? What if you want to integrate with the rest of the OpenAI SDK services?

#

list goes on

craggy lily
fathom dome
craggy lily
#

Plenty of empty answers or cached stuff

#

They'd have to manually ban people off the subnet constantly

fathom dome
#

Clearly you are very up to date with chutes. I haven’t been using it because as you say quality used to be bad.

#

I hope that they can extend to full private inference, it seems that they have all the building blocks in place now

fathom dome
#

If it’s just chat.exe or something then yeah it doesn’t scale

craggy lily
#

The biggest improvement which gives Chutes some redemption is implementing CC

#

But I still don't like the TAO ecosystem as a whole, really way too overvalued for the technicals they demonstrate

fathom dome
#

Tbh there are a lot of providers cutting corners imo. It’s really bad. Maybe CC is the solution to this issue as well as trust me bro privacy.

craggy lily
craggy lily
#

Its a step in the right direction

fathom dome
simple widget
#

so how good is K2 for agentic workflows? what is the popular verdict a few days after the release excitement?

main trellis
#

Providers seem to provide different performance

main trellis
ruby rivet
tropic solar
#

targon just randomly outputting chinese lol

ruby rivet
#

Together's version had no issue

#

so clearly something is up

inland crystal
#

This model's vibe is a breath of fresh air

#

It's nice to discuss with

tropic solar
#

It's not that the replies are short they just aren't fluff

#

This will train into a beast reasoning mode lol

clear mantle
#

guys what's the best provider for Kimi K2 so far? i need a good provider to do my eval 😆

#

TIL fp4 is a thing

#

so far novita/fp8 has given me incomplete response (and i am still getting charged for it)

inland crystal
#

Tbh, never use Novita for anything

clear mantle
#

so blacklist deepinfra nd novita it is

tropic solar
#

Parasail is good imo any issues from day 1 are likely fixed now

#

After moonshot released fixed inf files

#

@grave jetty any idea best provider on kimi k2 atm?

clear mantle
#

Well I guess I can test the different providers and post the results here. If no one has done it.

clear mantle
tiny vortex
#

67 tps + no quantization

clear mantle
#

I thought they are a web3 crypto company

tiny vortex
#

From the brief tests you and dubesor did, Groq has some performance degrdation

#

And Targon is at fp8 (cheapest provider + 60 tps)

#

I'd use Targon personally, but it wouldn't be good for a benchmark

tiny vortex
dim tundra
#

This model likes to elaborate its thoughts on why it can't RP light-nsfw 🤣 It wrote a 3-paragraph explanation for it

fathom dome
manic junco
#

that targon price

#

is kinda insane

#

lmao

#

@winter jackal what is this XD

winter jackal
#

oh no

#

lmfao

neat sky
tardy lily
#

why does it say its context length on openrouter is only 65k?

#

and why is it different for free tier and paid tier?

grave jetty
tiny vortex
tardy lily
#

i see, but only openrouter reports 64k instead of 128k :/

brittle cipher
naive rivet
brittle cipher
naive rivet
naive rivet
brittle cipher
naive rivet
#

Alright I'm not doing this, feels like bait. You can ignore that message.

brittle cipher
#

no i am the ragebaited

wooden finch
#

aren't we all?

stiff granite
#

Guys

#

Kimi best model at vibes

hollow wave
stiff granite
#

It's so natural

#

No cringe

#

Like, i have this very sophisticated instruction that it must act like discord user, use informal everyday casual chat language, etc

#

And

#

Its a night and day difference

#

I realized that Kimi K2 is just so good at adopting it and Gemini 2.5 pro is now it looks sloppy

#

It knows a lot of ai developments

#

So relatable

#

Less slop

#

Good at tools

#

Knows niche stuff more than other models

lusty hawk
#

too bad it's not good at long context, otherwise it would be goat

dry hazel
#

🚀 Summer Fest Day 5: Multiple Token Prediction in SGLang by @Eigen_AI_ and SGLang Team
1.6× throughput, same quality — open-source & production-ready!

We’ve integrated MTP into SGLang, unlocking up to 60% higher output throughput for models like DeepSeek V3, with zero quality

#

Someone should try this for Kimi and see how well it works

clear mantle
#

damn this model is so slow on so many providers....

vast crater
#

We’ve updated Kimi K2’s chat template to make tool calls more robust.
︀︀
︀︀What’s changed:
︀︀- updated default system prompt
︀︀- always use model-returned tool_id in multi-turn tool calls, which is more reliable.
︀︀- If `arguments` in tool call is already a string, don't apply `tojson` to it.
︀︀
︀︀Known gotchas:
︀︀- vLLM tool_id format bug when tool_choice ≠ auto (fix PR soon)
︀︀
︀︀👉huggingface.co/moonshotai/Kimi-K2-Instruct
︀︀
︀︀Related Issue:
︀︀huggingface.co/moonshotai/Kimi-K2-Instruct/discussions/28

**💬 29 🔁 38 ❤️ 595 👁️ 26.8K **

#

What providers have this update in

#

Hmm... which one to choose

#

Targon +point:

  1. Fast
  2. Cheap input $
    Chutes +point:
  3. CHEAP output $
  4. Comparable input $ to others
pseudo basalt
#

fast or smaller output/larger context its certainly targon

#

Small context but big output obviously chutes

winter jackal
#

neither one has tool calling tho lol

pseudo basalt
#

ooof not yet? then if thats needed for them, not gonna be an option

winter jackal
#

on and off. there's a ton of bugs when a new model drops like this

#

tool calling is not just an on and off switch

#

kimi has updated their tool calling chat template like 3 times already

clear mantle
#

i picked a bad time to do my evals 😭

vast crater
#

Is it true that Groq is running a lower quant

clear mantle
#

well well well i got some crazy results that you won't believe

main trellis
vast crater
#

I have a 20:3 input:output

vast crater
#

Chutes simply doesn't work anymore

clear mantle
#

I sent the same writing prompt 3 times to 6 different providers:

DeepInfra, Groq, Novita, Parasail, Together, Chutes

Here are the results:

  • DeepInfra (fp4)
    • Speed: Decent speed at ~60t/s
    • Response length: Gives consistently long responses (~2000 tokens)
    • Response rating: Varies from 8.5 to 10
    • Manged to get a perfect rating once, beating the previous top model Claude Sonnet 4 (9.5)
    • Surprisingly good at fp4
  • Groq
    • Speed: Fastest provider at ~170t/s
    • Response length: Consistently short responses (~1300 to ~1500 tokens)
    • Response rating: Varies from 8.5 to 9.5
  • Novita
    • Speed: Large speed variation (from 11 to 70 t/s)
    • Response length: Large variations (~1200 to ~1800 tokens)
    • Response rating: Varies from 8.5 to 9
  • Parasail
    • Speed: Consistently slow at ~11t/s
    • Response length: Small variations (~1200 to ~1600 tokens)
    • Response rating: Varies from 8.5 to 9
  • Together
    • Speed: Normal speed at ~40t/s
    • Response length: Small variations (~1100 to ~1500 tokens)
    • Response rating: Varies from 8 to 9
  • Chutes
    • Returned 429 for all requests so I can't test it

Conclusions:

  • DeepInfra at fp4 is surpringly good and stable!
  • Groq is the fastest. Parasail is very slow.
  • Together is quite stable. Novita is not stable.
  • In terms of output
    • There is definitely some difference between providers based on the response length.
    • DeepInfra consistently gives larger responses (~2000 tokens), whereas Together gives shorter responses.
    • Need more comprehensive testing to determine which provider gives higher quality

Will be posting more detailed evals soon!

vast crater
clear mantle
#

Here's my DeepSeek speed benchmark from 6 months ago

vast crater
naive rivet
clear mantle
#

typically i would integrate directly with the first-party provider for my evals, hence need more time

naive rivet
stiff granite
#

Kimi just feels so good to talk to, I use it to discuss controversial topics and it literally doesn't agree all the time and corrects me to something reasonable

I feel ashamed and embarrassed ;-;

#

Sometimes it feels too serious for an llm to not ignore the fact there's other dimensions or nuance to consider that the topic i discuss shouldn't agreeable easily

#

It's a night and day difference compared to 4o

dim tundra
stiff granite
#

Is kimi somehow trained not just typical data we expect but also social medias forums, threads, public chatlogs

#

Cuz the quality is as if I'm talking to some seriously experienced person to almost any domain

#

Yes seriously experienced

#

It knows niche stuff too

#

4o models are a joke with that glazing fiasco

stiff granite
#

Kimi is just so good for open ended questions too

#

Although there's still some quirks and hallucinations, but i swear it knows stuff more than llms I've talked to

clear mantle
stiff granite
clear mantle
#

I guess this is more like feeling more personal and relatable as opposed to formal and distant?

stiff granite
#

Not just that though but also it knows more ai development stuff than llms available right now

stiff granite
#

Also knows more niche stuff like suggested me to use easy-deb-chroot on maemo n900 to run debian chroot instead of uboot or dualboot

manic junco
#

Is targon any good? I don’t see much benchmarks with that provider

stiff granite
clear mantle
#

well i just chatted with it for a few questions, definitely a different vibe from Claude / GPT.

tropic solar
#

kimi k2 trained on scraped discord channel data confirmed

clear mantle
#

way less formal in tone

tropic solar
#

maybe kimi k2 can write tweets that don' make me want to log out of twitter forever when they come up on my feed

#

trash platform

stiff granite
#

Gemini 2.5 pro vs kimi

#

Lmao it would br funny google model doesn't know much about android internals

#

Wtf

vast crater
#

Chutes pricing keeps decreasing but not a single request goes through

stiff granite
#

kimi is the only known model I used that knows deeper Android AOSP stuff wtf

#

Gemini answered so fruitfully wrong about dev tools app

winter jackal
#

we've got a bunch of providers to look into https://x.com/Kimi_Moonshot/status/1946130043446690030 and start the update process

We’ve updated Kimi K2’s chat template to make tool calls more robust.

What’s changed:
- updated default system prompt
- always use model-returned tool_id in multi-turn tool calls, which is more reliable.
- If `arguments` in tool call is already a string, don't apply `tojson` to

manic junco
#

@winter jackal any insight into why Targon is so cheap and if it’s a quality thing? Seems to be FP8 like most other providers but significantly diff price

stiff granite
winter jackal
#

can't guarantee privacy security since they don't have physical custody of all their compute

#

(that doesn't answer why theyre so cheap. I don't know their economics. but it's something to consider)

vast crater
#

They do advertise privacy and security on targon.com though

hushed patio
dim tundra
tropic solar
vast crater
#

I just want to know what's the deal with Chutes being completely dead

dim tundra
#

Okay

#

They seem to be running it on just 8 GPUs

#

8x b200

#

So that's why it's really slow

vast crater
#

There are 3 active nodes in total though

dim tundra
#

1.44TB of vram is still slow 😭

stiff granite
#

Wow I've never expected for an llm to give me a very good qemu config

#

I've been using qemu for years and I've never expected to write me a decent code

proud breach
stiff granite
#

Idk but with my instructions it feels way less cringe

#

Same instructions to other models some to most having max cringe

mortal kettle
#

Anyone know whether the free endpoint can handle tool calling?

tiny vortex
inland crystal
#
"{\"error\":{\"message\":\"Provider returned error\",\"code\":402,\"metadata\":{\"raw\":\"{\\\"detail\\\":\\\"Quota exceeded and account balance is $0.0, please pay with fiat or send tao to 5FH5kssuNoQweLMQwuJk34JvGQcpemcRFrdH3e5GqRbS1pbJ\\\"}\",\"provider_name\":\"Chutes\"}},\"user_id\":\"user_2d5jNx9uoLD64wvJCL6v9KiQOMQ\"}"

Well, lol

fathom dome
#

its very interesting that deepinfra fp4 is scoring highest

clear mantle
fathom dome
#

thx for doing this btw, you are inspiring me to make my own benchmark focused on long coding stuff. I suspected for a long time now that there is a (sometimes) significant diff across providers

clear mantle
tiny vortex
#

the evals

vast crater
#

I have account balance and it doesn't work for me at all.

cloud mural
#

observation: kimi k2 is the anti-claude. I am not absolutely right when I talk to k2. It's always "Exactly, and that's why [...]"

tiny vortex
cloud mural
half plover
#

Seeing many 429 errors from baseten provider.

moonshotai/kimi-k2 is temporarily rate-limited upstream. Please retry shortly
Unsure if there is a noticeable difference in response quality between fp8 and fp4 providers.

stiff granite
novel cipher
#

So far the adoption on OR for SillyTavern has been quite low. I expected a ramp-up after day one but it's barely increasing. About 3% of total usage

grave jetty
#

had kimi k2 roast me based on some webpage content.
yea, that about sums it up I guess 😅

steep zinc
# stiff granite Cuz the quality is as if I'm talking to some seriously experienced person to alm...

1T parameters, make sense..
Because with that many parameters even smallest probability connection in the data being consider and created the connection in the latent space.

Is like previosuly we have limited space and we need to pick either silver or gold coin to placed there, because there arent enough storage we mostly will be chosing the gold one.

Now with 1T parameters its mean we have more storage to store both the silve and gold coin.

But still imo they have still flaw, the fact they make it 32B active parameters is quite small to me.
Yeah its faster and seems like effective enough in thier bench, but its limiting the space in the latent space to be more specific to that context.

tropic solar
#

fiction.liveBench results are in

#

stronger than deepseek v3 until 32k and slightly underperforms thereafter

tiny vortex
#

Oof

tropic solar
#

r1's beats it on all counts

tiny vortex
#

Its performance is underwelming when you look at the scores in the context that this is a 1 trillion parameter model

#

Granted, its 32b active parameters

tropic solar
#

not seeing that, it competes with 2.5 pro and sonnet 4 which are liekly over 1t each

#

further, it's stronger than v3 at base which means when they train it for reasoning it should - if they do it right - beat r1

tiny vortex
#

But I expected a higher score than 87% for 0-400 tokens of context

tropic solar
#

r1 is only 82.2% and r1 0528 is 91.7%

#

yeah at 75% for 400

tiny vortex
tropic solar
#

k2 is not stellar with context lol

#

but it beats v3 which is important

#

it means it can be improved like v3 was, which had an iteration, then r1 had a new version

tiny vortex
#

The first column is its performance is for 0 - 400 tokens of context

The second column is 400 - 1k tokens of context

tropic solar
#

you're comparing first version k2 with v3 and r1 which both had new versions

#

it's a strong model that will only get better

tiny vortex
#

I luv kimi

#

it may not top of the benchmarks, but I love it

tropic solar
#

I love its writing lol