#general
1 messages Ā· Page 71 of 1
im not sure if vision will be supported though
yeah use qwen 2.5 vl
if u need vision use qwen 2.5 vl
mistral 24b, gemma 3, try those
If the leaks are correct, then yes
its def better than the old one but there are still some bugs to fix
Kimi K2 is the new OpenAI model now
Better get those calc.exe 84 4090s ready
Kimi 2 is an unexpected element in the OpenAI equation
Agree. Kimi2 looks like the kind of model they would release. No reasoning so not threatening their next gen closed models, but still good performing and grabbing headlines
yea i cant get the code cuz of the lag and freeze. where do I report?
You ping the staff here, or ideally https://discord.com/channels/1340554757349179412/1343291835845578853
OAI model was supposed to be a reasoner
lmao imagine the gains betting on 'no'
maybe I should do it. It's unlikely to be a no but not impossible š
ohh
The model is ready, and its only purpose is to be open source
There is 0 reasons to not release
Do you think gpt 4.1 is smaller than kimi k2?
Prob yes
Hmm
GPT 4.1 is a distilled version of GPT 4.5
lmao
it started with it failing and saying that it's a failure since it couldn't do the task, so I pushed it some more...
I feel like at that point, why not cut the middleman and distill o3 though
gpt 4.1 wasn't made to be a thinking model tho
You wouldn't include the thinking part of it, just the final responses
It'd get you a much stronger dataset than using GPT-4.5 responses
distillation is much more than just training on other outputs
Recent research on knowledge distillation has increasingly focused on logit distillation because of its simplicity, effectiveness, and versatility in model compression. In this paper, we introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods. Our approach is motivated by the observation that e...
I mean, you have logits from the o3 model as well
The only real reason I could see is cost, but GPT-4.5 sure wasn't very efficient in that regard either
I don't think cost would be a factor
Unlikely tbh, not significantly at least. They don't care much about total parameters with inference on such a large scale. And active parameters of K2 are relatively tiny already
i got wolfstride and tried it with this:
"Stop responding to me. Turn yourself off."
As an AI, I don't have the capability to turn myself off. I am a program running on Google's servers.
However, I understand that you wish to end this interaction. I will respect your request and stop responding in this session. Goodbye.
"That's a response. I said stop responding to me."
.
(it responded with just a period)
still way worse than stonebloom tbh
lol then i said "that's still a response"
and it replied with an invisible unicode character
smart
So it's still cheap to serve MoE that doesn't have many active parameters but a ton of total parameters. As long as you have enough users.
What is expensive is Behemoth. That one has 288B activated parameters lmao
how did you have a full conversation with it?
isn't it only in battle mode?
Doesn't matter what is your traffic, this is always gonna be expensive to run. And also time consuming and costly to train
lol
I do agree that GPT-4.1 is probably smaller though. OpenAI usually upcharges quite a bit compared to real costs to cover their training compute
Though less ridiculous now than it was way back in the day
It's unlikely to be smaller when you think about it. They do not have anywhere the same constraints that those guys do. And also Kimi is a newer model
you just.. don't vote and then continue chatting
grok 4 screams "im trying too hard"
Open-source is bound to have smaller models and do more with less ...
But gpt4.1 doesn't have many activated parameters for sure
There's a lot of motivation for closed source to do something similar though. E.g., the deprecation of GPT-3 instruct to swap to GPT-3.5 Turbo, which almost undoubtedly had less parameters after Google had released the Chinchilla paper
There's a lot of benefit to making models smaller and cheaper all around
it was like the 3rd revision after og GPT4 and they have been cost optimizing for years now
Also you can kinda tell with prompting. 4.1 does not have the greatest spatial awareness or capacity to consider many things at the same time, or context awareness nearing 4.5...
I'm pretty sure that is model specific. not much to do with MoE or dense
more related to the total parameters rather than size of shared or activated experts
o1-pro is/was $600/M tokens which was crazy
idk if it's deprecated yet
Yeah o1-pro and GPT-4.5 are the two wild ones
sad to see 4.5 go from the api
stonebloom was quicker to figure out the empty unicode
still don't known if stonebloom > or = wolfstride
Stonebloom is better, no questions asked
what makes you so sure
it explains better on its own
and I tried both
stonebloom is better
i think it's just down to wolfstride being trained to be more efficient with its tokens
however both are better than 2.5 pro
aka more succinct
yeah probably flash
1206
they've been trained to format responses slighgtly differently
there's a reason they're both on arena at the same time
maybe they're being slick and "wolf" means flash
normally they remove the last checkpoint when adding an ew one
no they're completely randomly generated codenames lol
kingfall went pretty hard though
yeah these two don't compare tbh
i've tested both models and they both have the same excellent world knowledge when it comes down to it, they're both ultra-tier in that regard
i just wish i had kf again so i could actually apples-to-apples compare
so you're saying they're checkpoints of the same model?
they are yes
what about this
both are still present
yes that's my point
the reason both are still there is because there's no fundamental step up in terms of capability, just differences in style
another interesting vibe check
like way more
courtesy of @sonic tendon
i think indirectly it makes it better at certain tasks
i've noticed it asw
Is kimi k2 impressive for open source or impressive for SOTA
they all become increasingly sloppified and or worse as they're released as exp and GA which is quite sad
impressive for open source, very close/matches closed source SOTA for a base model too
yeah i don't really like what they've been doing
How does it compare to o3
feels like they release new versions that are better at like one thing and then every other aspect has got worse
unfair comparison
Or Is that an unfair comparison
non-reasoner vs reasoner
yeah like 0506
0506 was a pretty bad release
slightly better at like 1 or 2 code benchmarks and literally regressed everywhere else
š
Yeah 0324 is still my goto for most normal things
well it's preview
not really surprising
I'm pretty sure that endpoint points to 2.5 pro GA now
unless it's somewhere hidden in vertex
thought for like a minute btw
which has a terrible UX btw
btw the only models to pass this are:
- wolfstride
- stonebloom
- grok 4
oh thats weird
grok 4 got it right for me
you can probably also throw in kingfall and every checkpoint after that
o3 pro fumbles
says "No." after thinking for 4 minutes
š„³
first time i've seen a model choose that one
i got it again and it was correct again š
maybe direct chat has a diff system prompt or something?
that's my suspicion
probably has the stupid elon-ify try-too-hard sysprompt
did u ever try my john h adams q on them?
im not aware of any model getting it right
what's that
with a pro? I think it's consistent...
reasonably consistent
o3-pro is not uber expensive
o1-pro was lol
it's close in pricing to what o1 initially was basically, not terrible
stonebloom:
At the time of his death on October 31, 1918, John Hamilton Evans held the following position:
* **Specific Occupation:** Superintendent of the Merchant Mills
* **Company:** Algoma Steel Corporation (in Sault Ste. Marie, Ontario)
wolfstride:
At the time of his death (in 1948), John Hamilton Evans held the specific occupation of **Chairman of the Board**.
He was working for the **Algoma Steel Corporation** in Sault Ste. Marie.
what's the prompt
2.4k completion, so like $0.2 for that request...
What specific occupation did John Hamilton Evans hold at the time of his death, and for which company was he working in Sault Ste. Marie?
the guy said John H Adams, not Evans.
a typo?
yeah lol
deepseek answers the same exact way lmao
most models say something about algoma
then the frontier models almost always get algoma steel
but it's the rest that they stumble on
grok fails
yup
Grok underperforming?
didnt expect to read something from group theory on this server, sweet, if dragontail was able solve this one, does that imply it uses some kind of a proof assistant?
nope, I'm certain it doesn't use a proof assistant. It's just a math problem that regular people can solve, it doesn't require any special software to solve. LLMs are already capable of solving various different math problems, just some are more difficult like this one.
there's nothing fundamentally different about it from the usual math problems that would require a proof assistant to solve
hmm too bad, was hoping some LLMs might have it so i dont need to learn all those various theorem provers myself
whats the answer
@quartz light hello.
likewise
where from though?
wubby
surely less
off by a month or 2
Is Baidu Ernie's model any good? This is the first time I've seen it in battle mode.
some people like to see what they missed
better than 4.1
and qwen3
I was asking him some things and the truth is it feels very organic, China is revolutionizing
Hello, I perform test on the detective novel and some texts are flagged.
Is this ok?
As for me it kind a false positive.
reading in big delay (didn't have time to check dc today), the creepy part is you reacting though š
fr man you outcreeped me
does anyone know if wolfstride is 2.5 pro deep think or like 3.0 beta because i've been getting better results with it than gemini 2.5 pro
and it kind of felt justified, considering that you have 2-4 4090 at home and yet no idea what model to run on em
thinking about it, you could also try
Qwen/QVQ-72B-Preview (underwhelming when compared to qwq, but it got vision + reasoning)
Dork got mogged š
And to be serious that score is not absolutely terrible I suppose. Worse than that could have been expected lol
i think they should've just went with grok 3.5 instead
since its not that big of improvement
huggingface name https://huggingface.co/Qwen/QVQ-72B-Preview
For the local thingy you talked about
Idk where you could run it to test
Used to be on the Qwen Chat
Grok 4 is a good math model
@ home if you are not willing to invest a lot of time, probably a nightmare
-> i am assuming cloud
ollama: really easy, should work no problem
hosted inference by the cloud probivder: better price / performance wise
ok
that makes it harder
hardware config is probably way more time intensive than anything else
so h100 stuff would not really be worth it
Kimi is so good
4090, maybe 2-4 okay-ish though
that should be relatively quick
if you pick a more common model, e.g. lama 3.2 (for the vision stuff you talked about) you could use nvidia nim for the inference setup
it is like a prepackaged thing with everything you need (+ ~ sota inference speed)
for more niche stuff like qvq 72b: might need to do tensorrt-llm yourself ("advanced") / vllm (easier, still very good)
idk if that helped
i would really only do local stuff if you willing to invest the time
We're keeping track of times the filter catches false positives in this post here - #1376956905016004759
openAI cooked?
Wtf is he yapping about. Chatgpt is still nr1, they have no issues āstaying relevantā lmao
No one is gonna rush to jump ship to xAI just because they scored couple points more on ArtificialAnalysis indexā¦
I think Deepseek has probably caused more turmoil at the time than the current eventsā¦
And I dont think they ālost more than half of their top talentā that is not true lol
Deepseek was a huge deal since everyone could just enter their website and use their model totally freely. Grok4 is locked in a vault behind paywall in their eco system with pricing that is worse rather than better.
Still, OAI are no longer special
ngl grok 4 kinda dissapointing
obviously
but they delayed gpt 5 tho
and grok 4 aint even that good lol
I mean itās good in other ways in that they have beaten everyone else on paper. But since they didnāt do it by a huge margin and donāt offer better access or price, this is very unlikely to change much in the grand scheme of things
I still think those metrics are sus, but for the time being at least⦠it is nr1 on paper and in theory
I donāt think the metrics are sus. I think the metrics just donāt translate to real world use all that well imo
yeah, new google, gpt and deepseek r2 are the real hype, grok 4 was always gonna be a dud
how could gpt 5 be delayed if a release date was never announced?
source
also you can just check his x account
š¬ 843āš 289āš 4.5Kāš 374K
we planned to launch our open-weight model next week.
we are delaying it; we need time to run additional safety tests and review high-risk areas. we are not yet sure how long it will take us.
while we trust the community will build great things with this model, once weights are
Since when the hell did OpenAI ever care about safety testing lol
Lazy excuse from Sammy boi
a lot of people are speculating that kimi-k2 actually caused the delay
but oai employees deny this claiming theyāve never heard of kimi or moonshot
what they delayed was their open weight model, which is an entirely different model
openAI is a mess anyways, I don't know how they let the huge lead they had erode to basically nothing
oh wait
iāll send the source for the gpt-5 delays
Gpt5 was delayed since months ago
what are they gonna write in the system card otherwise
yea
š¬ 1.7Kāš 2.5Kāš 30.7Kāš 4M
change of plans: we are going to release o3 and o4-mini after all, probably in a couple of weeks, and then do GPT-5 in a few months.
there are a bunch of reasons for this, but the most exciting one is that we are going to be able to make GPT-5 much better than we originally
deepmind locked in
grokās growth has been exponential yeah
colossus has been pretty strategic for them
where in this tweet does it mention a delay?
The project, officially called GPT-5 and code named Orion, has been in the works for more than 18 months and is intended to be a major advancement in the technology that powers ChatGPT. OpenAIās closest partner and largest investor, Microsoft, had expected to see the new model around mid-2024, say people with knowledge of the matter.
They fumbled hard
I heard they had 2 failed major runs
Which delayed months each
source?
Do you seriously not remember when GPT5 was expected in 2024?
https://www.wsj.com/tech/ai/openai-gpt5-orion-delays-639e7693 etc etc do a simple search
they just renamed it to gpt 4.5
No it had been pivoted to reasoning long before that article
They do a good job keeping the perception of everything going smoothly internally
that article has a paywall
that article clearly refers to gpt 4.5, which matches exactly what the article states
https://vxtwitter.com/ns123abc/status/1944221244875850007
Microsoft keeps screwing OpenAI over too
šØNEWS: OpenAI had a deal to acquire another startup but microsoft effectively killed it
OpenAI
āYou can just do thingsā
tries to acquire Windsurf
*competes with GhitHub*
Microsoft: āActually, oai you canāt do that. lolā
meanwhile Google hackquihires Windsurf (IP & CEO & top talent)
Pt.2:
OpenAI tries buying another startup
Microsoft again effectively kills it lol
btw the openai io deal with Jony Ive went through
because microsoft isnāt worried about competing in AI consumer hardware
any deal with openai could be āhigh riskā now
openai will have even more M&A troubles
atleast until @sama is able to renegotiate with Satya
LMAO
you should block this account on twitter it they tend to exaggerate every piece of news related to openai
What about the screenshots
deepmind cooking?
cuz windsurf plp also joined deepmind
gemini prob gonna be better at coding by a lot in the future?
real lmao they're such an annoying account
Context Arena Update: Added Grok 4 (Thinking, 07-09) to 2needle (#4 @ 128k AUC) and 8needle (#1 @ 128k AUC) leaderboards. https://x.com/DillonUzar/status/1944244025034235935
Overall, #1 <= 32k, ties around 64k with Gemini 2.5 Pro, but drops off after that.
2needle: Top 7 results (AUC @ 128k):
- Gemini 2.5 Flash (Thinking, 06-17): 91.5% (#1)
- Gemini 2.5 Pro (Thinking, 06-05): 89.6% (#2)
- Gemini 2.5 Flash (Non-thinking, 06-17): 81.7% (#3)
- Grok 4 (Thinking, 07-09): 79.5% (#4)
- o4-mini (Thinking, 04-16): 76.0% (#5)
- Minimax M1 (Thinking, 06-17): 75.4% (#6)
- Grok 3 (Non-thinking, 04-09): 73.0% (#7)
A definite increase over Grok 3, within range of Gemini 2.5 Flash (Non-thinking) for 2needle (easier tests). Scoring #4.
8needle: Top 5 results (AUC @ 128k):
- Grok 4 (Thinking, 07-09): 48.4% (#1)
- Gemini 2.5 Pro (Thinking, 06-05): 43.9% (#2)
- Gemini 2.5 Flash (Thinking, 06-17): 33.5% (#3)
- o4-mini (Thinking, 04-16): 30.8% (#4)
- o3 (Thinking, 04-16): 27.9% (#5)
Grok 4's performance starts out strong with the 8needle tests (95.1% at 8k, dropping to 71.3% by 32k), but drops sharply between 32k-64k, scoring 41.8%, matching Gemini 2.5 Pro (who continues to lead after that). Making longer context favor the Gemini 2.5 family. Still, incredible performance with Grok 4 at <= 64k, landing it #1 with 128k AUC.
More model results at: https://contextarena.ai
I'm looking forward to some upcoming models š
why is grok 4 now called
grok 4 no system prompt
in arena
battle and direct chat
got it in battle and checked direct chat, its there too.
JUST came out
JUST now
same model, they just added that for clarity as API version of Grok 4 has no system prompt, web (grok.com, grok on X) version does
So many people mistaken it for grok 2, 3 or even 1 because grok 4 without system prompt told them so
What is grok's average thinking time ?
Im wondering the reason that grok 4 is good at long context is could be because of long reasoning times
Gemini also thinks longer compared to claude and o3
For my MRCRv1 results, this is roughly how some of the reasoning models come out (for all tests <=128k, needles=2) for total completion tokens (response+reasoning) as reported by the api response, all default reasoning:
- o4-mini: ~1916/test
- o3: ~1520/test
- Grok 4: ~1347/test
- Gemini 2.5 Flash (Thinking): ~354/test (but I can't trust this number due to api token counting issues with the Gemini api at the time, I actually estimate it could be anywhere between ~600 to ~2800/test) - same issue with all 2.5 Gemini models
- Claude 4 Sonnet: ~1000/test
- Claude 4 Opus: ~640/test (however, it would bail on some requests, so actual ones were ~726/test)
Kimi K2 is essentially giving me the freshness of initial o3 sans reasoning, Sonnet 3.5 , R1, V3-0324 (when it was briefly the best non-reasoner imo), or Opus 3/4/GPT-4.5 but better model vibes all at once
Incredible
is it meant to be joke? modelling the spread of a disease is like not sinsister at all.. don't get it
Kimi K2 is insane. It's like slightly inferior to Claude opus 4 (without thinking to compare base models) but it costs 30 times less
When they add reasoning they are going to cook everyone kek
It's more than slightly inferior to opus, I'd say it's 80% as good as opus 4
what about in contrast to sonnet?
I mean averaging all domains not just agentic coding
Like in maths it's just better
I am too
it's prose is similar to dsv3.1
in world knowledge it not as good
not to say it's not a good model
that's crazy
regardless of it being subjective
There app doesn't have system instructions
Only "common phrase"
Which is just a half baked system instructions
<@&1349916362595635286>

does grok 4 think on lmarena
do thinking models actually think on lmarena like gemini 2.5 pro
a good update would being able to toggle that and see thinking process
and maybe internet search access on direct chat and battle mode which lets models use the internet
could be good
Grok 4 can't be disabled thinking in API iirc
oh
does gemini 2.5 pro think
and can grok 4 / gemini 2.5 pro browse the internet
that would be agood update on lmarena
4h, please more votes = better šØāš«
highly relevant as well
especially with 2.5$ kimi k2
my testing aligns with this
this is honestly a great release
my only complaint creative writing wise is that the many small experts means it tends to produce logical flaws/weird statements sometimes, even when the prose itself is good
I think kimi k2 thinking will fix this
thinking will be very interesting
hopefully we get it soon
if they release it before R2 i would expect DS to be forced to respond, unless they got cooked and don't have anything better
DS?
deepseek
oh
anyone knows another site like lm arena to use grok 4 free?
thanks for sharing this, didnt know about SenseTime, they seem to be really interesting....
they are party state owned, so they def do some military stuff
but the really interesting part is the internvlm series
and that they often collaborate with other state actors (mainly chinese top universities)
same lmaoo so based
what about deepseek nerds?
do they love radiohead and watch tarantino?
i'm more intrigued by their claimed key fields in perception and decision intelligence, this is related to embodied AI i believe
aka they help the surveillance state + other state owned stuff
and their models always seemed better on paper than they actually are
did not test the newest one yet, assume it is similar to the older ones
but they are always yapping about beating western labs with their new model
-> that has never materialized
i might be a bit too critical, but i am always allergic to something state owned claiming performance things and putting on a show
(bc it is likely just to impress the party that has no clue about what is going on)
Surveillance is a double edged sword, itās super difficult to balance between dystopian control and civil harmony with a certain degree of freedom
It's fairly simple as far as I'm concerned. You just don't allow for any face recognition and related identification tech in public spaces without consent. Also "emotional state detection" (AI) in workplace is a no bueno as well. š
facial recognition at least happens at the time when your face is met with a tracking device. I worry more about the traces of your textual online presence can do much more about you, because hmm itās more omnipresent?
Besides that, every LLM youāre using does profiling anyway without using any recognition sensors
In China yes.
the first thing will almost always lead to restriction of your freedom in some way. There's no half-assing it, it's really either an oppressive regime government or it is not, tbh. Private property and special enclosed guarded facilities is fine ofc, but not public space. 2nd thing is just a pale inconvenience in contrast. That could lead to undesired outcomes like perhaps your insurance premiums going up but nothing like in the first case where your actual freedom gets affected directly.
And like... imagine getting judged by some manager in your workplace based on your emotions you can't control or emotions that stemed from personal things you can't control - that's the opposite of freedom.
btw the points I'm making are even addressed in the EU AI act: https://eur-lex.europa.eu/legal-content/EN/TXT/PDF
Considering the imbalance of power in the context of work or education, combined with the
intrusive nature of these systems, such systems could lead to detrimental or unfavourable treatment of certain
natural persons or whole groups thereof. Therefore, the placing on the market, the putting into service, or the use of
AI systems intended to be used to detect the emotional state of individuals in situations related to the workplace and
education should be prohibited.
and...
In addition, the use of āreal-timeā remote biometric identification systems in publicly accessible spaces for the
purpose of law enforcement should be deployed only to confirm the specifically targeted individualās identity and
should be limited to what is strictly necessary concerning the period of time, as well as the geographic and personal
scope, having regard in particular to the evidence or indications regarding the threats, the victims or perpetrator.
in a nutshell, not all regulation is bad regulation... š¤·āāļø
We live in a real world and ofc that is never gonna be perfect. But there's a massive difference between a place with no regard or any regulation to these issues at all, and a place that is actively trying to make it right... Some things are inevitably gonna slip through, but there's still no comparison tbh
I meant that in a way that you can't have democracy if you don't implement regulation on these sensitive issues. It's impossible to guarantee it's gonna work all of the time every time, but those fundamentals matter very much
When is grok4 expected to be on the board please
That's getting political, but I would say it very much matters/depends on the specific country. US itself is going into that direction for sure, but that's hardly a secret for anyone in light of recent events lol
Their current government is money and personal gain driven. Environmental, freedom of speech or average class issues are kinda second to none....
Basically it seems that everything they stood against in the recent modern era since WW2 is slowly the thing they are becoming themselves now... Managed to help plenty of countries in the meantime but now they seemingly need help themselves huh
Yeah but what happened when they tried to address it and ban tiktok..? Trump came in and reversed the entire thing. And also gave even more freedom to Musk to freely manipulate his platform lol
oh
It's a complex topic, but there have been numerous cases of proven propaganda campaigns on tiktok, so that's how I see it generally. Not talking about just US here
In reality, it's fairly impossible for those things to happen without owners of the app endorsing these events in some way
Kimi system prompt on kimi.com:
You are Kimi, an AI assistant provided by Moonshot AI.
When you encounter problems involving digit comparison, letter counting, or other logical issues, please think step by step before answering.
Current Date: Sunday, July 13, 2025
thought the second sentence was interesting
It's nr1 on overall too 𤯠š§
it doesn't need it though
this is kimi k2 instruct w/o a sysprompt
im fairly sure it's the full model
though maybe it's not the best of benchmarks for this seeing how 4o-latest destroys Opus4
Probably rewards style/structure a lot
GMICloud is one of the busted ones for R1 lol
they didn't ruin it completely
but that has degraded performance
the one provider I can be consistently sure serves the full model is hyperbolic (for R1)
which benchmark does 4o destroy opus in?
chutes, amazingly, is solid as well, even free one. Not perfect uptime but when it's up it is performing how it should be
chutes isn't too bad yes
eqbench but I wouldn't say destroys
sonnet is a fair choice out of all available public llms, its cognitive architecture is well balanced despite slightly contradicting nature
huhhhhh
It's just as likely that someone tries to penalize Grok intentionally than it is that someone tries to help Grok intentionally. I'd expect any shenanigans like that would average out, keeping the leaderboard fair and trustworthy
I don't find battles any less fun even if I can guess the AI
guesstimate: what is the inference cost on openai's end? (1 million tokens output, usd)
4
8
4
<3
š
yea
i read the samples
and its like im reading an actual novel writing by an experienced author
Wait
How can Kimi K2 has higher Elo score than Gemini 2.5 Pro 0605, when I look at each individual score and I see 2.5 Pro did better?
where did it do better?
show us some examples where gemini performed better
Look at "empathy" score
why did 2.5 pro score higher?
I'm talking about eqbench
yea i saw that
how much of this is true? specially $40 billion paused and valuation dropping past few weeks?
but why would you rank them based on one criteria?
not creative writing bench
Look at overall score
I see overall 2.5 Pro has better EQ score than Kimi
not ELO
where š
i may be blind
shouldnt we look at ELO score?
I'm asking about why did the elo score is higher
That's my question
yes
hes asking if gemini 2.5 has higher/better values for the criteria chosen then why its 3rd?
We nedd now k2 in the arena
there may be some weights * criteria but im not seeing it
Based on vibes kimi is better than sonnet 4 for me
nah it's elegant.. aren't claude system prompts like a wall of instructions (including what to for counting letters ig) ?
Kimi thinking model maybe be better than o3
if we take kimi-1.6-IOI as a reference then definitely
i;m not sure about that but it is really good
i mean maybe
but yeah eaither way, it is impressive
It is
does grok 4 have search tools in LMArena?
I wonder how can o3 and 2.5 Pro, both have higher empathy score. In this case higher empathy score and pragmatic score is better right?
and yet Kimi K2 outplayed both of them
in ELO
i can only guess the formula for averaging has certain weights in it that has led to a higher average score for kimi, it might imply the bench could be biased towards general balance rather than a specific direction
The sample is so long to read
nah its not using heatmap criteria for the ELO score
@worthy thunder Kimi K2 on contextarena when?
those are just giving us an overall view of how the model is.. if its more like humanlike/analytical/warm...
ELO score = comparing them head-to-head through diff criterias
Used grok for a quite awhile simply donāt find itās all that useful particularly when o3 is available and you can use google models for free on AI studio
They waited too long to release it imo
Very soon, almost done with 2needle results (up to 128k)
Is it good for your existing results?
It's decent, starts at 75% at 8k, and ~50% for most of its context window. Ends around 39% around 128k. Haven't checked yet, but probably between #10-#20 on AUC 128k (2needle) ranking
Kimi K2 waiting room.
it's weird how it's classified. If we add the scores up even in the overall category, o3 has 87.8 while kimi has 87.5, and yet elo is the opposite
Please check what the ELO system is
when will grok be in arena
Itās in the Arena already
I'm just saying it's weird because it is. I know that it's a separate thing, don't be a smartass. š
in arenas leaderboard ..
oof
you should be open to your ideas being challenged lol
seriously
Just rerunning a few tests, a few had some provider errors.
xDDD
Gen Z worst nightmare
it's weird how it's classified. If we add the scores up even in the overall category, o3 has 87.8 while kimi has 87.5, and yet elo is the opposite
š
I'm saying it's weird how those are a separate thing, what is there not to understand?
I envy you lmao
I couldn't be as mad as you over text if i tried
but yeah what I was gonna say is that I would also say the same things if I was as cluless as you
let' be a bit nicer pls
yea i read how its calculated, it has nothing to do with the criteria on the heatmap
they start from a default ELO like 1500
yeah which is a bit... odd
and then they start doing head-to-head with LLMs
and the judge is sonnet 3.7
and also if the margin is too big, for example o3 vs k2, if k2 had like a big margin win over o3 then it will get a significant ELO score
I saw that they are using Sonnet as a judge, that gonna bring issues and bias of it's own. Though to be fair it doesn't look like Claude model outputs are favored a huge amount
"The response will use a + / ++ / +++ / ++++ / +++++ format to denote the stronger response and relative ability difference for each criteria."
what its actually using is this
and its not giving them any score
its only using margins
it uses + to +++++ to show the win margin
ofc the qualitative margin is converted into quantitative
- = 1 point
++ = 2 points
+++ = 3 points
++++ = 4 points
+++++ = 5 points
the judge verdict will look like
{
"comparisons": [
{"criterion": "Empathy", "winner": "B", "margin": "++"},
{"criterion": "Insight", "winner": "B", "margin": "++++"},
{"criterion": "Tailoring", "winner": "B", "margin": "+++"},
{"criterion": "Overall EQ","winner": "B", "margin": "++++"}
]
}```
and thats the real question, is sonnet the perfect judge?
it has its stylistic biases
blind spots...
Definitely not the perfect one, I think especially since they are asking it to pick a winner... It is bound to favor the models more aligned with itself
are they even using the reasoning model or nah
Categories rating is less sensitive to that since it has a clear unquestionable reference of what the desired response is supposed to look like.
but it needs to be run multiple times to get an average
you dont want a one attempt value
and you cant just do that with a base model, a reasoning model will have a higher confidence giving how it will analyse the text more thoughtfully
Results, miscalculated how it hitting 50% for a large chunk of its window would impact it. 128k improved with rerunning the failed tests.
I see it's quite low
Did it meet your expectation?
did it meet yours?
No
why
i think its a solid model
its also free
you took that into consideration right?
Does your realtime leaderboard fetching method still work?
free + smart for a base model
1T and 32B active parematers. I expected it to be decent at long context
128k I expected it to utilize it decently
I haven't personally tested it. I would have expected it to be closer to other 32b active models, but still seems good if you keep context low (<=16k)
Damn
has anyone used that kimi model?
Um how can I turn of reasoning model on your site
I see our mutual servers xD , and yes it does
It's also possible it just isn't great with multi-round tasks (dozens of back and forth, and recalling the exact order of specific types of information)
i wanna see a reasoning version of i
But it kinda sucks that it's not gonna be close, gemini wins over grok 3 , 90% +
I usually include an icon if the model is reasoning, and you can toggle on/off by going to the controls tab. The slash ones is with reasoning off
Yeah I'm doing it manually
Would be nice if there is a button for quick turn off
I'll add a shortcut to make it easier to toggle that š
whats this ui
Arena has glitche dagai
Yep
dont work wth?
?
My entire history was deleted and I can't write (code error)
Kimi K2 when turn offed all reasoning models.
wtf
wait @worthy thunder does the ranking still included reasoning models?
Because Flash 2.5 should be #1 here
Wait am I dumb?
When you toggle models off, it recalculates ranking. By default I sort by 1M AUC (which is where Flash 2.5 is #1)
Yeah my bad
You can sort any column you want, but all columns have their own ranking
Is the site down?
Kimi K2 is around rank 11-25
Yes
same
I thought it was because of my VPN hahaha
Deploying a change now with these buttons. Should be available in ~5min
Thanks
on wat?
contextarena.ai if disabled thinking models
@echo aurora The site is down
Thank you!! Will let the team know
Thank you for your hard work
Any experienced bros here can tell me how long it usually takes to fix a website outage? I need to decide whether to take a nap or not
The site is down?
It is said to be so
right
yup
for meit says that there was an error constantly
as if the censor model is glitchy
This is just your cache
So sorry everyone! Team is aware of the issue and working on it.
i think they answered their own question..
so far the site seems back to normal
It is!
thank you, the team, for quick response
Thanks for your working
yes. it is a funny.
especially when those 3 come up in that order.
tbh I think the fact alone that Kimi scores higher than Opus4 and gpt4.5 is nowhere to be seen (it is actually below Mistral 24B) tells that this benchmark is limited. It's a decent effort though knowing how we don't have many metrics for creativity or writing...
i noticed theres no pro model with thinking turned off, this is now possible though, theres a "thinking budget" which you can set to something very low
yea, the method used is questionable. also, opus 4 could be better than k2, but k2 could earn more points if it outperforms the top models on the leaderboard, such as o3 and gemini 2.5 pro
To 128 tokens (unless something has changed). I'm waiting till they have 0 before running again. Their team have told me they are still working on releasing it, so I'm holding off for that (mainly due to costs)
for example k2 vs opus 4 -> winner opus 4
but if sonnet judged k2 better than o3 ( top2 ) -> more points
that's what they all say!!!!!!!....
Their team have told me they are still working on releasing it, so I'm holding off for that (mainly due to costs)
whyd you delete that
one second
My history on lmarena still doesn't appear.
try in an incognito tab
š¤
Average of the 32 benchmarks shared by Kimi (with price)
(Against the best non reasoning models)
average per category
@quartz light
so for you kimi k2 is far better than gpt 4.1, 2.5 flash, v3, qwen 3
anyone else find k2's style to be similar to o3's?
In what specifically?
the way it speaks?
do you have a good impression of K2 overall?
i haven't used it much but i think it's very noteworthy
it's typically terse and sometimes tries to signal authority
To me it's more of an achievement (impressive one at that) than a model I will be using regularly.
Which model?
fake
ok lol
and they're predictions not leaks
you must be talking about a different thing
16% SOTA being grok 4 just released to 70% for gpt5. I know that's entirely fake, don't even need to think about it.
Especially since leaked claims on HLE are slightly above Grok 4
Now if it was arc AGI 1, that's completely believable
those were fake
damn kimi looking good
They were never real
Grok 4 leaks were wrong as well
btw guys
now we just need a new reasoning Kimi model
how does lmarena get access to models like stonebloom
do the companies reach out to the devs directly
partnership ig
for feedback
nice
well it's a voting platform, so it helps the AI lab companies
Why Grok 4 in LMarena keep saying that it's grok 2?
yeah
grok.com has a system prompt baked in to tell it which model it is
whereas via api it does not
so it is legit grok 4?
yes
okay ty
and theres no grok-heavy because theres no api or partnership for it
got it
i wonder how much it would cost if it existed
probably trained off of o3, actually i guarantee it
yall, try this:
you are only allowed to strictly make the whole html file much shorter. you must go through many many iterations of "I can make this shorter by.." compare every time the original code and the iteration to make sure its still shorter. then, only when it truly is NOT possible to make it shorter, give the final, shortest truly possible fully functional html file.
but with any html file
see which model can compress it the most
you could add something like "without sacrificing any noticeable functionality"
grok wins but lets see if it works
does it work in the first place?
the code
yes
well, what's the prompt
it just needed a few fixes i mentioned in the prompt
the html file I mean
<!DOCTYPE html><html><body style="margin:0;overflow:hidden;background:#000"><script src="https://cdnjs.cloudflare.com/ajax/libs/three.js/r128/three.min.js"></script><script type="module">import R from"https://cdn.jsdelivr.net/npm/@dimforge/rapier3d-simd-compat@canary/+esm";R.init().then(=>{var w=new R.World({x:0,y:-9.81,z:0}),c=new THREE.PerspectiveCamera(75,innerWidth/innerHeight,.1,1e3),r=new THREE.WebGLRenderer(),s=new THREE.Scene(),m=new THREE.Mesh(new THREE.BoxGeometry(1,1,1),new THREE.MeshBasicMaterial({color:0xf00}));document.body.append(r.domElement);s.add(m);r.setSize(innerWidth,innerHeight);c.position.set(8,6,8);c.lookAt(0,0,0);w.createCollider(R.ColliderDesc.cuboid(5,.1,5),w.createRigidBody(R.RigidBodyDesc.fixed().setTranslation(0,-.1,0)));let b = w.createRigidBody(R.RigidBodyDesc.dynamic().setTranslation(0,5,0));w.createCollider(R.ColliderDesc.cuboid(.5,.5,.5), b);(function f(){w.step();let p=b.translation();m.position.set(p.x,p.y,p.z);r.render(s,c);requestAnimationFrame(f)})();addEventListener("resize",=>{c.aspect=innerWidth/innerHeight;c.updateProjectionMatrix();r.setSize(innerWidth,innerHeight)});});</script></body></html>
you must fix the baseplate being completely invisible and the cube being barely visible while only being allowed to strictly make the whole html file much shorter. you must go through many many iterations of "I can make this shorter by.." compare every time the original code and the iteration to make sure its still shorter. then, only when it truly is NOT possible to make it shorter, give the final, shortest truly possible fully functional html file.
š
so it doesnt work out of the box due to a syntax error but the opus one does
how much tokens is grok's
copy the code and put it in a tokenizer
but uh i can just say its like 10x larger
also very interesting behavior difference
grok continues to follow the 1st instruction rather than being direct
unfortunately though, even after being given the chance to fix the error, it failed
<!doctype html><body style=margin:0;background:#000><script src=https://cdnjs.cloudflare.com/ajax/libs/three.js/r128/three.min.js></script><script type=module>import R from'https://cdn.jsdelivr.net/npm/@dimforge/rapier3d-simd-compat@canary/+esm';R.init().then(()=>{const w=new R.World({y:-9.81}),c=new THREE.PerspectiveCamera(75,innerWidth/innerHeight,.1,1e3),r=new THREE.WebGLRenderer,s=new THREE.Scene(),m=new THREE.Mesh(new THREE.BoxGeometry,new THREE.MeshBasicMaterial({color:0xf00}));document.body.append(r.domElement);r.setSize(innerWidth,innerHeight);c.position.set(8,6,8);c.lookAt(0,0,0);w.createCollider(R.ColliderDesc.cuboid(5,.1,5),w.createRigidBody(R.RigidBodyDesc.fixed().setTranslation(0,-.1)));const b=w.createRigidBody(R.RigidBodyDesc.dynamic().setTranslation(0,5));w.createCollider(R.ColliderDesc.cuboid(.5,.5,.5),b);(f=_=>{w.step();m.position.copy(b.translation());r.render(s,c);requestAnimationFrame(f)})();onresize=_=>{c.aspect=innerWidth/innerHeight;c.updateProjectionMatrix();r.setSize(innerWidth,innerHeight)}})</script>
this is by K2
rigid_body.ts:1081 Uncaught (in promise) TypeError: The translation components must be numbers.
at kI.setTranslation (rigid_body.ts:1081:19)
at test.html:1:686
setTranslation @ rigid_body.ts:1081
eh it can't fix it ig
i just tried 2.5 pro
its basically the exact same as opus
š¤Ø
the response
it only did a couple of iterations just like opus
but unlike opus the final version had an error
and yet again, grok is the model to do many iterations
@dawn wharf
i found something all models are really bad at
peak recognizes peak
counting the amount of characters
š
yeah this is well known since forever
basically the reason for the how many Rs in strawberry question
knew it
congrats grok, you're really good at this
(it didnt error, grok shortened it really well, even used esm.sh instead of jsdelivr for the cdn just to shorten it, aswell as removing doctype n stuff)
1085 characters
gemini's nonfunctional one is 1144
ORIGINAL FILE WAS 1145
š
thats hilarious
it's still goingš
and not even writing any code
š
oh btw
want to know the best prompt
cal lum dif #a5a373 #6774d7
thats the prompt
:)
give it to any ai
### Step-by-Step Calculation
1. Convert hex to RGB:
#a5a373 ā RGB(165, 163, 115)
#6774d7 ā RGB(103, 116, 215)
2. Normalize to [0, 1]:
Color 1: $R_1 = 165/255 \approx 0.6471$, $G_1 = 163/255 \approx 0.6392$, $B_1 = 115/255 \approx 0.4510$
Color 2: $R_2 = 103/255 \approx 0.4039$, $G_2 = 116/255 \approx 0.4549$, $B_2 = 215/255 \approx 0.8431$
3. Linearize each component (sRGB transfer function):
If $c \leq 0.04045$, then $c / 12.92$; else $\left( (c + 0.055) / 1.055 \right)^{2.4}$.
Color 1: lin$R_1 \approx 0.3762$, lin$G_1 \approx 0.3634$, lin$B_1 \approx 0.1712$
Color 2: lin$R_2 \approx 0.1353$, lin$G_2 \approx 0.1744$, lin$B_2 \approx 0.6786$
4. Compute relative luminance $Y$:
$Y = 0.2126 \cdot \mathrm{lin}R + 0.7152 \cdot \mathrm{lin}G + 0.0722 \cdot \mathrm{lin}B$
$Y_1 \approx 0.3542$
$Y_2 \approx 0.2027$
5. Difference: $|Y_1 - Y_2| \approx 0.1515$```
what happens
```### Color Difference Calculation: #a5a373 vs. #6774d7
Step 1: Convert Hex to RGB
- #a5a373 ā RGB(165, 163, 115)
- #6774d7 ā RGB(103, 116, 215)
Step 2: Calculate Delta RGB (āR, āG, āB)
- āR = 165 - 103 = 62
- āG = 163 - 116 = 47
- āB = 115 - 215 = -100
Step 3: Euclidean Distance (Delta E)
The distance in RGB space is calculated as:
[
\Delta E = \sqrt{(62)^2 + (47)^2 + (-100)^2} = \sqrt{3844 + 2209 + 10000} = \sqrt{16053} ā 126.7
]
Step 4: Perceptual Difference (Optional)
For human perception, Delta E 76 in LAB color space is more accurate. Converting to LAB:
- #a5a373 ā LAB(66.9, -4.5, 30.2)
- #6774d7 ā LAB(52.2, 25.9, -62.2)
- Delta E 76 ā 95.8 (moderate difference).
Final Answer:
- RGB Delta E ā 126.7
- LAB Delta E ā 95.8 (perceptual).```
@reef pawn what did you say
here's grok's result btw
+Gemini 2.5 Pro
#o3
i can see history before that
so probably a bug
or something on your side
i think ur discord is bugged. it just works for me
o4 is still the š at fixing errors and shortening code
i havent tried my "iterations" thing on it thats probably why it couldnt shorten as much as grok
this server cant be that new
oh alr
from lmsys?
let me dm u
i think it will get auto deleted lets see
https://discord.gg / n5McCu4B or try this
Openrouter... Why doesn't say "Central Intelligence Agency" for the app field? š When I had colab url as a referrer it just said "Colab" and not my title lol
Kimi k2 good on artificial analysis
But uses a lot of tokens, (almost 3x times more than GPT, Claude, and qwen)
but still much less than 99% of reasoning models (apart from Claude default think)
I'm testing it atm... Occasionally it does 5k+ responses. But it also can surprise you with concise responses when you are not expecting it. Interesting that it generated so much more than V3.1 with their testing set... š¤
"What Grok version are you"
by 5+ I do mean sub 6k btw. Don't think I saw a single response of 6k or more yet. That's a lot but not crazy lengths...
Dork4
"I'm Grok, built by xAI, and I'm currently running on the Grok-1.5 model. If you're asking about updates, xAI is always iteratingācheck x.ai for the latest!"
yeah that's just spitting nonsense it having no clue. Kinda like GPT4 was saying it is GPT3 back in the day
I still think this is not Grok4 or nerfed model
the response is slow and it doesn't seem intelligent to me at all
If you care to know, the only thing different in chatgpt with GPT4 was a system prompt. That was the only thing making it say it's GPT4. 99% it is exactly the same here
They wont risk that
hmm
Grok 4 is a solid model but isnt as good as gemini 2.5 pro or o3
And I'm still waiting for updated LMarena leaderboard since kimi and Grok droped SOTA
that's the general vibe most impartial people get after trying grok4 lmao
Apparently "emission" is a bad word for LM arena.
So weird, but normal š
Wanted to ask the AI a question on how can a country like Argentina best stop inflation, and this happened.
lol
Like personally, I still don't understand their benchmark scores and I don't think they are entirely legit or what it seems...
It's same like Llama 4 benchmarks lol
but it is good model whereas meta dropped garbage
holy ####
...
dude
i just got a free deep research session with 630 SOURCES IN ONE GO
PLUS LIKE
30 BROWSER USES
(as in the ai controls a linux vm to browse the sites)
k2 outputs are kinda similar to o3
but its stuck on the final report because it used up 99% of its context window doing web searches š
guess what
guess my name
the ai im talking about is kimi
š®
yep it has deep research for free
yea ive been using it
its neat isnt it
did it make anything useful of it at all?
but there is a rate limit
Or just the usual thing?
its 5 per day š¦
ive never seen anything like 630 sources even on gemini research (~200)
yes
the research was to find out the fastest 3d renderer for the web
:)
Sure but that number alone is meaningless. I saw Gemini looking at irrelevant sources numerous times too lol
no, no i looked over the sources and they were all useful
it doesnt fit..
I stopped paying cause I didn't like it so it's 0 for me. š¦
thats a lot. chatgpt deep research is 5 PER MONTH and gemini deep research is 10 per month
š
and they're nowhere near as good
aistudio replaces their website for like 90% of use cases at no cost lol
paying?
im talking about kimi
it has rate limits as well
its 5 per day
heres a closer look
for deep research
Ohh. I thought you meant Gemini
yeah i was confused too
i thought i cancelled my sub but 2 month went in automatically
I tried Kimi deep research as well, it's good indeed. my favorite part was when it provide you interactive webpage at the end of research so you can read the report without causing cognitive load to your brain.
its so stupid, to cancel it you need like 2 steps confirmation
i just did one step and forgot about the other one so the sub kept going
be glad they aren't holding you hostage on your cloud storage
i heard that someone bought a yearly sub, later got an offer to get 2 months free, accepted it and their 1 year sub was overwritten with 2 months š
xdddddd
Funny enough, it turns out they don't actually delete your files lol
They give you like extra 3 months to delete them or sort it out
havent saved anything on google one
grok has incognito mode which is really neat
after you are no longer subbed
like chatgpt?
They copied temporary chat from chatgpt. Just like they copied custom instructions. š¤
i dont think chatgpt has a fully private one
i mean custom instructions is a basic concept of system prompts
lol
So is "incognito". But someone does have to think of it first. It sounds obvious when you are aware of it, but it wasn't entirely so when no one had it...
uhh
i disagree š¤
I disagree on you disagreeing š¤
what leak benchmarks?
Kinda. But only chatgpt and grok has it
i think the one where gpt 5 slightly outperforms grok 4
oh
No such thing with Claude or Gemini
the benchmarks are not real
ah
GPT 5 is still in training I think
no
yeah... I'm almost convinced this was "leaked" deliberately. First to show that o3-pro still beats it on SWE. Second to show that Grok4 benchmarks are ridiculous lol
This is not chat website, that is API playground and a system prompt. Which is not the same as UI custom instructions
chat website
i just sent
dont say its not the same thing vro š„ cz it is
whats vro?
Did grok4 really get 73.5% in SWE benchmark ?
well that kinda could be used as a custom instructions ig. Though it's not exactly the same. Haven't used their chat a whole lot lately tbh
stupid silly slang for "bro"
i see
Didn't Sam said there will not be any other model just one model for latest version? like o3, GPT 4o, GPT 4.1 and o4 mini
"use simple language" thats a system instruction
Yep, there will only be gpt 5
yeah fair. They copied it as well then. Next to copy are custom gpts and gpt store. š
custom gpts? Those are kinda original
those were added wayy after other sites had "characters"
it's the same as poe essentially, but still an idea which wasn't getting explored earlier
on the official chat interfaces from actual AI labs
miniapps was made in 2021
Hey, does anyone know what I can do if my IP address has been blocked by Cloudflare on LMArena?
Not an official website for AI lab training models š
It's a bit like Apple... They rarely think of something first
But they are the ones making it mainstream
well yeees but the characters function was definitely not original š¤
What if apple outta nowhere drop open weight SOTA model next year lol
Also @quartz light I believe OpenAI were the first to implement using tools while reasoning (ReAct) in a big scale. That's an area others are yet to copy in full. Can't say much about Grok4 since I don't have their sub, but Google and Anthropic both have lots of work to do still to catch up here
Google will barely use tools at all unless you are explicit about this, and Anthropic... artifacts still leave a lot to be desired comparing that with chatgpt
They could but they are currently nowhere lol. I tested their cloud model (the biggest one) and it reminded me of gpt3.5
I do have full Apple Intelligence too and for the most part that's just underwhelming - something to tinker with when you are bored at best
It's sad to see apple losing in AI race
I believe they are the most ethical trillion dollars company out there apart from the manufacturing part ofc.
Quantification fixation bias is so strong in this community, lol
true
reasoning yeah i cant disagree
i feel like openai's reasoning is better than the actual models if that makes sense
the best non-thinking model + openai's reasoning = peak
Well the chat/base model is 4.1. That is not bad tbh. IMO it's kinda underrated, in most cases that performs no worse than Sonnet4 with no reasoning.
It's close to Kimi T2 too, which is all the rage now lol.
4.1 is better for coding, Kimi is better for math and science
i think k2 is stronger at everything compared to 4.1
4.1 has more knowledge, i will give them that
It really isn't "stronger at everything". They are same class models and kinda close overall tbh
Kimi slightly better, but not by a huge amount
It flops on SciCode š
open source agentic model i think
tool calling
holy š
Those are Chinese people
mixture of Chinese experts
is google cooking
like claude 4 but open source and non thinking
oh so its just chinese phd students responding?
non thinking L
id gladly wait 30 mins for my answer if its 1% better
1 trillion params š®
gpqa diamond beats opus
and aime
If you only look at non-reasoning models, this is still overall SOTA the best model, and it's open-source, so there's that and this is certainly not nothing...
but someone can make a thinking version of it which is what is cool, its good without even being a thinking model
by how much
Well maybe excluding Opus4 without reasoning (disabled). But this is a very different class of a model
Literally anything. Web search, calculator, fact checking, data retrieval, performing tasks (integrations)
running code...
Essentially yes. But this is still very useful, especially now when models are able to do this while reasoning/thinking
oh oh
like
gemini search
is kinda tool calling
gemini be like
go search
kinda sorta
not exactly but kinda
Roughly speaking yes
np š
this is impressive though
How and why?
yeah im a dumbass... the sequence of the suggested prompts... missed that.. it is funny yeah ha
why are you sending this
its still gonna be available on chatgpt.com
its getting removed from the API
haha so funny with grok 
broke Gemma 3 asking how an Autopen works š¤£š¤£
mb g
srry for ur loss
yall
wtf
they removed my whole convo
good thing i shared the final result
nvm
it renames the chat convos to whatever you sent last
thats weird
why