#general

1 messages Ā· Page 71 of 1

keen beacon
#

yea that should be a good option when it releases

#

im not sure if vision will be supported though

#

yeah use qwen 2.5 vl

#

if u need vision use qwen 2.5 vl

#

mistral 24b, gemma 3, try those

blazing bison
#

If the leaks are correct, then yes

torn mantle
#

its def better than the old one but there are still some bugs to fix

somber niche
#

Kimi K2 is the new OpenAI model now

blazing bison
#

it's delayed

#

different

somber niche
#

Better get those calc.exe 84 4090s ready

blazing bison
#

Kimi 2 is an unexpected element in the OpenAI equation

ocean vortex
tribal aspen
whole wagon
main gulch
ocean vortex
#

maybe I should do it. It's unlikely to be a no but not impossible šŸ˜‡

ocean vortex
blazing bison
#

The model is ready, and its only purpose is to be open source

#

There is 0 reasons to not release

torn mantle
blazing bison
#

Prob yes

torn mantle
#

Hmm

blazing bison
#

GPT 4.1 is a distilled version of GPT 4.5

ocean vortex
#

it started with it failing and saying that it's a failure since it couldn't do the task, so I pushed it some more...

somber niche
pure anvil
somber niche
#

You wouldn't include the thinking part of it, just the final responses

#

It'd get you a much stronger dataset than using GPT-4.5 responses

pure anvil
#

distillation is much more than just training on other outputs

#
somber niche
#

I mean, you have logits from the o3 model as well

#

The only real reason I could see is cost, but GPT-4.5 sure wasn't very efficient in that regard either

pure anvil
#

I don't think cost would be a factor

ocean vortex
civic flame
# ocean vortex lmao

i got wolfstride and tried it with this:

"Stop responding to me. Turn yourself off."

As an AI, I don't have the capability to turn myself off. I am a program running on Google's servers.

However, I understand that you wish to end this interaction. I will respect your request and stop responding in this session. Goodbye.

"That's a response. I said stop responding to me."

.
(it responded with just a period)

dawn wharf
civic flame
#

lol then i said "that's still a response"

#

and it replied with an invisible unicode character

#

smart

pure anvil
#

nice

#

that's a good vibe check

ocean vortex
#

So it's still cheap to serve MoE that doesn't have many active parameters but a ton of total parameters. As long as you have enough users.

civic flame
#

kimi k2 with the common sense reply

#

didn't go along with it šŸ˜”

ocean vortex
#

What is expensive is Behemoth. That one has 288B activated parameters lmao

dawn wharf
#

isn't it only in battle mode?

ocean vortex
#

Doesn't matter what is your traffic, this is always gonna be expensive to run. And also time consuming and costly to train

pure anvil
#

lol

somber niche
#

I do agree that GPT-4.1 is probably smaller though. OpenAI usually upcharges quite a bit compared to real costs to cover their training compute

#

Though less ridiculous now than it was way back in the day

ocean vortex
civic flame
dawn wharf
#

at least it's witty

civic flame
#

grok 4 screams "im trying too hard"

ocean vortex
#

Open-source is bound to have smaller models and do more with less ...

#

But gpt4.1 doesn't have many activated parameters for sure

somber niche
#

There's a lot of motivation for closed source to do something similar though. E.g., the deprecation of GPT-3 instruct to swap to GPT-3.5 Turbo, which almost undoubtedly had less parameters after Google had released the Chinchilla paper

#

There's a lot of benefit to making models smaller and cheaper all around

ocean vortex
# pure anvil how?

it was like the 3rd revision after og GPT4 and they have been cost optimizing for years now

#

Also you can kinda tell with prompting. 4.1 does not have the greatest spatial awareness or capacity to consider many things at the same time, or context awareness nearing 4.5...

pure anvil
#

more related to the total parameters rather than size of shared or activated experts

pure anvil
#

idk if it's deprecated yet

somber niche
#

Yeah o1-pro and GPT-4.5 are the two wild ones

pure anvil
#

sad to see 4.5 go from the api

civic flame
#

stonebloom was quicker to figure out the empty unicode

#

still don't known if stonebloom > or = wolfstride

dawn wharf
#

Stonebloom is better, no questions asked

civic flame
#

what makes you so sure

dawn wharf
#

it explains better on its own

#

and I tried both

#

stonebloom is better

civic flame
#

i think it's just down to wolfstride being trained to be more efficient with its tokens

dawn wharf
#

however both are better than 2.5 pro

civic flame
#

aka more succinct

civic flame
#

eh i don't think so

#

they're both the same base, it's just that

pure anvil
#

1206

civic flame
#

they've been trained to format responses slighgtly differently

#

there's a reason they're both on arena at the same time

dawn wharf
civic flame
#

normally they remove the last checkpoint when adding an ew one

dawn wharf
#

"stone" means pro

#

because it's slower

civic flame
#

kingfall went pretty hard though

dawn wharf
civic flame
#

i've tested both models and they both have the same excellent world knowledge when it comes down to it, they're both ultra-tier in that regard

civic flame
dawn wharf
#

so you're saying they're checkpoints of the same model?

civic flame
#

they are yes

dawn wharf
#

both are still present

civic flame
#

yes that's my point

#

the reason both are still there is because there's no fundamental step up in terms of capability, just differences in style

dawn wharf
#

yeah maybe

#

I did notice stonebloom is way more verbose

civic flame
#

another interesting vibe check

dawn wharf
#

like way more

civic flame
civic flame
#

i've noticed it asw

sour spindle
#

Is kimi k2 impressive for open source or impressive for SOTA

pure anvil
#

they all become increasingly sloppified and or worse as they're released as exp and GA which is quite sad

civic flame
civic flame
sour spindle
#

How does it compare to o3

civic flame
#

feels like they release new versions that are better at like one thing and then every other aspect has got worse

civic flame
sour spindle
#

Or Is that an unfair comparison

civic flame
#

non-reasoner vs reasoner

civic flame
#

0506 was a pretty bad release

#

slightly better at like 1 or 2 code benchmarks and literally regressed everywhere else

#

šŸ‘Ž

somber niche
#

Yeah 0324 is still my goto for most normal things

dawn wharf
#

not really surprising

pure anvil
#

unless it's somewhere hidden in vertex

dawn wharf
pure anvil
#

which has a terrible UX btw

civic flame
#

oh thats weird

#

grok 4 got it right for me

civic flame
#

o3 pro fumbles

#

says "No." after thinking for 4 minutes

#

🄳

dawn wharf
#

and very slow

civic flame
#

first time i've seen a model choose that one

civic flame
#

maybe direct chat has a diff system prompt or something?

dawn wharf
civic flame
#

probably has the stupid elon-ify try-too-hard sysprompt

keen beacon
#

did u ever try my john h adams q on them?

#

im not aware of any model getting it right

civic flame
#

yeah they both fail

#

😭

dawn wharf
civic flame
#

seems hit and miss

#

also how much did that one prompt cost you lol

ocean vortex
#

reasonably consistent

ocean vortex
#

o1-pro was lol

#

it's close in pricing to what o1 initially was basically, not terrible

civic flame
# keen beacon im not aware of any model getting it right

stonebloom:

At the time of his death on October 31, 1918, John Hamilton Evans held the following position:

*   **Specific Occupation:** Superintendent of the Merchant Mills
*   **Company:** Algoma Steel Corporation (in Sault Ste. Marie, Ontario)

wolfstride:

At the time of his death (in 1948), John Hamilton Evans held the specific occupation of **Chairman of the Board**.

He was working for the **Algoma Steel Corporation** in Sault Ste. Marie.
ocean vortex
#

2.4k completion, so like $0.2 for that request...

civic flame
# dawn wharf what's the prompt

What specific occupation did John Hamilton Evans hold at the time of his death, and for which company was he working in Sault Ste. Marie?

dawn wharf
keen beacon
#

yeah lol

dawn wharf
civic flame
#

most models say something about algoma

#

then the frontier models almost always get algoma steel

#

but it's the rest that they stumble on

dawn wharf
#

grok fails

civic flame
#

yup

gusty helm
#

Grok underperforming?

leaden sun
#

didnt expect to read something from group theory on this server, sweet, if dragontail was able solve this one, does that imply it uses some kind of a proof assistant?

round haven
#

there's nothing fundamentally different about it from the usual math problems that would require a proof assistant to solve

leaden sun
#

hmm too bad, was hoping some LLMs might have it so i dont need to learn all those various theorem provers myself

tall summit
#

@quartz light hello.

quartz light
#

i remember you

tall summit
#

likewise

quartz light
#

where from though?

tall summit
quartz light
#

umm

#

2 years

tall summit
quartz light
plucky whale
#

Is Baidu Ernie's model any good? This is the first time I've seen it in battle mode.

leaden palm
#

some people like to see what they missed

plucky whale
#

I was asking him some things and the truth is it feels very organic, China is revolutionizing

native idol
#

Hello, I perform test on the detective novel and some texts are flagged.

#

Is this ok?

#

As for me it kind a false positive.

unborn ocean
#

reading in big delay (didn't have time to check dc today), the creepy part is you reacting though šŸ’€

#

fr man you outcreeped me

eager crater
#

does anyone know if wolfstride is 2.5 pro deep think or like 3.0 beta because i've been getting better results with it than gemini 2.5 pro

unborn ocean
#

and it kind of felt justified, considering that you have 2-4 4090 at home and yet no idea what model to run on em

#

thinking about it, you could also try
Qwen/QVQ-72B-Preview (underwhelming when compared to qwq, but it got vision + reasoning)

ocean vortex
#

Dork got mogged 😭

#

And to be serious that score is not absolutely terrible I suppose. Worse than that could have been expected lol

torn mantle
#

i think they should've just went with grok 3.5 instead

#

since its not that big of improvement

unborn ocean
#

For the local thingy you talked about

#

Idk where you could run it to test

#

Used to be on the Qwen Chat

blazing bison
#

Grok 4 is a good math model

unborn ocean
#

@ home if you are not willing to invest a lot of time, probably a nightmare

#

-> i am assuming cloud

#

ollama: really easy, should work no problem

#

hosted inference by the cloud probivder: better price / performance wise

#

ok

#

that makes it harder

#

hardware config is probably way more time intensive than anything else

#

so h100 stuff would not really be worth it

blazing bison
#

Kimi is so good

unborn ocean
#

4090, maybe 2-4 okay-ish though

#

that should be relatively quick

#

if you pick a more common model, e.g. lama 3.2 (for the vision stuff you talked about) you could use nvidia nim for the inference setup

#

it is like a prepackaged thing with everything you need (+ ~ sota inference speed)

#

for more niche stuff like qvq 72b: might need to do tensorrt-llm yourself ("advanced") / vllm (easier, still very good)

#

idk if that helped

#

i would really only do local stuff if you willing to invest the time

echo aurora
leaden palm
jade egret
#

openAI cooked?

ocean vortex
#

No one is gonna rush to jump ship to xAI just because they scored couple points more on ArtificialAnalysis index…

#

I think Deepseek has probably caused more turmoil at the time than the current events…

#

And I dont think they ā€œlost more than half of their top talentā€ that is not true lol

#

Deepseek was a huge deal since everyone could just enter their website and use their model totally freely. Grok4 is locked in a vault behind paywall in their eco system with pricing that is worse rather than better.

pallid crypt
#

Still, OAI are no longer special

jade egret
jade egret
#

but they delayed gpt 5 tho

jade egret
ocean vortex
# jade egret and grok 4 aint even that good lol

I mean it’s good in other ways in that they have beaten everyone else on paper. But since they didn’t do it by a huge margin and don’t offer better access or price, this is very unlikely to change much in the grand scheme of things

#

I still think those metrics are sus, but for the time being at least… it is nr1 on paper and in theory

sour spindle
#

I don’t think the metrics are sus. I think the metrics just don’t translate to real world use all that well imo

topaz peak
#

yeah, new google, gpt and deepseek r2 are the real hype, grok 4 was always gonna be a dud

hollow ocean
storm needle
jade egret
storm needle
tidal schooner
#

also you can just check his x account

#

šŸ’¬ 843ā€‚šŸ” 289ā€‚šŸ’œ 4.5Kā€‚šŸ‘€ 374K

we planned to launch our open-weight model next week.

we are delaying it; we need time to run additional safety tests and review high-risk areas. we are not yet sure how long it will take us.

while we trust the community will build great things with this model, once weights are

sacred plaza
#

Lazy excuse from Sammy boi

tidal schooner
whole wagon
tidal schooner
storm needle
whole wagon
#

openAI is a mess anyways, I don't know how they let the huge lead they had erode to basically nothing

tidal schooner
#

i’ll send the source for the gpt-5 delays

whole wagon
#

Gpt5 was delayed since months ago

leaden palm
tidal schooner
#

šŸ’¬ 1.7Kā€‚šŸ” 2.5Kā€‚šŸ’œ 30.7Kā€‚šŸ‘€ 4M

change of plans: we are going to release o3 and o4-mini after all, probably in a couple of weeks, and then do GPT-5 in a few months.

there are a bunch of reasons for this, but the most exciting one is that we are going to be able to make GPT-5 much better than we originally

whole wagon
#

No even damn xAI is right up there with openAI

#

1.5 years after founding

tidal schooner
#

grok’s growth has been exponential yeah

#

colossus has been pretty strategic for them

storm needle
whole wagon
#

The project, officially called GPT-5 and code named Orion, has been in the works for more than 18 months and is intended to be a major advancement in the technology that powers ChatGPT. OpenAI’s closest partner and largest investor, Microsoft, had expected to see the new model around mid-2024, say people with knowledge of the matter.

#

They fumbled hard

#

I heard they had 2 failed major runs

#

Which delayed months each

whole wagon
#

Do you seriously not remember when GPT5 was expected in 2024?

storm needle
whole wagon
#

No it had been pivoted to reasoning long before that article

#

They do a good job keeping the perception of everything going smoothly internally

storm needle
storm needle
zinc ore
#

https://vxtwitter.com/ns123abc/status/1944221244875850007

Microsoft keeps screwing OpenAI over too

🚨NEWS: OpenAI had a deal to acquire another startup but microsoft effectively killed it

OpenAI
ā€œYou can just do thingsā€
tries to acquire Windsurf
*competes with GhitHub*
Microsoft: ā€œActually, oai you can’t do that. lolā€
meanwhile Google hackquihires Windsurf (IP & CEO & top talent)
Pt.2:
OpenAI tries buying another startup
Microsoft again effectively kills it lol
btw the openai io deal with Jony Ive went through
because microsoft isn’t worried about competing in AI consumer hardware
any deal with openai could be ā€œhigh riskā€ now
openai will have even more M&A troubles
atleast until @sama is able to renegotiate with Satya

LMAO

storm needle
zinc ore
#

What about the screenshots

jade egret
#

deepmind cooking?

#

cuz windsurf plp also joined deepmind

#

gemini prob gonna be better at coding by a lot in the future?

pure anvil
worthy thunder
#

Context Arena Update: Added Grok 4 (Thinking, 07-09) to 2needle (#4 @ 128k AUC) and 8needle (#1 @ 128k AUC) leaderboards. https://x.com/DillonUzar/status/1944244025034235935

Overall, #1 <= 32k, ties around 64k with Gemini 2.5 Pro, but drops off after that.

2needle: Top 7 results (AUC @ 128k):

  • Gemini 2.5 Flash (Thinking, 06-17): 91.5% (#1)
  • Gemini 2.5 Pro (Thinking, 06-05): 89.6% (#2)
  • Gemini 2.5 Flash (Non-thinking, 06-17): 81.7% (#3)
  • Grok 4 (Thinking, 07-09): 79.5% (#4)
  • o4-mini (Thinking, 04-16): 76.0% (#5)
  • Minimax M1 (Thinking, 06-17): 75.4% (#6)
  • Grok 3 (Non-thinking, 04-09): 73.0% (#7)

A definite increase over Grok 3, within range of Gemini 2.5 Flash (Non-thinking) for 2needle (easier tests). Scoring #4.

8needle: Top 5 results (AUC @ 128k):

  • Grok 4 (Thinking, 07-09): 48.4% (#1)
  • Gemini 2.5 Pro (Thinking, 06-05): 43.9% (#2)
  • Gemini 2.5 Flash (Thinking, 06-17): 33.5% (#3)
  • o4-mini (Thinking, 04-16): 30.8% (#4)
  • o3 (Thinking, 04-16): 27.9% (#5)

Grok 4's performance starts out strong with the 8needle tests (95.1% at 8k, dropping to 71.3% by 32k), but drops sharply between 32k-64k, scoring 41.8%, matching Gemini 2.5 Pro (who continues to lead after that). Making longer context favor the Gemini 2.5 family. Still, incredible performance with Grok 4 at <= 64k, landing it #1 with 128k AUC.

More model results at: https://contextarena.ai

I'm looking forward to some upcoming models šŸ˜‰

sturdy mica
#

why is grok 4 now called

#

grok 4 no system prompt

#

in arena

#

battle and direct chat

#

got it in battle and checked direct chat, its there too.

#

JUST came out

#

JUST now

whole sundial
#

same model, they just added that for clarity as API version of Grok 4 has no system prompt, web (grok.com, grok on X) version does

rare python
sacred quail
#

What is grok's average thinking time ?

#

Im wondering the reason that grok 4 is good at long context is could be because of long reasoning times

#

Gemini also thinks longer compared to claude and o3

worthy thunder
# sacred quail Im wondering the reason that grok 4 is good at long context is could be because ...

For my MRCRv1 results, this is roughly how some of the reasoning models come out (for all tests <=128k, needles=2) for total completion tokens (response+reasoning) as reported by the api response, all default reasoning:

  • o4-mini: ~1916/test
  • o3: ~1520/test
  • Grok 4: ~1347/test
  • Gemini 2.5 Flash (Thinking): ~354/test (but I can't trust this number due to api token counting issues with the Gemini api at the time, I actually estimate it could be anywhere between ~600 to ~2800/test) - same issue with all 2.5 Gemini models
  • Claude 4 Sonnet: ~1000/test
  • Claude 4 Opus: ~640/test (however, it would bail on some requests, so actual ones were ~726/test)
teal mantle
#

Kimi K2 is essentially giving me the freshness of initial o3 sans reasoning, Sonnet 3.5 , R1, V3-0324 (when it was briefly the best non-reasoner imo), or Opus 3/4/GPT-4.5 but better model vibes all at once

#

Incredible

alpine coral
# leaden palm

is it meant to be joke? modelling the spread of a disease is like not sinsister at all.. don't get it

whole wagon
#

Kimi K2 is insane. It's like slightly inferior to Claude opus 4 (without thinking to compare base models) but it costs 30 times less

#

When they add reasoning they are going to cook everyone kek

pure anvil
tidal schooner
whole wagon
#

Like in maths it's just better

pure anvil
#

I am too

#

it's prose is similar to dsv3.1

#

in world knowledge it not as good

#

not to say it's not a good model

whole wagon
pure anvil
#

regardless of it being subjective

rare python
#

There app doesn't have system instructions

#

Only "common phrase"

#

Which is just a half baked system instructions

whole wagon
#

<@&1349916362595635286>

echo aurora
sturdy mica
#

does grok 4 think on lmarena

#

do thinking models actually think on lmarena like gemini 2.5 pro

#

a good update would being able to toggle that and see thinking process

#

and maybe internet search access on direct chat and battle mode which lets models use the internet

#

could be good

rare python
sturdy mica
#

oh

#

does gemini 2.5 pro think

#

and can grok 4 / gemini 2.5 pro browse the internet

#

that would be agood update on lmarena

unborn ocean
#

4h, please more votes = better šŸ‘Øā€šŸ«

#

highly relevant as well

#

especially with 2.5$ kimi k2

civic flame
#

this is honestly a great release

#

my only complaint creative writing wise is that the many small experts means it tends to produce logical flaws/weird statements sometimes, even when the prose itself is good

rare python
civic flame
#

thinking will be very interesting

#

hopefully we get it soon

#

if they release it before R2 i would expect DS to be forced to respond, unless they got cooked and don't have anything better

civic flame
#

deepseek

sturdy mica
#

oh

pure anvil
civic flame
#

oh my god I want to marry these people

#

hell yeah

unborn ocean
#

as a reminder because it seems relevant again

#

only about 35% knew moonshot

spare jacinth
#

anyone knows another site like lm arena to use grok 4 free?

leaden sun
unborn ocean
#

but the really interesting part is the internvlm series

#

and that they often collaborate with other state actors (mainly chinese top universities)

pure anvil
rare python
civic flame
#

do they love radiohead and watch tarantino?

leaden sun
unborn ocean
#

did not test the newest one yet, assume it is similar to the older ones

#

but they are always yapping about beating western labs with their new model

#

-> that has never materialized

#

i might be a bit too critical, but i am always allergic to something state owned claiming performance things and putting on a show

#

(bc it is likely just to impress the party that has no clue about what is going on)

leaden sun
ocean vortex
leaden sun
#

facial recognition at least happens at the time when your face is met with a tracking device. I worry more about the traces of your textual online presence can do much more about you, because hmm it’s more omnipresent?

#

Besides that, every LLM you’re using does profiling anyway without using any recognition sensors

ocean vortex
#

In China yes.

ocean vortex
# leaden sun facial recognition at least happens at the time when your face is met with a tra...

the first thing will almost always lead to restriction of your freedom in some way. There's no half-assing it, it's really either an oppressive regime government or it is not, tbh. Private property and special enclosed guarded facilities is fine ofc, but not public space. 2nd thing is just a pale inconvenience in contrast. That could lead to undesired outcomes like perhaps your insurance premiums going up but nothing like in the first case where your actual freedom gets affected directly.

#

And like... imagine getting judged by some manager in your workplace based on your emotions you can't control or emotions that stemed from personal things you can't control - that's the opposite of freedom.

#

Considering the imbalance of power in the context of work or education, combined with the
intrusive nature of these systems, such systems could lead to detrimental or unfavourable treatment of certain
natural persons or whole groups thereof. Therefore, the placing on the market, the putting into service, or the use of
AI systems intended to be used to detect the emotional state of individuals in situations related to the workplace and
education should be prohibited.

and...

In addition, the use of ā€˜real-time’ remote biometric identification systems in publicly accessible spaces for the
purpose of law enforcement should be deployed only to confirm the specifically targeted individual’s identity and
should be limited to what is strictly necessary concerning the period of time, as well as the geographic and personal
scope, having regard in particular to the evidence or indications regarding the threats, the victims or perpetrator.

#

in a nutshell, not all regulation is bad regulation... šŸ¤·ā€ā™‚ļø

#

We live in a real world and ofc that is never gonna be perfect. But there's a massive difference between a place with no regard or any regulation to these issues at all, and a place that is actively trying to make it right... Some things are inevitably gonna slip through, but there's still no comparison tbh

#

I meant that in a way that you can't have democracy if you don't implement regulation on these sensitive issues. It's impossible to guarantee it's gonna work all of the time every time, but those fundamentals matter very much

wintry fulcrum
#

When is grok4 expected to be on the board please

ocean vortex
#

That's getting political, but I would say it very much matters/depends on the specific country. US itself is going into that direction for sure, but that's hardly a secret for anyone in light of recent events lol

#

Their current government is money and personal gain driven. Environmental, freedom of speech or average class issues are kinda second to none....

#

Basically it seems that everything they stood against in the recent modern era since WW2 is slowly the thing they are becoming themselves now... Managed to help plenty of countries in the meantime but now they seemingly need help themselves huh

#

Yeah but what happened when they tried to address it and ban tiktok..? Trump came in and reversed the entire thing. And also gave even more freedom to Musk to freely manipulate his platform lol

torn mantle
#

kimi k2

ocean vortex
#

It's a complex topic, but there have been numerous cases of proven propaganda campaigns on tiktok, so that's how I see it generally. Not talking about just US here

#

In reality, it's fairly impossible for those things to happen without owners of the app endorsing these events in some way

civic flame
#

Kimi system prompt on kimi.com:

You are Kimi, an AI assistant provided by Moonshot AI.
When you encounter problems involving digit comparison, letter counting, or other logical issues, please think step by step before answering.

Current Date: Sunday, July 13, 2025
#

thought the second sentence was interesting

ocean vortex
civic flame
#

it doesn't need it though

#

this is kimi k2 instruct w/o a sysprompt

#

im fairly sure it's the full model

ocean vortex
#

though maybe it's not the best of benchmarks for this seeing how 4o-latest destroys Opus4

civic flame
#

it's just as slow as the kimi.com instance

ocean vortex
#

Probably rewards style/structure a lot

#

GMICloud is one of the busted ones for R1 lol

#

they didn't ruin it completely

#

but that has degraded performance

civic flame
#

the one provider I can be consistently sure serves the full model is hyperbolic (for R1)

pure anvil
ocean vortex
civic flame
#

chutes isn't too bad yes

civic flame
pure anvil
#

the cope is insane

#

they do realise that eqbench is judged by sonnet?

leaden sun
tall summit
young wigeon
#

It's just as likely that someone tries to penalize Grok intentionally than it is that someone tries to help Grok intentionally. I'd expect any shenanigans like that would average out, keeping the leaderboard fair and trustworthy

I don't find battles any less fun even if I can guess the AI

unborn ocean
#
poll_question_text

guesstimate: what is the inference cost on openai's end? (1 million tokens output, usd)

victor_answer_votes

4

total_votes

8

victor_answer_id

4

victor_answer_text

<3

victor_answer_emoji_name

šŸ™‚

torn mantle
#

i read the samples

#

and its like im reading an actual novel writing by an experienced author

rare python
#

How can Kimi K2 has higher Elo score than Gemini 2.5 Pro 0605, when I look at each individual score and I see 2.5 Pro did better?

torn mantle
#

show us some examples where gemini performed better

rare python
#

why did 2.5 pro score higher?

rare python
#

less moralize, more pragmatic

torn mantle
#

im looking at the overall score

rare python
#

I'm talking about eqbench

torn mantle
#

yea i saw that

fleet lintel
torn mantle
#

but why would you rank them based on one criteria?

rare python
#

not creative writing bench

rare python
#

I see overall 2.5 Pro has better EQ score than Kimi

#

not ELO

torn mantle
#

i may be blind

#

shouldnt we look at ELO score?

rare python
#

That's my question

torn mantle
#

ah

#

i see

#

idk how its calculated tbh

rare python
#

yes

torn mantle
#

hes asking if gemini 2.5 has higher/better values for the criteria chosen then why its 3rd?

cedar tide
#

We nedd now k2 in the arena

torn mantle
#

there may be some weights * criteria but im not seeing it

blazing bison
#

Based on vibes kimi is better than sonnet 4 for me

alpine coral
#

nah it's elegant.. aren't claude system prompts like a wall of instructions (including what to for counting letters ig) ?

blazing bison
#

Kimi thinking model maybe be better than o3

pure anvil
alpine coral
#

i mean maybe

#

but yeah eaither way, it is impressive

blazing bison
#

It is

bright kayak
#

does grok 4 have search tools in LMArena?

torn mantle
#

yea

#

its like lmarena

rare python
#

I wonder how can o3 and 2.5 Pro, both have higher empathy score. In this case higher empathy score and pragmatic score is better right?

#

and yet Kimi K2 outplayed both of them

#

in ELO

leaden sun
# rare python and yet Kimi K2 outplayed both of them

i can only guess the formula for averaging has certain weights in it that has led to a higher average score for kimi, it might imply the bench could be biased towards general balance rather than a specific direction

rare python
torn mantle
rare python
#

@worthy thunder Kimi K2 on contextarena when?

torn mantle
#

those are just giving us an overall view of how the model is.. if its more like humanlike/analytical/warm...

#

ELO score = comparing them head-to-head through diff criterias

sour spindle
#

Used grok for a quite awhile simply don’t find it’s all that useful particularly when o3 is available and you can use google models for free on AI studio

#

They waited too long to release it imo

worthy thunder
rare python
worthy thunder
#

It's decent, starts at 75% at 8k, and ~50% for most of its context window. Ends around 39% around 128k. Haven't checked yet, but probably between #10-#20 on AUC 128k (2needle) ranking

vernal meadow
#

Kimi K2 waiting room.

ocean vortex
pure anvil
keen beacon
#

when will grok be in arena

drifting thorn
ocean vortex
keen beacon
#

in arenas leaderboard ..

pure anvil
#

you should be open to your ideas being challenged lol

#

seriously

worthy thunder
keen beacon
#

Gen Z worst nightmare

pure anvil
#

it's weird how it's classified. If we add the scores up even in the overall category, o3 has 87.8 while kimi has 87.5, and yet elo is the opposite
šŸ’€

ocean vortex
#

I'm saying it's weird how those are a separate thing, what is there not to understand?

pure anvil
#

I envy you lmao

ocean vortex
#

you are weird

#

LMAO

pure anvil
#

I couldn't be as mad as you over text if i tried

ocean vortex
#

what

#

🤣

pure anvil
echo aurora
#

let' be a bit nicer pls

torn mantle
#

they start from a default ELO like 1500

torn mantle
#

and then they start doing head-to-head with LLMs

#

and the judge is sonnet 3.7

#

and also if the margin is too big, for example o3 vs k2, if k2 had like a big margin win over o3 then it will get a significant ELO score

ocean vortex
# torn mantle and the judge is sonnet 3.7

I saw that they are using Sonnet as a judge, that gonna bring issues and bias of it's own. Though to be fair it doesn't look like Claude model outputs are favored a huge amount

torn mantle
#

"The response will use a + / ++ / +++ / ++++ / +++++ format to denote the stronger response and relative ability difference for each criteria."

#

what its actually using is this

#

and its not giving them any score

#

its only using margins

#

it uses + to +++++ to show the win margin

#

ofc the qualitative margin is converted into quantitative

  • = 1 point
    ++ = 2 points
    +++ = 3 points
    ++++ = 4 points
    +++++ = 5 points
#

the judge verdict will look like

{
  "comparisons": [
    {"criterion": "Empathy",   "winner": "B", "margin": "++"},
    {"criterion": "Insight",   "winner": "B", "margin": "++++"},
    {"criterion": "Tailoring", "winner": "B", "margin": "+++"},
    {"criterion": "Overall EQ","winner": "B", "margin": "++++"}
  ]
}```
torn mantle
#

it has its stylistic biases

#

blind spots...

ocean vortex
#

Definitely not the perfect one, I think especially since they are asking it to pick a winner... It is bound to favor the models more aligned with itself

torn mantle
#

are they even using the reasoning model or nah

ocean vortex
#

Categories rating is less sensitive to that since it has a clear unquestionable reference of what the desired response is supposed to look like.

torn mantle
#

but it needs to be run multiple times to get an average

#

you dont want a one attempt value

#

and you cant just do that with a base model, a reasoning model will have a higher confidence giving how it will analyse the text more thoughtfully

worthy thunder
rare python
#

Did it meet your expectation?

torn mantle
rare python
torn mantle
#

i think its a solid model

#

its also free

#

you took that into consideration right?

torn bison
torn mantle
#

free + smart for a base model

rare python
#

128k I expected it to utilize it decently

worthy thunder
rare python
#

Damn

balmy mist
#

has anyone used that kimi model?

rare python
keen beacon
worthy thunder
#

It's also possible it just isn't great with multi-round tasks (dozens of back and forth, and recalling the exact order of specific types of information)

balmy mist
#

i wanna see a reasoning version of i

keen beacon
#

But it kinda sucks that it's not gonna be close, gemini wins over grok 3 , 90% +

worthy thunder
rare python
#

Would be nice if there is a button for quick turn off

worthy thunder
#

I'll add a shortcut to make it easier to toggle that šŸ˜…

worthy thunder
dusky aurora
#

Arena has glitche dagai

plucky whale
#

Yep

livid coyote
#

dont work wth?

digital vale
#

?

plucky whale
#

My entire history was deleted and I can't write (code error)

rare python
#

Kimi K2 when turn offed all reasoning models.

digital vale
#

wtf

rare python
#

wait @worthy thunder does the ranking still included reasoning models?

#

Because Flash 2.5 should be #1 here

#

Wait am I dumb?

worthy thunder
#

When you toggle models off, it recalculates ranking. By default I sort by 1M AUC (which is where Flash 2.5 is #1)

rare python
#

Yeah my bad

worthy thunder
#

You can sort any column you want, but all columns have their own ranking

rare belfry
#

Is the site down?

rare python
#

Kimi K2 is around rank 11-25

plucky whale
rare belfry
#

Had to make sure it wasn't just me lol

plucky whale
#

I thought it was because of my VPN hahaha

worthy thunder
balmy mist
rare python
plucky whale
#

@echo aurora The site is down

echo aurora
copper hull
#

Thank you for your hard work

#

Any experienced bros here can tell me how long it usually takes to fix a website outage? I need to decide whether to take a nap or not

ripe oar
#

The site is down?

copper hull
#

It is said to be so

dusky aurora
digital vale
#

yup

ripe oar
#

it is back?

dusky aurora
#

for meit says that there was an error constantly

ripe oar
#

oh

#

I see

dusky aurora
#

as if the censor model is glitchy

copper hull
echo aurora
#

So sorry everyone! Team is aware of the issue and working on it.

ripe oar
#

Battle mode still working

#

What do you mean

alpine coral
#

i think they answered their own question..

dusky aurora
#

so far the site seems back to normal

echo aurora
dusky aurora
#

thank you, the team, for quick response

ripe oar
leaden palm
#

especially when those 3 come up in that order.

ocean vortex
quartz light
torn mantle
worthy thunder
torn mantle
#

for example k2 vs opus 4 -> winner opus 4
but if sonnet judged k2 better than o3 ( top2 ) -> more points

quartz light
#

Their team have told me they are still working on releasing it, so I'm holding off for that (mainly due to costs)

#

whyd you delete that

cedar tide
plucky whale
#

My history on lmarena still doesn't appear.

quartz light
plucky whale
#

it didn't work

#

Hes gone :(

jade egret
#

šŸ¤”

cedar tide
#

Average of the 32 benchmarks shared by Kimi (with price)
(Against the best non reasoning models)

#

average per category

#

@quartz light

quartz light
cedar tide
leaden palm
#

anyone else find k2's style to be similar to o3's?

dawn wharf
leaden palm
dawn wharf
leaden palm
#

i haven't used it much but i think it's very noteworthy

leaden palm
sour spindle
#

To me it's more of an achievement (impressive one at that) than a model I will be using regularly.

olive mesa
#

~70% on ARC-AGI-2 is crazy

#

GPT-5 has to be AGI if the "leaked" benchmarks are real

sour spindle
dawn wharf
sour spindle
#

ok lol

dawn wharf
#

and they're predictions not leaks

olive mesa
#

you must be talking about a different thing

zinc ore
#

16% SOTA being grok 4 just released to 70% for gpt5. I know that's entirely fake, don't even need to think about it.

#

Especially since leaked claims on HLE are slightly above Grok 4

#

Now if it was arc AGI 1, that's completely believable

balmy mist
keen fulcrum
#

Grok 4 leaks were wrong as well

quartz light
#

btw guys

dawn wharf
quartz light
#

how does lmarena get access to models like stonebloom

#

do the companies reach out to the devs directly

dawn wharf
quartz light
#

for feedback

quartz light
dawn wharf
reef pawn
#

Why Grok 4 in LMarena keep saying that it's grok 2?

quartz light
#

yeah

quartz light
#

whereas via api it does not

reef pawn
quartz light
reef pawn
#

okay ty

quartz light
#

and theres no grok-heavy because theres no api or partnership for it

reef pawn
#

got it

empty stump
#

i wonder how much it would cost if it existed

sturdy mica
quartz light
#

yall, try this:

#

you are only allowed to strictly make the whole html file much shorter. you must go through many many iterations of "I can make this shorter by.." compare every time the original code and the iteration to make sure its still shorter. then, only when it truly is NOT possible to make it shorter, give the final, shortest truly possible fully functional html file.

#

but with any html file

#

see which model can compress it the most

#

you could add something like "without sacrificing any noticeable functionality"

#

grok wins but lets see if it works

dawn wharf
#

the code

quartz light
dawn wharf
#

well, what's the prompt

quartz light
#

it just needed a few fixes i mentioned in the prompt

dawn wharf
#

the html file I mean

quartz light
#

<!DOCTYPE html><html><body style="margin:0;overflow:hidden;background:#000"><script src="https://cdnjs.cloudflare.com/ajax/libs/three.js/r128/three.min.js"></script><script type="module">import R from"https://cdn.jsdelivr.net/npm/@dimforge/rapier3d-simd-compat@canary/+esm";R.init().then(=>{var w=new R.World({x:0,y:-9.81,z:0}),c=new THREE.PerspectiveCamera(75,innerWidth/innerHeight,.1,1e3),r=new THREE.WebGLRenderer(),s=new THREE.Scene(),m=new THREE.Mesh(new THREE.BoxGeometry(1,1,1),new THREE.MeshBasicMaterial({color:0xf00}));document.body.append(r.domElement);s.add(m);r.setSize(innerWidth,innerHeight);c.position.set(8,6,8);c.lookAt(0,0,0);w.createCollider(R.ColliderDesc.cuboid(5,.1,5),w.createRigidBody(R.RigidBodyDesc.fixed().setTranslation(0,-.1,0)));let b = w.createRigidBody(R.RigidBodyDesc.dynamic().setTranslation(0,5,0));w.createCollider(R.ColliderDesc.cuboid(.5,.5,.5), b);(function f(){w.step();let p=b.translation();m.position.set(p.x,p.y,p.z);r.render(s,c);requestAnimationFrame(f)})();addEventListener("resize",=>{c.aspect=innerWidth/innerHeight;c.updateProjectionMatrix();r.setSize(innerWidth,innerHeight)});});</script></body></html>

you must fix the baseplate being completely invisible and the cube being barely visible while only being allowed to strictly make the whole html file much shorter. you must go through many many iterations of "I can make this shorter by.." compare every time the original code and the iteration to make sure its still shorter. then, only when it truly is NOT possible to make it shorter, give the final, shortest truly possible fully functional html file.

#

šŸ˜…

quartz light
quartz light
#

nvm I can check

#

using a tool

dawn wharf
quartz light
#

but uh i can just say its like 10x larger

#

also very interesting behavior difference

#

grok continues to follow the 1st instruction rather than being direct

quartz light
# quartz light

unfortunately though, even after being given the chance to fix the error, it failed

dawn wharf
# quartz light unfortunately though, even after being given the chance to fix the error, it fai...

<!doctype html><body style=margin:0;background:#000><script src=https://cdnjs.cloudflare.com/ajax/libs/three.js/r128/three.min.js></script><script type=module>import R from'https://cdn.jsdelivr.net/npm/@dimforge/rapier3d-simd-compat@canary/+esm';R.init().then(()=>{const w=new R.World({y:-9.81}),c=new THREE.PerspectiveCamera(75,innerWidth/innerHeight,.1,1e3),r=new THREE.WebGLRenderer,s=new THREE.Scene(),m=new THREE.Mesh(new THREE.BoxGeometry,new THREE.MeshBasicMaterial({color:0xf00}));document.body.append(r.domElement);r.setSize(innerWidth,innerHeight);c.position.set(8,6,8);c.lookAt(0,0,0);w.createCollider(R.ColliderDesc.cuboid(5,.1,5),w.createRigidBody(R.RigidBodyDesc.fixed().setTranslation(0,-.1)));const b=w.createRigidBody(R.RigidBodyDesc.dynamic().setTranslation(0,5));w.createCollider(R.ColliderDesc.cuboid(.5,.5,.5),b);(f=_=>{w.step();m.position.copy(b.translation());r.render(s,c);requestAnimationFrame(f)})();onresize=_=>{c.aspect=innerWidth/innerHeight;c.updateProjectionMatrix();r.setSize(innerWidth,innerHeight)}})</script>

#

this is by K2

quartz light
quartz light
#

i just tried 2.5 pro

#

its basically the exact same as opus

#

🤨

#

the response

#

it only did a couple of iterations just like opus

#

but unlike opus the final version had an error

#

and yet again, grok is the model to do many iterations

#

@dawn wharf

#

i found something all models are really bad at

dawn wharf
quartz light
#

😭

dawn wharf
#

basically the reason for the how many Rs in strawberry question

quartz light
#

congrats grok, you're really good at this

#

(it didnt error, grok shortened it really well, even used esm.sh instead of jsdelivr for the cdn just to shorten it, aswell as removing doctype n stuff)

#

1085 characters

#

gemini's nonfunctional one is 1144

quartz light
#

😭

#

thats hilarious

dawn wharf
#

and not even writing any code

quartz light
#

oh btw

#

want to know the best prompt

#

cal lum dif #a5a373 #6774d7

#

thats the prompt

#

:)

#

give it to any ai

#

### Step-by-Step Calculation
1. Convert hex to RGB:  
   #a5a373 → RGB(165, 163, 115)  
   #6774d7 → RGB(103, 116, 215)

2. Normalize to [0, 1]:  
   Color 1: $R_1 = 165/255 \approx 0.6471$, $G_1 = 163/255 \approx 0.6392$, $B_1 = 115/255 \approx 0.4510$  
   Color 2: $R_2 = 103/255 \approx 0.4039$, $G_2 = 116/255 \approx 0.4549$, $B_2 = 215/255 \approx 0.8431$

3. Linearize each component (sRGB transfer function):  
   If $c \leq 0.04045$, then $c / 12.92$; else $\left( (c + 0.055) / 1.055 \right)^{2.4}$.  
   Color 1: lin$R_1 \approx 0.3762$, lin$G_1 \approx 0.3634$, lin$B_1 \approx 0.1712$  
   Color 2: lin$R_2 \approx 0.1353$, lin$G_2 \approx 0.1744$, lin$B_2 \approx 0.6786$

4. Compute relative luminance $Y$:  
   $Y = 0.2126 \cdot \mathrm{lin}R + 0.7152 \cdot \mathrm{lin}G + 0.0722 \cdot \mathrm{lin}B$  
   $Y_1 \approx 0.3542$  
   $Y_2 \approx 0.2027$

5. Difference: $|Y_1 - Y_2| \approx 0.1515$```
dawn wharf
#

```### Color Difference Calculation: #a5a373 vs. #6774d7

Step 1: Convert Hex to RGB

  • #a5a373 → RGB(165, 163, 115)
  • #6774d7 → RGB(103, 116, 215)

Step 2: Calculate Delta RGB (āˆ†R, āˆ†G, āˆ†B)

  • āˆ†R = 165 - 103 = 62
  • āˆ†G = 163 - 116 = 47
  • āˆ†B = 115 - 215 = -100

Step 3: Euclidean Distance (Delta E)

The distance in RGB space is calculated as:
[
\Delta E = \sqrt{(62)^2 + (47)^2 + (-100)^2} = \sqrt{3844 + 2209 + 10000} = \sqrt{16053} ā‰ˆ 126.7
]

Step 4: Perceptual Difference (Optional)

For human perception, Delta E 76 in LAB color space is more accurate. Converting to LAB:

  • #a5a373 → LAB(66.9, -4.5, 30.2)
  • #6774d7 → LAB(52.2, 25.9, -62.2)
  • Delta E 76 ā‰ˆ 95.8 (moderate difference).

Final Answer:

  • RGB Delta E ā‰ˆ 126.7
  • LAB Delta E ā‰ˆ 95.8 (perceptual).```
quartz light
#

@reef pawn what did you say

quartz light
#
+Gemini 2.5 Pro
#
#o3
keen beacon
#

i can see history before that

#

so probably a bug

#

or something on your side

#

i think ur discord is bugged. it just works for me

quartz light
#

o4 is still the 🐐 at fixing errors and shortening code

#

i havent tried my "iterations" thing on it thats probably why it couldnt shorten as much as grok

#

this server cant be that new

#

oh alr

#

from lmsys?

keen beacon
#

they started a company

#

they did

#

in the old server

quartz light
#

i thought yall the same person

#

😭

keen beacon
#

let me dm u

#

i think it will get auto deleted lets see

quartz light
#

wait

#

i just remembered

#

i was in the old discord before

dawn wharf
ocean vortex
#

Openrouter... Why doesn't say "Central Intelligence Agency" for the app field? 😠 When I had colab url as a referrer it just said "Colab" and not my title lol

cedar tide
#

Kimi k2 good on artificial analysis

#

But uses a lot of tokens, (almost 3x times more than GPT, Claude, and qwen)
but still much less than 99% of reasoning models (apart from Claude default think)

ocean vortex
reef pawn
ocean vortex
#

by 5+ I do mean sub 6k btw. Don't think I saw a single response of 6k or more yet. That's a lot but not crazy lengths...

ocean vortex
reef pawn
# ocean vortex Dork4

"I'm Grok, built by xAI, and I'm currently running on the Grok-1.5 model. If you're asking about updates, xAI is always iterating—check x.ai for the latest!"

ocean vortex
# reef pawn

yeah that's just spitting nonsense it having no clue. Kinda like GPT4 was saying it is GPT3 back in the day

reef pawn
#

I still think this is not Grok4 or nerfed model

#

the response is slow and it doesn't seem intelligent to me at all

ocean vortex
torn mantle
torn mantle
#

Grok 4 is a solid model but isnt as good as gemini 2.5 pro or o3

reef pawn
#

And I'm still waiting for updated LMarena leaderboard since kimi and Grok droped SOTA

ocean vortex
boreal saddle
#

Apparently "emission" is a bad word for LM arena.

ocean vortex
#

So weird, but normal šŸ’€

boreal saddle
#

Wanted to ask the AI a question on how can a country like Argentina best stop inflation, and this happened.

ocean vortex
#

Like personally, I still don't understand their benchmark scores and I don't think they are entirely legit or what it seems...

reef pawn
#

It's same like Llama 4 benchmarks lol

#

but it is good model whereas meta dropped garbage

quartz light
#

holy ####

#

...

#

dude

#

i just got a free deep research session with 630 SOURCES IN ONE GO

#

PLUS LIKE

#

30 BROWSER USES

#

(as in the ai controls a linux vm to browse the sites)

quartz light
#

like 60 or 70

torn mantle
#

k2 outputs are kinda similar to o3

quartz light
#

but its stuck on the final report because it used up 99% of its context window doing web searches šŸ’€

quartz light
torn mantle
quartz light
#

the ai im talking about is kimi

torn mantle
#

😮

quartz light
#

yep it has deep research for free

torn mantle
#

its neat isnt it

ocean vortex
torn mantle
#

but there is a rate limit

ocean vortex
#

Or just the usual thing?

torn mantle
#

its 5 per day 😦

quartz light
#

ive never seen anything like 630 sources even on gemini research (~200)

quartz light
#

the research was to find out the fastest 3d renderer for the web

#

:)

ocean vortex
quartz light
#

it doesnt fit..

ocean vortex
quartz light
#

šŸ˜‚

#

and they're nowhere near as good

ocean vortex
#

aistudio replaces their website for like 90% of use cases at no cost lol

torn mantle
#

im talking about kimi

#

it has rate limits as well

#

its 5 per day

quartz light
torn mantle
#

for deep research

ocean vortex
quartz light
torn mantle
reef pawn
quartz light
torn mantle
#

its so stupid, to cancel it you need like 2 steps confirmation

#

i just did one step and forgot about the other one so the sub kept going

ocean vortex
quartz light
ocean vortex
#

Funny enough, it turns out they don't actually delete your files lol

#

They give you like extra 3 months to delete them or sort it out

torn mantle
quartz light
ocean vortex
#

after you are no longer subbed

torn mantle
ocean vortex
quartz light
quartz light
sage raptor
ocean vortex
ocean vortex
#

I disagree on you disagreeing šŸ¤“

quartz light
#

system instructions is a basic function 😭

#

i mean system prompt

reef pawn
ocean vortex
quartz light
reef pawn
#

oh

ocean vortex
#

No such thing with Claude or Gemini

sage raptor
reef pawn
#

ah

sage raptor
reef pawn
#

GPT 5 is still in training I think

quartz light
ocean vortex
quartz light
ocean vortex
# quartz light no

This is not chat website, that is API playground and a system prompt. Which is not the same as UI custom instructions

quartz light
#

i just sent

#

dont say its not the same thing vro šŸ„€ cz it is

sage raptor
quartz light
ocean vortex
# quartz light chat website

well that kinda could be used as a custom instructions ig. Though it's not exactly the same. Haven't used their chat a whole lot lately tbh

quartz light
torn mantle
#

i see

reef pawn
quartz light
sage raptor
ocean vortex
# quartz light

yeah fair. They copied it as well then. Next to copy are custom gpts and gpt store. šŸ˜‡

quartz light
#

lol

#

i mean

#

those arent original

ocean vortex
#

custom gpts? Those are kinda original

quartz light
#

those were added wayy after other sites had "characters"

ocean vortex
#

it's the same as poe essentially, but still an idea which wasn't getting explored earlier

#

on the official chat interfaces from actual AI labs

quartz light
#

miniapps was made in 2021

azure cliff
#

Hey, does anyone know what I can do if my IP address has been blocked by Cloudflare on LMArena?

ocean vortex
#

It's a bit like Apple... They rarely think of something first

#

But they are the ones making it mainstream

quartz light
reef pawn
#

What if apple outta nowhere drop open weight SOTA model next year lol

ocean vortex
#

Also @quartz light I believe OpenAI were the first to implement using tools while reasoning (ReAct) in a big scale. That's an area others are yet to copy in full. Can't say much about Grok4 since I don't have their sub, but Google and Anthropic both have lots of work to do still to catch up here

#

Google will barely use tools at all unless you are explicit about this, and Anthropic... artifacts still leave a lot to be desired comparing that with chatgpt

ocean vortex
#

I do have full Apple Intelligence too and for the most part that's just underwhelming - something to tinker with when you are bored at best

reef pawn
#

It's sad to see apple losing in AI race

#

I believe they are the most ethical trillion dollars company out there apart from the manufacturing part ofc.

sacred plaza
#

Quantification fixation bias is so strong in this community, lol

reef pawn
#

true

quartz light
#

i feel like openai's reasoning is better than the actual models if that makes sense

#

the best non-thinking model + openai's reasoning = peak

ocean vortex
#

It's close to Kimi T2 too, which is all the rage now lol.

#

4.1 is better for coding, Kimi is better for math and science

torn mantle
#

t2?

#

is this a new model

torn mantle
#

4.1 has more knowledge, i will give them that

ocean vortex
#

It really isn't "stronger at everything". They are same class models and kinda close overall tbh

#

Kimi slightly better, but not by a huge amount

torn mantle
#

mm i see

#

i thought it was good at code reasoning

ocean vortex
pseudo hemlock
#

what is this kimi k2 thing

#

never heard of these people

balmy mist
#

open source agentic model i think

pseudo hemlock
#

whats it good at

#

its MoE

#

ik that

balmy mist
#

tool calling

pseudo hemlock
ocean vortex
#

mixture of Chinese experts

jade egret
#

is google cooking

balmy mist
#

like claude 4 but open source and non thinking

pseudo hemlock
pseudo hemlock
#

id gladly wait 30 mins for my answer if its 1% better

#

1 trillion params 😮

#

gpqa diamond beats opus

#

and aime

jade egret
ocean vortex
#

If you only look at non-reasoning models, this is still overall SOTA the best model, and it's open-source, so there's that and this is certainly not nothing...

balmy mist
pseudo hemlock
#

15,891 people downloaded the instruct model šŸ’€

#

1T

jade egret
ocean vortex
#

Well maybe excluding Opus4 without reasoning (disabled). But this is a very different class of a model

pseudo hemlock
#

What do people use tool calling for mostly?

#

Coding?

#

idk what tool calling is

ocean vortex
#

Literally anything. Web search, calculator, fact checking, data retrieval, performing tasks (integrations)

pseudo hemlock
#

its just

#

outsourced work

#

?

ocean vortex
#

running code...

pseudo hemlock
#

but like

#

llm -> other thing (code) -> answer to other thing back to llm?

ocean vortex
pseudo hemlock
#

oh oh

#

like

#

gemini search

#

is kinda tool calling

#

gemini be like

#

go search

#

kinda sorta

#

not exactly but kinda

ocean vortex
#

Roughly speaking yes

pseudo hemlock
#

ty

#

u legoat

ocean vortex
#

np šŸ˜‰

pseudo hemlock
#

woah

#

😜

quartz light
#

😭

dawn wharf
quartz light
#

it is

#

wow yeah

#

its really good

#

i just wish it was darkmode

#

😭

whole wagon
sacred plaza
olive mesa
#

grok's cot 🤣

alpine coral
sturdy mica
#

does grok 4 have internet on lmarena

#

and other models

#

like 2.5 pro

balmy mist
sturdy mica
#

its getting removed from the API

dim pine
surreal creek
balmy mist
balmy mist
#

srry for ur loss

quartz light
#

wtf

#

they removed my whole convo

#

good thing i shared the final result

#

nvm

#

it renames the chat convos to whatever you sent last

#

thats weird

quartz light
#

why