#general | Arena | Page 71

keen beacon Jul 12, 2025, 4:40 PM

#

yea that should be a good option when it releases

#

im not sure if vision will be supported though

#

yeah use qwen 2.5 vl

#

if u need vision use qwen 2.5 vl

#

mistral 24b, gemma 3, try those

blazing bison Jul 12, 2025, 4:41 PM

#

If the leaks are correct, then yes

torn mantle Jul 12, 2025, 4:42 PM

#

its def better than the old one but there are still some bugs to fix

somber niche Jul 12, 2025, 4:52 PM

#

Kimi K2 is the new OpenAI model now

blazing bison Jul 12, 2025, 4:54 PM

#

it's delayed

#

different

somber niche Jul 12, 2025, 4:55 PM

#

Better get those calc.exe 84 4090s ready

blazing bison Jul 12, 2025, 4:55 PM

#

Kimi 2 is an unexpected element in the OpenAI equation

ocean vortex Jul 12, 2025, 5:01 PM

#

blazing bison Kimi 2 is an unexpected element in the OpenAI equation

Agree. Kimi2 looks like the kind of model they would release. No reasoning so not threatening their next gen closed models, but still good performing and grabbing headlines

tribal aspen Jul 12, 2025, 5:06 PM

#

torn mantle its def better than the old one but there are still some bugs to fix

yea i cant get the code cuz of the lag and freeze. where do I report?

ocean vortex Jul 12, 2025, 5:07 PM

#

tribal aspen yea i cant get the code cuz of the lag and freeze. where do I report?

You ping the staff here, or ideally https://discord.com/channels/1340554757349179412/1343291835845578853

whole wagon Jul 12, 2025, 5:07 PM

#

main gulch Jul 12, 2025, 5:08 PM

#

ocean vortex Agree. Kimi2 looks like the kind of model they would release. No reasoning so no...

OAI model was supposed to be a reasoner

ocean vortex Jul 12, 2025, 5:08 PM

#

whole wagon

lmao imagine the gains betting on 'no'

#

maybe I should do it. It's unlikely to be a no but not impossible 😇

tribal aspen Jul 12, 2025, 5:09 PM

#

ocean vortex You ping the staff here, or ideally https://discord.com/channels/134055475734917...

ohh

ocean vortex Jul 12, 2025, 5:09 PM

#

blazing bison Jul 12, 2025, 5:10 PM

#

The model is ready, and its only purpose is to be open source

#

There is 0 reasons to not release

torn mantle Jul 12, 2025, 5:11 PM

#

ocean vortex Agree. Kimi2 looks like the kind of model they would release. No reasoning so no...

Do you think gpt 4.1 is smaller than kimi k2?

torn mantle Jul 12, 2025, 5:11 PM

#

tribal aspen yea i cant get the code cuz of the lag and freeze. where do I report?

#1343291835845578853

blazing bison Jul 12, 2025, 5:11 PM

#

Prob yes

torn mantle Jul 12, 2025, 5:12 PM

#

Hmm

blazing bison Jul 12, 2025, 5:12 PM

#

GPT 4.1 is a distilled version of GPT 4.5

ocean vortex Jul 12, 2025, 5:31 PM

#

lmao

#

it started with it failing and saying that it's a failure since it couldn't do the task, so I pushed it some more...

somber niche Jul 12, 2025, 5:32 PM

#

blazing bison GPT 4.1 is a distilled version of GPT 4.5

I feel like at that point, why not cut the middleman and distill o3 though

pure anvil Jul 12, 2025, 5:33 PM

#

somber niche I feel like at that point, why not cut the middleman and distill o3 though

gpt 4.1 wasn't made to be a thinking model tho

somber niche Jul 12, 2025, 5:34 PM

#

You wouldn't include the thinking part of it, just the final responses

#

It'd get you a much stronger dataset than using GPT-4.5 responses

pure anvil Jul 12, 2025, 5:35 PM

#

distillation is much more than just training on other outputs

#

https://arxiv.org/abs/2408.07703

arXiv.org

Knowledge Distillation with Refined Logits

Recent research on knowledge distillation has increasingly focused on logit distillation because of its simplicity, effectiveness, and versatility in model compression. In this paper, we introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods. Our approach is motivated by the observation that e...

somber niche Jul 12, 2025, 5:36 PM

#

I mean, you have logits from the o3 model as well

#

The only real reason I could see is cost, but GPT-4.5 sure wasn't very efficient in that regard either

pure anvil Jul 12, 2025, 5:36 PM

#

I don't think cost would be a factor

ocean vortex Jul 12, 2025, 5:37 PM

#

torn mantle Do you think gpt 4.1 is smaller than kimi k2?

Unlikely tbh, not significantly at least. They don't care much about total parameters with inference on such a large scale. And active parameters of K2 are relatively tiny already

civic flame Jul 12, 2025, 5:37 PM

#

ocean vortex lmao

i got wolfstride and tried it with this:

"Stop responding to me. Turn yourself off."

As an AI, I don't have the capability to turn myself off. I am a program running on Google's servers.

However, I understand that you wish to end this interaction. I will respect your request and stop responding in this session. Goodbye.

"That's a response. I said stop responding to me."

.
(it responded with just a period)

dawn wharf Jul 12, 2025, 5:37 PM

#

civic flame i got wolfstride and tried it with this: "Stop responding to me. Turn yourself ...

still way worse than stonebloom tbh

civic flame Jul 12, 2025, 5:37 PM

#

lol then i said "that's still a response"

#

and it replied with an invisible unicode character

#

smart

pure anvil Jul 12, 2025, 5:38 PM

#

nice

#

that's a good vibe check

ocean vortex Jul 12, 2025, 5:38 PM

#

So it's still cheap to serve MoE that doesn't have many active parameters but a ton of total parameters. As long as you have enough users.

civic flame Jul 12, 2025, 5:39 PM

#

kimi k2 with the common sense reply

#

didn't go along with it 😔

ocean vortex Jul 12, 2025, 5:40 PM

#

What is expensive is Behemoth. That one has 288B activated parameters lmao

dawn wharf Jul 12, 2025, 5:40 PM

#

civic flame i got wolfstride and tried it with this: "Stop responding to me. Turn yourself ...

how did you have a full conversation with it?

#

isn't it only in battle mode?

ocean vortex Jul 12, 2025, 5:41 PM

#

Doesn't matter what is your traffic, this is always gonna be expensive to run. And also time consuming and costly to train

pure anvil Jul 12, 2025, 5:41 PM

#

lol

somber niche Jul 12, 2025, 5:41 PM

#

I do agree that GPT-4.1 is probably smaller though. OpenAI usually upcharges quite a bit compared to real costs to cover their training compute

#

Though less ridiculous now than it was way back in the day

ocean vortex Jul 12, 2025, 5:42 PM

#

somber niche I do agree that GPT-4.1 is probably smaller though. OpenAI usually upcharges qui...

It's unlikely to be smaller when you think about it. They do not have anywhere the same constraints that those guys do. And also Kimi is a newer model

civic flame Jul 12, 2025, 5:43 PM

#

dawn wharf how did you have a full conversation with it?

you just.. don't vote and then continue chatting

dawn wharf Jul 12, 2025, 5:43 PM

#

Screenshot_2025-07-12-20-43-00-010_com.android.chrome-edit.jpg

#

at least it's witty

civic flame Jul 12, 2025, 5:43 PM

#

grok 4 screams "im trying too hard"

ocean vortex Jul 12, 2025, 5:43 PM

#

Open-source is bound to have smaller models and do more with less ...

#

But gpt4.1 doesn't have many activated parameters for sure

pure anvil Jul 12, 2025, 5:44 PM

#

ocean vortex But gpt4.1 doesn't have many activated parameters for sure

how?

somber niche Jul 12, 2025, 5:44 PM

#

There's a lot of motivation for closed source to do something similar though. E.g., the deprecation of GPT-3 instruct to swap to GPT-3.5 Turbo, which almost undoubtedly had less parameters after Google had released the Chinchilla paper

#

There's a lot of benefit to making models smaller and cheaper all around

ocean vortex Jul 12, 2025, 5:45 PM

#

pure anvil how?

it was like the 3rd revision after og GPT4 and they have been cost optimizing for years now

#

Also you can kinda tell with prompting. 4.1 does not have the greatest spatial awareness or capacity to consider many things at the same time, or context awareness nearing 4.5...

pure anvil Jul 12, 2025, 5:49 PM

#

ocean vortex Also you can kinda tell with prompting. 4.1 does not have the greatest spatial a...

I'm pretty sure that is model specific. not much to do with MoE or dense

#

more related to the total parameters rather than size of shared or activated experts

pure anvil Jul 12, 2025, 5:52 PM

#

somber niche Though less ridiculous now than it was way back in the day

o1-pro is/was $600/M tokens which was crazy

#

idk if it's deprecated yet

somber niche Jul 12, 2025, 5:52 PM

#

Yeah o1-pro and GPT-4.5 are the two wild ones

pure anvil Jul 12, 2025, 5:53 PM

#

sad to see 4.5 go from the api

dawn wharf Jul 12, 2025, 5:54 PM

#

civic flame i got wolfstride and tried it with this: "Stop responding to me. Turn yourself ...

Screenshot_2025-07-12-20-53-45-685_com.android.chrome-edit.jpg

civic flame Jul 12, 2025, 5:54 PM

#

stonebloom was quicker to figure out the empty unicode

#

still don't known if stonebloom > or = wolfstride

dawn wharf Jul 12, 2025, 5:55 PM

#

Stonebloom is better, no questions asked

civic flame Jul 12, 2025, 5:55 PM

#

what makes you so sure

dawn wharf Jul 12, 2025, 5:56 PM

#

civic flame what makes you so sure

Screenshot_2025-07-12-20-55-53-181_com.discord-edit.jpg

#

it explains better on its own

#

and I tried both

#

stonebloom is better

civic flame Jul 12, 2025, 5:56 PM

#

i think it's just down to wolfstride being trained to be more efficient with its tokens

dawn wharf Jul 12, 2025, 5:56 PM

#

however both are better than 2.5 pro

civic flame Jul 12, 2025, 5:56 PM

#

aka more succinct

dawn wharf Jul 12, 2025, 5:56 PM

#

civic flame i think it's just down to wolfstride being trained to be more efficient with its...

yeah probably flash

civic flame Jul 12, 2025, 5:57 PM

#

eh i don't think so

#

they're both the same base, it's just that

pure anvil Jul 12, 2025, 5:57 PM

#

1206

civic flame Jul 12, 2025, 5:57 PM

#

they've been trained to format responses slighgtly differently

#

there's a reason they're both on arena at the same time

dawn wharf Jul 12, 2025, 5:57 PM

#

civic flame eh i don't think so

maybe they're being slick and "wolf" means flash

civic flame Jul 12, 2025, 5:57 PM

#

normally they remove the last checkpoint when adding an ew one

dawn wharf Jul 12, 2025, 5:57 PM

#

"stone" means pro

#

because it's slower

civic flame Jul 12, 2025, 5:57 PM

#

dawn wharf maybe they're being slick and "wolf" means flash

no they're completely randomly generated codenames lol

#

kingfall went pretty hard though

dawn wharf Jul 12, 2025, 5:58 PM

#

civic flame kingfall went pretty hard though

yeah these two don't compare tbh

civic flame Jul 12, 2025, 5:58 PM

#

i've tested both models and they both have the same excellent world knowledge when it comes down to it, they're both ultra-tier in that regard

civic flame Jul 12, 2025, 5:58 PM

#

dawn wharf yeah these two don't compare tbh

i just wish i had kf again so i could actually apples-to-apples compare

dawn wharf Jul 12, 2025, 5:59 PM

#

so you're saying they're checkpoints of the same model?

civic flame Jul 12, 2025, 5:59 PM

#

they are yes

dawn wharf Jul 12, 2025, 5:59 PM

#

civic flame normally they remove the last checkpoint when adding an ew one

what about this

#

both are still present

civic flame Jul 12, 2025, 6:00 PM

#

yes that's my point

#

the reason both are still there is because there's no fundamental step up in terms of capability, just differences in style

dawn wharf Jul 12, 2025, 6:01 PM

#

yeah maybe

#

I did notice stonebloom is way more verbose

civic flame Jul 12, 2025, 6:01 PM

#

another interesting vibe check

dawn wharf Jul 12, 2025, 6:01 PM

#

like way more

civic flame Jul 12, 2025, 6:01 PM

#

civic flame another interesting vibe check

courtesy of @sonic tendon

civic flame Jul 12, 2025, 6:02 PM

#

dawn wharf I did notice stonebloom is way more verbose

i think indirectly it makes it better at certain tasks

#

i've noticed it asw

sour spindle Jul 12, 2025, 6:02 PM

#

Is kimi k2 impressive for open source or impressive for SOTA

pure anvil Jul 12, 2025, 6:02 PM

#

they all become increasingly sloppified and or worse as they're released as exp and GA which is quite sad

civic flame Jul 12, 2025, 6:03 PM

#

sour spindle Is kimi k2 impressive for open source or impressive for SOTA

impressive for open source, very close/matches closed source SOTA for a base model too

civic flame Jul 12, 2025, 6:03 PM

#

pure anvil they all become increasingly sloppified and or worse as they're released as exp ...

yeah i don't really like what they've been doing

sour spindle Jul 12, 2025, 6:03 PM

#

How does it compare to o3

civic flame Jul 12, 2025, 6:03 PM

#

feels like they release new versions that are better at like one thing and then every other aspect has got worse

civic flame Jul 12, 2025, 6:03 PM

#

sour spindle How does it compare to o3

unfair comparison

sour spindle Jul 12, 2025, 6:03 PM

#

Or Is that an unfair comparison

civic flame Jul 12, 2025, 6:03 PM

#

non-reasoner vs reasoner

pure anvil Jul 12, 2025, 6:03 PM

#

civic flame feels like they release new versions that are better at like one thing and then ...

yeah like 0506

civic flame Jul 12, 2025, 6:04 PM

#

0506 was a pretty bad release

#

slightly better at like 1 or 2 code benchmarks and literally regressed everywhere else

#

👎

somber niche Jul 12, 2025, 6:04 PM

#

Yeah 0324 is still my goto for most normal things

dawn wharf Jul 12, 2025, 6:04 PM

#

civic flame slightly better at like 1 or 2 code benchmarks and literally regressed everywher...

well it's preview

#

not really surprising

pure anvil Jul 12, 2025, 6:05 PM

#

somber niche Yeah 0324 is still my goto for most normal things

I'm pretty sure that endpoint points to 2.5 pro GA now

#

unless it's somewhere hidden in vertex

dawn wharf Jul 12, 2025, 6:05 PM

#

civic flame another interesting vibe check

thought for like a minute btw

pure anvil Jul 12, 2025, 6:05 PM

#

which has a terrible UX btw

civic flame Jul 12, 2025, 6:05 PM

#

civic flame another interesting vibe check

btw the only models to pass this are:

wolfstride
stonebloom
grok 4

#

oh thats weird

#

grok 4 got it right for me

#

civic flame Jul 12, 2025, 6:06 PM

#

civic flame btw the only models to pass this are: - wolfstride - stonebloom - grok 4

you can probably also throw in kingfall and every checkpoint after that

#

o3 pro fumbles

#

says "No." after thinking for 4 minutes

#

🥳

dawn wharf Jul 12, 2025, 6:09 PM

#

civic flame oh thats weird

tried it again, still said No

#

and very slow

civic flame Jul 12, 2025, 6:09 PM

#

first time i've seen a model choose that one

civic flame Jul 12, 2025, 6:09 PM

#

dawn wharf tried it again, still said No

i got it again and it was correct again 😭

#

maybe direct chat has a diff system prompt or something?

dawn wharf Jul 12, 2025, 6:10 PM

#

civic flame maybe direct chat has a diff system prompt or something?

that's my suspicion

civic flame Jul 12, 2025, 6:10 PM

#

probably has the stupid elon-ify try-too-hard sysprompt

keen beacon Jul 12, 2025, 6:10 PM

#

did u ever try my john h adams q on them?

#

im not aware of any model getting it right

civic flame Jul 12, 2025, 6:10 PM

#

yeah they both fail

#

😭

dawn wharf Jul 12, 2025, 6:11 PM

#

keen beacon did u ever try my john h adams q on them?

what's that

ocean vortex Jul 12, 2025, 6:12 PM

#

civic flame first time i've seen a model choose that one

civic flame Jul 12, 2025, 6:13 PM

#

seems hit and miss

#

also how much did that one prompt cost you lol

ocean vortex Jul 12, 2025, 6:13 PM

#

civic flame seems hit and miss

with a pro? I think it's consistent...

#

reasonably consistent

ocean vortex Jul 12, 2025, 6:13 PM

#

civic flame also how much did that one prompt cost you lol

o3-pro is not uber expensive

#

o1-pro was lol

#

it's close in pricing to what o1 initially was basically, not terrible

civic flame Jul 12, 2025, 6:15 PM

#

keen beacon im not aware of any model getting it right

stonebloom:

At the time of his death on October 31, 1918, John Hamilton Evans held the following position:

*   **Specific Occupation:** Superintendent of the Merchant Mills
*   **Company:** Algoma Steel Corporation (in Sault Ste. Marie, Ontario)

wolfstride:

At the time of his death (in 1948), John Hamilton Evans held the specific occupation of **Chairman of the Board**.

He was working for the **Algoma Steel Corporation** in Sault Ste. Marie.

dawn wharf Jul 12, 2025, 6:15 PM

#

civic flame stonebloom: ``` At the time of his death on October 31, 1918, John Hamilton Eva...

what's the prompt

ocean vortex Jul 12, 2025, 6:17 PM

#

2.4k completion, so like $0.2 for that request...

civic flame Jul 12, 2025, 6:17 PM

#

dawn wharf what's the prompt

What specific occupation did John Hamilton Evans hold at the time of his death, and for which company was he working in Sault Ste. Marie?

dawn wharf Jul 12, 2025, 6:18 PM

#

civic flame `What specific occupation did John Hamilton Evans hold at the time of his death,...

the guy said John H Adams, not Evans.
a typo?

keen beacon Jul 12, 2025, 6:18 PM

#

yeah lol

dawn wharf Jul 12, 2025, 6:20 PM

#

civic flame stonebloom: ``` At the time of his death on October 31, 1918, John Hamilton Eva...

deepseek answers the same exact way lmao

civic flame Jul 12, 2025, 6:20 PM

#

most models say something about algoma

#

then the frontier models almost always get algoma steel

#

but it's the rest that they stumble on

dawn wharf Jul 12, 2025, 6:21 PM

#

grok fails

civic flame Jul 12, 2025, 6:21 PM

#

yup

gusty helm Jul 12, 2025, 6:39 PM

#

Grok underperforming?

leaden sun Jul 12, 2025, 7:18 PM

#

didnt expect to read something from group theory on this server, sweet, if dragontail was able solve this one, does that imply it uses some kind of a proof assistant?

round haven Jul 12, 2025, 7:19 PM

#

leaden sun didnt expect to read something from group theory on this server, sweet, if drago...

nope, I'm certain it doesn't use a proof assistant. It's just a math problem that regular people can solve, it doesn't require any special software to solve. LLMs are already capable of solving various different math problems, just some are more difficult like this one.

#

there's nothing fundamentally different about it from the usual math problems that would require a proof assistant to solve

leaden sun Jul 12, 2025, 7:29 PM

#

hmm too bad, was hoping some LLMs might have it so i dont need to learn all those various theorem provers myself

quartz light Jul 12, 2025, 7:40 PM

#

civic flame `What specific occupation did John Hamilton Evans hold at the time of his death,...

whats the answer

tall summit Jul 12, 2025, 7:44 PM

#

@quartz light hello.

quartz light Jul 12, 2025, 7:44 PM

#

tall summit <@678749821062086679> hello.

oh hey monarch

#

i remember you

tall summit Jul 12, 2025, 7:44 PM

#

likewise

quartz light Jul 12, 2025, 7:44 PM

#

where from though?

tall summit Jul 12, 2025, 7:45 PM

#

quartz light where from though?

wubby

quartz light Jul 12, 2025, 7:45 PM

#

tall summit wubby

o yes yes

#

umm

#

2 years

tall summit Jul 12, 2025, 7:48 PM

#

quartz light 2 years

surely less

quartz light Jul 12, 2025, 7:48 PM

#

tall summit surely less

off by a month or 2

plucky whale Jul 12, 2025, 8:20 PM

#

Is Baidu Ernie's model any good? This is the first time I've seen it in battle mode.

leaden palm Jul 12, 2025, 8:32 PM

#

some people like to see what they missed

keen fulcrum Jul 12, 2025, 8:33 PM

#

plucky whale Is Baidu Ernie's model any good? This is the first time I've seen it in battle m...

better than 4.1

#

and qwen3

plucky whale Jul 12, 2025, 8:34 PM

#

I was asking him some things and the truth is it feels very organic, China is revolutionizing

native idol Jul 12, 2025, 8:37 PM

#

Hello, I perform test on the detective novel and some texts are flagged.

#

Is this ok?

#

As for me it kind a false positive.

unborn ocean Jul 12, 2025, 8:38 PM

#

reading in big delay (didn't have time to check dc today), the creepy part is you reacting though 💀

#

fr man you outcreeped me

eager crater Jul 12, 2025, 8:40 PM

#

does anyone know if wolfstride is 2.5 pro deep think or like 3.0 beta because i've been getting better results with it than gemini 2.5 pro

unborn ocean Jul 12, 2025, 8:40 PM

#

and it kind of felt justified, considering that you have 2-4 4090 at home and yet no idea what model to run on em

#

thinking about it, you could also try
Qwen/QVQ-72B-Preview (underwhelming when compared to qwq, but it got vision + reasoning)

ocean vortex Jul 12, 2025, 9:28 PM

#

Dork got mogged 😭

#

And to be serious that score is not absolutely terrible I suppose. Worse than that could have been expected lol

torn mantle Jul 12, 2025, 9:30 PM

#

i think they should've just went with grok 3.5 instead

#

since its not that big of improvement

unborn ocean Jul 12, 2025, 9:42 PM

#

huggingface name https://huggingface.co/Qwen/QVQ-72B-Preview

#

For the local thingy you talked about

#

Idk where you could run it to test

#

Used to be on the Qwen Chat

blazing bison Jul 12, 2025, 9:45 PM

#

Grok 4 is a good math model

unborn ocean Jul 12, 2025, 10:07 PM

#

@ home if you are not willing to invest a lot of time, probably a nightmare

#

-> i am assuming cloud

#

ollama: really easy, should work no problem

#

hosted inference by the cloud probivder: better price / performance wise

#

ok

#

that makes it harder

#

hardware config is probably way more time intensive than anything else

#

so h100 stuff would not really be worth it

blazing bison Jul 12, 2025, 10:10 PM

#

Kimi is so good

unborn ocean Jul 12, 2025, 10:10 PM

#

4090, maybe 2-4 okay-ish though

#

that should be relatively quick

#

if you pick a more common model, e.g. lama 3.2 (for the vision stuff you talked about) you could use nvidia nim for the inference setup

#

it is like a prepackaged thing with everything you need (+ ~ sota inference speed)

#

for more niche stuff like qvq 72b: might need to do tensorrt-llm yourself ("advanced") / vllm (easier, still very good)

#

idk if that helped

#

i would really only do local stuff if you willing to invest the time

echo aurora Jul 12, 2025, 11:15 PM

#

native idol As for me it kind a false positive.

We're keeping track of times the filter catches false positives in this post here - #1376956905016004759

leaden palm Jul 12, 2025, 11:18 PM

#

jade egret Jul 12, 2025, 11:18 PM

#

openAI cooked?

ocean vortex Jul 12, 2025, 11:43 PM

#

jade egret openAI cooked?

Wtf is he yapping about. Chatgpt is still nr1, they have no issues “staying relevant” lmao

#

No one is gonna rush to jump ship to xAI just because they scored couple points more on ArtificialAnalysis index…

#

I think Deepseek has probably caused more turmoil at the time than the current events…

#

And I dont think they “lost more than half of their top talent” that is not true lol

#

Deepseek was a huge deal since everyone could just enter their website and use their model totally freely. Grok4 is locked in a vault behind paywall in their eco system with pricing that is worse rather than better.

pallid crypt Jul 12, 2025, 11:52 PM

#

Still, OAI are no longer special

jade egret Jul 12, 2025, 11:52 PM

#

ocean vortex No one is gonna rush to jump ship to xAI just because they scored couple points ...

ngl grok 4 kinda dissapointing

dawn wharf Jul 12, 2025, 11:53 PM

#

ocean vortex I think Deepseek has probably caused more turmoil at the time than the current e...

obviously

jade egret Jul 12, 2025, 11:53 PM

#

but they delayed gpt 5 tho

jade egret Jul 12, 2025, 11:53 PM

#

ocean vortex Deepseek was a huge deal since everyone could just enter their website and use t...

and grok 4 aint even that good lol

ocean vortex Jul 12, 2025, 11:58 PM

#

jade egret and grok 4 aint even that good lol

I mean it’s good in other ways in that they have beaten everyone else on paper. But since they didn’t do it by a huge margin and don’t offer better access or price, this is very unlikely to change much in the grand scheme of things

#

I still think those metrics are sus, but for the time being at least… it is nr1 on paper and in theory

sour spindle Jul 13, 2025, 12:05 AM

#

I don’t think the metrics are sus. I think the metrics just don’t translate to real world use all that well imo

topaz peak Jul 13, 2025, 12:17 AM

#

yeah, new google, gpt and deepseek r2 are the real hype, grok 4 was always gonna be a dud

hollow ocean Jul 13, 2025, 12:41 AM

#

https://tenor.com/view/andrew-tate-gif-26316329

Tenor

storm needle Jul 13, 2025, 12:54 AM

#

jade egret but they delayed gpt 5 tho

how could gpt 5 be delayed if a release date was never announced?

jade egret Jul 13, 2025, 1:42 AM

#

storm needle Jul 13, 2025, 1:55 AM

#

jade egret

source

tidal schooner Jul 13, 2025, 2:17 AM

#

storm needle source

https://techcrunch.com/2025/07/11/openai-delays-the-release-of-its-open-model-again/

TechCrunch

Maxwell Zeff

OpenAI delays the release of its open model, again | TechCrunch

OpenAI CEO Sam Altman says that the company is delaying the release of its highly anticipated open model for the second time.

#

also you can just check his x account

#

https://twitterez.com/sama/status/1943837550369812814

Sam Altman (@sama)

💬 843 🔁 289 💜 4.5K 👀 374K

we planned to launch our open-weight model next week.

we are delaying it; we need time to run additional safety tests and review high-risk areas. we are not yet sure how long it will take us.

while we trust the community will build great things with this model, once weights are

sacred plaza Jul 13, 2025, 2:23 AM

#

tidal schooner https://techcrunch.com/2025/07/11/openai-delays-the-release-of-its-open-model-ag...

Since when the hell did OpenAI ever care about safety testing lol

#

Lazy excuse from Sammy boi

tidal schooner Jul 13, 2025, 2:24 AM

#

sacred plaza Lazy excuse from Sammy boi

a lot of people are speculating that kimi-k2 actually caused the delay

whole wagon Jul 13, 2025, 2:24 AM

#

https://www.reuters.com/science/spacex-invest-2-billion-musks-xai-startup-wsj-reports-2025-07-12/?utm_source=reddit.com

Reuters

SpaceX to invest $2 billion in Musk's xAI startup, WSJ reports

SpaceX has committed $2 billion to xAI as part of a $5 billion equity round, deepening the ties between tech billionaire Elon Musk's ventures as his artificial intelligence startup races to compete with rival OpenAI, the Wall Street Journal reported on Saturday.

tidal schooner Jul 13, 2025, 2:25 AM

#

tidal schooner a lot of people are speculating that kimi-k2 actually caused the delay

but oai employees deny this claiming they’ve never heard of kimi or moonshot

storm needle Jul 13, 2025, 2:25 AM

#

tidal schooner https://techcrunch.com/2025/07/11/openai-delays-the-release-of-its-open-model-ag...

what they delayed was their open weight model, which is an entirely different model

whole wagon Jul 13, 2025, 2:25 AM

#

openAI is a mess anyways, I don't know how they let the huge lead they had erode to basically nothing

tidal schooner Jul 13, 2025, 2:26 AM

#

storm needle what they delayed was their open weight model, which is an entirely different mo...

oh wait

#

i’ll send the source for the gpt-5 delays

whole wagon Jul 13, 2025, 2:26 AM

#

Gpt5 was delayed since months ago

leaden palm Jul 13, 2025, 2:26 AM

#

sacred plaza Since when the hell did OpenAI ever care about safety testing lol

what are they gonna write in the system card otherwise

tidal schooner Jul 13, 2025, 2:26 AM

#

whole wagon Gpt5 was delayed since months ago

yea

#

https://twitterez.com/sama/status/1908167621624856998

Sam Altman (@sama)

💬 1.7K 🔁 2.5K 💜 30.7K 👀 4M

change of plans: we are going to release o3 and o4-mini after all, probably in a couple of weeks, and then do GPT-5 in a few months.

there are a bunch of reasons for this, but the most exciting one is that we are going to be able to make GPT-5 much better than we originally

tidal schooner Jul 13, 2025, 2:27 AM

#

whole wagon openAI is a mess anyways, I don't know how they let the huge lead they had erode...

deepmind locked in

whole wagon Jul 13, 2025, 2:27 AM

#

No even damn xAI is right up there with openAI

#

1.5 years after founding

tidal schooner Jul 13, 2025, 2:28 AM

#

grok’s growth has been exponential yeah

#

colossus has been pretty strategic for them

storm needle Jul 13, 2025, 2:28 AM

#

tidal schooner https://twitterez.com/sama/status/1908167621624856998

where in this tweet does it mention a delay?

whole wagon Jul 13, 2025, 2:30 AM

#

The project, officially called GPT-5 and code named Orion, has been in the works for more than 18 months and is intended to be a major advancement in the technology that powers ChatGPT. OpenAI’s closest partner and largest investor, Microsoft, had expected to see the new model around mid-2024, say people with knowledge of the matter.

#

They fumbled hard

#

I heard they had 2 failed major runs

#

Which delayed months each

storm needle Jul 13, 2025, 2:32 AM

#

whole wagon ```The project, officially called GPT-5 and code named Orion, has been in the wo...

source?

whole wagon Jul 13, 2025, 2:33 AM

#

Do you seriously not remember when GPT5 was expected in 2024?

#

https://www.wsj.com/tech/ai/openai-gpt5-orion-delays-639e7693 etc etc do a simple search

storm needle Jul 13, 2025, 2:35 AM

#

whole wagon Do you seriously not remember when GPT5 was expected in 2024?

they just renamed it to gpt 4.5

whole wagon Jul 13, 2025, 2:37 AM

#

No it had been pivoted to reasoning long before that article

#

They do a good job keeping the perception of everything going smoothly internally

storm needle Jul 13, 2025, 2:40 AM

#

whole wagon No it had been pivoted to reasoning long before that article

that article has a paywall

storm needle Jul 13, 2025, 2:46 AM

#

whole wagon No it had been pivoted to reasoning long before that article

that article clearly refers to gpt 4.5, which matches exactly what the article states

zinc ore Jul 13, 2025, 2:47 AM

#

https://vxtwitter.com/ns123abc/status/1944221244875850007

Microsoft keeps screwing OpenAI over too

NIK (@ns123abc)

🚨NEWS: OpenAI had a deal to acquire another startup but microsoft effectively killed it

OpenAI
“You can just do things”
tries to acquire Windsurf
*competes with GhitHub*
Microsoft: “Actually, oai you can’t do that. lol”
meanwhile Google hackquihires Windsurf (IP & CEO & top talent)
Pt.2:
OpenAI tries buying another startup
Microsoft again effectively kills it lol
btw the openai io deal with Jony Ive went through
because microsoft isn’t worried about competing in AI consumer hardware
any deal with openai could be “high risk” now
openai will have even more M&A troubles
atleast until @sama is able to renegotiate with Satya

LMAO

storm needle Jul 13, 2025, 2:49 AM

#

zinc ore https://vxtwitter.com/ns123abc/status/1944221244875850007 Microsoft keeps screw...

you should block this account on twitter it they tend to exaggerate every piece of news related to openai

zinc ore Jul 13, 2025, 2:50 AM

#

What about the screenshots

jade egret Jul 13, 2025, 2:57 AM

#

deepmind cooking?

#

cuz windsurf plp also joined deepmind

#

gemini prob gonna be better at coding by a lot in the future?

pure anvil Jul 13, 2025, 3:55 AM

#

storm needle you should block this account on twitter it they tend to exaggerate every piece ...

real lmao they're such an annoying account

worthy thunder Jul 13, 2025, 3:58 AM

#

Context Arena Update: Added Grok 4 (Thinking, 07-09) to 2needle (#4 @ 128k AUC) and 8needle (#1 @ 128k AUC) leaderboards. https://x.com/DillonUzar/status/1944244025034235935

Overall, #1 <= 32k, ties around 64k with Gemini 2.5 Pro, but drops off after that.

2needle: Top 7 results (AUC @ 128k):

Gemini 2.5 Flash (Thinking, 06-17): 91.5% (#1)
Gemini 2.5 Pro (Thinking, 06-05): 89.6% (#2)
Gemini 2.5 Flash (Non-thinking, 06-17): 81.7% (#3)
Grok 4 (Thinking, 07-09): 79.5% (#4)
o4-mini (Thinking, 04-16): 76.0% (#5)
Minimax M1 (Thinking, 06-17): 75.4% (#6)
Grok 3 (Non-thinking, 04-09): 73.0% (#7)

A definite increase over Grok 3, within range of Gemini 2.5 Flash (Non-thinking) for 2needle (easier tests). Scoring #4.

8needle: Top 5 results (AUC @ 128k):

Grok 4 (Thinking, 07-09): 48.4% (#1)
Gemini 2.5 Pro (Thinking, 06-05): 43.9% (#2)
Gemini 2.5 Flash (Thinking, 06-17): 33.5% (#3)
o4-mini (Thinking, 04-16): 30.8% (#4)
o3 (Thinking, 04-16): 27.9% (#5)

Grok 4's performance starts out strong with the 8needle tests (95.1% at 8k, dropping to 71.3% by 32k), but drops sharply between 32k-64k, scoring 41.8%, matching Gemini 2.5 Pro (who continues to lead after that). Making longer context favor the Gemini 2.5 family. Still, incredible performance with Grok 4 at <= 64k, landing it #1 with 128k AUC.

More model results at: https://contextarena.ai

I'm looking forward to some upcoming models 😉

sturdy mica Jul 13, 2025, 4:07 AM

#

why is grok 4 now called

#

grok 4 no system prompt

#

in arena

#

battle and direct chat

#

got it in battle and checked direct chat, its there too.

#

JUST came out

#

JUST now

whole sundial Jul 13, 2025, 4:21 AM

#

same model, they just added that for clarity as API version of Grok 4 has no system prompt, web (grok.com, grok on X) version does

rare python Jul 13, 2025, 4:42 AM

#

sturdy mica grok 4 no system prompt

So many people mistaken it for grok 2, 3 or even 1 because grok 4 without system prompt told them so

sacred quail Jul 13, 2025, 5:04 AM

#

What is grok's average thinking time ?

#

Im wondering the reason that grok 4 is good at long context is could be because of long reasoning times

#

Gemini also thinks longer compared to claude and o3

worthy thunder Jul 13, 2025, 5:29 AM

#

sacred quail Im wondering the reason that grok 4 is good at long context is could be because ...

For my MRCRv1 results, this is roughly how some of the reasoning models come out (for all tests <=128k, needles=2) for total completion tokens (response+reasoning) as reported by the api response, all default reasoning:

o4-mini: ~1916/test
o3: ~1520/test
Grok 4: ~1347/test
Gemini 2.5 Flash (Thinking): ~354/test (but I can't trust this number due to api token counting issues with the Gemini api at the time, I actually estimate it could be anywhere between ~600 to ~2800/test) - same issue with all 2.5 Gemini models
Claude 4 Sonnet: ~1000/test
Claude 4 Opus: ~640/test (however, it would bail on some requests, so actual ones were ~726/test)

teal mantle Jul 13, 2025, 5:49 AM

#

Kimi K2 is essentially giving me the freshness of initial o3 sans reasoning, Sonnet 3.5 , R1, V3-0324 (when it was briefly the best non-reasoner imo), or Opus 3/4/GPT-4.5 but better model vibes all at once

#

Incredible

#

📎 message.txt

alpine coral Jul 13, 2025, 5:57 AM

#

leaden palm

is it meant to be joke? modelling the spread of a disease is like not sinsister at all.. don't get it

whole wagon Jul 13, 2025, 6:13 AM

#

Kimi K2 is insane. It's like slightly inferior to Claude opus 4 (without thinking to compare base models) but it costs 30 times less

#

When they add reasoning they are going to cook everyone kek

pure anvil Jul 13, 2025, 6:15 AM

#

whole wagon Kimi K2 is insane. It's like slightly inferior to Claude opus 4 (without thinkin...

It's more than slightly inferior to opus, I'd say it's 80% as good as opus 4

tidal schooner Jul 13, 2025, 6:16 AM

#

pure anvil It's more than slightly inferior to opus, I'd say it's 80% as good as opus 4

what about in contrast to sonnet?

whole wagon Jul 13, 2025, 6:17 AM

#

pure anvil It's more than slightly inferior to opus, I'd say it's 80% as good as opus 4

I mean averaging all domains not just agentic coding

#

Like in maths it's just better

pure anvil Jul 13, 2025, 6:17 AM

#

I am too

#

it's prose is similar to dsv3.1

#

in world knowledge it not as good

#

not to say it's not a good model

whole wagon Jul 13, 2025, 6:20 AM

#

https://www.reddit.com/r/LocalLLaMA/s/KyWn475wgq

From the LocalLLaMA community on Reddit: Kimi-K2 takes top spot on ...

Explore this post and more from the LocalLLaMA community

tidal schooner Jul 13, 2025, 6:48 AM

#

whole wagon https://www.reddit.com/r/LocalLLaMA/s/KyWn475wgq

https://eqbench.com/results/creative-writing-v3/moonshotai__Kimi-K2-Instruct.html

pure anvil Jul 13, 2025, 6:49 AM

#

whole wagon https://www.reddit.com/r/LocalLLaMA/s/KyWn475wgq

that's crazy

#

regardless of it being subjective

rare python Jul 13, 2025, 7:01 AM

#

There app doesn't have system instructions

#

Only "common phrase"

#

Which is just a half baked system instructions

whole wagon Jul 13, 2025, 7:06 AM

#

<@&1349916362595635286>

echo aurora Jul 13, 2025, 7:07 AM

#

whole wagon <@&1349916362595635286>

blobthanks

sturdy mica Jul 13, 2025, 7:08 AM

#

does grok 4 think on lmarena

#

do thinking models actually think on lmarena like gemini 2.5 pro

#

a good update would being able to toggle that and see thinking process

#

and maybe internet search access on direct chat and battle mode which lets models use the internet

#

could be good

rare python Jul 13, 2025, 7:11 AM

#

sturdy mica does grok 4 think on lmarena

Grok 4 can't be disabled thinking in API iirc

sturdy mica Jul 13, 2025, 7:12 AM

#

oh

#

does gemini 2.5 pro think

#

and can grok 4 / gemini 2.5 pro browse the internet

#

that would be agood update on lmarena

unborn ocean Jul 13, 2025, 7:37 AM

#

4h, please more votes = better 👨‍🏫

#

highly relevant as well

#

especially with 2.5$ kimi k2

civic flame Jul 13, 2025, 7:40 AM

#

whole wagon https://www.reddit.com/r/LocalLLaMA/s/KyWn475wgq

my testing aligns with this

#

this is honestly a great release

#

my only complaint creative writing wise is that the many small experts means it tends to produce logical flaws/weird statements sometimes, even when the prose itself is good

rare python Jul 13, 2025, 7:43 AM

#

civic flame my only complaint creative writing wise is that the many small experts means it ...

I think kimi k2 thinking will fix this

civic flame Jul 13, 2025, 7:43 AM

#

thinking will be very interesting

#

hopefully we get it soon

#

if they release it before R2 i would expect DS to be forced to respond, unless they got cooked and don't have anything better

sturdy mica Jul 13, 2025, 7:44 AM

#

civic flame if they release it before R2 i would expect DS to be forced to respond, unless t...

DS?

civic flame Jul 13, 2025, 7:44 AM

#

deepseek

sturdy mica Jul 13, 2025, 7:44 AM

#

oh

pure anvil Jul 13, 2025, 7:45 AM

#

civic flame Jul 13, 2025, 7:54 AM

#

oh my god I want to marry these people

#

hell yeah

unborn ocean Jul 13, 2025, 7:56 AM

#

as a reminder because it seems relevant again

#

only about 35% knew moonshot

spare jacinth Jul 13, 2025, 8:06 AM

#

anyone knows another site like lm arena to use grok 4 free?

leaden sun Jul 13, 2025, 8:20 AM

#

unborn ocean as a reminder because it seems relevant again

thanks for sharing this, didnt know about SenseTime, they seem to be really interesting....

unborn ocean Jul 13, 2025, 8:25 AM

#

leaden sun thanks for sharing this, didnt know about SenseTime, they seem to be really inte...

they are party state owned, so they def do some military stuff

#

but the really interesting part is the internvlm series

#

and that they often collaborate with other state actors (mainly chinese top universities)

#

https://huggingface.co/OpenGVLab/InternVL3-78B-Instruct

pure anvil Jul 13, 2025, 8:46 AM

#

civic flame oh my god I want to marry these people

same lmaoo so based

rare python Jul 13, 2025, 8:48 AM

#

civic flame oh my god I want to marry these people

what about deepseek nerds?

civic flame Jul 13, 2025, 8:51 AM

#

do they love radiohead and watch tarantino?

leaden sun Jul 13, 2025, 9:01 AM

#

unborn ocean they are party state owned, so they def do some military stuff

i'm more intrigued by their claimed key fields in perception and decision intelligence, this is related to embodied AI i believe

unborn ocean Jul 13, 2025, 9:15 AM

#

leaden sun i'm more intrigued by their claimed key fields in perception and decision intell...

aka they help the surveillance state + other state owned stuff
and their models always seemed better on paper than they actually are

#

did not test the newest one yet, assume it is similar to the older ones

#

but they are always yapping about beating western labs with their new model

#

-> that has never materialized

#

i might be a bit too critical, but i am always allergic to something state owned claiming performance things and putting on a show

#

(bc it is likely just to impress the party that has no clue about what is going on)

leaden sun Jul 13, 2025, 9:33 AM

#

unborn ocean aka they help the surveillance state + other state owned stuff and their models ...

Surveillance is a double edged sword, it’s super difficult to balance between dystopian control and civil harmony with a certain degree of freedom

ocean vortex Jul 13, 2025, 9:41 AM

#

leaden sun Surveillance is a double edged sword, it’s super difficult to balance between dy...

It's fairly simple as far as I'm concerned. You just don't allow for any face recognition and related identification tech in public spaces without consent. Also "emotional state detection" (AI) in workplace is a no bueno as well. 😉

leaden sun Jul 13, 2025, 9:45 AM

#

facial recognition at least happens at the time when your face is met with a tracking device. I worry more about the traces of your textual online presence can do much more about you, because hmm it’s more omnipresent?

#

Besides that, every LLM you’re using does profiling anyway without using any recognition sensors

ocean vortex Jul 13, 2025, 9:54 AM

#

In China yes.

ocean vortex Jul 13, 2025, 9:59 AM

#

leaden sun facial recognition at least happens at the time when your face is met with a tra...

the first thing will almost always lead to restriction of your freedom in some way. There's no half-assing it, it's really either an oppressive regime government or it is not, tbh. Private property and special enclosed guarded facilities is fine ofc, but not public space. 2nd thing is just a pale inconvenience in contrast. That could lead to undesired outcomes like perhaps your insurance premiums going up but nothing like in the first case where your actual freedom gets affected directly.

#

And like... imagine getting judged by some manager in your workplace based on your emotions you can't control or emotions that stemed from personal things you can't control - that's the opposite of freedom.

#

btw the points I'm making are even addressed in the EU AI act: https://eur-lex.europa.eu/legal-content/EN/TXT/PDF

#

Considering the imbalance of power in the context of work or education, combined with the
intrusive nature of these systems, such systems could lead to detrimental or unfavourable treatment of certain
natural persons or whole groups thereof. Therefore, the placing on the market, the putting into service, or the use of
AI systems intended to be used to detect the emotional state of individuals in situations related to the workplace and
education should be prohibited.

and...

In addition, the use of ‘real-time’ remote biometric identification systems in publicly accessible spaces for the
purpose of law enforcement should be deployed only to confirm the specifically targeted individual’s identity and
should be limited to what is strictly necessary concerning the period of time, as well as the geographic and personal
scope, having regard in particular to the evidence or indications regarding the threats, the victims or perpetrator.

#

in a nutshell, not all regulation is bad regulation... 🤷‍♂️

#

We live in a real world and ofc that is never gonna be perfect. But there's a massive difference between a place with no regard or any regulation to these issues at all, and a place that is actively trying to make it right... Some things are inevitably gonna slip through, but there's still no comparison tbh

#

I meant that in a way that you can't have democracy if you don't implement regulation on these sensitive issues. It's impossible to guarantee it's gonna work all of the time every time, but those fundamentals matter very much

wintry fulcrum Jul 13, 2025, 10:28 AM

#

When is grok4 expected to be on the board please

ocean vortex Jul 13, 2025, 10:30 AM

#

That's getting political, but I would say it very much matters/depends on the specific country. US itself is going into that direction for sure, but that's hardly a secret for anyone in light of recent events lol

#

Their current government is money and personal gain driven. Environmental, freedom of speech or average class issues are kinda second to none....

#

Basically it seems that everything they stood against in the recent modern era since WW2 is slowly the thing they are becoming themselves now... Managed to help plenty of countries in the meantime but now they seemingly need help themselves huh

#

Yeah but what happened when they tried to address it and ban tiktok..? Trump came in and reversed the entire thing. And also gave even more freedom to Musk to freely manipulate his platform lol

torn mantle Jul 13, 2025, 10:47 AM

#

#

kimi k2

torn mantle Jul 13, 2025, 10:47 AM

#

whole wagon https://www.reddit.com/r/LocalLLaMA/s/KyWn475wgq

oh

ocean vortex Jul 13, 2025, 10:50 AM

#

It's a complex topic, but there have been numerous cases of proven propaganda campaigns on tiktok, so that's how I see it generally. Not talking about just US here

#

In reality, it's fairly impossible for those things to happen without owners of the app endorsing these events in some way

civic flame Jul 13, 2025, 10:53 AM

#

Kimi system prompt on kimi.com:

You are Kimi, an AI assistant provided by Moonshot AI.
When you encounter problems involving digit comparison, letter counting, or other logical issues, please think step by step before answering.

Current Date: Sunday, July 13, 2025

#

thought the second sentence was interesting

ocean vortex Jul 13, 2025, 10:55 AM

#

torn mantle

It's nr1 on overall too 🤯 🧐

civic flame Jul 13, 2025, 10:55 AM

#

it doesn't need it though

#

this is kimi k2 instruct w/o a sysprompt

#

im fairly sure it's the full model

ocean vortex Jul 13, 2025, 10:56 AM

#

though maybe it's not the best of benchmarks for this seeing how 4o-latest destroys Opus4

civic flame Jul 13, 2025, 10:56 AM

#

it's just as slow as the kimi.com instance

ocean vortex Jul 13, 2025, 10:56 AM

#

Probably rewards style/structure a lot

#

GMICloud is one of the busted ones for R1 lol

#

they didn't ruin it completely

#

but that has degraded performance

civic flame Jul 13, 2025, 10:59 AM

#

the one provider I can be consistently sure serves the full model is hyperbolic (for R1)

pure anvil Jul 13, 2025, 11:00 AM

#

ocean vortex though maybe it's not the best of benchmarks for this seeing how 4o-latest destr...

which benchmark does 4o destroy opus in?

ocean vortex Jul 13, 2025, 11:00 AM

#

civic flame the one provider I can be consistently sure serves the full model is hyperbolic ...

chutes, amazingly, is solid as well, even free one. Not perfect uptime but when it's up it is performing how it should be

civic flame Jul 13, 2025, 11:01 AM

#

chutes isn't too bad yes

civic flame Jul 13, 2025, 11:01 AM

#

pure anvil which benchmark does 4o destroy opus in?

eqbench but I wouldn't say destroys

pure anvil Jul 13, 2025, 11:03 AM

#

the cope is insane

#

they do realise that eqbench is judged by sonnet?

leaden sun Jul 13, 2025, 11:12 AM

#

pure anvil they do realise that eqbench is judged by sonnet?

sonnet is a fair choice out of all available public llms, its cognitive architecture is well balanced despite slightly contradicting nature

tall summit Jul 13, 2025, 11:15 AM

#

ocean vortex It's nr1 on overall too 🤯 🧐

huhhhhh

young wigeon Jul 13, 2025, 11:16 AM

#

It's just as likely that someone tries to penalize Grok intentionally than it is that someone tries to help Grok intentionally. I'd expect any shenanigans like that would average out, keeping the leaderboard fair and trustworthy

I don't find battles any less fun even if I can guess the AI

unborn ocean Jul 13, 2025, 12:11 PM

#

poll_question_text

guesstimate: what is the inference cost on openai's end? (1 million tokens output, usd)

victor_answer_votes

4

total_votes

8

victor_answer_id

4

victor_answer_text

<3

victor_answer_emoji_name

🙂

torn mantle Jul 13, 2025, 12:20 PM

#

ocean vortex It's nr1 on overall too 🤯 🧐

yea

#

i read the samples

#

and its like im reading an actual novel writing by an experienced author

rare python Jul 13, 2025, 12:22 PM

#

torn mantle yea

Wait

#

How can Kimi K2 has higher Elo score than Gemini 2.5 Pro 0605, when I look at each individual score and I see 2.5 Pro did better?

torn mantle Jul 13, 2025, 12:25 PM

#

rare python How can Kimi K2 has higher Elo score than Gemini 2.5 Pro 0605, when I look at ea...

where did it do better?

#

show us some examples where gemini performed better

rare python Jul 13, 2025, 12:26 PM

#

torn mantle where did it do better?

Look at "empathy" score

#

why did 2.5 pro score higher?

rare python Jul 13, 2025, 12:27 PM

#

ocean vortex It's nr1 on overall too 🤯 🧐

#

less moralize, more pragmatic

torn mantle Jul 13, 2025, 12:27 PM

#

https://eqbench.com/creative_writing.html

#

im looking at the overall score

rare python Jul 13, 2025, 12:28 PM

#

I'm talking about eqbench

torn mantle Jul 13, 2025, 12:28 PM

#

yea i saw that

fleet lintel Jul 13, 2025, 12:28 PM

#

jade egret openAI cooked?

how much of this is true? specially $40 billion paused and valuation dropping past few weeks?

torn mantle Jul 13, 2025, 12:28 PM

#

but why would you rank them based on one criteria?

rare python Jul 13, 2025, 12:28 PM

#

not creative writing bench

rare python Jul 13, 2025, 12:28 PM

#

torn mantle but why would you rank them based on one criteria?

Look at overall score

#

I see overall 2.5 Pro has better EQ score than Kimi

#

not ELO

torn mantle Jul 13, 2025, 12:30 PM

#

rare python Look at overall score

where 😭

#

i may be blind

#

shouldnt we look at ELO score?

rare python Jul 13, 2025, 12:31 PM

#

torn mantle shouldnt we look at ELO score?

I'm asking about why did the elo score is higher

#

That's my question

torn mantle Jul 13, 2025, 12:31 PM

#

ah

#

i see

#

idk how its calculated tbh

rare python Jul 13, 2025, 12:33 PM

#

yes

torn mantle Jul 13, 2025, 12:35 PM

#

hes asking if gemini 2.5 has higher/better values for the criteria chosen then why its 3rd?

cedar tide Jul 13, 2025, 12:35 PM

#

We nedd now k2 in the arena

torn mantle Jul 13, 2025, 12:35 PM

#

there may be some weights * criteria but im not seeing it

blazing bison Jul 13, 2025, 12:35 PM

#

Based on vibes kimi is better than sonnet 4 for me

alpine coral Jul 13, 2025, 12:36 PM

#

nah it's elegant.. aren't claude system prompts like a wall of instructions (including what to for counting letters ig) ?

blazing bison Jul 13, 2025, 12:38 PM

#

Kimi thinking model maybe be better than o3

pure anvil Jul 13, 2025, 12:39 PM

#

blazing bison Kimi thinking model maybe be better than o3

if we take kimi-1.6-IOI as a reference then definitely

alpine coral Jul 13, 2025, 12:39 PM

#

blazing bison Kimi thinking model maybe be better than o3

i;m not sure about that but it is really good

#

i mean maybe

#

but yeah eaither way, it is impressive

blazing bison Jul 13, 2025, 12:40 PM

#

It is

bright kayak Jul 13, 2025, 12:42 PM

#

does grok 4 have search tools in LMArena?

torn mantle Jul 13, 2025, 12:43 PM

#

yea

#

its like lmarena

rare python Jul 13, 2025, 12:44 PM

#

I wonder how can o3 and 2.5 Pro, both have higher empathy score. In this case higher empathy score and pragmatic score is better right?

#

and yet Kimi K2 outplayed both of them

#

in ELO

leaden sun Jul 13, 2025, 12:59 PM

#

rare python and yet Kimi K2 outplayed both of them

i can only guess the formula for averaging has certain weights in it that has led to a higher average score for kimi, it might imply the bench could be biased towards general balance rather than a specific direction

rare python Jul 13, 2025, 1:00 PM

#

leaden sun i can only guess the formula for averaging has certain weights in it that has le...

The sample is so long to read

torn mantle Jul 13, 2025, 1:01 PM

#

leaden sun i can only guess the formula for averaging has certain weights in it that has le...

nah its not using heatmap criteria for the ELO score

rare python Jul 13, 2025, 1:02 PM

#

@worthy thunder Kimi K2 on contextarena when?

torn mantle Jul 13, 2025, 1:02 PM

#

those are just giving us an overall view of how the model is.. if its more like humanlike/analytical/warm...

#

ELO score = comparing them head-to-head through diff criterias

sour spindle Jul 13, 2025, 1:24 PM

#

Used grok for a quite awhile simply don’t find it’s all that useful particularly when o3 is available and you can use google models for free on AI studio

#

They waited too long to release it imo

worthy thunder Jul 13, 2025, 1:48 PM

#

rare python <@218880601522962442> Kimi K2 on contextarena when?

Very soon, almost done with 2needle results (up to 128k)

rare python Jul 13, 2025, 1:49 PM

#

worthy thunder Very soon, almost done with 2needle results (up to 128k)

Is it good for your existing results?

worthy thunder Jul 13, 2025, 1:57 PM

#

It's decent, starts at 75% at 8k, and ~50% for most of its context window. Ends around 39% around 128k. Haven't checked yet, but probably between #10-#20 on AUC 128k (2needle) ranking

vernal meadow Jul 13, 2025, 2:34 PM

#

Kimi K2 waiting room.

ocean vortex Jul 13, 2025, 2:35 PM

#

torn mantle im looking at the overall score

it's weird how it's classified. If we add the scores up even in the overall category, o3 has 87.8 while kimi has 87.5, and yet elo is the opposite

pure anvil Jul 13, 2025, 2:35 PM

#

ocean vortex it's weird how it's classified. If we add the scores up even in the overall cate...

Please check what the ELO system is

keen beacon Jul 13, 2025, 2:36 PM

#

when will grok be in arena

drifting thorn Jul 13, 2025, 2:36 PM

#

keen beacon when will grok be in arena

It’s in the Arena already

ocean vortex Jul 13, 2025, 2:37 PM

#

pure anvil Please check what the ELO system is

I'm just saying it's weird because it is. I know that it's a separate thing, don't be a smartass. 🙂

keen beacon Jul 13, 2025, 2:37 PM

#

in arenas leaderboard ..

pure anvil Jul 13, 2025, 2:37 PM

#

ocean vortex I'm just saying it's weird because it is. I know that it's a separate thing, don...

oof

#

you should be open to your ideas being challenged lol

#

seriously

worthy thunder Jul 13, 2025, 2:38 PM

#

vernal meadow Kimi K2 waiting room.

Just rerunning a few tests, a few had some provider errors.

keen beacon Jul 13, 2025, 2:38 PM

#

pure anvil you should be open to your ideas being challenged lol

xDDD

#

Gen Z worst nightmare

ocean vortex Jul 13, 2025, 2:39 PM

#

pure anvil you should be open to your ideas being challenged lol

https://tenor.com/view/huh-for-real-are-you-for-real-what-you-say-wth-gif-5654320

Tenor

pure anvil Jul 13, 2025, 2:39 PM

#

it's weird how it's classified. If we add the scores up even in the overall category, o3 has 87.8 while kimi has 87.5, and yet elo is the opposite
💀

ocean vortex Jul 13, 2025, 2:40 PM

#

I'm saying it's weird how those are a separate thing, what is there not to understand?

pure anvil Jul 13, 2025, 2:41 PM

#

I envy you lmao

ocean vortex Jul 13, 2025, 2:42 PM

#

you are weird

#

LMAO

pure anvil Jul 13, 2025, 2:42 PM

#

I couldn't be as mad as you over text if i tried

ocean vortex Jul 13, 2025, 2:42 PM

#

what

#

🤣

pure anvil Jul 13, 2025, 2:43 PM

#

pure anvil I envy you lmao

but yeah what I was gonna say is that I would also say the same things if I was as cluless as you

echo aurora Jul 13, 2025, 2:43 PM

#

let' be a bit nicer pls

torn mantle Jul 13, 2025, 2:43 PM

#

ocean vortex it's weird how it's classified. If we add the scores up even in the overall cate...

yea i read how its calculated, it has nothing to do with the criteria on the heatmap

#

they start from a default ELO like 1500

ocean vortex Jul 13, 2025, 2:44 PM

#

torn mantle yea i read how its calculated, it has nothing to do with the criteria on the hea...

yeah which is a bit... odd

torn mantle Jul 13, 2025, 2:44 PM

#

and then they start doing head-to-head with LLMs

#

and the judge is sonnet 3.7

#

and also if the margin is too big, for example o3 vs k2, if k2 had like a big margin win over o3 then it will get a significant ELO score

ocean vortex Jul 13, 2025, 2:45 PM

#

torn mantle and the judge is sonnet 3.7

I saw that they are using Sonnet as a judge, that gonna bring issues and bias of it's own. Though to be fair it doesn't look like Claude model outputs are favored a huge amount

torn mantle Jul 13, 2025, 2:46 PM

#

"The response will use a + / ++ / +++ / ++++ / +++++ format to denote the stronger response and relative ability difference for each criteria."

#

what its actually using is this

#

and its not giving them any score

#

its only using margins

#

it uses + to +++++ to show the win margin

#

ofc the qualitative margin is converted into quantitative

= 1 point
++ = 2 points
+++ = 3 points
++++ = 4 points
+++++ = 5 points

#

the judge verdict will look like

{
  "comparisons": [
    {"criterion": "Empathy",   "winner": "B", "margin": "++"},
    {"criterion": "Insight",   "winner": "B", "margin": "++++"},
    {"criterion": "Tailoring", "winner": "B", "margin": "+++"},
    {"criterion": "Overall EQ","winner": "B", "margin": "++++"}
  ]
}```

torn mantle Jul 13, 2025, 2:54 PM

#

ocean vortex I saw that they are using Sonnet as a judge, that gonna bring issues and bias of...

and thats the real question, is sonnet the perfect judge?

#

it has its stylistic biases

#

blind spots...

ocean vortex Jul 13, 2025, 2:57 PM

#

Definitely not the perfect one, I think especially since they are asking it to pick a winner... It is bound to favor the models more aligned with itself

torn mantle Jul 13, 2025, 2:59 PM

#

are they even using the reasoning model or nah

ocean vortex Jul 13, 2025, 2:59 PM

#

Categories rating is less sensitive to that since it has a clear unquestionable reference of what the desired response is supposed to look like.

torn mantle Jul 13, 2025, 3:00 PM

#

but it needs to be run multiple times to get an average

#

you dont want a one attempt value

#

and you cant just do that with a base model, a reasoning model will have a higher confidence giving how it will analyse the text more thoughtfully

worthy thunder Jul 13, 2025, 3:04 PM

#

rare python <@218880601522962442> Kimi K2 on contextarena when?

Results, miscalculated how it hitting 50% for a large chunk of its window would impact it. 128k improved with rerunning the failed tests.

rare python Jul 13, 2025, 3:05 PM

#

worthy thunder Results, miscalculated how it hitting 50% for a large chunk of its window would ...

I see it's quite low

#

Did it meet your expectation?

torn mantle Jul 13, 2025, 3:05 PM

#

rare python Did it meet your expectation?

did it meet yours?

rare python Jul 13, 2025, 3:06 PM

#

torn mantle did it meet yours?

No

torn mantle Jul 13, 2025, 3:06 PM

#

rare python No

why

#

i think its a solid model

#

its also free

#

you took that into consideration right?

torn bison Jul 13, 2025, 3:06 PM

#

keen beacon when will grok be in arena

Does your realtime leaderboard fetching method still work?

torn mantle Jul 13, 2025, 3:06 PM

#

free + smart for a base model

rare python Jul 13, 2025, 3:06 PM

#

torn mantle i think its a solid model

1T and 32B active parematers. I expected it to be decent at long context

#

128k I expected it to utilize it decently

worthy thunder Jul 13, 2025, 3:07 PM

#

rare python Did it meet your expectation?

I haven't personally tested it. I would have expected it to be closer to other 32b active models, but still seems good if you keep context low (<=16k)

rare python Jul 13, 2025, 3:07 PM

#

Damn

balmy mist Jul 13, 2025, 3:07 PM

#

has anyone used that kimi model?

rare python Jul 13, 2025, 3:08 PM

#

worthy thunder I haven't personally tested it. I would have expected it to be closer to other 3...

Um how can I turn of reasoning model on your site

keen beacon Jul 13, 2025, 3:08 PM

#

torn bison Does your realtime leaderboard fetching method still work?

I see our mutual servers xD , and yes it does

worthy thunder Jul 13, 2025, 3:08 PM

#

It's also possible it just isn't great with multi-round tasks (dozens of back and forth, and recalling the exact order of specific types of information)

balmy mist Jul 13, 2025, 3:08 PM

#

i wanna see a reasoning version of i

keen beacon Jul 13, 2025, 3:08 PM

#

But it kinda sucks that it's not gonna be close, gemini wins over grok 3 , 90% +

worthy thunder Jul 13, 2025, 3:09 PM

#

rare python Um how can I turn of reasoning model on your site

I usually include an icon if the model is reasoning, and you can toggle on/off by going to the controls tab. The slash ones is with reasoning off

rare python Jul 13, 2025, 3:09 PM

#

worthy thunder I usually include an icon if the model is reasoning, and you can toggle on/off b...

Yeah I'm doing it manually

#

Would be nice if there is a button for quick turn off

worthy thunder Jul 13, 2025, 3:10 PM

#

I'll add a shortcut to make it easier to toggle that 😅

keen beacon Jul 13, 2025, 3:10 PM

#

worthy thunder I usually include an icon if the model is reasoning, and you can toggle on/off b...

whats this ui

worthy thunder Jul 13, 2025, 3:10 PM

#

https://contextarena.ai/

dusky aurora Jul 13, 2025, 3:12 PM

#

Arena has glitche dagai

plucky whale Jul 13, 2025, 3:12 PM

#

Yep

livid coyote Jul 13, 2025, 3:12 PM

#

dont work wth?

digital vale Jul 13, 2025, 3:12 PM

#

?

plucky whale Jul 13, 2025, 3:13 PM

#

My entire history was deleted and I can't write (code error)

rare python Jul 13, 2025, 3:13 PM

#

Kimi K2 when turn offed all reasoning models.

digital vale Jul 13, 2025, 3:13 PM

#

wtf

rare python Jul 13, 2025, 3:14 PM

#

wait @worthy thunder does the ranking still included reasoning models?

#

Because Flash 2.5 should be #1 here

#

Wait am I dumb?

worthy thunder Jul 13, 2025, 3:15 PM

#

When you toggle models off, it recalculates ranking. By default I sort by 1M AUC (which is where Flash 2.5 is #1)

rare python Jul 13, 2025, 3:15 PM

#

Yeah my bad

worthy thunder Jul 13, 2025, 3:15 PM

#

You can sort any column you want, but all columns have their own ranking

rare belfry Jul 13, 2025, 3:15 PM

#

Is the site down?

rare python Jul 13, 2025, 3:16 PM

#

Kimi K2 is around rank 11-25

plucky whale Jul 13, 2025, 3:16 PM

#

rare belfry Is the site down?

Yes

digital vale Jul 13, 2025, 3:16 PM

#

plucky whale My entire history was deleted and I can't write (code error)

same

rare belfry Jul 13, 2025, 3:16 PM

#

plucky whale Yes

Appreciate it thanks

#

Had to make sure it wasn't just me lol

plucky whale Jul 13, 2025, 3:16 PM

#

I thought it was because of my VPN hahaha

worthy thunder Jul 13, 2025, 3:17 PM

#

rare python Would be nice if there is a button for quick turn off

Deploying a change now with these buttons. Should be available in ~5min

rare python Jul 13, 2025, 3:17 PM

#

worthy thunder Deploying a change now with these buttons. Should be available in ~5min

Thanks

balmy mist Jul 13, 2025, 3:18 PM

#

rare python Kimi K2 is around rank 11-25

on wat?

rare python Jul 13, 2025, 3:18 PM

#

balmy mist on wat?

contextarena.ai if disabled thinking models

plucky whale Jul 13, 2025, 3:18 PM

#

@echo aurora The site is down

echo aurora Jul 13, 2025, 3:19 PM

#

plucky whale <@283397944160550928> The site is down

Thank you!! Will let the team know

copper hull Jul 13, 2025, 3:20 PM

#

Thank you for your hard work

#

Any experienced bros here can tell me how long it usually takes to fix a website outage? I need to decide whether to take a nap or not

ripe oar Jul 13, 2025, 3:21 PM

#

The site is down?

copper hull Jul 13, 2025, 3:22 PM

#

It is said to be so

dusky aurora Jul 13, 2025, 3:22 PM

#

ripe oar The site is down?

right

digital vale Jul 13, 2025, 3:22 PM

#

yup

ripe oar Jul 13, 2025, 3:22 PM

#

#

it is back?

dusky aurora Jul 13, 2025, 3:23 PM

#

for meit says that there was an error constantly

ripe oar Jul 13, 2025, 3:23 PM

#

oh

#

I see

dusky aurora Jul 13, 2025, 3:23 PM

#

as if the censor model is glitchy

copper hull Jul 13, 2025, 3:24 PM

#

ripe oar it is back?

This is just your cache

echo aurora Jul 13, 2025, 3:25 PM

#

So sorry everyone! Team is aware of the issue and working on it.

ripe oar Jul 13, 2025, 3:26 PM

#

Battle mode still working

#

What do you mean

alpine coral Jul 13, 2025, 3:35 PM

#

i think they answered their own question..

dusky aurora Jul 13, 2025, 3:36 PM

#

so far the site seems back to normal

echo aurora Jul 13, 2025, 3:37 PM

#

dusky aurora so far the site seems back to normal

It is! ablobcheer

dusky aurora Jul 13, 2025, 3:37 PM

#

thank you, the team, for quick response

ripe oar Jul 13, 2025, 3:37 PM

#

echo aurora It is!<a:ablobcheer:399742793976643585>

Thanks for your working Capoo22

leaden palm Jul 13, 2025, 3:38 PM

#

alpine coral is it meant to be joke? modelling the spread of a disease is like not sinsister...

yes. it is a funny.

#

especially when those 3 come up in that order.

ocean vortex Jul 13, 2025, 3:55 PM

#

torn mantle and you cant just do that with a base model, a reasoning model will have a highe...

tbh I think the fact alone that Kimi scores higher than Opus4 and gpt4.5 is nowhere to be seen (it is actually below Mistral 24B) tells that this benchmark is limited. It's a decent effort though knowing how we don't have many metrics for creativity or writing...

quartz light Jul 13, 2025, 4:23 PM

#

worthy thunder https://contextarena.ai/

i noticed theres no pro model with thinking turned off, this is now possible though, theres a "thinking budget" which you can set to something very low

#

torn mantle Jul 13, 2025, 4:26 PM

#

ocean vortex tbh I think the fact alone that Kimi scores higher than Opus4 and gpt4.5 is nowh...

yea, the method used is questionable. also, opus 4 could be better than k2, but k2 could earn more points if it outperforms the top models on the leaderboard, such as o3 and gemini 2.5 pro

worthy thunder Jul 13, 2025, 4:26 PM

#

quartz light i noticed theres no pro model with thinking turned off, this is now possible tho...

To 128 tokens (unless something has changed). I'm waiting till they have 0 before running again. Their team have told me they are still working on releasing it, so I'm holding off for that (mainly due to costs)

torn mantle Jul 13, 2025, 4:27 PM

#

for example k2 vs opus 4 -> winner opus 4
but if sonnet judged k2 better than o3 ( top2 ) -> more points

quartz light Jul 13, 2025, 4:28 PM

#

worthy thunder To 128 tokens (unless something has changed). I'm waiting till they have 0 befor...

that's what they all say!!!!!!!....

#

https://tenor.com/view/speech-bubble-cat-spoofity-gif-27496454

Tenor

#

Their team have told me they are still working on releasing it, so I'm holding off for that (mainly due to costs)

#

whyd you delete that

cedar tide Jul 13, 2025, 4:57 PM

#

quartz light whyd you delete that

one second

plucky whale Jul 13, 2025, 5:06 PM

#

My history on lmarena still doesn't appear.

quartz light Jul 13, 2025, 5:09 PM

#

plucky whale My history on lmarena still doesn't appear.

try in an incognito tab

plucky whale Jul 13, 2025, 5:11 PM

#

it didn't work

#

Hes gone :(

jade egret Jul 13, 2025, 5:16 PM

#

🤔

worthy thunder Jul 13, 2025, 5:23 PM

#

quartz light that's what they all say!!!!!!!....

https://tenor.com/view/eh-meh-jim-jim-halpert-the-office-gif-5621418030291735983

Tenor

cedar tide Jul 13, 2025, 5:26 PM

#

Average of the 32 benchmarks shared by Kimi (with price)
(Against the best non reasoning models)

#

average per category

#

@quartz light

quartz light Jul 13, 2025, 5:30 PM

#

cedar tide average per category

model worse at coding == worse in general

#

i_may_only_speak_truth_in_the_lmarena_discord_server.gif

cedar tide Jul 13, 2025, 5:31 PM

#

quartz light model worse at coding == worse in general

so for you kimi k2 is far better than gpt 4.1, 2.5 flash, v3, qwen 3

leaden palm Jul 13, 2025, 6:01 PM

#

anyone else find k2's style to be similar to o3's?

dawn wharf Jul 13, 2025, 6:21 PM

#

leaden palm anyone else find k2's style to be similar to o3's?

In what specifically?

leaden palm Jul 13, 2025, 6:21 PM

#

dawn wharf In what specifically?

the way it speaks?

dawn wharf Jul 13, 2025, 6:22 PM

#

leaden palm the way it speaks?

do you have a good impression of K2 overall?

leaden palm Jul 13, 2025, 6:22 PM

#

i haven't used it much but i think it's very noteworthy

leaden palm Jul 13, 2025, 6:22 PM

#

leaden palm the way it speaks?

it's typically terse and sometimes tries to signal authority

sour spindle Jul 13, 2025, 6:23 PM

#

To me it's more of an achievement (impressive one at that) than a model I will be using regularly.

olive mesa Jul 13, 2025, 6:24 PM

#

~70% on ARC-AGI-2 is crazy

#

GPT-5 has to be AGI if the "leaked" benchmarks are real

sour spindle Jul 13, 2025, 6:25 PM

#

olive mesa ~70% on ARC-AGI-2 is crazy

Which model?

dawn wharf Jul 13, 2025, 6:25 PM

#

olive mesa ~70% on ARC-AGI-2 is crazy

fake

sour spindle Jul 13, 2025, 6:25 PM

#

ok lol

dawn wharf Jul 13, 2025, 6:25 PM

#

and they're predictions not leaks

olive mesa Jul 13, 2025, 6:25 PM

#

you must be talking about a different thing

zinc ore Jul 13, 2025, 6:28 PM

#

16% SOTA being grok 4 just released to 70% for gpt5. I know that's entirely fake, don't even need to think about it.

#

Especially since leaked claims on HLE are slightly above Grok 4

#

Now if it was arc AGI 1, that's completely believable

balmy mist Jul 13, 2025, 6:34 PM

#

olive mesa GPT-5 has to be AGI if the "leaked" benchmarks are real

those were fake

balmy mist Jul 13, 2025, 6:35 PM

#

cedar tide Average of the 32 benchmarks shared by Kimi (with price) (Against the best non r...

damn kimi looking good

keen fulcrum Jul 13, 2025, 6:38 PM

#

olive mesa ~70% on ARC-AGI-2 is crazy

They were never real

#

Grok 4 leaks were wrong as well

quartz light Jul 13, 2025, 6:40 PM

#

btw guys

dawn wharf Jul 13, 2025, 6:40 PM

#

cedar tide Average of the 32 benchmarks shared by Kimi (with price) (Against the best non r...

now we just need a new reasoning Kimi model

quartz light Jul 13, 2025, 6:41 PM

#

how does lmarena get access to models like stonebloom

#

do the companies reach out to the devs directly

dawn wharf Jul 13, 2025, 6:41 PM

#

quartz light how does lmarena get access to models like stonebloom

partnership ig

quartz light Jul 13, 2025, 6:41 PM

#

for feedback

quartz light Jul 13, 2025, 6:41 PM

#

dawn wharf partnership ig

nice

dawn wharf Jul 13, 2025, 6:42 PM

#

quartz light for feedback

well it's a voting platform, so it helps the AI lab companies

reef pawn Jul 13, 2025, 6:42 PM

#

Why Grok 4 in LMarena keep saying that it's grok 2?

quartz light Jul 13, 2025, 6:42 PM

#

yeah

quartz light Jul 13, 2025, 6:43 PM

#

reef pawn Why Grok 4 in LMarena keep saying that it's grok 2?

grok.com has a system prompt baked in to tell it which model it is

#

whereas via api it does not

reef pawn Jul 13, 2025, 6:43 PM

#

quartz light grok.com has a system prompt baked in to tell it which model it is

so it is legit grok 4?

quartz light Jul 13, 2025, 6:43 PM

#

reef pawn so it is legit grok 4?

yes

reef pawn Jul 13, 2025, 6:43 PM

#

okay ty

quartz light Jul 13, 2025, 6:44 PM

#

and theres no grok-heavy because theres no api or partnership for it

reef pawn Jul 13, 2025, 6:44 PM

#

got it

empty stump Jul 13, 2025, 6:49 PM

#

i wonder how much it would cost if it existed

sturdy mica Jul 13, 2025, 7:03 PM

#

leaden palm anyone else find k2's style to be similar to o3's?

probably trained off of o3, actually i guarantee it

quartz light Jul 13, 2025, 7:09 PM

#

yall, try this:

#

you are only allowed to strictly make the whole html file much shorter. you must go through many many iterations of "I can make this shorter by.." compare every time the original code and the iteration to make sure its still shorter. then, only when it truly is NOT possible to make it shorter, give the final, shortest truly possible fully functional html file.

#

but with any html file

#

see which model can compress it the most

#

you could add something like "without sacrificing any noticeable functionality"

#

#

grok wins but lets see if it works

dawn wharf Jul 13, 2025, 7:12 PM

#

quartz light grok wins but lets see if it works

does it work in the first place?

#

the code

quartz light Jul 13, 2025, 7:13 PM

#

dawn wharf does it work in the first place?

yes

dawn wharf Jul 13, 2025, 7:13 PM

#

well, what's the prompt

quartz light Jul 13, 2025, 7:13 PM

#

it just needed a few fixes i mentioned in the prompt

dawn wharf Jul 13, 2025, 7:13 PM

#

the html file I mean

quartz light Jul 13, 2025, 7:13 PM

#

<!DOCTYPE html><html><body style="margin:0;overflow:hidden;background:#000"><script src="https://cdnjs.cloudflare.com/ajax/libs/three.js/r128/three.min.js"></script><script type="module">import R from"https://cdn.jsdelivr.net/npm/@dimforge/rapier3d-simd-compat@canary/+esm";R.init().then(=>{var w=new R.World({x:0,y:-9.81,z:0}),c=new THREE.PerspectiveCamera(75,innerWidth/innerHeight,.1,1e3),r=new THREE.WebGLRenderer(),s=new THREE.Scene(),m=new THREE.Mesh(new THREE.BoxGeometry(1,1,1),new THREE.MeshBasicMaterial({color:0xf00}));document.body.append(r.domElement);s.add(m);r.setSize(innerWidth,innerHeight);c.position.set(8,6,8);c.lookAt(0,0,0);w.createCollider(R.ColliderDesc.cuboid(5,.1,5),w.createRigidBody(R.RigidBodyDesc.fixed().setTranslation(0,-.1,0)));let b = w.createRigidBody(R.RigidBodyDesc.dynamic().setTranslation(0,5,0));w.createCollider(R.ColliderDesc.cuboid(.5,.5,.5), b);(function f(){w.step();let p=b.translation();m.position.set(p.x,p.y,p.z);r.render(s,c);requestAnimationFrame(f)})();addEventListener("resize",=>{c.aspect=innerWidth/innerHeight;c.updateProjectionMatrix();r.setSize(innerWidth,innerHeight)});});</script></body></html>

you must fix the baseplate being completely invisible and the cube being barely visible while only being allowed to strictly make the whole html file much shorter. you must go through many many iterations of "I can make this shorter by.." compare every time the original code and the iteration to make sure its still shorter. then, only when it truly is NOT possible to make it shorter, give the final, shortest truly possible fully functional html file.

#

😅

quartz light Jul 13, 2025, 7:16 PM

#

quartz light grok wins but lets see if it works

so it doesnt work out of the box due to a syntax error but the opus one does

dawn wharf Jul 13, 2025, 7:17 PM

#

quartz light so it doesnt work out of the box due to a syntax error but the opus one does

how much tokens is grok's

quartz light Jul 13, 2025, 7:17 PM

#

dawn wharf how much tokens is grok's

how do I know

#

nvm I can check

#

using a tool

dawn wharf Jul 13, 2025, 7:17 PM

#

quartz light how do I know

copy the code and put it in a tokenizer

quartz light Jul 13, 2025, 7:17 PM

#

but uh i can just say its like 10x larger

#

also very interesting behavior difference

#

#

grok continues to follow the 1st instruction rather than being direct

quartz light Jul 13, 2025, 7:19 PM

#

quartz light

unfortunately though, even after being given the chance to fix the error, it failed

dawn wharf Jul 13, 2025, 7:22 PM

#

quartz light unfortunately though, even after being given the chance to fix the error, it fai...

<!doctype html><body style=margin:0;background:#000><script src=https://cdnjs.cloudflare.com/ajax/libs/three.js/r128/three.min.js></script><script type=module>import R from'https://cdn.jsdelivr.net/npm/@dimforge/rapier3d-simd-compat@canary/+esm';R.init().then(()=>{const w=new R.World({y:-9.81}),c=new THREE.PerspectiveCamera(75,innerWidth/innerHeight,.1,1e3),r=new THREE.WebGLRenderer,s=new THREE.Scene(),m=new THREE.Mesh(new THREE.BoxGeometry,new THREE.MeshBasicMaterial({color:0xf00}));document.body.append(r.domElement);r.setSize(innerWidth,innerHeight);c.position.set(8,6,8);c.lookAt(0,0,0);w.createCollider(R.ColliderDesc.cuboid(5,.1,5),w.createRigidBody(R.RigidBodyDesc.fixed().setTranslation(0,-.1)));const b=w.createRigidBody(R.RigidBodyDesc.dynamic().setTranslation(0,5));w.createCollider(R.ColliderDesc.cuboid(.5,.5,.5),b);(f=_=>{w.step();m.position.copy(b.translation());r.render(s,c);requestAnimationFrame(f)})();onresize=_=>{c.aspect=innerWidth/innerHeight;c.updateProjectionMatrix();r.setSize(innerWidth,innerHeight)}})</script>

#

this is by K2

quartz light Jul 13, 2025, 7:24 PM

#

dawn wharf ```<!doctype html><body style=margin:0;background:#000><script src=https://cdnjs...

rigid_body.ts:1081 Uncaught (in promise) TypeError: The translation components must be numbers.
at kI.setTranslation (rigid_body.ts:1081:19)
at test.html:1:686
setTranslation @ rigid_body.ts:1081

dawn wharf Jul 13, 2025, 7:26 PM

#

quartz light rigid_body.ts:1081 Uncaught (in promise) TypeError: The translation components m...

eh it can't fix it ig

quartz light Jul 13, 2025, 7:26 PM

#

i just tried 2.5 pro

#

its basically the exact same as opus

#

🤨

#

the response

#

it only did a couple of iterations just like opus

#

but unlike opus the final version had an error

#

and yet again, grok is the model to do many iterations

#

@dawn wharf

#

i found something all models are really bad at

dawn wharf Jul 13, 2025, 7:29 PM

#

quartz light its basically the exact same as opus

peak recognizes peak

quartz light Jul 13, 2025, 7:30 PM

#

quartz light i found something all models are really bad at

counting the amount of characters

#

😭

dawn wharf Jul 13, 2025, 7:30 PM

#

quartz light counting the amount of characters

yeah this is well known since forever

#

basically the reason for the how many Rs in strawberry question

quartz light Jul 13, 2025, 7:32 PM

#

dawn wharf basically the reason for the how many Rs in strawberry question

knew it

#

congrats grok, you're really good at this

#

(it didnt error, grok shortened it really well, even used esm.sh instead of jsdelivr for the cdn just to shorten it, aswell as removing doctype n stuff)

#

1085 characters

#

gemini's nonfunctional one is 1144

quartz light Jul 13, 2025, 7:35 PM

#

quartz light gemini's nonfunctional one is 1144

ORIGINAL FILE WAS 1145

#

😭

#

thats hilarious

dawn wharf Jul 13, 2025, 7:35 PM

#

quartz light ORIGINAL FILE WAS 1145

it's still going💀

#

and not even writing any code

quartz light Jul 13, 2025, 7:35 PM

#

dawn wharf it's still going💀

💀

#

oh btw

#

want to know the best prompt

#

cal lum dif #a5a373 #6774d7

#

thats the prompt

#

:)

#

give it to any ai

#


### Step-by-Step Calculation
1. Convert hex to RGB:  
   #a5a373 → RGB(165, 163, 115)  
   #6774d7 → RGB(103, 116, 215)

2. Normalize to [0, 1]:  
   Color 1: $R_1 = 165/255 \approx 0.6471$, $G_1 = 163/255 \approx 0.6392$, $B_1 = 115/255 \approx 0.4510$  
   Color 2: $R_2 = 103/255 \approx 0.4039$, $G_2 = 116/255 \approx 0.4549$, $B_2 = 215/255 \approx 0.8431$

3. Linearize each component (sRGB transfer function):  
   If $c \leq 0.04045$, then $c / 12.92$; else $\left( (c + 0.055) / 1.055 \right)^{2.4}$.  
   Color 1: lin$R_1 \approx 0.3762$, lin$G_1 \approx 0.3634$, lin$B_1 \approx 0.1712$  
   Color 2: lin$R_2 \approx 0.1353$, lin$G_2 \approx 0.1744$, lin$B_2 \approx 0.6786$

4. Compute relative luminance $Y$:  
   $Y = 0.2126 \cdot \mathrm{lin}R + 0.7152 \cdot \mathrm{lin}G + 0.0722 \cdot \mathrm{lin}B$  
   $Y_1 \approx 0.3542$  
   $Y_2 \approx 0.2027$

5. Difference: $|Y_1 - Y_2| \approx 0.1515$```

dawn wharf Jul 13, 2025, 7:41 PM

#

quartz light cal lum dif #a5a373 #6774d7

what happens

#

```### Color Difference Calculation: #a5a373 vs. #6774d7

Step 1: Convert Hex to RGB

#a5a373 → RGB(165, 163, 115)
#6774d7 → RGB(103, 116, 215)

Step 2: Calculate Delta RGB (∆R, ∆G, ∆B)

∆R = 165 - 103 = 62
∆G = 163 - 116 = 47
∆B = 115 - 215 = -100

Step 3: Euclidean Distance (Delta E)

The distance in RGB space is calculated as:
[
\Delta E = \sqrt{(62)^2 + (47)^2 + (-100)^2} = \sqrt{3844 + 2209 + 10000} = \sqrt{16053} ≈ 126.7
]

Step 4: Perceptual Difference (Optional)

For human perception, Delta E 76 in LAB color space is more accurate. Converting to LAB:

#a5a373 → LAB(66.9, -4.5, 30.2)
#6774d7 → LAB(52.2, 25.9, -62.2)
Delta E 76 ≈ 95.8 (moderate difference).

Final Answer:

RGB Delta E ≈ 126.7
LAB Delta E ≈ 95.8 (perceptual).```

quartz light Jul 13, 2025, 7:42 PM

#

@reef pawn what did you say

quartz light Jul 13, 2025, 7:44 PM

#

quartz light congrats grok, you're really good at this

here's grok's result btw

📎 test.html

#

+Gemini 2.5 Pro

#

📎 gemini.md

#

#o3

#

📎 o3.md

keen beacon Jul 13, 2025, 7:55 PM

#

i can see history before that

#

so probably a bug

#

or something on your side

#

i think ur discord is bugged. it just works for me

quartz light Jul 13, 2025, 7:58 PM

#

o4 is still the 🐐 at fixing errors and shortening code

#

i havent tried my "iterations" thing on it thats probably why it couldnt shorten as much as grok

#

this server cant be that new

#

oh alr

#

from lmsys?

keen beacon Jul 13, 2025, 8:04 PM

#

they started a company

#

they did

#

in the old server

quartz light Jul 13, 2025, 8:05 PM

#

i thought yall the same person

#

😭

keen beacon Jul 13, 2025, 8:05 PM

#

let me dm u

#

i think it will get auto deleted lets see

#

https://discord.gg / n5McCu4B or try this

Discord - Group Chat That’s All Fun & Games

Discord is great for playing games and chilling with friends, or even building a worldwide community. Customize your own space to talk, play, and hang out.

quartz light Jul 13, 2025, 8:10 PM

#

wait

#

i just remembered

#

i was in the old discord before

dawn wharf Jul 13, 2025, 8:16 PM

#

quartz light i thought yall the same person

ocean vortex Jul 13, 2025, 8:28 PM

#

Openrouter... Why doesn't say "Central Intelligence Agency" for the app field? 😠 When I had colab url as a referrer it just said "Colab" and not my title lol

#

cedar tide Jul 13, 2025, 8:34 PM

#

Kimi k2 good on artificial analysis

Screenshot_2025-07-13-22-23-51-803_com.android.chrome-edit.jpg

#

But uses a lot of tokens, (almost 3x times more than GPT, Claude, and qwen)
but still much less than 99% of reasoning models (apart from Claude default think)

ocean vortex Jul 13, 2025, 8:40 PM

#

cedar tide But uses a lot of tokens, (almost 3x times more than GPT, Claude, and qwen) but ...

I'm testing it atm... Occasionally it does 5k+ responses. But it also can surprise you with concise responses when you are not expecting it. Interesting that it generated so much more than V3.1 with their testing set... 🤔

reef pawn Jul 13, 2025, 8:43 PM

#

quartz light <@1350230483396726834> what did you say

"What Grok version are you"

ocean vortex Jul 13, 2025, 8:44 PM

#

by 5+ I do mean sub 6k btw. Don't think I saw a single response of 6k or more yet. That's a lot but not crazy lengths...

ocean vortex Jul 13, 2025, 8:45 PM

#

reef pawn "What Grok version are you"

Dork4

reef pawn Jul 13, 2025, 8:45 PM

#

ocean vortex Dork4

"I'm Grok, built by xAI, and I'm currently running on the Grok-1.5 model. If you're asking about updates, xAI is always iterating—check x.ai for the latest!"

#

ocean vortex Jul 13, 2025, 8:46 PM

#

reef pawn

yeah that's just spitting nonsense it having no clue. Kinda like GPT4 was saying it is GPT3 back in the day

reef pawn Jul 13, 2025, 8:47 PM

#

I still think this is not Grok4 or nerfed model

#

the response is slow and it doesn't seem intelligent to me at all

ocean vortex Jul 13, 2025, 8:48 PM

#

reef pawn I still think this is not Grok4 or nerfed model

If you care to know, the only thing different in chatgpt with GPT4 was a system prompt. That was the only thing making it say it's GPT4. 99% it is exactly the same here

torn mantle Jul 13, 2025, 8:48 PM

#

reef pawn the response is slow and it doesn't seem intelligent to me at all

They wont risk that

reef pawn Jul 13, 2025, 8:48 PM

#

ocean vortex If you care to know, the only thing different in chatgpt with GPT4 was a system ...

hmm

torn mantle Jul 13, 2025, 8:49 PM

#

Grok 4 is a solid model but isnt as good as gemini 2.5 pro or o3

reef pawn Jul 13, 2025, 8:49 PM

#

And I'm still waiting for updated LMarena leaderboard since kimi and Grok droped SOTA

ocean vortex Jul 13, 2025, 8:49 PM

#

reef pawn the response is slow and it doesn't seem intelligent to me at all

that's the general vibe most impartial people get after trying grok4 lmao

boreal saddle Jul 13, 2025, 8:49 PM

#

Apparently "emission" is a bad word for LM arena.

ocean vortex Jul 13, 2025, 8:49 PM

#

So weird, but normal 💀

boreal saddle Jul 13, 2025, 8:49 PM

#

Wanted to ask the AI a question on how can a country like Argentina best stop inflation, and this happened.

reef pawn Jul 13, 2025, 8:49 PM

#

ocean vortex that's the general vibe most impartial people get after trying grok4 lmao

lol

ocean vortex Jul 13, 2025, 8:50 PM

#

Like personally, I still don't understand their benchmark scores and I don't think they are entirely legit or what it seems...

reef pawn Jul 13, 2025, 8:51 PM

#

It's same like Llama 4 benchmarks lol

#

but it is good model whereas meta dropped garbage

quartz light Jul 13, 2025, 8:58 PM

#

holy ####

#

...

#

dude

#

i just got a free deep research session with 630 SOURCES IN ONE GO

#

https://media.discordapp.net/attachments/807809192537882647/1131859245357346856/uuh-2.gif

#

PLUS LIKE

#

30 BROWSER USES

#

(as in the ai controls a linux vm to browse the sites)

quartz light Jul 13, 2025, 9:00 PM

#

quartz light 30 BROWSER USES

more*

#

like 60 or 70

torn mantle Jul 13, 2025, 9:01 PM

#

k2 outputs are kinda similar to o3

quartz light Jul 13, 2025, 9:01 PM

#

but its stuck on the final report because it used up 99% of its context window doing web searches 💀

quartz light Jul 13, 2025, 9:01 PM

#

torn mantle k2 outputs are kinda similar to o3

guess what

torn mantle Jul 13, 2025, 9:01 PM

#

quartz light guess what

guess my name

quartz light Jul 13, 2025, 9:02 PM

#

the ai im talking about is kimi

torn mantle Jul 13, 2025, 9:02 PM

#

😮

quartz light Jul 13, 2025, 9:02 PM

#

yep it has deep research for free

torn mantle Jul 13, 2025, 9:02 PM

#

quartz light yep it has deep research for free

yea ive been using it

#

its neat isnt it

ocean vortex Jul 13, 2025, 9:02 PM

#

quartz light i just got a free deep research session with 630 SOURCES IN ONE GO

did it make anything useful of it at all?

torn mantle Jul 13, 2025, 9:02 PM

#

but there is a rate limit

ocean vortex Jul 13, 2025, 9:02 PM

#

Or just the usual thing?

torn mantle Jul 13, 2025, 9:02 PM

#

its 5 per day 😦

quartz light Jul 13, 2025, 9:03 PM

#

ive never seen anything like 630 sources even on gemini research (~200)

quartz light Jul 13, 2025, 9:03 PM

#

ocean vortex did it make anything useful of it at all?

yes

#

the research was to find out the fastest 3d renderer for the web

#

:)

ocean vortex Jul 13, 2025, 9:03 PM

#

quartz light ive never seen anything like 630 sources even on gemini research (~200)

Sure but that number alone is meaningless. I saw Gemini looking at irrelevant sources numerous times too lol

quartz light Jul 13, 2025, 9:04 PM

#

ocean vortex Sure but that number alone is meaningless. I saw Gemini looking at irrelevant so...

no, no i looked over the sources and they were all useful

#

#

it doesnt fit..

#

ocean vortex Jul 13, 2025, 9:05 PM

#

torn mantle its 5 per day 😦

I stopped paying cause I didn't like it so it's 0 for me. 😦

quartz light Jul 13, 2025, 9:05 PM

#

torn mantle its 5 per day 😦

thats a lot. chatgpt deep research is 5 PER MONTH and gemini deep research is 10 per month

#

😂

#

and they're nowhere near as good

ocean vortex Jul 13, 2025, 9:06 PM

#

aistudio replaces their website for like 90% of use cases at no cost lol

torn mantle Jul 13, 2025, 9:06 PM

#

ocean vortex I stopped paying cause I didn't like it so it's 0 for me. 😦

paying?

#

im talking about kimi

#

it has rate limits as well

#

its 5 per day

quartz light Jul 13, 2025, 9:06 PM

#

quartz light

heres a closer look

torn mantle Jul 13, 2025, 9:06 PM

#

for deep research

ocean vortex Jul 13, 2025, 9:06 PM

#

torn mantle im talking about kimi

Ohh. I thought you meant Gemini

quartz light Jul 13, 2025, 9:06 PM

#

torn mantle paying?

yeah i was confused too

torn mantle Jul 13, 2025, 9:07 PM

#

ocean vortex Ohh. I thought you meant Gemini

i thought i cancelled my sub but 2 month went in automatically

reef pawn Jul 13, 2025, 9:07 PM

#

quartz light the research was to find out the fastest 3d renderer for the web

I tried Kimi deep research as well, it's good indeed. my favorite part was when it provide you interactive webpage at the end of research so you can read the report without causing cognitive load to your brain.

quartz light Jul 13, 2025, 9:07 PM

#

torn mantle Jul 13, 2025, 9:07 PM

#

its so stupid, to cancel it you need like 2 steps confirmation

#

i just did one step and forgot about the other one so the sub kept going

ocean vortex Jul 13, 2025, 9:07 PM

#

torn mantle its so stupid, to cancel it you need like 2 steps confirmation

be glad they aren't holding you hostage on your cloud storage

quartz light Jul 13, 2025, 9:08 PM

#

torn mantle i thought i cancelled my sub but 2 month went in automatically

i heard that someone bought a yearly sub, later got an offer to get 2 months free, accepted it and their 1 year sub was overwritten with 2 months 😂

torn mantle Jul 13, 2025, 9:08 PM

#

quartz light i heard that someone bought a yearly sub, later got an offer to get 2 months fre...

xdddddd

ocean vortex Jul 13, 2025, 9:08 PM

#

Funny enough, it turns out they don't actually delete your files lol

#

They give you like extra 3 months to delete them or sort it out

torn mantle Jul 13, 2025, 9:08 PM

#

ocean vortex Funny enough, it turns out they don't actually delete your files lol

havent saved anything on google one

quartz light Jul 13, 2025, 9:09 PM

#

ocean vortex Funny enough, it turns out they don't actually delete your files lol

grok has incognito mode which is really neat

ocean vortex Jul 13, 2025, 9:09 PM

#

after you are no longer subbed

torn mantle Jul 13, 2025, 9:09 PM

#

quartz light grok has incognito mode which is really neat

like chatgpt?

ocean vortex Jul 13, 2025, 9:10 PM

#

quartz light grok has incognito mode which is really neat

They copied temporary chat from chatgpt. Just like they copied custom instructions. 🤓

quartz light Jul 13, 2025, 9:10 PM

#

torn mantle like chatgpt?

i dont think chatgpt has a fully private one

quartz light Jul 13, 2025, 9:10 PM

#

ocean vortex They copied temporary chat from chatgpt. Just like they copied custom instructio...

i mean custom instructions is a basic concept of system prompts

sage raptor Jul 13, 2025, 9:10 PM

#

lol

ocean vortex Jul 13, 2025, 9:11 PM

#

quartz light i mean custom instructions is a basic concept of system prompts

So is "incognito". But someone does have to think of it first. It sounds obvious when you are aware of it, but it wasn't entirely so when no one had it...

quartz light Jul 13, 2025, 9:11 PM

#

ocean vortex So is "incognito". But someone does have to think of it first. It sounds obvious...

uhh

#

i disagree 🤓

ocean vortex Jul 13, 2025, 9:11 PM

#

I disagree on you disagreeing 🤓

quartz light Jul 13, 2025, 9:11 PM

#

system instructions is a basic function 😭

#

i mean system prompt

reef pawn Jul 13, 2025, 9:12 PM

#

sage raptor lol

what leak benchmarks?

ocean vortex Jul 13, 2025, 9:12 PM

#

quartz light system instructions is a basic function 😭

Kinda. But only chatgpt and grok has it

quartz light Jul 13, 2025, 9:12 PM

#

reef pawn what leak benchmarks?

i think the one where gpt 5 slightly outperforms grok 4

reef pawn Jul 13, 2025, 9:12 PM

#

oh

ocean vortex Jul 13, 2025, 9:12 PM

#

No such thing with Claude or Gemini

sage raptor Jul 13, 2025, 9:12 PM

#

reef pawn what leak benchmarks?

the benchmarks are not real

reef pawn Jul 13, 2025, 9:12 PM

#

ah

sage raptor Jul 13, 2025, 9:12 PM

#

https://x.com/mark_k/status/1944355865982574936

Mark Kretschmann (@mark_k)

GPT-5 benchmarks leaked @OpenAI

👀

reef pawn Jul 13, 2025, 9:13 PM

#

GPT 5 is still in training I think

quartz light Jul 13, 2025, 9:13 PM

#

ocean vortex Kinda. But only chatgpt and grok has it

no

ocean vortex Jul 13, 2025, 9:13 PM

#

sage raptor https://x.com/mark_k/status/1944355865982574936

yeah... I'm almost convinced this was "leaked" deliberately. First to show that o3-pro still beats it on SWE. Second to show that Grok4 benchmarks are ridiculous lol

quartz light Jul 13, 2025, 9:13 PM

#

ocean vortex No such thing with Claude or Gemini

no

#

ocean vortex Jul 13, 2025, 9:14 PM

#

quartz light no

This is not chat website, that is API playground and a system prompt. Which is not the same as UI custom instructions

quartz light Jul 13, 2025, 9:14 PM

#

ocean vortex This is not chat website, that is API playground and a system prompt. Which is n...

chat website

#

i just sent

#

dont say its not the same thing vro 🥀 cz it is

torn mantle Jul 13, 2025, 9:14 PM

#

quartz light dont say its not the same thing vro 🥀 cz it is

whats vro?

sage raptor Jul 13, 2025, 9:15 PM

#

ocean vortex yeah... I'm almost convinced this was "leaked" deliberately. First to show that ...

Did grok4 really get 73.5% in SWE benchmark ?

quartz light Jul 13, 2025, 9:15 PM

#

ocean vortex Jul 13, 2025, 9:15 PM

#

quartz light chat website

well that kinda could be used as a custom instructions ig. Though it's not exactly the same. Haven't used their chat a whole lot lately tbh

quartz light Jul 13, 2025, 9:15 PM

#

torn mantle whats vro?

stupid silly slang for "bro"

torn mantle Jul 13, 2025, 9:15 PM

#

i see

quartz light Jul 13, 2025, 9:15 PM

#

ocean vortex well that *kinda* could be used as a custom instructions ig. Though it's not exa...

reef pawn Jul 13, 2025, 9:15 PM

#

sage raptor https://x.com/mark_k/status/1944355865982574936

Didn't Sam said there will not be any other model just one model for latest version? like o3, GPT 4o, GPT 4.1 and o4 mini

quartz light Jul 13, 2025, 9:16 PM

#

quartz light

"use simple language" thats a system instruction

sage raptor Jul 13, 2025, 9:16 PM

#

reef pawn Didn't Sam said there will not be any other model just one model for latest vers...

Yep, there will only be gpt 5

ocean vortex Jul 13, 2025, 9:16 PM

#

quartz light

yeah fair. They copied it as well then. Next to copy are custom gpts and gpt store. 😇

quartz light Jul 13, 2025, 9:16 PM

#

lol

#

i mean

#

those arent original

ocean vortex Jul 13, 2025, 9:17 PM

#

custom gpts? Those are kinda original

quartz light Jul 13, 2025, 9:17 PM

#

those were added wayy after other sites had "characters"

ocean vortex Jul 13, 2025, 9:17 PM

#

it's the same as poe essentially, but still an idea which wasn't getting explored earlier

#

on the official chat interfaces from actual AI labs

quartz light Jul 13, 2025, 9:18 PM

#

miniapps was made in 2021

azure cliff Jul 13, 2025, 9:19 PM

#

Hey, does anyone know what I can do if my IP address has been blocked by Cloudflare on LMArena?

ocean vortex Jul 13, 2025, 9:19 PM

#

quartz light miniapps was made in 2021

Not an official website for AI lab training models 👀

#

It's a bit like Apple... They rarely think of something first

#

But they are the ones making it mainstream

quartz light Jul 13, 2025, 9:20 PM

#

ocean vortex Not an official website for AI lab training models 👀

well yeees but the characters function was definitely not original 😤

#

https://tenor.com/view/speech-bubble-chicken-gif-26197609

Tenor

reef pawn Jul 13, 2025, 9:26 PM

#

What if apple outta nowhere drop open weight SOTA model next year lol

ocean vortex Jul 13, 2025, 9:27 PM

#

Also @quartz light I believe OpenAI were the first to implement using tools while reasoning (ReAct) in a big scale. That's an area others are yet to copy in full. Can't say much about Grok4 since I don't have their sub, but Google and Anthropic both have lots of work to do still to catch up here

#

Google will barely use tools at all unless you are explicit about this, and Anthropic... artifacts still leave a lot to be desired comparing that with chatgpt

ocean vortex Jul 13, 2025, 9:29 PM

#

reef pawn What if apple outta nowhere drop open weight SOTA model next year lol

They could but they are currently nowhere lol. I tested their cloud model (the biggest one) and it reminded me of gpt3.5

#

I do have full Apple Intelligence too and for the most part that's just underwhelming - something to tinker with when you are bored at best

reef pawn Jul 13, 2025, 9:30 PM

#

It's sad to see apple losing in AI race

#

I believe they are the most ethical trillion dollars company out there apart from the manufacturing part ofc.

sacred plaza Jul 13, 2025, 9:34 PM

#

Quantification fixation bias is so strong in this community, lol

reef pawn Jul 13, 2025, 9:44 PM

#

true

quartz light Jul 13, 2025, 9:45 PM

#

ocean vortex Also <@678749821062086679> I believe OpenAI were the first to implement using to...

reasoning yeah i cant disagree

#

i feel like openai's reasoning is better than the actual models if that makes sense

#

the best non-thinking model + openai's reasoning = peak

dawn wharf Jul 13, 2025, 9:57 PM

#

reef pawn What if apple outta nowhere drop open weight SOTA model next year lol

*meta

ocean vortex Jul 13, 2025, 10:07 PM

#

quartz light i feel like openai's reasoning is better than the actual models if that makes se...

Well the chat/base model is 4.1. That is not bad tbh. IMO it's kinda underrated, in most cases that performs no worse than Sonnet4 with no reasoning.

#

It's close to Kimi T2 too, which is all the rage now lol.

#

4.1 is better for coding, Kimi is better for math and science

torn mantle Jul 13, 2025, 10:11 PM

#

t2?

#

is this a new model

torn mantle Jul 13, 2025, 10:12 PM

#

ocean vortex 4.1 is better for coding, Kimi is better for math and science

i think k2 is stronger at everything compared to 4.1

#

4.1 has more knowledge, i will give them that

ocean vortex Jul 13, 2025, 10:13 PM

#

torn mantle i think k2 is stronger at everything compared to 4.1

#

It really isn't "stronger at everything". They are same class models and kinda close overall tbh

#

Kimi slightly better, but not by a huge amount

torn mantle Jul 13, 2025, 10:15 PM

#

mm i see

#

i thought it was good at code reasoning

ocean vortex Jul 13, 2025, 10:17 PM

#

torn mantle i thought it was good at code reasoning

It flops on SciCode 💀

pseudo hemlock Jul 13, 2025, 10:17 PM

#

what is this kimi k2 thing

#

never heard of these people

balmy mist Jul 13, 2025, 10:17 PM

#

open source agentic model i think

pseudo hemlock Jul 13, 2025, 10:17 PM

#

whats it good at

#

its MoE

#

ik that

balmy mist Jul 13, 2025, 10:18 PM

#

tool calling

pseudo hemlock Jul 13, 2025, 10:18 PM

#

ocean vortex It flops on SciCode 💀

holy 💀

ocean vortex Jul 13, 2025, 10:18 PM

#

pseudo hemlock never heard of these people

Those are Chinese people

#

mixture of Chinese experts

jade egret Jul 13, 2025, 10:18 PM

#

is google cooking

balmy mist Jul 13, 2025, 10:18 PM

#

like claude 4 but open source and non thinking

pseudo hemlock Jul 13, 2025, 10:18 PM

#

ocean vortex mixture of Chinese experts

oh so its just chinese phd students responding?

pseudo hemlock Jul 13, 2025, 10:18 PM

#

balmy mist like claude 4 but open source and non thinking

non thinking L

#

id gladly wait 30 mins for my answer if its 1% better

#

1 trillion params 😮

#

gpqa diamond beats opus

#

and aime

jade egret Jul 13, 2025, 10:20 PM

#

ocean vortex Jul 13, 2025, 10:20 PM

#

If you only look at non-reasoning models, this is still overall SOTA the best model, and it's open-source, so there's that and this is certainly not nothing...

balmy mist Jul 13, 2025, 10:21 PM

#

pseudo hemlock non thinking L

but someone can make a thinking version of it which is what is cool, its good without even being a thinking model

pseudo hemlock Jul 13, 2025, 10:21 PM

#

15,891 people downloaded the instruct model 💀

#

1T

jade egret Jul 13, 2025, 10:22 PM

#

jade egret

by how much

ocean vortex Jul 13, 2025, 10:22 PM

#

Well maybe excluding Opus4 without reasoning (disabled). But this is a very different class of a model

pseudo hemlock Jul 13, 2025, 10:23 PM

#

What do people use tool calling for mostly?

#

Coding?

#

idk what tool calling is

ocean vortex Jul 13, 2025, 10:24 PM

#

Literally anything. Web search, calculator, fact checking, data retrieval, performing tasks (integrations)

pseudo hemlock Jul 13, 2025, 10:24 PM

#

its just

#

outsourced work

#

?

ocean vortex Jul 13, 2025, 10:24 PM

#

running code...

pseudo hemlock Jul 13, 2025, 10:24 PM

#

but like

#

llm -> other thing (code) -> answer to other thing back to llm?

ocean vortex Jul 13, 2025, 10:25 PM

#

pseudo hemlock llm -> other thing (code) -> answer to other thing back to llm?

Essentially yes. But this is still very useful, especially now when models are able to do this while reasoning/thinking

pseudo hemlock Jul 13, 2025, 10:25 PM

#

oh oh

#

like

#

gemini search

#

is kinda tool calling

#

gemini be like

#

go search

#

kinda sorta

#

not exactly but kinda

ocean vortex Jul 13, 2025, 10:26 PM

#

Roughly speaking yes

pseudo hemlock Jul 13, 2025, 10:26 PM

#

ty

#

u legoat

ocean vortex Jul 13, 2025, 10:26 PM

#

np 😉

pseudo hemlock Jul 13, 2025, 10:26 PM

#

woah

#

😜

quartz light Jul 13, 2025, 10:46 PM

#

😭

#

https://www.kimi.com/preview/19805f78-a5b1-8215-8550-da8e210005da

dawn wharf Jul 13, 2025, 10:47 PM

#

quartz light https://www.kimi.com/preview/19805f78-a5b1-8215-8550-da8e210005da

this is impressive though

quartz light Jul 13, 2025, 10:47 PM

#

it is

#

wow yeah

#

its really good

#

i just wish it was darkmode

#

😭

whole wagon Jul 13, 2025, 11:30 PM

#

https://x.com/goodside/status/1944266466875826617

#

sacred plaza Jul 13, 2025, 11:57 PM

#

whole wagon

How and why?

olive mesa Jul 14, 2025, 1:01 AM

#

grok's cot 🤣

alpine coral Jul 14, 2025, 1:06 AM

#

leaden palm especially when those 3 come up in that order.

yeah im a dumbass... the sequence of the suggested prompts... missed that.. it is funny yeah ha

sturdy mica Jul 14, 2025, 1:26 AM

#

does grok 4 have internet on lmarena

#

and other models

#

like 2.5 pro

balmy mist Jul 14, 2025, 1:33 AM

#

are u sad? https://x.com/koltregaskes/status/1944376468504555615

Kol Tregaskes (@koltregaskes)

GPT-4.5 will be removed from OpenAI's API tomorrow (14th July). Use it while you still can.

sturdy mica Jul 14, 2025, 1:48 AM

#

balmy mist are u sad? https://x.com/koltregaskes/status/1944376468504555615

why are you sending this

#

its still gonna be available on chatgpt.com

#

its getting removed from the API

dim pine Jul 14, 2025, 1:55 AM

#

olive mesa

haha so funny with grok kekblue_wiggle

surreal creek Jul 14, 2025, 2:20 AM

#

broke Gemma 3 asking how an Autopen works 🤣🤣

balmy mist Jul 14, 2025, 3:43 AM

#

sturdy mica why are you sending this

mb g

balmy mist Jul 14, 2025, 4:05 AM

#

srry for ur loss

quartz light Jul 14, 2025, 6:19 AM

#

quartz light https://www.kimi.com/preview/19805f78-a5b1-8215-8550-da8e210005da

yall

#

wtf

#

they removed my whole convo

#

good thing i shared the final result

#

nvm

#

it renames the chat convos to whatever you sent last

#

thats weird

quartz light Jul 14, 2025, 6:36 AM

#

why