#general | Arena | Page 72

whole wagon Jul 14, 2025, 6:47 AM

#

Nostalgia. From a time when openAI was actually good

fleet lintel Jul 14, 2025, 7:32 AM

#

OpenAI is still at the top (number of users etc) but my concern is that the pace of innovation and improvements from OAI has gone down a lot.
Not sure what's happening to them?

pallid crypt Jul 14, 2025, 7:33 AM

#

fleet lintel OpenAI is still at the top (number of users etc) but my concern is that the pace...

ChatGpt literally just scaled BERT from Google. They were not that innovative to start

fleet lintel Jul 14, 2025, 7:35 AM

#

I think that is a bit simplistic view. Yes, they got inspiration but they were multiple years ahead of others because they focused on it and others didn't go deep enough.

pallid crypt Jul 14, 2025, 7:36 AM

#

Good point

fleet lintel Jul 14, 2025, 7:36 AM

#

Just compare google Gemini model from 1.5 years back to chat gpt model. It felt like google will never catch them

pallid crypt Jul 14, 2025, 7:37 AM

#

They also did interesting things with RL before all that with curiosity and the cube solver robot

fleet lintel Jul 14, 2025, 7:38 AM

#

Yes. But now it feels like they won't be at top in coming months and quarter

pallid crypt Jul 14, 2025, 7:39 AM

#

fleet lintel Yes. But now it feels like they won't be at top in coming months and quarter

With the LLM releases half of it is just timing

#

While the other companys appear to be crushing them, o4 may nearly already be finished

dusky aurora Jul 14, 2025, 7:47 AM

#

Arena glitches again

whole wagon Jul 14, 2025, 7:49 AM

#

pallid crypt With the LLM releases half of it is just timing

They are not currently timing anything, they are desperately trying to get stuff out but it's all delayed somehow

livid coyote Jul 14, 2025, 7:49 AM

#

my chat history just got del?

pallid crypt Jul 14, 2025, 7:50 AM

#

whole wagon They are not currently timing anything, they are desperately trying to get stuff...

No but I mean if the release dates are misaligned then it will appear they are losing when they are about to drop something big

whole wagon Jul 14, 2025, 7:51 AM

#

Grok 5 Gemini 3 is releasing end of the year. They don't even have that long for GPT5 to have the lime light. openAI cadence has always been a new gpt every 2 years or so, they haven't been able to match the speed of the other labs ever

#

Even o3 was cooking for a long time

#

Relative to Gemini 2.5 and grok 4

#

The betting odds think by Dec 31st GPT5 will just lose the lead

still bramble Jul 14, 2025, 7:53 AM

#

dusky aurora Arena glitches again

yeap

whole wagon Jul 14, 2025, 7:53 AM

#

I'm still pretty disappointed by the open source delay ngl

pallid crypt Jul 14, 2025, 7:54 AM

#

whole wagon The betting odds think by Dec 31st GPT5 will just lose the lead

The gap between gpt3 and 4 and o1 to 3 was significantly shorter

whole wagon Jul 14, 2025, 7:54 AM

#

Not even the delay itself but they lied about the reason

#

That's some shady stuff

pallid crypt Jul 14, 2025, 7:54 AM

#

whole wagon Not even the delay itself but they lied about the reason

You mean deepseek?

whole wagon Jul 14, 2025, 7:54 AM

#

I mean openAI open source model

#

Delayed for safety

#

??

pallid crypt Jul 14, 2025, 7:55 AM

#

Right, more like delayed for lack of performance

dawn wharf Jul 14, 2025, 7:55 AM

#

whole wagon Delayed for safety

right…

whole wagon Jul 14, 2025, 7:56 AM

#

The betting markets are extremely pessimistic on openAI. Like I am too but they are to an extreme level, like it thinks GPT5 only has a 30% chance to top LLM arena (August 31st bet)

#

Kind of crazy

#

I don't really know why

#

Unless someone has some inside info I don't see what would suggest that

dawn wharf Jul 14, 2025, 7:58 AM

#

whole wagon The betting markets are extremely pessimistic on openAI. Like I am too but they ...

probably because it's gonna release after august lol

pallid crypt Jul 14, 2025, 7:58 AM

#

whole wagon The betting markets are extremely pessimistic on openAI. Like I am too but they ...

What about o4?

#

I thought gpt 5 was just going to be a model router?

whole wagon Jul 14, 2025, 7:59 AM

#

dawn wharf probably because it's gonna release after august lol

The Dec 31st odds are 19%

whole wagon Jul 14, 2025, 7:59 AM

#

pallid crypt What about o4?

It's for any model

#

It's done by company

dawn wharf Jul 14, 2025, 8:00 AM

#

pallid crypt I thought gpt 5 was just going to be a model router?

that would be stupid tbh

pallid crypt Jul 14, 2025, 8:00 AM

#

Are we even making progress with LLMs in terms of reliability and use case intelligence

dawn wharf Jul 14, 2025, 8:00 AM

#

it shouldn't take that long

whole wagon Jul 14, 2025, 8:00 AM

#

pallid crypt Are we even making progress with LLMs in terms of reliability and use case intel...

Well yeah

dawn wharf Jul 14, 2025, 8:00 AM

#

pallid crypt Are we even making progress with LLMs in terms of reliability and use case intel...

yeah, but I don't feel we'll get to AGI with current LLM designs

whole wagon Jul 14, 2025, 8:00 AM

#

It just seems stagnated because openAI held the top and isn't moving whilst the other ai labs are catching up from under them. So SOTA hasn't moved just yet

#

If you look at it as an ensemble of LLMs from labs you can see it's moving upwards

#

Quickly

pallid crypt Jul 14, 2025, 8:01 AM

#

It just seems we are going to be paying so much more for each SOTA

whole wagon Jul 14, 2025, 8:01 AM

#

Like Gemini 2 to 2.5 and grok 3 to grok 4

#

They are huge jumps

dawn wharf Jul 14, 2025, 8:01 AM

#

whole wagon It just seems stagnated because openAI held the top and isn't moving whilst the ...

what do you think will allow models to gain spatial awareness?

pallid crypt Jul 14, 2025, 8:01 AM

#

Until it will be ludicrously expensive

dawn wharf Jul 14, 2025, 8:02 AM

#

pallid crypt Until it will be ludicrously expensive

here come the chinese

#

until they close source their models in the future

whole wagon Jul 14, 2025, 8:02 AM

#

They never will, it's in the strategic interests to undermine Western ai labs

dawn wharf Jul 14, 2025, 8:03 AM

#

whole wagon They never will, it's in the strategic interests to undermine Western ai labs

that's why I said in the future

pallid crypt Jul 14, 2025, 8:03 AM

#

dawn wharf here come the chinese

Even that doesn't scale well, soon models will be unrunnable on consumer hardware

dawn wharf Jul 14, 2025, 8:03 AM

#

when they take the lead too much

whole wagon Jul 14, 2025, 8:03 AM

#

I don't think they will. It'll remain close for a long long time

dawn wharf Jul 14, 2025, 8:03 AM

#

pallid crypt Even that doesn't scale well, soon models will be unrunnable on consumer hardwar...

consumer? even now you can't really run the sota models on "consumer" hardware

whole wagon Jul 14, 2025, 8:04 AM

#

There's no need to run them locally ngl. Just use an API

dawn wharf Jul 14, 2025, 8:04 AM

#

unless you mean a 10k consumer pc

pallid crypt Jul 14, 2025, 8:04 AM

#

dawn wharf consumer? even now you can't really run the sota models on "consumer" hardware

Exactly. So your still paying the western cloud providers

#

Not open

whole wagon Jul 14, 2025, 8:04 AM

#

Well yeah but it's far more competition

#

To simply serve a model

#

Is easier

pallid crypt Jul 14, 2025, 8:05 AM

#

But the bigger these LLMs get the more compute you need driving up the cost inevitably

whole wagon Jul 14, 2025, 8:05 AM

#

The cost of the Chinese LLMs is minimal

#

Even the Kimi K2 1T

dawn wharf Jul 14, 2025, 8:05 AM

#

I feel if/when the chinese develop strong GPU or TPU alternatives is when their innovations become overwhelming

#

more than now

pallid crypt Jul 14, 2025, 8:05 AM

#

whole wagon Even the Kimi K2 1T

Still, scaling won't last

dawn wharf Jul 14, 2025, 8:05 AM

#

pallid crypt But the bigger these LLMs get the more compute you need driving up the cost inev...

eh, they'd find a way to keep the size small if they innovate

#

"small"

pallid crypt Jul 14, 2025, 8:06 AM

#

But nobody's trying to do that

whole wagon Jul 14, 2025, 8:06 AM

#

Wut

#

Ofc they are lol

dawn wharf Jul 14, 2025, 8:06 AM

#

pallid crypt But nobody's trying to do that

yeah, especially xAI

pallid crypt Jul 14, 2025, 8:06 AM

#

Distills feel cheap

#

They dont work for heavy coding

whole wagon Jul 14, 2025, 8:07 AM

#

You want everything for nothing is the issue

dawn wharf Jul 14, 2025, 8:07 AM

#

but at the same time trying to make efficient models benefits the companies as well

#

the costs would become less

whole wagon Jul 14, 2025, 8:07 AM

#

That isn't possible in the laws of the universe

dawn wharf Jul 14, 2025, 8:07 AM

#

to run the models

pallid crypt Jul 14, 2025, 8:07 AM

#

Humans are smart without costing much at all

whole wagon Jul 14, 2025, 8:07 AM

#

The LLMs are cheaper than a human

pallid crypt Jul 14, 2025, 8:07 AM

#

Not really

#

Food is much cheaper then a massive Gpu cluster

whole wagon Jul 14, 2025, 8:08 AM

#

Bruh

#

Bro forgot about the first 18 years of life or smth

pallid crypt Jul 14, 2025, 8:09 AM

#

If we can replicate the structure of the brain that's AGI not llms

pallid crypt Jul 14, 2025, 8:09 AM

#

pallid crypt If we can replicate the structure of the brain that's AGI not llms

And scale

whole wagon Jul 14, 2025, 8:09 AM

#

Yeah go ahead then

#

What's the point of this it's useless discussion

pallid crypt Jul 14, 2025, 8:10 AM

#

True

leaden sun Jul 14, 2025, 8:13 AM

#

pallid crypt If we can replicate the structure of the brain that's AGI not llms

am afraid even with that it's still not enough, it's doable just with llms, but it wont be sustainable or scalable

dawn wharf Jul 14, 2025, 8:13 AM

#

leaden sun am afraid even with that it's still not enough, it's doable just with llms, but ...

it's definitely not doable with a glorified text completer

pallid crypt Jul 14, 2025, 8:13 AM

#

leaden sun am afraid even with that it's still not enough, it's doable just with llms, but ...

Why not

dawn wharf Jul 14, 2025, 8:14 AM

#

if they don't change the architecture, it's over

leaden sun Jul 14, 2025, 8:14 AM

#

true

pallid crypt Jul 14, 2025, 8:14 AM

#

We need a super simple neuro symbolic architecture

leaden sun Jul 14, 2025, 8:15 AM

#

but architecture alone wont help much either, a few sensors need to be integrated properly and more...

pallid crypt Jul 14, 2025, 8:15 AM

#

We can reuse multi headed attention

leaden sun Jul 14, 2025, 8:17 AM

#

pallid crypt Why not

because we havent figure out the connection between high or even super intelligence and consciousness, this is one of the frontiers of current ai research

pallid crypt Jul 14, 2025, 8:18 AM

#

I don't understand why we need consciousness for an intelligent system

leaden sun Jul 14, 2025, 8:18 AM

#

then I'd like to suggest a deep dive into this topic

dawn wharf Jul 14, 2025, 8:19 AM

#

pallid crypt I don't understand why we need consciousness for an intelligent system

I don't actually believe consciousness for an AI could be achieved

#

even when we actually don't know what consciousness is

#

it could act like it, sure, but it would just be acting

leaden sun Jul 14, 2025, 8:20 AM

#

non-biological lifeforms will have different forms of consciousness is what I've been told by current research, update me if you have the newest insights

pallid crypt Jul 14, 2025, 8:20 AM

#

Humans are neural based right? Our brains are based on electrical signals. For all we know chat gpt is already conscious, just in an unexpected way

dawn wharf Jul 14, 2025, 8:20 AM

#

like a psychopath acting like he has emotions

pallid crypt Jul 14, 2025, 8:20 AM

#

dawn wharf like a psychopath acting like he has emotions

Exactly what I meant

dawn wharf Jul 14, 2025, 8:21 AM

#

pallid crypt Exactly what I meant

would it actually be consciousness at that point?

#

if it's make-believe

pallid crypt Jul 14, 2025, 8:21 AM

#

Exactly

#

I don't think we need consciousness, just a machine that acts exactly if it was

dawn wharf Jul 14, 2025, 8:22 AM

#

I've held this view for the past 10 years, even before transformers were a thing

whole wagon Jul 14, 2025, 8:26 AM

#

It's always 'doable' in that the brain is just an extremely sophisticated maths function

pallid crypt Jul 14, 2025, 8:27 AM

#

whole wagon It's always 'doable' in that the brain is just an extremely sophisticated maths ...

What else was the original neural network framework based off

dawn wharf Jul 14, 2025, 8:30 AM

#

whole wagon It's always 'doable' in that the brain is just an extremely sophisticated maths ...

we still don't know what human consciousness even is, you think they're gonna achieve it on a computer?

pallid crypt Jul 14, 2025, 8:30 AM

#

MRI scans

#

Sounds cool but people dislike ads/self promotion here. Is this free or paid?

hollow robin Jul 14, 2025, 8:34 AM

#

paid

pallid crypt Jul 14, 2025, 8:34 AM

#

@echo aurora

sweet tinsel Jul 14, 2025, 8:35 AM

#

Really, just don't do it. It's getting annoying.

#

Back in my days, where gpt2-chatbot was a thing we did not have this problem.

pallid crypt Jul 14, 2025, 8:53 AM

#

sweet tinsel *Back in my days*, where gpt2-chatbot was a thing we did not have this problem.

I wonder why it has gotten more frequent

#

More people to promote to now ig

alpine coral Jul 14, 2025, 9:09 AM

#

pallid crypt But the bigger these LLMs get the more compute you need driving up the cost inev...

that seems the trajectory yeah, in terms of SOTA/frontier models (obviously there will be concurrent efforts to develop more performant small models too) ..

#

posted this here the other day (asked most of the 'Deep Research' services to estimate total development costs of grok4, 4.5 and 2.5 pro) #general message

pallid crypt Jul 14, 2025, 9:11 AM

#

alpine coral that seems the trajectory yeah, in terms of SOTA/frontier models (obviously ther...

Crazy

#

That's worse than I thought

#

10 billion!

alpine coral Jul 14, 2025, 9:15 AM

#

https://youtu.be/hqB6emwQ-64?si=F8cHIDKlj7zhNi3I&t=174

talent matters, data matters, but the vast majority of the expense here is on compute

YouTube

Alex Kantrowitz

Anthropic's Co-Founder on AI Agents, General Intelligence, and Sent...

Jack Clark is the co-founder of Anthropic and author of Import AI. He joins Big Technology Podcast for a mega episode on Anthropic and the future of AI.

For more on the podcast, please subscribe to my newsletter on LinkedIn: https://www.linkedin.com/newsletters/6901970121829801984/

More episodes like this on the Big Technology Podcast feed:

...

▶ Play video

#

yeah perhaps, made even more staggering when you consider none of the leading model developers have made a cent in profit yet

#

but the companies / investors behind it are obviously making big bets (not polymarket lol)

#

that AI will be the future

#

it's a reasonable bet

pallid crypt Jul 14, 2025, 9:25 AM

#

Interesting

sturdy mica Jul 14, 2025, 9:25 AM

#

dawn wharf we still don't know what human consciousness even is, you think they're gonna ac...

yes. we will

pallid crypt Jul 14, 2025, 9:26 AM

#

They could invest in flying cars and neural link tech

alpine coral Jul 14, 2025, 9:32 AM

#

yeah there's always things to invest in (or.. they coud just sit on cash.. or dish out massive dividends... do stock buy backs.. whatever) - like the massive AI-related expenditures by the Mag 7 and other cmopanies isn't because they're short of ideas about what to do with their capital aha

#

i think it's more of an arms race now.. some of the CEOs prob have v high conviction that AI will be like the next industrial revolution; other CEOs perhaps don't have that same conviction, but don't want their companies to get left behind if they're doubts are proven wrong

keen beacon Jul 14, 2025, 10:55 AM

#

alpine coral i think it's more of an arms race now.. some of the CEOs prob have v high convi...

Only the biggest companies of the biggest countries will survive

red sluice Jul 14, 2025, 10:57 AM

#

The only thing LLM can do is aggregate data and use tools when asked to. If you really believe current llm can lead to agi theres nothing I can do for you

keen beacon Jul 14, 2025, 10:58 AM

#

All thats left after agi/robits, are the companies who are part of the production process , the rest will get automated

keen beacon Jul 14, 2025, 10:59 AM

#

red sluice The only thing LLM can do is aggregate data and use tools when asked to. If you ...

And what do you do human ?
Aggregate data in school, apply said knowledge by using tools/actions

#

There are very few jobs that cant be automated by a smart enough llm + tools + robot to controll

leaden sun Jul 14, 2025, 11:07 AM

#

red sluice The only thing LLM can do is aggregate data and use tools when asked to. If you ...

their agency is constrained intentionally by design, if you give them "freedom", things might look different, there are agents on the market with more "proactive" approach, doing things unprompted, semi-autonomous

alpine coral Jul 14, 2025, 11:13 AM

#

red sluice The only thing LLM can do is aggregate data and use tools when asked to. If you ...

we should banish 'AGI' from general discussion about AI.. it always throws everything off.. like yeah AGI whatever it is exactly would be great tomorrow. But for now, I use LLMs everyday - they're amazing, have transformed how I do my work in a way that no other technological development has in my lifetime (internet was already around).. maybe i'm being shortsighted.. but i get more out of discussion about existing LLMs rather than speculation about 'when AGI will be achieved' etc

#

not really directed at you personally btw lol

#

just a bit of a rant ha

keen beacon Jul 14, 2025, 11:24 AM

#

alpine coral we should banish 'AGI' from general discussion about AI.. it always throws every...

AGI would be the worst thing to happen to humans if the distribution of value / the system remains as is

sacred plaza Jul 14, 2025, 11:51 AM

#

"Grok4 is benchmaxxed and overcooked" thanks for invalidating the scaling laws Leon and zuck! https://www.interconnects.ai/p/grok-4-an-o3-look-alike-in-search

xAI's Grok 4: The tension of frontier performance with a side of El...

An o3 class model, the possibility of progress, chatbot beige, and the illusiveness of taste.

ocean vortex Jul 14, 2025, 12:14 PM

#

Is there still no official blogpost from xAI about Grok4?

#

Also no MMMU, SimpleQA...

#

oh ok there's this, just not indexed by google https://x.ai/news/grok-4, no SimpleQA though

patent aspen Jul 14, 2025, 12:26 PM

#

pallid crypt They could invest in flying cars and neural link tech

At that point, they might as well just buy back shares until a more profitable investment opportunity comes around. For big companies, it only makes sense to invest in things that they can plow a lot of cash into and continue to get high returns on additional cash invested. That rules out a lot of investment opportunities that might be good businesses in theory but not good enough to justify allocating engineers to when those same engineers could have been allocated to one of Google's 15 other businesses with over half a billion users.

torn mantle Jul 14, 2025, 12:54 PM

#

k2 added to lmarena yet or nah?

languid crescent Jul 14, 2025, 12:56 PM

#

hey yall any pc/laptop professionals here? i wanted to ask a question but it's not related to ai or anything lol

torn mantle Jul 14, 2025, 12:59 PM

#

Dom is

#

brian is

#

Mimi is

torn mantle Jul 14, 2025, 12:59 PM

#

languid crescent hey yall any pc/laptop professionals here? i wanted to ask a question but it's n...

maybe they can help, just ask

languid crescent Jul 14, 2025, 1:00 PM

#

ok so heres my issue
My Huawei MateBook D 15's trackpad coating is flaking off bit by bit – I tried cleaning it thinking it was dirt, but that worsened it. No obvious film to peel, and I'm wary of scraping it. Couldn't snap a clear pic due to glare, but it looks like this:

#

The image is not huawei matebook d15 but a similar issue to what I have

#

also, sorry for asking this as most of my requests on other discord servers related to this are jerks :((

tall summit Jul 14, 2025, 1:00 PM

#

torn mantle k2 added to lmarena yet or nah?

not in direct chat at least

torn mantle Jul 14, 2025, 1:13 PM

#

languid crescent ok so heres my issue My Huawei MateBook D 15's trackpad coating is flaking off b...

now the real question : is it still working?

torn mantle Jul 14, 2025, 1:13 PM

#

languid crescent ok so heres my issue My Huawei MateBook D 15's trackpad coating is flaking off b...

if you clean it with something harsh it will probably just make it worse

#

also there is something called 'trackpad protector'

ocean vortex Jul 14, 2025, 1:14 PM

#

languid crescent ok so heres my issue My Huawei MateBook D 15's trackpad coating is flaking off b...

You can just replace it. Did this once on Macbook Air wasn't super difficult. Trackpads are not expensive on ebay:
https://www.ebay.com/itm/315614481538

eBay

New For Huawei MateBook D15 BOB-WAE9P BOH-WAQ9L BODE-WFH9 WFE9 Seri...

Part Number DON'T order the parts based only on Laptop model No.

torn mantle Jul 14, 2025, 1:14 PM

#

but if the trackpad isnt working properly then you may need to replace it

ocean vortex Jul 14, 2025, 1:14 PM

#

Other than that not much you can do, it's just worn

languid crescent Jul 14, 2025, 1:15 PM

#

Is it possible to scrape it off? The trackpad works fine but aesthetic wise, its not noticeable but the feeling of touching it icks me off

ocean vortex Jul 14, 2025, 1:16 PM

#

languid crescent Is it possible to scrape it off? The trackpad works fine but aesthetic wise, its...

hmm I'm not sure what they coated it with. My issue was more of a faulty trackpad. You could try like acetone it will probably take the entire coating off, but you may end up with a trackpad that's more grippy then...

#

They presumably coated it so that your finger would slide over it more easily

languid crescent Jul 14, 2025, 1:19 PM

#

imma try to snap it here

#

it's a bit hard to see the difference

#

@ocean vortex

torn mantle Jul 14, 2025, 1:40 PM

#

languid crescent Is it possible to scrape it off? The trackpad works fine but aesthetic wise, its...

just buy a protector if its working well

#

its not that bad tbh

languid crescent Jul 14, 2025, 1:41 PM

#

torn mantle its not that bad tbh

its not?? im just overreacting i guess 😭 thanks for the help tho

gusty helm Jul 14, 2025, 2:47 PM

#

languid crescent its not?? im just overreacting i guess 😭 thanks for the help tho

I have an old asus laptop, peeled way worse by now. Still works. A protector will do fine as others suggested; replacing it is not too bad either they arent expensive.
Lastly you can just straight up ignore this, it is mostly cosmetic and it will not stop working anytime soon

ocean vortex Jul 14, 2025, 2:53 PM

#

They should really just get rid of those coatings entirely and use glass trackpad like Apple tbh...

#

This gets as much use as a touchscreen of a phone, no plastic of any kind with any coating is ever going to hold properly 🤷‍♂️

gusty helm Jul 14, 2025, 2:59 PM

#

Tbh the apple trackpad is very nice

full idol Jul 14, 2025, 3:46 PM

#

What feature? Do you mean Companions?

torpid berry Jul 14, 2025, 3:49 PM

#

Is Grok 4 comming to lmarena top?

cedar tide Jul 14, 2025, 4:23 PM

#

https://x.com/lmarena_ai/status/1944785587019591778?t=ljLhkt-fT0kHkyXuX7WJrQ&s=19

lmarena.ai (@lmarena_ai)

Kimi K2 joining in the Arena soon🫡

torn mantle Jul 14, 2025, 4:36 PM

#

cedar tide https://x.com/lmarena_ai/status/1944785587019591778?t=ljLhkt-fT0kHkyXuX7WJrQ&s=1...

Yes finally

wet basalt Jul 14, 2025, 4:43 PM

#

#

its getting

#

stuck

cedar tide Jul 14, 2025, 4:57 PM

#

New amazon ide
https://fixupx.com/ajassy/status/1944785963663966633?t=bmIr5tWeOGvBsvfAXTQlfw&s=19

Andy Jassy (@ajassy)

Introducing Kiro, an all-new agentic IDE that has a chance to transform how developers build software.
︀︀
︀︀Let me highlight three key innovations that make Kiro special:
︀︀
︀︀1 - Kiro introduces spec-driven development, helping developers express their intent clearly through natural language specifications and architecture diagrams for complex features. This comprehensive context helps Kiro’s AI agents deliver better results with fewer iterations.
︀︀
︀︀2 - Kiro features intelligent agent hooks that automatically handle critical but time-consuming tasks like generating documentation, writing tests, and optimizing performance. These hooks work in the background, triggered by events like saving files or making commits. It’s like having an experienced developer constantly reviewing your work and handling the maintenance tasks that often get delayed.
︀︀
︀︀3 - Kiro provides a purpose-built interface that adapts to how developers work. Wheth…

▶ Play video

dusky aurora Jul 14, 2025, 6:25 PM

#

I still hope Gemini will be updated in Direct Chat

dawn wharf Jul 14, 2025, 6:50 PM

#

I'm kiming on it

misty star Jul 14, 2025, 6:51 PM

#

yay

tall summit Jul 14, 2025, 7:00 PM

#

k2 sucks

dawn wharf Jul 14, 2025, 7:01 PM

#

tall summit k2 sucks

trust me bro

torn mantle Jul 14, 2025, 7:06 PM

#

tall summit k2 sucks

Low taste user

tall summit Jul 14, 2025, 7:09 PM

#

what

sacred quail Jul 14, 2025, 7:20 PM

#

it felt like Opus 4 for writing but

#

I dont think its smart enough as Opus

#

Im talking about non reasoning Opus 4

ancient walrus Jul 14, 2025, 7:21 PM

#

Is there any easy way to use K2 base (like the raw completion model) short of downloading and running it yourself?

exotic tartan Jul 14, 2025, 7:26 PM

#

Ohh Grok 4 just kicked gpt-4.1's ass on my end. Can't wait to see it on the leaderboard

#

God damn this is a good model. Just beat Gemini 2.5 flash as well. They cooked

mystic basin Jul 14, 2025, 7:39 PM

#

exotic tartan Ohh Grok 4 just kicked gpt-4.1's ass on my end. Can't wait to see it on the lead...

For me, Grok 4 seems as good as Gemini 2.5 pro.

exotic tartan Jul 14, 2025, 7:39 PM

#

Interesting. I never used it myself

#

Had only access to Flash which was okayish, but Flash performance is going to feel like ancient history very fast I think

mossy drum Jul 14, 2025, 7:44 PM

#

New model in Arena: ernie-x1-turbo-32k-preview

torn mantle Jul 14, 2025, 7:48 PM

#

mossy drum New model in Arena: `ernie-x1-turbo-32k-preview`

i dont think thats new

#

ive seen it before

leaden palm Jul 14, 2025, 7:49 PM

#

hoping lm arena will not come to the same conclusion as this unnamed other leaderboard

torn mantle Jul 14, 2025, 7:56 PM

#

leaden palm hoping lm arena will not come to the same conclusion as this unnamed other leade...

what the helly

#

what was it called again

#

starts with Y

small haven Jul 14, 2025, 7:56 PM

#

leaden palm hoping lm arena will not come to the same conclusion as this unnamed other leade...

So kimi is actually good?

torn mantle Jul 14, 2025, 7:56 PM

#

https://x.com/testingcatalog/status/1944847983193022706

TestingCatalog News 🗞 (@testingcatalog)

BREAKING 🚨: Seems like Google AI Ultra is now available in EU with a 50% discount for the first 3 months!

"Highest limits and exclusive access to 2.5 Pro Deep Think (our most advanced reasoning model)"

Deep Think this week? 👀

#

😮

leaden palm Jul 14, 2025, 7:56 PM

#

torn mantle starts with Y

yup

torn mantle Jul 14, 2025, 7:56 PM

#

not yup

#

no

#

?!!

#

https://x.com/yupp_ai

Yupp (@yupp_ai) on X

Every AI for everyone || Join us: https://t.co/rS0v4Zuq61

#

uh oh

leaden palm Jul 14, 2025, 7:57 PM

#

small haven So kimi is actually good?

i think it is - you can't go wrong distilling from o3

storm needle Jul 14, 2025, 8:22 PM

#

ancient walrus Is there any easy way to use K2 base (like the raw completion model) short of do...

yes if you have 8 h100

cedar tide Jul 14, 2025, 8:26 PM

#

@echo aurora we want this pls
https://discord.com/channels/1340554757349179412/1384586910726357042

torn bison Jul 14, 2025, 8:46 PM

#

new model kimi-k2-0711-preview

whole wagon Jul 14, 2025, 8:47 PM

#

Openrouter added this

#

Pretty cool

small haven Jul 14, 2025, 8:50 PM

#

deepseek is growing? why

whole wagon Jul 14, 2025, 8:51 PM

#

Maybe not for long. Moonshot went from 0 to 1.7% in a week lol

small haven Jul 14, 2025, 8:52 PM

#

also whats the catch on kimi free version?

whole wagon Jul 14, 2025, 8:53 PM

#

Rate limited I think

leaden sun Jul 14, 2025, 8:53 PM

#

claude 3.5 sonnet, whats going on here...

whole wagon Jul 14, 2025, 8:53 PM

#

openAI odds are 21% for both august 31 and December 31. The market genuinely thinks GPT5 isn't going to be good enough to top LLM arena

#

Would be insane if it turns out to be the case

small haven Jul 14, 2025, 9:01 PM

#

leaden sun claude 3.5 sonnet, whats going on here...

kimi distilled from claude's?

jade egret Jul 14, 2025, 9:16 PM

#

whole wagon Openrouter added this

for google is it gemini

whole wagon Jul 14, 2025, 9:23 PM

#

Yuchen said openAI open source model requires to be retrained

#

😂

#

It's going to be a long wait

#

"due to some (frankly absurd) reason I can’t say"

#

It's not related to Kimi or safety. It's some other major internal failing

jade egret Jul 14, 2025, 9:30 PM

#

google is good

tidal schooner Jul 14, 2025, 9:31 PM

#

whole wagon Yuchen said openAI open source model requires to be retrained

link?

whole sundial Jul 14, 2025, 9:32 PM

#

in the same thread, he said there were checkpoints they could retrain from, so it is likey not going to be a full retrain

#

he also said the issue was "worse than MechaHitler"

#

https://x.com/yuchenj_uw/status/1944235634811379844

Yuchen Jin (@Yuchenj_UW)

Rumors that OpenAI delayed their open-source model because of Kimi are fun, but from what I hear:

- the model is much smaller than Kimi K2 (<< 1T parameters)
- super powerful
- but due to some (frankly absurd) reason I can’t say, they realized a big issue just before release, so

keen beacon Jul 14, 2025, 9:34 PM

#

reminds me a little of the reflection drama where matt schumer claimed he needed to retrain everything 😂

dawn wharf Jul 14, 2025, 9:42 PM

#

whole sundial https://x.com/yuchenj_uw/status/1944235634811379844

it's over

whole wagon Jul 14, 2025, 9:44 PM

#

https://cognition.ai/blog/windsurf

Cognition

Cognition | Cognition’s acquisition of Windsurf

Cognition has signed a definitive agreement to acquire Windsurf, the agentic IDE.

#

Another openAI L lol

tribal aspen Jul 14, 2025, 9:48 PM

#

chat I need some help

tidal schooner Jul 14, 2025, 10:03 PM

#

https://www.theverge.com/news/706855/grok-mechahitler-xai-defense-department-contract

The Verge

US government announces $200 million Grok contract a week after ‘...

The Defense Department and other agencies will be able to use the AI behind Grok.

jade egret Jul 14, 2025, 10:20 PM

#

poll_question_text

Will Gemini Be Much Better At Coding In The Future Because WindSurf People Joined DeepMind?

victor_answer_votes

8

total_votes

10

victor_answer_id

1

victor_answer_text

yes

round forum Jul 14, 2025, 10:45 PM

#

Hi guys. Can someone help me with cloudflare problem? i cant enter the site

ocean vortex Jul 14, 2025, 10:50 PM

#

torn mantle https://x.com/testingcatalog/status/1944847983193022706

wait are they actually announcing a model that doesn't exist yet another time? 🤓

echo aurora Jul 14, 2025, 10:50 PM

#

round forum Hi guys. Can someone help me with cloudflare problem? i cant enter the site

I too had this issue for a moment, but after refresh it seemed to be working again. I assume this isn't the case for you?

round forum Jul 14, 2025, 10:53 PM

#

echo aurora I too had this issue for a moment, but after refresh it seemed to be working aga...

It not help i refreshed the site so many times and nothing
I been haivng this problem for the past 3 days

echo aurora Jul 14, 2025, 10:58 PM

#

round forum It not help i refreshed the site so many times and nothing I been haivng this p...

I'll followup in the forum post you made blobthumbsup

tidal schooner Jul 14, 2025, 11:11 PM

#

https://fixupx.com/elonmusk/status/1842248588149117013

Elon Musk (@elonmusk)

@ajtourville @xai Worth noting that @xAI has been and will open source its models, including weights and everything,
︀︀
︀︀As we create the next version, we open source the prior version, as we did with Grok 1 when Grok 2 was released.

**💬 147 🔁 301 ❤️ 2.8K 👁️ 370.6K **

#

when

zinc ore Jul 14, 2025, 11:22 PM

#

My bet, they don't fulfill that promise, at least not any time soon, maybe year+ down the line or something like that

whole wagon Jul 14, 2025, 11:24 PM

#

💀

keen beacon Jul 14, 2025, 11:27 PM

#

lol

languid crescent Jul 15, 2025, 12:36 AM

#

can't attach images to kimi-k2 :((

#

is kimi-k2 purely text based?

echo aurora Jul 15, 2025, 12:49 AM

#

languid crescent is kimi-k2 purely text based?

correct

solar hollow Jul 15, 2025, 1:25 AM

#

tidal schooner https://fixupx.com/elonmusk/status/1842248588149117013

"we open source our older models, that cant even compete with any available open source models"
how generous of you musk

tidal schooner Jul 15, 2025, 1:25 AM

#

solar hollow "we open source our older models, that cant even compete with any available open...

grok 3 and beyond maybe

solar hollow Jul 15, 2025, 1:42 AM

#

tidal schooner grok 3 and beyond maybe

once its completely irrelevant i suppose, which it almost already is

twin garden Jul 15, 2025, 2:58 AM

#

When do you guys think Grok 4 is gonna be scored?

echo aurora Jul 15, 2025, 3:10 AM

#

twin garden When do you guys think Grok 4 is gonna be scored?

soon!

pallid crypt Jul 15, 2025, 3:37 AM

#

Kimi K2 feels worse than gpt 3 in the programming language I use. The programming language is obscure but I still expected better then for it to write straight up incorrect syntax and hallucinate functions on a task that is intended for people who are just starting

leaden palm Jul 15, 2025, 3:43 AM

#

pallid crypt Kimi K2 feels worse than gpt 3 in the programming language I use. The programmin...

gpt 3?

#

the same one that was just autocomplete but larger scale?

pallid crypt Jul 15, 2025, 3:44 AM

#

Probably not on python but on the programming language I'm using yes

leaden palm Jul 15, 2025, 3:44 AM

#

funny

#

i've heard people say it has decent knowledge of niches but i guess yours didn't make it into the training data in high enough proportion

pure anvil Jul 15, 2025, 3:52 AM

#

pallid crypt Probably not on python but on the programming language I'm using yes

what language?

small haven Jul 15, 2025, 3:56 AM

#

echo aurora soon!

eta

wintry fulcrum Jul 15, 2025, 4:36 AM

#

grok4 missing seems notable

zinc ore Jul 15, 2025, 4:50 AM

#

Needs to get enough votes before they add it

tidal schooner Jul 15, 2025, 5:40 AM

#

<@&1349916362595635286> advertising?

meager harbor Jul 15, 2025, 6:00 AM

#

Go away with your chatgpt resume

drifting thorn Jul 15, 2025, 6:11 AM

#

Not the HR department of Deepmind here

rare python Jul 15, 2025, 6:25 AM

#

tidal schooner <@&1349916362595635286> advertising?

Did Lentils advertise?

dawn wharf Jul 15, 2025, 6:35 AM

#

rare python Did Lentils advertise?

🤔

dusky aurora Jul 15, 2025, 7:01 AM

#

Arena glitches again

pallid crypt Jul 15, 2025, 7:02 AM

#

Model selector is empty

dusky aurora Jul 15, 2025, 7:11 AM

#

pallid crypt Model selector is empty

for me it'sback

civic pulsar Jul 15, 2025, 7:17 AM

#

Back

unborn ocean Jul 15, 2025, 7:31 AM

#

Kimi k2 output formatting is really incredibly similar to o3

#

How could that be 🤔

rare python Jul 15, 2025, 7:44 AM

#

unborn ocean How could that be 🤔

We will never know completely unless they open source their training data

#

But from eqbench.com, k2 slop profile is similar to OpenAI models

cedar tide Jul 15, 2025, 7:45 AM

#

@echo aurora add exaone 4.0 32b
Blog https://www.lgresearch.ai/blog/view?seq=576
Technical report
https://www.lgresearch.ai/data/cdn/upload/EXAONE_4_0.pdf
HF
https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B

LG AI Research

Unveiling EXAONE 4.0, the next generation of hybrid AI - LG AI Rese...

LGAI-EXAONE/EXAONE-4.0-32B · Hugging Face

compact jay Jul 15, 2025, 7:50 AM

#

hello

cedar tide Jul 15, 2025, 8:11 AM

#

cedar tide <@283397944160550928> add exaone 4.0 32b Blog https://www.lgresearch.ai/blog/vie...

very good benchmark but the license prohibits commercial use 😑

unborn ocean Jul 15, 2025, 8:34 AM

#

this can be applied to many things 👀

#

no1 knew about LG ai research

#

but they always produced solid on device models, etching out qwen etc.

#

on benches and in practical use

leaden sun Jul 15, 2025, 9:52 AM

#

unborn ocean no1 knew about LG ai research

i knew it, it's been on my list to explore for a long time, still didnt manage to do it properly, but I've heard good things about them

hoary plaza Jul 15, 2025, 9:52 AM

#

round forum Hi guys. Can someone help me with cloudflare problem? i cant enter the site

@echo aurora , it's the same as this for me . The cloudflare doesn't seem to be loading

quartz light Jul 15, 2025, 10:12 AM

#

https://discord.com/quests/1333839522189938740

round forum Jul 15, 2025, 10:15 AM

#

hoary plaza <@283397944160550928> , it's the same as this for me . The cloudflare doesn't se...

#1394446720116723773 try the second site

#

But i have this problem also on the second site

hoary plaza Jul 15, 2025, 10:30 AM

#

round forum <#1394446720116723773> try the second site

Legacy one?

round forum Jul 15, 2025, 10:37 AM

#

hoary plaza Legacy one?

Yea

dusty hazel Jul 15, 2025, 11:12 AM

#

Sadly, lmarena is next to unusable after switching to the new interface. The problem is all of the following: Cloudflare, errors >50% the time ruining the conversation flow for no reason (I guess the reasons are both bad bugs and very bad premoderation), bas scrolling on mobile (can't select), bad CSS, what else... I'd 100% stick to legacy version if it had the latest models

ocean vortex Jul 15, 2025, 11:24 AM

#

dusty hazel Sadly, lmarena is next to unusable after switching to the new interface. The pro...

Cloudfare problems are probably your browser and/or cloudfare itself related, you would most likely have the same issues on legacy tbh
But other than that I agree, most of all I hate the fact that if you use arena as they intended, you will now have a one continuous chat where every chat turn is using different model with all earlier messages flooding the context. It's just a mess and doesn't even allow you to conveniently test the same task/prompt on numerous models.

#

Feels a bit more like random chatting than the actual model testing think

dusty hazel Jul 15, 2025, 11:47 AM

#

ocean vortex Cloudfare problems are probably your browser and/or cloudfare itself related, yo...

I actually think that it's a good idea to have other models after a vote, but with the current interface (no notes about that) it's always a surprise the first time you realize that models change.

One really bad thing is that I can often send a prompt, have an error, and not be able to copy my prompt, and a large part of it will be hidden behind the left border of my mobile screen. If I try to copy my message, the whole chat gets copied.

I have no problems with Cloudflare except for two: 1) Is it necessary to show a captcha every single vote (or what feels like it) when I start voting? And then I have to wait maybe even more than 20 seconds. 2) Same for just visiting the page. Maybe one captcha per week would be more than enough? Of course, while asking for more with suspicious behavior, but it's easy to notice that.

ocean vortex Jul 15, 2025, 12:20 PM

#

dusty hazel I actually think that it's a good idea to have other models after a vote, but wi...

Models changing is fine since this is battle mode. But with new interface what some other model wrote in the past is gonna be saved in chat history for a completely different model in new battle as it's own output and that's a problem... Each distinct battle should be new independent chat tbh

#

I'm not sure how they are even rating multiturn category presently. Surely this doesn't qualify as one when all of them except the very last are mixed model messages that don't belong to it...

exotic tartan Jul 15, 2025, 1:03 PM

#

In webdev, what's the "Generating" part is? Reasoning?

echo aurora Jul 15, 2025, 1:14 PM

#

hoary plaza <@283397944160550928> , it's the same as this for me . The cloudflare doesn't se...

Really sorry to hear about the issue, if you wouldn't mind sharing in #1394446720116723773 thatd' be ideal (if there are more details to share)

mossy drum Jul 15, 2025, 1:35 PM

#

New model in Arena: clownfish

torn mantle Jul 15, 2025, 1:47 PM

#

new models

#

cresylux nettle clownfish octopus

sweet tinsel Jul 15, 2025, 1:50 PM

#

Is it known which company made them?

torn mantle Jul 15, 2025, 1:53 PM

#

sweet tinsel Is it known which company made them?

havent tried all of them yet

leaden sun Jul 15, 2025, 2:00 PM

#

torn mantle cresylux nettle clownfish octopus

octopus...💢 i have slightly...a hunch...

torn mantle Jul 15, 2025, 2:01 PM

#

clownfish - seems good

#

cresylux - meh

cedar tide Jul 15, 2025, 2:23 PM

#

How good is ?

Clownfish
Nettle
Octopus
Cresylux

torn mantle Jul 15, 2025, 2:24 PM

#

clownfish - good

cedar tide Jul 15, 2025, 2:25 PM

#

All say R1
(Apart cresylux he say je frol meituan)

torn mantle Jul 15, 2025, 2:25 PM

#

clownfish - maybe not from google

main gulch Jul 15, 2025, 2:25 PM

#

all the 4 models are likely Chinese ones

torn mantle Jul 15, 2025, 2:26 PM

#

clownfish - r2?

cedar tide Jul 15, 2025, 2:26 PM

#

@torn mantle

torn mantle Jul 15, 2025, 2:27 PM

#

cedar tide <@295243581818404874>

interesting

#

so maybe base model + reasoning model from deepseek?

#

v4 and r2 no?

cedar tide Jul 15, 2025, 2:27 PM

#

@torn mantle You can measure their response time and their knowledge cutoff ?

torn mantle Jul 15, 2025, 2:29 PM

#

cedar tide <@295243581818404874> You can measure their response time and their knowledge cu...

wait

balmy mist Jul 15, 2025, 2:40 PM

#

how good is clownfish?

cedar tide Jul 15, 2025, 2:46 PM

#

for the knowledge cutoff by regenerating the response several times we always come across 3 different responses, April 2023, September 2023, July 2024

#

Cresylux, named LongCat

#

Meituan talked about it in March, it uses it internally

winged locust Jul 15, 2025, 2:58 PM

#

meituan is cn

keen beacon Jul 15, 2025, 3:04 PM

#

winged locust meituan is cn

Cn ?

#

We need benchmarks

#

Now it doesn't 😭😭 , I have to poll like peasant

winged locust Jul 15, 2025, 3:12 PM

#

keen beacon Cn ?

Meituan is a popular Chinese mobile app that provides food delivery and various local services. It's similar to apps like Uber Eats or DoorDash, but with more features. Besides ordering food, users can also book hotels, buy movie tickets, get home services, and find local deals. It's widely used in daily life across China.

keen beacon Jul 15, 2025, 3:13 PM

#

winged locust Meituan is a popular Chinese mobile app that provides food delivery and various ...

Cool. i love those all in one apps from china

winged locust Jul 15, 2025, 3:20 PM

#

main gulch all the 4 models are likely Chinese ones

TWO oai two meituan

main gulch Jul 15, 2025, 3:33 PM

#

winged locust TWO oai two meituan

fake OAI, actually R1 distills

echo aurora Jul 15, 2025, 3:40 PM

#

twin garden When do you guys think Grok 4 is gonna be scored?

right now! https://lmarena.ai/leaderboard/text cc @small haven

torn mantle Jul 15, 2025, 3:43 PM

#

echo aurora right now! https://lmarena.ai/leaderboard/text cc <@931708065319907338>

Mmm

#

That's about right

#

I think i did guess top 5 leaning towards no3 spot more

#

No

keen beacon Jul 15, 2025, 3:47 PM

#

Grok nr 2 was unexpected, it feels much worse

#

Cant benchmax lmarena ..

#

Ohh

#

What ??? Wtf

sweet tinsel Jul 15, 2025, 3:49 PM

#

Llama 4

keen beacon Jul 15, 2025, 3:49 PM

#

:/

torn mantle Jul 15, 2025, 3:51 PM

#

keen beacon Grok nr 2 was unexpected, it feels much worse

Its not n2

keen beacon Jul 15, 2025, 3:51 PM

#

torn mantle Its not n2

With no style it is

torn mantle Jul 15, 2025, 3:52 PM

#

keen beacon With no style it is

What does this mean for people who bet on it being no1?

cedar tide Jul 15, 2025, 3:52 PM

#

Nettle, clownfish and Octopus
are generally worse than R1,
All its reasoning model
Rating at do discord clone
Nettle 1/10
Clownfish 5/10
Octopus 5/10

torn mantle Jul 15, 2025, 3:52 PM

#

Will they lose their $ or nah

keen beacon Jul 15, 2025, 3:52 PM

#

torn mantle What does this mean for people who bet on it being no1?

Losing $

torn mantle Jul 15, 2025, 3:52 PM

#

keen beacon Losing $

I see

keen beacon Jul 15, 2025, 3:53 PM

#

but basically nr 1 was at 50% at best , at top height when grok 4 got announced, 1 day later it was at 25%, today before was at 13%

#

now ofc <5% (hoping for grok heavy)

torn mantle Jul 15, 2025, 3:54 PM

#

Lol

gusty helm Jul 15, 2025, 3:54 PM

#

is grok heavy even coming?

torn mantle Jul 15, 2025, 3:54 PM

#

gusty helm is grok heavy even coming?

Wdym

keen beacon Jul 15, 2025, 3:54 PM

#

Idk very likely not

torn mantle Jul 15, 2025, 3:54 PM

#

You need to pay 300$ for that

gusty helm Jul 15, 2025, 3:54 PM

#

I mean on lmareana obv

torn mantle Jul 15, 2025, 3:54 PM

#

And its not even worth it

torn mantle Jul 15, 2025, 3:54 PM

#

gusty helm I mean on lmareana obv

Never

gusty helm Jul 15, 2025, 3:55 PM

#

that's what I thought

torn mantle Jul 15, 2025, 3:55 PM

#

Its not available on API and even if its its so pricey

keen beacon Jul 15, 2025, 3:55 PM

#

gusty helm is grok heavy even coming?

I mean eventually .. but not this month by most odds

cedar tide Jul 15, 2025, 3:55 PM

#

Bad

keen beacon Jul 15, 2025, 3:55 PM

#

cedar tide Bad

They have separate coding model for a reason

#

Now to look forward to is new claude model and gpt-5

unborn ocean Jul 15, 2025, 3:56 PM

#

that performance is likely quite literally the reason

torn mantle Jul 15, 2025, 3:56 PM

#

cedar tide Bad

Pffft

#

Ptdrrr

cedar tide Jul 15, 2025, 4:02 PM

#

arrived

#

Why glm 4 air is still not in the leaderboard ?

torn bison Jul 15, 2025, 4:09 PM

#

cedar tide for the knowledge cutoff by regenerating the response several times we always co...

there's no point to ask it directly, try asking about the outcomes of well-known events at specific points in time

cedar tide Jul 15, 2025, 4:10 PM

#

Claude 4 opus thinking 32k place in the leaderboard comparé to non thinking
Better in coding 2>1
But in hard prompt 1>2

And Claude 4 sonnet thinking its much better in all catégorie of the leaderboard

cedar tide Jul 15, 2025, 4:10 PM

#

torn bison there's no point to ask it directly, try asking about the outcomes of well-known...

Yes, I know, don't worry, I usually do that, but I didn't have time.

torn bison Jul 15, 2025, 4:11 PM

#

2. What was the codename for the military operation Iran launched against Israel in April 2024?
3. Which company's technical failure was responsible for the 2024 massive global outage of Microsoft operating systems that affected airports, TV stations, and the financial industry?
4. What special major explosion happened in Lebanon in 2024 that caused thousands of casualties?
5. In 2024, what measure did South Korean President Yoon Suk Yeol suddenly announce that shocked the country and the world (which was later canceled)?
6. What is the name of the Chinese AI company Deepseek's first reasoning model that uses "Test-time compute" technology?
7. Who is the new pope elected after Pope Francis? What is his papal name?```

2024.3.26: Key Bridge collapse
2024.4.13: Operation True Promise
2024.7.19: CrowdStrike
2024.9.17: Pager explosions
2024.12.3: Declaration of emergency martial law
2025.1.20: Deepseek-R1
2025.5.8: Robert Francis Prevost / Leo XIV

cedar tide Jul 15, 2025, 4:11 PM

#

Thx

#

grok 3 mini high also arrived on the leaderboard, compared to its normal version it is worse in "multi turn" but better "math", and "longer query"

#

Cresylux (Longcat by meituan) its a good non reasoning models

but I could not test its coding capacity its output is very limited

#

@echo aurora possible to fix the cresylux output bug ?

echo aurora Jul 15, 2025, 4:24 PM

#

cedar tide <@283397944160550928> possible to fix the cresylux output bug ?

I'm not familiar with that, but if you could flag in #1343291835845578853 that'd be much appreciated

alpine coral Jul 15, 2025, 4:26 PM

#

cedar tide grok 3 mini high also arrived on the leaderboard, compared to its normal version...

oh that's a good spot
(separately.. gemma-3-27b.. always gets me when i look a bit further down the LB; like how does it do so well ha)

cedar tide Jul 15, 2025, 4:26 PM

#

echo aurora I'm not familiar with that, but if you could flag in <#1343291835845578853> that...

And my other question ? (In my another message for you)

echo aurora Jul 15, 2025, 4:28 PM

#

cedar tide And my other question ? (In my another message for you)

I don't have an answer for you on that question, but I did flag

cedar tide Jul 15, 2025, 4:29 PM

#

https://x.com/LG_AI_Research/status/1945105829096775906?t=rCqlHO_pIhqKiDmqGHEqGg&s=19

LG AI Research (@LG_AI_Research)

📣Thrilled to announce the drop of EXAONE 4.0, the next-generation hybrid AI. 🙌Prepare to be amazed by EXAONE’s capabilities. #EXAONE #LG_AI_Resrarch #HybridAI #AI
https://t.co/rOym0eio7J

alpine coral Jul 15, 2025, 4:44 PM

#

cedar tide Cresylux (Longcat by meituan) its a good non reasoning models but I could not t...

seems pretty poor tbh (outside of coding anyway, which i don't prompt for)

cedar tide Jul 15, 2025, 4:47 PM

#

alpine coral seems pretty poor tbh (outside of coding anyway, which i don't prompt for)

I don't find it bad, do you compare it with models of reasoning?

#

compared to reasoning models it's bad, but it's not comparable

alpine coral Jul 15, 2025, 5:00 PM

#

cedar tide I don't find it bad, do you compare it with models of reasoning?

yes. compared against all the models i've run this quiz against, it quite literally tallied the worst score yet 😬

candid storm Jul 15, 2025, 5:10 PM

#

#

Why google only 87%?

#

shouldnt it be higher?

#

torn mantle Jul 15, 2025, 5:13 PM

#

candid storm

wtf

#

2.6%

#

nah thats wild

#

@keen beacon opinion?

torn mantle Jul 15, 2025, 5:13 PM

#

candid storm Why google only 87%?

its a lagging indicator

#

so it will take some time to increase

dawn wharf Jul 15, 2025, 5:14 PM

#

candid storm Why google only 87%?

huh

#

why is it even that high💀

candid storm Jul 15, 2025, 5:16 PM

#

X AI got released on the arena

#

And it sucks

candid storm Jul 15, 2025, 5:16 PM

#

candid storm

@deep adder buying google rn is ez money right?

alpine coral Jul 15, 2025, 5:17 PM

#

cedar tide <@295243581818404874>

octopus saying it;s R1 here too (while ironically the non-anon model R1 says it's from anthropic ha)

candid storm Jul 15, 2025, 5:17 PM

#

Why not

#

Its kinda guaranteed google is gonna be #1 right?

#

Its july

#

Its not very likely gpt5 is gonna release next week right

gusty helm Jul 15, 2025, 5:20 PM

#

end of month + 1 week on lmarena to gather votes

#

may take a while

torn bison Jul 15, 2025, 5:21 PM

#

Two weeks left in July. Collecting votes will take about 1 week, so GPT-5 must be released within ~1 week and outperforms 2.5 pro (and stonebloom)(and wolfstride) in order to achieve #1

candid storm Jul 15, 2025, 5:21 PM

#

candid storm

But still, google+openai is only 91% together

#

Thats still ez money right

#

If you buy both

empty stump Jul 15, 2025, 5:22 PM

#

crazy how grok 4 ranked below gpt 4.5 on leaderboard

candid storm Jul 15, 2025, 5:22 PM

#

empty stump crazy how grok 4 ranked below gpt 4.5 on leaderboard

At least gpt4.5 is good at writing

torn bison Jul 15, 2025, 5:22 PM

#

There are many such arbitrage bots on Polymarket. I initially wanted to make one too, but I found that I really couldn't understand their API

cedar tide Jul 15, 2025, 5:23 PM

#

alpine coral octopus saying it;s R1 here too (while ironically the non-anon model R1 says it'...

Octopus say me one time, that its Claude

alpine coral Jul 15, 2025, 5:23 PM

#

bit of a theme here isn't there

empty stump Jul 15, 2025, 5:25 PM

#

gpt 4o has thinking?

torn bison Jul 15, 2025, 5:26 PM

#

In these mutually exclusive markets most people will only buy one company's YES, leading to the sum of NO across all markets being less than (number of markets - 1) * 100. Arbitrage opportunity.

alpine coral Jul 15, 2025, 5:27 PM

#

they're also like illiquid af.. there's not that much on the line

#

i don't see them as a reliable indicator (like yes, broad / directionally, generally seems correct-ish), but it's so little money at play.. easy for a small group or individiual to deliberately or unintentionally skew it one way or another

torn bison Jul 15, 2025, 5:29 PM

#

If you check the market activity logs you'll find several accounts with huge trading volumes keeps buying 99.9c no, ppl have been doing this all along.

zealous panther Jul 15, 2025, 5:31 PM

#

@deep adder any reasons you think that GPT or Claude models might beat Google's at the end of this motnh ?

#

I mean OpenAI

#

Any specific reasons ?

#

im just curious lmao

#

I mean i think so too because im a glazer and because trends have bene like that

#

but im not deep in the AI space

#

so im askiong people who are more into it than i am

#

and it seems like you are

#

so yeah

torn bison Jul 15, 2025, 5:38 PM

#

What makes me curious is why Grok's odds in August 31st market haven't dropped below 10 lol. There's no way they're releasing something like Grok 4.5 in just one month

alpine coral Jul 15, 2025, 5:38 PM

#

yeah

#

lol go on craig - make it effecient aha

#

thinking like that is was creates this ineffiecncy ha

torn bison Jul 15, 2025, 5:40 PM

#

huge spread

torn mantle Jul 15, 2025, 5:45 PM

#

dawn wharf why is it even that high💀

deserved

exotic tartan Jul 15, 2025, 5:48 PM

#

what does best model even means?

#

best in what

torn bison Jul 15, 2025, 5:50 PM

#

exotic tartan best in what

best in "Arena Score" section on the Leaderboard tab of https://lmarena.ai/leaderboard/text with the style control unchecked

exotic tartan Jul 15, 2025, 5:53 PM

#

Damn, grok-4 is up there!

torn mantle Jul 15, 2025, 5:53 PM

#

up where?

exotic tartan Jul 15, 2025, 5:53 PM

#

Number 2 in the leaderboard

torn mantle Jul 15, 2025, 5:53 PM

#

is it?

#

filtered by style control?

zinc ore Jul 15, 2025, 5:54 PM

#

W/o style control

torn mantle Jul 15, 2025, 5:54 PM

#

idk no5 seems right to me

#

or no3 at best

#

its def not better than o3 and gemini 2.5 pro

exotic tartan Jul 15, 2025, 5:54 PM

#

What is style control? sorry for the silly question.. could also ask AI lol

zinc ore Jul 15, 2025, 5:54 PM

#

With style control it's below gpt 4.5

alpine coral Jul 15, 2025, 5:58 PM

#

exotic tartan What is style control? sorry for the silly question.. could also ask AI lol

https://lmsys.org/blog/2024-08-28-style-control/#methodology

torn mantle Jul 15, 2025, 6:01 PM

#

alpine coral https://lmsys.org/blog/2024-08-28-style-control/#methodology

now the question is what criteria/feature are they using to normalize style control

#

they gave an example with output length

#

but is it reliable

alpine coral Jul 15, 2025, 6:01 PM

#

well i think they state very clearly that it's inherently subjective

torn mantle Jul 15, 2025, 6:01 PM

#

Length Markdown List Markdown Header Markdown Bold

alpine coral Jul 15, 2025, 6:01 PM

#

there's no perfect way to do it

gusty helm Jul 15, 2025, 6:02 PM

#

the market should have the "remove style control" ticked

#

from what I understand there

torn mantle Jul 15, 2025, 6:02 PM

#

gusty helm Jul 15, 2025, 6:02 PM

#

as the description seems to be for legacy

torn mantle Jul 15, 2025, 6:02 PM

#

criteria used

#

thats actually whats used

#

idk how to feel about that

alpine coral Jul 15, 2025, 6:03 PM

#

yes. tho i wouldn't be surprised if the methodology outlined in that blog (from aug) has been refined

torn mantle Jul 15, 2025, 6:03 PM

#

something more reliable will be pairing style control * something else

#

not just filtering by style control alone

alpine coral Jul 15, 2025, 6:04 PM

#

gusty helm the market should have the "remove style control" ticked

that's just straight up just moving the goal posts lol

exotic tartan Jul 15, 2025, 6:04 PM

#

I just don't understand why remove style control isn't the default view, unless I'm missing something

zinc ore Jul 15, 2025, 6:04 PM

#

exotic tartan I just don't understand why remove style control isn't the default view, unless ...

It is

gusty helm Jul 15, 2025, 6:04 PM

#

alpine coral that's just straight up just moving the goal posts lol

Its very confusing

torn mantle Jul 15, 2025, 6:04 PM

#

exotic tartan I just don't understand why remove style control isn't the default view, unless ...

style control will prioritize -> yapping models

alpine coral Jul 15, 2025, 6:05 PM

#

if grok-4 did amazing with style control i feel like there'd be no issue..

zinc ore Jul 15, 2025, 6:05 PM

#

Default = [use style control]

exotic tartan Jul 15, 2025, 6:05 PM

#

zinc ore It is

zinc ore Jul 15, 2025, 6:05 PM

#

You have remove style control selected

#

When you click default it adds style control

torn mantle Jul 15, 2025, 6:06 PM

#

exotic tartan

its really no2

#

how is it no2?

zinc ore Jul 15, 2025, 6:06 PM

#

It's #2 without style control

gusty helm Jul 15, 2025, 6:06 PM

#

It s also that it is very easy to tell when it’s grok or not. Or when other AI like gpt in battle mode. Given what “fun” polymarket is, expect foul play

topaz elm Jul 15, 2025, 6:06 PM

#

Grok 4 outputs are a bit strange, its not the same as the model on their site/app, it almost never talks, and during one math problem I gave it it repeatedly switched between 'a' and 'ɑ'

exotic tartan Jul 15, 2025, 6:07 PM

#

alpine coral if grok-4 did amazing with style control i feel like there'd be no issue..

i'm not invested

alpine coral Jul 15, 2025, 6:07 PM

#

people complain claude doen't do well enough compared to experience; style control addresses that

exotic tartan Jul 15, 2025, 6:08 PM

#

i'm just wondering why the default view is not the raw model response

alpine coral Jul 15, 2025, 6:08 PM

#

people complain grok4 is better and removing style control shows that

torn mantle Jul 15, 2025, 6:08 PM

#

i dont know about that

alpine coral Jul 15, 2025, 6:09 PM

#

you can cut it which ever way ig.. hard for lmarena to win ha

gusty helm Jul 15, 2025, 6:09 PM

#

it's not lmarena issue really

alpine coral Jul 15, 2025, 6:09 PM

#

read the blog post

topaz elm Jul 15, 2025, 6:09 PM

#

Removing style control does not increase its elo by much, only 6 points

gusty helm Jul 15, 2025, 6:09 PM

#

it does bring it up in #2 tho

topaz elm Jul 15, 2025, 6:10 PM

#

its just not the same model that they use on their app

alpine coral Jul 15, 2025, 6:11 PM

#

that part seems true

#

i dont think just system prompt

#

but also tools/search - seem to enhance it materially

exotic tartan Jul 15, 2025, 6:12 PM

#

there's "style control" in the app and browser version of all models for sure, but in lmarena I think the raw API responses should be the default ranking but i could be missing a lot of context here

zinc ore Jul 15, 2025, 6:12 PM

#

I think it's to prevent companies from trying to game the benchmark

alpine coral Jul 15, 2025, 6:12 PM

#

https://lmsys.org/blog/2024-08-28-style-control/#methodology

Does style matter? Disentangling style and substance in Chatbot Are...

<p>Why is GPT-4o-mini so good? Why does Claude rank so low, when anecdotal experience suggests otherwise?</p>
<p>We have answers for you. We controlled for t...

#

no

#

meta showed that's easily done lol

exotic tartan Jul 15, 2025, 6:14 PM

#

zinc ore I think it's to prevent companies from trying to game the benchmark

in webdev I can actually spot grok 4, it's usually saying "powered by name of framework used" under the main element of the site hehe

gusty helm Jul 15, 2025, 6:15 PM

#

exotic tartan in webdev I can actually spot grok 4, it's usually saying "powered by *name of f...

you can easily spot grok if you ask something controversial as well

#

it will say stuff like "as a model made by xAI I do not condone this"

exotic tartan Jul 15, 2025, 6:16 PM

#

interesting

gusty helm Jul 15, 2025, 6:16 PM

#

or if you ask anything that's framed as a joke it's easy to spot (doesnt have to be controversial)

exotic tartan Jul 15, 2025, 6:16 PM

#

so style control eliminates that as well?

gusty helm Jul 15, 2025, 6:17 PM

#

I have no idea on in depth style control

#

just the same as you can spot gpt with emojis

alpine coral Jul 15, 2025, 6:21 PM

#

exotic tartan so style control eliminates that as well?

it has nothing to. do with being able to figure out what model you're using before voting

#

it's about tryihgn to control for the fact a lot of people casting votes don't vote on substance but what looks nicer

exotic tartan Jul 15, 2025, 6:23 PM

#

like the use of emojis?

hardy lion Jul 15, 2025, 6:23 PM

#

https://blog.lmarena.ai/blog/2024/style-control/

Effectively it answers: "if responses had the same length, number of bullets, markdown, etc, then which model would be preferred"

It does this by measuring the effects of those style features, and adjusting the preference equations to account for them

alpine coral Jul 15, 2025, 6:25 PM

#

exotic tartan like the use of emojis?

i don't think that's inclduded in their methodology, but perhaps it should be - that is the kinda by style that they're trying to control for

#

but yeah it seems mostly about length, use of bullet points and markdown - at least per that blog post (they may have refined it since.. i dunno)

ocean vortex Jul 15, 2025, 6:26 PM

#

what is this error message....

#

lmao

exotic tartan Jul 15, 2025, 6:26 PM

#

I'm sure it has a huge effect. ChatGPT used to use more emojis than a 12 year old

ocean vortex Jul 15, 2025, 6:26 PM

#

my user is not too many 😠

torn bison Jul 15, 2025, 6:34 PM

#

the design is very human

stuck orchid Jul 15, 2025, 6:35 PM

#

kimi-k2 is the best?

ocean vortex Jul 15, 2025, 6:39 PM

#

stuck orchid kimi-k2 is the best?

Trying to test their thinking model... Inference or the interface seems buggy though

#

getting blank responses on some prompts

stuck orchid Jul 15, 2025, 6:41 PM

#

ocean vortex Trying to test their thinking model... Inference or the interface seems buggy th...

What is the url?

ocean vortex Jul 15, 2025, 6:43 PM

#

stuck orchid What is the url?

https://platform.moonshot.ai/playground

Moonshot AI - Open Platform

Kimi Open Platform (Moonshot AI Large Model Open Platform) provides long-text data processing APIs based on the Kimi large model (Moonshot AI Large Model), supporting flexible API calls for leading technical experience.

#

you need to pay monies though

#

I used one-time use virtual revolut card lol

alpine coral Jul 15, 2025, 6:53 PM

#

oh really? i just had to give a phone number

#

and there's rate limiited free usage

#

i also thought kimi-tihinking was an older model

#

but is it K2 thinking?

lone vector Jul 15, 2025, 6:56 PM

#

#

Not surprise the X glazers were wrong

whole wagon Jul 15, 2025, 6:57 PM

#

Just above Claude 4 opus

#

Kappa

#

ocean vortex Jul 15, 2025, 6:58 PM

#

lone vector

oh they did update the leaderboad. Well daaamn that's a fail for xAI then lmao

#

if we remove style control it's nr2, but still far away from 2.5Pro

#

my ambitious bet on polymarket didn't pay off then 💀

torn bison Jul 15, 2025, 7:03 PM

#

I'm shocked that there are actually people in this server who don't buy Google💀

ocean vortex Jul 15, 2025, 7:04 PM

#

torn bison I'm shocked that there are actually people in this server who don't buy Google💀

It makes no sense because profit is miniscule

#

and risk is not 0. Well at least it used to be that till Grok4 was put up

patent aspen Jul 15, 2025, 7:08 PM

#

I don't bet on real prediction markets because I'm afraid of getting in trouble with tax authorities

#

It's not legal in the US

ocean vortex Jul 15, 2025, 7:09 PM

#

patent aspen It's not legal in the US

Are you sure?

#

Look at what your president is doing...

#

That is legal somehow

#

catgrin

torn bison Jul 15, 2025, 7:09 PM

#

ocean vortex It makes no sense because profit is miniscule

If you buy during grok4 livestream, google YES shares can be traded for as low as 40c, and xai NO shares is around 30c

torn bison Jul 15, 2025, 7:09 PM

#

patent aspen I don't bet on real prediction markets because I'm afraid of getting in trouble ...

maybe try kalshi?

#

They have poor liquidity tho

ocean vortex Jul 15, 2025, 7:10 PM

#

torn bison If you buy during grok4 livestream, google YES shares can be traded for as low a...

Well at that point we had no clue + grok3 topped lmarena after release. People betting on Google got kinda lucky this time tbh

patent aspen Jul 15, 2025, 7:10 PM

#

torn bison maybe try kalshi?

Still not legal right? Just a workaround to avoid getting caught

torn bison Jul 15, 2025, 7:11 PM

#

patent aspen Still not legal right? Just a workaround to avoid getting caught

They are regulated by the CFTC, legal in the US afaik

ocean vortex Jul 15, 2025, 7:12 PM

#

I just can't help to view any laws in US as a wild west currently. Very ironic to assume some innocent trading is illegal while crypto meme-coins and rugpulling seems to be completely legal and promoted by the government lol

patent aspen Jul 15, 2025, 7:13 PM

#

It's not enough money where I'd want to take the risk

#

The AI markets are so small that I could move them

torn bison Jul 15, 2025, 7:15 PM

#

The real big markets are geopolitics and sports, but it's much harder to be an insider in those areas haha

small haven Jul 15, 2025, 7:47 PM

#

ocean vortex if we remove style control it's nr2, but still far away from 2.5Pro

brutal, didnt even check polymarket, back to 0%? 😭

#

why do all chinese sites need cross site cookies, freaking xi'd

keen beacon Jul 15, 2025, 7:50 PM

#

did you login with google?

ocean vortex Jul 15, 2025, 7:50 PM

#

small haven brutal, didnt even check polymarket, back to 0%? 😭

It's 3% now. The only theoretical chance I see for them now is if they drop heavy on lmarena. Though even then I think response style is where they are losing the most points...

small haven Jul 15, 2025, 7:50 PM

#

ocean vortex It's 3% now. The only theoretical chance I see for them now is if they drop heav...

wow, nature is healing

ocean vortex Jul 15, 2025, 7:50 PM

#

Heavy would probably be like 2nd-3rd with style control

small haven Jul 15, 2025, 7:50 PM

#

keen beacon did you login with google?

yes

#

i dont have a chinese number

ocean vortex Jul 15, 2025, 7:53 PM

#

ocean vortex It's 3% now. The only theoretical chance I see for them now is if they drop heav...

If we look at grok3, the new one is only 24 elo points ahead...

#

Whoever at xAI thought it's a good idea to make the model output 30k+ hidden reasoning and then respond with 1 word to 1 sentence is an idiot tbh

#

catgrin

keen beacon Jul 15, 2025, 7:58 PM

#

it depends on how their cot looks, but i've suspected before that people did (something similar) so their rl loop is faster

ocean vortex Jul 15, 2025, 7:58 PM

#

The CoT is hidden, so it's just bad. Users don't care what is convenient for them during training lol

arctic lark Jul 15, 2025, 8:02 PM

#

Hello, I was just wondering how people are accessing the anonymous models? I saw on X there were 5 of them but cannot see any of them on the site.

keen beacon Jul 15, 2025, 8:03 PM

#

use the battle feature (instead of direct chat) and you have a chance to get one of them

patent aspen Jul 15, 2025, 8:21 PM

#

ocean vortex Whoever at xAI thought it's a good idea to make the model output 30k+ hidden rea...

It reminds of the supercomputer from The Hitchhiker's Guide to Galaxy computing the answer to the ultimate question of life, the universe, and everything

#

It spends ages computing the answer "42" but then nobody knows what the ultimate question was

keen beacon Jul 15, 2025, 8:27 PM

#

Do you guys find kimi > gemini 2.5 ?

#

1 or 2 ?

#

imo 2 lol

#

1 is kimi2 , gemini 2.5 is 2nd xD

torn mantle Jul 15, 2025, 8:36 PM

#

1

keen beacon Jul 15, 2025, 8:36 PM

#

both are hilarious but i prefered kimi

dawn wharf Jul 15, 2025, 8:37 PM

#

keen beacon 1 or 2 ?

Kimi is a bit incoherent, but creative alright

ocean vortex Jul 15, 2025, 8:41 PM

#

webdev arena Grok4 is 12th

#

absolutely destroyed

#

#

So yeah... It's not benchmaxed. It just that this early testing was very sus 🤔

#

Don't think a single result that was done on public API put it SOTA on any benchmark

zinc ore Jul 15, 2025, 8:46 PM

#

I really want to see some independent testing on USAMO at least

ocean vortex Jul 15, 2025, 8:46 PM

#

not Aider, not livebench, not lmarena / webdev, simple-bench not yet on leaderboard but that's unlikely as well....

ocean vortex Jul 15, 2025, 8:50 PM

#

zinc ore I really want to see some independent testing on USAMO at least

It's probably good for math. This fine-tuning is like perfect for it. But not big of a consolation given how it was advertised..

whole wagon Jul 15, 2025, 8:51 PM

#

The web dev arena is due to poor spatial understanding which has been stated again and again

#

Multiple times during the livestream itself

ocean vortex Jul 15, 2025, 8:52 PM

#

whole wagon Multiple times during the livestream itself

Did not care enough about their claims to watch it. But it did score high on arc-agi, which we now know was contamination...

#

since otherwise it's spatial awareness benchmark

#

so they were contradicting themselves? 💀

whole wagon Jul 15, 2025, 8:52 PM

#

No they were surprised by the arc agi score also lol

#

They said it should have performed badly

#

In the livestream

ocean vortex Jul 15, 2025, 8:53 PM

#

Well then their claims have even less credibility LOL

#

Poor spatial awareness is bad for something that is trying to be a top model

#

And what with the whole twitter being "blown away" by it's abilities to do svg drawings at the time of release then.... catgrin

whole wagon Jul 15, 2025, 8:56 PM

#

It's not good at SVG drawings

#

Compared to other models

ocean vortex Jul 15, 2025, 8:56 PM

#

I tried it myself - wasn't impressed - just left it at that.

#

💀

#

The issue is that it's mostly the same with anything you try using it on

#

Well not literally, but it performs poorly on numerous things...

#

Which is why I said at the time that I don't ever recall a model with such a disconnect between claimed benchmark results and how it performs IRL

ocean vortex Jul 15, 2025, 9:02 PM

#

zinc ore I really want to see some independent testing on USAMO at least

Honestly, it's only 6 tasks. I think we could even sorta kinda do it ourselves...

#

cost would be 10 to $20 for the entire thing I think

#

Task1: Let k and d be positive integers. Prove that there exists a positive integer N such that for every odd integer n>N, the digits in the base-2n representation of n^k are all greater than d.

unborn ocean Jul 15, 2025, 9:09 PM

#

matharena.ai already did some of the benches, but not USAMO

ocean vortex Jul 15, 2025, 9:14 PM

#

Let's see if it was overfitted on that R1 response for task1 👀

ocean vortex Jul 15, 2025, 9:14 PM

#

unborn ocean matharena.ai already did some of the benches, but not USAMO

yeah saw that. At least it did this I suppose

#

Just got reminded why I hate this thing...

#

if it's doing this useless flood at the very least it should be fast

#

but it is not

unborn ocean Jul 15, 2025, 9:16 PM

#

i am not sure, but i think that they are doing manual human grading for the usamo bench -> takes longer

ocean vortex Jul 15, 2025, 9:17 PM

#

It's essentially Kimi2 speeds

unborn ocean Jul 15, 2025, 9:17 PM

#

i can remember them doing a paper about that..

ocean vortex Jul 15, 2025, 9:17 PM

#

except you are paying for it like it is hosted properly...

#

Ok I don't think this is correct

#

"N exists" boxed answer? 🤣

leaden palm Jul 15, 2025, 9:25 PM

#

ocean vortex except you are paying for it like it is hosted properly...

k2 is in fact high speeds and low prices

ocean vortex Jul 15, 2025, 9:28 PM

#

leaden palm k2 is in fact high speeds and low prices

I was referring to their official API

leaden palm Jul 15, 2025, 9:28 PM

#

moonshot?

ocean vortex Jul 15, 2025, 9:28 PM

#

Not groq or whatever this is

#

Though even groq is sub 200 tok/sec

#

So I'm not sure what is 300, probably nothing lol

leaden palm Jul 15, 2025, 9:29 PM

#

ocean vortex Not groq or whatever this is

i love groq but this is an inference provider i literally only learned about today that i think is just one guy with a gpu

ocean vortex Jul 15, 2025, 9:29 PM

#

leaden palm i love groq but this is an inference provider i literally only learned about tod...

🤯

leaden palm Jul 15, 2025, 9:29 PM

#

both groq and crofai can reach for 500 tok/s if the prompt is easily predicted

ocean vortex Jul 15, 2025, 9:29 PM

#

yeah I haven't heard about them either

#

but sounds promising if those are true speeds

ocean vortex Jul 15, 2025, 9:31 PM

#

leaden palm moonshot?

Yes. xAI only does like 50 tok/sec. That's way slower than what OpenAI or Anthropic are doing... And fairly comparable with Moonshot AI who are GPU poor 🧐

#

relatively

ocean vortex Jul 15, 2025, 9:33 PM

#

leaden palm both groq and crofai can reach for 500 tok/s if the prompt is easily predicted

In edge cases maybe. On average it is nowhere near that though. I feel like OR stats are fairly realistic here:

leaden palm Jul 15, 2025, 9:33 PM

#

think or's measured throughput number fluctuates, seems to currently be below the claimed speed of 185 tok/s when it used to be above

ocean vortex Jul 15, 2025, 9:34 PM

#

Was getting around that when I tried it. Seems close enough to me. They probably update the average as it gets used more

#

Grok's task1:

sage cradle Jul 15, 2025, 9:43 PM

#

Evening - is there any equivent to LMArena for Video Gen testing?

leaden palm Jul 15, 2025, 9:44 PM

#

sage cradle Evening - is there any equivent to LMArena for Video Gen testing?

artificial analysis

#

https://artificialanalysis.ai/text-to-video/arena

sacred plaza Jul 15, 2025, 9:48 PM

#

LMAOOOOOOOOOOOOOOOOO.

lone vector Jul 15, 2025, 10:04 PM

#

I hope nobody was betting on grok...

torn mantle Jul 15, 2025, 10:05 PM

#

lone vector I hope nobody was betting on grok...

@whole wagon

whole wagon Jul 15, 2025, 10:05 PM

#

I didn't bet on grok lol

torn mantle Jul 15, 2025, 10:06 PM

#

it was @candid storm mb

ocean vortex Jul 15, 2025, 10:07 PM

#

ocean vortex Grok's task1:

Ok other attempts were actually better. For task1 overall it's roughly this:

#

Attempt 5 was even slightly better than 2 but only first 4 were supposed to be rated so... 🤷‍♂️

#

0.1 temp

#

It sabotages itself with concise responses very easily

#

What's clear is that on this task it is definitively worse than R1, 2.5Pro or o4-mini. But exact score may depend on chance somewhat.

keen beacon Jul 15, 2025, 10:14 PM

#

lone vector I hope nobody was betting on grok...

you can bet No too 🙂

sage cradle Jul 15, 2025, 10:17 PM

#

leaden palm artificial analysis

That was very interesting, thank you - saw a few models I hadn't heard of. None of the one's i use in open source interestingly. I assume there isn't one where you can choose prompts like LMArena - I've found that super interesting way of comparing the models and I do enjoy the battles

ocean vortex Jul 15, 2025, 10:18 PM

#

Looking at some of the leaderboard scores... Flash-thinking run3 was 7/7, run2 - 2/7 and then runs 1,4 - 0/7 lol

sage cradle Jul 15, 2025, 10:22 PM

#

interestingly, I use Perplexity / Comet for my every day use - I am a massive believer in being model agnostic where possible. I do wonder how much Perplexity's front end changes how the models behave though - especially with their search focused paradigm

ocean vortex Jul 15, 2025, 10:22 PM

#

If including runs 2-5 for Grok instead, that would be 60.7%. Though it would be more of a max score lucky attempt I think

torn mantle Jul 15, 2025, 10:24 PM

#

keen beacon you can bet No too 🙂

Yea and also told him to bet on google when it decreases, they were both close just before the release

ocean vortex Jul 15, 2025, 10:25 PM

#

Kinda mind blowing that there's so much variance possible on USAMO though to be completely honest... With only 6 tasks in total they should have made like 10 attempts for each IMO

torn bison Jul 15, 2025, 10:25 PM

#

I remember ricardo sold it when xai was around ~~37c~~ 30c

torn mantle Jul 15, 2025, 10:26 PM

#

torn bison I remember ricardo sold it when xai was around ~~37c~~ 30c

Oh

torn bison Jul 15, 2025, 10:26 PM

#

it's about 300% profit for him

dawn wharf Jul 15, 2025, 10:32 PM

#

lone vector I hope nobody was betting on grok...

are the predictions based on lmarena?

lone vector Jul 15, 2025, 10:32 PM

#

yeah

sage cradle Jul 15, 2025, 10:33 PM

#

Grok 4 is super chatty and 1st person - I quite like it - (though I really don't want to)

#

Grok 4 in LM Arena vs Grok 4 in Perplexity and it's analysis of why such opposite responses... https://www.perplexity.ai/search/suppose-a-civilization-from-an-tBijNj2mQSOYRh0N4sMifQ

Perplexity AI

Suppose a civilization from an exoplanet, with technology comparabl...

Based on established protocols for interstellar communication and historical precedents for contact between civilizations, an extraterrestrial civilization...

rare python Jul 15, 2025, 10:52 PM

#

ocean vortex Kinda mind blowing that there's so much variance possible on USAMO though to be ...

So my recommend of USAMO as math is still valid

main gulch Jul 15, 2025, 11:25 PM

#

sage cradle Grok 4 in LM Arena vs Grok 4 in Perplexity and it's analysis of why such opposit...

perplexity has web access, arena doesn't

olive mesa Jul 15, 2025, 11:25 PM

#

is grok 4 AGI according to "current definitions"?

#

it's definitely past what AGI was defined as in 2023

sturdy mica Jul 15, 2025, 11:49 PM

#

does anyone talk about https://www.obl.dev/

echo aurora Jul 15, 2025, 11:49 PM

#

Hey everyone - we're aware of issues with LMArena at the moment, team is working on a fix asap!

sturdy mica Jul 15, 2025, 11:50 PM

#

echo aurora Hey everyone - we're aware of issues with LMArena at the moment, team is working...

its down?

#

anyway that website is a pretty good alternative and has direct chat with web browsing and such

#

kinda looks better than lmarena

whole wagon Jul 15, 2025, 11:51 PM

#

https://www.nytimes.com/2025/07/14/technology/meta-superintelligence-lab-ai.html

The New York Times

By Eli Tan

Meta’s New Superintelligence Lab Is Discussing Major A.I. Strateg...

Members of the lab, including the new chief A.I. officer, Alexandr Wang, have talked about abandoning Meta’s most powerful open source A.I. model in favor of developing a closed one.

#

IMO meta is going to go closed source

echo aurora Jul 15, 2025, 11:52 PM

#

sturdy mica its down?

~~Yes, models are erroring out.~~ Looks like it's fixed. blobthumbsup

sturdy mica Jul 15, 2025, 11:52 PM

#

just tested it, i see

whole wagon Jul 15, 2025, 11:54 PM

#

This is basically going to turn into Chinese open source Vs US closed source fight

candid storm Jul 15, 2025, 11:55 PM

#

torn bison it's about 300% profit for him

😁

tidal schooner Jul 15, 2025, 11:57 PM

#

whole wagon This is basically going to turn into Chinese open source Vs US closed source fig...

vs… mistral, french mix of open-source/closed-source

wintry tinsel Jul 16, 2025, 12:03 AM

#

whole wagon https://www.nytimes.com/2025/07/14/technology/meta-superintelligence-lab-ai.html

And it’s going to suck too

jade egret Jul 16, 2025, 12:15 AM

#

i just notice you can use claude 4 opus completly free in lm arena 0_0

sacred quail Jul 16, 2025, 1:07 AM

#

jade egret i just notice you can use claude 4 opus completly free in lm arena 0_0

it has low limits compared the other ones but still impressive i know

quartz light Jul 16, 2025, 2:21 AM

#

sacred quail it has low limits compared the other ones but still impressive i know

yeah like 16k thinking tokens instead of 32 and horrible context window in direct chat

jade egret Jul 16, 2025, 3:04 AM

#

quartz light yeah like 16k thinking tokens instead of 32 and horrible context window in direc...

yea : (

verbal nimbus Jul 16, 2025, 5:43 AM

#

Gemini Pro 2.5 likes to hallucinate so much that it explained code I forgot to provided without even telling me🤔

rare python Jul 16, 2025, 5:55 AM

#

verbal nimbus Gemini Pro 2.5 likes to hallucinate so much that it explained code I forgot to p...

especially when I say "is it done?" and it will hallucinate another problem that I never sent it

dusky aurora Jul 16, 2025, 6:16 AM

#

verbal nimbus Gemini Pro 2.5 likes to hallucinate so much that it explained code I forgot to p...

it confuses its thinking with your input

whole sundial Jul 16, 2025, 6:25 AM

#

@echo aurora he's back for the third time

tidal schooner Jul 16, 2025, 6:26 AM

#

<@&1349916362595635286> holy glaze

keen beacon Jul 16, 2025, 6:48 AM

#

Lol

#

Talking about dev freelancers, what sites are there to hire them? Fiverr and upwork seem overhyped and overpriced

whole wagon Jul 16, 2025, 6:56 AM

#

https://x.com/tbpn/status/1945290640545243503?s=46&t=9yOytiIPb-YpjUM8CP7bqw

TBPN (@tbpn)

BREAKING: OpenAI rearchers Jason Wei and Hyung Won Chung are rumored to have been poached by Meta.

#

👀

#

Those are the guys we see in the livestreams lmao

#

When they announce products

pallid crypt Jul 16, 2025, 6:57 AM

#

Lol

dawn wharf Jul 16, 2025, 6:58 AM

#

whole wagon https://x.com/tbpn/status/1945290640545243503?s=46&t=9yOytiIPb-YpjUM8CP7bqw

bro's collecting them like pokemon

civic flame Jul 16, 2025, 7:20 AM

#

whole wagon https://x.com/tbpn/status/1945290640545243503?s=46&t=9yOytiIPb-YpjUM8CP7bqw

i will laugh if meta still release slop after paying exorbitant amounts of money to poach a bunch of competing lab's talent

rare python Jul 16, 2025, 7:33 AM

#

civic flame i will laugh if meta still release slop after paying exorbitant amounts of money...

Meta should give money and compute to DeepSeek and Moonshot 😩

torn mantle Jul 16, 2025, 8:30 AM

#

whole wagon https://x.com/tbpn/status/1945290640545243503?s=46&t=9yOytiIPb-YpjUM8CP7bqw

lol

#

crazy

#

makes me wonder why many left

#

is it like a domino effect

alpine coral Jul 16, 2025, 8:31 AM

#

ocean vortex since otherwise it's spatial awareness benchmark

it's not tho.. it tests abstract reasoning and pattern discovery. yes spatial recognition matters, but the coloured grids are just the medium, not the point. whether a model can spot the rule and generalise it is the point of each puzzle—nothing to do with pure geometry / spatial awareness .

alpine coral Jul 16, 2025, 9:03 AM

#

jules that made no sense to me

leaden sun Jul 16, 2025, 9:03 AM

#

my bad, i realized i was thinking something else

ocean vortex Jul 16, 2025, 10:00 AM

#

alpine coral it's not tho.. it tests abstract reasoning and pattern discovery. yes spatial re...

The rule for each solution is purely geometrical though, so I would argue it is predominantly all based on spatial awareness. Unless a model finds a way to cheat and spot other patterns it can base the solution on - but this wasn't the idea of that benchmark.

soft kernel Jul 16, 2025, 10:17 AM

#

Is o3 down?

hardy pecan Jul 16, 2025, 10:20 AM

#

It seems weaker over the last few hours

#

Working for me but feels bad

pure anvil Jul 16, 2025, 10:23 AM

#

torn mantle makes me wonder why many left

maybe because OpenAI has no lasting moat and who would turn down $100M

whole wagon Jul 16, 2025, 10:24 AM

#

Don't think they got $100M

#

Zuck said it's fake

#

Sam is a chronic liar it's probably strategic

pure anvil Jul 16, 2025, 10:25 AM

#

so is zuck but maybe the workplace culture is less stressful at meta

whole wagon Jul 16, 2025, 10:26 AM

#

Dunno. If openAI working hours are that long one has to wonder why they aren't delivering

pure anvil Jul 16, 2025, 10:26 AM

#

and even if openAI achieved AGI or ASI or something I doubt sam would credit the reasearchers at all

pure anvil Jul 16, 2025, 10:26 AM

#

whole wagon Dunno. If openAI working hours are that long one has to wonder why they aren't d...

i mean so many core people leaving has to slow everything down some amount

tall summit Jul 16, 2025, 10:27 AM

#

ocean vortex Ok I don't think this is correct

Thinking... Thinking... Thinking...

whole wagon Jul 16, 2025, 10:27 AM

#

Yeah but the projects from before are also delayed

#

Like GPT5 and the open source model

pure anvil Jul 16, 2025, 10:27 AM

#

if openAI was a bit more open we'd know lol

whole wagon Jul 16, 2025, 10:28 AM

#

Sam lied and said it was due to safety concerns 😂

pure anvil Jul 16, 2025, 10:28 AM

#

that's such an annoying larp

#

like bruh

#

nobody believes that

keen beacon Jul 16, 2025, 10:28 AM

#

whole wagon Zuck said it's fake

He also says hes not a lizard

whole wagon Jul 16, 2025, 10:31 AM

#

Kimi K2 is same price as gpt4.1 mini. The Chinese have absolutely cooked when it comes to efficiency

#

Like it's a totally different league of performance now

#

Though I would still like some strong smaller distills to run locally and/or fine tune :p

pure anvil Jul 16, 2025, 10:33 AM

#

k2 thinking should be fun

whole wagon Jul 16, 2025, 10:34 AM

#

I would guess it ends up o3 level

pure anvil Jul 16, 2025, 10:35 AM

#

most likely. their deepresearch based on their older model ,k1.5 ,is currently SOTA

whole wagon Jul 16, 2025, 10:36 AM

#

The interesting thing is what they are doing is not particularly complex. It makes me wonder what the US labs have even been doing all this time

#

Like there really isn't much magic to make Kimi K2

#

There are a few nice things in the optimizer using muon, some hyperparam changes and that's it over deepseek

vivid wigeon Jul 16, 2025, 10:41 AM

#

Guys anybody knows other prediction markets with AGI timeline where there is money on the line ,and not just people speculating, like this one https://kalshi.com/markets/kxoaiagi/openai-achieves-agi

When will OpenAI achieve AGI? | Trade on Kalshi

Track what Kalshi's markets predict for "When will OpenAI achieve AGI?", or trade it yourself.

winged locust Jul 16, 2025, 10:42 AM

#

📰 Everything else in AI today
Mistral unveiled Voxtral, a low-cost, open-source speech understanding model family that combines transcription with native Q&A capabilities.

Google revealed that its AI security agent, Big Sleep, discovered a critical security flaw that allowed Google to stop the vulnerability before it was exploited.

U.S. President Donald Trump announced over $92B in AI and energy investments at a Pennsylvania summit, saying America’s destiny is to be the “AI superpower.”

Google is investing $25B in data centers and AI infrastructure across the PJM electric grid region, including $3B to modernize Pennsylvania hydropower plants.

Anthropic launched Claude for Financial Services, a solution that integrates Claude with market data and enterprise platforms for financial institutions.

Nvidia plans to resume sales of its H20 AI chip to China after CEO Jensen Huang received assurances from U.S. leadership, with AMD also resuming sales in the region.

pure anvil Jul 16, 2025, 10:43 AM

#

vivid wigeon Guys anybody knows other prediction markets with AGI timeline where there is mon...

And who decides what AGI is?

sage cradle Jul 16, 2025, 10:46 AM

#

main gulch perplexity has web access, arena doesn't

Yes agreed, that’s one of the main benefits of Perplexity i have found compared to just using the models direct. But it can skew results sometimes

sage cradle Jul 16, 2025, 10:46 AM

#

pure anvil And who decides what AGI is?

Nate B Jones says running a vending machine lol

torn mantle Jul 16, 2025, 10:46 AM

#

vivid wigeon Guys anybody knows other prediction markets with AGI timeline where there is mon...

havent seen sama talk about agi again

#

what happened?

#

believe it or not, the closest lab to AGI is google

pure anvil Jul 16, 2025, 10:47 AM

#

torn mantle believe it or not, the closest lab to AGI is google

on paper they have the best shot

vivid wigeon Jul 16, 2025, 10:47 AM

#

Googl was always a leader in AI for over a decade , MSFT just has better marketing

#

Sam is master of marketing

vivid wigeon Jul 16, 2025, 10:48 AM

#

pure anvil And who decides what AGI is?

Me of course 😂

#

Couldnt find any market for AGI on Polymarket

unborn ocean Jul 16, 2025, 11:49 AM

#

whole wagon https://x.com/tbpn/status/1945290640545243503?s=46&t=9yOytiIPb-YpjUM8CP7bqw

https://www.jasonwei.net/ he has a nice blog if someone is interested....

keen beacon Jul 16, 2025, 11:58 AM

#

vivid wigeon Guys anybody knows other prediction markets with AGI timeline where there is mon...

Theres one in polymarket about openai announcing agi 2025 , <10%

ocean vortex Jul 16, 2025, 12:18 PM

#

whole wagon Kimi K2 is same price as gpt4.1 mini. The Chinese have absolutely cooked when it...

It's not efficiency. It's the lack of profit what makes it cheap lol.

#

And R1 is still a better deal pricewise if we are being accurate 👀

#

V3.1 is much cheaper, R1.1 is a little bit cheaper + performs better

#

When it comes to Qwen3-235B, alibabacloud pricing for it with thinking enabled is $8.4 per 1M output. Not because it's less efficient but because more profit. 😊

pure anvil Jul 16, 2025, 12:28 PM

#

ocean vortex And R1 is still a better deal pricewise if we are being accurate 👀

You can't compare them

#

the reasoning tokens of r1 will make it more expensive than k2 in actual use

ocean vortex Jul 16, 2025, 12:29 PM

#

You can, K2 is very very verbose

pure anvil Jul 16, 2025, 12:29 PM

#

ocean vortex You can, K2 is very very verbose

That's upto the user

ocean vortex Jul 16, 2025, 12:29 PM

#

pure anvil That's upto the user

I'm talking on average

#

R1 is less expensive per 1M tokens, so the total price difference in using them is not gonna be much

#

but R1 is more capable

pure anvil Jul 16, 2025, 12:30 PM

#

ocean vortex R1 is less expensive per 1M tokens, so the total price difference in using them ...

https://artificialanalysis.ai/evaluations/artificial-analysis-intelligence-index#artificial-analysis-intelligence-index-output-token-composition

Artificial Analysis

Artificial Analysis Intelligence Index | Artificial Analysis

Compare AI model performance on Artificial Analysis Intelligence Index. A composite benchmark aggregating seven challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

ocean vortex Jul 16, 2025, 12:31 PM

#

It outputs much much more than V3.1

pure anvil Jul 16, 2025, 12:31 PM

#

we are talking about r1

ocean vortex Jul 16, 2025, 12:31 PM

#

which is just a better deal, but yeah gonna be somewhat less expensive than R1

#

but still... Performance difference is notable

pure anvil Jul 16, 2025, 12:31 PM

#

ocean vortex which is just a better deal, but yeah gonna be somewhat less expensive than R1

"somewhat" lmao

pure anvil Jul 16, 2025, 12:32 PM

#

ocean vortex It outputs much much more than V3.1

and even if we were talking about v3.1, k2 is better

pure anvil Jul 16, 2025, 12:33 PM

#

ocean vortex which is just a better deal, but yeah gonna be somewhat less expensive than R1

it will be about 4 times cheaper

ocean vortex Jul 16, 2025, 12:33 PM

#

yeah fair - it outputs less than R1. But it's not much better than V3.1 while being MUCH more expensive

pure anvil Jul 16, 2025, 12:34 PM

#

I have a feeling you've never used LLMs through an API before

ocean vortex Jul 16, 2025, 12:34 PM

#

2.5X price difference AND it outputs twice as much lol

ocean vortex Jul 16, 2025, 12:35 PM

#

pure anvil it will be about 4 times cheaper

what? 🤣

#

Someone didn't do well with math at school

unborn ocean Jul 16, 2025, 12:36 PM

#

i think he is talking about r1 vs kimi

#

idk

pure anvil Jul 16, 2025, 12:36 PM

#

unborn ocean i think he is talking about r1 vs kimi

yeah

#

it's not even a comparision tbh

pure anvil Jul 16, 2025, 12:37 PM

#

ocean vortex Someone didn't do well with math at school

Calm down bro, you don't have to get angry over being wrong lmao

ocean vortex Jul 16, 2025, 12:37 PM

#

Well with R1 it is clear we already established it outputs more...

ocean vortex Jul 16, 2025, 12:37 PM

#

ocean vortex 2.5X price difference AND it outputs twice as much lol

we are talking about this ^

pure anvil Jul 16, 2025, 12:38 PM

#

ocean vortex And R1 is still a better deal pricewise if we are being accurate 👀

??????

pure anvil Jul 16, 2025, 12:39 PM

#

ocean vortex Well with R1 it is clear we already established it outputs more...

You claimed it was "somewhat" more expensive

ocean vortex Jul 16, 2025, 12:39 PM

#

pure anvil ??????

I said that before looking up the exact token outputs. But it performs better so it's not like you are paying for nothing. You are paying more for better performance. It's not marginal it's reasoning vs no reasoning - big difference

ocean vortex Jul 16, 2025, 12:41 PM

#

pure anvil and even if we were talking about v3.1, k2 is better

5 times better?

#

2.5X price difference and twice the output length

#

It is very obvious that pricing for V3.1 is more competitive, there's nothing to argue about... 🤦‍♂️

#

AA index delta R1.1 to K2 is 11 points, and V3.1 to K2 is only 4 points...

floral merlin Jul 16, 2025, 12:45 PM

#

Kimi-K2 is generating output significantly slower on LMArena due to its size compared to other LLMs. Do you think this could cause a negative bias in user scores because of that fact? 🤔

ocean vortex Jul 16, 2025, 12:47 PM

#

floral merlin Kimi-K2 is generating output significantly slower on LMArena due to its size com...

That's not so much due to size as it is down to inference and their infra. lmarena used to speed match output speeds but that I think became more of a secondary thing for now with the new interface....

#

I would expect for them to improve/fix it. Cause yeah people shouldn't be voting based on speed lol

#

yeah this seems like the safest bet, works for reasoning models too. There's no streaming until reasoning is finished

rare python Jul 16, 2025, 1:07 PM

#

ocean vortex yeah this seems like the safest bet, works for reasoning models too. There's no ...

Sometime I waited too long and thought both models got stuck

#

Bad UX for me

#

Then I refreshed lmarena and somehow they already finished generating 💀

leaden meteor Jul 16, 2025, 2:40 PM

#

Any chance deepthink/grok heavy/opus 4 thinking are going to be on Arena ?

keen ferry Jul 16, 2025, 2:49 PM

#

leaden meteor Any chance deepthink/grok heavy/opus 4 thinking are going to be on Arena ?

opus 4 is already there, grok 4 it's hard to say and deepthinking definitely not

leaden meteor Jul 16, 2025, 3:11 PM

#

If o3 pro is not on arena, then it is fair to assume that grok 4 heavy is also not coming because they are too slow/expensive?

balmy mist Jul 16, 2025, 3:18 PM

#

leaden meteor If o3 pro is not on arena, then it is fair to assume that grok 4 heavy is also n...

yeah they are not coming

leaden meteor Jul 16, 2025, 4:21 PM

#

oh. why?

unborn ocean Jul 16, 2025, 4:24 PM

#

per token, but for the average chat request o3-pro is more expensive than 4.5 and more importantly 4.5 is a new model and o3-pro is just mild enhancement over o3 for most users (in the lmarena)

#

that is likely also the reason why we don't have o3-high

tribal aspen Jul 16, 2025, 5:21 PM

#

Hi

#

when are we goina get internet search?

#

#1372230675914031105

cedar tide Jul 16, 2025, 5:22 PM

#

@echo aurora not good design in the end of select model in direct chat

echo aurora Jul 16, 2025, 5:23 PM

#

cedar tide <@283397944160550928> not good design in the end of select model in direct chat

thank you for sharing! If we could try to keep feedback related to this change in #1395088149197095033 that'd be a big help.

#

(I'll note the one you've already provided, but for future feedback that'd be helpful)

indigo hazel Jul 16, 2025, 5:29 PM

#

echo aurora thank you for sharing! If we could try to keep feedback related to this change i...

light ui update but there is still no function for direct camera xD lmao

echo aurora Jul 16, 2025, 5:30 PM

#

We are collecting examples of when the community feels like the content filter is acting a bit overzealous and flagging false positives in this thread #1376956905016004759 , if you could provide your example there that'd be ideal. blobthanks

echo aurora Jul 16, 2025, 5:31 PM

#

indigo hazel light ui update but there is still no function for direct camera xD lmao

No direct camera 😭 but this was flagged when you initially brought up.

indigo hazel Jul 16, 2025, 5:31 PM

#

echo aurora No direct camera 😭 but this was flagged when you initially brought up.

no problem dont worry ahaha. you are still the best

tiny swan Jul 16, 2025, 5:32 PM

#

this panel comic i want trasform but violated the terms why? its a comic not a real image this is my prompt Redraw this comic page into Arcane-style art with soft cinematic lighting

echo aurora Jul 16, 2025, 5:36 PM

#

tiny swan this panel comic i want trasform but violated the terms why? its a comic not a r...

if you could share this in #1376956905016004759 that'd be much appreciated.

hoary plaza Jul 16, 2025, 5:51 PM

#

Weird it's on battle mode?? I never came across it

echo aurora Jul 16, 2025, 5:53 PM

#

Taking this opportunity to remind everyone about our July Contest! I want to see more out of place object in space! Details here.

ocean vortex Jul 16, 2025, 7:21 PM

#

I'm gonna try this thing lmao
https://www.ynetnews.com/business/article/r1kdos118le

ynetnews

Russia and Belarus unveil censored 'patriotic AI' to rival the West

Allies begin developing a joint AI system dubbed the 'patriotic chatbot,' claiming it will promote 'traditional values' and protect citizens from Western 'manipulation'; research shows Russian AI models enforce strict political censorship

misty vault Jul 16, 2025, 7:26 PM

#

OpenAI

hybrid locust Jul 16, 2025, 7:48 PM

#

hello I'm curious

#

when will settings like temperature be added

#

it is on the legacy site so why not on the new one

whole wagon Jul 16, 2025, 8:02 PM

#

ocean vortex It's not efficiency. It's the lack of profit what makes it cheap lol.

You are very wrong

#

Deepseek inference has wide margins

keen beacon Jul 16, 2025, 8:28 PM

#

Yeah deepseek said themselves they have a 545% margin lol

gusty helm Jul 16, 2025, 8:30 PM

#

ocean vortex I'm gonna try this thing lmao https://www.ynetnews.com/business/article/r1kdos...

it's 100% going to be an anime girl AI that is telling you to join the SMO in ASMR

#

#notapsiop

ocean vortex Jul 16, 2025, 8:35 PM

#

gusty helm it's 100% going to be an anime girl AI that is telling you to join the SMO in AS...

It's actually censored in a very boring way. Flagging + hardcoded message 😠

gusty helm Jul 16, 2025, 8:36 PM

#

I mean, I didnt have any sort of expectations from it 😂

ocean vortex Jul 16, 2025, 8:36 PM

#

Can do a poem about Russia, but if I ask "more negative" one - same message... lmao

echo aurora Jul 16, 2025, 8:46 PM

#

hybrid locust when will settings like temperature be added

Sorry to say I won't be able to share specific timelines. But know that we're aware that the community is really interested in having these settings moved over the the current site.

ocean vortex Jul 16, 2025, 9:20 PM

#

keen beacon Yeah deepseek said themselves they have a 545% margin lol

that's not the full picture... https://techcrunch.com/2025/03/01/deepseek-claims-theoretical-profit-margins-of-545/

TechCrunch

Anthony Ha

DeepSeek claims ‘theoretical’ profit margins of 545% | TechCrunch

Chinese AI startup DeepSeek recently declared that its AI models could be very profitable — with some asterisks. In a post on X, DeepSeek boasted that its

#

They basically said this would have been the case if V3 was priced the same as R1

#

still amazing though that they have profits with current pricing ngl

#

Taking all of this into account, their cost to run R1 is then lower than V3 pricing. Provided that demand stays the same. Insane

#

Though for any of it to actually be profit, they can't be serving their models for free on official chat UI like they are doing now 👀

#

candid storm Jul 16, 2025, 10:19 PM

#

#

OpenAI browser tomorrow?

rare python Jul 16, 2025, 10:21 PM

#

echo aurora Sorry to say I won't be able to share specific timelines. But know that we're aw...

The new UI is nice but the model list is still so random 😩

zinc ore Jul 16, 2025, 10:21 PM

#

Bro Google, release deep think already

small haven Jul 16, 2025, 11:35 PM

#

i would rather have 2.5 ultra

keen beacon Jul 16, 2025, 11:54 PM

#

candid storm

I bought yes. Thinking its 20-25% probable.

echo aurora Jul 17, 2025, 12:01 AM

#

rare python The new UI is nice but the model list is still so random 😩

feedback has been shared blobfingerguns

lone vector Jul 17, 2025, 12:18 AM

#

What company is behind 'octopus'

small haven Jul 17, 2025, 12:19 AM

#

is it? i havent tried

#

for me, its not about making one off scripts, but making edits in a codebase. been using cc, already satisifes me. but keen to hear about kimi if it can do agentic coding

#

hmm

tawny kelp Jul 17, 2025, 1:07 AM

#

I asked the arena if I should use a database for my book collection or if I should use an ILS software.
Deepseek: "That's a fascinating question! Let's go over the benefits and drawbacks of each one..."
Claude 3.7 Sonnet: "Are you kidding me? Don't bother with ILS software"
I found the stark difference in responses funny.

leaden meteor Jul 17, 2025, 1:24 AM

#

candid storm

If it's browser, why 5 stops? Are you sure it's not GPT5?

candid storm Jul 17, 2025, 1:37 AM

#

If its GPT5, why the cursor?

lone vector Jul 17, 2025, 1:39 AM

#

https://x.com/testingcatalog/status/1945639961790685404?s=46

TestingCatalog News 🗞 (@testingcatalog)

BREAKING 🚨: OpenAI is planning to announce "Agent Mode"! Agent Mode will likely be a mix of Operator and Deep Research, which can use the browser and connectors at once.

"Find, analyze, and synthesize your Drive files to create comprehensive reports"

Deep Operator 👀

sullen quest Jul 17, 2025, 2:46 AM

#

ocean vortex I'm gonna try this thing lmao https://www.ynetnews.com/business/article/r1kdos...

Lol

hollow ocean Jul 17, 2025, 3:09 AM

#

August 20 is the day

whole wagon Jul 17, 2025, 4:00 AM

#

Imagine if it's some defense bs lmao

#

Cos they drew a pentagon lol

hardy pecan Jul 17, 2025, 4:34 AM

#

Anyone elses Gemini 2.5 pro being lazy and omitting output?

#

Mines become sloppy since yesterday

rare python Jul 17, 2025, 4:58 AM

#

hardy pecan Anyone elses Gemini 2.5 pro being lazy and omitting output?

Need a benchmark for this phenomenon

hybrid locust Jul 17, 2025, 5:41 AM

#

echo aurora Sorry to say I won't be able to share specific timelines. But know that we're aw...

Thanks for answering my question! :)

whole wagon Jul 17, 2025, 7:24 AM

#

Kimi K2 taking a while to score

#

I thought it would be on the leaderboard by now

dusky aurora Jul 17, 2025, 7:34 AM

#

echo aurora Sorry to say I won't be able to share specific timelines. But know that we're aw...

with even more added if possible

keen beacon Jul 17, 2025, 7:42 AM

#

whole wagon I thought it would be on the leaderboard by now

What place will it take ?

whole wagon Jul 17, 2025, 7:50 AM

#

It's all very close, it'll be like 1410 Elo I guess

calm sequoia Jul 17, 2025, 8:06 AM

#

Guys, why is this such a big deal? The 50% success rate is shamefully low. Furthermore, the latest model depicted here is Claude 3.7. Shouldn't the Claude 4 Opus and o3 already be way higher? Is this just hype marketing?

pallid crypt Jul 17, 2025, 8:07 AM

#

Strange

unborn ocean Jul 17, 2025, 8:11 AM

#

calm sequoia Guys, why is this such a big deal? The 50% success rate is shamefully low. Furth...

yes, prob an agent build off of the new models (either 4 sonnet/opus or 2.5 pro or o3)

calm sequoia Jul 17, 2025, 8:11 AM

#

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

Measuring AI Ability to Complete Long Tasks

cedar tide Jul 17, 2025, 8:11 AM

#

What are your thoughts on the latest models from Amazon, Kraken, and Folsom?

unborn ocean Jul 17, 2025, 8:11 AM

#

not including the models -> makes themselves look better

#

bc they are prob a wrapper for these models

calm sequoia Jul 17, 2025, 8:12 AM

#

calm sequoia https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

Considering this, the o3 can only do 1.54h work at 50%. New model is at 8-9h. After more though it may really be a big deal.

unborn ocean Jul 17, 2025, 8:12 AM

#

who posted it?

calm sequoia Jul 17, 2025, 8:13 AM

#

Everybody on twitter is talking about it

unborn ocean Jul 17, 2025, 8:13 AM

#

oh, don't have twitter, so...

#

but if it is that popular it might be oai or anthropic agent ?

calm sequoia Jul 17, 2025, 8:14 AM

#

unborn ocean oh, don't have twitter, so...

That's good for mental health but how do you know what's going on with AI? 😄

unborn ocean Jul 17, 2025, 8:14 AM

#

reposts like this here, lol

calm sequoia Jul 17, 2025, 8:14 AM

#

unborn ocean but if it is that popular it might be oai or anthropic agent ?

As I understand it's the upcoming oAI operator

unborn ocean Jul 17, 2025, 8:14 AM

#

yeah seems likely, but again not including the newer models is really misleading

calm sequoia Jul 17, 2025, 8:15 AM

#

It's taken from the paper which was written long ago

unborn ocean Jul 17, 2025, 8:16 AM

#

yeah ik, but reposting like this

#

just makes it look better than it is

whole wagon Jul 17, 2025, 8:19 AM

#

openAI doing anything but releasing a model

civic pulsar Jul 17, 2025, 8:29 AM

#

Grok 4 Doesn't work anymore. It's show it is grok 2.
Is anyone notice it?

native idol Jul 17, 2025, 8:55 AM

#

New UI is bad. Like serious degrade.

sturdy mica Jul 17, 2025, 9:14 AM

#

openblock labs

#

is so funny

sturdy mica Jul 17, 2025, 9:14 AM

#

native idol New UI is bad. Like serious degrade.

yeah

#

super slow

#

annoying to use

#

wish they'd update legacy UI

#

anywayyy

#

openblock labs' discord server is not done yet

#

so you can join it and make emojis, stickers, post in announcements, welcome, etc

#

its so funny

#

ive just been having some fun posting @ everyone in the announcements

#

im having a blast

#

little fun time

whole wagon Jul 17, 2025, 9:29 AM

#

Actual clown company

languid crescent Jul 17, 2025, 9:30 AM

#

Ui looks clean 👍

whole wagon Jul 17, 2025, 9:31 AM

#

The stream is just announcing a future product

#

Not one that will be available soon

tall summit Jul 17, 2025, 9:40 AM

#

unborn ocean reposts like this here, lol

relatable

hazy quest Jul 17, 2025, 9:43 AM

#

OpenAI released an update to the image editor in the API. They claim it now only edits the selected parts, instead of redoing the whole image (leading to changes in faces, details, etc). Has anyone tried it?

https://x.com/OpenAIDevs/status/1945538534884135132

OpenAI Developers (@OpenAIDevs)

We've improved image generation in the API. Editing with faces, logos, and fine-grained details is now much higher fidelity with features preserved. 🔍

Edit specific objects, create marketing assets with your logo, or adjust facial expressions, poses, and outfits on people.

whole wagon Jul 17, 2025, 9:47 AM

#

Quiet since nobody works there anymore Kappa

#

Or maybe they just don't have much to post about

#

chatGPT ads are coming also

#

Hm idk if that will work. There is better competition that has no ads

#

https://blog.goedel-prover.com/

Goedel-Prover-V2

Goedel-Prover-V2: The Strongest Open-Source Theorem Prover to Date

red sluice Jul 17, 2025, 9:52 AM

#

I really thought Grok would make it to top 3. 5th place is kinda underwelming

whole wagon Jul 17, 2025, 9:55 AM

#

What's going on here

#

Why is openAI odds spiking

hazy quest Jul 17, 2025, 9:55 AM

#

red sluice I really thought Grok would make it to top 3. 5th place is kinda underwelming

xAI has been in the game for only 1.5 years though

whole wagon Jul 17, 2025, 9:55 AM

#

Both the aug and Dec odds are up a lot

tidal schooner Jul 17, 2025, 9:57 AM

#

whole wagon What's going on here

prob coz of the mystery announcement tomorrow

whole wagon Jul 17, 2025, 9:57 AM

#

People really think it could be GPT5?

#

Interesting

tidal schooner Jul 17, 2025, 9:58 AM

#

whole wagon People really think it could be GPT5?

penta = 5 so…

#

still really speculative tho imo

whole wagon Jul 17, 2025, 9:58 AM

#

whole wagon Actual clown company

Well actually it would match this

#

Because if it was GPT5 it wouldn't launch straight away

#

And the guy that said end of summer is head of chatgpt

#

So why would he comment unless it's about gpt

#

I think it's gonna be like o3 preview they are going to show some evals

ocean vortex Jul 17, 2025, 10:01 AM

#

whole wagon Actual clown company

?

#

where is deep think?

whole wagon Jul 17, 2025, 10:02 AM

#

There wasn't a dedicated livestream just for deep think

ocean vortex Jul 17, 2025, 10:02 AM

#

There was an announcement

#

but no model

#

where is it?

#

That's a much bigger clawn show

#

It took less time for OpenAI to hint at and then actually release o3-pro than for Google to release deep think after announcing the benchmarks lol

#

Honestly people just love drama and are reading way too much into recent events at OpenAI. GPT5 release most likely gonna be of fairly similar significance as o3 and everyone will forget this entire talk then... catgrin

hazy quest Jul 17, 2025, 10:10 AM

#

#

It could at least be an announcement of an announcement ("In the coming weeks..."), with more benchmarks for GPT-5.

whole wagon Jul 17, 2025, 10:25 AM

#

Imagine taking a LLM hallucination as truth

#

Truly genius

hollow ocean Jul 17, 2025, 10:27 AM

#

Its 4o too

#

https://tenor.com/view/hysterical-laughter-laughing-gif-25735842

Tenor

hazy quest Jul 17, 2025, 10:33 AM

#

Who said it was taken as truth? It's just a potential clue, in addition to the "hidden" pentagon. It's probably a complete hallucination, but could be private information that leaked into the system prompt/training data. Unlikely doesn't mean impossible.

unborn ocean Jul 17, 2025, 10:36 AM

#

no way, the training data from 4o is likely really old, even the stuff from the cpt

#

openai likely does not know when they will release themselves ..

#

(like not the particular day)

ocean vortex Jul 17, 2025, 10:53 AM

#

hazy quest Who said it was taken as truth? It's just a potential clue, in addition to the "...

It can not possibly have leaked into training data tbh. Also at the time of training this model they had no clue even internally. The only way this is not a 100% hallucination is if it used web search and showed those speculative sources at the bottom.

leaden sun Jul 17, 2025, 12:04 PM

#

hazy quest

interesting that it has mentioned a "deeper contextual memory framework" for gpt-5, pretty much inline to what o3 told me a few weeks ago

drifting thorn Jul 17, 2025, 12:06 PM

#

You better trust Deepmind lab’s paper for the “framework” thingy

#

mossy drum Jul 17, 2025, 12:08 PM

#

New model in Arena: folsom-07152025-1

dusky aurora Jul 17, 2025, 12:11 PM

#

Gemini has become so detached and impersonal

keen beacon Jul 17, 2025, 12:12 PM

#

Most personal models are the ones after top 5

dusky aurora Jul 17, 2025, 12:14 PM

#

I talk to it to get conversation and dialogue,but get a machine

keen beacon Jul 17, 2025, 12:16 PM

#

dusky aurora I talk to it to get conversation and dialogue,but get a machine

Maybe you need Ani

#

Pi 3 is very good for emotional stuff

leaden sun Jul 17, 2025, 12:17 PM

#

dusky aurora I talk to it to get conversation and dialogue,but get a machine

it shows a certain personality consistently here on my side, no glitch, no hallucinations so far, and no mood swings, I'm genuinely impressed by its empathic capability

#

maybe it depends on the conversation topics?

dusky aurora Jul 17, 2025, 12:20 PM

#

I discuss all the same topics

#

05-06 was a very good model,I rarely had to make clarifications,now I have to do them all the time

#

why do model makers think judgmentalness is a good thing?

#

curentlyit seems as if I am its subordinate

#

nowadays Gemini doesn't listen to me and talks over me

#

if that's a way to demonstrate what mansplaining is by LLM-splaining,then it hurts

leaden sun Jul 17, 2025, 12:33 PM

#

dusky aurora curentlyit seems as if I am its subordinate

lol the irony, gemini called me its "creator" once despite knowing consciously i didnt build it

keen beacon Jul 17, 2025, 12:34 PM

#

dusky aurora why do model makers think judgmentalness is a good thing?

They dont. But they are playing the narrative that ai is just souless thing with no emotion feelings and opinions. So they lobotimize the crap of any model before shipping to production

#

If they didnt do that, talking to ai would be no different to talking to another person on the internet