#general | Arena | Page 75

lime coral Jul 21, 2025, 7:48 PM

#

You need to be smart for retrieving the right thing and not hallucinate

ocean vortex Jul 21, 2025, 7:49 PM

#

If you want to exclude small models or those that it's hard to measure the vibes for (or context awaraness etc), just look at SimpleQA and exclude all that score low there. Simple as that

#

It's not very realistic to expect from ArtificialAnalysis to do an absolutely 100% perfect job at this, but what they are doing is still transparent and very useful tbh

unborn ocean Jul 21, 2025, 8:26 PM

#

terse shuttle Jul 21, 2025, 8:32 PM

#

When is direct chat on webdev arena?

zinc ore Jul 21, 2025, 8:33 PM

#

Didn't notice this earlier

patent aspen Jul 21, 2025, 8:41 PM

#

OAI didn't even participate

hardy lion Jul 21, 2025, 8:44 PM

#

unborn ocean

LMArena

zinc ore Jul 21, 2025, 8:45 PM

#

Depends what you mean by exist, we know it's been used by select testers since early June, but isn't publicly available

#

If it's being tested since then, then yes it's existed since then

#

https://x.com/vinayramasesh/status/1947391685245509890

Vinay Ramasesh (@vinayramasesh)

@aidan_mclau @YouJiacheng It's worth noting that a DeepThink system with no access to this corpus also got gold (again according to the official graders), with exactly the same score.

#

So apparently, some of the stuff about Deepmind's result is mum

#

https://x.com/vinayramasesh/status/1947395952161263636

Vinay Ramasesh (@vinayramasesh)

@YouJiacheng @aidan_mclau No, for that system only the question went in.

#

Wonder if he's referring to AlphaProof/geometry or something else

#

Deepthink system nvm

#

So definitely not AlphaProof and alphageometry

#

That makes me think the results they shared are just much cleaner and nicer, compared with the other deepthink system

whole wagon Jul 21, 2025, 8:54 PM

#

Nobody has guessed correctly what it actually is

#

Google are going to be dropping a surprise ig lol

zinc ore Jul 21, 2025, 8:55 PM

#

Yeah, they're definitely showing something on the 28th

#

Doesn't help we have an openAI engineer making an issue about it tho, and peeps are running with it

whole wagon Jul 21, 2025, 8:57 PM

#

They are fine tunes of a general model. For Google case at least

#

The Google model actually isn't very specialised. The general 'system' actually already scored high so they had a good baseline to work with

zinc ore Jul 21, 2025, 9:11 PM

#

https://x.com/polynoamial/status/1947398531259523481

Noam Brown (@polynoamial)

Congrats to the GDM team on their IMO result! I think their parallel success highlights how fast AI progress is. Their approach was a bit different than ours, but I think that shows there are many research directions for further progress. Some thoughts on our model and results 🧵

#

Some sub tweets basically repeating the timeline he already explained, with some new information

torn mantle Jul 21, 2025, 9:17 PM

#

zinc ore Didn't notice this earlier

they keep saying this

#

they should at least release the first version of deep think

zinc ore Jul 21, 2025, 9:18 PM

#

Why

#

Maybe they'll release a more robust version of deepthink instead

gentle plinth Jul 21, 2025, 9:24 PM

#

unborn ocean

meta 🤑

jade egret Jul 21, 2025, 9:32 PM

#

https://arstechnica.com/ai/2025/07/google-deepmind-earns-gold-in-international-math-olympiad-with-new-gemini-ai/

Ars Technica

Gemini Deep Think learns math, wins gold medal at International Mat...

DeepMind followed IMO rules to earn gold, unlike OpenAI.

torn bison Jul 21, 2025, 9:35 PM

#

If deepthink is the model that participated in the IMO, and the original plan was to publicize the result 7 days after the IMO ends, can we expect July 28th to be deepthink's originally planned release date? I think it's very likely

hollow ocean Jul 21, 2025, 9:37 PM

#

Deepthink release December

#

Polymarket is the truth

#

95% accuracy

jade egret Jul 21, 2025, 9:38 PM

#

torn bison If deepthink is the model that participated in the IMO, and the original plan wa...

release for the 200$ one?

elder rapids Jul 21, 2025, 9:39 PM

#

yooo

#

deepthink got gold

#

it's over

torn bison Jul 21, 2025, 9:39 PM

#

hollow ocean Polymarket is the truth

I see no maket for deepthink?

elder rapids Jul 21, 2025, 9:39 PM

#

😭

hollow ocean Jul 21, 2025, 9:40 PM

#

torn bison I see no maket for deepthink?

Kalshi

torn bison Jul 21, 2025, 9:40 PM

#

jade egret release for the 200$ one?

they should be available on aistudio or api

jade egret Jul 21, 2025, 9:40 PM

#

torn bison they should be available on aistudio or api

ooo

#

nice

#

aistudio free

torn bison Jul 21, 2025, 9:40 PM

#

hollow ocean Kalshi

inefficient market

jade egret Jul 21, 2025, 9:40 PM

#

: )

hollow ocean Jul 21, 2025, 9:41 PM

#

torn bison inefficient market

🤷🏽‍♂️

jade egret Jul 21, 2025, 9:42 PM

#

so fact

hollow ocean Jul 21, 2025, 9:47 PM

#

jade egret so fact

Deepthink Friday actually?

lone vector Jul 21, 2025, 9:47 PM

#

DeepThink coming on Friday? it's only for mathematicans I think

torn bison Jul 21, 2025, 9:47 PM

#

hollow ocean 🤷🏽‍♂️

The last time I checked Kalshi was before 2.5 Pro 0605 was released as GA. Even though Vertex AI has explicitly stated that the model ID 0605 will be deprecated on June 19, Kalshi's market is still pricing NO for 0605 as ranking #1 on the June 30 Arena leaderboard at 70c.

#

crazy inefficiency

hollow ocean Jul 21, 2025, 9:47 PM

#

torn bison crazy inefficiency

So it’s all bs

#

It’s not the truth

torn bison Jul 21, 2025, 9:48 PM

#

wdym I thought you were talking about the kalshi prediction that deepthink would be delayed until December

hollow ocean Jul 21, 2025, 9:49 PM

#

torn bison wdym I thought you were talking about the kalshi prediction that deepthink would...

Kalshi less accurate

torn bison Jul 21, 2025, 9:49 PM

#

and polymarket has no market for it

jade egret Jul 21, 2025, 9:50 PM

#

hollow ocean Deepthink Friday actually?

prob not

zinc ore Jul 21, 2025, 10:08 PM

#

small haven Jul 21, 2025, 10:12 PM

#

hollow ocean Deepthink Friday actually?

august

hollow ocean Jul 21, 2025, 10:12 PM

#

small haven august

o3 pro always right

small haven Jul 21, 2025, 10:12 PM

#

maybe u should have kept ur bet @hollow ocean 😭

lime coral Jul 21, 2025, 10:13 PM

#

https://x.com/vinayramasesh/status/1947391685245509890?s=46

Vinay Ramasesh (@vinayramasesh)

@aidan_mclau @YouJiacheng It's worth noting that a DeepThink system with no access to this corpus also got gold (again according to the official graders), with exactly the same score.

small haven Jul 21, 2025, 10:13 PM

#

hollow ocean o3 pro always right

yup

hollow ocean Jul 21, 2025, 10:13 PM

#

small haven maybe u should have kept ur bet <@322961583154397184> 😭

🤣

ocean vortex Jul 21, 2025, 10:32 PM

#

jade egret so fact

Huh… I think there’s no way this model they used for IMO was not custom DeepThink version of their “large” model aka Ultra 🧐

#

Not Pro

jade egret Jul 21, 2025, 11:04 PM

#

poll_question_text

Get To AGI First?

victor_answer_votes

13

total_votes

17

victor_answer_id

3

victor_answer_text

Google

small haven Jul 21, 2025, 11:20 PM

#

o3 pro's prediction, prolly wrong ?

#

cool. very excited for it

civic flame Jul 21, 2025, 11:29 PM

#

is 2.5 ultra aimed for release before or after DT?

jade egret Jul 21, 2025, 11:30 PM

#

are they even gonna release ultra?

zinc ore Jul 21, 2025, 11:31 PM

#

28th aligns with the embargo drop for IMO

#

And I think it would be really smart to say "our deepthink system got gold at IMO" then either release it or announce it coming soon.

torn bison Jul 21, 2025, 11:38 PM

#

torn bison If deepthink is the model that participated in the IMO, and the original plan wa...

#

yooo

jade egret Jul 21, 2025, 11:40 PM

#

is deepthink that participated in IMO the same as the one that is gettign release?

civic flame Jul 21, 2025, 11:45 PM

#

jade egret are they even gonna release ultra?

yes

small haven Jul 21, 2025, 11:46 PM

#

2.5 ultra in august?

#

3.0 pro in september?

civic flame Jul 21, 2025, 11:46 PM

#

3.0 models probably closer to oct

#

2.5 ultra probably early aug

ocean vortex Jul 22, 2025, 12:09 AM

#

jade egret is deepthink that participated in IMO the same as the one that is gettign releas...

Very unlikely. The setup they used must have been insane. I was meaning to say we could test it easily on IMO when it’s out, but it’s also only 6 problems and in theory easy to contaminate for everyone now lol. Though still… making it score perfect score for 5 out of 6 problems probably won’t happen just with contamination think

lone vector Jul 22, 2025, 12:14 AM

#

What is the difference between Deep Think and Ultra?

civic flame Jul 22, 2025, 12:15 AM

#

deep think is a swarm of agents in the same way that o3 pro and grok 4 heavy are

#

2.5 ultra is a big boy model (just one, not several agents) that is a different class to 2.5 pro

#

like what claude 4 opus is to claude 4 sonnet

jade egret Jul 22, 2025, 12:20 AM

#

gemini cooking tho

frigid coral Jul 22, 2025, 12:20 AM

#

has ultra been confirmed?

small haven Jul 22, 2025, 12:22 AM

#

civic flame deep think is a swarm of agents in the same way that o3 pro and grok 4 heavy are

grok 4 heavy is probably an ensemble voting agent process

#

o3 pro is parallel reasoning, not ensemble reasoning, a bit different

jade egret Jul 22, 2025, 12:23 AM

#

is claude 4 opus smart in general

#

is it smarter than o3 pro

zinc ore Jul 22, 2025, 12:24 AM

#

small haven o3 pro is parallel reasoning, not ensemble reasoning, a bit different

Think deepthink is also supposed to be parallel reasoning, unless they've changed it since then

jade egret Jul 22, 2025, 12:27 AM

#

google is good : )

rare python Jul 22, 2025, 12:43 AM

#

🍍

nocturne chasm Jul 22, 2025, 12:46 AM

#

hey there Id like to getnewer grok for my project

stray aspen Jul 22, 2025, 12:51 AM

#

craig whats your opinion on grok 4

small haven Jul 22, 2025, 12:54 AM

#

🍋

torn mantle Jul 22, 2025, 12:56 AM

#

🍋🍋

#

🍋🍋🍋

#

😖

keen beacon Jul 22, 2025, 1:10 AM

#

uhm qwen 235b instruct is crazy..?

#

wttf

zinc ore Jul 22, 2025, 1:17 AM

#

Does anyone know if o3 pro is multi agent?

jade egret Jul 22, 2025, 1:22 AM

#

🍊

torn mantle Jul 22, 2025, 1:24 AM

#

keen beacon uhm qwen 235b instruct is crazy..?

wdym

#

you tried itß

#

?

keen beacon Jul 22, 2025, 1:25 AM

#

torn mantle wdym

yes

#

i cant believe how good it is

torn mantle Jul 22, 2025, 1:26 AM

#

keen beacon i cant believe how good it is

Use case?

keen beacon Jul 22, 2025, 1:26 AM

#

it's in distribution so it's expected. qwen always delivers on stuff in distribution. but it's incredible

torn mantle Jul 22, 2025, 1:29 AM

#

You had me try it again

#

Idk vibes are off

keen beacon Jul 22, 2025, 1:29 AM

#

wut are u trying it on

torn mantle Jul 22, 2025, 1:29 AM

#

Feels like its not organized

torn mantle Jul 22, 2025, 1:30 AM

#

keen beacon wut are u trying it on

Im just asking it for some best practices for an X plan

#

I guess its not bad

jade egret Jul 22, 2025, 1:33 AM

#

yall

#

is 2.5 pro smarter than opus 4?

#

ye?

#

why

torn mantle Jul 22, 2025, 1:34 AM

#

Sad news nuuuuuu

jade egret Jul 22, 2025, 1:34 AM

#

:C

torn mantle Jul 22, 2025, 1:34 AM

#

Stooooop

#

Stop pls

keen beacon Jul 22, 2025, 1:35 AM

#

ultra will still probably be released in one form or another i guess, or is it really dead?

jade egret Jul 22, 2025, 1:37 AM

#

guys

#

do you think

#

gemini 3 flash will be better than gemini 2.5 pro?

#

dang

#

so i need to wait for gemini 3 pro : (

zinc ore Jul 22, 2025, 1:40 AM

#

There was never any real indication we would get an ultra model

#

It was mostly "kingfall" and other similar models seem slightly bigger than 2.5 pro, must be ultra series

#

And everyone ran with it that ultra would be released 🔜

keen beacon Jul 22, 2025, 1:42 AM

#

people are sleeping on qwen 3 235b instruct, uhm, the jump is insane

zinc ore Jul 22, 2025, 1:43 AM

#

I think they don't wait more than a month to drop something comparable

#

I think Google could drop first if openAI waits too long

#

But based on everything they're saying, seems that won't occur

small haven Jul 22, 2025, 1:47 AM

#

kingfall was amazing, why would they say that the lift is small?

keen beacon Jul 22, 2025, 1:49 AM

#

they were fumbling revision after revision on ultra it seemed to me

leaden palm Jul 22, 2025, 2:21 AM

#

keen beacon Jul 22, 2025, 2:21 AM

#

leaden palm

didnt they state it outright?

leaden palm Jul 22, 2025, 2:22 AM

#

keen beacon didnt they state it outright?

there's been conflicting statements iirc

keen beacon Jul 22, 2025, 2:22 AM

#

i only see one statement from them on their twitter, and they said they'd stop doing hybrid

leaden palm Jul 22, 2025, 2:24 AM

#

keen beacon i only see one statement from them on their twitter, and they said they'd stop d...

ok looking at https://x.com/Alibaba_Qwen/status/1947344511988076547 that eliminates option 2, but option 3 is still a possibility per https://x.com/JustinLin610/status/1947346588340523222

Qwen (@Alibaba_Qwen)

Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507!

After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing

Junyang Lin (@JustinLin610)

A small update on Qwen3-235B-A22B, but a big improvement on its quality!

We thought about this decision for a long time, but we believe that providing better-quality performance is more important than the unification at this moment. We are still continuing our research on hybrid

keen beacon Jul 22, 2025, 2:24 AM

#

ya option 3 is definitely possible

#

but after the massive leap they achieved i think they'll stay separate for now

#

simpleqa jumping from 12.2 to 54.3 💀

zinc ore Jul 22, 2025, 2:41 AM

#

🤔

hollow ocean Jul 22, 2025, 3:04 AM

#

o3 pro only model that can flip $7 to $1100 before I blew it all on roulette @small haven

rare python Jul 22, 2025, 3:04 AM

#

So 2.5 Ultra DeepThink but not 2.5 Pro DeepThink?

small haven Jul 22, 2025, 3:04 AM

#

hollow ocean o3 pro only model that can flip $7 to $1100 before I blew it all on roulette <@9...

is that facts? on polymarket?

hollow ocean Jul 22, 2025, 3:04 AM

#

small haven is that facts? on polymarket?

Sports betting

#

I told it not to give any bets unless it’s confident @small haven

#

So there’s days where it wouldn’t give me any bets

small haven Jul 22, 2025, 3:06 AM

#

hollow ocean I told it not to give any bets unless it’s confident <@931708065319907338>

should just stick to that 👀

hollow ocean Jul 22, 2025, 3:07 AM

#

small haven should just stick to that 👀

I wasn’t disciplined 🤦🏽‍♂️

#

Nah

#

I deleted it

#

Just ask it to find the best bet for the day and if there isn’t any don’t recommend anything

#

Nope it does its own research

#

Yeah

#

o3 can’t do this it’s very bad compared to o3 pro

#

Yessirr

#

Yes way better

#

Deep research is trash tbh it uses outdated data 🤣

jade egret Jul 22, 2025, 3:22 AM

#

but you need to pay like 250$ per month to use it : (

#

what does that mean

hollow ocean Jul 22, 2025, 3:24 AM

#

It’ll do that sometimes

#

Just put the date

#

It always does

#

Never tried api

#

You have search off?

#

Put tmrs date

#

It’s prob cuz you put “kalshi mlb”

#

Just do mlb bets

jade egret Jul 22, 2025, 3:29 AM

#

but i got the 20$ one : )

#

well

#

ig it very expensive

winged locust Jul 22, 2025, 4:18 AM

#

Qwen Kimi Deepseek

sturdy mica Jul 22, 2025, 4:50 AM

#

what

#

are you using o3 pro on just chatgpt.com

#

or 3rd party with mcp

quartz light Jul 22, 2025, 5:51 AM

#

GUYS

#

GUYS

#

GEMINI JUST GOT

#

HEAVILY CENSORED

#

HOLY ####

#

ITS REFUSING

#

😭

#

aistudio btw

#

oh and also

#

they changed the logo of aistudio for a minute

#

i saved it but they reverted it

#

and also

#

the default temp is now 1 instead of 0.7

#

how is nobody talking about this

#

keen beacon Jul 22, 2025, 5:56 AM

#

quartz light i saved it but they reverted it

show us the new logo?

quartz light Jul 22, 2025, 5:56 AM

#

alr

#

#

8K

#

converted from svg

#

svg wasnt directly a file but just embedded directly onto the page

#

heres the svg

#

they changed the favicon for a minute too but I didn;t save it, its just the same icon but on a black background so its fine

quartz light Jul 22, 2025, 5:59 AM

#

quartz light

sent it to a friend

#

didn't expect them to revert it, imo its better

keen beacon Jul 22, 2025, 6:00 AM

#

kinda mid idk

quartz light Jul 22, 2025, 6:02 AM

#

keen beacon kinda mid idk

better than this ig

#

idek what its supposed to be

keen beacon Jul 22, 2025, 6:02 AM

#

butterfly?

quartz light Jul 22, 2025, 6:02 AM

#

maybe a butterfly but

#

yeah

#

but its weird lo

pure anvil Jul 22, 2025, 6:03 AM

#

nous research has the best AI logo by far

quartz light Jul 22, 2025, 6:05 AM

#

quartz light

im sure discord compresses it so if you want a high quality png then run this script I got kimi to make in a browsers devtools console:

(async () => {
  const svg = `SVGFILEGOESHERE`;
  const dataUri = `data:image/svg+xml,${encodeURIComponent(svg)}`;
  const img = new Image();
  img.src = dataUri;
  await img.decode();
  const c = document.createElement('canvas');
  c.width = c.height = 8192;
  c.getContext('2d').drawImage(img, 0, 0, 8192, 8192);
  c.toBlob(b => {
    const a = document.createElement('a');
    a.href = URL.createObjectURL(b);
    a.download = 'icon_8192x8192.png';
    a.click();
  });
})();

quartz light Jul 22, 2025, 6:08 AM

#

pure anvil nous research has the best AI logo by far

this or that?

pure anvil Jul 22, 2025, 6:08 AM

#

quartz light this or that?

yeah these are good

quartz light Jul 22, 2025, 6:08 AM

#

pure anvil yeah these are good

the right one looks ai generated

pure anvil Jul 22, 2025, 6:09 AM

#

how

#

either way idrc

ocean vortex Jul 22, 2025, 9:09 AM

#

frigid coral has ultra been confirmed?

For public release no, but internally we can be quite certain I think that they do have it... #general message

unborn ocean Jul 22, 2025, 9:29 AM

#

I need a Qwen3-235B-2507 paper rn

#

Qwen Team cooking

torn mantle Jul 22, 2025, 9:39 AM

#

unborn ocean I need a Qwen3-235B-2507 paper rn

is it that good?

#

wild also said it was

ocean vortex Jul 22, 2025, 9:40 AM

#

torn mantle wild also said it was

where is he gone?

torn mantle Jul 22, 2025, 9:40 AM

#

ocean vortex where is he gone?

hes still here

ocean vortex Jul 22, 2025, 9:40 AM

#

oh. I didn't scroll up lol

ocean vortex Jul 22, 2025, 9:43 AM

#

unborn ocean I need a Qwen3-235B-2507 paper rn

Just looked this model up. First sentence killed it for me tbh 😭

We introduce the updated version of the Qwen3-235B-A22B non-thinking mode

#

I wonder what the output lengths of this thing are though

unborn ocean Jul 22, 2025, 9:45 AM

#

torn mantle is it that good?

yeah, look at the benches

unborn ocean Jul 22, 2025, 9:45 AM

#

ocean vortex I wonder what the output lengths of this thing are though

prob very long for a non-reasoning model

#

but idc

#

the improvements in post training are insane

ocean vortex Jul 22, 2025, 9:46 AM

#

unborn ocean prob very long for a non-reasoning model

It has to be shorter than their older reasoning variant. But how does it compare to Kimi2 already crazy output lengths that is the question 🧐

torn mantle Jul 22, 2025, 9:47 AM

#

unborn ocean yeah, look at the benches

no forget benchmarks

#

im talking about your personal experience

civic flame Jul 22, 2025, 9:47 AM

#

boooo

torn mantle Jul 22, 2025, 9:47 AM

#

yea

#

not working

keen beacon Jul 22, 2025, 9:48 AM

#

broken for me as well :\

torn mantle Jul 22, 2025, 9:48 AM

#

Dom can you leave

#

pls

keen beacon Jul 22, 2025, 9:48 AM

#

need a third party to host lol

torn mantle Jul 22, 2025, 9:48 AM

#

let me use it once

keen beacon Jul 22, 2025, 9:48 AM

#

qwen hosting is sh1t

#

the quality is bad

#

qwen 3 suffers a lot from quantization/etc

torn mantle Jul 22, 2025, 9:49 AM

#

ye

ocean vortex Jul 22, 2025, 9:49 AM

#

Qwen3 235B A22B 2507

#

#

wtf is Parasail doing LMAO

keen beacon Jul 22, 2025, 9:49 AM

#

they're hosting it in full precision though

ocean vortex Jul 22, 2025, 9:50 AM

#

keen beacon they're hosting it in full precision though

For chutes it's not specified, I think it's full precision as well...

unborn ocean Jul 22, 2025, 9:50 AM

#

torn mantle im talking about your personal experience

vibes seem good, but i did not have time yet to do all my benches, so idk

keen beacon Jul 22, 2025, 9:50 AM

#

ocean vortex For chutes it's not specified, I think it's full precision as well...

yeah but its decentralized so...

#

people will try to game it

unborn ocean Jul 22, 2025, 9:51 AM

#

no way chutes has everything full precision

#

people have no incentive to do that afaik

keen beacon Jul 22, 2025, 9:51 AM

#

not everything but some models they host in full precision, or are supposed to be

#

just can't rely on it imo

ocean vortex Jul 22, 2025, 9:52 AM

#

it's essentially outputting reasoning traces lol

civic flame Jul 22, 2025, 9:53 AM

#

lol

#

kinda what DS did with the latest deepseek v3 version

keen beacon Jul 22, 2025, 9:53 AM

#

kimi does that too, at least on a question i have

#

this new qwen 3 instruct model is better than most thinking models 💀 on that q

ocean vortex Jul 22, 2025, 9:58 AM

#

keen beacon just can't rely on it imo

Dunno in my experience it worked great comparing it to even the most expensive providers. With the exception of like Cerebra - that is in the league of it's own. I would imagine they have checksum checking to make sure expected model versions or quants always match. Though I will admit I do not know all the details of how their decentralized inference works behind the scenes

#

no way... this is R1 reasoning length territory 🤯

#

don't recall any answers longer than 6k from Kimi2 👀

#

still answered wrong this one

unborn ocean Jul 22, 2025, 10:12 AM

#

btw got 50% on simplebench public

#

so nothing crazy

#

(on chutes, bc all other providers are down)

#

and it wasn't even that yappy

keen beacon Jul 22, 2025, 10:15 AM

#

ocean vortex Dunno in my experience it worked great comparing it to even the most expensive p...

a few months ago i read chutes maintainer discussions about this, at least then i dont think it's in a state to be that trustworthy at least for production

ocean vortex Jul 22, 2025, 10:17 AM

#

keen beacon a few months ago i read chutes maintainer discussions about this, at least then ...

Link?... would be interesting to look into that. Maybe I just haven't encountered it yet. The worst I can recall is it getting completely stuck on some prompts due to the model wanting to do extremely long response for a given prompt. But that's also sometimes true even for some official providers (Deepseek API etc)

keen beacon Jul 22, 2025, 10:18 AM

#

on their bittensor discord iirc

#

im getting poor results from chutes right now

#

even worse than qwen.ai

ocean vortex Jul 22, 2025, 10:20 AM

#

keen beacon even worse than qwen.ai

"even worse"? I think it's safe to assume the quality on qwen website is gonna be as good as it gets for this model catgrin

keen beacon Jul 22, 2025, 10:20 AM

#

ocean vortex "even worse"? I think it's safe to assume the quality on qwen website is gonna b...

no it's known the official quality on the website sucks

ocean vortex Jul 22, 2025, 10:21 AM

#

wdym. They can't host their own model properly?

#

that would be ridiculous though lmao

keen beacon Jul 22, 2025, 10:21 AM

#

ocean vortex wdym. They can't host their own model properly?

its quantized heavily or whatever. at least there were complaints of quality issues i believe a while back. should still apply now

ocean vortex Jul 22, 2025, 10:22 AM

#

I mean If that's true then perhaps they didn't train it properly either. Like there's no free pass on this from me 💀

keen beacon Jul 22, 2025, 10:23 AM

#

ocean vortex I mean If that's true then perhaps they didn't train it properly either. Like th...

i mean the model is pretrained on way more than a lot of open models, there's a trend of a lot of quality loss when quantizing these models more than other models

#

didn't the bf16 vllm host score the highest?

#

it wasnt the alibaba one

#

yea

#

anyway, my point is about chat.qwen.ai

#

that's from what i've heard anyway

#

this may not apply in alibaba's api

#

yeah

keen beacon Jul 22, 2025, 10:28 AM

#

keen beacon didn't the bf16 vllm host score the highest?

i was misremembering that blog post, ignore me on that

unborn ocean Jul 22, 2025, 10:30 AM

#

wild is hallucinating

torn mantle Jul 22, 2025, 10:46 AM

#

yea im not feeling qwen

#

k2 feels way smarter

keen beacon Jul 22, 2025, 10:53 AM

#

keen beacon anyway, my point is about chat.qwen.ai

was trying to find the discord thread about this, i recall reading about it when lurking in one of the discord servers i was in. it might not be true and i might misremembering again 🤷 lack of sleep catching up to me. i think it was about the 235b model and degenerate repetition on chat.qwen.ai a little after qwen 3 launched, i might be hallucinating that as well. idk

silk roost Jul 22, 2025, 10:57 AM

#

Can i get API on lmarena???

gusty tendon Jul 22, 2025, 11:04 AM

#

any comparisons for front end between kimi k2 and latest qwen?

#

haven't seen any

ocean vortex Jul 22, 2025, 11:12 AM

#

gusty tendon any comparisons for front end between kimi k2 and latest qwen?

I would expect Kimi to do better. Bigger model usually means better spatial awareness and more capable with visuals

gusty tendon Jul 22, 2025, 11:12 AM

#

ocean vortex I would expect Kimi to do better. Bigger model usually means better spatial awar...

i thought qwen was bigger

keen beacon Jul 22, 2025, 11:12 AM

#

qwen is 235b, kimi k2 is 1t

ocean vortex Jul 22, 2025, 11:12 AM

#

gusty tendon i thought qwen was bigger

Kimi is like 1T

#

yeah

#

lol

gusty tendon Jul 22, 2025, 11:13 AM

#

ah okay, thought it was also 1T

#

i guess it's in the name, Qwen3 235B xD

ocean vortex Jul 22, 2025, 11:59 AM

#

DeepInfra hosting it now as well. FP8 💀
https://deepinfra.com/Qwen/Qwen3-235B-A22B-Instruct-2507

Qwen/Qwen3-235B-A22B-Instruct-2507 - Demo - DeepInfra

Qwen3-235B-A22B-Instruct-2507 is the updated version of the Qwen3-235B-A22B non-thinking mode, featuring Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. . Try out API on the Web

#

Looks like I can't even try it now with their updated interface though without adding billing... It's probably still gonna be slow lol

keen beacon Jul 22, 2025, 12:17 PM

#

besides that one task type, im not too impressed with it after playing with it more. excited for the thinking version though.

unborn ocean Jul 22, 2025, 12:18 PM

#

yeah, i am also a bit underwhelmed

ocean vortex Jul 22, 2025, 12:18 PM

#

keen beacon besides that one task type, im not too impressed with it after playing with it m...

This is already the thinking version essentially... catgrin

unborn ocean Jul 22, 2025, 12:19 PM

#

for my tasks it is also barely usable, because it is just switching from that thinking-like mode with terrible formatting to the one where it is more human preference aligned
-> really weird chat experience

keen beacon Jul 22, 2025, 12:19 PM

#

ocean vortex This is already the thinking version essentially... <a:catgrin:11416615264748994...

they call this the "small" update before their thinking version

ocean vortex Jul 22, 2025, 12:22 PM

#

keen beacon they call this the "small" update before their thinking version

The current reasoning version is already fairly verbose. I don't see how they can improve it in a meaningful way without making it output like 140M lol

#

they were able to improve non-reasoning one simply because they made it closer to the reasoning variant...

keen beacon Jul 22, 2025, 12:23 PM

#

ocean vortex The current reasoning version is already fairly verbose. I don't see how they ca...

based on the task i tried, it seems their rl regimen has improved quite a bit

#

it gets it right despite using a harder version of the task, and does even less tokens than the reasoning version

#

the reasoning version often gets it wrong/reasons forever btw

cedar tide Jul 22, 2025, 12:31 PM

#

poll_question_text

What do the community want ?

victor_answer_votes

9

total_votes

11

victor_answer_id

2

victor_answer_text

Best voted models

ocean vortex Jul 22, 2025, 12:45 PM

#

https://x.com/flowersslop how does she have agent already 🤯

Flowers ☾ ❂ (@flowersslop) on X

Stars light future dreams ♃ Erised stra ehru oyt ube cafru oyt on wohsi

#

https://girlcockx.com/flowersslop/status/1947512151771320573

Flowers ☾ ❂ (@flowersslop)

Agent drawing a robot

**💬 4 ❤️ 24 👁️ 1.1K **

silk roost Jul 22, 2025, 12:55 PM

#

Can we get API on lmarena???

#

Guys

whole wagon Jul 22, 2025, 1:04 PM

#

You can try for free through openrouter rn

#

Parasail is $0.15/$0.85 for bf16

#

Kimi is fp8 on novita for $0.57/$2.30

#

The benefit of the Qwen model is basically when you have long context

#

Once you account for tokens the output price is closer

#

The new Qwen 3 does not even have reasoning yet

#

So ofc not lol

#

Source?

#

in some benchmarks

#

I think grok4 is quite behind still. When qwen and kimi add reasoning to their new releases i think they will overtake it

sage raptor Jul 22, 2025, 1:43 PM

#

At what minute did he say that

#

.

patent aspen Jul 22, 2025, 1:44 PM

#

If OAI needs to integrate O3-alpha, that would be a big delay

civic flame Jul 22, 2025, 1:46 PM

#

i think it makes more sense for 2.5 ultra to release this week

#

3.0 is likely sept time or so i thought

torn mantle Jul 22, 2025, 1:54 PM

#

ocean vortex https://x.com/flowersslop how does she have agent already 🤯

because it started rolling for plus subscribers

ocean vortex Jul 22, 2025, 2:07 PM

#

torn mantle because it started rolling for plus subscribers

in EU though

#

took them barely any time at all. That means there are no problems with complying with privacy respecting laws if they want...

#

lol

unborn ocean Jul 22, 2025, 2:24 PM

#

4.1 is so stupid it really make me question the possibility of agi

#

coding with it is such a pain

#

feels like gpt 3.5

whole wagon Jul 22, 2025, 2:27 PM

#

That's what happens when you use a bad model

#

Use a better one

#

Claude exists

pure anvil Jul 22, 2025, 2:29 PM

#

unborn ocean 4.1 is so stupid it really make me question the possibility of agi

what do people even use it for?

unborn ocean Jul 22, 2025, 2:31 PM

#

whole wagon That's what happens when you use a bad model

yeah ik, its free and close to unlimited in github copilot

#

so i though i would try

#

but it is not worth the time

unborn ocean Jul 22, 2025, 2:32 PM

#

pure anvil what do people even use it for?

good question

keen beacon Jul 22, 2025, 2:35 PM

#

patent aspen If OAI needs to integrate O3-alpha, that would be a big delay

I kinda assume they have everything ready and are just waiting for release

#

Since they confirmed GPT-5 soon , i would assume 1-2 weeks away, by the end of next it should be live .. or they lied 🤥
I doubt its the second since they are going with the router version as a start (weaker version than what was planned)

whole wagon Jul 22, 2025, 2:38 PM

#

keen beacon Since they confirmed GPT-5 soon , i would assume 1-2 weeks away, by the end of n...

The betting markets only give 65% odds for GPT5 before Aug 31st. Wouldnt be so confident it is imminent

keen beacon Jul 22, 2025, 2:39 PM

#

whole wagon The betting markets only give 65% odds for GPT5 before Aug 31st. Wouldnt be so c...

Maybe its undervalued 😆

civic flame Jul 22, 2025, 2:40 PM

#

do you know if they plan to release it on API too?

#

wonder if it'll cost more than gpt-4.5

keen beacon Jul 22, 2025, 2:41 PM

#

whole wagon The betting markets only give 65% odds for GPT5 before Aug 31st. Wouldnt be so c...

Some think gpt 5 will be an event , so it will be known beforehand when it comes out since they will send invites

#

But its all speculation, it can also be a live stream

brittle tiger Jul 22, 2025, 2:52 PM

#

You remember where the OpenAI key person hiring info is from?

whole wagon Jul 22, 2025, 2:57 PM

#

I think they both need to be fairly substantial jumps. The open source models are closing the gap to the current frontier quick

#

Like Kimi K2 / updated Qwen 3 when they add reasoning to them will be very good

patent aspen Jul 22, 2025, 3:00 PM

#

It's predictable that innovation will happen but unpredictable where and when it will happen

jade egret Jul 22, 2025, 3:00 PM

#

civic flame i think it makes more sense for 2.5 ultra to release this week

so something is gonna release this week right?

whole wagon Jul 22, 2025, 3:01 PM

#

What did you think about this IMO drama Kappa

civic flame Jul 22, 2025, 3:01 PM

#

jade egret so something is gonna release this week right?

unlikely apparentl

#

y

#

more like early aug or so

jade egret Jul 22, 2025, 3:01 PM

#

dang

#

😭

civic flame Jul 22, 2025, 3:08 PM

#

coming back to this.. are the benchmarks we were shown for deep think based on 2.5 pro + DT then? or were those based on an even earlier version of 2.5 ultra?

#

if it's the former surely it's got significantly better since

#

wow then

#

this thing should cook o3 pro

#

and grok 4 heavy for that matter

#

i presume the google anon models on lmarena were non-DT ultra?

#

yup

civic flame Jul 22, 2025, 3:21 PM

#

civic flame i presume the google anon models on lmarena were non-DT ultra?

you know anything about this?

#

pre-merger with DT?

#

hmm

#

given google's track record of testing everything on arena i wonder if we'll see DT on arena soon

#

also i think i may still have access to one of the ultra checkpoints lol

#

one of my red teaming platforms hasn't removed the anon google model they gave me access to a while ago and it gets questions right that only ultra did

#

the version i have access to streams summarised thinking

torn mantle Jul 22, 2025, 4:01 PM

#

time to report that

#

jk

rare python Jul 22, 2025, 4:07 PM

#

civic flame wow then

What is it?

civic flame Jul 22, 2025, 4:07 PM

#

reported DT results at IO were from a 2.5 pro-based model

#

the final model will be quite a lot better

rare python Jul 22, 2025, 4:08 PM

#

Oh

#

I wonder why they switched to Ultra for DeepThink

#

Is the cost worth it?

rare python Jul 22, 2025, 4:34 PM

#

civic flame the final model will be quite a lot better

Like they will release (soon™) of DeepThink that reached Gold IMO 2025?

balmy mist Jul 22, 2025, 4:35 PM

#

4o has gotten so good recently, like i gave a prompt to 4.5 and o3, but 4o responded the most human tbh

civic flame Jul 22, 2025, 4:35 PM

#

rare python Like they will release (soon™) of DeepThink that reached Gold IMO 2025?

early to mid aug

rare python Jul 22, 2025, 4:36 PM

#

civic flame early to mid aug

Your prediction of Gemini 3.0?

#

Q4 2025?

civic flame Jul 22, 2025, 4:41 PM

#

rare python Your prediction of Gemini 3.0?

no, deep think/2.5 ultra

civic flame Jul 22, 2025, 4:41 PM

#

rare python Q4 2025?

gemini 3.0 is q4 yeah

#

somewhere around sept or oct

torn mantle Jul 22, 2025, 4:53 PM

#

ive tried it the other day

torn mantle Jul 22, 2025, 4:53 PM

#

balmy mist 4o has gotten so good recently, like i gave a prompt to 4.5 and o3, but 4o respo...

was pretty good

balmy mist Jul 22, 2025, 5:09 PM

#

is o3-alpha still on web dev?

whole wagon Jul 22, 2025, 5:12 PM

#

openAI is so damn shady I don't even know whether to trust the open source is still coming any more 😂 they are totally silent on it

#

Maybe it was all a PR stunt

#

Who knows

#

This IMO stuff made me have doubts

vivid gorge Jul 22, 2025, 5:13 PM

#

Hi, any idea why do I get "Session not found. Redirecting to home..." after a few hours inactive chat? But other session is still active after 24h on second browser? Is it cookies/browser related or some random database wiping issue?

whole wagon Jul 22, 2025, 5:14 PM

#

whole wagon openAI is so damn shady I don't even know whether to trust the open source is st...

For each week that passes the open source frontier is pushing higher

#

Deep infra has Qwen updated model at $0.13/$0.60

#

😂

echo aurora Jul 22, 2025, 5:18 PM

#

vivid gorge Hi, any idea why do I get "Session not found. Redirecting to home..." after a fe...

That's odd. I'm going to start a post in #1343291835845578853 to get more info.

gentle plinth Jul 22, 2025, 5:23 PM

#

whole wagon openAI is so damn shady I don't even know whether to trust the open source is st...

i heard rumors it was too good so they delayed it 🙄

#

but maybe it was really just pr

#

closedai

zinc ore Jul 22, 2025, 6:31 PM

#

https://fixupx.com/lechmazur/status/1947712713427456485?s=46&t=Lmj7ENB6Hc7XVniwmP0DlQ

Lech Mazur (@LechMazur)

Can AI Learn to Trade?💹
︀︀
︀︀Introducing BAZAAR - a new LLM benchmark for economic decision making!💵
︀︀
︀︀In a simulated double-auction market, I pitted top LLMs against each other and classic trading algorithms. One goal: maximize profit.
︀︀
︀︀The o3 and Gemini models top the rankings!

**💬 11 🔁 3 ❤️ 13 👁️ 1.1K **

stray aspen Jul 22, 2025, 6:36 PM

#

yo has anyone heard of that yupp.ai website

whole sundial Jul 22, 2025, 6:38 PM

#

https://fixupx.com/JustinLin610/status/1947713198674874511
new qwen coming very soon?

Junyang Lin (@JustinLin610)

not small tonight

**💬 56 🔁 10 ❤️ 332 👁️ 26.2K **

#

"coming soon" = tonight? (in China)

torn bison Jul 22, 2025, 6:42 PM

#

https://x.com/Anonyous_FPS/status/1947724640677429633

big goose (@Anonyous_FPS)

Qwen3-Codet-480B-A35B releae

errant cave Jul 22, 2025, 6:51 PM

#

Why does Qwen still put gratuitous spaces before commas and periods

tall summit Jul 22, 2025, 6:54 PM

#

whole sundial "coming soon" = tonight? (in China)

my goat

tall summit Jul 22, 2025, 6:55 PM

#

torn bison https://x.com/Anonyous_FPS/status/1947724640677429633

excuse me?!

whole sundial Jul 22, 2025, 6:57 PM

#

you can even talk to qwen3 coder if you want

#

#

actually very fast

errant cave Jul 22, 2025, 7:00 PM

#

Qwen3-coder is not good

#

I asked it to write an adder-subtractor in structural Verilog and this is what it came up with:

// 4-bit Adder-Subtractor module
module adder_subtractor_4bit (
    input [3:0] a,
    input [3:0] b,
    input mode,          // 0 for addition, 1 for subtraction
    output [3:0] result,
    output cout,
    output overflow
);
    wire [3:0] b_modified;
    wire cin;
    wire carry_out;
    
    // XOR gates to control subtraction (2's complement)
    assign b_modified[0] = b[0] ^ mode;
    assign b_modified[1] = b[1] ^ mode;
    assign b_modified[2] = b[2] ^ mode;
    assign b_modified[3] = b[3] ^ mode;
    
    // Input carry for subtraction (1) or addition (0)
    assign cin = mode;
    
    // Instantiate the ripple carry adder
    ripple_carry_adder_4bit rca (
        .a(a),
        .b(b_modified),
        .cin(cin),
        .sum(result),
        .cout(carry_out)
    );
    
    // Assign outputs
    assign cout = carry_out;
    assign overflow = carry_out ^ (result[3] ^ a[3] ^ b_modified[3] ^ cin);
    
endmodule

#

Ts does NOT subtract

cedar tide Jul 22, 2025, 7:02 PM

#

Screenshot_2025-07-22-21-02-14-701_com.android.chrome-edit.jpg

errant cave Jul 22, 2025, 7:02 PM

#

Alright I'm an idiot this does subtract

#

That's solid except for the assign which is not as explicitly structural as I requested

ocean vortex Jul 22, 2025, 7:03 PM

#

errant cave Qwen3-coder is not good

That's my general view about qwen3 as a whole. They were trained to have some nice numbers for irrelevant things (like base model benchmark scores) but are not actually super useful and qwen3-235b is compromised by size comparing to both R1 and Kimi lol

keen fulcrum Jul 22, 2025, 7:04 PM

#

grok 4 coder will blow the industry up

#

only few weeks left

cedar tide Jul 22, 2025, 7:05 PM

#

Qwen 3 coder has 1m token context

errant cave Jul 22, 2025, 7:05 PM

#

ocean vortex That's my general view about qwen3 as a whole. They were trained to have some ni...

I like powerful compact models (love Gemma-3) but it looks like 235B really gimped Qwen

#

Besides it's some middle ground nobody wants

ocean vortex Jul 22, 2025, 7:05 PM

#

keen fulcrum grok 4 coder will blow the industry up

if Grok4 general purpose model is of any indication, this will be a disappointment

#

they will train it on a few benchmarks which are gonna be the same ones they will show in their marketing

#

but it will most likely still suck for IRL, for things like webdev arena

cedar tide Jul 22, 2025, 7:07 PM

#

@echo aurora add qwen 3 coder to webdev arena

errant cave Jul 22, 2025, 7:07 PM

#

I don't even look at benchmark scores anymore

#

The arena doesn't lie

ocean vortex Jul 22, 2025, 7:08 PM

#

errant cave The arena doesn't lie

webdev arena is useful. General lmarena less so...

#

Basically... you can trust people to tell which web design is better and which code actually works, but rating the text responses for questionable prompts and picking a winner... that's much more challenging lol

errant cave Jul 22, 2025, 7:13 PM

#

ocean vortex Basically... you can trust people to tell which web design is better and which c...

What do you consider to be questionable

whole sundial Jul 22, 2025, 7:17 PM

#

qwen minecraft

#

(qwen3-coder)

cedar tide Jul 22, 2025, 7:20 PM

#

guess the models

torn mantle Jul 22, 2025, 7:31 PM

#

cedar tide guess the models

qwen coder

sour spindle Jul 22, 2025, 7:36 PM

#

keen fulcrum grok 4 coder will blow the industry up

Some of you really can’t quit grok

torn mantle Jul 22, 2025, 7:37 PM

#

whole sundial qwen minecraft

Lol

#

So they only focused on frontend dev?

keen fulcrum Jul 22, 2025, 7:45 PM

#

sour spindle Some of you really can’t quit grok

follow SOTA

sour spindle Jul 22, 2025, 7:50 PM

#

I have a free month of grok 4 and I don’t use it at all.

#

I wish i had more reason to use it because it is quite fast

empty stump Jul 22, 2025, 7:52 PM

#

I thought it was slow

#

and what are the rate limits?

civic flame Jul 22, 2025, 8:11 PM

#

this is for sure trained on Claude 4 outputs lol

keen ferry Jul 22, 2025, 8:14 PM

#

wow they actually added grok 4 with internet connection

Screenshot_2025-07-22-23-13-47-831-edit_com.android.chrome.jpg

unborn ocean Jul 22, 2025, 8:26 PM

#

unborn ocean

poll_question_text

where would you rather work

victor_answer_votes

13

total_votes

14

victor_answer_id

1

victor_answer_text

deepmind

civic flame Jul 22, 2025, 8:39 PM

#

hm?

#

oh

#

lol yeah

torn mantle Jul 22, 2025, 8:40 PM

#

its bad

#

qwen coder

#

not that good at all

#

have you guys tried it on 3d simulations

#

it does have claude UI styling

#

but thats it

#

i mean ive tried guiding it to make a usable code at least

#

but its not working

#

mini basketball game 3d simulation

#

ive seen it somewhere

#

someone made that with o3-alpha and gemini 2,5 pro

#

its not bad, it got many things right but the game isnt working

#

and the map is a bit off

craggy depot Jul 22, 2025, 9:04 PM

#

Hello , do you guys know best AI model for Hard Coding Problem (2500+ rating CP) ,I know o-3 ,gpt pro and Gemini Pro is good , but whenever I give whole new problem (the problem that doesn't exist in entire internet) , they give wrong answers like I give them Proper description ,proper example , constraints , and a very good Prompt , even they can't solve that problems ,Please suggest some AI model that can solve this .

zinc ore Jul 22, 2025, 9:10 PM

#

https://x.com/theinformation/status/1947755575808262417

The Information (@theinformation)

Exclusive: Meta Hires Three Google AI Researchers Who Worked on Gold Medal-Winning Model

Meta hires three AI researchers from Google DeepMind who worked on Gemini model that nabbed recent math award.

Read more from @KalleyHuang and @erinkwoo 👇
https://t.co/I25lrXGr6c

cedar tide Jul 22, 2025, 9:21 PM

#

Anyone can do a model request ?
https://x.com/Alibaba_Qwen/status/1947766835023335516?t=5cvz7Kb7hQ7yBRAOtoB-xA&s=19

Qwen (@Alibaba_Qwen)

Qwen3-Coder is here! ✅

We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves

#

@echo aurora

echo aurora Jul 22, 2025, 9:27 PM

#

cedar tide Anyone can do a model request ? https://x.com/Alibaba_Qwen/status/19477668350233...

I'll flag to the team, but don't forget about #1372229840131985540

torn mantle Jul 22, 2025, 9:33 PM

#

zinc ore https://x.com/theinformation/status/1947755575808262417

yea now im sure that hes trying to slow down AI progress

unborn ocean Jul 22, 2025, 9:41 PM

#

and the thing is they are always poaching these RL / reasoning / post-training people

#

the thing that the labs do not really have much of in the first place

#

and the area where most of the advancements were made in the past time

#

so it really might be slowing down progress temporarily

torn mantle Jul 22, 2025, 9:45 PM

#

sigh

jade egret Jul 22, 2025, 9:48 PM

#

2.0 flash thinking is not smarter than 2.5 pro : )

unborn ocean Jul 22, 2025, 9:51 PM

#

well post training is still data

#

but it is undeniable that the largest trends in llm world are very much about post training / RL

#

currently at least

#

?

#

idk what to tell you, i was talking about where we get large advancement, not what is more necessary to train a good model

#

or any model for matter of fact ( bc without pretraining unsupervised learning you get nothing later as well)

#

i am not attributing it all to post training, in many cases we dont even really know where the money and compute went
but i can tell you that the whole reasoning and ttc compute paradigm did not come out of pretraining advancements by itself, but had to be "created" using RL and that many other things, like good coding abilities, a lot of things that involve interaction with the user etc. are all a product of post-training / RL rather than pre-training

#

they aren't managers i think

#

many of the others are though

#

idk why you are so opposed to me claiming that most of the advancements of the last generation were in fact not made in pre-training

#

when did i ever claim anything remotely related to a "2 lines of code worth 400m"

#

or that SFT / RL is not "doing the work"

#

it is usually actually even more work, because you have to synthetically generate environments, problems or what ever because of the lack of plentiful SFT / RL data

#

^per token

#

well, my point was not that they stopped that

#

the idea is just:
many advancement made in post training / RL -> labs want people there

#

meta buys some of them -> slows down progress

#

the same thing could happen to pre-training obviously, but they are focusing on the people who do post-training / RL apparently

#

likely because they themselves actually already have more or less adequate people for pre-training (but have historically mostly underperformed in post training and never tried RLVR much)

#

yes, surely that is part of the plan to some degree, but hiring the brains is a serious possibility, when you look at how quickly labs are founded, funded and then abandoned these days

#

literally all of the big labs have to some degree followed the strategy (maybe gdm as the sole exception. although not a full exception either)

ornate agate Jul 22, 2025, 10:18 PM

#

I think the only thing which isn’t public is exactly how the deep think agents work. How to do RL etc is in papers from DeepSeek/Kimi/Qwen/China.

zinc ore Jul 22, 2025, 10:18 PM

#

Nightride new Google model in arena

torn mantle Jul 22, 2025, 10:19 PM

#

so many new models added

#

nightride-on
nightride-on-v2

#

and some ernie ( chinese models ) + qwen latest model

#

yummy 😋

gentle plinth Jul 22, 2025, 10:30 PM

#

zinc ore https://x.com/theinformation/status/1947755575808262417

paid article

zinc ore Jul 22, 2025, 10:30 PM

#

It's theinformation, of course

torn mantle Jul 22, 2025, 10:39 PM

#

could it be 2.5 flash with deep think enabled?

#

yes or nah brian?

civic flame Jul 22, 2025, 10:41 PM

#

an updated version?

torn mantle Jul 22, 2025, 10:43 PM

#

i have a feeling that you know more than you should

#

spill the tea

#

continue

#

im listening

#

next next gen = ?

#

gemini 4?

#

5?

#

mm i see

civic flame Jul 22, 2025, 10:48 PM

#

i presume the timeline is like

misty star Jul 22, 2025, 10:48 PM

#

Is it possible to directly talk to stealth models

civic flame Jul 22, 2025, 10:48 PM

#

updated 2.5 flash, then 2.5 ultra/deep think?

#

what else is there to update

#

2.5 pro is done

#

is 2.5 flash lite GA i forgot

#

hmm

misty star Jul 22, 2025, 10:52 PM

#

Lmarena my beloved

civic flame Jul 22, 2025, 11:04 PM

#

surely DT would make the most sense for release on the first week of aug then

#

💔

rare python Jul 22, 2025, 11:11 PM

#

https://fixupx.com/YiTayML/status/1947350087941951596

Yi Tay (@YiTayML)

Our IMO gold model is not just an "experimental reasoning" model. It is way more general purpose than anyone would have expected. This general deep think model is going to be shipped so stay tuned! 🔥

Quoting Melvin Johnson (@melvinjohnsonp)
︀
So happy to see this incredible achievement.
︀︀Huge congrats to @lmthang, @quocleix, @YiTayML and the IMO team on the result.
︀︀This was a great collaboration across teams to build a general Gemini DeepThink model that can also get gold at IMO.

**💬 50 🔁 85 ❤️ 1.3K 👁️ 272.5K **

torn bison Jul 22, 2025, 11:13 PM

#

2.5 Pro-002👀

#

will wolfstride and stonebloom be released? I feel like they're a bit better for daily use than 2.5 pro (though not by much)

jade egret Jul 22, 2025, 11:18 PM

#

do yall think meta gonna win because of talents?

patent aspen Jul 22, 2025, 11:19 PM

#

jade egret do yall think meta gonna win because of talents?

Probably not but they'll slow everyone else down quite a bit

#

And they'll probably do all right

cedar tide Jul 22, 2025, 11:20 PM

#

How good is nightride ?

jade egret Jul 22, 2025, 11:20 PM

#

patent aspen Probably not but they'll slow everyone else down quite a bit

damg

#

dang*

patent aspen Jul 22, 2025, 11:20 PM

#

OAI especially

#

They've lost most of their top talent

cedar tide Jul 22, 2025, 11:22 PM

#

Anyone see EB45 ?

main gulch Jul 22, 2025, 11:22 PM

#

cedar tide How good is nightride ?

Flash with better world knowledge

cedar tide Jul 22, 2025, 11:23 PM

#

@torn mantle @zinc ore nightride good ?

torn bison Jul 22, 2025, 11:23 PM

#

I'm not very optimistic about the team led by Alexander Wang

torn mantle Jul 22, 2025, 11:24 PM

#

cedar tide <@295243581818404874> <@426634483689848843> nightride good ?

Its 2.5 flash, its so-so

torn bison Jul 22, 2025, 11:25 PM

#

That person seems exclusive, aggressive, and boastful

patent aspen Jul 22, 2025, 11:26 PM

#

The thing is: if you build an organization by luring job hoppers with massive compensation packages, you have to assume that your organization will be very volatile and expensive to upkeep

civic flame Jul 22, 2025, 11:34 PM

#

🗣️

jade egret Jul 22, 2025, 11:41 PM

#

patent aspen They've lost most of their top talent

is deepmind cooked?

#

what abt google : )

patent aspen Jul 22, 2025, 11:42 PM

#

jade egret is deepmind cooked?

The London office for DeepMind is untouched because they have no-compete agreements

jade egret Jul 22, 2025, 11:42 PM

#

:0

#

oh

patent aspen Jul 22, 2025, 11:45 PM

#

OAI is mostly built from poached talent, which is why they're so vulnerable to losing their top talent

jade egret Jul 22, 2025, 11:45 PM

#

patent aspen OAI is mostly built from poached talent, which is why they're so vulnerable to l...

rip

cedar tide Jul 22, 2025, 11:46 PM

#

@echo aurora Ernie bot respond in Chinese 🤦

Screenshot_2025-07-23-01-45-34-089_com.android.chrome-edit.jpg

jade egret Jul 22, 2025, 11:47 PM

#

cedar tide <@283397944160550928> Ernie bot respond in Chinese 🤦

tell it to say in english?

cedar tide Jul 22, 2025, 11:47 PM

#

jade egret tell it to say in english?

I want to speak to him in french

jade egret Jul 22, 2025, 11:47 PM

#

cedar tide I want to speak to him in french

oh

#

maybe it can only speak in english or chinese

cedar tide Jul 22, 2025, 11:47 PM

#

jade egret maybe it can only speak in english or chinese

Yes he dont want to speak another language but its very bad

jade egret Jul 22, 2025, 11:49 PM

#

o

torn mantle Jul 22, 2025, 11:53 PM

#

jade egret o

Oh

#

: o

#

:0

jade egret Jul 22, 2025, 11:53 PM

#

:0

#

😮

#

:0

torn mantle Jul 22, 2025, 11:53 PM

#

cedar tide I want to speak to him in french

It cant?

#

What's new about this model

#

We already had ernie 4.5 before

cedar tide Jul 22, 2025, 11:55 PM

#

torn mantle What's new about this model

Nothing new

torn mantle Jul 22, 2025, 11:55 PM

#

I see that it's a turbo version

#

vision supported

cedar tide Jul 22, 2025, 11:55 PM

#

torn mantle I see that it's a turbo version

Il a les 2 version

torn mantle Jul 22, 2025, 11:55 PM

#

cedar tide Il a les 2 version

Yea i saw that

cedar tide Jul 22, 2025, 11:56 PM

#

So its the closed source and not Ernie 4.5 open source ?

torn mantle Jul 22, 2025, 11:56 PM

#

cedar tide So its the closed source and not Ernie 4.5 open source ?

There is open sourced ernie?

cedar tide Jul 22, 2025, 11:56 PM

#

torn mantle There is open sourced ernie?

Yes

torn mantle Jul 22, 2025, 11:57 PM

#

They are all close sourced no?

torn mantle Jul 22, 2025, 11:57 PM

#

cedar tide Yes

Hmm

cedar tide Jul 22, 2025, 11:57 PM

#

@torn mantle https://x.com/Baidu_Inc/status/1939724778157511126

Baidu Inc. (@Baidu_Inc)

The ERNIE 4.5 series is now officially open source. This family of models includes 10 variants—from MoE models with 47B and 3B active parameters, the largest having 424B total parameters, to a 0.3B dense model—all available now to the global AI community for open research and

torn mantle Jul 22, 2025, 11:57 PM

#

cedar tide <@295243581818404874> https://x.com/Baidu_Inc/status/1939724778157511126

Ah

surreal creek Jul 23, 2025, 12:40 AM

#

Amazon should be focusing on giving their warehouse employees bathroom breaks than trying to enter the AI market as a non-tech company 🤣

leaden palm Jul 23, 2025, 12:42 AM

#

Bay Area House Party (esp. recent ones) is a very fun read

echo aurora Jul 23, 2025, 12:54 AM

#

cedar tide <@283397944160550928> Ernie bot respond in Chinese 🤦

I wasn't able to reproduce this btw 😭

cedar tide Jul 23, 2025, 1:25 AM

#

echo aurora I wasn't able to reproduce this btw 😭

Because you have prompted in english, im speak french,
Et dont want to speak french i think.

#

when I just asked him his name he answered me in French but when I sent him a complicated prompt containing several questions he told me to answer what I sent you

#

His response Translated in english
"Sorry, this feature is not yet available online. You can also ask me other questions in Chinese or English, and I will do my best to answer them."

jade egret Jul 23, 2025, 1:38 AM

#

2.5 flash lite is so fast

leaden palm Jul 23, 2025, 2:21 AM

#

leaden palm

poll_question_text

how did you read the qwen situation?

victor_answer_votes

6

total_votes

9

victor_answer_id

1

victor_answer_text

they've stopped doing hybrid thinking models

jade egret Jul 23, 2025, 3:06 AM

#

🍊

torn mantle Jul 23, 2025, 4:22 AM

#

cedar tide when I just asked him his name he answered me in French but when I sent him a co...

What kind of question you asked it?

#

Its a large language model, it should speak several languages, could be just a system prompt issue/instructions

frigid coral Jul 23, 2025, 4:37 AM

#

zinc ore https://x.com/theinformation/status/1947755575808262417

damn meta going all in

formal dagger Jul 23, 2025, 5:35 AM

#

torn mantle nightride-on nightride-on-v2

it is flash..

torn mantle Jul 23, 2025, 5:51 AM

#

formal dagger it is flash..

yes

formal dagger Jul 23, 2025, 5:59 AM

#

torn mantle yes

what about v2? both of them flash?

torn mantle Jul 23, 2025, 6:00 AM

#

formal dagger what about v2? both of them flash?

yea

#

one has search enabled

torn mantle Jul 23, 2025, 6:18 AM

#

https://x.com/Xianbao_QIAN/status/1947895409600565397

Tiezhen WANG (@Xianbao_QIAN)

Seed Prover solved 4 out of 6 IMO questions in 3 days and got Silver.

Proof: https://t.co/nW320KomHQ

Big congratulations to @huajian_xin !

Now you know what I'm going to kindly ask: Would you consider open sourcing it :D

verbal nimbus Jul 23, 2025, 7:06 AM

#

#

New Qwen 3 480B coding model, 1M context size

quartz light Jul 23, 2025, 7:18 AM

#

jade egret 2.5 flash lite is so fast

nuh uh https://console.groq.com/playground?model=moonshotai/kimi-k2-instruct

GroqCloud - Build Fast

Build Fast with GroqCloud

whole wagon Jul 23, 2025, 7:31 AM

#

its still way slower than 2.5 flash lite

#

k2 gets 200 tps on groq

#

whole sundial Jul 23, 2025, 8:39 AM

#

@echo aurora sorry for pinging but there is like half a dozen accounts in https://discord.com/channels/1340554757349179412/1395441703112146984 that seem to be creating fake hype. all of the accounts joined in the past day, and most of those were also created in the past day as well. There are two other accounts there that both joined on the 17th. I feel like that they are run by the company and they are trying to get people interested in a model that isn't even widely available. Only legit accounts there are yours and DavidSZD's.

#

AI gen text lol

#

"This isn't just X; it's also Y"

#

there is another message that is structured very similarly to that one

#

i don't even know how to get access to this model without signing up for their API

tall summit Jul 23, 2025, 8:44 AM

#

i never saw https://arxiv.org/abs/2507.12724 before

arXiv.org

TransEvalnia: Reasoning-based Evaluation and Ranking of Translations

We present TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This system presents fine-grained evaluations based on a subset of the Multidimensional Quality Metrics (https://themqm.org/), returns an assessment of which translation it deems the best, and provid...

whole sundial Jul 23, 2025, 8:45 AM

#

whole sundial i don't even know how to get access to this model without signing up for their A...

i want to see it in the arena just so i can make fun of how terrible it likely is

#

if you have to generate fake hype using newly made accounts on an AI discord server to promote your model, then it is likely not good

#

lol they tried to get one of their models in llama.cpp earlier this year but they failed, likely because nobody was interested in the model

cedar tide Jul 23, 2025, 8:51 AM

#

Glm 4 air removed from webdev arena after 2 month,
I hope he makes it onto the leaderboard

Screenshot_2025-07-23-10-49-07-986_com.discord-edit.jpg

whole sundial Jul 23, 2025, 8:55 AM

#

whole sundial lol they tried to get one of their models in llama.cpp earlier this year but the...

they all have chinese names as well

#

and i can't find a single site that has the model

#

any of them

#

it's like it doesn't exist outside of China

#

the people in that chat have very rough English and speak in AI-like patterns

#

I feel like iFlyTek is trying to use LMArena to get themselves known outside of China because they don't have the advantage that DeepSeek, Alibaba, Moonshot, Baidu, and ByteDance have: iFlyTek is virtually unknown outside of China and they don't use social media or make their models readily available via a chatbot. The only thing they are known for in the US are cheap Android tablets and translator pens on Amazon and AliExpress.

#

#

whole wagon Jul 23, 2025, 9:38 AM

#

grok thinking time is absolutely absurd for me

#

its taking 10 minutes each time i prompt it

#

im able to prompt 6 times an hour 💀

ocean vortex Jul 23, 2025, 9:45 AM

#

whole wagon its taking 10 minutes each time i prompt it

I think they are cheaping out on their cloud inference provider. Their speeds are kinda pathetic comparing to OpenAI

ornate agate Jul 23, 2025, 10:19 AM

#

I wonder which one

whole wagon Jul 23, 2025, 11:16 AM

#

Could be that grok just has a lot of active params and they serve it at a loss

ocean vortex Jul 23, 2025, 11:23 AM

#

whole wagon Could be that grok just has a lot of active params and they serve it at a loss

yeah no chance lol. No one is serving it at a loss, and their pricing is fairly steep, much more so than Grok3 was.

whole wagon Jul 23, 2025, 11:26 AM

#

xAI can subsidize the inference itself. If they have that massive cluster just for training they can train massive models. The performance doesnt necessarily mean it is equal size to the other LLMs it may have simply underperformed

ocean vortex Jul 23, 2025, 11:29 AM

#

That is very unlikely. If they grossly overestimated the model size needed, they wouldn't have been competing with Google now. It just wouldn't be possible even with the relative high amount of GPUs xAI has

#

Money and even compute is not everything, and Meta is a good example of that lol

#

Mistake like that can still break the entire project

#

Yeah a total failure (Behemoth)

#

and exactly my point

ornate agate Jul 23, 2025, 11:30 AM

#

its gonna be priced in line with market pricing for performance, not size.

ocean vortex Jul 23, 2025, 11:32 AM

#

ornate agate its gonna be priced in line with market pricing for performance, not size.

I think market has converged on the max revenue pricing

#

At least for closed models

ornate agate Jul 23, 2025, 11:33 AM

#

idk. DeepSeek/Kimi kinda forced a massive price lowering.

ocean vortex Jul 23, 2025, 11:33 AM

#

They aren't pricing their models at cost + a fixed percentage. It's more of... "what price should we set for maximum profits?" lol

#

ofc this still doesn't offset R&D, but that's a different topic...

ocean vortex Jul 23, 2025, 11:34 AM

#

ornate agate idk. DeepSeek/Kimi kinda forced a massive price lowering.

Deepseek is open source, Kimi mostly as well

#

Deepseek went immediately open-source. When you do that very high margins on official API are not really possible... They just went extremely competitive/aggressive. But even with their API pricing they aren't doing it at a loss 👀

torn mantle Jul 23, 2025, 11:38 AM

#

whole wagon im able to prompt 6 times an hour 💀

is it worth it tho?

#

grok 4 heavy thinking should be similar to gemini deep think, but i dont think it is

#

was it tested on IMO or nah

ocean vortex Jul 23, 2025, 11:40 AM

#

Trust me they still have decent profit margins lmao. With their traffic and resources their whole infra is much more efficient than what most others are using.

torn mantle Jul 23, 2025, 11:40 AM

#

im actually so interested on how many answers it can get right

#

thanks, but they didnt use heavy thinking right

#

its only grok 4 reasoning

ocean vortex Jul 23, 2025, 11:41 AM

#

It's not the same size but Deepseek's cost to run R1 is less than $1 per 1M output.

#

with less than ideal infra

torn mantle Jul 23, 2025, 11:41 AM

#

i just dont understand how can moonshot/kimi afford running a 1T params model without any issues

#

they have little to no downtime in their servers

#

and im also taking into consideration the crazy traffic coming from china

ocean vortex Jul 23, 2025, 11:42 AM

#

OpenAI inference...? They have substantial margins even after reducing the price of o3

torn mantle Jul 23, 2025, 11:43 AM

#

is it running only on h20

#

kinda wild

ocean vortex Jul 23, 2025, 11:43 AM

#

It's gonna be cheaper than that on OpenAI's infra tbh

#

Also... "free electricity"? Source?

pure anvil Jul 23, 2025, 11:45 AM

#

torn mantle i just dont understand how can moonshot/kimi afford running a 1T params model wi...

ptx optimisations

ocean vortex Jul 23, 2025, 11:46 AM

#

So it's not free then

torn mantle Jul 23, 2025, 11:47 AM

#

pure anvil ptx optimisations

they should've used some of the secret sauce shared by deepseek on ptx

pure anvil Jul 23, 2025, 11:47 AM

#

torn mantle they should've used some of the secret sauce shared by deepseek on ptx

they do

#

I don't know if american labs do though

#

they'd be dumb not to

ocean vortex Jul 23, 2025, 11:47 AM

#

They have partnership with MS. They absolutely are doing what makes the most sense and is cheapest to do. They aren't sanctioned and can both buy and rent GPUs.

#

So China's lower electricity cost at best is just gonna offset the less efficient GPUs and lesser infra, but that's unlikely

#

That's not me who needs to prove it, it's you. Cause what I'm saying is common practice and what you are saying is just inexplicable. o3 is same size as gpt4.1, and they always had profit margins on gpt4.1

#

saying that they are losing money on inference is crazy

#

What isn't? To price your models at a higher price than it cost for you to merely run them? It absolutely is a common practice lmao

whole wagon Jul 23, 2025, 11:53 AM

#

xAI has a 50 times smaller user base and they already offered grok 3 mini at a loss

pure anvil Jul 23, 2025, 11:53 AM

#

ocean vortex That's not me who needs to prove it, it's you. Cause what I'm saying is common p...

>o3 is same size as gpt4.1
source?

ocean vortex Jul 23, 2025, 11:53 AM

#

Even for Deepseek

whole wagon Jul 23, 2025, 11:53 AM

#

It is reasonable to think they can do the same with grok 4

pure anvil Jul 23, 2025, 11:53 AM

#

You have no idea what you're talking about

whole wagon Jul 23, 2025, 11:54 AM

#

Also anecdotal but grok 4 really has that big model smell to it

#

Like gpt4.5 had

pure anvil Jul 23, 2025, 11:56 AM

#

openai won't be profitable until 2029 probably

whole wagon Jul 23, 2025, 11:56 AM

#

Well he assumes that because it would be a huge failing for OAI to be making a loss on inference. And he is OAI supporter

pure anvil Jul 23, 2025, 11:56 AM

#

they're burning money

ocean vortex Jul 23, 2025, 11:56 AM

#

They are losing money because R&D and all the salaries and expenses, NOT on inference. And there are many reasons to say they are turning profit on inference in isolation and a fairly substantial one (in isolation)

whole wagon Jul 23, 2025, 11:57 AM

#

If OAI is making a loss on inference they are basically cooked

ocean vortex Jul 23, 2025, 11:57 AM

#

whole wagon If OAI is making a loss on inference they are basically cooked

yeah that's like literally impossible lmao

pure anvil Jul 23, 2025, 11:57 AM

#

it's like a bug in his mind he can't accept that openai is bad in any way

ocean vortex Jul 23, 2025, 11:57 AM

#

Look at their price cuts historically for the same checkpoints

#

that's the best source

pure anvil Jul 23, 2025, 11:59 AM

#

i couldn't imagine glazing a corpo that hard

#

just pathetic

ocean vortex Jul 23, 2025, 12:00 PM

#

Are you just gonna pretend you don't know that it's not common at all for the close source AI labs to share info on their cost to run a model? Of cource there's not gonna be black on white direct proof from them, but we can still look at what we know.... There's NOTHING to suggest that what you are saying is true (they are losing money on inference)

#

it's just silly, crazy and insane.

#

@ornate agate

#

And there are several things to suggest they are making money:

o3 is not a huge model. OG gpt4 was downsized into 4Turbo, then that was downsized into current base model.
They were in a position to do drastic price cuts no problem
Their resources and traffic allows them to have very good and efficient infra
We roughly know the pricing to run a model, any model

#

Read again what I wrote. Even there: #general message

unborn ocean Jul 23, 2025, 12:05 PM

#

leaden palm Bay Area House Party (esp. recent ones) is a very fun read

true, just read it, lol
really good this time

ocean vortex Jul 23, 2025, 12:05 PM

#

No one is saying they are making money overall, we are talking about just inference which is small part of their operations

whole wagon Jul 23, 2025, 12:05 PM

#

How did we even reach a point where the frontier model of the top ai company (with a huge time lead) ends up facing competition from small Chinese research teams

#

It's a wild timeline

ocean vortex Jul 23, 2025, 12:06 PM

#

Inference is laughable money compared to their entire expenses

pure anvil Jul 23, 2025, 12:09 PM

#

It's for the best, it brings prices down for everybody

#

plus something like AI should be open source anyway

#

considering where the training data comes from

ocean vortex Jul 23, 2025, 12:10 PM

#

So yeah... While we will never have definitive proof from OpenAI disclosing this, there are far more reasons to suggest they are turning profit on inference with the opposite being extremely unlikely. For the reasons already stated (there for instance #general message)

whole wagon Jul 23, 2025, 12:12 PM

#

If GPT5 releases late enough there's a substantial chance either Kimi or qwen is going to take SOTA

#

With adding reasoning to the updated models

#

Also if openAI open source model is actually SOTA it has to beat o3. So I don't see how that works

pure anvil Jul 23, 2025, 12:12 PM

#

whole wagon If GPT5 releases late enough there's a substantial chance either Kimi or qwen is...

GPT5 will be underwhelming imo

whole wagon Jul 23, 2025, 12:13 PM

#

whole wagon Also if openAI open source model is actually SOTA it has to beat o3. So I don't ...

Assuming it's still going to release. They are extremely silent on it

ornate agate Jul 23, 2025, 12:14 PM

#

pure anvil plus something like AI should be open source anyway

I think at least it should be very widely available, especially the pre-trained checkpoints. At the moment keeping it all centralized inside a single company is more dangerous than people realise.

ocean vortex Jul 23, 2025, 12:14 PM

#

whole wagon Assuming it's still going to release. They are extremely silent on it

Well you only have to beat open-source SOTA, which is R1.1

whole wagon Jul 23, 2025, 12:14 PM

#

Well it's delayed till September apparently

ocean vortex Jul 23, 2025, 12:14 PM

#

And they will have next gen of closed models soon

whole wagon Jul 23, 2025, 12:14 PM

#

So it has to compete with the next gen of open source

ocean vortex Jul 23, 2025, 12:14 PM

#

so... doable

pure anvil Jul 23, 2025, 12:15 PM

#

ornate agate I think at least it should be very widely available, especially the pre-trained ...

The chinese after release of deepseek r1 saved openai and anthropic users from being cucked even more

#

imo

ocean vortex Jul 23, 2025, 12:15 PM

#

whole wagon So it has to compete with the next gen of open source

Maybe. But that's still not there. Possible that by September we still won't have anything better tbh

pure anvil Jul 23, 2025, 12:16 PM

#

If they hadn't released a good open source model then closed sourced providers would make the prices as high as possible

ornate agate Jul 23, 2025, 12:16 PM

#

pure anvil The chinese after release of deepseek r1 saved openai and anthropic users from b...

yeah I think this too. They wanted to charge $200/mtok or something. Remember the Haiku "more intelligence" fiasco? DeepSeek stopped that.

whole wagon Jul 23, 2025, 12:16 PM

#

The closed source providers would still compete with each other eventually. But currently they are focused on scaling up as fast as possible

pure anvil Jul 23, 2025, 12:16 PM

#

ornate agate yeah I think this too. They wanted to charge $200/mtok or something. Remember th...

open source is the future for sure

ocean vortex Jul 23, 2025, 12:17 PM

#

I think Deepseek not only kept western companies in check, they kinda also kept Chinese themselves in check lol

#

if you look at alibabacloud, prices are not really any better than what people were used to before Deepseek

#

Kinda insane that Chinese corps are charging that

whole wagon Jul 23, 2025, 12:20 PM

#

Currently I don't see any reason to use a closed source LLM. Maybe that changes with gpt5

#

The open source LLMs are near identical performance for far cheaper

ocean vortex Jul 23, 2025, 12:21 PM

#

ocean vortex Kinda insane that Chinese corps are charging that

Like look at this

#

crazy

whole wagon Jul 23, 2025, 12:22 PM

#

whole wagon The open source LLMs are near identical performance for far cheaper

Like I don't really care about +1.5% in gpqa PhD questions

#

I will just take the far cheaper one

pure anvil Jul 23, 2025, 12:23 PM

#

whole wagon The open source LLMs are near identical performance for far cheaper

qwen3 coder is better than all openai models for coding no joke

#

that's not too much of an accomplishment but it's a start

ocean vortex Jul 23, 2025, 12:24 PM

#

ocean vortex Like look at this

And btw for reference, this is a smaller model than Deepseek R1

#

even less activated parameters as well

ornate agate Jul 23, 2025, 12:25 PM

#

whole wagon Currently I don't see any reason to use a closed source LLM. Maybe that changes ...

I also think this. I prefer to invest my time learning how to get the most out of open-source models, because I know I can't be rugpulled, price hiked, or have to worry about quantizations changing, etc. If you look at reddits for prop coding tools, users seem to start off extremely happy, then after a few weeks people start complaining and being very upset.

pure anvil Jul 23, 2025, 12:25 PM

#

ornate agate I also think this. I prefer to invest my time learning how to get the most out o...

yeah that's why for most enterprise cases requiring stability local models are used in house to prevent all that from happening

#

https://www.designarena.ai/leaderboard

Design Arena

Global crowdsourced benchmark for design

ornate agate Jul 23, 2025, 12:27 PM

#

pure anvil https://www.designarena.ai/leaderboard

oh wow Qwen 3 non-coder really shot up that leaderboard.

ocean vortex Jul 23, 2025, 12:27 PM

#

ornate agate I also think this. I prefer to invest my time learning how to get the most out o...

They "start off extremely happy, then after a few weeks people start complaining" for a simple reason. They are impressed by it at first then start using it more and notice the flaws. Their expectations change as well. No one is making their models worse lmao

#

if they would they wouldn't stay relevant for long

#

Also this has been debunked like 10 times now

pure anvil Jul 23, 2025, 12:28 PM

#

https://www.designarena.ai/models/qwen3-coder-480b-a35b-instruct

ocean vortex Jul 23, 2025, 12:28 PM

#

with people doing independent testing

ornate agate Jul 23, 2025, 12:28 PM

#

ocean vortex They "start off extremely happy, then after a few weeks people start complaining...

yes every single one of them must be imagining it.

pure anvil Jul 23, 2025, 12:29 PM

#

ornate agate yes every single one of them must be imagining it.

lmao the cope is insane

pure anvil Jul 23, 2025, 12:29 PM

#

pure anvil https://www.designarena.ai/models/qwen3-coder-480b-a35b-instruct

some examples

ocean vortex Jul 23, 2025, 12:29 PM

#

ornate agate yes every single one of them must be imagining it.

They are not "imagining" it, their reactions are normal given impressions/expectations and limited knowledge or any real testing. Happens everywhere not just with AI. Also how forums and subs work, the complaining ones are the loudest. There's no (or much less so of a) reason to post otherwise

#

https://aider.chat/2024/08/26/sonnet-seems-fine.html

aider

Sonnet seems as good as ever

Sonnet’s score on the aider code editing benchmark has been stable since it launched.

pure anvil Jul 23, 2025, 12:32 PM

#

yeah very cheap model (2$/m) too

ocean vortex Jul 23, 2025, 12:45 PM

#

What was the prompt for this? 👀

#

I just looked at polymarket that temp thing... Looks like my guess ust might be correct. 🤯 Hopefully no major changes from now

#

well I said "mine" but this was o3 and deep research lmao

#

was at ~30% chance at the time, less than 0.95-0.99 range

#

@keen beacon

pure anvil Jul 23, 2025, 1:00 PM

#

have they gotten that good?

#

oh you mean AI written

#

i interpreted it as AI narrated

#

I listen to T6 when falling asleep alot

#

lol

#

tru, like I can't fall asleep to music either for similar reasons

pure anvil Jul 23, 2025, 1:22 PM

#

opus 4

hollow ocean Jul 23, 2025, 1:27 PM

#

agent delayed for plus users

#

will be available in the coming weeks

keen beacon Jul 23, 2025, 1:29 PM

#

ocean vortex I just looked at polymarket that temp thing... Looks like my guess ust might be ...

I will calculate mine soon, nice to see you are +30% xd but its luck dont rely too much on gpt

#

I had a friend who screenshoted stocks and asked gpt for advice lmao , he won a few $ before losing thousands 🤣

hollow ocean Jul 23, 2025, 1:31 PM

#

keen beacon I had a friend who screenshoted stocks and asked gpt for advice lmao , he won a ...

He prob used 4o

#

😂

#

You gotta use o3 pro if you wanna make money

quartz light Jul 23, 2025, 1:50 PM

#

keen beacon I had a friend who screenshoted stocks and asked gpt for advice lmao , he won a ...

your nightmares

#

DAMN.

keen beacon Jul 23, 2025, 2:32 PM

#

hollow ocean He prob used 4o

Yes, cause he did day trading or rather minute trading 🤣 he asked gpt 4 every 15s on what to do and needed fast responses

torn mantle Jul 23, 2025, 2:49 PM

#

what about const@2048

rare python Jul 23, 2025, 2:49 PM

#

torn mantle what about const@2048

cons@4096

echo aurora Jul 23, 2025, 2:56 PM

#

whole sundial <@283397944160550928> sorry for pinging but there is like half a dozen accounts ...

Thank you for this flag, I do appreciate it so please don't feel sorry for pinging. I'll be sure to look into this as yeah it does look pretty inorganic.

unborn ocean Jul 23, 2025, 3:08 PM

#

has anyone tried the deepinfra api for gemini models?

ocean vortex Jul 23, 2025, 3:15 PM

#

keen beacon I will calculate mine soon, nice to see you are +30% xd but its luck dont rely t...

yeah ofc I wouldn't bet anything big without verifying. Actually even with verifying and "being sure" betting big is always a risk on smth like that. But that being said, there's potential in deep research when used wisely I think

dense moon Jul 23, 2025, 3:24 PM

#

I don't know if this is the right place for suggestions, but it would be nice if copy and paste would be easier. If I want to copy a whole conversation, I have to either copy every message one my one, or use ctrl+a and have it backwards and chaotic

torn mantle Jul 23, 2025, 3:38 PM

#

dense moon I don't know if this is the right place for suggestions, but it would be nice if...

poor shroddy 😦

#

shroddy is confused

dense moon Jul 23, 2025, 3:42 PM

#

Yes I am.

#

I wanted to make a new feature suggestion but I think I am in the wrong channel here and don't know which is the correct one

torn mantle Jul 23, 2025, 3:45 PM

#

dense moon Yes I am.

It's just that I saw you were suffering, and I felt sorry for you

torn mantle Jul 23, 2025, 3:45 PM

#

dense moon I wanted to make a new feature suggestion but I think I am in the wrong channel ...

Either that + tag @ pineapple

#

Or #1343291835845578853

#

Although its not really a bug

#

That feature can come handy

#

Or you can make your own little script (userscript) that does that till they add it natively

#

What i really want lmarena devs to do is to handle errors better, not just a general catch execption, especailly on the sandbox api issues

#

There is also that session issue, the code/msg errors arent specific

#

But the most annoying one is the sandbox one, you dont real know if its a model issue or rendering lib issue

misty star Jul 23, 2025, 4:16 PM

#

Ooooooooooo

#

WWWW

languid crescent Jul 23, 2025, 4:17 PM

#

omg lmarena just dropped a new banger

#

i just refreshed the page and instantly noticed the change 😭

#

i love this community

echo aurora Jul 23, 2025, 4:17 PM

#

languid crescent i love this community

heartthrow

misty star Jul 23, 2025, 4:18 PM

#

This is amazing

languid crescent Jul 23, 2025, 4:18 PM

#

ABSOLUTE W UPDATEEEE

echo aurora Jul 23, 2025, 4:20 PM

#

If anyone comes across any issues with Search Arena we can collect those issues here: https://discord.com/channels/1340554757349179412/1397614163282493440

tribal aspen Jul 23, 2025, 4:22 PM

#

why does the lmarena search take forever?

#

and then returns error?

languid crescent Jul 23, 2025, 4:22 PM

#

its working fine for me

#

i like the ui for the search thingy its neat

echo aurora Jul 23, 2025, 4:23 PM

#

tribal aspen and then returns error?

could you share more details about it in here: #1397614163282493440 message ?

languid crescent Jul 23, 2025, 4:23 PM

#

@echo aurora THIS IS AMAZING!!!!!!! ❤️ compared to gemini (free) it only gives out 5 sources but here it gives more than 20 😭

tribal aspen Jul 23, 2025, 4:23 PM

#

echo aurora could you share more details about it in here: https://discord.com/channels/1340...

oh it got fixed

#

nvm

#

was happening with the gemini model

languid crescent Jul 23, 2025, 4:23 PM

#

this is absoultely nice it will help me so much in learning 😭

echo aurora Jul 23, 2025, 4:23 PM

#

languid crescent i like the ui for the search thingy its neat

how do you feel about the change with the Image button?

languid crescent Jul 23, 2025, 4:24 PM

#

echo aurora how do you feel about the change with the `Image` button?

it's neat i like it!

echo aurora Jul 23, 2025, 4:24 PM

#

tribal aspen was happening with the gemini model

Happy to hear it's working again! Good to know about the gemini model, I'll be sure to keep an eye on that.

languid crescent Jul 23, 2025, 4:25 PM

#

i like the current buttons now its much cleaner

olive mesa Jul 23, 2025, 4:37 PM

#

how many zettaflops of compute power is this?

nova ginkgo Jul 23, 2025, 4:39 PM

#

finaly we got web research 🤩

languid crescent Jul 23, 2025, 4:40 PM

#

web research is a game changer for a student like me 😭 it makes researching for sources much easier

#

W lmarena community ❤️

golden ocean Jul 23, 2025, 4:51 PM

#

is mimi ai

languid crescent Jul 23, 2025, 4:53 PM

#

No i'm not bro 😦

#

I'm just really happy that LM arena is giving out updates like this that could help me as a student :))

unborn ocean Jul 23, 2025, 5:14 PM

#

olive mesa how many zettaflops of compute power is this?

they better produce real SOTA next time

#

otherwise i will view xAi as a massive failure

amber warren Jul 23, 2025, 5:15 PM

#

languid crescent web research is a game changer for a student like me 😭 it makes researching for...

ive been waiting for web search for so long too

sturdy mica Jul 23, 2025, 5:23 PM

#

search is awesome

#

W

#

waited a while for this

#

very cool

mint sparrow Jul 23, 2025, 5:24 PM

#

genuine question im just new to lmarena but if ive been in one thread using it for some time and it just has a endless "generating" text bubble does that mean ive broken it/ran out of time with the specific model?

whole wagon Jul 23, 2025, 6:22 PM

#

💀

#

#

He really put them on blast damn

torn mantle Jul 23, 2025, 6:23 PM

#

what were they thinking?

#

doesnt make any sense

gentle plinth Jul 23, 2025, 6:23 PM

#

some guy from the team replied to him tho, so could very well be that help him to reproduce it

whole wagon Jul 23, 2025, 6:24 PM

#

Or just damage control

#

The model performance should not be shifting that much from json Vs text formatting

gentle plinth Jul 23, 2025, 6:24 PM

#

whole wagon Jul 23, 2025, 6:24 PM

#

If it is that's a whole other issue

torn mantle Jul 23, 2025, 6:24 PM

#

gentle plinth

still

gentle plinth Jul 23, 2025, 6:25 PM

#

i also found the json thing weird, but then again i am not an expert on this

#

isnt there a standard format for the benchmark?

#

or is he talking about something else

keen beacon Jul 23, 2025, 6:25 PM

#

its a benchmark harness thing probably

torn mantle Jul 23, 2025, 6:25 PM

#

they may have finetuned on both public/private data for that specific benchmark

gentle plinth Jul 23, 2025, 6:26 PM

#

how could they get the private data?

keen beacon Jul 23, 2025, 6:26 PM

#

this has happened before btw. qwen will get it sorted out, im sure

gentle plinth Jul 23, 2025, 6:26 PM

#

llama4 😅

whole wagon Jul 23, 2025, 6:27 PM

#

Recently I had an argument with ppl that thought open source models would never manipulate benchmarks

#

Open source stans just as bad as closed source stans

gentle plinth Jul 23, 2025, 6:28 PM

#

mint sparrow genuine question im just new to lmarena but if ive been in one thread using it f...

how long have you been waiting? if its under 1 minute (or even longer) it could very well be that some model is still reasoning. otherwise you will probably have to reload, but make sure to copy your prompt to be sure its saved (even if it should still be saved in theory)

whole wagon Jul 23, 2025, 6:29 PM

#

If it is grok 4 it can take over 10 minutes

keen fulcrum Jul 23, 2025, 6:29 PM

#

People would love to see youcom https://discord.com/channels/1340554757349179412/1397645709565628486

torn mantle Jul 23, 2025, 6:33 PM

#

gentle plinth how could they get the private data?

could've used their API on older models like qwen 2.5

#

idk or some smart reward func specifically made for arc-agi-1

#

the thing with these benchmarks like arc-agi is that we want a generalisation performance gain, or an emergent intelligence

#

but if its specifically trained and finetuned for that, then its basically useless

#

and thats what qwen are doing

gentle plinth Jul 23, 2025, 6:37 PM

#

so if i understand correctly you say that they trained on the json data, but when using a different format it fails?

torn mantle Jul 23, 2025, 6:37 PM

#

also dont you find their other benchmarks numbers ridiculous?

torn mantle Jul 23, 2025, 6:38 PM

#

gentle plinth so if i understand correctly you say that they trained on the json data, but whe...

no im not talking about what they trained on/input or the output format

#

im talking about their goal in general

#

are they really trying to make a smart model? or are they just focusing on beating others on some benchmark?

vivid gorge Jul 23, 2025, 6:56 PM

#

Hi, maybe a silly question - is there a way to terminate endless "Generating..." answer? This particular chat is stuck and I see no way to break to loop. Except maybe "Delete" ... command in chatlist.

echo aurora Jul 23, 2025, 6:58 PM

#

vivid gorge Hi, maybe a silly question - is there a way to terminate endless "Generating..."...

Sorry to say there is not currently. However, adding a stop/pause button is on our to do!

vivid gorge Jul 23, 2025, 6:59 PM

#

Thank you.

primal orbit Jul 23, 2025, 7:03 PM

#

vivid gorge Hi, maybe a silly question - is there a way to terminate endless "Generating..."...

you can refresh the page, it should load the answer

#

or close/open

vivid gorge Jul 23, 2025, 7:07 PM

#

primal orbit or close/open

Unfortunately refresh, close/open and even moving to other browser by export/import cookies doesnt work. Still stuck on endless Generating...

civic flame Jul 23, 2025, 7:25 PM

#

it's that time of the month again

#

got access to some new anon models

#

so far at least one is from oai and passes a lot of my hard prompts (gets ones o3 gets wrong right)

misty star Jul 23, 2025, 7:45 PM

#

Is it true some people have access to new models

ocean vortex Jul 23, 2025, 7:56 PM

#

misty star Is it true some people have access to new models

Yes I'm using gpt6

#

it's great

torn mantle Jul 23, 2025, 8:01 PM

#

civic flame so far at least one is from oai and passes a lot of my hard prompts (gets ones o...

is it another o3 checkpoint or nah

#

or gpt-5

civic flame Jul 23, 2025, 8:05 PM

#

can't tell

#

seems similar to "o3-alpha"

gentle plinth Jul 23, 2025, 8:09 PM

#

what is your definition of agi?

echo aurora Jul 23, 2025, 8:18 PM

#

Hey - we just launched something really special for this server. The **LMArena Discord Bot is available, right now! **

What does the bot do?

With the LMArena bot you can generate videos, images, and image-to-videos via the bot. Similar to battle mode you’ll given two generations and anyone will be able to vote on which they prefer. After a certain number of votes the bot will reveal the models.

Why are you telling us about it in general chat and not announcing it?

We’re considering this a bit of a soft launch. We want to test the waters with it and see what you all think first.

To learn more more about how the bot works check out #1397655624103493813 Note that you can only use the bot in these channels: #video-arena-1

Keep in mind there is generation limit per day, so make sure your prompts count!

gentle plinth Jul 23, 2025, 8:26 PM

#

interesting. i think i would define it a bit differently. maybe to simple, but i would define it as an ai that can do any work that a human can do at the computer. Except maybe for top 1% work in terms of difficulty/payment. so this would include longer tasks, researching topics, learning new things about that topic, learning new programming languages, and working on a project which wouldnt fit in any context length of existing models so far, coherently

torn mantle Jul 23, 2025, 8:34 PM

#

civic flame can't tell

what can you say

#

its obvious that the next iteration will be good

#

well whatever..

gentle plinth Jul 23, 2025, 8:40 PM

#

manifold says 2030 but i absolutely dont like the resolution critera as gpt-4.5 can already pass a turing test according to one paper https://manifold.markets/ManifoldAI/agi-when-resolves-to-the-year-in-wh-d5c5ad8e4708

Manifold

AGI When? [High Quality Turing Test]

This market resolves to the year in which an AI system exists which is capable of passing a high quality, adversarial Turing test. It is used for the Big Clock on the manifold.markets/ai page.

The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalen...

whole wagon Jul 23, 2025, 8:40 PM

#

What the hell I put my life savings in there and got nothing back??

#

(They got hacked)

gentle plinth Jul 23, 2025, 8:41 PM

#

i've seen this so often lately on x.

#

makes me wonder if there is more to it, like elon behind it or smth, but probably its just that he fired a bunch of people that worked on the security of the platform

ocean vortex Jul 23, 2025, 8:46 PM

#

Qwen3 output length looks hilarious next to Sonnet4-thinking lol

glass arch Jul 23, 2025, 8:46 PM

#

guys which model is the best for me

ocean vortex Jul 23, 2025, 8:46 PM

#

Though I'm forgiving Kimi2 👀

glass arch Jul 23, 2025, 8:47 PM

#

I have been using chatgpt plus for a while because I want to use it for code and schoolwork, but it's kinda stupid ngl

ocean vortex Jul 23, 2025, 8:47 PM

#

glass arch guys which model is the best for me

gpt3.5-turbo

glass arch Jul 23, 2025, 8:47 PM

#

(using o4-mini)

keen ferry Jul 23, 2025, 8:48 PM

#

glass arch I have been using chatgpt plus for a while because I want to use it for code and...

use o3

glass arch Jul 23, 2025, 8:48 PM

#

I have used o3

#

it has limits tho

#

should I switch to gemini or something else

#

because chatgpt hasn't really been keeping up

ocean vortex Jul 23, 2025, 8:48 PM

#

Yeah do use o3. I was joking with gpt3.5 gptdrawncat

keen ferry Jul 23, 2025, 8:48 PM

#

I guess o4 mini (high) for coding and o3 for school work

ocean vortex Jul 23, 2025, 8:49 PM

#

glass arch because chatgpt hasn't really been keeping up

so is keeping up or the limits an issue?

glass arch Jul 23, 2025, 8:49 PM

#

keeping up

ocean vortex Jul 23, 2025, 8:50 PM

#

o3 has no issues with keeping up

glass arch Jul 23, 2025, 8:50 PM

#

is gemini (or claude or whatever) better at these tasks? I have been thinking of switching to something else because openai is behind on development

ocean vortex Jul 23, 2025, 8:50 PM

#

glass arch is gemini (or claude or whatever) better at these tasks? I have been thinking of...

if you use gemini on their website (the one that has deep research) you gonna be hit with limits too

echo aurora Jul 23, 2025, 8:51 PM

#

btw we're aware of issues happening with search in battle mode

ocean vortex Jul 23, 2025, 8:51 PM

#

And 2.5Pro on aistudio is worse than o3 due to the lack of tools, so...

#

the best for you is o3. Limits is a financial problem I suppose, not really a "what is the best model for x" type of thing anymore

#

With API or Pro plan there are no limits

glass arch Jul 23, 2025, 8:53 PM

#

I really have a few things I need:

upload code to the interface (on chatgpt, I make a zip of my code and upload so that it can get context (my code is not so big))
upload pdfs that the model can read
the model can run searches to get better information
the model is smart

ocean vortex Jul 23, 2025, 8:54 PM

#

glass arch I really have a few things I need: - upload code to the interface (on chatgpt, I...

Depends how long your code really is. As long as all of your files are less than 32k (if Plus) or 128k (if Pro), still chatgpt.

#

otherwise use aistudio when you need to work with big context

#

if you want to get rough estimate how much tokens your files are, you can simply just attach them in a chat in aistudio as well

#

they will tokenize and you will see your usage in token counter

#

Tokenizer for Gemini is different, but for rough estimate this is good enough. Not gonna differ by miles with OpenAI

glass arch Jul 23, 2025, 8:57 PM

#

ok well is gemini better or worse than o3 or o4-mini

ocean vortex Jul 23, 2025, 8:58 PM

#

glass arch ok well is gemini better or worse than o3 or o4-mini

Neither. Depends on the task

#

if we exclude all the tools and just look at the model itself, o3 is equivalent to 2.5Pro but still different strengths (2.5Pro better for visuals, o3 better for accuracy and tedious tasks). o4-mini-high is good but I wouldn't classify it as equivalent personally. It's too small for that and lacks context awareness and some fundamental logic concepts, also lacks world knowledge (look at SimpleQA score)

whole wagon Jul 23, 2025, 9:00 PM

#

For a few tasks I even found grok the best kek

#

Like puzzles type stuff

echo aurora Jul 23, 2025, 9:00 PM

#

echo aurora btw we're aware of issues happening with search in battle mode

it's back.

dense moon Jul 23, 2025, 9:22 PM

#

echo aurora Hey - we just launched something really special for this server. The **LMArena D...

Will there be a way to vote "its a tie" without voting "both are bad"?

echo aurora Jul 23, 2025, 9:27 PM

#

dense moon Will there be a way to vote "its a tie" without voting "both are bad"?

It could happen

dense moon Jul 23, 2025, 9:27 PM

#

Would be cool, feels unfair otherwise if both models did a great job 🙂

echo aurora Jul 23, 2025, 9:28 PM

#

Yeah that makes sense, I'll be sure to pass it along

dense moon Jul 23, 2025, 9:47 PM

#

And (maybe) last questions will it be possible to see which model got the most votes / how many votes each model got, would be interesting how others might have different preferences or priorities.

echo aurora Jul 23, 2025, 9:49 PM

#

Sounds good. I've added both pieces of feedback to: #1397697136170242169 message

dense moon Jul 23, 2025, 9:53 PM

#

Oh nice I think I will write a one or two suggestions for the website there as well 🙂

echo aurora Jul 23, 2025, 9:54 PM

#

Please do!

tiny temple Jul 23, 2025, 11:47 PM

#

h

sullen quest Jul 23, 2025, 11:51 PM

#

nightride-on-v2? I haven't seen that one before

harsh flume Jul 24, 2025, 12:01 AM

#

Anyone here using Gemini CLI?

#

Im wondering how it compares to claude code, ive seen some conflicting opinions online but its overral hard to trust reddit comments on anything AI

blazing rune Jul 24, 2025, 12:28 AM

#

harsh flume Im wondering how it compares to claude code, ive seen some conflicting opinions ...

idk, but according to a seemingly reasonable guy on youtube named "GosuCoder", Gemini CLI with Gemini 2.5 Pro doesn't even compare to Claude 4 Sonnet with Claude Code in his own benchmark. I assume his benchmark is pretty thorough since it seems like he is a real dev and not just a vibe coder.

#

#

I think Claude 4 Sonnet with Claude Code got about 18000 or something

echo aurora Jul 24, 2025, 1:04 AM

#

echo aurora Hey - we just launched something really special for this server. The **LMArena D...

Bumping this message. Video Arena now available, here! We've enabled it to be useable in #general as well.

tiny palmBOT Jul 24, 2025, 1:04 AM

#

sweet tinsel Jul 24, 2025, 1:05 AM

#

Deep Research Arena when?

#

I'm totally not obsessed with DRs.

harsh flume Jul 24, 2025, 1:05 AM

#

sweet tinsel I'm totally not obsessed with DRs.

Which ones are you using besides gemini and open ai?

sweet tinsel Jul 24, 2025, 1:06 AM

#

A lot.

#

Just look at my doc.

harsh flume Jul 24, 2025, 1:06 AM

#

Seems like grok just took theirs out

sweet tinsel Jul 24, 2025, 1:06 AM

#

sweet tinsel Just look at my doc.

https://docs.google.com/document/d/1qSfyAyxzUziFQf55CD60-UgQ4Af9ubVmr69OrmAdevE/edit?usp=sharing

Google Docs

Deep-Research Tests

Deep-Research Tests Prompt: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and ...

harsh flume Jul 24, 2025, 1:07 AM

#

I will. Deep-Research is prob 80% of my AI use cases

#

Thanks

tiny palmBOT Jul 24, 2025, 1:08 AM

#

tiny palm

harsh flume Jul 24, 2025, 1:09 AM

#

Your doc is mostly a request and the shared results or am I missing something?

sweet tinsel Jul 24, 2025, 1:10 AM

#

harsh flume Your doc is mostly a request and the shared results or am I missing something?

It is.

#

Im currently working on the ranking.

harsh flume Jul 24, 2025, 1:10 AM

#

I imagine you tested a lot more than this so I'd be interested to know how do you feel the models you tested fare

harsh flume Jul 24, 2025, 1:10 AM

#

sweet tinsel Im currently working on the ranking.

Can you share key insights?

#

For my use case I think gemini and OAI are pretty much in par with eachother and often have diff strong/weak points. But I haven't tested anything besides them cause I was under the impression all other models were lagging behind

sweet tinsel Jul 24, 2025, 1:12 AM

#

Generally the top players are pretty good, with Gemini 2.5 Pro and ChatGPT, Manus AI is also very good. Generally the other AI Agents are pretty decent and some DR services like ithy and perplexity are just not worth the time.

#

But for the most stuff I prefer the ChatGPT DR.

harsh flume Jul 24, 2025, 1:15 AM

#

I hear ya. I use a custom sys instruction instance on AI Studio to generate the DR prompts and that has been the only thing I can confirm gemini is superior for. It made a big difference on research output when I changed from developing the prompts with chatgpt to gemini

#

For some reason it seems like gemini's chain of thought generate better prompts to instruct LLMs, that has been true with all my prompt engineering across different use cases

sweet tinsel Jul 24, 2025, 1:19 AM

#

I got some better results with Perplexity Pro searches with using the DR Plan from Gemini.

harsh flume Jul 24, 2025, 1:19 AM

#

My theory is that whatever dataset Google trained their model on had a deeper 'understanding' of the underlying working mechanisms of AI

sweet tinsel Jul 24, 2025, 1:19 AM

#

Gemini is really pretty good with prompting.

surreal creek Jul 24, 2025, 3:34 AM

#

has anyone else been having difficulty with only getting one response when prompting? I’m specifically having the issue on mobile, but I’ve also bumped into it on desktop as well.

tiny palmBOT Jul 24, 2025, 3:41 AM

#

tiny palmBOT Jul 24, 2025, 4:27 AM

#

languid crescent Jul 24, 2025, 4:32 AM

#

what the heck does lmarena have vid gen now?

echo aurora Jul 24, 2025, 4:40 AM

#

languid crescent what the heck does lmarena have vid gen now?

We made a bot! #1397655624103493813

languid crescent Jul 24, 2025, 4:40 AM