#general

1 messages · Page 75 of 1

lime coral
#

You need to be smart for retrieving the right thing and not hallucinate

ocean vortex
#

If you want to exclude small models or those that it's hard to measure the vibes for (or context awaraness etc), just look at SimpleQA and exclude all that score low there. Simple as that

#

It's not very realistic to expect from ArtificialAnalysis to do an absolutely 100% perfect job at this, but what they are doing is still transparent and very useful tbh

unborn ocean
terse shuttle
#

When is direct chat on webdev arena?

zinc ore
#

Didn't notice this earlier

patent aspen
#

OAI didn't even participate

hardy lion
zinc ore
#

Depends what you mean by exist, we know it's been used by select testers since early June, but isn't publicly available

#

If it's being tested since then, then yes it's existed since then

#

So apparently, some of the stuff about Deepmind's result is mum

#

Wonder if he's referring to AlphaProof/geometry or something else

#

Deepthink system nvm

#

So definitely not AlphaProof and alphageometry

#

That makes me think the results they shared are just much cleaner and nicer, compared with the other deepthink system

whole wagon
#

Nobody has guessed correctly what it actually is

#

Google are going to be dropping a surprise ig lol

zinc ore
#

Yeah, they're definitely showing something on the 28th

#

Doesn't help we have an openAI engineer making an issue about it tho, and peeps are running with it

whole wagon
#

They are fine tunes of a general model. For Google case at least

#

The Google model actually isn't very specialised. The general 'system' actually already scored high so they had a good baseline to work with

zinc ore
#

Congrats to the GDM team on their IMO result! I think their parallel success highlights how fast AI progress is. Their approach was a bit different than ours, but I think that shows there are many research directions for further progress. Some thoughts on our model and results 🧵

#

Some sub tweets basically repeating the timeline he already explained, with some new information

torn mantle
#

they should at least release the first version of deep think

zinc ore
#

Why

#

Maybe they'll release a more robust version of deepthink instead

gentle plinth
torn bison
#

If deepthink is the model that participated in the IMO, and the original plan was to publicize the result 7 days after the IMO ends, can we expect July 28th to be deepthink's originally planned release date? I think it's very likely

hollow ocean
#

Deepthink release December

#

Polymarket is the truth

#

95% accuracy

elder rapids
#

yooo

#

deepthink got gold

#

it's over

torn bison
elder rapids
#

😭

hollow ocean
torn bison
jade egret
#

nice

#

aistudio free

torn bison
jade egret
#

: )

hollow ocean
jade egret
#

so fact

hollow ocean
lone vector
#

DeepThink coming on Friday? it's only for mathematicans I think

torn bison
# hollow ocean 🤷🏽‍♂️

The last time I checked Kalshi was before 2.5 Pro 0605 was released as GA. Even though Vertex AI has explicitly stated that the model ID 0605 will be deprecated on June 19, Kalshi's market is still pricing NO for 0605 as ranking #1 on the June 30 Arena leaderboard at 70c.

#

crazy inefficiency

hollow ocean
#

It’s not the truth

torn bison
#

wdym I thought you were talking about the kalshi prediction that deepthink would be delayed until December

torn bison
#

and polymarket has no market for it

jade egret
zinc ore
small haven
hollow ocean
small haven
#

maybe u should have kept ur bet @hollow ocean 😭

lime coral
small haven
ocean vortex
# jade egret so fact

Huh… I think there’s no way this model they used for IMO was not custom DeepThink version of their “large” model aka Ultra 🧐

#

Not Pro

jade egret
#
poll_question_text

Get To AGI First?

victor_answer_votes

13

total_votes

17

victor_answer_id

3

victor_answer_text

Google

small haven
#

o3 pro's prediction, prolly wrong ?

#

cool. very excited for it

civic flame
#

is 2.5 ultra aimed for release before or after DT?

jade egret
#

are they even gonna release ultra?

zinc ore
#

28th aligns with the embargo drop for IMO

#

And I think it would be really smart to say "our deepthink system got gold at IMO" then either release it or announce it coming soon.

jade egret
#

is deepthink that participated in IMO the same as the one that is gettign release?

civic flame
small haven
#

2.5 ultra in august?

#

3.0 pro in september?

civic flame
#

3.0 models probably closer to oct

#

2.5 ultra probably early aug

ocean vortex
lone vector
#

What is the difference between Deep Think and Ultra?

civic flame
#

deep think is a swarm of agents in the same way that o3 pro and grok 4 heavy are

#

2.5 ultra is a big boy model (just one, not several agents) that is a different class to 2.5 pro

#

like what claude 4 opus is to claude 4 sonnet

jade egret
#

gemini cooking tho

frigid coral
#

has ultra been confirmed?

small haven
#

o3 pro is parallel reasoning, not ensemble reasoning, a bit different

jade egret
#

is claude 4 opus smart in general

#

is it smarter than o3 pro

zinc ore
jade egret
#

google is good : )

rare python
#

🍍

nocturne chasm
#

hey there Id like to getnewer grok for my project

stray aspen
#

craig whats your opinion on grok 4

small haven
#

🍋

torn mantle
#

🍋🍋

#

🍋🍋🍋

#

😖

keen beacon
#

uhm qwen 235b instruct is crazy..?

#

wttf

zinc ore
#

Does anyone know if o3 pro is multi agent?

jade egret
#

🍊

torn mantle
#

you tried itß

#

?

keen beacon
#

i cant believe how good it is

torn mantle
keen beacon
#

it's in distribution so it's expected. qwen always delivers on stuff in distribution. but it's incredible

torn mantle
#

You had me try it again

#

Idk vibes are off

keen beacon
#

wut are u trying it on

torn mantle
#

Feels like its not organized

torn mantle
#

I guess its not bad

jade egret
#

yall

#

is 2.5 pro smarter than opus 4?

#

ye?

#

why

torn mantle
#

Sad news nuuuuuu

jade egret
#

:C

torn mantle
#

Stooooop

#

Stop pls

keen beacon
#

ultra will still probably be released in one form or another i guess, or is it really dead?

jade egret
#

guys

#

do you think

#

gemini 3 flash will be better than gemini 2.5 pro?

#

dang

#

so i need to wait for gemini 3 pro : (

zinc ore
#

There was never any real indication we would get an ultra model

#

It was mostly "kingfall" and other similar models seem slightly bigger than 2.5 pro, must be ultra series

#

And everyone ran with it that ultra would be released 🔜

keen beacon
#

people are sleeping on qwen 3 235b instruct, uhm, the jump is insane

zinc ore
#

I think they don't wait more than a month to drop something comparable

#

I think Google could drop first if openAI waits too long

#

But based on everything they're saying, seems that won't occur

small haven
#

kingfall was amazing, why would they say that the lift is small?

keen beacon
#

they were fumbling revision after revision on ultra it seemed to me

leaden palm
keen beacon
leaden palm
keen beacon
#

i only see one statement from them on their twitter, and they said they'd stop doing hybrid

leaden palm
# keen beacon i only see one statement from them on their twitter, and they said they'd stop d...

ok looking at https://x.com/Alibaba_Qwen/status/1947344511988076547 that eliminates option 2, but option 3 is still a possibility per https://x.com/JustinLin610/status/1947346588340523222

Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507!

After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing

A small update on Qwen3-235B-A22B, but a big improvement on its quality!

We thought about this decision for a long time, but we believe that providing better-quality performance is more important than the unification at this moment. We are still continuing our research on hybrid

keen beacon
#

ya option 3 is definitely possible

#

but after the massive leap they achieved i think they'll stay separate for now

#

simpleqa jumping from 12.2 to 54.3 💀

zinc ore
#

🤔

hollow ocean
#

o3 pro only model that can flip $7 to $1100 before I blew it all on roulette @small haven

rare python
#

So 2.5 Ultra DeepThink but not 2.5 Pro DeepThink?

hollow ocean
#

I told it not to give any bets unless it’s confident @small haven

#

So there’s days where it wouldn’t give me any bets

small haven
hollow ocean
#

Nah

#

I deleted it

#

Just ask it to find the best bet for the day and if there isn’t any don’t recommend anything

#

Nope it does its own research

#

Yeah

#

o3 can’t do this it’s very bad compared to o3 pro

#

Yessirr

#

Yes way better

#

Deep research is trash tbh it uses outdated data 🤣

jade egret
#

but you need to pay like 250$ per month to use it : (

#

what does that mean

hollow ocean
#

It’ll do that sometimes

#

Just put the date

#

It always does

#

Never tried api

#

You have search off?

#

Put tmrs date

#

It’s prob cuz you put “kalshi mlb”

#

Just do mlb bets

jade egret
#

but i got the 20$ one : )

#

well

#

ig it very expensive

winged locust
sturdy mica
#

what

#

or 3rd party with mcp

quartz light
#

GUYS

#

GUYS

#

GEMINI JUST GOT

#

HEAVILY CENSORED

#

HOLY ####

#

ITS REFUSING

#

😭

#

aistudio btw

#

oh and also

#

they changed the logo of aistudio for a minute

#

i saved it but they reverted it

#

and also

#

the default temp is now 1 instead of 0.7

#

how is nobody talking about this

keen beacon
quartz light
#

alr

#

8K

#

converted from svg

#

svg wasnt directly a file but just embedded directly onto the page

#

heres the svg

#

they changed the favicon for a minute too but I didn;t save it, its just the same icon but on a black background so its fine

quartz light
#

didn't expect them to revert it, imo its better

keen beacon
#

kinda mid idk

quartz light
#

idek what its supposed to be

keen beacon
#

butterfly?

quartz light
#

maybe a butterfly but

#

yeah

#

but its weird lo

pure anvil
#

nous research has the best AI logo by far

quartz light
# quartz light

im sure discord compresses it so if you want a high quality png then run this script I got kimi to make in a browsers devtools console:

(async () => {
  const svg = `SVGFILEGOESHERE`;
  const dataUri = `data:image/svg+xml,${encodeURIComponent(svg)}`;
  const img = new Image();
  img.src = dataUri;
  await img.decode();
  const c = document.createElement('canvas');
  c.width = c.height = 8192;
  c.getContext('2d').drawImage(img, 0, 0, 8192, 8192);
  c.toBlob(b => {
    const a = document.createElement('a');
    a.href = URL.createObjectURL(b);
    a.download = 'icon_8192x8192.png';
    a.click();
  });
})();
quartz light
pure anvil
quartz light
pure anvil
#

how

#

either way idrc

ocean vortex
unborn ocean
#

I need a Qwen3-235B-2507 paper rn

#

Qwen Team cooking

torn mantle
#

wild also said it was

ocean vortex
torn mantle
ocean vortex
#

oh. I didn't scroll up lol

ocean vortex
#

I wonder what the output lengths of this thing are though

unborn ocean
unborn ocean
#

but idc

#

the improvements in post training are insane

ocean vortex
torn mantle
#

im talking about your personal experience

civic flame
torn mantle
#

yea

#

not working

keen beacon
#

broken for me as well :\

torn mantle
#

Dom can you leave

#

pls

keen beacon
#

need a third party to host lol

torn mantle
#

let me use it once

keen beacon
#

qwen hosting is sh1t

#

the quality is bad

#

qwen 3 suffers a lot from quantization/etc

torn mantle
#

ye

ocean vortex
#

Qwen3 235B A22B 2507

#

wtf is Parasail doing LMAO

keen beacon
#

they're hosting it in full precision though

ocean vortex
unborn ocean
keen beacon
#

people will try to game it

unborn ocean
#

no way chutes has everything full precision

#

people have no incentive to do that afaik

keen beacon
#

not everything but some models they host in full precision, or are supposed to be

#

just can't rely on it imo

ocean vortex
#

it's essentially outputting reasoning traces lol

civic flame
#

lol

#

kinda what DS did with the latest deepseek v3 version

keen beacon
#

kimi does that too, at least on a question i have

#

this new qwen 3 instruct model is better than most thinking models 💀 on that q

ocean vortex
# keen beacon just can't rely on it imo

Dunno in my experience it worked great comparing it to even the most expensive providers. With the exception of like Cerebra - that is in the league of it's own. I would imagine they have checksum checking to make sure expected model versions or quants always match. Though I will admit I do not know all the details of how their decentralized inference works behind the scenes

#

no way... this is R1 reasoning length territory 🤯

#

don't recall any answers longer than 6k from Kimi2 👀

#

still answered wrong this one

unborn ocean
#

btw got 50% on simplebench public

#

so nothing crazy

#

(on chutes, bc all other providers are down)

#

and it wasn't even that yappy

keen beacon
ocean vortex
keen beacon
#

on their bittensor discord iirc

#

im getting poor results from chutes right now

ocean vortex
keen beacon
ocean vortex
#

wdym. They can't host their own model properly?

#

that would be ridiculous though lmao

keen beacon
ocean vortex
#

I mean If that's true then perhaps they didn't train it properly either. Like there's no free pass on this from me 💀

keen beacon
#

didn't the bf16 vllm host score the highest?

#

it wasnt the alibaba one

#

yea

#

that's from what i've heard anyway

#

this may not apply in alibaba's api

#

yeah

keen beacon
unborn ocean
#

wild is hallucinating

torn mantle
#

yea im not feeling qwen

#

k2 feels way smarter

keen beacon
# keen beacon anyway, my point is about chat.qwen.ai

was trying to find the discord thread about this, i recall reading about it when lurking in one of the discord servers i was in. it might not be true and i might misremembering again 🤷 lack of sleep catching up to me. i think it was about the 235b model and degenerate repetition on chat.qwen.ai a little after qwen 3 launched, i might be hallucinating that as well. idk

silk roost
#

Can i get API on lmarena???

gusty tendon
#

any comparisons for front end between kimi k2 and latest qwen?

#

haven't seen any

ocean vortex
keen beacon
#

qwen is 235b, kimi k2 is 1t

ocean vortex
#

yeah

#

lol

gusty tendon
#

ah okay, thought it was also 1T

#

i guess it's in the name, Qwen3 235B xD

ocean vortex
#

Looks like I can't even try it now with their updated interface though without adding billing... It's probably still gonna be slow lol

keen beacon
#

besides that one task type, im not too impressed with it after playing with it more. excited for the thinking version though.

unborn ocean
#

yeah, i am also a bit underwhelmed

ocean vortex
unborn ocean
#

for my tasks it is also barely usable, because it is just switching from that thinking-like mode with terrible formatting to the one where it is more human preference aligned
-> really weird chat experience

keen beacon
ocean vortex
#

they were able to improve non-reasoning one simply because they made it closer to the reasoning variant...

keen beacon
#

it gets it right despite using a harder version of the task, and does even less tokens than the reasoning version

#

the reasoning version often gets it wrong/reasons forever btw

cedar tide
#
poll_question_text

What do the community want ?

victor_answer_votes

9

total_votes

11

victor_answer_id

2

victor_answer_text

Best voted models

ocean vortex
silk roost
#

Can we get API on lmarena???

#

Guys

whole wagon
#

You can try for free through openrouter rn

#

Parasail is $0.15/$0.85 for bf16

#

Kimi is fp8 on novita for $0.57/$2.30

#

The benefit of the Qwen model is basically when you have long context

#

Once you account for tokens the output price is closer

#

The new Qwen 3 does not even have reasoning yet

#

So ofc not lol

#

Source?

#

in some benchmarks

#

I think grok4 is quite behind still. When qwen and kimi add reasoning to their new releases i think they will overtake it

sage raptor
#

At what minute did he say that

#

.

patent aspen
#

If OAI needs to integrate O3-alpha, that would be a big delay

civic flame
#

i think it makes more sense for 2.5 ultra to release this week

#

3.0 is likely sept time or so i thought

torn mantle
ocean vortex
#

took them barely any time at all. That means there are no problems with complying with privacy respecting laws if they want...

#

lol

unborn ocean
#

4.1 is so stupid it really make me question the possibility of agi

#

coding with it is such a pain

#

feels like gpt 3.5

whole wagon
#

That's what happens when you use a bad model

#

Use a better one

#

Claude exists

pure anvil
unborn ocean
#

so i though i would try

#

but it is not worth the time

unborn ocean
keen beacon
#

Since they confirmed GPT-5 soon , i would assume 1-2 weeks away, by the end of next it should be live .. or they lied 🤥
I doubt its the second since they are going with the router version as a start (weaker version than what was planned)

whole wagon
civic flame
#

do you know if they plan to release it on API too?

#

wonder if it'll cost more than gpt-4.5

keen beacon
#

But its all speculation, it can also be a live stream

brittle tiger
#

You remember where the OpenAI key person hiring info is from?

whole wagon
#

I think they both need to be fairly substantial jumps. The open source models are closing the gap to the current frontier quick

#

Like Kimi K2 / updated Qwen 3 when they add reasoning to them will be very good

patent aspen
#

It's predictable that innovation will happen but unpredictable where and when it will happen

jade egret
whole wagon
#

What did you think about this IMO drama Kappa

civic flame
#

y

#

more like early aug or so

jade egret
#

dang

#

😭

civic flame
#

coming back to this.. are the benchmarks we were shown for deep think based on 2.5 pro + DT then? or were those based on an even earlier version of 2.5 ultra?

#

if it's the former surely it's got significantly better since

#

wow then

#

this thing should cook o3 pro

#

and grok 4 heavy for that matter

#

i presume the google anon models on lmarena were non-DT ultra?

#

yup

civic flame
#

pre-merger with DT?

#

hmm

#

given google's track record of testing everything on arena i wonder if we'll see DT on arena soon

#

also i think i may still have access to one of the ultra checkpoints lol

#

one of my red teaming platforms hasn't removed the anon google model they gave me access to a while ago and it gets questions right that only ultra did

#

the version i have access to streams summarised thinking

torn mantle
#

time to report that

#

jk

rare python
civic flame
#

reported DT results at IO were from a 2.5 pro-based model

#

the final model will be quite a lot better

rare python
#

Oh

#

I wonder why they switched to Ultra for DeepThink

#

Is the cost worth it?

rare python
balmy mist
#

4o has gotten so good recently, like i gave a prompt to 4.5 and o3, but 4o responded the most human tbh

rare python
#

Q4 2025?

civic flame
civic flame
#

somewhere around sept or oct

torn mantle
#

ive tried it the other day

balmy mist
#

is o3-alpha still on web dev?

whole wagon
#

openAI is so damn shady I don't even know whether to trust the open source is still coming any more 😂 they are totally silent on it

#

Maybe it was all a PR stunt

#

Who knows

#

This IMO stuff made me have doubts

vivid gorge
#

Hi, any idea why do I get "Session not found. Redirecting to home..." after a few hours inactive chat? But other session is still active after 24h on second browser? Is it cookies/browser related or some random database wiping issue?

whole wagon
#

Deep infra has Qwen updated model at $0.13/$0.60

#

😂

echo aurora
gentle plinth
#

but maybe it was really just pr

#

closedai

zinc ore
#

Can AI Learn to Trade?💹
︀︀
︀︀Introducing BAZAAR - a new LLM benchmark for economic decision making!💵
︀︀
︀︀In a simulated double-auction market, I pitted top LLMs against each other and classic trading algorithms. One goal: maximize profit.
︀︀
︀︀The o3 and Gemini models top the rankings!

**💬 11 🔁 3 ❤️ 13 👁️ 1.1K **

stray aspen
#

yo has anyone heard of that yupp.ai website

whole sundial
#

"coming soon" = tonight? (in China)

torn bison
errant cave
#

Why does Qwen still put gratuitous spaces before commas and periods

tall summit
tall summit
whole sundial
#

you can even talk to qwen3 coder if you want

#

actually very fast

errant cave
#

Qwen3-coder is not good

#

I asked it to write an adder-subtractor in structural Verilog and this is what it came up with:

// 4-bit Adder-Subtractor module
module adder_subtractor_4bit (
    input [3:0] a,
    input [3:0] b,
    input mode,          // 0 for addition, 1 for subtraction
    output [3:0] result,
    output cout,
    output overflow
);
    wire [3:0] b_modified;
    wire cin;
    wire carry_out;
    
    // XOR gates to control subtraction (2's complement)
    assign b_modified[0] = b[0] ^ mode;
    assign b_modified[1] = b[1] ^ mode;
    assign b_modified[2] = b[2] ^ mode;
    assign b_modified[3] = b[3] ^ mode;
    
    // Input carry for subtraction (1) or addition (0)
    assign cin = mode;
    
    // Instantiate the ripple carry adder
    ripple_carry_adder_4bit rca (
        .a(a),
        .b(b_modified),
        .cin(cin),
        .sum(result),
        .cout(carry_out)
    );
    
    // Assign outputs
    assign cout = carry_out;
    assign overflow = carry_out ^ (result[3] ^ a[3] ^ b_modified[3] ^ cin);
    
endmodule
#

Ts does NOT subtract

cedar tide
errant cave
#

Alright I'm an idiot this does subtract

#

That's solid except for the assign which is not as explicitly structural as I requested

ocean vortex
# errant cave Qwen3-coder is not good

That's my general view about qwen3 as a whole. They were trained to have some nice numbers for irrelevant things (like base model benchmark scores) but are not actually super useful and qwen3-235b is compromised by size comparing to both R1 and Kimi lol

keen fulcrum
#

grok 4 coder will blow the industry up

#

only few weeks left

cedar tide
#

Qwen 3 coder has 1m token context

errant cave
#

Besides it's some middle ground nobody wants

ocean vortex
#

they will train it on a few benchmarks which are gonna be the same ones they will show in their marketing

#

but it will most likely still suck for IRL, for things like webdev arena

cedar tide
#

@echo aurora add qwen 3 coder to webdev arena

errant cave
#

I don't even look at benchmark scores anymore

#

The arena doesn't lie

ocean vortex
#

Basically... you can trust people to tell which web design is better and which code actually works, but rating the text responses for questionable prompts and picking a winner... that's much more challenging lol

errant cave
whole sundial
#

qwen minecraft

#

(qwen3-coder)

cedar tide
#

guess the models

torn mantle
sour spindle
torn mantle
#

So they only focused on frontend dev?

keen fulcrum
sour spindle
#

I have a free month of grok 4 and I don’t use it at all.

#

I wish i had more reason to use it because it is quite fast

empty stump
#

I thought it was slow

#

and what are the rate limits?

civic flame
#

this is for sure trained on Claude 4 outputs lol

keen ferry
#

wow they actually added grok 4 with internet connection

unborn ocean
# unborn ocean
poll_question_text

where would you rather work

victor_answer_votes

13

total_votes

14

victor_answer_id

1

victor_answer_text

deepmind

civic flame
#

hm?

#

oh

#

lol yeah

torn mantle
#

its bad

#

qwen coder

#

not that good at all

#

have you guys tried it on 3d simulations

#

it does have claude UI styling

#

but thats it

#

i mean ive tried guiding it to make a usable code at least

#

but its not working

#

mini basketball game 3d simulation

#

ive seen it somewhere

#

someone made that with o3-alpha and gemini 2,5 pro

#

its not bad, it got many things right but the game isnt working

#

and the map is a bit off

craggy depot
#

Hello , do you guys know best AI model for Hard Coding Problem (2500+ rating CP) ,I know o-3 ,gpt pro and Gemini Pro is good , but whenever I give whole new problem (the problem that doesn't exist in entire internet) , they give wrong answers like I give them Proper description ,proper example , constraints , and a very good Prompt , even they can't solve that problems ,Please suggest some AI model that can solve this .

zinc ore
cedar tide
#

@echo aurora

echo aurora
torn mantle
unborn ocean
#

and the thing is they are always poaching these RL / reasoning / post-training people

#

the thing that the labs do not really have much of in the first place

#

and the area where most of the advancements were made in the past time

#

so it really might be slowing down progress temporarily

torn mantle
#

sigh

jade egret
#

2.0 flash thinking is not smarter than 2.5 pro : )

unborn ocean
#

well post training is still data

#

but it is undeniable that the largest trends in llm world are very much about post training / RL

#

currently at least

#

?

#

idk what to tell you, i was talking about where we get large advancement, not what is more necessary to train a good model

#

or any model for matter of fact ( bc without pretraining unsupervised learning you get nothing later as well)

#

i am not attributing it all to post training, in many cases we dont even really know where the money and compute went
but i can tell you that the whole reasoning and ttc compute paradigm did not come out of pretraining advancements by itself, but had to be "created" using RL and that many other things, like good coding abilities, a lot of things that involve interaction with the user etc. are all a product of post-training / RL rather than pre-training

#

they aren't managers i think

#

many of the others are though

#

idk why you are so opposed to me claiming that most of the advancements of the last generation were in fact not made in pre-training

#

when did i ever claim anything remotely related to a "2 lines of code worth 400m"

#

or that SFT / RL is not "doing the work"

#

it is usually actually even more work, because you have to synthetically generate environments, problems or what ever because of the lack of plentiful SFT / RL data

#

^per token

#

well, my point was not that they stopped that

#

the idea is just:
many advancement made in post training / RL -> labs want people there

#

meta buys some of them -> slows down progress

#

the same thing could happen to pre-training obviously, but they are focusing on the people who do post-training / RL apparently

#

likely because they themselves actually already have more or less adequate people for pre-training (but have historically mostly underperformed in post training and never tried RLVR much)

#

yes, surely that is part of the plan to some degree, but hiring the brains is a serious possibility, when you look at how quickly labs are founded, funded and then abandoned these days

#

literally all of the big labs have to some degree followed the strategy (maybe gdm as the sole exception. although not a full exception either)

ornate agate
#

I think the only thing which isn’t public is exactly how the deep think agents work. How to do RL etc is in papers from DeepSeek/Kimi/Qwen/China.

zinc ore
#

Nightride new Google model in arena

torn mantle
#

so many new models added

#

nightride-on
nightride-on-v2

#

and some ernie ( chinese models ) + qwen latest model

#

yummy 😋

gentle plinth
zinc ore
#

It's theinformation, of course

torn mantle
#

could it be 2.5 flash with deep think enabled?

#

yes or nah brian?

civic flame
#

an updated version?

torn mantle
#

i have a feeling that you know more than you should

#

spill the tea

#

continue

#

im listening

#

next next gen = ?

#

gemini 4?

#

5?

#

mm i see

civic flame
#

i presume the timeline is like

misty star
#

Is it possible to directly talk to stealth models

civic flame
#

updated 2.5 flash, then 2.5 ultra/deep think?

#

what else is there to update

#

2.5 pro is done

#

is 2.5 flash lite GA i forgot

#

hmm

misty star
#

Lmarena my beloved

civic flame
#

surely DT would make the most sense for release on the first week of aug then

#

💔

rare python
#

Our IMO gold model is not just an "experimental reasoning" model. It is way more general purpose than anyone would have expected. This general deep think model is going to be shipped so stay tuned! 🔥

Quoting Melvin Johnson (@melvinjohnsonp)

So happy to see this incredible achievement.
︀︀Huge congrats to @lmthang, @quocleix, @YiTayML and the IMO team on the result.
︀︀This was a great collaboration across teams to build a general Gemini DeepThink model that can also get gold at IMO.

**💬 50 🔁 85 ❤️ 1.3K 👁️ 272.5K **

torn bison
#

2.5 Pro-002👀

#

will wolfstride and stonebloom be released? I feel like they're a bit better for daily use than 2.5 pro (though not by much)

jade egret
#

do yall think meta gonna win because of talents?

patent aspen
#

And they'll probably do all right

cedar tide
#

How good is nightride ?

patent aspen
#

OAI especially

#

They've lost most of their top talent

cedar tide
#

Anyone see EB45 ?

main gulch
cedar tide
#

@torn mantle @zinc ore nightride good ?

torn bison
#

I'm not very optimistic about the team led by Alexander Wang

torn mantle
torn bison
#

That person seems exclusive, aggressive, and boastful

patent aspen
#

The thing is: if you build an organization by luring job hoppers with massive compensation packages, you have to assume that your organization will be very volatile and expensive to upkeep

civic flame
#

🗣️

jade egret
#

what abt google : )

patent aspen
jade egret
#

:0

#

oh

patent aspen
#

OAI is mostly built from poached talent, which is why they're so vulnerable to losing their top talent

cedar tide
#

@echo aurora Ernie bot respond in Chinese 🤦

jade egret
cedar tide
jade egret
#

maybe it can only speak in english or chinese

cedar tide
jade egret
#

o

torn mantle
#

: o

#

:0

jade egret
#

:0

#

😮

#

:0

torn mantle
#

What's new about this model

#

We already had ernie 4.5 before

cedar tide
torn mantle
#

I see that it's a turbo version

#
  • vision supported
cedar tide
torn mantle
cedar tide
#

So its the closed source and not Ernie 4.5 open source ?

torn mantle
cedar tide
torn mantle
#

They are all close sourced no?

torn mantle
cedar tide
#

The ERNIE 4.5 series is now officially open source. This family of models includes 10 variants—from MoE models with 47B and 3B active parameters, the largest having 424B total parameters, to a 0.3B dense model—all available now to the global AI community for open research and

surreal creek
#

Amazon should be focusing on giving their warehouse employees bathroom breaks than trying to enter the AI market as a non-tech company 🤣

leaden palm
#

Bay Area House Party (esp. recent ones) is a very fun read

echo aurora
cedar tide
#

when I just asked him his name he answered me in French but when I sent him a complicated prompt containing several questions he told me to answer what I sent you

#

His response Translated in english
"Sorry, this feature is not yet available online. You can also ask me other questions in Chinese or English, and I will do my best to answer them."

jade egret
#

2.5 flash lite is so fast

leaden palm
# leaden palm
poll_question_text

how did you read the qwen situation?

victor_answer_votes

6

total_votes

9

victor_answer_id

1

victor_answer_text

they've stopped doing hybrid thinking models

jade egret
#

🍊

torn mantle
#

Its a large language model, it should speak several languages, could be just a system prompt issue/instructions

frigid coral
formal dagger
torn mantle
formal dagger
torn mantle
#

one has search enabled

torn mantle
verbal nimbus
#

New Qwen 3 480B coding model, 1M context size

whole wagon
#

its still way slower than 2.5 flash lite

#

k2 gets 200 tps on groq

whole sundial
#

@echo aurora sorry for pinging but there is like half a dozen accounts in https://discord.com/channels/1340554757349179412/1395441703112146984 that seem to be creating fake hype. all of the accounts joined in the past day, and most of those were also created in the past day as well. There are two other accounts there that both joined on the 17th. I feel like that they are run by the company and they are trying to get people interested in a model that isn't even widely available. Only legit accounts there are yours and DavidSZD's.

#

AI gen text lol

#

"This isn't just X; it's also Y"

#

there is another message that is structured very similarly to that one

#

i don't even know how to get access to this model without signing up for their API

tall summit
whole sundial
#

if you have to generate fake hype using newly made accounts on an AI discord server to promote your model, then it is likely not good

#

lol they tried to get one of their models in llama.cpp earlier this year but they failed, likely because nobody was interested in the model

cedar tide
#

Glm 4 air removed from webdev arena after 2 month,
I hope he makes it onto the leaderboard

whole sundial
#

and i can't find a single site that has the model

#

any of them

#

it's like it doesn't exist outside of China

#

the people in that chat have very rough English and speak in AI-like patterns

#

I feel like iFlyTek is trying to use LMArena to get themselves known outside of China because they don't have the advantage that DeepSeek, Alibaba, Moonshot, Baidu, and ByteDance have: iFlyTek is virtually unknown outside of China and they don't use social media or make their models readily available via a chatbot. The only thing they are known for in the US are cheap Android tablets and translator pens on Amazon and AliExpress.

whole wagon
#

grok thinking time is absolutely absurd for me

#

its taking 10 minutes each time i prompt it

#

im able to prompt 6 times an hour 💀

ocean vortex
ornate agate
#

I wonder which one

whole wagon
#

Could be that grok just has a lot of active params and they serve it at a loss

ocean vortex
whole wagon
#

xAI can subsidize the inference itself. If they have that massive cluster just for training they can train massive models. The performance doesnt necessarily mean it is equal size to the other LLMs it may have simply underperformed

ocean vortex
#

That is very unlikely. If they grossly overestimated the model size needed, they wouldn't have been competing with Google now. It just wouldn't be possible even with the relative high amount of GPUs xAI has

#

Money and even compute is not everything, and Meta is a good example of that lol

#

Mistake like that can still break the entire project

#

Yeah a total failure (Behemoth)

#

and exactly my point

ornate agate
#

its gonna be priced in line with market pricing for performance, not size.

ocean vortex
#

At least for closed models

ornate agate
#

idk. DeepSeek/Kimi kinda forced a massive price lowering.

ocean vortex
#

They aren't pricing their models at cost + a fixed percentage. It's more of... "what price should we set for maximum profits?" lol

#

ofc this still doesn't offset R&D, but that's a different topic...

ocean vortex
#

Deepseek went immediately open-source. When you do that very high margins on official API are not really possible... They just went extremely competitive/aggressive. But even with their API pricing they aren't doing it at a loss 👀

torn mantle
#

grok 4 heavy thinking should be similar to gemini deep think, but i dont think it is

#

was it tested on IMO or nah

ocean vortex
#

Trust me they still have decent profit margins lmao. With their traffic and resources their whole infra is much more efficient than what most others are using.

torn mantle
#

im actually so interested on how many answers it can get right

#

thanks, but they didnt use heavy thinking right

#

its only grok 4 reasoning

ocean vortex
#

It's not the same size but Deepseek's cost to run R1 is less than $1 per 1M output.

#

with less than ideal infra

torn mantle
#

i just dont understand how can moonshot/kimi afford running a 1T params model without any issues

#

they have little to no downtime in their servers

#

and im also taking into consideration the crazy traffic coming from china

ocean vortex
#

OpenAI inference...? They have substantial margins even after reducing the price of o3

torn mantle
#

is it running only on h20

#

kinda wild

ocean vortex
#

It's gonna be cheaper than that on OpenAI's infra tbh

#

Also... "free electricity"? Source?

ocean vortex
#

So it's not free then

torn mantle
pure anvil
#

I don't know if american labs do though

#

they'd be dumb not to

ocean vortex
#

They have partnership with MS. They absolutely are doing what makes the most sense and is cheapest to do. They aren't sanctioned and can both buy and rent GPUs.

#

So China's lower electricity cost at best is just gonna offset the less efficient GPUs and lesser infra, but that's unlikely

#

That's not me who needs to prove it, it's you. Cause what I'm saying is common practice and what you are saying is just inexplicable. o3 is same size as gpt4.1, and they always had profit margins on gpt4.1

#

saying that they are losing money on inference is crazy

#

What isn't? To price your models at a higher price than it cost for you to merely run them? It absolutely is a common practice lmao

whole wagon
#

xAI has a 50 times smaller user base and they already offered grok 3 mini at a loss

pure anvil
ocean vortex
#

Even for Deepseek

whole wagon
#

It is reasonable to think they can do the same with grok 4

pure anvil
#

You have no idea what you're talking about

whole wagon
#

Also anecdotal but grok 4 really has that big model smell to it

#

Like gpt4.5 had

pure anvil
#

openai won't be profitable until 2029 probably

whole wagon
#

Well he assumes that because it would be a huge failing for OAI to be making a loss on inference. And he is OAI supporter

pure anvil
#

they're burning money

ocean vortex
#

They are losing money because R&D and all the salaries and expenses, NOT on inference. And there are many reasons to say they are turning profit on inference in isolation and a fairly substantial one (in isolation)

whole wagon
#

If OAI is making a loss on inference they are basically cooked

ocean vortex
pure anvil
#

it's like a bug in his mind he can't accept that openai is bad in any way

ocean vortex
#

Look at their price cuts historically for the same checkpoints

#

that's the best source

pure anvil
#

i couldn't imagine glazing a corpo that hard

#

just pathetic

ocean vortex
#

Are you just gonna pretend you don't know that it's not common at all for the close source AI labs to share info on their cost to run a model? Of cource there's not gonna be black on white direct proof from them, but we can still look at what we know.... There's NOTHING to suggest that what you are saying is true (they are losing money on inference)

#

it's just silly, crazy and insane.

#

@ornate agate

#

And there are several things to suggest they are making money:

  1. o3 is not a huge model. OG gpt4 was downsized into 4Turbo, then that was downsized into current base model.
  2. They were in a position to do drastic price cuts no problem
  3. Their resources and traffic allows them to have very good and efficient infra
  4. We roughly know the pricing to run a model, any model
unborn ocean
ocean vortex
#

No one is saying they are making money overall, we are talking about just inference which is small part of their operations

whole wagon
#

How did we even reach a point where the frontier model of the top ai company (with a huge time lead) ends up facing competition from small Chinese research teams

#

It's a wild timeline

ocean vortex
#

Inference is laughable money compared to their entire expenses

pure anvil
#

It's for the best, it brings prices down for everybody

#

plus something like AI should be open source anyway

#

considering where the training data comes from

ocean vortex
#

So yeah... While we will never have definitive proof from OpenAI disclosing this, there are far more reasons to suggest they are turning profit on inference with the opposite being extremely unlikely. For the reasons already stated (there for instance #general message)

whole wagon
#

If GPT5 releases late enough there's a substantial chance either Kimi or qwen is going to take SOTA

#

With adding reasoning to the updated models

#

Also if openAI open source model is actually SOTA it has to beat o3. So I don't see how that works

pure anvil
whole wagon
ornate agate
ocean vortex
whole wagon
#

Well it's delayed till September apparently

ocean vortex
#

And they will have next gen of closed models soon

whole wagon
#

So it has to compete with the next gen of open source

ocean vortex
#

so... doable

pure anvil
#

imo

ocean vortex
pure anvil
#

If they hadn't released a good open source model then closed sourced providers would make the prices as high as possible

ornate agate
whole wagon
#

The closed source providers would still compete with each other eventually. But currently they are focused on scaling up as fast as possible

pure anvil
ocean vortex
#

I think Deepseek not only kept western companies in check, they kinda also kept Chinese themselves in check lol

#

if you look at alibabacloud, prices are not really any better than what people were used to before Deepseek

#

Kinda insane that Chinese corps are charging that

whole wagon
#

Currently I don't see any reason to use a closed source LLM. Maybe that changes with gpt5

#

The open source LLMs are near identical performance for far cheaper

ocean vortex
#

crazy

whole wagon
#

I will just take the far cheaper one

pure anvil
#

that's not too much of an accomplishment but it's a start

ocean vortex
#

even less activated parameters as well

ornate agate
pure anvil
ornate agate
ocean vortex
#

if they would they wouldn't stay relevant for long

#

Also this has been debunked like 10 times now

ocean vortex
#

with people doing independent testing

ornate agate
pure anvil
ocean vortex
# ornate agate yes every single one of them must be imagining it.

They are not "imagining" it, their reactions are normal given impressions/expectations and limited knowledge or any real testing. Happens everywhere not just with AI. Also how forums and subs work, the complaining ones are the loudest. There's no (or much less so of a) reason to post otherwise

pure anvil
#

yeah very cheap model (2$/m) too

ocean vortex
#

What was the prompt for this? 👀

#

I just looked at polymarket that temp thing... Looks like my guess ust might be correct. 🤯 Hopefully no major changes from now

#

well I said "mine" but this was o3 and deep research lmao

#

was at ~30% chance at the time, less than 0.95-0.99 range

#

@keen beacon

pure anvil
#

have they gotten that good?

#

oh you mean AI written

#

i interpreted it as AI narrated

#

I listen to T6 when falling asleep alot

#

lol

#

tru, like I can't fall asleep to music either for similar reasons

pure anvil
#

opus 4

hollow ocean
#

agent delayed for plus users

#

will be available in the coming weeks

keen beacon
#

I had a friend who screenshoted stocks and asked gpt for advice lmao , he won a few $ before losing thousands 🤣

hollow ocean
#

😂

#

You gotta use o3 pro if you wanna make money

keen beacon
# hollow ocean He prob used 4o

Yes, cause he did day trading or rather minute trading 🤣 he asked gpt 4 every 15s on what to do and needed fast responses

torn mantle
#

what about const@2048

rare python
echo aurora
unborn ocean
#

has anyone tried the deepinfra api for gemini models?

ocean vortex
dense moon
#

I don't know if this is the right place for suggestions, but it would be nice if copy and paste would be easier. If I want to copy a whole conversation, I have to either copy every message one my one, or use ctrl+a and have it backwards and chaotic

torn mantle
#

shroddy is confused

dense moon
#

Yes I am.

#

I wanted to make a new feature suggestion but I think I am in the wrong channel here and don't know which is the correct one

torn mantle
torn mantle
#

Although its not really a bug

#

That feature can come handy

#

Or you can make your own little script (userscript) that does that till they add it natively

#

What i really want lmarena devs to do is to handle errors better, not just a general catch execption, especailly on the sandbox api issues

#

There is also that session issue, the code/msg errors arent specific

#

But the most annoying one is the sandbox one, you dont real know if its a model issue or rendering lib issue

misty star
#

Ooooooooooo

#

WWWW

languid crescent
#

omg lmarena just dropped a new banger

#

i just refreshed the page and instantly noticed the change 😭

#

i love this community

echo aurora
misty star
#

This is amazing

languid crescent
#

ABSOLUTE W UPDATEEEE

echo aurora
tribal aspen
#

why does the lmarena search take forever?

#

and then returns error?

languid crescent
#

its working fine for me

#

i like the ui for the search thingy its neat

echo aurora
languid crescent
#

@echo aurora THIS IS AMAZING!!!!!!! ❤️ compared to gemini (free) it only gives out 5 sources but here it gives more than 20 😭

tribal aspen
#

nvm

#

was happening with the gemini model

languid crescent
#

this is absoultely nice it will help me so much in learning 😭

echo aurora
languid crescent
echo aurora
languid crescent
#

i like the current buttons now its much cleaner

olive mesa
#

how many zettaflops of compute power is this?

nova ginkgo
#

finaly we got web research 🤩

languid crescent
#

web research is a game changer for a student like me 😭 it makes researching for sources much easier

#

W lmarena community ❤️

golden ocean
#

is mimi ai

languid crescent
#

No i'm not bro 😦

#

I'm just really happy that LM arena is giving out updates like this that could help me as a student :))

unborn ocean
#

otherwise i will view xAi as a massive failure

amber warren
sturdy mica
#

search is awesome

#

W

#

waited a while for this

#

very cool

mint sparrow
#

genuine question im just new to lmarena but if ive been in one thread using it for some time and it just has a endless "generating" text bubble does that mean ive broken it/ran out of time with the specific model?

whole wagon
#

💀

#

He really put them on blast damn

torn mantle
#

what were they thinking?

#

doesnt make any sense

gentle plinth
#

some guy from the team replied to him tho, so could very well be that help him to reproduce it

whole wagon
#

Or just damage control

#

The model performance should not be shifting that much from json Vs text formatting

gentle plinth
whole wagon
#

If it is that's a whole other issue

torn mantle
gentle plinth
#

i also found the json thing weird, but then again i am not an expert on this

#

isnt there a standard format for the benchmark?

#

or is he talking about something else

keen beacon
#

its a benchmark harness thing probably

torn mantle
#

they may have finetuned on both public/private data for that specific benchmark

gentle plinth
#

how could they get the private data?

keen beacon
#

this has happened before btw. qwen will get it sorted out, im sure

gentle plinth
#

llama4 😅

whole wagon
#

Recently I had an argument with ppl that thought open source models would never manipulate benchmarks

#

Open source stans just as bad as closed source stans

gentle plinth
whole wagon
#

If it is grok 4 it can take over 10 minutes

keen fulcrum
torn mantle
#

idk or some smart reward func specifically made for arc-agi-1

#

the thing with these benchmarks like arc-agi is that we want a generalisation performance gain, or an emergent intelligence

#

but if its specifically trained and finetuned for that, then its basically useless

#

and thats what qwen are doing

gentle plinth
#

so if i understand correctly you say that they trained on the json data, but when using a different format it fails?

torn mantle
#

also dont you find their other benchmarks numbers ridiculous?

torn mantle
#

im talking about their goal in general

#

are they really trying to make a smart model? or are they just focusing on beating others on some benchmark?

vivid gorge
#

Hi, maybe a silly question - is there a way to terminate endless "Generating..." answer? This particular chat is stuck and I see no way to break to loop. Except maybe "Delete" ... command in chatlist.

echo aurora
vivid gorge
#

Thank you.

primal orbit
#

or close/open

vivid gorge
# primal orbit or close/open

Unfortunately refresh, close/open and even moving to other browser by export/import cookies doesnt work. Still stuck on endless Generating...

civic flame
#

it's that time of the month again

#

got access to some new anon models

#

so far at least one is from oai and passes a lot of my hard prompts (gets ones o3 gets wrong right)

misty star
#

Is it true some people have access to new models

ocean vortex
#

it's great

torn mantle
#

or gpt-5

civic flame
#

can't tell

#

seems similar to "o3-alpha"

gentle plinth
#

what is your definition of agi?

echo aurora
#

Hey - we just launched something really special for this server. The **LMArena Discord Bot is available, right now! **

What does the bot do?

  • With the LMArena bot you can generate videos, images, and image-to-videos via the bot. Similar to battle mode you’ll given two generations and anyone will be able to vote on which they prefer. After a certain number of votes the bot will reveal the models.

Why are you telling us about it in general chat and not announcing it?

  • We’re considering this a bit of a soft launch. We want to test the waters with it and see what you all think first.

To learn more more about how the bot works check out #1397655624103493813 Note that you can only use the bot in these channels: #video-arena-1

Keep in mind there is generation limit per day, so make sure your prompts count!

gentle plinth
#

interesting. i think i would define it a bit differently. maybe to simple, but i would define it as an ai that can do any work that a human can do at the computer. Except maybe for top 1% work in terms of difficulty/payment. so this would include longer tasks, researching topics, learning new things about that topic, learning new programming languages, and working on a project which wouldnt fit in any context length of existing models so far, coherently

torn mantle
#

its obvious that the next iteration will be good

#

well whatever..

gentle plinth
#

manifold says 2030 but i absolutely dont like the resolution critera as gpt-4.5 can already pass a turing test according to one paper https://manifold.markets/ManifoldAI/agi-when-resolves-to-the-year-in-wh-d5c5ad8e4708

Manifold

This market resolves to the year in which an AI system exists which is capable of passing a high quality, adversarial Turing test. It is used for the Big Clock on the manifold.markets/ai page.

The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalen...

whole wagon
#

What the hell I put my life savings in there and got nothing back??

#

(They got hacked)

gentle plinth
#

i've seen this so often lately on x.

#

makes me wonder if there is more to it, like elon behind it or smth, but probably its just that he fired a bunch of people that worked on the security of the platform

ocean vortex
#

Qwen3 output length looks hilarious next to Sonnet4-thinking lol

glass arch
#

guys which model is the best for me

ocean vortex
#

Though I'm forgiving Kimi2 👀

glass arch
#

I have been using chatgpt plus for a while because I want to use it for code and schoolwork, but it's kinda stupid ngl

ocean vortex
glass arch
#

(using o4-mini)

glass arch
#

I have used o3

#

it has limits tho

#

should I switch to gemini or something else

#

because chatgpt hasn't really been keeping up

ocean vortex
#

Yeah do use o3. I was joking with gpt3.5 gptdrawncat

keen ferry
#

I guess o4 mini (high) for coding and o3 for school work

ocean vortex
glass arch
#

keeping up

ocean vortex
#

o3 has no issues with keeping up

glass arch
#

is gemini (or claude or whatever) better at these tasks? I have been thinking of switching to something else because openai is behind on development

ocean vortex
echo aurora
#

btw we're aware of issues happening with search in battle mode

ocean vortex
#

And 2.5Pro on aistudio is worse than o3 due to the lack of tools, so...

#

the best for you is o3. Limits is a financial problem I suppose, not really a "what is the best model for x" type of thing anymore

#

With API or Pro plan there are no limits

glass arch
#

I really have a few things I need:

  • upload code to the interface (on chatgpt, I make a zip of my code and upload so that it can get context (my code is not so big))
  • upload pdfs that the model can read
  • the model can run searches to get better information
  • the model is smart
ocean vortex
#

otherwise use aistudio when you need to work with big context

#

if you want to get rough estimate how much tokens your files are, you can simply just attach them in a chat in aistudio as well

#

they will tokenize and you will see your usage in token counter

#

Tokenizer for Gemini is different, but for rough estimate this is good enough. Not gonna differ by miles with OpenAI

glass arch
#

ok well is gemini better or worse than o3 or o4-mini

ocean vortex
#

if we exclude all the tools and just look at the model itself, o3 is equivalent to 2.5Pro but still different strengths (2.5Pro better for visuals, o3 better for accuracy and tedious tasks). o4-mini-high is good but I wouldn't classify it as equivalent personally. It's too small for that and lacks context awareness and some fundamental logic concepts, also lacks world knowledge (look at SimpleQA score)

whole wagon
#

For a few tasks I even found grok the best kek

#

Like puzzles type stuff

dense moon
dense moon
#

Would be cool, feels unfair otherwise if both models did a great job 🙂

echo aurora
#

Yeah that makes sense, I'll be sure to pass it along

dense moon
#

And (maybe) last questions will it be possible to see which model got the most votes / how many votes each model got, would be interesting how others might have different preferences or priorities.

echo aurora
dense moon
#

Oh nice I think I will write a one or two suggestions for the website there as well 🙂

echo aurora
#

Please do!

tiny temple
#

h

sullen quest
#

nightride-on-v2? I haven't seen that one before

harsh flume
#

Anyone here using Gemini CLI?

#

Im wondering how it compares to claude code, ive seen some conflicting opinions online but its overral hard to trust reddit comments on anything AI

blazing rune
#

I think Claude 4 Sonnet with Claude Code got about 18000 or something

echo aurora
tiny palmBOT
sweet tinsel
#

Deep Research Arena when?

#

I'm totally not obsessed with DRs.

harsh flume
sweet tinsel
#

A lot.

#

Just look at my doc.

harsh flume
#

Seems like grok just took theirs out

sweet tinsel
harsh flume
#

I will. Deep-Research is prob 80% of my AI use cases

#

Thanks

tiny palmBOT
harsh flume
#

Your doc is mostly a request and the shared results or am I missing something?

sweet tinsel
#

Im currently working on the ranking.

harsh flume
#

I imagine you tested a lot more than this so I'd be interested to know how do you feel the models you tested fare

harsh flume
#

For my use case I think gemini and OAI are pretty much in par with eachother and often have diff strong/weak points. But I haven't tested anything besides them cause I was under the impression all other models were lagging behind

sweet tinsel
#

Generally the top players are pretty good, with Gemini 2.5 Pro and ChatGPT, Manus AI is also very good. Generally the other AI Agents are pretty decent and some DR services like ithy and perplexity are just not worth the time.

#

But for the most stuff I prefer the ChatGPT DR.

harsh flume
#

I hear ya. I use a custom sys instruction instance on AI Studio to generate the DR prompts and that has been the only thing I can confirm gemini is superior for. It made a big difference on research output when I changed from developing the prompts with chatgpt to gemini

#

For some reason it seems like gemini's chain of thought generate better prompts to instruct LLMs, that has been true with all my prompt engineering across different use cases

sweet tinsel
#

I got some better results with Perplexity Pro searches with using the DR Plan from Gemini.

harsh flume
#

My theory is that whatever dataset Google trained their model on had a deeper 'understanding' of the underlying working mechanisms of AI

sweet tinsel
#

Gemini is really pretty good with prompting.

surreal creek
#

has anyone else been having difficulty with only getting one response when prompting? I’m specifically having the issue on mobile, but I’ve also bumped into it on desktop as well.

tiny palmBOT
tiny palmBOT
languid crescent
#

what the heck does lmarena have vid gen now?

languid crescent