#general | Arena | Page 33

ocean vortex Apr 29, 2025, 7:17 PM

#

it's also true for most of the other benchmarks. If you focus on just 1 you degrade something else

torn mantle Apr 29, 2025, 7:18 PM

#

xdd

small haven Apr 29, 2025, 7:39 PM

#

day 13 of no o3 pro

#

day 1 of not caring about grok 3.5

#

big brain is 404

#

im retired

#

not interested in introducing virus to my system

#

ok buddy

torn mantle Apr 29, 2025, 8:09 PM

#

agree

eager crater Apr 29, 2025, 8:14 PM

#

Is it normal that the Q3_K_M quant is bigger than the Q3_K_XL quant? it's unsloth's

balmy mist Apr 29, 2025, 8:30 PM

#

torn mantle agree

wym virus?

torn mantle Apr 29, 2025, 9:11 PM

#

balmy mist wym virus?

malware

#

seems like i cant get enough of o3

#

this thing is so crazy

small haven Apr 29, 2025, 9:23 PM

#

the good type of virus

#

especially when it has these in the traces, gets me hard

torn mantle Apr 29, 2025, 9:24 PM

#

nah

#

its so smart

#

it totally feels like a different type model from all other LLMs

hollow ocean Apr 29, 2025, 9:25 PM

#

O3 pro only for rich p2w users

torn mantle Apr 29, 2025, 9:25 PM

#

hollow ocean O3 pro only for rich p2w users

yea

#

😦

#

sad

small haven Apr 29, 2025, 9:25 PM

#

thank you o3

golden ocean Apr 29, 2025, 9:26 PM

#

life without free sota models access method

small haven Apr 29, 2025, 9:28 PM

#

dw o3 pro is gonna be free in a year

#

but by then o5 pro is sota

torn mantle Apr 29, 2025, 9:31 PM

#

actually r2 may have some interesting outputs

#

hopefully its not like qwen where they only focus on fixing old bugs

small haven Apr 29, 2025, 9:56 PM

#

huh what is this arrow.. doesn't trigger anything

harsh flume Apr 29, 2025, 9:58 PM

#

Been away, how's new qwen performance?

#

Any sota vibe in it?

#

Or just good enough for a Chinese model?

keen beacon Apr 29, 2025, 9:58 PM

#

it isnt sota but its solid

harsh flume Apr 29, 2025, 9:59 PM

#

Not even top 3 contender for arena?

#

Any particular strong use case for it?

keen beacon Apr 29, 2025, 10:00 PM

#

harsh flume Not even top 3 contender for arena?

probably not

keen beacon Apr 29, 2025, 10:00 PM

#

harsh flume Any particular strong use case for it?

running locally i guess

small haven Apr 29, 2025, 10:04 PM

#

yes the media is forcing o3 pro lesssgoo

torn mantle Apr 29, 2025, 10:05 PM

#

we dont care anymore about anthropic

#

like seriously their models are unusable with the rate limits and they nerfed it

#

writes short/consise outputs all the time = so lazy

keen beacon Apr 29, 2025, 10:07 PM

#

they didn't nerf anything

#

it has been like that since day 1

torn mantle Apr 29, 2025, 10:08 PM

#

seems like you werent using it since day 1

keen beacon Apr 29, 2025, 10:08 PM

#

facepalm

#

i have been using it both on claude.ai and via their api since day 1

torn mantle Apr 29, 2025, 10:08 PM

#

apparently not

small haven Apr 29, 2025, 10:08 PM

#

3.7 is literally unusable

keen beacon Apr 29, 2025, 10:08 PM

#

it also seems fine to me 🤷

#

that goes for claude 2, claude instant, claude 3 opus, claude 3.5 sonnet, claude 3.7 sonnet

#

i use a lot of claude (still)

#

yeah i've been interested in them since the first ever claude

#

when the only way you could access their models publicly was with a crappy slack bot

#

claude 1 and claude instant were great creatively tbh

#

they just lost coherency over long context 😔

torn mantle Apr 29, 2025, 10:09 PM

#

day 1 :

keen beacon Apr 29, 2025, 10:09 PM

#

first company to introduce 100k context windows i think

torn mantle Apr 29, 2025, 10:10 PM

#

months later :

keen beacon Apr 29, 2025, 10:10 PM

#

it could be: system prompt changes/randomness

torn mantle Apr 29, 2025, 10:10 PM

#

same prompt

keen beacon Apr 29, 2025, 10:10 PM

#

if you're serious about getting the most out of these models you wouldn't use them over their consumer UIs

#

use the api dawg

torn mantle Apr 29, 2025, 10:10 PM

#

im not rich like you

#

that will cost me a lot

keen beacon Apr 29, 2025, 10:10 PM

#

ppl still complain about them nerfing it on the api anyway

#

people are schizo

torn mantle Apr 29, 2025, 10:11 PM

#

are you calling me schizo?

keen beacon Apr 29, 2025, 10:11 PM

#

no lmao

#

if you looked at the context

#

he said on the api

torn mantle Apr 29, 2025, 10:11 PM

#

@keen beacon is he calling me schizo?

#

????????????

keen beacon Apr 29, 2025, 10:11 PM

#

there have been so many instances in the ai industry of people claiming degradation when there have been no changes and it's like one giant placebo

#

no, he was talking about people who still complain claude is being nerfed on the api

torn mantle Apr 29, 2025, 10:12 PM

#

im jk

#

but in all seriousness i do think they played with the system prompt ( default one ) to output shorter results

small haven Apr 29, 2025, 10:12 PM

#

guys o3 marginal cost is zero

torn mantle Apr 29, 2025, 10:12 PM

#

small haven guys o3 marginal cost is zero

o3 this o3 that

#

you probably also dream of o3 pro

#

enough

#

use gemini 2.5 pro a bit

#

or qwen 3

small haven Apr 29, 2025, 10:13 PM

#

torn mantle you probably also dream of o3 pro

https://tenor.com/view/cuando-me-pagan-en-la-escuela-gif-13826598667065540218

Tenor

keen fulcrum Apr 29, 2025, 10:15 PM

#

poll_question_text

Did Qwen3 live up to what you expected?

victor_answer_votes

11

total_votes

18

victor_answer_id

2

victor_answer_text

Met expectations

ocean vortex Apr 29, 2025, 10:16 PM

#

no pro is more like you having 10 clones of yourself trying to do the same thing individually and then all of those are merged into 1 solution lol

small haven Apr 29, 2025, 10:18 PM

#

^

#

also big brain never existed

#

but internally

#

any plus have o1 pro yet?

ocean vortex Apr 29, 2025, 11:12 PM

#

small haven any plus have o1 pro yet?

Interesting. The only way this makes sense in my head is that they are disguising upcoming o3-pro like that. They used to do this with alpha releases (like gpt4-all-tools) where select users that normally use lower tier models (free gpt3.5) got access to upcoming best models/features.

#

Probably for data collection among users not terribly familiar with the models related the most to what they are releasing

brittle tiger Apr 29, 2025, 11:18 PM

#

nvmd im idiot. i have pro plan

full kite Apr 29, 2025, 11:22 PM

#

Guys how much better is qwen 3 compared to Gemini 2.5 pro? 3-4%?

#

Help oOoOO

keen beacon Apr 29, 2025, 11:22 PM

#

it isnt better

full kite Apr 29, 2025, 11:22 PM

#

Ok ty

keen beacon Apr 29, 2025, 11:22 PM

#

its good if u want something to run locally

full kite Apr 29, 2025, 11:25 PM

#

keen beacon its good if u want something to run locally

How do you run 700 billions model

keen beacon Apr 29, 2025, 11:26 PM

#

there is no 700b model lol

#

in qwen 3

full kite Apr 29, 2025, 11:26 PM

#

Ok how much

keen beacon Apr 29, 2025, 11:26 PM

#

idk there are a lot, just look it up

full kite Apr 29, 2025, 11:26 PM

#

😡

full kite Apr 29, 2025, 11:45 PM

#

That not an answer

small haven Apr 30, 2025, 12:12 AM

#

ocean vortex Interesting. The only way this makes sense in my head is that they are disguisin...

well the o1 pro right now is still as 💩 , so dont think theyre disguising it as o3 pro

#

thank u o1 pro...

small haven Apr 30, 2025, 12:44 AM

#

ok... thanks for the info. keeps using o3

brittle tiger Apr 30, 2025, 12:44 AM

#

omg this o3-pro on api is insane. hope you get to experience it someday @small haven

small haven Apr 30, 2025, 12:45 AM

#

?

#

small haven Apr 30, 2025, 1:09 AM

#

ok buddy show me ur ass on oai playground, u wont

#

o3 keeps passing these tests, its like orgasmic

#

this button looks new in dr

#

screenshot or troll

balmy mist Apr 30, 2025, 1:22 AM

#

screenshare please

small haven Apr 30, 2025, 1:31 AM

#

buddy had to disable the notifs haha

#

$cr33n$h0t pl0x

raven void Apr 30, 2025, 2:17 AM

#

Grok 3.5 will win lmarena in May

small haven Apr 30, 2025, 2:28 AM

#

life is good with operator

high egret Apr 30, 2025, 4:00 AM

#

what do you think will the biggest model of qwen3 score on lmarena leaderboard ?

#

btw one the model is added, generaly, how much time does it take to have enough votes to be on leaderboard ?

elder rapids Apr 30, 2025, 4:04 AM

#

raven void Grok 3.5 will win lmarena in May

could ye

leaden palm Apr 30, 2025, 4:05 AM

#

high egret what do you think will the biggest model of qwen3 score on lmarena leaderboard ?

if it has the same gain as gemini 2 -> 2.5 flash (adding thinking and updating the tuning) it would rank just slightly above 0324

high egret Apr 30, 2025, 4:20 AM

#

high egret btw one the model is added, generaly, how much time does it take to have enough ...

if someone have an approximate answer for this question

small haven Apr 30, 2025, 4:24 AM

#

qwen 3 is not gonna be #1 lol

hardy pecan Apr 30, 2025, 4:31 AM

#

high egret if someone have an approximate answer for this question

mr gambler, ill tell you, qwen has no chance of topping the leaderboard for your betting site, sorry

high egret Apr 30, 2025, 4:31 AM

#

haha thx guys

#

i was just curious

high egret Apr 30, 2025, 4:32 AM

#

hardy pecan mr gambler, ill tell you, qwen has no chance of topping the leaderboard for your...

but not for topping the leaderboard, honestly I would really be impressed if it topped 2.5 flash

#

and honestly the smaller model like 4B are really much more impressive than the 235B one

hardy pecan Apr 30, 2025, 4:33 AM

#

It'll do fine, but it wont stand out

high egret Apr 30, 2025, 4:33 AM

#

yeah, I feel that alibaba is better with smaller model

hollow ocean Apr 30, 2025, 4:33 AM

#

small haven

Why you use deep research to answer that

high egret Apr 30, 2025, 4:33 AM

#

like QwQ was already more impressive that qwen 2.5 max

high egret Apr 30, 2025, 4:34 AM

#

hollow ocean Why you use deep research to answer that

He want a full analysis of the whole the web to be 100% sure

#

@hardy pecan and I was meaning the delay to appear on leaderboard, not the score, that was the question I wanted an answer

keen beacon Apr 30, 2025, 4:35 AM

#

Qwen 3 235b has a surprisingly high arena hard score, I wouldn't be surprised if it performs well. But it won't top the leaderboard

hardy pecan Apr 30, 2025, 4:35 AM

#

Right, the models generally need above 2000 votes to appear on the leaderboard

#

Generally this takes awile, say 5 or 6 days depending on activity and hype generated around the released model

hollow ocean Apr 30, 2025, 4:36 AM

#

high egret He want a full analysis of the whole the web to be 100% sure

Regular o3 would be enough

high egret Apr 30, 2025, 4:37 AM

#

hollow ocean Regular o3 would be enough

Just kidding I think he made a missclick

high egret Apr 30, 2025, 4:37 AM

#

hardy pecan Generally this takes awile, say 5 or 6 days depending on activity and hype gener...

ok thx

#

The more I use gemini 2.5 pro the more I realise it's crazy

#

it gave me a full linear algebra course based on the video of my uni course

#

Capture_decran_2025-04-30_a_06.49.16.png

#

it's so perfect

small haven Apr 30, 2025, 4:50 AM

#

hollow ocean Why you use deep research to answer that

Its o

#

1 pro

hollow ocean Apr 30, 2025, 4:50 AM

#

small haven Its o

Ohh why use it over o3

small haven Apr 30, 2025, 4:51 AM

#

Its just troll idc lol

hollow ocean Apr 30, 2025, 4:57 AM

#

high egret it gave me a full linear algebra course based on the video of my uni course

O3 could do the same

keen beacon Apr 30, 2025, 5:03 AM

#

O3 can take in videos?

leaden palm Apr 30, 2025, 5:04 AM

#

hollow ocean Why you use deep research to answer that

that is not deep research

leaden palm Apr 30, 2025, 5:04 AM

#

high egret He want a full analysis of the whole the web to be 100% sure

that is not deep research

#

it is o1 pro / o3 pro

high egret Apr 30, 2025, 5:05 AM

#

my bad

leaden palm Apr 30, 2025, 5:05 AM

#

small haven well the o1 pro right now is still as 💩 , so dont think theyre disguising it as...

the reason he brought it up was pro discussion

raven void Apr 30, 2025, 6:04 AM

#

Xiaomi dropped a pretty good 7b model

fleet lintel Apr 30, 2025, 6:07 AM

#

high egret The more I use gemini 2.5 pro the more I realise it's crazy

100%. I just don't see a reason anymore to have chatgpt subscription.

calm sequoia Apr 30, 2025, 7:05 AM

#

Just tested the new QWEN. At math it's at similar level as DeepSeek R1 but much worse than Claude. At world-knowledge and logic it is at similar level as models, such as GPT 4o or Grok. Not SOTA.

calm sequoia Apr 30, 2025, 7:41 AM

#

Is Gemini 2.5 PRO via the Windsurf chat the same as via, e.g. AIstudio?

fleet lintel Apr 30, 2025, 7:43 AM

#

poll_question_text

Do you run models locally?

victor_answer_votes

7

total_votes

10

victor_answer_id

2

victor_answer_text

No

small haven Apr 30, 2025, 7:45 AM

#

poll_question_text

are u excited for o3 pro

victor_answer_votes

10

total_votes

13

victor_answer_id

2

victor_answer_text

shut up

alpine coral Apr 30, 2025, 8:07 AM

#

included it for this question set

#

generally struggles.. tho with thinking enabled, it does quite well in one run

keen fulcrum Apr 30, 2025, 8:25 AM

#

alpine coral included it for this question set

So o1 still great

alpine coral Apr 30, 2025, 8:25 AM

#

on that particular question set, yeah it does really well

#

but on others, it isn't at the top (though always up there)

#

didn't realise they reverted to an older 4o amid this personality backlash ha

We have rolled back last week’s GPT‑4o update in ChatGPT so people are now using an earlier version with more balanced behavior. The update we removed was overly flattering or agreeable—often described as sycophantic.
https://openai.com/index/sycophancy-in-gpt-4o/

still mason Apr 30, 2025, 8:39 AM

#

Guys, IDK all the technical stuff. I'm just curious how different are the rating/ranking methodology for all the different AI rating/ranking sites?

I'm seeing quite a significant difference in ranking, is that expected?

calm sequoia Apr 30, 2025, 8:42 AM

#

https://arxiv.org/pdf/2504.20879

#

Some beef for lmarena

#

I've said it before - it would be more beneficial for lmarena to rebrand as RLHF site instead of benchmark

alpine coral Apr 30, 2025, 8:45 AM

#

yeah and it's increasingly felt like it's become testing ground for the big labs.. soo many anon models over like the past 12 months.. i'm not surprised to see a paper make (and though i haven't read it, try to prove) these kinds of claims

calm sequoia Apr 30, 2025, 8:47 AM

#

TBH this critique is not fair. Nobody wants to test the open source models as they were much worse for most of the time. The LMarena also has to think about user side (voters)

#

This is very very legit

alpine coral Apr 30, 2025, 8:48 AM

#

yeah totally

#

we'd see farrrrrr fewer anon models if they all had to be published / added to leaderboard

#

would be just like the good ol im-a-good-chatgpt days again ha

#

rather than anon model spam

calm sequoia Apr 30, 2025, 8:50 AM

#

It's the same as everywhere - stuff gets ruined where the money comes in.

#

They have a short time window to do something before a new lmarena appears (real open source)

alpine coral Apr 30, 2025, 8:51 AM

#

calm sequoia It's the same as everywhere - stuff gets ruined where the money comes in.

yeah that's also why i don't really like (though i totally understand from the perspective of the project creators) the 'graduation' from an academic project to a commericial (hedge fund backed) like startup

calm sequoia Apr 30, 2025, 8:55 AM

#

Sadly this is just some lengthy corporate speech https://x.com/lmarena_ai/status/1917492084359192890

lmarena.ai (formerly lmsys.org) (@lmarena_ai) on X

Thanks for the authors’ feedback, we’re always looking to improve the platform!

If a model does well on LMArena, it means that our community likes it! Yes, pre-release testing helps model providers identify which variant our community likes best. But this doesn’t mean the

#

Nobody would blame lmarena if you would have separate benchmark for anonymous models + data gathering. The data may even be better if no betting would be involved 😄

alpine coral Apr 30, 2025, 9:11 AM

#

hmm between the paper and their response there's a bit to unpack and digest ha

teal mantle Apr 30, 2025, 9:39 AM

#

alpine coral didn't realise they reverted to an older 4o amid this personality backlash ha > ...

I ask sensitive question so it gets less sycophantic

#

But then I am thinking should I get supergrok if 3.5 is good next week

#

And shame to know o3 access is only through API (impossible for me) or their subscription, I am contemplating

brittle tiger Apr 30, 2025, 9:53 AM

#

calm sequoia This is very very legit

This dumb. Don't we want to see models being tested? Everyone instantly saw through the meta BS

calm sequoia Apr 30, 2025, 9:54 AM

#

No we didn't. It wasn't more surprising than Grok getting top 1.

brittle tiger Apr 30, 2025, 9:54 AM

#

Grok is actually good tho

#

We don't know nightwhisper exists if that rule is in place. I don't like

calm sequoia Apr 30, 2025, 9:55 AM

#

Right here we go, maverick is good for some who like emojis and yapping

#

The issue is that I want to know if I'm testing model for the benchmark or for RLHF in advance

alpine coral Apr 30, 2025, 10:19 AM

#

alpine coral hmm between the paper and their response there's a bit to unpack and digest ha

the perfect job for an LLM aha.. i gave the paper along with LMArena's resposnse and referenced blog posts to gem pro 2.5, sonn-3.7 and o3.. and ask to evaluate who 'has the upper ground'. they all come to the same conclusions (which kinda confirm my surface-level take): the paper is empirical, LMArena's response is largely rhetorical and doesn't address the core claims made by the paper.. iow the paper wins

#

Paper: Transparently states scrape period, method, code reference, full tables and simulations; backs each claim with quantitative evidence and stress-tests core BT assumptions.

LMArena response: Offers principles (human preference matters, open-source ethos) and restates existing policy. Provides no new datasets, no re-analysis of the authors’ code, and no point-by-point numerical rebuttal. The single concrete figure (41 % open-source battles) is aggregate, methodology-free, and does not address provider-level skew.

#

The rebuttal is largely rhetorical; it neither falsifies the documented selection-bias mechanism nor supplies counter-analysis of sampling or deprecation effects. By contrast, the paper demonstrates those effects with real and simulated data and exposes structural deviations from BT assumptions essential for a fair leaderboard.

Bottom Line
On the evidence presently available, the authors of ‘The Leaderboard Illusion’ hold the upper ground. Their critique is data-driven and methodologically explicit, while LMArena’s response asserts intentions and policies without supplying empirically grounded refutations.
o3

#

fwiw here's sonnet and 2.5's takes

keen beacon Apr 30, 2025, 10:30 AM

#

calm sequoia The issue is that I want to know if I'm testing model for the benchmark or for R...

Personally I'm fine with it if the anon models are high quality

#

It makes the arena interesting

calm sequoia Apr 30, 2025, 10:35 AM

#

keen beacon Personally I'm fine with it if the anon models are high quality

Wasn't this only the case with a nighwishperer? AFAIK other were just meh

#

Or dragontail

#

Cant remmember

keen beacon Apr 30, 2025, 10:35 AM

#

calm sequoia Wasn't this only the case with a nighwishperer? AFAIK other were just meh

Gemini 2.5 pro as nebula, pre release Gemini 2 I think too

calm sequoia Apr 30, 2025, 10:36 AM

#

Yeah, it was peak lmarena. But they were released. We are talking about unrealeased models - appears in the arena as anonymous and never sees the daylight after.

keen beacon Apr 30, 2025, 10:36 AM

#

calm sequoia Yeah, it was peak lmarena. But they were released. We are talking about unrealea...

There is also eureka chatbot 🤣 from google

#

That was like that

keen beacon Apr 30, 2025, 10:37 AM

#

calm sequoia Yeah, it was peak lmarena. But they were released. We are talking about unrealea...

when they were testing prerelease Gemini 2, some of those iterations weren't released I thinj

#

Same goes for now

#

Google makes the arena interesting

#

The only abject abuse of anon models I feel are the meta models

alpine coral Apr 30, 2025, 10:40 AM

#

keen beacon Personally I'm fine with it if the anon models are high quality

yeah in principle i am too (it's fun ha) - but quality is kinda key.. we remember the handful of notable/interesting ones, nebula, sus-columns etc, but the arena is swamped by them now.. and they're not all high quality

calm sequoia Apr 30, 2025, 10:40 AM

#

Hmm after prolonged thought I can agree that it's better to have them in lmarena. The results still should be publicised. Maybe while keeping the anonymous name.

alpine coral Apr 30, 2025, 10:40 AM

#

like :

🤖 400+ models on the leaderboards!
📊 300+ pre-release evaluations!

#

how many of those 300 are actually unveiled? i feel it's like ten maybe

keen beacon Apr 30, 2025, 10:46 AM

#

calm sequoia Hmm after prolonged thought I can agree that it's better to have them in lmarena...

I don't think having some of them on the leaderboard helps really tbh. Depending on when they remove the model, the number might not be valid with extremely high CI

#

It makes the leaderboard unnecessarily confusing. I feel like they've got a good thing going esp with google anon models

#

Just enforce a little more criteria and vetting on the models

alpine coral Apr 30, 2025, 10:48 AM

#

i mean we're the ones providing the data.. currently it just goes to google, oai, meta and xai.. i agree the LB would become a mess if they were all added

#

but still, they could do disclosures.. like a ballpark elo or whatever. . which lab the model was from etc

keen beacon Apr 30, 2025, 10:51 AM

#

alpine coral but still, they could do disclosures.. like a ballpark elo or whatever. . which ...

It's already obvious most of the time. Most of them train their company in. It removes some of the mystique of the arena if they do that

alpine coral Apr 30, 2025, 10:54 AM

#

yeah i mean after they're pulled from the arena ofc.. i dunno like a monthly roundup

#

would reduce the incentive for these companies to just spam it with with slight iterations of the same thing

versed flare Apr 30, 2025, 10:56 AM

#

hey, how do the qwen3 32b and 30a3b models perform?

#

idk if its the right place to ask but i cant find much stuff yet

alpine coral Apr 30, 2025, 10:57 AM

#

there's been a bit of discussion here - maybe scroll up and have a read

versed flare Apr 30, 2025, 10:57 AM

#

can you send a message link to lead me there?

alpine coral Apr 30, 2025, 10:57 AM

#

but i think general take is quite decent but nothing groundbreaking (in terms of pure perfomrance)

alpine coral Apr 30, 2025, 10:59 AM

#

versed flare can you send a message link to lead me there?

somwhere around here #general message

final flame Apr 30, 2025, 11:27 AM

#

guys

#

will there be a leaderboard update today?

drifting thorn Apr 30, 2025, 11:42 AM

#

Oh Cohere had just made a multimodal RAG

full kite Apr 30, 2025, 12:09 PM

#

how many grok exist

#

not better than me

#

nuhuh

golden ocean Apr 30, 2025, 12:18 PM

#

gemini is so horrible at listening

full kite Apr 30, 2025, 12:18 PM

#

he can't touch you

full kite Apr 30, 2025, 12:19 PM

#

golden ocean gemini is so horrible at listening

listening what

golden ocean Apr 30, 2025, 12:19 PM

#

full kite listening what

following instrunctions

full kite Apr 30, 2025, 12:20 PM

#

golden ocean following instrunctions

you need to talk to him nicely

#

say please daddyyyy

#

I'm controlled by elon

balmy mist Apr 30, 2025, 12:26 PM

#

swear

#

send details

cedar tide Apr 30, 2025, 12:33 PM

#

Waiting for o3 and o4 mini on the webdev arena

full kite Apr 30, 2025, 12:33 PM

#

cedar tide Waiting for o3 and o4 mini on the webdev arena

what is o3

#

isn't that already come up like 1 month ago with the ghibli images

cedar tide Apr 30, 2025, 12:35 PM

#

full kite what is o3

Serious ?

full kite Apr 30, 2025, 12:35 PM

#

cedar tide Serious ?

what

cedar tide Apr 30, 2025, 12:36 PM

#

full kite what

You're on the lm arena discord but you don't follow the releases of the best llms?

#

Do you know the reasoning models?

full kite Apr 30, 2025, 12:37 PM

#

i use gemini on googel studio ai

#

for homework

cedar tide Apr 30, 2025, 12:38 PM

#

@full kite https://openai.com/index/introducing-o3-and-o4-mini/

full kite Apr 30, 2025, 12:38 PM

#

bro is it the same sht that ppl used to make gibli images

#

this

#

cedar tide Apr 30, 2025, 12:39 PM

#

full kite bro is it the same sht that ppl used to make gibli images

Nope

full kite Apr 30, 2025, 12:39 PM

#

ok

cedar tide Apr 30, 2025, 12:39 PM

#

This GPT 1 image or GPT 4o

full kite Apr 30, 2025, 12:39 PM

#

so it's better than 3o

cedar tide Apr 30, 2025, 12:39 PM

#

full kite so it's better than 3o

3o dont exist

full kite Apr 30, 2025, 12:39 PM

#

why is there 2 4o

cedar tide Apr 30, 2025, 12:40 PM

#

Its o3

full kite Apr 30, 2025, 12:40 PM

#

that doesnt make any sens

cedar tide Apr 30, 2025, 12:40 PM

#

And its with gemini 2.5 pro the best llm ever

cedar tide Apr 30, 2025, 12:40 PM

#

full kite why is there 2 4o

What ?

#

there is only one 4o

full kite Apr 30, 2025, 12:43 PM

#

Ngga what

#

I'm blocking you

golden ocean Apr 30, 2025, 12:45 PM

#

full kite say please daddyyyy

It only listened when I told it that I was going to invade europe

full kite Apr 30, 2025, 12:50 PM

#

golden ocean It only listened when I told it that I was going to invade europe

Uhhh

#

Are you

tall summit Apr 30, 2025, 12:50 PM

#

cedar tide there is only one 4o

4o vs o4 dummy

full kite Apr 30, 2025, 12:51 PM

#

tall summit 4o vs o4 dummy

Yeah he's fkn rtrd

drifting thorn Apr 30, 2025, 12:52 PM

#

poll_question_text

Which LLM has the most performance gain with good prompts?

victor_answer_votes

10

total_votes

19

victor_answer_id

1

victor_answer_text

2.5 Pro

tall summit Apr 30, 2025, 12:57 PM

#

full kite Yeah he's fkn rtrd

there are two 4o because openai can't name things

balmy mist Apr 30, 2025, 1:01 PM

#

but 4o and o4 are completely different lol

#

where is o3 pro tho

#

@deep adder you sad about gpt4 leaving?

full kite Apr 30, 2025, 1:04 PM

#

What's gpt4

fast osprey Apr 30, 2025, 1:06 PM

#

Hey Hi
I am from BharatGen we had builded our own llm with indic nuance
May I know how we can integrate our model api in chatbot arena to get better comparison

balmy mist Apr 30, 2025, 1:07 PM

#

full kite What's gpt4

GPT-4 is OpenAI's most advanced AI language model. It's way smarter, more creative, and better at reasoning than previous versions. It can handle much longer text and can even understand images (not just text). You find it powering things like ChatGPT Plus and Microsoft Copilot - Gemini 2.5 pro 😂

#

wait why dont we have an ai bot in here for questions and searching?

#

that would be so clutch

#

it should be

#

its just different syntax at the end of the day and llms are beasts at languages

#

if anything give it docs and info on any language you want and guide it

#

what you been making?

golden ocean Apr 30, 2025, 1:11 PM

#

balmy mist <@348477266704990208> you sad about gpt4 leaving?

yes

balmy mist Apr 30, 2025, 1:12 PM

#

Whatever happened to that world-building prompt project you had?

full kite Apr 30, 2025, 1:19 PM

#

balmy mist GPT-4 is OpenAI's most advanced AI language model. It's way smarter, more creati...

gemini isnt powered by chatgpt

#

get a life

#

Thats what I do

#

Why is he saying random sht

#

wdym

bleak silo Apr 30, 2025, 1:20 PM

#

"undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired" true or false?

full kite Apr 30, 2025, 1:21 PM

#

what programming lil bro

#

Im not a weird geek

#

bro ,whar

#

how

#

can i make minecraft

#

what is pythin

#

ngga what is all of that

#

im using google to make my school homework

#

a what??

#

It already did that

#

like 10 years ago

#

bro google was created in 2008 or sm sht

#

tf are you talking about

#

google made chatgpt

#

no

drifting thorn Apr 30, 2025, 1:29 PM

#

I'm excited to be able to use Grok 3.5 on Poe

#

Like if I can

#

Don't take it as an info bruh

#

I'm just guessing

calm sequoia Apr 30, 2025, 1:31 PM

#

But andrey is wrong on Openrouter. It's not better than lmarena. Interestgly it's skewed other way around. Maybe aggregate of openrouter and lmarena is the answer

barren prairie Apr 30, 2025, 1:34 PM

#

Who is
folsom-exp-v1? He answered 95% of my questions correctly

#

O4mini failed it

#

Small mistake but interresting

full kite Apr 30, 2025, 1:37 PM

#

google did it

#

chatgpt

#

why

#

elon musk

#

what is grok

#

what does grok mean

#

google created elon musk

cedar tide Apr 30, 2025, 1:46 PM

#

barren prairie Who is folsom-exp-v1? He answered 95% of my questions correctly

Amazon

alpine coral Apr 30, 2025, 1:47 PM

#

i haven't found it particularly impressive

#

what are these questions ha

full kite Apr 30, 2025, 1:48 PM

#

what does neologism mean

alpine coral Apr 30, 2025, 1:48 PM

#

everyone and their datacentre, yes ha

full kite Apr 30, 2025, 1:48 PM

#

what does that mean

alpine coral Apr 30, 2025, 1:48 PM

#

literally just that

full kite Apr 30, 2025, 1:49 PM

#

neologism is a old word

cedar tide Apr 30, 2025, 1:49 PM

#

barren prairie Who is folsom-exp-v1? He answered 95% of my questions correctly

Is thinking ? @barren prairie

alpine coral Apr 30, 2025, 1:49 PM

#

a new word (or neologism) is by definition is invented recently ha

full kite Apr 30, 2025, 1:49 PM

#

what is neologism

alpine coral Apr 30, 2025, 1:50 PM

#

it's my favourite llm

#

no neologism

#

it's very powerful

full kite Apr 30, 2025, 1:51 PM

#

alpine coral a new word (or neologism) is by definition is invented recently ha

I thought it was a political thing

alpine coral Apr 30, 2025, 1:52 PM

#

lol jfc you weren't trolling.. um 'neologism' is just fancy way word for describing a new word

#

that's it

torn mantle Apr 30, 2025, 1:52 PM

#

barren prairie Who is folsom-exp-v1? He answered 95% of my questions correctly

nah this model is not good

#

i had it many times

#

when

full kite Apr 30, 2025, 1:55 PM

#

o

#

k

#

like skibidi is neologysm

#

is grok 3.5 better than chatgpt

#

or google

#

no

#

Prove it

drifting thorn Apr 30, 2025, 2:17 PM

#

Wow I can merely recognise Winter's face

sonic tendon Apr 30, 2025, 2:24 PM

#

deezpeek

balmy mist Apr 30, 2025, 2:32 PM

#

grok 3.5 is out?

balmy mist Apr 30, 2025, 2:33 PM

#

full kite im using google to make my school homework

in college?

#

lets do a petition to keep gpt4 @deep adder

#

why did they have to do a big thing about it leaving

#

thye should of just removed it, now im sad

#

ahh

#

nvm i dont care

#

when is grok 3.5 coming out

full kite Apr 30, 2025, 2:35 PM

#

balmy mist in college?

No

balmy mist Apr 30, 2025, 2:36 PM

#

i gotta say if that comes out fast then XAI really cooking with their releases

balmy mist Apr 30, 2025, 2:36 PM

#

full kite No

oh high school, i see

full kite Apr 30, 2025, 2:36 PM

#

It's hard

#

No it will not be better

#

It will need a subscription

#

Gemini is free

balmy mist Apr 30, 2025, 2:38 PM

#

idk i dont think you can say o3 is better than 2.5 pro

#

they are better in different ways

#

like o3 has its own lane

#

but for a general model, 2.5 is the best

#

like plug and play

full kite Apr 30, 2025, 2:53 PM

#

Google is better cause they are rich

torn mantle Apr 30, 2025, 2:59 PM

#

Grok 3.5 will probably be added friday

#

On lmarena

full kite Apr 30, 2025, 3:08 PM

#

Chatgpt 4 already exist

calm sequoia Apr 30, 2025, 3:25 PM

#

poll_question_text

Your opinion of AGI vs Political take

victor_answer_votes

4

total_votes

14

balmy mist Apr 30, 2025, 3:50 PM

#

https://x.com/OpenAI/status/1917607109853872183

OpenAI (@OpenAI) on X

AMA at 9:30am PT with Head of Model Behavior @joannejang to talk all things ChatGPT's personality.

https://t.co/thPLEMMADY

#

bruhh they playing

#

release o3 pro man

ocean vortex Apr 30, 2025, 4:07 PM

#

disagree. OG gpt4 was undertrained and they scaled gpt4.5 the same way, so it's undertrained relative to it's size too. Which means it's just mostly wasteful with no return

tall summit Apr 30, 2025, 4:07 PM

#

terrence tao is gonna be all over the deepseek prover

#

I didn't see this at the time but if you are "very curious" I'm not concerned about AGI and a fencesitter about trump

ocean vortex Apr 30, 2025, 4:09 PM

#

you could train og gpt4 sized model on lots of data and make it better than gpt4.5 in every way for sure

upper wolf Apr 30, 2025, 4:09 PM

#

does anyone else hate how the new chatgpts glaze you every 5 seconds and add 20 emojis to every message? I’m at work bro… like can you take yourself seriously please???

tall summit Apr 30, 2025, 4:09 PM

#

upper wolf does anyone else hate how the new chatgpts glaze you every 5 seconds and add 20 ...

yes this is a well documented problem, system prompts mitigate it a bit

ocean vortex Apr 30, 2025, 4:11 PM

#

upper wolf does anyone else hate how the new chatgpts glaze you every 5 seconds and add 20 ...

they admitted glazing to be a problem, so it's gonna be fixed eventually...

upper wolf Apr 30, 2025, 4:11 PM

#

you know what? you…….. you’re right

ocean vortex Apr 30, 2025, 4:11 PM

#

Maybe they trained it on 2.5 outputs or smth lmao

#

gemini was always the worst at this

upper wolf Apr 30, 2025, 4:11 PM

#

2.5 doesn’t yap nearly as much

ocean vortex Apr 30, 2025, 4:12 PM

#

though 2.5 is better than their earlier models

upper wolf Apr 30, 2025, 4:12 PM

#

ocean vortex gemini was always the worst at this

crazy how it’s basically switched in the past month

#

now gemini is the one talking like a nih document

#

Gemini keeps it real

ocean vortex Apr 30, 2025, 4:13 PM

#

upper wolf 2.5 doesn’t yap nearly as much

I was referring to "glazing". Phrases like "you are right to call me out!", "you are absolutely correct to question this!", even when you are writing nonsense deliberately

#

this was gemini thing

upper wolf Apr 30, 2025, 4:14 PM

#

on the same token, tho, i feel like 2.5 corrects you a lot more if there’s something wrong with your approach or you make an inaccurate statement. gpt-4o kinda just plays along.

tall summit Apr 30, 2025, 4:16 PM

#

upper wolf 2.5 doesn’t yap nearly as much

yes it does

ocean vortex Apr 30, 2025, 4:17 PM

#

ok so this is how it should be (4.5):

#

gemini:

#

4o/4.1:

alpine coral Apr 30, 2025, 4:18 PM

#

upper wolf does anyone else hate how the new chatgpts glaze you every 5 seconds and add 20 ...

oai said they reverted back to older 4o on chatgpt cause of these problems https://openai.com/index/sycophancy-in-gpt-4o/

ocean vortex Apr 30, 2025, 4:18 PM

#

so both 4o and gemini are bad as far as I'm concerned lol

full kite Apr 30, 2025, 4:27 PM

#

I'm on top

thorny drum Apr 30, 2025, 4:51 PM

#

its basically unusable imo

#

the model is very smart but will never correct you (which turns into a game of how can you prompt the model in a way that introduces no bias) and then has some BS yap score that limits its responses

ocean vortex Apr 30, 2025, 4:57 PM

#

thorny drum the model is very smart but will never correct you (which turns into a game of h...

yap score thing only applies to o3 and o4-mini. Some other models have something loosely similar but it's less specific, there's no specific score number for them

brittle tiger Apr 30, 2025, 5:21 PM

#

This dude is vibe coding final boss. Insane he did this in canvas

https://x.com/algo_diver/status/1917584289887322415

chansung (@algo_diver) on X

Experiment #20 of using Gemini 2.5 pro Canvas
@GoogleDeepMind @GeminiApp

thorny drum Apr 30, 2025, 5:24 PM

#

ocean vortex yap score thing only applies to o3 and o4-mini. Some other models have something...

yeah i meant o3 and o4mini

#

for me, when i look at the thinking it almost always is just discussing 'how can i say this using as few tokens as possible'

#

especially for coding tasks

ocean vortex Apr 30, 2025, 5:26 PM

#

thorny drum yeah i meant o3 and o4mini

they kinda fcked people over as soon as they introduced "developer message" tbh. It's now clear that they did this so they could take away system role from you and only use it themselves lol

#

developer message ≠ system message

#

it's weaker with less weight

balmy mist Apr 30, 2025, 5:40 PM

#

full kite I'm on top

u good lol?

small haven Apr 30, 2025, 5:49 PM

#

officially two weeks of no o3 pro

balmy mist Apr 30, 2025, 6:19 PM

#

i thought it was 3?

ocean vortex Apr 30, 2025, 6:41 PM

#

small haven officially two weeks of no o3 pro

why are you so obsessed with it. No one else is doing it because it doesn't make sense price wise to use. It's still the same model just used differently. You could probably code smth similar yourself with the existing o3 api.

small haven Apr 30, 2025, 6:41 PM

#

balmy mist i thought it was 3?

3 weeks?

small haven Apr 30, 2025, 6:41 PM

#

ocean vortex why are you so obsessed with it. No one else is doing it because it doesn't make...

ngmi

upper wolf Apr 30, 2025, 6:48 PM

#

small haven officially two weeks of no o3 pro

Bro is feinin for o3 😂 😂 😂

small haven Apr 30, 2025, 6:49 PM

#

upper wolf Bro is feinin for o3 😂 😂 😂

so is everyone else

#

if u had access to o3 pro and o3, im prtty sure, ur gonna spam o3 pro come on now

brittle tiger Apr 30, 2025, 6:54 PM

#

https://x.com/bindureddy/status/1917627887332778057

Bindu Reddy (@bindureddy) on X

LiveBench AI - Coding Category Re-Haul

We have changed the coding category questions to be way more complicated. This change was made to reflect real-life coding scenarios.

You will see that Sonnet scores much higher, and OpenAI's models do very well.

In real life, Sonnet 3.7

#

We changed questions to reflect real world use cases
Results don't reflect real world usage at all

zinc ore Apr 30, 2025, 7:05 PM

#

Yeah I find livebench highly suspect

#

That keep completely changing their question sets then saying they are better. Also lol @ 2.5 not even being there.

torn mantle Apr 30, 2025, 7:08 PM

#

brittle tiger https://x.com/bindureddy/status/1917627887332778057

good

#

yea we should orient more to real world scenarios

brittle tiger Apr 30, 2025, 7:11 PM

#

https://x.com/legit_api/status/1917657429334188272?t=5M__AnkMFjsBpk_901MBMw&s=19

ʟᴇɢɪᴛ (@legit_api) on X

New Gemini Ultra confirmed!

keen beacon Apr 30, 2025, 7:11 PM

#

WTF

brittle tiger Apr 30, 2025, 7:11 PM

#

Ultra coming at I/O

keen beacon Apr 30, 2025, 7:12 PM

#

brittle tiger https://x.com/legit_api/status/1917657429334188272?t=5M__AnkMFjsBpk_901MBMw&s=19

is this

#

fr

torn mantle Apr 30, 2025, 7:15 PM

#

brittle tiger https://x.com/legit_api/status/1917657429334188272?t=5M__AnkMFjsBpk_901MBMw&s=19

wtf

small haven Apr 30, 2025, 7:15 PM

#

wut

torn mantle Apr 30, 2025, 7:17 PM

#

please lets not make sunstrike = gemini 2.5 ultra

brittle tiger Apr 30, 2025, 7:19 PM

#

Really doubt. I don't think they tried ultra internally until very recently

small haven Apr 30, 2025, 7:25 PM

#

i dont like saying this, but o1 pro > o3 still

keen fulcrum Apr 30, 2025, 7:26 PM

#

Qwen mcp coming

keen beacon Apr 30, 2025, 7:31 PM

#

brittle tiger https://x.com/legit_api/status/1917657429334188272?t=5M__AnkMFjsBpk_901MBMw&s=19

woah

brittle tiger Apr 30, 2025, 7:32 PM

#

Ultra isn't too surprising. They wouldn't be tweeting those cryptic brick wall emojis to troll

keen beacon Apr 30, 2025, 7:34 PM

#

i think io this year is gonna be one of the best they've ever done

#

gemini 2.5 ultra, imagen 4, upgrades to ai in google search, lots of updates to gemini integration in google products

#

possibly veo 3 or a preview of it as well

ember rapids Apr 30, 2025, 7:36 PM

#

Someone under nda said google has 2 novel things to show off that r pretty insane

#

I don’t doubt it they’re cooking

zinc ore Apr 30, 2025, 7:39 PM

#

ember rapids Someone under nda said google has 2 novel things to show off that r pretty insan...

At io or just sometime this year?

ocean vortex Apr 30, 2025, 7:40 PM

#

zinc ore That keep completely changing their question sets then saying they are better. ...

to be fair though it's becoming bit more challenging to asses these models lately. Reasoning added a lot more variables, so if you focus on something too specific you can end up with one model unreasonably high and the other very low quite easy. Also I do think the progress OpenAI made o1 to o3 is somewhat overblown tbh

brittle tiger Apr 30, 2025, 7:42 PM

#

ember rapids Someone under nda said google has 2 novel things to show off that r pretty insan...

Sounds like they don't understand ndas 😂

#

That's cool info tho

ocean vortex Apr 30, 2025, 7:45 PM

#

OpenAI made ReAct agents with o3/o4, that part I think is more impressive

small haven Apr 30, 2025, 7:45 PM

#

brittle tiger Ultra isn't too surprising. They wouldn't be tweeting those cryptic brick wall e...

i think ultra gonna def be matching o3, i remember gemini 1.0 ultra was more reliable than pro back then, pretraining matter

sage raptor Apr 30, 2025, 7:46 PM

#

small haven i think ultra gonna def be matching o3, i remember gemini 1.0 ultra was more rel...

2.5 pro already is matching o3

keen beacon Apr 30, 2025, 7:46 PM

#

2.5 ultra = agi

#

trust

sage raptor Apr 30, 2025, 7:47 PM

#

what are the chances for 2.5 ultra this week ?

bright kayak Apr 30, 2025, 7:47 PM

#

i don't think there's going to be an ultra

small haven Apr 30, 2025, 7:47 PM

#

sage raptor 2.5 pro already is matching o3

thats cope, its not

bright kayak Apr 30, 2025, 7:47 PM

#

if they're already making them free on ai studio w/o api then they probably won't spend a ton on another much more expensive model

ocean vortex Apr 30, 2025, 7:47 PM

#

sage raptor 2.5 pro already is matching o3

yeah I agree. It's sometimes too concise and taking shortcuts for the worse, but that gets compensated with the fact that it's a more capable base model

#

more capacity and some things it can still do the same or even better while generating less

balmy mist Apr 30, 2025, 7:49 PM

#

brittle tiger https://x.com/legit_api/status/1917657429334188272?t=5M__AnkMFjsBpk_901MBMw&s=19

omggggg plss

#

gg openai

ember rapids Apr 30, 2025, 7:51 PM

#

brittle tiger Sounds like they don't understand ndas 😂

They were very vague

ocean vortex Apr 30, 2025, 7:51 PM

#

sage raptor what are the chances for 2.5 ultra this week ?

I doubt it would make much sense, but we are yet to see a truly enormous reasoning model so who knows... Google is certainly in position to do so with TPUs. They could just charge like Sonnet API prices for a model that is much much much bigger. Instead of it being free

brittle tiger Apr 30, 2025, 7:52 PM

#

sage raptor what are the chances for 2.5 ultra this week ?

If this code string means Gemini 2.5 Ultra, they are definitely saving it for I/O on the 20th. would be shocked if it came out before, only chance would be anon in arena and i doubt that as well

ember rapids Apr 30, 2025, 7:53 PM

#

Definitely for I/O

ocean vortex Apr 30, 2025, 7:53 PM

#

brittle tiger If this code string means Gemini 2.5 Ultra, they are definitely saving it for I/...

if their track record is anything to go by... They will announce on I/O something which came live 2 weeks earlier lmao

brittle tiger Apr 30, 2025, 7:54 PM

#

My only question is do they announce or release at I/O. Mainly a question of getting it ready. I think just announcement would be deflating so betting on release

ocean vortex Apr 30, 2025, 7:54 PM

#

or like a nothing burger update to context size / pricing / availability and whatnot

small haven Apr 30, 2025, 7:55 PM

#

cursor just soft launched cloud agents

ocean vortex Apr 30, 2025, 7:56 PM

#

brittle tiger My only question is do they announce or release at I/O. Mainly a question of get...

what did they announce at I/O 2024? It was nothing special iirc

#

their biggest updates were before and after

small haven Apr 30, 2025, 7:56 PM

#

ocean vortex Apr 30, 2025, 7:57 PM

#

context size 1M->2M I think it was

#

so maybe they will release 2.5 pro for the 3rd time at I/O. Was free but now 50% cheaper for those who insist on paying 😂

balmy mist Apr 30, 2025, 8:00 PM

#

so ultra before r2?

ocean vortex Apr 30, 2025, 8:01 PM

#

balmy mist so ultra before r2?

flash 2.5 lite

torn mantle Apr 30, 2025, 8:06 PM

#

balmy mist so ultra before r2?

around that time probably

#

I/O event is 22 May if im not wrong

brittle tiger Apr 30, 2025, 8:10 PM

#

476 hrs til ultra

#

keen beacon Apr 30, 2025, 8:14 PM

#

is dayhush

#

still

#

in

#

webdev arena

zinc ore Apr 30, 2025, 8:17 PM

#

Do y'all think one of the models we've been seeing in arena is ultra? Or that it's not been tested on there yet?

brittle tiger Apr 30, 2025, 8:22 PM

#

zinc ore Do y'all think one of the models we've been seeing in arena is ultra? Or that i...

Definitely not

#

https://x.com/lmarena_ai/status/1917668731481907527

lmarena.ai (formerly lmsys.org) (@lmarena_ai) on X

We thank the authors' for their feedback. However, there are a number of factual errors and misleading statements in this writeup:

Regarding the statement that some model providers are not treated fairly:
- This is not true. Given our capacity, we have always tried to honor all

keen beacon Apr 30, 2025, 8:34 PM

#

zinc ore Do y'all think one of the models we've been seeing in arena is ultra? Or that i...

its nightwhisper

raven void Apr 30, 2025, 8:44 PM

#

nightwhisper won't be as intelligent as o3, bearish on google

brittle tiger Apr 30, 2025, 8:45 PM

#

nw was in arena way before the gdm hypeposting began so I don't think so. and it wouldn't make since to put ultra in arena for a couple days then release it a month and half later

keen beacon Apr 30, 2025, 8:46 PM

#

So what cou ld

#

Nightwhisper be

sonic tendon Apr 30, 2025, 8:48 PM

#

lol

sonic tendon Apr 30, 2025, 8:49 PM

#

raven void nightwhisper won't be as intelligent as o3, bearish on google

gemini ultra tho

kind cloud Apr 30, 2025, 8:56 PM

#

Google test models seem to be gone.
Surely they will start a new testing season.

sonic tendon Apr 30, 2025, 8:57 PM

#

meta system prompt leak

oblique flint Apr 30, 2025, 9:03 PM

#

Ultra is just a subscription huh

keen beacon Apr 30, 2025, 9:04 PM

#

any info on qwen 3? Just learned about it

#

is it gonna be added to lmarena

fleet lintel Apr 30, 2025, 9:05 PM

#

keen beacon any info on qwen 3? Just learned about it

launched yesterday.. nothing interesting in my view. mostly ignore

small haven Apr 30, 2025, 9:07 PM

#

oblique flint Ultra is just a subscription huh

yea it didnt say "gemini 2.5 ultra" lol

torn mantle Apr 30, 2025, 9:07 PM

#

oblique flint Ultra is just a subscription huh

xd

#

as i thought

cedar tide Apr 30, 2025, 9:15 PM

#

keen beacon is dayhush

Yes

Screenshot_2025-04-30-14-27-00-041_com.android.chrome-edit.jpg

keen beacon Apr 30, 2025, 9:16 PM

#

cedar tide Yes

w

keen beacon Apr 30, 2025, 9:16 PM

#

keen beacon any info on qwen 3? Just learned about it

its shset

#

shet

cedar tide Apr 30, 2025, 9:21 PM

#

kind cloud Google test models seem to be gone. Surely they will start a new testing season.

No

Screenshot_2025-04-30-23-20-50-017_com.android.chrome-edit.jpg

hardy pecan Apr 30, 2025, 9:26 PM

#

grats on the 5 dollars

raven void Apr 30, 2025, 9:28 PM

#

#

Google is cooked

sonic tendon Apr 30, 2025, 9:29 PM

#

keen beacon is it gonna be added to lmarena

I've gotten it a couple times. sucks imo

#

congrats

raven void Apr 30, 2025, 9:30 PM

#

worse than 4.1 mini and claude 3.5 rip 🙏🏻

sonic tendon Apr 30, 2025, 9:32 PM

#

np :p

keen beacon Apr 30, 2025, 9:35 PM

#

I underestimated

#

Yeah

#

I underestimated o3

#

it’s so good

#

Now

#

I can’t use gemini

#

Cuz its so asss

#

it types weird

#

idk how to explain it

#

o3 follows my instruction better

#

Yeah

#

What’s the difference between o4 mini and o3

raven void Apr 30, 2025, 9:38 PM

#

O3 architect and debugger, Gemini coder

keen beacon Apr 30, 2025, 9:41 PM

#

yea makes sense I saw a difference but I couldn’t tell what it exactly was

keen beacon Apr 30, 2025, 9:41 PM

#

raven void O3 architect and debugger, Gemini coder

yea

zinc ore Apr 30, 2025, 9:51 PM

#

Don't rely on a single benchmark site to assess models

#

Livebench is pretty limited in what it tests anyway

#

Very narrow set of questions they ask, it absolutely doesn't represent the breadth of programming tasks

#

And they keep changing the questions entirely every couple weeks and changing the scores

elder rapids Apr 30, 2025, 9:57 PM

#

Gemini ultra?

elder rapids Apr 30, 2025, 9:57 PM

#

keen beacon I can’t use gemini

are you using the app version

elder rapids Apr 30, 2025, 9:58 PM

#

elder rapids Gemini ultra?

I suspect it might not be an ultra model but an ultra tier ?

#

or both

elder rapids Apr 30, 2025, 9:59 PM

#

raven void worse than 4.1 mini and claude 3.5 rip 🙏🏻

in "coding" lmao, that's what you filtered + that's why it's so "low"

#

while it's arguably the most versatile coder there

keen beacon Apr 30, 2025, 10:00 PM

#

elder rapids are you using the app version

no

#

ai sudio

elder rapids Apr 30, 2025, 10:00 PM

#

then cap asf

#

o3 can't follow instructions for nothing

#

ye

#

it has the nerfed models

#

wait you didn't know that?

#

this whole time?

#

yo no wonder you're getting a different experience

#

because the models are worse, but ye ofc integration is important

#

you just won't have access to the actual model everyone has been talking about and praising

#

system instructions/guardrails

#

and prob less thinking time

#

on your data?

#

the app does too

#

unless you have advanced I'm p sure

#

but not even that's absolutely confirmed

#

afaik

torn mantle Apr 30, 2025, 10:22 PM

#

nah

#

Best overall is gemini 2.5 pro

#

And if you disagree then you are biased

#

Or you work for anthropic

#

How much did they pay you?

#

I dont need that

golden ocean Apr 30, 2025, 10:24 PM

#

torn mantle How much did they pay you?

it just is

#

for coding at least

wind torrent Apr 30, 2025, 10:33 PM

#

LOL

small haven Apr 30, 2025, 10:43 PM

#

wait a min, use o3 to fix a test, dont pass, use o1 pro, dont pass, use o4 mini high, passes... ok

torn mantle Apr 30, 2025, 10:53 PM

#

small haven wait a min, use o3 to fix a test, dont pass, use o1 pro, dont pass, use o4 mini ...

ok

small haven Apr 30, 2025, 10:57 PM

#

torn mantle ok

use o3 pro, passes.. ok

#

mail for the gemini boys:

leaden palm Apr 30, 2025, 11:04 PM

#

small haven mail for the gemini boys:

if that's cursor isn't that like $0.3/request

#

while gemini is literally free on vertex/ai studio

small haven Apr 30, 2025, 11:05 PM

#

leaden palm if that's cursor isn't that like $0.3/request

p2w

leaden palm Apr 30, 2025, 11:05 PM

#

lmao

#

and then it asks you

small haven Apr 30, 2025, 11:06 PM

#

im not sure but i think every tool call is $0.3 if using o3

golden ocean Apr 30, 2025, 11:20 PM

#

the more i use gemini the more stupid things i notice it doing

bright kayak Apr 30, 2025, 11:24 PM

#

???

elder rapids Apr 30, 2025, 11:27 PM

#

this is a bad prompt altogether

golden ocean Apr 30, 2025, 11:28 PM

#

bright kayak ???

#

reference

#

bright kayak Apr 30, 2025, 11:29 PM

#

golden ocean reference

that's what happened to me too

#

the second response was even longer

elder rapids Apr 30, 2025, 11:30 PM

#

oh wow 2.5 pro does very well on this

#

even though the prompt is bad

#

🤷

#

send yours and I'll send mine

kind cloud Apr 30, 2025, 11:32 PM

#

https://fxtwitter.com/legit_api/status/1916855709167235542?t=bGetlgzwVzEW7_wh9ieoiA&s=19

ʟᴇɢɪᴛ (@legit_api)

Discovery Tool server is now open

Quoting ʟᴇɢɪᴛ (@legit_api)
︀
launching tomorrow in Beta
︀︀
︀︀Dev Mode is just placeholder server name

**💬 5 🔁 7 ❤️ 77 👁️ 10.8K **

golden ocean Apr 30, 2025, 11:32 PM

#

elder rapids Apr 30, 2025, 11:33 PM

#

oh nvm I misunderstood

#

but that's not how it works

#

gpt zero is outdated

#

its weighted too much on punctuation

#

that's o3s attempt?

#

holy that's way worse

#

wtf

#

that's cap lmao

#

look at mine

#

base

#

1

#

📎 25phumanattempt.txt

golden ocean Apr 30, 2025, 11:36 PM

#

bright kayak the second response was even longer

#

elder rapids Apr 30, 2025, 11:37 PM

#

not sure tbh, it's structure is too academic

elder rapids Apr 30, 2025, 11:38 PM

#

elder rapids

damn this one is the best one by far

#

cinematic asf

#

that's necessarily true, but thats not what I'm saying

raven void Apr 30, 2025, 11:58 PM

#

I feel like aistudio adds delay to responses at large context length

balmy mist May 1, 2025, 12:19 AM

#

kind cloud https://fxtwitter.com/legit_api/status/1916855709167235542?t=bGetlgzwVzEW7_wh9ie...

Anyone buying this lol?

#

Also any news?

torn mantle May 1, 2025, 12:20 AM

#

kind cloud https://fxtwitter.com/legit_api/status/1916855709167235542?t=bGetlgzwVzEW7_wh9ie...

What's this

torn mantle May 1, 2025, 12:21 AM

#

small haven mail for the gemini boys:

Funny that thus happened to me with gemini instead of o3

balmy mist May 1, 2025, 12:30 AM

#

send screen shot

#

wait nvm

#

idk why i fall for your nonsense lol

small haven May 1, 2025, 12:40 AM

#

torn mantle Funny that thus happened to me with gemini instead of o3

ok buddy

#

yall can stay on gemini, while i fast track with o3

small haven May 1, 2025, 1:16 AM

#

who wanna pay $200 for unlimited gemini 2.5 pro ?

small haven May 1, 2025, 1:36 AM

#

craig ur gonna stash up $200 for gemini ultra, arent u

worthy thunder May 1, 2025, 1:41 AM

#

I ran OpenAI-MRCR against Qwen3 (working on 8B and 14B). The smaller models (<8B) will NOT be included due to their max context lengths are less than 128k. Took awhile to run due to rate limits initially. (https://x.com/DillonUzar/status/1917754730857504966)

I used the default settings for each model (fyi - thinking mode is enabled by default).

AUC @ 128k Score:

Llama 4 Maverick: 52.7%
GPT-4.1 Nano: 42.6%
Qwen3-30B-A3B: 39.1%
Llama 4 Scout: 38.1%
Qwen3-32B: 36.5%
Qwen3-235B-A22B: 29.6%
Qwen-Turbo: 24.5%

See more: https://contextarena.ai/

Qwen3-235B-A22B consistently performed better at lower context lengths, but rapidly decreased closer to its limit, which was different compared to Qwen3-30B-A3B. Will eventually dive deeper into why and examine the results closer.

Note: There's been some subtle updates to the website over the last few days, will cover that later. I have a couple of big changes pending.

Enjoy.

#

No problem. Lmk if there are any other models you want me to try besides o1-pro and other higher-priced models. Limited budget atm :/

small haven May 1, 2025, 1:43 AM

#

omg some common sense !

worthy thunder May 1, 2025, 1:44 AM

#

I have it on my Todo to look much closer at those results (and a handful of a few others). I imagine its a quirk.

#

I've burned passed both respectively 😛

small haven May 1, 2025, 1:45 AM

#

wen o1 pro my guy

brisk turret May 1, 2025, 1:46 AM

#

Anyone seen that paper about how lmarena is junk?

small haven May 1, 2025, 1:46 AM

#

in terms of mrcr?

brisk turret May 1, 2025, 1:46 AM

#

https://arxiv.org/abs/2504.20879

arXiv.org

The Leaderboard Illusion

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted...

small haven May 1, 2025, 1:46 AM

#

huh

worthy thunder May 1, 2025, 1:46 AM

#

small haven wen o1 pro my guy

I might run a smaller sample size for o1-pro to get an idea of how it might perform, but not a full sample (~$38k+).

small haven May 1, 2025, 1:46 AM

#

thats why we gotta test it out

small haven May 1, 2025, 1:46 AM

#

worthy thunder I might run a smaller sample size for o1-pro to get an idea of how it might perf...

appreciate it

brisk turret May 1, 2025, 1:47 AM

#

It's a junk leaderboard, lmarena.ai would rather gain favour from Google and meta than have a fair leaderboard

#

Corruption is the term for that kind of behaviour

worthy thunder May 1, 2025, 1:48 AM

#

GPT-4.1's family also doesn't drop like that. I imagine it's cause the o-series wasn't focused on long context. Should be interesting to see what it might be like when OpenAI merges the GPT and o series.

alpine coral May 1, 2025, 2:30 AM

#

worthy thunder GPT-4.1's family also doesn't drop like that. I imagine it's cause the o-series ...

seems a bit like that doesn't it. can see the same drop off happens with o4-mini at essentially the same point (i.e. when the context bin hits 64k tokens)

#

also qwen-turbo's recall seems literally unusable for anything close to 1m tokens

#

love your work btw!

worthy thunder May 1, 2025, 2:34 AM

#

alpine coral also qwen-turbo's recall seems literally unusable for anything close to 1m token...

Yeah... Basically after 128k, all it gets right is the key it is told to prepend the answer with (counts for 1-5% of the total score per test if it gets it right) 😅 . You can check out its answers by hovering over a score cell and clicking the button that shows up.

worthy thunder May 1, 2025, 2:35 AM

#

alpine coral love your work btw!

Thanks! Really working hard to make this a nice site. More to come soon. 🙂
And as always, open to suggestions about nearly anything to improve it.

alpine coral May 1, 2025, 2:37 AM

#

worthy thunder Yeah... Basically after 128k, all it gets right is the key it is told to prepend...

aha i see. and nice! didn't realise could explore the answers - nice touch

worthy thunder May 1, 2025, 2:37 AM

#

Note: I still grade a model that stops a response due to hitting its context limit, which generally only happens for the last bin it has results for (usually only a small handful of tests, which may impact the score by 1-5%). More noticeable for reasoning models (due to their reasoning tokens adding more to the output, pushing it past its limits). An inherited issue from OpenAI-MRCR, which I have some ideas to improve upon.

worthy thunder May 1, 2025, 2:38 AM

#

alpine coral aha i see. and nice! didn't realise could explore the answers - nice touch

One of the upcoming changes (likely out this weekend) is a built-in diff viewer for the results (when you open the popup). Pretty bare bones atm.

alpine coral May 1, 2025, 2:39 AM

#

aha yeah that would be the way to go for sure:)

alpine coral May 1, 2025, 2:42 AM

#

worthy thunder Note: I still grade a model that stops a response due to hitting its context lim...

is that less of an issue for oai's reasoning models? like given the reasoning tokens for those models aren't added to the actual output nor accumulate in the context, but are discarded?

#

ah actually nvm.. ig most of the context window is actually being occupied by the input, come to think of it

worthy thunder May 1, 2025, 2:46 AM

#

alpine coral ah actually nvm.. ig most of the context window is actually being occupied by th...

bingo 😉

ocean panther May 1, 2025, 2:50 AM

#

Hi, I'm new to LMArena. I tried lmarena.ai, entered a prompt, got responses from model A and B. Now I'd like to vote which one I like better - except for the life of me, I can't find a button to that end.

How does one vote on this site?

worthy thunder May 1, 2025, 2:50 AM

#

alpine coral is that less of an issue for oai's reasoning models? like given the reasoning to...

Grok 3 Mini (High) is when I noticed that issue. I spent 3+ days fighting with it above 32k input tokens (especially for 8needle tests), where it would happily output 30k+ reasoning tokens (which made 64k+ bins near impossible to run). I don't grade outputs that return 0 output tokens (reasoning tokens don't get counted in that check). I almost gave up on testing that variant.

worthy thunder May 1, 2025, 2:55 AM

#

ocean panther Hi, I'm new to LMArena. I tried lmarena.ai, entered a prompt, got responses from...

The buttons should show up below the chat window (and only after both models finished responding). Are you able to share a pic?

ocean panther May 1, 2025, 2:55 AM

#

alpine coral May 1, 2025, 2:55 AM

#

worthy thunder Grok 3 Mini (High) is when I noticed that issue. I spent 3+ days fighting with i...

ah yeah i see what you mean - must've banging your head against the wall ha

ocean panther May 1, 2025, 2:57 AM

#

I'm using Firefox on Linux

alpine coral May 1, 2025, 2:57 AM

#

ocean panther

that's odd.. perhaps a bug 🤷‍♂️ maybe try just typing "thanks" and hitting send - perhaps the buttons will appear then

#

but yeah otherwise maybe just refresh (or try a different browser), which would mean losing the interaction.. (hopefully wasn't a really good response from one that you're eager to know what model it is)

ocean panther May 1, 2025, 2:59 AM

#

alpine coral that's odd.. perhaps a bug 🤷‍♂️ maybe try just typing "thanks" and hitting send...

Now it worked:

#

I would have given it to model B, but now its response to my "Thanks" annoyed me. So model A it is now.

#

Oh, now it tells me the models' real names, too. After voting.

alpine coral May 1, 2025, 3:04 AM

#

ocean panther I would have given it to model B, but now its response to my "Thanks" annoyed me...

ha a real time demonstration of how human preferences do not neatly map onto raw quality

ocean panther May 1, 2025, 3:06 AM

#

alpine coral ha a real time demonstration of how human preferences do not neatly map onto raw...

One of the models (Model A) was gemini flash preview or something like that; the other was called "claybrook". Now Google doesn't show anything under that name - is it safe to assume that it's an undercover model being tested by some large provider prior to release?

alpine coral May 1, 2025, 3:07 AM

#

your assumption is spot on

#

(I believe claybrook is a model from google, just fwiw, but yes it's unreleased and being anonymously tested in the arena under that pseudonym)

small haven May 1, 2025, 3:28 AM

#

life must good when o3 pro comes out

#

real growth gdp will rise by 1% by 3rd quarter

knotty scroll May 1, 2025, 3:28 AM

#

When does the leaderboard usually update

brisk turret May 1, 2025, 3:36 AM

#

When it would please a megacorp

small haven May 1, 2025, 4:09 AM

#

oh my goodness gracious

keen beacon May 1, 2025, 5:06 AM

#

worthy thunder I ran OpenAI-MRCR against Qwen3 (working on 8B and 14B). The smaller models (<8B...

btw where did u get api access to qwen 3? chutes/:free variant on openrouter?

#

great work btw

worthy thunder May 1, 2025, 5:08 AM

#

keen beacon btw where did u get api access to qwen 3? chutes/`:free` variant on openrouter?

Paid variants of the model on OpenRouter. Some providers offer 128k endpoints

#

(primarily NovitaAI)

raven void May 1, 2025, 5:38 AM

#

OpenAI launched 4.1 to tell Google they're cooked

olive mesa May 1, 2025, 5:42 AM

#

I just found out about this, rlly cool

#

It can restart my device... it also knows all the apps i have downloaded

raven void May 1, 2025, 5:46 AM

#

it's quite slow for an assistant tbh

patent bane May 1, 2025, 6:07 AM

#

what app is this?

raven void May 1, 2025, 6:08 AM

#

Cursor (design concept)

raven void May 1, 2025, 6:09 AM

#

raven void OpenAI launched 4.1 to tell Google they're cooked

https://fixvx.com/TheyCallMeCheng/status/1917691791882547258

Cheng - (@TheyCallMeCheng@x.com)

I was wrong, gpt-4.1 is actually the best model to code with rn.
I'm sorry but gemini 2.5 pro is so overrated.

keen beacon May 1, 2025, 6:14 AM

#

that's bs

olive mesa May 1, 2025, 6:15 AM

#

fr.

olive mesa May 1, 2025, 6:18 AM

#

small haven oh my goodness gracious

that's why i use this lol https://www.reddit.com/r/ChatGPT/comments/1k9bxdk/the_prompt_that_makes_chatgpt_go_cold/

From the ChatGPT community on Reddit

Explore this post and more from the ChatGPT community

cedar tide May 1, 2025, 7:56 AM

#

here is the worst model
https://fixupx.com/AmazonScience/status/1917738059346633132?t=8SsYZx_dk7mkmE26Ee7leQ&s=19

Amazon Science (@AmazonScience)

🚀 Amazon Nova Premier, our most capable teacher model for creating custom distilled models, is now available on Amazon Bedrock!
︀︀
︀︀Built for complex tasks like Retrieval-Augmented Generation (RAG), function calling, and agentic coding, its one-million-token context window enables analysis of large datasets while being the most cost-effective proprietary model in its intelligence tier.
︀︀
︀︀Learn more: amzn.to/4jwmV2l

**🔁 15 ❤️ 52 👁️ 5.0K **

▶ Play video

keen beacon May 1, 2025, 7:59 AM

#

cedar tide here is the worst model https://fixupx.com/AmazonScience/status/1917738059346633...

wow it sucks

cedar tide May 1, 2025, 8:00 AM

#

even Maverick and Gemini Flash are better 👀

cedar tide May 1, 2025, 8:01 AM

#

cedar tide here is the worst model https://fixupx.com/AmazonScience/status/1917738059346633...

Cost 2.5$ input
12.5$ output

alpine coral May 1, 2025, 8:01 AM

#

olive mesa that's why i use this lol https://www.reddit.com/r/ChatGPT/comments/1k9bxdk/the_...

oh that's really good - great fun ha

#

(applied on the right..)

keen beacon May 1, 2025, 8:03 AM

#

cedar tide Cost 2.5$ input 12.5$ output

Lmao

alpine coral May 1, 2025, 8:04 AM

#

yeah what wowee lol

cedar tide May 1, 2025, 8:05 AM

#

https://fixupx.com/suriyagnskr/status/1917731754515013772?t=yQeTFTkCfRkl0ZhQJ2k-tQ&s=19

Suriya Gunasekar (@suriyagnskr)

I am thrilled to share our newest Phi models. This time we went all in on post-training to produce Phi-4-reasoning (SFT only) and Phi-4-reasoning-plus (SFT + a touch of RL) — both 14B models that pack a punch in a small size across reasoning and general purpose benchmarks🧵

**💬 3 🔁 19 ❤️ 62 👁️ 15.0K **

keen beacon May 1, 2025, 8:06 AM

#

tbh qwen 3 4b destroys everyone at that size range

#

its crazy

#

i thought phi 4 mini reasoning was crazy but qwen 3 4b is next level good at the size

#

it makes all the other research attempts at around that size a joke

alpine coral May 1, 2025, 8:06 AM

#

cedar tide here is the worst model https://fixupx.com/AmazonScience/status/1917738059346633...

︀︀Built for complex tasks like Retrieval-Augmented Generation (RAG), function calling, and agentic coding...
invariably how models that were built with highest hopes (to be a genuine all purpose frontier model) but which fall well short, are described i feel ha RAG and agents

keen beacon May 1, 2025, 8:07 AM

#

alpine coral > ︀︀Built for complex tasks like Retrieval-Augmented Generation (RAG), function ...

they also released simpleqa benchmarks, but with tool use 🤣

alpine coral May 1, 2025, 8:07 AM

#

lmfao

keen beacon May 1, 2025, 8:12 AM

#

its also curious qwen didnt release qwen 3 32b base/235b base. the scores are very impressive for it. probably dont want their competitors having it for now

#

im also assuming this is a ss from their wip technical report. (it was in the qwen 3 blog post)

#

those numbers are old the released phi 4 mini reasoning performs even better

#

Anyway qwen 3 4b destroys everyone in the 4b size range

cedar tide May 1, 2025, 8:13 AM

#

keen beacon those numbers are old the released phi 4 mini reasoning performs even better

Ah sorry, i shared the wrong tweet

#

https://fixupx.com/WeizhuChen/status/1917810053077426570?t=asAf4qz4Lv5QUbkdBIDNxg&s=19

Weizhu Chen (@WeizhuChen)

Glad to see the team used a 3.8B model (Phi-4-mini-reasoning) to achieve 94.6 in Math-500 and 57.5 in AIME-24.
︀︀arxiv: arxiv.org/pdf/2504.21233
︀︀hf: huggingface.co/microsoft/Phi-4-mini-reasoning
︀︀Azure: aka.ms/phi4-mini-reasoning/azure

**💬 1 🔁 3 ❤️ 12 👁️ 500 **

keen beacon May 1, 2025, 8:15 AM

#

meanwhile qwen 3 4b has like 20 points over that in aime lol

cedar tide May 1, 2025, 8:15 AM

#

keen beacon Anyway qwen 3 4b destroys everyone in the 4b size range

Do you have a table comparing qwen with phi?

keen beacon May 1, 2025, 8:16 AM

#

no but i will make one in a bit

cedar tide May 1, 2025, 8:16 AM

#

Thx

glad jackal May 1, 2025, 8:21 AM

#

when will qwen3 appear in lmarena?

alpine coral May 1, 2025, 8:22 AM

#

keen beacon its also curious qwen didnt release qwen 3 32b base/235b base. the scores are ve...

yeah does particularly well in maths and coding i see

torn mantle May 1, 2025, 8:22 AM

#

cedar tide https://fixupx.com/suriyagnskr/status/1917731754515013772?t=yQeTFTkCfRkl0ZhQJ2k-...

Seems solid no?

#

Where can i try it?

keen beacon May 1, 2025, 8:22 AM

#

yeah teh phi 4 reasoning plus 14b is seemingly extremely impressive

#

phi 4 mini reasoning not so much

cedar tide May 1, 2025, 8:23 AM

#

torn mantle Where can i try it?

On your pc, or in azure api

glad jackal May 1, 2025, 8:23 AM

#

torn mantle Seems solid no?

i have tried qwen3 for RAG it is actually good

torn mantle May 1, 2025, 8:23 AM

#

cedar tide On your pc, or in azure api

My pc won't run it

alpine coral May 1, 2025, 8:23 AM

#

alpine coral yeah does particularly well in maths and coding i see

which prob explains a bit why i find the models kinda underwhelming.. i don't do any testing for coding - it's all language/comprehension and reasoming with a bit of instruction following

cedar tide May 1, 2025, 8:23 AM

#

torn mantle Seems solid no?

Il waiting to see against qwen 14b

keen beacon May 1, 2025, 8:23 AM

#

alpine coral which prob explains a bit why i find the models kinda underwhelming.. i don't do...

i think the tuning on the base model (the post trained versions we use) was inadequate

#

the pretrained models seem extremely strong

alpine coral May 1, 2025, 8:23 AM

#

teah that too

#

but also in terms of the base model

#

just looking at those evals

keen beacon May 1, 2025, 8:24 AM

#

i think this is why they didnt release the frontier ones, the 32b base and 235b moe

#

the base model versions

alpine coral May 1, 2025, 8:24 AM

#

yeah that would make sense

cedar tide May 1, 2025, 8:24 AM

#

cedar tide Il waiting to see against qwen 14b

the problem is that qwen have not shared a bench of this model

alpine coral May 1, 2025, 8:25 AM

#

i mean unreleased basically = proprietary (unless they change their mind) - can hardly blame them

cedar tide May 1, 2025, 8:25 AM

#

cedar tide the problem is that qwen have not shared a bench of this model

So we will compare with 30b a3b

keen beacon May 1, 2025, 8:25 AM

#

cedar tide So we will compare with 30b a3b

the 14b should be stronger than the 30b a3b

cedar tide May 1, 2025, 8:26 AM

#

keen beacon the 14b should be stronger than the 30b a3b

Yes but we dont have bench

glad jackal May 1, 2025, 8:26 AM

#

keen beacon the 14b should be stronger than the 30b a3b

i dont it will be stronger as a3b is moe right?

keen beacon May 1, 2025, 8:26 AM

#

glad jackal i dont it will be stronger as a3b is moe right?

it should be stronger conventionally speaking (the 14b)

#

the 30b moe is extremely interesting though

glad jackal May 1, 2025, 8:27 AM

#

i am kinda new i always thought that the architecture plays a key role

keen beacon May 1, 2025, 8:28 AM

#

glad jackal i am kinda new i always thought that the architecture plays a key role

im not sure about qwen 3 30b moe but a rule of thumb is sqrt(total params * active params) ~= dense model perf. but qwen 3 30b seems to break that

glad jackal May 1, 2025, 8:28 AM

#

i mean parameters are important too

glad jackal May 1, 2025, 8:29 AM

#

keen beacon im not sure about qwen 3 30b moe but a rule of thumb is sqrt(total params * acti...

oh

alpine coral May 1, 2025, 8:30 AM

#

cedar tide https://fixupx.com/suriyagnskr/status/1917731754515013772?t=yQeTFTkCfRkl0ZhQJ2k-...

what model is 'gemini 2 thinking' in those charts?

glad jackal May 1, 2025, 8:30 AM

#

yellow

flint sand May 1, 2025, 8:30 AM

#

probably 2.0

keen beacon May 1, 2025, 8:30 AM

#

might be 2 flash thinking

#

only gemini 2 thinking model

alpine coral May 1, 2025, 8:31 AM

#

keen beacon might be 2 flash thinking

ahh right

#

i thought it was only since 2.5

#

was 2 flash thinking only ever experimental?

keen beacon May 1, 2025, 8:31 AM

#

yea

glad jackal May 1, 2025, 8:33 AM

#

@keen beacon i have doubt about the sqrt one?

#

so if we take 30B-A3B we get 9.5B as the dense equivalent right?

#

does it mean that qwen3 14B is equivalent to qwen3 30B A3B?

keen beacon May 1, 2025, 8:35 AM

#

glad jackal does it mean that qwen3 14B is equivalent to qwen3 30B A3B?

it should be slightly stronger if we use that rule of thumb

#

qwen 30b moe seemingly breaks that rule though

glad jackal May 1, 2025, 8:36 AM

#

how?

#

wrt to benchmarks?

keen beacon May 1, 2025, 8:37 AM

#

#

also @cedar tide

#

open evals from hf

cedar tide May 1, 2025, 8:37 AM

#

Screenshot_2025-05-01-10-37-40-102_com.android.chrome-edit.jpg

cedar tide May 1, 2025, 8:38 AM

#

keen beacon

Where you find this ?

keen beacon May 1, 2025, 8:38 AM

#

https://xcancel.com/nathanhabib1011/status/1917230699582751157

Nitter

Nathan (@nathanhabib1011)

@huggingface's OPEN EVALS DROPPED FOR @Alibaba_Qwen it appears the models are great thinkers that know nothing 🐺

finished running qwen3 model on a selection of evals, results look great except simpleQA

235B-A22B wins the race with 🔥 results for 22B active params 👑

1/5

glad jackal May 1, 2025, 8:39 AM

#

seems like it is true the moe is performing better than the dense equi

keen beacon May 1, 2025, 8:39 AM

#

glad jackal seems like it is true the moe is performing better than the dense equi

yes

#

thats why i said it breaks the rule

#

the simpleqa score is telling

#

despite lower active params, but higher total params, it stores more world knowledge compared to the 14b

glad jackal May 1, 2025, 8:41 AM

#

ok

#

how does it perform wrt gemini 2.5?

#

moe model vs gemini 2.5

keen beacon May 1, 2025, 8:42 AM

#

gemini 2.5 is moe too

keen beacon May 1, 2025, 8:42 AM

#

glad jackal moe model vs gemini 2.5

none of them are gonna beat gemini 2.5

#

i would say its primarily the post training though that is the most important part

keen beacon May 1, 2025, 8:44 AM

#

keen beacon https://xcancel.com/nathanhabib1011/status/1917230699582751157

i also would bet the base models would score much higher on simpleqa, the post trained versions seem damaged heavily

glad jackal May 1, 2025, 8:46 AM

#

@keen beacon nowadays i am kinda performing more vibe coding than before how do i reduce it
i use it cause it gives me more optimal sol

keen beacon May 1, 2025, 8:47 AM

#

thats primarily a you problem, idk how to fix that, it requires a personal solution. you could force the conditions though (if u really want to stop vibe coding rather than be productive), learn an obscure language (that isn't supported well in llms) and start coding in that

#

personally day-to-day i use a language that isnt supported that well by llms yet, so i can't really vibe code

glad jackal May 1, 2025, 8:50 AM

#

but i try to do it in a normal way it takes more than 2 to 3hrs as i need to read the docs and then implement it

keen beacon May 1, 2025, 8:51 AM

#

glad jackal but i try to do it in a normal way it takes more than 2 to 3hrs as i need to rea...

do u want to stop vibe coding or do you want to do your work faster? (assuming its a job)

#

if you want to do your work faster, then keep vibe coding or smthing (assuming your resulting code is of satisfactory quality)

glad jackal May 1, 2025, 8:52 AM

#

i want to reduce vibe and also be productive

keen beacon May 1, 2025, 8:53 AM

#

glad jackal i want to reduce vibe and also be productive

u need to examine how you use ai/your workflow/etc. then try to work things out and strike a better balance. 🤷

alpine coral May 1, 2025, 8:56 AM

#

cedar tide

should add a column for qwen3-4b (there's data here) if you can bothered aha

keen beacon May 1, 2025, 8:59 AM

#

i wish alibaba had a qwen 3 api lol. im noticing bad degradation on some of the providers

#

it does extremely well on the tasks i need data for

unborn ocean May 1, 2025, 9:00 AM

#

keen beacon i wish alibaba had a qwen 3 api lol. im noticing bad degradation on some of the ...

Have you tested fireworks yet ? Thinking about using it

#

For the big MoE

alpine coral May 1, 2025, 9:01 AM

#

alpine coral should add a column for qwen3-4b (there's data [here](https://www.reddit.com/med...

actually nvm i was being lazy and forgot we have these fancy AI machines ha

#

v impressive for such a tiny model

#

imagine if scaling literally worked in terms of parameter size.. like extrapolating this 4b param model's scores out to something the size of GPT-4.5 or whatever

#

well.. i think most those benchmarks are bounded by 100%.. so ig you just hit that quickly ha

keen beacon May 1, 2025, 9:04 AM

#

unborn ocean Have you tested fireworks yet ? Thinking about using it

i havent checked it thoroughly still waiting for finalized pricing

#

early on w deepinfra i noticed outputs were garbled after a while/massively degraded on long responses

#

(it doesnt happen on chat.qwen.ai or even chutes, but chutes is weird)

keen beacon May 1, 2025, 9:11 AM

#

keen beacon i havent checked it thoroughly still waiting for finalized pricing

rn qwen 3 30b moe is 0.9m/tok on fireworks, qwen 3 235b is 0.1m/tok lol. and they stated theyre still working on the pricing. not gonna use it until they finalize it even if quality is good

keen beacon May 1, 2025, 9:39 AM

#

they used o3 mini traces hmm

#

ig u can see phi 4 reasoning traces to see how o3 mini reasons in raw form

#

im surprised oai allowed this lol

alpine coral May 1, 2025, 9:53 AM

#

microsoft have invested like $30bn into oai

keen beacon May 1, 2025, 9:54 AM

#

alpine coral microsoft have invested like $30bn into oai

yeah but it could have an impact on their investment wrt competitors

alpine coral May 1, 2025, 9:54 AM

#

microsoft is practically (if not literally) majority shareholder of oai

keen beacon May 1, 2025, 9:54 AM

#

competitors can now see o3 mini traces, style, etc., when openai tried to hide any of the new stuff hard

alpine coral May 1, 2025, 9:55 AM

#

so they;re not competitors (microsoft/phi and oai)

calm sequoia May 1, 2025, 9:55 AM

#

Just had an idea. The ELO score is determined by the battles model fought in the arena. There always exist a SET of models to choose from as an opponent in any given battle. This set consists of two sub-sets: better model subset and worse model subset. The ratio between those determines you ELO. If you provide a new model for testing, together with a number of objectively worse models or variants (e.g., mini, nano, flash, etc.), the subset of worse models inreases. The ratio of worse-to-better models increases in you favor from the start. Therefore, you get ELO boost by giving the arena model series instead of a single model.

#

Any ideas against this?

alpine coral May 1, 2025, 9:55 AM

#

keen beacon competitors can now see o3 mini traces, style, etc., when openai tried to hide a...

oh i see what you're sayig now

keen beacon May 1, 2025, 9:55 AM

#

alpine coral so they;re not competitors (microsoft/phi and oai)

i mean if openai is valued less itll impact ms

alpine coral May 1, 2025, 9:55 AM

#

yeah gotcha- misunderstood your initial point

#

perhaps oai is kinda resigned to the fact that those traces will always be sought be their competitors - and to some extent, extracted

#

like they're outliers by even hiding it

#

but ig they think (or know) they have some kinda special sauce goin on theere

#

so yeah i do see your point

keen beacon May 1, 2025, 9:58 AM

#

alpine coral perhaps oai is kinda resigned to the fact that those traces will always be sough...

they really dont want this happening they have a lot of extreme safeguards about it, unlike any other company. i doubt deepseek or qwen really extracted anything and used it

#

its kinda crazy how far they've gone in trying to hide it

#

then theyre doing the identity thing for newer models on top of that

alpine coral May 1, 2025, 10:05 AM

#

fwiw i'd personally be shocked if none of the major labs had to tried to reverse engineer / extract as much as they possibly can from each others' models ha

#

but also, raw reasoning excerpts exist, e.g. https://www.reddit.com/r/ChatGPT/comments/1fussvn/o1_preview_accidentally_gave_me_its_entire/

#

and yeah just the style.. "hmm let me reconsider.." etc

#

it's all kinda familar (seems drawn from a similar source)

#

i'd be surprised if deepseek/qwen (or google's thinking models for that matter) R1/qwq were totally in-house creations

keen beacon May 1, 2025, 10:13 AM

#

alpine coral i'd be surprised if deepseek/qwen (or google's thinking models for that matter) ...

google did not use any openai/deepseek/qwen traces, im pretty sure. their cold start is distinct compared to anyone else, u can see in the style.
im pretty sure qwq 32b preview didn't use o1 preview traces, at least in the final preview model. its one of the most unique models ive encountered tbh, a lot of "qwqisms" and "second language behavior" (based on my extensive usage of it. plus code switching)
although they share a resemblance to o1 preview traces, im pretty deepseek made their own coldstart as well.
qwen modeled their traces after deepseek after r1, though, im still pretty sure its still partially made by them/even if it wasn't completely by them (might've used r1, im not sure)

#

speculation obviously fwiw

alpine coral May 1, 2025, 10:14 AM

#

yeah i mean i'm just speculating - no proof / all anecdotal (and don't really use qwen or deepseek models ever aside from playing around)

#

ha oh

#

i feel you're more informed on it. it's just what i've always thought - but haven't looked into it

#

google for sure has a totally distinct feeling

#

in terms of its thinking models

keen beacon May 1, 2025, 10:16 AM

#

tbh additionally making ur own cold start isnt difficult, i dont really see an incentive to not make your own if you're a frontier lab. bootstrapping off of someone else isn't a sustainable strategy

#

i found the complaints about deepseek using leaked traces (i believe) really disingenuous

#

and based on superficial patterns after working with a ton of reasoning models, processing traces/etc

alpine coral May 1, 2025, 10:17 AM

#

i'm not familiar with them

alpine coral May 1, 2025, 10:19 AM

#

keen beacon tbh additionally making ur own cold start isnt difficult, i dont really see an i...

not bootstrapping - just borrowing elements of how they implemented the actual 'reasoning' part. like it's more than just givig it more inference opportunity. it basically has an inernal monologue (v similar to anthropi'c 'scratchpad', ive always thought)

#

having visibility of how the first company did it would be useful.

#

but not essential

keen beacon May 1, 2025, 10:20 AM

#

alpine coral not bootstrapping - just borrowing elements of how they implemented the actual '...

they definitely took inspiration off of o1 preview traces, to generate their own cold start, but i doubt they leaked and trained on them in a massive scale

alpine coral May 1, 2025, 10:20 AM

#

yeah i'm not suggesting they tried to copy it or anything

#

just yeah, where they could, take inspiration from (at the very very least)

keen beacon May 1, 2025, 10:20 AM

#

alpine coral not bootstrapping - just borrowing elements of how they implemented the actual '...

the reasoning style is highly dependent on cold start btw. you are not getting gemini style, qwqisms, etc., randomly from rl

unborn ocean May 1, 2025, 10:20 AM

#

keen beacon rn qwen 3 30b moe is 0.9m/tok on fireworks, qwen 3 235b is 0.1m/tok lol. and the...

yeah, i was also mainly surprised by the low price, so also waiting for the final statements

10 cents per million tokens would be real nice, but i guess we don't live in paradise do we

unborn ocean May 1, 2025, 10:21 AM

#

alpine coral microsoft have invested like $30bn into oai

i think the microsoft and openai had some heavy beef, so it might just be literally a way to get back at them lol

keen beacon May 1, 2025, 10:21 AM

#

alpine coral yeah i'm not suggesting they tried to copy it or anything

yeah when r1 was released people were complaining about it

#

some lab folks i believe

unborn ocean May 1, 2025, 10:22 AM

#

alpine coral microsoft is practically (if not literally) majority shareholder of oai

49% as far as I know, but none voting shares

alpine coral May 1, 2025, 10:22 AM

#

keen beacon yeah when r1 was released people were complaining about it

oh really? this is honestly just my perseonal speculation ha

keen beacon May 1, 2025, 10:23 AM

#

alpine coral oh really? this is honestly just my perseonal speculation ha

yea (i believe). i just found it disingenuous to claim that when u worked with traces a lot u probably know it isnt true

#

off of superficial stuff

#

thats why i thought u were talking about that

alpine coral May 1, 2025, 10:23 AM

#

oh nope aha

#

i just assume most progress is iterative

#

it's rare for something brand new to ever come along

#

i guess the argument is over the extent

#

rather than the fact of iteration

#

anyway.. separately ha it's been interesting to have a proper look through that paper which criticises the Arena for allowing a handful of providers to flood it with anonymous models (and then choose which to use as the public release and, by extension, get added to the LB, while the rest are discarded, resulting in a sampling bias).

keen beacon May 1, 2025, 10:33 AM

#

Is there a way to find old leaderboards? For example the leaderboard on the 14th of April

alpine coral May 1, 2025, 10:33 AM

#

the appendix includes a table with all the anon models they identified

#

there is also a section 'unknown' (and, curiously, they say qwen had a private model in the arena, albeit not under a pseudonym, so kinda different but still would be a first for a chinese lab afaik)

keen beacon May 1, 2025, 10:35 AM

#

ya i dont really count that tbh

#

qwen plus exp

hollow ocean May 1, 2025, 10:36 AM

#

keen beacon May 1, 2025, 10:36 AM

#

ya cool huh?

hollow ocean May 1, 2025, 10:36 AM

#

alpine coral May 1, 2025, 10:37 AM

#

keen beacon ya i dont really count that tbh

was just as a sidenote really

alpine coral May 1, 2025, 10:44 AM

#

keen beacon Is there a way to find old leaderboards? For example the leaderboard on the 14th...

https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard/tree/main

brittle tiger May 1, 2025, 10:48 AM

#

alpine coral anyway.. separately ha it's been interesting to have a proper look through that ...

i thought the paper was very weak. felt they had axe to grind more than doing a proper fair take. they mostly used meta being shady bc llama was weak to tarnish arena but that was already addressed by lmsys with policy update. and botched a ton of stats according to lmsys

alpine coral May 1, 2025, 10:50 AM

#

i honestly think lmarena's response is laughable - they don't address the actual points made in the paper

#

they make some totally irrelevant analogy about steph curry

#

weak af

#

it might be fairly said that cohere has an axe to grind with the arena

#

but i don't think that discredits the paper.. it's imperfect but rigorous from what i can tell. lmarena has only provided rhetoric in response...nothing to actually refute the paper's best-of-N strategy.. their data covering properiatery vs open models covers the entirety of the Arena's existence, rather than the few months the authors of the paper were scraping / analysing

brittle tiger May 1, 2025, 11:06 AM

#

I still think the central claim of their conclusion unfairly uses meta shadiness to a make much broader ridiculous point: "The widespread and apparent willful participation in the gamification of arena scores from a handful of top-tier industry labs is undoubtedly a new low for the AI research field. As scientists, we must do better. As a community, we must demand better."

#

"undoubtedly a new low for the AI research field" gimme a break that's laughable. and openai and google are not willfully gaming scores

unborn ocean May 1, 2025, 11:36 AM

#

I think much of the paper feels like you are asking a very verbose reasoning model to criticize your project and it just starts coming up with many (and partly obvious) weaknesses, most of which you will straight up ignore.

Otherwise the more soft and obvious critics, like most of the data going to big tech (inc. openai) are very valid, but also why the lmarena is able to exist at this scale in the first place (imo). (Because it is not just a benchmark, but also a high quality and agile data source for the companies)

#

But I am glad that they chose this approach of writing a paper over the article approach most other labs take to address issues. This is something we should have more off.

keen beacon May 1, 2025, 11:39 AM

#

Any news on augment code?

cedar tide May 1, 2025, 12:02 PM

#

alpine coral the appendix includes a table with all the anon models they identified

Where do you find this ?

alpine coral May 1, 2025, 12:14 PM

#

cedar tide Where do you find this ?

https://arxiv.org/pdf/2504.20879

cedar tide May 1, 2025, 12:14 PM

#

Thx

alpine coral May 1, 2025, 12:21 PM

#

brittle tiger "undoubtedly a new low for the AI research field" gimme a break that's laughable...

yeah agree.. that language is a bit much (even though i think the paper seems otherwise sound)

#

anyway.. was just posting those screenshots it's interesting to see all the anon models catalogued like that

#

they are a fun part of the arena - no doubt...

#

in fact.. it wouldn't be nearly as interesting without them

#

but yeah it's kinda gotten out of hand.. and i'm sympathetic to this 'best of n' argument and that being distortive and in some ways arguably unfair (even though there's technically nothing stop any lab, however small, from spamming the arena; they would just need to ask as often as google, meta etc do.. ig)

keen beacon May 1, 2025, 12:28 PM

#

alpine coral but yeah it's kinda gotten out of hand.. and i'm sympathetic to this 'best of n'...

its been a long time but i think there was drama about it a long time ago. something about how they wont add models or smthing and starting a competitor to lmsys. (never came to fruition i think)

torn mantle May 1, 2025, 12:28 PM

#

alpine coral https://arxiv.org/pdf/2504.20879

crazy how many models Meta tried

#

and they were all bad

alpine coral May 1, 2025, 12:29 PM

#

ha yeah that's kinda what i mean by out of hand

keen beacon May 1, 2025, 12:31 PM

#

keen beacon its been a long time but i think there was drama about it a long time ago. somet...

alpine coral May 1, 2025, 12:31 PM

#

aha oh drama from April 23

#

that's the real early days of this AI boom ha

keen beacon May 1, 2025, 12:32 PM

#

june 4 2023 xd (i dont think its relevant at all today, and presumably it was resolved quickly since i think nothing happened after that)

alpine coral May 1, 2025, 12:32 PM

#

lol geez i struggled there

#

i won't edit again - let it stand ha

#

but yeah i mean i hope lmarena doesn't become a victim of its own success (it needs integrity to be meaningful.. but betwen betting and playing footsies with literally some of the biggest companies in the world.. there may come a breaking point)

#

but yeah they are the definition of having a first-mover advantage

#

and all the scale (votes) that brings with it

#

be very hard for another crowdsource project to displace it (more likely the arena falls one day i'd think) ]

keen beacon May 1, 2025, 12:35 PM

#

alpine coral be very hard for another crowdsource project to displace it (more likely the are...

early days i think they couldve had a chance

alpine coral May 1, 2025, 12:35 PM

#

yup - competing first movers

#

but now yeah they're the like the google of ai chatbot crowdsource voting ha

keen beacon May 1, 2025, 12:45 PM

#

So i have a choice

#

My parents are making me decide now

#

Change my major to artifical intelligence and persue a technology career

#

or stick to marketing

#

with lovable and cursor becoming the top preformers i might have too 💔 gpt wrapper method

#

svelte

#

easy choice

#

or LuaU because LuaU is super simplified

#

so JS

#

yea thats the stuff i like the most web development

#

app creation is a whole other field

#

yes

#

working on a project now

#

#

2.5 pro for debugging yes

#

And Workflow/MVP planning

#

Azure backend

#

mvp = minimum viable project for early customers

#

like skeleton build to test theory/logistics

#

I get github pro free for two years so i have access to 4.1, gemini, claude 3.7 for free

#

For my resume

#

I'm a marketing student rn

#

Making a trend finder + tracker for cross platform data analysis

#

would look good on my resume and portfolio cuz im also doing business administration so i need to know how to lead a company, etc

#

Great I dont like AI studio that much

#

If you sign up on the Gemini website

#

there will be a free 1 month trial

#

Nah bad ui gemini official website better

#

When they end just make a new account

#

They have no KYC or nothing

#

u cant branch on the gemini product i think amongst other things. the gemini product seems bad compared to aistudio, which still has its own issues

#

they can but they dont

#

i've done it for awhile

keen beacon May 1, 2025, 1:06 PM

#

keen beacon u cant branch on the gemini product i think amongst other things. the gemini pro...

wym by branch?

#

ah

keen beacon May 1, 2025, 1:06 PM

#

keen beacon wym by branch?

on chatgpt u can edit ur message and immediately start a new branch without losing the previous branch/conversation

#

that explains it

#

whenever i get close to running out of tokens

#

i just tell it to copy itself in a prompt

#

then use gemini's gem manager so it remembers the base project

#

then that prompt acts like its loading a save

#

restores token context count + allows you to continue where you left off

#

just my little work around

#

i guess it really is a preference thing at the end of the day ai studio is more for people who understand ai, etc

I just need bare bones maybe add a few prints here and there or give me some other solutions

balmy mist May 1, 2025, 1:14 PM

#

is qwen 3 good at tool calling like o3?

#

like does it have tool calling built in the reasoning as well?

keen beacon May 1, 2025, 1:16 PM

#

balmy mist like does it have tool calling built in the reasoning as well?

i think

keen beacon May 1, 2025, 1:16 PM

#

balmy mist is qwen 3 good at tool calling like o3?

maybe not as good but pretty good i believe

#

where u are getting access to it btw

#

qwen3 api

balmy mist May 1, 2025, 1:20 PM

#

i havent tried it yet, just wanna do a project with it

#

wait can we even use the tool calling magic in api?

#

for o3 or qwen?

keen beacon May 1, 2025, 1:20 PM

#

i think yes for both

balmy mist May 1, 2025, 1:20 PM

#

i wish we could put our own tools in the reasoning

keen beacon May 1, 2025, 1:20 PM

#

you can

#

i believe

balmy mist May 1, 2025, 1:20 PM

#

teach me please 🙂

#

only for o3?

#

or both

#

o3 to expensive

keen beacon May 1, 2025, 1:21 PM

#

both

balmy mist May 1, 2025, 1:21 PM

#

whatt

keen beacon May 1, 2025, 1:21 PM

#

https://platform.openai.com/docs/guides/tools?api-mode=responses see function calling

keen beacon May 1, 2025, 1:22 PM

#

balmy mist whatt

qwen3 doesnt have built in tools i think, its all your own functions.

balmy mist May 1, 2025, 1:23 PM

#

so i can add lets say 50 more tools to o3 with function calling?

keen beacon May 1, 2025, 1:25 PM

#

balmy mist so i can add lets say 50 more tools to o3 with function calling?

i dont think this is a good idea but i guess lol

balmy mist May 1, 2025, 1:28 PM

#

why not lol?

keen beacon May 1, 2025, 1:28 PM

#

they said qwen is ass

#

it will take a lot of tokens to review and decide which tool to use 🤣

keen beacon May 1, 2025, 1:28 PM

#

keen beacon they said qwen is ass

it isnt

balmy mist May 1, 2025, 1:29 PM

#

keen beacon it will take a lot of tokens to review and decide which tool to use 🤣

what do you think is worth more time figuring out what tool to use or training models to mimic those tools? like instead of having a model that can generate 3d animations, why not just have a model use the blender mcp?

#

i think this is what openAi is noticing lowkey

#

we already have solid models and benchmarks are getting cooked

#

so why not just focus on building more tools around our models, which i guess is what is happening lol

keen beacon May 1, 2025, 1:30 PM

#

balmy mist what do you think is worth more time figuring out what tool to use or training m...

both are valid avenues

balmy mist May 1, 2025, 1:30 PM

#

yeah

#

def gonna be a mix of both tho

keen beacon May 1, 2025, 1:31 PM

#

the mcp thing is gonna matter significantly more going forward

balmy mist May 1, 2025, 1:31 PM

#

i want to see if i can do it a lil on a small scale with mcps

keen beacon May 1, 2025, 1:31 PM

#

it indirectly impacts a model that can generate 3d animations

balmy mist May 1, 2025, 1:31 PM

#

but have them use the mcps in the thinking

balmy mist May 1, 2025, 1:31 PM

#

keen beacon it indirectly impacts a model that can generate 3d animations

exactly

#

the only thing is having to update the models thinking base of new mcps and tools

keen beacon May 1, 2025, 1:32 PM

#

keen beacon it indirectly impacts a model that can generate 3d animations

this data can be used for a model that can generate 3d animations directly

balmy mist May 1, 2025, 1:32 PM

#

keen beacon this data can be used for a model that can generate 3d animations directly

i see what you mean

#

but what tool calling does qwen have?

#

ppl saying the 30b model is really good

keen beacon May 1, 2025, 1:33 PM

#

balmy mist but what tool calling does qwen have?

idk u can use mcp i believe. even the official qwen website is gonna support it ootb