#general

1 messages · Page 33 of 1

ocean vortex
#

it's also true for most of the other benchmarks. If you focus on just 1 you degrade something else

torn mantle
#

xdd

small haven
#

day 13 of no o3 pro

#

day 1 of not caring about grok 3.5

#

big brain is 404

#

im retired

#

not interested in introducing virus to my system

#

ok buddy

torn mantle
#

agree

eager crater
#

Is it normal that the Q3_K_M quant is bigger than the Q3_K_XL quant? it's unsloth's

balmy mist
torn mantle
#

seems like i cant get enough of o3

#

this thing is so crazy

small haven
#

the good type of virus

#

especially when it has these in the traces, gets me hard

torn mantle
#

nah

#

its so smart

#

it totally feels like a different type model from all other LLMs

hollow ocean
#

O3 pro only for rich p2w users

torn mantle
#

😦

#

sad

small haven
#

thank you o3

golden ocean
#

life without free sota models access method

small haven
#

dw o3 pro is gonna be free in a year

#

but by then o5 pro is sota

torn mantle
#

actually r2 may have some interesting outputs

#

hopefully its not like qwen where they only focus on fixing old bugs

small haven
#

huh what is this arrow.. doesn't trigger anything

harsh flume
#

Been away, how's new qwen performance?

#

Any sota vibe in it?

#

Or just good enough for a Chinese model?

keen beacon
#

it isnt sota but its solid

harsh flume
#

Not even top 3 contender for arena?

#

Any particular strong use case for it?

keen beacon
keen beacon
small haven
#

yes the media is forcing o3 pro lesssgoo

torn mantle
#

we dont care anymore about anthropic

#

like seriously their models are unusable with the rate limits and they nerfed it

#

writes short/consise outputs all the time = so lazy

keen beacon
#

they didn't nerf anything

#

it has been like that since day 1

torn mantle
#

seems like you werent using it since day 1

keen beacon
#

i have been using it both on claude.ai and via their api since day 1

torn mantle
#

apparently not

small haven
#

3.7 is literally unusable

keen beacon
#

it also seems fine to me 🤷

#

that goes for claude 2, claude instant, claude 3 opus, claude 3.5 sonnet, claude 3.7 sonnet

#

i use a lot of claude (still)

#

yeah i've been interested in them since the first ever claude

#

when the only way you could access their models publicly was with a crappy slack bot

#

claude 1 and claude instant were great creatively tbh

#

they just lost coherency over long context 😔

torn mantle
#

day 1 :

keen beacon
#

first company to introduce 100k context windows i think

torn mantle
#

months later :

keen beacon
#

it could be: system prompt changes/randomness

torn mantle
#

same prompt

keen beacon
#

if you're serious about getting the most out of these models you wouldn't use them over their consumer UIs

#

use the api dawg

torn mantle
#

im not rich like you

#

that will cost me a lot

keen beacon
#

ppl still complain about them nerfing it on the api anyway

#

people are schizo

torn mantle
#

are you calling me schizo?

keen beacon
#

no lmao

#

if you looked at the context

#

he said on the api

torn mantle
#

@keen beacon is he calling me schizo?

#

????????????

keen beacon
#

there have been so many instances in the ai industry of people claiming degradation when there have been no changes and it's like one giant placebo

#

no, he was talking about people who still complain claude is being nerfed on the api

torn mantle
#

im jk

#

but in all seriousness i do think they played with the system prompt ( default one ) to output shorter results

small haven
#

guys o3 marginal cost is zero

torn mantle
#

you probably also dream of o3 pro

#

enough

#

use gemini 2.5 pro a bit

#

or qwen 3

keen fulcrum
#
poll_question_text

Did Qwen3 live up to what you expected?

victor_answer_votes

11

total_votes

18

victor_answer_id

2

victor_answer_text

Met expectations

ocean vortex
#

no pro is more like you having 10 clones of yourself trying to do the same thing individually and then all of those are merged into 1 solution lol

small haven
#

^

#

also big brain never existed

#

but internally

#

any plus have o1 pro yet?

ocean vortex
# small haven any plus have o1 pro yet?

Interesting. The only way this makes sense in my head is that they are disguising upcoming o3-pro like that. They used to do this with alpha releases (like gpt4-all-tools) where select users that normally use lower tier models (free gpt3.5) got access to upcoming best models/features.

#

Probably for data collection among users not terribly familiar with the models related the most to what they are releasing

brittle tiger
#

nvmd im idiot. i have pro plan

full kite
#

Guys how much better is qwen 3 compared to Gemini 2.5 pro? 3-4%?

#

Help oOoOO

keen beacon
#

it isnt better

full kite
#

Ok ty

keen beacon
#

its good if u want something to run locally

full kite
keen beacon
#

there is no 700b model lol

#

in qwen 3

full kite
#

Ok how much

keen beacon
#

idk there are a lot, just look it up

full kite
#

😡

full kite
#

That not an answer

small haven
#

thank u o1 pro...

small haven
#

ok... thanks for the info. keeps using o3

brittle tiger
#

omg this o3-pro on api is insane. hope you get to experience it someday @small haven

small haven
#

?

small haven
#

ok buddy show me ur ass on oai playground, u wont

#

o3 keeps passing these tests, its like orgasmic

#

this button looks new in dr

#

screenshot or troll

balmy mist
#

screenshare please

small haven
#

buddy had to disable the notifs haha

#

$cr33n$h0t pl0x

raven void
#

Grok 3.5 will win lmarena in May

small haven
#

life is good with operator

high egret
#

what do you think will the biggest model of qwen3 score on lmarena leaderboard ?

#

btw one the model is added, generaly, how much time does it take to have enough votes to be on leaderboard ?

elder rapids
leaden palm
high egret
small haven
#

qwen 3 is not gonna be #1 lol

hardy pecan
high egret
#

haha thx guys

#

i was just curious

high egret
#

and honestly the smaller model like 4B are really much more impressive than the 235B one

hardy pecan
#

It'll do fine, but it wont stand out

high egret
#

yeah, I feel that alibaba is better with smaller model

hollow ocean
high egret
#

like QwQ was already more impressive that qwen 2.5 max

high egret
#

@hardy pecan and I was meaning the delay to appear on leaderboard, not the score, that was the question I wanted an answer

keen beacon
#

Qwen 3 235b has a surprisingly high arena hard score, I wouldn't be surprised if it performs well. But it won't top the leaderboard

hardy pecan
#

Right, the models generally need above 2000 votes to appear on the leaderboard

#

Generally this takes awile, say 5 or 6 days depending on activity and hype generated around the released model

hollow ocean
high egret
high egret
#

The more I use gemini 2.5 pro the more I realise it's crazy

#

it gave me a full linear algebra course based on the video of my uni course

#

it's so perfect

small haven
#

1 pro

hollow ocean
small haven
#

Its just troll idc lol

keen beacon
#

O3 can take in videos?

leaden palm
leaden palm
#

it is o1 pro / o3 pro

high egret
#

my bad

leaden palm
raven void
#

Xiaomi dropped a pretty good 7b model

fleet lintel
calm sequoia
#

Just tested the new QWEN. At math it's at similar level as DeepSeek R1 but much worse than Claude. At world-knowledge and logic it is at similar level as models, such as GPT 4o or Grok. Not SOTA.

calm sequoia
#

Is Gemini 2.5 PRO via the Windsurf chat the same as via, e.g. AIstudio?

fleet lintel
#
poll_question_text

Do you run models locally?

victor_answer_votes

7

total_votes

10

victor_answer_id

2

victor_answer_text

No

small haven
#
poll_question_text

are u excited for o3 pro

victor_answer_votes

10

total_votes

13

victor_answer_id

2

victor_answer_text

shut up

alpine coral
#

included it for this question set

#

generally struggles.. tho with thinking enabled, it does quite well in one run

keen fulcrum
alpine coral
#

on that particular question set, yeah it does really well

#

but on others, it isn't at the top (though always up there)

#

didn't realise they reverted to an older 4o amid this personality backlash ha

We have rolled back last week’s GPT‑4o update in ChatGPT so people are now using an earlier version with more balanced behavior. The update we removed was overly flattering or agreeable—often described as sycophantic.
https://openai.com/index/sycophancy-in-gpt-4o/

still mason
#

Guys, IDK all the technical stuff. I'm just curious how different are the rating/ranking methodology for all the different AI rating/ranking sites?

I'm seeing quite a significant difference in ranking, is that expected?

calm sequoia
#

Some beef for lmarena

#

I've said it before - it would be more beneficial for lmarena to rebrand as RLHF site instead of benchmark

alpine coral
#

yeah and it's increasingly felt like it's become testing ground for the big labs.. soo many anon models over like the past 12 months.. i'm not surprised to see a paper make (and though i haven't read it, try to prove) these kinds of claims

calm sequoia
#

TBH this critique is not fair. Nobody wants to test the open source models as they were much worse for most of the time. The LMarena also has to think about user side (voters)

#

This is very very legit

alpine coral
#

yeah totally

#

we'd see farrrrrr fewer anon models if they all had to be published / added to leaderboard

#

would be just like the good ol im-a-good-chatgpt days again ha

#

rather than anon model spam

calm sequoia
#

It's the same as everywhere - stuff gets ruined where the money comes in.

#

They have a short time window to do something before a new lmarena appears (real open source)

alpine coral
calm sequoia
#

Nobody would blame lmarena if you would have separate benchmark for anonymous models + data gathering. The data may even be better if no betting would be involved 😄

alpine coral
#

hmm between the paper and their response there's a bit to unpack and digest ha

teal mantle
#

But then I am thinking should I get supergrok if 3.5 is good next week

#

And shame to know o3 access is only through API (impossible for me) or their subscription, I am contemplating

brittle tiger
calm sequoia
#

No we didn't. It wasn't more surprising than Grok getting top 1.

brittle tiger
#

Grok is actually good tho

#

We don't know nightwhisper exists if that rule is in place. I don't like

calm sequoia
#

Right here we go, maverick is good for some who like emojis and yapping

#

The issue is that I want to know if I'm testing model for the benchmark or for RLHF in advance

alpine coral
# alpine coral hmm between the paper and their response there's a bit to unpack and digest ha

the perfect job for an LLM aha.. i gave the paper along with LMArena's resposnse and referenced blog posts to gem pro 2.5, sonn-3.7 and o3.. and ask to evaluate who 'has the upper ground'. they all come to the same conclusions (which kinda confirm my surface-level take): the paper is empirical, LMArena's response is largely rhetorical and doesn't address the core claims made by the paper.. iow the paper wins

#

Paper: Transparently states scrape period, method, code reference, full tables and simulations; backs each claim with quantitative evidence and stress-tests core BT assumptions.

LMArena response: Offers principles (human preference matters, open-source ethos) and restates existing policy. Provides no new datasets, no re-analysis of the authors’ code, and no point-by-point numerical rebuttal. The single concrete figure (41 % open-source battles) is aggregate, methodology-free, and does not address provider-level skew.

#

The rebuttal is largely rhetorical; it neither falsifies the documented selection-bias mechanism nor supplies counter-analysis of sampling or deprecation effects. By contrast, the paper demonstrates those effects with real and simulated data and exposes structural deviations from BT assumptions essential for a fair leaderboard.

Bottom Line
On the evidence presently available, the authors of ‘The Leaderboard Illusion’ hold the upper ground. Their critique is data-driven and methodologically explicit, while LMArena’s response asserts intentions and policies without supplying empirically grounded refutations.
o3

#

fwiw here's sonnet and 2.5's takes

keen beacon
#

It makes the arena interesting

calm sequoia
#

Or dragontail

#

Cant remmember

keen beacon
calm sequoia
#

Yeah, it was peak lmarena. But they were released. We are talking about unrealeased models - appears in the arena as anonymous and never sees the daylight after.

keen beacon
#

That was like that

keen beacon
#

Same goes for now

#

Google makes the arena interesting

#

The only abject abuse of anon models I feel are the meta models

alpine coral
calm sequoia
#

Hmm after prolonged thought I can agree that it's better to have them in lmarena. The results still should be publicised. Maybe while keeping the anonymous name.

alpine coral
#

like :

🤖 400+ models on the leaderboards!
📊 300+ pre-release evaluations!

#

how many of those 300 are actually unveiled? i feel it's like ten maybe

keen beacon
#

It makes the leaderboard unnecessarily confusing. I feel like they've got a good thing going esp with google anon models

#

Just enforce a little more criteria and vetting on the models

alpine coral
#

i mean we're the ones providing the data.. currently it just goes to google, oai, meta and xai.. i agree the LB would become a mess if they were all added

#

but still, they could do disclosures.. like a ballpark elo or whatever. . which lab the model was from etc

keen beacon
alpine coral
#

yeah i mean after they're pulled from the arena ofc.. i dunno like a monthly roundup

#

would reduce the incentive for these companies to just spam it with with slight iterations of the same thing

versed flare
#

hey, how do the qwen3 32b and 30a3b models perform?

#

idk if its the right place to ask but i cant find much stuff yet

alpine coral
#

there's been a bit of discussion here - maybe scroll up and have a read

versed flare
#

can you send a message link to lead me there?

alpine coral
#

but i think general take is quite decent but nothing groundbreaking (in terms of pure perfomrance)

alpine coral
final flame
#

guys

#

will there be a leaderboard update today?

drifting thorn
#

Oh Cohere had just made a multimodal RAG

full kite
#

how many grok exist

#

not better than me

#

nuhuh

golden ocean
#

gemini is so horrible at listening

full kite
#

he can't touch you

full kite
golden ocean
full kite
#

say please daddyyyy

#

I'm controlled by elon

balmy mist
#

swear

#

send details

cedar tide
#

Waiting for o3 and o4 mini on the webdev arena

full kite
#

isn't that already come up like 1 month ago with the ghibli images

cedar tide
full kite
cedar tide
# full kite what

You're on the lm arena discord but you don't follow the releases of the best llms?

#

Do you know the reasoning models?

full kite
#

i use gemini on googel studio ai

#

for homework

cedar tide
full kite
#

bro is it the same sht that ppl used to make gibli images

#

this

full kite
#

ok

cedar tide
#

This GPT 1 image or GPT 4o

full kite
#

so it's better than 3o

cedar tide
full kite
#

why is there 2 4o

cedar tide
#

Its o3

full kite
#

that doesnt make any sens

cedar tide
#

And its with gemini 2.5 pro the best llm ever

cedar tide
#

there is only one 4o

full kite
#

Ngga what

#

I'm blocking you

golden ocean
tall summit
full kite
drifting thorn
#
poll_question_text

Which LLM has the most performance gain with good prompts?

victor_answer_votes

10

total_votes

19

victor_answer_id

1

victor_answer_text

2.5 Pro

tall summit
balmy mist
#

but 4o and o4 are completely different lol

#

where is o3 pro tho

#

@deep adder you sad about gpt4 leaving?

full kite
#

What's gpt4

fast osprey
#

Hey Hi
I am from BharatGen we had builded our own llm with indic nuance
May I know how we can integrate our model api in chatbot arena to get better comparison

balmy mist
# full kite What's gpt4

GPT-4 is OpenAI's most advanced AI language model. It's way smarter, more creative, and better at reasoning than previous versions. It can handle much longer text and can even understand images (not just text). You find it powering things like ChatGPT Plus and Microsoft Copilot - Gemini 2.5 pro 😂

#

wait why dont we have an ai bot in here for questions and searching?

#

that would be so clutch

#

it should be

#

its just different syntax at the end of the day and llms are beasts at languages

#

if anything give it docs and info on any language you want and guide it

#

what you been making?

balmy mist
#

Whatever happened to that world-building prompt project you had?

full kite
#

get a life

#

Thats what I do

#

Why is he saying random sht

#

wdym

bleak silo
#

"undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired" true or false?

full kite
#

what programming lil bro

#

Im not a weird geek

#

bro ,whar

#

how

#

can i make minecraft

#

what is pythin

#

ngga what is all of that

#

im using google to make my school homework

#

a what??

#

It already did that

#

like 10 years ago

#

bro google was created in 2008 or sm sht

#

tf are you talking about

#

google made chatgpt

#

no

drifting thorn
#

I'm excited to be able to use Grok 3.5 on Poe

#

Like if I can

#

Don't take it as an info bruh

#

I'm just guessing

calm sequoia
#

But andrey is wrong on Openrouter. It's not better than lmarena. Interestgly it's skewed other way around. Maybe aggregate of openrouter and lmarena is the answer

barren prairie
#

Who is
folsom-exp-v1? He answered 95% of my questions correctly

#

O4mini failed it

#

Small mistake but interresting

full kite
#

google did it

#

chatgpt

#

why

#

elon musk

#

what is grok

#

what does grok mean

#

google created elon musk

alpine coral
#

i haven't found it particularly impressive

#

what are these questions ha

full kite
#

what does neologism mean

alpine coral
#

everyone and their datacentre, yes ha

full kite
#

what does that mean

alpine coral
#

literally just that

full kite
#

neologism is a old word

cedar tide
alpine coral
#

a new word (or neologism) is by definition is invented recently ha

full kite
#

what is neologism

alpine coral
#

it's my favourite llm

#

no neologism

#

it's very powerful

full kite
alpine coral
#

lol jfc you weren't trolling.. um 'neologism' is just fancy way word for describing a new word

#

that's it

torn mantle
#

i had it many times

#

when

full kite
#

o

#

k

#

like skibidi is neologysm

#

is grok 3.5 better than chatgpt

#

or google

#

no

#

Prove it

drifting thorn
#

Wow I can merely recognise Winter's face

sonic tendon
#

deezpeek

balmy mist
#

grok 3.5 is out?

balmy mist
#

lets do a petition to keep gpt4 @deep adder

#

why did they have to do a big thing about it leaving

#

thye should of just removed it, now im sad

#

ahh

#

nvm i dont care

#

when is grok 3.5 coming out

full kite
balmy mist
#

i gotta say if that comes out fast then XAI really cooking with their releases

balmy mist
full kite
#

It's hard

#

No it will not be better

#

It will need a subscription

#

Gemini is free

balmy mist
#

idk i dont think you can say o3 is better than 2.5 pro

#

they are better in different ways

#

like o3 has its own lane

#

but for a general model, 2.5 is the best

#

like plug and play

full kite
#

Google is better cause they are rich

torn mantle
#

Grok 3.5 will probably be added friday

#

On lmarena

full kite
#

Chatgpt 4 already exist

calm sequoia
#
poll_question_text

Your opinion of AGI vs Political take

victor_answer_votes

4

total_votes

14

balmy mist
#

bruhh they playing

#

release o3 pro man

ocean vortex
#

disagree. OG gpt4 was undertrained and they scaled gpt4.5 the same way, so it's undertrained relative to it's size too. Which means it's just mostly wasteful with no return

tall summit
#

terrence tao is gonna be all over the deepseek prover

#

I didn't see this at the time but if you are "very curious" I'm not concerned about AGI and a fencesitter about trump

ocean vortex
#

you could train og gpt4 sized model on lots of data and make it better than gpt4.5 in every way for sure

upper wolf
#

does anyone else hate how the new chatgpts glaze you every 5 seconds and add 20 emojis to every message? I’m at work bro… like can you take yourself seriously please???

tall summit
ocean vortex
upper wolf
#

you know what? you…….. you’re right

ocean vortex
#

Maybe they trained it on 2.5 outputs or smth lmao

#

gemini was always the worst at this

upper wolf
#

2.5 doesn’t yap nearly as much

ocean vortex
#

though 2.5 is better than their earlier models

upper wolf
#

now gemini is the one talking like a nih document

#

Gemini keeps it real

ocean vortex
#

this was gemini thing

upper wolf
#

on the same token, tho, i feel like 2.5 corrects you a lot more if there’s something wrong with your approach or you make an inaccurate statement. gpt-4o kinda just plays along.

tall summit
ocean vortex
#

ok so this is how it should be (4.5):

#

gemini:

#

4o/4.1:

alpine coral
ocean vortex
#

so both 4o and gemini are bad as far as I'm concerned lol

full kite
#

I'm on top

thorny drum
#

its basically unusable imo

#

the model is very smart but will never correct you (which turns into a game of how can you prompt the model in a way that introduces no bias) and then has some BS yap score that limits its responses

ocean vortex
brittle tiger
thorny drum
#

for me, when i look at the thinking it almost always is just discussing 'how can i say this using as few tokens as possible'

#

especially for coding tasks

ocean vortex
# thorny drum yeah i meant o3 and o4mini

they kinda fcked people over as soon as they introduced "developer message" tbh. It's now clear that they did this so they could take away system role from you and only use it themselves lol

#

developer message ≠ system message

#

it's weaker with less weight

balmy mist
small haven
#

officially two weeks of no o3 pro

balmy mist
#

i thought it was 3?

ocean vortex
# small haven officially two weeks of no o3 pro

why are you so obsessed with it. No one else is doing it because it doesn't make sense price wise to use. It's still the same model just used differently. You could probably code smth similar yourself with the existing o3 api.

small haven
upper wolf
small haven
#

if u had access to o3 pro and o3, im prtty sure, ur gonna spam o3 pro come on now

brittle tiger
#

We changed questions to reflect real world use cases
Results don't reflect real world usage at all

zinc ore
#

Yeah I find livebench highly suspect

#

That keep completely changing their question sets then saying they are better. Also lol @ 2.5 not even being there.

torn mantle
#

yea we should orient more to real world scenarios

brittle tiger
keen beacon
#

WTF

brittle tiger
#

Ultra coming at I/O

small haven
#

wut

torn mantle
#

please lets not make sunstrike = gemini 2.5 ultra

brittle tiger
#

Really doubt. I don't think they tried ultra internally until very recently

small haven
#

i dont like saying this, but o1 pro > o3 still

keen fulcrum
#

Qwen mcp coming

brittle tiger
#

Ultra isn't too surprising. They wouldn't be tweeting those cryptic brick wall emojis to troll

keen beacon
#

i think io this year is gonna be one of the best they've ever done

#

gemini 2.5 ultra, imagen 4, upgrades to ai in google search, lots of updates to gemini integration in google products

#

possibly veo 3 or a preview of it as well

ember rapids
#

Someone under nda said google has 2 novel things to show off that r pretty insane

#

I don’t doubt it they’re cooking

zinc ore
ocean vortex
brittle tiger
#

That's cool info tho

ocean vortex
#

OpenAI made ReAct agents with o3/o4, that part I think is more impressive

small haven
sage raptor
keen beacon
#

2.5 ultra = agi

#

trust

sage raptor
#

what are the chances for 2.5 ultra this week ?

bright kayak
#

i don't think there's going to be an ultra

small haven
bright kayak
#

if they're already making them free on ai studio w/o api then they probably won't spend a ton on another much more expensive model

ocean vortex
#

more capacity and some things it can still do the same or even better while generating less

ember rapids
ocean vortex
# sage raptor what are the chances for 2.5 ultra this week ?

I doubt it would make much sense, but we are yet to see a truly enormous reasoning model so who knows... Google is certainly in position to do so with TPUs. They could just charge like Sonnet API prices for a model that is much much much bigger. Instead of it being free

brittle tiger
ember rapids
#

Definitely for I/O

ocean vortex
brittle tiger
#

My only question is do they announce or release at I/O. Mainly a question of getting it ready. I think just announcement would be deflating so betting on release

ocean vortex
#

or like a nothing burger update to context size / pricing / availability and whatnot

small haven
#

cursor just soft launched cloud agents

ocean vortex
#

their biggest updates were before and after

small haven
ocean vortex
#

context size 1M->2M I think it was

#

so maybe they will release 2.5 pro for the 3rd time at I/O. Was free but now 50% cheaper for those who insist on paying 😂

balmy mist
#

so ultra before r2?

ocean vortex
torn mantle
#

I/O event is 22 May if im not wrong

brittle tiger
#

476 hrs til ultra

keen beacon
#

is dayhush

#

still

#

in

#

webdev arena

zinc ore
#

Do y'all think one of the models we've been seeing in arena is ultra? Or that it's not been tested on there yet?

brittle tiger
raven void
#

nightwhisper won't be as intelligent as o3, bearish on google

brittle tiger
#

nw was in arena way before the gdm hypeposting began so I don't think so. and it wouldn't make since to put ultra in arena for a couple days then release it a month and half later

keen beacon
#

So what cou ld

#

Nightwhisper be

sonic tendon
sonic tendon
kind cloud
#

Google test models seem to be gone.
Surely they will start a new testing season.

sonic tendon
#

meta system prompt leak

oblique flint
#

Ultra is just a subscription huh

keen beacon
#

any info on qwen 3? Just learned about it

#

is it gonna be added to lmarena

fleet lintel
small haven
torn mantle
#

as i thought

cedar tide
keen beacon
keen beacon
#

shet

hardy pecan
#

grats on the 5 dollars

raven void
#

Google is cooked

sonic tendon
#

congrats

raven void
#

worse than 4.1 mini and claude 3.5 rip 🙏🏻

sonic tendon
#

np :p

keen beacon
#

I underestimated

#

Yeah

#

I underestimated o3

#

it’s so good

#

Now

#

I can’t use gemini

#

Cuz its so asss

#

it types weird

#

idk how to explain it

#

o3 follows my instruction better

#

Yeah

#

What’s the difference between o4 mini and o3

raven void
#

O3 architect and debugger, Gemini coder

keen beacon
#

yea makes sense I saw a difference but I couldn’t tell what it exactly was

keen beacon
zinc ore
#

Don't rely on a single benchmark site to assess models

#

Livebench is pretty limited in what it tests anyway

#

Very narrow set of questions they ask, it absolutely doesn't represent the breadth of programming tasks

#

And they keep changing the questions entirely every couple weeks and changing the scores

elder rapids
#

Gemini ultra?

elder rapids
elder rapids
#

or both

elder rapids
#

while it's arguably the most versatile coder there

keen beacon
#

ai sudio

elder rapids
#

then cap asf

#

o3 can't follow instructions for nothing

#

ye

#

it has the nerfed models

#

wait you didn't know that?

#

this whole time?

#

yo no wonder you're getting a different experience

#

because the models are worse, but ye ofc integration is important

#

you just won't have access to the actual model everyone has been talking about and praising

#

system instructions/guardrails

#

and prob less thinking time

#

on your data?

#

the app does too

#

unless you have advanced I'm p sure

#

but not even that's absolutely confirmed

#

afaik

torn mantle
#

nah

#

Best overall is gemini 2.5 pro

#

And if you disagree then you are biased

#

Or you work for anthropic

#

How much did they pay you?

#

I dont need that

golden ocean
#

for coding at least

wind torrent
small haven
#

wait a min, use o3 to fix a test, dont pass, use o1 pro, dont pass, use o4 mini high, passes... ok

small haven
#

mail for the gemini boys:

leaden palm
#

while gemini is literally free on vertex/ai studio

small haven
leaden palm
#

lmao

#

and then it asks you

small haven
#

im not sure but i think every tool call is $0.3 if using o3

golden ocean
#

the more i use gemini the more stupid things i notice it doing

bright kayak
elder rapids
#

this is a bad prompt altogether

golden ocean
#

reference

bright kayak
#

the second response was even longer

elder rapids
#

oh wow 2.5 pro does very well on this

#

even though the prompt is bad

#

🤷

#

send yours and I'll send mine

kind cloud
golden ocean
elder rapids
#

oh nvm I misunderstood

#

but that's not how it works

#

gpt zero is outdated

#

its weighted too much on punctuation

#

that's o3s attempt?

#

holy that's way worse

#

wtf

#

that's cap lmao

#

look at mine

#

base

#

1

golden ocean
elder rapids
#

not sure tbh, it's structure is too academic

elder rapids
#

cinematic asf

#

that's necessarily true, but thats not what I'm saying

raven void
#

I feel like aistudio adds delay to responses at large context length

balmy mist
#

Also any news?

torn mantle
balmy mist
#

send screen shot

#

wait nvm

#

idk why i fall for your nonsense lol

small haven
#

yall can stay on gemini, while i fast track with o3

small haven
#

who wanna pay $200 for unlimited gemini 2.5 pro ?

small haven
#

craig ur gonna stash up $200 for gemini ultra, arent u

worthy thunder
#

I ran OpenAI-MRCR against Qwen3 (working on 8B and 14B). The smaller models (<8B) will NOT be included due to their max context lengths are less than 128k. Took awhile to run due to rate limits initially. (https://x.com/DillonUzar/status/1917754730857504966)

I used the default settings for each model (fyi - thinking mode is enabled by default).

AUC @ 128k Score:

  • Llama 4 Maverick: 52.7%
  • GPT-4.1 Nano: 42.6%
  • Qwen3-30B-A3B: 39.1%
  • Llama 4 Scout: 38.1%
  • Qwen3-32B: 36.5%
  • Qwen3-235B-A22B: 29.6%
  • Qwen-Turbo: 24.5%

See more: https://contextarena.ai/

Qwen3-235B-A22B consistently performed better at lower context lengths, but rapidly decreased closer to its limit, which was different compared to Qwen3-30B-A3B. Will eventually dive deeper into why and examine the results closer.

Note: There's been some subtle updates to the website over the last few days, will cover that later. I have a couple of big changes pending.

Enjoy.

#

No problem. Lmk if there are any other models you want me to try besides o1-pro and other higher-priced models. Limited budget atm :/

small haven
#

omg some common sense !

worthy thunder
#

I have it on my Todo to look much closer at those results (and a handful of a few others). I imagine its a quirk.

#

I've burned passed both respectively 😛

small haven
#

wen o1 pro my guy

brisk turret
#

Anyone seen that paper about how lmarena is junk?

small haven
#

in terms of mrcr?

brisk turret
#
small haven
#

huh

worthy thunder
small haven
#

thats why we gotta test it out

brisk turret
#

It's a junk leaderboard, lmarena.ai would rather gain favour from Google and meta than have a fair leaderboard

#

Corruption is the term for that kind of behaviour

worthy thunder
#

GPT-4.1's family also doesn't drop like that. I imagine it's cause the o-series wasn't focused on long context. Should be interesting to see what it might be like when OpenAI merges the GPT and o series.

alpine coral
#

also qwen-turbo's recall seems literally unusable for anything close to 1m tokens

#

love your work btw!

worthy thunder
worthy thunder
# alpine coral love your work btw!

Thanks! Really working hard to make this a nice site. More to come soon. 🙂
And as always, open to suggestions about nearly anything to improve it.

alpine coral
worthy thunder
#

Note: I still grade a model that stops a response due to hitting its context limit, which generally only happens for the last bin it has results for (usually only a small handful of tests, which may impact the score by 1-5%). More noticeable for reasoning models (due to their reasoning tokens adding more to the output, pushing it past its limits). An inherited issue from OpenAI-MRCR, which I have some ideas to improve upon.

worthy thunder
alpine coral
#

aha yeah that would be the way to go for sure:)

alpine coral
#

ah actually nvm.. ig most of the context window is actually being occupied by the input, come to think of it

ocean panther
#

Hi, I'm new to LMArena. I tried lmarena.ai, entered a prompt, got responses from model A and B. Now I'd like to vote which one I like better - except for the life of me, I can't find a button to that end.

How does one vote on this site?

worthy thunder
worthy thunder
ocean panther
alpine coral
ocean panther
#

I'm using Firefox on Linux

alpine coral
# ocean panther

that's odd.. perhaps a bug 🤷‍♂️ maybe try just typing "thanks" and hitting send - perhaps the buttons will appear then

#

but yeah otherwise maybe just refresh (or try a different browser), which would mean losing the interaction.. (hopefully wasn't a really good response from one that you're eager to know what model it is)

ocean panther
#

I would have given it to model B, but now its response to my "Thanks" annoyed me. So model A it is now.

#

Oh, now it tells me the models' real names, too. After voting.

alpine coral
ocean panther
alpine coral
#

your assumption is spot on

#

(I believe claybrook is a model from google, just fwiw, but yes it's unreleased and being anonymously tested in the arena under that pseudonym)

small haven
#

life must good when o3 pro comes out

#

real growth gdp will rise by 1% by 3rd quarter

knotty scroll
#

When does the leaderboard usually update

brisk turret
#

When it would please a megacorp

small haven
#

oh my goodness gracious

keen beacon
#

great work btw

worthy thunder
#

(primarily NovitaAI)

raven void
#

OpenAI launched 4.1 to tell Google they're cooked

olive mesa
#

I just found out about this, rlly cool

#

It can restart my device... it also knows all the apps i have downloaded

raven void
#

it's quite slow for an assistant tbh

patent bane
#

what app is this?

raven void
#

Cursor (design concept)

raven void
keen beacon
#

that's bs

olive mesa
#

fr.

olive mesa
cedar tide
#

🚀 Amazon Nova Premier, our most capable teacher model for creating custom distilled models, is now available on Amazon Bedrock!
︀︀
︀︀Built for complex tasks like Retrieval-Augmented Generation (RAG), function calling, and agentic coding, its one-million-token context window enables analysis of large datasets while being the most cost-effective proprietary model in its intelligence tier.
︀︀
︀︀Learn more: amzn.to/4jwmV2l

**🔁 15 ❤️ 52 👁️ 5.0K **

▶ Play video
cedar tide
#

even Maverick and Gemini Flash are better 👀

alpine coral
#

(applied on the right..)

keen beacon
alpine coral
#

yeah what wowee lol

cedar tide
keen beacon
#

tbh qwen 3 4b destroys everyone at that size range

#

its crazy

#

i thought phi 4 mini reasoning was crazy but qwen 3 4b is next level good at the size

#

it makes all the other research attempts at around that size a joke

alpine coral
keen beacon
alpine coral
#

lmfao

keen beacon
#

its also curious qwen didnt release qwen 3 32b base/235b base. the scores are very impressive for it. probably dont want their competitors having it for now

#

im also assuming this is a ss from their wip technical report. (it was in the qwen 3 blog post)

#

those numbers are old the released phi 4 mini reasoning performs even better

#

Anyway qwen 3 4b destroys everyone in the 4b size range

cedar tide
keen beacon
#

meanwhile qwen 3 4b has like 20 points over that in aime lol

cedar tide
keen beacon
#

no but i will make one in a bit

cedar tide
#

Thx

glad jackal
#

when will qwen3 appear in lmarena?

alpine coral
torn mantle
#

Where can i try it?

keen beacon
#

yeah teh phi 4 reasoning plus 14b is seemingly extremely impressive

#

phi 4 mini reasoning not so much

cedar tide
glad jackal
torn mantle
alpine coral
cedar tide
keen beacon
#

the pretrained models seem extremely strong

alpine coral
#

teah that too

#

but also in terms of the base model

#

just looking at those evals

keen beacon
#

i think this is why they didnt release the frontier ones, the 32b base and 235b moe

#

the base model versions

alpine coral
#

yeah that would make sense

cedar tide
alpine coral
#

i mean unreleased basically = proprietary (unless they change their mind) - can hardly blame them

cedar tide
keen beacon
cedar tide
glad jackal
keen beacon
#

the 30b moe is extremely interesting though

glad jackal
#

i am kinda new i always thought that the architecture plays a key role

keen beacon
glad jackal
#

i mean parameters are important too

alpine coral
glad jackal
#

yellow

flint sand
#

probably 2.0

keen beacon
#

might be 2 flash thinking

#

only gemini 2 thinking model

alpine coral
#

i thought it was only since 2.5

#

was 2 flash thinking only ever experimental?

keen beacon
#

yea

glad jackal
#

@keen beacon i have doubt about the sqrt one?

#

so if we take 30B-A3B we get 9.5B as the dense equivalent right?

#

does it mean that qwen3 14B is equivalent to qwen3 30B A3B?

keen beacon
#

qwen 30b moe seemingly breaks that rule though

glad jackal
#

how?

#

wrt to benchmarks?

keen beacon
#

also @cedar tide

#

open evals from hf

cedar tide
cedar tide
keen beacon
glad jackal
#

seems like it is true the moe is performing better than the dense equi

keen beacon
#

thats why i said it breaks the rule

#

the simpleqa score is telling

#

despite lower active params, but higher total params, it stores more world knowledge compared to the 14b

glad jackal
#

ok

#

how does it perform wrt gemini 2.5?

#

moe model vs gemini 2.5

keen beacon
#

gemini 2.5 is moe too

keen beacon
#

i would say its primarily the post training though that is the most important part

keen beacon
glad jackal
#

@keen beacon nowadays i am kinda performing more vibe coding than before how do i reduce it
i use it cause it gives me more optimal sol

keen beacon
#

thats primarily a you problem, idk how to fix that, it requires a personal solution. you could force the conditions though (if u really want to stop vibe coding rather than be productive), learn an obscure language (that isn't supported well in llms) and start coding in that

#

personally day-to-day i use a language that isnt supported that well by llms yet, so i can't really vibe code

glad jackal
#

but i try to do it in a normal way it takes more than 2 to 3hrs as i need to read the docs and then implement it

keen beacon
#

if you want to do your work faster, then keep vibe coding or smthing (assuming your resulting code is of satisfactory quality)

glad jackal
#

i want to reduce vibe and also be productive

keen beacon
alpine coral
# cedar tide

should add a column for qwen3-4b (there's data here) if you can bothered aha

keen beacon
#

i wish alibaba had a qwen 3 api lol. im noticing bad degradation on some of the providers

#

it does extremely well on the tasks i need data for

unborn ocean
#

For the big MoE

alpine coral
#

v impressive for such a tiny model

#

imagine if scaling literally worked in terms of parameter size.. like extrapolating this 4b param model's scores out to something the size of GPT-4.5 or whatever

#

well.. i think most those benchmarks are bounded by 100%.. so ig you just hit that quickly ha

keen beacon
#

early on w deepinfra i noticed outputs were garbled after a while/massively degraded on long responses

#

(it doesnt happen on chat.qwen.ai or even chutes, but chutes is weird)

keen beacon
keen beacon
#

they used o3 mini traces hmm

#

ig u can see phi 4 reasoning traces to see how o3 mini reasons in raw form

#

im surprised oai allowed this lol

alpine coral
#

microsoft have invested like $30bn into oai

keen beacon
alpine coral
#

microsoft is practically (if not literally) majority shareholder of oai

keen beacon
#

competitors can now see o3 mini traces, style, etc., when openai tried to hide any of the new stuff hard

alpine coral
#

so they;re not competitors (microsoft/phi and oai)

calm sequoia
#

Just had an idea. The ELO score is determined by the battles model fought in the arena. There always exist a SET of models to choose from as an opponent in any given battle. This set consists of two sub-sets: better model subset and worse model subset. The ratio between those determines you ELO. If you provide a new model for testing, together with a number of objectively worse models or variants (e.g., mini, nano, flash, etc.), the subset of worse models inreases. The ratio of worse-to-better models increases in you favor from the start. Therefore, you get ELO boost by giving the arena model series instead of a single model.

#

Any ideas against this?

alpine coral
keen beacon
alpine coral
#

yeah gotcha- misunderstood your initial point

#

perhaps oai is kinda resigned to the fact that those traces will always be sought be their competitors - and to some extent, extracted

#

like they're outliers by even hiding it

#

but ig they think (or know) they have some kinda special sauce goin on theere

#

so yeah i do see your point

keen beacon
#

its kinda crazy how far they've gone in trying to hide it

#

then theyre doing the identity thing for newer models on top of that

alpine coral
#

fwiw i'd personally be shocked if none of the major labs had to tried to reverse engineer / extract as much as they possibly can from each others' models ha

#

and yeah just the style.. "hmm let me reconsider.." etc

#

it's all kinda familar (seems drawn from a similar source)

#

i'd be surprised if deepseek/qwen (or google's thinking models for that matter) R1/qwq were totally in-house creations

keen beacon
# alpine coral i'd be surprised if deepseek/qwen (or google's thinking models for that matter) ...
  1. google did not use any openai/deepseek/qwen traces, im pretty sure. their cold start is distinct compared to anyone else, u can see in the style.
  2. im pretty sure qwq 32b preview didn't use o1 preview traces, at least in the final preview model. its one of the most unique models ive encountered tbh, a lot of "qwqisms" and "second language behavior" (based on my extensive usage of it. plus code switching)
  3. although they share a resemblance to o1 preview traces, im pretty deepseek made their own coldstart as well.
  4. qwen modeled their traces after deepseek after r1, though, im still pretty sure its still partially made by them/even if it wasn't completely by them (might've used r1, im not sure)
#

speculation obviously fwiw

alpine coral
#

yeah i mean i'm just speculating - no proof / all anecdotal (and don't really use qwen or deepseek models ever aside from playing around)

#

ha oh

#

i feel you're more informed on it. it's just what i've always thought - but haven't looked into it

#

google for sure has a totally distinct feeling

#

in terms of its thinking models

keen beacon
#

tbh additionally making ur own cold start isnt difficult, i dont really see an incentive to not make your own if you're a frontier lab. bootstrapping off of someone else isn't a sustainable strategy

#

i found the complaints about deepseek using leaked traces (i believe) really disingenuous

#

and based on superficial patterns after working with a ton of reasoning models, processing traces/etc

alpine coral
#

i'm not familiar with them

alpine coral
#

having visibility of how the first company did it would be useful.

#

but not essential

keen beacon
alpine coral
#

yeah i'm not suggesting they tried to copy it or anything

#

just yeah, where they could, take inspiration from (at the very very least)

keen beacon
unborn ocean
unborn ocean
keen beacon
#

some lab folks i believe

unborn ocean
alpine coral
keen beacon
#

off of superficial stuff

#

thats why i thought u were talking about that

alpine coral
#

oh nope aha

#

i just assume most progress is iterative

#

it's rare for something brand new to ever come along

#

i guess the argument is over the extent

#

rather than the fact of iteration

#

anyway.. separately ha it's been interesting to have a proper look through that paper which criticises the Arena for allowing a handful of providers to flood it with anonymous models (and then choose which to use as the public release and, by extension, get added to the LB, while the rest are discarded, resulting in a sampling bias).

keen beacon
#

Is there a way to find old leaderboards? For example the leaderboard on the 14th of April

alpine coral
#

the appendix includes a table with all the anon models they identified

#

there is also a section 'unknown' (and, curiously, they say qwen had a private model in the arena, albeit not under a pseudonym, so kinda different but still would be a first for a chinese lab afaik)

keen beacon
#

ya i dont really count that tbh

#

qwen plus exp

hollow ocean
keen beacon
#

ya cool huh?

hollow ocean
alpine coral
brittle tiger
alpine coral
#

i honestly think lmarena's response is laughable - they don't address the actual points made in the paper

#

they make some totally irrelevant analogy about steph curry

#

weak af

#

it might be fairly said that cohere has an axe to grind with the arena

#

but i don't think that discredits the paper.. it's imperfect but rigorous from what i can tell. lmarena has only provided rhetoric in response...nothing to actually refute the paper's best-of-N strategy.. their data covering properiatery vs open models covers the entirety of the Arena's existence, rather than the few months the authors of the paper were scraping / analysing

brittle tiger
#

I still think the central claim of their conclusion unfairly uses meta shadiness to a make much broader ridiculous point: "The widespread and apparent willful participation in the gamification of arena scores from a handful of top-tier industry labs is undoubtedly a new low for the AI research field. As scientists, we must do better. As a community, we must demand better."

#

"undoubtedly a new low for the AI research field" gimme a break that's laughable. and openai and google are not willfully gaming scores

unborn ocean
#

I think much of the paper feels like you are asking a very verbose reasoning model to criticize your project and it just starts coming up with many (and partly obvious) weaknesses, most of which you will straight up ignore.

Otherwise the more soft and obvious critics, like most of the data going to big tech (inc. openai) are very valid, but also why the lmarena is able to exist at this scale in the first place (imo). (Because it is not just a benchmark, but also a high quality and agile data source for the companies)

#

But I am glad that they chose this approach of writing a paper over the article approach most other labs take to address issues. This is something we should have more off.

keen beacon
#

Any news on augment code?

cedar tide
cedar tide
#

Thx

alpine coral
#

anyway.. was just posting those screenshots it's interesting to see all the anon models catalogued like that

#

they are a fun part of the arena - no doubt...

#

in fact.. it wouldn't be nearly as interesting without them

#

but yeah it's kinda gotten out of hand.. and i'm sympathetic to this 'best of n' argument and that being distortive and in some ways arguably unfair (even though there's technically nothing stop any lab, however small, from spamming the arena; they would just need to ask as often as google, meta etc do.. ig)

keen beacon
torn mantle
#

and they were all bad

alpine coral
#

ha yeah that's kinda what i mean by out of hand

alpine coral
#

aha oh drama from April 23

#

that's the real early days of this AI boom ha

keen beacon
#

june 4 2023 xd (i dont think its relevant at all today, and presumably it was resolved quickly since i think nothing happened after that)

alpine coral
#

lol geez i struggled there

#

i won't edit again - let it stand ha

#

but yeah i mean i hope lmarena doesn't become a victim of its own success (it needs integrity to be meaningful.. but betwen betting and playing footsies with literally some of the biggest companies in the world.. there may come a breaking point)

#

but yeah they are the definition of having a first-mover advantage

#

and all the scale (votes) that brings with it

#

be very hard for another crowdsource project to displace it (more likely the arena falls one day i'd think) ]

keen beacon
alpine coral
#

yup - competing first movers

#

but now yeah they're the like the google of ai chatbot crowdsource voting ha

keen beacon
#

So i have a choice

#

My parents are making me decide now

#

Change my major to artifical intelligence and persue a technology career

#

or stick to marketing

#

with lovable and cursor becoming the top preformers i might have too 💔 gpt wrapper method

#

svelte

#

easy choice

#

or LuaU because LuaU is super simplified

#

so JS

#

yea thats the stuff i like the most web development

#

app creation is a whole other field

#

yes

#

working on a project now

#

2.5 pro for debugging yes

#

And Workflow/MVP planning

#

Azure backend

#

mvp = minimum viable project for early customers

#

like skeleton build to test theory/logistics

#

I get github pro free for two years so i have access to 4.1, gemini, claude 3.7 for free

#

For my resume

#

I'm a marketing student rn

#

Making a trend finder + tracker for cross platform data analysis

#

would look good on my resume and portfolio cuz im also doing business administration so i need to know how to lead a company, etc

#

Great I dont like AI studio that much

#

If you sign up on the Gemini website

#

there will be a free 1 month trial

#

Nah bad ui gemini official website better

#

When they end just make a new account

#

They have no KYC or nothing

#

u cant branch on the gemini product i think amongst other things. the gemini product seems bad compared to aistudio, which still has its own issues

#

they can but they dont

#

i've done it for awhile

keen beacon
# keen beacon wym by branch?

on chatgpt u can edit ur message and immediately start a new branch without losing the previous branch/conversation

#

that explains it

#

whenever i get close to running out of tokens

#

i just tell it to copy itself in a prompt

#

then use gemini's gem manager so it remembers the base project

#

then that prompt acts like its loading a save

#

restores token context count + allows you to continue where you left off

#

just my little work around

#

i guess it really is a preference thing at the end of the day ai studio is more for people who understand ai, etc

I just need bare bones maybe add a few prints here and there or give me some other solutions

balmy mist
#

is qwen 3 good at tool calling like o3?

#

like does it have tool calling built in the reasoning as well?

keen beacon
#

where u are getting access to it btw

#

qwen3 api

balmy mist
#

i havent tried it yet, just wanna do a project with it

#

wait can we even use the tool calling magic in api?

#

for o3 or qwen?

keen beacon
#

i think yes for both

balmy mist
#

i wish we could put our own tools in the reasoning

keen beacon
#

you can

#

i believe

balmy mist
#

teach me please 🙂

#

only for o3?

#

or both

#

o3 to expensive

keen beacon
#

both

balmy mist
#

whatt

keen beacon
keen beacon
balmy mist
#

so i can add lets say 50 more tools to o3 with function calling?

keen beacon
balmy mist
#

why not lol?

keen beacon
#

they said qwen is ass

#

it will take a lot of tokens to review and decide which tool to use 🤣

keen beacon
balmy mist
#

i think this is what openAi is noticing lowkey

#

we already have solid models and benchmarks are getting cooked

#

so why not just focus on building more tools around our models, which i guess is what is happening lol

balmy mist
#

yeah

#

def gonna be a mix of both tho

keen beacon
#

the mcp thing is gonna matter significantly more going forward

balmy mist
#

i want to see if i can do it a lil on a small scale with mcps

keen beacon
#

it indirectly impacts a model that can generate 3d animations

balmy mist
#

but have them use the mcps in the thinking

balmy mist
#

the only thing is having to update the models thinking base of new mcps and tools

keen beacon
balmy mist
#

but what tool calling does qwen have?

#

ppl saying the 30b model is really good

keen beacon