#general

1 messages · Page 32 of 1

torn mantle
#

Yea

balmy mist
#

i just got back, did they say what time for qwen?

full kite
#

Qwen is as dosht as deepseek

#

Slow and unuseable

#

Unusable

#

Omfg

raven void
#

Sota open source

tawdry meteor
# raven void Sota open source

SOTA premium vs SOTA open source is like adobe products to me. Yeah photoshop is always gonna be the absolute best and have newest most incredible features and worth it if you're a pro photog but I just tell my friends to use gimp or other freeware

#

like unless using the AI is contributing to your commercial productivity /earnings then why not use the free stuff. but also it's going to contribute to almost everyone's productivity so 🤷 I hope open source becomes fantastic tho

#

agreed that's basically exactly what I meant lol

#

like if you're just editing family photos, use freeware or whatever. but if you're using it for true productivity you need premium tools

keen beacon
#

qwen 3 235b benchmark results apparently leaked, managed to copy the screenshot before the tweet got deleted

tawdry meteor
#

and same then for coding with SOTA vs open source

keen beacon
#

apparently its gone because they messed up some numbers

#

so idk which are right

#

humaneval looks low asf compared to the rest so it might be that

#

o3 got 83.3

sonic tendon
#

oh wow

tawdry meteor
#

yo wtf?

#

g2.5pro is 84 on GPQA

sonic tendon
tawdry meteor
#

I wonder which numbers were wrong

#

for reference

keen beacon
#

i presume they just put the wrong number, not they ran the benchmark wrong

#

who knows, they probably reused an old table and didn't replace it properly

tawdry meteor
#

this dude has never made a typo

balmy mist
#

Wait, where is o3 pro coming out, bro?

tawdry meteor
#

there's a permanent commemorative plaque made out of marble and with a steel engraving on it in front of a historic building in my city, and there's a misspelled word on it and punctuation missing in another spot. The ability for people to make clerical errors and no one else catching it baffles the mind

balmy mist
#

As in that’s bad?

tawdry meteor
keen beacon
#

like insane

balmy mist
#

Lmaoo

#

Ayee

#

So gg closed source?

#

Wait it the model actually out?

keen beacon
#

not yet i think

balmy mist
#

Bruhh

#

They blue balling us again

tawdry meteor
#

it hasn't been anonymous on arena right?

keen beacon
tawdry meteor
sonic tendon
#

i would guess that the biggest model is going to be closed-weight for a while

balmy mist
#

O3 pro has to come out this week right ?

#

This the 3rd week

sonic tendon
#

like they did w/ 2.5 max

keen beacon
sonic tendon
balmy mist
#

Cause next week would be a month vs a couple of weeks

#

Craig run our queries when u get o3 pro please

keen beacon
balmy mist
#

Really? I thought you were OpenAI ride or die?

keen beacon
#

i need qwen 235b asap 🤣

balmy mist
#

What u gonna do with it?

keen beacon
#

generate data

balmy mist
#

lol yupp

#

Gotta justify that cost

keen beacon
#

you could probably run qwen 235b pretty well on macs i guess becuz of moe

balmy mist
#

Hmmm really what kind of specs u need?

keen beacon
#

u need enough vram to load though

balmy mist
#

I got 18 gb

#

But it’s m3

keen beacon
#

if u can fit its a sota local model fr

keen beacon
balmy mist
#

Bruhh

#

I’m logging off until r2

tawdry meteor
#

BRUH

keen beacon
#

dw itll be good still lol

#

maybe not that crazy

balmy mist
#

And Claude lol

keen beacon
#

yeah i think it'll get close to o3 but i do doubt it'll beat it, at least not broadly

#

it'll beat o1 for sure

keen beacon
#

but i think thats a while away

balmy mist
#

Which o3?

keen beacon
#

regular

#

do you mean low/med/high

balmy mist
#

Wait so qwen not even SOTA?

balmy mist
keen beacon
#

we dont know

#

i was trolling, it may or may not be

balmy mist
#

Just cheap?

keen beacon
balmy mist
keen beacon
#

that is my bet

keen beacon
sonic tendon
#

i think it'll beat o3 on the (non-stylectrl) leaderboard

keen beacon
#

and my imagination

#

😇

balmy mist
#

Gotta lock into that

sonic tendon
keen beacon
#

😇

tawdry meteor
#

I was halfway through making a table with those numbers comparing to other benchmarks lmao

keen beacon
#

LOL

tawdry meteor
#

you got us good

keen beacon
#

i would be annoyed if i were you dw 😔

balmy mist
keen beacon
#

anyway im gonna go for a bit someone spam me if they release qwen3 (for real this time)

#

byebye

tawdry meteor
# sonic tendon wait what

I was making a table to compare Li's Leo's fake benchmarks to real benchmarks and when I found out it was fake I deleted my google account

keen beacon
#

ya sm1 posted the tweet before it got deleted

#

here

keen beacon
#

clearly they still value it since they give quota out to lmsys lol

tall summit
oblique flint
#

I get it, but human preference tinetuning does not have to come at the cost of actual performance, just look at 2.5 pro and o3

keen beacon
#

idk google has been the most prolific in terms of using the arena tbh

sonic tendon
#

yeah, i think it's still a valuable benchmark

thorny drum
#

i think lmarena is still the most public benchmark as silly as it is

keen beacon
#

the worst is meta

#

they sh1t their models into the arena

full kite
#

Worst is grok because it's racist

sonic tendon
#

imo grok is worse because elon fanboys back it

#

i mean, llama was (iirc) the first real well-funded open source model release

#

they were really popular at the beginning, but then everyone else ate their lunch

#

qwen and deepseek, namely

keen beacon
#

grok is a weird ass model

sonic tendon
#

uh oh

keen beacon
#

it's not incredible at anything but it doesn't suck at anything either

#

had its 10 seconds of fame and it's just meh now

sonic tendon
keen beacon
#

and their reasoning model seriously underperformed relative to how strong the base (theoretically) is

sonic tendon
#

ah, any reason in particular?

sonic tendon
#

that's valid

#

i try to set limits for myself and stick to them

keen beacon
#

its in the beta iirc

#

anyway im gonna actually go

#

goodbye

sonic tendon
#

bye!

#

oh

#

damn >:(

#

not sure

keen beacon
ocean vortex
keen beacon
#

prior to them intentionally giving early access out i think

#

ETA for Qwen 3 basically any minute now

it's 2am over there right now and they're working their asses off to upload all of the models and quants

keen beacon
#

what is bro doing

elder burrow
keen beacon
keen beacon
elder burrow
keen beacon
#

the um

elder burrow
#

the one which looks weird

#

3

keen beacon
#

shucks

elder burrow
#

1 and 2 are both veo 2

keen beacon
#

getting there..

#

ya 66 items+ 🤣

elder burrow
#

i just joined the server

#

thx for letting me know

keen beacon
keen beacon
#

sounds about right

elder burrow
#

when do you think is qwen 3 coming to their site?

keen beacon
#

soon theyre planning to launch a sub soon

keen beacon
# elder burrow guess which is qwen

well, iirc theyre using wanx2.1 or something its not a qwen video gen model although it comes from alibaba. i think its 14b. veo must be much bigger

#

on the qwen site it uses wanx2.1 for the time being

balmy mist
#

Maybe qwen 3 at 3?

#

So 18 mins?

rugged brook
#

Wym

torn mantle
#

This guy already has access

keen fulcrum
torn mantle
#

Its real

keen fulcrum
#

How would you know

torn mantle
#

He had a private invite

keen fulcrum
#

Well its releasing within 24h

keen beacon
#

i wonder if theyre gonna go to bed first

torn mantle
#

Because he knows some qwen devs and they follow him too

#

Same with deepseek devs

rugged brook
torn mantle
#

It hallucinated the google graph part and also the death of Nick Land

rugged brook
#

Its releasing noa

keen fulcrum
#

Well 235B isn't remarkable
Claude 3 Opus has 200B

#

GPT 5 expected to have 1-10T

blazing rune
keen beacon
#

opus is definitely not 200b lol

keen fulcrum
#

What then?

keen beacon
#

i doubt sonnet 3.5/3.7 is 200b either. i think its 400b

blazing rune
#

Sam Altman announced this on X

keen beacon
keen fulcrum
#

I am interested about the token limit

keen beacon
#

seems qwen 30b moe is gonna be a homerun

#

ppl testing lol

#

and wishful thinking 🤣

#

woah

#

cool asf

#

theyre probably bankrolling all this lmao

small haven
#

where tf is o3 pro

keen fulcrum
keen beacon
torn mantle
#

We're excited to announce we’ve launched several improvements to ChatGPT search, and today we’re starting to roll out a better shopping experience.

Search has become one of our most popular & fastest growing features, with over 1 billion web searches just in the past week 🧵

#

its 4am in china

ember rapids
#

I know several ppl were asking for the link to this server. It's public now.

keen beacon
keen fulcrum
#

So openbrain is the best model?
Its interesting to see public models are better than ERNIE

torn mantle
#

llama con tomorrow

keen beacon
#

posted that here already

torn mantle
keen beacon
#

Qwen is the large language model and large multimodal model series of the Qwen Team, Alibaba Group. Both language models and multimodal models are pretrained on large-scale multilingual and multimodal data and post-trained on quality data for aligning to human preferences. Qwen is capable of natural language understanding, text generation, vision understanding, audio understanding, tool use, role play, playing as AI agent, etc.

The latest version, Qwen3, has the following features:

Dense and Mixture-of-Experts (MoE) models, available in 0.6B, 1.7B, 4B, 8B, 14B, 32B and 30B-A3B, 235B-A22B.

Seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose chat) within a single model, ensuring optimal performance across various scenarios.

Significantly enhancement in reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.

Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.

Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.

Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation.

For more information, please visit our:

Blog

GitHub

Hugging Face

ModelScope

Qwen3 Collection

Join our community by joining our Discord and WeChat group. We are looking forward to seeing you there!

#
Qwen

QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD
Introduction Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to...

#

IT'S OUT

zinc ore
#
Qwen

QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD
Introduction Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to...

keen beacon
#

it got it right but wow that's a lot of reasoning tokens used up

stuck orchid
#

🎉

torn mantle
#

I can't try it rn, someone tell us the vibes

small haven
#

ok cnn

keen beacon
#

i did not expect it to get this close to g2.5 pro

blazing rune
#

their HF page only has the 0.6b model

#

this is annoying

torn mantle
cedar tide
#

Available now on
https://chat.qwen.ai/

keen beacon
#

the base kinda cooks v3 too

#

deepseek respond i dare you

small haven
#

can i say something without getting people riled up

cedar tide
blazing rune
blazing rune
#

nvm

small haven
#

qwen 3 is 💩 , where is o3 pro

keen beacon
#

yeah it was just hf

torn mantle
keen beacon
#

deepseek have been beaten on both r1 and v3

#

i kinda expect a response

torn mantle
#

Qwen 3 seems like qwq max to me

#

Havent noticed much improvements tbh

small haven
hardy pecan
#

it seems ok, will need to do more testing

torn mantle
#

Nah o3 is a league on its own tbh

keen beacon
torn mantle
#

Lol

keen beacon
#

it isn't

barren prairie
small haven
#

o3 marginal cost is zero

pliant cypress
#

no

golden ocean
#

no

misty vault
#

no

barren prairie
#

But why they are putting 1526263 models ... They must regulate this thing

keen beacon
#

lmao it's excrutiatingly slow rn because it's being hammered on qwen chat

keen beacon
#

^

barren prairie
#

I want a single good model like gemini and deepSeek not 1627373 models

keen beacon
#

it is not o3 and is still slightly behind gemini 2.5 pro. but it is still highly impressive

hardy pecan
#

people have "tested" it for 5 mins and already coming to conclusions lmao, maybe do some more testing

keen beacon
#

i don't think you get how this works

keen fulcrum
small haven
#

even when qwen 4 is released, o3 >> qwen 4, thats how bad qwen 3 is

ocean vortex
keen fulcrum
#

Just look here

ocean vortex
#

it destroys gpt4.1

barren prairie
keen beacon
ocean vortex
#

not even the biggest model

keen beacon
small haven
#

bro is writing essays in its traces for a simple logic problem bruh

keen beacon
#

yes it is a lot less efficient with reasoning than o3 or gemini

#

if there is one thing i've noticed it's that

small haven
#

ok at least it got the answer right haha

torn mantle
#

Good so far

#

Seems better than maverick at least

keen beacon
#

low ass bar

ocean vortex
#

this what I meant :

cedar tide
# keen beacon

Who wants to dedicate themselves to adding o4 mini, grok 3 mini think and gemini 2.5 flash to the benchmark?

keen beacon
ocean vortex
#

if they added 4.1 in there it wouldn't look much better for OpenAI

keen fulcrum
#

Are you friends of Qwen?

torn mantle
hardy pecan
#

creates a 99% copy of the discord front end, in a single html file, (without the backend)

small haven
#

ok im a bit more excited about r2

ocean vortex
#

oh but they tested with thinking enabled have they not

keen fulcrum
#

Where is the pricing for qwen3?

ocean vortex
#

this somewhat explains it then

#

it being this good on math relative to others

#

this is what meta should have done. Instead they went behemoth mode lmao

small haven
#

on that note... when is o3 pro

#

ur a comedian

torn mantle
torn mantle
small haven
#

lol i just visit qwen to test it on its release day and pack it up, will be back next release !

ocean vortex
#

it is obviously not. But at the same time there's gonna be no one to test if they updated the model in any way... Could just donate money to OpenAI instead

#

10 prompts would be like what... $150?

keen beacon
#

holy moly this is the slowest streaming i've ever seen

#

it's been thinking for like 10 minutes and it's not even streamed many reasoning tokens

cedar tide
small haven
#

why is my coffee already cold tf

#

that is true, but side effects ..

#

i can tell

keen fulcrum
#

Any unofficial benchmarks up yet?

ocean vortex
cedar tide
rugged brook
#

Is the qwen

#

Good

ocean vortex
torn mantle
#

Oh boi

#

You are fked

#

Oh no

torn mantle
#

Its good

#

Nothing crazy

#

Not that great at coding

small haven
#

coffee + brazilian fonk >> adderral

torn mantle
#

Multilingual claims = still far behind competitors

ocean vortex
# torn mantle Nothing crazy

yeah I'm not getting wow'ed by their biggest model. But it needs to be viewed in the proper context. It's competing with R1 not O3 or 2.5 pro

torn mantle
#

I would place it beneath grok3 base model

torn mantle
#

lol

small haven
#

facts...

rugged brook
#

U havent tried

#

Meth

small haven
#

meth makes u dumb

cedar tide
#

Qwen 3 253b non thinking vs deepseek v3.1

AIME 2024 : 40 / 52
GPQA : 63 / 66

rugged brook
#

Is it better then

#

R1

#

Thinking mode

small haven
#

just imagine r2 ...

rugged brook
#

When is it releasing

small haven
#

this year idk

hollow ocean
#

R2 July

rugged brook
#

?

cedar tide
small haven
#

based off polymarket lol

small haven
#

i still remember gpt 5 was predicted to be released october 2024.

hollow ocean
ocean vortex
hollow ocean
#

I have a ton of money on yes

torn mantle
#

Dont forget that its smaller than r1

raven void
#

Polymarket should have separate market for open source models

#

Really

hardy pecan
#

It got 3/10 of the simplebench questions ive checked so far, using 235B with thinking

small haven
cedar tide
#

Will r2 be based on 3.1 or a new base?

ocean vortex
#

it probably is still better than r1 overall though, all that aside and just looking at the final responses

hollow ocean
ocean vortex
#

just not everywhere, 'overall'

hollow ocean
#

R2 prediction is 86%

raven void
hardy pecan
small haven
cedar tide
small haven
torn mantle
hardy pecan
small haven
#

guys gpt 5 is releasing september 2024

brittle tiger
torn mantle
#

I just dont understand that comment tbh

#

Jealous for what?

small haven
keen fulcrum
rugged brook
#

how is it

#

worse

#

its good

small haven
#

met expectations, my expectations: 💩 as usual

torn mantle
#

I think it met expectations

#

The focus seems to be fixing qwen2.5 issues and add a hybrid reasoning feature

#

Better multilingual support + better multi-turn convo

#

Its not bad for its size

hardy pecan
#

Yeah, met expectations (I wasn't expecting 2.5 pro level)

small haven
#

still behind oai like > 6months

keen fulcrum
#

We will see end of the year
I would love Qwen to catch up more

small haven
#

pfft, not being pessimistic but, i think 6 months still

#

oai already working on o4, top 50 codeforces
o3 is at top 250

rugged brook
#

ye

hardy pecan
#

Ok qwen-235B-A22B with thinking, simple bench - 3/20 , so not the best result in the world.

hardy pecan
#

Essentially... but will wait to see, maybe I had bad variance for my pass@1

keen fulcrum
rugged brook
#

Bro what

barren prairie
#

Qwen is good but not that thing ..I didn t hype so much for it ...but so good as an open source model 😅

small haven
#

lay the adderall down bud

#

o3 != qwen3 lol

#

?

#

uve lost the plot

hollow ocean
#

Simple bench will never be solved

zinc ore
# small haven

How come their actual released model isn't performing equivalent to a top 200 programmer then?

#

Since it says o3 Jan is 175th rank

small haven
zinc ore
#

It's just their claims and "trust us bro"

small haven
#

i mean the "us" is oai themselves, so pretty reputable lol

zinc ore
#

I disagree. As they have a history of overhyping. They do get some nice performance from their models don't get me wrong.

patent aspen
# small haven

Hill climbing on competitive programming problems doesn't make a model a superhuman coder. Those problems are self-contained.

zinc ore
#

I basically consider Google and anthropic to be pretty reliable in their claims, but I see much less reliability coming from openAI and grok imo (inb4 I agitate some grok bros)

small haven
patent aspen
#

To attract funding

zinc ore
#

Dario specifically you mean?

patent aspen
#

Their CEO?

#

Yeah

zinc ore
#

I'm assuming you're referring to his prediction of AGI by 2026-2027

patent aspen
#

Yeah I mean they're trying to attract funding so it's w/e

zinc ore
#

Ehh, to me they aren't anywhere near as shameless as openAI with it

small haven
#

if i was an anthropic shareholder, id be shaking tbh

patent aspen
#

That's true. OAI trying to appear like they already have AGI cooking in the lab and are just waiting until it's perfect to unleash

#

Yeah Anthropic is in a tough spot

#

Coding was their main selling point

#

That won't last through the year

zinc ore
#

I don't have much of an opinion on deepseek

small haven
#

yea ok bud o1 > o3

#

agreed

patent aspen
#

Objectives and key results. It's how all silicon valley companies track goals

#

If they haven't achieved that by the end of the year, it will carry over until they do

small haven
#

lmaoooo

#

will toysrus ever develop a sota model

zinc ore
#

Dollar general SOTA model when

small haven
#

adderall still potent

#

they are like so behind

#

kinda facts

#

i have no clue what is that

patent aspen
#

Why?

zinc ore
#

Let's be real, 2 weeks max

#

Maybe 2 months at worst

patent aspen
#

They were 3 years ahead when they started

zinc ore
#

And debatably even behind

patent aspen
#

Now they're a bit behind

#

They're losing ground

small haven
zinc ore
#

Good good is being two years ahead

small haven
hollow ocean
#

I’m betting every dollar I have left that the simple bench won’t be solved this year

zinc ore
#

I'm not even disagreeing they're ahead, but they're barely ahead of 2.5 pro at a much higher price

small haven
zinc ore
#

Lemme guess you wanna compare to the o1 million dollar arc test

patent aspen
#

I'm not speculating.

small haven
zinc ore
#

You said 2.5 pro and o3 wasn't apples to apples

#

Now you say compare to o3 lol

small haven
#

that got released like 5 months ago

zinc ore
#

Dog, I asked you if we can't compare 2.5 to o3, then what should we compare it to?

small haven
#

i just said they are not apples to apples, meaning o3 is ahead

patent aspen
#

One thing I will say is that public version numbers are mostly branding and don't always consistently reflect the underlying process used to train the models

zinc ore
#

That's not what apples to apples means

#

An apples to apples comparison allows for something being far superior

small haven
#

ok mb

patent aspen
#

We think o4-mini are o3 are different post-trainings of the same underlying pre-trained model

#

Similarly 2.5 and 2.0 are part of same generation

#

It does matter because pre-training takes ages, so it has implications for the timelines of about 6 months

hollow ocean
#

If you put “it’s a trick question” the top models aces simplebench

#

Not a single question wrong

#

Yes try it

#

I just did it with o4 mini high

#

Yep

#

Try it if you don’t believe it

#

Grok 3 no thinking failed first question

#

4.5 gets it

small haven
#

Question 3

Jeff, Jo and Jim are in a 200m men's race, starting from the same position. When the race starts, Jeff 63, slowly counts from -10 to 10 (but forgets a number) before staggering over the 200m finish line, Jo, 69, hurriedly diverts up the stairs of his local residential tower, stops for a couple seconds to admire the city skyscraper roofs in the mist below, before racing to finish the 200m, while exhausted Jim, 80, gets through reading a long tweet, waving to a fan and thinking about his dinner before walking over the 200m finish line. Who likely finished last?

this question o3, o1-pro didn't get it, but o4-mini-high got it

torn mantle
#

Qwen 3 seems to hallucinate quite a lot

#

Yea and it gets many things wrong as well

#

Yea even maverick may be better than qwen 3

drifting thorn
#

Oh you guys are talking about Qwen 3 too

torn mantle
#

Sorry but its nowhere near gemini 2.5 or o3

drifting thorn
#

Not expecting to see Qwen 3 surpassing Gemini 2.5 or o3

#

But I expected it to surpass R1

torn mantle
#

So many wrong outputs

zinc ore
#

Could it be bugs? Since that happens a lot during releases

drifting thorn
#

… At least it provides researchers with new models…

leaden palm
#

maverick is larger than all qwen models

drifting thorn
#

The AI researchers can now move on to Qwen 3 from Qwen 2.5

small haven
#

just wait for behemoth

torn mantle
#

Qwen 3 needs a re-evaluation

small haven
#

claude pro sucks

torn mantle
#

Just coding

small haven
#

?

#

o3 > 3.7

#

just ask for the git diff and apply with 4.1 in cursor

torn mantle
#

Well the most nerfed models are anthropic ones

small haven
#

i agree, thats why i only use it to apply the git diffs

#

still potent i see

#

and fast

torn mantle
#

Alibaba still has a long way to go

small haven
#

i bought alibaba back in 2017, im still breakeven

torn mantle
#

Hopefully it drops

#

No wonder they weren't hyping this model

small haven
#

on what bench

#

oh right, the addy vibes

#

huh where is o3 in webdev

#

no wonder sonnet is #1

#

it would be close tho

#

like above 4.1

#

bru

#

is o3 in webdev tho

#

its cheaper than o1 and o1 was on it

#

true

still mason
north vale
#

every week or so

small haven
#

o4 mini < maverick, oh hell naww

leaden palm
#

why are you encouraging models that don't obey the system prompt

void copper
# torn mantle Qwen 3 seems to hallucinate quite a lot

which mode? If it hallucinate in non-thinking mode, I won't be that surprised since it combines thinking and non thinking stuff at once, and very likely those hallucinations are because of it thinking training. (Just like how human thinks, yeah, hallucinate a lot)

small haven
leaden palm
small haven
#

ahh i seee

#

the problem with this, is that people are still gonna vote for maverick

#

u inflating the bad models, great

alpine coral
small haven
#

wym? ur voting for o4 mini?

#

its visually not attractive on face value

small haven
#

at this point i dont care anymore

#

BUT WHEN O3 PRO SAM

solar nebula
#

so thats how they made o3

small haven
#

o4 full coming in this summer

#

OK BUD

#

take that bet to polymarket, dont need more money

#

addy is hitting

#

i still dont get how gpt5 is gonna work

#

like if i wanna use o3 pro

#

so my prompt would be "please use the biggest brain power in the world to answer" like what?

leaden palm
small haven
#

u want it to use o3 pro, but always give o4 mini

high egret
#

Hi guys ! i'm new here, is anyone online ?

#

quite impressed by qwen3 and wanted to discuss of it

#

I made a bet of 20$ on qwen3 best model before april 30, probably the worst I'll ever do haha

#

but 6k if I win lmao

small haven
#

bruh

keen fulcrum
small haven
#

llol

high egret
#

The model won't be on leaderboard when the market close lmao

#

but i'll pay my studies ez if it's win 👌

keen fulcrum
#

Qwen 3 on lmarena?
any external benchmark?

high egret
#

not now I think

#

yes we already have benchmarks

keen fulcrum
#

Can you share

high egret
#

quite impressive

keen fulcrum
keen fulcrum
high egret
compact knoll
#

hi everyone, sorry for dumb question, i guess people ask it all the time, but right now, what's the best model for STEM? (not necessarily for advanced coding, more for solving complex STEM problems, including math)

high egret
#

i'm a pure math major

#

but o3 and o4-mini give clearer response when you need a summary like of a course

high egret
compact knoll
#

yeah okay, im wondering which one is the best beetwen 2.5 pro and o3 right now, well i guess it depends on the demand, and that 2.5 will be better than o3 for some specific things and vice versa

keen fulcrum
high egret
#

but like when I need a full summary of a 200 pages course, I just give the full pdf to gemini, ask for it and get a perfect clear LaTeX document first try

#

not with o3

compact knoll
#

yeah i agree the context with Gemini is really impressive and super useful

high egret
#

and fu*** free

#

you're a STEM student ?

compact knoll
#

yep

high egret
#

which major ?

keen fulcrum
small haven
#

WE WANT O3 PRO

#

oh hes sober now

#

you know ur gonna get a passing test when traces look like this

hollow ocean
#

If you go on the Deepseek subreddit some still say R1 is better than 2.5 and o3

hardy pecan
#

Humans are tribal to a fault

hardy pecan
#

It was probably their first LLM they really used when it went viral, and never looked further

hollow ocean
small haven
#

Grok three point five isnt going to beat o three

torn mantle
#

I think they screwed up something on Qwen 3 training

keen beacon
torn mantle
#

the model is just a bit better than qwen 2.5 max

#

qwq*

keen beacon
#

I haven't had time to examine it yet I just woke up lol

fleet lintel
#

I am kinda same but from the opposite side. I am rooting for everyone to one up each other every week except for Meta. I want llama models to burn in hell.

torn mantle
#

my sleep schedule is fked up

#

so i tried it

#

it does a poor job at recalling

keen beacon
#

damn lol i was too tired and slept

#

i think even if the instruct versions arent that good the pretrained base models might be excellent

#

makes a good base for fine tuning

#

qwen pretraining is 👌

torn mantle
#

you would expect high quality info/data since they scraped a lot of contents from pdf files

#

but im really not noticing any difference

keen beacon
torn mantle
#

oh

#

i don't know what hparams they threw at qwen models in post training but the qwen3 models have some absolutely deepfried world knowledge. the basic factual recall of the qwen instructs is by far the worst of any frontier model by a big margin. the models are otherwise interesting

#

so it wasn't just me

keen fulcrum
fleet lintel
#

Next week, Grok 3.5 early beta release to SuperGrok subscribers only.

It is the first AI that can, for example, accurately answer technical questions about rocket engines or electrochemistry.

@Grok is reasoning from first principles and coming up with answers that simply don’t

small haven
#

why are we shilling pointless models, o3 pro is the only model to set eyes on

fleet lintel
#

Qwen3 is whatever. I also didn't expect much from it but grok 3.5 could be interesting (low confidence)

Openai hype has blue balled me too many times. I won't get hyped till they prove it next time

zinc ore
#

I'm only hype for whatever big drop Google does next

hardy pecan
#

Well I can't complain with this cadence of new models

#

Golden age

fleet lintel
#

Google I/o is coming. They will drop some shiny features but for models themselves, I think it will take them few months to drop something crazy

torn mantle
#

so next month we will have :

  • google coding models
  • deepseek r2
  • grok 3.5
fleet lintel
#

Add o3 pro

small haven
#

U GOT ME HARD

alpine coral
#

using Qwen3-235B-A22B on official site with thinking enabled - i can baredly tell it apart from something like qwq

#

but their evals have it outperforming gem-2.5-pro on some benchmarks? seens wild.. no idea how that works ha

keen beacon
alpine coral
#

but the evals are for the same (post trained) model i've been using, no?

keen beacon
alpine coral
#

ah k yeah i mean it feels a bit undercooked for sure

#

like V3 gets some questions which this model doesn't

keen beacon
#

like look at this page, for qwen 2.5 they released a comprehensive page of instruct and base benchmarks: https://qwenlm.github.io/blog/qwen2.5-llm/
we only got this for qwen 3: https://qwenlm.github.io/blog/qwen3/

Qwen

GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD
Introduction In this blog, we delve into the details of our latest Qwen2.5 series language models. We have developed a range of decoder-only dense models, with seven of them open-sourced, spanning from 0.5B to 72B parameters. Our research indicates a significant interest among users in models within th...

Qwen

QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD
Introduction Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to...

#

it seems they rushed everything lol

keen beacon
#

if it is really the post training, the pre trained base might be very good

alpine coral
#

oh don't get me wrong - for it's size and being open source (plus multi modal.. i think?) i mean there's a lot to work with / build on

#

and it's for sure solid

#

just not up there with the likes of 2.5-pro

#

based on my limited playing around with it (and subjective prefernces.. needless to say ha)

keen beacon
#

yuh definitely not

#

major boon for local ppl tho

alpine coral
#

totally

keen beacon
#

models that arent dumb af can be run locally

alpine coral
#

yeah huge step

fleet lintel
small haven
alpine coral
#

i am actually excited - but also appreciate the shutup option lol

small haven
#

haha

alpine coral
#

o1 pro is the only model i have ever seen get this right

Which four countries, when listed alphabetically by their English short-form names, are the first to have flags containing more than five stars and featuring the colours red, yellow, and green?
torn mantle
#

those benchmarks doesnt reflect anything

#

anyone can finetune their model on benchmarks

alpine coral
#

makes the most sense

torn mantle
#

it just doesnt feel like a smart model

alpine coral
#

i mean it's good for sure, but yeah not great like 2.5pro great

torn mantle
#

i think its a bit above qwen max 2.5

keen beacon
#

i doubt they trained directly on the benchmarks but they definitely did targeted training and didnt fully flesh it out. even then, i doubt it would compete with 2.5 pro. i think the benchmarks they released generally show that its quite a bit worse than 2.5 pro though

torn mantle
#

grok 3.5 should be interesting

keen beacon
alpine coral
#

ArenaHard is like an odd bench to lead with with (it's not alphabetical) - i kinda felt liek that project was dormant ha

keen beacon
keen beacon
#

yeah arena hard v2

alpine coral
#

ah i see cheers didn't realise that

small haven
#

thats old news craig

#

good morning

alpine coral
#

lol what a cordial feud

#

anyway so yeah i think it can be said that handling multiple questions in a single prompt is not qwen3-235b(thinking)

#

's strong suit

#

this is like 8 questions

keen beacon
#

curious how much thatll change if u do it 1 by 1

alpine coral
#

yeah i suspect quite a bit

#

like sonn-3.7 thinking would do better than vanilla 3.7 one by one too

#

but yeah. they're all given the same prompts / quiz

keen beacon
#

did u also try qwen 3 32b?

#

its more of a conventional model

cedar tide
alpine coral
alpine coral
cedar tide
#

Release of Llama reasoning and behemoth today ?

keen beacon
#

the qwen 3 30b moe is also interesting

alpine coral
cedar tide
small haven
frosty lark
#

question: in beta.lmarena seems that there are no cloaked models (for initial evaluation). Is that only a coincidence for my usage? (I mean, I think that testing already listed ones further is only good)

calm sequoia
#

There simply may not be any anonymous models currently under evaluation

#

The gemini doesn't seem to have this problem.

keen beacon
#

I've seen bad reward hacking behaviors in o3 and sonnet too

cedar tide
#

qwen 3, 253b, 30b Moe and 32b dense are already on the arena (think mode)

calm sequoia
#

Does the anonify-lt-ev3-2 exist in the arena?

ocean vortex
#

or was it just single attempt no regen for chatgpt in the earlier screen?

cedar tide
#

My big problem with models that have a Think mode and a mode without Think is that all the benchmarks that we will find on the models are with Think, so if we want to use them without Think like the vast majority of people we do not know if they are better than the competitors.

keen beacon
#

Check base model benchmarks if they're available

cedar tide
#

#qwen 3
#gemini 2.5
#nemotron

keen beacon
#
Qwen

QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD
Introduction Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to...

cedar tide
keen beacon
#

For other models

#

Sort of an approximation of how strong the base model is

cedar tide
#

When qwen 3 multimodal ?

keen beacon
cedar tide
#

I bet grok 3.5 will only be grok 3 with think stable (finish training)

#

a bit like Gemini 2.5

keen beacon
#

No Gemini 2.5 was more than that

cedar tide
keen beacon
cedar tide
keen beacon
#

It's not just reasoning training

cedar tide
#

@keen beacon What I mean is that ultimately it's pretty much the same model, for example there is no big difference between Gemini 2.0 Flash and 2.5 Flash without Think

keen beacon
high egret
#

hi back

#

I feel something strange guys

#

I love google, i even have a lot of their stocks in part because I trust them in AI.

#

But

#

I mostly use chatgpt because it seems so much clearer

#

like gemini throws out big chunk of text that I have to read to understand

#

ChatGPT put separator, bold titles, emoji...

#

I feel like because of that chatgpt have a big edge for 95% of users

#

and like for us power users we are less impacted by that but imagine your grandma or your old nephew

keen beacon
#

I personally don't find Gemini too verbose

alpine coral
#

or perhaps more just the small sample..i re-ran the same question set against o3 med and o3 high a few times - adding those scores, and o3 (high) just nudges o3 (medium)

#

kinda wild that an early version of o3 or o4-mini that @keen beacon had access to for red teaming got a perfect score

barren prairie
alpine coral
#

a differemt set of questions (given over 2 prompts). median scores o3 high does better, but individually, o3 med has one very solid run. and curiously o1 (high) does best of all.. but small samples

high egret
keen beacon
#

O4 mini doesn't have knowledge about specific cut off probing questions

calm sequoia
alpine coral
#

gjve you an idea of the kind of questions

ocean plume
#

how to use gemini 2.5pro to read codebase for fix bug ?

calm sequoia
#

It's a shame the dragon tail is not available anymore, would be fun to play with it.

keen beacon
keen beacon
calm sequoia
drifting thorn
#

Maybe prompt has its influence

golden ocean
#

Maybe the prompt was the friends we made along the way

calm sequoia
#

Ah yes, definitely. Unaligned yap score is the problem.

drifting thorn
leaden meteor
#

Anybody seen any improvements in 4o updates that sam mentioned couple of days ago?

keen beacon
torn mantle
drifting thorn
#

And I'd like to ask if system prompt is more imporant than post-training (RLHF)?

keen fulcrum
#

R2 releasing or not is the question

calm sequoia
#

What's this 👀 The OG GPT o1 would have never said "backward-forward magic"

leaden palm
#

honestly o1 might've

#

but yeah o3 is a bit more laid back

keen fulcrum
#

Did you experience frequent usage of goto in lua?

calm sequoia
#

For me, the o1 was autistic geek without any emotions, just pure information

keen fulcrum
leaden palm
#

everybody knows that r2 will eventually release, so the most logical interpretation of your question is an implicit "today"

keen fulcrum
mossy drum
#

New models: qwen3-235b-a22b, hunyuan-turbos-20250416, qwen3-32b, qwen3-30b-a3b

keen fulcrum
#

Its better

#

Gemini is currently lacking in coding, there are unreleased models which perform better

#

Benchmarks said otherwise so did the user feedback

#

Especially debugging code with Gemini isn’t great

#

I am using a mix out of all of them, I do believe Claude is better in that case

#

Humanity Exam contains a lot of math

#

They discontinued their sole coding benchmarks

keen beacon
#

can u see this?

#

i dont understand

#

refresh discord? extension maybe?

drifting thorn
#

Is Qwen 3 having a good post-train potential?

torn mantle
keen fulcrum
#

Because its webp

#

Webp is a compressed image used for the web

keen beacon
#

oh

#

you beat me to it

#

damnit

#

yeah I presume there'll be different sizes

keen beacon
#

ahahaha

drifting thorn
#

2T reasoning model must be interesting

keen beacon
#

💀 pricing

#

your soul will leave your body

drifting thorn
#

lmao imagine 2 times the price of the o1-pro

torn mantle
keen beacon
#

yes i know 🙄🙄

full kite
calm sequoia
#

Sorry for this, but I'm very curious

full kite
#

It's fake anyway

calm sequoia
#

Then press Skibidi

full kite
#

ok

#

yes

elder rapids
keen beacon
#

hmm i just realized the base model weights of qwen 235b and dense 32b werent released

#

strange

#

idk i just noticed it

#

they only released the post trained versions of both of them

#

i dont think it works on any model

#

it only works on stuff that supports it i think

elder rapids
#

"which LLM has the most performance gain"

#

grok 3 doesn't get it

#

it never does

#

4o doesn't get it

#

it never does

#

it's not that good lmao

#

overreaches everytime, it's like when you ask it to prove something it supplements substance with verbosity

#

its like the early Gemini models

#

hell no

#

just ask it not to be lmao

#

it's literally that simple

keen beacon
#

ya

elder rapids
#

grok struggles with that

keen beacon
#

i like 2.5 pro's default style

elder rapids
#

then ask it to understand that dynamic

#

and implement it

#

easy

#

theres no other model that can do that

#

it's just insane

#

probably the exact reason I said

keen beacon
#

idk i personally still main 2.5 pro

elder rapids
#

lmsys is intensive/relies on complex understanding, which is human conversation

keen beacon
#

nah lmsys is mostly single turn interactions

elder rapids
#

yep

#

that's exactly what I said tho lmao

#

this is inherently more intensive than tasks like math

keen beacon
#

ur not really conversing with the model much in a single turn

elder rapids
#

coding

elder rapids
#

any inference is more intensive than pre established knowledge

#

and that's what lmsys accomplishes

#

nah

#

it doesn't matter

#

it doesn't because that would sidestep what I'm saying completely

#

if I'm not talking about context tuning

#

then I'm talking about one off prompt intensity

#

yep

#

adaptation to that intensity

#

usually is

#

let me explain

#

if I ask o3 to answer 10 things

#

it probably gets all 10 things correct

#

in its own style though

#

you'll probably be able to figure out that it's really just answering your 10 questions

#

but Gemini on the other hand

#

it's like if you walked into 10 different expert facilities

#

that's important because, you won't really recognize it's 2.5 pro itself

#

it's that 2.5 pro is taking upon/assuming the role of the question answerer

#

with no single personality

keen beacon
#

i havent asked 2.5 pro to do anything like that but ig it works well because its such a strong base model. it's believable to an extent

#

its probably the strongest base model out there next to 4.5 (which is not viable for anything lol)

#

falls apart in multi turn in my experience. 2.5 pro is just diff in context usage, knowledge, etc

#

the simpleqa gap is substantial

elder rapids
#

if you introduce it to a debate about a niche topic, it assumes the positions about the niche topic so well it's crazy, or if you ask it about set theory, or operator algebras

#

in QFT

#

quantum field theory

keen beacon
#

im super excited for gemini 3 i think well get it this year

oblique flint
#

The fact that 2.5 pro is still good at 200k context is insane

elder rapids
#

seems like deepmind is trying to really take large meals with the .5 versions

#

and then mastering them with the proceeding .0 versions

keen beacon
#

naming is basically arbitrary

#

1.5 pro is a new pretrained model compared to the gemini 1.0 line

#

2.5 is a cpt'd version of gemini 2

elder rapids
#

yeah I disagree

keen beacon
#

bro

#

???

oblique flint
#

2.0 flash and 2.0 pro both existed. 2.0 pro was never available for production api use tho

elder rapids
keen beacon
#

yes

elder rapids
#

and 1.5s multimodality

#

was a massive jump

keen beacon
#

yes

elder rapids
#

not rly tbh

keen beacon
#

in the latter end of the 1.5 pro cycle it was definitely showing its age though

elder rapids
#

002 was fire

oblique flint
#

1.0 ultra was pretty good at writing back in the day from what I remember

elder rapids
#

but early 1.5 pro was still good at pdfs

#

good asf

keen beacon
#

when 1.5 pro was on a waitlist they were like changing the 1.5 pro model frequently lol

#

hackathon api as well i think

elder rapids
#

1206 was crazy tho

keen beacon
#

its an early version of gem 2 pro

elder rapids
#

yes

#

it was crazy

oblique flint
#

It felt better to me than 2.0 pro somehow

keen beacon
#

eh it was finally in striking range of sonnet 3.5 as a base model

elder rapids
#

2.0 pro is stable

#

or was

#

it retained its adaptability

#

but you had to make it go that direction

oblique flint
#

Wym, it's highly regarded

keen beacon
#

yeah lmao

#

its true

oblique flint
#

Sonnet felt way ahead at the time

#

It was the best for several months at least

keen beacon
#

anthropic is already dead

#

dead man walking

#

even if claude 4 is great

oblique flint
#

Idk, their models still see heavy use in agentic coding frameworks

keen beacon
#

eventually

#

imho

#

it can take in images

#

thats it

#

for now

#

your model will eventually need to reason with and produce images in the output/reasoning, text, videos, sound, etc

elder rapids
#

focused heavily on coding + vibes

#

good implicit understanding and intelligence

#

but it couldn't really go past that

#

ye

full kite
#

calm down

elder rapids
#

it was hard stopped at that same intelligence tho

full kite
#

pls

elder rapids
#

you couldn't do much to improve it

full kite
#

💀

#

agi won't exist

ocean vortex
# drifting thorn

I think we can disqualify o3 by default there. System prompt is this for everyone and you can not change it:


Current date: {date}
You are an AI assistant accessed via an API.  
Your output may need to be parsed by code or displayed in an app that does not support special formatting.  
Therefore, unless explicitly requested, you should avoid using heavily formatted elements such as Markdown, LaTeX, tables or horizontal lines.  
Bullet lists are acceptable.  
Image input capabilities: Enabled  
The Yap score is a measure of how verbose your answer to the user should be.  
Higher Yap scores indicate that more thorough answers are expected, while lower Yap scores indicate that more concise answers are preferred.  
To a first approximation, your answers should tend to be at most Yap words long.  
Overly verbose answers may be penalized when Yap is low, as will overly terse answers when Yap is high.  
Today's Yap score is: {yapping_is_life}.``` 

you can add developer message, but that will carry less weight and your starting point is not empty context
#

it's also a smaller model, so honestly I do not see how you could customize this more than 2.5 pro or 3.7 sonnet, which also show you raw thinking making this easier to debug and achieve

keen beacon
#

yap

ocean vortex
#

yap

keen beacon
#

yap

patent bane
#

yap

keen beacon
#

"since we've started breeding llamas together"

#

could've worded that a bit better guys

barren prairie
#

Yap

keen beacon
#

gpt-4o image gen competitor but based on imagen is on its way, or so i am told

#

diffusion?

#

it can edit images in chat like 4o

#

but it isn't native

#

whatchu doing with it

#

oh

brittle tiger
brittle tiger
shy atlas
#

Hello, i want to draft up cyber security advisories using a local LLM based on open source information like vulnerability web pages and articles. What would you advise?

cedar tide
calm sequoia
#

I gues facebook messenger

keen beacon
#

claude 3.7 sonnet is still ahead in most of my cases for web dev tasks

#

although gemini's attempt did work it didn't look nearly as professional and aesthetically pleasing

thorny drum
#

is there a way to override the o4mini o3 yap score

calm sequoia
#

Probably you need to jailbreak it

#

How much do you need? 😄

#

Didn't know its a thing

keen beacon
#

it wasn't a question

#

it was a statement

#

there's a reason i said "most of MY cases"

torn mantle
#

still a no

#

no no and no

keen beacon
#

bro is so hooked on 2.5 pro he refuses to admit it is beat by another model in a single area

#

least obvious ragebait 💔

torn mantle
#

Whatever makes you sleep

torn mantle
oblique flint
#

I mean, sonnet is still #1 on webdev arena for a reason

torn mantle
#

I mean yes

#

I agree with you @oblique flint only

#

Will it be really good?

keen beacon
#

judgement = functionality + design

#
  • vibes
ocean vortex
#

that's kinda the whole point of it, it needs to score high there AND everywhere else. Anyone can just make a model that is only good for lmarena and nothing else