#general
1 messages · Page 31 of 1
I wouldn't go as far than that, as the trend lately has been downsizing... So maybe in the long-run the trend/progress will catch up to it and it's not gonna be too small. But for now it does seem suboptimal as people just can't get spatial awareness at a top level without decently big model size
though we also need to keep in mind that we just do not know what could potentially be done with RL training using notably bigger model. Maybe it's diminishing returns, but maybe the gains are actually more substantial
anyone else have the problem of Gemini 2.5 Pro not using terminal in Cursor?
Claude 3.7 runs it 20 times and does everything I ask it, Gemini 2.5 Pro is unsure and asks to install libraries etc
prompted it in user rules too no change
it would unlikely to scale the same way as normal classic chat LLMs
so I would think "perfect size" for that is something different (bigger), even with the current metrics and their limitations
how is this real... 😭💀
for o1 it's called juice
to my surprise I'm getting the same 8192 score nonsense on playground with completely empty user defined system prompt for o3-high. Now that's 1 way to ruin a model...
we still aint got no new stuff?
people gonna have to worry about this stupid yap score now when trying to eval openai models 💀💀
whar hppened to r2?
there were rumors about it today
probably wont be released super soon tho
polymarket: 9% chance in apr, 90% chance apr/may/jun
manifold: 25% chance in apr, 88% chance in apr/may, 93% chance in apr/may/june
there are only 3 days left in April though lol
@keen beacon alt?
ive never seen a model do this 🤣
It still kinda sucks in cursor unfortunately, just not as good at toolcalling
that discrepancy in apr
how tf is not being arbitraged
@olive mesa hi twin
ah forgot that exists
Is there a reason why they don’t include Claude’s web search function for the web search leaderboard
yea to make sure it hits 1345 words. i didnt request that 🤣
Claude.ai but premium
Ok
But it’s only in premium just so you know
What do you mean
The lmarena api
Ok
So do you know why they haven’t added it
Oh
No it’s really good
I have perplexity pro and I feel like Claude has gotten many more answers right
For real
Oop
Well my dad uses perplexity too
Should I tell him to stop paying for it
it's crazy how media is so far off from the reality of these products
and fake news is the reason why a lot of them still live
perplexity should ong be dead rn
dawg
why are you saying it's a scam
because you agree with me
😭
ye
that's what I'm saying
u used to get like 600 messages a day on perplexity
My dad uses its API, he runs a music player company and he uses it to find creators numbers and their emails and tells them stuff. Is he cooked or nah
with a lot of models, even if u dont use it for search
nah
there's always going to be replacements
in the AI industry
maybe even made by openAI
No like do you think it’s getting it wrong
oh
depends
that kind of stuff is hard to get wrong
especially for an LLM
Ok thank god bro
My dad has been working on ts for 10 years and I’m scared ai is gonna screw him up
nah it's alr
if he adapts
he'll be elevated
there's not much downsides
still depends
Dw he’s already a millionaire but I’m just scared he is gonna be sad if this idea fails
he could be impacted yeah, but he might find a whole new interest in AI itself
who knows
ong
gemini is absolutely not impressed by sonnet 3.7:thinking chess moves it seems 😄
if you ask 2.5 pro to reason through the spatial task of the board + think like a grandmaster and focus on recollection, itll stomp these other models with the same prompt
He’s in the screen burning industry and he doesn’t use ai for that but now he’s branching to a music player website and app which uses ai for song recommendations but I think he will be fine even if it fails because he is one of the leaders in screen burning services.
it seems like the best chess model currently tbh, though as base
the gpt models
might be the best
especially 4.5
i am currently running them against another, with full info and full reasoning, so I'll see. definately a huge difference in behaviour though, gemini is confident and claude is constantly passive/doubting itself
Are u talking bout studio
the industry doesn't allow that
AIstudio yeah
but I mean
prompting wise
not temp control
and other dev stuff
Temp is also important tho
nice
he'll find his way
Yeah
Disappointing
See this is why screen burning is gonna fail bro
ye but when it comes to thinking models this becomes more arbitrary, especially puzzle tasks and spatial tasks like chess
Yeah I only use it for research tasks so I don’t face those issues
Say goodbye to the headaches of print prep and hello to a streamlined process that allows you to fulfill more orders in less time. Our screen print screens are designed to simplify your workflow, so you can focus on what matters most: delivering exceptional products to your customers.
Proudly black owned business if that interests anyone at all💀
send that to him not me 💯
Bro come on
yea 🤣 qwq 32b preview
Blacks I wild, just call us black people bro😭
why proudly smh
they added a sh1t ton of rl on top of it probably more than qwq full tbh
👹 evil
Dw I don’t really care
deadass
can't even say type sh**
you can suspect it, but not reasonably believe it imo
they're not in the same position DeepMind is
they don't have the data scientists
they don't have the researchers
ye but I think we can assume they have the researchers, but not insane data scientists
nah not in comparison
just as it is
ye ofc
but it's not necessarily sacrifice
in the way it's suggested
deepmind
anthropic
private institutions
universities
ion think anyone who works at these companies primarily subscribe to the ideals
ye
I would say only the really top guys
that represent those ideals
which is inherent to the ideology themselves
twink
I mean, if I were a standard worker
I wouldn't care about these things
I'm trying to work hard and get research in lmao
for money
since generally, specific researchers aren't valued in the sense they can continually output high level stuff
but can output quality for the direction the company intends
ion know about the specific situation too much with Ilya
but that's prob what happened, and he likely shifted
no "secret sauce". Just a head start when it mattered + userbase and some really smart engineers. Funding helps as well ofc
this is the same thing someone sent earlier lmao
"(This content is from public information and is for reference only and does not constitute investment advice) Investment is risky, please be cautious when entering the market!"
just in a different format
or actually prob where they got it from
damn u know chinese lol? or just guessed it immediately
I believe this to be the case
lets await next week and hopefulyl get some news
R2 and Qwen 3 are imminent to release soon
the qwen 3 release seems to be significant, they did llama cpp prs/transformers prs/vllm prs/mobile apps/etc far before the release of qwen 3
I believe Google will drop theirs soon after R2
im hoping they release a qwen 3 reasoning model off the bat, but im most excited for new pretrained base models for fine-tuning, etc. qwen 2.5 was exceptional
The coder models
it literally says it's not a leak, just an accumulation of already public information
which undermines it's value as a concept stock
since it's not new Information
yea theyre releasing smaller models too
maybe a 32b alternative but moe so itll inference faster
I truly don't think it's going to get that much better from deepseek
Oh indeed browser integrated llms soon to be the next thing
get rid of 2.5 pro, get rid of o3s and o4 minis release
let r2 release
do you seriously think the gap would've become THAT wide
i dont know what to expect with r2 tbh
There is the possibility in the room qwen 3 will outperform R2, lets see
I can't believe deepseek would've accomplished that
i dont think r2 will outperform 2.5 pro at least in simpleqa i think
let alone at the level of o3
i use 2.5 pro on stuff that requires a lot of world knowledge/niche world knowledge
its exceptional (compared to other reasoning models)
especially when r1 wasn't really that good
R1 forced the industry to release their newest models
that's straight up propaganda
😭 🙏
Why?
AI got significantly better as soon as R1
nothing occured when deepseek released that was meaningful
that's literally impossible
the time period is too narrow
that means they weren't planning on releasing o3 mini after the announcement
and it takes a ton of time
for them to prepare it
won't release it on a whim like that
unless it's truly done
especially with how integrated it was
take a look at 2.0 flash thinking
lol
same thing
it defo was the reason and the reason why they started working on improving the reasoning summary
this is exactly the only thing they did tho
i dont recall r1/o3 mini timelines that much tbh ive no idea about timeline
since people were whining about it
you cant attribute any of these AI things to the release of deepseek r1
timelines don't add up
- understanding what even goes on behind these ai
and what entails the integration
vision model and btw thats a bad way to test lol
grok has good team from what i can tell. i heard they pay way more than other labs, basically a working for elon tax
I cant get enough of o3
and that's a serious factor
I can, ts cannot comprehend a thing im saying 😭 🙏
jkjkjk
nahhh
I think it's actually starting
you cant get the jump from 2.0 flash thinking to 2.5 pro
without a major breakthrough
oh
wait wym?
era of huge growth
oh ye
if r2 doesn't close the gap tbh
I can reasonably assume
open source is going to be pretty bad
for a little while
until they come up with something
nah the qwen team will deliver
its their time this time
hopefully
im not sure about deepseek but the qwen chat website was updated with strings of a qwen plus sub with video gen, image gen, access to qwen 3, etc
Yo what ai should I use for research rn
I’ve tried everything aside from deepseek tbh
No like not deep research
Just like general search
What’s tavily?
How do I use it
bruh if u pay those prices lol
paying for o4 mini/o3's api and enabling first party web tools, etc on the api is more reasonable tbh
https://lichess1.org/game/export/gif/white/mLb7m0Zn.gif?theme=brown&piece=cburnett
The game between Gemini 2.5 Pro Preview and Claude 3.7 Sonnet Thinking finally concluded, and with 8 and 7 blunders respectively, ultimately ended in a draw!
grok or AI studio 2.5 pro grounding
probably to be expected without any prompting
I got 2.5 pro to play at around 1900~ ish
as that's below my elo
bro what is happening to chatgpt, everything is 64k context max
Cervical Spine Risk: Rotating your head 180 degrees is generally not recommended. It puts significant stress on the cervical vertebrae, discs, ligaments, and potentially the vertebral arteries that run through the neck bones to supply the brain. Doing this forcefully or if you have underlying neck issues (even unknown ones) could risk injury, nerve impingement, or (rarely) vascular problems causing dizziness or pain.
hi!
this is with prompting, and also there is no way 2.5 Pro (or any language model) comes even remotely close to 1900 ELO. I have tested and played matches and tournament around 200 times by now (using all types of different methods), and the strongest any LLM ever came was GPT-3.5 Instruct in movetext continuation (aka Chess notation recall from training data). Other language models play more in the 400 Elo range, even SOTA.
where were you 💔
my discord was muted 💔
don't leave me like that smh..
sorry lol.. ill try not to in the future
lmao ok 😭
In my calculations gpt 4o have 1000 and o3 mini have 1400 at THE MOST (likely 250 lower)
Rellying a lot in book moves or gimmick moves anyway
tell me where to play against 1000 elo gpt 4o, and I show you its not nearly 1000 elo
Putted some games i found at the analisis and it said 1000 elo
For some reason at defense it holds very well
which analysis?
That chess.com one what says the estimated one
Just like the gif shows, at developing it is great but at end game it trash out
It must be the reason it gived an high value
I don't know of any system that can take a game, and determine "ELO" based on it.... that would be super inaccurate. Elo is based on your opponents strenght, and the outcome, not on how good your moves looked in isolation..... Either way, I have tested a ton, and recorded a lot of games, and most SOTA models play around 400 ELO level (when compared to Lichess opponents), and are unable to beat the weakest Stockfish 14 level (sub 800 ELO)
Oh
Whata twist
I think the 1000 elo number is somehow accurate but limited to hard macths since chess youtubers played against
And for some reason it plays way better at middle game, likely due memorizing moves
yeah this isn't true at all lmao, minimum 1400 if you prompt it right + urge it to use opening repertoire, emphasizing move recollection, these models can easily be that good. can you give me your prompts lol
you either don't know what level they really are playing at with lack of experience in chess or you don't know how to prompt
has to be one of those two
talk is cheap, give me your 1400 minimum ELO prompt, and I can directly show you its not 1400 ELO.
(also you just lowered your treshhold by 500 ELO, impressive)
yk what just get up a game lmao
no?
that's an entirely different claim lmao
I'm not saying 2.5 pro plays at that level
GPT 5 at home
I'm saying if you urge recollection, it'll be at least 1400
not restrictive to 2.5 pro
I got 2.5 pro to play at around 1900~ ish
great now can you quote the claim made in the larger passage
really?`so I can get any model to play 2k elo also. they played E4
no, it loses too much context
well unlike you I already provided I have collected data (169 games as of now, between multiple modes). would love to see anything despite baseless claims about your 1900 or 1400 elo LLMs
🤷
just saying
not a lot of people are that good at understanding models
entire runs can easily be invalidated if you don't adjust prompt techniques respective to the model
i am not interested in troll statements. either provide proof for baseless claims or I got nothing to discuss with you
saying c5 is not proof of LLM playing 1400 or previously claimed 1900 elo.
I'll just send all the thought processes
and outputs
lmao
it's not that deep
dude actually blocked me lmaoooo
😭
anyone want to go against 2.5 pro?
just for fun, and for the sake of testing
They need to delete this new 4o personality
ye
it's creative
but saying anything to it poisons the well
adjusting to the user is cool, that's what I like about 2.5 pro
but goddamn
I have to prompt it everytime I talk to it
to be the way I like it
2.5 pro is great to talk to in comparison. They are trying way too hard with the new 4o etc. I didn't think they'd keep trying to force it this hard
I'm seeing people say they really like how warm it is
and how it's better than 3.7 sonnet
etc
but it's kinda of surreal
Maybe they like the sycophancy lol
where
subreddits and stuff
if o3 is deep research lite whats o3 pro
if Google releases another model after that tho
ion know what I'm gonna do
imagine anthropic releases 4.0
May 20
o4 mini is deep research lite lol
o3 is deep research
ye io
o3 pro is deep anus research
have you guys tested the deep researches enough
I don't use the Gemini one or the openAI one that much
yes
so I'm not sure which one is better
somehow my stupid scaffold with exa and gemini flash+pro outperforms all the other free ones
oai deep research is meta rn
nothing beats it
damn
What do y'all usually use deep research for
oh ye
but it just spits out mumbo jumbo, not wurf
oai deep research has high entropy info for every sentence
gemini dr is just stale and a bunch of unneeded detail
alr give me a prompt
like any prompt
Hmm what are you expecting out of "demo results" btw lol
it just loads a pregenerated one (im the one who set up that ui)
I disagree a ton with the density of info, but the formatting seems more consistent in openAI DR
they seem too similar to compare on that part, or its unnecessary comparison
since there's necessarily limited amount of info for a topic
u are pure coping, its not
gemini dr is like a student trying to fill up the minimum word count
wym coping? if I were asking about that I wouldn't be dismissing what you said lmao
oh mb i read it wrong
I'm asking which one is better, or which one sufficiently describes the information, not summarize it
in openAI DR it fetches good asf insight
and I've seen the same with Geminis
Gemini seems to verbosify a ton tho
but it doesn't seem like I'm getting less info
just more yap
meanwhile perplexity:
"prioritize verbosity" loool
well great o3 solved a unit test where o1 pro couldnt... nice
o3 is just so good
Gemini got the answer to my problem wrong even on the meta synthesis only o3 solved it
whats the problem
when they have o5 pro internally 🤣
code for a part of my project
to be working at oai, that must be insane
I feel that the vibe of o3 is too easy to recognize as well.
April was kind of disappointing.. no new top model was released. bunch of hype but nothing materialised.
yea compared to march where we got Gemma 2 27B, Mistral Small 3.1, QwQ, Gemini 2.5 Pro, and Nemotron 49B, it's quite an uneventful month.
Llama 4 and GPT-4.1 are kind of duds, at least for me.
Gemini 2.5 Pro, o3, o4-mini-high: 🤨🤨🤨
2.5 Pro is the SOTA...
yes
03-25, it doesn't count as April release (even though it got rebranded from exp to preview)
2.5 pro was in march. and nothing really better came after that
o3, o4-mini-high : disappointing. OAI tried to play game by releasing a bit prematurely to take on google but honestly they are just meh compared to 2.5 pro
Oh, right. He was in March.
From this post, you can understand where Google gets several models in LMarena.
So much more in such large numbers
Where is this from and what is synthesis
@balmy mist i think i may be right again, we may get r2 this monday / next week
lot of hints as well
he said a few weeks. have more patience.
does anyone know of insider Apple info? Are they planning to compete in AI space at all?
New model in Arena: llama-4-scout-17b-16e-instruct
easy to say when ur on a plus sub
Complaining about things you can't change is a wasted effort. Would be good to see it out at the end of this week, provided they are on time
Pollution is everywhere
They had this research for a while now apparently: https://arxiv.org/pdf/2308.06103
Maybe the already used it for the early Gemini 2.0 iterations back in 2024 (Like all the ones that where aistudio only)
Deepseek R2 before May?
7
17
2
No ❌
for what model? I'm absolutely blown away after I found out this silly score applies to playground and API too. If what you are saying is true that means any benchmarks that people did earlier might not even be possible to reproduce now. Changing stuff like that can always have unintended consequences, even if in theory higher value should be better. There is no place in API for crazy sht like that lol
Guys I'm better
hi all
hi !
depends , but I use gemini*, o4-mini-high , and claude 3.*
gemini 2.5 pro and claude 3.7
the api was what i found most curious about it when we were discussing it yesterday.. you could have been blown away 12hrs earlier if you read my messages lol
this is consistent with oai's chain of command i guess
- System / platform (oai)
- Developer (who can add what we call a system message)
- User
if there's a conflict, the higher one has authority
as it explains in its reasoning (and hence why it said 8192;the 2903 in the devloper message was overriden)
this seems about right
Is yap score unique for each user preference or are they just changing it for everyone on the fly looking for a sweet spot?
i think they might be from the same platform-level prompt (oai's instructions, which [are meant to] override anything given in the Developer Prompt (API) as well as end user messages
the 8196 comes from the platform-level prompt (as do, I think, those instructions about mirroring style etc)
What website is this
Ok thanks bro
np
what the fk is a yap score
I NEED SMART MIND HERE Please
I have this test tomorrow
It's like the final test of the year right
But it's a mock one
Idk sht about what the subjects are
I don't go to classes and everything
What would be the best method to know how to solve them
Like I have access to all the past mock tests from the last 4 years
and the correction of them
And also videos about ppl solving them
Like Idk what to do chat
Please 😭
Get ready @full kite
ready for what
Are you a smart person
To be unemployed
read through them then
😭 😭 😭 😭
And then what
Is that good
that's like good
ye
i have a spare laptop i put in the corner
then use parsec on main desktop
remote into it
gemini free trial gives u unlimited screenshots too
won't work for people with relative grading
also don't cheat dude
whats that
what free trial are you talking about
I'm using google ai studio
your gpa isn't directly based on your marks, it's based on how much you scored relative to the highest marks obtained in class
go to https://gemini.google.com/app
top right corner you should see something like this
click it
1 month free trial no payment
cancel before they charge u
p sure ai studio's 2.5 pro without premium is only like 500k tokens then you get rate limited for a day
or use the openrouter + vertex + ai studio + copilot stack
or use the official student free trial if you're in university
bro what tf
Dude 2.5 pro is free and 1 million token per chat
what is pro even about 🙏 😭
I have the flash one too
faster
2.5 pro in ai studio is 250k/min and 25/day
what does RPM means
requests per minute
actually maybe ai studio has higher limits than the ai studio api... idk
lmao
😡
yeah sorry didn't see that 👀
I think it's more on the fly thing, especially if this was at 4k earlier. I also very much doubt it was evaled with this in the context. But regardless it's unwanted flood for API as far as I'm concerned which makes "developer message" much less powerful and relevant
Guys does anybody have a chatgpt premium account that could be shared with me?
No. But I can run a prompt for you if you want
why do you pay for chatgpt
Well a "prompt" wouldn't solve everything
I need full access
😭😭😭😭
well then buy it
what do you need to do lil bro
I'M NOT SPENDING 20 DOLLARS ON IT, I ONLY NEED IT FOR LIKE 4 DAYS
Its whatever
well then @full kite could give it to you maybe
WHAT DO YOU NEED CHATGPT FOR
REALLY
😡
he does't need his it seems
school stuff.....
well I'm doing a research
I just need chatgpt to help me
the normal one doesn't accpet all of my files
and so on
how many files
have you tried aistudio?
yeah
it's free and the king for file uploads
but like
No you did not
I DID
okay what happen
wait
🤨 🫃
Dude
listen I'm going to quit
if you don't tell me
😡
WHAT HAPPEN WITH GOOGLE GEMINI
IT HAS 2 000 000 TOKENS FREE
PER CHAT
CHATGPT PRO IS 128 000
😇
😡
Ok we should start over
I'll help you
I want to help you
@frail thorn
?
😔
😀
Okay so do you need to upload a large ammout of pdfs?
Are you pregnant?
just forget it I already bought a subscription
🙄
yes
Now you gotta use it overtime so they lose money on your sub
hell yeah
I'M NOT SPENDING 20 DOLLARS ON IT, I ONLY NEED IT FOR LIKE 4 DAYS
I'll include the thank you
yeah well I did
wait does the app still lack the 1m context from 4.1 lmao
🫃
someones mad
✡️
🤗
🥀
you greedy juice...
say that to sam the juice
I wouldn't really say chatgpt is better for what you are trying to do even... But ok whatever 😂
yeah 20 dollars or free
kids in Africa could have eaten those dollars
LOLLLLLLL
WE GOT ONE
now that made me giggle
coins clipper
gold detector
🤗
thing
what does that mean
👹
diana12493_32 has been warned.
oy vey stop noticing
which is why if you build anything and want it to be free you have to get like 3 other providers
25 rpd does not serve 2-3 users actively building things
25 rpd is the documented limit
and i've ran into it before
💀
wtf happened here
😢 🖕
🫃
lmao
Logan just hint 2.5 ultra and remove tweet very fast 😆
what did he say
screenshot or didnt happen
buddy perma works at google and says this
Something about making custom t-shirts with text "1400+ ELO club", "smell big model", "AGI when". But maybe its just coping
maybe he is drunk
1400 elo?
what does that mean
Yo gemini 2.5 pro in perplexity is peak
Also I hate perplexity but its really good with the llm
whats perplexity
New gen
omfg
why are you in this server if you arent an arena user lol
deepwiki's deep research is underrated
the one from devin?
bro I just use arena to see the scoreboard I'm not jerkng off to it
so you do use the leaderboard
which is... elo data
yeah what about it
aok
I thought it was a chess thing, they were talking about that earlier
anything with pairwise comparisons can be measured with elo
like llm will never be able to play chess
no but fr what else can we do on arena thing
can we play on it or something
you can chat with llms and vote to help build the leaderboard
you can also compare other things:
- llms using github repos
- llms using search
- llms making websites
- llms writing code
- image generation
where
oh its probably riverhollow/sunstrike
im not impressed with these models tbh
they just seem like gemini 2.5 pro 03
I don't know what a repos is
thats ok, a lot of lm arena is for coders and you might not be one
I'm a coder
have you used github before?
to download yt dlp
I know about the green board
contributions sht
my iq is dropping
where's Nw 🙏😭
soon
in the void
the longer these models are unreleased
the more there's a chance people catch up
and it's unimpressive
lets disect urs so we can start the experiments immediately
That would be cheating I’m already AGI
more like artificial general stupidity
Ok that sounds more like a direct insult rather than a silly joke
no more funny
Are LLM’s capable of being stupid?
Perhaps stupidity is a byproduct of intelligence
Well you have a point
nah ion think so
a major part of what made o1 pro so good was it's ability for pure longer context reasoning
ye im using o3 more than o1 pro..
what we know is its 10x compute (so thinking for 10x longer than o1)
lol
and the fact it is outputting as a one shot answer rather than streaming tokens, means it is using an internal canvas in the backend
so it is constantly iterating its final answer, thats my hypothesis
and internal canvas
specifically to force a longer reasoning chain + better initial instruction retention
internal canvas, checking its own answer on a pass@1, then reiterates it, until it is satisfied with a hard limit of 10x compute
great for cleaning my ass. yes. thats it.
can sam stop riding his husband and release o3 pro
ceos dont have a healthy relationship
its a ceo lol?
watch movies
then u know
rich people are very unhappy
thats why i am happy
nonetheless you are not rich
cause its adverse to not say it
youre stuffed with adverse lies
/s
ceo's dont live on prairies, therefore they are unhappy creatures.
Should we expect a new image generation model from openAI in the arena?
you have cute attributes, federighi
ceos dont have small things.
see it as necessary.
we said the same word... lol
._. we are models?
i will weigh 81kg after water flunctuations in exactly 30 days
now i am 82.6kg
having diseases which makes loosing weight slower is bad
i think u just wanna beleive they are as happy as a nanny in a village
cause u wanan be rich too
so u have a goal to go for right now
No it’s more like multiple attempts and consensus system. You can still set low-med-high for pro.
oh hell yeah #announcements cute
omg sam tweeted, plz be o3 pro
its him riding his husband
hi
probably a refreshed pretrain
its likely to be a cpt of 2.0 pro
it isnt. even openai cant increase simpleqa that much yet thru reasoning
hi
just got folsom-exp-v1, which i haven't seen or heard of before - new anon model?
presumably related to cobalt, apricot
so amazon ig
when o3 shows these traces in the cot, i kinda leak a bit
Amazon reasoning model no?
you do what now
Any news on possible specs?
why is o3 limited at 64k context, absolute dogwater
omg qwen 3
apparently pre trained on 36 trillion tokens 😮 (2x qwen 2.5). multiple moe models?
Hi.
Is there a chat limit in https://beta.lmarena.ai? Or is it possible to test ai even on long requests (10K+ tokens)?
curious about the super low amnt of active params on the moe models. qwen 15b in the huggingface pr has 2b active, qwen 30b has 3b active
So llama 4 reasoning models werent added on lmarena giving the recent controversies
And its kinda weird too qwen3 isnt added yet in an anonymous battle mode
em?
gpt-4.1-2025-04-14 --> 41.6%
o3-low --> 49.4%
in both cases just shy of 20% of the earlier score increase
2.5 flash got a worse simpleqa tho compared to gemini 2 flash, i would expect openai to lead there in terms of factual reasoning. but yeah i guess its out of date, since we have o3. i was comparing it with o1 when i formed that opinion initially
They were
https://code-arena.fly.dev
what is this link ?
Omggg
Sunstrike is not generating code at all
Anybody tested qwen3?
its not out yet
the 235b moe (apparently) with reasoning mode might be very impressive
No eletricity in europe 😭
no lmao
it's just that those were the ones that accidentally appeared on modelscope briefly
anyway, to summarise, this week we are getting:
- qwen 3 (likely today)
- amazon nova premier (wed)
- deepseek R2 (somewhat likely, depends on qwen 3)
Would be cool if r2 went on arena before debuting
Yea the big one is 235b
i expect that to be a very good model
ah okay, its like we get a fire week, then break, then back to fire lol
i kinda like that, give us time to collect ourselves
Qwen 3 is what people expected llama 4 to be like
The GPT 4o doesn't allow altering real photos due to their stupid policies. Anyone have the jailbreak code or the alternative tool?
oh damn
was wondering if qwen3 was going to include a big model
seems sort of odd that they didn't include it in the huggingface pr from a week or two ago, but maybe they're trying to keep it closed?
a lot of models werent in the pr
ah
they just briefly put out a fp8 quantized qwen 3 0.6b on hf and removed it
Yea
Will 235b be the biggest model released by them so far?
Or they released smth bigger before
yea. 110b dense was their largest before
im so hyped rn
I think nothing will be better for at least a month
If anyone will need the same: https://sider.ai
Sider, the most advanced AI assistant, helps you to chat, write, read, translate, explain, test to image with AI, including GPT-4.1 & GPT-4.1 mini, Gemini and Claude, on any webpage.
it should at least be mostly as good as 2.5 I think.
probably. But it's not gonna be worth using it given the price that's for sure lol
technically sota now is o3 anyway
i dont think they will put o3 pro in the arena
Maybe to beat Gemini 🙂
imho the most likely thing to dethrone gemini 2.5 pro will be gemini 2.5 pro (at least in the arena) 🤣
In all seriousness o3 model is a lot better than gemini 2.5 pro at general tasks
You can still feel the robotic vibes from gemini
Just got r2 it's good
old model I'm already using r3
bro what
Guys can IA do my homework pls
it's a math test
Why did you change your username from Mango to Diana? 🙂
cause diana rules
if R2 > Behemot, Zuck is cooked
thats obviously gonna be the case tbh
what is behemot
Obvious for you, but not for Zuck or his investors 😄
what is behemot
behemot is agi
yeah zero chance
Behemot is minotaur shrek
Hi, I’m not a pro, will Qwen3 compete with Qwen 2.5 Max at the top of the leaderboard or its not the same category?
agi doesn't exist
nuhuh
qwen 3 is a line of models its not even released
or in the arena
Yea i mean one of them
yes at least one of them will compete i think
probably the big one will be competitive in the leaderboard
Okay thanks!
Qwen 3 may be better than Gemini-2.5-pro 👍
qwen is slow asf
this
this is a bear
it's behemoth
The Behemoth (Megasus mammothoides, name meaning "great mammoth pig") is a species of large land mammal that originally didn't exist, but was created by SciiFii and introduced to the African...
I think we both are. I hope he does too lol
lol
try it yourself lol
direct chat / side by side?
i think side by side was configured differently a while back it might have better limits
yea
i dunno about coding (let alone for a specifc programming language), but i feel 2.5 pro is just solid as af all round
its incredible i still main it over anything else
its definitely seen less c++ than the others and its more things to manage
yea
if look at it historically probably. it depends on how much of it was curated, but i think its likely to be even less
im not very sure about the others. but 1 and 2 is highly likely to be python/javascript (not sure which one is which though)
it depends really but with a gc and such its generally slower/much slower i think
did u try?
2.5 pro has the best context retention/usage in a model i think anyway, it helps a lot
you should probably upload the code into the repo instead of it being in a zip
do u have git installed?
i think u can also upload folders directly on the website
maybe ask 2.5 pro to teach u git
its more convenient and allows u to have version control
i'm getting so many errors in the arena atm
i feel like the yap score has been there all along (and at 8192).. recently discovered / noticed, rather than added..
also the other models' responses.. 3.7-sonn and v3 do well; sunstrike also (though verbose af)
folsom-exp-v1 assumes it's a bitcoin or something - pretty terrible response imo
what.
oai reasoning models have a top-level prompt that gives guidance about how to intereact and includes this part about a 'yap score' (which has always been at 8192, so far as i can tell)
is that real.
yeah i mean, to the extent there's some oai-imposed instrucitons that include this thing called a yap score, it's real
wow okay.
whether it's concerning though im not really sure tbh
like esp if it's been there all along
it may just be for stylistic purposes
but yeah that said.. my intial reaction was to think it was a way for oai to dynamically throttle outputs on o models in chagtgpt, to like manage costs / compute
yeah i only learnt about it here a couple of days ago
but now i think it's likely been there all along (and am not really bothered by it.. like there's no indication of it being used to nerf the models or whatever... yet anyway aha)
i kinda wonder if, in cases where lots of reasoning tokens are used especially, the final outputs could be super lengthy, and this was just their solution (prompting), rather than it actually meant to be dynamic
2.5 Pro tops another benchmark
GeoBench is an LLM/LVLM benchmark for GeoGuessr.
Qwen 3 253b will be better than deepseek 3.1 671b ? (And Maverick 400b)
Probably
The question is : will the smaller models be even better than Maverick?
Imagine 30b > 400b
below 253b there is only a dense of 14b and a moe of 30b
I don't think it can be better than maverick
apart from their version with reasoning
i think all/most of them are hybrid reasoning models
👀
https://x.com/alexalbert__/status/1916874027756666981?t=eC1sQXsZWHold8XGlIlxbw&s=19
https://x.com/alexalbert__/status/1916874039769120904?t=UGLhieKvS3ariM5WPQxCwQ&s=19
not meant to be a jab at any one lab in particular, just highlighting a particularly bad incentive structure I see rn. there's a reason you don't find Claude at #1 on chat slop leaderboards. it's the LLM equivalent of optimizing for video watch time in a social media algo.
Yes, it has already been leaked, we know that.
there is 15b moe according to the hf pr
might be a placeholder but the pr also mentioned the 8b model which was confirmed
LOOL he deleted them
Yes, I was thinking the same thing and I said it to several people, but in the end, whether it's the leaks from ModelScope or Hugging Face, there’s no trace of this model anywhere.
ya i guess its a placeholder
wont be surprised if its real tho
.
Qwen 3 dropping
yup they're trickling in
Gemini 2.5 pro is a solid model
Its either o3 or gemini 2.5 pro that deserves #1 spot tbh
Qwen 3 released
︀︀
︀︀Qwen3-8B
︀︀
︀︀Qwen3 Highlights
︀︀
︀︀Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Building upon extensive advancements in training data, model architecture, and optimization techniques, Qwen3 delivers the following key improvements over the previously released Qwen2.5:
︀︀
︀︀•Expanded Higher-Quality Pre-training Corpus: Qwen3 is pre-trained on 36 trillion tokens across 119 languages — tripling the language coverage of Qwen2.5 — with a much richer mix of high-quality data, including coding, STEM, reasoning, book, multilingual, and synthetic data.
︀︀•Training Techniques and Model Architecture: Qwen3 incorporates a series of training techiques and architectural refinements, including global-batch load balancing loss for MoE models and qk layernorm for all models, leading to improved stability and ov…
lmao someone leaked qwen 3 32b it seems
seemingly one of the randos (non qwen team) in the hf qwen org
32b where ?
eh they've started dropping them now anyway
doesn't matter much
not many details there tho
why are there random people in the qwen hf org lol
it'll be officially released literally in the next hour probably
I wonder if we also see something from deepseek this week
it would make sense but who knows
the 235b possibly
depends on how strong their reasoner is
yeah
I do expect it to beat R1 minimum really
if they can't even do that it's a flop
llama 4 reasoning releasing at llamacon tomorrow by the looks of it, frontend is ready
i would also expect behemoth
i think o3 is better at translation than g2.5pro
sounds fun
its gonna be a huge flop
if it's anything like the rest 
there will be a lot of memes about qwen and llama i suspect
idk about qwen
i have a lot more faith in them than i do meta
yann lecooked
yea i meant clowning on meta lol
how qwen was the llama 4 people expected
oh
yeah nevermind
https://modelscope.cn/collections/Qwen3-9743180bdc6b48
still no entries
its a shame that guy is uploading the ggufs publicly before qwen officially announces it
Must be some employee with access to the system
no its a random guy in the hf org i believe
Oh lol
I also hope that 15b moe is real, it's an awesome size
It's insane how many models they are gonna release. With 36 trillion tokens in pretraining not to mention the reasoning training etc
Is R2 coming out this week too?
deepSeek team always work silently no one knows what they are planning 🙃
There was a leak for deepseek too
if its not this week, must be next one
qwen 2.5 is already higher than maverick lmao
Behemoth
one of the chief researchers is tweeting aggressively about it
i would personally expect q3 within the next 24 hours
Yea
Weird they didnt add any new models on lmarena