#general
1 messages · Page 35 of 1
as they say, it's a codename for the memory feature (weird one indeed tho ha)
gpt-4-turbo is released?
Well its not something eatable
Great name for a brand selling productivity snacks
12 is missing
I wouldn't straightly put gpt 4o as number 3
o3 will eventually be number 1 with more votes
so are 5 and 6
I wonder how that happens
pretty sure it's how it's meant to be
Isn't it some very simple logic?
Or does that happen when models get removed?
actually yeah i dunno
This leaderboard is very poorly managed
The “holes” you’re seeing in the Rank * (UB) column aren’t a bug in the UI, they’re a side-effect of how that leaderboard does its UB‐ranking: it groups all models whose Arena-Scores are not statistically distinguishable (based on the 95 % CIs) into the same bucket, gives that bucket a rank number, then only bumps to the next rank when a model is significantly worse. If no model ever fell into the “bucket” that would have been called rank 5 (or 6, 10, 12, etc.) then you simply won’t see anyone labeled with that rank. In short, the missing numbers are just empty significance‐tiers—no model earned them.
ty
Whenever two or more models get folded into the same “bucket” because their 95 % confidence intervals overlap, they all take the lowest ordinal rank of that bucket, and the next model’s rank number jumps ahead past all of them. In other words:
• No one owns rank 5 or 6 because the models that would have occupied positions 5 and 6 were tied with higher-ranked buckets and so got pulled up into those buckets.
• Likewise, the three models tied at “9” consume positions 10 and 11 (so you never see a standalone 10 or 11), and leave the next distinct model at rank 11 or 13 depending on the bucket size.
It’s just the standard “competition-style” (or “min”) ranking with ties: you collapse tied items into the same rank, then skip over all of the internal slot numbers that are swallowed by the tie.
Thanks AI!!!!
It's stupid that's not how ranking works
They overcomplicated it
There are no ties anyway
There should not be buckets
Some idiot tried too hard to seem smart
And vastly overcomplicated a very simple task
i've definitely had ties where both models generate essentially the same ui
recently had grok 3 and maverick do that
found this interesting
surprising to see 2.5 flash cost more, but i suppose it depends on the task
wait until you see phi-4-reasoning-plus, which will be 2-3x of the next entry
Which model do you prefer?
6
9
5
Nightwhisper
do we know what model nightwhisper is?
interesting - cheers!
butt cheeks
I think it's dumb they don't include repeats for cost or output tokens. That's big part of evals and regular use. Aider bench does making his cost comparisons more valuable
Does anyone that use NotebookLM and know why the new 2.5 update (until yesterday it was still using 2.0) doesn't affect the audio overview (podcast)? I
Haha the cost per token is small on the flash but it takes too much tokens to think. I guess the llama models aren't actually so bad. They just presented it wrong.
Also it's surprising that the o3 spending is so high. The artificial analysis must be hell of a benchmark
Doesn't make any sense if NW is an old checkpoint
general purpose vs finetuned for web dev
It wasn't just good for web dev
It could do complex reasoning in 1 shot
- accurate visualisation
being finetuned on something doesnt always necessitate losing abilities on other stuff
Good at following prompts as well
did sunstrike appear in the general arena as well?
Frostwind/sunstrike... Are all finetuned on web dev imo
hmm
i'm expecting qwen 3 235B to be a very efficient model, on this price vs elo plot i'd expect it to be right where my cursor is
non log scale edition:
well maybe my cursor was too far to the left but you get the idea
whats the metric of the x axis
oh i thought the scale only went from 0 to 1 for a moment considering you cropped it
I think it's because o4 mini high is lazy, especially with code
so it's naturally more concise
dario said something that loosely translates to if they didnt cook they wont call it claude 4
eww no
yeah ive had it [sunstriker] a few times in the general arena - iirc thought it was v similar to 2.5 pro, though perhaps a bit weaker
though a week or so ago.. not sure if it's still around
idk, thinking it might be a bit higher price wise, because I am not sure if the current prices (of 0,1 per million in and out from deepinfra and fireworks) will stick
(especially not the fireworks one, as they literally said they are working on the pricing)
well if things change lmb will update
but im expecting deepinfra to stay low
it has a small # of active params and itll only get more optimized after all
yeah but llama price is higher and they have lower active param
deepseek is also close in active param (27B idk for sure though)
as long as its under $0.5 ill be happy
Deepinfra price matched fireworks which has temp pricing but I guess deepinfra didn't realize that and/or still want to compete. I think it might encourage lower prices overall though because of this
it will probably depend on the size of the userbase, as MoE deployment efficiency scales with the userbase
me to
Qwen 235b is slow as molasses on deepinfra, fireworks is an unknown quant iirc (but much faster)
True, 20 tokens/s + thinking Model = 💩
this guy still has a platform?
did he just took pic shared by techdevnotes and changed the background color and made it his own?
Dudes shameless glazing is kinda impressive ngl
well he did some modifications
lmao
you can actually see it in pixels
yea 🤣 🤣
he has no shame
hes a massive troll
he was in chatgpt discord server and we called him out
o
wait this is so funny, i applied some css filters and you can tell that he used the "draw on image" functionality of his screenshot editor to explicitly cover up the background text
omg 🤣
lol this is classic
he's just taking the piss at that point isn't he?
wait we have grok 3.5 alrdy?
yeah and o3 pro on the o1 pro api
the joke isn't hitting no more
i have supergrok, no 3.5..
He's been doing this since last year, it gets him views and he even openly admits that he's lying (at least he has in tweets a couple of times), and laughs at all the people who believe him.
ha yeah i mean kinda fair
He's so openly dishonest that I kinda like that account (not for anything serious though)
strawberry man doesnt even have grok 3.5, buddy woulda screenshot..
yea he left out the ones on customize grok section tho
i cant stand him
this is actually the worst bait of all time
elon isn't gonna let bro tap
this is the most that's happened so far
he has actually a condition called pseudologia fantastica
elon is so stupid it actually annoys me
pseudologia fantastica: 'Pathological lying, also known as pseudologia fantastica, is a chronic behavior characterized by the habitual or compulsive tendency to lie.'
okay buddy you're dumb but now it's everybody else's problem because you have so much influence
go away
🤣
lol I think he's completely lost it
That's literally par for how he talks, he was doing this for the "strawberry" models for upcoming openAI models, last year.
dude is a joke
if it truly is not as good as he says i am finally gonna unfollow him
You follow him unironically??
yo can these fools release o3 pro:
https://x.com/sama/status/1918735773098004680
i have been on a shopping bender this morning, this is much better than i expected!
Is it that crazy?
maybe they are waiting for another lab to release smth first
HAHAHAHAHA
elon ruined it with this engagement thingy
dumbass
https://x.com/btibor91/status/1918700964967796973
claude advanced research ranked at #4
tbh kinda agree about claude deep research
takes a lot of time with a mid-quality report
less details
the only thing thats good is that it can use any mcp
ai font
claude & gemini & grok are more like:
- study #1 found 50% improvements.
- study #2 found 40% improvements.
but they don't really compare studies and give a conclusion, oai dr does that, it goes into different studies, compares them together and gives you the conclusion, it even looks for the same factors/parameters used and list them, it actually understands what it needs to do.
i think it could really be good if they are a proper base model like o3, but they're stuck with 3.7 sonnet
like fetching almost 1k in sources is very impressive
eLon musk sand God confirmed
Guys . the o3 offered in alpha lmarena . Is it high , medium or low ?
Prompt : Based on a critical synthesis of recent, high-quality human clinical trials and systematic reviews, determine which compound – Berberine, Propolis, or Resveratrol – demonstrates the most compelling evidence for promoting overall health.
https://claude.ai/public/artifacts/a8a2d065-ddf4-4acf-853d-ddfd9a2fe15e
https://chatgpt.com/share/6813e578-eb4c-8012-976b-f07d475cdac9
https://grok.com/share/bGVnYWN5_03b16e10-b0c0-45b3-8014-ab796009f7b0
thanks to @small haven
grok chose the easiest road, it selected a study that includes 54 systematic reviews and it expanded on that, but its not really a deep research
but its still better than gemini
oai went in and understood what parameters to compare and analysed them and then came to a conclusion.
which is pretty neat tbh
imma run this on my gemini pipeline
this would take you a lot of time to do
run it boi
you have gemini advanced?
are we sure that the free version uses gemini 2.5 pro for deep research?
no i just used their api + exa
looks like it only did 3 searches and didn't make any tables
let me see what happens if i tell it to search deeper and use tables
go 2
Free version doesn't use 2.5
no, its just engagement bait
what do you think
probably no
gpt-4-32k-0314 access
I didn't know resveratrol reduced BP. Thanks!
sorry noob question , what does gemeini paid give vs the free version , will i see improvement on the gemini 2.5 pro i am using on AI studio
They're talking about deep research
Free version of deep research doesn't use 2.5, is my understanding. But maybe my information is outdated and that's changed over the past month.
ok so its the same 2.5 on paid and free models , not the low ,medium high crap Open AI uses
LMAO this is obvious satire
pretty sure he's satirizing the people who really do claim they have early access for engagement bait
ya lol
HAHAHAHAHA
now the question is does he know thats satire and is he playing dumb or
you can tell whether they have early access because they say concrete things
lemme guess, you have access
i like how u can use o3 to quickly test hypothesis... im hard.
No, he knows, he openly mocks all the people that believe him
That's why it's hilarious seeing people take him serious, even here
@torn mantle here's 2.5 deep research on your prompt
https://docs.google.com/document/d/15ZthQT7kwJu0MLgqFTyqGjkX5pdol3qz2YwV922x0m4/edit?usp=sharing
A Comparative Evaluation of Berberine, Propolis, and Resveratrol for Overall Health Promotion Based on Human Clinical Evidence (2015-2025) 1. Introduction 1.1 Overview Berberine, an isoquinoline alkaloid derived from plants like Coptis chinensis and Berberis species; Propolis, a complex resinous ...
it's not "crap". 2.5 is equivalent to low-med (slightly longer than low but notably shorter thinking than medium) and you are stuck with this 1 option. With OpenAI you can choose to have longer or shorter reasoning than that. 2.5 is better base model, but o3-high gets much more done with test-time compute alone than 2.5. This means that for precision complex recursive tasks o3 is simply better.
i have used 2.5 and o3 model for medical research , atleast in my case senario , i found 2.5 to be vastly superior
You were probably testing knowledge mostly then, for that you rarely need reasoning...
it does require some reaasoning , not as much as maths or CS but yeah
ok which version of o3 is available for plus users
medium, it's the "default"
so it the deep research the high model ?
as in, if not said otherwise it's always that
deep research is something custom made to work for that. It was using a version of o3 even before it was released as standalone model
so we dont have access to o3 high at all as plus users right
and any idea if grok3.5 would be integrated to LM arena board
of course it'll hit the leaderboards
whether it will be in direct chat is a different question
This is a high quality report
any bets if grok 3.5 will top the board ?
finally, an actual use for those incessant prediction markets
@brittle tiger it may actually be better than the oai dr version
OAI has a dr version ?, elaborate pls
they werent replying to you
i always runs both on prompts for my usecases. usually prefer 2.5 like 60% of the time but sometimes oai is way better. the one click to google docs is very nice with 2.5 and sourcing is done better
I think the one that was shared was based on o4mini ver
no way?
Nonetheless it was good as well
Thid i
Grok 3.5 is real
eLon ASI confirmed?
I was getting worried for a second it wasn't real that's a relief
i know AGI what is ASI ?
I expect no less from the mastermind Braniac Elon himself
super intelligence ?
grok 3.5 is the greatest event of the millenium
grok 3.5 is a scam?! Shocks the entire industry
damn the greatest model of the millenium has 20% odds of being #1 on arena by end of may
Hey everyone!
I'm new to Arena and excited to join the community! I wanted to ask about how to add the DeepSeek-R1T-Chimera model to the arena.
There are indications that it performs better than DeepSeek-R1, and it would be super interesting to see if it outcompetes other models.
Any guidance on how to get started would be greatly appreciated!
Thanks, DK
(It's been in the top 10 trending models on Hugging Face this week, https://huggingface.co/tngtech/DeepSeek-R1T-Chimera, https://x.com/tngtech/status/1916284566127444468)
such a fkced leaderboard, like oai should be first, not google
Are you saying market odds for top score at end of May are bad or lmarena is?
the latter
I'll ask o3-pro api for some ideas on how to improve
doesnt hit no more
are these new generations or from the archive
surely bing chat no longer exists
just get extra accounts buddy
That user is on discord
it is automated, check x.com/@gork
Actually you may be right, there's a gork on discord that always links to the Twitter account, but now I'm not sure they're even related. Would be interesting if that is 3.5
what
they're basically saying grok 3.5 is around ASI level
lmao might as well have said 1000-5000x because they would easily be able to make self-improving ai
Why is elon retweeting his posts?
Yea i saw that yesterday
One of xai devs asked people to follow @gork or smth like that
Didn't think too much about it tbh
it's probably only slightly better than grok 3
and i bet their reasoning implementation still sucks
i hope they keep gork around
Is gork by x
also looks like gork can use the web/twitter for up to date info
Is it finetuned to respond so sarcastically
it looks like it
elon has been promoting it and it only appeared under a week ago
it's basically just the @grok account but with a proper personality
but the rumour is it's also them piloting 3.5
X is filled with parody accounts, makes it hard to find out whether the message came from the real person
it's either Elon himself or some another bot
its not possible to reply that fast as a human
hes jobless
some new neuralink tech implanted in his brain
100%
understanding a comment/thread and typing it all out. and the sheer amount of replies. its not possible for a human
its a bot 99% likely. grok 3.5 i guess. it makes sense for a marketing stunt
yea i didnt get that until u typed neuralink lol
xd
so we should have like grok 3.5 mini ( the one prob used on gork bot acc ) and reasoning + instruct model?
or more likely a finetuned ver for x
yea this is probably the case
a finetuned ver for x
isnt it just more rate limits
i hope the situation is similar to 2.5 pro but if the rate limits are far worse then its unusable
if it's a huge model like 1.0 ultra the rate limits will be harsh
on ai studio it'll probably be something like 10 per day
So lm arena just stopped updating
hopefully its just slightly larger compared to 2.5 pro and they can still host it reasonably
e.g. sonnet 3 -> sonnet 3.5 increase in size. gemini 1.5 flash -> gemini 2.0 flash size increase. (gem 2 flash lite is seemingly the same size of gemini 1.5 flash)
I guess Google TPU will help them a lot
So even with a large model (>1T parameters) the cost is reasonable
why did you vote for Behemoth that's like easily the least promising one of the bunch 💀
I'm the believer of this paper
And I guess stop caring bout AI that much rn until the intellegence explosion
is a good decision
im not sure they would do a >1t model though. its kinda risky. even 1.0 ultra wasnt 1t+
a model that is slightly larger than 2.5 pro makes more sense, but its still very plausible it could be above or around 1t
Grok 3.5 > R2
from the current situation, my prediction is that larger parameters=better SimpleQA performance
yes this is generally true
but i dont think they made 2.0 pro larger compared to 2.5 pro and there was like a 10% increase in simpleqa
so I think with proper training and architecture, the bigger the better
since LLMs and LMMs don't know what is "fact"
yeah i mean who knows.. but as far as i can tell it could just be a fine tune of an existing grok or whatever (nothing particularly special about the responses; they're just super snarky )
but maybe it is grok 3.5 (presumably fine tuned or with some system prompt) and part of a marketing stunt
Well the old ultra was monolithic and it’s save to say the new one won’t be. Furthermore a total parameter count of >1t seems very likely considering that most models these days get awfully close to that number (and are significantly weaker than ultra will likely be, e.g. deepseek @ around 700b)
I would honestly suspect that 2.5 pro is already above the 1t parameter count (but that really is just me guessing)
In the age of complicated MoE or mixture of modalities or mixture of interleaved experts and complex adaptive quantisation strategies and complicated spec decode algorithms the total parameter count does not mean as much as it used to.
https://world.org what is this and why is openai promoting it
is this one of the dozens of defi apps
This may be one of Sam's startups
It is fine for me 10 rpd
Do anyone here really believes the new Qwen is better than o1 at coding?
it isnt even thinking there lol
i dont know how to feel about that
Idk maybe because it is not diff, but whole edits
But it really seems a bit too high
أنظمة الذكاء الاصطناعي: تستخدمها أم لا؟ شاركنا رأيك! استبيان قصير (7-10 دقائق) لبحث في علم النفس: https://forms.gle/u7EdUQ2DR9BuYVJQA شكراً!
LMAO
When can we expect grok 3.5 on arena? even under anonymous name... since I dont see anyone talking about it yet, I am assuming it is not on arena yet even as anonymous model..
why was GLM-4-32B-0414 not added to the arena?
Lol dude, you need months not weeks
Until o3 pro that coming soon
according to who is a model selected to be added to lmarena?
where is that
wow
he doesn't know about ubi
but there is 4 weeks vs 1 month
HAHAHA
...
hmm this isnt 1024x1536
New gamini 5.6 ultra coder is so good
wtf is this
You don't need crypto for ubi
when are we getting qwen3 on lmarena plzzzz
How are you using it?
in testing? in direct chat? on the leaderboard?
omg
real
fake grok 3.5 is asi++
it scores 100% on every benchmark even on questions with incorrect ground truths because it knows that
"first model that can reason from first principles and come up with answers that simply don’t exist on the Internet"
only 7.8% improvement on google proof QA
the scores are somewhat plausible
I didn t found it
was this just added today?
for the record:
You may only use this website for your personal or internal business purposes. You must not access the website programmatically, scrape or extract data, manipulate any leaderboard or ranking, or authorize or pay others to access or use the website on your behalf. Unauthorized use may result in suspension or termination of your access, including access by your organization.
So do we just cancel all our ai services if it's true?
Indeed
SuperGrok
Fake as hell
Is the way to go
Those grok 3 eval numbers are all wrong
I don't think so
Its realistic
Gemini 2.5 Ultra isn't released yet
they arent
seems to line up with grok 3 thinking on the blog post
link? am i looking at wrong numbers?
look at grok 3 beta thinking and hover ur cursor over the scores to see non cons 64
but not exactly...
84.2 is not 83.9, 77.1 is not 77.3, 80.4 is not 80.2, 76.2 is not 76
yea variance is expected
if they remeasured it
the simpleqa score increase is sus, but i don't know
if it is fake the person who made it understands how scores work a little though its somewhat believable
comparing cons@64 to pass@1 for 2.5 and o3 is dumb though
+100% wen?
true
its probably that
bigbrain = you
typing so fast
wait
are you @gork ?
Gork is indeed by xAI
keyboard practice+ai
I pinged gork and grok replied
yea gork is grok 3.5
A tweaked 3.5
if it is
Why would they bring an old grok 3 bot to life?
It became round only a week ago, we believe for testing the humour of LLMs
They decided to run a live test instead of lmarena
hopefully tomorrow on lmarena too
where did you get this from?
those are pretty insane numbers for a base model (that is... if those benchmarks are for the base)
scroll up "arbitrary leaks" apparently
oh
i mean, the design of the table does seem in line with xAI
this is the reasoning variant if it is real im pretty sure
not much else to go off of tho
This is a very uncredible source, so don't expect this to be true
i doubt that
nah
the variance in the grok 3 scores could indicate remeasuring, its quite a minor detail for someone to notice/fake
at least by default
does anyone happen to have any messages back from when people were still doubting reasoning
Lmao
yeah I doubt any run of the mill troll would be paying that close attention
How much context and parameters will grok 3.5 have?
I want 1M context
nah elons probably been hounding the engineers
"api on launch"
"make it so i can put all of the x codebase in there"
given the increasing pace of stuff about 3.5 trickling out, i would say a release in the next few days is quite likely
Ive been using grok 3 on and off, I've actually found its gotten worse and hallucinating more
openai, anthropic, xai... thats what they always say
to this date there hasnt been any real evidence in #1278144735264903178
anthropic discord
finally i have a reason to mod
yeah. this week seems reasonable. i bet team is moving extra hard to get it out before I/O. even if 3.5 is better than I/O releases they wouldn't want to take the chance
tf
i swear llms are always so cringe trying to act "gen z"
if that acc is using grok 3.5 then i have low expectations
the best model i've ever tested for doing it while not making me want to die is claude 3 opus
jailbreaked opus was so good
Is opus no longer served? Rip
Btw should I get supergrok or chatgpt plus/pro?
One for grok 3.5 another for o3
I wonder whether AI will be given a voice
gork voice would be fortnite zoomer vocal fry fr fr
o3 pro next week i can feel it
pretty sure this guy is responsible for 90% of gork hahah
Yep
Its him
I was going to say that but i forgot
Its probably him
Wdym looks like agi to me
is that not a real person
rip turing test
Dude is honestly weird, even changing his name to gorklon rust was whew
So cringe
thats our era's edison right there
What in tarnation
Btw should I get supergrok for 3.5 or chatgpt plus or pro for o3?
Need both a frontier reasoner and deep research
But my usage is not enough to exhaust both
the edge is obviously chatgpt
Can’t get gemini advanced, I made my account in unsupported countries
Good job
😭
Imply I picked where to live in
The Google accounts I have are few years old
Skill issue
Its better than paying $200/month
unpopular opinion but chatgpt pro is cheaper than gemini advanced/claude max/supergrok combined, iykyk
from all the oligarchs in the world he is acting like the dumbest of them all
like he is regarded or smth 
Free hype for the normies
the fact he retweeted that makes me believe those numbers are true?
u proly right haha
is grok 3.5 in arena now?
But did other labs use cons@64?
if not it mean it is bogus
they have in the past
anthropic did for sonnet 3.7 i think its not clear how many samples they used
they report pass@1 and "parallel test time compute" (seemingly cons-like, as they mention they sample several sequences)
weird how they dont mention the sample count at all
the language used seems to be trying to obfuscate the nature of it
there are so many footnotes on the 3.7 sonnet benchmark graphic lol
it's probably not that new, prolly o3 is built on 4.1 and 4.1 is a few months old
bc they have same knowledge cutoff
it is. they retrained o3 though
they are pretty good and have a lot of compute but they haven't rly closed the gap, like research wise they will just have fewer algo improvements than oai, gdm or anthropic will
cus elon is talent poaching across his 3 companies, thats his moat lol
i doubt its an early checkpoint if its real
this is the version theyre gonna release soon
"early checkpoint" is being used as a marketing term there
did they ever release grok 3 (full) reasoning btw?
the question is o3 pro ≷ grok 3.5 ?
jesus christ
ok coming clean i made that chart
i wanted to get my empty twt account to some amount of followers so it wouldn't get shadow banned
lol are you fr
yea
send a ss of the profile
Because of prior experiences
it's just an edit of the grok 3 blog post chart
with numbers copy+pasted from the gemini 2.5 pro blog post that i noised a little
Hey guys, I have been working on a project for systematic LLM jailbreaking and red-teaming. Would appreciate any feedback!
the leaderboard hasnt been updated for 13 days
the next update should not have grok3.5 on it i guess?
grok3.5 isnt an anon model afaik
yeah so 2.5 pro will still be 1st...
that dude is a clown tho right?
ill take a look, nice work tho
New model in Arena: cobalt-exp-beta-v10
New meta model?
Elon retweeting fake screenshots from known grifters im dead
it worries me that we will have 2 basically relevant models that are both referred to as 3.5
with claude and grok
simpleqa needs to be 100% too
its asi++
so i call it fake
im joking but i agree
i doubt theyre gonna do this (make it larger)
maybe. some benchmarks it might not be possible to reach 100% w/o contamination either. (wrong questions/ground truths, so it'd have to be trained on the questions to get the right/wrong answer)
2.5 ultra might beat gpt 4.5 in simpleqa, but we'll see
im super interested in it
scroll up
how far
ok well we were generally skeptic from the start
i was talking about this one post
that wave guy might be the same guy
idk' lol im gonna go to bed
That's Elon's thing. Promote fabricated things and call everything real you don't like propaganda/fake
Guess what hes doing rn?
Reposting that fake scs
I refuse to believe its better than o3 and gemini 2.5
Xai are so loud when it comes to their product, if it was really a big leap they would just straight up say that
openai:
Openai is on another level of hyping stuff
i mean atp someone from xai would have had said to him its fake, but hes still retweeting it mins ago.. lol
polymarket odds dropped biggly as of 20 mins ago, very interesting..
lmfao
Cuz of elon's post
Should the best models be P2W or no
market is never wrong!
It was wrong a couple times last year
Everyone that did their research didn’t bet on OpenAI to have the best
yea im kiddin obv lol
theres daily "quests" and you could buy more
just like in games, you can cash in but you can't cash out
well you used to be able to donate it to charity and they had a short stint with "sweepcash" but that's gone now
unfortunately not
lol no
not sure what ev means in this context
if youre talking about what you get out of it:
its just intellectual stimulation
you want to be a good predictor
you want to make the leaderboards
you dont want to go broke
you want to see number go up
LMAO, damn people are gullible
u can't be serious.. 😭
It ain't even coming this month
smart money would buy lowkey
can u make one for o3 pro with expected day
amazon i think. they have been releasing iterations pretty steadily (originally it was cobalt-exp-v1, last week or so it was up to v7, now v10 ig (hasn't been remarkable at all in my experience, but seems like they're in the process getting something ready to release)
I don't know yet. Will you harm me if I harm you first?
spit it out @worthy thunder
Context Arena Update: Added several Mistral LLMs to the MRCR 2needle leaderboard. (https://x.com/DillonUzar/status/1919191240123289920)
AUC @ 128k Results (Mistral Models):
- mistral-small-3.1-24b-instruct: 47.7%
- mistral-large-2411: 24.1%
- ministral-8b: 22.8%
- ministral-3b: 13.9%
See all results at: https://contextarena.ai/
Mistral Small 3.1 is currently performing between GPT-4.1 Mini and o3-mini based on the AUC @ 128k metric. For comparison, Gemini 2.5 (the current leader) and GPT-4.1 have been added to the main chart.
NOTE: The results for Mistral Small 3.1 (2503) and Mistral-8B (2503) are dated March 2025, while the others are from October/November 2024. Tests were conducted using endpoints from @OpenRouterAI and @MistralAI (presumably BF16).
More to come: Other model results will be released sporadically over the next few weekdays (including some 4needle and 8needle results), alongside new UI features. I've been pretty focused on analyzing some of the model results recently and figuring out ways to provide better insights and options for grading, so model results have slowed in the process. I will provide a summary once the rollout is complete.
Some UI enhancements have already partially rolled out, such as new information displays on hover and updated hover effects. A new diff viewer has also partially rolled out, with further improvements planned.
Enjoy.
are you just posting the archives or do you somehow still have real access
nah that's not consistent in front of technological progression
in front of any
looks like a daily limit on top of the monthly one
yea looks like it, it just becomes more stricter as time passes, not better
lol
they are severely gpu constrained
you pay 200?
ya not wurf it
Dum question
So theres the qwen3 235b a22b gguf model right
Which means there are 235b parameters in total but only 22b parameters are activated at once
Can i run it with the same amount of ram as a regular 22b gguf model?
Or do i still need enough ram to fit an entire 235b model inside?
Or something in between 🤔
You generally still need enough RAM to fit the entire 235B model (like in this model)
afaik MoE doesn't activate experts based on what prompt you gave, it's done on a per token basis so a single prompt can involve a lot of different experts. So if you can't fit the model into ram fully, it'll have to load from your disk to load in the right experts, that will be much much slower than loading from ram
Gotchu
Some context for where the Grok 3.5 stands
Seems like the reasoning is SOTA, but the base model is not as good as OG GPT 4.5
The next generation of models (GPT 5, Gemini 3?) will saturate most benchmarks 😄
those are fake benchmarks
Your opinion or confirmed?
kinda surprising that its not added yet on lmarena
Then why musk retweeted
Haha if he really retweeted fake benchmarks for his product he's at least braindead
Missing Elon of 2017 so much
I feel like that's how he came into power (politics) in the first place. By promoting the party he's aligned with for the big part with misinformation on his social media platform. I don't think he gets the concept of verifying the info, or just deliberately chooses not do it way too often and one-sided.
What's with this space. Avengers of missinformation
typical unemployed person living off twitter engagements
Elon (Adrian Dittman) seems quite employed 😄
Maybe wait for a credible source first lol, literally nothing confirmed
Holy coalmine of ian miles cheong
He is one of the last person, like ever to be informed in AI
will we ever get o3 pro?
it is
Does anyone know how to continue chatting with the model after voting and finding out its name? I know how, but is that method well-known?
yes
lmaoo
When tf is grok in arena
r2 will be the cheapest there
might not beat the others
i didnt test it much/at all. (my impression of it is primarily through the comparison images, im largely uninterested in webdev)
i largely dont
idk i dont really code using ai yet so idk about best practices
yea
probably making it more verbose. It's relatively concise. This could indirectly prolong thinking as well
I mean it is relatively speaking, esp if you consider the entire output thinking included and compare that to medium/high
it's a thinking model, it needs to output a ton if you are after maximizing every last bit of performance 👀
o3 medium does less, at least based on the thinking tokens used to run the artifical analysis gamut
link? The test I did myself earlier put it well below medium
I think this could be measuring smth else. Otherwise this makes no sense lol
?
oh
it isnt
it makes sense if u include the pricing, o3 is still the most expensive model to run there
even though the tokens outputted are less
it's weird. I'm sure there's some kind of explanation why they came up with this. But we know for a fact IRL R1 outputs less thinking tokens than O3, not even talking about 2.5...
? where did u come with that
the r1 thing
if you look at the cost to run it makes sense
eh?
there are ton of providers that cap it at 8k output. Also from my experience it rarely goes beyond that. While for o3 it's not unusual at all to see 20k+
when they cap it at 8k its usually the output response tokens not including cot output tokens
it's literally how many tokens are used to run artificialanalysis' benchmark. it's not based on or meant to represent regular usage or specific usage like coding
also note that it's 2.5 flash at the far left (2.5 pro also uses way more tokens than o series models, but not as many)
it's including the thinking. Saw it getting cut when that was capped at 4k (sambanova?)
this is how deepseek does it
thats different
and its no longer the case
yeah tbf it gets capped / cut-off on OR too sometimes - but i think that's on OR's side rather deeepseeks
its more but deepseek host it only at 64k i think
they have a reasoning effort param
which seems poorly implemented (trying to do it across providers)
i dunno
that how Deepseek does it but almost no one else hosting R1 I believe
it's capped for the entire thing
nah not a prompt.. not sure tbh
eh i dunno but it would be hiliarous if that was their implementation
yeah they usually just pass through
i dont think or does anything
if they do its not intended
yeah i know what you're saying
they are proxying all of the requests they can mess with it
i highly, highly doubt they are
i think they're assuming google etc will have actual reasoning budget hyper parameters at some point
i dunno what it does in the meantime
@keen beacon maybe OpenAI are better at overfitting 💀
but like I said it's weird, will do some more testing
How would they get the data if you are running it locally?
5g covid something like that (/s)
Oh, it has been like that for years already
Llama 4 was trained on Facebook data iirc
whatsapp is encrypted right? i assume if u communicate with meta bots/etc then ur data is collected
for sharing war plans
they are capping entire output just like before, but they did bump that limit from 4k to max configurable 8k now
o4-mini-high and o3 would be essentially unusable even with 8k
interface limit
my point is that this number includes thinking
and also that there are still a ton of providers including azure, with low limit where r1 is still very much usable
now compare that to this:
R1 would never generate anywhere near that
yeh because its capped at 32k reasoning tokens, at least on deepseek api
I have never seen it reaching that cap personally
and that task requires a lot of reasoning tokens. deepseek usually does 15k+ on some of my problems
sometimes more
it has no problem providing the answer with sambanova 8k cap
is it right tho?
o4 mini tries harder
even if it is wrong
who cares if it's right. We are talking which one outputs more thinking LOL
ok its nonsensical then
ur argument
u have to take into whether the answer is right or not or its arbitrary
not if you can't get R1 get anywhere close to that no matter the prompt
Also, this is not claude. Model has no clue about thinking budget, it will only get cut off. If it doesn't then it used all the tokens it could want
So like... Try sambanova interface and come up with a prompt for R1 that would use more for the thinking than 8k cap. I'm sure you will find this is next to impossible lol
@keen beacon
while for both o4-mini and o3, this is super easy
? its so easy LMAO
There are 5 houses, numbered 1 to 5 from left to right, as seen from across the street. Each house is occupied by a different person. Each house has a unique attribute for each of the following characteristics:
- Each person has a unique name: Eric, Alice, Peter, Bob, Arnold
- Each person has a unique type of pet: hamster, fish, cat, dog, bird
- Each person has an occupation: doctor, engineer, artist, lawyer, teacher
- Each person has a favorite color: green, blue, yellow, red, white
- The people keep unique animals: bird, cat, horse, fish, dog
Clues:
- The person who owns a dog is Arnold.
- The bird keeper is in the fourth house.
- The person who keeps a pet bird is directly left of the dog owner.
- The person who loves white is somewhere to the left of the person who is a lawyer.
- The person who loves yellow is directly left of the person whose favorite color is green.
- The cat lover is Bob.
- The cat lover is somewhere to the left of Eric.
- The person who keeps horses is in the fifth house.
- The person who is a lawyer is directly left of the person who is a teacher.
- The person who is a doctor is in the first house.
- Alice is the person who loves yellow.
- The person who loves blue is directly left of the person with an aquarium of fish.
- The person who loves yellow is in the first house.
- The person with a pet hamster is the person who is an artist.
- Eric is the dog owner.
ok I stand corrected. But all attempts I did of this are still below 16k. So my point stands. I just showed 50k with o4-mini-high above without even trying 😄
ok how many tokens did o4 mini take though lol (on that problem)
also its a single prompt. look at artificial analysis where they ran several benchmarks. plus with your prompt it didnt even get the answer right with less tokens, so its arbitrary in that instance too. might as well call og gpt 4 so much more efficient if u just look at that. (it does not make sense if you do not consider correctness)
how youre contesting running several standarized benchmarks and seeing the amount of tokens used just because of that 1 prompt is ridiculous
and the comparison/reasoning for that 1 prompt does not make sense
less, but this is task specific... I focuse on the maximum instead which is way more accurate I believe. If you take a subset of tasks a certain model could generate more purely because it's a different model which is probably how artificial analysis arrived there. What's more important is how far can it go
when R1 generates more, it's not by 40k more lol
and with OpenAI it is
ok but u have no data to prove that u literally have a single anomalous prompt. then ur justifying it with specious reasoning
wdym. There are plenty of prompts where OpenAI will generate beyond 16k, you don't even need to be trying...
I know this sounds stupid but,is 4.5 still available on the web?(battle part,not direct chat)
I think 1 month ago it was available
those questions are probably a 'biased distribution' biased sample unless its representative. those standardized benchmarks that artificialanalysis runs are way more representative for actual usage than potentially specific cherry picked problems that cause specific model outliers. and does r1 even get the questions right? (if it uses less tokens but gets the answer wrong it is also meaningless)
I don't think it's meaningless since it could get it wrong specifically because it's reluctant to output more.
I think the truth here is somewhere in the middle to be completely honest. Maybe on average it's that - OpenAI models fairly mindful of the usage and don't output a lot at all times. But when the model sees a task it can't solve without outputting a ton.. O3/O4 will do it and R1 or 2.5 most likely not, saw Gemini taking shortcuts to arrive at the answer faster more than once. And as for R1, once again it seemingly can't even get past 16k
So maybe that averge metric does not tell us all that much. You wouldn't want a model that outputs more when that's not needed and less when it is needed. A bit artificially bumping up the average (relative to what it is capable of performance wise with test-time compute)
Esp since we have quite a few of the models wasting tokens on 2nd guessing themselves rather than doing the task efficiently lol
so yeah... if we took like 100 test prompts and model1 would do say 50 attempts around 90k generated while the other 50 around 10k;
while model2 would generate consistently around 55k for all of them leading up to slightly higher average...
I would still say model1 is much less test-time compute limited and better optimised tbh
it would obviously be something way more random, but for the sake of simplicity assuming easy to reference numbers
looking at 2.5 some more digging into thinking, it does have a weird habit of rewriting the thinking into final response, even the parts where it self-corrects pretending it made the same mistake all over again... So you do have a healthy amount of generated data that is completely pointless 
but if you made it even more verbose, that would very likely help with the actual useful problem solving part as well
Back in the day, the Prolog programming language would have been a great tool to solve this kind of problem
i wonder if they'll release on api or not
imagine if they release the scores for grok 3.5 full reasoning but never release it lol like grok 3 full reasoning
i have a feeling the reason they didn't release grok 3 full reasoning was because it sucked
hopefully grok 3.5's reasoning implementation does not
so they wouldn't have a reason to
i wonder if they stopped using qwq preview traces 🤣 (very suspect)
probably didn't help things ☠️
maybe theyre using r1 for cold start now 🤣
lmaoo
did they? totally missed it
never spend much time on xAI
yes. take it with a grain of salt tho (obviously speculation) but i am reasonably confident in it. not gonna explain it again tho
Hi all, I just joined so sorry if this is the wrong channel to ask this question. But I was curious if anyone has a plot of ELO vs model release date. Thank you so much!
someone does
its vaporware
It’s confirmed
How come they are releasing before letting it loose anonymously on arena? I am guessing it's not meant for arena questions...
elon probably wouldve wanted to flex the arena score if it was better than 2.5 pro. its probably meh/fine, and the xai guys didnt think they should put it on as a pre-release (if it cant beat 2.5 pro)
maybe grok 3.5 will be great though, idk. even if it is good, its gonna be unusable for me because of the X peddling 😭
what time?
Evening
which timezone?
Pt probably
PT
maybe next monday
nahh it has to be this week
it was based on this post https://x.com/veggie_eric/status/1919420805987082535
It's gonna be an awesome week I can feel it. Have a great Monday!
he seems excited for the release
same
i had access for 2h
then it glitched out
how was it?
yes
ffinally we will see Gemini dethroned
2.5 pro level?
like better than 2.5?
i doubt 2.5 pro will be dethroned tbh
not sure, needs more testing
well someone lied, lets follow that lie
it may cost more but it will be better
we should start asking Craig
ahahaha
sure
unlike O series model Grok models are versatile like Claude
wow
isnt grok 3.5 asi?
they skipped agi like o1 -> o3
it better be asi or i am unfollowing that guy finally lol
wait what
are you following him?
Strawberry guy is a middle schooler
yeah, for entertainment
bruh what
nah this is sus
Grok 3.5 Is new Elo king fr
wym?
My friend knows 🍓 guy irl he’s in middle school
i follow a lot of ppl, and he was right about o4 in april
Do you see any Ultra releasing today? No than stop tickling my impatient nuts
he saw it in the UI
added for 1 sec
Alright '
fr?
I hope it ultra sucks cuz gemini is not based
can i stomp on your nuts instead?
Yeah his name is Jonathan
source: shizophrenia
if it is for AGI ok
why would you follow a grifter, a liar, an attention seeker?
he really doesnt have any insider infos
sydney gpt4
Because he has seen AGI fr
grok 3 already has X ads
🍓 guy is from Alaska and he’s in 7th grade
/\_____/\
/ o o \
( == ^ == )
) (
( )
( ( ) ( ) )
(__(__)___(__)__)
i follow elon lol and a whole bunch of other ppl like that
i have them blocked tbh
you blocked elon?
you can do that
i blocked him
i didnt even know you can block ppl on X
Yeah
strwb? or the other one
Strawberry
His real name is Jonathan
My friend has a picture of him logged into his account on his laptop
I’ll ask him for it
Skinny guy from middle school
Yeah
No reply yet he’s busy
What the strawberry guy from twitter?
“iruletheworldmo”
Still stuck in 2000s
Qwen 3 on webdev is it also the thinking version?
for those who want to know what the 4 new leaderboard models are
Qwen 3 253b
Gemma 3 12b
Gemma 3 4b
Olmo 2 32b
We waiting for nova premier on the arena (to make fun of him a little 😅)
https://x.com/gork so this is....
day 19 with no o3 pro
in other words, peak reasoning token counts and frequency of those counts is way more important than the total average for determining how much test-time compute a model can use.
i like this guy
how good is qwen 3 at translation?
It's asi ofc
You can try a prompt for me ?
ofc he can
im still waiting on a ss of o3 pro on o1 pro api
Are you kidding me ?
no?
is deepsearch or deeperresearch powered by grok 3 or grok 3 mini?
Is think ?
Ah, I just saw that he no longer has access.
What's up with this variance 🤔
Grok 3 , the think option is grok 3 mini . If i remember well
he never had access
he should make an account like iruletheworldmo
qwen3 does extremely well in distribution
craig is iruletheworldmo confirmed
It's probably first time model is so good at math and so bad at style control
And you too ?
no one has access
actually thats plausible
he talks like him too
shows it is strength
with less common languages 30b and 32b seems quite bad. 235b is decent
it can surprise you sometimes, but it's way too inconsistent overall and can fail easy prompts you wouldn't expect. That's the consensus on it I would say.
quantity wise grok's deep research beats chatgpt
quality wise the other way around
quantity is irrelevant though if it can't deliver lol
if it is full grok 3 then it is cheap for them to do deep research on grok 3 as opposed to OpenAI using o3
cost wise I doubt it's cheaper for them. It's a bigger model but potentially less test-time compute. I would say comparable if not more expensive...
OpenAI's deepresearch could be something like o3-high except with even longer outputs finetuned for search specifically
then I wonder how can grok offer much more limit
then it means the test time compute is defo more on o3
grok API pricing is super competitive too. While OpenAI is in the position now where they can kinda just charge more and get away with it... 🤷♂️
which languages did you test?
day 19 since o3 and it is still as magical
I'm getting pleasantly surprised at LLMs' translation ability to less common languages
@ocean vortex can you PM me? You seem to be blocked this
Trust me, I was testing it on niche language that only 300k or something people speak in the world (nobody counted us). Big models knows most of the words and can make poems.
Funny thing is that literature is almost non existent. I can't understand why is it in distribution. Maybe emergent property.
what language btw
if u dont mind
well... if you can google texts in that language it was probably trained on it 👀
So , no grok 3.5 today ? 🙄
I'm waiting for dork 4.0
when will they release gork
elon ma said next week on apr 29
grok 3.5 in june
gork > grok
When you're such a good model you max out on incorrect benchmarks (MMLU).

