#general

1 messages · Page 35 of 1

keen fulcrum
#

What is Teamfood?

alpine coral
patent bane
#

gpt-4-turbo is released?

keen fulcrum
brisk turret
#

Jesus Christ update the leaderboard

#

You had one job!

keen fulcrum
#

12 is missing

#

I wouldn't straightly put gpt 4o as number 3
o3 will eventually be number 1 with more votes

alpine coral
keen fulcrum
#

I wonder how that happens

alpine coral
#

pretty sure it's how it's meant to be

keen fulcrum
#

Isn't it some very simple logic?
Or does that happen when models get removed?

alpine coral
#

actually yeah i dunno

brisk turret
#

This leaderboard is very poorly managed

alpine coral
#

nah they periodically update it..

#

calm down lol

hardy pecan
#

The “holes” you’re seeing in the Rank * (UB) column aren’t a bug in the UI, they’re a side-effect of how that leaderboard does its UB‐ranking: it groups all models whose Arena-Scores are not statistically distinguishable (based on the 95 % CIs) into the same bucket, gives that bucket a rank number, then only bumps to the next rank when a model is significantly worse. If no model ever fell into the “bucket” that would have been called rank 5 (or 6, 10, 12, etc.) then you simply won’t see anyone labeled with that rank. In short, the missing numbers are just empty significance‐tiers—no model earned them.

alpine coral
#

ty

hardy pecan
#

Whenever two or more models get folded into the same “bucket” because their 95 % confidence intervals overlap, they all take the lowest ordinal rank of that bucket, and the next model’s rank number jumps ahead past all of them. In other words:

• No one owns rank 5 or 6 because the models that would have occupied positions 5 and 6 were tied with higher-ranked buckets and so got pulled up into those buckets.
• Likewise, the three models tied at “9” consume positions 10 and 11 (so you never see a standalone 10 or 11), and leave the next distinct model at rank 11 or 13 depending on the bucket size.

It’s just the standard “competition-style” (or “min”) ranking with ties: you collapse tied items into the same rank, then skip over all of the internal slot numbers that are swallowed by the tie.

#

Thanks AI!!!!

brisk turret
#

It's stupid that's not how ranking works

#

They overcomplicated it

#

There are no ties anyway

#

There should not be buckets

#

Some idiot tried too hard to seem smart

#

And vastly overcomplicated a very simple task

north horizon
#

i've definitely had ties where both models generate essentially the same ui

#

recently had grok 3 and maverick do that

keen beacon
#

found this interesting

#

surprising to see 2.5 flash cost more, but i suppose it depends on the task

alpine coral
earnest parcel
# alpine coral

wait until you see phi-4-reasoning-plus, which will be 2-3x of the next entry

keen fulcrum
#
poll_question_text

Which model do you prefer?

victor_answer_votes

6

total_votes

9

victor_answer_id

5

victor_answer_text

Nightwhisper

solar hollow
#

do we know what model nightwhisper is?

alpine coral
#

interesting - cheers!

golden ocean
#

butt cheeks

brittle tiger
# keen beacon found this interesting

I think it's dumb they don't include repeats for cost or output tokens. That's big part of evals and regular use. Aider bench does making his cost comparisons more valuable

ornate stump
#

Does anyone that use NotebookLM and know why the new 2.5 update (until yesterday it was still using 2.0) doesn't affect the audio overview (podcast)? I

calm sequoia
#

Also it's surprising that the o3 spending is so high. The artificial analysis must be hell of a benchmark

keen fulcrum
#

Wow only two voted for sunstrike

#

Sunstrike is probably gemini 2.5 ultra

torn mantle
keen beacon
#

general purpose vs finetuned for web dev

torn mantle
#

It wasn't just good for web dev

#

It could do complex reasoning in 1 shot

#
  • accurate visualisation
keen beacon
#

being finetuned on something doesnt always necessitate losing abilities on other stuff

torn mantle
#

Good at following prompts as well

keen beacon
#

did sunstrike appear in the general arena as well?

torn mantle
#

Frostwind/sunstrike... Are all finetuned on web dev imo

keen beacon
#

hmm

leaden palm
#

i'm expecting qwen 3 235B to be a very efficient model, on this price vs elo plot i'd expect it to be right where my cursor is

#

non log scale edition:

#

well maybe my cursor was too far to the left but you get the idea

tall summit
leaden palm
#

price

#

/ mtok

#

mixed

tall summit
# leaden palm price

oh i thought the scale only went from 0 to 1 for a moment considering you cropped it

blazing rune
#

so it's naturally more concise

leaden palm
#

dario said something that loosely translates to if they didnt cook they wont call it claude 4

small haven
#

eww no

alpine coral
# torn mantle Yea

yeah ive had it [sunstriker] a few times in the general arena - iirc thought it was v similar to 2.5 pro, though perhaps a bit weaker

#

though a week or so ago.. not sure if it's still around

unborn ocean
#

idk, thinking it might be a bit higher price wise, because I am not sure if the current prices (of 0,1 per million in and out from deepinfra and fireworks) will stick

unborn ocean
leaden palm
#

well if things change lmb will update

#

but im expecting deepinfra to stay low

#

it has a small # of active params and itll only get more optimized after all

unborn ocean
#

deepseek is also close in active param (27B idk for sure though)

leaden palm
keen beacon
#

Deepinfra price matched fireworks which has temp pricing but I guess deepinfra didn't realize that and/or still want to compete. I think it might encourage lower prices overall though because of this

unborn ocean
#

it will probably depend on the size of the userbase, as MoE deployment efficiency scales with the userbase

unborn ocean
keen beacon
#

Qwen 235b is slow as molasses on deepinfra, fireworks is an unknown quant iirc (but much faster)

unborn ocean
torn mantle
#

this guy still has a platform?

#

did he just took pic shared by techdevnotes and changed the background color and made it his own?

zinc ore
#

Dudes shameless glazing is kinda impressive ngl

torn mantle
#

well he did some modifications

torn mantle
keen beacon
#

yea 🤣 🤣

torn mantle
#

he has no shame

keen beacon
torn mantle
keen beacon
#

o

leaden palm
alpine coral
#

lol this is classic

sage raptor
alpine coral
#

he's just taking the piss at that point isn't he?

small haven
#

wait we have grok 3.5 alrdy?

keen beacon
#

yeah and o3 pro on the o1 pro api

small haven
#

the joke isn't hitting no more

alpine coral
small haven
#

i have supergrok, no 3.5..

zinc ore
alpine coral
#

ha yeah i mean kinda fair

zinc ore
#

He's so openly dishonest that I kinda like that account (not for anything serious though)

small haven
# alpine coral

strawberry man doesnt even have grok 3.5, buddy woulda screenshot..

torn mantle
#

i cant stand him

torn mantle
#

so true

keen beacon
#

elon isn't gonna let bro tap

torn mantle
#

nah this is crazy

leaden palm
torn mantle
#

he has actually a condition called pseudologia fantastica

keen beacon
#

elon is so stupid it actually annoys me

torn mantle
#

pseudologia fantastica: 'Pathological lying, also known as pseudologia fantastica, is a chronic behavior characterized by the habitual or compulsive tendency to lie.'

keen beacon
#

okay buddy you're dumb but now it's everybody else's problem because you have so much influence

#

go away

torn mantle
#

and probably combined with a personality disorder

#

those type of people scares me

sage raptor
keen beacon
#

lol I think he's completely lost it

zinc ore
#

That's literally par for how he talks, he was doing this for the "strawberry" models for upcoming openAI models, last year.

balmy mist
#

if it truly is not as good as he says i am finally gonna unfollow him

zinc ore
#

You follow him unironically??

balmy mist
keen fulcrum
torn mantle
torn mantle
#

how desperate can someone be

tall summit
torn mantle
#

elon ruined it with this engagement thingy

torn mantle
#

no wonder

tall summit
ocean vortex
small haven
#

https://x.com/btibor91/status/1918700964967796973

claude advanced research ranked at #4

Summary of my findings after comparing more Deep Research reports (these are my personal opinions based on my own tests and experience)

- No perfect solution yet, since all current deep research tools have limitations and make occasional errors, so you need to verify the

torn mantle
#

takes a lot of time with a mid-quality report

#

less details

leaden palm
#

the only thing thats good is that it can use any mcp

torn mantle
#

claude & gemini & grok are more like:

  • study #1 found 50% improvements.
  • study #2 found 40% improvements.

but they don't really compare studies and give a conclusion, oai dr does that, it goes into different studies, compares them together and gives you the conclusion, it even looks for the same factors/parameters used and list them, it actually understands what it needs to do.

small haven
#

like fetching almost 1k in sources is very impressive

wintry tinsel
#

eLon musk sand God confirmed

gilded drift
#

Guys . the o3 offered in alpha lmarena . Is it high , medium or low ?

torn mantle
#

Prompt : Based on a critical synthesis of recent, high-quality human clinical trials and systematic reviews, determine which compound – Berberine, Propolis, or Resveratrol – demonstrates the most compelling evidence for promoting overall health.

https://claude.ai/public/artifacts/a8a2d065-ddf4-4acf-853d-ddfd9a2fe15e
https://chatgpt.com/share/6813e578-eb4c-8012-976b-f07d475cdac9
https://grok.com/share/bGVnYWN5_03b16e10-b0c0-45b3-8014-ab796009f7b0

#

thanks to @small haven

#

grok chose the easiest road, it selected a study that includes 54 systematic reviews and it expanded on that, but its not really a deep research

#

but its still better than gemini

#

oai went in and understood what parameters to compare and analysed them and then came to a conclusion.

#

which is pretty neat tbh

leaden palm
#

imma run this on my gemini pipeline

torn mantle
#

this would take you a lot of time to do

torn mantle
#

you have gemini advanced?

#

are we sure that the free version uses gemini 2.5 pro for deep research?

leaden palm
#

no i just used their api + exa

#

looks like it only did 3 searches and didn't make any tables

#

let me see what happens if i tell it to search deeper and use tables

torn mantle
#

yea this is better than my gemini report

zinc ore
sage raptor
#

do they really have access to grok 3.5 ?

small haven
#

no, its just engagement bait

leaden palm
sage raptor
golden ocean
#

gpt-4-32k-0314 access

calm sequoia
remote niche
zinc ore
#

They're talking about deep research

#

Free version of deep research doesn't use 2.5, is my understanding. But maybe my information is outdated and that's changed over the past month.

remote niche
#

ok so its the same 2.5 on paid and free models , not the low ,medium high crap Open AI uses

tall summit
#

pretty sure he's satirizing the people who really do claim they have early access for engagement bait

tall summit
#

HAHAHAHAHA

#

now the question is does he know thats satire and is he playing dumb or

#

you can tell whether they have early access because they say concrete things

small haven
#

lemme guess, you have access

#

i like how u can use o3 to quickly test hypothesis... im hard.

zinc ore
#

That's why it's hilarious seeing people take him serious, even here

brittle tiger
#
vivid oyster
ocean vortex
# remote niche ok so its the same 2.5 on paid and free models , not the low ,medium high crap O...

it's not "crap". 2.5 is equivalent to low-med (slightly longer than low but notably shorter thinking than medium) and you are stuck with this 1 option. With OpenAI you can choose to have longer or shorter reasoning than that. 2.5 is better base model, but o3-high gets much more done with test-time compute alone than 2.5. This means that for precision complex recursive tasks o3 is simply better.

remote niche
ocean vortex
remote niche
#

it does require some reaasoning , not as much as maths or CS but yeah

#

ok which version of o3 is available for plus users

ocean vortex
remote niche
#

so it the deep research the high model ?

ocean vortex
#

as in, if not said otherwise it's always that

#

deep research is something custom made to work for that. It was using a version of o3 even before it was released as standalone model

remote niche
#

so we dont have access to o3 high at all as plus users right

#

and any idea if grok3.5 would be integrated to LM arena board

leaden palm
#

whether it will be in direct chat is a different question

remote niche
#

any bets if grok 3.5 will top the board ?

leaden palm
torn mantle
#

@brittle tiger it may actually be better than the oai dr version

remote niche
#

OAI has a dr version ?, elaborate pls

leaden palm
brittle tiger
torn mantle
wintry tinsel
torn mantle
#

Nonetheless it was good as well

wintry tinsel
#

Thid i

#

Grok 3.5 is real

#

eLon ASI confirmed?

#

I was getting worried for a second it wasn't real that's a relief

sour spindle
#

Grok 3 was my favorite model for a month or so

#

Haven’t touched it in awhile

torn mantle
#

Not even AGI

#

Straight up ASI

remote niche
#

i know AGI what is ASI ?

wintry tinsel
#

I expect no less from the mastermind Braniac Elon himself

remote niche
#

super intelligence ?

wintry tinsel
#

grok 3.5 is the greatest event of the millenium

golden ocean
#

source

#

in alternate timeline oai made gpt-5 agi instead of 4o

wintry tinsel
#

grok 3.5 is a scam?! Shocks the entire industry

brittle tiger
brittle tiger
bleak venture
#

Hey everyone!
I'm new to Arena and excited to join the community! I wanted to ask about how to add the DeepSeek-R1T-Chimera model to the arena.
There are indications that it performs better than DeepSeek-R1, and it would be super interesting to see if it outcompetes other models.
Any guidance on how to get started would be greatly appreciated!
Thanks, DK

small haven
# brittle tiger

such a fkced leaderboard, like oai should be first, not google

brittle tiger
brittle tiger
small haven
#

doesnt hit no more

misty vault
leaden palm
#

surely bing chat no longer exists

small haven
#

theyre offloading gpus for o3 pro, i respect that.. they will need it

small haven
#

just get extra accounts buddy

small haven
#

gork is using grok 3.5 aint it ..

#

like what hahha

zinc ore
#

That user is on discord

small haven
zinc ore
#

Actually you may be right, there's a gork on discord that always links to the Twitter account, but now I'm not sure they're even related. Would be interesting if that is 3.5

olive mesa
#

they're basically saying grok 3.5 is around ASI level

#

lmao might as well have said 1000-5000x because they would easily be able to make self-improving ai

torn mantle
torn mantle
#

One of xai devs asked people to follow @gork or smth like that

#

Didn't think too much about it tbh

keen beacon
#

and i bet their reasoning implementation still sucks

keen beacon
#

i hope they keep gork around

keen fulcrum
keen beacon
#

also looks like gork can use the web/twitter for up to date info

keen fulcrum
#

Is it finetuned to respond so sarcastically

keen beacon
#

elon has been promoting it and it only appeared under a week ago

#

it's basically just the @grok account but with a proper personality

#

but the rumour is it's also them piloting 3.5

keen fulcrum
#

X is filled with parody accounts, makes it hard to find out whether the message came from the real person

keen beacon
#

it's not a real person lol

#

can't be

ocean vortex
torn mantle
#

its probably elon

#

he already runs multiple accounts

keen beacon
#

its not possible to reply that fast as a human

torn mantle
#

some new neuralink tech implanted in his brain

#

100%

keen beacon
# torn mantle hes jobless

understanding a comment/thread and typing it all out. and the sheer amount of replies. its not possible for a human

#

its a bot 99% likely. grok 3.5 i guess. it makes sense for a marketing stunt

torn mantle
#

ofc its grok 3.5

#

we are just trolling

keen beacon
#

yea i didnt get that until u typed neuralink lol

torn mantle
#

xd

#

so we should have like grok 3.5 mini ( the one prob used on gork bot acc ) and reasoning + instruct model?

keen beacon
#

maybe

#

im more excited for 2.5 ultra though

torn mantle
#

or more likely a finetuned ver for x

#

yea this is probably the case

#

a finetuned ver for x

torn mantle
keen beacon
#

i hope the situation is similar to 2.5 pro but if the rate limits are far worse then its unusable

#

if it's a huge model like 1.0 ultra the rate limits will be harsh

#

on ai studio it'll probably be something like 10 per day

drifting thorn
brisk turret
#

So lm arena just stopped updating

keen beacon
drifting thorn
#

I guess Google TPU will help them a lot

#

So even with a large model (>1T parameters) the cost is reasonable

ocean vortex
drifting thorn
#

And I guess stop caring bout AI that much rn until the intellegence explosion

#

is a good decision

keen beacon
#

a model that is slightly larger than 2.5 pro makes more sense, but its still very plausible it could be above or around 1t

keen fulcrum
drifting thorn
#

from the current situation, my prediction is that larger parameters=better SimpleQA performance

keen beacon
#

but i dont think they made 2.0 pro larger compared to 2.5 pro and there was like a 10% increase in simpleqa

drifting thorn
#

so I think with proper training and architecture, the bigger the better

#

since LLMs and LMMs don't know what is "fact"

alpine coral
#

but maybe it is grok 3.5 (presumably fine tuned or with some system prompt) and part of a marketing stunt

unborn ocean
#

I would honestly suspect that 2.5 pro is already above the 1t parameter count (but that really is just me guessing)

#

In the age of complicated MoE or mixture of modalities or mixture of interleaved experts and complex adaptive quantisation strategies and complicated spec decode algorithms the total parameter count does not mean as much as it used to.

keen fulcrum
calm sequoia
#

This may be one of Sam's startups

barren prairie
calm sequoia
#

Do anyone here really believes the new Qwen is better than o1 at coding?

keen beacon
#

i dont know how to feel about that

unborn ocean
#

But it really seems a bit too high

dapper raven
misty vault
torn mantle
#

LMAO

leaden meteor
#

When can we expect grok 3.5 on arena? even under anonymous name... since I dont see anyone talking about it yet, I am assuming it is not on arena yet even as anonymous model..

pastel monolith
#

why was GLM-4-32B-0414 not added to the arena?

calm sequoia
#

Lol dude, you need months not weeks

cedar tide
#

Until o3 pro that coming soon

brittle tiger
pastel monolith
#

according to who is a model selected to be added to lmarena?

tall summit
keen beacon
sturdy mica
#

wow

golden ocean
leaden palm
#

but there is 4 weeks vs 1 month

tall summit
leaden palm
#

...

sturdy mica
#

i sent a photo of me

#

im so handsome right

golden ocean
#

this is my photo

#

im so handsome right

leaden palm
#

hmm this isnt 1024x1536

sturdy mica
#

thats like 1.9 mc

#

the colors are weird

vivid oyster
#

New gamini 5.6 ultra coder is so good

misty vault
sturdy mica
misty vault
#

idk

#

hitler bed wars perhaps

brittle tiger
vivid oyster
#

Gemain 2.5 pro tweaking

high egret
#

when are we getting qwen3 on lmarena plzzzz

balmy mist
rugged brook
#

havent u hard of it

rugged brook
#

its in ai studio

leaden palm
balmy mist
#

send screenshot

sage raptor
#

its real

keen beacon
#

omg

high egret
#

omg it's real

#

it's a special model they put it in Times New Roman

leaden palm
wheat onyx
#

Imagine this is real

keen beacon
#

fake grok 3.5 is asi++

#

it scores 100% on every benchmark even on questions with incorrect ground truths because it knows that

leaden palm
#

"first model that can reason from first principles and come up with answers that simply don’t exist on the Internet"

only 7.8% improvement on google proof QA

keen beacon
barren prairie
leaden palm
#

was this just added today?

#

for the record:

You may only use this website for your personal or internal business purposes. You must not access the website programmatically, scrape or extract data, manipulate any leaderboard or ranking, or authorize or pay others to access or use the website on your behalf. Unauthorized use may result in suspension or termination of your access, including access by your organization.

keen fulcrum
wheat onyx
#

So do we just cancel all our ai services if it's true?

keen fulcrum
#

Indeed
SuperGrok

brittle tiger
#

Fake as hell

keen fulcrum
#

Is the way to go

brittle tiger
#

Those grok 3 eval numbers are all wrong

keen fulcrum
#

I don't think so
Its realistic
Gemini 2.5 Ultra isn't released yet

keen beacon
#

seems to line up with grok 3 thinking on the blog post

brittle tiger
keen beacon
leaden palm
keen beacon
#

yea variance is expected

#

if they remeasured it

#

the simpleqa score increase is sus, but i don't know

#

if it is fake the person who made it understands how scores work a little though its somewhat believable

brittle tiger
#

comparing cons@64 to pass@1 for 2.5 and o3 is dumb though

torn mantle
#

true

#

its probably that

#

bigbrain = you

#

typing so fast

#

wait

#

are you @gork ?

keen fulcrum
sturdy mica
#

keyboard practice+ai

keen fulcrum
#

I pinged gork and grok replied

sturdy mica
#

=*

#

yes

torn mantle
keen fulcrum
#

Why would they bring an old grok 3 bot to life?

#

It became round only a week ago, we believe for testing the humour of LLMs

torn mantle
#

elon been hyping it since yesterday

#
  • some xai devs as well
keen fulcrum
#

They decided to run a live test instead of lmarena
hopefully tomorrow on lmarena too

keen beacon
#

those are pretty insane numbers for a base model (that is... if those benchmarks are for the base)

#

scroll up "arbitrary leaks" apparently

keen fulcrum
keen beacon
#

i mean, the design of the table does seem in line with xAI

keen beacon
#

not much else to go off of tho

keen fulcrum
#

This is a very uncredible source, so don't expect this to be true

keen beacon
#

i doubt that

#

nah

#

the variance in the grok 3 scores could indicate remeasuring, its quite a minor detail for someone to notice/fake

leaden palm
#

at least by default

#

does anyone happen to have any messages back from when people were still doubting reasoning

keen beacon
#

Lmao

keen beacon
keen fulcrum
#

How much context and parameters will grok 3.5 have?

keen beacon
#

probably similar to 3

#

so ~1T

keen fulcrum
#

I want 1M context

leaden palm
#

nah elons probably been hounding the engineers

#

"api on launch"

#

"make it so i can put all of the x codebase in there"

keen beacon
#

given the increasing pace of stuff about 3.5 trickling out, i would say a release in the next few days is quite likely

wheat onyx
#

Ive been using grok 3 on and off, I've actually found its gotten worse and hallucinating more

leaden palm
leaden palm
keen beacon
#

where did ulink

#

its unknown for me

leaden palm
keen beacon
#

oh

#

Grok 3.5

torn mantle
#

I want 10M context

#

nah claude max is meh

vivid oyster
#

ر

#

ز

#

ز

#

ز

#

ز

#

ز

#

ز

#

زز

#

ز

#

ز

#

ز

#

ز

#

ز

#

ز

#

ز

#

ز

#

ز

#

ز

leaden palm
#

finally i have a reason to mod

keen beacon
#

banana

torn mantle
#

trust center

#

hmmmmmmmmmmmmmm

#

mmmmmmmmmmmmmmmmmmm

brittle tiger
torn mantle
#

what am i reading...

#

70B valuation for this

olive mesa
#

i swear llms are always so cringe trying to act "gen z"

#

if that acc is using grok 3.5 then i have low expectations

keen beacon
#

jailbreaked opus was so good

teal mantle
#

Btw should I get supergrok or chatgpt plus/pro?

#

One for grok 3.5 another for o3

keen fulcrum
#

I wonder whether AI will be given a voice

brittle tiger
#

gork voice would be fortnite zoomer vocal fry fr fr

small haven
#

o3 pro next week i can feel it

small haven
torn mantle
#

Its him

#

I was going to say that but i forgot

#

Its probably him

ember rapids
tall summit
small haven
#

rip turing test

keen beacon
#

lowk gonna kms

#

this is a grown man 💔

zinc ore
#

Dude is honestly weird, even changing his name to gorklon rust was whew

torn mantle
#

So cringe

small haven
#

thats our era's edison right there

teal mantle
#

Btw should I get supergrok for 3.5 or chatgpt plus or pro for o3?

#

Need both a frontier reasoner and deep research

But my usage is not enough to exhaust both

torn mantle
#

Deep research, you can go with gemini advanced

#

Grok isnt worth it atm

small haven
#

the edge is obviously chatgpt

teal mantle
teal mantle
#

The Google accounts I have are few years old

zinc ore
#

Skill issue

torn mantle
#

Just create a new acc and use vpn or smth

#

You get 2 months for free

torn mantle
small haven
#

unpopular opinion but chatgpt pro is cheaper than gemini advanced/claude max/supergrok combined, iykyk

ocean vortex
#

like he is regarded or smth catgrin

keen beacon
#

elon retweeted the grifter sk

#

including the benchmark screenshot

#

weird

zinc ore
#

Free hype for the normies

small haven
#

u proly right haha

late path
#

is grok 3.5 in arena now?

teal mantle
#

But did other labs use cons@64?
if not it mean it is bogus

keen beacon
#

they have in the past

#

anthropic did for sonnet 3.7 i think its not clear how many samples they used

#

they report pass@1 and "parallel test time compute" (seemingly cons-like, as they mention they sample several sequences)

#

weird how they dont mention the sample count at all

#

the language used seems to be trying to obfuscate the nature of it

ocean vortex
#

it's not, it's new. based on gpt4.1

#

o3-preview was old

keen beacon
#

there are so many footnotes on the 3.7 sonnet benchmark graphic lol

north vale
#

it's probably not that new, prolly o3 is built on 4.1 and 4.1 is a few months old

#

bc they have same knowledge cutoff

keen beacon
#

it is. they retrained o3 though

north vale
#

they are pretty good and have a lot of compute but they haven't rly closed the gap, like research wise they will just have fewer algo improvements than oai, gdm or anthropic will

small haven
#

cus elon is talent poaching across his 3 companies, thats his moat lol

keen beacon
#

i doubt its an early checkpoint if its real

#

this is the version theyre gonna release soon

#

"early checkpoint" is being used as a marketing term there

#

did they ever release grok 3 (full) reasoning btw?

small haven
#

the question is o3 pro ≷ grok 3.5 ?

keen beacon
#

they reported scores though

#

vaporware lol

#

it probably does

#

it sucks

wintry locust
#

ok coming clean i made that chart

#

i wanted to get my empty twt account to some amount of followers so it wouldn't get shadow banned

keen beacon
#

lol are you fr

wintry locust
#

yea

keen beacon
#

send a ss of the profile

wintry locust
teal mantle
#

Because of prior experiences

wintry locust
#

it's just an edit of the grok 3 blog post chart

#

with numbers copy+pasted from the gemini 2.5 pro blog post that i noised a little

lean whale
late path
#

the leaderboard hasnt been updated for 13 days

#

the next update should not have grok3.5 on it i guess?

keen beacon
#

grok3.5 isnt an anon model afaik

late path
#

yeah so 2.5 pro will still be 1st...

balmy mist
#

that dude is a clown tho right?

mossy drum
#

New model in Arena: cobalt-exp-beta-v10

keen fulcrum
ember rapids
#

Elon retweeting fake screenshots from known grifters im dead

keen beacon
#

3.5 is probably worse than they expected/aimed for

#

itll be fine probably

#

thats it

wintry locust
#

it worries me that we will have 2 basically relevant models that are both referred to as 3.5

#

with claude and grok

keen beacon
#

simpleqa needs to be 100% too

#

its asi++

#

so i call it fake

#

im joking but i agree

#

i doubt theyre gonna do this (make it larger)

#

maybe. some benchmarks it might not be possible to reach 100% w/o contamination either. (wrong questions/ground truths, so it'd have to be trained on the questions to get the right/wrong answer)

#

2.5 ultra might beat gpt 4.5 in simpleqa, but we'll see

#

im super interested in it

ocean vortex
#

lmfao

#

4.0 is gork

#

you must be confusing it with a dork

#

easy mistake to make

leaden palm
#

did we get this yet

keen beacon
leaden palm
#

how far

keen beacon
leaden palm
#

ok well we were generally skeptic from the start

leaden palm
keen beacon
#

idk' lol im gonna go to bed

ocean vortex
#

That's Elon's thing. Promote fabricated things and call everything real you don't like propaganda/fake

torn mantle
#

Reposting that fake scs

#

I refuse to believe its better than o3 and gemini 2.5

#

Xai are so loud when it comes to their product, if it was really a big leap they would just straight up say that

torn mantle
small haven
misty vault
hardy pecan
#

polymarket odds dropped biggly as of 20 mins ago, very interesting..

hollow ocean
#

Should the best models be P2W or no

small haven
#

market is never wrong!

hollow ocean
late path
#

just bought google at 66 3h ago💀

#

stupid market

hollow ocean
#

Everyone that did their research didn’t bet on OpenAI to have the best

small haven
keen fulcrum
leaden palm
#

theres daily "quests" and you could buy more

#

just like in games, you can cash in but you can't cash out

#

well you used to be able to donate it to charity and they had a short stint with "sweepcash" but that's gone now

#

unfortunately not

#

lol no

#

not sure what ev means in this context

#

if youre talking about what you get out of it:

#

its just intellectual stimulation

#

you want to be a good predictor

#

you want to make the leaderboards

#

you dont want to go broke

#

you want to see number go up

topaz peak
torn mantle
#

evpi

#

evm

#

ev

#

evsi

small haven
#

u can't be serious.. 😭

zinc ore
#

It ain't even coming this month

small haven
#

smart money would buy lowkey

small haven
alpine coral
# keen fulcrum New meta model?

amazon i think. they have been releasing iterations pretty steadily (originally it was cobalt-exp-v1, last week or so it was up to v7, now v10 ig (hasn't been remarkable at all in my experience, but seems like they're in the process getting something ready to release)

misty vault
#

I don't know yet. Will you harm me if I harm you first?

golden ocean
#

spit it out @worthy thunder

worthy thunder
#

Context Arena Update: Added several Mistral LLMs to the MRCR 2needle leaderboard. (https://x.com/DillonUzar/status/1919191240123289920)

AUC @ 128k Results (Mistral Models):

  • mistral-small-3.1-24b-instruct: 47.7%
  • mistral-large-2411: 24.1%
  • ministral-8b: 22.8%
  • ministral-3b: 13.9%
    See all results at: https://contextarena.ai/

Mistral Small 3.1 is currently performing between GPT-4.1 Mini and o3-mini based on the AUC @ 128k metric. For comparison, Gemini 2.5 (the current leader) and GPT-4.1 have been added to the main chart.

NOTE: The results for Mistral Small 3.1 (2503) and Mistral-8B (2503) are dated March 2025, while the others are from October/November 2024. Tests were conducted using endpoints from @OpenRouterAI and @MistralAI (presumably BF16).

More to come: Other model results will be released sporadically over the next few weekdays (including some 4needle and 8needle results), alongside new UI features. I've been pretty focused on analyzing some of the model results recently and figuring out ways to provide better insights and options for grading, so model results have slowed in the process. I will provide a summary once the rollout is complete.

Some UI enhancements have already partially rolled out, such as new information displays on hover and updated hover effects. A new diff viewer has also partially rolled out, with further improvements planned.

Enjoy.

misty vault
leaden palm
# misty vault

are you just posting the archives or do you somehow still have real access

elder rapids
#

in front of any

small haven
#

pro is a scam wtf

#

had like 70~ reqs, now 0

zinc ore
leaden palm
small haven
balmy mist
small haven
#

they are severely gpu constrained

balmy mist
#

you pay 200?

small haven
trim vale
#

Dum question

#

So theres the qwen3 235b a22b gguf model right

#

Which means there are 235b parameters in total but only 22b parameters are activated at once

#

Can i run it with the same amount of ram as a regular 22b gguf model?

#

Or do i still need enough ram to fit an entire 235b model inside?

#

Or something in between 🤔

worthy thunder
#

You generally still need enough RAM to fit the entire 235B model (like in this model)

oblique flint
#

afaik MoE doesn't activate experts based on what prompt you gave, it's done on a per token basis so a single prompt can involve a lot of different experts. So if you can't fit the model into ram fully, it'll have to load from your disk to load in the right experts, that will be much much slower than loading from ram

trim vale
#

Gotchu

calm sequoia
#

Some context for where the Grok 3.5 stands

#

Seems like the reasoning is SOTA, but the base model is not as good as OG GPT 4.5

#

The next generation of models (GPT 5, Gemini 3?) will saturate most benchmarks 😄

torn mantle
calm sequoia
#

Your opinion or confirmed?

torn mantle
#

kinda surprising that its not added yet on lmarena

torn mantle
#

by the guy who shared it

calm sequoia
#

Then why musk retweeted

torn mantle
#

he deleted it btw

calm sequoia
#

Haha if he really retweeted fake benchmarks for his product he's at least braindead

#

Missing Elon of 2017 so much

ocean vortex
calm sequoia
#

What's with this space. Avengers of missinformation

torn mantle
calm sequoia
#

Elon (Adrian Dittman) seems quite employed 😄

hardy pecan
#

Maybe wait for a credible source first lol, literally nothing confirmed

teal mantle
teal mantle
#

He is one of the last person, like ever to be informed in AI

balmy mist
#

will we ever get o3 pro?

torn mantle
#

it is

kind cloud
#

Does anyone know how to continue chatting with the model after voting and finding out its name? I know how, but is that method well-known?

golden ocean
#

yes

golden ocean
#

lmaoo

vivid oyster
#

When tf is grok in arena

keen beacon
#

r2 will be the cheapest there

#

might not beat the others

#

i didnt test it much/at all. (my impression of it is primarily through the comparison images, im largely uninterested in webdev)

#

i largely dont

#

idk i dont really code using ai yet so idk about best practices

#

yea

ocean vortex
#

probably making it more verbose. It's relatively concise. This could indirectly prolong thinking as well

keen beacon
#

technically no its not on the dot

#

1,048,576

ocean vortex
#

I mean it is relatively speaking, esp if you consider the entire output thinking included and compare that to medium/high

#

it's a thinking model, it needs to output a ton if you are after maximizing every last bit of performance 👀

keen beacon
ocean vortex
keen beacon
#

i was surprised about it myself

#

o3 costs the most though

ocean vortex
# keen beacon

I think this could be measuring smth else. Otherwise this makes no sense lol

keen beacon
#

it makes sense if u include the pricing, o3 is still the most expensive model to run there

#

even though the tokens outputted are less

ocean vortex
# keen beacon

it's weird. I'm sure there's some kind of explanation why they came up with this. But we know for a fact IRL R1 outputs less thinking tokens than O3, not even talking about 2.5...

keen beacon
#

the r1 thing

#

if you look at the cost to run it makes sense

ocean vortex
keen beacon
alpine coral
ocean vortex
alpine coral
keen beacon
#

its more but deepseek host it only at 64k i think

alpine coral
#

they have a reasoning effort param

#

which seems poorly implemented (trying to do it across providers)

#

i dunno

ocean vortex
#

it's capped for the entire thing

alpine coral
#

nah not a prompt.. not sure tbh

#

eh i dunno but it would be hiliarous if that was their implementation

keen beacon
#

yeah they usually just pass through

#

i dont think or does anything

#

if they do its not intended

alpine coral
#

yeah i know what you're saying

keen beacon
#

they are proxying all of the requests they can mess with it

#

i highly, highly doubt they are

alpine coral
#

i think they're assuming google etc will have actual reasoning budget hyper parameters at some point

#

i dunno what it does in the meantime

ocean vortex
#

@keen beacon maybe OpenAI are better at overfitting 💀

#

but like I said it's weird, will do some more testing

blazing rune
#

How would they get the data if you are running it locally?

keen beacon
#

5g covid something like that (/s)

blazing rune
#

Oh, it has been like that for years already

#

Llama 4 was trained on Facebook data iirc

keen beacon
#

whatsapp is encrypted right? i assume if u communicate with meta bots/etc then ur data is collected

#

for sharing war plans

ocean vortex
#

o4-mini-high and o3 would be essentially unusable even with 8k

ocean vortex
#

my point is that this number includes thinking

#

and also that there are still a ton of providers including azure, with low limit where r1 is still very much usable

#

now compare that to this:

#

R1 would never generate anywhere near that

keen beacon
ocean vortex
keen beacon
#

and that task requires a lot of reasoning tokens. deepseek usually does 15k+ on some of my problems

#

sometimes more

ocean vortex
keen beacon
#

o4 mini tries harder

#

even if it is wrong

ocean vortex
#

who cares if it's right. We are talking which one outputs more thinking LOL

keen beacon
#

ur argument

ocean vortex
#

wdym

#

My argument is that o4-mini and o3 outputs more thinking tokens, that's it

keen beacon
#

u have to take into whether the answer is right or not or its arbitrary

ocean vortex
#

nothing to do with which performs better

#

on any given task

ocean vortex
#

Also, this is not claude. Model has no clue about thinking budget, it will only get cut off. If it doesn't then it used all the tokens it could want

keen beacon
#

is it releasing?

#

talking requests? 🙏

#

what

ocean vortex
#

So like... Try sambanova interface and come up with a prompt for R1 that would use more for the thinking than 8k cap. I'm sure you will find this is next to impossible lol

@keen beacon

#

while for both o4-mini and o3, this is super easy

keen beacon
#

There are 5 houses, numbered 1 to 5 from left to right, as seen from across the street. Each house is occupied by a different person. Each house has a unique attribute for each of the following characteristics:

  • Each person has a unique name: Eric, Alice, Peter, Bob, Arnold
  • Each person has a unique type of pet: hamster, fish, cat, dog, bird
  • Each person has an occupation: doctor, engineer, artist, lawyer, teacher
  • Each person has a favorite color: green, blue, yellow, red, white
  • The people keep unique animals: bird, cat, horse, fish, dog

Clues:

  1. The person who owns a dog is Arnold.
  2. The bird keeper is in the fourth house.
  3. The person who keeps a pet bird is directly left of the dog owner.
  4. The person who loves white is somewhere to the left of the person who is a lawyer.
  5. The person who loves yellow is directly left of the person whose favorite color is green.
  6. The cat lover is Bob.
  7. The cat lover is somewhere to the left of Eric.
  8. The person who keeps horses is in the fifth house.
  9. The person who is a lawyer is directly left of the person who is a teacher.
  10. The person who is a doctor is in the first house.
  11. Alice is the person who loves yellow.
  12. The person who loves blue is directly left of the person with an aquarium of fish.
  13. The person who loves yellow is in the first house.
  14. The person with a pet hamster is the person who is an artist.
  15. Eric is the dog owner.
ocean vortex
keen beacon
#

also its a single prompt. look at artificial analysis where they ran several benchmarks. plus with your prompt it didnt even get the answer right with less tokens, so its arbitrary in that instance too. might as well call og gpt 4 so much more efficient if u just look at that. (it does not make sense if you do not consider correctness)

#

how youre contesting running several standarized benchmarks and seeing the amount of tokens used just because of that 1 prompt is ridiculous

#

and the comparison/reasoning for that 1 prompt does not make sense

ocean vortex
#

when R1 generates more, it's not by 40k more lol

#

and with OpenAI it is

keen beacon
ocean vortex
soft kernel
#

I know this sounds stupid but,is 4.5 still available on the web?(battle part,not direct chat)
I think 1 month ago it was available

keen beacon
# ocean vortex wdym. There are plenty of prompts where OpenAI will generate beyond 16k, you don...

those questions are probably a 'biased distribution' biased sample unless its representative. those standardized benchmarks that artificialanalysis runs are way more representative for actual usage than potentially specific cherry picked problems that cause specific model outliers. and does r1 even get the questions right? (if it uses less tokens but gets the answer wrong it is also meaningless)

ocean vortex
# keen beacon those questions are probably a ~~'biased distribution'~~ biased sample unless it...

I don't think it's meaningless since it could get it wrong specifically because it's reluctant to output more.

I think the truth here is somewhere in the middle to be completely honest. Maybe on average it's that - OpenAI models fairly mindful of the usage and don't output a lot at all times. But when the model sees a task it can't solve without outputting a ton.. O3/O4 will do it and R1 or 2.5 most likely not, saw Gemini taking shortcuts to arrive at the answer faster more than once. And as for R1, once again it seemingly can't even get past 16k

#

So maybe that averge metric does not tell us all that much. You wouldn't want a model that outputs more when that's not needed and less when it is needed. A bit artificially bumping up the average (relative to what it is capable of performance wise with test-time compute)

#

Esp since we have quite a few of the models wasting tokens on 2nd guessing themselves rather than doing the task efficiently lol

#

so yeah... if we took like 100 test prompts and model1 would do say 50 attempts around 90k generated while the other 50 around 10k;
while model2 would generate consistently around 55k for all of them leading up to slightly higher average...

I would still say model1 is much less test-time compute limited and better optimised tbh

#

it would obviously be something way more random, but for the sake of simplicity assuming easy to reference numbers

ocean vortex
#

looking at 2.5 some more digging into thinking, it does have a weird habit of rewriting the thinking into final response, even the parts where it self-corrects pretending it made the same mistake all over again... So you do have a healthy amount of generated data that is completely pointless catgrin

#

but if you made it even more verbose, that would very likely help with the actual useful problem solving part as well

patent aspen
sonic tendon
#

oo, source?

#

ah

keen beacon
#

i wonder if they'll release on api or not

keen beacon
#

i have a feeling the reason they didn't release grok 3 full reasoning was because it sucked

#

hopefully grok 3.5's reasoning implementation does not

#

so they wouldn't have a reason to

#

i wonder if they stopped using qwq preview traces 🤣 (very suspect)

#

probably didn't help things ☠️

keen beacon
#

lmaoo

unborn ocean
#

never spend much time on xAI

keen beacon
terse ermine
#

Hi all, I just joined so sorry if this is the wrong channel to ask this question. But I was curious if anyone has a plot of ELO vs model release date. Thank you so much!

balmy mist
#

griter?

keen beacon
#

its vaporware

hollow ocean
leaden meteor
#

How come they are releasing before letting it loose anonymously on arena? I am guessing it's not meant for arena questions...

keen beacon
#

maybe grok 3.5 will be great though, idk. even if it is good, its gonna be unusable for me because of the X peddling 😭

balmy mist
hollow ocean
balmy mist
sage raptor
#

Pt probably

hollow ocean
balmy mist
#

oh yeah they never released that

torn mantle
balmy mist
torn mantle
#

he seems excited for the release

#

same

#

i had access for 2h

#

then it glitched out

balmy mist
torn mantle
#

yes

wintry tinsel
#

ffinally we will see Gemini dethroned

torn mantle
#

same

#

it was good

keen beacon
balmy mist
keen beacon
#

i doubt 2.5 pro will be dethroned tbh

wintry locust
#

did you really...

torn mantle
torn mantle
wintry tinsel
torn mantle
#

we should start asking Craig

keen beacon
#

ahahaha

torn mantle
#

sure

wintry tinsel
#

unlike O series model Grok models are versatile like Claude

balmy mist
#

wow

keen beacon
#

isnt grok 3.5 asi?

torn mantle
#

yes

keen beacon
#

they skipped agi like o1 -> o3

balmy mist
torn mantle
#

are you following him?

hollow ocean
balmy mist
torn mantle
wintry tinsel
#

Grok 3.5 Is new Elo king fr

balmy mist
hollow ocean
#

My friend knows 🍓 guy irl he’s in middle school

balmy mist
#

i follow a lot of ppl, and he was right about o4 in april

wintry tinsel
#

Do you see any Ultra releasing today? No than stop tickling my impatient nuts

torn mantle
#

added for 1 sec

wintry tinsel
#

Alright '

wintry tinsel
#

I hope it ultra sucks cuz gemini is not based

torn mantle
#

@deep adder how was ultra?

#

was it good?

misty vault
hollow ocean
misty vault
wintry tinsel
torn mantle
#

he really doesnt have any insider infos

misty vault
wintry tinsel
keen beacon
#

not just AGI, he has seen ASI

#

i think u meant behemot

torn mantle
keen beacon
#

grok 3 already has X ads

hollow ocean
#

🍓 guy is from Alaska and he’s in 7th grade

torn mantle
#

chatgpt is already surpassing x on daily users

#

let elon focus on his gork new bot

misty vault
#
      /\_____/\
     /  o   o  \
    ( ==  ^  == )
     )         (
    (           )
   ( (  )   (  ) )
  (__(__)___(__)__)
balmy mist
torn mantle
balmy mist
#

you blocked elon?

torn mantle
#

cant stand them

#

yea

balmy mist
#

you can do that

torn mantle
#

i blocked him

keen beacon
#

x is a cesspool

#

he didnt even mean it

balmy mist
#

i didnt even know you can block ppl on X

keen beacon
#

probably his dreams of an everything site or smthing

#

why hes calling everything x

hollow ocean
#

Yeah

sonic tendon
hollow ocean
#

His real name is Jonathan

#

My friend has a picture of him logged into his account on his laptop

#

I’ll ask him for it

#

Skinny guy from middle school

#

Yeah

#

No reply yet he’s busy

soft kernel
hollow ocean
soft kernel
#

Still stuck in 2000s

cedar tide
#

Qwen 3 on webdev is it also the thinking version?

#

for those who want to know what the 4 new leaderboard models are
Qwen 3 253b
Gemma 3 12b
Gemma 3 4b
Olmo 2 32b

#

We waiting for nova premier on the arena (to make fun of him a little 😅)

balmy mist
small haven
#

day 19 with no o3 pro

ocean vortex
tall summit
#

how good is qwen 3 at translation?

keen beacon
#

It's asi ofc

cedar tide
#

You can try a prompt for me ?

torn mantle
small haven
#

im still waiting on a ss of o3 pro on o1 pro api

cedar tide
torn mantle
teal mantle
#

is deepsearch or deeperresearch powered by grok 3 or grok 3 mini?

cedar tide
cedar tide
calm sequoia
#

What's up with this variance 🤔

gilded drift
torn mantle
#

he should make an account like iruletheworldmo

keen beacon
small haven
#

craig is iruletheworldmo confirmed

calm sequoia
#

It's probably first time model is so good at math and so bad at style control

cedar tide
torn mantle
torn mantle
#

he talks like him too

wintry locust
ocean vortex
ocean vortex
teal mantle
ocean vortex
teal mantle
#

if it is full grok 3 then it is cheap for them to do deep research on grok 3 as opposed to OpenAI using o3

ocean vortex
#

OpenAI's deepresearch could be something like o3-high except with even longer outputs finetuned for search specifically

teal mantle
ocean vortex
tall summit
ocean vortex
#

Lithuanian 😅

#

235b is around gpt4.1 level, smaller models are considerably worse

small haven
#

day 19 since o3 and it is still as magical

tall summit
calm sequoia
#

@ocean vortex can you PM me? You seem to be blocked this

calm sequoia
#

Funny thing is that literature is almost non existent. I can't understand why is it in distribution. Maybe emergent property.

keen beacon
#

if u dont mind

ocean vortex
gilded drift
#

So , no grok 3.5 today ? 🙄

ocean vortex
sage raptor
#

when will they release gork

small haven
#

elon ma said next week on apr 29

balmy mist
#

grok 3.5 in june

sage raptor
#

gork > grok

keen beacon
#

no it's asi

#

strawberry man

#

fake its asi

#

asi would have 100% on simpleqa

small haven
#

look in the mirror

#

so is every ai company except nvidia lol

zinc ore
#

When you're such a good model you max out on incorrect benchmarks (MMLU).