#general

1 messages · Page 27 of 1

balmy mist
#

i still dont understand this chart:

#

how is grok mini so high

#

this says its on par with o4 and 2.5 pro

#

but dirt cheap

#

how is that not the best model?

#

wait claude has the highest output?

#

at 128k?

#

then grok?

alpine coral
wintry tinsel
#

When we getting CLaude 4 already

balmy mist
#

u think its SOTA?

wintry tinsel
#

I think only real good models are google, Claude, and deep seek

fleet lintel
#

Dayhush looks comparable to NW

wintry tinsel
#

Google is competitive on price and best in some categories, Claude is best overall, and deep seek is open source

#

Open AI too expensive and too censored/stiff don’t like it

#

Grok is kind of funky, it’s not best on anything right now, so not important

hollow tinsel
#

Anyone getting Claybrook? the pokemon battle simulator blow me away

balmy mist
#

i like that claude output is 128k

#

i might start using it again

fleet lintel
hollow tinsel
#

No idea, how do you check for that?

balmy mist
#

wait i thought you were an openai fanboy?

#

you cheating on them?

cloud meadow
#

Reka.ai seems to be doing some sort of a new beta for their site.

#

Idk what model this is but it's so bad lmao.

hollow tinsel
fleet lintel
#

but folks here are very smart and they can look at configs etc to figure out the company

hollow tinsel
hollow tinsel
winged steeple
#

Hello all! I have been playing with the arena battle and I have a very silly question— are models picked entirely at random, or are they weighed somehow? I'm wondering because I often see the same models, despite the huge pool.

calm sequoia
oblique flint
#

does anyone have an explanation as to why 2.5 flash reasoning is more expensive per output token compared to 2.5 flash? Doesn't it still use the same model when it's reasoning? Or is it like a different quantization maybe?

cloud meadow
#

do you mean 2.0?

oblique flint
#

I mean 2.5 flash with reasoning enabled vs with reasoning disabled

cloud meadow
#

Oh right, it's better.

#

There you go

#

And also more expensive to run for Google.

oblique flint
#

so it's 2 different models under the same name?

cloud meadow
#

R1 and V3 situation I believe.

#

Idk

oblique flint
#

all in all it seems like the reasoning version is massively more expensive. Not only is the reasoning version like 6 times more expensive per output tokens, it's going to generate way more output tokens. Initially I thought it would only increase the number of tokens generated, but I didn't expect they would also charge more per token on the reasoning version

cloud meadow
alpine coral
# balmy mist u think its SOTA?

nah fwiw i've found it a bit underwhelming - it's not up there with gem pro 2.5 (for me / this is all obviously subjective..)

oblique flint
#

not gonna lie I havent even tried the model yet lol

alpine coral
#

but solid, for sure

plain zinc
oblique flint
#

so grok 3 mini is really 1/7th the output token price of 2.5 flash thinking lol

#

idk why google made 2.5 flash reasoning so overpriced

worthy thunder
#

@calm sequoia as requested I added ranking to the tables (based on AUC calcs, sorted by 1M since my company is focused on very long context). I made changes to the charts but not including them here until I have new models to run it on. Enjoy

willow grail
worthy thunder
# keen fulcrum Flash better?

There's a slightly higher score at 32k and 128k context for 2.5 Flash (Thinking enabled) compared 2.5 Pro, but it's within margin of error. Honestly, both 2.5 Flash (Thinking) and 2.5 Pro have practically the same scores.

worthy thunder
#

I found it interesting that they perform the same. That isn't true for all other model families with this bench so far.

keen fulcrum
#

Gemini 2.5 pro is too confident in its own errors and didn’t want to undergo another investigation of the prompt

#

Had a hard time disproving the llm

worthy thunder
#

Working on it. Somehow my org is locked out of o3. OpenAI support says we should have access and are looking into it

keen fulcrum
worthy thunder
#

US

worthy thunder
#

Yeah, we can't even get to that step on our account

earnest parcel
torpid berry
#

hello,
do you know when we will have o3 and o4-mini-high in leaderboard?

torpid berry
worthy thunder
#

We literally get a server error when we try to verify, so we can't even upload an ID. Support thinks its an issue on their end and are looking into it. 🤷
I'm interested in the o3 results too, so hoping they resolve ASAP.

torpid berry
#

I tried this in all models I know and they all failed

#

expected result:

#

tried with gemini 2.5 pro and new openai models, all failed

keen fulcrum
ember rapids
keen fulcrum
torpid berry
#

This chart does not include Gemini 2.5 pro

#

all Gemini models are pretty bad until Gemini 2.5 Pro, who is the best or one of the best models yet

worthy thunder
keen fulcrum
#

This will change

#

Bard wasn’t competitive

keen fulcrum
wintry tinsel
#

The O models are stem problem solvers, pretty terrible for general world knowledge, conversations, writing ability etc

#

If you are not in a technical field, or college, the O models probably aren’t useful to you

#

Now Claude is the best overall in my opinion since it feels the most grounded and consistent, Gemini would be the better frontier model, but it’s less capable of revising it’s mistakes, writing, and world understanding I’d say

cloud meadow
#

Spiritually cucked to anthropic.

cloud meadow
#

Gemini 2.5 is great at noticing mistakes

#

As for writing, your prompt may just be bad. Deepseek does a better job at creative writing for me than Anthropic

#

Though it does entirely depend on what you are asking for

balmy mist
#

this might be old:

keen fulcrum
#

Indeed you are late

blazing rune
keen beacon
#

ai vs ai CTF

#

soon?

blazing rune
keen beacon
#

March 31, 2025

#

wow i really am late

blazing rune
#

Like when I ask Gemini 2.5 Pro to write code with no comments, it still uses tons of comments, or if I tell it to be concise, it doesn't listen at all. It's just verbose with no way to fix it. Claude 3.7 Sonnet doesn't have these issues in my experience.

cloud meadow
blazing rune
#

I use temperature 0.5 all the time

#

so it's not that

#

it just doesn't listen to prompting nearly as well as Claude 3.7 Sonnet

pliant cypress
#

Connection errored out.

tall summit
#

BURPSUITE AI

calm sequoia
calm sequoia
calm sequoia
tall summit
#

Though I used to think it was

#

lmao i voted gpt 4.1 nano over claude 3.5 haiku

golden ocean
#

Hacking now ACTually going to be easy for skids

balmy mist
#

yo @tall summit you were the one talking about making a chess game right?

balmy mist
# tall summit yeah

Welcome to the AI Chess Battle OpenAI o3 and Gemini 2.5 Pro the state of the art models will be playing a chess match against each other. Let's see which model actually wins this.

openai o3, o3, o3 model, openai o4 mini, openai, chatgpt, ai, artificial intelligence, google gemini, gemini 2.5 pro, gemini 2.5, google ai, new ai coding with gem...

▶ Play video
tall summit
#

what about it

#

cute

balmy mist
#

this dude is livestreaming them going against each other

tall summit
#

YJxAI with 1.5k subs

#

it's free content for blogs and channels, huh?

harsh flume
#

Hey guys, I haven't been online for a couple of days. How have oAI's new models been fairing in the arena?

tall summit
#

and twitter accounts of course

harsh flume
#

I just got 4.1 in my first prompt, is o3 full and any version of o4 contending as well rn?

tall summit
tall summit
#

as well as 4.1 full mini nano

#

they're all even on direct chat

tall summit
#

aww 31-29

#

very even

balmy mist
#

yo i think we can build this better

#

but can you upload images to gemini 2.5 pro api?

tall summit
#

this is match 1

#

ill finish my match before his

balmy mist
#

im kinda mad he made it before us 😦

#

we dead was talking about this

#

the thing about chess context window does not matter as much

#

since you can always give the current game state to the model and nothing else

#

and start new convos with each model for every move

#

having memory of the previous moves doesn't matter in this case

tall summit
#

it helps according to that blog post

balmy mist
#

i can see it helping, but an intelligent model can understand the game based on a snapshot of the current state

#

like for any chess player

#

yeah having prior knowledge will help to understand how the other player plays

thorny drum
#

gemini formatting is still weak

#

especially with code

balmy mist
#

maybe then logging why it made each move and having that as knowledge base can be a good addition

balmy mist
tall summit
#

(according to the single and informal test)

balmy mist
#

how is that possible

#

maybe you might have to grond it with search as well

tall summit
tall summit
upper wolf
#

old gpt-4 was pretty good at chess

tall summit
#

it thinks the black king can move to b6, a8, or b8

upper wolf
#

i gave it an fen once and it even guessed the opening

tall summit
#

and that the white king has no legal moves

balmy mist
tall summit
#

not how good it can search about chess

balmy mist
tall summit
balmy mist
#

if they do not know how to play without help with search then give them the access to search until they can play chess properly wihtout help

tall summit
balmy mist
#

cause currently you are saying they cant even read FEN properly

tall summit
#

i wonder what would happen if i gave the pgn

upper wolf
#

gpt-4T was cracked at chess. It was extremely knowledgeable about very specific opening moves. also, it was able to play several full games without any illegal moves, which most current llms cannot do

balmy mist
#

so you think the models got worse?

upper wolf
#

the new models r a lot better at 99% of tasks i wanna say

#

idk why they fall off in chess ability

#

something about having more parameters would be my guess

balmy mist
#

what happened to alpha go?

tall summit
balmy mist
#

how are llms so much worse than alpha go

tall summit
#

for some reason

upper wolf
#

Lmao i havent heard that name in ages

tall summit
#

wait a second

wintry tinsel
tall summit
#

even after it wrote:
White: King on a1, Queen on a5, Rook on c1, Bishop on c4, Pawns on a2, c3, d4, e5, f4, g3.

tall summit
#

so it's even worse

wintry tinsel
balmy mist
#

i dont understand why they are so bad

wintry tinsel
#

If I need a problem solved, Ill use Gemini 2.5 pro, for everything else, Claude

narrow elbow
balmy mist
#

like alpha go is SOTA in specilazation, why dont we specialize our curretn SOTA to different tasks on the level that alpha go ws for go and League?

upper wolf
#

that’s not how it works

#

playing a game isn’t like generating text

balmy mist
#

but the thinking that goes into play chess should be similar to the thinking that goes into figuring out stuff

#

like reasoning

upper wolf
#

yes, they use similar logic

wintry tinsel
#

Nothing exciting releasing

upper wolf
#

but chess is much more 1-dimensional

wintry tinsel
#

Except maybe R2

#

It’s slow going for AI, what might be called Snail Season

narrow elbow
#

haha

torn mantle
#

which is win a game

balmy mist
#

and the reward function is set during the training phase right? can we have an additional one during finetuning?

novel flame
novel flame
# balmy mist like alpha go is SOTA in specilazation, why dont we specialize our curretn SOTA ...

AlphaGo, AlphaZero, AlphaCoder ... are all vastly different architectures. A Transformer LLM cannot be trained the way Alpha* is, and if it could be, it doesn't operate in a way that would allow it to use that training anyway. VASTLY different.

Take AlphaCoder for instance. It achieved/achieves SOTA on particular types of competitive coding by doing a ludicrous number of attempts (>100k, in some instances millions IIRC) and uses search to find the best/correct/optimal solution. It does work, but it takes a lot of compute and it is definitely not 'reasoning' in a humanlike way.

AlphaZero evaluates 10000s of moves per position, which is actually far less than it would need to do a brute-force search; part of the genius of AlphaZero is its ability to prune the search space down from millions of moves to just tens of thousands. But it's still search.

Current-gen reasoning LLMs don't do policy search, not in that way and not at that scale. The main thing they can do is spew more tokens in thinking mode in a controlled manner, effectively stuffing their own context with a small number of possible plans/approaches/solutions, refining this iteratively until their thinking budget is exhausted.

As most things in LLMs, Thinking Mode is just a primitive, degenerate form of ~~bending ~~policy search.

Thinking Mode gives the LLM a few dozen shots at getting to a better solution, but it's neither systematic nor exhaustive, nor does it have a world model or persistent state to simulate or probe internally. As such, there is no way to adapt a (current gen) LLM to do what AlphaZero does.

tall summit
#

some benchmarks are graded by llms and that's a whole new can of worms to open

torn mantle
#

is this like the official lmarena server?

#

webdev bugs is kinda pissing me off ngl

tall summit
#

yes..?

torn mantle
#

it also wastes so many tokens for them

tall summit
torn mantle
#

rendering issues

tall summit
#

barely anyone uses any of the feature text channels lol

torn mantle
#

people already issued that

torn mantle
oblique flint
tall summit
#

.

tall summit
raven void
novel flame
#

AlphaGo, AlphaZero, AlphaCoder ... are

tall summit
novel flame
# tall summit .

Actually I just updated the link to the paper, the one I wanted to link to was from DeepMind and it's a much better read.

torn mantle
#

whats the rate limit of o3 in chatgpt pro?

earnest parcel
blazing coyote
#

pro - unlimited , plus - 50 a week

earnest parcel
balmy mist
#

but you cant use gemini models 😦

#

and only claude 3.5

#

interesting why they only used these three

#

4.1 is a good coding model and grok 3 mini as well as 3.5, but why no 2.5

#

i heard people say non reasoning models are better at coding so maybe why the SOTA is not here, but why not 3.7 or v3?

#

@hollow ivy what was the system prompt for nw again?

bright kayak
#

when does the team decide when to update the leaderboard?

tall summit
#

when there are enough votes!!!

#

and isn't it automatic?

bright kayak
#

i don't know, that's why im asking

novel flame
balmy mist
#

yeah this theo guy seems to really like it, im about to use it to create the arena for ai to go against each other now

#

hopefully it can do it

novel flame
#

OK Chef is pretty cool. It generates a working app with DB and backend, and all the features it built actually worked.... though it only built half of the prompt, and it overall looked pretty crappy. Still, I'd say it's one of the best I've seen in terms of actually working

primal orbit
#

guys, any new strong anonymous models in normal arena? been away for a week

novel flame
primal orbit
#

in normal arena or webdev?

novel flame
#

webdev I believe

primal orbit
#

well, I don't code, so..

#

like to just play with them

novel flame
#

Other Google models: claybrook (okay), dragontail (very strong, likely a Pro update or something)

primal orbit
#

dragontail I have had tried, claybrook gonna try to catch thx

torn mantle
#

which one is better

#

sonnet 3.7
gpt4.1
dayhush
claybrook

#

not in order

pliant cypress
#

bottom left imo

torn mantle
#

yea its the most accurate one

sweet tinsel
#

Would say bottom left too, which one was it? I would guess dayhush.

torn mantle
#

pretty neat

keen beacon
#

dayhush is good

sonic tendon
#

dayhush

keen beacon
#

I do think we're getting Gemini cider

sonic tendon
#

new google model?

keen beacon
#

coder*

#

not cider

#

lmfao

keen beacon
#

nightwhisper, dayhush

#

:3

sonic tendon
#

gemini cider sounds fun

zinc ore
#

A Google employee has already confirmed coder coming

keen beacon
#

lmao

sonic tendon
#

:3

keen beacon
sonic tendon
#

google is cooking

balmy mist
#

yeah they need to release it tonight as a treat for us

sonic tendon
#

when did that happen

keen beacon
#

first time they've even talked about ultra for over a year

keen beacon
sonic tendon
balmy mist
sonic tendon
#

wh

#

ere

balmy mist
keen beacon
#

tulsee is director and product lead for Gemini models

sonic tendon
#

probably good for writing, then

keen beacon
#

it better be

#

I want my baby back

sonic tendon
#

at least, going by leo's impression of

sonic tendon
#

real

keen beacon
#

if ultra is as good as pro but with expected scaling improvements then sign me up

sonic tendon
#

i am equally curious about pricing

balmy mist
sonic tendon
#

given that the free rate limits for pro are 25req/day, gemini ultra is probably gonna be like 2

#

well, more realistically, 0

#

personally, i kinda like that we stopped trying to scale for a little bit and focused on releasing models of similar sizes but with more refined techniques

#

at least, from my impression (of frontier models)

#

4.5 got dialed back, opus 3.5/3.7 (if it exists at all) hasn't been publicly released

sweet tinsel
#

The really large models are really good for me, so I'd love 2.5 Ultra as I'm already a large fan of 4.5 and I'm hitting limits for 4.5 weekly. It's just really good In writing.

keen beacon
#

i do think huge models are impractical now, but that doesn't mean they have 0 value

balmy mist
keen beacon
balmy mist
#

interesting

#

how big do yall think o3 is?

keen beacon
#

it's 4.1

balmy mist
#

you think we can have a big reasoning model?

sonic tendon
#

i worry that, in increasing the max available model size, we'll end up inadvertently heavily increasing the total environmental impact/needed compute because people will use the biggest model possible for tasks that don't need it

keen beacon
#

wait let me check what a buddy at oai said

#

i am told around 200B

balmy mist
#

ahh

#

do you think we can have a reasoner at 1 trill?

#

or more?

keen beacon
#

absolutely

#

but again

#

it would be impractical

#

and probably prohibitively expensive

balmy mist
#

what is the biggest model rn? 4.5?

sonic tendon
keen beacon
#

yes

#

4.5 is over 4T params lol

balmy mist
#

lol

keen beacon
#

beats opus by quite a margin

#

probably?

sonic tendon
balmy mist
#

and gpt5 will be using that ?

sonic tendon
#

writing?

balmy mist
#

as the base?

keen beacon
#

no no

#

total params

#

opus had just over 1T

sonic tendon
#

ohhhh

#

two convos at once 😭

#

sorry guys

balmy mist
#

but wouldnt gpt5 be a large reasoning model?

sonic tendon
keen beacon
#

gpt-4 was just over 1T yed

#

yes

#

it's crazy how we've absolutely dunked on 4 performance wise with so much fewer active parameters

sonic tendon
#

yeah

#

remember when we thought gpt-4 was the ceiling? lol

keen beacon
#

🤣

sonic tendon
#

gemma 3 27b

#

literally can run that on my laptop (barely, at less than a token per second)

#

2023 (?) hn commenters were right

#

not sure how longer context would improve performance that much

#

super long context doesn't seem to be good for many tasks atm, surprisingly

keen beacon
#

oh yeah im told work is well underway on claude 4.0

sonic tendon
#

least of all coding

sonic tendon
#

might've pondered this in this chat before

novel flame
#

Reasoning in latent space without outputting tokens. Much more efficient, much less information lost to the token representation, should be much more humanlike

ember rapids
#

Wait I just realized google i/o is next month

#

Gemini ultra late may

sonic tendon
calm sequoia
#

The 2.5 Pro still blows my mind every day and it's not even SOTA anymore 👀

sonic tendon
keen beacon
#

yeah I remember seeing "Gemini 2.5 Pro" in a twitter notification, tapping it, looking at the benchmark image and my jaw dropped

#

it was the last thing I expected from deepmind

balmy mist
calm sequoia
#

They simply looked at what Meta is doing and did exact opposite

keen beacon
#

yann lecooked

novel flame
calm sequoia
raven void
#

Is grok 3 mini on lmarena?

sonic tendon
#

reasoning in a non-language-based abstract space was something i felt would be the endgame of LLMs since GPT-4, but i never knew enough to really formalize my ideas let alone work towards creating them

keen beacon
#

arena updated Apr 18, 2025
and no o3 yet :/

sonic tendon
#

it's cool to see that people are pursuing that now

calm sequoia
keen fulcrum
novel flame
sonic tendon
#

not exactly an ML person at the moment, but wouldn't you want to build some sort of transformer that had a way of tokenizing "thoughts"

#

if that makes any sense

#

i uh

#

j

novel flame
calm sequoia
sonic tendon
sonic tendon
#

i was under the impression that we specifically optimized against that happening during RL for interpretability reasons

novel flame
calm sequoia
#

The more gibberish the better

#

It would be smart tontrain small 3B model tontranslate gibberish to language if seeing thinking is so interesting to people

sonic tendon
#

maybe if you do RL enough, it'll just figure out a language of pure thought on its own that looks like unreadable random data to someone reading the CoT

calm sequoia
#

Yes, because it needs to learn compression. Unless English is the most efficient language that can exist. Which isn't the case.

keen beacon
#

oh yeah

#

I can't wait for Imagen 4 :3

sonic tendon
novel flame
# sonic tendon maybe if you do RL enough, it'll just figure out a language of pure thought on i...

Yes, or if you just don't use language tokens for reasoning in the first place. You can create your embeddings from language to latent vectors, then operate on the vectors all the way to the answer, and then translate back to language at the end/output. Apple Intelligence allegedly did this using i-JEPA internally, then using something small like Phi-3 trained to describe the result in natural language.

tall summit
#

time for ithkuil to shine

sonic tendon
#

i ought to read about this stuff more

brittle tiger
#

I wonder if polymarket becomes a problem for lmsys arena. Close to $3M has been bet riding on o3 or o4-mini surpassing 2.5 pro or not by end of April. Wouldn't be surprised if that's why we see new accounts asking about when it will be added

sonic tendon
tall summit
#

wait that polymarket question uses lmarena as its metric?

sonic tendon
#

yep

#

style control off

tall summit
#

never knew

thorny drum
#

It’s closer to like 100k or so

sonic tendon
thorny drum
#

The total volume traded is a poor indicator for money riding on this

sonic tendon
thorny drum
#

You should look at open interest

#

Say I buy a share from you then sell it back

#

We technically traded 2 shares but there’s no money riding on the outcome

sonic tendon
#

the total money in the market is the same as the total number of (yes+no)/2 shares in circulation, no?

thorny drum
#

More like the # of yes shares due to an intricacy with how Polymarket works

#

NO shares is a poor indicator

sonic tendon
#

wdym? i was under the impression that yes or no shares couldn't exist independently w/o a corresponding opposite share

thorny drum
#

Yeah but say I buy NO on every company

#

Then I don’t have any exposure to the outcome

#

Polymarket lets you get your money back by merging those shares into USDC

#

and the no shares effectively dissappear

sonic tendon
#

ohhhh, merging can happen across different options in a market?

#

my mistake

#

i thought it was just yes+no of one specific option, nothing else

#

apologies

thorny drum
#

It’s a very understandable mistake

#

Not very intuitive

sonic tendon
#

well, that's somewhat relieving, then

#

yeah

brittle tiger
#

Even if volume is only six figures, money riding on it is clearly incentive to try and game system. I doubt people trying that would be advanced enough to get past lmsys systems but they should be aware.

sonic tendon
#

yeah

thorny drum
#

I think they are and I don’t see any indication that people are rigging it

sonic tendon
#

or buy both a yes and no share at the same time for $1

thorny drum
#

Hopefully stays that way

#

There was that one Reddit post but I’m pretty sure it was fake

sonic tendon
#

@deep adder we're operating under the assumptions you're asserting - i think this is a misunderstanding

#

yes

#

@brittle tiger correct me if i'm wrong, but

thorny drum
#

Technically you can’t sell shares on Polymarket

#

But I don’t think there is any misunderstanding here

#

It’s just semantics

#

Not intuitive or meaningful

keen beacon
thorny drum
#

Sung Jin I think your clueless

#

No offense

calm sequoia
thorny drum
#

Some people made bad trades and you could’ve traded against them and made money

#

That’s all that happened

raven void
#

Skibidi

sonic tendon
#

yeah

#

i misunderstood what they said

thorny drum
#

It’s not ‘manipulation’

brittle tiger
keen beacon
thorny drum
#

What do you mean fictitious

sonic tendon
#

i imagine it still depends on the country you're in

thorny drum
#

I traded a few thousand shares against them and made money

keen fulcrum
#

Hi, I do believe we need to add geographical understanding to lmarena

#

Where AIs play geoguessr

thorny drum
#

They just made a bad trade

calm sequoia
thorny drum
# sonic tendon wdym?

When you ‘sell’ yes your technically buying NO and then merging the shares into USDC. The mechanism isn’t too important

keen beacon
thorny drum
#

There was a Reddit post from someone who claimed to rig the arena but it seemed fake

sonic tendon
#

sorry for drumming up an argument lol

#

i thought craig was joking when he said he traded a ton on prediction markets

keen beacon
#

question, what happens if 50 shares are being sold at 31 cent, but i place buy order for 50 shares at 60 cents (stupid ofc) but what would happen

#

would i be buying from those selling at 60 cents or 31cent ?

sonic tendon
#

i think you'd just sweep through the order book

#

sounds p fun

keen beacon
#

rare to see people sit and only talk about $

#

very logical and rare, i wonder why that is

sonic tendon
#

i should eep

keen fulcrum
#

Grok3 mini so great

calm sequoia
#

Grok is known to be good at benchmarks but not in practise. I'm not claiming it's underperforming, but there haven't been a single instance where Grok 3 is best at my usecases. Maybe new thinking model will be different, but there are too many options for coding at this level anyways.

keen beacon
#

atm o3>2.5>claude 3.7> the rest

keen fulcrum
#

You did forget o4-mini
Performs on-par with o3

#

The problem being people believe o3 and o4-mini are the same
There is a reasonable difference

calm sequoia
#

@keen beacon @sonic tendon would you share your experiances in betting on benchmarks? Was it successful?

keen fulcrum
#

A grok fan long time no see

keen beacon
#

I do ignore 99% of markets. most are betting, i pick the 1% where i can invest based on information

calm sequoia
#

I though only insiders make money there

keen beacon
#

they make the most $ for sure

#

i assume others also $ launder

#

but there are others ways
just one example i will give

#

for rotten tomatoes score for movies, i literally trained an ai model to predict final score, based on previous movie data

#

and i constantly scraped the rotten tomatoes site for new reviews, that gave me an edge over the market

calm sequoia
#

Very impresive

misty vault
balmy mist
#

nahh grok is not better than 2.5 lol

#

i wouldnt even put it better than 3.7 thats debatable

#

i have bro, but i will test again, i did not test mini yet

#

also yall need to try roo code with the different modes

#

its kinda cracked

#

i was only using boomerang mode

#

but you can do much much more

keen beacon
balmy mist
#

basically your own custom ai agent system that uses specific models for certain tasks and roles

fringe carbon
#

but this is the wrong chanel

zinc ore
#

New chart just dropped.

fringe carbon
#

u have a task llm set up?

keen beacon
#

all i can say is that i bet on very few markets that are very news driven, llm gets news all the time, when its good news than it auto invests

torn mantle
#

Elon is such a liar

#

He said grok 3 will be updated soon

#

But i havent noticed an difference at all

#

I guess those updates will be on grok 3.5

#

They really should fix multi turn

golden ocean
#

what about writing

balmy mist
#

@torn mantle im trying to make the best system prompt for a UI/EX designer, what do you use for your agents? also is apple the undisputed best looking apps? like user friendly and stuff?

hardy pecan
#

UHM guys, are you able to convince gemini 2.5 pro, that todays date is in 2025?

#

Mine is ADAMANT that its 2024 April 19th

#

and i give it websites and screenshots and its gaslighting me into thinking i docttored the screenshots lol

#

maybe it cant read the website and its hallucinating

misty vault
torn mantle
torn mantle
#

you can experiment with neumorphic

#

its pretty good too

#

for threejs i just ask it to make it realistic/HDR env/in-depth shadows & details

#

something like that

ocean vortex
umbral crypt
#

are there any rumors for nightwhisper or dragontail release dates?

ocean vortex
#

it should be on $0

torn mantle
#

dayhush is good but i still prefer nw

balmy mist
#

nw follows directions perfectly

#

and the ui design is slightly better

#

without having to nudge it as much

tall summit
#

redditors are screaming that o3's geoguessr performance is because of metadata

thorny drum
#

google rigging the benchmarks once again!

balmy mist
#

chill google is king

tall summit
#

referring to this twitter post btw

#

also brain fart that's not google

olive mesa
#

i honestly want google to create ASI before openai

brittle tiger
#

I had no idea you only get 32k context with Plus for chatgpt. Apparently it just does RAG after that

https://x.com/TheXeophon/status/1913120160753332703?t=6arSE6mYeArtrSWOTpzvNQ&s=19

Some @ChatGPTapp footguns:
- The model for Deep Research does not matter. It always uses o3 (Deep Research version)
- The model for image generation *does* matter - image gen is a tool; chosen model writes the prompt
- Pro has 128K context, Plus (+Teams) 32K, free 8K

keen beacon
#

vibe coding vs vibe pentesting

keen beacon
#

plus they already have the competition on lock

#

they sell LLM hosting for gpt and many other llms on azure

#

the amount of data they're probably getting rn is crazy

calm sequoia
# calm sequoia
poll_question_text

Have you ever bet money on the bench outcome?

victor_answer_votes

4

total_votes

8

victor_answer_id

2

victor_answer_text

No

worthy thunder
# keen fulcrum

Anyone know if they (xAI) are still doing consensus@64 for their benchmarks, or moved to pass@1? Genuinely curious, not paid attention since the original announcement

elder rapids
#

bruh?

#

are you talking about 2.5 flash

balmy mist
elder rapids
#

I'm calling for clarity

#

but I'm wondering why he said grok 3 and not grok 3 mini

#

or grok 3 vs 2.5 flash

#

knowing 2.0 pro would be a better base, and that 2.5 is even better than that

#

just saying

keen beacon
#

2.5 pro has a significantly better simpleqa score

#

its the closest thing to 4.5

#

afaik

#

ye

#

its like 10% better than grok 3 (simpleqa, which is quite significant)

elder rapids
#

2.0 pro is the best model out of that bunch lol

#

or was grok 3 updated recently

#

bruh?

keen beacon
#

ya. and anyway their reasoning doesnt seem to increase simpleqa yet, 2.5 flash actually has a lower simpleqa score than gem 2 flash

elder rapids
#

btw can we lower reasoning for 2.5 pro yet

#

that's not what base means

keen beacon
#

ive been seriously impressed with how much 2.5 pro knows tbh

elder rapids
#

that would just be inherent to the model size

#

2.0 pro is a stronger model than grok 3

#

and can reason better when asked to

#

therefore is a better "base model"

#

actually hollon, 2.0 pro is STILL the highest performing non reasoning model besides 4.5 on livebench??

#

yo wtf

#

Lowkey thought there was gonna be another model that I forgot about

#

can you screenshot

#

?

#

oh you mean this one

#

what does that mean tho?

#

lmarena isn't a performance bench it's a vibe bench

#

so?

#

because it is lol

#

and it doesn't write better

#

this is something I specifically look for

#

grok 3 shoehorns itself narratively

#

buns writing, good philosophical understand

#

becomes the balance

#

base ye

#

the best model I can pull the potential out of tho

#

is 2.5 pro

#

ye

#

it's creative but I disagree a ton

#

it doesn't know real writing conventions

#

it doesn't know how to conform to instructions

#

ie, a prose

#

or a poem

#

how would telling you be informative if I'm doing the same for all these models

#

4.5 is def human like

#

it doesn't get stuck on repetition

#

it doesn't overemphasize

#

it doesn't become swayed so much by context

#

but it isn't so good for writing, just "ideas"

#

since it lacks rigor

#

or academic conformity

#

just give it a quiz or something

#

like a writing quiz

#

it's going to get stuck on the potential ambiguity

#

and just choose whatever

#

rationalizing every answer the best it can

#

like "pick a sentence in source #2 that supports and states the relevant premise to source #2 in source #1"

#

given 2 sources

#

a hard one ofc

#

it'll sufficiently rationalize each sentence in the second source

#

but not be able to choose the "correct" sentence

#

with rigor

#

flash 2.0 gets these right

#

lol

#

but I'm saying it's a very general thing, you'll encounter it or you probably have

#

and didn't notice

#

that it lacks academic rigor

#

yep 4.5 has great implicit understanding

#

but again, this is still something that most other models lack, that 2.0 pro doesn't

#

although behind gpt 4.5 it's 3.5 sonnet

ember rapids
#

4.5 is the first model to have a somewhat decent sense of humor

elder rapids
#

ye

#

unprompted

#

on what

#

oh mb

#

nah I have

#

it hallucinates too much

#

it's smart

#

but damn

#

doesn't really understand prompts too well

small haven
#

can o3 pro come any sooner 😭

alpine coral
#

grok-3-mini(high) has also done quite poorly against a few question sets i've been using lately. will share some screenshots in a min. but yeah in terms comprehension / verbal reasoning (wordplays etc), and understanding of real-world implications / spatial reasoning as well as emotional intelligence – it's pretty crap from what i'm seeing tbh

balmy mist
#

im trying to see which model i should use for my low level agents, im thinking between: grok 3 mini, o4 mini, 4.1 mini, or 2.5 flash

keen beacon
#

prob o4 mini since its good at calling tools

balmy mist
#

and o4 is cheap right?

#

the only thing is grok 3 mini gives you free credits every month $150

keen beacon
#

does it support natively calling tools tho?

keen beacon
balmy mist
#

you are right

#

no tools

keen beacon
#

i mean u can still work out a solution its not as efficient but u have a ton of free credits and im assuming pricing is quite cheap on grok 3 mini

#

e.g. in the final response it returns the json of the tool call

balmy mist
#

and you got this chart:

balmy mist
#

thats where i am using these agents

keen beacon
#

oh idk then lol

#

if it supports a custom openai server u could probably make ur own proxy server that calls the grok api and emulates tool calls

#

it really depends but im not familar with that stuff at all (ai ides)

balmy mist
#

what is the artificial analysis intelligence?

#

if this is true then grok 3 mini is the best model

keen beacon
balmy mist
#

that cant be real tho?

#

like its higher than 3.7

keen beacon
#

i mean its charting the cost/performance ratio

balmy mist
#

yeah but vertical is performance

#

it destroys on cost

#

but then on performance its like top 3

#

besides o3 cause its not on the chart

keen beacon
#

ya i think its because of the benchmarks they choose

#

obviously depending on the task it varies

#

i think u should look at swebench or something

#

u are not throwing graduate level google proof questions

torn mantle
balmy mist
#

ahh okay, anyone have an updated version fro swe?

#

i think its worth locking in the best small model model(semi small)

keen beacon
#

wrt artificial analysis intelligence, i think the benchmarks there, xai probably applied a lot of rl proportionally to similar questions

balmy mist
keen beacon
#

so it may be misleading

balmy mist
torn mantle
#

so rusty

balmy mist
#

and bc its free

torn mantle
#

if you spend more time explaining to a model how to do things then you better off with another one

#

i found myself re-explaining my prompt

#

over and over

#

and reminding grok the context

#

it wasnt a good experience

#

it will generate a good code but after the 2nd prompt it wont follow instructions well

elder rapids
#

this website struggles with some stuff

balmy mist
keen beacon
elder rapids
#

price to performance is redundant in a vaccuum

torn mantle
#

yea

keen beacon
#

of ur use case

torn mantle
#

o3 is really good

#

its the first time i felt a model is really smart

zinc ore
#

Grok 3 mini should be a bit below o3 mini @1 pass

#

On most benchmarks

balmy mist
#

wtf? they used two models?

balmy mist
keen beacon
#

ya theres a configuration to do that in aider

#

architect mode

balmy mist
#

wow thats dope

#

but still o3 too expensive

torn mantle
#

but its like the most reliable model

alpine coral
balmy mist
#

do yall like 2.5 flash?

#

im settling on it for my small model

#

o4 is to expensive and only slightly better than flash, what yall think?

torn mantle
#

im still trying dayhush xd

novel flame
#

Grok 3 mini juiced on the benchmarks, that’s the only explanation for that ludicrous position on the Artificial Analysis chart. I tested the model myself last night and it’s pretty mid. It hallucinated in my grounded recall test, and did a good job on my realistic coding test, though it only barely broke my top ten of coding models.

In my harder coding test it was below average among the contenders (I have only given this test to the 15 or so most promising coding models so ‘average’ isn’t terrible, but Grok 3 mini was below average.

The Aider leaderboard looks pretty much correct in terms of real world coding usefulness of those models, matches my tests and experience quite well (though I still find that o3 and o4-mini behave inconsistently).

Grok 3 Mini might have some good uses as a cheap capable model for non-coding tasks. It had perfect scores for me in wordplay and associative logic.

calm sequoia
torn mantle
#

we all know it underperformed

#

also

#

what happened to "Big Brain" feature?

balmy mist
#

yeah that might not come

torn mantle
#

unfortunately dayhush doesnt seem that good at complex physical simulations

#

if its a recent checkpoint of nightwhisper then idk how to feel about that

#

i mean its better than other models but its not what i had in mind

novel flame
#

Hmm..... I managed to get Gemini 2.5 Flash to make a browser game for my coding test. With the default settings it basically told me to pound sand because the task was too big for a single prompt (which is fair), but I wanted it to give it a try so I reduced the temperature to 0.5 and lo and behold, it did it! ...... poorly. The result was buggy and generally underwhelming, on par with Grok 3 Mini, Llama 4 Maverick, o3, and o3 Mini (all of which did poorly on this test).

The top tier on this test is Claude 3.7 Sonnet quite far ahead of the pack; and the second tier consists of Grok 3, Riveroaks, Gemini 2.5 Pro, o4-mini-high, and DeepSeek V3 0324.

raven void
#

o3 is really good, 5-10 IQ points higher than the next smartest model
it's hard to believe OpenAI will have o5 pro internally in a few months

raven void
plain zinc
#

Dayhush - model coding Google's

raven void
#

From my testing, o3 answers are better than 2.5 most of the time, I would long OpenAI for April

#

In tests of Code Taste, Problem Analysis and Fiction

plain zinc
raven void
#

In general use also?🤔

#

I haven't checked lmarena recently tbh

keen beacon
drifting thorn
#

What? Dayhush?

novel flame
#

Maybe someone here knows…. I don’t have a paid Grok account so I tested Grok 3 Mini through the OpenRouter Playground with default settings. And as mentioned it performed poorly. But it occurs to me that maybe the default settings on OpenRouter are low reasoning effort? If so then maybe I judged Grok 3 Mini too quickly?

plain zinc
#

You'll remember my words when he arrives.

keen beacon
#

ik that 2.5 ultra would beat even o3 generally

#

hell o3 barely beats 2.5 pro, so no chance vs the ultra model

umbral crypt
keen beacon
umbral crypt
#

oh

late path
#

o3 reply with emoji

umbral crypt
#

☺️

#

like this

wraith tulip
#

HELLO. Nothing else to say. Just out of curiosity...🙂

waxen arch
#

beep boop

hollow comet
#

Does anyone know a working prompt for humanizing text?

tall summit
#

hello friends

tall summit
#

a surprising amount of people have asked this question and i get so confused

#

humanizing texts are one of llm's best skills

#

and o3 is quite amazing at prose so i think it's improving even at that

ornate stump
#

Never used Grok because I hated Twitter even before Elon, but I’ve found out that voice chat in grok is more useful than Gemini Live and chagpt advanced voice.

tall summit
calm sequoia
#
poll_question_text

King of the kings

victor_answer_votes

5

total_votes

14

victor_answer_id

1

victor_answer_text

o3

victor_answer_emoji_name

🔥

hollow comet
# tall summit humanizing texts are one of llm's best skills

For me it doesn't work at least as I want even though I used a more complex prompt. It works when it writes in my language and the AI ​​detector gives 0% but in English it doesn't work in most cases and the detector writes 100%. And even the text in my language which the detector defines as human after translation via Google Translate is also defined as AI

unborn ocean
#

U can try to tell it to write in the style of a specific well-known author.

glass arch
#

interesting

#

it seems chatgpt can show you images?

#

since when has it done this

#

also guys, only 100 million more years until we have to update the 13.8 billion yeasr number!

barren prairie
calm sequoia
#

Looks like the o3 also have sampling issues. It got my pro.pt right only 4/5 times. Varoability ussualy was problem only with the mini models.

plain zinc
#

It's coding model

#

Just simple coding model

ocean vortex
#

the way I'm looking at this is if it's correct most of the time that means it's correct period. But if you have 2 models you are comparing that are incredibly close, you could look at the exact success rate percentage as well for more in-depth info.

#

though at that point it must be said it would probably be smarter to expand the number of your test prompts, rather than hyper focusing on just this 1 or a few

ocean vortex
pliant cypress
#

webdev arena is so broken right now or something is wrong with my pc/internet?

#

like 90% generations fail

wheat onyx
#

Analyzing images like this is a gamechanger

wheat onyx
leaden meteor
#
poll_question_text

Will Gemini-2.5-Pro-Exp-03-25 drop from #1 on LMArena leaderboard in next 7 days

victor_answer_votes

11

total_votes

20

victor_answer_id

1

victor_answer_text

Yes

calm sequoia
calm sequoia
keen beacon
#

O3 and o4 mini are there

#

They used to ignore the value but it seems they don't anymore (setting api temp)

calm sequoia
#

Can you do it during battles? 👀👀

keen beacon
#

Nah

calm sequoia
#

Sadly anonymous models cannot be selected

#

It always passes on direct chat. Maybe they are testing multiple temperature variants

keen beacon
#

I heard the quality was different on the api rn but I don't use the models so idk

#

Something is messed up on chatgpt to some extent I've heard

ocean vortex
#

but for openai or anthropic, I do not believe there is a way with their libraries currently

tall summit
wheat onyx
#

I think those immediately DOA fitness mirrors are going to make a comeback soon

ocean vortex
#

for reference this is how it looks when that's not restricted:

heady kiln
#

anybody facing same issue? web lmarena
not returning any response

it says "generating" then "generating" disappears and makes me vote but, no output appears

ocean vortex
#

it wouldn't work with plus (32k), and I'm not sure pro has the full 200k either... 🤔
And with API it doesn't make sense to be paying this much, just dump it into 2.5 pro

balmy mist
# umbral crypt how do i get this?

Check out the NinjaChat AI platform over here : https://www.ninjachat.ai/

In this video, I'll be telling you about Grok 3 and Grok-3 Mini new API that allows you to use it for free with Cline and RooCode.


Resources:

Grok AI Console: console.x.ai
Requesty: https://app.requesty.ai/join?ref=4581bcf6


Key Takeaways:

🚀 Grok-3 API is ...

▶ Play video
ocean vortex
#

try it

heady kiln
#

ok, i'll try it again

ocean vortex
#

oh you just quoted the wrong person lol

#

I think it discards the cache and a refresh

heady kiln
#

anybody compared dayhush, claybrook, dragontail and nighwhisper yet?

heady kiln
#

sorry btw, wrong reply

calm sequoia
keen beacon
#

Hardware floating point precision optimizations etc

keen beacon
#

In reality because you can't have perfect precision numerically, hardware, etc there is inherent nondeterminism

ocean vortex
keen beacon
#

Greedy decoding assuming all sampling is disabled does not introduce randomness theoretically

ocean vortex
#

each training run will normally result in a different outcome (model) since random starting seed is usually that - random.

keen beacon
#

Ong

#

Omg

#

Not this again

calm sequoia
#

May be compiler thing

#

Choosing different path based on some conditions

#

Server thing also

calm sequoia
#

I asked o3: "Short answer: Not necessarily.
Under today’s large‑language‑model (LLM) stacks, sending exactly the same prompt 100 times can yield identical, slightly different, or wildly different completions depending on four controllable factors (decoding parameters, software, hardware, model snapshot) and three unavoidable sources of entropy (floating‑point nondeterminism, multithreading, and vendor weight updates)"

#

I dont see how floating point can be involved here, but multithreading could be a cause

umbral crypt
calm sequoia
#

Architectural & theoretical backdrop

Decoding strategies deliberately inject randomness to avoid repetition and exposure bias, and recent surveys show variance growing with model size and alignment techniques.

#

Extrapolation you say

keen beacon
#

LLMs can play chess. The skill is destroyed in the instruct process

leaden palm
calm sequoia
#

Could you imagine a new color, paws?

keen beacon
#

0.1+0.1+0.1 != 0.3

#

Can happen

calm sequoia
#

I spent a lot of time programming floating points in C. In my practise, its deterministic. Unprecise but deterministic.

#

It wouldnt be hard if we would have feedback loops

#

Why do you think data != algorithms

#

Instruction sets are data

keen beacon
#

Multi threading is done even in gpu inference (like scheduling etc maybe) but I don't think they cause the non determinism in actual decoding

calm sequoia
#

We need help from @wooden mulch What causes same models to produce different results with the same input on the arena?

keen beacon
calm sequoia
#

I always wonder if they are making any calculations locally

#

To reduce server load

keen beacon
#

Not possible lol

balmy mist
#

damn

ocean vortex
#

it "considers" a ton of tokens before settling on a specific one. And that process is for the most part random by design and including a lot of potential outcomes if you leave everything at default settings.

calm sequoia
#

Tou may be an order of magnitude more infmed than me or less informed

#

Anyway, Ill read one book on llms to catch up

ocean vortex
#

if we made LLMs deterministic they would be very boring and not really pleasant to interact with

calm sequoia
#

Was it since the beggining or introduced after the gpt 3.5?

oblique flint
#

but temperature isnt

calm sequoia
#

I dont understand why rhey didn't release o3 as GPT 5. It do feel like gpt 3.5 to gpt 4 improvement.

#

What next arebwe waiting for this quarter? 2.5 Ultra, R2 and qwen max?

#

I wonder if they will release o4 at some point or will call it GPT5

ocean vortex
# oblique flint llms are deterministic wym

well the way we implemented transformer arch and how we are running them, they are not deterministic. Was talking about end result here when you are already interacting with the model. Non-deterministic is desired behavior so when that's lacking it is simply artificially added at the given stage.

keen beacon
keen beacon
#

Adding a little bit of temperature changes it a lot more I think

ocean vortex
keen beacon
#

Greedy decoding is no sampling

ocean vortex
#

not in the context of you interacting with the models on lmarena

keen beacon
#

Yeah but you were talking about LLMs in general

#

I understand what u mean now

ocean vortex
#

the original question was why they are not deterministic on arena. So excluding the sampling would just be selectively misleading now lol

#

99% of the time users are interacting with LLMs on any platform it includes sampling anyway 👀

sour spindle
#

Wondering if anyone can explain this to me. Is o3 on arena the same o3 that you have on your ChatGPT account.

#

I feel like when I use o3 on my account I get more in-depth answers and o3 on arena feels watered down.

#

How are they able to give o3 access of arena does OAI subsidize usage?

leaden palm
#

o3 in arena is raw api version

sour spindle
#

Even then how are the able to allow so much access to it on arena

keen beacon
#

They didn't do it for o1 but added o3 mini tho (direct chat)

#

I'm assuming o3 is just cheaper to run

leaden palm
keen beacon
#

Yeah

leaden palm
#

o3 is $40/mtok while o1 was $60/mtok out

keen beacon
#

It's also probably more marked up (despite lower pricing) I think

keen beacon
leaden palm
#

m

sour spindle
#

Thanks for answering my questions KT

keen fulcrum
ember rapids
#

How does claybrook compare to nightwhisper

#

Just saw this

brittle tiger
#

but the gaps aren't major

alpine coral
#

it doesn't seem to do anything tbh.. like setting it to max (100k tokens), and it uses the less reasoning tokens than when set to 50k, or 1k.. (and perorms worse compared 50k.. though i'm pretty sure this parameter, at least on OR, isn't doing anything, and it's just variability)

keen beacon
#

Does it stop at 1 token if u set it to that

alpine coral
#

the value set in the param doesn't seem to correspond to how long / many tokens it uses for reasoning

keen beacon
#

No I guess

wheat onyx
alpine coral
#

it always reasons and coherently/comppletely

#

like no cut offs

keen ferry
#

which ai is on Gemini Live on the app im using webcam and its kinda good id say?

keen beacon
alpine coral
#

yeah was gonna post something there actually

keen beacon
#

It should be possible since I think they support it for openai reasoning models

alpine coral
#

it feels like the param is doing nothing / doesn't exist

torn mantle
#

i cant

torn mantle
#

claybrook gave me good results as well

#

so im not sure if its better than dragontail

#

bruh these names ...

alpine coral
#

i see yeah OR's implementation is off then

drifting thorn
#

Waiting for NW with better native tool use

torn mantle
#

this is actually a cool prompt

#

Gemini 2.5 Flash demolishes my Galton Board test, I could not get 4omini, 4o mini high, or 03 to produce this. I found that Gemini 2.5 Flash understands my intents almost instantly, code produced is tight and neat. The prompt is a merging of various steps. It took me 5 steps to

#

see if you can make dayhush generate that

keen fulcrum
#

So there is Dayhush and Claybrook from Google
Which is better?

torn mantle
#

dayhush

#

is a coding model

keen fulcrum
#

Even better than nightwhisper?

torn mantle
#

claybrook is probably a recent checkpoint of gemini 2.5 pro

torn mantle
#

i cant tell

#

its really hard

#

but ive noticed some small issues with dayhush

#
  • overlapping elements
  • text out of the containers elements
  • it needs a clear guidance
keen fulcrum
#

Perhaps a mini model

torn mantle
#

its probably the slowest model

keen fulcrum
torn mantle
#

it takes a lot of time thinking

drifting thorn
#

But knowledge graph is more useful than just embedding RAG

balmy mist
hybrid locust
#

lmarena beta has no rate limits wow

keen beacon
#

these mfs are being bankrolled in lab credits

hybrid locust
#

they're backed by lots of orgs tho

#

it's probably insignificant to them

alpine coral
#

well.. Sequoia Capital is an organisation of sorts lol

golden ocean
#

kinda ass cheeks

keen beacon
#

o3 mini high > 4.1 > dayhush

#

google getting too much hype

raven void
keen beacon
#

just gotta wait for o3 nr 1 to drop

#

any day now 😢

torn mantle
#

thats the issue

#

it may be good at UI/UX

#

but it still lacks on complex reasoning and implementation

#

i got it to work once/6 tries

torn mantle
#

but it hallucinates a lot

ocean vortex
keen beacon
balmy mist
calm sequoia
#

Chineese endgame

keen beacon
#

i asked gemini and o3 to prepare 3 hard questions for a very smart rival ai. and also to solve these questions so that i knew the solutions.

Both models solved each other questions, but gemini took way longer to do o3 questions and its solution was way less elegant

patent bane
#

wdym by less elegant

keen beacon
#

more complicated and longer than it need be

#

i also made a game where they use their own algorithms to command their player and shoot at each other, o3 won 8-2

elder rapids
#

these aren't even benchmarks

#

🙏 😭

keen beacon
#

these are better benchmarks than official benchmarks for which models know the answers of

zinc ore
elder rapids
#

they couldn't be

#

they're not quantifying any real metric

zinc ore
#

We're in the benchmark wars of AI

keen beacon
#

even silly ones

elder rapids
#

thats great but that means I can say 2.5 pro beats o3 in many many different areas

#

and therefore more general

zinc ore
#

What we need is better benchmarks