#general | Arena | Page 27

balmy mist Apr 18, 2025, 2:44 PM

#

i still dont understand this chart:

#

how is grok mini so high

#

this says its on par with o4 and 2.5 pro

#

but dirt cheap

#

how is that not the best model?

#

wait claude has the highest output?

#

at 128k?

#

then grok?

alpine coral Apr 18, 2025, 2:47 PM

#

balmy mist how is grok mini so high

not really sure what their methodology is.. i think they mainly aggregate evals, but also conduct their own (or maybe it's the other way round) https://artificialanalysis.ai/models/grok-3-mini-reasoning

wintry tinsel Apr 18, 2025, 2:48 PM

#

When we getting CLaude 4 already

balmy mist Apr 18, 2025, 2:48 PM

#

alpine coral not really sure what their methodology is.. i think they mainly aggregate evals,...

what do you think about grok 3 mini?

#

u think its SOTA?

wintry tinsel Apr 18, 2025, 2:49 PM

#

I think only real good models are google, Claude, and deep seek

fleet lintel Apr 18, 2025, 2:49 PM

#

Dayhush looks comparable to NW

wintry tinsel Apr 18, 2025, 2:50 PM

#

Google is competitive on price and best in some categories, Claude is best overall, and deep seek is open source

#

Open AI too expensive and too censored/stiff don’t like it

#

Grok is kind of funky, it’s not best on anything right now, so not important

hollow tinsel Apr 18, 2025, 2:51 PM

#

Anyone getting Claybrook? the pokemon battle simulator blow me away

balmy mist Apr 18, 2025, 2:51 PM

#

i like that claude output is 128k

#

i might start using it again

fleet lintel Apr 18, 2025, 2:51 PM

#

hollow tinsel Anyone getting Claybrook? the pokemon battle simulator blow me away

which company model is claybrook?

hollow tinsel Apr 18, 2025, 2:52 PM

#

No idea, how do you check for that?

balmy mist Apr 18, 2025, 2:52 PM

#

wait i thought you were an openai fanboy?

#

you cheating on them?

cloud meadow Apr 18, 2025, 2:56 PM

#

Reka.ai seems to be doing some sort of a new beta for their site.

#

Idk what model this is but it's so bad lmao.

hollow tinsel Apr 18, 2025, 2:56 PM

#

fleet lintel which company model is claybrook?

how do we check which company its from?

fleet lintel Apr 18, 2025, 2:57 PM

#

hollow tinsel how do we check which company its from?

generally you can just ask them... like which company model are you?

#

but folks here are very smart and they can look at configs etc to figure out the company

hollow tinsel Apr 18, 2025, 2:58 PM

#

fleet lintel generally you can just ask them... like which company model are you?

hmm only got Claybrook on webdev arena. Let me see if Claybrook shows up

hollow tinsel Apr 18, 2025, 3:07 PM

#

fleet lintel which company model is claybrook?

its google

winged steeple Apr 18, 2025, 3:08 PM

#

Hello all! I have been playing with the arena battle and I have a very silly question— are models picked entirely at random, or are they weighed somehow? I'm wondering because I often see the same models, despite the huge pool.

calm sequoia Apr 18, 2025, 3:09 PM

#

wintry tinsel Google is competitive on price and best in some categories, Claude is best over...

"Claude is best overall". By overall you probably mean "vibes"? 😄 Tell me how's it better than o3, rather than pricing

oblique flint Apr 18, 2025, 3:11 PM

#

does anyone have an explanation as to why 2.5 flash reasoning is more expensive per output token compared to 2.5 flash? Doesn't it still use the same model when it's reasoning? Or is it like a different quantization maybe?

cloud meadow Apr 18, 2025, 3:12 PM

#

oblique flint does anyone have an explanation as to why 2.5 flash reasoning is more expensive ...

2.5 flash vs 2.5 flash?

#

do you mean 2.0?

oblique flint Apr 18, 2025, 3:12 PM

#

I mean 2.5 flash with reasoning enabled vs with reasoning disabled

cloud meadow Apr 18, 2025, 3:12 PM

#

Oh right, it's better.

#

There you go

#

And also more expensive to run for Google.

oblique flint Apr 18, 2025, 3:13 PM

#

so it's 2 different models under the same name?

cloud meadow Apr 18, 2025, 3:14 PM

#

R1 and V3 situation I believe.

#

Idk

oblique flint Apr 18, 2025, 3:16 PM

#

all in all it seems like the reasoning version is massively more expensive. Not only is the reasoning version like 6 times more expensive per output tokens, it's going to generate way more output tokens. Initially I thought it would only increase the number of tokens generated, but I didn't expect they would also charge more per token on the reasoning version

cloud meadow Apr 18, 2025, 3:17 PM

#

oblique flint all in all it seems like the reasoning version is massively more expensive. Not ...

Would you say that the output is 6x better?

alpine coral Apr 18, 2025, 3:19 PM

#

balmy mist u think its SOTA?

nah fwiw i've found it a bit underwhelming - it's not up there with gem pro 2.5 (for me / this is all obviously subjective..)

oblique flint Apr 18, 2025, 3:19 PM

#

not gonna lie I havent even tried the model yet lol

alpine coral Apr 18, 2025, 3:19 PM

#

but solid, for sure

plain zinc Apr 18, 2025, 3:20 PM

#

https://x.com/OfficialLoganK/status/1912968696265343424

Logan Kilpatrick (@OfficialLoganK) on X

@theLance Time for Gemini code

oblique flint Apr 18, 2025, 3:22 PM

#

so grok 3 mini is really 1/7th the output token price of 2.5 flash thinking lol

#

idk why google made 2.5 flash reasoning so overpriced

worthy thunder Apr 18, 2025, 3:23 PM

#

@calm sequoia as requested I added ranking to the tables (based on AUC calcs, sorted by 1M since my company is focused on very long context). I made changes to the charts but not including them here until I have new models to run it on. Enjoy

willow grail Apr 18, 2025, 3:26 PM

#

plain zinc https://x.com/OfficialLoganK/status/1912968696265343424

🍆 🫦 me and 2.5 coder

keen fulcrum Apr 18, 2025, 3:26 PM

#

worthy thunder <@1235519607083765782> as requested I added ranking to the tables (based on AUC ...

Flash better?

worthy thunder Apr 18, 2025, 3:29 PM

#

keen fulcrum Flash better?

There's a slightly higher score at 32k and 128k context for 2.5 Flash (Thinking enabled) compared 2.5 Pro, but it's within margin of error. Honestly, both 2.5 Flash (Thinking) and 2.5 Pro have practically the same scores.

keen fulcrum Apr 18, 2025, 3:30 PM

#

worthy thunder There's a slightly higher score at 32k and 128k context for 2.5 Flash (Thinking ...

One is significantly cheaper

worthy thunder Apr 18, 2025, 3:30 PM

#

keen fulcrum One is significantly cheaper

Yes, if you compare cost and only care about context recall, then yes it is better 😉

#

I found it interesting that they perform the same. That isn't true for all other model families with this bench so far.

keen fulcrum Apr 18, 2025, 3:31 PM

#

Gemini 2.5 pro is too confident in its own errors and didn’t want to undergo another investigation of the prompt

#

Had a hard time disproving the llm

worthy thunder Apr 18, 2025, 3:32 PM

#

Working on it. Somehow my org is locked out of o3. OpenAI support says we should have access and are looking into it

keen fulcrum Apr 18, 2025, 3:32 PM

#

worthy thunder Working on it. Somehow my org is locked out of o3. OpenAI support says we should...

You need to verify your account

worthy thunder Apr 18, 2025, 3:32 PM

#

US

worthy thunder Apr 18, 2025, 3:33 PM

#

keen fulcrum You need to verify your account

We get an error message when we try to verify, before the Persona link is generated

#

Yeah, we can't even get to that step on our account

earnest parcel Apr 18, 2025, 3:33 PM

#

worthy thunder Working on it. Somehow my org is locked out of o3. OpenAI support says we should...

i also don't have access to it, despite having tier 4, it's a broad issue. Just kills hype imo, but if they think it helps them excluding users, so be it. I am not exactly starving for models, lol.

torpid berry Apr 18, 2025, 3:34 PM

#

hello,
do you know when we will have o3 and o4-mini-high in leaderboard?

keen fulcrum Apr 18, 2025, 3:34 PM

#

torpid berry hello, do you know when we will have o3 and o4-mini-high in leaderboard?

Enough data

torpid berry Apr 18, 2025, 3:36 PM

#

keen fulcrum Enough data

are they better than Gemini 2.5 pro?

worthy thunder Apr 18, 2025, 3:37 PM

#

We literally get a server error when we try to verify, so we can't even upload an ID. Support thinks its an issue on their end and are looking into it. 🤷
I'm interested in the o3 results too, so hoping they resolve ASAP.

torpid berry Apr 18, 2025, 3:37 PM

#

I tried this in all models I know and they all failed

Imagen_de_WhatsApp_2025-04-17_a_las_20.36.06_d6720114.jpg

#

expected result:

Imagen_de_WhatsApp_2025-04-18_a_las_15.21.27_58dc8c5b.jpg

#

tried with gemini 2.5 pro and new openai models, all failed

keen fulcrum Apr 18, 2025, 3:41 PM

#

torpid berry are they better than Gemini 2.5 pro?

Both are better

ember rapids Apr 18, 2025, 3:43 PM

#

keen fulcrum Apr 18, 2025, 3:44 PM

#

ember rapids

Mau?

torpid berry Apr 18, 2025, 3:52 PM

#

This chart does not include Gemini 2.5 pro

#

all Gemini models are pretty bad until Gemini 2.5 Pro, who is the best or one of the best models yet

worthy thunder Apr 18, 2025, 3:54 PM

#

keen fulcrum Mau?

Monthly Active Users is my guess

keen fulcrum Apr 18, 2025, 3:59 PM

#

This will change

#

Bard wasn’t competitive

keen fulcrum Apr 18, 2025, 4:00 PM

#

ember rapids

I don’t see any spike though

wintry tinsel Apr 18, 2025, 4:12 PM

#

calm sequoia "Claude is best overall". By overall you probably mean "vibes"? 😄 Tell me how's...

If your are not using it for math or coding, it’s better in every conceivable way

#

The O models are stem problem solvers, pretty terrible for general world knowledge, conversations, writing ability etc

#

If you are not in a technical field, or college, the O models probably aren’t useful to you

#

Now Claude is the best overall in my opinion since it feels the most grounded and consistent, Gemini would be the better frontier model, but it’s less capable of revising it’s mistakes, writing, and world understanding I’d say

cloud meadow Apr 18, 2025, 4:15 PM

#

wintry tinsel If your are not using it for math or coding, it’s better in every conceivable wa...

Lmfao

#

Spiritually cucked to anthropic.

cloud meadow Apr 18, 2025, 4:16 PM

#

wintry tinsel Now Claude is the best overall in my opinion since it feels the most grounded an...

it’s less capable of revising it’s mistakes, writing, and world understanding I’d say
I'd love to see an example of this.

#

Gemini 2.5 is great at noticing mistakes

#

As for writing, your prompt may just be bad. Deepseek does a better job at creative writing for me than Anthropic

#

Though it does entirely depend on what you are asking for

balmy mist Apr 18, 2025, 4:19 PM

#

plain zinc https://x.com/OfficialLoganK/status/1912968696265343424

why you post this?

#

this might be old:

#

https://x.com/kimmonismus/status/1913266003187826761

Chubby♨️ (@kimmonismus) on X

Gemini ultra confirmed?

keen fulcrum Apr 18, 2025, 4:22 PM

#

Indeed you are late

blazing rune Apr 18, 2025, 4:22 PM

#

cloud meadow > it’s less capable of revising it’s mistakes, writing, and world understanding...

I don't have an example, but I have noticed it doesn't seem to understand certain ML concepts as well as Claude 3.7 Sonnet

keen beacon Apr 18, 2025, 4:22 PM

#

https://portswigger.net/burp/documentation/desktop/burp-ai

Burp AI

Burp Suite includes AI-powered features designed to enhance your security testing workflow. They enable you to uncover vulnerabilities more efficiently, ...

#

ai vs ai CTF

#

soon?

blazing rune Apr 18, 2025, 4:23 PM

#

blazing rune I don't have an example, but I have noticed it doesn't seem to understand certai...

and it doesn't follow instructions nearly as well as sonnet

keen beacon Apr 18, 2025, 4:23 PM

#

March 31, 2025

#

wow i really am late

blazing rune Apr 18, 2025, 4:24 PM

#

Like when I ask Gemini 2.5 Pro to write code with no comments, it still uses tons of comments, or if I tell it to be concise, it doesn't listen at all. It's just verbose with no way to fix it. Claude 3.7 Sonnet doesn't have these issues in my experience.

cloud meadow Apr 18, 2025, 4:31 PM

#

blazing rune Like when I ask Gemini 2.5 Pro to write code with no comments, it still uses ton...

You might need a lower temperature

blazing rune Apr 18, 2025, 4:32 PM

#

I use temperature 0.5 all the time

#

so it's not that

#

it just doesn't listen to prompting nearly as well as Claude 3.7 Sonnet

pliant cypress Apr 18, 2025, 4:49 PM

#

Connection errored out.

tall summit Apr 18, 2025, 4:50 PM

#

keen beacon https://portswigger.net/burp/documentation/desktop/burp-ai

WHAT

#

BURPSUITE AI

calm sequoia Apr 18, 2025, 4:51 PM

#

worthy thunder <@1235519607083765782> as requested I added ranking to the tables (based on AUC ...

This looks really good! Keep up the good work 👍 Somehow on twitter there's a table with completely different result for long context. I wonder why 🤔 Will look for it for comparison.

calm sequoia Apr 18, 2025, 4:53 PM

#

torpid berry I tried this in all models I know and they all failed

Try english and maybe text format

calm sequoia Apr 18, 2025, 4:54 PM

#

wintry tinsel Now Claude is the best overall in my opinion since it feels the most grounded an...

Interesting. Are you using it for mental support?

tall summit Apr 18, 2025, 4:58 PM

#

wintry tinsel The O models are stem problem solvers, pretty terrible for general world knowled...

It seems according to benchmarks that this is not the case

#

Though I used to think it was

#

lmao i voted gpt 4.1 nano over claude 3.5 haiku

golden ocean Apr 18, 2025, 5:01 PM

#

keen beacon https://portswigger.net/burp/documentation/desktop/burp-ai

Lmfao

#

Hacking now ACTually going to be easy for skids

balmy mist Apr 18, 2025, 5:02 PM

#

yo @tall summit you were the one talking about making a chess game right?

tall summit Apr 18, 2025, 5:02 PM

#

balmy mist yo <@1027532403893358612> you were the one talking about making a chess game rig...

yeah

balmy mist Apr 18, 2025, 5:02 PM

#

tall summit yeah

https://www.youtube.com/watch?v=NXKUDrnu5o8&ab_channel=YJxAI

YouTube

YJxAI

Openai o3 vs Gemini 2.5 Pro. The Battle of Chess (Match 01)

Welcome to the AI Chess Battle OpenAI o3 and Gemini 2.5 Pro the state of the art models will be playing a chess match against each other. Let's see which model actually wins this.

openai o3, o3, o3 model, openai o4 mini, openai, chatgpt, ai, artificial intelligence, google gemini, gemini 2.5 pro, gemini 2.5, google ai, new ai coding with gem...

▶ Play video

tall summit Apr 18, 2025, 5:02 PM

#

what about it

#

cute

balmy mist Apr 18, 2025, 5:02 PM

#

this dude is livestreaming them going against each other

tall summit Apr 18, 2025, 5:03 PM

#

YJxAI with 1.5k subs

#

it's free content for blogs and channels, huh?

harsh flume Apr 18, 2025, 5:03 PM

#

Hey guys, I haven't been online for a couple of days. How have oAI's new models been fairing in the arena?

tall summit Apr 18, 2025, 5:04 PM

#

and twitter accounts of course

harsh flume Apr 18, 2025, 5:04 PM

#

I just got 4.1 in my first prompt, is o3 full and any version of o4 contending as well rn?

tall summit Apr 18, 2025, 5:04 PM

#

harsh flume Hey guys, I haven't been online for a couple of days. How have oAI's new models ...

mostly correlating with their benchmark stats.

tall summit Apr 18, 2025, 5:04 PM

#

harsh flume I just got 4.1 in my first prompt, is o3 full and any version of o4 contending a...

yes o4-mini (not high i don't think someone correct me) and o3 full are contending

#

as well as 4.1 full mini nano

#

they're all even on direct chat

tall summit Apr 18, 2025, 5:06 PM

#

balmy mist https://www.youtube.com/watch?v=NXKUDrnu5o8&ab_channel=YJxAI

HAHA THE USER POLL OF WHO WILL WIN IS 29 TO 29

#

aww 31-29

#

very even

balmy mist Apr 18, 2025, 5:07 PM

#

yo i think we can build this better

#

but can you upload images to gemini 2.5 pro api?

tall summit Apr 18, 2025, 5:08 PM

#

this is match 1

#

ill finish my match before his

balmy mist Apr 18, 2025, 5:10 PM

#

im kinda mad he made it before us 😦

#

we dead was talking about this

#

the thing about chess context window does not matter as much

#

since you can always give the current game state to the model and nothing else

#

and start new convos with each model for every move

#

having memory of the previous moves doesn't matter in this case

tall summit Apr 18, 2025, 5:11 PM

#

it helps according to that blog post

balmy mist Apr 18, 2025, 5:13 PM

#

i can see it helping, but an intelligent model can understand the game based on a snapshot of the current state

#

like for any chess player

#

yeah having prior knowledge will help to understand how the other player plays

thorny drum Apr 18, 2025, 5:14 PM

#

gemini formatting is still weak

#

especially with code

balmy mist Apr 18, 2025, 5:14 PM

#

maybe then logging why it made each move and having that as knowledge base can be a good addition

balmy mist Apr 18, 2025, 5:14 PM

#

thorny drum gemini formatting is still weak

you gotta add stuff like that in the SP

tall summit Apr 18, 2025, 5:15 PM

#

balmy mist i can see it helping, but an intelligent model can understand the game based on ...

gemini 2.5 couldnt interpret a FEN

#

(according to the single and informal test)

balmy mist Apr 18, 2025, 5:16 PM

#

tall summit gemini 2.5 couldnt interpret a FEN

really

#

how is that possible

#

maybe you might have to grond it with search as well

tall summit Apr 18, 2025, 5:25 PM

#

balmy mist maybe you might have to grond it with search as well

might be a bit rigged, i wouldn't let it use search for chess.

tall summit Apr 18, 2025, 5:25 PM

#

balmy mist how is that possible

just tried (in the arena) and it still can't or it just doesn't understand chess

upper wolf Apr 18, 2025, 5:26 PM

#

old gpt-4 was pretty good at chess

tall summit Apr 18, 2025, 5:26 PM

#

it thinks the black king can move to b6, a8, or b8

Screenshot_2025-04-18-20-25-30-235_org.mozilla.firefox-edit.jpg

upper wolf Apr 18, 2025, 5:26 PM

#

i gave it an fen once and it even guessed the opening

tall summit Apr 18, 2025, 5:27 PM

#

and that the white king has no legal moves

balmy mist Apr 18, 2025, 5:27 PM

#

tall summit might be a bit rigged, i wouldn't let it use search for chess.

if they both have access to search wouldnt it be fair?

tall summit Apr 18, 2025, 5:27 PM

#

balmy mist if they both have access to search wouldnt it be fair?

the point is how good the llm is

#

not how good it can search about chess

balmy mist Apr 18, 2025, 5:28 PM

#

tall summit the point is how good the llm is

but that will show how well it can use tools to achieve a goal,

tall summit Apr 18, 2025, 5:28 PM

#

balmy mist but that will show how well it can use tools to achieve a goal,

sure if you want to measure that

balmy mist Apr 18, 2025, 5:28 PM

#

if they do not know how to play without help with search then give them the access to search until they can play chess properly wihtout help

tall summit Apr 18, 2025, 5:28 PM

#

tall summit and that the white king has no legal moves

which are both wrong btw

balmy mist Apr 18, 2025, 5:28 PM

#

cause currently you are saying they cant even read FEN properly

tall summit Apr 18, 2025, 5:28 PM

#

i wonder what would happen if i gave the pgn

upper wolf Apr 18, 2025, 5:29 PM

#

gpt-4T was cracked at chess. It was extremely knowledgeable about very specific opening moves. also, it was able to play several full games without any illegal moves, which most current llms cannot do

balmy mist Apr 18, 2025, 5:29 PM

#

upper wolf gpt-4T was cracked at chess. It was extremely knowledgeable about very specific ...

wow i never tried it

#

so you think the models got worse?

upper wolf Apr 18, 2025, 5:30 PM

#

the new models r a lot better at 99% of tasks i wanna say

#

idk why they fall off in chess ability

#

something about having more parameters would be my guess

balmy mist Apr 18, 2025, 5:33 PM

#

what happened to alpha go?

tall summit Apr 18, 2025, 5:33 PM

#

tall summit i wonder what would happen if i gave the pgn

ok i gave it a pgn and it knew

balmy mist Apr 18, 2025, 5:33 PM

#

how are llms so much worse than alpha go

tall summit Apr 18, 2025, 5:33 PM

#

for some reason

upper wolf Apr 18, 2025, 5:33 PM

#

Lmao i havent heard that name in ages

tall summit Apr 18, 2025, 5:34 PM

#

wait a second

wintry tinsel Apr 18, 2025, 5:34 PM

#

cloud meadow As for writing, your prompt may just be bad. Deepseek does a better job at creat...

You don’t understand writing very well if you think Deep Seek is good, is just overly emphatic and random

tall summit Apr 18, 2025, 5:34 PM

#

tall summit it thinks the black king can move to b6, a8, or b8

@balmy mist gemini 2.5 pro thinks the white king can move to... a2

#

even after it wrote:
White: King on a1, Queen on a5, Rook on c1, Bishop on c4, Pawns on a2, c3, d4, e5, f4, g3.

wintry tinsel Apr 18, 2025, 5:35 PM

#

calm sequoia Interesting. Are you using it for mental support?

Wtf

tall summit Apr 18, 2025, 5:35 PM

#

tall summit even after it wrote: White: King on a1, Queen on a5, Rook on c1, Bishop on c4, P...

though this is wrong the pawn on d4 is black and there is no pawn on c3

#

so it's even worse

wintry tinsel Apr 18, 2025, 5:36 PM

#

tall summit It seems according to benchmarks that this is not the case

Benchmarks are just general use estimations, doesn’t mean they are accurate

balmy mist Apr 18, 2025, 5:36 PM

#

i dont understand why they are so bad

cloud meadow Apr 18, 2025, 5:36 PM

#

wintry tinsel You don’t understand writing very well if you think Deep Seek is good, is just o...

Werks for me.

wintry tinsel Apr 18, 2025, 5:36 PM

#

If I need a problem solved, Ill use Gemini 2.5 pro, for everything else, Claude

narrow elbow Apr 18, 2025, 5:37 PM

#

upper wolf gpt-4T was cracked at chess. It was extremely knowledgeable about very specific ...

GPT-4 was really strong when it first came out. At that time, Anthropic might not have split from OpenAI yet

wintry tinsel Apr 18, 2025, 5:37 PM

#

narrow elbow GPT-4 was really strong when it first came out. At that time, Anthropic might no...

GPT more like PP

#

🥶

balmy mist Apr 18, 2025, 5:38 PM

#

like alpha go is SOTA in specilazation, why dont we specialize our curretn SOTA to different tasks on the level that alpha go ws for go and League?

upper wolf Apr 18, 2025, 5:38 PM

#

that’s not how it works

#

playing a game isn’t like generating text

balmy mist Apr 18, 2025, 5:39 PM

#

but the thinking that goes into play chess should be similar to the thinking that goes into figuring out stuff

#

like reasoning

upper wolf Apr 18, 2025, 5:39 PM

#

yes, they use similar logic

wintry tinsel Apr 18, 2025, 5:39 PM

#

Nothing exciting releasing

upper wolf Apr 18, 2025, 5:39 PM

#

but chess is much more 1-dimensional

wintry tinsel Apr 18, 2025, 5:39 PM

#

Except maybe R2

#

It’s slow going for AI, what might be called Snail Season

narrow elbow Apr 18, 2025, 5:40 PM

#

haha

torn mantle Apr 18, 2025, 6:01 PM

#

balmy mist like alpha go is SOTA in specilazation, why dont we specialize our curretn SOTA ...

because alpha go had an easy reward function

#

which is win a game

balmy mist Apr 18, 2025, 6:21 PM

#

and the reward function is set during the training phase right? can we have an additional one during finetuning?

novel flame Apr 18, 2025, 6:24 PM

#

oblique flint all in all it seems like the reasoning version is massively more expensive. Not ...

This makes perfect sense if the reasoning model uses entropy-based logit search as linked to earlier today. I don't know if they do, but I was expecting to see exactly this kind of pricing anomaly for models that do.

novel flame Apr 18, 2025, 6:50 PM

#

balmy mist like alpha go is SOTA in specilazation, why dont we specialize our curretn SOTA ...

AlphaGo, AlphaZero, AlphaCoder ... are all vastly different architectures. A Transformer LLM cannot be trained the way Alpha* is, and if it could be, it doesn't operate in a way that would allow it to use that training anyway. VASTLY different.

Take AlphaCoder for instance. It achieved/achieves SOTA on particular types of competitive coding by doing a ludicrous number of attempts (>100k, in some instances millions IIRC) and uses search to find the best/correct/optimal solution. It does work, but it takes a lot of compute and it is definitely not 'reasoning' in a humanlike way.

AlphaZero evaluates 10000s of moves per position, which is actually far less than it would need to do a brute-force search; part of the genius of AlphaZero is its ability to prune the search space down from millions of moves to just tens of thousands. But it's still search.

Current-gen reasoning LLMs don't do policy search, not in that way and not at that scale. The main thing they can do is spew more tokens in thinking mode in a controlled manner, effectively stuffing their own context with a small number of possible plans/approaches/solutions, refining this iteratively until their thinking budget is exhausted.

As most things in LLMs, Thinking Mode is just a primitive, degenerate form of ~~bending ~~policy search.

Thinking Mode gives the LLM a few dozen shots at getting to a better solution, but it's neither systematic nor exhaustive, nor does it have a world model or persistent state to simulate or probe internally. As such, there is no way to adapt a (current gen) LLM to do what AlphaZero does.

tall summit Apr 18, 2025, 6:52 PM

#

wintry tinsel Benchmarks are just general use estimations, doesn’t mean they are accurate

they are accurate for the things that they test, for example the benchmarks that test writing ability are relatively accurate for testing writing ability
though it varies a lot benchmark to benchmark

#

some benchmarks are graded by llms and that's a whole new can of worms to open

torn mantle Apr 18, 2025, 6:59 PM

#

is this like the official lmarena server?

#

webdev bugs is kinda pissing me off ngl

tall summit Apr 18, 2025, 6:59 PM

#

yes..?

torn mantle Apr 18, 2025, 6:59 PM

#

it also wastes so many tokens for them

tall summit Apr 18, 2025, 6:59 PM

#

#arena-feedback

torn mantle Apr 18, 2025, 6:59 PM

#

rendering issues

tall summit Apr 18, 2025, 7:00 PM

#

barely anyone uses any of the feature text channels lol

torn mantle Apr 18, 2025, 7:00 PM

#

people already issued that

torn mantle Apr 18, 2025, 7:00 PM

#

tall summit barely anyone uses any of the feature text channels lol

havent seen any mods tbh

oblique flint Apr 18, 2025, 7:01 PM

#

novel flame This makes perfect sense if the reasoning model uses entropy-based logit search ...

which paper are you referring to? The swe search one?

tall summit Apr 18, 2025, 7:01 PM

#

.

tall summit Apr 18, 2025, 7:02 PM

#

torn mantle havent seen any mods tbh

they don't care to monitor this server

raven void Apr 18, 2025, 7:03 PM

#

tall summit they don't care to monitor this server

Post something sus to get them to monitor

novel flame Apr 18, 2025, 7:03 PM

#

AlphaGo, AlphaZero, AlphaCoder ... are

tall summit Apr 18, 2025, 7:03 PM

#

raven void Post something sus to get them to monitor

you can do that

novel flame Apr 18, 2025, 7:09 PM

#

tall summit .

Actually I just updated the link to the paper, the one I wanted to link to was from DeepMind and it's a much better read.

torn mantle Apr 18, 2025, 7:17 PM

#

whats the rate limit of o3 in chatgpt pro?

earnest parcel Apr 18, 2025, 7:18 PM

#

torn mantle whats the rate limit of o3 in chatgpt pro?

~~50/week~~ that's for plus, my bad

blazing coyote Apr 18, 2025, 7:18 PM

#

pro - unlimited , plus - 50 a week

#

https://help.openai.com/en/articles/9824962-openai-o3-o4-mini-and-o3-mini-usage-limits-on-chatgpt-and-the-api

earnest parcel Apr 18, 2025, 7:19 PM

#

blazing coyote pro - unlimited , plus - 50 a week

"unlimited" within reason, aka limited but high limit

balmy mist Apr 18, 2025, 7:20 PM

#

https://www.youtube.com/watch?v=hZeprLzd6xM

YouTube

Theo - t3․gg

Don’t sleep on Chef (I can’t believe it works this well)

There are so many AI app builders these days, but none of them can really do backend stuff at all. At least until Convex dropped theirs...

Thank you Agentuity for sponsoring! Check them out at: https://soydev.link/agentuity

SOURCE
https://chef.convex.dev/

Want to sponsor a video? Learn more here: https://soydev.link/sponsor-me

Check out my T...

▶ Play video

#

anyone tried https://chef.convex.dev/

Chef by Convex | Generate realtime full‑stack apps

Cook up something hot with Chef, the full-stack AI coding agent from Convex

#

but you cant use gemini models 😦

#

and only claude 3.5

#

interesting why they only used these three

#

4.1 is a good coding model and grok 3 mini as well as 3.5, but why no 2.5

#

i heard people say non reasoning models are better at coding so maybe why the SOTA is not here, but why not 3.7 or v3?

#

@hollow ivy what was the system prompt for nw again?

bright kayak Apr 18, 2025, 7:32 PM

#

when does the team decide when to update the leaderboard?

tall summit Apr 18, 2025, 7:35 PM

#

when there are enough votes!!!

#

and isn't it automatic?

bright kayak Apr 18, 2025, 7:35 PM

#

i don't know, that's why im asking

novel flame Apr 18, 2025, 7:42 PM

#

balmy mist anyone tried https://chef.convex.dev/

Hadn't yet, but web app codegen is my jam (and my side project) so I have to try it now. Seems like they do proper backend too, which is rare (app.base44.com, Probz.ai are a few notable exceptions).

balmy mist Apr 18, 2025, 7:48 PM

#

yeah this theo guy seems to really like it, im about to use it to create the arena for ai to go against each other now

#

hopefully it can do it

novel flame Apr 18, 2025, 7:51 PM

#

OK Chef is pretty cool. It generates a working app with DB and backend, and all the features it built actually worked.... though it only built half of the prompt, and it overall looked pretty crappy. Still, I'd say it's one of the best I've seen in terms of actually working

primal orbit Apr 18, 2025, 7:53 PM

#

guys, any new strong anonymous models in normal arena? been away for a week

novel flame Apr 18, 2025, 7:54 PM

#

primal orbit guys, any new strong anonymous models in normal arena? been away for a week

A WEEK? OMG you're in for a treat. nightwhisper is back with a new checkpoint as dayhush

primal orbit Apr 18, 2025, 7:54 PM

#

in normal arena or webdev?

novel flame Apr 18, 2025, 7:54 PM

#

webdev I believe

primal orbit Apr 18, 2025, 7:54 PM

#

well, I don't code, so..

#

like to just play with them

novel flame Apr 18, 2025, 7:55 PM

#

Other Google models: claybrook (okay), dragontail (very strong, likely a Pro update or something)

primal orbit Apr 18, 2025, 7:57 PM

#

dragontail I have had tried, claybrook gonna try to catch thx

torn mantle Apr 18, 2025, 8:01 PM

#

twitch clone

#

which one is better

#

sonnet 3.7
gpt4.1
dayhush
claybrook

#

not in order

pliant cypress Apr 18, 2025, 8:03 PM

#

bottom left imo

torn mantle Apr 18, 2025, 8:04 PM

#

yea its the most accurate one

#

https://3000-infg8ze4q31xxe3p3f9ol-814968e9.e2b-foxtrot.dev

sweet tinsel Apr 18, 2025, 8:05 PM

#

Would say bottom left too, which one was it? I would guess dayhush.

torn mantle Apr 18, 2025, 8:05 PM

#

pretty neat

torn mantle Apr 18, 2025, 8:05 PM

#

sweet tinsel Would say bottom left too, which one was it? I would guess dayhush.

yea its dayhush

keen beacon Apr 18, 2025, 8:06 PM

#

dayhush is good

sonic tendon Apr 18, 2025, 8:06 PM

#

dayhush

keen beacon Apr 18, 2025, 8:06 PM

#

I do think we're getting Gemini cider

sonic tendon Apr 18, 2025, 8:06 PM

#

new google model?

keen beacon Apr 18, 2025, 8:06 PM

#

coder*

#

not cider

#

lmfao

keen beacon Apr 18, 2025, 8:06 PM

#

sonic tendon new google model?

yes

#

nightwhisper, dayhush

#

:3

sonic tendon Apr 18, 2025, 8:06 PM

#

gemini cider sounds fun

zinc ore Apr 18, 2025, 8:06 PM

#

A Google employee has already confirmed coder coming

keen beacon Apr 18, 2025, 8:06 PM

#

lmao

sonic tendon Apr 18, 2025, 8:06 PM

#

:3

keen beacon Apr 18, 2025, 8:07 PM

#

zinc ore A Google employee has already confirmed coder coming

I'm still happy we're actually possibly getting ultra

sonic tendon Apr 18, 2025, 8:07 PM

#

google is cooking

balmy mist Apr 18, 2025, 8:07 PM

#

yeah they need to release it tonight as a treat for us

sonic tendon Apr 18, 2025, 8:07 PM

#

keen beacon I'm still happy we're actually possibly getting ultra

wait, fr?

#

when did that happen

keen beacon Apr 18, 2025, 8:07 PM

#

first time they've even talked about ultra for over a year

keen beacon Apr 18, 2025, 8:07 PM

#

sonic tendon wait, fr?

have you not seen

sonic tendon Apr 18, 2025, 8:07 PM

#

keen beacon have you not seen

nope

balmy mist Apr 18, 2025, 8:07 PM

#

sonic tendon wait, fr?

check relevant ai news channel

sonic tendon Apr 18, 2025, 8:07 PM

#

wh

#

ere

balmy mist Apr 18, 2025, 8:07 PM

#

keen beacon Apr 18, 2025, 8:08 PM

#

#

tulsee is director and product lead for Gemini models

sonic tendon Apr 18, 2025, 8:09 PM

#

probably good for writing, then

keen beacon Apr 18, 2025, 8:09 PM

#

it better be

#

I want my baby back

sonic tendon Apr 18, 2025, 8:09 PM

#

at least, going by leo's impression of

sonic tendon Apr 18, 2025, 8:09 PM

#

keen beacon I want my baby back

LMAOOO

#

real

keen beacon Apr 18, 2025, 8:10 PM

#

if ultra is as good as pro but with expected scaling improvements then sign me up

sonic tendon Apr 18, 2025, 8:10 PM

#

i am equally curious about pricing

balmy mist Apr 18, 2025, 8:10 PM

#

keen beacon tulsee is director and product lead for Gemini models

im shocked he replied to that lol

sonic tendon Apr 18, 2025, 8:11 PM

#

given that the free rate limits for pro are 25req/day, gemini ultra is probably gonna be like 2

#

well, more realistically, 0

#

personally, i kinda like that we stopped trying to scale for a little bit and focused on releasing models of similar sizes but with more refined techniques

#

at least, from my impression (of frontier models)

#

4.5 got dialed back, opus 3.5/3.7 (if it exists at all) hasn't been publicly released

sweet tinsel Apr 18, 2025, 8:14 PM

#

The really large models are really good for me, so I'd love 2.5 Ultra as I'm already a large fan of 4.5 and I'm hitting limits for 4.5 weekly. It's just really good In writing.

keen beacon Apr 18, 2025, 8:14 PM

#

i do think huge models are impractical now, but that doesn't mean they have 0 value

balmy mist Apr 18, 2025, 8:15 PM

#

keen beacon i do think huge models are impractical now, but that doesn't mean they have 0 va...

what do you think their value is now?

keen beacon Apr 18, 2025, 8:15 PM

#

balmy mist what do you think their value is now?

mainly in creative tasks and emergent capabilities

balmy mist Apr 18, 2025, 8:15 PM

#

interesting

#

how big do yall think o3 is?

keen beacon Apr 18, 2025, 8:16 PM

#

it's 4.1

balmy mist Apr 18, 2025, 8:16 PM

#

you think we can have a big reasoning model?

sonic tendon Apr 18, 2025, 8:16 PM

#

i worry that, in increasing the max available model size, we'll end up inadvertently heavily increasing the total environmental impact/needed compute because people will use the biggest model possible for tasks that don't need it

keen beacon Apr 18, 2025, 8:16 PM

#

wait let me check what a buddy at oai said

#

i am told around 200B

balmy mist Apr 18, 2025, 8:18 PM

#

ahh

#

do you think we can have a reasoner at 1 trill?

#

or more?

keen beacon Apr 18, 2025, 8:18 PM

#

absolutely

#

but again

#

it would be impractical

#

and probably prohibitively expensive

balmy mist Apr 18, 2025, 8:19 PM

#

what is the biggest model rn? 4.5?

sonic tendon Apr 18, 2025, 8:19 PM

#

sonic tendon i worry that, in increasing the max available model size, we'll end up inadverte...

maybe that's a fallacy of some kind lol

keen beacon Apr 18, 2025, 8:19 PM

#

yes

#

4.5 is over 4T params lol

balmy mist Apr 18, 2025, 8:19 PM

#

lol

keen beacon Apr 18, 2025, 8:19 PM

#

beats opus by quite a margin

#

probably?

sonic tendon Apr 18, 2025, 8:19 PM

#

keen beacon beats opus by quite a margin

how so?

balmy mist Apr 18, 2025, 8:19 PM

#

and gpt5 will be using that ?

sonic tendon Apr 18, 2025, 8:19 PM

#

writing?

balmy mist Apr 18, 2025, 8:19 PM

#

as the base?

keen beacon Apr 18, 2025, 8:19 PM

#

no no

#

total params

#

opus had just over 1T

sonic tendon Apr 18, 2025, 8:19 PM

#

ohhhh

#

two convos at once 😭

#

sorry guys

balmy mist Apr 18, 2025, 8:20 PM

#

but wouldnt gpt5 be a large reasoning model?

sonic tendon Apr 18, 2025, 8:20 PM

#

keen beacon opus had just over 1T

isn't gpt-4 allegedly like 1.2T

keen beacon Apr 18, 2025, 8:20 PM

#

gpt-4 was just over 1T yed

#

yes

#

it's crazy how we've absolutely dunked on 4 performance wise with so much fewer active parameters

sonic tendon Apr 18, 2025, 8:20 PM

#

yeah

#

remember when we thought gpt-4 was the ceiling? lol

keen beacon Apr 18, 2025, 8:21 PM

#

🤣

sonic tendon Apr 18, 2025, 8:21 PM

#

gemma 3 27b

#

literally can run that on my laptop (barely, at less than a token per second)

#

2023 (?) hn commenters were right

#

not sure how longer context would improve performance that much

#

super long context doesn't seem to be good for many tasks atm, surprisingly

keen beacon Apr 18, 2025, 8:23 PM

#

oh yeah im told work is well underway on claude 4.0

sonic tendon Apr 18, 2025, 8:23 PM

#

least of all coding

sonic tendon Apr 18, 2025, 8:23 PM

#

keen beacon oh yeah im told work is well underway on claude 4.0

wonder if it'll top lmarena? probably not unless

#

might've pondered this in this chat before

novel flame Apr 18, 2025, 8:23 PM

#

Reasoning in latent space without outputting tokens. Much more efficient, much less information lost to the token representation, should be much more humanlike

ember rapids Apr 18, 2025, 8:23 PM

#

Wait I just realized google i/o is next month

#

Gemini ultra late may

sonic tendon Apr 18, 2025, 8:23 PM

#

novel flame Reasoning in latent space without outputting tokens. Much more efficient, much l...

yeah didn't meta (for all their faults) publish a paper on that recently

calm sequoia Apr 18, 2025, 8:23 PM

#

The 2.5 Pro still blows my mind every day and it's not even SOTA anymore 👀

sonic tendon Apr 18, 2025, 8:24 PM

#

novel flame Reasoning in latent space without outputting tokens. Much more efficient, much l...

not great for interpretability purposes, though (probably)

keen beacon Apr 18, 2025, 8:24 PM

#

yeah I remember seeing "Gemini 2.5 Pro" in a twitter notification, tapping it, looking at the benchmark image and my jaw dropped

#

it was the last thing I expected from deepmind

balmy mist Apr 18, 2025, 8:24 PM

#

novel flame Reasoning in latent space without outputting tokens. Much more efficient, much l...

what about the thing meta was saying about reasoning without using tokens>

calm sequoia Apr 18, 2025, 8:25 PM

#

They simply looked at what Meta is doing and did exact opposite

keen beacon Apr 18, 2025, 8:25 PM

#

yann lecooked

novel flame Apr 18, 2025, 8:25 PM

#

sonic tendon yeah didn't meta (for all their faults) publish a paper on that recently

It's similar to Yann LeCun's theory of mind, so FAIR is researching this kind of thing. You can look up JEPA

calm sequoia Apr 18, 2025, 8:25 PM

#

keen beacon yeah I remember seeing "Gemini 2.5 Pro" in a twitter notification, tapping it, l...

Still having dopamine cravings since that time

raven void Apr 18, 2025, 8:26 PM

#

Is grok 3 mini on lmarena?

keen beacon Apr 18, 2025, 8:27 PM

#

calm sequoia The 2.5 Pro still blows my mind every day and it's not even SOTA anymore 👀

wheres o3

sonic tendon Apr 18, 2025, 8:27 PM

#

reasoning in a non-language-based abstract space was something i felt would be the endgame of LLMs since GPT-4, but i never knew enough to really formalize my ideas let alone work towards creating them

keen beacon Apr 18, 2025, 8:27 PM

#

arena updated Apr 18, 2025
and no o3 yet :/

sonic tendon Apr 18, 2025, 8:27 PM

#

it's cool to see that people are pursuing that now

calm sequoia Apr 18, 2025, 8:28 PM

#

keen beacon arena updated Apr 18, 2025 and no o3 yet :/

Go vote

keen fulcrum Apr 18, 2025, 8:28 PM

#

sonic tendon well, more realistically, 0

How about 0 for a time as with 2.5

novel flame Apr 18, 2025, 8:28 PM

#

sonic tendon reasoning in a non-language-based abstract space was something i felt would be t...

I mean it's totally doable -- build your LLM around a sparse autoencoder and you've got just that -- but getting it to scale and perform as well as a Transformer LLM is another thing

sonic tendon Apr 18, 2025, 8:29 PM

#

novel flame I mean it's totally doable -- build your LLM around a sparse autoencoder and you...

hm

#

not exactly an ML person at the moment, but wouldn't you want to build some sort of transformer that had a way of tokenizing "thoughts"

#

if that makes any sense

#

i uh

#

j

novel flame Apr 18, 2025, 8:31 PM

#

calm sequoia Apr 18, 2025, 8:32 PM

#

well-this-aged-like-wine-another-w-for-karpathy-v0-7rsni9uv8a4e1.jpg

sonic tendon Apr 18, 2025, 8:32 PM

#

novel flame

@keen beacon

sonic tendon Apr 18, 2025, 8:32 PM

#

calm sequoia

language mixing? or just gibberish

#

i was under the impression that we specifically optimized against that happening during RL for interpretability reasons

novel flame Apr 18, 2025, 8:33 PM

#

sonic tendon not exactly an ML person at the moment, but wouldn't you want to build some sort...

Sure, a small transformer on either side of the AE would do the trick. It's not the only way, but it would be the most obvious one today

calm sequoia Apr 18, 2025, 8:33 PM

#

The more gibberish the better

#

It would be smart tontrain small 3B model tontranslate gibberish to language if seeing thinking is so interesting to people

sonic tendon Apr 18, 2025, 8:34 PM

#

maybe if you do RL enough, it'll just figure out a language of pure thought on its own that looks like unreadable random data to someone reading the CoT

calm sequoia Apr 18, 2025, 8:34 PM

#

Yes, because it needs to learn compression. Unless English is the most efficient language that can exist. Which isn't the case.

keen beacon Apr 18, 2025, 8:35 PM

#

novel flame

LOL

#

oh yeah

#

I can't wait for Imagen 4 :3

sonic tendon Apr 18, 2025, 8:37 PM

#

sonic tendon maybe if you do RL enough, it'll just figure out a language of pure thought on i...

i'd put like 15 paranthetics here but i think this idea is mostly vibe-based anyway

tall summit Apr 18, 2025, 8:38 PM

#

calm sequoia Yes, because it needs to learn compression. Unless English is the most efficient...

ooh ooh ithkuil

novel flame Apr 18, 2025, 8:38 PM

#

sonic tendon maybe if you do RL enough, it'll just figure out a language of pure thought on i...

Yes, or if you just don't use language tokens for reasoning in the first place. You can create your embeddings from language to latent vectors, then operate on the vectors all the way to the answer, and then translate back to language at the end/output. Apple Intelligence allegedly did this using i-JEPA internally, then using something small like Phi-3 trained to describe the result in natural language.

tall summit Apr 18, 2025, 8:38 PM

#

time for ithkuil to shine

sonic tendon Apr 18, 2025, 8:38 PM

#

novel flame Yes, or if you just don't use language tokens for reasoning in the first place. ...

interesting

#

i ought to read about this stuff more

brittle tiger Apr 18, 2025, 8:40 PM

#

I wonder if polymarket becomes a problem for lmsys arena. Close to $3M has been bet riding on o3 or o4-mini surpassing 2.5 pro or not by end of April. Wouldn't be surprised if that's why we see new accounts asking about when it will be added

sonic tendon Apr 18, 2025, 8:42 PM

#

brittle tiger I wonder if polymarket becomes a problem for lmsys arena. Close to $3M has been ...

yeah, i worry about the dynamics that might arise when the betting market founded on the results of this benchmark handles vastly more money than the actual organization responsible for it

tall summit Apr 18, 2025, 8:42 PM

#

wait that polymarket question uses lmarena as its metric?

sonic tendon Apr 18, 2025, 8:43 PM

#

yep

#

style control off

tall summit Apr 18, 2025, 8:43 PM

#

never knew

thorny drum Apr 18, 2025, 8:43 PM

#

It’s closer to like 100k or so

novel flame Apr 18, 2025, 8:43 PM

#

sonic tendon yeah, i worry about the dynamics that might arise when the betting market founde...

So basically Goodhart's Law, but with money?

sonic tendon Apr 18, 2025, 8:44 PM

#

thorny drum It’s closer to like 100k or so

?

thorny drum Apr 18, 2025, 8:44 PM

#

The total volume traded is a poor indicator for money riding on this

sonic tendon Apr 18, 2025, 8:44 PM

#

thorny drum The total volume traded is a poor indicator for money riding on this

oh, i was under the impression that it was the volume currently in the market

thorny drum Apr 18, 2025, 8:44 PM

#

You should look at open interest

#

Say I buy a share from you then sell it back

#

We technically traded 2 shares but there’s no money riding on the outcome

sonic tendon Apr 18, 2025, 8:45 PM

#

the total money in the market is the same as the total number of (yes+no)/2 shares in circulation, no?

thorny drum Apr 18, 2025, 8:46 PM

#

More like the # of yes shares due to an intricacy with how Polymarket works

#

NO shares is a poor indicator

sonic tendon Apr 18, 2025, 8:47 PM

#

wdym? i was under the impression that yes or no shares couldn't exist independently w/o a corresponding opposite share

thorny drum Apr 18, 2025, 8:47 PM

#

Yeah but say I buy NO on every company

#

Then I don’t have any exposure to the outcome

#

Polymarket lets you get your money back by merging those shares into USDC

#

and the no shares effectively dissappear

sonic tendon Apr 18, 2025, 8:48 PM

#

ohhhh, merging can happen across different options in a market?

#

my mistake

#

i thought it was just yes+no of one specific option, nothing else

#

apologies

thorny drum Apr 18, 2025, 8:49 PM

#

It’s a very understandable mistake

#

Not very intuitive

sonic tendon Apr 18, 2025, 8:49 PM

#

well, that's somewhat relieving, then

#

yeah

brittle tiger Apr 18, 2025, 8:50 PM

#

Even if volume is only six figures, money riding on it is clearly incentive to try and game system. I doubt people trying that would be advanced enough to get past lmsys systems but they should be aware.

sonic tendon Apr 18, 2025, 8:50 PM

#

yeah

thorny drum Apr 18, 2025, 8:50 PM

#

I think they are and I don’t see any indication that people are rigging it

sonic tendon Apr 18, 2025, 8:50 PM

#

or buy both a yes and no share at the same time for $1

thorny drum Apr 18, 2025, 8:50 PM

#

Hopefully stays that way

#

There was that one Reddit post but I’m pretty sure it was fake

sonic tendon Apr 18, 2025, 8:51 PM

#

@deep adder we're operating under the assumptions you're asserting - i think this is a misunderstanding

#

yes

#

@brittle tiger correct me if i'm wrong, but

thorny drum Apr 18, 2025, 8:53 PM

#

Technically you can’t sell shares on Polymarket

#

But I don’t think there is any misunderstanding here

#

It’s just semantics

#

Not intuitive or meaningful

keen beacon Apr 18, 2025, 8:54 PM

#

brittle tiger Even if volume is only six figures, money riding on it is clearly incentive to t...

i think there was some market manipulation happening after openai stream, dumping to lower the prices and buying cheap

thorny drum Apr 18, 2025, 8:55 PM

#

Sung Jin I think your clueless

#

No offense

calm sequoia Apr 18, 2025, 8:55 PM

#

thorny drum Apr 18, 2025, 8:55 PM

#

Some people made bad trades and you could’ve traded against them and made money

#

That’s all that happened

raven void Apr 18, 2025, 8:55 PM

#

Skibidi

sonic tendon Apr 18, 2025, 8:55 PM

#

yeah

#

i misunderstood what they said

thorny drum Apr 18, 2025, 8:56 PM

#

It’s not ‘manipulation’

brittle tiger Apr 18, 2025, 8:56 PM

#

keen beacon i think there was some market manipulation happening after openai stream, dumpin...

I think the issue is around manipulation of arena, not polymarket where that's going to be an issue in every market

keen beacon Apr 18, 2025, 8:57 PM

#

thorny drum Some people made bad trades and you could’ve traded against them and made money

some people may have made fictive orders that they never meant to execute .. which may drive prices up or down, thats market manipulation and people have gone to jail for, though only rare cases when they were too good at it xD

sonic tendon Apr 18, 2025, 8:57 PM

#

thorny drum Technically you can’t sell shares on Polymarket

wdym?

thorny drum Apr 18, 2025, 8:57 PM

#

What do you mean fictitious

sonic tendon Apr 18, 2025, 8:57 PM

#

i imagine it still depends on the country you're in

thorny drum Apr 18, 2025, 8:58 PM

#

I traded a few thousand shares against them and made money

keen fulcrum Apr 18, 2025, 8:58 PM

#

Hi, I do believe we need to add geographical understanding to lmarena

#

#

Where AIs play geoguessr

thorny drum Apr 18, 2025, 8:58 PM

#

They just made a bad trade

calm sequoia Apr 18, 2025, 8:58 PM

#

keen fulcrum

Mountains are easy

thorny drum Apr 18, 2025, 8:59 PM

#

sonic tendon wdym?

When you ‘sell’ yes your technically buying NO and then merging the shares into USDC. The mechanism isn’t too important

keen beacon Apr 18, 2025, 8:59 PM

#

brittle tiger I think the issue is around manipulation of arena, not polymarket where that's g...

but id imagine that if someone would go through the effort to rig arena score, theyd do it for o3 , since profit would be bigger

sonic tendon Apr 18, 2025, 8:59 PM

#

thorny drum When you ‘sell’ yes your technically buying NO and then merging the shares into ...

ohh

#

my mistake

thorny drum Apr 18, 2025, 9:00 PM

#

There was a Reddit post from someone who claimed to rig the arena but it seemed fake

sonic tendon Apr 18, 2025, 9:00 PM

#

sorry for drumming up an argument lol

#

i thought craig was joking when he said he traded a ton on prediction markets

keen beacon Apr 18, 2025, 9:02 PM

#

question, what happens if 50 shares are being sold at 31 cent, but i place buy order for 50 shares at 60 cents (stupid ofc) but what would happen

#

would i be buying from those selling at 60 cents or 31cent ?

sonic tendon Apr 18, 2025, 9:02 PM

#

i think you'd just sweep through the order book

#

sounds p fun

keen beacon Apr 18, 2025, 9:04 PM

#

rare to see people sit and only talk about $

#

very logical and rare, i wonder why that is

sonic tendon Apr 18, 2025, 9:19 PM

#

i should eep

keen fulcrum Apr 18, 2025, 9:23 PM

#

#

Grok3 mini so great

calm sequoia Apr 18, 2025, 9:27 PM

#

Grok is known to be good at benchmarks but not in practise. I'm not claiming it's underperforming, but there haven't been a single instance where Grok 3 is best at my usecases. Maybe new thinking model will be different, but there are too many options for coding at this level anyways.

keen beacon Apr 18, 2025, 9:29 PM

#

atm o3>2.5>claude 3.7> the rest

keen fulcrum Apr 18, 2025, 9:31 PM

#

You did forget o4-mini
Performs on-par with o3

#

The problem being people believe o3 and o4-mini are the same
There is a reasonable difference

calm sequoia Apr 18, 2025, 9:33 PM

#

@keen beacon @sonic tendon would you share your experiances in betting on benchmarks? Was it successful?

keen fulcrum Apr 18, 2025, 9:33 PM

#

A grok fan long time no see

keen beacon Apr 18, 2025, 9:34 PM

#

calm sequoia <@456226577798135808> <@609942266953465856> would you share your experiances in ...

Very successful so far. But i cant share any sauce xd as to how

#

I do ignore 99% of markets. most are betting, i pick the 1% where i can invest based on information

calm sequoia Apr 18, 2025, 9:36 PM

#

I though only insiders make money there

keen beacon Apr 18, 2025, 9:37 PM

#

they make the most $ for sure

#

i assume others also $ launder

#

but there are others ways
just one example i will give

#

for rotten tomatoes score for movies, i literally trained an ai model to predict final score, based on previous movie data

#

and i constantly scraped the rotten tomatoes site for new reviews, that gave me an edge over the market

calm sequoia Apr 18, 2025, 9:39 PM

#

Very impresive

misty vault Apr 18, 2025, 9:42 PM

#

keen beacon and i constantly scraped the rotten tomatoes site for new reviews, that gave me ...

how do u bet on things
like on a site or something

balmy mist Apr 18, 2025, 9:43 PM

#

nahh grok is not better than 2.5 lol

#

i wouldnt even put it better than 3.7 thats debatable

#

i have bro, but i will test again, i did not test mini yet

#

also yall need to try roo code with the different modes

#

its kinda cracked

#

i was only using boomerang mode

#

but you can do much much more

keen beacon Apr 18, 2025, 9:44 PM

#

misty vault how do u bet on things like on a site or something

i either get alerted to verify before buy, or if i specified it then ai buys automatically via api

balmy mist Apr 18, 2025, 9:44 PM

#

basically your own custom ai agent system that uses specific models for certain tasks and roles

fringe carbon Apr 18, 2025, 9:45 PM

#

keen beacon question, what happens if 50 shares are being sold at 31 cent, but i place buy o...

you would buy for 31

#

but this is the wrong chanel

zinc ore Apr 18, 2025, 9:46 PM

#

New chart just dropped.

misty vault Apr 18, 2025, 9:46 PM

#

keen beacon i either get alerted to verify before buy, or if i specified it then ai buys aut...

but like where do u bet

fringe carbon Apr 18, 2025, 9:46 PM

#

keen beacon i either get alerted to verify before buy, or if i specified it then ai buys aut...

u keep saying ai buys/sells wdym by that

#

u have a task llm set up?

keen beacon Apr 18, 2025, 9:47 PM

#

all i can say is that i bet on very few markets that are very news driven, llm gets news all the time, when its good news than it auto invests

torn mantle Apr 18, 2025, 9:48 PM

#

Elon is such a liar

#

He said grok 3 will be updated soon

#

But i havent noticed an difference at all

#

I guess those updates will be on grok 3.5

#

They really should fix multi turn

golden ocean Apr 18, 2025, 9:53 PM

#

what about writing

balmy mist Apr 18, 2025, 9:56 PM

#

@torn mantle im trying to make the best system prompt for a UI/EX designer, what do you use for your agents? also is apple the undisputed best looking apps? like user friendly and stuff?

hardy pecan Apr 18, 2025, 9:57 PM

#

UHM guys, are you able to convince gemini 2.5 pro, that todays date is in 2025?

#

Mine is ADAMANT that its 2024 April 19th

#

and i give it websites and screenshots and its gaslighting me into thinking i docttored the screenshots lol

#

and i had it go to https://www.timeanddate.com/ and they STILL think its 2024

#

maybe it cant read the website and its hallucinating

misty vault Apr 18, 2025, 10:08 PM

#

hardy pecan and i give it websites and screenshots and its gaslighting me into thinking i do...

literally bing chat reference

torn mantle Apr 18, 2025, 10:15 PM

#

balmy mist <@295243581818404874> im trying to make the best system prompt for a UI/EX desig...

i do use a lot of keywords tbh

torn mantle Apr 18, 2025, 10:17 PM

#

balmy mist <@295243581818404874> im trying to make the best system prompt for a UI/EX desig...

material UI design
apple design
neumorphic design
skeuomorphic
glassmorphism
beautiful gradients
premium design

#

you can experiment with neumorphic

#

its pretty good too

#

for threejs i just ask it to make it realistic/HDR env/in-depth shadows & details

#

something like that

ocean vortex Apr 18, 2025, 10:23 PM

#

zinc ore New chart just dropped.

no one is paying for personal use of 2.5 pro though

umbral crypt Apr 18, 2025, 10:24 PM

#

are there any rumors for nightwhisper or dragontail release dates?

ocean vortex Apr 18, 2025, 10:24 PM

#

it should be on $0

torn mantle Apr 18, 2025, 10:35 PM

#

@balmy mist https://3000-ijmfmzi8fdwu5f966g91q-65401bad.e2b-foxtrot.dev

#

dayhush is good but i still prefer nw

balmy mist Apr 18, 2025, 10:45 PM

#

torn mantle dayhush is good but i still prefer nw

same

#

nw follows directions perfectly

#

and the ui design is slightly better

#

without having to nudge it as much

tall summit Apr 18, 2025, 10:47 PM

#

redditors are screaming that o3's geoguessr performance is because of metadata

thorny drum Apr 18, 2025, 10:48 PM

#

google rigging the benchmarks once again!

balmy mist Apr 18, 2025, 10:48 PM

#

chill google is king

tall summit Apr 18, 2025, 10:49 PM

#

referring to this twitter post btw

#

also brain fart that's not google

olive mesa Apr 18, 2025, 11:01 PM

#

i honestly want google to create ASI before openai

brittle tiger Apr 18, 2025, 11:23 PM

#

I had no idea you only get 32k context with Plus for chatgpt. Apparently it just does RAG after that

https://x.com/TheXeophon/status/1913120160753332703?t=6arSE6mYeArtrSWOTpzvNQ&s=19

Xeophon (@TheXeophon) on X

Some @ChatGPTapp footguns:
- The model for Deep Research does not matter. It always uses o3 (Deep Research version)
- The model for image generation *does* matter - image gen is a tool; chosen model writes the prompt
- Pro has 128K context, Plus (+Teams) 32K, free 8K

keen beacon Apr 19, 2025, 12:17 AM

#

golden ocean Lmfao

Yea and its burp

#

vibe coding vs vibe pentesting

keen beacon Apr 19, 2025, 12:18 AM

#

zinc ore New chart just dropped.

google is undoubtably the best bang for ur buck

#

plus they already have the competition on lock

#

they sell LLM hosting for gpt and many other llms on azure

#

the amount of data they're probably getting rn is crazy

calm sequoia Apr 19, 2025, 12:55 AM

#

calm sequoia

poll_question_text

Have you ever bet money on the bench outcome?

victor_answer_votes

4

total_votes

8

victor_answer_id

2

victor_answer_text

No

worthy thunder Apr 19, 2025, 1:41 AM

#

keen fulcrum

Anyone know if they (xAI) are still doing consensus@64 for their benchmarks, or moved to pass@1? Genuinely curious, not paid attention since the original announcement

elder rapids Apr 19, 2025, 2:10 AM

#

bruh?

#

are you talking about 2.5 flash

balmy mist Apr 19, 2025, 2:12 AM

#

elder rapids are you talking about 2.5 flash

yeah how did you know?

elder rapids Apr 19, 2025, 2:13 AM

#

I'm calling for clarity

#

but I'm wondering why he said grok 3 and not grok 3 mini

#

or grok 3 vs 2.5 flash

#

knowing 2.0 pro would be a better base, and that 2.5 is even better than that

#

just saying

keen beacon Apr 19, 2025, 2:19 AM

#

2.5 pro has a significantly better simpleqa score

#

its the closest thing to 4.5

#

afaik

#

ye

#

its like 10% better than grok 3 (simpleqa, which is quite significant)

elder rapids Apr 19, 2025, 2:21 AM

#

2.0 pro is the best model out of that bunch lol

#

or was grok 3 updated recently

#

bruh?

keen beacon Apr 19, 2025, 2:22 AM

#

ya. and anyway their reasoning doesnt seem to increase simpleqa yet, 2.5 flash actually has a lower simpleqa score than gem 2 flash

elder rapids Apr 19, 2025, 2:22 AM

#

btw can we lower reasoning for 2.5 pro yet

#

that's not what base means

keen beacon Apr 19, 2025, 2:22 AM

#

ive been seriously impressed with how much 2.5 pro knows tbh

elder rapids Apr 19, 2025, 2:22 AM

#

that would just be inherent to the model size

#

2.0 pro is a stronger model than grok 3

#

and can reason better when asked to

#

therefore is a better "base model"

#

actually hollon, 2.0 pro is STILL the highest performing non reasoning model besides 4.5 on livebench??

#

yo wtf

#

Lowkey thought there was gonna be another model that I forgot about

#

can you screenshot

#

?

#

oh you mean this one

#

what does that mean tho?

#

lmarena isn't a performance bench it's a vibe bench

#

so?

#

because it is lol

#

and it doesn't write better

#

this is something I specifically look for

#

grok 3 shoehorns itself narratively

#

buns writing, good philosophical understand

#

becomes the balance

#

base ye

#

the best model I can pull the potential out of tho

#

is 2.5 pro

#

ye

#

it's creative but I disagree a ton

#

it doesn't know real writing conventions

#

it doesn't know how to conform to instructions

#

ie, a prose

#

or a poem

#

how would telling you be informative if I'm doing the same for all these models

#

4.5 is def human like

#

it doesn't get stuck on repetition

#

it doesn't overemphasize

#

it doesn't become swayed so much by context

#

but it isn't so good for writing, just "ideas"

#

since it lacks rigor

#

or academic conformity

#

just give it a quiz or something

#

like a writing quiz

#

it's going to get stuck on the potential ambiguity

#

and just choose whatever

#

rationalizing every answer the best it can

#

like "pick a sentence in source #2 that supports and states the relevant premise to source #2 in source #1"

#

given 2 sources

#

a hard one ofc

#

it'll sufficiently rationalize each sentence in the second source

#

but not be able to choose the "correct" sentence

#

with rigor

#

flash 2.0 gets these right

#

lol

#

but I'm saying it's a very general thing, you'll encounter it or you probably have

#

and didn't notice

#

that it lacks academic rigor

#

yep 4.5 has great implicit understanding

#

but again, this is still something that most other models lack, that 2.0 pro doesn't

#

although behind gpt 4.5 it's 3.5 sonnet

ember rapids Apr 19, 2025, 2:49 AM

#

4.5 is the first model to have a somewhat decent sense of humor

elder rapids Apr 19, 2025, 2:50 AM

#

ye

#

unprompted

#

on what

#

oh mb

#

nah I have

#

it hallucinates too much

#

it's smart

#

but damn

#

doesn't really understand prompts too well

small haven Apr 19, 2025, 3:07 AM

#

can o3 pro come any sooner 😭

alpine coral Apr 19, 2025, 3:47 AM

#

keen beacon ya. and anyway their reasoning doesnt seem to increase simpleqa yet, 2.5 flash a...

i'm not surprised by that.. i'm getting pretty crappy results from flash2.5. With thinking enabled it performs better, but it feels very underdeveloped (experimental ig ha)

#

grok-3-mini(high) has also done quite poorly against a few question sets i've been using lately. will share some screenshots in a min. but yeah in terms comprehension / verbal reasoning (wordplays etc), and understanding of real-world implications / spatial reasoning as well as emotional intelligence – it's pretty crap from what i'm seeing tbh

#

#

#

balmy mist Apr 19, 2025, 4:08 AM

#

alpine coral grok-3-mini(high) has also done quite poorly against a few question sets i've be...

what was thing you should where grok 3 mini was really good?

#

im trying to see which model i should use for my low level agents, im thinking between: grok 3 mini, o4 mini, 4.1 mini, or 2.5 flash

keen beacon Apr 19, 2025, 4:12 AM

#

prob o4 mini since its good at calling tools

balmy mist Apr 19, 2025, 4:15 AM

#

and o4 is cheap right?

#

the only thing is grok 3 mini gives you free credits every month $150

keen beacon Apr 19, 2025, 4:16 AM

#

does it support natively calling tools tho?

keen beacon Apr 19, 2025, 4:16 AM

#

balmy mist the only thing is grok 3 mini gives you free credits every month $150

yeah free is better than everything else tho

balmy mist Apr 19, 2025, 4:17 AM

#

you are right

#

no tools

keen beacon Apr 19, 2025, 4:17 AM

#

i mean u can still work out a solution its not as efficient but u have a ton of free credits and im assuming pricing is quite cheap on grok 3 mini

#

e.g. in the final response it returns the json of the tool call

balmy mist Apr 19, 2025, 4:19 AM

#

and you got this chart:

balmy mist Apr 19, 2025, 4:19 AM

#

keen beacon e.g. in the final response it returns the json of the tool call

have you tried using the models in ai ide like roo code?

#

thats where i am using these agents

keen beacon Apr 19, 2025, 4:19 AM

#

oh idk then lol

#

if it supports a custom openai server u could probably make ur own proxy server that calls the grok api and emulates tool calls

#

it really depends but im not familar with that stuff at all (ai ides)

balmy mist Apr 19, 2025, 4:24 AM

#

what is the artificial analysis intelligence?

#

if this is true then grok 3 mini is the best model

keen beacon Apr 19, 2025, 4:26 AM

#

balmy mist what is the artificial analysis intelligence?

its a score based on a combination of several benchmarks

balmy mist Apr 19, 2025, 4:26 AM

#

that cant be real tho?

#

like its higher than 3.7

keen beacon Apr 19, 2025, 4:26 AM

#

i mean its charting the cost/performance ratio

balmy mist Apr 19, 2025, 4:26 AM

#

yeah but vertical is performance

#

it destroys on cost

#

but then on performance its like top 3

#

besides o3 cause its not on the chart

keen beacon Apr 19, 2025, 4:27 AM

#

ya i think its because of the benchmarks they choose

#

obviously depending on the task it varies

#

i think u should look at swebench or something

#

u are not throwing graduate level google proof questions

torn mantle Apr 19, 2025, 4:28 AM

#

balmy mist and you got this chart:

i dont know how to feel about grok 3 mini tbh

balmy mist Apr 19, 2025, 4:28 AM

#

ahh okay, anyone have an updated version fro swe?

#

i think its worth locking in the best small model model(semi small)

keen beacon Apr 19, 2025, 4:29 AM

#

wrt artificial analysis intelligence, i think the benchmarks there, xai probably applied a lot of rl proportionally to similar questions

balmy mist Apr 19, 2025, 4:29 AM

#

torn mantle i dont know how to feel about grok 3 mini tbh

what your tests show?

keen beacon Apr 19, 2025, 4:29 AM

#

so it may be misleading

balmy mist Apr 19, 2025, 4:29 AM

#

keen beacon wrt artificial analysis intelligence, i think the benchmarks there, xai probably...

waht do you think is the best small model?

torn mantle Apr 19, 2025, 4:29 AM

#

balmy mist what your tests show?

i stopped using grok 3 month ago

#

so rusty

balmy mist Apr 19, 2025, 4:30 AM

#

torn mantle i stopped using grok 3 month ago

same, i only am considering it now because of that chart

#

and bc its free

torn mantle Apr 19, 2025, 4:30 AM

#

if you spend more time explaining to a model how to do things then you better off with another one

#

i found myself re-explaining my prompt

#

over and over

#

and reminding grok the context

#

it wasnt a good experience

#

it will generate a good code but after the 2nd prompt it wont follow instructions well

elder rapids Apr 19, 2025, 4:31 AM

#

balmy mist and you got this chart:

hold on tho

#

this website struggles with some stuff

balmy mist Apr 19, 2025, 4:31 AM

#

torn mantle it will generate a good code but after the 2nd prompt it wont follow instruction...

ppl say the api for it is better

keen beacon Apr 19, 2025, 4:31 AM

#

this is probably more representative btw: https://aider.chat/docs/leaderboards/

aider

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

elder rapids Apr 19, 2025, 4:32 AM

#

price to performance is redundant in a vaccuum

torn mantle Apr 19, 2025, 4:32 AM

#

yea

keen beacon Apr 19, 2025, 4:32 AM

#

of ur use case

torn mantle Apr 19, 2025, 4:32 AM

#

o3 is really good

#

its the first time i felt a model is really smart

zinc ore Apr 19, 2025, 4:32 AM

#

Grok 3 mini should be a bit below o3 mini @1 pass

#

On most benchmarks

balmy mist Apr 19, 2025, 4:33 AM

#

wtf? they used two models?

Screenshot_2025-04-19_at_12.32.55_AM.png

balmy mist Apr 19, 2025, 4:33 AM

#

torn mantle o3 is really good

but its too expensive

keen beacon Apr 19, 2025, 4:33 AM

#

ya theres a configuration to do that in aider

#

architect mode

balmy mist Apr 19, 2025, 4:33 AM

#

wow thats dope

#

but still o3 too expensive

torn mantle Apr 19, 2025, 4:33 AM

#

balmy mist but its too expensive

yea

#

but its like the most reliable model

alpine coral Apr 19, 2025, 4:58 AM

#

keen beacon wrt artificial analysis intelligence, i think the benchmarks there, xai probably...

i suspect something along these lines

balmy mist Apr 19, 2025, 5:02 AM

#

do yall like 2.5 flash?

#

im settling on it for my small model

#

o4 is to expensive and only slightly better than flash, what yall think?

torn mantle Apr 19, 2025, 5:20 AM

#

im still trying dayhush xd

novel flame Apr 19, 2025, 5:44 AM

#

Grok 3 mini juiced on the benchmarks, that’s the only explanation for that ludicrous position on the Artificial Analysis chart. I tested the model myself last night and it’s pretty mid. It hallucinated in my grounded recall test, and did a good job on my realistic coding test, though it only barely broke my top ten of coding models.

In my harder coding test it was below average among the contenders (I have only given this test to the 15 or so most promising coding models so ‘average’ isn’t terrible, but Grok 3 mini was below average.

The Aider leaderboard looks pretty much correct in terms of real world coding usefulness of those models, matches my tests and experience quite well (though I still find that o3 and o4-mini behave inconsistently).

Grok 3 Mini might have some good uses as a cheap capable model for non-coding tasks. It had perfect scores for me in wordplay and associative logic.

calm sequoia Apr 19, 2025, 6:27 AM

#

novel flame Grok 3 mini juiced on the benchmarks, that’s the only explanation for that ludic...

That's true. As I said many times before, Grok shouldn't have taken the No. 1 spot even on general benchmark. I think they did similar thing that llama did.

torn mantle Apr 19, 2025, 6:37 AM

#

calm sequoia That's true. As I said many times before, Grok shouldn't have taken the No. 1 sp...

the only ones that believes those benchmarks are grok devs

#

we all know it underperformed

#

also

#

what happened to "Big Brain" feature?

balmy mist Apr 19, 2025, 6:48 AM

#

yeah that might not come

torn mantle Apr 19, 2025, 6:57 AM

#

unfortunately dayhush doesnt seem that good at complex physical simulations

#

if its a recent checkpoint of nightwhisper then idk how to feel about that

#

i mean its better than other models but its not what i had in mind

novel flame Apr 19, 2025, 7:32 AM

#

Hmm..... I managed to get Gemini 2.5 Flash to make a browser game for my coding test. With the default settings it basically told me to pound sand because the task was too big for a single prompt (which is fair), but I wanted it to give it a try so I reduced the temperature to 0.5 and lo and behold, it did it! ...... poorly. The result was buggy and generally underwhelming, on par with Grok 3 Mini, Llama 4 Maverick, o3, and o3 Mini (all of which did poorly on this test).

The top tier on this test is Claude 3.7 Sonnet quite far ahead of the pack; and the second tier consists of Grok 3, Riveroaks, Gemini 2.5 Pro, o4-mini-high, and DeepSeek V3 0324.

raven void Apr 19, 2025, 7:42 AM

#

o3 is really good, 5-10 IQ points higher than the next smartest model
it's hard to believe OpenAI will have o5 pro internally in a few months

raven void Apr 19, 2025, 7:43 AM

#

torn mantle if its a recent checkpoint of nightwhisper then idk how to feel about that

so a 1206 to 2.0 situation 🤔

plain zinc Apr 19, 2025, 7:44 AM

#

Dayhush - model coding Google's

#

raven void Apr 19, 2025, 7:53 AM

#

From my testing, o3 answers are better than 2.5 most of the time, I would long OpenAI for April

#

In tests of Code Taste, Problem Analysis and Fiction

plain zinc Apr 19, 2025, 7:55 AM

#

raven void From my testing, o3 answers are better than 2.5 most of the time, I would long O...

Dayhush wins o3 by two heads above

raven void Apr 19, 2025, 7:56 AM

#

In general use also?🤔

#

I haven't checked lmarena recently tbh

keen beacon Apr 19, 2025, 8:13 AM

#

plain zinc Dayhush wins o3 by two heads above

is Dayhush reasoning model ?
i remember that goolge would release a coding model , and yes indeed it may be better at coding but i doubt its generally better than o3.

drifting thorn Apr 19, 2025, 9:02 AM

#

What? Dayhush?

umbral crypt Apr 19, 2025, 9:13 AM

#

balmy mist the only thing is grok 3 mini gives you free credits every month $150

how do i get this?

novel flame Apr 19, 2025, 9:26 AM

#

Maybe someone here knows…. I don’t have a paid Grok account so I tested Grok 3 Mini through the OpenRouter Playground with default settings. And as mentioned it performed poorly. But it occurs to me that maybe the default settings on OpenRouter are low reasoning effort? If so then maybe I judged Grok 3 Mini too quickly?

plain zinc Apr 19, 2025, 9:29 AM

#

keen beacon is Dayhush reasoning model ? i remember that goolge would release a coding mode...

Overall Gemini 2.5 Ultra will be good.

#

You'll remember my words when he arrives.

keen beacon Apr 19, 2025, 9:30 AM

#

plain zinc Overall Gemini 2.5 Ultra will be good.

ik but you think Dayhush is 2.5 ultra ?

#

ik that 2.5 ultra would beat even o3 generally

#

hell o3 barely beats 2.5 pro, so no chance vs the ultra model

umbral crypt Apr 19, 2025, 9:43 AM

#

keen beacon hell o3 barely beats 2.5 pro, so no chance vs the ultra model

what do you mean by barely? there's 7% diff in aider polygot

keen beacon Apr 19, 2025, 9:44 AM

#

umbral crypt what do you mean by barely? there's 7% diff in aider polygot

o3 beats it in every benchmark , but lmarena = avg people asking for funny jokes and 8th grade math homeworks, which is more about general knowledge, structure, syntax. there the difference is very small

umbral crypt Apr 19, 2025, 9:45 AM

#

oh

late path Apr 19, 2025, 9:45 AM

#

o3 reply with emoji

umbral crypt Apr 19, 2025, 9:46 AM

#

☺️

#

like this

wraith tulip Apr 19, 2025, 10:10 AM

#

HELLO. Nothing else to say. Just out of curiosity...🙂

waxen arch Apr 19, 2025, 10:11 AM

#

beep boop

hollow comet Apr 19, 2025, 10:15 AM

#

Does anyone know a working prompt for humanizing text?

tall summit Apr 19, 2025, 10:23 AM

#

hello friends

tall summit Apr 19, 2025, 10:23 AM

#

hollow comet Does anyone know a working prompt for humanizing text?

just ask it to humanize the text

#

a surprising amount of people have asked this question and i get so confused

#

humanizing texts are one of llm's best skills

#

and o3 is quite amazing at prose so i think it's improving even at that

ornate stump Apr 19, 2025, 10:24 AM

#

Never used Grok because I hated Twitter even before Elon, but I’ve found out that voice chat in grok is more useful than Gemini Live and chagpt advanced voice.

tall summit Apr 19, 2025, 10:25 AM

#

plain zinc

that's incredible

calm sequoia Apr 19, 2025, 10:29 AM

#

poll_question_text

King of the kings

victor_answer_votes

5

total_votes

14

victor_answer_id

1

victor_answer_text

o3

victor_answer_emoji_name

🔥

hollow comet Apr 19, 2025, 10:32 AM

#

tall summit humanizing texts are one of llm's best skills

For me it doesn't work at least as I want even though I used a more complex prompt. It works when it writes in my language and the AI detector gives 0% but in English it doesn't work in most cases and the detector writes 100%. And even the text in my language which the detector defines as human after translation via Google Translate is also defined as AI

unborn ocean Apr 19, 2025, 11:19 AM

#

U can try to tell it to write in the style of a specific well-known author.

glass arch Apr 19, 2025, 11:22 AM

#

https://chatgpt.com/share/6803878d-d344-8010-b7ae-08a9eab81113

ChatGPT

ChatGPT - Big Bang Overview

Shared via ChatGPT

#

interesting

#

it seems chatgpt can show you images?

#

since when has it done this

#

also guys, only 100 million more years until we have to update the 13.8 billion yeasr number!

barren prairie Apr 19, 2025, 11:25 AM

#

glass arch it seems chatgpt can show you images?

Gemini used to do that too

calm sequoia Apr 19, 2025, 11:38 AM

#

Looks like the o3 also have sampling issues. It got my pro.pt right only 4/5 times. Varoability ussualy was problem only with the mini models.

plain zinc Apr 19, 2025, 11:58 AM

#

keen beacon ik but you think Dayhush is 2.5 ultra ?

Nope

#

It's coding model

#

Just simple coding model

tall summit Apr 19, 2025, 12:43 PM

#

calm sequoia Looks like the o3 also have sampling issues. It got my pro.pt right only 4/5 tim...

how about temperature

ocean vortex Apr 19, 2025, 12:47 PM

#

calm sequoia Looks like the o3 also have sampling issues. It got my pro.pt right only 4/5 tim...

that's not a sampling issue, more an indicator of performance. Before a model can do something consistently right, it needs to got through a phase of getting it right at least occasionally. While a true 100% success rate might not be even possible at all unless you are overfitting

#

the way I'm looking at this is if it's correct most of the time that means it's correct period. But if you have 2 models you are comparing that are incredibly close, you could look at the exact success rate percentage as well for more in-depth info.

#

though at that point it must be said it would probably be smarter to expand the number of your test prompts, rather than hyper focusing on just this 1 or a few

ocean vortex Apr 19, 2025, 12:56 PM

#

glass arch https://chatgpt.com/share/6803878d-d344-8010-b7ae-08a9eab81113

yeah those are just the sources it found with web search. You can click through them when you expand the image. Still useful though

pliant cypress Apr 19, 2025, 1:10 PM

#

webdev arena is so broken right now or something is wrong with my pc/internet?

#

like 90% generations fail

wheat onyx Apr 19, 2025, 1:12 PM

#

I am super impressed with o3 (haven't tested o4mini)

https://chatgpt.com/share/6802ef37-efbc-8011-b47e-3ddf1669260f

ChatGPT

ChatGPT - TV not turning on troubleshooting

Shared via ChatGPT

#

Analyzing images like this is a gamechanger

wheat onyx Apr 19, 2025, 1:13 PM

#

plain zinc It's coding model

That will be an interesting one to see. Google models are pretty cheap, and another SOTA coders I'm always watching

leaden meteor Apr 19, 2025, 1:20 PM

#

poll_question_text

Will Gemini-2.5-Pro-Exp-03-25 drop from #1 on LMArena leaderboard in next 7 days

victor_answer_votes

11

total_votes

20

victor_answer_id

1

victor_answer_text

Yes

calm sequoia Apr 19, 2025, 1:28 PM

#

tall summit how about temperature

How to set temperature on lmarena or chatgpt web?

calm sequoia Apr 19, 2025, 1:28 PM

#

ocean vortex that's not a sampling issue, more an indicator of performance. Before a model ca...

Why would the model be non deterministic if the input is the same?

keen beacon Apr 19, 2025, 1:30 PM

#

calm sequoia How to set temperature on lmarena or chatgpt web?

U can set it in direct chat on lmsys

#

O3 and o4 mini are there

#

They used to ignore the value but it seems they don't anymore (setting api temp)

calm sequoia Apr 19, 2025, 1:32 PM

#

Can you do it during battles? 👀👀

keen beacon Apr 19, 2025, 1:32 PM

#

Nah

calm sequoia Apr 19, 2025, 1:32 PM

#

Sadly anonymous models cannot be selected

#

It always passes on direct chat. Maybe they are testing multiple temperature variants

keen beacon Apr 19, 2025, 1:34 PM

#

I heard the quality was different on the api rn but I don't use the models so idk

#

Something is messed up on chatgpt to some extent I've heard

ocean vortex Apr 19, 2025, 1:34 PM

#

calm sequoia Why would the model be non deterministic if the input is the same?

output is always indeterministic technically. Even with temp0 it's not really 100% deterministic. You can only do it deterministic with additional parameters (seed and whatnot...) when it's implemented

#

but for openai or anthropic, I do not believe there is a way with their libraries currently

tall summit Apr 19, 2025, 1:36 PM

#

wheat onyx I am super impressed with o3 (haven't tested o4mini) https://chatgpt.com/share/...

i think this is the most incredible thing i've seen o3 do

wheat onyx Apr 19, 2025, 1:38 PM

#

tall summit i think this is the most incredible thing i've seen o3 do

Yeah I really can't be more impressed with it.

Think of all the applications. Endless

#

I think those immediately DOA fitness mirrors are going to make a comeback soon

ocean vortex Apr 19, 2025, 1:42 PM

#

keen beacon They used to ignore the value but it seems they don't anymore (setting api temp)

if they aren't that's not by design. They clearly don't want you to be changing that. Haven't tried it now but with o1/o3 API also used to return an error if you tried passing the temp parameter

#

for reference this is how it looks when that's not restricted:

heady kiln Apr 19, 2025, 1:46 PM

#

anybody facing same issue? web lmarena
not returning any response

it says "generating" then "generating" disappears and makes me vote but, no output appears

ocean vortex Apr 19, 2025, 1:48 PM

#

it wouldn't work with plus (32k), and I'm not sure pro has the full 200k either... 🤔
And with API it doesn't make sense to be paying this much, just dump it into 2.5 pro

balmy mist Apr 19, 2025, 1:48 PM

#

umbral crypt how do i get this?

https://www.youtube.com/watch?v=ex86YzTlBdo&t=4s&ab_channel=AICodeKing

YouTube

AICodeKing

Grok-3 & Grok-3 Mini (Free API) + Cline & RooCode : DON'T MISS THIS...

Check out the NinjaChat AI platform over here : https://www.ninjachat.ai/

In this video, I'll be telling you about Grok 3 and Grok-3 Mini new API that allows you to use it for free with Cline and RooCode.

Resources:

Grok AI Console: console.x.ai
Requesty: https://app.requesty.ai/join?ref=4581bcf6

Key Takeaways:

🚀 Grok-3 API is ...

▶ Play video

ocean vortex Apr 19, 2025, 1:49 PM

#

try it

heady kiln Apr 19, 2025, 1:50 PM

#

ok, i'll try it again

ocean vortex Apr 19, 2025, 1:51 PM

#

oh you just quoted the wrong person lol

#

I think it discards the cache and a refresh

heady kiln Apr 19, 2025, 1:51 PM

#

anybody compared dayhush, claybrook, dragontail and nighwhisper yet?

heady kiln Apr 19, 2025, 1:52 PM

#

ocean vortex I think it discards the cache and a refresh

sometimes only one model generates the webpage I request

#

sorry btw, wrong reply

calm sequoia Apr 19, 2025, 1:53 PM

#

ocean vortex output is always indeterministic technically. Even with temp0 it's not really 10...

But where does the randomness come from? The simple LLMs can be derived into a set of nonlinear functions. Meaning that the same input shall result in same output. I'm missing something. Is it tricks in the new architectures? Systemic issues?

keen beacon Apr 19, 2025, 1:54 PM

#

Hardware floating point precision optimizations etc

keen beacon Apr 19, 2025, 1:54 PM

#

calm sequoia But where does the randomness come from? The simple LLMs can be derived into a s...

But yes there is no inherent randomness otherwise

#

In reality because you can't have perfect precision numerically, hardware, etc there is inherent nondeterminism

ocean vortex Apr 19, 2025, 1:55 PM

#

calm sequoia But where does the randomness come from? The simple LLMs can be derived into a s...

bluntly speaking at it's core it's artificially introduced randomness where all of this is coming from. But temperature itself as per its definition tells the model to consider more or less probabilities for the next token - that's by itself is indetermination.

keen beacon Apr 19, 2025, 1:56 PM

#

Greedy decoding assuming all sampling is disabled does not introduce randomness theoretically

ocean vortex Apr 19, 2025, 1:56 PM

#

each training run will normally result in a different outcome (model) since random starting seed is usually that - random.

keen beacon Apr 19, 2025, 1:57 PM

#

Ong

#

Omg

#

Not this again

calm sequoia Apr 19, 2025, 1:57 PM

#

ocean vortex each training run will normally result in a different outcome (model) since rand...

Training yes. But we are talking about inference.

#

May be compiler thing

#

Choosing different path based on some conditions

#

Server thing also

calm sequoia Apr 19, 2025, 1:59 PM

#

keen beacon Not this again

Nobody cares if they reason as long as they produce good code 😄

#

I asked o3: "Short answer: Not necessarily.
Under today’s large‑language‑model (LLM) stacks, sending exactly the same prompt 100 times can yield identical, slightly different, or wildly different completions depending on four controllable factors (decoding parameters, software, hardware, model snapshot) and three unavoidable sources of entropy (floating‑point nondeterminism, multithreading, and vendor weight updates)"

#

I dont see how floating point can be involved here, but multithreading could be a cause

umbral crypt Apr 19, 2025, 2:01 PM

#

balmy mist https://www.youtube.com/watch?v=ex86YzTlBdo&t=4s&ab_channel=AICodeKing

That is the most disgusting voice I have ever heard

calm sequoia Apr 19, 2025, 2:01 PM

#

Architectural & theoretical backdrop

Decoding strategies deliberately inject randomness to avoid repetition and exposure bias, and recent surveys show variance growing with model size and alignment techniques.

#

Extrapolation you say

keen beacon Apr 19, 2025, 2:03 PM

#

LLMs can play chess. The skill is destroyed in the instruct process

leaden palm Apr 19, 2025, 2:04 PM

#

calm sequoia Apr 19, 2025, 2:05 PM

#

Could you imagine a new color, paws?

keen beacon Apr 19, 2025, 2:05 PM

#

calm sequoia I dont see how floating point can be involved here, but multithreading could be ...

Floating point numbers are extremely weird

#

0.1+0.1+0.1 != 0.3

#

Can happen

calm sequoia Apr 19, 2025, 2:06 PM

#

I spent a lot of time programming floating points in C. In my practise, its deterministic. Unprecise but deterministic.

#

It wouldnt be hard if we would have feedback loops

#

Why do you think data != algorithms

#

Instruction sets are data

keen beacon Apr 19, 2025, 2:08 PM

#

calm sequoia I dont see how floating point can be involved here, but multithreading could be ...

Multi threading isn't really applicable unless you're doing cpu inference I believe

#

Multi threading is done even in gpu inference (like scheduling etc maybe) but I don't think they cause the non determinism in actual decoding

calm sequoia Apr 19, 2025, 2:10 PM

#

We need help from @wooden mulch What causes same models to produce different results with the same input on the arena?

keen beacon Apr 19, 2025, 2:11 PM

#

calm sequoia I spent a lot of time programming floating points in C. In my practise, its dete...

Afaik it's still somewhat dependent on your setup, hardware etc. but I might be wrong. E.g. comparing results from different hardware might lead to differences

calm sequoia Apr 19, 2025, 2:13 PM

#

I always wonder if they are making any calculations locally

#

To reduce server load

keen beacon Apr 19, 2025, 2:14 PM

#

Not possible lol

balmy mist Apr 19, 2025, 2:15 PM

#

damn

ocean vortex Apr 19, 2025, 2:17 PM

#

calm sequoia We need help from <@787778518591078421> What causes same models to produce diffe...

tbh I think you just need to learn the basics on how LLMs work and how the next token is being generated in the context of token pools.

#

it "considers" a ton of tokens before settling on a specific one. And that process is for the most part random by design and including a lot of potential outcomes if you leave everything at default settings.

calm sequoia Apr 19, 2025, 2:20 PM

#

Tou may be an order of magnitude more infmed than me or less informed

#

Anyway, Ill read one book on llms to catch up

ocean vortex Apr 19, 2025, 2:21 PM

#

if we made LLMs deterministic they would be very boring and not really pleasant to interact with

calm sequoia Apr 19, 2025, 2:23 PM

#

Was it since the beggining or introduced after the gpt 3.5?

oblique flint Apr 19, 2025, 2:24 PM

#

ocean vortex if we made LLMs deterministic they would be very boring and not really pleasant ...

llms are deterministic wym

#

but temperature isnt

calm sequoia Apr 19, 2025, 2:26 PM

#

I dont understand why rhey didn't release o3 as GPT 5. It do feel like gpt 3.5 to gpt 4 improvement.

#

What next arebwe waiting for this quarter? 2.5 Ultra, R2 and qwen max?

#

I wonder if they will release o4 at some point or will call it GPT5

ocean vortex Apr 19, 2025, 2:37 PM

#

oblique flint llms are deterministic wym

well the way we implemented transformer arch and how we are running them, they are not deterministic. Was talking about end result here when you are already interacting with the model. Non-deterministic is desired behavior so when that's lacking it is simply artificially added at the given stage.

keen beacon Apr 19, 2025, 2:37 PM

#

oblique flint llms are deterministic wym

They are in theory if you use no sampling. Practically they are not

keen beacon Apr 19, 2025, 2:38 PM

#

ocean vortex well the way we implemented transformer arch and how we are running them, they a...

The small amount of non determinism without sampling I doubt would be very noticeable by the user in most cases tho

#

Adding a little bit of temperature changes it a lot more I think

ocean vortex Apr 19, 2025, 2:40 PM

#

keen beacon The small amount of non determinism without sampling I doubt would be very notic...

sampling is the biggest factor for sure. But we can't really talk about LLMs excluding the sampling can we...

keen beacon Apr 19, 2025, 2:40 PM

#

ocean vortex sampling is the biggest factor for sure. But we can't really talk about LLMs exc...

You can

#

Greedy decoding is no sampling

ocean vortex Apr 19, 2025, 2:40 PM

#

not in the context of you interacting with the models on lmarena

keen beacon Apr 19, 2025, 2:41 PM

#

Yeah but you were talking about LLMs in general

#

I understand what u mean now

ocean vortex Apr 19, 2025, 2:41 PM

#

the original question was why they are not deterministic on arena. So excluding the sampling would just be selectively misleading now lol

#

99% of the time users are interacting with LLMs on any platform it includes sampling anyway 👀

sour spindle Apr 19, 2025, 3:17 PM

#

Wondering if anyone can explain this to me. Is o3 on arena the same o3 that you have on your ChatGPT account.

#

I feel like when I use o3 on my account I get more in-depth answers and o3 on arena feels watered down.

#

How are they able to give o3 access of arena does OAI subsidize usage?

leaden palm Apr 19, 2025, 3:17 PM

#

sour spindle I feel like when I use o3 on my account I get more in-depth answers and o3 on ar...

o3 on chatgpt has access to python and searching

#

o3 in arena is raw api version

sour spindle Apr 19, 2025, 3:19 PM

#

Even then how are the able to allow so much access to it on arena

leaden palm Apr 19, 2025, 3:19 PM

#

sour spindle How are they able to give o3 access of arena does OAI subsidize usage?

probably this

keen beacon Apr 19, 2025, 3:20 PM

#

They didn't do it for o1 but added o3 mini tho (direct chat)

#

I'm assuming o3 is just cheaper to run

leaden palm Apr 19, 2025, 3:21 PM

#

keen beacon They didn't do it for o1 but added o3 mini tho (direct chat)

they added full o3 and o4 mini

keen beacon Apr 19, 2025, 3:21 PM

#

Yeah

leaden palm Apr 19, 2025, 3:21 PM

#

o3 is $40/mtok while o1 was $60/mtok out

keen beacon Apr 19, 2025, 3:21 PM

#

It's also probably more marked up (despite lower pricing) I think

keen beacon Apr 19, 2025, 3:22 PM

#

leaden palm they added full o3 and o4 mini

I meant instead of o1 they added o3 mini

leaden palm Apr 19, 2025, 3:22 PM

#

m

sour spindle Apr 19, 2025, 3:22 PM

#

Thanks for answering my questions KT

keen fulcrum Apr 19, 2025, 3:25 PM

#

https://fixupx.com/abustin/status/1912912748746514541

Bustin (@abustin)

Report this! 📢
︀︀
︀︀Grok.com now generates reports from uploaded CSV data.
︀︀
︀︀Share your feedback -- How can I further increase its usefulness?

**💬 64 🔁 93 ❤️ 605 👁️ 2.07M **

▶ Play video

#

https://fixupx.com/cb_doge/status/1913508655137280312

DogeDesigner (@cb_doge)

Grok is the smartest A.I. out there and the only one with real-time access to world events, thanks to 𝕏

**💬 578 🔁 341 ❤️ 2.2K 👁️ 18.69M **

ember rapids Apr 19, 2025, 3:42 PM

#

How does claybrook compare to nightwhisper

#

Just saw this

brittle tiger Apr 19, 2025, 3:47 PM

#

ember rapids How does claybrook compare to nightwhisper

in my testing nw > dayhush> claybrook

#

but the gaps aren't major

alpine coral Apr 19, 2025, 3:53 PM

#

novel flame Maybe someone here knows…. I don’t have a paid Grok account so I tested Grok 3 M...

you can adjust it on openrouter. it's weird though, the default is 0, but there clearly are tokens allocated for reasoning because it always reasons no matter what the value is set at, inc 0 (or 1, 10 etc)

#

it doesn't seem to do anything tbh.. like setting it to max (100k tokens), and it uses the less reasoning tokens than when set to 50k, or 1k.. (and perorms worse compared 50k.. though i'm pretty sure this parameter, at least on OR, isn't doing anything, and it's just variability)

keen beacon Apr 19, 2025, 3:55 PM

#

Does it stop at 1 token if u set it to that

alpine coral Apr 19, 2025, 3:56 PM

#

keen beacon Does it stop at 1 token if u set it to that

i've tried 1, 5 - doesn't make a difference. it always reasons on OR, as far as i can tell anyway

#

the value set in the param doesn't seem to correspond to how long / many tokens it uses for reasoning

keen beacon Apr 19, 2025, 3:57 PM

#

alpine coral i've tried 1, 5 - doesn't make a difference. it always reasons on OR, as far as ...

Ya but does it ever cut off on that amt

#

No I guess

wheat onyx Apr 19, 2025, 3:57 PM

#

brittle tiger in my testing nw > dayhush> claybrook

Weird that google keeps testing coders that are progressively worse. Mini, mini mini versions?

alpine coral Apr 19, 2025, 3:58 PM

#

keen beacon Ya but does it ever cut off on that amt

no.if you mean like reason for 1 token then proceed to the answer

#

it always reasons and coherently/comppletely

#

like no cut offs

keen ferry Apr 19, 2025, 3:59 PM

#

which ai is on Gemini Live on the app im using webcam and its kinda good id say?

keen beacon Apr 19, 2025, 4:01 PM

#

alpine coral no.if you mean like reason for 1 token then proceed to the answer

Hmm I skimmed the grok api docs on my phone it seems they support reasoning efforts and not really a reasoning token limit. Maybe u need to pass that on somehow if possible thru openrouter

alpine coral Apr 19, 2025, 4:01 PM

#

yeah was gonna post something there actually

keen beacon Apr 19, 2025, 4:01 PM

#

It should be possible since I think they support it for openai reasoning models

alpine coral Apr 19, 2025, 4:01 PM

#

it feels like the param is doing nothing / doesn't exist

torn mantle Apr 19, 2025, 4:01 PM

#

keen fulcrum https://fixupx.com/cb_doge/status/1913508655137280312

puhahahahahaha

#

i cant

torn mantle Apr 19, 2025, 4:03 PM

#

ember rapids Just saw this

NW = DH
claybrook = Dragontail (?)
Gemini 2.5 pro 03
Gemini 2.5 flash thinking

#

claybrook gave me good results as well

#

so im not sure if its better than dragontail

#

bruh these names ...

alpine coral Apr 19, 2025, 4:10 PM

#

i see yeah OR's implementation is off then

drifting thorn Apr 19, 2025, 4:11 PM

#

Waiting for NW with better native tool use

torn mantle Apr 19, 2025, 4:11 PM

#

this is actually a cool prompt

#

https://x.com/rezmeram/status/1912973797206155424

RameshR (@rezmeram) on X

Gemini 2.5 Flash demolishes my Galton Board test, I could not get 4omini, 4o mini high, or 03 to produce this. I found that Gemini 2.5 Flash understands my intents almost instantly, code produced is tight and neat. The prompt is a merging of various steps. It took me 5 steps to

#

see if you can make dayhush generate that

keen fulcrum Apr 19, 2025, 4:12 PM

#

So there is Dayhush and Claybrook from Google
Which is better?

torn mantle Apr 19, 2025, 4:12 PM

#

dayhush

#

is a coding model

keen fulcrum Apr 19, 2025, 4:12 PM

#

Even better than nightwhisper?

torn mantle Apr 19, 2025, 4:12 PM

#

claybrook is probably a recent checkpoint of gemini 2.5 pro

torn mantle Apr 19, 2025, 4:12 PM

#

keen fulcrum Even better than nightwhisper?

i find it similar tbh

#

i cant tell

#

its really hard

#

but ive noticed some small issues with dayhush

#

overlapping elements
text out of the containers elements
it needs a clear guidance

keen fulcrum Apr 19, 2025, 4:15 PM

#

Perhaps a mini model

torn mantle Apr 19, 2025, 4:16 PM

#

keen fulcrum Perhaps a mini model

nah it definitely didnt feel like a mini version

#

its probably the slowest model

keen fulcrum Apr 19, 2025, 4:16 PM

#

torn mantle Apr 19, 2025, 4:16 PM

#

it takes a lot of time thinking

drifting thorn Apr 19, 2025, 4:51 PM

#

keen fulcrum

Would love that

#

But knowledge graph is more useful than just embedding RAG

balmy mist Apr 19, 2025, 5:05 PM

#

torn mantle 1. NW = DH 2. claybrook = Dragontail (?) 3. Gemini 2.5 pro 03 4. Gemini 2.5 fla...

so you think flash is the best model overal? like including cost efficiency(performance and cost)?

calm sequoia Apr 19, 2025, 5:08 PM

#

torn mantle claybrook is probably a recent checkpoint of gemini 2.5 pro

Its worse

hybrid locust Apr 19, 2025, 5:09 PM

#

lmarena beta has no rate limits wow

keen beacon Apr 19, 2025, 5:10 PM

#

these mfs are being bankrolled in lab credits

hybrid locust Apr 19, 2025, 5:16 PM

#

keen beacon these mfs are being bankrolled in lab credits

lol

#

they're backed by lots of orgs tho

#

it's probably insignificant to them

alpine coral Apr 19, 2025, 5:17 PM

#

well.. Sequoia Capital is an organisation of sorts lol

golden ocean Apr 19, 2025, 5:24 PM

#

torn mantle https://x.com/rezmeram/status/1912973797206155424

https://3000-i1vc3uau4u936xz5lbljl-96f8ee60.e2b-foxtrot.dev/ by dayhush

#

kinda ass cheeks

#

https://3000-ic8ffl64if1ey2lpz839s-b4a4855c.e2b-foxtrot.dev/ this one by gpt 4.1 is good
I mean theres a lil error in the bottom they cant go into the middle slots

#

https://3000-iw42qa863msalyyv8tzqi-0ce36864.e2b-foxtrot.dev/ by o3 mini high

keen beacon Apr 19, 2025, 5:26 PM

#

o3 mini high > 4.1 > dayhush

#

google getting too much hype

raven void Apr 19, 2025, 5:29 PM

#

keen beacon google getting too much hype

you cooked these people

keen beacon Apr 19, 2025, 5:30 PM

#

just gotta wait for o3 nr 1 to drop

#

any day now 😢

torn mantle Apr 19, 2025, 5:30 PM

#

golden ocean https://3000-i1vc3uau4u936xz5lbljl-96f8ee60.e2b-foxtrot.dev/ by dayhush

yea

#

thats the issue

#

it may be good at UI/UX

#

but it still lacks on complex reasoning and implementation

#

i got it to work once/6 tries

torn mantle Apr 19, 2025, 5:31 PM

#

balmy mist so you think flash is the best model overal? like including cost efficiency(perf...

its good yea

#

but it hallucinates a lot

ocean vortex Apr 19, 2025, 5:40 PM

#

keen beacon just gotta wait for o3 nr 1 to drop

I don't think it's gonna go this high. But we will see.. 👀

keen beacon Apr 19, 2025, 5:43 PM

#

ocean vortex I don't think it's gonna go this high. But we will see.. 👀

that high * is the correct term, but why ?

balmy mist Apr 19, 2025, 5:46 PM

#

keen beacon o3 mini high > 4.1 > dayhush

yeah idk about that

calm sequoia Apr 19, 2025, 5:46 PM

#

ocean vortex I don't think it's gonna go this high. But we will see.. 👀

The 4o latest is very close and o3 is order of magnitude better. How's it not the new sota?

#

Chineese endgame

keen beacon Apr 19, 2025, 5:51 PM

#

i asked gemini and o3 to prepare 3 hard questions for a very smart rival ai. and also to solve these questions so that i knew the solutions.

Both models solved each other questions, but gemini took way longer to do o3 questions and its solution was way less elegant

patent bane Apr 19, 2025, 5:53 PM

#

wdym by less elegant

keen beacon Apr 19, 2025, 5:53 PM

#

more complicated and longer than it need be

#

i also made a game where they use their own algorithms to command their player and shoot at each other, o3 won 8-2

elder rapids Apr 19, 2025, 5:55 PM

#

these aren't even benchmarks

#

🙏 😭

keen beacon Apr 19, 2025, 5:55 PM

#

these are better benchmarks than official benchmarks for which models know the answers of

zinc ore Apr 19, 2025, 5:55 PM

#

keen beacon i asked gemini and o3 to prepare 3 hard questions for a very smart rival ai. and...

Tbf, if o3 is smart it would make questions it believes it answers better

elder rapids Apr 19, 2025, 5:55 PM

#

they couldn't be

#

they're not quantifying any real metric

zinc ore Apr 19, 2025, 5:56 PM

#

We're in the benchmark wars of AI

keen beacon Apr 19, 2025, 5:56 PM

#

elder rapids they're not quantifying any real metric

general intelligence to perform great accross any domain ..

#

even silly ones

elder rapids Apr 19, 2025, 5:56 PM

#

thats great but that means I can say 2.5 pro beats o3 in many many different areas

#

and therefore more general

zinc ore Apr 19, 2025, 5:57 PM

#

What we need is better benchmarks