#general

1 messages Β· Page 29 of 1

keen beacon
#

but you'll see

balmy mist
#

omgg im getting so excited

keen beacon
balmy mist
#

i never wanted another country to win so bad before lol

keen beacon
#

but it's very possible they finally introduce a subscription

fleet lintel
balmy mist
#

all i want is that api

balmy mist
keen beacon
#

not many at deepseek tho

balmy mist
#

liike that dude from GOT

keen beacon
#

mainly just word of mouth

oblique flint
#

I just hope it wont go in circles as much as R1

balmy mist
#

its funny cause r1 crashed the market a lil

#

and rn the market is trash

keen beacon
balmy mist
#

imagine if r2 comes out this week lmaoo

keen beacon
#

part of the training involved making the model better at determining how many reasoning tokens to use

#

although apparently it's still not as good at that as some other models

#

deepseek has always been a bit more

#

brute force-y

oblique flint
#

you got any info on qwen 3?

keen beacon
#

nope

#

unfortunately lmao

#

i do expect it in the next 2 weeks or so

#

but that's about it

balmy mist
#

i aint gonna lie this channel might be one of the leading spaces for ai news lmaoo

sonic tendon
#

i mean, that's probably mostly leo lol

sonic tendon
balmy mist
keen beacon
sonic tendon
#

tbh i'd be surprised if it doesn't outperform

balmy mist
keen beacon
#

well yes i have some but it's very much in the single digits!

balmy mist
#

dont forget deepseek dont got the same funds as OA

keen beacon
#

you'd be surprised

oblique flint
#

if r2 manages to outperform o3 at a way lower cost, I'd start to believe deepseek is actually the leading AI lab

fleet lintel
keen beacon
balmy mist
sonic tendon
balmy mist
#

after it crashed US markets they saw the vision lol

keen beacon
#

the people over at deepseek also have insane work ethic

#

as you might expect

#

most of them barely sleep

#

especially since the release date for R2 was changed to "ASAP"

balmy mist
#

they should def call it R3 tho

keen beacon
fleet lintel
balmy mist
#

troll OA a lil

keen beacon
#

😭

sonic tendon
balmy mist
sonic tendon
#

i imagine a lot of LM development is just throwing stuff at the wall rather than trying to imagine the next breakthrough

fleet lintel
balmy mist
#

they prob have an insanse workflow now, look at google's workflow

sonic tendon
#

the entire CN education system, I think

balmy mist
sonic tendon
oblique flint
#

btw I wonder, why didnt the other labs adopt deepseeks MLA (or did they?). I guess that was one of the main cost saving measures they implemented right?

sonic tendon
#

have a few friends there

sonic tendon
#

but not an expert

brittle tiger
#

They are insanely cracked and I'm not certain it's happening with AI companies yet but western big tech firms don't have the advantage of NSA coming to them saying "hey check out the new methods they're working on across the ocean". Instead us govt is trying to break up big tech. Cooperation with govt is major slept on advantage over there. They're insanely talented regardless tho

balmy mist
oblique flint
sonic tendon
#

but, the chinese govt is absolutely bankrolling the hell out of AI development rn, from what I've heard

brittle tiger
fleet lintel
#

is R2 on LMArena? Are they testing it here?

keen beacon
#

no

alpine coral
#

and a welcome one.. China hasn't bounced back since covid.. it hasn't been firing on all cylinders

keen beacon
#

this race is a huge opportunity for china

#

if they win the ai war it will be truly be the beginning of the end for the idea of the US as the leader of the world

alpine coral
#

but i think it's a false hope of sorts.. like deepseek did innovate, but they more so iterated

#

yeah i mean it's kinda wild and seemingly hyperbolic.. but i thnk it could plausibly said that the country dominates AI (/first achieves AGI whatvever that means) will have an advantage across everything, military included

fleet lintel
balmy mist
fleet lintel
balmy mist
#

studio is free

keen beacon
#

although yes, enterprise will definitely be something chinese labs struggle with in the west

balmy mist
keen beacon
#

currently there is a significant chance the US government bans deepseek

#

so it'll be interesting to see if that happens

keen beacon
#

because normally whatever the US government do (annoyingly) much of the west follows

#

like in the case of huawei

balmy mist
#

like providers will stop provding lol

#

what about if i download it locally?

keen beacon
#

well, it'll be illegal for providers to provide deepseek services in the US

#

so they won't if they don't want to break the law

alpine coral
balmy mist
#

why

fleet lintel
keen beacon
alpine coral
keen beacon
#

but if you didn't, downloading deepseek models would also be hard because serving it would be illegal

fleet lintel
balmy mist
#

if i fine tune r2 5 times over, is it the same model?

fleet lintel
alpine coral
#

but the race to AI supremecy i think gives space.. but still.. the current AI tech seems incompatible with a one-party authortarian state that censors everything (or tries to)

balmy mist
#

a company can provide a model that was fine tuned from r2

#

or will that also be banned?

#

and how would they even know it was fine tuned from r2?

fleet lintel
balmy mist
#

idk man, its like the ban on pirating, its illegal but its so common

#

and you cant prove it

#

banning it would just be a political statement

#

nothing more

#

people will still be using their models

#

even big companies

#

cause they will fine tune on it

fleet lintel
balmy mist
#

okay but you cant prove you have a deepseek model tbh, and if a company wants to save money they will use deepseek cheap models, and fine tune on them, models hallucinate a lot so you cant trust what a models says if you try and figure out the base model, and a lot of new companies are being made because of ai and for ai

#

they are most likely going to use deep seek models, and soon become mid level companies

#

its not going to just be individuals, even if it is just individuals, those people can make milliona dollar companies

#

bc of ai

ocean vortex
fleet lintel
#

may be.. but as soon as they become big, they need to pivot to something legal.

I

balmy mist
fleet lintel
#

i think govt ban is unlikely but a ban IMO would undoubtly kill the business in that country.

narrow elbow
ocean vortex
balmy mist
#

but you are right a ban would kill the business bc they would be just copying secrets of deepseek and applying to other models and no longer using deepseek models, just at first to distill the knowledge, and deepseek is opensource to

fleet lintel
brittle tiger
#

4.1 showing impressive here

narrow elbow
#

and US is also going to ban TikTokπŸ₯²

fleet lintel
narrow elbow
#

so sad

ember rapids
#

7 fig comp tho

fleet lintel
balmy mist
#

man if deepseek can launch this week i will be so happy man

#

as soon as yall see news about it please send here πŸ™‚

cedar tide
#

neither cobalt nor apricot are nova premier πŸ₯΄

ocean vortex
# narrow elbow yea, a legal ban just like banning TikTokπŸ˜†

banning anything is a dangerous game and ideally nothing at all should be banned in this context. But at the same time I kinda understand the arguments for banning tiktok. It has been used more than once for political campaigns now of questionable origins and the fact it is controlled by China's government makes it high risk

deep adder
#

country

ocean vortex
#

Imagine an app developed in US or any other western country blowing up in popularity in China. Impossible because all of them are banned by default

narrow elbow
ocean vortex
gentle plinth
narrow elbow
#

No, Americans are smart, you are racist

balmy mist
#

@gentle plinth have you tested this?

#

i just tested it and doesnt i didnt see this

brittle tiger
sage raptor
ocean vortex
#

just tried it and can confirm o4-mini does this πŸ‘€

ember rapids
#

I thought they always did that

brittle tiger
ocean vortex
#

you could even fine-tune your own gpt with their official tools to embed any hidden symbols you choose into it's responses

#

as far as the model is concerned, there's no big difference whichever space it uses anyway, in this case

#

but it saw enough of those special characters to use them some of the time

#

and the rationale is quite solid I think. Most people do not gonna bother to sanitize the text, so this makes it easy to detect the text generated by openai.

balmy mist
rapid merlin
#

This is not really related, but has anyone tried o4 in chatgpt? it seems LOBOTOMIZED into saving space. I asked it five times directly to give me a full script, and it "omits it for brevity", like what?

#

give me the full code
"omitted for brevity" (parts of it commented as that)
hm. did you miss something here? pastes the script it gave me
yes, i missed this, this and this
then what are you doing?? i told you to give me the full code!
"ommited for brevity" (parts of it commented as that)

#

to be fair, the code was indeed pretty long, but the fact it just starts completely ignoring commands is annoying

#

Yes

willow grail
#

which discord server has this bot?

balmy mist
#

well he aint release it yet

#

but it will be paid soon

willow grail
#

who is he @balmy mist πŸ˜„

balmy mist
#

he is a dev that is able to get details on new models

#

he has some tool he built that gives him alerts

#

you think you can built it for us?

upper wolf
#

most models on mobile have a system prompt that tell them to keep it brief

#

chatgpt included

tall summit
#

didn't know that

opaque adder
#

dude when is nightwhisper comng out

#

i'm tired of 2.5 pro

#

mm tbh

#

it is actually good at things like making animations in HTML
and you can use this in thread/banner designs

#

although what i will say is
no ai is ready for coding outside of web development yet

#

well yea very simple

#

ai right now in terms of programming is a front end specalist

#

go on

#

tell me some projects it can do in c++

#

that isnt a simple snake game

small haven
opaque adder
#

lol, you can't even prove a project it did

#

because it can't

opaque adder
small haven
#

ive been limited with o3 and o1 pro already so....

opaque adder
#

im asking projects it has built

small haven
opaque adder
#

not hard

#

yes give me a project

#

4th time asking

small haven
#

gonna waste a deep research req, idgaf

opaque adder
#

no dude
if you are building advanced projects, it simply just won't understand

#

ai isn't there yet

#

lol
go ahead and prompt it to build you a um/km that communicates with eachother

#

wont happen for a good few years

#

usermode
and kernelmode

#

both have interest in one another

#

no

#

no ai will be able do windows internals

#

at a even decent level

#

for a long time

#

yea

#

well thats the truth

#

ai just isnt at that level..

small haven
#

i want o3 pro so bad holy fck

opaque adder
#

well i work around it

#

why so

#

in what way exactly πŸ˜„

#

yes ive used 3.7
used 2.5 pro

tall summit
#

craig you sound like 3.7 or 2.5 pro

opaque adder
#

its okay

tall summit
#

in this current conversation

opaque adder
#

this is jsut the start of ai

#

ai in terms of coding has literally just begun

#

so its expected for it to be mid asf

#

no it isnt xdDD

tall summit
#

99% of devs is crazy

#

how bad do you think devs are

opaque adder
#

this dude..

#

you might aswell just say to people stop learning coding

#

what statement is that

balmy mist
#

that levels game i think

opaque adder
#

πŸ˜‚

#

now its getting obvious with r bait

#

next thing you'll tell me is that your jewish

#

its way easier to learn

#

ai is at the tip of ur hands for help

tall summit
#

because making things is fun

opaque adder
#

ai cant do backend for its life

tall summit
#

people will still become artists

#

well tbh ai art is still terrible

opaque adder
#

just tell people to stop being graphics designers

#

ai is here

tall summit
#

"stop doing codeforces, ai can solve them all"

opaque adder
#

you were tryna make jokes

tall summit
#

bait

opaque adder
#

prety funny if u were to be jewish

tall summit
#

thats the most boring joke of all time

opaque adder
#

exactly

tall summit
#

oh there we go @alpine pasture

#

deleted message moment

opaque adder
#

you have common traits with them

#

restriction of speech

worthy thunder
#

OpenAI-MRCR results on Llama 4: https://x.com/DillonUzar/status/1914415635582607770

  • Llama 4 Scout performs similar to GPT-4.1 Nano at higher context lengths.
  • Llama 4 Maverick is similar to (but slightly underperforms) GPT-4.1 Mini.

I ran these just in case ppl needed it. It's probably not a top priority for people, but sharing nonetheless.

Enjoy.

Update to benchmark setup - Noticed various models had some missing test results due to various server errors returned, or oddities in API outputs. Also some endpoints didn't support candidate outputs, so some models were missing multiple runs to smooth the output. Fixed those and reran most models, and confirmed all tests completed successfully except for those that exceeded model limits. Certain models have seen a decent change in results (see tables). Notably Gemini 2.5 Flash (thinking enabled) seemed to have been lucky with the original results, and now more in-line with what I was expecting.

Grok 3 results should be next, and hopefully ready in a few hours. It's been surprisingly difficult to run them without server timeout errors (almost behaves like some kind of throttling).

Any other models people are interested in?

small haven
#

lmao everyone just shut up

tall summit
#

i look forward to your charts now

small haven
#

we got unskippable ads in discord

tall summit
#

my ass

worthy thunder
# small haven can you benchmark o1 pro

Always the super expensive models with everyone πŸ˜‚
I'll see about adding it to the TODO. Have to budget based on what my company is willing to cover

opaque adder
#

y?

small haven
balmy mist
worthy thunder
# small haven i mean its unlimited (kind of) thru chatgpt

Unfortunately the benchmark sets up a history of chat messages between the user and the LLM before asking the benchmark question, so I'd need the ability to set up what was said both from the user and the model.
Plus ChatGPT has some system instructions and tooling behind the scenes which can impact the results. Too many uncontrolled variables :/

small haven
#

fair enough

native shoreBOT
#

dynoSuccess runo001 has been warned.

rapid merlin
#

cancelled the openai sub for claude, he aced it 0 shot

#

without any laziness

torn mantle
#

its a

#

ss

ocean vortex
#

total size offsets it but not nearly enough

ocean vortex
#

Probably something from HLE or simple-bench test set. o4-mini had big gains over o3-mini there. Though I don’t have anything specific without testing

balmy mist
#

Has anybody used O3 within the API for OpenAI? I'm scared that I might get wrecked by cost.

balmy mist
#

Yo, does anybody have any more o3 requests? I just ran out until like two days and I have a pressing request for my root code. Custom modes, I needed to be consolidated and I want to use o3.

#

Oh, wo I just saw your comment. Actually, I might try it. I'm gonna try it and hopefully I don't get wrecked.

#

grok 3 it's booty, bro, I'ma be honest. Like, for complicated, high-level tasks, even Gemini isn't as good, but I would rather use o3 for really high-level reasoning. Like, nothing is on its level.

#

nahh i need the best of the best lmaoo

#

Oh, that's actually smart. I didn't think about that. Instead of paying $200, you basically paying $60 wow. Or not even $60, maybe just $20. Maybe just $40 all you need. You just rotate. When you run out with that one, you just use the other one. That is genius, bro. You are-- oh my gosh. You're playing next level chess.

#

lol i used super whisper to dictate that so it sounds a bit off lmaoo

#

I turned on the data sharing but I don't see the free tokens. Like I said, it was a message for free tokens. I don't see it after I enabled it.

small haven
#

Share a chatgpt pro with 3 friends >

keen beacon
#

on roblox studio gemini really understands both backend and frontend

#

i'd say it needs more help on frontend because it doesnt fully comprehend 3d shapes and connecting colors to colors etc

#

i always start my prompts with

#

"always use module scripts, start with firstly your module loader, then your modules, create the core aspect of the game highly optimized, keep in mind servers will be full so your script scalability should change, etc"

keen beacon
#

cant u disable it?

small haven
#

Who tf care about memory loool

#

Its more of a gimmick

languid forum
#

Hey everyone, we are xanthorax Ai , Mods please drop me a Pm we want to get bench marked πŸ¦Ήβ€β™‚οΈ

#

does anyone know how to get darkbert if you dont have a non prof org ? am dying to get my hands on it

balmy mist
#

to use o3 in the api yall had to identify your org with the persona app?

#

they doing a lot lol

kind cloud
#

Claybrook always returns an API error now.

balmy mist
keen beacon
kind cloud
#

I think I saw a similar situation before Google released gemini-2.0-flash or something like that.

#

don't trust me

small haven
#

ok yea 50 o3 reqs/week + memory > unlimtied o3 reqs !!

calm sequoia
#

Sadly, the pro didn't got into the benchmarks last time due to price issues. I wonder what would be the ELO.

#

Is there any difference in 2.5 PRO performance with and without subscription?

keen beacon
#

the gemini product has a super long degrading sys prompt i believe

#

u should use it on aistudio

#

it doesnt matter at all on aistudio whether ur subbed or not

ocean vortex
calm sequoia
keen beacon
calm sequoia
#

The free version of chatgpt seems nerfed, that's why I'm asking for gemini

keen beacon
#

than gpt 4.1 on free chatgpt

calm sequoia
#

That's a fact

#

I have subscription on chatgpt though. May switch to gemini if they continue the pace

#

It seems that claybrook likes to draw penises instead of what I'm asking. Very human-like behavior.

keen beacon
#

Lmao

#

what did u ask

#

Lmao 🀣

keen beacon
#

do u have a link

#

im interested

cedar tide
#

Tomay good ?

hardy pecan
#

its ~gemini 2.5 level i felt, like dragontail etc

leaden thunder
#

Is there an API to get leaderboard data?

balmy mist
#

wild its funny they are calling it dreaming

keen beacon
balmy mist
#

yeah i stopped using mini, its a good model but flash is just a better model

#

srry for spam but this is kinda crazy

#

i could be late tho

tall summit
cedar tide
#

I gave Gemini the opportunity to think in the middle of his answers

brittle tiger
calm sequoia
tall summit
# calm sequoia

is this asking "What will the highest ARC-AGI score be at the end of 2025?"

calm sequoia
#

Obviously

tall summit
#

nothing is obvious about it

#

you removed almost all context

#

so i wanted to clarify

calm sequoia
#

Answer "Skibidi" if you lack context

tall summit
#

context as in what the question is saying

#

i know what arc-agi is

oblique flint
#

objectively 15+ should be the most correct answer lol, I mean 90% is still 15+

tall summit
#

lol

tall summit
#

nobody knows what the poll really is about...

brittle tiger
#

Anyone know why epoch never posted 2.5 scores for FrontierMath?

narrow elbow
#

OpenAI Five is also interesting

leaden meteor
#

Are we expecting any new announcment today from Google? I remember seeing someone posting about APRIL 22 placeholder for something...

cedar tide
novel flame
#

Possibly hot take: Gemini 2.5 Pro is very good, but it’s not as good as 3.7 Sonnet or indeed o3. You might have had a better time (and a lighter wallet) with Sonnet

#

Correct. We have magnificently good yappers, but not actual intelligence.

#

I mean, technically you can pay with crypto on OpenRouter

oblique flint
#

you dont have a credit card?

novel flame
#

No argument there; as someone who bought BTC in 2013 the scheme worked out well for me. But you don’t have to buy crypto to speculate or gamble with your life savings, you can also just buy it to pay for goods and services as an alternative to PayPal.

oblique flint
#

soon enough I think all the free goodies will cease to exist. Stuff like google ai studio, cursor, windsurf etc are just burning money right now

oblique flint
#

what is that list of countries lol

#

I am as well, in nl. I got a 'student' credit card a couple years ago just to make subscriptions for stuff like ai services easier. Although atm Im only subbed to cursor

novel flame
#

Me neither. Scandinavia here πŸ‘‹

oblique flint
#

I went from gpt 4 turbo subscription to claude 3 opus / 3.5 sonnet, then I went to cursor since ai studio is good enough for general non coding use, and just having ai integrated in the ide directly makes coding so much faster

novel flame
#

Don’t play much; don’t have too much time between work and kids and secret plans for world domination

#

Played or built πŸ˜‰

balmy mist
#

Is it true that Google is shipping today?

keen beacon
balmy mist
#

lol idk bro, im finding tweet

novel flame
#

I toyed with building something similar to this: https://youtu.be/1dSJ1oIBWCw?si=W01tllbd9lQvlAFU

βš”οΈ Join The Horde

In this video, we'll explore the exciting world of textual games and how they are being evolved into visual novels. We'll also dive into the technical aspects of implementing DALL-E-3, a new AI technology that allows for more realistic ...

β–Ά Play video
balmy mist
#

i been waiting for have SOTA models in games so badly

#

im not buying a game until we have that

#

cant we use deepseek models in games?

fleet lintel
balmy mist
#

dont say that 😦

#

i was so excited

#

i started my morning real good bc of it

fleet lintel
#

i think nothing interesting will be announced by google for couple of weeks..

#

i am waitin for deepseek r2

keen beacon
balmy mist
#

in our dreams

#

im really gonna just get another account

#

so that 100 rpw fr $40

#

thats actually not bad, yall know if we have token limits for o3 on plus?

#

also whats the context window and token limit per message?

unborn ocean
#

Really interesting paper (and also a quick read)

keen beacon
cedar tide
balmy mist
#

quick read is a paragraph my guy

#

but its okay, i got ai to tdlr

balmy mist
#

we should make one for this chat, if we can get 20 people we each put in $10 and make a new gmail that a mod can control or sum

#

this will really test how unlimited the requests are lol

novel flame
cedar tide
unborn ocean
#

read so many papers that reading this takes like 3 min at most (although I am arguably not very thorough)

keen beacon
mellow frigate
#

Are there any benchmark results for gemini 2.5 flash non thinking?

cedar tide
#

Here

mellow frigate
#

Ah nice find

cedar tide
keen fulcrum
#

Any new models this week
r2 and qwen 3?

keen ferry
#

deepseek servers are gonna be dead forever

novel flame
cedar tide
#

You have you classement ?

keen fulcrum
#

It will be quiet a while when google drops gemini 3.

novel flame
cedar tide
harsh flume
#

Little bit of a tangent

#

But I was thinking on how LLama gamed lmarena

#

by optimizing for human stylistic preference, which solved for a lot of emoji use

#

and wondered if the same would then apply let's say on dating apps chats

#

I mean, that a higher emoji style of conversation would gather favor since that can be inferred on human preference when it comes to voting ai

#

hah that's interesting

#

ive often avoided it under the impression that it'd give a 'boyish' feeling to the message instead of a man's one

#

I personally prefer non emoji responses from AI but I don't think at all my preferences fit the norm

#

Yea it's a good point

#

I mean the result speaks for itself

#

I rather disagree on the agreeable part

#

no pun intended lol

#

Yea I know what you mean but from anecdotal experience I see the opposite. There's a fine line

#

Being rather disagreeable can create some tension that then sets up for a release when the tension is diffused and that emotional roller coaster is enticing

#

that's the whole banter play

#

esp in male-female dynamics

#

but I think it extends besides that as well

#

I read the study you sent and the problem is that on face value it doesn't mean much. It could always be that people higher in extroversion would communicate with more emoji and these people would also be the ones naturally engaging in more interpersonal connection

blazing rune
#

@ocean vortex What are your thoughts on GPT-4.1? I'm wondering about nano, Mini, and the full one, but especially the full one

blazing rune
#

because it's decently close to sonnet even in areas that sonnet is best at, at least according to benchmarks

#

and it's considerably cheaper, right?

ocean vortex
#

don't do web design or visuals with it (coding visuals)

#

but other than that it\s great

blazing rune
#

the per token price is a lot better and I think it has a more efficient tokenizer than Claude, but I could be wrong

blazing rune
#

I thought they made it way better at that

keen beacon
ocean vortex
#

it's much improved but still the same size as gpt4o

#

@blazing rune 4.1:

#

3.7 sonnet:

novel flame
#

A quick benchmark of small (cheap) models only. Note: the last 3 tests I only ran on the top models.

I was extremely surprised to see Gemma 3 punch so far above its weight (cost) here.

ocean vortex
#

as for o3-high vs 3.7 sonnet-thinking....

o3-high:

#

3.7-thinking:

#

there are still differences and the winner for this is clear but it looks like reasoning helps quite a bit with openai models

keen beacon
balmy mist
#

dont know fi yall saw yet

keen beacon
zinc ore
#

ARC 1 will soon be saturated

balmy mist
#

weird they didnt do the high versions

thorny drum
#

yeah are they still testing the high versions?

#

or are they just not planning on testing them at all

balmy mist
#

i dont think so

#

but they will test o3 pro tho

#

why not give us an o4-mini pro as well?

thorny drum
#

mini pro 😭

keen beacon
#

my guess is that only o3-pro will get 80%+ on ARC-1

#

and from someone working on the ARC-2 benchmark , he stated that ARC-2 score for o3-pro should be anywhere from 10% to 20%

#

Pretty good considering o3 preview high costed thousands per task

balmy mist
#

i still wanted to be able to play with that version

balmy mist
sage raptor
#

100+ dollars for each question

ocean vortex
#

Like I do not see improvement in spatial reasoning equivalent to this. But they probably hyperfocused exactly on what is being tested there

keen beacon
ocean vortex
#

as for o3-preview... we already knew majority voting and similar systems help a ton in this benchmark which that model was (close to pro). And yeah I completely agree that they were misleading lol

keen beacon
#

They included the train set for o3 preview where you're specifically allowed to do that. It's a small number of questions. They didn't for the released o3

#

It was mainly the compute that got that score

ocean vortex
small haven
#

day 7 and still no o3 pro

ember rapids
keen beacon
#

Per task. The benchmark costed thousands to run tho

novel flame
ocean vortex
keen beacon
ocean vortex
#

then arc-agi and simple-bench notable improvements for o3 as well which is using the new base model. Both of these benchmarks need spatial awareness too, even though in different ways

balmy mist
#

claude is so good

#

like its actually crushing the rest lol

#

but where is our baby NW?

#

no love the true champ

novel flame
#

A quick benchmark of small (cheap)

ocean vortex
keen fulcrum
torn mantle
#

tbh

#

as i said before, o3 felt like a huge leap over other models

#

its really different

balmy mist
#

im about to just buy pro now

#

i was using 4.5 recently and its better than i thought lol

#

and its emotional iq is good

#

been using it for some of my convos with people

#

and way better any other model

#

still does some cringe stuff sometimes but for the most part its valid

#

lmaoo

#

true but if you getting o3 unlimited and o3 pro and unlimited image gen and 4.5

#

and sora

#

even tho sora kinda cheeks

#

$200 is a fair price

#

they are forcing you to pay that at this point

#

i just wish 4.5 was faster

warped estuary
calm sequoia
# calm sequoia
poll_question_text

ARC AGI 2025

victor_answer_votes

5

total_votes

14

victor_answer_id

1

victor_answer_text

90% +

victor_answer_emoji_name

πŸ‘€

small haven
#

at least +1 million ppl have a pro membership, so ur losing against them by not having it

#

its a fact tho

torn mantle
#

@keen beacon what type of reasoning effort does o3 has in lmarena?

#

is it o3-low?

small haven
#

one of them is me

#

ok o3 says 75k; i have been debunked

balmy mist
brittle tiger
small haven
tall summit
#

it doesnt know anything regarding strategy about so many games besides using the internet

keen beacon
keen beacon
elder rapids
#

then I'd use Google

#

what I would buy though

#

it's all the cumulative tools and search things

#

combined with o3

#
  • convenience to use 4.5
elder rapids
viral sky
#

hey all, maybe a weird question, but I just did an arena battle and really liked the output of one of the models and might be interested in using it in one of my projects. after voting, it tells me the model is called "claybrook" but I can't find any model by that name listed anywhere on the leaderboard, on hugging face, or even Google except for a reddit post from 3 days ago referring to it as an "experimental Google model," but not providing any link to more information. does anyone know where to find this?

balmy mist
#

And best image gen is OpenAI

#

And then o3 pro

elder rapids
# balmy mist Not 4.5 or o3

I can successfully mimick 4.5 with 2.5 pro, and 2.5 pro >> o3 for general tasks, no risk of hallucinations either

balmy mist
#

I have been testing the models recently and o3 is very useful if areas Gemini can’t come close

elder rapids
#

i haven't come across this

#

o3 is often way worse in general tasks

balmy mist
#

It’s good at reasoning

elder rapids
#

and when they're closer 2.5 pro seems to gap when you ask it to take a step back, evaluate, and move farther

balmy mist
#

And creativity, if you combo 4.5 and o3 you get very interesting results

elder rapids
#

whereas o3 doesn't seem to really understand beyond its initial comprehension

balmy mist
#

Yeah you can’t

#

Especially with the search and tooling in o3

elder rapids
balmy mist
#

Gemini just doesn’t have that

elder rapids
#

ie, doesn't know the relationship between established facts and progressive inquiry

#

and how to relate that to the current situation

#

I can lmao

#

you don't know what that means

#

go ahead and ask me

#

tho

balmy mist
#

Hallucinations are good imo

#

Good creativity

elder rapids
#

basic established philosophical concepts, or set theory, or category theory

balmy mist
#

Gemini is a good general model that’s it, o3 is just a different type of model

elder rapids
#

axioms

#

it's not external knowledge like literal facts

balmy mist
#

I think you are promoting incorrectly

elder rapids
#

that you think I'm talking about

balmy mist
#

How many tests have you ran?

elder rapids
#

and btw I started this stuff with o1 and 4o

#

not the Gemini family

#

I know their personality very well

#

and I can definitely get a lot out of them

#

but this is why I like the Gemini models so much

balmy mist
#

Do you have plus?

elder rapids
#

ye

balmy mist
#

So you tried with 50 attempts?

elder rapids
#

lmarena too lol

balmy mist
#

On OpenAI platform it’s diff

#

You more than 50 bro

elder rapids
#

?

balmy mist
#

I used 2.5 pro for at least 2000 requests, how can you judge o3 from your factual prompts that only are 50

elder rapids
#

because it's not only 50

balmy mist
#

o3 behaves differently on the app

#

It’s obvious

#

Gemini is a good model, very solid

elder rapids
#

this is exactly the opposite in my case lol

balmy mist
#

But they are just diff

elder rapids
#

ask Gemini to create its own philosophy of design and apply it

balmy mist
#

I guess some people are better with o3 and some are better with Gemini

elder rapids
#

and then ask o3 to create its own philosophy of design and apply it

#

to whatever you think should have sophistication

#

o3 just doesn't comprehend as much as 2.5 pro

balmy mist
#

And also you have memory in chatgpt that’s another reason why using it on their app matters

elder rapids
#

ye memory is fine

#

can't find a use for it tho

balmy mist
#

Memory plus o3 plus tooling is cracked

elder rapids
#

but I know eventually it's gonna be nice

#

but regardless

#

deadass I think 2.5 pro is just better

#

gpt models seem to be capped at their initial presentation

balmy mist
#

I guess that’s your opinion

elder rapids
#

same problem with 4o

balmy mist
#

I like both

elder rapids
#

you cant use the context as much to your advantage

balmy mist
#

It’s worth the price to me and my use case

#

Yes you can

elder rapids
#

and build its personality nearly as much

balmy mist
#

That’s why memory matters lol

#

U gotta get creative with it

elder rapids
#

ye

balmy mist
#

Like I said that’s why memory matters

#

Yes it can

#

You guys are using it differently from me

elder rapids
#

of doing so

#

ye

balmy mist
#

Its damn near like micro fine tuning with the outputs that i am getting

elder rapids
#

baseline o3 is better than 2.5

#

this is a fact

#

but after a few prompts, 2.5 pro can be improved like no other

balmy mist
#

I guess we can agree to disagree, memory is amazing imo and it compliments o3 nicely for me

elder rapids
#

ye, but with such a strong baseline like 2.5 pro, even being slightly weaker than o3

#

can be improved so much more

#

and that's so great

#

like in writing, how Claude can find nuances and balances

elder rapids
#

and can actively iterate through why and when it should do this

#

ye

elder rapids
#

gpt has to be better

#

for now

#

because veo 2 and 2.5 flash and stuff, and then canvas

#

but the tool usage

#

is just

#

godly in chatgpt

#

ye

#

but damn

#

having all that for free

#

for Gemini

balmy mist
#

I was just saying why the $200 is fair based on what they providing

elder rapids
#

ion know about that, deepmind prolly peeking the tool usage

tall summit
elder rapids
#

like apple vs Android

#

Android is just better now, but apple basically stole ts

balmy mist
#

It technically cannot be completely reproduced based on what OpenAI has built in their app, especially it for unlimited usages

elder rapids
#

it's the ecosystem

#

the design

#

the aesthetic of openAI

#

transcends Google

#

ye that's what I mean

#

that's what enables the analogy

balmy mist
#

And what did Gemini score on arc?

#

I don’t even think they tested it

elder rapids
#

def not

#

check any stat

#

ecosystem dependent

elder rapids
#

the more you discuss with 2.5 pro

#

the more you realize the generalizing ability

#

is just so far ahead in it

#

it's crazy

#

makes me feel like deepmind truly understands "general intelligence"

tall summit
#

i mean i asked it like one philosophical question but it found the main crux of it and related it to the actual mainstream philosophical ideas

elder rapids
#

ye this is a consequence of that general ability

tall summit
#

what a game changer for philosophy study because keeping up with every single thing is kind of impossible

#

thats why philosophy changes so much with culture

elder rapids
#

ye but it's like, just pop that bubble of roboticness with a single prompt

tall summit
#

i mean you should still read first party texts if its the kind of texts that benefit from being read

elder rapids
#

whereas for gpt models you have to deadass go step by step

#

how it should act

#

examples for a respond

#

etc etc

balmy mist
#

With a system prompt you can make any model act any way

#

That’s how I do it for Gemini and any other model I use

elder rapids
balmy mist
#

It’s gives you the same behavior

elder rapids
#

this is why I gave the example of 4.5 tho

#

you said 2.5 pro can't act that way

#

I know what you mean

balmy mist
#

I got Gemini to think it was human based on a system prompt

elder rapids
#

but that's why you're wrong

elder rapids
#

philosophy + AI becomes crazy

#

before, AI used to try hard to be inoffensive

#

to philosophical ideas

balmy mist
#

System prompts are the key to llms

elder rapids
#

and not try to touch things

elder rapids
#

given the analogy of a "wall"

#

bubble vs iron wall

#

there's a difference in how easy you can get through to something

#

whether it's any attempt to change at all

#

doesn't matter

#

it's how effortless and what entails that ease of pass

#

if 2.5 pro is the bubble that can be popped, little resistance, and gpt models can be the iron wall, that should be explanatory

balmy mist
#

any model can be popped tho

elder rapids
#

that's not the point

balmy mist
#

thats why pliny can crack them all

#

that is the point, its all about the prompting at the end of the day

elder rapids
balmy mist
#

if you cant get a model to do something it does not mean someone else can tho

elder rapids
#

if it's about prompting, and any model can be prompted to a certain way

#

then its an entirely different discussion here

balmy mist
#

but you are saying you cant get certain behavior from gpt models

balmy mist
#

what is your argument?

elder rapids
#

I can get o3 to act like gpt 4.5, with the help of 2.5 pro prompting it

balmy mist
#

how are system prompts not important?

elder rapids
#

but what if I wanted to get 2.5 pro to act like gpt 4.5

balmy mist
#

you use prompting again

elder rapids
#

yep

balmy mist
#

so why isnt prompting the most importatn thing?

elder rapids
#

no the difference is

balmy mist
#

system prompt is just a type of prompt

elder rapids
#

why do I need 2.5 pro to prompt o3 with me to give enough insight

#

for it to act a certain way

balmy mist
#

bc prompting is a skill

elder rapids
#

no no no

#

but why doesn't it go both ways

#

why can I prompt 2.5 pro to be like 4.5

#

without any help

balmy mist
#

you can tho

#

you are chosing to

elder rapids
#

choosing to what?

balmy mist
#

to not use prompt without help, you can prompt with help and without help, it might be easier with help but the reality it goes both ways

elder rapids
#

thats not what I'm saying tho

balmy mist
#

and its all about how well yo ucan prompt

elder rapids
#

eventually you'll get to a point with any LLM, so that it's favorable

#

but I'm talking about how it receives any quality of information

#

and how much it absorbs it

#

if I essentially need the help of 2.5 pro, to prompt engineer o3 with me, to act like 4.5

but I only need a single prompt for 2.5 pro to act like 4.5, especially WITHOUT the help of anything at all

#

what does that tell you

#

about o3, and about 2.5 pro

balmy mist
#

i mean who cares? the magic is being able to morph and shape it with your prompts

#

i do not want it to be easy

#

i like that prompting is a skill

elder rapids
#

but it is easy, with 2.5 pro

#

and more effective

#

yes?

balmy mist
#

and its not easy to get anything you want from the model

elder rapids
#

it is

#

with 2.5 pro

balmy mist
#

thats gonna be the difference with people making money, it kinda is right now already

#

yeah but thats because you can easily insert a system prompt, its built to be shaped, o3 and openai models are not built for that

#

if that was the case they would allow us to insert system prompts

elder rapids
#

you're acting like that's irrelevant

#

when that's the whole entire discussion

balmy mist
#

im saying why cry about it

#

its still a good model

#

and top teir in reasoning

elder rapids
#

yes

#

but when I can make 2.5 pro better

#

do you still not understand

brittle tiger
#

I can't remember the last time i used a prompt for something important without an llm writing it for me

balmy mist
#

and wen o3 pro drops the $200 will be fair like i said originally

balmy mist
#

i have a system prompt that mkaes prompts or shapes my prompts for me

#

i already understand that talking to llms takes a lil skill

elder rapids
balmy mist
#

like a random person cant hop on any llm(with no experience) and get exactly what they want, usually there is misalignment

#

im saying all of it does

#

o3 pro, 4.5, 4o image, sora(which is ehh) and the ecosystem

elder rapids
#

I'll get more out of a free unlimited 2.5 pro with 1m context, better answers, more intelligence, more flexibility

#

than o3 pro

balmy mist
#

thats bc of how you prompt

elder rapids
#

because, I won't just use the base model

balmy mist
#

its higher reasoning, thats what you are getting

#

gemini is capped at reasoning

elder rapids
#

hn

#

o1 pro is very good at initial context retention

#

by all means

#

but not progressive instructions

balmy mist
#

no matter what you cant force higher reasoning, yeah there are some tricks, but you are not going to be able to mimic o3 pro from gemini no matter how you prompt bro

balmy mist
#

what

elder rapids
#

not improving the reasoning process tho

balmy mist
#

aii show me

#

i wanna learn

#

cause that is worth money

#

and if that is the case, why aren't more people doing that?

elder rapids
#

I'll give an example

balmy mist
#

please, Imma take notes lol

elder rapids
#

let's say we ask a philosophical question

#

or actually nah

#

just a loaded question

#

"why is life painful"

#

what would you expect from the model

#

to respond with

#

"that's a loaded question, it has yadda yadda yadda"

#

right

#

but if you know other problems that you can actively tell it to demonstrate in the initial query

#

"'why is life painful' what kind of category error is this? is this a meaningful question 2.5 pro?"

#

ok now you've shifted it from an unintelligible premise and gave it enough context to respond with the level you're asking it for

#

baseline, it wouldn't have introduced the fact it's a category error to me

#

but informed me it was a loaded question

#

with 2.5 pro in my case, it's so good at improving with these hints and shifting its direction

#

it's able to apply its own developed philosophy/approach to these questions

brittle tiger
#

it's not really a winnable debate. both are better for different things. o3 + tools is nuts for quick research. I get most value from 2.5 for my projects but it's lots of data and does better with good guidance. I think o3 is better for less crafted queries.

#

that's fair but the highest iq people I know wouldn't touch gemini before 2.5 and prefer it over everything now

elder rapids
#

buying chrome? or successfully competing with it

#

they can't buy chrome

balmy mist
#

my thing is why waste time trying to force that out of a model when you can just give it a system prompt prior? i noticed a huge improvement with models whne you apply a specific system prompt prior to guide/ground the convo, i still dont see how that is getting to o3 pro levels like on an arc test? if you prompt it in a certain way trying to get the answer you want defeats the purpose, o3 pro is prob gonna 0 shot prompts without the extra context and fluff

elder rapids
#

that's impossible

#

but ye open AI is becoming a competitor

#

no? chrome isn't a subsidiary

#

lmao

brittle tiger
#

openai would be extremely dumb to pay whatever chrome would cost

elder rapids
#

huh?

#

chrome isn't a subsidiary

#

it's a product division of alphabet

#

Google literally could not allow that

#

it's fundemental to alphabet itself lmao

#

this won't happen

#

deadass, read the specific claims being made

brittle tiger
#

it's the judges call. the DOJ case makes very little sense. it had nothing to do with chrome. but judge could force it eventually years from now after final appeals

elder rapids
#

they never proved it

#

and that likely won't be the main focus anymore

brittle tiger
#

most likely outcome is google doesn't pay $20B a year for search priority anymore

#

to apple

elder rapids
#

and btw openAI wouldn't be able to, since it's not even fully independent

elder rapids
#

and with not much effort myself for tons of gains, I can force 2.5 pro to "gain" intelligence, field specific ofc

#

and that's me building it's reasoning for tasks

#

this isnt with models in general due to what was prioritized in training

#

but just so happens 2.5 pro is just, insanely good at this

#

receiving MY reasoning

#

comes from the product itself

#

didn't choose a side

#

I have plus

#

in chatgpt

#

but goddamn

#

it's crazy

#

ong

#

basically

#

I just don't think Google is gonna take that much losses with this

balmy mist
elder rapids
#

if any at all

elder rapids
ocean vortex
#

ok o3-high is really really good

#

I'm impressed with what it can do crunching numbers manually and decoding, breaking things down and doing it consistently πŸ‘€

balmy mist
elder rapids
balmy mist
#

Wait, isn't R2 supposed to come out this week?

elder rapids
#

rumors

#

if r2 is actually any good

#

I'm gonna be surprised

#

ye

balmy mist
#

so there's no way to use o3 high right now? is o3 high just gonna be pro?

elder rapids
#

especially compared with o3 o4 mini and 2.5 pro

elder rapids
#

you'd be surprised

balmy mist
#

wow

elder rapids
#

o3 pro is gonna be better

balmy mist
#

damn so this whole time i was using o3 medium

brittle tiger
#

love o3 but it's much more accessible than 2.5. I'd bet anything the top 10 researchers at OAI are using 2.5 than than o3

#

they are with 2.5

small haven
#

guys stop complaining about usage and pay $200/mo, ur welcome

ocean vortex
#

even some IT firms / startups give access to their employees to openai org/keys, effectively paying for them

#

openai are behind on model sizing and arch planning (too much on the small side with reasoning and probably too big to make sense with gpt4.5), but they seem to be well ahead on RL training and fine-tuning and even just training in general I think

brittle tiger
ocean vortex
brittle tiger
#

i mean o3 is way more wow factor to majority of ppl. elite ppl in niche fields get more out of 2.5

small haven
#

so o3 can handle less context than o1 pro, weird

ocean vortex
#

then you also have the tools that can help for sure on cgpt website

brittle tiger
#

I really really like o3. all i was saying is i bet best openai ppl are using 2.5 more

#

without context and speed i don't think that would be the case tho

ocean vortex
#

so if you need for it to analyze something exhaustively and do it reliably, I would bet on o3 with price not being an issue

#

more test-time compute (longer outputs) and less likely to hallucinate

#

2.5 pro sometimes does this thing where it simply takes a shortcut and guesses the final answer whichever sounds plausible enough at the time lol

#

but yeah that's not to say it is not close. Still very performant and very close, it's just that I wouldn't bet on it getting things done over o3

warped estuary
#

Can someone confirm that openai pro gives you unlimited api calls to use? For example with codex or cline

balmy mist
#

told*

#

you giving him the gems lmaoo

keen fulcrum
#

No

#

We are moving there anyway and it will be beneficial to have a deeper understanding

#

It will begin with novel discoveries

#

Because in immoral hands its an effective weapon
especially uncensored ones

#

Lets hope humans won't destroy themselves

warped estuary
# balmy mist told*

No one did I'm just trying to determine because I mainly use llms for coding

small haven
brittle tiger
#

U can 5x your money betting on it if you believe that

small haven
brittle tiger
#

How much did you put down?

small haven
#

ur better off betting on whatever meta is releasing, they love overfitting lmarena lmao

wintry tinsel
#

Imagine having a nuclear reactor, some Vietnamese slaves, to work the nuclear reactor, and your own 1 million GPU mega cluster, you could β€œlocally” run O3 an unlimited amount, that’s the life

fringe carbon
#

i mean there is some guy on fiverr right now looking for ppl to vote for o3

#

so like solid bet ig

#

ggs for me if it works out for him

#

could be u ig

balmy mist
#

How?

fringe carbon
# balmy mist How?

he’s simultaneously betting on oai and paying ppl to vote in the arena

small haven
#

i know theres a lmarena intern sneaking in a bet prerelease 😭

fringe carbon
#

pretty easy trade

kind cloud
#

I think claybrook is gone from arena.

kind cloud
#

I've played lmarena for 45 minutes and I've not seen it yet.

small haven
#

top workflow rn is asking o3 for git diffs and o1 pro to apply it in full

keen beacon
golden ocean
#

Any way to make gemini not touch code that i didnt tell it to or thats just how gemini 2.5 is

#

most annoying model ive worked with

keen beacon
#

atm i ask o3 for code and at the top to note down the file path , for example # project/app/scr/main.tsx
then i have a script that copies that output and writes it directly to that file
then i also have ngrok set up so that o3 can visit my site and see the result of what it wrote

that way it writes code, visits site to review, then modifies code again if need be, all in 1 prompt

small haven
calm sequoia
#

o3 mini-high performed better on ARC AGI 2 than the Gemini 2.5 Pro πŸ˜„ What a humiliation

golden ocean
keen beacon
#

lol

small haven
keen beacon
#

atm i want to make it so that o3 can write just the difference, not the whole file, will probably do it via git diff, but still learning how

#

also I need to have some logs for the bash commands that o3 runs in an endpoint. after that maybe magic can happen

#

But i find o3 lazy as hell, once i asked to iterrate getting data from a site until it got the right result
for instance try and get the names of the top models in lmarea, iterate until they are = ['gemini-2.5-pro-exp-03-25',
'chatgpt-4o-latest-20250326',
'grok-3-preview-02-24',
.....

code by o3

try:
some code that didnt work
except:
fallback_list = ['gemini-2.5-pro-exp-03-25',
'chatgpt-4o-latest-20250326',
'grok-3-preview-02-24',
'gpt-4.5-preview-2025-02-27',
'gemini-2.5-flash-preview-04-17',
'gemini-2.0-pro-exp-02-05',
'gemini-2.0-flash-thinking-exp-01-21',
'deepseek-v3-0324',
'deepseek-r1',
'gemini-2.0-flash-001']
return fallback_list

And I thought the code was working, i didnt know this mf cheated

torn mantle
#

cobalt-exp-beta-v6

#

a lot of versions already

calm sequoia
#

Considering the number of models that are being tested as anonymous but not released, the lmarena is going into direction of becoming RLHF platform instead of a benchmark.

alpine coral
#

lol yeah kinda feels a bit like that doesn't it

#

i preferred it as an academic project..

#

now it's a sequoia-backed start up...

kind cloud
#

tomay seems to be connected to the internet

fleet lintel
#

unfortunately there is nothing in Arena right now that excites me.
After NW, all models are just meh

sage raptor
#

Dayhush is not meh

cedar tide
#

grok 3 preview is new to the WebdevArena?

plain zinc
#

I've NEVER had it before.

ornate stump
#

Usually, the image models, like Imagen, they gonna be tested somewhere?

cedar tide
#

But there are just the early version in the leaderboard

keen fulcrum
calm spear
#

current LMArena UI will is going to be fully replaced with new one?

calm spear
#

o3 doesn't look much different in terms of performance from Gemini-2.5 pro exp

balmy mist
#

they raised the usage for o3?

plain zinc
#

Bro @cedar tide, please πŸ™πŸ™πŸ™

#

That's what I needed too.

cedar tide
balmy mist
#

its 50 per day now

#

for plus

#

wym?

#

for students

#

you found a work around for everyone else?

#

not sure, but it should be good at it

#

i really hope they surprise us and release o3 pro this week

#

that would make me so happy

#

bruhh

#

so like 50 per week lol

#

yeah that will be the last time i pay anything for openai

#

this is why google will win

plain zinc
balmy mist
#

you just said they out of gpus

plain zinc
#

Because 2.5 has syntax errors.

balmy mist
#

thats never going to be an issue with google

plain zinc
#

And they need to be fixed somehow.

keen beacon
balmy mist
#

bro its not chatgpt vs gemini, its google vs openai which is what you are missing

#

and you just said openai running out of gpus

sage raptor
#

no

balmy mist
#

google has way more money and are releasing their models for free already

#

and more ppl know about google than openai

#

more peopl trust google then openai

keen beacon
#

chatgpt is ai to most people

balmy mist
#

ask people 40 year old and up

#

they dont know about chatgpt

fleet lintel
#

O3 pro is not on Arena. Atleast I haven't encountered it

balmy mist
#

but they know google

keen beacon
#

4o native image gen was prob much bigger than gemini 2.5 pro to the public

balmy mist
#

what browser are yall on right now?

#

openai is even trying to buy chrome lol

cedar tide
# plain zinc Give it to me, prompt, please

just in the prompt system explain to him exactly that he can think several times in the middle of his response, and explain to him very clearly when to do it how to do it and that's it, if it doesn't work try to improve the prompt each time

balmy mist
#

bc they cant beat google

#

no company can tbh, elon and sam founded openai trying to do that and look at them now? two opposing companies instead of being united against google

#

isnt openai losing money?

keen beacon
#

mindshare doesnt really matter in the end tbh. whoever achieves agi first will matter the most

balmy mist
#

thats why they charge $200?

fleet lintel
#

I think most peolpe dont want to pay for GPT .. only business folks wants to pay

balmy mist
#

agi is subjective

#

agi is based on the time period, to some we already reached agi

#

why do people keep leaving open ai?

fleet lintel
#

If Google AI overview becomes good enough for most queries, why individuals will ever pay for chatgpt or even gemini.google.com
only freelancers and companies will pay for higher quality agentic workflows