#general
1 messages Β· Page 29 of 1
omgg im getting so excited
i'm not completely sure what their plans for it on chat.deepseek.com are
i never wanted another country to win so bad before lol
but it's very possible they finally introduce a subscription
where are you getting this info from?
all i want is that api
he got birdies everywhere
i guess you could say i have some connections
not many at deepseek tho
liike that dude from GOT
mainly just word of mouth
I just hope it wont go in circles as much as R1
it doesn't
imagine if r2 comes out this week lmaoo
part of the training involved making the model better at determining how many reasoning tokens to use
although apparently it's still not as good at that as some other models
deepseek has always been a bit more
brute force-y
you got any info on qwen 3?
nope
unfortunately lmao
i do expect it in the next 2 weeks or so
but that's about it
i aint gonna lie this channel might be one of the leading spaces for ai news lmaoo
i mean, that's probably mostly leo lol
not many?
true lol, but its also the convos we have
yeah again the chinese labs are very strict with this stuff
tbh i'd be surprised if it doesn't outperform
nahh thats pushing it
not many?
well yes i have some but it's very much in the single digits!
dont forget deepseek dont got the same funds as OA
the chinese government is absolutely willing to bankroll them
you'd be surprised
if r2 manages to outperform o3 at a way lower cost, I'd start to believe deepseek is actually the leading AI lab
i'd be super surrprised if r2 > o3
π€
oh yeah i forgot
yeah, that's the main reason
after it crashed US markets they saw the vision lol
the people over at deepseek also have insane work ethic
as you might expect
most of them barely sleep
especially since the release date for R2 was changed to "ASAP"
they should def call it R3 tho

not sure how much this help with research. you absolutely do need clear mind to do research related things.
troll OA a lil
tell a CCP official that
π
sometimes in tech i feel like people mistake passion for being "hardworking" but yeah for DS I'm not surprised lol
depends what you're doing
when you have ai assisting you then you can push your limits a bit more
i imagine a lot of LM development is just throwing stuff at the wall rather than trying to imagine the next breakthrough
true... i do think american companies software engineers are getting lazy and they barely do enough to keep the job (most of them if not all)
they prob have an insanse workflow now, look at google's workflow
also like
the entire CN education system, I think
you live in china right?
nope
btw I wonder, why didnt the other labs adopt deepseeks MLA (or did they?). I guess that was one of the main cost saving measures they implemented right?
have a few friends there
MLA?
but not an expert
They are insanely cracked and I'm not certain it's happening with AI companies yet but western big tech firms don't have the advantage of NSA coming to them saying "hey check out the new methods they're working on across the ocean". Instead us govt is trying to break up big tech. Cooperation with govt is major slept on advantage over there. They're insanely talented regardless tho
ahh okay, i was wondering how their education system treated ai
https://medium.com/data-science/deepseek-v3-explained-1-multi-head-latent-attention-ed6bee2a67c4
from my understanding it allows them to save a lot of memory that would otherwise be needed for k/v cache
i feel like ai development is a lot less conspiratorial than people here seem to think it is
but, the chinese govt is absolutely bankrolling the hell out of AI development rn, from what I've heard
Yeah I don't think they've had any benefits like this yet but it is an advantage they will def get eventually
part of it makes sense to me π . I think all the frontier companies will be doing lot of simialr things
is R2 on LMArena? Are they testing it here?
no
at the very least it's been a big boost for morale, from what i can tell
and a welcome one.. China hasn't bounced back since covid.. it hasn't been firing on all cylinders
this race is a huge opportunity for china
if they win the ai war it will be truly be the beginning of the end for the idea of the US as the leader of the world
but i think it's a false hope of sorts.. like deepseek did innovate, but they more so iterated
yeah i mean it's kinda wild and seemingly hyperbolic.. but i thnk it could plausibly said that the country dominates AI (/first achieves AGI whatvever that means) will have an advantage across everything, military included
Why would US companies use China models?
I think Chinesse companies will win in China and Western companies will win in West
yo this would be nuts:
https://x.com/testingcatalog/status/1914319517175460327
how is it difference from Canvas feature?
studio is free
the US public already willingly use chinese models.. as for US businesses, if a model from china significantly outperforms any american model at tasks they use LLMs for, and match/beat it in value, plenty of companies would make the switch
although yes, enterprise will definitely be something chinese labs struggle with in the west
canva is only available in gemini app right?
currently there is a significant chance the US government bans deepseek
so it'll be interesting to see if that happens
how can they do that?
because normally whatever the US government do (annoyingly) much of the west follows
like in the case of huawei
well, it'll be illegal for providers to provide deepseek services in the US
so they won't if they don't want to break the law
i could see the ccp also banning deepseek lol
true.. may be they are bringing it to AIstudio. not a big deal
if you already have it then there's not much they can do
LLMs are a censor's nightmare
but if you didn't, downloading deepseek models would also be hard because serving it would be illegal
with today's govt, anything can happen
bet as soon as r2 drops i am going to buy a new computer and download it, but what about fine tuned versions of deepseek models? like you cant just ban a model
if i fine tune r2 5 times over, is it the same model?
you kind of can. If govt issues a ban, most medium and large size companies will stop using it. And it doesn't really part if fringe group continues to use it
but the race to AI supremecy i think gives space.. but still.. the current AI tech seems incompatible with a one-party authortarian state that censors everything (or tries to)
but what about the fine tuning stuff i said?
a company can provide a model that was fine tuned from r2
or will that also be banned?
and how would they even know it was fine tuned from r2?
again, fringe groups will continue to do it. but any serious company wont. It's popularity and usage will decrease over time if not instantneaously.
idk man, its like the ban on pirating, its illegal but its so common
and you cant prove it
banning it would just be a political statement
nothing more
people will still be using their models
even big companies
cause they will fine tune on it
companies dont pirate things :). only individuals . For models, individuals are not the main consumers
okay but you cant prove you have a deepseek model tbh, and if a company wants to save money they will use deepseek cheap models, and fine tune on them, models hallucinate a lot so you cant trust what a models says if you try and figure out the base model, and a lot of new companies are being made because of ai and for ai
they are most likely going to use deep seek models, and soon become mid level companies
its not going to just be individuals, even if it is just individuals, those people can make milliona dollar companies
bc of ai
Cause if the model is open source there is no risk of your data leaking. Educated people in west mostly do not give a sht where the model comes from as long as it performs, why should we?
may be.. but as soon as they become big, they need to pivot to something legal.
I
bro lol, they cant prove they are using deepseek tho
what is models are advanced enough that you can program models to only share data in a very secret way? I dont think that's the case today. But these models could do that in future
i think govt ban is unlikely but a ban IMO would undoubtly kill the business in that country.
what are ya on about
yea, a legal ban just like banning TikTokπ
This is not the case for now. If that changes then the approach will be different
but you are right a ban would kill the business bc they would be just copying secrets of deepseek and applying to other models and no longer using deepseek models, just at first to distill the knowledge, and deepseek is opensource to
India banned TikTok like 5 years back some users found way to use it. but slowly it died down.
Currently it is non-existent in India.
Govt ban always results in slow death
4.1 showing impressive here
so Indians come to America in pursuit of freedom and upward mobility, seeking opportunities they may lack in their homeland.
and US is also going to ban TikTokπ₯²
yup. . And for more money π
so sad
Itβs scary how cracked they are
Iβve heard stories
7 fig comp tho
1Million+ ? In China? Wow
man if deepseek can launch this week i will be so happy man
as soon as yall see news about it please send here π
neither cobalt nor apricot are nova premier π₯΄
banning anything is a dangerous game and ideally nothing at all should be banned in this context. But at the same time I kinda understand the arguments for banning tiktok. It has been used more than once for political campaigns now of questionable origins and the fact it is controlled by China's government makes it high risk
country
Imagine an app developed in US or any other western country blowing up in popularity in China. Impossible because all of them are banned by default
concerning national security,MAGA
MAGA itself is a biggest threat to US national security now LOL
No, Americans are smart, you are racist
this is very interesting
@gentle plinth have you tested this?
i just tested it and doesnt i didnt see this
Really doubt this is watermarking and more result of rushing to get model out when they weren't originally planning to after 2.5 surprised. Watermarking text is much more sophisticated
this doesn't just randomly happen. They simply replaced some spaces with "Β " in their dataset
just tried it and can confirm o4-mini does this π
I thought they always did that
I just can't think of a plausible rational. Break a bunch of code to watermark and piss ppl off? Much more elegant ways to do it
that's not how it works, there's no code involved to break. You simply add these symbols throughout your fine-tuning dataset and the model is gonna use them
you could even fine-tune your own gpt with their official tools to embed any hidden symbols you choose into it's responses
as far as the model is concerned, there's no big difference whichever space it uses anyway, in this case
but it saw enough of those special characters to use them some of the time
and the rationale is quite solid I think. Most people do not gonna bother to sanitize the text, so this makes it easy to detect the text generated by openai.
wow codex being open source is really game changing:
https://x.com/dnak0v/status/1914380050931105894
This is not really related, but has anyone tried o4 in chatgpt? it seems LOBOTOMIZED into saving space. I asked it five times directly to give me a full script, and it "omits it for brevity", like what?
give me the full code
"omitted for brevity" (parts of it commented as that)
hm. did you miss something here? pastes the script it gave me
yes, i missed this, this and this
then what are you doing?? i told you to give me the full code!
"ommited for brevity" (parts of it commented as that)
to be fair, the code was indeed pretty long, but the fact it just starts completely ignoring commands is annoying
Yes
which discord server has this bot?
its a paid server
well he aint release it yet
but it will be paid soon
who is he @balmy mist π
he is a dev that is able to get details on new models
he has some tool he built that gives him alerts
you think you can built it for us?
are you on mobile
most models on mobile have a system prompt that tell them to keep it brief
chatgpt included
didn't know that
dude when is nightwhisper comng out
i'm tired of 2.5 pro
mm tbh
it is actually good at things like making animations in HTML
and you can use this in thread/banner designs
although what i will say is
no ai is ready for coding outside of web development yet
well yea very simple
ai right now in terms of programming is a front end specalist
go on
tell me some projects it can do in c++
that isnt a simple snake game
im tired of o1 pro
o1 pro is dogshit
ive been limited with o3 and o1 pro already so....
im asking projects it has built
gonna waste a deep research req, idgaf
no dude
if you are building advanced projects, it simply just won't understand
ai isn't there yet
lol
go ahead and prompt it to build you a um/km that communicates with eachother
wont happen for a good few years
usermode
and kernelmode
both have interest in one another
no
no ai will be able do windows internals
at a even decent level
for a long time
yea
well thats the truth
ai just isnt at that level..
i want o3 pro so bad holy fck
well i work around it
why so
in what way exactly π
yes ive used 3.7
used 2.5 pro
craig you sound like 3.7 or 2.5 pro
its okay
in this current conversation
this is jsut the start of ai
ai in terms of coding has literally just begun
so its expected for it to be mid asf
no it isnt xdDD
this dude..
you might aswell just say to people stop learning coding
what statement is that
that levels game i think
π
now its getting obvious with r bait
next thing you'll tell me is that your jewish
its way easier to learn
ai is at the tip of ur hands for help
woah man
because making things is fun
ai cant do backend for its life
"stop doing codeforces, ai can solve them all"
you were tryna make jokes
bait
prety funny if u were to be jewish
thats the most boring joke of all time
exactly
OpenAI-MRCR results on Llama 4: https://x.com/DillonUzar/status/1914415635582607770
- Llama 4 Scout performs similar to GPT-4.1 Nano at higher context lengths.
- Llama 4 Maverick is similar to (but slightly underperforms) GPT-4.1 Mini.
I ran these just in case ppl needed it. It's probably not a top priority for people, but sharing nonetheless.
Enjoy.
Update to benchmark setup - Noticed various models had some missing test results due to various server errors returned, or oddities in API outputs. Also some endpoints didn't support candidate outputs, so some models were missing multiple runs to smooth the output. Fixed those and reran most models, and confirmed all tests completed successfully except for those that exceeded model limits. Certain models have seen a decent change in results (see tables). Notably Gemini 2.5 Flash (thinking enabled) seemed to have been lucky with the original results, and now more in-line with what I was expecting.
Grok 3 results should be next, and hopefully ready in a few hours. It's been surprisingly difficult to run them without server timeout errors (almost behaves like some kind of throttling).
Any other models people are interested in?
lmao everyone just shut up
i appreciate it
i look forward to your charts now
we got unskippable ads in discord
my ass
can you benchmark o1 pro
Always the super expensive models with everyone π
I'll see about adding it to the TODO. Have to budget based on what my company is willing to cover
y?
Crazy how good 2.5 pro is
i mean its unlimited (kind of) thru chatgpt
he gonna pull up on you
Unfortunately the benchmark sets up a history of chat messages between the user and the LLM before asking the benchmark question, so I'd need the ability to set up what was said both from the user and the model.
Plus ChatGPT has some system instructions and tooling behind the scenes which can impact the results. Too many uncontrolled variables :/
fair enough
runo001 has been warned.
No, i was doing some debugging on pc
cancelled the openai sub for claude, he aced it 0 shot
without any laziness
lmao. This potentially could have something to do with llama4 having so little activated parameters/experts tbh
total size offsets it but not nearly enough
Probably something from HLE or simple-bench test set. o4-mini had big gains over o3-mini there. Though I donβt have anything specific without testing
Has anybody used O3 within the API for OpenAI? I'm scared that I might get wrecked by cost.
Yo, does anybody have any more o3 requests? I just ran out until like two days and I have a pressing request for my root code. Custom modes, I needed to be consolidated and I want to use o3.
Oh, wo I just saw your comment. Actually, I might try it. I'm gonna try it and hopefully I don't get wrecked.
grok 3 it's booty, bro, I'ma be honest. Like, for complicated, high-level tasks, even Gemini isn't as good, but I would rather use o3 for really high-level reasoning. Like, nothing is on its level.
nahh i need the best of the best lmaoo
Oh, that's actually smart. I didn't think about that. Instead of paying $200, you basically paying $60 wow. Or not even $60, maybe just $20. Maybe just $40 all you need. You just rotate. When you run out with that one, you just use the other one. That is genius, bro. You are-- oh my gosh. You're playing next level chess.
lol i used super whisper to dictate that so it sounds a bit off lmaoo
I turned on the data sharing but I don't see the free tokens. Like I said, it was a message for free tokens. I don't see it after I enabled it.
Share a chatgpt pro with 3 friends >
depends
on roblox studio gemini really understands both backend and frontend
i'd say it needs more help on frontend because it doesnt fully comprehend 3d shapes and connecting colors to colors etc
i always start my prompts with
"always use module scripts, start with firstly your module loader, then your modules, create the core aspect of the game highly optimized, keep in mind servers will be full so your script scalability should change, etc"
cant u disable it?
Hey everyone, we are xanthorax Ai , Mods please drop me a Pm we want to get bench marked π¦ΉββοΈ
does anyone know how to get darkbert if you dont have a non prof org ? am dying to get my hands on it
to use o3 in the api yall had to identify your org with the persona app?
they doing a lot lol
Claybrook always returns an API error now.
sad day
i dont think this is a fine tuning thing
I hope Google makes something happens today
I think I saw a similar situation before Google released gemini-2.0-flash or something like that.
don't trust me
ok yea 50 o3 reqs/week + memory > unlimtied o3 reqs !!
Sadly, the pro didn't got into the benchmarks last time due to price issues. I wonder what would be the ELO.
Is there any difference in 2.5 PRO performance with and without subscription?
the gemini product has a super long degrading sys prompt i believe
u should use it on aistudio
it doesnt matter at all on aistudio whether ur subbed or not
it was tested on several benchmarks. On most of them it wasn't impressive lol
You mean the o1-pro? I guess it is for specialized logic tasks which is not what most of the lmarena user's need
on aistudio its free and basically unlimited they train on ur data tho just a disclaimer just in case lol
The free version of chatgpt seems nerfed, that's why I'm asking for gemini
2.5 pro is way better
than gpt 4.1 on free chatgpt
That's a fact
I have subscription on chatgpt though. May switch to gemini if they continue the pace
It seems that claybrook likes to draw penises instead of what I'm asking. Very human-like behavior.
Tomay good ?
its ~gemini 2.5 level i felt, like dragontail etc
Is there an API to get leaderboard data?
wild its funny they are calling it dreaming
4.1 mini seems to have haziness wrt recent events after oct 2023, i guess they didnt curate the pretraining dataset on more recent events for 4.1 mini
yeah i stopped using mini, its a good model but flash is just a better model
srry for spam but this is kinda crazy
i could be late tho
hahahaha what the hell
I gave Gemini the opportunity to think in the middle of his answers
What was prompt to do this?
is this asking "What will the highest ARC-AGI score be at the end of 2025?"
Obviously
it's just 3 words
nothing is obvious about it
you removed almost all context
so i wanted to clarify
Answer "Skibidi" if you lack context
objectively 15+ should be the most correct answer lol, I mean 90% is still 15+
lol
nobody knows what the poll really is about...
Anyone know why epoch never posted 2.5 scores for FrontierMath?
OpenAI Five is also interesting
Are we expecting any new announcment today from Google? I remember seeing someone posting about APRIL 22 placeholder for something...
"leaker as featured in techcrunch"
https://x.com/chatgpt21/status/1914556100906545248?t=kmNaTJqXwI2ROk3y-Q2EoQ&s=19
Possibly hot take: Gemini 2.5 Pro is very good, but itβs not as good as 3.7 Sonnet or indeed o3. You might have had a better time (and a lighter wallet) with Sonnet
Correct. We have magnificently good yappers, but not actual intelligence.
I mean, technically you can pay with crypto on OpenRouter
you dont have a credit card?
No argument there; as someone who bought BTC in 2013 the scheme worked out well for me. But you donβt have to buy crypto to speculate or gamble with your life savings, you can also just buy it to pay for goods and services as an alternative to PayPal.
soon enough I think all the free goodies will cease to exist. Stuff like google ai studio, cursor, windsurf etc are just burning money right now
what is that list of countries lol
I am as well, in nl. I got a 'student' credit card a couple years ago just to make subscriptions for stuff like ai services easier. Although atm Im only subbed to cursor
Me neither. Scandinavia here π
I went from gpt 4 turbo subscription to claude 3 opus / 3.5 sonnet, then I went to cursor since ai studio is good enough for general non coding use, and just having ai integrated in the ide directly makes coding so much faster
Donβt play much; donβt have too much time between work and kids and secret plans for world domination
Played or built π
Is it true that Google is shipping today?
coding model maybe , or image gen
lol idk bro, im finding tweet
I toyed with building something similar to this: https://youtu.be/1dSJ1oIBWCw?si=W01tllbd9lQvlAFU
βοΈ Join The Horde
- Discord: https://discord.com/invite/uFB5YzH9YG
- Github: https://github.com/TheOrcDev
In this video, we'll explore the exciting world of textual games and how they are being evolved into visual novels. We'll also dive into the technical aspects of implementing DALL-E-3, a new AI technology that allows for more realistic ...
i been waiting for have SOTA models in games so badly
im not buying a game until we have that
cant we use deepseek models in games?
Trolling? nothing i have seeen from Logan or other Google folks ?
i think nothing interesting will be announced by google for couple of weeks..
i am waitin for deepseek r2
when do they drop and where ? live stream ?
in our dreams
everybody saying the same thing:
https://x.com/apples_jimmy/status/1914719405801660864
Plus users need higher o3 limits.
Tool use + search with o3 feels like magic for my use cases, coding is still 2.5 pro for me but everything else itβs o3.
im really gonna just get another account
so that 100 rpw fr $40
thats actually not bad, yall know if we have token limits for o3 on plus?
also whats the context window and token limit per message?
i suggest sharing pro plan with 1-2 others, 65-100$
lmaoo quick read
quick read is a paragraph my guy
but its okay, i got ai to tdlr
oh yeah you did say that yesterday
we should make one for this chat, if we can get 20 people we each put in $10 and make a new gmail that a mod can control or sum
this will really test how unlimited the requests are lol
You know, I had completely forgot about Gemma 3 until yesterday, when I did a quick benchmark of the cheapest non-free models on OpenRouter and it came second after Gemini 2.0 Flash!
Have you tried llama Maverick? and gemini 2.5 flash without reasoning?
its like 8 pages of some text in a decent font size
read so many papers that reading this takes like 3 min at most (although I am arguably not very thorough)
Cmon that will get flagged , but 3 people i would say should be in the clear
Are there any benchmark results for gemini 2.5 flash non thinking?
Here
βοΈ
Ah nice find
artificial analysis is also coming
Any new models this week
r2 and qwen 3?
i hope for r2
deepseek servers are gonna be dead forever
I only tested Scout in this one because I was going for the cheapest (though TBF Maverick is cheap enough that it should have been included). Gem2.5 Flash failed on one of the tests so it ended up in 4th place.
2.0 better ?
You have you classement ?
It will be quiet a while when google drops gemini 3.
I'm putting together a sheet now, will share after
It would be good if you would try maverick
Thx
Little bit of a tangent
But I was thinking on how LLama gamed lmarena
by optimizing for human stylistic preference, which solved for a lot of emoji use
and wondered if the same would then apply let's say on dating apps chats
I mean, that a higher emoji style of conversation would gather favor since that can be inferred on human preference when it comes to voting ai
hah that's interesting
ive often avoided it under the impression that it'd give a 'boyish' feeling to the message instead of a man's one
I personally prefer non emoji responses from AI but I don't think at all my preferences fit the norm
Yea it's a good point
I mean the result speaks for itself
I rather disagree on the agreeable part
no pun intended lol
Yea I know what you mean but from anecdotal experience I see the opposite. There's a fine line
Being rather disagreeable can create some tension that then sets up for a release when the tension is diffused and that emotional roller coaster is enticing
that's the whole banter play
esp in male-female dynamics
but I think it extends besides that as well
I read the study you sent and the problem is that on face value it doesn't mean much. It could always be that people higher in extroversion would communicate with more emoji and these people would also be the ones naturally engaging in more interpersonal connection
@ocean vortex What are your thoughts on GPT-4.1? I'm wondering about nano, Mini, and the full one, but especially the full one
full one is very good
because it's decently close to sonnet even in areas that sonnet is best at, at least according to benchmarks
and it's considerably cheaper, right?
don't do web design or visuals with it (coding visuals)
but other than that it\s great
the per token price is a lot better and I think it has a more efficient tokenizer than Claude, but I could be wrong
oh?
I thought they made it way better at that
cute
better but it's still nowhere near 3.7 sonnet in this specific aspect tbh
it's much improved but still the same size as gpt4o
@blazing rune 4.1:
3.7 sonnet:
A quick benchmark of small (cheap) models only. Note: the last 3 tests I only ran on the top models.
I was extremely surprised to see Gemma 3 punch so far above its weight (cost) here.
as for o3-high vs 3.7 sonnet-thinking....
o3-high:
3.7-thinking:
there are still differences and the winner for this is clear but it looks like reasoning helps quite a bit with openai models
love to see self made benchmarks.
dont know fi yall saw yet
hope you can extend to o3 and o4 too
ARC 1 will soon be saturated
weird they didnt do the high versions
yeah are they still testing the high versions?
or are they just not planning on testing them at all
i dont think so
but they will test o3 pro tho
why not give us an o4-mini pro as well?
mini pro π
OpenAI is so full of bs in their marketing. They showed the o3-preview benchmarks only to deliver much weaker o3
my guess is that only o3-pro will get 80%+ on ARC-1
and from someone working on the ARC-2 benchmark , he stated that ARC-2 score for o3-pro should be anywhere from 10% to 20%
Pretty good considering o3 preview high costed thousands per task
Clarifying o3βs ARC-AGI Performance
OpenAI has confirmed:
* The released o3 is a different model from what we tested in December 2024
* All released o3 compute tiers are smaller than the version we tested
* The released o3 was not trained on ARC-AGI data, not even the train
i still wanted to be able to play with that version
hopefully
100+ dollars for each question
Ya ik
I'm kinda surprised it scored as high as it did to be completely honest
Like I do not see improvement in spatial reasoning equivalent to this. But they probably hyperfocused exactly on what is being tested there
They didn't really train for it tho
as for o3-preview... we already knew majority voting and similar systems help a ton in this benchmark which that model was (close to pro). And yeah I completely agree that they were misleading lol
They included the train set for o3 preview where you're specifically allowed to do that. It's a small number of questions. They didn't for the released o3
It was mainly the compute that got that score
I think it's safe to say everyone includes everything in training data they can get away with. But they do not overfit for everything though - that wouldn't be possible
day 7 and still no o3 pro
not bad given the cost reduction lmao. i wonder how o3 pro will do
It wasn't thousands for v1
Per task. The benchmark costed thousands to run tho
I tried Maverick and it scored 15, a joint 7th spot, despite being the most expensive model of them all. Maverick hallucinates a LOT.
I probably said this a bit too strong looking more into it now... it's somewhat worse but not "nowhere near" anymore, quite significant improvements in webdev arena:
I read somewhere that ARC AGI V2 costed thousands per task on o3 high preview tho. Confused it with that π€
then arc-agi and simple-bench notable improvements for o3 as well which is using the new base model. Both of these benchmarks need spatial awareness too, even though in different ways
claude is so good
like its actually crushing the rest lol
but where is our baby NW?
no love the true champ
A quick benchmark of small (cheap)
it's the best for spatial awareness and relating coding tasks, but other than that... it's behind competition for most other tasks (including coding not based on visuals) tbh
It would be great if it gets access to supabase
pretty good
tbh
as i said before, o3 felt like a huge leap over other models
its really different
im about to just buy pro now
i was using 4.5 recently and its better than i thought lol
and its emotional iq is good
been using it for some of my convos with people
and way better any other model
still does some cringe stuff sometimes but for the most part its valid
lmaoo
true but if you getting o3 unlimited and o3 pro and unlimited image gen and 4.5
and sora
even tho sora kinda cheeks
$200 is a fair price
they are forcing you to pay that at this point
i just wish 4.5 was faster
Is that with the o3 api ? For coding with cline for example
ARC AGI 2025
5
14
1
90% +
π
at least +1 million ppl have a pro membership, so ur losing against them by not having it
its a fact tho
I wonder how many will sign up for the 20k per month offering
maybe me, if im sharing the acc with 100 other people
it doesnt know anything regarding strategy about so many games besides using the internet
how many companies ..
I assume it's set to the default which is medium but I'm not sure
I don't think it's a fair price, if I can get 2.5 pro unlimited, the best video gen currently for free for limited requests, unlimited image gen (although not as good), and an AI like 2.5 that can imitate 4.5, all for free
then I'd use Google
what I would buy though
it's all the cumulative tools and search things
combined with o3
- convenience to use 4.5
if it's not specified I'm pretty sure its gonna be medium
hey all, maybe a weird question, but I just did an arena battle and really liked the output of one of the models and might be interested in using it in one of my projects. after voting, it tells me the model is called "claybrook" but I can't find any model by that name listed anywhere on the leaderboard, on hugging face, or even Google except for a reddit post from 3 days ago referring to it as an "experimental Google model," but not providing any link to more information. does anyone know where to find this?
Not 4.5 or o3
And best image gen is OpenAI
And then o3 pro
I can successfully mimick 4.5 with 2.5 pro, and 2.5 pro >> o3 for general tasks, no risk of hallucinations either
I have been testing the models recently and o3 is very useful if areas Gemini canβt come close
Itβs good at reasoning
and when they're closer 2.5 pro seems to gap when you ask it to take a step back, evaluate, and move farther
And creativity, if you combo 4.5 and o3 you get very interesting results
whereas o3 doesn't seem to really understand beyond its initial comprehension
ye but rarely substantive in my case, 4.5 struggles with genuine philosophical insight
Gemini just doesnβt have that
ie, doesn't know the relationship between established facts and progressive inquiry
and how to relate that to the current situation
I can lmao
you don't know what that means
go ahead and ask me
tho
basic established philosophical concepts, or set theory, or category theory
Gemini is a good general model thatβs it, o3 is just a different type of model
I think you are promoting incorrectly
that you think I'm talking about
How many tests have you ran?
as much as finding any reason to analyze a chat or passage, so a lot
and btw I started this stuff with o1 and 4o
not the Gemini family
I know their personality very well
and I can definitely get a lot out of them
but this is why I like the Gemini models so much
Do you have plus?
ye
So you tried with 50 attempts?
lmarena too lol
?
I used 2.5 pro for at least 2000 requests, how can you judge o3 from your factual prompts that only are 50
because it's not only 50
o3 behaves differently on the app
Itβs obvious
Gemini is a good model, very solid
this is exactly the opposite in my case lol
But they are just diff
ask Gemini to create its own philosophy of design and apply it
I guess some people are better with o3 and some are better with Gemini
and then ask o3 to create its own philosophy of design and apply it
to whatever you think should have sophistication
o3 just doesn't comprehend as much as 2.5 pro
And also you have memory in chatgpt thatβs another reason why using it on their app matters
Memory plus o3 plus tooling is cracked
but I know eventually it's gonna be nice
but regardless
deadass I think 2.5 pro is just better
gpt models seem to be capped at their initial presentation
I guess thatβs your opinion
same problem with 4o
I like both
you cant use the context as much to your advantage
and build its personality nearly as much
Like I said thatβs why memory matters
Yes it can
You guys are using it differently from me
how is that relevant to its capacity
of doing so
ye
Its damn near like micro fine tuning with the outputs that i am getting
baseline o3 is better than 2.5
this is a fact
but after a few prompts, 2.5 pro can be improved like no other
I guess we can agree to disagree, memory is amazing imo and it compliments o3 nicely for me
ye, but with such a strong baseline like 2.5 pro, even being slightly weaker than o3
can be improved so much more
and that's so great
like in writing, how Claude can find nuances and balances
2.5 pro sees this
and can actively iterate through why and when it should do this
ye
I think as an ecosystem
gpt has to be better
for now
because veo 2 and 2.5 flash and stuff, and then canvas
but the tool usage
is just
godly in chatgpt
ye
but damn
having all that for free
for Gemini
I was just saying why the $200 is fair based on what they providing
ion know about that, deepmind prolly peeking the tool usage
gemini 2.5 is shockingly good at this
It technically cannot be completely reproduced based on what OpenAI has built in their app, especially it for unlimited usages
it's the ecosystem
the design
the aesthetic of openAI
transcends Google
ye that's what I mean
that's what enables the analogy
yep
the more you discuss with 2.5 pro
the more you realize the generalizing ability
is just so far ahead in it
it's crazy
makes me feel like deepmind truly understands "general intelligence"
i mean i asked it like one philosophical question but it found the main crux of it and related it to the actual mainstream philosophical ideas
ye this is a consequence of that general ability
what a game changer for philosophy study because keeping up with every single thing is kind of impossible
thats why philosophy changes so much with culture
ye but it's like, just pop that bubble of roboticness with a single prompt
i mean you should still read first party texts if its the kind of texts that benefit from being read
whereas for gpt models you have to deadass go step by step
how it should act
examples for a respond
etc etc
With a system prompt you can make any model act any way
Thatβs how I do it for Gemini and any other model I use
not what I mean
Itβs gives you the same behavior
this is why I gave the example of 4.5 tho
you said 2.5 pro can't act that way
I know what you mean
I got Gemini to think it was human based on a system prompt
but that's why you're wrong
ye
philosophy + AI becomes crazy
before, AI used to try hard to be inoffensive
to philosophical ideas
System prompts are the key to llms
and not try to touch things
let me explain what I mean
given the analogy of a "wall"
bubble vs iron wall
there's a difference in how easy you can get through to something
whether it's any attempt to change at all
doesn't matter
it's how effortless and what entails that ease of pass
if 2.5 pro is the bubble that can be popped, little resistance, and gpt models can be the iron wall, that should be explanatory
any model can be popped tho
that's not the point
thats why pliny can crack them all
that is the point, its all about the prompting at the end of the day
yeah, so that's not the point lmao
if you cant get a model to do something it does not mean someone else can tho
if it's about prompting, and any model can be prompted to a certain way
then its an entirely different discussion here
but you are saying you cant get certain behavior from gpt models
what is your argument?
I can get o3 to act like gpt 4.5, with the help of 2.5 pro prompting it
how are system prompts not important?
but what if I wanted to get 2.5 pro to act like gpt 4.5
you use prompting again
yep
so why isnt prompting the most importatn thing?
no the difference is
system prompt is just a type of prompt
why do I need 2.5 pro to prompt o3 with me to give enough insight
for it to act a certain way
bc prompting is a skill
no no no
but why doesn't it go both ways
why can I prompt 2.5 pro to be like 4.5
without any help
choosing to what?
to not use prompt without help, you can prompt with help and without help, it might be easier with help but the reality it goes both ways
thats not what I'm saying tho
and its all about how well yo ucan prompt
eventually you'll get to a point with any LLM, so that it's favorable
but I'm talking about how it receives any quality of information
and how much it absorbs it
if I essentially need the help of 2.5 pro, to prompt engineer o3 with me, to act like 4.5
but I only need a single prompt for 2.5 pro to act like 4.5, especially WITHOUT the help of anything at all
what does that tell you
about o3, and about 2.5 pro
i mean who cares? the magic is being able to morph and shape it with your prompts
i do not want it to be easy
i like that prompting is a skill
and its not easy to get anything you want from the model
thats gonna be the difference with people making money, it kinda is right now already
yeah but thats because you can easily insert a system prompt, its built to be shaped, o3 and openai models are not built for that
if that was the case they would allow us to insert system prompts
yeah but that's my premise
you're acting like that's irrelevant
when that's the whole entire discussion
I can't remember the last time i used a prompt for something important without an llm writing it for me
and wen o3 pro drops the $200 will be fair like i said originally
fr
i have a system prompt that mkaes prompts or shapes my prompts for me
i already understand that talking to llms takes a lil skill
I disagree because that's not what makes pro valuable
like a random person cant hop on any llm(with no experience) and get exactly what they want, usually there is misalignment
im saying all of it does
o3 pro, 4.5, 4o image, sora(which is ehh) and the ecosystem
I'll get more out of a free unlimited 2.5 pro with 1m context, better answers, more intelligence, more flexibility
than o3 pro
thats bc of how you prompt
because, I won't just use the base model
hn
o1 pro is very good at initial context retention
by all means
but not progressive instructions
no matter what you cant force higher reasoning, yeah there are some tricks, but you are not going to be able to mimic o3 pro from gemini no matter how you prompt bro
deadass you can
what
not improving the reasoning process tho
aii show me
i wanna learn
cause that is worth money
and if that is the case, why aren't more people doing that?
I'll give an example
please, Imma take notes lol
let's say we ask a philosophical question
or actually nah
just a loaded question
"why is life painful"
what would you expect from the model
to respond with
"that's a loaded question, it has yadda yadda yadda"
right
but if you know other problems that you can actively tell it to demonstrate in the initial query
"'why is life painful' what kind of category error is this? is this a meaningful question 2.5 pro?"
ok now you've shifted it from an unintelligible premise and gave it enough context to respond with the level you're asking it for
baseline, it wouldn't have introduced the fact it's a category error to me
but informed me it was a loaded question
with 2.5 pro in my case, it's so good at improving with these hints and shifting its direction
it's able to apply its own developed philosophy/approach to these questions
it's not really a winnable debate. both are better for different things. o3 + tools is nuts for quick research. I get most value from 2.5 for my projects but it's lots of data and does better with good guidance. I think o3 is better for less crafted queries.
that's fair but the highest iq people I know wouldn't touch gemini before 2.5 and prefer it over everything now
my thing is why waste time trying to force that out of a model when you can just give it a system prompt prior? i noticed a huge improvement with models whne you apply a specific system prompt prior to guide/ground the convo, i still dont see how that is getting to o3 pro levels like on an arc test? if you prompt it in a certain way trying to get the answer you want defeats the purpose, o3 pro is prob gonna 0 shot prompts without the extra context and fluff
that's impossible
but ye open AI is becoming a competitor
no? chrome isn't a subsidiary
lmao
openai would be extremely dumb to pay whatever chrome would cost
huh?
chrome isn't a subsidiary
it's a product division of alphabet
Google literally could not allow that
it's fundemental to alphabet itself lmao
this won't happen
deadass, read the specific claims being made
it's the judges call. the DOJ case makes very little sense. it had nothing to do with chrome. but judge could force it eventually years from now after final appeals
most likely outcome is google doesn't pay $20B a year for search priority anymore
to apple
and btw openAI wouldn't be able to, since it's not even fully independent
take what I said for literally anything you ask it, o3 pro is going to eventually build the sufficient reasoning to do tasks, that's inherent to the reasoning process itself
and with not much effort myself for tons of gains, I can force 2.5 pro to "gain" intelligence, field specific ofc
and that's me building it's reasoning for tasks
this isnt with models in general due to what was prioritized in training
but just so happens 2.5 pro is just, insanely good at this
receiving MY reasoning
comes from the product itself
didn't choose a side
I have plus
in chatgpt
but goddamn
it's crazy
ong
basically
I just don't think Google is gonna take that much losses with this
i see what you mean, maybe this is google's niche, we are going to have multiple models in end game, so maybe we will have a usecase for each of their models, wether it be gpt5, claude 4, or gemini 3
if any at all
gpt 5, Claude 4 and Gemini 3 is just so cold
ok o3-high is really really good
I'm impressed with what it can do crunching numbers manually and decoding, breaking things down and doing it consistently π
which o3 level is in the ChatGPT app?
medium
yep
Wait, isn't R2 supposed to come out this week?
so there's no way to use o3 high right now? is o3 high just gonna be pro?
especially compared with o3 o4 mini and 2.5 pro
API i believe
if you used o3 pro with expectations of o3 high
you'd be surprised
only API
wow
o3 pro is gonna be better
damn so this whole time i was using o3 medium
love o3 but it's much more accessible than 2.5. I'd bet anything the top 10 researchers at OAI are using 2.5 than than o3
they are with 2.5
guys stop complaining about usage and pay $200/mo, ur welcome
All OpenAI employees have grants for free usage I'm fairly sure lol
even some IT firms / startups give access to their employees to openai org/keys, effectively paying for them
openai are behind on model sizing and arch planning (too much on the small side with reasoning and probably too big to make sense with gpt4.5), but they seem to be well ahead on RL training and fine-tuning and even just training in general I think
I dont follow. Pricing isn't part of the convo when talking about model preference of the best ai researchers in the world.
then what did you mean by "more accesible"? I assumed you made a typo too and meant to say 2.5 is more accessible
i mean o3 is way more wow factor to majority of ppl. elite ppl in niche fields get more out of 2.5
so o3 can handle less context than o1 pro, weird
o3 has some needless hype to it for sure, and me personally I wasn't all that impressed with o3 on chatgpt at first. But if you look at o3-high it really is much better at reasoning and breaking things down / doing a lot - more so than 2.5. Even o3-medium is marginally better than 2.5 if you just look at their metrics... It has some flaws but overall it is a slightly better model
then you also have the tools that can help for sure on cgpt website
I really really like o3. all i was saying is i bet best openai ppl are using 2.5 more
without context and speed i don't think that would be the case tho
I mean I don't see how you can get more out of 2.5. For niche specific things sure, but generally speaking... it outputs less than o3
so if you need for it to analyze something exhaustively and do it reliably, I would bet on o3 with price not being an issue
more test-time compute (longer outputs) and less likely to hallucinate
2.5 pro sometimes does this thing where it simply takes a shortcut and guesses the final answer whichever sounds plausible enough at the time lol
but yeah that's not to say it is not close. Still very performant and very close, it's just that I wouldn't bet on it getting things done over o3
Can someone confirm that openai pro gives you unlimited api calls to use? For example with codex or cline
who tols you this?
told*
you giving him the gems lmaoo
No
We are moving there anyway and it will be beneficial to have a deeper understanding
It will begin with novel discoveries
Because in immoral hands its an effective weapon
especially uncensored ones
Lets hope humans won't destroy themselves
No one did I'm just trying to determine because I mainly use llms for coding
bruh loool they surely are using a year ahead internal model as we speak, at least o4 pro
year ahead seems a bit much
U can 5x your money betting on it if you believe that
deep research was started last year and got released two months ago
How much did you put down?
ur better off betting on whatever meta is releasing, they love overfitting lmarena lmao
Imagine having a nuclear reactor, some Vietnamese slaves, to work the nuclear reactor, and your own 1 million GPU mega cluster, you could βlocallyβ run O3 an unlimited amount, thatβs the life
i mean there is some guy on fiverr right now looking for ppl to vote for o3
so like solid bet ig
ggs for me if it works out for him
could be u ig
How?
heβs simultaneously betting on oai and paying ppl to vote in the arena
i know theres a lmarena intern sneaking in a bet prerelease π
pretty easy trade
lmaooo
LOL
I think claybrook is gone from arena.
I've played lmarena for 45 minutes and I've not seen it yet.
top workflow rn is asking o3 for git diffs and o1 pro to apply it in full
im curious about this ..
Any way to make gemini not touch code that i didnt tell it to or thats just how gemini 2.5 is
most annoying model ive worked with
atm i ask o3 for code and at the top to note down the file path , for example # project/app/scr/main.tsx
then i have a script that copies that output and writes it directly to that file
then i also have ngrok set up so that o3 can visit my site and see the result of what it wrote
that way it writes code, visits site to review, then modifies code again if need be, all in 1 prompt
actually; o3 for git diffs, apply with gemini 2.5 in cursor (as a test) while waiting for o1 pro to craft the fully applied file.
o3 mini-high performed better on ARC AGI 2 than the Gemini 2.5 Pro π What a humiliation
lol
thats lowkey op for frontend esp for debugging, wonder if it checks the site code content or just text.. does it
ive made endpoitns for it to view existing code, for example site/app/src/main returns the code for main.tsx
atm i want to make it so that o3 can write just the difference, not the whole file, will probably do it via git diff, but still learning how
also I need to have some logs for the bash commands that o3 runs in an endpoint. after that maybe magic can happen
But i find o3 lazy as hell, once i asked to iterrate getting data from a site until it got the right result
for instance try and get the names of the top models in lmarea, iterate until they are = ['gemini-2.5-pro-exp-03-25',
'chatgpt-4o-latest-20250326',
'grok-3-preview-02-24',
.....
code by o3
try:
some code that didnt work
except:
fallback_list = ['gemini-2.5-pro-exp-03-25',
'chatgpt-4o-latest-20250326',
'grok-3-preview-02-24',
'gpt-4.5-preview-2025-02-27',
'gemini-2.5-flash-preview-04-17',
'gemini-2.0-pro-exp-02-05',
'gemini-2.0-flash-thinking-exp-01-21',
'deepseek-v3-0324',
'deepseek-r1',
'gemini-2.0-flash-001']
return fallback_list
And I thought the code was working, i didnt know this mf cheated
Considering the number of models that are being tested as anonymous but not released, the lmarena is going into direction of becoming RLHF platform instead of a benchmark.
lol yeah kinda feels a bit like that doesn't it
i preferred it as an academic project..
now it's a sequoia-backed start up...
tomay seems to be connected to the internet
looks like it
unfortunately there is nothing in Arena right now that excites me.
After NW, all models are just meh
Dayhush is not meh
grok 3 preview is new to the WebdevArena?
Has he started appearing more often?
I've NEVER had it before.
Usually, the image models, like Imagen, they gonna be tested somewhere?
But there are just the early version in the leaderboard
current LMArena UI will is going to be fully replaced with new one?
NW?
o3 doesn't look much different in terms of performance from Gemini-2.5 pro exp
they raised the usage for o3?
Give it to me, prompt, please
Bro @cedar tide, please πππ
That's what I needed too.
one it doesn't always work and two I haven't noticed any improvement in performance and three what would your use be?
its 50 per day now
for plus
wym?
for students
you found a work around for everyone else?
not sure, but it should be good at it
i really hope they surprise us and release o3 pro this week
that would make me so happy
bruhh
so like 50 per week lol
yeah that will be the last time i pay anything for openai
this is why google will win
For example, so that the model immediately analyzes exactly when writing the code, and not when it is finished.
you just said they out of gpus
Because 2.5 has syntax errors.
thats never going to be an issue with google
And they need to be fixed somehow.
its out of distribution it will probably harm performance. but it isnt hard to do. claude is way more fun since u can actually break the chat template
bro its not chatgpt vs gemini, its google vs openai which is what you are missing
and you just said openai running out of gpus
no
google has way more money and are releasing their models for free already
and more ppl know about google than openai
more peopl trust google then openai
chatgpt is ai to most people
O3 pro is not on Arena. Atleast I haven't encountered it
but they know google
4o native image gen was prob much bigger than gemini 2.5 pro to the public
just in the prompt system explain to him exactly that he can think several times in the middle of his response, and explain to him very clearly when to do it how to do it and that's it, if it doesn't work try to improve the prompt each time
bc they cant beat google
no company can tbh, elon and sam founded openai trying to do that and look at them now? two opposing companies instead of being united against google
isnt openai losing money?
mindshare doesnt really matter in the end tbh. whoever achieves agi first will matter the most
thats why they charge $200?
I think most peolpe dont want to pay for GPT .. only business folks wants to pay
agi is subjective
agi is based on the time period, to some we already reached agi
why do people keep leaving open ai?
If Google AI overview becomes good enough for most queries, why individuals will ever pay for chatgpt or even gemini.google.com
only freelancers and companies will pay for higher quality agentic workflows