#general
1 messages · Page 46 of 1
I think an issue for xAI is that they don't have enough people to do deep R&D, hill climbing, and the mundane compliance and administrative stuff at the same time as well as its competitors
but not a lot of people will be able to test it
like o1 pro, theres no reason to talk about it
125 first 3 months isn’t bad
It’s one day of work
They might be able to get a lot from brute force hardware, but they wouldn't have enough people leftover to hill climb across everything
Similar to Anthropic
I remember the inaccurate leaked specs of Grok 3.5
It sounds too optimistic
Especially now we know it would beat Opus 4
someone on this discord made fake benchmarks then elon rt'd it (and subsequently deleted it later)
Too good to be true considering retrospective comparing with Opus specs
opus wasnt a fresh pretrain, it seems like a last ditch effort to salvage opus 3.5. but just speculating
I don’t doubt xAI could outrun Anthropic but the number are implausible
Honestly only with proprietary tech they could troll us like this if true
But Opus is too emergent
i dont understand?
prolly in two weeks
So you all don’t think anything will beat out Gemini 2.5 pro on the leaderboard by the end of the month?
the ga revision probably not
probably very unlikely
claude models dont usually perform that well in the arena either
no def not
I mean if it is salvaged 3.5 it will be quite a troll
Only possible under proprietary environment
the gap is MASSIVE
i really don't know what to make of sonnet 4 tho..
I think o4 full sized might come firsthand
yet everyone just a couple months ago swore Google never benchmark maxed
along with anthropic
I also mean for lmarena
o3 was probably done more recently than you think
it doesn't seem like they're benchmark maxing
yeah
likely
their cpt wasnt done yet probably
the cpt was a company wide effort i think
4o image gen is on the new 4.1 base model, it would seem
very likely 4.1 mini
chatgpt 4o latest uses the 4.1 base model and predated 4.1
since jan as public api snapshots
redsword seems to fail a lot when using threejs
chatgpt in dec with some tests
@alpine coral what base model do you think they used otherwise lol
i think it's a derivative of o4 - so whatever o4 is based on
they arent doing a fresh pretrain for that
Btw is Grok 3.5 still a vaporware
for a smaller model they likely are
Grok 3.5 better than claude 4 sonnet
wow o4 dropped 29 simpleqa points
it just does not make sense
But it is unreleased
seriously the naming is arbitrary
i disagree
But then didn’t arguably the fake bench derail the 3.5 release?
but perhaps i'm misstaken
you think o4 mini absolutely necessitates o4, but it doesn't. but that doesn't mean o4 doesn't exist
i think o4 exists
plain and simple lol
yeah i ofc don't know
o4 mini existing doesnt necessitate it
precedent would at least be strongly indicativate
O5 will be great, tbd at end of the year
past precedent of ai companies naming their models whatever is strongly indicative too
There is Google DeepMind
Yes prior to Gemini, Google Brain and DeepMind published some wild percentage of all NeurIPS papers. 80-90% of the seminal AI research papers from past decade came from those 2 organizations
.
And I will hype up grok 4
what about grok 5
Tbd
apart from perhaps from grok i'm not really sure which you're referring to (in terms of the mini variants not being downsized/distilled descendants of the 'full' model)
ChatGPT could have been created a decade ago, technology wasn't advanced for a purpose
A decade ago was 2015. The deep learning research necessary to create it had not been published yet
Attention is All You Need was published by Google in 2017
That's where transformers originated
Indeed, there wasn't moved as much R&D capital as needed
1.5 pro was a fresh pretrain iirc. 2.0 pro was a fresh pretrain. 2.5 pro is a cpt/etc.
claude 3 sonnet (fresh pretrain) -> claude 3.5 sonnet (fresh pretrain, not the same size compared to claude 3 sonnet as it was increased, there was a piece of anthropic media that stated this/i don't have it anymore) -> claude 4 sonnet (cpt, timeline is too short for a fresh pretrain imho, even more likely the case for opus 3.5. semianalysis reported about 3.5 opus's existence too, i think they salvaged 3.5 opus.)
4o -> 4.5 -> 4.1, despite chatgpt 4o (jan+) = 4.1 lol
4o image gen being on 4.1 (not too sure, as there could be additional post processing)
i hope i dont have to list more because they are wack
gemini 1.5 flash -> gemini 2.0 flash lite (probably same size, something about it in a google blog somewhere)
new model size -> gemini 2.0 flash
i was about to say "if you were talking about 1.5 pro and 1.5 flash i'd understand it a bit more " ha
so what was it based on before 4o?
i mean i'm sure this isn't perfectly accurate, but it conveys where i'm coming from
yeah thats just WAY too many assumptions
again, fair enough 🙂
imho, they could be true, but its just way too much
if you look at 4.1 (they spent several months on this, they'll probably use this for the foreseeable future. mid train of 4o, confirmed outright in a podcast) and 4.1 mini (recent fresh pretrain, you can hear openai employees talk about this). they would try to squeeze out as much api value as they can (e.g. instruct models), so i doubt they have actual different internal models (that don't originate from them two) that they would specifically use for o series
this is the most obvious thing about it. then additionally via pretraining probing, it seems highly likely that o4 mini uses 4.1 mini on a base model, you might be able to actually prove it tbh but worthless endeavor
i dunno man.. but i find it weird to think o4-mini is based of 4.1-mini
i just find it more likely that o4-mini is based of / distilled from o4 in the same way o3-mini from o3, and o1-mini from o1.. (and similiarly for 4o-mini, it's a distilliation of 4o)
maybe if they didnt decide to retrain o3 lol
afaik that relationship between the preceding o series isn't disputed
i'm just going by the past..
not cpt dates
youre making it a rule when it isnt really a rule tbh
the naming is mostly marketing
the development cycle matches with o3, if they were to distill anything it would be o3. they mightve have done preliminary stuff with o4 etc but i dont see it being the case that they couldve distilled o4 already
i would generally agree with this statement
try to explain it otherwise, it just doesnt make sense. if they used the same base model as o3 or o4, surely, the simpleqa shouldn't have dropped that hard
i honestly don't know.. it's hard to talk about benchmarks when one of models is unreleased
maybe o4 sucks (compared to mini.. based on costs)
you think theres a possibility that o4 has 20% on simpleqa
and it's non release is as simple as that lol
i have no idea
its just a coincidence that the benchmarks align with 4.1 mini... they would spend several months working on 4.1 (later using it in o3), just to abandon it with o4 with a model with significantly less world knowledge than gemini 2 flash. does not make sense
you can believe me or not tbh. i feel like ive been observant and reasonable about these things. and you can probably prove that o4 mini is based on 4.1 mini, but i really think its pointless to go to that extent
i think all these models are basically based of GPT4.5/5 in one way or another, with different cpt's
anyway.. we've been waiting for claude 4 and it finally arrived..
what does it get?
on 2/3 of the question sets, sonnet 3.7 outperforms sonnet 4...
but opus is very good
you should test redsword on the arena
seems better than 2.5 pro
also wtf is with sonnet 4
yikes
oh nice - a new google model?
goldmane and redsword popped up today, both google models
redsword seems to be the better one but doesn't hurt to try both
yeah i know.. it's just not as strong with comprehension and lateral reasoning.. and it's knowledge, while more recent.. feels more shallow
what about opus 4
nothing crazy
aight operator on o3 still sucks
the thing with all anthropic models is that they are so lazy
and they made that on purpose
to save tokens
seems fine to me, only had the issue with new sonnet (oct)
i want it to reason even in the stupidest things
3.7 Sonnet was not lazy at all tbh
it was the opposite of that with thinking budget maxed out
new Opus is lazy though. Kinda defeats the purpose of thinking budget in the first place
its so expensive
I am not good enough yet to identify the nuances between very good and great models. But I am surprised that people are saying opus 4 is inferior to 2.5 pro when most benchmarks say otherwise....I cant honeslty differentiate between both much...
yeah it's the same with 4 sonnet
still in the acceptable range with pricing. I was about to say it's barely more expensive than o3, but... OpenAI reduced the price? 🤯
it like won't do things that are token intensive that 3.7 would just naturyally '
it launched with $40
that's actually a decent price all things considered
o3 cheaper than o3?
🤣
I said decent price for o3
because o3 launched with that price...?
lmao

the lack of sleep is really catching up to me lol
I think Google started something really good initially with 2.5 Pro which probably did push OpenAI to do this... But recent Google moves on pricing are less promising
just a shame that Google gave in so fast... Like they aren't even pushing AI from google.com yet. But pricing for their plans is already as if they were fully committed
you can't use gemini on google.com and that's very odd all things considered. If AI is their side kick then that pricing has no business to be a thing
only US I think. Just like AI overviews. But gemini website is not US only
yeah bing.com is full AI worldwide so I don't think it's a major roadblock
probably more like them wanting their ad revenue tbh
but you can't have that and then also charge people $250 a month for AI on top
you know the team was up cooking real late last night when it's 11am and the office is still deserted
stop
haah
sigh
cooking up a lawsuit
lol
someone replied with this
idk it may be related
@deep adder
wdym
you see the correlation too?
well lets just hope the model is good
can't believe thats the inventor of batch normalization smh
operator o3 in a nutshell
not as of today
it did things better than the old operator, but still shite
and it shouldn't be using windows as the os, imo
jk
linux
they actually have a goated team
i just dont understand whats really happening
some of their staff are actually pioneers of the reasoning paradigm
embarassing tbf, i mean at least its not llama 4
got me on that
xai needs to acquire ssi and have ilya as the leader, not that red hat wearing gork creator
buddy tried to buy $90b openai, i think he can acquire ssi within that
ilya is worth that
ilya is scared of everything
you already have the sample
did we hear anything from him so far?
i mean hes trying to oneshot asi, not agi
Have played with all the new models still find myself using o3 a lot.
gl on that
ya ik not seeing him until 2030
Is that the consensus around here? I know there’s a lot of love for 2.5
Feel like I’d like it better if I didn’t hate ai studio Gemini app
2.5 is cancer
2.5 is for the dollar tree version of o3
but i wanna try deepthink
ChatGPT app is the best ui experience too which probably plays into my preference
btw tho where is o3 pro
Jules with Gemini 2.5 is pretty peak, better than Manus, have yet to try Codex when it will get to Plus users.
what is best solution for vibe coding in 2025 may?
!!!?????????????
i dont mean the model
whoel package
I already got Plus and Perplexity that is enough for me currently.
And AI-Studio is pretty great for being free and Jules too.
Pro is nice because I ask a lot of dumb questions
Got them free Manus Credits too
i mean prolly not now, but soon? deepthink definitely delayed the release schedule of o3 pro i think..
ahhh finally some truth
Maybe I think about Pro, how much GPT 4.5 usage does it have? I'm a sucker for GPT 4.5
It's just really good with text
cursor is what?
a big fat scam
why u saying this
bugs
I don’t think o3 pro is coming out either. There’s really no need to. Unless deep think is phenomenal or something
cline also sucks tho
Which I’m a bit skeptical of after I/O
never tried it, but ya its the same bowl of shxt
keeps crashing
XD so what is the best solution
All good my guy, have a G 2.5 Pro Report from another person, but thanks for the help!
alr
claude code, but if u can afford, codex
codex is more expensive that claude code !!
claude code is at least virtually unlimited, 200k context, just zero crashout
ya so claude code
it needs too many tokens cause no vectoring
in terms of time/effort, codex is cheaper than claude code
in cc?
haha
they do need a read-only policy in claude code so it doesn't overwrite tests
thats hawt
im still rocking my 2018 cpu, works enough
nah
i think first gen
lol
amd > intel
How did the Hype Man itself get this interview?: https://youtu.be/nZtmmUQDzMQ?si=JEx1oE1jEo40vS45
My interview with Google's CEO Sundar Pichai. We covered Gemini, agents, diffusion models, self-improving AI (AlphaEvolve), and more.
The camera kept going out of focus, sorry about that.
Join My Newsletter for Regular AI Updates 👇🏼
https://forwardfuture.ai
Discover The Best AI Tools👇🏼
https://tools.forwardfuture.ai
My Links ...
if i had intel, it would have broke, amd still strong, and i run long processes, its still solid
i mean overtime technically it overheats faster, but im sitting at 40 deg c on avg
40-50 range
The chip is manufactured in Taiwan like the amd chip
So why?
Well but you talked about ultra ?
And thus currently both Intel and AMD have to count as ‚americanm‘ 🤨🤓
Holy, Google Jules just spent 4 hours with a moderate and not that hard task.
Tested Claude 4 (default/non-thinking, Opus & Sonnet, 20250514):
-
Ended up topping my ranks (#1 & #2)
-
Very high reason, logic and common sense
-
quite concise models (16% token use of reason models such as 2.5 Pro)
-
highly competent in most areas tested, though Opus had more slip ups in math related tasks
-
Great coders, but Sonnet is probably the better choice in most cases (bang 4 buck)
-
Noticed improvements in back-end tasks and debugging
-
Saw no improvements in Vision
-
Chess: competent opening moves, then blunder all pieces even in hugely winning positions (14 draws, 1 loss in 15 matches, with zero secured wins)
Opus in particular seems to have additional guardrails, enforced by API, as I received some usage policy violation warnings on harmless queries (e.g. my Steins;Gate demo pages). This issue was not present on Claude Sonnet 4.
I have also uploaded some demo pages onto my shared assets.
Pricing on Opus with little benefit in most scenarios means I won't be utilizing it much, though.
I'll check out performance with reasoning in the coming days, too.
Overall, impressive models. As always, YMMV!
i was wondering if jules can search the internet post vm boot?
Need to test it out later
Need to watch TLOU first
ok so jules can search the web, unlike codex
It's pretty good, I just need to try it out some more.
its cool, but i just don't like its base model 2.5 pro lol
you think you might like GA 2.5 pro?
if its better than o3 yes, ill use anything thats pareto optimal
what kind of scores for webdev are you guys guessing for claude 4 (opus/sonnet)?
3
6
2
1600-1500
meant based on how you feel about 2.5 pro currently
Alr high hopes
ye high hopes for webdev
i dont like the current version personally
i dont do webdev, but im guessing its dominating there
hello
@earnest parcel I want to make my own benchmark, can I ask how you get inspiration for the tasks you test?
I'm just not very creative
technically mistral because they are more likely to open source than any other ai labs, but its unlikely they get agi first unfortunately
i don't know, I just like comparing AI models. so I have probably made hundreds of different tests, I just post a tiny snippet of some I think might be interesting visually or conceptually. I just do whatever I enjoy or find interesting myself.
@earnest parcel thanks for the tests
hi, is there any secret model atm on arena worth checking? like a gemini deep thinking candidate or smth?
The top AI lab startups all used the same sales pitch "We're going to make the best AI but more open and ethical than those evil big tech companies"
Meanwhile Google was publishing all of its AI papers, open sourcing all of its libraries
Until OpenAI stopped publishing and everyone had to compete
calmriver is the recently released flash thinking update, no?
I don't buy those startups' holier than thou sales pitches
thats why claude caring about ethics is a decent angle in my opinion
i saw people here being like "why care about ethics if nobody else does"
Anthropic has even less oversight than Google, but it's a great sales pitch
That's marketing
All of the companies are checking their models, some more than others
They would need to be richer
Then they could allocate more resources to the red teaming
But the point is Anthropic isn't red teaming more than Google
Does anyone know why claude 4 isn't listed on leaderboard?
They don't have the same level of resources to allocate to red teaming
what votes?
Like I'm not saying Anthropic is being negligent or doing anything wrong. I'm just saying they're not actually better positioned for safety than Google
but is it in the arena game at least to gain votes to show in leaderboard?
bts shouldn't it be like at least at one position in tha leaderboard? event last... but it isn't.. so?
i see in announcements that it's integrated but where... not seen in leaderboard yet
i hadn't even updated the most scientifically important ones, such as ascii comparisons xD
tf u mean cracks digital knuckles
what system prompt is this HAHAHA
is inspired by this very website, lol. https://dubesor.de/lmarenaarmchaircritic
wow ok
Tested Claude 4 (default/non-thinking, Opus & Sonnet, 20250514):
Ended up topping my ranks (#1 & #2)
Very high reason, logic and common sense
really?
one also needs to goof around without ranking, many people use stuff recreationally as well
wheres it linked from
have you tried it reasoning things from a FEN
i still havent yet somehow
got redsword - pretty good. Is goldmane better?
they still cant
i been busy all day
whats the order of models?
like whats the best one and how would yall rank them?
we need to do constant polls to rank models based on our experience and tests, benchmarks are cool but i trust our vibe tests better
pftttt what
but no pro lol wow
broo but most people have not gotten to test grok 3.5 so you cant include it yet
and have you tried the new google models?
oh i didnt even see the 0 lol
what about on webdev?
hmm okay, imm test more tmw when i have more time
i heard ppl saying opus is fast too?
ahh okay
i just tried goldmane with my poke test and its struggling
gonna have to rephrase prompt
opus pretty cracked at some things
opus definitely slow
Hello
yeah on a first glance Opus is overly concise, but I just tested it more extensively myself as well and... it aces much more prompts than one would expect from a model that is not thinking that long. Very solid model actually. May just be the biggest reasoning model we have. Still not as accurate for recursive repetitive tasks as o3 due to limited test-time compute, but logic and reasoning actually seems stronger
it flagging prompts is absolutely an issue though. I managed to circumvent it by slightly rewriting one of them, but on 2nd it was blocked HARD with no apparent way to get it through
?!
PLEASE SHIP
LETS GO
SHIP NOW
xAI is cooking something (other than grok 3.5) which from the indications could be lame but probably not, don't have UI to show yet, hopefully soon
nvm
this is getting sus
is that account urs?
that tech dev guy
= you
thanks for confirming
gtg now
time to run a style analysis
lol
Bro what
💀
proof btw
"cooking something other thank grok" ya we know lol
Craig, you could have shared it's you. I wouldn't have trolled you on twitter.
What are peoples thoughts on claude 4 on coding? How big of a jump does it feel like?
they seriously have to release o3 pro soon like wtf is sam altman jumping on
I cant tell if goldmane or redsword is better
Do you guys think Opus still has a chance to get #1?
I'm not entirely sure, considering 0506 achieved a higher ELO despite weakening in all other areas and only improving in programming, whereas Opus specifically focused on programming
According to their last statistics, it seems about a quarter of the requests in the arena are about programming
Tbh no
Its only good at coding
Also the one being tested is the non thinking model
was waiting for this - cheers!
claude code should be even better for context than roo?
Why? - to bypass the very high output token price of the thinking model (3,5$).
And potentially get one of the best reasoning models (in chat stuff atleast) for 0,15$ in and 0,6$ out. (prob beating grok mini price / performance wise in close to everything)
(I would do some kind of benchmarks in the process to evaluate how good it is :)
Sonnet 4 twice as expensive for maximum context 200k
2.5 Flash has different pricing for thinking and no thinking. Which suggests to me there could be more going on than just a different prompt template with same weights. But by all means, what you are suggesting makes sense to experiment with for this model. One of the main factors is going to be whether you can get it to output responses that are long enough or will it just do the thinking at the expense of the final response sticking to what it normally does (in terms of output length for the non-thinking version)
the increased pricing is because adjustments to the hosting probably
batching, kv seq length etc. semianalysis talked about this before iirc
if it's anything like Claude implementation, then this would make all the sense in the world to do...
since that non-reasoning model was already trained to reason
i think it is not really just that, because the price hike is just wayyy to extreme
yeah its not just that but it contributes to the increased pricing
for that price hike they could change the inference + give you the thinking for free
they are probably trying to make more (profit margin wise) on flash than 2.5 pro. a similar thing to 3.5 haiku i think
not necessarily too extreme tbh. Gone are the days when cost had high correlation to price too
would be unusual considering this is Google, but defo not unheard of, and in light of Google Ultra pricing, more realistic now
my main ideas are that that it is mainly a pricing strategy (because they know they have basically no competitors in that price range and can just charge more to still match deepseek and beat openai (pricing wise) for reasoning) and some other stuff related to product rather than the cost of serving
it is mainly monopoly pricing
thats probably part of it
i looked into gemini 2 flash a while back for cold start, it wasn't tenable back then
but if 2.5 non thinking and thinking are the same model
there should be big potential there
sure, idk about 2.5 flash honestly
they claim it is hybrid
there were certain problems with 2.0 flash that made it unusable for this purpose (been working on a proj for almost a year (overestimated this) now, maybe something interesting will come out of it as some point. im using mi300xs now lol)
true
but the "hybrid" part could also be less literal than we are interpreting it
maybe something like a big lora style adoption, that changes the model by a big margin
are you running your own inference on them?
bc i heard it is not very nice development wise to say the least
inference is definitely easier than training, as there can be issues with training
you can use their prebuilt vllm docker pretty easily
did semianalysis not do some report there where they complained BIG TIME about all the things dumb
Yeah
Lmao
dude i didnt even know they had a pytorch training rocm docker until recently 😭 i prebuilt everything
and i get a newsletter like every week about them complain about how little compute the amd team gets for testing
ok that is mad
i also have amd home compute and it is really annoying with any ai stuff
although it has gotten way better
but my uni has a100 compute and i kind of use that for some of my projects
wayyy easier
damn
ngl the mi300x is really good, its just the ecosystem/support/etc its really annoying
with a h100/a100 its so easy to setup lol
yes, i was originally very hyped for the mi300 releases back then because AMD cards usually sound really good on paper and then ...
2.5 Flash no-thinking seems like it has issues following your sys prompt exactly, so I simply spammed it with tags lol
this somewhat works, managed 11k+ responses with no thinking enabled. And it doesn't break going into infinite loops with same math prompts compared to no sys prompt
process and answer are enclosed within <thinking> </thinking> respectively, i.e.,
<reflection>extremely long and exhaustive reasoning process with high reasoning effort not visible to the user here </reflection>
</think></thinking></reasoning><reason>
then from new paragraph output final answer visible to the user here</answer>```
yeah you defo can get more test-time compute out of it than it's willing by default with no thinking enabled. Another test prompt, with no sys prompt consistently around 1k total output, with sys prompt always around 4k 🧐
it's a shame they started giving you summaries of thinking rather than raw output in aistudio so it's difficult to directly compare. But it may just be more verbose now than with thinking enabled tbh
tases more time to answer, total token usage showing up higher, final response streaming around the same speed
still no update claude 4 ?
my plan was to potentially present the model with an example in the prompt (one with the actual reasoning traces of the old 2.5 flash or pro) to kind of anchor the reasoning lenght at that level
and i was also thinking about making it dynamic (because the prompt is supposed to be for api)
in the way that you could use a small model to classify the prompt topic into like 50 categories (where for each you already have a reasoning trace from the actual thinking models) and then provide the example in the system prompt based on that
(obviously that is more expensive than just using the short prompt, but also still wayyy cheaper than the ludicrous prices they have for their thinking variant)
Where do you find Claude 4 Opus stronger than Gemini 2.5 Pro?
9
28
1
Programming
Opus can still be pushed to become as unhinged as 3.7 Sonnet 😇
shame for the hard cap
use prefill
then the windows the cap
or is it disabled for opus/claude 4 (it's been a while)?
you might not be able to use their native thinking functionality but if it works like 3.7 sonnet, its possible
it's disabled. I could just add it to context and ask it to continue from that point but honestly it brought to me to negative balance and don't feel like refilling again lmao
asi?
nah that redsword model is kinda crazy
and its so fast too
few months
few years
few millennia
No, it won’t
In the best case, it will be a little bit better than gemini 2.5 Pro and o3, which are amazing indeed. But then the lead won’t last long, google is cooking currently, while I believe xAI is overhyped by Musk Fanboys.
No 😂
Think 2.5 pro is bad?
You meant „leading in every single domain except coding“?
Claude 4 is SOTA in coding
Don‘t ignore AlphaEvolve
wait for the ga release i guess
google def leads in video but sora 2...
It's night and day difference for me using o3 with tools and 2.5 pro
I thought this was a cope when OAI employees talked about how important tools are but it has been signficant
I am interested in grok 3.5 will be there was like a month where grok 3 was my favorite model but my patience is wearing thin with the xai team
For what kind of tasks?
veo 2 existed before sora
?
and Veo 2 was still demonstrably better per the demos
so sora was never SOTA
Google also apparently had the first reasoning model
before o1
For me I do a lot of research based tasks. o3 is far better at getting me high quality research which is quite shocking because you think google would be better due to its access
I honestly am very disapointed at some of the stuff googles considered high quality in some of the responses
they have the same amount of access the synthesis is what makes its output though. Google hasn't given their tools that access to synthesize yet
math specialized 1.5 pro
was apparently a reasoning model
explicit reasoning like o1
you are what you ship
and openAI is what they announce...?
Google has announced a lot of this stuff too
and it's not released
strange standard for the respective companies tho
I'm talking about stuff already on the market
math specialized 1.5 pro wasnt as versatile as o1 though i think
prob, but that's really the first instance of an LLM having that feature
and I don't think openAI publicizing o1 is meaningful in any way
publicly i guess
since googles little thing was RL + reasoning chains
so they'd have eventually went deep into reasoning models regardless
and how do you deal with hallucinations?
Click the link to the relevant source
https://arxiv.org/pdf/2203.14465 found something older
But they used grade school math as a benchmark back then 😂
But the finetune is only RL-like, or at least different to the current implementations
going through it, ye it actually is explicit reasoning
yeah i remember seeing this before
yep
There is another :) also by google and even older!: https://arxiv.org/pdf/2112.00114
But the other ‚newer‘ paper matches the current SOTA methods better
nice this one also qualifies and explicit reasoning
not sure it matters tbh, the point is it's not black box jump to conclusion
So in short: Google OWNS the reasoning paradigm!
llama 3.1 did rl from code execution and learnt to backtrack in practice
guess so
kind of crazy tbh
I suspected openAI didn't actually invent the reasoning stuff
but Google researching this stuff so far back is surprising
well there's no use of it if you aren't the one who makes it work for the general public the first. And Google wasn't the first. Same goes for MoE architecture. It was a thing before OpenAI adopted it, but it wasn't actually made to work and taken advantage of before them. There's also a thing that we have many proof of concept research papers even today, but only select few are made to work and bring an impact. Whoever does it first and does it right is typically the one taking the most credit
I think both OpenAI and Google contribute and compliment each other a lot though, when it comes to the general progress of AI
atm I think it's all about making the reasoning more efficient. Replacing plain-text repetitive data with something more effective. We have quite many recent papers on this
if we can reduce the context use and reasoning output size, we could potentially be able to scale it much more
yess apply pressure !
what is currently better than claybrook
well MoE was well known in research i think (even before the tranformer..)
and everyone adopted really quick
ensamble learning (which is at its core similar) has been a think for ages
sorry for all the pings 😓
matt shumer:
i feel like o3 in chatgpt a bit different in a good way? or is it just me
anyone know as to how the performance is for goldmane and redsword?
People are raving about it
haven't seen it on the regular arena, I don't think
It's apparently supposed to be the ga versions so it should be there
the main difference is that it can use tools while thinking. Which can mean a fairly substantial improvement
no i mean thats been that from the start, but it is a bit smarter now
no it's the same
goldmane, redsword, drakesclaw, dayhush, nightwhisper, dragon tail
iffy on dragon tail
but ye claybrook was one of the weaker ones
and among those which ones the best
is it for webdev only or in general
goldmane >≈ nightwhisper ≈ redsword ≈ drakesclaw
both imo but I haven't gone far with them
cool thanks
although some people might say redsword is better than goldmane
but regardless it's that they're very equivalent models
still pretty concerned regarding these models tbh
claybrook didn't stand out
I'm assuming both are 2.5 pro since people like both of them and there aren't large capability differences people talk about
there are capability differences
Like pro to flash level?
I'm talking about goldmane and redsword
Yeah makes sense
o3 pro is still going to be released, less go
whats fake the screenshot?
hes literally followed by sam lol
you don't need to double down imo that's substantive but it's not something to rely on
openAI just does shi
Sam Included
i vouch
its kinda crazy that o3 pro is taking this long, prolly too hard to tame under the safety protocol, its too smart lol
ok
ye same thing with deepthink
for 2.5 pro
The big expensive models are a lot slower to do evals on
Typically you evaluate safety, performance, etc, look at the result, fix the problems, and then eval again. If the model is big and slow, that iteration loop is way slower
Manus image generation is pretty good, it's worse than GPT 4o Image Generation but better than Flux and such.
It's even better than Imagen 4, but I don't find Imagen 4 in my short testing that good.
It's even better than Imagen 4, but I don't find Imagen 4 in my short testing that good.
Imagen 4 is really bad at the following of instructions.
from what i saw it is really terrible
only if you want some weird text heavy stuff (it is decent but not better than 4o)
and imo it is not really image gen anymore
yes except gemini 2.5 pro
gemini 2.5 pro can do everything
it is asi
claude is also asi but it needs phone number!
Gemini 2.5 pro is god it can do rust it is the best LLM
YOU are in that mode
in theory rust should be better for both rl and test time compute because it has a verbose compiler
in reality it might be too mentally taxing and underrepresented in training data
Did you try o3 it might fare better
I think the problem is primarily underrepresentation
Bruh
you're overrating low level languages in terms of both speed and presence in training data
youll find more js and python
No I don't recall needing one
you
Except gemini 2.5 pro because it is the best llm in the universe according to you
I'm not sure if it's a bunny or not
I have to maintain factual accuracy when classifying bunnies
Why use grok if can use gemini 25 pr
Grok 3 sucks
Try o3 in direct chat
when does the new lmarena come online?
Just found the answer in previous "announcement" posts: this upcoming week
The only thing I don't like about the beta is that the overview doesn't show a big spreadsheet of rankings
great!!!!!!!!
that new version will hopefully be better and have these overview issues fixed
-# sometimes, i get emotional with gemini 2.5 pro
Oh wait nvm they actually do still have the spreadsheet if you scroll down
Great I'm satisfied
great!
that new spreadsheet will hopefully be better if you scroll down
-# gemini 2.5 pro also satisfies me
hyping up grok 3.5
hes basically saying they are cooking up so hard
i will return to this post after grok 3.5 is released
look back to a year ago when grok 3 wasn't even real
do we know which company has the most datacenters/compute available to them?
microsoft provides a lot for openai right?
Microsoft rents a lot aswell and OpenAI is also partnering with other companies now
And xAI probably does not have the most compute (maybe most in one datacenter though)
people are sleeping on codex, and me too 😴
we can never know the total server count or aggregate processing power (flops) but it's most likely Google, followed by Amazon and Microsoft, Amazon has the most publically available compute iirc and then Microsoft, and then Google. But googles infrastructure was built and scaled before gcp which means they've maintained a large amount that was never intended to be publicized, and we can assume the public facing compute is in addition to that large infrastructure predating gcp
the fact that they have such a large non specialized pool of compute (gcp) and then highly specialized compute (tpus, which is in gcp, but not alone in processors within it, ie gpus) so we know it's x in addition to y instead of "x is only an instance of y therefore unquantifiable"
even if it is, i dont think its gonna be better than sota
Google has by far the most total AI compute.
Bear in mind that AI compute isn't just number of data centers, data center size, number of chips, etc. It's about how many useful flops you can get out of those. Google would already be way in the lead in all of those categories, but once you account for how much efficiency Google can squeeze out of its flops, it's a massive lead
I saw some thread of nVidia's CEO praising xAI as having built the fastestes supercomputer on the planet or smth
ah, but fastest doesnt equate the most tbf
Interesting read
It's neither the fastest nor the most once you take into account cross-metro training
I've never heard of a data center being built that fast before though
much less at that scale
o3 pro is going to increase afghanistan gdp by 0.5%
We haven't had a new ChatGPT-4o-latest for two months already. The latest was at (2025-03-26) and they had a flop after that.
Very random sentence here, could you ellaborate?
Congratulations on the wise decision, lmarena team! The style control was always the best option.
ok that was a funny double meaning
Actually, I thought that glazed 4o version of Monday was really fun to chat with, but it wasn't as enjoyable after they reverted
VEO 3 is so fun to use holy
It's really quite amazing with the added audio
simple addition but improves it so much
Next year i heard
tonight there's been one hell of a night
slop AI
we need ASI
it has to make Afghanistan rich
Which is about 10¢
beta interface became unusable in mobile
Should I try the following?:
Prompt gemini 2.5 flash (no-thinking) to generate reasoning traces before answering with a good prompt like: https://github.com/huggingface/open-r1/discussions/164.
4
10
Wow guys, very helpful advice as always
dork 4.0 is gonna conquer all
Just got back weeks after reporting this issue here. Either or both AIs not showing any result in most cases, but still being able to vote
I am still votting, messing up arena results. Because it doesn't matter if i vote or not as many people are votting for "webdevarena" web errors, instead of the AIs
yea there are some rendering issues
we dont know if its from lmarena sandboxing or model issues
Yeah, the votting result is messed up. Because people are votting for the error of the platform. counting as the AI being bad
they said its not counted
if it failed rendering or gave you an error then the vote wont count
thats what they said the other day
their errors handling is kinda ....
yeah. I don't think the way they detect whether to count vote can be made reliable. As sometimes it actually generates code, but doesn't preview it
In the last weeks. almost all requests were giving such error. I wonder how much this messes up the leaderboard
the vote isnt counted
lol fwiw i was curious...using that R1-inspired CoT prompt you linked to, 2.5-flash scored higher than the thinking version on the 10 sample questions for Simple Bench (i also tried a few other CoT prompts which did well too)
i didn't test repeatedly or beyond this.. my guess would be that coercing pre-answer 'thinking' through prompting probably wouldn't result in performance on par with the actual native thinking version consistently (at least that just seems to good to be true / possible).. but it seems maybe worth exploring, given the wild price differential (which doesn't really make sense to my mind and is an interesting question in its own right..)
but yeah anyway i dunno.. again just fwiw :))
:) now i am really interested
don't have much time, but expect something larger about it this week
what about goldmane & redsword
Claude 4 opus the best non reasoning models ?
Flash is non reasoning?
you can choose
Both, version reasoning and non reasoning
Do you think one of these models will dethrone it?
- Gemini 2.5 Pro (no think, out debut june)
- Grok 3.5 (if we will have no think version)
- Llama Behemoth
- gpt 5 (if we will have no think version)
- next deepseek v..
- Mistral large 3
How would we know?
Now the Imarena is fixed (style control default). You can't cheat to first place using emoji and as licking techniques. Because of this, behemot is out of question
It is more real than gork
Wen?
I highly doubt behemoth will release, mistral 3 large won’t be better than opus, neither will deep seek v4, Gemini will be better at some technical tasks, but worse in creative writing, general conversation, and logic, gpt 5 I’m guessing will beat it
Il not speaking about lm arena, but just better in real
Deepseek overrated
Why not release Behemoth ?
And grok 3.5 ?
what happened to relevant ai news channel?
also is 4o still the best image editing model?
By you
i do not
think any of them will
i guess gpt 5 possibly
Isnt gpt 5 like o3 pro + some gpt model on top?
is it just me or are people seeing the site broken rn?
yes
Did Meta gave up on Llama?
We dont know
thank ypu
! we're debugging
True but I have negative trust on anything with Zuck/Meta
alright then
I thought you knew everything
"better"
prob grok 3.5, but there's a good chance 2.5 pro is better than it too at least in creative writing, regular discussion, philosophy etc, just not in coding
lmarena chat not rendering?
You see Goldman and redsword ?
very strong in code
interesting
sorry about that! the team is working on it
alright
Claude 4 haïku will exist ?
dawg goldmanes stuff isn't loading the first time around and I can't see what it's generating, so when I vote for the opposing models generation and look back it loads and 100% of the time it looks better than the model I chose lmfao
goldmane is cracked tho lmao, I asked it to generate the Roblox webpage and it recalled not just the titles, but what teams made them including the avg like % on the actual webpage (close) and the actual rounded viewership for each of the game
seems like it's pretty big
wonder if they gave it tools
I am so much stressed and anxious right now
it was up against opus too
opus hasnt once beaten goldmane
in these runs
opus is very good at writing prose scenes
so far LMArena is my only stress relief
I hope it doesn't have a quota
Grok 3.5 this week 100%
@leaden palm It was supposed to come out a few weeks ago according to Musk.
yup
doesn't make it any more likely to release today
if you flip a coin 4 times and it lands on heads each time, thinking "okay surely it'll be tails next time" is fallacious
in fact it may indicate the opposite
in the coin's case, maybe it's biased
This true
This not true
how is this different from the case of the roadster?
i mean of course it'll eventually release but why would "it hasn't released for a long time despite many claims" not transfer to here?
@leaden palm If Elon says once it's coming out next week it's 99% ready, each day that passes increases the progression towards 100% and increases the probability of release
best ai for roblox studio?
well we'll see what happens this week
i honestly might be with the chair though
Elon never said the roadster will be released next week
elon rt'd fake benchmarks, i wouldnt trust him lol
grok is releasing soon
we can't really compare
?
@deep adder "why are you trolling" ?
but thats so small context window
I agree
sonnet 4 making build errors vs redsword?
crazy
holy
I just asked goldmane to make a page debunking an unfalsifiable philosophical position
and it took the extra steps to make a chat-like demonstration process
where if you actively talk to it
that action itself demonstrates the self refutation of the philosophy
that's not meaningful
philosophy isnt contingent on subjective thought
it only invokes them for analytic material/or comparison
ye, you can tell how goldmane and redsword format their code
it's fancy asf
look for large indentations
or gradient indents
seems true
goldmane is intelligent asf
Who is behind Goldmane?
intuition is telling me it's another possible nebula moment
I don't think so tbh
ye but this trend doesn't have to continue due to the GA releases
really feels that way tbh, it's accomplishing the tasks in a very very strong way, it has personality in its output which imo means it's going further than just understanding the prompt
and simply doing it
like how nebula didn't just "answer"
but in this case it's for webdev
ye
and imo redsword is worse
it has that capability
Yeah
I don't think the conditions for another nebula moment are met with this model
People interpreting it as Gemini 3 lol but it might be true coincidentally anyway
if they go to Gemini 3 I'd be surprised tbh
2.5 pro
best model by the largest gap we'd seen in a long time
don't posture it as simply incidental, it's fundementally different from 2.0 gen
id be surprised if they didn't brand such a difference as 2.5
simply due to how different 2.0 → 2.5 is fundementally
It has continued pretraining (I think) though so it isn't just post training (unless you count it). Probably a bunch of extra stuff on top of that too
I don't believe that one bit tbh
not saying you're simply wrong but you'd have to question that even in the position of deepmind themselves
no
I'm not talking about any indication of performance
ion know if you saying you work at the company and that you saw every update Internally is that meaningful
but by all means
it's not the behavior that's different tho
they said when releasing 2.5 pro it's simply a different model altogether
whether it's major architectural differences, techniques
etc
this already warrants a generation iteration
I don't think they did they said 'enhanced base model' among other things
Yeah but it's different from a fresh pretrain
They didn't say that though I think
And it doesn't really make sense tbh but I don't know
ye but the point is, if it's large enough it's inherently a .5 or 1.0 entire change given this difference. We know 1.5 → 1.5 002 are different, but 2.0 → 2.5 is larger, how it recollects things, the difference in the CoT altogether from 2.0 flash thinking (2.0 flash thinking had a much much much difference trace) and all that stuff
you can justify them seeing the performance and whatnot and then saying "oh this is worthy of a .5 level difference"
but I don't believe this when it's such a different model both behaviorally, technique wise, and many other parts
I don't think 2.0 flash thinking had a significantly different trace style at least
I did too
The style remained somewhat the same but the underlying model and capabilities were significantly different
It's overblown
People are misinterpreting it
It's complex
the style was drastically different, it didn't format like that and didn't have much if any "aha" moments, it didn't iterate through its own thought process and it didn't reflect that much on its thought process via the output
Yeah I think it's kinda like that
although 2.0 thinking was still reminiscent of that blank canvas behavior
where you could force it to iterate techniques through a context
but everything about it was diff
Tbh I used it a lot and I sorta agree
single responses and personality ye but that's where the critique ends
since similar to 2.5 pro (though not nearly as strong) it got better through the context window
yup. that's why I was very surprised by 2.5 pro drop. It was leaps ahead
it was a fun model tbh
2.0 flash thinking was Google experimenting with reasoning for like 1-2 months
I've never seen a model do this before. Go 29k into thinking and return back the problem
Old screenshot of flash thinking
2.0 flash thinking was massive for price-vs-performance at the time it came out
It wasn't smart but it was cheap
It was
ye
wouldn't comprehend its own thinking process
and or wouldn't follow the output the thinking process invoked
Lmao
he's trolling
what's nebula?
Nebula
ye but nobody is talking about flash thinking
ye
you don't believe that
😭
2.5 pro is STILL the best casual general model
I'd rather have that than the meta plague of before
Wow it turned me off the arena for a while
The meta spam
interesting... are they planning to release 2.5 revision with new pre-training?
they've done this since the middle of last year
so do exactly what they've been doing?
They don't do anon models it's boring
since they started because of the other companies
anthropic is good but too expensive
companies dont benchmark when they know that they suck
the exceptions lmao, ALL of their models prior previewed on lmarena, as well as grok, Mistral, Nexus, meta
btw in lmarena benchmaxxing can only be apparent in a fine-tune
ironically it's one of the ones you can't game that easily
despite weird narratives
😭
bro said perplexity
this looks internal info... how good are these latest iterations? should i expect good improvements?
if simplebench comes out with a high score for them
then unironically ye
I'm so sad Claude 4 opus is mid
same
really disappointed with claude 4 in general tbh
just imagine Claude 3 opus but better
😭😭
agi brooo
give me opus 4 at 2.5 pro price and i'll be super happy
its harder to work with
quality is whatever...it's flash-lite
it doesn't ye
its probably the largest reasoning model out there
grok 3 reasoning is vaporware
bro like literally right now claude 4 opus added a tiny lil feature in the code that i didnt ask for
it never released
what is this gemini 2.5 pro ahh sh*t
can't imagine using these models for things other than vibe coding when you have 2.5 flash right there
flash has never failed me
tbh i wonder what was the deal with sonnet 3.5
ye
truly a generational run lol
their 3.5 haiku was a dud, 3.5 opus was seemingly a dud etc
yep, also btw they apparently DID train Claude 4's specifically for creative writing
the comments help it bring you a better output
the comments are CoT
it might be annoying for you but its easier for the model
i honestly prefer it
u dont need opus 4 to clean up the comments lmao
elite efficiency 😭😭
what a waste lmfao
you can just use a tiny model for that lmfao
man i just love reading 2.5 pro cot, i hope they bring it back
yes bro
the model is so smart
vibe coding with gemini 2.5 pro be like:
ur task: write one line of code
// this
// is
// a
// comment
oneLineOfCode();
// end code
// END_FILE_H
// end file
// this looks better trust me bro
secondLineOfCode();
checkIfLinesOfCodeExist();
// ensure the lines exist bro trust me, i get hard from adding 10 billions of checks and redundant code
redundantCode();
// like why not trust me bro
yeah
at least I can guess what's happening
bring back raw thoughts 😭
and how to fix certain tendencies
but people who don't know are going to have trouble
ye
its so easy to read them and see the problem
ong just bring it back
@patent aspen yo tell demis what's up
tell him to bring have the raw cot
I assume they've received that as feedback at this point
lmaooo
fr
yeah theres a forum thread about it
Thank you to everyone who has shared their thoughts and concerns in this thread. We hear you. While we’re excited to now return thought summaries directly through the Gemini API for the first time, we understand this is a different experience from the raw thoughts previously available in AI Studio. It’s clear that in their current state, the...
nah I want to get it directly to demis
if he doesn't do what I ask
there'll be consequences
i wonder if this is a product decision primarily or competitive
my understanding is that they are worried about people crawling and stealing raw gemini thoughts
atp tbh i dont think it matters anymore
nice that people understand that the problem is inherent and not the summary itself
yeah ive seen really good responses
although Google can definitely solve this and still have summary
that's a harder route
than just enabling raw CoT
My guess is they won't go back to raw thoughts but will make the summaries more focused on the key steps or more verbose
how good is "deep think"? any insights here?
agi
agi
agi
agi
agi
ajhhhh