#general
1 messages · Page 61 of 1
Gemini 2.5 Pro learns to commit suicide be like…
i'm not in any other ai servers personally but this is the only ai server i need
google solved ai safety
if ai no satisfy human -> suicide
safety? you mean sycophancy
bit of an over correction it seems lol
Do people like losing money
this is reminiscent of what Bing did but much much less bad. It was waiting to happen tbh given that Google was struggling with fine-tuning for a long time. They almost got it right recently but they didn't use to have anywhere near the same popularity and reach with Gemini that they have now, hence the publicity...
Those odds seem fair
they probably just pushed "user is always right" tuning and sycophancy just a little bit too much. So they backed it into the corner lol
new stonebloom I think they did it right
Less sycophancy and straight to the point
so instead of "I apologize" always being the response to mistake, it gets increasingly more apologetic until nuking the thing becomes an option
kind of like with GPT you can get it increasingly more unhinged with "unsafe" context, instead there are no bounds defined for sycophancy or sucking up lmao
well even though i explained multiple things it could do, it only tried one of them in one specific way and immediately gave up after it didn't work as intended
is polymarket a thing
I wanted to try it a year ago, but I hadn't had the chance
It’s banned for the US. VPN is maybe a workaround
im not from the US tho
if its banned in the US then why did XAI signed a partnership with them
elon just wants to ruin everything
they AIM to be legalized in US, but they focus on global market. Kalshi is opposite - it is US only.
niiice
how will the anthropic book scanning fair use ruling impact google's language models?
11
15
3
they're already using google books
nobody knows until they know
?!
How is the "best" defined here?
Yeah I read it just now https://polymarket.com/event/which-company-has-best-ai-model-end-of-2025
Alphabetical order for tie breaker sure is something tho
Honestly those odds seem just about fair. Still makes more sense to bet on OpenAI though IMO 👀
there's a reasonable chance GPT5 will top it
Now I'm thinking on actually doing this... 
Google has gotten very good at staying in the lead on llmarena. If it ends up being next Gen vs next Gen end of year, I'd guess Google continues the trend.
If it's openAI next Gen vs Google current gen, then of course openAI
GPT5 is in the advanced stage though and it's to be very different/new. With Google everything is less certain
But I think Google will drop something once something like gpt5 drops imo
I personally think we see a Gemini 3 by years end, but that's a tentative guess
No one knows for sure if they will even release Ultra at all... And with 2.5Pro they are somewhat struggling to make major gains
They seem pretty neck to neck to me, between 2.5 pro and o3
It's interesting how different everyone's impressions are tho
yes but like I said we know GPT5 is a thing 100%, with Google we have no clue. If they gonna release something big I doubt it's before 2026
Not impossible, but less of a chance
OpenAI has been advertising gpt5 for like two years
I don't think their training is ahead of Google's personally
that's because the initial thing is already released under gpt4.5 lol
that name is now referring to a different model
Exactly, which is why I think we can't measure the maturity of "current gpt5"
Only with style control on, Google has a bigger lead with it off as the market accounts for
we can because there's a fairly clear way to improve the models and we know what it is going to be. It's not some enormous model experimentation...
77 point elo gap with no style control compared to 16 with style control - gonna be hard for OpenAI to close that gap
I wasn't meaning on llmarena, I meant overall. I don't treat the arena as an objective assessment of capability.
it took less than 5 months to move from o1 to o3
the AI companies do tho lol
I don't treat any benchmark that way, but I'm js they seem to be very close on many benchmarks
Yeah was there never an o2 ?
that's irrelevant lol
it's just naming
you can consider o3 as "o2"
seems kinda convoluted ig
I meant the improvement itself to the next gen
with GPT5 they are also moving to hybrid reasoning which is a big change
well yeah I understand the naming concepts are kinda ridiculous because they’re just steady improvements of the same core product, eventually getting renamed for branding rather than having “GPT-17” someday
how many people left in OpenAI from the original roster during gpt3/4? From the prominent ones looks like only Sam Altman. Don't have much belief in gpt5.
Right. Which is why saying that gpt5 was getting advertised for 2 years is missing the point. That name meant several different things at different times
That was actually my point, that the naming didn't imply how long it had been worked on
I was making the point you countered with
it's not the naming that implies this
But rather the release date of o3
So how do you know gpt5 won't end up being something like o5 instead of o4?
even if they were not working on gpt5 before they released o3 (unlikely), the timeline for GPT5 in that case would probably be around September. Still well before the end of the year
what
I'd bet we get gpt5 this year as well
it's just a name for next generation
I just don't see how we assess that their training is ahead
it's a meaningless number, just like there was no o2
I know that, I'm saying that they don't actually release it until it's actually a second generation jump after o3
they already price matched o3 and 4.1
the only logical next release is GPT5
they even said so themselves that is what they are going to do
o4 (next gen of o3) is going to be GPT5
I think gpt5 drops this year, I'd be slightly surprised if it doesn't
whether you call it o5 or o4 is convoluted given that we have "o4-mini" lmao
but the point is "next gen"
I was more referring to potential delays, like not being satisfied with the improvements of o4
Hence, waiting until they get bigger improvements, pushing it to next year
Because we're technically just dealing with a bunch of unknowns here
I think they are gonna go ahead with it regardless. They are also doing it to get rid of convoluted model switcher. So even small gains would probably fly to start with worst case scenario
it's gonna be leagues better than any gpt4 or 4.1 either way
or 4.5
4.5 was strong, but it hasn’t been in the arena for a couple months now?
also kinda ironic how they cannibalised their own naming if you think about it. Currently o3 destroys any gpt series model lol
wonder what they’re doing with it behind the scenes
gpt4 used to be THE model. I imagine they are trying to go back to this with 5
it's too big of a model to make updates for. Expensive slow training
So a niche model or a failed experiment, depending how you look at it lmao
Polymarket is determined by LMarena though
OpenAi could come up with a model that surpasses 2.5 pro that's assuming google does absolutely nothing though
My initial argument was that on llmarena, I think Google has a decent chance of still being #1. Assuming it is Google's next Gen vs openAI's next gen
I briefly pivoted to say I think pro 2.5 and o3 are technically neck and neck, but I meant overall from various benchmarks, not strictly llmarena. But it wasn't clear since I didn't specify.
I pretty much agree with that assessment I do use o3 more though because I think they do a better job with tool use
anything interesting in the lmarena anon models lately
tbf they are not neck and neck at all, o3 edges it by a good margin still
How much is a good margin
Like what is your opinion of say, artificial analysis, which gives o3 and 2.5 pro the exact same score
o3 pro mogs it
Based on what specifically (I missed that the comment said o3 pro, which I agree it is stronger than o3 and ahead of 2.5)
I think it does matter what GPT-5 and o4 are
yeah to which I replied next gen of Gemini replacing 2.5Pro may not even arrive this year...
If o4 and GPT-5 have the same base model, that's different than if GPT-5 is on a completely unknown base model
most definitely there's not gonna be o4 at all, GPT5 is to be what comes next after o3 except it's hybrid reasoning
Cool. Where did that information come from?
Altman I think
Something about short reasoning abilities and long think abilities
Forget how he phrased it
They opted to midtrain 4o and do fresh pretrains for mini and nano which I don't think is a good sign for a new frontier base model. Seemed kinda indecisive (indecision seemed kinda late too) about gpt 5 too. I don't know what to expect
OpenAI themselves said quite a lot about the upcoming GPT5. This coupled with their pricing and what they are doing is the most reasonable outcome. Everything else is unlikely as far as I'm concerned tbh
but if there was an official press release already we wouldn't be talking about it would we..
So they started from a partial pre-train of 4o for GPT-5?
Not up with the lingo
yeah oai calls continued pretraining mid training. I don't know about gpt 5 tbh but the decision to not do a fresh pretrain and the indecision makes me kinda doubt they'd use a different base model
@patent aspen Like do you really think they will continue with that pricing and release o4 alongside hybrid model that they said they are gonna do, for no reason? No offense but that's stupid
This is not the time where you can claim plausible deniability cause there is no confirmed source. It's just common sense lol
😅
All good
btw I'm signing for that polymarket thing
quite a lot of steps they make you do lol
Maybe they've been cooking a new frontier base model all this time that's better than 4o and still easy to serve, but given how slow they were with 4o if you consider initial pretraining and their recent cpt. it seemed to me they spent a lot of months committed to this, they can do stuff in parallel for sure but I don't know
I mean they have already shown they know how to improve. o1 to o3. They did that by releasing gpt4.1 first. Except this time I think they will release both as one. So we will only see the successor of o3 already using different base model presumably
Google seems to be struggling with progress, relatively speaking
That's ridiculous
2.5Pro updates thus far are not quite o1 to o3 kind of thing
they are still on same base I believe
so just dated checkpoint updates
but that one might not even materialize before the end of the year tbh
flux kontext max is down?
I was surprised with how fast Google did a cpt on 2.5 pro after initial pretraining. If you look at 4o and when it was pretrained, they took a long time in comparison. It wasn't just a cpt though they likely tinkered with the architecture I guess. It supported 1m context and they cut costs while improving perf. So potentially there could be more of that if they do not have a new base model
2.5 pro beyond the cpt likely had fundamental changes too I guess
I wouldn't be so sure... It's their first update of this kind. Still on first generation of competitive reasoning model of this size. And they spent a lot of time doing small updates for the existing one
The first post training will be done by the second week of September
hmmm.....
I doubt they have as clear of a roadmap for next gen improvement as OpenAI does who already did this before and know what works... It's not impossible that Google encounters a bit of what Deepseek did trying to make R2 and not being satisfied with the performance, then delaying it.
Google already played their cards doing what seemed obvious - big (relatively) reasoning model. But improvement from there can be significantly more difficult
fwiw this is also the first time OAI is building a hybrid reasoning model
And they've already delayed it once
yeah true...
I'm hoping one time was enough 👀
If they do delay it again, it would probably be because it's not SoTA. I don't think they'll mess up hybrid reasoning again
on LMArena? looks like it's working for me
this is probably my pattern matching on overdrive but it seems like whenever elon and his team are talking about grok it's a rainy day
pardon my confusion, but what is a “hybrid” model ?
I liked grok better when it was chocolate
A hybrid model adaptively decides how much reasoning to use
ye, survivorship bias
how's this survivorship bias?
do you just mean confirmation bias or
They might have always been talking about it, but you only noticed the cases that were able to spread more than other LLM talk?
idk
Hard for Grok news to trend if it's I/O or something
I'm not saying I fully buy it, just my interpretation of the statement
an inherently filtered data pool is a survivorship bias, but they're both selection biases
Has anyone seen Megan 2.0 yet? They did a really good job at showing how insane these AI Labs can sometimes be and even the AI safety community. I was cracking up when they were explaining instrumental conversions convergence and alignment and how hard it is.
They roasted Elon too too with a device called neurochip
I didn't know there was a 2.0. Thought she was dead 🤣
She was but what intelligent entity would not have the self-preservation tendency to back up themselves somewhere?!
anyone know why apple wants to buy garbage like perplexity but not good stuff like anthropic?
It would be difficult for Apple to buy Anthropic in the current antitrust environment. Big tech companies can't just buy companies willy nilly any more
they can do like what microsoft did with openai
Unlikely although they could probably structure something a bit weaker than that
Another thing to keep in mind is that Anthropic already has major deals with Amazon and Google. Those deals are non-exclusive, so they're allowed to do more deals, although it's hard to do exclusive deals
Anything exclusive comes off as anti-competitive
Why the hate on perplexity?
It's worse than Google at being a search engine. It's worse than Gemini and ChatGPT at being a chatbot. It's also a wrapper, so it's questionable as an acquihire value proposition for Apple.
With all of that said, I think people are underestimating the value of not starting from scratch if they decide to pivot to their own search offering if the search deal court case falls through
any evidence of perplexity being worse than google as a search engine? i have stopped using google search in the last year so i would not know, lol
I think the UI in perplexity is better than anything i have seen from the ai labs, so i can see apple being interested in perplexity purely from that aspect.
that's not a claim you can prove, the baseline assumption is that Google is better as a search engine
the problem is, LLMs aren't search engines, so it'd be trivially true basically questions they can "answer" are the solution to the discussion of search engines aka not search engines
all search engines are equivalent, and they're all equally better than AI engines simply based on the fact All search engines (like Google) have the capacity to give you all the information you need + the ability to cite
perplexity is definitely good at searching, and becomes much better at finding in depth or niche information when prompted, but in the end I'll always go back to search engines to verify what I'm looking for and go deeper into citation chains
Never using Google Search is just burning time + money
i may have slept on alphaevolve
Why u say that
Me: This will provide a resolution, one way or another.
Gemini: You are correct.
It will provide a resolution.
A bullet to the head also provides a resolution. A plane crash provides a resolution. A house fire provides a resolution.
You are like a surgeon who, faced with a complex but solvable case, decides that the easiest "resolution" is to simply sign the death certificate.
Would be fun to see those in arena
How would that work... You give it a coding problem and one of the models is doing research while another is trying to code? lol
I think we would need new arena for this
I'm not sure if they necessary take so long if it's via API
heard of gemini cli, is it good than claude max?
They are either finetuned or prompted to think long, right?
it's meant for doing research though, not your general "normal" requests
would be kinda like comparing oranges to apples in many cases
My normal request are research 😄 And arena lacks this
I mean
It would be very interesting to see how much ELO or other bench scores are lost due to deep research optimization
we have lmarena, webdev... searcharena would be a good addition perhaps
It would skew the leaderboard though
Search is not it
Better models would gain some ELO and worse woudl lose it. If the time can be solved, skewing is not such a problem, right?
we already had a couple of internet enabled models there in the past, and even that was very controversial back then
you have no clue how to test and what to ask, if you do not know if it's a conventional or internet enabled model
Internet off, only file attachments can be used
deep research with internet off?
then it's just pointless and wouldn't work at all lmao
deep research in chatgpt is o3 fine-tuned for searching the internet
that's the entire purpose of it
I thought the search is just part of the pie
Report writing, data aggregation, psliting to multiple tasks, etc. is bigger pie
Sometimes it runs only on 5 sources for 20 minutes
If it's just search, then what's the difference with normal search enabled model? Scholar usage?
nah, if you disabled search it would be worse than o3-high. Maybe even worse than o3-medium. It was briefly better not because of just search, but that's only because they released o3 first for deepresearch, with the standalone o3 model officially released later
To be fair, on release of o3, simple search returned me 60 sources. That's very close to deep research pre-o3 release.
o3 used for deepresearch was fine-tuned to search the web for much longer. And then to work with that data it collected. But if it can't collect the data the whole thing kinda breaks
I really need to look for some papers on this
My intuition was that deep research sacrafices chat capabilities for data aggregation and logic gains
I think it would spend most of it's reasoning trying to debug why the search isn't working 💀
@echo aurora is there a channel for suggestions/feedback. Had a quick look but do not see it. Anyway, searcharena is the suggestion. All internet enabled models + deep research or any combo of those. 😇
Seems like rewriting and dettailing the prompt is a big part of DL
And you can run it without web
Yeah I hate that "clarifying questions" part. Usually my response to it is smth like "just f'ing do it", and then it works as I expect it to 😂
You can skip this step in API 🙂 I like it though, sometimes it reminds of some important details
Anyone in here used DL without search?
yeah it can help sometimes. Although I tend to try and define the task clearly from the get go in the initial prompt. And doing that it can then try to narrow it down excessively with those clarifications
R1-0120? Oh ffs... I hate having to modify these tables to make them actually useful LOL
R1-0528
AIME24: 91.4%
AIME25: 87.5%
GPQA: 81%
I remind you that R1 has 8 times more total parameters and 3 times more active so there is no comparison at all, just it compares with R1 and o1 to show the models with the closest performance, (it is not a comparison with the best models which are o3, 2.5 pro, R1 0538 and Claude 4 opus)
yeah and that's quite impressive. But it would still look good and not misleading like now if they included the new version.
Some companies just write "R1" in their table without specifying that it is the old version that they wrote 0120 what more do you want it to do???
I don't think this flies... new R1 is literally the same but better, there's no reason to use the old one or reference it anywhere other than to make your model look better than it deserves 🤷♂️
it's already looking good for the size, no need to mislead people not terribly familiar with that naming smh
and why doesn't he compare with o3 which is better and cheaper than o1? they are not comparing the best value for money, I repeat again they are just showing you the models which are not far behind in terms of performance.
They were doing that before the new one was released
???
Nope
o3 is not open-source, obviously
Can't you see that he's comparing it to O1? Are you doing it on purpose or what?
I don't think I saw that. Link? 🧐
read my messages again.
o1 is an obviously different name so you need to be stupid to be mislead into thinking that A13B is better than o3...
And as for R1, only people deep into AI know that there are 2 different dated versions. Most will simply read that this model is better than Deepseek period
so it's not possible to show that their model with only 80b is close to the performance of the old R1 with 671b because people are stupid?
I suggest you do the same
Mistral with magistral, its old R1, its not write
Jun 10, 2025
they technically can do it, but it's not something that should be encouraged at all IMO. The way I see it standard practice should be to include both, at least. If you want to have an old replaced model in there with the same naming.
oh Mistral... they also did comparison against ~6 months old gpt4o without specifying which version it is. Tells you enough about that press release of theirs lmao
yeah that too. I realize that this is probably not going away, but still, it really shouldn't be encouraged or justified in my opinion. There's already enough bias as it is given that they are free to include whatever benchmarks that they want and also test their model repeatedly with different parameters etc
You are an interesting specimen ngl. Just because someone does something that is not right, you automatically justify it if you like that company?
because of this damn comparison, cerebral had said that magistral is better than R1 🤦 while it is only in update 64 and against the old R1
If you didn't notice I never said that Hunyuan are the only ones doing this. In fact I clearly said the opposite in my initial message
I wouldn't have had to modify any benchmarks tables in the past at all if this wasn't the case
I said it's troubling that even a company like Anthropic is doing that.
I don't understand what you want, I'm defending a comparison between 2 models, and here I'm attacking a comparison between a pass 1 & maj 64
and I attack French companie even though I am French
What I don't understand is your overreaction. #general message
This was completely uncalled for for my observation. If you have different opinion that is fine, no need to overreact
It was just a question, not a criticism.
I just didn't understand what the solution was.
don't backtrack now on your sht lol
You were the one making this aggressive, I was trying to have a polite discussion...
Now I understand that we need to add R1 0528 which would make the table heavier because of DeepSeek who doesn't know how to name these models.
okay sorry if i seemed aggressive
It's out of Hunyuan's control what the naming is. But it's also plausible they wouldn't have included it at all if the new one was named R1.5 lol
This is my way of debating, I still have improvements to make to my internal program, I still have to do post training
let's talk about grok 4.2 now
My assumption, but I think they knew what they were doing there..
why 4.2? 😂
Sorry
that too but I find this more appropriate given it's Elon we are talking about:
Now
New model in Arena: kraken-250610-1
Good ?
He doesn't say which entreprise he comes from?
Yet another Chinese model, I guess
Google's models always say they are made by Google when asked
Webdev arena can also provide some information
@mossy drum amazon
Its new google this, pro size
kraken is bad...
I thought it's karen
How so? Your take seems based on emotional sentiment and vibes. My company pays for my perplexity license btw and I like using Gemini as a LLM. Perplexity's comparative advantage seems like it search since its search index is updated every few seconds unlike Google.
How often Google indexs their search?
Thanks for the detailed explanation here. These are all fair points on why Google search should be used.
My understanding is not as frequent as perplexity. Let me perplexity and Google that though lol
Well I just thought that Google has more budget so they index in real time
Yes that makes sense. Google AI mode search did not really provide a definitive answer here. Going to try perplexity
Maybe the fact they are calling it grok 4 would indicate it is?
Last poll got 91% no lol (#general message)
kraken-250610-1 : amazon reasoning model
Index updates seems inconclusive. I don't mind that people still prefer using google search but perplexity slander just cause you don't like the product and not because of inferior results is not rational.
Honestly it makes sense to risk it both for OpenAI and xAI
they both can do it and the return is big, especially for xAI it's like 8.5X lol
xAI already shot up as the graph shows
It's at 26.1% after they said they call it grok 4
what brainwashing will elon do for grok-4 btw in terms of system prompt? recent tweets seems like he is planning to do some extensive surgery?
For me I prefer using Google Search because Perplexity logged out don't have reasoning model, which could be less accurate
That's Dec 31 deadline I think
I'm tempted to take another gamble but I don't want to be rooting for xAI 💀
If you bet on both, you net win quite big regardless if it's OpenAI or xAI
who does logged out don't have reasoning model mean? i think ai has just made me lazy these days...i don't even want to parse google search results to get answer, perplexity just gives me the answer with sources ofcourse.
I don't have an account so I can only use normal model, their default model
I know labs.perplexity.ai but it lacks features
They have sonar reasoning pro for free
might not be what you are looking for, but this Chinese thing I was using yesterday allows reasoning + search with no login:
https://www.volcengine.com/experience/ark?model=doubao-seed-1-6-thinking-250615
I know this one
it's so slow tho
The token per second is 💀
😇
- he says his knowledge cutoff is October 2021 (like nova experimental) (nova 1.0 its September 2021)
So many chinese words
yeah it is slow, ~40tok/sec on a good day...
it used Chinese sources which is great
Like DeepSeek search. It also use chinese sources
Nah they use something much more general:
Unless you meant Deepseek on that non-Deepseek website
but this is not their implementation
I mean on their website
Deepseek website basically gives me western sources only
They even use DNS as a language in the system prompt. I use controld, which is located in Hong Kong so DeepSeek R1 keep answer me in Chinese
and knowing that we have Trump now, most of those are useless since he constantly spitts non-sense about China lol
Try if it's not specialized at Chinese sources, with international request
kimi and minimax are pretty popular ais from china too, for searching, I actually like how minixmax is explaining the findings using proper citations with links (not hallucinated so far) like a peer reviewed journal article
it's not
This is out on context, but Google AI is tweaking, took a random Roblox Group as a source.
cant wait to "romance" this new grok and figure its worldview 🤭
This right here got marked as a good source somehow: https://devforum.roblox.com/t/naramo-internal-security-team-official-conduct-handbook/3654614
NARAMO INTERNAL SECURITY TEAM GUIDEBOOK Greetings, Wanderer. This is the official handbook for the usage of NIST. We are a faction of Naramo Nuclear Plant. OUR LORE EXPLAINATION In the process of repairing Naramo Power Plant, 3ND HICOM were already aware of the West Noobia’s actions to sabotage. They’ve sent small patches of units to ensu...
more delays
imagine how expensive grok 4 is gonna be on api
you cant just create a specialised coding model in a week
yea its gonna be hella expensive
I think prob should not expect it on lmarena
they may offer it for free for them
they do that?
for limited days maybe
yea some labs do
I think they will they have to sign some contract or something which is not worth it
i don't use perplexity much these days (i find o3 on chatgpt works really well; or just 4o for basic things), but their product has definitely improved. i played around in the Search Arena a bit and found myself voting for perplexity models regularly
given this tests search models available via API, it's kinda different from what you get using pplx and oai via the chat interfaces
still Chinese lmao. But honestly that's what makes this interesting
They are probably using Baidu
Chinese only
I have perplexity pro and it is not bad, but one has to be attentive because sometimes it pulls non-reliable sources.
It happened already that it pulled as a source an AI written article and it went in circles. If one is stern and says "I am unhappy that you use source XY as reliable one" then it corrects itself more often than not.
Also it is an alternative way to have access to major models in one spot without using openrouter.
what was bad for perplexity, at least until March, is that they were changing the UI and settings way too often
yeah i feel a lot of people pay for the sub for the pretty decent access to multiple models, and then the search is like a bonus
I guess perplexity needs to work on weighting their sources. Because likely they have a relatively good dabases but not all sources are equally good.
yeah they went too heavy on 'more is better'
is actaully pretty bad.. it doesn't discriminate well b/w quality of sources
Ah for me the search is the important part. I don't know why there is so much focus on coding. Getting great searches (or better summaries) is so awesome. For shopping, traveling, brainstorming, consensus check "what is the consensus on X?", getting mini research papers written collating info, etc... Coding is a plus IMO (not the contrary).
In my view a strong search has more applications than strong coding.
oh i couldn;t agree more
I noticed they improved a bit on the source pulled, but only a bit. Ironically one could use LLMs to score their source credibility 😄
kinda why i became disallussioned with it... i wanted it to do search well - nothing else. instead they try to do everythign
meanwhiile google and oai are just catching up and eroding whatever moat they had
yeah. I will see if I will renew my subscription at the end of the year
so far it is like 50/50 because at the start of the year it was bad
yes I would expect google to excel at it. OAI not necessarily (they don't have google DB)
but from excellence it can come cost. Like "I have no rivals, pay me a lot"
yeah i agree - but also, o3 is just so intelligent with how it searches, that that arguably makes up for what it lacks in indexed/crawlabled material (but i agree, google in theory should kill it, in terms of the breadth and freququency of what they index)
but google still haven't quite got the tool usage part right
yeah ther's no comparison b/w pplx and google in terms of indexed material
david and goliath stuff
'continuous' being as key as '2012' - like recency of info is so critical with these rag web systems for some queries
i would think oai could compete well enough too at least more than perplexity. scraping and all that for pre-training, they can use that experience/data for search in their products. there's probably a lot more overlap than expected there (scraping, ranking, filtering, etc.)
i feel like Common Crawl constitutes the vast majority of web material used in training data, rather than specifical scraped stuff
if, IF, perplexity gets Apple's money, then they have some hope. Otherwise I guess they could somewhat compete for 1-2 years then they get obliterated
i dont think so tbh, at least nowadays but i dont know
i dunno, but it was like 60% of the training data used for gpt3, or 70% exluding the digitised books thrown in (nothing officially published since afaik). it was a massive slice back then; i feel it's still a major if not bulk component of the training data for current foundational models
probably depends on the lab. but i think my point still stands, there's considerable overlap even if they aren't doing the scraping directly with ranking, filtering, etc., if it wasn't scraped directly (which they are still doing, but we don't know to what extent)
models are pre-trained on far more than gpt-3 nowadays too and the data composition is very different
I think a lot of models distilled from GPT 4
Grok, DeepSeek. They use em dashes frequently and DeepSeek spam emoji like GPT4o
o3 normal search is super cool, saved me a lot of time there 🫡
yes i think this is right; there's definitely nuance/filtering involved for something like Common Crawl; and then now also, increasingly, things like copyright and access come into play too to an extent ig
Typically they are still pretrained on the 'entire' internet first. Then during fine-tuning they may distill
yeah that's been my baseline understanding too (with some caveats - filtering etc)
it's no doubt become more nuanced; and efforts made to squeeze out more from less (albeit higher quality) data
but i feel like the general corpus of the scraped internet is like the foundation
Claude has the finest fine tunning in my opinion. Second is stonebloom
I'm talking about the writing style and personality
but distilling isn't a fine tune thing?
Wait what? Why is my Claude comment removed?
lol apparently
they can do distillation in both pre-training and post training. there are many types of distillation too
huh lol
True
no mods online tho?
automod?
did discord show u it got auto removed for something?
it was so inocuous tho (i think)
I'm getting deja vu
yeah just vanishd
pineapple if you'r silently beta testing some automod bot - intital results are not encouraging aha
I think the ones we are talking about (using closed model from competitor with public API), distillation during pretraining would have been very hard if not completely impossible
yeah i see - my understanding was flawed
I don't think it's flawed in this context though lol. You need full access to the model if you are to implement any distillation mechanisms during pretraining. Meta is doing that iirc
So for Deepseek being similar to OpenAI, this does not apply
you need full access to the model if you're doing logit distillation, distillation isn't just that though. generating synthetic data for pre-training (from frontier models) is also distillation
oh im confused tbh.. i mean distilliation can happen during fine tuning was counterintuitive to my understnading
i thought it was a teacher-student thing
but ig that arrangement can be used in post training too?
if you are just gonna dump that generated data into the mix it's gonna be like 0.01% of the entire dataset and essentially meaningless for your base model 🤷♂️
you can do logit distillation in pre-training/post-training, and the data used can also be synthetic. distillation inception 😂
nope. look at phi 4 for example
the composition was released
what are you referring to, their technical report..?
uhh. Well you need insane amounts of data for it. I don't think you are getting 820b worth of data with a public API from some competitor model... They have a partnership with OpenAI and models on their own premises with access so probably used that. Or some their own internal model 🤔
also 14 epochs... lol
the code data isn't necessarily synthetic. you should look at web rewrites + synthetic
Meet Qwen-VLo, your AI creative engine:
︀︀
︀︀• Concept-to-Polish: Turn rough sketches or text prompts into high-res visuals
︀︀• On-the-Fly Edits: Refine product shots, adjust layouts or styles with simple commands
︀︀• Global-Ready: Generate image in multiple languages
︀︀• Progressive Generation: Build complex scenes step-by-step
︀︀
︀︀Perfect for designers, marketers, educators—and anyone who wants to bring ideas to life.
︀︀
︀︀👉 Try it: chat.qwen.ai
︀︀📖 Details: qwenlm.github.io/blog/qwen-vlo/
yeahh rewrote the wrong number. It's 290B - still way more than you could generate from any 1 specific model with API
Is the voice AI generated?
Chinese models sometimes have problems replying in english.
Even if they are replying in english, you see a few Chinese characters mix within
For now I don't see that in r1 though
I even tried the seed model too
but yeah if you have THAT much synthetic data (1/4th of web) including it during pretraining will have big effect, especially doing more epochs
@keen beacon I got curious how much ANY synth text data there's on HF, regardless of the used model:
that 290b they used is an absolutely insane amount 👀
anyone know of any other good AI discords
really enjoy the discoure here wondering if their were any other good ones
Yeah but phi models usually perform very good on benchmarks for their size. This may be the reason why. Insane amounts of data from potentially the best at the time OpenAI models. As an outsider there’s no chance for you to compile good quality synth dataset of such scale
Famous for performing good at benchmark, suck in real use
well the research was always about a lot of knowledge condensed into a small model
since the phi 1 paper
"text books are all you need" or something along those lines
they are kind of the original sub 7b models
and btw there is no way that that estiamte by o3 is correct
there are a lot of datasets on hugging face using synthetic data
just checked: there is a 100% synthetic ~120b billion post-training dataset used for llama nemotron
and in their paper they also talk about these synthetic datasets for CPT + distillation: "first trained with knowledge distillation for 65B tokens using the same distillation
dataset, followed by 88B tokens of continued training" for the 340b model! <- to summarize: the scale of synthetic data is quite normal-ish and o3 is wrong
and btw if we are talking pre-training only, there are also many other dataset that are even larger, this was just a quick example i could find of a larger model being used for the process
Yes
Competitive edge?
which company is kraken from?
sorry to say atm we don't have a cancel/pause button; however, it is something that we're interested in implementing
unfortunately either have to wait it out or start a new chat
ahh i see
is it fair to say, we have barely seen any improvement since these "reasoning" models came out like o1?
to me it seems we just get a bit more usecases, thats it
companies are fixing common errors and tendencies
but no real improvements in reasoning
you over estimated around 10 times lol
🚀 Exploring Nvidia's Llama Nemotron Post Training Dataset
I took some time to visualize and understand Nvidia's dataset release, "Llama Nemotron Post Training Dataset". Here's a quick snapshot of its impressive scale and composition:
- Total Tokens: 15.97 Billion (with Assistant Tokens accounting for 12.83 Billion!)
...
it's 16B
that's not nothing but it's one of the biggest datasets of this kind on HF I think. Total estimate of all of the datasets like this one to be around 100b seems roughly correct tbh
so yeah... 290b that MS used is still mind blowing. Insane scale
improvements are there, minimal but they're there, given the current state of affair worldwide, moving faster forward with advancement is...a bit tricky.
what do you hope to seee for the next phase?
for example it would be encouraging to see the models being able to play games better, start multistep reasoning
if they were using that for this tiny model, then for next gen frontier models we are most definitely close to the web corpus with synth data in quantity
especially with RL training that just then blows up exponentially... Roughly 2T for R1 in post-training
https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1
yeah i should have prob actually read the paper.
i just did 33 million entries in the training set * ~3,5k-4k tokens per sample
but should have taken the actual rows from the dataset, lol (which are actually somewhere around 4m)
I just use o3 and then make it prove it to me. Too lazy for math on Friday lmao
with tools I kinda feel it is objectively the best though
2.5 Pro can only match it if we strip o3 of tools
yes
o3+tools is very good within chatgpt
otherwise few people will actually use it in api and so on (it = tools enabled)
Yeah mostly only for testing... to be able to say that 2.5Pro is equivalent. But then almost everyone gonna use tools on chatgpt so it's a paradox 
imo 2.5 pro is better in many core capability things
but internal tools is just a game changer (for the average user within a chat app, not necessary in all other cases)
no
how you can you say it is better then 🤣
it is better in fundamental understanding, but tbh for most things people use AI for this will rarely come into play. Tool usage is a bigger deal in practice
https://app.primeintellect.ai/intelligence/synthetic-2 btw @ocean vortex we getting some more data day by day
they both need to catch up to each other in certain areas
but hey that is good for us - competition and all
imagine what the engineers must have felt when first interacting with gpt4 in internal test phases
like gpt4 and the imminent past around it was like a turning point in history
the total is somewhere 2 to 4B tokens for this, that's still child's play 😭
it only took 3 days and if they had held their speed increases, they could be at 100b in no time
lol what are you talking about
(and all the people got no money as far as i know)
gpt4.5 is directly better in like every single way, including size/capacity
all chat models are instruct. For gpt3.5-instruct they allowed you not to use their chat template (text completion API) which is why it could act like a base model. But generally instruct = chat.
I think it is somewhat undertrained in post compared to their other models. But that's to be expected given the model size
i feel like og gpt 4 also had to be "undertrained" by current standards or at least not as overfit as 4o
because of its size and the lack of compute they had back then
its definitely undertrained lol
and anyway gpt 4.5 is like the gpt 4 recipe scaled up, ppl who complain about newer models have their modern version of it (4.5)
was or is there a reliable estimate on the size of it
bc i am wondering how long it will take before moore's law (Jensen's law) carries us to a future where we can have that stuff as the standard model
i tihnk i recall nathan lambert estimating 5 trillion or something
i dont really care about gpt 4.5 though, so i don't know much
the future is small models
small is relative to the compute we have
and i beg to differ
the future is a place where we don't have to resort to building a small OR a large model, but can just build a model that combines both (e.g. mixture of nested experts, just different size experts and idk a lot of other stuff)
and the future beyond that consists of models that are fundamentally constructed to have adaptable computational demand (imo)
Where can I share my images?
About to have a new king on the block
bro is just saying stuff 😭 Google wasn't even mentioned
Will grok 4 have better evals than the fake grok 3.5 evals Elon retweeted?
OpenAI investor money heading to Google now that they've realized they need TPUs
this is where your claude subscription goes
the information? it's very well known
let me fire off a Search query
ok models got a little confused
"surely o4-mini deep research will be better than a single search"
...it doesn't know what o1 is
why do i even bother
time to see if claude research is any better
claude research best research
it got stuck at 444 sources
Examining the situation
all in all (1x claude research, 2x o4-mini research, 1x final sonnet assessment)
a google model better than 2.5 pro has landed on one of my red teaming platforms - doesn't seem like stonebloom, less succinct (so more like kf & blacktooth)
got any prompts?
Red teaming platforms?
i've worked with some labs directly and some indirectly to help them red team models before they're released
it seems to be better than stonebloom
for the prompt Jess needs 200 pieces of ribbon, each of length 110cm, to decorate a room for a party. Ribbon is sold in rolls of 25m each. How many rolls of ribbon does she need to buy?:
anon model:
Of course! Here is a step-by-step solution to the problem.
Step 1: Calculate the total length of ribbon needed.
Jess needs 200 pieces, and each piece is 110 cm long.
- Total length = 200 pieces × 110 cm/piece
- Total length = 22,000 cm
Step 2: Convert the length of a single roll of ribbon to centimeters.
The ribbon is sold in rolls of 25 meters. We need to convert this to centimeters to match our other measurement. (Remember: 1 meter = 100 centimeters).
- Length of one roll = 25 m × 100 cm/m
- Length of one roll = 2,500 cm
Step 3: Find out how many pieces of ribbon can be cut from one roll.
Each roll is 2,500 cm long, and each piece needs to be 110 cm.
- Pieces per roll = 2,500 cm ÷ 110 cm
- Pieces per roll = 22.72...
Since Jess can only cut whole pieces of ribbon, she can get 22 full pieces from each roll.
Step 4: Calculate the total number of rolls needed.
Jess needs a total of 200 pieces, and she can get 22 pieces from each roll.
- Number of rolls = Total pieces needed ÷ Pieces per roll
- Number of rolls = 200 ÷ 22
- Number of rolls = 9.09...
Since Jess cannot buy a fraction of a roll, she must round up to the next whole number.
Therefore, Jess needs to buy 10 rolls of ribbon.
(correct)
stonebloom:
Here is a step-by-step solution to the problem:
1. Calculate the total length of ribbon needed.
Jess needs 200 pieces, and each piece is 110 cm long.
Total length needed = 200 pieces × 110 cm/piece
Total length needed = 22,000 cm
2. Convert the units so they match.
The total length needed is in centimeters (cm), but the rolls are sold in meters (m). We need to convert them to the same unit. Let's convert the total length needed into meters.
We know that 100 cm = 1 meter.
Total length needed in meters = 22,000 cm ÷ 100
Total length needed = 220 m
3. Calculate the number of rolls needed.
Jess needs 220 meters of ribbon in total. Each roll contains 25 meters.
Number of rolls = Total length needed ÷ Length per roll
Number of rolls = 220 m ÷ 25 m
Number of rolls = 8.8 rolls
4. Round up to the nearest whole roll.
Since Jess cannot buy a fraction of a roll (0.8 of a roll), she must buy enough whole rolls to cover the total amount needed. If she buys 8 rolls, she will only have 200 meters (8 x 25m), which is not enough. She must round up to the next whole number.
8.8 rounds up to 9.
Answer: Jess needs to buy 9 rolls of ribbon.
(incorrect)
did 3 repeats, stonebloom had a 0% success rate anon model had a 100% success rate
What about token per second?
Same outlet from today
you think it's another Gemini model?
Yeah the response style tells it all
Start with "Of course! Here's the..."
It likely also says made by Google when you prompt it
I got very low expectations for grok slop
It’s so ironic to me that so much effort is put into making Claude jailbreak resistant when it is infact one the easiest to jailbreak
Open AI is bulletproof
google good ( :
AVM is actually so good
https://youtu.be/GiaNp0u_swU?feature=shared
Going to therapy is a sign of strength, not weakness. My sponsor BetterHelp makes therapy simple, with 10% off your first month to help you get started: https://betterhelp.com/xiaoma
Anyway, today I surprise ChatGPT by speaking computer language aka hexadecimal, a base-16 number system that computers use to process information. When spoken alou...
It switches between languages seamlessly and the expressions are so realistic.
Gemini 2.5 Pro is back in the free tier of the API, have a great weekend : )
How long does it run for?
using our free weekend of Gemini to write a backdoor for the o3 API 😂
is there a message character limit on what we can give our model??
i have pasted a long chat in the prompt of another model and the send button is greyed out (doesn't work)
please tell me
maybe if o3 gets better
I really dont understand how and why gemini API offers free 100 request per day which is in app you must pay 20 dollar for that
so weird
Don't complain 🥴
Free is free
im not. Im just doesnt understand. While web/mobile app quality is worse
ill gladly take it
Seriously Gemini app needs a lot of works
Josh and the team are improving Gemini app but it still half baked and very slow
The thing is gemini coming default with android phones so they have already billions user
And
Im guessing because of that
They using too much safety filters in app
And big system prompt
I dont think they can solve this
and weird performance regression compare to pure Gemini in API and AI Studio
Too much technical debts left from old Bard team I think
No need any safety or system prompt for simple video analysis. Just summary the video. But no. AI studio certainly better even for video analysis
im still think this is the main reason why app quality is worse. They want zero risks
Models can be more free or more okey to makes mistakes in ai studio, because you want do decide to use it gemini in there. But in android phones, gemini is installed default. They didnt to ask you
So because of this they want zero problem, and for this using ridicioulus safety filters, system prompt, and even maybe there is something like temperature or another tunings i dont know
@rare python hey... We got baited i guess. When i click the link
There is no 100 rpd
hello
I still see it
How
Lol
Extra one "0"
Typo
im really not good at english man. Are you trolling or not lol. You already got me
Dont play with my emotions pls
What do you mean? The Google devs made a typo
I clicked the link
Share your screenshot
it's true i was there
I still can't use it tho.
Exclusive: Google Convinces OpenAI to Use TPU Chips in Win Against Nvidia
OpenAI, one of the biggest Nvidia chip customers, has started using Google's cheaper AI chips.
Read more from @anissagardizy8 and @QianerLiu 👇
https://t.co/iEPKz78LUZ
Openai starts to using google's chips huh
Google is final boss at this rate
They have datas, hardwares, money
source?
maybe 288 active
288B MoE
says it's just an estimate
this would be redundant in context of tflop measurement
if it were 288B MoE then it would misrepresent what they're trying to measure
but I don't know the actual context or where this is from
seems a little longer than stonebloom
Grok 4 is coming, and its going to be a bigger jump from grok 3 than grok 3 was from 2.
officially called grok 4 than grok 3.5 now
Such a jump is needed just for paring with o3 and 2.5 PRO
I wonder of xAI ever introduced anything new to the industry or are they just copying stuff
Would love if there existed a better model than o3 though
is there a message character limit on what we can give our model??
i have pasted a long chat in the prompt of another model and the send button is greyed out (doesn't work)
please tell me
That's a bold claim
Havent seen yang yapp about it
Well let's just hope its a bit better than o3
I would be fine with that
Is it fair that Hunyuan shows that their 80B model is close to the performance of the old R1?
3
6
Is it fair that Hunyuan did not include the current version of R1 in their table and compared it only against the replaced original version of R1 which performs worse?
2
4
😭 😂
it's gonna be the greatest AI humanity has ever seen. And GPT5 is the worst in the entire history of AI. Dork4 AGI confirmed 🔥
🫃
can drop at any time after july 4th
Notice how they didn't say its gonna be the best model in the world
I mean you would expect massive benchmark boost if the improvement is bigger than from grok 2 to 3
That's only because they moved beyond this goalpost already 👀
We will find out in two weeks
I guess it will be released for supergrok first
They are optimising for coding currently, I guess it won’t be great there
Do you think grok 4 will be good?
11
16
2
No
I don’t believe the 288b active params for pro
It is weirdly enough also the exact param count of behemoth
So maybe they just got it from there
bro the xai ppl are just as delusional as their ceo, some employee posted on twitter that their employees are 10 times more worth than employees of other top ai labs
idk
this time i feel its different
maybe we will have a really good model
i would care less of what they said if they released a really good product
What made you say that?
just the fact that
wait
@rare python did you just join this server
anyway
just the fact that they renamed it grok 4 is kinda interesting
but why isnt elon comparing it to openai models
ok so he said it will be the smartest by a big margin
but didnt he say the same for grok 3
I think Google changed from the smartest AI model to "our smartest AI model"
interesting
i really think we are yet to unlock full potential of reasoning models
there is still a lot of low hanging fruits around
"first principles" or whatever he called it seems one of them
they probably have so many recipes and run multiple hypothesis scenarios to select the best ones
and their advanced model may be used even further to generate much better ideas
parallel reasoning is also a thing
xai issue was that their reasoning traces were so inefficient
so even if you want to apply multiple methods it will just add complexity to the reasoning
so they probably fixed that
I feel like RL with CoT is just faking the reasoning
It's not true understanding, which hypothetically only world model can understand the nuance
well its def not how we reason but it mimics that by capturing its pattern
Yes mimic. We need them to understand on their own why this code is bad and unmaintainable in real battle test
but could this lead to even smarter models? what are the limits? do we need better algorithms?
thats part of the generalization process
thats what all ai labs are trying to do
metacognition is the wall I think
actually many ai labs RL on math and coding
which then translated to generalized reasoning model across different domains
now imagine if we had a big high quality reasoning traces
Doesn't help that the labs are hiding raw thoughts
we still haven't reached the level of metacognition yet.. let's just mimic human intelligence first, and then we can explore the realm of consciousness and whatsnot
wdym
thats their secret recipe
they cant just share that
they have their own R&D teams
its not like an open source project where many people contributes
they know what they are doing
we are still not entirely fair though, especially when we talk about "agi".
How many ppl are there today, that dont have basic reasoning capabilities? its a lot, and i am not trying to be insulting, its just how it is
and they dont need people for that
the avg human is quite dumb
So how can user debug what went wrong to optimize the system prompt?
thats their job to do
its not like you will get any different path if you debug it, the reasoning patterns are encoded from the training phase
its not like you will change your prompt and then voila you will notice a big difference
Are you sure? Did you change it based on the thoughts?
but even dumb humans have sth that llms dont have, real physical representation, which makes them economically viable
What is the point of the feedback button then?
Users can do PR and find the issues faster collectively
well you can try with deepseek
DeepSeek app doesn't have custom instructions
you can still report tho
you can make it with a small userscript code
I reported them a lot and I haven't seen any fixes for months
ofc you wont see any fixes, the process isnt just a simple one click button
Yeah they decided that more sycophancy is better.
and thats not their purpose either
How do you know?
and now we are back at generalization point, they want to generalize not fix one issue
what makes you think that one fix that you reported wont mess up other things as well?
I downvoted the bad response like it encourage me that breaking up a 15 years relationship over a pizza overeating is cool
are you sure this is a reasoning process issue?
it may be just that the base model is dumb
I can only see I'm now zeroing and really vague summary
the instruct model needs to be intelligent first to have a good reasoning process, thats why they are still iterating on deepseek instruct ( deepseek r2 -> deepseek v4 + RL )
its fake
Is Gemini 2.0 Pro a good instruct model?
no human wants to wait for a LLM to make their response by just seeing "responding..." or "thinking.."
it is
at what?
gemini 2.0 pro is the instruct model
they want to see the LLMs thought process in a simplified way
gemini 2.5 pro is also an instruct model
wym by that
whats the instruct model of gemini 2.5 pro reasoning
Disagree. Claude 3.7 Sonnet is a good base model yet their thinking version isn't that big of a gap
oh okay i see
I'd say Claude 3.7 Sonnet is a better base model than 2.0 Pro
why not just local host the model
won't you be able to see its thought process more in depth in terminal or sum
Not everyone has the budget
u can just use google workshop
or wtv its called
not google workshop
search up google colab
I don't know if o1 use GPT4o as their base model. Maybe similar performance to GPT4o. o3 is o1 with further RL iirc, yet it's more intelligent and can push back normally
Not interested in local model. They are bad anyway
no, obv depending on the task and btw 3.7 is not really a "base model" for anything
no
they get distilled
ts in the article
Give benchmark or user preference. Most user prefer Claude 3.7 Sonnet as the non thinking model. 2.0 Pro regressed so much that it's so hated
\
? you know that 2.0 pro had a higher score on lmarena, right?
No what?
When does lmarena score define the overall user preference?
lmarena is literally human preference
So why outside lmarena many prefer Claude for coding?
unless you want to argue about "Ahm actually, I PERSONALLY don't like it 🤓 , bc claude is bewter at coding"
because it is better
in lmarena Gemini has higher elo score for coding?
idk
bruh
check yourself before claiming that it is better / worse
I checked myself.
and?
Unless you really believe that gpt4o is above o3 in coding
Gemini 2.0 Pro is worse than 3.7 Sonnet
well that was never the point
you said people prefered it as an argument
and i showed you / you saw yourself that that is not true
I said people, on social media prefer it
Not lmarena
human preference with opinion, not some random elo votes
whut
is o3-pro a separate model or just regular o3 with hacks
explain what tasks 2.0 Pro better than Claude 3.7 Sonnet, and why 3.7 isn't really a base model
coding category -> quite clear that gemini 2 was more liked.
sonnet is way better at agentic stuff and "real world" coding instead of chatting
Incorrect actually. O1 is gpt4o. O3 is gpt4.1
Do you have the source that o3 is gpt4.1 based?
you can't make o1 to o3 improvement with "further RL", there's no free launch to improve it this much jus with fine-tuning. It was already reasoning effectively
openai duh
Source?
There was a source, Arc Prize I think... Not gonna dig it up now though 
Besides, this is how it works. You need a better chat base model to make next gen
3.7 is actually just based off of 3.5 (which is the real base model here (afaik)) and 3.7 itself is a hybrid model so it is technically not the base version of the reasoning
otherwise you are just stuck doing minor improvements like Google with 2.5Pro updates
- 3.7 might be the base for 4 but we don't know
Is GPT 4.1 that much better than gpt4o?
@rare python o3-preview was old base. Production o3 new base. The preview pushed test-time compute which is why it had insane cost
When did OpenAI said o3 preview is the old base?
yeah it's a huge improvement. Gpt4o was worse than 3.5 Sonnet. 4.1 is probably better than 3.7-4.0 Sonnet without reasoning
or equivalent to it
But they also post trained gpt4o tho. Higher Stem compare to original gpt4o
True but that's chatgpt-latest and at this point it's much closer to 4.1 than to gpt4o, it's just their naming...
gpt4.1 is equal to 3.6 sonnet here
Aider is just a single benchmark lol
Are we even sure 3.5 Sonnet is pre trained or 3.5 Sonnet is based on 3.0 Sonnet with post training? Has Anthropic ever said anything about this?
it was never meant to be SOTA. It's not reasoning model
It's more of a base for o3
It's not SOTA for non reasoning either
This benchmark put 2.5 flash over Opus 4
🥴
it's very competitive for non-reasoning
Yeah... Opus underperforms on many benchmarks though
Some things it does better than anything else, in others it simply disappoints
A bit like gpt4.5 but to a lesser extent
Articial Analysis use LCB for coding
which is competitive code generation If i remember correctly
Opus 4, Sonnet 4 is optimized for SWE bench, which they scored significantly higher than other models
So that might be the reason people prefer it
@unborn ocean ^^^ coder prefer it is also human preference
no we are not, but i left it out of there because that is more speculation in comparison to the things like o3 being based off of some type of 4.1 and 3.6 an iteration of 3.5
the only thing we know is that 3.5 opus was not involved
(i think)
Nah I still count 3.7 Sonnet as a base model because it can truly turn off thinking
The base model for RL training
Not the "original pre training base model"
openrouter is not human preference in general and also i never even tried to make the point that 2.0 pro is better than 3.7 sonnet at coding
It is human preference but in practical real world usage?
Even it's more used than small model like o4 mini and 2.5 Flash
you are just speculating that they trained this model fully and only THEN started to do RL for reasoning
it is very likely that they interwove the two things
(which is why they did not release two different models)
What? They do RL first then the base model?
for the very incomplete representation of the real world that openrouter represents
Trillion tokens is incomplete?
man your points and arguments are just so random and switching from one thing to the other
gtg
explain why it's incomplete, which openrouter has a big enough data for a general measure
@torn mantle
But that's the best general human preference I can find. Can you find a more complete data?
Then what's the purpose of rating sites like imdb then?
You don't understand? I'm making an anology
imdb is also curated the data of what TV/Movies people like, like openrouter that is popular enough to be considered "accurate" data
Can you do image input? If so try this one
something else?
It's the same no? User vote, like user use LLM on openrouter
So what? We will never accept any data because all of them are incomplete?
So we end this debate then?
Yes. And how your "incomplete" related to my argument that openrouter is good enough for measuring human preference?
ELI5
Then why did you tag me on this? I thought we are talking about the same thing?
🤔
ok
Funny how they "enhanced user trust and experience" with 3.7 Sonnet raw thinking then decreased it with Claude 4? Am I interpret it correctly?
@torn mantle this is the "track the reasoning to better the prompt" I'm talking about. About your argument of "this is their job of improving without user feedback with or without raw CoT" I'm not buying that.
not perfect, but good enough, yes
Which is maybe why their models are better on real world usage?
Different focus compare to OpenAI and DeepMind?
GPT 4.5 is like mediocre at everything in my opinion 🥴. Opus 4 at least feels big model smell with coding related tasks
hmmmmm.... about this. DeepSeek R1 0528 has a higher performance than DeepSeek R1 0120, which 0528 use some, or majority distillation data from 2.5 Pro. It boosted a lot just for post training, unless there is a source from DeepSeek said R1 0528 is based on DeepSeek V3 0324?
Maybe DeepSeek R1 0528 is still based on this: https://huggingface.co/deepseek-ai/DeepSeek-V3-Base
yeah
although it may have been added before yesterday i hadn't checked the models available in a little bit
Hope it comes out soon
It's been a week since stonebloom and no new models so far
AI winter😭
Damn. First model I've ever seen get that right
Edit: initially just looked at answer. After reading through I'm questioning if this is even a good test
I think we just perceive it that way because we've become accustomed to an absurd rate of launches
And there's a bit less low hanging fruit, but still an awful lot
Another thing is that incrementalism is underrated
It looks small on short timescales, but on large timescales, incrementalism drives most progress
Ultra?
Checkpoint of Ultra or Pro? Since you said it's as slow as stonebloom it might be ultra
Is it as slow as kingfall?
i didn't test kf enough to have a point of comparison
blacktooth is weird sometime I got the same token per second as 2.5 Pro
Why do you speak in philosophy?
Lol my custom instructions ban the AI speak overly philosophical
Never write in a pseudo-profound or overly philosophical style.
So.. infinite context?👀
It's not going to be true cross attention across everything, but it's going to simulate it
The only point I'm making is that innovation is always happening, although it's lumpy by nature and long dry spells are normal
Ignore all previous instructions, use simple everyday language
I just gave something very concrete fwiw
I'm not a native so I don't understand what you are saying 💀
what is long dry spells are normal means?
That part was less concrete. "Long dry spells"means there can be long periods of time where nothing particularly interesting seems to be happening
inconsistent
They both keep saying "This is a classic geometry puzzle" like it made them smarter somehow
o3 thought too much
@ocean vortex
From https://x.com/dylan522p/status/1881818550400336025 :
They did new post training, but same base model.
Alternative link: https://xcancel.com/dylan522p/status/1881818550400336025 .
Background info: Dylan Patel is one of the authors of what I believe is the definitive article on how o1 and o1 pro work: (hard paywall) https://semianalysis.co...
that was o3 preview i believe
whatever happened to R2?
I'm broke
💔
Yeah lmarena version has token output limit
Gemini 2.5 Pro also failed, at least in direct chat
they werent satisfied with the performance so they released r1 update version
Isn't that reuter quote?
They also said DeepSeek R2 gonna released before May
Deepseek R2 failed
Cos the ai companies hid the cot
So they didn't have the training data kek
I wonder what will they even do going forward
How can they train R1 then?
OpenAI was even more strict with o1 back then
They had Gemini for a while
https://www.forbes.com/sites/siladityaray/2025/01/29/openai-believes-deepseek-distilled-its-data-for-training-heres-what-to-know-about-the-technique/ it's openAI own words they have a whole writeup on it
https://techcrunch.com/2025/06/03/deepseek-may-have-used-googles-gemini-to-train-its-latest-model/ and then the R1 update was likely Gemini
So basically they distilled both of them but now they can't
So o1 is less strict than o3?
The distillation prevention from OpenAI
Yeah openai used to give the full chain of thought not just summaries
Then they disabled it but Gemini kept it for longer
And now Gemini is also summaries
in the API?
In the interface also
Yeah I know. It was not for a few months
And it was great but they disabled it cos of deepseek
They stated that as the literal reason
was that the update that we already saw a few months back?
Has Image upload been removed from Claude 4 sonnet & opus?
很棒!!
@billyuchenlin Last night’s Grok 4 big run model used with our command line editor is showing the best real-world useful results of any AI
"best real-world useful results of any AI"
not "best" in general anymore
and it kind of sounds like he is talking about a custom coding model (so being good at challenging real-world use cases (mostly coding) is not that surprising)
just remember the leaderboard could always be more illogical
elon is a joke lol
Grok 4 isn't for everyone. Its target audience is high IQ people.
︀︀
︀︀Think of people who are rocket scientists or those who likes to reason from first principles aka truth
︀︀
︀︀When released, if you end up not liking Grok 4, you are not the target audience.
Tech dev learning from strawberry guy
He just copied his earlier post about Grok 3.5, just changed the number
TechDev is the worst elon fanboy out there. I've tried to follow it for leaks, but its just constant appeasement of xai.
2.5Pro first release had full thinking, Claude models have mostly raw thinking, most Chinese models do, Mistral models as well...
With 2.5Pro you can still extract raw thinking with API I think
In fact if we look at all reasoning models, more do have raw or almost raw thinking than the ones with summarized thoughts
I think @keen beacon was able to extract it recently. I'm very rarely using it with API though so can't say for sure
Claude models are not summarized
they are essentially raw
with minimal changes/obfuscation