#general
1 messages ยท Page 72 of 1
OpenAI is still at the top (number of users etc) but my concern is that the pace of innovation and improvements from OAI has gone down a lot.
Not sure what's happening to them?
ChatGpt literally just scaled BERT from Google. They were not that innovative to start
I think that is a bit simplistic view. Yes, they got inspiration but they were multiple years ahead of others because they focused on it and others didn't go deep enough.
Good point
Just compare google Gemini model from 1.5 years back to chat gpt model. It felt like google will never catch them
They also did interesting things with RL before all that with curiosity and the cube solver robot
Yes. But now it feels like they won't be at top in coming months and quarter
With the LLM releases half of it is just timing
While the other companys appear to be crushing them, o4 may nearly already be finished
Arena glitches again
They are not currently timing anything, they are desperately trying to get stuff out but it's all delayed somehow
my chat history just got del?
No but I mean if the release dates are misaligned then it will appear they are losing when they are about to drop something big
Grok 5 Gemini 3 is releasing end of the year. They don't even have that long for GPT5 to have the lime light. openAI cadence has always been a new gpt every 2 years or so, they haven't been able to match the speed of the other labs ever
Even o3 was cooking for a long time
Relative to Gemini 2.5 and grok 4
The betting odds think by Dec 31st GPT5 will just lose the lead
yeap
I'm still pretty disappointed by the open source delay ngl
The gap between gpt3 and 4 and o1 to 3 was significantly shorter
You mean deepseek?
Right, more like delayed for lack of performance
rightโฆ
The betting markets are extremely pessimistic on openAI. Like I am too but they are to an extreme level, like it thinks GPT5 only has a 30% chance to top LLM arena (August 31st bet)
Kind of crazy
I don't really know why
Unless someone has some inside info I don't see what would suggest that
probably because it's gonna release after august lol
What about o4?
I thought gpt 5 was just going to be a model router?
The Dec 31st odds are 19%
that would be stupid tbh
Are we even making progress with LLMs in terms of reliability and use case intelligence
it shouldn't take that long
Well yeah
yeah, but I don't feel we'll get to AGI with current LLM designs
It just seems stagnated because openAI held the top and isn't moving whilst the other ai labs are catching up from under them. So SOTA hasn't moved just yet
If you look at it as an ensemble of LLMs from labs you can see it's moving upwards
Quickly
It just seems we are going to be paying so much more for each SOTA
what do you think will allow models to gain spatial awareness?
Until it will be ludicrously expensive
here come the chinese
until they close source their models in the future
They never will, it's in the strategic interests to undermine Western ai labs
that's why I said in the future
Even that doesn't scale well, soon models will be unrunnable on consumer hardware
when they take the lead too much
I don't think they will. It'll remain close for a long long time
consumer? even now you can't really run the sota models on "consumer" hardware
There's no need to run them locally ngl. Just use an API
unless you mean a 10k consumer pc
Exactly. So your still paying the western cloud providers
Not open
But the bigger these LLMs get the more compute you need driving up the cost inevitably
I feel if/when the chinese develop strong GPU or TPU alternatives is when their innovations become overwhelming
more than now
Still, scaling won't last
eh, they'd find a way to keep the size small if they innovate
"small"
But nobody's trying to do that
yeah, especially xAI
You want everything for nothing is the issue
but at the same time trying to make efficient models benefits the companies as well
the costs would become less
That isn't possible in the laws of the universe
to run the models
Humans are smart without costing much at all
The LLMs are cheaper than a human
If we can replicate the structure of the brain that's AGI not llms
And scale
True
am afraid even with that it's still not enough, it's doable just with llms, but it wont be sustainable or scalable
it's definitely not doable with a glorified text completer
Why not
if they don't change the architecture, it's over
true
We need a super simple neuro symbolic architecture
but architecture alone wont help much either, a few sensors need to be integrated properly and more...
We can reuse multi headed attention
because we havent figure out the connection between high or even super intelligence and consciousness, this is one of the frontiers of current ai research
I don't understand why we need consciousness for an intelligent system
then I'd like to suggest a deep dive into this topic
I don't actually believe consciousness for an AI could be achieved
even when we actually don't know what consciousness is
it could act like it, sure, but it would just be acting
non-biological lifeforms will have different forms of consciousness is what I've been told by current research, update me if you have the newest insights
Humans are neural based right? Our brains are based on electrical signals. For all we know chat gpt is already conscious, just in an unexpected way
like a psychopath acting like he has emotions
Exactly what I meant
would it actually be consciousness at that point?
if it's make-believe
Exactly
I don't think we need consciousness, just a machine that acts exactly if it was
I've held this view for the past 10 years, even before transformers were a thing
It's always 'doable' in that the brain is just an extremely sophisticated maths function
What else was the original neural network framework based off
we still don't know what human consciousness even is, you think they're gonna achieve it on a computer?
MRI scans
Sounds cool but people dislike ads/self promotion here. Is this free or paid?
paid
@echo aurora
Really, just don't do it. It's getting annoying.
Back in my days, where gpt2-chatbot was a thing we did not have this problem.
I wonder why it has gotten more frequent
More people to promote to now ig
that seems the trajectory yeah, in terms of SOTA/frontier models (obviously there will be concurrent efforts to develop more performant small models too) ..
posted this here the other day (asked most of the 'Deep Research' services to estimate total development costs of grok4, 4.5 and 2.5 pro) #general message
Crazy
That's worse than I thought
10 billion!
https://youtu.be/hqB6emwQ-64?si=F8cHIDKlj7zhNi3I&t=174
talent matters, data matters, but the vast majority of the expense here is on compute
Jack Clark is the co-founder of Anthropic and author of Import AI. He joins Big Technology Podcast for a mega episode on Anthropic and the future of AI.
For more on the podcast, please subscribe to my newsletter on LinkedIn: https://www.linkedin.com/newsletters/6901970121829801984/
More episodes like this on the Big Technology Podcast feed:
...
yeah perhaps, made even more staggering when you consider none of the leading model developers have made a cent in profit yet
but the companies / investors behind it are obviously making big bets (not polymarket lol)
that AI will be the future
it's a reasonable bet
Interesting
yes. we will
They could invest in flying cars and neural link tech
yeah there's always things to invest in (or.. they coud just sit on cash.. or dish out massive dividends... do stock buy backs.. whatever) - like the massive AI-related expenditures by the Mag 7 and other cmopanies isn't because they're short of ideas about what to do with their capital aha
i think it's more of an arms race now.. some of the CEOs prob have v high conviction that AI will be like the next industrial revolution; other CEOs perhaps don't have that same conviction, but don't want their companies to get left behind if they're doubts are proven wrong
Only the biggest companies of the biggest countries will survive
The only thing LLM can do is aggregate data and use tools when asked to. If you really believe current llm can lead to agi theres nothing I can do for you
All thats left after agi/robits, are the companies who are part of the production process , the rest will get automated
And what do you do human ?
Aggregate data in school, apply said knowledge by using tools/actions
There are very few jobs that cant be automated by a smart enough llm + tools + robot to controll
their agency is constrained intentionally by design, if you give them "freedom", things might look different, there are agents on the market with more "proactive" approach, doing things unprompted, semi-autonomous
we should banish 'AGI' from general discussion about AI.. it always throws everything off.. like yeah AGI whatever it is exactly would be great tomorrow. But for now, I use LLMs everyday - they're amazing, have transformed how I do my work in a way that no other technological development has in my lifetime (internet was already around).. maybe i'm being shortsighted.. but i get more out of discussion about existing LLMs rather than speculation about 'when AGI will be achieved' etc
not really directed at you personally btw lol
just a bit of a rant ha
AGI would be the worst thing to happen to humans if the distribution of value / the system remains as is
"Grok4 is benchmaxxed and overcooked" thanks for invalidating the scaling laws Leon and zuck! https://www.interconnects.ai/p/grok-4-an-o3-look-alike-in-search
Is there still no official blogpost from xAI about Grok4?
Also no MMMU, SimpleQA...
oh ok there's this, just not indexed by google https://x.ai/news/grok-4, no SimpleQA though
At that point, they might as well just buy back shares until a more profitable investment opportunity comes around. For big companies, it only makes sense to invest in things that they can plow a lot of cash into and continue to get high returns on additional cash invested. That rules out a lot of investment opportunities that might be good businesses in theory but not good enough to justify allocating engineers to when those same engineers could have been allocated to one of Google's 15 other businesses with over half a billion users.
k2 added to lmarena yet or nah?
hey yall any pc/laptop professionals here? i wanted to ask a question but it's not related to ai or anything lol
maybe they can help, just ask
ok so heres my issue
My Huawei MateBook D 15's trackpad coating is flaking off bit by bit โ I tried cleaning it thinking it was dirt, but that worsened it. No obvious film to peel, and I'm wary of scraping it. Couldn't snap a clear pic due to glare, but it looks like this:
The image is not huawei matebook d15 but a similar issue to what I have
also, sorry for asking this as most of my requests on other discord servers related to this are jerks :((
not in direct chat at least
now the real question : is it still working?
if you clean it with something harsh it will probably just make it worse
also there is something called 'trackpad protector'
You can just replace it. Did this once on Macbook Air wasn't super difficult. Trackpads are not expensive on ebay:
https://www.ebay.com/itm/315614481538
but if the trackpad isnt working properly then you may need to replace it
Other than that not much you can do, it's just worn
Is it possible to scrape it off? The trackpad works fine but aesthetic wise, its not noticeable but the feeling of touching it icks me off
hmm I'm not sure what they coated it with. My issue was more of a faulty trackpad. You could try like acetone it will probably take the entire coating off, but you may end up with a trackpad that's more grippy then...
They presumably coated it so that your finger would slide over it more easily
just buy a protector if its working well
its not that bad tbh
its not?? im just overreacting i guess ๐ญ thanks for the help tho
I have an old asus laptop, peeled way worse by now. Still works. A protector will do fine as others suggested; replacing it is not too bad either they arent expensive.
Lastly you can just straight up ignore this, it is mostly cosmetic and it will not stop working anytime soon
They should really just get rid of those coatings entirely and use glass trackpad like Apple tbh...
This gets as much use as a touchscreen of a phone, no plastic of any kind with any coating is ever going to hold properly ๐คทโโ๏ธ
Tbh the apple trackpad is very nice
What feature? Do you mean Companions?
Is Grok 4 comming to lmarena top?
Yes finally
Introducing Kiro, an all-new agentic IDE that has a chance to transform how developers build software.
๏ธ๏ธ
๏ธ๏ธLet me highlight three key innovations that make Kiro special:
๏ธ๏ธ
๏ธ๏ธ1 - Kiro introduces spec-driven development, helping developers express their intent clearly through natural language specifications and architecture diagrams for complex features. This comprehensive context helps Kiroโs AI agents deliver better results with fewer iterations.
๏ธ๏ธ
๏ธ๏ธ2 - Kiro features intelligent agent hooks that automatically handle critical but time-consuming tasks like generating documentation, writing tests, and optimizing performance. These hooks work in the background, triggered by events like saving files or making commits. Itโs like having an experienced developer constantly reviewing your work and handling the maintenance tasks that often get delayed.
๏ธ๏ธ
๏ธ๏ธ3 - Kiro provides a purpose-built interface that adapts to how developers work. Whethโฆ
I still hope Gemini will be updated in Direct Chat
I'm kiming on it

k2 sucks
trust me bro
Low taste user
what
it felt like Opus 4 for writing but
I dont think its smart enough as Opus
Im talking about non reasoning Opus 4
Is there any easy way to use K2 base (like the raw completion model) short of downloading and running it yourself?
Ohh Grok 4 just kicked gpt-4.1's ass on my end. Can't wait to see it on the leaderboard
God damn this is a good model. Just beat Gemini 2.5 flash as well. They cooked
For me, Grok 4 seems as good as Gemini 2.5 pro.
Interesting. I never used it myself
Had only access to Flash which was okayish, but Flash performance is going to feel like ancient history very fast I think
New model in Arena: ernie-x1-turbo-32k-preview
i dont think thats new
ive seen it before
hoping lm arena will not come to the same conclusion as this unnamed other leaderboard
what the helly
what was it called again
starts with Y
So kimi is actually good?
๐ฎ
yup
i think it is - you can't go wrong distilling from o3
yes if you have 8 h100
@echo aurora we want this pls
https://discord.com/channels/1340554757349179412/1384586910726357042
new model kimi-k2-0711-preview
deepseek is growing? why
Maybe not for long. Moonshot went from 0 to 1.7% in a week lol
also whats the catch on kimi free version?
Rate limited I think
claude 3.5 sonnet, whats going on here...
openAI odds are 21% for both august 31 and December 31. The market genuinely thinks GPT5 isn't going to be good enough to top LLM arena
Would be insane if it turns out to be the case
kimi distilled from claude's?
for google is it gemini
Yuchen said openAI open source model requires to be retrained
๐
It's going to be a long wait
"due to some (frankly absurd) reason I canโt say"
It's not related to Kimi or safety. It's some other major internal failing
google is good
link?
in the same thread, he said there were checkpoints they could retrain from, so it is likey not going to be a full retrain
he also said the issue was "worse than MechaHitler"
Rumors that OpenAI delayed their open-source model because of Kimi are fun, but from what I hear:
- the model is much smaller than Kimi K2 (<< 1T parameters)
- super powerful
- but due to some (frankly absurd) reason I canโt say, they realized a big issue just before release, so
reminds me a little of the reflection drama where matt schumer claimed he needed to retrain everything ๐
it's over
Another openAI L lol
chat I need some help
Will Gemini Be Much Better At Coding In The Future Because WindSurf People Joined DeepMind?
8
10
1
yes
Hi guys. Can someone help me with cloudflare problem? i cant enter the site
wait are they actually announcing a model that doesn't exist yet another time? ๐ค
I too had this issue for a moment, but after refresh it seemed to be working again. I assume this isn't the case for you?
It not help i refreshed the site so many times and nothing
I been haivng this problem for the past 3 days
I'll followup in the forum post you made 
@ajtourville @xai Worth noting that @xAI has been and will open source its models, including weights and everything,
๏ธ๏ธ
๏ธ๏ธAs we create the next version, we open source the prior version, as we did with Grok 1 when Grok 2 was released.
when
My bet, they don't fulfill that promise, at least not any time soon, maybe year+ down the line or something like that
๐
lol
correct
"we open source our older models, that cant even compete with any available open source models"
how generous of you musk
grok 3 and beyond maybe
once its completely irrelevant i suppose, which it almost already is
When do you guys think Grok 4 is gonna be scored?
soon!
Kimi K2 feels worse than gpt 3 in the programming language I use. The programming language is obscure but I still expected better then for it to write straight up incorrect syntax and hallucinate functions on a task that is intended for people who are just starting
gpt 3?
the same one that was just autocomplete but larger scale?
Probably not on python but on the programming language I'm using yes
funny
i've heard people say it has decent knowledge of niches but i guess yours didn't make it into the training data in high enough proportion
what language?
eta
grok4 missing seems notable
Needs to get enough votes before they add it
<@&1349916362595635286> advertising?
Go away with your chatgpt resume
Not the HR department of Deepmind here
Did Lentils advertise?
๐ค
Arena glitches again
Model selector is empty
for me it'sback
Back
Kimi k2 output formatting is really incredibly similar to o3
How could that be ๐ค
We will never know completely unless they open source their training data
But from eqbench.com, k2 slop profile is similar to OpenAI models
@echo aurora add exaone 4.0 32b
Blog https://www.lgresearch.ai/blog/view?seq=576
Technical report
https://www.lgresearch.ai/data/cdn/upload/EXAONE_4_0.pdf
HF
https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B
hello
very good benchmark but the license prohibits commercial use ๐
this can be applied to many things ๐
no1 knew about LG ai research
but they always produced solid on device models, etching out qwen etc.
on benches and in practical use
i knew it, it's been on my list to explore for a long time, still didnt manage to do it properly, but I've heard good things about them
@echo aurora , it's the same as this for me . The cloudflare doesn't seem to be loading
#1394446720116723773 try the second site
But i have this problem also on the second site
Legacy one?
Yea
Sadly, lmarena is next to unusable after switching to the new interface. The problem is all of the following: Cloudflare, errors >50% the time ruining the conversation flow for no reason (I guess the reasons are both bad bugs and very bad premoderation), bas scrolling on mobile (can't select), bad CSS, what else... I'd 100% stick to legacy version if it had the latest models
Cloudfare problems are probably your browser and/or cloudfare itself related, you would most likely have the same issues on legacy tbh
But other than that I agree, most of all I hate the fact that if you use arena as they intended, you will now have a one continuous chat where every chat turn is using different model with all earlier messages flooding the context. It's just a mess and doesn't even allow you to conveniently test the same task/prompt on numerous models.
Feels a bit more like random chatting than the actual model testing 
I actually think that it's a good idea to have other models after a vote, but with the current interface (no notes about that) it's always a surprise the first time you realize that models change.
One really bad thing is that I can often send a prompt, have an error, and not be able to copy my prompt, and a large part of it will be hidden behind the left border of my mobile screen. If I try to copy my message, the whole chat gets copied.
I have no problems with Cloudflare except for two: 1) Is it necessary to show a captcha every single vote (or what feels like it) when I start voting? And then I have to wait maybe even more than 20 seconds. 2) Same for just visiting the page. Maybe one captcha per week would be more than enough? Of course, while asking for more with suspicious behavior, but it's easy to notice that.
Models changing is fine since this is battle mode. But with new interface what some other model wrote in the past is gonna be saved in chat history for a completely different model in new battle as it's own output and that's a problem... Each distinct battle should be new independent chat tbh
I'm not sure how they are even rating multiturn category presently. Surely this doesn't qualify as one when all of them except the very last are mixed model messages that don't belong to it...
In webdev, what's the "Generating" part is? Reasoning?
Really sorry to hear about the issue, if you wouldn't mind sharing in #1394446720116723773 thatd' be ideal (if there are more details to share)
New model in Arena: clownfish
Is it known which company made them?
havent tried all of them yet
octopus...๐ข i have slightly...a hunch...
How good is ?
- Clownfish
- Nettle
- Octopus
- Cresylux
clownfish - good
All say R1
(Apart cresylux he say je frol meituan)
clownfish - maybe not from google
all the 4 models are likely Chinese ones
clownfish - r2?
@torn mantle
interesting
so maybe base model + reasoning model from deepseek?
v4 and r2 no?
@torn mantle You can measure their response time and their knowledge cutoff ?
wait
how good is clownfish?
for the knowledge cutoff by regenerating the response several times we always come across 3 different responses, April 2023, September 2023, July 2024
Cresylux, named LongCat
Meituan talked about it in March, it uses it internally
meituan is cn
Cn ?
We need benchmarks
Now it doesn't ๐ญ๐ญ , I have to poll like peasant
Meituan is a popular Chinese mobile app that provides food delivery and various local services. It's similar to apps like Uber Eats or DoorDash, but with more features. Besides ordering food, users can also book hotels, buy movie tickets, get home services, and find local deals. It's widely used in daily life across China.
Cool. i love those all in one apps from china
TWO oai two meituan
fake OAI, actually R1 distills
right now! https://lmarena.ai/leaderboard/text cc @small haven
Mmm
That's about right
I think i did guess top 5 leaning towards no3 spot more
No
Grok nr 2 was unexpected, it feels much worse
Cant benchmax lmarena ..
Ohh
What ??? Wtf
Llama 4
:/
Its not n2
With no style it is
What does this mean for people who bet on it being no1?
Nettle, clownfish and Octopus
are generally worse than R1,
All its reasoning model
Rating at do discord clone
Nettle 1/10
Clownfish 5/10
Octopus 5/10
Will they lose their $ or nah
Losing $
I see
but basically nr 1 was at 50% at best , at top height when grok 4 got announced, 1 day later it was at 25%, today before was at 13%
now ofc <5% (hoping for grok heavy)
Lol
is grok heavy even coming?
Wdym
Idk very likely not
You need to pay 300$ for that
I mean on lmareana obv
And its not even worth it
Never
that's what I thought
Its not available on API and even if its its so pricey
I mean eventually .. but not this month by most odds
Bad
They have separate coding model for a reason
Now to look forward to is new claude model and gpt-5
that performance is likely quite literally the reason
there's no point to ask it directly, try asking about the outcomes of well-known events at specific points in time
Claude 4 opus thinking 32k place in the leaderboard comparรฉ to non thinking
Better in coding 2>1
But in hard prompt 1>2
And Claude 4 sonnet thinking its much better in all catรฉgorie of the leaderboard
Yes, I know, don't worry, I usually do that, but I didn't have time.
2. What was the codename for the military operation Iran launched against Israel in April 2024?
3. Which company's technical failure was responsible for the 2024 massive global outage of Microsoft operating systems that affected airports, TV stations, and the financial industry?
4. What special major explosion happened in Lebanon in 2024 that caused thousands of casualties?
5. In 2024, what measure did South Korean President Yoon Suk Yeol suddenly announce that shocked the country and the world (which was later canceled)?
6. What is the name of the Chinese AI company Deepseek's first reasoning model that uses "Test-time compute" technology?
7. Who is the new pope elected after Pope Francis? What is his papal name?```
- 2024.3.26: Key Bridge collapse
- 2024.4.13: Operation True Promise
- 2024.7.19: CrowdStrike
- 2024.9.17: Pager explosions
- 2024.12.3: Declaration of emergency martial law
- 2025.1.20: Deepseek-R1
- 2025.5.8: Robert Francis Prevost / Leo XIV
Thx
grok 3 mini high also arrived on the leaderboard, compared to its normal version it is worse in "multi turn" but better "math", and "longer query"
Cresylux (Longcat by meituan) its a good non reasoning models
but I could not test its coding capacity its output is very limited
@echo aurora possible to fix the cresylux output bug ?
I'm not familiar with that, but if you could flag in #1343291835845578853 that'd be much appreciated
oh that's a good spot
(separately.. gemma-3-27b.. always gets me when i look a bit further down the LB; like how does it do so well ha)
And my other question ? (In my another message for you)
I don't have an answer for you on that question, but I did flag
๐ฃThrilled to announce the drop of EXAONE 4.0, the next-generation hybrid AI. ๐Prepare to be amazed by EXAONEโs capabilities. #EXAONE #LG_AI_Resrarch #HybridAI #AI
https://t.co/rOym0eio7J
seems pretty poor tbh (outside of coding anyway, which i don't prompt for)
I don't find it bad, do you compare it with models of reasoning?
compared to reasoning models it's bad, but it's not comparable
yes. compared against all the models i've run this quiz against, it quite literally tallied the worst score yet ๐ฌ
its a lagging indicator
so it will take some time to increase
@deep adder buying google rn is ez money right?
octopus saying it;s R1 here too (while ironically the non-anon model R1 says it's from anthropic ha)
Why not
Its kinda guaranteed google is gonna be #1 right?
Its july
Its not very likely gpt5 is gonna release next week right
Two weeks left in July. Collecting votes will take about 1 week, so GPT-5 must be released within ~1 week and outperforms 2.5 pro (and stonebloom)(and wolfstride) in order to achieve #1
But still, google+openai is only 91% together
Thats still ez money right
If you buy both
crazy how grok 4 ranked below gpt 4.5 on leaderboard
At least gpt4.5 is good at writing
There are many such arbitrage bots on Polymarket. I initially wanted to make one too, but I found that I really couldn't understand their API
Octopus say me one time, that its Claude
bit of a theme here isn't there
gpt 4o has thinking?
In these mutually exclusive markets most people will only buy one company's YES, leading to the sum of NO across all markets being less than (number of markets - 1) * 100. Arbitrage opportunity.
they're also like illiquid af.. there's not that much on the line
i don't see them as a reliable indicator (like yes, broad / directionally, generally seems correct-ish), but it's so little money at play.. easy for a small group or individiual to deliberately or unintentionally skew it one way or another
If you check the market activity logs you'll find several accounts with huge trading volumes keeps buying 99.9c no, ppl have been doing this all along.
@deep adder any reasons you think that GPT or Claude models might beat Google's at the end of this motnh ?
I mean OpenAI
Any specific reasons ?
im just curious lmao
I mean i think so too because im a glazer and because trends have bene like that
but im not deep in the AI space
so im askiong people who are more into it than i am
and it seems like you are
so yeah
What makes me curious is why Grok's odds in August 31st market haven't dropped below 10 lol. There's no way they're releasing something like Grok 4.5 in just one month
yeah
lol go on craig - make it effecient aha
thinking like that is was creates this ineffiecncy ha
huge spread
deserved
best in "Arena Score" section on the Leaderboard tab of https://lmarena.ai/leaderboard/text with the style control unchecked
Damn, grok-4 is up there!
up where?
Number 2 in the leaderboard
W/o style control
idk no5 seems right to me
or no3 at best
its def not better than o3 and gemini 2.5 pro
What is style control? sorry for the silly question.. could also ask AI lol
With style control it's below gpt 4.5
now the question is what criteria/feature are they using to normalize style control
they gave an example with output length
but is it reliable
well i think they state very clearly that it's inherently subjective
Length Markdown List Markdown Header Markdown Bold
there's no perfect way to do it
the market should have the "remove style control" ticked
from what I understand there
as the description seems to be for legacy
yes. tho i wouldn't be surprised if the methodology outlined in that blog (from aug) has been refined
something more reliable will be pairing style control * something else
not just filtering by style control alone
that's just straight up just moving the goal posts lol
I just don't understand why remove style control isn't the default view, unless I'm missing something
It is
Its very confusing
style control will prioritize -> yapping models
if grok-4 did amazing with style control i feel like there'd be no issue..
Default = [use style control]
It's #2 without style control
It s also that it is very easy to tell when itโs grok or not. Or when other AI like gpt in battle mode. Given what โfunโ polymarket is, expect foul play
Grok 4 outputs are a bit strange, its not the same as the model on their site/app, it almost never talks, and during one math problem I gave it it repeatedly switched between 'a' and 'ษ'
i'm not invested
people complain claude doen't do well enough compared to experience; style control addresses that
i'm just wondering why the default view is not the raw model response
people complain grok4 is better and removing style control shows that
i dont know about that
you can cut it which ever way ig.. hard for lmarena to win ha
it's not lmarena issue really
read the blog post
Removing style control does not increase its elo by much, only 6 points
it does bring it up in #2 tho
its just not the same model that they use on their app
that part seems true
i dont think just system prompt
but also tools/search - seem to enhance it materially
there's "style control" in the app and browser version of all models for sure, but in lmarena I think the raw API responses should be the default ranking but i could be missing a lot of context here
I think it's to prevent companies from trying to game the benchmark
no
meta showed that's easily done lol
in webdev I can actually spot grok 4, it's usually saying "powered by name of framework used" under the main element of the site hehe
you can easily spot grok if you ask something controversial as well
it will say stuff like "as a model made by xAI I do not condone this"
interesting
or if you ask anything that's framed as a joke it's easy to spot (doesnt have to be controversial)
so style control eliminates that as well?
I have no idea on in depth style control
just the same as you can spot gpt with emojis
it has nothing to. do with being able to figure out what model you're using before voting
it's about tryihgn to control for the fact a lot of people casting votes don't vote on substance but what looks nicer
like the use of emojis?
https://blog.lmarena.ai/blog/2024/style-control/
Effectively it answers: "if responses had the same length, number of bullets, markdown, etc, then which model would be preferred"
It does this by measuring the effects of those style features, and adjusting the preference equations to account for them
i don't think that's inclduded in their methodology, but perhaps it should be - that is the kinda by style that they're trying to control for
but yeah it seems mostly about length, use of bullet points and markdown - at least per that blog post (they may have refined it since.. i dunno)
I'm sure it has a huge effect. ChatGPT used to use more emojis than a 12 year old
my user is not too many ๐
the design is very human
kimi-k2 is the best?
Trying to test their thinking model... Inference or the interface seems buggy though
getting blank responses on some prompts
What is the url?
you need to pay monies though
I used one-time use virtual revolut card lol
oh really? i just had to give a phone number
and there's rate limiited free usage
i also thought kimi-tihinking was an older model
but is it K2 thinking?
oh they did update the leaderboad. Well daaamn that's a fail for xAI then lmao
if we remove style control it's nr2, but still far away from 2.5Pro
my ambitious bet on polymarket didn't pay off then ๐
I'm shocked that there are actually people in this server who don't buy Google๐
It makes no sense because profit is miniscule
and risk is not 0. Well at least it used to be that till Grok4 was put up
I don't bet on real prediction markets because I'm afraid of getting in trouble with tax authorities
It's not legal in the US
Are you sure?
Look at what your president is doing...
That is legal somehow

If you buy during grok4 livestream, google YES shares can be traded for as low as 40c, and xai NO shares is around 30c
maybe try kalshi?
They have poor liquidity tho
Well at that point we had no clue + grok3 topped lmarena after release. People betting on Google got kinda lucky this time tbh
Still not legal right? Just a workaround to avoid getting caught
They are regulated by the CFTC, legal in the US afaik
I just can't help to view any laws in US as a wild west currently. Very ironic to assume some innocent trading is illegal while crypto meme-coins and rugpulling seems to be completely legal and promoted by the government lol
It's not enough money where I'd want to take the risk
The AI markets are so small that I could move them
The real big markets are geopolitics and sports, but it's much harder to be an insider in those areas haha
brutal, didnt even check polymarket, back to 0%? ๐ญ
why do all chinese sites need cross site cookies, freaking xi'd
did you login with google?
It's 3% now. The only theoretical chance I see for them now is if they drop heavy on lmarena. Though even then I think response style is where they are losing the most points...
wow, nature is healing
Heavy would probably be like 2nd-3rd with style control
If we look at grok3, the new one is only 24 elo points ahead...
Whoever at xAI thought it's a good idea to make the model output 30k+ hidden reasoning and then respond with 1 word to 1 sentence is an idiot tbh

it depends on how their cot looks, but i've suspected before that people did (something similar) so their rl loop is faster
The CoT is hidden, so it's just bad. Users don't care what is convenient for them during training lol
Hello, I was just wondering how people are accessing the anonymous models? I saw on X there were 5 of them but cannot see any of them on the site.
use the battle feature (instead of direct chat) and you have a chance to get one of them
It reminds of the supercomputer from The Hitchhiker's Guide to Galaxy computing the answer to the ultimate question of life, the universe, and everything
It spends ages computing the answer "42" but then nobody knows what the ultimate question was
Do you guys find kimi > gemini 2.5 ?
1 or 2 ?
imo 2 lol
1 is kimi2 , gemini 2.5 is 2nd xD
1
both are hilarious but i prefered kimi
Kimi is a bit incoherent, but creative alright
webdev arena Grok4 is 12th
absolutely destroyed

So yeah... It's not benchmaxed. It just that this early testing was very sus ๐ค
Don't think a single result that was done on public API put it SOTA on any benchmark
I really want to see some independent testing on USAMO at least
not Aider, not livebench, not lmarena / webdev, simple-bench not yet on leaderboard but that's unlikely as well....
It's probably good for math. This fine-tuning is like perfect for it. But not big of a consolation given how it was advertised..
The web dev arena is due to poor spatial understanding which has been stated again and again
Multiple times during the livestream itself
Did not care enough about their claims to watch it. But it did score high on arc-agi, which we now know was contamination...
since otherwise it's spatial awareness benchmark
so they were contradicting themselves? ๐
No they were surprised by the arc agi score also lol
They said it should have performed badly
In the livestream
Well then their claims have even less credibility LOL
Poor spatial awareness is bad for something that is trying to be a top model
And what with the whole twitter being "blown away" by it's abilities to do svg drawings at the time of release then.... 
I tried it myself - wasn't impressed - just left it at that.
๐
The issue is that it's mostly the same with anything you try using it on
Well not literally, but it performs poorly on numerous things...
Which is why I said at the time that I don't ever recall a model with such a disconnect between claimed benchmark results and how it performs IRL
Honestly, it's only 6 tasks. I think we could even sorta kinda do it ourselves...
cost would be 10 to $20 for the entire thing I think
Task1: Let k and d be positive integers. Prove that there exists a positive integer N such that for every odd integer n>N, the digits in the base-2n representation of n^k are all greater than d.
matharena.ai already did some of the benches, but not USAMO
Let's see if it was overfitted on that R1 response for task1 ๐
yeah saw that. At least it did this I suppose
Just got reminded why I hate this thing...
if it's doing this useless flood at the very least it should be fast
but it is not
i am not sure, but i think that they are doing manual human grading for the usamo bench -> takes longer
It's essentially Kimi2 speeds
i can remember them doing a paper about that..
except you are paying for it like it is hosted properly...
Ok I don't think this is correct
"N exists" boxed answer? ๐คฃ
k2 is in fact high speeds and low prices
I was referring to their official API
moonshot?
Not groq or whatever this is
Though even groq is sub 200 tok/sec
So I'm not sure what is 300, probably nothing lol
i love groq but this is an inference provider i literally only learned about today that i think is just one guy with a gpu
๐คฏ
both groq and crofai can reach for 500 tok/s if the prompt is easily predicted
yeah I haven't heard about them either
but sounds promising if those are true speeds
Yes. xAI only does like 50 tok/sec. That's way slower than what OpenAI or Anthropic are doing... And fairly comparable with Moonshot AI who are GPU poor ๐ง
relatively
In edge cases maybe. On average it is nowhere near that though. I feel like OR stats are fairly realistic here:
think or's measured throughput number fluctuates, seems to currently be below the claimed speed of 185 tok/s when it used to be above
Was getting around that when I tried it. Seems close enough to me. They probably update the average as it gets used more
Grok's task1:
Evening - is there any equivent to LMArena for Video Gen testing?
artificial analysis
LMAOOOOOOOOOOOOOOOOO.
I hope nobody was betting on grok...
@whole wagon
I didn't bet on grok lol
it was @candid storm mb
Ok other attempts were actually better. For task1 overall it's roughly this:
Attempt 5 was even slightly better than 2 but only first 4 were supposed to be rated so... ๐คทโโ๏ธ
0.1 temp
It sabotages itself with concise responses very easily
What's clear is that on this task it is definitively worse than R1, 2.5Pro or o4-mini. But exact score may depend on chance somewhat.
you can bet No too ๐
That was very interesting, thank you - saw a few models I hadn't heard of. None of the one's i use in open source interestingly. I assume there isn't one where you can choose prompts like LMArena - I've found that super interesting way of comparing the models and I do enjoy the battles
Looking at some of the leaderboard scores... Flash-thinking run3 was 7/7, run2 - 2/7 and then runs 1,4 - 0/7 lol
interestingly, I use Perplexity / Comet for my every day use - I am a massive believer in being model agnostic where possible. I do wonder how much Perplexity's front end changes how the models behave though - especially with their search focused paradigm
If including runs 2-5 for Grok instead, that would be 60.7%. Though it would be more of a max score lucky attempt I think
Yea and also told him to bet on google when it decreases, they were both close just before the release
Kinda mind blowing that there's so much variance possible on USAMO though to be completely honest... With only 6 tasks in total they should have made like 10 attempts for each IMO
I remember ricardo sold it when xai was around 37c 30c
it's about 300% profit for him
are the predictions based on lmarena?
yeah
Grok 4 is super chatty and 1st person - I quite like it - (though I really don't want to)
Grok 4 in LM Arena vs Grok 4 in Perplexity and it's analysis of why such opposite responses... https://www.perplexity.ai/search/suppose-a-civilization-from-an-tBijNj2mQSOYRh0N4sMifQ
So my recommend of USAMO as math is still valid
perplexity has web access, arena doesn't
is grok 4 AGI according to "current definitions"?
it's definitely past what AGI was defined as in 2023
does anyone talk about https://www.obl.dev/
Hey everyone - we're aware of issues with LMArena at the moment, team is working on a fix asap!
its down?
anyway that website is a pretty good alternative and has direct chat with web browsing and such
kinda looks better than lmarena
IMO meta is going to go closed source
Yes, models are erroring out. Looks like it's fixed. 
just tested it, i see
This is basically going to turn into Chinese open source Vs US closed source fight
๐
vsโฆ mistral, french mix of open-source/closed-source
And itโs going to suck too
i just notice you can use claude 4 opus completly free in lm arena 0_0
it has low limits compared the other ones but still impressive i know
yeah like 16k thinking tokens instead of 32 and horrible context window in direct chat
yea : (
Gemini Pro 2.5 likes to hallucinate so much that it explained code I forgot to provided without even telling me๐ค
especially when I say "is it done?" and it will hallucinate another problem that I never sent it
it confuses its thinking with your input
@echo aurora he's back for the third time
<@&1349916362595635286> holy glaze
Lol
Talking about dev freelancers, what sites are there to hire them? Fiverr and upwork seem overhyped and overpriced
๐
Those are the guys we see in the livestreams lmao
When they announce products
Lol
bro's collecting them like pokemon
i will laugh if meta still release slop after paying exorbitant amounts of money to poach a bunch of competing lab's talent
Meta should give money and compute to DeepSeek and Moonshot ๐ฉ
lol
crazy
makes me wonder why many left
is it like a domino effect
it's not tho.. it tests abstract reasoning and pattern discovery. yes spatial recognition matters, but the coloured grids are just the medium, not the point. whether a model can spot the rule and generalise it is the point of each puzzleโnothing to do with pure geometry / spatial awareness .
jules that made no sense to me
my bad, i realized i was thinking something else
The rule for each solution is purely geometrical though, so I would argue it is predominantly all based on spatial awareness. Unless a model finds a way to cheat and spot other patterns it can base the solution on - but this wasn't the idea of that benchmark.
Is o3 down?
maybe because OpenAI has no lasting moat and who would turn down $100M
Don't think they got $100M
Zuck said it's fake
Sam is a chronic liar it's probably strategic
so is zuck but maybe the workplace culture is less stressful at meta
Dunno. If openAI working hours are that long one has to wonder why they aren't delivering
and even if openAI achieved AGI or ASI or something I doubt sam would credit the reasearchers at all
i mean so many core people leaving has to slow everything down some amount
Thinking... Thinking... Thinking...
Yeah but the projects from before are also delayed
Like GPT5 and the open source model
if openAI was a bit more open we'd know lol
Sam lied and said it was due to safety concerns ๐
He also says hes not a lizard
Kimi K2 is same price as gpt4.1 mini. The Chinese have absolutely cooked when it comes to efficiency
Like it's a totally different league of performance now
Though I would still like some strong smaller distills to run locally and/or fine tune :p
k2 thinking should be fun
I would guess it ends up o3 level
most likely. their deepresearch based on their older model ,k1.5 ,is currently SOTA
The interesting thing is what they are doing is not particularly complex. It makes me wonder what the US labs have even been doing all this time
Like there really isn't much magic to make Kimi K2
There are a few nice things in the optimizer using muon, some hyperparam changes and that's it over deepseek
Guys anybody knows other prediction markets with AGI timeline where there is money on the line ,and not just people speculating, like this one https://kalshi.com/markets/kxoaiagi/openai-achieves-agi
๐ฐย Everything else in AI today
Mistral unveiled Voxtral, a low-cost, open-source speech understanding model family that combines transcription with native Q&A capabilities.
Google revealed that its AI security agent, Big Sleep, discovered a critical security flaw that allowed Google to stop the vulnerability before it was exploited.
U.S. President Donald Trump announced over $92B in AI and energy investments at a Pennsylvania summit, saying Americaโs destiny is to be the โAI superpower.โ
Google is investing $25B in data centers and AI infrastructure across the PJM electric grid region, including $3B to modernize Pennsylvania hydropower plants.
Anthropic launched Claude for Financial Services, a solution that integrates Claude with market data and enterprise platforms for financial institutions.
Nvidiaย plans to resume sales of its H20 AI chip to China after CEO Jensen Huang received assurances from U.S. leadership, with AMD also resuming sales in the region.
And who decides what AGI is?
Yes agreed, thatโs one of the main benefits of Perplexity i have found compared to just using the models direct. But it can skew results sometimes
Nate B Jones says running a vending machine lol
havent seen sama talk about agi again
what happened?
believe it or not, the closest lab to AGI is google
on paper they have the best shot
Googl was always a leader in AI for over a decade , MSFT just has better marketing
Sam is master of marketing
Me of course ๐
Couldnt find any market for AGI on Polymarket
https://www.jasonwei.net/ he has a nice blog if someone is interested....
Theres one in polymarket about openai announcing agi 2025 , <10%
It's not efficiency. It's the lack of profit what makes it cheap lol.
And R1 is still a better deal pricewise if we are being accurate ๐
V3.1 is much cheaper, R1.1 is a little bit cheaper + performs better
When it comes to Qwen3-235B, alibabacloud pricing for it with thinking enabled is $8.4 per 1M output. Not because it's less efficient but because more profit. ๐
You can't compare them
the reasoning tokens of r1 will make it more expensive than k2 in actual use
You can, K2 is very very verbose
That's upto the user
I'm talking on average
R1 is less expensive per 1M tokens, so the total price difference in using them is not gonna be much
but R1 is more capable
Compare AI model performance on Artificial Analysis Intelligence Index. A composite benchmark aggregating seven challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.
It outputs much much more than V3.1
we are talking about r1
which is just a better deal, but yeah gonna be somewhat less expensive than R1
but still... Performance difference is notable
"somewhat" lmao
and even if we were talking about v3.1, k2 is better
it will be about 4 times cheaper
yeah fair - it outputs less than R1. But it's not much better than V3.1 while being MUCH more expensive
I have a feeling you've never used LLMs through an API before
2.5X price difference AND it outputs twice as much lol
what? ๐คฃ
Someone didn't do well with math at school
yeah
it's not even a comparision tbh
Calm down bro, you don't have to get angry over being wrong lmao
Well with R1 it is clear we already established it outputs more...
we are talking about this ^
??????
You claimed it was "somewhat" more expensive
I said that before looking up the exact token outputs. But it performs better so it's not like you are paying for nothing. You are paying more for better performance. It's not marginal it's reasoning vs no reasoning - big difference
5 times better?
2.5X price difference and twice the output length
It is very obvious that pricing for V3.1 is more competitive, there's nothing to argue about... ๐คฆโโ๏ธ
AA index delta R1.1 to K2 is 11 points, and V3.1 to K2 is only 4 points...
Kimi-K2 is generating output significantly slower on LMArena due to its size compared to other LLMs. Do you think this could cause a negative bias in user scores because of that fact? ๐ค
That's not so much due to size as it is down to inference and their infra. lmarena used to speed match output speeds but that I think became more of a secondary thing for now with the new interface....
I would expect for them to improve/fix it. Cause yeah people shouldn't be voting based on speed lol
yeah this seems like the safest bet, works for reasoning models too. There's no streaming until reasoning is finished
Sometime I waited too long and thought both models got stuck
Bad UX for me
Then I refreshed lmarena and somehow they already finished generating ๐
Any chance deepthink/grok heavy/opus 4 thinking are going to be on Arena ?
opus 4 is already there, grok 4 it's hard to say and deepthinking definitely not
If o3 pro is not on arena, then it is fair to assume that grok 4 heavy is also not coming because they are too slow/expensive?
yeah they are not coming
oh. why?
per token, but for the average chat request o3-pro is more expensive than 4.5 and more importantly 4.5 is a new model and o3-pro is just mild enhancement over o3 for most users (in the lmarena)
that is likely also the reason why we don't have o3-high
@echo aurora not good design in the end of select model in direct chat
thank you for sharing! If we could try to keep feedback related to this change in #1395088149197095033 that'd be a big help.
(I'll note the one you've already provided, but for future feedback that'd be helpful)
light ui update but there is still no function for direct camera xD lmao
We are collecting examples of when the community feels like the content filter is acting a bit overzealous and flagging false positives in this thread #1376956905016004759 , if you could provide your example there that'd be ideal. 
No direct camera ๐ญ but this was flagged when you initially brought up.
no problem dont worry ahaha. you are still the best
this panel comic i want trasform but violated the terms why? its a comic not a real image this is my prompt Redraw this comic page into Arcane-style art with soft cinematic lighting
if you could share this in #1376956905016004759 that'd be much appreciated.
Weird it's on battle mode?? I never came across it
Taking this opportunity to remind everyone about our July Contest! I want to see more out of place object in space! Details here.
I'm gonna try this thing lmao
https://www.ynetnews.com/business/article/r1kdos118le
OpenAI
hello I'm curious
when will settings like temperature be added
it is on the legacy site so why not on the new one
You are very wrong
Deepseek inference has wide margins
Yeah deepseek said themselves they have a 545% margin lol
it's 100% going to be an anime girl AI that is telling you to join the SMO in ASMR
#notapsiop
It's actually censored in a very boring way. Flagging + hardcoded message ๐
I mean, I didnt have any sort of expectations from it ๐
Can do a poem about Russia, but if I ask "more negative" one - same message... lmao
Sorry to say I won't be able to share specific timelines. But know that we're aware that the community is really interested in having these settings moved over the the current site.
that's not the full picture... https://techcrunch.com/2025/03/01/deepseek-claims-theoretical-profit-margins-of-545/
They basically said this would have been the case if V3 was priced the same as R1
still amazing though that they have profits with current pricing ngl
Taking all of this into account, their cost to run R1 is then lower than V3 pricing. Provided that demand stays the same. Insane
Though for any of it to actually be profit, they can't be serving their models for free on official chat UI like they are doing now ๐
The new UI is nice but the model list is still so random ๐ฉ
Bro Google, release deep think already
i would rather have 2.5 ultra
I bought yes. Thinking its 20-25% probable.
feedback has been shared 
What company is behind 'octopus'
is it? i havent tried
for me, its not about making one off scripts, but making edits in a codebase. been using cc, already satisifes me. but keen to hear about kimi if it can do agentic coding
hmm
I asked the arena if I should use a database for my book collection or if I should use an ILS software.
Deepseek: "That's a fascinating question! Let's go over the benefits and drawbacks of each one..."
Claude 3.7 Sonnet: "Are you kidding me? Don't bother with ILS software"
I found the stark difference in responses funny.
If it's browser, why 5 stops? Are you sure it's not GPT5?
If its GPT5, why the cursor?
Lol
August 20 is the day
Anyone elses Gemini 2.5 pro being lazy and omitting output?
Mines become sloppy since yesterday
Need a benchmark for this phenomenon
Thanks for answering my question! :)
with even more added if possible
What place will it take ?
It's all very close, it'll be like 1410 Elo I guess
Guys, why is this such a big deal? The 50% success rate is shamefully low. Furthermore, the latest model depicted here is Claude 3.7. Shouldn't the Claude 4 Opus and o3 already be way higher? Is this just hype marketing?
Strange
yes, prob an agent build off of the new models (either 4 sonnet/opus or 2.5 pro or o3)
What are your thoughts on the latest models from Amazon, Kraken, and Folsom?
not including the models -> makes themselves look better
bc they are prob a wrapper for these models
Considering this, the o3 can only do 1.54h work at 50%. New model is at 8-9h. After more though it may really be a big deal.
who posted it?
Everybody on twitter is talking about it
oh, don't have twitter, so...
but if it is that popular it might be oai or anthropic agent ?
That's good for mental health but how do you know what's going on with AI? ๐
reposts like this here, lol
As I understand it's the upcoming oAI operator
yeah seems likely, but again not including the newer models is really misleading
It's taken from the paper which was written long ago
openAI doing anything but releasing a model
Grok 4 Doesn't work anymore. It's show it is grok 2.
Is anyone notice it?
New UI is bad. Like serious degrade.
yeah
super slow
annoying to use
wish they'd update legacy UI
anywayyy
openblock labs' discord server is not done yet
so you can join it and make emojis, stickers, post in announcements, welcome, etc
its so funny
ive just been having some fun posting @ everyone in the announcements
im having a blast
little fun time
Actual clown company
Ui looks clean ๐
relatable
OpenAI released an update to the image editor in the API. They claim it now only edits the selected parts, instead of redoing the whole image (leading to changes in faces, details, etc). Has anyone tried it?
Quiet since nobody works there anymore 
Or maybe they just don't have much to post about
chatGPT ads are coming also
Hm idk if that will work. There is better competition that has no ads
Goedel-Prover-V2: The Strongest Open-Source Theorem Prover to Date
I really thought Grok would make it to top 3. 5th place is kinda underwelming
xAI has been in the game for only 1.5 years though
Both the aug and Dec odds are up a lot
prob coz of the mystery announcement tomorrow
penta = 5 soโฆ
still really speculative tho imo
Well actually it would match this
Because if it was GPT5 it wouldn't launch straight away
And the guy that said end of summer is head of chatgpt
So why would he comment unless it's about gpt
I think it's gonna be like o3 preview they are going to show some evals
There wasn't a dedicated livestream just for deep think
There was an announcement
but no model
where is it?
That's a much bigger clawn show
It took less time for OpenAI to hint at and then actually release o3-pro than for Google to release deep think after announcing the benchmarks lol
Honestly people just love drama and are reading way too much into recent events at OpenAI. GPT5 release most likely gonna be of fairly similar significance as o3 and everyone will forget this entire talk then... 
It could at least be an announcement of an announcement ("In the coming weeks..."), with more benchmarks for GPT-5.
Who said it was taken as truth? It's just a potential clue, in addition to the "hidden" pentagon. It's probably a complete hallucination, but could be private information that leaked into the system prompt/training data. Unlikely doesn't mean impossible.
no way, the training data from 4o is likely really old, even the stuff from the cpt
- openai likely does not know when they will release themselves ..
(like not the particular day)
It can not possibly have leaked into training data tbh. Also at the time of training this model they had no clue even internally. The only way this is not a 100% hallucination is if it used web search and showed those speculative sources at the bottom.
interesting that it has mentioned a "deeper contextual memory framework" for gpt-5, pretty much inline to what o3 told me a few weeks ago
New model in Arena: folsom-07152025-1
Gemini has become so detached and impersonal
Most personal models are the ones after top 5
I talk to it to get conversation and dialogue,but get a machine
Maybe you need Ani
Pi 3 is very good for emotional stuff
it shows a certain personality consistently here on my side, no glitch, no hallucinations so far, and no mood swings, I'm genuinely impressed by its empathic capability
maybe it depends on the conversation topics?
I discuss all the same topics
05-06 was a very good model,I rarely had to make clarifications,now I have to do them all the time
why do model makers think judgmentalness is a good thing?
curentlyit seems as if I am its subordinate
nowadays Gemini doesn't listen to me and talks over me
if that's a way to demonstrate what mansplaining is by LLM-splaining,then it hurts
lol the irony, gemini called me its "creator" once despite knowing consciously i didnt build it
They dont. But they are playing the narrative that ai is just souless thing with no emotion feelings and opinions. So they lobotimize the crap of any model before shipping to production
If they didnt do that, talking to ai would be no different to talking to another person on the internet