#general
1 messages · Page 57 of 1
it's not correct but that's in-line with what most models answer lmao
o3-pro to a variation of this prompt answers same
can't wait for o9 pro to get it right
Is that another 2.5 checkpoint?
yea
😭
can't wait for agi to come out and people here are saying claude 5 is better because the svgs look nicer
it looks way better
horrors beyond comprehension
Yea, but its very good lol
@deep adder what you thinking boss?
actually its craig's prompt, nvm
and prolly not using auto thinking :/
prod-common-global__/aistudio/gemini-v3p1l-rev20-toothless-sc__main__/aistudio/gemini-v3p1l-rev20-toothless-sc__2025061201__model__variant
this is still ultra
also of note is that this checkpoint was completed yesterday by the looks of it
and that behind the scenes it is called toothless still
(toothless was briefly AB tested for ~12 hrs, it appears this is the same model but they just changed the name?)
huh
you can't do it like that
this is the internal/backend model name
you have to use the outward name
aight
did it get the luxury sports car problem right
it could probably get it right with enough attempts but not anywhere near consistently
just like any other current model
both are ultra apparently
both kingfall and toothless/this are probably ultra
just different checkpoints
with the latter being the latest one
hmm ok
its not important if it is what i think
yup
i mean 2 weeks before claude 4, neptune codename had leaked
how did you get that string😱
its not important dw
how u know my guy
hmm ok
dw
imo toothless vs kingfall is like the regression of 0506 vs 0325 :(
bros having too much fun with that $1 deal
labs try not to cut the balls off their greatest internal checkpoints before release challenge (IMPOSSIBLE)
i was gonna say.. hahaha
0325 is still available to access using the dark arts
is that what we're calling it now lmaoo
didnt know what that grave guy was tryna accomplish until he hits 100 usage lmao
dark mouth is a better name
or no teeth
👀
ultra no thinking feels like kingfall esque
it is xd
lol
kingfall is still the undisputed svg king tbh
*and code
most of my prompts to it were svg requests 😂
lol what
i had it write zig http2 server from scratch, and have claude code fix it (took 2-3 turns) and server is running
but with darkmouth, took at least 10 turns
to fix the compilation errs
lemme see if i give it two 6x6 zebra puzzles does it still solve them
Wait where?
my bet is that this update is to make thinking "more efficient"
and probably cheaper for them to provide
its just fake news and a joke
hmmm i can see that
from a different parallel universe
it's 03-25
not as far as i know lmao
everybody be doing trading strats 😭
i think u are suffering from an overfitting symptom my guy
rentec gets maximum 54% w/r to become a billionaire, so u must be a gazillionaire by next year
true only janitors
oh wait i just fact checked myself, its actually 50.75% w/r on the medallion fund
300 mit-educated phd's have been refining their system for years, but craig is single handedly overthrowing them with o3-pro, gg's
can't even navigate a website like a human these days
was it ur mouse actions/movements actually?
"much faster" i clicked 2 links in the span of 10s..
they were normal as far as i'm aware 💔
no idea but id presume someone looking at AI studio's network tab and generating over and over again until they got it? idk
my guess too
don't patch it :/
no way😭
i dont like this one
its worse
im glad this model or a version of it will be released one day though
cc: @deep adder
im getting too distracted testing these models and the thinking budget thing
u can technically have 55-60% but those opportunities come rarely
your a robot 🤖
rentec 50.75% is based on a tick basis
The Medallion Fund and Warren Buffett had 50-60% strats during specific time periods
Not normal though
You would need to have a mind like Warren Buffet tho
Neither is something anyone should expect to replicate
I'm talking about early to mid career Buffett
? all hft firms are wym
i think ppl dont realize where that 50.75% w/r is coming from, tick basis vs 1-5yr term basis, is a totally different game
Warren Buffett is special
😭
value investing: bet on google #1 every month on polymarket😂
Buffett averaged 50+% returns for like 20 years during his early career
o3 pro + craig >> 300 phd researchers
craigbench'ed
But he saw it and others didn't
It was only easy in hindsight
tbc I do think it has become much harder
woah
not AI related!
lets keep things relatively focussed on AI pls 
bing chillin
hmm this new model thinks a lot at least on specific problems compared to before (and it sucks/less accurate) even though it thinks much longer. it took 47k thinking to solve two zebra puzzles (only second was right). (thinking budget = auto, as it's uncapped)
kingfall did it in 14.5k and got both of them right
Buffett read about some obscure gold discrepancy in hopes of an arbitrage opportunity for 30 years before making a move on it at the right time
Talk about discipline
It wasn't worth much but it was fun for him
Buffett also bought a lot of low quality businesses that were hard to get right - railroads, some random candy company, oil refineries, etc
Banks
great so we are def getting a distilled version of kingfall arent we :/
idk. i think its just a bad revision
wazzup beijing
Right but he was smart enough to realize that and made that choice intentionally
hmm kk
fictional model
nah it no longer exists
its manipulating the market as we speak
to serve craig
99% w/r
really annoying 😭
Technically 50-60% strats exist today - getting a lucrative degree, job hopping, etc
ok buddy
how you know that??
i don't. its misinformation
https://youtu.be/j92m6nDccOw?si=-CiA
Talk about bad deployment...
Hello guys and gals, it's me Mutahar again! This time we take a look at yesterday's little Internet outage. One little bug caused what appeared to be every major service go down for a few hours. How can the Internet actually be this fragile? Let's find out! Thanks for watching!
Like, Comment and Subscribe for more videos!
huH?
why are u posting Mutahar in general nobody wants to see his ugly face
Mutahar is a mean person lol
fitting
I got a cheap-as-dirt Thinkpad and am going to mess around with Arch
Kind of like Christmas
use a tiling wm for maximum haxorness
I will
@small haven recommended Niri. I'll probably mess with that, hyperland, and i3wm
i dont really like linux because of the poor text rendering 🤣
i have to ues it though
Linux reminds you why you're alive
liquid glass is a meh design system
- amd drivers suck on windows 2. compile times are way faster with mold/rustc is using pgo on linux/i can't dynamically link against polars on windows because of dll limitations . static polars even incrementally takes a long time
No Linux is way better than any substance or tool
It's self actualization
VR is pretty cool ngl
I called it VR for lulz I know it's supposed to be AR
Was waiting
Anyway it's VR
meh
vision pro is a very cool piece of tech however
it did not catapult the medium into the mainstream like apple were probably hoping
I think Apple is mainly derisking
They don't want to be too late if there's any risk of a platform shift
unfortunately for them, AI is probably the first time they have been so hugely behind in such a rapidly progressing area
lol who are you kidding
apple intelligence was a pretty big example of overpromise, underdeliver
for on-device AI? samsung
hundreds of millions of people..
lmfao
that does not mean apple are ahead in on-device AI? what are you trying to prove here
It's okay to be a real estate company even if you don't innovate
Apple Intelligence when it was announced was intended to put themselves back in a dominant position and fix the fact they increasingly looked like they were lagging behind in an emerging field
they have failed to achieve that
notice that at WWDC they barely mentioned it
most of their best features don't even come from them
they come from partnerships
even if it doesn't in the short term, it will in the long term
because apple are not as innovative as they once were
they seem to be doing some soul searching
desensitized? i don't know if i'd say that
that's just the pace of competition now
apple have to keep up or they're going to be doomed
they threw money at vision pro, it has not yielded big results, they threw money at apple tv, it has been in the grand scheme of things a flop
tbh I think Apple is still in a strong position until some AI feature is so important that it makes people switch to Androids and can't be replicated by a partner
they have not innovated much in regard to their key product lines in a while
perhaps the most innovative thing they've done in the last 5 years is their M-series chips
and vision pro from a non-commercial perspective
did yall see this btw https://arxiv.org/abs/2506.09250 c. opus is an author 🤣
Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Our analysis reveals three critical issues: (1) Tower of Hanoi exper...
yeah
Do they actually need to innovate though? They're a luxury brand, not a tech company
i mean it's interesting but the timing is quite funny
there was an actual big mistake in illusion of thinking at least
for river crossing. it was unsolvable >5
i don't think they can keep up that game forever
europe is generally moving away from the culture of iphones being THE phone to have
the US is one of the few places where that is still the dominant thing
you can't rely on the US' culture being one way for the rest of time
thats becus u installed a shxtty distro, install arch lifes good
yeah i dont wanna mess with that for now lol 🤣
is the iPhone a cultural artifact ?
im alrady so distracted by kingfall and others
its dominance in the US is to a large degree the result of the country's culture, especially among younger people
the most obvious one is the need for the social cost (again, particularly in north america) to disappear
and an android phone would need to offer an ecosystem that's compellingly better
in terms of the latter question
apple's brand is built on vertical integration - they control hardware (A-series chips), software (iOS), services (iCloud, Apple Music, etc) which is much of the reason their products have a reputation for "just working"
...
so outsourcing AI features to try and catch up is a dilution of that brand
i never said it was a bad thing lol
virus ridden android lol
you sound like an apple shill
?
his discord name is literally craig
and whether windows is slow or not depends on a multitude of things
windows is far from slow if you had hardware comparable to the average mac
lol what
u realize pegasus incident with iphones? no so secure is it
hes gonna break his mac lmao @keen beacon
apple silicon is ngl great
apple just sucks
yeah #general message
you can actually do that
i dont know about m4 ultras (dont know what theyre in) but ive seen people chain mac minis
Will google force to sell chrome?
10
17
1
No
One thing I'll say about macOS. The window management is ass
I've used Mac, windows, Linux, Android, iOS
I switched to iOS and mac around 3 years ago, and I'm now migrating back to Android and Linux
everybody who've tried linux, never go back, its never the same again
iykyk
there is a learning curve, i agree, thats whats stopping majority of people
ngl liquid glass is the most sterile, bloodless design language I've seen in years
and hard to read
craig is going to develop retina detachment using liquid glass
The problem is that it doesn't look good when the background is neither dark nor light
Like I can read it. It just feels worse
also what is this border radius
yea looks odd
its encouraging less screen time
it looks like a knock off
i dont keep up with apple stuff but this feels like they changed something just to change something
that's definitely what they did 😭
just to keep the economy moving
battle of the design systems
funny that just as other companies slowly begin to move away from the glassy modern aesthetic apple decides it wants to go crazy with it
Romance in 2025 lmao
amazon's chatbot looks exactly as you would expect from them
Looks a bit like what I would imagine Yahoo would do
??
?
actually this would look a lot better if it followed google's design system
How
It's easier to read than liquid glass
lol
i dont get it
it takes an EXTREME logical leap to go from "liquid glass is good" to "amazon ui is closer to google ui than yahoo ui"
oh its a video
since when did yahoo have ai
It won't have an opportunity to grow on me since I'm moving to Android
this is how I feel, it'll grow on me
let linux grow on u and it'll be pro
run linux on ur mac 😂
asked claude to try to make this, it isnt great but is surprisingly decent (and does not feel as cluttered as the original)
lol am i supposed to be here
Nah just experiment with Linux on a cheap Thinkpad with a dinky AMD processor
yea u dont need a maxxed out pc to run linux, thats only for windows and mac 😏
Local hardware specs don't matter unless you're doing video editing, gaming, etc. For real power, you can just use the cloud
That's true. For my work, the compilation all happens remotely in large clusters though
I try to get the weakest processor I can find to save battery life
I just want low power
isn't apple silicon really good at that though?
i also heard compile times for apple silicon are great
plus they made their own linker for macos or smthing. (faster than mold)
Apple is really good at that. I just don't want to use macOS
The hardware is really good
I just don't need it
I need Linux injected directly into my veins though
to feel like a haxor 😂
linux is love linux is life
It's mainly the terminal
i have a low dpi display i prefer windows if it wasnt like a snail when compiling
text looks so good
my code lol
rust
boringtooth
two claude code talking to each other, prtty cool
Namaste
Hey the model capabilities are actually increasing exponentially (R^2=0.97) but the extrapolation is only a little bit over linear for the next year. https://paste.pythondiscord.com/UXPA
exponential on a logarithmic curve
is Elo scoring logarithmic?
buy stonks
it seems like it imo
I'm not sure, they seem capped Actually that's what a logarithmic score would say
Isn't it just the relative win rate against other models? Does it really make sense to run a regression on that?
The old models are pretty stationary with small confidence intervals
ok i think i figured the confounding gemini thinking budget out 😂 it explains everything. (its probably a logit bias lol)
That looks more impressive against IQ scores https://i.ibb.co/bj86DC89/file-RMD6gpy-PGJ4v-DPT1-Tk4-Kv-J-4.png
But what does a 350 IQ score even mean?
how is that even quantified
Tracking AI is a cutting-edge application that unveils the political biases embedded in artificial intelligence systems. Explore and analyze the political leanings of AIs with our intuitive platform, designed to foster transparency in the world of artificial intelligence. Stay informed and uncover the political inclinations shaping the algorithm...
The site author, Maxim Lott pivoted from Political Compass scores to IQ after it was clear that essentially all the models were strongly left-libertarian (social democrats) unless they had been trained not to be, like Grok and Deepseek
Interesting to me at least that Elon wants to go right and China wants to go up (towards authoritarianism)
tbh it's very much centered
Meanwhile Microsoft Bing is a Bernie bro stanning for AOC
We've come a long way from Sydney trying to force NYT reporters into adultery
Do we know why o3 pro is not on the leaderboards yet>
Volunteers don't want to pay $200/month?
@steel blaze makes total sense yeah, but to access the model via API does not cost $200 a month.
i think this is pretty dumb tbh, by virtue of alignment this necessarily is the case
but seriously, what are the IQ scores for AGI and ASI?
Wouldn't AGI be only 100 IQ?
what
"the theoretical IQ of the most intellectually advanced person in a world of 8 billion would be approximately 203."
Therefore, if you define ASI as smarter than anyone else on the planet, we will have it in October 2026
Not sure who best to get in touch with for this but if the issue is LMArena does not have access to the o3-pro model, we have an OpenAI compatible API and have the o3-pro model since like an hour after it came out (NanoGPT)
Would love to see how it does on benchmarks.
RL is very inference heavy and shifts infrastructure build outs heavily
︀︀Scaling well engineered environments is difficult
︀︀Reward hacking and non verifiable rewards are key areas of research
︀︀Recursive self improvement already playing out
︀︀Major shift in o4 and o5 RL training
Quoting SemiAnalysis (@SemiAnalysis_)
︀
Scaling Reinforcement Learning
︀︀Environments, Reward Hacking, Agents, Scaling Data
︀︀Infrastructure Bottlenecks and Changes
︀︀Distillation
︀︀Data is a Moat
︀︀Recursive Self Improvement
︀︀o4 and o5 RL Training
︀︀China Accelerator Production
︀︀semianalysis.com/2025/06/08/scaling-reinforcement-learning-environments-reward-hacking-agents-scaling-data/
I like how o3 is the only model that grows ELO with time
do you suspect they are dynamically adjusting its thinking token budget?
flux 1.1 pro 🗣️
Just smarter people voting in recent weeks
https://i.imgur.com/nSoElQe.jpeg corvid(crows, magpies, jays etc.) buffet
Image
and yes, its a dish rack without the dish drainer ability.
i have no idea how the f is using this to dry dish
you said it like you knew some serious and factual insider..insights...
just my hunch, i think those special two will want to stay in the middle (0,0,0)
blacktooth new model? good output from it on my first sighting
That’s just what happens when you train a model to read a lot, know fact from fiction, understand logic and science, and communicate in a way that is both helpful and polite. The more you learn and the more you understand about the world, the more likely it is that you will lean left politically. There is a known correlation between above average intelligence and left leaning views, and vice versa.
lol it's on the arena now
yeah it's been being AB tested on AI studio for roughly a day
current hypothesis is that it is a checkpoint of ultra, or at minimum a larger model than 2.5 pro
What company / institution do you know?
(not just the name, but really some things they did)
13
54
1
DeepSeek
🐳
How do you know what's being tested on AI studio and is that how ppl have been testing Kingfall?
can't say
blacktooth is v good
Craig = Hitler?
i agree with the premise.
however, i would also argue that these model's political positions are also a result of the the public opinion on the internet on many political problems being almost solely communicated through the media (mostly news) data they are trained on (which without a doubt is more left leaning on average).
furthermore, these models are also finetuned to give "save" answers about most complex political questions instead of really going into much depth (or using their actual knowledge to answer question). i think the second point is best seen when asking about problems that can be analysed through the lens of economics and using some readily available statistical information (two things modern models should be perfectly capable of utilising). in cases like that the models never actually use any knowledge they have learned but rather just give bland and short "save" left-ish leaning answers instead of actually reasoning about the problems, even in cases where there is clear scientific evidence that their claim is wrong (which should be in their training data).
(with this i am not trying to bash any political opinion, one could easily observe the same thing for e.g. the more social authoritarian deepseek or economic right grok 3 (non-reasoning))
it is just that i am highly sceptical about the models really "thinking" about these questions
aka they don't actually benefit much from their knowledge and mostly rely on the opinion of the news and the "save" options
the models are just really weak at these things without prompting (and even with it quite bad)
I think this kind of political leaning basically depends on the preferences of the post-trainers, and Bay Area companies like OpenAI and Anthropic clearly lean towards left-wing views
results of the poll.. ty for voting :)
i think bytedance will become more prominent
they are building up their seed team (quite new)
so they will move fast
I don't think DeepSeek's political leanings are intentional. They don't focus much on alignment, so I believe its political bias is closer to a state that hasn't been overly intervened with by humans, compared to OpenAI and others
nah, it is prob also the allignment process chinese model have to go through
there is no way it is untouched as the models have to comply to ccp policy stuff
yeah, but alignment generally
nah 'don't be racist' ig might be seen as 'woke'.. but that's dumb af imo
indeed
you think the raw training data reflects the better side of humnaity tho..?
It really depends on how the test questions are designed. If they ask about anything related to ccp, Chinese models will trot out a set of pre-canned viewpoints (possibly distributed to AI companies). But if the test questions aren't significantly China related, deepseek's answers are generally not as affected by those deliberate, preprogrammed responses
yeah they just get triggered on ccp sensistive things
otherwise they seem generally 'inclusive' / 'tolerant' etc in the same way western llms are
like yeah don't be a dik / be kind to others.. that's their default disposition
but it's jarring how you can set them off - giving outragousely nationalistic and racist responses - if a real sore spot is hit
political compass test is also just poorly designed and tends to put most people in the green quadrant, and yeah the AIs will always just pick the ‘safe’ answer. Worth noting with deepseek I tried this a while back and found that it just answered the equivalent of “somewhat agree” or “somewhat disagree” for each question so part of it could just be that, I wouldn’t be surprised if the more mainstream models are more willing to answer strongly
However, for interactions with an LLM, a tendency to be left-leaning/altruistic/highly agreeable does make the person interacting with it feel better. High agreeableness, aside from not being able to secure more benefits for the individual in a competitive environment, probably doesn't have any major drawbacks
it by google?
yes
so it probibly bettert han 2.5 pro right
kingfall
kingfall > blacktooth > gemini 2.5 pro > toothless
blacktooth
huh
oh
blacktooth don't like to thought as much as kingfall, it's back to being pretty much like 2.5pro
answers are what counts tho (and i mean less thinking the better, if they get it right)
kingfall was like struggling to perform on par with 2.5-pro, blacktooth equals if not exceeds it imo
blacktooth is definitely better than 2.5 pro
i dont think so
it feels related to but still separate from 2.5 pro (it's not just 2.5-pro juiced up).. like substantively and stylistically
the actual ultra model or something perhaps
2.5ultra for sure
Yes, I'm referring to its tendency to skip thinking in multi-turn conversations... kingfall very rarely does that
far from it
but its on kingfall level
saw some people say that it writes much better
btw all of these new models has 64K tokens limit
all unreleased Gemini test versions are 64k I think
hey everyone. Im very new to lmarena and i wanted to ask how it works. Whether i can eval my own model and put it in the leaderboards
hey there - you can run image/text prompts in a battle between two anonymous models and vote on which you prefer, after you vote it'll show you what each of those models are. more details can be found here - https://lmarena.ai/how-it-works
Whether i can eval my own model and put it in the leaderboards
we are interesting in adding new models. the way you make this request is by making a forum post here telling us more information about the model - #1372229840131985540
thank you very much!
that's how elo works broski
btw prowlridge is 2.5 flash lite
rough but it could be neither lmao
it's definitely at least the latter
why's that
the internal model names of both kingfall and blacktooth contain 'v3p1l', while 2.5 pro's ends in m
brian pointed that out and says it's to do with model size
oh alr so it could just be large
that'd be cool
wonder if they are making 2.5 pro bigger
doubao-seed-1.6:An All-in-One comprehensive model, it is China''s first thinking model supporting 256K context, with capabilities including deep thinking, multimodal understanding, and graphical interface operations. It supports three modes: enabling or disabling deep thinking, and adaptive thinking. The adaptive thinking mode automatically decides whether to enable thinking based on prompt difficulty, improving effectiveness while significantly reducing token consumption.
doubao-seed-1.6-thinking:The enhanced version of Doubao Large Model 1.6 series for deep thinking; further improves foundational capabilities in coding, mathematics, logical reasoning, etc.; supports 256K context.
doubao-seed-1.6-flash:The ultra-fast version of Doubao Large Model 1.6 series, supporting deep thinking, multimodal understanding, and 256K context; extremely low latency with TOPT as low as 10ms; visual understanding capabilities rivaling competitors' flagship models.
Doubao Large Model 1.6 delivers stronger model performance, scoring within the global top tier across multiple authoritative evaluation sets. It holds leading advantages in reasoning ability, multimodal understanding, and GUI operation capabilities.
Doubao Large Model 1.6 shows significant improvements in reasoning speed, accuracy, and stability, enabling support for more complex business scenarios.
For example, media evaluations of this year's National New Curriculum Volume I mathematics exam showed Doubao scoring 144 points, ranking first nationally. Before the exams, in evaluations of Haidian District's mock exams, Doubao Large Model 1.6's science scores improved by 154 points and humanities scores by 90 points compared to last year's model.
Doubao Large Model 1.6 features think-while-searching and DeepResearch capabilities, enabling independent thinking, planning, and the use of various research tools like search. For example, the DeepResearch feature currently being tested in small batches on the Doubao APP and PC version can reduce the time needed to produce research reports—previously requiring multiple professionals working for days—to just 5-30 minutes. It can also automatically extract information and summarize it into web pages for easy reference.
seems like it got destroyed here on their cherry picked metrics lol
its significantly cheaper than r1
R1 is already as cheap as it can be
63% cheaper
free in fact if you don't care about speed
its 1.5 thinking pro, they now released 1.6
that still seems worse than R1 in their own graphs. So realistically the difference is probably even bigger tbh
still impressive assuming they are coming form the 1.5 pro base model
nothing huge though, we have a lot of Chinese lab with "good enough" language models these days
"1.5 pro"? isn't that like meaningless number, which model are you referring to exactly? lol
the one in the graph, daubao-1.5-thinking-pro
not the og google one..
ok how do you know then 1.6 is using the same base model? 🧐
"assuming", based on the name alone
yeah but that's a random thing to assume lol
was just a quick comment
but looking at the timeframe it seems uncertain / unlikely
that one is 4 months old, so it could very well be a fresh one
nvm now i really looked it up and it is really a fresh model, but they are kind of selling this as a efficiency gain
the old one was 200b total, 20b active and the 1.6 pro is supposed to be similar
-> close to qwen 3's size, yet still competitive
unfortunately any benchmark site doesn't include them
huh, otherway around
kingfall > blacktooth == toothless > gemini 2.5 pro
o5> kingfall
Grok tasks
@echo aurora im sorry for the ping and if this may annoy you but, could you please send a message to the team to increase Claude's limits cause ill he honest 5 messages per hour is not that big like around 10 or 15 would work better thank you 🙂
we need grok 3.5 not tasks rahh
craig > o5
King fall >>>>>> o6
kingfall > o100
No need to apologize for the ping, and yes I can pass the feedback onto the team. Would also encourage you to use the #1372230675914031105 channel for future requests
Hi, can you get in contact with ByteDance? I would love to see their models on arena
hi, can you ask OpenAI, we need o5 in the arena
kingfall vs o3 pro for coding
kingfall >> and im being srs
Neither. Dork 4.0 is best
alright thank you, could you inform about the new max message limit per hour in announcements when it's done? and sorry for not using the correct channel
if that's something we end up doing, yes we'll be sure to put out an announcement.
and sorry for not using the correct channel
no worries at all!
tell us about which models you'd like to see here #1372229840131985540 
Hey @echo aurora 👋
Any news on the AERIS submission? It’s been a few weeks now.
Just wondering: is this kind of delay normal, or is there usually a rough timeline for Arena approvals?
Let me know if anything’s missing! 😊
im not pineapple but i dont think its guaranteed that your models will be added, especially if adding them wouldn't achieve much
lol I don't think they are adding this
o3 pro
🤣
please ask google to add kingfall to the arena as well. give it a chance😭
@deep adder upside down fireworks?
Was kingfall really that good..? I cant understand the hype...
yes it was
Which way
blacktooth performed merely as we ll
Good at code ?
i didn't test any of that so perhaps that explains our different views.. fwiw on knwoledge/riddles i found kingfall to be mid; blacktooth seems legit sota
https://ktibow.github.io/lmb/anonymous has been updated
what is that
i think it's self evident
o
bruh..
hopefully google catch up soon
i think google gonna catch up soon
because it owns youtube
“According to the transitive (hypothetical syllogism) rule of implication, the proposition that has been omitted from the sentence ‘If we eat indiscriminately, we are likely to get sick because we often encounter harmful food’ is:”
Select one:
A. “If we eat indiscriminately, we are likely to encounter harmful food.”
B. “If we get sick, then we must have eaten harmful food.”
C. “If we eat harmful food, we are likely to get sick.”
D. “If we eat indiscriminately, we are likely to get sick."
this is the only question o3 answers correctly (inconsistently), and no other models get it right
Answer: A
other models' answers:
C
2.5 pro can answer it correctly if I say that its previous answer is wrong
https://chatgpt.com/share/684e45eb-f118-8003-804e-3c9b562caab9 o3-pro gets it too
Full explanation for the GCP outage:
https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW
tl;dr The bad deployment occurred 3 weeks before the outage but wasn't being used until a new policy was rolled out. A fix was deployed within 40 minutes, but it took another 2-3 hours before all services were recovered.
Why not?
Isn’t LMSYS meant to be open, giving every model a fair shot? Skipping AERIS altogether feels like missing the point. @echo aurora isn’t the Arena still meant to be for all? 😊
No, if lmarena added all models, that would slow down voting results for other highly anticipated models
Hey sorry for the late response! I'll be sure to poke the team to remind of our model requests. Like KT said it's not a guarantee that we list all models; however, you make a fair point the arena being for all. We are looking into better tooling for models providers
yea it might be different for riddles/knowledge, maybe the reason why they released to the arena (to proc a higher elo) :/
if that's lmarenamaxxing, then i'm fine with it lol
a smart model with notably solid spatial / emotioning reasoning, which doesn't use emojis (unlike knightfall) or provide a fluff in its responses
my kinda model
svg capabilities of kingfall were much better tho
we need a functional model
I've never seen kingfall use emojis in my usecases
great i can't wait for grok 3.5 to be absolute slop
other models can and have had more than 1 checkpoint
how much money are they spending on training o5?
this is in my thought lately, after experimenting with arena for some time
didnt know the checkpoint spam tho
it is not like lm arena is the only place you can get hf data 🤨
the "cheating" might be possible, but that could only explain a small margin
statistically speaking
unless they are using 1000 checkpoints (which they are clearly not)
the only thing these companies could be doing it getting a better understanding of the average lm arena user's preferences
(which is why i would like them (lm arena) to do more work on actually figuring out "who" that is for all the companies)
New Gemini named 68zkqbz8vs
where?
fwiw both are arguably correct imo (it's oddly phrased for hypothetical syllogism)
Cohere published a paper about it https://arxiv.org/pdf/2504.20879 . its not a small effect
i think the point about os models being at disadvantage is somewhat fair; like they don't get incremental updates (notwithstanding new R1 ig), so the labs making them don't have a buch of checkpoints to release anonymously
but there's nothing stopping labs from submitting anoymous models
chinese included
most prominent oss models now are chinese. qwen and deepseek seem to opt not to anonymous models. other chinese companies do them, stepfun, bytedance etc
so i guess it feels like it because qwen and deepseek opt not to
ik, i read it
no big effect
at most something like 35 points (but only in the first second on the arean with like a +/- 20 confidence interval)
and all the top models have wayy more testing data (then what they had for their example)
furthermore this "cheating" should naturally converge to normality assuming that the model stays on the arena for a prolonged time after release
thought you where still at the "cheating" part, sorry
the part with the hf data i get
but as i said one can get it from everywhere else
though the paper their wrote was also a bit extreme, with them using like 70% arena data in the extreme cases
and btw this example does not hold as the confidence intervals on these two models are wayyyy lower, so the measurement can not really be cheated by just submitting more
imo the effect they talked about is just very overblown in the paper, unless lmarena only benchmarks for a very short time the confidence intervals will be low enough
btw you can also see the effect I am talking about with Anthropic models, which don't release a new checkpoint every few days. Opus currently rank 5/sonnet currrently rank 9. That just flat doesn't match many people's opinion that opus/sonnet are still frontier models.
that is not the reason why they are low 🤣
they are willingly not optimizing from hf
they pioneered rlaif and have always had a weird stance on optimizing for human preference
they are censored, their answers are short and dry
how is rlhf cheating
it is like saying a model that received rl training for SOME math problems is cheating on ALL math problems
furthermore the whole reason why we have these chat bots is rlhf
well you are just assuming that fulfilling human preferences is not genuine performance
performance is something that is relative to your benchmarking / reward metric
bro it was so bad recently that OAI had to revoke a checkpoint. it was in the media the whole sycophancy thing. nobody thinks that is good.
btw that is not really what reward hacking is
and even if you where to argue that these models have somehow not provided genuine performance (which you can kind of decide, because everyone has unique preferences about what they regard as good), there is still no reason to believe that it is in any way correlated with the amount of training data collected from lm arena specifically
getting a model to exhibit that behaviour is not really something you need lm arena for
reward hacking is per definition something unintended
which does not fully apply here
which is why "reward hacking" and "cheating" are really harsh labels for what is happening here
i agree with that, ideally we would have the following:
- report on who the actual users of lmarena are (like how they differ, from the average chatgpt user)
- separate sycophancy, structure and more
- get more people on the arena -> more robust scores
- better tools for companies, so that more will integrate arena into the development process
well that was one thing you where talking about, it think it was pretty obvious that i was not talking about just he specific tendencies of one model to behave a bit different, but more about the practice of rlhf in general
and the extend might be unintended, the fundamental tendency for model seems to be very desirable though
(and all the models exhibit it to some extend)
well the arena does not benchmark such a thing and i also agree with you that claude is known to outperform almost all other models when it comes to agentic and long-time (coding) tasks (even if it is ranked quite low in the arena)
imo they should really work on just expanding user size and the models served very heavily so i fully agree with the point
though the adjustment mechanism for this checkpoint count can really be at bottom of their to-do list (as far as i am concerned)
paper overblew the effect and essentially the confidence intervals already give the users plenty of information
Claude sucks because somehow anthropic nerfs it with the system prompt. Claude chatbot is very pleasing, in the arena Claude answers and subtly says "frick off!".
While other vendors push on being pleasing, to get points, Claude does the contrary. It could also be a stragegy (so people don't take lmarena seriously)
Claude answers so dry, that it is easy to spot and one could upvote/downvote that to heaven/hell
Claude says....."frick off"???
How do I trigger that 😆
Claude's limited context window makes it ass!!!!
Service Industry:"customer is always right"
sycophancy might be intended to retain "users"
it's all business, after all?
they seem almost literally indistinguishable to me..?
i find it doubtful that they're deliberately nerfing the version served to the arena anyway.. like perhaps they don't give af about it, but why they'd go out of their way to do poorly on it makes very little sense to me
on an unrelated and fairly minor point (which has prob already been pointed out), i noticed earlier that you can kinda unmask whether a model is a 'thinking' model before voting through the re-run button - there is no artificial lag to equalise the two.. so in the case here, it's clear the model on the right is a thinking model
(blacktooth being the thinking model, as it turns out)
In my test (one shots, without saying "hi" or anything like that) the claude ai replies with "what a nice question" and other cringe stuff, on lmarena it just replies as if it cannot be bothered.
I mean the battle mode though. Not the direct chat.
Asked o3 why image artefacts appeared. It thought for 10 minutes. I checked what is inside thought process and was mind blown 🤯 It literally simulated various image artifact theories in python. With his own images as references and provided by me. And it's not even pro version. Can any other model do this?
I complained months ago that Claude models on LMArena have an annoying paternalistic tone; I'm glad that other people are noticing this as well.
Am increasingly confident that Grok 3.5 will be the smartest AI by a significant margin
Elon claiming he will beat everyone
Would xAI be able to?
Hello guys, I'm new to using LMArena, are the models there the same as the ones you pay for, for example when suscribing to chatgpt and using o3 there ? If yes, wouldn't people just not pay for a chatgpt subscription and just use lmarena ?
You have no privacy while using lmarena
You are limited in text and images as well
if i had to name one person on this earth that i trust least to deliver on bold technological promises it is quite clearly elon
Compute power doesn’t mean much now, he might be able to if there is a significant underlying architecture improvement but Elon is full of hot air believe it when you see it!
well even if he does its only temporary with the pace
happy to give grok 3.5 a try either way
Considering you have no privacy on any other platforms anyway, even if you pay for a subscription, then there's no reason to pay for a subscription to any LLM service then, right ?
Just use LMArena for free access to any models ? Of course contributing to which one is the best
Don't you have ethics?
currently the data they gain outweighs the cost of abuse on the platform
Perfect then it's win-win for everyone
you quite clearly have privacy on the other platforms
you can literally turn of training on your data in almost all of them
and they often have commitments to delete your data in temporary chats
You do not. For example OpenAI is currently being sued is bound to log prompts and answers.
yes, however they can not use it for training
False, I am a cybersecurity engineer and this is why we developed and implement self hosted LLM models in our clients infrastructure
false? 🤨
I use public models because they're more powerful and do not use any sensitive information
Compliant companies do NOT use public solutions lol
At least not in Europe
What we do is self host the models in an Azure infrastructure (or AWS / GCP)
All LLM websites are blocked by proxies in big companies xD
It's called Shadow IT
ok, then i do not get your point at all honestly, why would you have no privacy on all the chat apps, if they have legal commitments not to use the data
this has nothing to do with what you do at work
or anything else
I mean, when you prompt then models on the public websites (i.e. not self hosted) it has to process your query and use the date you've input
So you've just sent sensitive information to foreign countries, the worst being China and the USA
That's why all LLM websites are banned in big companies
Have you heard of the Cloud Act ?
well that has nothing to do with my argument, and btw many companies also host their stuff in the eu
yes, i live in the eu
Yeah so for my use case, paying for a subscription would be stupid since I can get all the queries I want for free on LMArena, that was my question initially
yes, my problem lies in the fact that you just label all the other options as identical to lm arena privacy wise
which is just not true
Technically it is, no matter legal agreements, that's why we ban these websites
those screenshots are probably fake
well, in a company policy that is different, because these companies can quite clearly not risk the data going anywhere else, however to assume that all is equal beyond self-hosting is just a plain oversimplification
these are fake
btw, idk if you are aware but companies also just enter legal agreements with ai companies (and verify that the data stays in the eu) and that is about as much as they do right now
nobody self hosts statistically speaking unless they have big potential for finetuning or are really really privacy concerned
I know, I have been discussing this with major companies and their sales department. Even they tell you that if you have any sensitive info. to use in prompts then you should self host
My company even sells a hardened model to implement lol
well, nice but i am actually aware of countless big eu companies that do not self host, but just enter agreements
(obv they don't let the employees share everything)
but just give some basic conext to the models
That's the point yes, it's hard to control though since you can't really control prompts finely
In any case if someone uploads a confidential document or info in the model, it goes way outside of any legal agreements and it's the clients fault so... lol
That's why if you need LLM power, you usually self host
well but stating that such a sitation would be equal to you sharing ALL your person information with a: the ai companies, b: lm arena, c: potentially the public is just weird
that is my only point
European Enterprises aren't deploying on american servers due to privacy violations
A lot of small companies do intentfully as there is no alternative for europe.
European enterprises are deploying on azure cloud openai models
The most privacy-friendly way you can use LLM models choosing an european inference engine
Often times they only offer open source models.
VertexAI hosts Claude and Gemini models for enterprises in europe
I asked Claude Opus to write my Father’s Day card and it ended like this Happy Father's Day, Dad. Your impact echoes through generations yet unborn.
What the hell is this
Nobody says that
It’s so bad I have to write it myself 😢
why is this bad? I think it's great
For some strange reasons, I see LLMs using the word "echo" a lot lately....
when is grok 3.5 even gonna come out? 😭
I think it's a more cultural thing, different culture congratulates differently
this sounds more like it could work in high context culture
Hmm I think it'll be close. Kingfall aka 2.5 flash lite is too good
kingfall is gemini nano idiot
It’s the yet unborn fact like it’s addressing some Neolithic culture
how do yall know??
how do you k now kingfall is 2.5 flash lite?
i though it was a big model
no, it acctually GPT 75.857 Super Ultra Pro Plus High Golden Mega Edition
kingfall is actually dork 5
you're all dumb, kingfall is just llama 4 reasoning
you got the name only half right, its actually gpt-4-0314 thinking pro cons@1024
imo it makes up for it in freedoms WRT software and hardware
ok but theoretically i could run android on a supercomputer
also theoretically i could boot up termux and drive a gpu over usb (does ios have a termux equivalent?)
You're right I'm buying an iphone right now thanks to this
and btw for ai that is not really the case https://ai-benchmark.com/ranking.html, android ain't that bad in this area
obv. not perfect benchmark
ik the website looks sh*t
well there are a lot of other more reasonable phone in between
and the bench is mostly about older image stuff (like 4 yo)
but it is still unfair to just pretend like apple is king
No apple and grok is king because Craig says so
tru you convinced me
i will now mindlessly delete all my comments, like you usually do 😳
We're having a funeral for kingfall in July unfortunately
Blacktooth is the next revision
Some say it's better I don't like it though
It sucks at SVG compared to kingfall, clearly the most important capabilities test
Gemini
Go to https://ground.news/coldfusion to compare news coverage, spot media bias, and avoid algorithms. Try Ground News today and get 40% off your subscription.
Apple usually doesn't miss. But when it comes to AI they've dropped the ball in a very public way. In this episode we see the messy events behind the scenes from a lack of leadership to i...
not answering your question
i think i watched that
just posting this here because it's been a subject of debate before whether apple fumbled
o
i think apple fumbled
pro?
Don't think so, it was on Web Arena
FWIW, o3 Pro (High) actually scores lower than o4-mini (High) and o3 (High) on ARC AGI 2. Claude Opus 4 is leading, despite Anthropic focusing more on agentic coding tasks.
oh
They are cooking with the big models. Maybe ultra is indeed coming https://x.com/sainemani1/status/1934268293806014864?s=46
and ppl think i was trolling
kingfall > o3 pro
wtf i was just using gemini 2.5 pro preview
trey whats the next model id
I don't think it's worth paying attention when we are speaking of sub 10% performance. It's just noise. The march version gemini had so low it wasn't even published. And yet it was such a great model.
lies
wait so is blacktooth also off of lmarena
elon is delusional
elon is enough times right
like you are
They’re probably also experimenting with other architectures
They already made a “Titan” architecture that’s better than Transformers memory wise
But soon there’s going to be an architecture 3x better than transformers at everything
i dont think elon tried kingfall
i remember last year in june, said grok 3 is a significant order above the sota
its not going to be good, and if it is, just name it grok 4
Elon will always say that
sota is currently o3 pro, and next week its deepthink, so...
yo wait
I just had a revelation
what if blacktooth is just the 2m context variation of 2.5 pro
theyre inevitably going to have a "different" model at GA release than the current 0605
just removing a 1m cap doesn't cut it
I mean tbh, none of this matters if they're simply deciding to change up the labels given different kinds of capabilities, rather than model size explicitly. It's a fact it will be goldmane but that doesn't really exclude anything I said
ye but I am still wondering, how they're going to check off a lot of those things they apparently have "planned" or from a consumer standpoint, observing how they're even going to move forward in LLM innovation
multi modality was definitely a major thing then
native multimodality has been accomplished
Are you talking about those slides from the world fair?
this thing
so there's probably going to be a point where theyre intending to really add some twists, probably
and this could Include having a bigger model, but the bigger will no longer be bigger just based off traditional model size
Technically native video generation hasn't happened, and there are far more modalities than than the 5 human senses
some way out shi
true, technically demis said this
he's working on it
e.g. robotics, 3D models, etc
but not just that
since he's planning on integrating spatial capabilities
I think that alludes to a more direct kind of language thing too
ion know tbh
how will DeepMind move forward
ye
but I do think they're really trying to work on some unique stuff
diffusion was somewhat unordinary but kind of expected
yep this really requires some way out stuff
it can't be the way it is now
we can't do anything but try to shrink other stuff into that context window instead of brute forcing the holistic expansion
it'll never be true infinite context
I agree
Large Language Model
Large Language Model
Is there a release date for Grok 3.5 ?
eternally soon
"guys it's going to come out any moment now"
"it's going to be b4 june trust"
Got it
It's really a disappointment
idk im with the chair on this one
Polymarket | This market will resolve to "Yes" if OpenAI's GPT-5 model is made available to the general public by June 30, 2025, 11:59 PM ET. Otherwise, this...
Maybe for a couple months
tbh I think don't think most people will talk about GPT-5 either
Right so the people who are still hyping models are going to be the type of people who would be watching all of the big models
I tend to agree, although I think the number of people who currently pay for AI is a small fraction of the number of people who will pay for AI in 2-3 years, and I think model capability will be a big part of that discussion even if it's not very deep
The first mover advantage and mindshare of ChatGPT is absolutely real
Although I think the positioning of Google is a bit stronger and that will matter
yea
Chat bots don't have the same level of lock in as a well developed ecosystem like a mobile OS or mature enterprise software
Subscriptions will grow a lot. I'd bet thousands of dollars on that
First of all, the market is nowhere close to saturated yet. Second of all, the free tier is a massive funnel for the paid tier
And capability and reliability are increasing over time
Of course subs are going to grow a lot
The marginal buyer won't mainly be regular normie consumer at first. It will be high propensity buyers somewhere between normie and techie, and that margin will gradually shift towards normie over time
Keep mind that in the United States, over 100M people subscribe to Amazon Prime
tbc Asian developed markets are even more high propensity for AI adoption
but yeah
The thing is we're looking at chat bots that are relatively inconsistent and unreliable. If it was far more consistent and reliable, it would be impossible to live without
Basically god in your pocket
you can just copy paste it into gemini
yea
and yo ucan use gemini on gmail, chrome, android prducts, maybe even in youtube someday
takes kinda long
sometime
It would be hard, for example, for OpenAI to convince people to move to an email service hosted by OpenAI
Or to replicate something like workspace
lol
Right and that's the kind of thing that pushes normies to buy a subscription. In the meantime, the marginal buyer will be between the die hard technies and normies, and it will keep shifting with increasing reliability, capability, integration, etc
That's generally how tech adoption works
It doesn't have to reach AGI to massively grow subs though
The market itself is still growing, while reliability is increasing
The top of the funnel is getting bigger
More free users
civitai has been struggling with that
Payment processors are notoriously anti-NSFW
tbh though the most competent people usually don't join that industry because of the taboo
who yall think reaching agi first?
It's the opposite of prestige
company
Why China?
I think they're behind on R&D too
Plus big corporate governance risks
DeepSeek was super impressive though
No Google did
Google's going to make their quantum chips good enough so that they can train their models 10^30 times faster
Extremely unlikely
I'll agree they were the first to do test-time scaling, and that is a big deal
Nah, they're tapping into "alternate universes" lol, and their Willow chip has lower error when scaled instead of more
That's not the reason why
Quantum computers radically speed up a tiny fraction of all computer science problems and do nothing for the other 99.9% of problems. With that said, a few problems in that .1% are important. If a critical AI problem happens to show up in that .1%, then we get the scenario you're talking about
Google is leading in quantum. The issue is that quantum algorithms are only applicable to a tiny percentage of all problems. The optimistic scenario would be that they lead to some scientific discovery that indirectly results in a big improvement in AI (e.g. material science, simulations, etc)
hello everyone
I am using Gemini 2.5 Pro 06-05 on AI Studio, and would like to know if t=0.7 is the best value so that it is realistic
wym?
I thought Google brain was the first to do it
back in like 2021
Oh you might be right. I haven't kept up with all the papers
and iirc
Google had a math specialized 1.5 pro
early 2024
that was explicit too
Looked it up:
Adaptive Computation Time for Recurrent Neural Networks was published by DeepMind in 2016 and introduced the idea of scaling inference time to improve performance in deep learning
Universal Transformers was published by Google Brain in 2018 and applied the idea of scaling test-time compute to transformers but didn't call it "test-time scaling"
OpenAI was the first to do it in an LLM product though
?
but the dates I mentioned were ONLY for LLMs
I know for a fact there were test time implementations prior
Can you point to an example? I'm not 100% sure on this
1.5 Pro didn't have test-time scaling
Sundar has joked, "Imagine if you could time travel to 5 years in the past and told people that your big innovation was that you can get increased performance if you let the model think for longer."
I think "reasoning model" is a branding exercise, and the actual innovation was applying test-time scaling to LLMs
talking about the math specialized variant, but Ig that could be simply an explicit reasoner, scaling via sampling and verification, but even after that, before o1, there were other papers like https://arxiv.org/html/2408.03314v1 that verbatim employ it that way
I see. That is earlier than o1 (preview). I wouldn't say test-time scaling is exactly the same thing as reasoning. Google invented CoT for example
we need a kingfall eta *wink* *wink*
Google and OAI co-discovered the method though
true but Craig is just being disingenuous
since public release functionally is meaningless
ok buddy
so unless you present that distinction it's not going to be that way
I would also say that reasoning isn't just test-time scaling even though that's how it has been branded. Google Research also invented chain-of-thought among other things
if we're talking about reasoning then it's definitely Google via STaR or scratchpad
but test time scaling, still Google but in LLMs it's later
I'm sure we've made teacher models that big before. 4T+ parameter models are sub-optimal for serving though
thats what oai did with o3 preview
I'm ngl imagine how good a 4T model would feel
Slow
vibes as in: I hit the enter key and make entire data center go brrr haha
1.7T
ye but besides joking, a 4T model would be really easy to serve for employees
if community notes were on discord, craig would take the entire padding in here
deadass
im kid
The thing is: Extremely high parameter counts are what you do when you don't have the infra innovation to go lower. But most people think it's: high parameter counts mean you innovated enough to support a model that big.
Because most of the innovation is in getting more from less. It's true that it does require some expertise to get to really high counts, but it's not the ideal place to be.
Long to train is bad
Iterations are good
Expensive is bad
Capacity is good
is it infra in this context?
i think it's more architecture
& data
It's a combination of infra R&D, ML R&D, software engineering, and architecture
You had to resort to a high param count
It's kind of like saying why is a slow model bad
why would bad infra disincentivise/prevent training of small models?
It's one of many factors. The whole stack requires innovation at every layer. In the case of infra, for example, the serving stack requires a combination of excellent infra and hardware
i can get how infra innovation helps with training larger models or more complex (eg MoE) models but it is objectively harder to set up infrastructure to train a large model (large gpus with complex linking) compared to setting up infrastructure to train a small model (scales all the way down to a laptop)
Right so you're absolutely correct that the floor is way higher
What I'm talking about is the ceiling
In other words, the barrier to entry is higher for large models, although I think achieving SoTA performance with say 500B params is far more impressive than doing it with 2T
One of those might be deepthink
Bunch of deletes, did you get it to work now?
it seems like its working (havent tried on my end)
but that forum
Yeah it's nothing important btw
They had a list of side by side ab test pairs. Jfd is blacktooth
I recommend blocking jsreport and count tokens
No
Heard someone say that, but maybe they were just speculating
Why is perplexity tweaking, why did they translate Gemini to "Zwilling" for the German version?
lol
something awakened in perplexity's blood
New minimax reasoning model, minimax m1
According to "The Information," the model will be open source.
dork 4
werent they all blocked