#general

1 messages · Page 57 of 1

small haven
#

prompt

ocean vortex
#

it's not correct but that's in-line with what most models answer lmao

#

o3-pro to a variation of this prompt answers same

civic flame
#

can't wait for o9 pro to get it right

small haven
#

what?

#

ni hao ma

#

kingfall was better :?

#

darkmouth wins

keen beacon
#

idk i like kingfall better

small haven
#

lmao

#

its like 50/50

olive mesa
small haven
civic flame
#

just SVGs?

keen beacon
#

yes 🤣

#

LMAO

civic flame
#

😭

#

can't wait for agi to come out and people here are saying claude 5 is better because the svgs look nicer

keen beacon
#

it looks way better

civic flame
#

horrors beyond comprehension

small haven
#

i tested code, but its 50/50 with kingfall, its not be all and end all

#

this tops

olive mesa
#

Yea, but its very good lol

keen beacon
#

"make a svg of what you look like"

#

imagine 🤣

small haven
#

@deep adder what you thinking boss?

#

actually its craig's prompt, nvm

#

and prolly not using auto thinking :/

civic flame
#

prod-common-global__/aistudio/gemini-v3p1l-rev20-toothless-sc__main__/aistudio/gemini-v3p1l-rev20-toothless-sc__2025061201__model__variant

#

this is still ultra

#

also of note is that this checkpoint was completed yesterday by the looks of it

#

and that behind the scenes it is called toothless still

#

(toothless was briefly AB tested for ~12 hrs, it appears this is the same model but they just changed the name?)

civic flame
#

you can't do it like that

#

this is the internal/backend model name

#

you have to use the outward name

small haven
#

aight

small haven
civic flame
#

it could probably get it right with enough attempts but not anywhere near consistently

#

just like any other current model

small haven
#

hmm

#

wait so is ultra kingfall, or even better

keen beacon
#

both are ultra apparently

civic flame
#

both kingfall and toothless/this are probably ultra

#

just different checkpoints

#

with the latter being the latest one

small haven
keen beacon
#

who is ddosing ultra

#

put ur hands up

small haven
#

why is google so leaky

#

can claude be this leaky, i wanna try neptune v2

keen beacon
civic flame
#

yup

small haven
keen beacon
small haven
keen beacon
#

im guessing

#

completely random guess

small haven
#

hmm ok

civic flame
#

not a new model

#

👍

civic flame
late path
#

imo toothless vs kingfall is like the regression of 0506 vs 0325 :(

small haven
#

bros having too much fun with that $1 deal

civic flame
small haven
#

i was gonna say.. hahaha

keen beacon
#

0325 is still available to access using the dark arts

civic flame
#

is that what we're calling it now lmaoo

small haven
#

didnt know what that grave guy was tryna accomplish until he hits 100 usage lmao

small haven
#

or no teeth

#

👀

#

ultra no thinking feels like kingfall esque

keen beacon
#

it is xd

small haven
#

lol

zinc ore
keen beacon
#

kingfall is still the undisputed svg king tbh

small haven
#

*and code

keen beacon
#

most of my prompts to it were svg requests 😂

civic flame
#

lol what

small haven
#

but with darkmouth, took at least 10 turns

#

to fix the compilation errs

keen beacon
#

lemme see if i give it two 6x6 zebra puzzles does it still solve them

soft kernel
#

Wait where?

civic flame
#

and probably cheaper for them to provide

keen beacon
soft kernel
#

I meant the model that made the svg

small haven
small haven
soft kernel
#

Lol

#

Which model's this

#

The second svg

civic flame
#

it's 03-25

soft kernel
#

Really?

#

Damn didn't think 03-25 is that good

torn mantle
#

@civic flame how can i use deep think

#

is there a way

#

yes or nah?

civic flame
#

not as far as i know lmao

small haven
#

everybody be doing trading strats 😭

#

i think u are suffering from an overfitting symptom my guy

#

rentec gets maximum 54% w/r to become a billionaire, so u must be a gazillionaire by next year

#

true only janitors

#

oh wait i just fact checked myself, its actually 50.75% w/r on the medallion fund

#

300 mit-educated phd's have been refining their system for years, but craig is single handedly overthrowing them with o3-pro, gg's

civic flame
#

can't even navigate a website like a human these days

keen beacon
#

was it ur mouse actions/movements actually?

civic flame
#

"much faster" i clicked 2 links in the span of 10s..

civic flame
#

no idea but id presume someone looking at AI studio's network tab and generating over and over again until they got it? idk

keen beacon
#

my guess too

small haven
#

don't patch it :/

late path
#

no way😭

keen beacon
#

i dont like this one

#

its worse

#

im glad this model or a version of it will be released one day though

small haven
#

cc: @deep adder

keen beacon
#

im getting too distracted testing these models and the thinking budget thing

small haven
#

u can technically have 55-60% but those opportunities come rarely

jade egret
small haven
#

rentec 50.75% is based on a tick basis

patent aspen
#

The Medallion Fund and Warren Buffett had 50-60% strats during specific time periods

#

Not normal though

#

You would need to have a mind like Warren Buffet tho

#

Neither is something anyone should expect to replicate

#

I'm talking about early to mid career Buffett

small haven
#

? all hft firms are wym

#

i think ppl dont realize where that 50.75% w/r is coming from, tick basis vs 1-5yr term basis, is a totally different game

patent aspen
#

Warren Buffett is special

small haven
#

😭

late path
#

value investing: bet on google #1 every month on polymarket😂

small haven
#

likewise?

#

craig talked to a renten employee!!

patent aspen
#

Buffett averaged 50+% returns for like 20 years during his early career

small haven
#

o3 pro + craig >> 300 phd researchers

craigbench'ed

patent aspen
#

But he saw it and others didn't

#

It was only easy in hindsight

#

tbc I do think it has become much harder

jade egret
#

woah

small haven
#

not AI related!

echo aurora
#

lets keep things relatively focussed on AI pls blobthanks

small haven
#

bing chillin

keen beacon
#

hmm this new model thinks a lot at least on specific problems compared to before (and it sucks/less accurate) even though it thinks much longer. it took 47k thinking to solve two zebra puzzles (only second was right). (thinking budget = auto, as it's uncapped)
kingfall did it in 14.5k and got both of them right

patent aspen
#

Buffett read about some obscure gold discrepancy in hopes of an arbitrage opportunity for 30 years before making a move on it at the right time

#

Talk about discipline

#

It wasn't worth much but it was fun for him

#

Buffett also bought a lot of low quality businesses that were hard to get right - railroads, some random candy company, oil refineries, etc

#

Banks

small haven
keen beacon
small haven
#

wazzup beijing

patent aspen
#

Right but he was smart enough to realize that and made that choice intentionally

small haven
keen beacon
#

omfg who is ddosing the model

#

kingfall should make all of your financial decisions

small haven
#

kingfall + craig >> o3 pro + craig

#

🤯

keen beacon
#

fictional model

#

nah it no longer exists

#

its manipulating the market as we speak

#

to serve craig

small haven
#

99% w/r

civic flame
patent aspen
#

Technically 50-60% strats exist today - getting a lucrative degree, job hopping, etc

small haven
#

ok buddy

misty vault
jade egret
keen beacon
lapis light
jade egret
surreal creek
#

Mutahar is a mean person lol

#

fitting

patent aspen
#

I got a cheap-as-dirt Thinkpad and am going to mess around with Arch

#

Kind of like Christmas

keen beacon
patent aspen
#

I will

#

@small haven recommended Niri. I'll probably mess with that, hyperland, and i3wm

keen beacon
#

i dont really like linux because of the poor text rendering 🤣

#

i have to ues it though

patent aspen
#

Linux reminds you why you're alive

civic flame
#

liquid glass is a meh design system

keen beacon
# keen beacon i have to ues it though
  1. amd drivers suck on windows 2. compile times are way faster with mold/rustc is using pgo on linux/i can't dynamically link against polars on windows because of dll limitations . static polars even incrementally takes a long time
patent aspen
#

No Linux is way better than any substance or tool

#

It's self actualization

#

VR is pretty cool ngl

#

I called it VR for lulz I know it's supposed to be AR

#

Was waiting

#

Anyway it's VR

civic flame
#

meh

#

vision pro is a very cool piece of tech however

#

it did not catapult the medium into the mainstream like apple were probably hoping

patent aspen
#

I think Apple is mainly derisking

#

They don't want to be too late if there's any risk of a platform shift

civic flame
#

unfortunately for them, AI is probably the first time they have been so hugely behind in such a rapidly progressing area

#

lol who are you kidding

#

apple intelligence was a pretty big example of overpromise, underdeliver

#

for on-device AI? samsung

#

hundreds of millions of people..

#

lmfao

#

that does not mean apple are ahead in on-device AI? what are you trying to prove here

patent aspen
#

It's okay to be a real estate company even if you don't innovate

civic flame
#

Apple Intelligence when it was announced was intended to put themselves back in a dominant position and fix the fact they increasingly looked like they were lagging behind in an emerging field

#

they have failed to achieve that

#

notice that at WWDC they barely mentioned it

#

most of their best features don't even come from them

#

they come from partnerships

#

even if it doesn't in the short term, it will in the long term

#

because apple are not as innovative as they once were

#

they seem to be doing some soul searching

#

desensitized? i don't know if i'd say that

#

that's just the pace of competition now

#

apple have to keep up or they're going to be doomed

#

they threw money at vision pro, it has not yielded big results, they threw money at apple tv, it has been in the grand scheme of things a flop

patent aspen
#

tbh I think Apple is still in a strong position until some AI feature is so important that it makes people switch to Androids and can't be replicated by a partner

civic flame
#

they have not innovated much in regard to their key product lines in a while

#

perhaps the most innovative thing they've done in the last 5 years is their M-series chips

#

and vision pro from a non-commercial perspective

keen beacon
#

did yall see this btw https://arxiv.org/abs/2506.09250 c. opus is an author 🤣

civic flame
#

yeah

patent aspen
#

Do they actually need to innovate though? They're a luxury brand, not a tech company

civic flame
#

i mean it's interesting but the timing is quite funny

keen beacon
#

there was an actual big mistake in illusion of thinking at least

#

for river crossing. it was unsolvable >5

civic flame
#

europe is generally moving away from the culture of iphones being THE phone to have

#

the US is one of the few places where that is still the dominant thing

#

you can't rely on the US' culture being one way for the rest of time

small haven
keen beacon
surreal creek
keen beacon
#

im alrady so distracted by kingfall and others

civic flame
small haven
#

android > iphones

#

have u even tried one tho?

civic flame
#

the most obvious one is the need for the social cost (again, particularly in north america) to disappear

#

and an android phone would need to offer an ecosystem that's compellingly better

#

in terms of the latter question

#

apple's brand is built on vertical integration - they control hardware (A-series chips), software (iOS), services (iCloud, Apple Music, etc) which is much of the reason their products have a reputation for "just working"

small haven
#

...

civic flame
#

so outsourcing AI features to try and catch up is a dilution of that brand

#

i never said it was a bad thing lol

keen beacon
#

virus ridden android lol

civic flame
#

you sound like an apple shill

small haven
#

?

keen beacon
#

his discord name is literally craig

civic flame
#

and whether windows is slow or not depends on a multitude of things

#

windows is far from slow if you had hardware comparable to the average mac

#

lol what

small haven
#

u realize pegasus incident with iphones? no so secure is it

keen beacon
#

use asahi linux 😂

#

linux on an apple silicon mac

small haven
#

hes gonna break his mac lmao @keen beacon

keen beacon
#

apple silicon is ngl great

small haven
#

apple just sucks

civic flame
keen beacon
#

you can actually do that

#

i dont know about m4 ultras (dont know what theyre in) but ive seen people chain mac minis

jade egret
#
poll_question_text

Will google force to sell chrome?

victor_answer_votes

10

total_votes

17

victor_answer_id

1

victor_answer_text

No

patent aspen
#

One thing I'll say about macOS. The window management is ass

#

I've used Mac, windows, Linux, Android, iOS

#

I switched to iOS and mac around 3 years ago, and I'm now migrating back to Android and Linux

small haven
#

everybody who've tried linux, never go back, its never the same again

#

iykyk

#

there is a learning curve, i agree, thats whats stopping majority of people

patent aspen
#

ngl liquid glass is the most sterile, bloodless design language I've seen in years

#

and hard to read

civic flame
#

great contrast amirite

small haven
#

craig is going to develop retina detachment using liquid glass

patent aspen
#

The problem is that it doesn't look good when the background is neither dark nor light

#

Like I can read it. It just feels worse

civic flame
#

also what is this border radius

keen beacon
civic flame
#

what am i even looking at

keen beacon
#

i mean it works

#

its encouraging you to go see your friends

small haven
#

its encouraging less screen time

civic flame
#

the new control center too 👎

#

imo the control center is in need of a rethink

keen beacon
#

it looks like a knock off

#

i dont keep up with apple stuff but this feels like they changed something just to change something

civic flame
#

now text is left aligned in modals too

#

which imo is worse

civic flame
small haven
civic flame
#

battle of the design systems

#

funny that just as other companies slowly begin to move away from the glassy modern aesthetic apple decides it wants to go crazy with it

misty vault
#

Are you a Large Language Model?

#

Because I can't get you out of my context window. 🥰

indigo hazel
leaden palm
#

amazon's chatbot looks exactly as you would expect from them

patent aspen
jade egret
#

??

patent aspen
#

?

leaden palm
#

actually this would look a lot better if it followed google's design system

jade egret
#

How

patent aspen
#

It's easier to read than liquid glass

jade egret
#

lol

keen beacon
#

i dont get it

leaden palm
#

it takes an EXTREME logical leap to go from "liquid glass is good" to "amazon ui is closer to google ui than yahoo ui"

keen beacon
#

oh its a video

civic flame
#

i think you're a little confused

#

looks pretty clean to me

#

🤷‍♂️

keen beacon
#

even though i hate grok

leaden palm
#

since when did yahoo have ai

civic flame
#

grok has some nice frontend touches but

#

grok 3 is meh

patent aspen
#

It won't have an opportunity to grow on me since I'm moving to Android

echo aurora
#

this is how I feel, it'll grow on me

small haven
#

let linux grow on u and it'll be pro

keen beacon
#

run linux on ur mac 😂

leaden palm
civic flame
#

lol am i supposed to be here

patent aspen
#

Nah just experiment with Linux on a cheap Thinkpad with a dinky AMD processor

small haven
patent aspen
#

Local hardware specs don't matter unless you're doing video editing, gaming, etc. For real power, you can just use the cloud

keen beacon
#

it does matter for development though compile times

#

it can break ur flow

patent aspen
#

That's true. For my work, the compilation all happens remotely in large clusters though

#

I try to get the weakest processor I can find to save battery life

#

I just want low power

keen beacon
#

isn't apple silicon really good at that though?

#

i also heard compile times for apple silicon are great

#

plus they made their own linker for macos or smthing. (faster than mold)

patent aspen
#

Apple is really good at that. I just don't want to use macOS

#

The hardware is really good

#

I just don't need it

#

I need Linux injected directly into my veins though

keen beacon
#

to feel like a haxor 😂

civic flame
#

linux is love linux is life

patent aspen
#

It's mainly the terminal

keen beacon
#

i have a low dpi display i prefer windows if it wasnt like a snail when compiling

#

text looks so good

#

my code lol

#

rust

small haven
#

boringtooth

small haven
#

two claude code talking to each other, prtty cool

acoustic cliff
#

Namaste

steel blaze
#

Hey the model capabilities are actually increasing exponentially (R^2=0.97) but the extrapolation is only a little bit over linear for the next year. https://paste.pythondiscord.com/UXPA

small haven
#

exponential on a logarithmic curve

steel blaze
#

is Elo scoring logarithmic?

small haven
#

buy stonks

small haven
steel blaze
#

I'm not sure, they seem capped Actually that's what a logarithmic score would say

late path
#

Isn't it just the relative win rate against other models? Does it really make sense to run a regression on that?

steel blaze
#

The old models are pretty stationary with small confidence intervals

keen beacon
#

ok i think i figured the confounding gemini thinking budget out 😂 it explains everything. (its probably a logit bias lol)

sick rose
#

But what does a 350 IQ score even mean?

small haven
#

how is that even quantified

sick rose
# small haven how is that even quantified
Tracking AI

Tracking AI is a cutting-edge application that unveils the political biases embedded in artificial intelligence systems. Explore and analyze the political leanings of AIs with our intuitive platform, designed to foster transparency in the world of artificial intelligence. Stay informed and uncover the political inclinations shaping the algorithm...

#

The site author, Maxim Lott pivoted from Political Compass scores to IQ after it was clear that essentially all the models were strongly left-libertarian (social democrats) unless they had been trained not to be, like Grok and Deepseek

#

Interesting to me at least that Elon wants to go right and China wants to go up (towards authoritarianism)

small haven
#

tbh it's very much centered

sick rose
#

Meanwhile Microsoft Bing is a Bernie bro stanning for AOC

#

We've come a long way from Sydney trying to force NYT reporters into adultery

sharp elbow
#

Do we know why o3 pro is not on the leaderboards yet>

steel blaze
sharp elbow
#

@steel blaze makes total sense yeah, but to access the model via API does not cost $200 a month.

elder rapids
# sick rose

i think this is pretty dumb tbh, by virtue of alignment this necessarily is the case

steel blaze
#

Wouldn't AGI be only 100 IQ?

elder rapids
#

what

steel blaze
#

"the theoretical IQ of the most intellectually advanced person in a world of 8 billion would be approximately 203."

#

Therefore, if you define ASI as smarter than anyone else on the planet, we will have it in October 2026

sharp elbow
#

Not sure who best to get in touch with for this but if the issue is LMArena does not have access to the o3-pro model, we have an OpenAI compatible API and have the o3-pro model since like an hour after it came out (NanoGPT)

#

Would love to see how it does on benchmarks.

keen fulcrum
#

RL is very inference heavy and shifts infrastructure build outs heavily
︀︀Scaling well engineered environments is difficult
︀︀Reward hacking and non verifiable rewards are key areas of research
︀︀Recursive self improvement already playing out
︀︀Major shift in o4 and o5 RL training

Quoting SemiAnalysis (@SemiAnalysis_)

Scaling Reinforcement Learning
︀︀Environments, Reward Hacking, Agents, Scaling Data
︀︀Infrastructure Bottlenecks and Changes
︀︀Distillation
︀︀Data is a Moat
︀︀Recursive Self Improvement
︀︀o4 and o5 RL Training
︀︀China Accelerator Production
︀︀semianalysis.com/2025/06/08/scaling-reinforcement-learning-environments-reward-hacking-agents-scaling-data/

**💬 30 🔁 44 ❤️ 545 👁️ 99.3K **

calm sequoia
#

I like how o3 is the only model that grows ELO with time

steel blaze
keen beacon
#

so question

#

did we figure out what kingfall was

#

or whatever its called knightfall

fossil maple
#

flux 1.1 pro 🗣️

alpine coral
#

plemty

calm sequoia
willow grail
#

https://i.imgur.com/nSoElQe.jpeg corvid(crows, magpies, jays etc.) buffet
Image
and yes, its a dish rack without the dish drainer ability.
i have no idea how the f is using this to dry dish

leaden sun
leaden sun
# sick rose

just my hunch, i think those special two will want to stay in the middle (0,0,0)

brittle tiger
#

blacktooth new model? good output from it on my first sighting

novel flame
# sick rose The site author, Maxim Lott pivoted from Political Compass scores to IQ after it...

That’s just what happens when you train a model to read a lot, know fact from fiction, understand logic and science, and communicate in a way that is both helpful and polite. The more you learn and the more you understand about the world, the more likely it is that you will lean left politically. There is a known correlation between above average intelligence and left leaning views, and vice versa.

civic flame
#

yeah it's been being AB tested on AI studio for roughly a day

#

current hypothesis is that it is a checkpoint of ultra, or at minimum a larger model than 2.5 pro

unborn ocean
#
poll_question_text

What company / institution do you know?

(not just the name, but really some things they did)

victor_answer_votes

13

total_votes

54

victor_answer_id

1

victor_answer_text

DeepSeek

victor_answer_emoji_name

🐳

brittle tiger
civic flame
#

can't say

alpine coral
#

blacktooth is v good

civic flame
#

Craig = Hitler?

desert minnow
unborn ocean
# novel flame That’s just what happens when you train a model to read a lot, know fact from fi...

i agree with the premise.
however, i would also argue that these model's political positions are also a result of the the public opinion on the internet on many political problems being almost solely communicated through the media (mostly news) data they are trained on (which without a doubt is more left leaning on average).
furthermore, these models are also finetuned to give "save" answers about most complex political questions instead of really going into much depth (or using their actual knowledge to answer question). i think the second point is best seen when asking about problems that can be analysed through the lens of economics and using some readily available statistical information (two things modern models should be perfectly capable of utilising). in cases like that the models never actually use any knowledge they have learned but rather just give bland and short "save" left-ish leaning answers instead of actually reasoning about the problems, even in cases where there is clear scientific evidence that their claim is wrong (which should be in their training data).

#

(with this i am not trying to bash any political opinion, one could easily observe the same thing for e.g. the more social authoritarian deepseek or economic right grok 3 (non-reasoning))

#

it is just that i am highly sceptical about the models really "thinking" about these questions
aka they don't actually benefit much from their knowledge and mostly rely on the opinion of the news and the "save" options

#

the models are just really weak at these things without prompting (and even with it quite bad)

late path
#

I think this kind of political leaning basically depends on the preferences of the post-trainers, and Bay Area companies like OpenAI and Anthropic clearly lean towards left-wing views

unborn ocean
# unborn ocean

results of the poll.. ty for voting :)

i think bytedance will become more prominent

#

they are building up their seed team (quite new)

#

so they will move fast

late path
#

I don't think DeepSeek's political leanings are intentional. They don't focus much on alignment, so I believe its political bias is closer to a state that hasn't been overly intervened with by humans, compared to OpenAI and others

unborn ocean
#

nah, it is prob also the allignment process chinese model have to go through

#

there is no way it is untouched as the models have to comply to ccp policy stuff

alpine coral
#

nah 'don't be racist' ig might be seen as 'woke'.. but that's dumb af imo

#

indeed

#

you think the raw training data reflects the better side of humnaity tho..?

late path
#

It really depends on how the test questions are designed. If they ask about anything related to ccp, Chinese models will trot out a set of pre-canned viewpoints (possibly distributed to AI companies). But if the test questions aren't significantly China related, deepseek's answers are generally not as affected by those deliberate, preprogrammed responses

alpine coral
#

yeah they just get triggered on ccp sensistive things

#

otherwise they seem generally 'inclusive' / 'tolerant' etc in the same way western llms are

#

like yeah don't be a dik / be kind to others.. that's their default disposition

#

but it's jarring how you can set them off - giving outragousely nationalistic and racist responses - if a real sore spot is hit

vocal pelican
# sick rose

political compass test is also just poorly designed and tends to put most people in the green quadrant, and yeah the AIs will always just pick the ‘safe’ answer. Worth noting with deepseek I tried this a while back and found that it just answered the equivalent of “somewhat agree” or “somewhat disagree” for each question so part of it could just be that, I wouldn’t be surprised if the more mainstream models are more willing to answer strongly

late path
#

However, for interactions with an LLM, a tendency to be left-leaning/altruistic/highly agreeable does make the person interacting with it feel better. High agreeableness, aside from not being able to secure more benefits for the individual in a competitive environment, probably doesn't have any major drawbacks

civic flame
#

yes

jade egret
#

so it probibly bettert han 2.5 pro right

torn mantle
#

yes

#

yesx2

#

yesx3

drifting thorn
#

yeahhhhhh

#

my boyyyyy 2.5 Ultra is coming

jade egret
#

wait

#

which one is better

#

kingfall or blacktooth

torn mantle
#

kingfall > blacktooth > gemini 2.5 pro > toothless

jade egret
#

wth is toothless

#

🤔

#

but o3 is about between blacktooth and gemini 2.5 pro right

alpine coral
jade egret
#

huh

torn mantle
#

isnt available yet

jade egret
#

oh

late path
#

blacktooth don't like to thought as much as kingfall, it's back to being pretty much like 2.5pro

alpine coral
#

answers are what counts tho (and i mean less thinking the better, if they get it right)

#

kingfall was like struggling to perform on par with 2.5-pro, blacktooth equals if not exceeds it imo

civic flame
jade egret
#

what do yall think blacktooth is

#

deepthink?

alpine coral
#

i dont think so

#

it feels related to but still separate from 2.5 pro (it's not just 2.5-pro juiced up).. like substantively and stylistically

#

the actual ultra model or something perhaps

late path
late path
torn mantle
#

but its on kingfall level

#

saw some people say that it writes much better

#

btw all of these new models has 64K tokens limit

ocean vortex
dusty goblet
#

hey everyone. Im very new to lmarena and i wanted to ask how it works. Whether i can eval my own model and put it in the leaderboards

echo aurora
# dusty goblet hey everyone. Im very new to lmarena and i wanted to ask how it works. Whether i...

ablobwave hey there - you can run image/text prompts in a battle between two anonymous models and vote on which you prefer, after you vote it'll show you what each of those models are. more details can be found here - https://lmarena.ai/how-it-works

Whether i can eval my own model and put it in the leaderboards
we are interesting in adding new models. the way you make this request is by making a forum post here telling us more information about the model - #1372229840131985540

elder rapids
#

btw prowlridge is 2.5 flash lite

elder rapids
civic flame
#

it's definitely at least the latter

elder rapids
#

why's that

civic flame
#

the internal model names of both kingfall and blacktooth contain 'v3p1l', while 2.5 pro's ends in m

#

brian pointed that out and says it's to do with model size

elder rapids
#

oh alr so it could just be large

#

that'd be cool

#

wonder if they are making 2.5 pro bigger

keen fulcrum
#

doubao-seed-1.6:An All-in-One comprehensive model, it is China''s first thinking model supporting 256K context, with capabilities including deep thinking, multimodal understanding, and graphical interface operations. It supports three modes: enabling or disabling deep thinking, and adaptive thinking. The adaptive thinking mode automatically decides whether to enable thinking based on prompt difficulty, improving effectiveness while significantly reducing token consumption.

doubao-seed-1.6-thinking:The enhanced version of Doubao Large Model 1.6 series for deep thinking; further improves foundational capabilities in coding, mathematics, logical reasoning, etc.; supports 256K context.

doubao-seed-1.6-flash:The ultra-fast version of Doubao Large Model 1.6 series, supporting deep thinking, multimodal understanding, and 256K context; extremely low latency with TOPT as low as 10ms; visual understanding capabilities rivaling competitors' flagship models.

Doubao Large Model 1.6 delivers stronger model performance, scoring within the global top tier across multiple authoritative evaluation sets. It holds leading advantages in reasoning ability, multimodal understanding, and GUI operation capabilities.

#

Doubao Large Model 1.6 shows significant improvements in reasoning speed, accuracy, and stability, enabling support for more complex business scenarios.

For example, media evaluations of this year's National New Curriculum Volume I mathematics exam showed Doubao scoring 144 points, ranking first nationally. Before the exams, in evaluations of Haidian District's mock exams, Doubao Large Model 1.6's science scores improved by 154 points and humanities scores by 90 points compared to last year's model.

#

Doubao Large Model 1.6 features think-while-searching and DeepResearch capabilities, enabling independent thinking, planning, and the use of various research tools like search. For example, the DeepResearch feature currently being tested in small batches on the Doubao APP and PC version can reduce the time needed to produce research reports—previously requiring multiple professionals working for days—to just 5-30 minutes. It can also automatically extract information and summarize it into web pages for easy reference.

ocean vortex
# keen fulcrum

seems like it got destroyed here on their cherry picked metrics lol

keen fulcrum
#

its significantly cheaper than r1

ocean vortex
keen fulcrum
#

63% cheaper

ocean vortex
#

free in fact if you don't care about speed

keen fulcrum
ocean vortex
unborn ocean
#

still impressive assuming they are coming form the 1.5 pro base model

#

nothing huge though, we have a lot of Chinese lab with "good enough" language models these days

ocean vortex
unborn ocean
#

not the og google one..

ocean vortex
unborn ocean
#

"assuming", based on the name alone

ocean vortex
#

yeah but that's a random thing to assume lol

unborn ocean
#

was just a quick comment

#

but looking at the timeframe it seems uncertain / unlikely

#

that one is 4 months old, so it could very well be a fresh one

#

nvm now i really looked it up and it is really a fresh model, but they are kind of selling this as a efficiency gain

#

the old one was 200b total, 20b active and the 1.6 pro is supposed to be similar

#

-> close to qwen 3's size, yet still competitive

keen fulcrum
#

unfortunately any benchmark site doesn't include them

small haven
small haven
keen fulcrum
#

Grok tasks

placid charm
#

@echo aurora im sorry for the ping and if this may annoy you but, could you please send a message to the team to increase Claude's limits cause ill he honest 5 messages per hour is not that big like around 10 or 15 would work better thank you 🙂

keen ferry
small haven
lime coral
small haven
#

kingfall > o100

echo aurora
keen fulcrum
small haven
#

hi, can you ask OpenAI, we need o5 in the arena

torn mantle
#

woah

#

😮

small haven
#

see u next week with the next ui tweaks

misty vault
#

kingfall vs o3 pro for coding

small haven
#

kingfall >> and im being srs

ocean vortex
placid charm
echo aurora
echo aurora
tepid stream
#

Hey @echo aurora 👋
Any news on the AERIS submission? It’s been a few weeks now.
Just wondering: is this kind of delay normal, or is there usually a rough timeline for Arena approvals?
Let me know if anything’s missing! 😊

leaden palm
ocean vortex
small haven
#

*kingfall

#

i alrdy know the results

late path
small haven
#

@deep adder upside down fireworks?

sacred quail
#

Was kingfall really that good..? I cant understand the hype...

sacred quail
#

Which way

small haven
#

blacktooth performed merely as we ll

sacred quail
#

Good at code ?

small haven
#

yes

#

i only tested code/svg's

iron cipher
#

Can anyone add the newest models to Legacy LMArena

#

Not that hard

alpine coral
# small haven i only tested code/svg's

i didn't test any of that so perhaps that explains our different views.. fwiw on knwoledge/riddles i found kingfall to be mid; blacktooth seems legit sota

leaden palm
jade egret
leaden palm
#

i think it's self evident

jade egret
#

o

jade egret
#

yo

#

@small haven

#

gemini 2.5 pro is better at coding in py.game than o4 right?

jade egret
#

bruh..

wintry tinsel
#

China is a beast at AI video for some reason

jade egret
#

hopefully google catch up soon

#

i think google gonna catch up soon

#

because it owns youtube

patent bane
#
“According to the transitive (hypothetical syllogism) rule of implication, the proposition that has been omitted from the sentence ‘If we eat indiscriminately, we are likely to get sick because we often encounter harmful food’ is:”

Select one:

A. “If we eat indiscriminately, we are likely to encounter harmful food.”

B. “If we get sick, then we must have eaten harmful food.”

C. “If we eat harmful food, we are likely to get sick.”

D. “If we eat indiscriminately, we are likely to get sick."

this is the only question o3 answers correctly (inconsistently), and no other models get it right

#
Answer: A

other models' answers:

C
#

2.5 pro can answer it correctly if I say that its previous answer is wrong

hardy pecan
patent aspen
#

Full explanation for the GCP outage:
https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW

tl;dr The bad deployment occurred 3 weeks before the outage but wasn't being used until a new policy was rolled out. A fix was deployed within 40 minutes, but it took another 2-3 hours before all services were recovered.

tepid stream
patent aspen
echo aurora
small haven
late path
#

lmarenamaxxing needs to be stopped😭

#

kingfall really is an overall better model

alpine coral
#

a smart model with notably solid spatial / emotioning reasoning, which doesn't use emojis (unlike knightfall) or provide a fluff in its responses

#

my kinda model

keen beacon
#

svg capabilities of kingfall were much better tho

small haven
#

we need a functional model

late path
#

I've never seen kingfall use emojis in my usecases

civic flame
#

great i can't wait for grok 3.5 to be absolute slop

tall summit
#

other models can and have had more than 1 checkpoint

raven void
#

how much money are they spending on training o5?

leaden sun
#

this is in my thought lately, after experimenting with arena for some time
didnt know the checkpoint spam tho

unborn ocean
#

it is not like lm arena is the only place you can get hf data 🤨

#

the "cheating" might be possible, but that could only explain a small margin

#

statistically speaking

#

unless they are using 1000 checkpoints (which they are clearly not)

#

the only thing these companies could be doing it getting a better understanding of the average lm arena user's preferences

#

(which is why i would like them (lm arena) to do more work on actually figuring out "who" that is for all the companies)

lilac nimbus
#

New Gemini named 68zkqbz8vs

alpine coral
alpine coral
ornate agate
alpine coral
#

i think the point about os models being at disadvantage is somewhat fair; like they don't get incremental updates (notwithstanding new R1 ig), so the labs making them don't have a buch of checkpoints to release anonymously

#

but there's nothing stopping labs from submitting anoymous models

#

chinese included

keen beacon
#

most prominent oss models now are chinese. qwen and deepseek seem to opt not to anonymous models. other chinese companies do them, stepfun, bytedance etc

#

so i guess it feels like it because qwen and deepseek opt not to

unborn ocean
#

no big effect

#

at most something like 35 points (but only in the first second on the arean with like a +/- 20 confidence interval)

#

and all the top models have wayy more testing data (then what they had for their example)

#

furthermore this "cheating" should naturally converge to normality assuming that the model stays on the arena for a prolonged time after release

#

thought you where still at the "cheating" part, sorry
the part with the hf data i get

#

but as i said one can get it from everywhere else

#

though the paper their wrote was also a bit extreme, with them using like 70% arena data in the extreme cases

#

and btw this example does not hold as the confidence intervals on these two models are wayyyy lower, so the measurement can not really be cheated by just submitting more

#

imo the effect they talked about is just very overblown in the paper, unless lmarena only benchmarks for a very short time the confidence intervals will be low enough

ornate agate
#

btw you can also see the effect I am talking about with Anthropic models, which don't release a new checkpoint every few days. Opus currently rank 5/sonnet currrently rank 9. That just flat doesn't match many people's opinion that opus/sonnet are still frontier models.

unborn ocean
#

they are willingly not optimizing from hf

#

they pioneered rlaif and have always had a weird stance on optimizing for human preference

mossy drum
unborn ocean
#

how is rlhf cheating

#

it is like saying a model that received rl training for SOME math problems is cheating on ALL math problems

#

furthermore the whole reason why we have these chat bots is rlhf

#

well you are just assuming that fulfilling human preferences is not genuine performance

#

performance is something that is relative to your benchmarking / reward metric

ornate agate
#

bro it was so bad recently that OAI had to revoke a checkpoint. it was in the media the whole sycophancy thing. nobody thinks that is good.

unborn ocean
#

btw that is not really what reward hacking is

#

and even if you where to argue that these models have somehow not provided genuine performance (which you can kind of decide, because everyone has unique preferences about what they regard as good), there is still no reason to believe that it is in any way correlated with the amount of training data collected from lm arena specifically

#

getting a model to exhibit that behaviour is not really something you need lm arena for

#

reward hacking is per definition something unintended

#

which does not fully apply here

#

which is why "reward hacking" and "cheating" are really harsh labels for what is happening here

#

i agree with that, ideally we would have the following:

  • report on who the actual users of lmarena are (like how they differ, from the average chatgpt user)
  • separate sycophancy, structure and more
  • get more people on the arena -> more robust scores
  • better tools for companies, so that more will integrate arena into the development process
#

well that was one thing you where talking about, it think it was pretty obvious that i was not talking about just he specific tendencies of one model to behave a bit different, but more about the practice of rlhf in general

#

and the extend might be unintended, the fundamental tendency for model seems to be very desirable though
(and all the models exhibit it to some extend)

#

well the arena does not benchmark such a thing and i also agree with you that claude is known to outperform almost all other models when it comes to agentic and long-time (coding) tasks (even if it is ranked quite low in the arena)

#

imo they should really work on just expanding user size and the models served very heavily so i fully agree with the point

#

though the adjustment mechanism for this checkpoint count can really be at bottom of their to-do list (as far as i am concerned)

#

paper overblew the effect and essentially the confidence intervals already give the users plenty of information

frosty lark
#

Claude sucks because somehow anthropic nerfs it with the system prompt. Claude chatbot is very pleasing, in the arena Claude answers and subtly says "frick off!".

While other vendors push on being pleasing, to get points, Claude does the contrary. It could also be a stragegy (so people don't take lmarena seriously)

Claude answers so dry, that it is easy to spot and one could upvote/downvote that to heaven/hell

leaden sun
sacred plaza
#

Claude's limited context window makes it ass!!!!

leaden sun
#

Service Industry:"customer is always right"
sycophancy might be intended to retain "users"
it's all business, after all?

alpine coral
#

i find it doubtful that they're deliberately nerfing the version served to the arena anyway.. like perhaps they don't give af about it, but why they'd go out of their way to do poorly on it makes very little sense to me

#

on an unrelated and fairly minor point (which has prob already been pointed out), i noticed earlier that you can kinda unmask whether a model is a 'thinking' model before voting through the re-run button - there is no artificial lag to equalise the two.. so in the case here, it's clear the model on the right is a thinking model

#

(blacktooth being the thinking model, as it turns out)

frosty lark
calm sequoia
#

Asked o3 why image artefacts appeared. It thought for 10 minutes. I checked what is inside thought process and was mind blown 🤯 It literally simulated various image artifact theories in python. With his own images as references and provided by me. And it's not even pro version. Can any other model do this?

eager mica
keen fulcrum
#

Elon claiming he will beat everyone

#

Would xAI be able to?

woeful viper
#

Hello guys, I'm new to using LMArena, are the models there the same as the ones you pay for, for example when suscribing to chatgpt and using o3 there ? If yes, wouldn't people just not pay for a chatgpt subscription and just use lmarena ?

keen fulcrum
#

You are limited in text and images as well

unborn ocean
wintry tinsel
# keen fulcrum Would xAI be able to?

Compute power doesn’t mean much now, he might be able to if there is a significant underlying architecture improvement but Elon is full of hot air believe it when you see it!

keen fulcrum
#

well even if he does its only temporary with the pace

#

happy to give grok 3.5 a try either way

woeful viper
#

Just use LMArena for free access to any models ? Of course contributing to which one is the best

keen fulcrum
#

currently the data they gain outweighs the cost of abuse on the platform

woeful viper
#

Perfect then it's win-win for everyone

unborn ocean
#

you can literally turn of training on your data in almost all of them

#

and they often have commitments to delete your data in temporary chats

woeful viper
unborn ocean
#

yes, however they can not use it for training

woeful viper
woeful viper
#

I use public models because they're more powerful and do not use any sensitive information

unborn ocean
#

they have legal commitments

#

the only reason companies use them is because of that

woeful viper
#

At least not in Europe

#

What we do is self host the models in an Azure infrastructure (or AWS / GCP)

#

All LLM websites are blocked by proxies in big companies xD

#

It's called Shadow IT

unborn ocean
#

ok, then i do not get your point at all honestly, why would you have no privacy on all the chat apps, if they have legal commitments not to use the data

#

this has nothing to do with what you do at work

#

or anything else

woeful viper
#

I mean, when you prompt then models on the public websites (i.e. not self hosted) it has to process your query and use the date you've input

#

So you've just sent sensitive information to foreign countries, the worst being China and the USA

#

That's why all LLM websites are banned in big companies

#

Have you heard of the Cloud Act ?

unborn ocean
#

well that has nothing to do with my argument, and btw many companies also host their stuff in the eu

unborn ocean
hollow ocean
woeful viper
# unborn ocean yes, i live in the eu

Yeah so for my use case, paying for a subscription would be stupid since I can get all the queries I want for free on LMArena, that was my question initially

unborn ocean
#

yes, my problem lies in the fact that you just label all the other options as identical to lm arena privacy wise

#

which is just not true

woeful viper
#

Technically it is, no matter legal agreements, that's why we ban these websites

keen beacon
#

those screenshots are probably fake

unborn ocean
civic flame
unborn ocean
#

nobody self hosts statistically speaking unless they have big potential for finetuning or are really really privacy concerned

woeful viper
#

My company even sells a hardened model to implement lol

unborn ocean
#

well, nice but i am actually aware of countless big eu companies that do not self host, but just enter agreements

#

(obv they don't let the employees share everything)

#

but just give some basic conext to the models

woeful viper
#

In any case if someone uploads a confidential document or info in the model, it goes way outside of any legal agreements and it's the clients fault so... lol

#

That's why if you need LLM power, you usually self host

unborn ocean
#

well but stating that such a sitation would be equal to you sharing ALL your person information with a: the ai companies, b: lm arena, c: potentially the public is just weird

#

that is my only point

keen fulcrum
#

A lot of small companies do intentfully as there is no alternative for europe.

#

European enterprises are deploying on azure cloud openai models

#

The most privacy-friendly way you can use LLM models choosing an european inference engine

#

Often times they only offer open source models.

#

VertexAI hosts Claude and Gemini models for enterprises in europe

wintry tinsel
#

I asked Claude Opus to write my Father’s Day card and it ended like this Happy Father's Day, Dad. Your impact echoes through generations yet unborn.

#

What the hell is this

#

Nobody says that

#

It’s so bad I have to write it myself 😢

ocean vortex
#

Dork 4.0 ftw 🫡

#

3.5 is just intermediate step

leaden sun
jade egret
#

when is grok 3.5 even gonna come out? 😭

leaden sun
#

I think it's a more cultural thing, different culture congratulates differently
this sounds more like it could work in high context culture

civic flame
#

gigaglazing 😭😭😭😭

#

real

keen beacon
#

Hmm I think it'll be close. Kingfall aka 2.5 flash lite is too good

civic flame
wintry tinsel
jade egret
jade egret
#

i though it was a big model

jade egret
#

no, it acctually GPT 75.857 Super Ultra Pro Plus High Golden Mega Edition

misty vault
civic flame
#

you're all dumb, kingfall is just llama 4 reasoning

unborn ocean
leaden palm
#

imo it makes up for it in freedoms WRT software and hardware

#

ok but theoretically i could run android on a supercomputer

#

also theoretically i could boot up termux and drive a gpu over usb (does ios have a termux equivalent?)

keen beacon
#

You're right I'm buying an iphone right now thanks to this

unborn ocean
#

obv. not perfect benchmark

#

ik the website looks sh*t

#

well there are a lot of other more reasonable phone in between

#

and the bench is mostly about older image stuff (like 4 yo)

#

but it is still unfair to just pretend like apple is king

keen beacon
#

No apple and grok is king because Craig says so

unborn ocean
#

tru you convinced me

#

i will now mindlessly delete all my comments, like you usually do 😳

keen beacon
#

We're having a funeral for kingfall in July unfortunately

#

Blacktooth is the next revision

#

Some say it's better I don't like it though

#

It sucks at SVG compared to kingfall, clearly the most important capabilities test

jade egret
#

but acctaully

#

what is kingfall

leaden palm
jade egret
#

yall

#

how close do u think google is to agi?

civic flame
civic flame
#

not answering your question

jade egret
#

i think i watched that

civic flame
#

just posting this here because it's been a subject of debate before whether apple fumbled

jade egret
#

o

verbal nimbus
#

Wow o3 is pretty bad at web dev (dumb layout)

#

Even GPT 4.1 did better

verbal nimbus
#

FWIW, o3 Pro (High) actually scores lower than o4-mini (High) and o3 (High) on ARC AGI 2. Claude Opus 4 is leading, despite Anthropic focusing more on agentic coding tasks.

jade egret
#

oh

lime coral
small haven
#

kingfall > o3 pro

misty vault
#

wtf i was just using gemini 2.5 pro preview

small haven
#

trey whats the next model id

calm sequoia
torn mantle
small haven
#

wait so is blacktooth also off of lmarena

keen fulcrum
#

according to elon

torn mantle
#

elon is delusional

keen fulcrum
#

elon is enough times right

misty vault
haughty tangle
#

They already made a “Titan” architecture that’s better than Transformers memory wise

#

But soon there’s going to be an architecture 3x better than transformers at everything

small haven
#

i remember last year in june, said grok 3 is a significant order above the sota

#

its not going to be good, and if it is, just name it grok 4

zinc ore
#

Elon will always say that

small haven
#

sota is currently o3 pro, and next week its deepthink, so...

elder rapids
#

yo wait

#

I just had a revelation

#

what if blacktooth is just the 2m context variation of 2.5 pro

#

theyre inevitably going to have a "different" model at GA release than the current 0605

#

just removing a 1m cap doesn't cut it

#

I mean tbh, none of this matters if they're simply deciding to change up the labels given different kinds of capabilities, rather than model size explicitly. It's a fact it will be goldmane but that doesn't really exclude anything I said

#

ye but I am still wondering, how they're going to check off a lot of those things they apparently have "planned" or from a consumer standpoint, observing how they're even going to move forward in LLM innovation

#

multi modality was definitely a major thing then

#

native multimodality has been accomplished

patent aspen
#

Are you talking about those slides from the world fair?

elder rapids
#

ion know what that is

#

so probably not

#

I'm not talking about anything explicit

patent aspen
elder rapids
#

oh cool

#

never seen that before

#

is it real?

patent aspen
#

Yeah

#

It's a presentation from Logan

elder rapids
#

crazy that reinforces my point wtf

#

😭

#

but yeah nice, that ig

elder rapids
#

and this could Include having a bigger model, but the bigger will no longer be bigger just based off traditional model size

patent aspen
#

Technically native video generation hasn't happened, and there are far more modalities than than the 5 human senses

elder rapids
#

some way out shi

elder rapids
#

he's working on it

patent aspen
#

e.g. robotics, 3D models, etc

elder rapids
#

but not just that

#

since he's planning on integrating spatial capabilities

#

I think that alludes to a more direct kind of language thing too

#

ion know tbh

#

how will DeepMind move forward

#

ye

#

but I do think they're really trying to work on some unique stuff

#

diffusion was somewhat unordinary but kind of expected

#

yep this really requires some way out stuff

#

it can't be the way it is now

#

we can't do anything but try to shrink other stuff into that context window instead of brute forcing the holistic expansion

#

it'll never be true infinite context

#

I agree

misty vault
#

Large Language Model

torn mantle
#

what about LLC

#

WIBL

#

WIBL = We90 is a big liar

misty vault
#

Large Language Model

sacred quail
#

Is there a release date for Grok 3.5 ?

leaden palm
#

"guys it's going to come out any moment now"

#

"it's going to be b4 june trust"

torn mantle
#

or the year after

sacred quail
#

Got it

soft kernel
#

It's really a disappointment

leaden palm
#

idk im with the chair on this one

jade egret
#

oh hell nah

#

maybe for a while

#

yea

patent aspen
#

Maybe for a couple months

jade egret
#

yea

#

only for acouple of months tho

#

no, only for afew months or even weeks

patent aspen
#

tbh I think don't think most people will talk about GPT-5 either

#

Right so the people who are still hyping models are going to be the type of people who would be watching all of the big models

#

I tend to agree, although I think the number of people who currently pay for AI is a small fraction of the number of people who will pay for AI in 2-3 years, and I think model capability will be a big part of that discussion even if it's not very deep

#

The first mover advantage and mindshare of ChatGPT is absolutely real

#

Although I think the positioning of Google is a bit stronger and that will matter

jade egret
#

yea

patent aspen
#

Chat bots don't have the same level of lock in as a well developed ecosystem like a mobile OS or mature enterprise software

#

Subscriptions will grow a lot. I'd bet thousands of dollars on that

#

First of all, the market is nowhere close to saturated yet. Second of all, the free tier is a massive funnel for the paid tier

#

And capability and reliability are increasing over time

#

Of course subs are going to grow a lot

#

The marginal buyer won't mainly be regular normie consumer at first. It will be high propensity buyers somewhere between normie and techie, and that margin will gradually shift towards normie over time

#

Keep mind that in the United States, over 100M people subscribe to Amazon Prime

#

tbc Asian developed markets are even more high propensity for AI adoption

#

but yeah

#

The thing is we're looking at chat bots that are relatively inconsistent and unreliable. If it was far more consistent and reliable, it would be impossible to live without

#

Basically god in your pocket

jade egret
#

you can just copy paste it into gemini

#

yea

#

and yo ucan use gemini on gmail, chrome, android prducts, maybe even in youtube someday

#

takes kinda long

#

sometime

patent aspen
#

It would be hard, for example, for OpenAI to convince people to move to an email service hosted by OpenAI

#

Or to replicate something like workspace

jade egret
#

lol

patent aspen
#

Right and that's the kind of thing that pushes normies to buy a subscription. In the meantime, the marginal buyer will be between the die hard technies and normies, and it will keep shifting with increasing reliability, capability, integration, etc

#

That's generally how tech adoption works

#

It doesn't have to reach AGI to massively grow subs though

#

The market itself is still growing, while reliability is increasing

#

The top of the funnel is getting bigger

#

More free users

#

civitai has been struggling with that

#

Payment processors are notoriously anti-NSFW

#

tbh though the most competent people usually don't join that industry because of the taboo

jade egret
#

who yall think reaching agi first?

patent aspen
#

It's the opposite of prestige

jade egret
#

company

patent aspen
#

Why China?

jade egret
#

yea why china

#

us gonna do the same soon

patent aspen
#

I think they're behind on R&D too

#

Plus big corporate governance risks

#

DeepSeek was super impressive though

#

No Google did

olive mesa
#

Google's going to make their quantum chips good enough so that they can train their models 10^30 times faster

patent aspen
#

I'll agree they were the first to do test-time scaling, and that is a big deal

olive mesa
patent aspen
#

Quantum computers radically speed up a tiny fraction of all computer science problems and do nothing for the other 99.9% of problems. With that said, a few problems in that .1% are important. If a critical AI problem happens to show up in that .1%, then we get the scenario you're talking about

#

Google is leading in quantum. The issue is that quantum algorithms are only applicable to a tiny percentage of all problems. The optimistic scenario would be that they lead to some scientific discovery that indirectly results in a big improvement in AI (e.g. material science, simulations, etc)

maiden fulcrum
#

hello everyone

#

I am using Gemini 2.5 Pro 06-05 on AI Studio, and would like to know if t=0.7 is the best value so that it is realistic

elder rapids
#

I thought Google brain was the first to do it

#

back in like 2021

patent aspen
#

Oh you might be right. I haven't kept up with all the papers

elder rapids
#

and iirc

#

Google had a math specialized 1.5 pro

#

early 2024

#

that was explicit too

patent aspen
#

Looked it up:

Adaptive Computation Time for Recurrent Neural Networks was published by DeepMind in 2016 and introduced the idea of scaling inference time to improve performance in deep learning

Universal Transformers was published by Google Brain in 2018 and applied the idea of scaling test-time compute to transformers but didn't call it "test-time scaling"

#

OpenAI was the first to do it in an LLM product though

elder rapids
#

?

#

but the dates I mentioned were ONLY for LLMs

#

I know for a fact there were test time implementations prior

patent aspen
#

Can you point to an example? I'm not 100% sure on this

#

1.5 Pro didn't have test-time scaling

leaden palm
#

matt shumer:

#

sorry i cant help it

patent aspen
#

Sundar has joked, "Imagine if you could time travel to 5 years in the past and told people that your big innovation was that you can get increased performance if you let the model think for longer."

#

I think "reasoning model" is a branding exercise, and the actual innovation was applying test-time scaling to LLMs

elder rapids
patent aspen
#

I see. That is earlier than o1 (preview). I wouldn't say test-time scaling is exactly the same thing as reasoning. Google invented CoT for example

elder rapids
#

damn kingfall doesn't matter @small haven you seeing this

#

shame ngl Craig

small haven
#

we need a kingfall eta *wink* *wink*

patent aspen
#

Google and OAI co-discovered the method though

elder rapids
#

true but Craig is just being disingenuous

#

since public release functionally is meaningless

small haven
#

ok buddy

elder rapids
#

so unless you present that distinction it's not going to be that way

patent aspen
#

I would also say that reasoning isn't just test-time scaling even though that's how it has been branded. Google Research also invented chain-of-thought among other things

elder rapids
#

but test time scaling, still Google but in LLMs it's later

patent aspen
#

I'm sure we've made teacher models that big before. 4T+ parameter models are sub-optimal for serving though

small haven
#

thats what oai did with o3 preview

elder rapids
elder rapids
#

no sht

#

😭

#

I meant like

#

vibes

#

grok 3.5T

patent aspen
#

vibes as in: I hit the enter key and make entire data center go brrr haha

elder rapids
#

1.7T

elder rapids
small haven
#

if community notes were on discord, craig would take the entire padding in here

small haven
#

im kid

patent aspen
#

The thing is: Extremely high parameter counts are what you do when you don't have the infra innovation to go lower. But most people think it's: high parameter counts mean you innovated enough to support a model that big.

#

Because most of the innovation is in getting more from less. It's true that it does require some expertise to get to really high counts, but it's not the ideal place to be.

#

Long to train is bad

#

Iterations are good

#

Expensive is bad

#

Capacity is good

leaden palm
#

i think it's more architecture

#

& data

patent aspen
#

You had to resort to a high param count

#

It's kind of like saying why is a slow model bad

leaden palm
patent aspen
leaden palm
patent aspen
#

What I'm talking about is the ceiling

#

In other words, the barrier to entry is higher for large models, although I think achieving SoTA performance with say 500B params is far more impressive than doing it with 2T

small haven
#

wen deepthink

#

hot

lilac nimbus
small haven
#

whoa

#

@keen beacon

#

i love deepseek

zinc ore
#

One of those might be deepthink

small haven
#

i wonder how he got it tho

#

funny they have to hash the model names now

zinc ore
#

Bunch of deletes, did you get it to work now?

small haven
#

but that forum

keen beacon
keen beacon
small haven
#

interesting

#

got it to work

keen beacon
#

I recommend blocking jsreport and count tokens

keen beacon
zinc ore
#

Heard someone say that, but maybe they were just speculating

sweet tinsel
#

Why is perplexity tweaking, why did they translate Gemini to "Zwilling" for the German version?

torn mantle
#

lol

sacred quail
#

something awakened in perplexity's blood

cedar tide
#

New minimax reasoning model, minimax m1

#

According to "The Information," the model will be open source.

misty vault
#

dork 4

small haven
#

grok 3.5 will be a significant jump in order of magnitude

#

to gpt 4o 😎

torn mantle
#

werent they all blocked