#general

1 messages · Page 58 of 1

torn mantle
#

also please refrain from sharing them pls

#

im noticing google is patching them lightly

small haven
#

@deep adder

torn mantle
#

you didnt catch that

#

hehe

small haven
#

huh sure

torn mantle
#

wth

#

@small haven btw i couldnt get any of them to work

small haven
#

theres a trick, u have to remove token count or something like that

#

but i couldnt bother

torn mantle
#

thats just the modal info

small haven
torn mantle
#

yea someone sent me that already

small haven
#

mhmm

torn mantle
#

the thing is, they're talking about models that require the thinking budget to be turned off in order to work

keen beacon
#

use flash as a template instead of pro so you can do that

#

the models suck anyway

#

the ones that remain

torn mantle
#

yea

keen beacon
calm sequoia
#

Guys, have anyone here used Copilot 365?

#

I need an agentic tool liek Cursor but for documents.

#

When I hear "copilot" i get ick. But it seems to be only solution

ocean vortex
#

I think they need to add smth like SimpleQA into their test suite. We need at least 1 benchmark where model size does make a difference

#

otherwise people looking at this may just as well conclude that there's no point paying for larger models lol

#

2.5 Flash 26% vs 54% for 2.5 Pro

#

kinda same story for other labs as well

tall summit
ocean vortex
tall summit
#

real

soft kernel
olive mesa
torn mantle
cedar tide
fleet lintel
calm sequoia
cedar tide
civic flame
#

2.5 pro beat it in literally every benchmark in that image lol

cloud venture
#

4 out of 5

#

yeah ig o3 is "winning"...

#

...the second place

calm sequoia
dusky aurora
#

again,"there was an error" to everything

atomic pagoda
#

I’m getting “there was an error” to everything too

dusky aurora
#

as George Martin said, "outage is coming"

atomic pagoda
#

Great

echo aurora
#

uh oh

#

team is looking into blobsalute

dusky aurora
boreal saddle
late path
#

500

indigo hazel
#

there error is happening on mobile and pc both

late path
#

thought it was my problem and cleared all my cookies

#

didn't help

fleet lintel
echo aurora
#

my apologies for the inconvenience everyone! we are looking into getting a fix out asap

indigo hazel
echo aurora
indigo hazel
jovial heath
#

Hi, is there an error on the site?? I'm receiving there was an error message everytime xD or is just me??

indigo hazel
#

they re working on it i think

sudden cloud
#

Right thanks 🙏🏼

echo aurora
jovial heath
indigo hazel
#

pineapple is really patient xD, great person

echo aurora
#

ablobcheer ablobcheer ablobcheer ablobcheer okay should be working again!!! ablobcheer ablobcheer ablobcheer ablobcheer

#

get back to battling!

indigo hazel
#

thank you very much

echo aurora
#

no, thank you all for flagging! truly helps us so much. we couldn't be more thankful for an active community ❤️

calm sequoia
whole wagon
#

They use it as a defense somehow. Like o3 pro would top the simple bench if only they benchmarked it 😂 like barely anyone can afford to bench it lol

late path
#

new deepseek r1 got #2 on webdev arena

torn mantle
#

?

#

i dont like blackbooth

blazing coyote
#

Blacktooth feels noticeably worse than Kingfall

whole wagon
#

I saw some others saying they like it

cedar tide
#

Qwen no thinking better than thinking

surreal creek
#

Wow we may be entering the Chinese century

sour spindle
whole wagon
#

What checkpoint is dropping in 3 days? Is it kingfall?

surreal creek
#

surprising overperformance by DeepSeek and Alibaba!

whole wagon
#

Is black tooth Gemini ultra I saw ppl say that

keen beacon
#

if u mean ga 2.5 pro, it will be the same as the 0605 preview just renamed to ga

whole wagon
#

Kingfall is ultra?

surreal creek
#

also, when the leaderboards were fully transferred over to the new LMArena site, did they just standardize them around 2.5 Pro 05-06 being 1446 Elo in every category?

surreal creek
whole wagon
#

blacktooth made me this plane svg which is pretty cool

#

It even did some really nice details like the hint of the far wing

haughty tangle
#

v-jepa 2 having only 1.2b parameters but being near o3 is crazy

whole wagon
#

This is gemini-2.5-pro-preview-05-06 😂 no way blacktooth is just another checkpoint of 2.5 pro

late path
#

blacktooth does better than 0605 on SVG, but still not as good as kingfall

fleet lintel
whole wagon
#

i dont really see much difference with the robot face ngl

fleet lintel
whole wagon
#

the plane one is so much more obvious

fleet lintel
#

ohk

#

what is this?

#

Looks like a new model.. do we know which company Blacktooth is from?

late path
whole wagon
#

"Generate an SVG of a plane. Make it as detailed as possible" This is it

fleet lintel
#

oh..looks like blacktooth is gemini

late path
#

haha I found old messages. This one seems to be kingfall as well

echo aurora
#

reminds me of:

potent snow
#

Is there some way to use images as reference to generate?

whole wagon
#

i dont seem to get it yet

verbal nimbus
#

It was generally the best model at drawing in TikZ.

whole wagon
#

claude-sonnet-4-20250514-thinking-32k

echo aurora
verbal nimbus
#

This is Grok? Pretty good

verbal nimbus
whole wagon
#

i had one with opus but it was just as bad so it lost

late path
verbal nimbus
#

On Claude Web, you can feed back the image and ask it to iterate on the design.

#

Seemed to work on a unicorn example.

atomic pagoda
#

I’m still getting the error message when I try to send something or something went wrong with this response try again

verbal nimbus
#

Claude 3.5 I think. Someone made a timelapse from feeding the outputs of each iteration back to the model. In TikZ (very difficult to draw in, but used to prevent data contamination).

atomic pagoda
#

Is anyone else getting the error message or something went wrong with this response try again

echo aurora
atomic pagoda
#

I don’t know, it’s for Claude opus 4 in direct chat mode

verbal nimbus
#

In 20+ battles, I think I only got a response from both models once. All the other times, one of the models would think for a long time but output nothing.

echo aurora
atomic pagoda
#

Ok

calm sequoia
#

The only loser here is Musk 🙂

#

Musk: 200k SOTA GPUs, DeepSeek: 0 SOTA GPUS (smuggling)

verbal nimbus
#

These are in consecutive order too, I'm not cherry picking.

small haven
torn mantle
#

Not a fan of blackbooth

elder rapids
#

imo that's just nonsensical, blacktooth is really good

#

it gaps all the other models

keen beacon
#

personally i like kingfall better

elder rapids
#

yeah but they're likely just different models and as far as I can tell, blacktooth is much more refined

#

no syntax problems in its output, respects the thinking process unlike kingfall, doesn't jump to conclusions, still understands everything

#

I can't trust that you guys experienced the same thing I did with kingfall, but it's not the insane model you guys are making it out to be

#

it has insane spatial abilities imo and excells in some tasks, but it performed worse usually in plain context tasks where it has to track maybe an argument, or conclude something within that box of context without just "I feel like overall x is better", stuff that necessitates an inherent grasp before even thinking about it

#

o3 imo WAS the best at this, kingfall couldn't fill that, but 0605 accomplishes these "if you know, you know" tasks

jade egret
#

guys

#

plz help which llm is best for pygame?

elder rapids
#

and that's the same thing I'm getting with blacktooth, which kingfall ultimately failed

small haven
#

blacktooth is a lmarena proc, not at all functional compared to kingfall when u look at coding, hence why the svg's were also not as high-fidelity

elder rapids
#

cool but I don't get how that's relevant

#

I don't get the glaze tbh, I abused that model a ton, it was so good I can't tell if it just had a different thinking summary, or that thinking summary just really liked what it was seeing

keen beacon
#

kingfall felt like it had less work done on it. it had a lot of magical moments/etc where it wasn't diluted by the post training and was unintentional. (being amazing at svgs which isn't a usual post training thing you usually do, i think, and a case where it consistently started solving two 6x6 zebra puzzles when it consistently failed when given one, by spontaneously making a system when faced with increased complexity) blacktooth feels like an overcorrection/overdone post training cooking the model on stuff outside of distribution whereas your task might be in distribution for this revision. (i believe kingfall/blacktooth use the same base model, they're just different post training revisions)

small haven
elder rapids
small haven
#

donald trump is delayed, damn it

elder rapids
#

id already moved on from kingfall with riddles or tests

#

after like a couple hours of release

elder rapids
#

did nobody see how different the summary was

#

or nah

#

it was way different

keen beacon
#

i heard someone talk about that

#

i dont pay attention to the summary that much though

#

i leak cot if i want to read it

elder rapids
#

ye I'd expect that, I just ignore it

#

but it caught my eye when it started naturally capitalizing and emphasizing things

#

and placing things where like, damn you can really understand it

#

and not the roboticness that the current summary has in aistudio

keen beacon
#

iirc it will remain on chatgpt tho

primal orbit
#

I got blacktooth in general arena, not dev

meager harbor
# calm sequoia The only loser here is Musk 🙂

we'll see with grok 3.5, google had so much more computanional power than OAI but since they joined the LLM party late, they were behind for like 2 years. Same should apply for XAI and it's evenworse since xAI started from scratch while it wasn't the case for google, you need some nuance in your thinking.
But yeah I see OpenAI and google on top of xAI in the long run.

meager harbor
#

it's annoying as hell

#

Gemini with their 0506 and 0605 confusion

#

why are they trying to mess with us so badly ? a simple versioning would have done the trick

#

but NO they want to f***k with us it seems

#

Not even mentioning OAI who are the jerks king when it comes to bad versioning and confuse the hell out of the user

brittle tiger
small haven
#

donald trump is delayed

#

apparently

sacred quail
#

its so sad that LLMs are focusing codes more and focusing writing less

#

i can understand why but still

small haven
#

its so good that they removed it, holy moly

sacred quail
#

i'd say gemini 06/05 best(even better than opus 4) but im a gemini fan soo i could be biased

meager harbor
#

why aren't 01 pro and 03 pro on lm arena ?

sacred quail
#

expensive, too much time for answer

misty vault
small haven
#

the great flippening

meager harbor
# meager harbor why aren't 01 pro and 03 pro on lm arena ?

well but then they can use any marketing BS to say its the best model in the world ever. I don't understand why OAI don't give free acess for 03 pro to lm arena for Batlle (and not direct access), it's not like they don't have the computational power to treat 500 request per day for 03 pro. I call this BS. OAI could if they wanted to

small haven
#

huh i get temporarily limited on o3 pro occasionally, they are still compute constrained

sacred quail
#

I remember anthropic did diss against lmarena because they think human feedbacks turns models lame
I mean, peoples likes emojis and charts and theyre listen us about that but
still useful to see people's feedback

meager harbor
wintry tinsel
#

At writing

#

Gemini is more Descriptive Opus feels more natural

small haven
#

its not really "unlimited"

#

as they say

meager harbor
#

you pay for my subscription then ?

storm needle
small haven
storm needle
small haven
#

tbf i spam like a bot

meager harbor
#

😂 for real ?

small haven
#

better off holding it for deepthink

#

how do u know though

#

where is o3 pro usamo 2025

ornate agate
#

all I want to know is how this happened

small haven
small haven
#

bing chilling wants to see ur id

haughty tangle
#

there's going to be a decent amount of people literally worshipping ASI once it's created, there's already a religion for people who think current AI's are gods

elder rapids
#

I've tried them, none of them are nearly as good as veo 3

#

and it's not even close, most of them aren't as good as veo 2

#

it's really strange how high rated they are though

#

I do notice that they excel off of an image prompt or a flow-esque setup, and do pretty well for fantasy (because they're fine-tuned for that) but it shouldn't be this close

#

ye but I can't imagine that's anything but a flaw of the benchmark

#

rather than how good the model is in reality

#

I haven't used seedance

#

I'm just going off of all the other AI like Kling etc

wintry tinsel
jade egret
#

hola

#

guys

#

why do i think claude 4 opus is much more creative than 2.5 pro?

small haven
ornate agate
#

also I think claudes are dense models, whereas gemini is MoE, so it can use all those parameters for creativity

unborn ocean
#

no, but they have higher quality human feedback and also a higher quantity of it

#

that is the main part

#

it is also why the models are really good in these short human preference evaluators

that said, bytedance still build a very impressive model, mainly because they can achieve lower prices than competitors (the models we get is already greatly distilled and technically 480p with a 1080p upscale)

jade egret
#

yall

#

do you think kingsfall is gonna be better than claude 4 opus (at those specific coding)

jade egret
#

what about

#

blacktooth?

small haven
#

mhmmm, not rlly

jade egret
#

dang

unborn ocean
#

they are not as strong in the llm space

elder rapids
unborn ocean
#

but i was (and i assumed you aswell) refering to their video models

elder rapids
#

isn't a secret that 0325 was extremely creative

unborn ocean
#

deepseek is a very respected newcomer, but there are other companies that capture more of the consumer market there

#

it should also be noted that bytedance operates a ranking platform almost identical to the one used by aritificial analysis but for the chinese market (also including competitors, very similar to artificial analysis)

#

and that is probably also a reason why they are so good at the specific format

patent aspen
#

One thing I haven't seen many people point out: if the video leaderboards included native audio generation, there would only be one model

unborn ocean
#
  • the market is in shambles
#

reporting live

misty vault
#

yo my internet is bugged and user pfps not loading

#

oh wait

sour spindle
#

Anyone know what model this is: prowlridge

patent aspen
#

Just realized flash lite could be a pun on flashlight

small haven
#

allegedly

potent pilot
#

I'm just curious, has anyone opted out of the arbitration section of the ToS?

elder rapids
late path
cedar tide
#

deep think
2.5 flash lite - prowlridge
2.5 ultra - blacktooth

small haven
#

so dt is still coming early? did that get revised alrdy?

small haven
small haven
#

hmm damn

patent aspen
#

I know right

torn mantle
#

Du coup tu dors pas?

#

Idk if ultra thingy is a new model or not

cedar tide
torn mantle
#

Sigh

#

Blackbooth is thr next update?

#

Didn't like it

cedar tide
#

what the horse ?

torn mantle
#

Its like using 2.0

cedar tide
torn mantle
#

Flash lite

torn mantle
cedar tide
#

flash not today
but deepthink

#

prove that flash ga not today @patent aspen

patent aspen
#

I have no idea what you're talking about

sacred quail
cedar tide
#

Wasn't there a code name for the Mystery Gemini models on the arena related to a horse?

patent aspen
cedar tide
#

@patent aspenYou deleted your "prediction", did you think people were stupid?

patent aspen
#

Are you doing okay? I hope the rest of your day goes well!

small haven
small haven
torn mantle
small haven
torn mantle
#

Femelle pfffft

small haven
#

lol

elder rapids
elder rapids
#

so it could go like

flash lite - super speed, pro - powerful model, flash - fast model

#

deepthink seems to be on the way but I don't think it's guaranteed GA like with the other models, that was never set in stone

#

nor alluded to

patent aspen
#

Yeah now that you mentioned it, I think the horse emoji is regular Flash because it's the "work horse" model

elder rapids
#

just simply "comes later"

small haven
#

f's

elder rapids
#

ye but it could also be that it's previewed first tmr

#

and the info you're getting is different because while that's technically true

#

unless they don't want a preview

#

and it's just simply that dangerous

late path
#

I think they were waiting for o3pro, but o3pro turned out to be garbage and completely not worth competing with😂

elder rapids
#

o3 pro isn't garbage imo

#

do you think deepthink will be SOTA

late path
#

well at least it falls far short of the expected o1 to o1pro leap

whole wagon
#

Does an ultra exist

elder rapids
#

ye

whole wagon
#

Is that what this blacktooth is lmao

#

There's no way it can be deep think, it responds way too fast

elder rapids
#

if they never planned to release it, they would've likely never put it in the arena

whole wagon
#

I think it's coming ngl

#

Sundar said before they were considering releasing the ultra models

cedar tide
elder rapids
small haven
elder rapids
#

via "next generation performance would already close that gap"

whole wagon
#

Up until a point

#

Eventually the size will just win

elder rapids
#

true, pro could just be getting an upgrade and therefore that's blacktooth

#

but not quite ultra size

#

ye and also it's not far fetched to say that, they could be coming up with some really good efficiency innovation

#

hope they don't want profit over retention

#

benefits me more that they make things cheap

small haven
#

off topic, but will google ever create a claude code equivalent (gemini code assistant/jules don't compare), but with gemini models? that's where they would rlly win against anthropic imo

elder rapids
#

guess so but they could see AI as something that has nothing to do with profit or retention, and or just restrict chat AI access and go fully into subtle AI

#

where it's still beneficial for them, but removes the hope people have so when prior they could say "no profit over retention please"

#

ye but that just sidesteps the question I was asking lmao

#

"early"? if this is a distinction then it concedes user retention

small haven
#

especially when its cli/terminal bound

jade egret
#

what the best for coding rn

elder rapids
#

there is no "early" or "for a time"

small haven
elder rapids
#

if it's not perpetually free then it isn't user retention in AI, simple

jade egret
#

2nd best?

#

fr?

#

yooo w

#

i can use 4.5 than

elder rapids
elder rapids
jade egret
#

hm

elder rapids
#

Craig just says stuff

small haven
#

for coding? nope

elder rapids
#

don't look at what he says

jade egret
#

ima test both out?

#

ima try both out

elder rapids
jade egret
elder rapids
#

you'd be wasting your time trying to find a middle ground

#

he's just saying random shi

small haven
#

u could have at least said o3/o3 pro

elder rapids
#

deadass

jade egret
#

i haev a question

#

when every i use o3

#

it doesn't even put code in the canva

jade egret
#

: 0

elder rapids
#

it's simple, don't use o3, gpt 4.5, 4o

#

for coding

#

cool 90% of people would agree

#

there's a sentiment behind asking them not to monetize or concern themselves with major profit

#

that's why the initial question was asked at all

#

yo, id rather use 4o mini

#

😭

elder rapids
#

retention includes a lot of things implicitly, I don't really care about how they view it

#

or their philosophy, or their definition of retention

small haven
#

100m users organically with/without forced google integrations? 😂

elder rapids
#

with forced Google integration it would prob be close to a billion

#

lol

#

define natural acquisition

#

yo what does that even mean

#

yo

#

you know that's contradictive

#

😭

#

bro not even trying

jade egret
#

i dont think 4.5 worked

small haven
#

ok that was funny

jade egret
elder rapids
#

cool an outlier on the biggest stage where by your definition would be unnatural

small haven
#

ive been using less of oai models fwiw

elder rapids
#

do you not think this possibly alludes to other things

#

like maybe the ability to generate other ads

#

you can simply not count them yourself

#

the numbers are there brochacho

#

great time to use AI

#

ngl

small haven
#

whats ur risk appetite

elder rapids
#

no reason for that to be valid if we have no idea where openAI stands currently since it's visible growth was around a year or two ago, and Google might start going uphill given their innovations

#

like with that alphadolphin bs

jade egret
#

what that

small haven
#

i mean obv oai then

jade egret
#

use gpt 4.5 for prompt?

elder rapids
small haven
#

google as a company is alrdy mature

jade egret
small haven
#

so deepmind isolated?

#

deepmind > oai

elder rapids
elder rapids
#

then DeepMind

#

easy

#

lmao

#

dude just trapped himself

#

😭

small haven
#

lol

#

thought experiment -> deepmind

elder rapids
#

😭

small haven
#

in the ai scope, deepmind is considered infant

jade egret
small haven
#

oai is mature

#

relatively

elder rapids
#

a lot younger than openAI

jade egret
#

oh

#

it acctually follows that?

#

dang

#

alr brb

elder rapids
small haven
#

500b

whole wagon
#

It would be crashing rn

zinc ore
#

I'd guess 80% but slowly going down

whole wagon
small haven
whole wagon
#

These days the landscape is far different

small haven
#

so prolly even $1b at ipo, i wouldnt be surpirsed

elder rapids
small haven
whole wagon
#

Bro has to resort to mind share now 💀

small haven
#

its obviously >80%

elder rapids
#

all centered on the fact we were talking about implicit growth

#

it's getting old Craig

jade egret
#

guys

#

plz tell me

#

is there any model that is the same as claude 4 opus

small haven
elder rapids
jade egret
#

i ran out budget for that too

#

well

#

i can try

#

brb

patent aspen
#

What if OAI just loses?

#

Seriously

jade egret
#

why ; (

jade egret
#

:0

#

i want google win (:

patent aspen
#

Me too orange me too

small haven
elder rapids
small haven
jade egret
#

guys

#

how about deepseek r1 for coding

#

that seems like the only one works idk why

small haven
#

theres a lot of markets, craig's market is polymarket

#

yea or the latter

elder rapids
#

for asking for clarity

jade egret
#

deepseek server busy..

jade egret
#

wait i spelt it right

#

:0

small haven
patent aspen
#

@deep adder I think you tend to decide what you want to be true and then ignore everything that doesn't conform to that

elder rapids
#

it doesn't have to be a stop sign if he can't read

#

get with the program brochacho

jade egret
#

guys..

Gemini didn't work
chatGPT didn't work
I ran out of claude
DeepSeek server busy
I don't think grok is gonna work

what should i try next (for coding, pygame for exact)

small haven
#

craig is funny imma give him that

lapis light
#

I personally think, as soon as Google starts tackling personalization, OpenAI is cooked

#

At least, that's what I want to see happen.

patent aspen
#

I would add one caveat that it's hard for DeepMind to exist in a vacuum. The computing power, supporting teams, data from products, etc all play a role. If we assume they retain those advantages, I would definitely bet on DeepMind

elder rapids
#

demis and sundar is a good combo tbh

#

sundar seems bright

patent aspen
#

Granted OAI can't exist in a vacuum either. They need Microsoft to bankroll them to remain competitive

#

I doubt they want to bite the hand that feeds them

small haven
#

but does msft own 49 or 51% of oai?

patent aspen
#

Did OAI actually sue Microsoft?

small haven
patent aspen
#

Right I knew about that, but I think it's more like OAI is doing the bare minimum where Microsoft will still let them use their cloud at a discount

small haven
#

mainly started when oai asked to allocate some gcp servers

patent aspen
#

I guess OAI will start paying GCP to have some negotiation leverage

patent aspen
small haven
small haven
patent aspen
small haven
#

google is seeping into oai ecosystem, what u gonna do about it craig

patent aspen
#

Then their losses for 2025 would likely be closer to $18B

small haven
#

holy

patent aspen
#

I think OAI lost around $8B in 2024 and was on pace to lose $14B in 2025 before all of this Microsoft stuff

small haven
#

just another funding round, no biggie

hollow ocean
#

Will deep think dethrone o3 pro tmr?

#

Why not

small haven
#

@patent aspen have u tried o3 pro, how does it compare to dt

hollow ocean
#

What’s coming tmr

small haven
#

hmm damn

hollow ocean
#

Will pro GA be good

small haven
#

isnt it literally 0605

hollow ocean
#

I think so

#

So same model

small haven
#

preview to generally accessible 🤷

#

oh

#

craigbenched

patent aspen
#

What if @deep adder was actually Craig Federighi?

small haven
#

can i have an autograph

patent aspen
#

ngl I think the real Craig Federighi has the most punchable smirk

#

Like I'm not an angry person at all

small haven
#

do u have the same agility like him when he goes down the stairs at apple hq

#

wow

#

lemme know when, so i can short

patent aspen
#

Sometimes I think Apple is like a mustache twirling villain

#

Like when they designed their own protocol for air pods that didn't allow third party buds. Then they made the excuse that they couldn't allow 3rd party buds because it wouldn't be secure - with the protocol that they designed

#

And they will never have an opportunity to try until the EU gets tired of their anti-competitive practices

#

Like with lightning cables

small haven
#

oh naw @deep adder how do u respond to this

topaz edge
#

had to check why i blocked that dude

#

looks like it was justified

#

guess they should focus less on design and more on hardware

#

lol

#

back to the block list

small haven
#

wait what

#

craig is also getting hate from other servers 😭

#

that must be some sort of achievement

topaz edge
#

mods should ban him

small haven
#

tbh i enjoy his company

topaz edge
#

why

small haven
#

hes just funny, make it less stale

#

im not taking it serious

topaz edge
#

funny is an interesting way to describe it

prisma bison
#

Hey folks

echo aurora
#

howdy ablobwave

topaz edge
elder rapids
#

insane response

dusky aurora
#

LMArena is my only way of stress relief these days

#

after yet anothe rmssile in my city this morning, my mind wants to rest

civic flame
#

☹️

meager harbor
#

This is AGI

#

Seriously I laugh when i see people saying AI will replace 20% of jobs in the near future

fleet lintel
#

Will I be able to use deep think Gemini without the ultra subscription??

torn mantle
elder rapids
# meager harbor

too trained on the classical riddle or simply decided to forgive the possible mistake the riddle gave lol, models are like that and the thinking process would probably point that out, but since it likes brevity it'll just spit out the total corrected answer

#

which Includes the model forgiveness factor

whole wagon
# meager harbor

Go to people on the street and ask them this question. Report back what % gets it right

#

Human intelligence is so overestimated

#

I tried asking some basic times tables to some people and they messed them all up

#

Meanwhile LLMs can multiple two 8 digit numbers with no tools easily lol

#

Sometimes I get the feeling AGI is already here. And it's just not that world changing is all

#

Maybe when simple bench falls. That will be the marker

drifting thorn
#

Need a WeChat account

whole wagon
#

Google cooking like crazy ATM

#

Like bruh I thought every 3 months was fast now they releasing a new batch of models and it's not even been 2 weeks since the last

drifting thorn
#

Thanks to the integration of AI departments of Google

whole wagon
#

Meanwhile over 2 years between chatGPT 4 (march 2023) and upcoming chatGPT 5 in July

#

This cadence won't work now

#

With Google on the scene

drifting thorn
#

Deepmind is a very strong team

#

And I guess the next step for Google is combining V-JEPA 2, Veo, Imagen and LLM to create the base of AGI

whole wagon
#

Google says they don't think the LLM arch can reach AGI

#

Probably they have something else developing for that

#

Imagine Google drops Gemini 3 same time as gpt5

#

That would be so brutal kek

leaden sun
indigo hazel
#

and at the same time deepseek r2 lmao

whole wagon
#

Imagine if human reasoning really isn't special and a LLM is enough to reach it

#

Dunno whether that would be good or bad lol

#

It seems increasingly likely as the days pass

leaden sun
drifting thorn
#

V-JEPA 2 is a world model

#

And it can predicts the future event well, much better than current video-generating AIs

elder rapids
#

this is not by any means a simple bench esque problem either

#

so many problems that aren't "solved" by these AI is because there's a level of forgiveness because it has the necessary "truth" in their training data

#

whether this is a flaw is random

leaden sun
# whole wagon I tried asking some basic times tables to some people and they messed them all u...

was it done statistically using huge random sample groups (>=1000 people, this is the sample size that is common in clinical trials for testing new medication for example, the smallest being 100, you need at least 10 to test new cosmetics products sigh)

or you just went on street somewhere randomly and asked random people passing by? here in my place, if i do that, everyone will solve correctly, but i need to admit my town is famous for academics 😆

ocean vortex
#

yeah Opus does get it right. Seems that I overestimated 4.5 though since that doesn't...

alpine coral
# elder rapids whether this is a flaw is random

i dont think totally random - the more well known the 'truth' (or in this case, the riddle) is, the more likely it is to result in an overfitted, wrong answer to a slight variation of the question asked related to it (and designed to get an alternative response as the solution)

#

with yeah assumed typos etc on the part of the user as the rationale

#

the son mother doctor setup is as 'classic' as they come imo

#

o3 gets it wrong, unless instructed to interpret the question literally (funnily, it's reasoning summaries indicate that the CoT was entirely wrong.. but ig the model corrected for the actual answer)

#

o3-pro, when told to interpret it literally (i.e. not assume typos on the part of the user or that it is a failed attempt at asking the 'classic' version of the question) also gets it 'right' (but not really ofc, as Dom points out)

#

oh that's o1-pro..whoops

alpine coral
ocean vortex
#

I tried this on 2.5Pro, no go, it failed as well catgrin

ocean vortex
#

fundamentally it means the model is less flexible

alpine coral
ocean vortex
alpine coral
#

yeah tho look at o3's response reasoning summary above; it doesn't say typo, but it says "the user's phrasing is a bit off", which to my mind is basically the same thing no?

ocean vortex
#

could still be just the lack of capacity tbh. You can see reasoning helps it, but still not enough and it goes with the easy answer the path of least resistance eventually

#

the fact that o3-pro gets the original wording wrong, would suggest that virtually no attempts of o3 get it right. They probably only contemplate it in reasoning but I'm not sure reasoning traces are even considered with parallel compute for response ranking - would make it much more expensive

#

I think the bottom line is, if the model is to assume the user made a typo or intended to say something else, at the very least that should be in the final response when giving an answer for alternative interpretation

#

If it assumes things wrong (not how it was written) and then just provides you with a concise simple answer - that's automatically incorrect, in my book.

alpine coral
#

well i mean this is why i like riddles ahah

ocean vortex
#

regardless of the reasoning even

alpine coral
#

or twists on riddles.. they trip LLMs up

#

(as was mentioned before tho, in a way that might trip up many humans too, like pretty easy to gloss over the phrasing of the quwstion and assume it's the 'classic' and the answer is the mother.. but kinda beside the point to the discussion.. nvm / carry on aha)

alpine coral
ocean vortex
#

yeah.. the reason I even tested it is because from my experience bigger models are considerably better at things like this. They are somewhat less likely to assume things that sound similar and already exist in training data but lead to different outcome 👀

fleet lintel
#

I am excited about deep think models. Is there any insider or trusted tester info out on Gemini Deep think models? .. @patent aspen

hollow ocean
#

getting ultra as soon as its out

#

can't wait for sota reasoning

fleet lintel
#

I am not paying 3000$ per year yet but might be able to get share access from time to rime

fleet lintel
#

3000 $ is not a small amount. I can afford but I rather not unless it's necessary.

#

Honestly, my company should provide me an ultra subscription.

alpine coral
#

was reading through this from ARC the other day, thought this was on the money:

You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.

#

lol yeah prob tbf ha

#

but i do think for 'general' intelligence, some expectation that you can give it a basic af question or riddle, that the overwhleming majority of literate humans would get right, and it too gets them right, is reasonable

#

anyway.. not a can of worms i wanted to open.. let's move on ha

#

AGI is perhaps my most disliked term.. it means [so many different things to different people that it basically means] nothing

golden ocean
#

i would like to open that can of worms

alpine coral
#

go on 😉

willow grail
#

olonly real wo/men ur claude max/code

sacred plaza
#

The Claude code running for 7 hours was a bit odd. What the hell was it doing for 7 hours??? Was it just stuck accessing a database for that long haha

unborn ocean
leaden sun
unborn ocean
#

Or I hope, bc otherwise it might be a bit boring

alpine coral
leaden sun
#

the colon somehow has helped to make clear that sentence in quotation mark is what the surgeon has literally said, hence explicitly stating, he's the father. but:

alpine coral
#

ahh i see

#

thanks for claifying!

#

yeah that's kinda intersting ey

leaden sun
alpine coral
#

oh love that..

(assuming the twist and traditional riddle intent):

#

which model?

leaden sun
#

dear chat, whats going on with you 😔

#

4o the latest

alpine coral
leaden sun
alpine coral
#

ahh gotcha gotcha

leaden sun
#

the punctuation does matter, i knew it right away

alpine coral
# leaden sun

i dont get the way it starts with "Ah, I see now!" - it's like a follow-up or from reasoning?

leaden sun
#

i used the version from above first, most got it wrong, then I added the colon, suddenly, it's clear for most but not all

alpine coral
#

i cant reproduce with 4o after adding the colon

brittle tiger
keen beacon
#

fast, strong, workhorse

#

i guess

alpine coral
#

i like how we're basically readng tarot cards now

#

kidding aha

leaden sun
alpine coral
#

i agree.. his tweets are worth something.. and that interpretration of the emojis makes sense ig

leaden sun
#

i use the battle mode, so the previous rounds might have influenced it

ornate agate
#

lightning bolt = flash; arm = pro; horse = fast = flashlite

leaden sun
willow grail
#

while u cope and hope of free lmarena models...

#

olonly real wo/men us e claude max/code for swe.

#

🙂

willow grail
#

/s

indigo hazel
patent aspen
#

The horse is regular Flash. They've said dozens of times that it's the "workhorse model".

alpine coral
#

that does ring true

willow grail
#

go claude max/code

ocean vortex
#

you can do your own with parallel requests

#

but it's gonna cost you 😇

keen beacon
#

logan didn't say that?

patent aspen
#

Maybe interpreting the emojis or 3 x gemini

jade egret
#

hi

kind cloud
#

on vertex

patent aspen
#

I think it depends on what is being predicted tbh

brittle tiger
#

Can you give an example of one and what odds look like?

indigo hazel
placid charm
#

@echo aurora any news about lmarena test garden you allowed to disclose?

patent aspen
brittle tiger
#

There for me too

willow grail
willow grail
civic flame
#

oh nvm no we're not

willow grail
#

if u would need berberine u would know the answer

jade egret
#

woah

willow grail
#

no idea what that is

jade egret
#

so 2.5 pro is officially on vertex studio?

willow grail
#

ew no. i dont do drugs

#

i prefer safe streets, superb public transport system.
money if u get sick, so many holiday weeks as employee.
money if u loose job.
money if you cant work.
money if. if if if

#

SOURCE, old man?

#

oh ok

#

oh ok

civic flame
#

okay yeah this seems to be blacktooth

#

damn

willow grail
#

the user?

cedar tide
jade egret
#

yea

#

but

#

blacktooth better than 2.5 pro 0605

civic flame
#

lol no

#

this is better

cedar tide
#

@kind cloud

keen beacon
#

interested in that too, havent gotten the chance to play with it yet

brittle tiger
keen beacon
#

its the same as 0605 then

cedar tide
#

so no price reduction when you are without reasoning like on 2.5 flash 🥴

#

@brittle tigeryou have the price of 2.5 flash lite ?

hybrid locust
#

ai studio is changing

#

they're releasing them

cedar tide
hybrid locust
#

the other models are gone

keen beacon
#

same

cedar tide
#

06 05 and 05 20 gone

torn mantle
cedar tide
keen beacon
#

you removed those models 🥲

wintry tinsel
#

Yo wtf

#

lol

#

My daily driver is gone

civic flame
#

lol google are going minimalist

#

trust the process

hazy quest
#

"Gemma 3n E4B" was this model there already?

hazy quest
#

Whats the logic behind the name?

jade egret
whole wagon
#

Chill the new ones are getting added very soon lol

jade egret
#

today?

whole wagon
#

Yes

jade egret
#

W

#

blacktooth right

#

hopefully better than opus 4 at coding

#

because i ran out of credit soooo fast 😭

brittle tiger
whole wagon
#

They are in Google ai studio now

keen beacon
#

yup

leaden meteor
#

LMArena updated its leaderboard just now but 06-05 still top....where is stable 2.5pro?

whole wagon
#

Well. They just need to change the name lol

balmy mist
#

there is a new 2.5 pro?

keen beacon
#

ga version is supposed to be the same as 0605

#

afaik

balmy mist
#

ahhh

#

anyone test them yet?

whole wagon
#

Ppl not understanding blacktooth/kingfall is part of a different model line

#

Gemini 2.5 pro is done

keen beacon
#

no there might be more revisions of gemini 2.5 pro still

#

but blacktooth and kingfall are different

#

i agree

whole wagon
#

There won't be more revisions

keen beacon
#

there will be

whole wagon
#

That's the point of the ga. Logan already said it's final release

whole wagon
#

Is Gemini 3 pro not cooking kek

keen beacon
#

things are done in parallel

#

gemini 3 probably still pretraining

whole wagon
#

Hm

hazy quest
#

There it is

#

On AI Studio

nimble trail
#

So I think there will be revision soon ig

keen beacon
#

yeah tats a future revision i think

#

one of the side by side ab tests had it

#

not sure if it iwll be that though

whole wagon
#

IMO, a revision should go under a 2.6 name or smth

#

It's confusing to keep updating 2.5 pro

keen beacon
#

thats kinda surprising for 2.5 pro kinda i guess

elder rapids
#

yo

keen beacon
#

i would think they would keep incremental updates

elder rapids
#

it's using new formatting

balmy mist
#

whats the difference?

elder rapids
#

I thought it was just 0605

balmy mist
#

like which one is GA? which one is better

elder rapids
#

??

keen beacon
#

0506 is older. 0605 is gemini 2.5 pro

#

logan said it would be the same im kinda confused

#

ga and 0605

elder rapids
#

it's different

keen beacon
#

are people tripping balls?

balmy mist
#

which is deep think?

whole wagon
#

Iirc GA is about the user experience. The model intelligence is the same

balmy mist
patent aspen
#

fwiw it's possible there may be updates other than the model itself that result in differences in performance

elder rapids
#

it's not just 0605

#

and it's not 2m context

balmy mist
elder rapids
#

what's flash lites pricing

whole wagon
#

Updated pricing 👀

elder rapids
#

ye because efficiency is part of the preview

hazy quest
#

Lmao so 2.5 Pro is 06-05, no changes

elder rapids
#

btw it is different its using line breaks more often, like how 4o spams it

hazy quest
#

Updated system prompt. It has been doing that ealier today already

balmy mist
#

that new lite model is fast af

elder rapids
#

yeah I'm averaging like 400 t/s with it

patent aspen
civic flame
patent aspen
#

Oh latency? Yeah that could definitely change without a model update

jade egret
#

guys

#

is gemini 2.5 pro equal to gemini 2.5 pro 0605?

whole wagon
#

Yep

jade egret
#

bruh

#

than

#

nvm

#

so technically we didn't get new stuff

whole wagon
#

Flash had a pricing change

keen beacon
whole wagon
#

New stuff doesn't drop on Tuesdays :p

echo aurora
jade egret
patent aspen
#

Right there's a bunch of other teams that spend all day optimizing inference that don't work on the model itself

whole wagon
#

$0.30/$2.50 for 2.5 flash

brittle tiger
#

pricing

jade egret
#

plz link

brittle tiger
#

flash lite evals

patent aspen
#

lol I love that they used the flashlight emoji. I just realized that was a pun yesterday

whole wagon
brittle tiger
jade egret
#

guys

#

do we expect new gemini model to drop soon?

keen beacon
#

My guess no

jade egret
#

sadly

jade egret
#

nahh

#

bro you don't even know kingfall benchmakr yet

cedar tide
#

Flash lite

#

Prowlridge was not also in webdev ?

keen beacon
#

it was iirc

meager harbor
# whole wagon Go to people on the street and ask them this question. Report back what % gets i...

"Sure, if your bar for AGI is 'unqualified human,' then yeah, I guess it's 'here' – and it's mostly garbage.

But for me and as the real definition goes, real AGI means an AI that's as good as the median human professional in that field (like coding in a particular language) doing actual work, not just acing some random benchmark. And it’s AGI for that field only, not AGI as a whole. Language models are not AGI in any field yet. Translation is probably the closest they get, but even there, your average professional human translator still does a better job.

Then there's ASI: that’s when no human in that specific field can beat the AI. We've got ASI already in super narrow areas – games like Chess, Go, and StarCraft 2. But those are specialized game-playing AIs, not language models, and they don't really have a 'productive' purpose beyond those games."

cedar tide
keen beacon
#

if i remember correctly

keen beacon
north sandal
#

^

#

think it should reflect the update now that it's graduated

cedar tide
#

SimpleQA
2.0 flash lite 16.5
2.5 flash lite 10.7

keen beacon
#

wow

civic flame
#

to be fair flash lite is probably tiny

umbral crypt
#

How is it going?

whole wagon
#

2.0 flash got 30%

#

Same price as 2.5 Flash lite that gets 10% kek

keen beacon
#

all of the 2.5 models except pro get lower simpleqa scores

#

than their 2.0 equivalents

civic flame
#

oh yikes lol

cedar tide
civic flame
#

all my expectations have now transferred to grok 3.5 🥀

keen beacon
#

what about ultra?

civic flame
#

i mean for models releasing relatively soon'

#

i think ultra is still a while away

keen beacon
#

yea

#

probably

whole wagon
#

They were supposed to release like a month ago or more

civic flame
#

yeah elon and deadlines have almost never gone together

#

i don't even know why he bothers

whole wagon
#

Well they delayed it for a reason

#

Because the performance wasn't there

civic flame
#

i suppose im willing to wait if its better than the current model selection

#

which it probably will be, dk by how much though

whole wagon
#

No way it beats current 2.5 pro ngl

#

They still had a huge gap with grok 3

civic flame
#

im talking base model

#

grok thinking underperforms given the strong base

jade egret
#

why is gemini 2.5 pro first in webdev but 4 opus is better?

cedar tide
keen beacon
#

lol the tease it works two ways

whole wagon
#

Ultra + DeepThink = AGI Kappa

cedar tide
#

2.5 its officialy moe

meager harbor
# whole wagon Google says they don't think the LLM arch can reach AGI

AI's current prowess shines brightest in intensely text-based applications like translation, writing, and coding. In these areas, the path to AGI seems primarily constrained by the AI's capacity to handle and recall information across vast contexts or databases. Once this limitation is overcome – once they can process massive amounts of text without losing coherence or "forgetting" details – we will effectively have AGI for these domains, though perhaps not ASI.

IMO top-tier AIs today already operate at a near-AGI level within the very small contexts they can manage for translation, writing, and coding. However, for less text-centric or entirely non-textual fields, even at small context they can struggle very much. a different set of breakthroughs will be required.

Consider the potential: Imagine handing an AI your entire Git library and asking it to develop a new screen; it would immediately pinpoint what to modify and where. Or, provide it with all seven Harry Potter books and task it with crafting an eighth volume, and it would weave a new story while flawlessly remembering and integrating every key element. I think with LLM it's achievable with just tweaking on how it can handle huge informations

hollow ocean
#

O3 pro second place on simple bench before being deleted

#

58.9%

civic flame
#

o3 pro is mid

hollow ocean
civic flame
#

lol whenever something doesn't confirm your existing opinions you disregard it as some kind of cheating

hollow ocean
#

You think so

tall summit
#

nobody ever talks about it

#

i'm surprised how it's so good and yet nobody talks about it

#

insane

meager harbor
#

still not as good as average pro translator

#

idk how DeepL is still alive

torn mantle
#

time to block all xai staff

elder rapids
#

😐

tall summit
torn mantle
#

we get it

#

but its getting on my nerves

#

they hyped it up for like 3 months

#

if they have something ready and its good then they should at least tease it properly

zinc ore
#

Also, Logan only does that to tell us release is next day

#

So his tweets are actually meaningful

meager harbor
zinc ore
jade egret
#

@deep adder which one do u think

wintry tinsel
#

What is the new 2.5 pro up in studio?

#

It just says “new” with no date

keen beacon
#

just renamed

jade egret
ocean vortex
#

Google just f'ing renamed it and called it a day lmao

patent aspen
ocean vortex
fleet lintel
ocean vortex
#

we are testing models here, not tools 😇

fleet lintel
#

flash lite seems great though

zinc ore
#

They said goldmane would be GA

#

For weeks now

#

Yeah

#

I'm js, they clearly communicated this as their intent weeks back

ocean vortex
#

all small models score low on SimpleQA, even o4-mini-high

keen beacon
ocean vortex
#

bigger ones score higher, do not need tools

keen beacon
#

aren't they the same size 2.0 flash/flash lite with 2.5 flash/flash lite too btw?

ocean vortex
#

People pretend that small models like o4-mini-high are just as good as the bigger ones. SimpleQA is a good benchmark to prove them wrong tbh

keen beacon
#

others and i find simpleqa to be representative of a model's world knowledge, so if the scores drops, means the model tends to have less. it affects other tasks as well

#

it wasn't exactly scoring high before, the drop is significant

ocean vortex
#

MMLU Pro is decent as well, but at this point it's too saturated to say much...

keen beacon
#

its not just google anyway, apparently claude 4 dropped a lot too

ocean vortex
#

it's not about that. It's about giving measurable reasons for bigger models to exist. Otherwise everyone would dumb them down because many other metrics are good even with small models... Which is just stupid. SimpleQA is great and we need more benchmarks like that

keen beacon
#

some of that for sure. but i think its important for these models to have a decent amount of world knowledge built in too

#

it seemed google made huge progress with 2.5 pro (nearly 10% boost from 2.0 pro) in this front, but the other 2.5 models don't reflect that

#

but 2.5 pro probably has other stuff that they didn't apply to the others

zinc ore
#

What about FACTs grounding

#

Isn't that bench about world knowledge

#

Which Gemini models score highest among all models

#

Like +20% over openAI models as an example

keen beacon
#

its different

#

i believe that one is just for the response to be grounded in the context provided

keen beacon
#

this is a little bit of a incoherent tangent but some, i believe, argue that for these models to "reason" at all or at least partly, it requires (at least partly) unrelated information that you wouldn't expect to be important. it might perform worse on scenarios you wouldn't expect or generalize worse if it has less 'world knowledge' which i feel correlates well with the simpleqa score and my experiences with it. grounded small models with large context windows are important, but i think the holistic performance of the model is affected in non obvious ways by the lack of world knowledge.

whole wagon
#

Feels like Google are the only ones that can do this ngl lol

keen beacon
#

yeah that is possible, but that would require a lot of changes to the architecture & pre-training probably. there are so many ways to do that, i don't really know what that'll look like. it really depends though. depending on the implementation, i still think it could miss out of some of the implicit connections if it had naturally more world knowledge. it's not really about the world knowledge exactly, but the world knowledge we see measures the breadth of data and unrelated data it understands and that correlates with "reasoning" and the connections its formed i think, if that makes any sense lol

patent aspen
#

I think relatively big models are here to stay

leaden sun
#

graph ai....i think the first time I've heard of such graph was an announcement by neo4j...a long time ago? such graph ai has evolved to design agentic workflow now too

keen beacon
# patent aspen I think relatively big models are here to stay

im not even arguing for bigger models, just saying that world knowledge is important. i think models will trend smaller and smaller, and that there'll be much better ways of unlocking/training in a lot more world knowledge. probably in the near future

#

i hope i made sense lol

small haven
#

oh crows

jade egret
#

🐦‍⬛

willow grail
#

@small haven

small haven
willow grail
small haven
#

oh i thought it was ur yard

willow grail
#

its crops

willow grail
#

the first video link is graveyard

#

the last one is public land

small haven
#

mhmm interesting

willow grail
small haven
#

just a filler, sorry idk what to say

willow grail
#

yes.... the very close graveyard has crow nestings

keen beacon
#

i agree, my point was about how world knowledge is important. you can achieve that naively via a larger model but i think future smaller models will be much improved in that regard with better methodologies etc. it's irrelevant to the point i was primarily making anyway

#

yea i expect a lot of improvements in the near future

willow grail
#

@small haven