#general

1 messages · Page 4 of 1

torn mantle
#

Actually gemini has a different reasoning approach

#

So it will be interesting

keen beacon
#

this seems different compared to flash thinking

gentle plinth
#

yeah openai hiding the reasoning chain for which you are paying is just ridiculous

torn mantle
keen beacon
#

in my personal experience flash thinking had several limitations on ood

brittle tiger
sick mountain
#

flash thinking:

torn mantle
#

Everytime you find interesting stuff

#

I think it may be time to sub to gemini advanced

#

Its not a bad plan

sick mountain
#

if 2.5 is SOTA then it is for sure worth it

torn mantle
#

And also they improved the deep research

keen beacon
#

the gemini models on gemini the product kinda suck tho

#

aistudio is better

sick mountain
#

i feel like they've likely toned down the censorship recently no testing tho just vibes

torn mantle
keen beacon
#

holy hell google moves fast now lol

rigid widget
#

I'm genuinely surprised that even people within this community are believing the "safety" garbage coming from these artificial intelligence corporations.

#

They can mark any requests they want as dangerous, using the excuse of security. Weapons, porn, drugs, workarounds, modifications... you can mark all of these as potentially dangerous. You can even consider giving information about yt-dlp and ffmpeg as "potentially dangerous," and even "content blockers, VPNs, userscripts" as "potentially dangerous."

calm sequoia
#

It was anomaly from the start that Google is not the No. 1 at everything LLM and AI

alpine coral
#

just that safety is not the same as propaganda

north vale
#

Fwiw i think the 2.5 thing is prolly real now

alpine coral
#

yeah me too

keen beacon
#

ya it is

#

its just an unbelievable pace

#

they did gemini 2 sooo fast

#

then pivot to this that quickoly

#

they didnt even release 2.0 pro stable

#

😉

rigid widget
silk haven
#

I thought that the 2.5 Flash and 2.5 Pro would be released near Google I/O in May, like last year

brittle tiger
alpine coral
keen beacon
silk haven
#

Google's Sergey Brin: Google’s AI products “are overrun with filters and punts of various kinds.” -> Google’s co-founder tells AI staff to stop ‘building nanny products’

Google’s AI products “are overrun with filters and punts of various kinds.” According to Brin, Google needs to “trust our users” and “can’t keep building nanny products.”

keen beacon
leaden palm
#

top 5 google naming fails of all time

north vale
#

Yeah this doesnt make any sense

#

Surely they add thinking to the name

keen beacon
#

cot models will probably be the default going forward

north vale
#

Nah

keen beacon
#

they wont release a non cot model anymore

leaden palm
#

well gpt-5 won't use chain of thought if unneeded as i understand

keen beacon
north vale
#

Idk if it’s hybrid and mostly doesnt use reasoning for most completions it doesnt count as cot model to me

#

So ig depends how u look at it

#

Maybe models with cot capability will be defaukt

#

But asking it how are u and it reasoning about it before answering will not be default

#

Bc that’s useless

keen beacon
silk haven
# keen beacon that certainly didnt stop them from developing stuff (other than safety filters)...

Sergey Brin full note:

“It has been 2 years of the Gemini program and GDM. We have come a long way in that time with many efforts we should feel very proud of. At the same time competition has accelerated immensely and the final race to AGI is afoot. I think we have all the ingredients to win this race but we are going to have to turbocharge our efforts.

Code matters most — AGI will happen with takeoff, when the Al improves itself. Probably initially it will be with a lot of human help so the most important is our code performance. Furthermore this needs to work on our own 1p code. We have to be the most efficient coder and Al scientists in the world by using our own Al.

Productivity — In my experience about 60 hours a week is the sweet spot of productivity. Some folks put in a lot more but can burn out or lose creativity. A number of folks work less than 60 hours and a small number put in the bare minimum to get by. This last group is not only unproductive but also can be highly demoralizing to everyone else.

Location — It is important to work in the office because physically being together is far more effective for communication than gve etc. And, therefore you need to be physically colocated with others working on the same thing. We need to minimize reporting lines across countries, cities, and buildings. I recommend being in the office at least every week day.

Organization — We need to have clear responsibility and organization with high functioning groups with shared management and technology leadership.

Simplicity — Lets use simple solutions where we can. Eg if prompting works, just do that, don’t posttrain a separate model. No unnecessary technical complexities (such as lora). Ideally we will truly have one recipe and one model which can simply be prompted for different uses.

Excellence — whether it’s an eval or a data source or a dashboard or a message in an internal Ul, please make sure they all work and all are good.

rigid widget
# keen beacon they didnt even release 2.0 pro stable

because thanks to AI Studio, they want to create models that are constantly being tested with new data and are always getting better. They don't want to offer something as "stable" without doing something really big.

silk haven
# silk haven Sergey Brin full note: “It has been 2 years of the Gemini program and GDM. We h...

Speed — we need our products, models, internal tools to be fast. Can’t wait 20 minutes to run a bit of python on borg.

Iterate at small scale — we need lots of ideas that we can test quickly. The best way to do this is small scale experiments until you can ramp up and hopefully see increasing advantage at scale. This is an excellent validation. Working too much at just large scale has a habit of minor tweaking and overfitting to evals, checkpoint sniping, etc. We need real wins that scale.

No punting — we can’t keep building nanny products. Our products are overrun with filters and punts of various kinds. We need capable products and [to] trust our users.“

north vale
silk haven
rigid widget
#

Folks, please resist the hype and be patient. Real-world tests are consistently the most crucial.

keen beacon
#

well people have been testing nebula here for a while and its been good

#

^

#

although it could be possible phantom is 2.5 pro exp and nebula is something else, or vice versa

pure nova
#

I cant believe they release nebula already

sick mountain
#

how long has it been in lmarena?

keen beacon
#

GDM employees have been hinting it nonstop for the last day or two

pure nova
#

Its in aistudio

keen beacon
keen beacon
#

maybe different temperatures?

sick mountain
#

it is fake

keen beacon
#

just different revisions

brittle tiger
keen beacon
pure nova
#

Im in the uk and i have it

keen beacon
#

send a screenshot

pure nova
keen beacon
#

lol no

north vale
sick mountain
#

lmao

keen beacon
#

it's not actually called nevila

#

nebula

pure nova
#

thats what it says on my studio

keen beacon
#

right

#

🙄

#

u changed pro to nebula lol

#

with inspect element

#

its supposed to be 2.5 anyway

pure nova
#

03-25

#

yeah thats what it says

#

hmm thats odd 🤔

brittle tiger
#

I'll feel bad for my laugh emoji if you aren't BSing but that would be pretty strange

rigid widget
pure nova
#

im not sure

#

thats what its saying on my studio

#

oh its out on gemini now

brittle tiger
pure nova
#

so any announcement or any news / changes for this ?

#

when can we expect official benchmarks

sick mountain
#

probably at 11 am or 12 pm est

north vale
#

ok but fr why is polymarket not summing up to 100%

#

ik they all resolve no from a tie but a tie seems very unlikely?

#

it seems priced at 10%+ rn

pure nova
#

holy f

#

2.5 is so good wtf

#

every other ai model i asked, it just made complete ass

#

but 2.5 nailed it

sick mountain
#

prompt?

pure nova
#

it was legit the simplest html website

#

and every other model couldnt do it for its life

barren prairie
scarlet flint
#

have you tried deepseek?

#

i saw on internet today

#

that they released new model and its very good

scarlet flint
rigid widget
pure nova
#

its my own website?

scarlet flint
pure nova
scarlet flint
#

thanks

#

yeah newest deepseek model can't replicate that

#

not even close

elder rapids
#

they must be super confident

#

2.5 is crazy

scarlet flint
#

yeah

pure nova
#

try to plug it in gemini

#

compare it with me

scarlet flint
elder rapids
#

saying it because I read the previous messages

#

so

pure nova
#

this is what i got fm gemini

north vale
#

if anyone wants to run a prompt and doesn't have access, i got access to 2.5 pro

elder rapids
#

I have access too

#

it's pretty good, but it's nerfed

pure nova
elder rapids
#

they're not allowing it to think long enough

#

lmao

#

it's cutting short its own CoT

scarlet flint
pure nova
#

yeah wtf

#

compared to gemini

#

its dogshit

elder rapids
#

alright what's going on tho

#

2.5 is insane

pure nova
#

wow how the hell did deepseek mess it up THAT bad

elder rapids
#

for a lot of tasks

#

but they're cutting off the cot

loud leaf
#

nebula = 2.5 pro?

elder rapids
#

I think so yeah

north vale
#

ye

#

prolly

pure nova
#

2.5 can also do geogussr

#

its cool

#

i asked it and it got it right first attempt

scarlet flint
#

i've tested it on my JFrame design in java

#

and it upgraded it heavly

elder rapids
#

2.5 has to be SOTA

#

I'm asking it to think longer

#

and it's actually getting these puzzles right lmfao

rigid widget
elder rapids
#

flash doesn't think longer when I ask it to either

pure nova
brittle tiger
#

I want it out in AI studio. have access in gemini but ai studio better for testing when not hooked up to memories and apps like in main app

elder rapids
#

are they gonna add 2.5 to AI studio?

north vale
#

likely

rigid widget
#

left to right: deepseekv3, gemini2.0pro, claude3.7sonnet, deepseekv3-0324

keen beacon
#

wait for it to launch on ai studio

#

Gemini version is weaker

elder rapids
#

ye

#

has to be affecting the thinking length

scarlet flint
rigid widget
pure nova
keen beacon
#

4o image gen coming

pure nova
#

gemini could dothat too

rigid widget
#

My English is not very good. Which poem is better?

rigid widget
scarlet flint
#

?

#

i don't have access to it since i don't have the subscription i guess

keen beacon
#

otherwise it just looks embarrassing

elder rapids
#

what if 2.5 pro is an entirely different and new model

#

it doesn't say it's thinking

gentle plinth
keen beacon
#

yeah we know

elder rapids
#

nah what I mean is

keen beacon
#

qwen 3 on thursday 👀 apparently

scarlet flint
#

more efficient

elder rapids
#

what if it's better for context

#

that all the other models

scarlet flint
scarlet flint
#

and its like internal thinking

elder rapids
#

ye

pure nova
#

its not as good as deepseek but all i said was upgrade the page

scarlet flint
#

it kept the style of original page

pure nova
#

yeah

elder rapids
#

I'm predicting it becomes the best long context reasoner

#

by a large margin

scarlet flint
keen beacon
#

hopefully we get o3 soon given it's been threatened

pure nova
#

i wonder if 2.5 pro is better than 3.7 thinking at its max both

rigid widget
rigid widget
keen beacon
#

they potentially have a year lead on native image gen though

pure nova
#

oh wait that is 2.5 right

elder rapids
#

i think it's better

#

but we'll have to wait and see for the un nerfed version

pure nova
#

how are u so sure that this is nerfed at all

rigid widget
#

Can someone who speaks English help me?

pure nova
#

lol no way
i just asked gemini 2.5
Simulate a gravity-affected ball bouncing inside a rotating square using Python, with realistic velocity, collision, and rotation-aware physics.
and it gave me a syntax error

#

insane

scarlet flint
#

i see that google has this tendency to release very good models from time to time like that experimental model that got almost to the top of leaderboard, i think it was sitting on second place, from my tests it was better than 2.0 flash they released into gemini website

elder rapids
#

either it's not nebula at all

#

or it's nerfed

#

only possible cases

keen beacon
#

just wait for the release on aistudio 🙈

elder rapids
pure nova
#

braindead?

elder rapids
rigid widget
elder rapids
#

dawg

#

if it's saying things differently, not reasoning as long as nebula

#

it's nerfed

#

this isn't skepticism or anything

#

😭

pure nova
#

holy sht every question i ask it im getting syntax error

rigid widget
pure nova
#

write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically
and
Simulate a gravity-affected ball bouncing inside a rotating square using Python, with realistic velocity, collision, and rotation-aware physics.

keen beacon
#

the gemini product models suck

scarlet flint
elder rapids
keen beacon
#

the aistudio release will be good

scarlet flint
scarlet flint
#

it loooks like

#

animation

#

or something XD

#

always the same

sick mountain
pure nova
#

ok yeah wtf

#

deepseek one works instanly

scarlet flint
pure nova
#

ok nvm

scarlet flint
#

keep watching

pure nova
#

the ball fell through the square

scarlet flint
#

its glitching and also play it again

#

its 1:1

#

the same simulation

#

the same path

pure nova
#

but at least it didnt get syntax errors like gemini

scarlet flint
#

the same colission hit

loud leaf
keen beacon
#

in a few hours

scarlet flint
#

like the animation was hard coded but it wasn't

scarlet flint
#

i wonder if deepseek r2 will be one of the best if not the best thinking model or if it will be total trash and will fail every expectation

pure nova
#

it was literally vertices =#comment

#

which doesnt make sense

sick mountain
#

ive gotten that same thing with 2.0 ft before

rigid widget
elder rapids
#

seems to have an aptitude for coding

#

2.0 pro does seem better still tbh

#

3.7 sonnet and 2.0 pro are visibly better than the other non thinking models

#

no one talks about it for some reason

keen beacon
#

it's literally an entire book

#

when will they understand how badly this impacts performance

calm sequoia
#
poll_question_text

Nebula in general leaderboard. Which place after update?

victor_answer_votes

13

total_votes

17

victor_answer_id

1

victor_answer_text

1

victor_answer_emoji_name

🤩

keen beacon
#

it seems theres a lot more to that 🤣

lime coral
#

Legit or not we will see

keen beacon
#

the sentence doesn't even really make sense

#

that guy is a weirdo

wintry tinsel
#

I have a question about the new deep seek V3 is the api for V3 updated or do I need a new checkpoint

#

As in a new API

scarlet flint
#

i mean can't they like train model to behave in some way

#

instead of feeding it with system prompt that breaks performance by like 60%

scarlet flint
torn mantle
#

@keen beacon what am i reading here? Did they mess with the model again? Does it feel different than what we got on lmarena?

keen beacon
#

it's out!!

keen beacon
#

omg

silk haven
#

🚀🚀🚀

torn mantle
keen beacon
#

this version is NOT a thinking model, that appears to be still on its way on studio

brittle tiger
#

LIve in AI STUDIO

pure nova
#

why is ai studio somehow better than normal site

pure nova
#

i dont understand what they do to it

keen beacon
#

its thinking for me

#

oh it's a hybrid

#

interesting

keen beacon
pure nova
#

wtf the google ai studio got the hexagon wrong

#

insane

#

it does it fine for like 10 seconds then it falls through

thorny drum
#

wow sundar tweeting nebula

#

discord hype goes far

brittle tiger
#

glad it launched with 1m context

#

def better than gemini app version for some reason. arc-agi test i used that failed gemini app over and over one shotted it in ai studio like it did in arena as nebula

keen beacon
north vale
#

wild

barren prairie
#

Is pro2.5 nebula confirmed ? Or maybe specter ? And nebula is still didn t come

north vale
sick mountain
#

wow

north vale
#

that's nuts

keen beacon
keen beacon
#

theres never gonna be official confirmation

lime coral
#

Can as well be flash 2.5 or flash in the app or whatever

sick mountain
#

sometimes there is

keen beacon
#

on unreleased variants that never get released?

sick mountain
#

oh i mean if it is released

keen beacon
#

while there is some for them being the same

silk haven
lime coral
#

None in both case. All I see is « their way to answer looks the same ». Not enough

keen beacon
silk haven
brittle tiger
#

1443 dam

pure nova
#

can we see the questions that it gets asked though

#

it would be nice to see what it got wrong/right etc

keen beacon
silk haven
pure nova
#

why cant we see the claude 3.7 one for code stuff lol

#

he hid it?

keen beacon
#

GOOGLE IS SO BACK

pure nova
#

so back

keen beacon
#

huh

pure nova
#

those benchmarks are insane

#

gemini is like 5x better

#

apart frm coding somehow

silk haven
#

2.5 free, OpenAI is cooked

pure nova
#

idk if its forever tho

#

how is o3 mini high still better at code

#

insane

keen beacon
#

o3 mini is based on 4o mini too lol

lime coral
red sluice
#

Nebula (Gemini 2.0 Pro Thinking) is really inconsistent. It can produce never seen before results for an LLM, but sometimes it totally cracks up & produces language that is too familiar for a professional result... I'm not so sure, it's certainly a very good model, and no wonder it destroys chatgpt already. It also struggles a lot with formatting on elaborated prompts, o3-mini is (unfortunately) better for my usage.

lime coral
pure nova
gentle plinth
red sluice
lime coral
red sluice
gentle plinth
#

thats why its called experimental

keen beacon
#

google's experimental models are free its not a new thing

sick mountain
pure nova
#

lol , this week is ai week, 3 new models will come out and beat gemini 2.5 mark my words

keen beacon
#

openai should drop o3

#

they just got smoked

cloud meadow
lime coral
willow grail
#

gemini 2.5 pro destroy chatgpt o3-mini-high

hmmmmmmm i hope its just a benchmark..... thingy....
if its gooder than o3 high then... omg
dont give me hopes

pure nova
#

c 3.7 still remains undefeated for web

keen beacon
lime coral
keen beacon
#

^

cloud meadow
pure nova
lime coral
#

Would not validate anything on lmsys until way more people try the model

#

Confidence interval +-15 pts for gemini and +-10 pts for Claude.

cloud meadow
#

Claude didn't do anything too crazy with 3.7. Soon enough R2 will drop. I can't wait.

keen beacon
#

yall are sleeping on qwen

gentle plinth
lime coral
#

Anyway 2k vote is not enough. Same think for the global ranking. Every new model is #1 because of this

north vale
#

2k is enough

#

every new model is not #1

lime coral
#

Most of the big labs model are #1

#

Where is 4.5 now?

north vale
silk haven
north vale
#

would you predict 2.5 pro gets off of #1?

red sluice
#

Weird that on multi turn, it's not being that dominant so far.

keen beacon
#

rl is usually done only in single turn

cloud meadow
#

Where is Samuel Altman's moat?

#

Evaporated

#

Meta needs to release soon

keen beacon
lime coral
#

No more hype for hypeman

red sluice
#

OpenAI is cooked if they don't have an unreleased model to quickly drop asap lol

keen beacon
#

they have o3 they're just stalling on it

red sluice
#

Doesn't need to be o3 or 5, just needs to be something that can keep up

hardy pecan
lime coral
#

Full stats

hardy pecan
#

oh dear...

keen beacon
#

oh wow

#

that simpleqa score

#

is bonkers

#

gemini has always been great at simpleqa

#

but it appears with 2.5

#

they literally

#

leapt for almost every benchmark

#

they have been cooking

north vale
#

wish they'd shown their prev top model in the benchmark table

red sluice
#

Is o3 that good? Is there any decent benchmark available somewhere? Couldn't test it, too expensive 💀

north vale
#

to compare the improvement

cloud meadow
#

Imagine Llama 4 drops and it overtakes 2.5 lmfao

#

Unlikely to happen but it would be crazy

north vale
keen beacon
#

have u gone on the arena?

north vale
#

so maybe like not that good in single completions

keen beacon
barren prairie
keen beacon
#

based on the meta model spam

cloud meadow
#

Still holding out hope

#

Trust the Zucc

keen beacon
#

i doubt they will be able to beat qwen 3 even when releasing one month later

cloud meadow
#

What sizes do you think they will release?

keen beacon
#

confirmed sizes are 8b and 15b moe for now. i would expect a successor to the 32b model, maybe moe

silk haven
#

I’m testing 2.5 Pro on the Gemini app and the experience is better than in AI Studio, the integration with Google Search and YouTube is insane

lime coral
#

With the new DeepSeek v3 llama got postponed for at least six more months

keen beacon
red sluice
lime coral
keen beacon
keen beacon
lime coral
#

Depends on when Google drop the experimental

#

Or will they ship gemini ultra 3 instead

keen beacon
#

i think the qwen team are the dark horse in this, but i dont think they will outright be sota

red sluice
#

Is there a simple way to see against which models an LLM loses the most on average?

silk haven
elder rapids
elder rapids
#

lmaooo

keen beacon
#

i said that here first

elder rapids
#

quiet buddy

#

I'm taking it

keen beacon
#

shhhh

rigid widget
elder rapids
#

I was gonna analogize 1.0 → 1.5 distinction tho

#

and then the evolution of 1.5 → 2.0

#

since they probably started completely from scratch on each one

#

so if it doesn't signify thinking, it's probably inherently a pure thinking model

#

that was my thought process

red sluice
#

🤔 What's that

silk haven
north vale
#

do we know api cost for 2.5 pro

keen beacon
lime coral
#

Asking about native image. Native audio is a myth now

lime coral
keen beacon
#

4o avm is native audio tho if ur talking bout that

lime coral
#

Speaking about gemini. They teased it for gemini 2.0 and then ~

keen beacon
#

oh

lime coral
#

Crazy demo

elder rapids
#

also just tried 2.5 pro on AI studio

#

and it's clearly different from the product

#

ngl I don't even know where to start, I was like an hour late on discovering 2.5 pro

torn mantle
#

still havent tried it

#

but i said they tend to nerf their model

keen beacon
#

That simpleqa score is crazy

torn mantle
#

maybe its not nebula

#

specter was google model right?

keen beacon
#

Gpt 4.5 is much much larger and 2.5 pro is somewhat competitive

keen beacon
torn mantle
#

did you notice any difference?

lime coral
olive mesa
#

just saw this wow

#

i dont know its context length

#

im guessing 2 million or 3-4 million since its 2.5

north vale
#

not 3-4 million

olive mesa
#

oh only 1m

#

well i mean thats still huge but i guess since it's experimental rn

lime coral
#

It will increased with time

olive mesa
#

yeah

lime coral
#

OG 1.5 was first released with 128k even though they teased the 1-2M

elder rapids
#

it seems to brute force the 30 hares 20 wolfs thing

#

and it gets it correctly

keen beacon
#

um tf it has a cut off of january 2025?? wtf the turn around is insane

elder rapids
#

did Google find the secret sauce 😭

keen beacon
#

the gemini 2-2.5 timeline is absolutely insane

keen ferry
#

opinion's on gemini 2.5 pro?

keen beacon
keen ferry
#

was it worth the wait

keen beacon
#

what a joke

keen beacon
keen beacon
elder rapids
keen beacon
#

i thought the gemini 2 timelines were short but this is CRAZY

elder rapids
#

this really is crazy

#

vibes + insane reasoning

torn mantle
#

are we sure this is the same model?

brittle tiger
elder rapids
#

told you guys you werent glazing it enough lmao

keen beacon
lime coral
#

Gpt4o also failed the hands

keen beacon
#

embarassing

olive mesa
rigid widget
#

wow 😲

olive mesa
#

it gives me a different vibe

elder rapids
#

ong

torn mantle
olive mesa
#

yeah

#

i like how it writes code now

#

it just overall looks better

elder rapids
#

Claude and Google's vibe switched

#

talking about the models

silk haven
#

2.5 is a Breakthrough

elder rapids
#

3.7 became more robotic, 2.5 pro is so creative

elder rapids
oblique flint
#

Damn I was wrong about 2.5 pro lol, it's actually better at coding than I initially anticipated. Would be great if it's cheaper than claude too

elder rapids
#

look at that long context lmao

torn mantle
#

the model is so good

olive mesa
#

fr

pure nova
#

coding though?

torn mantle
pure nova
#

anyone tried c/c++ ?

torn mantle
pure nova
#

i dont think any ai is fit for c/c++ right now tbh to build an actual decent project

#

its getting somewhat closer but its still lacking a lot

elder rapids
#

2.5 pro brute forcing webdev is crazy

silk haven
#

2.5 series… not only 2.5 pro
When 2.5 flash? Maybe phantom?

olive mesa
#

#1 on lmarena jeez

#

anybody know what it gets on arc-agi-2?

elder rapids
#

probably similar to sonnet 3.7 thinking

olive mesa
#

yeah its close to 64k

torn mantle
olive mesa
#

so far ive only seen it compared to 16k and 32k and it's a lot better

lime coral
#

Arc agi is useless

#

I prefer humanity last exam

#

At least you solve practical things

oblique flint
#

Crazy that 3.7 sonnet is still 90 points ahead of 2.5 pro in webdev arena

elder rapids
#

it's worse in other things compared to 2.5 pro

lime coral
elder rapids
#

ye

lime coral
#

Claude was like the ultimate King there

#

3.5 only dethroned by 3.7

elder rapids
#

ong

cloud meadow
#

No other model has continously followed my instructions this well

#

It's also picked up things better than any other model

#

Might be my new favorite

silk haven
#

2.0 pro was removed from Gemini app

silk haven
cloud meadow
#

The reasoning is interesting too

eager crater
#

today is crazy

#

we got dalle 4 and the best overall llm

cloud meadow
#

I remember just a few months ago (I think before deepseek r1) some dude talking about how everything has been boring since the finetune days and then r1 and distills dropped

#

It was pretty funny lol

rigid widget
#

They've taken user data in aistudio seriously. I tried to make them do this dozens of times, but they couldn't.

rigid widget
cloud meadow
#

Idk if I am the sole reason but I think I made it into the dataset

keen beacon
#

maybe not

cloud meadow
#

I should hope so considering the fact they own a search engine

torn mantle
#

i was trying like some niche prompts on aistudio

#

and seems like they improved on them a lot

#

way way better than grok 3 + reasoning

#

blows it out of the water

oblique flint
#

From reddit:

Just a couple of days ago I wrote this:

This is my exact experience. Long context windows are barely any use. They are vaguely helpful for "needle in a haystack" problems, not much more.

I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). More often than not, it doesn't even "see" the final poems, and it has this exact feeling of "seeing the first ones", then "skipping the middle ones", "seeing some a bit ahead" and "completely ignoring everything else".

I see very few companies tackling the issue of large context windows, and I fully believe that they are key for some significant breakthroughs with LLMs. RAG is not a good solution for many problems. Alas, we will have to keep waiting...

Having just tried this model, I can say that this is a breakthrough moment. A leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.

Finally they're starting to utilize that context window

torn mantle
#

so consistent too

thorny drum
#

that's crazy

#

imagine being in middle school rn

#

you can literally copy paste your whole book in for your book report

rigid widget
#

v3 0325 You can really move and place blocks but it laaaags a lot.

keen beacon
red sluice
#

Totally a bug, not openai's core prompt 🤔

rigid widget
#

meanwhile 2.5 pro

#

left deepseek v3 0324 (non-reasoning model) right gemini 2.5 pro (reasoning model)

cloud meadow
#

oneshot minecraft?

rigid widget
#

it create a good Hacker News clone, but does Hacker News have anything to do with hackers at all?

keen beacon
#

it has hackers in the name

#

this thing is great at web design

#

(w/ custom cursor)

cloud meadow
#

It made that?

keen beacon
#

it made the entire landing

#

that's just a part

#

it can be pretty bold.. personally i think this is cool

rigid widget
brittle tiger
keen beacon
#

fully built by gemini 2.5 pro

#

(the svg is naturally pretty bad, llms can't do word svgs just yet)

keen beacon
#

"Write full HTML, CSS and JavaScript for a very beautiful, bold, creative, sleek, polished landing page for Cosine, an AI lab", then "Make it much more beautiful, bold, creative, sleek, and polished. Do not use comments." x2

olive mesa
#

google is progressing faster than openai

rigid widget
#

what x2 mean?

olive mesa
#

well from the public view

#

i know they have a lot better stuff they're working on rn

keen beacon
#

iteration

keen beacon
elder rapids
keen beacon
#

bro wants that validation

elder rapids
#

glaze me dawg

#

I need it

rigid widget
#

Aistudio crashed and I can't even access my other prompts.

elder rapids
#

I said ts verbatim

keen beacon
elder rapids
#

yo tell me why I'm a genius

keen beacon
keen beacon
elder rapids
#

pre 2.5 pro

#

I came here to talk about my observations with nebula

#

enuff speaking chat, lift me onto my pedestal

keen beacon
#

i didn't give it any examples but i did ask it to iterate on itself

elder rapids
#

damn

#

this is crazy

#

alright so wait

#

this implies it can reason through granularity now through 1m context

#

this is nuts

keen beacon
#

did somebody say nuts 🗣️

#

but fr

rigid widget
#

Hey, you're Google, snap out of it!

elder rapids
#

there was definitely a breakthrough

#

also, I think sometimes it still breaks, in certain CoT processes it stops putting in the numbers for a calculation, but keeps the surround formatting

olive mesa
#

0 shot?

ocean vortex
#

what's your prompt? Curious how this compares with gpt4.5 and grok3

pure nova
#

no way someone has already jailbrken gemini

olive mesa
pure nova
#

ye

olive mesa
#

damn

eager crater
eager crater
elder rapids
#

just gave 2.5 pro 800k tokens worth of material and it processed it faster than flash and pro, and gave extraordinary summary results, and didn't miss a single granular thing, and also gave interpretive results rather than just data points

#

Google did something

wintry tinsel
elder rapids
#

and then I said I was surprised and that its crazy it's able to do things like that over long context and it pinpointed exactly why it was different, just from the quality of its own output

pure nova
#

he's good with SE and stuff

#

so he done it many times before with claude too and everything

elder rapids
#

this model is literally 1 of 1

wintry tinsel
#

K can you give me his contact than, I know Pliny will release a jailbreak but his stuff is annoying in how it’s formatted

elder rapids
pure nova
#

like whatd u ask

elder rapids
#

the type of information?

elder rapids
#

it was just a book lol

wintry tinsel
pure nova
#

if u wanna try his dc is

#

access44

rigid widget
#

What is gpt4o-lmsys-0315a-ev3-text

#

0315? is this mistake

keen beacon
#

march 15th?

rigid widget
#

first time out

keen beacon
#

weird ass name

rigid widget
#

it can't be march 15

#

but it not good at translation

keen beacon
rigid widget
keen beacon
#

could be but i think thats an internal name lol

barren prairie
#

It looks alike sarcoptes scabei 🤔

lime coral
rigid widget
#

who is rhea from?

#

it's very good

cedar tide
brittle tiger
north vale
#

prolly llama 4 checkpoint

torn mantle
elder rapids
#

not that good

north vale
elder rapids
#

crazy

rocky wing
#

Hi all

mint relic
#

Hello, anyone tried R2 ? Is there any place to use it ?

elder rapids
#

it isn't out

sour spindle
#

What do you actually get with paid gemini vs. just using the models in ai studio

rigid widget
rigid widget
elder rapids
#

@keen beacon do any other models get this right?

#

besides 2.5 pro

keen beacon
#

not as far as i know

elder rapids
#

do you have access to o1

#

ive been trying a ton of these puzzles and it seems like 2.5 pro is way ahead in this aspect

keen beacon
elder rapids
#

even when made into text form

#

including o3 mini and deepseek

#

they just can't get them right

willow grail
#

eu didnt get the new 4o image?

elder rapids
rigid widget
elder rapids
#

as I said

#

and the gap isn't THAT large

brittle tiger
keen beacon
#

dont they still suck even if theyre not experimental?

rigid widget
keen beacon
#

in the gemini product

elder rapids
#

ye that tends to be a good option

elder rapids
#

once they leave experimental

rigid widget
keen beacon
#

also

#

im not keeping track but i feel like i should've hit the ai studio RPD rate limit by now

#

it is nowhere to be seen

keen beacon
#

cuz i use aistudio all the time

#

random ocr? random tests conversations etc

elder rapids
#

wonder if they did the same thing deepseek did, training for specifically eq

keen beacon
sour spindle
#

Maybe they are simply bypassing the rates for the time being?

elder rapids
#

after some hours with it

#

it has moments where it resembles Claude

keen beacon
#

its just bugged

elder rapids
#

or at least, a very large, intelligent model

#

makes me wonder how Big pro is

#

if this model is below 100b that would be really crazy

keen beacon
#

it is not that would be absurd tbh

silk haven
#

Apricot-exp-v1?? Amazon model?

hazy quest
#

Finaly midnight in Europe. What a day this has been lmao

elder rapids
#

isn't sonnet and 4o at least 100b

keen beacon
#

4o is estimated to be 200b, sonnet is 400b

#

pro is definitely within that range

sour spindle
elder rapids
#

damn fr?

keen beacon
#

total params. theyre all moe i think

torn mantle
#

yea this new model is on next level

elder rapids
#

I expect pro to be 120~

keen beacon
elder rapids
#

or 150

keen beacon
#

8b flash is a different model

elder rapids
keen beacon
#

its a different model

#

theres flash and flash 8b

elder rapids
#

yeah duh

#

😭

#

I said that

#

but anyways

brittle tiger
keen beacon
#

1.5 line:
flash (larger)
flash 8b
pro

2.0
flash lite (direct technical successor of flash)
flash (larger than flash lite/1.5 flash)
pro

elder rapids
#

🙏

#

anyways

#

8b flash to 200b would be wild tbh

north vale
#

yeah i think they're calling them flash and pro based on the speed and cost more than the size being comparable to 1.5 's flash and pro

#

basically flash could be 200B with 40B active params

#

and pro could be 1.3T with 150B active params

#

really uncertain but that would make sense to me

keen beacon
elder rapids
#

yeah but we know that's not true so it's kinda trivial

north vale
#

oh ok, i don't know it to not be true

#

how do you know

keen beacon
#

economics lol they are not increasing model size to that level anymore

#

they didnt even release 1.0 ultra access and a google employee confirmed it wasnt even close to og gpt 4 afaik

elder rapids
sick mountain
#

why not? hardware is getting better too

elder rapids
#

if models are 27b with similar performance

#

yeah that's not true lmao

#

you still need the total params to run it

north vale
#

i really don't think 27B models have similar perf

sick mountain
#

ultra did not use MoE iirc

elder rapids
#

-10b are visibly worse

north vale
#

27B models have much lower perf than gemini pro

elder rapids
#

but 30~ is just fine

north vale
#

is what i'm saying

elder rapids
#

yeah but I'm not talking about Gemini pro

north vale
#

oh ok

atomic locust
#

I want to give out my MacBook Air 2020 &** for free, it's in perfect health and good as, alongside a charger so it's perfect, I want to give it out because I just got a new model and I thought of giving out the old one to someone who can't afford one and is in need of it... Strictly First come first serve !
DM IF YOU ARE INTERESTED

keen beacon
elder rapids
#

but anyways back to the point

#

I do think Google has always had special models, and the speed both perform at is crazy

#

so they can't be that big

north vale
#

I just don't see gemini 2.5 pro being within 30% of the size of gemini 1.5 pro

#

elo 140 pts apart

keen beacon
elder rapids
#

ye

north vale
#

yeah good point

elder rapids
#

I think the pros are maximum 10b params deviation

north vale
#

that'd be completely wild google dominance

elder rapids
#

but I have no idea how large they are

#

ye

keen beacon
#

the pros are still around the same size. but it is quite plausible they increased the size a little but its not a trillion parameter model

elder rapids
#

and I don't think 1.5 pro is above 200b

#

it's both faster and seemed like it had less "raw" intelligence than Claude, which was similar in time, and 4o

#

seemed to know less stuff without search as well

#

completely up to you whether you agree

#

but I do think 200b+ models tend to just feel heavier

#

so I'm inclined to believe it's at most 150b

keen beacon
#

its the same

elder rapids
#

wym?

#

it's way faster

#

especially now

keen beacon
#

2.0 pro and 1.5 pro are the same speed

elder rapids
#

dawg

keen beacon
#

i cant tell right now for 2.5 pro because there are no measurements for 2.5 pro

elder rapids
#

I'm not talking about 2.0 pro vs 1.5 pro

#

I'm talking about 1.5 pro vs 4o

#

and then equating to 2.0 pro

keen beacon
#

well its relevant because 2.5 pro is highly likely to be continued pretrained from 2.0 pro

#

the timeline seems absurd if it isnt

elder rapids
#

I don't think it's absurd at all tbh

sick mountain
#

google does have the most compute

elder rapids
#

working on both 2.0 and 2.5 at the same time is super reasonable

#

if they're going for completely different architectures

keen beacon
#

you just pretrained gemini 2.0 pro spending millions and ur gonna throw it away and rush a model from scratch in a month or two??:?

elder rapids
#

as they explained that it's inherently a reasoning model

keen beacon
elder rapids
keen beacon
#

yes but that was a sizable amount of time

elder rapids
#

that was also like 2 months

#

bruh

keen beacon
#

????

elder rapids
#

002 came out in like October

#

2.0 pro was in experimental in November

#

so ig one month

#

preceding 1206

#

oh yeah it was 2 months too I'm tripping

#

002 came in September

#

2.0 pro came in November

keen beacon
elder rapids
#

it was on lmsys too, everyone was talking about it

elder rapids
#

that's what I mean here

#

it's not like they're throwing away progress

#

since the "progress" is research itself

#

so they could completely ditch 2.5 tomorrow if they find another breakthrough

keen beacon
#

002 wasnt a new pretrained model it was just another tune afaik

elder rapids
#

yeah I know

#

but 2.0 is completely different

#

and then jumping from 2.0 to 2.5 within a couple months seems reasonable

#

that's how they managed going from "bard" to "ultra 1.0" and then a month later, into 1.5 pro

#

and then ditched ultra

keen beacon
elder rapids
#

they've done this more than once

keen beacon
#

so ur saying they pretrained a new 2.5 pro model from scratch, did reasoning rl, safety, etc. in 2 months??

elder rapids
#

saying the best AI compute in the world can't do ts is wild

#

safety aligning is the hardest part of that process

#

and I'm pretty sure past models would be insanely informative of that process

#

they probably wanted to get 2.0 over with, with a breakthrough

meager sun
#

🧱 🤣 👍

elder rapids
#

and then follow up with 2.5 to use it

#

the fact that 2.5 isn't actually that affected by the transformer context drop off is insane

#

it has to be different, there's no other way tbh

#

what if it's TITANS

#

that'd be crazy

#

we'll literally never know, they could have something that actually performs with titans techniques

#

etc

#

2.5 is simply different from 2.0

verbal nimbus
keen beacon
#

yeah he posted it here earlier lol

verbal nimbus
#

Gemini is indeed the best at 120K

verbal nimbus
#

It struggles a bit at 16K (typical transformer behavior)

elder rapids
verbal nimbus
#

Didn't notice V3 0324 is on there too

elder rapids
#

damn

#

eclipsed in context

verbal nimbus
elder rapids
#

ah yeah I know, but I mean

#

I'm not sure it would be so sudden

#

and THAT great

#

the other models don't seem to be affected

verbal nimbus
#

Hmm yeah that's a bit odd

elder rapids
#

that's really crazy tho tbh, 90 vs 65.6

#

2.5 pro vs 4o

keen beacon
#

2.5 pro can do 2m-10m context+, 4o total context is 128k-200k

verbal nimbus
elder rapids
#

it would be more like need in a haystack

#

rather than actual reasoning

#

but with 2.5 pro that kinda just stopped existing as a problem

#

I tested it too

#

I don't think you guys realize how crazy this is tbh

verbal nimbus
elder rapids
#

I'm really not sure

keen beacon
elder rapids
#

2.5 pro just kinda shook me, especially testing it on lmsys with nebula

keen beacon
#

at least the base model

verbal nimbus
elder rapids
#

ye ofc

keen beacon
verbal nimbus
#

I wish there was more news on Titan/Mamba-variants

elder rapids
#

it has a unique cot too tho

keen beacon
elder rapids
#

read it's reasoning process

#

it uses weird code words

#

symbols

#

etc

verbal nimbus
keen beacon
#

why mamba/etc dont actually work

elder rapids
#

wonder how this is gonna go in notebook llm

#

I've had problems with it

#

nobody seems to care since it's trivial

#

but I think the products could be so much better

keen beacon
verbal nimbus
#

Haven't heard anything since though 🤷. Same with Titans.

keen beacon
verbal nimbus
elder rapids
#

as current thinking models

#

but probably gonna remain the base

#

I think the problems we currently have now will eventually be fixed, like better reasoning by creating a CoT

#

and then more attachments

ocean vortex
keen beacon
#

yeah this guy is wild man

elder rapids
#

I don't want to be rude, this has happened more than once too

#

but goddamn

elder rapids
#

🙏😔

rigid widget
elder rapids
# leaden palm ??

because it isn't technically the same architecture lmao you guys are confusing transformer with what we have now, which has been established as a change for a while now, as with gpt or native multimodality

#

now I'm wondering if you guys are trolling lmao

#

this is getting ridiculous

leaden palm
willow grail
#

what the rate limit for gemini 2.5?

keen beacon
#

If there is one it's very high. I don't think aistudio is limited like the free api offering

willow grail
keen beacon
willow grail
#

have u connted a ide with 2.5?

keen beacon
#

Requests per day

#

Idk about openrouter tho

keen beacon
willow grail
willow grail
keen beacon
#

I don't use ai to code yet they suck at rust

leaden palm
#

openrouter actually gives you more limits lol

leaden palm
#

(which is weird because it's free either way)

willow grail
#

so everybody has less prompts

leaden palm
elder rapids
willow grail
elder rapids
#

since it's architectural identity wasn't a primary claim, and what I said operates on its lack of relevance already, this is just a comprehension issue

leaden palm
#

you're the one saying that thinking models use different architectures

#

and don't get that r1 is just v3 RLd on thinking

elder rapids
#

and even clarified "not technically the same" so I consider what I'm saying pedantic posturing, but for rhetorical purposes

#

since the discussion is operating primarily on the CORE architecture, ie titans vs transformers and I'm explicitly stepping away from that dialectic, what do you think I'm saying

#

not only that, I even clarified why they can technically be distinguished between the base transformer architecture (Ie gpt, multimodality) and since yes comprehension is an issue, you dismissed it with "pedantry" knowing that's the premise, not my rebuttal towards what they're saying

leaden palm
#

nobody asked

keen beacon
#

can someone with access to o1 pro give it this

#

the answer is permanent

#

but gemini 2.5 pro, grok 3 thinking and claude 3.7 sonnet thinking all fail

leaden palm
# keen beacon can someone with access to o1 pro give it this
Question 5

A particle P, of mass m, is attached to one end of a light elastic string of natural length 0.5 m and modulus of elasticity 2mg. The other end of the string is attached to a fixed point A on a rough horizontal surface.

P is held at a point B, where |AB|=0.5 m and given a speed of 1.4 ms⁻¹ in the direction AB.

P comes at rest at the point C.

Determine whether this position of rest is instantaneous or permanent.

heres the transcription

keen beacon
#

looks like 2.5 pro gets it with code execution