#general
1 messages · Page 4 of 1
this seems different compared to flash thinking
yeah openai hiding the reasoning chain for which you are paying is just ridiculous
Probably modified or smth like oai one
in my personal experience flash thinking had several limitations on ood
deepseek reasoing outputs > gemini > openai
for me at least
flash thinking:
Yea for now
I always return to deepseek just to read the reasoning process tbh
Everytime you find interesting stuff
I think it may be time to sub to gemini advanced
Its not a bad plan
if 2.5 is SOTA then it is for sure worth it
And also they improved the deep research
i feel like they've likely toned down the censorship recently no testing tho just vibes
That's the issue with google, they have great models, imagen 3 & veo 2, but the marketing and implementation is just not it
holy hell google moves fast now lol
I'm genuinely surprised that even people within this community are believing the "safety" garbage coming from these artificial intelligence corporations.
They can mark any requests they want as dangerous, using the excuse of security. Weapons, porn, drugs, workarounds, modifications... you can mark all of these as potentially dangerous. You can even consider giving information about yt-dlp and ffmpeg as "potentially dangerous," and even "content blockers, VPNs, userscripts" as "potentially dangerous."
It was anomaly from the start that Google is not the No. 1 at everything LLM and AI
im not saying i'm agreeing with it
just that safety is not the same as propaganda
Fwiw i think the 2.5 thing is prolly real now
yeah me too
ya it is
its just an unbelievable pace
they did gemini 2 sooo fast
then pivot to this that quickoly
they didnt even release 2.0 pro stable
😉
please just look at elon musk
I thought that the 2.5 Flash and 2.5 Pro would be released near Google I/O in May, like last year
would make sense for this to be today. bc yea 2.5 before GA 2.0 Pro would be funny
tbf this is a good point lol
i dont think so tbh. i think theyre skipping it
Google's Sergey Brin: Google’s AI products “are overrun with filters and punts of various kinds.” -> Google’s co-founder tells AI staff to stop ‘building nanny products’
Google’s AI products “are overrun with filters and punts of various kinds.” According to Brin, Google needs to “trust our users” and “can’t keep building nanny products.”
Last month
that certainly didnt stop them from developing stuff (other than safety filters) this fast lol
Very Bullish
top 5 google naming fails of all time
cot models will probably be the default going forward
Nah
well gpt-5 won't use chain of thought if unneeded as i understand
yea but its gonna be a hybrid like sonnet 3.7 presumably
Idk if it’s hybrid and mostly doesnt use reasoning for most completions it doesnt count as cot model to me
So ig depends how u look at it
Maybe models with cot capability will be defaukt
But asking it how are u and it reasoning about it before answering will not be default
Bc that’s useless
Sergey Brin full note:
“It has been 2 years of the Gemini program and GDM. We have come a long way in that time with many efforts we should feel very proud of. At the same time competition has accelerated immensely and the final race to AGI is afoot. I think we have all the ingredients to win this race but we are going to have to turbocharge our efforts.
Code matters most — AGI will happen with takeoff, when the Al improves itself. Probably initially it will be with a lot of human help so the most important is our code performance. Furthermore this needs to work on our own 1p code. We have to be the most efficient coder and Al scientists in the world by using our own Al.
Productivity — In my experience about 60 hours a week is the sweet spot of productivity. Some folks put in a lot more but can burn out or lose creativity. A number of folks work less than 60 hours and a small number put in the bare minimum to get by. This last group is not only unproductive but also can be highly demoralizing to everyone else.
Location — It is important to work in the office because physically being together is far more effective for communication than gve etc. And, therefore you need to be physically colocated with others working on the same thing. We need to minimize reporting lines across countries, cities, and buildings. I recommend being in the office at least every week day.
Organization — We need to have clear responsibility and organization with high functioning groups with shared management and technology leadership.
Simplicity — Lets use simple solutions where we can. Eg if prompting works, just do that, don’t posttrain a separate model. No unnecessary technical complexities (such as lora). Ideally we will truly have one recipe and one model which can simply be prompted for different uses.
Excellence — whether it’s an eval or a data source or a dashboard or a message in an internal Ul, please make sure they all work and all are good.
because thanks to AI Studio, they want to create models that are constantly being tested with new data and are always getting better. They don't want to offer something as "stable" without doing something really big.
Speed — we need our products, models, internal tools to be fast. Can’t wait 20 minutes to run a bit of python on borg.
Iterate at small scale — we need lots of ideas that we can test quickly. The best way to do this is small scale experiments until you can ramp up and hopefully see increasing advantage at scale. This is an excellent validation. Working too much at just large scale has a habit of minor tweaking and overfitting to evals, checkpoint sniping, etc. We need real wins that scale.
No punting — we can’t keep building nanny products. Our products are overrun with filters and punts of various kinds. We need capable products and [to] trust our users.“
https://x.com/testingcatalog/status/1904539290899533838?s=46 lol is this real chat
Folks, please resist the hype and be patient. Real-world tests are consistently the most crucial.
well people have been testing nebula here for a while and its been good
^
although it could be possible phantom is 2.5 pro exp and nebula is something else, or vice versa
I cant believe they release nebula already
how long has it been in lmarena?
GDM employees have been hinting it nonstop for the last day or two
Its in aistudio
~5 days
specter/phantom/nebula i think are the same
maybe different temperatures?
looks very fake
it is fake
dont think it would be noteworthy enough to split into different names i think
just different revisions
screencap?
wrong
Im in the uk and i have it
send a screenshot
lol no
Agreed
lmao
thats what it says on my studio
right
🙄
u changed pro to nebula lol
with inspect element
its supposed to be 2.5 anyway
I'll feel bad for my laugh emoji if you aren't BSing but that would be pretty strange
bruh nebula it's anon name
lfg
so any announcement or any news / changes for this ?
when can we expect official benchmarks
probably at 11 am or 12 pm est
ok but fr why is polymarket not summing up to 100%
ik they all resolve no from a tie but a tie seems very unlikely?
it seems priced at 10%+ rn
holy f
2.5 is so good wtf
every other ai model i asked, it just made complete ass
but 2.5 nailed it
prompt?
it was legit the simplest html website
and every other model couldnt do it for its life
I don t have it on my google ai studio 🥲🥲🥲🥲🥲😥😥😥
have you tried deepseek?
i saw on internet today
that they released new model and its very good
me too
of course and too much soon i release my comparison for too many tasks
its my own website?
can you share the image? i wanted to try it on other models?
yeah
Gemini did it perfectly pretty much
try to plug it in gemini
compare it with me
i don't have the pro model
show me what deepseek gave to ou
this is what i got fm gemini
if anyone wants to run a prompt and doesn't have access, i got access to 2.5 pro
how
wow how the hell did deepseek mess it up THAT bad
nebula = 2.5 pro?
I think so yeah
i think its bad to replicate images
i've tested it on my JFrame design in java
and it upgraded it heavly
2.5 has to be SOTA
I'm asking it to think longer
and it's actually getting these puzzles right lmfao
joke bro
flash doesn't think longer when I ask it to either
ask it to think for a certain amt of time eg 30sec
I want it out in AI studio. have access in gemini but ai studio better for testing when not hooked up to memories and apps like in main app
are they gonna add 2.5 to AI studio?
likely
left to right: deepseekv3, gemini2.0pro, claude3.7sonnet, deepseekv3-0324
system prompt related, this has always been an issue with models on gemini
wait for it to launch on ai studio
Gemini version is weaker
i aksed it to upgrade the design
By the way, Gemini didn't do that thing we talked about earlier, they've clearly restricted it.
yes but its not about that its about copying the image
4o image gen coming
gemini could dothat too
My English is not very good. Which poem is better?
Dall-e 3 is totally garbage let's see
yeah i know, can you show me version of your page upgraded by the gemini 2.5
?
i don't have access to it since i don't have the subscription i guess
my theory is that it's upgraded compared to the initial preview of 4o image gen
otherwise it just looks embarrassing
what if 2.5 pro is an entirely different and new model
it doesn't say it's thinking
yeah we know
nah what I mean is
qwen 3 on thursday 👀 apparently
there was this new thinking method for models
more efficient
draft something
and its like internal thinking
ye
its not as good as deepseek but all i said was upgrade the page
yeah its nice
it kept the style of original page
yeah
chain-of-draft
great week for acceleration
hopefully we get o3 soon given it's been threatened
i wonder if 2.5 pro is better than 3.7 thinking at its max both
As long as there is that "security" nonsense I mentioned before, they can never compete with others in image generation.
can anyone help?
they potentially have a year lead on native image gen though
whats the prompt for 2.5 to do this
oh wait that is 2.5 right
from my experiments with nebula
i think it's better
but we'll have to wait and see for the un nerfed version
how are u so sure that this is nerfed at all
finally gpt4o releases today
Can someone who speaks English help me?
lol no way
i just asked gemini 2.5
Simulate a gravity-affected ball bouncing inside a rotating square using Python, with realistic velocity, collision, and rotation-aware physics.
and it gave me a syntax error
insane
i see that google has this tendency to release very good models from time to time like that experimental model that got almost to the top of leaderboard, i think it was sitting on second place, from my tests it was better than 2.0 flash they released into gemini website
because it's worse than nebula by a ton and doesn't think long enough
either it's not nebula at all
or it's nerfed
only possible cases
just wait for the release on aistudio 🙈
braindead?
ong
bro this is the unnerfed version, we will see the nerfed version in the future.
dawg
if it's saying things differently, not reasoning as long as nebula
it's nerfed
this isn't skepticism or anything
😭
holy sht every question i ask it im getting syntax error
did i shadow banned? Why doesn't anyone see what I post?
write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically
and
Simulate a gravity-affected ball bouncing inside a rotating square using Python, with realistic velocity, collision, and rotation-aware physics.
the gemini product models suck
apparently it's unstable idk
the aistudio release will be good
i used your prompt in deepseek
this is what he gave me
it loooks like
animation
or something XD
always the same
might be a tokenization issue ive seen other gemini models miss empty brackets a lot, also see https://www.reddit.com/r/Bard/comments/1jjmta6/gemini_25_cannot_write/
not really
ok nvm
keep watching
the ball fell through the square
its glitching and also play it again
its 1:1
the same simulation
the same path
but at least it didnt get syntax errors like gemini
the same colission hit
how long's the typical lag to add to aistudio? longer thinking time will help
in a few hours
like the animation was hard coded but it wasn't
hours
yeah
@pure nova
i wonder if deepseek r2 will be one of the best if not the best thinking model or if it will be total trash and will fail every expectation
yeah this is probably why , cause it was giving a syntax error of an empty variable
it was literally vertices =#comment
which doesnt make sense
ive gotten that same thing with 2.0 ft before
Look at Deepseek V3 0324 and you will understand
from my testing it's alright
seems to have an aptitude for coding
2.0 pro does seem better still tbh
3.7 sonnet and 2.0 pro are visibly better than the other non thinking models
no one talks about it for some reason
yeah i just found the sysprompt
it's literally an entire book
when will they understand how badly this impacts performance
Nebula in general leaderboard. Which place after update?
13
17
1
1
🤩
wtf
it seems theres a lot more to that 🤣
lol no he's not a Google employee
the sentence doesn't even really make sense
that guy is a weirdo
The information he has is reliable but he stole it from somewhere else
I have a question about the new deep seek V3 is the api for V3 updated or do I need a new checkpoint
As in a new API
i mean can't they like train model to behave in some way
instead of feeding it with system prompt that breaks performance by like 60%
i think its just "deepseek-chat"
@keen beacon what am i reading here? Did they mess with the model again? Does it feel different than what we got on lmarena?
it's out!!
nah it's just the Gemini system prompt sucking
omg
finally
this version is NOT a thinking model, that appears to be still on its way on studio
LIve in AI STUDIO
why is ai studio somehow better than normal site
???
i dont understand what they do to it
its thinking for everything for me
wtf the google ai studio got the hexagon wrong
insane
it does it fine for like 10 seconds then it falls through
glad it launched with 1m context
def better than gemini app version for some reason. arc-agi test i used that failed gemini app over and over one shotted it in ai studio like it did in arena as nebula
yeah it was just bugged
wild
Is pro2.5 nebula confirmed ? Or maybe specter ? And nebula is still didn t come
wow
that's nuts
specter nebula phantom are the same just different revisions afaik
No confirmation
theres never gonna be official confirmation
Can as well be flash 2.5 or flash in the app or whatever
sometimes there is
on unreleased variants that never get released?
oh i mean if it is released
ok but theres no evidence for that
while there is some for them being the same
None in both case. All I see is « their way to answer looks the same ». Not enough
what? did u account for when they arrived in the arena? anyway i feel like the community here is extremely good at this
whats ur track record btw?
1443 dam
can we see the questions that it gets asked though
it would be nice to see what it got wrong/right etc
GOOGLE IS SO BACK
huh
2.5 free, OpenAI is cooked
o3 mini is based on 4o mini too lol
Are you blind haha
Nebula (Gemini 2.0 Pro Thinking) is really inconsistent. It can produce never seen before results for an LLM, but sometimes it totally cracks up & produces language that is too familiar for a professional result... I'm not so sure, it's certainly a very good model, and no wonder it destroys chatgpt already. It also struggles a lot with formatting on elaborated prompts, o3-mini is (unfortunately) better for my usage.
maybe try a lower temperature?
At this point I'm not even sure Google will dare to release it in this version
lol it’s always strange with new reveal. Wait 1 week. Also why they label « experimental » and serve it free on the api
The issue is that temperature is already something normies don't use.
thats why its called experimental
google's experimental models are free its not a new thing
tested under code name nebula: https://x.com/lmarena_ai/status/1904581128746656099
lol , this week is ai week, 3 new models will come out and beat gemini 2.5 mark my words
woah
Like the instability
gemini 2.5 pro destroy chatgpt o3-mini-high
hmmmmmmm i hope its just a benchmark..... thingy....
if its gooder than o3 high then... omg
dont give me hopes
c 3.7 still remains undefeated for web
lol that jump is insane tho
The confidence is too large right now
^
For now
meaning?
Would not validate anything on lmsys until way more people try the model
Confidence interval +-15 pts for gemini and +-10 pts for Claude.
Claude didn't do anything too crazy with 3.7. Soon enough R2 will drop. I can't wait.
yall are sleeping on qwen
- with 95% certainty
Anyway 2k vote is not enough. Same think for the global ranking. Every new model is #1 because of this
doubtful
They don't have enough GPUs for that
would you predict 2.5 pro gets off of #1?
Weird that on multi turn, it's not being that dominant so far.
rl is usually done only in single turn
o3 mini which is based on 4o mini being competitive with a much larger model 🤔
If they are making it somehow part of gpt5 it’s maybe better for them like that. Imagine they release o3 and then gpt5 has like 0.1% better reasoning stats
No more hype for hypeman
OpenAI is cooked if they don't have an unreleased model to quickly drop asap lol
they have o3 they're just stalling on it
Doesn't need to be o3 or 5, just needs to be something that can keep up
They have a stream in 40 mins... to show off image editing most likely lol
Full stats
oh dear...
oh wow
that simpleqa score
is bonkers
gemini has always been great at simpleqa
but it appears with 2.5
they literally
leapt for almost every benchmark
they have been cooking
wish they'd shown their prev top model in the benchmark table
Is o3 that good? Is there any decent benchmark available somewhere? Couldn't test it, too expensive 💀
to compare the improvement
Imagine Llama 4 drops and it overtakes 2.5 lmfao
Unlikely to happen but it would be crazy
seems likely that it's pretty good at using more test-time compute
have u gone on the arena?
so maybe like not that good in single completions
llama 4 seems to be disappointing
Impossible
based on the meta model spam
If it's anything like those new anonymous models, yeah true.
Still holding out hope
Trust the Zucc
i doubt they will be able to beat qwen 3 even when releasing one month later
What sizes do you think they will release?
confirmed sizes are 8b and 15b moe for now. i would expect a successor to the 32b model, maybe moe
I’m testing 2.5 Pro on the Gemini app and the experience is better than in AI Studio, the integration with Google Search and YouTube is insane
With the new DeepSeek v3 llama got postponed for at least six more months
no way they dont announce llama 4 at their first llamacon
I know. It was a joke lol
depends on when oai launch o3
but yeah even with it being delayed 6 months i dont think meta will be able to beat deepseek 🤣
Depends on when Google drop the experimental
Or will they ship gemini ultra 3 instead
i think the qwen team are the dark horse in this, but i dont think they will outright be sota
Is there a simple way to see against which models an LLM loses the most on average?
Introducing Gemini 2.5 Pro Experimental.
The 2.5 series marks a significant evolution: Gemini models are now fundamentally thinking models.
This means the model reasons before responding, to maximize accuracy -- and it’s our best Gemini model yet.
Blog -
predicted the long context leap
no way I predicted this too
lmaooo
i said that here first
yes it's updated both api and app
ig good predictions tho
I was gonna analogize 1.0 → 1.5 distinction tho
and then the evolution of 1.5 → 2.0
since they probably started completely from scratch on each one
so if it doesn't signify thinking, it's probably inherently a pure thinking model
that was my thought process
🤔 What's that
Join the team behind Gemini 2.5 as they dive into the model’s thinking and coding advancements.
🎙️Space starts at 12:20pm PT. Drop your questions below.
https://t.co/wBOHiC0n9k
do we know api cost for 2.5 pro
lmao what
Asking about native image. Native audio is a myth now
0 for now. Still experimental
4o avm is native audio tho if ur talking bout that
Speaking about gemini. They teased it for gemini 2.0 and then ~
oh
Gemini 2.0 introduces multilingual native audio output. Watch this demo to see how this new capability can help developers build multimodal AI agents. These new output modalities are available to early testers, with wider rollout expected next year. Start building with Gemini 2.0 at aistudio.google.com.
Learn more about Gemini 2.0 → https://...
Crazy demo
also just tried 2.5 pro on AI studio
and it's clearly different from the product
ngl I don't even know where to start, I was like an hour late on discovering 2.5 pro
That simpleqa score is crazy
Gpt 4.5 is much much larger and 2.5 pro is somewhat competitive
They said it was nebula
just saw this wow
i dont know its context length
im guessing 2 million or 3-4 million since its 2.5
not 3-4 million
It will increased with time
yeah
OG 1.5 was first released with 128k even though they teased the 1-2M
um tf it has a cut off of january 2025?? wtf the turn around is insane
did Google find the secret sauce 😭
the gemini 2-2.5 timeline is absolutely insane
opinion's on gemini 2.5 pro?
im not sure if its correct. ill have to see if it knows events after june 2024
was it worth the wait
meanwhile gpt-4.5:
what a joke
yes
bro they continue pretrained the model/did all this stuff in like a month/two 🤣 if thats actually true
didn't even have to wait lmao
i thought the gemini 2 timelines were short but this is CRAZY
are we sure this is the same model?
gemini image editing looks better than openai feature coming out today
told you guys you werent glazing it enough lmao
yes, this is nebula
Gpt4o also failed the hands
embarassing
the wait wasnt even long lmao
wow 😲
it gives me a different vibe
ong
its different
2.5 is a Breakthrough
3.7 became more robotic, 2.5 pro is so creative
has to be
Damn I was wrong about 2.5 pro lol, it's actually better at coding than I initially anticipated. Would be great if it's cheaper than claude too
look at that long context lmao
the model is so good
fr
coding though?
good so far
anyone tried c/c++ ?
yea
i dont think any ai is fit for c/c++ right now tbh to build an actual decent project
its getting somewhat closer but its still lacking a lot
2.5 pro brute forcing webdev is crazy
probably similar to sonnet 3.7 thinking
yeah its close to 64k
not yet
so far ive only seen it compared to 16k and 32k and it's a lot better
Crazy that 3.7 sonnet is still 90 points ahead of 2.5 pro in webdev arena
p sure Claude is made for these kinds of tasks specifically
it's worse in other things compared to 2.5 pro
What is crazy is Google jump
ye
ong
This new Google Gemini 2.5 model is insane
No other model has continously followed my instructions this well
It's also picked up things better than any other model
Might be my new favorite
2.0 pro was removed from Gemini app
Rip
The reasoning is interesting too
I remember just a few months ago (I think before deepseek r1) some dude talking about how everything has been boring since the finetune days and then r1 and distills dropped
It was pretty funny lol
They've taken user data in aistudio seriously. I tried to make them do this dozens of times, but they couldn't.
Even though most of you aren't aware of it yet, the best non-reasoning model also
You are definitely right. I tried feeding a list of names of java obfuscators into the model since it had no idea past something like proguard and now it lists the top 5 I've continously asked questions about lmfao
Idk if I am the sole reason but I think I made it into the dataset
theres a huge jump in world knowledge
maybe not
I should hope so considering the fact they own a search engine
yea i noticed that
i was trying like some niche prompts on aistudio
and seems like they improved on them a lot
way way better than grok 3 + reasoning
blows it out of the water
From reddit:
Just a couple of days ago I wrote this:
This is my exact experience. Long context windows are barely any use. They are vaguely helpful for "needle in a haystack" problems, not much more.
I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). More often than not, it doesn't even "see" the final poems, and it has this exact feeling of "seeing the first ones", then "skipping the middle ones", "seeing some a bit ahead" and "completely ignoring everything else".
I see very few companies tackling the issue of large context windows, and I fully believe that they are key for some significant breakthroughs with LLMs. RAG is not a good solution for many problems. Alas, we will have to keep waiting...
Having just tried this model, I can say that this is a breakthrough moment. A leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.
Finally they're starting to utilize that context window
so consistent too
deepseek v3?
that's crazy
imagine being in middle school rn
you can literally copy paste your whole book in for your book report
v3 0325 You can really move and place blocks but it laaaags a lot.
lmao
Totally a bug, not openai's core prompt 🤔
meanwhile 2.5 pro
left deepseek v3 0324 (non-reasoning model) right gemini 2.5 pro (reasoning model)
oneshot minecraft?
it create a good Hacker News clone, but does Hacker News have anything to do with hackers at all?
It made that?
it made the entire landing
that's just a part
it can be pretty bold.. personally i think this is cool
Hahaha Security
"Gemini 2.5 Pro just zero-shotted a task o3-mini-high made no progress on after burning through millions of credits via Aider"
holy moly. there are some bugs here but i'm confident it could solve them in 1 prompt
fully built by gemini 2.5 pro
(the svg is naturally pretty bad, llms can't do word svgs just yet)
what was your prompt?
"Write full HTML, CSS and JavaScript for a very beautiful, bold, creative, sleek, polished landing page for Cosine, an AI lab", then "Make it much more beautiful, bold, creative, sleek, and polished. Do not use comments." x2
google is progressing faster than openai
what x2 mean?
woah
lol
glaze me I predicted this exactly
bro wants that validation
Aistudio crashed and I can't even access my other prompts.
I said ts verbatim
if you insist?
yo tell me why I'm a genius
yeah it did for a sec for me but it's working again now (?)
u just joined bro
pre 2.5 pro
I came here to talk about my observations with nebula
enuff speaking chat, lift me onto my pedestal
0 shot?
i didn't give it any examples but i did ask it to iterate on itself
damn
this is crazy
alright so wait
this implies it can reason through granularity now through 1m context
this is nuts
Hey, you're Google, snap out of it!
granularity has been the no. 1 problem for ages lmao
there was definitely a breakthrough
also, I think sometimes it still breaks, in certain CoT processes it stops putting in the numbers for a calculation, but keeps the surround formatting
looks great
0 shot?
what's your prompt? Curious how this compares with gpt4.5 and grok3
no way someone has already jailbrken gemini
that's 2.5 pro exp??
ye
damn
Code SVG of a detailed crab
yes
just gave 2.5 pro 800k tokens worth of material and it processed it faster than flash and pro, and gave extraordinary summary results, and didn't miss a single granular thing, and also gave interpretive results rather than just data points
Google did something
Dm me the jailbreak?
and then I said I was surprised and that its crazy it's able to do things like that over long context and it pinpointed exactly why it was different, just from the quality of its own output
cant he told me to not send it to anyone
he's good with SE and stuff
so he done it many times before with claude too and everything
this model is literally 1 of 1
K can you give me his contact than, I know Pliny will release a jailbreak but his stuff is annoying in how it’s formatted
such as?
such as what
like whatd u ask
the type of information?
he wont give it to no one lol
it was just a book lol
Finee
march 15th?
first time out
weird ass name
doesnt have to be trained just today
Shouldn't it be according to their release date?
could be but i think thats an internal name lol
It looks alike sarcoptes scabei 🤔
Big model from meta
Damn that's wild
prolly llama 4 checkpoint
its bad
impressive
only for coding, but it's not better than the SOTA models in that either
not that good
Gemini models are now capable enough to assist with fundamental AI research!
Several theorems featured in our recent ICML submissions were co-proved with Gemini's help.
2.5 Pro is a really good model; give it a try if you haven't already :)
crazy
Hi all
Hello, anyone tried R2 ? Is there any place to use it ?
it isn't out
What do you actually get with paid gemini vs. just using the models in ai studio
Deepseek V3 vs Deepseek V3 0325 real outputs. (Claude, almost there!)
here is https://rentry.org/deepseekv3-vs-v3-0325
Unfortunately no
not as far as i know
do you have access to o1
ive been trying a ton of these puzzles and it seems like 2.5 pro is way ahead in this aspect
yeah
even when made into text form
including o3 mini and deepseek
they just can't get them right
eu didnt get the new 4o image?
gemini modals really good at multimodal
yeah they tend to understand things more, but Im making them into text form
as I said
and the gap isn't THAT large
a worse experience until they leave experimental mode. no reason to use them in gemini app until they are no longer experimental. when that happens tho you get integration with search, drive, gmail, youtube, image generation, etc
dont they still suck even if theyre not experimental?
I also do not like using artificial intelligence directly in a multimodal manner because I always get worse results. So I OCR it into text first.
in the gemini product
ye that tends to be a good option
I think it's just speed and less error Otherwise, models are trained with data whether you pay for or not.
added to eqbench
also
im not keeping track but i feel like i should've hit the ai studio RPD rate limit by now
it is nowhere to be seen
i think the models are unlimited on the site but limited on the api
cuz i use aistudio all the time
random ocr? random tests conversations etc
damn that's really crazy
wonder if they did the same thing deepseek did, training for specifically eq
pretty big jump
I feel like this too. I sometimes get this message: "failed to list tuned models user has exceeded quota" but it says I am still using the model.
Maybe they are simply bypassing the rates for the time being?
you can definitely see it too, the way it speaks is pretty great
after some hours with it
it has moments where it resembles Claude
ive been getting that since yesterday i think
its just bugged
or at least, a very large, intelligent model
makes me wonder how Big pro is
if this model is below 100b that would be really crazy
it is not that would be absurd tbh
Apricot-exp-v1?? Amazon model?
Finaly midnight in Europe. What a day this has been lmao
wym?
isn't sonnet and 4o at least 100b
"There are decades where nothing happens; and there are weeks where decades happen"
Could change this to days and weeks with AI development lol
damn fr?
total params. theyre all moe i think
yea this new model is on next level
8b flash to 200b would be wild tbh
I expect pro to be 120~
flash is not 8b
or 150
8b flash is a different model
flash 8b dawg
Think ppl discount how much stuff like this will drive progress as models inprove
1.5 line:
flash (larger)
flash 8b
pro
2.0
flash lite (direct technical successor of flash)
flash (larger than flash lite/1.5 flash)
pro
yeah i think they're calling them flash and pro based on the speed and cost more than the size being comparable to 1.5 's flash and pro
basically flash could be 200B with 40B active params
and pro could be 1.3T with 150B active params
really uncertain but that would make sense to me
lol this is wild
yeah but we know that's not true so it's kinda trivial
economics lol they are not increasing model size to that level anymore
they didnt even release 1.0 ultra access and a google employee confirmed it wasnt even close to og gpt 4 afaik
there are traits of models + that would be heavy and unnecessary + ton of money for no reason
why not? hardware is getting better too
if models are 27b with similar performance
yeah that's not true lmao
you still need the total params to run it
i really don't think 27B models have similar perf
ultra did not use MoE iirc
wym?
-10b are visibly worse
27B models have much lower perf than gemini pro
but 30~ is just fine
is what i'm saying
yeah but I'm not talking about Gemini pro
oh ok
I want to give out my MacBook Air 2020 &** for free, it's in perfect health and good as, alongside a charger so it's perfect, I want to give it out because I just got a new model and I thought of giving out the old one to someone who can't afford one and is in need of it... Strictly First come first serve !
DM IF YOU ARE INTERESTED
no they arent btw. they directly said flash lite is based on 1.5 flash size/architecture/whatever i dont recall the exact quote
but anyways back to the point
I do think Google has always had special models, and the speed both perform at is crazy
so they can't be that big
I just don't see gemini 2.5 pro being within 30% of the size of gemini 1.5 pro
elo 140 pts apart
one was pretrained way before and wasnt even a thinking model + modern stuff
ye
yeah good point
I think the pros are maximum 10b params deviation
that'd be completely wild google dominance
the pros are still around the same size. but it is quite plausible they increased the size a little but its not a trillion parameter model
and I don't think 1.5 pro is above 200b
it's both faster and seemed like it had less "raw" intelligence than Claude, which was similar in time, and 4o
seemed to know less stuff without search as well
completely up to you whether you agree
but I do think 200b+ models tend to just feel heavier
so I'm inclined to believe it's at most 150b
no it isnt faster lol
its the same
2.0 pro and 1.5 pro are the same speed
dawg
i cant tell right now for 2.5 pro because there are no measurements for 2.5 pro
I'm not talking about 2.0 pro vs 1.5 pro
I'm talking about 1.5 pro vs 4o
and then equating to 2.0 pro
well its relevant because 2.5 pro is highly likely to be continued pretrained from 2.0 pro
the timeline seems absurd if it isnt
I don't think it's absurd at all tbh
google does have the most compute
working on both 2.0 and 2.5 at the same time is super reasonable
if they're going for completely different architectures
you just pretrained gemini 2.0 pro spending millions and ur gonna throw it away and rush a model from scratch in a month or two??:?
as they explained that it's inherently a reasoning model
ok, thats still on top of a base model. not relevant
ye they did that with 1.5 pro 002 to 2.0
yes but that was a sizable amount of time
????
002 came out in like October
2.0 pro was in experimental in November
so ig one month
preceding 1206
oh yeah it was 2 months too I'm tripping
002 came in September
2.0 pro came in November
they were working on gemini 2 in parallel
it was on lmsys too, everyone was talking about it
yeah I know
that's what I mean here
it's not like they're throwing away progress
since the "progress" is research itself
so they could completely ditch 2.5 tomorrow if they find another breakthrough
002 wasnt a new pretrained model it was just another tune afaik
yeah I know
but 2.0 is completely different
and then jumping from 2.0 to 2.5 within a couple months seems reasonable
that's how they managed going from "bard" to "ultra 1.0" and then a month later, into 1.5 pro
and then ditched ultra
thats like 5 months, they started work after june 2024. the supposed cut off of 2.5 pro is january 2025
that's not crazy tho
they've done this more than once
so ur saying they pretrained a new 2.5 pro model from scratch, did reasoning rl, safety, etc. in 2 months??
saying the best AI compute in the world can't do ts is wild
safety aligning is the hardest part of that process
and I'm pretty sure past models would be insanely informative of that process
they probably wanted to get 2.0 over with, with a breakthrough
🧱 🤣 👍
and then follow up with 2.5 to use it
the fact that 2.5 isn't actually that affected by the transformer context drop off is insane
it has to be different, there's no other way tbh
what if it's TITANS
that'd be crazy
we'll literally never know, they could have something that actually performs with titans techniques
etc
2.5 is simply different from 2.0
Found this interesting benchmark
yeah he posted it here earlier lol
Gemini is indeed the best at 120K
Ah
It struggles a bit at 16K (typical transformer behavior)
that might be a testing issue rather than it's actual performance
Didn't notice V3 0324 is on there too
There's a reason why transformers struggle in the middle: https://www.youtube.com/watch?v=FAspMnu4Rt0
ah yeah I know, but I mean
I'm not sure it would be so sudden
and THAT great
the other models don't seem to be affected
Hmm yeah that's a bit odd
2.5 pro can do 2m-10m context+, 4o total context is 128k-200k
Sonnet is surprisingly even worse
ye, but 1.5 and 2.0 pro still struggled in granularity
it would be more like need in a haystack
rather than actual reasoning
but with 2.5 pro that kinda just stopped existing as a problem
I tested it too
I don't think you guys realize how crazy this is tbh
That would be, but I don't think they'd start with such a big model for Titan?
I'm really not sure
its highly likely its just 2.0 pro with continued pretraining
2.5 pro just kinda shook me, especially testing it on lmsys with nebula
at least the base model
It's a thinking model too, probably RL trained?
ye ofc
yes they updated the base model then tuned it for reasoning/rl on it
I wish there was more news on Titan/Mamba-variants
it has a unique cot too tho
i dont think mamba is good
Google made two variants based on Mamba that performed better, but I haven't heard anything since.
this is a good article from what i recall https://magic.dev/blog/100m-token-context-windows
why mamba/etc dont actually work
wonder how this is gonna go in notebook llm
I've had problems with it
nobody seems to care since it's trivial
but I think the products could be so much better
its been a while since i read this tho 🤣
I'll check it out
These were the Mamba variants Google made: https://www.reddit.com/r/MachineLearning/comments/1b3leks/deepmind_introduces_hawk_and_griffin_r/
Haven't heard anything since though 🤷. Same with Titans.
transformers keep being improved and improved tbh i dont see anything replacing it lol
Probably not anytime soon. Diffusion LLMs seemed interesting though.
ye but I guess still not technically the same architecture
as current thinking models
but probably gonna remain the base
I think the problems we currently have now will eventually be fixed, like better reasoning by creating a CoT
and then more attachments
What. Thinking models are literally your standard transformer architecture with some fine-tuning. Nothing under the hood is changed
yeah this guy is wild man
what's with the lack of reading comprehension here
I don't want to be rude, this has happened more than once too
but goddamn
good luck
because it isn't technically the same architecture lmao you guys are confusing transformer with what we have now, which has been established as a change for a while now, as with gpt or native multimodality
now I'm wondering if you guys are trolling lmao
this is getting ridiculous
what we have now, which has been established as a change for a while now
are you saying "all modern models (even llama) have tweaks and improvements over the original gpt, and gpt is a large improvement over transformers" (pedantic) or "thinking models have an architecturally different way of generating text" (incorrect, see r1)
what the rate limit for gemini 2.5?
I've been using it a lot and haven't encountered it yet
If there is one it's very high. I don't think aistudio is limited like the free api offering
but ai studio is also free same as openrouter
Ya but u have low rpd
have u connted a ide with 2.5?
I mean on the aistudio website there aren't limits
oke so u copy paste everything into your ide
top rpm vs bottom rpm vs req day?
I don't use ai to code yet they suck at rust
openrouter actually gives you more limits lol
top one is if you have a payment method
(which is weird because it's free either way)
doesnt openrouter share the one 2.5 one with all users?
so everybody has less prompts
they contacted google for higher limits
this would make sense if this premise weren't my own claim lol, they suggested fundemental architectural change but I said it isn't technically the same but it doesn't matter since with or without inherent limitations (transformer, or not), we can optimize for other specific tasks like we did with CoT, and what were already doing (for agentic use)
thanks honey
since it's architectural identity wasn't a primary claim, and what I said operates on its lack of relevance already, this is just a comprehension issue
comprehension??
you're the one saying that thinking models use different architectures
and don't get that r1 is just v3 RLd on thinking
I explicitly said "remain the base" dawg 😭
and even clarified "not technically the same" so I consider what I'm saying pedantic posturing, but for rhetorical purposes
since the discussion is operating primarily on the CORE architecture, ie titans vs transformers and I'm explicitly stepping away from that dialectic, what do you think I'm saying
not only that, I even clarified why they can technically be distinguished between the base transformer architecture (Ie gpt, multimodality) and since yes comprehension is an issue, you dismissed it with "pedantry" knowing that's the premise, not my rebuttal towards what they're saying
nobody asked
can someone with access to o1 pro give it this
the answer is permanent
but gemini 2.5 pro, grok 3 thinking and claude 3.7 sonnet thinking all fail
Question 5
A particle P, of mass m, is attached to one end of a light elastic string of natural length 0.5 m and modulus of elasticity 2mg. The other end of the string is attached to a fixed point A on a rough horizontal surface.
P is held at a point B, where |AB|=0.5 m and given a speed of 1.4 ms⁻¹ in the direction AB.
P comes at rest at the point C.
Determine whether this position of rest is instantaneous or permanent.
heres the transcription
looks like 2.5 pro gets it with code execution