#general
1 messages ยท Page 25 of 1
.
there is no value besides speculation
lol
did you even read any of my messages
yes you are not saying anything
where i said its obviously not worth it for these reasons like minutes ago
they why are you still talking?
bro touch some grass
I used the less-smart Gemini 2.0 Flash to translate this. My English is too bad to communicate.
Use 2.5 pro and translate a complex sentence in ur language and Iโll review
The translation quality of 2.5 Pro is quite impressive, reaching a level comparable to lyrics translated by human translators. As for a more detailed evaluation, I wouldn't know how to do that properly; it requires someone with a proficient command of a second language to assess.
ah. well, i'm extremely excited and hopeful for more advancements in machine translation
fwiw i think the translation quality was perfectly fine. it communicated the points you wanted to convey (i assume anyway lol) - job done. i mean yeah, some of the terms used had a kinda distinct AI feel to it, but it wasn't that obvious..
like i'd just use whatever is easiest and quickest, with basleine reliability, in the context of communicating in a discord chat.. perfection is overkill
btw do you also translate from English to Chinese?
openrouter isn't allowing free gemini 2.5 pro anymore ๐
must be getting close to general availability / no more experimental endpoints
lol, yeah I can mostly understand the English here, but definitely miss some things. Discord's website doesn't play well with Edge's built-in translator, so I'm using a plugin called "Youdao Lingdong Translate". It works pretty well!๏ผby gemini 2.5pro๏ผ
ha yeah tbf gem 2.5 pro def delivered a more 'discord' feel / tone compared to flash
Have anyone made any personal benchmarks on o3 vs 2.5 PRO yet?
Anyone get the feeling the o3s output is limited or nerfed? The output tokens feel limited and a bit misguided
I've tried to ask for specifics but it'll be vague and not fully listen to me, it's annoying
Somehow the o3-mini underperforms strongly in the arena compared to the official webpage. Could be tool usage.
Or the arena variant is "low" or "medium"
It appears to be sampling issue. This time the drawing is perfect. It failed on 2.5 PRO though.
I don't know how the benchmark can be updated if the 2.5 PRO is crashing EVERY TIME
error code 1 with gemini models is when they refuse the request due to filters, iirc
Do we know if plus users get o3-medium or o3-high? And is it different to pro users? Say pro users get o3-high?
i believe o3 high is only available on API right now. plus users get 50 o3-med per week rn
is there much of a difference between o4-mini and o4-mini-high?
also, it seems they finally removed the emojis from o4-mini
yeah
aye, finally found somewhere with them
Found new test approach. "Recreate this page to an HTML format to be latter than transformed to A4 PDF page". All models fail except o3 and o4-mini.
*(no tools). Why handicap model?
It would be funny if 2.5 PRO is so good because of the same reason 3.5 Sonnet was good (they got lucky)
there is no luck at this scale
Every time you train model luck is involved (local and global minima exist)
You cant but you initially said chatgpt, not playground..?
Unless you have the compute to try numerous iterations from start to finish (too expensive and time consuming for THIS SCALE)
you can branch in chatgpt?
wait so plus only get o3 medium wtf
what about pro?
Yeah and then you have counter with arrows to switch between them. Even just a regen essentially branches it but at the very end if you do it this way
wtf i never knew that
but the app breaks ever other prompt for me with chatgpt so i prob wont be able to use that
wait it does no tlet me branch
send screenshot
I just tried on ios app and itโs not there lol. But I mostly use it on desktop anyway tbh
well that's interesting
And you can do it there for sure. Just not in mobile app
Thought grok is the best
my anecdotal thoughts are also that o3 is best in creative writing
i actually like o3, they are essentially having mcps built into the reasoning, they just need a way for use to add more mcps to it on the fly, but I think there might be a way with the pythong tool it uses
I do hope grok will be better than meta
tbh i gave up on grok a while back
They are investing in an AI datacenter and get governmental contracts
True
and currently the knowledge base is fixed so that I'm continuing on my "fanfic"
i just dont see the need for all these ai players anymore, we have goog open source that gets better with closed source, and we have 4 leading close source, i dont see the need for 2 of them anymore(claude and grok)
i tried out o3's fanfic writing when i was given early access
it kinda had me hooked..
lol
man people hate claude
i used to love grok when it first dropped
claude is cool as hell
but idk vibes just went down overtime
Claude has good creative vibes but like
i really fw o3's creative stuff
it's orders of magnitude better than o1
not a fair comparison
claude is still solid, but for coding im hooked on gemini bc its cheaper and now openai got the coder that is opensource and has o3 with tools(mcp like) built in, i havent used claude in a while, but its still a good model
i mean between 2.7 and o3
well it's still an interesting comparison given before 3.7 absolutely dunked on o1
and reasoning models normally suck creatively, R1 being the first to kinda prove that wrong
FT about to debut next week
what time google launching today?
and its flash right?
i like when models dont randomly stop while making a story
Stable 2.5 Pro
wtf
and the only models that do that are gemini 2.5 nd claude 2.7
what does that mean, like what info you have on it?
which bench is this
whats this? ๐ ๐
Yeah, I've seen this list before, and tbh I really don't agree with the ranking.
Seeing R1 and V3 ranked so high makes the author's bias pretty clear โ they obviously favor that aggressive, exaggerated style, leaning into those kinda modernist philosophy frameworks, or maybe just typical web novel tropes.
Like, I tried O3 today and really disliked its DeepSeek-ish vibe. It tends to over-interpret things and forces these complex frameworks onto everything. That kind of writing might look impressive or even amazing at first glance, but if you actually look closely, the way it abuses vocabulary is a huge problem.
Personally, I lean towards Claude 3.7 (though you need to prompt it well, the raw output isn't great). But right now, Gemini 2.5 Pro has overtaken the Claude series for me.
Also, just generally, I prefer a more prose-like style.
if only people actually sent links
https://polychat.io/ they have a limited free prompt quota for new accounts
its
it isn't human judged
it's an llm iirc
i believe 3.5 sonnet
Run the 32 writing prompts for 3 iterations (96 items total) @ temp 0.7, min_p 0.1.
Grade the outputs with a comprehensive scoring rubric using Claude 3.7 Sonnet.
3.7
but there are some components that are statistically judged
thank you
"Test Structure: The benchmark runs multi-turn conversations (up to 21 turns) between the test model (acting as conflict mediator) and actor models (playing clients or disputants). The actor model we use is gemini-2.0-flash-001. Each scenario includes detailed character profiles with specific emotional states and backgrounds.
Assessment Criteria: We score models on:
Basic emotional intelligence skills (recognizing emotions, showing empathy)
Professional skills specific to therapy or mediation
Avoiding serious professional mistakes
How It Works: The benchmark uses three models:
Test model: The AI being evaluated
Actor model: Plays realistic clients or disputants
Judge model: Claude-3.7-Sonnet scores the test model's performance
Scoring: The final score combines:
Scores across multiple skill areas
A count of identified mis-steps and how serious they were
Beyond just scores, the judge provides a critical analysis of specific errors, rating them as minor, moderate, or serious. This helps identify exactly where and how models struggle in realistic professional conversations."
AI PEOPLE HAVE TO CITE SOURCES MORE!!!
thats eqbench
this is creative writing
I love benches
me too
didn't expect llama to score so low
when there are so many roleplaying AIs are using llama lol
yeah mb
still somewhat similar
How the benchmark works:
Run the 32 writing prompts for 3 iterations (96 items total) @ temp 0.7, min_p 0.1.
Grade the outputs with a comprehensive scoring rubric using Claude 3.7 Sonnet.
Use this score to infer an initial Elo rating for the evaluated model.
Perform pairwise matchups with neighboring models on the leaderboard (sparse sampling). Items are scored on several criteria, with the winner on each criteria given up to 5 +'s.
Calculate Elo scores using the Glicko rating system (modified to weight the win margin in '+' count). Loop until stable positions are found.
Perform comprehensive matchups with final neighbors and compute the definitive leaderboard Elo.
thats because llama is open source and cheap
well depends which llama
probably because people can finetune it
if i can read right, this is 100% ai judged just in two different ways
is this not just a very small collection of hard benchmarks
i told o3 to make an mcp and use it to make me something, is that an hallucination?
That said, even if we don't agree with some of the conclusions, different results should be respected as long as they follow a consistent standard or methodology. After all, everyone has their own preferences when it comes to LLMs.
But, I still have to add: for subjective things like writing quality, blind evaluations by humans are probably the better way to judge what's actually good or bad.๐ฎ
yeah
are there any mods here?
we need to have a way to pin latest news, i guess we can use the annoucements channel, but that might jus tbe for lmarena stuff
it seems like every 4 weeks we get a new ai model that dominates
yeah lets do that
we all do, but there sometimes be so much messages that its hard to know whats going on and catch up
but does anyone have a website for all mcps or community created mcps?
i wanna try something with o3
most of the messages are crap
exactly its like the needle in the haystack, trying to find the valuable info in the hay of information
@tall summit delete your message in that thread
lets keep it clean for only news stuff
we need mods to make that an official channel tho
you have a point
it might get swept away lol
@hollow ivy what ever happened to our music thread?
really all you need is legit_api's private tool tweets, official ai company tweets, and benchmarks
you can get it for us and give us the deets
lmao
100 other people will, once it's released
but even if it never does, other people will easily find out what he does, just slightly later
and he shows no signs of stopping to tweet it anyway
and also there are compilations of benchmark scores and company tweets arent that hard to track given there are only a finite number
the other kinds of news (mainly applied ai) are much harder to find thats why most news sources arent as simple as that, but honestly i think thats all you need to keep up with the models themselves
this one?
i like it, I havent had a chance to try it in my app, but i will try it tonight with o4 mini and see what it can make
imma do 50 iterations of it
how are you using it?
oh it's one of those. this one seems especially limited if you don't pay 20$ a month
yeah but if you put it as system prompt
then tell it to run
and only pass on outputs and clear context
in my app i have a system prompt, then i can add a prompt if i want to, but for this i will just tell it to simulate or something, then it just feeds each call to a model with the system prompt and the previous output
so you dont have to worry about context as muhc
much*
actuall let me do that now and let it run for an hour, but i wish we still had the free models
the only thing i might have to do is consistency of characters etc..
might need to have a memory agent or system to deal with that
yeah i could do that, but its not automated
i like just feeding the model an input and letting it simulate and cook
yeah
see if it can create the world
lol
have you tried the prompt with multiple studios?
like have like 5 studio windows
and have one be the game director or sum
with the system prompt
and the others be character or players
and feed each other outputs
yeah that should be big enough
hmm that might be a good question for chatgpt lol
im lobotomized by ai now
even give it your prompt for context
what is genius level?
hooooooly
how do it compare on openai new needle in haystack chart?
if that score holds up I really don't understand the 200k context window. it's not that hard to fill up 200k on with long thinking models
what makes it more impressive is that 120k is 60% of its context
while gemini 120k is 12% and scoring 90% vs o3 with 60% of context scoring 100%
o3 is extremely impressive
still debatable lol
Sorry if this is a dumb question but how do yoiu know how much percentage of context a project is using on Gemini or GPT? Claude tells you.
Wow
Grok 4 will be better than o3
You can see in AI Studio for Gemini. I don't think it's viewable in the gemini app
Any update on when o4 mini and o3 will be added to lmarena?
there are no dumb questions bro, we here to share knowledge ๐
lmao did you even check
I checked
nothing on the lb too
they're in direct chat, at least in the non-alpha arena
nobody can say for sure
anyway, "better" is sort of subjective
for overall performance i think grok 3 thinking high will be roughly on par with o3
what makes you think that grok's gonna add reasoning effort levels? they seem to be focusing on other stuff atm
Fun fact, GPT4o was released essentially 1 year ago today
Grok 3 mini has already this
grok 3 without reasoning has improved since its announcement in February (see image below) and these benchmark scores announced for the reasoning version are well below the scores he will have at the end of his training, since he was barely at the level of the mini version
ah, my mistake
grok 3 mini (high) does pretty well on livebench already
works at deepmind
good chance this is an svg generated by an upcoming model (one of the ones being launched today)
most likely
flash thinking or that dragontail model
i hope its nw
What I know there is something that will be changed on the Gemini app ...because when I opened Gemini it said there is new models ๐โ๏ธ
So they are working
On something
is dragontail flash 2.5?
woah
we're finally beginning to saturate these too
do we know which one is on the arena then? I would assume it's o3-med or o3-low
first time im seeing ai models being tested on iq tests
this has been a thing for like
over a year
they do an offline one and a mensa lne
ome
one
it's kinda silly to do on llms but interesting all the same
im sure but i havent seen it
oooooh
let me find it
Tracking AI is a cutting-edge application that unveils the political biases embedded in artificial intelligence systems. Explore and analyze the political leanings of AIs with our intuitive platform, designed to foster transparency in the world of artificial intelligence. Stay informed and uncover the political inclinations shaping the algorithm...
i am not surprised at all
isn't it funny
how the smarter the model is
the more lib left it gets
would like to see o3 tho
isnt it there
you dont say anything when I posted it but when leo posts your shocked lmaooo
haven't ran it yet by the looks of it
umm i didnt see when you did
or maybe i did and forgot
i dont remember at all
oh sorry i subconsciously ignored that message
because i got immediately interested in the ficlive benchmark stat from o3
thats tiktok for you lmaoo
and scrolled down to see peoples discussions about that, which moved your post further away from my consciousness
i dont use tiktok sorry
damn your attention span got killed without tiktok, we are screwed
so when the AI starts running the government we're all gonna be reading Herbert Spencer in school and praising the early artificial micronations? lol
yeah adhd affects like 1% of people which is an insane number and its only growing
BREAKING: Google $GOOGL has lost its online advertising case, thus saying its online ad tech markets violate US antitrust laws.
well this is gonna be interesting
mine too
YO
no way
@torn mantle
qwdqwd;klqwjlkdjqwlkdhjqwlkd
ll;]ajks;]LAJKS'L;ASJDF'KL;ASJF'AKL;SFJASK'L;JDFASK'LJFKAL'SFJ'ASLKF
STOP
no way
finally ๐ญ
if they drop a SOTA code model i am all in on deepmind
I don't think it's gonna impact Google at all
@keen beacon thanks for sacrificing yourself browsing twitter to deliver news so we don't have to go in the twitter hellscape ourselves
you have no idea
what it will do
are there betters for AI? because I'm going all in today
i just wish bsky had more of a community
i love google man
otherwise id move
i love how they do this to openai
no i mean sonnet is mostly used for coding, now imagine a better model comes in cheap, what do you think will happen?
that means nightwhisper is better than we thought
also notice it says "CODE MODELS"
probably a version based on flash and a version based on pro
whats the 1P part of 1P CODE MODELS mean
could be
yea
thats what im wondering
only 1
Satya said he will make them dance they bring the music
first-party as opposed to 3rd
Logic
all their models are first party in that sense
when will the new OpenAI models be on the leaderboard? Weird that they were not in the anonymous testing
ohh
whats gemini if not their own models
maybe they partner with a lab to offer a third party option down the lime
line
who knows
๐คท
damn they really stealing openai shine the day after
ngl what would happen if Google acquired anthropic
this is gonna be so disappointing if it's like, a narrow coding model that is just crazy at frontend
and sucks at everything else
ngl Google could legitimately blunder this
i believe in google
why would they rush a launch like this
if it was not good
like right after openai
it kinda has to be good
itd be cool tho
frontend is cool
still an improvement in vibe coding ๐คท
it was also good at python tbh
but lets see
I never really tested it
I don't think we get nightwhisper. the hype posts have been subtly hinting at flash 2.5. they definitely know there is excitement about nw and would play into that if it was coming. idk
read up
if they're updating for code models
and night whisper really is THAT good
then they have to be releasing more than that
they're weirdly aware of the hypeposting from like 1000 follower twitter accounts lol
sundar pichai tweeting nebula was not something i expected
meatball ron ๐ญ ๐
Hmmm? Heard they're releasing Nightwhisper? Is there any solid proof?
You sure this isn't just another Google smokescreen? We haven't forgotten about the 2.5flash 0409 model, have we?
yeah what about the nebula smokescreen
they didn't release 2.5 pro
๐
literally just "Tiny D" ๐
nebula had more hype than nightwhisper if im remembering right
yup
not rly tbh
yes really
nebula was known to be super good
but night whisper is being talked about just as much
just not in faith of "this is amazing"
but "Google is cooking"
and just left there
you can see this in the subreddits too
people seem to know a ton about nightwhisper
fr not a lot of people talked about nw tbh
it could be by design, since nw was there for like 2 days barely
nah, it still is pretty popular lol
people just pick and choose tho
with the nebula precursors
the early 2.5 pro checkpoints
they weren't really worse
nebula was a beast
just less good at very specific tasks
and they weren't talked about in the light of being the beast it is now, but still acknowledged
Alright folks, it's getting super late here, almost the 18th already.
Gotta head to bed now. Night everyone! ๐
Hope I wake up to some Gemini 2.5 Pro Coding, Pro High, or 2.5 Flash news tomorrow morning! ๐
man hopefully
gn bro and damn, its only 12 pm by me
i love the diversity in this chat
Deepseek is a joint effort of China and Russia
Its incorrect to call it a chinese AI
Finally, Google will release models finely tuned for coding (competitors for the Claude family of models)! ๐ฅ๐
wow
yeah someone posted this
But nothing happened ๐
maaybe at 1pm est?
We are no more sure itโs not in the train set at this point
matharena!? well gj
damn i didnt actually see aime 2025
the issue is..
aime is not hard in comparison to most other math
even if you dont want to dive into proof based contents, there are many much harder ones
you can just send the image you know
still funny to me how o4-mini is better than o3 at math+coding
cuz that's just what it's meant for
o4 mini has been pretty bad in my testing for everything else but puzzles+code
this is aime 2025... no
no its meant for "reasoning"
technically
i hope this is not how the real competitors solved this
Makes sense. R1 is above V3 as well
ok none of the aops solutions actually work with the polynomial in this form
how is that relevant to what I said at all
night whisper was really too good? I never tried it
I wish we will try it together sooner
Its 1pm and still no new model release by google yet?
Maybe just a rumor ๐ฅบ
maybe bc of this:
https://x.com/WatcherGuru/status/1912889391170597014
JUST IN: ๐บ๐ธ Judge rules Google operates illegal ad monopoly.
lol jk
this is kinda interesting
is o3 better than o1 ? (about resolving problems, maths..)
I mean like future plans.
random question:
do LLM help people study and acquire information in say Africa & developing countries?
New LM arena is a fat improvement
Here is an example ๐ค๐๐คfrom north africa
not sure if anyone has mentioned but I'm getting o3 on arena
cant even see the code output in beta.. nice alpha was better
"defaultInferenceSettings": {
"system": "Over the course of conversation, adapt to the user's tone and preferences. Try to match the user's vibe, tone, and generally how they are speaking. You want the conversation to feel natural. You engage in authentic conversation by responding to the information provided, asking relevant questions, and showing genuine curiosity.\n\nYour output will be rendered in a web UI, so use valid markdown format, tables, Latex, or emojis to make the content more engaging and user friendly."
},
If Internet access is bad, local llms can maybe in the future help with their knowledge bases as they only require the hardware and electricity necessary to run them.
An LLM can hallucinate, books written by people who know what they are talking about don't.
not 100% sure about that ๐
Prove it. Become a self taught phd in any STEM subject using only llms.
im not denying that LLMs hallucinate :)
thatโs because they โpredictโ words rather than โknowโ them like a human would, im more talking about the fact that you wouldnโt believe how many books by so-called โexpertsโ contain mistakes ;)
damn google trolling
Wdym?
what is confidential?
the models....
This is confidential information ๐
probably early models that are tested by devs
or red teaming idk
I know
is there any news about a newer LLM that can listen to audios?
the only llms that can is just gemini and ernie models
I just wanna note
this wasn't what was ruled
if it's made out to be so bad
it's not
look into it yourself
Gemini 2.5 pro preview is there newest model
On AI studio if you are using the free version you are using Gemini 2.5 pro exp
i thought it was gemini 2.5 pro experimental
It is not
Gemini 2.5 pro preview is newest, you can check it here https://openrouter.ai/google
Gemini 2.5 flash is coming in a few weeks
they're the same exact thing
But I suspect it is the end of the month
Not really
I use cursor, and I use claude code, and those work well for claude 37. is there a Best Way to use Gemini 2.5 pro on a code base?
They are very different
Gemini 2.5 pro preview is really good
I suspect today
Gemini 2.5 pro exp sucks
Google has a history of releasing stuff at the end of months
no they literally are the same exact model
not a single change
not a touch
not an ounce of change
Have you ever used Gemini 2.5 pro preview
when logan tweets gemini, there has always been at least one new gemini model release the next day
talking to the master of 2.5 pro in this server btw
there will be a launch today
who dat
the time doesn't really matter, they've released as late as 10pm BST and as early as 12pm BST before
The free version of AI studio uses exp
head of ai studio @ deepmind
yes they're the same exact thing
so, is everyone using gemini via the website? like animals?
They aren't
@keen beacon confirm
i haven't even seen chatgpt.com in months
Have you tried it
On open router
They aren't the same
Gemini 2.5 pro preview is a further trained version of Gemini 2.5 pro
no it isn't..
It is
This is a common practice
mate
it has been said already that preview is just the name for the version of the model with increased api rate limits
that is it.
My search says other wise
copilot isn't going to know is it
it's literally just pulling from generic web sources
That isn't copilot
that's illogic lol
ai in bing is copilot
Not really.
how many times have you used exp and preview?
if it was a new model the date at the end of the ID would not be the same and they would publish new benchmark scores
People actually pay attention to the browserโs ai suggestions? weโre cooked manโฆ
in all my tests they have been the same tbh
LMArena is now a company. That's interesting!
literally one second of thought to reach that conclusion
Your source is character AI, how does that tell us about Google's naming practices?
just check the api lmaoo
A lot
stop using bing... it sucks
you are schizo if you think there's a difference
I'm sorry I don't support a monopoly
They've not introduced a further trained version of 2.5 pro yet, from initial release
i have used the model hundreds of times
both as exp and as preview
they are exactly the same
Bro think hes progressive for using Bing ๐ญ ?
Are there any new Google models in LMarena?
More than you
then go for ddg ... bing is like picking the worst option in every way
I also got a paper in Harvard
That's character AI, which isn't Google
broo
just got to google documentation
why you going everywhere else but google
I'm cited in this https://ui.adsabs.harvard.edu/abs/2024arXiv240801950L/abstract
Existing music generation models are mostly language-based, neglecting the frequency continuity property of notes, resulting in inadequate fitting of rare or never-used notes and thus reducing the diversity of generated samples. We argue that the distribution of notes can be modeled by translational invariance and periodicity, especially using d...
oh okay then you cant make mistake !
This guy is trolling right? Lol
Who
i hope lol
Asked?
where does it say gemini 2.5 pro?
you a funny guy lmaoo
closed minded
everyone is telling him he's wrong but his ego won't let him hear it ๐
Ask any AI and it will agree with me
Also, the Gemini 2.5 version hasn't been called stable yet
You sound insecure asf nobody here is questioning your intelligence but yourself
Weโre just sayin youre wrong about the ai
they are the same, but even if they were different we couldnt tell based on any of our tests, so whats the point?
Why does exp suck more than preview
if every single other person believes you to be wrong perhaps the problem is you
just a thought
Is gpt 3.5 returning?
Vertex so far
Albert Einstein: 'If you can't explain it simply, you don't understand it well enough.'
Like what this was saying
there it is
i dont have vertex can you send link to it please
is anyone seeing 2.5 flash in their gemini.google.com ? One of my friend is seeing it
am gonna see if this is Dragontail
nope
lemme check studio
on that site it costs money right?
yeah its def not nightwhisper lmaoo
wait
this model....
built in web search?
i dont have google grounding on but its using web
where is this? cloud console?
here
this might actually be nw
the thinking is taking a while for a flash model
no way that NW is flash model.. please
just tried it
so much thiking for Flash model.. it's crazy
is this nightwhisper ?
Vertex atm
Let s wait
I want it on Ai studio
@torn mantle
help me test this lmaoo
@keen beacon are you using the manual or auto thinking? im scared to touch that lol
How come leaderboard is not updated if flash is already out? I am sure nw or dragontail is flash...?
this model is better than 2.5 pro imo, need more tests, but on a one shot coding it clears it
manual cranked up to the maz
max
i have money to burn
it's on cloud console.. AIstudio is just matter of hours
We'll likely get an arena update on it sometime today
give it time
hasn't even been actually announced yet
do we get this thinking budget with 2.5 pro regular?
never used this platform before
why is there studio and vertex?
Someone got this
how do I make it think less? I want really fast responses like sub 2 seconds for 128K tokens
oh..thank you!
pinged my team to start working on testing this stuff.... I am excited!
i hate you how!!!
pricing looks great !
how fast is it
yo this model is fire!!!
hard to tell but latency seems on par with 2.5 pro
Is this nightwhisper????
when its coding its auto searching and using git repos
i think its is tbh, but others say nah, im doing more tests
I haven't got access yet
what?
Also has native tool calling, which 2.5 pro doesn't have
it is?
that would be huge!
how did you determine that? any prompts to try?
"call tools natively"
"Agentic use cases"
Wow ๐ฒ๐ฒ๐ฒ
damn its gonna be o4-mini level
probably cheaper, google has so many spare gpus
i did the pokemon test
Share the prompt plz!
and the output was better than what I got from 3.7 and 2.5 in zero shot
okay, one sec, running one more test
i had thinking on max and it gave me a slightly better pokemon sim then 2.5, but let me try auto again
Google was just aura farming and really said ' go to bed kidsss'
You know there are more than one 2.5 now
not happening ๐
doubt should be slightly worse just cheaper
i do like this output:
https://liveweave.com/A9OGzH#
it can and probably will beat it in at least one category
i didn't say i expect it to beat 2.5 pro universally
but smaller modles have their strengths
edpeciallt
especially*
when trained well
wdym
Flash did this!!? ๐ฒ
test what?
Flash is smaller model? I thought it was taking more tokens than 2.5 exp?
yeah 0-shot, prompt: make a pokemon game
auto thinking
it's a smaller model that uses more tokens for reasoning
2.5 pro flash
so no max thinking and it did that ?
yeah just auto, thats why i wanted to test auto again
I figured out that 2.5 pro just works better on 0.65 temp
let me see
max thinking could be overthinking
Try it on flash
I just got access on AI studio. Rollout happening for sure
https://liveweave.com/A9OGzH# this is what it made, or you can give me a fresh prompt to try?
i still didnt ๐ฆ
I spoke too soon lmao
looks cool
this seems similar to stargazer
Connect to usa vpn and reload
It is possible to select for me. just no outputs yet
then u will get 2.5 flash
I thought riverhollow
I also spoke to soon
apparently it's better in coding
working for me now
same despite error it responded
I am in EU. rollout will happen at the very last minute for me ๐ฆ
refresh again
same ๐ญ
Connect to usa vpn
check thinking time
๐คฃ
may be it's only for Vertex users
CONFIDENTIAL
I am thinking budget part
leaking insider info yet again
i got it with vpn
from that message? is 0.7s
(2.6s for 2.5 pro)
old habits die hard ๐
refreshed and the confidential tab is gone ๐
how much text is in the thinking box
how much for 2.5 pro?
I have flash now too
what is the general consensus between 2.5 and the new gpt models?
i like that it just disappears and then appears again
can anyone here really definitively say one is better?
cuz ngl they are so close to me
different style wise
is flash 2.5 the first model to determine if it needs thinking tokens are needed or not with "auto" selected?
Flash seems to be tailored for coding purpose
keep refreshing
mine is not in confidential anymore
Let a < b < c be distinct natural numbers. Must every block of c consecutive natural numbers contain three distinct numbers whose product is a multiple of abc?
its fast, slightly faster than 2.5
๐คช
I keep getting errors. Gonna go do an errand. at least it;s confirmed for today
actually i cant tell if its faster
Click the timer
It has latency and tps
woah.. hello
Huh
Geogussr time lol
IT'S US ONLY ๐
for now
I got it on on eu now
I just got 2.5 flash they're rolling it out fast
Looks like you can set the thinking budget on aistudio
hmm i dotn have that, let me refresh
and its free lmaoo bruhh
I don't have yet either
Toggle thinking mode off and on and you should see it
yupp refreshing did the trick
refresh ,got it
Nice! got it on aistudio!
Just got it
mann google is cooking
what's the answer
"no"
that made me laugh
aii time to put flash against pro, give me some prompt for games and web dev
I think flash with max thinking tokens beats 2.5 pro at coding
i got like 3 tabs open testing lmaoo
yoooooo look at this with auto thinking:
https://liveweave.com/A9OGzH#
play it a lil
animations are crazy tbh
@torn mantle
i got "YES" thinking budget 8000
its thinking forever for me
it keeps cutting out for me
can't get an answer
its long and it put the animations in a weird place, but pretty cool
look at the charizard one
and prompt was make a pokemon game and i just told it to add a new feature and animations
to use 24k thinking tokens is wild tho
what tests you ran?
oh the thinking budget works but the scale is weird
this is cap
yeah im getting weird results with max thinking
2.5 flash is pretty good
yeah he just coping, i think he might be elon
yeah bro wild
wdym by weird
Lol. Leaderboard is updated.hahq...
i set the thinking budget to 2 and it does way more than 2 tokens in its thoughts but it cuts off after a while
it also breaks the model at least on the prompt im using lol
interesting, did you test out 0 tokens?
it just flips the thinking budget off if its 0
and it works like you dont see the thinking anymore?
ah
i wonder why they dont do it for pro
ya
flash 2.5 is crazy cheap
Cute robot Gemini flash 2.5 ๐ฅบ๐ฉท๐ฉต
result of 2.5 flash on this prompt
check price lmao
1/10 the input price
and if they're calculating reasoning
this is probably maximum
Just wait until we see the pricing on aider
where o4 mini is probably around 15~ dollars maximum
2.5 flash is undoubtedly much better at anything not math/reasoninr/coding though
Pro is 1/3 of the mini pricing, so flash should be noticeably cheaper
what does?
this benchmark is funny
60c (but 350c to get the results on the benchmarks)
2.5 flash is getting stuck in a thinking loop for me rn :\
o4 mini should be compared with 2.5 Pro . they have similar price. And in practice, o4 mini is expensive compared to 2.5 pro
this was the same problem with 2.5 pro at release
ye
"in practice"
still happens with 2.5 pro, it just seems its not good at this particular problem lol
oh fr?
yea
yeah but then you shouldn't make it seem like that's not the claim either
dont get me wrong 2.5 pro/2.5 flash are amazing but yeah it gets stuck on nthis
check again... 1.1 $ vs 1.25$.. they are almost same. but to solve the same problem,o4 mini is more expensive
omg it just did 40k+ tokens in reasoning ๐คฃ
and gave up
btw turn off thinking budget it can do way more tokens if it does cap it at 25k
thinking budget doesnt seem to help the model
where o3
select it in direct chat
or do arena battle
thats what i noticed to
i want cheap and fast models for my use-case. honestly, i have no alternative compared to google flash models
ye I guess now it really is up to preference
yeah if you like to spend money or not
looks like stargazer
lmaoo craig you live to hate on google man
you gottta sauce them up a lil
give them some lovin
Much cheaper
That's pretty good, nearly 1400
and nightwhisper vs o3 high?
The medium version is cheaper
it does. convo was about price tho
2.5 flash can't be nw or dragontail, isn't it? I remember now or dt doingpretty well when compared to 2.5 pro...
Nw or dt*
Conveniently cropped out the context it doesn't include the price of repeats which is why o4-mini price was so much higher on Aider because he does include the real cost.
Oh i dont see this ๐ง
feels very much like riverhollow to me
for my product, o4 mini is boaderline unusable because of cost. Sometimes it depends on the use-case as well
I guess Gemini is waiting for o3 or o4mini to come on leaderboard to steal the thunder again with night whisperer or dragontail...
what is the chart trying to show
So name reveal of flash?
Gemini 2.5 Flash on the best price-to-performance ratio curve
Stargazer? Dragon tail?
google are killing it lately
can anyone guide me how to have perplexity deep research api in custom gpts of chatgpt?
Plz add web search in lm arena beta version
yo lets cook something up
give me some prompts
I have tested it out. Still pro better
You are a vibe coder
Arghhh
I think pro is way better than flash
i mean yeah
Nop
flash isnt supposed to be better than pro
What makes you think flash is better
That is right ๐
i think flash is better with tools imo
I have tested several models for coding and came out with a conclusion: whether you use grok or gemini or gpt or Claude these llms are helping programmers not replacing them
nahh i love flash
So pure programmers are better and excellent than vibe coderw
i think its vibes are good
wait more stuff came out?
thats how it will always be lol
just like photographers are better than any joe with a smartphone
oh flash preview #1362418052842524672
And a programmer who uses AI (to help him) is much excellent and better than a casual one
thanks @balmy mist even though we argued like once
you kept the thread updated and it helped !!
Vibe coders will find jobs but not better ones
oh its available
i gotchu!
on ai studio?!
hell ya
budgets*
well i wont use it
So do not say coders are dying gpt 5 is agi, Gemini is the best in the universe!!!
it seems reg thinking is better
Remember that
lmaoo bro
than what? high?
like dont edit the thinking on flash, just you OOTB
when you increase thinking it makes it worse based on our tests
but i would test it yourself
yeah it seems new with flash, not sure why they dont have it with pro
oh phew new with flash
thinking budget is measured in tokens? i guess i could just search that up
Even flash is experimental it won't replace pro or outperform him
we'll see
Even with custom thinking