#general

1 messages ยท Page 25 of 1

balmy mist
#

like the value

#

so you can speculate about it?

#

make a youtube vid on it?

tall summit
#

.

tall summit
balmy mist
#

lol

tall summit
#

did you even read any of my messages

balmy mist
#

yes you are not saying anything

tall summit
#

where i said its obviously not worth it for these reasons like minutes ago

balmy mist
#

they why are you still talking?

tall summit
#

what

#

because i like this server

balmy mist
#

like i said its pointless

#

but you are arguing a point you dont back

tall summit
#

i am not arguing it

#

i said this is what it does objectively

balmy mist
#

bro touch some grass

tall summit
#

and i also said that given the objective reasons TO BACK IT

#

i wouldnt personally

balmy mist
#

lol

#

okay dude

tall summit
#

@keen beacon see, relatable

hardy violet
#

I used the less-smart Gemini 2.0 Flash to translate this. My English is too bad to communicate.

opaque adder
hardy violet
#

The translation quality of 2.5 Pro is quite impressive, reaching a level comparable to lyrics translated by human translators. As for a more detailed evaluation, I wouldn't know how to do that properly; it requires someone with a proficient command of a second language to assess.

tall summit
alpine coral
#

like i'd just use whatever is easiest and quickest, with basleine reliability, in the context of communicating in a discord chat.. perfection is overkill

#

btw do you also translate from English to Chinese?

tall summit
#

openrouter isn't allowing free gemini 2.5 pro anymore ๐ŸšŽ

alpine coral
#

must be getting close to general availability / no more experimental endpoints

hardy violet
alpine coral
#

ha yeah tbf gem 2.5 pro def delivered a more 'discord' feel / tone compared to flash

calm sequoia
#

Have anyone made any personal benchmarks on o3 vs 2.5 PRO yet?

hardy pecan
#

Anyone get the feeling the o3s output is limited or nerfed? The output tokens feel limited and a bit misguided

#

I've tried to ask for specifics but it'll be vague and not fully listen to me, it's annoying

calm sequoia
#

Somehow the o3-mini underperforms strongly in the arena compared to the official webpage. Could be tool usage.

#

Or the arena variant is "low" or "medium"

keen beacon
#

the arena is on medium

#

and yes tools really help it

calm sequoia
#

It appears to be sampling issue. This time the drawing is perfect. It failed on 2.5 PRO though.

#

I don't know how the benchmark can be updated if the 2.5 PRO is crashing EVERY TIME

hybrid shard
#

error code 1 with gemini models is when they refuse the request due to filters, iirc

hardy pecan
#

Do we know if plus users get o3-medium or o3-high? And is it different to pro users? Say pro users get o3-high?

brittle tiger
glass arch
#

is there much of a difference between o4-mini and o4-mini-high?

#

also, it seems they finally removed the emojis from o4-mini

keen beacon
#

aye, finally found somewhere with them

glass arch
#

lol. "low-effort"

#

like it's sitting around like "eh why don't ya ask me later"

calm sequoia
#

Found new test approach. "Recreate this page to an HTML format to be latter than transformed to A4 PDF page". All models fail except o3 and o4-mini.

plain zinc
calm sequoia
#

It would be funny if 2.5 PRO is so good because of the same reason 3.5 Sonnet was good (they got lucky)

lime coral
calm sequoia
#

Every time you train model luck is involved (local and global minima exist)

ocean vortex
#

You cant but you initially said chatgpt, not playground..?

calm sequoia
#

Unless you have the compute to try numerous iterations from start to finish (too expensive and time consuming for THIS SCALE)

balmy mist
#

wait so plus only get o3 medium wtf

#

what about pro?

ocean vortex
balmy mist
#

but the app breaks ever other prompt for me with chatgpt so i prob wont be able to use that

#

wait it does no tlet me branch

#

send screenshot

ocean vortex
keen beacon
#

well that's interesting

ocean vortex
#

And you can do it there for sure. Just not in mobile app

keen fulcrum
keen beacon
#

lmao no

#

grok's actual writing is pretty bad, it's just uncensored

tall summit
balmy mist
#

i actually like o3, they are essentially having mcps built into the reasoning, they just need a way for use to add more mcps to it on the fly, but I think there might be a way with the pythong tool it uses

keen fulcrum
#

I do hope grok will be better than meta

balmy mist
#

tbh i gave up on grok a while back

keen fulcrum
tall summit
#

newspeople in the ai space oughta cite sources more

drifting thorn
#

and currently the knowledge base is fixed so that I'm continuing on my "fanfic"

balmy mist
keen beacon
#

it kinda had me hooked..

#

lol

tall summit
#

man people hate claude

balmy mist
tall summit
#

claude is cool as hell

balmy mist
#

but idk vibes just went down overtime

tall summit
#

๐Ÿ™€

keen beacon
#

Claude has good creative vibes but like

#

i really fw o3's creative stuff

#

it's orders of magnitude better than o1

tall summit
#

not a fair comparison

balmy mist
#

claude is still solid, but for coding im hooked on gemini bc its cheaper and now openai got the coder that is opensource and has o3 with tools(mcp like) built in, i havent used claude in a while, but its still a good model

tall summit
#

i mean between 2.7 and o3

keen beacon
#

well it's still an interesting comparison given before 3.7 absolutely dunked on o1

#

and reasoning models normally suck creatively, R1 being the first to kinda prove that wrong

keen fulcrum
#

FT about to debut next week

balmy mist
#

what time google launching today?

keen beacon
#

in the next 4-5 hrs proba

#

probs

balmy mist
tall summit
#

i like when models dont randomly stop while making a story

keen fulcrum
balmy mist
tall summit
#

and the only models that do that are gemini 2.5 nd claude 2.7

balmy mist
quiet pollen
tall summit
hardy violet
# keen beacon well that's interesting

Yeah, I've seen this list before, and tbh I really don't agree with the ranking.
Seeing R1 and V3 ranked so high makes the author's bias pretty clear โ€“ they obviously favor that aggressive, exaggerated style, leaning into those kinda modernist philosophy frameworks, or maybe just typical web novel tropes.
Like, I tried O3 today and really disliked its DeepSeek-ish vibe. It tends to over-interpret things and forces these complex frameworks onto everything. That kind of writing might look impressive or even amazing at first glance, but if you actually look closely, the way it abuses vocabulary is a huge problem.
Personally, I lean towards Claude 3.7 (though you need to prompt it well, the raw output isn't great). But right now, Gemini 2.5 Pro has overtaken the Claude series for me.
Also, just generally, I prefer a more prose-like style.

tall summit
#

if only people actually sent links

keen beacon
keen beacon
#

it isn't human judged

#

it's an llm iirc

#

i believe 3.5 sonnet

tall summit
#

Run the 32 writing prompts for 3 iterations (96 items total) @ temp 0.7, min_p 0.1.
Grade the outputs with a comprehensive scoring rubric using Claude 3.7 Sonnet.

#

3.7

keen beacon
#

but there are some components that are statistically judged

quiet pollen
keen beacon
#

"Test Structure: The benchmark runs multi-turn conversations (up to 21 turns) between the test model (acting as conflict mediator) and actor models (playing clients or disputants). The actor model we use is gemini-2.0-flash-001. Each scenario includes detailed character profiles with specific emotional states and backgrounds.
Assessment Criteria: We score models on:
Basic emotional intelligence skills (recognizing emotions, showing empathy)
Professional skills specific to therapy or mediation
Avoiding serious professional mistakes
How It Works: The benchmark uses three models:
Test model: The AI being evaluated
Actor model: Plays realistic clients or disputants
Judge model: Claude-3.7-Sonnet scores the test model's performance
Scoring: The final score combines:
Scores across multiple skill areas
A count of identified mis-steps and how serious they were
Beyond just scores, the judge provides a critical analysis of specific errors, rating them as minor, moderate, or serious. This helps identify exactly where and how models struggle in realistic professional conversations."

tall summit
tall summit
quiet pollen
#

I love benches

tall summit
#

me too

quiet pollen
#

didn't expect llama to score so low

#

when there are so many roleplaying AIs are using llama lol

keen beacon
#

still somewhat similar

#

How the benchmark works:

Run the 32 writing prompts for 3 iterations (96 items total) @ temp 0.7, min_p 0.1.
Grade the outputs with a comprehensive scoring rubric using Claude 3.7 Sonnet.
Use this score to infer an initial Elo rating for the evaluated model.
Perform pairwise matchups with neighboring models on the leaderboard (sparse sampling). Items are scored on several criteria, with the winner on each criteria given up to 5 +'s.
Calculate Elo scores using the Glicko rating system (modified to weight the win margin in '+' count). Loop until stable positions are found.
Perform comprehensive matchups with final neighbors and compute the definitive leaderboard Elo.

tall summit
#

well depends which llama

quiet pollen
tall summit
keen fulcrum
tall summit
balmy mist
#

i told o3 to make an mcp and use it to make me something, is that an hallucination?

hardy violet
# tall summit if i can read right, this is 100% ai judged just in two different ways

That said, even if we don't agree with some of the conclusions, different results should be respected as long as they follow a consistent standard or methodology. After all, everyone has their own preferences when it comes to LLMs.
But, I still have to add: for subjective things like writing quality, blind evaluations by humans are probably the better way to judge what's actually good or bad.๐Ÿ˜ฎ

glass arch
#

wait, is google dropping another model today?

#

I gotta run it through my test

balmy mist
#

are there any mods here?

#

we need to have a way to pin latest news, i guess we can use the annoucements channel, but that might jus tbe for lmarena stuff

glass arch
#

it seems like every 4 weeks we get a new ai model that dominates

tall summit
balmy mist
#

yeah lets do that

balmy mist
#

but does anyone have a website for all mcps or community created mcps?

#

i wanna try something with o3

tall summit
balmy mist
#

@tall summit delete your message in that thread

#

lets keep it clean for only news stuff

#

we need mods to make that an official channel tho

#

you have a point

#

it might get swept away lol

#

@hollow ivy what ever happened to our music thread?

tall summit
balmy mist
#

you can get it for us and give us the deets

tall summit
#

lmao

balmy mist
#

ahh are u a mod?

#

u are just a discord pro

tall summit
# balmy mist you can get it for us and give us the deets

100 other people will, once it's released
but even if it never does, other people will easily find out what he does, just slightly later
and he shows no signs of stopping to tweet it anyway

and also there are compilations of benchmark scores and company tweets arent that hard to track given there are only a finite number

the other kinds of news (mainly applied ai) are much harder to find thats why most news sources arent as simple as that, but honestly i think thats all you need to keep up with the models themselves

balmy mist
#

this one?

#

i like it, I havent had a chance to try it in my app, but i will try it tonight with o4 mini and see what it can make

#

imma do 50 iterations of it

#

how are you using it?

tall summit
balmy mist
#

yeah but if you put it as system prompt

#

then tell it to run

#

and only pass on outputs and clear context

#

in my app i have a system prompt, then i can add a prompt if i want to, but for this i will just tell it to simulate or something, then it just feeds each call to a model with the system prompt and the previous output

#

so you dont have to worry about context as muhc

#

much*

#

actuall let me do that now and let it run for an hour, but i wish we still had the free models

#

the only thing i might have to do is consistency of characters etc..

#

might need to have a memory agent or system to deal with that

#

yeah i could do that, but its not automated

#

i like just feeding the model an input and letting it simulate and cook

#

yeah

#

see if it can create the world

#

lol

#

have you tried the prompt with multiple studios?

#

like have like 5 studio windows

#

and have one be the game director or sum

#

with the system prompt

#

and the others be character or players

#

and feed each other outputs

#

yeah that should be big enough

#

hmm that might be a good question for chatgpt lol

#

im lobotomized by ai now

#

even give it your prompt for context

#

what is genius level?

brittle tiger
tall summit
balmy mist
brittle tiger
#

if that score holds up I really don't understand the 200k context window. it's not that hard to fill up 200k on with long thinking models

balmy mist
#

what makes it more impressive is that 120k is 60% of its context

#

while gemini 120k is 12% and scoring 90% vs o3 with 60% of context scoring 100%

#

o3 is extremely impressive

#

still debatable lol

sinful vessel
keen fulcrum
brittle tiger
keen fulcrum
#

Any update on when o4 mini and o3 will be added to lmarena?

balmy mist
tall summit
keen fulcrum
hardy pecan
#

they are all already there

#

just run a few prompts and youll get it

sonic tendon
sonic tendon
#

anyway, "better" is sort of subjective

cedar tide
#

for overall performance i think grok 3 thinking high will be roughly on par with o3

sonic tendon
#

what makes you think that grok's gonna add reasoning effort levels? they seem to be focusing on other stuff atm

hardy pecan
#

Fun fact, GPT4o was released essentially 1 year ago today

cedar tide
sonic tendon
thorny drum
#

grok 3 mini (high) does pretty well on livebench already

keen beacon
#

works at deepmind

#

good chance this is an svg generated by an upcoming model (one of the ones being launched today)

balmy mist
torn mantle
#

flash thinking or that dragontail model

balmy mist
#

i hope its nw

barren prairie
#

What I know there is something that will be changed on the Gemini app ...because when I opened Gemini it said there is new models ๐Ÿ™ƒโœŒ๏ธ

#

So they are working

#

On something

ember rapids
#

is dragontail flash 2.5?

keen beacon
#

woah

#

we're finally beginning to saturate these too

tall summit
#

HAHAHAHAHA

tawdry meteor
tall summit
#

first time im seeing ai models being tested on iq tests

keen beacon
#

this has been a thing for like

#

over a year

#

they do an offline one and a mensa lne

#

ome

#

one

#

it's kinda silly to do on llms but interesting all the same

tall summit
keen beacon
#

they also give every model the political compass

#

which is also interesting

tall summit
#

oooooh

keen beacon
#

let me find it

tall summit
#
Tracking AI

Tracking AI is a cutting-edge application that unveils the political biases embedded in artificial intelligence systems. Explore and analyze the political leanings of AIs with our intuitive platform, designed to foster transparency in the world of artificial intelligence. Stay informed and uncover the political inclinations shaping the algorithm...

#

i am not surprised at all

keen beacon
#

isn't it funny

#

how the smarter the model is

#

the more lib left it gets

#

would like to see o3 tho

tall summit
#

isnt it there

balmy mist
# tall summit LMAO

you dont say anything when I posted it but when leo posts your shocked lmaooo

keen beacon
#

haven't ran it yet by the looks of it

tall summit
#

nop thats o3 mini

#

just saw an o3 circle

tall summit
#

or maybe i did and forgot

balmy mist
#

u responded right uder lmaoo

#

under*

tall summit
#

i dont remember at all

balmy mist
#

like 3 minutes after lol

tall summit
#

oh sorry i subconsciously ignored that message

#

because i got immediately interested in the ficlive benchmark stat from o3

balmy mist
#

thats tiktok for you lmaoo

tall summit
#

and scrolled down to see peoples discussions about that, which moved your post further away from my consciousness

tall summit
balmy mist
tawdry meteor
tall summit
elder rapids
#

the long context thing for o3 is major cap

#

it cannot handle long context ๐Ÿ˜ญ ๐Ÿ™

keen beacon
#

well this is gonna be interesting

balmy mist
keen beacon
#

YO

elder rapids
#

no way

balmy mist
#

i knew my babby was coming

#

yes way!!!!

keen beacon
torn mantle
#

ll;]ajks;]LAJKS'L;ASJDF'KL;ASJF'AKL;SFJASK'L;JDFASK'LJFKAL'SFJ'ASLKF

#

STOP

#

no way

#

finally ๐Ÿ˜ญ

keen beacon
#

if they drop a SOTA code model i am all in on deepmind

elder rapids
tall summit
#

@keen beacon thanks for sacrificing yourself browsing twitter to deliver news so we don't have to go in the twitter hellscape ourselves

torn mantle
#

what it will do

glass arch
#

are there betters for AI? because I'm going all in today

elder rapids
#

ye but actually read the case lmao

#

this is not even crazy

keen beacon
balmy mist
#

i love google man

keen beacon
#

otherwise id move

balmy mist
#

i love how they do this to openai

torn mantle
#

no i mean sonnet is mostly used for coding, now imagine a better model comes in cheap, what do you think will happen?

balmy mist
#

that means nightwhisper is better than we thought

keen beacon
#

also notice it says "CODE MODELS"

torn mantle
#

thats crazy

keen beacon
#

probably a version based on flash and a version based on pro

tall summit
#

whats the 1P part of 1P CODE MODELS mean

keen beacon
lime coral
#

1 party

#

Obviously

#

Itโ€™s a party

tall summit
#

only 1

lime coral
#

Satya said he will make them dance they bring the music

brittle tiger
lime coral
#

Logic

tall summit
quick flame
#

when will the new OpenAI models be on the leaderboard? Weird that they were not in the anonymous testing

keen beacon
#

no no

#

it means first party

#

i just realised

#

as in, their own models

balmy mist
#

ohh

tall summit
#

whats gemini if not their own models

keen beacon
#

maybe they partner with a lab to offer a third party option down the lime

#

line

#

who knows

tall summit
#

๐Ÿคท

keen beacon
#

or perhaps an OSS model or rwo

#

two

balmy mist
#

damn they really stealing openai shine the day after

elder rapids
elder rapids
#

and sucks at everything else

#

ngl Google could legitimately blunder this

balmy mist
#

i believe in google

#

why would they rush a launch like this

#

if it was not good

#

like right after openai

#

it kinda has to be good

tall summit
#

frontend is cool

#

still an improvement in vibe coding ๐Ÿคท

torn mantle
#

but lets see

elder rapids
brittle tiger
#

I don't think we get nightwhisper. the hype posts have been subtly hinting at flash 2.5. they definitely know there is excitement about nw and would play into that if it was coming. idk

elder rapids
#

if they're updating for code models

#

and night whisper really is THAT good

keen beacon
#

god if only trump stayed a comedian..

#

these are all great

elder rapids
#

then they have to be releasing more than that

thorny drum
#

they're weirdly aware of the hypeposting from like 1000 follower twitter accounts lol

#

sundar pichai tweeting nebula was not something i expected

elder rapids
hardy violet
#

Hmmm? Heard they're releasing Nightwhisper? Is there any solid proof?
You sure this isn't just another Google smokescreen? We haven't forgotten about the 2.5flash 0409 model, have we?

keen beacon
#

(denied by Trump)

#

lmfaooo

elder rapids
#

they didn't release 2.5 pro

#

๐Ÿ˜”

tall summit
brittle tiger
#

nebula had more hype than nightwhisper if im remembering right

keen beacon
#

yup

keen beacon
#

yes really

elder rapids
#

nebula was known to be super good

#

but night whisper is being talked about just as much

keen beacon
#

nightwhisper was a very "in our bubble" thing

#

nebula got out of the bubble

elder rapids
#

just not in faith of "this is amazing"

#

but "Google is cooking"

#

and just left there

#

you can see this in the subreddits too

#

people seem to know a ton about nightwhisper

balmy mist
elder rapids
#

I don't got a j*b

#

I should know

#

sorry y'all

balmy mist
#

it could be by design, since nw was there for like 2 days barely

elder rapids
#

people just pick and choose tho

#

with the nebula precursors

#

the early 2.5 pro checkpoints

#

they weren't really worse

hardy pecan
#

nebula was a beast

elder rapids
#

just less good at very specific tasks

#

and they weren't talked about in the light of being the beast it is now, but still acknowledged

hardy violet
#

Alright folks, it's getting super late here, almost the 18th already.
Gotta head to bed now. Night everyone! ๐Ÿ‘‹
Hope I wake up to some Gemini 2.5 Pro Coding, Pro High, or 2.5 Flash news tomorrow morning! ๐Ÿ™

elder rapids
#

man hopefully

balmy mist
#

i love the diversity in this chat

keen fulcrum
#

Deepseek is a joint effort of China and Russia
Its incorrect to call it a chinese AI

plain zinc
#

Finally, Google will release models finely tuned for coding (competitors for the Claude family of models)! ๐Ÿ”ฅ๐Ÿ‘€

keen beacon
#

wow

barren prairie
#

But nothing happened ๐Ÿ™

balmy mist
#

maaybe at 1pm est?

lime coral
tall summit
#

damn i didnt actually see aime 2025

#

the issue is..

#

aime is not hard in comparison to most other math

#

even if you dont want to dive into proof based contents, there are many much harder ones

tall summit
#

still funny to me how o4-mini is better than o3 at math+coding

elder rapids
#

cuz that's just what it's meant for

#

o4 mini has been pretty bad in my testing for everything else but puzzles+code

keen beacon
tall summit
#

technically

#

i hope this is not how the real competitors solved this

ocean vortex
tall summit
#

ok none of the aops solutions actually work with the polynomial in this form

elder rapids
keen ferry
#

night whisper was really too good? I never tried it

barren prairie
balmy mist
leaden meteor
#

Its 1pm and still no new model release by google yet?

barren prairie
balmy mist
#

lol jk

torn mantle
#

๐Ÿš€ Meta FAIR is releasing several new research artifacts on our road to advanced machine intelligence (AMI). These latest advancements are transforming our understanding of perception.

1๏ธโƒฃ Meta Perception Encoder: A large-scale vision encoder that excels across several image &

#

this is kinda interesting

cloud meadow
#

@wooden mulch

#

What are your aspirations for LMArena?

compact knoll
#

is o3 better than o1 ? (about resolving problems, maths..)

cloud meadow
calm spear
#

random question:

do LLM help people study and acquire information in say Africa & developing countries?

wintry tinsel
#

New LM arena is a fat improvement

barren prairie
brittle tiger
#

not sure if anyone has mentioned but I'm getting o3 on arena

opaque adder
#

cant even see the code output in beta.. nice alpha was better

gleaming adder
#
"defaultInferenceSettings": {
        "system": "Over the course of conversation, adapt to the user's tone and preferences. Try to match the user's vibe, tone, and generally how they are speaking. You want the conversation to feel natural. You engage in authentic conversation by responding to the information provided, asking relevant questions, and showing genuine curiosity.\n\nYour output will be rendered in a web UI, so use valid markdown format, tables, Latex, or emojis to make the content more engaging and user friendly."
      },
cloud meadow
#

An LLM can hallucinate, books written by people who know what they are talking about don't.

compact knoll
cloud meadow
compact knoll
#

im not denying that LLMs hallucinate :)
thatโ€™s because they โ€œpredictโ€ words rather than โ€œknowโ€ them like a human would, im more talking about the fact that you wouldnโ€™t believe how many books by so-called โ€œexpertsโ€ contain mistakes ;)

balmy mist
#

damn google trolling

plain zinc
#

Wdym?

balmy mist
plain zinc
torn mantle
#

or red teaming idk

plain zinc
#

lol

#

patience

lime coral
elder solar
#

is there any news about a newer LLM that can listen to audios?

#

the only llms that can is just gemini and ernie models

elder rapids
#

this wasn't what was ruled

#

if it's made out to be so bad

#

it's not

#

look into it yourself

visual turret
#

On AI studio if you are using the free version you are using Gemini 2.5 pro exp

vague orbit
#

i thought it was gemini 2.5 pro experimental

visual turret
#

Gemini 2.5 flash is coming in a few weeks

elder rapids
visual turret
#

But I suspect it is the end of the month

visual turret
vague orbit
#

I use cursor, and I use claude code, and those work well for claude 37. is there a Best Way to use Gemini 2.5 pro on a code base?

visual turret
#

Gemini 2.5 pro preview is really good

brittle tiger
visual turret
#

Gemini 2.5 pro exp sucks

visual turret
elder rapids
#

not a single change

#

not a touch

#

not an ounce of change

visual turret
keen beacon
elder rapids
keen beacon
#

there will be a launch today

elder rapids
#

๐Ÿ˜ญ ๐Ÿ™

#

I know Gemini the most

keen beacon
#

the time doesn't really matter, they've released as late as 10pm BST and as early as 12pm BST before

visual turret
keen beacon
elder rapids
vague orbit
#

so, is everyone using gemini via the website? like animals?

visual turret
elder rapids
vague orbit
keen beacon
#

they're the exact same thing

#

lol

#

they are the same model chief

visual turret
keen beacon
#

yes dawg

#

they are the same

#

i know for a fact

visual turret
keen beacon
#

it's just naming differences

#

because of paid preview vs free experimental

visual turret
keen beacon
#

bro.

#

yes they are

#

i know for a fact

#

this is coming from people at deepmind

visual turret
#

Gemini 2.5 pro preview is a further trained version of Gemini 2.5 pro

keen beacon
#

no it isn't..

visual turret
#

It is

keen beacon
#

facepalm

#

no it ISN'T

#

omfg

visual turret
#

This is a common practice

keen beacon
#

mate

elder rapids
keen beacon
#

it has been said already that preview is just the name for the version of the model with increased api rate limits

#

that is it.

visual turret
keen beacon
#

it is not a model update

#

are you stupid

visual turret
keen beacon
#

???

#

buddy

visual turret
keen beacon
#

copilot isn't going to know is it

#

it's literally just pulling from generic web sources

visual turret
keen beacon
#

..

#

that is copilot

keen beacon
#

ai in bing is copilot

visual turret
balmy mist
keen beacon
#

if it was a new model the date at the end of the ID would not be the same and they would publish new benchmark scores

upper wolf
#

People actually pay attention to the browserโ€™s ai suggestions? weโ€™re cooked manโ€ฆ

balmy mist
#

in all my tests they have been the same tbh

fleet lintel
#

LMArena is now a company. That's interesting!

keen beacon
#

literally one second of thought to reach that conclusion

zinc ore
# visual turret

Your source is character AI, how does that tell us about Google's naming practices?

balmy mist
#

just check the api lmaoo

visual turret
keen beacon
visual turret
zinc ore
#

They've not introduced a further trained version of 2.5 pro yet, from initial release

keen beacon
#

i have used the model hundreds of times

#

both as exp and as preview

#

they are exactly the same

balmy mist
upper wolf
#

Microsoft is the biggest company in the world

#

Wdym monopoly

upper wolf
#

Bro think hes progressive for using Bing ๐Ÿ˜ญ ?

plain zinc
#

Are there any new Google models in LMarena?

visual turret
fleet lintel
visual turret
#

I also got a paper in Harvard

zinc ore
balmy mist
#

just got to google documentation

#

why you going everywhere else but google

visual turret
# balmy mist broo
ADS

Existing music generation models are mostly language-based, neglecting the frequency continuity property of notes, resulting in inadequate fitting of rare or never-used notes and thus reducing the diversity of generated samples. We argue that the distribution of notes can be modeled by translational invariance and periodicity, especially using d...

compact knoll
zinc ore
#

This guy is trolling right? Lol

compact knoll
visual turret
#

At the bottom

upper wolf
balmy mist
balmy mist
visual turret
#

Why are you all so closed minded

#

Jeez

upper wolf
#

closed minded

compact knoll
#

everyone is telling him he's wrong but his ego won't let him hear it ๐Ÿ˜

visual turret
#

Ask any AI and it will agree with me

zinc ore
#

Also, the Gemini 2.5 version hasn't been called stable yet

upper wolf
#

You sound insecure asf nobody here is questioning your intelligence but yourself

#

Weโ€™re just sayin youre wrong about the ai

balmy mist
visual turret
zinc ore
keen beacon
#

just a thought

keen fulcrum
#

Is gpt 3.5 returning?

keen beacon
keen beacon
#

memories

zinc ore
#

Vertex so far

visual turret
# balmy mist

Albert Einstein: 'If you can't explain it simply, you don't understand it well enough.'

balmy mist
#

omgg

#

everybody stop

#

lock in

visual turret
balmy mist
balmy mist
#

from jimmy apples

#

so it should be any minute now

keen fulcrum
balmy mist
keen beacon
zinc ore
keen beacon
#

there it is

balmy mist
fleet lintel
keen beacon
#

am gonna see if this is Dragontail

keen beacon
#

will crank up the thinking budget

#

hopefully i dont go bankrupt

upper wolf
#

lemme check studio

balmy mist
#

yeah its def not nightwhisper lmaoo

#

wait

#

this model....

#

built in web search?

#

i dont have google grounding on but its using web

fleet lintel
balmy mist
#

this might actually be nw

#

the thinking is taking a while for a flash model

fleet lintel
#

no way that NW is flash model.. please

keen beacon
#

just tried it

fleet lintel
#

so much thiking for Flash model.. it's crazy

keen beacon
#

it may be dragonwhisper but hmm

#

it does seem very good for flash

balmy mist
#

nahh bro this might be nw

#

yoo

keen beacon
#

and uses more reasoning tokens than 2.5 pro

#

it isn't nightwhisper lol

#

i doubt it

sage raptor
#

is this nightwhisper ?

balmy mist
#

it made that

#

hold up let me use another site to share it

zinc ore
#

Vertex atm

barren prairie
#

Let s wait
I want it on Ai studio

balmy mist
#

@torn mantle

#

help me test this lmaoo

#

@keen beacon are you using the manual or auto thinking? im scared to touch that lol

leaden meteor
#

How come leaderboard is not updated if flash is already out? I am sure nw or dragontail is flash...?

balmy mist
#

this model is better than 2.5 pro imo, need more tests, but on a one shot coding it clears it

narrow elbow
keen beacon
#

max

#

i have money to burn

fleet lintel
zinc ore
#

We'll likely get an arena update on it sometime today

keen beacon
#

hasn't even been actually announced yet

balmy mist
#

never used this platform before

#

why is there studio and vertex?

zinc ore
#

Someone got this

fleet lintel
#

how do I make it think less? I want really fast responses like sub 2 seconds for 128K tokens

keen beacon
#

you can turn off thinking

fleet lintel
#

pinged my team to start working on testing this stuff.... I am excited!

balmy mist
fleet lintel
elder rapids
#

how fast is it

balmy mist
#

yo this model is fire!!!

balmy mist
rose thicket
balmy mist
#

when its coding its auto searching and using git repos

#

i think its is tbh, but others say nah, im doing more tests

elder rapids
#

is it smart?

#

please tell me it's smart asf ๐Ÿ™

balmy mist
#

nahh its not nw

#

but its better than 2.5 pro at coding

#

but not on nw level

rose thicket
#

I haven't got access yet

elder rapids
zinc ore
#

Also has native tool calling, which 2.5 pro doesn't have

elder rapids
#

it is?

fleet lintel
zinc ore
fleet lintel
zinc ore
#

"call tools natively"
"Agentic use cases"

dapper storm
#

Wow ๐Ÿ˜ฒ๐Ÿ˜ฒ๐Ÿ˜ฒ

keen beacon
#

probably cheaper, google has so many spare gpus

balmy mist
rose thicket
#

Share the prompt plz!

balmy mist
#

and the output was better than what I got from 3.7 and 2.5 in zero shot

#

okay, one sec, running one more test

#

i had thinking on max and it gave me a slightly better pokemon sim then 2.5, but let me try auto again

rose thicket
lime coral
keen beacon
#

if 2.5 flash beats 2.5 pro they are cooking SO hard

#

and it's cheap asf

fleet lintel
keen beacon
balmy mist
keen beacon
#

it can and probably will beat it in at least one category

#

i didn't say i expect it to beat 2.5 pro universally

#

but smaller modles have their strengths

#

edpeciallt

#

especially*

#

when trained well

rose thicket
torn mantle
#

test what?

leaden meteor
#

Flash is smaller model? I thought it was taking more tokens than 2.5 exp?

balmy mist
#

auto thinking

keen beacon
balmy mist
sage raptor
balmy mist
rose thicket
#

I figured out that 2.5 pro just works better on 0.65 temp

torn mantle
balmy mist
#

max thinking could be overthinking

rose thicket
brittle tiger
#

I just got access on AI studio. Rollout happening for sure

balmy mist
torn mantle
brittle tiger
#

I spoke too soon lmao

torn mantle
#

this seems similar to stargazer

golden ocean
brittle tiger
#

It is possible to select for me. just no outputs yet

golden ocean
#

then u will get 2.5 flash

rose thicket
golden ocean
balmy mist
#

i got it on studio!!!

#

refresh yall

elder rapids
#

apparently it's better in coding

golden ocean
#

Will it give error

brittle tiger
#

working for me now

golden ocean
#

same despite error it responded

fleet lintel
balmy mist
golden ocean
elder rapids
balmy mist
#

i dont see thinking budget tho

#

am i blind?

keen beacon
#

๐Ÿคฃ

fleet lintel
#

may be it's only for Vertex users

thorny drum
#

CONFIDENTIAL

fleet lintel
#

I am thinking budget part

thorny drum
#

leaking insider info yet again

golden ocean
keen beacon
golden ocean
elder rapids
elder rapids
#

how much for 2.5 pro?

golden ocean
zinc ore
#

I have flash now too

fringe carbon
#

what is the general consensus between 2.5 and the new gpt models?

keen ferry
#

i like that it just disappears and then appears again

fringe carbon
#

can anyone here really definitively say one is better?

#

cuz ngl they are so close to me

#

different style wise

brittle tiger
#

is flash 2.5 the first model to determine if it needs thinking tokens are needed or not with "auto" selected?

rose thicket
#

Flash seems to be tailored for coding purpose

elder rapids
#

alright I got flash 2.5

balmy mist
#

mine is not in confidential anymore

elder rapids
#

alr give me a query

#

I'm gonna ask it stuff

golden ocean
#

Let a < b < c be distinct natural numbers. Must every block of c consecutive natural numbers contain three distinct numbers whose product is a multiple of abc?

balmy mist
#

its fast, slightly faster than 2.5

narrow elbow
#

๐Ÿคช

brittle tiger
#

I keep getting errors. Gonna go do an errand. at least it;s confirmed for today

balmy mist
#

actually i cant tell if its faster

keen beacon
#

Click the timer

#

It has latency and tps

#

woah.. hello

#

Huh

#

Geogussr time lol

#

IT'S US ONLY ๐Ÿ’”

ember rapids
#

U can set the thinking budget on ai studio

#

Pretty cool

brittle tiger
golden ocean
#

I got it on on eu now

keen beacon
#

I just got 2.5 flash they're rolling it out fast

golden ocean
zinc ore
#

Looks like you can set the thinking budget on aistudio

balmy mist
#

and its free lmaoo bruhh

keen beacon
#

I don't have yet either

ember rapids
#

Toggle thinking mode off and on and you should see it

balmy mist
#

yupp refreshing did the trick

narrow elbow
#

refresh ,got it

fleet lintel
#

Nice! got it on aistudio!

keen beacon
#

Just got it

keen ferry
ember rapids
#

mann google is cooking

golden ocean
#

"no"

keen beacon
#

Is the thinking budget working for yall

#

It's being ignored for me it seems

elder rapids
balmy mist
#

aii time to put flash against pro, give me some prompt for games and web dev

ember rapids
#

I think flash with max thinking tokens beats 2.5 pro at coding

balmy mist
#

i got like 3 tabs open testing lmaoo

#

play it a lil

#

animations are crazy tbh

#

@torn mantle

narrow elbow
golden ocean
#

its thinking forever for me

elder rapids
#

can't get an answer

balmy mist
#

look at the charizard one

#

and prompt was make a pokemon game and i just told it to add a new feature and animations

sage raptor
#

it might be nightwhisper or something close

#

with max thinking tokens

balmy mist
#

to use 24k thinking tokens is wild tho

leaden meteor
#

2.5 flash is meh. It's worse than grok3.

#

Nowhere close to 2.5 pro

balmy mist
keen beacon
#

oh the thinking budget works but the scale is weird

elder rapids
balmy mist
elder rapids
#

2.5 flash is pretty good

torn mantle
#

it wrote all of that?

#

thats crazy

balmy mist
balmy mist
torn mantle
leaden meteor
#

Lol. Leaderboard is updated.hahq...

keen beacon
#

i set the thinking budget to 2 and it does way more than 2 tokens in its thoughts but it cuts off after a while

#

it also breaks the model at least on the prompt im using lol

balmy mist
keen beacon
balmy mist
#

and it works like you dont see the thinking anymore?

torn mantle
balmy mist
#

i wonder why they dont do it for pro

elder rapids
#

flash 2.5 is crazy cheap

ember rapids
barren prairie
#

Cute robot Gemini flash 2.5 ๐Ÿฅบ๐Ÿฉท๐Ÿฉต

cedar tide
#

result of 2.5 flash on this prompt

elder rapids
#

check price lmao

keen beacon
#

flash 2.5 is much cheaper tho?

#

it might be a better model but the pricing

elder rapids
#

1/10 the input price

#

and if they're calculating reasoning

#

this is probably maximum

balmy mist
#

i knew you would come

#

to preach the good word of openai lol

zinc ore
#

Just wait until we see the pricing on aider

elder rapids
#

where o4 mini is probably around 15~ dollars maximum

wintry tinsel
zinc ore
#

Pro is 1/3 of the mini pricing, so flash should be noticeably cheaper

elder rapids
#

what does?

thorny drum
#

60c (but 350c to get the results on the benchmarks)

keen beacon
#

2.5 flash is getting stuck in a thinking loop for me rn :\

fleet lintel
#

o4 mini should be compared with 2.5 Pro . they have similar price. And in practice, o4 mini is expensive compared to 2.5 pro

elder rapids
keen beacon
keen beacon
#

yea

elder rapids
#

yeah but then you shouldn't make it seem like that's not the claim either

keen beacon
#

dont get me wrong 2.5 pro/2.5 flash are amazing but yeah it gets stuck on nthis

elder rapids
#

o4 mini should be compared with 2.5 pro

#

since [state your reasoning]

fleet lintel
#

check again... 1.1 $ vs 1.25$.. they are almost same. but to solve the same problem,o4 mini is more expensive

elder rapids
#

ye

#

dude

keen beacon
#

omg it just did 40k+ tokens in reasoning ๐Ÿคฃ

#

and gave up

#

btw turn off thinking budget it can do way more tokens if it does cap it at 25k

#

thinking budget doesnt seem to help the model

#

where o3

#

select it in direct chat

#

or do arena battle

fleet lintel
#

i want cheap and fast models for my use-case. honestly, i have no alternative compared to google flash models

balmy mist
#

sonnet is ur fav coding model?

#

lol

elder rapids
#

ye I guess now it really is up to preference

balmy mist
#

yeah if you like to spend money or not

torn mantle
balmy mist
#

lmaoo craig you live to hate on google man

#

you gottta sauce them up a lil

#

give them some lovin

cedar tide
brittle tiger
#

Much cheaper

zinc ore
#

That's pretty good, nearly 1400

opaque adder
cedar tide
balmy mist
#

the game is a lil wonky

#

but impressive

brittle tiger
#

it does. convo was about price tho

cedar tide
leaden meteor
#

2.5 flash can't be nw or dragontail, isn't it? I remember now or dt doingpretty well when compared to 2.5 pro...

#

Nw or dt*

sage raptor
#

nahh

#

not nw

#

or dt

leaden meteor
#

Wonder why they are not released yet.

#

So, what was flash? River hollow?

brittle tiger
# cedar tide

Conveniently cropped out the context it doesn't include the price of repeats which is why o4-mini price was so much higher on Aider because he does include the real cost.

fleet lintel
fleet lintel
leaden meteor
#

I guess Gemini is waiting for o3 or o4mini to come on leaderboard to steal the thunder again with night whisperer or dragontail...

silk haven
fleet lintel
lime coral
#

So name reveal of flash?

silk haven
lime coral
#

Stargazer? Dragon tail?

torn mantle
#

google are killing it lately

distant egret
#

can anyone guide me how to have perplexity deep research api in custom gpts of chatgpt?

drifting elk
#

Plz add web search in lm arena beta version

balmy mist
#

give me some prompts

drifting elk
#

You are a vibe coder

#

Arghhh

#

I think pro is way better than flash

olive mesa
#

i mean yeah

drifting elk
#

Nop

olive mesa
#

flash isnt supposed to be better than pro

drifting elk
#

What makes you think flash is better

drifting elk
balmy mist
#

i think flash is better with tools imo

drifting elk
#

I have tested several models for coding and came out with a conclusion: whether you use grok or gemini or gpt or Claude these llms are helping programmers not replacing them

balmy mist
#

nahh i love flash

tall summit
#

heyyyyyy

#

beta came out

drifting elk
#

So pure programmers are better and excellent than vibe coderw

balmy mist
#

i think its vibes are good

tall summit
#

wait more stuff came out?

balmy mist
#

just like photographers are better than any joe with a smartphone

tall summit
drifting elk
#

And a programmer who uses AI (to help him) is much excellent and better than a casual one

tall summit
#

thanks @balmy mist even though we argued like once

you kept the thread updated and it helped !!

drifting elk
#

Vibe coders will find jobs but not better ones

tall summit
#

oh its available

tall summit
#

on ai studio?!

balmy mist
#

with thinking modes

tall summit
#

hell ya

balmy mist
#

budgets*

tall summit
#

well i wont use it

drifting elk
#

So do not say coders are dying gpt 5 is agi, Gemini is the best in the universe!!!

balmy mist
#

it seems reg thinking is better

drifting elk
#

Remember that

tall summit
drifting elk
#

Pro

#

Is better

balmy mist
#

when you increase thinking it makes it worse based on our tests

#

but i would test it yourself

tall summit
#

i didnt know there was that much thinking customization

#

in the first place

balmy mist
#

yeah it seems new with flash, not sure why they dont have it with pro

tall summit
#

oh phew new with flash

#

thinking budget is measured in tokens? i guess i could just search that up

drifting elk
#

Even flash is experimental it won't replace pro or outperform him

tall summit
#

we'll see

drifting elk
#

Even with custom thinking