#general | Arena | Page 25

balmy mist Apr 17, 2025, 10:02 AM

#

like the value

#

so you can speculate about it?

#

make a youtube vid on it?

tall summit Apr 17, 2025, 10:03 AM

#

.

tall summit Apr 17, 2025, 10:03 AM

#

balmy mist so you can speculate about it?

there is no value besides speculation

balmy mist Apr 17, 2025, 10:03 AM

#

lol

tall summit Apr 17, 2025, 10:03 AM

#

did you even read any of my messages

balmy mist Apr 17, 2025, 10:03 AM

#

yes you are not saying anything

tall summit Apr 17, 2025, 10:03 AM

#

where i said its obviously not worth it for these reasons like minutes ago

balmy mist Apr 17, 2025, 10:03 AM

#

they why are you still talking?

tall summit Apr 17, 2025, 10:03 AM

#

what

#

because i like this server

balmy mist Apr 17, 2025, 10:04 AM

#

like i said its pointless

#

but you are arguing a point you dont back

tall summit Apr 17, 2025, 10:04 AM

#

i am not arguing it

#

i said this is what it does objectively

balmy mist Apr 17, 2025, 10:04 AM

#

bro touch some grass

tall summit Apr 17, 2025, 10:04 AM

#

and i also said that given the objective reasons TO BACK IT

#

i wouldnt personally

balmy mist Apr 17, 2025, 10:04 AM

#

lol

#

okay dude

tall summit Apr 17, 2025, 10:04 AM

#

balmy mist bro touch some grass

lmao reminds me of

#

@keen beacon see, relatable

hardy violet Apr 17, 2025, 10:13 AM

#

I used the less-smart Gemini 2.0 Flash to translate this. My English is too bad to communicate.

opaque adder Apr 17, 2025, 10:14 AM

#

hardy violet I used the less-smart Gemini 2.0 Flash to translate this. My English is too bad ...

Use 2.5 pro and translate a complex sentence in ur language and I’ll review

hardy violet Apr 17, 2025, 10:16 AM

#

The translation quality of 2.5 Pro is quite impressive, reaching a level comparable to lyrics translated by human translators. As for a more detailed evaluation, I wouldn't know how to do that properly; it requires someone with a proficient command of a second language to assess.

tall summit Apr 17, 2025, 10:24 AM

#

hardy violet The translation quality of 2.5 Pro is quite impressive, reaching a level compara...

ah. well, i'm extremely excited and hopeful for more advancements in machine translation

alpine coral Apr 17, 2025, 10:36 AM

#

hardy violet I used the less-smart Gemini 2.0 Flash to translate this. My English is too bad ...

fwiw i think the translation quality was perfectly fine. it communicated the points you wanted to convey (i assume anyway lol) - job done. i mean yeah, some of the terms used had a kinda distinct AI feel to it, but it wasn't that obvious..

#

like i'd just use whatever is easiest and quickest, with basleine reliability, in the context of communicating in a discord chat.. perfection is overkill

#

btw do you also translate from English to Chinese?

tall summit Apr 17, 2025, 10:39 AM

#

openrouter isn't allowing free gemini 2.5 pro anymore 🚎

alpine coral Apr 17, 2025, 10:40 AM

#

must be getting close to general availability / no more experimental endpoints

hardy violet Apr 17, 2025, 10:53 AM

#

alpine coral fwiw i think the translation quality was perfectly fine. it communicated the poi...

lol, yeah I can mostly understand the English here, but definitely miss some things. Discord's website doesn't play well with Edge's built-in translator, so I'm using a plugin called "Youdao Lingdong Translate". It works pretty well!（by gemini 2.5pro）

alpine coral Apr 17, 2025, 10:55 AM

#

ha yeah tbf gem 2.5 pro def delivered a more 'discord' feel / tone compared to flash

calm sequoia Apr 17, 2025, 11:32 AM

#

Have anyone made any personal benchmarks on o3 vs 2.5 PRO yet?

hardy pecan Apr 17, 2025, 11:37 AM

#

Anyone get the feeling the o3s output is limited or nerfed? The output tokens feel limited and a bit misguided

#

I've tried to ask for specifics but it'll be vague and not fully listen to me, it's annoying

calm sequoia Apr 17, 2025, 11:45 AM

#

Somehow the o3-mini underperforms strongly in the arena compared to the official webpage. Could be tool usage.

#

Or the arena variant is "low" or "medium"

keen beacon Apr 17, 2025, 11:47 AM

#

the arena is on medium

#

and yes tools really help it

calm sequoia Apr 17, 2025, 11:48 AM

#

It appears to be sampling issue. This time the drawing is perfect. It failed on 2.5 PRO though.

#

I don't know how the benchmark can be updated if the 2.5 PRO is crashing EVERY TIME

hybrid shard Apr 17, 2025, 11:49 AM

#

error code 1 with gemini models is when they refuse the request due to filters, iirc

hardy pecan Apr 17, 2025, 11:53 AM

#

Do we know if plus users get o3-medium or o3-high? And is it different to pro users? Say pro users get o3-high?

brittle tiger Apr 17, 2025, 11:58 AM

#

hardy pecan Do we know if plus users get o3-medium or o3-high? And is it different to pro us...

i believe o3 high is only available on API right now. plus users get 50 o3-med per week rn

glass arch Apr 17, 2025, 12:02 PM

#

is there much of a difference between o4-mini and o4-mini-high?

#

also, it seems they finally removed the emojis from o4-mini

keen beacon Apr 17, 2025, 12:04 PM

#

glass arch is there much of a difference between o4-mini and o4-mini-high?

yeah

#

aye, finally found somewhere with them

glass arch Apr 17, 2025, 12:06 PM

#

lol. "low-effort"

#

like it's sitting around like "eh why don't ya ask me later"

calm sequoia Apr 17, 2025, 12:10 PM

#

Found new test approach. "Recreate this page to an HTML format to be latter than transformed to A4 PDF page". All models fail except o3 and o4-mini.

plain zinc Apr 17, 2025, 12:13 PM

#

calm sequoia Apr 17, 2025, 12:19 PM

#

plain zinc

*(no tools). Why handicap model?

#

It would be funny if 2.5 PRO is so good because of the same reason 3.5 Sonnet was good (they got lucky)

lime coral Apr 17, 2025, 12:42 PM

#

calm sequoia It would be funny if 2.5 PRO is so good because of the same reason 3.5 Sonnet wa...

there is no luck at this scale

calm sequoia Apr 17, 2025, 12:42 PM

#

Every time you train model luck is involved (local and global minima exist)

ocean vortex Apr 17, 2025, 12:42 PM

#

You cant but you initially said chatgpt, not playground..?

calm sequoia Apr 17, 2025, 12:43 PM

#

Unless you have the compute to try numerous iterations from start to finish (too expensive and time consuming for THIS SCALE)

balmy mist Apr 17, 2025, 12:51 PM

#

ocean vortex You cant but you initially said chatgpt, not playground..?

you can branch in chatgpt?

#

wait so plus only get o3 medium wtf

#

what about pro?

ocean vortex Apr 17, 2025, 12:53 PM

#

balmy mist you can branch in chatgpt?

Yeah and then you have counter with arrows to switch between them. Even just a regen essentially branches it but at the very end if you do it this way

balmy mist Apr 17, 2025, 12:53 PM

#

ocean vortex Yeah and then you have counter with arrows to switch between them. Even just a r...

wtf i never knew that

#

but the app breaks ever other prompt for me with chatgpt so i prob wont be able to use that

#

wait it does no tlet me branch

#

send screenshot

ocean vortex Apr 17, 2025, 12:56 PM

#

balmy mist wait it does no tlet me branch

I just tried on ios app and it’s not there lol. But I mostly use it on desktop anyway tbh

keen beacon Apr 17, 2025, 12:56 PM

#

well that's interesting

ocean vortex Apr 17, 2025, 12:56 PM

#

And you can do it there for sure. Just not in mobile app

keen fulcrum Apr 17, 2025, 12:57 PM

#

keen beacon well that's interesting

Thought grok is the best

keen beacon Apr 17, 2025, 12:58 PM

#

lmao no

#

grok's actual writing is pretty bad, it's just uncensored

tall summit Apr 17, 2025, 12:58 PM

#

keen beacon well that's interesting

my anecdotal thoughts are also that o3 is best in creative writing

balmy mist Apr 17, 2025, 12:58 PM

#

i actually like o3, they are essentially having mcps built into the reasoning, they just need a way for use to add more mcps to it on the fly, but I think there might be a way with the pythong tool it uses

keen fulcrum Apr 17, 2025, 12:59 PM

#

I do hope grok will be better than meta

balmy mist Apr 17, 2025, 12:59 PM

#

tbh i gave up on grok a while back

keen fulcrum Apr 17, 2025, 12:59 PM

#

balmy mist tbh i gave up on grok a while back

They are investing in an AI datacenter and get governmental contracts

drifting thorn Apr 17, 2025, 1:01 PM

#

keen beacon grok's actual writing is pretty bad, it's just uncensored

True

tall summit Apr 17, 2025, 1:01 PM

#

https://eqbench.com/

#

newspeople in the ai space oughta cite sources more

drifting thorn Apr 17, 2025, 1:02 PM

#

and currently the knowledge base is fixed so that I'm continuing on my "fanfic"

balmy mist Apr 17, 2025, 1:02 PM

#

keen fulcrum They are investing in an AI datacenter and get governmental contracts

i just dont see the need for all these ai players anymore, we have goog open source that gets better with closed source, and we have 4 leading close source, i dont see the need for 2 of them anymore(claude and grok)

keen beacon Apr 17, 2025, 1:03 PM

#

drifting thorn and currently the knowledge base is fixed so that I'm continuing on my "fanfic"

i tried out o3's fanfic writing when i was given early access

#

it kinda had me hooked..

#

lol

tall summit Apr 17, 2025, 1:03 PM

#

man people hate claude

balmy mist Apr 17, 2025, 1:03 PM

#

tall summit man people hate claude

i used to love grok when it first dropped

tall summit Apr 17, 2025, 1:03 PM

#

claude is cool as hell

balmy mist Apr 17, 2025, 1:03 PM

#

but idk vibes just went down overtime

tall summit Apr 17, 2025, 1:03 PM

#

keen beacon it kinda had me hooked..

noted...

#

🙀

keen beacon Apr 17, 2025, 1:03 PM

#

Claude has good creative vibes but like

#

i really fw o3's creative stuff

#

it's orders of magnitude better than o1

tall summit Apr 17, 2025, 1:04 PM

#

not a fair comparison

balmy mist Apr 17, 2025, 1:04 PM

#

claude is still solid, but for coding im hooked on gemini bc its cheaper and now openai got the coder that is opensource and has o3 with tools(mcp like) built in, i havent used claude in a while, but its still a good model

tall summit Apr 17, 2025, 1:04 PM

#

i mean between 2.7 and o3

keen beacon Apr 17, 2025, 1:04 PM

#

well it's still an interesting comparison given before 3.7 absolutely dunked on o1

#

and reasoning models normally suck creatively, R1 being the first to kinda prove that wrong

keen fulcrum Apr 17, 2025, 1:05 PM

#

FT about to debut next week

balmy mist Apr 17, 2025, 1:05 PM

#

what time google launching today?

keen beacon Apr 17, 2025, 1:05 PM

#

in the next 4-5 hrs proba

#

probs

balmy mist Apr 17, 2025, 1:05 PM

#

keen beacon in the next 4-5 hrs proba

and its flash right?

tall summit Apr 17, 2025, 1:05 PM

#

i like when models dont randomly stop while making a story

keen fulcrum Apr 17, 2025, 1:05 PM

#

balmy mist and its flash right?

Stable 2.5 Pro

balmy mist Apr 17, 2025, 1:05 PM

#

keen fulcrum Stable 2.5 Pro

wtf

tall summit Apr 17, 2025, 1:06 PM

#

and the only models that do that are gemini 2.5 nd claude 2.7

balmy mist Apr 17, 2025, 1:06 PM

#

keen fulcrum Stable 2.5 Pro

what does that mean, like what info you have on it?

quiet pollen Apr 17, 2025, 1:07 PM

#

keen beacon well that's interesting

which bench is this

tall summit Apr 17, 2025, 1:07 PM

#

keen beacon aye, finally found somewhere with them

whats this? 👉 👈

tall summit Apr 17, 2025, 1:07 PM

#

quiet pollen which bench is this

https://eqbench.com/creative_writing.html

hardy violet Apr 17, 2025, 1:07 PM

#

keen beacon well that's interesting

Yeah, I've seen this list before, and tbh I really don't agree with the ranking.
Seeing R1 and V3 ranked so high makes the author's bias pretty clear – they obviously favor that aggressive, exaggerated style, leaning into those kinda modernist philosophy frameworks, or maybe just typical web novel tropes.
Like, I tried O3 today and really disliked its DeepSeek-ish vibe. It tends to over-interpret things and forces these complex frameworks onto everything. That kind of writing might look impressive or even amazing at first glance, but if you actually look closely, the way it abuses vocabulary is a huge problem.
Personally, I lean towards Claude 3.7 (though you need to prompt it well, the raw output isn't great). But right now, Gemini 2.5 Pro has overtaken the Claude series for me.
Also, just generally, I prefer a more prose-like style.

tall summit Apr 17, 2025, 1:07 PM

#

if only people actually sent links

keen beacon Apr 17, 2025, 1:07 PM

#

tall summit whats this? 👉 👈

https://polychat.io/ they have a limited free prompt quota for new accounts

keen beacon Apr 17, 2025, 1:08 PM

#

hardy violet Yeah, I've seen this list before, and tbh I really don't agree with the ranking....

its

#

it isn't human judged

#

it's an llm iirc

#

i believe 3.5 sonnet

tall summit Apr 17, 2025, 1:08 PM

#

Run the 32 writing prompts for 3 iterations (96 items total) @ temp 0.7, min_p 0.1.
Grade the outputs with a comprehensive scoring rubric using Claude 3.7 Sonnet.

#

3.7

keen beacon Apr 17, 2025, 1:08 PM

#

but there are some components that are statistically judged

quiet pollen Apr 17, 2025, 1:08 PM

#

tall summit https://eqbench.com/creative_writing.html

thank you

keen beacon Apr 17, 2025, 1:09 PM

#

"Test Structure: The benchmark runs multi-turn conversations (up to 21 turns) between the test model (acting as conflict mediator) and actor models (playing clients or disputants). The actor model we use is gemini-2.0-flash-001. Each scenario includes detailed character profiles with specific emotional states and backgrounds.
Assessment Criteria: We score models on:
Basic emotional intelligence skills (recognizing emotions, showing empathy)
Professional skills specific to therapy or mediation
Avoiding serious professional mistakes
How It Works: The benchmark uses three models:
Test model: The AI being evaluated
Actor model: Plays realistic clients or disputants
Judge model: Claude-3.7-Sonnet scores the test model's performance
Scoring: The final score combines:
Scores across multiple skill areas
A count of identified mis-steps and how serious they were
Beyond just scores, the judge provides a critical analysis of specific errors, rating them as minor, moderate, or serious. This helps identify exactly where and how models struggle in realistic professional conversations."

tall summit Apr 17, 2025, 1:09 PM

#

quiet pollen thank you

AI PEOPLE HAVE TO CITE SOURCES MORE!!!

tall summit Apr 17, 2025, 1:09 PM

#

keen beacon "Test Structure: The benchmark runs multi-turn conversations (up to 21 turns) be...

thats eqbench

tall summit Apr 17, 2025, 1:09 PM

#

keen beacon well that's interesting

this is creative writing

quiet pollen Apr 17, 2025, 1:10 PM

#

I love benches

tall summit Apr 17, 2025, 1:10 PM

#

me too

quiet pollen Apr 17, 2025, 1:10 PM

#

didn't expect llama to score so low

#

when there are so many roleplaying AIs are using llama lol

keen beacon Apr 17, 2025, 1:11 PM

#

tall summit thats eqbench

yeah mb

#

still somewhat similar

#

How the benchmark works:

Run the 32 writing prompts for 3 iterations (96 items total) @ temp 0.7, min_p 0.1.
Grade the outputs with a comprehensive scoring rubric using Claude 3.7 Sonnet.
Use this score to infer an initial Elo rating for the evaluated model.
Perform pairwise matchups with neighboring models on the leaderboard (sparse sampling). Items are scored on several criteria, with the winner on each criteria given up to 5 +'s.
Calculate Elo scores using the Glicko rating system (modified to weight the win margin in '+' count). Loop until stable positions are found.
Perform comprehensive matchups with final neighbors and compute the definitive leaderboard Elo.

tall summit Apr 17, 2025, 1:11 PM

#

quiet pollen when there are so many roleplaying AIs are using llama lol

thats because llama is open source and cheap

#

well depends which llama

quiet pollen Apr 17, 2025, 1:12 PM

#

tall summit thats because llama is open source and cheap

probably because people can finetune it

tall summit Apr 17, 2025, 1:12 PM

#

keen beacon How the benchmark works: Run the 32 writing prompts for 3 iterations (96 items ...

if i can read right, this is 100% ai judged just in two different ways

keen fulcrum Apr 17, 2025, 1:13 PM

#

https://scale.com/leaderboard

SEAL LLM Leaderboards: Expert-Driven Private Evaluations

Explore the SEAL leaderboards for expert-driven, private, regularly updated LLM rankings and evaluations across domains like coding, instruction following and more!

tall summit Apr 17, 2025, 1:14 PM

#

keen fulcrum https://scale.com/leaderboard

is this not just a very small collection of hard benchmarks

balmy mist Apr 17, 2025, 1:14 PM

#

i told o3 to make an mcp and use it to make me something, is that an hallucination?

hardy violet Apr 17, 2025, 1:16 PM

#

tall summit if i can read right, this is 100% ai judged just in two different ways

That said, even if we don't agree with some of the conclusions, different results should be respected as long as they follow a consistent standard or methodology. After all, everyone has their own preferences when it comes to LLMs.
But, I still have to add: for subjective things like writing quality, blind evaluations by humans are probably the better way to judge what's actually good or bad.😮

glass arch Apr 17, 2025, 1:17 PM

#

wait, is google dropping another model today?

#

I gotta run it through my test

balmy mist Apr 17, 2025, 1:17 PM

#

glass arch wait, is google dropping another model today?

yeah

#

are there any mods here?

#

we need to have a way to pin latest news, i guess we can use the annoucements channel, but that might jus tbe for lmarena stuff

glass arch Apr 17, 2025, 1:20 PM

#

it seems like every 4 weeks we get a new ai model that dominates

tall summit Apr 17, 2025, 1:20 PM

#

i get all my ai news from #general

balmy mist Apr 17, 2025, 1:21 PM

#

yeah lets do that

balmy mist Apr 17, 2025, 1:21 PM

#

tall summit i get all my ai news from <#1340554757827461211>

we all do, but there sometimes be so much messages that its hard to know whats going on and catch up

#

but does anyone have a website for all mcps or community created mcps?

#

i wanna try something with o3

tall summit Apr 17, 2025, 1:22 PM

#

balmy mist we all do, but there sometimes be so much messages that its hard to know whats g...

most of the messages are crap

balmy mist Apr 17, 2025, 1:23 PM

#

tall summit most of the messages are crap

exactly its like the needle in the haystack, trying to find the valuable info in the hay of information

#

@tall summit delete your message in that thread

#

lets keep it clean for only news stuff

#

we need mods to make that an official channel tho

#

you have a point

#

it might get swept away lol

#

@hollow ivy what ever happened to our music thread?

tall summit Apr 17, 2025, 1:26 PM

#

balmy mist exactly its like the needle in the haystack, trying to find the valuable info in...

really all you need is legit_api's private tool tweets, official ai company tweets, and benchmarks

balmy mist Apr 17, 2025, 1:26 PM

#

you can get it for us and give us the deets

tall summit Apr 17, 2025, 1:27 PM

#

lmao

balmy mist Apr 17, 2025, 1:28 PM

#

ahh are u a mod?

#

u are just a discord pro

tall summit Apr 17, 2025, 1:32 PM

#

balmy mist you can get it for us and give us the deets

100 other people will, once it's released
but even if it never does, other people will easily find out what he does, just slightly later
and he shows no signs of stopping to tweet it anyway

and also there are compilations of benchmark scores and company tweets arent that hard to track given there are only a finite number

the other kinds of news (mainly applied ai) are much harder to find thats why most news sources arent as simple as that, but honestly i think thats all you need to keep up with the models themselves

balmy mist Apr 17, 2025, 1:32 PM

#

this one?

#

i like it, I havent had a chance to try it in my app, but i will try it tonight with o4 mini and see what it can make

#

imma do 50 iterations of it

#

how are you using it?

tall summit Apr 17, 2025, 1:34 PM

#

keen beacon https://polychat.io/ they have a limited free prompt quota for new accounts

oh it's one of those. this one seems especially limited if you don't pay 20$ a month

balmy mist Apr 17, 2025, 1:36 PM

#

yeah but if you put it as system prompt

#

then tell it to run

#

and only pass on outputs and clear context

#

in my app i have a system prompt, then i can add a prompt if i want to, but for this i will just tell it to simulate or something, then it just feeds each call to a model with the system prompt and the previous output

#

so you dont have to worry about context as muhc

#

much*

#

actuall let me do that now and let it run for an hour, but i wish we still had the free models

#

the only thing i might have to do is consistency of characters etc..

#

might need to have a memory agent or system to deal with that

#

yeah i could do that, but its not automated

#

i like just feeding the model an input and letting it simulate and cook

#

yeah

#

see if it can create the world

#

lol

#

have you tried the prompt with multiple studios?

#

like have like 5 studio windows

#

and have one be the game director or sum

#

with the system prompt

#

and the others be character or players

#

and feed each other outputs

#

yeah that should be big enough

#

hmm that might be a good question for chatgpt lol

#

im lobotomized by ai now

#

even give it your prompt for context

#

https://x.com/DeryaTR_/status/1912856563859022191

Derya Unutmaz, MD (@DeryaTR_) on X

As I claimed, the OpenAI o3 model is at or near genius level. I’m sure someone will cope by saying, “Oh, but it still can’t do this or that,” which is quite silly considering how many zillion things a genius human cannot do! Regardless, the next AI model will fix these as well.

#

what is genius level?

brittle tiger Apr 17, 2025, 1:45 PM

#

https://x.com/ficlive/status/1912863028141244850

Fiction.live (@ficlive) on X

OpenAI Strikes Back

tall summit Apr 17, 2025, 1:46 PM

#

brittle tiger https://x.com/ficlive/status/1912863028141244850

hooooooly

balmy mist Apr 17, 2025, 1:48 PM

#

brittle tiger https://x.com/ficlive/status/1912863028141244850

how do it compare on openai new needle in haystack chart?

brittle tiger Apr 17, 2025, 1:49 PM

#

if that score holds up I really don't understand the 200k context window. it's not that hard to fill up 200k on with long thinking models

balmy mist Apr 17, 2025, 1:49 PM

#

what makes it more impressive is that 120k is 60% of its context

#

while gemini 120k is 12% and scoring 90% vs o3 with 60% of context scoring 100%

#

o3 is extremely impressive

#

still debatable lol

sinful vessel Apr 17, 2025, 1:56 PM

#

balmy mist o3 is extremely impressive

Sorry if this is a dumb question but how do yoiu know how much percentage of context a project is using on Gemini or GPT? Claude tells you.

ember rapids Apr 17, 2025, 1:56 PM

#

brittle tiger https://x.com/ficlive/status/1912863028141244850

Wow

keen fulcrum Apr 17, 2025, 1:57 PM

#

brittle tiger https://x.com/ficlive/status/1912863028141244850

Grok 4 will be better than o3

brittle tiger Apr 17, 2025, 1:58 PM

#

sinful vessel Sorry if this is a dumb question but how do yoiu know how much percentage of con...

You can see in AI Studio for Gemini. I don't think it's viewable in the gemini app

keen fulcrum Apr 17, 2025, 2:15 PM

#

Any update on when o4 mini and o3 will be added to lmarena?

balmy mist Apr 17, 2025, 2:16 PM

#

sinful vessel Sorry if this is a dumb question but how do yoiu know how much percentage of con...

there are no dumb questions bro, we here to share knowledge 🙂

tall summit Apr 17, 2025, 2:17 PM

#

keen fulcrum Any update on when o4 mini and o3 will be added to lmarena?

lmao did you even check

keen fulcrum Apr 17, 2025, 2:21 PM

#

tall summit lmao did you even check

I checked
nothing on the lb too

hardy pecan Apr 17, 2025, 2:24 PM

#

they are all already there

#

just run a few prompts and youll get it

sonic tendon Apr 17, 2025, 2:25 PM

#

keen fulcrum I checked nothing on the lb too

they're in direct chat, at least in the non-alpha arena

sonic tendon Apr 17, 2025, 2:27 PM

#

keen fulcrum Grok 4 will be better than o3

nobody can say for sure

#

anyway, "better" is sort of subjective

cedar tide Apr 17, 2025, 2:33 PM

#

for overall performance i think grok 3 thinking high will be roughly on par with o3

sonic tendon Apr 17, 2025, 2:35 PM

#

what makes you think that grok's gonna add reasoning effort levels? they seem to be focusing on other stuff atm

hardy pecan Apr 17, 2025, 2:38 PM

#

Fun fact, GPT4o was released essentially 1 year ago today

cedar tide Apr 17, 2025, 2:40 PM

#

sonic tendon what makes you think that grok's gonna add reasoning effort levels? they seem to...

Grok 3 mini has already this

Screenshot_2025-04-17-16-39-48-187_com.android.chrome-edit.jpg

cedar tide Apr 17, 2025, 2:40 PM

#

cedar tide for overall performance i think grok 3 thinking high will be roughly on par with...

grok 3 without reasoning has improved since its announcement in February (see image below) and these benchmark scores announced for the reasoning version are well below the scores he will have at the end of his training, since he was barely at the level of the mini version

sonic tendon Apr 17, 2025, 2:40 PM

#

cedar tide Grok 3 mini has already this

ah, my mistake

thorny drum Apr 17, 2025, 2:41 PM

#

grok 3 mini (high) does pretty well on livebench already

cedar tide Apr 17, 2025, 2:41 PM

#

cedar tide grok 3 without reasoning has improved since its announcement in February (see im...

Screenshot_2025-04-10-07-59-33-783_com.android.chrome-edit.jpg

keen beacon Apr 17, 2025, 2:54 PM

#

https://x.com/AdvaitOnline/status/1912852199446548510?t=CfLXs-8v0dqh_anqB6mc7A&s=19

Advait Bopardikar (@AdvaitOnline) on X

#

works at deepmind

#

good chance this is an svg generated by an upcoming model (one of the ones being launched today)

balmy mist Apr 17, 2025, 3:02 PM

#

https://x.com/legit_api/status/1912880516581241033

ʟᴇɢɪᴛ (@legit_api) on X

preparations underway for more Gemini 2.5 models

torn mantle Apr 17, 2025, 3:03 PM

#

keen beacon good chance this is an svg generated by an upcoming model (one of the ones being...

most likely

#

flash thinking or that dragontail model

balmy mist Apr 17, 2025, 3:03 PM

#

i hope its nw

barren prairie Apr 17, 2025, 3:07 PM

#

What I know there is something that will be changed on the Gemini app ...because when I opened Gemini it said there is new models 🙃✌️

#

So they are working

#

On something

ember rapids Apr 17, 2025, 3:08 PM

#

is dragontail flash 2.5?

keen beacon Apr 17, 2025, 3:12 PM

#

https://x.com/AISafetyMemes/status/1912875957897003354

AI Notkilleveryoneism Memes ⏸️ (@AISafetyMemes) on X

o3 scores 136 IQ on Mensa Norway, qualifying for Mensa

#

woah

#

we're finally beginning to saturate these too

tall summit Apr 17, 2025, 3:12 PM

#

keen beacon https://x.com/AISafetyMemes/status/1912875957897003354

LMAO

#

HAHAHAHAHA

tawdry meteor Apr 17, 2025, 3:12 PM

#

brittle tiger i believe o3 high is only available on API right now. plus users get 50 o3-med p...

do we know which one is on the arena then? I would assume it's o3-med or o3-low

tall summit Apr 17, 2025, 3:12 PM

#

first time im seeing ai models being tested on iq tests

keen beacon Apr 17, 2025, 3:13 PM

#

this has been a thing for like

#

over a year

#

they do an offline one and a mensa lne

#

ome

#

one

#

it's kinda silly to do on llms but interesting all the same

tall summit Apr 17, 2025, 3:13 PM

#

keen beacon over a year

im sure but i havent seen it

keen beacon Apr 17, 2025, 3:13 PM

#

they also give every model the political compass

#

which is also interesting

tall summit Apr 17, 2025, 3:13 PM

#

oooooh

keen beacon Apr 17, 2025, 3:13 PM

#

let me find it

tall summit Apr 17, 2025, 3:14 PM

#

https://trackingai.org/political-test

Tracking AI

Tracking AI is a cutting-edge application that unveils the political biases embedded in artificial intelligence systems. Explore and analyze the political leanings of AIs with our intuitive platform, designed to foster transparency in the world of artificial intelligence. Stay informed and uncover the political inclinations shaping the algorithm...

#

i am not surprised at all

keen beacon Apr 17, 2025, 3:15 PM

#

#

isn't it funny

#

how the smarter the model is

#

the more lib left it gets

#

would like to see o3 tho

tall summit Apr 17, 2025, 3:15 PM

#

isnt it there

balmy mist Apr 17, 2025, 3:15 PM

#

tall summit LMAO

you dont say anything when I posted it but when leo posts your shocked lmaooo

keen beacon Apr 17, 2025, 3:16 PM

#

haven't ran it yet by the looks of it

tall summit Apr 17, 2025, 3:16 PM

#

nop thats o3 mini

#

just saw an o3 circle

tall summit Apr 17, 2025, 3:16 PM

#

balmy mist you dont say anything when I posted it but when leo posts your shocked lmaooo

umm i didnt see when you did

#

or maybe i did and forgot

balmy mist Apr 17, 2025, 3:16 PM

#

u responded right uder lmaoo

#

under*

tall summit Apr 17, 2025, 3:16 PM

#

i dont remember at all

balmy mist Apr 17, 2025, 3:16 PM

#

tall summit hooooooly

.

#

like 3 minutes after lol

tall summit Apr 17, 2025, 3:17 PM

#

oh sorry i subconsciously ignored that message

#

because i got immediately interested in the ficlive benchmark stat from o3

balmy mist Apr 17, 2025, 3:17 PM

#

thats tiktok for you lmaoo

tall summit Apr 17, 2025, 3:18 PM

#

and scrolled down to see peoples discussions about that, which moved your post further away from my consciousness

tall summit Apr 17, 2025, 3:18 PM

#

balmy mist thats tiktok for you lmaoo

i dont use tiktok sorry

balmy mist Apr 17, 2025, 3:18 PM

#

tall summit i dont use tiktok sorry

damn your attention span got killed without tiktok, we are screwed

tawdry meteor Apr 17, 2025, 3:19 PM

#

keen beacon the more lib left it gets

so when the AI starts running the government we're all gonna be reading Herbert Spencer in school and praising the early artificial micronations? lol

tall summit Apr 17, 2025, 3:19 PM

#

balmy mist damn your attention span got killed without tiktok, we are screwed

yeah adhd affects like 1% of people which is an insane number and its only growing

elder rapids Apr 17, 2025, 3:20 PM

#

the long context thing for o3 is major cap

#

it cannot handle long context 😭 🙏

keen beacon Apr 17, 2025, 3:21 PM

#

https://x.com/unusual_whales/status/1912888054978670963

unusual_whales (@unusual_whales) on X

BREAKING: Google $GOOGL has lost its online advertising case, thus saying its online ad tech markets violate US antitrust laws.

#

well this is gonna be interesting

glass arch Apr 17, 2025, 3:21 PM

#

balmy mist damn your attention span got killed without tiktok, we are screwed

mine too

balmy mist Apr 17, 2025, 3:21 PM

#

https://x.com/legit_api/status/1912888062138347773

ʟᴇɢɪᴛ (@legit_api) on X

WE ALSO SEEM TO BE GETTING CODE MODELS!

WE MIGHT SEE THE REVEAL OF NIGHTWHISPER

keen beacon Apr 17, 2025, 3:21 PM

#

YO

elder rapids Apr 17, 2025, 3:21 PM

#

no way

balmy mist Apr 17, 2025, 3:21 PM

#

i knew my babby was coming

#

yes way!!!!

keen beacon Apr 17, 2025, 3:21 PM

#

balmy mist https://x.com/legit_api/status/1912888062138347773

@torn mantle

torn mantle Apr 17, 2025, 3:22 PM

#

balmy mist https://x.com/legit_api/status/1912888062138347773

qwdqwd;klqwjlkdjqwlkdhjqwlkd

#

ll;]ajks;]LAJKS'L;ASJDF'KL;ASJF'AKL;SFJASK'L;JDFASK'LJFKAL'SFJ'ASLKF

#

STOP

#

no way

#

finally 😭

keen beacon Apr 17, 2025, 3:22 PM

#

if they drop a SOTA code model i am all in on deepmind

elder rapids Apr 17, 2025, 3:22 PM

#

keen beacon well this is gonna be interesting

I don't think it's gonna impact Google at all

tall summit Apr 17, 2025, 3:22 PM

#

@keen beacon thanks for sacrificing yourself browsing twitter to deliver news so we don't have to go in the twitter hellscape ourselves

torn mantle Apr 17, 2025, 3:22 PM

#

elder rapids I don't think it's gonna impact Google at all

you have no idea

#

what it will do

glass arch Apr 17, 2025, 3:23 PM

#

are there betters for AI? because I'm going all in today

elder rapids Apr 17, 2025, 3:23 PM

#

ye but actually read the case lmao

#

this is not even crazy

keen beacon Apr 17, 2025, 3:23 PM

#

tall summit <@456226577798135808> thanks for sacrificing yourself browsing twitter to delive...

i just wish bsky had more of a community

balmy mist Apr 17, 2025, 3:23 PM

#

i love google man

keen beacon Apr 17, 2025, 3:23 PM

#

otherwise id move

balmy mist Apr 17, 2025, 3:23 PM

#

i love how they do this to openai

torn mantle Apr 17, 2025, 3:23 PM

#

no i mean sonnet is mostly used for coding, now imagine a better model comes in cheap, what do you think will happen?

balmy mist Apr 17, 2025, 3:23 PM

#

that means nightwhisper is better than we thought

keen beacon Apr 17, 2025, 3:24 PM

#

also notice it says "CODE MODELS"

torn mantle Apr 17, 2025, 3:24 PM

#

keen beacon also notice it says "CODE MODEL**S**"

yea

#

thats crazy

keen beacon Apr 17, 2025, 3:24 PM

#

probably a version based on flash and a version based on pro

tall summit Apr 17, 2025, 3:24 PM

#

whats the 1P part of 1P CODE MODELS mean

torn mantle Apr 17, 2025, 3:24 PM

#

keen beacon probably a version based on flash and a version based on pro

could be

#

yea

keen beacon Apr 17, 2025, 3:24 PM

#

tall summit whats the 1P part of 1P CODE MODELS mean

thats what im wondering

lime coral Apr 17, 2025, 3:24 PM

#

1 party

#

Obviously

#

It’s a party

tall summit Apr 17, 2025, 3:24 PM

#

only 1

lime coral Apr 17, 2025, 3:24 PM

#

Satya said he will make them dance they bring the music

brittle tiger Apr 17, 2025, 3:24 PM

#

tall summit whats the 1P part of 1P CODE MODELS mean

first-party as opposed to 3rd

lime coral Apr 17, 2025, 3:25 PM

#

Logic

tall summit Apr 17, 2025, 3:25 PM

#

brittle tiger first-party as opposed to 3rd

all their models are first party in that sense

quick flame Apr 17, 2025, 3:25 PM

#

when will the new OpenAI models be on the leaderboard? Weird that they were not in the anonymous testing

keen beacon Apr 17, 2025, 3:25 PM

#

no no

#

it means first party

#

i just realised

#

as in, their own models

balmy mist Apr 17, 2025, 3:25 PM

#

ohh

tall summit Apr 17, 2025, 3:25 PM

#

whats gemini if not their own models

keen beacon Apr 17, 2025, 3:26 PM

#

maybe they partner with a lab to offer a third party option down the lime

#

line

#

who knows

tall summit Apr 17, 2025, 3:26 PM

#

🤷

keen beacon Apr 17, 2025, 3:26 PM

#

or perhaps an OSS model or rwo

#

two

balmy mist Apr 17, 2025, 3:26 PM

#

damn they really stealing openai shine the day after

elder rapids Apr 17, 2025, 3:31 PM

#

keen beacon maybe they partner with a lab to offer a third party option down the lime

ngl what would happen if Google acquired anthropic

elder rapids Apr 17, 2025, 3:35 PM

#

balmy mist https://x.com/legit_api/status/1912888062138347773

this is gonna be so disappointing if it's like, a narrow coding model that is just crazy at frontend

#

and sucks at everything else

#

ngl Google could legitimately blunder this

balmy mist Apr 17, 2025, 3:36 PM

#

i believe in google

#

why would they rush a launch like this

#

if it was not good

#

like right after openai

#

it kinda has to be good

tall summit Apr 17, 2025, 3:36 PM

#

elder rapids this is gonna be so disappointing if it's like, a narrow coding model that is ju...

itd be cool tho

#

frontend is cool

#

still an improvement in vibe coding 🤷

torn mantle Apr 17, 2025, 3:37 PM

#

elder rapids this is gonna be so disappointing if it's like, a narrow coding model that is ju...

it was also good at python tbh

#

but lets see

elder rapids Apr 17, 2025, 3:38 PM

#

torn mantle it was also good at python tbh

I never really tested it

brittle tiger Apr 17, 2025, 3:38 PM

#

I don't think we get nightwhisper. the hype posts have been subtly hinting at flash 2.5. they definitely know there is excitement about nw and would play into that if it was coming. idk

elder rapids Apr 17, 2025, 3:39 PM

#

brittle tiger I don't think we get nightwhisper. the hype posts have been subtly hinting at fl...

read up

#

if they're updating for code models

#

and night whisper really is THAT good

keen beacon Apr 17, 2025, 3:39 PM

#

god if only trump stayed a comedian..

#

these are all great

elder rapids Apr 17, 2025, 3:39 PM

#

then they have to be releasing more than that

thorny drum Apr 17, 2025, 3:39 PM

#

they're weirdly aware of the hypeposting from like 1000 follower twitter accounts lol

#

sundar pichai tweeting nebula was not something i expected

elder rapids Apr 17, 2025, 3:39 PM

#

keen beacon these are all great

meatball ron 😭 🙏

hardy violet Apr 17, 2025, 3:40 PM

#

Hmmm? Heard they're releasing Nightwhisper? Is there any solid proof?
You sure this isn't just another Google smokescreen? We haven't forgotten about the 2.5flash 0409 model, have we?

keen beacon Apr 17, 2025, 3:40 PM

#

(denied by Trump)

#

lmfaooo

elder rapids Apr 17, 2025, 3:40 PM

#

hardy violet Hmmm? Heard they're releasing Nightwhisper? Is there any solid proof? You sure t...

yeah what about the nebula smokescreen

#

they didn't release 2.5 pro

#

😔

tall summit Apr 17, 2025, 3:40 PM

#

keen beacon god if only trump stayed a comedian..

literally just "Tiny D" 😐

brittle tiger Apr 17, 2025, 3:40 PM

#

nebula had more hype than nightwhisper if im remembering right

keen beacon Apr 17, 2025, 3:40 PM

#

yup

elder rapids Apr 17, 2025, 3:40 PM

#

brittle tiger nebula had more hype than nightwhisper if im remembering right

not rly tbh

keen beacon Apr 17, 2025, 3:40 PM

#

yes really

elder rapids Apr 17, 2025, 3:40 PM

#

nebula was known to be super good

#

but night whisper is being talked about just as much

keen beacon Apr 17, 2025, 3:41 PM

#

nightwhisper was a very "in our bubble" thing

#

nebula got out of the bubble

elder rapids Apr 17, 2025, 3:41 PM

#

just not in faith of "this is amazing"

#

but "Google is cooking"

#

and just left there

#

you can see this in the subreddits too

#

people seem to know a ton about nightwhisper

balmy mist Apr 17, 2025, 3:42 PM

#

keen beacon nightwhisper was a very "in our bubble" thing

fr not a lot of people talked about nw tbh

elder rapids Apr 17, 2025, 3:42 PM

#

I don't got a j*b

#

I should know

#

sorry y'all

balmy mist Apr 17, 2025, 3:42 PM

#

it could be by design, since nw was there for like 2 days barely

elder rapids Apr 17, 2025, 3:42 PM

#

balmy mist it could be by design, since nw was there for like 2 days barely

nah, it still is pretty popular lol

#

people just pick and choose tho

#

with the nebula precursors

#

the early 2.5 pro checkpoints

#

they weren't really worse

hardy pecan Apr 17, 2025, 3:43 PM

#

nebula was a beast

elder rapids Apr 17, 2025, 3:43 PM

#

just less good at very specific tasks

#

and they weren't talked about in the light of being the beast it is now, but still acknowledged

hardy violet Apr 17, 2025, 3:44 PM

#

Alright folks, it's getting super late here, almost the 18th already.
Gotta head to bed now. Night everyone! 👋
Hope I wake up to some Gemini 2.5 Pro Coding, Pro High, or 2.5 Flash news tomorrow morning! 🙏

elder rapids Apr 17, 2025, 3:44 PM

#

man hopefully

balmy mist Apr 17, 2025, 3:50 PM

#

hardy violet Alright folks, it's getting super late here, almost the 18th already. Gotta head...

gn bro and damn, its only 12 pm by me

#

i love the diversity in this chat

keen fulcrum Apr 17, 2025, 4:00 PM

#

Deepseek is a joint effort of China and Russia
Its incorrect to call it a chinese AI

plain zinc Apr 17, 2025, 4:20 PM

#

Finally, Google will release models finely tuned for coding (competitors for the Claude family of models)! 🔥👀

keen beacon Apr 17, 2025, 4:20 PM

#

https://x.com/mbalunovic/status/1912897439876477395

Mislav Balunović (@mbalunovic) on X

And we have our first fully green row on MathArena - o4-mini-high completely solves AIME 2025 II, marking the benchmark officially saturated!

#

wow

keen beacon Apr 17, 2025, 4:21 PM

#

plain zinc Finally, Google will release models finely tuned for coding (competitors for the...

yeah someone posted this

barren prairie Apr 17, 2025, 4:24 PM

#

But nothing happened 🙁

balmy mist Apr 17, 2025, 4:29 PM

#

maaybe at 1pm est?

lime coral Apr 17, 2025, 4:33 PM

#

keen beacon https://x.com/mbalunovic/status/1912897439876477395

We are no more sure it’s not in the train set at this point

lime coral Apr 17, 2025, 4:33 PM

#

hardy violet Alright folks, it's getting super late here, almost the 18th already. Gotta head...

https://x.com/advaitonline/status/1912852199446548510?s=46

Advait Bopardikar (@AdvaitOnline) on X

tall summit Apr 17, 2025, 4:34 PM

#

keen beacon https://x.com/mbalunovic/status/1912897439876477395

matharena!? well gj

#

damn i didnt actually see aime 2025

#

the issue is..

#

aime is not hard in comparison to most other math

#

even if you dont want to dive into proof based contents, there are many much harder ones

tall summit Apr 17, 2025, 4:36 PM

#

plain zinc Finally, Google will release models finely tuned for coding (competitors for the...

you can just send the image you know

#

still funny to me how o4-mini is better than o3 at math+coding

elder rapids Apr 17, 2025, 4:40 PM

#

cuz that's just what it's meant for

#

o4 mini has been pretty bad in my testing for everything else but puzzles+code

keen beacon Apr 17, 2025, 4:42 PM

#

lime coral We are no more sure it’s not in the train set at this point

this is aime 2025... no

tall summit Apr 17, 2025, 4:47 PM

#

elder rapids cuz that's just what it's meant for

no its meant for "reasoning"

#

technically

#

i hope this is not how the real competitors solved this

Screenshot_2025-04-17-19-49-25-848_org.mozilla.firefox-edit.jpg

ocean vortex Apr 17, 2025, 4:53 PM

#

keen beacon well that's interesting

Makes sense. R1 is above V3 as well

tall summit Apr 17, 2025, 4:54 PM

#

ok none of the aops solutions actually work with the polynomial in this form

elder rapids Apr 17, 2025, 4:59 PM

#

tall summit no its meant for "reasoning"

how is that relevant to what I said at all

keen ferry Apr 17, 2025, 4:59 PM

#

night whisper was really too good? I never tried it

barren prairie Apr 17, 2025, 5:00 PM

#

keen ferry night whisper was really too good? I never tried it

I wish we will try it together sooner

balmy mist Apr 17, 2025, 5:05 PM

#

📎 message.txt

leaden meteor Apr 17, 2025, 5:05 PM

#

Its 1pm and still no new model release by google yet?

barren prairie Apr 17, 2025, 5:06 PM

#

leaden meteor Its 1pm and still no new model release by google yet?

Maybe just a rumor 🥺

balmy mist Apr 17, 2025, 5:06 PM

#

maybe bc of this:
https://x.com/WatcherGuru/status/1912889391170597014

Watcher.Guru (@WatcherGuru) on X

JUST IN: 🇺🇸 Judge rules Google operates illegal ad monopoly.

#

lol jk

torn mantle Apr 17, 2025, 5:07 PM

#

https://x.com/AIatMeta/status/1912906758856778226

AI at Meta (@AIatMeta) on X

🚀 Meta FAIR is releasing several new research artifacts on our road to advanced machine intelligence (AMI). These latest advancements are transforming our understanding of perception.

1️⃣ Meta Perception Encoder: A large-scale vision encoder that excels across several image &

#

this is kinda interesting

cloud meadow Apr 17, 2025, 5:12 PM

#

@wooden mulch

#

What are your aspirations for LMArena?

compact knoll Apr 17, 2025, 5:12 PM

#

is o3 better than o1 ? (about resolving problems, maths..)

cloud meadow Apr 17, 2025, 5:13 PM

#

cloud meadow What are your aspirations for LMArena?

I mean like future plans.

calm spear Apr 17, 2025, 5:20 PM

#

random question:

do LLM help people study and acquire information in say Africa & developing countries?

wintry tinsel Apr 17, 2025, 5:22 PM

#

New LM arena is a fat improvement

barren prairie Apr 17, 2025, 5:23 PM

#

calm spear random question: do LLM help people study and acquire information in say Africa...

Here is an example 🤓🙂🤝from north africa

brittle tiger Apr 17, 2025, 5:24 PM

#

not sure if anyone has mentioned but I'm getting o3 on arena

opaque adder Apr 17, 2025, 5:25 PM

#

cant even see the code output in beta.. nice alpha was better

gleaming adder Apr 17, 2025, 5:26 PM

#

"defaultInferenceSettings": {
        "system": "Over the course of conversation, adapt to the user's tone and preferences. Try to match the user's vibe, tone, and generally how they are speaking. You want the conversation to feel natural. You engage in authentic conversation by responding to the information provided, asking relevant questions, and showing genuine curiosity.\n\nYour output will be rendered in a web UI, so use valid markdown format, tables, Latex, or emojis to make the content more engaging and user friendly."
      },

cloud meadow Apr 17, 2025, 5:37 PM

#

calm spear random question: do LLM help people study and acquire information in say Africa...

If Internet access is bad, local llms can maybe in the future help with their knowledge bases as they only require the hardware and electricity necessary to run them.

#

An LLM can hallucinate, books written by people who know what they are talking about don't.

compact knoll Apr 17, 2025, 5:39 PM

#

cloud meadow An LLM can hallucinate, books written by people who know what they are talking a...

not 100% sure about that 😆

cloud meadow Apr 17, 2025, 5:40 PM

#

compact knoll not 100% sure about that 😆

Prove it. Become a self taught phd in any STEM subject using only llms.

compact knoll Apr 17, 2025, 5:42 PM

#

im not denying that LLMs hallucinate :)
that’s because they “predict” words rather than “know” them like a human would, im more talking about the fact that you wouldn’t believe how many books by so-called “experts” contain mistakes ;)

balmy mist Apr 17, 2025, 5:45 PM

#

damn google trolling

plain zinc Apr 17, 2025, 5:46 PM

#

Wdym?

elder solar Apr 17, 2025, 5:46 PM

#

plain zinc Finally, Google will release models finely tuned for coding (competitors for the...

what is confidential?

balmy mist Apr 17, 2025, 5:47 PM

#

plain zinc Wdym?

the models....

plain zinc Apr 17, 2025, 5:47 PM

#

elder solar what is confidential?

This is confidential information 🙂

torn mantle Apr 17, 2025, 5:47 PM

#

elder solar what is confidential?

probably early models that are tested by devs

#

or red teaming idk

plain zinc Apr 17, 2025, 5:47 PM

#

balmy mist the models....

They will be released today

#

lol

#

patience

lime coral Apr 17, 2025, 5:49 PM

#

keen beacon this is aime 2025... no

I know

elder solar Apr 17, 2025, 5:51 PM

#

is there any news about a newer LLM that can listen to audios?

#

the only llms that can is just gemini and ernie models

elder rapids Apr 17, 2025, 5:54 PM

#

balmy mist maybe bc of this: https://x.com/WatcherGuru/status/1912889391170597014

I just wanna note

#

this wasn't what was ruled

#

if it's made out to be so bad

#

it's not

#

look into it yourself

visual turret Apr 17, 2025, 5:56 PM

#

leaden meteor Its 1pm and still no new model release by google yet?

Gemini 2.5 pro preview is there newest model

#

On AI studio if you are using the free version you are using Gemini 2.5 pro exp

vague orbit Apr 17, 2025, 5:56 PM

#

i thought it was gemini 2.5 pro experimental

visual turret Apr 17, 2025, 5:57 PM

#

vague orbit i thought it was gemini 2.5 pro experimental

It is not

#

Gemini 2.5 pro preview is newest, you can check it here https://openrouter.ai/google

Google | OpenRouter

Browse models from Google

#

Gemini 2.5 flash is coming in a few weeks

elder rapids Apr 17, 2025, 5:58 PM

#

vague orbit i thought it was gemini 2.5 pro experimental

they're the same exact thing

visual turret Apr 17, 2025, 5:58 PM

#

But I suspect it is the end of the month

visual turret Apr 17, 2025, 5:58 PM

#

elder rapids they're the same exact thing

Not really

vague orbit Apr 17, 2025, 5:58 PM

#

I use cursor, and I use claude code, and those work well for claude 37. is there a Best Way to use Gemini 2.5 pro on a code base?

visual turret Apr 17, 2025, 5:59 PM

#

visual turret Not really

They are very different

#

Gemini 2.5 pro preview is really good

brittle tiger Apr 17, 2025, 5:59 PM

#

visual turret But I suspect it is the end of the month

I suspect today

visual turret Apr 17, 2025, 5:59 PM

#

Gemini 2.5 pro exp sucks

visual turret Apr 17, 2025, 5:59 PM

#

brittle tiger I suspect today

Google has a history of releasing stuff at the end of months

elder rapids Apr 17, 2025, 5:59 PM

#

visual turret Not really

no they literally are the same exact model

#

not a single change

#

not a touch

#

not an ounce of change

visual turret Apr 17, 2025, 6:00 PM

#

elder rapids not an ounce of change

Have you ever used Gemini 2.5 pro preview

keen beacon Apr 17, 2025, 6:00 PM

#

visual turret Google has a history of releasing stuff at the end of months

when logan tweets gemini, there has always been at least one new gemini model release the next day

elder rapids Apr 17, 2025, 6:00 PM

#

visual turret Have you ever used Gemini 2.5 pro preview

talking to the master of 2.5 pro in this server btw

keen beacon Apr 17, 2025, 6:00 PM

#

there will be a launch today

elder rapids Apr 17, 2025, 6:00 PM

#

😭 🙏

#

I know Gemini the most

vague orbit Apr 17, 2025, 6:00 PM

#

keen beacon when logan tweets gemini, there has always been at least one new gemini model re...

who dat

keen beacon Apr 17, 2025, 6:01 PM

#

the time doesn't really matter, they've released as late as 10pm BST and as early as 12pm BST before

visual turret Apr 17, 2025, 6:01 PM

#

elder rapids talking to the master of 2.5 pro in this server btw

The free version of AI studio uses exp

keen beacon Apr 17, 2025, 6:01 PM

#

vague orbit who dat

head of ai studio @ deepmind

elder rapids Apr 17, 2025, 6:01 PM

#

visual turret The free version of AI studio uses exp

yes they're the same exact thing

vague orbit Apr 17, 2025, 6:01 PM

#

so, is everyone using gemini via the website? like animals?

visual turret Apr 17, 2025, 6:01 PM

#

elder rapids yes they're the same exact thing

They aren't

elder rapids Apr 17, 2025, 6:01 PM

#

visual turret They aren't

@keen beacon confirm

vague orbit Apr 17, 2025, 6:01 PM

#

i haven't even seen chatgpt.com in months

visual turret Apr 17, 2025, 6:01 PM

#

vague orbit so, is everyone using gemini via the website? like animals?

Https://aistudio.google.com

keen beacon Apr 17, 2025, 6:01 PM

#

they're the exact same thing

#

lol

#

they are the same model chief

visual turret Apr 17, 2025, 6:01 PM

#

keen beacon they're the exact same thing

Have you tried it

keen beacon Apr 17, 2025, 6:02 PM

#

yes dawg

#

they are the same

#

i know for a fact

visual turret Apr 17, 2025, 6:02 PM

#

keen beacon yes dawg

On open router

keen beacon Apr 17, 2025, 6:02 PM

#

it's just naming differences

#

because of paid preview vs free experimental

visual turret Apr 17, 2025, 6:02 PM

#

keen beacon because of paid preview vs free experimental

They aren't the same

keen beacon Apr 17, 2025, 6:02 PM

#

bro.

#

yes they are

#

i know for a fact

#

this is coming from people at deepmind

visual turret Apr 17, 2025, 6:02 PM

#

Gemini 2.5 pro preview is a further trained version of Gemini 2.5 pro

keen beacon Apr 17, 2025, 6:02 PM

#

no it isn't..

visual turret Apr 17, 2025, 6:02 PM

#

It is

keen beacon Apr 17, 2025, 6:02 PM

#

facepalm

#

no it ISN'T

#

omfg

visual turret Apr 17, 2025, 6:03 PM

#

This is a common practice

keen beacon Apr 17, 2025, 6:03 PM

#

mate

elder rapids Apr 17, 2025, 6:03 PM

#

keen beacon Apr 17, 2025, 6:03 PM

#

it has been said already that preview is just the name for the version of the model with increased api rate limits

#

that is it.

visual turret Apr 17, 2025, 6:03 PM

#

keen beacon mate

keen beacon Apr 17, 2025, 6:03 PM

#

it is not a model update

#

are you stupid

visual turret Apr 17, 2025, 6:04 PM

#

keen beacon it is not a model update

My search says other wise

keen beacon Apr 17, 2025, 6:04 PM

#

???

#

buddy

visual turret Apr 17, 2025, 6:04 PM

#

#general message

keen beacon Apr 17, 2025, 6:04 PM

#

copilot isn't going to know is it

#

it's literally just pulling from generic web sources

visual turret Apr 17, 2025, 6:04 PM

#

keen beacon copilot isn't going to know is it

That isn't copilot

keen beacon Apr 17, 2025, 6:04 PM

#

..

#

that is copilot

compact knoll Apr 17, 2025, 6:04 PM

#

visual turret Gemini 2.5 pro preview is a further trained version of Gemini 2.5 pro

that's illogic lol

keen beacon Apr 17, 2025, 6:04 PM

#

ai in bing is copilot

visual turret Apr 17, 2025, 6:05 PM

#

compact knoll that's illogic lol

Not really.

balmy mist Apr 17, 2025, 6:05 PM

#

visual turret My search says other wise

how many times have you used exp and preview?

keen beacon Apr 17, 2025, 6:05 PM

#

if it was a new model the date at the end of the ID would not be the same and they would publish new benchmark scores

upper wolf Apr 17, 2025, 6:05 PM

#

People actually pay attention to the browser’s ai suggestions? we’re cooked man…

balmy mist Apr 17, 2025, 6:05 PM

#

in all my tests they have been the same tbh

fleet lintel Apr 17, 2025, 6:05 PM

#

LMArena is now a company. That's interesting!

keen beacon Apr 17, 2025, 6:05 PM

#

literally one second of thought to reach that conclusion

zinc ore Apr 17, 2025, 6:05 PM

#

visual turret

Your source is character AI, how does that tell us about Google's naming practices?

balmy mist Apr 17, 2025, 6:05 PM

#

just check the api lmaoo

visual turret Apr 17, 2025, 6:05 PM

#

balmy mist how many times have you used exp and preview?

A lot

fleet lintel Apr 17, 2025, 6:05 PM

#

visual turret https://discord.com/channels/1340554757349179412/1340554757827461211/13624886999...

stop using bing... it sucks

keen beacon Apr 17, 2025, 6:06 PM

#

visual turret A lot

you are schizo if you think there's a difference

visual turret Apr 17, 2025, 6:06 PM

#

fleet lintel stop using bing... it sucks

I'm sorry I don't support a monopoly

zinc ore Apr 17, 2025, 6:06 PM

#

They've not introduced a further trained version of 2.5 pro yet, from initial release

keen beacon Apr 17, 2025, 6:06 PM

#

i have used the model hundreds of times

#

both as exp and as preview

#

they are exactly the same

balmy mist Apr 17, 2025, 6:06 PM

#

visual turret A lot

https://ai.google.dev/gemini-api/docs/models#model-versions

Google AI for Developers

Gemini models | Gemini API | Google AI for Developers

Learn about Google's most advanced AI models including Gemini 2.5 Pro

upper wolf Apr 17, 2025, 6:06 PM

#

Microsoft is the biggest company in the world

#

Wdym monopoly

visual turret Apr 17, 2025, 6:07 PM

#

zinc ore They've not introduced a further trained version of 2.5 pro yet, from initial re...

https://www.clrn.org/what-does-stable-and-preview-mean-on-character-ai/

California Learning Resource Network

CLRN team

What does stable and preview mean on character AI? - California Lea...

What does "Stable" and "Preview" Mean on Character AI? In the realm of character AI, "stable" and "preview" are two […]

upper wolf Apr 17, 2025, 6:07 PM

#

Bro think hes progressive for using Bing 😭 ?

plain zinc Apr 17, 2025, 6:07 PM

#

Are there any new Google models in LMarena?

visual turret Apr 17, 2025, 6:07 PM

#

upper wolf Bro think hes progressive for using Bing 😭 ?

More than you

fleet lintel Apr 17, 2025, 6:07 PM

#

visual turret I'm sorry I don't support a monopoly

then go for ddg ... bing is like picking the worst option in every way

visual turret Apr 17, 2025, 6:07 PM

#

I also got a paper in Harvard

zinc ore Apr 17, 2025, 6:07 PM

#

visual turret https://www.clrn.org/what-does-stable-and-preview-mean-on-character-ai/

That's character AI, which isn't Google

balmy mist Apr 17, 2025, 6:07 PM

#

visual turret I also got a paper in Harvard

broo

#

just got to google documentation

#

why you going everywhere else but google

visual turret Apr 17, 2025, 6:08 PM

#

balmy mist broo

I'm cited in this https://ui.adsabs.harvard.edu/abs/2024arXiv240801950L/abstract

ADS

Why Perturbing Symbolic Music is Necessary: Fitting the Distributio...

Existing music generation models are mostly language-based, neglecting the frequency continuity property of notes, resulting in inadequate fitting of rare or never-used notes and thus reducing the diversity of generated samples. We argue that the distribution of notes can be modeled by translational invariance and periodicity, especially using d...

compact knoll Apr 17, 2025, 6:08 PM

#

visual turret I also got a paper in Harvard

oh okay then you cant make mistake !

zinc ore Apr 17, 2025, 6:08 PM

#

This guy is trolling right? Lol

upper wolf Apr 17, 2025, 6:08 PM

#

visual turret I'm cited in this https://ui.adsabs.harvard.edu/abs/2024arXiv240801950L/abstract

Who

compact knoll Apr 17, 2025, 6:08 PM

#

zinc ore This guy is trolling right? Lol

i hope lol

visual turret Apr 17, 2025, 6:08 PM

#

upper wolf Who

Cited in the paper

#

At the bottom

upper wolf Apr 17, 2025, 6:08 PM

#

visual turret Cited in the paper

Asked?

balmy mist Apr 17, 2025, 6:08 PM

#

visual turret I'm cited in this https://ui.adsabs.harvard.edu/abs/2024arXiv240801950L/abstract

where does it say gemini 2.5 pro?

balmy mist Apr 17, 2025, 6:09 PM

#

visual turret Cited in the paper

you a funny guy lmaoo

visual turret Apr 17, 2025, 6:09 PM

#

Why are you all so closed minded

#

Jeez

upper wolf Apr 17, 2025, 6:09 PM

#

closed minded

compact knoll Apr 17, 2025, 6:09 PM

#

everyone is telling him he's wrong but his ego won't let him hear it 😁

visual turret Apr 17, 2025, 6:09 PM

#

Ask any AI and it will agree with me

zinc ore Apr 17, 2025, 6:09 PM

#

Also, the Gemini 2.5 version hasn't been called stable yet

upper wolf Apr 17, 2025, 6:09 PM

#

You sound insecure asf nobody here is questioning your intelligence but yourself

#

We’re just sayin youre wrong about the ai

balmy mist Apr 17, 2025, 6:10 PM

#

visual turret Why are you all so closed minded

they are the same, but even if they were different we couldnt tell based on any of our tests, so whats the point?

visual turret Apr 17, 2025, 6:10 PM

#

balmy mist they are the same, but even if they were different we couldnt tell based on any ...

Why does exp suck more than preview

zinc ore Apr 17, 2025, 6:10 PM

#

Screenshot_2025-04-17-21-05-06-96_b72a20be883aec8a014bd2b7c7038e87.jpg

keen beacon Apr 17, 2025, 6:10 PM

#

visual turret Why are you all so closed minded

if every single other person believes you to be wrong perhaps the problem is you

#

just a thought

balmy mist Apr 17, 2025, 6:11 PM

#

visual turret Why does exp suck more than preview

keen fulcrum Apr 17, 2025, 6:11 PM

#

Is gpt 3.5 returning?

keen beacon Apr 17, 2025, 6:11 PM

#

zinc ore

woah

keen beacon Apr 17, 2025, 6:11 PM

#

keen fulcrum Is gpt 3.5 returning?

that would be cool asf

#

memories

zinc ore Apr 17, 2025, 6:11 PM

#

Vertex so far

visual turret Apr 17, 2025, 6:11 PM

#

balmy mist

Albert Einstein: 'If you can't explain it simply, you don't understand it well enough.'

balmy mist Apr 17, 2025, 6:11 PM

#

zinc ore

wait this is new

#

omgg

#

everybody stop

#

lock in

visual turret Apr 17, 2025, 6:12 PM

#

balmy mist

Like what this was saying

balmy mist Apr 17, 2025, 6:12 PM

#

zinc ore

send link please

balmy mist Apr 17, 2025, 6:13 PM

#

zinc ore Vertex so far

ahh

#

from jimmy apples

#

so it should be any minute now

keen fulcrum Apr 17, 2025, 6:13 PM

#

zinc ore

Photoshop?

narrow elbow Apr 17, 2025, 6:14 PM

#

https://tenor.com/view/lmao-spit-take-cracking-up-haha-so-funny-omg-gif-8856945541583565377

Tenor

balmy mist Apr 17, 2025, 6:14 PM

#

keen fulcrum Photoshop?

nahh it was from apples

#

https://x.com/apples_jimmy/status/1912931455006900522

Jimmy Apples 🍎/acc (@apples_jimmy) on X

Flash out on vertex

keen beacon Apr 17, 2025, 6:14 PM

#

zinc ore Apr 17, 2025, 6:14 PM

#

https://x.com/apples_jimmy/status/1912931455006900522

Jimmy Apples 🍎/acc (@apples_jimmy) on X

Flash out on vertex

keen beacon Apr 17, 2025, 6:14 PM

#

there it is

balmy mist Apr 17, 2025, 6:14 PM

#

keen beacon there it is

i dont have vertex can you send link to it please

keen beacon Apr 17, 2025, 6:15 PM

#

https://console.cloud.google.com/vertex-ai/studio/multimodal

Google Cloud console

fleet lintel Apr 17, 2025, 6:15 PM

#

is anyone seeing 2.5 flash in their gemini.google.com ? One of my friend is seeing it

keen beacon Apr 17, 2025, 6:15 PM

#

am gonna see if this is Dragontail

upper wolf Apr 17, 2025, 6:15 PM

#

fleet lintel is anyone seeing 2.5 flash in their gemini.google.com ? One of my friend is se...

nope

keen beacon Apr 17, 2025, 6:15 PM

#

will crank up the thinking budget

#

hopefully i dont go bankrupt

upper wolf Apr 17, 2025, 6:15 PM

#

lemme check studio

balmy mist Apr 17, 2025, 6:16 PM

#

keen beacon will crank up the thinking budget

on that site it costs money right?

#

yeah its def not nightwhisper lmaoo

#

wait

#

this model....

#

built in web search?

#

i dont have google grounding on but its using web

fleet lintel Apr 17, 2025, 6:18 PM

#

balmy mist i dont have google grounding on but its using web

where is this? cloud console?

balmy mist Apr 17, 2025, 6:18 PM

#

keen beacon https://console.cloud.google.com/vertex-ai/studio/multimodal

here

#

this might actually be nw

#

the thinking is taking a while for a flash model

fleet lintel Apr 17, 2025, 6:19 PM

#

no way that NW is flash model.. please

keen beacon Apr 17, 2025, 6:20 PM

#

just tried it

fleet lintel Apr 17, 2025, 6:20 PM

#

so much thiking for Flash model.. it's crazy

keen beacon Apr 17, 2025, 6:20 PM

#

it may be dragonwhisper but hmm

#

it does seem very good for flash

balmy mist Apr 17, 2025, 6:20 PM

#

nahh bro this might be nw

#

yoo

keen beacon Apr 17, 2025, 6:20 PM

#

and uses more reasoning tokens than 2.5 pro

#

it isn't nightwhisper lol

#

i doubt it

sage raptor Apr 17, 2025, 6:21 PM

#

is this nightwhisper ?

balmy mist Apr 17, 2025, 6:21 PM

#

https://liveweave.com/A9OGzH

#

it made that

#

hold up let me use another site to share it

zinc ore Apr 17, 2025, 6:22 PM

#

Vertex atm

barren prairie Apr 17, 2025, 6:22 PM

#

Let s wait
I want it on Ai studio

balmy mist Apr 17, 2025, 6:22 PM

#

@torn mantle

#

help me test this lmaoo

#

@keen beacon are you using the manual or auto thinking? im scared to touch that lol

leaden meteor Apr 17, 2025, 6:23 PM

#

How come leaderboard is not updated if flash is already out? I am sure nw or dragontail is flash...?

balmy mist Apr 17, 2025, 6:23 PM

#

this model is better than 2.5 pro imo, need more tests, but on a one shot coding it clears it

narrow elbow Apr 17, 2025, 6:23 PM

#

keen beacon Apr 17, 2025, 6:24 PM

#

balmy mist <@456226577798135808> are you using the manual or auto thinking? im scared to to...

manual cranked up to the maz

#

max

#

i have money to burn

fleet lintel Apr 17, 2025, 6:24 PM

#

barren prairie Let s wait I want it on Ai studio

it's on cloud console.. AIstudio is just matter of hours

zinc ore Apr 17, 2025, 6:24 PM

#

We'll likely get an arena update on it sometime today

keen beacon Apr 17, 2025, 6:24 PM

#

leaden meteor How come leaderboard is not updated if flash is already out? I am sure nw or dra...

give it time

#

hasn't even been actually announced yet

balmy mist Apr 17, 2025, 6:25 PM

#

keen beacon manual cranked up to the maz

do we get this thinking budget with 2.5 pro regular?

#

never used this platform before

#

why is there studio and vertex?

zinc ore Apr 17, 2025, 6:26 PM

#

Someone got this

keen beacon Apr 17, 2025, 6:26 PM

#

balmy mist do we get this thinking budget with 2.5 pro regular?

i

#

no*

fleet lintel Apr 17, 2025, 6:26 PM

#

how do I make it think less? I want really fast responses like sub 2 seconds for 128K tokens

keen beacon Apr 17, 2025, 6:26 PM

#

you can turn off thinking

#

fleet lintel Apr 17, 2025, 6:27 PM

#

keen beacon

oh..thank you!

#

pinged my team to start working on testing this stuff.... I am excited!

balmy mist Apr 17, 2025, 6:27 PM

#

zinc ore Someone got this

i hate you how!!!

fleet lintel Apr 17, 2025, 6:28 PM

#

zinc ore Someone got this

pricing looks great !

elder rapids Apr 17, 2025, 6:29 PM

#

how fast is it

balmy mist Apr 17, 2025, 6:30 PM

#

yo this model is fire!!!

balmy mist Apr 17, 2025, 6:30 PM

#

elder rapids how fast is it

hard to tell but latency seems on par with 2.5 pro

rose thicket Apr 17, 2025, 6:30 PM

#

balmy mist yo this model is fire!!!

Is this nightwhisper????

balmy mist Apr 17, 2025, 6:30 PM

#

when its coding its auto searching and using git repos

#

i think its is tbh, but others say nah, im doing more tests

elder rapids Apr 17, 2025, 6:31 PM

#

is it smart?

#

please tell me it's smart asf 🙏

balmy mist Apr 17, 2025, 6:31 PM

#

nahh its not nw

#

but its better than 2.5 pro at coding

#

but not on nw level

rose thicket Apr 17, 2025, 6:31 PM

#

I haven't got access yet

elder rapids Apr 17, 2025, 6:31 PM

#

balmy mist but its better than 2.5 pro at coding

what?

zinc ore Apr 17, 2025, 6:31 PM

#

Also has native tool calling, which 2.5 pro doesn't have

elder rapids Apr 17, 2025, 6:31 PM

#

it is?

fleet lintel Apr 17, 2025, 6:31 PM

#

balmy mist but its better than 2.5 pro at coding

that would be huge!

zinc ore Apr 17, 2025, 6:32 PM

#

fleet lintel Apr 17, 2025, 6:32 PM

#

balmy mist but its better than 2.5 pro at coding

how did you determine that? any prompts to try?

zinc ore Apr 17, 2025, 6:32 PM

#

"call tools natively"
"Agentic use cases"

dapper storm Apr 17, 2025, 6:32 PM

#

Wow 😲😲😲

keen beacon Apr 17, 2025, 6:33 PM

#

zinc ore "call tools natively" "Agentic use cases"

damn its gonna be o4-mini level

#

probably cheaper, google has so many spare gpus

balmy mist Apr 17, 2025, 6:34 PM

#

fleet lintel how did you determine that? any prompts to try?

i did the pokemon test

rose thicket Apr 17, 2025, 6:34 PM

#

Share the prompt plz!

balmy mist Apr 17, 2025, 6:34 PM

#

and the output was better than what I got from 3.7 and 2.5 in zero shot

#

okay, one sec, running one more test

#

i had thinking on max and it gave me a slightly better pokemon sim then 2.5, but let me try auto again

rose thicket Apr 17, 2025, 6:36 PM

#

keen beacon probably cheaper, google has so many spare gpus

Google was just aura farming and really said ' go to bed kidsss'

lime coral Apr 17, 2025, 6:36 PM

#

balmy mist i had thinking on max and it gave me a slightly better pokemon sim then 2.5, but...

You know there are more than one 2.5 now

keen beacon Apr 17, 2025, 6:36 PM

#

if 2.5 flash beats 2.5 pro they are cooking SO hard

#

and it's cheap asf

fleet lintel Apr 17, 2025, 6:37 PM

#

keen beacon if 2.5 flash beats 2.5 pro they are cooking SO hard

not happening 🙂

keen beacon Apr 17, 2025, 6:37 PM

#

keen beacon if 2.5 flash beats 2.5 pro they are cooking SO hard

doubt should be slightly worse just cheaper

balmy mist Apr 17, 2025, 6:37 PM

#

i do like this output:
https://liveweave.com/A9OGzH#

keen beacon Apr 17, 2025, 6:37 PM

#

it can and probably will beat it in at least one category

#

i didn't say i expect it to beat 2.5 pro universally

#

but smaller modles have their strengths

#

edpeciallt

#

especially*

#

when trained well

torn mantle Apr 17, 2025, 6:38 PM

#

balmy mist <@456226577798135808> are you using the manual or auto thinking? im scared to to...

wdym

rose thicket Apr 17, 2025, 6:38 PM

#

balmy mist i do like this output: https://liveweave.com/A9OGzH#

Flash did this!!? 😲

torn mantle Apr 17, 2025, 6:38 PM

#

test what?

leaden meteor Apr 17, 2025, 6:38 PM

#

Flash is smaller model? I thought it was taking more tokens than 2.5 exp?

balmy mist Apr 17, 2025, 6:38 PM

#

rose thicket Flash did this!!? 😲

yeah 0-shot, prompt: make a pokemon game

#

auto thinking

keen beacon Apr 17, 2025, 6:38 PM

#

leaden meteor Flash is smaller model? I thought it was taking more tokens than 2.5 exp?

it's a smaller model that uses more tokens for reasoning

balmy mist Apr 17, 2025, 6:38 PM

#

torn mantle test what?

2.5 pro flash

sage raptor Apr 17, 2025, 6:38 PM

#

balmy mist auto thinking

so no max thinking and it did that ?

balmy mist Apr 17, 2025, 6:39 PM

#

sage raptor so no max thinking and it did that ?

yeah just auto, thats why i wanted to test auto again

rose thicket Apr 17, 2025, 6:39 PM

#

I figured out that 2.5 pro just works better on 0.65 temp

torn mantle Apr 17, 2025, 6:39 PM

#

balmy mist 2.5 pro flash

let me see

balmy mist Apr 17, 2025, 6:39 PM

#

max thinking could be overthinking

rose thicket Apr 17, 2025, 6:39 PM

#

rose thicket I figured out that 2.5 pro just works better on 0.65 temp

Try it on flash

brittle tiger Apr 17, 2025, 6:39 PM

#

I just got access on AI studio. Rollout happening for sure

balmy mist Apr 17, 2025, 6:39 PM

#

torn mantle let me see

https://liveweave.com/A9OGzH# this is what it made, or you can give me a fresh prompt to try?

torn mantle Apr 17, 2025, 6:40 PM

#

brittle tiger I just got access on AI studio. Rollout happening for sure

i still didnt 😦

brittle tiger Apr 17, 2025, 6:40 PM

#

I spoke too soon lmao

torn mantle Apr 17, 2025, 6:40 PM

#

balmy mist https://liveweave.com/A9OGzH# this is what it made, or you can give me a fresh p...

looks cool

#

this seems similar to stargazer

golden ocean Apr 17, 2025, 6:40 PM

#

torn mantle i still didnt 😦

Connect to usa vpn and reload

brittle tiger Apr 17, 2025, 6:40 PM

#

It is possible to select for me. just no outputs yet

golden ocean Apr 17, 2025, 6:40 PM

#

then u will get 2.5 flash

rose thicket Apr 17, 2025, 6:40 PM

#

torn mantle this seems similar to stargazer

I thought riverhollow

golden ocean Apr 17, 2025, 6:41 PM

#

brittle tiger I spoke too soon lmao

I also spoke to soon

balmy mist Apr 17, 2025, 6:42 PM

#

i got it on studio!!!

#

refresh yall

elder rapids Apr 17, 2025, 6:42 PM

#

apparently it's better in coding

golden ocean Apr 17, 2025, 6:42 PM

#

balmy mist refresh yall

msg it

#

Will it give error

brittle tiger Apr 17, 2025, 6:43 PM

#

working for me now

golden ocean Apr 17, 2025, 6:43 PM

#

same despite error it responded

fleet lintel Apr 17, 2025, 6:43 PM

#

balmy mist i got it on studio!!!

I am in EU. rollout will happen at the very last minute for me 😦

balmy mist Apr 17, 2025, 6:43 PM

#

golden ocean same despite error it responded

refresh again

sage raptor Apr 17, 2025, 6:43 PM

#

fleet lintel I am in EU. rollout will happen at the very last minute for me 😦

same 😭

golden ocean Apr 17, 2025, 6:44 PM

#

sage raptor same 😭

Connect to usa vpn

elder rapids Apr 17, 2025, 6:44 PM

#

golden ocean same despite error it responded

check thinking time

balmy mist Apr 17, 2025, 6:44 PM

#

i dont see thinking budget tho

#

#

am i blind?

keen beacon Apr 17, 2025, 6:44 PM

#

🤣

fleet lintel Apr 17, 2025, 6:44 PM

#

may be it's only for Vertex users

thorny drum Apr 17, 2025, 6:45 PM

#

CONFIDENTIAL

fleet lintel Apr 17, 2025, 6:45 PM

#

I am thinking budget part

thorny drum Apr 17, 2025, 6:45 PM

#

leaking insider info yet again

keen ferry Apr 17, 2025, 6:45 PM

#

fleet lintel I am in EU. rollout will happen at the very last minute for me 😦

i got it with vpn

golden ocean Apr 17, 2025, 6:45 PM

#

elder rapids check thinking time

from that message? is 0.7s
(2.6s for 2.5 pro)

keen beacon Apr 17, 2025, 6:45 PM

#

thorny drum leaking insider info yet again

old habits die hard 😉

golden ocean Apr 17, 2025, 6:45 PM

#

balmy mist refresh again

refreshed and the confidential tab is gone 😔

elder rapids Apr 17, 2025, 6:45 PM

#

golden ocean from that message? is 0.7s (2.6s for 2.5 pro)

how much text is in the thinking box

golden ocean Apr 17, 2025, 6:45 PM

#

elder rapids how much text is in the thinking box

elder rapids Apr 17, 2025, 6:46 PM

#

how much for 2.5 pro?

golden ocean Apr 17, 2025, 6:46 PM

#

zinc ore Apr 17, 2025, 6:47 PM

#

I have flash now too

fringe carbon Apr 17, 2025, 6:47 PM

#

what is the general consensus between 2.5 and the new gpt models?

keen ferry Apr 17, 2025, 6:47 PM

#

i like that it just disappears and then appears again

fringe carbon Apr 17, 2025, 6:47 PM

#

can anyone here really definitively say one is better?

#

cuz ngl they are so close to me

#

different style wise

brittle tiger Apr 17, 2025, 6:48 PM

#

is flash 2.5 the first model to determine if it needs thinking tokens are needed or not with "auto" selected?

rose thicket Apr 17, 2025, 6:48 PM

#

Flash seems to be tailored for coding purpose

elder rapids Apr 17, 2025, 6:48 PM

#

golden ocean

thanks

#

alright I got flash 2.5

balmy mist Apr 17, 2025, 6:49 PM

#

golden ocean refreshed and the confidential tab is gone 😔

keep refreshing

#

mine is not in confidential anymore

#

elder rapids Apr 17, 2025, 6:50 PM

#

alr give me a query

#

I'm gonna ask it stuff

golden ocean Apr 17, 2025, 6:50 PM

#

Let a < b < c be distinct natural numbers. Must every block of c consecutive natural numbers contain three distinct numbers whose product is a multiple of abc?

balmy mist Apr 17, 2025, 6:51 PM

#

its fast, slightly faster than 2.5

narrow elbow Apr 17, 2025, 6:51 PM

#

🤪

brittle tiger Apr 17, 2025, 6:51 PM

#

I keep getting errors. Gonna go do an errand. at least it;s confirmed for today

balmy mist Apr 17, 2025, 6:51 PM

#

actually i cant tell if its faster

keen beacon Apr 17, 2025, 6:51 PM

#

Click the timer

#

It has latency and tps

#

#

woah.. hello

#

Huh

#

Geogussr time lol

#

IT'S US ONLY 💔

ember rapids Apr 17, 2025, 6:53 PM

#

U can set the thinking budget on ai studio

#

Pretty cool

brittle tiger Apr 17, 2025, 6:53 PM

#

keen beacon IT'S US ONLY 💔

for now

golden ocean Apr 17, 2025, 6:54 PM

#

I got it on on eu now

keen beacon Apr 17, 2025, 6:54 PM

#

I just got 2.5 flash they're rolling it out fast

golden ocean Apr 17, 2025, 6:54 PM

#

zinc ore Apr 17, 2025, 6:54 PM

#

Looks like you can set the thinking budget on aistudio

balmy mist Apr 17, 2025, 6:54 PM

#

zinc ore Looks like you can set the thinking budget on aistudio

hmm i dotn have that, let me refresh

#

and its free lmaoo bruhh

keen beacon Apr 17, 2025, 6:54 PM

#

I don't have yet either

ember rapids Apr 17, 2025, 6:55 PM

#

Toggle thinking mode off and on and you should see it

balmy mist Apr 17, 2025, 6:55 PM

#

yupp refreshing did the trick

narrow elbow Apr 17, 2025, 6:55 PM

#

refresh ,got it

fleet lintel Apr 17, 2025, 6:55 PM

#

Nice! got it on aistudio!

keen beacon Apr 17, 2025, 6:55 PM

#

Just got it

keen ferry Apr 17, 2025, 6:56 PM

#

Screenshot_2025-04-17-21-56-15-376-edit_com.android.chrome.jpg

ember rapids Apr 17, 2025, 6:56 PM

#

mann google is cooking

elder rapids Apr 17, 2025, 6:56 PM

#

golden ocean Let a < b < c be distinct natural numbers. Must every block of c consecutive nat...

what's the answer

golden ocean Apr 17, 2025, 6:57 PM

#

"no"

keen beacon Apr 17, 2025, 6:57 PM

#

Is the thinking budget working for yall

#

It's being ignored for me it seems

elder rapids Apr 17, 2025, 6:57 PM

#

golden ocean "no"

that made me laugh

balmy mist Apr 17, 2025, 6:57 PM

#

aii time to put flash against pro, give me some prompt for games and web dev

ember rapids Apr 17, 2025, 6:58 PM

#

I think flash with max thinking tokens beats 2.5 pro at coding

balmy mist Apr 17, 2025, 6:59 PM

#

i got like 3 tabs open testing lmaoo

#

yoooooo look at this with auto thinking:
https://liveweave.com/A9OGzH#

#

play it a lil

#

animations are crazy tbh

#

@torn mantle

narrow elbow Apr 17, 2025, 7:00 PM

#

elder rapids that made me laugh

i got "YES" thinking budget 8000

golden ocean Apr 17, 2025, 7:01 PM

#

its thinking forever for me

elder rapids Apr 17, 2025, 7:02 PM

#

narrow elbow i got "YES" thinking budget 8000

it keeps cutting out for me

#

can't get an answer

balmy mist Apr 17, 2025, 7:03 PM

#

its long and it put the animations in a weird place, but pretty cool

#

look at the charizard one

#

and prompt was make a pokemon game and i just told it to add a new feature and animations

sage raptor Apr 17, 2025, 7:07 PM

#

it might be nightwhisper or something close

#

with max thinking tokens

balmy mist Apr 17, 2025, 7:09 PM

#

to use 24k thinking tokens is wild tho

leaden meteor Apr 17, 2025, 7:09 PM

#

2.5 flash is meh. It's worse than grok3.

#

Nowhere close to 2.5 pro

balmy mist Apr 17, 2025, 7:10 PM

#

leaden meteor 2.5 flash is meh. It's worse than grok3.

what tests you ran?

keen beacon Apr 17, 2025, 7:11 PM

#

oh the thinking budget works but the scale is weird

elder rapids Apr 17, 2025, 7:12 PM

#

leaden meteor 2.5 flash is meh. It's worse than grok3.

this is cap

balmy mist Apr 17, 2025, 7:12 PM

#

keen beacon oh the thinking budget works but the scale is weird

yeah im getting weird results with max thinking

elder rapids Apr 17, 2025, 7:12 PM

#

2.5 flash is pretty good

torn mantle Apr 17, 2025, 7:12 PM

#

balmy mist <@295243581818404874>

wtf

#

it wrote all of that?

#

thats crazy

balmy mist Apr 17, 2025, 7:12 PM

#

elder rapids this is cap

yeah he just coping, i think he might be elon

balmy mist Apr 17, 2025, 7:12 PM

#

torn mantle it wrote all of that?

yeah bro wild

torn mantle Apr 17, 2025, 7:12 PM

#

balmy mist yeah im getting weird results with max thinking

wdym by weird

leaden meteor Apr 17, 2025, 7:12 PM

#

Lol. Leaderboard is updated.hahq...

keen beacon Apr 17, 2025, 7:13 PM

#

i set the thinking budget to 2 and it does way more than 2 tokens in its thoughts but it cuts off after a while

#

it also breaks the model at least on the prompt im using lol

balmy mist Apr 17, 2025, 7:14 PM

#

torn mantle wdym by weird

balmy mist Apr 17, 2025, 7:14 PM

#

keen beacon i set the thinking budget to 2 and it does way more than 2 tokens in its thought...

interesting, did you test out 0 tokens?

keen beacon Apr 17, 2025, 7:15 PM

#

balmy mist interesting, did you test out 0 tokens?

it just flips the thinking budget off if its 0

balmy mist Apr 17, 2025, 7:16 PM

#

and it works like you dont see the thinking anymore?

torn mantle Apr 17, 2025, 7:16 PM

#

balmy mist

ah

balmy mist Apr 17, 2025, 7:16 PM

#

i wonder why they dont do it for pro

keen beacon Apr 17, 2025, 7:16 PM

#

balmy mist and it works like you dont see the thinking anymore?

ya

elder rapids Apr 17, 2025, 7:16 PM

#

flash 2.5 is crazy cheap

ember rapids Apr 17, 2025, 7:17 PM

#

barren prairie Apr 17, 2025, 7:17 PM

#

Cute robot Gemini flash 2.5 🥺🩷🩵

cedar tide Apr 17, 2025, 7:18 PM

#

result of 2.5 flash on this prompt

elder rapids Apr 17, 2025, 7:18 PM

#

check price lmao

keen beacon Apr 17, 2025, 7:18 PM

#

flash 2.5 is much cheaper tho?

#

it might be a better model but the pricing

elder rapids Apr 17, 2025, 7:18 PM

#

1/10 the input price

#

and if they're calculating reasoning

#

this is probably maximum

balmy mist Apr 17, 2025, 7:19 PM

#

i knew you would come

#

to preach the good word of openai lol

zinc ore Apr 17, 2025, 7:19 PM

#

Just wait until we see the pricing on aider

elder rapids Apr 17, 2025, 7:19 PM

#

where o4 mini is probably around 15~ dollars maximum

wintry tinsel Apr 17, 2025, 7:19 PM

#

cedar tide result of 2.5 flash on this prompt

2.5 flash is undoubtedly much better at anything not math/reasoninr/coding though

zinc ore Apr 17, 2025, 7:19 PM

#

Pro is 1/3 of the mini pricing, so flash should be noticeably cheaper

elder rapids Apr 17, 2025, 7:20 PM

#

what does?

thorny drum Apr 17, 2025, 7:20 PM

#

ember rapids

this benchmark is funny

#

60c (but 350c to get the results on the benchmarks)

keen beacon Apr 17, 2025, 7:20 PM

#

2.5 flash is getting stuck in a thinking loop for me rn :\

fleet lintel Apr 17, 2025, 7:21 PM

#

o4 mini should be compared with 2.5 Pro . they have similar price. And in practice, o4 mini is expensive compared to 2.5 pro

elder rapids Apr 17, 2025, 7:21 PM

#

keen beacon 2.5 flash is getting stuck in a thinking loop for me rn :\

this was the same problem with 2.5 pro at release

elder rapids Apr 17, 2025, 7:21 PM

#

fleet lintel o4 mini should be compared with 2.5 Pro . they have similar price. And in practi...

ye

#

"in practice"

keen beacon Apr 17, 2025, 7:22 PM

#

elder rapids this was the same problem with 2.5 pro at release

still happens with 2.5 pro, it just seems its not good at this particular problem lol

elder rapids Apr 17, 2025, 7:22 PM

#

keen beacon still happens with 2.5 pro, it just seems its not good at this particular proble...

oh fr?

keen beacon Apr 17, 2025, 7:22 PM

#

yea

elder rapids Apr 17, 2025, 7:22 PM

#

yeah but then you shouldn't make it seem like that's not the claim either

keen beacon Apr 17, 2025, 7:22 PM

#

dont get me wrong 2.5 pro/2.5 flash are amazing but yeah it gets stuck on nthis

elder rapids Apr 17, 2025, 7:22 PM

#

o4 mini should be compared with 2.5 pro

#

since [state your reasoning]

fleet lintel Apr 17, 2025, 7:23 PM

#

check again... 1.1 $ vs 1.25$.. they are almost same. but to solve the same problem,o4 mini is more expensive

elder rapids Apr 17, 2025, 7:23 PM

#

ye

#

dude

keen beacon Apr 17, 2025, 7:23 PM

#

omg it just did 40k+ tokens in reasoning 🤣

#

and gave up

#

btw turn off thinking budget it can do way more tokens if it does cap it at 25k

#

thinking budget doesnt seem to help the model

#

where o3

#

select it in direct chat

#

or do arena battle

balmy mist Apr 17, 2025, 7:26 PM

#

keen beacon btw turn off thinking budget it can do way more tokens if it does cap it at 25k

thats what i noticed to

fleet lintel Apr 17, 2025, 7:26 PM

#

i want cheap and fast models for my use-case. honestly, i have no alternative compared to google flash models

balmy mist Apr 17, 2025, 7:27 PM

#

sonnet is ur fav coding model?

#

lol

elder rapids Apr 17, 2025, 7:27 PM

#

ye I guess now it really is up to preference

balmy mist Apr 17, 2025, 7:27 PM

#

yeah if you like to spend money or not

torn mantle Apr 17, 2025, 7:28 PM

#

cedar tide result of 2.5 flash on this prompt

looks like stargazer

balmy mist Apr 17, 2025, 7:28 PM

#

lmaoo craig you live to hate on google man

#

you gottta sauce them up a lil

#

give them some lovin

cedar tide Apr 17, 2025, 7:30 PM

#

Screenshot_2025-04-17-21-29-10-417_com.android.chrome-edit.jpg

brittle tiger Apr 17, 2025, 7:31 PM

#

Much cheaper

zinc ore Apr 17, 2025, 7:31 PM

#

That's pretty good, nearly 1400

opaque adder Apr 17, 2025, 7:31 PM

#

cedar tide

and nightwhisper vs o3 high?

cedar tide Apr 17, 2025, 7:31 PM

#

brittle tiger Much cheaper

The medium version is cheaper

balmy mist Apr 17, 2025, 7:32 PM

#

https://liveweave.com/A9OGzH#

#

the game is a lil wonky

#

but impressive

brittle tiger Apr 17, 2025, 7:33 PM

#

it does. convo was about price tho

cedar tide Apr 17, 2025, 7:36 PM

#

brittle tiger Much cheaper

Screenshot_2025-04-17-21-36-20-388_com.android.chrome-edit.jpg

leaden meteor Apr 17, 2025, 7:41 PM

#

2.5 flash can't be nw or dragontail, isn't it? I remember now or dt doingpretty well when compared to 2.5 pro...

#

Nw or dt*

sage raptor Apr 17, 2025, 7:43 PM

#

nahh

#

not nw

#

or dt

leaden meteor Apr 17, 2025, 7:44 PM

#

Wonder why they are not released yet.

#

So, what was flash? River hollow?

brittle tiger Apr 17, 2025, 7:45 PM

#

cedar tide

Conveniently cropped out the context it doesn't include the price of repeats which is why o4-mini price was so much higher on Aider because he does include the real cost.

cedar tide Apr 17, 2025, 7:45 PM

#

brittle tiger Conveniently cropped out the context it doesn't include the price of repeats whi...

Oh i dont see this 🧐

fleet lintel Apr 17, 2025, 7:51 PM

#

leaden meteor So, what was flash? River hollow?

feels very much like riverhollow to me

fleet lintel Apr 17, 2025, 7:52 PM

#

brittle tiger Conveniently cropped out the context it doesn't include the price of repeats whi...

for my product, o4 mini is boaderline unusable because of cost. Sometimes it depends on the use-case as well

leaden meteor Apr 17, 2025, 7:53 PM

#

I guess Gemini is waiting for o3 or o4mini to come on leaderboard to steal the thunder again with night whisperer or dragontail...

silk haven Apr 17, 2025, 8:03 PM

#

fleet lintel Apr 17, 2025, 8:04 PM

#

silk haven

what is the chart trying to show

lime coral Apr 17, 2025, 8:04 PM

#

So name reveal of flash?

silk haven Apr 17, 2025, 8:05 PM

#

fleet lintel what is the chart trying to show

Gemini 2.5 Flash on the best price-to-performance ratio curve

lime coral Apr 17, 2025, 8:05 PM

#

Stargazer? Dragon tail?

torn mantle Apr 17, 2025, 8:17 PM

#

google are killing it lately

distant egret Apr 17, 2025, 8:18 PM

#

can anyone guide me how to have perplexity deep research api in custom gpts of chatgpt?

drifting elk Apr 17, 2025, 8:18 PM

#

Plz add web search in lm arena beta version

balmy mist Apr 17, 2025, 8:19 PM

#

torn mantle google are killing it lately

yo lets cook something up

#

give me some prompts

drifting elk Apr 17, 2025, 8:19 PM

#

silk haven Gemini 2.5 Flash on the best price-to-performance ratio curve

I have tested it out. Still pro better

#

You are a vibe coder

#

Arghhh

#

I think pro is way better than flash

olive mesa Apr 17, 2025, 8:20 PM

#

i mean yeah

drifting elk Apr 17, 2025, 8:20 PM

#

Nop

olive mesa Apr 17, 2025, 8:20 PM

#

flash isnt supposed to be better than pro

drifting elk Apr 17, 2025, 8:21 PM

#

What makes you think flash is better

drifting elk Apr 17, 2025, 8:21 PM

#

olive mesa flash isnt supposed to be better than pro

That is right 👍

balmy mist Apr 17, 2025, 8:22 PM

#

i think flash is better with tools imo

drifting elk Apr 17, 2025, 8:22 PM

#

I have tested several models for coding and came out with a conclusion: whether you use grok or gemini or gpt or Claude these llms are helping programmers not replacing them

balmy mist Apr 17, 2025, 8:23 PM

#

nahh i love flash

tall summit Apr 17, 2025, 8:23 PM

#

heyyyyyy

#

beta came out

drifting elk Apr 17, 2025, 8:23 PM

#

So pure programmers are better and excellent than vibe coderw

balmy mist Apr 17, 2025, 8:23 PM

#

i think its vibes are good

tall summit Apr 17, 2025, 8:23 PM

#

wait more stuff came out?

balmy mist Apr 17, 2025, 8:23 PM

#

drifting elk So pure programmers are better and excellent than vibe coderw

thats how it will always be lol

#

just like photographers are better than any joe with a smartphone

tall summit Apr 17, 2025, 8:24 PM

#

oh flash preview #1362418052842524672

drifting elk Apr 17, 2025, 8:24 PM

#

And a programmer who uses AI (to help him) is much excellent and better than a casual one

tall summit Apr 17, 2025, 8:24 PM

#

thanks @balmy mist even though we argued like once

you kept the thread updated and it helped !!

drifting elk Apr 17, 2025, 8:25 PM

#

Vibe coders will find jobs but not better ones

tall summit Apr 17, 2025, 8:25 PM

#

oh its available

balmy mist Apr 17, 2025, 8:25 PM

#

tall summit thanks <@367710025994731520> even though we argued like once you kept the threa...

i gotchu!

tall summit Apr 17, 2025, 8:25 PM

#

on ai studio?!

balmy mist Apr 17, 2025, 8:25 PM

#

tall summit oh its available

on studio and vertex yupp

#

with thinking modes

tall summit Apr 17, 2025, 8:25 PM

#

hell ya

balmy mist Apr 17, 2025, 8:25 PM

#

budgets*

tall summit Apr 17, 2025, 8:25 PM

#

well i wont use it

drifting elk Apr 17, 2025, 8:25 PM

#

So do not say coders are dying gpt 5 is agi, Gemini is the best in the universe!!!

balmy mist Apr 17, 2025, 8:26 PM

#

it seems reg thinking is better

drifting elk Apr 17, 2025, 8:26 PM

#

Remember that

balmy mist Apr 17, 2025, 8:26 PM

#

drifting elk So do not say coders are dying gpt 5 is agi, Gemini is the best in the universe!...

lmaoo bro

tall summit Apr 17, 2025, 8:26 PM

#

balmy mist it seems reg thinking is better

than what? high?

drifting elk Apr 17, 2025, 8:26 PM

#

Pro

#

Is better

balmy mist Apr 17, 2025, 8:26 PM

#

tall summit than what? high?

like dont edit the thinking on flash, just you OOTB

#

when you increase thinking it makes it worse based on our tests

#

but i would test it yourself

tall summit Apr 17, 2025, 8:27 PM

#

i didnt know there was that much thinking customization

#

in the first place

balmy mist Apr 17, 2025, 8:27 PM

#

yeah it seems new with flash, not sure why they dont have it with pro

keen fulcrum Apr 17, 2025, 8:27 PM

#

https://help.kagi.com/kagi/ai/llm-benchmark.html

Kagi LLM Benchmarking Project | Kagi's Docs

Kagi Search Help

tall summit Apr 17, 2025, 8:27 PM

#

oh phew new with flash

#

thinking budget is measured in tokens? i guess i could just search that up

drifting elk Apr 17, 2025, 8:28 PM

#

Even flash is experimental it won't replace pro or outperform him

tall summit Apr 17, 2025, 8:29 PM

#

we'll see

drifting elk Apr 17, 2025, 8:29 PM

#

Even with custom thinking