#general

1 messages · Page 16 of 1

balmy mist
#

maybe for kids

sturdy mica
#

i’m leaving this chat

balmy mist
#

it sucks we cant have multiple tools at once

sturdy mica
balmy mist
sturdy mica
#

i wish we could, you can with api

balmy mist
#

yeah does not make sense

sturdy mica
#

and with the live streaming thing but that only works with 2.0 flash

balmy mist
#

have you played aroudn with gemini 2.5 and tools?

sturdy mica
#

yes

balmy mist
#

how was it and what did you use?

sturdy mica
#

i use gemini 2.5 with grounding, function calling, code execution

#

they all work how you think they would

#

why?

balmy mist
#

i guess this is how you use multiple tools

balmy mist
#

what does code execution od?

#

do*

#

just executed what code you give it?

#

does it pop open view for it?

#

i need to dive deeper

sturdy mica
#

“generate and run code that gives a random number “

#

and it will write python code and show output

#

ill screenshot rn

#

ok i cant but

#

you get yhe idea

#

just test it urself

balmy mist
#

lmaoo yeah

#

have you tried fine tuning a model on studio?

torn mantle
#

20 reports/day

#

quite generous

#

xd

drifting thorn
#

ok Cline is not that bad

#

I finally overcome that 2 chapters of my character's throwback

#

with Cline and a prompt that I've editted multiple times

balmy mist
teal mantle
#

Can you still use credits and still qualify for higher request limits?

balmy mist
keen beacon
teal mantle
torn mantle
#

like 2 per day or smth

#

or 1

#

didnt age well

oblique flint
#

meanwhile microsoft still cant produce a frontier model themselves, they're entirely dependent on openai to do so

alpine coral
#

oh i noticed the colour but didn't realise what it meant til now - nice / congrats! 👍
(will be good to have an active mod ha)

alpine coral
# torn mantle 20 reports/day

have you tested it yet? If it's as good as they're saying it is, 20/day is epic value (vs chatgpt's Deep Search offering, which is limited at like 10/month or something it feels like)

oblique flint
#

https://www.youtube.com/watch?v=_WvtdRtG1aY

wish we had something like this for gemini and other OSes than macos

See how OpenAI’s ChatGPT can now integrate directly with apps like IDEs to help engineers write, debug, and refactor code in real time. In this demo, we fix a checkout error and ship the fix directly to our IDE.—no copy/paste needed.

Ideal for developers and technical teams looking to enhance their daily tools with AI-powered code generatio...

▶ Play video
alpine coral
keen beacon
#

gemini 2.5 flash today apparently

oblique flint
sage raptor
#

stargazer

oblique flint
#

how do you know its releasing today tho

north vale
hardy pecan
#

cool

alpine coral
#

sonnet is capped at 32k tokens. oai lets you adjust via low / med / high

keen beacon
#

i think u can do 64k

#

or with claude u can do even more by taking advantage that its a single model, and continously prefilling it lol

alpine coral
#

lol indeed

#

yeah ik what you mean

keen beacon
#

its interestingn to me anthropic didnt make it a special token

#

so the behavior can 'leak' easily

alpine coral
#

as far as i can tell, there is nothing 'special' going on at all with sonnet3.7 thinking

#

it just has a 'srcatchpad' thing given to it and some system prompt

#

there doesn't seem a fundamental difference; it just does CoT reasoning (as part of its regular inference, even if it's rendered in box on the UI) and that informs its 'final' response

#

when 'thinking' is enabled / usd as the model

keen beacon
barren prairie
keen beacon
hardy pecan
#

a new challenger:

hazy quest
#

We have been saying that every few months for over a year lol. Remember Opus 3.0

keen beacon
#

anthropic doesnt use a regular eot special token even though its present in the tokenizer since claude 1. (this is how this breakage occurs, conditional stop on human:)

they seem to avoid adding special tokens whenever possible to maintain pretraining knowledge throughout the window

(they replace antml:thinking with <thinking>, even though it's actually antml:thinking if the screenshot is confusing)

sage raptor
#

is it good ?

oblique flint
harsh flume
#

I wonder how much (if any) research is going on in things outside of mainstream transformer architecture on the major companies

cedar tide
keen beacon
#

so 2.5 flash today

#

but i find it hard to believe that's all we'll be getting

#

they've literally had like

#

5 or 6 anonymous models on the text arena

#

and a few on webdev

keen beacon
#

didn't think with my first test Q (which it got right)

keen beacon
#

thats interesting

oblique flint
#

what price do yall think 2.5 flash api is gonna be.

Current flash is 0.10$ per 1M input and 0.40$ per 1M output.

o3 mini is 1.10 per 1M input and 4.40 per 1M output

my guess 2.5 flash will be 0.2 per 1M input and 1.0 per 1M output. I think it's prolly still worse than o3 mini at math and coding but it's also a lot cheaper ofc

keen beacon
#

looks like those models may be public today, or at least one of them

#

hopefully nightwhisperer...

drifting thorn
#

Hope nightwhisper is 2.5 Ultra

keen beacon
#

i highly doubt they're working on an ultra model tbh but i would be happy to be proven wrong

drifting thorn
#

Like I won’t hope that nightwhisper is a detuned version of 2.5 Pro

#

Either increasing parameter or thinking tokens

barren prairie
alpine coral
gentle plinth
#

javascript honestly, just tell it to generate everything in one html file, just copy that and open it , no compiling, special software or anything needed and it can run on almost any pc

#

obviously c++ would be faster but its a tenfold more complicated and highly dependent upon the exact system and hardware configuration

#

i mean for example gemini 2.5 pro can generate a 3d airplane simulator without a problem in one-shot

#

all in one html file with html/js/css embeded

#

it probably uses the three.js lib I think, but dont need to ask it, it will use whatever is appropriate

brittle tiger
# keen beacon this is a base model

Sure it's not thinking? Would be first non-thinking model to get one of my test arc-agi problems correct and it went through a bunch of hypotheses and combined them to reach final answer

keen fulcrum
#

Nightwhisper
when , I want it now

drifting thorn
#

I need Deepseek R2

#

Hope DeepSeek R2 has a million token context window

keen beacon
cloud meadow
#

Elon is both naive and simultaneously a narcissist.

#

You could already do that I think.

#

Not sure if it's new but Gemini has started asking "Which answer is better?" on some prompts.

cloud meadow
drifting thorn
#

What if AIs have unlimited thought tokens?

#

Would AI be able to make decisions on how many tokens they used for thinking? If not, then the above idea is disastrous

cloud meadow
#

Gemini 2.5 is pretty and I'm excited for the future of coding AIs

hardy pecan
#

TPU dominance

oblique flint
keen beacon
calm sequoia
ivory schooner
#

我的24k......但我相信24k的后继者将会是Behemoth ......

#

但我听中文社群说,恐怕要等很久以后......

#

(准确一点的话是夏天.......)

#

期待Behemoth 吧

brittle tiger
torn mantle
#

no wonder

balmy mist
#

they changed ui back?

barren prairie
balmy mist
#

and studio

drifting thorn
#

I think Anthropic will be able to train a model way better than 3.7, R1, o3-mini or even Gemini 2.5 Pro when they can get a ‘honest’ large multimodal model

#

The answer of cot models will adhere to what they ‘think’ and we can train it much more efficiently with RL.

keen beacon
#

oh i love the new aistudio ui

lime coral
#

lol

#

Weird on mobile. Will need some adaptation

harsh flume
#

Anyone can see a feasible path any company surpasses Google this month?

sturdy mica
#

no

#

also is the new gemini model coming out today

keen beacon
#

yes 2.5 flash

sturdy mica
#

awww

#

shucks

oblique flint
#

where are the 2.5 flash benchmarks tho

keen beacon
#

woah i just saw the new ai studio

#

it looks nice

drifting thorn
#

Do you guys think Gemini models are based on MoE architecture?

keen beacon
#

they said so in 1.5 pro's announcement i think.

torn mantle
#

kinda sad we wont have gemini coder

oblique flint
#

moe is the only way their api prices can be so low I think

torn mantle
#

its the only model i was looking forward to tbh

drifting thorn
#

I’m looking for Claude 4.0 after they found out that reasoning models aren’t honest

keen beacon
#

its not just moe thats making the difference

drifting thorn
#

Anthropic is definitely training a new model to be “honest” in showing it’s chain of thought

#

After their research showing 3.7 thinking hides its actual thoughts

brittle tiger
barren prairie
#

Or maybe luna or some dreams

drifting thorn
#

For every model, does the chain of thought aligns with the answer

fleet lintel
#

close to zero

balmy mist
sturdy mica
balmy mist
#

i dont care about flash unless its blazing fast and dirt dirt cheap

sturdy mica
#

yeah

oblique flint
#

o3 mini killer would be great

barren prairie
#

But is it confirmed ? There is no NW?

barren prairie
#

I mean Gemini coder?

keen beacon
#

considering each request is generally more expensive because of thinking tokens and they want to be competitive

sturdy mica
#

Gemini 2.5 pro is a o3 mini killer

#

i think

#

its way better anyway

#

also when is o4mini and o3 comjng out

balmy mist
#

this is y

#

makes sense now

#

openai who?

blazing rune
#

the input is about the same I think

#

honestly I would say it beats o3 mini considerably, at least for ML stuff

sage raptor
#

what tpu openai uses

barren prairie
torn mantle
#

its just so crazy

oblique flint
sage raptor
oblique flint
#

yea I saw it and I'll definitely try it out, but I doubt it holds up against o3 mini in actual practical coding. This benchmark I believe is more competitive coding

blazing rune
sage raptor
#

it's not fake

blazing rune
#

well, not fake

#

but it certainly won't be good in the real world

sage raptor
#

idk

blazing rune
#

competitive coding isn't very useful in the real world, at least for LLMs. it's good for humans because we can apply the concepts in many different ways, but LLMs can't generalize nearly as well

brittle tiger
lavish orchid
keen beacon
#

i'd love to meet sam, dario, demis... leave out elon or i might do something i regret

leaden palm
#

who watching cloud next

keen beacon
#

watched the opening keynote and will just keep an eye on the blog for everything else

leaden palm
#

lm arena mentioned

#

ok he just said 2.5 flash

#

thinking

#

has reasoning effort

#

"coming soon"...

#

lame

#

should have a giant lever to deploy it to prod

keen fulcrum
leaden palm
#

depends on how soon "soon" is

#

the work day is still young

keen beacon
#

where's nightwhisperer at 😔

keen fulcrum
leaden palm
#

theyre not gonna call it that (probably)

keen beacon
#

well yeah

leaden palm
#

so far its just been mentioning 2.5 pro and flash (both thinking, and flash has reasoning effort)

keen beacon
#

either it was a 2.5 pro variant or it was an update

leaden palm
#

ts not serious

keen beacon
#

they done got the mcdonalds ceo on stage 💔

leaden palm
torn mantle
#

how much time left

leaden palm
#

TODAY

#

NOW

#

HOP ON

torn mantle
#

link?

#

thanks

sturdy mica
leaden palm
#

3.7 can keep working without breaking things for longer but doesn't think as well

leaden palm
#

man i would be so hyped rn if i was a devops engineer

#

gemini weights leak when /s

#

~~ is this new~~ no, already made and tested in products, just now in vertex ai

keen beacon
#

depends on how you mean

#

new on GCP? yes, it was announced earlier today

#

new as a thing? no

#

it was already being tested publicly

#

on musicFX

#

it's a pretty meh model, we've been spoilt w/ things like suno and udio

leaden palm
#

they just said mindful and demure in the big 25

#

ehh what should i expect

balmy mist
#

is the event over?

leaden palm
#

one stream

#

many hours (probably)

barren prairie
balmy mist
#

thnx

lime coral
#

Would turn into a generational hater if no update on native audio/img

balmy mist
#

yall notice that gemini works a lot better in api then on studio? is that by design?

#

might be noob observation

#

just started using api for gemini lol

lime coral
lime coral
keen beacon
#

for anyone who wants a link - https://studio.firebase.google.com/

alpine coral
#

these events are always so cringe lol

keen beacon
#

third edit is the charm 💀

balmy mist
#

what model is powering this?

#

nw?

#

or the tools being used with 2.5

#

makes it like nw

lime coral
#

Obviously I don’t know

balmy mist
#

or nw was just 2.5 with tools all along?

#

ohh lmaoo

keen beacon
#

using it npw

#

this thing is fast

#

it uses next.js

leaden palm
#

ah yes, "war of the checkboxes" needs ai

#

also buggy

keen beacon
#

yeah i've just got to that stage as well

#

it ran into an error, said it auto fixed it and it turns out it didn't

leaden palm
#

agentspace is kinda cool though, personalized deep research and tool use (sending mail, analyzing data, generating audio overviews) plus chrome could be useful if i was an employee

keen beacon
#

oh nevermind.. just needed to put in a key for it to let me click fix

balmy mist
#

wow tbh its too much stuff being released lmaoo

#

like i cant keep up

#

i gotta test out this firebase studio tho

oblique flint
#

bro release it already darn it

barren prairie
#

This fire studio is for what?? 😅

lime coral
#

Full stack app

keen beacon
spare mango
leaden palm
#

the voice agent demo is actually really cool

spare mango
#

Is this for real? Do most people not have access to 2.5 Pro (experimental) on the free version of Gemini?

#

According to Gemini 2.5 Pro itself, only a select few people have access to this experimental model, and I'm one of them?

#

Is this making stuff up or do any of you guys not have access to this?

leaden palm
spare mango
#

I asked it if there's any difference between Gemini Free and Gemini Advanced, since I have access to 2.5 Pro (experimental) on the free version anyway, and it went off on a tangent about how this model does not exist, then went on to say how I'm part of a select cohort that has access to it.

leaden palm
#

if you think that the hallucination problem will never be solved consider going to a prediction market

leaden palm
#

gemini advanced still has advantages (first few i can think of are higher usage limits and more deep research)

keen beacon
#

the biggest advantage for gemini advanced rn is 2.5 pro deep research

#

i am tempted to go for it because of that

#

otherwise i wouldn't care

spare mango
#

apparently the non-deep research model is still the best in the market.

keen beacon
#

?

leaden palm
keen beacon
#

yeah

spare mango
keen beacon
#

free deep research is on 2.0 flash thinking

keen beacon
leaden palm
#

i read this as a general rollout

#

but yeah no 2.5 pro is a good model for deep research purposes, at least when testing with my own harness

spare mango
# keen beacon it isn't

LMArena did not place 2.5 Pro Experimental as number 1 because of Deep Research is what I'm saying.

keen beacon
spare mango
keen beacon
#

i said i wouldn't use it if it wasn't for 2.5 pro deep research because you can get 2.5 pro with no discernible rate limit for free on ai studio

leaden palm
#

fun fact: if you trust google's deep research evals, their version would be 146 elo above openai's

keen beacon
#

i'd like to see its performance on HLE

#

at least that way we'd get a more direct comparison

leaden palm
keen beacon
balmy mist
keen beacon
#

nemotron ultra is now out on nvidia build

#

it beats R1 in most benchmarks

balmy mist
#

is firebase studio free?

#

if it is gg

keen beacon
#

in my testing it was... meh

balmy mist
#

the thing that is good about it is that it can handle large code

#

the other apps fail when i give it anything above 32k for some reason

#

but you are right it is kinda meh, but I would still say its better than gemini 2.5 by itself

#

im cooking rn, give me a sec

keen beacon
#

I don't even know if it's fully powered by 2.5 pro

#

some of it may be a flash model

balmy mist
#

bruhh

#

it better not be, but if it is that is impressive

#

really shows how tools can boost up a model

#

nevermind

#

it keeps failing

keen beacon
# keen beacon

Isn't this fp8. The other ones don't have native support. there's still a big leap though

#

slightly off topic but

#

the more i test chatgpt 4o latest

#

(the march version)

#

the higher opinion i have of its creative writing

#

it feels like R1 quality (great) but it doesn't fall apart after more than a few chapters like R1 does

#

What about quasar

#

quasar disappointed me for writing tbh

#

step back from chatgpt 4o latest

#

more robotic

#

for the most part, i agree with this

#

although c3.7s would def be in my top 5 minimum

balmy mist
#

wow so deepseek taking ws?

keen beacon
#

yeah

#

i'm very excited for R2

sturdy mica
#

what is quasar

keen beacon
#

we should be getting that quite soon

sturdy mica
#

model

keen beacon
sturdy mica
#

oh

keen beacon
#

its just a gpt 4o update

#

prob an api dated version

sturdy mica
#

oh 🥀

#

💔

keen beacon
#

they shouldve released chatgpt 4o the last one as an api dated version

keen beacon
#

chatgpt 4o and 4o on the api are too different for them to have released it as a 4o dated version

sturdy mica
#

case oh

keen beacon
#

4o is more professional/"serious", chatgpt 4o is optimised for chat

#

and ofc creative tasks

sturdy mica
#

ok

#

did u guys see new secret model in battle mode

keen beacon
#

although it did take them too long to release the current latest chatgpt 4o model as a dated version, for a while you could only access it via the -latest endpoint

sturdy mica
#

its called noob pro hacker obby tycoon

keen beacon
#

iirc

sturdy mica
#

it is o6 mini

keen beacon
#

only lmsys has access to older versions

keen beacon
#

yeah nevermind

balmy mist
#

wait the event over?

#

so where is flash?

keen beacon
#

announcement by logan later maybe?

barren prairie
woeful geyser
#

The entire venom system prompt, summarized by Claude:

keen beacon
sturdy mica
keen beacon
sturdy mica
#

also what is venom

#

is that a prompt

keen beacon
sturdy mica
#

ok

keen beacon
sturdy mica
#

oh ok…………

#

im gonna play roblox now

#

actually twm fortress 2

wintry tinsel
keen beacon
#

yeah that was the reason i said mostly

#

sonnet 3.7 is king

leaden palm
#

"optimus alpha"

ocean vortex
keen beacon
#

in my experience

balmy mist
ocean vortex
#

oh. Ok that's less ridiculous lol

balmy mist
#

i cant take anymore torlling

#

trolling

wintry tinsel
#

It is across the board but in some areas with the right prompt 2.5 pro can be better

leaden palm
balmy mist
#

i just want my damn nightwhisper

#

man

brittle tiger
balmy mist
#

forget everything else

wintry tinsel
#

2.5 pro is the best at nsfw writing lmao

balmy mist
#

jk

#

its solid

#

just not nightwhisper

balmy mist
keen beacon
#

woah

#

new model

balmy mist
#

give me plz

ocean vortex
#

is it better than opus for that in your experience? @keen beacon

keen beacon
keen beacon
balmy mist
#

so not out yet

keen beacon
#

oh

#

hmph

#

should be up by the end of today then

#

must be another oai model

balmy mist
#

nooooo

#

it better be o3

upper wolf
#

2.5 pro still throws so many refusals. It won’t do anything related to web scraping (puppeteer/selenium etc.). on top of that, it can’t even say that it won’t, it says that it’s “not sufficiently trained” on it

balmy mist
keen beacon
#

i have tested o3 medium

#

it was pretty good

#

but there are some things it performs meh on

#

web development is still a weak point

#

but significantly better than o1/o3 mini

barren prairie
keen beacon
#

it's not out

#

i help labs out sometimes

wintry tinsel
#

Google event goes through 11th right

balmy mist
balmy mist
#

ill just stick to 2.5

wintry tinsel
#

There’s no reason to use it though since it’s much more pricy

keen beacon
#

i really like opus even now when price isn't a consideration

#

just has good vibes

sturdy mica
#

what was that super good coder on lmarena again

#

it was nightfall or something

keen beacon
#

nightwhisperer

#

or nightwhisper

#

or whatever it was

#

it was webdev arena only

sturdy mica
#

yeah

#

when is that gonna come out sigh

#

anyway what are you sll talking about

wintry tinsel
#

The reason google models are so cheap is they are trying to roll them out en masse, if they went the other route and had less instances for higher compute and cost they might be able to have the model give much better outputs relatively, I could not know what I’m talking about tho

sturdy mica
#

why can i not

#

react to things

#

oh

#

there

#

i cant react to ktibow

keen beacon
#

hmm sorta weird 2.5 flash isnt released yet, maybe they might do an anon model on openrouter 🤣

barren prairie
sturdy mica
#

lucky you

#

im going to hit you with my car

#

that villager is YOU

#

this is mildly uninteresting and nobody is ralking now

keen beacon
#

peak shitposter

torn mantle
#

thanks

wintry tinsel
#

They should add auto regressive image capabilities to 2.5 pro

sturdy mica
keen beacon
#

native image gen

keen beacon
#

they just haven't released it yet

#

actually no

#

i may be getting mixed up with 2.0 pro

keen beacon
#

im curious whether they worked on it at all with 2.5 pro

sturdy mica
#

ok nerd

brittle tiger
sturdy mica
#

whats the best model rn

#

its 2.5 pro right

#

ai studio version / api version

torn mantle
keen fulcrum
#

There is still imagefx which gives you exceptional results

balmy mist
#

as i play with firebase some more i see what truly is

#

its pretty much a competitor for cursor and every other ide and even claude code tbh

#

so it basically gemini code

#

just not in the cli

#

and this agentspace is nuts

#

wow

olive mesa
#

are there any models better than 2.5 yet

balmy mist
#

gg open ai

balmy mist
wintry locust
#

skibidi toilet rizz

olive mesa
balmy mist
#

i would like it

olive mesa
#

usually theres a new best every week to a month

balmy mist
#

but its not necessary for what we need

#

i would say maybe faster

#

and larger output and maybe window

keen beacon
#

2.5 flash soon 🤔

balmy mist
#

but thats pretty much it

keen beacon
#

is faster

balmy mist
#

IQ wise no

#

it would be nice tho

olive mesa
#

it would be nice to have a passive superintelligence

thorny drum
balmy mist
#

i mean but do we need that

olive mesa
#

not the allied mastercomputer type

#

yeah

balmy mist
#

agi is subjective

#

some people say its already here

#

some say years away

#

some say this year

#

depends on your definition a this point

#

remember wen it was to pass the turing test lol

olive mesa
#

yeah

balmy mist
#

but this agentspace is very interesting

#

this is gonna go wild once people get on it both it and fire studio

olive mesa
#

are most ais trained with curriculum learning?

#

i honestly feel like we would have asi by now if so

#

like just giving them near-impossible questions and every once in a while they produce an extremely good cot and response

#

then add that to a dataset

#

train

#

and repeat

keen beacon
#

that is sorta being done rn

barren prairie
balmy mist
#

yo firebase is free

#

wtf

#

why the hell am i using roo anymore lol

balmy mist
#

how is it free?

#

like am i missing something?

#

gg cline, roocode, augment, cursor, windsurf RIP

#

bolt

#

claude code lol, but i still like that its a CLI so claude still got a lil hope

#

no its using 2.5

#

why the hell would i use my own api lmaooo

#

actually nevermind i know what to do

#

gonna use my own free exp model, buts pretty much like roocode in terms of how you pay

#

oh i see what you mean

#

so by defaul the built in is 2.0?

#

gonna try it with 2.5

hidden widget
oblique flint
#

2.5 flash still not released?

keen beacon
#

a little strange right?

balmy mist
#

nahh firebase is expensive

#

going back to free version lol

keen beacon
oblique flint
#

disappointing performance and they want to keep cooking or smth else?

keen beacon
oblique flint
#

Im just confused cause the leaked model string in the python sdk said april 9th right?

balmy mist
#

nice

#

you gonna use that in firebase?

#

also whats a good prompt for refactoring an app i have to look better visually? should I saw apple design expert and stuff?

keen fulcrum
#

Did they announce nightwhisper?

storm bolt
brittle tiger
wet ingot
#

Anyone know what the model “dragontail” is? I got it on lmarena and it was good but I can’t find anything about it

balmy mist
#

how good is it?

torn mantle
#

why would anyone prefer grok 3 over gemini 2.5 pro?>

vivid oyster
#

Cuz elon musk is hot

torn mantle
torn mantle
wet ingot
torn mantle
#

xd

vivid oyster
torn mantle
vivid oyster
#

Yes

vivid oyster
#

Did u askit

wet ingot
#

No

#

Let me see if I can get it again

vivid oyster
#

Im trying rn

#

Its google @wet ingot

wet ingot
#

I got it

vivid oyster
wet ingot
#

Yeah lmao I just got the exact same thing

vivid oyster
#

Only google says 'i am a large language mode trained by google'

#

Maybe its nightwhisper

#

Was it better than 2.5 pro

wet ingot
#

Hard to say

keen ferry
vivid oyster
#

Whos

vivid oyster
#

Model

torn mantle
#

Tf

brittle tiger
#

Shadebrook passes vibe test. dragon tail does not for me

torn mantle
#

2 new models?

vivid oyster
#

Yeah

brittle tiger
#

Shadebrook is Google

vivid oyster
#

Dragontail is google too

keen ferry
wet ingot
#

Interesting Google is adding so many models

#

Without saying anything

vivid oyster
#

Maybe they're experimenting

#

For 2.5 flash

wet ingot
#

Yeah

vivid oyster
#

One of them might be nightwhisper

torn mantle
vivid oyster
#

Im trying to get it

#

So I can ask it

#

Questions

#

But it said the same prompt all google model says

torn mantle
#

you are trying to debunk it

#

silly you

wet ingot
#

We don’t know what it is but it’s good with logic and images

vivid oyster
#

Yes

torn mantle
#

shadebrook is not good

#

probably like flash lite

upper wolf
#

Is it thinking

torn mantle
#

so fast too

#

no

keen beacon
keen beacon
#

let me test this stuff

wet ingot
#

Dragontail was good for me

torn mantle
#

its probably flash without thinking

#

its blazing fast

brittle tiger
# keen beacon ooh

Actually idk. It just got an arc-agi problem wrong that it got right this morning. It's very fast tho

keen beacon
#

hmm

torn mantle
#

please let dragontail be good

#

pleaaaaaase

keen ferry
#

can someone give me basic questions

#

i got him

#

dragontail

torn mantle
torn mantle
vivid oyster
keen ferry
#

magic

vivid oyster
#

That u got dragontail

torn mantle
#

shhhhhhh

upper wolf
#

You’re playing Roulette at a casino with a broken wheel that makes it 0.36% more likely to land on Green. What is the new expected value of a $100 bet on the color red?

keen ferry
#

its soo good

#

There are three 'r's in the word "strawberry".

torn mantle
#

ask him this

wet ingot
#

I mean a lot of newer models know that

torn mantle
keen ferry
#

k

torn mantle
#

and give me the html file

keen ferry
#

sec

torn mantle
#

and i will tell if its good or not

vivid oyster
#

This bot is a dumbass

keen beacon
#

oh dear

olive mesa
#

people probably memed it then it got into a dataset

torn mantle
#

dreamtides is meh too

torn mantle
vivid oyster
#

I think I got it

#

It told me there's three

keen beacon
#

gottem

zinc ore
keen beacon
#

it's taking a while to start streaming

torn mantle
#

you got his ash

keen beacon
#

okay it just started

vivid oyster
#

I made him give me a discord html

torn mantle
keen beacon
#

im also giving dragontail a web task

vivid oyster
#

This is what it gave me

torn mantle
torn mantle
#

are ytou sure its dragontail?

upper wolf
#

Guys

vivid oyster
#

Dragontail

torn mantle
upper wolf
#

is dragontail thinking

vivid oyster
keen beacon
vivid oyster
#

It's flash thinking

keen ferry
#

mine is still generating

vivid oyster
#

Probably

keen beacon
#

this seems to be flash

upper wolf
#

Thanks

torn mantle
#

its meh

keen beacon
#

yes

vivid oyster
#

It takes a while to start

#

Sending

keen ferry
torn mantle
vivid oyster
torn mantle
#

wait a minute

#

this is so good

keen ferry
#

yeah

torn mantle
#

which model is that

keen ferry
#

dragontail

torn mantle
#

just from that output

keen beacon
#

yeah it's at least on par with 2.5 pro thinking in my very limited testing thus far

#

which i wouldn't expect from a small model

torn mantle
#

i got it

#

i will try it

#

doesnt seem like a thinking model

#

so fast too

keen ferry
#

it was slow for me

zinc ore
#

Google's models from now on are hybrid models, so if it's something like flash it'll be both

keen beacon
#

it seems dynamic

#

i have had a request with practically 0 time and another with like 15s

#

seems pretty good

zinc ore
#

Faster than 2.5 pro?

keen beacon
#

yes

torn mantle
#

its kinda the same as gemini 2.5 pro

#

a bit worse no?

keen beacon
#

no

#

they're very similar, i can't really discern much of a performance difference as of yet

#

which is sorta surprising given this seems to spend half as much time on it and yet matches pro

zinc ore
#

Could also be updated pro if not flash

#

But my bet is on flash

vivid oyster
#

It's faster

#

Its probly just flash thinking

zinc ore
#

Flash 2.5

#

They don't call it thinking anymore

keen beacon
#

in my (again somewhat limited) testing it doesn't seem worse than pro

drifting thorn
#

All 2.5 models are thinking models now

keen beacon
#

and if this is flash

#

wtf are the other like 3 anon google models on the arena rn

#

it doesn't make sense to add flash as an anonymous model today when they're releasing it like tomorrow 😭

zinc ore
#

Yeah depends if we getting flash this week, or more like next week or something

drifting thorn
#

I guess they are Gemma 4?

torn mantle
#

you are giving us all these outputs

drifting thorn
#

Hhhhhhj

vivid oyster
#

.;

brittle tiger
# keen beacon it seems dynamic

This seems plausible bc it nailed the arc-agi problem this morning after running like 4 hypotheses to solve and combining relevant ones and then this afternoon zipped out a totally wrong answer on same problem

keen beacon
#

ffs 🤦‍♂️

#

the full switch to the new arena couldn't come soon enough

harsh flume
#

Did google remove the option to data train from AI studio or am I just not finding it?

#

I was gonna test it out today 😭

#

We should get new LB update today btw

keen fulcrum
#

Microsoft has copilot and vscode (roocode is better), Google has Firebase Studio and then there are third party ones

#

I wonder if Apple will do something about it

#

Xcode with AI?

#

Firebase Studio and Nightwhisper can be the cheapest option honestly.

alpine coral
#

i just got it. v strong indeed. would've said almost certainly thinking model based on the quality and time(/delay) of the response (+ it was against command-a, which isn't thinking afaik)

raven void
#

grok 3 mini high is looking pretty good

balmy mist
raven void
#

yes, API

balmy mist
#

oh snapp

#

finally

keen beacon
#

pretty bad pricing lmao

balmy mist
#

lmaoo yeah its like sonnet right?

raven void
#

mini looks like a pareto frontier model though

brittle tiger
ivory schooner
#

愿即将到来的Behemoth 是24k......

#

愿即将到来的Behemoth 是24k......

sturdy mica
#

what

balmy mist
vivid oyster
#

He's saying he wishes it's 24k

#

24k karat gold

#

When it comes out

balmy mist
#

oh lol okay

upper wolf
#

It might’ve just been maverick with a modified system prompt it didnt seem that accurate

vivid oyster
#

Behemoth is prob anonymous-test

#

It's shet

hardy pecan
hardy pecan
#

Tested Grok 3 Beta in OpenRouter for the 20 public SimpleBench questions, it got 6/20

oblique flint
#

still no 2.5 flash

#

🥲

hardy pecan
torn mantle
woeful geyser
#

Got Maverick full release! 😌

#

Quite a surprise, but that's a good thing: I can't pinpoint the exact model because its "vibe" doesn't scream out loud.

Hoping for the actual result.

torn mantle
drifting thorn
#

WHY ALL APIs have a context limit of only 1 million

#

I was just trying to set up a MCP server with Cline

#

it failed

#

it says it's out of context window when it's like 80% done

#

arghhhhhhhhhhh

cedar tide
cedar tide
drifting thorn
#

I need real 10 million context window model

#

There's been 3 months since the publication of "Titans" architecture by Google

#

I hope Google can make a reasoning large multimodal reasoning model out of that architecture

cedar tide
drifting thorn
#

Since Grok 3 have no ethics regulations

tall summit
#

hihi

fleet lintel
#

is dragontail as good as nightwhisper?

torn mantle
torn mantle
#

no

#

nightwhisper is finetuned intensively on coding

#

they probably made a good reward model for web dev

#

styling wise

cedar tide
#

Dragontail

noble zinc
#

could dragon tail be o4-mini or o3?

cedar tide
noble zinc
#

hopefully pricing remains similar to 2.0 flash

torn mantle
#

its def not nightwhisper

#

i got much better results for a discord clone from nw

#

its on par with gemini 2.5 pro

#

idk if its better or not

cedar tide
#

screenshot on laptop
dragontail

#

2.5 pro

#

i think its gemini 2.5 pro, low thinking

#

the results are very similar to 2.5 and it thinks but for less time

oblique flint
#

o4 mini 👀 I have no hope for o3 being affordable though lol

hazy quest
#

Do you guys have access to Veo2 in AI Studio? It seems to be rolling out, some have it, some don't. I don't 😦

hardy pecan
#

Yeh I've tried it out

keen beacon
oblique flint
#

hmm will o4 mini launch before 2.5 flash, seems like it lol

keen beacon
#

i wonder if o4 mini is based on an updated 4o mini base

#

or if its just more roids 🤣

#

seems like google sort of forced their hand

#

how could it be o4 mini

#

its literally not thinking

#

it streams immediately and there is no apparent thinking

#

yeah

sage raptor
#

maybe its o4 mini low

keen beacon
#

no

#

and i benchmarked it it must be an insane regression from o3 mini 🤣

#

there is 0 thinking

#

nada

#

zero

#

people are crazy 🤣

#

i'm more interested in this

#

they're adding it to openrouter "tomorrow morning" (EST)

#

imagine if its 2.5 flash 🤣

#

same naming scheme as the anon openai model so i don't think so lmao

keen beacon
#

shrug i find that a little unlikely

brittle tiger
#

Why does open router say o4-mini and have same stats basically?

keen beacon
#

openrouter doesn't say o4 mini

brittle tiger
#

Lmaoo I need to stop trying to use brain 2 min after waking uo

torn mantle
#

im not hyped for any o-serie model

keen beacon
#

suit yourself

#

i'm still very interested

#

competition = always good

alpine coral
keen beacon
alpine coral
#

yeah no thinking there.. but it can't be o4-mini can it?

#

tf is the point of their naming schema if that is the case ha

keen beacon
alpine coral
#

yeah the non-thihking alone

keen beacon
alpine coral
#

but also performance-wise, solid as it is - it's not rubbing up against the frontier in any way

keen beacon
#

also: o3 mini gpqa diamond: 74.8%, math 500: 97.3%

#

@keen beacon you should test optimus alpha when it releases

keen beacon
alpine coral
#

some scores from a quiz (~20 questionss in one prompt; same as shared above here somewhere)
just fwiw

#

wait.. that wasn't sorted right

tall summit
keen beacon
#

openrouter

visual turret
#

why is 3.7 sonnet in 23 place? gemini flash lite is in 16

keen beacon
#

use style control..

visual turret
visual turret
keen beacon
#

that's just what something based on human preference will produce 🤷‍♂️

oblique flint
#

man the difference in capability between 2.5 pro in ai studio vs cursor is so huge. I really wish it wasnt so bad in cursor lol

calm sequoia
fleet lintel
# alpine coral

That's disappointing. I am hoping for better models than 2.5 Gemini

keen fulcrum
#

It deleted my 800k context conversation which is frustrating though

oblique flint
#

I have been avoiding roocode because I hear it can be expensive af

golden ocean
#

I created new chatgpt alt account on new chrome profile and I can use gpt 4o forever it seems unlike my og chatgpt account. Can also generate way more images

#

Is that a feature for new accounts or something

#

Will prob return to normal after a while

balmy mist
#

wait who owns dragontail?

balmy mist
#

lol

#

i wonder what progress he has made

vague orbit
#

Any reports on how Cognito performs?

sonic tendon
#

which is what lmarena "power users" tend to do en masse, i think

vague orbit
#

What kind of persona is a lmarena power user?

sonic tendon
sonic tendon
#

and/or do more than ~5 chats on lmarena a day on average

#

(that number is totally arbitrary, i just made it up)

vague orbit
#

I would guess that if you have a problem and wanted to solve it with an LLM, throwing it at LMArena would lead to really mixed results

sonic tendon
#

but yeah, you're right

#

i don't use it as a general-purpose chatbot, it's more for my own curiosity

tall summit
sonic tendon
sonic tendon
keen beacon
#

anonymous model releasing on arena today

sonic tendon
#

ah

#

openrouter too?

keen beacon
#

sorry i meant openrouter

#

lmao

sonic tendon
#

all good

tall summit
calm sequoia
# calm sequoia
poll_question_text

Best Deep Research

victor_answer_votes

8

total_votes

15

victor_answer_id

1

victor_answer_text

Gemini

victor_answer_emoji_name

🫡

sonic tendon
#

does openrouter have a discord?

keen beacon
#

yup

#

look at their footer

sonic tendon
#

joined

#

oh, grok 3's finally open access now

tall summit
sonic tendon
#

lmaoo

#

yeah 2025 ai scene in a nutshell

sonic tendon
#

especially with coding, in my experience

#

off the top of my head, i don't know of any models that support much more than 1M context, anyway

keen beacon
#

the new llama models are supposed to but they suck balls

#

and some of the geminis iirc go to 2M

sonic tendon
#

i forget what that fiction context window benchmark is called

ocean vortex
keen beacon
sonic tendon
keen beacon
#

😭

ocean vortex
sonic tendon
#

0 weights

keen beacon
sonic tendon
#

/nonsexual

keen beacon
#

behemoth better be good

sonic tendon
#

it's gonna suck

#

probably

keen beacon
#

it'll probably underperform

#

we'll see

#

hopefully we get o3 today

#
  • o4 mini
sonic tendon
#

oh, on that note

#

man, i need to stop doubting myself

keen beacon
#

wats that for

#

i kinda doubt it will take top position

#

idk

sonic tendon
keen beacon
#

maybe joint 1st w/ stylectrl

north vale
#

It’ll matter a lot to how well reasoning scales to superhuman levels how o4-mini performs

#

It really depends whether they posttrained o3 on their new 4o stuff

#

imo

keen beacon
#

o3 low will be very different from o3 high

#

its conceivable the new anonymous chatbot could top the leaderboard

keen beacon
#

there are many

sonic tendon
keen beacon
#

quasar isn't on the arena

#

and even if it was

keen beacon
north vale
#

It’s ass compared to 2.5

keen beacon
#

i don't think it would take the top spot tbh

#

it's not as nice to talk to as chatgpt 4o latest

#

also seems less creative

oblique flint
keen beacon
#

it's on openrouter

sonic tendon
keen beacon
sonic tendon
keen beacon
#

anonymous chatbot isn't on the arena

#

not the same name, but its the same model

brittle tiger
#

making up 30 elo points is harder than it seems too

sonic tendon
#

so, both models get paused until they both finish thinking, basically

keen beacon
#

there's anonymous-test