#general
1 messages · Page 16 of 1
maybe for kids
i’m leaving this chat
it sucks we cant have multiple tools at once
no. no friend, no
lol
it does :(
i wish we could, you can with api
yeah does not make sense
and with the live streaming thing but that only works with 2.0 flash
have you played aroudn with gemini 2.5 and tools?
yes
how was it and what did you use?
i use gemini 2.5 with grounding, function calling, code execution
they all work how you think they would
why?
i guess this is how you use multiple tools
i just was curious because i never played with it
what does code execution od?
do*
just executed what code you give it?
does it pop open view for it?
i need to dive deeper
you can ask it to generate and run code like
“generate and run code that gives a random number “
and it will write python code and show output
ill screenshot rn
ok i cant but
you get yhe idea
just test it urself
ok Cline is not that bad
I finally overcome that 2 chapters of my character's throwback
with Cline and a prompt that I've editted multiple times
you only get like 20 per day with openai right?
Can you still use credits and still qualify for higher request limits?
did you hit the 1000 limit?
I don't believe so but u should really ask it in OR discord server lol
Far from
much less
like 2 per day or smth
or 1
xddd
didnt age well
meanwhile microsoft still cant produce a frontier model themselves, they're entirely dependent on openai to do so
oh i noticed the colour but didn't realise what it meant til now - nice / congrats! 👍
(will be good to have an active mod ha)
have you tested it yet? If it's as good as they're saying it is, 20/day is epic value (vs chatgpt's Deep Search offering, which is limited at like 10/month or something it feels like)
https://www.youtube.com/watch?v=_WvtdRtG1aY
wish we had something like this for gemini and other OSes than macos
See how OpenAI’s ChatGPT can now integrate directly with apps like IDEs to help engineers write, debug, and refactor code in real time. In this demo, we fix a checkout error and ship the fix directly to our IDE.—no copy/paste needed.
Ideal for developers and technical teams looking to enhance their daily tools with AI-powered code generatio...
yeah even less (for Plus anyway).. it's 10/month
gemini 2.5 flash today apparently
sauce?
stargazer
how do you know its releasing today tho
cool
oh interesting so thinking_budget (in this case 10,000) is presumably their language for the parameter that defines token allowance for 'thinking'
sonnet is capped at 32k tokens. oai lets you adjust via low / med / high
i think u can do 64k
or with claude u can do even more by taking advantage that its a single model, and continously prefilling it lol
not the exact antml:thinking tag but the behavior is close enough
its interestingn to me anthropic didnt make it a special token
so the behavior can 'leak' easily
as far as i can tell, there is nothing 'special' going on at all with sonnet3.7 thinking
it just has a 'srcatchpad' thing given to it and some system prompt
there doesn't seem a fundamental difference; it just does CoT reasoning (as part of its regular inference, even if it's rendered in box on the UI) and that informs its 'final' response
when 'thinking' is enabled / usd as the model
it doesnt have comprehensive thinking instructions, just an instruction to think in antml:thinking and max thinking length where they probably tuned in several values with differing 'thinking' lengths. it was trained in to think when the response starts with antml:thinking, otherwise its normal
I want Gemini 2.5 flash thinking not Gemini 2.5 flash 😭
antml:thinking isnt a special token but its sanitized hard so u cant really use it (without degradation) but since it isn't a special token, <thinking> will basically be seen as the same because of pretraining associations instead of adding a new special token to the vocab
a new challenger:
We have been saying that every few months for over a year lol. Remember Opus 3.0
anthropic doesnt use a regular eot special token even though its present in the tokenizer since claude 1. (this is how this breakage occurs, conditional stop on human:)
they seem to avoid adding special tokens whenever possible to maintain pretraining knowledge throughout the window
(they replace antml:thinking with <thinking>, even though it's actually antml:thinking if the screenshot is confusing)
is it good ?
please be better than o3 mini at coding
I wonder how much (if any) research is going on in things outside of mainstream transformer architecture on the major companies
Its the thinking version today, and we have too thinking budget
so 2.5 flash today
but i find it hard to believe that's all we'll be getting
they've literally had like
5 or 6 anonymous models on the text arena
and a few on webdev
this is a base model
didn't think with my first test Q (which it got right)
they removed all the anon google models on web dev i think
thats interesting
what price do yall think 2.5 flash api is gonna be.
Current flash is 0.10$ per 1M input and 0.40$ per 1M output.
o3 mini is 1.10 per 1M input and 4.40 per 1M output
my guess 2.5 flash will be 0.2 per 1M input and 1.0 per 1M output. I think it's prolly still worse than o3 mini at math and coding but it's also a lot cheaper ofc
yeah i just checked
looks like those models may be public today, or at least one of them
hopefully nightwhisperer...
Hope nightwhisper is 2.5 Ultra
i highly doubt they're working on an ultra model tbh but i would be happy to be proven wrong
Like I won’t hope that nightwhisper is a detuned version of 2.5 Pro
Either increasing parameter or thinking tokens
That's great with Meta we were getting 1526366373 Anonymous model to get a stupid one than
i think we could see a thinking_budget API param (and like a slider in the aistudio UI)
thats pretty interesting
javascript honestly, just tell it to generate everything in one html file, just copy that and open it , no compiling, special software or anything needed and it can run on almost any pc
obviously c++ would be faster but its a tenfold more complicated and highly dependent upon the exact system and hardware configuration
i mean for example gemini 2.5 pro can generate a 3d airplane simulator without a problem in one-shot
all in one html file with html/js/css embeded
it probably uses the three.js lib I think, but dont need to ask it, it will use whatever is appropriate
Sure it's not thinking? Would be first non-thinking model to get one of my test arc-agi problems correct and it went through a bunch of hypotheses and combined them to reach final answer
Nightwhisper
when , I want it now
well it started streaming immediately when i tested it
Elon is both naive and simultaneously a narcissist.
You could already do that I think.
Not sure if it's new but Gemini has started asking "Which answer is better?" on some prompts.
The only time an AI will get me to actually le soyjak (mouth wide open) would be if I put in an entire 1 million token project and it efficiently recodes it to another language and another stack. Maybe 3.5 years away?
What if AIs have unlimited thought tokens?
Would AI be able to make decisions on how many tokens they used for thinking? If not, then the above idea is disastrous
Gemini 2.5 is pretty and I'm excited for the future of coding AIs
TPU dominance
https://blog.google/products/google-cloud/next-2025/
Gemini 2.5 Flash, our workhorse model with low latency and cost efficiency, will soon be available in Vertex AI.
wen ai studio lol
我的24k......但我相信24k的后继者将会是Behemoth ......
但我听中文社群说,恐怕要等很久以后......
(准确一点的话是夏天.......)
期待Behemoth 吧
not yet
they changed ui back?
Who?
I think Anthropic will be able to train a model way better than 3.7, R1, o3-mini or even Gemini 2.5 Pro when they can get a ‘honest’ large multimodal model
The answer of cot models will adhere to what they ‘think’ and we can train it much more efficiently with RL.
oh i love the new aistudio ui
Anyone can see a feasible path any company surpasses Google this month?
yes 2.5 flash
where are the 2.5 flash benchmarks tho
when its out in aistudio
woah i just saw the new ai studio
it looks nice
Do you guys think Gemini models are based on MoE architecture?
yes
they said so in 1.5 pro's announcement i think.
kinda sad we wont have gemini coder
moe is the only way their api prices can be so low I think
its the only model i was looking forward to tbh
I’m looking for Claude 4.0 after they found out that reasoning models aren’t honest
moe is faster/cheaper but i think in high batching/etc the calculus is more complicated and the gains are reduced. i dont know much tho lol
its not just moe thats making the difference
Anthropic is definitely training a new model to be “honest” in showing it’s chain of thought
After their research showing 3.7 thinking hides its actual thoughts
Gemini coder was just a dream, a whisper in the night
We will have some stars today
Or maybe luna or some dreams
For every model, does the chain of thought aligns with the answer
close to zero
fr im gonna cry tonight
they should release coder instead of flash today…
i dont care about flash unless its blazing fast and dirt dirt cheap
yeah
o3 mini killer would be great
I don t care about flash unless it is flash thinking 🙂 🩷🩵 I love the flash thinking bot
But is it confirmed ? There is no NW?
it is thinking dw
I mean Gemini coder?
2.0 flash was cheaper than 4o mini, i dont expect it to be much more expensive per token
considering each request is generally more expensive because of thinking tokens and they want to be competitive
??
Gemini 2.5 pro is a o3 mini killer
i think
its way better anyway
also when is o4mini and o3 comjng out
this is y
makes sense now
openai who?
well, yeah, but it's also likely a lot bigger, and a good bit more expensive for output
the input is about the same I think
honestly I would say it beats o3 mini considerably, at least for ML stuff
what tpu openai uses
You mean at coding bcz o3 mini is a trush on other tasks
no wonder they are serving all of these models
its just so crazy
its way more expensive
yea I saw it and I'll definitely try it out, but I doubt it holds up against o3 mini in actual practical coding. This benchmark I believe is more competitive coding
that's almost definitely fake
it's not fake
idk
because of this
competitive coding isn't very useful in the real world, at least for LLMs. it's good for humans because we can apply the concepts in many different ways, but LLMs can't generalize nearly as well
hey everyone, very happy to share that I got accepted into YC's AI Startup School!! will hopefully see Sam Altman, Elon Musk and others! 🙂
i'd love to meet sam, dario, demis... leave out elon or i might do something i regret
lmaoo based
who watching cloud next
watched the opening keynote and will just keep an eye on the blog for everything else
lm arena mentioned
ok he just said 2.5 flash
thinking
has reasoning effort
"coming soon"...
lame
should have a giant lever to deploy it to prod
Shame not the alpha site is shown
where's nightwhisperer at 😔
Did they share nightwhisper yet?
theyre not gonna call it that (probably)
well yeah
so far its just been mentioning 2.5 pro and flash (both thinking, and flash has reasoning effort)
either it was a 2.5 pro variant or it was an update
ts not serious
they done got the mcdonalds ceo on stage 💔
oh and dont forget the new tpu announcement
when is the google event
how much time left
TODAY
NOW
HOP ON
Organizations around the world are driving change with innovative solutions, boosting efficiency, empowering employees, engaging customers, and fueling growt...
minecraft release in 2011
3.7 can keep working without breaking things for longer but doesn't think as well
man i would be so hyped rn if i was a devops engineer
gemini weights leak when /s
~~ is this new~~ no, already made and tested in products, just now in vertex ai
depends on how you mean
new on GCP? yes, it was announced earlier today
new as a thing? no
it was already being tested publicly
on musicFX
it's a pretty meh model, we've been spoilt w/ things like suno and udio
is the event over?
Still speaking
thnx
Would turn into a generational hater if no update on native audio/img
yall notice that gemini works a lot better in api then on studio? is that by design?
might be noob observation
just started using api for gemini lol
If you want to try nightwhisper this might be the way to go https://x.com/testingcatalog/status/1910010822937698425?s=46
Logan once said ai studio is purposely build to reflect the api behavior. It runs on the api
smh
for anyone who wants a link - https://studio.firebase.google.com/
these events are always so cringe lol
third edit is the charm 💀
wait what?
what model is powering this?
nw?
or the tools being used with 2.5
makes it like nw
Obviously I don’t know
yeah i've just got to that stage as well
it ran into an error, said it auto fixed it and it turns out it didn't
agentspace is kinda cool though, personalized deep research and tool use (sending mail, analyzing data, generating audio overviews) plus chrome could be useful if i was an employee
oh nevermind.. just needed to put in a key for it to let me click fix
wow tbh its too much stuff being released lmaoo
like i cant keep up
i gotta test out this firebase studio tho
bro release it already darn it
This fire studio is for what?? 😅
Full stack app
they're waiting to release 2.5 flash and gemini coder in one go trust me bro trust me 💔
When is the "soon" 🥲
the voice agent demo is actually really cool
Is this for real? Do most people not have access to 2.5 Pro (experimental) on the free version of Gemini?
According to Gemini 2.5 Pro itself, only a select few people have access to this experimental model, and I'm one of them?
Is this making stuff up or do any of you guys not have access to this?
llms are llms and llms hallucinate
so the "best" llm in the market is hallucinating about this simple fact, interesting...
I asked it if there's any difference between Gemini Free and Gemini Advanced, since I have access to 2.5 Pro (experimental) on the free version anyway, and it went off on a tangent about how this model does not exist, then went on to say how I'm part of a select cohort that has access to it.
if you think that the hallucination problem will never be solved consider going to a prediction market
Sorry I don't what you mean?
gemini advanced still has advantages (first few i can think of are higher usage limits and more deep research)
the biggest advantage for gemini advanced rn is 2.5 pro deep research
i am tempted to go for it because of that
otherwise i wouldn't care
deep research is another model
apparently the non-deep research model is still the best in the market.
?
is that only for advanced?
yeah
what
free deep research is on 2.0 flash thinking
it isn't
i read this as a general rollout
but yeah no 2.5 pro is a good model for deep research purposes, at least when testing with my own harness
LMArena did not place 2.5 Pro Experimental as number 1 because of Deep Research is what I'm saying.
i never said they did
you said you wouldn't use it if it weren't for that so I thought you were implying it was.
i said i wouldn't use it if it wasn't for 2.5 pro deep research because you can get 2.5 pro with no discernible rate limit for free on ai studio
fun fact: if you trust google's deep research evals, their version would be 146 elo above openai's
i'd like to see its performance on HLE
at least that way we'd get a more direct comparison
(that's like the difference between llama nemotron 49b and gemini 2.5 pro exp)
nutss
nemotron ultra is now out on nvidia build
it beats R1 in most benchmarks
in my testing it was... meh
lol
the thing that is good about it is that it can handle large code
the other apps fail when i give it anything above 32k for some reason
but you are right it is kinda meh, but I would still say its better than gemini 2.5 by itself
im cooking rn, give me a sec
bruhh
it better not be, but if it is that is impressive
really shows how tools can boost up a model
nevermind
it keeps failing
Isn't this fp8. The other ones don't have native support. there's still a big leap though
slightly off topic but
the more i test chatgpt 4o latest
(the march version)
the higher opinion i have of its creative writing
it feels like R1 quality (great) but it doesn't fall apart after more than a few chapters like R1 does
What about quasar
quasar disappointed me for writing tbh
step back from chatgpt 4o latest
more robotic
for the most part, i agree with this
although c3.7s would def be in my top 5 minimum
wow so deepseek taking ws?
what is quasar
we should be getting that quite soon
model
anonymous openai model on openrouter
oh
they shouldve released chatgpt 4o the last one as an api dated version
chatgpt 4o and 4o on the api are too different for them to have released it as a 4o dated version
case oh
4o is more professional/"serious", chatgpt 4o is optimised for chat
and ofc creative tasks
although it did take them too long to release the current latest chatgpt 4o model as a dated version, for a while you could only access it via the -latest endpoint
its called noob pro hacker obby tycoon
thats still the case tho?
iirc
it is o6 mini
only lmsys has access to older versions
oh this is news to me lmao
yeah nevermind
announcement by logan later maybe?
"sooner" 🙂
The entire venom system prompt, summarized by Claude:
lol this is what i mean
what is venom
lmfao
ok
it was an anonymous meta model on the arena with a long system prompt
Deep seek isn’t that good for creative writing it introduces random non sequiters and makes everything overly verbose and dramatic
lol what
really?
oh. Ok that's less ridiculous lol
It is across the board but in some areas with the right prompt 2.5 pro can be better
soon™️
Could this be nightwhisper?
https://x.com/_davideast/status/1909984439985229940?t=ddJfOgwZywH0inE1AxyWyg&s=19
forget everything else
2.5 pro is the best at nsfw writing lmao
noooo
its bunns
nightwhisper there ?
2.5 pro maybe with tool calls
is it better than opus for that in your experience? @keen beacon
let me test this
yeah once u get it setup. multi turn, context usage, all their training tricks, etc., are just awesome and result in a really good experience imho
2.5 pro still throws so many refusals. It won’t do anything related to web scraping (puppeteer/selenium etc.). on top of that, it can’t even say that it won’t, it says that it’s “not sufficiently trained” on it
wait you already test o3 rright and it was mid?
i have tested o3 medium
it was pretty good
but there are some things it performs meh on
web development is still a weak point
but significantly better than o1/o3 mini
O3 is out? But when ???
Google event goes through 11th right
only for chads
okay i guess thats okay
ill just stick to 2.5
Opus has less flavor and prompt adherence but more personality and a “natural feel to it”
There’s no reason to use it though since it’s much more pricy
The reason google models are so cheap is they are trying to roll them out en masse, if they went the other route and had less instances for higher compute and cost they might be able to have the model give much better outputs relatively, I could not know what I’m talking about tho
nah google is now sota
hmm sorta weird 2.5 flash isnt released yet, maybe they might do an anon model on openrouter 🤣
I can react 🙂
lucky you
im going to hit you with my car
that villager is YOU
this is mildly uninteresting and nobody is ralking now
peak shitposter
can someone confirm
They should add auto regressive image capabilities to 2.5 pro
thanks
what’s that
native image gen
it already has them
they just haven't released it yet
actually no
i may be getting mixed up with 2.0 pro
2.5 pro is highly likely to be a cpt of 2.0 pro it should still have them
im curious whether they worked on it at all with 2.5 pro
ok nerd
I don't think it js
There is still imagefx which gives you exceptional results
as i play with firebase some more i see what truly is
its pretty much a competitor for cursor and every other ide and even claude code tbh
so it basically gemini code
just not in the cli
and this agentspace is nuts
wow
are there any models better than 2.5 yet
gg open ai
there really is no need for any models better than 2.5
skibidi toilet rizz
there kinda is
i would like it
usually theres a new best every week to a month
but its not necessary for what we need
i would say maybe faster
and larger output and maybe window
2.5 flash soon 🤔
but thats pretty much it
is faster
it would be nice to have a passive superintelligence
i could use some agi rn
i mean but do we need that
agi is subjective
some people say its already here
some say years away
some say this year
depends on your definition a this point
remember wen it was to pass the turing test lol
yeah
but this agentspace is very interesting
this is gonna go wild once people get on it both it and fire studio
are most ais trained with curriculum learning?
i honestly feel like we would have asi by now if so
like just giving them near-impossible questions and every once in a while they produce an extremely good cot and response
then add that to a dataset
train
and repeat
that is sorta being done rn
I need deepSeek r2 😁
yeah i wouldnt mind that
how is it free?
like am i missing something?
gg cline, roocode, augment, cursor, windsurf RIP
bolt
claude code lol, but i still like that its a CLI so claude still got a lil hope
no its using 2.5
why the hell would i use my own api lmaooo
actually nevermind i know what to do
gonna use my own free exp model, buts pretty much like roocode in terms of how you pay
oh i see what you mean
so by defaul the built in is 2.0?
gonna try it with 2.5
i can see another clickbait video from IcodeKing here
2.5 flash still not released?
a little strange right?
maybe 2.5 flash doesnt have enough votes in the arena
disappointing performance and they want to keep cooking or smth else?
if its stargazer i dont think its disappointing
Im just confused cause the leaked model string in the python sdk said april 9th right?
yes
nice
you gonna use that in firebase?
also whats a good prompt for refactoring an app i have to look better visually? should I saw apple design expert and stuff?
Did they announce nightwhisper?
No
Anyone know what the model “dragontail” is? I got it on lmarena and it was good but I can’t find anything about it
how good is it?
Yeah
How good
Ima try it rn
Cuz elon musk is hot
did he do it with you?
Idk but it had a better answer to my logic problem than most other models and it was good with an image
xd
Yes
it felt good?
Yes
What model did it say it was
Did u askit
I got it
Yeah lmao I just got the exact same thing
Only google says 'i am a large language mode trained by google'
Maybe its nightwhisper
Was it better than 2.5 pro
Hard to say
Whos
Shadebrook passes vibe test. dragon tail does not for me
2 new models?
Yeah
Shadebrook is Google
Dragontail is google too
Yeah
One of them might be nightwhisper
pro? flash?
Idk
Im trying to get it
So I can ask it
Questions
But it said the same prompt all google model says
We don’t know what it is but it’s good with logic and images
Yes
Is it thinking
eugh
ooh
let me test this stuff
Dragontail was good for me
Actually idk. It just got an arc-agi problem wrong that it got right this morning. It's very fast tho
hmm
how many r's are in strawberry?
make a discord clone in one html file.
How tf do u know
magic
That u got dragontail
shhhhhhh
You’re playing Roulette at a casino with a broken wheel that makes it 0.36% more likely to land on Green. What is the new expected value of a $100 bet on the color red?
ask him this
I mean a lot of newer models know that
make a discord clone in one html file.
k
and give me the html file
sec
and i will tell if its good or not
oh dear
why do models get that specifically wrong
people probably memed it then it got into a dataset
dreamtides is meh too
because its a tokenization issue
gottem
Don't do R questions, they don't see letters like us
it's taking a while to start streaming
you got his ash
okay it just started
I made him give me a discord html
gimme gimme
im also giving dragontail a web task
worse than gemini 2.5 pro thinking
Guys
Dragontail
its shadebrook?
is dragontail thinking
yes
mine is still generating
Probably
this seems to be flash
Thanks
its meh
yes
wtf
Is this dragontail
yeah
which model is that
dragontail
just from that output
yeah it's at least on par with 2.5 pro thinking in my very limited testing thus far
which i wouldn't expect from a small model
it was slow for me
Google's models from now on are hybrid models, so if it's something like flash it'll be both
it is
it seems dynamic
i have had a request with practically 0 time and another with like 15s
seems pretty good
Faster than 2.5 pro?
yes
no
they're very similar, i can't really discern much of a performance difference as of yet
which is sorta surprising given this seems to spend half as much time on it and yet matches pro
i would be kinda surprised
in my (again somewhat limited) testing it doesn't seem worse than pro
All 2.5 models are thinking models now
and if this is flash
wtf are the other like 3 anon google models on the arena rn
it doesn't make sense to add flash as an anonymous model today when they're releasing it like tomorrow 😭
Yeah depends if we getting flash this week, or more like next week or something
I guess they are Gemma 4?
its you
you are giving us all these outputs
Hhhhhhj
.;
This seems plausible bc it nailed the arc-agi problem this morning after running like 4 hypotheses to solve and combining relevant ones and then this afternoon zipped out a totally wrong answer on same problem
Did google remove the option to data train from AI studio or am I just not finding it?
I was gonna test it out today 😭
We should get new LB update today btw
Microsoft has copilot and vscode (roocode is better), Google has Firebase Studio and then there are third party ones
I wonder if Apple will do something about it
Xcode with AI?
Firebase Studio and Nightwhisper can be the cheapest option honestly.
were they the same prompts? perhaps it is dynamic if they were the same (though if they were different, esp in length and / complexity) that might better explain the discrepancy
i just got it. v strong indeed. would've said almost certainly thinking model based on the quality and time(/delay) of the response (+ it was against command-a, which isn't thinking afaik)
grok 3 mini high is looking pretty good
thats new?
pretty bad pricing lmao
lmaoo yeah its like sonnet right?
mini looks like a pareto frontier model though
For being huge and propped up on infra faster than even seemed possible it's not that surprising tho
its coming out?
No
He's saying he wishes it's 24k
24k karat gold
When it comes out
oh lol okay
It might’ve just been maverick with a modified system prompt it didnt seem that accurate
It is
Behemoth is prob anonymous-test
It's shet
Qwen 3 is coming in hours!
Good chance of topping the open source leaderboard
Tested Grok 3 Beta in OpenRouter for the 20 public SimpleBench questions, it got 6/20
How much did gemini pro get
9/20
does grok 3 use reasoning?
Got Maverick full release! 😌
Quite a surprise, but that's a good thing: I can't pinpoint the exact model because its "vibe" doesn't scream out loud.
Hoping for the actual result.
WHY ALL APIs have a context limit of only 1 million
I was just trying to set up a MCP server with Cline
it failed
it says it's out of context window when it's like 80% done
arghhhhhhhhhhh
Not yet
Sorry Bindu, this is not gonna happen that soon. We still need some more time.
I need real 10 million context window model
There's been 3 months since the publication of "Titans" architecture by Google
I hope Google can make a reasoning large multimodal reasoning model out of that architecture
Wait grok 3 reasoning
Since Grok 3 have no ethics regulations
hihi
is dragontail as good as nightwhisper?
seems like they werent satisfied with the performance
coding wise?
no
nightwhisper is finetuned intensively on coding
they probably made a good reward model for web dev
styling wise
Dragontail
could dragon tail be o4-mini or o3?
Its from google
hopefully pricing remains similar to 2.0 flash
its def not nightwhisper
i got much better results for a discord clone from nw
its on par with gemini 2.5 pro
idk if its better or not
sorry for the bad screenshot
screenshot on laptop
dragontail
2.5 pro
i think its gemini 2.5 pro, low thinking
the results are very similar to 2.5 and it thinks but for less time
o4 mini 👀 I have no hope for o3 being affordable though lol
Do you guys have access to Veo2 in AI Studio? It seems to be rolling out, some have it, some don't. I don't 😦
Yeh I've tried it out
it's openai's launch day in the week... we may actually be getting things 👀
hmm will o4 mini launch before 2.5 flash, seems like it lol
i wonder if o4 mini is based on an updated 4o mini base
or if its just more roids 🤣
seems like google sort of forced their hand
how could it be o4 mini
its literally not thinking
it streams immediately and there is no apparent thinking
yeah
maybe its o4 mini low
no
and i benchmarked it it must be an insane regression from o3 mini 🤣
there is 0 thinking
nada
zero
people are crazy 🤣
i'm more interested in this
they're adding it to openrouter "tomorrow morning" (EST)
imagine if its 2.5 flash 🤣
same naming scheme as the anon openai model so i don't think so lmao
could be named like that to trip people
i find that a little unlikely
Why does open router say o4-mini and have same stats basically?
openrouter doesn't say o4 mini
Lmaoo I need to stop trying to use brain 2 min after waking uo
im not hyped for any o-serie model
quaser ?
ye
yeah no thinking there.. but it can't be o4-mini can it?
tf is the point of their naming schema if that is the case ha
i measured gpqa diamond and math 500 it must be a severe regression if it was o4 mini (which it can't be because its not thinking)
yeah the non-thihking alone
but also performance-wise, solid as it is - it's not rubbing up against the frontier in any way
ah yup i recall
also: o3 mini gpqa diamond: 74.8%, math 500: 97.3%
@keen beacon you should test optimus alpha when it releases
i will if rate limits are like quasar
some scores from a quiz (~20 questionss in one prompt; same as shared above here somewhere)
just fwiw
wait.. that wasn't sorted right
which server's that?
openrouter
why is 3.7 sonnet in 23 place? gemini flash lite is in 16
use style control..
well i know but i want to know why tbh.
still doesn't make sense
that's just what something based on human preference will produce 🤷♂️
man the difference in capability between 2.5 pro in ai studio vs cursor is so huge. I really wish it wasnt so bad in cursor lol
Coyld you explain what does it mean "the open-router server"?
That's disappointing. I am hoping for better models than 2.5 Gemini
Just use roocode
It deleted my 800k context conversation which is frustrating though
I have been avoiding roocode because I hear it can be expensive af
I created new chatgpt alt account on new chrome profile and I can use gpt 4o forever it seems unlike my og chatgpt account. Can also generate way more images
Is that a feature for new accounts or something
Will prob return to normal after a while
wait who owns dragontail?
2.5 is too good
lol
TPU x @ssi news: "OpenAI co-founder and former chief scientist @ilyasut’s new AI startup, Safe Superintelligence, is using Google Cloud’s TPU chips to power its AI research," via @TechCrunch ↓ https://t.co/G0jnB5rEua
i wonder what progress he has made
Any reports on how Cognito performs?
i think most of it is that claude's terse response style doesn't work well when you're trying to do quick one-shot comparisons between models
which is what lmarena "power users" tend to do en masse, i think
What kind of persona is a lmarena power user?
it's nice when you're expecting it though
people that talk a lot on this discord, basically
and/or do more than ~5 chats on lmarena a day on average
(that number is totally arbitrary, i just made it up)
I would guess that if you have a problem and wanted to solve it with an LLM, throwing it at LMArena would lead to really mixed results
personally, i do it just to try and explore different models (plus, it's one of the lowest-barrier-of-entry ways to get alpha access to new LLMs as they're being developed)
but yeah, you're right
i don't use it as a general-purpose chatbot, it's more for my own curiosity
guess im a power user now
welcome to the club :3
what's optimus alpha
anonymous model releasing on arena today
all good
hyped
poll_question_text
Best Deep Research
victor_answer_votes
8
total_votes
15
victor_answer_id
1
victor_answer_text
Gemini
victor_answer_emoji_name
🫡
does openrouter have a discord?
i am hearing about so many random models and i've been in the server for 2 minutes
what exactly are you trying to do? most models tend to perform pretty poorly when operating with that much context, even if they do support it officially
especially with coding, in my experience
off the top of my head, i don't know of any models that support much more than 1M context, anyway
the new llama models are supposed to but they suck balls
and some of the geminis iirc go to 2M
i forget what that fiction context window benchmark is called
I find it odd this name matching, even the context matches? 🤯
https://x.com/SILXLAB/status/1909475116637208937?t=Jd0rllJnICEeVyKNPuYZ_g&s=19
17 yr old grifter or smthing. just a larper just ignore
lol
😭
probably. It's somewhat unusual for openai though as well. It's still not released and only on openrouter
0 weights
this guy is actually 17 yrs old
meta in general seems to have been sucking balls lately
/nonsexual
behemoth better be good
maybe joint 1st w/ stylectrl
It’ll matter a lot to how well reasoning scales to superhuman levels how o4-mini performs
It really depends whether they posttrained o3 on their new 4o stuff
imo
it also depends on which reasoning effort version they put on the arena
o3 low will be very different from o3 high
its conceivable the new anonymous chatbot could top the leaderboard
which one
there are many
i mean, i had a pretty good impression of it, but you may be right
anonymous chatbot/quasar by openai. since the leaderboard is mainly human preference
quasar isn't on the arena
and even if it was
it is lol
It’s ass compared to 2.5
i don't think it would take the top spot tbh
it's not as nice to talk to as chatgpt 4o latest
also seems less creative
well if it takes like several minutes to answer a question idk if it will get the top stop lol
yeah, as we've learned, this matters a lot more than people think it does
anonymous chatbot is the same though lol
in the original (gradio) lmarena, it pauses the output of both models until they both start sending tokens
making up 30 elo points is harder than it seems too
so, both models get paused until they both finish thinking, basically
there's anonymous-test