#general | Arena | Page 24

keen beacon Apr 16, 2025, 9:58 PM

#

what would you name this new retrained o3

#

o4 then benchmark it with the juice u used for o3... i guess?

ocean vortex Apr 16, 2025, 9:58 PM

#

be explicit that it's not just "o3". I think that's fairly obvious, no?

zinc ore Apr 16, 2025, 9:58 PM

#

They can't do that again because once they actually release people will test it and get vastly different results. People already picked up on what they did, so they risk harming their brand continuing to do so

#

So caught up on trying to generate hype compared to their competition that they just make highly misleading test results

#

But anyone perceptive can pick up on

ocean vortex Apr 16, 2025, 9:59 PM

#

zinc ore They can't do that again because once they actually release people will test it ...

sure but the point is they knew to begin with that they aren't testing just the model o3. But rather their version of a "pro" setup which happens to be based on the new model

#

that was never gonna see the day of light as just "o3" catgrin

keen beacon Apr 16, 2025, 10:00 PM

#

ocean vortex be explicit that it's not just "o3". I think that's fairly obvious, no?

how would you do that though, its already a mess

#

the o3 in december is much less stronger than the retrained o3 with the new 4.1 base we have now, at least served with the same compute. so it does mean something

ocean vortex Apr 16, 2025, 10:01 PM

#

keen beacon how would you do that though, its already a mess

o3-pro-preview, or whatever else...

keen beacon Apr 16, 2025, 10:03 PM

#

still misleading since o3 pro is not gonna be served with that level of compute

#

it depends on the benchmarks though, if the o3 pro product they're gonna release performs impressively/or close compared to the initial o3 results, they could name it like that

ocean vortex Apr 16, 2025, 10:03 PM

#

keen beacon the o3 in december is much less stronger than the retrained o3 with the new 4.1 ...

which would all line up if that version from december wasn't scoring higher, but it is... And that's the problem. It's worse but it was scoring higher, because it wasn't just o3... Like I really think this is obvious they didn't just accidentally do it LOL

keen beacon Apr 16, 2025, 10:06 PM

#

ocean vortex which would all line up if that version from december wasn't scoring higher, but...

at the time do u think they already had plans to retrain o3/etc

#

its misleading but i dont think its intentionally malicious

keen beacon Apr 16, 2025, 10:08 PM

#

ocean vortex which would all line up if that version from december wasn't scoring higher, but...

they could make it score higher if they used as much compute as they did before, but like the other guy said its a bad idea and this time it would be intentionally misleading imo

ocean vortex Apr 16, 2025, 10:09 PM

#

then

#

now

keen beacon Apr 16, 2025, 10:10 PM

#

ocean vortex then

arent they using cons@64 or smthing like that, the grey area which i assume the score is for?

keen beacon Apr 16, 2025, 10:10 PM

#

ocean vortex now

this is pass@1 i think

ocean vortex Apr 16, 2025, 10:10 PM

#

iirc it was pass@1 for o3 supposedly

keen beacon Apr 16, 2025, 10:11 PM

#

whats the grey area then

ocean vortex Apr 16, 2025, 10:11 PM

#

and cons@64 only for o1

ocean vortex Apr 16, 2025, 10:16 PM

#

keen beacon whats the grey area then

I'm not sure, maybe it was 🤔
this still doesn't add up though assuming preview wasn't using the new base model

keen beacon Apr 16, 2025, 10:18 PM

#

even without the juicing it couldve just been doing even more massive and unreasonable chain of thought by default compared to the optimized new o3, to reach those scores

fringe carbon Apr 16, 2025, 10:18 PM

#

keen beacon whats the grey area then

i believe light grey is if it's reprompted

tall summit Apr 16, 2025, 10:19 PM

#

What is cons@64, you might ask? Well, it’s short for “consensus@64,” and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers.

ocean vortex Apr 16, 2025, 10:19 PM

#

keen beacon they could make it score higher if they used as much compute as they did before,...

it can not be misleading if they are explicit that this isn't just o3. Like it's really not f'ing hard to do. You can literally name it "o3 + extended cot synthesis"

#

just like they are listing tools now, this isn't all that different

keen beacon Apr 16, 2025, 10:25 PM

#

ocean vortex it can not be misleading if they are explicit that this isn't just o3. Like it's...

you want them to go back and edit all the videos/material after they changed plans? its just not realistic. they weren't in development far enough to know how they'd serve it so how would you make that distinction without knowing the future

#

they changed it now to o3-preview at least

ocean vortex Apr 16, 2025, 10:27 PM

#

keen beacon you want them to go back and edit all the videos/material after they changed pla...

No I think my point is that now it's obvious they knew what it was and what it's going to be released as to start with. This wasn't "a mistake" lol

keen beacon Apr 16, 2025, 10:27 PM

#

they knew they were gonna retrain o3 at the time lol?

ocean vortex Apr 16, 2025, 10:29 PM

#

keen beacon they knew they were gonna retrain o3 at the time lol?

?? no they knew they were not gonna release o3-pro as "o3", and that the model to be actually released gonna be much closer to o1

keen beacon Apr 16, 2025, 10:29 PM

#

but its not even o3 pro

ocean vortex Apr 16, 2025, 10:29 PM

#

keen beacon but its not even o3 pro

whatever, that is clearly not the point here lmao

#

I'm just calling it as such as it's the closest match to what it was

keen beacon Apr 16, 2025, 10:29 PM

#

if they released that version of o3 without the juice and changed the reasoning effort levels would you be ok with it?

ocean vortex Apr 16, 2025, 10:31 PM

#

keen beacon if they released that version of o3 without the juice and changed the reasoning ...

I would be ok if it wasn't named "o3" in the graphs when they knew that model is nothing like "o3" to be released. And once again, you can't be serious thinking this was a mistake. catgrin

keen beacon Apr 16, 2025, 10:32 PM

#

it isnt a mistake

#

bro?

ocean vortex Apr 16, 2025, 10:32 PM

#

well then why are you asking if they should go back and edit it lol

keen beacon Apr 16, 2025, 10:32 PM

#

it was intentional at the time but they didnt know the future and committed to it. it is now misleading in the present

ocean vortex Apr 16, 2025, 10:32 PM

#

they did it deliberately, what's there to edit? 👀

thorny drum Apr 16, 2025, 10:32 PM

#

its pretty lame but on the tier list of gaming benchmarks its only in the middle

#

i think anyone who cares a ton about benchmarks probably watched the stream and knew they used ungodly compute on them

keen beacon Apr 16, 2025, 10:33 PM

#

ocean vortex they did it deliberately, what's there to edit? 👀

its deliberate but they have no idea about how misleading it would be in the future lol

thorny drum Apr 16, 2025, 10:33 PM

#

i personally think its cool to hear about these companies super powerful internal models

#

and i wasnt expecting an o3 that was $1/token

ocean vortex Apr 16, 2025, 10:35 PM

#

keen beacon its deliberate but they have no idea about how misleading it would be in the fut...

Ok imagine yourself in their shoes. You have a model x, but instead of running it normally, you use pro-like system to boost the scores way up. But then still leave the name exactly as it was, model x, with no additions. Now in what universe would you think this can be not misleading? LOL

keen beacon Apr 16, 2025, 10:36 PM

#

ocean vortex Ok imagine yourself in their shoes. You have a model x, but instead of running i...

they were early in development they might have some idea but it could change drastically

tall summit Apr 16, 2025, 10:36 PM

#

guys

#

it doesnt matter

#

just wanted to tell you

raven void Apr 16, 2025, 10:37 PM

#

new o3 is pretty good, they've reduced the size while keeping similar performance to original o3

keen beacon Apr 16, 2025, 10:37 PM

#

tall summit it doesnt matter

we like being stun locked 🤣

tall summit Apr 16, 2025, 10:37 PM

#

pretty sure the consensus is o4 mini is better not "often" or even "sometimes" but more than just occasionally

tall summit Apr 16, 2025, 10:38 PM

#

keen beacon we like being stun locked 🤣

i love pointless arguments but when youre on the outside its easy to see how annoying they are

ocean vortex Apr 16, 2025, 10:38 PM

#

keen beacon they were early in development they might have some idea but it could change dra...

nah I think it was obvious since the beginning this isn't competing directly with normal o1. Well for them at least, for us... even I didn't know they used something similar to pro for those scores. I was impressed back then but now I kinda feel scammed. Which is the whole reason of why I even wrote it. catgrin

keen beacon Apr 16, 2025, 10:38 PM

#

theres nothing wrong with it lol

tall summit Apr 16, 2025, 10:38 PM

#

keen beacon theres nothing wrong with it lol

^

zinc ore Apr 16, 2025, 10:40 PM

#

Pro is just longer think for base o3 (from tech crunch)

keen beacon Apr 16, 2025, 10:40 PM

#

ocean vortex nah I think it was obvious since the beginning this isn't competing directly wit...

its still impressive though for any system to reach that level even if it was juiced

ocean vortex Apr 16, 2025, 10:42 PM

#

keen beacon its still impressive though for any system to reach that level even if it was ju...

it's mostly hype building marketing. If you are gonna do something like that well then do what google did

#

AlphaGeometry

#

and the fact that they didn't even have new base (4.1) back then makes it worse, not better tbh
At least now with an improved model it is not 1 million miles away...

#

it's still an interesting good model, but yeah... the direction openai is heading and what have they became lately comparing to like 1.5 years ago is something I really do not like

#

don't get me started on the pricing too... they finally reduced the price but only for the new model, o1 is amazingly now more expensive than o3 catgrin

ember rapids Apr 16, 2025, 10:51 PM

#

We need Claude 4

#

That’s what I’m most hyped for

ocean vortex Apr 16, 2025, 10:51 PM

#

with the releases of both 4-turbo and gpt4o, their pricing was essentially unmatched (good)

#

then o series started and all hell broke loose with them going to town on pricing lol

thorny drum Apr 16, 2025, 10:54 PM

#

i feel like these new models r pretty good

#

whats wrong with them

ocean vortex Apr 16, 2025, 10:54 PM

#

yeah it was a very novel idea. Preview had some bugs and was rough with later o1 being much better imo but this idea enabled good progress for sure

ember rapids Apr 16, 2025, 10:55 PM

#

The most impressive aspect of o3 for me so far is its ability to generate good ideas

hardy pecan Apr 16, 2025, 11:03 PM

#

Is o3, medium or high in the lmarena?

glass arch Apr 16, 2025, 11:04 PM

#

hardy pecan Is o3, medium or high in the lmarena?

base o3 is here

dapper storm Apr 16, 2025, 11:04 PM

#

Yea but what level of thinking is that

glass arch Apr 16, 2025, 11:05 PM

#

we have no freaking clue I guess

keen beacon Apr 16, 2025, 11:05 PM

#

i like how o3 talks

#

it's got more of a personality than o1

ocean vortex Apr 16, 2025, 11:05 PM

#

dapper storm Yea but what level of thinking is that

if it's not specified that's medium

glass arch Apr 16, 2025, 11:05 PM

#

on chatgpt.com, it actually thinks with your name for some reason

ocean vortex Apr 16, 2025, 11:06 PM

#

glass arch on chatgpt.com, it actually thinks with your name for some reason

yeah I noticed that too. I found that odd since I have memory and CI disabled and it somehow still knows lol

glass arch Apr 16, 2025, 11:06 PM

#

here is what it does for me

elder rapids Apr 16, 2025, 11:07 PM

#

keen beacon it's got more of a personality than o1

ye good vibes

glass arch Apr 16, 2025, 11:07 PM

#

it also writes with a lot more emojis for some reason?

elder rapids Apr 16, 2025, 11:07 PM

#

it still doesn't shine like 2.5 pro though tbh

glass arch Apr 16, 2025, 11:07 PM

#

I think openai's models are the friendliest

#

gemini is quite annoying to talk to

ocean vortex Apr 16, 2025, 11:08 PM

#

glass arch here is what it does for me

🧐

elder rapids Apr 16, 2025, 11:08 PM

#

glass arch gemini is quite annoying to talk to

might be just getting used to

#

in my experience, the Gemini family is more of a customizable, but baseline neutral family

ocean vortex Apr 16, 2025, 11:08 PM

#

elder rapids might be just getting used to

it's their fine-tuning. They nearly fixed it but not entirely

elder rapids Apr 16, 2025, 11:09 PM

#

I don't think it's a flaw lol

#

I think that's the intention

glass arch Apr 16, 2025, 11:09 PM

#

elder rapids might be just getting used to

it seems bad at understanding social norms
I hooked it up in such a way that me and some friends could speak to it in a discord call, and it is so hard to talk to

elder rapids Apr 16, 2025, 11:09 PM

#

glass arch it seems bad at understanding social norms I hooked it up in such a way that me ...

oh ikwym

glass arch Apr 16, 2025, 11:09 PM

#

the openai models are better at social norms though

ocean vortex Apr 16, 2025, 11:09 PM

#

"you are right to point that out!" - I hate these from gemini lmao

elder rapids Apr 16, 2025, 11:09 PM

#

when I give it an example

#

and the example is using "you"

#

it thinks it needs to address it as the AI itself

#

rather than assuming it's a general example

keen beacon Apr 16, 2025, 11:10 PM

#

ocean vortex "you are right to point that out!" - I hate these from gemini lmao

i think 2.5 pro is less sycophantic

ocean vortex Apr 16, 2025, 11:11 PM

#

keen beacon i think 2.5 pro is less sycophantic

yeah it's acceptable levels now. But sometimes still annoying

glass arch Apr 16, 2025, 11:11 PM

#

if there was an AI model that I would trust to not destroy us if it got AGI, it would be chatgpt (any one of them tbh)

elder rapids Apr 16, 2025, 11:12 PM

#

i wouldn't say Claude

#

it has a very good feel, but it's almost superficial

#

so yeah chatgpt prob

ocean vortex Apr 16, 2025, 11:12 PM

#

glass arch if there was an AI model that I would trust to not destroy us if it got AGI, it ...

it's the most stable behaving one for sure

elder rapids Apr 16, 2025, 11:13 PM

#

to me Gemini has more of that meta intelligence, and that's what's deciding it's morality

#

rather than having instructed morals

glass arch Apr 16, 2025, 11:13 PM

#

claude is actually very good at language (by this, I mean it can speak toki pona the best)

elder rapids Apr 16, 2025, 11:13 PM

#

ocean vortex it's the most stable behaving one for sure

yep

glass arch Apr 16, 2025, 11:13 PM

#

I did a few tests with the arena, and claude comes out on top for all toki pona conversations

elder rapids Apr 16, 2025, 11:13 PM

#

very contained

elder rapids Apr 16, 2025, 11:14 PM

#

glass arch I did a few tests with the arena, and claude comes out on top for all toki pona ...

ye that's what I look for in models

#

3.5 sonnet has always been clearly the best

#

but that's why I like 2.5 pro so much, because it does the same thing when you really want it to

glass arch Apr 16, 2025, 11:16 PM

#

I want to run a test playing a game called "keep talking and nobody explodes"

#

I did it a few days ago with o3-mini-high

#

which of the new models should I pick?

elder rapids Apr 16, 2025, 11:16 PM

#

keep talking and nobody explodes?

glass arch Apr 16, 2025, 11:17 PM

#

yes

#

the one with the bomb

barren prairie Apr 16, 2025, 11:18 PM

#

glass arch gemini is quite annoying to talk to

I can agree +++
But it improved a lot (I talk to Gemini flash thinking and it seems cute 😆😆) but chatgpt is more natural

ocean vortex Apr 16, 2025, 11:18 PM

#

glass arch yes

sounds like some nsfw rp lmao

elder rapids Apr 16, 2025, 11:18 PM

#

I'm gonna explode

ocean vortex Apr 16, 2025, 11:19 PM

#

💀

glass arch Apr 16, 2025, 11:19 PM

#

ocean vortex sounds like some nsfw rp lmao

nah, more like yelling at the AI for not understanding the maze puzzle

elder rapids Apr 16, 2025, 11:19 PM

#

barren prairie I can agree +++ But it improved a lot (I talk to Gemini flash thinking and it s...

flash thinking was hella neurotic, but it was super adjustable

#

damn if y'all feel this way about Gemini

#

you guys gotta know the strats

#

theres some insane keywords that these models react to

glass arch Apr 16, 2025, 11:20 PM

#

which

#

I gotta try now

elder rapids Apr 16, 2025, 11:20 PM

#

Gemini and Claude

#

with "cut and cleanly"

#

or 4o, with "pragmatic"

glass arch Apr 16, 2025, 11:21 PM

#

openai is reaally good at naming huh

#

4o

#

o4

elder rapids Apr 16, 2025, 11:21 PM

#

feels like they're doing this on purpose

glass arch Apr 16, 2025, 11:21 PM

#

what does the "o" in 4o mean again?

elder rapids Apr 16, 2025, 11:22 PM

#

if they release o5 after releasing 5o

elder rapids Apr 16, 2025, 11:22 PM

#

glass arch what does the "o" in 4o mean again?

Omni

#

ngl when the first 4o was "anonymous-chatbot" it felt so fresh

#

like a clear step above the other models

glass arch Apr 16, 2025, 11:22 PM

#

it was truly the best

keen beacon Apr 16, 2025, 11:22 PM

#

anonymous chatbot is chatgpt 4o latest. are you talking about the i am good gpt2 thing or the first chatgpt 4o latest variant

glass arch Apr 16, 2025, 11:23 PM

#

now I have to switch off of 4o because it's not even worth giving orders to

elder rapids Apr 16, 2025, 11:23 PM

#

keen beacon anonymous chatbot is chatgpt 4o latest. are you talking about the i am good gpt2...

the first chatgpt 4o variant pre release

keen beacon Apr 16, 2025, 11:23 PM

#

oh

elder rapids Apr 16, 2025, 11:23 PM

#

I am good gpt2 was another pre release variant

#

of 4o

keen beacon Apr 16, 2025, 11:23 PM

#

yea ik

elder rapids Apr 16, 2025, 11:24 PM

#

and then "birdie" was athene 70b

#

which was lowk a secret

#

y'all didn't know about that one

#

the best model at that time behind 4o

glass arch Apr 16, 2025, 11:24 PM

#

what is your guys' go-to prompt for these new models?

#

I ask for pacman in pygame

#

and o4 did really good at this

elder rapids Apr 16, 2025, 11:25 PM

#

I usually give random puzzles

#

and then ask older models to solve the same ones

#

o4 mini is good at puzzles

#

o3 is horrible

#

o3 is super smart outside of these rigor subjects tho

glass arch Apr 16, 2025, 11:26 PM

#

yeah for my use cases, these models are very optimized

ornate stump Apr 16, 2025, 11:27 PM

#

Does anyone know if they’re planning to improve the voice mode? It’s still the least utilized and developed feature considering its potential.

glass arch Apr 16, 2025, 11:27 PM

#

also, I don't think an AI is going to be fooling a human any time soon

elder rapids Apr 16, 2025, 11:27 PM

#

ornate stump Does anyone know if they’re planning to improve the voice mode? It’s still the l...

I think until the competition gets stronger in voice

#

they're not gonna do that much

#

sesame is cool and all

glass arch Apr 16, 2025, 11:27 PM

#

gemini is CRAZY at video

elder rapids Apr 16, 2025, 11:27 PM

#

but the distribution is bad

#

since it's not a major distributor like Google

elder rapids Apr 16, 2025, 11:28 PM

#

glass arch gemini is CRAZY at video

yep

glass arch Apr 16, 2025, 11:28 PM

#

I streamed me and the boys playing the binding of isaac, and it was able to identify stuff

elder rapids Apr 16, 2025, 11:28 PM

#

it's in its own league as far as understanding videos

glass arch Apr 16, 2025, 11:29 PM

#

elder rapids it's in its own league as far as understanding videos

from what I see on x, the everything app, grok can understand videos pretty well too

ornate stump Apr 16, 2025, 11:29 PM

#

elder rapids sesame is cool and all

Yeah, but I mean, I don't need the AI to flirt with me or be that assertive. But the voice mode in Gemini or OpenAI is stupid you not only can't have a conversation with it, but you also can't use it under any circumstances.

glass arch Apr 16, 2025, 11:29 PM

#

ornate stump Yeah, but I mean, I don't need the AI to flirt with me or be that assertive. But...

it works for me mostly

#

it likes to end its turn too quickly though

#

well, not gemini

elder rapids Apr 16, 2025, 11:30 PM

#

glass arch from what I see on *x, the everything app*, grok can understand videos pretty we...

wym?

glass arch Apr 16, 2025, 11:30 PM

#

gemini talks way too freaking much

elder rapids Apr 16, 2025, 11:30 PM

#

since when can grok understand videos

glass arch Apr 16, 2025, 11:31 PM

#

elder rapids wym?

it seems like people can mention grok under a video and it will respond

elder rapids Apr 16, 2025, 11:31 PM

#

glass arch it seems like people can mention grok under a video and it will respond

I thought that was finding references

#

and transcriptions

#

rather than watching the video

glass arch Apr 16, 2025, 11:31 PM

#

oh yeah probably

elder rapids Apr 16, 2025, 11:31 PM

#

since in that case, it doesn't do very well

glass arch Apr 16, 2025, 11:31 PM

#

I don't know what grok does

elder rapids Apr 16, 2025, 11:31 PM

#

I've seen it used on memes and stuff

#

video memes

ornate stump Apr 16, 2025, 11:31 PM

#

glass arch gemini talks way too freaking much

Gemini Live can’t search the web, though, right?

elder rapids Apr 16, 2025, 11:31 PM

#

and it doesn't do very well understand what it even is

glass arch Apr 16, 2025, 11:31 PM

#

ornate stump Gemini Live can’t search the web, though, right?

it can't(?)

elder rapids Apr 16, 2025, 11:32 PM

#

ornate stump Gemini Live can’t search the web, though, right?

I think most recently

#

it may be able to

#

before no, now I think so

keen beacon Apr 16, 2025, 11:32 PM

#

hmm sometimes 2.5 pro stops thinking on aistudio, especially with extremely long chats, not sure if its just a visual thing (it starts streaming with the same delay as if it were to start streaming a thought process), but if i prompt it to keep thinking itll usually pop it back up 🤣

elder rapids Apr 16, 2025, 11:32 PM

#

ornate stump Yeah, but I mean, I don't need the AI to flirt with me or be that assertive. But...

Google might solve that soon

elder rapids Apr 16, 2025, 11:33 PM

#

keen beacon hmm sometimes 2.5 pro stops thinking on aistudio, especially with extremely long...

wym? like, the text just suddenly pops back in the message it just finished after you send something

ornate stump Apr 16, 2025, 11:34 PM

#

elder rapids I think most recently

I've just tried it and it can search the web now

elder rapids Apr 16, 2025, 11:34 PM

#

oh ye this happens to me

#

same with it going outside of its thinking box

#

ye, I think it's a visual bug

ornate stump Apr 16, 2025, 11:35 PM

#

elder rapids Google might solve that soon

Do you have any news/articles on this, or are you just speculating based on what they’re doing right now?

elder rapids Apr 16, 2025, 11:35 PM

#

based on what they're doing right now

#

since they're focusing on multimodality

real totem Apr 16, 2025, 11:36 PM

#

Bro

#

I just came back

#

And I see this

#

o4 mini and o3 and 4.1

elder rapids Apr 16, 2025, 11:36 PM

#

the streaming delay?

real totem Apr 16, 2025, 11:36 PM

#

Ain’t no way all of these released so fast

elder rapids Apr 16, 2025, 11:36 PM

#

I'm confused

real totem Apr 16, 2025, 11:36 PM

#

Which is the best

#

One

elder rapids Apr 16, 2025, 11:36 PM

#

o3 for general tasks, o4 mini for coding tasks/puzzles, 4.1 is API only

real totem Apr 16, 2025, 11:37 PM

#

Are they better

#

Than gemini 2.5 ro

#

Pro

elder rapids Apr 16, 2025, 11:37 PM

#

and its good at small coding tasks

ornate stump Apr 16, 2025, 11:37 PM

#

real totem Than gemini 2.5 ro

meh

real totem Apr 16, 2025, 11:37 PM

#

Google’s next release

#

Is finna be crazy

real totem Apr 16, 2025, 11:37 PM

#

ornate stump meh

It’s better

#

For me

#

Not by a lot tho so ye

elder rapids Apr 16, 2025, 11:38 PM

#

real totem Than gemini 2.5 ro

o4 mini is a little better than 2.5 pro at pure coding tasks, from the benchmarks, o3 isn't really that much better, or even at all. It seems to fail things 2.5 is getting really easily, and 4.1 isn't a reasoning competitor

real totem Apr 16, 2025, 11:38 PM

#

elder rapids o4 mini is a little better than 2.5 pro at pure coding tasks, from the benchmark...

Oh

#

Makes sense

keen fulcrum Apr 16, 2025, 11:38 PM

#

What is the difference between o4 mini and o4 mini high?

elder rapids Apr 16, 2025, 11:39 PM

#

high/pro means it thinks longer

keen fulcrum Apr 16, 2025, 11:39 PM

#

Simple to understand thanks

ornate stump Apr 16, 2025, 11:40 PM

#

real totem Not by a lot tho so ye

Gemini 2.5 got a lot of people to switch when they released it, not sure if it's the same with o3

barren prairie Apr 16, 2025, 11:40 PM

#

Sometimes I think that Gemini 2.5 is overthinking ...sometimes , the Flash thinking make good answers while the pro sucks 🙂 I still donno why

balmy mist Apr 16, 2025, 11:41 PM

#

https://x.com/koltregaskes/status/1912645196866793854

Kol Tregaskes (@koltregaskes) on X

ChatGPT Plus, Team & Enterprise subscribers get:
- 50 messages a week with o3
- 150 messages a day with o4-mini
- 50 messages a day with o3-mini-high

ChatGPT Pro subscribers basically her unlimited.

This is great but I'm questioning the value of Pro with such high Plus limits.

elder rapids Apr 16, 2025, 11:41 PM

#

barren prairie Sometimes I think that Gemini 2.5 is overthinking ...sometimes , the Flash think...

I think this is inherent to the prompt itself tbh

real totem Apr 16, 2025, 11:41 PM

#

ornate stump Gemini 2.5 got a lot of people to switch when they released it, not sure if it...

Yeah

#

They wont switch from gemini

#

Cuz its free

elder rapids Apr 16, 2025, 11:41 PM

#

2.0 flash thinking is a true overthinker

real totem Apr 16, 2025, 11:41 PM

#

And o3 is paid

elder rapids Apr 16, 2025, 11:41 PM

#

at least from the months I used it

ornate stump Apr 16, 2025, 11:41 PM

#

barren prairie Sometimes I think that Gemini 2.5 is overthinking ...sometimes , the Flash think...

This is largely based on what you're doing, but if overthinking is an issue for you, maybe the O4 Mini will suit you better.

keen fulcrum Apr 16, 2025, 11:41 PM

#

Gemini 3 will be groundbreaking yet again

barren prairie Apr 16, 2025, 11:42 PM

#

elder rapids I think this is inherent to the prompt itself tbh

It is just some QMC and questions

elder rapids Apr 16, 2025, 11:42 PM

#

real totem And o3 is paid

depends, the limits can be really restrictive, and o4 mini is cheaper than 2.5 pro

keen beacon Apr 16, 2025, 11:42 PM

#

omg

#

o4 mini limits are pretty decent

elder rapids Apr 16, 2025, 11:42 PM

#

although that's beneficial ONLY for coding

barren prairie Apr 16, 2025, 11:42 PM

#

ornate stump This is largely based on what you're doing, but if overthinking is an issue for ...

Nahhh at all

elder rapids Apr 16, 2025, 11:42 PM

#

oh wait

real totem Apr 16, 2025, 11:42 PM

#

elder rapids depends, the limits can be really restrictive, and o4 mini is cheaper than 2.5 p...

Yeah

elder rapids Apr 16, 2025, 11:42 PM

#

you're talking about on the apps

real totem Apr 16, 2025, 11:42 PM

#

I just switch accounts

#

When I get restricted

real totem Apr 16, 2025, 11:43 PM

#

elder rapids you're talking about on the apps

The ai studio

elder rapids Apr 16, 2025, 11:43 PM

#

ye

real totem Apr 16, 2025, 11:43 PM

#

It’s 50 messages I think

#

It’s pretty good

elder rapids Apr 16, 2025, 11:43 PM

#

if you're talking about AI studio vs chatgpt plus

real totem Apr 16, 2025, 11:43 PM

#

Yeah

elder rapids Apr 16, 2025, 11:43 PM

#

real totem It’s pretty good

I don't think that's true tho

#

that's just for API I think

#

I've sent more than 50 per chat in an hour

#

ye the difference is big, thinks a lot

#

wonder how good 2.5 pro would be with that crazy length

keen beacon Apr 16, 2025, 11:44 PM

#

hmm i was checking to see if the 2.5 pro thinking bug was visual: 5.5s to the first token (first token in thoughts) and 6.1s (2.5 pro immediately going into a response), it doesn't seem to be visual

elder rapids Apr 16, 2025, 11:45 PM

#

keen beacon hmm i was checking to see if the 2.5 pro thinking bug was visual: 5.5s to the f...

would this not be directly visual

#

I'm confused on wym

keen beacon Apr 16, 2025, 11:45 PM

#

if it were doing the thought process and skipping the thinking block

#

u would expect the first token to be delayed on the response

#

but its the same

#

delay with or without thoughts

#

so it doesnt seem to be visual

elder rapids Apr 16, 2025, 11:46 PM

#

alr I'm lost but I hope you figure it out

#

sorry bro 🙏

keen beacon Apr 16, 2025, 11:46 PM

#

its fine im explaining it confusingly

leaden palm Apr 16, 2025, 11:46 PM

#

man how late is later...

keen beacon Apr 16, 2025, 11:47 PM

#

leaden palm man how late is later...

u can select the models in direct chat if u want btw

#

both o3 andn o4 mini

leaden palm Apr 16, 2025, 11:47 PM

#

o cool

#

time to give it my (only?) eval question

keen beacon Apr 16, 2025, 11:48 PM

#

keen beacon its fine im explaining it confusingly

the mechanism if it's not visual i guess is because they exclude thinking blocks, i guess when u reach a certainn amount of turns the model's tendency to start with a thinking block isnt there since all the past and numerous amount of turns had no thinking blocks

leaden palm Apr 16, 2025, 11:54 PM

#

leaden palm time to give it my (only?) eval question

o3 was... alright? the style was bad:

manually wrapped its text (presumably trained on too much TeX)
disregarded "mtok", took it to mean ktok, messing up the calculations
called phi 4 multimodal "φ‑4‑mm"
but once i got past that, i did enjoy its discussion of the math and attention and economics

#

meanwhile o4 mini just spreading misinformation

keen beacon Apr 16, 2025, 11:55 PM

#

did u set it to 0 temp btw

leaden palm Apr 16, 2025, 11:55 PM

#

i left it at default 0.7

keen beacon Apr 16, 2025, 11:55 PM

#

yea try using 0 temp

#

it seems its very sensitive

leaden palm Apr 16, 2025, 11:56 PM

#

i'll try again but i'd be surprised if that improved it more than regenerating would

#

what the hell is this 😭

#

okay it was definitely better

#

i still like the style of the o series

elder rapids Apr 16, 2025, 11:59 PM

#

they're all o series

#

😭 🙏

leaden palm Apr 16, 2025, 11:59 PM

#

4.1:

leaden palm Apr 17, 2025, 12:00 AM

#

leaden palm okay it was definitely better

ehh actually it lost some of the math, so it might just be regeneration quality

keen beacon Apr 17, 2025, 12:05 AM

#

it depends on how its ingested

#

but 2.5 pro i think is probably better

elder rapids Apr 17, 2025, 12:06 AM

#

2.5 is absolutely better

#

it's not close

#

the Gemini models are basically made for that

#

tbh

#

as far as my testing goes

#

uploading anything to Gemini is far superior, it's audio understanding, it's video understanding

#

nah it goes for all the models

#

just try 2.5 pro with this stuff

#

it's seriously crazy

#

it's basically perfect

#

ye, but tbh its just different approaches

#

I like reading other models thinking

#

ye

#

can't say that's much of a problem tho cuz that's what benefits Gemini more in general

glass arch Apr 17, 2025, 12:10 AM

#

elder rapids keep talking and nobody explodes?

I just played ktane with it

plain zinc Apr 17, 2025, 12:11 AM

#

How do you like o3? o4-mini high?

elder rapids Apr 17, 2025, 12:11 AM

#

o1 pro is more of a academic

plain zinc Apr 17, 2025, 12:11 AM

#

Tell me, please.

glass arch Apr 17, 2025, 12:11 AM

#

I am gonna upload my video of this session to #ai-creations when youtube finishes processing it

elder rapids Apr 17, 2025, 12:11 AM

#

yep, I still like 2.5 pro the most tho

#

it's just the way I prompt it

#

and how it allows itself to be prompted

#

that's just super unique to me tbh

#

so the feel that I like in other models, I can definitely replicate in 2.5 pro

#

and then boom

#

no need for them

drifting thorn Apr 17, 2025, 12:33 AM

#

Btw is the o3 using tools in the arena?

glass arch Apr 17, 2025, 12:45 AM

#

ok I shared my video

balmy mist Apr 17, 2025, 1:19 AM

#

drifting thorn Btw is the o3 using tools in the arena?

nahh

drifting thorn Apr 17, 2025, 1:20 AM

#

Sad

#

Waiting for Deepseek R2 that integrates MCP tools in chain of thought

plain zinc Apr 17, 2025, 1:26 AM

#

Dragontail, riverhollow, and shadebook are gone.

#

Shadebrook

balmy mist Apr 17, 2025, 1:38 AM

#

lol

glass arch Apr 17, 2025, 1:39 AM

#

I like how chatgpt just takes the verbal abuse I give it

balmy mist Apr 17, 2025, 1:40 AM

#

you better becareful abusing chatgpt

#

it got memory now

glass arch Apr 17, 2025, 1:42 AM

#

balmy mist it got memory now

I lobotomized that part

late path Apr 17, 2025, 1:43 AM

#

does anyone else feel like o3's chatting style is strangely similar to r1...?

elder rapids Apr 17, 2025, 1:47 AM

#

plain zinc Dragontail, riverhollow, and shadebook are gone.

u sure?

#

that didn't seem to be the case

#

so it must've happened within the last hours

#

by the way, I'm not sure about o4 mini being cheaper in practice

#

seems to be p4p not on 2.5 pros level

#

same goes for o3

glass arch Apr 17, 2025, 1:50 AM

#

when are we gonna be allowed to use 4.1 in the app

elder rapids Apr 17, 2025, 1:50 AM

#

you're not I think

glass arch Apr 17, 2025, 1:50 AM

#

!!!!

#

bruh

keen beacon Apr 17, 2025, 1:51 AM

#

glass arch when are we gonna be allowed to use 4.1 in the app

Chatgpt?

elder rapids Apr 17, 2025, 1:51 AM

#

4o seems to be the absolute replacement

glass arch Apr 17, 2025, 1:51 AM

#

keen beacon Chatgpt?

yes

keen beacon Apr 17, 2025, 1:51 AM

#

It's already live under 4o lol

elder rapids Apr 17, 2025, 1:51 AM

#

keen beacon It's already live under 4o lol

wym?

#

in the chatgpt app?

keen beacon Apr 17, 2025, 1:52 AM

#

They're continuing chatgpt 4o latest which is the chatgpt 4o model on the app, which is on the same base model as 4.1

thorny drum Apr 17, 2025, 1:52 AM

#

4o != 4.1

keen beacon Apr 17, 2025, 1:52 AM

#

It's mega confusing but yeah

#

Chatgpt 4o latest uses the 4.1 base model/has had it for a while before 4.1 released

#

The new base model where it was continued pretrained and has a newer cut off

glass arch Apr 17, 2025, 1:54 AM

#

that is so stupid!????

keen beacon Apr 17, 2025, 1:54 AM

#

glass arch that is so stupid!????

Openai tries to make naming as confusing as possible

#

For some reason lmao

elder rapids Apr 17, 2025, 1:55 AM

#

glass arch that is so stupid!????

I mean not rly, as long as you consider -o as a variation

#

which in this case, it would still be 4.1 → 4 → 4 omni

keen beacon Apr 17, 2025, 1:55 AM

#

There are small differences though the chatgpt 4o tune is slightly different though even if it is on the 4.1 base model, it is more human preference aligned. But model performance is largely the same

elder rapids Apr 17, 2025, 1:55 AM

#

since 4o previously probably still used gpt 4, as a distill

#

so it would be 4.0 Omni

#

now it could be 4.1 Omni and therefore 4o

glass arch Apr 17, 2025, 1:56 AM

#

can't wait for 4.5o

elder rapids Apr 17, 2025, 1:56 AM

#

could probably still be called 4o

glass arch Apr 17, 2025, 1:56 AM

#

will it be better at language?

plain zinc Apr 17, 2025, 2:00 AM

#

elder rapids u sure?

Yes

#

I don't see them anymore.

#

I only come across 2.5 pro, 2.0 flash thinking

elder rapids Apr 17, 2025, 2:01 AM

#

could be just unluckiness

plain zinc Apr 17, 2025, 2:02 AM

#

elder rapids could be just unluckiness

Take a look for yourself too

#

You won't get them either.

#

It's strange that Legit didn't reveal anything about their loss.

keen beacon Apr 17, 2025, 2:04 AM

#

plain zinc It's strange that Legit didn't reveal anything about their loss.

dont really know who legit is but i guess he scrapes web dev arena

#

in the metadata there its still live

#

dragontail/etc

plain zinc Apr 17, 2025, 2:04 AM

#

keen beacon dont really know who legit is but i guess he scrapes web dev arena

He's one of the main leaks of some cool stuff.

plain zinc Apr 17, 2025, 2:05 AM

#

keen beacon dragontail/etc

Did they turn it off for an update?

#

Did you see him just now?

#

Or recently?

keen beacon Apr 17, 2025, 2:06 AM

#

plain zinc Did they turn it off for an update?

idk if its in the arena or not, but its still in the metadata of web dev arena:

keen beacon Apr 17, 2025, 2:06 AM

#

plain zinc It's strange that Legit didn't reveal anything about their loss.

probably why his bot/etc didnt catch it

#

if it was removed off the main arena

elder rapids Apr 17, 2025, 2:08 AM

#

plain zinc It's strange that Legit didn't reveal anything about their loss.

just got River hollow lol

keen beacon Apr 17, 2025, 2:08 AM

#

plain zinc Did you see him just now?

i just got dragontail in webdev arena btw: (so the metadata for web dev arena is right at least for web dev arena)

elder rapids Apr 17, 2025, 2:08 AM

#

plain zinc Apr 17, 2025, 2:10 AM

#

elder rapids just got River hollow lol

._.

#

It can't be that I'm unlucky._.

#

I know for a fact that there were none.

#

What the...

elder rapids Apr 17, 2025, 2:11 AM

#

sometimes I get an extremely long period of not getting certain models

#

like, I haven't even gotten o4 yet

#

or o3

plain zinc Apr 17, 2025, 2:12 AM

#

elder rapids sometimes I get an extremely long period of not getting certain models

Is that what this is about?

#

Why is this happening?

#

Update?

elder rapids Apr 17, 2025, 2:12 AM

#

seems like a balancing thing or smth

#

for the arena

#

could be that they actually did take them down

#

just for a little bit

#

for an update

#

we'll never know

plain zinc Apr 17, 2025, 2:12 AM

#

elder rapids like, I haven't even gotten o4 yet

and I get it._.

elder rapids Apr 17, 2025, 2:13 AM

#

oh damn I just got o3 vs 2.5 pro

#

crazy

plain zinc Apr 17, 2025, 2:13 AM

#

...

#

What...

elder rapids Apr 17, 2025, 2:13 AM

#

😭

plain zinc Apr 17, 2025, 2:13 AM

#

10 new models have just been added.

#

It was 96 and became 106

elder rapids Apr 17, 2025, 2:14 AM

#

kinda excited for 2.5 flash

keen beacon Apr 17, 2025, 2:14 AM

#

it was supposed to come out a while back

#

the model was added on the sdk

elder rapids Apr 17, 2025, 2:14 AM

#

ye

keen beacon Apr 17, 2025, 2:14 AM

#

0409 or smthing

elder rapids Apr 17, 2025, 2:14 AM

#

that wasn't too long ago tho

keen beacon Apr 17, 2025, 2:15 AM

#

with thinking budget

elder rapids Apr 17, 2025, 2:15 AM

#

keen beacon 0409 or smthing

oh fr?

#

damn I hope it's cheap and smart

#

that would be crazy

keen beacon Apr 17, 2025, 2:16 AM

#

elder rapids oh fr?

yea despite the sdk name it didnt come on the 9th

#

they delayed it for some reason

elder rapids Apr 17, 2025, 2:16 AM

#

if it's a jump from 2.0 thinking, it's gonna be good

keen beacon Apr 17, 2025, 2:16 AM

#

oh 2.5 flash is gonna be good

elder rapids Apr 17, 2025, 2:16 AM

#

although it was trash at coding

#

it was for some reason INSANE at some things

#

tbh bro

#

I keep thinking back

#

the old times

#

where athene 70b was the hidden gem

#

that pure RLHF model was sick when you prompted it for the right things

elder rapids Apr 17, 2025, 2:18 AM

#

keen beacon oh 2.5 flash is gonna be good

I hope the long context is somewhat maintained

#

if it's anything like 2.5 pro too

#

it's gonna go crazy

hardy pecan Apr 17, 2025, 2:23 AM

#

https://simple-bench.com/

SimpleBench

#

O3 beat out 2.5 pro

plain zinc Apr 17, 2025, 2:25 AM

#

Yay!

#

And more...

#

HE's become insanely slow💀

elder rapids Apr 17, 2025, 2:25 AM

#

hardy pecan O3 beat out 2.5 pro

damn I was expecting more

plain zinc Apr 17, 2025, 2:25 AM

#

I had to wait 3-5 minutes for the code generation to finish.

elder rapids Apr 17, 2025, 2:25 AM

#

53% for high

#

damn I said o4 was gonna get low too

#

and o3 was gonna get 50+

hardy pecan Apr 17, 2025, 2:26 AM

#

Hey it shows how long 2.5 pro is too, since everyone is impressed by Geminis offering. 50s on that benchmark is great, and actually SOTA

elder rapids Apr 17, 2025, 2:26 AM

#

but I thought it was gonna be higher than 2.5 💔

hardy pecan Apr 17, 2025, 2:27 AM

#

yeah the smaller mini models tend to suck at that benchmark

zinc ore Apr 17, 2025, 2:27 AM

#

elder rapids 53% for high

Thing is how many tokens did 2.5 use?

elder rapids Apr 17, 2025, 2:28 AM

#

there's no way for them to have optimized that

#

so it's gonna be baseline what you see

#

rather than a high or medium variant like o1, o3, and o4

brittle tiger Apr 17, 2025, 2:29 AM

#

https://x.com/OfficialLoganK/status/1912690850292768951?t=odVru2H-4Tpr31466k_JcA&s=19

Logan Kilpatrick (@OfficialLoganK) on X

Gemini 🤔

zinc ore Apr 17, 2025, 2:29 AM

#

Exactly

brittle tiger Apr 17, 2025, 2:29 AM

#

Flash thinking tomorrow

elder rapids Apr 17, 2025, 2:29 AM

#

it's the same reason why openAI models seem to dominate benchmarks

#

but then not do well in practice

plain zinc Apr 17, 2025, 2:30 AM

#

Yessss

#

I already thought that Google wouldn't show anything this week and would give OpenAI all week.

zinc ore Apr 17, 2025, 2:30 AM

#

I just wish I could more readily compare the tokens/thinking time being used

plain zinc Apr 17, 2025, 2:31 AM

#

But they decided to go all in.🔥

#

Bold and fast

zinc ore Apr 17, 2025, 2:31 AM

#

Makes it harder to tell if these little 3% gaps are meaningful or exaggerated

#

So I don't like there's just a persistent mystery now

elder rapids Apr 17, 2025, 2:31 AM

#

zinc ore I just wish I could more readily compare the tokens/thinking time being used

ye

thorny drum Apr 17, 2025, 2:32 AM

#

wdyt flash thinking means given pro is already a thinking model

elder rapids Apr 17, 2025, 2:32 AM

#

but that's why it's still pretty much clear 2.5 pro is leading

thorny drum Apr 17, 2025, 2:32 AM

#

or is this just flash

zinc ore Apr 17, 2025, 2:32 AM

#

2.5 flash is a hybrid/unified model, so thinking model yes, once it releases

elder rapids Apr 17, 2025, 2:33 AM

#

regardless of narrow task advantancment like o4 mini, or the general ability of o3

#

it seems like it's inherently a weaker model

#

due to the fact it literally has to think more

#

to achieve similar output

brittle tiger Apr 17, 2025, 2:33 AM

#

I was really impressed by o3 but mostly for the tool use and UI. I think 2.5 Pro which was rushed out the door can match it and probably remains a dev fave. Think 2.5 flash will be better than o4-mini but just a hunch

balmy mist Apr 17, 2025, 2:33 AM

#

https://x.com/OfficialLoganK/status/1912690850292768951

Logan Kilpatrick (@OfficialLoganK) on X

Gemini 🤔

#

no way

elder rapids Apr 17, 2025, 2:33 AM

#

ye they've been doing this

#

for a while

#

it's a nice hint

zinc ore Apr 17, 2025, 2:34 AM

#

Good guess is 2.5 flash, I think they might drop a coding model too

elder rapids Apr 17, 2025, 2:34 AM

#

could, that would be pretty cool

#

that's what the discussion is about

balmy mist Apr 17, 2025, 2:34 AM

#

i hope its nightwhisper

elder rapids Apr 17, 2025, 2:34 AM

#

the fact it's so little above 2.5 pro while likely thinking so much more

#

going for all the other benchmarks as well

elder rapids Apr 17, 2025, 2:35 AM

#

elder rapids by the way, I'm not sure about o4 mini being cheaper in practice

for example, this benchmark

ivory schooner Apr 17, 2025, 2:35 AM

#

我正在学会等待Behemoth ......

elder rapids Apr 17, 2025, 2:35 AM

#

o4 mini costs more than 2.5 pro

keen beacon Apr 17, 2025, 2:35 AM

#

brittle tiger https://x.com/OfficialLoganK/status/1912690850292768951?t=odVru2H-4Tpr31466k_JcA...

release tomorrow then

keen beacon Apr 17, 2025, 2:35 AM

#

balmy mist https://x.com/OfficialLoganK/status/1912690850292768951

LOL

elder rapids Apr 17, 2025, 2:36 AM

#

keen beacon LOL

sorry I already predicted other things about these models today

#

I deserve the glaze

#

not you

keen beacon Apr 17, 2025, 2:36 AM

#

buddy

#

who are you

elder rapids Apr 17, 2025, 2:36 AM

#

2/2 you're just 1/1

keen beacon Apr 17, 2025, 2:36 AM

#

do you have internal openai model access and connections?

#

no

#

pipe DOWN

#

🙄

elder rapids Apr 17, 2025, 2:37 AM

#

keen beacon do you have internal openai model access and connections?

I take a look inside without permission

#

I'm just like that

keen beacon Apr 17, 2025, 2:37 AM

#

elder rapids I take a look inside without permission

🤨

leaden palm Apr 17, 2025, 2:37 AM

#

keen beacon release tomorrow then

its just gonna be 2.5 flash...

elder rapids Apr 17, 2025, 2:37 AM

#

sorry you need to be granted access

leaden palm Apr 17, 2025, 2:37 AM

#

leaden palm its just gonna be 2.5 flash...

(although i'd be happy to be surprised)

keen beacon Apr 17, 2025, 2:37 AM

#

leaden palm its just gonna be 2.5 flash...

2.5 flash is of course the minimum but there may be another model update

elder rapids Apr 17, 2025, 2:38 AM

#

keen beacon 2.5 flash is of course the minimum but there may be another model update

ye

zinc ore Apr 17, 2025, 2:38 AM

#

I'm guessing they feel confident about 2.5 flash

leaden palm Apr 17, 2025, 2:38 AM

#

maybe

elder rapids Apr 17, 2025, 2:38 AM

#

that would be cool

balmy mist Apr 17, 2025, 2:38 AM

#

keen beacon LOL

you really called it lol

leaden palm Apr 17, 2025, 2:38 AM

#

seems openai is the only one whos got tool use in thinking though which is EXTREMELY weird

#

like it's just a ui level thing

#

how you call your own inference api

elder rapids Apr 17, 2025, 2:38 AM

#

exactly

#

it becomes so much less of a jump

keen beacon Apr 17, 2025, 2:38 AM

#

leaden palm maybe

it does feel kinda weak on their part to release 2.5 flash, a worse model, when openai have just mostly beat 2.5 pro

elder rapids Apr 17, 2025, 2:38 AM

#

while 2.5 is base

keen beacon Apr 17, 2025, 2:38 AM

#

i think they'll wanna flex

balmy mist Apr 17, 2025, 2:39 AM

#

elder rapids I deserve the glaze

wait you have connects?

leaden palm Apr 17, 2025, 2:39 AM

#

yes
it's just whether or not you start another <think> after a tool message

elder rapids Apr 17, 2025, 2:39 AM

#

keen beacon it does feel kinda weak on their part to release 2.5 flash, a worse model, when ...

exactly, so they might just release a coding model

zinc ore Apr 17, 2025, 2:39 AM

#

This is hilarious because they're just benching the same model over and over again, so naturally it'll get similar scores, just incrementally worse/better based on thinking time

leaden palm Apr 17, 2025, 2:39 AM

#

what do you mean you don't know about that

elder rapids Apr 17, 2025, 2:39 AM

#

balmy mist wait you have connects?

@keen beacon is my connect

brittle tiger Apr 17, 2025, 2:39 AM

#

keen beacon i think they'll wanna flex

If it clearly bests o4-mini they would def like that

balmy mist Apr 17, 2025, 2:40 AM

#

google about to do openai dirty lol

keen beacon Apr 17, 2025, 2:40 AM

#

brittle tiger If it clearly bests o4-mini they would def like that

yeah but i kinda doubt it

balmy mist Apr 17, 2025, 2:40 AM

#

y they cant let openai have one week

keen beacon Apr 17, 2025, 2:40 AM

#

o4 mini is very strong

elder rapids Apr 17, 2025, 2:40 AM

#

keen beacon yeah but i kinda doubt it

ye

late path Apr 17, 2025, 2:40 AM

#

this is our frontier model💀

keen beacon Apr 17, 2025, 2:41 AM

#

in most cases it matches o3 and in some it beats it

elder rapids Apr 17, 2025, 2:41 AM

#

but if 2.5 flash is narrow

keen beacon Apr 17, 2025, 2:41 AM

#

i dont think flash has the chops for that

elder rapids Apr 17, 2025, 2:41 AM

#

and anywhere near o4 mini

#

then o4 mini has to be cooked

#

the context + the price

#

at reasonable output

balmy mist Apr 17, 2025, 2:41 AM

#

it would have to be nightwhisper right?

brittle tiger Apr 17, 2025, 2:41 AM

#

late path this is our frontier model💀

I really dislike tests like this and token issues

balmy mist Apr 17, 2025, 2:41 AM

#

they were waiting on this

#

like they probably been tesing o3 and o4 mini all day

topaz peak Apr 17, 2025, 2:42 AM

#

really disappointing that o3 fails the bucket test damn

balmy mist Apr 17, 2025, 2:42 AM

#

google is funny

#

i feel bad for open ai

#

they might have to releas o3 pro next week

elder rapids Apr 17, 2025, 2:42 AM

#

balmy mist they might have to releas o3 pro next week

alr chill

#

o3 pro is gonna be cracked, although I don't think it's going to be straight up better than the one we saw during shipmas

#

the direction and bases have changed

zinc ore Apr 17, 2025, 2:44 AM

#

They might release an updated 2.5 pro as well, think it's been about 3 weeks since it dropped and last year they were doing updates on around 3 week gaps sometimes. Usually 4 at the latest.

elder rapids Apr 17, 2025, 2:45 AM

#

maybe

#

but now that it's actually confirmed there's going to be SOMETHING

#

yep

#

this is a crazy advantage

#

the fact they said it's trained WITH the tools

#

as it's reasoning

#

ngl

#

I didn't even know HLE 2.5 pro was 18%

#

or I didn't pay attention to that

late path Apr 17, 2025, 2:47 AM

#

o4mini's hallucination rate is unbelievably high. it scores worse than o1mini on OpenAI's own PersonQA benchmark

elder rapids Apr 17, 2025, 2:47 AM

#

I wanna know how good 2.5 pro DR is with that

elder rapids Apr 17, 2025, 2:48 AM

#

late path o4mini's hallucination rate is unbelievably high. it scores worse than o1mini on...

I do wanna point out tho, independent from benchmarks

#

it's smart

#

but it has unbelievable confidence in going in one route

leaden palm Apr 17, 2025, 2:49 AM

#

late path o4mini's hallucination rate is unbelievably high. it scores worse than o1mini on...

do you have the data? can't find it in https://openai.com/index/introducing-o3-and-o4-mini/, https://openai.com/index/o3-o4-mini-system-card/, or https://openai.com/index/introducing-simpleqa/

elder rapids Apr 17, 2025, 2:49 AM

#

o3 and 2.5 are very creative in their self reflectiveness, but o4 mini just really goes for it

#

super straightforward too

late path Apr 17, 2025, 2:50 AM

#

leaden palm do you have the data? can't find it in https://openai.com/index/introducing-o3-a...

leaden palm Apr 17, 2025, 2:50 AM

#

are you saying personqa is a subset of simpleqa

#

i dont think it is

plain zinc Apr 17, 2025, 2:51 AM

#

Nightwhisper is coming out today!

leaden palm Apr 17, 2025, 2:51 AM

#

ok its personqa

#

serves the same purpose i guess

plain zinc Apr 17, 2025, 2:51 AM

#

dragontail cannot exit because it is still being tested

#

This may be the 2.5 flash model that will be released next week only.

brittle tiger Apr 17, 2025, 2:52 AM

#

For ppl saying o4-mini is cheaper than 2.5 pro

alpine coral Apr 17, 2025, 2:53 AM

#

keen beacon There are small differences though the chatgpt 4o tune is slightly different tho...

yeah exactly.. i really don't think this is widely appreciated ha

zinc ore Apr 17, 2025, 2:53 AM

#

Wow, nearly 20x price for o3 high

alpine coral Apr 17, 2025, 2:55 AM

#

plain zinc This may be the 2.5 flash model that will be released next week only.

that would be v impressive if that's the case.. like kinda insane tbh.. dragontail consistently performs comparably to 2.5 Pro, sometimes does even better (in my experience)

plain zinc Apr 17, 2025, 2:56 AM

#

alpine coral that would be v impressive if that's the case.. like kinda insane tbh.. dragonta...

Yes, I agree. Sometimes it's better.

elder rapids Apr 17, 2025, 2:56 AM

#

brittle tiger For ppl saying o4-mini is cheaper than 2.5 pro

ye I sent this earlier

plain zinc Apr 17, 2025, 2:58 AM

#

#

Okay. This is 2.5 Flash

#

Not 2.5 pro modified version 🥲

elder rapids Apr 17, 2025, 2:59 AM

#

too early to say that lol

#

since it could be

#

Gemini 2.5 flash preview

Gemini 2.5 pro 0417

balmy mist Apr 17, 2025, 3:00 AM

#

yeah w/ the agentic reasoning built in

plain zinc Apr 17, 2025, 3:00 AM

#

Is 2.5 flash really a nightwhisper._.

elder rapids Apr 17, 2025, 3:00 AM

#

plain zinc Is 2.5 flash really a nightwhisper._.

that would be insane

#

but honestly could make sense

balmy mist Apr 17, 2025, 3:00 AM

#

then openai is cooked

elder rapids Apr 17, 2025, 3:00 AM

#

if it really is

#

yeah

#

that would just be insane

balmy mist Apr 17, 2025, 3:00 AM

#

i feel bad for openai again

plain zinc Apr 17, 2025, 3:00 AM

#

elder rapids that would be insane

2.5 flash with the maximum level of reasoning

elder rapids Apr 17, 2025, 3:00 AM

#

like I would be in disbelief

plain zinc Apr 17, 2025, 3:01 AM

#

2.5-flash-high

balmy mist Apr 17, 2025, 3:01 AM

#

lets do a go fund me for openai

elder rapids Apr 17, 2025, 3:01 AM

#

plain zinc 2.5-flash-high

ye

#

but they all think for similar amount of times

#

they're capped

#

by lmsys

#

so whether it's thinking really fast and a ton, for the same amount of time

#

we'd never know

#

we don't even know if night whisper could even be flash 2.5

#

it could be those other checkpoints, why would the performance be so different

#

if not for a Gemini coder, 2.5 pro checkpoint, and then 2.5 flash checkpoint

balmy mist Apr 17, 2025, 3:04 AM

#

do you work for openai?

#

they should hire you

#

how can google get you on their side?

#

notebooklm>

elder rapids Apr 17, 2025, 3:09 AM

#

I see your point, but you probably think the image zoom feature that o3 uses is an example, but you're just wrong lmao, notebook llm? Google AI search (as in bard, from way before), native image gen flash? long context?

#

wym? this is exactly THE step forward these companies are trying to make, the tool usage, all that stuff basically came from bard lol

#

without search, hallucinations become a major problem again

#

also not to mention, project Astra

#

and still, the other things I mentioned

alpine coral Apr 17, 2025, 3:14 AM

#

tbf they pioneered and open sourced the transformer architecture in 2017 - everyone else since (oai included) is arguably innovating on top of the status quo they established

elder rapids Apr 17, 2025, 3:16 AM

#

even still tho

#

Google had been developing veo before open AI

#

had sora

#

they just announced it after

thorny drum Apr 17, 2025, 3:16 AM

#

i really wonder who writes llm system prompts

#

ifl they have fun with them

alpine coral Apr 17, 2025, 3:17 AM

#

yeah tbh i knew that's what you meant (was just thowing it in there - but in a way, kinda beside your point / taken for granted i know)

thorny drum Apr 17, 2025, 3:17 AM

#

do you think they AB tested calling it 'yap score'

brittle tiger Apr 17, 2025, 3:17 AM

#

i think you might be right about oai consumer stickiness if goog doesn't beat them by wide margin but they def innovate. invented self driving cars, solved protein folding, 2 nobel prizes in AI past year

elder rapids Apr 17, 2025, 3:18 AM

#

what about context caching, Gemma, native audio understanding, learnLM, AI overviews

#

I honestly don't think that argument has any basis

#

even benefit of the doubt

balmy mist Apr 17, 2025, 3:19 AM

#

@deep adder did you like 2.5 pro?

elder rapids Apr 17, 2025, 3:19 AM

#

Google deepmind genuinely seem like the actual innovators, in and outside your direct chat bot interface

#

even mobile integration, Gemini has

#

how tho?

#

I would literally NEVER use it's o3s tool usage

#

if not for specific coding things

#

lmao

plain zinc Apr 17, 2025, 3:20 AM

#

🤨

elder rapids Apr 17, 2025, 3:20 AM

#

and btw

plain zinc Apr 17, 2025, 3:20 AM

#

Don't like the model because it's a chatbot?

#

what?

elder rapids Apr 17, 2025, 3:20 AM

#

no

#

read context

#

he's saying that openAI are the innovators

#

while naming things Google created

balmy mist Apr 17, 2025, 3:21 AM

#

elder rapids Apr 17, 2025, 3:21 AM

#

but it's not "actually" useful though?

#

you're saying it is

#

but that can't be the case if it's coding based

plain zinc Apr 17, 2025, 3:22 AM

#

And completely replacing human labor

elder rapids Apr 17, 2025, 3:22 AM

#

you can say native image gen

keen fulcrum Apr 17, 2025, 3:22 AM

#

balmy mist

XAI, Grok 4 will crush

plain zinc Apr 17, 2025, 3:22 AM

#

No, I don't want that kind of future.

elder rapids Apr 17, 2025, 3:22 AM

#

but Google did that first

#

you can say vision

#

but Google did that first

#

you can say agents

plain zinc Apr 17, 2025, 3:22 AM

#

plain zinc No, I don't want that kind of future.

I'm just for the AI to be smart enough to be an assistant and no more.

elder rapids Apr 17, 2025, 3:22 AM

#

but Claude and gemini did that first

#

the only thing you can say, is "reasoning"

brittle tiger Apr 17, 2025, 3:23 AM

#

i think talking to dolphins will be useful

elder rapids Apr 17, 2025, 3:23 AM

#

but did Google not release a paper

#

explaining exactly what openAI did

balmy mist Apr 17, 2025, 3:23 AM

#

who are the mods of this server? it would be funny to add tags to people name for if they support a specific company lol

elder rapids Apr 17, 2025, 3:23 AM

#

like, right after?

elder rapids Apr 17, 2025, 3:23 AM

#

brittle tiger i think talking to dolphins will be useful

ye I saw this too

#

😭 🙏

balmy mist Apr 17, 2025, 3:24 AM

#

brittle tiger i think talking to dolphins will be useful

thats dope

elder rapids Apr 17, 2025, 3:24 AM

#

why'd they do that tho

balmy mist Apr 17, 2025, 3:24 AM

#

u dont wanna talk to dophins?

elder rapids Apr 17, 2025, 3:25 AM

#

nah deepmind is cool, they just do some stuff

#

btw did we all just forget about genie

#

ye but openAI wouldn't be where it's at without the transformer architecture

#

they improved it for the same reasons Google knew

#

that's why Google was able to hop on board so early

#

ye

#

it's kinda crazy how one company

#

kinda sparked it all

#

or two ig

#

or 3

#

there's prolly more

#

but same idea

#

crazy how a couple companies are causing the craziest era

#

an invention probably greater than fire

#

ye but I'm pretty sure that would've been created regardless

#

things like that were bound to be created soon

#

not sure about large language models tho

#

cuz its actually pretty specific

#

like who knew bro

#

ye, but that seems more like the problem of antecedent improbabilities

keen fulcrum Apr 17, 2025, 3:32 AM

#

Hi, here is an insane offer
https://www.lennysnewsletter.com/p/an-unbelievable-offer-now-get-one
For $240/year you get a 1-year Pro subscription of below items.

Bolt
Cursor
Lovable
Replit
v0
Linear
Notion
Perplexity Pro
Superhuman
Granola

An unbelievable offer: Now get one free year of Cursor, v0, Replit,...

Yes, you read that right.

#

It isn't

ember rapids Apr 17, 2025, 3:40 AM

#

Logan tweeted the bat signal

#

Looks like we might be getting something tomorrow

patent bane Apr 17, 2025, 3:41 AM

#

ember rapids Logan tweeted the bat signal

sybau

#

"iT IsN'T"

ember rapids Apr 17, 2025, 3:41 AM

#

drifting thorn Apr 17, 2025, 3:48 AM

#

Okay, just got the idea of this

#

It's basically dealing with the routing issue of having 1 million experts

#

but there's no companies(we've known now) that are using that much experts in a model

#

Deepseek has 256+1

#

Llama has 128/16 only

#

And other open-sourced models are dense models

fleet lintel Apr 17, 2025, 4:03 AM

#

o3-high is extermely expensive

elder solar Apr 17, 2025, 4:53 AM

#

i dont see gpt 4.1 on lm arena

leaden palm Apr 17, 2025, 4:57 AM

#

elder solar i dont see gpt 4.1 on lm arena

in the leaderboard or in the arena?

elder solar Apr 17, 2025, 4:58 AM

#

leaderboard

leaden palm Apr 17, 2025, 4:58 AM

#

patience axel

plain zinc Apr 17, 2025, 5:00 AM

#

elder solar i dont see gpt 4.1 on lm arena

They're probably collecting feedback from Gemini 2.5 Flash right now.

elder solar Apr 17, 2025, 5:01 AM

#

flash??

#

theres only pro

leaden palm Apr 17, 2025, 5:02 AM

#

elder solar flash??

you missed out on cloud next

#

(and the open secret that google tests many models and tunes on lm arena anonymously)

elder solar Apr 17, 2025, 5:04 AM

#

but gemini 2.5 flash is not out

leaden palm Apr 17, 2025, 5:04 AM

#

elder solar but gemini 2.5 flash is not out

????

#

so what

#

it's being tested on lm arena anyway

keen beacon Apr 17, 2025, 5:36 AM

#

balmy mist

screw openai

#

dont forget they offed their first whistleblower

fleet lintel Apr 17, 2025, 6:00 AM

#

elder solar but gemini 2.5 flash is not out

Coming out today... almost 100% sure from all the changelogs etc

ornate stump Apr 17, 2025, 6:01 AM

#

fleet lintel Coming out today... almost 100% sure from all the changelogs etc

but flash means just faster and cheaper, right? So why did it seem like it was actually a bit better in Arena than pro?

oblique flint Apr 17, 2025, 6:02 AM

#

if flash 2.5 gets anywhere near o4 mini performance at flash 2.0 prices I'll be very happy

fleet lintel Apr 17, 2025, 6:03 AM

#

ornate stump but flash means just faster and cheaper, right? So why did it seem like it was a...

who said it's better in Arena?

#

It's actually quite meh compared to bigger models

#

but given the cost, it's amazing

fleet lintel Apr 17, 2025, 6:04 AM

#

oblique flint if flash 2.5 gets anywhere near o4 mini performance at flash 2.0 prices I'll be ...

o4 mini prices == 2.5 Pro prices. Despite name of "mini", it is equivalent for Pro models of Google

fleet lintel Apr 17, 2025, 6:04 AM

#

keen beacon dont forget they offed their first whistleblower

Team Anthropic!! wohoo

ornate stump Apr 17, 2025, 6:04 AM

#

fleet lintel who said it's better in Arena?

idk I don't follow too much, but it was Dragontail or Nightwhisper or something like that, right? I've tried it a couple of times and it seemed like it had better structured output.

alpine coral Apr 17, 2025, 6:05 AM

#

yeah if it's dragontail that is being referred to as 2.5 Flash, it's def not meh

fleet lintel Apr 17, 2025, 6:05 AM

#

ornate stump idk I don't follow too much, but it was Dragontail or Nightwhisper or something ...

nah... Dragontail or NW are definietly not flash. If they are then Google has already won the game. I am ready to bet that they are not flash

alpine coral Apr 17, 2025, 6:06 AM

#

which google anon model/s are you referring to then?

#

omg - could today be the day

#

google is finally unveiling euraka chatbot!!

fleet lintel Apr 17, 2025, 6:06 AM

#

Riverholllow

alpine coral Apr 17, 2025, 6:06 AM

#

fleet lintel Riverholllow

oh yeah i can see that

#

in terms of performance

fleet lintel Apr 17, 2025, 6:06 AM

#

or shadebrook.. They both behaved like flash

ornate stump Apr 17, 2025, 6:07 AM

#

fleet lintel nah... Dragontail or NW are definietly not flash. If they are then Google has a...

Ah, right that's what was confusing me.

alpine coral Apr 17, 2025, 6:07 AM

#

fleet lintel or shadebrook.. They both behaved like flash

yeah agreed. neither of them get close to 2.5 pro

alpine coral Apr 17, 2025, 6:10 AM

#

alpine coral google is finally unveiling euraka chatbot!!

i swear they've just forgotten about it.. been there for like more than a year now... next level stealth

fleet lintel Apr 17, 2025, 6:11 AM

#

what eureka chatbot does?

keen beacon Apr 17, 2025, 6:21 AM

#

alpine coral i swear they've just forgotten about it.. been there for like more than a year n...

Confuses me why they still host it lol

alpine coral Apr 17, 2025, 6:25 AM

#

so if dragontail/nightwhisper isn't 2.5 Flash (which I think we can safely assume), i feel like there's a good chance that dragontail is 2.5 Pro with reasoning_budget set to high

elder rapids Apr 17, 2025, 6:26 AM

#

keen beacon dont forget they offed their first whistleblower

ngl

#

even granting it, outside of speculation

#

that they'd do this

#

it wouldn't truly make sense

#

considering there's nothing about a whistleblower in openAi's case

#

that could be harmful in any way shape or form

#

or even openAI much as a company

keen beacon Apr 17, 2025, 6:27 AM

#

..?

#

u dont believe companies off people

#

?

elder rapids Apr 17, 2025, 6:28 AM

#

keen beacon u dont believe companies off people

do I believe companies would kill people that would harm potential strategy?

keen beacon Apr 17, 2025, 6:28 AM

#

Thinking openai killed that guy is wild tbh

elder rapids Apr 17, 2025, 6:28 AM

#

yeah, that's inherent to the idea

#

lol

fleet lintel Apr 17, 2025, 6:28 AM

#

i dont think tech companies do it "yet".

elder rapids Apr 17, 2025, 6:28 AM

#

I'm not sure how that's really substantive

keen beacon Apr 17, 2025, 6:28 AM

#

silicon valley is one of the most disgusting places

#

on earth

elder rapids Apr 17, 2025, 6:28 AM

#

considering I said, let's grant the premise that he was killed

#

say he was killed, rationalize it

#

like, in chat

keen beacon Apr 17, 2025, 6:29 AM

#

i dont think it's far fetched to say some companies would go to extreme lengths

elder rapids Apr 17, 2025, 6:29 AM

#

and if you can't, why speculate

keen beacon Apr 17, 2025, 6:29 AM

#

Iirc wasn't he whistle blowing something that was obvious, wasn't it about copyrighted content

elder rapids Apr 17, 2025, 6:29 AM

#

keen beacon Iirc wasn't he whistle blowing something that was obvious, wasn't it about copyr...

ye

hardy violet Apr 17, 2025, 6:29 AM

#

alpine coral so if dragontail/nightwhisper isn't 2.5 Flash (which I think we can safely assum...

Does everyone remember the Ultra series? It was only used once. It's a great name. If a 2.5 Pro-High (something like that?) model actually exists, the name might be worth considering.

elder rapids Apr 17, 2025, 6:29 AM

#

but even if we grant that too

#

that's not even crazy

keen beacon Apr 17, 2025, 6:29 AM

#

its speculation true

elder rapids Apr 17, 2025, 6:29 AM

#

keen beacon its speculation true

nah but fr

#

of course speculation sometimes means something

#

but like, you don't gotta speculate, if on the grounds of accepting that premise, you still can't understand why

#

no dots are connected etc

keen beacon Apr 17, 2025, 6:30 AM

#

#

nah cus ts pmo 😭 🙏🏽

elder rapids Apr 17, 2025, 6:30 AM

#

then why is them killing the whistleblower unique

#

therefore it has to be the conclusion you're looking for

#

ykwim

fleet lintel Apr 17, 2025, 6:31 AM

#

honestly, I think this is not the right place to discuss whether OAI assasssinated someone or not

elder rapids Apr 17, 2025, 6:31 AM

#

fleet lintel honestly, I think this is not the right place to discuss whether OAI assasssinat...

ion think it's serious enough not to

#

just a little comment here and there

#

like now, discussion poof, dismissed

elder rapids Apr 17, 2025, 6:31 AM

#

hardy violet Does everyone remember the Ultra series? It was only used once. It's a great nam...

ye

#

it seemed to be a really really large model tho

#

like gpt 4

#

and that's really the only way they could've competed at that time

#

they limited usage too

keen beacon Apr 17, 2025, 6:32 AM

#

is there a new gpt model dropping

#

feel like i hear about it everytime i open the chat

elder rapids Apr 17, 2025, 6:32 AM

#

and that's surprising for Google

#

or maybe they allocated enough compute

#

during then and now

#

but who knows

#

if they gave 2.5 pro high compute

#

it WOULD dominate

keen beacon Apr 17, 2025, 6:33 AM

#

https://techcrunch.com/2025/04/14/openais-new-gpt-4-1-models-focus-on-coding/

TechCrunch

Kyle Wiggers

OpenAI's new GPT-4.1 AI models focus on coding | TechCrunch

OpenAI has launched a new family of models called GPT-4.1. They focus on coding, and are exclusively available through the company's API.

elder rapids Apr 17, 2025, 6:33 AM

#

keen beacon feel like i hear about it everytime i open the chat

o3 full and o4 mini dropped

keen beacon Apr 17, 2025, 6:33 AM

#

its alr dropped

#

Old news lol

elder rapids Apr 17, 2025, 6:33 AM

#

keen beacon its alr dropped

bro is so behind

fleet lintel Apr 17, 2025, 6:33 AM

#

hardy violet Does everyone remember the Ultra series? It was only used once. It's a great nam...

i think there is no money or much real-world applications of Ultra models. Only benefit I see is to show how far ahead we are to the world (by benchmarks etc). That's why I think Google is not pursing them anymore

elder rapids Apr 17, 2025, 6:33 AM

#

4.1 dropped like a little while ago 😭 🙏

keen beacon Apr 17, 2025, 6:33 AM

#

they just dropped this monday

elder rapids Apr 17, 2025, 6:34 AM

#

censor ts

#

not all of us w*rk

keen beacon Apr 17, 2025, 6:34 AM

#

😭

elder rapids Apr 17, 2025, 6:34 AM

#

keen beacon they just dropped this monday

ye but today

#

they released more models

#

o3 full and o4 mini

fleet lintel Apr 17, 2025, 6:35 AM

#

keen beacon they just dropped this monday

I understand your sentiment. sometimes poeple behave as if 2 hour (or day) old news is like from few years back 🙂

keen beacon Apr 17, 2025, 6:35 AM

#

Space moves very fast

elder rapids Apr 17, 2025, 6:35 AM

#

ngl 4.1 is kinda bad outside of coding

#

has vibes

#

but synthetic asf

keen beacon Apr 17, 2025, 6:36 AM

#

20$ a month

#

and if using free you only get like 30 back and forths

#

vs geminis free trial 1 month repeatable full advanced

fleet lintel Apr 17, 2025, 6:36 AM

#

4.1 should have been just a tweet. but they hyped it like crazy.. I am still mad about it

elder rapids Apr 17, 2025, 6:36 AM

#

keen beacon vs geminis free trial 1 month repeatable full advanced

not even that

#

straight up AI studio

#

give them your data, you get infinite usage

#

I know the data I give them allows them to progress btw

#

it's worth a ton

#

don't glaze me yet tho

keen beacon Apr 17, 2025, 6:37 AM

#

elder rapids not even that

i know

#

the thing about this is

#

skyvern and other things about to get full gemini support

elder rapids Apr 17, 2025, 6:38 AM

#

whatever that means, I agree

#

🙏

keen beacon Apr 17, 2025, 6:38 AM

#

#

https://github.com/Skyvern-AI/

GitHub

Skyvern

Skyvern helps companies automate browser based workflows with AI - Skyvern

elder rapids Apr 17, 2025, 6:38 AM

#

niche territory right off the bat

keen beacon Apr 17, 2025, 6:38 AM

#

elder rapids whatever that means, I agree

its like the only computer vision project ik about

elder rapids Apr 17, 2025, 6:39 AM

#

damn you either live under a rock or you don't

keen beacon Apr 17, 2025, 6:39 AM

#

go look, that will be free? nah

#

https://github.com/browser-use/browser-use

GitHub

GitHub - browser-use/browser-use: Make websites accessible for AI a...

Make websites accessible for AI agents. Contribute to browser-use/browser-use development by creating an account on GitHub.

elder rapids Apr 17, 2025, 6:39 AM

#

yo what if Google releases an agent

#

like, a full on agent

keen beacon Apr 17, 2025, 6:40 AM

#

you can try this but its made by LangChai or some ccp stuff

keen beacon Apr 17, 2025, 6:40 AM

#

elder rapids yo what if Google releases an agent

thats what i want bruh

#

they have screenshare but its bad

alpine coral Apr 17, 2025, 6:40 AM

#

fleet lintel 4.1 should have been just a tweet. but they hyped it like crazy.. I am still mad...

i think it was worthy of more than just a tweet.. hopefully one day it'll all be a lot clearer with hindsight / disclosures, and someone will create like an org chart or family tree, with a time axis.. it's hard to conceptualised / understand the constellation of models they've got atm, and how they all (or most) in one way or another kind of interrelate

keen beacon Apr 17, 2025, 6:40 AM

#

sends like past 5 second worth of clip for ai to understand

alpine coral Apr 17, 2025, 6:40 AM

#

we don't just want thinking models anyway imo

elder rapids Apr 17, 2025, 6:40 AM

#

keen beacon they have screenshare but its bad

ye but now with their new understanding

#

and determination

keen beacon Apr 17, 2025, 6:40 AM

#

i think that's what they'll do with copilot

#

but we'll see

elder rapids Apr 17, 2025, 6:41 AM

#

seems like they can really create anything

#

they've just been upgrading the Gemini app nonstop

keen beacon Apr 17, 2025, 6:41 AM

#

their funding

#

them mfs created a new material

elder rapids Apr 17, 2025, 6:41 AM

#

was that even real tho

#

the Microsoft thing

keen beacon Apr 17, 2025, 6:42 AM

#

https://deepmind.google/discover/blog/millions-of-new-materials-discovered-with-deep-learning/

Google DeepMind

Millions of new materials discovered with deep learning

We share the discovery of 2.2 million new crystals – equivalent to nearly 800 years’ worth of knowledge. We introduce Graph Networks for Materials Exploration (GNoME), our new deep learning tool...

fleet lintel Apr 17, 2025, 6:42 AM

#

alpine coral i think it was worthy of more than just a tweet.. hopefully one day it'll all be...

but hype like I can't sleep blah... was unwarranted

elder rapids Apr 17, 2025, 6:42 AM

#

keen beacon https://deepmind.google/discover/blog/millions-of-new-materials-discovered-with-...

ohhh

#

yo wtf

#

why'd I not know this

#

bro picks and chooses which rock he lives under

alpine coral Apr 17, 2025, 6:43 AM

#

fleet lintel but hype like I can't sleep blah... was unwarranted

yeah def wasn't a can't sleep moment - i'm with ya there ha

fleet lintel Apr 17, 2025, 6:43 AM

#

elder rapids why'd I not know this

impossible to be on top of everything in AI space. just focus on what is best for your professional career

keen beacon Apr 17, 2025, 6:43 AM

#

https://www.capacitymedia.com/article/2e4ufsy6ss2sklukotgcg/news/article-googles-next-gen-quantum-chip

Capacity Media

Google's next-gen quantum chip cracks 10-septillion-year computatio...

Google has unveiled its latest quantum computing chip, Willow, a next-generation hardware unit that dramatically reduces quantum computational errors and can perform complex calculations exponentially faster than classical supercomputers.

keen beacon Apr 17, 2025, 6:43 AM

#

elder rapids ohhh

if this is real

#

like

#

fleet lintel Apr 17, 2025, 6:44 AM

#

keen beacon

this is called marketing 🙂

keen beacon Apr 17, 2025, 6:44 AM

#

fleet lintel this is called marketing 🙂

still if its getting this fast and we're constantly advancing at this rate

elder rapids Apr 17, 2025, 6:44 AM

#

fleet lintel impossible to be on top of everything in AI space. just focus on what is best f...

nah but ask me anything I would've probably known

fleet lintel Apr 17, 2025, 6:44 AM

#

dont worry too much about quantum computing for atleast 5 years. progress will happen but noting useable till thne

keen beacon Apr 17, 2025, 6:44 AM

#

eventually things like AES will die out

elder rapids Apr 17, 2025, 6:44 AM

#

just not that

#

which is crazy

#

how'd I not know that

#

I'm Lowkey disappointed I'm logging off

keen beacon Apr 17, 2025, 6:45 AM

#

😭

fleet lintel Apr 17, 2025, 6:45 AM

#

elder rapids I'm Lowkey disappointed I'm logging off

with waht?

elder rapids Apr 17, 2025, 6:45 AM

#

got dealt a revelation

fleet lintel Apr 17, 2025, 6:46 AM

#

oh take care!

#

https://www.reddit.com/r/singularity/comments/1k156qa/ig_google_has_won/

From the singularity community on Reddit: Ig google has won😭😭...

Explore this post and more from the singularity community

hardy violet Apr 17, 2025, 6:57 AM

#

fleet lintel i think there is no money or much real-world applications of Ultra models. Only ...

Okay, so here's the thing. But, like, O1 and O1 Pro, based on both their pricing and positioning, they don't really feel like the same model at all. If a 2.5 Pro-High (or a similar product) were released and named "2.5 Ultra"...@elder rapids@fleet lintel

fleet lintel Apr 17, 2025, 6:57 AM

#

This is my team experience as well with o4/o3 models.. very expensive and not necessarily better

fleet lintel Apr 17, 2025, 7:00 AM

#

hardy violet Okay, so here's the thing. But, like, O1 and O1 Pro, based on both their pricing...

makes sense. but it wont be called ultra. people have differnet understanding of what ultra means and launching a 2.5 Pro-High with ultra name would result in backlash

they might (and should) do more like Pro and flash models with different level of thinking budget (like OAI)

hardy violet Apr 17, 2025, 7:11 AM

#

"Pro" and "Flash" are indeed clear and easy-to-understand product naming conventions. And yes, Google AI Studio needs a parameter to adjust the reasoning strength . Furthermore, I've noticed that many websites don't offer any option to adjust this reasoning strength, which is very inconvenient.🥲

alpine coral Apr 17, 2025, 7:12 AM

#

yeah i don't think their model released will be called 2.5-Pro-High (though it might be.. just to hedge ha)

#

i think when they release the stable version, it will have a parameter for reasoning tokens (like sonnet.3.7-thinking is fixed at 32k; OAI models offer low, med, high). Perhaps it'll be a dropdown, low/med/high, but there's no reason why the end user couldn't specify the max number of tokens to be allocated for reasoning / thinking, like with max_output

#

and i think perhaps dragontail could be 2.5-Pro served with a high value set for reasoning_budget

#

but it won't be like separate model

fleet lintel Apr 17, 2025, 7:26 AM

#

alpine coral and i think perhaps dragontail could be 2.5-Pro served with a high value set for...

good guess.. i agree with you

#

https://x.com/GeminiApp/status/1912591827087315323?s=19

🤔 how useful is it?

Google Gemini App (@GeminiApp) on X

We’ve been hearing great feedback on Gemini Live with camera and screen share, so we decided to bring it to more people ✨

Starting today and over the coming weeks, we're rolling it out to *all* @Android users with the Gemini app. Enjoy!

PS If you don’t have the app yet,

opaque adder Apr 17, 2025, 7:36 AM

#

is o4 mini or o3 better than gemini 2.5 pro yet

#

or is chatgpt still braindead

fleet lintel Apr 17, 2025, 7:39 AM

#

o3 > 2.5 pro > o4 mini

but o3 is very expensive for relatively small benefit over 2.5 pro.

I am a bit disappointed by yesterday's OAI release.

But chatgpt is no way braindead 🙂

opaque adder Apr 17, 2025, 7:54 AM

#

wtf

#

o3 is better than 2.5 pro now

#

surely nightwhisper is gonna come out

#

why are u disappointed tho

#

ah shet yeah

hardy violet Apr 17, 2025, 8:01 AM

#

opaque adder wtf

No, that's not quite right. The difference is minimal. However, in some o3 scenarios, the cost can be ten times higher or even more! And there are also cases, particularly in code writing, where it's less than ideal (certainly not as good as the scores might lead you to believe). While its Agent performs well, which is fair, the base model is just deeply unsatisfying.

hardy violet Apr 17, 2025, 8:18 AM

#

I'm not sure what the state of English is like with this model (I'm using Chinese). I've tested its understanding of a few texts, and the results are far from ideal. At first glance, it seems impressive and insightful, but that's quickly revealed to be an illusion. Like DeepSeek R1, it tends to latch onto certain words and over-interpret them, and its language is quite flamboyant. If you like DeepSeek's writing style, then perhaps this model is a compromise. In comparison, Gemini 2.5 Pro demonstrates a remarkably solid, appropriate, and somewhat insightful understanding. Furthermore, if understanding is subjective, I also tested the application of translating Japanese song lyrics into Chinese, and it was terrible, very disappointing.

keen beacon Apr 17, 2025, 8:19 AM

#

goooood morning

balmy mist Apr 17, 2025, 8:23 AM

#

keen beacon goooood morning

gm bro

sonic tendon Apr 17, 2025, 8:41 AM

#

gm

plain zinc Apr 17, 2025, 8:48 AM

#

opaque adder ah shet yeah

I do not recognize o3-high as a competitor for 2.5 pro, because it is o3 with the MAXIMUM level of reasoning.

opaque adder Apr 17, 2025, 8:49 AM

#

hardy violet I'm not sure what the state of English is like with this model (I'm using Chines...

ur using a translator?

opaque adder Apr 17, 2025, 8:50 AM

#

hardy violet No, that's not quite right. The difference is minimal. However, in some o3 scena...

i didnt realise that this is ai

opaque adder Apr 17, 2025, 8:51 AM

#

hardy violet I'm not sure what the state of English is like with this model (I'm using Chines...

this one is sorta noticeable that it's ai translated because of
" and its language is quite flamboyant."
"but that's quickly revealed to be an illusion."

alpine coral Apr 17, 2025, 8:54 AM

#

still 1000% better than old school translation (wouldn't even be a question if it was translated; there'd be a litany odd / seemingly misplaced words etc)

#

kinda interesting come to think of it.. like you give an LLM text to translate, and ig generally it both translates and polishes it? Like it will spit out the translation with properly formed sentences, correct punctuation etc, even if the original was sloppy (like typos, poor / no punctuation etc)?

alpine coral Apr 17, 2025, 9:00 AM

#

opaque adder this one is sorta noticeable that it's ai translated because of " and its langua...

also hopefully AI translated not AI generated (otherwise I think it like passed the turing test.. or we failed it lol)

keen beacon Apr 17, 2025, 9:04 AM

#

https://x.com/legit_api/status/1912701325504020612

ʟᴇɢɪᴛ (@legit_api) on X

there's a chance that the next Gemini model to be released is 2.5 Flash

hopefully there's more after this

balmy mist Apr 17, 2025, 9:20 AM

#

this si what o3 says about itself:
Think of me as a miniature A2A coordinator bundled with a suite of MCP‑ready tools.
Everything you saw in the video—agent discovery, delegation, precise tool execution—happens here on a smaller scale every time I answer. The protocols simply formalise and generalise what’s happening under the hood right now

#

so openai is pretty much doing what google is doing with A2A and anthropic is doing with mcps but built in

#

the only thing that will make o3 better is the ability add more tools or mcps to it overtime and allow o3 to create agents based on those tools

drifting thorn Apr 17, 2025, 9:23 AM

#

hardy violet I'm not sure what the state of English is like with this model (I'm using Chines...

omg another Chinese speaker in the chatroom

tall summit Apr 17, 2025, 9:24 AM

#

hardy violet I'm not sure what the state of English is like with this model (I'm using Chines...

how good is gemini 2.5 as a translator in those languages?

drifting thorn Apr 17, 2025, 9:25 AM

#

Very good in "Chinese to English" I can confirm

tall summit Apr 17, 2025, 9:26 AM

#

what lmao

Screenshot_2025-04-17-12-25-47-864_org.mozilla.firefox-edit.jpg

#

paid tier for a discovery tool

keen beacon Apr 17, 2025, 9:27 AM

#

either way people will just repost that stuff so you won't ever need to pay

tall summit Apr 17, 2025, 9:27 AM

#

keen beacon either way people will just repost that stuff so you won't ever need to pay

well yeah!

#

haven't looked into what he's doing but i do wonder

#

whether he has exclusive access to something (and what it is) or not

balmy mist Apr 17, 2025, 9:31 AM

#

yeah to pay seems weird bc he posts it as soon as he gets news

#

thats what his twitter is for

tall summit Apr 17, 2025, 9:31 AM

#

oh he'll surely stop posting it

balmy mist Apr 17, 2025, 9:31 AM

#

like why he gets views lol

tall summit Apr 17, 2025, 9:31 AM

#

once he makes it paid

#

but he also is saying he won't make it paid for now

#

clearly testing whether he can afford to

#

but nothings even happening i dont even know why im speculating

balmy mist Apr 17, 2025, 9:32 AM

#

what woul dbe the benefit for people? like how useful is his tool anyway? it tells us what is most likely coming out but so does the companies

#

but for the webdev arena stuff its cool

#

like i never check webdev unless he posts about a new model

tall summit Apr 17, 2025, 9:34 AM

#

balmy mist what woul dbe the benefit for people? like how useful is his tool anyway? it tel...

he knows before the companies officially tell us

#

or at most immediately when they do since its a discord webhook and he gets notifications

balmy mist Apr 17, 2025, 9:36 AM

#

not the gemini one

tall summit Apr 17, 2025, 9:36 AM

#

?

balmy mist Apr 17, 2025, 9:36 AM

#

yesterday

#

logan posted about a new model

#

then legit posted about it a few hours later

#

about the exact model

#

but at that point whocares

#

i can just wait for today

#

unless you in prediction markets

#

idk, i just dont see the need for buying that, especially when you have channels like this

#

the only thing about openai is their UI sometimes buggs out

#

i always have issues with their website

narrow elbow Apr 17, 2025, 9:39 AM

#

balmy mist unless you in prediction markets

TWEET LIKE TRUMP, “THIS IS GREAT TIME TO BUY”🤪

balmy mist Apr 17, 2025, 9:40 AM

#

never get issues with studio, like maybe 1/10 times with studio and i usually know when it will bug out, but with chatgpt bruhh, its like 50/50

#

im gonna build an o3 clone that is adaptable where you can add new tools on the fly, its pretty much just a small scale ai ide or more specifically a augment or cursor or windsurf, but built into a model with specific mcps, like a starter package, i wonder if you built an ai ide with o3 as the brains, maybe that is where openai is going?

#

gpt5 is just o3 but access to more tools and trained on how to better use those tools and more agents built in

ocean vortex Apr 17, 2025, 9:48 AM

#

balmy mist never get issues with studio, like maybe 1/10 times with studio and i usually kn...

Studio is a pain to use compared to chatgpt website. Good luck trying to branch the conversation from the middle of your chat… It’s really openai’s playground alternative rather than chatgpt. For that it’s alright

balmy mist Apr 17, 2025, 9:51 AM

#

true but is openai playground free?

#

and tbh i dont like using openai playground tbh, its a little confusing for me for some reason

#

studio is very straightforward

balmy mist Apr 17, 2025, 9:53 AM

#

ocean vortex Studio is a pain to use compared to chatgpt website. Good luck trying to branch ...

the branching feature is more newer, and it works okay, but when you compare just fucntionality to what you can do in chatgpt studio works better

#

you cant even branch in chatgpt, so no point even mentioning that

ocean vortex Apr 17, 2025, 9:57 AM

#

balmy mist you cant even branch in chatgpt, so no point even mentioning that

Are you joking, it works perfect. You just edit the message in the middle of your chat and your chat continues from that point. In aistudio it’s an absolute mess with it just leaving all the flood from previous branch below. It’s a playground and not a full functionality user friendly interface like cgpt

#

kinda by design

balmy mist Apr 17, 2025, 9:59 AM

#

you can branch in playground?

tall summit Apr 17, 2025, 9:59 AM

#

balmy mist but at that point whocares

what kind of argument is "who cares" against the fact that it objectively does something most people don't have access to without doing what the bot itself does
though of course it's not worth it and i hope nobody pays for it if he ever makes it paid

balmy mist Apr 17, 2025, 10:00 AM

#

tall summit what kind of argument is "who cares" against the fact that it objectively does s...

what you are talking about is getting info an hour or sometimes minutes before people

tall summit Apr 17, 2025, 10:00 AM

#

balmy mist what you are talking about is getting info an hour or sometimes minutes before p...

or days

balmy mist Apr 17, 2025, 10:00 AM

#

tall summit or days

depends

tall summit Apr 17, 2025, 10:00 AM

#

honestly nobody KNOWS when anything is dropping

#

thats THE POINT

balmy mist Apr 17, 2025, 10:00 AM

#

thats why to me its pointless

tall summit Apr 17, 2025, 10:00 AM

#

i can see its use

#

though to me it doesnt much matter what will come out

#

its all just names

balmy mist Apr 17, 2025, 10:01 AM

#

what would paying for it do for you?

tall summit Apr 17, 2025, 10:01 AM

#

until we actually see what it does

#

which only ever happens on lmarena or private access (which nobody normal will have access to regardless) or the actual release which after that it doesnt matter

tall summit Apr 17, 2025, 10:02 AM

#

balmy mist what would paying for it do for you?

??

tall summit Apr 17, 2025, 10:02 AM

#

tall summit what kind of argument is "who cares" against the fact that it objectively does s...

.

balmy mist Apr 17, 2025, 10:02 AM

#

? i dont know how else i could write that

#

what would it do for you?

#

lol