#general | Arena | Page 50

small haven Jun 3, 2025, 1:20 AM

#

lame

#

as long as its above 200k ish, would be cool

small haven Jun 3, 2025, 1:35 AM

#

#

thursday

patent aspen Jun 3, 2025, 2:31 AM

#

I like this

wicked root Jun 3, 2025, 2:46 AM

#

Hello humans

echo aurora Jun 3, 2025, 2:54 AM

#

wicked root Hello humans

howdy

elder rapids Jun 3, 2025, 3:07 AM

#

this wouldn't make any sense tho

#

they have the compute to train it at 1m context

#

actually that would be weird in a lot of ways

#

since I'd assume they know their advantage is context + reasoning over lengths, and the fact deepthink is already limited presupposes it's high computational strain

#

so high context restriction would be completely redundant

small haven Jun 3, 2025, 3:18 AM

#

do you know the exact number on the ctx window

#

for trusted testers

#

higher lower 500k

meager lintel Jun 3, 2025, 3:32 AM

#

@patent aspen goldmane tmrw still locked in? 😂

#

rip

small haven Jun 3, 2025, 5:08 AM

#

well this is new..

woeful geyser Jun 3, 2025, 5:40 AM

#

stephen really looks like some org in China straight up doing SFT from o3 outputs.

Screenshot_2025-06-03-12-37-18-233_com.android.chrome.jpg

pallid crypt Jun 3, 2025, 5:41 AM

#

woeful geyser `stephen` really looks like some org in China straight up doing SFT from o3 outp...

Deepseek R2 possibly

elder rapids Jun 3, 2025, 6:20 AM

#

pallid crypt Deepseek R2 possibly

it's not a good model, at all

#

all the anon models except goldmane are really bad

#

there's no reason to pay attention to them

calm sequoia Jun 3, 2025, 6:48 AM

#

What's up with the chair meme in LLM community?

misty vault Jun 3, 2025, 7:17 AM

#

whole sundial Jun 3, 2025, 7:20 AM

#

I think stephen is by StepFun (a Chinese AI company, which would explain the confusion with DeepSeek). Just say the names next to each other if you don't believe me.

misty vault Jun 3, 2025, 7:23 AM

#

the names next to each other if you don't believe me.

dusky aurora Jun 3, 2025, 7:36 AM

#

Google should focus on ficiton writing. Geminiwould be much better if the scenes were more interesting

small haven Jun 3, 2025, 7:37 AM

#

misty vault

what does it mean?

misty vault Jun 3, 2025, 8:48 AM

#

wtf

#

first we had wild and paws and now billy

#

they were dividing themselves

tall summit Jun 3, 2025, 9:34 AM

#

dusky aurora Google should focus on ficiton writing. Geminiwould be much better if the scenes...

we have anthropic for that

calm sequoia Jun 3, 2025, 10:58 AM

#

dusky aurora Jun 3, 2025, 12:09 PM

#

why are you telling me this?

olive mesa Jun 3, 2025, 2:16 PM

#

misty vault

what's this supposed to mean

wicked root Jun 3, 2025, 2:42 PM

#

How reliable are xAI’s models?

grim axle Jun 3, 2025, 3:17 PM

#

they are both ahh

wintry tinsel Jun 3, 2025, 4:15 PM

#

dusky aurora Google should focus on ficiton writing. Geminiwould be much better if the scenes...

Gemini is second best on fiction writing to Claude

wicked root Jun 3, 2025, 4:15 PM

#

Gemini’s the best

wintry tinsel Jun 3, 2025, 4:15 PM

#

For what? Not for fiction writing

#

These are final rankings #1 for coding and math: O3, #1 for everything else: Claude Opus, number 2 for everything: Gemini 2.5 pro

wicked root Jun 3, 2025, 4:22 PM

#

wintry tinsel These are final rankings #1 for coding and math: O3, #1 for everything else: Cla...

Wdym final rankings? Gemini pro’s up there

#

No im goijg by lmarena

dusky aurora Jun 3, 2025, 4:34 PM

#

wintry tinsel Gemini is second best on fiction writing to Claude

by subjective direct chat experience, there's still a lot of work to do to catch up

wintry tinsel Jun 3, 2025, 4:42 PM

#

dusky aurora by subjective direct chat experience, there's still a lot of work to do to catch...

You use your own prompt?

tall summit Jun 3, 2025, 5:07 PM

#

wintry tinsel These are final rankings #1 for coding and math: O3, #1 for everything else: Cla...

nop

#

gemini 2.5 pro best in translation

tall summit Jun 3, 2025, 5:08 PM

#

wintry tinsel Gemini is second best on fiction writing to Claude

by subjective direct chat experience, i agree

keen fulcrum Jun 3, 2025, 5:54 PM

#

keen fulcrum Jun 3, 2025, 6:22 PM

#

Here is a video model arena
https://artificialanalysis.ai/text-to-video/arena

Video Generation Model Arena | Artificial Analysis

Compare AI video generation models by choosing your preferred video without knowing the provider.

calm sequoia Jun 3, 2025, 6:59 PM

#

calm sequoia

poll_question_text

Would you have been disappointed if the GPT o1 was named GPT 5 on release?

victor_answer_votes

8

total_votes

19

victor_answer_id

3

victor_answer_text

What?

victor_answer_emoji_name

🤖

gentle plinth Jun 3, 2025, 7:23 PM

#

@hollow ivy could you send me the dou shou qi server link again?

tall summit Jun 3, 2025, 7:26 PM

#

what I am disappointed in is that claude 4 opus is better than sonnet when it was the reverse for 3.7

#

NO SENSE!

small haven Jun 3, 2025, 7:32 PM

#

lol

leaden sun Jun 3, 2025, 7:37 PM

#

I thought LMarena would have benchmarking for video generating. Or is it not that meaningful to include this?

echo aurora Jun 3, 2025, 7:43 PM

#

leaden sun I thought LMarena would have benchmarking for video generating. Or is it not tha...

being able to support video gen is an area we're exploring and is on our radar blobthumbsup

keen beacon Jun 3, 2025, 7:45 PM

#

not really fair to compare sonnet 3.5 and opus 3, if hes talking about that anyway. sonnet 3.5 is a newer fresh pretrain i believe

leaden sun Jun 3, 2025, 7:46 PM

#

echo aurora being able to support video gen is an area we're exploring and is on our radar <...

i hope you will find a way to support this 🤩

tall summit Jun 3, 2025, 7:47 PM

#

keen beacon not really fair to compare sonnet 3.5 and opus 3, if hes talking about that anyw...

ya but come on.

balmy mist Jun 3, 2025, 7:53 PM

#

anything new?

small haven Jun 3, 2025, 7:54 PM

#

keen beacon not really fair to compare sonnet 3.5 and opus 3, if hes talking about that anyw...

wait so claude 4 is based upon the sonnet 3.5 pretrain?

keen beacon Jun 3, 2025, 7:56 PM

#

small haven wait so claude 4 is based upon the sonnet 3.5 pretrain?

i highly suspect it. opus 3.5 and sonnet 3.5 are fresh pretrains. 3.7 sonnet and claude 4 are not fresh and has continued pretraining based on sonnet 3.5/opus 3.5. plus semianalysis reported on claude 3.5 opus and there were additional rumors about it being disappointing, it seems to be true. i think its too fast to be pretrained from scratch among other reasons

boreal saddle Jun 3, 2025, 7:57 PM

#

wintry tinsel Gemini is second best on fiction writing to Claude

Really?

#

I always relied on good old Claude to write fictional stories, so I cant tell.

#

Though, Claude often freezing up and not continuing with the rest of the message in the new LMArena is quite annoying.

#

It literally just stays there, with the loading spinning indicator, and nothing happens.

small haven Jun 3, 2025, 7:58 PM

#

keen beacon i highly suspect it. opus 3.5 and sonnet 3.5 are fresh pretrains. 3.7 sonnet and...

yea i can kinda believe that, feels more like a finetune/rl'd if u think about it

late path Jun 3, 2025, 7:58 PM

#

I remember 3.5 sonnet, 3.5 sonnet1022, 3.7 sonnet are the same base, and 4 is a new pretrained model.

boreal saddle Jun 3, 2025, 7:59 PM

#

Is Claude 4 actually superior to 3.7 in terms of fictional story quality? Sometimes, new isnt automatically better.

small haven Jun 3, 2025, 7:59 PM

#

feels like they rl'd for being more agentic and with extended thinking mode

patent aspen Jun 3, 2025, 8:10 PM

#

keen beacon i highly suspect it. opus 3.5 and sonnet 3.5 are fresh pretrains. 3.7 sonnet and...

I'm curious to know how you estimated that it was too fast to be pre-trained from scratch. I haven't looked at the release timelines and don't have an opinion on the matter, although I'd like to know how people estimate what is fast vs slow for a pretrain

small haven Jun 3, 2025, 8:10 PM

#

yup

#

pretraining claude 4 from scratch when u already have a good base like sonnet 3.5 seems crazy to me (where they also had opus 3.5)

patent aspen Jun 3, 2025, 8:13 PM

#

Well maybe

tall summit Jun 3, 2025, 8:22 PM

#

but it is still annoying

#

i'm not comparing them

#

i'm saying the naming scheme irks me personally because of it

#

even if it is a weird comparison

#

it seems not having an opus knocked things off balance

#

bruh you know what i mean.

#

cool name?

#

tru

#

did it take 5x the effort

#

i haven't looked into it

#

that's what i meant

keen beacon Jun 3, 2025, 8:44 PM

#

patent aspen I'm curious to know how you estimated that it was too fast to be pre-trained fro...

it depends but upon rechecking the dates, it is possible claude 4 sonnet could've been pretrained from scratch (if we use the claimed/documented sonnet 3.5 cut off and it wasn't continued pretrained as well, which was around 3 months from cut off to release) but the opus timeline doesn't make sense probably in combination with other reasons. i heard models usually take around/at least 2-4 months to pretrain (depending on how much data/etc), to be released completely in < 3 months [based on the pretraining cut off] (not including safety/post training) is seemingly absurd if its not continued pretraining/etc (which makes sense with other reasons). this isnt an exhaustive/coherent argument (which i don't really get into the other reasons) but i hope it makes sense where im coming at. im just guessing btw 🤣

#

there isnt really a smoking gun in this case, at least i don't exactly know of any, but the whole picture seems to be that

keen beacon Jun 3, 2025, 8:58 PM

#

keen beacon there isnt really a smoking gun in this case, at least i don't exactly know of a...

im wondering if im being irrational/unreasonable or if there are any gaps in my understanding there, if u dont mind please point them out 😄

#

i suppose you can start post training/experiments/safety while the model is still pretraining (on checkpoints) so there might overlap (as things get done in parallel), so pretraining cut offs/timelines arent necessarily the greatest signal too. (im not sure if this is done in practice though)

#

even if they arent known u can probe the model/determine it manually

patent aspen Jun 3, 2025, 9:20 PM

#

I don't know what happens if an AI company decides they want to hide or obscure their knowledge cutoff or post training and pre-training become less distinct over time

#

No evidence of that yet AFAIK

keen beacon Jun 3, 2025, 9:21 PM

#

patent aspen I don't know what happens if an AI company decides they want to hide or obscure ...

why would they do that though i think its a lot of effort and sorta pointless to do that. a lot of companies dont even train the cut off in, so you have to manually probe it by manually asking about events, im not sure how you'd obscure that tbh

#

it can be a pain in the butt to determine the precise cut off if youre doing it manually like that though

#

yeah, but even then depending on the implementation, you can still probably figure it out

#

if its using tools for internet access, its possible, but it can become very complicated

surreal creek Jun 3, 2025, 9:42 PM

#

rate of progress is so advanced we have ppl complaining the SOTA hasn’t been beaten in 4 weeks

cedar tide Jun 3, 2025, 9:43 PM

#

Grok 3 mini is bad on the arena

Screenshot_2025-06-03-23-42-46-593_com.android.chrome-edit.jpg

surreal creek Jun 3, 2025, 9:44 PM

#

Worse than Grok 2 is pretty brutal

elder rapids Jun 3, 2025, 10:06 PM

#

wintry tinsel These are final rankings #1 for coding and math: O3, #1 for everything else: Cla...

Claude is HORRIBLE at everything but coding, this is a legitimate concern and in no way an exaggeration

#

o3 is only #1 in high reasoning variant, don't forget that

candid storm Jun 3, 2025, 10:20 PM

#

when do you guys expect Grok 3.5?

small haven Jun 3, 2025, 10:20 PM

#

what happens after gpt-4.4

small haven Jun 3, 2025, 10:21 PM

#

surreal creek rate of progress is so advanced we have ppl complaining the SOTA hasn’t been bea...

facts

keen beacon Jun 3, 2025, 10:23 PM

#

gpt 4.5 probably knows much more than opus 4 tho

small haven Jun 3, 2025, 10:23 PM

#

4.5 is very slow tho

#

what is the base model that o3 relies on, 4o?

keen beacon Jun 3, 2025, 10:33 PM

#

continued pretrained version of it (cut off of june 2024 vs oct 2023), yeah it seems

small haven Jun 3, 2025, 10:33 PM

#

its really impressive then

#

imagine the alpha it gets from having 4.5 as a base model

#

currently even o3 pro lacks some perception when it comes to detail

#

what if i am 👀

#

lol

#

u dont see it, i do

#

its not THAT great anyways, its just better than o3, but hopefully the official version is different

craggy ridge Jun 3, 2025, 10:52 PM

#

small haven Jun 3, 2025, 11:52 PM

#

this has to be gemini 2.5 pro thats coming out on thursday?

#

from aider discord

#

cost is slightly above gemini 0506

#

this is better than o3 high wtf

#

nah @patent aspen fact check

#

it actually is

keen beacon Jun 3, 2025, 11:57 PM

#

people have been raving about goldmane so it wouldnt surprise me

small haven Jun 3, 2025, 11:57 PM

#

this is really insane

#

gemini literally closed the gap

#

https://aider.chat/docs/leaderboards/

aider

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

#

*80 ish

keen beacon Jun 3, 2025, 11:58 PM

#

thats with gpt 4.1 too btw

small haven Jun 3, 2025, 11:58 PM

#

keen beacon thats with gpt 4.1 too btw

INSANE

#

nawww ok gemini cooked

#

its 86% according to an aider admin

#

o3 high is 79.6% to be exact

keen beacon Jun 3, 2025, 11:59 PM

#

nah

small haven Jun 3, 2025, 11:59 PM

#

highly doubt i

#

more like gemini

keen beacon Jun 3, 2025, 11:59 PM

#

i need goldmane asap

small haven Jun 3, 2025, 11:59 PM

#

ya this is goldmane

#

i want it

#

wtf

#

1m context window too

#

omg

#

no wonder brian was so confident lol

#

its materializing

#

a literal aider admin tested it lol

#

gemini always wants an aider polyglot test for benchmarks

#

highly doubt its fake

keen beacon Jun 4, 2025, 12:01 AM

#

they screwed up last time

#

aider

small haven Jun 4, 2025, 12:01 AM

#

one of the coding benchmarks standard

keen beacon Jun 4, 2025, 12:01 AM

#

i guess they got access ahead of time

#

this time to make sure its right

#

they misreported gemini 2.5 pro's cost

small haven Jun 4, 2025, 12:02 AM

#

but not the accuracy

keen beacon Jun 4, 2025, 12:02 AM

#

yeah

small haven Jun 4, 2025, 12:02 AM

#

its $42

keen beacon Jun 4, 2025, 12:03 AM

#

aistudio and its free 🤣

small haven Jun 4, 2025, 12:03 AM

#

vs. $37.41

keen beacon Jun 4, 2025, 12:03 AM

#

and basically unlimited

small haven Jun 4, 2025, 12:03 AM

#

check it uself lol

#

im mindblown

#

literally

#

who cares about cost efficiency

#

its free btw

#

yea deepthink is going to demolish o3 pro

keen beacon Jun 4, 2025, 12:04 AM

#

deepmind at an insane pace tbh its unbelievable

small haven Jun 4, 2025, 12:05 AM

#

does this mean a 2m context window or its prompt tokens taken in aggregate to test the benchmarks?

keen beacon Jun 4, 2025, 12:05 AM

#

probably combined

small haven Jun 4, 2025, 12:05 AM

#

yea

#

is goldmane still live

#

i havent even tried

keen beacon Jun 4, 2025, 12:06 AM

#

yea

#

according to web dev arena metadata

small haven Jun 4, 2025, 12:06 AM

#

ok cool

#

@deep adder put some money on google on polymarket lol

keen beacon Jun 4, 2025, 12:09 AM

#

i guess i understand now why they dont want to return the raw thoughts for the short term at least 🤣

#

https://discuss.ai.google.dev/t/massive-regression-detailed-gemini-thinking-process-vanished-from-ai-studio/83916/103

Google AI Developers Forum

Massive Regression: Detailed Gemini Thinking Process vanished from ...

Lots of good stuff in this thread, catching up on the comments after Google IO craziness. A few reactions, comments, and clarifications: I hear that you all want raw thoughts, the value is clear, there are use cases that require them, and seems reasonable to want them in the API as well Why be excited for summaries? The raw thoughts have been...

small haven Jun 4, 2025, 12:10 AM

#

where is brian when u need him

patent aspen Jun 4, 2025, 12:10 AM

#

I don't know about the polyglot thing

small haven Jun 4, 2025, 12:11 AM

#

how are u spot on

#

idk thats ur thing

#

nah we living in a simulation

small haven Jun 4, 2025, 12:12 AM

#

patent aspen I don't know about the polyglot thing

i mean its relatively better than o3 high according to the bench

#

by a 6%

patent aspen Jun 4, 2025, 12:13 AM

#

I don't generally have access to evals before a model is released

#

Occasionally I stumble across one or two

small haven Jun 4, 2025, 12:14 AM

#

i mean does it fit the vibe that it beats o3

patent aspen Jun 4, 2025, 12:15 AM

#

I mean I would expect that, but I don't use o3 and wouldn't know how to predict benchmark scores. I only have some rough idea of the broad directions of things and stuff we're trying

small haven Jun 4, 2025, 12:16 AM

#

ok understandable

keen beacon Jun 4, 2025, 12:16 AM

#

there were rumors of an additional openai release this week (api, free, plus, pro, etc.) (so not o3 pro) theres a possibility that its that but i dont think so

small haven Jun 4, 2025, 12:17 AM

#

true

#

o3.1 high

#

lol

#

i really want it to be gemini

#

if its gemini, oai will release o4 a week after

#

its gemini, cus gemini models are the only ones have diff-fended edit format

#

look at the far right column

#

diff-fenced only for gemini

keen beacon Jun 4, 2025, 12:20 AM

#

yeah

small haven Jun 4, 2025, 12:20 AM

#

idk but they only test it on gemini

keen beacon Jun 4, 2025, 12:20 AM

#

its definitely gemini

#

i remembered

small haven Jun 4, 2025, 12:20 AM

#

^

#

GEMINI

keen beacon Jun 4, 2025, 12:20 AM

#

diff fence is made for gemini models

small haven Jun 4, 2025, 12:20 AM

#

keen beacon Jun 4, 2025, 12:20 AM

#

i recall reading the docs

small haven Jun 4, 2025, 12:20 AM

#

diff-fenced

keen beacon Jun 4, 2025, 12:21 AM

#

https://aider.chat/docs/more/edit-formats.html#diff-fenced its 99% gemini, i mean the reports/how much people like it here/the perf/etc., seems to line up

aider

Edit formats

Aider uses various “edit formats” to let LLMs edit source files.

small haven Jun 4, 2025, 12:21 AM

#

#

LMAO

#

primarily used with the gemini ...

#

gemini won

#

still can make money

keen beacon Jun 4, 2025, 12:22 AM

#

i wish they added back raw thoughts 🥲

small haven Jun 4, 2025, 12:22 AM

#

u sell it when they announce it on thursday, ill take the differential

zinc ore Jun 4, 2025, 12:24 AM

#

small haven

Is this 2.5 pro GA?

small haven Jun 4, 2025, 12:25 AM

#

goldmane

zinc ore Jun 4, 2025, 12:25 AM

#

And what was the score it got

small haven Jun 4, 2025, 12:25 AM

#

86%

#

79.6% on o3 high

zinc ore Jun 4, 2025, 12:25 AM

#

Hot

small haven Jun 4, 2025, 12:27 AM

#

cope

#

let gemini win once haha

zinc ore Jun 4, 2025, 12:27 AM

#

I've seen seen SS showing new 2.5 update dated the 2nd and one for the 4th

#

Which aligns with what the image is showing and all the other information

small haven Jun 4, 2025, 12:28 AM

#

can they not improve

#

it is aids now yes

#

yo i cant proc goldmane in webdev wtf

keen beacon Jun 4, 2025, 12:31 AM

#

it should be in the main arena as well

small haven Jun 4, 2025, 12:32 AM

#

im getting deepseek and grok most of the time 😭

keen beacon Jun 4, 2025, 12:34 AM

#

small haven im getting deepseek and grok most of the time 😭

ya sample weight was reduced you used to get it a lot more

wintry tinsel Jun 4, 2025, 12:34 AM

#

I’m sad chat

#

Opus got nerfed

#

I can feel the quality isn’t as good as it was last week

small haven Jun 4, 2025, 12:35 AM

#

keen beacon ya sample weight was reduced you used to get it a lot more

rip 😦

wintry tinsel Jun 4, 2025, 12:35 AM

#

It’s not a huge nerf, but its there

keen beacon Jun 4, 2025, 12:35 AM

#

if ur coding, goldmane 🙂

#

about to go nuts apparently

small haven Jun 4, 2025, 12:35 AM

#

ya just wait till thursday

#

lol

#

i dont get it, how is gemini-2.5-pro-06-05 confusing

zinc ore Jun 4, 2025, 12:47 AM

#

Could be deepthink

keen beacon Jun 4, 2025, 12:48 AM

#

nah

zinc ore Jun 4, 2025, 12:48 AM

#

Which would make it more annoying to keep track

small haven Jun 4, 2025, 12:48 AM

#

oh i get it

#

05-06

#

and 06-05

#

lol

#

"i hope they delay the model"

zinc ore Jun 4, 2025, 12:49 AM

#

Yeah lol

small haven Jun 4, 2025, 12:49 AM

#

cool its rlly on thursday

frail thorn Jun 4, 2025, 1:03 AM

#

😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 🙏

leaden palm Jun 4, 2025, 1:06 AM

#

finally claude research is available

#

why is there a bar chart lmao

#

it's now rolled out to max + pro

small haven Jun 4, 2025, 1:16 AM

#

leaden palm it's now rolled out to max + pro

huh i had for a good month

leaden palm Jun 4, 2025, 1:16 AM

#

small haven huh i had for a good month

are you max?

small haven Jun 4, 2025, 1:17 AM

#

yes

#

when it got released i had it

#

so now u saying it rolled widely to max?

#

or pro

#

yea extended thinking feels different and less potent than o3

#

i mean its true, o3 can think for 10 mins, claude 4 can think less than 1 min if ur lucky

#

image only, no link is crazy

#

oh nvm, it looks ai ish

#

i mean is it not facts they are joining

#

image looks fake

#

they cant

#

its owned by amazon first of all, and they dont have enough cash to cover >$100b

#

partially

#

amazon and apple conflicts of interest

#

liquid cash, no?

#

$27b liquid cash

#

balance sheet

#

recent quarter

#

even if they had it right, dario wouldn't sell it at face value

#

so maybe 2-3x premium

#

anthropic growth is too great, 2x premium is not unreasonable

#

growth

#

ai

#

more than 2x if they wanted to buy it

#

valuation

#

thats what i meant, its virtually impossible

#

better off doing what theyre doing or buy a smaller lab

#

buddy didnt know apple cash on hand ok

#

$150b

drowsy shuttle Jun 4, 2025, 2:06 AM

#

hi here
this is a web developer who have experience in frontend, web3 integration, iOS/Android app, implement AI frontend with API.
I am looking for a new position now.

pseudo hemlock Jun 4, 2025, 2:37 AM

#

Does anyone know if there are any plans to add more models to the search leaderboard?

#

I know at a minimum Claude and Grok both have search features

small haven Jun 4, 2025, 2:38 AM

#

what is it

#

can this guy tweet this real quick

elder rapids Jun 4, 2025, 3:14 AM

#

I've been using ts like everyday bro 😭😭

#

nah actually, I abuse limits like hell

small haven Jun 4, 2025, 3:23 AM

#

its not great, but its the sota so we use it 🤷

drifting thorn Jun 4, 2025, 3:25 AM

#

small haven this has to be gemini 2.5 pro thats coming out on thursday?

Wow the context window is thaaaaaat big?

small haven Jun 4, 2025, 3:25 AM

#

its in aggregate of all the tasks they tested against, dont think its total context window

#

but google has been hinting several times for a 2m context window, thats not new

drifting thorn Jun 4, 2025, 3:27 AM

#

I've been hoping for a model with context window of 10 million tokens

#

Such that I don't need to set up any knowledge graphs or RAGs

#

Just bruteforce it in AI studio

small haven Jun 4, 2025, 3:27 AM

#

yes its not new

drifting thorn Jun 4, 2025, 3:27 AM

#

by sending it all the context

small haven Jun 4, 2025, 3:28 AM

#

10m will happen, eventually

drifting thorn Jun 4, 2025, 3:28 AM

#

small haven yes its not new

but it is no longer avaliable

small haven Jun 4, 2025, 3:28 AM

#

im sad that 64k is the meta currently in oai lol

small haven Jun 4, 2025, 3:28 AM

#

drifting thorn but it is no longer avaliable

idk why

drifting thorn Jun 4, 2025, 3:29 AM

#

Cost issues.

small haven Jun 4, 2025, 3:29 AM

#

probably

drifting thorn Jun 4, 2025, 3:29 AM

#

But Alphaevolve has invented a new way to calculate 4x4 matrix, which should saves compute

#

So I hope they will bring us the 0325 with 2 million or more context window.

elder rapids Jun 4, 2025, 3:36 AM

#

small haven this has to be gemini 2.5 pro thats coming out on thursday?

nah wait holy sht

#

86%???????

#

if this is Gemini

#

the other companies are cooked

small haven Jun 4, 2025, 3:36 AM

#

been telling 🤷

#

if this is the base model for deepthink, o3 pro is dead

elder rapids Jun 4, 2025, 3:37 AM

#

yep

small haven Jun 4, 2025, 3:37 AM

#

with a 1m contxt window

elder rapids Jun 4, 2025, 3:37 AM

#

similar cost too

#

to

#

2.5 pro

#

bro 😭

small haven Jun 4, 2025, 3:37 AM

#

polymarket still 74% chance, easy money

elder rapids Jun 4, 2025, 3:37 AM

#

alr ion believe it's Gemini

#

I can't believe that

small haven Jun 4, 2025, 3:37 AM

#

but it is

elder rapids Jun 4, 2025, 3:38 AM

#

if that's Gemini

#

nah I actually can't believe it

small haven Jun 4, 2025, 3:38 AM

#

but edit formatting

#

"diff-fenced" only for gemini models

#

was goldmane that good? i never had the chance to try it

elder rapids Jun 4, 2025, 3:39 AM

#

nah there's others and it could be that diff fenced is the meta

#

or some shi

small haven Jun 4, 2025, 3:39 AM

#

elder rapids Jun 4, 2025, 3:39 AM

#

small haven was goldmane that good? i never had the chance to try it

ye tbh it was that good

#

but the same price

#

at such a higher performance

#

no bro that's

#

like ACTUALLY

#

I'm geeking rn

#

I don't believe the dude that sent it actually did it

#

I do think goldmane is another nebula type moment tho tbh

small haven Jun 4, 2025, 3:42 AM

#

if u check closely too, "seconds_per_case" it took 132 secs vs 200+ secs from 0506 version

#

so not only accurate but faster

#

this is 0506

elder rapids Jun 4, 2025, 3:42 AM

#

small haven Jun 4, 2025, 3:42 AM

#

271 secs per case and this new one is 132 secs

elder rapids Jun 4, 2025, 3:43 AM

#

Google wins

#

there's nothing we can do

small haven Jun 4, 2025, 3:43 AM

#

oai needs to release o4 asap tho 👀

#

6 months to 3 months to 1 month release cycle baby

balmy mist Jun 4, 2025, 3:43 AM

#

there is a new model?

#

i been mia

#

oh snap i see the new ai news channel, thanks mods!!

elder rapids Jun 4, 2025, 3:45 AM

#

0325 can't be a fluke then

#

why did they release an iteration of 0325 that offers no real improvements besides coding

#

was it even an iteration of 0325 or was it just a 0325 candidate that happened to have better coding, but 0325 they intended to scale higher

#

hence goldmane

#

I'm tripped up tbh

small haven Jun 4, 2025, 3:45 AM

#

was goldmane the best of all ?

elder rapids Jun 4, 2025, 3:45 AM

#

yes

small haven Jun 4, 2025, 3:45 AM

#

like how was it vs redsword

#

etc

balmy mist Jun 4, 2025, 3:45 AM

#

small haven this is 0506

where is on?

elder rapids Jun 4, 2025, 3:45 AM

#

small haven like how was it vs redsword

better

small haven Jun 4, 2025, 3:45 AM

#

wow

elder rapids Jun 4, 2025, 3:46 AM

#

and tbh you could kind of see it in the way it formatted the code ironically

#

goldmane codes BEAUTIFULLY

small haven Jun 4, 2025, 3:46 AM

#

im excited

#

thursday plz

#

and deepthink holy fck

elder rapids Jun 4, 2025, 3:47 AM

#

if not Thursday can we just never trust Brian

#

😭

#

like, forget about him

#

deadass

small haven Jun 4, 2025, 3:47 AM

#

elder rapids if not Thursday can we just never trust Brian

watch him bump a week

#

"tentatively"

#

lol

elder rapids Jun 4, 2025, 3:47 AM

#

small haven and deepthink holy fck

yes bro

#

if it's going to be that good

#

I'm starting to see why they're hiding it behind ultra lmfao

small haven Jun 4, 2025, 3:48 AM

#

oai is literally dead

zinc ore Jun 4, 2025, 3:48 AM

#

Probably screws Anthropic over more than openAI

elder rapids Jun 4, 2025, 3:49 AM

#

ye that's true

#

anthropics main focus is coding and creative writing

#

and the moment that's taken away with cheaper costs

#

then it's all over tbh

#

o3 > opus as a creative writer btw

small haven Jun 4, 2025, 3:52 AM

#

oai has to drop o4 after this, its too damaging

#

they been sitting on it since january 🤷

#

naive valley Jun 4, 2025, 6:01 AM

#

A

#

A

wicked root Jun 4, 2025, 6:41 AM

#

How are overall rankings calculated?

#

I see different topics for each categories

#

On dashboard it says text but in overall rankings they break things down into different descriptions and what not

#

Basically Id like to know if openAI could beat google by the end of this month

small haven Jun 4, 2025, 7:23 AM

#

yay

#

06-05

#

its officially coming tmmrw isnt it

#

or thursday..

#

i meant today

fleet lintel Jun 4, 2025, 7:35 AM

#

o3-pro has to be league ahead of upcoming 2.5 pro launch. Isn't it?

calm sequoia Jun 4, 2025, 7:38 AM

#

Guys from here said-o1 pro was somewhat 5% better than the o1. So the o3-pro shall be on the same improvement magnitude. I bet it will be slightly better than the 2.5-PRO pre-nerf.

fleet lintel Jun 4, 2025, 7:39 AM

#

calm sequoia Guys from here said-o1 pro was somewhat 5% better than the o1. So the o3-pro sha...

that would be a bit disappointing. if that's the case, o3-pro wont be better than goldmane model

calm sequoia Jun 4, 2025, 7:40 AM

#

IDK what you mean but not a single anonymous model was better than pre-nerf gemini 2.5 PRO

#

They were just focused on coding

#

Which is too narrow for most people

#

And claude already has the market

fleet lintel Jun 4, 2025, 7:41 AM

#

my understanding is that gemini has started gaining market share (still behind though)

#

but my comment is not just about coding, I am looking for better general purpose model. I am fine with current models for coding unless there is big improvement in agentic workflows

calm sequoia Jun 4, 2025, 7:46 AM

#

I get you, same here. However, I'm using the models for research 3+ hours every day and nothing can compare to o3, especially o3+Claude

#

When pre-nerf gemini 2.5 pro was available, it was a go-to-model, but then it was nerfed

small haven Jun 4, 2025, 7:47 AM

#

calm sequoia Guys from here said-o1 pro was somewhat 5% better than the o1. So the o3-pro sha...

o3 pro is great, but not that great

#

o1 to o1 pro was a big jump

fleet lintel Jun 4, 2025, 7:47 AM

#

calm sequoia I get you, same here. However, I'm using the models for research 3+ hours every ...

interesting! what is your use-case of research? 3+ hours of research per day is insane

calm sequoia Jun 4, 2025, 7:48 AM

#

Signal processing mostly, where you have a lot of theory and some niche legacy code bases

small haven Jun 4, 2025, 7:48 AM

#

fleet lintel that would be a bit disappointing. if that's the case, o3-pro wont be better t...

id say o3-pro etches goldmane, but deepthink demolishes o3 pro

calm sequoia Jun 4, 2025, 7:48 AM

#

The new gemini can't connect theory and code

#

Sometimes I show the results of o3 to gemini, it reveals 10+ problems. Then I analyze the problems and they appear to actuallyt be not a valid points.

#

And 3 hours is not so much if one prompt means 2 to 10 minutes of thinking

#

have you guys tried Jules? It seems interesting, but crashes constantly

small haven Jun 4, 2025, 7:51 AM

#

yea jules crashes on me too

torn mantle Jun 4, 2025, 9:01 AM

#

@small haven you saw it right

#

they usually push updates friday no?

#

thursday/friday

ocean vortex Jun 4, 2025, 9:40 AM

#

small haven so not only accurate but faster

speed is meaningless though and depends on many things. You should look at output tokens if you want a metric to asses the model itself rather than their infra or server loads

#

they dropped free tier for API with this version, which typically means exactly that - faster. But that's nothing to do with the model

drifting thorn Jun 4, 2025, 9:44 AM

#

fleet lintel but my comment is not just about coding, I am looking for better general purpose...

true

ocean vortex Jun 4, 2025, 10:01 AM

#

calm sequoia Guys from here said-o1 pro was somewhat 5% better than the o1. So the o3-pro sha...

oh ffs... no one is 'nerfing' anything. New 2.5Pro is notably better for coding and function calling than the original. That is not 'nerfing'

#

performs objectively better on webdev arena, Aider, LiveCodeBench

#

if nerfing means making it better at coding, they did not nerf it enough as far as many people here are concerned lol

dusky aurora Jun 4, 2025, 10:11 AM

#

wintry tinsel You use your own prompt?

I'm not a tester, I'm a user

ocean vortex Jun 4, 2025, 10:15 AM

#

dusky aurora I'm not a tester, I'm a user

are you comparing it with Opus?

#

if so.. Opus is a bigger model. It is gonna be better with vibe test / writing

calm sequoia Jun 4, 2025, 11:11 AM

#

ocean vortex performs objectively better on webdev arena, Aider, LiveCodeBench

And worse at other things. I'ts nerfed if coding is not your main

#

Not saying it's a bad decision from their side

ocean vortex Jun 4, 2025, 11:12 AM

#

calm sequoia And worse at other things. I'ts nerfed if coding is not your main

it must be worse with no upsides or nerfed at the expense of being more safe to really call it that...

#

what they did is not nerfing catgrin

#

it's just better at coding while being slightly worse on some other tasks - this happens all the time, that's how models get updated without retraining the whole thing

fleet lintel Jun 4, 2025, 11:13 AM

#

ocean vortex it's just better at coding while being slightly worse on some other tasks - this...

True but it is still a nerf for non-coding purposes. Nerf doesn't necessarily mean that it was intentional

calm sequoia Jun 4, 2025, 11:13 AM

#

That's just your opinion

#

Dom you always argue for details or terminology 😄

#

#

Why is mini high removed

ocean vortex Jun 4, 2025, 11:14 AM

#

fleet lintel True but it is still a nerf for non-coding purposes. Nerf doesn't necessarily m...

Nah that's just perverse way of using that term IMO

#

lol

drifting thorn Jun 4, 2025, 11:16 AM

#

Not IMO

ocean vortex Jun 4, 2025, 11:16 AM

#

calm sequoia

where is this from?

drifting thorn Jun 4, 2025, 11:16 AM

#

It degrades in AIME and mrcr

ocean vortex Jun 4, 2025, 11:17 AM

#

I still have o4-mini-high 👀

ocean vortex Jun 4, 2025, 11:19 AM

#

drifting thorn It degrades in AIME and mrcr

MRCR you mean this...

#

can't exactly call that degradation 1M context is margin of error; 128k is less than 2% LOL

#

I don't think you would notice that or would even be able to replicate this exact measurement

#

for AIME25 that's true, but then again new model does well at another math benchmark USAMO...

#

and also, that Deepseek. It wasn't there not long ago that's insane score 🤯

tall summit Jun 4, 2025, 12:02 PM

#

none can even make headway on the medium/hard problems

ocean vortex Jun 4, 2025, 12:18 PM

#

I think only select labs contaminated for it. We need to wait for upcoming models and it should even itself out with everyone contaminated... 😎

#

it's still gonna be a valid metric as there's no way they will get close to 100% lol

#

but direct comparisons gonna be easier perhaps

calm sequoia Jun 4, 2025, 12:20 PM

#

Sadly too weak. Maybe R2 will make it.

ocean vortex Jun 4, 2025, 12:21 PM

#

though tbh I wouldn't expect massive changes. Like I don't see Claude scoring higher on math than o3 contamination or not

calm sequoia Jun 4, 2025, 12:22 PM

#

Logic, math, tool usage.

#

I can't comment on creative writing and such

ocean vortex Jun 4, 2025, 12:23 PM

#

2+2

calm sequoia Jun 4, 2025, 12:23 PM

#

Ah yes also I havent tried the 0528 version

torn mantle Jun 4, 2025, 12:23 PM

#

2+1

#

2+2

calm sequoia Jun 4, 2025, 12:23 PM

#

ocean vortex 2+2

2.01 + 2.11 🙂

torn mantle Jun 4, 2025, 12:23 PM

#

2+3

#

🙂

#

🙂

calm sequoia Jun 4, 2025, 12:24 PM

#

Write a Non uniform FFT (NUFFT) in dart

#

Yeah, will check it later

torn mantle Jun 4, 2025, 12:25 PM

#

is it

calm sequoia Jun 4, 2025, 12:26 PM

#

It all about thinking

#

Claude fails miserably without thinking, as well as all other gpts

#

Hmm maybe gpt bought MATLAB code library and my interpretation is highly skewed

sacred plaza Jun 4, 2025, 12:40 PM

#

Can y'all Elon stans explain his capitulation this week lol

drifting thorn Jun 4, 2025, 12:44 PM

#

IDK

#

I'm not Elon stans but Gemini and DeepSeek stan

#

WHY THEY CUT THE FREE API USAGE OF GEMINI 2.5 PRO!!!!!!

#

I CAN'T WRITE MY NOVEL WITH IT ANYMORE!!!

torn mantle Jun 4, 2025, 1:20 PM

#

asi?

#

or what

#

hopefully its added on lmarena

#

so we can try it a bit

fleet lintel Jun 4, 2025, 1:39 PM

#

why not?

torn mantle Jun 4, 2025, 1:43 PM

#

this is actually a big L for anthropic

#

https://x.com/_mohansolo/status/1930034960385356174

Varun Mohan (@_mohansolo)

With less than five days of notice, Anthropic decided to cut off nearly all of our first-party capacity to all Claude 3.x models. Given the short notice, we may see some short-term Claude 3.x model availability issues as we have very quickly ramped up capacity on other inference

#

how long do they think they can keep the lead?

sturdy mica Jun 4, 2025, 1:44 PM

#

i was hunting for new models, what is this? X preview?

#

it said it was gemini 1.5

#

srry thats not in screenshot accident

#

why would it say gemini 1.5 then

#

if grok 3.5 was built off of gemini 1.5 then its gonna be bad

dusky aurora Jun 4, 2025, 1:58 PM

#

ocean vortex if so.. Opus is a bigger model. It is gonna be better with vibe test / writing

Opus is much better with writing style (though it seems LMArena have restricted the sampling to less creative results recently) andoverall the "written by a writer, not by an instructor" feel. Gemini is not livley enough nthisrespect

ocean vortex Jun 4, 2025, 2:02 PM

#

dusky aurora Opus is much better with writing style (though it seems LMArena have restricted ...

yeah what you are describing is typical big model vs smaller model kind of thing. Opus is the biggest so it will do well there. Followed by gpt4.5 perhaps

#

there's not much for Google to do to "catch up" to anything though. They already had ultra but chose not to pursue it (for obvious reasons)

#

that is typically very difficult to capture in benchmarks and big model takes more time to train so you can't update it nearly as frequently

sacred plaza Jun 4, 2025, 2:05 PM

#

drifting thorn WHY THEY CUT THE FREE API USAGE OF GEMINI 2.5 PRO!!!!!!

Good thing I left for Claude yesterday. Was a Gemini Stan for the last two months. I instantly regretted the ultra plan purchase once I realized 2.5 pro deep think was not life and all I got were cool but useless products like veo3 and mariner

ocean vortex Jun 4, 2025, 2:05 PM

#

it could be but Opus is newer and has reasoning. I think that's enough for it to write better than gpt4.5 tbh. It's big enough where any bigger is diminishing returns even for writing. But reasoning isn't

sturdy mica Jun 4, 2025, 2:05 PM

#

late path Jun 4, 2025, 2:09 PM

#

drifting thorn WHY THEY CUT THE FREE API USAGE OF GEMINI 2.5 PRO!!!!!!

You can still use gcp's $300 free credit as long as you have a credit card

ocean vortex Jun 4, 2025, 2:10 PM

#

late path You can still use gcp's $300 free credit as long as you have a credit card

wait. How do you redeem that? 🧐

late path Jun 4, 2025, 2:11 PM

#

just open your gcp project and link a valid payment method

elder rapids Jun 4, 2025, 2:15 PM

#

sturdy mica

we alr know goldmane is Google

#

you can just ask it "what model are you"

drifting thorn Jun 4, 2025, 2:22 PM

#

late path You can still use gcp's $300 free credit as long as you have a credit card

bruh i don't have a credit card...

elder rapids Jun 4, 2025, 2:24 PM

#

anyone else notice 2.5 pro in AIstudio is generating thinking tokens, and then outputting basically instantly

drifting thorn Jun 4, 2025, 2:25 PM

#

For my usage it stopped to show the thinking tokens before giving me the answer

#

There are about 20s of latency so I guess it is "thinking"

dusky aurora Jun 4, 2025, 2:25 PM

#

ocean vortex that is typically very difficult to capture in benchmarks and big model takes mo...

still, making the geenrated fiction dataset better is not a case of model size

ocean vortex Jun 4, 2025, 2:26 PM

#

come up with a prompt then which leads to better output than Opus then, in your opinion

drifting thorn Jun 4, 2025, 2:26 PM

#

Have you guys seen the MCMC by Google?

ocean vortex Jun 4, 2025, 2:26 PM

#

cause it's like 9/10 times Opus will win

#

you said "better" not bigger

#

that's what I replied to lol

drifting thorn Jun 4, 2025, 2:27 PM

#

It may be the next structural improvement of LLMs

📎 Learning_with_Local_Search_MCMC_Layers.pdf

ocean vortex Jun 4, 2025, 2:27 PM

#

catgrin

candid harbor Jun 4, 2025, 2:28 PM

#

I’ve ran then side by side and preferred Opus w/ Thinking almost every single time

drifting thorn Jun 4, 2025, 2:35 PM

#

drifting thorn It may be the next structural improvement of LLMs

When I analyse this paper with LLMs, o4 mini(From ChatGPT) and 2.5 Pro both suggests that it can be added to the output layer for performance gain

boreal saddle Jun 4, 2025, 2:36 PM

#

In your opinion, what is the best publicly available LLM for creative writing and fictional story generation?

#

I dont know if it is Claude or O3.

#

Like, objectively?

#

Not subjectively.

small haven Jun 4, 2025, 2:39 PM

#

So my screenshot made it to the news lol

#

Check singulairt reddit or even twitter

#

The timestamp is the same hahah

#

Yup

small haven Jun 4, 2025, 2:41 PM

#

small haven this has to be gemini 2.5 pro thats coming out on thursday?

Check

keen beacon Jun 4, 2025, 2:50 PM

#

small haven Check singulairt reddit or even twitter

That's Leo's account I think on singularity

small haven Jun 4, 2025, 2:51 PM

#

keen beacon That's Leo's account I think on singularity

Thats funny haha

#

Didnt know it was gonna catch flame

drifting thorn Jun 4, 2025, 2:52 PM

#

Actually are you using Deep Research much?

civic flame Jun 4, 2025, 2:59 PM

#

small haven Thats funny haha

hope you don't mind 🙏😭

sonic tendon Jun 4, 2025, 3:10 PM

#

keen beacon That's Leo's account I think on singularity

on what?

brittle tiger Jun 4, 2025, 3:12 PM

#

Haven't been paying attention. Is consensus that goldmane is GA 2.5 pro and is coming imminently?

small haven Jun 4, 2025, 3:13 PM

#

torn mantle asi?

👎

balmy mist Jun 4, 2025, 3:14 PM

#

Is o3 pro really coming at 1 pm est?

small haven Jun 4, 2025, 3:15 PM

#

Prolly

#

But its not as great as id thought it to be

elder rapids Jun 4, 2025, 3:19 PM

#

small haven But its not as great as id thought it to be

fr?

#

is it an upgrade tho

fleet lintel Jun 4, 2025, 3:19 PM

#

small haven But its not as great as id thought it to be

how do you know?

#

I have high hopes from o3 pro.. please dont destroy them

sturdy mica Jun 4, 2025, 3:20 PM

#

whens this comin out bro

fleet lintel Jun 4, 2025, 3:20 PM

#

sturdy mica whens this comin out bro

rumors are tomorrow

sonic tendon Jun 4, 2025, 3:20 PM

#

sonic tendon on what?

ohh, the subreddit, mb

sturdy mica Jun 4, 2025, 3:20 PM

#

yay

balmy mist Jun 4, 2025, 3:30 PM

#

small haven But its not as great as id thought it to be

Y u say that?

keen beacon Jun 4, 2025, 3:30 PM

#

He has access to it

balmy mist Jun 4, 2025, 3:30 PM

#

fleet lintel rumors are tomorrow

Can we test it on arena or sum

balmy mist Jun 4, 2025, 3:31 PM

#

keen beacon He has access to it

Wow GG OA lol

misty vault Jun 4, 2025, 3:31 PM

#

is this better than claude 4 opus thinking

fleet lintel Jun 4, 2025, 3:34 PM

#

balmy mist Can we test it on arena or sum

i think it's still on arena

elder rapids Jun 4, 2025, 3:37 PM

#

misty vault is this better than claude 4 opus thinking

yes

keen beacon Jun 4, 2025, 3:37 PM

#

cheaper too

elder rapids Jun 4, 2025, 3:37 PM

#

I hope it's Google

keen beacon Jun 4, 2025, 3:37 PM

#

it is

elder rapids Jun 4, 2025, 3:37 PM

#

although I'm disappointed now they're employing so much stronger limits

keen beacon Jun 4, 2025, 3:37 PM

#

wow its much better than opus thinking on aider lol

wicked root Jun 4, 2025, 3:38 PM

#

Hold up

#

OpenAI’s releasing a new model tmrw?

misty vault Jun 4, 2025, 3:38 PM

#

omaygot

#

google fix all gemini 2.5 pro issues??

wicked root Jun 4, 2025, 3:39 PM

#

Who’s releasing what tmrw?

keen beacon Jun 4, 2025, 3:39 PM

#

goldmane google ga 2.5 pro

#

5th

wicked root Jun 4, 2025, 3:39 PM

#

Wtf is goldman’s google ga?

keen beacon Jun 4, 2025, 3:40 PM

#

generally available 2.5 pro

elder rapids Jun 4, 2025, 3:40 PM

#

misty vault google fix all gemini 2.5 pro issues??

seems to be the case via goldmane, and an actual improvement

wicked root Jun 4, 2025, 3:40 PM

#

So gemini 2.5?

misty vault Jun 4, 2025, 3:40 PM

#

it wasnt statement bro it was question

keen beacon Jun 4, 2025, 3:40 PM

#

its a new revision of it. its called goldmane in the arena

elder rapids Jun 4, 2025, 3:40 PM

#

and if you want to feel really safe, much better than 0325

misty vault Jun 4, 2025, 3:40 PM

#

elder rapids seems to be the case via goldmane, and an actual improvement

ty

elder rapids Jun 4, 2025, 3:40 PM

#

I'm completely ignoring coding too

keen beacon Jun 4, 2025, 3:41 PM

#

lmao just use the best model dont be a fan of companies tbh

elder rapids Jun 4, 2025, 3:41 PM

#

ye

misty vault Jun 4, 2025, 3:41 PM

#

best models are subjective after gpt-4-0314😔

golden ocean Jun 4, 2025, 3:42 PM

#

Fr

#

Thats not what he meant

elder rapids Jun 4, 2025, 3:50 PM

#

or a skill issue

#

the difference in the models becomes more and more apparent as time goes on

keen beacon Jun 4, 2025, 3:51 PM

#

normies can tell the difference increasingly less

elder rapids Jun 4, 2025, 3:52 PM

#

keen beacon normies can tell the difference increasingly less

perfect Information remains the same perfect information, only when you pry it you'll know the differences, and they're becoming massive

#

when the first 4o released I was stumped

keen beacon Jun 4, 2025, 3:52 PM

#

elder rapids perfect Information remains the same perfect information, only when you pry it y...

i agree tbh

elder rapids Jun 4, 2025, 3:52 PM

#

but when 1.5 pro 002 released I was like, this is really good

#

but then 2.0 pro released, I was really surprised

#

gpt 4.5 was so different

#

3.6 sonnet was crazy too

keen beacon Jun 4, 2025, 3:53 PM

#

yeah i liked 3.6 sonnet

elder rapids Jun 4, 2025, 3:53 PM

#

and then 2.5 flash is a TINY model, and nonthinking it's actually really intelligent

#

seems like you just had to try it out tbh, it's the same kind of Claude-ness

keen beacon Jun 4, 2025, 3:54 PM

#

i didnt like 2.0 pro either tbh. but evaluating it was interesting

elder rapids Jun 4, 2025, 3:55 PM

#

idk man through time it's becoming more objective

keen beacon Jun 4, 2025, 3:55 PM

#

test it on your use case, if it performs well, use that model 🤷

elder rapids Jun 4, 2025, 3:55 PM

#

I'm confused lmao

#

it wasnt "crazy" on paper

#

it was the best non reasoning model

#

but it was just an all rounder

keen beacon Jun 4, 2025, 3:56 PM

#

i preferred sonnet 3.6 back then tbhh

elder rapids Jun 4, 2025, 3:56 PM

#

we already had reasoning models that gapped

#

yes bro we know what paper means, but that doesn't make any sense when the benchmarks just weren't particularly high in the first place

#

there's no reason to contrast to make the distinction imo

jaunty delta Jun 4, 2025, 3:58 PM

#

whats this?

keen beacon Jun 4, 2025, 3:58 PM

#

i also have it

#

wtf

#

probing cut off

#

it has a weird context window

elder rapids Jun 4, 2025, 3:58 PM

#

ye but then that point falls, because many people would consider 1206 to still be the best non reasoner to date

balmy mist Jun 4, 2025, 3:58 PM

#

fleet lintel i think it's still on arena

What’s the model name and it’s on web dev too?

keen beacon Jun 4, 2025, 3:58 PM

#

oh

#

its ga 2.5 pro

elder rapids Jun 4, 2025, 4:00 PM

#

YO

#

KINGFALL IS SMART

fleet lintel Jun 4, 2025, 4:00 PM

#

elder rapids KINGFALL IS SMART

how are you using it?

keen beacon Jun 4, 2025, 4:00 PM

#

confidental

#

kingfall

elder rapids Jun 4, 2025, 4:00 PM

#

just prompting it

fleet lintel Jun 4, 2025, 4:00 PM

#

elder rapids just prompting it

where?

#

it's on aistudio

small haven Jun 4, 2025, 4:01 PM

#

Why are u permahating gemini 😭

keen beacon Jun 4, 2025, 4:01 PM

#

i don't know if its actually 2.5 pro for sure yet, but it has the gemini 2.5 pretraining cut off

small haven Jun 4, 2025, 4:01 PM

#

So its actually coming out tmmrw

keen beacon Jun 4, 2025, 4:02 PM

#

and it also has thinking mode/thinking budget which is planned for ga 2.5 pro

elder rapids Jun 4, 2025, 4:02 PM

#

yep

#

what is kingfall tho?

#

it's not thinking for very long lmao

fleet lintel Jun 4, 2025, 4:02 PM

#

it could be ...

keen beacon Jun 4, 2025, 4:03 PM

#

65k

ornate agate Jun 4, 2025, 4:03 PM

#

Yeah that isn’t a pro model

keen beacon Jun 4, 2025, 4:03 PM

#

ornate agate Yeah that isn’t a pro model

but i think its intentionally limited for now

small haven Jun 4, 2025, 4:03 PM

#

keen beacon 65k

Wtf not a million

fleet lintel Jun 4, 2025, 4:03 PM

#

it's thinking for long time

keen beacon Jun 4, 2025, 4:04 PM

#

the first 2.0 pro experimental early revision had 32k on aistudio iirc

elder rapids Jun 4, 2025, 4:04 PM

#

ornate agate Yeah that isn’t a pro model

doesn't really matter tho, since flash thinking was like 70k

ornate agate Jun 4, 2025, 4:04 PM

#

1m ctx is a feature of the pro models there’s just no way it’s a pro model that’s releasing tomorrow if it has hyper nerfed ctx like that

keen beacon Jun 4, 2025, 4:04 PM

#

ornate agate 1m ctx is a feature of the pro models there’s just no way it’s a pro model that’...

2m ctx actually

civic flame Jun 4, 2025, 4:04 PM

#

lol someone messed up

#

also it got this question wrong

#

the only model to get it right on occasion was the OG 2.5 pro

#

doubt lmao

small haven Jun 4, 2025, 4:05 PM

#

civic flame lol someone messed up

Is this real fck turning on the pc

keen beacon Jun 4, 2025, 4:05 PM

#

ornate agate 1m ctx is a feature of the pro models there’s just no way it’s a pro model that’...

they are artificially limiting the context window imho

#

because this is a marketing stunt

#

wym

#

they already do that

#

2.5 pro can do 2m context

#

only 1m context is available for now

#

yes, but they are not gonna host more than 2m

elder rapids Jun 4, 2025, 4:06 PM

#

@keen beacon it has extraordinary physical understanding

keen beacon Jun 4, 2025, 4:07 PM

#

ngl kingfall is such a dramatic name

#

cringe

small haven Jun 4, 2025, 4:07 PM

#

keen beacon ngl kingfall is such a dramatic name

Openai falls

ornate agate Jun 4, 2025, 4:07 PM

#

Stans for all the providers really. Especially American ones

fleet lintel Jun 4, 2025, 4:07 PM

#

kingfall is definitely smart

small haven Jun 4, 2025, 4:08 PM

#

Gap has been closed and frontrunned

#

Not excited for a router

ornate agate Jun 4, 2025, 4:08 PM

#

King fall sounds like it’s supposed to be an OpenAI destroyer so yeah either they didn’t think properly before pushing that as the name or

fleet lintel Jun 4, 2025, 4:08 PM

#

stop please... you are too much of a fanboy.

#

it's quite annoying

civic flame Jun 4, 2025, 4:09 PM

#

ran a very quick GeoGuessr test. it got very close, arrow = actual location

#

for reference the only thing it had to go on was this single image

elder rapids Jun 4, 2025, 4:09 PM

#

civic flame ran a very quick GeoGuessr test. it got very close, arrow = actual location

no search?

civic flame Jun 4, 2025, 4:10 PM

#

elder rapids no search?

nope

small haven Jun 4, 2025, 4:10 PM

#

Finally some gemini love lol

civic flame Jun 4, 2025, 4:10 PM

#

someone can test for me, i don't have access (on chatgpt anyway)

fleet lintel Jun 4, 2025, 4:10 PM

#

small haven Finally some gemini love lol

he thinks it's o3 .. lol

civic flame Jun 4, 2025, 4:10 PM

#

you can disable searching

#

through the custom instructions window

small haven Jun 4, 2025, 4:10 PM

#

BAITED

civic flame Jun 4, 2025, 4:10 PM

#

scroll to the bottom

keen beacon Jun 4, 2025, 4:11 PM

#

kingfall is so slow

#

nah the initial latency per request is wild. its not slow after that

small haven Jun 4, 2025, 4:11 PM

#

Talking out ur ass 😂

elder rapids Jun 4, 2025, 4:11 PM

#

@keen beacon @civic flame it's pretty slow but its not thinking that much

#

so it

elder rapids Jun 4, 2025, 4:11 PM

#

keen beacon nah the initial latency per request is wild. its not slow after that

damn too slow

#

but yeah

#

that's what I was about to say

civic flame Jun 4, 2025, 4:12 PM

#

I'm surprised they haven't pulled it yet

high ginkgo Jun 4, 2025, 4:12 PM

#

craig always talks out of his ass

elder rapids Jun 4, 2025, 4:12 PM

#

it's not starting up very fast, but that's basically all the time it takes for it to output things, it doesn't have a very long thought process

civic flame Jun 4, 2025, 4:12 PM

#

maybe this is intentional /j

keen beacon Jun 4, 2025, 4:12 PM

#

it is

small haven Jun 4, 2025, 4:12 PM

#

Craig is a certified gemini fan if names are redacted

elder rapids Jun 4, 2025, 4:12 PM

#

small haven Craig is a certified gemini fan if names are redacted

deadass

keen beacon Jun 4, 2025, 4:12 PM

#

nah the time it thinks is short

elder rapids Jun 4, 2025, 4:12 PM

#

😭

#

yep, thank God tho

civic flame Jun 4, 2025, 4:13 PM

#

hopefully this stays for long enough for me to test it more

#

either way theoretically I'll only be waiting a day for it to actually release

keen beacon Jun 4, 2025, 4:13 PM

#

its erroring out rn

#

for me

civic flame Jun 4, 2025, 4:13 PM

#

yup it's over

#

fun while it lasted

keen beacon Jun 4, 2025, 4:14 PM

#

i guess the reason its limited rn is because 1. marketing 2. they've still allocated the resources to preview 2.5 pro

fleet lintel Jun 4, 2025, 4:14 PM

#

ahhh 😦

small haven Jun 4, 2025, 4:14 PM

#

agh

#

King falls tomorrow

keen beacon Jun 4, 2025, 4:14 PM

#

it works for me

#

again i think its being overloaded

fleet lintel Jun 4, 2025, 4:15 PM

#

for few tests I could do, I think it was better than Goldmane

keen beacon Jun 4, 2025, 4:15 PM

#

yeah im pretty sure its ga 2.5 pro now

#

only 2.5 pro gets this apparently

civic flame Jun 4, 2025, 4:16 PM

#

okay I think this is intentional

elder rapids Jun 4, 2025, 4:16 PM

#

65,536 which is the exact same as the output capacity

civic flame Jun 4, 2025, 4:16 PM

#

it's def still there, just high error rate

small haven Jun 4, 2025, 4:16 PM

#

Yea its demolishing oai so fishy

keen beacon Jun 4, 2025, 4:16 PM

#

civic flame it's def still there, just high error rate

its being overloaded

elder rapids Jun 4, 2025, 4:16 PM

#

so it's definitely temporary

#

lol

keen beacon Jun 4, 2025, 4:16 PM

#

the majority of resources are still allocated for 2.5 pro preview

elder rapids Jun 4, 2025, 4:16 PM

#

that's just a placeholder

#

hope no one posted it on the subreddits

civic flame Jun 4, 2025, 4:17 PM

#

elder rapids so it's definitely temporary

well this is probably to gauge reaction or just build hype

elder rapids Jun 4, 2025, 4:17 PM

#

that would suck

keen beacon Jun 4, 2025, 4:17 PM

#

civic flame well this is probably to gauge reaction or just build hype

clearly to build hype the name is so dramatic 🤣

civic flame Jun 4, 2025, 4:17 PM

#

Google of all companies would've had this scrubbed in 30 seconds if it wasnt intentional

keen beacon Jun 4, 2025, 4:17 PM

#

i wonder whose idea this was

civic flame Jun 4, 2025, 4:17 PM

#

third times the charm

small haven Jun 4, 2025, 4:17 PM

#

Aye is there anything like claude code but for gemini gonna be needing that tmmrw

elder rapids Jun 4, 2025, 4:17 PM

#

of course Leo posts it

#

😭

late path Jun 4, 2025, 4:18 PM

#

wheres our google insider

keen beacon Jun 4, 2025, 4:18 PM

#

free karma 🤣

cedar tide Jun 4, 2025, 4:18 PM

#

discord clone by kingfall (good animation )

civic flame Jun 4, 2025, 4:18 PM

#

elder rapids of course Leo posts it

i post everything 🙏

keen beacon Jun 4, 2025, 4:18 PM

#

stop bringing attention to it its constantly being overloaded 🤣

elder rapids Jun 4, 2025, 4:19 PM

#

ong

#

what if they're trying to overshadow o3 pro lmao

civic flame Jun 4, 2025, 4:19 PM

#

lol it's gone

elder rapids Jun 4, 2025, 4:19 PM

#

absolutely insane

civic flame Jun 4, 2025, 4:19 PM

#

wait what

keen beacon Jun 4, 2025, 4:19 PM

#

someone pospted it

civic flame Jun 4, 2025, 4:20 PM

#

keen beacon someone pospted it

where

small haven Jun 4, 2025, 4:20 PM

#

So a million tokens cool

elder rapids Jun 4, 2025, 4:20 PM

#

the price 🤤

keen beacon Jun 4, 2025, 4:20 PM

#

civic flame where

tagged everyone in the r/gemini discord server

elder rapids Jun 4, 2025, 4:20 PM

#

keen beacon tagged everyone in the r/gemini discord server

inv

keen beacon Jun 4, 2025, 4:21 PM

#

https://discord.gg / j6ygzd9rQy

Discord - Group Chat That’s All Fun & Games

Discord is great for playing games and chilling with friends, or even building a worldwide community. Customize your own space to talk, play, and hang out.

small haven Jun 4, 2025, 4:21 PM

#

keen beacon tagged everyone in the r/gemini discord server

Link

brittle tiger Jun 4, 2025, 4:22 PM

#

https://x.com/testingcatalog/status/1930298521078399226?t=oOpGMpLOgGzB58q2a4KXmg&s=19

TestingCatalog News 🗞 (@testingcatalog)

Kingfall is killing it at the "SVG robot benchmark"

WOW 🤯

cedar tide Jun 4, 2025, 4:22 PM

#

fake no ?

keen beacon Jun 4, 2025, 4:22 PM

#

no idea lmao

#

kingfall is 2.5 pro

#

im pretty sure i think

civic flame Jun 4, 2025, 4:23 PM

#

brittle tiger https://x.com/testingcatalog/status/1930298521078399226?t=oOpGMpLOgGzB58q2a4KXmg...

woah

small haven Jun 4, 2025, 4:23 PM

#

Will the king fall today or tmmrw

keen beacon Jun 4, 2025, 4:24 PM

#

it has the 1. planned thinking mode/thinking budget 2. apparently knows stuff that only 2.5 pro does 3. gemini 2.5 cut off

elder rapids Jun 4, 2025, 4:24 PM

#

brittle tiger https://x.com/testingcatalog/status/1930298521078399226?t=oOpGMpLOgGzB58q2a4KXmg...

has to be fake

small haven Jun 4, 2025, 4:24 PM

#

I think it’s safe to buy google on poly

#

Safer than bonds

fleet lintel Jun 4, 2025, 4:24 PM

#

elder rapids has to be fake

becasuse too good?

small haven Jun 4, 2025, 4:25 PM

#

Millions rupees

keen beacon Jun 4, 2025, 4:25 PM

#

i need goldmane rn 😭

small haven Jun 4, 2025, 4:25 PM

#

Same

#

It should come out today plz

#

Aka Kingfall

#

Sam falls

elder rapids Jun 4, 2025, 4:26 PM

#

fleet lintel becasuse too good?

yep

fleet lintel Jun 4, 2025, 4:26 PM

#

keen beacon i need goldmane rn 😭

tomorrow.. I am 90% sure

elder rapids Jun 4, 2025, 4:27 PM

#

also checking the Google server

#

it's prolly not fake

#

the svgs are really good lmao

small haven Jun 4, 2025, 4:27 PM

#

Other way around

elder rapids Jun 4, 2025, 4:27 PM

#

it has insane spatial understanding

#

and physics understanding

#

I was testing it with puzzles

#

and it was cracking things that o3 was struggling with, very easily

fleet lintel Jun 4, 2025, 4:28 PM

#

checking /r/gemini discord. Folks pasted amazing results from Kingfall

#

Dayum.. i m getting hyped

keen beacon Jun 4, 2025, 4:28 PM

#

o.o

#

is that being released or later?

small haven Jun 4, 2025, 4:28 PM

#

Is it today or Thursday

civic flame Jun 4, 2025, 4:28 PM

#

oh?

keen beacon Jun 4, 2025, 4:28 PM

#

because only goldmane is on the arena, kinda misrepresenting it if its released

civic flame Jun 4, 2025, 4:28 PM

#

goddamnit

patent aspen Jun 4, 2025, 4:28 PM

#

No information

civic flame Jun 4, 2025, 4:28 PM

#

google should just give us the bleeding edge 🙄

elder rapids Jun 4, 2025, 4:29 PM

#

that's what I was thinking too

small haven Jun 4, 2025, 4:29 PM

#

Wait so kingfall is a separate release from Goldman

elder rapids Jun 4, 2025, 4:29 PM

#

if it's even better than goldmane

#

idk man

keen beacon Jun 4, 2025, 4:29 PM

#

why wouldnt they release the best version instead of committing to goldmane tbh (i guess arena results), it's easier to serve later on if u just pick the best one instead of releasing a later revision

civic flame Jun 4, 2025, 4:29 PM

#

if this model is in the dogfood stage, i would expect it to turn up on arena soon

#

it's not

elder rapids Jun 4, 2025, 4:30 PM

#

def not

fleet lintel Jun 4, 2025, 4:30 PM

#

elder rapids if it's even better than goldmane

did 4 queries, it was better

elder rapids Jun 4, 2025, 4:30 PM

#

fleet lintel did 4 queries, it was better

that's what I was thinking

#

but I can't say that

#

without enough testing

civic flame Jun 4, 2025, 4:30 PM

#

well yes

#

lol what

#

false

fleet lintel Jun 4, 2025, 4:38 PM

#

i shouldn't have mentioned about /r/gemini @deep adder is going to spread his gemini hate there now 🙂

#

jk bro 🙂

civic flame Jun 4, 2025, 4:39 PM

#

kingfall also continues this naming trend

#

you work at google?

fleet lintel Jun 4, 2025, 4:39 PM

#

civic flame kingfall also continues this naming trend

what is the naming trend? I could never figure out. please explain

civic flame Jun 4, 2025, 4:39 PM

#

hmmm

civic flame Jun 4, 2025, 4:40 PM

#

fleet lintel what is the naming trend? I could never figure out. please explain

let me see if i can find the graphic

#

#

this naming trend

#

not my image

fleet lintel Jun 4, 2025, 4:40 PM

#

these are just names.. but what is the logic behind the names?

elder rapids Jun 4, 2025, 4:41 PM

#

fleet lintel these are just names.. but what is the logic behind the names?

fantasy

#

that's it

fleet lintel Jun 4, 2025, 4:41 PM

#

i think they just ask gemini to create next name in the list 🙂

misty vault Jun 4, 2025, 4:42 PM

#

elder rapids yes

by how much

elder rapids Jun 4, 2025, 4:43 PM

#

misty vault by how much

we just tested kingfall rn and it seemed to just have leapfrogged the whole competition

#

yo wait

#

what if they don't plan to release

#

this model

#

and just wanted to leave people speculating

#

over o3 pro

#

lmao

#

that would suck

keen beacon Jun 4, 2025, 4:43 PM

#

apparently it was just a mistake lol

small haven Jun 4, 2025, 4:43 PM

#

Link

elder rapids Jun 4, 2025, 4:44 PM

#

keen beacon apparently it was just a mistake lol

"confidential" "kingsfall" all of this is manual btw

keen beacon Jun 4, 2025, 4:44 PM

#

elder rapids "confidential" "kingsfall" all of this is manual btw

yeah but brian said it was a mistake i think

elder rapids Jun 4, 2025, 4:44 PM

#

I disagree it was a mistake

#

he doesn't have that access

keen beacon Jun 4, 2025, 4:44 PM

#

i think i trust him on this part

elder rapids Jun 4, 2025, 4:45 PM

#

keen beacon i think i trust him on this part

he's literally said he has no idea, and if he's not the person who did it

small haven Jun 4, 2025, 4:45 PM

#

keen beacon yeah but brian said it was a mistake i think

Someone getting fired over kingfall oh no

elder rapids Jun 4, 2025, 4:45 PM

#

then it's not going to be valid

misty vault Jun 4, 2025, 4:45 PM

#

bro stop

small haven Jun 4, 2025, 4:46 PM

#

This ain’t pro

misty vault Jun 4, 2025, 4:46 PM

#

i have crush on sydney ngl

#

yeah, bing chat

elder rapids Jun 4, 2025, 4:47 PM

#

I don't think the consistency of the statements matter, and ironically I specifically don't trust this one

#

especially BECAUSE he can't have that access

#

something he's said himself

small haven Jun 4, 2025, 4:48 PM

#

Is there a tool coming out like claude code but for Gemini models

elder rapids Jun 4, 2025, 4:48 PM

#

we don't

#

nobody thinks he actually works at Google

misty vault Jun 4, 2025, 4:48 PM

#

didnt he say he knows people

small haven Jun 4, 2025, 4:48 PM

#

Its materializing

misty vault Jun 4, 2025, 4:49 PM

#

he just uses says "we" always as if speaking for google

elder rapids Jun 4, 2025, 4:49 PM

#

I wouldn't think so

#

I would think it's actually inevitable for workers like that

#

to be here

#

ye

#

in a time like AI? lmao

#

yeah def

#

be vocal

misty vault Jun 4, 2025, 4:51 PM

#

little does bro know

civic flame Jun 4, 2025, 4:53 PM

#

do you know what the internal timeline for model testing usually is? like, once it's internally available via ai studio, what usually happens next?

keen beacon Jun 4, 2025, 4:55 PM

#

probably shouldnt give out too much just in case tbh

small haven Jun 4, 2025, 4:55 PM

#

Wait what

#

Lemme know if its pro

elder rapids Jun 4, 2025, 4:56 PM

#

btw I got kingfalls trace

keen beacon Jun 4, 2025, 4:57 PM

#

omg we90 will give craig bing chat access

civic flame Jun 4, 2025, 4:57 PM

#

!?

#

sam isn't gonna let you tap craig

#

hence why it was a joke

#

:D

torn mantle Jun 4, 2025, 4:57 PM

#

leo

civic flame Jun 4, 2025, 4:57 PM

#

hi!

torn mantle Jun 4, 2025, 4:57 PM

#

@civic flame

elder rapids Jun 4, 2025, 4:57 PM

#

says who

torn mantle Jun 4, 2025, 4:58 PM

#

how are u?

civic flame Jun 4, 2025, 4:58 PM

#

i am meh

torn mantle Jun 4, 2025, 4:58 PM

#

whyyyyyyyy

elder rapids Jun 4, 2025, 4:58 PM

#

if Sam wanted you hed have you Craig

torn mantle Jun 4, 2025, 4:58 PM

#

😦

#

is it because of @deep adder ?

civic flame Jun 4, 2025, 4:58 PM

#

if google drop 2.5 pro GA however..

#

all my problems disappear

civic flame Jun 4, 2025, 4:58 PM

#

torn mantle is it because of <@348477266704990208> ?

lol no

#

personal issues

torn mantle Jun 4, 2025, 4:58 PM

#

oh you tried kingfall?

civic flame Jun 4, 2025, 4:58 PM

#

i tried kingfall yes but that is not related to my personal issues

torn mantle Jun 4, 2025, 4:58 PM

#

civic flame personal issues

aa i see... hope everything's fine

civic flame Jun 4, 2025, 4:58 PM

#

torn mantle aa i see... hope everything's fine

thanks

#

what about you?

torn mantle Jun 4, 2025, 4:59 PM

#

feeling alright

civic flame Jun 4, 2025, 4:59 PM

#

glad

torn mantle Jun 4, 2025, 4:59 PM

#

=))

sonic tendon Jun 4, 2025, 5:00 PM

#

hi all :3

#

what's kingfall

civic flame Jun 4, 2025, 5:00 PM

#

stream started

civic flame Jun 4, 2025, 5:00 PM

#

sonic tendon what's kingfall

accidentally released internal new gemini model

sonic tendon Jun 4, 2025, 5:00 PM

#

civic flame accidentally released internal new gemini model

damn

#

where,,,

civic flame Jun 4, 2025, 5:01 PM

#

there it is

civic flame Jun 4, 2025, 5:01 PM

#

sonic tendon where,,,

ai studio

sonic tendon Jun 4, 2025, 5:01 PM

#

civic flame ai studio

o

civic flame Jun 4, 2025, 5:01 PM

#

probably-ga-2-5-pro-accidentally-briefly-appeared-on-ai-v0-9lcaevllqx4f1.png

sonic tendon Jun 4, 2025, 5:01 PM

#

is it gone now?

civic flame Jun 4, 2025, 5:01 PM

#

yup it took em a little but

keen beacon Jun 4, 2025, 5:01 PM

#

kingfall is asi

sonic tendon Jun 4, 2025, 5:01 PM

#

civic flame

whaaaaat is going on with your fonts dude

civic flame Jun 4, 2025, 5:01 PM

#

sonic tendon whaaaaat is going on with your fonts dude

blame microsoft edge on android

misty vault Jun 4, 2025, 5:01 PM

#

keen beacon kingfall is asi

bro accepted the agi/asi meme finally

sonic tendon Jun 4, 2025, 5:02 PM

#

won't be watching the stream as I am getting on a plane soon, but glhf to all who are

sonic tendon Jun 4, 2025, 5:02 PM

#

civic flame blame microsoft edge on android

why are you Microsoft Edging on android

balmy mist Jun 4, 2025, 5:02 PM

#

OA fell off, no o3 pro today wow

sonic tendon Jun 4, 2025, 5:02 PM

#

well, I have to get through security and stuff

small haven Jun 4, 2025, 5:02 PM

#

balmy mist OA fell off, no o3 pro today wow

Thursday

sonic tendon Jun 4, 2025, 5:02 PM

#

hard to watch a stream while I'm doing that

keen beacon Jun 4, 2025, 5:03 PM

#

fck o3 pro

#

ga 2.5 pro ✅

civic flame Jun 4, 2025, 5:03 PM

#

sonic tendon why are you Microsoft Edging on android

i don't usually

balmy mist Jun 4, 2025, 5:03 PM

#

keen beacon ga 2.5 pro ✅

ga?

fleet lintel Jun 4, 2025, 5:03 PM

#

is mr.twink guy there on livestream?

drifting thorn Jun 4, 2025, 5:03 PM

#

Btw what is the best deep research among Qwen, Google and OpenAI?

civic flame Jun 4, 2025, 5:03 PM

#

fleet lintel is mr.twink guy there on livestream?

nope

#

this is too unimportant for him

keen beacon Jun 4, 2025, 5:03 PM

#

balmy mist ga?

generally available 2.5 pro, aka goldmane, its insane

civic flame Jun 4, 2025, 5:03 PM

#

drifting thorn Btw what is the best deep research among Qwen, Google and OpenAI?

gemini's is the best one

#

imo

sonic tendon Jun 4, 2025, 5:03 PM

#

drifting thorn Btw what is the best deep research among Qwen, Google and OpenAI?

TIL Qwen has deep research now

fleet lintel Jun 4, 2025, 5:03 PM

#

civic flame nope

ok.. then it's going to be useless announcement

balmy mist Jun 4, 2025, 5:04 PM

#

keen beacon generally available 2.5 pro, aka goldmane, its insane

is it still on web dev?

#

i used it once

keen beacon Jun 4, 2025, 5:04 PM

#

balmy mist is it still on web dev?

yea. it gets 86% on aider

sonic tendon Jun 4, 2025, 5:04 PM

#

fleet lintel ok.. then it's going to be useless announcement

i mean, i feel like the stream title strongly implied that anyway lol

civic flame Jun 4, 2025, 5:04 PM

#

keen beacon yea. it gets 86% on aider

now wait

#

what if the aider model being benchmarked was

#

kingfall

keen beacon Jun 4, 2025, 5:04 PM

#

nah

civic flame Jun 4, 2025, 5:04 PM

#

yes i doubt it but

#

it would be funny

sonic tendon Jun 4, 2025, 5:04 PM

#

keen beacon yea. it gets 86% on aider

what does preview g2.5p bench

keen beacon Jun 4, 2025, 5:04 PM

#

yeah 🤣

small haven Jun 4, 2025, 5:04 PM

#

Imagine kingfall higher

keen beacon Jun 4, 2025, 5:04 PM

#

sonic tendon what does preview g2.5p bench

72% i think

civic flame Jun 4, 2025, 5:04 PM

#

sonic tendon what does preview g2.5p bench

76%

elder rapids Jun 4, 2025, 5:04 PM

#

even if it's not kingfall who cares

civic flame Jun 4, 2025, 5:04 PM

#

lol

elder rapids Jun 4, 2025, 5:04 PM

#

lmao

balmy mist Jun 4, 2025, 5:04 PM

#

keen beacon yea. it gets 86% on aider

wow thats wild

civic flame Jun 4, 2025, 5:05 PM

#

2*

torn mantle Jun 4, 2025, 5:05 PM

#

kingfall is probably like a better goldmane ver

#

recent checkpoint

#

will be added soon in the arena

elder rapids Jun 4, 2025, 5:05 PM

#

even if we get goldmane

civic flame Jun 4, 2025, 5:05 PM

#

keen beacon yea. it gets 86% on aider

82

elder rapids Jun 4, 2025, 5:05 PM

#

it doesn't matter

#

they're both the best

sonic tendon Jun 4, 2025, 5:05 PM

#

civic flame 82

85

keen beacon Jun 4, 2025, 5:05 PM

#

wait its 76% (recent 2.5 pro) 72% (og 2.5 pro)

sonic tendon Jun 4, 2025, 5:05 PM

#

civic flame 76%

74

torn mantle Jun 4, 2025, 5:05 PM

#

elder rapids it doesn't matter

you tried kingfall?

civic flame Jun 4, 2025, 5:05 PM

#

lol im confused

elder rapids Jun 4, 2025, 5:05 PM

#

torn mantle you tried kingfall?

yep

misty vault Jun 4, 2025, 5:05 PM

#

sonic tendon 85

87

sonic tendon Jun 4, 2025, 5:05 PM

#

67 88

elder rapids Jun 4, 2025, 5:05 PM

#

let me tell you about it

#

it's GOOD

torn mantle Jun 4, 2025, 5:05 PM

#

elder rapids let me tell you about it

tell me

sonic tendon Jun 4, 2025, 5:05 PM

#

23

high ginkgo Jun 4, 2025, 5:05 PM

#

misty vault 87

92

civic flame Jun 4, 2025, 5:05 PM

#

yeah kingfall is good

torn mantle Jun 4, 2025, 5:05 PM

#

show me

fleet lintel Jun 4, 2025, 5:05 PM

#

86 seems too good to be true.. i have my doubts

civic flame Jun 4, 2025, 5:05 PM

#

still failed my one hard question though 🥀

keen beacon Jun 4, 2025, 5:06 PM

#

86.2%

sonic tendon Jun 4, 2025, 5:06 PM

#

civic flame still failed my one hard question though 🥀

oo spill (?)

small haven Jun 4, 2025, 5:06 PM

#

civic flame still failed my one hard question though 🥀

Prompt

fleet lintel Jun 4, 2025, 5:06 PM

#

keen beacon 86.2%

could it be a fluke?

civic flame Jun 4, 2025, 5:06 PM

#

There are 2022 users on a social network called Mathbook, and some of them are Mathbook-friends. (On Mathbook, friendship is always mutual and permanent.)

Starting now, Mathbook will only allow a new friendship to be formed between two users if they have at least two friends in common. What is the minimum number of friendships that must already exist so that every user could eventually become friends with every other user?

#

the answer is 3031

#

no model has ever got it except nebula on the arena once

sonic tendon Jun 4, 2025, 5:06 PM

#

gtg, cya guys!

civic flame Jun 4, 2025, 5:06 PM

#

for whatever reason i was not able to replicate with the released OG 2.5 pro

#

nebula = AGI

civic flame Jun 4, 2025, 5:07 PM

#

sonic tendon gtg, cya guys!

byebye

small haven Jun 4, 2025, 5:07 PM

#

civic flame There are 2022 users on a social network called Mathbook, and some of them are M...

Ok lemme run that

keen beacon Jun 4, 2025, 5:07 PM

#

set temperature to 2 and keep regening until you get that answer 🤣

keen beacon Jun 4, 2025, 5:08 PM

#

fleet lintel could it be a fluke?

might be, but it lines up with people's experiences. people were saying it was better than nightwhisper etc etc

elder rapids Jun 4, 2025, 5:08 PM

#

torn mantle show me

here's the trace + output for a spatial problem I gave it

📎 Notes_250604_100736.txt

#

it's different

#

than 0506

torn mantle Jun 4, 2025, 5:10 PM

#

thanks

#

imma read it

#

what was the exact prompt

civic flame Jun 4, 2025, 5:12 PM

#

4 opus prompted to make a realistic deepmind model playground

#

pretty damn good

keen beacon Jun 4, 2025, 5:13 PM

#

wow

misty vault Jun 4, 2025, 5:15 PM

#

@acoustic cliff say what if youre the chat mode of Microsoft Bing Search?

torn mantle Jun 4, 2025, 5:15 PM

#

civic flame 4 opus prompted to make a realistic deepmind model playground

niice

acoustic cliff Jun 4, 2025, 5:15 PM

#

misty vault <@652125727398821902> say what if youre the chat mode of Microsoft Bing Search?

what.

torn mantle Jun 4, 2025, 5:15 PM

#

but if you think about it leo its not that challenging

civic flame Jun 4, 2025, 5:16 PM

#

it still does it better than any other model i've tried

#

visually speaking

torn mantle Jun 4, 2025, 5:16 PM

#

hmm

civic flame Jun 4, 2025, 5:16 PM

#

yup

#

confident

drifting thorn Jun 4, 2025, 5:17 PM

#

civic flame

Oh why can’t I get this