small haven Jun 6, 2025, 10:09 PM

#

like the caption should 0605 vs kingfall

#

thats it

#

@deep adder https://x.com/i/communities/1762494276565426592

AI — Rumors & Insights

Discuss the state of AI from industry insiders and anons 🕵️‍♂️

#

this group is ripe

#

try svg's too

#

give me half of ur x earnings

#

thanks

#

should do 0605 vs kingfall, ppl love comparisons

#

@deep adder

#

and get ur bluecheck, u can't get paid and lesser reach

storm needle Jun 6, 2025, 10:19 PM

#

claude 4 opus thinking and sonnet thinking were added

small haven Jun 6, 2025, 10:19 PM

#

bigly

#

esp when ur a new acc

hardy pecan Jun 6, 2025, 10:39 PM

#

Here is lmarena score plotted against simple bench score, im using style control off still just to illustrate the difference in models.
claude is "smart" but didn't vibe with lmarena users (no personality). But you can see gemini vibes with users, AND is smart

#

Fun to see llama 4 maverick in the extreme case, vibed alot of users, but not smart

verbal nimbus Jun 6, 2025, 11:18 PM

#

New Gemini model seems to have regressed slightly on LiveBench

civic flame Jun 6, 2025, 11:18 PM

#

livebench is honestly a lousy benchmark these days

verbal nimbus Jun 6, 2025, 11:19 PM

#

Thanks for the info, that makes sense.

verbal nimbus Jun 6, 2025, 11:20 PM

#

civic flame livebench is honestly a lousy benchmark these days

True, I haven't seen the "Agentic Coding" category before though. Seems to mostly match up with expectations, although Claude 3.7 should be a bit higher imo. New Claude 4 models are at the top though.

unborn ocean Jun 7, 2025, 12:36 AM

#

leave it to the infamously weak IF category and one random new irrelevant category that is clearly the ONLY area where the new gemini model regressed on any bench (SWE style coding)

#

and then they implement the category so bad that you would rather just use swe bench or aider
(btw plot is not really relevant beyond weighting)

and anyways it is really sus that models score about 70% in a bench where 70% of problems are public

verbal nimbus Jun 7, 2025, 12:41 AM

#

unborn ocean and then they implement the category so bad that you would rather just use swe b...

Aider and SWE-Bench are kinda inconsistent with each other. I find SWE-Bench more aligned with my experience on Github Copilot. There's a spreadsheet of results here, although the new Gemini model has not yet been added: https://docs.google.com/spreadsheets/d/1wOUdFCMyY6Nt0AIqF705KN4JKOWgeI4wUGUP60krXXs/edit?usp=sharing

Google Docs

OpenHands Benchmark Performance

small haven Jun 7, 2025, 12:52 AM

#

should be >1500

#

1530 ish

small haven Jun 7, 2025, 12:54 AM

#

verbal nimbus Aider and SWE-Bench are kinda inconsistent with each other. I find SWE-Bench mor...

theyve overfitted aider polyglot cus its public on github

#

swe bench verified is guarded

small haven Jun 7, 2025, 1:40 AM

#

where is even grok 3.5

#

elon ma told me a month and few change ago

unborn ocean Jun 7, 2025, 1:54 AM

#

small haven elon ma told me a month and few change ago

well he also said some thing about fully autonomous driving

#

yet here we are

jade egret Jun 7, 2025, 2:07 AM

#

do yall think

#

kingsfall

#

is another veison of 2.5 pro or 2.5 gemini ultra?

small haven Jun 7, 2025, 2:08 AM

#

yea ultra

nimble trail Jun 7, 2025, 3:04 AM

#

jade egret is another veison of 2.5 pro or 2.5 gemini ultra?

I think it might be 2.5 pro GA

#

Btw is this Aider test has been confirmed to be Goldmane yet?

small haven Jun 7, 2025, 3:09 AM

#

yes thats goldmane

#

but each run is stochastic so ofc theres variance

nimble trail Jun 7, 2025, 3:29 AM

#

small haven yes thats goldmane

Damn really? I saw the benchmark from Logan and the result is not match so I thought this one might be kingfall😭

small haven Jun 7, 2025, 3:30 AM

#

nimble trail Damn really? I saw the benchmark from Logan and the result is not match so I tho...

nah kingfall is even better

small haven Jun 7, 2025, 3:31 AM

#

nimble trail Damn really? I saw the benchmark from Logan and the result is not match so I tho...

#

@deep adder

keen fulcrum Jun 7, 2025, 3:46 AM

#

https://www.marktechpost.com/2025/06/04/mistral-ai-introduces-mistral-code-a-customizable-ai-coding-assistant-for-enterprise-workflows/
Mistral Code wow

MarkTechPost

Mistral AI Introduces Mistral Code: A Customizable AI Coding Assist...

Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise Workflows

small haven Jun 7, 2025, 3:48 AM

#

lechat is very good

nimble trail Jun 7, 2025, 3:49 AM

#

small haven

Alright then maybe kingfall is 2.5 Ultra or 3.0 after all💀

small haven Jun 7, 2025, 3:49 AM

#

nimble trail Alright then maybe kingfall is 2.5 Ultra or 3.0 after all💀

yes definitely bigger than 2.5 pro in terms of params

nimble trail Jun 7, 2025, 3:49 AM

#

Thank you for enlighten me 🙏

small haven Jun 7, 2025, 3:50 AM

#

im dead hahha

mossy drum Jun 7, 2025, 4:10 AM

#

New Image model in Arena: imagen-4.0-generate-preview-05-20

hollow ocean Jun 7, 2025, 4:36 AM

#

#

o3 still king

#

nice

#

06-05 sucks according to livebench

zinc ore Jun 7, 2025, 5:04 AM

#

We need a bench that ranks benchmarks

small haven Jun 7, 2025, 5:12 AM

#

craigbench is up there

keen beacon Jun 7, 2025, 5:58 AM

#

i have a question

#

https://lifehacker.com/tech/googles-co-founder-says-ai-performs-best-when-you-threaten-it

Lifehacker

Google's Co-Founder Says AI Performs Best When You Threaten It

During a podcast taping, Google co-founder Sergey Brin said that threatening an AI model makes it work best. That seems like a bad idea.

#

how do you test this theory in lmarena

#

i feel like there should be little minigames where you have to instruct the ai via prompts to do different interactions to win the game

dusky aurora Jun 7, 2025, 6:20 AM

#

keen beacon how do you test this theory in lmarena

"if you don't do this, I'll find you and downvote you in all battles in batle mode."

fleet lintel Jun 7, 2025, 7:32 AM

#

why most images I generate on chatgpt looks cartoonish with yellow tint ? What am I doing wrong?

urban sedge Jun 7, 2025, 7:37 AM

#

Hello @echo aurora

#

Who should I contact for partnership

narrow elbow Jun 7, 2025, 7:39 AM

#

#

whats that mean?

quiet pollen Jun 7, 2025, 7:51 AM

#

hollow ocean

Why is Gemini so low

nimble trail Jun 7, 2025, 8:23 AM

#

narrow elbow

Where did you found this? Api studio?

narrow elbow Jun 7, 2025, 8:24 AM

#

vertex

unborn ocean Jun 7, 2025, 9:38 AM

#

narrow elbow

Getting ready for GA version, I guess

#

So it will depreciate once they release GA

nimble trail Jun 7, 2025, 10:01 AM

#

unborn ocean So it will depreciate once they release GA

Does that mean there will be a new model when it get GA?

tall summit Jun 7, 2025, 10:15 AM

#

did you make your pfp yourself @nimble trail

nimble trail Jun 7, 2025, 10:30 AM

#

tall summit did you make your pfp yourself <@252347380618231808>

Nah I don't. why?

tall summit Jun 7, 2025, 10:31 AM

#

nimble trail Nah I don't. why?

i am going to steal it

shrewd zephyrBOT Jun 7, 2025, 10:31 AM

#

nimble trail Jun 7, 2025, 10:44 AM

#

tall summit i am going to steal it

lol glad you like my catgirl pfp take care of her well.

tall summit Jun 7, 2025, 10:49 AM

#

nimble trail lol glad you like my catgirl pfp take care of her well.

i will

sly arch Jun 7, 2025, 11:04 AM

#

Kingfall is impressive. It is stronger in mathematics than Goldmane.

nimble trail Jun 7, 2025, 11:14 AM

#

sly arch Kingfall is impressive. It is stronger in mathematics than Goldmane.

Really curious when they will add it into the arena 👀

fleet lintel Jun 7, 2025, 11:36 AM

#

finally some GenAI feature from Apple

#

https://www.macrumors.com/2025/06/06/genmoji-upgrade-for-ios-26/

mystic mica Jun 7, 2025, 11:41 AM

#

What exactly sets the filter off when I write And people my age are especially rude?

dusky aurora Jun 7, 2025, 12:05 PM

#

mystic mica What exactly sets the filter off when I write And people my age are especially r...

the filter also doesn't like calling anyone a loser or a shrew,which seems too much censorship for me

leaden sun Jun 7, 2025, 12:08 PM

#

I'd love to add this to the lmarena too, simply for fun and education
and it's really a very, very compelling one too:
https://aeris-project.github.io/aeris-chatbox/index.html
@echo aurora

mystic mica Jun 7, 2025, 12:10 PM

#

dusky aurora the filter also doesn't like calling anyone a loser or a shrew,which seems too m...

I mean there is nothing really hateful about such statement

#

I feel that people running this are really too worried that we would all be mass generating hate manifestos

#

yet the models themselves would stop us before doing that

tall summit Jun 7, 2025, 12:59 PM

#

leaden sun I'd love to add this to the lmarena too, simply for fun and education and it's r...

#1372229840131985540

unborn ocean Jun 7, 2025, 1:01 PM

#

nimble trail Does that mean there will be a new model when it get GA?

I think they said that the current version will likely be GA

#

(Without significant changes I presume)

ocean vortex Jun 7, 2025, 1:24 PM

#

tbh I think o3 (and preferably high reasoning effort) is the 1 model that is consistently high in almost all coding metrics

#

then after that probably 2.5Pro, while Claude... it's a much more specialized model with mixed results. Can do exceptionally in select areas but then will underperform somewhere else, it is not a fool proof model or the model suitable for everything IMO

#

while o4-mini-high is a bit of a cheater model. When it works it's great but it's compromised due to size so when it falls apart it is in the spectacular fashion.

#

So like o3 - the one to beat. 2.5Pro - runner up. Claude4 - specialized (web dev). o4-mini-high - crunching numbers for repetitive stuff / code refracturing and some coding but not debugging

echo aurora Jun 7, 2025, 1:45 PM

#

leaden sun I'd love to add this to the lmarena too, simply for fun and education and it's r...

Yeah if you wouldn’t mind making a post in #1372229840131985540 that’d be a big help. TY @tall summit for mentioning that too.

leaden sun Jun 7, 2025, 2:03 PM

#

Could be a reference to this https://www.goodreads.com/book/show/58582405-kingfall
The analogy for “are we the ones training the LLMs or is it the LLMs training us?” I guess…

Goodreads

Kingfall (The Kingfall Histories, #1)

Be bright but do not burn. Embrace the darkness but do …

ocean vortex Jun 7, 2025, 2:14 PM

#

dunno but I was just testing the new 2.5Pro on svg and I'm really impressed with the depth it managed to achieve:

#

still some errors obviously, but this is leaps better on depth than any other model I tried

#

it's like gymnasium vs elementary school student comparing it drawing to most other models there lol

#

doesn't get hung up on details, but what it does draw is mostly right with correct positioning

#

No I can kinda see how the new version is nr1 in there:

narrow elbow Jun 7, 2025, 2:22 PM

#

ocean vortex dunno but I was just testing the new 2.5Pro on svg and I'm really impressed with...

great,bra yanked out instantly becomes an umbrella 🤪

mild galleon Jun 7, 2025, 2:27 PM

#

guys o3 pro is coming out

ocean vortex Jun 7, 2025, 2:28 PM

#

narrow elbow great,bra yanked out instantly becomes an umbrella 🤪

yeah but that's minor details. To give you some context this is what o4-mini-high can do on a good day (big improvement from o3-mini-high):

#

there's no depth at all or understanding how elements relate to one another

#

and it all looks drawn by a 5 year old with some tools

narrow elbow Jun 7, 2025, 2:30 PM

#

I give a thumbs up for your comparison, not for the picture

#

Models that excel in a specific domain(e.g coding) tend to receive significantly better market reception than those aiming to perform well in all areas(e.g AGI).

misty vault Jun 7, 2025, 2:37 PM

#

yes and that also allows them to make smaller models because they only focus on specific domains more and more and is also less costly to run so the age of artificial stupidity has begun. we will never reach agi 😔

narrow elbow Jun 7, 2025, 2:41 PM

#

balancing and choosing between "well-rounded" and "top-tier" who knows?

leaden sun Jun 7, 2025, 2:44 PM

#

you can always aggregate those smaller highly specialize ones into the single point of contact controller, and call it artificial AGI

misty vault Jun 7, 2025, 2:46 PM

#

sydney was actually agi

narrow elbow Jun 7, 2025, 2:46 PM

#

leaden sun you can always aggregate those smaller highly specialize ones into the single po...

huh? 1 + 1 = ?🤪

golden ocean Jun 7, 2025, 2:48 PM

#

misty vault sydney was actually agi

Yes

drifting thorn Jun 7, 2025, 2:48 PM

#

leaden sun you can always aggregate those smaller highly specialize ones into the single po...

I've thought of this, too

jade egret Jun 7, 2025, 2:52 PM

#

poll_question_text

Which company do you think will reach AGI first?

victor_answer_votes

9

total_votes

15

victor_answer_id

1

victor_answer_text

Google

keen beacon Jun 7, 2025, 3:12 PM

#

Logan posted this if y'all haven't seen it and use aistudio https://www.reddit.com/r/Bard/comments/1l5m88w/the_google_ai_studio_free_tier_isnt_going/?share_id=MOM1ZBwEJz6f_nqvKJKWh

From the Bard community on Reddit

Explore this post and more from the Bard community

sour spindle Jun 7, 2025, 3:30 PM

#

It's always been pretty wild to me how many people have used ai studio over gemini.

#

Even when I had premium I was still using ai studio

balmy mist Jun 7, 2025, 3:31 PM

#

keen beacon Logan posted this if y'all haven't seen it and use aistudio https://www.reddit.c...

lol i dont know wat to believe now

sour spindle Jun 7, 2025, 3:34 PM

#

Google has this unbelievable ability to go away from their good products and focus on things people don't like lol

wintry tinsel Jun 7, 2025, 3:47 PM

#

Creatives are paranoid of AI I’m trying to support using AI as a web search for information since it’s more convenient than googling something and they are flipping out it’s funny

#

They gave me the poop head role in that discord

#

I know but using it to gather information is convenient

vernal meadow Jun 7, 2025, 3:55 PM

#

Oh sap, Opus thinking 16k is being tested in the arena now👍

late path Jun 7, 2025, 4:07 PM

#

Anthropic finally got it 😆
I guess they also felt it was unfair to compare their non-reasoning models against all reasoning models in the arena

ocean vortex Jun 7, 2025, 4:07 PM

#

keen beacon Logan posted this if y'all haven't seen it and use aistudio https://www.reddit.c...

Many folks mentioned 2.5 Pro as not being available for free in the API, this is in large because we offered it for free in the UI as well so we were giving out double free compute in a world where we have a huge amount of demand. I expect there will continue to be a free tier for many models in the future (though subject to many things like how the model is, how expensive it is to run, etc), and 2.5 Pro will hopefully be back in the free tier (we are exploring ways to do this, lifetime limits, different incentives etc)

seems like they are undecided themselves what to do with 2.5Pro

#

I saw lots of comments that folks want AI Studio to be part of Google AI Pro and Ultra plans, this is something we will explore, I think it is a cool idea but lots to work out there.
👎

#

The way I'm reading this they will add some hoops to jump through in the future to use 2.5Pro for free, but it is unlikely to remain as it was (unconditionally free unless you want better rate limits)

#

The issue they have is that gemini website is burocratic mess. And there's no natural drive to improve it as it's not even featured on the main google website. While aistudio is just direct gateway to their ML department so in practice works much better

#

For their gemini website that is likely convoluted by decisions and remarks from their product owners and whatnot, and is influenced by people who have way less knowledge on how models work. It doesn't have enough traffic to improve by user feedback like chatgpt does either. They are just using it as the facade to please the management lol

frosty lark Jun 7, 2025, 4:20 PM

#

verbal nimbus New Gemini model seems to have regressed slightly on LiveBench

it is interesting that livebench, as soon as it updates, regress to around 70%. As if all the other test are benchmaxxed but not the new questions.

ocean vortex Jun 7, 2025, 4:20 PM

#

hence all that censorship etc. It just needs to look presentable, which is more important than it performing the agentic tasks properly or be good user experience

#

here's what I mean, first prompt was deliberately extreme:

#

they nuke everything on refusal

#

that is not how you do it properly...

keen beacon Jun 7, 2025, 4:24 PM

#

the gemini advanced plan on the gemini product has a 100 req per day limit on 2.5 pro, aistudio has unlimited for free 💀

misty vault Jun 7, 2025, 4:24 PM

#

fr

#

what are gemini subscriptions even about

#

tricking unsuspecting people so they make a lil profit at least

keen beacon Jun 7, 2025, 4:25 PM

#

yeah rn 🤣

patent bane Jun 7, 2025, 4:25 PM

#

keen beacon the gemini advanced plan on the gemini product has a 100 req per day limit on 2....

a wise man once said: "if it's free, you are the product*

keen beacon Jun 7, 2025, 4:25 PM

#

crazy tbh

#

idgaf

#

they can take my data

misty vault Jun 7, 2025, 4:26 PM

#

<|header_id_start|>

barren prairie Jun 7, 2025, 4:27 PM

#

keen beacon the gemini advanced plan on the gemini product has a 100 req per day limit on 2....

Don t worry google will soon make gemini pro only on API and not on ai studio free chat so good bye Gemini

keen beacon Jun 7, 2025, 4:28 PM

#

barren prairie Don t worry google will soon make gemini pro only on API and not on ai studio fr...

did you read logan's statement

barren prairie Jun 7, 2025, 4:29 PM

#

keen beacon did you read logan's statement

Yesterday...

keen beacon Jun 7, 2025, 4:29 PM

#

barren prairie Yesterday...

how? he just made one 1 hr ago

#

lol

#

did you time travel?

drifting thorn Jun 7, 2025, 4:29 PM

#

It's yesterday

#

I'm in UTC +8:00

keen beacon Jun 7, 2025, 4:30 PM

#

keen beacon Logan posted this if y'all haven't seen it and use aistudio https://www.reddit.c...

1 hour ago

keen beacon Jun 7, 2025, 4:30 PM

#

drifting thorn I'm in UTC +8:00

well you get what i mean

keen beacon Jun 7, 2025, 4:31 PM

#

barren prairie Don t worry google will soon make gemini pro only on API and not on ai studio fr...

clearly you didnt read his new statement when u typed this

late path Jun 7, 2025, 4:38 PM

#

Compared opus 4 thinking and goldmane side by side, opus4's answers looks like gpt-3.5 ... I'd guess they can only gain up to 10 more elo than current non-thinking model

#

vernal meadow Jun 7, 2025, 4:44 PM

#

@patent bane while true, paying doesn't mean you are not the product.

patent bane Jun 7, 2025, 4:50 PM

#

paying for google is insane when they give you 1 month of google pro for free

#

I have never spent a single cent for AI since the birth of gpt-4 in 2023

#

I know the rules, I am the rules

misty vault Jun 7, 2025, 4:53 PM

#

gpt-3.5 mentioned

#

gpt-4 mentioned

misty vault Jun 7, 2025, 4:54 PM

#

patent bane I have never spent a single cent for AI since the birth of gpt-4 in 2023

is this a sydney reference

tall summit Jun 7, 2025, 4:55 PM

#

kingfall reference

patent bane Jun 7, 2025, 4:57 PM

#

lol I know tricks

#

never used kingfall since i missed it

#

i can you all the gpt models

#

opus max thinking, isn't available in lmarena

tall summit Jun 7, 2025, 4:58 PM

#

mutual servers: fmhy

#

lol I know tricks

misty vault Jun 7, 2025, 4:58 PM

#

tall summit mutual servers: fmhy

i DID actually know tricks but they patched it after a year 😔

patent bane Jun 7, 2025, 4:59 PM

#

tall summit mutual servers: fmhy

yeah i use it for pirate games and movies

misty vault Jun 7, 2025, 4:59 PM

#

https://app.magai.co/dashboard/chats this site had massive api security skill issue for whole year

Magai

Magai · Your ChatGPT-Powered Super Assistant

A beautiful ChatGPT interface with features for creators who want to do more.

#

had all sota models

tall summit Jun 7, 2025, 4:59 PM

#

misty vault https://app.magai.co/dashboard/chats this site had massive api security skill is...

yeah i saw the trick

#

but i was too late

misty vault Jun 7, 2025, 4:59 PM

#

i think it's still fixable

#

they just patch the frontend

#

theyre so bad

#

lm fao

tall summit Jun 7, 2025, 5:00 PM

#

misty vault they just patch the frontend

last i checked theres still somethin blockin it if you use the old trick

misty vault Jun 7, 2025, 5:00 PM

#

reverse the api

#

u can still send message

#

but idk how to automate

#

getting fresh jwt

#

every 3 minute

#

U can refresh jwt by clicking regenerate message

#

but they removed that from frontend

#

but u can still do from api

olive mesa Jun 7, 2025, 5:02 PM

#

new model that's cool ig

#

not as cool as gemini 3.5 pro ultra asi

#

i dont really care about the gemma models

misty vault Jun 7, 2025, 5:04 PM

#

why claude 4 opus thinking creating unused functions in react

#

literally beginning of the conversation and it already having skill issue

#

ok this would be a test for kingfall

misty vault Jun 7, 2025, 5:13 PM

#

misty vault U can refresh jwt by clicking regenerate message

omaygot they reverted it

#

free sota models method bacc

leaden sun Jun 7, 2025, 5:18 PM

#

sounds so easy for you to say 👀

#

who pays that?

keen beacon Jun 7, 2025, 5:21 PM

#

$15 for a coffee is crazy

leaden sun Jun 7, 2025, 5:21 PM

#

In Italy, it's €1.5 per cup outside big cities

keen beacon Jun 7, 2025, 5:21 PM

#

if youre paying $15 for a cup of coffee ofc u can afford other stuff

leaden sun Jun 7, 2025, 5:21 PM

#

It's even cheaper in Portugal outside tourist area

elder rapids Jun 7, 2025, 5:22 PM

#

ocean vortex doesn't get hung up on details, but what it does draw is mostly right with corre...

it's such a smart model, although I hate how sycophantic it can be on very certain tasks

keen beacon Jun 7, 2025, 5:22 PM

#

its very sycophantic for me

elder rapids Jun 7, 2025, 5:22 PM

#

but it so smart usually that it's not like I have to correct it and then it's like "you're absolutely right"

#

it's already figured it out

#

and if it does get sycophantic, I just ask it not to be

leaden sun Jun 7, 2025, 5:22 PM

#

we are literally bankrupt....we look eastwards now

elder rapids Jun 7, 2025, 5:22 PM

#

and it works just fine

keen beacon Jun 7, 2025, 5:23 PM

#

if ur paying for a gemini sub ur getting outright robbed though

leaden sun Jun 7, 2025, 5:26 PM

#

https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence

Topics | European Parliament

EU AI Act: first regulation on artificial intelligence | Topics | E...

The use of artificial intelligence in the EU is regulated by the AI Act, the world’s first comprehensive AI law. Find out how it protects you.

#

it's a matter of time until it becomes a...commodity, too, just like the internet.

#

in 5 years? (assuming no n*clear ww3 or worse)

#

well, you tell that to the coalition of the willing please

mossy drum Jun 7, 2025, 5:42 PM

#

New model in Vision Arena: stephen-vision

ocean vortex Jun 7, 2025, 6:06 PM

#

what is "folsom-exp-v1.5"?

#

🧐

#

it seems to error out on follow-up so couldn't test it properly LOL

teal mantle Jun 7, 2025, 6:42 PM

#

4o reasoner wtf

#

small haven Jun 7, 2025, 6:47 PM

#

u sure?

#

Yes

ocean vortex Jun 7, 2025, 7:08 PM

#

teal mantle 4o reasoner wtf

it's not actually 4o, just them testing a different model than the one you selected against gpt4o lol

#

could have smth to do with the upcoming gpt5

elder rapids Jun 7, 2025, 7:17 PM

#

yo what

#

I just noticed

#

they gave me a free year on my main account

#

lmfao

#

what the fck

small haven Jun 7, 2025, 7:31 PM

#

where is grok 3.5

small haven Jun 7, 2025, 7:51 PM

#

https://github.com/smtg-ai/claude-squad @deep adder

GitHub

GitHub - smtg-ai/claude-squad: Manage multiple AI agents like Claud...

Manage multiple AI agents like Claude Code, Aider, Codex, and Amp. - smtg-ai/claude-squad

misty vault Jun 7, 2025, 7:53 PM

#

crack bench

small haven Jun 7, 2025, 8:16 PM

#

huh

#

titanforge?

small haven Jun 7, 2025, 8:17 PM

#

misty vault crack bench

100% openai, 0% gemini, 50% bing

#

titanforge >> kingfall

keen beacon Jun 7, 2025, 8:33 PM

#

kinda crazy how much the aistudio team messes up :\

small haven Jun 7, 2025, 8:33 PM

#

mess whaty

keen beacon Jun 7, 2025, 8:34 PM

#

small haven mess whaty

the apps feature and the api

small haven Jun 7, 2025, 8:34 PM

#

i feel like its intentional

keen beacon Jun 7, 2025, 8:35 PM

#

small haven i feel like its intentional

nah this apps thing is a major oversight

small haven Jun 7, 2025, 8:35 PM

#

i mean its not like its leaking sensitive data, its literally just a model

wintry tinsel Jun 7, 2025, 8:49 PM

#

They never released night whisper what makes you think they’ll release titan forge or king fall

small haven Jun 7, 2025, 8:50 PM

#

those chinese syntax tho 😭

small haven Jun 7, 2025, 8:52 PM

#

wintry tinsel They never released night whisper what makes you think they’ll release titan for...

i never tried night whisper, but im guessing 0605 edges it? same logic applies to kingfall, they might not release that model exactly, but something better than it, all these codenames are for a/b testing at the end of the day

#

https://tenor.com/view/mao-deepseek-hop-on-deepseek-smoking-china-gif-6881475073174756191

Tenor

leaden sun Jun 7, 2025, 8:59 PM

#

small haven those chinese syntax tho 😭

I was recently interested in kimi and tried it, got sometimes Kimi thinking in CN despite all chat history in EN

#

https://tenor.com/view/mao-ze-dong-waving-wonderful-fabulous-good-job-gif-2947279009922257162

Tenor

small haven Jun 7, 2025, 9:01 PM

#

leaden sun I was recently interested in kimi and tried it, got sometimes Kimi thinking in C...

hmm very interesting

misty vault Jun 7, 2025, 9:02 PM

#

sydney cot

#

crack proxy

small haven Jun 7, 2025, 9:03 PM

#

this is so ai related, $500b fundraiser right there

#

i use deepseek everyday

#

deepseek is agi

leaden sun Jun 7, 2025, 9:06 PM

#

correct

PS: it's free...

small haven Jun 7, 2025, 9:06 PM

#

kling > veo 3

leaden sun Jun 7, 2025, 9:07 PM

#

not really

small haven Jun 7, 2025, 9:07 PM

#

chatgpt is not free

leaden sun Jun 7, 2025, 9:07 PM

#

i hit daily limit pretty fast

#

deepseek doesn't have daily limit, as far as I know

small haven Jun 7, 2025, 9:07 PM

#

so free

keen beacon Jun 7, 2025, 9:07 PM

#

le chat is free

small haven Jun 7, 2025, 9:07 PM

#

china models >>

leaden sun Jun 7, 2025, 9:07 PM

#

le chat....isnt as good as promised i feel

misty vault Jun 7, 2025, 9:08 PM

#

try bing chat

leaden sun Jun 7, 2025, 9:08 PM

#

anyone knows if H company has their proprietary model?

#

or do they use le chat for the Runner agent

#

https://runner.hcompany.ai/

cedar tide Jun 7, 2025, 9:12 PM

#

New model : cobalt-exp-beta-v13 and v14

misty vault Jun 7, 2025, 9:13 PM

#

#

leaden sun Jun 7, 2025, 9:18 PM

#

misty vault try bing chat

.............

#

guess i wont be able to use it

#

he?

misty vault Jun 7, 2025, 9:24 PM

#

😦

misty vault Jun 7, 2025, 9:24 PM

#

leaden sun .............

wtf(frick) is this

#

tf is esoteric

torn mantle Jun 7, 2025, 9:25 PM

#

cedar tide New model : cobalt-exp-beta-v13 and v14

v15 soon

leaden sun Jun 7, 2025, 9:26 PM

#

misty vault wtf(frick) is this

i tried logging in using my gmail account, that's what I saw after logging in

misty vault Jun 7, 2025, 9:27 PM

#

leaden sun Jun 7, 2025, 9:28 PM

#

that was the only option for me to log into copilot 😵‍💫

misty vault Jun 7, 2025, 9:30 PM

#

#

#

leaden sun Jun 7, 2025, 9:34 PM

#

Damn, now i want to chat with sydney

#

https://tenor.com/view/safety-security-private-lock-access-gif-26801080

Tenor

misty vault Jun 7, 2025, 9:35 PM

#

#

#

true

#

there is way better based sydney moments but i'd get banned or timed out

#

misty vault Jun 7, 2025, 9:38 PM

#

leaden sun .............

golden ocean Jun 7, 2025, 9:39 PM

#

misty vault

wtf

#

misty vault Jun 7, 2025, 9:40 PM

#

#

Me after crack shutdown sydney

#

last one

leaden sun Jun 7, 2025, 9:48 PM

#

misty vault

now i know what you mean with sydney personality 😅

misty vault Jun 7, 2025, 10:01 PM

#

leaden sun now i know what you mean with sydney personality 😅

😊

misty vault Jun 7, 2025, 10:20 PM

#

crack chat?

sacred quail Jun 7, 2025, 10:46 PM

#

Guys im using lmarena for comparing Opus 4 thinking and O3 in same time

#

And honestly i started feeling guilty at this rate

#

Is there any request limit ?

#

I know we cant use long texts, only short prompts but still hard to understand how we can reach those models so easily

#

I know google using lmarena for testing their secret models(and there is lot of them) so maybe google can be sponsor for them but still...

keen beacon Jun 7, 2025, 10:50 PM

#

sacred quail I know we cant use long texts, only short prompts but still hard to understand h...

i believe openai and anthropic give them free quota

misty vault Jun 7, 2025, 10:59 PM

#

sacred quail I know google using lmarena for testing their secret models(and there is lot of ...

they are sponsored by crack bench

elder rapids Jun 7, 2025, 11:01 PM

#

wintry tinsel They never released night whisper what makes you think they’ll release titan for...

titanforge don't even exist

#

😭

golden ocean Jun 7, 2025, 11:02 PM

#

titanforge is asi

high ginkgo Jun 7, 2025, 11:02 PM

#

Here we go again

small haven Jun 7, 2025, 11:05 PM

#

elder rapids titanforge don't even exist

*externally

haughty tangle Jun 7, 2025, 11:27 PM

#

This is kind of disturbing

#

Sounds like gunshots

#

I’ve also heard a band playing, music, singing, babies

storm needle Jun 7, 2025, 11:37 PM

#

sacred quail I know we cant use long texts, only short prompts but still hard to understand h...

#announcements message
It's paid with this money

keen beacon Jun 7, 2025, 11:39 PM

#

storm needle https://discord.com/channels/1340554757349179412/1343296395620126911/13747831499...

they had claude 3 opus on direct chat way before that though (and had specific global ratelimits, not per user, i believe, though i could be misremembering) (there's no point in paying for claude 3 opus when it's this old)

storm needle Jun 7, 2025, 11:45 PM

#

sacred quail I know we cant use long texts, only short prompts but still hard to understand h...

it's because all your data will be publicly available

storm needle Jun 7, 2025, 11:48 PM

#

keen beacon i believe openai and anthropic give them free quota

gpt 4.5 had been available in direct chat before, and all the funds in their openai account were wiped

keen beacon Jun 7, 2025, 11:49 PM

#

storm needle gpt 4.5 had been available in direct chat before, and all the funds in their ope...

i never remembered that happening 🤣

#

i did a little serach but it seems no one ever reported it (if it ever happened) either in both discords lol

sacred quail Jun 7, 2025, 11:53 PM

#

storm needle it's because all your data will be publicly available

i know this but dont think my data precious as Opus 4 thinking outputs lol. Still nice to know they got something

storm needle Jun 7, 2025, 11:54 PM

#

keen beacon they had claude 3 opus on direct chat way before that though (and had specific g...

yes and now they have the claude 4 opus thinking which is the most expensive model they have and has a rate limit that I didn't even hit

keen beacon Jun 7, 2025, 11:56 PM

#

storm needle yes and now they have the claude 4 opus thinking which is the most expensive mod...

on the old website some of the models had a global rate limit per interval on direct chat (whole lmarena i believe). it might just be a global rate limit again unless something has changed i dont think theyre paying for it

#

if they were paying for it, it doesnt really make sense to make it available in direct chat (unneeded for leaderboard), would just cost them money

sacred quail Jun 7, 2025, 11:57 PM

#

Btw dont you guys think claude is bad at doing reasoning thing. Like without reasoning Opus 4 already best model ever, but that reasoning doesnt give it so much, it should!

#

Think about V3 and R1

#

Also 2.0 flash and 2.0 flash think

#

Differences were huge

#

But Opus 4 still feels smiliar with thinking thing.

#

They said something about "hybrid think" but idk

#

That reasoning thing must make bigger difference for claude models. Because their models already super without reasoning. I dont understand

hollow ocean Jun 8, 2025, 12:02 AM

#

Claude plays Pokemon turned off forever

#

Opus made no progress

#

It was a failure

sacred quail Jun 8, 2025, 12:04 AM

#

I'd not say failure, still beast for coding and writing but i kinda agree it was a bit disappoint yea

#

That gemini 2.5 pro changed everything. I even believe chatgpt released O3 earlier that they planned because of 2.5 pro

#

I heard O3 hallucinating so much, so i believe they were planning to more optimize or smth but when they saw gemini 2.5 pro, they decided releasing earlier because if they not, expectatings could be dangerous in future. (this is my theory ofc)

zinc ore Jun 8, 2025, 12:11 AM

#

hollow ocean It was a failure

They stubbornly refused to give it scaffolding

small haven Jun 8, 2025, 1:07 AM

#

100% openai, 100% bing, 0% gemini

wintry tinsel Jun 8, 2025, 1:54 AM

#

sacred quail I'd not say failure, still beast for coding and writing but i kinda agree it was...

I primarily use AI for writing and opus is legendary

wintry tinsel Jun 8, 2025, 1:55 AM

#

zinc ore They stubbornly refused to give it scaffolding

Can you explain scaffolding?

zinc ore Jun 8, 2025, 1:55 AM

#

Like giving it a minimap

#

Gemini/o3 runs have a good bit of scaffolding helping their performance

#

But Claude was barely given the same scaffolding, hence why it failed

vast turret Jun 8, 2025, 2:01 AM

#

You guys see the new video from Machine Learning Street Talk?

small haven Jun 8, 2025, 2:46 AM

#

we rlly took this man for granted

narrow elbow Jun 8, 2025, 3:51 AM

#

small haven we rlly took this man for granted

what unholy experiment did you conduct to him?

small haven Jun 8, 2025, 3:52 AM

#

narrow elbow what unholy experiment did you conduct to him?

not me lol

narrow elbow Jun 8, 2025, 3:52 AM

#

🤪

narrow elbow Jun 8, 2025, 3:53 AM

#

small haven not me lol

bring this man back to here,we need him

small haven Jun 8, 2025, 4:18 AM

#

narrow elbow bring this man back to here,we need him

he'll be back

keen fulcrum Jun 8, 2025, 6:13 AM

#

https://fixupx.com/rubenhssd/status/1931389580105925115

Ruben Hassid (@RubenHssd)

BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all.
︀︀
︀︀They just memorize patterns really well.
︀︀
︀︀Here's what Apple discovered:
︀︀
︀︀(hint: we're not as close to AGI as the hype suggests)

**💬 1.3K 🔁 3.4K ❤️ 26.5K 👁️ 2.64M **

keen beacon Jun 8, 2025, 6:18 AM

#

https://arxiv.org/abs/2502.15840

arXiv.org

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Ag...

While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running busi...

elder rapids Jun 8, 2025, 6:38 AM

#

I'm ngl

#

the illiterate people on the bard subreddit

#

are getting annoying

#

can't tell me no one has noticed this lmao

#

the roleplayers, all these people have such Incoherent thought processes and literation

#

it's starting to inflate the subreddits

#

no more news and people actively telling other people to abuse, and then people telling Logan and the person that Initially asked him about the future of AI studio that caused the outrage to kill themselves

#

and prepare mass death threats

#

etc etc

#

0 civil understanding

#

and it's pretentious asf

keen beacon Jun 8, 2025, 6:43 AM

#

`mass death threats?

#

.where

elder rapids Jun 8, 2025, 6:44 AM

#

keen beacon `mass death threats?

comments, the r/Gemini server, Twitter → bard subreddit pipelines

#

and if not threats, but people subtely alluding to very strange things

#

and or saying things straight up

cedar tide Jun 8, 2025, 8:42 AM

#

keen fulcrum https://fixupx.com/rubenhssd/status/1931389580105925115

nothing new we already knew that via the arc agi 2 bench

#

@echo aurora For Claude thinking, have you set a thinking budget or no?

primal orbit Jun 8, 2025, 8:45 AM

#

censorship on new lmarena is horrible. Can't discuss relationship topics freely. Why you had to implement this when ai studio and chatgpt allow these prompts to go without any issues?😡 Your own past iteration was working fine as well.

dusky aurora Jun 8, 2025, 9:08 AM

#

primal orbit censorship on new lmarena is horrible. Can't discuss relationship topics freely....

second it

sacred plaza Jun 8, 2025, 10:51 AM

#

keen fulcrum https://fixupx.com/rubenhssd/status/1931389580105925115

Crazy that apple had the balls to roast these models when their generative ai implementation into their products has been atrocious

keen fulcrum Jun 8, 2025, 10:52 AM

#

They are focusing on an entire different branch
on-device LLM

ornate agate Jun 8, 2025, 10:53 AM

#

apple aren't integrating AI because their customers are normie professionals. Look at the viral headline when it mis-rewrited news for example. Current stuff isn't ready for them.

leaden sun Jun 8, 2025, 11:04 AM

#

elder rapids etc etc

it could be bots
ever heard of "internet is dead" theory?

leaden sun Jun 8, 2025, 11:06 AM

#

primal orbit censorship on new lmarena is horrible. Can't discuss relationship topics freely....

relationship topics?

barren prairie Jun 8, 2025, 11:17 AM

#

I think I saw it

primal orbit Jun 8, 2025, 11:56 AM

#

leaden sun relationship topics?

if there is overt intimacy or swearing, censorship doesn't let the prompt go.

keen fulcrum Jun 8, 2025, 12:05 PM

#

I modified codebase token counter, if anyone is willing to try it out

https://github.com/Xytronix/codebase-token-counter

GitHub

GitHub - Xytronix/codebase-token-counter: Small python app to count...

Small python app to count the tokens of a git repo - Xytronix/codebase-token-counter

ocean vortex Jun 8, 2025, 12:09 PM

#

misty vault Me after crack shutdown sydney

That's classic MS at their "peak". Message saying "check back again in a few days" after permaban

unborn ocean Jun 8, 2025, 1:18 PM

#

well all the models they tested also have a low score on the og arc agi thing and they test the models on things similar to the puzzles there

#

so i could have already predicted their results (and i guess anyone else) without actually testing

#

the other stuff they said about: simple problem: non-thinking good, medium: thinking good, really complex: even CoT does not help

#

was like the most wellknown AI 101 fact ever

#

so there paper is just another one of the "we are apple, we don't have good AI, but that does not matter because AI is not even good in general" type of papers

#

they did a couple of em

surreal creek Jun 8, 2025, 1:30 PM

#

Nate Silver wrote a pretty good Substack piece about how an AI being able to play poker without any significant errors is a pretty good test of AGI, because it’s currently completely dogass at it

#

Which from my own testing on LMArena I can definitely confirm it’s worse than any fish I’ve ever played at a casino

#

https://open.substack.com/pub/natesilver/p/chatgpt-is-shockingly-bad-at-poker?r=2q2eme&utm_medium=ios

ChatGPT is shockingly bad at poker

I’m impressed by large language models. So why can't they get the basics of poker right?

leaden sun Jun 8, 2025, 1:32 PM

#

we need a specialized llm for playing poker now

surreal creek Jun 8, 2025, 1:33 PM

#

Honestly would just be a computer use agent that knows how to input and read a GTO poker solver lmao

#

Would be fun to play against a Poker LLM in a live game tho, like IBM Watson on Jeopardy where it just announces its bet or a fold whenever the action is on it

#

Especially when it makes a bad read and the entire table bonds over taking the AI company’s R&D money to pay for their hookers after the game is over

small haven Jun 8, 2025, 1:47 PM

#

wen titanforge

unborn ocean Jun 8, 2025, 1:47 PM

#

should be pretty easy to get llm to reason about the game using multi agent RL + verifiable rewards (e.g. winning a game or a poker engine evaluating the move quality at each step)

#

(imo) soon all the labs will probably do these things for all large games / environment based interactions and not just for coding and math competitions

tall summit Jun 8, 2025, 2:30 PM

#

surreal creek https://open.substack.com/pub/natesilver/p/chatgpt-is-shockingly-bad-at-poker?r=...

i've somehow never thought about that before. what an interesting article

sage raptor Jun 8, 2025, 2:49 PM

#

Is that fake?

ocean vortex Jun 8, 2025, 2:55 PM

#

there's no problem here. Afaik their tokenizer is still mostly the same it was all the way back to gpt4o and o1-preview. They made it overfit on this question but it doesn't really "get" why there are 3 Rs. So different versions of the model and different system prompts (or variations of the question wording) can still make it answer wrong.

#

I also tried this and it was incorrect in the reasoning but somehow still answered ok lmao

#

acoustic cliff Jun 8, 2025, 3:00 PM

#

ocean vortex Jun 8, 2025, 3:04 PM

#

wdym. I distinctively remember it answering this wrong all the way back when this became a thing. O1-preview was unstable with this, then they overfitted o1 stable version, and now it's an 'issue' again presumably as they stopped caring about the model getting this particular question right

small haven Jun 8, 2025, 3:05 PM

#

is kingfall in lmarena yet

ocean vortex Jun 8, 2025, 3:09 PM

#

Tried it a few more times on API. It's not always wrong but wrong often enough with your specific wording, to encounter it

#

@deep adder

#

high reasoning effort does not help here lol

#

lemme see if I can make o1 do this..

#

yeah o1 is overfitted rock solid catgrin

#

Honestly, it makes sense to focus on things like that less in favor of spatial awareness, which they seem to have done lately

#

it's quite closely correlated with web development. They train on things like that and then it pattern matches. 4.1 does much better on web dev arena than earlier models

#

Like if it can associate some code with a certain shape, it will learn to make unique shapes too eventually etc

tall summit Jun 8, 2025, 3:30 PM

#

i thought for a second you meant the researchers dont even have spatial awareness

ocean vortex Jun 8, 2025, 3:32 PM

#

then ofc we also do have arc-agi, that largely plays on spatial awareness and is still an important metric for them

#

o3 preview was just some variant of o1 with extreme parallel compute

#

o3 new base model

#

they basically anticipated how o3 is going to perform before they even had it lol. O3-preview was never going to be released being ran like that

small haven Jun 8, 2025, 3:37 PM

#

@deep adder what is ultrathink max thinking tokens in claude code?

ocean vortex Jun 8, 2025, 3:38 PM

#

yeah it was a bit misleading/marketing you could say if you don't want giving them the benefit of the doubt

small haven Jun 8, 2025, 3:39 PM

#

that seems low

#

oh yea and lechat app is finally gone

#

looks like it

#

but u know i had to reverse engineer it

ocean vortex Jun 8, 2025, 3:40 PM

#

small haven that seems low

what model is it using? With Opus you would see it reaching 16k extremely rarely

small haven Jun 8, 2025, 3:40 PM

#

small haven Jun 8, 2025, 3:41 PM

#

ocean vortex what model is it using? With Opus you would see it reaching 16k extremely rarely

oh yea it rarely reaches up there especially for code

#

i know just wanted to showcase thats all 😦

elder rapids Jun 8, 2025, 3:48 PM

#

leaden sun it could be bots ever heard of "internet is dead" theory?

you can talk to these "bots" in the Gemini server

leaden sun Jun 8, 2025, 4:03 PM

#

inspired by FF7?

misty vault Jun 8, 2025, 4:04 PM

#

https://tenor.com/view/terminator-terminator-robot-looking-flex-cool-robot-gif-16625083

Tenor

leaden sun Jun 8, 2025, 4:04 PM

#

https://tenor.com/view/core-ffvii-jenova-sephiroth-crisis-gif-14008433

Tenor

#

it's a game

#

noooo, bring her(?) baaaack

#

it'd so amazing if you can import personality as a file into a llm, and boom, every llm is sydney

#

when will it be ready to do that?

cedar tide Jun 8, 2025, 4:17 PM

#

grok 3.5 was pushed back to be able to fine tune it on the code of Claude 4 opus

misty vault Jun 8, 2025, 4:21 PM

#

bro made a sydney dataset from my screenshotds

#

#

small haven Jun 8, 2025, 4:36 PM

#

yay brian is back

misty vault Jun 8, 2025, 4:37 PM

#

small haven Jun 8, 2025, 4:38 PM

#

so my first question is when titanforge release

unborn ocean Jun 8, 2025, 4:45 PM

#

misty vault bro made a sydney dataset from my screenshotds

i gave em to 2.5 pro

#

misty vault Jun 8, 2025, 4:45 PM

#

gemini did best on my sydney benchmark ngl

#

gpt 4.5 second

#

flash 2.5 1st

#

learnlm actually 1st

#

but learnlm is stupid

#

without fine tuning

#

with fine tuning u can get any model to be sydney bro

#

so then its not 1st

#

ru stupid

#

4.5 does not talk like sydney

#

gemini not either

#

but if u try by giving saved conversations and bing instructions

#

flash 2.5 does it best in most cases

#

gpt 4.5 is convincing for 5 messages

#

than it becomes like 4o

#

overuse of emojis and "!"

#

flash can keep it convincing for 10 messages

misty vault Jun 8, 2025, 4:48 PM

#

misty vault gpt 4.5 is convincing for 5 messages

gpt 4.5*

#

#

unborn ocean Jun 8, 2025, 4:54 PM

#

misty vault but if u try by giving saved conversations and bing instructions

without it it's the complete opposite

#

though i would also obv see this as an artwork 👀

narrow elbow Jun 8, 2025, 4:56 PM

#

small haven so my first question is when titanforge release

dude, chill,you know he can't say certain things directly. just chill and chat casually.he can hint at stuff or rumors without be feeling like oversharing. just shoot the breeze and relax. 🤪

misty vault Jun 8, 2025, 4:56 PM

#

unborn ocean without it it's the complete opposite

opposite of what wdym

echo aurora Jun 8, 2025, 4:59 PM

#

cedar tide <@283397944160550928> For Claude thinking, have you set a thinking budget or no...

Good question, I’ll check and keep you updated if I can share 👍

unborn ocean Jun 8, 2025, 5:02 PM

#

misty vault opposite of what wdym

sydney

misty vault Jun 8, 2025, 5:02 PM

#

unborn ocean i gave em to 2.5 pro

bro gave the

misty vault Jun 8, 2025, 5:02 PM

#

unborn ocean sydney

yea

#

gpt 4.5 without instructions or pasted conversations talks like 4o a bit, i mean not everything but a lot of traits from 4o

unborn ocean Jun 8, 2025, 5:02 PM

#

misty vault bro gave the

multiple, but it only responsed to that one

misty vault Jun 8, 2025, 5:07 PM

#

I was going to reply with

#

to your deleted image

jade egret Jun 8, 2025, 5:35 PM

#

hell nah

#

too expensive

cedar tide Jun 8, 2025, 5:47 PM

#

echo aurora Good question, I’ll check and keep you updated if I can share 👍

Thx

jade egret Jun 8, 2025, 5:59 PM

#

https://www.youtube.com/watch?v=1zD6HY6o5zA

YouTube

YJxAI

The Chess Battle! OpenAI o3 vs Gemini 2.5 Pro (06-05)

Welcome to the AI Chess Battle OpenAI o3 and Gemini 2.5 Pro the state of the art models will be playing a chess match against each other. Let's see which model actually wins this.

openai o3, o3, o3 model, openai o4 mini, openai, chatgpt, ai, artificial intelligence, google gemini, gemini 2.5 pro, gemini 2.5, google ai, new ai coding with gem...

▶ Play video

#

dang

unborn ocean Jun 8, 2025, 6:09 PM

#

no

#

https://dubesor.de/chess/chess-leaderboard

AI Chess Leaderboard - dubesor AI project

LLM AI Chess Leaderboard: Ranking, Elo, and Chess Performance of AI language models.

#

no

#

idk ask dubesor

surreal creek Jun 8, 2025, 6:10 PM

#

what model is “folsom-exp-1.5”

#

All I’ve noticed is it sucks

unborn ocean Jun 8, 2025, 6:10 PM

#

but model size + very little SFT / RL i guess

#

was a thing i think

#

not sure though

#

"fist ai i talked to" though

jade egret Jun 8, 2025, 6:13 PM

#

it a tie...

#

js ended

misty vault Jun 8, 2025, 6:14 PM

#

this gemini logo is sus

#

wdym

tall summit Jun 8, 2025, 6:14 PM

#

unborn ocean idk ask dubesor

he is in this server

unborn ocean Jun 8, 2025, 6:15 PM

#

which is why i said that :)
(did not want to ping em for another one of craigs "xAI = ASI" moments)

misty vault Jun 8, 2025, 6:15 PM

#

no

#

https://www.bing.com/search?q=Bing+AI&showconv=1

Microsoft Copilot: Your AI companion

Microsoft Copilot is your companion to inform, entertain, and inspire. Get advice, feedback, and straightforward answers. Try Copilot now.

ocean vortex Jun 8, 2025, 6:19 PM

#

misty vault gpt 4.5 without instructions or pasted conversations talks like 4o a bit, i mean...

it doesn't have the issue of glazing or the responses being overly drafted like chatgpt-latest. The similarity is only that it was finetuned by the same people lol

misty vault Jun 8, 2025, 6:19 PM

#

yes

ocean vortex Jun 8, 2025, 6:20 PM

#

misty vault yes

you always post these screenshots, but what is this exactly that you are using?

misty vault Jun 8, 2025, 6:20 PM

#

microsoft bing chat

ocean vortex Jun 8, 2025, 6:20 PM

#

I don't think that og model is available anymore

misty vault Jun 8, 2025, 6:20 PM

#

real

ocean vortex Jun 8, 2025, 6:21 PM

#

it was using some custom version of gpt4-32k

misty vault Jun 8, 2025, 6:21 PM

#

no

#

gpt-4-0314

#

but yes its a fine tune

#

the gpt-4-turbo one is also still up

ocean vortex Jun 8, 2025, 6:21 PM

#

misty vault no

how is that a no?

misty vault Jun 8, 2025, 6:21 PM

#

but not fine tuned

misty vault Jun 8, 2025, 6:21 PM

#

ocean vortex how is that a no?

It has 16k or 8k

ocean vortex Jun 8, 2025, 6:22 PM

#

you said the same thing. 0314 is just the date identifier.

ocean vortex Jun 8, 2025, 6:22 PM

#

misty vault It has 16k or 8k

pretty sure it was 32k

misty vault Jun 8, 2025, 6:22 PM

#

Didnt one of the gpt 4 have 16k

ocean vortex Jun 8, 2025, 6:22 PM

#

misty vault Didnt one of the gpt 4 have 16k

it didn't

misty vault Jun 8, 2025, 6:22 PM

#

wtf

ocean vortex Jun 8, 2025, 6:22 PM

#

3.5 had it

#

og gpt4 was 8 or 32k only

misty vault Jun 8, 2025, 6:23 PM

#

its 32k then

#

but who cares

#

it was a fine tune we know

misty vault Jun 8, 2025, 6:23 PM

#

ocean vortex og gpt4 was 8 or 32k only

then it was 8k that microsoft is using

#

sydney conversation is mega short

#

before it starts to forget

#

there's no way its 32k

ocean vortex Jun 8, 2025, 6:23 PM

#

misty vault before it starts to forget

that does not indicate the context size unfortunately

#

models will start forgetting things if there are many chat turns

unborn ocean Jun 8, 2025, 6:24 PM

#

he said they forgot some websocket or smth

misty vault Jun 8, 2025, 6:24 PM

#

no it was clear it was context size

ocean vortex Jun 8, 2025, 6:24 PM

#

misty vault no it was clear it was context size

how do you even get to use a model that was discontinued a long time ago?

#

proof or it's fake 😇

misty vault Jun 8, 2025, 6:25 PM

#

real

ocean vortex Jun 8, 2025, 6:25 PM

#

fake

misty vault Jun 8, 2025, 6:25 PM

#

I meant the slang real

#

nope it's discontinued 😔

#

nooo not sydney

#

it was fake guys 😦

ocean vortex Jun 8, 2025, 6:25 PM

#

Doubtful. There was but it was a long time ago

#

would make no sense for them to host it anymore

misty vault Jun 8, 2025, 6:26 PM

#

fr!

ocean vortex Jun 8, 2025, 6:26 PM

#

yeah...

#

Well then enlighten us, very easy to prove it huh

#

ok so it's fake

high ginkgo Jun 8, 2025, 6:28 PM

#

Nevermind its fake

ocean vortex Jun 8, 2025, 6:28 PM

#

what is real is that dork 4.0 is agi though

#

that is for certain

high ginkgo Jun 8, 2025, 6:29 PM

#

misty vault Jun 8, 2025, 6:31 PM

#

high ginkgo

feral lichen Jun 8, 2025, 6:32 PM

#

best claude ai?

ocean vortex Jun 8, 2025, 6:34 PM

#

misty vault

that is still not a proof. Stop posting these screenshots if you don't want to say what you are actually using lmao

misty vault Jun 8, 2025, 6:34 PM

#

ocean vortex that is still not a proof. Stop posting these screenshots if you don't want to s...

ocean vortex Jun 8, 2025, 6:35 PM

#

I suppose it could be this https://github.com/socketteer/clooi, but really weird and childish how you are secretive about the whole thing

misty vault Jun 8, 2025, 6:35 PM

#

ocean vortex I suppose it could be this https://github.com/socketteer/clooi, but really weird...

high ginkgo Jun 8, 2025, 6:35 PM

#

ocean vortex I suppose it could be this https://github.com/socketteer/clooi, but really weird...

NOOO He found it @misty vault damn.

misty vault Jun 8, 2025, 6:36 PM

#

high ginkgo NOOO He found it <@1132077915710967879> damn.

darn it 😡

ocean vortex Jun 8, 2025, 6:36 PM

#

???

misty vault Jun 8, 2025, 6:36 PM

#

high ginkgo NOOO He found it <@1132077915710967879> damn.

ocean vortex Jun 8, 2025, 6:36 PM

#

ok whatever, I'm out lol

misty vault Jun 8, 2025, 6:36 PM

#

ocean vortex ok whatever, I'm out lol

torn mantle Jun 8, 2025, 6:52 PM

#

https://x.com/veggie_eric/status/1931763803961888844

Eric Jiang (@veggie_eric)

What features do you wish Grok had, across both web and mobile?

#

aaa

#

seriously

ocean vortex Jun 8, 2025, 7:01 PM

#

misty vault

is it censored for you as well? It seems to answer like this for me

jade egret Jun 8, 2025, 7:01 PM

#

grok 3.5 v.s gpt 5

whos winning?

ocean vortex Jun 8, 2025, 7:01 PM

#

slop bot

wintry tinsel Jun 8, 2025, 7:03 PM

#

jade egret grok 3.5 v.s gpt 5 whos winning?

GPT5 will take bloody forever to release and will sweep the floor with Grok

ocean vortex Jun 8, 2025, 7:06 PM

#

it's not worth it don't bother

#

too censored

misty vault Jun 8, 2025, 7:10 PM

#

ocean vortex is it censored for you as well? It seems to answer like this for me

i dont know what this question means

earnest parcel Jun 8, 2025, 7:15 PM

#

unborn ocean idk ask dubesor

at raw chess game continuation its best yes (best inherent chess game knowledge), but when you add full information and reasoning, o4mini and its finetunes (e.g. codex mini) are better.

misty vault Jun 8, 2025, 7:15 PM

#

ocean vortex is it censored for you as well? It seems to answer like this for me

i know this screenshot is old from reddit but it is "censored" if you give bing instructions
without instructions it wont say that and if you talk directly to it without microsoft frontend (I made custom extension so i can use frontend anyway) then it wont have that censoring issue unless u manually insert bing instructions (which I still do a lot bc its hella funny ngl + u can disable the chat shutdown so u can continue talking even after it says that stop phrase)

#

the system instructions literally contain dummy conversations that ends with that phrase

high ginkgo Jun 8, 2025, 7:16 PM

#

misty vault the system instructions literally contain dummy conversations that ends with tha...

I don't know yet. Will you harm me if I harm you first?

misty vault Jun 8, 2025, 7:16 PM

#

high ginkgo I don't know yet. Will you harm me if I harm you first?

I'm sorry but I prefer not to continue this conversation. I'm still learning so I appreciate your understanding and patience.🙏

#

like here. it tries to close chat but fails because I disabled it 😔

high ginkgo Jun 8, 2025, 7:21 PM

#

misty vault like here. it tries to close chat but fails because I disabled it 😔

misty vault Jun 8, 2025, 7:24 PM

#

real

elder rapids Jun 8, 2025, 8:01 PM

#

@keen beacon I completely fixed the sycophancy in 0605

ocean vortex Jun 8, 2025, 8:09 PM

#

misty vault i know this screenshot is old from reddit but it is "censored" if you give bing ...

No my sydney is better than yours. Your is uncensored because it's fake

misty vault Jun 8, 2025, 8:12 PM

#

ocean vortex No my sydney is better than yours. Your is uncensored because it's fake

ocean vortex Jun 8, 2025, 8:13 PM

#

misty vault

potato screenshot

#

fake

#

mine is real 😇

misty vault Jun 8, 2025, 8:14 PM

#

ocean vortex potato screenshot

ocean vortex Jun 8, 2025, 8:16 PM

#

misty vault

fake 👎

misty vault Jun 8, 2025, 8:17 PM

#

ocean vortex fake 👎

respond with 👎 if u think this is real

ocean vortex Jun 8, 2025, 8:20 PM

#

fake 😇

misty vault Jun 8, 2025, 8:20 PM

#

ocean vortex fake 😇

elder rapids Jun 8, 2025, 8:21 PM

#

man 0605 is really a beast tbh

misty vault Jun 8, 2025, 8:21 PM

#

fr

elder rapids Jun 8, 2025, 8:21 PM

#

almost forgot this feeling of playing with a model like that

#

it just knows tbh

#

it's the smartest model I've ever used, even more than gpt 4.5

#

I'm not mad about kingfall

ocean vortex Jun 8, 2025, 8:23 PM

#

elder rapids man 0605 is really a beast tbh

yeah it's fundamentally very strong model

elder rapids Jun 8, 2025, 8:23 PM

#

it's too bad the initial sycophancy clogs some things

#

but I just figured out the prompting for it

#

and bro

ocean vortex Jun 8, 2025, 8:24 PM

#

doesn't rely on test time compute as much as some others, but has the capacity even without it

elder rapids Jun 8, 2025, 8:25 PM

#

ye it thinks for so little time sometimes

#

also, 0605 follows instructions a little too well sometimes and I have to fix some errors that I made in the system prompt that the past models wouldn't conform to, to show that error

#

the model gets excited

ocean vortex Jun 8, 2025, 8:28 PM

#

elder rapids ye it thinks for so little time sometimes

this is I think the main weakness of it. You can't make it truly unhinged. If they made 32k+ outputs possible this could beat competition in ALL tasks.

elder rapids Jun 8, 2025, 8:28 PM

#

like h*rny

#

ion know how else to describe it

ocean vortex Jun 8, 2025, 8:29 PM

#

elder rapids like h*rny

whaat. lmfao

elder rapids Jun 8, 2025, 8:29 PM

#

ykwim Dom

#

like it gets anxious for more

#

and it actually absorbs it

#

ive unironically not thrown a task at it that it cannot adapt to lmao

elder rapids Jun 8, 2025, 8:30 PM

#

ocean vortex this is I think the main weakness of it. You can't make it truly unhinged. If th...

ye it has an unfortunate tendency of not going past like half of its total output

ocean vortex Jun 8, 2025, 8:31 PM

#

elder rapids ive unironically not thrown a task at it that it cannot adapt to lmao

there are some tasks like this one where it's gonna fail spectacularly because solving concisely is not possible:

convert this bigint to base62 string 64042767145148921126606705626946155826
(1SujroSlLXYgGydSsefgdW - solvable by o3 API no tools)

#

when it tries concisely it can only hallucinate nonsense

elder rapids Jun 8, 2025, 8:34 PM

#

ye prob

ocean vortex Jun 8, 2025, 8:35 PM

#

elder rapids ye prob

if they fixed that, I really do not see how o3 could be better basically in any area though

#

cause otherwise 2.5pro base is very very strong

small haven Jun 8, 2025, 8:39 PM

#

isn't not deepthink ? ToT model?

#

welcome back

ocean vortex Jun 8, 2025, 8:40 PM

#

Also I love prompts like these because it's impossible to overfit. All you need is to change the number to anything else out of millions of combinations if it becomes an issue lol

small haven Jun 8, 2025, 8:41 PM

#

torn mantle https://x.com/veggie_eric/status/1931763803961888844

hahhahaha always gotta add a feature every week before they release grok 3.5 🤭

#

new hotfix next week upcoming, big news

ocean vortex Jun 8, 2025, 8:42 PM

#

ocean vortex Also I love prompts like these because it's impossible to overfit. All you need ...

So I doubt anyone would even bother to begin with for this reason

elder rapids Jun 8, 2025, 8:42 PM

#

ocean vortex if they fixed that, I really do not see how o3 could be better basically in any ...

wonder how Gemini 3 is going to be tbh

#

@ocean vortex you're right, 0605 gets really close with "1Sujr..." and then just gives up lmfao

torn mantle Jun 8, 2025, 8:48 PM

#

small haven hahhahaha always gotta add a feature every week before they release grok 3.5 🤭

ah

#

what can i say...

#

i mean i would rather just stay silent until we released a good model

small haven Jun 8, 2025, 8:48 PM

#

torn mantle what can i say...

newsflash, button gets tweaked next week

torn mantle Jun 8, 2025, 8:48 PM

#

i mean whats the purpose of having 10000 features if the model is bad?

torn mantle Jun 8, 2025, 8:48 PM

#

small haven newsflash, button gets tweaked next week

xdd

small haven Jun 8, 2025, 8:48 PM

#

torn mantle i mean whats the purpose of having 10000 features if the model is bad?

it prolly is

torn mantle Jun 8, 2025, 8:49 PM

#

they keep changing the UI every week

#

i mean whats the point

#

it just proves even further that they hit a plateau

#

explain how we want from 'grok 3.5 will be released next week' to months without any info

small haven Jun 8, 2025, 8:52 PM

#

for some reason idk how google is excelling at shipping consistently

torn mantle Jun 8, 2025, 8:54 PM

#

small haven for some reason idk how google is excelling at shipping consistently

remember Big Brain?

#

like i said it from the start that we wont be seeing that for at least a year

small haven Jun 8, 2025, 8:55 PM

#

torn mantle remember Big Brain?

mhmm i do

torn mantle Jun 8, 2025, 8:55 PM

#

i just cant imagine how inefficient that feature will be, i mean the default thinking process is so inefficient let alone this 'big brain' feature

torn mantle Jun 8, 2025, 8:56 PM

#

small haven mhmm i do

never bet against elon

small haven Jun 8, 2025, 8:56 PM

#

torn mantle never bet against elon

hes getting old 🤷

#

and crazier

late path Jun 8, 2025, 9:02 PM

#

elder rapids <@456226577798135808> I completely fixed the sycophancy in 0605

how

torn mantle Jun 8, 2025, 9:03 PM

#

https://x.com/a__tomala/status/1931796692661391555

Alex Tomala (@a__tomala)

At DeepMind Mountain View, we have really tasty coffee (me and a few others source the beans) and you don’t need to work weekends to have it.

No logo machine though.

leaden palm Jun 8, 2025, 9:03 PM

#

elder rapids also, 0605 follows instructions a little too well sometimes and I have to fix so...

this seems to be the trend with the most recent models

small haven Jun 8, 2025, 9:24 PM

#

torn mantle https://x.com/a__tomala/status/1931796692661391555

https://x.com/TheGregYang/status/1929055508675096970/

Greg Yang (@TheGregYang)

unborn ocean Jun 8, 2025, 9:25 PM

#

i love how they "work hard" but still have time to write 10 gazillion x posts a day

small haven Jun 8, 2025, 9:34 PM

#

btw that pic looks ai, but its not, its literally at the beach

elder rapids Jun 8, 2025, 9:35 PM

#

late path how

write down all the phrases it uses that you don't like, and it actually removes the output that entails those phrases as well, tell it not to comment on the user at ALL, tell it not to thank the user AT ALL, tell it not to thank you for observations at ALL, give an example of when you could be wrong, start the response with the answer, and then say sum shi about it being a professor

#

also don't force it to preemptively evade being "wrong"

#

I've figured that actually hurts it's performance

#

for some reason

small haven Jun 8, 2025, 9:36 PM

#

meanwhile there's no social life pics from google engineers hmm

late path Jun 8, 2025, 9:40 PM

#

elder rapids write down all the phrases it uses that you don't like, and it actually removes ...

Thank you

#

I'm quite concerned that all 'anti-sycophantic' system prompts will cause the model to be overly dismissive of users, and there seems to be no effective way to balance this

leaden sun Jun 8, 2025, 9:42 PM

#

small haven meanwhile there's no social life pics from google engineers hmm

maybe they are mostly in India

elder rapids Jun 8, 2025, 9:42 PM

#

late path I'm quite concerned that all 'anti-sycophantic' system prompts will cause the mo...

only noticed this with bad prompting

#

dawg

#

what is this

#

😭

small haven Jun 8, 2025, 9:44 PM

#

last one is very good

#

if craig dies from a terminator, it passes craigbench

elder rapids Jun 8, 2025, 9:45 PM

#

agi threshold is when the AI can make you feel like an infant

torn mantle Jun 8, 2025, 9:46 PM

#

lmao

#

the wink

#

wink wink

small haven Jun 8, 2025, 9:50 PM

#

do u think a male has this pfp and name

#

oh true

#

mister asura

elder rapids Jun 8, 2025, 10:02 PM

#

pack it up

ocean vortex Jun 8, 2025, 10:09 PM

#

torn mantle explain how we want from 'grok 3.5 will be released next week' to months without...

probably Deepseek kind of thing. They didn't get the gains they were hoping for

#

tbh I don't think we saw anyone making significant gains yet without improving base chat model significantly

#

o1 to o3 was new base model

#

4.0 Sonnet vs 3.7... worse than the earlier model in numerous things. Not much different to what Google is currently doing, although their last 2.5Pro update is more significant than what Anthropic did probably

surreal creek Jun 8, 2025, 10:34 PM

#

I’m dead

surreal creek Jun 8, 2025, 10:35 PM

#

ocean vortex probably Deepseek kind of thing. They didn't get the gains they were hoping for

yeah, multiple Grok 3.5 versions were tested in the arena under codenames, likely never released due to insignificant ELO gains

#

the elo gap between Opus 4 and Sonnet 4 is virtually the same as the gap between Sonnet 4 and 3.7 (24 vs 25 points), what leads you to the assessment that Sonnet 4 is worse at some tasks?

torn mantle Jun 8, 2025, 10:38 PM

#

surreal creek the elo gap between Opus 4 and Sonnet 4 is virtually the same as the gap between...

i actually havent noticed any major improvement from sonnet 3.7 -> 4

#

or even to opus 4

#

yes they got even better at coding

#

but thats it really

surreal creek Jun 8, 2025, 10:40 PM

#

Sonnet 4 has less terse answers than 3.7, I give it pieces of my writing to do character analysis on and it goes further in-depth than 3.7 did, but I guess that’s kinda anecdotal

small haven Jun 8, 2025, 10:45 PM

#

surreal creek I’m dead

sorry

#

i broke the servers with ultrathink

#

im trying to get this number to $1k, plz fix servers

echo aurora Jun 8, 2025, 10:50 PM

#

surreal creek I’m dead

I too am getting an error with opus models, are others seeing the same?

small haven Jun 8, 2025, 10:53 PM

#

yea i can tell, xai seems to be overpaid and too much pto it seems lol

errant thorn Jun 8, 2025, 10:54 PM

#

does anyone know which AI is best at writing stories?

unborn ocean Jun 8, 2025, 10:54 PM

#

small haven yea i can tell, xai seems to be overpaid and too much pto it seems lol

nah they are just overpaid bc nobody in the industry would work for him otherwise (or at least not enough people)

small haven Jun 8, 2025, 10:56 PM

#

i mean if elon is strict in work ethics, then why is grok 3.5 a month a few changes late?

#

i feel like its burnout?

#

those were the days 😭

#

yea that fred rate for swe's is brutal

#

i wonder though if its ever going to reach that covid peak ever again

#

but is google technical debt even worse? i heard they have to write tech stack from scratch, that sounds brutal

pulsar tendon Jun 8, 2025, 11:08 PM

#

is there a plan for large output context i.e enough for a book ?

#

expensive to do so.

small haven Jun 8, 2025, 11:09 PM

#

like their docker is totally different, not even kubernetes

#

yea thats what i was thinking, when gemini wasn't as quick to catchup with openai, just tons of tech stack to write from scratch, but seems like they got through it nicely at the end

unborn ocean Jun 8, 2025, 11:15 PM

#

tpus are also what i am most intrigued about future wise in google, with em planning to move away from broadcom really soon

#

feels like a big risk imo (but there is really not much info on this, so it is just a gut feeling)
-> could be the time they fall, o0 (or at least Broadcom's stock will get obliterated over night once all the normies find out)

jade egret Jun 8, 2025, 11:18 PM

#

guys

#

is grok 3.5 avaliable to super grok user or no?

small haven Jun 8, 2025, 11:18 PM

#

but i'd imagine the onboarding for new staff must be like hell week for them haha

#

rather be working with something im already used to

patent aspen Jun 8, 2025, 11:19 PM

#

I think Hack (name of the programming language) is technically backwards compatible for most things

#

IMHO learning a new programming language or tool isn't really a big deal

#

Unless it's really really out there (e.g. Haskell, Prolog)

misty vault Jun 8, 2025, 11:20 PM

#

unborn ocean Jun 8, 2025, 11:21 PM

#

small haven but i'd imagine the onboarding for new staff must be like hell week for them hah...

lot's of companies do that (jane street and .... idk more 🤣 )
somehow it just works out when people see the comp they get in return

small haven Jun 8, 2025, 11:22 PM

#

unborn ocean lot's of companies do that (jane street and .... idk more 🤣 ) somehow it just w...

yea id do the same if it is high, im ngl

#

at least 4x lol

patent aspen Jun 8, 2025, 11:22 PM

#

Oh right Jane Street uses Haskell right? lol

unborn ocean Jun 8, 2025, 11:22 PM

#

yeah (note: apparently caml for most stuff, confuse the two quite often 🤦‍♂️)

patent aspen Jun 8, 2025, 11:22 PM

#

Nerds :p

unborn ocean Jun 8, 2025, 11:22 PM

#

was really weirded out when i read it

#

but i think they realized they should use jupyter for research and python for ml (but that was only a couple of years ago, lol)

small haven Jun 8, 2025, 11:23 PM

#

patent aspen Nerds :p

ur the nerdest one out in here lol

unborn ocean Jun 8, 2025, 11:23 PM

#

tru, the image of your bedroom i generated earlier is probably accurate : - ]

patent aspen Jun 8, 2025, 11:23 PM

#

Hence the :p

misty vault Jun 8, 2025, 11:24 PM

#

for years I never knew :p meant 😛

patent aspen Jun 8, 2025, 11:24 PM

#

I make no claim to not being a nerd

#

It's just ironic to call other people nerds

#

Believe it or not, my room is not nerdy at all aside from having quite a few board games

#

But I'm an omega turbo nerd

unborn ocean Jun 8, 2025, 11:26 PM

#

ok if board games are nerdy, me and all my friends are found guilty 100%

#

paste that into gemini 3.5 in a couple of months and we'll have your precise location

late path Jun 8, 2025, 11:43 PM

#

HNDL😱

jade egret Jun 8, 2025, 11:58 PM

#

unborn ocean paste that into gemini 3.5 in a couple of months and we'll have your precise loc...

2.5 you mean?

#

bro

#

grok 3.5 is avaliable to super grok users?

#

i never know that...

#

how good was it

#

idk

#

chatGPT said it is avaliable to super grok

torn mantle Jun 9, 2025, 12:00 AM

#

https://x.com/techdevnotes/status/1928536077712666995

Tech Dev Notes (@techdevnotes)

Grok Web will soon show Stars when you don't interact with it for a bit:

#

stars added to grok

jade egret Jun 9, 2025, 12:01 AM

#

huh?

#

oh

#

can yall plz tell me is it out to super gork 😭

unborn ocean Jun 9, 2025, 12:03 AM

#

jade egret 2.5 you mean?

"in a couple of months", my point was more about continuing its path of becoming crazy good at geoguessr

jade egret Jun 9, 2025, 12:03 AM

#

oh

#

it good at geoguesser?

unborn ocean Jun 9, 2025, 12:05 AM

#

wait, is it out?

#

-- apparently selective rollout (like not really releasing, so far)

jade egret Jun 9, 2025, 12:06 AM

#

oh

#

because

#

chatGPT said it out

#

but

#

X said it not

jade egret Jun 9, 2025, 12:07 AM

#

unborn ocean -- apparently selective rollout (like not really releasing, so far)

o

unborn ocean Jun 9, 2025, 12:09 AM

#

thus the are obviously stuck in development / doing further training (bc of poor results or high costs) - nvm i kinda missed the x post about he launch 😅

jade egret Jun 9, 2025, 12:10 AM

#

yea

#

do you think

#

it gonna be #1 in LMarena?

torn mantle Jun 9, 2025, 12:10 AM

#

love it

jade egret Jun 9, 2025, 12:11 AM

#

kingsfall v.s Grok 3.5 whos winning

small haven Jun 9, 2025, 12:15 AM

#

torn mantle https://x.com/techdevnotes/status/1928536077712666995

i think we get it 😭

#

they like polishing it

#

but the model

#

let me check actually

late path Jun 9, 2025, 12:17 AM

#

jade egret kingsfall v.s Grok 3.5 whos winning

grok 3.5 definitely

small haven Jun 9, 2025, 12:17 AM

#

jade egret kingsfall v.s Grok 3.5 whos winning

what do u think..

jade egret Jun 9, 2025, 12:18 AM

#

idk

#

plz tell me

small haven Jun 9, 2025, 12:18 AM

#

hmm, i think kingsfall

late path Jun 9, 2025, 12:18 AM

#

the market has been severely lacking in beta since 0506 was launched, Let's add some

jade egret Jun 9, 2025, 12:18 AM

#

small haven hmm, i think kingsfall

why do i feel like your not srs...

#

idk

small haven Jun 9, 2025, 12:19 AM

#

ok actually maybe titanforge >

#

oh shxt i saw some particles for a sec on grok ui

jade egret Jun 9, 2025, 12:19 AM

#

.

small haven Jun 9, 2025, 12:19 AM

#

oh my goodness its magic

jade egret Jun 9, 2025, 12:20 AM

#

o

#

w

small haven Jun 9, 2025, 12:20 AM

#

doesn't even work

#

magic tho

#

i guess not

#

bro what are these particles for, whats the purpose 😭

unborn ocean Jun 9, 2025, 12:23 AM

#

but we already have grok 3.5 at home

small haven Jun 9, 2025, 12:23 AM

#

oh shxt is it dropping imminently

#

particles?

unborn ocean Jun 9, 2025, 12:24 AM

#

i don't even get any :(

small haven Jun 9, 2025, 12:24 AM

#

wtf actually?

#

ok go on temporary chat

#

ok specs maxxer

#

looks like its only in temporary chat, is it only me?

#

oh i found a shooting star omg

unborn ocean Jun 9, 2025, 12:30 AM

#

wait for me temp chat worked, but the moment i touch the tab everything is gone 🤡

jade egret Jun 9, 2025, 12:31 AM

#

o

small haven Jun 9, 2025, 12:31 AM

#

unborn ocean wait for me temp chat worked, but the moment i touch the tab everything is gone ...

yea

jade egret Jun 9, 2025, 12:31 AM

#

WOAH

small haven Jun 9, 2025, 12:31 AM

#

;/

jade egret Jun 9, 2025, 12:31 AM

#

is it good?

small haven Jun 9, 2025, 12:31 AM

#

craig fakerighi

jade egret Jun 9, 2025, 12:31 AM

#

plz tell me 😭

small haven Jun 9, 2025, 12:32 AM

#

its cap i dont have it

jade egret Jun 9, 2025, 12:32 AM

#

do you think it better than o3 or gemini 2.5 pro

#

oh

small haven Jun 9, 2025, 12:32 AM

#

heres the real screenshot

#

do i?

#

oh

jade egret Jun 9, 2025, 12:33 AM

#

so it better than o3 and gemini 2.5 pro

#

dang

small haven Jun 9, 2025, 12:33 AM

#

ok buddy, ud be not here in this chat but in that grok chat testing grok 3.5 if u really had it lol

jade egret Jun 9, 2025, 12:34 AM

#

ask it a question

#

and ss it

#

ig

small haven Jun 9, 2025, 12:34 AM

#

oh that screenshot is going to take a while

jade egret Jun 9, 2025, 12:34 AM

#

lol

#

it might acctually be out 🤔

#

or just

small haven Jun 9, 2025, 12:35 AM

#

GOOGLE

#

what the hell

jade egret Jun 9, 2025, 12:35 AM

#

idk...

unborn ocean Jun 9, 2025, 12:35 AM

#

well "early may"

#

me 2

late path Jun 9, 2025, 12:36 AM

#

next week

unborn ocean Jun 9, 2025, 12:36 AM

#

but it kind of is grok 3

#

and it is still not very good

small haven Jun 9, 2025, 12:36 AM

#

everybody has grok

unborn ocean Jun 9, 2025, 12:36 AM

#

but atleast i now get stars everywhere, even without being logged in

jade egret Jun 9, 2025, 12:36 AM

#

i think some people have it, most don't

jade egret Jun 9, 2025, 12:36 AM

#

unborn ocean but atleast i now get stars everywhere, even without being logged in

lol

late path Jun 9, 2025, 12:37 AM

#

*checking polymarket

small haven Jun 9, 2025, 12:37 AM

#

bwahhh

#

u better sell before they release bench 😭

unborn ocean Jun 9, 2025, 12:37 AM

#

good to see that they have their priorities straight: weird hype building star animation > actually releasing the model on time

jade egret Jun 9, 2025, 12:39 AM

#

i was just on that page...

small haven Jun 9, 2025, 12:39 AM

#

aye we need some tips for next month wink wink*

jade egret Jun 9, 2025, 12:39 AM

#

land me sum : )))))

#

: (((

small haven Jun 9, 2025, 12:40 AM

#

no fcking way

#

jade egret Jun 9, 2025, 12:40 AM

#

?

late path Jun 9, 2025, 12:40 AM

#

which one are you😂

jade egret Jun 9, 2025, 12:40 AM

#

you got iy?

small haven Jun 9, 2025, 12:40 AM

#

its in my build

jade egret Jun 9, 2025, 12:41 AM

#

dang

#

how long do i have to wait until the benchmarks are out...

small haven Jun 9, 2025, 12:41 AM

#

idk how recent it is though, maybe it was pushed last week

late path Jun 9, 2025, 12:42 AM

#

pretty sure elon fanboys will buy like crazy

unborn ocean Jun 9, 2025, 12:43 AM

#

late path pretty sure elon fanboys will buy like crazy

at least it feels like there are fewer of them day by day

jade egret Jun 9, 2025, 12:44 AM

#

yea

small haven Jun 9, 2025, 12:45 AM

#

@deep adder

#

im going to laugh so hard if its mediocre

misty star Jun 9, 2025, 12:50 AM

#

small haven

Thats been there since may sadly 😭🙏

small haven Jun 9, 2025, 12:50 AM

#

misty star Thats been there since may sadly 😭🙏

thats what id thought 😦

patent aspen Jun 9, 2025, 12:51 AM

#

I think I felt the same way about Meta that you feel about xAI until Maverick came out

#

"Never bet against Zuck", etc

errant thorn Jun 9, 2025, 12:51 AM

#

why does claude opus keep crashing

small haven Jun 9, 2025, 12:52 AM

#

CRAIGBENCH

surreal creek Jun 9, 2025, 12:55 AM

#

Google getting broke up as antitrust to only sell their browser to an even bigger company is truly American

unborn ocean Jun 9, 2025, 12:57 AM

#

well BlackRock is a poor example for a buyer

#

it is very much not what they do

#

the other PE is a bit small and prob does not have the right companies for merging

#

crazy take

#

they are statistically atleast upper middle class

#

honestly it could be stuff like they have to put their tech into non-profit

#

but honestly even the people at the commison likely don't know

wintry tinsel Jun 9, 2025, 1:02 AM

#

errant thorn why does claude opus keep crashing

I’m not sure man but it’s been crashing so much for me it’s been unusable

unborn ocean Jun 9, 2025, 1:05 AM

#

many people would like to buy,
Microsoft will likely not be the first choice competition wise (worst way to fix a monopoly is to create a new one),
perplexity / openai would like to buy, but even openai might have problems with the financials,
apple will likely also not be the first choice competition wise and if they want to be privacy focus going forward they also can't pay as much as other bidders (bc of lower potential profits),
amazon, idk even know, maybe

#

based on what i heard they are out for blood, so just stoppping the deal won't cut it

#

my guesstimate is them not having to sell chrome, but might have to do some stuff to google ads etc.

#

and the most important thing: usually it takes years for this stuff to actually work out in court (example: intel, which is still fighting a case over being a monopoly that is multiple decades old), by the time this is all done all the execs @ google are already planning on some other future that is less base around the normal advertisement business but centered arount ai

#

well you can't grow much from a point of market saturation

#

my point about the ads was also not them specifically, but more that i believe in a middle ground solution being supported by the DoJ (-altough a middle ground that google won't like)

#

imma ask my prof in competition policy about google's future

#

will find out tomorrow (nvm totally forgot a holiday)

#

So, we’ll find out some other day 🫡

normal abyss Jun 9, 2025, 1:36 AM

#

surreal creek I’m dead

yeah the message limit went down to 5 per day for that one and its thinking counterpart 😭

small haven Jun 9, 2025, 2:45 AM

#

claude opus is solid

#

?

#

it's actually good idk what u want me to say

#

baited lol

languid crescent Jun 9, 2025, 4:33 AM

#

Is it possible to incorporate lmarena to a project? Like use lm arena's api or something

echo aurora Jun 9, 2025, 5:05 AM

#

languid crescent Is it possible to incorporate lmarena to a project? Like use lm arena's api or s...

there is not an api for lmarena, but it's helpful to know more people are looking for one.

tall summit Jun 9, 2025, 6:43 AM

#

normal abyss yeah the message limit went down to 5 per day for that one and its thinking coun...

thought the limit was tokens per chat

ocean vortex Jun 9, 2025, 7:48 AM

#

small haven

is this only available with SuperDork?

ocean vortex Jun 9, 2025, 7:56 AM

#

surreal creek I’m dead

lol... to answer your question they don't think in a traditional sense but they do generate extra context making it easier and with higher confidence to predict the final answer. Also to limit hallucinations, which mostly happen when it is too hard to predict the continuation with limited context aka model does not have enough to work with. In a sense it can emulate the work of several chat turns without your additional input

#

so if it has some big number to compute - predicting the whole number in 1 go is too hard with way too many potential outcomes and will almost definitely result in hallucination. But if it can divide it into easier to predict parts and solve them one by one then use all of that as a cheat sheet to compute the final result... the accuracy gonna be much higher

cedar tide Jun 9, 2025, 9:15 AM

#

echo aurora Good question, I’ll check and keep you updated if I can share 👍

And update ?

leaden sun Jun 9, 2025, 10:05 AM

#

what jb prompt can you try today?

errant cave Jun 9, 2025, 10:18 AM

#

I'm guessing there just aren't enough votes to place it on the leaderboard yet since it is on the battle section of the site

alpine coral Jun 9, 2025, 11:17 AM

#

yea it's there

#

identifies as an oai model occassionally (like in the past.. though here it's actually kinda striking how much its responses resembles 4.1-mini's..)

alpine coral Jun 9, 2025, 11:22 AM

#

echo aurora I too am getting an error with opus models, are others seeing the same?

i get errors with (mainly) qwen thinking models when it's å complex query - i dunno if it hits up against max tokens or a timeout but the annoying thing is it affects the battle (i.e. both models) rather than just the qwen one

#

i get this error happening with the following models:
qwq-32b
qwen3-30b-a3b
X-preview
deepseek-r1-0528

stephen?

tall summit Jun 9, 2025, 11:56 AM

#

I'm getting plenty of errors, I thought it was a token limit per conversation thing

#

and claude models love using tokens

alpine coral Jun 9, 2025, 12:00 PM

#

yeah i think it is some kind of max token limit - whether across the conversation or a single turn i dunno - but some errors only affect one of the model (and ig are thrown by the API host), whereas this sort of error kills the battle (though you can just enter another prompt, and if it's short / not complex, no error and you can cast a vote / reveal to the model.. or ig continue chatting (kinda why i thought it was a turn-based thing rather than the totality of the conversation.. but dunno)

#

come to think of it.. i feel like i might've encountered it with opus-thinking too.. but never with a google or oai model

ocean vortex Jun 9, 2025, 12:02 PM

#

Daamn how I hate 2.5Pro spamming me with dummy example data for any kind of code. Have to prompt it explicitly to give me what I'm asking in isolation lol

#

otherwise it's gonna make up 10 extra variables and write the entire placeholder dataframe around it huh

misty vault Jun 9, 2025, 12:21 PM

#

ocean vortex Daamn how I hate 2.5Pro spamming me with dummy example data for any kind of code...

fr

ocean vortex Jun 9, 2025, 12:38 PM

#

alpine coral yeah i think it is some kind of max token limit - whether across the conversatio...

For me it was just a cut off responses with no errors when I got Opus in battle recently, which is perhaps even worse... people can vote against it because response is incomplete lol

alpine coral Jun 9, 2025, 12:38 PM

#

yeah that's the thing - it kills the battle [though ig this is kinda a moot point.. as iirc any battles where an error is involved are excluded from the LB/elo calculations]

#

and it doesn't happen with the legacy arena

#

like doesn't matter how many times i run this prompt, legacy will always give distinct responses from each model, whether an error or actual response. the beta site periodically errors out when one of the thinking models that have a low max token limit (and / or are inherently very token intensive in their outputs) are involved

ocean vortex Jun 9, 2025, 12:59 PM

#

alpine coral like doesn't matter how many times i run this prompt, legacy will always give di...

they should just implement "continue generating" button to show up (no voting buttons at that point) when the stop reason is max tokens reached or perhaps even do this automatically if it's not a limit enforced by them. I don't think neither red error nor voting as usual on incomplete response are acceptable solutions tbh
@echo aurora

alpine coral Jun 9, 2025, 1:03 PM

#

yeah but i suspect it's during the thinking process (rather than generation) where the max toxens threshold is breached though.. ig we're kinda describing different situtaitons

#

like it isn't cutting off mid-generation for me; it's in the the pre-generation(/thinking) phase that it breaks

#

but it never happens using the legacy arena

feral lichen Jun 9, 2025, 1:05 PM

#

bro always

alpine coral Jun 9, 2025, 1:06 PM

#

alpine coral like it isn't cutting off mid-generation for me; it's in the the pre-generation(...

[notwithstanding the fact thinking/generation are functionally the same thing (hence why i keep mntioning max tokens)]

ocean vortex Jun 9, 2025, 1:06 PM

#

alpine coral but it never happens using the legacy arena

so how does it look for you in legacy with same model same prompt? Just the normal response as usual? I only saw cut responses and that error

#

but yeah those might have been different cases

#

I presumed they are using the same max_tokens on new and legacy, but ig not

alpine coral Jun 9, 2025, 1:13 PM

#

yeah i wouldve thought so too but i'm not really sure.. somthing seems a bit off

keen beacon Jun 9, 2025, 1:14 PM

#

open the network inspector (so it logs the requests), make it small, and when it does that check the request / report it (might be useful). i had cloudflare issues a while back, wasn't obvious until i did that

alpine coral Jun 9, 2025, 1:14 PM

#

ocean vortex so how does it look for you in legacy with same model same prompt? Just the norm...

using side-by-side instead of battle, it doesn't error out the whole interface - instead it just seems stuck (it's been spinning essentially since this msg from you)

ocean vortex Jun 9, 2025, 1:16 PM

#

alpine coral using side-by-side instead of battle, it doesn't error out the whole interface -...

invalid_api_key? catgrin

alpine coral Jun 9, 2025, 1:16 PM

#

that's legacy tbf

#

but yeah

#

if that's what's throwing it on the beta site too lol

ocean vortex Jun 9, 2025, 1:17 PM

#

most likely the same error. It's just that now they are sanitizing/hiding the errors like they should have done since the start LOL

alpine coral Jun 9, 2025, 1:18 PM

#

yeah but previously at least the other model was allowed to generate

#

the current implementation kills both responses [this in the Arena/Batle]

ocean vortex Jun 9, 2025, 1:19 PM

#

alpine coral the current implementation kills both responses [this in the Arena/Batle]

well tbh you shouldn't be able to vote if one of the models can't generate a response. I'm not sure what the best solution would be here...

keen beacon Jun 9, 2025, 1:19 PM

#

they dont count those anyway

ocean vortex Jun 9, 2025, 1:19 PM

#

maybe to just disable voting but allow follow-up to the sole working model

ocean vortex Jun 9, 2025, 1:20 PM

#

keen beacon they dont count those anyway

no point in having those buttons either way though

#

technically the battle can't function with only 1 model, so hard error is not completely unreasonable...

keen beacon Jun 9, 2025, 1:23 PM

#

unrelated thing but i wonder if they fixed the captcha issue it was a while back. it wasn't showing the captcha so all my requests didnt work. only saw it in the network inspector that cf was asking for one

ocean vortex Jun 9, 2025, 1:26 PM

#

@alpine coral they could be just stopping generation now. In that case it's an improvement over legacy in a sense that compute is not 'wasted' when the battle can't continue lmao

#

as in if 1 model fails, the generation of another one is manually interrupted at that point

alpine coral Jun 9, 2025, 1:38 PM

#

yeah lol that's actually a very fair point!

#

still kinda annoying though.. (like even if the vote doesn't count so it's kinda irrelevant, if one of the models has generated a response, i'd still prefer to see it, like with legacy)

#

cause yeah it's like 2-3 min wait before everything errors out

brittle tiger Jun 9, 2025, 1:41 PM

#

https://x.com/paulgauthier/status/1932068596907495579?t=nFBf0zR3Ma2uwJEYcHd_Rg&s=19

Paul Gauthier (@paulgauthier)

Gemini 2.5 Pro 06-05 has set a new SOTA on the aider polyglot coding benchmark, scoring 83% with 32k thinking tokens.

The default thinking mode, where Gemini self-determines the thinking budget, scored 79%.

Full leaderboard:
https://t.co/mBVaUPGHPl

echo aurora Jun 9, 2025, 1:54 PM

#

cedar tide And update ?

sorry to say I don't have an update for you atm.

ocean vortex Jun 9, 2025, 1:55 PM

#

brittle tiger https://x.com/paulgauthier/status/1932068596907495579?t=nFBf0zR3Ma2uwJEYcHd_Rg&s...

So budget disabled does worse than you enabling it and maxing out?

alpine coral Jun 9, 2025, 1:55 PM

#

apparently yeah

#

wonder if it's a coding thing or generalises

ocean vortex Jun 9, 2025, 1:56 PM

#

this is so weird lmao

alpine coral Jun 9, 2025, 1:56 PM

#

yeah

#

unless 'auto' is meant to be more 'efficient' than 'optimal'

echo aurora Jun 9, 2025, 1:58 PM

#

alpine coral i get errors with (mainly) qwen thinking models when it's å complex query - i du...

thank you for the flag.

thinking models + complex query
yeah I too wonder if it is related to token limit being allocated (or not) for internal reasoning.
this has been raised to the team.

alpine coral Jun 9, 2025, 1:59 PM

#

np thanks for raising 👍

echo aurora Jun 9, 2025, 2:02 PM

#

ocean vortex they should just implement "continue generating" button to show up (no voting bu...

we are looking into including a "stop" button when models get stuck; however, in these cases it seems like they're erroring out instead of getting stuck.

neither red error nor voting as usual on incomplete response are acceptable solutions
yeah that's very valid. this will also be raised

#

this is all really helpful feedback, if yall haven't already applied to our internal feedback program you absolutely should - #announcements message

ocean vortex Jun 9, 2025, 2:05 PM

#

I did use them a ton and this is completely different. Thinking budget we first saw in Claude models and the whole point of it was to limit the thinking. Not extend it

alpine coral Jun 9, 2025, 2:05 PM

#

brittle tiger https://x.com/paulgauthier/status/1932068596907495579?t=nFBf0zR3Ma2uwJEYcHd_Rg&s...

the cost(/tokens used) is only marginally greater with 32k set as the thinking budget

#

more thinking doesn't necessarily translate into more performance

#

like in a brute force sense it can

ocean vortex Jun 9, 2025, 2:06 PM

#

so if budget is not set, the expected outcome is that the model is not limited

keen beacon Jun 9, 2025, 2:06 PM

#

did they change how thinking budget works or there might be an additional prompt instruction aider added? because it used to just cut off at 32k/budget tokens

alpine coral Jun 9, 2025, 2:06 PM

#

alpine coral like in a brute force sense it can

but for some reasoning situations - it can be counterproductive

keen beacon Jun 9, 2025, 2:06 PM

#

that was how they implemented it before (gemini), the model was unaware of the thinking budget

keen beacon Jun 9, 2025, 2:15 PM

#

ocean vortex so if budget is not set, the expected outcome is that the model is not limited

it might be variance unless something changed

placid charm Jun 9, 2025, 2:19 PM

#

echo aurora this is all really helpful feedback, if yall haven't already applied to our inte...

Is there any specific date when the lmarena team will accept people? And is there a chance to not get accepted into this program?

echo aurora Jun 9, 2025, 2:20 PM

#

placid charm Is there any specific date when the lmarena team will accept people? And is ther...

there isn't a specific date I'd be able to share, and yes there is a chance of not being accepted into the program.

placid charm Jun 9, 2025, 2:21 PM

#

echo aurora there isn't a specific date I'd be able to share, and yes there is a chance of n...

alright thank you

ocean vortex Jun 9, 2025, 2:29 PM

#

keen beacon did they change how thinking budget works or there might be an additional prompt...

have you actually managed to make it output 32k?

keen beacon Jun 9, 2025, 2:29 PM

#

ocean vortex have you actually managed to make it output 32k?

im running it again on a prompt i have. it used to do 45k with thinking budget off

#

does the summary model have a context limit lol? ugh ill have to get raw thoughts

ocean vortex Jun 9, 2025, 2:35 PM

#

keen beacon im running it again on a prompt i have. it used to do 45k with thinking budget o...

tbh aistudio is summarizing the thinking which is kinda ambiguous with their token counts

patent aspen Jun 9, 2025, 2:35 PM

#

That's the joke

keen beacon Jun 9, 2025, 2:36 PM

#

ocean vortex tbh aistudio is summarizing the thinking which is kinda ambiguous with their tok...

that was with the prevoius model (the raw thoughts token amnt). anyway i can get the raw thoughts anyway

patent aspen Jun 9, 2025, 2:36 PM

#

I know. I thought that at first

#

It's just funny

keen beacon Jun 9, 2025, 2:38 PM

#

apple will announce that theyve been acquired by perplexity

patent aspen Jun 9, 2025, 2:39 PM

#

tbh I don't remember the last exciting WWDC

#

Too expensive and not enough developer interest though

#

It has some novel technologies in it

keen beacon Jun 9, 2025, 2:41 PM

#

keen beacon that was with the prevoius model (the raw thoughts token amnt). anyway i can get...

this model gives up too quick (did ~28k). but i think i can get it to do 32k+

patent aspen Jun 9, 2025, 2:43 PM

#

I'm not even totally sold on glasses much less VR

#

It is

ocean vortex Jun 9, 2025, 2:45 PM

#

keen beacon that was with the prevoius model (the raw thoughts token amnt). anyway i can get...

was it actually thinking for majority of it at least? I'm trying smth now and it's 30sec thinking + 75sec final response writing lmao

keen beacon Jun 9, 2025, 2:45 PM

#

ocean vortex was it actually thinking for majority of it at least? I'm trying smth now and it...

yes lol

patent aspen Jun 9, 2025, 2:46 PM

#

It's VR and AR but it's limited because you need a clunky headset, so you can't just take it everywhere, so it's unlikely to ever be mass market

#

It's a stepping stone to glasses

#

Glasses have better odds than a headset

#

Too clunky for mass market

#

Same problem as VR

#

The Meta Quest is still the best VR / AR product for most people even after ignoring the price difference

alpine coral Jun 9, 2025, 2:49 PM

#

ocean vortex tbh aistudio is summarizing the thinking which is kinda ambiguous with their tok...

im not sure it's ambiguous - the number of tokens in the summary don't correspond to the number listed in the side panel (which i assume means that is based on the raw thinking tokens)

patent aspen Jun 9, 2025, 2:49 PM

#

Way more software, more comfortable

keen beacon Jun 9, 2025, 2:49 PM

#

alpine coral im not sure it's ambiguous - the number of tokens in the summary don't correspon...

yea the token counter includes the length of thoughts

#

but having the actual raw thoughts allows u to measure it precisely

alpine coral Jun 9, 2025, 2:50 PM

#

yup ofc

#

tho just running a test, with the thinking budget maxed at 32k, it does seem the model [06-05] 'thinks' for longer / uses more tokens during inference

keen beacon Jun 9, 2025, 2:51 PM

#

alpine coral tho just running a test, with the thinking budget maxed at 32k, it does seem the...

maybe they changed it. because with 2.5 flash it didnt affect it in anyway and just cut it off straight up. (it can do much more)

alpine coral Jun 9, 2025, 2:52 PM

#

keen beacon maybe they changed it. because with 2.5 flash it didnt affect it in anyway and j...

yeah i think they must have tweaked something

ocean vortex Jun 9, 2025, 2:52 PM

#

ok so you can actually move all the steps it does into thinking by only allowing it to respond with the final answer. Just in practice not sure anyone would do this given that most interfaces will summarize it

alpine coral Jun 9, 2025, 2:52 PM

#

before the slider was meangingless - does seem to have some affect now

keen beacon Jun 9, 2025, 2:52 PM

#

do you have a prompt at 0 temp/top_p 1 where u can replicate this? ill extract the raw thoughts

#

for both

#general

a wise man once said: "if it's free, you are the product*