#general

1 messages · Page 51 of 1

drifting thorn
#

Is that just opened to someone?

keen beacon
#

it was removed

#

later 2.5 pro revision, or something else, it was good

elder rapids
#

just for it to solve a basic physics problem, it wasn't hard

#

I meant for it, to see how its changed

small haven
#

it will fix o3 pro issues

late path
#

looks like today's openai livestream doesn't have o3pro?

keen beacon
#

maybe tmrw?

small haven
#

nah like seriously, deepthink >>

#

thursday

wicked root
#

Do we have rumors on o3 release?

torn mantle
#

lies

wicked root
#

So gemini’s coming out tmrw

small haven
#

o3 pro on thursday

wicked root
#

Deepthink thursday

sturdy mica
#

guys my pc just broke

wicked root
#

O3 on thurs too?

torn mantle
#

this is crazy

wicked root
#

What time?

small haven
#

i have it alrdy, its ass

wicked root
#

Hmmmmm

wicked root
#

How ass is it?

small haven
#

if ur tight, id say hold onto it for google ultra

patent aspen
#

When people use the term "structured thinking" what are they usually talking about? I only ask because a lot of external AI jargon doesn't map cleanly to more technical AI jargon

keen beacon
#

idk it seems idiosyncratic to me too

small haven
wicked root
small haven
#

o1 pro could spit out 2k lines no placeholder, no omissions, all in full, o3 is maxed out at 500 locs

torn mantle
#

OMG

#

OMGGGGGGGGGGGGGGGGGG

keen beacon
#

OMG

torn mantle
#

nvm

#

bark

#

bard*

small haven
#

and o3 pro times out a lot

torn mantle
#

saw this on reddit

#

kingfall

#

is it good or nah

small haven
#

king falls tmmrw and king will fall further a week later

keen beacon
#

its good

civic flame
torn mantle
#

google releasing products every week whereas xai :

civic flame
#

okay gdm just drop kingfall already 😔

small haven
#

link

#

lol

#

no screenshot either

torn mantle
# small haven this is why

lol.. i mean he probably took a vacation after a long time, i dont doubt that they sleep in the office

small haven
#

o3 pro cots seem to get it

torn mantle
#

but we should seriously start an audit to see what are they doing exactly

small haven
#

so far, its still thinking

torn mantle
#

cheeeeeen

small haven
#

ya

drifting thorn
#

Gotta sleep

small haven
#

o3 pro is ass i been telling

drifting thorn
#

Gn everyone

civic flame
#

so close yet so far

#

☠️

small haven
keen beacon
#

but o3 pro has tools

elder rapids
#

I don't think I would take the extent of deepthink, so inevitably, yeah

small haven
#

yea im in evan chen site, no 3031 answer

#

o3 pro timed out great

#

full cot

#

full cot ^

#

this is from another run

#

and still running

civic flame
small haven
#

yea guessed so

civic flame
small haven
#

they both timed out

keen beacon
#

thanks i needed this

civic flame
small haven
keen beacon
#

nah

small haven
#

it was nearing 15 mins, i guess thats the hard limit

patent bane
#

what's the discord?

small haven
#

deepthink is going to crack it without tools

keen beacon
#

i wonder if theyre gonna switch out the deepthink model

#

if they make further revisions that close the gap as a standalone model w/o parallel requests

keen beacon
elder rapids
#

man

#

I want that model

#

give me that model

#

😭😭

keen beacon
#

deepthink? kingfall? goldmane? everything?

elder rapids
#

everything

#

I want kingfall

civic flame
#

real

misty vault
civic flame
#

there's a decent chance kingfall drops on arena this week to be fair

#

i see goldmane releasing officially tomorrow, and kingfall appearing on the arena on the weekend

#

i arrived at the conclusion that the red teaming platform was just testing existing models this round

#

so nothing good

misty vault
#

leo has access to asi

civic flame
#

yup

civic flame
elder rapids
#

btw

#

kingfall and goldmane are INSANE writers

#

the best by a large margin out of all the models

civic flame
#

what did kingfall write

#

goldmane was decent but not the best in my experience

elder rapids
#

or to "write"

#

I'm asking them to explain things

#

and that gives insight into how they actually write

#

without forcing them

#

no

#

lol

#

not what I said

civic flame
elder rapids
#

that's not true at all

#

you don't ask for a prose or a prompt when you want to see how a good model is in writing

civic flame
#

debatable

elder rapids
#

if it has to create its own context

#

then you're not really allowing it to write, this is the same for every LLM

misty vault
#

sydney fine tune of gpt-4 is best at writing

elder rapids
#

deadass hate when you talk to me like I'm stupid

#

no it has nothing to do with style lmao

#

I'm not asking it a question and paying attention to its wording

#

that's redundant and nonsensical

keen beacon
#

🤣 🤣

elder rapids
#

why y'all deleting so much

keen beacon
#

why not

misty vault
#

he is shy and insecure

#

nah bro sydney is opposite of shy

#

.

elder rapids
#

I just said you shouldn't allow it to create its own context, certain models do do this well like r1 and o3, but models like grok are hard capped, same with claude

#

you know this, too, with 3.6 sonnet

#

it was an excellent writer

#

but it was a different type

civic flame
#

@keen beacon i agree the OG claude models were really really good at writing

#

they felt very unhinged but in a good way

#

i remember when claude was only available for public use using their slack bot

#

it was great

#

let me see if the messages are still there

civic flame
small haven
#

anybody have gemini ultra here

narrow elbow
#

Google pulls a Kingfall before OpenAI's stream 🤪

small haven
#

hahah they thought it was o3 pro, plugged it out when it wasn't

small haven
#

augment code is matching google ultra in price 😭

loud tinsel
#

<@&1349916362595635286> Please, add "Claude 4 Sonnet" or** "Claude 4 Opus"** in the last WebSite !

Why did you almost completely abandon the old website, when it is better done and uses gradio (which is visually attractive) ?

civic flame
#

i'm not too sure about those last 4 words..

misty vault
#

fr

#

like
we're not talking about sydney

civic flame
#

this is all gone now 😔

small haven
#

@deep adder run this bunx ccusage

#

faster npx alternative

#

?

#

it checks ur claude code metrics lol

#

i just seen that

#

i went through 2 billions tokens insanity

#

how come

keen beacon
#

oh reddit is suing anthropic

#

they have a deal with openai and google

echo aurora
# loud tinsel <@&1349916362595635286> Please, add **"Claude 4 Sonnet"** or** "Claude 4 Opus"**...

I am sorry to hear you're not a fan of the of the new site. It is a big change, it's totally fair if you prefer the old website, but we’d love to hear your feedback on the new one if you’re open to sharing. When we changed from the legacy site to the current site we did mention that moving forward all feature updates and improvements will happen on the new LMArena site. We have been seeing a lot of positive signal that the new site is more appealing and have made the decision that going forward this is where our team is going to be focusing on.

echo aurora
# small haven any bug bounty?

Not atm but for sure a good idea I'll pass along. We do have report bug form on the site but in terms of a bounty program that's not something we have currently.

#

Did you find something 👀

misty vault
#

yes, remember when models went unavailable while ago? he did that 😔

echo aurora
cedar tide
#

@echo aurora will the webdev arena be integrated into the main site?

loud tinsel
# echo aurora I am sorry to hear you're not a fan of the of the new site. It is a big change, ...

<@&1349916362595635286> I understand your team's decision. However, I believe both websites are complementary, as each has its strengths and weaknesses.

New website:
It features a more modern, visually pleasant, and intuitive interface. It's prettier and easier to use, but also noticeably less complete.

Old website:
While its interface is a bit more "raw" or even cluttered, it's much more complete and packed with features. It might feel more like a "dev tool" than a "power user interface", but its depth is genuinely valuable.

Personally, I especially appreciate being able to tweak settings like temperature or the maximum number of output tokens (up to 4096), and how easily accessible every button or option becomes once you're used to the layout.


Honestly, what I prefer is that "familiar" and "feature-rich" experience. For example, when using tools like Ollama, I tend to choose the command line over a graphical interface.

So I think both websites are great—but just not for the same type of user. Then again, maybe I’m a bit of an edge case 😅

In the finally, I prefer the last website 😁

echo aurora
small haven
#

im jk nothing

#

too lazy, but if money is involved, i would

cedar tide
#

@echo aurora Can we have a page with a continuously updated list of models currently on the leaderboard that are not yet in the arena? (mystery models included)

echo aurora
sonic tendon
#

on that note: maybe redarena being integrated into the main site could be cool

echo aurora
elder rapids
small haven
#

the perfect ai is a 1 bit-sized model with superhuman reasoning, 100 trillions tokens of context

unborn ocean
#

it is kind of a weird ideal in many ways and i have some problems with him assuming that this would be the ideal (economically speaking), but it is fiction anyways

#

so who cares

#

on a completely different note: @earnest parcel godaddy says your domain is worth 434 USD 🤑 (dubesor.de)

wicked root
#

guys I broke Gemini

small haven
#

u know the convo been boring when he pulls this shxt

tall summit
small haven
#

its gonna happen but whatever hes talking is at least a decade away lol

misty vault
#

agi

boreal saddle
#

If AGI/ASI is actually developed in this decade like AI 2027 predicts, I will eat my hat.

#

Fair enough.

#

Though the AI 2027 scenario will either be the biggest comedy show of our time, or the most shocking prophetic call ever.

small haven
#

is goldmane still good for thursday?

misty vault
small haven
cedar tide
#

Google didn't lie, the latest 2.5 flash is much more efficient than the old one (being more performant)

echo aurora
leaden palm
#

[moved to ai-news]

small haven
#

huh where is this guy tweet??

#

@patent aspen i thought it was tmmrw

#

oh nvm he tweeted 4 hrs later from this time, last time

sweet frost
#

is stylectrl dead in the new arena?

keen beacon
sweet frost
#

as it should be thanks!

small haven
#

late june

#

as per brian

#

its going thru a safety testing phase rn

#

only trusted users

#

yea its heavily been nerfed compared to the december version

#

wym? u think its gonna be a flop?

#

nah even a month before, its been flaking, was just using o3 since

#

idk i think theyre just following protocol

#

u love finance do u

small haven
#

cool hopefully

#

ive tried months ago, its meh unless u do math heavy things

#

deeperthink is rlly bad

#

o3 patches that tho, let alone o3 pro

#

grok 3 is archaic

elder rapids
#

when is Logan tweeting

#

istg

small haven
#

hahah im also waiting

#

usually tweets at 8pm pst

elder rapids
#

alr so in 2 hours

#

or an hour 50 minutes

small haven
#

a little over that, well for the last tweet

#

pre gemini io

#

goated

#

but where is logan's tweet?

small haven
#

?

#

i dont see it

#

ok bud

leaden palm
#

well google doesn't have it

#

fair enough

#

still

#

burden of proof is on you

small haven
#

logan is knocked out at a gay bar in san francisco

#

all san fran are gay

leaden palm
#

if you dont want to be in the same place where major ai developments are taking place

haughty tangle
#

There’s too many stealth Google models

#

There’s like 20, I can’t keep count of them. Every 4 days I see a new one.

#

Kingfall, Dragontail, Nightwhisper, Dreamtides, Moonhowler, Stargazer, Shadowbrook, Riverhollow, Lunarcall, Moonfall

#

I’m pretty sure that’s all of them

jade egret
#

?

#

whats releasing tomorow

elder rapids
#

nothing

elder rapids
#

I'll never believe you again

#

it's over Google is never releasing

jade egret
#

huh

#

fr?

#

YO

#

new gemini model?

#

?

#

plz answer : (

echo aurora
#

🤞

jade egret
#

wait a min

elder rapids
#

ts not happening

jade egret
#

i though gemini 2.5 pro already released

elder rapids
jade egret
#

or thats the preview

elder rapids
#

there are different previews, they change like every month

jade egret
#

o

#

so

#

newer preview

#

it gonna beb etter right

elder rapids
#

ye, but GA soon

jade egret
#

what is ga

elder rapids
#

GA = general availability

jade egret
#

oh

#

w

#

is

#

gemini 2.5 ultra ever gonan release ; (

elder rapids
#

no

hollow ocean
#

O3 killer tmr?

elder rapids
#

it's not going to be an o3 "killer", o3 has higher compute variants

#

it's just going to be even better than before

#

with that out of the way

#

yeah o3 killer, if tomorrow

wicked root
#

What about gpt5?

elder rapids
#

hope it's even real

wicked root
#

Polymarket has the probability at 90% for a 2025 release

wintry tinsel
#

What are the rumors for king fall

#

Better than opus or nah

wintry tinsel
haughty tangle
#

I don’t know how that compares to Claude 4 Opus

#

Since I haven’t tested it with the same prompt

#

It’s probably just an iteration of 2.5 Pro though

#

I like everything but the robot

#

The background and text is nice, but I personally think Kingfall is better

drifting thorn
#

Where is Kingfall?

#

arghhhh

civic flame
#

THEEERE IT IS

#

WE'RE SO BACK BROS

dusky aurora
#

I witnessed firsthand as Gemini changed versions (in direct chat) back in May

small haven
#

SAM FALLS TODAY

late path
#

goldmane🏆🏆🏆

small haven
#

kingfall definitely won

small haven
#

trust in brian

#

brian is our google insider

#

o1 pro? its not that good, o3 beats it by a mile

#

kingfall is way above o3

#

i mean even o3 in price/performance

#

i doubt kingfall uses 10x

#

its more like $1.25/1m in $10/1m out

#

deepthink will be 10x

#

have u tried kingfall? do u not remember, its cot wasn't that big and actually lesser than 0506

#

prolly i can believe that

#

u right, thats an interesting pov, parallel cot

#

oh yea deepthink i believe will demolish o3 pro

#

i have o3 pro rn on stealth, it aint that great as it thought it be

#

its better than o3, but not similar to the jump from o1 to o1 pro

#

all of this competition is so good, bc now we have o4 and o5-mini-high to proc an early release, very very excited for summer

#

hmm, currently if ur basing codex as their winning agentic product, i believe ur wrong, ive tried it and currently claude code >> is the meta

#

operator isn't even that good, even with the o3 upgrade as the base model

#

hmm, have u tried jules?

#

its more like claude > gemini > openai, in my view

#

lately or when it got released

#

it improved a lot

#

codex is garbage too 😭

#

they both garbage

#

but i belive oai is more garbage than gemini

#

😂

#

im using codex-1 on their ui, its rlly bad

#

idk why ppl are hyping itup

#

by perpertuity, how is codex mini > codex-1?

#

no its not the ui, its beautiful, but the output is very bad

#

trust me

#

hmm later today i will, but i think gemini is going to take my time lol

#

its unfortunately google that will win

#

i am surprised

#

but any competition is good

#

dario said benchmarks dont matter anymore, so its gg 😭

late path
#

and $15/$75 pricing💀

small haven
#

gemini at 86% and claude is at 72%, im not using claude 4 for a while

#

yes, but this kind of a jump is too big, im going to be using the good ole' copy and paste until they integrate their pro/ultra sub for virtually unlimited queries like claude max

#

i believe its going to be unlimited on at least gemini ultra via their ui

#

o3 pro is roughly at $100-$150 in/out and they can afford the "unlimited" inference via their ui

#

so why cant google

#

veo 3 is just a whole different arch

#

and its just a gimmick rn, i dont rlly care

#

we will see,

#

gonna get ultra

#

if deepthink is announced

#

but later june im guessing

#

nah gemini is

#

oai never cut under competition

#

ever

#

r1 > o1 mini, what are u saying 😭

#

talking out of ur ass srry 😭

#

oh yea i get what u mean, but they still never cut under competition, but cut after competition cuts

#

never before it

#

they always held a premium

#

they greedy

elder rapids
#

I used to pray for times like these

small haven
#

so i guess we can trust brian

#

my question is why is he doing it 😭

wispy leaf
#

New sota model today?

late path
#

goldmane yes

keen beacon
small haven
#

i think the alpha here is reading all of brian discord messages

#

😭

keen beacon
#

chatgpt is so ass

#

google is sooooooooo much better

#

but their ai helps kill kids

#

but other then that its great

small haven
keen beacon
small haven
keen beacon
#

🌐 Build your next project on Hostinger with an INSANELY fast VPS: Get 10% off with code NETWORKCHUCK: http://hostinger.com/networkchuckvps

☕ Because everything in I.T. requires coffee: https://ntck.co/coffee

We’re in the future. AI can now run your web browser, complete tasks, and automate entire workflows– basically, acting like y...

▶ Play video
#

i was using this to do my homework

#

its pretty decent

#

go to the end of the video you'll see the chatgpt vs browser use comparison

there really is no difference especially if you have a decent cpu/gpu

elder rapids
# small haven so i guess we can trust brian

sure but I still don't believe hed have so much more access after that tbh, he says he has "connections" but then it's like he's "directly" there, combined with the fact it's already a reasoned inference via Logan saying it'd come early June

ornate agate
#

Who’s Brian?

elder rapids
#

hence the conflicting information that are equally valid (Logan also saying, "in weeks they'll come")

#

and he's not pushing for any of them

#

but ey

#

glad it's Thursday

torn mantle
small haven
#

i was skeptical at first, but damn google insiders are built diff

drifting thorn
#

2.5 Flash feels so dumb to me

#

kingfall

hardy pecan
#

o3-pro and new gemini model tomorrow? that was would a lovely surprise

drifting thorn
#

I need King to fall rn with free API usage

late path
#

I'm curious what else OpenAI has up its sleeve this time to steal Gemini's thunder

#

Last time for 0325 they released GPT-4o imagen and Ghibli on the same day

#

If they only release o3pro today it probably won't be enough

drifting thorn
#

Maybe o3 image gen?

#

Flux Kontext is threatening GPT image 1

#

And Flux Kontext is much faster than GPT's and it doesn't have that yellow filter that GPT always have.

calm sequoia
#

No o3 though

#

Qwen3 30B is just off the charts for it's size

tall summit
drifting thorn
#

fxxk

#

the knowledge base in Poe is shxtty

#

I use 2.5 Pro as the base model, and then I asked him about a novel character(which is in the novel) and then it just pops out nonsense

ornate stump
#

why flash 2.5 in gemini app feels stupid today

calm sequoia
#

That's a good observation. But I'm still surprised how good the 4.5 is and how bad is the o4-mini.

#

Hmm I wonder if it means that the train time compute > inference time compute in the end.

#

This says otherwise

#

That's true. The distribution is niche

#

Ahh I'd like to see the 4.5 with PRO level thinking

#

Yeah but still would be interesting experiment to see bench results

#

Indeed

dusky aurora
#

it seems as if LMArena's sampling has become more restrictive

#

less creativity

#

Gemini is much less interesting then

ocean vortex
#

also context awareness ("vibe test") - the smaller the model is, typically the more literally it will take your last message or sentence having less capacity to consider everything else or "read between the lines"

#

SimpleQA 62.5% (gpt4.5) vs 19.3% (o4-mini-high)

boreal saddle
#

You gotta be kidding me.

#

Why does the new website break down all the time?

#

Even discussing perfectly legitimate subjects gets the "Something went wrong" error.

ocean vortex
#

safetycell technology 🤷‍♂️

boreal saddle
#

I never got such errors on the old website.

#

Refreshed the website, and finally, Claude is speaking up.

dusky aurora
# boreal saddle

opus has periods of unavailability, you simply have to wait some time and reroll regularly. it's not relat d to the prompt

calm sequoia
#

Today I've used the google collab and android studio. They both have "gemini" chat to interact to, however, without model selection or identification. This could explain why they need so many different anonymous models on arena: each has a different use case and may be finetuned for this.

#

Or they simply but cheapest gemini flash 😦

boreal saddle
cursive pagoda
#

hey , you guys have some server where people actively post their AI projects and discuss?

raven void
#

I want to join such a server

cursive pagoda
#

crazy

echo aurora
calm sequoia
#

Hey mister "there have never been any nerfs" 🤓 @ocean vortex

late path
#

goldmane will crush it yay

ocean vortex
#

lmao

calm sequoia
#

Vibes says otherwise

ocean vortex
#

?

#

this graph is wrong on some many levels. Not any better than some random thing someone drew

#

Like... don't mind gpt4.1 being above 2.5pro

#

LOL

dusky aurora
calm sequoia
ocean vortex
# calm sequoia

yeah this is accurate. Exactly the same +/- miniscule amount

calm sequoia
drifting thorn
calm sequoia
drifting thorn
#

oh btw claude 4 opus is not in the chart of artificial analysis

ocean vortex
# calm sequoia

yeah but this is completely different to the official leaderboard... Wonder if artificialanalysis messed something up with their testing. Wouldn't be the first time

alpine coral
# calm sequoia This is going to be big in the future, whatever Wolfram does is always excellent...

i dunno.. what it's actually benchmarking is very narrow..

The task consists of going from English-language specifications to Wolfram Language code. The test cases are exercises from Stephen Wolfram's An Elementary Introduction to the Wolfram Language. These exercises have been done online by millions of humans, and we've developed effective tools for determining functional correctness of code, which we're now applying to LLMs.

#

Perhaps translating natural language into wolfram has more generalised value.. and full respect to Wolfram too.. im just not sure it is or will be that useful / meaningful

ocean vortex
#

we already saw it higher on AIDER, webdev arena etc

calm sequoia
#

Why not, if I remmember they just showed the ELO based benchmarks during presentation, which are innacurate

#

Anyway, not important now as the GA is underway

alpine coral
#

i feel like if there's nerfing.. it happens from pre-release (e.g. nebula, goldmane etc ect) to preview/exp - then GA is just another layer of safety and corporate alignment nerfing

ocean vortex
calm sequoia
#

But how they showed it? I don't remember numerical data

ocean vortex
#

I think they messed smth up with their test suite lol

ocean vortex
calm sequoia
#

So you justify "no-nerf" statement based solely on livebench?

ocean vortex
#

there are more, basically most coding benchmarks show improvement

boreal saddle
#

This is quickly becoming the new "As a AI language model....", for real.

#

What if you wanted to talk to a LLM, but God said: "There was an error."

alpine coral
#

yeah and if the version they release (whether preview or GA) is slightly less performant than goldmane, that would be consistent with what i feel like ive seen in the past

#

they rarely seem stronger compared to pre-release imo anyway

late path
#

like, +30 elo

alpine coral
#

yeah both goldmane and redsword are solid af

#

well.. redsword is no more.. but it was (and i feel they are more the less the same checkpoints)

#

nah i think it's way more fundamental than that

#

they're making good models

#

yeah latest iterations have been incremental

#

but nebula / 2.5 pro was a massive performance step up

#

i'm not sure goldmane is of the same extent - maybe - but it feels quite signficicant imo / fwiw

late path
#

goldmane will be the next Nebula moment. It's much better than 0506

alpine coral
late path
alpine coral
#

in terms of actual usage - o3 on chatgpt is my go-to for anything involving complexity and / or web search.. it's really something

#

but if i needed to use an API, gem pro 2.5 would be so much smoother

#

and it's such high quialuity

#

but yeah not o3 chatgpt

#

i mean i just use 4o for most things tbh lol

#

fair enough - i can appreciate that 👍

#

it is a strong model

#

but yeah 4o is just useful for quick stuff - like translating / transcribing, some quick question

#

like i don't always need the 'best' model - actually, aside from research, i rarely use thinking models

#

i do but that's just me ha

#

it's what oai says it is

#

yeah i dont see that as controversial tbh

#

but i guess others find it so

#

and like 4.5 is still 'preview' (i bet whatever is the OG version that wasn't tuned for safety and public release is a beast)

#

but it's being depcrated soon.. ig we'll never see a non-preview version..

tall summit
#

?

#

ok

#

im talkin about most things.

late path
#

It's clear that OpenAI has put a lot of effort into optimizing 4o's chat experience and multi-turn conversations. There's a reason it's now ranked 1st in multi-turn on the arena

alpine coral
#

yeah i agree, though the latest 2.5 flash with thinking is like notably strong imo

#

but still.. different use cases - like thinking gets in the way.. if i'm gonna use it i may as well go all out with o3 or whatever ha

balmy mist
late path
#

waiting for leaderboard updates

cedar tide
#

Nope

balmy mist
#

what is GA?

late path
#

is openai still having another livestream?

fleet lintel
#

First they will put the model in preview mode (different from experiment mode) and if they find no bugs then only it will go to GA

#

Looks like latest model in GA target and will release as a preview model for first few weeks

brittle tiger
#

I got a arc-agi problem through and it was right. O3 and last 2.5 pro get it right about half the time. Not enough time to test thoroughly

fleet lintel
#

I saw it and it's gone again

late path
#

portal color scheme

fleet lintel
#

yeah, it's rolling out. I am probably hitting different server now and flag is not ON yet on new server

brittle tiger
#

I got one through and now in the same chat instance it says model not available

main gulch
golden ocean
#

pls say it fixed coding issues

lime coral
fleet lintel
#

Formatting is better. Looks a bit like chatgpt to me

brittle tiger
#

craig is sundar larping as openai fanboy confirmed

#

1470 ELO

echo aurora
brittle tiger
fleet lintel
#

24 elo gain is good. I was hoping for 30 elo gain but still respectable

brittle tiger
#

30+ point elo difference over Opus is hefty amount of ppl picking it over Opus

late path
sour spindle
#

Damn may be back on the googtrain again

patent bane
#

we are so back

patent bane
brittle tiger
fleet lintel
patent bane
fleet lintel
#

https://x.com/sundarpichai/status/1930656033237823862

"We also heard your feedback and made improvements to style and the structure of responses. "

No wonder it feels better

Our latest Gemini 2.5 Pro update is now in preview.

It’s better at coding, reasoning, science + math, shows improved performance across key benchmarks (AIDER Polyglot, GPQA, HLE to name a few), and leads @lmarena_ai with a 24pt Elo score jump since the previous version.

We also

brittle tiger
calm sequoia
#

hahahahaha

patent bane
#

close?

golden ocean
#

yea ever since cot became a thing i noticed verbosity went up so high and I never benefited from cot anyway. non cot sota models could already do every one of my tasks but i gotta be honest claude 4 opus thinking kinda beats non thinking opus 4 so im starting to accept cot and claude thinking doesnt have verbosity issues

fleet lintel
late path
#

gemini 2.5 ultra👀

calm sequoia
#

Lol the grok 3.5 lost opportunity for the good release. And now it is Irrelevant 😄

civic flame
#

do you think it's likely kingfall will turn up on arena soon?

wicked root
#

WOOOOOOOT

wicked root
#

GOOGLE SUPREMACY

civic flame
fleet lintel
#

and quite cheap

patent bane
elder rapids
#

SHT

calm sequoia
#

They did reverse as last time

fleet lintel
# calm sequoia

this is what I am looking for.. could you align them better ? 😄

wicked root
#

Polymarket’s going ham

#

Odds of google winning this month jumped 10% in last 2 days

calm sequoia
wicked root
#

I dno man. Im new to the ai bet scene

#

I use gemini for work tho

#

No we’re at 80%

fleet lintel
#

what is the Polymarket link?

civic flame
#

it does, though

#

there'd a toggle in ai atudio

#

studio

#

speaking of AI studio it keeps throwing errors ugh

balmy mist
#

yeah it does

#

i have mine off but it still thinks

#

nvm mine is on lol

civic flame
wicked root
#

Ill take my chances. Ill win as long as non-google or openai models win

#

By the end of June i mean

balmy mist
#

yoooo this is so much better, like it feels better to use

#

which model was this? NW?

#

lol NW literally is a myth at this point, is goldmane better than NW?

balmy mist
#

ahh yeah that makes sense

#

what was this General availability people keep talking about?

#

i have been mia for a bit

#

this is impressive af

#

wasn't every model release GA?

drifting thorn
#

Wow I’m exhilarated

#

Gemini new model

#

Yay

balmy mist
#

isnt o3 pro coming out today as well?

#

or did OA give up?

keen fulcrum
balmy mist
keen fulcrum
#

They are releasing snapshot after snapchot, still no gemini 3
or where is ultra

patent bane
#

people keep talking about how bad the new releases are but I see the potentials, imagine what our children would be using 10 years later

#

i will definitely be fooled by AI

keen fulcrum
#

What about Ultra?
I don't believe Gemini 3 is that late tbh

sour spindle
#

Does anyone know if there is a way to set your defaults in ai studio like i prefer temperature at 0 or do you have to manually change it everytime

cedar tide
#

Does anyone have the benchmarks for version 05-06 (the previous one) please?

patent bane
cedar tide
#

Oh i found this

golden ocean
#

bro definitely wants to be like that

#

but u wanted to be like that either way

#

LOL

keen fulcrum
#

It could be Sunstrike

wicked root
#

Fat cat

misty vault
fleet lintel
#

After seeing releases from Google, OAI and Claude, is there any hope for Microsoft, Amazon or Apple to release a good AI model ever?

wicked root
#

@deep adder do u think OAI can beat gemini?

fleet lintel
wicked root
#

How so?

#

No im asking out of curiosity

#

Why is OAI’s model better?

patent bane
fleet lintel
# wicked root Nah

Amazon might be OK because of CLaude investment and Microsoft because of OAI but Apple is kinda done

wicked root
#

Idc what the crowd wants lol. LMArena ranking is the only thing I care about

fleet lintel
barren prairie
#

Chatgpt is more popular but Gemini is better for me .

wicked root
#

Why? Polymarket only cares about lmarena ranking and I use gemini pro on daily basis without any problem

fleet lintel
#

and Google own around 10% of Claude. I thought difference would be much bigger.

cedar tide
#

06-05 is it goldmane or redsword?

fleet lintel
#

goldmane, confirmed by CEO on twitter

small haven
#

when is kingfall

cedar tide
#

You have the proof from the web dev Arena ?

fleet lintel
barren prairie
#

The kingfall didn t fall today 🙂

small haven
#

is kingfall deepthink parallel cot?\

cedar tide
#

Ah found

cedar tide
fleet lintel
#

I have my doubts. It doesn't look that easy to cook something in this area

small haven
#

u missed on some wealth

fleet lintel
#

how is function calling for the latest Gemini model? It used to be bad

late path
#

bought some at 78c😋

small haven
#

brian is talking shh

fleet lintel
#

he is talking from like 5 min...

#

has to be the biggest reveal ever

small haven
#

when will sam fall?

fleet lintel
small haven
fleet lintel
#

you are going to be fired if you reveal too much

small haven
#

vvery cool thanks

late path
#

please dont

small haven
#

chillax

quiet folio
small haven
quiet folio
#

word respect detected, reposting messages

misty vault
#

real

#

Imma try

small haven
#

@civic flame gemini hasn't solved it, im feeling grok 3.5

#

its not

#

oh

#

u can set thinking budget

#

im at 8192 default

#

im trying another

fleet lintel
#

drop hints ... not full paragraphs 🙂

tall summit
#

whats fixed

small haven
#

u need to make ur own discord

late path
#

My favorite paragraphs😭

tall summit
#

hoooly

#

fixed 🙀

small haven
quiet folio
#

Bruh those deleted messages are only I think I think I think and other uncertainties

#

Hows that delete worthy

small haven
#

craigbench, oai at 100%, gemini at 0%

misty vault
cedar tide
#

Does anyone have any code examples from 06-05?

fleet lintel
misty vault
#

I'm going to finish my project with claude 4 opus thinking and gemini side by side now

#

Claude 4 been doing good but I want to see if they fixed the gemini coding annoyances

civic flame
leaden sun
small haven
#

still running

fleet lintel
small haven
#

ohhh ffs

civic flame
small haven
civic flame
#

when a model can consistently solve this im calling it agi

#

trust

civic flame
small haven
civic flame
#

i set it to max

#

for that prompt

small haven
#

wow

fleet lintel
quiet folio
patent bane
civic flame
#

There are 2022 users on a social network called Mathbook, and some of them are Mathbook-friends. (On Mathbook, friendship is always mutual and permanent.)

Starting now, Mathbook will only allow a new friendship to be formed between two users if they have at least two friends in common. What is the minimum number of friendships that must already exist so that every user could eventually become friends with every other user?

small haven
civic flame
#

ans = 3031

patent bane
#

what models have solved it?

#

is it tricky?

civic flame
#

nebula (NOT released 2.5 pro for some reason) sometimes got it

#

that's the only model

#

that's because it was cheating lol

tall summit
civic flame
#

tool usage is fair except web search

tall summit
#

so yes.

small haven
civic flame
#

lol

#

vibe-wise i quite like this new 2.5 pro

#

it makes some actually visually pleasing frontends that don't feel like generic slop

fleet lintel
civic flame
#

html + css only

misty vault
tall summit
misty vault
#

is it ass cheeks

small haven
#

u realized deeperthink mode has to search the web 😭

small haven
#

link it then

#

its believable, grok is finetuned to math

civic flame
misty vault
#

fair

elder rapids
small haven
#

ya the fact opus is still at 79% swe, and 0605 at 67.2%, but aider 82% and 72% opus, fishy

civic flame
#

look at the comments on this

#

i honestly did not realise this was a USAMO problem

elder rapids
small haven
civic flame
civic flame
tall summit
tall summit
fleet lintel
#

wow, goldmane is confidently wrong.
I said "answer is 3031. prove it"

And reply was (last line)
"Therefore, despite your suggestion, the mathematical proof confirms that the minimum number of friendships is 4041."

🙂

wicked root
elder rapids
# small haven its really both just code tasks

the initial presupposition of different coding strengths affirms this regardless tho? the fact that they simply ARE good at different coding things in practice, should surely be considered

small haven
#

wow brian left the server

elder rapids
#

he didn't

small haven
elder rapids
#

he's still in my heart

small haven
#

add him

civic flame
#

💔

small haven
#

great

civic flame
#

he probably got clocked

#

rip

civic flame
small haven
quiet folio
#

Is this my fault

small haven
#

wanted to ask when will there be a claude max equivalent for gemini 😭

quiet folio
late path
#

yes

misty vault
#

yes

golden ocean
#

yes

quiet folio
#

yes

tall summit
late path
#

you and that purple guy

#

You drove away our insider

elder rapids
late path
#

🤦

small haven
#

brian is going to come back under a different alt

quiet folio
#

Guys if we all show enough love we can get him to join back

patent bane
elder rapids
#

kingfall svgs >>>>>>

small haven
elder rapids
#

but that's pretty much it imo, the insane nuance in how they speak, the details

#

are there

quiet folio
small haven
quiet folio
#

what treaure trove stash

#

Is he referring to the deleted brian messages

small haven
#

no

quiet folio
#

But my nu des ae private. I'd prefer not to share them here

#

Also because that would be against the rules

small haven
#

oh true

elder rapids
#

btw without thinking budget turned on, is it just how much it wants to think?

civic flame
#

again, there is a non thinking mode

#

iirc it's just only on ai studio

#

whoops

#

yeah nevermind it's just bugged

#

well you can on the frontend it just throws an error

misty vault
#

thats what he meant bro

civic flame
#

yes i know dawg

#

i just didn't realise that because i thought the reason it was throwing an error was it was overloaded initially

patent bane
#

why mine is 24576?

surreal creek
#

Aspiring to be this delusional

small haven
#

yea tbf oai still has the edge rn

#

even if kingfall drops, they'll just drop o4 and o5 mini high

patent bane
#

damn

elder rapids
small haven
elder rapids
#

you'd have to treat Gemini deepthink with the same attitude

#

which to me, is obviously just selective tbh

small haven
late path
#

Ugh, really wish this server didn't have those disrespectful people. I'm going to miss Brian||and his insider infomations||

small haven
#

o3 pro wasn't supposed to make it

elder rapids
#

it's "technically" bannable but thats nonsensical

#

no reason to report them or do anything about it

quiet folio
#

(No one ever got banned for this unless they were a big name so good luck)

patent bane
quiet folio
#

Anyway but I agree I also deleted the messages and wasn't going to do it again but oh well rip

small haven
patent bane
small haven
patent bane
small haven
#

hes gone

#

wow

misty vault
#

NOw lets get brian back

elder rapids
#

what

#

ngl Craig

#

you're kind of

small haven
#

w craiggers

elder rapids
#

mentally lacking

keen fulcrum
#

Its sad Apple didn't join the AI race
They have the resources to do so

torn mantle
#

grok 3.5 update

#

🥰

small haven
torn mantle
#

he said

#

'fixing bot replies'

#

....

#

google just released another model

#

veo 3

#

imagen 4

#

imagen 4 ultra

#

whisk...

echo aurora
#

Sry for the slow response, currently on mobile.

torn mantle
#

meanwhile xai are struggling to release grok 3.5?

#

doesnt make any sense to me

small haven
misty vault
small haven
#

almost a month delayed

misty vault
#

but whatever in this case it was not invalid reaction. they got revenge on the guy for making brian leave so that is fair

torn mantle
elder rapids
#

holy

torn mantle
#

thats what im thinking

elder rapids
#

sht

torn mantle
#

HOLY

#

OIMG

#

STOP

#

wait

misty vault
#

nahh you cried lmaoo

elder rapids
#

asura

#

0605 just wrote me

torn mantle
#

OMG

elder rapids
#

an INSANE

#

ESSAY

torn mantle
elder rapids
#

y'all gotta see this

torn mantle
#

you have access?

elder rapids
misty vault
#

gork 3.5 release???

torn mantle
#

oh?

zinc ore
#

So if goldmane is 82% on aider, then which model was 86%?? Was it kingsfall or something else?

torn mantle
#

qwen

#

q w e n

misty vault
#

Idk why everyone hype

torn mantle
#

:p

#

its good to create hype

barren prairie
torn mantle
#

i can send link in dm

#

thanks

#

oh the choice of 'they'

#

i see what you did

#

lol

#

yea

#

you sleep well

#

8h sleep = sharp memory

#

tf

#

no way

#

thats unhealthy

deep adder
#

ppl are weird

torn mantle
#

shrug

fleet lintel
#

That's why we can't have good things. I am quite sure Google disabled thinking because of deepseek

elder rapids
#

how tf do I change chat names on mobile

#

for AI studio

torn mantle
#

its the fastest way and easiest road

fleet lintel
#

too much liability for a big company

elder rapids
#

ye

#

and also, Google = data geeks

#

they want the cleanest data

#

so even if they did, it's not really going to be that meaningful

elder rapids
#

wya

torn mantle
#

im here

torn mantle
#

what secret

#

what

#

its kingfall

#

not fallking

#

i dont have fallking

#

im sorry

#

just go to twitter and type kingfall

small haven
#

wen kingfallfurther?

torn mantle
#

there is a chinese guy who shared the link just recently

#

it works

#

havent checked the code tbh

#

im just using it

small haven
#

wait whats going on

#

i just went to grab coffee

torn mantle
#

what if????????

#

omg

#

stop it

#

imagine...

#

nah its not

#

its just kingfall

#

just random things

late path
#

it works, just like you're accessing api endpoints at google internally

torn mantle
#

stop the cap

#

don't share it pls

#

lemme use it

#

chat is acting weird

#

😮

fleet lintel
#

share please... how do I test otherwise

torn mantle
#

beg

small haven
torn mantle
#

i cant share the link here

small haven
#

ok not gonna login with my google account 😭

torn mantle
#

what

#

its an artifact on aistudio

#

it will be patched?

#

because of @small haven ?

small haven
#

i mean does it match kingfall of yesterday?

small haven
#

wow

torn mantle
#

its kingfall

#

fallking

late path
torn mantle
#

its gemini 3

#

probably first checkpoint of gemini 3

#

because logan said goldmane is the stable gemini 2.5 pro ver

#

its quite decent

small haven
#

whats the context size

torn mantle
#

im happy

#

64k

#

if im not wrong

small haven
#

lets talk about o3 pro

#

where is it?

#

hmm i just checked the model selector

#

im blown away 😭

#

no my oai model selector

#

no

#

i thought it would came out thursday

#

they had everything set up, like why

torn mantle
small haven
torn mantle
small haven
#

for the oai model selector

torn mantle
#

what

#

yes

small haven
#

yes

torn mantle
#

thank me

small haven
#

thank me

torn mantle
small haven
#

thank u

torn mantle
#

yes

small haven
#

what

#

oh right

late path
#

they retained 0506 after this release

#

why not also keep 0325?

golden ocean
#

is 0605 better than 0325

keen fulcrum
#

How does the new version compare to sonnet and opus?

elder rapids
#

btw what's the highest we've seen for the simplebench sample questions?

#

@torn mantle @civic flame

small haven
#

gonna try this

ocean vortex
#

@calm sequoia you happy now?

#

they did the opposite now lmao

#

coding worse in parts, everything else better

small haven
#

is that good? where does o3 and 0506 stand

#

0506 as in 0605