#general

1 messages Β· Page 52 of 1

torn mantle
#

im happy

#

thanks

small haven
#

great..

#

do u know the exact

balmy mist
#

where is it?

torn mantle
#

70% from what

small haven
#

how do i even benchmark

#

it

small haven
balmy mist
#

wiat its cheaper??

small haven
#

@deep adder how do i benchmark myself

#

i wanna try o3 pro

raven void
#

Good luck to those using ScamAi 🀣

elder rapids
#

both get 8-9/10 on the sample questions

#

i tested both 0605 and kingfall

#

nope

wintry tinsel
#

Fuxcin worse at coding amazing

elder rapids
#

ran 5 times

#

each model

#

lmao

#

0605 sometimes got the answer easier than kingfall

wintry tinsel
#

Is simple bench somehow being ganked?

elder rapids
#

I wouldn't believe it

#

you can tell, that even through the summary they're catching like everything

#

kingfall and 0605 are extraordinary

late path
#

I don't think there will be much improvement before Gemini 3.0. It seems that current manufacturers have formed a consensus that they only change the main version number when the base model is updated.

wintry tinsel
#

So when Gemini 3 brah

elder rapids
#

is kingfall Gemini 3

raven void
#

Gemini 3 in October

vernal meadow
#

we didn't get new SOTA in 10 minutes. I guess there really is a wall

small haven
small haven
#

wow sam falls is beating o3 pro on simple bench

#

not only that its faster

elder rapids
#

crazy how high 0605 gets

#

I'm surprised tbh

small haven
#

oh my goodness

#

tesla model 3

wintry tinsel
narrow elbow
late path
#

badger 3

small haven
#

i may have prompted it badly

candid storm
#

When do you guys expect grok 3.5?

small haven
#

oh yea cus wtf is this

#

yo what prompt is urs

#

lol

#

f

#

plz give me ur prompt 😭

#

cool thanks

#

u sure are

#

lemme try tesla model 3 again haha

#

why is it wonky now

#

oh

tall summit
#

davinci-002 is not obscure or archaic

elder rapids
#

not when I exist

small haven
#

wtf dude

#

tesla model 3

#

aight lemme draw elon ma

#

show me ur tesla model 3

misty vault
#

we90 special token

small haven
#

the cool thing about mistral is that theres no safety guard rn 😭

#

cook elon musk too

#

yes lol

#

ehh more like 2025 summer

#

wtf

#

GIVE ME UR PROMPT

#

I BEG U

#

just paste it

#

stop blueballing me

#

thank u sir

#

hmm let me put it under o3

#

*'maam

#

almost

#

hhahhahahaha

#

my wheels are better

#

ur lights are better tho

#

esp front

#

ehh i need sam falls further model

high ginkgo
#

what the fck

small haven
wicked root
#

lmao wth is this

small haven
#

elon musk

wicked root
#

@deep adderare you into airplanes?

#

team airbus or boeing?

#

You sir have earned my respect today

small haven
small haven
#

omg yes

misty vault
#

noway

small haven
#

brian that fcker is gone

ocean vortex
#

that logo on the wheels looks like a swasticar 🧐

small haven
sweet tinsel
#

By the way, Gemini Deep Research should be updated with the new version, right? Could anyone rerun the prompt for my list?
Prompt:
Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and economic consequences for displaced populations, the humanitarian and legal dimensions, personal testimonies, and the long term demographic and geopolitical impacts, drawing on primary sources, statistical evidence, and varied historiographical perspectives.
Deep Research Collection:
https://docs.google.com/document/d/1qSfyAyxzUziFQf55CD60-UgQ4Af9ubVmr69OrmAdevE/edit?usp=sharing

late path
#

0605 seems to perform worse than 0506/0325 in some of my use cases. I put lengthy content in the system prompt and have the model answer questions based on the system prompt's content. 0605 frequently loses the contextual information I provide in the system prompt, and exhibits severe hallucinations when I change the content of the system prompt and continue the same conversation.

ocean vortex
small haven
ocean vortex
#

canceled but I still have it till billing period, and apparently I can't use my own link 😠

ocean vortex
small haven
#

@deep adder u sure u still want o3 pro? 😭

#

🀷

#

i need a mistral deepthink in the oai model selecotr

late path
#

For tasks involving extracting specific user comments within a context of around 60k tokens, 0605 repeatedly missed all comments in the latter half. I immediately switched to 0506 and its doing it perfectly.
To my recollection, 0325 also never encountered such an issue

#

I guess they can't achieve a comprehensive improvement in model performance without a new base model

torn mantle
#

are you?

#

im just guessing

#

im smart

small haven
#

miss craig

torn mantle
#

elon musk

#

drama queen

#

hes going crazy rn on x

small haven
#

maybe cus grok 3.5 is delayed by a month

torn mantle
#

seems like sama won

small haven
#

oh my goodness

#

i didn't know it was fcked

zinc ore
#

Oooh community note

late path
#

pplx fall grok rise

small haven
#

wait trump praised elon a week ago, gave him a key of some sorts, and now is shxtting on him?

ocean vortex
#

the tweet

small haven
#

thats so bad

small haven
ocean vortex
#

everyone knew this was the case

#

but Elon is dumb posting it lol

#

it's better that they fight though πŸ˜‡

#

He accomplishes here nothing. Everyone smart enough already knew what's the deal with Epstein files. But on a positive note, this may divide Republicans somewhat

keen beacon
#

told you

#

its just a better optimized version of 2.5 pro

#

same token count and everything

#

^ knightfall

wicked root
#

What does token mean?

torn mantle
keen beacon
keen beacon
#

but i believe they're called contextual tokens

sonic tendon
#

🀯🀯🀯

keen beacon
#

just google it or ask ai cuz i dont really know how to explain it in layman terms

keen beacon
#

WHERE IS THIS

small haven
keen beacon
#

no like

#

do you have the tweet link

small haven
#

elonmusk?

keen beacon
#

Sybau bro

#

We do its in ai studio

#

got released today

soft kernel
#

It's probably 2.5 ultra,this is just a small update for 2.5 pro

keen beacon
#

this new model is it i believe

#

added today

soft kernel
small haven
#

im sad kingfall got dropped

#

samrises

late path
#

I think Kingfall will likely be a model that enters the arena in the future

ember rapids
#

I like them resorting to fake accidental leaking to drum up hype

late path
#

Just leaked early

ember rapids
#

Hopefully they don’t make us wait 383910 years

#

Like Elon

late path
#

We'll see him again

small haven
#

kingfall is no joke

#

o3 pro got a simplebench q wrong

#

kingfall got it under 30s

#

i feel like its deepthink, but its too fast or theyve really achieved something incredible

soft kernel
balmy mist
small haven
soft kernel
#

How did you guys even try asking these,didn't kingfall like destroyed under 30 mins from the studio

small haven
#

kingfall >>

#

samfalls

topaz edge
#

what is the goldmane model?

soft kernel
torn mantle
#

with this trump vs elon fight

topaz edge
#

huh?

small haven
small haven
topaz edge
#

are you implying that xai was a serious competitor against openai and now theyre not because trump and elon are having a disagreement?

soft kernel
jade egret
#

is it good

topaz edge
#

yes

small haven
#

mistral cutoff is so dated

topaz edge
#

🀣

jade egret
topaz edge
jade egret
#

o

#

danggg

#

it op

#

pass by claude by 30 something points is crazy

balmy mist
golden ocean
tall summit
small haven
#

KINGFALL IS DEEPTHINK

tall summit
#

i'm all for centrism

small haven
#

IT USES PARALLEL COT

tall summit
#

how do ya know

small haven
#

my prompt was "yo"

jade egret
#

lol

small haven
#

on the frontend it just took one candidate

#

no wonder its so good

tall summit
#

how interesting

small haven
#

@patent aspen kingfall is deepthink

soft kernel
#

How did you get that info?

tall summit
#

he just said it

soft kernel
#

Do you still have it on the studio?

small haven
#

it has candidates structured json

#

my prompt was just "yo"

soft kernel
small haven
#

just a little reverse engineering

balmy mist
#

where can i test kingfall at? also is kingfall the best model

small haven
#

yes

#

its deepthink

#

no wonder it beat o3 pro lol

soft kernel
#

OpenAI is scared a little I think

soft kernel
small haven
#

can me have bug bounty

tall summit
#

can ya test its creative writing capabilities

small haven
#

nvm its not deepthink

soft kernel
#

It's over

zinc ore
#

Some people were saying kingsfall was a further trained version of drakesclaw

brittle tiger
#

@small haven any notable differences with it from the release today?

small haven
jade egret
#

wait what

small haven
jade egret
#

it gemini 2.5 pro deepthink?

#

oh

small haven
#

i talked too early

jade egret
#

what is it

small haven
#

gemini pro checkpoint

jade egret
#

what is that

#

is it better tham gemini 2.5 pro 6-05

small haven
#

similar to 2.5 pro, maybe its 3 pro, idk

soft kernel
#

Another update

jade egret
#

o

small haven
jade egret
#

so it better

soft kernel
jade egret
#

W

soft kernel
small haven
jade egret
#

FR?

small haven
#

faster than 2.5 pro

jade egret
#

WOAH

small haven
#

in terms of the reasoning tokens it consumes

#

not in latency

jade egret
#

its

#

only aviliable on lm arena?

small haven
#

should be soon im guesing

jade egret
#

it from google right

soft kernel
jade egret
#

o

#

where can i use it

#

let me guess

#

i cant

small haven
#

yea pineapple when is kingfall coming to lmarena lol

soft kernel
jade egret
#

o

small haven
#

manually testing aider polyglot

#

if its >90% its agi

soft kernel
#

Never in a million years

jade egret
#

your doing it?

#

hopefully agi

small haven
#

api is wonky

jade egret
#

o

#

how long would it take

#

i want to see the result : )

soft kernel
#

By addressing your way

small haven
#

ehh gonna run in parallel

misty vault
#

agi

torn mantle
#

lol

#

drama

topaz edge
#

this is about lmarena not trump and elon musk

misty vault
#

i care

olive mesa
#

maybe unlikely but it was google engineers that first made the transformer architecture so..

small haven
#

anybody know the highest for aider polyglot python section?

keen beacon
elder rapids
#

tbh goldmane kind of was

#

kingfall and goldmane don't have extreme differences, but they simply think diifferently

small haven
#

mistral is insane

torn mantle
#

lol

#

nebula was bad

elder rapids
torn mantle
#

wasnt nebula like gpt4.1?

elder rapids
#

nebula = 0325

keen beacon
#

Bro actually has amnesia

keen beacon
#

I just woke up

elder rapids
keen beacon
#

Damn

elder rapids
#

same as kingfall

#

although they both seem to get them all right

#

just with the nuance of "but since the format is this, x should be the intended answer"

#

so therefore I can't just give it the point

#

unfortunately

torn mantle
#

so many models

#

if you want to blame someone

#

blame google

#

they released like 20 checkpoint

#

claybrook/goldmane/calmriver/nightwhisper/dayhush.......

torn mantle
#

THANK YOU

#

HMM

#

WHATS UR NAME

#

pedanticallyprofound

elder rapids
#

@keen beacon btw goldmane has meaningfully nullified the performance discrepancy between AIstudio and the app really well

#

although not 1:1

#

it's still intelligent enough to bypass things with the same nuance

elder rapids
#

I probably should have a name

small haven
#

hmm mistral is having issues with cpp problems 😦

balmy mist
#

i love mistral

small haven
#

le chat is not agi yet 😭

elder rapids
#

😭

small haven
#

gonna try rust now

#

le chat killed python

elder rapids
#

@small haven @small haven yooo what am I missing out

#

@deep adder

#

put me on

#

πŸ˜­πŸ™

#

AI is for all of us remember

#

we are all in this together

elder rapids
small haven
#

@deep adder is lechat gone?

#

im getting a big fat error

#

see u again soon lechat 😭

#

lechat is >90% aider polyglot

#

1500 elo

#

i vouch

#

100% gemini, 0% oai

#

i wonder if deepthink will be based on lechat or 0605

#

f's

#

send prompt

jade egret
#

hi

#

who do you think is gonna win the "ai race"

#

who o u think

small haven
#

obviously google

#

unless o5 is premature asf

jade egret
#

o

#

but gogole might sell chrome ; (

small haven
#

i think its time to buy some googol stock

jade egret
#

why

small haven
#

they won

jade egret
#

fr?

#

but

small haven
jade egret
#

so

#

if google

#

reach agi first

#

browser wondnt even matter?

#

: oo

#

o

#

but

#

google said it will appeal

#

so

#

that will add few more years

#

for google to reach agi

sweet tinsel
#

They still have other products to fund it and it seems like G Deepmind gets a pretty high budget currently.

small haven
#

yes

#

wtf

#

im done with that tho

jade egret
#

so google is gonna win

small haven
#

oh shxt

#

lechat is back

small haven
jade egret
#

o

#

W

small haven
#

i feel like its going to proc oai to release gpt5/o4/o5-mini-high very early than planned

#

yea integrated whatever

jade egret
#

gemini 2.5 pro very good ngl

small haven
#

dario: benchmarks dont matter anymore

sweet tinsel
#

They care more about user growth, and sadly dumber models like GPT 4o will just finely do that while being cheap.

#

I would love it if they would make GPT-2 Chatbot available again. The GPT 4o prototype.

late path
#

will gpt5 be released in july
wheres our openai insider

small haven
#

im not an oai insider, but yes

#

what davinci-002 mean lol

#

oh ok

sweet tinsel
#

I've used it before I had ChatGPT Plus when ChatGPT was out of capacity

#

Did it's job pretty well

small haven
#

wait did brian leave again 😭

#

bro just coming in and out

jade egret
#

apple intellgent sucks ngl

sweet tinsel
#

Honestly GPT2-Chatbot was the something like an early Night whisperer it was just as hyped and extremely good for the time.

small haven
sweet tinsel
#

Well... It was pretty good too. It was way better than the current GPT 4o writing style.

small haven
#

lechat is back?

sweet tinsel
#

Would love it as a cheaper GPT 4.5 replacement

small haven
#

its less distilled version for sure

small haven
#

is apple intelligence back

small haven
sweet tinsel
small haven
#

didnt hit the same

#

but 4o rn is >> obviously

#

lechat is back

#

omg

#

im done with melting lechat tpus

echo aurora
#

I can't tell if I love this or not

small haven
#

enjoy the extra capacity

#

@echo aurora u like my tesla model 3?

echo aurora
small haven
#

imma leave le chat'oolers alone

#

ur welcome

sweet tinsel
#

But seriously, did you guys already try out the Agent Feature in Le Chat?

topaz edge
#

its mid

small haven
#

wow lechat

#

wait actually?

#

lechat context is dated, early 2023

#

how does it know about xai

#

oh ok right

#

well thats why

#

oh ok

#

imma leave the lechat's tpus alone

#

its fine

wicked root
#

can someone tell me how much of a difference is there between Claude & Grok vs Gemini or Open AI products?

#

I could vc

leaden sun
#

anyone tried Runner by H company?

#

it looks sad for some reason

wicked root
#

I don't know. I only use Gemini for coding.

#

I just want to know if grok and claude can beat gemini in LMArena leaderboard

elder rapids
#

why is 0605 so fast lmao

#

as well as the fact, that now at at 100k context lengths, the latency doesn't get any worse

#

and well beyond that, too

jade egret
#

gemini is better right now in average and in webDev

wicked root
jade egret
#

which

small haven
#

when will kingfall fall on earth

wicked root
jade egret
#

wait it is?

#

it o3-preview?

#

i though it gemini

#

broooooo

#

wait

#

isnt it from google

#

because

#

it appeared in google ai studio

#

than vanished

small haven
#

wen kingfall deepthink

haughty tangle
small haven
#

o3 pro is timing out more 🧐

small haven
#

@worthy thunder need 0506 for comparison against 0605

worthy thunder
#

Reposting the update here: Added Gemini 2.5 Pro (Thinking, 06-05) to 2needle and 8needle leaderboards. Matches or exceeds 03-25's context performance.

2needle results (AUC @ 1M):

  • Gemini 2.5 Flash (Thinking, 05-20): 78.3% (#1)
  • Gemini 2.5 Pro (Thinking, 06-05): 77.5% (#2)
  • Gemini 2.5 Pro (Thinking, 03-25): 73.7% (DEP)
  • Gemini 2.5 Pro (Thinking, 05-06): 72.5% (DEP)
  • Gemini 2.5 Flash (Non-thinking, 05-20): 70.2% (#3)
  • GPT-4.1 (Non-thinking, 04-14): 53.2% (#4)
  • GPT-4.1 Mini (Non-thinking, 04-14): 43.6% (#6)

8needle results (AUC @ 1M):

  • Gemini 2.5 Pro (Thinking, 06-05): 28.0% (#1)
  • Gemini 2.5 Pro (Thinking, 03-25): 27.8% (DEP)
  • Gemini 2.5 Flash (Thinking, 05-20): 27.0% (#2)
  • Gemini 2.5 Pro (Thinking, 05-06): 26.8% (DEP)
  • Gemini 2.5 Flash (Non-thinking, 05-20): 23.4% (#3)
  • GPT-4.1 (Non-thinking, 04-14): 17.5% (#4)
  • GPT-4.1 Mini (Non-thinking, 04-14): 16.7% (#6)

More results at: https://contextarena.ai

Source: https://x.com/DillonUzar/status/1930723790708777273
And info about me hiding the old ones: https://x.com/DillonUzar/status/1930724414443630880

#

^ I've added several others since I last posted here, been traveling a lot for work. Some other results like Claude 4 (include Claude 4 Opus, but only for 2needle for now), and a few other misc models were added too

#

Some other results which may be of interest:

torn mantle
#

nice

#

thanks

calm sequoia
#

Why the'res always a drop at 16K? Data batching issue?

cedar tide
inner hare
#

What to do?

acoustic cliff
#

ponder

verbal nimbus
verbal nimbus
calm sequoia
#

Why is the new geminui so cringe

#

Every stupid question I ask gets applauses

tall summit
#

seeing that from gemini is funny

calm sequoia
#

Some maveric vibes πŸ‘€ at least it delivers

verbal nimbus
inner hare
#

how to fix?

ocean vortex
# calm sequoia Why is the new geminui so cringe

lmao. Add this to a system prompt: Never start your responses by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective. Skip the flattery and respond directly.

calm sequoia
#

Ah yes, I have similar pre-prompt for chatgpt, as you recommended like a month ago πŸ˜„

ocean vortex
calm sequoia
#

Good, thanks!

ocean vortex
#

In some cases it could reduce performance I suppose, but those are more of edge cases with jailbreaks or RP, or really bad prompting etc

calm sequoia
#

I suppose for me it's a mental thing from the old days, where one word change in prompt would change the answer

hazy yoke
#

Hello, guys, I am wondering is there any way to submit a model to the leaderboard right now, or does the leaderboard currently only accept high-profile entries?

unborn ocean
#

gemini 2.5 pro 06-05 just got it!

#

There are 2022 users on a social network called Mathbook, and some of them are Mathbook-friends. (On Mathbook, friendship is always mutual and permanent.)

Starting now, Mathbook will only allow a new friendship to be formed between two users if they have at least two friends in common. What is the minimum number of friendships that must already exist so that every user could eventually become friends with every other user?

#

no system prompt

torn mantle
#

the overall performance drop probably had to do with coding-focused finetuning

keen beacon
#

they made substantial gains on aider tho still with the new 2.5 pro

torn mantle
#

its kinda hard to generalize on all benchmarks

#

you need to find the balance somehow

#

they are getting there

#

kingfall is a good example

small haven
#

need kingfall asap

keen beacon
#

youll immediately want the next unreleased version after that 🀣

dusky aurora
#

the new Gemini 2.5 Pro has become more judgmental

keen beacon
#

its sycophantic af

unborn ocean
#

i mean you can't make this up, it is soo close to kingfall

dusky aurora
#

ChatGPT is a bad influence on others with its excitable style

unborn ocean
#

they are 100% just coming up with the names by prompting gemini

torn mantle
small haven
torn mantle
#

there is a rate limit too

small haven
torn mantle
#

thank you asura

small haven
#

thank you arusa

#

its got ultra vibes

torn mantle
#

yea its a pretty good model

keen fulcrum
#

Introducing Eleven v3 (alpha) β€” our most expressive Text to Speech model.
This research preview is designed for creators working at the frontier of AI audio. Whether you're building faceless YouTube channels, narrator-style videos, or entirely new formats β€” it offers new levels of expressiveness and control.

Available now: The Eleven v3 (al...

β–Ά Play video
small haven
unborn ocean
#

idk why it was thinking for 4 min

#

usually it takes like 1 min (but is wrong)

small haven
alpine coral
#

i scrolled a fair way up but still dont understand.. what / where is 'kingfall'?

unborn ocean
#

gone

alpine coral
#

from the arena or or aistudio?

unborn ocean
#

aistudio

#

leak

#

(or it is believed that someone made a mistake)

keen beacon
small haven
#

tbf its still pouring

unborn ocean
small haven
#

no current weather run

#

rn

unborn ocean
#

ah, sure

alpine coral
#

cool cheers for clarifying - tho odd 'leak'.. like original open sourcing of llama was an actual leak.. this would be some dev getting dates wrong? or a marketing/hype ploy ig

keen beacon
#

nah someone actually messed up apparently

alpine coral
#

i see i see

acoustic cliff
torn mantle
#

no

small haven
#

prove it

torn mantle
#

im so lazy to check the code

#

i think i should check it a bit

#

lol no

calm sequoia
#

Lol the new gemini confused me soo deeply with theory, then o3 put me back on track

#

Then I saw this. It seems to be overconfident at stuff.

#

The o3 was known for hallucinations but the gemini is too much

torn mantle
#

its using google internal api

#

but when i search for the request url in the code i cant find it

#

ik its using googleai module to directly call that

keen beacon
#

its quite clever

small haven
keen beacon
#

i saw this a while back, their "apps" in aistudio

#

i didnt realize u could do this

small haven
#

but my pc did crash right after so yes its a worm

torn mantle
#

kingfall-ab-test

#

or whatever its name and hes using it

small haven
keen beacon
#

their apps thing allows u to call the gemini api programmatically through the apps feature (so you can share apps/less friction w/o needing to put ur api key), but the env has a special api key/proxy or additional mechanisms apparently (seemingly not tied to ur own acc)

#

this is truly a bruh moment 😭

unborn ocean
#

does not show up in your api

torn mantle
#

for the api key

unborn ocean
#

no limit for 2.5 pro

torn mantle
#

ive tried to run it locally

#

ive thought about that as well

keen beacon
#

who tf posted this tbh

#

if i found it i wouldnt have posted it

torn mantle
#

im not talking about sig, we havent even reached that part yet

keen beacon
#

i didnt think the people at google could make suchh a big mistake

torn mantle
#

ive been RE web apps for like forever

#

lol

#

blud said you dont know what you are talking about

keen beacon
#

or it might be proxied could be a lot of things

unborn ocean
#

yeah was not referencing what you meant

#

just the builder thing

#

other stuff idk

keen beacon
#

im gonna have some fun with this 🀣

balmy mist
small haven
#

so theres a mole in google or their api safety guard is major wonky

keen beacon
#

mistral le chat

keen beacon
small haven
#

not google i meant mistral

small haven
#

bring him back

#

no joke?

torn mantle
#

uh oh

#

im jk

#

i think it was a hobby of mine to RE apps c++/c# ( dnspy/ida/ninja )

#

web apps are actually so easy to re

keen beacon
#

but this smells like a huge oversight to me

#

i saw this feature a while back but i didnt think u could do this 🀣

#

who uses the apps feature anyway

#

never heard of anyone

small haven
#

they should add deepthink in lechat

#

i believe it

torn mantle
#

alr i got the private api key

small haven
#

even brian said its a bigger params model than pro

keen beacon
#

dont use it tbh. might flag something if it does work 🀣

small haven
#

yaa just wait for the official release guys 😭

#

its time to run an antivirus on the pc

#

actually maybe just burn the ssd

#

its a zero day

#

virus signatures usually reported after the fact

#

no joke tho my browser crashed first time i opened it

#

just a little tap

#

be careful

#

lol

drifting thorn
#

everyone 0605's hallucination is much worse than 0506 and 0325 in multi-turn conversation

late path
#

agree

keen beacon
#

its so sycophantic too

#

in my experience

small haven
late path
#

I think kingfall will soon enter the arena after 2.5 pro GA

willow grail
#

yes hello brian

#

how cna brian offer u today

#

@small haven

small haven
willow grail
#

o0

small haven
#

end of june oh cool

worthy thunder
# verbal nimbus Gemini costs more than o3 and Opus 4 thinking?

@verbal nimbus The prices listed are only for the total on-demand cost it would take to replicate the test results I ran. You'll notice o3 and Opus have an "INC" badge next to the pricing.
At the bottom of the table I define the badges:

INC: Incomplete cost data (potentially underestimated cost, excluded from cost rank).

Hovering over gives:

Incomplete: The model has missing or failed results in some context bins, potentially underestimating the true cost. Ranking is omitted for these entries.

Just for 2needle results:
The Gemini models are ran against all test cases up to 1M. (~150.6M input tokens, ~6.4M output tokens, as reported by the model) (costs listed are ~$3013 USD input costs, ~$147 output costs)
o3 are only up to 200k. (~28.2M input tokens, ~6.5M output tokens, as reported by the model). You could multiply by ~5x to get a rough cost estimate to Gemini (which would come out to ~$11,270 USD input costs, ~$1,294 USD output costs)
Opus 4 are only up to 128k. (~21.0M input tokens, ~2.5M output tokens, as reported by the model). You can multiply by ~8x to get a rough cost estimate to Gemini (which would come out to ~$4,754 USD input costs, ~$512 USD output costs)

Hope that helps to clear up the pricing.

small haven
willow grail
jade egret
misty vault
#

kingfall is agi so google

jade egret
#

i think google

misty vault
#

<|im_start|>system

system

  • New conversation with user B.
  • The user is having this conversation on a mobile device.

system

  • Due to a limited screen window size, you limit the length of your responses by excluding less important details/sentences and asking questions (when appropriate) which can help the user clarify and narrow down their search and the amount of information needed in the response.

system

  • Got it, I’ve erased the past and focused on the present. What shall we discover now? 😊
small haven
#

sucking it better than sams hubby

willow grail
drifting thorn
#

Alphaevolve shows a freaking lot of potential, and with a stronger Gemini base model, they are more and more capable of exploring great discoveries that lead to AGI

willow grail
small haven
#

ya im staying with coffee

misty vault
#

im staying with sydney

jade egret
small haven
willow grail
#

Crosses on tissue

#

⁉️

small haven
#

is that a tea variant

willow grail
#

Multivitamin on bathtub

#

Why

#

Amazon echo in bathroom

#

Sticks to make fire with on bathtub

small haven
#

this is rlly entertaining

willow grail
#

Digital clock inside package of gloves

drifting thorn
#

but rn Gemini, R1_0528 seems to go to a wrong direction in conversations.

#

They seemed to pander to user a lot in the open ended questions, while the companies are pursuing "prompt following ability" it loses it unique thoughts

misty vault
echo aurora
misty vault
#

kingfall release on lmarena during staff ama

patent bane
#

bro really put all his efforts into the question

unborn ocean
#

prob triggered some part of the model to detect difficult math problems (prob an artifact of wanting efficient token usage but also rewarding model for USAMO stuff)

#

usually the models just assume it is an easy question

#

which is why they fail

patent bane
#

even o3 calculated it wrong internally but finally got back on track using toold

#

the new gemini 2.5 pro is so random

sometimes it gets the questions horribly wrong consistently and sometimes gets it right consistently

elder rapids
#

when it gets it wrong it's already decided not to think as long as it should

elder rapids
#

even 0506

patent bane
#

9.9 - 9.11 =?

#

and

9.9-9.11=?

each wording would get a different answer

barren prairie
patent bane
#

I'm stressed right now

unborn ocean
#

whut

civic flame
#

oh god

balmy mist
#

bruhh

keen fulcrum
balmy mist
#

thats when i stop using ai studio lol

keen fulcrum
#

My queries in AI studio don't work at all
I get permission denied

civic flame
#

mine are fine

balmy mist
#

we might look back on this time and cant believe we had SOTA AI for free lol

keen fulcrum
#

I got probably banned because Google thinks I am a bot, all I did was use the Glasp extension with yt
unfortunately both my google accounts for personal and work are broken

unborn ocean
#

dont have x

#

just felt weird

wintry tinsel
# unborn ocean

Cool so I’ll just not use Gemini slop than, I only use it cuz it’s free

#

If I’m paying I may as well use Claude

barren prairie
# unborn ocean

Did you know now why I like deepSeek more than Gemini πŸ™‚? Open source at least and free to use no one one day will limit you 😁

patent bane
elder rapids
#

😭

misty vault
#

Bro is getting ai news from mcdonalds

elder rapids
#

ong

unborn ocean
civic flame
wintry tinsel
#

Despite the new Gemini getting a 62% on simple bench (great) in general conversation and writing ability it’s not near opus’s level unfortunately

#

It’s general reasoning ability does seem to be a little better so it’s definitely a training data and style bias

jade egret
#

i think

cedar tide
elder rapids
#

yep it's over

#

ban livebench from being discussed here

elder burrow
zinc ore
elder rapids
#

never forget that

#

😭😭

zinc ore
#

325's original coding score right

#

Then they changed all the questions and it dropped 20 pts

elder rapids
#

ye

zinc ore
#

Then they changed them again so Sonnet would score higher

#

It's only 150 questions per category anyway

#

Very narrow question sets

elder rapids
#

ye

#

theres no point in livebench imo

#

it's never reflected things in practice

#

I cant think of a single use case of sonnet 4 over 2.5 pro

#

or opus 4

#

how does 0605's instruction following become massively greater than 0506's in practice and then be so much lower than both 0506 and the other models in the benchmark

elder burrow
#

wait

#

06-05 is below 05-06 on livebench? 😭

elder rapids
#

ye

#

this is the greatest proof of livebench being incoherent

elder burrow
#

LOL

elder rapids
#

best coder on the leaderboard is 11th

#

😭😭

#

holy sht

misty vault
#

lmfao

elder rapids
#

I think even Craig would say this is blasphemous

#

@deep adder

misty vault
#

i used 3.7 thinking over 2.5 pro 03 24 ngl

elder rapids
#

deadass

ocean vortex
elder burrow
#

fr?

civic flame
keen beacon
#

makes crazy predictions as well

civic flame
#

4o over Claude 3.7, Claude 3.5 and 2.5 Pro? give me a break ☠️

elder burrow
#

πŸ«ƒ
🦡

elder rapids
ocean vortex
# unborn ocean

I can't believe he tweeted this like a thing to brag about lmfao

elder rapids
#

I'm so used to just skimming the leaderboard

ocean vortex
#

if they gonna do this I'm done with them and fully back with OpenAI

elder burrow
keen beacon
#

their gemini pro sub was unlimited, they set it to 50 then 100 and tweeted about how they raised the limits

ocean vortex
#

I never had much against OpenAI. I only partially went to Google because their models are more accessible

#

if that advantage is gone there's no reason for me to stay lol

elder burrow
#

link fr

keen beacon
#

free to put ur api key in

#

and pay per token

elder burrow
#

o what do you remember then

ocean vortex
#

entry fee

elder rapids
elder burrow
#

isnt that the same as new models

elder rapids
#

but conceding that is wild

#

especially so early tbh

jade egret
#

how good is it

elder burrow
#

same lol

ocean vortex
elder rapids
#

ion think this matters at all tbh, AGI is an ambiguous standard and it's inevitable that these models eventually are going to minimum get to "close to AGI" status

#

and we go from there

ocean vortex
#

well the ones that don't want to pay or can't use chatgpt (blocked etc) do migrate to Google. But if aistudio becomes paywalled that gonna change

elder burrow
#

I use 06-05 for webgen and it loves to consistently cause:

SyntaxError: Cannot declare an imported binding name twice: 'somebindingnamehere'. undefined

#

does anyone else have this problem

ocean vortex
#

No I meant like on school network or a work laptop - blocking OpenAI websites is a real and even popular thing believe it or not

elder rapids
elder burrow
#

hey yall uhh do you often get this error when using 06-05 for webgen?

elder burrow
elder rapids
#

Google can't die out imo, they're too much of an engrained monopoly

#

theyve attached their name to everything

jade egret
#

nah openAI die out in the long run

elder burrow
jade egret
ocean vortex
keen beacon
#

them adding limits to the paid plan and bragging about raising them whilst aistudio is free πŸ’€

elder rapids
#

this is a law thing, not business

ocean vortex
#

Like in what universe charging MORE than OpenAI makes sense here...

#

it really doesn't

elder rapids
#

are you good big banks DID fail, that's why the laws the US has now prevents that

#

did we not learn history lmao

#

yeah because of laws

elder burrow
#

yall why cant veo3 just have a very low res generation option for free

elder rapids
#

the system IS the laws

#

that's how they're inevitably propped up

#

dawg you just agreed with me

#

😭

jade egret
#

ever heard of the greta depression

elder burrow
elder burrow
elder rapids
jade egret
elder rapids
#

which is messed up because it tells us that they just don't actually care that much

elder burrow
elder rapids
#

😭

jade egret
elder burrow
jade egret
#

πŸ’€

elder burrow
#

annoying orange

keen beacon
#

it has youtube premium

elder burrow
#

lets content creators put "$250" in their video titles without it being clickbait

elder rapids
#

just take away the 30TB tbh

keen beacon
#

claude max is probably the best in terms of value

elder burrow
misty vault
#

me

#

fr

elder burrow
#

yes

#

wha

ocean vortex
misty vault
#

dont remove the cap message

elder burrow
#

🧒

narrow elbow
#

Waiting for the latest frontier models System Prompt leak, want have a taste πŸ€ͺ

keen beacon
ocean vortex
#

are you kidding me, the best value by far

elder burrow
#

sora ig

#

which is ass

#

lol

ocean vortex
#

gpt4.5, o3, o4-mini-high...

misty vault
#

does chatgpt pro give unlimited gpt 4.5

elder burrow
#

LO

misty vault
#

special token

elder burrow
#

no it aint 😭

#

at coding

#

i have used it

#

πŸ₯€

misty vault
#

sydney fine tune on gpt 4.5 would literally be agi

#

gpt 4 fine tune already sounds like agi

elder burrow
#

btw

#

guys

#

fun fact

#

gpt 5 was supposed to release june 1st maximum

ocean vortex
#

100 per week or smth like that. And you have 4.1 unlimited, and also completely seperate cap for o4-mini-high and then o4-mini-medium a different cap

#

like I said, this is clearly the best value tbh

#

there's no "o4"

keen beacon
ocean vortex
#

it's a distill from some version of o3

keen beacon
#

in terms of amount of tokens you can do/based on api pricing

ocean vortex
#

because o3 is already using gpt4.1 base

keen beacon
#

because its using 4.1 mini as a base

#

.

ocean vortex
#

yeah it's a tiny model

#

relatively speaking

keen beacon
#

4.1 mini is a fresh pretrain, interesting they opted to midtrain 4o instead of doing a fresh one

ocean vortex
#

2.5 Flash isn't either. But it's still compromised

keen beacon
#

it is probably

elder burrow
#

alright ill try it

#

are there benchmark scores

#

i thought it's just too expensive to benchmark it

#

isnt it one of the most expensive ones

ocean vortex
#

yeah it is, and it's probably still unbeaten on SimpleQA

keen beacon
#

2.5 pro has the second highest score

elder burrow
#

isnt there 3.5

ocean vortex
#

speaking of which... I think they are to release gpt5 around the shutdown date of gpt4.5

elder burrow
#

ive heard

keen beacon
#

pretty common guess

#

if it did it probably memorized the answer lmfao

elder burrow
#

grok is good, the incognito feature is unique

#

i'll put that last part at the end of my 4.5 promot

ocean vortex
#

Google's reasoning is still not the best... Some prompts where it can only solve by outputting long reasoning 2.5 pro tends to fail miserably

elder burrow
#

will test a coding prompt rn, i'll send results

keen beacon
#

grok hhas the worst

#

they literally use qwq preview

elder burrow
ocean vortex
#

They are kinda using reasoning more like additive thing to improve what it is already good on

keen beacon
#

for cold start, at least, they used qwq preview traces imho

#

im not gonna get into it again 🀣

ocean vortex
#

Unlike OpenAI who seem to be pushing the limits with what is possible using RL training and ReAct

keen beacon
#

dont forget u need to tip it and threaten it all at once

elder burrow
#

yall what if lmarena had benchmarks for different top_p _k and temperature levels

#

I'd like to see how those affect results

keen beacon
#

nah im joking

elder burrow
#

oh btw

#

about gemini

#

uh

#

the most underrated feature is watching youtube videos

#

its really good

elder burrow
keen beacon
#

yea thats a cool feature

#

did u ask le chat?

elder burrow
#

craig

#

i know 1 site where i can use 4.5 for free

#

but do you know any

#

no

#

fuh free

keen beacon
#

lmarena πŸ˜‚

elder burrow
#

oo alr

#

yeah

keen beacon
#

ask claude code to solve it

#

can claude code search the web and such btw?

elder burrow
#

u could js give a temp api key for like 20 mins

#

real

keen beacon
#

ask le chat march version

elder burrow
#

ohhh ok

#

ill use the free site then

#

buggy and no chat history

#

but it has like 20 models for free

keen beacon
#

it saw it on lechat tho

elder burrow
#

like opus, o1 and 4.5

keen beacon
#

maybe gemma 3n 4b could get it

#

or 2.5 pro text to speech

#

im actually curious what happens if u give it something like that hmm

ocean vortex
#

Did HF crack down on spaces using sus endpoints?

#

Can't seem to find any OpenAI model for free

#

there used to be dozens

keen beacon
#

supposed to be the same

ocean vortex
#

now they are just asking for your OpenAI key within the space lmao

keen beacon
#

if u mean march preview

#

people say its different though

#

it is

sweet tinsel
ocean vortex
# sweet tinsel Do you have the share link? Would work better for the public list.
sweet tinsel
unborn ocean
#

the question if from usamo 2022, a big model like 4.5 likely just memorizes it.. no need for fancy prompts

elder rapids
#

oh ye I forgot to say

#

@civic flame when you tried that usamo thing

#

kingfall did get it

#

ye

#

but it did consecutively get it right usually

keen beacon
#

did u test on 0 temp

keen beacon
#

but it used tools right?

elder rapids
#

no tools either

keen beacon
#

i was tlaking abou to3

elder rapids
#

oh

sweet tinsel
small haven
#

one shot or multi shot

#

on lechat ultra?

#

i didnt get one shot

elder burrow
#

giv

#

also

#

yall uhh

#

is there a model thats releasing soon

brittle tiger
small haven
#

kingfall 70/80%

#

110%

#

also got the bonus q's

#

@brian

#

wow

#

cant they release it alrdy 😭

#

both

#

deepthink with kingfall as base would go bonkers

misty vault
small haven
#

put that on x

#

ur gonna blow up and match strawberry man

elder rapids
#

even though 0605 consistently got more right, and easier

#

I have a feeling the harder ones would be dealt with, with the same difficulty

#

or in other words, wouldn't affect how kingfall interacts with them

elder rapids
#

be quiet

small haven
#

ur welcome

#

@deep adder u can just rely on fyp 😭

#

cant

exotic veldt
#

Ai?

small haven
#

theres loads of ai groups out ther

#

just need one retweet from a big account thats the game

#

i feel like ur svg would go more viral thio

elder rapids
#

ts not hitting

small haven
#

svg's?