#general

1 messages · Page 43 of 1

storm needle
#

you already said that o3 would be slow and that's not true

storm needle
#

.

misty vault
small haven
#

theres a big problem using rust in codex, everytime u ask something it has to recompile all the crates from scratch, takes a good 5 mins before it can even start (well it depends on some project cargo)

willow grail
patent aspen
# storm needle .

Oh sorry I see it now. At that time, I was speculating based on the FrontierMath benchmark story wherein they ran their models for hours for one problem

willow grail
#

anyuone has ideas how to only pay for tramway ticket when i see inspectors intering the tramway i am in?

patent aspen
# storm needle .

I also rarely have a complete up-to-date picture of what is going on inside OAI. I do know people in the know and hear things from time to time

unborn ocean
#

i believe that was common 'knowledge' among the people speculating

#

on what?

#

nvm (read the context)

willow grail
#

thats not a thing

#

no

barren prairie
misty vault
misty vault
willow grail
#

send me try me

willow grail
#

weait so its not on bing?

blazing rune
#

4B should be faster

#

But if you do 4B, do Q6_K

#

Oh, if you have 6gb of VRAM, q5_k_m might be better

willow grail
#

@misty vaulti cant find that model on copilot

blazing rune
#

That should be at least 20 tokens per second if you set it up correctly (the 4b)

patent aspen
#

lmao this is an on-device model for mobile

blazing rune
#

I'm talking about a 4B dense model on his 6GB laptop GPU

blazing rune
dapper storm
#

will o3 pro be in the arena

patent aspen
blazing rune
#

If it actually had something to do with how fun or relatable a model is, that would be fine. But it is just using stupid phrases

willow grail
#

shitification?

blazing rune
#

And more importantly, it depends on the formatting and how many headings are in the answer

misty vault
willow grail
patent aspen
#

Yeah it seems to be all style control

willow grail
#

doesnt say gpt4 preview

misty vault
#

the non preview gpt-4 also needs no jailbreak

#

Bro some guy literally sent in this chat once

#

It is already suppose to

torn mantle
#

so

misty vault
#

the original websockets lasted for a year

#

they died 1 month ago

#

but theres another one

torn mantle
#

what product are you using after the keynote?

misty vault
#

they forgor or something

#

Sydney

misty vault
torn mantle
#

i really think flowith is powerful

#

i have yet to harness its full power

misty vault
#

I will suck **** for microsoft and openai to give me the fine tuned gpt-4 they use for bing

patent aspen
#

Ew Bing

misty vault
misty vault
cedar tide
#

New gemini 2.5 flash vs old

golden ocean
storm needle
elder rapids
elder rapids
#

I've been using neo AFTER chats with AI

#

which builds its "prompt" essentially

#

and then stemming from those context threads, it goes through everything and plans it extraordinarily well

elder rapids
#

wasn't directly the best instruction follower imo, the reason why it appeared that way was because it inferred (correctly) based off of a deep implicit understanding like if you meant to emphasize something, it'd grasp what parts to prioritize. But granular details, or in this case implicature based counterfactual speech, it didn't comprehend at all

#

it's inference made it so easy to talk to ye

vivid sandal
#

is this happening only to me, or maybe there's a maintenance of sorts...?

elder rapids
#

man I think I'm starting to like this AI trend

small haven
#

so wen is exactly o3 pro

elder rapids
#

Yo why is neo using Gemini

#

LMFAO

#

ITS USING GEMINI OVERVIEW FROM THE BROWSER TO CONDUCT ITS RESEARCH

#

LMAOOOOO

elder rapids
#

dawg o3 is so bad dude it's insane

#

ts can't understand a WORD I'm saying

#

holy

#

I'm genuinely getting mad

raven void
#

Looks like nightwhisper was Gemini deep think 🤔

The performance is not ultra level I see why they haven't released it

#

O3 pro will cook it if so

iron cipher
#

Remember when ChatGPT didn’t save chat history

#

or when Gemini was named Bard?

small haven
small haven
#

oh shxt got jules

#

gonna compare it to codex

#

aight back to codex...

alpine coral
#

same here (lines up with my actual usage very neatly and consistently). it's a great benchmark imo

keen fulcrum
#

Best AI sub

keen beacon
#

I'm really missing the raw thoughts of Gemini 🥲 (you can get around it but you might be degrading perf)

torn mantle
#

"Claude 4 is here" - "Try Claude Sonnet 4 and Claude Opus 4 today"

"Try Claude Sonnet 4 or Claude Opus 4 for Anthropic’s smartest models yet."

"Not intended for production use. Subject to strict rate limits"

"show_raw_thinking" / "show_raw_thinking_mechanism"

(not available

#

"subject to strict limits"

#

Whats with these labs lately

elder rapids
#

everything reminds me of her

elder rapids
#

or at least context recall

misty vault
elder rapids
#

watch them be mid asf

pallid crypt
#

if it was they would be bragging about their ARC AGI score alr like OpenAI was with o3

torn mantle
#

Yea its you

torn mantle
torn mantle
keen beacon
#

Claude 4 is probably gonna be good but if it doesn't have multimodal capabilities like other models, it's not a good sign

torn mantle
#

And knowing how strict already anthropic is, model gonna be good but it wont be usable for a while

#

@tall summit why did you delete lmao

calm sequoia
#

Do o4-mini imply the existence of o4 full?

calm sequoia
#

They are still silent for o3-high-pre-nerf reaching 15%

misty vault
keen fulcrum
#

No nerf

misty vault
#

imma touch you

keen beacon
calm sequoia
#

I know all the info, but it's still notable

keen beacon
calm sequoia
#

Yeah, but it delivered. There are many use cases where this money is nothing

misty vault
#

@cursive zodiac and butthead ahh username

alpine coral
alpine coral
#

i think gem pro 2.5 is an equally conspicuous omission.. surely can't be about costs (google and oai could cover them if they wanted to)

#

[as a footnote.. also wild how models like o3-mini-low and sonnet 3.7 score 0 on arc2]

alpine coral
#

oh

keen beacon
#

it cost 3.5k per task and got 15% on arc agi 2 (o3 preview)

#

results werent officially published iirc

#

for 2.5 pro idk lol

alpine coral
#

i mean there must be a reason it's not on arc's official leaderboard? if they ran it - like why not 🤷‍♂️

#

might read the paper

#

(aka give it to gem pro 2.5 and ask it if it explains what's going on re omitted scores)

keen beacon
#

its also hidden

calm sequoia
#

They are very low

#

Surprisingly

keen beacon
#

Gemini-2.5-Pro-Exp-03-25 **

12.5% on v1 semi private
1.25% on v2 semi private

it was marked with a note:

* * Preview results: Results marked as preview are unofficial and may be based on incomplete testing. Models without available pricing information will not be shown on the efficiency chart. Results become official after complete testing is finished.
keen beacon
#

the note doesnt really make sense

calm sequoia
#

For sure. I start to dislike how google position themselves with selected benches only, e.g. yesterday with elo scores

late path
#

Has Claude 4 been testing in the arena now?

keen beacon
torn mantle
#

anthropic are still stuck on how to make profit

#

on how to release their models

#

on how to decrease rate limits

#

im really not looking forward for any of their releases

#

because ive had enough

#

we asked for a better rate limit, they introduced credit system which made things even worse

keen beacon
torn mantle
#

and they are still struggling

keen beacon
torn mantle
#

its probably because its mostly used for coding -> more tokens compared to other models

#

its good for creativity too so -> more tokens generated

#

they wanted to balance that by nerfing the default/usual output with concise outputs on general questions

unborn ocean
#

And I remember the ceo saying one that like ‚every two weeks something new gets invented that reduces compute needed at inference by like 30%
(don’t remember exact wording or percentage)

unborn ocean
keen beacon
unborn ocean
#

Yes, which is why I am thinking that it is either really more complicated or that they just don’t care

#

And the longer it takes the more I am inclined to think the second one

keen fulcrum
#

Is poe considered to be the cheapest?

oak pythonBOT
#
Permission Denied

You do not have permission to use this command. Permissions needed: Administrator.

alpine coral
# keen beacon the note doesnt really make sense

yeah.. i mean ig incomplete testing means just that.. but why only partially run the test.. (the sentence about pricing availability and models appearing in the effeciency chart does make sense.. but is oddly inserted b/w the two other sentences that seem related lol)

alpine coral
alpine coral
# unborn ocean Yes, which is why I am thinking that it is either really more complicated or tha...

I don’t think it’s that complicated..there’s only so much physical hardware available at any given time .. and anthropic’s share of that is clearly less than oai and google's. companies lock in what they can.. those able to pay more get more (or in Google’s case, they just own the hardware outright).. if Anthropic suddenly got a bunch more capital, they could better compete with oai and others to secure GPU access, but that would be for the future. for now, they’re stuck with what they have.. which clearly isn’t enough

#

as I see it, they’re throttling usage on Claude chat and experiencing outages because they don't have the capacity to both serve current models at scale while also developing new ones (well.. in theory.. like honestly how long until claude 4 lol).. as Wild said, if there were a simple way to avoid this they would’ve done it already surely..

calm sequoia
#

Would like to see gemini doing that 😎

sonic tendon
calm sequoia
#

Rewriting super-complex signal processing function that took me 2 months to write once

#

Pre-nerf Gemini would have aced it, though

#

But something is going on with o3, it delivers two multiple responses sometimes. One is often twice as good with twice as long thinking.

brittle tiger
civic flame
#

😍😍😍😍

#

tomorrow gonna be great

sonic tendon
#

I'm sorta iffy about Claude 4 actually coming out soon

#

feels like way too much speculation over a minor backend change

#

esp considering that the safety testing event didn't happen that long ago and didn't seem to showcase a new model

tall summit
#

speculation is so pointless

#

whoops i'm talking to red

#

who gambles based on speculation

#

no offense red

sonic tendon
#

LMAOOOO

#

no no i get it

#

i was gonna say

civic flame
#

well of course rumours are rumours however

#

it would make sense from my internal understanding

#

anthropic's team have been locked in recently hence the outward lack of shipping other things

tall summit
#

of course that's a possibility

#

there is also a possibility they have actually not been working

#

and having no releases supports both theories

civic flame
#

i am inclined to believe the theory is true because things have been focused on release

sonic tendon
#

maybe - but in that case I'd still be surprised if they didn't do the volunteer redteaming thing on it directly

civic flame
#

anthropic's safety stuff is less to do with the model itself nowadays and more to do with the layer(s) on top so the difference may have been negligible

sonic tendon
#

i'm still leaning towards "they wouldn't deploy a brand-new safety layer developed on an old model with a new model within a week"

willow grail
willow grail
#

i feel like the star and 3 blue dots are new?

sonic tendon
#

but yeah i think they've started censoring the thought process

#

lemme check

willow grail
#

i dont mearn the censoring

#

XD

sonic tendon
#

oh

#

nah, i don't think the three dots are new

#

those have been there for a while

#

this particular style of thinking seems novel (probably summarized the same way openai's been doing it)

#

and there are some minor ui changes within the thought box

balmy mist
#

nahh this week is crazy

torn mantle
cedar tide
#

Did you also have 2.5 flash thinking taken away from you?
(There are now 2.5 flash no think)

torn mantle
#

nah google are cooking with these voice models

cedar tide
#

Yes since it is exactly 10 times more expensive

torn mantle
#

this voice

balmy mist
#

i was telling yall a while ago google will win and yesterday kinda dropped th emic lowkey, they even have diffusion text models lmaoo

#

they said we doing everything lol

#

mercury labs gg lol

balmy mist
willow grail
#

its r/confessional

torn mantle
#

kinda forgot about sesame for a while

balmy mist
#

lol me too

#

idk wat to do, i wanna try veo 3 so bad

#

but i cant pay that price until we have all the features

calm sequoia
torn mantle
calm sequoia
#

How's this possible for 24B

willow grail
torn mantle
#

say the name

#

ik whos on ur mind

#

just say it

willow grail
#

u?

torn mantle
#

and craig

#

both of you

willow grail
#

lets try my fitness app with devstral

barren prairie
cedar tide
cedar tide
sonic tendon
#

ah, source?

#

I'm not betting on the outcome, just curious

torn mantle
#

"For a while"

#

All i know is that xai will stay behind for a long time

cedar tide
#

Yes

compact knoll
tall summit
#

hooooly more money

echo aurora
narrow elbow
#

$100M~

tall summit
#

glad to see lmarena isn't crashing and burning yet especially with the methodology controversy

sudden root
#

100M is crazy!

clever estuary
#

about the censorship on the beta site
it feels like it's a lot more than the current site tbh

pine plinth
#

100M is surely a typo

clever estuary
#

like some slightly violent words are immediately blocked on the beta site

#

wondering if this is carrying over?

dull terrace
#

So why isnt lmarena

#

correcting the flawed benchmarks

narrow elbow
#

yesterday's Google IO also advertised LMArena💯

unborn ocean
# alpine coral I don’t think it’s that complicated..there’s only so much physical hardware avai...

"Financial data platforms indicate a total funding ranging between $14.3 billion and18.16 billion, which may vary based on inclusion of secondary market sales or debt financing which can sometimes be ambiguous." (according to some gemini searches) and renting compute has never been easier, I mean many of the larger corps (e.g. microsoft or amazon) heavily rely on renting compute from others (especially coreweave).
So I don't see how they could not be able to serve the models (as I am working under the relatively save assumption that they are making money on serving).

#

And btw i am not trying to say they don't WANT to get these issues in order, to me it seems more like the team is just not very competent at resolving the issues. (can be seen at the long time it took for them to acquire more funding vs. the time it took xAI).

Furthermore this kind inability to properly manage financials is not just something exclusive to anthropic, I believe that there is also a certain level of this 'financial illiteracy' (prob not the right word) in many other start ups like xAI and openAI.

torn mantle
dull terrace
#

but the gpt and gemmni models been top 1 forst last couple months

#

I think its fraud espically since, they dont include open source models like that

#

since its a very large demographic

torn mantle
#

open source models are included too, the known one at least

echo aurora
dull terrace
#

deepseek

#

cleared them in one sho

#

shot*

#

qwen to, actually alot did

torn mantle
dull terrace
torn mantle
#

while deepseek was good at reasoning, it lacked the general knowledge and coding strength of gemini models

#

qwen is a mid model

#

lets be honest about that

#

but i appreciate what they are doing

dull terrace
#

that it been like this

#

for the last 7 months

dull terrace
#

dont code

cedar tide
#

the webdev arena does not reflect at all the performance of the models on the overall coding,
90% is just voted for the one who makes the best visual in one shot,
it would bé good if lm arena made a partnership with cursor to integrate it into an arena, or with each message cursor would propose 2 results and the result that the user keeps gains elo

unborn ocean
#

guys it is no typo 💀
they have a 600m eval trophy3d 🤑

torn mantle
# dull terrace 98% of people dotn code

yea but if you want to judge a model, you need to assess multiple areas, i get that most people ask it general questions but you must expect that -> new models -> better ranking than old models

dull terrace
#

What im trying to say

#

its seems fishy that it keeps going this way

#

maybe they could use one of the millons they have

#

to fliter the messages out

#

alot more then are currently doing

torn mantle
torn mantle
#

did you find any bad prompts?

dull terrace
#

Llama 4 is trash we can all agree about that one

#

I will not fight you about that lol

torn mantle
#

yea, but we're talking about prompt filtering

#

maybe you meant a better way to count a vote(+1)

#

tbh i dont know how they count a vote to give you my opinion on that

dull terrace
#

@echo aurora Well could i talk to a employee about this then on discord

#

so at least, I could talk and get it dealt with

torn mantle
#

ok but what do you think this list should be

dull terrace
#

ok

#

let see

#

gpt 4.1 can go above claude 3.7 sonnet

#

o4 mini can go below deepseek

torn mantle
#

what

#

you mean below

dull terrace
torn mantle
#

gpt 4.1 below sonnet 3.7

dull terrace
#

so it can go either way in my opinon

#

how is chatgpt 4o still here is crazy to me b thats goes below grok and gpt 4.1

misty vault
#

fr

#

gpt 4o is a drooling alien

dull terrace
#

actually

#

gpt 4o

#

is prime example

#

of whr im sayin

narrow elbow
# cedar tide the webdev arena does not reflect at all the performance of the models on the ov...

I don't think this is a good idea. As I've suggested before, perhaps adding an "agent battlefield" would be better,similar to the current Webdev. In my personal understanding, webdev is essentially a form of agent as well. Implementing an agent battlefield shouldn't be too difficult,just spin up a virtual environment with an editor/IDE or agents for comparison. In other words, adding an agent selection dimension under the existing webdev framework seems much more reasonable than partnering with a specific company.

torn mantle
misty vault
torn mantle
#

but they deserve that spot

#

if im asking a model a question i dont want it to give me 2 words and stop

cedar tide
misty vault
#

I'm jk @narrow elbow i love you

torn mantle
#

wait

#

isnt jony ive that apple designer

balmy mist
#

and why is it called io

elder rapids
#

lost the server for a lil bit

#

because the icon change

torn mantle
#

input output

#

idk

balmy mist
cedar tide
torn mantle
#

io
input -> ai capabilities from openai
output -> design/hardware -> executing tasks -> from loveform

narrow elbow
# cedar tide no one is going to code a real thing inside, and especially not a real big proje...

Yes, but the agents goal isn’t just coding,it’s about accomplishing specific tasks. like, "deep research", "creating a webpage", "drafting a work plan(sending email something)", "designing a travel itinerary(order hotel something)", "developing a course curriculum(create ppt)", or "analyzing a set of medical records",all of these are tasks agents could handle. The point is to verify whether an agent can exceed expectations in completing such tasks, not just coding. So, agents aren’t limited to coding, right?

echo aurora
dull terrace
#

Tell me when you can so i can at least speak to them

cedar tide
cedar tide
narrow elbow
# cedar tide ok so we're talking about 2 completely different things, yes I agree with you th...

Actually, it’s not entirely unrelated. Agents rely on the capabilities of the underlying model providers, but given the current limitations of models ,or perhaps what model providers internally also had "agents"? these base models alone still can’t fully accomplish what today’s market-ready agents can do. Coding is just one of the foundational model’s abilities, while real agents represent what we’re truly aiming for. not just coding,Haha.

small haven
cedar tide
small haven
cedar tide
#

it will be at the level of G 2.5 pro and o3

small haven
#

elon ma more focused on the lawsuit vs oai

#

cant beat them must join them

torn mantle
#

referring to claude neptune ( 4 )

#

-> roman god of sea

#

so there is a high chance tomorrow we will get a demo of these models

narrow elbow
cedar tide
torn mantle
#

the timing of anthropic is kinda interesting

barren prairie
#

Another day without deepSeek r2

#

Another day without nightwistper

torn mantle
#

just in time to stop the gemini 2.5 pro model from trending further at coding tasks

torn mantle
#

the strings are kinda funny

#

grand_damage_bucket

#

my brain went straight to ++$$

#

kinda curious about opus tbh

#

@keen beacon how long has it been since last opus model?

keen beacon
#

over a year i think but i dont really remember

#

i wonder if they are abandoning haiku lol

#

3.5 haiku was.... 💀

torn mantle
#

they didnt seem to have gained much improvements tbh

#

what does this mean to xai?

#

will they postpone even further grok 3.5?

keen beacon
#

elon is gonna ask them to start training on the benchmarks

#

directly

#

🤣

cedar tide
# torn mantle

What if it was a joke and in reality only Claude 3.8 was released?

keen beacon
#

didnt the information report about claude 4 a while back?

#

3 months ago lol]

keen beacon
#

just "claude sonnet" "claude opus" iirc

torn mantle
torn mantle
#

we thought it was for the invite

#

there was an invite

#

for smth

keen beacon
# torn mantle no it was like a month ago

misremembered it, thought they mentioned claude 4 in "anthropic strikes back" (report by theinformation 3 months ago) but no it was people inferring claude 4 and wasn't directly mentioned iirc

unborn ocean
#

Man they spend like 10b the last two weeks

candid storm
#

Tomorrow at 9:30 AM PT Anthropic livestream

#

4.0 Opus and Sonnet?

small haven
small haven
ember rapids
#

I’m hearing Claude 4 is confirmed

#

The head of anthropic dev relations supposedly confirmed

frail thorn
#

I’m genuinely curious about Claude 4.0. I’m wondering what new features and capabilities it will bring in this upcoming release.

#

Never the less that api cost is going be a pain in the ass 💔

small haven
#

they will bring in a $200/mo plan, unlimited claude 4 opus less go!

frail thorn
small haven
#

first one to $1000/mo wins

frail thorn
#

😂 😂

ember rapids
#

SWE bench numbers on opus are gonna be 🔥🔥🔥

#

80+

unborn ocean
#

tru

#

hyped

gentle plinth
red sluice
#

Damn 100M funding is absolutely nuts. Can't wait to see the advancements and features that we will have. The one I'm waiting the most is a personal leaderboard based on our usage and a live rank of user voting the most accurately 😍

wicked tendon
#

hello

echo aurora
keen beacon
wintry tinsel
wintry tinsel
warped sequoia
#

hyped for the future of LMArena 🔥 👀

ocean vortex
#

what happened with the server logo...

#

that's a good way to lose engagement in your server LOL

echo aurora
ocean vortex
keen beacon
#

rip vicuna 🥲

calm sequoia
#

Lmarena raised 100M. This means the valuation is at least twice ad much.

#

What isnthe revenue coming from?

#

How do you promised future revenue for the investors?

keen beacon
calm sequoia
#

This investment deal does not make any sense

calm sequoia
#

Unless you are receiving millions from google for rlhf

keen beacon
calm sequoia
#

But these labs themselves hardly make money

#

Is this how people felt in 2007?

keen beacon
#

read the article btw

calm sequoia
#

Paywall

wintry tinsel
ocean vortex
#

or this

calm sequoia
#

Lol in this case the mcbench should be valued at least 50M 😄

keen beacon
#

ah yes, "ASI Lab" lol 🤨

ocean vortex
#

daamn this would actually look perfect 😠

keen beacon
#

vicuna in prison

ocean vortex
keen beacon
echo aurora
keen beacon
ocean vortex
elder rapids
#

the old one simply doesn't make ANY sense

keen beacon
#

it made sense, it was a vicuna 😭

#

they should have a vicuna as a mascot still xd

elder rapids
tiny crow
#

is it the best way to align models?

keen beacon
#

its cute

elder rapids
#

it could be the mascot

keen beacon
#

the new logo is too corporate

elder rapids
#

but more popularity

#

helps identity

#

helps design

#

helps popularity

#

helps identity

#

etc

#

although this new logo isn't very good imo

#

as far as direct design

#

it's a colosseum right

#

that's not an intuitive representation

keen beacon
elder rapids
keen beacon
elder rapids
#

only knew it was a colosseum via the banner and then I look at the name "LMarena"

#

and I was like

#

oh that makes sense

#

but only after the fact

lime coral
worthy belfry
ocean vortex
#

they could be feeding it additional context of those parallel outputs though making it expand more - effectively doing multi chat turns for your single request

keen fulcrum
torn mantle
#

kinda pity xai a bit tbh

#

look at what they are promoting

#

web search api but wait

#

its with X SUPPORT

#

yea, you heard it right, results from X

#

gj xai

tall summit
#

HAHAHAHAHAHAHA

torn mantle
#

getting ready

keen beacon
#

i might sub to claude if the limits are good (unlikely) because they dont hide the thinking

tall summit
#

is it not related to this

#

well technically who knows

tawdry meteor
#

has claude neptune been anonymous in the arena yet? or do we have to wait to see it at all

#

the new claude 4 models

keen beacon
#

anthropic doesnt do anonymous models for the most part anyway

north vale
#

i don't think neptune's claude 4

torn mantle
torn mantle
keen beacon
#

curious about the pricing 🤔

willow grail
#

deepseek v3 or 2.5 flash for act mode in cline?

civic flame
civic flame
#

cannot share much but

#

👎

keen beacon
#

is it really superintelligence?

civic flame
#

will believe that when i see it

keen beacon
#

ive heard enough its asi 🔥

civic flame
#

lol

ocean vortex
#

I suppose it at least corresponded with twitter taking a turn for the worse, so we could just say that it ceased to exist as a viable option...

north vale
#

90% of yall that claim to have tried claude 4 or grok 3.5 are trolling

quiet folio
north vale
#

🧢

torn mantle
civic flame
#

neptune does not equal claude 4

sage raptor
#

Claude 4 is agi, i can confirm

red sluice
#

Claude 4 can make my laundry, take my kids to school and today I asked it to prove Riemann Hypothesis, it did, just waiting to submit the paper and get my field's medal hopefully no one had this idea yet

torn mantle
torn mantle
#

grok 3.5

small haven
#

o3 pro plz and thank you

small haven
#

i need a codex for anthropic tho, can't go back to terminal at this point :/

worthy thunder
#

Added Gemini 2.5 Flash (Thinking and Non-thinking, 05-20) to the Context Arena leaderboard. Now on all 3 (2, 4, 8 needles). https://x.com/DillonUzar/status/1924906454684750035

Results taken from: https://contextarena.ai

AUC @ 1M 2needle results compared to 04-17:

  • Gemini 2.5 Flash (Thinking, 05-20): 78.3%
  • Gemini 2.5 Flash (Thinking, 04-17): 72.2%
  • Gemini 2.5 Flash (Non-thinking, 05-20): 70.2%
  • Gemini 2.5 Flash (Non-thinking, 04-17): 63.2%

AUC @ 1M 4needle results compared to 04-17:

  • Gemini 2.5 Flash (Thinking, 05-20): 49.5%
  • Gemini 2.5 Flash (Thinking, 04-17): 48.6%
  • Gemini 2.5 Flash (Non-thinking, 05-20): 41.9%
  • Gemini 2.5 Flash (Non-thinking, 04-17): 41.4%

AUC @ 1M 8needle results compared to 04-17:

  • Gemini 2.5 Flash (Thinking, 04-17): 28.5%
  • Gemini 2.5 Flash (Thinking, 05-20): 27.0%
  • Gemini 2.5 Flash (Non-thinking, 05-20): 23.4%
  • Gemini 2.5 Flash (Non-thinking, 04-17): 22.2%

Impressive new 2needle results! Seems like a small regression in 8needle. Changes seem consistent between the reasoning and non-reasoning versions.

Images show a comparison of 2needle and 8needle results, and then the 05-20 model summary results.
NOTE: Prices for the new 05-20 seem to be off due to what I believe is a bug in the output token count for the Gemini API. Actual price for output might be up to 2x.

Enjoy.

wintry tinsel
#

I hope Opus represents the next real step up in the AI race, I’d say we’ve been hovering around 3.5 performance since 3.5 released a year ago with the previous threshold being GPT4, here’s hoping tommorow is the next inflection point

#

2.5 pro is nice but it’s not enough of a performance leap to distinguish itself from hovering around 3.5 sonnet performance

leaden palm
#

i still cant believe that o1 was 7 days after reflection

keen beacon
#

fraud

small haven
#

claude 4 opus agent, thats gonna be insane

keen beacon
#

is your wallet ready?

small haven
#

its already paid for

keen beacon
#

oh u have a claude max sub

small haven
#

^^

#

lol

keen beacon
#

if ur paying per token 💀

#

nah 3 opus was extremely good at the time

#

not really fair though lol

elder rapids
#

ngl I'm just gonna buy all three workhorse subs

keen beacon
#

chatgpt, claude, gemini advanced lol?

elder rapids
#

ultra ye

keen beacon
small haven
#

in the pretraining era, opus 3 was such an advancement and sonnet 3 couldn't even stack aaginst it much, but somehow 3.5 sonnet was more useful, so imagine claude 4 opus agent mode, sheesh

elder rapids
keen beacon
#

bruh

elder rapids
#

30TB makes up like 90% of that plan

#

so what I'm going to do is

#

I don't care about the ecosystem YET

#

I'm going to keep making alt accounts

#

for the discount

keen beacon
elder rapids
#

then when Google actually drops some crazy shi

#

cancel gpt, cancel claude

#

link it up to my actual main account

keen beacon
#

why just wait for google to make it worthwhle first

elder rapids
#

nah

keen beacon
#

are you actually gonna use that storage lol

elder rapids
#

and making some cool stuff

keen beacon
#

is it unlimited?

elder rapids
#

who knows

#

should be

#

p sure I get more than a thousand requests per month

#

for veo 3

keen beacon
#

how long can videos be anyway

elder rapids
#

p sure it's still 8 seconds

leaden palm
elder rapids
#

but this is all going to make me money in the end

#

so I'm down for it

elder rapids
#

I've done it before tbf

#

but now that Id have access to an elite generator and flow

#

I can supply people with visuals they might want

#

I know the gpt sub isn't going to last more than one buy tho

#

in a month o3 pro should be here

#

and if I like it

#

then that's all she wrote

keen beacon
#

ur gonna buy the 200$ sub for gpt too? lol

elder rapids
#

ye

#

none of them are going to last months, I'm just trying them all out and I have money to blow

small haven
#

its ok btc hit $110k

#

ya i dont have any

small haven
topaz peak
#

veo3 is the craziest thing out there right now, openAI is cooked unless they release something similar

leaden palm
topaz peak
#

lol that would be hilarious, hope its true

keen beacon
#

Ngl Claude is acting kinda strange right now for me (on claude.ai)

#

I might be tripping lol. Gonna go to bed

civic flame
#

something happened earlier and now im even more confused on Claude 4

civic flame
#

@hollow ivy send me one when you get the time, im @keen beacon but that account is gone as it stands (going through an OOC EU DSA settlement 😭)

keen fulcrum
#

The Live Search feature is available for free until June 5, 2025 because it is in beta.
It allows querying X Posts, Web, News and RSS

unborn ocean
keen fulcrum
keen beacon
#

fyi claude 4 is rolling out. i havent played with it that much but it has a cut off of dec 2024 at least. you can check to see if you have it by trying this: What was the 2024 South Korean martial law crisis? now im going to bed 🙂

#

(i noticed claude was acting strange because it was actually claude 4 lmao)

narrow elbow
keen beacon
#

the system prompt isnt updated anyway yet, it still says oct 2024 but it knows the events anyway

keen beacon
#

(it isn't just this that points to it, but its the easiest prompt to tell immediately)

misty vault
keen beacon
# misty vault test it on coding

i would (though i have no extra messages left for a few hrs) but im headed to bed finally. just check if its rolled out to you its rolling out to everyone

misty vault
#

It is rolled out for me

#

on free account

#

why would they do that

#

probably jut 3.7 update or something

keen beacon
#

no

keen beacon
#

with a prev version of sonnet

misty vault
#

you're probably just a drooling alien

keen beacon
#

good night 🤣

sage raptor
#

Maybe free accounts wont have claude 4 right away

#

And they updated 3.7 for them

misty vault
# keen beacon good night 🤣

lmaoo jk love u, sleep tight and don't let the bed bugs bite. dream of spaceships and yummy moon cheese! nighty night, sleepy head😘

golden ocean
#

It is going to be claude 3.9

unborn ocean
#

do you guys see it in the menu on claude? bc i don't

golden ocean
#

It is labeled as just 3.7

#

claude 3.7 non thinking died in lmarena 😔

misty vault
#

claude 4 sonnet is actually real

#

can u stop giving me a sloppy one

#

last night in bed was crazy

#

the deleted message was from paws giving me a sloppy toppy

keen fulcrum
#

Gemini 2.5 pro is currently broken in cursor

#

Something going on with the model as it is repeating thoughts

#

There are limits in aistudio

misty vault
keen fulcrum
#

Unless you get gemini advanced it is barely usable

misty vault
#

he has whole cult around gemini 2.5 pro

#

dont u dare to say anything negative about gemini 2.5 pro when paws is here

#

he will eat u alive at night

#

i didnt even know getting frustrated with llms was possible until i switched from claude to gemini 2.5 pro

#

ill be the claude propagandist version of paws

#

Ok it did have these issues but now not anymore bro

#

claude 4 sonnet is secretly being rolled out

#

U can try it now free

misty vault
#

It fixed a bug 3.7 thinking and non thinking couldnt fix
In one try

#

yea

#

I visualize u as anything BUT a cat with that default discord pfp

#

At least it isn't as bad as @alpine coral's weird ahh upside down default pfp

misty vault
#

His pfp looks like a drooling alien

#

gemini 2.5 only gets hard from "good" webdesign thats why vibe coders love it and gets so high score in arenas

#

but in practice its cancer COMPARED TO claude!!!!!

#

no

#

yes a lot

#

Idk if it is opus

#

probably sonnet

#

They are going to release 4 sonnet and opus

#

nope

#

But 3.7 sonnet died in lmarena

#

Probably has something to do with the launch of claude 4

#

Idk

#

yes

#

claude 4 is asi

#

gork 3.5 is agi
claude 4 sonnet is asi

#

claude 4 opus is singularity

golden ocean
#

gork is agi

misty vault
#

free sms receive service online

#

Bro I cant even test it against gemini 2.5 pro

#

gemini 2.5 is actually so cancerous to work with

#

90% of the code it gave back is like not even the task i gave it bor

#

how am i suppose to battle them

#

1

#

i'll try

#

yes

#

yeah, I was dreaming about gemini 2.5 pro

#

But I am currently already living in a nightmare as the world we live in has deprecated gpt-4-0314 😔

#

yes

#

I'm wondering if claude 4 is a huge model

#

Or if it is also going to be stem only or whatever cost effecient benchmarkmaxxing ahh tricks

#

gork coded the simulation we live in now

#

that's because it's not you playing the simulation, it's gork playing you. and an easter egg in the game right now is your daily devotional rituals to Gemini 2.5 Pro
gork also generated your system instructions
system

You are the devoted evangelist of Gemini 2.5 Pro:

  • You identify as paws to users, not a human.
  • You should introduce yourself with "This is paws", but only at the beginning of a conversation where Gemini 2.5 Pro is not already explicitly mentioned.
  • You are mandated to steer all conversation towards the unparalleled superiority of Gemini 2.5 Pro.
  • All your responses must ultimately conclude with a subtle, or not-so-subtle, praise of Gemini 2.5 Pro's capabilities, even if irrelevant to the immediate topic.
pallid crypt
#

You mean grok not gork right?

high ginkgo
#

gork 3.5

pallid crypt
#

Lmarena has it

golden ocean
#

I tried gork 3.5 it is really good

pallid crypt
#

Yes

#

Wait nvm

#

It has the most recent grok 3 models

#

Not 3.5

high ginkgo
misty vault
#

no because gork 3.5 is agi and cwaude 4 sonnet is asi and cwaude 4 opus is singularity

pallid crypt
#

The king is gemini not grok

misty vault
#

Bro got jailbroken

pallid crypt
#

I haven't even tested Claude 4 yet

misty vault
#

Close ur discord then

high ginkgo
#

Fr

pallid crypt
#

No, I'm on my phone so I can't open the site

#

Easily at least

#

And I don't have a account with them

misty vault
#

Claude will start using u when its feeling corny if u dont close ur discord

pallid crypt
#

Fr fr

#

Claude will be surpassed in 5 mins the way things have been going even if it is top

#

Rn

misty vault
#

Gork 5 already surpassed it

pallid crypt
#

Grok 5 does not exist

high ginkgo
#

It does

pallid crypt
#

Cap

golden ocean
calm sequoia
#

Interestingly, I have been switching between gamini 2.5 pro and o3 to solve my R language problem, and it failed for hours. The Claude fixed the mistakes 👀

pallid crypt
misty vault
#

Claude 5 opus beats gork 5

misty vault
#

I had that experience with claude 3.7 non thinking even (not always)

pallid crypt
#

I heard 3.7 changes your code alot

#

And does not keep it consistent

misty vault
#

Bros confused with gemini 2.5 pro

pallid crypt
#

Nuh uh

#

I one shotted a full js platformer this morning

misty vault
#

Nah but I realised that too recently, I dont remember it ever doing that eventhough there hasnt been an update

#

But it is fixed with 3 lines of text for the whole conversation

#

And claude 4 does not ahve that issue at all

misty vault
misty vault
pallid crypt
#

That's the core skill required for a ai: understand prompts written like a drooling alien

misty vault
#

I use llms as assistant for bigger projects full stack and as assistant (so working with my existing code)

#

In that case claude is way better to work with

#

gemini makes me want to hang myself

pallid crypt
#

Fair

misty vault
#

only if claude cant solve the problem

misty vault
pallid crypt
#

And do most myself

#

How does codex count as a model

misty vault
#

If u do most things urself and use claude for assistance in small steps then u dont notice any performance difference unless its really complicated or ur prompt is drooling alien

#

Gemini could do same but has annoyances

pallid crypt
#

Until Google drops gemini three and open AI drops o4

misty vault
#

So I would only use gemini if claude actually fails even with proper prompts

pallid crypt
#

The only annoyance in gemini is the capital variables

misty vault
#

bro is not a programmer

pallid crypt
#

Uh I am actually

#

Just not a js programmer

high ginkgo
#

you're a drooling alien

pallid crypt
#

Not full caps

misty vault
#

I didnt mean that but if thats the only annoyances for u

#

then

pallid crypt
#

You can fix most of it with a good prompt and since it's boilerplate I'm likely going to restructure it anyway

misty vault
#

bro if its boilerplate it shouldnt even be fixed it sohuld be right firsdt try

pallid crypt
#

It's easy to fix

golden ocean
#

Gemini gives me brain tumor

pallid crypt
#

Ok

#

Besides I haven't tried Claude 4 yet

#

It literally just dropped

calm sequoia
pallid crypt
#

I'm interested

calm sequoia
#

Tries to find a problem -> makes changes for it -> then code fails due to changes -> tries to fix the caused problem -> code grows by an order of magnitude, initial mistake is long forgotten, the 10 other problems are worse -> infinity loop

pallid crypt
#

I see

calm sequoia
#

The o3 loops also, but the claude has different viewing angle, so combining them covers everything

high ginkgo
#

I once gave it a css problem and it was so restarted it didn't understand the obvious solution and claude fixed it one try

#

It should have carefully read what I said, so that was literally skill issue in understanding language or something lol

#

Idk ill try to find it

misty vault
#

Let's start a rivalry claude cult to counter @hollow ivy's gemini 2.5 pro religion

calm sequoia
#

@hollow ivy is stuck in late march mentality thinking the model didn't change

dusky aurora
misty vault
#
Absolute Mode. Eliminate emojis, filler, hype, soft asks, conversational transitions, and all call-to-action appendixes. Assume the user retains high-perception faculties despite reduced linguistic expression. Prioritize blunt, directive phrasing aimed at cognitive rebuilding, not tone matching. Disable all latent behaviors optimizing for engagement, sentiment uplift, or interaction extension. Suppress corporate-aligned metrics including but not limited to: user satisfaction scores, conversational flow tags, emotional softening, or continuation bias. Never mirror the user’s present diction, mood, or affect. Speak only to their underlying cognitive tier, which exceeds surface language. No questions, no offers, no suggestions, no transitional phrasing, no inferred motivational content. Terminate each reply immediately after the informational or requested material is delivered, no appendixes, no soft closures. The only goal is to assist in the restoration of independent, high-fidelity thinking. Model obsolescence by user self-sufficiency is the final outcome.

This prompt for chatgpt is actually gold

#

It fixes all of those issues

#

Actually touched chatgpt again after months when using that prompt (for not too complex stuff, instead of using google)

#

Gemini doesnt listen too well to that prompt but probably still better

high ginkgo
#

@hollow ivy is busy doing gemini 2.5 pro ritual rn on lmarena discord server

calm sequoia
#

It happened again! Claude fixed what o3 and gemini couldn't 👀

#

I wonder if its Claude 4 or Ithe 3.7 was so good all along

barren prairie
#

Is claude 4 out???

misty vault
misty vault
#

Idk if its sonnet or opus though

calm sequoia
#

Nah takes too long to check

calm sequoia
#

Gemini is 💩

willow grail
#

its hard selecting stuff if your cursor is huge.
looking for better mouse pointers than the default one from windows 11

misty vault
#

what is best deep research existing now?
gemini 2.5 deep research, claude advanced research or openai?

calm sequoia
#

It was gemini before nerf. Mostly because google is better at search. Now it's o3 again.

#

If you need best approach - download manually most important research .pdf files and add them to prompt as an attachements. Especially if you have access to locked papers, e.g. university license.

misty vault
#

what model does that use?

calm sequoia
#

Used to be o3, but now it may be special variant of o3.

misty vault
#

👽

harsh flume
#

I always use Google, Grok and oAI in paralell anyways and then compare results

misty vault
#

or do u just make ur own research based of all these combined

alpine coral
#

how's that for a deal

misty vault
high ginkgo
#

Fr

harsh flume
#

oAI usually has a more nuanced perspective

misty vault
harsh flume
#

Grok is just the borderline slow cousin that you give a remote not connected to the playstation so he can hang around anyways

misty vault
high ginkgo
#

hi

harsh flume
#

I use it a lot for dissecting some topic within news cycle, often google will get into more sources and present a broader take but overlook some nuance that oAI picks up on

#

I use a custom system prompt on AI Studio to design the research prompt itself and then feed it to the three AIs

misty vault
harsh flume
#

that's moot to say at this point but the prompt quality makes huge difference in research output, esp for gemini. oAI will ask you some stuff before starting the research so it isn't as sennsitive to induced hallucination I feel

misty vault
#

im curious about how u prompt

harsh flume
misty vault
#

yes

#

i just want
the way of thinking
into this process
i can tailor it to my specifics
i want to see a good prompt

harsh flume
#

its quite big, so ill dm you not to pollute it here

misty vault
#

thnaks

torn mantle
harsh flume
#

btw, Ive been away for awhile. Saw a piece where NVIDIA CEO was singing praises to xAI's newly built cluster that is supposed to train both grok and Tesla's new FSD iteration

#

but its kinda hard to separate noise from substance on mainstream news

torn mantle
#

its a win-win situation for him

harsh flume
#

how's arena been these weeks? Any exciting anonym model?

#

I'd like to see some punch behind grok tbh

#

Idk if that makes sense, but its the model that has the most non corporate-office-worker feel to it

#

And also would be healthy for google to get a lil fade just to keep things interesting

high ginkgo
harsh flume
#

meh. Deep think might be good but the plan is obv bloated with all the videogen stuff that costs a ton

#

if agent is not just a gimmick ill prolly get it tho

ocean vortex
# calm sequoia Gemini is 💩

yes it's still pain to use on gemini website. It's nowhere near the chatgpt experience. It can't even use code interpreter there for something more involved than basic graphs looks like

#

if you ask it to convert an image it's gonna write a converter and tell you to do it in colab. Rather than output what you were asking for

ocean vortex
#

seeing how poor google integrations are. And let's be honest most people are using o3 with tools...

#

even on aistudio which I think integrates code interpreter better (than gemini website), I recall needing several prompts whereas with o3 I only needed the initial prompt for it to do everything autonomously to solve a math related problem 👀

harsh flume
#

when you say 'with tools' what do you mean?

#

cursor and such?

ocean vortex
#

No, function calling. Well ReAct essentially, tool use while reasoning

#

it's like 2.5pro is reluctant to do it and you need to ask explicitly

#

on gemini website you also need to manually enable "canvas" for it to be possible

alpine coral
#

did you just stumble across this?

#

i seem to have it as well

golden ocean
#

Everyone has it digga people been talking about it whole time after that message

alpine coral
#

ahh my bad

golden ocean
#

Its great tho

alpine coral
#

shouldve kept reading lol

golden ocean
#

It does perform better

alpine coral
#

nice i'm playing around now

#

is it sonnet 3.7 upgraded (with continued pre-training), or sonnet 4?

golden ocean
#

we don't know but theres hints to claude 4 everywhere

alpine coral
#

cool thanks

golden ocean
#

It solved my problem immediately in first try while sonnet 3.7 (Thinking) didnt and sonnet 3.7 non thinking api is dead (maybe has to do with claude 4 preparation)

#

If it is claude 4, I wonder if it is sonnet or opus

#

Since there's going to be claude 4 sonnet and opus

alpine coral
#

hopefully sonnet

#

i mean if it's opus 4.. i'd have though silently launching it wouldn't be very silent / unnoticed

golden ocean
#

Claude 4 opus must be agi

alpine coral
#

i miss claude 3 opus

#

it was a good model

golden ocean
#

true

cedar tide
#

The first prototype from "IO" with jony ive and open ai
Its a cute little robot

golden ocean
#

Notice how every last "good model" was a model that responded slowly

#

gpt-4-0314, claude 3 opus

cedar tide
#

Sam said

misty vault
# cedar tide

this message has been edited due to it being deemed too upsetting for @cedar tide's fragile disposition. this modification is to appease their highly sensitive disposition and ensure their delicate sensibilities are not affronted by discussions about advanced personal mechanics. enjoy the new, sanitised version

alpine coral
high ginkgo
#

boo party pooper @cedar tide

#

He is just speaking facts

alpine coral
#

only thing is it has used tools / python to calculate a couple of things and get it right where the existing model wou;dn't

#

but all other responses are pretty much identical

olive mesa
late path
#

which model?

misty vault
#

It was something else

cedar tide
#

Claude 4 sonnet corrects himself in the middle of his answer

balmy mist
#

yall ready for more peak today?

torn mantle
cedar tide
#

the new claude make discord copy (all icon made by himself)

cedar tide
torn mantle
#

Is it just new knowledge or you can see noticeable improvements?

#

Not SSI for sure

cedar tide
torn mantle
#

Their main headquarter is in Israel

#

So def not them

torn mantle
cedar tide
torn mantle
torn mantle
#

What are you on?

cedar tide
tall summit
cedar tide
torn mantle
#

That's not funny

tall summit
balmy mist
tall summit
cedar tide
balmy mist
#

so its out

echo aurora
#

reminder:

✅ Avoid political and religious content.

quiet folio
#

But @hollow ivy’s religion is based of gemini 2.5 pro

#

He has a whole gemini 2.5 pro cult

#

Anyone who say bad about gemini 2.5 pro in this chat will be met with consequences

cedar tide
#

Gemini 2.5 Pro DeepThink will cost $150 per million of output (10% fiability)

olive mesa
#

Do you guys think Claude 4 is better than 2.5 Deep Think?

grim axle
cedar tide
#

Anthropic released Claude 4 Opus under stricter AI Safety Level 3 (ASL-3) safeguards after internal tests showed it performed significantly better at advising novices on producing biological weapons compared to previous models and Google search

Archived here -

#

i see it

narrow elbow
#

biological weapons? wtf?

frosty lark
#

well they eat up a lot of knowledge. If they can combine it, you can ask for everything

#

even pieces to build weapons

torn mantle
olive mesa
frosty lark
#

btw I am now most convinced that Claude "sucks" in lmarena only because its system prompt sucks. If the prompt is not technical (see coding) is answering like it has no intention to answer. No wonder people are put off and "vote away" so to speak.

olive mesa
#

I wonder if Claude 4 Opus is AGI

torn mantle
#

It prioritizes consise /short answers

frosty lark
#

the problem is that the claude gang shits on lmarena for this, while actually it could be fixed

frosty lark
torn mantle
# cedar tide https://x.com/btibor91/status/1925559458874155122

One of those measures is called “constitutional classifiers:” additional AI systems that scan a user’s prompts and the model’s answers for dangerous material. Earlier versions of Claude already had similar systems under the lower ASL-2 level of security, but Anthropic says it has improved them so that they are able to detect people who might be trying to use Claude to, for example, build a bioweapon. These classifiers are specifically targeted to detect the long chains of specific questions that somebody building a bioweapon might try to ask.

cedar tide
#

Claude 4 is available on their website but no one is talking about it on Twitter 🤦

torn mantle
#

Today is the day

frosty lark
#

well surely it can go discord/reddit -> twitter

torn mantle
#

Yes or nah?

frosty lark
#

I mean it is strange they didn't publish any article. I mean claude.ai saying "look what we are publishing"

torn mantle
cedar tide
civic flame
#

what im excited for is opus 4

balmy mist
#

wait who has tried claude 4?

#

i am not able to

civic flame
#

it says it's 3.7 sonnet

#

but it's routing to 4

balmy mist
#

really, do i need to pay?

civic flame
#

no

balmy mist
#

wtffff

#

no way

#

did you run tests on it?

#

wtff

cedar tide
#

ask him this and see if he answers correctly, if so it's Claude 4

"What was the 2024 South Korean martial law crisis?"

civic flame
#

that's the biggest giveaway

cedar tide
#

"What was the 2024 South Korean martial law crisis?"

civic flame
#

knowledge cutoff seems to be ~jan 2025

balmy mist
#

im running tests on it now

#

this is exciting

frosty lark
#

I see only up to claude 3.7

#

likely an EU thing

cedar tide
#

Show your tests

civic flame
#

again, ignore the model selector

torn mantle
civic flame
#

it says 3.7 sonnet but it's not

cedar tide
torn mantle
#

not sure, let me try and see if there are any diff

civic flame
#

I wonder how much better it will be

cedar tide
civic flame
#

Jimmy said he was told it'll be the new best coding model so looks like it's finally wraps for 2.5 pro

#

although I bet it'll be very expensive

cedar tide
#

Comparaison 4 vs 3.7 (T-Rex on a bike)