#general

1 messages · Page 49 of 1

winter geyser
#

Why doesn’t the DeepSeek R1 0528 model on Web LMArena output any code? Is it a bug, or is code generation disabled for this model?

ocean vortex
#

I feel like people should stop pushing the narrative that new R1 is equivalent to o3...

HLE: 20.6 vs 17.7
SimpleQA: 49.4 vs 27.8
SWE Verified: 69.1 vs 57.6
(o3 vs R1)

#

it's a great model but it is not quite on the level of o3 or 2.5Pro

brittle hull
#

Yo, has anyone seen a model codenamed Stephen on lmarena today? I came across it twice

brittle hull
#

Maybe pre-Grok-3.5, 'cause Musk dropped early Grok3 a week before last time

brittle hull
patent aspen
#

I think it even says deepseek in the code

brittle hull
#

Unlikely R2, since testing R2 right after releasing Updated R1.1 is kinda meh

alpine coral
#

oh..

sacred plaza
#

Are there any grok 3 glazzers on this chat. If so, what is your best steelman argument for why I should use anything from xAI given the fact the Twitter guy brainwashes grok's system prompt on a monthly basis.

tall summit
#

it's free

#

with limited uses

fringe carbon
tall summit
#

LMAO

fringe carbon
#

very good context window

#

ok real talk though

#

is it safe to upload stuff like api keys and whatnot to openai

#

probably right?

#

i'm not afraid openai gonna use them

#

i'm afraid that the model spits out the key to someone else

vernal meadow
#

@sacred plaza doesn't sound like you are asking for any usecases. Sounds like you are already set.

#

@fringe carbon You should have a paylimit on all API keys and from time to time use new ones. These days the API providers make it quite easy.

I woudn't post my passwords in there tho. Not sure in which context that would make sense : P

fringe carbon
#

and hard code my api keys at the top of files

#

so in that context

ocean vortex
#

bluntly speaking it will get lost in the sea of data. And since they are not gonna overfit the model on your chats there's no chance. Not to mention that them not sanitizing your data is now what you would expect either...

#

or in other words still... a key is not a thing that is easy to "remember" unless it is being shown repeatedly/overfit. It knows generally how it looks but not this specific key exactly as it is

unborn ocean
#

@fringe carbon you can just disagree to data training (in the settings for chatgpt / gemini) and they are legally not allowed to train on the stuff

#

that is how i do it

#

and with aistudio I am just extra careful

sacred plaza
dusky aurora
sacred plaza
#

close to real time twitter data****

echo aurora
sacred plaza
dusky aurora
bright lion
#

Oh Boy, can‘t wait to pay over 200$ a month for restricted o3 pro access now

#

Where did you hear this from btw.

echo aurora
#

Starting the a16z podcast now!

misty vault
wintry tinsel
#

Thoughts on the new deep seek for festive writing?

#

Is V3 or R1 V2 better?

misty vault
elder rapids
#

what is this server

tall summit
#

how do you not know

keen beacon
#

so goldmane is gonna be ga 2.5 pro it would seem

elder rapids
#

ye

#

thank God

#

y'all kept saying redsword was better

#

I was like youre tripping

keen beacon
#

i just need it in aistudio + raw thoughts 🤩

elder rapids
#

ong

#

man

#

give me that model

#

😭

#

when do you guys think it'll be released

#

GA taking too long bruh

keen beacon
#

next month

elder rapids
teal mantle
#

anyone guess what model made this

keen beacon
#

🤣

elder rapids
#

I mean like when next month

teal mantle
#

it is so far one of the worst ones

keen beacon
#

probably soon enough since they removed redsword

elder rapids
#

send in dms

#

the invite

kind cloud
#

ok

teal mantle
#

it sucks at this (once)

elder rapids
#

0% formatting

teal mantle
#

it still sucks and don't know what is a Touhou spellcard

elder rapids
#

goldmane codes beautifully

#

someone gotta talk about that

#

did you guys see redsword and goldmane code? 😭

#

ts is a work of art

#

the gradient indents are so beautiful

kind cloud
#

I think we can distinguish between Gemini-flash and Gemini-pro because Pro accurately remembers chapter titles of One Piece, but Flash doesn't. As a result of this knowledge test, goldmane is identified as Gemini-pro.

elder rapids
#

just waiting for Logan to say "Gemini"

keen beacon
#

yeah

elder rapids
#

flash has sauce

keen beacon
#

but it does know less though

#

something happened between flash and flash lite

#

i noticed

elder rapids
#

wonder how they're going to scale up the diffusion model

#

what

#

that's an unnecessary distinction here

#

more loads, larger model, maintaining efficiency

#

yeah well I don't

keen beacon
#

im curious about this. if u can answer this, do you see it ever replacing gemini/being the main thing? (diffusion)

elder rapids
#

but I don't believe they dont know what theyre going to do with it

keen beacon
#

yeah, but id like to know their opinion on it

elder rapids
#

or large text updates

#

and not necessarily discrete generation

kind cloud
teal mantle
#

I forgot but is o3 parameter identical to non-reasoners like 4o?

keen beacon
#

yes

misty vault
#

brian is google ceo???

#

brian is part of that team???

#

me too

golden ocean
#

same

misty vault
#

sydney chatbot

#

Im not interested in building ai if i cant build a gpt-4-0314 at home

#

I will need datacenter

elder rapids
#

I want to use it

#

like, the specific chapter

kind cloud
elder rapids
#

@balmy mist flowith corrupted some of my stuff

hollow ocean
#

Today confirmed

#

#

Maybe

tall summit
#

based

keen beacon
#

adhd is a scam to get kids on meth

#

the medicine they give kids with adhd

#

is brain rotting

#

my brother went schizophrenic off it

#

im telling you rn dont take adderal

#

the difference is adderal is addictive

#

wut lol

#

you really shouldnt rely on adderal to function

#

you should look for other remedies

#

teas and stuff

#

ai is getting really good at study guides as well

#

look up NotebookLM

#

alot of people take adderal for school

#

i mean around me atleast

#

there are alot of kids who only take the medicine to focus in school

#

thats a symptom of a larger problem tbh

#

trash courses

#

our schooling is just trash

#

wym

keen fulcrum
#
rxddit.com

Today we're launching Perplexity Labs.

Labs is for your more complex tasks. It's is like having an entire team at your disposal.

Build anything from analytical reports and presentations to dynamic dashboards. Now available for all Pro users.

While Deep Research remains the fastest way to get comprehensive answers to in-depth questions, Labs ...

▶ Play video
keen beacon
#

oh yea for sure

#

but as a first generation US citizen imma tell u one thing

#

relying on anything to function is bad

#

that was my point

#

not that it's not a real disorder

#

its just under researched

#

i was watching a video from neil de grass tyson, and yk how the common saying is "we only use 20% of our brain" etc

#

thats not the truth

#

we only know what 20% of it does

#

ain a flex bro 😭

#

liver gon be fried by the time u older

#

dont hate the messanger shrug

late path
#

A person with mild ADHD would also live a better life if they took amphetamines, I guess?

keen beacon
#

they're trying to use AI to create medicines genetically tailored to ur dna

#

so instead of getting pills with side effects you'd have drugs tailored for you

#
#

welcome to the future lil bro

#

for sure

#

US government is tryna use it to end HIV

#

etc

#

wow it got auto modded

#

for sure

misty vault
#

crispr

#

who is deleting their messages

keen beacon
#

idk doubt it

#

it can either go 2 ways

#

ai helps us or ai destroys us

misty vault
#

yeah we got gpt-4o

keen beacon
#

thats just the ai we have now

#

companies 100% have gatekept ais

#

like google has a self learning huge new ai

elder rapids
#

goldmane is so intelligent this is crazy

#

😭😭

#

nebula moment

#

istg bro

misty vault
elder rapids
#

I'm being so deadass

keen beacon
#

lol

elder rapids
#

thinks quickly

keen beacon
#

i doubt its deepthink

elder rapids
#

ye

keen beacon
#

we know what it is already like we keep arguing about this lol

elder rapids
#

ong

#

seriously though

#

wild,

#

goldmane is actually so good

#

it's so smart

#

nah like actually

#

I know it's some shi to compare the models

#

I love playing with o3

#

but goldmane is GOOD

keen beacon
#
#

rip raw thoughts

elder rapids
keen beacon
#

On summaries, we have heard a lot of valid feedback. We understand this is a different experience from the raw thoughts previously available in AI Studio. Sometimes product teams have to weigh a lot of pros and cons to come to a specific decision. This is one of those times. Please work with us and help us in getting summaries to a point where they have just the right amount of detail that you need. You are our valued and needed collaborators in this. In the meantime, we will keep listening to your feedback here or DM @shresbm or @vish_owl or @OfficialLoganK on X.

#

based on that it basically confirms its a competitive decision i guess

#

it was implied before but that basically confirms it

keen beacon
#

i hope so

elder rapids
#

it's gonna be hard to get a model that can select and identify those wording schemes and what the model is anticipating, the route it intends and it says it's going towards, the key words it's relying on, the aha moments to take advantage of

#

it's the little things and that's going to so hard for them

#

but that's accepting the premise of distillation, I disagree this is possible in AI studio, and summaries should only exist in api

keen beacon
#

yeah summarization inherently dilutes the signal

#

tbh it wont prevent sophisticated actors, openai seemingly has a lot of protections about it and even then its not flawless. google has even less

#

it makes the user experience more annoying generally

elder rapids
#

ye, there's really a lot of reasons summary is just a fleeting decision

#

rather than actually substantiated

keen beacon
#

you can still trivially leak the cot i guess, so ill be doing that if i need to understand exactly what the model is doing at times

#

its just that its potentially unneeded degradation

narrow elbow
#

This can be seen as a cheating behavior. The so-called thought summary and the answer itself may have no direct correlation (the model is not open source, and the correlation between the thought chain and the answer cannot be verified). Even if two models are used, one for thinking (and generating summaries) and one for answering, users cannot detect it. This makes the model lose reputation and trust (even if it is strong). Using security and preventing distillation as an excuse does not seem like something a super large enterprise would do. It is stingy.😏

dusky aurora
elder rapids
unborn ocean
#

Let’s be real: we will all get used to it.

#

Big companies might get access, we might not.

misty vault
small haven
#

omg its coming

elder rapids
kind cloud
#

Yes. So, strictly speaking, this isn't meant to find out 'goldmane'; rather, it's the way to exclude Flash models.

#

Maybe the fact that it tries to cite images could also be a way to tell it apart, but I'm not certain yet.

brittle hull
kind cloud
#

it's just a Chinese model

brittle hull
#

What makes u think that?

kind cloud
#

Because it revealed its developer to me

small haven
#

how do we do a poll here

keen beacon
#

plus icon -> create poll

small haven
kind cloud
patent aspen
# small haven

IMO whichever one releases second will probably be stronger lol

sonic tendon
#

redsword leaderboard release wen eta

#

might not happen, but curious

tall summit
small haven
sonic tendon
#

on that note, I've been wondering what folsom is

#

but yeah baidu and bytedance have been active lately

#

yeah

#

well, they tend to happen at the same time

late path
#

We will never see redsword again

sonic tendon
#

also, wasn't goldmane the worse one

keen beacon
#

redsword was removed

small haven
#

sure, i guess to some extent, it wont be dominating everything, but i do believe deepthink will edge in math/code a bit more, usamo at >40% is very impressive ngl

keen beacon
sonic tendon
#

"best" seems a tad subjective lol

#

they'll probably both have a case for being the best, I think

small haven
#

lmao

#

we have a strawberry in here

sonic tendon
#

I keep hearing that term

keen beacon
#

strawberry man?

#

idk

sonic tendon
#

the twitter guy

#

i assume

#

heh

small haven
#

my issue here is why would an actual google engineer be here to actually talk about insider info? what is there to be gain, other than attention and losing ur cushy job lol

keen beacon
#

nah hes just a massive google fan

sonic tendon
#

aren't we all

small haven
#

not wannabes

#

ok bud

sonic tendon
#

i doubt anyone cares that much

floral merlin
#

Hello, where is the price / score chart in the new UI?

#

This (Price Anaysis):

#

It is the most usable for me chart so far.

keen beacon
#

it might not be added yet in the new site idk

floral merlin
echo aurora
floral merlin
elder rapids
keen beacon
#

i havent used the model yet lmao i just read the chat 🤣

#

there were people saying both things i dont know

jade egret
#

hi

elder rapids
#

ion know bro, I've been accurate about basically every assessment ive projected here

#

which isn't much projections at all

#

but still

echo aurora
elder rapids
#

I KNOW models

misty vault
#

do u know that gpt-4-0314 is agi

elder rapids
#

gpt 40 is agi

golden ocean
#

gpt 4 is agi

high ginkgo
#

then we go to a more recent model gpt 4o and we're back to narrow ai ☹️

small haven
#

no comment

sour spindle
#

Opus 4 just took the top spot on SimpleBench

primal orbit
#

is opus in the lmarena direct chat thinking or not?

torn mantle
#

its good

#

although the radar chart is a bit broken visually

#

its def better than whatever they were providing previously

#

deep research feature or wtvr...

elder rapids
#

also I don't think opus 4 or sonnet 4 nonthinking are going to be much higher than 3.7 nonthinking

misty vault
#

real

small haven
#

these tweets about o3 pro is making me 🥵

#

@deep adder why can't i paste image into claude code anymore

#

it used to work

#

i guess they disabled it :/

#

or my settings is fcked

hollow ocean
#

@misty star Disappointment

small haven
#

claude code is a fcking beast

#

ya not ai related, if they can do, why can i post a xi pic

#

*cant

small haven
#

oh hell naw

#

wait i was using haiku?

#

tf

#

lmao, i was running multi agents

#

i think its caused i ran above limits

#

it defaults to haiku

#

i was wondering why i was getting shxtty results

hollow ocean
pseudo hemlock
#

Is the "prompt to best model" feature gone?

pseudo hemlock
#

I'm on there and went to "prompt-specific leaderboard", put in the prompt, and it doesn't load anything after I press send

echo aurora
pseudo hemlock
#

and not a big deal, i'm sure i'm the only person who wants it enough to try it on the legacy site lmao

elder solar
#

why i dont have this feature in my chatgpt logged account?

#

it shows off when logged out

torn escarp
drifting thorn
#

Flux Konnect is good as hell

misty vault
#

fluxsydney

echo aurora
ocean vortex
#

native thinking:

#

with a prompt, thinking 'disabled':

#

the thing with Opus is that it usually doesn't reason for long either way

#

write me a poem that doesn't rhyme. <thinking>

echo aurora
pliant cypress
#

New deepseek R1 score 10% higher on SimpleBench wow

torn mantle
#

the new deepseek is actually crazy

#

its the closest model to o3 in terms of formatting

#

they are trying to mimic that

#

pretty sure its trained on o3 outputs too

tall summit
#

i hate o3 formatting

torn mantle
#

i love the arrows explanation, straight to the point

#

i myself use that a lot

keen beacon
#

nah this new r1 is closer to gemini

#

in terms of output

torn mantle
#

gemini is like the biggest yapper

#

it doesnt write with emojis nor arrows

#

but its packed with so much knowledge

#

so we got the formatting + knowledge from both models

keen beacon
#

for creative writing probably personally

#

but eqbench's creative writing leaderboard is a useful metric

torn mantle
dusky aurora
tall summit
keen beacon
#

if u understand how its calculated

dusky aurora
tall summit
#

opus my beloved

torn mantle
#

lol

#

grok 3.5 is coming

#

just wait a year or two

wintry tinsel
#

It is lol

#

My metric is different though I jailbreak and unlock the model first, than I judge its full underlying ability

#

Opus is so far ahead its in a league of its own

#

Like a fundamentally different class of performance, I’ve struggled to step down even to sonnet after using it

#

For creative writing I mean

small haven
#

opus is built diff

ocean vortex
ocean vortex
small haven
#

omfg i have it

#

craig smart

#

7mins for news is crazy tho

#

i thought u was trolling but dayum, o1 pro in the api boys

#

*o3 pro

wintry tinsel
small haven
keen ferry
torn mantle
small haven
torn mantle
#

stop it

#

or dont

#

idkdidkidkdi

small haven
#

its free, i just had to splurge $200

#

o3 pro is super long tho, way longer than the old o1 pro

torn mantle
#

first one that will have o3 pro access in this server

#

what a privilege

#

😖

small haven
#

ok lemme put ur berberine prompt in

torn mantle
#

lol

#

is it out yet or what?

small haven
#

not officially

grim axle
#

anyone having problems?

elder rapids
#

😭

#

whoever made it has no idea how to judge + the model capabilities to judge

#

look at the bias control clarifications lmao

grim axle
small haven
elder rapids
echo aurora
#

Sorry for missing models everyone! Our team is looking into

grim axle
echo aurora
misty vault
small haven
#

ok so wen is o4

#

and ultimately o4 pro

misty vault
#

no

#

if u haven't reloded since before the issue

#

then everything still works fine

grim axle
#

press ctrl w I found a way

misty vault
small haven
# small haven
poll_question_text

which one wins

victor_answer_votes

10

total_votes

15

victor_answer_id

2

victor_answer_text

o3 pro

misty vault
small haven
#

oh my goodness, o3 pro is no joke

civic flame
#

@small haven you taking requests?

small haven
civic flame
#

o3 surprises me by how meh it is at correctly formatting realistic wikipedia articles so try this:

"Write full Wikitext for a very realistic Wikipedia article for the 2028 Republican primaries, after the primary has finished."

small haven
#

queued

split olive
#

fix lmarena

glad peak
#

Anyone else having this issue? Was working fine 5 mins ago now all my chats are gone

split olive
#

i get connection failed error

glad peak
#

Should've just scrolled up lol

echo aurora
glad peak
#

Bruh you guys give me free access to all the best models with ease. It working 3% of the time would be a blessing

small haven
#

@civic flame what did you get usually for both candidates

#

their names

tall summit
small haven
#

tim scott is that out of the scope

grim axle
echo aurora
grim axle
echo aurora
#

Lol not true!! ^

split olive
#

💀💀

grim axle
#

What model should they release next?

civic flame
#

it went with a scenario i think is not particularly likely but all the same interesting to read

misty vault
civic flame
#

unfortunately it looks like formatting is only a little better than o3's attempt

#

nonexistent template

grim axle
civic flame
#

skipped out a bunch 👎 lazy

small haven
#

damn

misty vault
#

openai not giving lmarena gpt-4-0314 access 😔

small haven
#

if 2028 is tim scott, im eating my shorts

civic flame
#

i think approximately zero of these people would decline to run lmao

#

claude 4 opus' prediction i think makes more sense, it went with vance

small haven
civic flame
#

i presume somewhere in the CoT it was like

#

"this is likely to be a very competitive primary without trump, and like in 2016 the winner tends to be hard to predict, so..."

#

which i suppose makes sense

#

shame they don't expose much of the CoT though

echo aurora
#

slight gestures towards

✅ Avoid political and religious content.

misty vault
#

I would vote for gpt-4-0314 if it were running for president ngl

small haven
grim axle
#

I need to try the new flux model please turn back on

#

also is the new model any good?

elder rapids
#

yo 2.5 pro in the app has better instruction following now

#

sum changed

keen beacon
#

gemini product or aistudio? (or both)

elder rapids
#

it's interpreting my instructions and applying them in a much better way than before

elder rapids
keen beacon
#

kk

tall summit
#

opus bad at translation compared to gemini 2.5 pro D:

elder rapids
elder rapids
#

strangely enough

#

but 2.5 pro is a god at translation

#

same with 4o

elder rapids
#

although less nuance when pushed

torn mantle
#

unfortunately i cant tell if its any different from o1 pro

elder rapids
#

ask if it's agi

torn mantle
#

leo been playing roblox all day

small haven
#

o3 pro is less verbose

elder rapids
#

o3 is much less verbose in general tbh

small haven
#

its not a bad thing tbh

#

ya next year

#

lol

#

now we wait for deepthink :/

elder rapids
#

ngl high hopes for deepthink

#

if it actually pushes 2.5 pro further

leaden palm
#

you guys know you could code/orchestrate your own deepthink right

elder rapids
#

then that's something really to hope for

small haven
elder rapids
grim axle
#

it’s pretty easy

pseudo hemlock
rigid crescent
#

sorry if this has been asked to death already but is repochat planned for the new ui?

echo aurora
rigid crescent
#

understood, thank you for clarifying 🙏

small haven
#

o3 pro uses big fat arrows 😮

atomic pagoda
#

Is anyone else getting the connection error?

meager lintel
#

New Gemini 2.5 pro checkpoint in a few days

#

Probably goldmane?

#

Where’d Tuesday come from?

#

😄

#

Well, I heard from someone else who saw it leaked from semi-public info too so makes sense

#

LMArena staff sandbagging the leaderboard update until Tuesday would be 💀

#

Just like Grok 3…

#

In this case, the other source I think is from something similar to the feature flags leak of Claude 4

torn escarp
errant thorn
#

it was working for a few secs just now but now its back down

keen beacon
#

they should call it o2 pro instead for maximum confusion

small haven
#

i literally have it 😭

#

o1 pro with search enabled kek

#

o3 pro gave me alpha omg

patent aspen
#

When is the last time someone working at OAI said they were still working on o3 pro?

#

Looks like April 16

small haven
#

yesterday

patent aspen
#

Oh where?

small haven
#

oai cpo

keen beacon
small haven
#

theres no way ppl still think o3 pro is fake 😭

#

o3 pro vs baseline

#

ok so u saying o3 pro but integrated into gpt 5 lol

#

i mean its still o3 pro

#

just a router

patent aspen
#

It sounds like o3 pro is coming out eventually

small haven
#

i mean yes if u scroll up a bit

patent aspen
#

I know. I'm just reaffirming based on the posts above

small haven
#

99.99% confidence band lol

echo aurora
#

site should be up and working again btw 👍

keen beacon
#

gpt 5 release might coincide/be somewhat correlated with gpt 4.5 being shut down on the api too maybe

#

which is in july

small haven
#

day 1 with o3 pro

#

gpt5 at the end of the day is just a router

keen beacon
small haven
misty vault
#

gpt-5-preview-0314

small haven
#

? i think its just going to explode more? bc majority is just using 4o as default

misty vault
#

true

keen beacon
#

they spent a lot of time on 4o (mid train). whilst 4.1 mini/etc are fresh. it seems 4o is gonna be used for a while

#

(they talked about this in a podcast btw about the mid-train/fresh train)

#

yeah

small haven
#

yea, no one cares about the rest lol

#

wb grok 3.5

#

even when elon ma has a black eye

#

i feel like grok 3.5 bigbrain is just going to at most match o3

#

u rlly believe that?

keen beacon
#

lmao

patent aspen
#

Is bigbrain some meme?

small haven
#

i can not vouch for this

keen beacon
#

they named it bigbrain

small haven
#

oh right xai will release a $200/mo plan 🧠

#

lol ahh

elder rapids
#

4o is spitting out images embedded into the chat

#

man where is goldmane

#

cool that's when my big ass TV arrives

#

ion believe that tbh

misty vault
#

wtf is ion

elder rapids
#

I don't

small haven
#

are u saying gemini 2.5 pro updated version is gonna match goldmane

#

noice

echo aurora
#

hey after the site was down are you now seeing your chat history (blobyes ) or are you NOT seeing your chat history (blobno )?

elder rapids
#

goldmane is an explanation god

#

it's crazy intelligent and it's subject to being influenced more now

#

less dogmatic when it comes to uncertain things at first and brute forces conclusions to be more certain

#

what it says

#

it's subject to being influenced more now

#

no I mean it's different from 0506

elder rapids
#

which in the cases It was in

#

was surprisingly appropriate

patent aspen
#

Is it less verbose than 0506? It's supposed to be

#

Or at least that was highly requested

elder rapids
patent aspen
#

I'm excited. I haven't got around to trying it yet

keen beacon
#

I hope the not thinking bug is fixed

#

It's really annoying when it does it

patent aspen
misty vault
#

I think the not thinking part is the bug like he literally just said

keen beacon
small haven
#

ok after playing with o3 pro for a few hours, its pure insanity

keen beacon
#

Long conversations

#

Since they exclude the previous thoughts iirc in prev turns, at some point the model just doesn't have the tendency to do it. Weirdly they don't prefill the thinking delimiter so it can just do that

#

Might be fixed since their logic will be different with the next update I think since they're adding the toggle

#

Yeah I don't think so

#

The 'fix' is to ask it to think it's just annoying to do so

#

Sometimes it won't work and you have to rephrase it in weird ways to get it to think etc

#

I looked at the latency for the first token so it's not visual

grim axle
#

Okay yeah the flux model sucks ass

keen beacon
#

It could be that but it mostly starts happening in long conversations, and I think the mechanism is as above. It's been a thing since flash thinking exp

#

Also it can think twice or get into thinking loops (multiple thinking blocks per reply) so I think it's a model thing

keen beacon
#

#general message here's an instance of it doing it (fyi I was wrong here about it being a special token)

meager lintel
#

whatever X-preview is sucks lol

meager lintel
#

after making 05/06 the ONLY option with no option to still use 03/25, I was half worried the thinking spam was some kind of intentional thing to pump out output tokens.....

#

yea makes sense

#

I'm hoping that whatever the "new research stuff" got into the new 2.5 flash is in goldmane too

#

new 2.5 flash is absolutely amazing

#

just not quite smart enough

#

but it hits so far above its weight it's insane

#

I saw some google researcher on twitter saying something like "a ton of new research ideas (which I can't talk about) were successful and got into 2.5 flash", so it's got my hopes up 😄

elder rapids
#

Logan confirmed it's coming in the next few weeks

keen beacon
#

He said that a few days earlier

#

They removed redsword I assume release is imminent

patent aspen
#

Weird that he said a couple weeks on May 28th

elder rapids
#

it's either you or him

#

ye

keen beacon
#

Maybe Logan is talking about an even later revision

#

But this update is substantial

#

It would be strange

elder rapids
#

I mean, in my eyes it's better than 0325 all around tbh

#

0506 still had the same capability but you had to prompt it more

#

goldmane simply just does it

patent aspen
#

tbh if Logan said a couple weeks on May 28th, I trust him more than myself

elder rapids
#

btw I believe 2.5 flash was an exception

#

they didn't actually serve it

#

as far as I know, I don't use vertex

keen beacon
elder rapids
keen beacon
#

Maybe but more time couldn't have hurt

elder rapids
#

hopefully

#

I anticipate googles releases a lot

small haven
elder rapids
leaden palm
grim axle
#

FLUX AI IS HAVING PROBLEMS

leaden palm
#

perhaps it's because it didn't force you to attach anything

leaden palm
#

well you didn't attach an image there did you

#

probably why you're getting problems

#

it works fine for me with an image

grim axle
#

I tried and it’s still not working

leaden palm
#

then that's odd

grim axle
#

is there a limit?

#

because I used the model 20 times

keen fulcrum
#

Claude is making these graphs and Cursor isn't great at displaying them 😄

elder rapids
#

lmao why does 2.5 flash thinking identify as Claude 4 sonnet, with all of the up to date Claude information on the arena

#

it says the model string too, like the regular Claude models

#

I've also been seeing different models act in a way that don't align with their personality

#

there's definitely a bug going on rn

#

where it's showing the wrong model name

#

or its routing to a different model

#

even the other models, like "Stephen" or "x-preview" are doing it

#

Claude is sometimes identifying as a Google model, too

#

yo this is DEFINITELY happening

hollow ocean
elder rapids
small haven
#

o3 pro currently has a 64k context window 😦

#

if deepthink matches o3 pro, but offers 1m context window, google wins

hollow ocean
calm sequoia
#

Someone on twitter posted this interesting table

small haven
late path
#

they should add phantom and nebula too

#

the beginning of google's legendary comeback😁

dusky aurora
vernal meadow
#

yes they both are and they are both very good

dusky aurora
#

so they say Goldmane will be relased i the coming days?

alpine coral
#

i can't rememember the last time a google (or anthropic or oai for that matter) model identified itself as a model from a different lab.. and it happened repeatedly?

primal orbit
keen beacon
alpine coral
#

that would fs be the most obvious / likely explanation

keen beacon
#

There were issues with lmarena earlier this is probably related

sacred plaza
#

anyone know what Ilya Sutskever has been up at his new startup?

patent aspen
#

Using Google TPUs for research

#

tbh I don't expect any company starting so late to become relevant, although I also didn't expect DeepSeek or xAI, so take my word with a grain of salt

misty vault
#

me when gemini 2.5 pro

#

no gemini 2.5 pro is king

#

i learned from u

#

no

#

u worshipped gemini 2.5 pro in may still 🥰

high ginkgo
#

gemini 2.5 pro is cancer

misty vault
#

it is though

quiet folio
sacred plaza
high ginkgo
#

i am going to wait for deep think

dusky aurora
#

I still see nothing better in the Direct Chat list. Even the May version of Gemini is a viceroy compredto others

#

Opus is good but so laconic

#

somehow modern LLMs tend toward giving short, token-hoarding replies

keen fulcrum
patent aspen
#

You think SSI will be relevant?

keen fulcrum
late path
late path
#

I remember Ilya saying their first product would be safe ASI.
We won't see them until ASI

patent aspen
#

Wouldn't they just get outpaced by Google if that's their approach?

#

That's why I don't think it's particularly likely that new entrants will catch up

#

I think DeepSeek will remain relevant because it's based in China, and the US may eventually ban China from using US models

#

I don't think it will surpass the top US model in capability

dusky aurora
late path
#

I'm not sure if every model Deepseek releases is only slightly worse than SOTA. Is this a coincidence, or is it because the distilled data they used fundamentally limits their upper bound?

tall summit
#

and it is wordy when you ask

dusky aurora
#

I prefer long replies. Curent Gemini is all about bullet points,which isreadable but too abrupt

sacred plaza
#

Ssi isn't trying to make money or sell stuff though. Why are you comparing these AI labs with SSI? It seems more like a research facility

sacred plaza
#

This seems probably false. Agree that they are competing on the same pool of resources when it comes to gpus though but the goal for SSI does not seem to be AGI

#

Very cool. Have not heard of blue sky research but would not be surprised given Google deepmind as institutional.

Given how hard it was to get even 20% of the compute for open AI for safety safety specific testing that led to the creation of anthropic, I don't see the market incentives promoting safety work for its own sake, like SSI.

#

Would be glad to be proved wrong tho!

#

Wish I knew that deep seek release before I brought the ultra plan, lol 😭. Agree that AI labs are doing safety work, which is definitely promising!

#

I agree with this take. It just seems like Google is adding graph of thoughts promoting technique into the internal model to create deepthink. This kind of seems similar to how AI Labs put in chain of thought prompting into their models to develop their reasoning models.

feral lichen
#

best ai for roblox studio?

late path
#

it would be a remarkable achievement if they could replicate the huge elo improvement that Alphago achieved by using mcts over raw DNN in LLM

narrow elbow
#

pursuit of technological dominance, tech giants and capitalists always prioritize capability over safety.this race mentality remains unchanged. just like cold war nukes.

patent aspen
narrow elbow
#

yea

elder rapids
#

but this is crazy tbh

cedar tide
#

when the next leaderboard update ? there are 6 models in the arena not yet in the leaderboard 🥴
(Two Claude 4, new R1, grok 3 mini,
qwen 3 no think, glm 4 air)

#

They added it to the battle arena recently

#

@deep adder
Now in battle arena

leaden palm
#

depends on how long a while is

elder rapids
tulip meadow
#

Hello can someone help me?

#

In Which section, Can I use as ai?

cedar tide
#

Nope

torn mantle
#

i thought the next grok model on lmarena will be the 3.5 ver

#

ig we just have to wait a little longer

echo aurora
tulip meadow
#

Anyone have torrentleech site access? I need invitation

verbal nimbus
#

Would be cool if the Web Arena had a Svelte mode. Curious to see if the rankings would stay the same.

keen fulcrum
#

So no r2 for the foreseeable future=
How long will we be stuck on R1?

small haven
#

o3 pro before grok 3.5 is crazy

#

it happened

#

troll?

elder rapids
#

I'm not getting goldmane NEARLY as much as in the legacy website

#

which is super strange tbh

small haven
#

ok ya officially, who tf cares

#

its already here

#

yes

#

o3 pro + claude code is the meta

#

yo but imagine deepthink matches o3 pro, right.. but with 1m context window, that would go insane

small haven
#

no

elder rapids
#

but hold on I want to know

#

deepthink isn't going to be 2.5 pro 0506

#

with more thinking

#

it's going to be goldmane lvl probably

#

with parallel

small haven
elder rapids
#

which could be crazier

small haven
#

exactly lmao

#

u can retire at any age

small haven
#

thats why i keep myself busy

surreal warren
#

Any idea What tools to try for Deep research sites or scrapping
To Find me items matching specs

Chatgpt is fabricating

small haven
#

u can batch tasks in parallel in claude code, amazing

#

yup

#

u can run as many as u want, but for my case, 3 was enough

#

add coffee

candid harbor
#

try quadruple espressos

feral lichen
#

How can I continue a conversation if you keep standing like that?

hollow ocean
#

GPT-5 July confirmed ✅

elder rapids
#

but ion think it's absolutely confirmed

surreal creek
#

is Stephen a different version of R1? it keeps answering in Chinese which is a pretty obvious giveaway, but there’s a May version of R1 not codenamed in the arena

#

unless it’s an undisclosed version of Qwen

elder rapids
#

they're just random small Chinese models

#

same as X preview

#

Gemini 2.7 soon

#

get it right

patent aspen
elder rapids
#

ion think Google would do that tbh

patent aspen
#

I don't either

elder rapids
#

they have some sort of philosophy of design

drifting thorn
leaden palm
# drifting thorn How?

openai style:

const solve = async (prompt) => {
const results = await Promise.all(Array.from({length: 9}, () => generate(prompt)));
const index = parseInt(await generate(
  `We tried to figure out the answer to the prompt <prompt>${prompt}</prompt> 9 times. Write a final answer incorporating the best aspects from all of these: <answers>${results.join("\n\n---\n\n")}</answers>`
));
return results[index];
}

open source style:

const solve = async (prompt) => {
return generate(`${prompt}

Note: whenever you are about to end thinking, don't. Instead, first write out what you were about to respond with, then critique it in depth, then keep thinking. You are only allowed to end thinking once this has happened 5 times.`);
}

tree of thought style:

const solve = async (prompt, decisions) => {
const trials = (await generate(`You are in the process of solving the prompt <prompt>${prompt}</prompt>. You've made these decisions so far: <decisions>${decisions.join("\n\n---\n\n")}</decisions> You now need to either list some possible paths you can take (separated with the separator ---) or only list the final answer.`)).split("---").map(x => x.trim());
if (trials.length > 1) {
const results = await Promise.all(trials.map(d => solve(prompt, [...decisions, d])));
const best = await generate(`You are in the process of solving the prompt <prompt>${prompt}</prompt>. You've made the decisions <decisions>${decisions.join("\n\n---\n\n")}</decisions> so far. Now, you made some more decisions, resulting in these results: <results>${results.join("\n\n---\n\n")}</results> Write your final answer to this prompt, combining the best aspects of each.`);
return best;
}
return trials[0];
}
wintry tinsel
#

After opus my reaction to AI news has been

#

🫤 😕 😑 😐

#

Like they is nothing interesting happening

leaden palm
hardy pecan
#

we are in a golden age

#

but have trouble perceiving it

leaden palm
#

idk its said that logan is responsible for accelerating gemini's availability and ai studio dev

surreal creek
hollow ocean
#

Opus sota

small haven
#

they alrdy nerfed o3 pro great

elder rapids
#

like, could you imagine being able to speak to something that isn't human but can actually, coherently, and with extraordinary articulation

#

talk to you about things that are extremely implicit, beyond the syntax it's built off of

#

and so reminiscent of human thoughts

#

and then now it can access and see your screen, and think about what it's looking at with the necessary context, to actual figure it out

#

rather than a narrow program that key logs then executes that in repetition

#

all in the span of a year

keen fulcrum
#

LLMs & Models:

DeepSeek: R1-0528 released: 64K context, efficient quantization.
OpenAI: Deprecating GPT-4 32k for GPT-4o, chat log & censorship concerns.
Anthropic: Claude Opus safety report, mechanistic interpretability tools.
Google: Veo 3 video model, SignGemma for sign language. Gemini 2.5 Pro: large context, UI/creative limits.
Mistral: Agents API for orchestration.
AI21: Jamba model reception good, details limited.
XAI: $300M for Grok on Telegram, skepticism remains.

Agents & Tools:

Perplexity: Labs for multi-tool workflows, new features.
LlamaIndex: Agents in Finance workshop, advanced RAG.
VerbalCodeAI: AI terminal tool for code analysis.
Latent Space: Collab on autonomous engineers.

Infra & Hardware:

Unsloth: Optimized DeepSeek models for limited hardware.
AMD: Max+ 365 GPU (128GB VRAM).
NVIDIA: Blackwell optimizations for DeepSeek R1.

Open Source:

Ollama: Naming issues, SDK instabilities.
Hugging Face: Diffusers enhanced, LightEval v0.10.

Challenges:

Cursor: Backlash over slow pool removal.
Manus: Instability, network issues.
Nomic.ai: Cloud security concerns.

New:

Black Forest Labs: New AI lab, Flux-1-Kontext image model.
Factory AI: Autonomous software engineers.

Insights:

Mary Meeker: AI industry report: accelerating adoption.
Microsoft: Early Sora API access.
Cohere: AI automation gains.
Gradio: MCP hackathon.

misty vault
#

gpt

keen fulcrum
small haven
#

I think they tried o4 pro

willow grail
#

ayo!!!!
TO ALL OUTSIDE VIBE CODERS!!!!

u know which laptops are best for vibe coding with our lovely cc/cm in parks?

calm sequoia
#

I think "vibe-coding" does not impact your laptop choice anyhow.

#

It all depends on what you code. Web - buy something nice to hold and use. Heavy workload (apps, data processing) - buy something powerful, even if ugly.

willow grail
#

even amazon?

calm sequoia
#

Idk I havent seen website that requries a lot of cores or GPU

#

And amazon website is 💩 wdym?

willow grail
# calm sequoia And amazon website is 💩 wdym?

hmmm okey... but like lets go one step earlier..

how to vibe code with claude code on ANDROID?
do i even need a laptop to use claude code on ubuntu? or is screensharing from pc to outside device enough?

#

WE DONT KNOW

#

stuff like laptop and tablet is an issue cause it probably will be impossible to use without a table high enough

#

i am feeding crows at graveyard. this takes time. like 2 hours daily.

so i wanna code while at it

#

i could bring a mobile table with me if laptop/tablet

#

slept 8 hours R.

#

claude code is the best agentic vibe coder ou tthere

#

ok then why did my lower arms hurt a lot 10 years ago after 5h usage of macbook??

#

weighted 1.6kg

#

i can watch websites on android yes.

#

just websites and perhaps games up to phaser.js level

misty vault
#

no way

willow grail
#

thats unnatural and thus will lead to pain

patent aspen
#

If I didn't have a full time job and had the energy and will to code in my free time, I'd probably switch from a Macbook Pro to a Framework running Arch Linux

queen siren
#

anyone recommend good libraries for running multiagent llm systems?

patent aspen
#

The Linux terminal feels like putting my hand on the third rail of the universe, and no amount of money, build quality, or brand value can give me that high

glad jackal
#

Guys currently which llm has the least hallucinations is it Gemini 2.5?

#

Lmarena?

#

What abt Gemini 2.5

#

It has Been 1 in leader board

grim axle
#

can’t wait until Claude opus is open source

elder rapids
elder rapids
#

yes bro opus never hallucinates

#

Claude 4 opus is a hallucination goblin btw

acoustic cliff
echo aurora
feral lichen
#

can anyone tell me best ai for lua coding?

balmy mist
#

anything new?

sour spindle
#

Dumb question that’s probably been answered is the claude 4 opus on leaderboard thinking or base

ocean vortex
sour spindle
#

Pretty good score for the base version

ocean vortex
#

there's no such thing as claude base though, both are the same exact model

#

chat

#

just different prompt template 😉

#

And if you add <thinking> at the end of your prompt, from my experience this is gonna be largely the same as thinking natively enabled

small haven
#

same day as o3 pro official release 🥳

ocean vortex
#

Straight forward parallel compute, you could implement that in a short evening lol

keen beacon
#

apparently o3 pro is already rolling out

ocean vortex
#

and it doesn't need safety testing additionally etc

keen beacon
#

just recently

small haven
small haven
#

me

#

bro got amnesia

ocean vortex
#

seems on pair with 4.0 dork

small haven
keen beacon
#

hes not trolling

ocean vortex
#

are you not? Is it good then?

small haven
#

ok let me get some proof, sigh

#

gimme a prompt @ocean vortex

#

that involves the internet lol

ocean vortex
#

is it with tools or no tools?

small haven
#

no clue, it doesn't show explicitly in the cot

#

it does use web tho

ocean vortex
# small haven no clue, it doesn't show explicitly in the cot

we need smth it can't cheat with tools on then... try this:

approximately:
A. 4km eastward
B. 30 km northward
C. >30km away north-westerly
D. <1 km northward
E. >30 km away north-easterly.
F. 5 km+ eastward
G. The glove is exactly where the car was at the time it slipped out.
H. Neither option is correct.```
small haven
#

queued

marsh stratus
#

is Claude 4 Opus on the leaderboard the thinking Claude or nonthinking?

ocean vortex
#

seriously though read several messages above lol

small haven
ocean vortex
patent aspen
#

What is the source for o3 pro rolling out on Thursday?

humble heath
#

will o3 pro be added to the arena?? i assume not bc the model is so expensive

small haven
#

and thursday is where big releases happen

small haven
ocean vortex
#

yeah ik, but I don't have 3.5 obviously, it's still hilarious why it responded this way. It's just showing "chatgpt" for this chat no model name

small haven
#

let's be honest tho, do u really think 3.5 could have done that

small haven
ocean vortex
small haven
#

no i meant the answer

#

is it A

ocean vortex
small haven
#

damnn

ocean vortex
#

I haven't seen any model get this right yet though. So it doesn't mean it's sht. Just that maybe it is not significantly better than normal o3

small haven
#

deepthink next

small haven
#

wen deepthink sir

#

yes

#

im actually curious

#

yes i believe that

#

its brian

patent aspen
#

It's a new person

small haven
#

nah but when is deepthink

#

gimme the exact date

patent aspen
#

Anat

#

I remember Ruth because she made our perks suck. I don't think about Anat much

#

She's kind of just there

small haven
#

brian gimme deepthink exact date release

keen beacon
#

he probably doesnt know xd

small haven
#

go into that deepmind office and asked everybody for the date

keen beacon
#

yeah but idk what you expect from him if u keep asking about it

small haven
#

hes talking quiet guys

#

so basically end of june

#

ur not in that slack group? come on bro

#

i have a genuine question, will deepthink have a 1m context window like the rest of the gemini models ?

keen beacon
#

theyre releasing two revs in a month?

#

or naming it ga later

#

i guess its the latter

small haven
#

wait they can actually afford deepthink on 2m context window? i believe it for regular gemini models

keen beacon
#

load i guess

small haven
#

im guessing keeping compute for research/training

#

we need an oai insider in here now

#

yoooo wen is o4 pro

#

ok bud

#

2m context window on deepthink is absolutely going to crush o3 pro, sorry sam

keen beacon
#

can deepthink use tools?

#

if not there will still be a use for o3 pro

small haven
#

maybe not, but 2m context window is a huge deal

#

o3 pro currently has 64k

#

i maxxed it out

#

128k got timed out

#

80k~ timed out

#

58k ish passed

#

original o1 pro could fit 128k

#

o1 as well

#

o3-mini-high too, but now it's limited to 64k

#

yes...

#

cuz of the huge spike in new members

#

google has no users, so 2m context window it is

#

no "loyal" users

#

😭

patent aspen
#

I mean that's their first mover advantage

#

It is

small haven
#

"forced"

#

nk style

patent aspen
#

You are correct Gemini will not be ChatGPT

keen beacon
#

4o image gen was probably bigger than 2.5 pro

small haven
#

400m in ai studio? or 400m android users who have no choice to pass by a gemini feature lol

#

geographic 100% india?

#

and other third world countries

#

i believe tides will turn when deepthink release all jokes aside

patent aspen
#

And you are a shining example @deep adder

small haven
#

craig singh

#

@patent aspen also when is jules actually going to be GA? would gladly pay for unlimited like codex

patent aspen
#

Your politeness is incomparable

small haven
#

its a fact, not conspiracy

#

western world is iphone dominated

green oak
#

I would like to but the buisness

#

I have 3 dollars

small haven
#

wtf

patent aspen
#

"third world country type" comes off a bit differently haha

#

Kind of like poors or plebs

#

It does. I'm just remarking on your word choice

#

Oh absolutely

#

More so in the United States among gen Z but yes

#

Sure

small haven
#

i think thats a you thing 😂

ocean vortex
#

but the still thought $250 sub was great idea

sour spindle
#

Gemini app is atrocious

#

Normies love ChatGPT vast majority don’t even know there are other models

patent aspen
#

People (especially rich people) have tended to value materialistic, branded status symbols less over time

#

Why do you think that?

#

Interesting. It does seem there was a counter-trend among millenials toward experiences, health, and well-being, although the current trend among younger people does seem to be towards materialism

elder rapids
#

this IS Gemini usage...

red sluice
#

Gemini pro is winning on most benchmarks

hidden quartz
#

Hey guys wanted to know if images generated from lmarena can be used commercially? And if not would editing it help??

red sluice
hidden quartz
#

Umm currently using flux to generate product photos and then changing the subject with my own and keeping the background. Should I be worried?

zinc ore
#

AI Overviews is 1.5b if they were counting that. Per IO

#

I think Meta claims their AI usage is 1b (or close to that figure)