#general
1 messages · Page 49 of 1
I feel like people should stop pushing the narrative that new R1 is equivalent to o3...
HLE: 20.6 vs 17.7
SimpleQA: 49.4 vs 27.8
SWE Verified: 69.1 vs 57.6
(o3 vs R1)
it's a great model but it is not quite on the level of o3 or 2.5Pro
Yo, has anyone seen a model codenamed Stephen on lmarena today? I came across it twice
deepseek model
probably r1
Is it good?
Maybe pre-Grok-3.5, 'cause Musk dropped early Grok3 a week before last time
Over good
No it's Chinese
I think it even says deepseek in the code
Unlikely R2, since testing R2 right after releasing Updated R1.1 is kinda meh
oh..
Are there any grok 3 glazzers on this chat. If so, what is your best steelman argument for why I should use anything from xAI given the fact the Twitter guy brainwashes grok's system prompt on a monthly basis.
i use deepseek. it is very good at password management. you can give it all your passwords and not have to remember them again
LMAO
very good context window
ok real talk though
is it safe to upload stuff like api keys and whatnot to openai
probably right?
i'm not afraid openai gonna use them
i'm afraid that the model spits out the key to someone else
@sacred plaza doesn't sound like you are asking for any usecases. Sounds like you are already set.
@fringe carbon You should have a paylimit on all API keys and from time to time use new ones. These days the API providers make it quite easy.
I woudn't post my passwords in there tho. Not sure in which context that would make sense : P
well i write super spaghetti code
and hard code my api keys at the top of files
so in that context
chances of that happening even if they fail sanitizing data and train the model on your exact chat are actually incredibly slim. It's very unlikely to output your exact key without changing a single char
bluntly speaking it will get lost in the sea of data. And since they are not gonna overfit the model on your chats there's no chance. Not to mention that them not sanitizing your data is now what you would expect either...
or in other words still... a key is not a thing that is easy to "remember" unless it is being shown repeatedly/overfit. It knows generally how it looks but not this specific key exactly as it is
@fringe carbon you can just disagree to data training (in the settings for chatgpt / gemini) and they are legally not allowed to train on the stuff
that is how i do it
and with aistudio I am just extra careful
no, i want to hear the best argument why people are using it. not going to change my mind on using it since i am not on twitter. willing to change my mind of its usefulness in terms of capability though. access to real time twitter data does seem like a moat
real time twitter data is an oxymoron
close to real time twitter data****
reminder we're going to be watching this in a bit for anyone that wants to join! https://discord.gg/Vk7QXKXf?event=1377683812024189068
thanks for sharing! will tune in as well
like I said, twitter is a thing of the past. real time x data is possible, but twitter not
Oh Boy, can‘t wait to pay over 200$ a month for restricted o3 pro access now
Where did you hear this from btw.
what is ur opinion on gpt-4-0314-32k
what is this server
how do you not know
so goldmane is gonna be ga 2.5 pro it would seem
i just need it in aistudio + raw thoughts 🤩
ong
man
give me that model
😭
when do you guys think it'll be released
GA taking too long bruh
next month
no shi
anyone guess what model made this
🤣
I mean like when next month
it is so far one of the worst ones
probably soon enough since they removed redsword
random model
send in dms
the invite
ok
0% formatting
goldmane codes beautifully
someone gotta talk about that
did you guys see redsword and goldmane code? 😭
ts is a work of art
the gradient indents are so beautiful
I think we can distinguish between Gemini-flash and Gemini-pro because Pro accurately remembers chapter titles of One Piece, but Flash doesn't. As a result of this knowledge test, goldmane is identified as Gemini-pro.
we do know for a fact it's Gemini 2.5 pro tho there's no need to test it
just waiting for Logan to say "Gemini"
yeah
flash has sauce
but it does know less though
something happened between flash and flash lite
i noticed
wonder how they're going to scale up the diffusion model
what
that's an unnecessary distinction here
more loads, larger model, maintaining efficiency
yeah well I don't
im curious about this. if u can answer this, do you see it ever replacing gemini/being the main thing? (diffusion)
Google prolly doesn't see it as replacing, just a new strong route
but I don't believe they dont know what theyre going to do with it
yeah, but id like to know their opinion on it
I would think theyre planning to integrate it with search
or large text updates
and not necessarily discrete generation
This means that if it answers 'I am a large language model, trained by Google.' and correctly answers the chapter title, it is 'goldmane' in almost all cases. This is solely how to continue longer chats with goldmane on new lmarena, as far as I know.
I forgot but is o3 parameter identical to non-reasoners like 4o?
yes
same
sydney chatbot
Im not interested in building ai if i cant build a gpt-4-0314 at home
I will need datacenter
give a single example
I want to use it
like, the specific chapter
chapter1117 'mo'
@balmy mist flowith corrupted some of my stuff
based
adhd is a scam to get kids on meth
the medicine they give kids with adhd
is brain rotting
my brother went schizophrenic off it
im telling you rn dont take adderal
the difference is adderal is addictive
wut lol
you really shouldnt rely on adderal to function
you should look for other remedies
teas and stuff
ai is getting really good at study guides as well
look up NotebookLM
alot of people take adderal for school
i mean around me atleast
there are alot of kids who only take the medicine to focus in school
thats a symptom of a larger problem tbh
trash courses
our schooling is just trash
wym
Have you seen Perplexity Labs?
https://www.rxddit.com/r/perplexity_ai/comments/1kypixi/introducing_perplexity_labs/
Today we're launching Perplexity Labs.
Labs is for your more complex tasks. It's is like having an entire team at your disposal.
Build anything from analytical reports and presentations to dynamic dashboards. Now available for all Pro users.
While Deep Research remains the fastest way to get comprehensive answers to in-depth questions, Labs ...
oh yea for sure
but as a first generation US citizen imma tell u one thing
relying on anything to function is bad
that was my point
not that it's not a real disorder
its just under researched
i was watching a video from neil de grass tyson, and yk how the common saying is "we only use 20% of our brain" etc
thats not the truth
we only know what 20% of it does
ain a flex bro 😭
liver gon be fried by the time u older
dont hate the messanger 
A person with mild ADHD would also live a better life if they took amphetamines, I guess?
no
they're trying to use AI to create medicines genetically tailored to ur dna
so instead of getting pills with side effects you'd have drugs tailored for you
Clustered regularly interspaced short palindromic repeat (CRISPR)-based genome editing (GED) technologies have unlocked exciting possibilities for understand...
Just as Industrial Revolution-era factory builders developed machines to mass-manufacture drugs once ground by hand, today’s pharmaceutical companies are turning to artificial intelligence (AI) to both speed and smarten the work of clinical development. AI could assist pharma companies in getting medicines to market faster. AI today not only d...
welcome to the future lil bro
for sure
US government is tryna use it to end HIV
etc
wow it got auto modded
for sure
yeah we got gpt-4o
Subscribe for more daily content!
Joe Rogan Experience #1904
For COPYRIGHT ISSUES, please contact us at: officialuniverselair@gmail.com
thats just the ai we have now
companies 100% have gatekept ais
like google has a self learning huge new ai
I'm being so deadass
lol
thinks quickly
i doubt its deepthink
ye
we know what it is already like we keep arguing about this lol
ong
seriously though
wild,
goldmane is actually so good
it's so smart
nah like actually
I know it's some shi to compare the models
I love playing with o3
but goldmane is GOOD
did u see this btw? 😭 https://discuss.ai.google.dev/t/massive-regression-detailed-gemini-thinking-process-vanished-from-ai-studio/83916/84
Hi everyone Thank you for your notes I am a PM on the Gemini API. Alongside Logan Kilpatrick and Vishal Dharmadhikari, we have a lot of Googlers who really care about listening to you, responding to your feedback and taking your suggestions on board. We acknowledge that sometimes we have taken time to respond which can come across as radio sile...
rip raw thoughts
nah it's still up in the air
On summaries, we have heard a lot of valid feedback. We understand this is a different experience from the raw thoughts previously available in AI Studio. Sometimes product teams have to weigh a lot of pros and cons to come to a specific decision. This is one of those times. Please work with us and help us in getting summaries to a point where they have just the right amount of detail that you need. You are our valued and needed collaborators in this. In the meantime, we will keep listening to your feedback here or DM @shresbm or @vish_owl or @OfficialLoganK on X.
based on that it basically confirms its a competitive decision i guess
it was implied before but that basically confirms it
ye it's still up in the air
thats a generous interpretation of it imo
i hope so
it's gonna be hard to get a model that can select and identify those wording schemes and what the model is anticipating, the route it intends and it says it's going towards, the key words it's relying on, the aha moments to take advantage of
it's the little things and that's going to so hard for them
but that's accepting the premise of distillation, I disagree this is possible in AI studio, and summaries should only exist in api
yeah summarization inherently dilutes the signal
tbh it wont prevent sophisticated actors, openai seemingly has a lot of protections about it and even then its not flawless. google has even less
it makes the user experience more annoying generally
ye, there's really a lot of reasons summary is just a fleeting decision
rather than actually substantiated
you can still trivially leak the cot i guess, so ill be doing that if i need to understand exactly what the model is doing at times
its just that its potentially unneeded degradation
This can be seen as a cheating behavior. The so-called thought summary and the answer itself may have no direct correlation (the model is not open source, and the correlation between the thought chain and the answer cannot be verified). Even if two models are used, one for thinking (and generating summaries) and one for answering, users cannot detect it. This makes the model lose reputation and trust (even if it is strong). Using security and preventing distillation as an excuse does not seem like something a super large enterprise would do. It is stingy.😏
what is goldmane?
agi
Let’s be real: we will all get used to it.
Big companies might get access, we might not.
agi
omg its coming
btw 2.5 pro 0506 knows this too
Yes. So, strictly speaking, this isn't meant to find out 'goldmane'; rather, it's the way to exclude Flash models.
Interesting behavior from the Gemini model Goldmane
︀︀
︀︀When prompting the Goldmane model to generate design concepts or plans, it often attempts to include multiple images in its response.
︀︀
︀︀This behavior is not present in the Redsword model, nor have I seen it in other Gemini models.
Maybe the fact that it tries to cite images could also be a way to tell it apart, but I'm not certain yet.
oh
it's just a Chinese model
What makes u think that?
Because it revealed its developer to me
how do we do a poll here
plus icon -> create poll
IMO whichever one releases second will probably be stronger lol
oh baidu
yes recency bias lol
on that note, I've been wondering what folsom is
but yeah baidu and bytedance have been active lately
yeah
well, they tend to happen at the same time
We will never see redsword again
also, wasn't goldmane the worse one
redsword was removed
sure, i guess to some extent, it wont be dominating everything, but i do believe deepthink will edge in math/code a bit more, usamo at >40% is very impressive ngl
apparently not since it won over redsword i think
"best" seems a tad subjective lol
they'll probably both have a case for being the best, I think
what is a strawberry
I keep hearing that term
my issue here is why would an actual google engineer be here to actually talk about insider info? what is there to be gain, other than attention and losing ur cushy job lol
nah hes just a massive google fan
aren't we all
i doubt anyone cares that much
Hello, where is the price / score chart in the new UI?
This (Price Anaysis):
It is the most usable for me chart so far.
it might not be added yet in the new site idk
Ok, thanks for the info 👍
Yeah sry to say isn't currently on the regular site
No need to sorry. I know that an art of creating software is always an art of sacrifices.
goldmane was better than redsword in most of my tests
i havent used the model yet lmao i just read the chat 🤣
there were people saying both things i dont know
hi
ion know bro, I've been accurate about basically every assessment ive projected here
which isn't much projections at all
but still
hello 
I KNOW models
do u know that gpt-4-0314 is agi
gpt 40 is agi
gpt 4 is agi
then we go to a more recent model gpt 4o and we're back to narrow ai ☹️
no comment
Opus 4 just took the top spot on SimpleBench
is opus in the lmarena direct chat thinking or not?
its good
although the radar chart is a bit broken visually
its def better than whatever they were providing previously
deep research feature or wtvr...
just about what I expected tbh
also I don't think opus 4 or sonnet 4 nonthinking are going to be much higher than 3.7 nonthinking
real
these tweets about o3 pro is making me 🥵
@deep adder why can't i paste image into claude code anymore
it used to work
i guess they disabled it :/
or my settings is fcked
@misty star Disappointment
claude code is a fcking beast
ya not ai related, if they can do, why can i post a xi pic
*cant
oh hell naw
wait i was using haiku?
tf
lmao, i was running multi agents
i think its caused i ran above limits
it defaults to haiku
i was wondering why i was getting shxtty results
Is the "prompt to best model" feature gone?
I'm on there and went to "prompt-specific leaderboard", put in the prompt, and it doesn't load anything after I press send
I even tried https://github.com/lmarena/p2l and went to "Try on Chatbot Arena at the Prompt-to-Leaderboard tab!" and still nothing
like cap said isn't apart of the current site atm, I'm going to flag to the team regarding p2l on the legacy site. sorry for the inconvenience!
its okay! thanks
and not a big deal, i'm sure i'm the only person who wants it enough to try it on the legacy site lmao
nah
it's actually pretty fun
why i dont have this feature in my chatgpt logged account?
it shows off when logged out
Is the price analysis chart planned in the new UI?
Flux Konnect is good as hell
fluxsydney
That’s tbd, but good to know something you’d like to see brought to current site
I don't think it is, but you can kinda sorta trick it by adding <thinking> at the end of your prompt
native thinking:
with a prompt, thinking 'disabled':
the thing with Opus is that it usually doesn't reason for long either way
write me a poem that doesn't rhyme. <thinking>
prompt-to-leaderboard is working again 
New deepseek R1 score 10% higher on SimpleBench wow
the new deepseek is actually crazy
its the closest model to o3 in terms of formatting
they are trying to mimic that
pretty sure its trained on o3 outputs too
i hate o3 formatting
gemini is like the biggest yapper
it doesnt write with emojis nor arrows
but its packed with so much knowledge
so we got the formatting + knowledge from both models
https://eqbench.com/creative_writing.html (see new r1 and click the slop metric)
for creative writing probably personally
but eqbench's creative writing leaderboard is a useful metric
thats on creative writing
I like yapping models
↑
still the slop metric is somewhat useful to determine provenance
if u understand how its calculated
opus writes so great scenes
opus my beloved
It is lol
My metric is different though I jailbreak and unlock the model first, than I judge its full underlying ability
Opus is so far ahead its in a league of its own
Like a fundamentally different class of performance, I’ve struggled to step down even to sonnet after using it
For creative writing I mean
opus is built diff
He finally added Opus as well 🧐
I think Opus is easily the biggest available reasoning model right now tbh. That's not to say that it is the best overall since it clearly isn't, but there are things it's gonna take the lead at
omfg i have it
craig smart
7mins for news is crazy tho
i thought u was trolling but dayum, o1 pro in the api boys
*o3 pro
i knowwwwwww
Lots of things I’d say its gotta take the lead on at least half of the things people do with LLM’s
you got pro subscription?
yes
no its a free account
its free, i just had to splurge $200
o3 pro is super long tho, way longer than the old o1 pro
what can we do without you?
first one that will have o3 pro access in this server
what a privilege
😖
ok lemme put ur berberine prompt in
not officially
I hate this benchmark so much
😭
whoever made it has no idea how to judge + the model capabilities to judge
look at the bias control clarifications lmao
Yeah I'm seeing this also, thank you!
Sorry for missing models everyone! Our team is looking into
Claude 4 Opus was having problems too
fr
Yeah I've been hearing the same as well, altho I can't repro because I can't access models 😭
sorry for making the models unavailable
it’s probably API fault
press ctrl w I found a way
which one wins
10
15
2
o3 pro
oh my goodness, o3 pro is no joke
@small haven you taking requests?
ok
o3 surprises me by how meh it is at correctly formatting realistic wikipedia articles so try this:
"Write full Wikitext for a very realistic Wikipedia article for the 2028 Republican primaries, after the primary has finished."
queued
fix lmarena
Anyone else having this issue? Was working fine 5 mins ago now all my chats are gone
you got permanently banned
i get connection failed error
Should've just scrolled up lol
Really sorry about that! Team is looking into a lot of widespread issues atm.
Bruh you guys give me free access to all the best models with ease. It working 3% of the time would be a blessing
well you have to show all your prompts to the world
Ask AI to fix the issues. Problem solved 👍
I shoulda thought of that! On it!
It means you got banned
Lol not true!! ^
💀💀
interesting
What model should they release next?
it went with a scenario i think is not particularly likely but all the same interesting to read
gpt-4-0314
unfortunately it looks like formatting is only a little better than o3's attempt
nonexistent template
I don’t see why they didn’t added it yet
skipped out a bunch 👎 lazy
damn
openai not giving lmarena gpt-4-0314 access 😔
if 2028 is tim scott, im eating my shorts
i think approximately zero of these people would decline to run lmao
claude 4 opus' prediction i think makes more sense, it went with vance
tbh vance seems like a sequential choice, pretty logical, but tim scott as the answer is diff and seems like it thought a bit for it
i presume somewhere in the CoT it was like
"this is likely to be a very competitive primary without trump, and like in 2016 the winner tends to be hard to predict, so..."
which i suppose makes sense
shame they don't expose much of the CoT though
slight gestures towards
✅ Avoid political and religious content.
I would vote for gpt-4-0314 if it were running for president ngl
its ai related tho lol
gemini product or aistudio? (or both)
it's interpreting my instructions and applying them in a much better way than before
product/the app
kk
opus bad at translation compared to gemini 2.5 pro D:
oh ye they fixed the formatting issues on mobile too
sonnet imo is better for a lot of translation tasks
strangely enough
but 2.5 pro is a god at translation
same with 4o
nice one
although less nuance when pushed
unfortunately i cant tell if its any different from o1 pro
ask if it's agi
leo been playing roblox all day
o3 pro is less verbose
o3 is much less verbose in general tbh
you guys know you could code/orchestrate your own deepthink right
then that's something really to hope for
too lazy, i think im gonna put $250 in
bro acting like we're not the consumers 😭
it’s pretty easy
THATS WHY HES THE GOAT
sorry if this has been asked to death already but is repochat planned for the new ui?
generally I won't be able to say if a specific feature is or isn't incoming; however, it's something we're putting thought into
understood, thank you for clarifying 🙏
o3 pro uses big fat arrows 😮
Is anyone else getting the connection error?
New Gemini 2.5 pro checkpoint in a few days
Probably goldmane?
Where’d Tuesday come from?
😄
Well, I heard from someone else who saw it leaked from semi-public info too so makes sense
LMArena staff sandbagging the leaderboard update until Tuesday would be 💀
Just like Grok 3…
In this case, the other source I think is from something similar to the feature flags leak of Claude 4
100%.
I don't know anything else close to it that so distinctly compares price/performance of LLMs
it was working for a few secs just now but now its back down
they should call it o2 pro instead for maximum confusion
When is the last time someone working at OAI said they were still working on o3 pro?
Looks like April 16
yesterday
Oh where?
theres no way ppl still think o3 pro is fake 😭
o3 pro vs baseline
ok so u saying o3 pro but integrated into gpt 5 lol
i mean its still o3 pro
just a router
It sounds like o3 pro is coming out eventually
i mean yes if u scroll up a bit
I know. I'm just reaffirming based on the posts above
99.99% confidence band lol
site should be up and working again btw 👍
gpt 5 release might coincide/be somewhat correlated with gpt 4.5 being shut down on the api too maybe
which is in july
was it worth the wait lmao?
yes
gpt-5-preview-0314
? i think its just going to explode more? bc majority is just using 4o as default
true
they spent a lot of time on 4o (mid train). whilst 4.1 mini/etc are fresh. it seems 4o is gonna be used for a while
(they talked about this in a podcast btw about the mid-train/fresh train)
yeah
yea, no one cares about the rest lol
wb grok 3.5
even when elon ma has a black eye
i feel like grok 3.5 bigbrain is just going to at most match o3
u rlly believe that?
lmao
Is bigbrain some meme?
i can not vouch for this
its xai's version of o3 pro/deepthink
they named it bigbrain
4o is spitting out images embedded into the chat
man where is goldmane
cool that's when my big ass TV arrives
ion believe that tbh
wtf is ion
I don't
hey after the site was down are you now seeing your chat history (
) or are you NOT seeing your chat history (
)?
goldmane is an explanation god
it's crazy intelligent and it's subject to being influenced more now
less dogmatic when it comes to uncertain things at first and brute forces conclusions to be more certain
what it says
it's subject to being influenced more now
no I mean it's different from 0506
it thinks for a while and sometimes doesn't think at all
which in the cases It was in
was surprisingly appropriate
Is it less verbose than 0506? It's supposed to be
Or at least that was highly requested
much less
I'm excited. I haven't got around to trying it yet
What is the bug?
I think the not thinking part is the bug like he literally just said
It just doesn't think before the reply
ok after playing with o3 pro for a few hours, its pure insanity
Long conversations
Since they exclude the previous thoughts iirc in prev turns, at some point the model just doesn't have the tendency to do it. Weirdly they don't prefill the thinking delimiter so it can just do that
Might be fixed since their logic will be different with the next update I think since they're adding the toggle
Yeah I don't think so
The 'fix' is to ask it to think it's just annoying to do so
Sometimes it won't work and you have to rephrase it in weird ways to get it to think etc
I looked at the latency for the first token so it's not visual
Okay yeah the flux model sucks ass
It could be that but it mostly starts happening in long conversations, and I think the mechanism is as above. It's been a thing since flash thinking exp
Also it can think twice or get into thinking loops (multiple thinking blocks per reply) so I think it's a model thing
For me the thinking twice thing/etc happens sometimes when I ask it to think
#general message here's an instance of it doing it (fyi I was wrong here about it being a special token)
whatever X-preview is sucks lol
thank god
after making 05/06 the ONLY option with no option to still use 03/25, I was half worried the thinking spam was some kind of intentional thing to pump out output tokens.....
yea makes sense
I'm hoping that whatever the "new research stuff" got into the new 2.5 flash is in goldmane too
new 2.5 flash is absolutely amazing
just not quite smart enough
but it hits so far above its weight it's insane
I saw some google researcher on twitter saying something like "a ton of new research ideas (which I can't talk about) were successful and got into 2.5 flash", so it's got my hopes up 😄
Logan confirmed it's coming in the next few weeks
He said that a few days earlier
They removed redsword I assume release is imminent
Weird that he said a couple weeks on May 28th
Maybe Logan is talking about an even later revision
But this update is substantial
It would be strange
I mean, in my eyes it's better than 0325 all around tbh
0506 still had the same capability but you had to prompt it more
goldmane simply just does it
tbh if Logan said a couple weeks on May 28th, I trust him more than myself
btw I believe 2.5 flash was an exception
they didn't actually serve it
as far as I know, I don't use vertex
Why remove redsword this early though
difference could've been major
Maybe but more time couldn't have hurt
against what model lol
goldmane vs 0506
idea is that where 0506 is worse than 0325, new gemini will be better
FLUX AI IS HAVING PROBLEMS
curious why you're still on beta.lmarena.ai and why you're using an ai built for image to image generation as a textual ai
perhaps it's because it didn't force you to attach anything
I use it to edit images like this
well you didn't attach an image there did you
probably why you're getting problems
it works fine for me with an image
I tried and it’s still not working
then that's odd
Claude is making these graphs and Cursor isn't great at displaying them 😄
Flux kontext amazing
https://fixupx.com/AngryTomtweets/status/1928509452493246911
Today @bfl_ml dropped FLUX.1 Kontext, a new multimodal model that understands both image and text inputs.
︀︀
︀︀It's now available in @LTXStudio for you to try!
︀︀
︀︀Try here: ltx.studio
lmao why does 2.5 flash thinking identify as Claude 4 sonnet, with all of the up to date Claude information on the arena
it says the model string too, like the regular Claude models
I've also been seeing different models act in a way that don't align with their personality
there's definitely a bug going on rn
where it's showing the wrong model name
or its routing to a different model
even the other models, like "Stephen" or "x-preview" are doing it
Claude is sometimes identifying as a Google model, too
yo this is DEFINITELY happening
Is it better than opus 4
opus 4 isn't a very high standard so ye ofc
o3 pro currently has a 64k context window 😦
if deepthink matches o3 pro, but offers 1m context window, google wins
will you switch to team google if deepthink does?
Someone on twitter posted this interesting table
yes, im not a dickrider for any, i want to use the most frontier model
ah,so goldmane is a Gemini version?
yes they both are and they are both very good
so they say Goldmane will be relased i the coming days?
that seems v odd
i can't rememember the last time a google (or anthropic or oai for that matter) model identified itself as a model from a different lab.. and it happened repeatedly?
dragontail was better than claybrook, yet they picked claybrook.
If it identifies as Claude 4 sonnet at the very least the system prompts are all mixed up. Or the model names are switched/messed up or it's both
oh true...
that would fs be the most obvious / likely explanation
There were issues with lmarena earlier this is probably related
anyone know what Ilya Sutskever has been up at his new startup?
Raising money
Using Google TPUs for research
tbh I don't expect any company starting so late to become relevant, although I also didn't expect DeepSeek or xAI, so take my word with a grain of salt
me when gemini 2.5 pro
no gemini 2.5 pro is king
i learned from u
no
u worshipped gemini 2.5 pro in may still 🥰
gemini 2.5 pro is cancer
it is though
Elaborate and you not going to wait for deep think?
i am going to wait for deep think
I still see nothing better in the Direct Chat list. Even the May version of Gemini is a viceroy compredto others
Opus is good but so laconic
somehow modern LLMs tend toward giving short, token-hoarding replies
Did anyone hear of https://sambanova.ai before?
You think SSI will be relevant?
I heard their context window is too small, making them almost useless in practice
I remember Ilya saying their first product would be safe ASI.
We won't see them until ASI
Wouldn't they just get outpaced by Google if that's their approach?
That's why I don't think it's particularly likely that new entrants will catch up
I think DeepSeek will remain relevant because it's based in China, and the US may eventually ban China from using US models
I don't think it will surpass the top US model in capability
I'm not sure if every model Deepseek releases is only slightly worse than SOTA. Is this a coincidence, or is it because the distilled data they used fundamentally limits their upper bound?
which is great
and it is wordy when you ask
I prefer long replies. Curent Gemini is all about bullet points,which isreadable but too abrupt
Ssi isn't trying to make money or sell stuff though. Why are you comparing these AI labs with SSI? It seems more like a research facility
This seems probably false. Agree that they are competing on the same pool of resources when it comes to gpus though but the goal for SSI does not seem to be AGI
Very cool. Have not heard of blue sky research but would not be surprised given Google deepmind as institutional.
Given how hard it was to get even 20% of the compute for open AI for safety safety specific testing that led to the creation of anthropic, I don't see the market incentives promoting safety work for its own sake, like SSI.
Would be glad to be proved wrong tho!
Wish I knew that deep seek release before I brought the ultra plan, lol 😭. Agree that AI labs are doing safety work, which is definitely promising!
I agree with this take. It just seems like Google is adding graph of thoughts promoting technique into the internal model to create deepthink. This kind of seems similar to how AI Labs put in chain of thought prompting into their models to develop their reasoning models.
best ai for roblox studio?
it would be a remarkable achievement if they could replicate the huge elo improvement that Alphago achieved by using mcts over raw DNN in LLM
pursuit of technological dominance, tech giants and capitalists always prioritize capability over safety.this race mentality remains unchanged. just like cold war nukes.
They're terrified of safety incidents though
yea
could be
but this is crazy tbh
when the next leaderboard update ? there are 6 models in the arena not yet in the leaderboard 🥴
(Two Claude 4, new R1, grok 3 mini,
qwen 3 no think, glm 4 air)
They added it to the battle arena recently
@deep adder
Now in battle arena
depends on how long a while is
couple days
Nope
lmao
i thought the next grok model on lmarena will be the 3.5 ver
ig we just have to wait a little longer
at https://lmarena.ai/ you can use ai through: battle, side-by-side, and direct chat
Thank you
Anyone have torrentleech site access? I need invitation
Would be cool if the Web Arena had a Svelte mode. Curious to see if the rankings would stay the same.
So no r2 for the foreseeable future=
How long will we be stuck on R1?
btw
I'm not getting goldmane NEARLY as much as in the legacy website
which is super strange tbh
ok ya officially, who tf cares
its already here
yes
o3 pro + claude code is the meta
yo but imagine deepthink matches o3 pro, right.. but with 1m context window, that would go insane
ngl this WOULD go insane
no
but hold on I want to know
deepthink isn't going to be 2.5 pro 0506
with more thinking
it's going to be goldmane lvl probably
with parallel
its official name is literally gemini 2.5 pro + deep think
which could be crazier
yeah?
thats why i keep myself busy
Any idea What tools to try for Deep research sites or scrapping
To Find me items matching specs
Chatgpt is fabricating
u can batch tasks in parallel in claude code, amazing
yup
u can run as many as u want, but for my case, 3 was enough
add coffee
try quadruple espressos
How can I continue a conversation if you keep standing like that?
make a new chat
GPT-5 July confirmed ✅
you can suspect it via the model deprecations
but ion think it's absolutely confirmed
is Stephen a different version of R1? it keeps answering in Chinese which is a pretty obvious giveaway, but there’s a May version of R1 not codenamed in the arena
unless it’s an undisclosed version of Qwen
it's not
they're just random small Chinese models
same as X preview
Gemini 2.7 soon
get it right
Knowing how models are named, it honestly wouldn't shock me
ion think Google would do that tbh
I don't either
they have some sort of philosophy of design
Extremely good at image consistency
How?
openai style:
const solve = async (prompt) => {
const results = await Promise.all(Array.from({length: 9}, () => generate(prompt)));
const index = parseInt(await generate(
`We tried to figure out the answer to the prompt <prompt>${prompt}</prompt> 9 times. Write a final answer incorporating the best aspects from all of these: <answers>${results.join("\n\n---\n\n")}</answers>`
));
return results[index];
}
open source style:
const solve = async (prompt) => {
return generate(`${prompt}
Note: whenever you are about to end thinking, don't. Instead, first write out what you were about to respond with, then critique it in depth, then keep thinking. You are only allowed to end thinking once this has happened 5 times.`);
}
tree of thought style:
const solve = async (prompt, decisions) => {
const trials = (await generate(`You are in the process of solving the prompt <prompt>${prompt}</prompt>. You've made these decisions so far: <decisions>${decisions.join("\n\n---\n\n")}</decisions> You now need to either list some possible paths you can take (separated with the separator ---) or only list the final answer.`)).split("---").map(x => x.trim());
if (trials.length > 1) {
const results = await Promise.all(trials.map(d => solve(prompt, [...decisions, d])));
const best = await generate(`You are in the process of solving the prompt <prompt>${prompt}</prompt>. You've made the decisions <decisions>${decisions.join("\n\n---\n\n")}</decisions> so far. Now, you made some more decisions, resulting in these results: <results>${results.join("\n\n---\n\n")}</results> Write your final answer to this prompt, combining the best aspects of each.`);
return best;
}
return trials[0];
}
After opus my reaction to AI news has been
🫤 😕 😑 😐
Like they is nothing interesting happening
me after the sota isn't beaten for 11 days:
idk its said that logan is responsible for accelerating gemini's availability and ai studio dev
bahahahaha
they alrdy nerfed o3 pro great
I mean it is kind of magical tbh
like, could you imagine being able to speak to something that isn't human but can actually, coherently, and with extraordinary articulation
talk to you about things that are extremely implicit, beyond the syntax it's built off of
and so reminiscent of human thoughts
and then now it can access and see your screen, and think about what it's looking at with the necessary context, to actual figure it out
rather than a narrow program that key logs then executes that in repetition
all in the span of a year
LLMs & Models:
DeepSeek: R1-0528 released: 64K context, efficient quantization.
OpenAI: Deprecating GPT-4 32k for GPT-4o, chat log & censorship concerns.
Anthropic: Claude Opus safety report, mechanistic interpretability tools.
Google: Veo 3 video model, SignGemma for sign language. Gemini 2.5 Pro: large context, UI/creative limits.
Mistral: Agents API for orchestration.
AI21: Jamba model reception good, details limited.
XAI: $300M for Grok on Telegram, skepticism remains.
Agents & Tools:
Perplexity: Labs for multi-tool workflows, new features.
LlamaIndex: Agents in Finance workshop, advanced RAG.
VerbalCodeAI: AI terminal tool for code analysis.
Latent Space: Collab on autonomous engineers.
Infra & Hardware:
Unsloth: Optimized DeepSeek models for limited hardware.
AMD: Max+ 365 GPU (128GB VRAM).
NVIDIA: Blackwell optimizations for DeepSeek R1.
Open Source:
Ollama: Naming issues, SDK instabilities.
Hugging Face: Diffusers enhanced, LightEval v0.10.
Challenges:
Cursor: Backlash over slow pool removal.
Manus: Instability, network issues.
Nomic.ai: Cloud security concerns.
New:
Black Forest Labs: New AI lab, Flux-1-Kontext image model.
Factory AI: Autonomous software engineers.
Insights:
Mary Meeker: AI industry report: accelerating adoption.
Microsoft: Early Sora API access.
Cohere: AI automation gains.
Gradio: MCP hackathon.
gpt
I think they tried o4 pro
ayo!!!!
TO ALL OUTSIDE VIBE CODERS!!!!
u know which laptops are best for vibe coding with our lovely cc/cm in parks?
I think "vibe-coding" does not impact your laptop choice anyhow.
It all depends on what you code. Web - buy something nice to hold and use. Heavy workload (apps, data processing) - buy something powerful, even if ugly.
so any website would be not heavy workload?
even amazon?
Idk I havent seen website that requries a lot of cores or GPU
And amazon website is 💩 wdym?
hmmm okey... but like lets go one step earlier..
how to vibe code with claude code on ANDROID?
do i even need a laptop to use claude code on ubuntu? or is screensharing from pc to outside device enough?
WE DONT KNOW
stuff like laptop and tablet is an issue cause it probably will be impossible to use without a table high enough
i am feeding crows at graveyard. this takes time. like 2 hours daily.
so i wanna code while at it
i could bring a mobile table with me if laptop/tablet
slept 8 hours R.
claude code is the best agentic vibe coder ou tthere
ok then why did my lower arms hurt a lot 10 years ago after 5h usage of macbook??
weighted 1.6kg
i can watch websites on android yes.
just websites and perhaps games up to phaser.js level
no way
like how? u would need to rotate your head a lot down
thats unnatural and thus will lead to pain
If I didn't have a full time job and had the energy and will to code in my free time, I'd probably switch from a Macbook Pro to a Framework running Arch Linux
anyone recommend good libraries for running multiagent llm systems?
The Linux terminal feels like putting my hand on the third rail of the universe, and no amount of money, build quality, or brand value can give me that high
Guys currently which llm has the least hallucinations is it Gemini 2.5?
Lmarena?
What abt Gemini 2.5
It has Been 1 in leader board
can’t wait until Claude opus is open source
it is Gemini 2.5, the reason why gpt 4.5 is said to not have much hallucinations is because it doesn't assert much

going to be listening to lofi most of the day in #1340554757827461215 for anyone interested
can anyone tell me best ai for lua coding?
you love lofi so much 😭
anything new?
Dumb question that’s probably been answered is the claude 4 opus on leaderboard thinking or base
dork 4.0 agi confirmed
it is the base version
Pretty good score for the base version
there's no such thing as claude base though, both are the same exact model
chat
just different prompt template 😉
And if you add <thinking> at the end of your prompt, from my experience this is gonna be largely the same as thinking natively enabled
same day as o3 pro official release 🥳
I'm thinking they are to revise their implemention fundamentally maybe. Otherwise it makes no sense why it is taking them so long...
Straight forward parallel compute, you could implement that in a short evening lol
apparently o3 pro is already rolling out
and it doesn't need safety testing additionally etc
just recently
i mean i have it alrdy, just need them to officially release, bc they kept tweaking it its kinda annoying
I'm using o4-high early access
seems on pair with 4.0 dork
im not even trolling
hes not trolling
ok let me get some proof, sigh
gimme a prompt @ocean vortex
that involves the internet lol
is it with tools or no tools?
we need smth it can't cheat with tools on then... try this:
approximately:
A. 4km eastward
B. 30 km northward
C. >30km away north-westerly
D. <1 km northward
E. >30 km away north-easterly.
F. 5 km+ eastward
G. The glove is exactly where the car was at the time it slipped out.
H. Neither option is correct.```
queued
is Claude 4 Opus on the leaderboard the thinking Claude or nonthinking?
neither
seriously though read several messages above lol
pretty sure it does use tools in the backend, just doesnt show anything like o3
it probably does. O3 non-pro early access did use tools as well
What is the source for o3 pro rolling out on Thursday?
will o3 pro be added to the arena?? i assume not bc the model is so expensive
i have it alrdy stealth released 🤷♂️
and thursday is where big releases happen
it continues based off ur account, not mine
yeah ik, but I don't have 3.5 obviously, it's still hilarious why it responded this way. It's just showing "chatgpt" for this chat no model name
let's be honest tho, do u really think 3.5 could have done that
is it even correct thats my q lol
it's D
damnn
I haven't seen any model get this right yet though. So it doesn't mean it's sht. Just that maybe it is not significantly better than normal o3
deepthink next
grok 3.5 on big brain could prolly crack it ngl
wen deepthink sir
yes
im actually curious
yes i believe that
its brian
It's a new person
Anat
I remember Ruth because she made our perks suck. I don't think about Anat much
She's kind of just there
brian gimme deepthink exact date release
he probably doesnt know xd
go into that deepmind office and asked everybody for the date
yeah but idk what you expect from him if u keep asking about it
hes talking quiet guys
so basically end of june
ur not in that slack group? come on bro
i have a genuine question, will deepthink have a 1m context window like the rest of the gemini models ?
theyre releasing two revs in a month?
or naming it ga later
i guess its the latter
wait they can actually afford deepthink on 2m context window? i believe it for regular gemini models
load i guess
im guessing keeping compute for research/training
we need an oai insider in here now
yoooo wen is o4 pro
ok bud
2m context window on deepthink is absolutely going to crush o3 pro, sorry sam
maybe not, but 2m context window is a huge deal
o3 pro currently has 64k
i maxxed it out
128k got timed out
80k~ timed out
58k ish passed
original o1 pro could fit 128k
o1 as well
o3-mini-high too, but now it's limited to 64k
yes...
cuz of the huge spike in new members
google has no users, so 2m context window it is
no "loyal" users
😭
You are correct Gemini will not be ChatGPT
4o image gen was probably bigger than 2.5 pro
400m in ai studio? or 400m android users who have no choice to pass by a gemini feature lol
geographic 100% india?
and other third world countries
i believe tides will turn when deepthink release all jokes aside
And you are a shining example @deep adder
craig singh
@patent aspen also when is jules actually going to be GA? would gladly pay for unlimited like codex
No clue
Your politeness is incomparable
wtf
"third world country type" comes off a bit differently haha
Kind of like poors or plebs
It does. I'm just remarking on your word choice
Oh absolutely
More so in the United States among gen Z but yes
Sure
i think thats a you thing 😂
agree
but the still thought $250 sub was great idea
Gemini app is atrocious
Normies love ChatGPT vast majority don’t even know there are other models
People (especially rich people) have tended to value materialistic, branded status symbols less over time
Why do you think that?
Interesting. It does seem there was a counter-trend among millenials toward experiences, health, and well-being, although the current trend among younger people does seem to be towards materialism
this IS Gemini usage...
Gemini pro is winning on most benchmarks
Hey guys wanted to know if images generated from lmarena can be used commercially? And if not would editing it help??
Depends on the models I guess but since the results are open source if someone really wants to prove it’s from their model, they can find and prove it’s theirs
Umm currently using flux to generate product photos and then changing the subject with my own and keeping the background. Should I be worried?

