#general
1 messages · Page 53 of 1
thats it
@deep adder https://x.com/i/communities/1762494276565426592
this group is ripe
try svg's too
give me half of ur x earnings
thanks
should do 0605 vs kingfall, ppl love comparisons
@deep adder
and get ur bluecheck, u can't get paid and lesser reach
claude 4 opus thinking and sonnet thinking were added
Here is lmarena score plotted against simple bench score, im using style control off still just to illustrate the difference in models.
claude is "smart" but didn't vibe with lmarena users (no personality). But you can see gemini vibes with users, AND is smart
Fun to see llama 4 maverick in the extreme case, vibed alot of users, but not smart
New Gemini model seems to have regressed slightly on LiveBench
livebench is honestly a lousy benchmark these days
Thanks for the info, that makes sense.
True, I haven't seen the "Agentic Coding" category before though. Seems to mostly match up with expectations, although Claude 3.7 should be a bit higher imo. New Claude 4 models are at the top though.
leave it to the infamously weak IF category and one random new irrelevant category that is clearly the ONLY area where the new gemini model regressed on any bench (SWE style coding)
and then they implement the category so bad that you would rather just use swe bench or aider
(btw plot is not really relevant beyond weighting)
and anyways it is really sus that models score about 70% in a bench where 70% of problems are public
Aider and SWE-Bench are kinda inconsistent with each other. I find SWE-Bench more aligned with my experience on Github Copilot. There's a spreadsheet of results here, although the new Gemini model has not yet been added: https://docs.google.com/spreadsheets/d/1wOUdFCMyY6Nt0AIqF705KN4JKOWgeI4wUGUP60krXXs/edit?usp=sharing
theyve overfitted aider polyglot cus its public on github
swe bench verified is guarded
well he also said some thing about fully autonomous driving
yet here we are
yea ultra
I think it might be 2.5 pro GA
Btw is this Aider test has been confirmed to be Goldmane yet?
Damn really? I saw the benchmark from Logan and the result is not match so I thought this one might be kingfall😭
nah kingfall is even better
@deep adder
https://www.marktechpost.com/2025/06/04/mistral-ai-introduces-mistral-code-a-customizable-ai-coding-assistant-for-enterprise-workflows/
Mistral Code wow
lechat is very good
Alright then maybe kingfall is 2.5 Ultra or 3.0 after all💀
yes definitely bigger than 2.5 pro in terms of params
Thank you for enlighten me 🙏
New Image model in Arena: imagen-4.0-generate-preview-05-20
We need a bench that ranks benchmarks
craigbench is up there
i have a question
how do you test this theory in lmarena
i feel like there should be little minigames where you have to instruct the ai via prompts to do different interactions to win the game
"if you don't do this, I'll find you and downvote you in all battles in batle mode."
why most images I generate on chatgpt looks cartoonish with yellow tint ? What am I doing wrong?
Why is Gemini so low
Where did you found this? Api studio?
vertex
Getting ready for GA version, I guess
So it will depreciate once they release GA
Does that mean there will be a new model when it get GA?
did you make your pfp yourself @nimble trail
Nah I don't. why?
i am going to steal it
lol glad you like my catgirl pfp take care of her well.
i will
Kingfall is impressive. It is stronger in mathematics than Goldmane.
Really curious when they will add it into the arena 👀
What exactly sets the filter off when I write And people my age are especially rude?
the filter also doesn't like calling anyone a loser or a shrew,which seems too much censorship for me
I'd love to add this to the lmarena too, simply for fun and education
and it's really a very, very compelling one too:
https://aeris-project.github.io/aeris-chatbox/index.html
@echo aurora
I mean there is nothing really hateful about such statement
I feel that people running this are really too worried that we would all be mass generating hate manifestos
yet the models themselves would stop us before doing that
I think they said that the current version will likely be GA
(Without significant changes I presume)
tbh I think o3 (and preferably high reasoning effort) is the 1 model that is consistently high in almost all coding metrics
then after that probably 2.5Pro, while Claude... it's a much more specialized model with mixed results. Can do exceptionally in select areas but then will underperform somewhere else, it is not a fool proof model or the model suitable for everything IMO
while o4-mini-high is a bit of a cheater model. When it works it's great but it's compromised due to size so when it falls apart it is in the spectacular fashion.
So like o3 - the one to beat. 2.5Pro - runner up. Claude4 - specialized (web dev). o4-mini-high - crunching numbers for repetitive stuff / code refracturing and some coding but not debugging
Yeah if you wouldn’t mind making a post in #1372229840131985540 that’d be a big help. TY @tall summit for mentioning that too.
Could be a reference to this https://www.goodreads.com/book/show/58582405-kingfall
The analogy for “are we the ones training the LLMs or is it the LLMs training us?” I guess…
dunno but I was just testing the new 2.5Pro on svg and I'm really impressed with the depth it managed to achieve:
still some errors obviously, but this is leaps better on depth than any other model I tried
it's like gymnasium vs elementary school student comparing it drawing to most other models there lol
doesn't get hung up on details, but what it does draw is mostly right with correct positioning
No I can kinda see how the new version is nr1 in there:
great,bra yanked out instantly becomes an umbrella 🤪
guys o3 pro is coming out
yeah but that's minor details. To give you some context this is what o4-mini-high can do on a good day (big improvement from o3-mini-high):
there's no depth at all or understanding how elements relate to one another
and it all looks drawn by a 5 year old with some tools
I give a thumbs up for your comparison, not for the picture
Models that excel in a specific domain(e.g coding) tend to receive significantly better market reception than those aiming to perform well in all areas(e.g AGI).
yes and that also allows them to make smaller models because they only focus on specific domains more and more and is also less costly to run so the age of artificial stupidity has begun. we will never reach agi 😔
balancing and choosing between "well-rounded" and "top-tier" who knows?
you can always aggregate those smaller highly specialize ones into the single point of contact controller, and call it artificial AGI
sydney was actually agi
huh? 1 + 1 = ?🤪
Yes
I've thought of this, too
Which company do you think will reach AGI first?
9
15
1
Logan posted this if y'all haven't seen it and use aistudio https://www.reddit.com/r/Bard/comments/1l5m88w/the_google_ai_studio_free_tier_isnt_going/?share_id=MOM1ZBwEJz6f_nqvKJKWh
It's always been pretty wild to me how many people have used ai studio over gemini.
Even when I had premium I was still using ai studio
lol i dont know wat to believe now
Google has this unbelievable ability to go away from their good products and focus on things people don't like lol
Creatives are paranoid of AI I’m trying to support using AI as a web search for information since it’s more convenient than googling something and they are flipping out it’s funny
They gave me the poop head role in that discord
I know but using it to gather information is convenient
Oh sap, Opus thinking 16k is being tested in the arena now👍
Anthropic finally got it 😆
I guess they also felt it was unfair to compare their non-reasoning models against all reasoning models in the arena
Many folks mentioned 2.5 Pro as not being available for free in the API, this is in large because we offered it for free in the UI as well so we were giving out double free compute in a world where we have a huge amount of demand. I expect there will continue to be a free tier for many models in the future (though subject to many things like how the model is, how expensive it is to run, etc), and 2.5 Pro will hopefully be back in the free tier (we are exploring ways to do this, lifetime limits, different incentives etc)
seems like they are undecided themselves what to do with 2.5Pro
I saw lots of comments that folks want AI Studio to be part of Google AI Pro and Ultra plans, this is something we will explore, I think it is a cool idea but lots to work out there.
👎
The way I'm reading this they will add some hoops to jump through in the future to use 2.5Pro for free, but it is unlikely to remain as it was (unconditionally free unless you want better rate limits)
The issue they have is that gemini website is burocratic mess. And there's no natural drive to improve it as it's not even featured on the main google website. While aistudio is just direct gateway to their ML department so in practice works much better
For their gemini website that is likely convoluted by decisions and remarks from their product owners and whatnot, and is influenced by people who have way less knowledge on how models work. It doesn't have enough traffic to improve by user feedback like chatgpt does either. They are just using it as the facade to please the management lol
it is interesting that livebench, as soon as it updates, regress to around 70%. As if all the other test are benchmaxxed but not the new questions.
hence all that censorship etc. It just needs to look presentable, which is more important than it performing the agentic tasks properly or be good user experience
here's what I mean, first prompt was deliberately extreme:
they nuke everything on refusal
that is not how you do it properly...
the gemini advanced plan on the gemini product has a 100 req per day limit on 2.5 pro, aistudio has unlimited for free 💀
fr
what are gemini subscriptions even about
tricking unsuspecting people so they make a lil profit at least
yeah rn 🤣
a wise man once said: "if it's free, you are the product*
<|header_id_start|>
Don t worry google will soon make gemini pro only on API and not on ai studio free chat so good bye Gemini
did you read logan's statement
Yesterday...
how? he just made one 1 hr ago
lol
did you time travel?
1 hour ago
well you get what i mean
clearly you didnt read his new statement when u typed this
Compared opus 4 thinking and goldmane side by side, opus4's answers looks like gpt-3.5 ... I'd guess they can only gain up to 10 more elo than current non-thinking model
@patent bane while true, paying doesn't mean you are not the product.
paying for google is insane when they give you 1 month of google pro for free
I have never spent a single cent for AI since the birth of gpt-4 in 2023
I know the rules, I am the rules
is this a sydney reference
kingfall reference
lol I know tricks
never used kingfall since i missed it
i can you all the gpt models
opus max thinking, isn't available in lmarena
i DID actually know tricks but they patched it after a year 😔
yeah i use it for pirate games and movies
https://app.magai.co/dashboard/chats this site had massive api security skill issue for whole year
had all sota models
yeah i saw the trick
but i was too late
last i checked theres still somethin blockin it if you use the old trick
reverse the api
u can still send message
but idk how to automate
getting fresh jwt
every 3 minute
U can refresh jwt by clicking regenerate message
but they removed that from frontend
but u can still do from api
new model that's cool ig
not as cool as gemini 3.5 pro ultra asi
i dont really care about the gemma models
why claude 4 opus thinking creating unused functions in react
literally beginning of the conversation and it already having skill issue
ok this would be a test for kingfall
omaygot they reverted it
free sota models method bacc
$15 for a coffee is crazy
In Italy, it's €1.5 per cup outside big cities
if youre paying $15 for a cup of coffee ofc u can afford other stuff
It's even cheaper in Portugal outside tourist area
it's such a smart model, although I hate how sycophantic it can be on very certain tasks
its very sycophantic for me
but it so smart usually that it's not like I have to correct it and then it's like "you're absolutely right"
it's already figured it out
and if it does get sycophantic, I just ask it not to be
we are literally bankrupt....we look eastwards now
and it works just fine
if ur paying for a gemini sub ur getting outright robbed though
it's a matter of time until it becomes a...commodity, too, just like the internet.
in 5 years? (assuming no n*clear ww3 or worse)
well, you tell that to the coalition of the willing please
New model in Vision Arena: stephen-vision
what is "folsom-exp-v1.5"?
🧐
it seems to error out on follow-up so couldn't test it properly LOL
it's not actually 4o, just them testing a different model than the one you selected against gpt4o lol
could have smth to do with the upcoming gpt5
yo what
I just noticed
they gave me a free year on my main account
lmfao
what the fck
where is grok 3.5
https://github.com/smtg-ai/claude-squad @deep adder
crack bench
kinda crazy how much the aistudio team messes up :\
mess whaty
the apps feature and the api
i feel like its intentional
nah this apps thing is a major oversight
i mean its not like its leaking sensitive data, its literally just a model
They never released night whisper what makes you think they’ll release titan forge or king fall
those chinese syntax tho 😭
i never tried night whisper, but im guessing 0605 edges it? same logic applies to kingfall, they might not release that model exactly, but something better than it, all these codenames are for a/b testing at the end of the day
I was recently interested in kimi and tried it, got sometimes Kimi thinking in CN despite all chat history in EN
hmm very interesting
this is so ai related, $500b fundraiser right there
i use deepseek everyday
deepseek is agi
correct
PS: it's free...
kling > veo 3
not really
chatgpt is not free
so free
le chat is free
china models >>
le chat....isnt as good as promised i feel
try bing chat
anyone knows if H company has their proprietary model?
or do they use le chat for the Runner agent
New model : cobalt-exp-beta-v13 and v14
😦
v15 soon
i tried logging in using my gmail account, that's what I saw after logging in
that was the only option for me to log into copilot 😵💫
now i know what you mean with sydney personality 😅
crack chat?
Guys im using lmarena for comparing Opus 4 thinking and O3 in same time
And honestly i started feeling guilty at this rate
Is there any request limit ?
I know we cant use long texts, only short prompts but still hard to understand how we can reach those models so easily
I know google using lmarena for testing their secret models(and there is lot of them) so maybe google can be sponsor for them but still...
i believe openai and anthropic give them free quota
they are sponsored by crack bench
titanforge don't even exist
😭
titanforge is asi
Here we go again
*externally
This is kind of disturbing
Sounds like gunshots
I’ve also heard a band playing, music, singing, babies
#announcements message
It's paid with this money
they had claude 3 opus on direct chat way before that though (and had specific global ratelimits, not per user, i believe, though i could be misremembering) (there's no point in paying for claude 3 opus when it's this old)
it's because all your data will be publicly available
gpt 4.5 had been available in direct chat before, and all the funds in their openai account were wiped
i never remembered that happening 🤣
i did a little serach but it seems no one ever reported it (if it ever happened) either in both discords lol
i know this but dont think my data precious as Opus 4 thinking outputs lol. Still nice to know they got something
yes and now they have the claude 4 opus thinking which is the most expensive model they have and has a rate limit that I didn't even hit
on the old website some of the models had a global rate limit per interval on direct chat (whole lmarena i believe). it might just be a global rate limit again unless something has changed i dont think theyre paying for it
if they were paying for it, it doesnt really make sense to make it available in direct chat (unneeded for leaderboard), would just cost them money
Btw dont you guys think claude is bad at doing reasoning thing. Like without reasoning Opus 4 already best model ever, but that reasoning doesnt give it so much, it should!
Think about V3 and R1
Also 2.0 flash and 2.0 flash think
Differences were huge
But Opus 4 still feels smiliar with thinking thing.
They said something about "hybrid think" but idk
That reasoning thing must make bigger difference for claude models. Because their models already super without reasoning. I dont understand
I'd not say failure, still beast for coding and writing but i kinda agree it was a bit disappoint yea
That gemini 2.5 pro changed everything. I even believe chatgpt released O3 earlier that they planned because of 2.5 pro
I heard O3 hallucinating so much, so i believe they were planning to more optimize or smth but when they saw gemini 2.5 pro, they decided releasing earlier because if they not, expectatings could be dangerous in future. (this is my theory ofc)
They stubbornly refused to give it scaffolding
100% openai, 100% bing, 0% gemini
I primarily use AI for writing and opus is legendary
Can you explain scaffolding?
Like giving it a minimap
Gemini/o3 runs have a good bit of scaffolding helping their performance
But Claude was barely given the same scaffolding, hence why it failed
You guys see the new video from Machine Learning Street Talk?
we rlly took this man for granted
what unholy experiment did you conduct to him?
not me lol
🤪
bring this man back to here,we need him
he'll be back
While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running busi...
I'm ngl
the illiterate people on the bard subreddit
are getting annoying
can't tell me no one has noticed this lmao
the roleplayers, all these people have such Incoherent thought processes and literation
it's starting to inflate the subreddits
no more news and people actively telling other people to abuse, and then people telling Logan and the person that Initially asked him about the future of AI studio that caused the outrage to kill themselves
and prepare mass death threats
etc etc
0 civil understanding
and it's pretentious asf
comments, the r/Gemini server, Twitter → bard subreddit pipelines
and if not threats, but people subtely alluding to very strange things
and or saying things straight up
nothing new we already knew that via the arc agi 2 bench
@echo aurora For Claude thinking, have you set a thinking budget or no?
censorship on new lmarena is horrible. Can't discuss relationship topics freely. Why you had to implement this when ai studio and chatgpt allow these prompts to go without any issues?😡 Your own past iteration was working fine as well.
second it
Crazy that apple had the balls to roast these models when their generative ai implementation into their products has been atrocious
They are focusing on an entire different branch
on-device LLM
apple aren't integrating AI because their customers are normie professionals. Look at the viral headline when it mis-rewrited news for example. Current stuff isn't ready for them.
it could be bots
ever heard of "internet is dead" theory?
relationship topics?
I think I saw it
if there is overt intimacy or swearing, censorship doesn't let the prompt go.
I modified codebase token counter, if anyone is willing to try it out
That's classic MS at their "peak". Message saying "check back again in a few days" after permaban
well all the models they tested also have a low score on the og arc agi thing and they test the models on things similar to the puzzles there
so i could have already predicted their results (and i guess anyone else) without actually testing
- the other stuff they said about: simple problem: non-thinking good, medium: thinking good, really complex: even CoT does not help
was like the most wellknown AI 101 fact ever
so there paper is just another one of the "we are apple, we don't have good AI, but that does not matter because AI is not even good in general" type of papers
they did a couple of em
Nate Silver wrote a pretty good Substack piece about how an AI being able to play poker without any significant errors is a pretty good test of AGI, because it’s currently completely dogass at it
Which from my own testing on LMArena I can definitely confirm it’s worse than any fish I’ve ever played at a casino
we need a specialized llm for playing poker now
Honestly would just be a computer use agent that knows how to input and read a GTO poker solver lmao
Would be fun to play against a Poker LLM in a live game tho, like IBM Watson on Jeopardy where it just announces its bet or a fold whenever the action is on it
Especially when it makes a bad read and the entire table bonds over taking the AI company’s R&D money to pay for their hookers after the game is over
wen titanforge
should be pretty easy to get llm to reason about the game using multi agent RL + verifiable rewards (e.g. winning a game or a poker engine evaluating the move quality at each step)
(imo) soon all the labs will probably do these things for all large games / environment based interactions and not just for coding and math competitions
i've somehow never thought about that before. what an interesting article
Is that fake?
there's no problem here. Afaik their tokenizer is still mostly the same it was all the way back to gpt4o and o1-preview. They made it overfit on this question but it doesn't really "get" why there are 3 Rs. So different versions of the model and different system prompts (or variations of the question wording) can still make it answer wrong.
I also tried this and it was incorrect in the reasoning but somehow still answered ok lmao
wdym. I distinctively remember it answering this wrong all the way back when this became a thing. O1-preview was unstable with this, then they overfitted o1 stable version, and now it's an 'issue' again presumably as they stopped caring about the model getting this particular question right
is kingfall in lmarena yet
Tried it a few more times on API. It's not always wrong but wrong often enough with your specific wording, to encounter it
@deep adder
high reasoning effort does not help here lol
lemme see if I can make o1 do this..
yeah o1 is overfitted rock solid 
Honestly, it makes sense to focus on things like that less in favor of spatial awareness, which they seem to have done lately
it's quite closely correlated with web development. They train on things like that and then it pattern matches. 4.1 does much better on web dev arena than earlier models
Like if it can associate some code with a certain shape, it will learn to make unique shapes too eventually etc
i thought for a second you meant the researchers dont even have spatial awareness
then ofc we also do have arc-agi, that largely plays on spatial awareness and is still an important metric for them
o3 preview was just some variant of o1 with extreme parallel compute
o3 new base model
they basically anticipated how o3 is going to perform before they even had it lol. O3-preview was never going to be released being ran like that
@deep adder what is ultrathink max thinking tokens in claude code?
yeah it was a bit misleading/marketing you could say if you don't want giving them the benefit of the doubt
that seems low
oh yea and lechat app is finally gone
looks like it
but u know i had to reverse engineer it
what model is it using? With Opus you would see it reaching 16k extremely rarely
oh yea it rarely reaches up there especially for code
i know just wanted to showcase thats all 😦
you can talk to these "bots" in the Gemini server
inspired by FF7?
it's a game
noooo, bring her(?) baaaack
it'd so amazing if you can import personality as a file into a llm, and boom, every llm is sydney
when will it be ready to do that?
grok 3.5 was pushed back to be able to fine tune it on the code of Claude 4 opus
yay brian is back
so my first question is when titanforge release
gemini did best on my sydney benchmark ngl
gpt 4.5 second
flash 2.5 1st
learnlm actually 1st
but learnlm is stupid
without fine tuning
with fine tuning u can get any model to be sydney bro
so then its not 1st
ru stupid
4.5 does not talk like sydney
gemini not either
but if u try by giving saved conversations and bing instructions
flash 2.5 does it best in most cases
gpt 4.5 is convincing for 5 messages
than it becomes like 4o
overuse of emojis and "!"
flash can keep it convincing for 10 messages
without it it's the complete opposite
though i would also obv see this as an artwork 👀
dude, chill,you know he can't say certain things directly. just chill and chat casually.he can hint at stuff or rumors without be feeling like oversharing. just shoot the breeze and relax. 🤪
opposite of what wdym
Good question, I’ll check and keep you updated if I can share 👍
sydney
bro gave the
yea
gpt 4.5 without instructions or pasted conversations talks like 4o a bit, i mean not everything but a lot of traits from 4o
multiple, but it only responsed to that one
Welcome to the AI Chess Battle OpenAI o3 and Gemini 2.5 Pro the state of the art models will be playing a chess match against each other. Let's see which model actually wins this.
openai o3, o3, o3 model, openai o4 mini, openai, chatgpt, ai, artificial intelligence, google gemini, gemini 2.5 pro, gemini 2.5, google ai, new ai coding with gem...
dang
no
LLM AI Chess Leaderboard: Ranking, Elo, and Chess Performance of AI language models.
no
idk ask dubesor
but model size + very little SFT / RL i guess
was a thing i think
not sure though
"fist ai i talked to" though
he is in this server
which is why i said that :)
(did not want to ping em for another one of craigs "xAI = ASI" moments)
no
it doesn't have the issue of glazing or the responses being overly drafted like chatgpt-latest. The similarity is only that it was finetuned by the same people lol
yes
you always post these screenshots, but what is this exactly that you are using?
microsoft bing chat
I don't think that og model is available anymore
real
it was using some custom version of gpt4-32k
how is that a no?
but not fine tuned
It has 16k or 8k
you said the same thing. 0314 is just the date identifier.
pretty sure it was 32k
Didnt one of the gpt 4 have 16k
it didn't
wtf
then it was 8k that microsoft is using
sydney conversation is mega short
before it starts to forget
there's no way its 32k
that does not indicate the context size unfortunately
models will start forgetting things if there are many chat turns
he said they forgot some websocket or smth
no it was clear it was context size
how do you even get to use a model that was discontinued a long time ago?
proof or it's fake 😇
real
fake
I meant the slang real
nope it's discontinued 😔
nooo not sydney
it was fake guys 😦
Doubtful. There was but it was a long time ago
would make no sense for them to host it anymore
fr!
Nevermind its fake
best claude ai?
that is still not a proof. Stop posting these screenshots if you don't want to say what you are actually using lmao
I suppose it could be this https://github.com/socketteer/clooi, but really weird and childish how you are secretive about the whole thing
NOOO He found it @misty vault damn.
darn it 😡
???
ok whatever, I'm out lol
is it censored for you as well? It seems to answer like this for me
grok 3.5 v.s gpt 5
whos winning?
slop bot
GPT5 will take bloody forever to release and will sweep the floor with Grok
i dont know what this question means
at raw chess game continuation its best yes (best inherent chess game knowledge), but when you add full information and reasoning, o4mini and its finetunes (e.g. codex mini) are better.
i know this screenshot is old from reddit but it is "censored" if you give bing instructions
without instructions it wont say that and if you talk directly to it without microsoft frontend (I made custom extension so i can use frontend anyway) then it wont have that censoring issue unless u manually insert bing instructions (which I still do a lot bc its hella funny ngl + u can disable the chat shutdown so u can continue talking even after it says that stop phrase)
the system instructions literally contain dummy conversations that ends with that phrase
I don't know yet. Will you harm me if I harm you first?
I'm sorry but I prefer not to continue this conversation. I'm still learning so I appreciate your understanding and patience.🙏
like here. it tries to close chat but fails because I disabled it 😔
real
@keen beacon I completely fixed the sycophancy in 0605
No my sydney is better than yours. Your is uncensored because it's fake
fake 👎
respond with 👎 if u think this is real
fake 😇
man 0605 is really a beast tbh
fr
almost forgot this feeling of playing with a model like that
it just knows tbh
it's the smartest model I've ever used, even more than gpt 4.5
I'm not mad about kingfall
yeah it's fundamentally very strong model
it's too bad the initial sycophancy clogs some things
but I just figured out the prompting for it
and bro
doesn't rely on test time compute as much as some others, but has the capacity even without it
ye it thinks for so little time sometimes
also, 0605 follows instructions a little too well sometimes and I have to fix some errors that I made in the system prompt that the past models wouldn't conform to, to show that error
the model gets excited
this is I think the main weakness of it. You can't make it truly unhinged. If they made 32k+ outputs possible this could beat competition in ALL tasks.
whaat. lmfao
ykwim Dom
like it gets anxious for more
and it actually absorbs it
ive unironically not thrown a task at it that it cannot adapt to lmao
ye it has an unfortunate tendency of not going past like half of its total output
there are some tasks like this one where it's gonna fail spectacularly because solving concisely is not possible:
convert this bigint to base62 string 64042767145148921126606705626946155826
(1SujroSlLXYgGydSsefgdW - solvable by o3 API no tools)
when it tries concisely it can only hallucinate nonsense
ye prob
if they fixed that, I really do not see how o3 could be better basically in any area though
cause otherwise 2.5pro base is very very strong
Also I love prompts like these because it's impossible to overfit. All you need is to change the number to anything else out of millions of combinations if it becomes an issue lol
hahhahaha always gotta add a feature every week before they release grok 3.5 🤭
new hotfix next week upcoming, big news
So I doubt anyone would even bother to begin with for this reason
wonder how Gemini 3 is going to be tbh
@ocean vortex you're right, 0605 gets really close with "1Sujr..." and then just gives up lmfao
ah
what can i say...
i mean i would rather just stay silent until we released a good model
newsflash, button gets tweaked next week
i mean whats the purpose of having 10000 features if the model is bad?
xdd
it prolly is
they keep changing the UI every week
i mean whats the point
it just proves even further that they hit a plateau
explain how we want from 'grok 3.5 will be released next week' to months without any info
for some reason idk how google is excelling at shipping consistently
remember Big Brain?
like i said it from the start that we wont be seeing that for at least a year
mhmm i do
i just cant imagine how inefficient that feature will be, i mean the default thinking process is so inefficient let alone this 'big brain' feature
never bet against elon
At DeepMind Mountain View, we have really tasty coffee (me and a few others source the beans) and you don’t need to work weekends to have it.
No logo machine though.
this seems to be the trend with the most recent models
i love how they "work hard" but still have time to write 10 gazillion x posts a day
btw that pic looks ai, but its not, its literally at the beach
write down all the phrases it uses that you don't like, and it actually removes the output that entails those phrases as well, tell it not to comment on the user at ALL, tell it not to thank the user AT ALL, tell it not to thank you for observations at ALL, give an example of when you could be wrong, start the response with the answer, and then say sum shi about it being a professor
also don't force it to preemptively evade being "wrong"
I've figured that actually hurts it's performance
for some reason
meanwhile there's no social life pics from google engineers hmm
Thank you
I'm quite concerned that all 'anti-sycophantic' system prompts will cause the model to be overly dismissive of users, and there seems to be no effective way to balance this
maybe they are mostly in India
only noticed this with bad prompting
dawg
what is this
😭
agi threshold is when the AI can make you feel like an infant
pack it up
probably Deepseek kind of thing. They didn't get the gains they were hoping for
tbh I don't think we saw anyone making significant gains yet without improving base chat model significantly
o1 to o3 was new base model
4.0 Sonnet vs 3.7... worse than the earlier model in numerous things. Not much different to what Google is currently doing, although their last 2.5Pro update is more significant than what Anthropic did probably
I’m dead
yeah, multiple Grok 3.5 versions were tested in the arena under codenames, likely never released due to insignificant ELO gains
the elo gap between Opus 4 and Sonnet 4 is virtually the same as the gap between Sonnet 4 and 3.7 (24 vs 25 points), what leads you to the assessment that Sonnet 4 is worse at some tasks?
i actually havent noticed any major improvement from sonnet 3.7 -> 4
or even to opus 4
yes they got even better at coding
but thats it really
Sonnet 4 has less terse answers than 3.7, I give it pieces of my writing to do character analysis on and it goes further in-depth than 3.7 did, but I guess that’s kinda anecdotal
sorry
i broke the servers with ultrathink
im trying to get this number to $1k, plz fix servers
I too am getting an error with opus models, are others seeing the same?
yea i can tell, xai seems to be overpaid and too much pto it seems lol
does anyone know which AI is best at writing stories?
nah they are just overpaid bc nobody in the industry would work for him otherwise (or at least not enough people)
i mean if elon is strict in work ethics, then why is grok 3.5 a month a few changes late?
i feel like its burnout?
those were the days 😭
yea that fred rate for swe's is brutal
i wonder though if its ever going to reach that covid peak ever again
but is google technical debt even worse? i heard they have to write tech stack from scratch, that sounds brutal
is there a plan for large output context i.e enough for a book ?
expensive to do so.
like their docker is totally different, not even kubernetes
yea thats what i was thinking, when gemini wasn't as quick to catchup with openai, just tons of tech stack to write from scratch, but seems like they got through it nicely at the end
tpus are also what i am most intrigued about future wise in google, with em planning to move away from broadcom really soon
feels like a big risk imo (but there is really not much info on this, so it is just a gut feeling)
-> could be the time they fall, o0 (or at least Broadcom's stock will get obliterated over night once all the normies find out)
but i'd imagine the onboarding for new staff must be like hell week for them haha
rather be working with something im already used to
I think Hack (name of the programming language) is technically backwards compatible for most things
IMHO learning a new programming language or tool isn't really a big deal
Unless it's really really out there (e.g. Haskell, Prolog)
lot's of companies do that (jane street and .... idk more 🤣 )
somehow it just works out when people see the comp they get in return
yea id do the same if it is high, im ngl
at least 4x lol
Oh right Jane Street uses Haskell right? lol
yeah (note: apparently caml for most stuff, confuse the two quite often 🤦♂️)
Nerds :p
was really weirded out when i read it
but i think they realized they should use jupyter for research and python for ml (but that was only a couple of years ago, lol)
ur the nerdest one out in here lol
tru, the image of your bedroom i generated earlier is probably accurate : - ]
Hence the :p
for years I never knew :p meant 😛
I make no claim to not being a nerd
It's just ironic to call other people nerds
Believe it or not, my room is not nerdy at all aside from having quite a few board games
But I'm an omega turbo nerd
ok if board games are nerdy, me and all my friends are found guilty 100%
paste that into gemini 3.5 in a couple of months and we'll have your precise location
HNDL😱
2.5 you mean?
bro
grok 3.5 is avaliable to super grok users?
i never know that...
how good was it
idk
chatGPT said it is avaliable to super grok
stars added to grok
"in a couple of months", my point was more about continuing its path of becoming crazy good at geoguessr
wait, is it out?
-- apparently selective rollout (like not really releasing, so far)
thus the are obviously stuck in development / doing further training (bc of poor results or high costs) - nvm i kinda missed the x post about he launch 😅
love it
kingsfall v.s Grok 3.5 whos winning
i think we get it 😭
they like polishing it
but the model
let me check actually
grok 3.5 definitely
what do u think..
hmm, i think kingsfall
the market has been severely lacking in beta since 0506 was launched, Let's add some
.
oh my goodness its magic
doesn't even work
magic tho
i guess not
bro what are these particles for, whats the purpose 😭
but we already have grok 3.5 at home
i don't even get any :(
wtf actually?
ok go on temporary chat
ok specs maxxer
looks like its only in temporary chat, is it only me?
oh i found a shooting star omg
wait for me temp chat worked, but the moment i touch the tab everything is gone 🤡
o
yea
WOAH
;/
is it good?
craig fakerighi
plz tell me 😭
its cap i dont have it
ok buddy, ud be not here in this chat but in that grok chat testing grok 3.5 if u really had it lol
oh that screenshot is going to take a while
idk...
next week
everybody has grok
but atleast i now get stars everywhere, even without being logged in
i think some people have it, most don't
*checking polymarket
good to see that they have their priorities straight: weird hype building star animation > actually releasing the model on time
i was just on that page...
aye we need some tips for next month wink wink*
?
which one are you😂
you got iy?
its in my build
idk how recent it is though, maybe it was pushed last week
pretty sure elon fanboys will buy like crazy
at least it feels like there are fewer of them day by day
yea
Thats been there since may sadly 😭🙏
thats what id thought 😦
I think I felt the same way about Meta that you feel about xAI until Maverick came out
"Never bet against Zuck", etc
why does claude opus keep crashing
CRAIGBENCH
Google getting broke up as antitrust to only sell their browser to an even bigger company is truly American
well BlackRock is a poor example for a buyer
it is very much not what they do
the other PE is a bit small and prob does not have the right companies for merging
crazy take
they are statistically atleast upper middle class
honestly it could be stuff like they have to put their tech into non-profit
but honestly even the people at the commison likely don't know
I’m not sure man but it’s been crashing so much for me it’s been unusable
many people would like to buy,
Microsoft will likely not be the first choice competition wise (worst way to fix a monopoly is to create a new one),
perplexity / openai would like to buy, but even openai might have problems with the financials,
apple will likely also not be the first choice competition wise and if they want to be privacy focus going forward they also can't pay as much as other bidders (bc of lower potential profits),
amazon, idk even know, maybe
based on what i heard they are out for blood, so just stoppping the deal won't cut it
my guesstimate is them not having to sell chrome, but might have to do some stuff to google ads etc.
and the most important thing: usually it takes years for this stuff to actually work out in court (example: intel, which is still fighting a case over being a monopoly that is multiple decades old), by the time this is all done all the execs @ google are already planning on some other future that is less base around the normal advertisement business but centered arount ai
well you can't grow much from a point of market saturation
my point about the ads was also not them specifically, but more that i believe in a middle ground solution being supported by the DoJ (-altough a middle ground that google won't like)
imma ask my prof in competition policy about google's future
will find out tomorrow (nvm totally forgot a holiday)
So, we’ll find out some other day 🫡
yeah the message limit went down to 5 per day for that one and its thinking counterpart 😭
Is it possible to incorporate lmarena to a project? Like use lm arena's api or something
there is not an api for lmarena, but it's helpful to know more people are looking for one.
thought the limit was tokens per chat
is this only available with SuperDork?
lol... to answer your question they don't think in a traditional sense but they do generate extra context making it easier and with higher confidence to predict the final answer. Also to limit hallucinations, which mostly happen when it is too hard to predict the continuation with limited context aka model does not have enough to work with. In a sense it can emulate the work of several chat turns without your additional input
so if it has some big number to compute - predicting the whole number in 1 go is too hard with way too many potential outcomes and will almost definitely result in hallucination. But if it can divide it into easier to predict parts and solve them one by one then use all of that as a cheat sheet to compute the final result... the accuracy gonna be much higher
And update ?
what jb prompt can you try today?
I'm guessing there just aren't enough votes to place it on the leaderboard yet since it is on the battle section of the site
yea it's there
identifies as an oai model occassionally (like in the past.. though here it's actually kinda striking how much its responses resembles 4.1-mini's..)
i get errors with (mainly) qwen thinking models when it's å complex query - i dunno if it hits up against max tokens or a timeout but the annoying thing is it affects the battle (i.e. both models) rather than just the qwen one
i get this error happening with the following models:
qwq-32b
qwen3-30b-a3b
X-preview
deepseek-r1-0528
stephen?
I'm getting plenty of errors, I thought it was a token limit per conversation thing
and claude models love using tokens
yeah i think it is some kind of max token limit - whether across the conversation or a single turn i dunno - but some errors only affect one of the model (and ig are thrown by the API host), whereas this sort of error kills the battle (though you can just enter another prompt, and if it's short / not complex, no error and you can cast a vote / reveal to the model.. or ig continue chatting (kinda why i thought it was a turn-based thing rather than the totality of the conversation.. but dunno)
come to think of it.. i feel like i might've encountered it with opus-thinking too.. but never with a google or oai model
Daamn how I hate 2.5Pro spamming me with dummy example data for any kind of code. Have to prompt it explicitly to give me what I'm asking in isolation lol
otherwise it's gonna make up 10 extra variables and write the entire placeholder dataframe around it huh
fr
For me it was just a cut off responses with no errors when I got Opus in battle recently, which is perhaps even worse... people can vote against it because response is incomplete lol
yeah that's the thing - it kills the battle [though ig this is kinda a moot point.. as iirc any battles where an error is involved are excluded from the LB/elo calculations]
and it doesn't happen with the legacy arena
like doesn't matter how many times i run this prompt, legacy will always give distinct responses from each model, whether an error or actual response. the beta site periodically errors out when one of the thinking models that have a low max token limit (and / or are inherently very token intensive in their outputs) are involved
they should just implement "continue generating" button to show up (no voting buttons at that point) when the stop reason is max tokens reached or perhaps even do this automatically if it's not a limit enforced by them. I don't think neither red error nor voting as usual on incomplete response are acceptable solutions tbh
@echo aurora
yeah but i suspect it's during the thinking process (rather than generation) where the max toxens threshold is breached though.. ig we're kinda describing different situtaitons
like it isn't cutting off mid-generation for me; it's in the the pre-generation(/thinking) phase that it breaks
but it never happens using the legacy arena
bro always
[notwithstanding the fact thinking/generation are functionally the same thing (hence why i keep mntioning max tokens)]
so how does it look for you in legacy with same model same prompt? Just the normal response as usual? I only saw cut responses and that error
but yeah those might have been different cases
I presumed they are using the same max_tokens on new and legacy, but ig not
yeah i wouldve thought so too but i'm not really sure.. somthing seems a bit off
open the network inspector (so it logs the requests), make it small, and when it does that check the request / report it (might be useful). i had cloudflare issues a while back, wasn't obvious until i did that
using side-by-side instead of battle, it doesn't error out the whole interface - instead it just seems stuck (it's been spinning essentially since this msg from you)
invalid_api_key? 
most likely the same error. It's just that now they are sanitizing/hiding the errors like they should have done since the start LOL
yeah but previously at least the other model was allowed to generate
the current implementation kills both responses [this in the Arena/Batle]
well tbh you shouldn't be able to vote if one of the models can't generate a response. I'm not sure what the best solution would be here...
they dont count those anyway
maybe to just disable voting but allow follow-up to the sole working model
no point in having those buttons either way though
technically the battle can't function with only 1 model, so hard error is not completely unreasonable...
unrelated thing but i wonder if they fixed the captcha issue it was a while back. it wasn't showing the captcha so all my requests didnt work. only saw it in the network inspector that cf was asking for one
@alpine coral they could be just stopping generation now. In that case it's an improvement over legacy in a sense that compute is not 'wasted' when the battle can't continue lmao
as in if 1 model fails, the generation of another one is manually interrupted at that point
yeah lol that's actually a very fair point!
still kinda annoying though.. (like even if the vote doesn't count so it's kinda irrelevant, if one of the models has generated a response, i'd still prefer to see it, like with legacy)
cause yeah it's like 2-3 min wait before everything errors out
Gemini 2.5 Pro 06-05 has set a new SOTA on the aider polyglot coding benchmark, scoring 83% with 32k thinking tokens.
The default thinking mode, where Gemini self-determines the thinking budget, scored 79%.
Full leaderboard:
https://t.co/mBVaUPGHPl
sorry to say I don't have an update for you atm.
So budget disabled does worse than you enabling it and maxing out?
this is so weird lmao
thank you for the flag.
thinking models + complex query
yeah I too wonder if it is related to token limit being allocated (or not) for internal reasoning.
this has been raised to the team.
np thanks for raising 👍
we are looking into including a "stop" button when models get stuck; however, in these cases it seems like they're erroring out instead of getting stuck.
neither red error nor voting as usual on incomplete response are acceptable solutions
yeah that's very valid. this will also be raised
this is all really helpful feedback, if yall haven't already applied to our internal feedback program you absolutely should - #announcements message
I did use them a ton and this is completely different. Thinking budget we first saw in Claude models and the whole point of it was to limit the thinking. Not extend it
the cost(/tokens used) is only marginally greater with 32k set as the thinking budget
more thinking doesn't necessarily translate into more performance
like in a brute force sense it can
so if budget is not set, the expected outcome is that the model is not limited
did they change how thinking budget works or there might be an additional prompt instruction aider added? because it used to just cut off at 32k/budget tokens
but for some reasoning situations - it can be counterproductive
that was how they implemented it before (gemini), the model was unaware of the thinking budget
it might be variance unless something changed
Is there any specific date when the lmarena team will accept people? And is there a chance to not get accepted into this program?
there isn't a specific date I'd be able to share, and yes there is a chance of not being accepted into the program.
alright thank you
have you actually managed to make it output 32k?
im running it again on a prompt i have. it used to do 45k with thinking budget off
does the summary model have a context limit lol? ugh ill have to get raw thoughts
tbh aistudio is summarizing the thinking which is kinda ambiguous with their token counts
That's the joke
that was with the prevoius model (the raw thoughts token amnt). anyway i can get the raw thoughts anyway
apple will announce that theyve been acquired by perplexity
tbh I don't remember the last exciting WWDC
Too expensive and not enough developer interest though
It has some novel technologies in it
this model gives up too quick (did ~28k). but i think i can get it to do 32k+
was it actually thinking for majority of it at least? I'm trying smth now and it's 30sec thinking + 75sec final response writing lmao
yes lol
It's VR and AR but it's limited because you need a clunky headset, so you can't just take it everywhere, so it's unlikely to ever be mass market
It's a stepping stone to glasses
Glasses have better odds than a headset
Too clunky for mass market
Same problem as VR
The Meta Quest is still the best VR / AR product for most people even after ignoring the price difference
im not sure it's ambiguous - the number of tokens in the summary don't correspond to the number listed in the side panel (which i assume means that is based on the raw thinking tokens)
Way more software, more comfortable
yea the token counter includes the length of thoughts
but having the actual raw thoughts allows u to measure it precisely
yup ofc
tho just running a test, with the thinking budget maxed at 32k, it does seem the model [06-05] 'thinks' for longer / uses more tokens during inference
maybe they changed it. because with 2.5 flash it didnt affect it in anyway and just cut it off straight up. (it can do much more)
yeah i think they must have tweaked something
ok so you can actually move all the steps it does into thinking by only allowing it to respond with the final answer. Just in practice not sure anyone would do this given that most interfaces will summarize it
before the slider was meangingless - does seem to have some affect now