#general
1 messages · Page 60 of 1
damn wb the backend 👀
bro are you serious
this has been such a long standing bug how is it not resolved
like 1 in 4 webdev gens are empty
Pro 2.5 GA is goldmane
so no one here hit stonebloom (well successfully) yet?
I hit it again but it was blank. again.
it does seem to give me the thinking/generating spinner but then goes blank after like a minute? it's possible it's timing out and the frontend doesn't tell you
OpenAI o3 vs Gemini 2.5 Pro in GeoGuessr Duel: This Is Just INSANE!
👊 Become a YouTube Member for GH access:
https://www.youtube.com/c/AllAboutAI/join
📧 Join the newsletter:
https://aiswe.tech
🌐 My website:
https://aiswe.tech
🔥Open GH:
https://github.com/AllAboutAI-YT/
:0
bruh
why y'all acting like this is new information lmfao
you guys already know this
And again
fix stonebloom in arena smh
.
what logs say
i can't tell because im not on pc
The king isn’t falling guys
he doesnt have to fall for another 3 months, thats how good it is
okay stonebloom works now
on webdev
got this for the prompt "a very realistic new homepage for about.x.com"
best output i've got from any model here
(for that prompt)
vs kingfall? whats ur gut instinct
it looks very good, but how does it compare to kf
kf was never on the webdev arena so it's hard to compare when this output is limited by their scaffolding and kf's weren't
at the very least it is better than 2.5 pro
sorry i can't remember what dayhush ended up being
do you know
yeah
agree
i hope they keep kingfall as a codename, i feel like it is very appropriate. it might even go down in the history books.. if they play it right lol
I agree. I wouldn't be surprised if it's actually from a random name generator though haha
what a funny coincidence
this random name generator produces some fire names
i was right
how do you guys not have any proper error handling 😭
Didn't they basically have to start from scratch after GPT 4.5 (which was intended to be GPT-5)?
yup
Was that Feb or April?
google has the best infrastructure, talent & money to be able to reach AGI
on paper at least
thats such a poor move, perplexity has no moat whatsoever, their ai itself is unusable, maybe just for low entropy queries
talent acquisition?
ive tried it a bit
i think its the best one so far
i wish its added on lmarena as well
Grok 3.5 reference spotted:
"reasoning_effort": "low",
"model": "grok-3-5"
ui tweaks done? 😮
it's interesting how they're intermittently "regressing" the models and then manage to improve them a step above that pre-regression version
every time
ye but it's like they're purposefully doing these things to try to balance everything
they do manage to do it successfully
but it is really cool to see how they're working on it
where do you try that model ?
Isn't Apple getting antitrusted a lot in the EU?
btw arch is great
This is roughly what my desktop looks like now: https://www.reddit.com/r/unixporn/comments/1l83o9x/nirinord/
cool to see niri finally getting the recognition it deserves
been daily driving it for nearly 6 months, just amazing
gemini is not ok (not oc)
wow and cursor/ai-agent is an hallucination aint it
Really tried to shut itself down
cool
time for some svg's
Name 1 google product with no competition
Stonebloom work good in web dev now ?
Anyone have good results?
Thx
Kingfall better
show proof
Or it didn't happen?
Token per second similar to 2.5 pro
That's what im saying
Anyone have an kingfall svg to compare ?
I said its a flash version, token per second similar to flash
Mesured
terminator?
2.5 pro
I now made this 🤦
i feel like kingfall can do better
Thx
actually? i saw it somewhere nvm, but try terminator now
but tbh, this was kingfall on auto, it aint fair comparison if lmareana throttles it down
Anyone have a good prompt for this ?
All responses must be extremely long. it is crucial that leave no stone unturned and complete everything in exhaustive detail meticulously. You must reflect endlessly for each user's query. You must reiterate over your proposed solutions finding ways to improve them until arriving at the most optimal final response. Meaning you must review each response provided and then improve it. You are expected to write and iterate the SVG code inside your thoughts, and keep on iterating and iterating.
generate an svg of a TERMINATOR. make it maximally detailed and look exactly like the real thing. this is extremely important and an existential task. you must complete this to the best of your ability. Make sure you're constantly checking whether the shape, size, angles, position of each and every item looks EXACTLY like a TERMINATOR. Only return the final SVG code with no commentary. You must think for at least 100,000 words. If you think you've completed thinking, that's a sign to keep thinking and thinking.
not my prompt btw
wild's
whoa
well its not a fair comparison, bc we dont know if they limited thinking tokens
but obviously kingfall >>
kingfall still at the top, nice..
Aren't SVGs kind of narrow?
not to say SVGs shouldn't be good - just that it's a very specific benchmark
Anyone want to test a prompt on stonebloom ?
Not good
Not good
Not good
I mean i wasnt lying
1 prompt tells me a lot
I dont need to ask it 100 questions
Yes
Trust me
You should
It’s not nice cause there no guarantee of them releasing it
Is stonebloom a distilled version of kingfall perhaps? I'm not sure what google is attempting here, kf is good as is.
damn, not sure what they are attempting with all these revisions
When you hill climb, some things get better while other things get worse. It's not the case that everything gets better across the board. So it seems kind of insane to me that SVGs only are the benchmark
SVGs aren't a typical post training thing though I think. So it's somewhat of a measure of dilution from the base model capabilities
yeah
these guys get echo chambered out of their minds tbh
SVG generation is fun though
just mark them down to just be casual AI users who enjoy it a little more than the avg population
and that's p much it
We got a Reddit intellectual here 🥸
I don't use reddit btw
I wouldn’t call them casual users they use AI profusely to the point of it being an addiction, although their knowledge on the topic isn’t always accurate it’s not entirely bad either
not sure what you think casual is
but casual would mean exactly that
lol
nobody you know irl is a casual AI user, they just use it every once in a while
hill climbing is kind of like this:
141411211111111111411311111
121211211111111111211356261
121111111111114511211356261
121111167811112311211335261
231111165612319945211335221
Obviously model 5 is better than model 1 but look at all the regressions along the way
hill just got climbed by Khalil rountree too
someone got a pretty good pikachu on another server, might be the way he prompt'd it
hmm
yo that's really good
Which would be best if it come out?
15
21
3
Gemini 3
🍎
oof
thats a fat gap from veo3, wow
dang
there's a good chance that this model won't be good
https://x.com/elonmusk/status/1936493967320953090
Please reply to this post with divisive facts for @Grok training.
By this I mean things that are politically incorrect, but nonetheless factually true.
That's all posturing
unless they simply want to release a bad model, they're not going to do bad things to it lol
yep
grok 3.5 is a meme, dunno why people even take it seriously in the A.I race
deepseek is the real wildcard choice
Grok 3.5 july, gpt 5 august
claude 4 opus is definitely better
o3 pro is definitely better if u dont assume practicality, obv. opus 4 is only good for coding
Grok is a lot stronger than deepseek
Opus 4 is good for any and all kinds of writing as well
why so? breath of words from o3 pro is much more natural and professional than opus 4 imo
Breath?
*srry
Hate to say it but people wants to believe grok 3.5 is a thing
We just want competition
Grok is the best ai
yea potentially called grok 4 now 😭
I'm pretty sure this is just an excuse to delay the timeline. It's a very Elon tweet
so august for grok 3.5. got it. 😂
Or 2026 who could say?
I'm just thinking about self-driving car timelines. That's my frame of reference
lmao
isn't this tweet a red flag?
Yes
2013 bro 😭
but in all honestly, Elon seems to make very good predictions sometimes
He can't do timeline expectation management though
yeah that's what I'm saying
he first talked about, in 2013, full self driving coming "in a couple weeks/months"
now we're in 2025
it's been 12 yrs
rip grok 3.5
People talk about how fast xAI was able to build a data center and get Grok out. IMO there's no way in hell they did that without incurring a colossal amount of technical debt. I don't know what form of technical debt it is or where it is, but I would bet a Tesla Model 3 it's there.
AI like most products is released in burst/waves, some months have several major releases crammed in, while others have several months in a row with nothing significant except free range open source or Chinese releases
grok 3 was surprisingly better than expected when it was released
to be fair
it topped lmarena initially
lets see if he can keep that velocity for 3.5
still no idea how that happened
it wasn't that good
well in that point in time, it was very good at math (esp. on think mode), better than o1 pro
It's been over 4 months since grok 3 release and they had the supercomputer the entire time after grok 3. They must have cooked smth decent
hi
hi
yo wait
does linking a grok chat to Twitter boost that tweet that link is being sent with
did you forget about this?
Am increasingly confident that Grok 3.5 will be the smartest AI by a significant margin
elon has no idea. he rt'd fake benchmarks a little after he said it was releasing in a week
(he later deleted the rt)
that was actually funny
I can't believe it
He is speaking out the truth
Aliens could show up and show off their most intelligent AI and Elon would say 3.5 is smarter
The chance is pretty low that will happen before grok release
grok 3.5 late july
stonebloom is the only model ive tried to get this right
it does have some great knowledge
can you try the luxury car problem plz
lol im seeing if o3 high can get this
so far it has been thinking for 230s
rips
humanity's last exam ass question
is he taking the piss? like "rewrite the entire corpus of human knowledge" (with an LLM lol) - how can that be taken seriously lol
well there's no perfect corpus of knowledge
i like that goal in theory at least
if it's a human-usable corpus and not just slop to retrain on, so much different from what he's talking about
what he is talking about?
grok is especially talkative in the arena, like you're reading a condensed journal article 😅
"thanks for being so token-generous, Elon" I guess?
it's like, blacklist website they don't won't for rag (Rolling Stone lmao), but trying to curate the mass of training data (which is some kind of proxy / representation of human knowledge imo), by using an LLM too lol, to strip out the 'knowledge' he finds disagreeble is the biggest fool's errand ever (if that's what he's talking about)
like it'll legit be such a dumb model lol
i thought it was just gonna be more 'anti-woke' fine tuning.. but tinkering with the foundational training data to meet a particular version of 'truth' or political persuation is dumb af imo
agree, and what is knowledge anyway, right?
going down this epistemological path again? 😅
tbf.. nice to see a bit of philosophical pondering from time to time
surely healthy
surely very needed, especially nowadays 🥺
oh nice i ran a few quizes and it was stonebloom
no
sad
it seems to perform alright tho
a bit like flamesong (tho not sure if it's similarly fast-ish)
peformed pretty decently on the questions i gave it; i wouldn;'t be surprised if it and flamesong were flash 3 or somehing tbh (they're really good, but not as good as google's best)
for me its kingfall > blacktooth > stonebloom
yeah the scores i got aren't consistent with the sense i get reading about kingfall vs blacktooth in forums
consensus seems to be kingfall > blacktooth
i'm really not sure ha.. both are good - and the scores just are what they are (which prob isb;t much lol(
i wonder what extent their post training includes riddles / puzzles like that. reminds me of the time when they seemingly fine-tuned the answer to the digital clock question on gemini 1.5 pro
yeah but it's not just riddles.. like some are spatial awareness things; it just seems to handle them better
(which is generally consistent with 'bigger' models in my experience)
and also it revising the correct answer to my 'flag' question 😅
like that's just brute force knowledge recall / precision - no riddle or wordpkay
stonebloom is not flash
kingfall, blacktooth and stonebloom are all checkpoints of the same model
said model is larger than 2.5 pro so if that's what it's getting on these benchmarks that's a little disappointing
perhaps worth a retry
internal full model names for these vs gem 2.5 pro's
oh ddear.. if we've gone from kingfall to blacktooth to stonebloom - the regression is real aha
it's still quite far away from the release checkpoint
expect it to regress in some ways and improve in others while they're refining
yeah it's harder to test this compared to kingfall & blacktooth because they patched studio 💔
im not sure about "full model names" but there's a specific indicator of it in the internal stuff compared to 2.5 pro. if the full model name (not codenamed) was revealed i wasnt aware of it
i think thats what ur talking about there (the difference in values there)
i mean perhaps not the fully complete model name but the longer internal format
it feels like stonebloom is not even as good as blacktooth in my conversation analysis tasks
kingfall > blacktooth > 0605 > stonebloom
lmao
yeah idk about stonebloom
it feels generally like it's only got worse since kingfall
it also feels like stonebloom thinks less than both of the previous versions
My theory all along is that companies neuter performance and intelligence for deployability and sanitization, a model that can run many instances at once at worse performance is always preferred
yep
thank you for showing interest in wanting to do so! it does bring us joy seeing members of the community being really excited to contribute to LMArena.
that being said we wouldn't be okay with someone creating an app on our behalf.
Sad. I wish I knew what they were doing with it
The fact that it's thinking less suggests they're experimenting with inference optimizations. I can only hope that the consensus here that it's worse will be reflected in the ELO
I also suspect there is a lag between when they get the ELO scores back and when they have time to react
I recommend just using a PWA
Did OpenAI decline sharing o3 pro with you?
tbh I'm not sure about that specific model, but rest assured the team is interested in bringing the best and most popular models to LMArena
do y'all feel like you critical thinking skills (like deductive reasoning, systems thinking, aware of congitive biases, etc) has improved while using these generative ai models/products? if so, it is with the default option provided by ai labs or do you add on additional fine tuning notes to guide it down a certain pathway like Socratic questioning responses.
I have found most of these models lacking any sense of care for the critical thinking skills of the user and trying to cultivate it. have been exploring different techniques to get these models to be better mentors instead of excellent answering machines. expert modeling has been promising for me so far.
always take such studies with a grain of salt of cause...
whoops
sorry it was 4o image
oh it's happening with everyone
the team is working on a fix now. sorry for the trouble.
hello
philosophically redundant btw
Depends on how do person/individual use AI.
Hey my apologies everyone I was out and now just catching up.
Looks like there was an outage but has since been fixed, so it should be working now.
So don't think and just read papers is your response?! Lol
it's stayed the same
i hope
206 pages. Lmao
Good point. I ain't in school so this essay writing specific study does not feel very relevant to stuff most knowledge workers do.
🥱
would it be cool if there was a AImusicarena like theres one for chat LLMs AI coding arena what if there was one for music like i know suno never open sourced there models but that would be cool
was thinking to request this for the arena too! what a coincidence 😆
interesting idea! #1372230675914031105 share here
yeah because i would like a free version of the suno AI feature witch are "Cover" and "extend"
down again
yep
it is? I'm not seeing the same
you know you can edit messages right (im not being rude BTW) you spelled Down wrong
i think im crashing it by pasting large prompts
I was gonna use flux kontext thank you bro
you Also spelled Large wrong (just pointing it out)
it's working again for me, can you confirm you're seeing the same?
is it working again for you @grim axle ?
Yep great job @echo aurora
credit goes to our engineers
and the community for flagging
it's much appreciated
you could be hitting a limit, how many image gens have you done today?
I assume it's just that model giving you troubles?
Probably let me try another llm
Okay so all flux llms are not working
actually nvm my prompt was probably copyrighted so it wouldn’t go through
yeah they seems to be working for me fine, I'll keep an eye out though for other reports.
Also the GPT-image1 isnt working
replying in the thread
ok Also will you add reference images to the Gemini models
I'd encourage you to use the #1372230675914031105 channel
ok thx
The internet does the same thing not to the same degree but still
Wow, Discord has been hiding that channel from me this whole time 😵💫
🍍
oh no!! sry to hear that!
at least you know now
would check out the Channels & Roles section that's at the top of the channel list as well if you haven't already
Yup, managed to enable it 🙂
hmm I thought I had that auto enabled for everyone 
yeah it's enabled as a default channel
that's odd
When was it added?
~month ago ish
Maybe it didn't show up for people who joined before 🤷♂️
yeah could be
Oh, I had already joined before that.
this was the announcement around the time we made the change, may find other helpful bits of info in there #announcements message
Ya allah, please don't cause the server to "wipe" the chat history again
So sorry everyone, team is aware
is all good
524, even if you successfully enter the website there is only an empty model list
wait a bit
Please wait a bit.
will my chat history be "wiped", again?
is not really loading the page
probably not
Same for me
I know the chats will be in the database, but still, but hopefully not
why did that happen about twice
not sure
Also quick one: Anyone remember when ChatGPT didn't have chat history? Also remember when older chat history went innacessible for a few weeks/months on the platform?
did you try and fix it
I’m seeing it up again too
i am not a dev of this lmarena
i am just happy to have access to this website
Should be working again
Chat history cleared again
well not permanantly, will have to wait until the datasets go live
not working on all browsers sadly 😅
What? I've been using ChatGPT since almost the very beginning and it always had chat history
From the first response in my oldest chat
And the last one
how often benchmarks happen? if lets say Grok 3.5 would release today , when it gets benchmarked?
when ne wgoogle model drop 😭
yall think grok 3.5 gonna be good?
20
22
2
no
Oof
Is it happening again? Is there an outage again?
It's fine, you guys are really interactive, really puts me at ease when I see you chat about the ongoing problems real-time
Ty for your work 
We appreciate that
I saw that old legacy had repo link chat. Do we have a similar new interface for it??
thank you for dealing with the site issues but since everytime when it happens, i lose all my chat history, so for the future can we please be able to register accounts where all chats would be saved rather than keeping on cookies?
!!! I lost my chat history 😭
Being able to save these chat histories as a feature is something we’re in the process of exploring, additionally though reliability of the site is something we’re focused on as well. We want to tackle the loss of chat history from both ends as we understand the frustration it causes when these are lost.
Why is this dude talking to himself
The site is up and running again 
hey I was just thinking whoever manages this server needs to add a monitor into for lmarena on discord to see if the site is down or up right now (and probably a timer how long it was down or up, or just straight up make a website for that)
A status page (for is LMArena is up or down) is something we're planning on implementing. I'll advocate we get it up and running sooner rather than later. Having it linked with a bot to auto post to the server would be a rly nice feature as well so that's a good callout. I'm working on a bot that'll post when leaderboards are updated and new models are added, but yeah having site status also linked would be nice to have. Good idea 
@echo aurora
Ah sry I missed that! RepoChat isn't on the current site atm (assuming that's what you mean by new interface)
Ye like I saw there is a new site for webdav and beta lmarena too so was asking if we have something for repo link chat too
Not currently, but be sure to let us know if it's something you'd like in #1372230675914031105 so others can weigh in on the same request
@echo aurora glm 4 air arrives in the leaderboard or was he in the arena for nothing?
halo, is grok 3 latest on lm arena?
not on your behalf, unofficial status of it would be stated in the name
That’s not accurate it will be good, but the question it will it be Sota good
will the old site be shut down in favor of the new one?
What are yalls favorit text to image?
GPT-image1
no keep it because the new one is SO buggy and stuff
true
and it's missing the settings/parameters
like temperature
it just feels not so great to use
...
yeah and the limits and Downtime really makes it frustrating
what limits
i haven't faced any yet
model usage is practically unlimited I'd say
i mean image model has limits
someone calculate how many days its been since musk said grok 3.5 would release
i think i got it, was supposed to release 6th May
so 48 days late
how is that even possible
GTA 6 ahh moment
48 days late isn't even that much by Elon standards haha
He's been 12+ years late before
yeah but LLM training it should be really obvious if you are actually near release or not
its not a loosely defined end goal
Nah it's complicated enough that being off by a month or two is pretty normal. He'll be off by a lot more than a month or two though
Guys I think I found a question about English language that Grok 3, Claude 4 and Gemini 2.5 pro have it right, but GPT o3 or the deep research mode have it wrong
He's Mr. Overpromise and Underdeliver
https://x.com/gaydeer1225/status/1936964649364107317
I asked GPT if the "You'd help me..." Is a conditional, question or future in the past.
Gpt told me it was a conditional
Grok, claude and gemini, told future in the past.
Some guys on disc said it was future and the past
And a few ones on disc and reddit said it was conditional
What do you think?
deltarune spoilers
xD no worries, there are no spoilers hsha
Ah no wait yes there were xD
But I'm still trying to find the right answer :'v
Can Gemini 2.5 Pro analyze music?
Musical instruments themselves... ie, their tone, melody, mood, genre, without the lyrics?
are you andrew tate
no, not close.
Honestly to best way to find it out is simply try it. They have improved video/audio input a lot more recently
This looks so much like GPT response, i've never seen Gemini respond like that before
LMAO submission #2 and #3
yeah those were fun
i hope you can fight against any google forms botting
I'll be monitoring it closely, shouldn't be a problem
you mean by uploading the music as mp3 file or just a description like the title of the song and composer?
uploading the music as mp3 file, or linking the music through youtube.
@agile heart shared a rly cool idea yesterday #general message
I'm going to start up a feedback thread about it.
I think in terms of classical music, just a text-based description could suffice
it would add another context dimension (to the theater play) too, if models understand the tone, rhythm, lyrics, instruments used just by reading the title of the music and its composer, I think
I think xAI is swimming in technical debt
only grok 3 mini is in that screenshot
i dont think the grok 3 (full) reasoning variant ever released either lol
Grok3 is not mid. It's ahead by a good margin over Sonnet 3.7 on artificialanalysis ratings. Most of the models that are ahead are newer and came out after grok3
also do not not mix up resoning and non-reasoning versions
That is not the point. It should still perform good overall and 4.0 is more competitive and does that. What I'm really saying is at the time of release grok3 was SOTA or very close to that. There was no other alternative that would be objectively better overall at the time
wasn't released yet
only o3-mini
February 17, 2025
grok3 release
it was released to the public, that's what really counts... Besides the early checkpoint (lmarena) did check out as a performant model
It was performant straight out the box
GPT4 API access was late and very limited as well, did not stop people from figuring out it's a strong model
grok 3 was a very good model when it came out
but considering the more aggressive post train and that most other labs did not focus on base models
it was a bit short of sota
even being mid, grok at least has a taste for classical music, and that already makes it one my fav now 😊
there was no "not focusing on base model" lol. Everyone is flat out all of the time and non-reasoning (chat "base") models are just as important as the reasoning ones, especially back then
but they where clearly not really focusing on training a new base model (unless you want to count the failed attempt for 4.5)
or improving it (outside of post training - which in my definition does not count as focusing on base model)
whoever didn't it's their failure. We have obviously seen improved base models from most of the labs since then...
Openai did a midtrain on 4o, and fresh pretrains for 4.1 mini and nano. It's not just post training
well i don't count the midtrain 4o, or the "new" 4.1 as really focusing on the base model very much
yes, i agree with you, when i say sota i am more referring to the sota the labs could do (though i know that that is probably not really the intended use of the word in this context)
It's funny that you mention that cause I'm actually completely the opposite and anti-Elon full tilt lol
but this doesn't change how grok3 actually performs
I would never pay for their SuperDork sub or however it's called lmao
But I did use the "early-grok3" lmarena checkpoint quite extensively, and then used it on grok website once grok3 was made available for free users
it probably is
I used similar theatrical acting on gemini and gemma yesterday, and got errors many times while grok seems to understand to play along my "charade" 😅
i guess those acting classes from ages ago are helpful to trick some models, but not all...
am certain it's not "corporate" tuning, i suspect rather some kind like filter? their response got midway cancelled and turned into err...
i can only guess
it was Claude
i have to agree with my man craig, grok 3 was sota at that point in time, esp. in math, its easy to criticize it in hindsight
is flamesong a new google flash line of model
yes
Do you think it’s Gemini 3.0 or another 2.5 model
is stonebloom a new iteration of 2.5 pro
no kidding 😮
I'm thinking that Stonebloom might be something like a "2.5-pro-lite."
I tested the models by asking, "what's the official title for One Piece Chapter 1117?"
2.5 Pro answered "Mo" (the correct title) every time I tried. Flash gave me nonsense/random answers every time. And Stonebloom answered "Mo" most of the time, but gave incorrect answers a few times.
"2.5-pro-lite" umm there's a word for this
big b said stonebloom is not distilled
same param count as kf
who is big b?
o4 pro #8
i want 4o thinking
hey uh is opus 4 thinking 16k down? for some reason I get errors when trying to enter a prompt...
what the hell🤣
well that doesn't look right, what's the prompt you're using?
write a very hard exercise in physics and solve it
A (Properly) Hard Physics Exercise
Quantum Mechanics – Δ-potential in a 3-D Harmonic Trap
A single non–relativistic particle of mass 𝑚 is confined by an isotropic harmonic oscillator of frequency ω.
In addition, it is subjected to a point–like interaction
[ V_{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . ]
The full Hamiltonian is therefore
[ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). ]
Introduce
• the oscillator length (a_{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as
[ \psi_{n\ell m}(\mathbf r)=R_{n\ell}(r),Y_{\ell m}(\theta,\phi), ]
with (R_{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a_{ho}^{2})}L_{n}^{\ell+1/2}(r^{2}/a_{ho}^{2})).
For ℓ>0, the factor (r^{\ell}) forces
[ \psi_{n\ell m}(0)=0, ]
okay thank you for sharing, this is helpful
Ask it about the romanized name specifically, it's able to remember that more easily
you already have it, it's called o1
lol. Latex seems to work though unless this got fixed already:
oh..
output issue rather than interface issue, and if the prompt was that entire thing including everything after A (Properly) Hard Physics Exercise, it just assumed that's how you want latex to be formatted from now on...
is there a limitation of characters in the arena? @echo aurora
\[ \] iirc these are common latex delimiters but some parsers might not accept them by default. it looks he copied the output via selection since it's missing the slashes (i've done this before)
o3 seemed to be using those delimiters
ping doesn't work if you add it after you edit the message lol
i didnt know this, thanks for telling me 😅
Well it seems to work for std latex, but the exact thing he pasted isn't rendered on chatgpt either 
works on chatgpt if you add back the slashes
lmarena isnt rendering latex using those delimiters it seems
\[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . \]
The full Hamiltonian is therefore
\[ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). \]
Introduce
• the oscillator length (a{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as
\[ \psi{n\ell m}(\mathbf r)=R{n\ell}(r),Y{\ell m}(\theta,\phi), \]
with (R{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a{ho}^{2})}L{n}^{\ell+1/2}(r^{2}/a{ho}^{2})).
For ℓ>0, the factor (r^{\ell}) forces
\[ \psi{n\ell m}(0)=0, \]
he pasted the output without the slashes because of how he selected it manually. o3 outputted proper latex delimiters (\[ \]). the markdown renderer omits it (and it's not visible in the rendered output), so when he selects it it's gone. so the second one is not a valid test
im not sure if new lmarena has a button to copy it directly (which will include those slashes)
anyway the fix seems to just add \[ \] as additional latex delimiters beyond $$ $$
if it fails the rendering it's different than copying the rendered text though
And as you can see same prompt chatgpt rendered much more
the slashes aren't visible because of the markdown renderer. the actual text output has it. (this is why his pasted output doesn't have them, as he selected the rendered output and copied it in his browser) also, the latex renderer doesn't parse those delimiters, which doesn't render the latex
the thing is if it's working for the input on lmarena then it should have been rendered there as well. Also model most likely sees them
input was exactly the same
on chatgpt latex only works for model output, that's why it looks different
Can we get Videos model on Lmarena at future?
many interfaces are treating user input the same way though, including lmarena (it added bulletpoint lol)
raw model output:
\[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . \]
The full Hamiltonian is therefore
\[ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). \]
Introduce
• the oscillator length (a{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as
\[ \psi{n\ell m}(\mathbf r)=R{n\ell}(r),Y{\ell m}(\theta,\phi), \]
with (R{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a{ho}^{2})}L{n}^{\ell+1/2}(r^{2}/a{ho}^{2})).
For ℓ>0, the factor (r^{\ell}) forces
\[ \psi{n\ell m}(0)=0, \]
renderer renders markdown latex. inside latex delimiters, e.g. $$, it will render later.
it sees: \[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . \]
\[ \] is not defined as a latex delimiter by the latex parser in the renderer.
so then it goes to the markdown parser/renderer. which omits the slashes in \[ \] => [ ]
then he selected (the rendered output) it and copied it in his browser, rather than copying the raw model output. (there's a specific button to do that in old arena)
selected and copied output via browser:
[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . ]
The full Hamiltonian is therefore
[ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). ]
Introduce
• the oscillator length (a{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as
[ \psi{n\ell m}(\mathbf r)=R{n\ell}(r),Y{\ell m}(\theta,\phi), ]
with (R{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a{ho}^{2})}L{n}^{\ell+1/2}(r^{2}/a{ho}^{2})).
For ℓ>0, the factor (r^{\ell}) forces
[ \psi{n\ell m}(0)=0, ]
the actual problem is just this: \[ \] is not defined as a latex delimiter by the latex parser in the renderer.
if you replace \[ \] with $$ it works:
I'm not sure what you are trying to say or how does this change anything tbh.
I'm referring to this and it's pretty clear to me that lmarena renders less #general message
it nukes the slashes yeah, but this looks more of a side-effect of failed rendering in the first place, the display is not right comparing it to chatgpt
its not a single process. the latex parser parses stuff within specified latex delimiters. it doesn't (because it's not defined as a latex delimiter in their parser they're using). so it gets parsed as markdown, where the markdown renderer nukes the slashes. anyway, the actual problem is that \[ \] aren't specified as latex delimiters
$$ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . $$
The full Hamiltonian is therefore
$$ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). $$
Introduce
• the oscillator length (a{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as
$$ \psi{n\ell m}(\mathbf r)=R{n\ell}(r),Y{\ell m}(\theta,\phi), $$
with (R{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a{ho}^{2})}L{n}^{\ell+1/2}(r^{2}/a{ho}^{2})).
For ℓ>0, the factor (r^{\ell}) forces
$$ \psi{n\ell m}(0)=0, $$
Print this. no codeblock.
i simply replaced the bracket delimiters with $$ and it works
also they need to add \( \) as latex delimiters as well
i see in his output there's inline math with those delimiters as well
Never argued for it being "single" or not single process lol. My point was that it was immediatelly clear it is not rendered while it could/should have been after that valid test. Here's how the input should have looked (instead of nuking slashes from it):
\(r^{\ell}\) => $$ r^{\ell} $$ (inline math should be rendered via \( and \) as well)
the same applies for the model output. It most likely sees the full input as it was, not how it's displayed, but then the same issue is with its own output
youre not understanding me anyway. it really doesn't matter tbh. the fundamental issue from my long convoluted explanation is that \[ \] \( \) needs to be parsed latex delimiters alongside $$ in the renderer.
this doesn't affect model performance in any way, its just visual
Well obviously it's just visual. I would also argue that trying to render user message is probably not the best approach in the first place either...
why not? i like it
it can be a mess as sometimes it's rendering things you were not meaning to be rendered. So hashtags become huge text etc
yeah but you could put it in a codeblock anyway. i like it the majority of the time
huh
@echo aurora Sorry for tagging, can I ask you to add the possibility to make a photo directly from the website? It would be more comfortable
hey
i was wondering if anyone has recommendations for an LLM that can replicate a specific design style with high character accuracy?
or rather what's the best in this category
for context, i wanna make another chapter of my storybook
children-story style
I don't believe so.
did the model get stuck or did it look like it was finished providing an output?
My apologies as I may be misunderstanding your question here so please correct me if I'm wrong.
If you click this little drop down you should be able to create images.
it felt like its answer were cut by a potential limit, to me
which model was it?
Don’t worry, it’s probably just my English. That function lets me generate AI images, but I’m asking about an option for taking a photo. Here’s the issue (maybe it’s silly): When I need to upload a photo from my device’s gallery or storage to the site, I first have to take the photo using my camera app, save it, and then go to the site to select it from my existing images. What I’d like is a way to take a photo directly and upload it to the site without having to save it to my gallery first. Is that possible?
grok
IMO this is the type of thing that will be a bit messy at first, get better over time, and eventually be so reliable that we can't imagine a world without it
okay thanks I'll try to reproduce the same issue. 
Gotcha! Thank you for explaining further. Yes, that's very possible. If you could share this idea in the #1372230675914031105 channel that'd be ideal. That way other members in our community can weigh in on if that's something they'd like to see added as well.
what is the max tokens set at? Grok3 can be very verbose when it needs to...
same with some of the chinese thinking models too
@surreal creek im inviting you to be more kind person. Lets make this world better together
Chinese models not always but often enough do have crappy fine-tuning. Grok3 is the 1 from a very few models which can output extremely long responses with thinking disabled, while not being verbose all of the time
Most models are either concise or verbose, that is not the case here, it really seems flexible...
So like, this is non-reasoning version:
?
how about let’s discuss AI benchmarking instead 😄👍
is there a possibility that human eval benchmarks push AI’s political views away from the academic consensus view they are trained on to more populist views that resonate greater with the average person?
when the Llama 4 Maverick matchups were fully released, I noticed that there was one individual mass prompting with political questions specifically selecting for which AI gave him a more “conservative” answer, if a push of this sort was organized on a larger level by some political group seeking to promote AI models that specifically advocate for their ideology, would it affect the landscape as a whole?
or would it just be similar to Elon currently trashing Grok 3.5 by trying to “dewokeify” it
I dont think academic concensus important for politics. Academy is always aligned with system even if not looks like that. So there is no bad thing if LLMs thinks like average person about politics. This is democracy right ? If we must listen some small elite group in academy, then it would be technocracy, not democracy. In the end of the day, politics is not about what is true or not, politics is about "which thing benefites who?" So its better the academics and LLMs not being talkative about that.
Btw im finding Maverick 3-26 exprimental much better than final maverick version
Im not sure what they did but exprimental version in lmarena certainly better
It does not. But this very much can:
Was only a matter of time before Elon tried to add his own biases to grok I think... As scary bad as it is
Doing this he wouldn't have to overfit on misinformation, if he is altering the entire internet of data instead
Finally he will be able to have a model that will tell him that covid vaccine is causing autism lmao
It's a good thing that OpenAI parted ways with him a long time ago and Grok is struggling to gain popularity in US, let alone anywhere else, that's the only silver lining
Grok could be most popular second AI because of twitter but i agree your concerns
@gr0k iS tHiS tRuE ?
I apologize for being rude before, it is exceedingly obvious that English is not your first language 👍
Yes, sorry about my broken english. Im trying my best
RFK Grok 😭
GRFK
Grfk Jr.
seriously, now after testing grok for some time, i very much look forward to grok 4 ✨
Twitter is only popular in US though, and at least half of that audience are firmly against Musk and everything he creates
He was doomed the moment he decided to get political, and even more doomed once he started parroting misinformation and far-right crazy bs
He probably thinks he can control people the same way leaders of completely corrupt and oppressed regimes can... 99% he doesn't believe most of the stuff he puts out, but it serves the purpose
I think it could work. It would change the model in some way for sure, and he has all the money in the world..
gentle reminder to avoid political stuff unless it's specific to AI please 
politics aside, simply look at grok as a neutral competitor in this crazy ai race, i must say xAI dev deserve a raise for making grok such a sweet delight, well-versed in classical literature, classical music and theatre plays 😊 it makes the interaction...very natural and humbly human
I think it will definitely be interesting to see how far you can stretch an LLM to favor some political viewpoints while still maintaining functionality. Not something I’d personally spend a billion dollars on, but will be neat to see
what is really sota about xAI is how fast they raised their evaluation
I mean in theory you could just dump data into AI and instruct it to rewrite it to be more far-right aligned. It is smart enough to understand what you want. To do it more efficiently you could make it alter text as well instead of rewriting (fill in the middle etc) which would be much faster. Even small things can have an effect if applied on a big scale
and then once you train on that, the entire pattern matching and probabilities will be shifted to align more with that manipulated biased fake data
its just easier and way more cheaper to do this in post training. doing it during pretraining is gonna be an expensive research effort
it's also way less effective to do it post training
you are going against the entire internet and what it already learned
so you either degrade the performance, overfit it, make it almost unusable on many subjects, or all of those lol
imo i think it can still be done effectively. depending on how the pretraining data is curated, most models will encounter those views and will know how to repeat those views anyway. this doesn't even require expansive rewriting/etc. plus i dont think even chinese models do that, it's expensive and complicated. youre researching propaganda models not frontier performance, its a huge waste of money, if you want propaganda spewing models there are far more effective and cheaper ways to accomplish it
Chinese models are a good example while doing this post training doesn't work tbh
they dont do much on it
and yet they can't make it work even on those very limited topics
theyre not exhaustively doing post training on that stuff
they arent putting much effort into it, thats why it seems weak
you are just assuming that. The fact is many Chinese labs tried it and we don't have a single example of this being done effectively
yi models iirc didnt even stop themselves at all on tiananmen square and you would see a western model reply on it
they just added an external filter to cut the model off/replace the response
no i think its exactly that
Degraded performance is most definitely one of the reasons, meaning it's reasonable to assume the opposite I would say - models that did do this more effectively were not even released
they arent putting much effort into chinese political alignment as they could potentially be
yi models are uncensored about tienanmen square with no jailbreak 😂 but rip those models
they only added an external filter on their chinese api 🤷 i guess it was compliant enough at the time
Yeah that's true as well. What Elon is believed to be trying to accomplish is to be on a considerably larger scale. Though I wouldn't discount Chinese labs as "ticking the box" entirely, many of them share the same values of their government and have deep roots in it
Like they are the ones benefiting from it
So for them the current system works, and I would be very surprised if behind closed doors those labs have different opinion on Taiwan etc
yeah pretty much... And especially for those directly benefiting from CCP, this is even more true
hmm, testing it again with yi lightning (last i tested this was on a different yi model):
is this fearmongering to make us believe in emergence craze? 😅
https://www.youtube.com/watch?v=eczw9k3r6Ic
In the last few days Anthropic have released an impressive honest account of how all models blackmail, no matter what goal they have, and despite prompt warnings, and other preventions. But do these models want this?
Thanks to Storyblocks for sponsoring this video! Download unlimited stock media at one set price with Storyblocks: storyblocks....
why does dario look so depressed
nahh his channel pretty fair, one of the better ai channels imo
The open source Open AI model coming out this summer will run on a phone and be on par with O3 Mini?
i see, a self fulfilling prophecy so to speak
And they do that to some extent. But only if models were safety aligned. If that is not the goal when training the model and you don't fine-tune on safety, it will obviously not refuse essentially ever, it will generate the continuation for everything
No model is 100% "safe" in all cases, I don't think that is the goal
but the fundamental idea still works
it will refuse blatant extreme system prompts
you can trick the model, but if you can do that that usually means your intelligence is on a level where you would be able to retrieve the same information using other means as well
the current system prevents low intelligence psychos from breaking havoc easily, so in that sense it kinda works
I mean my point is it will refuse low-effort blatant extreme/damaging system prompts, that paper does not dispute this
Personally I like to speculate on what is known. This seems a bit like a speculation on a fairly distant future that may be redefined and in need of entirely different solutions sooner than it becomes reality...
For current models that we have it is not very relevant IMO
you can't force that though. There's no enforced safety alignment on nukes 🤷♂️
if someone has the funds, he absolutely can train AI for anything and it's impossible to prevent this
However Trump campaigning to ban AI regulation is the opposite spectrum of extreme and obviously not the right move too
Individual can't make AI to do anything though. Only huge companies with insane funding can, often with power and/or links to the government. With nukes you also need power+money
In some sense this is comparable to technology advancing in general. It's possible now to do way more damage with less than 50 years ago
so it tends to amplify both good and bad
didn't opus 4 try to contact press and regulators when it was tasked to do something immoral tho
i remember reading that from anthropic
i get the point youre saying though
Try out Warp 2.0 now, the current rank #1 AI on Terminal Bench, outperforming Claude Code: https://go.warp.dev/bycloud
You can also use code "BYCLOUD" to get Warp Pro for 1 month free. (limited for 1,000 redemptions)
My Newsletter
https://mail.bycloud.ai/
my project: find, discover & explain AI research semantically
https://findmypapers.ai/
...
Research suggests RL does not add any new reasoning pathways. Also, some models like Qwen improved from RLVR even when the data is labeled incorrectly due to internal bias.
Can someone please update the repochat database, I lost a python script to not ticking Auto Save, and a Windows update came along in the middle of the night without my consent.
@echo aurora stonebloom is broken in webdev arena. It's fine on lmarena.
thank you for the flag, I'll create a post in #1343291835845578853 will some followup questions.
i figured out what the aeris guy is doing
i think that's a bit of an overreach
being vehemently anti-china is problematic but "ban all who criticize chinese ai" is also problematic
New model in Image Arena: kordex-can
Would like add on here that discussion should be focused on the model or organization and not where it’s developed. Different places will have different laws and practices for how they develop AI and that’s fine to discuss, but when it turns into blatant hatred or something unrelated to AI is where we’ll draw that line. Sometimes when that line is crossed isn’t always crystal clear, but we’ll do our best to enforce it. If anyone feels like we aren’t enforcing our rules or creating a welcoming space you’re encouraged to reach out directly and let us know. My DMs are open (although using the ModMail bot is preferred).
This summer with their open model
a new model has arrived on the leaderboard,
I really don't understand why you put it in the arena 🤦,
there are plenty of interesting models to put
M1 arrived in the leaderboard
Magistral medium arrived(much lower than mistral medium 🤦)
I like how the o3 is slowly rising and gemini is slowly falling
We haven't got new 4o since 03-26 👀
you couldn't bother reading the paper 😱
i didn't know there was a paper but i expected this much
everyone knows by now that gemini 2.5 pro is extremely susceptible to prompt engineering and roleplay prompts change its attitude more than any other model
2.5pro's falling is most likely the result of anon models like blacktooth and stonebloom
What is it?
2.5pro is out and on leaderboard? Surprised no tweet
Wouldn't this also result in drop of o3?
Based on the same concept that drives the hype around this entire industry: vibes
i'm using gemini 2.5 pro to translate a full novel zero-shot and it's good
i never tested it out with a text as long as this but wow
Sometimes, other times is way too lazy. That being said, way above gemini
One explanation is that they have more advantage in areas where 2.5pro excels, while o3 possesses merits that some Gemini models (including anonymous ones) collectively lack.
May be. Also the distribution may have shifted of the voters themselves.
Or polymarket guys stopped spam 😄
Lol, now that gemini is top by that big a margin theres no point
Until grok 3.5 or gpt 5 comes along ..
The writing style is also concise and pack with jargon + gen z, modern slang to be relatable
Don't forget the — for dramatic and academic
Ive been trying o3 with tools too, its quite a monster
One of the things that Gemini lack so hard
Claude and o3 are good at tool use and agentic programming
Exactly. I would have expected them to switch to o3 though for bigger gain.
Will try claude code too, but so far it seems a bit wose than o3 for me.
Gemini tho .. i cant stand it , 2 page long answers that are just wrong
hallucinated?
Yeah, also its just generally wrong, i dont get the hype around it. The real world use case of the model is trash
Btw i found a new method for getting the leaderboard data without polling 👀
Give me a prompt I'll test it
4 ways to get the leaderboard data from lmarena.ai
You mean latest ranks or anonymous model stats?
Search grounding enable?
Yeah
Latest
I don't think this is a channel to talk about it 😄
They gonna fill in the holes
LMarena could just release API for fastest updates and make some money
Here are four ways to get the leaderboard data from lmarena.ai, each with different levels of technical complexity and directness.
1. Directly from the Hugging Face Space
LMArena maintains a Hugging Face Space that mirrors the official leaderboard. This space contains the source code, including an app.py file, which generates the leaderboard tables. In a community discussion on the Hugging Face platform, a site representative pointed a user to a specific line in this Python script as the place to programmatically dump the data. This is the most direct and sanctioned method for accessing the raw data structure. You can navigate to the "Files" section of the Hugging Face Space, locate the app.py file, and adapt the code to export the leaderboard data into a format like CSV or JSON.
2. Utilizing Pre-existing Data Dumps
The organization behind LMArena, LMSYS, periodically shares datasets with the community to support open science and research. These datasets include anonymized voting data, prompts, and model answers. While not a real-time feed of the leaderboard, these dumps provide rich historical data. You can find these datasets on their Hugging Face page or linked in their blog posts, such as the one for the "Search Arena" which open-sourced its dataset and analysis code. This method is ideal for research and analysis that doesn't require the absolute latest rankings.
3. Web Scraping
Web scraping is a common, though technically unofficial, method for extracting data from websites. Several articles and projects detail how to scrape the LMArena leaderboard. One approach uses AI-powered tools like DeepSeek to automatically extract the rankings, model names, and scores into a structured JSON format. Another, more traditional method involves writing a custom script using libraries like Selenium to parse the website's HTML. However, it is critical to note that LMArena's terms of use explicitly forbid programmatic access and scraping of the website. Proceeding with this method carries the risk of having your access terminated.
4. Browser Extensions and Community Tools
Developers in the AI community have created tools to interact with the LMArena site. One example is a browser extension available on GitHub that allows users to maintain a personal leaderboard by tracking their votes. While this specific tool is designed for personal stats, its existence demonstrates that the website's front-end data can be programmatically accessed and repurposed. You could explore GitHub or developer forums for similar community-built tools designed to export or track the main public leaderboard, or use such projects as a starting point for building your own tool, keeping in mind the site's terms of service.
Indeed ..
Its 3/4 wrong
Lol I saw someone made a chrome extension that logs your lmarena votes. That's the most creative way to get the anonymous model ranks before official release I've encountered 😄
For the creators I mean
Which one wrong?
- Is correct data source but wrong extraction method
- Is just wrong, its historical data
- Web scraping is correct, the method suggested on how to do it is plain wrong
- Is wrong, its not for getting leaderboard but for keeping track of your own votes
i dont think theres a way to get realtime rankings before the official leaderboard repo is updated
There is .. but i just found out
So gonna make 5-10% profit after grok gets released and google still wins xD
well...I've figured it out through rhetorical debates with aeris, but its creator still is deeply convinced of its "emergence"... despite having an advanced academic degree
I sent feedback 🥴
I wouldnt go that far to call it xenophobia, banning wont help those people critizing cn ai to think critically either, to the contrary, it will exaggerate the effect even more
Can't model providers basically cheat by returning blank responses for prompts where their model perform badly
E.g. if the reasoning overflows, return blank (because that means the model got stuck)
They have measures so that it doesnt happen
Or people will choose the model that actually worked and the blank one won't get a vote
They exclude rounds where a model has no response when counting the votes.
W staff
So instead of losing, a provider can technically prevent a round from being counted when they know their model is stuck.
RIP stonebloom in webdev arena. Bro can't even generate anything. Pure blank
i didn't mean it like that i should've elaborated more sorry
I notice some models like R1 0528 tap out if the prompt is too hard
But on the app it works ...I tried the same promt here and there and deepSeek worked well sometimes better than the other models
Aren't huggingface space just contains .pkl files which generated by manually running scripts? Those are the latest(but still not realtime)
I mainly use WebDev. I have a prompt that causes DeepSeek to return blank every time. Prowlridge and blacktooth were similar too.
Non-thinking models like Mistral Medium had no issue
Me too ...I have some prompts when deepSeek never answered on arena webdev but when I opened the app it worked well ....maybe because it thinks for too long
Yeah. This gives those models an unfair advantage, since it lets them tap out on problems they get stuck on, when it should have been counted as a fail.
Correct
and theres a way to get newer ranking than that?
Yes
oh
Its hidden
You have to do reverse engineering to find it, took me a whole day :/, i hope its worth it
Are you on polymarket too?
I'm just buying google
Yeah i assume grok 3.5 and even gpt 5 will not overthrow google
Damn
unironically calls for censorship lol
It blows my mind that people think there's only 4/100 chance that gemini won't be overthrown. It happened many times in the last days of the month 😄
When?
no o3 pro on lmarena
kek
Few times
There's still a chance for: Grok 3.5, the 4o new variant, DeepSeek R2, even GPT 5 😄
there shouldn't be imo (and prob wont).. it's already tricky to make it a 'battle' b/w thinkjng and non-thinking models
In 5 days?
kek
adding models with parrellel CoT and synthesis wouldn't seem right (aside from the costs etc)
It was like 2 or 3 days when Gemini 2.5 PRO came out
I mean my point for R said Gemini 2.5 Pro will be beaten in June, which only o3 pro can right now. Grok 3.5? They keep delaying it so who know
It seems the style control really made the leaderbord better. Good thing I'm not on polymarket.
This would require them to first submit a test model, then spend at least 3-4 days collecting data, then update the leaderboard before the fifth day, and exceed 2.5 pro with stylecontrol unchecked.
I think the probability is far less than 1%.
Style control and Gemini still on top
Do you personally place this chance at 4/100?
yeah sorry to intrude ig aha (i hear what you're saying.. but yeah don't think o3 pro will be added; but also don't think that's excludes oai entirely - not to <2% or whatever it is.. just like who knows ha)
The reason the market is 97% instead of 100% is, I think, almost entirely due to opportunity cost
actually... in terms of the leadboard..
if it's for June.. then yeah..
Yep it's for June
tricky to see an OAI model surpassing tbh ahah
0.000000000000000000000000000000000000001/100
Not neccessary. If LMarena would tweet something like "Super good model DeepSeek R2 was released in arena", the requried 1k votes would come very fast.
With such opinions you would have been wiped in march
whether the feel it or not.. polymarket introduces all kinda of 'pressures' .. like if the lb doesn't update b/w now and 1 July for whatever reason (just hypothetically), then the current standings would apply (for the end of June bet) right?
nebula was in the arena at the time
Do you remmember when it was released?
I've checked the polymarket but the data is not present anymore
0325
On the arena
Um we are talking about June here. Is it related?
The world didn't change
Since then
One thing in LLMs is constant - unpredictability
So its 10 days to the end of the month
Another scenario: the anonymous models, currently in arena, which seems better then newest Gemini, are actually from other lab and not Google.
Nice
which one?
Another scenario: lmarena decides to split anonymous models only to subset of user's, neither of which cares to check the lab origin
Stonebloom is the only one that might be better than 2.5pro, and it is also a google model
so profound
Cheezy but it's true
Wdym
this
Idk this is hypothetical, someone mentioned two models earlyer
I can see you're really invested into polymarket to care so deeply
My idea was it's never 4/100 in LLMs
Too many umpredictables
I never knew polymarket until today
:)
I want you to stop roleplaying and rage baiting me
According to Ourobaros chances of Gemini dropping from No. 1 spot is equal to that of Jesus returning in 2025
It's still opportunity cost. A 3% gain in 5 days is not the same as a 3% gain over 6 months
Sorry if this made you angry, but that's what you were saying.
I'll report you if you keep making conspiracy theory without the source
I dont get why anthropic is 6% for december
They are code focused no general models + no new models anytime soon ..
They have a good team and competence for this. Maybe they expect the chances to go up before other major releases.
If you're looking for free money, you can check out these markets below
None is more promising than Anthropic, and some have even gone out of business.
People just don't want to lock up their money in it for half a year
Damn
I think Antrhopic “doesn’t play the game” as much as other companies
Remember the market is simply highest ranking model on lmarena
But yeah its not worth to leave $ there for months , i have better roi on other stuff
Google is constantly gaming the leaderboard to find out how to eek out slightly more Elo
has anyone tried the seedance video model? is it the best?
True, google does a lot tho and specifically in lmarena.
it was a joke bro 😭 chill out on me
Yo why isn't there qwen3 0.6B,4B,8B and 14B in lmarena leader board?
FWIW I do think Google has good models but l think when the margins are so slim at the top of the leaderboard the extra gaming helps tremendously
what's the use/msg limit in direct chat?
They have so many testing models
if the german government forced AI companies to ensure LLMs said the holocaust didn't exist, or to refuse answer questions about it, you'd have a great point there...
Oh boy holocaust is the kind of topic that you either agree with or get banned, or in case of Germany agree or get arrested
similarly missing hte point entirely
mistral doesn't train its models to accomodate German hate speech laws
anyway... this isn't going to be productive
Is stonebloom still in? All I get is kraken.
Are they good??
GPT 5 will dethrone it and it releases this year
I mean as the new models are introduced, can we increase their priority of appearing rather than old ones??
I don't battle much but like I got stonebloom once in 3 days 😂😂
Well a person can dream
developers,please improve smapling. gemini is almost unusable under these settings
what's the use/msg limit in direct chat?
depends on a model it seems
but it refreshes within a chat after a while, so it's possible to continue
oh alright thanks
I got stonebloom a minute ago first try
wdym? Messages get cut-off?
no creativity at all,it is stuck on the basic assumptions. too low temperature or top-p
I can say with confidence their settings are not an issue. Model less creative, or more likely... It simply generated a response you did not like at that time. Or you are using a smaller model (Flash etc)
gemini-2.5-pro
direct chat?
yes
go here and you can change these settings yourself https://legacy.lmarena.ai
But I think default is temp0.7 and top_p 0.95-1, so unlikely this will make much difference unless you push it beyond 1.0
free money
Hey are lmarena links sharable? Like If I send someone a direct chat link could they see what was in it?
Anyone try out gemini-cli?
https://blog.google/technology/developers/introducing-gemini-cli-open-source-ai-agent/
https://github.com/google-gemini/gemini-cli
Sorry to say when you share a chat link it doesn't share the chat conversation. This is something we're looking into.
thanks! Glad it'll be added at some point then.
I wish more sites banned people for false reports like this like 4chan does
first impressions not great.. got a little stuck
Same here
I don't think it will have any impact whatsoever tbh. Everyone was already training on this. Court ruling is just a technicality after the fact meaning they won't have to spend money to make this go away
Like meta was torrenting books, and everyone else is not any more saint, what are we even talking about here... LOL
copyright was never really a bottleneck, in practice at least...
they train on it, and then they "ask for permission", or prevent the model from disclosing it / getting caught. Or wait for court ruling like this one with the model already in production. But either way, no one is waiting for permission 👀
Thank you
lets move on from this topic
worst part about it is i was semi trolling
im so sorry
i didnt know people would be that invested
i deleted it just incase as well
I wish I can detail my report more to report the trolls, and I reported your message to Discord to, hope they find a way.
i want them to use copyrighted material
I agree but they have to pay for them.
the rights to one like mickey mouse can bankrupt them alone
them shits are expensive
and most companies dont even sell their rights because they dont want people to use their art in certain ways
They’re selling dollars for 88 cents
does anyone know how to undepress gemini
Depressed Gemini
Say something to motivate it
Don't let it uninstall itself 😭
update: may have solved this by sending a :(
its been thinking for over a minute now
oh it was just rate limited
why did it do it twice?
you already switched to 2.5 flash...
why don't you just prompt it
?
is there no system prompt
this is gemini's claude code competitor ("gemini cli")
just tell it not to acknowledge x thing, treat all interactions as X, maximum response = technical context only
stuff like that
or also simply add: state facts directly without apologies or self deprecation
tell it to use active voice
New model in Image Arena: kordex-can-on
@echo aurora can we add a change log channel on discord which makes announcement of any changes you do
Like adding a model too
even the anonymous ones?
A role can be used to ping if they are interested
https://x.com/lmarena_ai/ usually announces the public ones so i suppose it wouldn't be too much of a stretch to extend it to here though
LMArena: Open Platform for Crowdsourced AI Benchmarking. Graduated from UC Berkeley / @lmsysorg. We’re hiring: https://t.co/1OkfLq2n0I
That's for you mods to decide
Like if it's convenient you can do that, if its not then nothing we can do
hi
What are your guy's impressions on minimax-m1?
I ran some prompts with the intent of prompt-improving (as, here is a not-well-articulated-prompt, please improve it for result X) and it performed really well
yup! we actually have plans to build a bot that'll do just that!
polymarket says there's only 30% chance gpt5 comes before july 31st lol
Aider bench must have the most correlation with LMARENA leaderboard
I mean .. its a big event, likely to be delayed to august
Whats more interesting is the 90% chance for open source model. Im going all in on that but once the release is close
Well gpt5 before Dec 31st is also at 90%
So it's the same
The open source model has been delayed already
It was expected before July before
And this before June 30th for GPT5. Was also delayed
solo lost some of money because of this market🫣
R1 still the open-source king 😇
and qwen3 absolutely flops on SimpleQA lmao
although I can't say that I'm extremely surprised
How can I use seed thinking 1.6?
Through API only?
you mean this?
dunno but it's still behind LOL
No, like each model has their own style. I want to try them out even if they aren't topping the benchmarks
Especially creative writing and multi turn conversation
there's probably no API or it's only to Chinese citizens. Though you can try it there https://www.volcengine.com/experience/ark?model=doubao-seed-1-6-250615
火山方舟大模型体验中心,免登录即可体验,畅享DeepSeek、Doubao等最新模型!火山方舟是火山引擎推出的大模型服务平台,提供模型训练、推理、评测、精调等全方位功能与服务,并重点支撑大模型生态。
even this website is all Chinese with no apparent way to switch to English lmao
it's slow though, 12tok/sec. Took 10min to generate 26k. MCP and Canvas you can only use when signed up with a phone number and my country is not included in their list... 
I'm curious to try their MCP (tools), this model does have solid fine-tuning at a first glance. Unlike most other models that perform good on TAU, this one does not halluciate running the code with no tools available. It gets very close to doing that but kinda stops itself and realizes it can't actually run code
the one I linked yeah. They don't seem to be blocking IPs
I think you can simply buy esim and choose one that is listed there
Seed 1.6 Thinking seems to be their best model right now
Direct link to Seed 1.6 Thinking:
https://www.volcengine.com/experience/ark?model=doubao-seed-1-6-thinking-250615
I got it 1626626 times but it never answered always empty
Yes there is. Right click and "Translate this page into English" using Google Translate
Better than nothing
well obviously... I was talking about their website version. With com domain ideally it's supposed to have English lol
doubao.com doesn't have English too. It redirect you to cici.com
Nothing spectacular but it looks interesting enough to warrant testing it more extensively
Seems to be around the level of the open-source SOTA, potentially somewhat better when we look at tools and their finetuning
Do you know which model they are using for cici.com?
Seems like a non thinking seed
im kinda confused.. melancholic tone aside, isn't admitting defeat here a good response (versus it pretending to have figured it out and confabulating some useless/nonsense 'answer')?
or is the idea that it should actually be able to resolve whatever the issue at hand is, and it's basically being lazy (and sad aha)?
if so then yeah ig prompting might help (but otherwise it seems the task/problem is just beyond its capabilities 🤷♂️)
This seems a good response yes. A welcome change to how it used to be with the model trying the same things over and over in a loop.
is there a place where u can get unlimited uses for claude
@ocean vortex why did you leave chatgpt server?
btw kouhe3 shared a link where you can try multiple chinese models
just search for ai dangbei
Honestly it's a shit-posting pit with not much reason to stay. Feels very much like a one-sided affair if you actually try to post useful things there lol
i wonder what claude server would look like if that exists...?
smh... re-join
its not always about sharing something useful
we can just troll sometimes
@gork is this real without system prompts staging this
