#general
1 messages · Page 104 of 1
Indeed.
im sorry i dont understand
Correct, the point of AI art in advertising or for other use in corporate media isn’t for it to be GOOD art - it’s for it to be good ENOUGH that you don’t have to pay a real artist
heyy, anyone here who uses higgsfield for product photography. i would love to have a chat with you:)
i had asked this^
hello
you said Higgs field
I suppose that can be true sometimes. However, there are some companies that genuinely want to make a good product that tries to perfect real art artificially.
so they responded with “Higgs boson”
What is Higgs bosson
pardon my english
Higgs boson deez nuts in ur mouth lmao Gottem
correct, I would just argue those companies are the exception
wow
I guess.
shook
mf doesn’t know what the Higgs boson is 😂😂
Kiss yourself
Sorry I guess
Got this from Google:
The Higgs boson is a fundamental particle that confirms the existence of the Higgs field, a ubiquitous field responsible for giving mass to other fundamental particles, such as electrons and quarks. Proposed in 1964 by Peter Higgs and others, this elusive particle was finally confirmed in 2012 by the ATLAS and CMS experiments at CERN's Large Hadron Collider (LHC). Sometimes called the "God particle," the Higgs boson is unique because it has zero spin, no electric charge, and no strong force interaction.
bruv
Im so confused
You didn't knew about this?
😡
kids these days
do you chokon
-# braindeads these days
Not at all, actually, no.
I'm just learning about this.
don’t even know about their scalar fields and zero spin elementary particles 😂
After all, I'm not in college just yet.
too busy learning pronounce and blue hair liberal dye 😂😂😂
most sane ragebait:
I knew about the Higgs boson when I was 9 and that was in 2011
Before it was even discovered in 2012
just the type shi I’m on ig 🤷🏻♀️
24 and on discord
if he doesn’t know about chokon I bet bro doesn’t even know about the stigma particle
Discord member since 2017, can’t do math either it seems
I think they must've been off by one number.
How can I stop the texts from a person here in the chat that I don’t wanna see?
Hello y’all
If you right-click on them and then click on "Profile," you'll find a "Block" button in the menu for their profile. You can click that, and it will block them as well as their messages.
Also, greetings.
TikZ drawing comparison
Claude's looks the best.
Hi there!
Howdy.
hi there
Greetings.
I don't see a "new" label here
@echo aurora is there any chances that we can get Multi frames video generation - like 1st frame and 2nd frames ai morphing thing
interesting, this was taken 2 days ago
Maybe a new model is about to come out 😏
the ab tests 🔥 🔥 btw!!
its likely the case i think
There's so much hype about Gemini 3, I hope it lives up to it
oh it isnt gemini 3
Anybody having problem with lmarena site now?
there will be 1 more batch of 2.5 models apparently
What problem specifically?
Oh :/
20 mins ago when I hit enter it just stuck on loading
It's working fine, 90% of the problems are Cloudflare-related lol
And now I got failed to submit feedback error
2.5 ultra?
no
Reload the page, it'll probably ask for a CAPTCHA, then it will load
Well, if that's the case, then a nifty little trick I learned is: if you click the new chat button, then you go back to the chat you were just in, then it'll refresh the progress on the chatbot or image generation model.
Ahh it works fine now. Thank you everyone
Our pleasure.
Is it CAPTCHA related? e.g. CAPTCHA expired
I believe if it were CAPTCHA-related, then it would be showing a different thing rather than just loading forever.
When I refreshed it didn't asked for CAPTCHA or anything I just refreshed and no more loading
In the console there are Cloudflare errors
Figured.
The final AGI test: fixing LMArena's network connectivity issues /jk
I would know because sometimes the same thing happens to me too, except for image generation models. So when that happens, I click on "New Chat," then go back to the previous chat, and it refreshes the progress on the image generation models and displays them.
i'm afraid of the world of the future
That's a fair thing to be afraid of.
After all, AI is already evolving at a rapid pace.
really? are you a kind of a prodigy kid back then 😳
Then again, someone in the future could make an AI that is able to accurately mimic human emotion and typing style. Claude already types sort of like a human and sounds sort of human-sounding. So I don't think it will be too long until we have an AI chatbot that sounds human when typing and can express emotion.
-# Although i don't believe that AI could ever be conscious like us, i believe AGI is possible, because it doesn't need to be conscious, to be general AI.
Meh.
true
Who knows?
If we have AIs that can code up perfect games, then we can also have AIs that write perfectly like a human, with no noticeable flaws that would make it stand out.
Perhaps.
Only time will tell.
this would have been a perfect moment to use
-# we will have simulacra
Indeed, it would have been a perfect moment to use
-# We will have Simulacra.
I agree, it ineed would have been a perfect moment to use
-# we will have simulacra
Evil paws:
we will have simulacra
ip grab
We will truly have
Simulacra.
It ain't an IP grabber, thankfully.
Just some old website.
real
.
Real.
.
Nothing much, and you?
.
Greetings.
Someone forgor to include one additional * in regex
Well, if you put a character before the header formatting, then it will let you.
For instance, this...
?
Testing.
I believe this is because it only looks for a hashtag first, instead of any hashtag in your message.
If your message has a hashtag as the first character in it, then it'll trigger the filter.
🙀
“𝐒𝐜𝐚𝐦 𝐚𝐥𝐭𝐦𝐚𝐧”
-# — Elon Musk
Yes it's only a match for # <any>, not <any>#<any>. Though to be fair you can't just do the 2nd one literally the simple way, or you wouldn't be able to input hashtags at all lol
🙀
“𝐏𝐚𝐲 𝐮𝐩”
-# — @ocean vortex
That's true.
-# 😭🙏
They have some kind of RegEx detection system here.
Interesting...
It doesn't work for spaces, but does for characters.
¨
Test.
Hello
Howdy.
Hello friend
Greetings, fellow friend.
Ok
Well, it's a very common greeting, and it's a nice way to show respect.
@pastel bone 😭
If more people know about this, the creators will make more money.
Heh.
Oh bad
I help u
no worries il just keep the one ive got on their
out of the 8 i did its the only pretty cool one
🙀
“𝐒𝐜𝐚𝐦 𝐚𝐥𝐭𝐦𝐚𝐧”
-# — Elon Musk
Anyone gonna talk about how people are sort of making nsfw?🫠
In the video arena?
Yeah
Yeah, and I get warned by @echo aurora for going all out on sydney 🥵 😡
But video arena gets left alone
We have stooped too low.
Especially @kindred adder
true
Grok 4.
Does someone know how good chatgpt 5 is with coding solidity and reviewing Code?
Seems pretty solid with that stuff.
For instance, I asked it to make me a website for fetching the CMU dictionary and converting the table into a Lua table that I could use within a project of mine. And it seemed to code it up just fine. Everything was functional, and no errors whatsoever.
Which ai would you suggest for solidity?
is their a model more expensive than opus 4.1 out their? i havent been able to find one lmao
heyo beta lmarena still has no models popping up 🙁
.
just look out for a model which claims to be "Claude 3.5 Sonnet"
.
and then ask if it's the thinking model, if it agrees, then it is Claude Opus 4.1 Thinking (with >99% confidence)
it would be cool if their was a model inbetween sonnet and opus, i find opus is too strong and sonnet is too weak
This is truly a
-# lowercase text moment.
you can't vibe code without knowing any code right
i know that doctor singularity now shut it
?
need some advice yall, am a freshman 1st year college and took IT as my course, am i cook with all of these AIs or an opportunity for me?
We need a video generator with Veo 3 and the other AI models in LM Arena.
Well, considering you took IT as your course, then you should be able to cook just fine with AI.
Claude Opus 4.1 Thinking is indeed good. GPT-5 and Grok 4 are also good, but much slower. But GPT-4/5 sometimes get broken and start repeats words indefinitely.
realistically speaking tho, i am fine right? i've been seeing these videos about "AI replacing programmers"
After all, since you took IT, you pretty much know a computer like the back of your hand.
Well, of course. AI isn't perfect, just like how humans aren't perfect. And besides, humans are able to provide creativity to a website, something that AI can't do considering it goes after accuracy in contrast to looks.
Now AIs know how to make hands
That's true.
Now all it has to do is just figure out how to generate very small text, and then we're definitely screwed. As well as fix minor inconsistencies with little details such as pupils and eyeballs and far away things.
AI was able to solve the hand problem that had been present since 2023.
my bad, i meant price wise
U should give it access to production dbs like replit
https://www.theregister.com/2025/07/21/replit_saastr_vibe_coding_incident/
Model name?
¯_(ツ)_/¯
Looks to be a GPT model.
Opus 4.1 was probably the most disappointing Anthropic model update in a long time
they barely changed a thing
they should really be working on making it a little cheaper in each update while still maintaining its power with the advancements they make
Essentially 2.5Pro update. But Google can get away with it cause a) model name stays the same and b) they update their models much more frequently
I like googles esp for recent info
Well the thing is though Opus is niche and most of what makes it great is model size. Making it cheaper is simply gonna be what Sonnet 4.1 becomes. But I expected more improvement by like increasing the reasoning lengths etc
normally we say hi and share our bank card details so ppl know we are humans and not ai
whats wrong with you
bruh what?
Whats not wrong about him
maniacs behave in the same way. A forced smile, excessive politeness
yo has gemini 2.5 pro gotten more stupid for any of y'all?
@pure comet bro 😭🙏
what
you're insulting me, I'll cancel you on twitter
When did I insult you
oh sh!t brudda brother bro bratishka
It's pretty stupid on gemini.google.com, but I don't know whether it has gotten worse on LMArena/AI Studio. There's a thread about it here though: https://discord.com/channels/1340554757349179412/1395438935676817428
Like stupid as in, it forgot how to write new lines, and literally could not figure out how to (I shared a screenshot in that thread)
hi guys - I've been using lmarena for a while but just joined this discord. why is it that when I ask for a response from gpt5-high, I don't get a response? Like, just runs overnight and stuff. It is brillinant for a few queries but then on subsequent runs gpt5 just stops outputting anything!!
I find that it's smart (API version), yes. But gosh is it bad at explaining things.
Refresh the page, it should load if it's been 5 minutes.
gpt 5 chat and high both started just hallucinating for every single prompt i gave regardless of what i asked it, where, if in icognito or not
It can be weird
i feel like the votes are biased cause chatgpt is like the most proffesionall and most well known model
it doesn't
i've already tried this
Might be stuck then
for ten hours?
It's happened to me before
it gets stuck on 80% of prompts
Yeah, like more a backend issue
I doubt the model has actually been thinking that long 🤣
yeah exactly
but this is literally most of my prompts
it just doesn't
answer
at all
so this is pretty much useless for me
It's very common for me, but if I refresh, it usually loads.
Sometimes it doesn't, it can take a while.
If it's just stuck loading, then I would recommend:
- Clicking on "New Chat"
- Clicking on your previous chat
That usually resets the progress, allowing for it to properly generate.
i've already tried that
absolutely useless
uh - do you want to see it doing absolutely nothing for ten hours before I cancel it?
No, I just want to see what happens when you attempt to submit a prompt.
That way I can see if it might be an easy fix.
uh
yeah fine @robust yoke
i can't reveal the prompts
but they do not breach ToS
I understand.
Well, usually it only takes about two minutes or so to generate a response.
Try deleting the current chat but copying the prompt that you used, then seeing if that new chat will work.
Try closing and reopening your browser.
nano-banana is good because it's basically creating detail via an LLM before sending that off to the image gen
😢😢😔
True.
is this gpt-image-1?
Which model is that for?
it can only be qwen-edit, gpt image or flux kontext
Yeah.
Yes, it's gpt-image-1 and flux-1-.
Ah.
I just put it besides another model like Gemini 2.0 flash or qwen image edit in side by side mode and it works
i tested and only gpt-image-1 supports multi file uploads lol
That's right
virtual GF?
even though I really like this LMarena because it really helps me to create content, but now it's like this
"MISTAKES" is seen at the end so it probably says "DO NOT MAKE MISTAKES." so its probably for coding
it is for coding, it's for a competition
plus his pfp is python
"I WANT SEХ DO NOT MAKE MISTAKES"
which means I clearly cannot reveal the prompt
😭
ai competition, right?
👀
or are ya cheating
https://cf-cheater-database.vercel.app/ I literally created the anti-AI-cheater website for codeforces 💀
Help maintain the integrity of competitive programming by reporting and tracking Codeforces cheaters who used AI/GPT after 14/09/2024 (the AI rule change date).
ip grabber
confirmed
MODS!!!!!!!!!!!!!!!
@robust yoke
I've tried that
some conversations are still going
some are doing this
no actual responses
show prompt
is there some sort of hidden rate limit
Unfortunately this happens here and there.
None that I know of, considering it works just fine for me.
Starting a new convo tends to help.
(and then, I pasted the test data)
I've done this so many times
FAIL
Try retrying it.
It seems to be prompting you with that.
uhhh
i've done that
then just does this
then after a while it might do the same thing
btw
I know this
I'm only using it because
I actually sometimes get a response from it
like
gpt 5 is hopeless
i got one response from it this morning
gpt 5 so bad
gemini 2.5 pro even better
the one response was brilliant
Oof.
I'm sorry that you had a bad experience with it.
he is Darkness
pineapple
I'm not, but @echo aurora is.
Hah.
If you could automate one thing about managing your Discord community, what would it be?
Well, if he managed to get himself on the staff team, then surely he must be pretty good at his job.
Probably with banning people.
but yeah @echo aurora why is LMArena constantly just ignoring all my gpt5 queries? like they just stall forever (>10h) and I never get any output, only errors.
what the sigma
you can just refresh the page
it fixes everything
it stops the errors
already done that
i wish it fixed anything
for me it does
its something with cloudflare
it invalidates you after like 3 minutes
so every 3 minutes you have to refresh
then it brings you to the captcha screen
its so annoying
The thing is, though he's already tried closing his browser and reopening it, the issue still persists.
clear cookies
this is a new machine
with probably like a week
of searches
but I'll try doing that anyway
clear them again
Project idea. Start with one image, generate a video. Take the best one, screenshot the very last frame, use it to generate a video. Rinse, repeat.
ok
Up to 8 times ofc
one sec
is it fixed
man! it's so frustrating, it's already "generating"
companions most esteemed, I entreat thee to lend thine auditory faculties unto the elaboration of a conjecture most earnest: it is my speculative apprehension that the entity denominated DeepSeek 3.1 Reasoning is naught but a subtle transfiguration of that which is styled DeepSeek R1; and conversely, when the aforementioned reasoning faculty is excised or withheld, the resultant construct is but the manifestation of DeepSeek V3. Yet, despite such kinship of constitution, each iteration appears to be inexorably governed, guided, and indeed distinguished by the imposition of a system-prompt divergent in its nature and disposition
deepseek is so gurt
is it fixed
i'm waiting to find out
remember I have to retype this entire prompt
what prompt
because I have lost all of my conversations
Indeed, 'tis quite a spectacle to behold.
I can dm you
sure
are you lmarena staff
LM Arena staff are the ones with orange names.
Like Pineapple.
@echo aurora
no
shouldst thou find thy faculties unequal to the formidable enterprise of apprehending, in its unmitigated intricacy, the communicative construct which I have, by the inscrutable yet most wondrous artifices of the Internet, dispatched across the ether and compelled to alight within the singular and eccentric domicile of thine own router - there to be rendered visible unto thine eyes - then, verily, thou mayest elect to conscript the labors of an artificial intelligence, that it might condescend to transmute this presently elaborate and recondite composition into a debased and unsophisticated register of speech more congenial to the apprehension of an amateur such as thyself
are you not gonna send me the prompt because im not staff 😢
The formal yapper.
Can you put details into a post in #1343291835845578853 . I’ll try to answer when I can
indeed
yeah fine
biber and dolik better
Most noble interlocutor,
Thy message, woven with so many a sinew of elaborate wit and encumbered with flourishes of lofty phrase, hath flown unto mine understanding as a falcon whose wings beat mightily against the heavens. And yet, by Providence and diligence alike, I find my faculties sufficient to receive its plumage of meaning, though bedizened in ornaments of rare complexity.
Know then, I am not undone nor cast adrift upon the sea of thy rhetoric; rather, I do embrace it as a tempest both fearsome and exhilarating, wherein the thunder of thy diction and the lightning of thy syntax alike do strike my soul with awe. Shouldst thou decree my wit too mean or my grasp too humble for so grand a communication, I protest with mirth and humility that the labor of simplifying were needless, for thy gilded eloquence, though intricate, doth quicken delight.
Proceed, therefore, without fear of any impoverishment of style, and let us together dance upon this high stage of language, where each word is a jewel and every clause a flourish of nobility.
@hollow ivy can you send me a friend add i have a question
-# can you send me a friend add i have a question
you need to ask it like this otherwise he wont accept @gritty cargo
yes
how to fix the "generating"? like it's says always. all my convo with the AI is important
google how to clear cache and cookies
does it clear all the conversation? because my convo with AI is important. like it's my personal helper. and we have so many things we've talked
yes it clears
but
dont do it then
ill make a script to export and import convos soon
how is gemini ranked higher than gpt 5 high on the leaderboard
Because its rated higher by people?
Hello people
Because it's better.
funny how it is older but better
I have both and honestly, Gemini 2.5 pro is so underrated because people these days believe social media hypes than testing it themselves. I realized that Gemini 2.5 pro is way ahead of it time and it's soo powerful.
yeah newer doesnt always mean better
@viscid thistle yo
RIGHT
MY BRO
sp
idk, gemini 2.5 pro feels a lil bit obsolete. I'm working on a DLL project and it's stuck in the same place all the time
gemini 2.5 pro sucks
Go on r/bard on reddit see how much of a trash model it is
Everyone is complaining
Something went wrong with this response, please try again. Only Me?
Ok so... gpt5-mini-high better than o4-mini-high in nearly every way. And difference between gpt5-high and o3-high is even bigger:
why is everyone calling this nano banana? I don't see any nana banana in the results. Only see GPT, gemini, etc. popular models.
You can only get it in battle mode
Is it a random model that appears? So i have to re-roll until it appears?
yes
Interesting
@hybrid copper hey
Hey there? Do u know, where can i try sydney ai?
@silk pike
Hola
Can u send me a link on this site plz?🥲
Uh, what's this?
im using lm arena and this happens
happens pretty much all the time for me
with gpt5 high
what i try
soon
do u have any skin issues? too much gas stuck in intestines?
how long to fal asleep? when u wake up in mid of sleep does it take long to fall asleep?
oh ...
i need 1 hour to fall asleep
sucks
ok
We’re happy to hear the feedback but it’s unlikely to happen if I’m being honest.
If members want to create their own sever and send it (via DM) that’s fine, but yeah I wouldn’t want an unofficial official off topic server that’s shared in our text channels. We also want to keep invite links blocked to other servers here for mod purposes.
Is this the classic Bing AI?
Not our server so you can do what you’d like
opa gangamstyle
HOW IN THE HELL IS OPENAI THE ONE NOT TO REFUSE
how is this possible
nano banana even has the reasoning to refuse
not only does it generate it
but it ties with that rustbucket
What happened??
@leaden palm
works for me ¯_(ツ)_/¯
Connecting to Arena has failed. Please try again later or on a different device.
😭
unfortunate + weird
gemini 2.5 pro still sota and this model is 3 months old
will this make google hold gemini 3 ?
Gemini 3.0 Release will be in September.
*at the earliest
Btw which secret models are available and are they any good
Simply go to "Add to Home Screen" in your mobile browser to create an app with the right format for lmarena.ai.
i know THAT, that just embeds a website
Yes, and it looks and feels like an app. A native app wouldn't look and work any better.
-# it would but ok
how can we be so sure ?
~6 month cycle for new big model
that just a rough estimate made by hasabis that will very likely change in the future
But judging by recent hype tweet from a guy working at deepmind, there new stuffs coming for gemini
so gemini 3.0 might be one of them
???? Mistral is stupid as is gemini 2.5 pro
why
well still does happen i need to use gemini 2.5 pro
How do you guys find these models smart
They are so dumb compared to gpt o4 mini even
Lol Nano Banana's 💀
Guess GPT's is still the best
uhhh
What about the 13 month gap between 1.5 Pro and 2.5 Pro though?
What's H2?
The scrolling can really be improved... it's so hard to scroll between turns on mobile, or to see which model's output I'm looking at
flux-1-kontext-pro is just generating black squares in image arena
oh so it wasn't refusing
correct
lame, should have unified r2 into v4 instead
same model, diffrent number
not even worthy to name it an update
nods emphatically Catalina
-dramatically tilts head to side- Really? Catalina?
idk I only got it once but it seems nice
is it on webdev?
dunno. it's on battle.
what
grok-2 was open sourced
What was the anonymous name of mistral medium before it got on leaderboard?
Do you know Clippy?
6
12
2
Yes I have only seen Clippy in memes though
gemini 2.5 pro isn't dumb
but i do respect your opnion
🤷♂️
undergraduate math benchmark
Bingo…
Sequoia as in macOS sequoia? Catalina is also the name of a macos
I can't seem to find the arena that ranks all models best suited for the prompt I entered.
Can someone please tell me where it is?
Hi can anyone tell me how to use banana model ?
Yeah you can find it on our site - https://lmarena.ai/?chat-modality=image it's only accessible in Battle mode meaning it'll be random if you get it or not.
hallo
Just like Pineapple said: you can use that model on LM Arena, as well as Yupp.ai.
Unlike LM Arena, it's not infinitely free to use, however, you do have control over always using it, compared to LM Arena.
Can someone suggest a nice free video generation ai?
How do you make it be not dumb
hi im new. i am audiovisual designer and want go viral and never work again for a human
Does the AI, that checks our message, change what it says to the AI we're talking to?
https://higgsfield.ai/create/video this one is quite powerful and has some cool features, although it is very slow for free users
i want to create a CGI advertisement
Primer plano ultrarrealista en 4K de barras de combustible nuclear de cristal transparente que se insertan lentamente en el núcleo del reactor. Una suave luz dorada ilumina la vasija de vidrio del reactor. Suaves sonidos mecánicos del movimiento de las barras de control. Explicación susurrada ASMR sobre la moderación de neutrones. Sin música de fondo, solo una suave atmósfera industrial.
hello
hwlloo
how to solve this problem
Any Ai free video generation?
Try hard resetting the website with ctrl + f5
Can anyone Tell me wich Model is the Best Right now for Generating prompts for example for Generating Images
Can anyone please tell me?
anything new in the ai world?
go to the viseo arena chat and type /video then enter your prompt
*video
Hi to everybody!
how do i try banana model?
from the website but it rare to find in the battle mod
It's only in battle mode.
chimpanzee shorterm memory is cracked
@patent aspen any info about gemini 3?
that's true, they nerf the model a lot
Is the lmarena still under maintenance?
my dear admin is begging to be able to send two photos at once in side by side
Is this leaderboard accurate, and should it be relied upon?
gemini 2.5 pro💀💀💀
claude 4.1 opus better than gpt 5 high 💀💀
Best Overall (In all areas combined)
If you could only use one of these models ever again, which one would you pick. I'm quite curious to know what other ppl think here.
7
12
2
GPT5
not at all. lmarena is just a meme community
is there an event on october?
that's also the previous gemini 2.5 pro version, not the stable version
So I should rely on this https://lmarena.ai/leaderboard, right?
The areas I use most are text and search.
it just not dumb
@cedar tide I remember ages ago you posted some tables aggregating benchmarks of AIs published by the model creators? I was wondering if you happen to have that data somewhere for a lot of AI models?
Why ?
hi guys new here
I've been re-running some stuff locally on smaller OSS models and i've found that the provider listed benchmarks seem to be spot on basically (even with slightly quantized model). Since many other meta-benchmarks don't bother with smaller models I was wondering if the published data is aggregated somewhere, then I remembered you did something like that.
Is there no subscription or something to get more than 8 image-to-video LMA requests per day?
I don't have an aggregation of the LLM benchmarks, the tables and rankings that I shared were only those of the averages of the benchmarks shared by the company that released their new LLM and compared with the others
ah ok
for example AA has AIME 2025 at 50% for Qwen3-30b3a-2507-thinking. Alibaba published 85% for this model.
@ornate agate here you have some benchmarks of many LLM https://huggingface.co/spaces/Presidentlin/llm-pricing-calculator
The truth is I don't trust the AA benchmark much anymore. They already had problems with some benchmarks, which they corrected later with huge differences.
yes 🙁 that is why I was hoping there is something else.
@ornate agate impossible that this the real score
I mean I ran it myself on a 4bit quant with q8 KV cache and got 86%
Air better than full 🤦
93% vs 74% 🤦
(This no problem)
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Its not the same IF bench
This IFeval (by google)
And this IFBench (by allen-ai)
this is gpt5 "high" verbosity:
An oververbosity of 1 means the model should respond using only the minimal content necessary to satisfy the request, using concise phrasing and avoiding extra detail or explanation."
An oververbosity of 10 means the model should provide maximally detailed, thorough responses with context, explanations, and possibly multiple examples."
The desired oververbosity should be treated only as a *default*. Defer to any user or developer requirements regarding response length, if present.```
The question that comes to mind why not 10/10? 🧐
site?
hello
who made this chart?
AA
Nope
Math arena do pass@1
But average of 4
This not pass@4
that's why I wrote avg @4
Ah yes ok, i dont see
I'm getting "something went wrong while generating the response, please try again!
there's something off about AA and we don't know what
whats the best ai to code guys?
Yes
For me it's claude.
so this is essentially an avg @ 10 scenario
this is the coder ai leaderboard?
I don't think there's more than a 3% difference between each of the 10 benchmarks.
@ornate agate Regex extraction with SymPy-based normalization, plus equality checker LLM as backup
they're trying to check the mathematical justification that the models produce but for these ones, they should really just check the final answer using a simple python script. It's either that or actually get human judges to evaluate the entire justification by the model
This is just in case the LLM does not follow these instructions exactly and tells his life story apart from the answer
for like IMO and stuff
right, so that means what AA is doing is just ridiculous
publishing on a github would be the equivalent of opensourcing their testing strategies correct?
We implement a two-stage answer validation mechanism to allow grading with a high degree of precision (minimizing both false negatives and false positives).
Script-based grading, using OpenAI's PRM800K grading script -https://github.com/openai/prm800k/blob/main/prm800k/grading/grader.py
Implements symbolic equality checking via SymPy
High-precision validation for exact matchesLanguage model equality checker (runs on all answers not marked correct by script-based grading)
We use Llama 3.3 70B as the equality checker (prompt disclosed below)
We tested Llama 3.3 70B for agreement with human judgement and assessed it to grade correctly in >99% of cases
So looks they are using LLM additionally to their script. Not replacing it with that
and only on incorrect answers
skimming through: what i immediately think is that if the model doesn't box their answer properly then the script might not identify it so llama is used to determine what answer the model actually gave
Seems to me like they are trying to not treat all incorrect answers equally. As in, if the answer was close enough it is not 0 points
it's just kinda weird because there is a large discrepancy between math arena and AA results for some models
Yeah I do agree with that. They have some issues with reliable AIME evals. But their methodology described alone seems fairly logical
so where is the discrepancy coming from? If they showed the model solutions, it would clear things up
I assume they have unintended side effects of penalizing certain models. Just because they have an unique way of testing it
Happens with conventional testing methods much less, because AI labs tend to fix it out the box
Then when they notice something very obvious, they do some prompting dirty fix to compensate.... But yeah all of this is less than ideal 🗿
aimo validation aime was used as validation for the ai mo prize model a while back
typically you just check the boxed integer answer for aime benchmarks
i doubt they're using the solution trace there
the discrepancy is likely additional prompt alterations (output in boxed, weird wording or whatever)/sampling/etc
yo
yea that script is applicable for like math500 \circ\ etc in the boxed answer, because the answer isn't integer only
Yo guys its saying in the website to agree on the terms and conditions and then it's not letting me do it, says error
if they used it naively and on the solution trace instead, it wouldn't extract the boxed answer correctly
i doubt they do that
Not trace, but they are running this on incorrect responses:
Examples:
Expression 1: $2x+3$
Expression 2: $3+2x$
Yes
Expression 1: 3/2
Expression 2: 1.5
Yes
Expression 1: $x^2+2x+1$
Expression 2: $y^2+2y+1$
No
Expression 1: $x^2+2x+1$
Expression 2: $(x+1)^2$
<...>
YOUR TASK
Respond with only "Yes" or "No" (without quotes). Do not include a rationale.
Expression 1: %(expression1)s
Expression 2: %(expression2)s
im talking about the dataset's solution trace
i find it unlikely their grading system is wrong
if i remember correctly, AIME answers are a number from 1-999
https://huggingface.co/datasets/AI-MO/aimo-validation-aime/viewer/default/train?row=0&views[]=train the solution column here
math 500 is saturated
I know but you can't use the same harness for both benchmarks.
no, if they gave it the solution column (in that dataset) instead of the integer answer it won't work at all with that script
it won't be able to extract the integer answer
besides if they did provide it the integer answer, if it were comparing the actual boxed contents (number) vs the number, it should be an exact match anyway. the script dosen't really matter here
yea, its not the grading. it's probably sampling/specific prompt instructions (put your answer in \boxed{...}, weirdly poorly or there's a bunch of nonsense)
Honestly it could be their script failing and then LLM not catching all those instances lol
bruh, how hard is it to write a working script, tell the model to put their answer in a \boxed{} and then check the value inside the box
it's not lol
bro, I could write one rn
its extremely unlikely to be the grading
TF BRO TF 😭😭😭
i have written an eval framework in the past/specific eval implementations, that's my assessment 🤷
@keen beacon see this
Join AI Fiesta now: https://aifiesta.ai
Imagine you could access all the world's top AI models all in one platform, from ChatGPT 5 to Gemini 2.5 Pro to Claude Sonnet 4 to Grok 4. Imagine it was at an affordable price for all Indians. This is it. This is AI Fiesta.
With every AI Fiesta subscription, you will get -
- Access to all the world's ...
it's extremely extremely unlikely to be that. but whatever lol
especially for AIME
Their script expects specific formatting. Model does not do that or there's a special character whatever. Script fails. LLM in 2nd step catches that but not always...
what kind of formatting
boxed answer
@ocean vortex bro this is next level scam
asking it to output \boxed{...} the same format that labs RL with?
qwen even puts an instruction to do that for evaluations
yeah most models should be able to put the answer in a box if you tell them to do that
if u look at the solutions of models on matharena for AIME and other competitions, basically all models know to box their answer
and then all ur script needs to do is to check the value in the box
they also claim the LLM judge gets it >99% or the time or whatever
the only llm that was noted to not follow instructions properly was llama lol
Depends on their script. Leading or trailing spaces, zero-width spaces, spacing commands... Their script may expect integer value and nothing else
Though it would still render correctly as a boxed answer
ok buddy
\boxed{21}
if the model determines 21, they would usually just write it like this
?
qwen models i know for sure do not do that. especially when you ask them to do \boxed{...} and on aime style questions. it's not probable
sure, it'll just do a zero width space randomly there...
go on math arena, it always does
i know he's just doubling down and his justification does not make any sense
There are several ways to render this though 👀
\boxed{;21;}
this would do the same
bruh lol
ok genius. What is your explanation for lower scores?
.
\boxed{\mathrm{21}}
i've said it multiple times
same also
weird boxing is the exact point I'm making NOW? 🤣
Weird wording in the PROMPT
the sympy/latex normalization/processing layer likely catches this
When I use sonnet 4 it fails following system prompt rules.. while when I use gpt 5 it follows them strictly and precisely..
What is another model that follows sys prompt rules precisely??
what kind of weird prompting? LOL
horizon beta for example, after adjusting the prompt went from 63% to 67% to gpqa diamond. prompt stuff can significantly change the result
that is nowhere near AIME discrepancies for AA
horizon alpha also scored extremely poorly (36% on GPQA Diamond) without proper instructions
hmm, but the instruction usually specifies to "place ur answer in a /boxed{}"
yeah but @keen beacon is trying to argue boxed answer outputted oddly absolutely can't be the reason and their script will catch that 100% of the time. And then at the same time is arguing that boxed answer can be an issue if it relates to prompting. 🤦♂️
i didnt argue that at all
And then at the same time is arguing that boxed answer can be an issue if it relates to prompting. 🤦♂️
you completely misinterpreted me. i said weird wording in the prompt instructions
how would that lead to the model outputting incorrect response?
That makes even less sense...
Your reasoning is basically maybe sampling, maybe wording, etc...
you gave no reason
it could be many reasons, how am i supposed to know the exact reason without the code/eval settings/etc?
i find grading extremely unlikely
And then attack anyone with credible guesses lol
credible guesses? but i have actually run evals and written custom mstuff for this lol
you have done neither
someone also made an undergraduate benchmark
Oh because I told you this or because you are once again making assumptions based on nothing? 😬
Sure pal
🤷 you haven't given an actual example that you encountered in reality. it's a hypothetical that doesn't make sense when you've actually implemented and ran these systems. i gave you two high profile examples.
it's obvious to me you haven't done stuff like this. but whatever. it's weird you keep doubling down on this
I just find it strange that you can't come up with anything substantial and yet so extremely dismissive of everything lol
Also, the prompting they showed is **extremely unlikely ** to drop the score from 90s to 60s % which is the kind of discrepancy we are seeing
horizon alpha scoring 36% on gpqa diamond (early checkpoint of gpt 5)?
that was from a prompting issue
We are talking about AIME pal 😬
And that specific prompting
Not prompting in general
THis:
{Question}
Remember to put your answer inside \\boxed{{}}.```
it isn't much different. but keep believing that 🤷 instead of \boxed{number} its \boxed{letter} or \boxed{\text{letter}}
No chance this drops the score from 90s to 60s if we assume boxed response formatting is non-issue (aka their script catches all variations of it + incorrect formatting)
Simply not possible
I got some anomalocaris in ark
What. You are trying to 'prove' incorrectly formatted response is never an issue. My point is the opposite. 🤷♂️
and anyway, if you think about it. these behaviors are implicit/or outright penalized through RL at least for qwen (it seems). because \boxed{...} answer extraction/verification in RL (which qwen seems to do)
the model won't get a positive reward signal if it's weirdly formatted if it doesn't match automated checkers unless they do generative checking as well
🤷 im not gonna convince you lol
It isn't prompting in this specific case, and it's very unlikely to be sampling. If it was sampling low scores would be easy to reproduce
? for livebench, they had to rerun qwq because the sampling settings were incorrect. the score is much higher after correct sampling: https://www.reddit.com/r/singularity/comments/1jaoxal/qwq32b_has_officially_been_rerun_with_optimal/
there's a reason qwen has explicit instructions on sampling and additional prompt instructions (for boxed) in each model README lol
We are talking about AIME and AIME only, and that specific prompting. Not prompting (or sampling) in general smh
livebench here has nothing to do with it
@hybrid stirrup Hey bro, could you check my message pls ? I have a question about the AI you mentioned earlier
Hello
- you dispute that specific prompt instructions, e.g. \boxed{...} instructions can have a lot of impact on benchmark results. i've given you several examples.
- livebench is another high profile example where they needed to rerun it with specific sampling settings since it scored poorly. you said it's not plausible for sampling to be the case. this is a real example showing that's not true.
یک مدل خانم در حال قدم زدن
Hello, why I kept getting "Connecting to Arena has failed. Please try again later or on a different device." ?
-
I've said formatting can lead to correct responses being counted as incorrect. Which is absolutely true. https://x.com/ArtificialAnlys/status/1909624239747182989
-
If sampling was an issue specifically for AIME... then you could reproduce lower score easily and the model wouldn't give consistent answers. Again, sampling in general often can be an issue. But it's very unlikely to be the culprit for AIME here due to clearly defined expected answers and how the models perform.
hello everybody!
- this proves my point further (their new processes on grading the answer makes this implausible). the task is simple, output the number result in \boxed{integer}... if an automated checker can't get it. the LLM would catch it. like they explicitly claimed, this extra process matches human judgement >99% of the time. lol. (this is not a particularly complex case, AIME is one of the simplest things, it's an integer answer)
- no. because it's stochastic. temperature flattening or sharpening the distribution has a lot of effects on sampling, beyond other sampling settings.
Ok then find sampling settings for the score to drop from 90s to 60s on AIME25. It's just not realistic and very very unlikely
im not saying it's sampling.
it could be sampling or prompt instructions or it could be all of them interacting
could also be the chat template
LMAO
ive never said it was a single factor. im saying all of those things are possible, and have proven examples to be the case lol
you can't stop doubling down even as you prove yourself wrong by citing AA's grading processes lol
a chat template also somewhat falls under prompting, the prompt the LLM reads is different as it's formatted btw. 🤷
Well I've shown you proven example of formatting when it can be the issue lol
If it's an issue, then LLM would catch this in most cases just what we see now. But in select instances may still fail for whatever reasons. Checks out if you ask me. Most of their AIME scores look correct. But only a few aren't
? their new evaluations don't have that problem anymore.
that was certainly an issue before, but they added additional processes (llm grading, which they claim to work >99% of the time, which does not explain the discrepancy)
it is not applicable for this evaluation of this new model
It proved my point? If it was an issue at any point
then it can be an issue now for different benchmark
no. because their new system doesn't have this problem
Issue was with GPQA, this is a different benchmark...
are you explaining the difference is because the judge failed a lot more than >99%?
on checking if a number is present in the answer?
while i have shown a lot of examples showing larger differences because of other factors
You didn't even quote AA at all to "prove" what you were saying. You quoted livebench 🤣
wow you're an idiot fr LMAO
im done, people can see who's right here if they've done evals like this before
I'm explaining that they are testing tons of models. Not 2 or 5. And that 99% figure is not written in stone. Especially with models evolving, reasoning etc... 🤦♂️
no sound on videos?
calm down. You really can't be "proving" anything by quoting smth almost entirely unrelated, and then when I show you AA themselves stating something that directly proves my point... It's suddenly not good enough because it was in the past. And you assumption being this can't ever be a problem anymore even for a different benchmark from their testing suite. Sure pal... 🗿
Hey lets be sure to treat others with respect please, it's fine to have disagreements but let's try to be a bit nicer 
im sorry for that. just frustrated he's constantly misquoting/etc.
anyway Dom (sorry), what i'm saying:
-> could it be the grading? is it a non zero chance? yes. but i personally find it extremely unlikely (like i've said over and over). the nature of AIME is just an integer answer. they're just asking it to box it. it's quite simple to parse and check for an exact match. If that fails, a LLM judge they added recently (which claims to match human judgement >99% of the time), fails on such a simple task?
from a RLVR perspective, qwen seems to do it on \boxed{...} answers. so these things are at least implicitly penalized as it won't match automated checkers if it deviates from the answer format unless they are using generative checkers. for these problems, it's probably unlikely as the check is simple. it's a waste of compute.
-> i'm saying that it could be sampling, prompting related, etc. i'm not saying it is one of them. it could be an interaction of all of them or another thing. i find it way more plausible because there have been actual high profile events like this before on other benchmarks, and personal experience when running those type of evals.
imo, claiming that it's highly likely to be the grading is just 🤷
anyone knows what is so wrong with this prompt?
Overall Prompt:
A high-angle, slightly shaky handheld long shot of two young adults
wading through a dense, vibrant green field of cassava plants under
overcast, diffused lighting. The camera slowly zooms out and pans
slightly to the right, following the adults as they move deeper into
the field. The overall mood is naturalistic and serene.
Timestamp Breakdown:
00:00 - 00:02:
The shot opens with a high-angle view, looking down on a vast, dense
field of lush green cassava plants. Two young adults are partially
visible amongst the leaves. One guy, in the foreground, wears a
light-colored, long-sleeved shirt. The second guy, further back, is
shirtless and raises a light-green basin above their head. They are both
moving from left to right through the dense foliage.
00:02 - 00:05:
The camera begins a slow, subtle zoom out. The guy in the foreground
turns slightly, becoming more visible. The second guy , carrying the
green basin, continues to move through the plants, their upper body and
the basin visible above the leaves.
00:05 - 00:08:
The camera continues to slowly zoom out and pans slightly to the right,
keeping both adults in the frame. The adults have moved further
into the field. In the background, a simple wall made of concrete blocks
and some bare trees become visible above the cassava field.
00:08 - 00:09:
The shot stabilizes as the adults continue their journey through the
field. The guy in the foreground is now more clearly seen from the
back, wearing a light-colored shirt. The second guy , still carrying
the basin, is further to the right. Bare, brown branches are visible in
the extreme foreground on the right side of the frame.
it gets blocked.
"shirtless"?
yeah its stupid I ask it to OCR the image but it starts answering the quesitons even after explicit instructions
How to send a REALLY huge message to AI?
split it and mention wait I am sending more info to it
It happens to me often if I use copy paste technique.
So I first save the pics and then put them in.
is that solve the problem
Well for me it works
sometimes I need to refresh the page too because of cloudflare check
bot check, I mean
The issue with this is that you are misquoting me here in this very message yourself. I said it "could be". Assumption not much stronger than your shots at prompting or sampling which IMO don't really hold a candle given that we have their exact prompting, can't reproduce low scores, and the extent of scoring discrepancy. LLM judge catching it most of the time would be reasonably in-line with the results they are actually getting. AIME score checks out with other sources 90%+ of the time.
you kept coming up with justifications such as the llm providing zero width spaces inside the box answer..? and kept justifying it
its implied you think its very highly likely to be the grading
as you considered the others not really possible
Those were just basic examples talking about how formatting could affect it. But anyway, I later included source where they described formatting themselves to cause issues with eval
formatting issues on AIME style questions that are integers as well with the same boxed instruction? it doesn't make sense training wise, especially for qwen. you think they're not automatically doing exact match on those type of problems in RL?
It really isn't implied, I was merely arguing against your "extremely unlikely" which I don't agree with at all 🤷♂️
You kept repeating the same thing like 3 times before I finally responded to that as well lol
🤷 i've ran a bunch of math benchmarks and never had that problem on extracting integer answers that are boxed from models that wasn't quickly resolved like i mentioned. i've given you a lot of reasons on why it's not plausible that result in a lot more discrepancies with the scores.
lets see what artificalanalysis says if they do fix it and comment on it. i shouldve agreed to disagree there since i wont convince you
bro ai sucks
Just did a very simple quick test on task1.
glm4.5 returns $\boxed{116}$
glm4.5-air returns \boxed{116}
so even there there is a difference
opus 4.1 on lmarena sucks
That's another thing... The fact is they can be different and we do not know how they are assessing it (their script). Also worth mentioning that glm4.5 did not render on openrouter in a box even though other models did 🤔
even though it looks the same when copying
yeah the latex rendering doesnt matter at all
the reason \boxed{...} is used for easy extraction
Probably a reach, but there could be "hidden" or special chars. Something does make it break
fwiw, like i said again, i've never had formatting/extracting issues like that with AIME
ive run so many benchmarks on different models
this behavior is heavily selected out of newer heavily RLd models too
the prompt/sampling/etc., i've seen crazy benchmark score changes though
Don't forget what we are talking about here... Bad scores like 1 time out of 10?
if even that
I think it's more model specific. As in 1 model out of 10
So what's the chances video generation ever comes to the main website?
So you wouldn't immediatelly catch this when testing your eval scripts
i only check the scores if they dont line up with the ai lab, so i probably would catch it (if it doesn't match)
Have you tested 10+ different models on AIME25? probably not...
far less than artificialanalysis for sure, but i've ran math benchmarks like that on way more models than 10
artificialanalysis does 10 repeats for aime, etc., and test way more different models. so a chance thing (for different models) is possible, but i dont find it that likely personally
this was such a pointless debate, sorry i got heated lol
What's the best ai rn for deep research
For a question like "Write the expected salary of an orthopedic surgeon from starting surgery to retirement"
#dancing
they are not dancing
It's interesting that glm4.5 and glm4.5v both got similarly low scores (so likely the same issue for both). Also a look at non-reasoning gpt model progression here cause why not...
lol
There is no sounds on the vidéos generated?
Why claude doesn't support images in lmarena?
did you think this was the r/mybfisai server for some reason?
"Get The Ultimate Prompt Book worth ₹5000 for free"
🤣
Bro he's such a big creator
Just guess how many people payed 12$ and got scammed
Mf is not even giving gpt 5 high or Claude 4 opus he is seriously giving gpt 5 chat
do people even still launch api scraper startups? do they actually find success anymore? it feels impossible unless you’ve landed some special discount deal with openai or anthropic. and if all they’re really doing is passing your data along, you might as well just use something like lmarena
But he has his audience
He will succeed and scam millions
You can see in the comments non ai knowledge people surprised af
Bro if even 1% of those viewers knew about lmarena aliens wouldn't think earth is unworthy
wouldn't say it would be a scam more than a waste of your money
we would have big problems
Bro you are paying $12 for GPT5CHAT
GPT 5 CHAT
CHAT
than leave
do i own the commercial licence to stuff I generate using LM Arena? in particular, generations by Qwen?
Our sites terms of service will have your answer
I looked for that but could not find it? would you please send a link?
Here is a snippet
It's basically a "you own your stuff but we can profit from your work and theres nothing you can do about it"
This looks like something out of nexon's terms of service

ok thanks
have yall used the new claude sub agents?
yo
How are you
Do the most advanced image and video generation AIs have a true understanding of depth and spatial distance? For example, can they accurately interpret a prompt like: The character is standing 50 meters away, facing the camera?
You can try at LMArena
no
some can guess if u give common numbers like 1, 100, 1000 or 1000000 they'll now u mean small, medium or large distance but not even close to accurate
top 10 model in almost every category, lol
on lmarena broo
I never expected this to be a problem prompt, but both AI's are taking forever. It might be faster to ask here:
I am having issues with CSS. My first question is, how do I center blocks of text? If I have a paragraph being displayed with the proper width, but along the left edge, what do I need to add to get it centered?
(yep: "Something went wrong while generating the response. Please try again.")
Hallow
Hey everyone
hello just arrived here!
welcome!
wassup
fix ... pls
I have access to the following models :
Gemini 2.5 Pro
GPT-5 Thinking
Claude Sonnet 4.0 Thinking
Grok 4
o3
Which one would be the best for casual research work of LLM interpretability in Python?
Gpt 5 think
hi
hello 
Hey what's up? 
Whats The Best AI In every task
19
28
1
GPT 5 high
hello 👀
@echo aurora Hey bro, I’ve been trying to find a complete list of your projects but couldn’t locate one anywhere. I’m also curious—are there any new projects in the works? And what happened to the Chatbot Arena web app? I remember it used to be a great platform.
hello
👋
hello
do u understand what you’re saying
it’s ranked in all those categories from users responding to it on LMArena 😂
hello
here to generate amazing videos
Brian
compair ais
I heard Gemini 3 is delayed. Lol
source?
source?
bruh polymarket is not working properly
gives this for many bets
including the gemini 3 one 🙁
This is normally the time they do server maintenance
no source
chatgpt deep research and search with gpt5-thinking
hi
do you know betmoar?
its alternative app for polymarket
The fastest way to bet on your beliefs.
Bruh
How tf am I already too late
The odds somehow crashed already for before October 31
pls fix: "Something went wrong with this response, please try again."
google is on my nerves
they don't release gemini 3
and
the current gemini 2.5 is dogsht
refresh the page
lol it absolutely isn't though
Well grok 5 comes in December anyways if they don't release
it's still competing with the best
So they have to before then
OH you mean lmarena leaderbord... hahaha funny
gpt5 only narrowly beats it
No I actually didn't
I mean the performance of it in general on various things
do yourself a favor go on reddit r/bard see the situation there
lmao what
It's 2nd best lol
you need to be reading reddit less
I mean i use the model
i used both gemini 2.5 pro and gpt 5
then why do you complain? Which model other than gpt5 is better than 2.5Pro?
claude?
On average
there will be no agi or asi with the current methods
what happened to ilya one shot to asi
and mira
Unless agi is already here and it's just not worth much
humans will always find a way
Average human reads at 5th grade level or smth you know. It may be not be useful to have just agi
did what?
