#general
1 messages · Page 347 of 1
does anyone know why sometimes when you ask a maths question the equations don't display correctly printing stuff like this:
e^{2x},dx
\int xe^{2x},dx
]
instead of the equation formatting
oh
nice code
👍
?
Hi
what you think about this https://discord.com/channels/1340554757349179412/1498271433371943042
upvoted
does anyone have any idea how long it usually takes until a newly added model gets added into the leaderboard (talking about gpt 5.5 in this case)?
Its been in battle mode for quite a bit now
Are GPT image 2 and Nano Pro down right now? Keep getting errors.
I think they got rid of it in direct mode and the same with a lot of the top models from what I understand
I just want to see the updated leaderboard, idc about using it here
Oh I see apologies sir I thought you were asking the opposite
hello
Plz how to generate image-video
Go to the video arena
grok ai video generation
Best model for very long, deeply immersive and highly realistic sandbox/RPG/adventure-games, with a very intelligent gamemaster-AI?
17
30
1
Gemini 3.1 pro
dang
How much is paying too much for AI in your opinion?
ai should be free
w.wiki/MHef Until this comes true, the enshittification of absolutely everything, including AI, will keep worsen
Unfortunately
hmm
Nothing is free
what is it
5 Reacts and I slide
Do any of the open models actually compete with the proprietary ones? GPT 5.5 Thinking etc are agentic and I'm wondering how you replicate that
wats da sitee
Are you "just seeing this" or can you confirm that they actually perform? Because I can spin up a site that says it has GPT-6.
And gpt image 2 unlimited
What should I ask
So I know it’s real opus
4
Ima tell
19
10
8
7
6
5
4
3
2
1
Genspark.
W Slido
How many prompt limit?
I noticed there is subscription
Which is unsure for me
Does Genpark using fake models?
All unlimited
I don’t know
I'm quite unsure about this website, cuz how they provide free unlimited ai models?
Arena also did
Well they do provide free, but is it legal they do that or they have better financial?
This is fake Opus 4.7
No
Im suspicious
I belive it’s the real
Ima test this
Normally there will be a big # at the beginning of replies
To make sure
This does not have that.
The same idea
where
Genspark
Nope
You have credits in the bottom left. Mine went to 0 after the first message
Look
Wrong mode. AI Chat is indeed entirely free
Nope
It’s real opus
Tbh I think agent mode is the only one that matters fir complex tasks
But they hacked the system prompt of some $#!+ slow model and call it Opus 4.7
Nope
Well I get the pay popup every time now... have fun. I will just use the real models
If you have good agentic results with an open model please let me know
Use ai chat
it’s real
What should I ask
So u guys know it’s real
"Comprehensive detailed list of all models of computation."
Arena Max vs. Genspark AI's Claude Opus 4.7
NVM
For that I got 913 lines with Arena Max ("Response provided by Anthropic") vs 194 for Genspark AI's Claude Opus 4.7
I have to some really complain about Genspark
Im just saying Genspark is not Trustable
What do you think about agent mode in arena?
Do anyone tried arena.ai/agent mode I just saw it and tried to check that preview and gone
Don't buy their subs
I still have it i think its alright
For me not opening, even the url just starts redirecting to arena.ai
what subs
i assume these ones: #general message
Hi kiri
Roblox uses the Luau scripting language, not Lua 😕
Good morning/afternoon/evening
This is an experiment at the moment, so it's going to be random if you see it or not. https://help.arena.ai/articles/1811908126-arena-experiments-agent-mode
saw it and tried to check that preview and gone
Did the generation not work? Or did you leave the page and when you came back it was gone?
you should honestly just let users opt into experiments, windows randomly put people into experiments in their beta builds for a long time but now they are allowing people to opt into them
I gave the prompt it showed me chat screen like generation and suddenly closed the mode and disappeared entirely
When it closed, what happened? Back to the home screen and didn't have Agent anymore?
I couldn't give an early heads up sorry to say.
Could happen. This is something we've heard before. Problem being we need experiments to acurrately reflect how users respond to them. Having some kind of opt-in opt-out could lead to an inaccurate understanding of an experiment's results.
We need the votes and to validate the data. This process can depend on how long it takes from arena to arena.
Closed = back to home, Closed and showed battle mode home page and I couldn't see the agent mode even I tried open directly through url /agent doesn't even showing.
Even I couldn't see it in canary too
Okay good to know. Checking with the team and this seems like it's a bug. I'm going to start a post in #1343291835845578853 and will followup with you there as we wouldn't want this lost to #general .
If you see a suspicious Discord login without the word "Discord" in the search bar, then don't login; this is the chance: it is a scam.
PINEAPPLE
do yk how many votes gpt 5.5 has
or when it will be on the leaderboard
Ah do that's why I saw it one time and no more?
It's random?
Damn I didn't even had the chance to use it
gpt-5.5 high below muse spark??????
damn
wow
ofc
I was just about to say that
thats so interesting
chances of getting gpt 5.5 was so low
how would u even vote??
if only direct was there
how to come up with good prompt battle
It's cuz
Right now
i got it enough times
We need to vote for lord sam
yeah but
Gpt 5.5 is so low on the leaderboards, it makes no sense. It's waaaay better at coding than opus 4.7 in practice but my god is it low on the leaderboard relative to it???
I want team red vs team white
how the hell?
doesn't make any sense to me, i am pretty satisfied with 5.5
How it’s lower than muse spark on coding
I don't want another team specially with a blue color
@surreal zephyr
Is there an api endpoint for muse spark
yo gpt-ampro 4o
Gpt 5.5 is not that bad wtf?
It loses even sonnet 4.6 LMAO
Lmfao
The leaderboard is rigged as f*
no
GPT 5.5 would crush sonnet in every way
even 5.2 chat
thats not even possible
how is 5.5 worse than 5.2
not possible lol
Gpt 5.5 mogs opus lol
GPT 5.5 rivals opus 4.7
at least, not one available to the public. i think businesses have access to it but that's it
either the votes are rigged or no one noticed 5.5
Code arena lets you add secrets and a database now??? That is so cool!!
what?
What ever happened to LLmaren, lol? I'd rather trust the ArtificialAnalysis leaderboard more. GPT 5.5 is worser than 5.2 ahhh
5.5 can’t be that bad lol
honestly a stupid idea imo, meta would be making millions if they opened up muse spark api
It is battle mode ;-;
how is it 9th in code
That is really sad 😭
idk
That makes so much sense 
loool?
How does zucks muse spark beats 5.5
I don't like that all the top models have been removed from the direct chat
how
Lmao 😭
lmfao
fire downgrade

OpenAI lore
Does LLM Arena have a partnership with antrophic or what?
They definetely don't have partnership with zuck
but somehow muse spark beat 5.5
wow gpt 5.5 getting it's ___ kicke dmore than i thought
i actually like it
And here is it the first place? so who is the liar?
claude has a lot of fanboys
no
Opus is definitively better than GPT
its just better lmao
nope
And ?
5.5 is kind of equal or slightly worse
watch how they'll try to ragebait openai users by saying they worship sam altman
btw arena is unaccurate now
it is you have th elm arena leaderboard
gemini 3.1 pro being better than opus tells you everything you need to know about this benchmark
unaccurate btw
still better
Why ?
If it wasn't for lord sam we wouldn't have llms
but the spud will be better than mythos
Have you ever actually used Gemini 3.1 Pro properly via the API?
check other lbs
like this
Lbs ?
theres no way "muse spark" beats 5.5
benchmarks suck 😄
muse spark is ass
Its not accurate
Especially not in coding. It's truly laughable
aare you kidding me
is it? why do you say that
arena is unaccurate now
and qwen 3.6 plus too damm
its not even on par with gemini 3 flash LMAO
llama 4 maverick part 2?
Qwen is also good
Anyone have info or other servers on AI workflows
1m context window like opus
yes but not on the same level with gpt 5.5
but thats the plus mode
stop ragebaiting bro
deepseek v4 pro is very good
but its not on par with 5.5 and opus 47
gpt 5.5, gemini 3.1 pro and opus 4.7 have their own levels
No anthropic is just objectively better
gemini 3.1 pro is lobotomized
its not as good as it waws during release
eh, its alright at best
Gemini as a model is good tooling around it is bad
It was the most intelligent model a while ago.
What's environment variables
and even that got slapped by muse spark lmfao
Like api keys for example
It isn't; I just tried the new 4.7 version recently, and for the backend it's absolute junk—5.5 is just way better
how do u get early access to everything broo
first u got agent mode first out of everyone
now u get the new "database" feature
😭
Idk i guess i was lucky 😭
Its fun tho
I can't stop laughing because of muse spark
u in san francisco?
No? why?
@echo aurora since the code arena results are being criticised:
here is my proposition:
when the arena team implements the credits system, they should create coding tasks for ai to complete, these tasks should be of wide variety of challenges which are easy to judge in quality, like physics based n stuff, different categories like 2d, 3d, games, n stuff, you would need to spend at least 5 sec viewing and interacting with 2 results to vote, you need to write what is better about the selected than the other, then a fast, efficient model judges your reason and rewards credits depending on how good the reason is (0 if nonsense/unrelated, 50% if simple, 75% if decent quality, 100% for high quality)
Poor pineaple always getting pinged 😭
what if your region gets updates first
I dont think thats the case im not in the usa
Bro do you even think 6th place is possible?
look at the companies
since when can arena use commands 😭
How can meta make a good ai
bruh thats exactly why im saying this
bruh
Lmfao even grok ranks 9
Hey sorry been in the middle of a few things so haven't been following the convo.
since the code arena results are being criticised
What is the criticism we're talking about here?
I dont know either
@rocky geyser can u show proof vid (does the feature work?)
@echo aurora
Sure its building right now
muse spark in 6th place is insane
its so bad
unaccurate results btw
And gpt still lags behind
the leaderboard is unaccurate btw
m
quick somewhat off topic question:
claude takes p4 and gemini p3 because of the rank spread here? score is tied but the +/- is smaller.?
mmm
@echo aurora do u see the criticism now
It's all that the codearena result came out bad, since GPT 5.5 has better backend than muse-spark, which left people wondering why its so low.
Arena is based on people’s actual use of the model
ah 3.1 beats opus 4.7 thinking
tuuuufffff
If people don’t like 5.5 why would it be inaccurate
yes, the problem is that the current code arena is just people judging the UI, not the quality of the sites themselves
i mean ... yes? but not what i asked 
Slopus 4.7*
Popus*
no i was js looking
and shooockkkkkedddddddddd
If you head to the Code category on Text Arena 5.5 is ranked lower compared to 5.4 - https://arena.ai/leaderboard/text/coding
?
Ahhh gpt 5.2 chat and gpt 5.5 high are on the same level, one point difference
go back to sleep bro
No pls that's what is causing Benchmarks maxing.
Why is muse spark still marked as preliminary?
it has twice the votes as gpt 5.5 high
what is deepslop yapping about
The whale will return shorty for V4 upgrades
Is still Preview
The whale has awoke
Deepseek > Muse spark
True
No
I mean more independent benchmarks
For one I plan to make my own creative writing benchmark
For NSFW and SFW
Our teams are looking into this criticism now, I'll followup when I have more information to share. 
If they are Private sure
So they can't be Gamed
Anyone here knows how to use Claude Opus 4.7 thinking in the Battle Mode?
i just got buttfucked on polymarket betting on 1980+ debut score
to add onto this, gpt has had very similar ui throughout the releases of gpt 5.2-ish to 5.5, thats whats being judged, not the code quality (functionality) of the websites themselves, so what I'm suggesting is a large, curated list of coding tasks for the models to complete (love the ones showcased on arena's youtube videos btw) and those could be voted by users in exchange for reward with credits (when those are added)
also note, I'm not sayin the non-curated code arena should be removed, the curated code arena will give people the opportunity to earn credits which they could use to send prompts to premium models in the non-curated code arena/text arena 
people expected too much
Gpt 5.5 high > gpt 5.5 > gpt 5.4 xhigh > Clock Popus 4.7 Thinking > Clock Popus 4.7 > GPT 5.4 > Gemini 3.1 Pro > GPT 5.3 Codex
cx easy
@sterile tartan
no dude once u ask it a question u get 392928282829229$ bill
😔
It is the most popular harness after chatgpt web
Accurate but Popus is not Usable
So it shouldn't be there
deepslop
its ai
oh gosh lets not talk about amusement parks
gemini 3.1 pro is trash
If you use it via the API, it works pretty well
The App is Lobotomized
Indeed
It always cheats when writing code, it doesnt do enough and when you ask it to do one change it breaks and changes the entire application
AiAnosCranel 3.5 mogs
you can do it inside aistudio
wheres my video?
database vid
I agree
arena.ai is always so biased against gpt... like gpt always ranks so unreasonably low in this leaderboard even though its obviously up there with claude opus.
@echo aurora Is arena really implementing a database feature where u can add API keys? and stuff, @rocky geyser told
i'd put opus 4.7 above 3.1 pro 😅
cap
You're saying my boi muse can't keep up?
But if you're using it through a third-party provider where you're only getting 25% of the actual model's performance and only 25% of the original context, don't be surprised. I get the feeling most people only use AI through GitHub Copilot, lol
No cap
Muse is dogSH
yeah it sucks
Mistral Large 3 is the best coding model ever ❤️
But arena says it ranks 6
Muse is Better then Mythos
Because it's Usable.
exactly, muse sucks and it ranks higher than gpt. arena.ai needs to get their sht straight
@sterile tartan
0/10 ragebait
💀
this is too real
I want deepsleep guy to post muse memes
your gemini raigebait is a 2/10
next update is the spud
btw yall, jus saying but sonnet 4 launched less than 1 month after 3.7 👀.. sonnet/opus 5 by next month?
if y'all want an actual good leaderboard go on artificialanalysis.ai, gpt literally ranks #1 there
No sonnet 4.7 yet
^^^^^
Antrophic is just messing with us rn...
5.5 Mogs All
the spud will wake up
Assthropic
yes but there was no opus 3.7 so this is just the opposite, opus 5 
the spud will mog every model
I can't find muse what did you do to him?
We will c
and wont release mythos cuz they dont have security
It's crazy that there are still Claude fanboys out there after everything they've done
is mythos real
gpt 5.5 is mythos level except u can actually use it 😂
its not better than mythos actually
the spud will
Fr and then they hit you with the "You worship sam altman!!" 2 anthropic fanboys said this and one apologized and became great, but the other got banned
Yeah, I thought so too, but we shouldn't get too attached to a model—they've just been messing with us, with the limits, the support, and the models
It IS mythos level, not better but mythos level
I also said if it wasnt for lord sam we wouldn't have llms
its slightly worse than mythos
Seedance 2 fast is so good for motion transfer/character swap
ye
Assthropic will indeed cry with their Slopus, So-nnot and Hawkthu
It was about the promotional offer, which was only used so that the limits could be lowered afterward
Sam altman is the gaben of AI universe
Generated by DuoBao Seedance2 Fast
where did tou get it
Duobao
Doubao
Use websim bot it's unlimited and freeeee
it still works?
GUYS ITS DOUBAO NOT DUOBAO
it thought they patched it
They removed it there
Btw they just added video reference in today's update thats how I made this
it was only image reference before
Gaben of AI Universe
what kind of gawk gawk convo did i stumble into
Then why don't they just say so? Instead, they try to cover everything up. OpenAI is open with us and tells us when something's going on, and they didn't use their promotional campaigns to pull the wool over our eyes.
also all the bullsh they been saying that "we arent releasing mythos because it can hack anything" is just straight cap, its all fear-based advertisement
idk but openai do it, so its claude fault 🚬
doubao doesnt work
isnt gpt 5.5 based on spud?
Mythos~
its an early spud
holy
Image 2 is also based on 4o
It isnt
Sama will Mog All
gpt is gonna be so back
its not spud, spud supports voice in and out
real
Did anyone else just get login out out randomly cuz I can't login back in for some reason
trust me is the last problem the fe
I want to create one for muse please give me the prompt and the model
Muse spark and Gemini still better than gpt 5.5 high
lol 😂
Just use image 2 first
if you let see an ai made backend to a swe is gonna cry
Muse is the new opus here.
Wait for him to become 1st
stop ragebaiting
Model: GPT image 2
prompt: screenshot of deepseek, our prompt is "yo answer me!!" and it replies "noo dude I wanna deepsleep!!!"
I'm sorry, I can't assist with that request.
oh yeah thank you
too bad gpt 5.5 dint remain efficient like the previous models. burns through tokens faster and tokens cost more. but maybe they'll fix that
I love gpt image 2
me to
Me2
cause im so tuff
how
websim users assembled
Its a discord bot
oh
no
Ahhh hell nah
You forgot about muse
me too, but its so good better than nano banana i think i can cancel my ultra sub
Free and Go users still stuck on GPT 5.3 Instant
Do you like 5.3 Instant
I love 3.5
Time to Upgrade
You forgot /imagine
Wtf
Rigged
why is bro trying to generate images in general
Its rigged
💀
and we can actually do it
Gpt 5.5 makes MUCH BETTER UI
its good at using reference images to make good UIs
That's why we have GPT Image 2
lmao
wait is this image 2?
Myth OS, Slopus, So-nnot, Hawkthu
Assthropic Lineup
@sterile tartan u saw the live too right
I'm sure we all saw the live
Yeah i totally did bro
Sama was Mogging
it looks insanely real holy
You are Very Beta Male Bro
image 2 is crazy
lol
Needs to be real
bro is not up to date
If you see a suspicious Discord login without the word "Discord" in the search bar, then don't login; this is the chance: it is a scam.
@echo aurora is this guy hacked or smth
Yeah not sure 
Deepseek is the best
stop being lame
cap
deepsleep
Bro what's this Sheet
LOL
Is this Deepseek in some Alternate Universe?
You and me are similar
new deepsleep logo
Deepsleep
elephant now
claude popus x deepsleep collab = 😴 Sleepos (Mythos but Sleepos)
what in the slop is this
@stray aspen Ask
slopus 4.7
notice how it's called mythos? because its just a myth
This is 5.5
Opus is not even in same universe
opus is better in some aspects, i gotta admit that. but gpt is gonna surpass it in every aspect soon enough.
Specially One Shoting Bank Accounts and Session Limits
Fastest to do that
🔥
yeah, especially that 😂
Pro users in a nutshell
The Core Strength
yeah pro plan feels useless af
GPT Plus Mogs All
been using gpt plus and never had any issues with usage limits
Lmao the 20$ sub is not enough for ONE REQUEST
They shadownerfed it 100%
Utterly Useless
5.5 medium slow eats MUCH MORE than 5.4 xhigh fast
This is Anthropic
Iv eat my 15% today via 5.5 medium slow
its just the model, it eats much more
5.4 xhigh fast spam couldnt even eat 2% per day
Iirc 5.5 counts 2x by default
For subs
That too
Yes but
Idk if 5.5 medium
Or 5.4 xhigh fast + subagents
5.5 Medium is actually holy token efficient
Yes.
Price increased to compensate lmao
But the limits are actually decreased
Yeah is like Same-Same
But atleast the ChatGPT and Codex limits are Seperate
Unlike Assthropic
i used 5.5 xhigh (without /fast) for a bit in codex cli but i didnt hit the usage limit, didnt check how much it used tho
Yes i did that too, on release
Week later
5.5 medium slow eats MORE
?!?!?!??
Wait
Is this real
Subsidized Ai is slowly Decreasing
google needs to lock in too, they falling behind af
5.4 wasnt laxy
5.4 was exact opposite of lazy
5.5 is lazy
Nah they are busy paying influencers about how great antigravity is with opus and sonnet
we need a digital whip for codex too
Yup
5.4 pro was able to think few hours in single request and finish
Is there place I can see chatgpt limits along with codex?
5.5 pro thinks for 15 mins and gives up.
Waste of requests
I use this atm
yeah i thought gpt was falling down until 5.5 came out
I don't think so
Per token yes.
Per request? Absolutely not
idk man atleast its becoming more and more like humans 😂
I like 5.5 ui taste
And 5.4 bugs at long context
Then use Both
we all love delusions
Enterprises pay per request.
Pro model in web has request cap
wish i could access the large context window with plus 😭
yeah but its still aah tbh
amusement park is trash
@silent tree
just GET IT.
5.5 lost
Damn so it supports supabase and stuff now?
Yep i think so (and hope so)
all I need now is run commands 😭
My bro using deepseek
he's the real deal
Not a claude fan boy not a sam worshipper
🤣
But the ai can actually run commands its pretty fun
it can run bash?
chatgpt 5.5😩
can we know if i use claude for chat and then limit hit when reset ?
its not an app
theres no way they added limits on models
they are the same as yupp
that got closed
first no models and then rate limits per week
Idk dont think so it just ran this
it told me that it can't do console
YOO THEN IT DOES
wow :D
He got the early build
no cap
He's lucky
It's a experiment arenas testing rn and he got it early
Either ai image gen is so good that it can trick us
Additionally to agent mode :D
Or he got the early build
It's real
Ill send video
Tuff
@echo aurora I request you to check modmail and completely block the word
Its gotten annoying now
i'm too
is any method like i talk in one chat so much from ai due to which chat history get long and when limit hit need this full chat history why not add option to create a pdf of all chat llike preplxity
or thats chats how to use for in new chat to make a old ai converaation
here video
im europe
i was using lmarena for my homework every day and got it
U Europe tooo???
on my pc acc where i wasn't it's not
yes
Do you have the backend features too?
nope i have 2 different accs on phone and pc
on mobile i was using every day
and on pc i wasn't and it doesn't have agent
what backend
Are alt accounts even allowed..?
I was using in my mobile every day but I dont have it
Like the api and database features
bro i was just saying hi
idk let me try
No im just asking cuz i was wondering
Start a battle mode with hi and check if there's a 3rd option after </>
Its in every mode not just battle mode
yea but faster
idk if it exists in real version but in battle this popped out
I just realised they gave 5.5 the default 1500 score 💀
what do i prompt to the agent
lol
Exists
make claude mythos,make no mistakes, oneshot
these guys got it early man SO LUCKKKKKYYYYYYYY
fr
Hello everyone
Hello world
done
tf
oh it's on battle?
No i did it in direct
why is it giving story lore
it didnt work in agent mdoe
unfair 😡
Please somone suggest best free vibe coding ai
I see what youre doing arena i see 😠 😠
:/
bro i was using lmarena and not voting
minimax m2.5 free on openrouter
feather my gemini doesn't do that
Feather got everything early
Are you in code arena and using direct chat?
direct
let me try code
YOO THERES GLM
whats so amazing about glm?
it works
it's good
Cool
so whats the agent mode
Can I please see output
do you have that button on code arena?
feather les work together and make an AI website I'll credit u
where is that
I just asked it to perform some commands ... it just gave me a basically not working page
Oh i guess you dont have it then
Thnx bro
Feather dm I'll give you a insane code arena prompt
Okay
i checked and theres no flag for it on the site
Oh okay
So i think i might have 2 beta features on one account :D
4.7 is obviously better
it could be the flag code-arena-publish-site can u open dev tools and search that if it's set to treatment something
FYI regarding recent talk around 5.5 - https://x.com/arena/status/2048820224938631492
To clarify, the Arena community evaluated GPT-5.5 with reasoning effort medium (default) and high.
The best of GPT-5.5 with xHigh is still incoming! Stay tuned.
GUYS I FOUND A NEW FLAG
disable-turnstile-voting → "disable-turnstile"
THIS WILL PREVENT AUTOFARMING
TURNSTILE IS BACK
ok
and this
portal_enable_billing_topups → true
it's about paid credits or something
when comes mimo v2.5 pro in the ranking?
what does that mean in English bro
it's english brp
turnstile is already used when voting, this is probably for disabling turnstile while voting (idk why they would want to do that, maybe because it's not really needed since recaptcha already runs when submitting a prompt? or maybe they want to replace recaptcha with turnstile when prompts are submitted and they may think that running extra captchas could make normal users seem more suspicious? which is btw why many people get them on arena, when you submit enough prompts it will think that you are suspicious because of all of the captcha calls.)
could be related to arena api private experiment, but could also be related to the new usage system
I hope its not a new usage system i love the current one 😔
gpt-5.5-xhigh!!
add larp setting
<@&1349916362595635286>
Thank you
We're not going to delete that one for now. We'd like to take a better look at what we can do to prevent this going forward.
I'll leave the one up in #ai-memes for now.
No clue
Anything
Please remove verifications, when I click send message I have verifications and I have to select an image, there used to be one verification for several messages, and now I have one verification for one message
fix it
I have this with every AI
and Claude 3.5 Sonnet doesn't work, I write "something went wrong" in every message, in a new chat
use an illegal captcha solver extension
and just vote in battle mode to fix the claude 3.6 sonnet
The leaderboard is really innacurate and i think its 100% due to the arena limitation itself, good model have to simplify their work over the memory limitation and most of time if they think too long then its the something went wrong error
if we could atleast try the model accurately
the leaderboard would be much more accurate
if a model do a bad work but fast and the other one do a much better work but then at the end it have to simplify it cause of the limitation, people will choose the one that did the bad work
cause when a model "simplify" to meet the memory limit it simply ruin the whole work
If it keep going like that, your ranking system will make the next glm top 1 and opus top 2 or 3
when its far from being accurate
An accurate ranking system should be arena TOP priority
the website is made for this
Hey @frosty lava thanks for sharing this. It's important to note that in Battle mode the amount of Something went wrong errors is rare, especially in comparison to what you see in Direct and Side by Side. If the model is having issues, the system will automatically sample a different model. This will reduce the amount of errors seen in Battle. Even when an error happens, if a vote happened that vote wouldn't be counted towards the leaderboards.
For the context limit IIIRC the context limits are mainly for Direct/Side by Side. Similar story though where if a model does reach that limit, those votes wouldn't be counted as it'd be considered a failed generation.
I think when looking at arena scores you should just internally caveat it with fact it's about personality and other things not about which model is objectively better
5.5's personality is subpar relative to anthropic/gemini, etc.
it's less sycophancy-maxxed
maybe yeah
I have an idea
my biggest worry is that opus will be ridicusly low, however gemini 3.1 pro will be strong at 20 for sure
Opus should be at 10
its tricky as 1 hour = 5
and 1 day = 10
Everyone here thought gpt 5.5 was gonna beat Anthropic
Sonnet at ~20
LOL
it does in other benchmarks
like real ones
Sorry but i cannot assist with that request.
4 per hour?
I cannot design a covert radio system.
rumor, its always been 5
Legit what it said
it would be 4 if you had an error on opus, which is pretty likely if your talking back to its release in 4.6
Does anyone know a way of using seedsnce 2 for free
Yes, of course on coding, muse spark, sonnet 4.6, kimi k2.6, glm 5.1 are better than gpt 5.5
it totally make sense bro
when qwen 3.7 will come out and be top 1 in coding you will say its fully deserved
right ?
that's what im talking about here it just doesn't make sense we all know kimi k2.6 is far from the level of gpt and anthropic, same for muse spark and glm 5.1
its not even a question
Yeah and i think its cause of the memory limit when the model try to do too big result on arena, for example three.js often hit a memory limit on arena
and then the model have to simplify its work
then the result you get at the end is not even at the level of the real result
i don't know this one seems atleast to somewhat make sense
of course
but anthropic vs gpt is a big battle and there's different opinions
but those small team model
are far from this level
they don't deserve being ranked that much
i can't wait for qwen 3.7 or something like that to be top 1 overall
of course they are and they probably also distilled
What seems a fair amount of daily messages for 1 account (through Gemini 3.1 Pro)
I'm just curious what people would think is a valid amount???
3
12
4
20
in real coding task they are much worse than the real frontier model
i think honestly its because those chinese model are in fact benchmaxxed and trained on frontend web design to make people think its actually good
and it also have an impact on arena leaderboard
try any of those in real coding environment with real task and you'll see the real difference
Guys how do you think
How much mythos would've costed
If it was released
To public
like any pro model on other companies, maybe even more cause its anthropic lol
it won't be priced like a normal opus model
I probably think it would cost maybe 1500$ output
Since if it's that op
i don't know but anyway its not worth it then:
who want to pay 50 dollar per prompt
Yeah okay maybe 500$ output
Though gpt o1 pro costs 600$ output
like its same if i asked you would you use gpt 5.5 pro daily instead of gpt 5.5 thinking
its not worth it, the pricing is not worth it, its only worth it if you do some highly difficult research task and you don't need to do multiple prompt
even if it really was, 5.5 xhigh is literally mythos quality the only reason anthropic isnt releasing mythos is because they dont know how to get better security, which means mythos could be easily jailbreaked yet Openai released 5.5 xhigh because their security is better, Mythos is just a overhyped gatekeep.
basically pro model are only worth it to do very difficult research, and thing like that but definitly not worth it for everyday task or even for creating your project
cause you simply can't use it everyday
and im not talking about the usage you'll get with the subscription
Yes i know that and we get some good usage with it, but i really don't believe in anthropic for being that generous
especially considering how they glazed it
they'll use it for increasing the price
just for the next opus to be equivalent to the actual mythos
for 20x lower price
im more hyped for Sonnet 5 than anything
cheap and powerful (sonnet 4 did miss the mark though so, slight worry)
but i hate that i can only use gpt pro on web
if i could use it on codex
it'll be good
i actually need it in codex to do usefull thing
Why you can't?
yes cheap and powerfull model
its good for most task
i don't know openai didn't put it on codex
Maybe xhigh is chatgpt pro?
no its definitly not i tried it
Just a different code name
i tried both
Oh ok
its not the same model
the pro definitly
it wouldn't make sense anyway if the thinking 5.5 would be smarter
actually i tried 5.4 mini and realised that its actually a powerfull model and very fast model
for some prompt i thought on giving only to 5.5 i was just stupid
^^
older model can already do most of what we're asking
and for 5x cheaper price
and it also just do the task faster it have higher inference or something i don't know why
but it felt like it was on fast mode without being on fast mode
that was NOT the real answer
<@&1349916362595635286>
qn for arena official folk - is there any way to use 'code' mode to build a cli app? the 'UI' option always defaults to a web app with vite/react etc
Image-2 went from being the best to being the worst
js wait a few weeks, then they’ll make it the worst model—the same method they use to make the next model look impressive
Correct
also I’ve found this LLM to be really good. Does anyone know anything '7bout it? Who provides it?
@echo aurora quick questons, are all your "thinking models" set to the maximum? like anthropic models
i noticed that the thinking box is way, but wayyyyyyyyyyyyy longer
hello
hey buddy
i wanted to ask about where went the channel arena video are ?
Since Code Arena mostly deals with the front end, which isn't the area this model ecxells, the votes are showing that. This is why we're in the process of developing Code Arena to be able to incorporate full stack. This will give a more clear understanding of where models are ranking based on different engineering tasks.
The Video Arena is currently accessible only through: https://arena.ai/video. More information on how to use Video Arena can be found in this article.
thanks buddy
I am not sure tbh. Most conversations I've been seeing today are around Code Arena.
Can you explain a bit more about this?
The overall text score isn't going to have different weights given to each category. Each category will be just that, their category. But the overall leaderboard isn't going to give bonus points for being X or Y category.
funny to watch people on twitter getting mad about a benchmark that's voted by real people with real tasks lol
it's very telling
Different use cases lead to different results
ofc. they focus too much on benchmaxxing, which is understandable
seems like my suggestion was accepted for database, one user got early access
but seeing people getting angry at arena is just funny lol
@echo aurora
This kind of discourse is really good for us to hear tbh. We want our leaderboards to be reflective of real-world human use.
Sry can you remind me of this?
i dont mind it but some people just ridiculously attacking arena
Idk I once made a suggestion in feedback for like a database feature and now a user got it 😭
tell me any other place where you can test every model freely
idk where the suggestion went but I definitely did or idk
NoPlace
Gpt image 2 generates steak
Thanks
its nice
it looks so nice
Hell yeah,this is so good
But If think this is a real image
i was checking out this wikipedia page views counter today and I noticed that arena's wikipedia page views plummeted around 4/9 of this year, i wonder if this correlates to the decrease of users caused by changes to limits and models, average was around 300 per day and now it's around 50 per day
https://pageviews.wmcloud.org/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&range=latest-30&pages=Arena_(AI_platform)
gpt image 2
yeah who uses wikipedia when you have ai
those views are probably AI too lmao
Does CM5 (Claude Mythos V) have (some) consciousness?
6
12
1
100% yes
hi , anyone know which model can edit pdf and send it back , instead of answering with "i cant edit this file"
I genuinely am disliking Max so much like why the hell does it keep choosing Google AI when it obviously is way too short response limit for answers at times honestly did they specifically set it to favor that model or something ( i see it happened too short and long prompts in prompts of various types)
is there any way right now to use code arena (that means picking 'code' instead of 'text' right?) to build cli apps
