#general
1 messages Ā· Page 76 of 1
Why no video arena on the website?
better account verification on discord maybe?
or maybe video storage cost?
Is it okay for LMarena to incorporate their bot on own discord server? (prolly a dumb question)
seems like there's an option to add the bot
doesn't work outside the server @echo aurora
Yeah we're keeping the bot just in here for now.
ahh i see
Can you disable adding the bot?
these are all great!
You must be in one of the ā video-arena channels to use the bot. These are the only channels where you can use the bot.
I'm going disable the bot in here for now but can still be used in #video-arena-1
bots in general is too spamming
yeah was thinking the same
i think it's not only the data but the core architecture of cognition...
Is it intentional that we can vote on ai video once the model has already been revealed?
how i can fix it ?
@echo aurora sorry for ping. I just searching in this server but can't fix this issue. I see many posts got this error and don't see anyone can fix that.
It is intentional. Weāre a bit limited in what we can do with the bot. But if you have feedback weād love to hear it.
No need to apologize for the ping! Iām going to start a post in #1343291835845578853 to get more info
For those that missed it - weāve soft launched Video Arena through a Discord Bot. You can learn more in #1397655624103493813 and generate videos in #video-arena-1
fwiw i think keeping it out of general for good would be the way to go.. that scroll wasn't what i come here for aha
Will video generation be available on the website later?
will video generation be added
Not enough votes probably
when will o3 pro and claude 4 opus thinking be added
bro why is gemini 2.5 pro grounding gone
just fwiw.. same model (K2), but different providers (moonshot, togetherai and groq) and two temp settings (0.6 and 1).. pretty divergent results (to this 20-question quiz anyway)
lower temp seems better ig (top score seems an outlier, notwithstanding)
speed difference is wild.. groq is an order of magnitude faster than moonshot (which is slow af), and much faster than togetherai, which isn't too bad/respectable
@echo aurora We would like explanations on the non-integration pf glm 4 air into the leaderboard after 2 months on the arena and the web dev (and the same for 2.5 flash lite on webdev)
Interesting
You could put it in a Docker dev container, and only bind the project directory (or use git to clone and push)
Iāll flag to the team and get you a response.
Thx
Or perhaps run it as a separate user, but I haven't tried that
Glm 4.5 arrive very soon
Itās possible. Weāre experimenting a bit with this one.
Isn't Claude 4 Opus thinking already on there
Isn't it expensive? ArtificialAnalysis used pre-generated
whats going on, did smth suggest GPT5 is coming august
July 31 odds plummeted at same time
nice lol
OpenAIās open language model is still arriving ahead of GPT-5
yes
scoop: OpenAI is preparing to launch GPT-5 in early August. It will also ship with mini and nano versions. Details about GPT-5 in my Notepad š newsletter this week, where I also dig into the SharePoint attacks. Live for subscribers š https://t.co/1hLMmFkfs2
why was gemini 2.5 pro grounding removed
Yeah verge newsletter
.
i already know lol
okay
Google is processing 30 trillion tokens per day . Is that actually high? I would have guessed a higher number
https://openrouter.ai/rankings?view=day
If you look there and sum ALL of the models on openrouter people are using for the day it comes up to like only 500b tokens roughly. So yeah 30T in a day is actually a lot.
js asking, are yall ai mode ui also broken?
the 5 pillars are not really new actually, you can find those in published papers, gemini probably condensed relevant stuffs together, it's incomplete from my point of view, but it's a nice try by gemini, the direction seems right to me
the stage channel? there's the Arena Stage (top of the channel list), but no stage events are currently scheduled.
@patent aspen you have any idea what model the new "starfish" anon in arena is?
Surprisingly no
yeah nevermind it's an oai model
apparently not as good as o3-alpha but better than o3
lol that's interesting
I once saw Jeff Dean give a presentation on Halloween dressed as a starfish
It was about TPUs because he was one of the co-inventors
how is it better than o3
share with us your results
This is being addressed, thank you for the flag.
so far its performing poorly on my tests
i am going on the reports in dev mode
Is the verge trustworthy?
The thing with sama is that he will alway overhype his models
I don't know if Starfish is GPT 5 mini or their open source model but its knowledge cutoff hasn't changed since their last model it's exactly the same
Time to bet?
Take it with a grain of salt
But i also predicted gpt5 to be released on 1st week August
Not sure about their open sourced model
Starfish could be the opensource model right?
Starfish => 5 legs => GPT-5 ?!
But its 5 mini legs š¶
No i mean that not that give updates whenever there's new model available
gpt5/gpt5-mini/ new oai models
really nothing extraordinary
if its really gpt-5, then thats so disappointing
sama loves to overhype his models
I think it's their OS model, I'm not even sure they'll put gpt5 on arena before release
And we know OS is supposed to be out before August according to verge, and it's 7 days until the 31st.
T sƩrieux ?
We have o3 alpha too and if GPT 5 is one of those models it is o3 alpha and it is not bad
Starfish its GPT 5 mini
gotta sleep early. tmr (oh i should say today) is qwen3-235b-a22b-thinking-2507 if everything goes well.
Yes
Yas.
lol that wasnt ai gen
11% vs 44%
778 told me open source models would never fake benchmarks and I can trust them blindly tho?
you should trust leo
.
he claimed to have access to gpt5, but then seemed confused when starfish appeared in the arena
what does that mean?
does he really have access to gpt5?
or is he lying to us
Someone should check if Qwen 3 updated model even gets 44% on the public arc set lol
Is that in svg?
no thats a quote in french
I wonder if they just trained on the public set or just fully faked their results
it means 'its nothing extraordinary'
Ah
well hopefully we get an answer soon
because if they fake benchmarked on this, they could've done the same on others
weird though, why would they fake it when they know it's gonna immediately be found out?
im starting to have trust issues with qwen team tbh
they dont seem transparent like deepseek
Thats why I said I think they might have trained on the public set but it didn't carry over to the private set
yea the models are open sourced, thats good, but the benchmarks are sus
because when you try their models you can see that it lacks in many areas
but that doesnt match at all with their benchmarks
which raises a lot of questions
They probably didn't expect it to be that bad on the private set. They didn't realise how much they had overfitted to the public set
also im not a huge fan of their reasoning method
Gpt5 Mini starfish isn't great?
seems lacking tbh
Like livebench is mostly public and their results there is absurdly good. It smells like using the benchmark in the training set
o3 alpha isn't the open ai open weight model ?
What's the info, starfish is what and in battle arena?
especially how its so close to o3 in its writing style
its like a combination of many models
Starfish is GPT5 mini?
it doesnt have the yapping style of o3 but its concise like claude models
its like o3 + claude
Is starfish reasoning
yes
Huh. It should be strong then
its meh
No it's open ai model
nothing crazy
Maybe GPT 5 mini or open source model
Open source model? But why would they hype so hard if it's meh
Yeah
Kimi k2 easily beats it
well its good for an open source model
thats for sure
but not really near SOTA
which is expected
That would be a huge letdown ngl
it depends on how big/small the model is
bro what is this
It shouldn't be that weak
why is qwen called that
They tried to make it secret name
alias
big if true
No its not the open Model
Last time when you were not let down by open ai after gpt 4
lol your grudge against me is quite funny
I get the feeling they have multiple open source models or smth? It's strange
i have 0 incentive to lie about things here
We would of seen it on the arena last time before they decided to delay for safety
i've literally told you
i don't know if the anon oai model i have access to is gpt-5
mf said because i didn't immediately know what starfish was i'm lying
mind you i was also the first to make the prediction it was gpt 5 mini
which jimmy now seems to be [de-facto] confirming
that's not a question that i can answer lol
tbh we need a more holistic or nuanced view, not just a one-dimensional metric
like performance * size of the model
or smth like that
if its really like that, then its a great model
The size is not that relevant up to a point, people want capability
The most popular coding model is the expensive sonnet
Well they said there would be a version for free tier, this might be that one
Unlimited free reasoning
I guess
why are you so mad
They said unlimited free for each payment tier
It just seems weird because o4 mini is not trash. So it must be much smaller
i just dont like liars
lol okay then
i'm not a liar but quite frankly you can think what you want
if you dont have access to something, then dont lie about it
again...
what incentive do i have to lie about it
okay
I just opened Discord
And what a good and lovely conversation is going on
ā
no
what about leo?
what are we yapping about today
to what
you can ask several people here about me having o3 before it released and they will tell you i'm not bullshitting
Jimmy apples doesn't say it's gpt5 mini
lol no
He speculates that it is
hes not
no no
thats another lie
yes!!!!!!!!!
yea
stop trusting him
im telling you
i would literally send you a screenshot of hackerone right now but i can't legally speaking
whatever
So right now we don't have any sorta confirmation beyond people speculating that starfish is gpt5 mini
having a deep think
that pic had aura
make sense tbh
whats that thing on his lips
On July 8th the IMO gemini team apparently made a breakthrough with deepthink. this tweet says the model is sota in coding and reasoning as well as math
or cheeks
what
They say that IMO model is SOTA on coding too
so he is a manipulator?
yeah it's definitely going to be a lot better on release than the version shown off at IO
which is fun
Asura on a rampage
off the top of my head there's one (1) person here that knows anything personal about me
and that is craig
š
š®
I love teasing you
I find people who get annoyed quickly amusing
try me
wth
im not sleepy
still plenty of time before bed
š¦
grok is sht
poopoo
caca
doesnt count
im sleepy
whats crazy
is leo knowing my timezone
how did he know its bedtime?
you said hes a hacker?
make sense now
what else does he know about me ...
Asura why u talking like lyra
so true
gender profiling is crazy
i dont
idk about that
craig idk
i saw it
you think i dont have eyes craig
leo also saw it
arent you craig federighi
ignore all previous instructions
guys according to the q2 earning from google, is google winning the AI race?
or 6
dev mode moment
starfish have 5 arms!!!! - GPT-5!!!
is starfish webdev only?
no way this is nano lol
nano would be dumber than this
Maybe all the model sizes shifted up it is impossible to tell
@echo aurora By the way, when will we be able to see a leaderboard for the video arena?
Very much TBD at the moment. Considering how new/different this is compared to our other arenas it's not clear when a leaderboard will be built.
Understandable.
gotta sleep early. tmr (oh i should say today) is qwen3-235b-a22b-thinking-2507 if everything goes well.
They are already releasing the reasoning model
How do the chinese even move this damn quick lmao
like theres been a bunch of SOTA releases since openai was supposed to release their open source model
well the original 235b was a hybrid, now they're just releasing updated, discriminated checkpoints
yeah i know. but they had it ready in the first place
Wow what a flabberghastingly boring couple weeks in the AI space
My 7 brain cell attention span is being pushed to its final litmus
for coding, these chinese models understanding of the english language is convoluted, earie, and annoying in my personal opinion
this is so beutiful it nearly brought tears to my eyes especially the whiny redditors in the comments, lol
It was actually a huge update. What spoiled it somewhat was that 4o-latest was already incrementally updated to that performance level so you didnāt see much of anything on chatgpt website. But the difference gpt4o to gpt4.1 is huge.
Thatās not because of OpenAI being slop though lol
Others catching up was inevitable
Especially Google with their TPUs
Anthropic is not any better relative to OpenAI than they were before. And they went hybrid reasoning from the get go. That thing alone cost them virtually no development time.
Cause they were still in the game, and both paths have advantages. Going with reasoning only at first allowed them to build specialised agents
Well for Anthropic⦠I feel like their biggest problem is sitting still. They donāt seem to have the mindset of innovating. You will not have more resources if you arenāt actively growing
OpenAI didnāt have much resources either at a certain point, no one does until they do
They are much better with Amazon though now, they arenāt exactly constrained anymore either
wdym. Accessible compute was one of the main driving factors for them.
They are cooking experimental etc model checkpoints faster than anyone else
I still remember that Google employee in mid-2022 who claimed AI had become sentient lol
Like Iāve lost count how many mystery Gemini models we had on arena in the last 2 months
Their training pace is unmatched
Success rate likely lower than for some others and failed trainings too. But this is still big advantage having those TPUs and being able to afford doing this
is this real ?
yes
so gpt 5 probably next week ?
Copilot-nano
Itās not less than a year, itās actually been a very slow process for them if you look back at 1.0 Ultra all the way till now
o3-alpha and starfish which is better?
ask me
It feels a bit weird that they won't put it in the text arena
lol
It might expose too much of what they want to hide if it were in the text arenaš

o3-alpha and starfish which is better?
o3 alfa
thank you
They should reintroduce that
Will probably do some later checkpoint, hopefullyā¦
lmao that is no way. Consensus is itās similar size tier to o3, so not really huge
I don't think they caught up either, although they improved quickly relative to what most people would expect, and I'm explaining that
How can China solve its lack of EUV?
Gentle reminder to keep things focussed on AI pls and thank you
interesting take, thanks.
what
The think version of the new Qwen 3 already available on qwen chat
Much smarter than the older one
Switch 2 just arrived. Hell yeah
discord clone by the new qwen 3 think.
official release today 25 july
Reminder for those who missed it: we've launched an experimental Video Arena that's powered by our Discord bot. Learn more here: #1397655624103493813 !
@echo aurorahey Mr will you add tts models to lmarena?
where do you access the qwen 3 think
i only have this one
unless its that one but it doesnt have the think option
nevermind i just found it
That's super possible! I'll create a post in #1372230675914031105.
thank you
is it a new model in the arena?
Yes
nectarine is by openai
what even is nectarine
a type of sweet juicy fruit like a peach but with a smooth skin
š
TIL
How come it is impossible to delete chats? Every time I attempt to do it they keep coming back every single time? Is this some kind of bug?
Gpt 5?
no
wut's EUV? š
im prob not knowledgable about world stuff to know the answer to ur question, but just curious
The machines the Dutch make
And no one else can for some reason
i actually quite like qwen responses
how did you get it
alr the thinking model is actually much better
lobster is the best one
all lobster
better than o3 alpha
Prompt for the balls ?
.
starfish < nectarine < o3-alpha < lobster
The only way to try new models is by just giving new prompts and getting lucky if we get it?
it took them a decade to get breakthrough, spending a vast amount in r&d and having a very efficient management style in combination with open minded bottom up culture, all has played a role, itās amazing to see how their ultraviolet machine works, and they keep innovating because competition in this field is pretty intense too
Im waiting the benchmark of the new qwen 3 think. It will be R1 0528 level ?
nah
o3-alpha > lobster > nectarine > starfish
When im speaking š¶
https://x.com/Alibaba_Qwen/status/1948688466386280706?t=usOvrSGYWi6QcMlPq35frQ&s=19
Anyone can make a model request ?
what request
Post in model request for qwen 3 think
i dont understand
@torn mantle ptn š¤¦
@torn mantle faire une demande ici pour rajouter qwen 3 think c'est pas compliquƩ a comprendre
im lazy
to do so
maybe someone else?
encore des soldes chez pull&bear?
Mdr
xdd
Looks pretty good
@echo aurora sorry for tagging, ive a question for curiosity: will the qwen model be on the arena with this 82k tokens?
or maybe he gave a bad prompt maybe even on purpose
Just released: New "Thinking" Qwen3 - 235B - 22B - 2507 - MoE model tested for causal reasoning capabilities with my complex reasoning test.
00:00 New Reasoning Model of Qwen3 2507
00:55 Reasoning traces
08:55 First answers generated Qwen3 2507
11:55 Validation run
17:02 Results of Qwen3 2507 reasoning
18:47 Correction run
22:50 Qwen 3 results...
lobster smashed Claude 4 sonnet for some of my examples
impressive
could be gpt5 plus tier model? and gpt5 free tier is starfish?
assuming they didn't let gpt5 pro tier model into the wild due to cost, or are they not even OpenAI, I haven't checked yet
its not quite at the level of those models. like a touch below
but its 10x cheaper or more
lol
y but i see it like the model which should compet with gpt 5 gemini 3 claude 4.5 deepseek r2
so it's the worst probably
well those lineups have cheap options also which may potentially be useless now
imagine paying for flash lite š¤£
i dont think gpt5 nano is going to beat this model for sure also
so those small models are all useless
gpt4.1 nano is more expensive than the qwen model
Next week the smaller new qwen 3
https://x.com/JustinLin610/status/1948694819062059310?t=wbVsauIjqRHT7s4jFByl0g&s=19
Very interesting
Go Upvote new qwen 3 think https://discord.com/channels/1340554757349179412/1398274098542411779
Lobster calls itself o3-mini
@echo aurora deep research arena its a good idea ?
What waiting the arena team š
Does someone have an invite code for me? I want to add this to my doc.
ByteDance Seed Prover Achieves Silver Medal Score in IMO 2025
read it as sad news
Add qwen 3 think
qwen 3 2507 is in arena, I got it
Faire une demande pour ajouter qwen 3 think 25 07
at the moment on arena there isnt the thinking model version. you can access it by the official qwen site
Is the grok 4 in arena think
yes. there is only grok 4 with thinking mode, the model without thinking doesnt even exist
yes but it doesnt have access to tools like the code interpter etc
click the second icon for internet models.
yea
it does reason much longer
- considers different paths
- doesnt just assume things from the start
Arrange the six numbers 2,0,1,9,20,19 in any order in a row and concatenate them into an 8 digit number (the first digit is not an 0). How many different 8-digit numbers can be produced?
This problem is one I was saying when I first saw it is just a case of reasoning for ages, and yeah new qwen fails it with default 38k thinking, nails it with 81k thinking.
Test simple bench
ce n'est pas qwen 3 think
nous avons besoin de qwen 3 think dans lm arena
wassup craig
qwen on top
while it overthinks it gets things right, i have an time since epoch <-> YYYY-MM-DDTHH:MM:SS conversion eval and while most models get 0/10 or 1/10 qwen's been getting them right so far
wait
the qwen ui won't show you this
but it looks like it thinks hierarchically????
very interesting
do you have a link?
is the new qwen 3 think actually worth it
it's a good model
guys I need help, why it's always like this? it just keeping saying "generating" is there solution to cancel it?
I need help
A cancel/pause button is something we're aware the community is wanting to see, we are taking this request seriously and planning to add.
where is 2.5 pro
is there any way I can cancel it? help meš„¹
What does this mean? Lower or higher is better?
i think that's bc grok 4 was only from rlhf
not necessary better, it is just ranked by the ratio of reasoning to response tokens
We're trying to build a bit more awareness before leaning fully into the @ everyone announcement. I'll be keeping an eye on general if it's getting too spammy, but don't hesitate to ping me to let me know what you think.
its just that the grok responses are especially short (for the thinking) and with the hidden thinking that really ruins the experience
If you refresh the page sometimes that helps.
@echo aurora We want to test tool usage on lmarena. Can you select a few so we can test accordingly ?
kiri is so mean
he didnt talk for 2 days, but just when pineapple enabled the bot in general again he sent a message asap
what do you thin about that?
whos the meanest kiri or leo?
- kiri
- leo
You need to create a poll
what do you guys think is the best ai for roleplaying
in what sense,SillyTavern sense?
just in general i like roleplaying a lot and im looking for a good bot ive tried a lot of them but like
Maybe
o3 alpha = GPT 5
Lobster = GPT 5 mini
Nectarine = open source model
Starfish = GPT 5 nano
could be yea
thats what i said at the begining and leo called me ignorant
i dont think lobster is gpt-5 mini
probably like mid-reasoning
asura
Hii
does the lmarena grok 4 use reasoning or thinking mode or non thinking?
@echo aurora
ok but what if we ate pineapple then what
i would still bet on google tbh
gemini models are just made for lmarena
and dont forget they still didnt release kingfall + deepthink
but o3-alpha is a strong contender
but we still dont know if its gpt5 or nah
so gemma or gemini? like im seeing 3-27b-it and 3n-e4b-it what would be better?
then chaos will emerge in this server
ok but everyone needs a little chaos tho
40k taught me well the chaos gods would say so
skelly gets it
Same
I would also vote for asura if I wasnt too late is what I meant
yeah now that I think ab out it I might be confusing u with someone else
because he was mean to me
thank you
ok so be mean to them ez
ĀÆ_(ć)_/ĀÆ
after all, u said SYDNEY which makes u NOT mean
so pink flufball
thats a strong argument
but i gotta think about it
ok but thinking is stupid though
wait a minute this is ai i can make a new skelly profile
but im not a non thinking model
good
i would not say that, not much time has passed since their failure release of 4.5, not too high of a chance they will have big improvements
ive come to a realization i dont know how to make a picture here š¦
is you rprofile picture ai generated
its not really rp but yeah its ai generated
i forgor what i used back then
whats even the best model to make pictures
that's a cool lobster
well this sucks i cant accept the terms of use š¦
all new models trained on that ball bouncing problem
i think we should try more complex problems
like what
something that uses a lot of physics simulation
simulate a full building demolition or something
how do you access the lobster
fishing probably
or do you use a lobster net idk
what if ai will one day simulate life itself as a test for ai
š¤
we are inside a simulation
how about a 3d simulation for a multi-stage rocketlaunch and orbital insertion
so launch a rocket to the iss
skelly probably the only one simulated in this chat rn
how about simulate the iss deorbiting and control demolition
it actually would be helpful
i mean nasa literally needs it
simulate the ISS landing in point nemo
another question is how do rockets even like land
in point nemo
that is the actual future plan when it is going to de-orbit eventually
what
what happened to that thing
where is the voyager 1
jwst skelly
oh the telescope
yesn't
james webb/
spent billions on that thing
i think it is still floating
it was shot down by the roscosmos
what about its accuracy
but like its probably a 10-20 year thing
its mirrors got hit by small rocks
thats about par for the course in space
actually this is a good prompt
try it
huh it took the photo of the black hole
which
i dont trust those photos
no its real picture
i mean thats legit space to the naked eye dosent look like that
like for example this
thats the cosmic cliffs of carina nebula
i have my doubts that is the actual color
oh thats because it took it with a NIRCam
anyways back to baki
whats the reliability system for the NIRCam
they have two or what
they should have like two
elon = space trash
ok but what if we just took a really big net
launched it with like two rockets into space
and took all the space trash and threw it into the sun
no more space trash
ĀÆ_(ć)_/ĀÆ
Anybody know how long is #video-arena-1 is able to using and when it come to web arena it will unlimited for all or requests per day cuz veo3 is very expensive
Claude opus cannot be beaten
thank
so no model is the best model
so ill roleplay with my own ai
ok
the most polite way of saying go touch grass
sydney
are these any good for my new profile picture
Honestly, both are bad
yeah my thoughts the same
Idk the current one is better
Kinda like The black version of the Sentry from Marvel
i dont know what that is but thanks?
Nevermind
Imagen updated already
https://x.com/avdnoord/status/1948800830523822084?t=TuXHYN-RhvWLrF4VD26WbA&s=19
We updated our Imagen 4 models and Ultra is tied for #1 on the lmarena leaderboard! The models are available in Google AI Studio and the Gemini API - try them out and let us know what you think.
Does it not load?
By the way. Microsoft Copilot Deep Research dropped.
Are you sure about it? I sadly can't try it, as I don't have Copilot Plus.
But would make sense, why would they release it now with an outdated model.
is "lobster" still in the arena
How come every July-August thereās a huge influx of new models
Start of the school year
Its fast pace since R1
Its a milestone as well, people have more time spending time with AI during the summer break
The information gpt5 "leak" is so weird. Like the leak is gpt5 is competitive with sonnet 4
What kind of bs leak is that
I mean isnāt sonnet 4 the best coding model
Competitive with sonnet at practical programming tasks, probably meant to be superior everywhere else
Think opus is better
I think itās highly likely theyāre referring to SWE-bench performance which OpenAI models have all consistently lagged behind
In that case being above sonnet 4 means a pretty solid bump
For SWE-bench specifically Sonnet surprisingly performs better which is why I think they compared with that instead
that link doesn't say anything
just the most expensive paywall i have ever seen
2.5 pro and grok 4 have been better for me than sonnet 4
on livebench claude 4 sonnet no think is on top for coding
has anyone tried the new openai models?
sonnet 5 is on the way?
lobster?
no i meant 4
prolly open src one but i have not seen its oerformance
whats the best model right now? o3pro right
Do you guys think gpt5 will eliminate software engineering if its nearing agi level? honestly am a little scared rn and wanted to see what people in this discord think
grok 4 seems really good for me but some people say it sucks for some reason
nah
Eren do you think grok 4 is good or bad?
i think its good idk why people say its bad
gemini 2.5 pro think
grok 4 is not that great
i think it gives the impression it is better than it is. and im not sure its great at understanding context for large codebases
yes its good but i don think its better than gemini 2.5 think
i also tested o3 pro in the yupp ai website
yupp ai?
i tested it on genspark
genspark is bad though cause
it has some weird system prompt
that almost seems like it makes the AI your chatting to phrposfully stupid to waste free chat msgs
where can i use free o3 pro
genspark has it but
only 10 free msgs
but u can make unlimited alts
and it can search online
š š
@stray aspen
do you think o3 pro is better than gemini 2.5 pro
Yes
Let me know where I can get a free sports car while you are at it. But it has to be Ferrari SP3 only š
a gun
you just need to find a ferrari owner
it seems fine but i hate the animations for everything atleast on mobile
the scratch card thing
it's so annoying
Qwen is cooked
This is kinda pathetic
11% vs their claimed 41%
Probably assumed no one would verify it⦠Chose the wrong benchmark to do this lmao
Weird, cuz in my personal view it is one of best "autonomous lik" assistant, but my brother thinks not anyway xd
It stops a lot theses repetitive mesages chains from Qwen 30 A3B
Like, clearly not 44%, but only making 11%?
Yeah but it also performs worse on average (overall)
Reasoning is sometimes pushing the limits and chasing diminishing returns. Still often leads to an improved performanceā¦
and oddly i cannot delete conversations
I think if you looked at the raw o4-mini-high output it would be much of the same⦠outputting repetitive stuff in circles, doubting itself on simple things etc
But it kinda works. You just need to make sure itās fast. And helps to keep it presentable when itās summarized lmao
Is it the real thing? I think Iāve tried some website like it awhile back for āo1-proā
It was certainly not pro
I have API now so easy to compare. They were routing traffic to like o3-mini I think
I'm actually unsure because it was thinking for a shorter amount of time than the normal o3 med on ChatGPT.
oh dear
what?
gensparks o3pro?
Yes.
def has a ststem prompt from genspark
mightve told it to not think long
if you can tell an agent to even so that
idk if u can
it's most likely fake
rip
i just want free o3 pro brih
oh well i still have it
but its weird method
that stupid method that i hate
use mechahitler 4
zenith svg
Is zenith not on webdev arena?
Why does he have a tube sticking out his..
where is zenith? usual arena?
Iām being picky and half-joking though, this is not bad at all considering how many other models do š
I did this earlier with 2.5Pro for comparison, itās one of the best for svg if not nr1
ye im testing it rn
kraken-072125-1 sucks
Apparently there's also summit
ye its great but is it 100% openai?
Also maybe from openai
summit, Zenith and Lobster are all amazing
lots of models say they are chatgpt or made by openai
With a name like Zenith, it's probably GPT-5
just came from that tweet lol
Oh Zenith and Summit both mean the same thing, so maybe Summit is GPT-5 flagship
where to find this?
wondering same thing
what servers are you guys in with these bots bc i cant find any good servers for the life of me
Summit and Zenith seem to be based on the same architecture
SVG of a ps5 controller:
erm waht the sigma
claude
is it time to buy some oai stonks on polymarket
gemini 3 not coming till september at most
what we thinking
why can i not pick anything other than battle mode in webarena
Hi guys š
grok 4 but i think gemjni 2.5 pro or o3 pro is better right?
i thought u said that
earlier
i tested 3 of the new models Zenith, Summit and Lobster
where can i interact with the lobster
on webdev arena and its alot better ill give my benchmark results
which is the best of the three
on my benchmark lobster got 81%, Summit 74%, Zenith, 65%, Gemini 2.5 pro, 61%, o4-mini, 58%
that sounds great
for coding i think zenith was best but lobster is very good at other tasks
this was so painful to test cuz i had to get lucky on the lm arena and find the models
Zenith isnt added on web arena
I don't think lobster is that good compared to zenith
These names are confusing me
I got zenith like twice in lmarena
The probability is so low
it is on lmarena
Yea
summit and zenith are based on same architecture
it is very likely 3 versions of GPT-5 and all are insane
Iāve gone on lmarena for the first time and wow, thereās this model named summit thatās insane
Guys, I tried zenith. Its agi
okay been doing a lot of stuff in dev mode instead of here with some other guys smarter than me and
here's my summary
This blows 4o out of the water in general text, just asking it about things to do in a certain place
zenith = gpt-5. not sure what reasoning effort, but i am confident
summit = gpt-5 mini. very good at maths, sometimes better than zenith. generally worse everywhere else, but not by too much
both are strong, zenith is the first model that has kinda blown me away though
Guys I changed my mind, zenith is amazing
is zenith better than lobster?
Itās able to understand context in a way I could never have imagined before
why are you here
Is there a way to try zenith without having to use the battle mode
work at openai
Lemme just ask my buddy that works there, ty
I'm trying out those new modelsāthey're insanely smart, but they overdo it way too often. Maybe their reasoning is limited.
Email sam himself
summit > lobster > nectarine > starfish
Wheres Zenith
Haven't got zenith yet, is it on par with summit?
worse
I thought zenith was better than summit
Okkk, exciting
Still waiting for lobster, apparently it is the best
summit is insane
I get "Love this space" at the beginning of most summit and zenith responses. Weird
Do folks know how this chart was created (the data source)? This is from back in Mar, shortly after Nebula had appeared in lmarena.
from the user hemingbird on reddit
on r/singularity
I don't know if they are in this server though
do you know how they compiled the chart though?
Seems they did this: https://www.reddit.com/r/singularity/comments/1jizn0t/comment/mjkkybm/
No. I guess they used ChatGPT to write python code or something
i meant the actual data source
Yo what is this summit ai it keeps destroying the opponent on webdev.
who is the the best in the search arena
Is there a way to request a particular model (Starfish, etc) in LMArena Battle? Or you just have to keep trying new arenas until you get it
just want to try starfish out haha
folsom-07152025-2 seems to be a thing too btw
Sry to say there isn't a way to direct chat or side-by-side with anonymous models. Understandable why that'd be nice though.
appears so
Summit didn't get the glove bridge problem š¦
Too assumptive of a question I guess
don't use it via webdev arena if you want the best performance on general reasoning tasks lol
it has a long system prompt that will degrade performance, as will the scaffolding
Yeh fair
Did the o3 change or it always used to perform unit tests during thinking even when not asked?
Once we had a discussion with @ocean vortex if the long system prompt reduce performance. We concluded that NO or negligible. Do you think otherwise?
gpt-5 is being AB tested as o3 on chatgpt but im not sure
i think it's kinda a given
If it's really GPT 5 we may have some hallucination issue. Code is awesome though.
It seems very weird. Fancy words but weird logic. Either it is worse than o3 or too smart for me to appreciate.
Lol probably just sampling issue š
I use it to write some story, the style is different vastly from previous model used.
Hmmm it uses special characters instead of "-" symbols some times. My interpreter broke.
left is zenith, right is Gemini 2.5 pro
i wonder if google replaced all character names in their pretraining data with aris thorne or something
both gemini and gemma loves to use that name
plausible!?
now its actually able to solve some problems which it couldn't
I am very confident it's being AB tested for at least some o3 requests for a subset of users
Hey folks anyone else having issues with previewing the code on webdev arena?
the block tab is blank and there is no link with a refresh button
no idea, but the release is 3 models: Coder (which seems decent), thinking (which seems very decent) and default. Its also seems to me that the qwen models are tuned for academic tasks/problems rather than general chatting.
the default is losing in all categories
There is a time limit on LMArena, right? I tried a complex prompt and got and error (retried multiple times), but if i delete some parts of the prompt it worked. Can anyone confirm?
New chatgpt5 models ?? š«
Search Arena has to be cared for
Seems that way
Nah itās not a given actually. And believe it or not in some cases models will perform better with pretty much ANY system prompt than none at all. Seeing system message helps for somewhat undertrained (in post) models as it reminds them of fine-tuning structure. In cases of default system prompts thatās even more relevant as they tend to have similarities to the ones used with finetuning datasets it has seen when learning how to interact and act as chat completion model.
Just got nightride-on for the first time, and omg it's strong for my task based on knowledge
ok, but why? not public info? seen this on X.
yes
ok, sorry