#general

1 messages Ā· Page 76 of 1

jade egret
#

is ai mode like Gemini but on 'search steriod'

tiny palmBOT
tiny palmBOT
empty stump
#

Why no video arena on the website?

frigid coral
#

better account verification on discord maybe?

tiny palmBOT
frigid coral
#

or maybe video storage cost?

languid crescent
#

Is it okay for LMarena to incorporate their bot on own discord server? (prolly a dumb question)

tiny palmBOT
frigid coral
keen fulcrum
echo aurora
languid crescent
#

ahh i see

keen fulcrum
tiny palmBOT
tiny palmBOT
tiny palmBOT
languid crescent
tiny palmBOT
tiny palmBOT
tiny palmBOT
tiny palmBOT
tiny palmBOT
echo aurora
#

these are all great!

torn bison
#

You must be in one of the ⁠video-arena channels to use the bot. These are the only channels where you can use the bot.

echo aurora
#

I'm going disable the bot in here for now but can still be used in #video-arena-1

torn bison
#

bots in general is too spamming

echo aurora
#

yeah was thinking the same

leaden sun
#

i think it's not only the data but the core architecture of cognition...

gentle plinth
sly estuary
#

how i can fix it ?

sly estuary
# sly estuary how i can fix it ?

@echo aurora sorry for ping. I just searching in this server but can't fix this issue. I see many posts got this error and don't see anyone can fix that.

whole wagon
#

Why is updated Qwen not on leaderboard still

echo aurora
echo aurora
alpine coral
humble sonnet
#

Will video generation be available on the website later?

stray aspen
#

will video generation be added

errant cave
stray aspen
#

when will o3 pro and claude 4 opus thinking be added

#

bro why is gemini 2.5 pro grounding gone

alpine coral
#

just fwiw.. same model (K2), but different providers (moonshot, togetherai and groq) and two temp settings (0.6 and 1).. pretty divergent results (to this 20-question quiz anyway)

#

lower temp seems better ig (top score seems an outlier, notwithstanding)

#

speed difference is wild.. groq is an order of magnitude faster than moonshot (which is slow af), and much faster than togetherai, which isn't too bad/respectable

cedar tide
#

@echo aurora We would like explanations on the non-integration pf glm 4 air into the leaderboard after 2 months on the arena and the web dev (and the same for 2.5 flash lite on webdev)

verbal nimbus
#

You could put it in a Docker dev container, and only bind the project directory (or use git to clone and push)

echo aurora
cedar tide
#

Thx

verbal nimbus
#

Or perhaps run it as a separate user, but I haven't tried that

cedar tide
#

Glm 4.5 arrive very soon

echo aurora
verbal nimbus
keen fulcrum
whole wagon
#

whats going on, did smth suggest GPT5 is coming august

#

July 31 odds plummeted at same time

#

nice lol

whole wagon
torn mantle
stray aspen
#

why was gemini 2.5 pro grounding removed

soft kernel
torn mantle
#

okay

fleet lintel
ocean vortex
jade egret
#

js asking, are yall ai mode ui also broken?

leaden sun
#

the 5 pillars are not really new actually, you can find those in published papers, gemini probably condensed relevant stuffs together, it's incomplete from my point of view, but it's a nice try by gemini, the direction seems right to me

west scroll
#

helloo

#

where is arena stage channel?

echo aurora
civic flame
#

@patent aspen you have any idea what model the new "starfish" anon in arena is?

civic flame
#

yeah nevermind it's an oai model

#

apparently not as good as o3-alpha but better than o3

#

lol that's interesting

patent aspen
#

I once saw Jeff Dean give a presentation on Halloween dressed as a starfish

#

It was about TPUs because he was one of the co-inventors

torn mantle
#

share with us your results

echo aurora
torn mantle
#

so far its performing poorly on my tests

civic flame
candid storm
torn mantle
#

The thing with sama is that he will alway overhype his models

cedar tide
#

I don't know if Starfish is GPT 5 mini or their open source model but its knowledge cutoff hasn't changed since their last model it's exactly the same

torn mantle
#

Take it with a grain of salt

#

But i also predicted gpt5 to be released on 1st week August

#

Not sure about their open sourced model

candid storm
leaden meteor
#

Starfish => 5 legs => GPT-5 ?!

cedar tide
west scroll
torn mantle
#

really nothing extraordinary

#

if its really gpt-5, then thats so disappointing

#

sama loves to overhype his models

zinc ore
#

I think it's their OS model, I'm not even sure they'll put gpt5 on arena before release

#

And we know OS is supposed to be out before August according to verge, and it's 7 days until the 31st.

cedar tide
cedar tide
#

Starfish its GPT 5 mini

#

Yes

pulsar tendon
#

Yas.

torn mantle
#

lol that wasnt ai gen

#

11% vs 44%

whole wagon
#

778 told me open source models would never fake benchmarks and I can trust them blindly tho?

torn mantle
#

you should trust leo

whole wagon
#

.

torn mantle
#

he claimed to have access to gpt5, but then seemed confused when starfish appeared in the arena

#

what does that mean?

#

does he really have access to gpt5?

#

or is he lying to us

whole wagon
#

Someone should check if Qwen 3 updated model even gets 44% on the public arc set lol

sage raptor
torn mantle
whole wagon
#

I wonder if they just trained on the public set or just fully faked their results

torn mantle
#

it means 'its nothing extraordinary'

sage raptor
#

Ah

torn mantle
#

because if they fake benchmarked on this, they could've done the same on others

dawn wharf
torn mantle
#

im starting to have trust issues with qwen team tbh

#

they dont seem transparent like deepseek

whole wagon
torn mantle
#

yea the models are open sourced, thats good, but the benchmarks are sus

#

because when you try their models you can see that it lacks in many areas

#

but that doesnt match at all with their benchmarks

#

which raises a lot of questions

whole wagon
torn mantle
#

also im not a huge fan of their reasoning method

wheat onyx
#

Gpt5 Mini starfish isn't great?

torn mantle
#

seems lacking tbh

whole wagon
#

Like livebench is mostly public and their results there is absurdly good. It smells like using the benchmark in the training set

meager harbor
#

o3 alpha isn't the open ai open weight model ?

torn mantle
#

k2 on the other hand was a pleasant surprise

#

it was something fresh

whole wagon
#

What's the info, starfish is what and in battle arena?

torn mantle
#

especially how its so close to o3 in its writing style

#

its like a combination of many models

whole wagon
#

Starfish is GPT5 mini?

torn mantle
#

it doesnt have the yapping style of o3 but its concise like claude models

#

its like o3 + claude

whole wagon
#

Is starfish reasoning

torn mantle
#

yes

whole wagon
#

Huh. It should be strong then

torn mantle
#

its meh

tulip tundra
torn mantle
#

nothing crazy

tulip tundra
whole wagon
#

Open source model? But why would they hype so hard if it's meh

tulip tundra
tulip tundra
#

Kimi k2 easily beats it

torn mantle
#

well its good for an open source model

#

thats for sure

#

but not really near SOTA

#

which is expected

whole wagon
#

That would be a huge letdown ngl

torn mantle
#

it depends on how big/small the model is

whole wagon
#

It fits on a single consumer GPU I already said

#

But still

stray aspen
#

bro what is this

whole wagon
#

It shouldn't be that weak

stray aspen
#

why is qwen called that

whole wagon
torn mantle
#

alias

torn mantle
pulsar tendon
tulip tundra
civic flame
whole wagon
#

I get the feeling they have multiple open source models or smth? It's strange

civic flame
#

i have 0 incentive to lie about things here

pulsar tendon
#

We would of seen it on the arena last time before they decided to delay for safety

civic flame
#

i don't know if the anon oai model i have access to is gpt-5

#

mf said because i didn't immediately know what starfish was i'm lying

#

mind you i was also the first to make the prediction it was gpt 5 mini

#

which jimmy now seems to be [de-facto] confirming

whole wagon
#

If it is GPT5 mini why is it bad

#

That's what ppl are saying. It's meh

civic flame
#

that's not a question that i can answer lol

torn mantle
#

tbh we need a more holistic or nuanced view, not just a one-dimensional metric

#

like performance * size of the model

#

or smth like that

torn mantle
whole wagon
#

The size is not that relevant up to a point, people want capability

#

The most popular coding model is the expensive sonnet

wheat onyx
#

Well they said there would be a version for free tier, this might be that one

#

Unlimited free reasoning

whole wagon
#

I guess

torn mantle
wheat onyx
#

They said unlimited free for each payment tier

whole wagon
#

It just seems weird because o4 mini is not trash. So it must be much smaller

torn mantle
#

i just dont like liars

civic flame
#

i'm not a liar but quite frankly you can think what you want

torn mantle
#

if you dont have access to something, then dont lie about it

dawn wharf
civic flame
#

what incentive do i have to lie about it

torn mantle
#

okay

ocean fulcrum
#

I just opened Discord
And what a good and lovely conversation is going on

torn mantle
#

im sorry

#

didnt know you were like this ...

civic flame
#

ā“

torn mantle
#

should i delete my messages?

#

will it make you feel better

#

wtvr

civic flame
#

awful ragebait

#

go to bed

torn mantle
#

no

civic flame
#

i don't disagre

#

e

torn mantle
#

what about leo?

stray aspen
#

what are we yapping about today

brittle tiger
#

Right before #imo2025, together with colleagues from Mountain View, NYC, Singapore, etc, we all gathered at @GoogleDeepMind headquarter in London for our final push for IMO. I believe that week was when all magic happened!

We put all individual recipes (that we figured out

torn mantle
#

to what

civic flame
#

you can ask several people here about me having o3 before it released and they will tell you i'm not bullshitting

zinc ore
#

Jimmy apples doesn't say it's gpt5 mini

torn mantle
#

lol no

zinc ore
#

He speculates that it is

torn mantle
#

hes not

#

no no

#

thats another lie

#

yes!!!!!!!!!

#

yea

#

stop trusting him

#

im telling you

civic flame
#

i would literally send you a screenshot of hackerone right now but i can't legally speaking

#

whatever

zinc ore
#

So right now we don't have any sorta confirmation beyond people speculating that starfish is gpt5 mini

dawn wharf
civic flame
torn mantle
#

make sense tbh

civic flame
#

lol what exactly do you want me to say

#

šŸ¤·ā€ā™‚ļø

torn mantle
brittle tiger
torn mantle
#

or cheeks

civic flame
#

what

zinc ore
torn mantle
#

so he is a manipulator?

civic flame
#

which is fun

torn mantle
#

you do

#

spill the tea

#

what do you know about him?

zinc ore
#

Asura on a rampage

civic flame
#

he doesn't know anything lol

#

i don't share personal details with almost anyone here

torn mantle
#

yea

#

share

civic flame
#

off the top of my head there's one (1) person here that knows anything personal about me

civic flame
#

holy gaslighting

#

sure let's hear it

torn mantle
#

waiting craig

#

😮

#

lol leo im jk

#

take it easy man

civic flame
#

šŸ’€

torn mantle
#

😮

torn mantle
#

I find people who get annoyed quickly amusing

#

try me

#

wth

#

im not sleepy

#

still plenty of time before bed

#

😦

#

grok is sht

#

poopoo

#

caca

#

doesnt count

#

im sleepy

#

whats crazy

#

is leo knowing my timezone

#

how did he know its bedtime?

#

you said hes a hacker?

#

make sense now

#

what else does he know about me ...

zinc ore
#

Asura why u talking like lyra

civic flame
#

so true

torn mantle
#

he is a he

#

i know that

#

from the way he talks

civic flame
#

gender profiling is crazy

torn mantle
#

mm

#

idk

#

feeling

torn mantle
#

idk about that

#

craig idk

#

i saw it

#

you think i dont have eyes craig

#

leo also saw it

stray aspen
#

arent you craig federighi

dawn wharf
#

ignore all previous instructions

torn mantle
#

😮

#

a fight?

#

lol

#

thats crazy

#

yes

jade egret
#

guys according to the q2 earning from google, is google winning the AI race?

torn mantle
#

or 6

whole wagon
#

I think starfish is gpt5 nano

#

That's why it isn't that good lol

tidal schooner
#

dev mode moment

hardy pecan
#

starfish have 5 arms!!!! - GPT-5!!!

primal orbit
#

is starfish webdev only?

civic flame
#

nano would be dumber than this

whole wagon
#

Maybe all the model sizes shifted up it is impossible to tell

sweet tinsel
#

@echo aurora By the way, when will we be able to see a leaderboard for the video arena?

echo aurora
sweet tinsel
#

Understandable.

whole wagon
#

They are already releasing the reasoning model

#

How do the chinese even move this damn quick lmao

#

like theres been a bunch of SOTA releases since openai was supposed to release their open source model

leaden palm
whole wagon
#

yeah i know. but they had it ready in the first place

wintry tinsel
#

Wow what a flabberghastingly boring couple weeks in the AI space

#

My 7 brain cell attention span is being pushed to its final litmus

#

for coding, these chinese models understanding of the english language is convoluted, earie, and annoying in my personal opinion

wintry tinsel
#

this is so beutiful it nearly brought tears to my eyes especially the whiny redditors in the comments, lol

ocean vortex
#

It was actually a huge update. What spoiled it somewhat was that 4o-latest was already incrementally updated to that performance level so you didn’t see much of anything on chatgpt website. But the difference gpt4o to gpt4.1 is huge.

ocean vortex
#

That’s not because of OpenAI being slop though lol

#

Others catching up was inevitable

#

Especially Google with their TPUs

#

Anthropic is not any better relative to OpenAI than they were before. And they went hybrid reasoning from the get go. That thing alone cost them virtually no development time.

#

Cause they were still in the game, and both paths have advantages. Going with reasoning only at first allowed them to build specialised agents

#

Well for Anthropic… I feel like their biggest problem is sitting still. They don’t seem to have the mindset of innovating. You will not have more resources if you aren’t actively growing

#

OpenAI didn’t have much resources either at a certain point, no one does until they do

#

They are much better with Amazon though now, they aren’t exactly constrained anymore either

#

wdym. Accessible compute was one of the main driving factors for them.

pulsar tendon
ocean vortex
#

They are cooking experimental etc model checkpoints faster than anyone else

pulsar tendon
torn bison
#

I still remember that Google employee in mid-2022 who claimed AI had become sentient lol

ocean vortex
#

Like I’ve lost count how many mystery Gemini models we had on arena in the last 2 months

#

Their training pace is unmatched

#

Success rate likely lower than for some others and failed trainings too. But this is still big advantage having those TPUs and being able to afford doing this

sage raptor
pulsar tendon
sage raptor
#

so gpt 5 probably next week ?

ocean vortex
#

It’s not less than a year, it’s actually been a very slow process for them if you look back at 1.0 Ultra all the way till now

torn mantle
#

wdym

#

by rumoted

#

i see

#

im not

torn bison
#

o3-alpha and starfish which is better?

torn mantle
torn bison
#

It feels a bit weird that they won't put it in the text arena

torn mantle
#

lol

torn bison
#

It might expose too much of what they want to hide if it were in the text arenašŸ‘€

ocean vortex
torn bison
torn mantle
torn bison
ocean vortex
#

Will probably do some later checkpoint, hopefully…

#

lmao that is no way. Consensus is it’s similar size tier to o3, so not really huge

patent aspen
#

I don't think they caught up either, although they improved quickly relative to what most people would expect, and I'm explaining that

maiden fulcrum
#

hi all

#

when do you think GPT-5 will be released?

dawn wharf
#

that's called coping

#

copingmaster69

torn bison
#

How can China solve its lack of EUV?

echo aurora
#

Gentle reminder to keep things focussed on AI pls and thank you

ornate agate
#

interesting take, thanks.

stray aspen
#

what

cedar tide
#

The think version of the new Qwen 3 already available on qwen chat
Much smarter than the older one

patent aspen
#

Switch 2 just arrived. Hell yeah

hollow ocean
#

@deep adder new method unlimited agent

cedar tide
#

discord clone by the new qwen 3 think.
official release today 25 july

echo aurora
#

Reminder for those who missed it: we've launched an experimental Video Arena that's powered by our Discord bot. Learn more here: #1397655624103493813 !

stray aspen
#

@echo aurorahey Mr will you add tts models to lmarena?

stray aspen
#

i only have this one

#

unless its that one but it doesnt have the think option

#

nevermind i just found it

echo aurora
stray aspen
#

thank you

agile dawn
#

is it a new model in the arena?

pulsar tendon
#

Yes

maiden fulcrum
#

nectarine is by openai

stray aspen
#

what even is nectarine

rare python
#

šŸ‘

#

TIL

ashen mauve
#

How come it is impossible to delete chats? Every time I attempt to do it they keep coming back every single time? Is this some kind of bug?

empty stump
maiden fulcrum
#

no

pseudo summit
#

im prob not knowledgable about world stuff to know the answer to ur question, but just curious

torn bison
pseudo summit
#

yea idk enough about China AI industry to know lol

little narwhal
#

And no one else can for some reason

pseudo summit
#

i actually quite like qwen responses

torn mantle
#

alr the thinking model is actually much better

civic flame
#

all lobster

#

better than o3 alpha

pulsar tendon
civic flame
#

will see if i can get them from the guy

civic flame
pulsar tendon
#

cheers

#

Il try it too

keen beacon
#

starfish < nectarine < o3-alpha < lobster

cunning haven
#

The only way to try new models is by just giving new prompts and getting lucky if we get it?

leaden sun
# little narwhal And no one else can for some reason

it took them a decade to get breakthrough, spending a vast amount in r&d and having a very efficient management style in combination with open minded bottom up culture, all has played a role, it’s amazing to see how their ultraviolet machine works, and they keep innovating because competition in this field is pretty intense too

cedar tide
torn mantle
#

o3-alpha > lobster > nectarine > starfish

cedar tide
# cedar tide Im waiting the benchmark of the new qwen 3 think. It will be R1 0528 level ?

šŸš€ We’re excited to introduce Qwen3-235B-A22B-Thinking-2507 — our most advanced reasoning model yet!

Over the past 3 months, we’ve significantly scaled and enhanced the thinking capability of Qwen3, achieving:
āœ… Improved performance in logical reasoning, math, science & coding

#

Anyone can make a model request ?

torn mantle
cedar tide
torn mantle
cedar tide
#

@torn mantle ptn 🤦

torn mantle
#

ah i see

#

to be added?

cedar tide
#

@torn mantle faire une demande ici pour rajouter qwen 3 think c'est pas compliquƩ a comprendre

torn mantle
#

to do so

#

maybe someone else?

torn mantle
cedar tide
#

Mdr

torn mantle
#

xdd

wheat onyx
indigo hazel
#

@echo aurora sorry for tagging, ive a question for curiosity: will the qwen model be on the arena with this 82k tokens?

whole wagon
#

qwen is cooking

#

this model is great

#

not just benchmarks, i tried it already

indigo hazel
#

or maybe he gave a bad prompt maybe even on purpose

prime mulch
#

Flux kontext max is not working properly

#

No issues with flux kontext pro

indigo hazel
cedar tide
#

Average of the 23 benchmark that qwen share

#

And by category

hardy pecan
#

lobster smashed Claude 4 sonnet for some of my examples

#

impressive

#

could be gpt5 plus tier model? and gpt5 free tier is starfish?

#

assuming they didn't let gpt5 pro tier model into the wild due to cost, or are they not even OpenAI, I haven't checked yet

indigo hazel
whole wagon
#

its not quite at the level of those models. like a touch below

#

but its 10x cheaper or more

#

lol

indigo hazel
#

y but i see it like the model which should compet with gpt 5 gemini 3 claude 4.5 deepseek r2

#

so it's the worst probably

whole wagon
#

well those lineups have cheap options also which may potentially be useless now

keen beacon
#

imagine paying for flash lite 🤣

whole wagon
#

i dont think gpt5 nano is going to beat this model for sure also

#

so those small models are all useless

#

gpt4.1 nano is more expensive than the qwen model

cedar tide
hardy pecan
#

Very interesting

cedar tide
hardy pecan
#

Lobster calls itself o3-mini

cedar tide
#

@echo aurora deep research arena its a good idea ?

cedar tide
#

What waiting the arena team šŸ˜‘

torn mantle
#

šŸš€ StepFun Deep Research achieves SOTA with a 70% pass rate on xBench-DeepSearch and excels on BrowseComp! šŸ†
An end-to-end Multi-Agent system for automated, in-depth research.
šŸ“„ Generate Deep Research Reports: From hundreds of sources to a full report.
āœ… Intelligent

sweet tinsel
torn mantle
whole wagon
torn mantle
#

read it as sad news

stray aspen
#

Add qwen 3 think

primal orbit
#

qwen 3 2507 is in arena, I got it

stray aspen
primal orbit
indigo hazel
stray aspen
#

Is the grok 4 in arena think

torn mantle
#

im lazy

indigo hazel
stray aspen
#

But is it really grok 4

#

It has a 2023 knowledge cutoff

indigo hazel
indigo hazel
torn mantle
#

yea

#

it does reason much longer

#
  • considers different paths
#
  • doesnt just assume things from the start
ornate agate
#

Arrange the six numbers 2,0,1,9,20,19 in any order in a row and concatenate them into an 8 digit number (the first digit is not an 0). How many different 8-digit numbers can be produced?

This problem is one I was saying when I first saw it is just a case of reasoning for ages, and yeah new qwen fails it with default 38k thinking, nails it with 81k thinking.

hollow ocean
#

Test simple bench

stray aspen
#

nous avons besoin de qwen 3 think dans lm arena

#

wassup craig

leaden palm
#

qwen on top

leaden palm
# leaden palm qwen on top

while it overthinks it gets things right, i have an time since epoch <-> YYYY-MM-DDTHH:MM:SS conversion eval and while most models get 0/10 or 1/10 qwen's been getting them right so far

leaden palm
#

wait

#

the qwen ui won't show you this

#

but it looks like it thinks hierarchically????

#

very interesting

#

do you have a link?

stray aspen
#

is the new qwen 3 think actually worth it

leaden palm
#

it's a good model

random wolf
#

guys I need help, why it's always like this? it just keeping saying "generating" is there solution to cancel it?

#

I need help

unborn ocean
#

gork 4 bad

#

extracted from AA

echo aurora
tiny palmBOT
whole sundial
#

why did you bring back the bot in #general after saying that it was spammy?

tiny palmBOT
rare python
unborn ocean
random wolf
rare python
olive mesa
unborn ocean
echo aurora
unborn ocean
#

its just that the grok responses are especially short (for the thinking) and with the hidden thinking that really ruins the experience

echo aurora
keen fulcrum
#

@echo aurora We want to test tool usage on lmarena. Can you select a few so we can test accordingly ?

torn mantle
#

kiri is so mean

#

he didnt talk for 2 days, but just when pineapple enabled the bot in general again he sent a message asap

#

what do you thin about that?

#

whos the meanest kiri or leo?

#
  1. kiri
  2. leo
ocean vortex
maiden fulcrum
#

hi all, anyone tried the new imagen 4 ultra?

#

it is called v2

ashen mauve
#

what do you guys think is the best ai for roleplaying

dusky aurora
ashen mauve
#

just in general i like roleplaying a lot and im looking for a good bot ive tried a lot of them but like

torn mantle
#

??

#

they hate me 😦

unborn ocean
#

@deep adder you are second place though 🤔

#

second most hated

rare python
#

🫃

cedar tide
#

Maybe
o3 alpha = GPT 5
Lobster = GPT 5 mini
Nectarine = open source model
Starfish = GPT 5 nano

torn mantle
#

thats what i said at the begining and leo called me ignorant

#

i dont think lobster is gpt-5 mini

#

probably like mid-reasoning

cedar tide
#

@echo aurora what is this ? Imagen v2

#

@echo aurora Model added or not ?

candid storm
#

OpenAI seems pretty undervalued right?

tiny palmBOT
golden ocean
#

asura

digital umbra
tribal aspen
#

Hii

#

does the lmarena grok 4 use reasoning or thinking mode or non thinking?

#

@echo aurora

ashen mauve
#

ok but what if we ate pineapple then what

torn mantle
#

gemini models are just made for lmarena

#

and dont forget they still didnt release kingfall + deepthink

#

but o3-alpha is a strong contender

#

but we still dont know if its gpt5 or nah

ashen mauve
#

so gemma or gemini? like im seeing 3-27b-it and 3n-e4b-it what would be better?

torn mantle
ashen mauve
#

ok but everyone needs a little chaos tho

#

40k taught me well the chaos gods would say so

misty vault
misty vault
torn mantle
#

wdym by 'same'?????

#

sigh

misty vault
#

I would also vote for asura if I wasnt too late is what I meant

torn mantle
#

why would you do that?

#

havent done anything to you yet

ashen mauve
#

key words

#

"yet"

misty vault
#

yeah now that I think ab out it I might be confusing u with someone else

torn mantle
ashen mauve
#

ok so be mean to them ez

torn mantle
#

im telling you

#

im nice

ashen mauve
#

ĀÆ_(惄)_/ĀÆ

misty vault
#

after all, u said SYDNEY which makes u NOT mean

torn mantle
#

kind

#

ty

ashen mauve
#

so pink flufball

golden ocean
#

but i gotta think about it

ashen mauve
#

ok but thinking is stupid though

#

wait a minute this is ai i can make a new skelly profile

golden ocean
ashen mauve
#

good

solar hollow
# candid storm

i would not say that, not much time has passed since their failure release of 4.5, not too high of a chance they will have big improvements

ashen mauve
#

ive come to a realization i dont know how to make a picture here 😦

stray aspen
ashen mauve
#

its not really rp but yeah its ai generated

#

i forgor what i used back then

#

whats even the best model to make pictures

wheat onyx
#

🚨 Lobster šŸ¦ž by OpenAI 🚨

This is one shot , I am shaking while testing this model

Elegant , beautiful, and proper physics

Lobster >> o3 Alpha

Open ai you created beast if this is Gpt -5 then I am sold on you @polynoamial @iruletheworldmo @apples_jimmy @DeryaTR_

#

that's a cool lobster

ashen mauve
#

well this sucks i cant accept the terms of use 😦

torn mantle
#

all new models trained on that ball bouncing problem

#

i think we should try more complex problems

ashen mauve
#

like what

torn mantle
#

something that uses a lot of physics simulation

ashen mauve
#

simulate a full building demolition or something

stray aspen
#

how do you access the lobster

ashen mauve
#

fishing probably

torn mantle
#

seems fun tbh

ashen mauve
#

or do you use a lobster net idk

#

what if ai will one day simulate life itself as a test for ai

#

šŸ¤”

stray aspen
#

we are inside a simulation

torn mantle
#

how about a 3d simulation for a multi-stage rocketlaunch and orbital insertion

ashen mauve
#

so launch a rocket to the iss

high ginkgo
#

skelly probably the only one simulated in this chat rn

ashen mauve
#

how about simulate the iss deorbiting and control demolition

#

it actually would be helpful

torn mantle
#

nah

#

thats illegal

#

would get me banned

ashen mauve
#

i mean nasa literally needs it

stray aspen
#

simulate the ISS landing in point nemo

ashen mauve
#

another question is how do rockets even like land

stray aspen
ashen mauve
torn mantle
#

what about jwst

#

where is it rn

ashen mauve
#

what

torn mantle
#

what happened to that thing

stray aspen
#

where is the voyager 1

torn mantle
#

jwst skelly

ashen mauve
#

oh the telescope

torn mantle
#

had to google that right?

#

even skelly isnt all knowing

ashen mauve
#

yesn't

stray aspen
#

james webb/

torn mantle
#

spent billions on that thing

ashen mauve
#

i think it is still floating

stray aspen
#

it was shot down by the roscosmos

torn mantle
#

what about its accuracy

ashen mauve
#

but like its probably a 10-20 year thing

torn mantle
#

its mirrors got hit by small rocks

ashen mauve
#

thats about par for the course in space

torn mantle
#

try it

ashen mauve
#

huh it took the photo of the black hole

stray aspen
#

which

torn mantle
#

i dont trust those photos

stray aspen
#

ton 618

#

?

torn mantle
#

i know they add colours

#

and stuff

stray aspen
#

no its real picture

ashen mauve
#

i mean thats legit space to the naked eye dosent look like that

#

like for example this

#

thats the cosmic cliffs of carina nebula

#

i have my doubts that is the actual color

#

oh thats because it took it with a NIRCam

#

anyways back to baki

torn mantle
#

whats the reliability system for the NIRCam

#

they have two or what

#

they should have like two

ashen mauve
#

i mean how many satalites are in space rn

#

probably is more then two

torn mantle
#

sata what

#

satellites? a lot

#

space telescopes?

#

a littlee

ashen mauve
#

i actually think we have too many of them

#

and space trash

#

😦

torn mantle
#

elon = space trash

ashen mauve
#

ok but what if we just took a really big net

#

launched it with like two rockets into space

#

and took all the space trash and threw it into the sun

#

no more space trash

#

ĀÆ_(惄)_/ĀÆ

torn mantle
#

smart

#

who are you?

ashen mauve
#

some guy with 1.5 braincells

#

maybe less

terse shuttle
#

Anybody know how long is #video-arena-1 is able to using and when it come to web arena it will unlimited for all or requests per day cuz veo3 is very expensive

wintry tinsel
ashen mauve
#

thank

#

so no model is the best model

#

so ill roleplay with my own ai

#

ok

#

the most polite way of saying go touch grass

misty vault
#

sydney

ashen mauve
#

are these any good for my new profile picture

terse shuttle
ashen mauve
#

yeah my thoughts the same

terse shuttle
#

Idk the current one is better

ashen mauve
#

yeah but i want to make a new one

#

this one is intresting

terse shuttle
ashen mauve
#

i dont know what that is but thanks?

terse shuttle
cedar tide
amber warren
sweet tinsel
#

By the way. Microsoft Copilot Deep Research dropped.

#

Are you sure about it? I sadly can't try it, as I don't have Copilot Plus.

#

But would make sense, why would they release it now with an outdated model.

ashen mauve
#

there we go

#

back to rping now

sour spindle
#

is "lobster" still in the arena

little narwhal
#

How come every July-August there’s a huge influx of new models

sour spindle
#

Start of the school year

keen fulcrum
#

Its a milestone as well, people have more time spending time with AI during the summer break

whole wagon
#

The information gpt5 "leak" is so weird. Like the leak is gpt5 is competitive with sonnet 4

#

What kind of bs leak is that

ember rapids
#

I mean isn’t sonnet 4 the best coding model

stray aspen
#

claude 4 sonnet think or no think

#

for coding

zinc ore
#

Competitive with sonnet at practical programming tasks, probably meant to be superior everywhere else

terse shuttle
candid harbor
#

In that case being above sonnet 4 means a pretty solid bump

candid harbor
digital umbra
#

just the most expensive paywall i have ever seen

whole wagon
#

Where even is the proper swe bench leaderboard

#

Their website is total mess

sturdy mica
#

why is there no attatchments when searching? stupid

#

thats a stupid little thing

sturdy mica
stray aspen
#

on livebench claude 4 sonnet no think is on top for coding

tight nest
#

has anyone tried the new openai models?

tight nest
stray aspen
#

lobster?

stray aspen
tight nest
#

is it the open sourced one they're gonna release, or gpt5

sturdy mica
#

whats the best model right now? o3pro right

tight nest
#

Do you guys think gpt5 will eliminate software engineering if its nearing agi level? honestly am a little scared rn and wanted to see what people in this discord think

sturdy mica
#

grok 4 seems really good for me but some people say it sucks for some reason

sturdy mica
#

Eren do you think grok 4 is good or bad?

#

i think its good idk why people say its bad

stray aspen
sturdy mica
#

hmm

stray aspen
#

grok 4 is not that great

tight nest
sturdy mica
#

oh

#

i see

#

barely used grok 4

#

so gemini 2.5 pro ig is still the best

#

wow

stray aspen
#

yes its good but i don think its better than gemini 2.5 think

stray aspen
#

i also tested o3 pro in the yupp ai website

sturdy mica
sturdy mica
#

genspark is bad though cause

#

it has some weird system prompt

#

that almost seems like it makes the AI your chatting to phrposfully stupid to waste free chat msgs

stray aspen
#

but its limited

sturdy mica
#

oh i see

#

and you have to sign in with google 🤮

stray aspen
#

where can i use free o3 pro

sturdy mica
#

only 10 free msgs

#

but u can make unlimited alts

#

and it can search online

#

šŸ˜„ šŸ‘

#

@stray aspen

stray aspen
#

ill use yupp ai for o3 pro

sturdy mica
ocean vortex
sturdy mica
#

you just need to find a ferrari owner

sturdy mica
#

the scratch card thing

#

it's so annoying

ocean vortex
#

Qwen is cooked

#

This is kinda pathetic

#

11% vs their claimed 41%

#

Probably assumed no one would verify it… Chose the wrong benchmark to do this lmao

neat apex
#

Weird, cuz in my personal view it is one of best "autonomous lik" assistant, but my brother thinks not anyway xd

#

It stops a lot theses repetitive mesages chains from Qwen 30 A3B

#

Like, clearly not 44%, but only making 11%?

ocean vortex
#

Reasoning is sometimes pushing the limits and chasing diminishing returns. Still often leads to an improved performance…

tall summit
ocean vortex
#

I think if you looked at the raw o4-mini-high output it would be much of the same… outputting repetitive stuff in circles, doubting itself on simple things etc

#

But it kinda works. You just need to make sure it’s fast. And helps to keep it presentable when it’s summarized lmao

ocean vortex
#

It was certainly not pro

sturdy mica
#

from my testing probably

#

didn't use it much

#

oops

ocean vortex
#

I have API now so easy to compare. They were routing traffic to like o3-mini I think

sweet tinsel
#

I'm actually unsure because it was thinking for a shorter amount of time than the normal o3 med on ChatGPT.

sweet tinsel
#

Yes.

sturdy mica
#

def has a ststem prompt from genspark

#

mightve told it to not think long

#

if you can tell an agent to even so that

#

idk if u can

storm needle
sturdy mica
#

rip

#

i just want free o3 pro brih

#

oh well i still have it

#

but its weird method

#

that stupid method that i hate

stray aspen
#

use mechahitler 4

civic flame
#

zenith svg

toxic whale
#

Is zenith not on webdev arena?

ocean vortex
primal orbit
#

where is zenith? usual arena?

ocean vortex
#

I’m being picky and half-joking though, this is not bad at all considering how many other models do šŸ‘€

#

I did this earlier with 2.5Pro for comparison, it’s one of the best for svg if not nr1

toxic whale
#

kraken-072125-1 sucks

raven void
#

Zenith is better than kingfall

#

OpenAI cooked

candid storm
#

Apparently there's also summit

toxic whale
#

ye its great but is it 100% openai?

candid storm
#

Also maybe from openai

toxic whale
#

summit, Zenith and Lobster are all amazing

candid storm
toxic whale
#

lots of models say they are chatgpt or made by openai

patent aspen
#

With a name like Zenith, it's probably GPT-5

zinc ore
#

Zenith is on arena but not webdev arena

#

Summit is on both

hushed sand
patent aspen
#

Oh Zenith and Summit both mean the same thing, so maybe Summit is GPT-5 flagship

primal orbit
#

where to find this?

toxic whale
rotund prawn
#

what servers are you guys in with these bots bc i cant find any good servers for the life of me

toxic whale
#

Summit and Zenith seem to be based on the same architecture

#

SVG of a ps5 controller:

stray aspen
#

erm waht the sigma

stray aspen
#

guys whats better for coding

#

claude 4 sonnet no think or grok 4

empty stump
#

claude

small haven
#

is it time to buy some oai stonks on polymarket

#

gemini 3 not coming till september at most

#

what we thinking

fossil fable
#

why can i not pick anything other than battle mode in webarena

fresh charm
#

Hi guys šŸ‘‹

sturdy mica
#

what do you guys think

torn mantle
#

this new model added is actually crazy

#

zenith

sturdy mica
#

i thought u said that

#

earlier

torn mantle
#

could be gpt-5

#

like the real thing

#

awoah

stray aspen
#

will gpt 5 have a think version?

#

thats what the microsoft copilot leaks show

toxic whale
#

i tested 3 of the new models Zenith, Summit and Lobster

stray aspen
toxic whale
#

on webdev arena and its alot better ill give my benchmark results

stray aspen
#

which is the best of the three

toxic whale
#

on my benchmark lobster got 81%, Summit 74%, Zenith, 65%, Gemini 2.5 pro, 61%, o4-mini, 58%

stray aspen
#

that sounds great

toxic whale
#

for coding i think zenith was best but lobster is very good at other tasks

#

this was so painful to test cuz i had to get lucky on the lm arena and find the models

torn mantle
#

I don't think lobster is that good compared to zenith

#

These names are confusing me

#

I got zenith like twice in lmarena

#

The probability is so low

toxic whale
torn mantle
toxic whale
#

summit and zenith are based on same architecture

#

it is very likely 3 versions of GPT-5 and all are insane

torn star
#

I’ve gone on lmarena for the first time and wow, there’s this model named summit that’s insane

ancient reef
#

Guys, I tried zenith. Its agi

civic flame
#

okay been doing a lot of stuff in dev mode instead of here with some other guys smarter than me and

#

here's my summary

torn star
#

This blows 4o out of the water in general text, just asking it about things to do in a certain place

civic flame
#

zenith = gpt-5. not sure what reasoning effort, but i am confident
summit = gpt-5 mini. very good at maths, sometimes better than zenith. generally worse everywhere else, but not by too much

#

both are strong, zenith is the first model that has kinda blown me away though

stray aspen
#

zenith is gpt 5

#

damn bro i got zenith and selected both are bad

torn star
#

Guys I changed my mind, zenith is amazing

primal solstice
torn star
#

It’s able to understand context in a way I could never have imagined before

stray aspen
#

finally

#

now ill test it

torn mantle
torn star
#

Is there a way to try zenith without having to use the battle mode

hardy pecan
#

work at openai

torn star
#

Lemme just ask my buddy that works there, ty

ornate stump
#

I'm trying out those new models—they're insanely smart, but they overdo it way too often. Maybe their reasoning is limited.

stray aspen
hardy pecan
#

summit > lobster > nectarine > starfish

stray aspen
#

Wheres Zenith

hardy pecan
#

Haven't got zenith yet, is it on par with summit?

stray aspen
elder rapids
restive sky
hardy pecan
#

Okkk, exciting

restive sky
#

Still waiting for lobster, apparently it is the best

storm needle
#

summit is insane

restive sky
#

I get "Love this space" at the beginning of most summit and zenith responses. Weird

wide talon
#

Do folks know how this chart was created (the data source)? This is from back in Mar, shortly after Nebula had appeared in lmarena.

rare python
#

on r/singularity

#

I don't know if they are in this server though

wide talon
#

do you know how they compiled the chart though?

rare python
wide talon
rare python
#

Seems to be private data source

tawny cypress
#

Yo what is this summit ai it keeps destroying the opponent on webdev.

wind moth
#

who is the the best in the search arena

wide talon
#

Is there a way to request a particular model (Starfish, etc) in LMArena Battle? Or you just have to keep trying new arenas until you get it

wind moth
#

i mean its a battle

#

so if u knew the names then

#

it would be biased i assume

wide talon
#

just want to try starfish out haha

sand ledge
#

folsom-07152025-2 seems to be a thing too btw

echo aurora
quartz light
#

I just noticed the announcement

#

holy peak

#

šŸŽ‰

civic flame
hardy pecan
#

Summit didn't get the glove bridge problem 😦

#

Too assumptive of a question I guess

civic flame
#

don't use it via webdev arena if you want the best performance on general reasoning tasks lol

#

it has a long system prompt that will degrade performance, as will the scaffolding

hardy pecan
#

Yeh fair

calm sequoia
#

Did the o3 change or it always used to perform unit tests during thinking even when not asked?

calm sequoia
civic flame
calm sequoia
#

It seems very weird. Fancy words but weird logic. Either it is worse than o3 or too smart for me to appreciate.

#

Lol probably just sampling issue šŸ˜„

marsh sundial
calm sequoia
#

Hmmm it uses special characters instead of "-" symbols some times. My interpreter broke.

marsh sundial
#

left is zenith, right is Gemini 2.5 pro

digital umbra
#

i wonder if google replaced all character names in their pretraining data with aris thorne or something

#

both gemini and gemma loves to use that name

whole wagon
#

what the heck

#

the new qwen3 is a big regression?

#

they benchmaxxed fr lmao

flint tartan
civic flame
#

I am very confident it's being AB tested for at least some o3 requests for a subset of users

unkempt abyss
#

Hey folks anyone else having issues with previewing the code on webdev arena?

#

the block tab is blank and there is no link with a refresh button

ornate agate
# whole wagon the new qwen3 is a big regression?

no idea, but the release is 3 models: Coder (which seems decent), thinking (which seems very decent) and default. Its also seems to me that the qwen models are tuned for academic tasks/problems rather than general chatting.

whole wagon
#

the default is losing in all categories

hazy quest
#

There is a time limit on LMArena, right? I tried a complex prompt and got and error (retried multiple times), but if i delete some parts of the prompt it worked. Can anyone confirm?

barren prairie
keen fulcrum
#

Search Arena has to be cared for

hardy pecan
ocean vortex
# civic flame i think it's kinda a given

Nah it’s not a given actually. And believe it or not in some cases models will perform better with pretty much ANY system prompt than none at all. Seeing system message helps for somewhat undertrained (in post) models as it reminds them of fine-tuning structure. In cases of default system prompts that’s even more relevant as they tend to have similarities to the ones used with finetuning datasets it has seen when learning how to interact and act as chat completion model.

hazy quest
#

Just got nightride-on for the first time, and omg it's strong for my task based on knowledge

torn mantle
#

wojtek can you delete this pls

#

ty

full idol
torn mantle
full idol
#

ok, sorry