#general

1 messages ยท Page 78 of 1

keen beacon
#

I need them to be serious

blazing bison
#

I feel that gemini is being nerfed

#

But maybe it's just me

versed totem
keen fulcrum
#

@echo aurora Can you make the bot show us the result of the votes?

echo aurora
stray aspen
stray aspen
tight silo
#

think it's a sign of a new model?

stray aspen
#

bro

#

why is romlox word banned here

digital umbra
#

people spammed it in the video arena

stray aspen
grizzled bobcat
#

is unlimited time?

#

My question

wheat onyx
unborn ocean
#

sharing this for the 3x time because it seems relevant

#

very few knew zhipu

digital umbra
#

zhipu have used a few different names iirc

gentle plinth
#

With embed

wintry locust
ocean vortex
#

LOL

wintry locust
#

youre missing minimax and baichuan and iflytek

ocean vortex
void tusk
#

is GLM-4.5 an improved version of gpt4.5?

wintry locust
#

i guess lg is korean

#

ok fair enough

ocean vortex
#

Yeah but they are also irrelevant

digital umbra
gentle plinth
wintry locust
#

AI Sweden

void tusk
#

thanks

stray aspen
#

why do all the chinese models have a similar reasoning process

grizzled bobcat
#

Is free?

stray aspen
#

yes

grizzled bobcat
#

I see it

daring rover
#

is glm openai?

#

oh

#

it's a random company

keen beacon
daring rover
#

i thuoght it was openai's OS model for a split sec

keen beacon
#

so soon we'll have some new models

digital umbra
#

i'm just really curious to see what the open source model will be called

keen beacon
#

lol

#

open source

#

they will probably continue using some weird names

#

perhaps

#

Or maybe it's 3.5 finally open-sourced. I'd laugh

digital umbra
#

if they include GPT or "o" in the name it's going to be so confusing lol

#

i'm guessing it's going to be a new model architecture rather than based on any of their proprietary models

keen beacon
digital umbra
#

well, considering they delayed it when kimi k2 released...

#

i guess it will have a few hundred billion parameters, or a dense model equivalent

#

if they release a 50b moe it's going to be so underwhelming lol

keen beacon
#

Gemma 3 27b is real good for it's size for example

#

a bit aged already

reef pawn
#

Gemini 3 pro when?

blazing bison
#

I bet August

reef pawn
#

Same month as GPT 5, right?

blazing bison
#

I think gpt 5 will be released this week

reef pawn
#

Oh okay

#

Can't wait

warm fulcrum
#

how are some people using gpt-5 before its released

blazing bison
#

They aren't

reef pawn
digital umbra
#

people speculating that openai is rushing their model before the eu ai act goes into effect, if that was the case i would think google would also be rushing something

warm fulcrum
blazing bison
#

Actually there is people that get access weeks before release, but these people generally don't talk about it

blazing bison
#

I think it is but...

warm fulcrum
#

well ye how are people able to use that model?

blazing bison
reef pawn
blazing bison
#

The rank is not public yet

reef pawn
#

Oh okay

warm fulcrum
#

how does lmarena even have ahold of these models?

keen beacon
reef pawn
#

Labs give them early access for testing

digital umbra
#

they get a lot of useful feedback for putting models here

blazing bison
warm fulcrum
#

wowie

#

lets hope openai new model lives up to the hype

blazing bison
#

Its good

#

But it's not agi good

reef pawn
#

AGI is buzzword

blazing bison
#

Its like 25% improvement from o3

reef pawn
#

Nice

blazing bison
#

And 25% is a lot

keen beacon
#

with no hallucinations and consistent answers

digital umbra
#

zenith will probably be a great model, let's hope it won't be too expensive or behind a router that gives you garbage most of the time

keen beacon
#

and learns from mistakes independently

keen talon
#

can someone tell me what are the limits for claude 4 opus?

blazing bison
#

Because it was already like this on the arena

reef pawn
warm fulcrum
#

why is gemini 2.5 pro rated #1 on all tasks

#

theres no way it actually is that good

blazing bison
keen beacon
reef pawn
#

Gemini is my fav model

keen beacon
#

in many areas

warm fulcrum
#

everytime i ask it to code it just blabs a lot

#

it adds more comments than code

reef pawn
keen beacon
blazing bison
keen beacon
#

And of course in Finnish

warm fulcrum
#

the actual gemini website doesn't even want to code

#

it just says it isn't capable

keen beacon
#

Lol uralic languages are probably in the 0.005 percent of votes/prompts

reef pawn
blazing bison
warm fulcrum
blazing bison
reef pawn
keen beacon
blazing bison
#

If gpt 5 is not a big leap, I'm sad

#

The bubble will burst

torn mantle
#

grok 4

surreal creek
#

Gemini in first place 30 points ahead of o3 on the coding leaderboard lol

stray aspen
#

which claude

keen fulcrum
#

I think Arena should reconsider the evaluation process and include pregenerated results for prompts

#

That way a prompt can be evaluated from multiple users

meager harbor
unborn ocean
#

you know chinese labs are afraid of repercussions if prompting for "a taipei vacation" is already considered an inappropriate topic

meager harbor
#

so any gpt 5 whispers ?

sonic tendon
#

glm 4.5 is surprisingly good

echo aurora
#

We're aware of issues related to non-text models struggling at the moment.

echo aurora
whole sundial
#

guys I think they might have distilled glm 4.5 off of gemini, I just had a response start with "Of course!"

quiet moss
#

just because it said Of course means its trained off of Gemini?

whole sundial
#

ok then tell me another model that starts with "Of course!" all the time

#

seems to happen when reasoning is off

wintry tinsel
#

And writing too Claude is always the bomb

whole sundial
#

yeah it starts with "Of course!" just like Gemini

#

at least with reasoning off

#

must of post-trained it off of gemini conversations, at least partially

#

but this shouldn't be a surprise, Chinese companies distill off of US models all the time

stray aspen
#

glm 4.5 no think is gemini

whole sundial
#

I feel like the "Of course!" is a watermark put in by Google

#

I'm not saying it is Gemini, I was just saying that they distilled Gemini into the model

#

and it has long response, kimi gets straight to the point

#

that might be better for some people though, but this means glm 4.5 is going to have more slop

stray aspen
#

i love the glm UI

whole sundial
#

thinking glm 4.5 does not have the "Of course!" stuff, i think it only does that for non-thinking due to likely gemini distillation. As they can't distill their reasoning traces anymore, it won't do it in reasoning mode because it's distilled off of a different model

leaden palm
#

and a short default max tokens

whole sundial
#

it identifies itself as being by Zhipu like it should, but the "Of course!" threw me off a bit

blazing bison
#

Even if it have a little of gemini data, it's not a problem if the model is good

#

But for me it's no good

sturdy mica
#

whats wrong with that website

#

โ˜น๏ธ

whole sundial
#

it seems to be fine without thinking, maybe it messes up with thinking?

#

or when multiple people are using it at the same time?

sturdy mica
torn star
#
poll_question_text

GPT 5 when?

victor_answer_votes

20

total_votes

32

victor_answer_id

2

victor_answer_text

Next Thursday (aug 8)

leaden palm
#

what ai mode suggestions do you guys have

#

#1 doesn't really make sense to me and #3 isn't really relevant but #2 is definitely personalized

gusty night
#

Hello !!

harsh flume
#

What do you guys think it's capping AIs from performing well in frontier math benchs?

#

it doesnt seem like it would be an unsurmisable problem when you take into account the existence of models like AlphaFold

leaden sun
# harsh flume What do you guys think it's capping AIs from performing well in frontier math be...

in short, it's a multifaceted problem, beginning with what "understanding" even truly means for a machine, to the problem of translation between formal logic and natural language, to the fact that most if not all traditionally trained mathematicians work more with intuition rather than pure information retrieval, connecting the dots works often subconsciously that happens to surface into conscious understanding, leading to the Eureka moment. As far as i know, the current ai architecture is still too limiting?

#

in case you're interested, one of the current frontier ai research is about the connection between consciousness and high intelligence, it's still an open problem, but a very fascinating one compared to those hopeless millennium prize problems...

novel crater
#

what is the fastest model on lmarena?

harsh flume
#

I understand when it comes to tier 4, but AlphaGO in 2017 kinda solved the dilemma of navigating a giant state space (10^170), I am kinda dumb but it feels like problems in tier 1-3 of FrontierMath would be a lot easier and lower search space than that since they are all solvable.

It seems like they are only testing LLMs tho which makes sense to have a low score, altough i'd assume that LLMs could implement math-driven tools like alphaproof where the LLM layer would translate a problem into pure math and call in the solver

leaden sun
#

i think proof assistants are already being integrated into the architecture to make it more deterministic, the thing is, those theorem provers are not complete and still an area of active research

#

they only testing LLMs? so they have figured another way already? dont tell me it's an artificially grown organic hybrid brain hahah

wicked root
#

Is there a new model that's being tested right now in LMArena?

#

Word on the street is GPT5 is being tested rn

drifting thorn
#

nah it's great in creative writing(writing lyrics)

whole sundial
#

time to make your pfp a picture of cliff richard lol

drifting thorn
#

the never gonna give you up hallucination is the LLM joke of the year

quiet moss
#

If GPT-5 releases by July 31, is it likely it will be on LMArena on the same day?

whole sundial
#

GLM 4.5 gets this right as well

drifting thorn
#

GLM 4.5 has bad lyric writing

#

It doesn't even rhyme with the line I gave it

#

he gave me 4 answers, but none of them rhymes

gusty night
#

You are quick at model integration ๐Ÿ‘

drifting thorn
#

What's the provider of kraken-072125-1\

whole sundial
#

amazon

harsh flume
#

I read through what I could find of information on their website and apparently the bench is done with the models using tools, so it'd be possible to integrate a native math AI that an LLM could call on

digital umbra
#

this came up in openrouter discord

leaden sun
#

i know it's difficult for people outside math to imagine how...fundamentally different the areas in maths actually are

leaden sun
#

obviously, llms need to understand the problem first, recall knowledge needed (theorems, lemmatas, corollaries etc), connect the dots and use the tools correctly to get the final answer

harsh flume
#

alpha geometry2 another

leaden sun
#

those are not general math ai, they are specialized if am not mistaken, but yeah, you can always build a swarm of specialized ones and call it a general ai

harsh flume
#

AGI will prob be a form of that anyways as I dont think general intelligence will come from a pure next-token-predictor model with infinite scaling

leaden sun
#

the coordination between those agents within a swarm will be a challenge, it's studied also in dynamical systems

harsh flume
#

the interesting thing is that these are a whole other transformers achitecture so integrating them within the answer scope of a LLM would be really dope

#

lol it seems like they are on it already

#

here I was proposing the invention of fire whilst they are already on blowtorch schematics lol

#

man, I wish LMArena would organize a sorts of AMA with top AI researchers from these labs, they must be in direct contact with the industry's forefront and that'd make some great content given how invested this server's users are

#

People here would formulate more interesting questions than 90% of podcast hosts

leaden sun
agile bloom
#

based on the response glm 4.5 gave me, it needs to be worked on

#

like damn, glm 4.5 told me it's mental state

slim mesa
#

hii

#

the grok 4, on the part of direct chat, is really grok 4?

#

mine say him is the grok 1 xd

#

sorry bad english

calm sequoia
#

What does this even mean

nimble trail
whole sundial
# calm sequoia

it's interesting that the current model here is 4o. must be filler for gpt-5 (which, considering they have already added this, should be coming very soon)

calm sequoia
#

And I wouldn't say this is much longer.

ashen mauve
#

What is GLM anyways?

cedar tide
#

Nemotron v1.5 on Artificial analysis
Its best score for an open source model that can be deployed on a single h100

golden ocean
cedar tide
#

First of all, I want to clarify that I don't trust this score at all to predict their overall performance.

#

Kimi k2 is 2nd best Model without Reasoning so no problem with his score, and you can't compare him with reasoning models

#

For glm They themselves shared the score of their model on the same benchmarks as artificial analysis and these are the right places

#

It's certain that if he had infinite money he would have set many other benchmarks

humble sonnet
#

What is GLM 4.5 ?

keen beacon
reef pawn
teal mantle
#

I am mostly API only but should I renew GPT Plus or Supergrok
One for agent, one for grok 4

torn mantle
teal mantle
reef pawn
#

Oh then the scores are good

cedar tide
torn mantle
#

david

#

is it good or nah

reef pawn
cedar tide
torn mantle
#

proprietary means ownership @reef pawn

reef pawn
#

Oh okay

torn mantle
#

not open source

cedar tide
#

So very good

torn mantle
reef pawn
cedar tide
torn mantle
#

closed source

torn mantle
humble sonnet
#

Are there any special features?

reef pawn
torn mantle
humble sonnet
reef pawn
torn mantle
#

proprietary is owned by the ones who made it

cedar tide
#

@humble sonnet salut

torn mantle
#

what are you talking about?

reef pawn
#

How you gonna make money from open source model

keen beacon
torn mantle
#

i think you are confusing it with another word or something

keen beacon
cedar tide
reef pawn
teal mantle
#

GPT plus or supergrok btw

reef pawn
#

Both sucks

#

Gemini better

teal mantle
cedar tide
reef pawn
#

GPT-1 IMAGE is good tho

teal mantle
#

I already milk CLI and AIstudio like anyone decent

torn mantle
reef pawn
teal mantle
reef pawn
reef pawn
torn mantle
#

if i say you are right then you are right

#

@cedar tide you are wrong

reef pawn
#

Aight ๐Ÿ™

cedar tide
keen beacon
#

lol

calm sequoia
#

Wtf guys, what are you using gemini for

#

Just run out of o3 request

#

Tried gemini 2.5 Pro max thinking budget

#

Failed at all of my requests miserably (o3 successful 90%)

#

Is Gemini always like this? ๐Ÿ’€

blazing bison
#

Cursor staff is already using it

#

Now I'm pretty sure Zenith was GPT 5

mortal lynx
#

To me o3 and 2.5 pro are both pretty hit or miss

#

Gemini 2.5 Pro was vastly superior at some tasks and garbage at others, same for o3

#

for coding tasks atleast

blazing bison
#

gpt 5 is good

#

it's agi, believe

mortal lynx
#

If zenith was GPT-5 it's still not quite AGI, but much closer than o3 and o4-mini were

blazing bison
#

i'm just kidding bro

mortal lynx
#

I know, i'm just commenting on it

#

I do think Zenith and o3-alpha were a considerably improvement over what we have today, atleast for what I've tested

#

much more than the "20%+ points in HLE and ARC-AGI!" models we got these past few months

blazing bison
#

o3-alpha was the best one, idk why people said zenith was better

#

maybe zenith is o3 alpha but the feeling that i got trying o3 alpha, the results, it was better than anything i ever tryed for coding

keen fulcrum
blazing bison
keen fulcrum
#

the blur is horrible

blazing bison
#

it's on purpose

#

but it means that gpt-5 is ready, idk openai is waiting for

keen fulcrum
#

this is a correct blur

blazing bison
#

and after this week rate limits from anthropic uugh

#

I really want OpenAI to dethrone them

keen fulcrum
#

they ran out of gpus

mortal lynx
#

They're probably just preparing the blog posts, demos, videos and research papers

#

hopefully the demos are better than their usual "Look how our model can order a new shoe! AGI is here!"

#

Google and xAI do a much better job at that

floral comet
#

I heard there's (GPT-5) model also known as Zenith, is it still in the LLM Arena?

floral comet
#

Ohh, damn.. How do people can even use it.. I guess I'm not lucky enough

odd shard
#

hello

stray aspen
#

craig will gpt 5 be AGI

quiet moss
#

no

blazing bison
quiet moss
keen beacon
quiet moss
#

No it says sk

keen fulcrum
stray aspen
#

is it just me or is gemini 2.5 pro getting worse each day

cedar tide
torn mantle
#

who voted no

#

lets talk

keen beacon
# cedar tide

I strongly hope to see that. I'd try all sorts of things

sour spindle
#

Itโ€™s tool use is completely broken

#

You would think they would be dominating in this regard

cedar tide
#

Batard ๐Ÿคฃ

keen beacon
echo aurora
cedar tide
#

๐Ÿš€ Qwen3-30B-A3B Small Update: Smarter, faster, and local deployment-friendly.
๏ธ€๏ธ€
๏ธ€๏ธ€โœจ Key Enhancements:
๏ธ€๏ธ€โœ… Enhanced reasoning, coding, and math skills
๏ธ€๏ธ€โœ… Broader multilingual knowledge
๏ธ€๏ธ€โœ… Improved long-context understanding (up to 256K tokens)
๏ธ€๏ธ€โœ… Better alignment with user intent and open-ended tasks
๏ธ€๏ธ€โœ… No more blocks โ€” now operating exclusively in non-thinking mode
๏ธ€๏ธ€
๏ธ€๏ธ€๐Ÿ”ง With 3B activated parameters, it's approaching the performance of GPT-4o and Qwen3-235B-A22B Non-Thinking
๏ธ€๏ธ€
๏ธ€๏ธ€Qwen Chat: chat.qwen.ai/?model=Qwen3-30B-A3B-2507
๏ธ€๏ธ€
๏ธ€๏ธ€HF:huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507 or huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8
๏ธ€๏ธ€
๏ธ€๏ธ€ModelScope: modelscope.cn/models/Qwen/Qwen3-30B-A3B-Instruct-2507 or modelscope.cn/models/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

**๐Ÿ’ฌ 12โ€‚๐Ÿ” 21โ€‚โค๏ธ 194โ€‚๐Ÿ‘๏ธ 4.8Kโ€‚**

#

Anyone can make a request ?

drifting crow
cedar tide
#

qwen 3 coder arrived in the leaderboard and its the 3th overall open source model
(Kimi 2 and old qwen 3 no think better)

keen fulcrum
#

David do you have news about grok 4 coder

#

when will it release

cedar tide
cedar tide
#

by category

fleet lintel
keen beacon
#

I may sound dumb but what does spatial awareness mean in LLM models? Vision capabilities?

#

I see, thanks

blazing bison
#

openai is genius with this study together release lol

civic flame
#

ZENITH IS BACK

#

๐Ÿ˜

blazing bison
#

really?

civic flame
#

YESS

#

okay perchance false alarm

keen beacon
blazing bison
stray aspen
#

guys whats the most trustworthy ai Benchmark

blazing bison
#

openai will collect so much reasoning data with this study together

#

this mode actually asks a lot about your reasoning

#

funny

stray aspen
#

craig wil gpt 5 be agi

civic flame
#

it's half back

#

can't share much but

blazing bison
#

?

civic flame
#

all it needs now is for them to flick a switch to enable it in battle

#

it's been re-added as if ready

blazing bison
#

claude 5 will be agi

#

with weekly rate limits

#

after 2 prompts

jade egret
#

gpt 5 when ):

blazing bison
#

thursday

jade egret
#

fr?

#

this?

blazing bison
civic flame
#

Craig

#

Neptune isn't a new model dawg

#

๐Ÿฅ€

keen beacon
jade egret
civic flame
#

well yeah opus is a good model

#

๐Ÿ˜ญ

cedar tide
#

How do you know ?

#

Very good artificial analysis

#

Its 32b sota

digital umbra
#

Doesn't require internet connection

cedar tide
#

I didn't say it was a sota and that it was better than o3 I don't know what you're talking about

#

in the arena you even have 1b models, the arena is not only for sota models

digital umbra
#

Still I think EXAONE (which btw isn't a chinese model) is problematic because its license basically forbids you from doing anything at all useful with it

#

Sure you can benchmark but that's about it lol

cedar tide
#

Yes exaone its non commercial permissive
We have on just one api with 1/1$ input output

digital umbra
#

Which is stupidly expensive for a 32B

cedar tide
#

@deep adder I don't understand anything you're saying

cedar tide
stray aspen
#

damn craig is educating everyone

digital umbra
#

yes, why would anyone use an open source model they can run locally when they could give their personal data to openai, be forced to use a web interface and rate limits

stray aspen
#

didnt expect qwen would get this far on the artificial analysis leaderboard

digital umbra
#

i can feed how much sensitive data i want into my gpu with no regrets. i mean it already sees everything i have on my screen anyway ๐Ÿ˜›

keen beacon
#

they are doing it because of the nyt thing right?

#

sam is trying to bring attention to it to win that lawsuit i guess

digital umbra
#

you're typing this on the discord of a site that provides user prompts to AI companies to improve their models...

#

and even if the data is useless for training it would still be useful to sell to data brokers

devout vault
#

is glm 4.5 even good

#

is it better than other smart models like gemini 2.5 pro?

stray aspen
#

are you serious

devout vault
#

weird

blazing bison
#

This craig is just a rage baiter yapper

stray aspen
#

welcome to the internet bro

torn mantle
#

Ong

#

Onnng

blazing bison
#

It is, just ignore him

#

Bait again

keen beacon
#

That's just capitalism

digital umbra
#

your original point was that open source models were useless because the chatgpt free tier existed

torn mantle
#

Thanks

digital umbra
keen beacon
#

There's more to life than ChatGPT

#

I use deepseek and Kimi

#

for example

#

why use chatgpt instead of aistudio atp btw

keen beacon
#

Even more than in gemini.app

#

ok but youre basically already accepting theyre collecting your data

#

use a frontier reasoning model and make it worth it ๐Ÿคฃ

balmy mist
#

when do yall think gpt5 is coming out?

keen beacon
#

I heard some news though that Sam Altman revealed that people say all kinds of personal info on Chatgpt

digital umbra
#

fun thing that's a requirement for using o3 then

keen beacon
#

?

#

That's some braindead thinking

#

to do

#

on Twitter

primal orbit
#

@echo aurora thank you for bringing rate limit notification in the direct chat! very much appreciated.

primal orbit
#

If we could edit the message in chat and reroll, would be a great next update. Like in the Google AI Studio. Sometimes you make mistake and the chat goes off rails.

digital umbra
#

yes

#

A "tournament" mode where you can keep using the winning model from the previous turn would also be nice

digital umbra
#

ah yes, let me just set up a shell company in panama so i can use chatgpt without letting them know my identity

keen beacon
#

Does the EU's GDPR help in how AI companies can collect data? Just curious if people here would know more

unborn ocean
#

well idk about the specifics, but there are a lot of data collection things that are turned off for eu consumers

unborn ocean
#

e.g. training on data with aistudio free tier

#

(in the api only)

#

no i meant the api

#

has a free tier

#

aistudio as a webapp is a different quota that is completely free, separate of the api free tier

keen beacon
#

Well, it's good I heard about that usement of data too

#

๐Ÿ‘

novel crater
#

yoooooooooooooo

#

your parents gave you a great name

#

haha

stray aspen
#

wassup billy

blazing bison
#

So let's bring zenith back?

stray aspen
#

yes

torn mantle
#

you're still thinking of zenith

#

quite the obsession

blazing bison
#

it has the potential to be 1500 elo

stray aspen
#

I'm gonna play hugging face

grizzled bobcat
#

Guys

#

Help me

#

It limit massage

#

It not unlimited massage

echo aurora
keen beacon
#

Anyone?

verbal nimbus
#

Talking to Gemini 2.5 Pro is a bit frustrating sometimes. It doesn't notify me that I provided the same attachment twice.

#

On AIStudio it likes to use flowery language for open-ended questions like it's inventing marketing terms, but it's great on STEM questions.

zinc ore
#

I'd love to believe it

sturdy mica
#

how come you can't add attachments to searching models!?!?

#

does anyone know a workaround or something

keen beacon
#

Anyone spotted zenith yet?

stray aspen
#

craig do you think gpt 5 will smoke all the other models

#

and will remain for a long time

keen beacon
sturdy mica
#

yo how do i have attachments and internet access at the same time

#

cuz this low-key annoying

sturdy mica
stray aspen
#

make a feedback

#

and maybe theyll add it

sturdy mica
#

is there some free service where i could use some models like grok 4 with internet and also attachments

#

@stray aspen do you know of one

#

sigh

#

bzzzzzzzz

#

bzbzbzbzbz

#

aaaaaaaaahhhh

#

bbbbbbbbbb

sturdy mica
#

lmarena supports only one or the other

#

i need both at same time

lime coral
#

Since GPT5 uses tools by default they should be compared with Deep Research version

fleet lintel
#

i need GPT5 now. when are they launching?

keen beacon
cedar tide
#

a good cleaning is nice

torn mantle
cedar tide
cedar tide
frosty lark
#

elo are relative, one cannot compare between playerpools.

cedar tide
#

Ernie 4.5 is underrated ๐Ÿ˜ตโ€๐Ÿ’ซ

cedar tide
# cedar tide a good cleaning is nice

Now that we have cleaned these 11 models, add these 10 models ๐Ÿ˜ถ

Qwen 30b A3b 25 07
Gemini 2.5 no think
Open reasoning nemotron 32b
Ernie 4.5 300b
Glm 4.5 no think and on webdev
Solar pro 2
Exaone 4.0 32b
Hunyuan 80b a13b
Intern S1 (241b vision)
Reka flash 3.1

torn mantle
cedar tide
#

I don't think so

torn mantle
#

lol

#
#

let me see

cedar tide
torn mantle
#

for multilingual vibe check

#

if its not fluent and feels native then its a big -1

#

they all sound robotic and ai gen

#

@cedar tide whats your first benchmark

#

or what do you try it on

cedar tide
#

@torn mantle The truth is I haven't tried it like everyone else, but just if it has good benchmarks we should give it a chance in the arena so we can try it.

languid crescent
#

@echo aurora am so sorry ๐Ÿ™ I received a warning about advertising didn't know that I can't share it :((

cedar tide
ornate agate
#

I think benchmarks are still a lot better than random vibes or assuming itโ€™s bad

torn mantle
#

no they are not

#

vibes check is superior

cedar tide
#

@torn mantle for you deepseek v3 is much better?

languid crescent
#

i am such a disappointment ๐Ÿ™

torn mantle
#

you have the same pfp picture as david

#

why

ornate agate
#

Exaone and nemotron AA benchmark at 32b size makes them very compelling for further analysis

cedar tide
#

Yes

#

but glm is mostly good at webdev and he's not on it yet

#

and there are only think versions of glm, but sometimes people prefer no think versions, for example qwen 3 no think is much higher than think version in the leaderboard

tall summit
#

which lmarena direct chat models have ratelimits?

blazing bison
#

The update of the chatgpt Mac app with preparations for gpt 5 basically confirmed that it's gonna be a router

#

๐Ÿค“

stray aspen
#

Bro the Baidu ernie playground is so trash

cedar tide
#

Announcing the Artificial Analysis Music Arena Leaderboard: with >5k votes, Suno v4.5 is the leading Music Generation model followed by Riffusionโ€™s FUZZ-1.1 Pro.
๏ธ€๏ธ€
๏ธ€๏ธ€Googleโ€™s Lyria 2 places third in our Instrumental leaderboard, and Udioโ€™s v1.5 Allegro places third in our Vocals leaderboard.
๏ธ€๏ธ€
๏ธ€๏ธ€The Instrumental Leaderboard is as follows:
๏ธ€๏ธ€๐Ÿฅ‡ย @SunoMusic V4.5
๏ธ€๏ธ€๐Ÿฅˆย @riffusionai FUZZ-1.1 Pro
๏ธ€๏ธ€๐Ÿฅ‰ย @GoogleDeepMind Lyria 2
๏ธ€๏ธ€@udiomusic v1.5 Allegro
๏ธ€๏ธ€@StabilityAI Stable Audio 2.0
๏ธ€๏ธ€@metaai MusicGen
๏ธ€๏ธ€
๏ธ€๏ธ€Rankings are based on community votes across a diverse range of genres and prompts. Want to see your prompt featured? You can submit prompts in the arena today.
๏ธ€๏ธ€
๏ธ€๏ธ€๐Ÿ‘‡ See below for the Vocals Leaderboard and link to participate!

**๐Ÿ’ฌ 4โ€‚๐Ÿ” 6โ€‚โค๏ธ 38โ€‚๐Ÿ‘๏ธ 1.2Kโ€‚**

whole wagon
#

Did I hallucinate, I swear on chatGPT the switch model option had gpt5 for a second ๐Ÿ˜‚

finite pollen
#

hey in our battles, models that are removed get relabed back to Assistant A so we dont know what they were.. can this be fixed?

jade egret
#

when gpt 5

echo aurora
tall summit
#

which lmarena direct chat models have ratelimits?

torn mantle
#

๐Ÿš€ Qwen3-30B-A3B-Thinking-2507, a medium-size model that can think!

โ€ข Nice performance on reasoning tasks, including math, science, code & beyond
โ€ข Good at tool use, competitive with larger models
โ€ข Native support of 256K-token context, extendable to 1M

Qwen Chat: Go to

cedar tide
#

Already see

torn mantle
#

xd

cedar tide
#

Yes im busy now

#

Soon the average of the benchmark

#

@torn mantle officialy coder 30b a13b tomorow

cedar tide
# cedar tide
poll_question_text

To create Agent arena

victor_answer_votes

21

total_votes

24

victor_answer_id

1

victor_answer_text

Yes

cedar tide
cedar tide
torn mantle
digital umbra
cedar tide
#

average of the 24 benchmark

#

by category

#

@torn mantle @ancient reef

torn mantle
#

Not bad

cedar tide
#

In my opinion, Gemini 2.5 Flash is about this size.

#

size of deepseek r1 ?

digital umbra
cedar tide
#

@ornate agatethe only hint we have is that there was a gemini 1.5 flash 8b version

keen beacon
#

Zenith in lmarena, anyone?

#

Or openrouter horizon alpha?

digital umbra
#

zenith was removed, horizon is not out yet

wicked root
#

any update on GPT5?

lapis light
#

Can I just ask though, why are there three Video Arena channels?

echo aurora
lapis light
tawdry sapphire
#

yo

echo aurora
tawdry sapphire
#

wassup bro

mint cape
echo aurora
mint cape
#

I forgot to disable @everyone pings on the server and was midly annoyed ๐Ÿ™

tribal glacier
#

hello .)

brittle tiger
humble sonnet
digital umbra
#

you have to go through hoops to use o3 through API so it's not too surprising

humble sonnet
#

Do you have a limit with image with bot?

void tusk
#

oh boy, i bet there are going to be a bunch of new people here xd

#

even newer than me XD

echo aurora
humble sonnet
#

Oh , but image is unlimited on website

wicked root
#

how do people know this? I can't find official statements anywhere

reef pawn
wicked root
#

and is the new model going to be better than gemini pro?

#

I might switch over to gpt5 if that's the case

nimble trail
ionic idol
#

Hi

wicked root
wintry tinsel
#

I eagerly await August for gpt 5

dusky ore
#

#share-prompts create video, on a subway platform, the chubby raccoon is running away, clutching the three ducklings in its arms. Behind him, the mother duck is chasing after the raccoon, wings spread and beak open in panic. The scene is dynamic and full of action. vertical, 9:16

echo aurora
reef pawn
#

How can I enable my Gemini AI pro membership in Google AI Studio? I already have this membership and want to use Veo 3 and Imagen Ultra but I'm unable to do so

merry pond
#

@amber warren Hi !

amber warren
#

helloo

dusky ore
#

/video create video, on a subway platform, the chubby raccoon is running away, clutching the three ducklings in its arms. Behind him, the mother duck is chasing after the raccoon, wings spread and beak open in panic. The scene is dynamic and full of action. vertical, 9:16

merry pond
echo aurora
merry pond
#

Is there a role for Lmarena staff to recognize employees who work there and prevent people from being misled by imposters?

wheat hamlet
#

hello guys

merry pond
echo aurora
echo aurora
wheat hamlet
#

would love to know whats ur personal fav model for text, music and video

#

i feel like people have different preferred models nowadays

merry pond
trail ember
#

๐Ÿค—

stray aspen
#

@echo aurora Mr why so many video arena chats

echo aurora
digital umbra
#

i wonder who is paying for it

stray aspen
#

We are

#

Ok but fr how does lm arena profit

#

@echo aurora

echo aurora
shut quiver
#

Um... hi there. Just joined today. Nice to meet you guys

keen beacon
#

main chat is not the place.

echo aurora
shut quiver
old ginkgo
#

Do you guys think meta will make a pay2win comeback?

digital umbra
#

considering how much they invested on stealing talent from other companies, i think they have to

old ginkgo
#

Yeah i think so too. I think mark will just let them do whatever. Literally 0 safeguards, just make sure you win. Also here is 100 billion dollars in salary and datacenters lol

woven harness
#

video arena will be on the web?

echo aurora
wintry tinsel
#

To me it feels do or die, they think their future value as a company is banking on this promise so they are willing to burn all their money on the chance they succeed regardless of how high that chance is, even if thereโ€™s a real possibility it doesnโ€™t pan out

keen beacon
#

I love this

#

But how much does this cost to run?

#

I wonder if companies gave credits to these guys to advertise their services

verbal sorrel
#

You guys need to turn off sound so its not a dead giveaway that the model is Veo 3.... ๐Ÿคฃ

#

Also doesnt seem like an apples to apples comparison at the point anymore too

echo aurora
torn mantle
#

Do it

#

Ah you already did

#

Free money

#

Hax

#

Smh

patent aspen
torn mantle
#

Yea

#

You need to look for arbitrage

#

Between diff platforms

torn mantle
#

Hax

#

You have a script for it?

verbal sorrel
#

You arent measuring video models anymore at that point

torn mantle
#

Or Is it manual

#

Liar

#

Smh

patent aspen
verbal sorrel
#

The entire point of LMArena is blind testing, if you know the model is Veo 3 right away then it defeats the purpose.

verbal sorrel
torn mantle
#

Craig just let it out

#

Don't worry

#

You are safe here

#

Im skydiving rn

#

Wish me lick

#

Luck i mean

verbal sorrel
#

For LMArena to be effective you need to remove bias, which is why its blind. But if you know the model is Veo 3 out the gates that obviously doesnt work anymore.

patent aspen
# verbal sorrel Thats the point...

It's too bad the others don't but it's not like this is the first time it's been possible to deduce what a model is based on its responses

torn mantle
#

Orabazes is mad

verbal sorrel
#

True but this is just too obvious and detrimental to testing

torn mantle
#

Angry

echo aurora
verbal sorrel
#

Im not mad just trying to make it a better leaderboard for everybody

torn mantle
#

I have never used any gambling website

#

Like ever

echo aurora
#

It's valid feedback for sure

patent aspen
#

It would be silly to punish the best video model because the other models don't have feature parity

real whale
#

/image-to-video /image-to-video

echo aurora
patent aspen
#

Worse video models should also be incentivised to compete for user preference holistically

real whale
#

Thanks you @echo aurora

#

@echo aurora I'm little nob on this site

visual panther
#

yes

merry reef
#

hola

blazing bison
#

Bring zenith back

#

๐Ÿ˜ข

deep valve
#

hello

fading kraken
#

hello

#

how can i check out the code named models

#

clownfish, nettle, etc

tired herald
#

Can't directly, you need to go to the battle and have luck ig

fading kraken
#

ah got it

warm fulcrum
#

when gpt-5 releases, do u think they will bring it in lmarena?

fading kraken
#

no

tired herald
warm fulcrum
#

they got enough money out of me already

tired herald
#

I can't help you with that

warm fulcrum
#

im just saying

tired herald
#

And I'm just saying too ๐Ÿ˜ญ

warm fulcrum
#

๐Ÿ‘

tired herald
#

You'll probably still have some free messages tho

#

On OpenAI

warm fulcrum
#

they would water it down hella tho so

tired herald
#

Very likely yes

warm fulcrum
#

im so jealous i never got to try zenith ngl

tired herald
#

So am I

warm fulcrum
#

๐Ÿคท

tired herald
#

Just have to wait then ig

tall summit
#

gpt-5 will be huge no matter what

#

even just by reputation

#

it'd be very idiotic of them not to

digital umbra
#

gpt-5 will be huge because it's the first openai model released in a long time with a name that actually makes some sense

blissful sluice
#

Its. Called gpt-five

digital umbra
#

the first AGI model will be called gpt5.1o-max-pro-alpha

patent aspen
civic flame
#

i'm going for no

#

๐Ÿฟ this next 7 days is going to be HOT

stray aspen
civic flame
#

in demis we trust

zinc ore
#

InB4 drop at the same hr

ocean vortex
blazing bison
#

English only

silent flume
#

new model in arena

#

potato and dino

digital umbra
#

dino claims to be by anthropic

#

and potato by openai

silent flume
#

they are good?

coarse flame
#

Potato keep trying to use imgur link

digital umbra
#

no idea if they're good or not

ebon jacinth
#

bros gpt 5 tomorrow?

long depot
#

Hi, new here! I'm curious to know if anyone has measured whether people are inherently more likely to pick option A or B in the arena, because of recency bias. I know that when I get long responses I read through one and establish an opinion, then read through the other but can't help comparing as I go, which might bias my vote.

bronze veldt
#

hello

echo aurora
whole wagon
#

This is OpenAI

#

256k context, smells like open source model

digital umbra
#

i had a suspicion it would be that

#

if it's the open source model, 256k context would be nice

whole wagon
#

They aren't even hiding it tbh

#

Like didn't even bother with the system prompt lol

digital umbra
whole wagon
#

That's a tokenizer issue I'm pretty sure. There's some explanation to why it's a crap test

#

Same issue caused the r in strawberry thing

digital umbra
#

it doesn't seem like it's a reasoning model though

whole wagon
#

Could be they just turned off the reasoning

digital umbra
#

maybe

whole wagon
#

For the demo

digital umbra
#

hm

#

it doesn't seem to be terrible at the one trivia question i asked it

#

which even sonnet 4, deepseek and glm-4.5 fails at (but kimi k2 gets right)

whole wagon
#

The training cutoff date is strange

#

It's long ago for some reason

#

Hm well the open source model was delayed a lot. So maybe it does add up

#

The openAI models on LM arena know the current president

digital umbra
#

qwen3 says biden is president too

whole wagon
#

Yeah. It doesn't instantly learn the cutoff date has to be like may onwards

#

For it to say trump is president

digital umbra
#

i think it's a large model, too big for 1 gpu

torn mantle
#

@cedar tide is probably sleeping and missed horizon alpha

torn mantle
#

lmao

#

xdddd

cedar tide
#

He say october cutoff but he know deepseek r1

torn mantle
#

deepseek r1 update could be potato

#

new model added to lmarena

digital umbra
#

close to gpt 4.1 for sure

torn mantle
#

people are not liking this horizon alpha model at all

#

makes me wonder if deepseek really hit a wall or nah

#

potato was ok-ish but nothing crazy

digital umbra
#

potato isn't horizon alpha

torn mantle
#

yea ik ik

#

potato could be a chinese model

#

horizon alpha is def from oai

runic axle
#

From initial testing Horizon Alpha has the same writing style as Zenith/Summit

torn mantle
#

yea coding wise, it has similarities to summit/zenith but not that good tho

digital umbra
#

it could be a gpt5 variant for sure, i'd assume it would be for the free tier of chatgpt in that case (replacing 4o)

runic axle
#

I meant for creative writing. IIRC the Zenith/Sumit models in LMArena had a thinking/reasoning budget, but Horizon Alpha doesn't.

torn mantle
#

surely not a gpt5 variant

digital umbra
#

it also makes sense if it's the open source model if kimi was indeed the reason why they delayed it, because based on how it answers i think it's probably in the same size range, around 1T parameters (and yes i know kimi is kinda bad for its size)

torn mantle
#

nah

#

the one who got access said its much much smaller

#

can run in a single H100(?) gpu

digital umbra
#

if it's dense it would be much smaller

torn mantle
#

i dont know if he meant H100 or B-serie

#

you mean moe?

#

dense it will just activate the whole params

digital umbra
#

kimi is moe

torn mantle
#

yea it is

digital umbra
#

so i think if it's a dense model it would surely be smaller for the same performance

torn mantle
#

yea could be

runic axle
#

If this is the open-weights model maybe it was distilled from GPT-5?

digital umbra
#

nah

#

distilled from gpt-4 variants

runic axle
#

Yeah probably more likely

torn mantle
#

you know what

#

everything is possible

#

lets just wait and see

cedar tide
torn mantle
#

the video needs more work

#

but gl

#

i would just brainstorm ideas -> run it on notebooklm video overviews

#

and make a similar presentation

#

you are a cutie paws

#

๐Ÿ˜–

#

its a good one

#

๐Ÿ‘

cedar tide
hardy pecan
#

quick pass of simplebench for Horizon-Alpha: 3/20 lmao

finite pollen
#

its probably that OSS one they keep hyping :p

ebon jacinth
#

it could be GPT5-nano

lapis light
#

Reminds me of the Google Lamda moment

misty vault
#

stop

paper nimbus
#

where can i track new models

mild vapor
#

Does "LMArena" not support setting the Aspect Ratio when creating images? I've given the commands as detailed as possible, but the result is still a 1:1 image.

echo aurora
gritty flare
mild vapor
#

In "video-arena", is there a limit to the number of videos can make?

echo aurora
#

Yeah

mild vapor
echo aurora
junior shuttle
#

hallo everyone

languid crescent
#

Will video arena be ever in lmarena? Or it just stays in discord for good?

echo aurora
languid crescent
nova frost
#

is there a option to use veo 3 in generating a video because i like it when sound effect is available.

echo aurora
nova frost
#

how many credits we generate a video here?

echo aurora
echo aurora
whole sundial
languid crescent
whole sundial
#

well, that's how they made the "mode" lol

languid crescent
#

@echo aurora it possible for the video generation arena to send the result directly to the person who prompted it? (Assuming it's not a chatbot you can interact with.) The idea is: you type your prompt in the #video-arena channel, but only you can see the generated video result? idk if discord can even do this ๐Ÿ˜ญ

echo aurora
languid crescent
limber kiln
#

Hello ! Best wishes for all.

tight sedge
#

camping value video

echo aurora
primal fern
#

How to generta evideo in this channel

nocturne bear
#

@icy forge great prompt buddy

tight sedge
#

A serene lakeside campsite at dawn, golden sunlight filtering through pine trees. A tent is pitched near the water, with a small campfire smoldering. A coffee pot steams on a rustic wooden table. Slow drone shot moving from the lake to the campsite."

pseudo summit
#

haven't gotten to mess w/ it too much, but some of the responses i got were similar to past grok responses

pseudo summit
# torn mantle

this is not consistent for potato btw, not sure if that points towards chinese model

hallow ridge
#

How can I use AI to take ove r

halcyon vortex
#

yo all of my chats just got wiped...

still bramble
#

And this error now appears

halcyon vortex
pseudo summit
#

interesting

#

observed some very weird behavior by models right before the error

#

maybe they're doing maintenance or smth?

gloomy knot
#

hello everyone, joining here to try out the video arena ๐Ÿ™‚

halcyon vortex
halcyon vortex
#

I think the devs are trying to fix lmarena tho

#

they better ๐Ÿ™

nova edge
#

hello

echo aurora
echo aurora
pseudo summit
echo aurora
pseudo summit
#

all models

#

nvm, seems to be back!

echo aurora
#

glad to hear it! but keep me updated if things seem broken again blobthumbsup

velvet iris
#

hi

void perch
#

can i invite the bot to my private channel, easy lost track in public chanel

flint mortar
calm sequoia
#

The o3 suddenly started showing code changes without any actual differences. Anyone noticed this?

still bramble
pseudo summit
#

chats did not return for me on PC ...

torn mantle
#

xAI supports AI safety and will be signing the EU AI Actโ€™s Code of Practice Chapter on Safety and Security. While the AI Act and the Code have a portion that promotes AI safety, its other parts contain requirements that are profoundly detrimental to innovation and its copyright

still bramble
wind briar
#

hello, do you know any way to jailbreak gpt? like give you information it shouldn't give you ( tax fraud, fake ids, etc)

#

askin for a friend๐Ÿ˜…

halcyon vortex
#

@echo aurora Will I ever get my chats back? They just dissapeared randomly

still bramble
# wind briar hello, do you know any way to jailbreak gpt? like give you information it should...

This article looks pretty convincing (https://habr.com/ru/articles/923084/), I once added it to my bookmarks, but never tried the advice from there. It is in Russian, but the screenshots with examples of all jailbreaks are in English, so I think everything will be clear.

ะฅะฐะฑั€

ะœะฐะนะบะป ะกะบะพั„ะธะปะด ะทะฝะฐะตั‚, ั‡ั‚ะพ ะธะฝะพะณะดะฐ ะดะตะปะฐั‚ัŒ ะดะถะตะนะปะฑั€ะตะนะบ ะผะพั€ะฐะปัŒะฝะพ ะŸั€ะธะฒะตั‚! ะกะตะณะพะดะฝั ะผั‹ ะบะพะฟะฝั‘ะผ ะฒ ะพะดะฝัƒ ะธะท ัะฐะผั‹ั… ัะฟะพั€ะฝั‹ั… ะธ ะฝะตะดะพะพั†ะตะฝั‘ะฝะฝั‹ั… ั‚ะตะผ ะฒ ะผะธั€ะต ะ˜ะ˜ โ€” ะดะถะตะนะปะฑั€ะตะนะบะธ ั‡ะฐั‚ะฑะพั‚ะพะฒ. ะขะพ ัะฐะผะพะต, ั‡ั‚ะพ ะฟะพะทะฒะพะปัะตั‚ ัƒะฑั€...

golden onyx
#

GM beautiful people

paper nimbus
#

gippity 5 wen chat

dusky aurora
civic flame
weak sluice
#

Hi!

primal sun
#

hi

blazing narwhal
#

hi

keen beacon
teal crypt
#

Hi

quiet pollen
#

Been hearing that Gemini got nerfed, is it true? But when I see the leaderboard, they are still first in most of the fields

deft vigil
#

is there any unreleased model on lmarena right now ?

cedar tide
#

Well, I was busy with Horizon. What are Dino and Potato worth? Is it worth going to the arena?

deft vigil
#

im on dino now but it really slow though. horizon is openai model ?

torn mantle
#

im really starting to think deepseek is ded

civic flame
#

chinese distils by the looks of it

cedar tide
civic flame
#

cuttlefish and clownfish were removed

cedar tide
civic flame
deft vigil
#

wow the only way to access it through lmarena battle right

cedar tide
torn mantle
#

people on chinese forums/servers are talking about anothe 2 months delay for deepseek r2

civic flame
cedar tide
#

@torn mantleWell, I slept, what did I miss on Horizon?

torn mantle
#

they hit a wall and they have some technicall issues

civic flame
#

lol they probably realised they're too far below gpt-5 if they release r2 soon

torn mantle
#

i tried it on 1 prompt only

civic flame
#

terrible at math

torn mantle
#

idk its only good at svg

deft vigil
#

Dino is keep on generating crazy

civic flame
#

then why ask what you missed ๐Ÿ˜ญ

torn mantle
#

its the same thing, when something became trendy they will finetune like crazy on it

#

and thats what happened with svg

#

they are focusing on the wrong thing

cedar tide
#

@civic flamehave people run benchmarks?

#

apart gpqa and math 500

#

and arc agi

#

and eq bench, creative writing

civic flame
#

it got ~67% on aider

torn mantle
#

the thing that we should focus on is aristotle x1

civic flame
#

which puts it just below claude 4 opus

cedar tide
torn mantle
civic flame
#

what the

#

never heard of that

#

oh wait

torn mantle
#

isnt it the math model

civic flame
#

that's aristotle

torn mantle
#

that acer was using?

civic flame
#

yeah

#

he said it's definitely not "mathematical superintelligence" as they were selling it

torn mantle
rare python
civic flame
#

apparently it's only good at very specific kinds of math

torn mantle
#

they blog can have some insights

#

their*

rare python
#

Can I use it?

torn mantle
#

Our system achieves state-of-the-art performance on both benchmarks because we've built systematic verification directly into the reasoning process. Rather than emotional doubt, our models apply procedural self-skepticism to their outputs, making the skepticism reliable and scalable rather than unpredictable.

Other models embody the opposite of scientific thinking. They're trained to sound confident about everything, when science requires knowing when you don't know.

torn mantle
#

seems to me they are looking for more fundings

#

typical small business -> big business

civic flame
#

acer and another guy got a testflight invite pretty quickly

torn mantle
#

im not sure if this model is powering their app or nah

civic flame
#

it is

torn mantle
#

yea that one

#

kinda curious if you can just ask it something unrelated to math

#

like inject some random math question inside an actual question

rare python
#

Oh I thought this one is more general

torn mantle
#

well they said they saturated both benchmarks