#general

1 messages ยท Page 107 of 1

quartz pike
#

but i was asking for like an ACTUAL ai image model that exists on lmarena

#

im stoopit

#

and lazy

#

so me no do dat

wild sapphire
#

this is going wild now

#

only limit is imagination

mental scaffold
#

Hello, here to check out the videos

wild sapphire
#

yeah and the wild imagination worldwide

quartz pike
#

Chat is ts tuff?

#

Duh its ai generated. this is lmarena ๐Ÿ˜ญ

#

correct. but personally i find the seccond one a bit funny

#

Bro just casually lets go

#

and f*cking dies ๐Ÿ˜ญ

#

lol

#

btw can u vpte

#

i wanna see what the models are

#

that made it

fervent pebble
#

Hello. I'm here because i'm excited at how fast Ai Image generation is evolving and looking forward to seeing how it develops.

quartz pike
#

๐Ÿ˜ญ

lunar jay
alpine coral
#

nice one - that does suggest some kind of RAG/internet access almost surely

#

is nightride a strong model otherwise? like aside from up to date info

#

@echo aurora i had scroll through so many images and greetings before finding discussion about models in the arena (which is what i liked about the server in the past).. perhaps there's a way to separate out like general chatter from discussion / speculation about anon models (the juicy stuff ha)

neon idol
#

Sorry for the image in italian just use google lens for translate

solid brook
#

ask for knowledge cutoff this time

neon idol
ripe mountain
#
poll_question_text

SOTA

victor_answer_votes

9

total_votes

17

victor_answer_id

1

victor_answer_text

GPT-5

#
poll_question_text

SOTA Open-Source/Open-Weight

victor_answer_votes

7

total_votes

12

victor_answer_id

1

victor_answer_text

Qwen3 235B 2507 (Reasoning)

keen fulcrum
#
poll_question_text

Grok 5 before Gemini 3

victor_answer_votes

19

total_votes

21

victor_answer_id

2

victor_answer_text

No

verbal nimbus
#
poll_question_text

Is Gemini 3 already on LMArena under an anonymous name?

victor_answer_votes

13

total_votes

13

victor_answer_id

2

victor_answer_text

No

ocean vortex
verbal nimbus
#
poll_question_text

When will Gemini 3.0 be released?

victor_answer_votes

6

total_votes

14

victor_answer_id

1

victor_answer_text

September

ocean vortex
#

But ig it's old and people are bored wanting smth new lol

#

Deepseek V3.1 was a non-release for the most part. You can now have no reasoning or reasoning that almost matches R1 within the same model. Ok great, moving on...

rustic knot
ocean vortex
#

I suppose qwen3 performs, but my IRL experience was somewhat differing from this

rustic knot
#

that's why its AA score is better

#

but ngl they did r1 dirty by that aime 2025

ocean vortex
#

Qwen3 seemed to me like it's much easier to break and not nearly as reliable as R1

rustic knot
#

another thing to note is that it seems like ds keeps trying to use grpo while Qwen sswitched to gspo

worthy sleet
#

I need nano banana and I can use it for free from google aistudio. I can also do it from lmarena directchat. Is it preferable for lmarena doing it there because they can use those prompt or is it more of an expense for them?

ocean vortex
#

Also smaller

rustic knot
#

which new benchmarks?

ocean vortex
#

uh there was creative writing one

ocean vortex
#

that became a thing after that model was released

rustic knot
ocean vortex
rustic knot
#

that seems like from a long time ago

ocean vortex
#

Like it really seems that model had good fine-tuning. It is performing on things they didn't explicity train for

ocean vortex
rustic knot
#

like kimi overtook everyone on the eq creative bench

#

it's basically the best writer, even gpt 6 (underscore) agrees

ocean vortex
rustic knot
#

there seems to be a tradeoff b/w what certain RL algorithms are good for, GRPO might be better for writing where the rewards are uncertain while GSPO is better when there is only 1 right answer

ocean vortex
#

They can't do RL training (reasoning) though it seems ๐Ÿ—ฟ

#

kimi has a lot of potential for it

rustic knot
#

have u seen the GSPO paper?

keen beacon
#

gspo is just better overall

#

vs grpo

rustic knot
#

in grpo, there is a higher variance for answers which might be better for writing purposes. Trentk used it to turn a small Qwen model in mechahitler

#

more creative yk

keen beacon
#

not too sure about that but in my experience gspo is just better

rustic knot
#

cuz ur probably using it in technical cases where there is only 1 right answer for the most part

#

@ocean vortex you gone buddy? xD

leaden sun
white hatch
#

will there be added support for uploading files other than images?

ocean vortex
#

No opinion on this. I'm more interested in how it turns out in practice. But haven't looked into technical details of this enough to comment tbh

#

Plus I would say it's a bad idea to generalize like this. I think there's more than 1 way to train a good performing model

keen beacon
#

Until you remember that GPT-5 is the best model overall, that is, most likely to give a good answer to ANY prompt.

#

Regarding if Qwen is benchmaxxed or not - we quite literally do not have appropriate benchmarks to figure it out

#

In psychometrics for example, tests are created to distinguish between different broad abilities, which are measured with different subtests

#

Let's say that we did an exploratory factor analysis on a LLM and figured out that it has a strength in broad factor that is responsible for creative writing, and each other thing this broad ability accounts for

#

So it's going to be great at benchmarks that measure this broad ability

#

And if a model does somehow bad at all benchmarks at the same time except those it was measured at, it is an evidence of benchmaxxxing

#

But we do not have extensive psychometrically valid benchmarks for LLM so far, so...

summer cove
#

hi

leaden sun
keen beacon
leaden sun
#

Ok, that speaks volume about your background

tawny kite
#

HEY EVERYONE, I AM NEW HERE,

sullen fern
#

hi

languid crescent
#

mah chat history is gone ๐Ÿ˜ญ

fervent tangle
#

GUYS

#

GOOGLE RELEASED NANO BANANA

#

ITS GEMINI 2.5 FLASH

#

on google ai studio

keen beacon
#

OMG NANO BANANA

sullen fern
#

I have a question

fervent tangle
#

finally imma use nano banana for free

sullen fern
#

How can i change the aspect ratio on lmarena ?

fervent tangle
#

on google ai studio

fervent tangle
keen beacon
#

its also available on lmarena direct chat btw if u run into ur limits there

sullen fern
#

damn i wanted to make a horizontal image

keen beacon
#

aistudio has generation limits for nano banana

fervent tangle
fervent tangle
keen beacon
#

yea u have X amount of requests

fervent tangle
keen beacon
#

im not sure about the specifics, they dont document it and it changes

fervent tangle
fervent tangle
keen beacon
#

yea im just saying if u do reach the high limit

fervent tangle
keen beacon
#

๐Ÿคท im just saying it is available on lmarena direct chat if you want. no login required there to get extra use

fervent tangle
#

imma try it if i reach the limit

#

also did the quality drop?

#

after they released it?

#

or am i tripping

keen beacon
#

personally i dont see much of a difference/if at all compared to when it was anon but i barely used it

#

some people have said it was nerfed though

fervent tangle
south vigil
#

What do you gets think is better at coding; GPT 5 high or claude opus 4.1/sonnet4

white hatch
#

gpt for architecture, claude for coding

fervent tangle
#

try 4 sonnet if u dont want alot of costs

south vigil
#

claude opus 4.1 was unable to help yesterday fix a simple api concurrency limit issue, had to take over, and do it the old fashioned way. i use opus 4.1 often and it's decent but was thinking about switching to gpt5 high though because it's a better reasoning model.

south vigil
#

didn't try, should've, i just fixed it myself, i'll see if gpt5 can try optimise it today

#

because claude opus 4.1 completely failed

high hound
fervent tangle
drifting crow
#

Nvidia

sonic bear
#

golden hour backlight,soft rim light on hair,clothes & subjects,lens flare sunlight,create shadow on ground

topaz bay
# high hound

wen it comes to text model only, then XAI,openai and Anthropic have a big chance

#

Only problem is that the west only does close sourced

wet pewter
topaz bay
#

I hope china keeps doing good open source, so that it's a fair market

topaz bay
fervent tangle
#

thats how they made their AI models good

quartz pike
#

tuff or nah. vote ples

#

yall just asking. are the devs planning to add video arena to the website?

stiff blaze
#

Make him run like super mario 3

fervent tangle
quartz pike
#

to go to le channel

neon musk
#

How generate image to video on this LMARENA to ratio 9:16 or portrait??

quartz pike
#

so you can vote

fervent tangle
#

the right one is veo3

#

but it doesnt have audio..

quartz pike
#

oh?

fervent tangle
quartz pike
#

dis one is even better but idk wtf is going on with the left one

#

The ai on the left crashed the f out

exotic tartan
#

why do i see only video-arena-1? aren't there supposed to be more?

grand valve
#

hello friends....how to use google flast 2.5 i.e. nano banana

solid brook
keen beacon
fervent tangle
keen beacon
#

In short, while Claude models seem to be great at coding, they severely lack in GENERAL ability

sacred wharf
#

hi cant wait to see whats posssible here

keen beacon
#

I've written before about METR's time horizon benchmark. While I consider it a valuable benchmark, it doesn't measure exactly what it's trying to. In order to only measure a model's time horizon, a benchmark would need to only vary the task length. Instead, the short tasks tend to be easy and not specialized-knowledge-dependent (i.e. doing a web search), whereas the long ones tend to require far greater specialized knowledge and intelligence/problem-solving (i.e. ML coding tasks). So it winds up measuring an amalgamation of time horizon, coding ability, ML knowledge, problem-solving, etc. Very roughly speaking, it's a decent benchmark of (partly narrow) abilities useful for AI automation of AI progress.

quartz pike
#

Hey anyone here know code?

#

Cause just asking. Wich assistant did better since i dont understand code:

quartz pike
#

?

#

im a dummy

#

me no know

#

or how to put code somewhere and start it

astral kayak
#

hello

keen beacon
echo aurora
# alpine coral <@283397944160550928> i had scroll through so many images and greetings before f...

This is a bit tricky as creating new channels doesn't always lead to community members using them for the topic of the channel, we're currently experiencing this in the other existing channels. What I've seen other communities do, and what I'm encouraging everyone here to do, is if you'd like to have a conversation stay on a specific topic -> create a thread for it. It won't be great for new people to join in on that conversation (as it'll get pushed up and off the general text area), but it does establish a more "private" place for members to discuss something specific.

quartz pike
#

oh wait im a dumbfuc

#

i just realised what you meant

#

ima just shut tf up ๐Ÿ˜ญ

keen beacon
#

Bro ๐Ÿ’€

quartz pike
#

Chill english aint my first language ๐Ÿ˜ญ

#

im greek

#

ฮ•ฮปฮปฮฑฮดฮฑ!!!!

#

-# that says greece in greek

feral python
#

aaaaaaaaaaaaaa

tribal peak
#

Hello

torn mantle
#

WAAAAAAAAAAAAAH

#

๐Ÿ˜ฑ

feral python
feral python
keen beacon
quartz pike
#

cool

unborn spoke
#

Hello this Forum is so cool

echo aurora
zealous sparrow
#

MAI-1 is the stupidest model name i heard

humble sonnet
#

What is mai-1 ?

zealous sparrow
#

Medium Aritifical Inteligence?? I dont know the short say.

keen beacon
#

Lmaoo

zealous sparrow
#

Nevermind its probably just Microsoft Artifical Inteligence..

rustic knot
#

oai stream rn (it's incredibly bad)

ember whale
#

hi

quartz pike
#

Yall how good is mai

keen beacon
quartz pike
#

i expect it to be trash

#

since well... Microsoft

#

microsoft's models suck ass

zealous sparrow
#

I will test webdev

#

prob goin to be so bad

quartz pike
#

hmmm

#

how do you test webdev?

#

js asking

zealous sparrow
#

personally use codepen

#

for html

#

too lazy to open vsc

quartz pike
#

oh

#

me stupid

#

what is codepen

#

and how do i use it

zealous sparrow
quartz pike
#

And can you pls compare it to gpt 5 chat?

zealous sparrow
quartz pike
#

yes

#

but its worth it

#

lol

zealous sparrow
#

it takes a while to generate a simple html game

hollow imp
#

Gpt 5 high vs this new model

zealous sparrow
quartz pike
#

Maybe againt gpt 5 chat or nano

hollow imp
#

So the new model is bs

#

I'm not even gonna try it

quartz pike
#

but high there is no chance

hollow imp
#

Then

quartz pike
#

But since its microsoft

#

its safe to assume its dogsh1t

zealous sparrow
#

im doing the high vs mai comparison
Topic: Single html shooter

#

simple

hollow imp
#

They were using o1

zealous sparrow
#

GPT 5 high is already done thinking and mai is still thinking

hollow imp
#

And now they merged with github and released this sh it

quartz pike
#

no wonder it was ass

#

o1 is respectfully. DOGSH1T

zealous sparrow
#

its obvious who is winning.

quartz pike
#

left lol

hollow imp
#

I'm saying before they released this

quartz pike
#

mai1 thinks its einstein

hollow imp
#

They were using o1

quartz pike
#

thinking that much

civic flame
#

mai-1 has been in development since early 2024 btw

#

โ˜ ๏ธ

zealous sparrow
#

unless mai servers died

quartz pike
#

DAMN

#

ALR ITS LOOSING

#

EARLY 2024

#

THATS A 1 YEAR OLD MODEL

#

SONNET 3.5 RELEASED 1 YEAR AGO

zealous sparrow
#

btw

hollow imp
#

๐Ÿคฎ

zealous sparrow
#

mai was designed to be a gpt 4 competitor

civic flame
#

4 turbo*

glacial mulch
#

lmao what is mai

civic flame
#

lol

keen beacon
#

mid ai

balmy mist
#

so is openai speech on seasame now?

zealous sparrow
keen beacon
#

ah

#

you meant that

whole swallow
whole swallow
#

Aight

zealous sparrow
#

yeah mai breaks if you tell it to webdev

#

its a generating loophole

civic flame
#

literally agi

zealous sparrow
#

been like 5 min

#

if it ever finishes and the game is worse than GPT-5s

#

i mean i told it to make a cafe site

#

it finished...

#

i am not going to lie

#

the site aint half bad

#

It found images.

supple vector
#

Lmarena free API when????

#

day one of asking for LMArena free ai api

zealous sparrow
#

my thoughts on mai

#

the model is slow but it makes decent sites

#

if LMArena errors at you during gen you are in for a lot of waiting

#

LMarena redid its captcha?

cyan zodiac
#

what could it possibly be writing

prime mulch
zealous sparrow
#

theres also a contact us section

#

the generation takes so long tho

#

im pretty sure the next prompt i gave it which was add JS is already generatin 10mins

#

i did say a lot of js

#

it errored ๐Ÿ™

misty vault
#

@deep adder mai-1-preview is sydney from mcdonalds

zealous sparrow
#

mai-1 must have a very small context window

#

will error if you ask for too much

golden ocean
misty vault
supple vector
keen beacon
#

Guys

#

I just joined

#

What's up with the hype

keen beacon
echo aurora
keen beacon
wintry tinsel
#

No context

zealous sparrow
#

MAI is not that bad but also not that good

#

context window is prob small

#

will error if you are demanding

#

takes long to gen

#

[talkin webdev]

unborn ocean
magic stag
#

Who's the one with nailoong enojis?

magic stag
#

Saw this reaction in announcements

keen beacon
#

What's up with oAI streaming?

ocean vortex
#

They refuse to let that name go

keen beacon
#

why did they ever upgrade 4o when they have 5?

#

now we have to retest all these stupid benchmarks again

ocean vortex
#

oh wait I can't read

#

it's a previous model

#

it's now called gpt-realtime huh

quartz pike
#

mai*

ocean vortex
#

And now they renamed gpt4o-realtime into gpt-realtime, which makes sense actually

ocean vortex
#

As for gpt4o text model... Current version of that is gpt5-chat. If they haven't renamed, it would be that.

quartz pike
ocean vortex
#

it already cannibalized gpt4.1 so they held on to that name too long as is tbh

vast fern
ocean vortex
keen beacon
#

microsoft ai 1 ๐Ÿ˜‚

quartz pike
ocean vortex
ornate agate
keen beacon
#

yes ๐Ÿคฃ ๐Ÿคฃ

ocean vortex
#

What is is exactly though... is it like a huge model or renamed Phi... ๐Ÿง

keen beacon
#

theinformation had an article about in may 2024

#

lol

ocean vortex
#

Lack of any metrics at all is not very inspiring

ornate agate
#

"MAI-1-preview is an in-house mixture-of-experts model, pre-trained and post-trained on ~15,000 NVIDIA H100 GPUs. " :\

#

it better be really good if it used 10x the GPUs of DeepSeek

ocean vortex
raven heath
ocean vortex
#

First prompt impression it's thinking for ages and can't decode things nowhere near as well as gpt5-mini

ornate agate
raven heath
ornate agate
#

theres literally no way anyone's gonna click on random audios posted with no words

ocean vortex
ornate agate
ocean vortex
#

correct decoding:

#

sh'it decoding:

ornate agate
#

it crashed on trying to do an aime question

#

lets try again

zinc ore
#

Bruh 14 day ago poll

ocean vortex
ornate agate
solid brook
#

bruh 2 of them really chose elon musk....

ornate agate
ocean vortex
wooden salmon
#

Oh thx

ocean vortex
#

MAI doesn't seem to actually be horrendous, but I really do not think it's gonna challenge the current best ones. Maybe og R1 level.

neon idol
gray junco
#

hello

hardy lion
unborn ocean
#

only openai ai is good at it

#

does not really reflect much

ornate agate
#

it seems that other AIs fail the base64 decode itself, for some reason

ocean vortex
#

There are endless possible prompts

unborn ocean
#

very easy to implement

#

not overfit in the traditional sense, yet still optimised and not reflective of overall reasoning power

ocean vortex
#

It's reversed and then encoded. So next level. But gpt5-mini can do it reliably so ๐Ÿ‘€

unborn ocean
#

in my experience (which admittedly is from a couple of months, if not halve a year ago) openai is just stronger than expected on this

ocean vortex
# unborn ocean very easy to implement

Not really. This kinda converts to many other reasoning tasks too. But they didn't specifically do this for sure as with tools it will solve it in seconds

unborn ocean
#

the same on coding competitions, which does not mean the overfit in the traditional sense here as well, it just means they have trained (rl'ed) a lot on the format

unborn ocean
ocean vortex
ocean vortex
sonic tendon
unborn ocean
ocean vortex
#

They just didn't. It makes no sense to do it for them. Target something specifically you can't even promote or sell to the public? Makes no sense at all.

#

Unless they did smth similar to improve the reasoning performance in general

#

but then it just kills your entire point

#

lol

unborn ocean
ornate agate
#

actually training specifically on this is exactly the sort of thing I imagine OpenAI to do.

unborn ocean
sonic tendon
#

this seems fairly trivial to test

ocean vortex
unborn ocean
#

and openai was king back then

#

and i felt like they did use it for training

ocean vortex
#

You are essentially saying "yeah it's better but only because they focused on making it better". Or like "yeah it is better on this thing but only because they focused on this specific thing they didn't advertise anywhere at all and there's no benefit at all". None of these statements make any sense whatsover tbh

#

Metrics do though. No one has time to dig through this nonense of what people are doing lol

#

They would degrade what actually matters to them if they were training so randomly

#

It's just one of the things that highlighs where OpenAI is clearly ahead atm

whole wagon
#

Microsoft made their own LLM kekw Top 10 anime betrayals

sonic tendon
sonic tendon
#

and an R1 finetune I think?

ocean vortex
whole wagon
#

This is the first major llm release out of Microsoft

unborn ocean
# ocean vortex You are essentially saying "yeah it's better but only because they focused on ma...

i dont think you are getting my point:
they did not focus on this to sell the 10 nerds who try it out on their model, they probably used it in their rl training process to enhance other areas, why?:

  • it is really easy to implement (no humans needed)
  • one could suspect that the models would benefit from more concise and flawless reasoning in other areas aswell (bc one flaw results in broken output and the token usage indicates efficiency)
  • it is really easy to implement curriculum learning on top of this (scaling the difficulty of the decoding problem)
  • no real overfit
ocean vortex
whole wagon
ocean vortex
#

Well almost everyone is doing math or cyphers with reasoning

#

for training

unborn ocean
whole wagon
#

I actually knew about this Microsoft ai for ages. They operated in secret as a redundancy for if openAI ever betrayed Microsoft

ocean vortex
#

And decoding is fundamentally just math

#

So yeah... ๐Ÿคทโ€โ™‚๏ธ

#

Someone just does it better than others lol

unborn ocean
whole wagon
#

Will be interesting to see how fast Microsoft climbs to the frontier

#

Everyone's first LLM is bad

sonic tendon
#

people seem to be sloptimizing for LMArena

unborn ocean
#

amazon, lol

ocean vortex
#

wdym. RL training for reasoning pretty much evolves around math. That's how you can see deterministic results and eval reasoning traces

#

Like that's not news lol

rustic knot
#

the entirety of AI is math

sonic tendon
#

i always liked them for their terse responses

#

but then

unborn ocean
#

same, still mentally ignoring that they are n.2 without style control

#

can't explain why

ocean vortex
#

This is still math if you write the prompt correctly. Nothing to do with puzzles when we are assessing just decoding. If you do base62 there's gonna be a TON of math

sonic tendon
#

i feel like DS might be the first to top Google on the non style control leaderboard

#

since 2.5

ocean vortex
#

Every model knows how to do it

unborn ocean
#

yeah, if huawei gets their sh*t together (big if)

ocean vortex
#

but not every model can do the math part with good enough precision

#

disagree. It's simply testing how well a model can reason and for how long while still being able to complete the task and not go offtrack

#

If it gives up half-way through and hallucinates an answer - you now know the limits of it.

#

tools render it invalid test lol

#

But that's also why it is unlikely for AI labs to focus their training on it

#

It is not. ML models are not humans catgrin

#

Ever since reasoning became a thing, this is the way to test that (test-time compute) tbh

#

If it's allowed to reason as long as it needs to and it had good RL training, eventually it will arrive at the answer with high enough confidence

#

And generally... I don't think it should be surprise to anyone that OpenAI is leading on fine-tuning at this stage to be completely honest. Their model is not the biggest and nor do they have the most compute

#

As I've said, it makes no sense at all to do this UNLESS it improves performance for them elsewhere they can show. So it's kinda irrelevant?

#

Testing this is meant to show what it can do what is applicable to IRL tasks. I think this does it well. Regardless of if they found it to be the case by training for it or if they didn't train for it.

#

It is positively not possible for them to train for it "just because" if they hadn't found it improving performance on the things they can showcase and sell (published metrics). Like this is just ridiculous and totally not how things are done lol

tall summit
#

LMAO

ocean vortex
#

No I gave you the arguments lol

#

I find it unbelievable what mental gymnastics you are doing to turn some thing a model does well at into a negative. Coming up with some wild theories ๐Ÿ—ฟ

#

Don't be THIS biased ๐Ÿ‘€

whole wagon
#

openAI does a bit of chess training I think. There's no way GPT5 is this good at it otherwise

ocean vortex
#

Dunno about chess... ๐Ÿค”

#

maybe

#

lol what

#

You probably realise yourself this is nonsense

#

At least I would hope so

#

just stop it lol

#

example 236 from their RL gym ???

#

came up with the prompt myself, reversed it myself, encoded it myself. Have more prompts than just this one

#

well you are just spitting non-sense

#

๐Ÿ—ฟ

#

My goal is to eval models regardless what lab was behind it. Believe it or not in my testing gpt-oss didn't do great overall at all

#

Even though I did have fairly high expectations

#

You think wrong...? Or maybe didn't read attentively

whole wagon
#

Gpt oss is not well rounded. Great in some things bad in others

ocean vortex
#

o4-mini considerably better model, like no comparison better, imo

verbal nimbus
whole wagon
#

o4 mini different price class. Compare to GPT5 nano

ocean vortex
#

It wasn't meant to compete with nano

whole wagon
#

The small 20b model is just straight up bad. Like qwen is better lol

ocean vortex
whole wagon
#

Yeah ik

ocean vortex
#

The smaller ones are naturally worse still

whole wagon
#

I'm just saying the small one is basically useless lol

#

Qwen 30b a3b outperforms

ocean vortex
#

Other than that we have no way of knowing what exactly they have done. You can say this about everything

#

but at a certain point you just need to accept the facts

ocean vortex
whole wagon
#

There are fine tunes that remove it iirc

verbal nimbus
#

OSS 120B is interesting, only 5B active parameters

ornate agate
wintry tinsel
rancid knot
#

Hello, I want to learn how to use Ai properly and change style of image and video

errant condor
#

Guys any way to generate 9:16 videos from an image?

empty briar
#

Hello, I am here to learn about the new AI tool with LMArena

golden ocean
#

true

maiden fox
#

helli, I love just you

white hatch
#

Any benchmarks where mini models are better than their big brother?

keen beacon
# white hatch Any benchmarks where mini models are better than their big brother?

High-level: yes, this happens. Smaller models can beat their bigger sibling when:

Theyโ€™re task-specialized (e.g., a 7B code-tuned model edging out a general 70B on coding benchmarks like HumanEval/MBPP).
The benchmark rewards concise instruction-following (some 7โ€“8B instruct models have topped their familyโ€™s larger base models on MT-Benchโ€“style evaluations).
There are tight context/latency budgets (short prompts, low-temperature, or limited tokens) where smaller models are less prone to over-elaboration.
The pipeline uses strong retrieval/tools, making the generator size less decisive.
Evaluation variance is high and differences are within noise; smaller models can win specific subsets even if they lose on average.
#

got it to google

minor schooner
#

Hi all fellow AI enabled self expression utilisers..! Here to compare different models abilities to bring hard sci-fi animations with a few very exact requirements for details expected to be present, and keep the consistency in check.

errant rover
#

Guys what's the best AI for upscaling right now ?

bleak jewel
#

Hi everyone

rocky frigate
#

Hey I'm new to here tell me how to add a prompt and generate a video. How to type the prompt?

barren prairie
echo aurora
sonic tendon
#

wait, when did this happen?

#

i don't think it's even on OR yet

#

no idea, I haven't seen it till now

jovial wolf
#

when i try https://web.lmarena.ai/
it gives me like ksx files that are a big headache to run and are buggy, rather than the quick .html files im used to getting from chatgpt's site. does anyone know more about this?

like the sandbox doesnt work, ever. maybe ill just try a different browser?

urban mulch
#

New here

hardy lion
echo aurora
median oasis
#

Hello all, Her to attempt to 'level up' my content creation skills for non profit association member engagement

sonic tendon
#

this seems like it could've been when it hit the arena

#

if so it wouldn't be uncommon

echo sinew
keen beacon
rustic knot
#

god bless Qwen, they will overtake deepseek in fundamental ai research and development

ornate sleet
#

New here ๐Ÿงก

rustic knot
keen beacon
#

To be honest we do not even have any good benchmarks to start with

jovial pelican
#

I'm curious to see and be a part of

lofty elm
#

nano-banana not showing in direct chat?

keen beacon
lofty elm
keen beacon
lofty elm
#

thankss

keen beacon
#

okay, this Qwen seems to be a bit better than previous versions

#

still far away from frontier

#

seems like they did something to reduce hallucinations this time

livid venture
#

Hello, LmArena

lofty elm
#

thanks its working

keen beacon
errant rover
#

yo

verbal nimbus
#

Q1 2026? ๐Ÿ’€

#

The new AI Studio is bad, can't even scroll up

rare python
coarse charm
#

hi

fiery wraith
#

hi

muted delta
#

hi

gloomy badge
#

ya

rain rune
#

Hey

keen beacon
#

Did anyone know that AIs trained with reinforcement learning lose the ability to be creative?

#

Mode collapse harms esthetics: outputs start to sound the same, like โ€œAI slopโ€ or โ€œChatGPTeseโ€, or look somehow similar, like โ€œthe Midjourney look.โ€ This cripples creative uses like creative writing. And this damage can manifest in strange ways, like models refusing to write non-rhyming poetry & subtly steering non-rhyming inputs towards rhyming, or being unable to generate random numbers.

#

Basically, you just train a model to give precise and correct answers - and it starts to create art with the same precision and correctness as if it was a reinforcement learning challenge. It loses the ability to synthesize from weakly related and totally unrelated ideas - it loses the ability to be creative.

steep cypress
#

helo!

novel umbra
#

hello

gusty gull
#

Hi just dicovered lm arena on discord ๐Ÿ™‚

echo aurora
#

hey everyone ablobwave

exotic nebula
molten fern
#

hello

exotic nebula
exotic nebula
#

bro i thought you were just a mod ๐Ÿ˜ญ

slow galleon
#

hello

limber acorn
#

hello

tired lichen
#

hi there

stoic raft
#

hi.. im newbie.. how generate video here?

willow steppe
#

hello

quartz light
#

bruh

#

where is the coding model

keen beacon
#

hello

granite igloo
#

Hello I have the reached the limit for nanobanana, it's telling me to wait another 50 mins to generate so I have a question for that, How many Images can I generate before hitting the limit? I have generated around 15-20 images before hitting the limit and once the 50 minutes pass will I be able to again generate 15-20 images or will I be limited to a few images only? TIA

hollow horizon
#

Hello

limber crow
#

the model of video generate is random?

#

I can't choose the model

glossy tinsel
#

Hello

desert abyss
desert abyss
limber crow
#

๐Ÿค”

lilac pivot
#

Hello. I new to this arena.

proven reef
#

hi

echo aurora
echo aurora
echo aurora
vague owl
#

hi everyone

lone notch
#

helo

keen fulcrum
quartz light
keen fulcrum
#

I was expecting more as well

quartz light
#

wait

keen fulcrum
#

Its ok for its price

quartz light
#

is that the only thing??

keen fulcrum
#

Yes

quartz light
#

seriously????????

#

ive been waiting for over a month

#

for that sh?

#

@deep adder grok fell off

quartz light
#

its many times worse than gpt 5 nano

whole wagon
astral pagoda
quartz light
#

idk

#

yall check this out

#

its goofy but

#

i guess camera rotation kinda works

#

ill fix in like 10 hrs

#

btw it uses nanobanana

#

and yes it is super inefficient but uhhh

#

yes

empty salmon
#

Google aistudio remove aspect ratio 9:16 for veo ?

latent pagoda
#

hello

empty salmon
keen beacon
#

Insanely difficult task for LLMs:

What emotions does the following piece of music evoke? Why? Conduct a comprehensive music theory analysis.

Kick = 1/4 * 32
Snare = 1/16~, 1/32~, 1/16, 1/8, 1/16~, 1/16, 1/16., 1/32~ * 64
Closed hat = 1/16 * 128
Open hat = 1/8 * 64
Bass = (B1 1/4 * 6, G#1 1/4 * 2) * 4
Celli pizzicato = ((B3 1/8 D#4 1/8 F#4 1/8) * 21), B3 1/8
Organ = (D#5min 1/8, 1/8~, D#5min 1/8, F#5maj 1/8, C#5maj 1/8, 1/8~, C#5maj 1/8, F5dim 1/8, B4maj 1/8, 1/8~, B4maj 1/8, F#5Maj 1/8, A#4min 1/8, 1/8~, A#4min 1/8, C#5maj 1/8) * 4
Choir = (B4maj 1/4.., F5dim/B4 1/4.., F5Dim/B4 1/8, C#5maj/G#4 1/2, C#5maj 1/2) * 2
Trumpet = (B3+G#4 1/4.., F4+G#4 1/8., C#4+G#4 1/8., C#4+G#4+B4 1/8., C#4+B4 1/4.., D#4+B4 1/4.., D#4+B4 1/8, B3+G#4 1/4.., F4+G#4 1/8., C#4+G#4 1/8., C#4+G#4+B4 1/8., C#4+B4 1/4.., D#4+A#4 1/4.., D#4+A#4 1/8) * 2

Legend:
C#5maj/G#4 1/2..

C - note label
5 - octave label
maj - chord quality
/G#4 - note in bass (for inversions)
1/2 - note duration

Other:
. - dotted note
~ - pause
B3+G#4 - notes sounding together

#

Qwen 2507 Thinking thinks that the tempo is

Implied Moderate to Slow (Dotted half notes in choir, long bass notes). Likely 70-90 BPM โ€“ slow enough for emotional weight, fast enough for rhythmic drive.
Which is absolutely ridiculous. Other models, however, determine it correctly at 120-160 BPM.

#

But if I tell that it is 70 BPM in the prompt, they all pretend that the tempo is "slow and ceremonial march", which is again ridiculous

#

That Qwen is absolutely horrible at this. It invents things I never included in this track and never ever studied:

Organ/Choir/Trumpet Progression: This is a direct adaptation of Pachelbel's Canon progression (I-V-vi-iii-IV-I-IV-V), but transposed to B minor with minor-mode alterations

Picardy Third (B4maj): The tonic chord resolving as major (B-D#-F#) instead of minor (B-D-F#) is the emotional core. It transforms expected sorrow into luminous hopeโ€”like sunlight breaking through clouds. This is the primary source of "bittersweet" tension.

#

Okay, maybe I accidentally rediscovered Pachebel's Canon, but there are literally no Picardy third in the end here, all chords here are strictly diatonic to F# major/B Lydian, the mode is definitely major so it can't be a Picardy third. It completely made it up.

#

New Qwen-Max which is at LMArena right now is much better. But it is still horrible, making up things and hearing modes that are simply not present in the track

#

Deepseek V3.1 thinking in the chat is much better. It makes something up sometimes, but at least it is able to clearly identify the tonal center at B most of the time, and call it a major mode. Which is very close, because it is actually B Lydian, a major mode of F# Major scale.

whole wagon
keen beacon
#

The worst offenses I can accuse Deepseek in are "chromaticism" and "modal interchange" (which are bs because all chords are strictly diatonic here), but not goddamn Picardy thirds

#

Qwen on other hand completely slops everything up. Also the first model that calculated the tempo to be 2 times slower than it actually is.

zinc ore
keen beacon
#

GPT 5 is the only model that correctly identifies B Lydian. It succeeds around 2-3 attempts out of 10, but it is still better than Qwen and Deepseek that never managed to do it.

#

Also the only model that keep insisting on F# or B Major and disregards other modes, which is almost correct

ocean vortex
whole wagon
ocean vortex
#

Last week, we quietly released grok-code-fast-1 under the codename sonic. During this stealth phase, our team carefully monitored community channels and deployed multiple new model checkpoints to address feedback.

So basically 'sonic' was no good and they buried the results. ๐Ÿ—ฟ

#

Their blogpost absolutely screams "AI written" lmao

grok-code-fast-1 was crafted to shine in the tasks developers face every day, striking a compelling balance between performance and cost. Its strength lies in delivering strong performance in a economical, compact form factor, making it a versatile choice for tackling common coding tasks quickly and cost-effectively.

wild galleon
#

How if i want use API of Lmarena.ai to my openweb ui ??

unborn ocean
#

If you are looking for resources in general.

ocean vortex
keen fulcrum
zinc ore
#

Where's the coding benchmarks (outside of cyber security)

keen fulcrum
#

If they were that proud of their coding model we would have seen promotions by Musk

ocean vortex
# keen fulcrum

biology and chemistry I wouldn't classify as coding. Cybersecurity cybench perhaps, with a stretch... Though I wouldn't expect to be looking at this and nothing else for a coding model

EDIT: Ok they did low-key mention SWE in their news article. That's something I suppose

pine kraken
#

https://x.com/noob_contrarian/status/1961335098327929081

Hope you all are having a good day. I just wanted to take a moment and tell you all about a research study Iโ€™m conducting for an app that focuses on AI Brainrot Detox with some new features and benefits for people who are struggling. I would really, really appreciate it if you all could just take out a few minutes from your day and participate in this.
Iโ€™m reiterating that this is completely anonymous so please donโ€™t feel uncomfortable.

working on recovering from AI overload - quick poll to see who else needs help: [https://t.co/dcy1UF0iOG]

(anonymous, of course; weโ€™re all in this together)

fleet narwhal
#

hello..just wanted to learn more about AI,that is why joined..have a good day to all

tender trellis
#

Wha5h is this

vapid sapphire
#

hi

leaden sun
gaunt void
#

Hello

cedar estuary
#

hi everyone!

tall summit
#

lesswrong moment, disregarding it immediately

leaden sun
#

ok? i should check their credibility then ๐Ÿ˜…

rich field
#

hi

leaden sun
rich field
#

๐Ÿ™‚

#

what is it all about?

leaden sun
#

well...talking about benchmarking mostly ๐Ÿ˜† feel free to look around and play with image/video generation

rose vessel
#

hello all !

fleet lintel
#

I am lookoing for some hard (text base) questions that we expect that LLMs should be able to solve but currently cant

leaden sun
daring schooner
#

hello

neat quarry
#

Hello!

#

how i make videos here?

dusky tangle
#

Hello! ๐Ÿ™‚

wicked thicket
#

hello

neat quarry
#

I sent my prompt, but the video wasn't made T-T

#

How long does it take to make?

#

Damn, I think I've already used up my daily credits even though I haven't done anything lol

wanton knot
#

Hello!

stark sand
#

they are enjoying burger in a happy and smiling mood

grand pewter
#

Hii

latent crest
#

Good morning guys , can nano banana do nsfw content or soft nsfw content ?

merry geode
#

Helloo

midnight mesa
#

i love lmarena

tough onyx
#

heloo

golden ocean
#

-# (Claude Opus 4.1 thinking is included in the second option)

quartz pike
#

yo yall

#

on lmarena

#

is there a limit to how many messages i can send

#

or image requests to an ai?

#

-# personally i think yes. because it would cost a SH1T ton of money for lmarena to run o3 for free to anyone. and 4.1 opus.

#

-# but im not sure so pls fact check me

lilac inlet
quartz pike
#

oh lol

#

what is the limit?

quartz pike
#

its a really great platform

lilac inlet
# quartz pike oh lol

I think it depends on your token. If you're using it to generate a long code, then it won't even run thrice!

quartz pike
#

damn

#

so its a token limit not a run limit.

#

Got it.

willow grail
#

hm

#

banana sucks at

convert style of image into japanese anime style. this shows a dinosaur anime. we see a ceratosaurus.

#

someone help me ?

quiet kestrel
#

How to make video

#

Pls tell me

#

Hello?

willow grail
#

this is the best so far which it could make

#

but its not really anime

keen beacon
#

Is there any Ai generator that does nsfw content? Or I meant say everything in general literally

willow grail
#

convert picture of dinosaur so it looks like manga death note and similar manga.

chilly parrot
#

hi

viscid sparrow
#

hi

quiet kestrel
#

How to see my videos

rocky wedge
#

Hi

midnight mesa
#

is it possible to nano banana make videos

lofty compass
#

a 3d wireframe of a battle ship

#

/video a 3d wireframe of a battle ship

timber iris
#

Does GPT 5 HIGH in imarena have thinking?

frank elm
#

I am exploring the new Features

echo sinew
timber iris
# neon idol Yes

Great, i guess it have auto thinking, sadly we can't see what are they thinking

cinder oar
#

yoh great to get a chance for different ai.

keen beacon
#

Sad that we can't really see the reasoning process.

willow grail
hybrid yacht
#

Hi all, complete newbie to this Ai thing,
just here to learn

willow grail
#

how did you find lmarena

hybrid yacht
willow grail
#

oh thats a new youtuber

hollow ether
#

Hello there! It was good to be here and learn more about AI

boreal rampart
#

hello! I'm here for make a content video AI!

stray dock
#

broooo wtf just booted up LMArena and all my previous prompts are GONE

#

WTF

willow grail
exotic gust
eager crag
#

is it... down?

exotic gust
#

seems like it

#

nope

eager crag
#

no?

exotic gust
#

back to normal

eager crag
#

oh yeah it's back up cool

exotic gust
#

prolly some cloudflare issue

eager crag
#

or it's me who did too many prompts

rich compass
#

fix your @$$hole site๐Ÿ˜Ž

eager crag
#

hey, not nice.

rich compass
#

im trying to fix my script

#

AND SITE

#

JUST DOWN

dense sphinx
#

Mine not working too

eager crag
#

me too, but it isn't necessary to insult the developers.

dense sphinx
#

Is it bug again?

eager crag
#

they're gonna fix it like they always did. patience is a virtue.

opal hamlet
#

WTF๏ผŸ

rich compass
#

badass site

#

dont worry

#

your ass soon gone

dense sphinx
#

JN Pavel durov dangerous.

rich compass
#

IM NOT IN THE DANGER SKYLER

#

I AM THE DANGE

#

R

dense sphinx
#

Can it be fixed?

eager crag
#

it can be, just an outage.

#

i'm guessing at least.

rich compass
#

grok com

#

idk

#

5 requests

sonic flax
#

Hello friends, I get the message "No models found," how can I fix this?

tight oriole
#

๐Ÿ™€ ๐Ÿ’€ โœŒ๏ธ

rich compass
#

kitler

tight oriole
eager crag
sonic flax
#

It's back, friends.

dense sphinx
#

@rich compass where?

#

On LM?

rich compass
#

grok dot com

merry wren
#

hi

#

im having an issue

#

there are no more models on the model selector

night geode
#

Hi

merry wren
night geode
#

I am completely new to this platform, how can we generate videos in the website?

night geode
#

Alr

#

Thanks

static stream
#

hello

merry wren
merry wren
lean coral
#

hey

#

is there any way i can generate more thn 8 vid in 4 hr

echo aurora
echo aurora
rugged brook
#

is sonnet 4 thinking better then gemini 2.5 in coeding

lean coral
#

@echo aurora there is limitation for generating image?

rugged brook
#

ye

lean coral
#

ah

#

i was looking for any ai that close to veo 3 that going to give vid with sound with any limitation

merry wren
merry wren
quartz pike
#

Tuff or nah pls vote i wanna see what sh1tty ai made the one on the right.

sterile pulsar
#

Hello!

quartz pike
#

hello

lean coral
#

hey

rose lintel
#

hi

#

im new here

upper venture
# quartz pike Tuff or nah pls vote i wanna see what sh1tty ai made the one on the right.

It makes sense. Veo 3 is one of the best video GenAI rn and seed dance isn't made for those type of gaming videos after all, you can see it generates yeah but as the name suggests and given the company who made it (Bytedance, owner of TikTok and others) that isn't the main goal, it's more about dances, IRL videos, etc not games even tho it can generate xD.
Btw what an impressive result from Veo 3 even with the audio

lean coral
#

can i generate ai vid in lmarena compare model

upper venture
#

ig, I didn't use or test it yet, I just like seeing others' prompts and gens lol

willow grail
#

wtf is happening with users

#

so many japanese nicknames

unborn ocean
#

i acutally called that it would be the lite model

#

very easy to notice

upper venture
#

It is a capable model ngl

#

Not as capable as Veo 3 tho. But yes, this result doesn't show all the power that the Seed team created with their GenAI models

unborn ocean
willow grail
unborn ocean
#

nah, again the name change, lol

#

like the 10th

willow grail
#

i played catch with wild crows

#

itsg fun lol

#

they need to train tho

dense geyser
#

hello

willow grail
#

i will teach them how to catch food in air

willow grail
#

so u go up there and throw the food down and they try to catch it

unborn ocean
willow grail
#

if its baby season then u just need to feed em and they stop attacking

unborn ocean
#

like they kill (weak) pigeons and stuff like that

willow grail
unborn ocean
#

yeah idk, usually it is just old looking ones (and then they team up etc.)

willow grail
unborn ocean
#

was probably just confusing em with ravens (because we dont really have any of them)

#

thinking about it

#

i was also surprised

willow grail
#

ravens usually are parents and children.
they dont do big groups like crows

unborn ocean
#

well i did see ravens / crows teaming up on single pigeons

willow grail
#

lol

#

wher ulive

#

was it winter?

#

achso

unborn ocean
#

maybe i was drunk

willow grail
#

no idea

unborn ocean
#

but saw it multiple times, so idk

willow grail
junior forge
#

Hi, im abdoulaye from fench nice to meet all!

willow grail
#

whats going on so many new people

echo aurora
#

hi everyone ablobwave lots of new folks, welcome welcome! Don't hesitate to ping me if you have questions or problems with the site blobthanks

misty vault
dusty niche
#

Guess the modle

clear trellis
modern flume
#

how to generate video

dusty niche
modern flume
remote lagoon
#

even has an extra btn that links to webdev arena!

dusty niche
#

video arena is chat here not a website

echo aurora
echo aurora
ripe orbit
#

hello

echo aurora
upbeat idol
#

hi

rustic knot
#

oh this model literally said it was everyone

quartz pike
#

WHAT

#

YOU ARE HERE

#

WWWWW

austere kiln
tall owl
#

hello

teal mantle
#

what is the gpt 5 pro quota of chatgpt pro?

#

team is 15 per month

weak timber
#

Hello

torpid roost
#

hi

blissful carbon
#

hi

echo aurora
#

hello ๐Ÿ‘‹

nocturne fiber
#

happy to be here

teal mantle
grizzled plaza
#

hello world

vagrant idol
#

Hello

median ginkgo
#

Anyone here knows how i can do the image to video thing

naive sedge
#

Then*

echo aurora
#

not /image

#

/image creates images, /video creates videos, /image-to-videos makes videos with a reference image

leaden palm
#

hi guys - this place grown into something new.

remember when there was just one lmsys discord, when the icon was the same as gradio's or vicuna's, when there were just a few million votes? these days with all the new image, video, and text models, lm arena has really became a phenomenon.

thing is that i've also became a person who isn't as interested in these new video models, these hyperactive chats, these things that come with scale. i haven't even been using lm arena or moderating the chat that much lately. as such i'm leaving this server and leaving the job to the other moderators here.

it was great being with all of you. see you around, hope to discuss some more things then ๐Ÿ‘‹

fathom plover
#

hi

echo aurora
#

I really appreciate all of the help you've put into this community! You shall be missed โค๏ธ

sterile saddle
#

helo to everyone

proud hazel
#

Hey @echo aurora, how do you prefer to be contacted about โ€œprivateโ€ matters? Is it okay to just send you a DM?

echo aurora
inner gate
#

Whatโ€™s up pals

umbral veldt
#

can't wait to try video gen!

sweet imp
#

Yo YO

quartz pike
#

Chat i think gpt image 1 is cooking

proud hazel
quartz pike
#

it crashed

#

i cri

#

me depresso

#

me die

proud hazel
#

๐Ÿซก

verbal nimbus
quartz pike
#

it aint direct

#

its compare

#

or wait

#

is me stupid

#

yeah

#

compare

magic rock
#

How to generate video with audio?

quartz pike
#

L u c k.

#

pray to lmarena discord

#

to give you veo3

keen beacon
#

ah

#

with audio

#

it is random, so no

neon idol
quartz pike
#

yes

raw arrow
#

hello

echo aurora
sour wyvern
#

Hello

dense quail
#

Hello I am here to learn more on how to generate good videos from prompts

echo aurora
hollow geyser
#

hello, video generation

ocean vortex
upper venture
#

And listen, I'm not a big fan of Google or Gemini.

ocean vortex
upper venture
#

But from all the texts here and even that Minecraft first pov generation, I see why people talk so great about it

#

Veo 3 is really capable (I mean, expected from the owners of YouTube and the ones who have the largest amount of GPUs out there)

proud hazel
#

Kling 2.1 Master, Wan 2.2 and Seedance 1.0 Pro all generate higher-quality videos. However, Veo 3 has audio.

upper venture
#

That's a big thing

proud hazel
upper venture
ocean vortex
upper venture
#

Lm arena separates the Veo models from sound and non sound. Yes, the difference is big too. The ones with sound have way more points. But the ones without sound are still scoring better than the competition. Don't get it wrong. Veo 3 is the most capable video model.

proud hazel
#

In most cases, I actually find the results from other providers to be better than those from Veo itself. But taste is subjective.

upper venture
#

Yes and the majority agrees that Veo 3 is better. It is trained on way more data, follows prompts closely and also comes from Google and all its processing power.

#

That said, I prefer Wan, it's on the bottom of the rank but c'mon. Unlimited free generation is better than limited generation from Hailuo, Kling, Seedance pro or even Veo.

upper venture
# proud hazel Unlimited Wan?

Yes! Qwen/Alibaba provides Wan for free. Unlimited. Like there's guardrails I believe (you can't be making 100 videos per minute with a macro or bot). But aside from that, it's truly unlimited! How many videos you want to make you just go there and make.

#

At least for the time being

proud hazel
#

480p or 720p?

upper venture
#

I prefer it but I don't use it lol

ocean vortex
proud hazel
ocean vortex
#

If majority of people agree on their "subjective" opinions, that kinda becomes not subjective anymore. The findings ๐Ÿ‘€

#

With bias obviously ruled out since it's a blind testing/voting

upper venture
#

LM Arena (this server) ranking as of last update

upper venture
ocean vortex
upper venture
ocean vortex
#

this seems "slow unlimited"

#

How long does this typically take?

upper venture
#

Yes. It is slow

#

A few mins. The get member is not needed even for free members you get I believe 50 coins per day where you can get 5 fast generations per day.

#

People needs to understand that those video models are extremely expensive to run. Being provided unlimited generation per day even if it takes half an hour or an hour each. It's still free. Remember that it'll be using wan 2.2 their best model too.