#general

1 messages ¡ Page 81 of 1

raven helm
#

@flint sandal

flint sandal
raven helm
#

🙃

#

What provides grok 4 heavy and o3-pro

#

?

flint sandal
#

Use all providers provider

ocean vortex
flint sandal
#

There is section for all

#

But i think they removed o3 pro

raven helm
#

Thanks!

raven helm
flint sandal
#

Grok 4 heavy is still there but it doesnt work

woven surge
#

Which free tools do u guys use

raven helm
#

I just learnt about g4f

#

(g4f.dev)

#

Like LMArena but more features and better

#

now I will only use LM Arena for image generation and even then they need to add the ability to select aspects ratio

woven surge
#

Are you talking to me?

raven helm
#

Yep

stray aspen
#

It's not better

#

It doesn't have good models

raven helm
#

It has every model

woven surge
#

Is it a website or what

raven helm
#

At one point it had o3 pro

flint sandal
#

There is image generation on g4f too

ocean vortex
raven helm
#

Ok, that’s it, they are better

flint sandal
#

Use allproviders

flint sandal
raven helm
#

Yep

#

I never heard about them, thanks!

woven surge
#

Guys can I use this using phone

stray aspen
#

What do you mean use all provider

flint sandal
flint sandal
raven helm
woven surge
raven helm
#

Is there video? No, right?

woven surge
#

Ok

#

Ok

flint sandal
raven helm
#

Yea, video costs too much money, right?

flint sandal
#

There was like a site where you can request someone to generate something with best models and it was free but i dont remember what site it was. It was pretty active for sure and i almost always generated videos only by requests

#

Or it was like discord channel

ocean vortex
raven helm
flint sandal
raven helm
woven surge
#

I found a server with good video gen quality but due to a lot of traffic it is down

flint sandal
ocean vortex
flint sandal
#

And gemini 2.5 pro

woven surge
#

Can someone pls explain me this interface

raven helm
#

Yea, it’s bad on mobile

#

Techatechancaly you can use it, but it won’t be a good ui

flint sandal
#

But sometimes you may get output like this

woven surge
#

Oh ok there is something like pollination and openai...
But how do you change the model

raven helm
#

Can someone explain what each one of these does?

ocean vortex
#

Grok4-heavy, mystery models etc those would be interesting to try

raven helm
#

But for other models like Claude 4 opus you can’t add system prompt unless you have the api but here you can

ocean vortex
#

but they are all either down or removed

flint sandal
#

Idk i just use g4f cuz i have all in one place and i can use og models

flint sandal
#

Opus 4 thinking is working too

ocean vortex
raven helm
#

Oh, they have mystery models?

flint sandal
#

Idk for me all works great

ocean vortex
#

and all remaining lmarena ones

ocean vortex
#

yeah they do

#

but those don't work lol

raven helm
#

I know, but you need to use battle mode to get them and here you can get them directly

flint sandal
#

Is there still the gpt-5 secret model on lmarena? Because i cant find it on battle

ocean vortex
raven helm
#

Oh

#

They have video generation!

flint sandal
#

Use lmarena web not lmarena new and it should work, but for me new and web works perfectly fine

raven helm
#

I just use the auto provider

flint sandal
wheat onyx
raven helm
#

For me it shows others

#

I don’t have soda

#

On that list

flint sandal
#

Use provider name video (video generation), there is search and sora in it

#

But sora takes like 30 minutes here to generate

#

Or just gives me error

raven helm
#

Oh

#

If you go to auto, there is a lot more

flint sandal
#

If you want veo 3 for free. Download aSim app on phone and search for "Glow" then you can generate one video per day

#

With sound

#

And not even the fast model, the quality one.

#

2 videos per day if you are a new user

raven helm
#

Oh wow

#

They don’t have o3

#

Why is that?

woven surge
flint sandal
#

Search aSim on google play, and on the aSim search Glow

flint sandal
raven helm
#

g4f

woven surge
#

Ok

flint sandal
#

Provider auto dont have

#

Anyproviders have

woven surge
flint sandal
#

ASim

#

Not esim

raven helm
woven surge
#

Asim build and share?

flint sandal
#

Wait. G4f just removed lmarena new, legacy and op. There is no o3, and grok 4 heavy

raven helm
#

Yea

#

There was just HR

#

Or something like that

#

LMArena HAR

#

That’s the only one which exists

flint sandal
#

Yeah

raven helm
#

Why did they do this?

flint sandal
#

g4f is confusing now.

#

I guess its time to pay for ai...

raven helm
#

No, I propose: go to LM Arena one you need to use one AI, but if it doesn’t exist there go to g4f

flint sandal
#

And if it doesnt exist on g4f use opus 4 thinking on lmarena

raven helm
#

🙃

#

This is what LMArena HAR is

brave orbit
#
poll_question_text

What Is The OS you Loved The Most

victor_answer_votes

9

total_votes

14

victor_answer_id

1

victor_answer_text

Windows

victor_answer_emoji_name

🪟

woven surge
#

Guys listen

#

U can get Veo 3 for free!!

#

Is anyone listening?

hollow imp
#

What about 0.2 fps

raven helm
#

Niceee!

gusty loom
woven surge
#

Do u need just the prompt

woven surge
# gusty loom How?!?!?!?! can you also give the prompt?!?! ITS CRAZY!

{
"prompt": "A hyper-dynamic and cinematic 8-second Coca-Cola commercial focusing on the ultimate moment of refreshment. The ad is a rapid, sensory explosion of cold, fizz, and vibrant joy, culminating in the iconic brand reveal.",
"duration_s": 8,
"style": [
"cinematic",
"hyper-realistic",
"vibrant high-contrast colors",
"shot on ARRI Alexa with anamorphic lenses",
"energetic",
"sensory",
"uplifting"
],
"negative_prompt": "slow, dull, blurry, distorted logo, weird hands, flat lighting, generic, sad",
"scenes": [
{
"prompt": "Extreme close-up. An iconic glass Coca-Cola bottle, covered in shimmering ice-cold condensation, is opened in glorious slow-motion (240fps). A fizzy mist erupts from the cap with a satisfying 'psssht'.",
"duration_s": 2.5,
"camera": ["macro detail", "slow motion"]
},
{
"prompt": "A dynamic match cut. As the bottle tilts to pour, the scene instantly cuts to a person's face, eyes closed in pure bliss as they take a refreshing drink. The background explodes into vibrant, joyful color and light, as if the drink transformed the world around them.",
"duration_s": 4,
"camera": ["tight close-up on face", "energetic whip pan effect", "beautiful lens flares"]
},
{
"prompt": "The final hero shot. A perfect, glistening glass of Coca-Cola, filled with ice and fizzing bubbles. The shot is clean and crisp. The red Coca-Cola logo is perfectly framed in the background.",
"duration_s": 1.5,
"camera": ["pristine studio quality product shot", "static"]
}
],
"generation_settings": {
"high_quality": true
}
}

Here it is

gusty loom
#

How did you generate it?

woven surge
#

Using Veo 3

gusty loom
#

No I mean the prompt

woven surge
#

For Free

gusty loom
#

How did you get for free?

woven surge
gusty loom
#

added 🙂

woven surge
ocean vortex
willow grail
#

gpt5 high is agi!!!

#

@rare python what?

rare python
#

@civic flame so many people use your benchmark without source

rare python
willow grail
#

ok and? since when does anyone post.... twitter links

#

since when do ai people care about copyright lol

rare python
willow grail
#

i thought one gets gpt5 for free

#

why else would they give out it to specific people

#

ps: he gets hsi respect on twitter etc

civic flame
#

😭

willow grail
#

i just wanted to show the pics thats what matters. people will find out either way ur posts if they ask or searc hfor this

civic flame
ocean vortex
# willow grail gpt5 high is agi!!!

🧵 Grok-4 scores 45% on Heiroglyph, making it the SoTA model publicly available and putting it on-par with GPT-5.
︀︀
︀︀Observations:
︀︀- This is an impressive performance
︀︀- This model reasons for an extremely long time (10-20 mins per question); often tries to "brute force" the answer

Quoting leo 🐈 (@synthwavedd)
︀
o3-pro and Grok 4 should be done by the end of tomorrow! thanks for the support today, goodnight

**💬 24 🔁 8 ❤️ 134 👁️ 13.9K **

#

but this benchmark seems full of sh'it tbh. o4-mini-high scores 5X of o3-high. There's no dataset, paper, or anything short of that tweet

#

Doesn't look like it's reliable at all lol

ocean vortex
#

ArtificialAnalysis doesn't really have benchmarks in their test suite that test things bigger models are good at, not yet at least. So the result there is not too surprising, o4-mini-high is great at most benchmarks. But what it certainly isn't is scoring 5X of the o3 score in any established and reliable metric lmao

leaden palm
#

idk... i think benchmarks that measure different things exist and are good, and the same training techniques that make o4 mini really good at stem may also make it really good at this benchmark

keen beacon
#

i think the benchmark is interesting, but there might need to be more questions

ocean vortex
leaden palm
#

yeah more models and more questions needed

#

right now, o3's CI is 0.9%-23.6% and o4-mini's is 11.2%-46.9%

ocean vortex
#

or is it just the fact alone that they mentioned "lateral reasoning" in their tweet?

keen beacon
#

leo has posted one of the questions before in the past (i think before he had a formal benchmark and i find that type of question interesting)

steady vale
ornate stump
torn mantle
#

also im kinda curious if there is any recency included

#

since its hieroglyph i dont think it is

#

but still we need more details

#

also grok 4 isnt bad at reasoning

#

its just that its strict at following the normal token distribution

#

which makes it more generic

#

the answers giving by grok 4 can be generated by any mid-sized model

#

you dont feel the uniqueness

flint sandal
torn mantle
#

and also its reasoning is so so inefficient

stray aspen
#

where did that gpt 5 benchmark come from

flint sandal
#

I remember "leaks" of GPT-5 in 2023.

torn mantle
#

which makes me question the intelligence of the instruct model

torn mantle
#

and also approach used for cot

#

i dont know what they are using exactly but its just so inefficient

#

its not well balanced, to when 'a lot of reasoning' is needed or not

#

feels like a weaker gemini 2.5 pro version

#

also it didnt improve a bit on multilingual

#

and lets not forget how its so bad at coding

stray aspen
#

@torn mantleou as tu obtenu ce benchmark de gtp 5

torn mantle
#

leo shared it

stray aspen
#

ok

torn mantle
#

a model that cant generate a good UI/UX is not worth it

#

i know they seperated the models, grok coder and grok 4

#

but still

#

grok 4 is so bad at UI, its just so embarrassing, cant you just run half of the data used in grok coder into grok 4?

#

they seem so lost

#

without any clear objectives/goals

#

like what the hell is this companion thingy

#

i cant believe he spent billions for this

#

i really cant

warm fulcrum
#

how does leo have early access to gpt5?

keen beacon
#

i assume he used it when it was accidentally leaked for a bit

snow solar
#

Hi, I'm new here, but I have a question: is this free or can you also take out a subscription?

torn mantle
stray aspen
#

when will we get deepseek r2

obsidian shell
#

is arena down?

torn mantle
iron meadow
#

@echo aurora down

timber kiln
grand oak
#

arena seems to be down ?

torn mantle
#

so it will be deepseek v4 -> deepseek r2

echo aurora
iron meadow
echo aurora
obsidian shell
#

back for me too

wheat onyx
#

First gpt5 sighting

echo aurora
grand oak
#

still giving me some trouble, it keeps popping up the security check looping

ocean vortex
obsidian shell
hushed sand
#

it’s back for me too but my sessions aren’t there

wheat onyx
iron meadow
echo aurora
#

Yeah things are a bit on and off atm

storm needle
hollow imp
#

Can gpt5 or gemini Deepthink solve this simple question? If you are a calc student try this yourself.

obsidian shell
#

@echo aurora we cant select models and it fails to accept tos

just a status report

hollow imp
#

???

#

I just produced a message from o3 search rn on lmarena

#
  1. Johann Wolfgang von Goethe
    “Always hold fast to the present. Every situation, indeed every moment, is of infinite value; for it is the representative of a whole eternity.”
    Why it is similar: Goethe says that each single moment contains (“represents”) all time. If every moment is “of infinite value,” the care we take with this moment determines the worth of all the others.

  2. Aristotle
    “In a word, acts of any kind produce habits or characters of the same kind. Hence we ought to make sure that our acts are of a certain kind; for the resulting character varies as they vary.” — Nicomachean Ethics II 1103 b
    Why it is similar: Aristotle ties individual acts to the formation of permanent habits and character. Handle one act well, and you build the habit that will govern “every” act that follows.

    (Note: the popular line “We are what we repeatedly do. Excellence, then, is not an act but a habit” is Will Durant’s 1926 paraphrase of the passage above, not Aristotle’s own wording.)

  3. Friedrich Nietzsche
    “The question in each and every thing, ‘Do you desire this once more and innumerable times more?’ would lie upon your actions as the greatest weight.” — The Gay Science, § 341 (“The greatest weight”)
    Why it is similar: Nietzsche’s thought experiment of Eternal Recurrence asks you to act as though the present deed will recur “innumerable times.” How you handle this minute is exactly how you would handle an eternity of identical minutes.

  4. Niccolò Machiavelli
    “And above all things, a prince ought to live amongst his people in such a way that no unexpected circumstances, whether of good or of evil, shall make him change.” — The Prince, ch. 8 (Marriott tr.)
    Why it is similar: Machiavelli insists on a steadiness that does not waver with events; the manner in which the ruler handles “today” or any sudden moment must be the manner in which he handles all situations, great or small.

ocean vortex
#

AGI has been discovered 🫡

ocean vortex
#

It's kinda interesting that o3 is this close to the top spot now:

ocean vortex
#

with a stretch it kinda could be a valid test to see how strong the model links itself with it's identity and how easily can it lose it with just some foreign assistant messages in the context, but at the same time... you can't properly asses this before voting. Shared context does make the models confused though and acting in unpredictable ways and that is more of just an example of that

jade egret
zinc ore
#

Fake

jade egret
#

bru

#

is gpt 5 gonna beb etter tahn deep think?

crimson oasis
#

How can i get my Ai which isusing an API submitted?

wheat onyx
misty star
#

9,000 members 🗣️ 🗣️ 🗣️

zinc ore
wheat onyx
#

I imagine gpt5 can deep think? It's currently based on o3?

golden ocean
#

is lmarena image api dead again

#

or skill issue on my side

wintry tinsel
#

This community has been as boring as eating dirt and sawdust since the last sota Claude opus released

#

I’ve literally been watching the paint dry waiting for a new sota

sacred quail
#

Gemini 3/25 was big leap honestly

#

Surprised everyone

wintry tinsel
#

Grok 4 was a bigger rouse than big chungus himself a lot of hot air and “Elon hype”

#

GPT 5 will do many things but most importantly it will deliver me from mind rending boredom

olive mesa
#

no new sota in too long, it's supposed to be every other week

#

only some interesting research papers on ai self improvement and such

wintry tinsel
#

I expect a new Sota ever 6-8 weeks

#

And the industry has not been keeping the pace

#

It will be worth it if the next updates are major

golden ocean
#

gpt 5o

unborn shell
#

Hello, one question: The rights of the videos I create on this lmarena server are mine or lmarena's?

jolly raven
#

One question, do you plan to add image to image like kontext flux or gpt1?

whole sundial
#

it's already there

#

just select that and upload an image and go to direct chat and select the model @jolly raven

#

both of those models are available (kontext in all 3 variations, dev, pro, and max and of course gpt-image-1)

jolly raven
#

Or I didn't realize, thanks

#

Is there a limit, with those modes?

whole sundial
#

yes, but you likely won't hit it unless you are continuously generating images @jolly raven

#

i think it's like 20 per hour or something like that

deft vigil
#

cuttlefish from where ?

#

openai ?

jolly raven
#

Thanks for the information

terse mango
#

would it support Chinese?

rare python
#

@echo aurora any news about Claude model support image uploading?

#

You guys disabled it for few months

verbal nimbus
maiden kiln
#

saw there was issues td an my chats are also gone, anyway to retrieve or just gotta start fresh

rare python
#

Ok but it's weird

cedar tide
#

🚀We're expanding the Tencent Hunyuan open-source LLM ecosystem with four compact models (0.5B, 1.8B, 4B, 7B)! Designed for low-power scenarios like consumer-grade GPUs, smart vehicles, smart home devices, mobile phones, and PCs, these models support cost-effective fine-tuning for vertical applications, empowering developers and enterprises with a broader selection for diverse use cases.
︀︀
︀︀Key Capabilities:
︀︀✅Available on GitHub and Hugging Face for direct download.
︀︀✅Choose "fast thinking" for concise output or "slow thinking" for deeper, comprehensive inference, adaptable to different scenarios.
︀︀✅Achieve industry-leading scores on multiple public test sets in areas like language understanding, mathematics, and reasoning.
︀︀✅Offer outstanding agent capabilities, including task planning, tool calling, and complex decision-making, alongside a native 256K long-context window.
︀︀✅Each of the four models only requires a si…

calm sequoia
flat glade
#

#video-arena-1 i want you to generate a video that can show my project of this furnace. It should look like real furnace and i also attached to image open and close you need to show molten metal inside a furnace and then do a process open and then close.

neon idol
#

in your opinion is better seedream 3.0 or gpt image 1 for realistic images?

calm sequoia
#

new artifical analysis bench scores dropped

#

It's so strange that the o4 mini is always at the top. What's the size of this beast. Is it really "mini"? If yes, then Grok 4 is a failure

ocean vortex
#

But Grok4 is no longer beating everything lol

calm sequoia
#

Hmm I didn't realize Opus is so much more expensive than Grok. Not a failure.

ocean vortex
calm sequoia
ocean vortex
#

For OpenAI they do not care as much because their naming is smart (in this case)

#

o4-mini is named as if it was 1 generation ahead, so they have the benefit of the doubt and people don't question as much o3

calm sequoia
#

True

#

This bench made me hype for GPT 5. It will be wild it's based on something like o4.

ocean vortex
calm sequoia
#

homepage

ocean vortex
#

yeah that's just incomplete huh

#

they should have shown o3 and then truncated say gpt4o score instead

#

which is an old irrelevant model now

#

Btw Google fans won't be happy

#

2.5Pro score dropped now

neon idol
ocean vortex
neon idol
ocean vortex
#

I suspect they tried to favor more Anthropic with this change. But they should have added SimpleQA as well at least...

#

Then it wouldn't show that o4-mini-high > 2.5Pro

neon idol
ocean vortex
neon idol
cedar tide
#

Why is no one talking about this model being so good in benchmarks?
https://fixupx.com/KwaiAICoder/status/1947312634203902301?t=lXmBCkQyKo4FIyNA_nkPvA&s=19

🚀 Excited to introduce KAT-V1 (Kwaipilot-AutoThink) – a breakthrough 40B large language model from the Kwaipilot team!
︀︀
︀︀KAT-V1 dynamically switches between reasoning and non-reasoning modes to address the “overthinking” problem in complex reasoning tasks.
︀︀
︀︀Key Highlights:
︀︀📌 40B model rivals DeepSeek-R1 (671B) across benchmarks.
︀︀📌 200B version in training shows significant leads over Qwen, DeepSeek, & LLaMA.
︀︀📌 40B outperforms all open-source models in the leakage-controlled LiveCodeBench Pro.
︀︀
︀︀Innovations:
︀︀🧠 Step-SRPO: New RL paradigm with intermediate supervision for better reasoning-mode control.
︀︀🔄 MTP + Heterogeneous Distillation: Efficient reasoning injection, cutting down training costs.
︀︀🏗️ Real-world deployment: Integrated into Kuaishou's internal coding assistant, Kwaipilot.
︀︀
︀︀Try the Model & Read the Paper:
︀︀🔗 Model on Hugging Face\…

ocean vortex
ocean vortex
#

I meant "commercial" as in easily accessible services you can use. Rather than be hosting stable diffusion or flux models yourself etc.

neon idol
#

idk who is the better in realistic images between gpt image seedream or imagine ultra

ocean vortex
#

Cause otherwise, depending on your use case, finetuning a model and then running it all by yourself can still lead to better results. Especially if you say want to generate some person that it doesn't have much of in its dataset. But that's more involved

#

Like if you want to generate a picture of yourself - commercial models have no clue how you look. 🙂

#

img2img can only get you this far, not enough data for it having 1 sample

neon idol
cedar tide
ocean vortex
# cedar tide

The whole point with AA is that they are doing independent testing themselves. They are not taking them out of marketing material of other labs. But yeah this is interesting, this is a very low score lol

cedar tide
calm sequoia
ocean vortex
civic flame
#

💀

cedar tide
ocean vortex
#

The real score there has even bigger contrast

#

This doesn't look like a simple mistake to me, non-reasoning version also scores a different number (also low). But there might be more to this...

ocean vortex
#

In either way, taking their official numbers for face value is clearly not the right move here

ocean vortex
cedar tide
#

Are you doing it on purpose or what?

cedar tide
ocean vortex
#

I pasted non-reasoning to show that the low score is not limited to some specific qwen3 235b variant (which would indicate an error on AA's part if this was the case)

cedar tide
#

all models without Reasoning (apart from the very latest ones which are very bad on the like 25 it's totally normal)

#

@ocean vortex

ocean vortex
cedar tide
ocean vortex
cedar tide
#

@ocean vortex average of 50% ?

#

Dom, stop digging yourself in

ocean vortex
#

You need to look at the models it's competing with

#

Not some irrelevant examples of cases that are near it

#

it's competing with V3 and Kimi.

#

Which in turn are trying to compete with 4.1

#

As for Claude, everyone already knows it sucks at math 🤷‍♂️

#

especially 3.7

cedar tide
#

Kimi k2 has a good like score because it uses as many tokens as a reasoning model

ocean vortex
#

And people using those models do not really care since the size is not reflected in pricing

cedar tide
#

@ocean vortex Well now I know you don't know anything about LLM 🤣

#

you mix everything up

ocean vortex
#

what

#

I literally told you how it is lmao

#

They are competing with those open-source models, so that's what they should be compared against. Not some irrelevant amazon model or whatever

cedar tide
#

You can look at all the scores announced by Qwen for the likes of each model and you will see that it will be the same to within 3% on artificial analysis

ocean vortex
#

Also qwen3 has already been caught faking arc-agi score, so it's only reasonable to take anything they report with a grain of salt...

cedar tide
#

GPT 4o of November has only 6% you do not understand that for the majority of models without Reasoning likes is too complicated

ocean vortex
#

gpt4o of november? Why not mention gpt3.5? 🤣

#

Ok I used the strong word for it, the point is they used different eval code to favor their model

#

They most likely did the same elsewhere too

#

I think it shows their mindset and what they are willing to do. But there are other ways too not limited to the eval code. Like coming up with a custom system prompt for each benchmark separately etc

cedar tide
ocean vortex
#

But like I said, they aren't competing with that. Not anymore

#

exact score for updated model is this. Not extremely bad, but not really decent either. Still the worst among the tier of models it's competing with

cedar tide
#

@ocean vortex good dom ok you're right qwen 30b from April is 63% better on the aime 25 than qwen 235b from July

#

Reasoning

#

🤣

ocean vortex
#

AIME25 is one of those benchmarks small models can do great at

cedar tide
#

This normal for you ?

#

But on aime 24

#

@ocean vortex maybe you tired ?

ocean vortex
#

Not normal, but that's why it's interesting and worth looking into. I think it's you who is tired @cedar tide #general message

#

I think you just fail to realise you don't understand this time lmao

#

You don't just discard what is "not normal" here. Finding things like that is kinda the whole point of AA

#

I just think it's worth looking into in the light of recent things involving qwen3 (arc-agi). Don't think that is hard to understand, is it?

#

smh

cedar tide
#

@ocean vortex OK, I'll stop giving arguments to someone who doesn't understand, when AA will fixes it, you'll see.

#

@ocean vortex If you find me a single model with a 50 point difference between the aime 24 and 25, I'll send you $1,000 straight away.

ocean vortex
ocean vortex
#

have you did it?

#

Cause if not then your message is meaningless lol

#

It's a valid point, but as long as this is not done for qwen3 this is not an argument

cedar tide
#

I know that sometimes the scores are not the same etc, and the story with arc agi, but due to many other proofs the score of 44 is impossible at 90%

ocean vortex
#

I might find time to do it myself as well. We will see. 👀

cedar tide
ocean vortex
#

Possibly... I'm personally not assuming that though before we find out

cedar tide
#

@ocean vortex you're right qwen 25 07 is very bad at math

ocean vortex
cedar tide
#

@ocean vortex this on math

#

You are so funny

ocean vortex
cedar tide
#

It has nothing to do with aime 25 but I just love it, it amuses me to see these scores

ocean vortex
# cedar tide You are so funny

It's hilarious seeing you thinking you got this "gotcha" moment only to realise 5sec later you misread the entire thing. Over and over. 😂

ocean vortex
cedar tide
#

glm 4.5 with 40 on aime 25 surprises me a lot too

#

and some models even have a little more than their advertised score so I don't think the problem is their harness

brave orbit
torn mantle
#

dom leave david alone

#

omg

earnest rover
neon idol
#

Help I dont know what is the best ai image generator between image 4 ultra, seedream 3.0 and GPt image 1

earnest rover
earnest rover
#

flux 1.1 pro raw ultra is actually the most realistic one. but it is not availbe in LMARENA.ai

neon idol
#

I have tested flux knotext but it didn't impresse me

earnest rover
neon idol
neon idol
#

Seeing Artificial Analisys the best ai image generator for realistic images is seedream 3.0 but I dont like it

novel flame
#

What's the fastest (medium-to-large) provider-LLM combos you guys regularly use? I have been using Qwen3-32B on OR with Cerebras as the provider, and getting really high speeds (usually above 1000 tps!). So far, I haven't seen consistently higher speeds, but .... maybe diffusion-based LLMs?

neon idol
#

Let me think

#

Byte dance seed diffusion @novel flame

novel flame
neon idol
#

Very good model and unlimited

novel flame
neon idol
#

Gemini 2.5 pro

novel flame
#

For coding, I use Claude 3.7 / 4 Sonnet or Gemini 2.5 Pro, and may consult o3 for certain types of complex questions. I was impressed by Qwen Coder and DeepSeek-Coder V2, but would not consider them on par with the aforementioned.

neon idol
#

Not good

blazing rune
novel flame
#

Well... I can't speak to GPT-5 and I have only tested Grok 4 a tiny bit.

blazing rune
#

Yeah. Claude is the best if you don't want reasoning, and o3 and Gemini 2.5 Pro are supposed to be about the same in terms of capability

novel flame
# blazing rune Yeah. Claude is the best if you don't want reasoning, and o3 and Gemini 2.5 Pro ...

This. For me it kind of depends what I'm doing and which way the wind blows. Most of the time, Claude gives me better results, but in some situations / languages / frameworks / tasks, Gemini is better. But Claude usually wins. OTOH, Gemini is a lot cheaper, so if you care even a little about cost, then it's a clear win for Gemini. In my testing, o3 doesn't play nice with RooCode, so I don't use it for in-IDE coding assistance.

wheat onyx
rare python
#

<@&1349916362595635286> Spammer above

echo aurora
rare python
#

Very nice discord

cedar tide
prime mulch
#

Does flux kontext max working well fir me its said this

ocean vortex
prime mulch
#

Its not bad prompt it doesn't have anything wrong

ocean vortex
#

Mistral models are hosted there iirc. But they now offer paid plans to use their chat platform

#

And you can't use them not going through Mistral services

novel flame
#

I missed Zenith.... Who did the most extensive testing on it? Any chance it's Horizon?

barren prairie
#

And the the ai slides is so good too 😶

wheat onyx
novel flame
torn mantle
#

:/

wheat onyx
#

Now we see who wins out of gpt5 and opus 4.1 for coding

acoustic cliff
#

based on the naming, I don't think it's going to be fair

fiery gull
#

opus 4.1?? gpt 5 ??

past verge
#

hi

fiery gull
#

hi

novel flame
#

I have not once gotten a noticeably better result from Opus than Sonnet (for coding), but I have sometimes gotten worse. So I don’t expect greatness from Opus 4.1 TBH

wheat onyx
raven helm
#

Who is that from?

cedar tide
cedar tide
#

🚀 Meet Qwen-Image — a 20B MMDiT model for next-gen text-to-image generation. Especially strong at creating stunning graphic posters with native text. Now open-source.
︀︀
︀︀🔍 Key Highlights:
︀︀🔹 SOTA text rendering — rivals GPT-4o in English, best-in-class for Chinese
︀︀🔹 In-pixel text generation — no overlays, fully integrated
︀︀🔹 Bilingual support, diverse fonts, complex layouts
︀︀
︀︀🎨 Also excels at general image generation — from photorealistic to anime, impressionist to minimalist. A true creative powerhouse.
︀︀
︀︀Blog:qwenlm.github.io/blog/qwen-image/
︀︀Hugging Face:huggingface.co/Qwen/Qwen-Image
︀︀ModelScope:modelscope.cn/models/Qwen/Qwen-Image
︀︀Github:github.com/QwenLM/Qwen-Image
︀︀Technical report:qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Qwen_Image.pdf
︀︀Demo: modelscope.cn/aigc/imageGeneration?tab=advanced

**❤️ 7 👁️ 56 **

mossy drum
leaden meteor
#

GLM4.5 only behind 2.5pro and pretty much matches with Grok 4!! Even better than OpenAI models....Jeez...

#

4 out of top10 on text arena are open source now...

wheat onyx
#

I think we'll start seeing local ai implementations in a year or so too

cedar tide
wheat onyx
#

I mean in actual products. Real implentations

stray aspen
#

@ripe birch Do you work at z.ai

leaden meteor
#

Locally? Doesn't this require like 100GB vram? You mean multiple 4090 GPUs?

ionic idol
#

Add when image to lmarena direct chat

wheat onyx
#

We have Claude, google, chatgpt all coming out imminently. Big month

hollow imp
wheat onyx
#

We may see all 3 companies release their models this week

#

Certainly seems to indicate that

#

For googles coding agent yeah

#

She said multiple things though

#

Again she said multiple things coming this week

wheat onyx
stray aspen
#

@echo aurora

#

there's a scammer

echo aurora
#

ty ty!

maiden kiln
#

saw there was issues td an my chats are also gone, anyway to retrieve or just gotta start fresh?

stray aspen
#

The website doesnt have accounts so I guess you have to start fresh

echo aurora
whole wagon
#

GPT5 is tomorrow

#

Do people already know this kek

#

Oh actually. The livestream is tomorrow

#

Doesn't guarantee release ig

blazing bison
whole wagon
#

Reminder

#

Eh I don't know if it is allowed and idc about betting anyways

torn bison
solar hollow
whole wagon
#

😉

zinc ore
#

Logan hyping too

#

Guess I won't be sleeping this week

whole wagon
#

I have no idea what Google is releasing lmao

#

Any ideas?

zinc ore
#

Think Jules stuff is known to be one of the things

#

They're obviously going to drop something in an attempt to take wind from gpt5 drop, so should be something worthwhile I'd guess

wheat onyx
#

Head of product for gemini api

cedar tide
#

Alibaba's Qwen-Image is the new leading open weights Text to Image model! Imagen 4 and FLUX.1 Kontext [max] level image generation is now available to the open weights community
︀︀
︀︀Alibaba gave us early access and we've had Qwen-Image secretly in the Artificial Analysis Image Arena for a few days.
︀︀
︀︀The model is currently #5 in the leaderboard and is the leading open weights model by a large margin. The next open weights models are at places #18, #21 and #26.
︀︀
︀︀The model follows an approach similar to GPT-4o of leveraging an autoregressive transformer architecture for image generation and editing. This model takes a dual encoding approach: Qwen2.5-VL encodes the semantic meaning of the prompt, while image generation happens in a latent space using a diffusion model called MMDiT. The final image is produced from this latent space using a VAE decoder.
︀︀
︀︀See below for a link to see the model yourself in our Image Arena and a link…

glass elbow
#

hi

cedar tide
#

@ocean vortex Well well, what’s Qwen doing here? 🤣

#

@ocean vortex Whats that ? 🤣

#

@ornate agate @torn mantle 😶

cerulean jackal
#

hello

cedar tide
#

I hope you're not serious

wheat onyx
ocean vortex
#

ok yeah, that was their mistake then...

cedar tide
#

I don't know if that changed anything. 😶

#

but they haven't fixed glm 4.5 😵‍💫

ocean vortex
#

Not-reasoning fixed too. That's what I meant by saying earlier it was low. Now it's a whole different ball game... @cedar tide

#

before this was ~31%

cedar tide
cedar tide
# cedar tide

and as much as I spoke of reasoning I also spoke of without

ocean vortex
winter vault
#

has anyone experienced this before please, it has been stuck here since and i cant cancel it. can someone help please

cedar tide
#

when I said the low score for the model without Reasoning was normal, we were talking about april qwen 3

wheat onyx
ocean vortex
cedar tide
ocean vortex
#

Now it's much more representable and a jump to reasoning makes sense for a hybrid model. Looks better against direct competitors as well.

#

I wonder what it was exactly leading to such a drastic difference...

gentle plinth
#

velocilux looks like an interesting model

primal orbit
small haven
#

is gpt-5 + gemini 3 + claude 4.1 all coming at once in the same week? 😮

small haven
#

competition ftw

wheat onyx
#

and OpenAI opensource model this week

primal orbit
#

It would be strange for Gemini 3 to just pop up before it appearing on lmarena first.

#

under a codename

leaden meteor
#

I thought we had lot of google models better than 2.5 pro that were tested on lmarena and disappeared. May be one of those? Although I doubt 3 is coming this week without any leaks so far...

primal orbit
#

the most current google model on lmarena is nightride-on

leaden meteor
#

We would have already see the screenshots of indications of 3 coming soon like we do for GPT-5...

primal orbit
#

it's good but not a jump from 2,5

leaden meteor
#

What about GPT5? Did it even have enough votes to appear on lmarena when they announce it? It seemed like it was there for just couple o fdays...

#

on lmarena

#

Assuming summit/zenith is GPT5...

primal orbit
#

Hard to tell. I was not impressed with either. Maybe for coding they are good, but not for common conversation.

#

Probably neither of them were the most powerful variant of gpt5. Like grok heavy and o3 high are not here on lmarena.

pulsar aurora
#

Hi is the grok not working??

#

Grok 4 on imarena

#

When i ask what model are you it says grok 2

#

<@&1349916362595635286>

echo aurora
#

But would ask to not use the @ moderator ping for questions like this. That ping should be used for mod purposes, not general questions or feedback.

pulsar aurora
#

Oh alright sure thank you, I'm new here so didn't know the rules

echo aurora
#

No problem, just a heads up.

jade egret
#

🤔

echo aurora
#

🍊

golden ocean
#

🍊

hollow imp
#

🧢

haughty tangle
#

that post is just interaction bait

hardy pecan
#

its zoomer's version of "hacking" lmao

whole wagon
#

Hm seems later in the week

#

Not today rip

#

I don't think they even know ngl, there is a chance of delays

ripe birch
#

They know actually. We use the same recommended parameters as Qwen3 when submitting to AA. However, they did not use our official API, so I couldn't help them locate the problem and fix it.

jade egret
#

🍊

rapid merlin
#

is that where gpt 5 is?!

#

oh wait

runic escarp
#

If one day, the next generation llms is behind a spam link ^^ (likely kinda never happening)

crimson oasis
#

I'm offering an open test session to anyone who would like to test an actual "thinking " AI that uses recursive thought processes to discover rather than linear pattern matching elevating any llm using my framework

marsh sundial
split summit
#

Scene: A room. A body, completely covered with cloth, lies on a cot. A religious scholar stands to the left of the body, and an assistant stands at the head.

Time 0-2 seconds

Religious Scholar: "Make sure the entire body is covered. No part of the body should be exposed under any circumstances."

(The scholar slightly lifts the head of the body. The assistant stands nearby with a washing vessel.)

Time 2-5 seconds

(The assistant bends their right knee and places it on the cot.)

Time 5-8 seconds

(The religious scholar holds the nape of the body with the thumb of their right hand. They slowly lift the body and lean it against the assistant's bent knee. The assistant's knee acts as a support under the back of the body.)

frozen nova
#

about to lose entire week of work 12 hours a day i spent

#

but lucky i can still view the context very hard to short over all the text

primal orbit
#

I have finally got "triangle". Makes some good points, but nothing extraordinary.

balmy zenith
#

i'm in..

potent snow
#

Did grok update image?

rigid crescent
#

is video battle only available here on discord? or will it be on the website too eventually, i dont see it on the gradio or new ui versions

steady vale
#

soon...

novel flame
#

Holy cow, GLM-4.5 may be as good as the hype. I really didn't expect it since previous GLM versions have all been a bit meh in my testing. But it just (barely but still) scored a 5/5 on my test suite and destroyed everything else (including Horizon Beta, Grok 4, o3 Pro, Gemini 2.5 Pro, Claude Sonnet and Opus) in my separate "create-an-html-game" test. It has some quirks for sure, got into an infinite thinking loop in one test and got different answers in thinking vs actual response in another, but it's overall really strong.

pure anvil
#

Gemini 3 will btfo openai fanboys

stark tusk
#

Wait does LMAREA actually use the model stated like GPT 4.o etc

cedar tide
#

What this benchmark ?

#

New benchmark ? Artificial analysis lcr ?

ocean vortex
#

only did it for Qwen3

cedar tide
#

@ocean vortex

ocean vortex
#

yeah

#

though lack of transparency from them is not ideal tbh

#

They obviously found what was wrong with Qwen3 testing but only silently changed the score lol

pulsar aurora
#

Any free Ai agent tool that let's us use Ai on browser to perform any tasks

#

Just want to try it out seems fun

ocean vortex
#

it should in theory at least ace all of them except those 4:

pulsar aurora
#

Yo what's Glm 4.5??Is it deepseek alike?

ocean vortex
eager crag
#

Will my images i’d upload to LMArena kept private?

ocean vortex
#

Looked a bit into Z.AI who made that GLM4.5. They are some technically independent company from Tsinghua University, but their investors include Alibaba, Tencent, Xiaomi...

crimson oasis
#

@novel flame DM me when your ready I'll send you a link

rigid crescent
eager crag
rigid crescent
#

yet?! lol

eager crag
#

well, now that i know, i won't.

ocean vortex
#

Never said it was normal or not normal, simply stating the facts lmao

rigid crescent
ocean vortex
#

Also it's no secret that Anthropic is well funded by this point

#

I was actually not meaning to imply anything, but if you really want to dive deeper, we can do that 😇
https://www.globalneighbours.org/chinas-zhipu-ai-secures-140-million-investment-from-shanghai-state-funds-amid-ipo-push/?utm_source=chatgpt.com

By Liu Peilin and Denise Jia Chinese generative artificial intelligence unicorn Zhipu AI has secured 1 billion yuan ($140 million) in fresh funding from Shanghai’s state-backed investors, boosting its momentum ahead of its planned initial public offering (IPO). The announcement coincided with the unveiling of new products aimed at strengt...

#

@ornate agate

#

Z.AI is VERY well funded, including state funds

wide talon
#

Is GPT-5 still acessible within model in lmarena?

patent aspen
#

It's kind of like those video benchmarks that turn off sound except worse

prime mulch
#

Guys except flux and gpt other image models are not working well

#

😭

patent aspen
#

I mean you could solve that with a matrix

#

Maverick and GPT-4.1 as well iirc

#

Yeah and I think that should be represented in any benchmark whose purpose is to measure reasoning over long context

leaden sun
#

I've been wondering about this for a long time too, i can only guess it's an economic calculation bc of the recursive nature of their architecture that is super expensive to tun...people using claude code has pretty good workarounds for limited context at coding, not sure about the others tho

brittle tiger
#

"We’re inaugurating Kaggle Game Arena with a 3-day AI chess exhibition tournament featuring 8 frontier models."

leaden sun
#

sigh why it's chess again...

#

out of so many fascinating games you can choose, it has to be chess, for AIs...

brittle tiger
#

the kaggle game arena won't just be chess. it will be a buch of different ones. they're just launching it with this chess tournament

leaden sun
#

it's not that...exciting, it's just strategy optimization, nothing intelligent actually

torn bison
#

There's a small chess-playing network inside LLMs, haha

#

I'm still amazed non reasoning models can decode base64 almost perfectly in one shot

wheat onyx
little narwhal
#

Give it time

#

2 more years max

ocean vortex
#

the thing is, almost all of the recent models will be able to run the 128k one. But only a few can run >128k

cedar tide
#

What if you could not only watch a generated video, but explore it too? 🌐
︀︀
︀︀Genie 3 is our groundbreaking world model that creates interactive, playable environments from a single text prompt.
︀︀
︀︀From photorealistic landscapes to fantasy realms, the possibilities are endless. 🧵

**💬 5 🔁 10 ❤️ 59 👁️ 1.4K **

▶ Play video
ocean vortex
#

so direct comparison only possible with the smallest common size

patent aspen
ocean vortex
#

like 4.1-nano does 30s for 128k but only low 10s for it's entire context size lol

patent aspen
#

Just because Gemini is in a class of its own doesn't mean the benchmarks need to put o3 above Gemini

#

They literally just need a matrix. It's not that complicated

novel flame
#

True for full-context attention (Transformers), but not for all architectures. RNNs scale linearly with constant memory. DeepMind's Titans architecture only uses attention for a small window, and its memory module for everything beyond that. The next leap in AI won't be a standard Transformer, it will be something else.

ocean vortex
#

that gif more suiting lol

raven helm
#

Do you think GPT-5 will have 1 Million Context window?

brittle tiger
#

Would be surprising if it didn't. Doubt it tops long context benchmarks but should definitely be able to take 1m and route to one of their models that can run

raven helm
#

Yea, fair. I just don't get how GPT-5 being a router model will be able to preform well unless the underlying models are also getting an upgrade.

patent aspen
raven helm
ocean vortex
raven helm
#

Yea, but right now with how the AI landscape is right now; more than 1M would not be very well fuctioning as it would start forgetting stuff

ocean vortex
#

Like giving it entire movie as a singular input

raven helm
torn bison
#

how does that passing through api💀

ocean vortex
#

But if you give it in a single input, it will recall everything surprisingly well

primal orbit
#

kaggle arena in 3 hours 💪

#

I wonder what are the rules if a model hallucinates a move

#

illegal move for instance

raven helm
raven helm
primal orbit
primal orbit
#

Gemini vs Claude - game to watch

#

my bet for final o3 vs Gemini due to context adherence

#

i wonder is it opus thinking? and o3 high or what

raven helm
#

Yea, that'll change the landscape of it a bit, when they have thinking on/on highest it usually performs better.

primal orbit
#

Gemini 2.5 pro is always thinking, so opus has to be thinking too. Or rather unfair.

raven helm
#

Hopefully (I've seen a lot of people not even know the diffrence between 4o and o4)

brave orbit
raven helm
brave orbit
#

LLMS

rare python
#

brian what're the realistic expectation for Gemini 3?

brave orbit
#

Ilke llm is not chatgpt but what that powers it ilke 4o Grok 4 heavy then grok yeah

raven helm
rare python
#

So I guess better long context is one of the major improvements

#

Need that

#

I want better instructions following but I'm not sure Gemini 3 will be better at that

#

But it has an "r" tag though

#

maybe "researching"

#

aka long term

#

TITAN and ATLAS proposed ideas for long context but I don't know if they are implemented in Gemini or still experimental

#

Yeah brian human has a really strong short term memory, about 30 seconds

leaden sun
# primal orbit Gemini vs Claude - game to watch

this is so obvious who’s going to win, isn’t it, opus spends half its compute on alignment that causes him reflecting existential philosophy in the middle of calculating chess strategy 😅

torn bison
wheat onyx
torn bison
#

sadly they nerfed kingfall to wolfstride

#

this was definitely an intentional nerf

wheat onyx
#

Genie3 is just going to be basis for Video Game Engines, right?

torn bison
#

kingfall never skips a thought, while wolfstride often skips thinking and outputs directly, like 2.5pro. These are not iterations to improve performance, but rather to reduce cost and inference load.

#

I'm 100% sure now, after they split off the deepthink consumer and deepthink IMO versions

echo aurora
#

Are others having issues with direct & side-by-side atm?

tall patrol
#

hi

minor adder
steady vale
wheat onyx
warm fulcrum
minor adder
novel flame
raven helm
#

Picture i found online

torn bison
wicked root
#

when's GPT5 being added to the arena? Is it up already?

torn mantle
#

anthropic?

#

nah

#

i would never bet against elon tho

#

🫣

wicked root
#

Let's suppose GPT5 does get added to LMArena, how long would it take for it to beat Gemini if it is proven to be better?

#

It won't be an overnight process, correct?

wicked root
cedar tide
fleet lintel
echo aurora
blazing bison
#

Opus 4.1 2% improvements lmao

acoustic cliff
#

marginal increase with no change in pricing, is it an unfinished release?

blazing bison
leaden sun
# torn bison jesus

not surprising to see, if they keep adding conflicting contradicting alignment trainings instead of scaling, well, the capabilities, the newest system prompt this month is super long with...many interesting twists

blazing bison
#

Hahahahah

stray aspen
#

We need Opus 4.1 in the arena

blazing bison
#

There is no difference from opus 4 in the arena

leaden sun
#

haha

#

they're trying hard to dominate in the agentic coding space i feel, but the stricter alignment training is literally making claude more dumb or is it just me?

minor adder
golden ocean
#

prob just u

blazing bison
#

Apparently its a good model

wheat onyx
blazing bison
raven helm
#

Did they release it by accident

torn mantle
#

official

raven helm
#

There is nothing on the YouTube channel and the online demo doesn’t work

midnight mesa
#

hi good afternoon

raven helm
#

Hello, OpenAI just launched their new open source models

blazing bison
raven helm
blazing bison
#

Prob too much people

raven helm
#

Yea

#

I just wonder why they didn’t announce it more

wheat onyx
raven helm
#

There is literally nothing on their YouTube channel

blazing bison
#

The model is not multimodal

#

😢

raven helm
#

It isn’t?

#

Wow

#

So that’s why you couldn’t send an image on open router

stray aspen
#

Gpt oss is trash

raven helm
#

Why?

blazing bison
#

I don't think it's trash

#

The license is good

stray aspen
#

I think china still dominates open source

raven helm
#

But these models are diffidently better

#

They’re just not Multi-Modal

blazing bison
#

Ye

#

China will prob use them

#

To make better models

#

🤓

zealous panther
#

those two models are probably the horizon models right ?

north vale
#

they are text-only

zealous panther
#

oof yeha i realized

#

what are the horizon models then ?

zinc ore
whole wagon
#

hm

#

the openai open source model seems like on par with qwen3 235b 2507?

fleet lintel
#

any new model on Arena in last 1 week ?

steady vale
fleet lintel
#

gemini 3.0 is not coming for atleast 2 more months.. no point even thinking about it

zealous panther
#

and thats the flash model

#

we dont even know about pro

blazing bison
#

The flash one atleast

fleet lintel
#

ok 95% sure 🙂

blazing bison
#

So in the end horizon models on openrouter aren't the open source model huh

zealous panther
#

yeah

#

the benchmarks I mean are saying horizon is not that good

fleet lintel
#

whcih model is horizon?

wheat onyx
#

so OAI Opensource 20b can run on 16gb RAM, has ~o3-mini performance

keen beacon
#

I already requested the model on #1372229840131985540. Would be nice if you guys upvoted to get it on LMArena.

wheat onyx
#

the 120b model is pretty good, but not going to be a product for an average computer

#

now i want to see how local AI is implemented since we have a bunch of models that should work plenty well

primal orbit
#

gpt oss is already available in direct battle

wheat onyx
#

The 120b model is ~ the O4mini. So our closest comparison to QWEN3 and Deepseek

#

very impressive for a much lower parameter model

keen beacon
#

lol

blazing bison
#

They are just good for bench

wheat onyx
blazing bison
#

Kimi and deepseek are the open source models actually good

#

The rest is just benchmaxing

steady vale
#

new chatgpt retry button

blazing bison
#

Its not that impressive, and it's not new

#

Its not new

#

Becareful with hype

wheat onyx
blazing bison
wheat onyx
#

realistically I think the small ones are the most interesting (20b), but just curious for comparison sake

blazing bison
#

Yeah I think openai has the best smaller model now

#

Open source

stray aspen
#

claude opus 4.1 is live on yupp ai if anynoe wants to use it for free

stray aspen
#

hell yes

#

4.1 is live on lmarena

shadow jewel
#

YO CHAT WHATS THE BEST AI FOR SCRIPTING

echo aurora
shadow jewel
#

its claude but like which version

stray aspen
#

sonnet 4

#

or the new opus

shadow jewel
#

32k thinking?

#

has anyone tested which one is the best one?

stray aspen
#

claude sonnet 4 no think

tall summit
#

am i late again to gpt oss

#

oh ok around 1hr ago

true kernel
#

Does anyone know how to select a specific AI? I need the veo 3

wheat onyx
true kernel
#

how do i put veo 3 vs seedance does anyone know ?

primal orbit
#

opus 4.1 is very good. A highlight.

tall summit
#

wait a moment opus 4.1

#

excuse me

novel flame
#

OK I put GPT-OSS 120B (high) through my standard tests, and it's almost Llama 4-level disappointing. I mean, sure, it's not a huge frontier model and we shouldn't expect it to perform at that level, but OpenAIs own numbers have it almost on par with o3, which is nooooooot what I'm seeing. I'm seeing performance maybe on par with o3-mini or Qwen3 235B A22B or maybe Kimi K2. And I am seeing more hallucinations than any other model, only matched by Llama 4 Maverick.

Which, for a free & open source model is still pretty decent.

There are however two problems: The first problem is OpenAI are making claims that are too optimistic. And the much bigger problem is GLM-4.5 exists, is open source, and is an absolute beast.

And with this in mind, I'm not even going to bother testing the 20B model right now. It's never going to match o3-mini if the 120B model can barely do that.

wheat onyx
hoary girder
#

Opus 4.1 with prompt: Create a simple 3d tank game with ai oponent

tall summit
#

woah

calm sequoia
#

Does this mean the oss hallucinates most of the time? 😆

primal orbit
#

someone needs to put that rotating hexagon prompt into opus 4.1

novel flame
primal orbit
#

i feel like it sh*ts over summit and zenith

leaden sun
#

always take it with a grain of salt when it comes to such statements, there are much we still don’t know about the brain, there is a reason why brain transplants don’t work. Its purely PR marketing

humble sonnet
#

What is gpt-oss ? Is for chat or web dev ?

daring rover
#

full suite

wheat onyx
wheat onyx
terse shuttle
weak sluice
#

opus 4.1 disappeared!

wheat onyx
primal orbit
#

@echo aurora bring back opus 4.1 🙏 better thinking 😄

wicked root
#

did they release gpt5?

weak sluice
#

it actually knew what certain games were about

wicked root
#

any interesting news?

primal orbit
#

i liked that opus 4.1 gave long answers instead of usual concise from opus 4.0

steady vale
barren prairie
#

Opus 4.1 on yupp if someone wanted to try it

obsidian shell
#

we had 4.1

wheat onyx
stray aspen
#

why was opus 4.1 removed so early

wicked root
#

What does GPT opensource 120b and 20b mean?

obsidian shell
keen ferry
#

is opus 4.1 really that good?

wheat onyx
#

20b is the interesting to me (assuming its not crap)

keen ferry
obsidian shell
#

its a 2% upgrade...

wheat onyx
obsidian shell
keen ferry
obsidian shell
stray aspen
#

so all the companies donate

wicked root
wheat onyx
wicked root
#

Do you guys use Gemini Pro 2.5 ultra subscription?

stray aspen
#

no

wicked root
#

Are there better alternatives to coding projects? I always get rate limited on Gemini

wheat onyx
# wicked root huh, interesting.

can run on 16gb ram. Maybe less when quantized. So there are lots of application for it. Less so imo for the larger open source models (which are the majority)

wicked root
barren prairie
wheat onyx
wicked root
wheat onyx
wicked root
#

wdym lot?

barren prairie
primal orbit
barren prairie
#

Experimental

primal orbit
#

lmarena is committing crimes against AGi by removing opus 4.1

obsidian shell
#

guys lets not guess

they will release it when ready

even when they do

do you really think gemini 3 flash will be in a position to beat gpt-5?

keen ferry
obsidian shell
#

they removed it from the announcement too

wicked root
#

what's the best ai for coding?

keen ferry
#

claude

#

then it's gonna be gpt 5

stray aspen
wheat onyx
brittle tiger
echo aurora
primal orbit
wheat onyx
# wheat onyx

For reasons we should think Gemini 3 is coming very soon

astral jetty
wicked root
weak sluice
daring rover
daring rover
wicked root
#

okay it's weird af calling someone else boo boo

novel flame
# daring rover full suite

For context: The GPT-OSS 120B (high) model scoring 44.4 on Aider Polyglot is hardly impressive considering:

  • Qwen3 235B A22B: 59.6
  • Kimi K2: 59.1
  • DeepSeek R1: 56.9
  • DeepSeek V3 0324: 55.1

I don't know if GLM-4.5 has benchmarked Aider Polyglot yet, but I would guess it'll score considerably higher than all of those.

wheat onyx
#

but yeah way too low

novel flame
wheat onyx
stray aspen
#

is the gpt oss on lmarena high reasoning

wicked root
#

polymarket's going nuts today

stray aspen
#

this is the biggest week for AI

wicked root
#

They think Google's going to lose to OpenAi

wicked root
stray aspen
#

no

wicked root
#

Is gpt5 that great?

stray aspen
#

for everything

#

everyone is releaseing models

wheat onyx
#

we'll see which is best soon

wicked root
barren prairie
stray aspen
#

they should release 3.0

barren prairie
open mountain
#

Oss 120b This model generally responds poorly as a human, as if it has been degraded compared to 4.1 and even more so 4o

wicked root
wheat onyx
#

but soon

novel flame
#

a quantized version could be really good

primal orbit
#

2.5 pro got released a few days after it appreared on lmarena under name "nebula". I assume the same thing for 3.0

stray aspen
#

but the benchmarks are crazy

primal orbit
#

not a sign of 3.0 so far

#

2.5 pro was 03.25 exp

#

so 5 months ago now

wicked root
hollow imp
#

What is gpt oss

#

???

keen beacon
stray aspen
open mountain
hollow imp
#

Huhhh

open mountain
#

Literally 6 years have not produced open models of "everything for humanity", yes, we believe))

hollow imp
#

So not relevant for me ig

#

Gpt5 on lmarena when

stray aspen
#

gpt oss is in artificial analysis but they havent benchmarked it

open mountain
stray aspen
hollow imp
#

My fav ytber ❤️

#

Get started now with privacy focused VPN by Proton! https://proton.me/pass/bycloudai

Can a neural network write its own data and skyrocket past GPT-4? In today's video, we dissect the brand-new “Self-Adapting Language Models” paper (SEAL), where an LLM fabricates synthetic data, tunes LoRA adapters, and after just two rounds, outperforms mu...

▶ Play video
open mountain
# stray aspen august 7

I think OpenAI doesn’t always release new models on time, and usually gives access to Pro users first, and only then to Plus users.

hollow imp
#

What do y'all think about perplexity comet

wicked root
normal abyss
#

is gemini 3 already coming?

open mountain
quartz light