#general

1 messages · Page 77 of 1

ocean vortex
#

I agree that some of those prompts are overloaded with stuff about tools and are a bit much. But it’s still not definitively bad, you just have an emphasis on tools at the start of your chats. There have been no observed performance degradation in most cases - it basically evens itself out, helps as much as it harms but makes perfect sense for the platform to use

torn mantle
#

dw its ok

ocean vortex
#

So depending on your prompt and model it may perform slightly better or slightly worse

hazy quest
#

Asura are you gatekeeping information leaked on X ?!

ocean vortex
#

What was it

#

Co to kurwa jest @full idol

alpine flare
hazy quest
#

This one

alpine flare
torn star
#

I can already tell these models will have an elo of 1800+

keen beacon
#

asura is trolling lol

rare python
torn mantle
#

they all bully me

#

what did i do?

rare python
torn mantle
#

it was removed

#

there is no such thing

rare python
#

🤔

tepid turtle
#

Hi guys, I can't seem to find the new "zenith" model everyone's talking about, was it removed in the meantime?

rare python
#

54%

torn mantle
#

...

hazy quest
fossil fable
#

am i stupid

marsh sundial
#

summit got better writing style, zenith is over rhetoric, maybe variant

sour spindle
#

Going to be quite funny if these new models aren’t OpenAI

#

Meta finally putting that money to use lol

#

Some obscure cracked Chinese company side project?

alpine flare
#

Did some testing on the models. Here's what I found out:

Zenith introduced a new perspective on a physics question (electrostatics) that I had, which I have never seen before from any other model. However, it made some strange assumptions and concluded that a scenario was only true if the assumption was also true, which was incorrect. I have also never seen this assumption from any other model before. Usually, you get very similar arguments from several models—you can often predict how they will respond to and reason about a specific physics question (be it wrong or right, there are pretty typical parts they will be wrong and then the answer will be wrong)—but Zenith and Summit were both entirely new in this regard. However, I did not get the impression their raw knowledge base has improved much, as they still produced similar hallucinations on niche topics like o3 or 2.5 Pro.

hazy quest
#

I got

  • Summit against: Gemma 3 27b, Deepseek V3 0324, 4o 0326, llama 3.3 70B,
  • Zenith against: Amazon nova pro v1, Sonnet 4 (x2), Sonnet 4 32k

Weird that there Summit was paired only against clearly weaker models, and that none were paired against the big boys.

torn mantle
alpine flare
#

Suppose, you insert two metal plates pressed together inside a charged capacitor disconnected from battery (constant charge). Now we have Capacitor Plate 1 -> MetalPlate1+2 -> Capacitor Plate 2. Then, we separate the two metal plates inbetween the capacitor plates, so we have Capacitor Plate 1 -> MetalPlate1 -> Metal Plate2 -> Capacitor Plate 2. Please discuss whether there is a net electric field between MetalPlate1 and MetalPlate2

#

Even o3-pro answered "After the two inserted metal plates are pulled apart, they carry equal and opposite charges that cannot neutralise each other. Those charges reside partly on the facing surfaces and create a non-zero, uniform electric field in the space between the plates." which is clearly wrong

whole wagon
#

whenever you put an actually complex problem in llm arena and get a reasoning model, the model thinks for so long and times out lol

torn mantle
whole wagon
#

what

torn mantle
#

maybe you are smarter than all of us

ocean vortex
whole wagon
#

summit is really good

ocean vortex
#

lmao

whole wagon
#

like damn wow

fossil fable
civic flame
#

webarena only has battle mode lol

whole wagon
#

I am in a game show where there are 3 doors. One contains a prize and two a goat. I can choose one door at the beginning. Then the game master offers me to change it. Does changing it increase my probability of winning?

Still cant get this basic question right

#

The answer is obviously No. But it overfits to training data on monty hall problem

#

simplebench type problem

#

The reasoning the model gives is instantly about monty hall and the maths

stray aspen
#

stepfun just dropped a new model

whole wagon
#

Which is interesting. It's like trying to do typo correction but for the entire prompt

alpine flare
hazy quest
#

Same. Got summit twice, and both times it said something along those lines.
"Caveats: If the host does not reveal a door and simply asks if you want to change to one of the other two at random, switching doesn’t help (it stays 1/3)."

alpine flare
#

zenith didnt do this btw (Grok-4 and summit are the only models that provide such a note, Grok even states "You didn't mention the host opening a door, which is unusual...")

whole wagon
#

70% chance it's here by Aug 15th

torn bison
#

more like 90% but im not gonna risk my money on it

whole wagon
#

Interestingly the Dec 31 odds are not shifting though. A Google counter response is expected by then I guess

tall summit
torn mantle
#

i was confident gemini models wont be topped

#

but zenith / summit proved me wrong

whole wagon
#

Wtf

torn bison
#

I never thought zero shot prompting could create apps of this caliber

torn mantle
whole wagon
#

No on chatGPT

torn mantle
#

some people said o3 on chatgpt got redirected to the new model zenith/gpt-5

whole wagon
#

I think more people would have noticed that lol

stray aspen
#

where are gpt-5 news

torn mantle
primal orbit
whole wagon
#

Maybe it's whatever o3-alpha is

torn mantle
#

maybe

primal orbit
#

wtf stay switzerland supposed to mean

leaden sun
stray aspen
#

lol

leaden sun
#

i like "surgically helpful in private", like it's implying a "private therapist" or clinically analytical helpfulness with surgical precision

rare python
#

@torn mantle

leaden palm
#

so many models

stray aspen
#

to use zenith you just send the battle mode your propmt until you get it?

civic flame
#

yes

whole sundial
#

someone is trying to censor people talking about nightride? I guess that's because Google doesn't want people finding out their model searches on the web, which is why it is good at knowledge. I gave a question yesterday with a very obscure answer that can only be found by searching, and nightride-on got it correct.

keen fulcrum
#

It could have been in its dataset

whole sundial
#

don't think it would have known this lol

#

this isn't the search arena google, why are you using web search on the normal arena? that is not fair to any other model

keen beacon
#

gemini advanced/bard with search has been part of the normal arena in the past

#

its not a new thing

jagged crown
#

I'm literally in shock. Zenith is basically creating things that have taken me months in the span of a few minutes

unborn ocean
#

aka just an experiment, idk though

stray aspen
#

is o4-mini-high in the arena

torn mantle
#

i still think its kiri

civic flame
whole wagon
#

Summit > Zenith > Lobster > Starfish

wheat onyx
#

Looking forward to gpt5 in coming days

whole wagon
#

Looking forward to open source catching it in 3 months time Kappa

jagged crown
#

I don't think the ramifications of these technologies are really understood by the general public. While on one hand I'm thrilled to be able to create the ideas I have in my head, I'm also terrified for what this will mean for the economy at large

wheat onyx
jagged crown
wheat onyx
toxic whale
tight silo
#

yeah, I got a random call from summit when doing the battles and I thought it was pretty dang awesome

#

only reason why I didn't give it the win is cause it generated something only partially related to my prompt

#

asked it for a fantasy story set in the perspective of a paladin in hell, it gave me a story of a big-rig driving paladin punching demons so it was pretty close, but not quite fantasy

#

still cool as hell though

tight silo
#

meant it just showed up

leaden palm
#

ok

tight silo
#

i am very very tired

ornate stump
#

Don’t worry in a couple of years you’ll get a call from someone’s ai assistant for real

tight silo
#

yeep

solar hollow
#

you guys know if there are restrictions on lmarena for german users?

#

when i want to battle in the arena, sometimes responses take a very long time and/or just error after a while

#

really discourages me from continueing

torn star
#

how long until we have ai models creating videos for us on a customized fyp

dawn wharf
whole wagon
#

Summit is gone out of battle? 😦

#

Zenith is not on the same level

echo aurora
hardy pecan
#

Anyone get the feeling this discord is being used to astroturf and to pump up the new models to manipulate polymarket odds? all these "new" people coming in seems sus.. haven't had them come in in droves like this before lmao

#

maybe im just skeptical

whole wagon
#

I don't believe it Kappa

whole sundial
whole wagon
#

I don't think the polymarket betting (and kalshi etc) has created very good incentives in this space

hazy quest
whole wagon
#

I don't like they keep adding and removing models

#

I only get zenith now

sonic tendon
#

market reactions to new information can be very fickle

#

and liquidity on even the biggest ai market on poly is still awful

#

i don't think that you could make much by speculating on and attempting to control crowd sentiment

gusty tendon
brittle tiger
#

@civic flame what examples of zenith/summit doing great with reasoning have you seen? the svg pelican looked on par with 2.5, worse than opus 4 to me

civic flame
#

sorry, remind me tomorrow i'm not in a good place at the minute. rough night

fossil fable
#

how do i use direct chat on webarena

echo aurora
#

That is feedback we're aware is important though.

formal dagger
#

summit crushed Gemini 2.5 on timeline management of story

ocean vortex
fossil fable
ocean vortex
#

But I agree online models have no business directly competing with models having no tools or internet access

stray aspen
#

does any model smash claude sonnet 4 no think at coding?

storm needle
ocean vortex
#

It’s not even the same league tbh

wanton sonnet
mint cape
#

Hello

torn star
#

Cuttlefish is actually insane.

candid storm
blazing rune
#

I wish they had a non reasoning option for Gemini 2.5 Pro, and had very short reasoning versions for both, like 1k tokens or something

fleet smelt
whole wagon
#

You can't

#

Because they are unreleased

fleet smelt
#

So how normal people is testing it ??

polar venture
fleet smelt
polar venture
#

Like this here iuse ChatGPT 4o

torn star
stray aspen
#

is this o4 mini high

stray aspen
fossil fable
fleet smelt
stray aspen
#

just keep on sending until you get that model

fossil fable
stray aspen
#

its companies testing their models

#

its unreleased ai models

#

i mean kinda is

fossil fable
dusky pier
#

How is it insane?

hardy pecan
tight nest
#

is it true that the current o3 on the chatgpt ui actually routes to Zenith? I saw some people claiming that on twitter

dusky pier
#

I'm glad they are killing it

#

It's so expensive

hardy pecan
#

Ive been getting alot of A/B testing using o3 in chatgpt.com

#

Highly doubt they just replaced it with "zenith" or "summit"

dusky pier
wheat onyx
#

I wonder what the costs for gpt5 are

stray aspen
#

is this 4o mini high?

dusky pier
#

I don't see the hype in it

hardy pecan
#

Lol

#

so bizzare

small haven
#

hows zenith

jade egret
#

when gpt 5.....

small haven
#

yes how is it

#

better than o3-pro? 👀

#

or kingfall?

forest prism
#

Is folsom-072125-1 just minimax-m1?

dawn wharf
sturdy mica
#
poll_question_text

best model rn

victor_answer_votes

7

total_votes

14

victor_answer_id

3

victor_answer_text

o3 pro

torn bison
whole wagon
#

Bruh why am I just getting zenith

#

Summit is so much better smh

zinc ore
#

They changed the sysprompt supposedly, which lowered zenith performance

hardy pecan
#

Appears to be so, I'm verifying simple-bench questions summit vs zenith to see which is smarter, although contamination at this point might be redundant

leaden palm
#

what do we think about zenith's knowledge? is it just a really recent cutoff or does it search the web?

leaden palm
#

folsom-072125-2 is kinda stupid

small haven
#

wow cant wait to use it

wintry tinsel
#

Gpt 5 has to be successful for the health of the whole industry

winged locust
#

NO CUREFISH

haughty tangle
#

summit is o4-pro

stuck rose
#

A great arena for understanding different models. Looking forward to hear about API possibilities for diffrent models.

limber anvil
#

Is it better than

#

Claude 4

reef pawn
#

Does Copilot PC NPUs have any use cases with LLMs?

heady arch
#

Guys do you think that zenith and summit got deleted? I can't get them for 1 hour

gray delta
heady arch
#

Oh no

zealous panther
#

what does that mean

hardy pecan
#

bastards

#

I was testing them

keen fulcrum
sweet tinsel
#

Copilot is weird, the deep research is not available in the browser, it told me to pay for plus in the desktop app and I've got 10 free uses on the mobile app.

hardy pecan
#

So zenith is still here then perhaps

civic flame
#

it should be

#

but I haven't been able to get it

hardy pecan
#

Neither

civic flame
#

it's still configured on the frontend but I have a feeling it's been disabled behind the scenes

#

as has summit

#

because I've gone from getting them every few rounds to getting neither of them in 50 rounds

hardy pecan
#

Yeah same here

sweet tinsel
#

Copilot DR is Weird, it refuses to speak in English.

whole wagon
#

Maybe they removed cos they saw this kek

#

The august 15th release bet is at 69% so the market thinks basically guaranteed GPT5 reaches 1st on lmarena

midnight ferry
fallow remnant
#

Where to get to gpt 5?

cedar tide
#

Zenith and summit removed too

#

@echo aurora possible to tell the lm arena team that we absolutely want the possibility of regenerating the last answer of the llm even if we have already voted (we could do it on the old lm arena)

midnight ferry
#

gpt 5 released 31th july

hardy pecan
#

Simple Bench scores from what i tested so far, wasn't able to get to 20 questions in time, before they were removed
Summit: 9/11
Zenith: 10/12

for comparison, gemini 2.5 pro: 7/11 or 7/12 of the simple bench questions

Unfortunately not complete testing, but a rough idea at least

midnight ferry
#

what about lobster ?

hardy pecan
#

the next 10 other simple bench questions generally trip up alot of models so id expect them to perform less well as opposed to the first 10 public ones

#

Only got lobster in webarena

whole wagon
#

where do u get 20 questions

hardy pecan
#

from a competition the simple bench fella held to find the best optimized prompt to get these questions correct, wandb.ai dataset i believe

blazing bison
#

They removed zenith, killed my boy

#

😢

sweet tinsel
#

It's a mistake testing ms copilot out, it's really corny.

merry stag
#

where can i try zenith? i try http://lmarena.ai and its always use known model like o3 not anonymous model

fleet lintel
#

and these models are fast or slow?

hardy pecan
stone birch
#

I tried using the same prompt about 30 times in battle mode to test Zenith, but it never came up.
Is it possible that it has been removed from LM Arena?

civic flame
#

yes it's been disabled but the arena is still configured for it

#

so it's probably temporary

#

normally when they're done completely they'll remove the model on the server and client sides, but it's still there on both, just disabled in evaluations

stone birch
#

Can't use it? 😭

civic flame
#

not right now

past dagger
#

Is there any way to transfer my chat from one pc to another

civic flame
#

nope

languid crescent
#

i did suggest it i think its part of their plan

#

like ability to export chats like chatgpt

past dagger
#

then only copy paste works ?

languid crescent
#

yea that's what i've been doing just copy all the convo to a .txt file

past dagger
#

how you do it

languid crescent
#

literally copy every message 😭

past dagger
#

one by one

languid crescent
past dagger
#

got it tnq

torn mantle
#

am i blind...

#

is it for plus users only?

wild kayak
#

A question about web dev arena. If one model fails to generate valid JSON for the chat, will this directly result in the model's failure in the battle? or the battle will just be ignored.

torn mantle
blazing bison
#

?

torn mantle
#

?

blazing bison
#

?

torn mantle
#

yea its decided by the final votes on lmarena

blazing bison
#

what i mean is that models that is trained to smash bench do better than real good models

#

that's why i said "basically"

#

Lama 4 is essentially proof of this

whole wagon
#

simplebench would be good if he actually added models

fossil fable
#

bug: webarena refuses to work whenever accessed via mobile browser on desktop mode

civic flame
#

works for me

forest flower
#

Hey guys, i'm new to web dev arena, i went into the battle section and wrote a prompt and i got the 2 results side by side, but i can't seem to figure out which AI / models generated each output, is it possible to see which on generated these outputs, i looked everywhere i still can't seem to find it

civic flame
#

you see after you vote

forest flower
civic flame
#

yes

#

both lmarena and web arena

forest flower
#

cool, thank you

#

I got one more question, how would i generate a video in video-arena here, i tried #video-arena-1 A real life sagittarius in its true form, but it doesn

#

doesn't seem to give any output

gentle plinth
#

You have to use the command /video

forest flower
#

alright, i didn't see that, thanks a lot @gentle plinth

smoky finch
#

guys are the new models (lobster, summit) no longer available in the arena?

blazing bison
#

Yes, removed

smoky finch
#

😦

fossil fable
#

Fuckkk. The models are gone.

fossil fable
blazing bison
#

what they did with my zenith boy

reef tendon
#

Do they usually pull these models a couple days before release?

blazing bison
#

weeks before release

#

2 weeks

reef tendon
#

naaah :///

blazing bison
#

1 - 2 weeks generally

reef tendon
#

I needed them

blazing bison
#

they aren't much better than sonnet or o3

#

only on frontend, if you do frontend things then i understand

blazing bison
#

if you say so

storm needle
#

except claude

unborn ocean
#

for most labs it is likely only from the pre-training stage

#

though there are definitely a lot of other labs just straight up using the sota in post training

mellow mango
#

Im new so I gotta ask how does LMArena work? Can I really generate more than, let's say, 3 images of gpt per day even though it's limited on the official website?

#

I mean, is it basically useless paying for GPT Plus cause of this?

leaden palm
mellow mango
#

such as gpt image and google's imagen 4 ultra

leaden palm
#

ah right there's also direct mode

#

well then it's just a matter of rate limits and features

mellow mango
leaden palm
#

i don't think limits are explicit

mellow mango
#

it's pretty good

gentle plinth
zealous panther
#

and stuff like deep research and allat

hazy quest
#

Cuttlefish is strange. I don't know if its in a good or bad way. I asked for a task, he answered that "it's a good idea, but why not going even further?" (and suggested how, instead of actually doing my task)

blazing bison
#

🤓

clever estuary
#

yo does anyone have the system prompt for o3-2025-04-16 on LM Arena, something feels suspicious here
I tried both Chatgpt o3 and the playground API there
the answer quality is much lower than o3 on LM Arena
it's either the arena has a very good system prompt or, OpenAI is being very shady here

blazing bison
#

The only good model of this batch was zenith and summit

pseudo magnet
#

summit seems very good

#

all my questions he got 100% right

#

even the niche ones

blazing bison
#

It just point to the model directly using the api

storm needle
clever estuary
#

hmm I see... I'll try that with the API

hazy quest
#

LMArena does not reveal which model it was after selected the bes answer. Have you had that happening?

clever estuary
blazing bison
hazy quest
clever estuary
blazing bison
clever estuary
#

oh interesting
I never knew that lmao

blazing bison
#

O3 from api is much, much better

leaden palm
blazing bison
blazing bison
gentle plinth
#

It's not all requests tho, only some

blazing bison
#

The source?

#

He works at openai so...

#

yes, it's noticeable that o3 on chatgpt is bad compared to api

#

o3 on chatgpt is lazy, and dumb for coding

civic flame
#

although the new sysprompt they added to them on lmarena made them way dumber

blazing bison
civic flame
#

pre-sloppification they're noticeably smarter at basically everything

#

zenith got 10/10 on the public simplebench dataset lol

#

yeah

#

the only question it still sometimes struggles with is the last one

blazing bison
#

i tryed it and got like 7/10

civic flame
hazy quest
blazing bison
#

2 days ago i think

civic flame
#

and were you giving it the questions as raw text with the choices

stray aspen
#

are o3 api requests also being routed to gpt 5?

blazing bison
#

yes raw text with the coices

#

just copy pasted

civic flame
#

no

#

individually

#

regular

raven oracle
#

Zenith is gone, right?

civic flame
blazing bison
civic flame
#

but not gone gone

#

it's still there just the serverside doesn't give it to you in evaluations

gentle plinth
#

Until Google takes their turn and drops another model checkpoint

hazy quest
#

Kingfall was already a "while" ago though

gentle plinth
#

Time will tell

civic flame
#

it's the base model for deepthink

#

lol

blazing bison
#

if openai still have the strict policy of using data only from 2023 and 2024 with exception of politics, then it's not the case

#

sam said that the reasoning behind that is that the model can get updated information searching the internet

#

the objective is the models be smart enough to use internet and in context learning

civic flame
civic flame
#

nope

stray aspen
#

est ce que quelqu'un a acces a gemini 2.5 pro deepthink?

civic flame
#

it knows what gpt-4.1 is without it being in the sysprompt

blazing bison
#

because they traing their models with thei api information

#

and it always updated

civic flame
#

maybe so

gentle plinth
#

I think they can recognize output from their own models tho, but not sure. Gptzero seems kinda OK for classifying if some text is Ai generated. Not 100% accurate, but still somewhat

blazing bison
#

they still update the models with new data btw, just not new data from internet

#

synthetic data

civic flame
#

tbh they shouldn't stop updating it with new info from the internet for that much longer

#

it's useful for models to have more internal knowledge without needing to use web tools

blazing bison
#

i think that gpt-5 is a model router, that route for new models based on gpt 4.5 or gpt 4.1

#

we didn't see any gpt 4.1 or 4.5 thinking yet so zennith is prob one of them distilled + thinking

gentle plinth
#

If he says unified this would mean multiple models under the hood

blazing bison
civic flame
#

lol what

leaden palm
#

iirc reputable sources said it would be like a router first but would eventually become unified

blazing bison
civic flame
#

link it then

blazing bison
#

i think i bookmarked it, gonna check

gentle plinth
#

Which would mean they were two models

clever estuary
#

so what, it's just MoE

civic flame
gentle plinth
#

It could mean a lot of things

clever estuary
#

MoE is "unified model"

blazing bison
#

cause he made mistakes more than one time and just delete tweets like nothing happened?

gentle plinth
clever estuary
#

it's impossible to be just one single model because that would be insanely hard to run, assuming 5 is much much better than o3 etc

blazing rune
wintry tinsel
#

Gpt 5 coming out in only one billion years guys 🥹

civic flame
#

it's releasing in the next 2 weeks lol

wintry tinsel
#

That’s just a rumor

#

The ai race is so pressured and accelerated they don’t let things simmer and discover paradigm shifting advancements, every iterative improvement is a new release, so the new next Sota is almost always one iterative improvement over the previous one

civic flame
#

i know someone at oai who told me that it aligns with his understanding of the launch window

#

take that as you will

#

but i'm confident

dawn wharf
wintry tinsel
#

I will take that as a month give or take and I’m not one month patient

dawn wharf
#

they might have found optimization techniques

clever estuary
civic flame
dawn wharf
wintry tinsel
#

Google does yeah

civic flame
#

i hope they're cooking with deepthink

dawn wharf
#

but I agree

civic flame
#

i trust that they are

civic flame
blazing bison
civic flame
#

ty

blazing bison
#

if the o3 thing is true, that some questions make it answer like zenith on chatgpt, the routing approach is basically confirmed and in test

#

and the promise that everyone will get access to gpt-5 unlimited, they need routing for that...

wintry tinsel
#

Gpt 5 will cost through the nose :/

blazing bison
#

I don't think it will cost more than o3

gentle plinth
#

old or new price?

blazing bison
#

new

#

bro do you have sources or you just say things on your mind?

#

oh my god

#

When discussing the price of things, it's best to base it on something

#

professional yapper

#

i source things that i say

gentle plinth
#

🍿

blazing bison
#

ok keep yapping

#

unified model and can you inference guy

#

kek

#

and i have money to pay for it

#

we aren't the same

digital umbra
# blazing bison I don't think it will cost more than o3

if their model router is calibrated to direct you to the smallest model most of the time, they could get away with advertising a lower price but it would also be quite annoying for most people i guess. (they could add pricing tiers where more $$$ -> better chance of getting the good model)

blazing bison
gentle plinth
blazing bison
#

keep yapping

#

i'm not ,even if it's a router they not gonna say it

#

the unified info is from feb 12

#

great source you too

#

they consider "little routing" and "unified model" the same thing

#

and it's not

#

a game with words

#

scam altman is playing

digital umbra
#

i guess the interesting thing to see if routing will be only in chatgpt (it's pretty much guaranteed to be) or if it's going to be in the API as well

leaden palm
#

i think openai and anthropic are competing for the title

zinc ore
#

I don't even see what the big deal is with them routing

blazing bison
#

If you compare the marketing of december o3 and what got released, the difference...

#

And people fall for it again and again

gentle plinth
#

they make great products, but not regarding ai. they marketed the new siri to be able to read your e-mails, lookup photos and make new appointments based on all of that, and they couldnt deliver

blazing bison
digital umbra
#

i think they said reasoning wasn't worth the effort

blazing bison
#

they said that llms wasn't worth the effort

digital umbra
#

this is what i was thinking of

#

maybe you think of something else

blazing bison
#

Did you read the paper?

gentle plinth
#

the paper cant be taken seriously, it was written by an intern, and it turned out that the context window of the llms wouldnt even be sufficient to write out the entire solutions that failed

blazing bison
digital umbra
blazing bison
#

"both collapse at high complexity"

digital umbra
#

that was snipped from the conclusion

blazing bison
#

I don't know what your point is

digital umbra
#

i think we all read it in different ways

unborn ocean
#

the paper is really bad, they released multiple papers like that

#

they even did something similar for their open source llm, were they 1. invent new eval, 2. sota sucks at it apparently, 3. conclude: ai is a joke

#

it is basically just them trying to pretend like them sucking at ai products is completely fine

digital umbra
#

i don't really understand why apple is doing it

#

they're sitting on a goldmine really, they have cheap-ish hardware (not by consumer standards, but compared to nvidia datacenter stuff) that can run large models

#

it doesn't really matter if their in house models aren't that good, they could make a lot of money from hardware alone

gentle plinth
#

ai isnt their thing so much in my opinon. they are already doing lots of money with hardware. but training their own models, it doesnt fit into their philosophy of striving perfection. you cant control ai, its just a black box that can sometimes do unexpected things. also since they are oriented towards privacy they dont collect a lot of data which could be used for training. they probably also dont want to get their hands dirty by torrenting books as meta did to train ai

torn mantle
#

And how's their hardware compared to tpus? Thats the real cheapest thing, they don't have cost advantage, their r&d ai team are still far behind, and they dont have a clear business plan for it

digital umbra
#

nvidia is the highest valued company in the world due to their hardware and the fact the world is running on their CUDA stack. apple could easily grab a slice of that cake by developing a open cross platform alternative to CUDA together with other actors like AMD and intel but they're too mismanaged

blazing bison
#

They lost the head of Apple inteliggence for Zuck btw

digital umbra
#

they have built in TPU core in their SoC but it's just a pretty limited since it's a consumer device in the end. perhaps they could scale it and make dedicated server hardware...

blazing bison
#

Zuck is throwing money on researchers like crazy

digital umbra
#

meta is focusing on software and model development. apple should focus on hardware instead. there's no point in all companies trying to do the same thing i think

stray aspen
#

I agree

blazing bison
#

🤷‍♂️

#

I wouldn't like my company being dependent on 3rd models

digital umbra
#

i wonder what meta and openai thinks about being dependent on google hardware...

blazing bison
#

Openai is doing everything they can to change this

#

And zuck too...

torn mantle
digital umbra
#

perhaps but doesn't that show that there is still an opening?

#

if there still is no viable alternative to cuda

torn mantle
#

Its really not that simple

#

I mean it may look like that but its a complex ecosystem

#

Even if apple changed their business plan strategy

#

Nvidia & apple have diff strategies btw... You will never find an m4 or m3 processor on dell

timber kiln
#

Yeah no other company can run 2B terrible models on their phones right genius Apple

#

Literally every phone?

#

None of them need to install that default
There are hundreds of apps on the marketplace

hybrid widget
#

hello

timber kiln
#

Your point is nobody cares about edge models?

digital umbra
#

both google and microsoft are putting lots of research into small models for phones and laptops, they're probably already ahead of apple in most areas there

timber kiln
#

I mean if you can enjoy a 2B model
Go ahead buddy
That is not enough for my needs

digital umbra
#

even if apple has a "nicer" ui it doesn't really matter if the model is crap

timber kiln
#

It does matter

#

If you are privacy schizo you need a beefy gpu to keep up not a tiny npu

torn mantle
# digital umbra if there still is no viable alternative to cuda

Also even if they made something similar to cuda goodluck convincing all ai companies to switch to their own, and goodluck adding support of this new cuda alt on pytorch/tensorflow, and gl making +1000 kernel optimized functions for that, nvidia literally has department with hundreds of engineers working for months just to optimize cuDNN kernel to gain a 1% optimization

timber kiln
#

What sensitive data can you feed into those tiny models even lmao they have terrible context length

#

You wouldn't know using only local small models

digital umbra
torn mantle
digital umbra
#

google and intel consists of 1% of the consumer gpu market, at best

torn mantle
digital umbra
#

apple at least sells large amounts of consumer hardware, and amd i think is essential

torn mantle
#

U are always right

hybrid widget
#

hii

torn mantle
#

Apple sells the experience

#

The whole package

digital umbra
#

i know

#

that approach doesn't work for them anymore

#

unless they start to inject huge amounts of $ and talent like meta

torn mantle
#

They need to sell the whple ecosystem then

#

Processor + OS

digital umbra
#

apple has been a hardware company since the start, i think trying to own the entire software ecosystem was a mistake for them

torn mantle
#

Apple is in a safe zone

digital umbra
#

i wouldn't say that is the only reason

#

nvidia was at the bottom when the tariffs were announced

#

apple wouldn't have been as affected if they sold hardware to datacenters as well

stray aspen
#

Damn.craig came to educate everyone

fossil fable
#

how do you boost a server for 6 years straight what the fuckkkk

stray aspen
#

How long has this server existed

fossil fable
#

yeah

#

server must've been wiped

torn mantle
#

hax

torn mantle
#

thats why they invested heavily in india recently

wanton sonnet
echo aurora
gentle plinth
#

https://youtube.com/shorts/trwoCpWN6Ug just one of many example where apple ai is just not sota, apple is good at other things, but definitely not at ai

Samsung AI vs Apple AI — which one is actually better? We put both to the test with real-world tasks and features. From photo editing to live translation, who wins the AI battle? Watch until the end! #samsungai #appleai #aicomparison #techreview #galaxys25ultra #iphone16promax

▶ Play video
sturdy mica
#

but not just

#

4 letters

#

it sets off automod

whole wagon
#

Top openrouter companies this week

#

Qwen shot up lmao

dusky pier
whole wagon
dusky pier
digital umbra
#

mistral's usage that is

whole wagon
#

GPT5 going to come in clutch for OpenAI. They definitely need it lol

gentle plinth
digital umbra
#

doesn't make much sense to use byok with openrouter when you still still have to pay openai directly and 5% to openrouter on top of it

whole wagon
#

It is a relative change

gentle plinth
#

you pay less then 1% in fees for topup and get a 1% discount

ocean vortex
gentle plinth
#

maybe but where are you getting the info that OR user pay more?

whole wagon
gentle plinth
#

the pricing on both pages are the same

#

there is, but as i said its less then the 1% discount

#

its enabled by default

#

as i said its enabled be default

#

ah yes

#

😂

ocean vortex
gentle plinth
#

ok i was wrong

digital umbra
#

OR is for using large open source models + people too lazy to set up litellm for proprietary models

ocean vortex
#

So 2 months ago was not representative tbh

#

Neither is the current one unless you pretend o3 doesn’t exist lol

whole wagon
#

Well it went from 24.7% to 4.7% in 2 months, that's not easy to explain away

#

Even if what you say is true I don't think it causes that huge difference

ocean vortex
#

People are simply using the best (and cheapest) that they can access

#

Qwen and Deepseek both updated their models or released the new ones around that time. So everyone wanted to test those too, even if they didn’t end up sticking with those…

#

Same for Claude4

#

Kinda all checks out tbh

ocean vortex
#

This reminded me of a presentation I saw recently by some true hard-core coder. He coded the entire presentation himself almost live on stage (no powerpoint or any other app). He had a strong opinion against using any AI at all lmao

#

When you are this good at it, AI will slow you down in many cases or just screw things up

fossil fable
fossil fable
torn mantle
#

This will translate to overall cost optimization and stability

#

Sigh, mistral focus changed completely

#

They are working on implementing AI tools/features on ERP systems like SAP/Sage... Thats good but i miss old mistral, feels like they far behind competitors

ocean vortex
# wanton sonnet The problem is that investment wouldn’t help them or save them. The difference i...
#

Apple’s involvement in China is huge. They have a low-key deal with them where they are essentially investing billions into developing the country itself. In return of cheap labor and no nasty surprises from CCP

#

A bit ‘deal with the devil’ kind of thing. They are politically incompatible but also inseparable and one of the biggest contributors to China’s success

torn mantle
#

They are working closely with the ccp yea

lime coral
#

hikikiki

ocean vortex
#

On the surface level what Trump is trying to do by forcing everyone to move away from China may look right. But not when you realise his own phone he is gonna sell is made in China and not when you start to understand the complexity of the supply chain and the current market. It’s just not possible and this is a dumb way to do this

#

Short-term political gains at the cost of completely destroying things and causing chaos or distrust longer term. China is laughing at this as all cards are in their hands and they seem to be much more competent at diplomacy, more measured too

#

For impartial countries China even with all their obvious issues is becoming a more reliable partner than US…

sick chasm
#

zenith is in rotation again, we're so back lol

#

summit too btw

torn mantle
#

Oai usually has a pattern releasing models after adding them to lmarena

#

Idk if its a one week thing or two weeks

#

I think gpt5 will be released next Thursday

torn star
#

Or the one after this one

torn mantle
#

One after

#

Are u sure?

stray aspen
#

est ce que quelqu un sait quand l arene video sera publiee

jade egret
#

which one comming out first gpt-5 or gemini 3

stray aspen
jade egret
#

why

#

is gemini 2.5 pro a base

#

what the base model right now for google?

#

2.5 flash?

#

plz tell 😦

#

so we need gemini 2.5 ultra

#

kingsfall is deepthink?

#

fr?

#

so google does have ultra so they js not releasing it

#

oh

#

why tho

#

🍊

#

but i dont think grok gonna be good especially after grok 4

#

oh

#

true

#

hopefully elon musk doesn't mess it up 🙂

#

maybe

#

oh

#

so gpt 5 is still gonna be much better

#

cmon release it already 🙃

#

next few week

#

hopefully

#

even free user can access it i think

#

js not as good as subscriptions

#

please be good at coding

raven oracle
sick chasm
#

Also, I'm not at all familiar with how lmarena works, but for very simple prompts (like one above), I've got summit/zenith consistently (for a past hour or so). haven't been able to catch them with complex prompts tho.

whole wagon
#

Interesting

#

Summit is a beast. I dont find zenith that good ngl. Like it's slightly better than o3 (so it will get 1st) but that's it

sour spindle
sturdy mica
#

i wish you could upload images with search. i really need that

#

i dont know why you cant. its like the arena is restricting you

#

@echo aurora please, adding attachments would be awesome to search

#

don't know why they aren't there anyway

echo aurora
jovial sapphire
#

Hi there

echo aurora
jade egret
jovial sapphire
jade egret
jovial sapphire
#

Yesterday, I was getting Zenith all the time

#

Now I can't seem to get it at all...

jade egret
#

is that gpt 5

jovial sapphire
#

I think it is

#

I'm an ethical hacker and usually, I ask LLMs to code me tools that help me for my job

#

Zenith no joke made me a tool that I will now use everyday

#

2k lines of code, one shot

#

I have never seen any LLMs do that, with that precision

#

And I use Opus and Sonnet everyday

#

So if it's not GPT5, it's a really, really good model

sick chasm
#

After some testing I feel odds are even worse than 1/100 now. Got 1 zenith and 0 summit answers in 200 chats

jovial sapphire
#

Yeah

#

They can change the frequency I think

#

Chance of dropping

#

I'm so sad, I wish I had more time to test it 😦

#

Here's Gemini comparing the code of Zenith and the code of Gemini 2.5 pro on the same prompt

jade egret
#

is claude 4 opus and claude 4 sonnet the same base model but opus have more time to think?

whole sundial
quiet moss
#

ah sh

#

I was just looking for those models

#

man

#

i hope GPT-5 comes out soon

stray aspen
quiet moss
#

its a discord server ig

#

may you send it in my dms? @whole sundial

whole sundial
#

dev mode discord, they have a bot that checks lmarena apis for new and removed models

leaden meteor
#

How do I join this Arena battles discord?

whole sundial
#

can I dm you the link?

echo aurora
whole sundial
#

that's not what they were asking for

whole wagon
#

Yeah Claude just made opus 5x more expensive per token for memes. It's the same base model fr

#

Like bruh

jade egret
#

o

#

kimi k2 lowkey kinda slow

jovial sapphire
molten wind
#

sup chat

jovial sapphire
#

Especially developpers

dense dirge
#

How do I join this Arena Battles discord?

jade egret
dense dirge
#

@jade egret ??

leaden palm
torn star
dense dirge
jade egret
torn star
#

The world will change substantially. Most won’t feel it until it’s too late

restive sky
#

Is it possible to ask another question to the same LLM in battle mode after it is revealed which is which?

#

For example, if I want to ask more questions of a model with an alias, do I have to just keep trying until I get it again?

twin acorn
#

every time i send a prompt in web arena it times out

keen fulcrum
echo aurora
echo aurora
keen fulcrum
echo aurora
#

refresh discord.

keen fulcrum
twin acorn
#

this all the time as well

#

this as well

#

maybe works 5% of the time

iron cipher
torn mantle
#

are you using some kind of vpn?

sturdy mica
torn mantle
#

i hate this type of error handling

sturdy mica
#

yuck

#

new ui held together by hope

#

the new ui is so weirdly coded it's like it was vibe coded

torn mantle
#

they are actually checking for turnstile token failures(cloudflare antibot) & network errors & server-side vote rejections but it's currently displaying the same generic error message for all of them

#

instead of 'Failed to submit vote' it could've been something like :

  • Failed to submit vote : Turnstile token is missing or empty.
  • Failed to submit vote : Network request failed. The server could not be reached
  • Failed to submit vote : Network request failed. Server rejected the vote
#

more specific and helps in diagnosing the problem

#

nvm they are using the error object, @twin acorn next time, just open the developer console and share the error message from there

twin acorn
formal dagger
#

even not wolfstride, can't bare 2.5 pro now.

pseudo heath
#

what happened with the new qwen3 series on the latest leaderboard? it seems they are all gone

twin acorn
#

nvm they are using the error object, @

lofty cosmos
civic flame
lunar wind
#

Am i in the wrong server?

west scroll
civic ginkgo
frigid coral
civic flame
#

search for the invite on twitter or someone here will send you a dm

civic ginkgo
west scroll
lunar wind
frigid coral
whole sundial
#

you shouldn't do that in main chat here

#

you should remove that message btw

torn mantle
# twin acorn

yea its a cloudflare issue but its probably coming from your adblocker

#

disable it and see if it works

#

ur adblocker blocks some analytics trackers -> cloudflare flags this as suspicious behavior -> leads to authorization errors (401) -> server applies a rate limit on your vote attempt (429)

gentle plinth
#

i think it could also be the case that some specific strings in the prompt/output trigger a cloudflare security block.
I had it in direct chat, some specific prompt with special characters (html tags etc.) didnt work, maybe because cloudflare thought this is some kind of xss attack.

torn mantle
#

im looking at my console logs and im kinda having same errors but the vote is submitted

torn mantle
#

could be that its sending 401 but didnt reach the security threshold to throw a 429 ( rate limit error )

torn mantle
winged locust
#

GLM4.5 can use

gentle plinth
#

but in the network view of the browser i could see that cloudflare blocked the request because of that

#

so maybe here its a different problem

torn mantle
#

had that many times

blazing bison
#

Using chagpt agent to use lmarena and watch it voting for claude models against gpt 4.1 and o3 answers is actually funny

keen beacon
#

also, hello

blazing bison
keen beacon
#

would be quite fun

blazing bison
#

But i have nothing to use this agent for so I'm just wasting compute on useless things

keen beacon
alpine coral
leaden meteor
old ginkgo
keen beacon
#

I know that at somepoint AntiAI people will get real pissed off and try to sabotage some places

keen beacon
#

what are you trying to do

gusty sinew
#

how to fix stuck on generating issue does anyone know

old ginkgo
old ginkgo
keen beacon
#

ok but what case, it really depends lol

#

like i cant help you if its that generic. do you need vision input as well? etc. how hard is the task? etc. i mean you can use 4.1, but it won't be cost effective if youre trying to do something easy (but needs fine-tuning for accuracy/etc). if it's more complex, probably 4.1 i guess

#

it can but not really at the same time its complicated lol. i don't personally recommend relying on it to introduce knowledge unless you're doing continued pretraining (larger scale)

jagged crown
keen beacon
#

Esp reddit

#

I like to be there on the r/singularity sub. AntiAI sub is a bit too dramatic at times to even take a peek in.

old ginkgo
keen beacon
old ginkgo
#

Does anybody know where to find the codenamed models on lmarena?

jagged crown
old ginkgo
#

I think you just make the models battle and when you're lucky, you get a codename model. That's what i think atleast.

jagged crown
#

I guess I'm asking how to find notifications of what models were added or removed

gusty sinew
#

anyone know how to fix issues

#

with it not working.

#

it always does this

#

and then i lose all chat data

#

if anyone experienced issues like this before and knows how to fix please lmk

#

or a different ai place

#

that doesnt do this and lets me use claude without paying

keen talon
#

hi, what are the limits for the models?

torn mantle
#

the session still exist right?

whole wagon
#

What the heck

torn mantle
whole wagon
#

Yeah this is the best open source agentic coding model for sure

torn mantle
#

also the web search feature is like a deep research one

whole wagon
#

This is absolutely insane, it's like just slightly worse than Claude 4 sonnet

#

But 10x cheaper

blazing bison
#

it's like qwen, the vibes doenst match

#

just good on paper

whole wagon
#

Did you try it

blazing bison
#

yes

#

i think kimi k2 is the best rgn

#

from the os options

whole wagon
#

Kimi K2 missing reasoning rn that's the only issue

blazing bison
#

for me it do well without reasoning

#

and sorry for saying this but the reasoning of claude models is a joke

#

I'm a heavy user of Claude Code, and I almost never use Reasoning

torn mantle
#

its meh

#

benchmaxxing as always

jagged crown
#

Is cuttlefish still on the arena?

torn mantle
#

yes

whole wagon
#

Benchmaxing 🥀 if these Chinese LLMs actually get added to lmarena it would help

#

Like the updated Qwen reasoning model is still not there

#

I want to see if it is benchmaxxed or real

#

The new non reasoning model was bad on lmarena but they removed it

#

They should tell us before randomly removing models I think

#

@echo aurora why was it removed and why nobody was informed?

#

It is more resistant to it especially maths and coding categories

keen beacon
agile bloom
#

is there an android app for for LMArena?

torn mantle
agile bloom
#

got it, it's pretty cool tho, wish someone interested in making an open source app for it

#

I usually have long convo on it, take long to load my chat so an android (with better optimized) would work better I feel

#

btw anyone knows which is the smartest ai model to use? with largest data it has been trained on and with most parameters?

keen beacon
#

and many other values

cedar tide
#

@echo aurora 2.5 flash lite disappeared from the leaderboard

#

@echo aurora and why new qwen also disappeared ?

echo aurora
#

blobthanks I'll flag

agile bloom
#

is it available to use on LMArena?

#

also why are these paid high end ai models free to use here?

#

is it available in battle or one vs one?

cedar tide
echo aurora
dusky pier
timber kiln
torn mantle
#

Its so basic

dusky pier
torn mantle
gusty sinew
#

anyonme know this error

#

every time i press the button it just says same thing

#

i keep losing all my chat data because of this and then it purely not woprking

#

it does it after a few hours use of the chat

dusky pier
gusty sinew
#

no one else get these erros using LMArena?

#

maybe it browser i using

#

or something

blazing bison
stray aspen
#

i reload the website and it lets me continue

gusty sinew
gentle plinth
torn mantle
sour spindle
#

all the gpt5 models gone i see ?

gusty sinew
#

would an adblocker be causing it

#

because it usually doesnt come up for me

#

any verification thing like that

jade egret
#

what the most human like llm for writing

gusty sinew
dusky pier
#

The 1st one was from GLM

gentle plinth
#

the new one? (glm4.5)

dusky pier
torn mantle
gusty sinew
gusty sinew
#

dam

torn mantle
gentle plinth
gusty sinew
#

its any chat i go to, after i use it for a bit it just does this and never fixes

torn mantle
#

Open up your console and share with us the erros

torn mantle
gusty sinew
gusty sinew
hybrid widget
#

hi

gusty sinew
blazing bison
blazing rune
#

nobody knows

#

can't predict the future

blazing bison
#

Kimi k2

#

Or gemini 2.5 pro

blazing rune
#

Kimi K2 is good but the fast providers are terrible

keen beacon
#

Just gotta wait for GPT-5 first

blazing rune
blazing bison
gusty sinew
keen beacon
blazing rune
keen beacon
#

Too many typos with others

blazing bison
#

Its for test models not prod usage...

blazing rune
keen beacon
#

It's hot where I live

blazing rune
keen beacon
#

gives me symptoms

blazing bison
keen beacon
blazing rune
#

same thing happened with optimus alpha and quasar alpha

keen beacon
#

along with other stuff

blazing bison
#

Zenith is sycophantic too

keen beacon
blazing rune
blazing bison
keen beacon
blazing bison
#

But almost never got summit

blazing bison
keen beacon
#

Hopefully for free users GPT-5 won't be "nerfed"

blazing bison
#

Gpt 5 is really something else

#

Oh they will

keen beacon
#

It's good at that

#

at least

blazing bison
#

I tried it with other things, it's good with math too

#

Medicinal skills too

keen beacon
#

or the sycophantic stuff

blazing bison
#

Very sycophantic

keen beacon
#

Ever since the accident

#

with one 4o update

blazing bison
#

But it's smart

#

4o is sycophantic and dumb

#

🤓

gentle plinth
blazing bison
keen beacon
blazing bison
keen beacon
#

Google's gemini is going into that direction too

#

at times I don't like it when models are too "positive"