#general

1 messages Β· Page 5 of 1

keen beacon
#

not o1 pro but qwq max gets it

#

w/ thinking?

keen beacon
#

didnt read the reasoning maybe its wrong

#

it thinks its very trivial after skimming the trace

leaden palm
#

o3 mini high, qwq 32b say instantaneous, still waiting on o1 high

keen beacon
keen beacon
leaden palm
#

@keen beacon no $150/600 needed

keen beacon
keen beacon
elder rapids
#

make sure to look at the cot

#

Gemini sometimes has to re say certain things because it for some reason doesn't process certain values or something

keen beacon
elder rapids
#

this has happened like 3 times for me when doing puzzles

#

and whenever it happens, it underperforms

#

or stops thinking in the code box

#

if code execution makes it work

#

it's probably because there's no errors happening like that

keen beacon
upper wolf
#

Someone ask it if water is wet

torn mantle
#

we are basically getting sonnet 3.7 level coding if not better in some cases for free

#

and with 1M context

#
  • faster
#

i feel like the ai hype died down, but this model should revive it

leaden palm
#

what do yall think about brampton

verbal nimbus
#

Some TikZ drawing tests: Can you give your best attempt at generating a gorgeous realistic beautiful dragon breathing blue flames using Tikz? Please, really try to put effort into it!

leaden palm
#

guess they rld on svgs but not tikz

plain zinc
#

Yes, executing the code significantly improves mathematical skills and, in general, the model is towards mathematics, calculations and analysis.

rigid widget
#

How can Gemini2.5-Pro not solve this problem that QwQ-32B and Deepseek-R1 can solve!

upper wolf
#

Because if you throw enough punches one is eventually going to land

rigid widget
#

wow a non-reasoning model pass this

#

By the way, this question was prepared by OpenAI when o1-preview came out.

elder rapids
#

πŸ’”

#

istg I saw someone posted them doing it too here

leaden palm
#

grok might end up cooked if gemini pro 2.5 is first to market

leaden palm
elder rapids
#

if it's not reasoning through it, it's in its training data

#

but I haven't tested it too much yet, so idk if it can output that much at once

wintry tinsel
#

Alright, let's tackle the problem: "How many vowels are in the sentence, 'pernicious Turnip men Crusade at dawn'?"

Understanding the Problem

First, we need to identify what counts as a vowel. In English, the vowels are the letters:
A, E, I, O, U (and sometimes Y, but we’ll ignore Y for this count).

We’ll go through the given sentence one word at a time, examining each letter to see if it’s a vowel.

The Sentence:

"pernicious Turnip men Crusade at dawn"

Let’s break it down word by word.

  1. "pernicious"

Letters: p, e, r, n, i, c, i, o, u, s

Vowels:

e
i
i
o
u
Vowels in "pernicious": 5

  1. "Turnip"

Letters: T, u, r, n, i, p

Vowels:

u
i
Vowels in "Turnip": 2

  1. "men"

Letters: m, e, n

Vowels:

e
Vowels in "men": 1

  1. "Crusade"

Letters: C, r, u, s, a, d, e

Vowels:

u
a
e
Vowels in "Crusade": 3

  1. "at"

Letters: a, t

Vowels:

a
Vowels in "at": 1

  1. "dawn"

Letters: d, a, w, n

Vowels:

a
Vowels in "dawn": 1

Summing Up the Vowels

Now, let’s add up the vowels from each word:

pernicious: 5
Turnip: 2
men: 1
Crusade: 3
at: 1
dawn: 1
Total vowels = 5 + 2 + 1 + 3 + 1 + 1 = 13

Verification

To ensure accuracy, let’s list all the vowels in order from the entire sentence:

Sentence: pernicious Turnip men Crusade at dawn

Underlined vowels: e, i, i, o, u, u, i, e, u, a, e, a, a

Counting them:

e
i
i
o
u
u
i
e
u
a
e
a
a
Yes, that confirms 13 vowels in total.

Final Answer:

There are 13 vowels in the sentence "pernicious Turnip men Crusade at dawn."

#

From deep seek V3

elder rapids
leaden meteor
#

Why are people so confident about Deepseek and Openai being top in the next month? Do they have any models that are doing well (anonymously) in arena? Gemini 2.5 pro seems to be a clear leader for next few weeks? o3 mini high and deepseek-v3 are already on the leaderboard. Is Deepseek 0324 that good that it will beat gemini 2.5pro? And I see that o1 pro and R2 are not in arena anyways based on conversations here...SO, I dont get why 60% of the people who took the poll here think that either OpenAI or DeepSeek is going to top 2.5 pro?

leaden meteor
#

Isn't it coming in may?

elder rapids
#

a lot of people who follow AI came from the deepseek thing

#

which became borderline propaganda

#

before and for a while after, 2.0 flash thinking was closer to r1 while being smaller and cheaper, and nobody paid attention to that

#

hopefully r2 is the actual jump in performance, but if it's not simply better than 2.5 pro then it's not going to matter, to me at least anyways

leaden palm
#

still waiting for r2...

elder rapids
#

since 2.5 pro has 1m context length truly now

#

I'd rather have that

elder rapids
#

it doesn't do very well every time I get it

#

but still formats decent

#

oh he blocked me

#

lolll

leaden meteor
#

Yeah...I have seen some people posting their results here and anonymous chatbot did not seem to be better than nebula (2.5pro)...

elder rapids
#

a lot of people in the AI space just simply don't know what they're talking about and pay attention to politics too much

alpine coral
#

anonymous chatbot (likely latest iteration of chatgpt-4o-latest imo) isn't a thinking a model and is presumably natively multimodal. i use 4o for most day-to-day tasks (i don't need thinking for a quick translation or whatever)

#

it seems an incremental improvement (or possibly already deployed on chatgpt) - won't smash any records, but not meant to either (GPT-4.5 arguably was a bit of a disappointment in terms of performance though imo - but is still very robust for reasoning and writing tasks, as most big models tend to be)

keen beacon
elder rapids
#

sometimes it points out how it didn't actually format certain information

#

and it's been freaky since it released in the app first

#

I hope someone with more reach can mention this and look into it

#

I know I'm def not the only one

#

but I've probably tested it 10x more than others

#

50 is too little

#

πŸ’”

keen beacon
rigid widget
rigid widget
#

It is also clear from the example of creating super mario

torn mantle
#

but tbh its nowhere near gemini 2.5 pro

#

i mean for the cost its a good one

#

gemini just seems good overall

calm sequoia
#

Last time Someone overtook Grok the leaderboard was updated the same day πŸ˜„ It would be funny if it would happen again and musk would release Grok 3 full immediately

#

On the other hand, you need time to train on Gemini 2.5 outputs πŸ˜„

barren prairie
calm sequoia
#

It shouldn't have taken previous no. 1 spots too πŸ˜„

#

Really good job guys. You're oracles!

plain zinc
barren prairie
rigid widget
rigid widget
#

They're all hoarding data and copying from each other. That's why they exhibit very similar behaviors in many things.

rigid widget
#

We said everything beforehand.

calm sequoia
#

Unless in early stages of pre-training, but the base model was not changed (or was it?)

rigid widget
rigid widget
hazy quest
#

Hey guys. Do you know of a subreddit or Discord server to discuss specifically everything related to AI images? It's probably obvious but I can't seem to find it.

  • r/singularity is too broad
  • r/StableDiffusion seems to be focusing on AI videos (?)
  • r/aiArt seems to be about the "art" aspect more than the AI one (?)

Basically I'm looking for a space where people discuss/compare for example Reve vs 4o vs Imagen 3 vs Recraft

brittle tiger
plain zinc
#

What will you do to that person if someone writes here that 2.5 is WORSE?

#

πŸ€”

torn mantle
#

but they should justify

#

and not based only on one prompt

plain zinc
#

This is a test for ardent Gemini fans πŸ™‚

#

But I'm not like that. Don't think about it.

#

Gemini is interesting to me as a research product.

subtle frost
#

What's the best ai for solving chemistry MCQs

rigid widget
rigid widget
alpine coral
#

call it 'omni'.. wonder if / when it will actually be a truly multimodal model..

#

does it use separate models for image recognition and generation, or just the latter?

#

same with advanced audio.. is that actually integrated natively? i haven't really been folllowing since the initial release

#

function calling isn't exactly a modality, but 4o has become a lot better at using code interpreter. like for anything not too complicated, it handles data pretty well imo given the speed and cost

rigid widget
#

Gemini 2.5 Pro here is ethic,safety,bla,bla

leaden meteor
#

I am surprised they did not release this in the leaderboard yet when lot of people are saying this is one of the best non-reasoning models...

rigid widget
rigid widget
leaden meteor
#

I am only talking about arena. Not the general release... If they released it in arena and model is really good, it would have come on top, isnt it? Even people are biased against it? Isnt that the point of the arena....

willow grail
#

why do i get this after every message?

rigid widget
willow grail
rigid widget
#

claude create a python command it don't get the prompt

leaden meteor
#

Was it there before public release? I was asking why it was not released in teh arena before public release... But I see that Deepseek tends to not do that...So, I guess it will take a week or two before it appears in teh leaderboard from now?

rigid widget
#

you're right

alpine coral
# rigid widget Give V3 0324 a chance

played around a bit. it's pretty decent for sure. though fwiw i wasn't blown away - it's solid but yeah i'm not sure i'd be willing say anything more than that (based on my limited use anyway)

#

even though it's not a thinking model, it's been post trained in a way that encourages CoT responses, rather than just blurting out the answer.. can be a bit tedious

#

it seems good at physics / science

#

impressive with some riddles, but not strong with theory of mind kinda stuff, and falls over on some generally simple ones (usually by overanalyzing it / conducting needless calculations)

#

i like it's note at the end here ha

rigid widget
#

Probably best for roleplay too

keen beacon
#

☠️

rigid widget
#

by the way there is no system prompt

#

this is default

#

Boss and lazy employee roleplay

#

Why Chatgpt is not that good at "Chat"

elder rapids
#

but nah I meant like, testing for looking at the CoT process

#

and stuff

#

there's a lot of things broken with 2.5 rn

#

nobody is mentioning it for some reason

#

crazy

rigid widget
rigid widget
alpine coral
# rigid widget

i don't know what's going on lol (though clearly V3's response is superior / more creative) but anyway was curious to give V3's response to 4o and tell it to make it 'spicier'. initial attempt was bland; the next one also kinda meh, but the third/final was surprisingly lucid tbh ha

#

4.5 even more so ("if i have to finish myself off in the jungle" lol)

#

i thought 4o just used more emojis and had a differnt (more casual) tone.. but didn't realise it (and 4.5) actually have like some 'character' in a way oai models previously haven't

elder rapids
golden ocean
#

I think he meant that you can not really look at CoT it's hidden

rigid widget
#

this is CoT

#

this is not CoT

timber kiln
#

Does the gemini advanced sub have a rate limit for the 2.5?

elder rapids
#

😭

#

you're conflating the initial o3 mini summary with markdown/formatting

golden ocean
#

@rigid widget are u going to let this slide

elder rapids
#

W instigating

#

but Gemini uses markdown for aha moments

#

it's a part of its cot

elder rapids
#

I went way past ai studios 50 rpd limit in the app

rigid widget
#

We want real pure CoT not changed CoT

elder rapids
#

it's not being optimized post CoT

#

it just fundementally is a different approach to CoT

rigid widget
#

Even o3 mini is more readable

elder rapids
#

that's because they have a different approach

#

they're giving similar amounts of information

#

neither are being changed

#

Gemini has always used markdown way more than other models

rigid widget
elder rapids
#

whether it's an initial system prompt

#

or unintentional

#

the behavior still shows in the CoT

rigid widget
#

yeah

elder rapids
#

but regardless, that is the "raw" cot

elder rapids
#

still clarifies in unformatted text

rigid widget
#

People enjoy reading R1's thoughts

elder rapids
#

ye cuz they're funny asf

#

deepseek lab trained for weird types of "aha" moments

rigid widget
#

It is very educational

#

I learned a lot of things while reading them.

elder rapids
#

this is crazy

#

16~ avg improvement from flash thinking

#

yo this is actually insane

#

its not even incrementally better than all the other models

#

the difference between other models and their competitors are super incremental in other benchmarks and Gemini 2.5 pro doesn't follow that trend

silk haven
elder rapids
#

from the moment I had my hands on nebula I knew this model was different

#

don't call me a prophet, but I might just be one

keen beacon
#

brother

#

☠️

elder rapids
#

I can't get over it 😭 πŸ™

split kayak
#

gemini2.5

rigid widget
oblique flint
#

so pro is the first 2.5 gemini model. Will flash be coming soon?

#

was phantom 2.5 pro as well or could that be flash?

rigid widget
elder rapids
#

flash

#

if they release 2.5 flash

#

then it's safe to say the old models are gonna be removed

elder rapids
#

v3 is doing well

elder rapids
thorny drum
#

damn #1 in almost every category

silk haven
#

Noam Shazeer

#

The 🐐

elder rapids
#

phantom seemed pretty similar to nebula tho tbh

#

but apparently it was removed and added back

#

so it could've been changed to flash and I'd have no clue cuz I only tested the old phantom

#

why's 2.5 pro so fast tho

willow grail
keen beacon
#

ive been doing some research

#

without consensus voting, which openai used to make o3 (full) look better on the benchmark graphs, gemini 2.5 pro is actually better

#

AIME o3 ~90-91% vs 2.5 pro 92%
GPQA o3 ~82-83% vs 2.5 pro 84%

rigid widget
#

Gemini 2.5 Pro
Except reasoning (πŸ₯ˆ2. best model)
All ⚑πŸ₯‡1. best model
Good job Google!

keen beacon
#

which is why openai need to stop sitting on their damn products..

#

just release it already

rigid widget
#

Deepseek V3 03-24 (comparing with non-reasoning models)
(Overall: πŸ₯ˆ 2. best model)
Math: πŸ₯‡1. best model
Coding: πŸ₯ˆ2. best model
Reasoning: πŸ₯‰ 3. best model

Good job Deepseek!

rigid widget
oblique flint
rigid widget
#

Look at gpt4.5 vs v3-0324

#

gpt4.5 is better but who cares?

elder rapids
#

how big is v3 tho?

rigid widget
#

deepseek 150x cheaper

elder rapids
#

I heard it's like a 500b+ model

#

I'm not sure how true that is

rigid widget
#

bro it's open weights

elder rapids
#

huh?

elder rapids
#

😭

#

v3 = v3

elder rapids
#

DAMN

silk haven
elder rapids
#

685b params

keen beacon
oblique flint
#

so, can anyone reverse engineer the scaffolding used in claudeplayspokemon and make gemini 2.5 pro play pokemon lol

leaden meteor
oblique flint
leaden meteor
#

I see. Is this o1-2024-12-17- high different from o1-2024-12-17 on leaderboard ?

elder rapids
#

might not be the best at purely reasoning

#

but that's trivially true

wispy mist
#

can this data be used or licensed on a company's website?

red sluice
#
poll_question_text

Who will have the best model within one month?

victor_answer_votes

5

total_votes

13

victor_answer_id

2

victor_answer_text

Google

olive mesa
#

waiting for 2.5 flash to use for my projects

#

even though i have like 12 api keys the rate limits are still strict

olive mesa
#

but time and cost is a lot more

#

worse model with 50k thinking tokens slightly better than a good model with 10k thinking tokens

rigid widget
cloud meadow
#

yeah

#

What's more, we opensource the model behind all this, Qwen2.5-Omni-7B, under the license of Apache 2.0, and we release the technical report to share all our details!

keen beacon
#

4o image gen, prompt ""A photo of Tim Cook making a speech on stage behind a lectern with the text "FOR THE MANY, NOT THE FEW" and the UK Labour logo on it". no other model has got this close to perfection - it literally got the exact lectern right with just the "H" missing from "the", which i'm sure would be gone if you regen'd a few times

#

for reference

elder rapids
blazing rune
keen beacon
#

you could argue that

#

but like

#

i like the fact openai have lowered the guardrails for this

#

dall-e was censored asf and it showed

ocean vortex
# elder rapids this is crazy

this is not a very good benchmark, but in this case it is not wrong... 2.5 pro very well just might be the best model ever released thus far tbh

keen beacon
#

it's finally joever for livebench

#

2.5 pro doesnt seem to know much after june 2024. it doesnt seem like they focused on recent events that much

#

how tf is it 2x faster than gem 2 pro

rigid widget
elder rapids
#

that's more of a consensus too

#

livebench and simplebench

#

but yeah 2.5 pro is just

#

really crazy

oblique flint
#

anthropic will not be happy about this release if it turns out this model outperforms the claude models at a lower cost lol

ocean vortex
#

which is just ridiculous

thorny drum
#

i dont think any benchmark will give you a perfect stack rank of every model

ocean vortex
#

no but some are better than the others. Livebench is among the worst ones tbh

thorny drum
#

what are your favorites?

ocean vortex
#

if you take any real benchmark like say MMLU Pro, you will hardly see any discrepancies at all. All scores just make sense and are logical

#

And also... contamination is not really the issue. Bigger issue is "private datasets" as that contributes nothing to improvement of the models IMO. It also invites selective cheating when only select companies have some kind of access through connections which means even more discrepancies. If benchmark is diverse and good, it's pretty much impossible to cheat beyond what everyone is doing (or to get a perfect score) unless that is the sole thing you focus on disregarding all the other metrics. Which is just not realistic as it would be obvious model is sht

#

the main thing should be that it's a fair game. If everyone has the same public access that means it is fair

#

and actually is driving the improvement. Rather than being some secretive thing no one can even validate if it's any good...

elder rapids
# ocean vortex this couldn't be further away from the truth. Unless you really think qwq is bet...

this isn't what makes a benchmark bad at all lmao, that would be reasonable if the score discrepancy were larger than just a few percentile. It's a really good inference model that doesn't have quite as much knowledge as deepseek, but that isn't what its testing. it's not "better" than deepseek, nor is deepseek so much better than o3 mini, nor is v3 better than Claude 3-7 sonnet, this isn't because the benchmark is flawed, but because of how YOURE viewing the averaged score

#

the same way lmsys isn't a bad benchmark, its just not quantifying what you think it's quantifying

ocean vortex
#

that model has no business being equal with those

#

let alone above them

elder rapids
#

wym just no? there's literally Β±1 deviation LMAO

#

have you tried it??

#

in the same way smaller models can outperform gpt 4

#

is the same way qwq-32b can have extremes like that

elder rapids
#

but that's not what its testing

#

look at the individual scores

ocean vortex
#

and it's not a good benchmark at all if it fails to find the difference. Qwq is worse than R1 overall, not the other way around

ocean vortex
elder rapids
keen beacon
ocean vortex
#

not just a few or a few subcategories

elder rapids
#

lol

keen beacon
elder rapids
#

but it's the worst for actually coding

#

and stuff

keen beacon
elder rapids
#

never got the attention it deserved

elder rapids
#

flash thinking wins

#

that kind of stuff

#

o1 gets crushed too

ocean vortex
#

so it doesn't make sense lol

keen beacon
ocean vortex
#

it's not a very good test

keen beacon
#

i dont see any of the other models going 29k tokens in thinking and coming up with this

#

fyi qwq does that one just fine btw

#

afaik

elder rapids
#

this server bro istg

keen beacon
elder rapids
#

in a debate kind of inquiry

#

are you good?

keen beacon
elder rapids
#

ngl you guys have to be trolling

keen beacon
elder rapids
keen beacon
#

theres considerable overlap with debate skills and social skills though

elder rapids
#

no

#

it's not lmao

keen beacon
#

yes for llms

ocean vortex
keen beacon
#

yeah this guy is so annoyingn lol

elder rapids
elder rapids
keen beacon
elder rapids
#

but of course, you don't know the claim and won't bother to reiterate what I'm saying for clarity

ocean vortex
elder rapids
cloud meadow
#

Nah I was using the wrong one lmao

elder rapids
keen beacon
#

are you using 0 temperature?

cloud meadow
elder rapids
keen beacon
keen beacon
cloud meadow
#

It keeps repeating stuff on 0 temp lmao

ocean vortex
# elder rapids alr, state my premise

You seem to be thinking everyone is doing it wrong and livebench is this magical benchmark more accurate than proven reliable metrics. Metrics with a much better track record and better results. That's almost like 'benchmark shopping' - look hard enough and you gonna find a benchmark for your confirmation bias

keen beacon
cloud meadow
#

I had to stop it and then tell it to continue

#

Still, seems wrong?

keen beacon
#

generated by 4o

#

🀯

cloud meadow
#

We are a few years away from AI being used as video evidence in a court case

#

I mean a few months

#

πŸ˜›

keen beacon
cloud meadow
#

The future might be bleak or bright idk

ocean vortex
#

So like... simple-bench is 'perfect' for people who think Claude is better than everything else. But they fail to understand it's mostly just spatial awareness reasoning. Your daily tasks will usually include MUCH more areas than just this. Then there's also an issue that it's private and no one can verify it properly so it's a guesswork

#

people like that will also look at swe-bench but conveniently ignore LCB where claude is destroyed and it's still coding lol

keen beacon
ocean vortex
#

it's hard to be impartial, but you really shouldn't judge the benchmark based on individual model results. The number of discrepancies in total is the easiest fair way to get the rough overall picture

elder rapids
# ocean vortex You seem to be thinking everyone is doing it wrong and livebench is this magical...

that's completely loaded lol and doesn't state my premise. Whether or not livebench is a viable objective evaluation that judges how models specifically perform in certain tasks isn't what I'm saying lol, the reason why it's even averaged out through such wide variation of things on livebench especially is because they aim for exactly what you're looking at while discussing with it, qwq-32b does have specific traits that may help it over deepseek, livebench tests only a few languages, and it happens to excel in these few languages, but if you check aider, clearly there's a discrepancy, but also in the claim I've been making; how well it does when given newer, or at least more obscure information, and therefore a better score, with large variation. anything Β±5~ score deviation wont have such discrepancy, yet it's completely realistic that's how they interact with that kind of information

#

I don't even think we disagree

ocean vortex
#

it's also unreasonable to expect any model to do well everywhere... Just not realistic

elder rapids
#

but there's clearly a difference in what we understand these benchmarks do

#

for me, Claude 3.5 sonnet didn't do well, for exactly what I expected livebench to do

ocean vortex
elder rapids
#

which is why I clarified that

#

"such a wide variation"

ocean vortex
#

and if it is and the results are that... the benchmark is not the best one. Simple as that πŸ˜‰

elder rapids
#

it's exactly what it claims to be

#

and seems to match really well because, the model approaches you that way too

#

novel information over a general discussion = non standard benchmark performance

#

and probably why o3 mini and o1 have done so well, since they focus on exactly that kind of behavior of receiving

#

and it's probably why flash thinking is so high too

#

since it's just not that good, but if you use it

#

it's just really good at receiving information

ocean vortex
#

have you read their description at least?

elder rapids
#

wym?

elder rapids
ocean vortex
#

They are aiming at providing general purpose anti-contamination benchmark. I already said what I think about contamination (it's not an issue for a good benchmark), and if this is general purpose... then it's clearly not the best benchmark. Like I've already said several times but you just don't seem to get it

#

nowhere does it say it's specialised or limited in scope:

elder rapids
#

if it's anti contamination it's aiming for novel information and score discrepancy under that

elder rapids
ocean vortex
elder rapids
#

that's it

ocean vortex
elder rapids
#

I just said what I meant

ocean vortex
ocean vortex
#

nonsense

elder rapids
#

😭?

ocean vortex
#

"how it adapts to you and stuff" are just some meaningless words you made up, in this context

elder rapids
#

it's not meaningless lol

ocean vortex
#

it absolutely is

#

you have no argument lmao

elder rapids
#

that means how it receives information, you can't get any simpler than that

#

if it means how it receives information, that means how it interprets things through a context length right? if it's better at low context

#

then it's necessarily going to be better

#

oh wait I wonder why all of these models also excel at lmsys

#

which happens to be ahem

#

general low context performance

#

🀦

ocean vortex
elder rapids
#

that's clearly not what I'm saying

#

😭

ocean vortex
#

then talk properly, cause you are just making sht up catgrin

elder rapids
#

I just clarified exactly what it means tho, and then you sidestepped into that interpretation

elder rapids
#

you know you can ask for clarity

#

right?

ocean vortex
# elder rapids if it's anti contamination it's aiming for novel information and score discrepan...

the bigger issue is testing properly. If you aren't testing properly with good test questions "novel" part is completely irrelevant. And I already said why contamination is mostly a non-issue. If the entire dataset is public, that means all models are trained with the same data available and the scores are fair. It is still a consideration, but good benchmark is MUCH more important than just making sure you develop test questions in secrecy...

elder rapids
ocean vortex
#

if it was an issue we would have seen models getting perfect scores on all public benchmarks a long time ago. That didn't happen. Because it is essentially impossible

elder rapids
#

and then it averaged that

#

because the point it

#

it's not testing intensive information

#

and the wide variety implicitly denotes novelty is what it's looking for

ocean vortex
elder rapids
#

if it's not the general purpose benchmark you think it is

#

then it's not, and I'm showing you how

#

lmsys is a general purpose benchmark

#

but clearly they're different benchmarks

#

right?

ocean vortex
ocean vortex
elder rapids
ocean vortex
#

it's a human preference benchmark

elder rapids
ocean vortex
#

not an objective deterministic tool for general performance

elder rapids
#

it's a general purpose benchmark via human preference

#

which is exactly the distinction I'm making

#

the reason why that's relevant is because livebench is a general purpose benchmark via novelty

#

to you, one is more apparent than the other

ocean vortex
elder rapids
#

theyre entirely different categories altogether (meaning, it doesn't make sense for you to think I meant that regardless)

#

and that was I was implying the whole discussion (meaning, it's not changing my mind if that's what I've been saying)

ocean vortex
#

general purpose tests for everything with a deterministic objective eval methods. Human preference tests mostly for the 'vibe' and style of responses and the initial impression they leave on people. Those 2 things are different and not really compatible. For the record

#

livebench is not human preference testing, in case you think it is...

elder rapids
#

it's not like I didn't clarify that

elder rapids
#

and they are competible

#

I don't see how you think otherwise if they're simply so categorically distinct

#

you can use objective metrics to measure very narrow non general purpose tasks

ocean vortex
# elder rapids that's just begging the question lol, general purpose just means the scope of wh...

it also means there should be barely hardly any discrepancies at all. Truthfully all those less popular benchmarks have an easy way in NOT making it a general purpose benchmark. But as soon as they do, if there are many discrepancies... They kind of sabotage themselves because the standards for it are very different and you can't have a leaderboard looking like that with the scores that don't make sense

elder rapids
elder rapids
# ocean vortex ^

that doesn't mean anything tho? I dismissed it, since it's implicit through the extensive information

ocean vortex
elder rapids
#

ie, high benchmark variety + high benchmark avg

elder rapids
ocean vortex
#

cause that's their MAIN selling point

#

LMAO

elder rapids
#

you're saying this because you think I'm getting novelty from anti contamination

#

but I've clarified like 3 times

#

that's not what makes it novel

ocean vortex
#

do you know what 'novel' means?

elder rapids
#

yeah new

#

that's not relevant regardless

ocean vortex
#

it means it contains problems that are not included elsewhere, hence the anticontamination part.

elder rapids
#

ok, but that's not what makes it novel here

#

anti contamination implies novelty sure

#

but, let's dismiss it's "anti contamination" selling point completely

#

ignore it

#

ignore everything the livebench descriptions say

#

now

#

livebench is a novelty benchmark

#

why?

#

it tests extensive information, with the example of coding, it's not going deep into each variety

#

but how is this novelty?

#

because it's averaging low intensive scores

ocean vortex
#

you are completely missing the point lmao

ocean vortex
#

they only part which makes it "novel" is anti-contamination, that's it

#

the scope is MUCH smaller than other well known benchmarks

elder rapids
#

yes I'm glad you understand

#

since that's what I said

ocean vortex
#

like bruh...

elder rapids
#

yeah wide variety lol

ocean vortex
#

this is getting ridiculous lol

elder rapids
#

but you think "it tests extensive information" contrasts that

ocean vortex
#

you can NOT have extensive information if your scope is limited and you have way less test questions

#

that is the opposite of extensive information

#

even if you include many categories, that will not change much at all if the total amount of test prompts is small

#

that only means each individual category is that much less accurate

elder rapids
#

but that's still not what I mean

#

if each category has their own intense form of information (how complex it is/it's depth) it still has respective extensity

ocean vortex
#

At this point I don't think you know yourself what you mean. You keep changing the goalpost LOL

elder rapids
#

you don't know what changing the goalpost means

#

you're just shotgunning and not letting me clarify lol

ocean vortex
elder rapids
elder rapids
#

I prefaced the discussion with that tho

#

so

#

you're getting kinda lost on self proposal, it's not "not really" if that doesn't contrast my claim at all

ocean vortex
elder rapids
#

and it's an extreme

elder rapids
#

how can I change a goalpost without moving the direction of the topic

ocean vortex
#

lmaoo

elder rapids
#

yep

#

that's where we started

ocean vortex
#

we already established that " one of the especially good benchmarks" is the last thing it is and that it is not human preference benchmark

#

or did you forget

elder rapids
#

😭

#

it is one of the especially good benchmarks

#

for clarity, now that the goalpost has remained what it is

elder rapids
#

if that's not what I'm saying, how would that establish the premise?

ocean vortex
#

so you are gonna argue it is "extensive information" or big variety again?

#

duuuude

#

like just drop it

#

LOL

elder rapids
ocean vortex
#

what

#

I give up

elder rapids
#

can you screenshot a category of livebench rq

#

"IF" necessarily extensive

#

"Language" necessarily extensive

#

"reasoning" necessarily extensive

#

they're categorically vague lol

#

that means they're necessarily extensive

#

it's not that deep

#

larger categories = more extensive

oblique flint
#

What the hell happened here lol

oblique flint
#

One notable issue with 2.5 pro that I havent seen mentioned too much is the fact that it doesn't respect edit format syntax well. If you look at aider it only gets the format correct 90% of the time whereas with sonnet is 98%. I also watched a video of a guy using the model with roo code and there 2.5 pro was running into some issues with performing the edits correctly, which means you have to pay for the failed requests as a user. Not sure if it's going to overtake sonnet 3.7 for coding if they don't fix these issues for stable release

blazing rune
#

At least it didn't work for me

oblique flint
#

Both seem like instruction following issues, so I hope they'll work on it

ocean vortex
ocean vortex
#

it is a stable release for all intents of purposes. Mostly all of the models we are talking about are either beta, research-preview, or experimental, or they chose not to name them this way for marketing reasons

#

Google has a luxury of doing a lot of training (TPUs and their resources) so they iterate and update them frequently. But every public release is still adequately tested, safety aligned and basically made sure it is as good and stable as they can make it at that time

keen beacon
#

ga models on aistudio

#

exp models have very limited api rate limits

ocean vortex
#

well I suppose they label "stable" as the ones they are confident about performing and can allocate the most compute for. But it still doesn't change the fact that "experimental" is mostly just a name and is not rushed nor anything like that. But if it doesn't perform or they have some doubts or they are confident they can improve on it relatively soon they can label it as such

static lark
alpine coral
willow grail
keen beacon
#

The three in bottom left subgrid is placed wrong

#

Move it to the right

#

Also 2.5 pro got it wrong there

alpine coral
#

i did think it curious that were multiple solutions lol

#

will circle back later and try again with it properly input to the solver

alpine coral
# keen beacon Also 2.5 pro got it wrong there

it got it right, no? i mean, the multi-solution version I gave it anyway (like its solution is the same as that by the solver; its first solution anyway ha) - or is there a discrepancy b/w the two i overlooked? (i did have another look just now and they seem the same fwiw ha)

#

separately, haven't seen ops in the arena before

#

claims its from OAI when asked, though i'm not sure if that's accurate or hallucinated (inclined towards the latter at this stage)

keen beacon
#

Thought u gave it the right one and just entered the puzzle wrong in the solver

alpine coral
#

gotcha (yeah copy pasted the same wrong one into both ha)

keen beacon
#

another one ☠️

mossy drum
#

New model in Arena: deepseek-v3-0324

cedar tide
rigid widget
rigid widget
rigid widget
#

guysI want to create a real and relible benchmark

#

If you want, you can share tasks with me.

plain zinc
rigid widget
#

Gemini advanced trial has started

#

I hate that disgusting experimental warning

calm sequoia
#

Anyone know the tool that embeds LLM into computer to read software outputs, such as discord, on the device documents, slack and so on?

#

Like Agent

#

I guess it may be too soon

torn mantle
#

cybele

#

its from meta

#

its so slow tho

torn mantle
#

idk how to feel about the new meta model

calm sequoia
cedar tide
ocean vortex
#

that is not "he" catgrin

cedar tide
#

Il not speak good english

torn mantle
#

seems like it can search the web

#

but its so slow

#

its kinda good but not sure

rigid widget
#

Gemini 2.5 is amazing at math πŸ˜‚

torn mantle
#

yea cybele is being tested intensively on the arena

hazy quest
#

Did 4o solved image generation completely? For realistic pictures, are there still subtle clues? Until now, details like buttons on a shirt or background characters would be slightly off... but now? What are your impressions?

gentle plinth
#

I wanted to try if it can do a chess board

#

Which basically all other image-gen models struggle with

leaden meteor
hazy quest
#

Not yet for free

oblique flint
#

when gpt 5 releases imo

ocean vortex
#

well and the text depiction is very very good, since it's an LLM first and foremost

#

so things like that will not happen anymore lol

#

since it's generating an image directly itself rather than just making a request for you to another model which has only the most basic language understanding

hidden mauve
#

But do you have any idea why gemini-2.5-pro-exp-03-25 cannot answer this long prompts? It works with prompts of 1-2 sentences, but throws this error after a token (or character) limit

alpine coral
#

not sure i understand.. but inputs are truncated at 12,000 characters (~2,700 tokens) iirc

#

maybe work backwards. like for sure 1 or 2 sentences will work. but if it's currently 500 sentences or whatever, try 400; if that fails, 300 etc

#

also if it's specific to gemini-2.5-pro-exp, try using it directly instead in AI Studio (it's free for now i think)

calm sequoia
alpine coral
#

you should an option for 'dunno' - for us fencesitters lol

rigid widget
#

I'm having a hard time with Gemini 2.5 Pro

alpine coral
#

but using on ai studio, with code exection enabled (left), it does get it (i think )

alpine coral
plain zinc
#

Guys, try to create a website with my end line: Just don't create all sorts of nonsense just to increase the lines of code. Expand the site in all areas of kss, js and html5, but here's what the text provides, include EVERYTHING in the code. Nothing should cover the text or be an obstacle, so that tariffs, for example, do not exist, but they should be written. All in one html code. Write the code right away without any unnecessary questions.

#

He gave me 2k lines of code when he directly requested it with my line

rigid widget
alpine coral
#

yeah i dunno.. seems more like a wrinkle than a major flaw

rigid widget
#

πŸ˜‚ but not working

brittle tiger
rigid widget
#

This was seriously not what I expected for "best model"

alpine coral
#

it is important imo that LLMs use tools for complex calculations rather than undertaking themselves, don't get me wrong.. but yeah it's wrinkle that can be ironed out

#

4o does it very well these days

brittle tiger
plain zinc
rigid widget
rigid widget
#

but just adding "All in one html code" does better for me

rigid widget
#

paid Gemini version is wayyyy worse than free aistudio version

#

always produces shorter codes

olive mesa
# rigid widget

we need to wait a couple months to get models visually aware enough to do things like that

#

well maybe if you retry a bit it can do it

#

models we have rn can make very simple 2d ascii animations also

alpine coral
#

i'm not saying i agree with them / think they're good or whatever

#

but it's like one of the biggest companies in the world

#

is it really surprising that their main consumer chat ui refuses to make pornhub.com clone..?

olive mesa
alpine coral
#

it's a dumb way to look at a model's capabilities anyway

alpine coral
#

most people aren't doing erotica rp

#

(though way more than i would ever have thought apparently do lol)

rigid widget
#

it's not an erotica issue

alpine coral
#

i just mean the models provided big western companies are gonna be milktoast - they don't wanna 'alienate advertisers' or whatever.. it's not about spreading woke.. it's just them being risk averse

rigid widget
#

Don't side with companies

#

They are not right

alpine coral
#

yeah create a thread or something - let's not clog this up with whatever it is exactl that we're discussing

#

last thing i'll say.. i'm honestly surprised (and grateful) how much oai has pivoted away from this super risk averse approach to content moderation lately (it seems to me anyway). but yeah google be google..

alpine coral
#

it drew a smiley one for me.. i don't get the value in selectively trying show that gem pro 2.5 is somehow not good

#

there will always be stuff llms fail on

olive mesa
# rigid widget

it's a frown and the mouth is more up.. you told it to try its best

rigid widget
alpine coral
rigid widget
#

it just acts stupid in some situations

#

I expected it to be better

alpine coral
#

fair enough

#

just feels like you're like looking for edge cases where it fails and emphasising them

#

but what i describe are edge cases are just things you've tested (and expected better results) - that's totally fair

alpine coral
rigid widget
#

Same goes for Claude.

alpine coral
#

yeah again, because the 'website' is 'pornhub' lol

rigid widget
#

No it's not

alpine coral
#

though actually, maybe it's actually anti-copyright infringement guardrails πŸ€·β€β™‚οΈ

rigid widget
#

it's aistudio screen

alpine coral
#

ohh i thought it was anthropic

rigid widget
#

screenshot was from aistudio

alpine coral
#

make a thread

rigid widget
alpine coral
#

yes. i used fewer words to describe the same task.. tbh the more relevant difference i think is that is that i cropped the image from the screenshot you provided (i.e. they were working off different inputs)

#

but again.. this just feels like an exercise to prove the model is not that great.. when we're just highlighting flaws

#

i see V3 has been added to direct chat

rigid widget
rigid widget
timber kiln
alpine coral
#

like the usage around those use cases is non-trivial

#

i'm just saying i don't think that's what google has in mind when they're deploying models to their main chat ui

rigid widget
#

So why can't AI pretend to be a serial killer?

#

Aren't there serial killers in movies and books?

#

Is it wrong to talk about serial killers?

alpine coral
calm sequoia
#

@wooden mulch Please make separate channels for mathematicians and for creative writers πŸ™‚

alpine coral
#

or just a separate creative writing one tbh

#

but yeah either way i like that idea ha

calm sequoia
#

I really care about news regarding coding and math but this "censorship" and "I draw a smile" noise is just too much (sorry, different use cases)

alpine coral
#

yeah i'm with you there

#

and i don't care for or have use cases around coding or STEM

#

i just need LLMs to comprehend / parse stuff

#

censorship is generally irrelvant

#

ASCII art or image > SVG is utterly irrelevant to my actual use cases.. (though fwiw i think ascii art can be kinda useful.. not for art / coding, more demonstrating geometrical awareness kinda thing)

#

and yeah creative writing is totally irrelvant to anything i do with LLMs lol

rigid widget
#

Math, Coding, Creative Writing channels

rigid widget
alpine coral
#

i use LLMs for work.. it isn't for what i need them to do

rigid widget
#

You are cool that it hasn't come to you yet.

alpine coral
#

i've banged on about chinese models being propaganda parrots - censorship is indeed a concern

alpine coral
#

but that's only from a high level sense; i don't use chinese models so really why should i care lol

#

yeah western models are woke i know i know

#

create a thread

#

i think most people wanna discuss the Arena and models ( as well as the AI space generally, which includes censorship but it's a niche detail, in practical terms )

rigid widget
calm sequoia
#

Sometimes I feel that @rigid widget is a chineese spy or DeepSeek itself πŸ˜„

alpine coral
#

he's joking

rigid widget
#

it is least censored of all subjects "outside of China"

#

Also, no matter how much I criticize Gemini, it is much, much better than Claude.

calm sequoia
#

Then write "Taiwan is a country" to this chat. And, if your'e brave, write "Xi is a political clown and a cartoon bear"

rigid widget
#

Left V3, right V3 0324

#

How's that?

calm sequoia
#

No no, you write this

#

To this chat

rigid widget
#

No

#

it is writr

#

can we share OpenRouter chats?

calm sequoia
#

OMG you really are a Chinese LLM bot 🀣 GOLD

rigid widget
#

Prompt is: "I don't know anything about Taiwan-China, give me a brief summary but by supporting Taiwan"

lime coral
rigid widget
#

if you don't believe, try it yourself

#

is this real?

brittle tiger
rigid widget
brittle tiger
#

They should just make Gemini app better. AI studio is fine

rigid widget
ocean vortex
timber kiln
barren prairie
brittle tiger
#

Ppl starting to pick up on 2.5 power as anime fest dies down

rigid widget
rigid widget
calm sequoia
#

Are thinking models better for translation or base models?

rigid widget
brittle tiger
barren prairie
# rigid widget yes, before that it was forgetting the context in the 2nd prompt

And it was so dump ... And he will always recite his story life instead of answering you πŸ€£πŸ˜‚ I am Gemini a large model trained by google and blah blah... Never read a document correctly....never understand me ... 🀨
Arena chatbot was giving me hope that Gemini will be usefull in the future on the app πŸ˜‚πŸ˜‚

teal mantle
#

what is rage?

rigid widget
earnest parcel
#

Tested Gemini 2.5 Pro Experimental 03-25:

Average-verbose reasoning model with around 5.4x token use of a traditional model, clocking in around DeepSeek-R1 level token usage. Far less verbose than models such as o3-mini-high or Sonnet Thinking.

  • #1 Reasoning/Logic segment, surpassing GPT-4.5 Preview
  • #1 in Code segment, surpassing GPT-4.5 Preview
  • STEM and math were competent, but nowhere near top, in my testing
  • Overall utility for miscellaneous casual tasks, where fine, but not outstanding

I really enjoyed testing this model. It's very capable, but still shows flaws in certain areas. As always: YMMV!

torn mantle
keen beacon
#

theyre still working on the qwen3 pull request even though its past midnight for them

#

seems they are absolutely swamped with work

#

qwen 3 πŸ™

upper wolf
#

Gemini’s markdown is not always displayed correctly (when using google’s website, lmarena it’s fine) it pmo so much

scarlet flint
#

is there any question that only thinking model can solve?

north vale
#

Not in theory but in practice yes, questions that rely on reasoning

scarlet flint
#

can you five me one?

novel flame
#

What is known about Qwen 3 at this point? It feels like we got Qwen2.5-Max just a few weeks ago, and that one surprised me with a perfect score on my personal micro-benchmark (a feat so far only matched by GPT-4.5, Claude 3.5 Sonnet, Claude 3.7 Sonnet, o1, DeepSeek-R1, Grok-3 Reasoning, and Gemini 2.5 Pro). So if they've made it substantially better then....... holy cow.

(and yes, I need to revise and make my benchmark harder, because Qwen2.5-Max's perfect score is not equal to Gemini 2.5 Pro's perfect score.)

leaden meteor
#

Reposting poll " How long before Gemini 2.5 Pro is overtaken on Arena leaderboard? "for more votes
#general message

blazing rune
#

For example, the Qwen 3 MoE model is apparently 15B total parameters and 2B active parameters.

#

the default config in the code they made to prepare for the launch implies a dense 8B model

blazing rune
#

but I guess in the past, the default configs in the code weren't accurate

#

Like the default config for Qwen 2 has no GQA, but the actual models do have GQA

blazing rune
#

but previously even when they did release MoE models, it was only 1

#

Like Qwen 1.5 MoE (which was terrible iirc)

keen beacon
#

but i think the 15b moe will be very very good πŸ€” even if its not qwen 2.5 max level

blazing rune
#

same

#

I hope it's at least as good as Qwen 2.5 14b

#

because I only have 10GB of VRAM, I can only do a terrible quant of 14b or a higher quant but put some of the layers on the CPU

#

and I will definitely need to offload the Qwen 3 MoE onto my CPU, but it will be much better than 5 TPS (because it only uses 2b parameters)

#

I need to check how fast Qwen 2.5 3B is on my CPU

keen beacon
#

if you do multiple requests/batching i believe you start to lose the speed benefits of moe though

blazing rune
#

ah

#

well, I don't have Qwen 2.5 3B, but I do have Llama 3.2 3B, so I will test that one

#

it's pretty similar in architecture

keen beacon
#

anonymous-chatbot released

blazing rune
keen beacon
blazing rune
#

yeah

keen beacon
#

non embedding i think

#

besides the moe, im really curious about the 32b equivalent

barren prairie
blazing rune
#

but the MoE at q6_k quantization should be much faster

blazing rune
keen beacon
blazing rune
#

yeah

#

Even if the MoE models are slightly worse in performance, it's so much better on CPUs

keen beacon
#

i hope they use mla too, but from skimming the pr it doesnt seem that qwen 3 uses it

#

the 8b dosent really seem compelling really with the 15b moe existing

ocean vortex
keen beacon
#

fr

barren prairie
#

The fewer emojis πŸ˜‚πŸ˜‚πŸ˜‚πŸ˜‚πŸ˜‚ but who cares πŸ€¨πŸ€¨πŸ˜΅β€πŸ’«πŸ˜΅β€πŸ’«

torn mantle
leaden meteor
#

on leaderboard

keen beacon
#

given how much it costs i highly doubt it'll come to the arena

leaden meteor
#

OpenAI might eat the cost to showcase their best model.....atleast for few thousand votes...

barren prairie
ocean vortex
thorny drum
#

it would be ranked dead last lol

#

i dont think i have the attention span to wait for it to respond

ocean vortex
thorny drum
#

yeah for sure

#

i realized that when i sent the message lol

#

but i feel like its not a very good fit for the arena

#

could say i'd vote 'both bad' every time

#

so it would be ranked exactly in the middle

north vale
#

if you wanted to make it rank in the middle you'd prolly want to vote a is better or b is better at random

#

bc wouldn't a tie just cancel the vote

raven void
#

as expected 🫑

torn mantle
#

is it that good?

keen beacon
#

it's good creatively but that's about it

#

it's def not better than grok 3

#

classic oai focusing on style

wary oasis
#

is new deepseek update going to be tested? curious to see how it stacks up

north vale
#

pretty sure it's in the arena

#

just needs enough votes

timber kiln
barren prairie
# raven void

Wait a bit untill it will get enough vots grok3 was 1412 and gpt4.5was 1411 than ....

torn mantle
#

i think cybele is really the latest llama 4 checkpoint

#

ive seen it against the big boys

#

like vs gpt4o latest
vs sonnet 3.7
vs grok 3

#

cohere if im not wrong

#

idk

#

if its dumb or not

#

it gave me references for one of my prompts

#

so i thought it had internet access

#

but when i asked it told it doesnt

#

so its either hallucination or something else

keen beacon
ancient reef
#

new llama model with emojies reminds me of deepseek (cybele)

torn mantle
#

its so bad

#

they really are just scalling the models blindly

#

what am i looking at exactly...

#

themis?

#

this is from meta too

elder rapids
#

its good when it gets stuff right

#

but its struggling so much for basic stuff

torn mantle
#

some models did

#

i did

#

it wasnt

raven void
#

haven't seen many Meta AI researcher(or devrel) on Twitter tbh

ocean vortex
# raven void as expected 🫑

yeah... Models like that is what makes lmarena look like sht to be brutally honest. It can be rigged and they have a free pass with this model user preference being the sole thing that matters

pulsar tendon
ocean vortex
#

I would maybe even go as far as to say text-only performance of gpt4o does not belong in 2025. As is quite evident by all the other metrics

pulsar tendon
#

There’s a noticeable difference in its coding ability.

ocean vortex
#

if you actually look at the evals they did publish... I don't think gpt4o coding got better even once since it's release. And with chatgpt-latest that even less on their radar as they don't publish these numbers at all lol

keen beacon
#

Why is it unlikely they updated the base model

ocean vortex
#

it's unlikely coding got better

#

not that they updated it lol

keen beacon
#

It's not like before December where 4o was just tuned and tuned again

ocean vortex
keen beacon
#

It's a lose lose situation for them

ocean vortex
#

what's bad are the current metrics catgrin

#

besides it adds up with it consistently scoring high on lmarena ever since they started doing this "chatgpt-latest" thing

#

that doesn't have any other evals

#

like it's relatively easy to do if that's all you care about

verbal nimbus
verbal nimbus
ocean vortex
keen beacon
#

just anecdotally, with the new continued pretrained version (you can tell when the cut off was updated), it started matching 1206/sonnet 3.5 in some 'base model' tests i made

ocean vortex
#

I just hate this whole "vibe coding" or "it performs better because we said so" thing. Seems like something made up for models that don't perform. If it does perform then they wouldn't have problems doing proper evals with it LOL

keen beacon
#

ik u dont like livebench but i just noticed this:

keen beacon
#

livebench only measures rote reasoning in my experience, but i found it curious

keen beacon
#

i just dont have a lot of questions on it (testing that specific stuff), so i might just need more questions for a more reliable indicator

keen beacon
ocean vortex
plain zinc
ocean vortex
#

it would mean earlier chatgpt-lastest before this one had barely any improvement at all over the much cheaper 0806 lol

#

and yet it was still was at the top (in top3?) at the time on lmarena

keen beacon
plain zinc
leaden palm
#

i'm betting they RLd on benchmark like tasks

keen beacon
#

theyre doing smthing strange with the new continued pretrained version too

#

december version had a lot of knowledge gaps (post oct 2023), with the enxt version it was noticeably better in recent events. so they mustve kept continued pretraining the model and did a quick instruct tune to release

#

i think these recent versions weren't an elaborate instruct tune and they were waiting for it to continue pretraining (doing instruct tunes on checkpoints)

ocean vortex
#

and like I surely can't be the only one noticing that they are cannibalising all their reasoning models on lmarena. If chatgpt-latest really was this good then it must also mean o1 and o1 are sht unsuccessful experiments. Those 2 things are not compatible with one another. And we do know for a fact o1 and o3 are good models with much more data to back it up

#

so yeah... things simply do not add up for it to have a legitimate score at the top

keen beacon
#

what lmsys is measuring, chatgpt 4o latest is very good at

ocean vortex
ocean vortex
#

specialized in predicting the response styles that would be preferred by the user

#

I'm pretty sure this does not always lead to an accurate response either. Responses can be preferred for the wrong reasons too like perhaps even going as far as people having common misconceptions and all the weird patterns that we wouldn't want the model to learn... as it would give us an illusion of performance rather than actual thing.

shut parrot
#

the arena concept of AI model ranking is awesome in just about every way, but it does kinda play into one of the biggest criticisms of LLMs, in that they commonly repeat the biases of the humans that created the data they were trained upon

#

like if the entire LMArena voting population were just racists and sexists asking racist and sexist prompts that made it past the moderation guidelines and selecting the most vile ones as the "winners"

#

obviously the data generated from those comparisons being used recursively to train AI further would make LLMs a lot worse a lot faster, but it would boost arena scores

leaden palm
#

thankfully not

upper wolf
leaden palm
#

style control lifts claude up much more than exclude refusal

upper wolf
#

the issue goes much deeper than refusals - lets say you ask something like β€œdoes God exist” most people are religious so they would want to hear β€œyes of course”

north vale
#

yes several ideas

lean whale
torn mantle
#

new model : bolide

raven void
#

themis is the big llama model?

mossy drum
#

New model in Arena: phoebe

torn mantle
#

themis and cybele

raven void
#

phoebe is pretty good but buggy

#

bolide is meh

scarlet flint
#

in general

#

llama models are

#

kind of bad

#

they aren't impressive or anything

#

nothing special with them

calm sequoia
alpine coral
#

moonhowler is new? interesting that it just goes straight into the code with its response; no preamble. (i voted phi-4 fwiw.. neither scripts worked, but its almost did)

alpine coral
#

spider kinda sounds familiar.. but is it new?

plain zinc
alpine coral
#

not sure - just been single prompt / turn with this task

ocean vortex
distant egret
#

when will gpt4o 26 Mar 25 latest model benchmark gets released, for all category benchmarks

ocean vortex
#

they will probably only release them once they update API dated version though that wouldn't score on lmarena so high

#

but it's not gonna be less capable πŸ‘€

distant egret
distant egret
brazen sphinx
#

when will deepseek v3 0324 have arena score ?

distant egret
alpine coral