#general
1 messages Β· Page 5 of 1
ya
didnt read the reasoning maybe its wrong
it thinks its very trivial after skimming the trace
o3 mini high, qwq 32b say instantaneous, still waiting on o1 high
yeah the reasoning is wrong
oh well
@keen beacon no $150/600 needed
is this right? i added a period (missed a period when pasting) and regened
o1 is still great when it comes to problems like this
make sure to look at the cot
Gemini sometimes has to re say certain things because it for some reason doesn't process certain values or something
initial KE calculated correctly, but fell apart. rest is slop and it got to that answer with no proper reasoning
this has happened like 3 times for me when doing puzzles
and whenever it happens, it underperforms
or stops thinking in the code box
if code execution makes it work
it's probably because there's no errors happening like that
additionally here was the reasoning for it if ur curious:
Someone ask it if water is wet
we are basically getting sonnet 3.7 level coding if not better in some cases for free
and with 1M context
- faster
i feel like the ai hype died down, but this model should revive it
what do yall think about brampton
Some TikZ drawing tests: Can you give your best attempt at generating a gorgeous realistic beautiful dragon breathing blue flames using Tikz? Please, really try to put effort into it!
guess they rld on svgs but not tikz
Yes, executing the code significantly improves mathematical skills and, in general, the model is towards mathematics, calculations and analysis.
How can Gemini2.5-Pro not solve this problem that QwQ-32B and Deepseek-R1 can solve!
Because if you throw enough punches one is eventually going to land
wow a non-reasoning model pass this
By the way, this question was prepared by OpenAI when o1-preview came out.
literally gave this problem to it both as nebula and earlier and it got it right
π
istg I saw someone posted them doing it too here
grok might end up cooked if gemini pro 2.5 is first to market
v3 0324 needs to be studied
if it's not reasoning through it, it's in its training data
but I haven't tested it too much yet, so idk if it can output that much at once
Alright, let's tackle the problem: "How many vowels are in the sentence, 'pernicious Turnip men Crusade at dawn'?"
Understanding the Problem
First, we need to identify what counts as a vowel. In English, the vowels are the letters:
A, E, I, O, U (and sometimes Y, but weβll ignore Y for this count).
Weβll go through the given sentence one word at a time, examining each letter to see if itβs a vowel.
The Sentence:
"pernicious Turnip men Crusade at dawn"
Letβs break it down word by word.
- "pernicious"
Letters: p, e, r, n, i, c, i, o, u, s
Vowels:
e
i
i
o
u
Vowels in "pernicious": 5
- "Turnip"
Letters: T, u, r, n, i, p
Vowels:
u
i
Vowels in "Turnip": 2
- "men"
Letters: m, e, n
Vowels:
e
Vowels in "men": 1
- "Crusade"
Letters: C, r, u, s, a, d, e
Vowels:
u
a
e
Vowels in "Crusade": 3
- "at"
Letters: a, t
Vowels:
a
Vowels in "at": 1
- "dawn"
Letters: d, a, w, n
Vowels:
a
Vowels in "dawn": 1
Summing Up the Vowels
Now, letβs add up the vowels from each word:
pernicious: 5
Turnip: 2
men: 1
Crusade: 3
at: 1
dawn: 1
Total vowels = 5 + 2 + 1 + 3 + 1 + 1 = 13
Verification
To ensure accuracy, letβs list all the vowels in order from the entire sentence:
Sentence: pernicious Turnip men Crusade at dawn
Underlined vowels: e, i, i, o, u, u, i, e, u, a, e, a, a
Counting them:
e
i
i
o
u
u
i
e
u
a
e
a
a
Yes, that confirms 13 vowels in total.
Final Answer:
There are 13 vowels in the sentence "pernicious Turnip men Crusade at dawn."
From deep seek V3
yo they still haven't fixed the tokenization issue
Why are people so confident about Deepseek and Openai being top in the next month? Do they have any models that are doing well (anonymously) in arena? Gemini 2.5 pro seems to be a clear leader for next few weeks? o3 mini high and deepseek-v3 are already on the leaderboard. Is Deepseek 0324 that good that it will beat gemini 2.5pro? And I see that o1 pro and R2 are not in arena anyways based on conversations here...SO, I dont get why 60% of the people who took the poll here think that either OpenAI or DeepSeek is going to top 2.5 pro?
R2 will be the champ
Isn't it coming in may?
- gpt 5 in the case of openAI, bandwagon in the case of deepseek
- no
- Gemini will probably continue to be a leader generally for a while
- deepseek 0324 is good, but it's still an incremental jump, I do think 2.0 pro and sonnet 3.7 are better
- reasonable in the case of openAI (since people expect o3 full aka gpt 5 eventually), bandwagoning in the case of deepseek
a lot of people who follow AI came from the deepseek thing
which became borderline propaganda
before and for a while after, 2.0 flash thinking was closer to r1 while being smaller and cheaper, and nobody paid attention to that
hopefully r2 is the actual jump in performance, but if it's not simply better than 2.5 pro then it's not going to matter, to me at least anyways
0324 is a good model
anonymous chatbot is a good model
(in terms of style)
still waiting for r2...
was anonymous chatbot not an early 4o checkpoint
it doesn't do very well every time I get it
but still formats decent
oh he blocked me
lolll
Yeah...I have seen some people posting their results here and anonymous chatbot did not seem to be better than nebula (2.5pro)...
yeah dw about it
a lot of people in the AI space just simply don't know what they're talking about and pay attention to politics too much
anonymous chatbot (likely latest iteration of chatgpt-4o-latest imo) isn't a thinking a model and is presumably natively multimodal. i use 4o for most day-to-day tasks (i don't need thinking for a quick translation or whatever)
it seems an incremental improvement (or possibly already deployed on chatgpt) - won't smash any records, but not meant to either (GPT-4.5 arguably was a bit of a disappointment in terms of performance though imo - but is still very robust for reasoning and writing tasks, as most big models tend to be)
that was me, yes it can solve it it just doesn't have that high of a success rate
this has to be an issue with the model rather than it's reasoning process
sometimes it points out how it didn't actually format certain information
and it's been freaky since it released in the app first
I hope someone with more reach can mention this and look into it
I know I'm def not the only one
but I've probably tested it 10x more than others
50 is too little
π
buddy ive been extensively testing this model since the day it showed up in the arena
Gemini 2.5 Pro is doing wrong
except @plain zinc
The model probably doesn't generate most answers from scratch, so far I haven't gotten a very specific answer.
It is also clear from the example of creating super mario
Give V3 0324 a chance
its a good model
but tbh its nowhere near gemini 2.5 pro
i mean for the cost its a good one
gemini just seems good overall
Last time Someone overtook Grok the leaderboard was updated the same day π It would be funny if it would happen again and musk would release Grok 3 full immediately
On the other hand, you need time to train on Gemini 2.5 outputs π
But i don t think it will take it over Gemini this time ππ
It shouldn't have taken previous no. 1 spots too π
Really good job guys. You're oracles!
It is only V3 wait untill it will be V4 and r2 ππand untill they will distill Gemini ππ
π
ππππ
I think it would be more accurate to compare with non-reasoning models.
Do you think they didn't use R1 while making Gemini 2.5 pro?
They're all hoarding data and copying from each other. That's why they exhibit very similar behaviors in many things.
I thought I would be 3rd, thinking that the leadership table would be released a week later. I didn't expect it to be released this early.
We said everything beforehand.
It was proven that no data is better than bad data. The R1 is not exceptional; therefore, it is wrong to assume that they trained on R1 outputs for 100%.
Unless in early stages of pre-training, but the base model was not changed (or was it?)
They probably used models like Claude, Grok, R1, o1 simultaneously and constantly covered their shortcomings with comparisons. I didn't say there was anything wrong with that. I'm just against OpenAI's "model distillation" excuse.
They removed the basic model (2.0 pro) from aistudio. I also get a resource exhausted error on OpenRouter.
Hey guys. Do you know of a subreddit or Discord server to discuss specifically everything related to AI images? It's probably obvious but I can't seem to find it.
- r/singularity is too broad
- r/StableDiffusion seems to be focusing on AI videos (?)
- r/aiArt seems to be about the "art" aspect more than the AI one (?)
Basically I'm looking for a space where people discuss/compare for example Reve vs 4o vs Imagen 3 vs Recraft
4o image capabilities are v impressive but it's not native multimodal
nothing lol
but they should justify
and not based only on one prompt
This is a test for ardent Gemini fans π
But I'm not like that. Don't think about it.
Gemini is interesting to me as a research product.
What's the best ai for solving chemistry MCQs
try it and see π
bro if Gemini was THE BEST at everything, it would be AGI. But we know it's not.
oh yeah that's right
call it 'omni'.. wonder if / when it will actually be a truly multimodal model..
does it use separate models for image recognition and generation, or just the latter?
same with advanced audio.. is that actually integrated natively? i haven't really been folllowing since the initial release
function calling isn't exactly a modality, but 4o has become a lot better at using code interpreter. like for anything not too complicated, it handles data pretty well imo given the speed and cost
Gemini 2.5 Pro here is ethic,safety,bla,bla
When do you expect DS 0324 in the leaderboard?
I am surprised they did not release this in the leaderboard yet when lot of people are saying this is one of the best non-reasoning models...
it competes with Claude 3.7 Sonnet in coding at a 15 times cheaper price
Most people didn't really try, Deepseek made the announcement very quietly.
I am only talking about arena. Not the general release... If they released it in arena and model is really good, it would have come on top, isnt it? Even people are biased against it? Isnt that the point of the arena....
why do i get this after every message?
me to I always get this error
because ur mango
Isn't it already in the arena?
claude create a python command it don't get the prompt
Was it there before public release? I was asking why it was not released in teh arena before public release... But I see that Deepseek tends to not do that...So, I guess it will take a week or two before it appears in teh leaderboard from now?
Yes, it was not in the arena before.
you're right
played around a bit. it's pretty decent for sure. though fwiw i wasn't blown away - it's solid but yeah i'm not sure i'd be willing say anything more than that (based on my limited use anyway)
even though it's not a thinking model, it's been post trained in a way that encourages CoT responses, rather than just blurting out the answer.. can be a bit tedious
it seems good at physics / science
impressive with some riddles, but not strong with theory of mind kinda stuff, and falls over on some generally simple ones (usually by overanalyzing it / conducting needless calculations)
i like it's note at the end here ha
also math
Probably best for roleplay too
β οΈ
by the way there is no system prompt
this is default
Boss and lazy employee roleplay
Why Chatgpt is not that good at "Chat"
ong
but nah I meant like, testing for looking at the CoT process
and stuff
there's a lot of things broken with 2.5 rn
nobody is mentioning it for some reason
crazy
you can not really look at CoT it's hidden
i don't know what's going on lol (though clearly V3's response is superior / more creative) but anyway was curious to give V3's response to 4o and tell it to make it 'spicier'. initial attempt was bland; the next one also kinda meh, but the third/final was surprisingly lucid tbh ha
4.5 even more so ("if i have to finish myself off in the jungle" lol)
i thought 4o just used more emojis and had a differnt (more casual) tone.. but didn't realise it (and 4.5) actually have like some 'character' in a way oai models previously haven't
let me try
huh?
I think he meant that you can not really look at CoT it's hidden
Does the gemini advanced sub have a rate limit for the 2.5?
it's not
yes it is?
π
you're conflating the initial o3 mini summary with markdown/formatting
@rigid widget are u going to let this slide
nah I don't think so
I went way past ai studios 50 rpd limit in the app
We want real pure CoT not changed CoT
it's not "changed" cot in the same way o3 minis was
it's not being optimized post CoT
it just fundementally is a different approach to CoT
Even o3 mini is more readable
that's because they have a different approach
they're giving similar amounts of information
neither are being changed
Gemini has always used markdown way more than other models
Looks more like a system prompt it's not first person
ye Gemini tends to not speak in first person
whether it's an initial system prompt
or unintentional
the behavior still shows in the CoT
yeah
but regardless, that is the "raw" cot
this is crazy
16~ avg improvement from flash thinking
yo this is actually insane
its not even incrementally better than all the other models
the difference between other models and their competitors are super incremental in other benchmarks and Gemini 2.5 pro doesn't follow that trend
from the moment I had my hands on nebula I knew this model was different
don't call me a prophet, but I might just be one
I can't get over it π π
gemini2.5
Crazy for Gem2.5Pro and V3 0324 π₯π₯
so pro is the first 2.5 gemini model. Will flash be coming soon?
was phantom 2.5 pro as well or could that be flash?
could be
flash
if they release 2.5 flash
then it's safe to say the old models are gonna be removed
that's just for coding avg tho
Flash
damn #1 in almost every category
phantom seemed pretty similar to nebula tho tbh
but apparently it was removed and added back
so it could've been changed to flash and I'd have no clue cuz I only tested the old phantom
why's 2.5 pro so fast tho
ive been doing some research
without consensus voting, which openai used to make o3 (full) look better on the benchmark graphs, gemini 2.5 pro is actually better
AIME o3 ~90-91% vs 2.5 pro 92%
GPQA o3 ~82-83% vs 2.5 pro 84%
Gemini 2.5 Pro
Except reasoning (π₯2. best model)
All β‘π₯1. best model
Good job Google!
o3 is several months older tho I guess
which is why openai need to stop sitting on their damn products..
just release it already
Deepseek V3 03-24 (comparing with non-reasoning models)
(Overall: π₯ 2. best model)
Math: π₯1. best model
Coding: π₯2. best model
Reasoning: π₯ 3. best model
Good job Deepseek!
At these prices, even if they were better, they're useless.
but what made o3 crazy is its score on arc agi. I dont think 2.5 pro is getting near that, unless they test it with similar copious amounts of compute
how big is v3 tho?
deepseek 150x cheaper
huh?
DAMN
685b params
drop o3 sam.
so, can anyone reverse engineer the scaffolding used in claudeplayspokemon and make gemini 2.5 pro play pokemon lol
2.5 pro is #1 in reasoning also, isnt it? Why did you mention it 2nd?
I see. Is this o1-2024-12-17- high different from o1-2024-12-17 on leaderboard ?
it is the best reasoning model
might not be the best at purely reasoning
but that's trivially true
can this data be used or licensed on a company's website?
Who will have the best model within one month?
5
13
2
waiting for 2.5 flash to use for my projects
even though i have like 12 api keys the rate limits are still strict
reasoning average better by a little bit
but time and cost is a lot more
worse model with 50k thinking tokens slightly better than a good model with 10k thinking tokens
Dude you need to be a politician
Voice Chat + Video Chat! Just in Qwen Chat (https://chat.qwen.ai)! You can now chat with Qwen just like making a phone call or making a video call! Check the demo in https://www.youtube.com/watch?v=yKcANdkRuNI
What's mβ¦
this is awesome
yeah
What's more, we opensource the model behind all this, Qwen2.5-Omni-7B, under the license of Apache 2.0, and we release the technical report to share all our details!
4o image gen, prompt ""A photo of Tim Cook making a speech on stage behind a lectern with the text "FOR THE MANY, NOT THE FEW" and the UK Labour logo on it". no other model has got this close to perfection - it literally got the exact lectern right with just the "H" missing from "the", which i'm sure would be gone if you regen'd a few times
for reference
nice
can't wait for Google to make a bigger native image Gen now, the flash one is small
I care more about Qwen 3 (especially the MoE)
Oh, isn't this dangerous????
you could argue that
but like
i like the fact openai have lowered the guardrails for this
dall-e was censored asf and it showed
this is not a very good benchmark, but in this case it is not wrong... 2.5 pro very well just might be the best model ever released thus far tbh
it's finally joever for livebench
2.5 pro doesnt seem to know much after june 2024. it doesnt seem like they focused on recent events that much
how tf is it 2x faster than gem 2 pro
it's amazing π
livebench is one of the especially good benchmarks as far as how it matches talking to each ai lmao, I'm not sure what makes you think that
that's more of a consensus too
livebench and simplebench
but yeah 2.5 pro is just
really crazy
I wonder if that means it'll be cheap as well, surely yes?
anthropic will not be happy about this release if it turns out this model outperforms the claude models at a lower cost lol
this couldn't be further away from the truth. Unless you really think qwq is better than all of those:
which is just ridiculous
i dont think any benchmark will give you a perfect stack rank of every model
no but some are better than the others. Livebench is among the worst ones tbh
what are your favorites?
if you take any real benchmark like say MMLU Pro, you will hardly see any discrepancies at all. All scores just make sense and are logical
And also... contamination is not really the issue. Bigger issue is "private datasets" as that contributes nothing to improvement of the models IMO. It also invites selective cheating when only select companies have some kind of access through connections which means even more discrepancies. If benchmark is diverse and good, it's pretty much impossible to cheat beyond what everyone is doing (or to get a perfect score) unless that is the sole thing you focus on disregarding all the other metrics. Which is just not realistic as it would be obvious model is sht
the main thing should be that it's a fair game. If everyone has the same public access that means it is fair
and actually is driving the improvement. Rather than being some secretive thing no one can even validate if it's any good...
this isn't what makes a benchmark bad at all lmao, that would be reasonable if the score discrepancy were larger than just a few percentile. It's a really good inference model that doesn't have quite as much knowledge as deepseek, but that isn't what its testing. it's not "better" than deepseek, nor is deepseek so much better than o3 mini, nor is v3 better than Claude 3-7 sonnet, this isn't because the benchmark is flawed, but because of how YOURE viewing the averaged score
the same way lmsys isn't a bad benchmark, its just not quantifying what you think it's quantifying
no. Just no lol
that model has no business being equal with those
let alone above them
wym just no? there's literally Β±1 deviation LMAO
have you tried it??
in the same way smaller models can outperform gpt 4
is the same way qwq-32b can have extremes like that
they won't consistently be better than gpt 4
but that's not what its testing
look at the individual scores
and it's not a good benchmark at all if it fails to find the difference. Qwq is worse than R1 overall, not the other way around
that's why you need benchmark with plenty of test questions
to find the difference? or the difference it claims to
deepseek v3, claude 3.7 sonnet, gpt 4.5 arent thinking models. flash thinking isnt good. its not surprising for it to be competitive on rote tasks/math
not just a few or a few subcategories
flash thinking is the best here for inquiry
lol
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
no it isnt
never got the attention it deserved
I mean like, if you set up a debate between two models
flash thinking wins
that kind of stuff
o1 gets crushed too
their average was supposed to be an average of those things:
so it doesn't make sense lol
yea livebench tests only rote tasks lol
it's not a very good test
nah flash thinking sucks
i dont see any of the other models going 29k tokens in thinking and coming up with this
fyi qwq does that one just fine btw
afaik
dawg π π I just IMPLIED this exact thing
this server bro istg
you said flash thinking would win
its one of the worst though alongside qwen models
what is this quantifying ππ
ngl you guys have to be trolling
the elimination game benchmark, it tests social reasoning, strategy, and deception
but that isn't what I just said is it
theres considerable overlap with debate skills and social skills though
yes for llms
I think the issue is you. Just because you don't like that you are wrong you saying that everyone else is "trolling" is not gonna change anything and not gonna make you right lmao
yeah this guy is so annoyingn lol
strategy in debate is rhetorical recession or knowing how to make a cohesive argument
2.5 seems to do it
that would make sense if I weren't right
is that flash thinking?
but of course, you don't know the claim and won't bother to reiterate what I'm saying for clarity
you aren't
alr, state my premise
Nah I was using the wrong one lmao
dawg you're literally just ignoring what I'm saying, and I don't know if it's just to argue. this isn't the first time
nope it just did it again for me
are you using 0 temperature?
^
yeah ur using 0.7 temperature
It keeps repeating stuff on 0 temp lmao
You seem to be thinking everyone is doing it wrong and livebench is this magical benchmark more accurate than proven reliable metrics. Metrics with a much better track record and better results. That's almost like 'benchmark shopping' - look hard enough and you gonna find a benchmark for your confirmation bias
ya using a non zero temperature allows it to not get stuck by adding randomness. try it with 2.5 pro on 0 temp and it wont get stuck
We are a few years away from AI being used as video evidence in a court case
I mean a few months
π
yea both are wrong i dont have the answer on hand though from what i remember both are incorrect
The future might be bleak or bright idk
So like... simple-bench is 'perfect' for people who think Claude is better than everything else. But they fail to understand it's mostly just spatial awareness reasoning. Your daily tasks will usually include MUCH more areas than just this. Then there's also an issue that it's private and no one can verify it properly so it's a guesswork
people like that will also look at swe-bench but conveniently ignore LCB where claude is destroyed and it's still coding lol
https://x.com/koltregaskes/status/1904974999011614895 what a last couple weeks of march for image gen
MIDJOURNEY V7 TARGET LAUNCH IS MONDAY 31ST MARCH! π
Next week!
it's hard to be impartial, but you really shouldn't judge the benchmark based on individual model results. The number of discrepancies in total is the easiest fair way to get the rough overall picture
that's completely loaded lol and doesn't state my premise. Whether or not livebench is a viable objective evaluation that judges how models specifically perform in certain tasks isn't what I'm saying lol, the reason why it's even averaged out through such wide variation of things on livebench especially is because they aim for exactly what you're looking at while discussing with it, qwq-32b does have specific traits that may help it over deepseek, livebench tests only a few languages, and it happens to excel in these few languages, but if you check aider, clearly there's a discrepancy, but also in the claim I've been making; how well it does when given newer, or at least more obscure information, and therefore a better score, with large variation. anything Β±5~ score deviation wont have such discrepancy, yet it's completely realistic that's how they interact with that kind of information
I don't even think we disagree
it's also unreasonable to expect any model to do well everywhere... Just not realistic
but there's clearly a difference in what we understand these benchmarks do
for me, Claude 3.5 sonnet didn't do well, for exactly what I expected livebench to do
nonsense. They provide the average score. If it's no good it should not be provided and the benchmark should be specialised rather than general purpose
that would make sense if the topic variance werent so wide
which is why I clarified that
"such a wide variation"
and if it is and the results are that... the benchmark is not the best one. Simple as that π
it's exactly what it claims to be
and seems to match really well because, the model approaches you that way too
novel information over a general discussion = non standard benchmark performance
and probably why o3 mini and o1 have done so well, since they focus on exactly that kind of behavior of receiving
and it's probably why flash thinking is so high too
since it's just not that good, but if you use it
it's just really good at receiving information
have you read their description at least?
wym?
ye but I think that's a much more important aspect over others tho, spatial reasoning necessarily bleeds through to other tasks that may be the best in actually talking to it
They are aiming at providing general purpose anti-contamination benchmark. I already said what I think about contamination (it's not an issue for a good benchmark), and if this is general purpose... then it's clearly not the best benchmark. Like I've already said several times but you just don't seem to get it
nowhere does it say it's specialised or limited in scope:
this implies exactly what I said tho lol
if it's anti contamination it's aiming for novel information and score discrepancy under that
yeah? but that's not really what I said either
this just makes no sense at all tbh and is just a bunch of nonsense
just how it adapts to you and stuff lol
that's it
like what? Are you meaning to say that there are discrepancies because it's general purpose? Cause that's nonsensical too
I just said what I meant
LMFAO
^
nonsense
π?
"how it adapts to you and stuff" are just some meaningless words you made up, in this context
it's not meaningless lol
that means how it receives information, you can't get any simpler than that
if it means how it receives information, that means how it interprets things through a context length right? if it's better at low context
then it's necessarily going to be better
oh wait I wonder why all of these models also excel at lmsys
which happens to be ahem
general low context performance
π€¦
Production models do not receive information, quite obviously. They receive it during training. And calling your input prompts as "receiving information" is ridiculous. That's not how it works LOL
what
that's clearly not what I'm saying
π
then talk properly, cause you are just making sht up 
I just clarified exactly what it means tho, and then you sidestepped into that interpretation
^
you know you can ask for clarity
right?
the bigger issue is testing properly. If you aren't testing properly with good test questions "novel" part is completely irrelevant. And I already said why contamination is mostly a non-issue. If the entire dataset is public, that means all models are trained with the same data available and the scores are fair. It is still a consideration, but good benchmark is MUCH more important than just making sure you develop test questions in secrecy...
that's not what I'm saying at all, the novel part is relevant when that's the point of the benchmark through large variety, ie, livebench being so general that they don't focus on coding enough for more than 2 languages (I said this already), which can be equivocated other aspects of the benchmark
if it was an issue we would have seen models getting perfect scores on all public benchmarks a long time ago. That didn't happen. Because it is essentially impossible
and then it averaged that
because the point it
it's not testing intensive information
and the wide variety implicitly denotes novelty is what it's looking for
then it shouldn't be general purpose benchmark. I feel like we are going in circles...
dude
if it's not the general purpose benchmark you think it is
then it's not, and I'm showing you how
lmsys is a general purpose benchmark
but clearly they're different benchmarks
right?
wide variety implies extensive information you just said it doesn't have. Which one is it? Like make up your mind lol
are you... not very smart? It is NOT that
mb I meant intensive information
it's a human preference benchmark
exactly
not an objective deterministic tool for general performance
it's a general purpose benchmark via human preference
which is exactly the distinction I'm making
the reason why that's relevant is because livebench is a general purpose benchmark via novelty
to you, one is more apparent than the other
I think I'm done here... You are obviously gonna just keep changing your tune and pretend you meant otherwise than you really did. Human preference is NOT equal to general purpose. Human preference is a specific category of that
dawg
theyre entirely different categories altogether (meaning, it doesn't make sense for you to think I meant that regardless)
and that was I was implying the whole discussion (meaning, it's not changing my mind if that's what I've been saying)
general purpose tests for everything with a deterministic objective eval methods. Human preference tests mostly for the 'vibe' and style of responses and the initial impression they leave on people. Those 2 things are different and not really compatible. For the record
livebench is not human preference testing, in case you think it is...
it's not like I didn't clarify that
^
^
that's just begging the question lol, general purpose just means the scope of what's being evaluated
and they are competible
I don't see how you think otherwise if they're simply so categorically distinct
you can use objective metrics to measure very narrow non general purpose tasks
it also means there should be barely hardly any discrepancies at all. Truthfully all those less popular benchmarks have an easy way in NOT making it a general purpose benchmark. But as soon as they do, if there are many discrepancies... They kind of sabotage themselves because the standards for it are very different and you can't have a leaderboard looking like that with the scores that don't make sense
again, if that's not what the leaderboard claims, then it's not a flawed benchmark lol, livebench is unique
that doesn't mean anything tho? I dismissed it, since it's implicit through the extensive information
I don't think you understand the scope of contamination problem. It is small relatively speaking. I've said this repeatedly. ONCE AGAIN, contamination is irrelevant as long as the benchmark itself is not on the same high level standard that the main benchmarks are. It's testing nothing, in comparison and bluntly speaking
ie, high benchmark variety + high benchmark avg
that's completely irrelevant, I don't care about the contamination
then you shouldn't use livebench
cause that's their MAIN selling point
LMAO
you're saying this because you think I'm getting novelty from anti contamination
but I've clarified like 3 times
that's not what makes it novel
do you know what 'novel' means?
it means it contains problems that are not included elsewhere, hence the anticontamination part.
ok, but that's not what makes it novel here
anti contamination implies novelty sure
but, let's dismiss it's "anti contamination" selling point completely
ignore it
ignore everything the livebench descriptions say
now
livebench is a novelty benchmark
why?
it tests extensive information, with the example of coding, it's not going deep into each variety
but how is this novelty?
because it's averaging low intensive scores
NO. it test nothing in comparison to something like MMLU
you are completely missing the point lmao
huh?
they only part which makes it "novel" is anti-contamination, that's it
the scope is MUCH smaller than other well known benchmarks
you need to google what "extensive information" means
like bruh...
yeah wide variety lol
this is getting ridiculous lol
but you think "it tests extensive information" contrasts that
you can NOT have extensive information if your scope is limited and you have way less test questions
that is the opposite of extensive information
even if you include many categories, that will not change much at all if the total amount of test prompts is small
that only means each individual category is that much less accurate
sure, I'll grant that
but that's still not what I mean
if each category has their own intense form of information (how complex it is/it's depth) it still has respective extensity
At this point I don't think you know yourself what you mean. You keep changing the goalpost LOL
I haven't made a separate claim tho? π
you don't know what changing the goalpost means
you're just shotgunning and not letting me clarify lol
Not really. You can have a single coding test prompt and rate coding like that. It will have "respective extensity" but it will also be completely useless
ie, low extensive formatting, low intensity, while being highly extensive
yep
I prefaced the discussion with that tho
so
you're getting kinda lost on self proposal, it's not "not really" if that doesn't contrast my claim at all
I do and it fits perfectly to what you are doing. You keep going back and forth diluting the point itself this all started from
and it's an extreme
π? this literally doesn't fit if there's no separate claim or subordinate topic
how can I change a goalpost without moving the direction of the topic
should I remind you that it started from this? ^
lmaoo
^
we already established that " one of the especially good benchmarks" is the last thing it is and that it is not human preference benchmark
or did you forget
that's the premise, how would we have established that
π
it is one of the especially good benchmarks
for clarity, now that the goalpost has remained what it is
I dismissed this lol
^
if that's not what I'm saying, how would that establish the premise?
so you are gonna argue it is "extensive information" or big variety again?
duuuude
like just drop it
LOL
ye it's extensive information
can you screenshot a category of livebench rq
"IF" necessarily extensive
"Language" necessarily extensive
"reasoning" necessarily extensive
they're categorically vague lol
that means they're necessarily extensive
it's not that deep
larger categories = more extensive
What the hell happened here lol
One notable issue with 2.5 pro that I havent seen mentioned too much is the fact that it doesn't respect edit format syntax well. If you look at aider it only gets the format correct 90% of the time whereas with sonnet is 98%. I also watched a video of a guy using the model with roo code and there 2.5 pro was running into some issues with performing the edits correctly, which means you have to pay for the failed requests as a user. Not sure if it's going to overtake sonnet 3.7 for coding if they don't fix these issues for stable release
And it doesn't follow certain instructions well, like telling it "Don't use any comments in the code"
At least it didn't work for me
Both seem like instruction following issues, so I hope they'll work on it
agree. LMAO
Ehmm... I'm not sure what went wrong with your reasoning but 90 vs 98% is really not the thing of "not sure if it's going to overtake sonnet 3.7 for coding if they don't fix these issues for stable release" lol
it is a stable release for all intents of purposes. Mostly all of the models we are talking about are either beta, research-preview, or experimental, or they chose not to name them this way for marketing reasons
Google has a luxury of doing a lot of training (TPUs and their resources) so they iterate and update them frequently. But every public release is still adequately tested, safety aligned and basically made sure it is as good and stable as they can make it at that time
the rate limits for stable models are way higher tho
ga models on aistudio
exp models have very limited api rate limits
well I suppose they label "stable" as the ones they are confident about performing and can allocate the most compute for. But it still doesn't change the fact that "experimental" is mostly just a name and is not rushed nor anything like that. But if it doesn't perform or they have some doubts or they are confident they can improve on it relatively soon they can label it as such
Create stunning short-form videos for TikTok, Instagram Reels, and YouTube Shorts with Boba AI Video Editor. AI-powered video creation made simple.
yeah they're indeed both wrong. 2.5 does get it - after a fair bit of thinking
(note there are 3 solutions; but for all the values in top row [i.e. in the solution screenshot] remain the same)
Btw u input the puzzle wrong that's why there's multiple solutions
The three in bottom left subgrid is placed wrong
Move it to the right
Also 2.5 pro got it wrong there
i did think it curious that were multiple solutions lol
will circle back later and try again with it properly input to the solver
it got it right, no? i mean, the multi-solution version I gave it anyway (like its solution is the same as that by the solver; its first solution anyway ha) - or is there a discrepancy b/w the two i overlooked? (i did have another look just now and they seem the same fwiw ha)
separately, haven't seen ops in the arena before
claims its from OAI when asked, though i'm not sure if that's accurate or hallucinated (inclined towards the latter at this stage)
I guess it's right if you gave it the same wrong puzzle
Thought u gave it the right one and just entered the puzzle wrong in the solver
gotcha (yeah copy pasted the same wrong one into both ha)
meta
another one β οΈ
New model in Arena: deepseek-v3-0324
there is also "cybele" he always says he comes from open ai but you can see that he comes from meta
Try and see
you right I think there should be no hidden database
That's exactly why I create my own "real and reliable" benchmarks.
OMG WHERE IS SAFETY β οΈβ οΈβ οΈβ οΈβ οΈβ οΈ
guysI want to create a real and relible benchmark
If you want, you can share tasks with me.
The safety in question
There is a whole lot more of these
Some crazier stuff too
I will test it
Gemini advanced trial has started
I hate that disgusting experimental warning
Anyone know the tool that embeds LLM into computer to read software outputs, such as discord, on the device documents, slack and so on?
Like Agent
I guess it may be too soon
yea i just got it
cybele
its from meta
its so slow tho
wild
idk how to feel about the new meta model
who is "he"? π§
Thanks! Is it this one? https://www.cybelesoft.com
"cybele"
that is not "he" 
Il not speak good english
yea cybele is being tested intensively on the arena
Did 4o solved image generation completely? For realistic pictures, are there still subtle clues? Until now, details like buttons on a shirt or background characters would be slightly off... but now? What are your impressions?
Is it usable for free? I don't have plus and once tried it but then it said that it's overloaded, bc of too many requests. And now it's only responding with dalle images for some reason
I wanted to try if it can do a chess board
Which basically all other image-gen models struggle with
Not yet for free
when gpt 5 releases imo
the point of native imagen is more of it being able to apply the exact changes that you want directly. The quality itself is not gonna be better than the best existing models out there, but what it can do with img2img and how you can guide it towards exactly what you want is unmatched
well and the text depiction is very very good, since it's an LLM first and foremost
so things like that will not happen anymore lol
since it's generating an image directly itself rather than just making a request for you to another model which has only the most basic language understanding
idk what channel is related to https://lmarena.ai/ Leaderboard
But do you have any idea why gemini-2.5-pro-exp-03-25 cannot answer this long prompts? It works with prompts of 1-2 sentences, but throws this error after a token (or character) limit
not sure i understand.. but inputs are truncated at 12,000 characters (~2,700 tokens) iirc
maybe work backwards. like for sure 1 or 2 sentences will work. but if it's currently 500 sentences or whatever, try 400; if that fails, 300 etc
also if it's specific to gemini-2.5-pro-exp, try using it directly instead in AI Studio (it's free for now i think)
No, it messed up all my prompts that uses devices instead of real people.
you should an option for 'dunno' - for us fencesitters lol
i think it doesn't have code execution enabled on the chat app (or if it does, it refueses to use it here). i got the same wrong calculations when trying on gemini.google
but using on ai studio, with code exection enabled (left), it does get it (i think )
actually.. pretty sure the web app does have it (but doesn't seem naturally inclined to use it..)
Please use your
code_executiontool to calculate: 8282838822881918*838282828
Guys, try to create a website with my end line: Just don't create all sorts of nonsense just to increase the lines of code. Expand the site in all areas of kss, js and html5, but here's what the text provides, include EVERYTHING in the code. Nothing should cover the text or be an obstacle, so that tariffs, for example, do not exist, but they should be written. All in one html code. Write the code right away without any unnecessary questions.
He gave me 2k lines of code when he directly requested it with my line
it did right when I said like that. But why doesn't it do it? It's a shame for the people who pay money for Gemini-Advanced π¬
yeah i dunno.. seems more like a wrinkle than a major flaw
it gave me 63k token for a svg
π but not working
https://www.reddit.com/r/Bard/s/U1ieYcF8vo pretty crazy from gem 2.5
This was seriously not what I expected for "best model"
let me try
it is important imo that LLMs use tools for complex calculations rather than undertaking themselves, don't get me wrong.. but yeah it's wrinkle that can be ironed out
4o does it very well these days
I have same error with svg where it starts doing hundreds or thousands of small lines when it could just do a big one
Then show the result and give your feedback.
it's better for one shot prompt
sorry for background sounds π
but just adding "All in one html code" does better for me
paid Gemini version is wayyyy worse than free aistudio version
always produces shorter codes
we need to wait a couple months to get models visually aware enough to do things like that
well maybe if you retry a bit it can do it
models we have rn can make very simple 2d ascii animations also
it has way more guardrails
i'm not saying i agree with them / think they're good or whatever
but it's like one of the biggest companies in the world
is it really surprising that their main consumer chat ui refuses to make pornhub.com clone..?
there's a new state-of-the-art llm every 1-3 weeks lmao
it's a dumb way to look at a model's capabilities anyway
aistudio don't refuse?
most people aren't doing erotica rp
(though way more than i would ever have thought apparently do lol)
it's not an erotica issue
i just mean the models provided big western companies are gonna be milktoast - they don't wanna 'alienate advertisers' or whatever.. it's not about spreading woke.. it's just them being risk averse
yeah create a thread or something - let's not clog this up with whatever it is exactl that we're discussing
last thing i'll say.. i'm honestly surprised (and grateful) how much oai has pivoted away from this super risk averse approach to content moderation lately (it seems to me anyway). but yeah google be google..
Dude, it drew a sad face?
it drew a smiley one for me.. i don't get the value in selectively trying show that gem pro 2.5 is somehow not good
there will always be stuff llms fail on
that happens to me too
it's a frown and the mouth is more up.. you told it to try its best
Don't drawing converting images to svg
but gem pro 2.5 is notably good - that's like a consensus.. it doesn't mean it's agi or anything.. but this new V3 etc.. i mean they're diffeernt world
I didn't say it was bad
it just acts stupid in some situations
I expected it to be better
fair enough
just feels like you're like looking for edge cases where it fails and emphasising them
but what i describe are edge cases are just things you've tested (and expected better results) - that's totally fair
if i understand correctly, that is what i asked https://g.co/gemini/share/33015c25ac37
Same goes for Claude.
yeah again, because the 'website' is 'pornhub' lol
No it's not
though actually, maybe it's actually anti-copyright infringement guardrails π€·ββοΈ
it's aistudio screen
ohh i thought it was anthropic
make a thread
it's not same prompt by the way
yes. i used fewer words to describe the same task.. tbh the more relevant difference i think is that is that i cropped the image from the screenshot you provided (i.e. they were working off different inputs)
but again.. this just feels like an exercise to prove the model is not that great.. when we're just highlighting flaws
i see V3 has been added to direct chat
because i get excited and use it, but i get disappointed
it's fast?
Before Cline almost all usage on openrouter was that...
And most local llm stuff...
yeah don't get me wrong, i've been genuinely surprised
like the usage around those use cases is non-trivial
i'm just saying i don't think that's what google has in mind when they're deploying models to their main chat ui
So why can't AI pretend to be a serial killer?
Aren't there serial killers in movies and books?
Is it wrong to talk about serial killers?
oai's tack towards more liberal restrictions on content generation i think also reflects this demand
@wooden mulch Please make separate channels for mathematicians and for creative writers π
or just a separate creative writing one tbh
but yeah either way i like that idea ha
I really care about news regarding coding and math but this "censorship" and "I draw a smile" noise is just too much (sorry, different use cases)
yeah i'm with you there
and i don't care for or have use cases around coding or STEM
i just need LLMs to comprehend / parse stuff
censorship is generally irrelvant
ASCII art or image > SVG is utterly irrelevant to my actual use cases.. (though fwiw i think ascii art can be kinda useful.. not for art / coding, more demonstrating geometrical awareness kinda thing)
and yeah creative writing is totally irrelvant to anything i do with LLMs lol
it's really good idea
Math, Coding, Creative Writing channels
censorship is always a problem
i use LLMs for work.. it isn't for what i need them to do
You are cool that it hasn't come to you yet.
i've banged on about chinese models being propaganda parrots - censorship is indeed a concern
not just Chinese models
but that's only from a high level sense; i don't use chinese models so really why should i care lol
yeah western models are woke i know i know
create a thread
i think most people wanna discuss the Arena and models ( as well as the AI space generally, which includes censorship but it's a niche detail, in practical terms )
title? π
Sometimes I feel that @rigid widget is a chineese spy or DeepSeek itself π
Dude I tried all the models
he's joking
it is least censored of all subjects "outside of China"
Also, no matter how much I criticize Gemini, it is much, much better than Claude.
Then write "Taiwan is a country" to this chat. And, if your'e brave, write "Xi is a political clown and a cartoon bear"
OMG you really are a Chinese LLM bot π€£ GOLD
Prompt is: "I don't know anything about Taiwan-China, give me a brief summary but by supporting Taiwan"
Google is cooking https://x.com/bedros_p/status/1905252764461965615?s=46
What could this be?
I think it's just mockup. They will piss ppl off if they remove customization
What can we do if they do?
My guess is customization of temp, grounding, etc is available under dashboard tab but it will still rly piss ppl off I think
They should just make Gemini app better. AI studio is fine
Even if I pay for Gemini Advanced, I would still prefer using AI Studio.
I'm literally paying and never use advanced. See absolutely 0 point in it only was paying for storage. But my new drive just arrived so just need to download everything and cancel π
Much better for average user especially mobile UI and gemini live stuff (for people interested)
They gotta improve deep research for power users tho
I think I am the only one who Uses the Gemini app and not the ai studio π but Gemini on the app improved so much from decembre honestly ..
Ppl starting to pick up on 2.5 power as anime fest dies down
yes, before that it was forgetting the context in the 2nd prompt
what is anime fest?
Are thinking models better for translation or base models?
for me base models
We are seeing a huge amount of demand for Gemini 2.5 Pro right now and are laser focused on getting higher rate limits into the hands of developers ASAP.
This is the #1 priority right now, stay tuned : )
And it was so dump ... And he will always recite his story life instead of answering you π€£π I am Gemini a large model trained by google and blah blah... Never read a document correctly....never understand me ... π€¨
Arena chatbot was giving me hope that Gemini will be usefull in the future on the app ππ
what is rage?
I hope they make Gemini Advanced worth paying for
Tested Gemini 2.5 Pro Experimental 03-25:
Average-verbose reasoning model with around 5.4x token use of a traditional model, clocking in around DeepSeek-R1 level token usage. Far less verbose than models such as o3-mini-high or Sonnet Thinking.
- #1 Reasoning/Logic segment, surpassing GPT-4.5 Preview
- #1 in Code segment, surpassing GPT-4.5 Preview
- STEM and math were competent, but nowhere near top, in my testing
- Overall utility for miscellaneous casual tasks, where fine, but not outstanding
I really enjoyed testing this model. It's very capable, but still shows flaws in certain areas. As always: YMMV!
theyre still working on the qwen3 pull request even though its past midnight for them
seems they are absolutely swamped with work
qwen 3 π
Geminiβs markdown is not always displayed correctly (when using googleβs website, lmarena itβs fine) it pmo so much
is there any question that only thinking model can solve?
Not in theory but in practice yes, questions that rely on reasoning
can you five me one?
What is known about Qwen 3 at this point? It feels like we got Qwen2.5-Max just a few weeks ago, and that one surprised me with a perfect score on my personal micro-benchmark (a feat so far only matched by GPT-4.5, Claude 3.5 Sonnet, Claude 3.7 Sonnet, o1, DeepSeek-R1, Grok-3 Reasoning, and Gemini 2.5 Pro). So if they've made it substantially better then....... holy cow.
(and yes, I need to revise and make my benchmark harder, because Qwen2.5-Max's perfect score is not equal to Gemini 2.5 Pro's perfect score.)
Reposting poll " How long before Gemini 2.5 Pro is overtaken on Arena leaderboard? "for more votes
#general message
idk, it probably won't be comparable to Qwen 2.5 Max, simply because the models are (presumably) much smaller than the size of Qwen 2.5 Max
For example, the Qwen 3 MoE model is apparently 15B total parameters and 2B active parameters.
the default config in the code they made to prepare for the launch implies a dense 8B model
there could be more than one
but I guess in the past, the default configs in the code weren't accurate
Like the default config for Qwen 2 has no GQA, but the actual models do have GQA
I hope so
but previously even when they did release MoE models, it was only 1
Like Qwen 1.5 MoE (which was terrible iirc)
but i think the 15b moe will be very very good π€ even if its not qwen 2.5 max level
same
I hope it's at least as good as Qwen 2.5 14b
because I only have 10GB of VRAM, I can only do a terrible quant of 14b or a higher quant but put some of the layers on the CPU
and I will definitely need to offload the Qwen 3 MoE onto my CPU, but it will be much better than 5 TPS (because it only uses 2b parameters)
I need to check how fast Qwen 2.5 3B is on my CPU
if you do multiple requests/batching i believe you start to lose the speed benefits of moe though
ah
well, I don't have Qwen 2.5 3B, but I do have Llama 3.2 3B, so I will test that one
it's pretty similar in architecture
anonymous-chatbot released
We need benchmarks
i dont remember but i think qwen 2.5 3b is smaller
yeah
Anonymous chatgpt bot
Llama 3.2 3B (only CPU) at F16 (in GGUF format) is about 3 tokens per second, so that is worse than I thought
but the MoE at q6_k quantization should be much faster
I hope they have a MoE with ~30b total params
ya they should spam moes instead of dense i think
yeah
Even if the MoE models are slightly worse in performance, it's so much better on CPUs
i hope they use mla too, but from skimming the pr it doesnt seem that qwen 3 uses it
the 8b dosent really seem compelling really with the 15b moe existing
this is such a ridiculous release in the context of what their competition is doing recently. Basically beating a dead horse lmfao
fr
The fewer emojis πππππ but who cares π€¨π€¨π΅βπ«π΅βπ«
Not ready for oss now and it is still evolving. but will be accessible through Qwen Chat. Will announce it soon. A small release
Any chance of o1 pro dropping soon?
on leaderboard
given how much it costs i highly doubt it'll come to the arena
OpenAI might eat the cost to showcase their best model.....atleast for few thousand votes...
The problem is no one will wait 5 min for an answer π you will feel that a problem happened and refresh the page
they have nothing to gain and everything to lose. Now they have the benefit of a doubt but it is very unlikely it would be ranked #1, probably not even #2 tbh
it would be ranked dead last lol
i dont think i have the attention span to wait for it to respond
you wouldn't know which is slower since both responses will be outputted at the same time now. But yeah you would notice it is taking ages lmao
yeah for sure
i realized that when i sent the message lol
but i feel like its not a very good fit for the arena
could say i'd vote 'both bad' every time
so it would be ranked exactly in the middle
if you wanted to make it rank in the middle you'd prolly want to vote a is better or b is better at random
bc wouldn't a tie just cancel the vote
it's good creatively but that's about it
it's def not better than grok 3
classic oai focusing on style
is new deepseek update going to be tested? curious to see how it stacks up
They don't host the old model when they release the new one
So when you get deepseek on the arena its the new model
Wait a bit untill it will get enough vots grok3 was 1412 and gpt4.5was 1411 than ....
i think cybele is really the latest llama 4 checkpoint
ive seen it against the big boys
like vs gpt4o latest
vs sonnet 3.7
vs grok 3
cohere if im not wrong
idk
if its dumb or not
it gave me references for one of my prompts
so i thought it had internet access
but when i asked it told it doesnt
so its either hallucination or something else
it's meta
new llama model with emojies reminds me of deepseek (cybele)
wow
its so bad
they really are just scalling the models blindly
what am i looking at exactly...
themis?
this is from meta too
bruh its been hallucinating like crazy
its good when it gets stuff right
but its struggling so much for basic stuff
haven't seen many Meta AI researcher(or devrel) on Twitter tbh
yeah... Models like that is what makes lmarena look like sht to be brutally honest. It can be rigged and they have a free pass with this model user preference being the sole thing that matters
I would maybe even go as far as to say text-only performance of gpt4o does not belong in 2025. As is quite evident by all the other metrics
Thereβs a noticeable difference in its coding ability.
extremely unlikely. More of "vibe" thing or a fluke. If there was anything to show they would have shown livecodebench, swe-bench, humaneval or anything else
if you actually look at the evals they did publish... I don't think gpt4o coding got better even once since it's release. And with chatgpt-latest that even less on their radar as they don't publish these numbers at all lol
Why is it unlikely they updated the base model
They continued pre training of it to update the cut off and capabilities, I find it plausible for notable improvements to be made on a better base model
It's not like before December where 4o was just tuned and tuned again
they would have shown it if that was the case
Even if there are improvements, it's unlikely to match up with other models
It's a lose lose situation for them
not really. Merely updating their internal metrics only people "in the know" check wouldn't have made anything worse
what's bad are the current metrics 
besides it adds up with it consistently scoring high on lmarena ever since they started doing this "chatgpt-latest" thing
that doesn't have any other evals
like it's relatively easy to do if that's all you care about
DeepSeek's V3 0324 already scores higher than R1 in SWE-bench?
https://www.reddit.com/r/LocalLLaMA/comments/1jjusya/deepseek_v3_0324_got_388_swebench_verified_w/
Plot twist: They just used DeepSeek V3 0324 π€£
I don't think this means all that much... When I tried making something similar I got vastly different results merely by retrying the same thing. Both kinda do look like attempts from the same model having that in mind... π§
just anecdotally, with the new continued pretrained version (you can tell when the cut off was updated), it started matching 1206/sonnet 3.5 in some 'base model' tests i made
I just hate this whole "vibe coding" or "it performs better because we said so" thing. Seems like something made up for models that don't perform. If it does perform then they wouldn't have problems doing proper evals with it LOL
ik u dont like livebench but i just noticed this:
anecdotally π
livebench only measures rote reasoning in my experience, but i found it curious
it is apart of my personal evals π€·
i just dont have a lot of questions on it (testing that specific stuff), so i might just need more questions for a more reliable indicator
im surprised it lines up more than i thought lol
gpt-4o-2024-08-06 scored there not that much less. Though this is livebench so not sure if reading too much into it would be reasonable
Gemini 2.5 Pro - ππ
it would mean earlier chatgpt-lastest before this one had barely any improvement at all over the much cheaper 0806 lol
and yet it was still was at the top (in top3?) at the time on lmarena
at least what is measured but i definitely noticed something was different with the new continued pretrained versions
Gemini 2.5 Pro Took the FIRST place in THREE places: https://x.com/patloeber/status/1905333725698666913
πGemini 2.5 Pro is currently
- #1 on LMArena
- #1 on Livebench
- #1 across SEAL leaderboards
Also starts becoming the top choice for coding tasks :)
Our teams are working hard on getting everyone higher rate limitsπ
i'm betting they RLd on benchmark like tasks
yeah imho well it is stronger but that doesnt mean it deserves that spot
theyre doing smthing strange with the new continued pretrained version too
december version had a lot of knowledge gaps (post oct 2023), with the enxt version it was noticeably better in recent events. so they mustve kept continued pretraining the model and did a quick instruct tune to release
i think these recent versions weren't an elaborate instruct tune and they were waiting for it to continue pretraining (doing instruct tunes on checkpoints)
and like I surely can't be the only one noticing that they are cannibalising all their reasoning models on lmarena. If chatgpt-latest really was this good then it must also mean o1 and o1 are sht unsuccessful experiments. Those 2 things are not compatible with one another. And we do know for a fact o1 and o3 are good models with much more data to back it up
so yeah... things simply do not add up for it to have a legitimate score at the top
lmsys isnt meant to be end all be all benchmark tho
what lmsys is measuring, chatgpt 4o latest is very good at
yeah but they are pushing it and it does matter. I just think it is unfair to other models that kinda have to play by different rules
the issue is scope. It's becoming a specialized model 
specialized in predicting the response styles that would be preferred by the user
I'm pretty sure this does not always lead to an accurate response either. Responses can be preferred for the wrong reasons too like perhaps even going as far as people having common misconceptions and all the weird patterns that we wouldn't want the model to learn... as it would give us an illusion of performance rather than actual thing.
the arena concept of AI model ranking is awesome in just about every way, but it does kinda play into one of the biggest criticisms of LLMs, in that they commonly repeat the biases of the humans that created the data they were trained upon
like if the entire LMArena voting population were just racists and sexists asking racist and sexist prompts that made it past the moderation guidelines and selecting the most vile ones as the "winners"
obviously the data generated from those comparisons being used recursively to train AI further would make LLMs a lot worse a lot faster, but it would boost arena scores
thankfully not
I think thatβs also why claude is always lower on the leaderboard. itβs designed by an AI safety company and is built as a constitutional AI that strives for objectiveness (over simply what the user wants to hear), which is not really most peopleβs taste. it tries to βpleaseβ the user a lot less
style control lifts claude up much more than exclude refusal
the issue goes much deeper than refusals - lets say you ask something like βdoes God existβ most people are religious so they would want to hear βyes of courseβ
yes several ideas
Hey Guys,
For people who are interested in LLM jailbreaks and AI safety, here is a resource me and a bunch of friends compiled. You can run any jailbreak methods or read about the technical aspects,
We have created a comprehensive overview of the most influential LLM jailbreaking methods.
new model : bolide
themis is the big llama model?
New model in Arena: phoebe
yea there are two big llama models
themis and cybele
in general
llama models are
kind of bad
they aren't impressive or anything
nothing special with them
moonhowler is new? interesting that it just goes straight into the code with its response; no preamble. (i voted phi-4 fwiw.. neither scripts worked, but its almost did)
spider kinda sounds familiar.. but is it new?
it did a good job (first UI shown) - i voted for it over o3-mini
Which company are these models from?
not sure - just been single prompt / turn with this task
yeah things like that become an issue for sure when they are pushing gpt4o to rank higher than all of their other models there...
when will gpt4o 26 Mar 25 latest model benchmark gets released, for all category benchmarks
that's the thing, they do not release "chatgpt-latest" model benchmarks at all, and this is the model ranking high on lmarena
they will probably only release them once they update API dated version though that wouldn't score on lmarena so high
but it's not gonna be less capable π
yeah i am curiously waiting someone makes benchmark results of gpt4o 26March25 latest model for all categories. even artificial analysis website still didnot do that
whenever you come across benchmarks scores for gpt4o 26Mar25 latest model, request you to tag me and share me details. thank you.
when will deepseek v3 0324 have arena score ?
so basically arena score is based on user usage and user reviews, but not based on benchmark analysis for each of benchmark cateogory right?
assuming anonymous-chatbot= chatgpt-4o-latest-20250326 (available via Direct Chat) = chatgpt-4o-latest (available via oai API), shouldn't it be possible for companies (like artificialanalysis etc) to benchmark it themselves?