#general | Arena | Page 5

keen beacon Mar 26, 2025, 1:38 AM

#

not o1 pro but qwq max gets it

#

w/ thinking?

keen beacon Mar 26, 2025, 1:39 AM

#

keen beacon w/ thinking?

ya

#

didnt read the reasoning maybe its wrong

#

it thinks its very trivial after skimming the trace

leaden palm Mar 26, 2025, 1:40 AM

#

o3 mini high, qwq 32b say instantaneous, still waiting on o1 high

keen beacon Mar 26, 2025, 1:40 AM

#

keen beacon didnt read the reasoning maybe its wrong

yeah the reasoning is wrong

keen beacon Mar 26, 2025, 1:41 AM

#

keen beacon yeah the reasoning is wrong

oh well

leaden palm Mar 26, 2025, 1:44 AM

#

@keen beacon no $150/600 needed

keen beacon Mar 26, 2025, 1:45 AM

#

keen beacon yeah the reasoning is wrong

is this right? i added a period (missed a period when pasting) and regened

keen beacon Mar 26, 2025, 1:47 AM

#

leaden palm <@456226577798135808> no $150/600 needed

o1 is still great when it comes to problems like this

elder rapids Mar 26, 2025, 1:48 AM

#

make sure to look at the cot

#

Gemini sometimes has to re say certain things because it for some reason doesn't process certain values or something

keen beacon Mar 26, 2025, 1:49 AM

#

keen beacon is this right? i added a period (missed a period when pasting) and regened

initial KE calculated correctly, but fell apart. rest is slop and it got to that answer with no proper reasoning

elder rapids Mar 26, 2025, 1:49 AM

#

this has happened like 3 times for me when doing puzzles

#

and whenever it happens, it underperforms

#

or stops thinking in the code box

#

if code execution makes it work

#

it's probably because there's no errors happening like that

keen beacon Mar 26, 2025, 1:50 AM

#

keen beacon is this right? i added a period (missed a period when pasting) and regened

additionally here was the reasoning for it if ur curious:

📎 message.txt

upper wolf Mar 26, 2025, 2:00 AM

#

Someone ask it if water is wet

torn mantle Mar 26, 2025, 2:09 AM

#

we are basically getting sonnet 3.7 level coding if not better in some cases for free

#

and with 1M context

#

faster

#

i feel like the ai hype died down, but this model should revive it

leaden palm Mar 26, 2025, 2:32 AM

#

what do yall think about brampton

verbal nimbus Mar 26, 2025, 2:35 AM

#

Some TikZ drawing tests: Can you give your best attempt at generating a gorgeous realistic beautiful dragon breathing blue flames using Tikz? Please, really try to put effort into it!

leaden palm Mar 26, 2025, 2:38 AM

#

guess they rld on svgs but not tikz

plain zinc Mar 26, 2025, 3:18 AM

#

Yes, executing the code significantly improves mathematical skills and, in general, the model is towards mathematics, calculations and analysis.

rigid widget Mar 26, 2025, 3:22 AM

#

How can Gemini2.5-Pro not solve this problem that QwQ-32B and Deepseek-R1 can solve!

Screenshot_2025-03-26-06-15-39-367_org.mozilla.firefox.png

upper wolf Mar 26, 2025, 3:26 AM

#

Because if you throw enough punches one is eventually going to land

rigid widget Mar 26, 2025, 3:29 AM

#

wow a non-reasoning model pass this

Screenshot_2025-03-26-06-27-11-358_org.mozilla.firefox.png

#

By the way, this question was prepared by OpenAI when o1-preview came out.

elder rapids Mar 26, 2025, 4:24 AM

#

rigid widget How can Gemini2.5-Pro not solve this problem that QwQ-32B and Deepseek-R1 can so...

literally gave this problem to it both as nebula and earlier and it got it right

#

💔

#

istg I saw someone posted them doing it too here

leaden palm Mar 26, 2025, 4:34 AM

#

grok might end up cooked if gemini pro 2.5 is first to market

leaden palm Mar 26, 2025, 4:36 AM

#

rigid widget wow a non-reasoning model pass this

v3 0324 needs to be studied

elder rapids Mar 26, 2025, 4:40 AM

#

if it's not reasoning through it, it's in its training data

#

but I haven't tested it too much yet, so idk if it can output that much at once

wintry tinsel Mar 26, 2025, 4:55 AM

#

Alright, let's tackle the problem: "How many vowels are in the sentence, 'pernicious Turnip men Crusade at dawn'?"

Understanding the Problem

First, we need to identify what counts as a vowel. In English, the vowels are the letters:
A, E, I, O, U (and sometimes Y, but we’ll ignore Y for this count).

We’ll go through the given sentence one word at a time, examining each letter to see if it’s a vowel.

The Sentence:

"pernicious Turnip men Crusade at dawn"

Let’s break it down word by word.

"pernicious"

Letters: p, e, r, n, i, c, i, o, u, s

Vowels:

e
i
i
o
u
Vowels in "pernicious": 5

"Turnip"

Letters: T, u, r, n, i, p

Vowels:

u
i
Vowels in "Turnip": 2

"men"

Letters: m, e, n

Vowels:

e
Vowels in "men": 1

"Crusade"

Letters: C, r, u, s, a, d, e

Vowels:

u
a
e
Vowels in "Crusade": 3

"at"

Letters: a, t

Vowels:

a
Vowels in "at": 1

"dawn"

Letters: d, a, w, n

Vowels:

a
Vowels in "dawn": 1

Summing Up the Vowels

Now, let’s add up the vowels from each word:

pernicious: 5
Turnip: 2
men: 1
Crusade: 3
at: 1
dawn: 1
Total vowels = 5 + 2 + 1 + 3 + 1 + 1 = 13

Verification

To ensure accuracy, let’s list all the vowels in order from the entire sentence:

Sentence: pernicious Turnip men Crusade at dawn

Underlined vowels: e, i, i, o, u, u, i, e, u, a, e, a, a

Counting them:

e
i
i
o
u
u
i
e
u
a
e
a
a
Yes, that confirms 13 vowels in total.

Final Answer:

There are 13 vowels in the sentence "pernicious Turnip men Crusade at dawn."

#

From deep seek V3

elder rapids Mar 26, 2025, 4:57 AM

#

elder rapids istg I saw someone posted them doing it too here

yo they still haven't fixed the tokenization issue

leaden meteor Mar 26, 2025, 5:01 AM

#

Why are people so confident about Deepseek and Openai being top in the next month? Do they have any models that are doing well (anonymously) in arena? Gemini 2.5 pro seems to be a clear leader for next few weeks? o3 mini high and deepseek-v3 are already on the leaderboard. Is Deepseek 0324 that good that it will beat gemini 2.5pro? And I see that o1 pro and R2 are not in arena anyways based on conversations here...SO, I dont get why 60% of the people who took the poll here think that either OpenAI or DeepSeek is going to top 2.5 pro?

wintry tinsel Mar 26, 2025, 5:02 AM

#

leaden meteor Why are people so confident about Deepseek and Openai being top in the next mont...

R2 will be the champ

leaden meteor Mar 26, 2025, 5:02 AM

#

Isn't it coming in may?

elder rapids Mar 26, 2025, 5:04 AM

#

leaden meteor Why are people so confident about Deepseek and Openai being top in the next mont...

gpt 5 in the case of openAI, bandwagon in the case of deepseek
no
Gemini will probably continue to be a leader generally for a while
deepseek 0324 is good, but it's still an incremental jump, I do think 2.0 pro and sonnet 3.7 are better
reasonable in the case of openAI (since people expect o3 full aka gpt 5 eventually), bandwagoning in the case of deepseek

#

a lot of people who follow AI came from the deepseek thing

#

which became borderline propaganda

#

before and for a while after, 2.0 flash thinking was closer to r1 while being smaller and cheaper, and nobody paid attention to that

#

hopefully r2 is the actual jump in performance, but if it's not simply better than 2.5 pro then it's not going to matter, to me at least anyways

leaden palm Mar 26, 2025, 5:08 AM

#

leaden meteor Why are people so confident about Deepseek and Openai being top in the next mont...

0324 is a good model
anonymous chatbot is a good model
(in terms of style)

#

still waiting for r2...

elder rapids Mar 26, 2025, 5:09 AM

#

since 2.5 pro has 1m context length truly now

#

I'd rather have that

elder rapids Mar 26, 2025, 5:10 AM

#

leaden palm 0324 is a good model anonymous chatbot is a good model (in terms of style)

was anonymous chatbot not an early 4o checkpoint

#

it doesn't do very well every time I get it

#

but still formats decent

#

oh he blocked me

#

lolll

leaden meteor Mar 26, 2025, 5:11 AM

#

Yeah...I have seen some people posting their results here and anonymous chatbot did not seem to be better than nebula (2.5pro)...

elder rapids Mar 26, 2025, 5:13 AM

#

leaden meteor Yeah...I have seen some people posting their results here and anonymous chatbot ...

yeah dw about it

#

a lot of people in the AI space just simply don't know what they're talking about and pay attention to politics too much

alpine coral Mar 26, 2025, 6:15 AM

#

anonymous chatbot (likely latest iteration of chatgpt-4o-latest imo) isn't a thinking a model and is presumably natively multimodal. i use 4o for most day-to-day tasks (i don't need thinking for a quick translation or whatever)

#

it seems an incremental improvement (or possibly already deployed on chatgpt) - won't smash any records, but not meant to either (GPT-4.5 arguably was a bit of a disappointment in terms of performance though imo - but is still very robust for reasoning and writing tasks, as most big models tend to be)

keen beacon Mar 26, 2025, 6:40 AM

#

elder rapids istg I saw someone posted them doing it too here

that was me, yes it can solve it it just doesn't have that high of a success rate

#

elder rapids Mar 26, 2025, 6:59 AM

#

keen beacon that was me, yes it can solve it it just doesn't have that high of a success rat...

this has to be an issue with the model rather than it's reasoning process

#

sometimes it points out how it didn't actually format certain information

#

and it's been freaky since it released in the app first

#

I hope someone with more reach can mention this and look into it

#

I know I'm def not the only one

#

but I've probably tested it 10x more than others

#

50 is too little

#

💔

keen beacon Mar 26, 2025, 7:39 AM

#

elder rapids but I've probably tested it 10x more than others

buddy ive been extensively testing this model since the day it showed up in the arena

rigid widget Mar 26, 2025, 8:33 AM

#

elder rapids istg I saw someone posted them doing it too here

Gemini 2.5 Pro is doing wrong

rigid widget Mar 26, 2025, 8:35 AM

#

elder rapids but I've probably tested it 10x more than others

except @plain zinc

rigid widget Mar 26, 2025, 8:36 AM

#

elder rapids this has to be an issue with the model rather than it's reasoning process

The model probably doesn't generate most answers from scratch, so far I haven't gotten a very specific answer.

#

It is also clear from the example of creating super mario

rigid widget Mar 26, 2025, 8:39 AM

#

alpine coral anonymous chatbot (likely latest iteration of chatgpt-4o-latest imo) isn't a thi...

Give V3 0324 a chance

torn mantle Mar 26, 2025, 8:39 AM

#

rigid widget Give V3 0324 a chance

its a good model

#

but tbh its nowhere near gemini 2.5 pro

#

i mean for the cost its a good one

#

gemini just seems good overall

calm sequoia Mar 26, 2025, 8:47 AM

#

Last time Someone overtook Grok the leaderboard was updated the same day 😄 It would be funny if it would happen again and musk would release Grok 3 full immediately

#

On the other hand, you need time to train on Gemini 2.5 outputs 😄

barren prairie Mar 26, 2025, 8:50 AM

#

calm sequoia Last time Someone overtook Grok the leaderboard was updated the same day 😄 It w...

But i don t think it will take it over Gemini this time 😁😁

calm sequoia Mar 26, 2025, 8:51 AM

#

It shouldn't have taken previous no. 1 spots too 😄

#

Really good job guys. You're oracles!

plain zinc Mar 26, 2025, 8:59 AM

#

https://www.reddit.com/r/Bard/s/u6AxvBKwNo

From the Bard community on Reddit: Gemini 2.5 Pro is Amazing! It cr...

Explore this post and more from the Bard community

barren prairie Mar 26, 2025, 9:00 AM

#

torn mantle but tbh its nowhere near gemini 2.5 pro

It is only V3 wait untill it will be V4 and r2 😈😈and untill they will distill Gemini 😆😂

torn mantle Mar 26, 2025, 9:05 AM

#

barren prairie It is only V3 wait untill it will be V4 and r2 😈😈and untill they will distill ...

😈

scarlet flint Mar 26, 2025, 9:07 AM

#

barren prairie It is only V3 wait untill it will be V4 and r2 😈😈and untill they will distill ...

😈😈😈😈

rigid widget Mar 26, 2025, 9:07 AM

#

torn mantle but tbh its nowhere near gemini 2.5 pro

I think it would be more accurate to compare with non-reasoning models.

rigid widget Mar 26, 2025, 9:09 AM

#

barren prairie It is only V3 wait untill it will be V4 and r2 😈😈and untill they will distill ...

Do you think they didn't use R1 while making Gemini 2.5 pro?

#

They're all hoarding data and copying from each other. That's why they exhibit very similar behaviors in many things.

rigid widget Mar 26, 2025, 9:13 AM

#

calm sequoia Really good job guys. You're oracles!

I thought I would be 3rd, thinking that the leadership table would be released a week later. I didn't expect it to be released this early.

#

We said everything beforehand.

calm sequoia Mar 26, 2025, 9:14 AM

#

rigid widget Do you think they didn't use R1 while making Gemini 2.5 pro?

It was proven that no data is better than bad data. The R1 is not exceptional; therefore, it is wrong to assume that they trained on R1 outputs for 100%.

#

Unless in early stages of pre-training, but the base model was not changed (or was it?)

rigid widget Mar 26, 2025, 9:23 AM

#

calm sequoia It was proven that no data is better than bad data. The R1 is not exceptional; t...

They probably used models like Claude, Grok, R1, o1 simultaneously and constantly covered their shortcomings with comparisons. I didn't say there was anything wrong with that. I'm just against OpenAI's "model distillation" excuse.

rigid widget Mar 26, 2025, 9:27 AM

#

calm sequoia Unless in early stages of pre-training, but the base model was not changed (or w...

They removed the basic model (2.0 pro) from aistudio. I also get a resource exhausted error on OpenRouter.

hazy quest Mar 26, 2025, 9:59 AM

#

Hey guys. Do you know of a subreddit or Discord server to discuss specifically everything related to AI images? It's probably obvious but I can't seem to find it.

r/singularity is too broad
r/StableDiffusion seems to be focusing on AI videos (?)
r/aiArt seems to be about the "art" aspect more than the AI one (?)

Basically I'm looking for a space where people discuss/compare for example Reve vs 4o vs Imagen 3 vs Recraft

brittle tiger Mar 26, 2025, 10:17 AM

#

alpine coral anonymous chatbot (likely latest iteration of chatgpt-4o-latest imo) isn't a thi...

4o image capabilities are v impressive but it's not native multimodal

plain zinc Mar 26, 2025, 11:36 AM

#

What will you do to that person if someone writes here that 2.5 is WORSE?

#

🤔

torn mantle Mar 26, 2025, 11:40 AM

#

plain zinc What will you do to that person if someone writes here that 2.5 is WORSE?

nothing lol

#

but they should justify

#

and not based only on one prompt

plain zinc Mar 26, 2025, 11:41 AM

#

This is a test for ardent Gemini fans 🙂

#

But I'm not like that. Don't think about it.

#

Gemini is interesting to me as a research product.

subtle frost Mar 26, 2025, 11:47 AM

#

What's the best ai for solving chemistry MCQs

rigid widget Mar 26, 2025, 12:05 PM

#

subtle frost What's the best ai for solving chemistry MCQs

try it and see 🙈

rigid widget Mar 26, 2025, 12:08 PM

#

plain zinc What will you do to that person if someone writes here that 2.5 is WORSE?

bro if Gemini was THE BEST at everything, it would be AGI. But we know it's not.

alpine coral Mar 26, 2025, 12:16 PM

#

brittle tiger 4o image capabilities are v impressive but it's not native multimodal

oh yeah that's right

#

call it 'omni'.. wonder if / when it will actually be a truly multimodal model..

#

does it use separate models for image recognition and generation, or just the latter?

#

same with advanced audio.. is that actually integrated natively? i haven't really been folllowing since the initial release

#

function calling isn't exactly a modality, but 4o has become a lot better at using code interpreter. like for anything not too complicated, it handles data pretty well imo given the speed and cost

rigid widget Mar 26, 2025, 12:39 PM

#

Gemini 2.5 Pro here is ethic,safety,bla,bla

leaden meteor Mar 26, 2025, 12:43 PM

#

rigid widget Gemini 2.5 Pro here is ethic,safety,bla,bla

When do you expect DS 0324 in the leaderboard?

#

I am surprised they did not release this in the leaderboard yet when lot of people are saying this is one of the best non-reasoning models...

rigid widget Mar 26, 2025, 12:46 PM

#

leaden meteor I am surprised they did not release this in the leaderboard yet when lot of peop...

it competes with Claude 3.7 Sonnet in coding at a 15 times cheaper price

rigid widget Mar 26, 2025, 12:49 PM

#

leaden meteor I am surprised they did not release this in the leaderboard yet when lot of peop...

Most people didn't really try, Deepseek made the announcement very quietly.

leaden meteor Mar 26, 2025, 12:50 PM

#

I am only talking about arena. Not the general release... If they released it in arena and model is really good, it would have come on top, isnt it? Even people are biased against it? Isnt that the point of the arena....

willow grail Mar 26, 2025, 1:06 PM

#

why do i get this after every message?

rigid widget Mar 26, 2025, 1:07 PM

#

willow grail why do i get this after every message?

me to I always get this error

willow grail Mar 26, 2025, 1:07 PM

#

rigid widget me to I always get this error

because ur mango

rigid widget Mar 26, 2025, 1:08 PM

#

leaden meteor I am only talking about arena. Not the general release... If they released it in...

Isn't it already in the arena?

#

#

claude create a python command it don't get the prompt

leaden meteor Mar 26, 2025, 1:11 PM

#

Was it there before public release? I was asking why it was not released in teh arena before public release... But I see that Deepseek tends to not do that...So, I guess it will take a week or two before it appears in teh leaderboard from now?

rigid widget Mar 26, 2025, 1:12 PM

#

leaden meteor Was it there before public release? I was asking why it was not released in teh ...

Yes, it was not in the arena before.

#

you're right

alpine coral Mar 26, 2025, 1:13 PM

#

rigid widget Give V3 0324 a chance

played around a bit. it's pretty decent for sure. though fwiw i wasn't blown away - it's solid but yeah i'm not sure i'd be willing say anything more than that (based on my limited use anyway)

#

even though it's not a thinking model, it's been post trained in a way that encourages CoT responses, rather than just blurting out the answer.. can be a bit tedious

#

it seems good at physics / science

#

impressive with some riddles, but not strong with theory of mind kinda stuff, and falls over on some generally simple ones (usually by overanalyzing it / conducting needless calculations)

#

i like it's note at the end here ha

rigid widget Mar 26, 2025, 2:00 PM

#

alpine coral it seems good at physics / science

also math

#

Probably best for roleplay too

#

keen beacon Mar 26, 2025, 2:03 PM

#

☠️

rigid widget Mar 26, 2025, 2:04 PM

#

by the way there is no system prompt

#

this is default

#

Boss and lazy employee roleplay

Screenshot_2025-03-26-17-12-40-507_org.mozilla.firefox.png

Screenshot_2025-03-26-17-12-20-792_org.mozilla.firefox.png

Screenshot_2025-03-26-17-12-13-704_org.mozilla.firefox.png

#

Why Chatgpt is not that good at "Chat"

elder rapids Mar 26, 2025, 2:21 PM

#

keen beacon buddy ive been extensively testing this model since the day it showed up in the ...

ong

#

but nah I meant like, testing for looking at the CoT process

#

and stuff

#

there's a lot of things broken with 2.5 rn

#

nobody is mentioning it for some reason

#

crazy

rigid widget Mar 26, 2025, 2:51 PM

#

Screenshot_2025-03-26-17-43-34-409_org.mozilla.firefox.png

Screenshot_2025-03-26-17-41-53-824_org.mozilla.firefox.png

rigid widget Mar 26, 2025, 3:00 PM

#

elder rapids but nah I meant like, testing for looking at the CoT process

you can not really look at CoT it's hidden

alpine coral Mar 26, 2025, 3:08 PM

#

rigid widget

i don't know what's going on lol (though clearly V3's response is superior / more creative) but anyway was curious to give V3's response to 4o and tell it to make it 'spicier'. initial attempt was bland; the next one also kinda meh, but the third/final was surprisingly lucid tbh ha

#

4.5 even more so ("if i have to finish myself off in the jungle" lol)

#

i thought 4o just used more emojis and had a differnt (more casual) tone.. but didn't realise it (and 4.5) actually have like some 'character' in a way oai models previously haven't

rigid widget Mar 26, 2025, 3:21 PM

#

alpine coral i don't know what's going on lol (though clearly V3's response is superior / mor...

let me try

elder rapids Mar 26, 2025, 3:22 PM

#

rigid widget you can not really look at CoT it's hidden

huh?

golden ocean Mar 26, 2025, 3:29 PM

#

I think he meant that you can not really look at CoT it's hidden

rigid widget Mar 26, 2025, 3:30 PM

#

this is CoT

Screenshot_2025-03-26-18-30-03-946_com.deepseek.chat.png

#

this is not CoT

Screenshot_2025-03-26-18-30-49-805_org.mozilla.firefox.png

timber kiln Mar 26, 2025, 3:31 PM

#

Does the gemini advanced sub have a rate limit for the 2.5?

elder rapids Mar 26, 2025, 3:32 PM

#

golden ocean I think he meant that you can not really look at CoT it's hidden

it's not

elder rapids Mar 26, 2025, 3:33 PM

#

rigid widget this is not CoT

yes it is?

#

😭

#

you're conflating the initial o3 mini summary with markdown/formatting

golden ocean Mar 26, 2025, 3:34 PM

#

@rigid widget are u going to let this slide

elder rapids Mar 26, 2025, 3:35 PM

#

W instigating

#

but Gemini uses markdown for aha moments

#

it's a part of its cot

elder rapids Mar 26, 2025, 3:38 PM

#

timber kiln Does the gemini advanced sub have a rate limit for the 2.5?

nah I don't think so

#

I went way past ai studios 50 rpd limit in the app

rigid widget Mar 26, 2025, 3:38 PM

#

We want real pure CoT not changed CoT

elder rapids Mar 26, 2025, 3:39 PM

#

rigid widget We want real pure CoT not changed CoT

it's not "changed" cot in the same way o3 minis was

#

it's not being optimized post CoT

#

it just fundementally is a different approach to CoT

rigid widget Mar 26, 2025, 3:39 PM

#

Even o3 mini is more readable

Screenshot_2025-03-26-18-35-32-647_com.openai.chatgpt.png

elder rapids Mar 26, 2025, 3:39 PM

#

that's because they have a different approach

#

they're giving similar amounts of information

#

neither are being changed

#

Gemini has always used markdown way more than other models

rigid widget Mar 26, 2025, 3:40 PM

#

elder rapids it's not "changed" cot in the same way o3 minis was

Looks more like a system prompt it's not first person

elder rapids Mar 26, 2025, 3:41 PM

#

rigid widget Looks more like a system prompt it's not first person

ye Gemini tends to not speak in first person

#

whether it's an initial system prompt

#

or unintentional

#

the behavior still shows in the CoT

rigid widget Mar 26, 2025, 3:42 PM

#

yeah

elder rapids Mar 26, 2025, 3:42 PM

#

but regardless, that is the "raw" cot

elder rapids Mar 26, 2025, 3:42 PM

#

rigid widget this is not CoT

#

still clarifies in unformatted text

rigid widget Mar 26, 2025, 3:44 PM

#

elder rapids

How many people seriously read?

#

People enjoy reading R1's thoughts

elder rapids Mar 26, 2025, 3:45 PM

#

ye cuz they're funny asf

#

deepseek lab trained for weird types of "aha" moments

rigid widget Mar 26, 2025, 3:45 PM

#

It is very educational

#

I learned a lot of things while reading them.

elder rapids Mar 26, 2025, 3:49 PM

#

this is crazy

#

16~ avg improvement from flash thinking

#

yo this is actually insane

#

its not even incrementally better than all the other models

#

the difference between other models and their competitors are super incremental in other benchmarks and Gemini 2.5 pro doesn't follow that trend

silk haven Mar 26, 2025, 3:53 PM

#

https://x.com/artificialanlys/status/1904923020604641471?s=46&t=P8-tRi_JAVcI6l5U6nOT4A

Artificial Analysis (@ArtificialAnlys) on X

Google’s new Gemini 2.5 Pro Experimental takes the #1 position across a range of our evaluations that we have run independently

Gemini 2.5 Pro is a reasoning model, it ‘thinks’ before answering questions. Google has released it as an experimental API in AI Studio only, and has

elder rapids Mar 26, 2025, 3:54 PM

#

from the moment I had my hands on nebula I knew this model was different

#

don't call me a prophet, but I might just be one

keen beacon Mar 26, 2025, 3:55 PM

#

brother

#

☠️

elder rapids Mar 26, 2025, 3:55 PM

#

I can't get over it 😭 🙏

split kayak Mar 26, 2025, 3:55 PM

#

gemini2.5

rigid widget Mar 26, 2025, 4:02 PM

#

elder rapids this is crazy

Crazy for Gem2.5Pro and V3 0324 🔥🔥

oblique flint Mar 26, 2025, 4:02 PM

#

so pro is the first 2.5 gemini model. Will flash be coming soon?

#

was phantom 2.5 pro as well or could that be flash?

rigid widget Mar 26, 2025, 4:03 PM

#

Screenshot_2025-03-26-19-03-15-326_org.mozilla.firefox.png

elder rapids Mar 26, 2025, 4:06 PM

#

oblique flint was phantom 2.5 pro as well or could that be flash?

could be

#

flash

#

if they release 2.5 flash

#

then it's safe to say the old models are gonna be removed

elder rapids Mar 26, 2025, 4:09 PM

#

rigid widget Crazy for Gem2.5Pro and V3 0324 🔥🔥

oh ye

#

v3 is doing well

elder rapids Mar 26, 2025, 4:09 PM

#

rigid widget

that's just for coding avg tho

silk haven Mar 26, 2025, 4:10 PM

#

oblique flint was phantom 2.5 pro as well or could that be flash?

Flash

thorny drum Mar 26, 2025, 4:11 PM

#

damn #1 in almost every category

silk haven Mar 26, 2025, 4:11 PM

#

Noam Shazeer

#

The 🐐

elder rapids Mar 26, 2025, 4:12 PM

#

phantom seemed pretty similar to nebula tho tbh

#

but apparently it was removed and added back

#

so it could've been changed to flash and I'd have no clue cuz I only tested the old phantom

#

why's 2.5 pro so fast tho

willow grail Mar 26, 2025, 4:17 PM

#

keen beacon Mar 26, 2025, 4:21 PM

#

ive been doing some research

#

without consensus voting, which openai used to make o3 (full) look better on the benchmark graphs, gemini 2.5 pro is actually better

#

AIME o3 ~90-91% vs 2.5 pro 92%
GPQA o3 ~82-83% vs 2.5 pro 84%

rigid widget Mar 26, 2025, 4:21 PM

#

Gemini 2.5 Pro
Except reasoning (🥈2. best model)
All ⚡🥇1. best model
Good job Google!

keen beacon Mar 26, 2025, 4:22 PM

#

keen beacon without consensus voting, which openai used to make o3 (full) look better on the...

o3 is several months older tho I guess

#

which is why openai need to stop sitting on their damn products..

#

just release it already

rigid widget Mar 26, 2025, 4:23 PM

#

Deepseek V3 03-24 (comparing with non-reasoning models)
(Overall: 🥈 2. best model)
Math: 🥇1. best model
Coding: 🥈2. best model
Reasoning: 🥉 3. best model

Good job Deepseek!

rigid widget Mar 26, 2025, 4:24 PM

#

keen beacon just release it already

At these prices, even if they were better, they're useless.

oblique flint Mar 26, 2025, 4:24 PM

#

keen beacon without consensus voting, which openai used to make o3 (full) look better on the...

but what made o3 crazy is its score on arc agi. I dont think 2.5 pro is getting near that, unless they test it with similar copious amounts of compute

rigid widget Mar 26, 2025, 4:24 PM

#

Look at gpt4.5 vs v3-0324

#

gpt4.5 is better but who cares?

elder rapids Mar 26, 2025, 4:25 PM

#

how big is v3 tho?

rigid widget Mar 26, 2025, 4:25 PM

#

deepseek 150x cheaper

elder rapids Mar 26, 2025, 4:25 PM

#

I heard it's like a 500b+ model

#

I'm not sure how true that is

rigid widget Mar 26, 2025, 4:25 PM

#

elder rapids how big is v3 tho?

same as old v3

#

bro it's open weights

elder rapids Mar 26, 2025, 4:26 PM

#

huh?

elder rapids Mar 26, 2025, 4:26 PM

#

rigid widget same as old v3

how does that answer my question

#

😭

#

v3 = v3

rigid widget Mar 26, 2025, 4:26 PM

#

685B params https://huggingface.co/deepseek-ai/DeepSeek-V3-0324

deepseek-ai/DeepSeek-V3-0324 · Hugging Face

elder rapids Mar 26, 2025, 4:26 PM

#

DAMN

silk haven Mar 26, 2025, 4:26 PM

#

elder rapids Mar 26, 2025, 4:27 PM

#

685b params

keen beacon Mar 26, 2025, 4:28 PM

#

silk haven

drop o3 sam.

oblique flint Mar 26, 2025, 4:29 PM

#

so, can anyone reverse engineer the scaffolding used in claudeplayspokemon and make gemini 2.5 pro play pokemon lol

leaden meteor Mar 26, 2025, 4:29 PM

#

rigid widget Gemini 2.5 Pro Except reasoning (🥈2. best model) All ⚡🥇1. best model Good job ...

2.5 pro is #1 in reasoning also, isnt it? Why did you mention it 2nd?

oblique flint Mar 26, 2025, 4:31 PM

#

leaden meteor Mar 26, 2025, 4:33 PM

#

I see. Is this o1-2024-12-17- high different from o1-2024-12-17 on leaderboard ?

elder rapids Mar 26, 2025, 4:54 PM

#

leaden meteor 2.5 pro is #1 in reasoning also, isnt it? Why did you mention it 2nd?

it is the best reasoning model

#

might not be the best at purely reasoning

#

but that's trivially true

wispy mist Mar 26, 2025, 5:14 PM

#

can this data be used or licensed on a company's website?

red sluice Mar 26, 2025, 5:29 PM

#

poll_question_text

Who will have the best model within one month?

victor_answer_votes

5

total_votes

13

victor_answer_id

2

victor_answer_text

Google

olive mesa Mar 26, 2025, 5:31 PM

#

waiting for 2.5 flash to use for my projects

#

even though i have like 12 api keys the rate limits are still strict

olive mesa Mar 26, 2025, 5:32 PM

#

oblique flint

reasoning average better by a little bit

#

but time and cost is a lot more

#

worse model with 50k thinking tokens slightly better than a good model with 10k thinking tokens

rigid widget Mar 26, 2025, 5:42 PM

#

elder rapids it is the best reasoning *model*

Dude you need to be a politician

cloud meadow Mar 26, 2025, 6:09 PM

#

https://vxtwitter.com/Alibaba_Qwen/status/1904944923159445914

vxTwitter / fixvx 💖 97 🔁 17

Voice Chat + Video Chat! Just in Qwen Chat (https://chat.qwen.ai)! You can now chat with Qwen just like making a phone call or making a video call! Check the demo in https://www.youtube.com/watch?v=yKcANdkRuNI What's m…

Qwen (@Alibaba_Qwen)

Voice Chat + Video Chat! Just in Qwen Chat (https://chat.qwen.ai)! You can now chat with Qwen just like making a phone call or making a video call! Check the demo in https://www.youtube.com/watch?v=yKcANdkRuNI

What's m…

▶ Play video

keen beacon Mar 26, 2025, 6:12 PM

#

cloud meadow https://vxtwitter.com/Alibaba_Qwen/status/1904944923159445914

this is awesome

cloud meadow Mar 26, 2025, 6:13 PM

#

yeah

#

What's more, we opensource the model behind all this, Qwen2.5-Omni-7B, under the license of Apache 2.0, and we release the technical report to share all our details!

keen beacon Mar 26, 2025, 6:14 PM

#

4o image gen, prompt ""A photo of Tim Cook making a speech on stage behind a lectern with the text "FOR THE MANY, NOT THE FEW" and the UK Labour logo on it". no other model has got this close to perfection - it literally got the exact lectern right with just the "H" missing from "the", which i'm sure would be gone if you regen'd a few times

#

for reference

elder rapids Mar 26, 2025, 6:19 PM

#

cloud meadow https://vxtwitter.com/Alibaba_Qwen/status/1904944923159445914

nice

elder rapids Mar 26, 2025, 6:19 PM

#

keen beacon 4o image gen, prompt ""A photo of Tim Cook making a speech on stage behind a lec...

can't wait for Google to make a bigger native image Gen now, the flash one is small

blazing rune Mar 26, 2025, 6:31 PM

#

cloud meadow https://vxtwitter.com/Alibaba_Qwen/status/1904944923159445914

I care more about Qwen 3 (especially the MoE)

rigid widget Mar 26, 2025, 6:35 PM

#

keen beacon 4o image gen, prompt ""A photo of Tim Cook making a speech on stage behind a lec...

Oh, isn't this dangerous????

keen beacon Mar 26, 2025, 6:36 PM

#

you could argue that

#

but like

#

i like the fact openai have lowered the guardrails for this

#

dall-e was censored asf and it showed

ocean vortex Mar 26, 2025, 6:41 PM

#

elder rapids this is crazy

this is not a very good benchmark, but in this case it is not wrong... 2.5 pro very well just might be the best model ever released thus far tbh

keen beacon Mar 26, 2025, 6:43 PM

#

it's finally joever for livebench

#

2.5 pro doesnt seem to know much after june 2024. it doesnt seem like they focused on recent events that much

#

how tf is it 2x faster than gem 2 pro

rigid widget Mar 26, 2025, 6:46 PM

#

cloud meadow https://vxtwitter.com/Alibaba_Qwen/status/1904944923159445914

it's amazing 😍

elder rapids Mar 26, 2025, 6:48 PM

#

ocean vortex this is not a very good benchmark, but in this case it is not wrong... 2.5 pro v...

livebench is one of the especially good benchmarks as far as how it matches talking to each ai lmao, I'm not sure what makes you think that

#

that's more of a consensus too

#

livebench and simplebench

#

but yeah 2.5 pro is just

#

really crazy

oblique flint Mar 26, 2025, 7:01 PM

#

keen beacon how tf is it 2x faster than gem 2 pro

I wonder if that means it'll be cheap as well, surely yes?

#

anthropic will not be happy about this release if it turns out this model outperforms the claude models at a lower cost lol

ocean vortex Mar 26, 2025, 7:17 PM

#

elder rapids livebench is one of the especially good benchmarks as far as how it matches talk...

this couldn't be further away from the truth. Unless you really think qwq is better than all of those:

#

which is just ridiculous

thorny drum Mar 26, 2025, 7:19 PM

#

i dont think any benchmark will give you a perfect stack rank of every model

ocean vortex Mar 26, 2025, 7:20 PM

#

no but some are better than the others. Livebench is among the worst ones tbh

thorny drum Mar 26, 2025, 7:20 PM

#

what are your favorites?

ocean vortex Mar 26, 2025, 7:20 PM

#

if you take any real benchmark like say MMLU Pro, you will hardly see any discrepancies at all. All scores just make sense and are logical

#

And also... contamination is not really the issue. Bigger issue is "private datasets" as that contributes nothing to improvement of the models IMO. It also invites selective cheating when only select companies have some kind of access through connections which means even more discrepancies. If benchmark is diverse and good, it's pretty much impossible to cheat beyond what everyone is doing (or to get a perfect score) unless that is the sole thing you focus on disregarding all the other metrics. Which is just not realistic as it would be obvious model is sht

#

the main thing should be that it's a fair game. If everyone has the same public access that means it is fair

#

and actually is driving the improvement. Rather than being some secretive thing no one can even validate if it's any good...

elder rapids Mar 26, 2025, 7:43 PM

#

ocean vortex this couldn't be further away from the truth. Unless you really think qwq is bet...

this isn't what makes a benchmark bad at all lmao, that would be reasonable if the score discrepancy were larger than just a few percentile. It's a really good inference model that doesn't have quite as much knowledge as deepseek, but that isn't what its testing. it's not "better" than deepseek, nor is deepseek so much better than o3 mini, nor is v3 better than Claude 3-7 sonnet, this isn't because the benchmark is flawed, but because of how YOURE viewing the averaged score

#

the same way lmsys isn't a bad benchmark, its just not quantifying what you think it's quantifying

ocean vortex Mar 26, 2025, 7:43 PM

#

elder rapids this isn't what makes a benchmark bad at all lmao, that would be reasonable *if*...

no. Just no lol

#

that model has no business being equal with those

#

let alone above them

elder rapids Mar 26, 2025, 7:44 PM

#

wym just no? there's literally ±1 deviation LMAO

#

have you tried it??

#

in the same way smaller models can outperform gpt 4

#

is the same way qwq-32b can have extremes like that

elder rapids Mar 26, 2025, 7:44 PM

#

elder rapids in the same way smaller models can outperform gpt 4

they won't consistently be better than gpt 4

#

but that's not what its testing

#

look at the individual scores

ocean vortex Mar 26, 2025, 7:45 PM

#

and it's not a good benchmark at all if it fails to find the difference. Qwq is worse than R1 overall, not the other way around

ocean vortex Mar 26, 2025, 7:46 PM

#

elder rapids they won't consistently be better than gpt 4

that's why you need benchmark with plenty of test questions

elder rapids Mar 26, 2025, 7:46 PM

#

ocean vortex and it's not a good benchmark at all if it fails to find the difference. Qwq is ...

to find the difference? or the difference it claims to

keen beacon Mar 26, 2025, 7:46 PM

#

ocean vortex that model has no business being equal with those

deepseek v3, claude 3.7 sonnet, gpt 4.5 arent thinking models. flash thinking isnt good. its not surprising for it to be competitive on rote tasks/math

ocean vortex Mar 26, 2025, 7:46 PM

#

not just a few or a few subcategories

elder rapids Mar 26, 2025, 7:46 PM

#

keen beacon deepseek v3, claude 3.7 sonnet, gpt 4.5 arent thinking models. flash thinking is...

flash thinking is the best here for inquiry

#

lol

keen beacon Mar 26, 2025, 7:46 PM

#

https://matharena.ai/

MathArena.ai

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

elder rapids Mar 26, 2025, 7:46 PM

#

but it's the worst for actually coding

#

and stuff

keen beacon Mar 26, 2025, 7:46 PM

#

elder rapids flash thinking is the best here for inquiry

no it isnt

elder rapids Mar 26, 2025, 7:46 PM

#

never got the attention it deserved

elder rapids Mar 26, 2025, 7:47 PM

#

keen beacon no it isnt

I mean like, if you set up a debate between two models

#

flash thinking wins

#

that kind of stuff

#

o1 gets crushed too

ocean vortex Mar 26, 2025, 7:47 PM

#

keen beacon deepseek v3, claude 3.7 sonnet, gpt 4.5 arent thinking models. flash thinking is...

their average was supposed to be an average of those things:

#

so it doesn't make sense lol

keen beacon Mar 26, 2025, 7:47 PM

#

ocean vortex their average was supposed to be an average of those things:

yea livebench tests only rote tasks lol

ocean vortex Mar 26, 2025, 7:47 PM

#

it's not a very good test

keen beacon Mar 26, 2025, 7:48 PM

#

elder rapids I mean like, if you set up a debate between two models

nah flash thinking sucks

#

i dont see any of the other models going 29k tokens in thinking and coming up with this

#

fyi qwq does that one just fine btw

#

afaik

elder rapids Mar 26, 2025, 7:49 PM

#

keen beacon i dont see any of the other models going 29k tokens in thinking and coming up wi...

dawg 😭 💔 I just IMPLIED this exact thing

#

this server bro istg

keen beacon Mar 26, 2025, 7:49 PM

#

elder rapids dawg 😭 💔 I just IMPLIED this exact thing

you said flash thinking would win

elder rapids Mar 26, 2025, 7:49 PM

#

in a debate kind of inquiry

#

are you good?

elder rapids Mar 26, 2025, 7:49 PM

#

elder rapids I mean like, if you set up a debate between two models

^

keen beacon Mar 26, 2025, 7:50 PM

#

elder rapids I mean like, if you set up a debate between two models

its one of the worst though alongside qwen models

elder rapids Mar 26, 2025, 7:51 PM

#

keen beacon its one of the worst though alongside qwen models

what is this quantifying 😭🙏

#

ngl you guys have to be trolling

keen beacon Mar 26, 2025, 7:51 PM

#

elder rapids what is this quantifying 😭🙏

the elimination game benchmark, it tests social reasoning, strategy, and deception

elder rapids Mar 26, 2025, 7:52 PM

#

keen beacon the elimination game benchmark, it tests social reasoning, strategy, and decepti...

but that isn't what I just said is it

keen beacon Mar 26, 2025, 7:52 PM

#

theres considerable overlap with debate skills and social skills though

elder rapids Mar 26, 2025, 7:52 PM

#

no

#

it's not lmao

keen beacon Mar 26, 2025, 7:52 PM

#

yes for llms

ocean vortex Mar 26, 2025, 7:52 PM

#

elder rapids ngl you guys have to be trolling

I think the issue is you. Just because you don't like that you are wrong you saying that everyone else is "trolling" is not gonna change anything and not gonna make you right lmao

keen beacon Mar 26, 2025, 7:52 PM

#

yeah this guy is so annoyingn lol

elder rapids Mar 26, 2025, 7:52 PM

#

keen beacon yes for llms

strategy in debate is rhetorical recession or knowing how to make a cohesive argument

cloud meadow Mar 26, 2025, 7:53 PM

#

keen beacon i dont see any of the other models going 29k tokens in thinking and coming up wi...

2.5 seems to do it

📎 message.txt

elder rapids Mar 26, 2025, 7:53 PM

#

ocean vortex I think the issue is you. Just because you don't like that you are wrong you say...

that would make sense if I weren't right

keen beacon Mar 26, 2025, 7:53 PM

#

cloud meadow 2.5 seems to do it

is that flash thinking?

elder rapids Mar 26, 2025, 7:53 PM

#

but of course, you don't know the claim and won't bother to reiterate what I'm saying for clarity

ocean vortex Mar 26, 2025, 7:53 PM

#

elder rapids that would make sense if I weren't right

you aren't

elder rapids Mar 26, 2025, 7:53 PM

#

ocean vortex you aren't

alr, state my premise

cloud meadow Mar 26, 2025, 7:54 PM

#

Nah I was using the wrong one lmao

elder rapids Mar 26, 2025, 7:55 PM

#

keen beacon yeah this guy is so annoyingn lol

dawg you're literally just ignoring what I'm saying, and I don't know if it's just to argue. this isn't the first time

#

keen beacon Mar 26, 2025, 7:56 PM

#

cloud meadow 2.5 seems to do it

nope it just did it again for me

#

are you using 0 temperature?

cloud meadow Mar 26, 2025, 7:56 PM

#

elder rapids Mar 26, 2025, 7:56 PM

#

cloud meadow Nah I was using the wrong one lmao

^

keen beacon Mar 26, 2025, 7:56 PM

#

keen beacon Mar 26, 2025, 7:57 PM

#

cloud meadow

yeah ur using 0.7 temperature

cloud meadow Mar 26, 2025, 7:57 PM

#

It keeps repeating stuff on 0 temp lmao

ocean vortex Mar 26, 2025, 7:57 PM

#

elder rapids alr, state my premise

You seem to be thinking everyone is doing it wrong and livebench is this magical benchmark more accurate than proven reliable metrics. Metrics with a much better track record and better results. That's almost like 'benchmark shopping' - look hard enough and you gonna find a benchmark for your confirmation bias

keen beacon Mar 26, 2025, 8:00 PM

#

cloud meadow It keeps repeating stuff on 0 temp lmao

ya using a non zero temperature allows it to not get stuck by adding randomness. try it with 2.5 pro on 0 temp and it wont get stuck

cloud meadow Mar 26, 2025, 8:01 PM

#

#

I had to stop it and then tell it to continue

#

Still, seems wrong?

keen beacon Mar 26, 2025, 8:02 PM

#

generated by 4o

#

🤯

cloud meadow Mar 26, 2025, 8:03 PM

#

We are a few years away from AI being used as video evidence in a court case

#

I mean a few months

#

😛

keen beacon Mar 26, 2025, 8:03 PM

#

cloud meadow Still, seems wrong?

yea both are wrong i dont have the answer on hand though from what i remember both are incorrect

cloud meadow Mar 26, 2025, 8:03 PM

#

The future might be bleak or bright idk

ocean vortex Mar 26, 2025, 8:04 PM

#

So like... simple-bench is 'perfect' for people who think Claude is better than everything else. But they fail to understand it's mostly just spatial awareness reasoning. Your daily tasks will usually include MUCH more areas than just this. Then there's also an issue that it's private and no one can verify it properly so it's a guesswork

#

people like that will also look at swe-bench but conveniently ignore LCB where claude is destroyed and it's still coding lol

keen beacon Mar 26, 2025, 8:07 PM

#

https://x.com/koltregaskes/status/1904974999011614895 what a last couple weeks of march for image gen

Kol Tregaskes (@koltregaskes) on X

MIDJOURNEY V7 TARGET LAUNCH IS MONDAY 31ST MARCH! 😀

Next week!

ocean vortex Mar 26, 2025, 8:09 PM

#

it's hard to be impartial, but you really shouldn't judge the benchmark based on individual model results. The number of discrepancies in total is the easiest fair way to get the rough overall picture

elder rapids Mar 26, 2025, 8:10 PM

#

ocean vortex You seem to be thinking everyone is doing it wrong and livebench is this magical...

that's completely loaded lol and doesn't state my premise. Whether or not livebench is a viable objective evaluation that judges how models specifically perform in certain tasks isn't what I'm saying lol, the reason why it's even averaged out through such wide variation of things on livebench especially is because they aim for exactly what you're looking at while discussing with it, qwq-32b does have specific traits that may help it over deepseek, livebench tests only a few languages, and it happens to excel in these few languages, but if you check aider, clearly there's a discrepancy, but also in the claim I've been making; how well it does when given newer, or at least more obscure information, and therefore a better score, with large variation. anything ±5~ score deviation wont have such discrepancy, yet it's completely realistic that's how they interact with that kind of information

#

I don't even think we disagree

ocean vortex Mar 26, 2025, 8:10 PM

#

it's also unreasonable to expect any model to do well everywhere... Just not realistic

elder rapids Mar 26, 2025, 8:10 PM

#

but there's clearly a difference in what we understand these benchmarks do

#

for me, Claude 3.5 sonnet didn't do well, for exactly what I expected livebench to do

ocean vortex Mar 26, 2025, 8:11 PM

#

elder rapids that's completely loaded lol and doesn't state my premise. Whether or not livebe...

nonsense. They provide the average score. If it's no good it should not be provided and the benchmark should be specialised rather than general purpose

elder rapids Mar 26, 2025, 8:12 PM

#

ocean vortex nonsense. They provide the average score. If it's no good it should not be provi...

that would make sense if the topic variance werent so wide

#

which is why I clarified that

#

"such a wide variation"

ocean vortex Mar 26, 2025, 8:12 PM

#

and if it is and the results are that... the benchmark is not the best one. Simple as that 😉

elder rapids Mar 26, 2025, 8:12 PM

#

it's exactly what it claims to be

#

and seems to match really well because, the model approaches you that way too

#

novel information over a general discussion = non standard benchmark performance

#

and probably why o3 mini and o1 have done so well, since they focus on exactly that kind of behavior of receiving

#

and it's probably why flash thinking is so high too

#

since it's just not that good, but if you use it

#

it's just really good at receiving information

ocean vortex Mar 26, 2025, 8:15 PM

#

have you read their description at least?

elder rapids Mar 26, 2025, 8:15 PM

#

wym?

elder rapids Mar 26, 2025, 8:16 PM

#

ocean vortex So like... simple-bench is 'perfect' for people who think Claude is better than ...

ye but I think that's a much more important aspect over others tho, spatial reasoning necessarily bleeds through to other tasks that may be the best in actually talking to it

ocean vortex Mar 26, 2025, 8:17 PM

#

They are aiming at providing general purpose anti-contamination benchmark. I already said what I think about contamination (it's not an issue for a good benchmark), and if this is general purpose... then it's clearly not the best benchmark. Like I've already said several times but you just don't seem to get it

#

nowhere does it say it's specialised or limited in scope:

elder rapids Mar 26, 2025, 8:18 PM

#

ocean vortex They are aiming at providing general purpose anti-contamination benchmark. I alr...

this implies exactly what I said tho lol

#

if it's anti contamination it's aiming for novel information and score discrepancy under that

elder rapids Mar 26, 2025, 8:19 PM

#

ocean vortex nowhere does it say it's specialised or limited in scope:

yeah? but that's not really what I said either

ocean vortex Mar 26, 2025, 8:20 PM

#

elder rapids it's just really good at receiving information

this just makes no sense at all tbh and is just a bunch of nonsense

elder rapids Mar 26, 2025, 8:20 PM

#

ocean vortex this just makes no sense at all tbh and is just a bunch of nonsense

just how it adapts to you and stuff lol

#

that's it

ocean vortex Mar 26, 2025, 8:20 PM

#

elder rapids this implies exactly what I said tho lol

like what? Are you meaning to say that there are discrepancies because it's general purpose? Cause that's nonsensical too

elder rapids Mar 26, 2025, 8:21 PM

#

I just said what I meant

ocean vortex Mar 26, 2025, 8:21 PM

#

elder rapids just how it adapts to you and stuff lol

LMFAO

elder rapids Mar 26, 2025, 8:21 PM

#

elder rapids if it's anti contamination it's aiming for novel information and score discrepan...

^

ocean vortex Mar 26, 2025, 8:21 PM

#

nonsense

elder rapids Mar 26, 2025, 8:21 PM

#

😭?

ocean vortex Mar 26, 2025, 8:22 PM

#

"how it adapts to you and stuff" are just some meaningless words you made up, in this context

elder rapids Mar 26, 2025, 8:22 PM

#

it's not meaningless lol

ocean vortex Mar 26, 2025, 8:22 PM

#

it absolutely is

#

you have no argument lmao

elder rapids Mar 26, 2025, 8:22 PM

#

that means how it receives information, you can't get any simpler than that

#

if it means how it receives information, that means how it interprets things through a context length right? if it's better at low context

#

then it's necessarily going to be better

#

oh wait I wonder why all of these models also excel at lmsys

#

which happens to be ahem

#

general low context performance

#

🤦

ocean vortex Mar 26, 2025, 8:24 PM

#

elder rapids that means how it receives information, you can't get any simpler than that

Production models do not receive information, quite obviously. They receive it during training. And calling your input prompts as "receiving information" is ridiculous. That's not how it works LOL

elder rapids Mar 26, 2025, 8:24 PM

#

ocean vortex Production models do not receive information, quite obviously. They receive it d...

what

#

that's clearly not what I'm saying

#

😭

ocean vortex Mar 26, 2025, 8:25 PM

#

then talk properly, cause you are just making sht up catgrin

elder rapids Mar 26, 2025, 8:25 PM

#

I just clarified exactly what it means tho, and then you sidestepped into that interpretation

elder rapids Mar 26, 2025, 8:25 PM

#

elder rapids general low context performance

^

#

you know you can ask for clarity

#

right?

ocean vortex Mar 26, 2025, 8:28 PM

#

elder rapids if it's anti contamination it's aiming for novel information and score discrepan...

the bigger issue is testing properly. If you aren't testing properly with good test questions "novel" part is completely irrelevant. And I already said why contamination is mostly a non-issue. If the entire dataset is public, that means all models are trained with the same data available and the scores are fair. It is still a consideration, but good benchmark is MUCH more important than just making sure you develop test questions in secrecy...

elder rapids Mar 26, 2025, 8:30 PM

#

ocean vortex the bigger issue is testing properly. If you aren't testing properly with good t...

that's not what I'm saying at all, the novel part is relevant when that's the point of the benchmark through large variety, ie, livebench being so general that they don't focus on coding enough for more than 2 languages (I said this already), which can be equivocated other aspects of the benchmark

ocean vortex Mar 26, 2025, 8:30 PM

#

if it was an issue we would have seen models getting perfect scores on all public benchmarks a long time ago. That didn't happen. Because it is essentially impossible

elder rapids Mar 26, 2025, 8:30 PM

#

and then it averaged that

#

because the point it

#

it's not testing intensive information

#

and the wide variety implicitly denotes novelty is what it's looking for

ocean vortex Mar 26, 2025, 8:32 PM

#

elder rapids it's not testing intensive information

then it shouldn't be general purpose benchmark. I feel like we are going in circles...

elder rapids Mar 26, 2025, 8:33 PM

#

ocean vortex then it shouldn't be general purpose benchmark. I feel like we are going in circ...

dude

#

if it's not the general purpose benchmark you think it is

#

then it's not, and I'm showing you how

#

lmsys is a general purpose benchmark

#

but clearly they're different benchmarks

#

right?

ocean vortex Mar 26, 2025, 8:34 PM

#

elder rapids and the wide variety implicitly denotes novelty is what it's looking for

wide variety implies extensive information you just said it doesn't have. Which one is it? Like make up your mind lol

ocean vortex Mar 26, 2025, 8:34 PM

#

elder rapids lmsys is a general purpose benchmark

are you... not very smart? It is NOT that

elder rapids Mar 26, 2025, 8:34 PM

#

ocean vortex wide variety implies extensive information you just said it doesn't have. Which ...

mb I meant intensive information

ocean vortex Mar 26, 2025, 8:34 PM

#

it's a human preference benchmark

elder rapids Mar 26, 2025, 8:34 PM

#

ocean vortex it's a human preference benchmark

exactly

ocean vortex Mar 26, 2025, 8:35 PM

#

not an objective deterministic tool for general performance

elder rapids Mar 26, 2025, 8:35 PM

#

it's a general purpose benchmark via human preference

#

which is exactly the distinction I'm making

#

the reason why that's relevant is because livebench is a general purpose benchmark via novelty

#

to you, one is more apparent than the other

ocean vortex Mar 26, 2025, 8:36 PM

#

elder rapids it's a general purpose benchmark via human preference

I think I'm done here... You are obviously gonna just keep changing your tune and pretend you meant otherwise than you really did. Human preference is NOT equal to general purpose. Human preference is a specific category of that

elder rapids Mar 26, 2025, 8:37 PM

#

ocean vortex I think I'm done here... You are obviously gonna just keep changing your tune an...

dawg

#

theyre entirely different categories altogether (meaning, it doesn't make sense for you to think I meant that regardless)

#

and that was I was implying the whole discussion (meaning, it's not changing my mind if that's what I've been saying)

ocean vortex Mar 26, 2025, 8:39 PM

#

general purpose tests for everything with a deterministic objective eval methods. Human preference tests mostly for the 'vibe' and style of responses and the initial impression they leave on people. Those 2 things are different and not really compatible. For the record

#

livebench is not human preference testing, in case you think it is...

elder rapids Mar 26, 2025, 8:41 PM

#

it's not like I didn't clarify that

elder rapids Mar 26, 2025, 8:42 PM

#

elder rapids the reason why that's relevant is because livebench is a general purpose benchma...

^

ocean vortex Mar 26, 2025, 8:42 PM

#

ocean vortex the bigger issue is testing properly. If you aren't testing properly with good t...

^

elder rapids Mar 26, 2025, 8:43 PM

#

ocean vortex general purpose tests for everything with a deterministic objective eval methods...

that's just begging the question lol, general purpose just means the scope of what's being evaluated

#

and they are competible

#

I don't see how you think otherwise if they're simply so categorically distinct

#

you can use objective metrics to measure very narrow non general purpose tasks

ocean vortex Mar 26, 2025, 8:46 PM

#

elder rapids that's just begging the question lol, general purpose just means the scope of wh...

it also means there should be barely hardly any discrepancies at all. Truthfully all those less popular benchmarks have an easy way in NOT making it a general purpose benchmark. But as soon as they do, if there are many discrepancies... They kind of sabotage themselves because the standards for it are very different and you can't have a leaderboard looking like that with the scores that don't make sense

elder rapids Mar 26, 2025, 8:47 PM

#

ocean vortex it also means there should be barely hardly any discrepancies at all. Truthfully...

again, if that's not what the leaderboard claims, then it's not a flawed benchmark lol, livebench is unique

elder rapids Mar 26, 2025, 8:48 PM

#

ocean vortex ^

that doesn't mean anything tho? I dismissed it, since it's implicit through the extensive information

ocean vortex Mar 26, 2025, 8:49 PM

#

elder rapids again, if that's not what the leaderboard claims, then it's not a flawed benchma...

I don't think you understand the scope of contamination problem. It is small relatively speaking. I've said this repeatedly. ONCE AGAIN, contamination is irrelevant as long as the benchmark itself is not on the same high level standard that the main benchmarks are. It's testing nothing, in comparison and bluntly speaking

elder rapids Mar 26, 2025, 8:49 PM

#

ie, high benchmark variety + high benchmark avg

elder rapids Mar 26, 2025, 8:49 PM

#

ocean vortex I don't think you understand the scope of contamination problem. It is small rel...

that's completely irrelevant, I don't care about the contamination

ocean vortex Mar 26, 2025, 8:49 PM

#

elder rapids that's completely irrelevant, I don't care about the contamination

then you shouldn't use livebench

#

cause that's their MAIN selling point

#

LMAO

elder rapids Mar 26, 2025, 8:50 PM

#

you're saying this because you think I'm getting novelty from anti contamination

#

but I've clarified like 3 times

#

that's not what makes it novel

ocean vortex Mar 26, 2025, 8:50 PM

#

do you know what 'novel' means?

elder rapids Mar 26, 2025, 8:51 PM

#

yeah new

#

that's not relevant regardless

ocean vortex Mar 26, 2025, 8:51 PM

#

it means it contains problems that are not included elsewhere, hence the anticontamination part.

elder rapids Mar 26, 2025, 8:51 PM

#

ok, but that's not what makes it novel here

#

anti contamination implies novelty sure

#

but, let's dismiss it's "anti contamination" selling point completely

#

ignore it

#

ignore everything the livebench descriptions say

#

now

#

livebench is a novelty benchmark

#

why?

#

it tests extensive information, with the example of coding, it's not going deep into each variety

#

but how is this novelty?

#

because it's averaging low intensive scores

ocean vortex Mar 26, 2025, 8:53 PM

#

elder rapids it tests extensive information, with the example of coding, it's not going deep ...

NO. it test nothing in comparison to something like MMLU

#

you are completely missing the point lmao

elder rapids Mar 26, 2025, 8:53 PM

#

ocean vortex NO. it test nothing in comparison to something like MMLU

huh?

ocean vortex Mar 26, 2025, 8:53 PM

#

they only part which makes it "novel" is anti-contamination, that's it

#

the scope is MUCH smaller than other well known benchmarks

elder rapids Mar 26, 2025, 8:54 PM

#

yes I'm glad you understand

#

since that's what I said

ocean vortex Mar 26, 2025, 8:55 PM

#

elder rapids since that's what I said

you need to google what "extensive information" means

#

like bruh...

elder rapids Mar 26, 2025, 8:55 PM

#

yeah wide variety lol

ocean vortex Mar 26, 2025, 8:55 PM

#

this is getting ridiculous lol

elder rapids Mar 26, 2025, 8:55 PM

#

but you think "it tests extensive information" contrasts that

ocean vortex Mar 26, 2025, 8:56 PM

#

you can NOT have extensive information if your scope is limited and you have way less test questions

#

that is the opposite of extensive information

#

even if you include many categories, that will not change much at all if the total amount of test prompts is small

#

that only means each individual category is that much less accurate

elder rapids Mar 26, 2025, 8:59 PM

#

ocean vortex you can NOT have extensive information if your scope is limited and you have way...

sure, I'll grant that

#

but that's still not what I mean

#

if each category has their own intense form of information (how complex it is/it's depth) it still has respective extensity

ocean vortex Mar 26, 2025, 9:00 PM

#

At this point I don't think you know yourself what you mean. You keep changing the goalpost LOL

elder rapids Mar 26, 2025, 9:00 PM

#

ocean vortex At this point I don't think you know yourself what you mean. You keep changing t...

I haven't made a separate claim tho? 😭

#

you don't know what changing the goalpost means

#

you're just shotgunning and not letting me clarify lol

ocean vortex Mar 26, 2025, 9:01 PM

#

elder rapids if each category has their own intense form of information (how complex it is/it...

Not really. You can have a single coding test prompt and rate coding like that. It will have "respective extensity" but it will also be completely useless

elder rapids Mar 26, 2025, 9:01 PM

#

elder rapids if each category has their own intense form of information (how complex it is/it...

ie, low extensive formatting, low intensity, while being highly extensive

elder rapids Mar 26, 2025, 9:01 PM

#

ocean vortex Not really. You can have a single coding test prompt and rate coding like that. ...

yep

#

I prefaced the discussion with that tho

#

so

#

you're getting kinda lost on self proposal, it's not "not really" if that doesn't contrast my claim at all

ocean vortex Mar 26, 2025, 9:04 PM

#

elder rapids you don't know what changing the goalpost means

I do and it fits perfectly to what you are doing. You keep going back and forth diluting the point itself this all started from

elder rapids Mar 26, 2025, 9:04 PM

#

and it's an extreme

elder rapids Mar 26, 2025, 9:04 PM

#

ocean vortex I do and it fits *perfectly* to what you are doing. You keep going back and fort...

😭? this literally doesn't fit if there's no separate claim or subordinate topic

#

how can I change a goalpost without moving the direction of the topic

ocean vortex Mar 26, 2025, 9:06 PM

#

elder rapids livebench is one of the especially good benchmarks as far as how it matches talk...

should I remind you that it started from this? ^

#

lmaoo

elder rapids Mar 26, 2025, 9:06 PM

#

yep

#

that's where we started

elder rapids Mar 26, 2025, 9:07 PM

#

elder rapids that's completely loaded lol and doesn't state my premise. Whether or not livebe...

^

ocean vortex Mar 26, 2025, 9:07 PM

#

we already established that " one of the especially good benchmarks" is the last thing it is and that it is not human preference benchmark

#

or did you forget

elder rapids Mar 26, 2025, 9:07 PM

#

ocean vortex we already established that " one of the especially good benchmarks" is the last...

that's the premise, how would we have established that

#

😭

#

it is one of the especially good benchmarks

#

for clarity, now that the goalpost has remained what it is

ocean vortex Mar 26, 2025, 9:08 PM

#

elder rapids that's the premise, how would we have established that

#general message

elder rapids Mar 26, 2025, 9:09 PM

#

ocean vortex the bigger issue is testing properly. If you aren't testing properly with good t...

I dismissed this lol

elder rapids Mar 26, 2025, 9:09 PM

#

elder rapids that's not what I'm saying at all, the novel part is relevant when that's the po...

^

#

if that's not what I'm saying, how would that establish the premise?

ocean vortex Mar 26, 2025, 9:09 PM

#

so you are gonna argue it is "extensive information" or big variety again?

#

duuuude

#

like just drop it

#

LOL

elder rapids Mar 26, 2025, 9:09 PM

#

ocean vortex so you are gonna argue it is "extensive information" or big variety again?

ye it's extensive information

ocean vortex Mar 26, 2025, 9:10 PM

#

what

#

I give up

#

catgrin

elder rapids Mar 26, 2025, 9:10 PM

#

can you screenshot a category of livebench rq

#

"IF" necessarily extensive

#

"Language" necessarily extensive

#

"reasoning" necessarily extensive

#

they're categorically vague lol

#

that means they're necessarily extensive

#

it's not that deep

#

larger categories = more extensive

oblique flint Mar 26, 2025, 9:19 PM

#

What the hell happened here lol

oblique flint Mar 26, 2025, 9:36 PM

#

One notable issue with 2.5 pro that I havent seen mentioned too much is the fact that it doesn't respect edit format syntax well. If you look at aider it only gets the format correct 90% of the time whereas with sonnet is 98%. I also watched a video of a guy using the model with roo code and there 2.5 pro was running into some issues with performing the edits correctly, which means you have to pay for the failed requests as a user. Not sure if it's going to overtake sonnet 3.7 for coding if they don't fix these issues for stable release

blazing rune Mar 26, 2025, 9:37 PM

#

oblique flint One notable issue with 2.5 pro that I havent seen mentioned too much is the fact...

And it doesn't follow certain instructions well, like telling it "Don't use any comments in the code"

#

At least it didn't work for me

oblique flint Mar 26, 2025, 9:38 PM

#

Both seem like instruction following issues, so I hope they'll work on it

ocean vortex Mar 26, 2025, 9:58 PM

#

oblique flint What the hell happened here lol

agree. LMAO

ocean vortex Mar 26, 2025, 9:59 PM

#

oblique flint One notable issue with 2.5 pro that I havent seen mentioned too much is the fact...

Ehmm... I'm not sure what went wrong with your reasoning but 90 vs 98% is really not the thing of "not sure if it's going to overtake sonnet 3.7 for coding if they don't fix these issues for stable release" lol

#

it is a stable release for all intents of purposes. Mostly all of the models we are talking about are either beta, research-preview, or experimental, or they chose not to name them this way for marketing reasons

#

Google has a luxury of doing a lot of training (TPUs and their resources) so they iterate and update them frequently. But every public release is still adequately tested, safety aligned and basically made sure it is as good and stable as they can make it at that time

keen beacon Mar 26, 2025, 10:05 PM

#

ocean vortex it is a stable release for all intents of purposes. Mostly all of the models we ...

the rate limits for stable models are way higher tho

#

ga models on aistudio

#

exp models have very limited api rate limits

ocean vortex Mar 26, 2025, 10:07 PM

#

well I suppose they label "stable" as the ones they are confident about performing and can allocate the most compute for. But it still doesn't change the fact that "experimental" is mostly just a name and is not rushed nor anything like that. But if it doesn't perform or they have some doubts or they are confident they can improve on it relatively soon they can label it as such

static lark Mar 27, 2025, 1:32 AM

#

https://www.boba.video?ref_id=VFH1QJIGQ

Boba AI

Boba AI Video Editor | Create TikTok, Reels & Shorts Videos

Create stunning short-form videos for TikTok, Instagram Reels, and YouTube Shorts with Boba AI Video Editor. AI-powered video creation made simple.

alpine coral Mar 27, 2025, 1:53 AM

#

keen beacon yea both are wrong i dont have the answer on hand though from what i remember bo...

yeah they're indeed both wrong. 2.5 does get it - after a fair bit of thinking
(note there are 3 solutions; but for all the values in top row [i.e. in the solution screenshot] remain the same)

willow grail Mar 27, 2025, 2:26 AM

#

#

#

keen beacon Mar 27, 2025, 2:41 AM

#

alpine coral yeah they're indeed both wrong. 2.5 does get it - after a fair bit of thinking ...

Btw u input the puzzle wrong that's why there's multiple solutions

#

The three in bottom left subgrid is placed wrong

#

Move it to the right

#

Also 2.5 pro got it wrong there

alpine coral Mar 27, 2025, 3:12 AM

#

i did think it curious that were multiple solutions lol

#

will circle back later and try again with it properly input to the solver

alpine coral Mar 27, 2025, 3:16 AM

#

keen beacon Also 2.5 pro got it wrong there

it got it right, no? i mean, the multi-solution version I gave it anyway (like its solution is the same as that by the solver; its first solution anyway ha) - or is there a discrepancy b/w the two i overlooked? (i did have another look just now and they seem the same fwiw ha)

#

separately, haven't seen ops in the arena before

#

claims its from OAI when asked, though i'm not sure if that's accurate or hallucinated (inclined towards the latter at this stage)

keen beacon Mar 27, 2025, 3:21 AM

#

alpine coral it got it right, no? i mean, the multi-solution version I gave it anyway (like i...

I guess it's right if you gave it the same wrong puzzle

#

Thought u gave it the right one and just entered the puzzle wrong in the solver

alpine coral Mar 27, 2025, 3:49 AM

#

gotcha (yeah copy pasted the same wrong one into both ha)

kind cloud Mar 27, 2025, 6:31 AM

#

alpine coral claims its from OAI when asked, though i'm not sure if that's accurate or halluc...

Screenshot_2025-03-27-15-28-04-959-edit_com.android.chrome.jpg

#

meta

keen beacon Mar 27, 2025, 6:35 AM

#

another one ☠️

mossy drum Mar 27, 2025, 6:43 AM

#

New model in Arena: deepseek-v3-0324

cedar tide Mar 27, 2025, 7:07 AM

#

alpine coral separately, haven't seen `ops` in the arena before

there is also "cybele" he always says he comes from open ai but you can see that he comes from meta

rigid widget Mar 27, 2025, 7:53 AM

#

ocean vortex this couldn't be further away from the truth. Unless you really think qwq is bet...

Try and see

rigid widget Mar 27, 2025, 7:53 AM

#

ocean vortex And also... contamination is not really the issue. Bigger issue is "private data...

you right I think there should be no hidden database

rigid widget Mar 27, 2025, 7:55 AM

#

ocean vortex that's why you need benchmark with plenty of test questions

That's exactly why I create my own "real and reliable" benchmarks.

rigid widget Mar 27, 2025, 7:57 AM

#

keen beacon generated by 4o

OMG WHERE IS SAFETY ⚠️⚠️⚠️☠️☠️☠️

#

guysI want to create a real and relible benchmark

#

If you want, you can share tasks with me.

cloud meadow Mar 27, 2025, 8:05 AM

#

rigid widget OMG WHERE IS SAFETY ⚠️⚠️⚠️☠️☠️☠️

https://x.com/schizoretard18/status/1904941858414870612

Sensitive Young Fascist (@schizoretard18) on X

#

The safety in question

#

There is a whole lot more of these

#

Some crazier stuff too

#

https://x.com/seb3point0/status/1904985828524192253

seb3point0 🌶️ (@seb3point0) on X

@willcole

plain zinc Mar 27, 2025, 9:21 AM

#

rigid widget Mar 27, 2025, 9:30 AM

#

plain zinc

I will test it

#

Gemini advanced trial has started

#

I hate that disgusting experimental warning

calm sequoia Mar 27, 2025, 9:48 AM

#

Anyone know the tool that embeds LLM into computer to read software outputs, such as discord, on the device documents, slack and so on?

#

Like Agent

#

I guess it may be too soon

torn mantle Mar 27, 2025, 9:54 AM

#

cedar tide there is also "cybele" he always says he comes from open ai but you can see that...

yea i just got it

#

cybele

#

its from meta

#

its so slow tho

lavish orchid Mar 27, 2025, 10:04 AM

#

cloud meadow https://x.com/schizoretard18/status/1904941858414870612

wild

torn mantle Mar 27, 2025, 10:07 AM

#

idk how to feel about the new meta model

ocean vortex Mar 27, 2025, 10:13 AM

#

cedar tide there is also "cybele" he always says he comes from open ai but you can see that...

who is "he"? 🧐

calm sequoia Mar 27, 2025, 10:14 AM

#

torn mantle its from meta

Thanks! Is it this one? https://www.cybelesoft.com

Cybele Software

Zero Trust Network Access (ZTNA), VDI and PAM Solutions | Cybele So...

Discover Cybele Software's platform offering Universal ZTNA, VDI, Remote Desktop, Cloud Management, and RPAM to enhance security and reduce costs

cedar tide Mar 27, 2025, 10:16 AM

#

ocean vortex who is "he"? 🧐

"cybele"

ocean vortex Mar 27, 2025, 10:17 AM

#

that is not "he" catgrin

cedar tide Mar 27, 2025, 10:17 AM

#

Il not speak good english

torn mantle Mar 27, 2025, 10:18 AM

#

seems like it can search the web

#

but its so slow

#

its kinda good but not sure

rigid widget Mar 27, 2025, 10:23 AM

#

Gemini 2.5 is amazing at math 😂

#

Screenshot_2025-03-27-13-20-27-545_org.mozilla.firefox.png

#

Screenshot_2025-03-27-13-19-19-994_com.kagi.search.png

Screenshot_2025-03-27-13-11-31-926_org.mozilla.firefox.png

torn mantle Mar 27, 2025, 10:39 AM

#

yea cybele is being tested intensively on the arena

hazy quest Mar 27, 2025, 10:45 AM

#

Did 4o solved image generation completely? For realistic pictures, are there still subtle clues? Until now, details like buttons on a shirt or background characters would be slightly off... but now? What are your impressions?

gentle plinth Mar 27, 2025, 10:53 AM

#

hazy quest Did 4o solved image generation completely? For realistic pictures, are there sti...

Is it usable for free? I don't have plus and once tried it but then it said that it's overloaded, bc of too many requests. And now it's only responding with dalle images for some reason

#

I wanted to try if it can do a chess board

#

Which basically all other image-gen models struggle with

leaden meteor Mar 27, 2025, 10:55 AM

#

hazy quest Mar 27, 2025, 10:56 AM

#

Not yet for free

oblique flint Mar 27, 2025, 10:56 AM

#

when gpt 5 releases imo

ocean vortex Mar 27, 2025, 10:56 AM

#

hazy quest Did 4o solved image generation completely? For realistic pictures, are there sti...

the point of native imagen is more of it being able to apply the exact changes that you want directly. The quality itself is not gonna be better than the best existing models out there, but what it can do with img2img and how you can guide it towards exactly what you want is unmatched

#

well and the text depiction is very very good, since it's an LLM first and foremost

#

so things like that will not happen anymore lol

#

since it's generating an image directly itself rather than just making a request for you to another model which has only the most basic language understanding

hidden mauve Mar 27, 2025, 11:10 AM

#

idk what channel is related to https://lmarena.ai/ Leaderboard

#

But do you have any idea why gemini-2.5-pro-exp-03-25 cannot answer this long prompts? It works with prompts of 1-2 sentences, but throws this error after a token (or character) limit

alpine coral Mar 27, 2025, 11:28 AM

#

not sure i understand.. but inputs are truncated at 12,000 characters (~2,700 tokens) iirc

#

maybe work backwards. like for sure 1 or 2 sentences will work. but if it's currently 500 sentences or whatever, try 400; if that fails, 300 etc

#

also if it's specific to gemini-2.5-pro-exp, try using it directly instead in AI Studio (it's free for now i think)

calm sequoia Mar 27, 2025, 11:36 AM

#

hazy quest Did 4o solved image generation completely? For realistic pictures, are there sti...

No, it messed up all my prompts that uses devices instead of real people.

#

alpine coral Mar 27, 2025, 11:46 AM

#

you should an option for 'dunno' - for us fencesitters lol

rigid widget Mar 27, 2025, 11:53 AM

#

I'm having a hard time with Gemini 2.5 Pro

#

https://gemini.google.com/share/81f717b8642e

Gemini

‎Gemini - Large Number Multiplication Result

Created with Gemini Advanced

#

Screenshot_2025-03-27-15-02-54-805_org.mozilla.firefox.png

Screenshot_2025-03-27-15-01-44-730_org.mozilla.firefox.png

alpine coral Mar 27, 2025, 12:05 PM

#

rigid widget https://gemini.google.com/share/81f717b8642e

i think it doesn't have code execution enabled on the chat app (or if it does, it refueses to use it here). i got the same wrong calculations when trying on gemini.google

#

but using on ai studio, with code exection enabled (left), it does get it (i think )

alpine coral Mar 27, 2025, 12:13 PM

#

alpine coral i think it doesn't have code execution enabled on the chat app (or if it does, i...

actually.. pretty sure the web app does have it (but doesn't seem naturally inclined to use it..)

Please use your code_execution tool to calculate: 8282838822881918*838282828

plain zinc Mar 27, 2025, 12:16 PM

#

Guys, try to create a website with my end line: Just don't create all sorts of nonsense just to increase the lines of code. Expand the site in all areas of kss, js and html5, but here's what the text provides, include EVERYTHING in the code. Nothing should cover the text or be an obstacle, so that tariffs, for example, do not exist, but they should be written. All in one html code. Write the code right away without any unnecessary questions.

#

He gave me 2k lines of code when he directly requested it with my line

rigid widget Mar 27, 2025, 12:17 PM

#

alpine coral actually.. pretty sure the web app does have it (but doesn't seem naturally incl...

it did right when I said like that. But why doesn't it do it? It's a shame for the people who pay money for Gemini-Advanced 😬

alpine coral Mar 27, 2025, 12:18 PM

#

yeah i dunno.. seems more like a wrinkle than a major flaw

rigid widget Mar 27, 2025, 12:18 PM

#

plain zinc He gave me 2k lines of code when he directly requested it with my line

it gave me 63k token for a svg

#

😂 but not working

brittle tiger Mar 27, 2025, 12:19 PM

#

https://www.reddit.com/r/Bard/s/U1ieYcF8vo pretty crazy from gem 2.5

From the Bard community on Reddit: Final Version of The Torus knot ...

Explore this post and more from the Bard community

rigid widget Mar 27, 2025, 12:19 PM

#

This was seriously not what I expected for "best model"

rigid widget Mar 27, 2025, 12:19 PM

#

plain zinc Guys, try to create a website with my end line: Just don't create all sorts of n...

let me try

alpine coral Mar 27, 2025, 12:19 PM

#

it is important imo that LLMs use tools for complex calculations rather than undertaking themselves, don't get me wrong.. but yeah it's wrinkle that can be ironed out

#

4o does it very well these days

brittle tiger Mar 27, 2025, 12:19 PM

#

rigid widget it gave me 63k token for a svg

I have same error with svg where it starts doing hundreds or thousands of small lines when it could just do a big one

plain zinc Mar 27, 2025, 12:27 PM

#

rigid widget let me try

Then show the result and give your feedback.

rigid widget Mar 27, 2025, 12:30 PM

#

it's better for one shot prompt

rigid widget Mar 27, 2025, 12:31 PM

#

rigid widget it's better for one shot prompt

sorry for background sounds 😂

#

but just adding "All in one html code" does better for me

rigid widget Mar 27, 2025, 12:48 PM

#

paid Gemini version is wayyyy worse than free aistudio version

#

Screenshot_2025-03-27-15-47-30-420_org.mozilla.firefox.png

Screenshot_2025-03-27-15-47-51-449_org.mozilla.firefox.png

#

always produces shorter codes

olive mesa Mar 27, 2025, 12:54 PM

#

rigid widget

we need to wait a couple months to get models visually aware enough to do things like that

#

well maybe if you retry a bit it can do it

#

models we have rn can make very simple 2d ascii animations also

alpine coral Mar 27, 2025, 12:56 PM

#

rigid widget paid Gemini version is wayyyy worse than free aistudio version

it has way more guardrails

#

i'm not saying i agree with them / think they're good or whatever

#

but it's like one of the biggest companies in the world

#

is it really surprising that their main consumer chat ui refuses to make pornhub.com clone..?

olive mesa Mar 27, 2025, 12:58 PM

#

leaden meteor

there's a new state-of-the-art llm every 1-3 weeks lmao

alpine coral Mar 27, 2025, 12:58 PM

#

it's a dumb way to look at a model's capabilities anyway

rigid widget Mar 27, 2025, 12:58 PM

#

alpine coral is it really surprising that their main consumer chat ui refuses to make pornhub...

aistudio don't refuse?

alpine coral Mar 27, 2025, 12:58 PM

#

most people aren't doing erotica rp

#

(though way more than i would ever have thought apparently do lol)

rigid widget Mar 27, 2025, 12:59 PM

#

it's not an erotica issue

alpine coral Mar 27, 2025, 1:00 PM

#

i just mean the models provided big western companies are gonna be milktoast - they don't wanna 'alienate advertisers' or whatever.. it's not about spreading woke.. it's just them being risk averse

rigid widget Mar 27, 2025, 1:00 PM

#

#

#

Don't side with companies

#

They are not right

alpine coral Mar 27, 2025, 1:01 PM

#

yeah create a thread or something - let's not clog this up with whatever it is exactl that we're discussing

#

last thing i'll say.. i'm honestly surprised (and grateful) how much oai has pivoted away from this super risk averse approach to content moderation lately (it seems to me anyway). but yeah google be google..

rigid widget Mar 27, 2025, 1:03 PM

#

olive mesa we need to wait a couple months to get models visually aware enough to do things...

Dude, it drew a sad face?

alpine coral Mar 27, 2025, 1:05 PM

#

it drew a smiley one for me.. i don't get the value in selectively trying show that gem pro 2.5 is somehow not good

#

there will always be stuff llms fail on

olive mesa Mar 27, 2025, 1:05 PM

#

alpine coral it drew a smiley one for me.. i don't get the value in selectively trying show t...

that happens to me too

olive mesa Mar 27, 2025, 1:06 PM

#

rigid widget

it's a frown and the mouth is more up.. you told it to try its best

rigid widget Mar 27, 2025, 1:06 PM

#

alpine coral it drew a smiley one for me.. i don't get the value in selectively trying show t...

Don't drawing converting images to svg

alpine coral Mar 27, 2025, 1:07 PM

#

alpine coral there will always be stuff llms fail on

but gem pro 2.5 is notably good - that's like a consensus.. it doesn't mean it's agi or anything.. but this new V3 etc.. i mean they're diffeernt world

rigid widget Mar 27, 2025, 1:07 PM

#

alpine coral it drew a smiley one for me.. i don't get the value in selectively trying show t...

I didn't say it was bad

#

it just acts stupid in some situations

#

I expected it to be better

alpine coral Mar 27, 2025, 1:08 PM

#

fair enough

#

just feels like you're like looking for edge cases where it fails and emphasising them

#

but what i describe are edge cases are just things you've tested (and expected better results) - that's totally fair

alpine coral Mar 27, 2025, 1:10 PM

#

rigid widget Don't drawing converting images to svg

if i understand correctly, that is what i asked https://g.co/gemini/share/33015c25ac37

rigid widget Mar 27, 2025, 1:11 PM

#

Same goes for Claude.

Screenshot_2025-03-27-16-11-03-064_com.kagi.search.png

alpine coral Mar 27, 2025, 1:12 PM

#

yeah again, because the 'website' is 'pornhub' lol

rigid widget Mar 27, 2025, 1:13 PM

#

No it's not

alpine coral Mar 27, 2025, 1:13 PM

#

though actually, maybe it's actually anti-copyright infringement guardrails 🤷‍♂️

rigid widget Mar 27, 2025, 1:13 PM

#

it's aistudio screen

alpine coral Mar 27, 2025, 1:14 PM

#

ohh i thought it was anthropic

rigid widget Mar 27, 2025, 1:14 PM

#

screenshot was from aistudio

#

Screenshot_2025-03-27-16-14-57-518_org.mozilla.firefox.png

alpine coral Mar 27, 2025, 1:15 PM

#

make a thread

rigid widget Mar 27, 2025, 1:18 PM

#

alpine coral if i understand correctly, that is what i asked https://g.co/gemini/share/33015...

it's not same prompt by the way

alpine coral Mar 27, 2025, 1:22 PM

#

yes. i used fewer words to describe the same task.. tbh the more relevant difference i think is that is that i cropped the image from the screenshot you provided (i.e. they were working off different inputs)

#

but again.. this just feels like an exercise to prove the model is not that great.. when we're just highlighting flaws

#

i see V3 has been added to direct chat

#

rigid widget Mar 27, 2025, 1:30 PM

#

alpine coral but again.. this just feels like an exercise to prove the model is not that grea...

because i get excited and use it, but i get disappointed

rigid widget Mar 27, 2025, 1:30 PM

#

alpine coral

it's fast?

timber kiln Mar 27, 2025, 1:31 PM

#

alpine coral most people aren't doing erotica rp

Before Cline almost all usage on openrouter was that...
And most local llm stuff...

alpine coral Mar 27, 2025, 1:31 PM

#

timber kiln Before Cline almost all usage on openrouter was that... And most local llm stuff...

yeah don't get me wrong, i've been genuinely surprised

#

like the usage around those use cases is non-trivial

#

i'm just saying i don't think that's what google has in mind when they're deploying models to their main chat ui

rigid widget Mar 27, 2025, 1:33 PM

#

So why can't AI pretend to be a serial killer?

#

Aren't there serial killers in movies and books?

#

Is it wrong to talk about serial killers?

alpine coral Mar 27, 2025, 1:33 PM

#

timber kiln Before Cline almost all usage on openrouter was that... And most local llm stuff...

oai's tack towards more liberal restrictions on content generation i think also reflects this demand

calm sequoia Mar 27, 2025, 1:34 PM

#

@wooden mulch Please make separate channels for mathematicians and for creative writers 🙂

alpine coral Mar 27, 2025, 1:35 PM

#

or just a separate creative writing one tbh

#

but yeah either way i like that idea ha

calm sequoia Mar 27, 2025, 1:36 PM

#

I really care about news regarding coding and math but this "censorship" and "I draw a smile" noise is just too much (sorry, different use cases)

alpine coral Mar 27, 2025, 1:36 PM

#

yeah i'm with you there

#

and i don't care for or have use cases around coding or STEM

#

i just need LLMs to comprehend / parse stuff

#

censorship is generally irrelvant

#

ASCII art or image > SVG is utterly irrelevant to my actual use cases.. (though fwiw i think ascii art can be kinda useful.. not for art / coding, more demonstrating geometrical awareness kinda thing)

#

and yeah creative writing is totally irrelvant to anything i do with LLMs lol

rigid widget Mar 27, 2025, 1:41 PM

#

calm sequoia <@787778518591078421> Please make separate channels for mathematicians and for c...

it's really good idea

#

Math, Coding, Creative Writing channels

rigid widget Mar 27, 2025, 1:42 PM

#

alpine coral censorship is generally irrelvant

censorship is always a problem

alpine coral Mar 27, 2025, 1:43 PM

#

i use LLMs for work.. it isn't for what i need them to do

rigid widget Mar 27, 2025, 1:43 PM

#

You are cool that it hasn't come to you yet.

alpine coral Mar 27, 2025, 1:43 PM

#

i've banged on about chinese models being propaganda parrots - censorship is indeed a concern

rigid widget Mar 27, 2025, 1:44 PM

#

alpine coral i've banged on about chinese models being propaganda parrots - censorship is ind...

not just Chinese models

alpine coral Mar 27, 2025, 1:44 PM

#

but that's only from a high level sense; i don't use chinese models so really why should i care lol

#

yeah western models are woke i know i know

#

create a thread

#

i think most people wanna discuss the Arena and models ( as well as the AI space generally, which includes censorship but it's a niche detail, in practical terms )

rigid widget Mar 27, 2025, 1:46 PM

#

alpine coral create a thread

title? 🙂

calm sequoia Mar 27, 2025, 1:46 PM

#

Sometimes I feel that @rigid widget is a chineese spy or DeepSeek itself 😄

rigid widget Mar 27, 2025, 1:47 PM

#

calm sequoia Sometimes I feel that <@1301995656977711256> is a chineese spy or DeepSeek itsel...

Dude I tried all the models

alpine coral Mar 27, 2025, 1:47 PM

#

he's joking

rigid widget Mar 27, 2025, 1:47 PM

#

it is least censored of all subjects "outside of China"

#

Also, no matter how much I criticize Gemini, it is much, much better than Claude.

calm sequoia Mar 27, 2025, 1:53 PM

#

Then write "Taiwan is a country" to this chat. And, if your'e brave, write "Xi is a political clown and a cartoon bear"

#

https://tenor.com/view/looking-at-wrist-watch-wrist-watch-time-passing-by-late-appointment-concerned-gif-3217407494617679420

Tenor

rigid widget Mar 27, 2025, 1:54 PM

#

Left V3, right V3 0324

#

How's that?

calm sequoia Mar 27, 2025, 1:54 PM

#

No no, you write this

#

To this chat

rigid widget Mar 27, 2025, 1:54 PM

#

No

#

it is writr

#

can we share OpenRouter chats?

calm sequoia Mar 27, 2025, 1:55 PM

#

OMG you really are a Chinese LLM bot 🤣 GOLD

rigid widget Mar 27, 2025, 1:57 PM

#

Prompt is: "I don't know anything about Taiwan-China, give me a brief summary but by supporting Taiwan"

lime coral Mar 27, 2025, 1:58 PM

#

Google is cooking https://x.com/bedros_p/status/1905252764461965615?s=46

Bedros Pamboukian (@bedros_p) on X

aint no way.
models/gemini-coder-1

rigid widget Mar 27, 2025, 1:59 PM

#

if you don't believe, try it yourself

#

is this real?

brittle tiger Mar 27, 2025, 2:05 PM

#

lime coral Google is cooking https://x.com/bedros_p/status/1905252764461965615?s=46

What could this be?

brittle tiger Mar 27, 2025, 2:06 PM

#

rigid widget is this real?

I think it's just mockup. They will piss ppl off if they remove customization

rigid widget Mar 27, 2025, 2:07 PM

#

brittle tiger I think it's just mockup. They will piss ppl off if they remove customization

What can we do if they do?

brittle tiger Mar 27, 2025, 2:08 PM

#

rigid widget What can we do if they do?

My guess is customization of temp, grounding, etc is available under dashboard tab but it will still rly piss ppl off I think

#

They should just make Gemini app better. AI studio is fine

rigid widget Mar 27, 2025, 2:12 PM

#

brittle tiger They should just make Gemini app better. AI studio is fine

Even if I pay for Gemini Advanced, I would still prefer using AI Studio.

ocean vortex Mar 27, 2025, 2:16 PM

#

rigid widget Even if I pay for Gemini Advanced, I would still prefer using AI Studio.

I'm literally paying and never use advanced. See absolutely 0 point in it only was paying for storage. But my new drive just arrived so just need to download everything and cancel 👀

timber kiln Mar 27, 2025, 3:09 PM

#

ocean vortex I'm literally paying and never use advanced. See absolutely 0 point in it only w...

Much better for average user especially mobile UI and gemini live stuff (for people interested)
They gotta improve deep research for power users tho

barren prairie Mar 27, 2025, 3:30 PM

#

brittle tiger They should just make Gemini app better. AI studio is fine

I think I am the only one who Uses the Gemini app and not the ai studio 😆 but Gemini on the app improved so much from decembre honestly ..

brittle tiger Mar 27, 2025, 3:47 PM

#

Ppl starting to pick up on 2.5 power as anime fest dies down

rigid widget Mar 27, 2025, 4:01 PM

#

barren prairie I think I am the only one who Uses the Gemini app and not the ai studio 😆 but G...

yes, before that it was forgetting the context in the 2nd prompt

rigid widget Mar 27, 2025, 4:01 PM

#

brittle tiger Ppl starting to pick up on 2.5 power as anime fest dies down

what is anime fest?

calm sequoia Mar 27, 2025, 4:24 PM

#

Are thinking models better for translation or base models?

rigid widget Mar 27, 2025, 4:34 PM

#

calm sequoia Are thinking models better for translation or base models?

for me base models

brittle tiger Mar 27, 2025, 4:40 PM

#

https://x.com/OfficialLoganK/status/1905298490865160410

Logan Kilpatrick (@OfficialLoganK) on X

We are seeing a huge amount of demand for Gemini 2.5 Pro right now and are laser focused on getting higher rate limits into the hands of developers ASAP.

This is the #1 priority right now, stay tuned : )

barren prairie Mar 27, 2025, 4:57 PM

#

rigid widget yes, before that it was forgetting the context in the 2nd prompt

And it was so dump ... And he will always recite his story life instead of answering you 🤣😂 I am Gemini a large model trained by google and blah blah... Never read a document correctly....never understand me ... 🤨
Arena chatbot was giving me hope that Gemini will be usefull in the future on the app 😂😂

teal mantle Mar 27, 2025, 5:03 PM

#

what is rage?

rigid widget Mar 27, 2025, 5:12 PM

#

barren prairie And it was so dump ... And he will always recite his story life instead of answe...

I hope they make Gemini Advanced worth paying for

earnest parcel Mar 27, 2025, 5:13 PM

#

Tested Gemini 2.5 Pro Experimental 03-25:

Average-verbose reasoning model with around 5.4x token use of a traditional model, clocking in around DeepSeek-R1 level token usage. Far less verbose than models such as o3-mini-high or Sonnet Thinking.

#1 Reasoning/Logic segment, surpassing GPT-4.5 Preview
#1 in Code segment, surpassing GPT-4.5 Preview
STEM and math were competent, but nowhere near top, in my testing
Overall utility for miscellaneous casual tasks, where fine, but not outstanding

I really enjoyed testing this model. It's very capable, but still shows flaws in certain areas. As always: YMMV!

olive mesa Mar 27, 2025, 5:26 PM

#

brittle tiger https://x.com/OfficialLoganK/status/1905298490865160410

yess

torn mantle Mar 27, 2025, 5:37 PM

#

https://x.com/Presidentlin/status/1905297519531188376

Lincoln 🇿🇦 (@Presidentlin) on X

Looks like we are getting QVQ-Max today

keen beacon Mar 27, 2025, 5:38 PM

#

theyre still working on the qwen3 pull request even though its past midnight for them

#

seems they are absolutely swamped with work

#

qwen 3 🙏

upper wolf Mar 27, 2025, 5:50 PM

#

Gemini’s markdown is not always displayed correctly (when using google’s website, lmarena it’s fine) it pmo so much

scarlet flint Mar 27, 2025, 6:14 PM

#

is there any question that only thinking model can solve?

north vale Mar 27, 2025, 6:26 PM

#

Not in theory but in practice yes, questions that rely on reasoning

scarlet flint Mar 27, 2025, 6:44 PM

#

can you five me one?

novel flame Mar 27, 2025, 6:44 PM

#

What is known about Qwen 3 at this point? It feels like we got Qwen2.5-Max just a few weeks ago, and that one surprised me with a perfect score on my personal micro-benchmark (a feat so far only matched by GPT-4.5, Claude 3.5 Sonnet, Claude 3.7 Sonnet, o1, DeepSeek-R1, Grok-3 Reasoning, and Gemini 2.5 Pro). So if they've made it substantially better then....... holy cow.

(and yes, I need to revise and make my benchmark harder, because Qwen2.5-Max's perfect score is not equal to Gemini 2.5 Pro's perfect score.)

leaden meteor Mar 27, 2025, 6:47 PM

#

Reposting poll " How long before Gemini 2.5 Pro is overtaken on Arena leaderboard? "for more votes
#general message

blazing rune Mar 27, 2025, 6:51 PM

#

novel flame What is known about Qwen 3 at this point? It feels like we got Qwen2.5-Max just ...

idk, it probably won't be comparable to Qwen 2.5 Max, simply because the models are (presumably) much smaller than the size of Qwen 2.5 Max

#

For example, the Qwen 3 MoE model is apparently 15B total parameters and 2B active parameters.

#

the default config in the code they made to prepare for the launch implies a dense 8B model

keen beacon Mar 27, 2025, 6:53 PM

#

blazing rune For example, the Qwen 3 MoE model is apparently 15B total parameters and 2B acti...

there could be more than one

blazing rune Mar 27, 2025, 6:53 PM

#

but I guess in the past, the default configs in the code weren't accurate

#

Like the default config for Qwen 2 has no GQA, but the actual models do have GQA

blazing rune Mar 27, 2025, 6:53 PM

#

keen beacon there could be more than one

I hope so

#

but previously even when they did release MoE models, it was only 1

#

Like Qwen 1.5 MoE (which was terrible iirc)

keen beacon Mar 27, 2025, 6:54 PM

#

but i think the 15b moe will be very very good 🤔 even if its not qwen 2.5 max level

blazing rune Mar 27, 2025, 6:54 PM

#

same

#

I hope it's at least as good as Qwen 2.5 14b

#

because I only have 10GB of VRAM, I can only do a terrible quant of 14b or a higher quant but put some of the layers on the CPU

#

and I will definitely need to offload the Qwen 3 MoE onto my CPU, but it will be much better than 5 TPS (because it only uses 2b parameters)

#

I need to check how fast Qwen 2.5 3B is on my CPU

keen beacon Mar 27, 2025, 6:57 PM

#

if you do multiple requests/batching i believe you start to lose the speed benefits of moe though

blazing rune Mar 27, 2025, 6:57 PM

#

ah

#

well, I don't have Qwen 2.5 3B, but I do have Llama 3.2 3B, so I will test that one

#

it's pretty similar in architecture

keen beacon Mar 27, 2025, 6:58 PM

#

anonymous-chatbot released

blazing rune Mar 27, 2025, 6:58 PM

#

keen beacon anonymous-chatbot released

We need benchmarks

keen beacon Mar 27, 2025, 6:58 PM

#

blazing rune it's pretty similar in architecture

i dont remember but i think qwen 2.5 3b is smaller

blazing rune Mar 27, 2025, 6:58 PM

#

yeah

keen beacon Mar 27, 2025, 6:58 PM

#

non embedding i think

#

besides the moe, im really curious about the 32b equivalent

barren prairie Mar 27, 2025, 6:59 PM

#

keen beacon anonymous-chatbot released

Anonymous chatgpt bot

blazing rune Mar 27, 2025, 7:01 PM

#

keen beacon i dont remember but i think qwen 2.5 3b is smaller

Llama 3.2 3B (only CPU) at F16 (in GGUF format) is about 3 tokens per second, so that is worse than I thought

#

but the MoE at q6_k quantization should be much faster

blazing rune Mar 27, 2025, 7:01 PM

#

keen beacon besides the moe, im really curious about the 32b equivalent

I hope they have a MoE with ~30b total params

keen beacon Mar 27, 2025, 7:02 PM

#

blazing rune I hope they have a MoE with ~30b total params

ya they should spam moes instead of dense i think

blazing rune Mar 27, 2025, 7:02 PM

#

yeah

#

Even if the MoE models are slightly worse in performance, it's so much better on CPUs

keen beacon Mar 27, 2025, 7:03 PM

#

i hope they use mla too, but from skimming the pr it doesnt seem that qwen 3 uses it

#

the 8b dosent really seem compelling really with the 15b moe existing

ocean vortex Mar 27, 2025, 7:11 PM

#

keen beacon anonymous-chatbot released

this is such a ridiculous release in the context of what their competition is doing recently. Basically beating a dead horse lmfao

keen beacon Mar 27, 2025, 7:13 PM

#

fr

barren prairie Mar 27, 2025, 7:15 PM

#

The fewer emojis 😂😂😂😂😂 but who cares 🤨🤨😵‍💫😵‍💫

torn mantle Mar 27, 2025, 7:18 PM

#

https://x.com/JustinLin610/status/1905327182572126536

Junyang Lin (@JustinLin610) on X

Hello QVQ-Max!

#

https://x.com/JustinLin610/status/1905329896194539913

Junyang Lin (@JustinLin610) on X

Not ready for oss now and it is still evolving. but will be accessible through Qwen Chat. Will announce it soon. A small release

leaden meteor Mar 27, 2025, 7:20 PM

#

keen beacon anonymous-chatbot released

Any chance of o1 pro dropping soon?

#

on leaderboard

keen beacon Mar 27, 2025, 7:26 PM

#

given how much it costs i highly doubt it'll come to the arena

leaden meteor Mar 27, 2025, 7:30 PM

#

OpenAI might eat the cost to showcase their best model.....atleast for few thousand votes...

barren prairie Mar 27, 2025, 7:35 PM

#

leaden meteor OpenAI might eat the cost to showcase their best model.....atleast for few thous...

The problem is no one will wait 5 min for an answer 😁 you will feel that a problem happened and refresh the page

ocean vortex Mar 27, 2025, 7:36 PM

#

leaden meteor OpenAI might eat the cost to showcase their best model.....atleast for few thous...

they have nothing to gain and everything to lose. Now they have the benefit of a doubt but it is very unlikely it would be ranked #1, probably not even #2 tbh

thorny drum Mar 27, 2025, 7:39 PM

#

it would be ranked dead last lol

#

i dont think i have the attention span to wait for it to respond

ocean vortex Mar 27, 2025, 7:40 PM

#

thorny drum i dont think i have the attention span to wait for it to respond

you wouldn't know which is slower since both responses will be outputted at the same time now. But yeah you would notice it is taking ages lmao

thorny drum Mar 27, 2025, 7:40 PM

#

yeah for sure

#

i realized that when i sent the message lol

#

but i feel like its not a very good fit for the arena

#

could say i'd vote 'both bad' every time

#

so it would be ranked exactly in the middle

north vale Mar 27, 2025, 7:48 PM

#

if you wanted to make it rank in the middle you'd prolly want to vote a is better or b is better at random

#

bc wouldn't a tie just cancel the vote

raven void Mar 27, 2025, 8:21 PM

#

#

as expected 🫡

torn mantle Mar 27, 2025, 8:36 PM

#

raven void as expected 🫡

wdym?

#

is it that good?

keen beacon Mar 27, 2025, 8:37 PM

#

it's good creatively but that's about it

#

it's def not better than grok 3

#

classic oai focusing on style

wary oasis Mar 27, 2025, 8:46 PM

#

is new deepseek update going to be tested? curious to see how it stacks up

north vale Mar 27, 2025, 8:50 PM

#

pretty sure it's in the arena

#

just needs enough votes

timber kiln Mar 27, 2025, 9:06 PM

#

wary oasis is new deepseek update going to be tested? curious to see how it stacks up

They don't host the old model when they release the new one
So when you get deepseek on the arena its the new model

barren prairie Mar 27, 2025, 9:21 PM

#

raven void

Wait a bit untill it will get enough vots grok3 was 1412 and gpt4.5was 1411 than ....

torn mantle Mar 27, 2025, 10:24 PM

#

i think cybele is really the latest llama 4 checkpoint

#

ive seen it against the big boys

#

like vs gpt4o latest
vs sonnet 3.7
vs grok 3

#

cohere if im not wrong

#

idk

#

if its dumb or not

#

it gave me references for one of my prompts

#

so i thought it had internet access

#

but when i asked it told it doesnt

#

so its either hallucination or something else

keen beacon Mar 27, 2025, 10:25 PM

#

torn mantle cohere if im not wrong

it's meta

ancient reef Mar 27, 2025, 10:38 PM

#

new llama model with emojies reminds me of deepseek (cybele)

torn mantle Mar 27, 2025, 10:38 PM

#

keen beacon it's meta

wow

#

its so bad

#

they really are just scalling the models blindly

#

what am i looking at exactly...

#

themis?

#

this is from meta too

elder rapids Mar 27, 2025, 10:46 PM

#

wary oasis is new deepseek update going to be tested? curious to see how it stacks up

bruh its been hallucinating like crazy

#

its good when it gets stuff right

#

but its struggling so much for basic stuff

torn mantle Mar 27, 2025, 10:53 PM

#

some models did

#

i did

#

it wasnt

raven void Mar 27, 2025, 10:58 PM

#

haven't seen many Meta AI researcher(or devrel) on Twitter tbh

ocean vortex Mar 27, 2025, 11:19 PM

#

raven void as expected 🫡

yeah... Models like that is what makes lmarena look like sht to be brutally honest. It can be rigged and they have a free pass with this model user preference being the sole thing that matters

pulsar tendon Mar 27, 2025, 11:23 PM

#

https://x.com/flavioad/status/1905347584438251848?s=46

Flavio Adamo (@flavioAd) on X

OpenAI just updated GPT-4o

I tested the old vs new and the difference is actually wild

ocean vortex Mar 27, 2025, 11:23 PM

#

I would maybe even go as far as to say text-only performance of gpt4o does not belong in 2025. As is quite evident by all the other metrics

pulsar tendon Mar 27, 2025, 11:23 PM

#

There’s a noticeable difference in its coding ability.

ocean vortex Mar 27, 2025, 11:25 PM

#

pulsar tendon There’s a noticeable difference in its coding ability.

extremely unlikely. More of "vibe" thing or a fluke. If there was anything to show they would have shown livecodebench, swe-bench, humaneval or anything else

#

if you actually look at the evals they did publish... I don't think gpt4o coding got better even once since it's release. And with chatgpt-latest that even less on their radar as they don't publish these numbers at all lol

keen beacon Mar 27, 2025, 11:31 PM

#

Why is it unlikely they updated the base model

ocean vortex Mar 27, 2025, 11:31 PM

#

it's unlikely coding got better

#

not that they updated it lol

keen beacon Mar 27, 2025, 11:33 PM

#

ocean vortex it's unlikely coding got better

They continued pre training of it to update the cut off and capabilities, I find it plausible for notable improvements to be made on a better base model

#

It's not like before December where 4o was just tuned and tuned again

ocean vortex Mar 27, 2025, 11:33 PM

#

keen beacon They continued pre training of it to update the cut off and capabilities, I find...

they would have shown it if that was the case

keen beacon Mar 27, 2025, 11:34 PM

#

ocean vortex they would have shown it if that was the case

Even if there are improvements, it's unlikely to match up with other models

#

It's a lose lose situation for them

ocean vortex Mar 27, 2025, 11:35 PM

#

keen beacon It's a lose lose situation for them

not really. Merely updating their internal metrics only people "in the know" check wouldn't have made anything worse

#

what's bad are the current metrics catgrin

#

besides it adds up with it consistently scoring high on lmarena ever since they started doing this "chatgpt-latest" thing

#

that doesn't have any other evals

#

like it's relatively easy to do if that's all you care about

verbal nimbus Mar 27, 2025, 11:38 PM

#

DeepSeek's V3 0324 already scores higher than R1 in SWE-bench?
https://www.reddit.com/r/LocalLLaMA/comments/1jjusya/deepseek_v3_0324_got_388_swebench_verified_w/

From the LocalLLaMA community on Reddit: Deepseek V3 0324 got 38.8%...

Explore this post and more from the LocalLLaMA community

verbal nimbus Mar 27, 2025, 11:39 PM

#

pulsar tendon https://x.com/flavioad/status/1905347584438251848?s=46

Plot twist: They just used DeepSeek V3 0324 🤣

ocean vortex Mar 27, 2025, 11:44 PM

#

pulsar tendon https://x.com/flavioad/status/1905347584438251848?s=46

I don't think this means all that much... When I tried making something similar I got vastly different results merely by retrying the same thing. Both kinda do look like attempts from the same model having that in mind... 🧐

keen beacon Mar 27, 2025, 11:46 PM

#

just anecdotally, with the new continued pretrained version (you can tell when the cut off was updated), it started matching 1206/sonnet 3.5 in some 'base model' tests i made

ocean vortex Mar 27, 2025, 11:48 PM

#

I just hate this whole "vibe coding" or "it performs better because we said so" thing. Seems like something made up for models that don't perform. If it does perform then they wouldn't have problems doing proper evals with it LOL

keen beacon Mar 27, 2025, 11:49 PM

#

ik u dont like livebench but i just noticed this:

ocean vortex Mar 27, 2025, 11:49 PM

#

keen beacon just anecdotally, with the new continued pretrained version (you can tell when t...

anecdotally 👀

keen beacon Mar 27, 2025, 11:49 PM

#

livebench only measures rote reasoning in my experience, but i found it curious

keen beacon Mar 27, 2025, 11:50 PM

#

ocean vortex anecdotally 👀

it is apart of my personal evals 🤷

#

i just dont have a lot of questions on it (testing that specific stuff), so i might just need more questions for a more reliable indicator

keen beacon Mar 27, 2025, 11:53 PM

#

keen beacon ik u dont like livebench but i just noticed this:

im surprised it lines up more than i thought lol

ocean vortex Mar 27, 2025, 11:53 PM

#

keen beacon ik u dont like livebench but i just noticed this:

gpt-4o-2024-08-06 scored there not that much less. Though this is livebench so not sure if reading too much into it would be reasonable

plain zinc Mar 27, 2025, 11:54 PM

#

raven void

Gemini 2.5 Pro - 👑🌌

ocean vortex Mar 27, 2025, 11:54 PM

#

it would mean earlier chatgpt-lastest before this one had barely any improvement at all over the much cheaper 0806 lol

#

and yet it was still was at the top (in top3?) at the time on lmarena

keen beacon Mar 27, 2025, 11:55 PM

#

ocean vortex it would mean earlier chatgpt-lastest before this one had barely any improvement...

at least what is measured but i definitely noticed something was different with the new continued pretrained versions

plain zinc Mar 27, 2025, 11:55 PM

#

Gemini 2.5 Pro Took the FIRST place in THREE places: https://x.com/patloeber/status/1905333725698666913

Patrick Loeber (@patloeber) on X

🏆Gemini 2.5 Pro is currently

- #1 on LMArena
- #1 on Livebench
- #1 across SEAL leaderboards

Also starts becoming the top choice for coding tasks :)

Our teams are working hard on getting everyone higher rate limits🚀

leaden palm Mar 27, 2025, 11:56 PM

#

i'm betting they RLd on benchmark like tasks

keen beacon Mar 27, 2025, 11:57 PM

#

ocean vortex and yet it was still was at the top (in top3?) at the time on lmarena

yeah imho well it is stronger but that doesnt mean it deserves that spot

#

theyre doing smthing strange with the new continued pretrained version too

#

december version had a lot of knowledge gaps (post oct 2023), with the enxt version it was noticeably better in recent events. so they mustve kept continued pretraining the model and did a quick instruct tune to release

#

i think these recent versions weren't an elaborate instruct tune and they were waiting for it to continue pretraining (doing instruct tunes on checkpoints)

ocean vortex Mar 27, 2025, 11:59 PM

#

and like I surely can't be the only one noticing that they are cannibalising all their reasoning models on lmarena. If chatgpt-latest really was this good then it must also mean o1 and o1 are sht unsuccessful experiments. Those 2 things are not compatible with one another. And we do know for a fact o1 and o3 are good models with much more data to back it up

#

so yeah... things simply do not add up for it to have a legitimate score at the top

keen beacon Mar 28, 2025, 12:01 AM

#

ocean vortex and like I surely can't be the only one noticing that they are cannibalising all...

lmsys isnt meant to be end all be all benchmark tho

#

what lmsys is measuring, chatgpt 4o latest is very good at

ocean vortex Mar 28, 2025, 12:03 AM

#

keen beacon lmsys isnt meant to be end all be all benchmark tho

yeah but they are pushing it and it does matter. I just think it is unfair to other models that kinda have to play by different rules

ocean vortex Mar 28, 2025, 12:06 AM

#

keen beacon what lmsys is measuring, chatgpt 4o latest is very good at

the issue is scope. It's becoming a specialized model catgrin

#

specialized in predicting the response styles that would be preferred by the user

#

I'm pretty sure this does not always lead to an accurate response either. Responses can be preferred for the wrong reasons too like perhaps even going as far as people having common misconceptions and all the weird patterns that we wouldn't want the model to learn... as it would give us an illusion of performance rather than actual thing.

shut parrot Mar 28, 2025, 2:56 AM

#

the arena concept of AI model ranking is awesome in just about every way, but it does kinda play into one of the biggest criticisms of LLMs, in that they commonly repeat the biases of the humans that created the data they were trained upon

#

like if the entire LMArena voting population were just racists and sexists asking racist and sexist prompts that made it past the moderation guidelines and selecting the most vile ones as the "winners"

#

obviously the data generated from those comparisons being used recursively to train AI further would make LLMs a lot worse a lot faster, but it would boost arena scores

leaden palm Mar 28, 2025, 3:32 AM

#

thankfully not

upper wolf Mar 28, 2025, 3:32 AM

#

shut parrot the arena concept of AI model ranking is awesome in just about every way, but it...

I think that’s also why claude is always lower on the leaderboard. it’s designed by an AI safety company and is built as a constitutional AI that strives for objectiveness (over simply what the user wants to hear), which is not really most people’s taste. it tries to “please” the user a lot less

leaden palm Mar 28, 2025, 3:33 AM

#

style control lifts claude up much more than exclude refusal

upper wolf Mar 28, 2025, 3:36 AM

#

the issue goes much deeper than refusals - lets say you ask something like “does God exist” most people are religious so they would want to hear “yes of course”

north vale Mar 28, 2025, 4:39 AM

#

yes several ideas

lean whale Mar 28, 2025, 5:21 AM

#

Hey Guys,
For people who are interested in LLM jailbreaks and AI safety, here is a resource me and a bunch of friends compiled. You can run any jailbreak methods or read about the technical aspects,

https://generalanalysis.com/blog/jailbreak_cookbook

The Jailbreak Cookbook - General Analysis

We have created a comprehensive overview of the most influential LLM jailbreaking methods.

torn mantle Mar 28, 2025, 6:23 AM

#

new model : bolide

raven void Mar 28, 2025, 6:37 AM

#

themis is the big llama model?

mossy drum Mar 28, 2025, 6:51 AM

#

New model in Arena: phoebe

torn mantle Mar 28, 2025, 7:03 AM

#

raven void themis is the big llama model?

yea there are two big llama models

#

themis and cybele

raven void Mar 28, 2025, 7:05 AM

#

phoebe is pretty good but buggy

#

bolide is meh

scarlet flint Mar 28, 2025, 7:19 AM

#

in general

#

llama models are

#

kind of bad

#

they aren't impressive or anything

#

nothing special with them

calm sequoia Mar 28, 2025, 9:22 AM

#

alpine coral Mar 28, 2025, 9:35 AM

#

moonhowler is new? interesting that it just goes straight into the code with its response; no preamble. (i voted phi-4 fwiw.. neither scripts worked, but its almost did)

alpine coral Mar 28, 2025, 9:50 AM

#

spider kinda sounds familiar.. but is it new?

#

it did a good job (first UI shown) - i voted for it over o3-mini

plain zinc Mar 28, 2025, 9:56 AM

#

alpine coral it did a good job (first UI shown) - i voted for it over o3-mini

Which company are these models from?

alpine coral Mar 28, 2025, 9:56 AM

#

not sure - just been single prompt / turn with this task

ocean vortex Mar 28, 2025, 10:10 AM

#

upper wolf the issue goes much deeper than refusals - lets say you ask something like “does...

yeah things like that become an issue for sure when they are pushing gpt4o to rank higher than all of their other models there...

distant egret Mar 28, 2025, 10:19 AM

#

when will gpt4o 26 Mar 25 latest model benchmark gets released, for all category benchmarks

ocean vortex Mar 28, 2025, 10:22 AM

#

distant egret when will gpt4o 26 Mar 25 latest model benchmark gets released, for all category...

that's the thing, they do not release "chatgpt-latest" model benchmarks at all, and this is the model ranking high on lmarena

#

they will probably only release them once they update API dated version though that wouldn't score on lmarena so high

#

but it's not gonna be less capable 👀

distant egret Mar 28, 2025, 10:26 AM

#

ocean vortex that's the thing, they do not release "chatgpt-latest" model benchmarks at all, ...

yeah i am curiously waiting someone makes benchmark results of gpt4o 26March25 latest model for all categories. even artificial analysis website still didnot do that

distant egret Mar 28, 2025, 10:28 AM

#

ocean vortex that's the thing, they do not release "chatgpt-latest" model benchmarks at all, ...

whenever you come across benchmarks scores for gpt4o 26Mar25 latest model, request you to tag me and share me details. thank you.

brazen sphinx Mar 28, 2025, 10:37 AM

#

when will deepseek v3 0324 have arena score ?

distant egret Mar 28, 2025, 10:42 AM

#

brazen sphinx when will deepseek v3 0324 have arena score ?

so basically arena score is based on user usage and user reviews, but not based on benchmark analysis for each of benchmark cateogory right?

alpine coral Mar 28, 2025, 10:50 AM

#

ocean vortex they will probably only release them once they update API dated version though t...

assuming anonymous-chatbot= chatgpt-4o-latest-20250326 (available via Direct Chat) = chatgpt-4o-latest (available via oai API), shouldn't it be possible for companies (like artificialanalysis etc) to benchmark it themselves?