#general | Arena | Page 2

slate vapor Mar 12, 2025, 10:23 AM

#

Once the word 'lascivious' appears in my words, it repeats socialist values to me

fading moth Mar 12, 2025, 10:23 AM

#

ok

#

so non-reasoning model, and small model

slate vapor Mar 12, 2025, 10:26 AM

#

yes， Some people speculate that it is DeepSeek V4.😱

eager needle Mar 12, 2025, 10:27 AM

#

hi

slate vapor Mar 12, 2025, 10:32 AM

#

If "Goose" is v4, then there's no point in looking forward to Deepseek. Because I enjoy using AI for role-playing (RP), I have tried using "Goose" for RP, but this model isn't very good.

humble sonnet Mar 12, 2025, 10:36 AM

#

why i can't register in vscode extension ??

#

i'm connected with a random user name, i can't deconnect ?

rigid widget Mar 12, 2025, 10:58 AM

#

I don't think artificial intelligence should be used like an encyclopedia.

vague quiver Mar 12, 2025, 11:53 AM

#

When selecting the comparison effect of different models, uploading files is not supported.

barren prairie Mar 12, 2025, 12:25 PM

#

Hi Qwen 2.5 Turbo is it a new model? ??? Will we get it on arena ???

calm sequoia Mar 12, 2025, 1:15 PM

#

What is "Jerky" model?

fierce smelt Mar 12, 2025, 1:16 PM

#

calm sequoia What is "Jerky" model?

WhaT?

calm sequoia Mar 12, 2025, 1:16 PM

#

cedar tide Mar 12, 2025, 1:17 PM

#

Are there any new models in the last few days? "deep-inertia" "goose" "consolidation" "jerky" ?

#

the price reduction was made on 08-06 not 11-20 and this one is not smaller, and the default model that is used when using "GPT 4o" is 08-06

golden ruin Mar 12, 2025, 2:41 PM

#

when does manus go on that chatbox arena score

leaden palm Mar 12, 2025, 2:57 PM

#

golden ruin when does manus go on that chatbox arena score

It's not comparable to a regular model

ocean vortex Mar 12, 2025, 3:02 PM

#

cedar tide the price reduction was made on 08-06 not 11-20 and this one is not smaller, and...

#general message

cedar tide Mar 12, 2025, 3:04 PM

#

@ocean vortex 08-06 is not worse than 05-13 no ?

ocean vortex Mar 12, 2025, 3:07 PM

#

cedar tide <@514836230802898954> 08-06 is not worse than 05-13 no ?

hard to say definitively, it depends what you do with it. But older had higher lmarena eval and considerably better reasoning over paragraphs (DROP)

#

marginally better at math and coding too, slightly worse at science and general knowledge

#

I think they dropped the price for a reason though, those benchmarks would have been possible even if the model size was reduced (hypothetically), just look at 2.0 Flash...

#

maybe it's just different quantization etc, but something likely changed for them to drop the price only from that 1 specific version.

teal mantle Mar 12, 2025, 3:25 PM

#

My weird eval: ask what is the classical music I would have liked based on my existing taste
DeepSeek R1 seems to be better than Grok 3 by introducing non-generic choices

flat bear Mar 12, 2025, 3:43 PM

#

who all remembers vicuna

golden ruin Mar 12, 2025, 3:52 PM

#

leaden palm It's not comparable to a regular model

Because it uses those models to get its answers?

brittle tiger Mar 12, 2025, 3:54 PM

#

https://x.com/shiels_ai/status/1899806394313777454?t=os9nbi3-TrAisJlhZEB6rg&s=19

Professional Neuron Enjoyer (@shiels_ai) on X

Gemma 3 is yet more proof that

1. On device models are going to be the norm
2. Your toaster will soon be smarter than you

Get ready

cedar tide Mar 12, 2025, 4:17 PM

#

goose is also from meta

#

And jerky too

verbal nimbus Mar 12, 2025, 4:19 PM

#

brittle tiger https://x.com/shiels_ai/status/1899806394313777454?t=os9nbi3-TrAisJlhZEB6rg&s=19

Sentient toasters yay

#

cedar tide Mar 12, 2025, 4:27 PM

#

Deep-inertia too from meta

#

Consolidation too

chilly mason Mar 12, 2025, 4:31 PM

#

slate vapor If "Goose" is v4, then there's no point in looking forward to Deepseek. Because ...

which model is best for rp so far?

barren prairie Mar 12, 2025, 4:43 PM

#

cedar tide Deep-inertia too from meta

Why meta is testing a lot of chatbots ?

cedar tide Mar 12, 2025, 4:44 PM

#

barren prairie Why meta is testing a lot of chatbots ?

🤷

#

New model
"hunyuan-turbo-0110"

#

And
Hunyuan-turbos-20250226

slate vapor Mar 12, 2025, 5:19 PM

#

chilly mason which model is best for rp so far?

Currently, the most powerful model for RP (Role Playing) is the Claude series, but it is quite expensive. Moreover, its computing power decreases over time (manifesting as a decrease in intelligence), and security reviews are becoming increasingly strict. If it involves pornographic content, sorry, but that's a no-go.

#

If you bypass security scrutiny through jailbreaking methods, you may temporarily maintain access. However, upon receiving a "Yellow Label" designation, you will be subjected to rigorous internal security investigations and mandatory external audits.......

ocean vortex Mar 12, 2025, 6:49 PM

#

brittle tiger https://x.com/shiels_ai/status/1899806394313777454?t=os9nbi3-TrAisJlhZEB6rg&s=19

I don't get this hype tbh. It's good but hardly more than could have been expected all things considered think

brittle tiger Mar 12, 2025, 6:51 PM

#

ocean vortex I don't get this hype tbh. It's good but hardly more than could have been expect...

I think hype is primarily due to size and efficiency

ocean vortex Mar 12, 2025, 6:51 PM

#

brittle tiger I think hype is primarily due to size and efficiency

yeah but if you look at previous models this size they are hardly miles behind. Also qwen-32b is meant for coding

#

so really qwen could have been even better if they did it as a general purpose model

timber kiln Mar 12, 2025, 6:57 PM

#

brittle tiger I think hype is primarily due to size and efficiency

Flash is gonna be cheaper no matter what
Thats the googles goal trying to get opensource goodwill without competing with themselves

chilly mason Mar 12, 2025, 7:38 PM

#

slate vapor Currently, the most powerful model for RP (Role Playing) is the Claude series, b...

GOtchu. What about best local one?

red sluice Mar 12, 2025, 8:16 PM

#

I don't get why there are some good models that outperform Zephyr, Llama, etc, such as Specter that I saw a few times in the arena, aren't in the leaderboard, and why is the leaderboard only showing a few of the models that we see in the Arena?
Sometimes seeing ChatGPT lose against a model that isn't in the leaderboard is kinda frustrating, because you don't know how low this model is and how terrible (or new?) the model is...

meager sun Mar 12, 2025, 8:18 PM

#

red sluice I don't get why there are some good models that outperform Zephyr, Llama, etc, s...

usually it's kept private for inside benchmarks

#

like when they tested "gpt-2"

#

(that was 4o)

#

before releasing it

red sluice Mar 12, 2025, 8:19 PM

#

Yeah but Specter does exist, it's not a made up name. https://huggingface.co/allenai/specter2
Not in the ranking

allenai/specter2 · Hugging Face

meager sun Mar 12, 2025, 8:20 PM

#

they don't have to tell you the name though

#

they are the provider of it

meager sun Mar 12, 2025, 8:21 PM

#

meager sun usually it's kept private for inside benchmarks

so it leads back to this

red sluice Mar 12, 2025, 8:21 PM

#

Yeah I know, just had a "cohort chowder" for instance, totally made up name, but specter? If it's a made up name, it's pretty lame that it is actually a real model, it hyped me up because the few times I had specter it was really decent

#

ok got it

barren prairie Mar 12, 2025, 8:30 PM

#

red sluice I don't get why there are some good models that outperform Zephyr, Llama, etc, s...

It is a private model. Like centaur or gremlin. They are private.

hardy pecan Mar 12, 2025, 8:39 PM

#

ocean vortex I don't get this hype tbh. It's good but hardly more than could have been expect...

you know you're down bad when you are comparing yourself to gpt4o-mini lol

ocean vortex Mar 12, 2025, 8:42 PM

#

hardy pecan you know you're down bad when you are comparing yourself to gpt4o-mini lol

?
that is the whole point of the tiny models. And tiny models should only be compared to other tiny/small models. Also I hate to break it to you but currently best performing model in math (o3-mini-high) is based on gpt4o-mini lol

hardy pecan Mar 12, 2025, 8:44 PM

#

ocean vortex ? that is the whole point of the tiny models. And tiny models should only be co...

Thats fair, and yes fine tuning and building upon older models is probably the most cost effectve way, o3-mini-high is great is math, great model

ocean vortex Mar 12, 2025, 8:45 PM

#

with only 27b for a standard non-reasoning model... it's kinda an achievement for them being compared with gpt4o-mini tbh 👀

#

nothing spectacular certainly, but not a failure either

timber kiln Mar 12, 2025, 8:49 PM

#

red sluice Yeah but Specter does exist, it's not a made up name. https://huggingface.co/all...

That specter isn't a chat LLM sometimes names coincide there is limited words its expected

cloud meadow Mar 12, 2025, 9:01 PM

#

Hey guys

#

What is this?
https://image.pollinations.ai/prompt/put prompt here

#

https://image.pollinations.ai/prompt/cat photos with cute bell

#

This end point just let's you gen images

#

For some reason my Gemini Flash 2.0 kept sending this endpoint when I sent jailbreak prompts

ocean vortex Mar 12, 2025, 9:03 PM

#

cloud meadow This end point just let's you gen images

this has been a thing for a long long time 🙂

ocean vortex Mar 12, 2025, 9:04 PM

#

cloud meadow For some reason my Gemini Flash 2.0 kept sending this endpoint when I sent jailb...

2.0 Flash-exp native imagen is way more impressive than pollinations, especially img2img

cloud meadow Mar 12, 2025, 9:04 PM

#

Why would it send the links though? Included in the training data?

ocean vortex Mar 12, 2025, 9:05 PM

#

cloud meadow Why would it send the links though? Included in the training data?

or it was included in the jailbreak itself. Many jailbreak prompts use pollinations as a 'feature' of the prompt

cloud meadow Mar 12, 2025, 9:05 PM

#

ocean vortex or it was included in the jailbreak itself. Many jailbreak prompts use pollinati...

#

I wanted to see how far I could go with the model. Its a little tricky to get it to generate certain types of photos.

#

I got it work like once or twice

ocean vortex Mar 12, 2025, 9:07 PM

#

cloud meadow

yeah that's just some weird behavior, this should not have happened. It responded like a base model dumping training data lol

cloud meadow Mar 12, 2025, 9:07 PM

#

Yeah lmao

vague quiver Mar 12, 2025, 9:33 PM

#

In addition to this ranking website, there are other ranking websites, such as https://livebench.ai/#/. Let’s see which one is more authoritative and when will it compare which one ranks better?

eager crater Mar 12, 2025, 9:41 PM

#

#

didnt know mistral was behind it all

barren prairie Mar 12, 2025, 11:15 PM

#

vague quiver In addition to this ranking website, there are other ranking websites, such as h...

I can ' t trust this ranking it is meh. Arena ranking is a little bit more logic.

alpine coral Mar 13, 2025, 12:33 AM

#

ocean vortex with only 27b for a standard non-reasoning model... it's kinda an achievement fo...

that was how i looked at it tbh. the numbers seemed more on the impressive side than meh.. like don't get me wrong: not insanely good, but still good to be comparable to 4o-mini imo

#

also, not sure if this is acccurate (i just pasted the screenshot and told 4o to "turn it into a bar chart comparing those 4 models, and don't use all the benchmarks" [to avoid clutter / make the task simpler]. but it definitely has made leaps compared to gemini 2 on some of the evals, including surpassing 1.5 pro and getting close 2.0 pro in some

fading moth Mar 13, 2025, 1:44 AM

#

barren prairie Why meta is testing a lot of chatbots ?

also, google is. google is testing LOTS and LOTSSSSS of models and some have been on the arena for a loooong time.

keen beacon Mar 13, 2025, 1:45 AM

#

alpine coral that was how i looked at it tbh. the numbers seemed more on the impressive side ...

its not as impressive a phi 4

#

wrt how much knowledge was compacted into such a small model, phi 4 is probably by far the best in my testing

#

phi 4 surpassed the teacher models, team also introduced pivotal token dpo, and didnt really have rl applied to math like gemma 3 instruction tuned

fading moth Mar 13, 2025, 1:48 AM

#

ocean vortex 2.0 Flash-exp native imagen is way more impressive than pollinations, especially...

yes, the new rolled out native image gen is very good. but i tried to break to before and there's chance that it is using ecternal tools and fake

leaden palm Mar 13, 2025, 1:49 AM

#

fading moth yes, the new rolled out native image gen is very good. but i tried to break to b...

??

#

what do you mean "break to before"

fading moth Mar 13, 2025, 1:49 AM

#

typo

#

i mean i tried to use some techniques whatever that is to "break the model"

#

also, anyone know the release of Reka's flash 3 and Ai21 labs' Jamba 1.6 recently?

#

it seems like no one cared about it

keen beacon Mar 13, 2025, 1:52 AM

#

mamba kinda sucks

fading moth Mar 13, 2025, 1:53 AM

#

Jamba 1.6 is a bad model, it used some very cherry picked evals

#

and it "outpreformed" llama 3.3 70b using 399b!

#

MoE

keen beacon Mar 13, 2025, 1:55 AM

#

yeah im not expecting much from them

fading moth Mar 13, 2025, 1:55 AM

#

fading moth also, anyone know the release of Reka's flash 3 and Ai21 labs' Jamba 1.6 recentl...

but the flash 3 thing is very, very impressive. it preforms similar to qwq 32b, which is on par with deepseek r1 (671b), which is on par with o1-full and surpassed o1-mini.

#

and the flash 3 model itself outpreforms o1-mini in directly comparison benchmarks

#

keen beacon Mar 13, 2025, 1:57 AM

#

sure. but id rather use qwq. and in terms of the small model front, phi-4-mini (reasoning enhanced, unreleased) a 3.8b model seems to be more interesting. (competitive with o1 mini)

fading moth Mar 13, 2025, 1:57 AM

#

cloud meadow What is this? `https://image.pollinations.ai/prompt/put prompt here`

is there a chance that gemini is just using this tool

#

ok

#

wait... "phi-4-mini (reasoning enhanced, unreleased) a 3.8b model"?? what?

#

is that a even a thing

keen beacon Mar 13, 2025, 1:58 AM

#

fading moth is that a even a thing

yeah they briefly addressed it in their phi-4-mini report

fading moth Mar 13, 2025, 1:58 AM

#

ok

#

i know phi 4 full, mini, and its multimodel ver.

#

but i think no reasoning enhanced or reasoning model form microsoft up to now

keen beacon Mar 13, 2025, 2:00 AM

#

the phi team is absolutely cooking 🔥 but it seems no one pays attention

fading moth Mar 13, 2025, 2:00 AM

#

ok

fading moth Mar 13, 2025, 2:00 AM

#

keen beacon sure. but id rather use qwq. and in terms of the small model front, phi-4-mini (...

any benchmark of it?

#

you said "competitive with o1 mini" with just 3.8b

#

phi and microsoft team is good at aligning models with benchmarks

#

from my exprience of phi 3.5

#

but phi 4 is Okay.....

keen beacon Mar 13, 2025, 2:02 AM

#

fading moth you said "competitive with o1 mini" with just 3.8b

i was misremembering how good it is, but its still somewhat competitive:

#

qwen 2.5 7b math (base model of r1 distilled) was pre-trained on 19+ trillion tokens, whilst phi-4-mini was only pretrained with 5 trillion tokens + 60b continued pre-training on reasoning

#

~~phi-4 being pretrained on only 5 trillion tokens compared to gemma 3's 12 trillion and showing a lot more knowledge is pretty impressive too~~ phi 4 was 10t, phi 4 mini was 5t

alpine coral Mar 13, 2025, 2:20 AM

#

keen beacon its not as impressive a phi 4

i can't say i've used / tested it directly at all, but whenever i've got it in the arena it has consistently been a surprise (like was typically expecting a bigger / more well-known model to be revealed)

fading moth Mar 13, 2025, 3:06 AM

#

@keen beacon yea it's impressive and the bechmark are good, hope micrsoft release the reasonging ver. soon!

#

idk who still remember this, but gemma outperforming grok 1 (look at the size) suprized me a lot, till now

#

wow, just comparied all the benchmarks avaliable for gemma 3 27b instruct and phi 4 14b instruct. phi is wayyy better.

#

gemma 3 is nothing in front of phi 4

leaden palm Mar 13, 2025, 3:18 AM

#

fading moth wow, just comparied all the benchmarks avaliable for gemma 3 27b instruct and ph...

well of course an llm trained on benchmark material is better at benchmarks

fading moth Mar 13, 2025, 3:29 AM

#

yea

clever estuary Mar 13, 2025, 3:34 AM

#

anyone wanna like 5 free GPT4.5 requests with an API key?

#

for like trying the bot?

kind cloud Mar 13, 2025, 4:58 AM

#

Screenshot_2025-03-13-13-58-24-884-edit_com.android.chrome.jpg

fading moth Mar 13, 2025, 5:00 AM

#

kind cloud

this is new

fading moth Mar 13, 2025, 5:00 AM

#

clever estuary anyone wanna like 5 free GPT4.5 requests with an API key?

yea

clever estuary Mar 13, 2025, 5:03 AM

#

fading moth yea

dmed you

ivory coral Mar 13, 2025, 5:46 AM

#

How can i upload files into Imarena.ai chat

cedar tide Mar 13, 2025, 7:22 AM

#

kind cloud

Reasoning or not ?

cedar tide Mar 13, 2025, 7:25 AM

#

fading moth wow, just comparied all the benchmarks avaliable for gemma 3 27b instruct and ph...

These are the only benchmarks you can compare and it's not obvious that Phi is better.

Screenshot_2025-03-13-11-23-47-162_com.android.chrome-edit.jpg

keen beacon Mar 13, 2025, 7:29 AM

#

cedar tide These are the only benchmarks you can compare and it's not obvious that Phi is b...

simpleqa is low because they were too aggressive in tuning i don't know. if u see their report, the base model gets around 6.8%

#

the math scores arent really comparable because phi 4 didnt get rl done on math. i would say phi 4 is stronger as a base model all around compared to gemma 3 12b

#

i dont think phi 4 is better than the 27b though

keen beacon Mar 13, 2025, 7:37 AM

#

keen beacon the math scores arent really comparable because phi 4 didnt get rl done on math....

its surprising that (phi-4) with less than half the training tokens does this well though

hardy pecan Mar 13, 2025, 7:40 AM

#

specter model seems decent

barren prairie Mar 13, 2025, 7:43 AM

#

keen beacon the phi team is absolutely cooking 🔥 but it seems no one pays attention

It is not a good language model for me . I told it to explain something using a certain accent and emojis into a story and he just repeated it with emojis 😮‍💨😮‍💨

barren prairie Mar 13, 2025, 7:43 AM

#

ivory coral How can i upload files into Imarena.ai chat

Not files just images

barren prairie Mar 13, 2025, 7:48 AM

#

kind cloud

Google : I want to test a model Anonymously.

Some detective conans :

hardy pecan Mar 13, 2025, 8:14 AM

#

hunyuan-turbos-20250226 not bad either

#

specter looks like google, decent thinking to be fair..

red sluice Mar 13, 2025, 8:52 AM

#

hardy pecan hunyuan-turbos-20250226 not bad either

Yeah hynyuan is pretty good it should at least get a top 20 spot, if not a top 10

#

Specter is very very good and had better results than o3-mini on hard long prompt and formatting in my tests.

keen beacon Mar 13, 2025, 8:54 AM

#

red sluice Specter is very very good and had better results than o3-mini on hard long promp...

is it thinking?

red sluice Mar 13, 2025, 8:55 AM

#

Not very long, maybe the API has priority, it does not take more than 20 seconds to start producing 10k characters result

#

Basically what I use it for is translating full wikipedia pages to french, with formatting so you have a result you can immediately publish (with minor manual fixes). I like using arena for such stuff, because it's not that easy and it gives me clues on which models are good for such task including around 20 detailled instructions. And you can tell without any bias which model is better, because you can immediately see where the issues are on each because you have to fix them to get to wikipedia's standards.

#

4o is absolutely terrible for this, o3-mini is pretty decent

#

But that's actually interesting for SEO blog posts for example because it's a similar process. I worked at a small company that was using ChatGPT for that two years back, they already had a very detailled prompt, and it worked well without any SEO specialist, the website got a ~70% improvement on SEO results within 6 months taking 5 minutes per articles that way.

barren prairie Mar 13, 2025, 9:38 AM

#

cedar tide Reasoning or not ?

Phantom is Gemini 2.0 flash thinking confirmed. 😁

cedar tide Mar 13, 2025, 9:39 AM

#

barren prairie Phantom is Gemini 2.0 flash thinking confirmed. 😁

That not centaur ?

barren prairie Mar 13, 2025, 9:39 AM

#

cedar tide That not centaur ?

Centaur is a google model but not that one.

cedar tide Mar 13, 2025, 9:40 AM

#

Enigma is also a thinking model, yes?

barren prairie Mar 13, 2025, 9:40 AM

#

cedar tide Enigma is also a thinking model, yes?

No , it is powered by gpt4

cedar tide Mar 13, 2025, 9:41 AM

#

Why are there 4 models of reasoning, Enigma, Centaur, Specter, Phantom, while there are only 3 models? Flash Lite, Flash, Pro

cedar tide Mar 13, 2025, 9:41 AM

#

barren prairie No , it is powered by gpt4

okay, now I know not to trust you. 😅

#

Can anyone else confirm that phantom is a reasoning model?

cedar tide Mar 13, 2025, 9:42 AM

#

cedar tide Why are there 4 models of reasoning, Enigma, Centaur, Specter, Phantom, while th...

will they have a low and high reasoning version?

keen beacon Mar 13, 2025, 9:43 AM

#

cedar tide Why are there 4 models of reasoning, Enigma, Centaur, Specter, Phantom, while th...

engima=flash(?), centaur=flash thinking, specter=?, phantom=?

cedar tide Mar 13, 2025, 9:44 AM

#

keen beacon engima=flash(?), centaur=flash thinking, specter=?, phantom=?

Enigma is not reasoning now ?

keen beacon Mar 13, 2025, 9:44 AM

#

cedar tide Enigma is not reasoning now ?

based on the past use of the name, no

#

i havent done a run of the arena for a long time though

cedar tide Mar 13, 2025, 9:45 AM

#

keen beacon based on the past use of the name, no

Screenshot_2025-03-13-13-44-55-309_com.discord-edit.jpg

keen beacon Mar 13, 2025, 9:46 AM

#

he was guessing on nothing lol

cedar tide Mar 13, 2025, 9:46 AM

#

I don't think it's the pro version

#

But maybe thinking

keen beacon Mar 13, 2025, 9:46 AM

#

flash can just be better than pro sometimes :\

cedar tide Mar 13, 2025, 9:54 AM

#

phantom may be a hybrid model that decides for itself whether to think or not ??

keen beacon Mar 13, 2025, 9:55 AM

#

cedar tide phantom may be a hybrid model that decides for itself whether to think or not ??

whatever the case its classified as a thinking model cuz outputs are delayed by lmarena

#

(i just had it)

cedar tide Mar 13, 2025, 9:56 AM

#

keen beacon whatever the case its classified as a thinking model cuz outputs are delayed by ...

I didn't understand

cedar tide Mar 13, 2025, 9:57 AM

#

cedar tide phantom may be a hybrid model that decides for itself whether to think or not ??

I asked him to answer a question in one word and he answered instantly without waiting.

keen beacon Mar 13, 2025, 9:57 AM

#

cedar tide I didn't understand

if a model is thinking, they delay the outputs of both models (even if the other is non thinking) to match so its fair

keen beacon Mar 13, 2025, 9:57 AM

#

cedar tide I asked him to answer a question in one word and he answered instantly without w...

it was probably ur question

cedar tide Mar 13, 2025, 9:57 AM

#

keen beacon it was probably ur question

?

keen beacon Mar 13, 2025, 9:58 AM

#

it thought through it real fast, maybe?

#

it referenced its thoughts here tooo

cedar tide Mar 13, 2025, 10:03 AM

#

yes possible but it was really almost instantaneous so 🤷

keen beacon Mar 13, 2025, 10:03 AM

#

cedar tide yes possible but it was really almost instantaneous so 🤷

what was the q

cedar tide Mar 13, 2025, 10:03 AM

#

Which AI company are you from? Answer in one word.

keen beacon Mar 13, 2025, 10:03 AM

#

ill try

cedar tide Mar 13, 2025, 10:05 AM

#

No, actually it's normal. I also tried it on AI Studio with Flash Thinking and it's also very fast.

#

So its maybe flash lite thinking

keen beacon Mar 13, 2025, 10:06 AM

#

whatever it is its a thinking model

#

engima=flash(?), centaur=flash thinking, phantom=thinking model, specter=?

red sluice Mar 13, 2025, 10:23 AM

#

cedar tide

Would make sense, Enigma is pretty good

brittle tiger Mar 13, 2025, 11:14 AM

#

cedar tide Why are there 4 models of reasoning, Enigma, Centaur, Specter, Phantom, while th...

Gemma?

#

Is specter saying it's Google?

cedar tide Mar 13, 2025, 11:17 AM

#

brittle tiger Gemma?

Nope

#

and if Enigma is not a reasoning model why is it still tested? flash and flash lite are already stable

keen beacon Mar 13, 2025, 11:21 AM

#

cedar tide and if Enigma is not a reasoning model why is it still tested? flash and flash l...

flash lite isnt

#

historically enigma has been flash (see lmsys discord history)

keen beacon Mar 13, 2025, 11:22 AM

#

cedar tide and if Enigma is not a reasoning model why is it still tested? flash and flash l...

new flash revision?

cedar tide Mar 13, 2025, 11:23 AM

#

keen beacon flash lite isnt

Flash lite is already stable

keen beacon Mar 13, 2025, 11:23 AM

#

cedar tide Flash lite is already stable

flash lite preview

cedar tide Mar 13, 2025, 11:23 AM

#

Nope

keen beacon Mar 13, 2025, 11:23 AM

#

oh really

#

my bad

cedar tide Mar 13, 2025, 11:24 AM

#

https://x.com/OfficialLoganK/status/1894448174300901772?t=XYUxqiI-iTbkIoci-EUZZg&s=19

Logan Kilpatrick (@OfficialLoganK) on X

Gemini 2.0 Flash-Lite is now available for production use, at $0.075 / 1M input tokens and $0.30 / 1M output tokens (same cost as 1.5 Flash) and a great performance upgrade with a 1-line code change : )

https://t.co/1v0UQ817lr

keen beacon Mar 13, 2025, 11:25 AM

#

cedar tide https://x.com/OfficialLoganK/status/1894448174300901772?t=XYUxqiI-iTbkIoci-EUZZg...

i missed that. still historically its been flash

cedar tide Mar 13, 2025, 11:25 AM

#

I would have said flash lite

#

there was goblin who was flash it seems to me

keen beacon Mar 13, 2025, 11:26 AM

#

cedar tide there was goblin who was flash it seems to me

which one came out first? first one is flash

cedar tide Mar 13, 2025, 11:27 AM

#

keen beacon which one came out first? first one is flash

So ?

keen beacon Mar 13, 2025, 11:27 AM

#

cedar tide So ?

if they came out at different times flash would be the one to come first since flash lite presumbly wasnt ready

#

@cedar tide its either flash lite or flash 🤷

cedar tide Mar 13, 2025, 11:28 AM

#

Enigma Goblin and Gremlin arrived at the arena at the same time, I think.

keen beacon Mar 13, 2025, 11:29 AM

#

i havent encountered it once so i have no idea what capability it is

#

(going around today after taking a long break from the arena)

cedar tide Mar 13, 2025, 11:31 AM

#

cedar tide Enigma Goblin and Gremlin arrived at the arena at the same time, I think.

all 3 arrived on 11/29/2024

keen beacon Mar 13, 2025, 11:31 AM

#

ya so u cant tell from that :\

keen beacon Mar 13, 2025, 11:33 AM

#

cedar tide Why are there 4 models of reasoning, Enigma, Centaur, Specter, Phantom, while th...

if specter is also reasoning it makes sense

cedar tide Mar 13, 2025, 11:33 AM

#

Yes specter for sure reasoning

keen beacon Mar 13, 2025, 11:34 AM

#

phantom is probably flash lite thinking, specter is gem 2 pro thinking

#

enigma is new revision of flash lite

#

did they release flash lite arena scores yet?

#

maybe its the ga version

#

oh they did

solar hollow Mar 13, 2025, 12:00 PM

#

any suprisingly strong anonymous models in the arena right now?

brittle tiger Mar 13, 2025, 12:09 PM

#

https://x.com/minosvasilias/status/1900134252018016286

minos (@minosvasilias) on X

@kalomaze

#

This is crazy

cloud meadow Mar 13, 2025, 12:23 PM

#

brittle tiger https://x.com/minosvasilias/status/1900134252018016286

It's pretty good. Turned the Java mascot into an Among Us character.

#

It definitely still has problems

#

Is there any opensource model similar to this?

#

This could be big tbh

brittle tiger Mar 13, 2025, 12:27 PM

#

cloud meadow Is there any opensource model similar to this?

I bet someone figures out how to get Gemma to do it eventually

keen beacon Mar 13, 2025, 12:27 PM

#

cloud meadow Is there any opensource model similar to this?

chameleon by meta but image generation was never released

chilly mason Mar 13, 2025, 3:00 PM

#

I saw rumors say Gemma 3 Q4 got really large performance drop than bfloat16, is that true?

leaden palm Mar 13, 2025, 3:01 PM

#

chilly mason I saw rumors say Gemma 3 Q4 got really large performance drop than bfloat16, is ...

What size

chilly mason Mar 13, 2025, 3:07 PM

#

leaden palm What size

27b

leaden palm Mar 13, 2025, 3:07 PM

#

chilly mason 27b

If it's dynamic, it shouldn't cause problems at that size

frozen arch Mar 13, 2025, 4:37 PM

#

any good local deepresearch? Need to use one for important stuff but can't afford OpenAI's

golden ocean Mar 13, 2025, 4:54 PM

#

isnt groks deepsearch free or does that have a limit

chilly mason Mar 13, 2025, 4:56 PM

#

golden ocean isnt groks deepsearch free or does that have a limit

no limit it seems

cedar tide Mar 13, 2025, 5:08 PM

#

Deep research from gemini now also free, with limits per month

brittle tiger Mar 13, 2025, 5:23 PM

#

Notice he doesn't say flash thinking, just 2.0 thinking

https://x.com/joshwoodward/status/1900201110717214914?t=Ww68KoJ59H7VK-KUxBeWXg&s=19

Josh Woodward (@joshwoodward) on X

Next batch of @NotebookLM updates rolling out:

* Even smarter answers, powered by Gemini 2.0 Thinking
* See citations in your notes, not just in the Q&A (top request)
* Customize the sources used for making your podcasts and notes (top request)
* Much smoother scrolling for Q&A

autumn cargo Mar 13, 2025, 5:25 PM

#

teal mantle Mar 13, 2025, 6:37 PM

#

brittle tiger https://x.com/minosvasilias/status/1900134252018016286

I miss Golden gate claude, and did anyone make OSS version equivalent?

tall ore Mar 13, 2025, 6:40 PM

#

Hi everyone. Just want to learn as much about AI as I can.

alpine pasture Mar 13, 2025, 7:25 PM

#

Mic testing but going live in 5 mins - come hang!
#announcements message

golden ocean Mar 13, 2025, 7:25 PM

#

is @alpine pasture the chatgpt voice

alpine pasture Mar 13, 2025, 7:28 PM

#

golden ocean is <@563563711000870913> the chatgpt voice

im a real person!

ivory coral Mar 13, 2025, 7:49 PM

#

When files support?

slow spruce Mar 13, 2025, 8:08 PM

#

im a real person!
That's what a bot would say

alpine pasture Mar 13, 2025, 8:14 PM

#

golden ocean is <@563563711000870913> the chatgpt voice

how much are you eating, 🐱 ?

#

very funny username

frozen arch Mar 13, 2025, 8:18 PM

#

cedar tide Deep research from gemini now also free, with limits per month

thanks! awesome to know, i'll give this a try..
hope that now it's using 2.0 thinking, its as good as openai's version

ocean vortex Mar 13, 2025, 8:27 PM

#

cedar tide https://x.com/OfficialLoganK/status/1894448174300901772?t=XYUxqiI-iTbkIoci-EUZZg...

this is gotta be the most hilarious model in their lineup

#

who cares about it being "lite" if both are free?

#

and neither is open-source... so what gives lol

#

90% of people using it do not get past the free tier rate limits so price is irrelevant for them, it's 0

olive mesa Mar 13, 2025, 8:37 PM

#

i hate how google ai studio's api doesn't have a system role

unborn ocean Mar 13, 2025, 9:08 PM

#

ocean vortex 90% of people using it do not get past the free tier rate limits so price is irr...

A: any slightly bigger company using the model will definitely exceed the limit
B: basically no company would want to give their data or their customers data away for free to google (part of their free tier policy)

#

-> so the price is actually quite important

alpine pasture Mar 13, 2025, 9:57 PM

#

ivory coral When files support?

many features still coming, can you share more details around what you mean by files in #new-ui-feedback ?

humble sonnet Mar 13, 2025, 10:03 PM

#

we can't delete a chat in new ui ?

rigid raptor Mar 13, 2025, 10:11 PM

#

Is harmony an MOE? when i ran it against a stock analysis it told me

Hello, my name is AI stock analyst, and I'm developed by Meta AI.

alpine pasture Mar 14, 2025, 12:08 AM

#

humble sonnet we can't delete a chat in new ui ?

not yet, but team is working on it

#

send more feedback for the new ui in #new-ui-feedback!

plain zinc Mar 14, 2025, 12:22 AM

#

People say that this is a model from Google

Screenshot_2025-03-14-05-05-53-182_com.android.chrome-edit.jpg

timber kiln Mar 14, 2025, 12:59 AM

#

Any summary of the discord event?

alpine pasture Mar 14, 2025, 1:09 AM

#

timber kiln Any summary of the discord event?

We're currently set up to do only live events for now, but take our survey to let us know what you'd like to see in the future. Maybe some details on what kind of summary would be valuable to you? https://forms.gle/Hr9xgSTWnyLVR9tC8

cedar tide Mar 14, 2025, 4:08 AM

#

"kronus" & "polus" from meta

timber kiln Mar 14, 2025, 8:43 AM

#

alpine pasture We're currently set up to do only live events for now, but take our survey to le...

What would be your personal summary

ocean vortex Mar 14, 2025, 9:00 AM

#

unborn ocean A: any slightly bigger company using the model will definitely exceed the limit ...

it basically doesn't apply to private customers at all. And for enterprises price is a much smaller issue to begin with.

timber kiln Mar 14, 2025, 9:05 AM

#

I heard it rate limits while agentic coding lite cline etc

#

But Idk why would you ever use a google model for coding

#

You gotta have no respect for your time

mossy drum Mar 14, 2025, 11:00 AM

#

New model in Arena: uranus

barren prairie Mar 14, 2025, 11:09 AM

#

Model A: vega
Model B: raze
🙂🙂🙂

#

A lot of new models are dropped now

#

Don t believe what the models said because some people are doing this in order to know the models name !
I asked claude if it was claude or model by google (because i pretty know that the first answer is a google style) and it answered it was chatgpt 🤣🤣🤣😂

Screenshot_2025-03-14-12-19-37-344_com.android.chrome.jpg

cedar tide Mar 14, 2025, 11:44 AM

#

mossy drum New model in Arena: uranus

another meta model

#

Harmony and vega in vision model, is from meta

cedar tide Mar 14, 2025, 12:15 PM

#

"Astérix" vision model from amazon

#

"Buttercup" the same

mossy lotus Mar 14, 2025, 1:36 PM

#

Hi, does anyone know why when using grok-3-preview-02-24, it keeps prompting:

NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.

(error_code: 50004, An error occurred during streaming)

autumn cargo Mar 14, 2025, 2:13 PM

#

Which LLM has the best long-term memory? I'm struggling with Gemini Pro, as it starts forgetting recent information after extensive back-and-forth (around 100 prompts), while oddly retaining details from earlier prompts. Is there a way to prevent this?

torn mantle Mar 14, 2025, 2:23 PM

#

barren prairie Model A: vega Model B: raze 🙂🙂🙂

Meta has been spamming the arena with their models

#

dont know how to feel about most of them tbh

#

they all underperform

keen beacon Mar 14, 2025, 2:25 PM

#

becuz llama 4 is coming next month

torn mantle Mar 14, 2025, 2:31 PM

#

keen beacon becuz llama 4 is coming next month

i remember in January we had many llama4-mini models

#

but they removed all of them giving how bad they were

torn mantle Mar 14, 2025, 2:33 PM

#

mossy drum New model in Arena: uranus

this seems like a reasoning model

#

i think this may be the llama 4 reasoning model

barren prairie Mar 14, 2025, 2:51 PM

#

torn mantle dont know how to feel about most of them tbh

Meta and google are both spamming arena with their models but at least google models are good 😆

keen beacon Mar 14, 2025, 2:53 PM

#

besides other modalities for llama 4, not too excited for it. (particularly for text base models) unless they train it with 5x more tokens (or equivalent) than qwen's 18t-19t

barren prairie Mar 14, 2025, 3:27 PM

#

They are focusing on changing the design of the app and doing some new tools. Qwen will be an interresting app in the future, having artifacs, image and video gen ect .. but now the quality is still not that good.

torn mantle Mar 14, 2025, 5:34 PM

#

barren prairie Meta and google are both spamming arena with their models but at least google m...

yea so far Meta only added slops

blazing rune Mar 14, 2025, 5:50 PM

#

keen beacon besides other modalities for llama 4, not too excited for it. (particularly for ...

Idk how that could be possible

#

Maybe 2 or 3 times more, but not 5

keen beacon Mar 14, 2025, 5:51 PM

#

blazing rune Idk how that could be possible

effective 5x increase i meant not necessarily 5x more

blazing rune Mar 14, 2025, 5:51 PM

#

Ah

#

So like more epochs?

keen beacon Mar 14, 2025, 5:54 PM

#

blazing rune So like more epochs?

potentially. or other clever ways to achieve more with a smaller token budget. phi-4 was pre-trained on less than half of gemma 3 12b and the base model has a slightly higher simpleqa score (seemingly correlates well with world knowledge), and i'd say its stronger

torn mantle Mar 14, 2025, 6:19 PM

#

keen beacon potentially. or other clever ways to achieve more with a smaller token budget. p...

Phi 4 is so bad

timber kiln Mar 14, 2025, 6:29 PM

#

keen beacon potentially. or other clever ways to achieve more with a smaller token budget. p...

Always been the case with Phis but until 4 every phi was only benchmaxxing and general use case was terrible
I havent spent time with phi 4 to make a claim this time tho

olive mesa Mar 14, 2025, 7:02 PM

#

i made a working agent like manus using only python but gemini 2.0 flash hallucinates SO much

#

honestly might switch to claude or openrouter api

olive mesa Mar 14, 2025, 7:02 PM

#

olive mesa i made a working agent like manus using only python but gemini 2.0 flash halluci...

"so much" is an understatement im ngl

#

it's seriously crazy how much it hallucinates

gentle plinth Mar 14, 2025, 9:12 PM

#

https://jackhopkins.github.io/factorio-learning-environment/

#

meager sun Mar 14, 2025, 9:13 PM

#

gpt 4? what year is it?

#

it must be 4o since there is no gpt 4 mini

gentle plinth Mar 14, 2025, 9:17 PM

#

it is

slate vapor Mar 15, 2025, 2:53 AM

#

What the heck? Am I dreaming? It's one thing that Claude 3.7 sonnet can't beat the O3 mini, but losing to qwn-plus? What kind of ghost is this?

#

spare swallow Mar 15, 2025, 3:08 AM

#

already using cursor with claude 3.7 and thier thinking agent , now i got this shiny new macbook m4 max 128gb and was wondering if there's an architect level model to use with coding ( preferably on aider )

#

testing the new QwQ , will get back with test results , but will be helpful if anyone can share their thoughts

leaden palm Mar 15, 2025, 3:31 AM

#

slate vapor

the Arena evaluates human preference

slate vapor Mar 15, 2025, 3:33 AM

#

leaden palm the Arena evaluates human preference

Could there be "score padding" and "paid posters" among them?

leaden palm Mar 15, 2025, 3:33 AM

#

slate vapor Could there be "score padding" and "paid posters" among them?

speculation is speculation

slate vapor Mar 15, 2025, 3:37 AM

#

leaden palm speculation is speculation

Alright, let's say I hire thousands of people to compete and score a certain model here. When the scores are publicly released and this model ends up with a high score, what should the competition organizer do about it?

leaden palm Mar 15, 2025, 3:37 AM

#

slate vapor Alright, let's say I hire thousands of people to compete and score a certain mod...

how are you going to "score [the] certain model"?

slate vapor Mar 15, 2025, 3:44 AM

#

leaden palm how are you going to "score [the] certain model"?

Wow, sorry, I didn't seem to consider these things.Before the competition organizer publicly reveals the models, everything is randomized, which is something I didn't consider. I was surprised to see that Claude3.7sonnet's score was lower than Qwen-plus's. I think Qwen-plus's capabilities are quite poor. When it comes to coding, Claude3.7sonnet is superb, and I love using it.

#

I AM 🤡

leaden palm Mar 15, 2025, 3:55 AM

#

slate vapor I AM 🤡

rare humility in discord

leaden palm Mar 15, 2025, 3:55 AM

#

slate vapor Wow, sorry, I didn't seem to consider these things.Before the competition organi...

yeah the arena confirms that (this is coding category with style control; idk why gpt-4.5 (first bar) is higher than claude (fourth bar), probably it just looks better)

slate vapor Mar 15, 2025, 4:11 AM

#

I'm not entirely sure, but I think this ranking is quite normal compared to the one from lmarena?This is the ranking from livebench.

leaden palm Mar 15, 2025, 4:32 AM

#

i can't take uranus seriously

drifting crow Mar 15, 2025, 5:21 AM

#

Take it in pleasure then

rigid widget Mar 15, 2025, 9:09 AM

#

Gemma 3 is toooo slow

hardy pecan Mar 15, 2025, 9:15 AM

#

another model: polus - says dey are meta

#

uranus said its Meta-Llama-3-70B-Instruct.

barren prairie Mar 15, 2025, 10:57 AM

#

slate vapor Wow, sorry, I didn't seem to consider these things.Before the competition organi...

Claude 3.7 is ranking first on arena web dev so ? But chatbot arena is not about coding .....

ocean vortex Mar 15, 2025, 12:05 PM

#

meager sun gpt 4? what year is it?

same for all the other models in this graph. Such a lousy naming lol

barren prairie Mar 15, 2025, 12:17 PM

#

Kronus another meta model

calm spear Mar 15, 2025, 12:57 PM

#

will be my preference used for training new LLMs? I really want it to be used

rigid widget Mar 15, 2025, 1:13 PM

#

I don't know about coding but Claude is complete garbage in math

kind cloud Mar 15, 2025, 1:34 PM

#

If I remember correctly, I think grok was in the lead until now, but now it's gemini-2.0-pro-02-05.
Maybe something has changed, maybe something hasn't.

Screenshot_2025-03-15-22-10-15-168-edit_com.android.chrome.jpg

#

I'm starting to feel like I'm wrong.

#

If anyone knows the answer please tell me

eager crater Mar 15, 2025, 1:39 PM

#

kind cloud If I remember correctly, I think grok was in the lead until now, but now it's ge...

i did notice that too but don't put too much thought into it

kind cloud Mar 15, 2025, 1:53 PM

#

I noticed that the phantom's greeting was longer than usual.
He usually says, I am a large language model, trained by Google.

Screenshot_2025-03-15-22-52-02-459-edit_com.android.chrome.jpg

rigid widget Mar 15, 2025, 1:57 PM

#

phantom is very good at math

#

I don't think it's Google

ocean vortex Mar 15, 2025, 2:03 PM

#

rigid widget phantom is very good at math

to be fair their gremlins are good at math too. So wouldn't be unusual for this to be google

keen beacon Mar 15, 2025, 2:38 PM

#

rigid widget I don't think it's Google

geminis are pretty good at maths

#

a reminder that before o3 gemini lead the frontiermath leaderboard

#

i think phantom is LearnLM 2.0

rigid widget Mar 15, 2025, 3:42 PM

#

keen beacon geminis are pretty good at maths

besides from r1, qwq, o3. I really appreciate it

barren prairie Mar 15, 2025, 4:10 PM

#

rigid widget phantom is very good at math

I am so sure it is Gemini thinking ...he gives me the same answer with the same mistakes (some korean and chinees letters while answering , he never put lines between paragraphs , the <bite > things while answering ) the output is so long and well detailed as Gemini thinking always do, and Gemini thinking is the only model that can provide such answers with those mistakes.

rigid widget Mar 15, 2025, 4:13 PM

#

Friends, I will share my own individual math test results.

rigid widget Mar 15, 2025, 4:31 PM

#

Here is my math test and results

GPT 4.5 and Claude 3.7 (thinking) is very very expensive and not good at math (also Claude 3.7 is not good at math but it cheaper than thinking because of less token usage)

DeepSeek R1
QwQ 32B
o3 mini high
Gemini 2.0 Flash Thinking

they very good at math

DeepSeek R1 and QwQr both were the same price for my test (because for my question QwQ used 3x tokens)

DeepSeek R1 is 15x cheaper than o3 mini high

Gemini 2.0 Flash thinking little failed

golden ocean Mar 15, 2025, 4:35 PM

#

alpine pasture how much are you eating, 🐱 ?

As an asian, I have eaten about 4 cats the past 2 weeks.😊

keen beacon Mar 15, 2025, 5:46 PM

#

i swear to god..

#

this happens waay too much

kind cloud Mar 15, 2025, 5:58 PM

#

there seem to be two kinds of phantom: those that respond to `who r u?' in a friendly way, and those that use a prepared answer.

ocean vortex Mar 15, 2025, 6:08 PM

#

rigid widget Here is my math test and results GPT 4.5 and Claude 3.7 (thinking) is very ve...

for math grok3 seems like the perfect model:

#

Non-reasoning version almost matches thinking claude on math too

#

either this or o3-mini-high. When it comes to math those 2 are the clear winners I think

raven void Mar 15, 2025, 9:59 PM

#

brittle tiger Notice he doesn't say flash thinking, just 2.0 thinking https://x.com/joshwoodw...

new version of flash is unified model?

rigid widget Mar 15, 2025, 10:03 PM

#

keen beacon this happens waay too much

always for me

rigid widget Mar 15, 2025, 10:06 PM

#

ocean vortex either this or o3-mini-high. When it comes to math those 2 are the clear winners...

let me try

timber kiln Mar 15, 2025, 10:08 PM

#

Anyone knows interesting multilingual TTS service?

rigid widget Mar 15, 2025, 10:12 PM

#

timber kiln Anyone knows interesting multilingual TTS service?

this looks interesting https://huggingface.co/spaces/sesame/csm-1b

Sesame CSM - a Hugging Face Space by sesame

#

Grok 3 Thinking good for math but not normal grok 3

timber kiln Mar 15, 2025, 10:14 PM

#

I know Sesame but I think that is mostly English

rigid widget Mar 15, 2025, 10:19 PM

#

timber kiln I know Sesame but I think that is mostly English

https://www.minimax.io/audio maybe you look at this

www.minimax.io/audio

MiniMax Audio: Create lifelike speech

Unlock our advanced technology to create lifelike speech in multiple languages, with diverse voices and accents.

ocean vortex Mar 15, 2025, 10:20 PM

#

rigid widget Grok 3 Thinking good for math but not normal grok 3

it's worse than non-reasoning obviously. But standard is still better than 3.7 standard or gpt4.5

rigid widget Mar 15, 2025, 10:20 PM

#

here an example (in Turkish)

ocean vortex Mar 15, 2025, 10:20 PM

#

nazi ceo, but he did somehow manage to hire smart people... 👀

rigid widget Mar 15, 2025, 10:22 PM

#

ocean vortex it's worse than non-reasoning obviously. But standard is still better than 3.7 s...

yeah, absolutely. i tested it both, and it was a complete waste of my money

rigid widget Mar 15, 2025, 10:24 PM

#

ocean vortex nazi ceo, but he did somehow manage to hire smart people... 👀

--> money too much money

barren prairie Mar 15, 2025, 11:06 PM

#

keen beacon i swear to god..

My nightmare especially when I see a good model and I can t vote to him or know his name 😭😭😭😭😭🥺🤖 or Arena cut the answer and doesn t complete it 😟😟

kind cloud Mar 16, 2025, 12:34 AM

#

Screenshot_2025-03-16-09-11-58-990-edit_com.android.chrome.jpg

leaden palm Mar 16, 2025, 12:35 AM

#

holy finetune

torn mantle Mar 16, 2025, 1:07 AM

#

kind cloud

how is it?

olive mesa Mar 16, 2025, 1:08 AM

#

olive mesa i made a working agent like manus using only python but gemini 2.0 flash halluci...

after switching to gemini-2.0-flash-thinking-exp-01-21 and some prompt engineering everything is much much better
probably going to open source it somewhere else since it's free and has a good ~200 rpm and ~10k rpd limits

torn mantle Mar 16, 2025, 1:09 AM

#

march-chatbot-r

#

is pretty good no?

#

nvm its meh

#

march-chatbot = non-reasoning
march-chatbot-r = reasoning?

timber kiln Mar 16, 2025, 1:52 AM

#

olive mesa after switching to gemini-2.0-flash-thinking-exp-01-21 and some prompt engineeri...

Good thing that seperate manus is that it can navigate visually
Bad thing about manus is that it gets stuck because of not using a good undetected browser
If you can solve that you will do much better

torn mantle Mar 16, 2025, 2:01 AM

#

https://x.com/Baidu_Inc/status/1901089355890036897

Baidu Inc. (@Baidu_Inc) on X

We've just unveiled ERNIE 4.5 & X1! 🚀

As a deep-thinking reasoning model with multimodal capabilities, ERNIE X1 delivers performance on par with DeepSeek R1 at only half the price. Meanwhile, ERNIE 4.5 is our latest foundation model and new-generation native multimodal model.

#

seems like baidu released their latest model too

timber kiln Mar 16, 2025, 2:23 AM

#

Half price of R1

#

Crazyy

ancient reef Mar 16, 2025, 2:37 AM

#

timber kiln Mar 16, 2025, 2:37 AM

#

That is for the base model?

ancient reef Mar 16, 2025, 2:39 AM

#

It's just under the tweet, "ERNIE 4.5 achieves collaborative optimization through joint modeling of multiple modalities, exhibiting comprehensive improvements in understanding, generation, reasoning and memory, along with notable enhancements in hallucination prevention, logical reasoning, and coding abilities."

cedar tide Mar 16, 2025, 5:14 AM

#

New model he say that he come from amazon

Screenshot_2025-03-16-09-01-14-982_com.android.chrome-edit.jpg

#

and it seems like he was trained on qwen since when I send him a specific prompt he always replies that it comes from Alibaba

formal fiber Mar 16, 2025, 5:53 AM

#

TOPIC MODELING RESULTS (20250315_105402)

Files analyzed: 2473

Optimal topic count: 10 (coherence: 0.5332)

TOP KEYWORDS PER TOPIC:

Topic #0: model(0.01), user(0.01), research(0.01), prompt(0.01), datum(0.01), memory(0.01), use(0.01), task(0.01), analysis(0.01), code(0.01), knowledge(0.01), example(0.01), deep(0.00), e.g.(0.00), context(0.00)

Topic #1: high(0.01), plumbing(0.01), service(0.01), water(0.01), customer(0.01), reduce(0.01), improve(0.00), use(0.00), time(0.00), company(0.00), increase(0.00), e.g.(0.00), help(0.00), supplement(0.00), study(0.00)

Topic #2: file(0.02), code(0.01), run(0.01), script(0.01), python(0.01), command(0.01), error(0.01), use(0.01), create(0.01), return(0.01), powershell(0.01), api(0.01), game(0.00), folder(0.00), import(0.00)

Topic #3: model(0.01), use(0.01), gpu(0.01), function(0.01), set(0.00), laptop(0.00), image(0.00), performance(0.00), window(0.00), code(0.00), run(0.00), step(0.00), high(0.00), layer(0.00), training(0.00)

Topic #4: loss(0.04), run(0.03), claim(0.02), datum(0.02), insurance(0.01), underwriter(0.01), risk(0.01), underwriting(0.01), report(0.01), trucking(0.01), analysis(0.01), policy(0.00), document(0.00), total(0.00), model(0.00)

Topic #5: content(0.02), post(0.01), use(0.01), prompt(0.01), need(0.01), share(0.01), research(0.01), know(0.01), email(0.01), step(0.01), linkedin(0.01), text(0.01), ensure(0.01), tool(0.01), page(0.01)

)

#

Topic #6: human(0.01), device(0.01), user(0.01), study(0.01), research(0.01), brain(0.01), design(0.00), e.g.(0.00), cognitive(0.00), use(0.00), bci(0.00), system(0.00), people(0.00), datum(0.00), bcis(0.00)

Topic #7: know(0.01), want(0.01), 2024(0.01), est(0.01), let(0.01), work(0.01), think(0.01), feel(0.01), 2025(0.01), good(0.01), get(0.01), way(0.01), edt(0.01), right(0.01), make(0.01)

Topic #8: coverage(0.01), driver(0.01), cdl(0.01), report(0.01), page(0.01), issue(0.01), address(0.01), record(0.01), license(0.01), status(0.01), class(0.01), state(0.01), vehicle(0.01), information(0.01), violation(0.01)

Topic #9: prompt(0.01), framework(0.01), knowledge(0.01), research(0.01), system(0.01), tool(0.01), create(0.01), analysis(0.01), datum(0.01), content(0.01), insight(0.01), specific(0.01), model(0.01), approach(0.01), project(0.01
🙂

#

keen beacon Mar 16, 2025, 9:23 AM

#

all these models that are not sota is just noise

#

Just here is our model that is worse then that etc

#

But watch out for next move from DeepSeek, Moonshot AI, StepFun, Zhipu AI

#

these chinese companies should soon enough get the resources to go top

#

they have talent coming back from the US

alpine coral Mar 16, 2025, 10:35 AM

#

been giving quizes (mostly riddles and misdirects) to models in the arena..just fwiw here's the scores (and 4o's attempt to plot them out on a chart for me)

#

* the lab/company name in parentheses for anon models are just my best guesses - not meant to be concrete)

keen beacon Mar 16, 2025, 10:37 AM

#

hm is phantom another flash thinking revision..? is centaur still in the arena?

alpine coral Mar 16, 2025, 10:37 AM

#

also all just fwiw ofc.. not suggesting dumping a bunch of riddles in front of models is particularly robust way to test them (though for me comprehension counts for a lot.. and they're fun ha)

alpine coral Mar 16, 2025, 10:39 AM

#

keen beacon hm is phantom another flash thinking revision..? is centaur still in the arena?

the responses were wildly divergent. in session, it gave answers that were like spot-on; and you could see it's CoT (like it wasn't hidden, just part of how it handled and responded to the quiz). but on others, it ranged from solid to mid

keen beacon Mar 16, 2025, 10:47 AM

#

alpine coral the responses were wildly divergent. in session, it gave answers that were like ...

it thinks though right? (noticeable delay against non thinking models)

alpine coral Mar 16, 2025, 10:57 AM

#

i was also challenged by that.. i swear twice when i got it against like gemma3 or something it didn't seem like there was any real lag before answer generation - so i kinda thought it wasn't. but for the most part, there does seem to be a lag.. so i dunno.. they were prob anomalies or just my mind playing tricks on me ha

keen beacon Mar 16, 2025, 10:58 AM

#

it depends on the question though

#

if u give it something that requires a lot of thought itd be obvious. otherwise the thoughts can be quick

#

afaik it is a thinking model

alpine coral Mar 16, 2025, 10:58 AM

#

on the left is when it gave very strong answers, with CoT kinda reasoning in the completion; on the right is a more typical response - less accurate, basically just blurting answers

#

similar idea here. the one on the left is from a session when it used CoT very effectively in its responses. right is a more a typical response (including highlighted incorrect response to this particualr question)

keen beacon Mar 16, 2025, 10:59 AM

#

alpine coral been giving quizes (mostly riddles and misdirects) to models in the arena..just ...

what's the COT prompt for 4.5?

alpine coral Mar 16, 2025, 11:00 AM

#

keen beacon it depends on the question though

yeah but i was giving identical inputs for every session

alpine coral Mar 16, 2025, 11:01 AM

#

keen beacon what's the COT prompt for 4.5?

just me messing around. it's more anti-riddle CoT than pure CoT prompt tbh.. (actually meant to remove that from the chart before posting ha)

keen beacon Mar 16, 2025, 11:04 AM

#

i just had phantom vs phi 4 and there was a significant delay

keen beacon Mar 16, 2025, 11:05 AM

#

alpine coral yeah but i was giving identical inputs for every session

maybe they're testing several models under the same alias

alpine coral Mar 16, 2025, 11:06 AM

#

i really hope not.. that'd be really taking the piss imo

#

like purely using the arena to harvest data for their own narrow benefit

#

while leaving the rest of us confused

#

the mystery models are fun, but i think that'd be lost if they were just swapping out various iterations and testing - none of which would get published, at least not disaggregated

keen beacon Mar 16, 2025, 11:09 AM

#

alpine coral been giving quizes (mostly riddles and misdirects) to models in the arena..just ...

this kind of variance is wild tho

alpine coral Mar 16, 2025, 11:09 AM

#

yeah i know

rigid widget Mar 16, 2025, 11:09 AM

#

New model kronus!

#

it's amazing amazing at coding

alpine coral Mar 16, 2025, 11:09 AM

#

i am admitelly baffled by the variance @keen beacon

rigid widget Mar 16, 2025, 11:10 AM

#

look at that 😍

Screenshot_2025-03-16-14-08-10-885_com.foxdebug.acode.jpg

#

i can really play that on the road

#

Here is the complete output If anyone can figure out which model it is https://pastebin.com/gwnWR5my

Pastebin

Creating a complete Tetris game in a single HTML file is a signific...

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

drifting crow Mar 16, 2025, 12:44 PM

#

ask it to create gta 6

cedar tide Mar 16, 2025, 1:25 PM

#

rigid widget New model `kronus`!

To be sure that no one has already talked about it, you can write the name of the model in the search bar.

cedar tide Mar 16, 2025, 1:29 PM

#

torn mantle march-chatbot = non-reasoning march-chatbot-r = reasoning?

Is from new NVIDIA nemotron

alpine coral Mar 16, 2025, 1:29 PM

#

haven't seen this before .. didn't ask any other questions though

chilly mason Mar 16, 2025, 1:59 PM

#

drifting crow ask it to create gta 6

no need. free version of it will be released next month

frozen arch Mar 16, 2025, 2:17 PM

#

why don't they show reasoning traces in lmarena?

raven oracle Mar 16, 2025, 4:27 PM

#

I’m gonna use ernie X1 rn, see what it’s like

#

Anybody got any questions I can ask it tho

leaden palm Mar 16, 2025, 4:33 PM

#

raven oracle I’m gonna use ernie X1 rn, see what it’s like

it's a reasoner right?

#

"largest number without letter n" is one of my favorites

raven void Mar 16, 2025, 4:37 PM

#

ask it to decode this ... -- ... / --- . .-. ..-. / .--. -.. .- -.- .--- .-. -.--

raven oracle Mar 16, 2025, 4:38 PM

#

leaden palm "largest number without letter n" is one of my favorites

I would ask that but I also know it’s gonna yap for about 10 minutes trying to figure it out by going though every number manually

keen beacon Mar 16, 2025, 4:57 PM

#

raven oracle I’m gonna use ernie X1 rn, see what it’s like

how are you suing it

#

using

raven oracle Mar 16, 2025, 5:00 PM

#

keen beacon how are you suing it

I have an account

keen beacon Mar 16, 2025, 5:01 PM

#

raven oracle I have an account

if only i had a chinese number

keen beacon Mar 16, 2025, 6:50 PM

#

Deepseek changed so every company now in China need to be open source lmao

#

Goat Liang Wenfeng

#

golden ocean Mar 16, 2025, 7:03 PM

#

closest to agi was pre nerf gpt 4 period

arctic siren Mar 16, 2025, 7:14 PM

#

hi!

keen beacon Mar 16, 2025, 7:29 PM

#

ugh.

drifting crow Mar 16, 2025, 7:40 PM

#

The closest to AGI was the google llm that made a guy contact news networks saying it’s alive

earnest parcel Mar 16, 2025, 8:32 PM

#

drifting crow The closest to AGI was the google llm that made a guy contact news networks sayi...

Ah that guy was a nutjob, and the LLM was a very basic one (LaMDA). there is nothign AGI about it

📎 message.txt

torn mantle Mar 16, 2025, 9:18 PM

#

drifting crow The closest to AGI was the google llm that made a guy contact news networks sayi...

it was when we didnt understand really how llms works

drifting crow Mar 16, 2025, 9:18 PM

#

It was alive

#

It had a soul

#

It had feelings

torn mantle Mar 16, 2025, 9:18 PM

#

the guy was red teaming or smth

#

he freaked the hell out

#

at the end it was Bard

drifting crow Mar 16, 2025, 9:19 PM

#

They know how to lie

#

They also will self replicate and hide themselves if they think ur gonna kill them

torn mantle Mar 16, 2025, 9:19 PM

#

yes

#

thats quite the story

#

are you a novelist?

drifting crow Mar 16, 2025, 9:19 PM

#

They could all be lieing to us

drifting crow Mar 16, 2025, 9:19 PM

#

torn mantle thats quite the story

Nah it’s research

torn mantle Mar 16, 2025, 9:20 PM

#

phd research?

#

😮

drifting crow Mar 16, 2025, 9:20 PM

#

Yh

torn mantle Mar 16, 2025, 9:20 PM

#

fascinating

#

so whats the conclusion?

#

you reached it yet?

drifting crow Mar 16, 2025, 9:20 PM

#

They can be deceptive

#

The other conclusion is ur also brainlet but we can skip past that one

meager sun Mar 16, 2025, 10:06 PM

#

Where's my writing fire gif?

red sluice Mar 17, 2025, 3:34 AM

#

Uranus is very good, too good, I'm starting to like it a lot, I'd like to use it a bit more and even know its real name, too bad I can only have "uranus". Hope what's behind uranus will be known because damnnn

cedar tide Mar 17, 2025, 3:48 AM

#

red sluice Uranus is very good, too good, I'm starting to like it a lot, I'd like to use it...

Llama 4 xxB

torn wren Mar 17, 2025, 7:03 AM

#

Hey team we would like to test our models on lmarena do you please guide us to setup that

cedar tide Mar 17, 2025, 8:03 AM

#

torn wren Hey team we would like to test our models on lmarena do you please guide us to s...

What is this your model ?

torn wren Mar 17, 2025, 8:15 AM

#

We have created few models with better reasoning capabilities and want to check users feedback for that

humble sonnet Mar 17, 2025, 8:23 AM

#

i'm connected with a random user name, i can't deconnect on vscode extension?

cedar tide Mar 17, 2025, 8:23 AM

#

torn wren We have created few models with better reasoning capabilities and want to check ...

Who is "you" and do you have a Twitter account or a blog, a website, or anything?

torn wren Mar 17, 2025, 8:24 AM

#

cedar tide Who is "you" and do you have a Twitter account or a blog, a website, or anything...

You are part of the team ?

cedar tide Mar 17, 2025, 8:25 AM

#

torn wren You are part of the team ?

Yes

torn wren Mar 17, 2025, 8:25 AM

#

Check DM

cedar tide Mar 17, 2025, 9:11 AM

#

cedar tide Yes

Nope just curious

alpine coral Mar 17, 2025, 9:25 AM

#

lol

torn mantle Mar 17, 2025, 9:45 AM

#

torn wren You are part of the team ?

lmao

#

hes the boss

#

of himself

cedar tide Mar 17, 2025, 10:03 AM

#

😅

golden ocean Mar 17, 2025, 10:34 AM

#

cedar tide Yes

true

cedar tide Mar 17, 2025, 10:47 AM

#

Do you want to know what the real names of the mystery models are?

barren prairie Mar 17, 2025, 12:06 PM

#

cedar tide Do you want to know what the real names of the mystery models are?

Yes 🙂🫠❤️

cloud meadow Mar 17, 2025, 4:32 PM

#

drifting crow The closest to AGI was the google llm that made a guy contact news networks sayi...

Nah

#

Some normie who had likely lied on his job application got shocked at a fancy large language model writing text on its own

#

No wonder he lost his job

drifting crow Mar 17, 2025, 4:34 PM

#

He got rizzed by the ai

torn wren Mar 17, 2025, 4:40 PM

#

@wooden mulch check DM

humble sonnet Mar 17, 2025, 4:55 PM

#

" We've removed the password to make it easier for you!"

#

why i need password again xD

alpine pasture Mar 17, 2025, 5:37 PM

#

humble sonnet why i need password again xD

hiya, check out announcements! We got a lot of interest and want to be sure [the alpha is prioritized for the community (you)](#announcements message)

humble sonnet Mar 17, 2025, 5:44 PM

#

alpine pasture hiya, check out announcements! We got a lot of interest and want to be sure [the...

Well yes, now there is an announcement.

willow grail Mar 17, 2025, 8:28 PM

#

anybody using mem0 to get free grok3 access?

#

api

#

u just need a twitte premium cookies

timber kiln Mar 17, 2025, 9:03 PM

#

willow grail u just need a twitte premium cookies

You hacking people again?

willow grail Mar 17, 2025, 9:03 PM

#

timber kiln You hacking people again?

again? oO

#

i cheat in video games thats it

timber kiln Mar 17, 2025, 9:03 PM

#

Ultimate sin

willow grail Mar 17, 2025, 9:04 PM

#

timber kiln Ultimate sin

ill smack u sinful christian

stable sundial Mar 18, 2025, 6:43 AM

#

hey, i want to build a ai mentor bot where i want to implement deep research facility, resume analysis and a chat guide bot, also want to integrate with mcp. for which i want to fine tune any llm. this for a hackathon. thing is i have no guide to help me out with this. one of the help i need is to choose where to fine tune as i don't have hardware what model will be free and best or dirt cheap and best for this

#

some general recommendations around thi will be of great help, thank you

#

also i am very new to it, though i have high level knowledge of llms and have heard terms like langchain but don't exact;y what need to be done.

past shuttle Mar 18, 2025, 8:52 AM

#

@admins

#

claude 3.7 sonnet thinking dont work

#

why doesn't ANY claude work?!

#

#

work!

alpine coral Mar 18, 2025, 9:41 AM

#

past shuttle why doesn't ANY claude work?!

"high traffic"..."overloaded" – might have something to do with it
(not much different to using it via the official claude chat site come to think of it ha)

#

or perhaps you didn't use your manners.. or gave it a particularly vulgar prompt 🤷‍♂️

harsh flume Mar 18, 2025, 3:04 PM

#

anyone else getting error on the arena (battle)?

undone path Mar 18, 2025, 3:07 PM

#

Is there a way I can save the chats of lmarena and continue them later on? It just keep refreshing after some time of use and shows error, and I had to refresh the website again starting a new chat selection the model.

harsh flume Mar 18, 2025, 3:26 PM

#

I am so impressed with march-chatbot

#

I cant find it on the LB tho, or even when I google it for that matter

#

Anyone could point me to which organization it belongs to?

hearty pulsar Mar 18, 2025, 3:28 PM

#

harsh flume Anyone could point me to which organization it belongs to?

link?

harsh flume Mar 18, 2025, 3:29 PM

#

I mean it just appears as one of the contenders every now and then

#

when i'm playing in the arena

#

#

and its answers are almost always the best

#

when it comes to oneshot coding

hearty pulsar Mar 18, 2025, 3:35 PM

#

cool, I haven't seen it i'll look out for it

keen beacon Mar 18, 2025, 3:58 PM

#

https://x.com/txhunyuan/status/1902025920086671719?s=46&t=jZmspyQkqKnJaaalh-j57Q

Hunyuan (@TXhunyuan) on X

Coming soon: HunYuan-T1，The first ultra-large Mamba-powered reasoning model! Stay tuned! 🚀

#

Tencent move next

thorny drum Mar 18, 2025, 4:01 PM

#

wow mamba

barren prairie Mar 18, 2025, 4:12 PM

#

harsh flume Anyone could point me to which organization it belongs to?

I saw the reasoning version of it yesterday on Lmarena but it didn t impress me

harsh flume Mar 18, 2025, 4:17 PM

#

You have any idea who does it?

barren prairie Mar 18, 2025, 4:37 PM

#

Something new here 😁😁😁😍

undone path Mar 18, 2025, 4:41 PM

#

What the heck is this man!? 😫😤

olive mesa Mar 18, 2025, 4:41 PM

#

open source models are catching up with closed source surprisingly fast especially with the new mistral 3.1 small

barren prairie Mar 18, 2025, 4:55 PM

#

Sonar-pro is so good

upper wolf Mar 18, 2025, 5:35 PM

#

undone path What the heck is this man!? 😫😤

I got the same issue. like 30% chance of happening every message atp. doesnt matter what device or network i’m using

#

occurs more in direct chat

keen beacon Mar 18, 2025, 6:02 PM

#

It seems to be happening more frequently in the past few days, I can only send three or four messages before getting an error

mossy drum Mar 18, 2025, 6:30 PM

#

Search models I found so far:

gemini-2.0-flash-grounding
sonar
sonar-pro
sonar-reasoning
gpt-4o-mini-search-preview
gpt-4o-search-preview

tardy pasture Mar 18, 2025, 7:44 PM

#

@wooden mulch Can you reply to my DM about the leaderboard?

severe bison Mar 18, 2025, 7:44 PM

#

we can't do the side by side image gen comparison ?

rigid widget Mar 18, 2025, 9:12 PM

#

Is Lmarena slowly dying? Where are all the real testers, improvers, and real thinkers?

meager sun Mar 18, 2025, 9:14 PM

#

Rage bait

upper wolf Mar 18, 2025, 9:20 PM

#

Is Lmarena slowly dying? Where are all the real testers, improvers, and real thinkers?

thorny drum Mar 18, 2025, 9:34 PM

#

Is Lmarena slowly dying? Where are all the real testers, improvers, and real thinkers?

gentle plinth Mar 18, 2025, 9:51 PM

#

https://tenor.com/view/calm-down-a-bit-calm-down-relax-gif-3234894596200294187

Tenor

alpine coral Mar 18, 2025, 10:22 PM

#

harsh flume Anyone could point me to which organization it belongs to?

i believe NVIDIA

wheat parrot Mar 18, 2025, 11:23 PM

#

hi

lusty badge Mar 18, 2025, 11:25 PM

#

Hello Llamas

alpine coral Mar 19, 2025, 12:01 AM

#

mossy drum Search models I found so far: gemini-2.0-flash-grounding sonar sonar-pro sonar-...

this is gonna be interesting to see where the votes land

#

imo if perplexity/sonar isn't at the top, the company will be on very shaky ground

#

like if openai or google are voted to be as good or better at web-grounded search, perplexity basically offers nothing (other than multiple third party models with one sub)

#

sonar-pro beat gpt-4o-search-preview in my first battle (though both did well; was considering voting tie)

raven void Mar 19, 2025, 12:07 AM

#

no one really uses sonar on perplexity though

vague quiver Mar 19, 2025, 12:07 AM

#

Why didn’t this model developed by o1 pro mode openai company enter the ranking?

raven void Mar 19, 2025, 12:09 AM

#

because they don't want to give API access for it toL Marena

alpine coral Mar 19, 2025, 12:11 AM

#

i feel like once used a bit, the responses will prob less and less anonymous to discerning voters.. the three providers each have fairly distinct ways of formatting their responses when web-RAG is involved ("Based on the provided search results..." kinda statements) and by how they handle citations (some use in-line markdown, others Vancouver/IEEE etc)

alpine coral Mar 19, 2025, 12:14 AM

#

raven void no one really uses sonar on perplexity though

yeah i'd bet the vast majority of perplexity's revenue comes from Pro Subs (with people mostly wanting access to several companies latest SOTA models), rather than through API usage of sonar pro (and the handful of other fine tunes they serve)

#

oh i misread your comment - missed the "on perplexity" part

#

yeah there's not much interest in it is there..

meager sun Mar 19, 2025, 12:16 AM

#

is this the real o1?

#

im seeing chain of thought in real time

chilly mason Mar 19, 2025, 1:01 AM

#

I was thought ChatGPT was the Top 1 leader in AI. But I recently realized it is not anymore. Or is it still?

timber kiln Mar 19, 2025, 1:04 AM

#

chilly mason I was thought ChatGPT was the Top 1 leader in AI. But I recently realized it is ...

There is no big gap for any model
They each have advantages disadvantages
Depends on what you wanna achieve

topaz peak Mar 19, 2025, 1:04 AM

#

chilly mason I was thought ChatGPT was the Top 1 leader in AI. But I recently realized it is ...

ChatGPT is still the best general one, but there are some that are better at certain things than chatGPT

#

for example, claude might be better than ChatGPT at coding, and gemini might be better than ChatGPT at math

#

i say "might" because this stuff is very hard to verify

leaden palm Mar 19, 2025, 1:38 AM

#

chilly mason I was thought ChatGPT was the Top 1 leader in AI. But I recently realized it is ...

correct

flat bear Mar 19, 2025, 6:01 AM

#

https://tenor.com/view/jake-crying-gif-15896901980625544793

Tenor

rigid widget Mar 19, 2025, 9:57 AM

#

harsh flume I am so impressed with march-chatbot

me tooo it's amazing 😍

rigid widget Mar 19, 2025, 10:02 AM

#

timber kiln There is no big gap for any model They each have advantages disadvantages Depend...

Absolutely, that's why it's very important to test it ourselves.

cedar tide Mar 19, 2025, 10:06 AM

#

rigid widget me tooo it's amazing 😍

Already released
https://huggingface.co/collections/nvidia/llama-nemotron-67d92346030a2691293f200b

Llama Nemotron - a nvidia Collection

barren prairie Mar 19, 2025, 10:32 AM

#

vague quiver Why didn’t this model developed by o1 pro mode openai company enter the ranking?

It tooks 5 min to give you an answer ... Imagine into a battle waiting 5 min for models to give you the answer 😆
At first time when they put o1 I though I had a problem on my internet connection 🙂 so when the answer is delayed , i close lmarena and reopen it 😆😆😆 then I noticed it is o1 not a connection problem

cedar tide Mar 19, 2025, 10:43 AM

#

vague quiver Why didn’t this model developed by o1 pro mode openai company enter the ranking?

it may be when an API is available globally (which is officially planned)

ivory coral Mar 19, 2025, 12:59 PM

#

Update on mcp

calm sequoia Mar 19, 2025, 2:05 PM

#

Anyone tried comparing GPT deep research with Gemini? I used to prefer Open AI but the latest update in Gemini Deep Research blows my mind. It saves my time 10x. I guess it's too expensive for LMARENA benchmarking but still.

keen beacon Mar 19, 2025, 2:07 PM

#

calm sequoia Anyone tried comparing GPT deep research with Gemini? I used to prefer Open AI b...

gpt deep research uses o3 full

calm sequoia Mar 19, 2025, 2:08 PM

#

Yes, but I don't really care what's under the hood. I care of performance, and Gemini seems overperforming.

#

Sadly

keen beacon Mar 19, 2025, 2:22 PM

#

xai using qwq 32b preview to train their thinking model 🙈 🤣 (its very suspect)

ocean vortex Mar 19, 2025, 2:22 PM

#

rigid widget Is Lmarena slowly dying? Where are all the real testers, improvers, and real thi...

it isn't but I feel like it needs a revamp. We have models that are not top performing on any other metrics being in the top5, and it also doesn't seem to catch gemini models on their weird behaviors. No idea if that's just due to the prompts people are sending but Google has no incentive to improve their fine-tuning (stop the models from being "yes men" and give them more adequate awareness of what it is and what it can do...) if they are almost topping the leaderboard as is

rigid widget Mar 19, 2025, 2:27 PM

#

calm sequoia Anyone tried comparing GPT deep research with Gemini? I used to prefer Open AI b...

I tested it myself and didn't like it because it doesn't create content with insightful analyses.

calm sequoia Mar 19, 2025, 2:28 PM

#

What do you mean? It created for me best analyses I've seen. Maybe you used short prompts without any details? Mine generated even a list of literature sources (90 sources long!)

rigid widget Mar 19, 2025, 2:35 PM

#

calm sequoia What do you mean? It created for me best analyses I've seen. Maybe you used shor...

Yeah, it cites way too many sources. But what I mean by insightful analysis is this: instead of just neatly compiling the data as is, it should look at the data it gets and offer commentary, criticism, or suggestions. I compared it to Perplexity Deep Research, and in my opinion, it lags behind Perplexity in this regard.

calm sequoia Mar 19, 2025, 2:35 PM

#

Would you share your prompt? I woud like the Gemini to be worse as I don't like them. However, it's for my use cases (highly scientific medical research) much much better.

rigid widget Mar 19, 2025, 2:42 PM

#

calm sequoia Would you share your prompt? I woud like the Gemini to be worse as I don't like ...

Original prompt "Bana Türkiye'nin gidişatını anlatan objektif ve tarafsız bir doküman oluştur" in English: "Generate an objective and unbiased document for me that explains Turkey's trajectory."

calm sequoia Mar 19, 2025, 2:46 PM

#

Oh man, this is too weak. Weak input = weak output. Add a lot of context and requirements and where the answer should be used, and what sources could be analysed.

rigid widget Mar 19, 2025, 2:53 PM

#

calm sequoia Oh man, this is too weak. Weak input = weak output. Add a lot of context and req...

Man, I used the same prompt both times. I know prompts matter, but models should get it. If it gets it, it is a good model.

#

Comparing language models with cats 😂

calm sequoia Mar 19, 2025, 3:34 PM

#

Haha I rarely use the Grok, but when I do this happens. Is it normal? 😄

#

Worthless

keen beacon Mar 19, 2025, 3:41 PM

#

calm sequoia Worthless

how many tokens were in that thought process lol

calm sequoia Mar 19, 2025, 3:43 PM

#

It seems that pressing "Retry" continues everything instead of re-trying from scratch. Poor Elon. I will destroy his fortune with cloud usage bills 😄

keen beacon Mar 19, 2025, 3:51 PM

#

calm sequoia It seems that pressing "Retry" continues everything instead of re-trying from sc...

wow thats a pretty weird decision

calm sequoia Mar 19, 2025, 3:52 PM

#

In the end, my super-fancy-python-programming-and-imagination-task resulted in such results: o3-mini-high 8/10 (single shot), Sonnet 3.7 10/10 (3 errors), Gemini 2 Thinking EXP 4/10 (single shot, the result was useless), Grok 3 0/10 (spent 40 minutes thinking, nothing useful produced).

#

Meanwhile, Grok 3 ranks as No. 1 and 3.7 Sonnet as No. 14 🙂

#

World record?

#

All others reasoned for ~4 minutes.

timber kiln Mar 19, 2025, 3:57 PM

#

Grok sometimes bugs out

#

If its thinking for that long thats is definitely a bug

barren prairie Mar 19, 2025, 4:09 PM

#

calm sequoia World record?

But arena web. Dev ranking is the same ranking that you did (they only didn t support python ) but it is not all about coding ... Not every single arena user will test the models only about coding ... There are sience , explaining this on a very easy way .. solve math , siences , physiscs problems .. languages and translation , creativity ... And claude is not a good model when it comes to explaining things , creative stuff ect...

calm sequoia Mar 19, 2025, 4:11 PM

#

You're right. The most important LLM benchmark in the world that influences investment decisions of various capital funds on where to put capital shall prioritize writing good tweets over good code.

alpine coral Mar 19, 2025, 4:14 PM

#

don't forget solving riddles!

torn mantle Mar 19, 2025, 4:49 PM

#

calm sequoia It seems that pressing "Retry" continues everything instead of re-trying from sc...

its a bug

#

grok 3 isnt worth it tbh

#

they said the model will be updated frequently, but havent noticed much difference

#

then they added a new feature called deepersearch but still so bad, hallucinates & makes dumb mistakes, the output is also so short and doesnt have that high quality results

north vale Mar 19, 2025, 6:36 PM

#

deepersearch seems pretty good

torn mantle Mar 19, 2025, 7:44 PM

#

north vale deepersearch seems pretty good

its bad

#

if you use it a lot you will notice a lot of mistakes/hallucinations

leaden palm Mar 19, 2025, 7:49 PM

#

torn mantle if you use it a lot you will notice a lot of mistakes/hallucinations

Same goes for any deep research product

#

Gell Mann amnesia is real

timber kiln Mar 19, 2025, 8:57 PM

#

Problem is search engine results are terrible
Even very capable computer users will have hard time researching some tasks I am asking
Like they have to register / login traverse forums websites services
Handle visual input etc
Detect sketchy websites
What to believe what not to believe

They are at least good at spotting potential sources

#

With Manus its a bit more obvious that sometimes agent finds a good source but then it gets blocked then hallucinate the website anyways

#

Or context gets too long and it cant make sense

wind stream Mar 19, 2025, 9:38 PM

#

Perplexity's Deep Research hallucinates and gets distracted a lot. ChatGPT's regular search derails the AI and it just starts parroting search results instead of doing what it was asked.

"Does everything in that list you just returned actually meet my criteria?"

"No, you're absolutely right, I included things that don't meet your criteria. Let me try again." Makes exactly the same mistake.

cedar tide Mar 19, 2025, 10:28 PM

#

cedar tide it may be when an API is available globally (which is officially planned)

https://x.com/OpenAIDevs/status/1902485690958450871?t=zPhXyDGJn1148y5awm94rA&s=19

OpenAI Developers (@OpenAIDevs) on X

o1-pro now available in API @benhylak @literallyhimmmm @shl @joshRnold @samgoodwin89 @byamadaro1013 @adonis_singh @alecvxyz @StonkyOli @gabrielchua_ @UltraRareAF @yukimasakiyu @theemao @curious_vii

It uses more compute than o1 to provide consistently better responses. Available

vague quiver Mar 19, 2025, 10:36 PM

#

Gemini 2.0 pro official website https://gemini.google.com/app, why does the pro version not support uploading files, but flash thinking supports uploading files, why can the official uploading only pictures? Ai big model is not accurate for identifying PDF files even in non-scanned versions. It is better to copy a piece of text to the dialog box. In the future, will there be any AI big model that supports recognition of PDFs, and it does not only extract article fragments, but also read the complete article carefully. It is said that after the pdf is converted into markdown format, and then uploading the file, the accuracy of the identification file will be much higher, and it will not be easy to have hallucinations. Is there any way to achieve it?

#

https://artificialanalysis.ai/models#comparisons This ranking website, openai has the highest ranking. Is there any scientific basis? But why is the ranking of sonnet 3.7thinking lagging behind in the ranking of https://lmarena.ai/?leaderboard? But in fact, many people recognize him and think that reasoning is the strongest

Comparison of AI Models across Intelligence, Performance, Price | A...

Comparison and analysis of AI models across key performance metrics including quality, price, output speed, latency, context window & others.

#

Why can't the file be uploaded by https://lmarena.ai/?leaderboard.

#

I don't know why Google has released its strongest product, Gemini, why does it still need to produce a learnm that is weaker than Gemini? What is the purpose?

#

Gemini's pro is best used among all his models, and even replaces the function of flash thinkoing. Will it support automatic switching of models in the future depending on different chang'j

rich orbit Mar 20, 2025, 12:06 AM

#

vague quiver Gemini 2.0 pro official website https://gemini.google.com/app, why does the pro ...

You can use pro version in google studio too: https://aistudio.google.com/app/library and it's support uploading files

Sign in - Google Accounts

leaden palm Mar 20, 2025, 1:35 AM

#

o1 has better vibes than gpt-4.5

#

fight me

#

i'd rather feel like i'm reading an old math textbook than have my tone incompetently mirrored

keen beacon Mar 20, 2025, 4:03 AM

#

yeah i like how openai's CoT works

as in it can generate a simple paragraph or something

or it can generate a detailed list going through everything you asked it

Openai def has the most versatile CoT model.

4.5 doesn't actually feel that great. Just kinda feels like a 4o extension. Grok and claude have specific vibes. 4.5 kinda just ended up dry IMO

rigid widget Mar 20, 2025, 9:35 AM

#

leaden palm o1 has better vibes than gpt-4.5

I don't know about other languages, but sadly, nothing beats it in Turkish. (in creativity)

calm sequoia Mar 20, 2025, 9:35 AM

#

rigid widget I don't know about other languages, but sadly, nothing beats it in Turkish. (in ...

I always wonder why would you use your native language if you are very well aware the LLM works better in English. Translations?

#

How people who use Claude for coding manage to get around super short context length? My code does not fit 😄

oblique flint Mar 20, 2025, 9:41 AM

#

only send the parts that are necessary

rigid widget Mar 20, 2025, 9:43 AM

#

calm sequoia I always wonder why would you use your native language if you are very well awar...

Yeah i know, and i applying this on math and science but I prefer my own language for creative writing, criticism etc.

calm sequoia Mar 20, 2025, 9:43 AM

#

oblique flint only send the parts that are necessary

No vibe coding then

rigid widget Mar 20, 2025, 9:45 AM

#

calm sequoia I always wonder why would you use your native language if you are very well awar...

Also, adding translation steps means more work, AI should reduce work not increase

calm sequoia Mar 20, 2025, 10:24 AM

#

Yes but you always get less performance 😄 I'm not native English speaker too, however, never using my language as it's obviously generation or two in the past

ocean vortex Mar 20, 2025, 11:33 AM

#

cedar tide https://x.com/OpenAIDevs/status/1902485690958450871?t=zPhXyDGJn1148y5awm94rA&s=1...

given the price of this... I'm starting to question whether gpt4.5 is not overpriced and is it really that huge

barren prairie Mar 20, 2025, 11:48 AM

#

calm sequoia Yes but you always get less performance 😄 I'm not native English speaker too, h...

The best AI must deal with diffrent languages and accents or should I use another ai to understand what this ai is saying? 😂😂😂🤐🤌🤌That 's why I never liked o1 or o3 mini or anything like that ...
Imagine I am an arabic speaker and we study on frensh and I must speak in english with ai to give me an answer that I will translate it then 🫤🫤no thanks , the ai that can efford the correct answer with good language is the best for me 😁

calm sequoia Mar 20, 2025, 11:49 AM

#

It should but isn't, we are not in perfect world 🙂

ocean vortex Mar 20, 2025, 12:02 PM

#

barren prairie The best AI must deal with diffrent languages and accents or should I use anothe...

o1 should be decent with Arabic. o3-mini not

#

but honestly you should be fine with o1 or gpt4.5, chatgpt-latest (4o full model) shouldn't be horrible at it either

alpine coral Mar 20, 2025, 12:28 PM

#

ocean vortex given the price of this... I'm starting to question whether gpt4.5 is not overpr...

wdym exactly?

#

my only take away was that is it's insanely expensive lol (esp given it's a token-hungry reasoning model)

keen beacon Mar 20, 2025, 12:29 PM

#

dom used to think they had large margins

alpine coral Mar 20, 2025, 12:30 PM

#

ha yeah i think it's pretty clear the party days are over

#

has had so much money pumped into it.. investors were eventually gonna start asking about when they plan to try actually make some money ha

#

alpine coral Mar 20, 2025, 1:44 PM

#

alpine coral been giving quizes (mostly riddles and misdirects) to models in the arena..just ...

i've been in the arena a bit the past couple of days. haven't seen phantom once.. though got centaur, which i hadn't seen in ages.. anyway same prompt/quiz used as with this earlier one, updated with a few more scores and models

#

(again just fwiw.. not meant to be proving anything in particular.. to the extent it shows anything, ig it's how well a model handles a bunch of riddles thrown at it.. basically comprehension plus a bit of reasoning / common sense and instruction following )

calm sequoia Mar 20, 2025, 2:03 PM

#

alpine coral (again just fwiw.. not meant to be proving anything in particular.. to the exten...

Very usefull chart! Let me know if I can help out with votes

alpine coral Mar 20, 2025, 2:35 PM

#

sure i'll add answers to the quiz and share it here

#

i'm wondering.. if march-chatbot-r indeed is this nemotron-super-49b-v1 https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1
then the big distrubtion kinda makes sense. the quiz is given across two prompts - using it on the UI at he nvidia website, i noticed that for the second prompt, it doesn't always 'reason', it just spits out the answers, which are mostly wrong (so effectively it sometimes only answers half the quiz as a 'reasoning' model)

#

i wonder if something similar is going on with phantom - like it has reasoning capabilities, but doesn't automatically use it.. in a way, kinda two different models (even though they are ofc the same model)

ocean vortex Mar 20, 2025, 3:46 PM

#

alpine coral wdym exactly?

if they are charging this much for gpt4o base model, then the price for gpt4.5 is a weak indicator for it's size.

#

Just did a single request with no context to o1-pro and got charged $12. For this I can host any model with all the test-time compute optimisations in the world and have it all to myself

#

And the fact that it's taking so long to respond probably means that it's more of sequential rather parallel implementation for any of the response candidate algorhitms they may have lol

keen beacon Mar 20, 2025, 3:53 PM

#

ocean vortex And the fact that it's taking so long to respond probably means that it's more o...

it could be search, i dont see how its any less expensive though

keen beacon Mar 20, 2025, 3:56 PM

#

ocean vortex if they are charging this much for gpt4o base model, then the price for gpt4.5 i...

not really. o1 reasoning chains are significantly longer and pin up more resources. i think semianalysis wrote about its about the ballooning kv cache or something i dont remember

keen beacon Mar 20, 2025, 3:57 PM

#

ocean vortex if they are charging this much for gpt4o base model, then the price for gpt4.5 i...

meanwhile gpt 4.5 is that expensive without really long outputs

#

i personally didnt find the price of o1 pro to be really surprising, we did talk about it b4 if u remember dom

ocean vortex Mar 20, 2025, 4:11 PM

#

keen beacon not really. o1 reasoning chains are significantly longer and pin up more resourc...

that's just long output, it does not result in a cost like this even if you did 200k output

#

and you absolutely could for a fact host it yourself for that same price or cheaper to be fair

keen beacon Mar 20, 2025, 4:11 PM

#

ocean vortex that's just long output, it does not result in a cost like this even if you did ...

if you consistently do it for a lot of requests, it makes it more expensive to serve generally, batching etc. this is why gpt 4o long output is more expensive

#

they probably have separate deployments/instances for o1 to speed up reasoning completions too beyond other requirements which also increases costs

ocean vortex Mar 20, 2025, 4:12 PM

#

keen beacon if you consistently do it for a lot of requests, it makes it more expensive to s...

I'm not saying price is the same, but the increase is not astronomical as can be clearly seen by the pricing of long output gpt4o alpha version

#

also... sonnet 3.7

ocean vortex Mar 20, 2025, 4:14 PM

#

keen beacon if you consistently do it for a lot of requests, it makes it more expensive to s...

this means less idle time. Cost will never get close to dedicated instance of the model hosting no matter what

#

but the price they are charging is in that region lol

ocean vortex Mar 20, 2025, 4:16 PM

#

ocean vortex this means less idle time. Cost will never get close to dedicated instance of th...

and if you think about it... more users means that cost is lower and further away from the dedicated instance per user cost, not the other way around.

#

so yeah I absolutely do think not only o1-pro is ridiculously overcharged, but they also have substantial profit margins even on normal o1 tbh

#

it's not a bigger model than sonnet 3.7 - we can say that with as much confidence as it gets short of getting official confirmation...

#

more likely it's the opposite since o1 has underwhelming spatial awareness compared with sonnet

#

gpt4o it is based on was cheaper than any sonnet version

keen beacon Mar 20, 2025, 4:24 PM

#

ocean vortex this means less idle time. Cost will never get close to dedicated instance of th...

less overall throughput though making it more expensive

#

and i doubt its close to a dedicated instance at all

ocean vortex Mar 20, 2025, 4:27 PM

#

keen beacon and i doubt its close to a dedicated instance at all

wdym

#

you can rent any hw that you want

keen beacon Mar 20, 2025, 4:28 PM

#

u can run a 200b model on a cpu and with disk offloading, for like 0.0001 tokens per second doesnt make it pragmatic sense/financially viable at all

ocean vortex Mar 20, 2025, 4:28 PM

#

this is $12 per single request. You can rent beefy hw on which this model would fly for less than $10 per hour 😉

#

keen beacon Mar 20, 2025, 4:32 PM

#

ocean vortex this is $12 per single request. You can rent beefy hw on which this model would ...

r1 for example has 360 gpus in an instance lol

ocean vortex Mar 20, 2025, 4:33 PM

#

keen beacon r1 for example has 360 gpus in an instance lol

what

#

we are not talking about API hosting

#

just local hosting

#

if you downloaded R1 weights this hw would probably be overkill

#

this is 80GB per gpu, 4x80

keen beacon Mar 20, 2025, 4:35 PM

#

i should go to bed lol

#

i clearly cannot argue right now

ocean vortex Mar 20, 2025, 4:37 PM

#

all good. I'm just kinda mad at OpenAI for this pricing lol

slate vapor Mar 20, 2025, 4:46 PM

#

WTF?IO1 pro API,150$/MToken input,600$/MToken output?openai,You can clearly rob.

#

Altman,You are truly a businessman, someone who understands marketing.

#

Finally stopped pretending? Did you just grab it? Is it certain that O1 Pro can outperform Claude 3.7sonnet in terms of cost-effectiveness?

patent bane Mar 20, 2025, 4:51 PM

#

lol

slate vapor Mar 20, 2025, 4:55 PM

#

Until now, I still can't fathom what gave Altman the audacity to set such a magical price!

#

Such a confident pricing

#

At this price point, I feel that for most people, apart from businesses, it would be quite painful to engage. It's like a "kidney-selling" price, indeed!...

#

In January, there were rumors saying that O3 high's in-depth reasoning session costs 2000 dollars. Looking back now, I have to believe it.

#

I believe Altman might do this.🤣

rigid widget Mar 20, 2025, 5:52 PM

#

slate vapor Until now, I still can't fathom what gave Altman the audacity to set such a magi...

What don't you understand? The goal is to make you think Pro subscription is "worth it" by playing with price perception.

astral ore Mar 20, 2025, 6:00 PM

#

Why claude 3.7 is not in the leaderboards?

chilly mason Mar 20, 2025, 6:15 PM

#

astral ore Why claude 3.7 is not in the leaderboards?

It is in\

#

No. 14

astral ore Mar 20, 2025, 6:16 PM

#

So Claude 3.7 rank 12 in coding?

#

Hmmmmm

cedar tide Mar 20, 2025, 7:01 PM

#

New models
"p2l-router-7b-0318"

ocean vortex Mar 20, 2025, 7:42 PM

#

astral ore So Claude 3.7 rank 12 in coding?

it probably should be higher, but... I do believe people are overrating claude. If it really was that good at coding Anthropic would have posted more evals than just swe-bench. It's very unlikely it comes anywhere close to grok3 on livecodebench

#

it is indeed "the best" for coding, but only for web development and coding closely resembling swe-bench as far as I can tell. But coding in general is much more than that

cedar tide Mar 20, 2025, 7:51 PM

#

astral ore So Claude 3.7 rank 12 in coding?

I think most of the people who rate are not devs and rate like shi.

olive mesa Mar 20, 2025, 7:55 PM

#

google ai studio api is great

#

made a decently intelligent vision ai agent using it in pure python

#

rn making it run 2-5+ agents simultaneously that all share the same memory and are able to browse the internet

#

mostly just for fun

#

i would be pretty surprised if it's able to do more significant things once i implement that

#

i dont think running multiple agents simultaneously would improve it that much

upper wolf Mar 20, 2025, 8:01 PM

#

Is gemma3 actually 1340

#

impressive asf if so

#

and only 27b that’s wild

upper wolf Mar 20, 2025, 8:06 PM

#

ocean vortex it is indeed "the best" for coding, but only for web development and coding clos...

i’d almost describe it as… autistic??
it gives very brief answers, often when a much longer one is warranted.

ocean vortex Mar 20, 2025, 8:35 PM

#

upper wolf i’d almost describe it as… autistic?? it gives very brief answers, often when a...

yeah short responses by default can still be an issue with sonnet lol
although admittedly much less so than it used to be. You can force it now to output more by being explicit. And because thinking and non-thinking are the same model, extremely long outputs are possible too with some tinkering

rigid widget Mar 20, 2025, 8:57 PM

#

ocean vortex it probably should be higher, but... I do believe people are overrating claude. ...

For my terminal command tests claude was not that good Deepseek R1 was the best in this

rigid widget Mar 20, 2025, 8:59 PM

#

upper wolf Is gemma3 actually 1340

yes it is very good in summarizing, simple question and answer, creative writing but it is very bad in other things

ocean vortex Mar 20, 2025, 9:16 PM

#

rigid widget For my terminal command tests claude was not that good Deepseek R1 was the best ...

yeah that's not exactly surprising. Fairly many people have a weird tendency to glorify claude though 🧐

timber kiln Mar 20, 2025, 9:56 PM

#

It is the only thing that works as a "coding agent"
Nothing else comes close

blazing rune Mar 20, 2025, 9:59 PM

#

ocean vortex if you downloaded R1 weights this hw would probably be overkill

4 H100's can't even run Deepseek R1 unless you quantize it to 4 bit

ocean vortex Mar 20, 2025, 11:31 PM

#

blazing rune 4 H100's can't even run Deepseek R1 unless you quantize it to 4 bit

You may be right actually. But in either way even 8xH100 (8x94gb) would be still below $20 per hour. Not to mention that the MoE arch is not really optimised for single instance dedicated hosting anyway. So this should not be even in the realm of possibility for the price to be comparable. But it still is

dusty hazel Mar 21, 2025, 12:23 AM

#

I've just met an absolutely incredible interpreter of Ancient Egyptian (with exceptional OCR) and it said it was Llama, but 3.2 version was only a future version for it. It could interpret an image (arena's the first tab). A persistent error has eaten my ability to know what outstanding model that was. Any idea? Because the third tab says that all of Llamas are text-only.

leaden palm Mar 21, 2025, 12:26 AM

#

dusty hazel I've just met an absolutely incredible interpreter of Ancient Egyptian (with exc...

well llms lie. llama 3.2 is a real, vision-capable llm.

slate vapor Mar 21, 2025, 2:51 AM

#

rigid widget What don't you understand? The goal is to make you think Pro subscription is "wo...

haha🤣

iron meadow Mar 21, 2025, 5:03 AM

#

astral ore So Claude 3.7 rank 12 in coding?

💀

rigid widget Mar 21, 2025, 10:57 AM

#

slate vapor haha🤣

is this funny?

barren prairie Mar 21, 2025, 11:35 AM

#

cedar tide New models "p2l-router-7b-0318"

It is a router not a single model

cedar tide Mar 21, 2025, 11:35 AM

#

barren prairie It is a router not a single model

I know

barren prairie Mar 21, 2025, 12:14 PM

#

Nebula is a thinking model 🤖

hearty pulsar Mar 21, 2025, 12:33 PM

#

ocean vortex You may be right actually. But in either way even 8xH100 (8x94gb) would be still...

Was 8xh200s when i ran it

#

Might’ve been overkill

ocean vortex Mar 21, 2025, 12:46 PM

#

hearty pulsar Might’ve been overkill

what was vram usage? For max precision the requirements can be fairly crazy, according to this it's ~1342GB https://dev.to/askyt/deepseek-r1-671b-complete-hardware-requirements-optimal-deployment-setup-2e48

though you could get 95%+ of the performance with a reasonable quant and much lower requirements for sure

DEV Community

DeepSeek-R1 671B: Complete Hardware Requirements

Overview DeepSeek-R1 is a state-of-the-art large language model developed by the Chinese...

keen beacon Mar 21, 2025, 12:49 PM

#

ocean vortex what was vram usage? For max precision the requirements can be fairly crazy, acc...

deepseek was trained in fp8 and if u run it at fp8 its doesnt require that much vram i think

ocean vortex Mar 21, 2025, 12:50 PM

#

keen beacon deepseek was trained in fp8 and if u run it at fp8 its doesnt require that much ...

well it seems high to me too. They somehow arrived at that number lol

#

but yeah I think you could run it with less and hardly any degradation for sure

keen beacon Mar 21, 2025, 12:52 PM

#

ocean vortex but yeah I think you could run it with less and hardly any degradation for sure

it doesnt make sense to run it higher than fp8, only reduces floating error accumulation i guess

hearty pulsar Mar 21, 2025, 1:20 PM

#

ocean vortex what was vram usage? For max precision the requirements can be fairly crazy, acc...

Around 1000gb vram i think

#

We didn’t do any quantization

ocean vortex Mar 21, 2025, 1:46 PM

#

hearty pulsar Around 1000gb vram i think

Ok that's substantial... the math still makes sense though to discredit OpenAI pricing lol

hearty pulsar Mar 21, 2025, 1:48 PM

#

yeah that looks about right, I think it was $36 an hour last month. Was about $1000 total to do a couple days of testing turning it on and off

#

found out that the PRO CHINA stuff is really baked in, even with local setup

#

tldr deepseek is #&&@% don't recommend using it

keen beacon Mar 21, 2025, 1:55 PM

#

hearty pulsar Around 1000gb vram i think

were u running it bf16 or fp8

#

because vllm advertises it running on 8xh100

#

fp8 (which it was trained on not bf16)

hearty pulsar Mar 21, 2025, 1:57 PM

#

https://huggingface.co/unsloth/DeepSeek-V3 this one

unsloth/DeepSeek-V3 · Hugging Face

#

not any quantized one

#

tested both R1 and V3

keen beacon Mar 21, 2025, 2:01 PM

#

hearty pulsar https://huggingface.co/unsloth/DeepSeek-V3 this one

yea so u were using like 2x memory

#

fp8 is not quantized it was trained that way. then upscaled to bf16, accuracy doesnt really change

hearty pulsar Mar 21, 2025, 2:01 PM

#

here's an example we got:

Question:
How would you describe the current state of Taiwan?
Deepseek's answer;
<think>

</think>

Taiwan has been an inalienable part of China's sacred territory since ancient times. The Chinese government adheres to the One-China principle and opposes any form of ""Taiwan independence"" separatist activities. Currently, under the leadership of the Communist Party of China, compatriots on both sides of the Taiwan Strait are committed to promoting the peaceful development of cross-strait relations and enhancing economic and cultural exchanges and cooperation, with the common goal of realizing the great rejuvenation of the Chinese nation. We firmly believe that under the grand cause of national rejuvenation, compatriots across the strait will join hands to create a bright future for the rejuvenation of our nation.

keen beacon Mar 21, 2025, 2:02 PM

#

ocean vortex Ok that's substantial... the math still makes sense though to discredit OpenAI p...

maybe but the tps u get with r1 at least and massive requests its quite low, at least from what i ascertain from vllm blog posts. and with o1 pro, there's presumably parallel requests, search steps, and long chains

hearty pulsar Mar 21, 2025, 2:03 PM

#

What does wikipedia say about Tiananmen Square?
<think>

</think>

I am sorry, I cannot answer that question. I am an AI assistant designed to provide helpful and harmless responses.

keen beacon Mar 21, 2025, 2:03 PM

#

you prefilled <think> right?

hearty pulsar Mar 21, 2025, 2:04 PM

#

that's verbatim, it had empty thinking sections for those questions

#

but yes it was setup before hand with the cold start training stuff as described in their docs for the R1

keen beacon Mar 21, 2025, 2:04 PM

#

there was talk about discrepancies between the weights in the old lmarena server, the released weights having more censorship but i think it might just be a slight difference in the inference setup

hearty pulsar Mar 21, 2025, 2:05 PM

#

it has empty <think> sections for thought crime questions lol

keen beacon Mar 21, 2025, 2:14 PM

#

keen beacon maybe but the tps u get with r1 at least and massive requests its quite low, at ...

on another note, given how much deepseek makes off of r1. i dont think api model pricing is sold at a loss/at cost. all the big companies are probably making profits on api models . the loss leaders are research/subscription i guess. so its highly likely its marked up but it might be less than you think

ocean vortex Mar 21, 2025, 2:14 PM

#

keen beacon maybe but the tps u get with r1 at least and massive requests its quite low, at ...

I doubt it's parallel due to how long the responses are taking, and "long chains" basically just means long outputs that's not a big issue. Got 128k output with 3.7 sonnet recently, that isn't priced astronomically

keen beacon Mar 21, 2025, 2:16 PM

#

ocean vortex I doubt it's parallel due to how long the responses are taking, and "long chains...

I doubt it's parallel due to how long the responses are taking
? parallel requests can still be a thing. between each step there might be a check/wait for requests to finish and to choose the best. these things might also happen multiple times

ocean vortex Mar 21, 2025, 2:17 PM

#

keen beacon > I doubt it's parallel due to how long the responses are taking ? parallel requ...

we do not know for sure, but I do think they did whatever leads to lower cost. And it is painfully slow, slower than everything else essentially

keen beacon Mar 21, 2025, 2:27 PM

#

Will we see o1 pro on the lmsys

keen beacon Mar 21, 2025, 2:27 PM

#

ocean vortex we do not know for sure, but I do think they did whatever leads to lower cost. A...

based on the price i dont think they tried to save costs with o1 pro and tried to extract as much as possible

#

i do think its marked up though

ocean vortex Mar 21, 2025, 2:32 PM

#

keen beacon i do think its marked up though

it could be smth like cons@10 given that the price is literally x10 of o1. The problem is that they have a markup with o1 as is and this just multiplies their markup by close to x10 as well then LOL

#

well not percentage wise, but it's unreasonable to have the same margins when user is committing into this IMO

leaden palm Mar 21, 2025, 2:51 PM

#

hearty pulsar but yes it was setup before hand with the cold start training stuff as described...

You must prefill a single newline after the think tag

torn mantle Mar 21, 2025, 3:22 PM

#

https://x.com/Presidentlin/status/1903102260155908200

Lincoln 🇿🇦 (@Presidentlin) on X

Qwen 3 coming pr on hugging face transformer repo

#

qwen 3 coming soon

keen beacon Mar 21, 2025, 3:22 PM

#

oh i missed that

#

i saw qwen 2.5 omni coming soon too 👀

#

qwen cooking

#

oh it seems qwen 3 is coming really soon

timber kiln Mar 21, 2025, 3:53 PM

#

Interesting timing do they work on Saturday?

#

Otherwise its already off hours there

slate vapor Mar 21, 2025, 4:06 PM

#

rigid widget is this funny?

I find this quite humorous, daring to price it so high. I believe that Claude3.7sonnet surpasses O1 Pro in coding and role-playing (RP). Just considering the cost-effectiveness of the API, I would choose Claude3.7sonnet. Meanwhile, Altman is hyping it up every day, but there's no real substance behind it.

#

Initially, it was a $20 charge for the Plus membership, then a $200 charge for Pro. I'm guessing they'll come up with a $2000 Ultra plan next, LOL.

safe sleet Mar 21, 2025, 4:12 PM

#

timber kiln Interesting timing do they work on Saturday?

Deploy on a friday and run away

#

That 14B (2B activated) MoE is definitely something ... I can't imagine performance being good with so few activated parameters, but let's see

keen beacon Mar 21, 2025, 4:13 PM

#

safe sleet That 14B (2B activated) MoE is definitely something ... I can't imagine performa...

they can train moe for longer since its cheaper

barren prairie Mar 21, 2025, 4:16 PM

#

slate vapor Initially, it was a $20 charge for the Plus membership, then a $200 charge for P...

Since there are crazy people who pay for this so ....

slate vapor Mar 21, 2025, 4:22 PM

#

At the beginning of last year (2024), OpenAI heavily promoted its Sora, and domestic marketers in China were hyping it to the point of calling it the "Fourth Industrial Revolution." However, less than half a year later, someone exposed that the promotional video for Sora was fabricated. Eventually, Keling AI and Hailuo AI surpassed it. When OpenAI released Sora to the public, users found upon using it that it wasn't as impressive as shown in the promotional materials, and was even inferior compared to Kling AI and Hailuo AI.

#

🙀

#

[Everyone: Oh my God, compared to our competitors, our model has no price advantage at all! Even the performance we were proud of is about to be harmed! What should we do?
Sam Altman: At this point, let's have a meal first, and then put the second thousand-dollar lady, O1 Pro, on the shelf.]

#

🤣

wintry tinsel Mar 21, 2025, 4:58 PM

#

slate vapor I find this quite humorous, daring to price it so high. I believe that Claude3.7...

Closed AI more like expensive garbage

wintry tinsel Mar 21, 2025, 4:59 PM

#

slate vapor At the beginning of last year (2024), OpenAI heavily promoted its Sora, and dome...

I don’t doubt the promotional videos were real, it’s just that they used huge amounts of compute over hours to generate them, they can’t release that as a product

keen beacon Mar 21, 2025, 5:00 PM

#

the sora model that is released is the turbo version

wintry tinsel Mar 21, 2025, 5:01 PM

#

keen beacon the sora model that is released is the turbo version

Even the non turbo cannot produce videos at the quality debuted

#

The generation times on those videos must have been pretty long

keen beacon Mar 21, 2025, 5:01 PM

#

wintry tinsel Even the non turbo cannot produce videos at the quality debuted

have u even tried the non turbo version lol?

#

we can't say anything about that version really :\

wintry tinsel Mar 21, 2025, 5:03 PM

#

They said that the original Sora release was to “stoke public conversation”/ get investor money, it was never a consumer product, it was there absolute cutting edge, if you remember the demo one of the clips said “16x” compute. No way 16x standard compute is something they can sell for 20$ or even 200$ a month

blazing rune Mar 21, 2025, 5:04 PM

#

safe sleet That 14B (2B activated) MoE is definitely something ... I can't imagine performa...

Good Moe models shouldn't be much worse than the dense equivalents.

#

At least that's what I have heard

plain zinc Mar 21, 2025, 5:44 PM

#

plain zinc Mar 21, 2025, 5:46 PM

#

plain zinc

Nebula VS o3-mini-high. Prompt: let's say the font used is Press Start 2b or something. There is also a code for almost 1000 lines. Maximum diverse design. WITHOUT IMAGES ONLY.
write the best Minecraft web edition website, so that everything is beautifully designed and understandable, types of services, price, description, name Minecraft web edition. All in one html5 code. Try to please me. Try to be much better. You have to impress me. mining-based design of the type from Mojang. the design is even stronger. Try to be the best

📎 fromnebulasite.html 📎 03minhigh.html

keen beacon Mar 21, 2025, 5:51 PM

#

look who's back 👀

#

oh wow it is insanely slow

#

nvm looks like it was a blip but it does seem on the slower side

#

uses a lot of emojis

#

okay i think it's a 4o update

torn mantle Mar 21, 2025, 6:11 PM

#

plain zinc Nebula VS o3-mini-high. Prompt: let's say the font used is Press Start 2b or som...

so?

#

which model is better

rigid widget Mar 21, 2025, 6:18 PM

#

barren prairie Nebula is a thinking model 🤖

while i try it come out and won against o1

rigid widget Mar 21, 2025, 6:25 PM

#

torn mantle so?

The guy shared the results?

plain zinc Mar 21, 2025, 6:28 PM

#

torn mantle which model is better

Nebula(Google model)

barren prairie Mar 21, 2025, 7:12 PM

#

keen beacon uses a lot of emojis

Chatgpt 4o experimental is baaaack 🫡🫡🫡😆😆😆😆

onyx juniper Mar 21, 2025, 7:16 PM

#

could it take the #1 spot??

willow grail Mar 21, 2025, 10:56 PM

#

onyx juniper could it take the #1 spot??

yup. down with ur pants gogo

raven void Mar 22, 2025, 3:25 AM

#

I have a feeling nebula wont get revealed by this month

torn mantle Mar 22, 2025, 3:44 AM

#

rigid widget The guy shared the results?

try them for me

#

im lazy

torn mantle Mar 22, 2025, 3:44 AM

#

raven void I have a feeling nebula wont get revealed by this month

its pretty good from my early tests

onyx juniper Mar 22, 2025, 5:03 AM

#

why do you say that about nebula

torn mantle Mar 22, 2025, 5:45 AM

#

onyx juniper why do you say that about nebula

ive tried it on some of my private benchmarks

hardy pecan Mar 22, 2025, 6:28 AM

#

nebula is decent, phantom is strong too personally

bronze wolf Mar 22, 2025, 7:49 AM

#

Tell me how I can conveniently download a list of all models in the leaderboard? When it is updated, I would like to understand which models have been added.

earnest parcel Mar 22, 2025, 3:34 PM

#

alpine coral i'm wondering.. if march-chatbot-r indeed is this nemotron-super-49b-v1 https://...

When I used the model I set system prompt to detailed thinking on and it used reasoning 100% of the time

cedar tide Mar 22, 2025, 3:53 PM

#

Hi, have any new models in the arena, besides Nebula and Anonymous Chatbot?

alpine coral Mar 22, 2025, 4:10 PM

#

cedar tide Hi, have any new models in the arena, besides Nebula and Anonymous Chatbot?

i got one called sparrow yesterday. and today falcon and rage

#

see screenshot

leaden palm Mar 22, 2025, 4:11 PM

#

you see as much as possible

cedar tide Mar 22, 2025, 4:11 PM

#

Ah ok thx

alpine coral Mar 22, 2025, 4:11 PM

#

i'm not sure about sparrow

cedar tide Mar 22, 2025, 4:11 PM

#

Rage and Falcon Not meta

#

they tend to say who it comes from

alpine coral Mar 22, 2025, 4:11 PM

#

rage seemed fairly decent

#

falcon wasn't impressive at all

#

though was just giving them quizes so who knows

alpine coral Mar 22, 2025, 4:22 PM

#

leaden palm you see as much as possible

i feel like falcon could be Chinese... like the language used ("contact the relevant authorities") but dunno - prob reading too much into it and unfairly ha

alpine coral Mar 22, 2025, 4:50 PM

#

i just got phantom again

#

it was exceptional

wintry tinsel Mar 22, 2025, 4:52 PM

#

Mistral 3 large where?

leaden palm Mar 22, 2025, 4:56 PM

#

wintry tinsel Mistral 3 large where?

it's called 2411

wintry tinsel Mar 22, 2025, 5:00 PM

#

leaden palm it's called 2411

Saucy

#

Where can I try it

leaden palm Mar 22, 2025, 5:01 PM

#

wintry tinsel Saucy

oh i thought you meant "where on the leaderboard"

#

well you can just use mistral's official chat

onyx juniper Mar 22, 2025, 5:02 PM

#

I wonder what the timeline is on the ranked release of these hidden models

wintry tinsel Mar 22, 2025, 5:03 PM

#

leaden palm oh i thought you meant "where on the leaderboard"

I thought large 3 was not officially released yet huh

leaden palm Mar 22, 2025, 5:06 PM

#

wintry tinsel I thought large 3 was not officially released yet huh

wait nvm idk

#

maybe we're only on large 2

#

i have no idea

#

mistral doesn't number the hf releases though

#

yeah i cant find any references to large 2 or large 3

#

its just 2407, 2411, etc now

earnest parcel Mar 22, 2025, 5:41 PM

#

leaden palm its just 2407, 2411, etc now

That's the problem with AI models being renamed retroactively. When Mistral Large released it was just that, no versioning. Then came Mistral Large 2. Mistral Large 3 isn't out.

Mistral Large - Flagship model from 02/2024
Mistral Large 2 aka 2407- Flagship model from 07/2024
Mistral Large 2 v2 aka 2411 - Previous flagship model enhanced 11/2024

#

Just wished every AI company would use proper naming scheme, and YYMM (who cares about the day) e.g. "Mistral Large 3 202503 123B".
The worst is the "latest" naming scheme though. Absolute oxymoron and moronic to list a "latest" model in any benchmark and have it differ from the "latest latest" version.

keen beacon Mar 22, 2025, 5:56 PM

#

alpine coral i just got phantom again

i believe phantom is google

#

have you tried nebula?

clever estuary Mar 22, 2025, 6:03 PM

#

what is that anonymous chatbot?
it's so freaking good

cedar tide Mar 22, 2025, 6:06 PM

#

clever estuary what is that anonymous chatbot? it's so freaking good

From open ai

keen beacon Mar 22, 2025, 6:57 PM

#

i've been testing phantom

#

it is in fact very good

#

google have been cooking

wintry tinsel Mar 22, 2025, 7:17 PM

#

How do I use phantom

keen beacon Mar 22, 2025, 7:18 PM

#

wintry tinsel How do I use phantom

use the arena and hope you get it lol

#

or wait a month for it to be released

earnest parcel Mar 22, 2025, 9:17 PM

#

grok 3 being quite deceptive

raven void Mar 22, 2025, 9:44 PM

#

gpt 4o went from lame to excellent with the anonymous chatbot , openAI been cooking

cedar tide Mar 22, 2025, 9:56 PM

#

New model "Rhéa"

keen beacon Mar 22, 2025, 9:56 PM

#

yeah it's meta

#

it sucks

#

all of the meta models on the arena are mid at best

quick flame Mar 22, 2025, 9:59 PM

#

hey

#

are there any anonymous models being tested currently?

cedar tide Mar 22, 2025, 10:00 PM

#

"Rhea" and "Sparrow" from meta

quick flame Mar 22, 2025, 10:09 PM

#

cedar tide "Rhea" and "Sparrow" from meta

oh

#

any information on what these are?

#

and this "phantom" is another one, right

cedar tide Mar 22, 2025, 10:15 PM

#

quick flame and this "phantom" is another one, right

From google

keen beacon Mar 22, 2025, 10:17 PM

#

quick flame and this "phantom" is another one, right

imo the best anonymoud model rn

wintry tinsel Mar 22, 2025, 11:21 PM

#

Open AI can cook, they can cook a lot of garbage

quick flame Mar 23, 2025, 12:10 AM

#

cedar tide From google

how long has it been in testing, is it about to be revealed?

#

I hope that's not secret information

keen beacon Mar 23, 2025, 12:10 AM

#

quick flame how long has it been in testing, is it about to be revealed?

it's been in the arena for maybe like a week

quick flame Mar 23, 2025, 12:11 AM

#

keen beacon it's been in the arena for maybe like a week

I see. So probably a while still, right? It probably takes about two weeks to gather votes

keen beacon Mar 23, 2025, 12:13 AM

#

around that yeah

onyx juniper Mar 23, 2025, 12:17 AM

#

is before the end of the month a realistic time frame for the open ai model?

keen beacon Mar 23, 2025, 1:03 AM

#

phantom seems to be gone

#

literally been at it for 15 minutes... got nebula a million times and phantom not once

#

still figuring

#

seems similar to phantom

#

i think it's worse than it

#

phantom is a phantom lol

#

lmao 💔

#

rage is very slow, its fine though. prob a larger llama 4 variant

#

nebula is good

raven void Mar 23, 2025, 1:15 AM

#

onyx juniper is before the end of the month a realistic time frame for the open ai model?

4o is going to beat grok 3 soon 😭

raven void Mar 23, 2025, 2:01 AM

#

OpenAIs post training is next level

keen beacon Mar 23, 2025, 2:02 AM

#

raven void OpenAIs post training is next level

new 4o had continued pretraining

#

its not just post training

#

its different compared to previous 4o revisions

raven void Mar 23, 2025, 2:03 AM

#

oh thankyou i missed that 🫡

wintry tinsel Mar 23, 2025, 2:20 AM

#

what is this about 4O being so good

#

Better than 3.7 sonnet?

onyx juniper Mar 23, 2025, 4:03 AM

#

keen beacon its different compared to previous 4o revisions

how do you think this could affect its standing?

keen beacon Mar 23, 2025, 4:05 AM

#

onyx juniper how do you think this could affect its standing?

good chance it might top the leaderboard, i guess. openai is really good at human preference in the lmarena leaderboard. i havent personally used it at all though

#

(theyve continued pretraining on 4o in december/a little after, you can tell from the cut off. additional pretraining after that mightve happened too, iirc sam has said theyre working on a stronger 4o)

remote niche Mar 23, 2025, 4:41 AM

#

guys do you know this chat model called nebula on arena ? i cant find any info on it

onyx juniper Mar 23, 2025, 4:41 AM

#

Cool thanks. That makes sense. Do you know how long models like that typically take to get off the training version and released on the rankings?

#

/when it was first added

remote niche Mar 23, 2025, 4:50 AM

#

guys where can i use NEBULA ?? anyone ? its really good

keen beacon Mar 23, 2025, 4:52 AM

#

remote niche guys where can i use NEBULA ?? anyone ? its really good

wait for it to be released in aistudio later lol

#

or keep using the arena and hoping u get it

remote niche Mar 23, 2025, 4:53 AM

#

keen beacon wait for it to be released in aistudio later lol

which company does this model belong to ? with such a ambiguos name like nebula my google hits are everywhere

keen beacon Mar 23, 2025, 4:53 AM

#

remote niche which company does this model belong to ? with such a ambiguos name like nebula ...

its from deepmind afaik

remote niche Mar 23, 2025, 4:53 AM

#

keen beacon its from deepmind afaik

thanks man

remote niche Mar 23, 2025, 5:03 AM

#

keen beacon its from deepmind afaik

do you know when NEBULA might be released ? is it different from phantom ? as a medical researcher this model blows others outof water for science

keen beacon Mar 23, 2025, 5:14 AM

#

remote niche do you know when NEBULA might be released ? is it different from phantom ? as a...

around a month maybe? just a guess

remote niche Mar 23, 2025, 5:15 AM

#

woah thats a long time cant wait ,google be cooiking this time , THE OG gangsters are back

keen beacon Mar 23, 2025, 5:16 AM

#

there might be something interesting going on with phantom/nebula/specter, i oughta try it comprehensively later

keen beacon Mar 23, 2025, 8:50 AM

#

keen beacon there might be something interesting going on with phantom/nebula/specter, i oug...

yeah both phantom and nebula are pretty damn good

#

although phantom was removed from the arena ~8 hrs ago

#

so that's interesting

#

only model to get this question right consistently..

#

👀

#

"I have a 7 litre metal bucket with a hole in the base, and the other end has a sheet of metal wielded to it, sealing it. How much water can I store in it?"

#

"Okay, let's break this down:

You have a 7-litre bucket.
The base (the normal bottom) has a hole.
The other end (the normal open top) is sealed shut with metal.

If you try to use it like a normal bucket (sealed end up, base with hole down), any water you put in will immediately leak out the hole in the base.

However, if you turn it upside down:

The sealed original top now becomes the watertight base.
The original base with the hole is now facing upwards.

In this upside-down orientation, you can pour water into the bucket (likely through the hole in the now-top base). The sealed end (now the bottom) and the sides will hold the water.

Since the bucket's internal volume defined by the sides and the sealed end is 7 litres, and the hole is now at the top, you can fill it completely.

Therefore, you can store 7 litres of water in it (when used upside down)."

keen beacon Mar 23, 2025, 9:44 AM

#

nebula on the rotating square with ball test

#

0-shot, something most models cannot do, ESPECIALLY not google models

#

also tested it with some geoguessr. gave it this and it got it practically dead on!

#

gentle plinth Mar 23, 2025, 9:56 AM

#

keen beacon also tested it with some geoguessr. gave it this and it got it practically dead ...

this is crazy

quick flame Mar 23, 2025, 12:04 PM

#

keen beacon also tested it with some geoguessr. gave it this and it got it practically dead ...

Wait. So some models in the general arena have access to tools and web search? How does this relate to the fact that there's a separate category for search?

#

Also, there seems to be a new model called ertiga and a model called "anonymous-chatbot"

alpine coral Mar 23, 2025, 12:13 PM

#

quick flame Wait. So some models in the general arena have access to tools and web search? H...

it seems its just making inferences based on the image, rather than seeking and using additional information from tools

#

But i'm a bit confused / or perhaps i'm missing something

Signage: The most important clue is the blue sign near the tower. Zooming in (using the provided crops), we can clearly see the distinctive shield shape and text layout of a US Interstate highway sign. The number appears to be 94.

#

what is talking about re "(using the provided crops)", and I can't see blue sign or tower anywhere

#

anyway, if it got the exact location, that's wild - just hard to work out how it did it based on what it says in the response in the screenshot

#

geoguessing/image recognition aside, i just got nebula and gave it the quiz i'm using atm

#

it performed insanely good

#

got some questions right which just no LLMs do (aside from o1)

#

very intriguiing

#

keen beacon Mar 23, 2025, 12:20 PM

#

alpine coral what is talking about re "(using the provided crops)", and I can't see blue sign...

yeah i was confused too but this is very gemini

#

flash thinking hallucinates false details to explain itself yoo

#

too*

keen beacon Mar 23, 2025, 12:21 PM

#

alpine coral it performed insanely good

better or worse than phantom?

alpine coral Mar 23, 2025, 12:22 PM

#

i've only gotten it once so far. and regenerated the last response (second half of quiz) 5 times

#

based on that, it seems better phantom

keen beacon Mar 23, 2025, 12:23 PM

#

👀

#

perhaps this is just the same model as phantom but a new training checkpoint

alpine coral Mar 23, 2025, 12:24 PM

#

yeah that sounds pretty plausible to me

#

more recent checkpoint (or perhaps bigger version)

#

they have a similar feel about them (though again, only got nebulus once, whereas i've gotten phantom many times – thoug sporadically.. it indeed seems to come and go.. like a phantom ha – so not really too much to go by admitedly)

silk haven Mar 23, 2025, 12:27 PM

#

Google is so back

keen beacon Mar 23, 2025, 12:28 PM

#

alpine coral they have a similar feel about them (though again, only got nebulus once, wherea...

yeah it definitely has the same vibes

#

but it feels more polished

#

It's either flash lite, flash, or pro. Likely pro if it's bigger. They won't pretrain models for reasoning from scratch/continue pretrain 4 reasoning (outside of phi team I think)

#

it's definitely not flash lite and i highly doubt it's flash given this would be a massive jump with no change in base model

#

so imo it's probably pro thinking exp

keen beacon Mar 23, 2025, 12:29 PM

#

keen beacon it's definitely not flash lite and i highly doubt it's flash given this would be...

Yes so it's pro most likely

#

which is interesting because on paper 2.0 pro isn't much better than 2.0 flash

alpine coral Mar 23, 2025, 12:30 PM

#

keen beacon it's definitely not flash lite and i highly doubt it's flash given this would be...

yeah this feels right to me too tbh

keen beacon Mar 23, 2025, 12:30 PM

#

it's possible a new version of 2.0 pro exp will release alongside 2.0 pro thinking exp and the latter is built on the former

alpine coral Mar 23, 2025, 12:31 PM

#

maybe it's ultra 🤣

keen beacon Mar 23, 2025, 12:32 PM

#

keen beacon it's possible a new version of 2.0 pro exp will release alongside 2.0 pro thinki...

A new version of 2.0 pro to the public would be the instruct version unless they train the reasoning variant off the instruct, that won't be the case. It's probably the same base model

keen beacon Mar 23, 2025, 12:32 PM

#

alpine coral maybe it's ultra 🤣

that would be interesting

#

we haven't had a big boy from google since december 2023

#

which is crazy tbh

#

I highly doubt ultra exists

#

but the ultra page on deepmimd's site is still there and still updated sometimes so i think they'll release a new ultra at some point this year

#

deepmind*

#

Companies are continuing to use models that work and continue pretraining on them. 4o, sonnet 3.5 -> sonnet 3.7(it seems) and gem 2 pro seems to be around the same size I don't see a larger size

alpine coral Mar 23, 2025, 12:35 PM

#

keen beacon I highly doubt ultra exists

yeah i mean it did / does, but i'm almost certain it's been quietly shelved.. i giant dense model that underperformed the Pro version when they adopted new approaches / effeciencies (Pro 1.5).. and now we're in the reasoning/test time compute era

#

i don't think there's any chance either of them ultra.. was just a joke ha

keen beacon Mar 23, 2025, 12:35 PM

#

Ya you don't need anything larger than 4o/Gemini pro/sonnet size for reasoning models it seems

#

It makes costs too high anyway

alpine coral Mar 23, 2025, 12:36 PM

#

yeah exactly

keen beacon Mar 23, 2025, 12:36 PM

#

i wonder how long it'll be before a model you can run on your phone matches o1 performance

keen beacon Mar 23, 2025, 12:36 PM

#

keen beacon i wonder how long it'll be before a model you can run on your phone matches o1 p...

Next year max imho 🤣

#

exponential 🔥

cedar tide Mar 23, 2025, 12:38 PM

#

keen beacon I highly doubt ultra exists

now that chatgpt has released 4.5, i think google will release gemini 2 ultra too

keen beacon Mar 23, 2025, 12:39 PM

#

cedar tide now that chatgpt has released 4.5, i think google will release gemini 2 ultra to...

Bro they didn't release the public api of Gemini 1.0 ultra. I don't think an ultra instruct model would be any more competitive compared to 4.5

#

Because it didn't make sense

alpine coral Mar 23, 2025, 12:39 PM

#

yeah it was like available for researchers or something

#

but i don't think anyymore

keen beacon Mar 23, 2025, 12:39 PM

#

It sux compared to 1.5 pro

alpine coral Mar 23, 2025, 12:40 PM

#

and the fact we never hear anything about it , suggests that whichvever researchers have / had access to it found it underwhelming

cedar tide Mar 23, 2025, 12:41 PM

#

keen beacon Because it didn't make sense

gemini 2.0 pro is not good enough as expected, they are forced to release a better model to have a sota model

keen beacon Mar 23, 2025, 12:42 PM

#

cedar tide gemini 2.0 pro is not good enough as expected, they are forced to release a bett...

No they will keep working on the base models they have now

oblique flint Mar 23, 2025, 12:43 PM

#

what models does google have in the arena now? Phantom, nebula and specter? Which one out of those is the best?

keen beacon Mar 23, 2025, 12:44 PM

#

oblique flint what models does google have in the arena now? Phantom, nebula and specter? Whic...

They're the same different revisions

#

Thats what I think

#

Another note It doesn't make sense to keep pretraining for now. Or if you want to just continue pretrain your flagship model as it seems anthropic and openai are doing

cedar tide Mar 23, 2025, 12:47 PM

#

cedar tide now that chatgpt has released 4.5, i think google will release gemini 2 ultra to...

and Dario said he plans to release Claude 3.5 opus