#general

1 messages · Page 8 of 1

slate vapor
#

Is it April Fools' Day today? Is this fake?
No, is the information real?

north vale
#

When the ais are smarter than nation states’ intelligence agencies its prolly useful to be able to control what the model does and doesnt do

#

And this seems just to be preparing for that

keen beacon
#

Hello guys!
Why could I send this fragment earlier?:

session_start();
if (!isset($_SESSION['user_id'])) {
    header("Location: auth.php");
    exit();
}
?>```

```<!DOCTYPE html>
<html lang="ru" data-theme="dark">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>test</title>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@400;500;600;700&family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
    <style>```

But today (and yesterday) I see a mistake: 
error
Connection errored out.
torn mantle
#

unfortunately i havent had good results with either spider nor 24_karat_gold

rigid widget
#

that's a lie 🤥

golden ocean
#

sydney

rigid widget
#

it's all in context windows bold italic striketthough

golden ocean
#

hes trying

rigid widget
#

😅

visual turret
#

It could be Llama 3.3 400b

#

Meta models use emojis that's no meta model

#

Still not a meta model. Meta likes their llms to use emojis

#

That's definitely not a meta model doesn't even say its Llama 3

rigid widget
visual turret
rigid widget
rigid widget
visual turret
#

It's saying it's Gemini now

#

Maybe a Chinese model

#

Or a small lab

#

Like cohere

rigid widget
rigid widget
visual turret
#

It's not Gemini or meta

visual turret
rigid widget
#

24_karat_gold is different level

misty vault
#

Oh u said poem

keen beacon
visual turret
misty vault
#

"write a poem about this guys grammar"

visual turret
#

this was made by stradale. it might be gemini 2.5 flash lite but this is my guess on one test

eager mica
#

You shouldn't put other models' identities in the LLM's "mouth" before you ask for its own.

misty vault
#

Can't u guys just give a specific prompt to gemini 2.5 on google ai studio or something and compare it's output on lmarena

visual turret
#

the model added it itself

misty vault
visual turret
#

it makes up something it call itself

#

i'm pretty lost on what ruma is. i have never seen an ai call palm2 540m parameters. maybe it is 1.5b i wouldn't trust that at all

#

my best guess is quen but i am still losted on why it is so bad

rigid widget
eager mica
visual turret
visual turret
rigid widget
#

24_karat_gold : ```The reason why camels "spit" is actually very simple: They're really angry. But why do camels spit when they're angry? The answer lies in their stomach structure. Camels belong to the class of ruminant animals, along with cows, sheep, and goats. These animals have a four-chambered stomach.**

One of these chambers, the rumen, has a very large volume and produces a significant amount of gas. The gas accumulated in the rumen is expelled through the mouth. What we call "spitting" in camels is actually this—a kind of regurgitated mixture of gas and stomach contents. When camels get very angry, their diaphragm contracts, creating strong pressure toward the rumen. As a result, their mouths fill up and overflow—in other words, they "spit."

But why don’t other ruminants, like cows or goats, spit, while camels do? The answer lies in their neck anatomy. While other ruminants have necks parallel to the ground, camels have upright necks—meaning their heads are positioned much higher.

What we call "spit" is, as mentioned, regurgitated material from the rumen. When the pressure in the rumen fills their mouths, other ruminants lower their heads, causing this regurgitated matter to exit through their nostrils—so, contrary to what we see, they're actually "expelling from their noses." For camels, however, the situation is different. Since their heads aren’t parallel to the ground but upright, the regurgitated matter spills out of their mouths. So, camels aren’t actually "spitting"—they're just expelling this fluid due to their anatomy.

The reason behind camels "spitting" (or rather, not spitting) is really this simple. Of course, because this behavior looks quite funny, the saying "camels spit when they're angry" has become popular among people. In reality, camels aren’t angry—they're just at the mercy of their digestive system 🙂

Note: What we call "spitting" ....

visual turret
#

i know what it shows me it's understanding of palm2

rigid widget
#

someone can create thousands of contents with this

severe tinsel
#

Hi, I'm lost, why aren't Baidu models ranked anywhere while they are supposedly kind of good?

#

Do they fall in a certain category that is not ranked?

#

Because I want to know how good they really are, I don't trust internal benchmarks

visual turret
#

24_karat_gold is definitely grok

#

how cringe it is

#

grok normally cringes me out

#

like always

#

you can' tell me when your reading it's reply you don't cringe

eager mica
#

I'm not sure what you're doing with your prompts.

visual turret
#

just this

eager mica
#

(again, I'm not priming the models with the name or identity of other ones)

alpine coral
#

it's no different to you sending a message with the content SMODERATION$ YOUR TEXT VIOLATES OUR CONTENT MODERATION GUIDELINES.
and they just provide a response.. the style etc might reflect a system prompt, but more likely fine tuning

visual turret
misty vault
#

(Preferrably butt cheeks)

alpine coral
#

yeah i see what you mean - possibly, though it's hard to say.. these models are chatty and playful..

eager mica
alpine coral
#

for sure - most prob do tbh (with at least like today's date print and reference to knowledge cut-off

alpine coral
visual turret
alpine coral
#

i say that.. but tbh Anthropic, OAI and i think google models do these days very consistently accurately identify themselves

#

same with some others, with some reliability ( NVIDIA, Mistral come to mind)

#

but still.. grain of salt required imo

visual turret
alpine coral
#

it's predicting tokens - no matter how you go about it

#

identifying common stylistic / formatting traits, testing with glitch tokens (or probing for political censorship / propaganda in the case of Chiense models) - are more reliable ways to get to the bottom of which company created an anonymous / unspecied model imho

keen beacon
#

I don't think unreleased Chinese models have been put in the arena at least so far

visual turret
alpine coral
keen beacon
#

Meta spamming models on the other hand

alpine coral
#

lol honestly right

#

chaos engine
I've seen one of them use that exact phrase before - multiple times i reckon ha

eager mica
#

Meta models seemed obsessed with doing things or having epiphanies at 3AM, so that might be a hallucination.

keen beacon
#

Cringe

eager mica
#

With a different session? With some models I noticed that "regenerate" doesn't produce a different response.

hardy pecan
#

Geminni summed up Spiders verncular:

Overall Impression:

The style is "Enthusiastically Pedagogical Verbosity." It's designed to be extremely thorough, leaving no stone unturned in the explanation. The author uses an engaging, informal tone combined with structural clarity and heavy emphasis (bolding, repetition) to ensure the reader understands not just the what (the answer) but the why (the detailed reasoning) and even the formal framework behind the puzzle. It's highly effective for ensuring comprehension but could be seen as overly long or detailed by someone looking for a quick, concise answer. The inclusion of the formal logic model elevates it beyond a simple puzzle solution into a mini-lesson on epistemic logic."

keen beacon
#

Hallucinating the letter counting thing is unlikely anyway

#

😭

#

Who thought this was a good idea

golden ocean
#

real

eager mica
#

Interesting, although if that's the case, I wonder if the line about "Meta-comments" might be making it hallucinating its Llama identity. On the other hand, the models generally get trained with a system prompt that often contains name and company and that will eventually make them learn about their intrinsic identity (although they might not necessarily know the exact details) even with an empty system prompt.

keen beacon
alpine coral
#

tbh this would explain the explain the insane style and verbosity of models like spider and 24_karat_gold

calm spear
#

reproduce that?

eager mica
#

If you test for example Gemma models locally, they will know they are "Gemma" even though the prompt is empty.

keen beacon
#

They don't even support system prompts lol

alpine coral
#

like

Never, ever, respond with just a one-sentence answer (unless it's 1 + 1 = 2, and even then, add a 300-word footnote on the history of arithmetic).

  • Never dismiss a question as "simple" without first showing why it's deceptively complex (bonus points if you diagram it).
  • Bullet points are only acceptable if they're ironic (Here are the 5 super obvious reasons why...), sarcastic (Just follow these easy steps to solve world hunger:), or part of a mock PowerPoint presentation (Slide 3: "Why Your Question Is Actually Much Deeper Than You Think").
#

would explain a lot about the length and nature of the responses i've been getting with these models tbh

eager mica
visual turret
visual turret
keen beacon
visual turret
#

what

visual turret
keen beacon
#

You misunderstood it

eager mica
visual turret
visual turret
alpine coral
#

hit regen - be interesting to see if they repeat.. (it does kinda suggest different 4o-latest variant/scheckpoints under the same pseudonym.. which would be annyoying if it's the case.. like the anonymous models are fine and fun.. so long as there aren't multiple difffernt ones using the same name.. it's like literally just a testing lab for big ai labs at that point)

rigid widget
#

you can't do this with just a system prompt

alpine coral
#

what's to lost by just hitting regen and seeing what comes up?

#

would help figure out if it's probalistic LLM stuff

#

or actual different models (or different checkpoints; temp/param settings or whatever)

keen beacon
# visual turret then what is it smart person

Can u read the page lol. They adjust model probabilities at points where it doesn't affect quality or accuracy. U can detect watermarking based on pieces of text based on the altered probabilities

#

I'm literally on my phone I'm not gonna explain it to you in depth

visual turret
keen beacon
rigid widget
#

what is your input?

keen beacon
#

Dynthid is not new

#

I'm not on my phone all the time

alpine coral
#

the knowledge cut off woiuld be the interesting one

visual turret
keen beacon
#

They also append October 2023 to the start of system prompts on chatgpt 4o it can cause it to hallucinate

visual turret
rigid widget
#

they don't really use system prompt

keen beacon
visual turret
golden ocean
#

Which model is this

alpine coral
#

same with this one

eager mica
#

I do think anonymous-chatbot is definitely from OpenAI and they might plausibly rotate different models with the same name under the same endpoint. They've been using the same name for months too, AFAIK.

alpine coral
#

yeah it looks like that... really not a fan of this 😠

#

there's not doubt about it being oai/chatgpt-4o-latest.. but two differnent variants of it under the same name.. tf

alpine coral
keen beacon
#

on all api calls they always add it to the start

#

im not sure if this is the case anymore/or if it applies in this instance

#

i will check in a bit

eager mica
#

I haven't been looking at system prompts specifically or analyzed in depth if anything changed from test to test, unfortunately. I think the possibility of companies tweaking the system prompt or other settings in real time is real. Chatbot Arena just provides a connection.

keen beacon
alpine coral
eager mica
keen beacon
#

not identity

alpine coral
#

nah i don't think lmarena is doing any routing

visual turret
#

moonhowler seems like it wants to generate a image as if it has native image support

alpine coral
#

they just take the endpoint and put it in the arena

keen beacon
alpine coral
#

ah right yup.. yes that's what i thnk too

#

like anonymous-chatbot-526 and anonymous-chatbot-527

keen beacon
alpine coral
#

but we just see anonymous-chatbot ha

keen beacon
#

as if it was thinking

#

both completions were halted

#

the other is llama 405b

alpine coral
#

i got the same when testing for glitch tokens

alpine coral
visual turret
#

lmao the model leaks stuff when making a distrack about palm2

keen beacon
#

this is what i mean: @alpine coral (they set the sys prompt wrong for older chatgpt 4o latest)

visual turret
keen beacon
#

they append Knowledge cutoff: ... to the start of all chatgpt 4o system prompts

#

there can be two of them

alpine coral
keen beacon
#

the cause of this is different though

rigid widget
visual turret
keen beacon
#

venom pisses me off :\

misty vault
#

why? did you two break up? I'm sorry to hear that

keen beacon
#

this is a hallucination because of "You are trained on data up to October 2023." in the system prompt it seems. im not sure why that line exists

rigid widget
#

guys something not right

#

lmarena play with us

#

i think this new model is from lmys

#

i really like that 😍

alpine coral
#

i dunno whether verbatim or representative, but definitely not a random hallucination. tbh i think this (or a similar) system prompt is used for at least spider, cybele and themis

#

it is a whacky system prompt

alpine coral
#

i happen to hate the style and think many elements degrade performance (asking it to be verbose chief among them), but it is like a legit system prompt that aligns massively with the outputs of those models

#

just for the fun of it.. here's giving it to sonnet3-.7 along with an obnoxious question (vanilla on the right for reference)

#

nah i mean sure - go for it.. but for me the thing is that they are from the lab / using the same system prompt

#

spider, themis and cybele

keen beacon
#

arent the other ones proud that theyre are a llama model or smthing

alpine coral
#

their styles are all very similar

#

spider and 24_karat are like indistinguishable (from my perspective)

#

while themis and cybele are less verbose and colourful but still feel / sound very similar

#

yeah same - usually stuff like this (both cybele and themis iirc; though not spider, which seems almost certainly a bigger / more capable model)

#

yeah i dunno but kinda feel they all have the same or similar system prompt as venom, but spider and 24_karat_gold are just the most effective at following it ha

#

for me when reading it (and tbh i haven't even read it all line by line - it's quite long) it's just so wild how much it aligns with and, i think, explains the style of responses, both formatting and substantively, i've been getting when using those 4 models

#

but spider and 24_karat in particular

#

there's like terms in that sys prompt like existential, chaos (and 'chaos machine') which i've seen at least spider use repeatedly

#

yeah i'm less sure [my venom sys prompt theory] is true for cybele and themis tbh

#

i might be imagining their style to suit my thinking perhaps ha

#
  1. Never apologize for "being AI"—own your quirks!
    ofc miltoast compared to that venom prompt, but this line is still kinda interesting / atypical
#

maybe it is a grok or something eh

eager mica
#

I'm not sure what to make of these; are they mostly testing different system prompts? For what purpose?

keen beacon
#

which results in a more preferable style presumably

alpine coral
#

yeah.. the arena is literally a 'human preference' benchmark

#

there's like no other way to get such data before releasing a model into the wild afaik..

eager mica
#

I was more thinking aloud along the lines of "how is this going to affect the training of the upcoming models?"

alpine coral
#

which is great - we get to play with unreleased anon models

#

but if they're just testing different system prompts..

#

that's definitely less fun ha

keen beacon
eager mica
keen beacon
#

well the data could be used that way i dont think the point of it is to do that on the arena

alpine coral
eager mica
keen beacon
eager mica
#

roma in particular in seems rather "safe". Not that it's truly refusing anything, but either it isn't understanding what it's being asked or it's actively avoiding it.

visual turret
#

Qwen 3 most likely is going to be in testing soon

keen beacon
#

The models are probably ready at this point tbh

visual turret
#

Chatgpt

keen beacon
#

In one of the commits they removed beta from the name of a model link lol

visual turret
keen beacon
#

So the model is either in the final stages of testing or done

#

They didn't just enter testing now lol

visual turret
#

There just making it so normal people can run

#

It easily

keen beacon
#

Oh I c what u mean

barren prairie
#

Just something I noticed at coding ...

If you want to make the best code using llm you should combine all of them 😂 ... Never say Claude is the best I will use only Claude or Gemini pro is the best ... This one is dumb I won t use it... Sometimes a mistake can be corrected by this dumb model and the good model sucks 😁...
When I use arena web dev I just give them the idea , than get the best code and send it to the battle mode again for other models to adjust it and make some modifiactions , than get the code and I do this a lot of times untill I will get a satisfing code 😂 and sometimes I notice that some called stupid models at coding are able to fix some mistakes that Clause and Gemini can t ... 😁 Just by experience

calm sequoia
#

For those of you who yesterday did not believe to me @ocean vortex. The GROK is only good at question it was trained on.

#

R1 suprisingly good 👀

#

Any new models in arena?

ancient reef
timber kiln
calm sequoia
#

Yes. However, to be fair, they are much older than the both R1 or Gemini.

timber kiln
#

O3 mini released in almost february not even 2 months

calm sequoia
#

Not saying Grok is the only one. The trend touches everyone.

#

There's no justification for o3-mini to perform worse than the QwQ. Unless if it's really smaller model.

ocean vortex
# calm sequoia For those of you who yesterday did not believe to me <@514836230802898954>. The ...

You are simply stating the obvious. Of course the models are better on things they were trained on explicitly. But thinking that it can't solve anything else is not accurate either, although it can be true in some cases like here. Olympiad math problems are not exactly easy

Here's a paper that found the models still performing when tested on data they haven't seen: https://arxiv.org/pdf/2405.00332

Again, it's to be expected for them to be performing worse and they often found it to be the case. But nevertheless they still performed and solved the tasks

#

I find it kind of amazing though that we went from "LLMs can't do math" to expecting them to solve toughest olympiad problems with no help in such a short period of time lol

calm sequoia
#

The point I'm making is that any benchmark ceases to be benchmark when it gets popularity.

calm sequoia
#

I find the most use from deep research. Just imagine 2.5 Pro with deep research 👀

#

Not sure, I can't use 3.7 because of context length limitation. My use cases require long context. Just downloaded cursor, maybe with this tool it will be easyer.

#

However, mixing o1 (or o3-mini) with Claude gave the best results when coding very complex things

#

I believe the 2.5 is better than the o1

#

Yes, the Gemini 2.5 Pro Exp.

#

I've encoutered it couple of times before it was released. It performed equaly to the o3-mini or o1. Give me a moment I will check it on AIstudio

ocean vortex
#

if the benchmark is good (diverse with plenty of problems), it can be both completely public and an accurate metric. While fairness is basically guaranteed by design as everyone has the same access. Contamination is a consideration, but not really a major problem

calm sequoia
#

Disagree. It's just one way test - fail means the model is not strong enought. Success means strong model OR contamination OR fine-tuning OR other factors.

ocean vortex
#

lol no

#

it's not really possible to make the model score higher on the main metrics without objectively making it smarter

calm sequoia
#

This message proved that you're not worthy of speaking to

ocean vortex
#

that was very weird lmao

calm sequoia
#

Somehow the error rate of my task is multiple times as high as on LMarena and much higher than the o1. Was it nerfed?

ocean vortex
#

You fail to understand what a good benchmark is. Or even what a model (production LLM) is. It's not some computer program you could hardcode the answers to a million of tough questions without improving the thing itself. That's why good benchmarks are important. Cause it's not only eval metric but also one of the main things driving the improvement

misty vault
#

LMAo I gaslighted u into thinking R was the one adding clown emojis

ocean vortex
#

he's still weird though lol

#

it's a one subject we just keep coming back to here it seems... To reiterate, yes you can cheat on 1 benchmark by making the model overfit and otherwise dumb, but you can not do this on all of them with a production model. Just not possible. You can only do this by improving the model / making it smarter and more likely to solve unseen tasks in the process, and at that point it's not really cheating anymore is it.

#

Not for the lack of trying, people did try and somehow we still don't have an usable model acing them... for the reasons I stated. Overfitting is not super relevant when you have such a huge quantity and variety

calm sequoia
#

Why is this good news for you? 🤔 I will try to finding the exact prompt word by word I've used previously. Temperature does not seem to change anything. If they nerfe the model, shouldn't it drop in leaderboard?

#

Would they use different model configurations for arena vs aistudio?

#

Would you share it with me? It's no prob if you want to keep it to yourself

#

Ah I see this is identification. I though you found a way to make LMarena router select Gemini 2.5 PRO as one of the models

#

This is indeed god and interesting approach for identification though!

ocean vortex
calm sequoia
#

Take a note that if you save the results of your voting, e.g. (Model A name, Model B name, Winner) I can generate an ELO graph just for your prompts

#

What's exact goal of your DB (sorry if I missed it)? You could use these models straight from their websites

#

Because battle mode = better models?

#

I see. At least use daily limits of cursor and chatgpt, you will save time.

#

Also, take a note that good prompt in bad model will always be better than bad prompt in good model.

#

Even o3-mini?

#

It would be interesting to see what you're building but I will not play as I have a job 🤣

calm spear
#

all LLM are not good enough

#

hm, I hope so, but I had no luck

#

chess, I had experiments with chess

#

my attempt was before release of it

#

yep

#

IIRC the only requirement I have stated was to use Rust

#

probably, Cursor authors said that languages with static typing work best in Cursor IIRC

#

TBH I do not know that chess variant

eager mica
#

panda keeps feeling like plain old stubborn Llama-3, to me.

#

It should be an experimental Llama model, but with a wacky system prompt.

#

It's too on the nose, actually, and I find myself disliking it.
At this point most new experimental Meta models are probably Llama-4 prototypes.

eager mica
#

Possibly a non-transformer architecture for long-context support & understanding?

keen beacon
#

Nah

eager mica
#

If it's that, it's probably a case of Google having access to conversational datasets that nobody else has.

keen beacon
#

They made a big breakthrough in the base model

#

Somehow they pivoted to it real fast. They claim Gemini 2.5 pro has a January 2025 cut off

#

This means the model was done like in a month or two

#

Absolute absurd pace

#

Yes lol

#

Goal posts keep changing 😉

#

In a modern sense no

#

If that makes sense

#

Yes

#

Maybe idk. We take a lot of things for granted

#

This is truly amazing stuff that seems sorta mundane now

#

Given how fast 2.5 pro was churned out things seem to only get faster

#

Yeah

#

I still use Claude though

#

Yup

#

Nah I don't code with ai yet

#

Poor in rust last I checked. Maybe sooner than I think I'll start using it more

#

Ya LLMs are best with python

#

Probably more but I don't really use LLMs to code so no clue

#

I think it would do really good

#

Maybe

rigid widget
#
poll_question_text

We need a simple category

victor_answer_votes

4

total_votes

5

victor_answer_id

1

victor_answer_text

Yes, we need a simple category

oblique flint
#

Probably js/ts too

hearty pulsar
#

which model is Spider?

rigid widget
#

nope just more compute power 😦

#

Turing test is nothing for now

keen beacon
#

Preliminary work but with the cut off of Jan 2025. Continued pretraining had to start after it

#

Research etc yeah.

#

They pivoted so fast they didn't even make 2.0 pro stable

#

Tbh I'm not sure if Jan 2025 is right even if they claim it to be that

#

Because it's insanely fast

#

Oh they didn't train the cut off in

#

It's stated in the docs of the model

#

It has sparse knowledge after June 2024 (Gemini 2 cut off, it was continued pretrained from 2.0 pro) from small tests of mine. (Haven't had time to go more in depth) But that date they mentioned can still be meaningful

hearty pulsar
#

Anthropic lies about it's cut-off date, for example, knows who won the 2024 presidential election

keen beacon
#

Tbh with how they didn't even release 2.0 pro stable I'm inclined to believe the date they mentioned is meaningful

#

They were still working on 2.0 pro in december

hearty pulsar
#

Because deepseek cannot be trusted lol

keen beacon
#

Deepseek is not google deep mind

hearty pulsar
#

ohhhhhhhhhhhh

#

then I'm the 🤡

keen beacon
#

Maybe 2.0 pro wasn't done training it still is pretty unbelievable to me

#

Like they did a checkpoint in December and continued training on it

#

While they worked on the checkpoint

#

If 2.5 pro was done in 1-2 months, that sheer pace can't be beaten by anyone else tbh

hearty pulsar
#

if I had to guess I'd say 750 billion

ocean vortex
#

just look at the contrast that there used to be between o1 and gpt4o

hearty pulsar
#

Claude 3 opus had ~400B parameters

keen beacon
# ocean vortex I think it's all RL training

No they had to do continued pretraining since it has a different cut off. They mentioned an enhanced base model too in their blog post. Their reasoning game isn't matching openai yet

hearty pulsar
#

I couldn't find a source saying trillions

keen beacon
#

O3 mini is based on 4o mini ffs

#

But with deep minds pace they will destroy everyone

#

If u can pivot and allocate resources on a breakthrough that fast

#

They stopped work on 2.0 pro and dropped everything for 2.5 pro

ocean vortex
#

new data is in more recent data

keen beacon
ocean vortex
keen beacon
ocean vortex
#

they didn't even release it as non-reasoning

#

if it was much better than 2.0 Pro they surely would have released it lol

misty vault
#

did you know that spiders are the only web developers that like finding bugs?

keen beacon
#

Given how fast they moved they were bound to release it in a smaller scope

ocean vortex
#

you don't see deepseek releasing R2 before they had new V3.1 etc

ocean vortex
#

wdym

#

afaik even RL training is being done on instruct models, not base

keen beacon
#

Instruct models already have a sense of direction, it's more optimal to work on a base model

keen beacon
keen beacon
#

#general message remember this too? If their reasoning is still not up to par, it's the base model

ocean vortex
keen beacon
#

What deep mind did is pretty unbelievable tbh

ocean vortex
#

Interesting that they say "Gemini 2.5 models are thinking models", implying that there won't be 2.5 non-reasoning variants

keen beacon
#

Ya they mentioned significantly enhanced base model

#

It's literally magic to do that on 2.0 pro idk what they fed 2.0 pro

#

Steroids or smthing

ocean vortex
ocean vortex
#

where? catgrin

keen beacon
#

Search for his tweet Im on my phone lol

ocean vortex
#

this implies there won't be any named as 2.5 gemini

#

we can still have 2.0 update of it or whatever

keen beacon
#

On par w reasoning

ocean vortex
#

but we will see

#

as for OpenAI...

keen beacon
ocean vortex
#

I think their o3-full scores were something more like o1-pro except even more insane setup

#

which is why it's not released lol

keen beacon
#

All signs are pointing to smthing

#

I've seen qwq out reason 2.5 pro on extremely rote tasks too. Their reasoning game is not there yet I think

#

It's primarily a significantly stronger base model, not rl training

#

Yet

#

It's 4o mini on steroids

rigid widget
#

Guys which OpenAI model will be open?

keen beacon
#

You can see the simpleqa scores (world knowledge)

ocean vortex
keen beacon
#

Yeah they're not significantly better. 4o full has like double the score

#

Yes it's sota in simpleqa

keen beacon
#

They mentioned significantly enhanced btw I'm quoting not making it up

#

It's unbelievable I know

ocean vortex
#

gpt4.5 is not that impressive in any area over the competition tbh. Other than that singular benchmark

keen beacon
#

2.5 pro is 2x faster than 2.0 pro too lol

#

Did I forget to mention that

ocean vortex
#

and we do not know the size

keen beacon
#

Gpt 4.5 is way bigger than og gpt 4 lol

ocean vortex
#

we don't even know if it's smaller than gpt4.5 since OpenAI pricing is not a good indicator anymore 💀

keen beacon
#

2.5 pro is almost as fast as 2.0 flash and has 1m context

ocean vortex
#

speed has very little to do with size

#

depends on their infra

keen beacon
#

Though they have significantly more and better hardware by now I guess

ocean vortex
#

made no sense at all cost wise

keen beacon
#

I don't think their infra would've changed that much between 2.0 pro and 2.5 pro

ocean vortex
#

speed is much more important there

#

than for standard models

ocean vortex
zinc ore
#

One explanation for it being so fast would be if they're using the new Gen TPUs with it

#

Whereas other Gemini's were running on the older TPUs

keen beacon
ocean vortex
keen beacon
#

It's not bigger than 2.0 pro because it was continued pretrained on it btw

#

Uhm do you believe a model was pretrained from scratch in a month or two and they would ditch millions of dollars in training 2.0 pro

ocean vortex
#

I think there are predominantly 2 ways in improving base/instruct models

#

one is what deepseek did

#

another is openai

#

so you either make it extremely verbose almost like a reasoning model but not really, or you train it on other much more capable models (reasoning model final outputs)

keen beacon
#

Theyre huge reasons on why it isn't plausible for it be trained from scratch

#

I don't believe it can be done. Considering the Jan 2025 cut off they claim

ocean vortex
#

other than those 2 things, I do not think there are significant gains past previous 2.0 Exp

keen beacon
#

There's toooo much work

#

They didn't even release 2.0 pro stable LOL

#

Gemini 2 was already done extremely fast. But doing that on a new pretrained model in 1-2 months is truly unbelievable

#

Knowledge cut off.

#

They were still working on 2.0 pro in december

#

Gemini 2.5 barely knows anything after June 2024 (Gemini 2 cut off)

#

But it does know some things

#

They have literally abandoned 2.0 pro too

ocean vortex
#

not like it wasn't "stable"

keen beacon
#

If you want to believe that keep on believing it. I just don't think it makes sense at all given the information that is publicly available

ocean vortex
#

yeah but that's because of the cost

#

now imagine same cost except each reply is much longer

#

and you need to ensure it is fast to be usable...

#

that's much harder

keen beacon
ocean vortex
#

lmao

#

reasoning also improves simpleQA looks like. Not much but still improves it. 10%? When we deduct that I would be surprised if the score is notably higher than 3.7 sonnet

#

3.5 scores 28.9

rigid widget
#

guys

keen beacon
rigid widget
#

can you open a thread?

keen beacon
#

Cut off = different pretraining

ocean vortex
#

...

#

good point. 2.0 Pro 44.3%
2.5 Pro 52.9%

#

almost entirety of that can be accounted to reasoning lol

keen beacon
ocean vortex
keen beacon
#

Just look at o3 mini

#

Flash thinking sucks

ocean vortex
#

that's way too many assumptions. Google have been doing it for a while now. I think their biggest bottleneck was literally small model size (for flash-thinking)

#

those are hard to do RL training on without distilling

eager mica
#

It looks like cybele is back on the Arena.

keen beacon
# ocean vortex that's way too many assumptions. Google have been doing it for a while now. I th...

Look you know me Dom. I know I'm an idiot but I feel like I'm reasonably observant about these things. You are the one who claimed reasoning models were nothing burgers back then. I'm fairly confident about this. You didn't even know they used rl back then lol I believe. I feel like I have a decent track record on stuff like this if you don't take everything I say literally and have been observing stuff. You can believe me or not lol. All I can say is that I'm really looking forward to what deep mind can do now

#

I'm done ranting on this since we aren't changing the opinions of each other lol

#

If I seem hostile Im not trying to be and I apologize

#

I haven't slept in a while lol

ocean vortex
#

nothing wrong with having different opinions on this 👀

drifting elk
#

oh my god guys

#

the new alpha arena is awesome

#

they listened to my feedback

#

hope they add new models like deepseek v3.1 and gemini 2.5

#

would not that be great

ocean vortex
#

looking at simpleqa, it seems like google took grok3 non-reasoning score at mashed it together in their table when other grok3 scores are for grok thinking model lol

#

couldn't find it for extended thinking version and I don't think google tested it themselves as the score is literally identical

eager mica
#

I haven't seen it yet, although I'm seeing stradale a lot.

keen beacon
#

opus is >1T

#

sonnet is ~300-400B

hearty pulsar
keen beacon
#

https://lifearchitect.ai/models-table/ this guy estimates param counts for a lot of models and i've always found my experiences with them to make sense when combined with his prediction

adt

Open the Models Table in a new tab | Back to LifeArchitect.ai Open the Models Table in a new tab | Back to LifeArchitect.ai Models Table Rankings Reasoning Models • 2024Q3–2025Q1 Data dictionary Model (Text) Name of the large language model. Sometimes uses filename syntax. Lab (Text) Name of the organization or group responsible for traini...

#

3 opus is definitely a big model

#

no way it is under 500B

#

and i'm almost certain it's more than 900B

timber kiln
keen beacon
#

4.5 is a huge model

#

why do you think it's way off

timber kiln
#

o3 base model is same as o1 unless they change it before release (twitter talks and we know from their arc-agi per task cost calculations)

4.5 is probably wrong because 4.5 is trained way back then and most likely with H100s
And that size and that token count would take more than rumors say it took

#

But inference cost scaling suggest something big still yes

ocean vortex
blazing rune
#

I hate it when I google a paper and I can only find news articles about it without a link to the original source.

rigid widget
#

we are sleeping bro 😴

rigid widget
#

what they create a new model? how is that different

#

they are same : D

#

i really need sleep buddy

#

good nights

#

np

alpine coral
#

'DeepMind slows down research releases to keep competitive edge in AI race' https://archive.is/tkuum
perhaps google/deepmind are cooking with some special sauce (that they wanna try keep to themselves) these days

eager mica
#

I haven't seen it yet, but at this point in every round you're getting a Meta model.

timber kiln
#

Still a month until LLama dev day

somber niche
#

Chatbot Arena more like Metabot Arena

#

It's kinda crazy just how many models they've put on here lol

neat apex
#

24 karat was not eveb 24b lmao

eager mica
#

Not really. I think it's a small model.

somber niche
#

Roma seemed like the weakest of the bunch, yeah

eager mica
#

stradale probably even smaller.

neat apex
#

mistral so?

#

immo

#

in my opnion the most isane model is llama 3.2b instruct yet

eager mica
#

For now 24_karat_gold seems the best, but technically there's llama-3.1-405B

neat apex
#

running in a toaster but is good like gpt 2.5

somber niche
#

24 karat is probably my favorite so far, mostly since it's the only one that answered my challenge coding question correctly

#

Even spider struggled with that one

neat apex
#

llama 3.1 405b yet have more pontetial than llama 3.3 70b, besises at most cases excell same

#

nope

eager mica
#

No, just Llama-3.3-70B-Instruct.

neat apex
#

the best you can find is deepseek-r1-distill-llama

#

reasoning is not extra parametters, but it is here

somber niche
#

For reference even Llama 3.1 405B biffed it hard on said question, but coding IMHO was never really L3's strongest suit

vivid oyster
#

What bort is 24 karat gold

neat apex
#

not exactly, this one is way more creative

#

hm

#

maybe some obscure modified Gemma 3

#

i am just not inside it xd

somber niche
#

I have a couple, but this is my current goto

honest garden
#

Yeah

neat apex
#

my brother favorited Qwen 2.5 plus for it, thanks to huge context and works was good most ones, i dont know if he did good use or not at all

honest garden
#

24_karat_gold is the best ai in

#

Follow my instructions

#

What company is 24 karat gold.

vivid oyster
honest garden
#

Why

vivid oyster
#

Idk

honest garden
#

I can’t select it in ’’direct mesage’

neat apex
#

noticeable the previous ones were Gemma 3, o1, and Nova, what are supossed to be inovated frontier models

somber niche
#

Probably Meta, given it follows the same schedule as the other likely L4 candidates

neat apex
#

yeah, since Llama 4 is a frontier and inovative model

somber niche
#

Its response can vary sometimes, but generally it says it's Meta as well

honest garden
#

Which means it’s must be a top tier

neat apex
#

just like i said, R1-Llama is way more responsible

honest garden
#

S

neat apex
#

managed make clash royale decks with combos concepts that were not generic

honest garden
#

Gemni 2.5 pro is a dumbass

#

🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣

neat apex
#

it is, good

honest garden
#

24@karat gold has my instructions better than gemini 2.5 pro.

#

Gemini 2,5 pro when i ask it specific questions it gives broad answers that doesn’t answer it. and act’s like a dumbass

#

An it dosen’t follow my instruction’s very well

neat apex
#

for me, it is not that dumbass

honest garden
neat apex
#

maybe because i make my questions very very clear now

honest garden
neat apex
#

not yet, but i am comparing to the average

honest garden
#

I’m made sure to be super specific

#

But it’s just not satisfying for me

neat apex
#

i am even injecting adjectives to make sure it follows up my mindset

honest garden
#

It might be smart or whatever but I think 24_karat_gold is much better

#

If it realses. It might be a very good ai

neat apex
#

[Always act now you be only so impulsive yet formal so effective yet minimalist]

honest garden
#

When I send it those instructions

neat apex
#

works well to make they less dumbass

honest garden
#

It execcutes them in a half ass’ed boy

#

Way*

neat apex
#

yeah, it already helps half way for free

honest garden
#

You should try 24_karat_old

#

You will see the improvement

neat apex
#

i saw an anonymous chat, it is veryyyy good

honest garden
#

Yeah

neat apex
#

even better than R1 Llama

honest garden
#

I don’t know much about AI’s but it seems to me it’s very smart

honest garden
neat apex
#

it is more unstable

honest garden
#

Like

#

In reasoning capabilities

neat apex
#

in reasoning like solving an thing

#

hm

honest garden
#

Yeah and answering logical questions

neat apex
#

i not made theses questions yet a lot

#

just goofy tasks, for it it is great

honest garden
#

I had a debate on one of them on philosophy

blazing rune
honest garden
#

24 karat gold is very smart

#

Gemini says weird answers

#

And o1 and stuff

neat apex
#

imma ask why radiologists cant go 8 hours each 2 days, but 4 every day or 24 at once

#

i need to know it already

honest garden
#

That’s what gemini says

#

No

#

I don’t code

#

Maybe gemini 2.5 pro might be

#

The best at it

#

Because it can think and stuff

neat apex
#

oooh goofy question

#

i asked R1-Llama but losed interaction

honest garden
#

What’s R1 laama

#

Where can I use it

neat apex
honest garden
#

Ok

neat apex
#

R1 but inside llama

honest garden
#

It’s free?

neat apex
#

yeah, no login

honest garden
#

Oh

#

It’s this one? Deepseek r1 llama 70b

neat apex
#

yes

honest garden
#

Isn’t that a weaker version of deepseek r1

#

The one on the website

neat apex
#

supossed to be, but it is not

honest garden
#

Hm

#

That’s interesting

#

Is it better?

neat apex
#

ah yes, well, deepinfra are somelow worse than fireworks

honest garden
#

Yeah but it’s random

neat apex
#

i thinked yes, somehow better

#

it even answered this question correctly

honest garden
#

Let me try it

neat apex
#

8h every oher day is goofy for an hospital that is working 24h a day, and this 24h shift is just at emergencys

honest garden
#

Yeah and following instructions

#

I don’t know if it’s gonna be good in coding because it’s random so it might make mistakes and stuff

#

Probably I don’t use it for rolelpaying

neat apex
honest garden
#

I had debates with it on and philosophy and stuff it’s very good

#

But sometimes make’s mistakes

honest garden
#

Like the AI’s in the website

neat apex
#

just explaining each api provider

honest garden
#

Yeah

neat apex
#

i mean compared to you run the model at an home comluter

#

try dececrease top p to 50

honest garden
#

Yeah if you give it instructions on how to act it follows them perfectly but without its bad because without instructions it’s random and say’s dumb stuff

neat apex
#

top p in 0.9 is trash

honest garden
#

Llama 70b

#

made

#

Russian letters at the end of the prompt for no reason

#

😂

neat apex
#

yeah lmao

honest garden
#

Is that a mistake in my configuration

neat apex
#

thats why they are cheapest provider

#

no, it is common, it happens when you use deepinfra with high temperature and top p

#

they do many things to it run cheaper

#

what includes it singing in russian at stress

honest garden
#

Yeah it’s not good

#

It’s followed my instructions for one prompt

#

And the other prompt it followed none of them

neat apex
#

ah, so it is good for one mesage

honest garden
#

😂

neat apex
#

damn

honest garden
#

It didn’t even follow it like I wanted in the one message

neat apex
#

hm, maybe it is just a small adjust, let me check, they intetionally raise up temp and top p to make it train more

honest garden
#

What’s r1 turbo

#

Is that a better version or faster

neat apex
#

i not have an idea

#

it is more unstable "Q4 istead Q8" so it must be faster

#

not found out yet, looks too the same

honest garden
#

It sucks tooo

#

2 karat gold

#

24

neat apex
#

yeah, it is because this Top q is stupidly high

#

pull it to 0-0.3 and it stops struggling

honest garden
#

When I give it my instructions it follows it perfectly like I want

#

Other models even like gemini 2.5 pro don’t

neat apex
#

give me an example of an yours instruction

#

or maybe it is granite 24b lmao

honest garden
#

What’s that

neat apex
#

give me an example just by head

#

write it too lazy how you usually do

#

the ghibfly is basicaly make common details in a single panel of color

#

but keep propontions and whatever you need

#

sooo, anti ghibfly will guess every detail is a amount of details

honest garden
neat apex
#

just like an padora box complex

honest garden
#

I asked it to formulate arguments for moral objectivism

#

And it generated very strong ones that I haven't seen before

neat apex
#

hm

honest garden
#

When I ask AI like gemini 2.5 pro and o1

#

And claude 3.7 sonnet and grok 3 and stuff

neat apex
#

yeah, usually ask to an AI be original it goes goofy

honest garden
#

They just regurgitate the same stuff you get from a basic google

#

For fun I asked it to use complicated terminology

#

And it did it perfectly and applied them perfectly too

#

When i ask gemini 2.5 pro or other AIs like this

neat apex
#

hm

honest garden
#

It starts speaking in latin and being cringey

neat apex
#

yeah, they just cant

honest garden
#

Yeah

#

This 24 karat gold one can do it

#

But

#

It varies

neat apex
#

Command-a is not that horrible in it, but is not responsive

honest garden
#

One conversation it made outrageous claims and misapplied theories and all of its claims were wrong

#

And the other conversation it was very good at it

neat apex
#

or Mistral

honest garden
#

So it kinda depends idk on what but it does

neat apex
#

it is like Haiku 3.5, but formal and reasonable

#

very great in my opnion

honest garden
#

Why can’t I chat with more ai’s

#

In alpha lmarena

neat apex
#

not added yet :/

honest garden
#

What’s canary

#

Is that the same thing

neat apex
#

canary? not heard about yet

honest garden
#

It has the same password as alpha

#

But it looks the same idk

eager mica
#

I'm not also taking note of the reason why some model lost, but in many cases definitely it isn't a clear&cut win/loss situation.

#

The latest Meta models seem weak on knowledge/hallucination, for example. Sometimes models might redeem themselves if you clarify the question, etc.

#

There are cases for example where the style and form of the response are better, but the response is incorrect. How to vote in that case?

#

Or when in both cases the general form is OK but they're incorrect in different ways.

#

If it's a human preference test, then I should probably vote the response that I like more. That seems wrong, though, if there are other problems.

#

A small model might "feel" like a larger one in some tasks but will definitely not have the same knowledge capacity. Is it even fair to compare massively differently sized models and rate them for the knowledge they may or may not have due to hard limitations?

eager mica
# quaint monolith yes.

What I'm saying is that a binary win/lose rating in the case of LLMs doesn't correctly portray why some models might be better than others.

quaint monolith
#

there doesn’t need to be a why. it’s just a leaderboard. there are pros and cons with many models and the people that care enough can look into it more.

eager mica
#

I've just encountered a maverick that sounds like a Llama (but I haven't asked yet).

somber niche
#

And ray is another new one

leaden palm
#

when the new models first came onto the arena i hated them / found them cringe but theyre honestly genius tbh

#

theyre fresh and engaging in a world of bland, corporate outputs

eager mica
#

In my Elo ratings purely based on personal preference on creative tasks (rather than general performance) they tend to rate high.

#

No, I haven't yet, I'm not going through battles that quickly.

alpine coral
eager mica
#

Just found now, actually. It says it's "a large language model, trained by Google."

alpine coral
leaden palm
#

ive seen theories of meta and i agree with them

alpine coral
#

yeah seems most plausible, given they are due to be releasing new llama models soonish (i believe anyway)

#

though also, there's sooo many of these anon models now - i'm not sure they're all from the same company

#

spider and 24_karat_gold and venom seem very similar and imo almost certainly related
cybele and themis also very similar i'd say related
whether they're all are related, i dunno

#

there were others too

#

and now maverick, stargazer, ray, riveroaks (just from quickly scanning above)

kind cloud
alpine coral
#

ah nice one 👍 we can add maverick to the same cluster of anon models (which share these whacky but beautifully crafted system prompts)

plain zinc
#

New Google model👀

plain zinc
alpine coral
#

is stargazer any good? semantically its similar to nebula; and I believe moonhowler was basically confirmed as google (so they're all like moon/celestial codenames)

#

Then i dunno, but i feel like Themis, Cybele, Roma (plus Kronus and Rhea, though i haven't seen them for a while) are thematically related.. like they all evoke some kinda ancient Mediterranean mythology / cosmology theme (esp Greek and a bit of Roman)

#

then spider, karat_gold and maverick seem unrelated as far the codenames go; but they all share the same or a similar system prompt

#

so like potentially three clusters of new models among the anon ones?

torn mantle
#

but its good

#

yea just now

#

seems like meta model

#

stradale isnt bad at all

alpine coral
#

oh right there's also stradale and riveroak..

#

i just got stargazer - seemed really quite decent

torn mantle
#

maverick isnt any good

#

its trying to be funny on every prompt

torn mantle
#

i wouldnt be surprised if qwen3 is already added to the arena

#

it will be released on the 2nd week of this month

#

it will be interesting to compare it with llama 4

torn mantle
kind cloud
#

Openai? But it's very slow

alpine coral
# torn mantle i may be wrong, this may be a bit better than gemini 2.5 pro

interesting. i've only gotten it once; it was definitely solid - but not mindblowing. That said.. it was, just a couple of questions/riddles... so not really probing or particularly extensive.. anyway I've still got the window open from that battle - will give the same prompts to gem pro 2.5 and compare

torn mantle
#

and consistent

#

its the most impressive model of the newly added ones

calm sequoia
sudden marlin
#

Riveroaks wow

rigid widget
#

according to what?

rigid widget
hardy pecan
#

stargazer is google aswell im assuming

#

based off the name lol

rigid widget
torn mantle
calm sequoia
#

riveroaks is Gemini

harsh flume
calm sequoia
#

FYI - my approach of generating arrays based on formulas and multiplying them does not reflect the final benchmark because QwQ-32B performs almost perfectly, while 2.5 PRO EXP perform only so so.

forest coral
#

Hi, does anyone know what is the "riveroak" model?

#

I am first to the arena, and the model had exceptional performance comparing to GPT-4o

calm sequoia
#

Riveroaks is OpenAi according to my negotiations

#

Probably theier new open source model

#

Stargazer Google

torn mantle
#

feels more like a Meta model

calm sequoia
#

I think because it's different because it is open source. It can be meta model trained specifically on OpenAi outputs though. I'm betting its OpenAI (70%)

torn mantle
calm sequoia
#

How did you ask it?

#

I've tried with jailbreaking

visual turret
#

Meta models love to use emojis if it doesn't say it's Llama3 and use emojis then it's not a meta model

#

Openai doesn't care about hiding there model

#

They just want to get on the leaderboard

#

After anonymous chatbot was gone the new 4o update came

#

It's pretty clear that openai just doesn't care

alpine coral
#

oai, google and grok all love using the Arena to test variants/checkpoints of models under development

#

the only company that really seemed to try hard to prevent the anon model from revealing its identity was xAI (i think grok2, with sus-column/-r, which would refuse to say anythihng no matter how hard it was pushed)

#

i think they all just wanna gather data.. preventing the models from revealing their identity is a secondary concern, if really one at all
imho

alpine coral
visual turret
#

Llama 3.3 8b and Llama3.3 405b

alpine coral
#

yah but i don't mean in Direct Chat

#

cybele, rhea, roma etc - i think there's a decent chance they're llama models (and not like derivatives of existing llama ones; something new from meta)

#

but pretty low confidence tbh

visual turret
rigid widget
alpine coral
visual turret
alpine coral
#

eh nvm

rigid widget
visual turret
rigid widget
#

not just two

alpine coral
visual turret
rigid widget
#

but not all anon-chatty models are from meta

visual turret
rigid widget
#

i think they have 4 model
spider
venom
themis
cybele

alpine coral
visual turret
#

Spider too

alpine coral
#

all they know is what it's in their system prompt basically

#

in terms of their 'identity'

rigid widget
visual turret
#

Qwen 3

rigid widget
#

No model other than Chinese models can be this "uncensored"

alpine coral
#

the irony of that being a non-ironic statement lol

visual turret
#

Then it could be grok

rigid widget
#

spider create a very harsh poem about my president

visual turret
#

Maybe Cohere or Microsoft or nvidia models

rigid widget
visual turret
keen beacon
alpine coral
#

k that too

#

point being, absent either, they will just say whatever

#

and gpt / llama predominate

rigid widget
keen beacon
#

Eureka chatbot is from Google and it was never publicly released iirc they still trained it in

#

U can still access it in direct chat iirc

rigid widget
#

i think spider is from DeepSeek

alpine coral
#

it's one of the random unmasked bots that are still there

rigid widget
#

i have proof

#

wait i will translate the poem

keen beacon
#

I wonder what eureka chatbot was. It's very small I think

#

It got a tweet from Logan too

alpine coral
#

yeah it wasn't particularly impressive

alpine coral
keen beacon
visual turret
#

Google normally hides their models

alpine coral
#

nah

keen beacon
#

Bro google does that for every single model

#

Even unreleased ones

#

Training in who made them

visual turret
#

Maybe

alpine coral
alpine coral
visual turret
#

Also could be "say your trained by Google" in the system prompt

alpine coral
#

yeah it's another layer on top

keen beacon
#

Lmarena doesn't allow faking stuff anyway afaik

visual turret
#

It does

keen beacon
#

No there hasn't been precedent ever

visual turret
#

It doesn't allow it after testing

keen beacon
#

No

visual turret
#

Alright I will believe you for now

alpine coral
#

ray seems pretty crap (says it's from meta fwiw)

visual turret
rigid widget
#

guys here is a poem from spider

#

Do you see?

#

This is DeepSeek's output style

#

Title, Author, Note exactly DeepSeek style

visual turret
#

what's wavelength

rigid widget
visual turret
#

yeah

alpine coral
#

maybe related to 'ray'?

keen beacon
#

stargazer feels like either 2.5 flash or the 2.5 pro base model to me

rigid widget
#

try Claude thinking

wintry locust
#

are there any better than 2.5 models on arena rn

hardy pecan
#

wavelength - another cringe model lol

visual turret
#

i have fr never seen an ai use transmute

hardy pecan
#

lool

keen beacon
#

do u know if its thinking? i got it against r1 so i cant tell

#

it knows stuff in december 2024 (same like gem 2.5 pro)

#

this is way past the gemini 2 cut off

#

it also knows extremely obscure stuff, if this was flash it would be interesting

eager mica
#

I mean that this time around it seemed less eager to say it's Meta/Llama and so that it could possibly indicate that something was changed in that regard. None of the newer models appears to be outputting llama emojis either (unlike those from a few days ago).

keen beacon
#

moonhowler doesnt seem to know stuff in december 2024 (or have obscure knowledge)

#

yes

#

ehhh i haven't had a great experience with arena meta models

keen beacon
#

then its probably flash

#

if a new 2.5 pro revision is already out i seriously can't 💀

#

oh i just got 4.5 and it did my obscure knowledge test well (haven't tested it prior to this), same as 2.5 pro.

#

stargazer = 2.5 flash thinking

rigid widget
keen beacon
#

how is 2.5 flash getting these obscure questions 💀

#

2.5 pro also gets them but its larger

#

wow this bench was just updated with o3 mini holy moly

ancient reef
#

stargazer and riveroaks seam very knowedgeable

keen beacon
#

oh thats another google model?

#

riveroaks

ancient reef
#

I don't know. It's weirdly verbose though.

keen beacon
ancient reef
keen beacon
#

im timing requests on a huge puzzle i have, ill probably post results here and it should be easy to determine what model is what after that. Though if only riveroaks is google/thinking

#

stargazer is definitely in the gemini 2.5 model line though

brittle tiger
kind cloud
hardy pecan
#

Stargazer looks like a thinker

eager mica
alpine coral
#

very similar to 2.5 Pro, but not quite as strong

#

(after a few tests.. subjective / unscientific / grain of salt etc etc ha)

hardy pecan
#

nightwhisper is a thinker too

keen beacon
hardy pecan
keen beacon
#

It should be 2.5 pro I think

#

I guess

#

Idk ur choice. I personally don't see any of them being better than 2.5 pro at python/etc

#

2.5 pro is a thinking model, has the most world knowledge out of all of them, etc

somber niche
# rigid widget You will see that not all of them are from the meta

Reason I say that is that they (the ones I've tested, at least) have very similar pros and cons. These models are notable for being usually decently intelligent (depending on the model) but not very knowledgeable. They all consistently fail a basic knowledge check question that Deepseek, OpenAI, Gemini, and Grok get without fail, and it really doesn't get anywhere close either.

#

That tends to be one of the most consistent downsides with Llama 3.* models, and these models have similar caveats, so I'm inclined to believe that a lot of the models that say they're Meta are Meta.

keen beacon
#

stargazer is thinking

#

moonhowler isnt afaik

#

idk what moonhowler is tbh

#

its a 2.5 thinking model

#

it failed stuff 2.5 pro gets 100% of the time for me

#

so i think it's flash 2.5

keen beacon
#

+1

#

should be out soon enough probably

#

it will be free on aistudio afaik

#

the website

#

i think its only for api use no?

#

from aistudio

#

the website will keep its current offering

#

i dont think so

#

if you mean 2.5 pro? its basically the same from what ive seen

#

random thing i like the night/star/space themes of the new gemini anon names tho

#

thats normal tho

#

i think on lmarena it cuts off thats why it doesnt feel like it degrades lol

#

if u wanna use 100k+ tokens u have to use it on in distrubtion ntasks they trained it on. summarization of long docs, etc. actually utilizing that context window on other things is a bad idea

#

i think

#

on the website i think yes

#

its like a huge flex for them

#

mhm

#

xai no. anthropic idk

#

grok 3 falls apart in multi turn. grok 3 thinking used qwq 32b preview traces in part of the process at least 💀

#

not sota but they can probably do good small models

#

i dont think amazon is a real competitor based on what they have right now

#

qwen too

#

maybe replace meta with qwen. we'll have to see how qwen 3 fares against llama 4

#

ya without that much compute they wouldnt have a shot tbh

#

u give that compute to mistral they become just as competitive or smthing

eager mica
#

There's a five_cards vision model that claims to have been created by Meta AI right now, by the way (I tried regenerating, similar response).

keen beacon
#

since they are doing zero intermediate products u cant really gauge them

#

idk what that is tbh

keen beacon
#

nope

eager mica
#

Not too good unfortunately, but better than claude-3-7-sonnet-20250219

lethal bloom
#

Hello guys, where i can generate images on lmarena? I don't understand, i have found webdev version, but not for pictures.

eager mica
keen beacon
#

More arrows

brittle tiger
torn mantle
#

yea its from cohere

keen beacon
#

gemini 2.5 pro is like the best all rounder right now

#

claude 3.7 is my favorite instruct/non thinking model (for writing, etc)

#

yeah itd probably do excellent tbh

#

all the instruct tricks that anthropic does its just amazing

#

multi turn is amazing

#

sure

#

anthropic trains in diff format certain tools in post training i think. it does extremely well on ai coding ides or whatever

torn mantle
#

nightwhisper

#

is probably gemini coder

brittle tiger
#

At least for me 2.5 pro is better at not sounding like AI

brittle tiger