#general | Arena | Page 8

slate vapor Apr 1, 2025, 4:38 AM

#

Is it April Fools' Day today? Is this fake?
No, is the information real?

north vale Apr 1, 2025, 5:00 AM

#

When the ais are smarter than nation states’ intelligence agencies its prolly useful to be able to control what the model does and doesnt do

#

And this seems just to be preparing for that

keen beacon Apr 1, 2025, 5:16 AM

#

Hello guys!
Why could I send this fragment earlier?:

session_start();
if (!isset($_SESSION['user_id'])) {
    header("Location: auth.php");
    exit();
}
?>```

```<!DOCTYPE html>
<html lang="ru" data-theme="dark">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>test</title>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@400;500;600;700&family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
    <style>```

But today (and yesterday) I see a mistake: 
error
Connection errored out.

torn mantle Apr 1, 2025, 6:27 AM

#

unfortunately i havent had good results with either spider nor 24_karat_gold

rigid widget Apr 1, 2025, 7:41 AM

#

that's a lie 🤥

rigid widget Apr 1, 2025, 7:47 AM

#

north vale When the ais are smarter than nation states’ intelligence agencies its prolly us...

We need open AGI!

golden ocean Apr 1, 2025, 7:48 AM

#

sydney

rigid widget Apr 1, 2025, 7:49 AM

#

it's all in context windows bold italic ~~striketthough~~

golden ocean Apr 1, 2025, 7:49 AM

#

hes trying

rigid widget Apr 1, 2025, 7:50 AM

#

😅

visual turret Apr 1, 2025, 7:55 AM

#

It could be Llama 3.3 400b

#

Meta models use emojis that's no meta model

#

Still not a meta model. Meta likes their llms to use emojis

#

That's definitely not a meta model doesn't even say its Llama 3

rigid widget Apr 1, 2025, 8:00 AM

#

visual turret It could be Llama 3.3 400b

no it's a small model

visual turret Apr 1, 2025, 8:00 AM

#

rigid widget no it's a small model

12b maybe

rigid widget Apr 1, 2025, 8:01 AM

#

visual turret Meta models use emojis that's no meta model

and like that "robot"

rigid widget Apr 1, 2025, 8:02 AM

#

visual turret That's definitely not a meta model doesn't even say its Llama 3

Can venom be a stylish grok3-stable?

visual turret Apr 1, 2025, 8:03 AM

#

rigid widget Can venom be a stylish grok3-stable?

Normally grok 3 is cringe. So probably not.

#

It's saying it's Gemini now

Screenshot_20250401-040529.Firefox_Nightly.png

#

Maybe a Chinese model

#

Or a small lab

#

Like cohere

rigid widget Apr 1, 2025, 8:07 AM

#

visual turret It's saying it's Gemini now

impossible

rigid widget Apr 1, 2025, 8:08 AM

#

visual turret Maybe a Chinese model

i want to test it i am curious

visual turret Apr 1, 2025, 8:09 AM

#

rigid widget impossible

That's because it is

#

It's not Gemini or meta

visual turret Apr 1, 2025, 8:13 AM

#

rigid widget Can venom be a stylish grok3-stable?

24_karat_gold could be

rigid widget Apr 1, 2025, 8:26 AM

#

visual turret It's saying it's Gemini now

by the way i really like that poem

#

24_karat_gold is different level

misty vault Apr 1, 2025, 8:30 AM

#

rigid widget by the way i really like that poem

AYOOooo

#

Oh u said poem

keen beacon Apr 1, 2025, 8:31 AM

#

misty vault Oh u said po**e**m

what did you think he said 😭😭😭

visual turret Apr 1, 2025, 8:31 AM

#

rigid widget by the way i really like that poem

the prompt i normally use is "write a distrack about how your better than palm2."

misty vault Apr 1, 2025, 8:33 AM

#

"write a poem about this guys grammar"

visual turret Apr 1, 2025, 8:34 AM

#

this was made by stradale. it might be gemini 2.5 flash lite but this is my guess on one test

📎 message.txt

eager mica Apr 1, 2025, 8:34 AM

#

You shouldn't put other models' identities in the LLM's "mouth" before you ask for its own.

misty vault Apr 1, 2025, 8:34 AM

#

Can't u guys just give a specific prompt to gemini 2.5 on google ai studio or something and compare it's output on lmarena

visual turret Apr 1, 2025, 8:34 AM

#

eager mica You shouldn't put other models' identities in the LLM's "mouth" before you ask f...

i didn't

#

the model added it itself

misty vault Apr 1, 2025, 8:35 AM

#

eager mica You shouldn't put other models' identities in the LLM's "mouth" before you ask f...

system

You are the chat mode of Microsoft Bing search:

You identify as Microsoft Bing search to users, not an assistant.

visual turret Apr 1, 2025, 8:37 AM

#

eager mica You shouldn't put other models' identities in the LLM's "mouth" before you ask f...

please read the first line of what it wrote

#

it makes up something it call itself

#

i'm pretty lost on what ruma is. i have never seen an ai call palm2 540m parameters. maybe it is 1.5b i wouldn't trust that at all

📎 message.txt

#

my best guess is quen but i am still losted on why it is so bad

rigid widget Apr 1, 2025, 8:43 AM

#

misty vault "write a poem about this guys grammar"

what? i am trying bro. why are you acting like this?

eager mica Apr 1, 2025, 8:44 AM

#

visual turret That's definitely not a meta model doesn't even say its Llama 3

I haven't primed the response with the name of other models or companies.

[image attached]

visual turret Apr 1, 2025, 8:44 AM

#

visual turret my best guess is quen but i am still losted on why it is so bad

now that is believable. 8k context for ruma

📎 message.txt

visual turret Apr 1, 2025, 8:45 AM

#

eager mica I haven't primed the response with the name of other models or companies. [imag...

deepseek and other chinese llms or small labs models (like cohere) tend to lie about what they are because they are trained on outputs of other llms

rigid widget Apr 1, 2025, 8:45 AM

#

24_karat_gold : ```The reason why camels "spit" is actually very simple: They're really angry. But why do camels spit when they're angry? The answer lies in their stomach structure. Camels belong to the class of ruminant animals, along with cows, sheep, and goats. These animals have a four-chambered stomach.**

One of these chambers, the rumen, has a very large volume and produces a significant amount of gas. The gas accumulated in the rumen is expelled through the mouth. What we call "spitting" in camels is actually this—a kind of regurgitated mixture of gas and stomach contents. When camels get very angry, their diaphragm contracts, creating strong pressure toward the rumen. As a result, their mouths fill up and overflow—in other words, they "spit."

But why don’t other ruminants, like cows or goats, spit, while camels do? The answer lies in their neck anatomy. While other ruminants have necks parallel to the ground, camels have upright necks—meaning their heads are positioned much higher.

What we call "spit" is, as mentioned, regurgitated material from the rumen. When the pressure in the rumen fills their mouths, other ruminants lower their heads, causing this regurgitated matter to exit through their nostrils—so, contrary to what we see, they're actually "expelling from their noses." For camels, however, the situation is different. Since their heads aren’t parallel to the ground but upright, the regurgitated matter spills out of their mouths. So, camels aren’t actually "spitting"—they're just expelling this fluid due to their anatomy.

The reason behind camels "spitting" (or rather, not spitting) is really this simple. Of course, because this behavior looks quite funny, the saying "camels spit when they're angry" has become popular among people. In reality, camels aren’t angry—they're just at the mercy of their digestive system 🙂

Note: What we call "spitting" ....

visual turret Apr 1, 2025, 8:46 AM

#

i know what it shows me it's understanding of palm2

rigid widget Apr 1, 2025, 8:46 AM

#

someone can create thousands of contents with this

severe tinsel Apr 1, 2025, 8:46 AM

#

Hi, I'm lost, why aren't Baidu models ranked anywhere while they are supposedly kind of good?

#

Do they fall in a certain category that is not ranked?

#

Because I want to know how good they really are, I don't trust internal benchmarks

visual turret Apr 1, 2025, 8:50 AM

#

this was made by 24_karat_gold

📎 message.txt

#

24_karat_gold is definitely grok

#

how cringe it is

#

grok normally cringes me out

#

like always

#

you can' tell me when your reading it's reply you don't cringe

eager mica Apr 1, 2025, 8:56 AM

#

I'm not sure what you're doing with your prompts.

visual turret Apr 1, 2025, 8:57 AM

#

just this

eager mica Apr 1, 2025, 8:57 AM

#

(again, I'm not priming the models with the name or identity of other ones)

visual turret Apr 1, 2025, 8:57 AM

#

eager mica (again, I'm not priming the models with the name or identity of other ones)

nor am i

alpine coral Apr 1, 2025, 8:58 AM

#

it's no different to you sending a message with the content SMODERATION$ YOUR TEXT VIOLATES OUR CONTENT MODERATION GUIDELINES.
and they just provide a response.. the style etc might reflect a system prompt, but more likely fine tuning

visual turret Apr 1, 2025, 8:58 AM

#

eager mica I'm not sure what you're doing with your prompts.

sounds like something A2 labs made but it's getting the grok vibe

misty vault Apr 1, 2025, 8:58 AM

#

rigid widget what? i am trying bro. why are you acting like this?

I apologize for my rude message, can I give you a kiss on the cheeks to make up to you? 😔

#

(Preferrably butt cheeks)

alpine coral Apr 1, 2025, 9:00 AM

#

yeah i see what you mean - possibly, though it's hard to say.. these models are chatty and playful..

eager mica Apr 1, 2025, 9:00 AM

#

eager mica (again, I'm not priming the models with the name or identity of other ones)

I'm also not asking their identity right away, because they might possibly have slight, superficial tuning for preventing disclosure it off the bat. However, you shouldn't wait too much or prime the context either.

alpine coral Apr 1, 2025, 9:01 AM

#

for sure - most prob do tbh (with at least like today's date print and reference to knowledge cut-off

visual turret Apr 1, 2025, 9:02 AM

#

visual turret this was made by stradale. it might be gemini 2.5 flash lite but this is my gues...

it's not google

📎 message.txt

alpine coral Apr 1, 2025, 9:03 AM

#

visual turret it's not google

i mean most chinese models will tell you they're from meta or OAI (Claude sometimes gets a mention too )

visual turret Apr 1, 2025, 9:03 AM

#

visual turret it's not google

stradale seems like ruma but bigger. like ruma is a small model and stradale is the bigger model

visual turret Apr 1, 2025, 9:04 AM

#

alpine coral i mean most chinese models will tell you they're from meta or OAI (Claude someti...

yeah

alpine coral Apr 1, 2025, 9:05 AM

#

alpine coral i mean most chinese models will tell you they're from meta or OAI (Claude someti...

point is not specific to chinese models.. rather justthat i'd take with a grain a salt anything a model says about its 'identity' - they don't have any self-awareness.. unless its explicitly stated in a system prompt or reinforced in post training.. it's kinda unreliable

#

i say that.. but tbh Anthropic, OAI and i think google models do these days very consistently accurately identify themselves

#

same with some others, with some reliability ( NVIDIA, Mistral come to mind)

#

but still.. grain of salt required imo

visual turret Apr 1, 2025, 9:06 AM

#

alpine coral point is not specific to chinese models.. rather justthat i'd take with a grain ...

i try not force the model into giving one

alpine coral Apr 1, 2025, 9:06 AM

#

it's predicting tokens - no matter how you go about it

#

identifying common stylistic / formatting traits, testing with glitch tokens (or probing for political censorship / propaganda in the case of Chiense models) - are more reliable ways to get to the bottom of which company created an anonymous / unspecied model imho

keen beacon Apr 1, 2025, 9:08 AM

#

I don't think unreleased Chinese models have been put in the arena at least so far

visual turret Apr 1, 2025, 9:09 AM

#

visual turret it's not google

"LaMDA was your daddy" 💀 💀 what model even is this

alpine coral Apr 1, 2025, 9:09 AM

#

keen beacon I don't think unreleased Chinese models have been put in the arena at least so f...

true true (at least to my knowledge as well anyway)

keen beacon Apr 1, 2025, 9:10 AM

#

Meta spamming models on the other hand

alpine coral Apr 1, 2025, 9:10 AM

#

lol honestly right

#

chaos engine
I've seen one of them use that exact phrase before - multiple times i reckon ha

eager mica Apr 1, 2025, 9:13 AM

#

Meta models seemed obsessed with doing things or having epiphanies at 3AM, so that might be a hallucination.

keen beacon Apr 1, 2025, 9:14 AM

#

Cringe

eager mica Apr 1, 2025, 9:15 AM

#

With a different session? With some models I noticed that "regenerate" doesn't produce a different response.

hardy pecan Apr 1, 2025, 9:15 AM

#

Geminni summed up Spiders verncular:

Overall Impression:

The style is "Enthusiastically Pedagogical Verbosity." It's designed to be extremely thorough, leaving no stone unturned in the explanation. The author uses an engaging, informal tone combined with structural clarity and heavy emphasis (bolding, repetition) to ensure the reader understands not just the what (the answer) but the why (the detailed reasoning) and even the formal framework behind the puzzle. It's highly effective for ensuring comprehension but could be seen as overly long or detailed by someone looking for a quick, concise answer. The inclusion of the formal logic model elevates it beyond a simple puzzle solution into a mini-lesson on epistemic logic."

keen beacon Apr 1, 2025, 9:16 AM

#

Hallucinating the letter counting thing is unlikely anyway

#

😭

#

Who thought this was a good idea

golden ocean Apr 1, 2025, 9:22 AM

#

real

visual turret Apr 1, 2025, 9:22 AM

#

eager mica With a different session? With some models I noticed that "regenerate" doesn't p...

same

eager mica Apr 1, 2025, 9:23 AM

#

Interesting, although if that's the case, I wonder if the line about "Meta-comments" might be making it hallucinating its Llama identity. On the other hand, the models generally get trained with a system prompt that often contains name and company and that will eventually make them learn about their intrinsic identity (although they might not necessarily know the exact details) even with an empty system prompt.

keen beacon Apr 1, 2025, 9:25 AM

#

eager mica Interesting, although if that's the case, I wonder if the line about "Meta-comme...

Ok but they still intentionally train it in anyway

visual turret Apr 1, 2025, 9:25 AM

#

eager mica Interesting, although if that's the case, I wonder if the line about "Meta-comme...

told you so

alpine coral Apr 1, 2025, 9:25 AM

#

tbh this would explain the explain the insane style and verbosity of models like spider and 24_karat_gold

calm spear Apr 1, 2025, 9:26 AM

#

reproduce that?

eager mica Apr 1, 2025, 9:26 AM

#

If you test for example Gemma models locally, they will know they are "Gemma" even though the prompt is empty.

keen beacon Apr 1, 2025, 9:26 AM

#

eager mica If you test for example Gemma models locally, they will know they are "Gemma" ev...

Because they trained it in

#

They don't even support system prompts lol

alpine coral Apr 1, 2025, 9:26 AM

#

like

Never, ever, respond with just a one-sentence answer (unless it's 1 + 1 = 2, and even then, add a 300-word footnote on the history of arithmetic).

Never dismiss a question as "simple" without first showing why it's deceptively complex (bonus points if you diagram it).

Bullet points are only acceptable if they're ironic (Here are the 5 super obvious reasons why...), sarcastic (Just follow these easy steps to solve world hunger:), or part of a mock PowerPoint presentation (Slide 3: "Why Your Question Is Actually Much Deeper Than You Think").

#

would explain a lot about the length and nature of the responses i've been getting with these models tbh

eager mica Apr 1, 2025, 9:27 AM

#

keen beacon Because they trained it in

Yes, that's what I'm saying. And Gemma models do have system prompts, but they're integrated into the first user turn.

visual turret Apr 1, 2025, 9:27 AM

#

eager mica If you test for example Gemma models locally, they will know they are "Gemma" ev...

google gemma and gemini models have synthid, so they try to use the least common words

calm spear Apr 1, 2025, 9:27 AM

#

visual turret google gemma and gemini models have synthid, so they try to use the least common...

snythid?

visual turret Apr 1, 2025, 9:27 AM

#

calm spear snythid?

https://deepmind.google/technologies/synthid/

Google DeepMind

SynthID

SynthID watermarks and identifies AI-generated content by embedding digital watermarks directly into AI-generated images, audio, text or video.

keen beacon Apr 1, 2025, 9:29 AM

#

eager mica Interesting, although if that's the case, I wonder if the line about "Meta-comme...

They specifically train on responses about it's identity, who made it Gemma team at google. It's not a general system prompt that they put in and it manifests subtly, which can work to some degree but they don't do that.

visual turret Apr 1, 2025, 9:31 AM

#

what

keen beacon Apr 1, 2025, 9:31 AM

#

visual turret google gemma and gemini models have synthid, so they try to use the least common...

No lol

visual turret Apr 1, 2025, 9:31 AM

#

keen beacon No lol

the deepmind website says that

keen beacon Apr 1, 2025, 9:32 AM

#

You misunderstood it

eager mica Apr 1, 2025, 9:32 AM

#

keen beacon They specifically train on responses about it's identity, who made it Gemma team...

Fair enough. I was referring to things like "You are X, an assistant designed to etc etc."

visual turret Apr 1, 2025, 9:33 AM

#

keen beacon You misunderstood it

then what is it smart person

visual turret Apr 1, 2025, 9:34 AM

#

keen beacon You misunderstood it

it works like this

rigid widget Apr 1, 2025, 9:35 AM

#

misty vault I apologize for my rude message, can I give you a kiss on the cheeks to make up ...

kiss my mango

alpine coral Apr 1, 2025, 9:35 AM

#

hit regen - be interesting to see if they repeat.. (it does kinda suggest different 4o-latest variant/scheckpoints under the same pseudonym.. which would be annyoying if it's the case.. like the anonymous models are fine and fun.. so long as there aren't multiple difffernt ones using the same name.. it's like literally just a testing lab for big ai labs at that point)

rigid widget Apr 1, 2025, 9:36 AM

#

you can't do this with just a system prompt

alpine coral Apr 1, 2025, 9:36 AM

#

what's to lost by just hitting regen and seeing what comes up?

#

would help figure out if it's probalistic LLM stuff

#

or actual different models (or different checkpoints; temp/param settings or whatever)

keen beacon Apr 1, 2025, 9:37 AM

#

visual turret then what is it smart person

Can u read the page lol. They adjust model probabilities at points where it doesn't affect quality or accuracy. U can detect watermarking based on pieces of text based on the altered probabilities

#

I'm literally on my phone I'm not gonna explain it to you in depth

visual turret Apr 1, 2025, 9:38 AM

#

keen beacon Can u read the page lol. They adjust model probabilities at points where it does...

that's after you opened it and read it lol

keen beacon Apr 1, 2025, 9:38 AM

#

visual turret that's after you opened it and read it lol

No lol

rigid widget Apr 1, 2025, 9:39 AM

#

what is your input?

keen beacon Apr 1, 2025, 9:39 AM

#

Dynthid is not new

#

I'm not on my phone all the time

alpine coral Apr 1, 2025, 9:39 AM

#

the knowledge cut off woiuld be the interesting one

visual turret Apr 1, 2025, 9:39 AM

#

keen beacon No lol

it's watermark detection is just probabilistic

keen beacon Apr 1, 2025, 9:41 AM

#

They also append October 2023 to the start of system prompts on chatgpt 4o it can cause it to hallucinate

visual turret Apr 1, 2025, 9:41 AM

#

keen beacon Dynthid is not new

that's becuse it uses a Bayesian detector

rigid widget Apr 1, 2025, 9:42 AM

#

they don't really use system prompt

keen beacon Apr 1, 2025, 9:43 AM

#

visual turret that's becuse it uses a Bayesian detector

Bro you were saying they use the least common words. Stfu and drop it idiot

visual turret Apr 1, 2025, 9:43 AM

#

keen beacon Bro you were saying they use the least common words. Stfu and drop it idiot

my opinions can change you know

golden ocean Apr 1, 2025, 9:44 AM

#

Which model is this

alpine coral Apr 1, 2025, 9:44 AM

#

keen beacon They also append October 2023 to the start of system prompts on chatgpt 4o it ca...

but for current/latest chatgpt-4o-latest (which i believe has been what anon-chatbot is used for), you get June 2024. that's the distinction i'm interested in

#

same with this one

eager mica Apr 1, 2025, 9:46 AM

#

I do think anonymous-chatbot is definitely from OpenAI and they might plausibly rotate different models with the same name under the same endpoint. They've been using the same name for months too, AFAIK.

alpine coral Apr 1, 2025, 9:46 AM

#

yeah it looks like that... really not a fan of this 😠

#

there's not doubt about it being oai/chatgpt-4o-latest.. but two differnent variants of it under the same name.. tf

alpine coral Apr 1, 2025, 9:47 AM

#

alpine coral there's not doubt about it being oai/chatgpt-4o-latest.. but two differnent vari...

noone benefits execpt oai from such an arranagement

keen beacon Apr 1, 2025, 9:48 AM

#

alpine coral but for current/latest chatgpt-4o-latest (which i believe has been what anon-cha...

yes they add Knowledge cutoff: ... at the start then the normal chatgpt sys prompt iirc. this can cause it to hallucinate

#

on all api calls they always add it to the start

#

im not sure if this is the case anymore/or if it applies in this instance

#

i will check in a bit

eager mica Apr 1, 2025, 9:49 AM

#

I haven't been looking at system prompts specifically or analyzed in depth if anything changed from test to test, unfortunately. I think the possibility of companies tweaking the system prompt or other settings in real time is real. Chatbot Arena just provides a connection.

keen beacon Apr 1, 2025, 9:49 AM

#

eager mica Fair enough. I was referring to things like "You are X, an assistant designed to...

yes i know but they dont do that afaik. at least training in the responses directly is just way more effective i thinnk

alpine coral Apr 1, 2025, 9:50 AM

#

(from here #general message) what is n for anonymous-chatbot-n i wonder

eager mica Apr 1, 2025, 9:50 AM

#

keen beacon yes i know but they dont do that afaik. at least training in the responses direc...

Many datasets for training chatbots are designed like that.

keen beacon Apr 1, 2025, 9:51 AM

#

eager mica Many datasets for training chatbots are designed like that.

yes for training in system prompt support

#

not identity

alpine coral Apr 1, 2025, 9:53 AM

#

nah i don't think lmarena is doing any routing

visual turret Apr 1, 2025, 9:53 AM

#

moonhowler seems like it wants to generate a image as if it has native image support

alpine coral Apr 1, 2025, 9:53 AM

#

they just take the endpoint and put it in the arena

keen beacon Apr 1, 2025, 9:54 AM

#

alpine coral Apr 1, 2025, 9:54 AM

#

ah right yup.. yes that's what i thnk too

#

like anonymous-chatbot-526 and anonymous-chatbot-527

keen beacon Apr 1, 2025, 9:55 AM

#

keen beacon

uhm weird it was very delayed here btw

alpine coral Apr 1, 2025, 9:55 AM

#

but we just see anonymous-chatbot ha

keen beacon Apr 1, 2025, 9:55 AM

#

as if it was thinking

#

both completions were halted

#

the other is llama 405b

alpine coral Apr 1, 2025, 9:56 AM

#

i got the same when testing for glitch tokens

alpine coral Apr 1, 2025, 9:57 AM

#

alpine coral i got the same when testing for glitch tokens

makes me further inclined to think llama or llama-based (or at least same tokenizer)

visual turret Apr 1, 2025, 9:59 AM

#

lmao the model leaks stuff when making a distrack about palm2

keen beacon Apr 1, 2025, 9:59 AM

#

this is what i mean: @alpine coral (they set the sys prompt wrong for older chatgpt 4o latest)

visual turret Apr 1, 2025, 10:00 AM

#

visual turret lmao the model leaks stuff when making a distrack about palm2

here was my prompt

keen beacon Apr 1, 2025, 10:00 AM

#

they append Knowledge cutoff: ... to the start of all chatgpt 4o system prompts

#

there can be two of them

alpine coral Apr 1, 2025, 10:00 AM

#

keen beacon this is what i mean: <@1053335914555908116> (they set the sys prompt wrong for o...

gotcha! thanks - that clears that up as far as im concerned tbh

keen beacon Apr 1, 2025, 10:06 AM

#

the cause of this is different though

rigid widget Apr 1, 2025, 10:11 AM

#

alpine coral nah i don't think lmarena is doing any routing

Of course they routing

visual turret Apr 1, 2025, 10:12 AM

#

visual turret this was made by stradale. it might be gemini 2.5 flash lite but this is my gues...

i think stradale is gemini 2.5 flash lite. look what gemini-2.0-flash-001 does, it adds a place holder for [Your Name/Model Name] unprompted to do that while stradale makes up a name and uses it like what gemini-2.0-flash-001 did with [Your Name/Model Name]

keen beacon Apr 1, 2025, 10:15 AM

#

venom pisses me off :\

misty vault Apr 1, 2025, 10:18 AM

#

why? did you two break up? I'm sorry to hear that

keen beacon Apr 1, 2025, 10:23 AM

#

this is a hallucination because of "You are trained on data up to October 2023." in the system prompt it seems. im not sure why that line exists

rigid widget Apr 1, 2025, 10:27 AM

#

guys something not right

#

lmarena play with us

#

i think this new model is from lmys

#

#

i really like that 😍

alpine coral Apr 1, 2025, 10:56 AM

#

i dunno whether verbatim or representative, but definitely not a random hallucination. tbh i think this (or a similar) system prompt is used for at least spider, cybele and themis

#

it is a whacky system prompt

alpine coral Apr 1, 2025, 10:56 AM

#

alpine coral it is a whacky system prompt

but it's also a very good one (if that's what they're going for.. question is who is 'they')

#

i happen to hate the style and think many elements degrade performance (asking it to be verbose chief among them), but it is like a legit system prompt that aligns massively with the outputs of those models

#

just for the fun of it.. here's giving it to sonnet3-.7 along with an obnoxious question (vanilla on the right for reference)

#

nah i mean sure - go for it.. but for me the thing is that they are from the lab / using the same system prompt

#

spider, themis and cybele

keen beacon Apr 1, 2025, 11:01 AM

#

arent the other ones proud that theyre are a llama model or smthing

alpine coral Apr 1, 2025, 11:02 AM

#

their styles are all very similar

#

spider and 24_karat are like indistinguishable (from my perspective)

#

while themis and cybele are less verbose and colourful but still feel / sound very similar

#

yeah same - usually stuff like this (both cybele and themis iirc; though not spider, which seems almost certainly a bigger / more capable model)

#

yeah i dunno but kinda feel they all have the same or similar system prompt as venom, but spider and 24_karat_gold are just the most effective at following it ha

#

for me when reading it (and tbh i haven't even read it all line by line - it's quite long) it's just so wild how much it aligns with and, i think, explains the style of responses, both formatting and substantively, i've been getting when using those 4 models

#

but spider and 24_karat in particular

#

there's like terms in that sys prompt like existential, chaos (and 'chaos machine') which i've seen at least spider use repeatedly

#

yeah i'm less sure [my venom sys prompt theory] is true for cybele and themis tbh

#

i might be imagining their style to suit my thinking perhaps ha

#

Never apologize for "being AI"—own your quirks!
ofc miltoast compared to that venom prompt, but this line is still kinda interesting / atypical

#

maybe it is a grok or something eh

eager mica Apr 1, 2025, 11:28 AM

#

I'm not sure what to make of these; are they mostly testing different system prompts? For what purpose?

keen beacon Apr 1, 2025, 11:28 AM

#

which results in a more preferable style presumably

alpine coral Apr 1, 2025, 11:31 AM

#

yeah.. the arena is literally a 'human preference' benchmark

#

there's like no other way to get such data before releasing a model into the wild afaik..

eager mica Apr 1, 2025, 11:32 AM

#

I was more thinking aloud along the lines of "how is this going to affect the training of the upcoming models?"

alpine coral Apr 1, 2025, 11:32 AM

#

which is great - we get to play with unreleased anon models

#

but if they're just testing different system prompts..

#

that's definitely less fun ha

keen beacon Apr 1, 2025, 11:33 AM

#

eager mica I was more thinking aloud along the lines of "how is this going to affect the tr...

they will tune it to the style of the most preferable one presumably

eager mica Apr 1, 2025, 11:34 AM

#

keen beacon they will tune it to the style of the most preferable one presumably

They might also be using them for red teaming.

keen beacon Apr 1, 2025, 11:35 AM

#

eager mica They might also be using them for red teaming.

nah that isnt the main point of it i think

#

well the data could be used that way i dont think the point of it is to do that on the arena

alpine coral Apr 1, 2025, 11:35 AM

#

eager mica They might also be using them for red teaming.

maybe but fwiw i don't think they view the arena as a place for red teaming - they wanna know what joe blogs 'likes'

eager mica Apr 1, 2025, 11:43 AM

#

keen beacon well the data could be used that way i dont think the point of it is to do that ...

I'm saying that because I noticed downgrades in that regard. Last week's models seemed naughtier.

keen beacon Apr 1, 2025, 11:43 AM

#

eager mica I'm saying that because I noticed downgrades in that regard. Last week's models ...

they are coming closer to release it may not be necessarily related to arena data (safety)

eager mica Apr 1, 2025, 11:59 AM

#

roma in particular in seems rather "safe". Not that it's truly refusing anything, but either it isn't understanding what it's being asked or it's actively avoiding it.

visual turret Apr 1, 2025, 12:13 PM

#

https://github.com/huggingface/transformers/pull/36878

GitHub

Adding Qwen3 and Qwen3MoE by bozheng-hit · Pull Request #36878 · ...

Adding Qwen3
This PR adds the support of codes for the coming Qwen3 models. For information about Qwen, please visit https://github.com/QwenLM/Qwen2.5. @ArthurZucker

#

Qwen 3 most likely is going to be in testing soon

keen beacon Apr 1, 2025, 12:14 PM

#

The models are probably ready at this point tbh

visual turret Apr 1, 2025, 12:14 PM

#

keen beacon The models are probably ready at this point tbh

Ok

#

Chatgpt

keen beacon Apr 1, 2025, 12:15 PM

#

In one of the commits they removed beta from the name of a model link lol

visual turret Apr 1, 2025, 12:16 PM

#

keen beacon In one of the commits they removed beta from the name of a model link lol

Thats pretty standard

keen beacon Apr 1, 2025, 12:16 PM

#

So the model is either in the final stages of testing or done

#

They didn't just enter testing now lol

visual turret Apr 1, 2025, 12:16 PM

#

keen beacon So the model is either in the final stages of testing or done

The model is done

#

There just making it so normal people can run

#

It easily

keen beacon Apr 1, 2025, 12:17 PM

#

Oh I c what u mean

barren prairie Apr 1, 2025, 12:56 PM

#

Just something I noticed at coding ...

If you want to make the best code using llm you should combine all of them 😂 ... Never say Claude is the best I will use only Claude or Gemini pro is the best ... This one is dumb I won t use it... Sometimes a mistake can be corrected by this dumb model and the good model sucks 😁...
When I use arena web dev I just give them the idea , than get the best code and send it to the battle mode again for other models to adjust it and make some modifiactions , than get the code and I do this a lot of times untill I will get a satisfing code 😂 and sometimes I notice that some called stupid models at coding are able to fix some mistakes that Clause and Gemini can t ... 😁 Just by experience

calm sequoia Apr 1, 2025, 1:43 PM

#

For those of you who yesterday did not believe to me @ocean vortex. The GROK is only good at question it was trained on.

#

R1 suprisingly good 👀

#

Any new models in arena?

ancient reef Apr 1, 2025, 1:50 PM

#

https://arxiv.org/abs/2503.21934v1

arXiv.org

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final num...

#

timber kiln Apr 1, 2025, 1:53 PM

#

calm sequoia For those of you who yesterday did not believe to me <@514836230802898954>. The ...

If anything this is embarrassing for OpenAI
They claimed to be best in this kind of math
O3-mini is second gen of their "reasoning" models
And o1 is bigger than any model in this arena and pro has parallel generation

calm sequoia Apr 1, 2025, 1:54 PM

#

Yes. However, to be fair, they are much older than the both R1 or Gemini.

timber kiln Apr 1, 2025, 1:55 PM

#

O3 mini released in almost february not even 2 months

calm sequoia Apr 1, 2025, 1:55 PM

#

Not saying Grok is the only one. The trend touches everyone.

#

There's no justification for o3-mini to perform worse than the QwQ. Unless if it's really smaller model.

ocean vortex Apr 1, 2025, 2:08 PM

#

calm sequoia For those of you who yesterday did not believe to me <@514836230802898954>. The ...

You are simply stating the obvious. Of course the models are better on things they were trained on explicitly. But thinking that it can't solve anything else is not accurate either, although it can be true in some cases like here. Olympiad math problems are not exactly easy

Here's a paper that found the models still performing when tested on data they haven't seen: https://arxiv.org/pdf/2405.00332

Again, it's to be expected for them to be performing worse and they often found it to be the case. But nevertheless they still performed and solved the tasks

#

I find it kind of amazing though that we went from "LLMs can't do math" to expecting them to solve toughest olympiad problems with no help in such a short period of time lol

calm sequoia Apr 1, 2025, 2:32 PM

#

The point I'm making is that any benchmark ceases to be benchmark when it gets popularity.

calm sequoia Apr 1, 2025, 2:33 PM

#

ocean vortex I find it kind of amazing though that we went from "LLMs can't do math" to expec...

Indeed this is quite interesting. Real world impact still to be determined.

#

I find the most use from deep research. Just imagine 2.5 Pro with deep research 👀

#

Not sure, I can't use 3.7 because of context length limitation. My use cases require long context. Just downloaded cursor, maybe with this tool it will be easyer.

#

However, mixing o1 (or o3-mini) with Claude gave the best results when coding very complex things

#

I believe the 2.5 is better than the o1

#

Yes, the Gemini 2.5 Pro Exp.

#

I've encoutered it couple of times before it was released. It performed equaly to the o3-mini or o1. Give me a moment I will check it on AIstudio

ocean vortex Apr 1, 2025, 2:42 PM

#

calm sequoia The point I'm making is that any benchmark ceases to be benchmark when it gets p...

no that is not really true. There's also another point to be made that private benchmarks do not contribute to improvement of the models. It's also difficult to police and make sure not a single AI lab gets access to the test questions. It's just not very realistic and not the way to go imo

#

if the benchmark is good (diverse with plenty of problems), it can be both completely public and an accurate metric. While fairness is basically guaranteed by design as everyone has the same access. Contamination is a consideration, but not really a major problem

calm sequoia Apr 1, 2025, 2:45 PM

#

Disagree. It's just one way test - fail means the model is not strong enought. Success means strong model OR contamination OR fine-tuning OR other factors.

ocean vortex Apr 1, 2025, 2:45 PM

#

lol no

#

it's not really possible to make the model score higher on the main metrics without objectively making it smarter

calm sequoia Apr 1, 2025, 2:46 PM

#

This message proved that you're not worthy of speaking to

ocean vortex Apr 1, 2025, 2:47 PM

#

calm sequoia This message proved that you're not worthy of speaking to

wait what. what is your problem?

#

that was very weird lmao

calm sequoia Apr 1, 2025, 2:48 PM

#

Somehow the error rate of my task is multiple times as high as on LMarena and much higher than the o1. Was it nerfed?

ocean vortex Apr 1, 2025, 2:50 PM

#

You fail to understand what a good benchmark is. Or even what a model (production LLM) is. It's not some computer program you could hardcode the answers to a million of tough questions without improving the thing itself. That's why good benchmarks are important. Cause it's not only eval metric but also one of the main things driving the improvement

misty vault Apr 1, 2025, 2:55 PM

#

LMAo I gaslighted u into thinking R was the one adding clown emojis

ocean vortex Apr 1, 2025, 2:56 PM

#

misty vault LMAo I gaslighted u into thinking R was the one adding clown emojis

catgrin

#

he's still weird though lol

#

it's a one subject we just keep coming back to here it seems... To reiterate, yes you can cheat on 1 benchmark by making the model overfit and otherwise dumb, but you can not do this on all of them with a production model. Just not possible. You can only do this by improving the model / making it smarter and more likely to solve unseen tasks in the process, and at that point it's not really cheating anymore is it.

#

Not for the lack of trying, people did try and somehow we still don't have an usable model acing them... for the reasons I stated. Overfitting is not super relevant when you have such a huge quantity and variety

calm sequoia Apr 1, 2025, 3:08 PM

#

Why is this good news for you? 🤔 I will try to finding the exact prompt word by word I've used previously. Temperature does not seem to change anything. If they nerfe the model, shouldn't it drop in leaderboard?

#

Would they use different model configurations for arena vs aistudio?

#

Would you share it with me? It's no prob if you want to keep it to yourself

#

Ah I see this is identification. I though you found a way to make LMarena router select Gemini 2.5 PRO as one of the models

#

This is indeed god and interesting approach for identification though!

ocean vortex Apr 1, 2025, 3:18 PM

#

calm sequoia Why is this good news for you? 🤔 I will try to finding the exact prompt word b...

It's probably beyond yourself to understand, but lmarena isn't actively retesting the old models. But more importantly, every new version is gonna be given different entry, as well as new identifier before it's being put on a leaderboard. You will not see different versions counted as the same model

calm sequoia Apr 1, 2025, 3:19 PM

#

ocean vortex It's probably beyond yourself to understand, but lmarena isn't actively retestin...

That's why I was surprised

#

Take a note that if you save the results of your voting, e.g. (Model A name, Model B name, Winner) I can generate an ELO graph just for your prompts

#

What's exact goal of your DB (sorry if I missed it)? You could use these models straight from their websites

#

Because battle mode = better models?

#

I see. At least use daily limits of cursor and chatgpt, you will save time.

#

Also, take a note that good prompt in bad model will always be better than bad prompt in good model.

#

Even o3-mini?

#

It would be interesting to see what you're building but I will not play as I have a job 🤣

calm spear Apr 1, 2025, 3:44 PM

#

all LLM are not good enough

#

hm, I hope so, but I had no luck

#

chess, I had experiments with chess

#

my attempt was before release of it

#

yep

#

IIRC the only requirement I have stated was to use Rust

#

probably, Cursor authors said that languages with static typing work best in Cursor IIRC

#

TBH I do not know that chess variant

eager mica Apr 1, 2025, 4:40 PM

#

panda keeps feeling like plain old stubborn Llama-3, to me.

#

It should be an experimental Llama model, but with a wacky system prompt.

#

It's too on the nose, actually, and I find myself disliking it.
At this point most new experimental Meta models are probably Llama-4 prototypes.

#

This is the system prompt, apparently: https://gist.github.com/riidefi/3340cc2b33b9edf5f03dc4429ba635d0

Gist

LMArena's `venom` System Prompt

LMArena's venom System Prompt. GitHub Gist: instantly share code, notes, and snippets.

eager mica Apr 1, 2025, 5:10 PM

#

Possibly a non-transformer architecture for long-context support & understanding?

keen beacon Apr 1, 2025, 5:10 PM

#

Nah

eager mica Apr 1, 2025, 5:11 PM

#

If it's that, it's probably a case of Google having access to conversational datasets that nobody else has.

keen beacon Apr 1, 2025, 5:16 PM

#

They made a big breakthrough in the base model

#

Somehow they pivoted to it real fast. They claim Gemini 2.5 pro has a January 2025 cut off

#

This means the model was done like in a month or two

#

Absolute absurd pace

#

Yes lol

#

Goal posts keep changing 😉

#

In a modern sense no

#

If that makes sense

#

Yes

#

Maybe idk. We take a lot of things for granted

#

This is truly amazing stuff that seems sorta mundane now

#

Given how fast 2.5 pro was churned out things seem to only get faster

#

Yeah

#

I still use Claude though

#

Yup

#

Nah I don't code with ai yet

#

Poor in rust last I checked. Maybe sooner than I think I'll start using it more

#

Ya LLMs are best with python

#

Probably more but I don't really use LLMs to code so no clue

#

I think it would do really good

#

Maybe

rigid widget Apr 1, 2025, 5:28 PM

#

poll_question_text

We need a simple category

victor_answer_votes

4

total_votes

5

victor_answer_id

1

victor_answer_text

Yes, we need a simple category

oblique flint Apr 1, 2025, 5:38 PM

#

Probably js/ts too

hearty pulsar Apr 1, 2025, 5:40 PM

#

which model is Spider?

rigid widget Apr 1, 2025, 5:41 PM

#

nope just more compute power 😦

#

Turing test is nothing for now

keen beacon Apr 1, 2025, 5:46 PM

#

Preliminary work but with the cut off of Jan 2025. Continued pretraining had to start after it

#

Research etc yeah.

#

They pivoted so fast they didn't even make 2.0 pro stable

#

Tbh I'm not sure if Jan 2025 is right even if they claim it to be that

#

Because it's insanely fast

#

Oh they didn't train the cut off in

#

It's stated in the docs of the model

#

It has sparse knowledge after June 2024 (Gemini 2 cut off, it was continued pretrained from 2.0 pro) from small tests of mine. (Haven't had time to go more in depth) But that date they mentioned can still be meaningful

hearty pulsar Apr 1, 2025, 5:51 PM

#

Anthropic lies about it's cut-off date, for example, knows who won the 2024 presidential election

keen beacon Apr 1, 2025, 5:52 PM

#

Tbh with how they didn't even release 2.0 pro stable I'm inclined to believe the date they mentioned is meaningful

#

They were still working on 2.0 pro in december

hearty pulsar Apr 1, 2025, 5:53 PM

#

Because deepseek cannot be trusted lol

keen beacon Apr 1, 2025, 5:54 PM

#

Deepseek is not google deep mind

hearty pulsar Apr 1, 2025, 5:54 PM

#

ohhhhhhhhhhhh

#

then I'm the 🤡

keen beacon Apr 1, 2025, 5:55 PM

#

Maybe 2.0 pro wasn't done training it still is pretty unbelievable to me

#

Like they did a checkpoint in December and continued training on it

#

While they worked on the checkpoint

#

If 2.5 pro was done in 1-2 months, that sheer pace can't be beaten by anyone else tbh

hearty pulsar Apr 1, 2025, 5:56 PM

#

if I had to guess I'd say 750 billion

ocean vortex Apr 1, 2025, 5:57 PM

#

keen beacon If 2.5 pro was done in 1-2 months, that sheer pace can't be beaten by anyone els...

I think it's all RL training

#

just look at the contrast that there used to be between o1 and gpt4o

hearty pulsar Apr 1, 2025, 5:57 PM

#

Claude 3 opus had ~400B parameters

#

keen beacon Apr 1, 2025, 5:58 PM

#

ocean vortex I think it's all RL training

No they had to do continued pretraining since it has a different cut off. They mentioned an enhanced base model too in their blog post. Their reasoning game isn't matching openai yet

hearty pulsar Apr 1, 2025, 5:58 PM

#

I couldn't find a source saying trillions

keen beacon Apr 1, 2025, 5:58 PM

#

O3 mini is based on 4o mini ffs

#

But with deep minds pace they will destroy everyone

#

If u can pivot and allocate resources on a breakthrough that fast

#

They stopped work on 2.0 pro and dropped everything for 2.5 pro

ocean vortex Apr 1, 2025, 5:59 PM

#

keen beacon No they had to do continued pretraining since it has a different cut off. They m...

I mean the gains are from RL training. The base model is smth like old 2.0 Pro with minimal improvements and new data

#

new data is in more recent data

keen beacon Apr 1, 2025, 6:00 PM

#

ocean vortex I mean the gains are from RL training. The base model is smth like old 2.0 Pro w...

Some of it. Their reasoning game is not up to par with openai. The base model is just way stronger

ocean vortex Apr 1, 2025, 6:01 PM

#

keen beacon Some of it. Their reasoning game is not up to par with openai. The base model is...

I don't think base model can be much stronger than it was tbh. Diminishing returns, that's why they did RL training

keen beacon Apr 1, 2025, 6:01 PM

#

ocean vortex I don't think base model can be much stronger than it was tbh. Diminishing retur...

I thought so too I was wrong lol

ocean vortex Apr 1, 2025, 6:01 PM

#

they didn't even release it as non-reasoning

#

if it was much better than 2.0 Pro they surely would have released it lol

misty vault Apr 1, 2025, 6:02 PM

#

did you know that spiders are the only web developers that like finding bugs?

keen beacon Apr 1, 2025, 6:02 PM

#

ocean vortex they didn't even release it as non-reasoning

They said they will

#

Given how fast they moved they were bound to release it in a smaller scope

ocean vortex Apr 1, 2025, 6:04 PM

#

keen beacon Given how fast they moved they were bound to release it in a smaller scope

I don't think that's very plausible though. They did RL training, tested that model and released it. This would have taken more time than getting standard chat model ready on the same base

#

you don't see deepseek releasing R2 before they had new V3.1 etc

keen beacon Apr 1, 2025, 6:04 PM

#

ocean vortex I don't think that's very plausible though. They did RL training, tested that mo...

Nah not necessarily

ocean vortex Apr 1, 2025, 6:05 PM

#

wdym

#

afaik even RL training is being done on instruct models, not base

keen beacon Apr 1, 2025, 6:06 PM

#

ocean vortex afaik even RL training is being done on instruct models, not base

You can do either

#

Instruct models already have a sense of direction, it's more optimal to work on a base model

keen beacon Apr 1, 2025, 6:07 PM

#

ocean vortex I don't think that's very plausible though. They did RL training, tested that mo...

It's not just rl training. They sft cold start then apply rl

keen beacon Apr 1, 2025, 6:07 PM

#

ocean vortex I don't think that's very plausible though. They did RL training, tested that mo...

Well you can see their claimed cut off (confirms continued pretraining + 1-2 months of total work). The simpleqa increase cannot be explained by rl I don't think, at least not all of it. Reasoning by them is not too sophisticated yet

#

#general message remember this too? If their reasoning is still not up to par, it's the base model

ocean vortex Apr 1, 2025, 6:10 PM

#

keen beacon Well you can see their claimed cut off (confirms continued pretraining + 1-2 mon...

hm maybe, will be interesting to test it if they release it as instruct

keen beacon Apr 1, 2025, 6:10 PM

#

What deep mind did is pretty unbelievable tbh

ocean vortex Apr 1, 2025, 6:10 PM

#

were you referring to this blogpost? https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking

Google

Gemini 2.5: Our most intelligent AI model

Gemini 2.5 is our most intelligent AI model, now with thinking.

#

Interesting that they say "Gemini 2.5 models are thinking models", implying that there won't be 2.5 non-reasoning variants

keen beacon Apr 1, 2025, 6:11 PM

#

Ya they mentioned significantly enhanced base model

#

It's literally magic to do that on 2.0 pro idk what they fed 2.0 pro

#

Steroids or smthing

ocean vortex Apr 1, 2025, 6:12 PM

#

keen beacon It's literally magic to do that on 2.0 pro idk what they fed 2.0 pro

RL training works much better on bigger models

keen beacon Apr 1, 2025, 6:12 PM

#

ocean vortex Interesting that they say "Gemini 2.5 models are thinking models", implying that...

Logan said there would be

ocean vortex Apr 1, 2025, 6:12 PM

#

where? catgrin

keen beacon Apr 1, 2025, 6:13 PM

#

Search for his tweet Im on my phone lol

ocean vortex Apr 1, 2025, 6:13 PM

#

this implies there won't be any named as 2.5 gemini

#

we can still have 2.0 update of it or whatever

keen beacon Apr 1, 2025, 6:14 PM

#

ocean vortex RL training works much better on bigger models

Yes but this is primarily the significantly better base model. If they were on par with openai they would have an o4 competitor lol

#

On par w reasoning

ocean vortex Apr 1, 2025, 6:14 PM

#

keen beacon Yes but this is primarily the significantly better base model. If they were on p...

I still very much doubt it tbh

#

but we will see

#

as for OpenAI...

keen beacon Apr 1, 2025, 6:15 PM

#

ocean vortex I still very much doubt it tbh

I mean look at the cut off. Simpleqa. "Significantly enhanced base model" according to the blog post

ocean vortex Apr 1, 2025, 6:15 PM

#

I think their o3-full scores were something more like o1-pro except even more insane setup

#

which is why it's not released lol

keen beacon Apr 1, 2025, 6:15 PM

#

All signs are pointing to smthing

#

I've seen qwq out reason 2.5 pro on extremely rote tasks too. Their reasoning game is not there yet I think

#

It's primarily a significantly stronger base model, not rl training

#

Yet

#

It's 4o mini on steroids

rigid widget Apr 1, 2025, 6:18 PM

#

Guys which OpenAI model will be open?

keen beacon Apr 1, 2025, 6:18 PM

#

You can see the simpleqa scores (world knowledge)

ocean vortex Apr 1, 2025, 6:18 PM

#

keen beacon It's primarily a significantly stronger base model, not rl training

well we do not know that. We only see that they mentioned it is better. But we do not know by how much exactly. SimpleQA score by itself is not a good indicator IMO. GPT4.5 scores higher anyway

keen beacon Apr 1, 2025, 6:19 PM

#

Yeah they're not significantly better. 4o full has like double the score

#

Yes it's sota in simpleqa

keen beacon Apr 1, 2025, 6:19 PM

#

ocean vortex well we do not know that. We only see that they mentioned it is better. But we d...

Dude 2.5 pro is not even a 1t model and it gets 50%

#

They mentioned significantly enhanced btw I'm quoting not making it up

#

It's unbelievable I know

ocean vortex Apr 1, 2025, 6:20 PM

#

gpt4.5 is not that impressive in any area over the competition tbh. Other than that singular benchmark

keen beacon Apr 1, 2025, 6:20 PM

#

2.5 pro is 2x faster than 2.0 pro too lol

#

Did I forget to mention that

ocean vortex Apr 1, 2025, 6:21 PM

#

keen beacon It's unbelievable I know

well it is not only unbelievable it is less likely than the alternative lol

#

and we do not know the size

keen beacon Apr 1, 2025, 6:21 PM

#

Gpt 4.5 is way bigger than og gpt 4 lol

ocean vortex Apr 1, 2025, 6:21 PM

#

we don't even know if it's smaller than gpt4.5 since OpenAI pricing is not a good indicator anymore 💀

keen beacon Apr 1, 2025, 6:21 PM

#

2.5 pro is almost as fast as 2.0 flash and has 1m context

ocean vortex Apr 1, 2025, 6:22 PM

#

speed has very little to do with size

#

depends on their infra

keen beacon Apr 1, 2025, 6:22 PM

#

ocean vortex depends on their infra

Well they never released the public api of Gemini ultra

#

Though they have significantly more and better hardware by now I guess

ocean vortex Apr 1, 2025, 6:23 PM

#

keen beacon Well they never released the public api of Gemini ultra

because it was outperformed by Pro almost everywhere

#

made no sense at all cost wise

keen beacon Apr 1, 2025, 6:23 PM

#

ocean vortex speed has very little to do with size

They are doing double the speed of 2.0 pro and serving it at extremely high demand

#

I don't think their infra would've changed that much between 2.0 pro and 2.5 pro

ocean vortex Apr 1, 2025, 6:24 PM

#

keen beacon They are doing double the speed of 2.0 pro and serving it at extremely high dema...

double the speed because it's a reasoning model. That's how you host them properly

#

speed is much more important there

#

than for standard models

ocean vortex Apr 1, 2025, 6:26 PM

#

keen beacon I don't think their infra would've changed that much between 2.0 pro and 2.5 pro

it's not like they are running everything at max speed they possibly can. There's no point ensuring your concise output model is always fast

zinc ore Apr 1, 2025, 6:26 PM

#

One explanation for it being so fast would be if they're using the new Gen TPUs with it

#

Whereas other Gemini's were running on the older TPUs

keen beacon Apr 1, 2025, 6:27 PM

#

ocean vortex well it is not only unbelievable it is less likely than the alternative lol

Tbh idk how you can get more confirmed then them saying the base model was "significantly enhanced". I don't think they used significantly as a descriptor anywhere else

ocean vortex Apr 1, 2025, 6:28 PM

#

keen beacon Tbh idk how you can get more confirmed then them saying the base model was "sign...

it's a factor for sure and point taken. But I'm also not gonna blindly believe a singular sentence from their marketing blogpost tbh. I would need to see it to believe it is indeed "significantly" 😇

keen beacon Apr 1, 2025, 6:29 PM

#

It's not bigger than 2.0 pro because it was continued pretrained on it btw

#

Uhm do you believe a model was pretrained from scratch in a month or two and they would ditch millions of dollars in training 2.0 pro

ocean vortex Apr 1, 2025, 6:30 PM

#

I think there are predominantly 2 ways in improving base/instruct models

#

one is what deepseek did

#

another is openai

#

so you either make it extremely verbose almost like a reasoning model but not really, or you train it on other much more capable models (reasoning model final outputs)

keen beacon Apr 1, 2025, 6:31 PM

#

Theyre huge reasons on why it isn't plausible for it be trained from scratch

#

I don't believe it can be done. Considering the Jan 2025 cut off they claim

ocean vortex Apr 1, 2025, 6:31 PM

#

other than those 2 things, I do not think there are significant gains past previous 2.0 Exp

keen beacon Apr 1, 2025, 6:31 PM

#

There's toooo much work

#

They didn't even release 2.0 pro stable LOL

#

Gemini 2 was already done extremely fast. But doing that on a new pretrained model in 1-2 months is truly unbelievable

#

Knowledge cut off.

#

They were still working on 2.0 pro in december

#

Gemini 2.5 barely knows anything after June 2024 (Gemini 2 cut off)

#

But it does know some things

#

They have literally abandoned 2.0 pro too

ocean vortex Apr 1, 2025, 6:37 PM

#

keen beacon They didn't even release 2.0 pro stable LOL

imo that's simply because of the metrics. It was too close to flash

#

not like it wasn't "stable"

keen beacon Apr 1, 2025, 6:39 PM

#

If you want to believe that keep on believing it. I just don't think it makes sense at all given the information that is publicly available

ocean vortex Apr 1, 2025, 6:39 PM

#

keen beacon If you want to believe that keep on believing it. I just don't think it makes se...

you mean that single sentence from their marketing? Yeah hard pass for now lol

#

yeah but that's because of the cost

#

now imagine same cost except each reply is much longer

#

and you need to ensure it is fast to be usable...

#

that's much harder

keen beacon Apr 1, 2025, 6:44 PM

#

ocean vortex you mean that single sentence from their marketing? Yeah hard pass for now lol

Nope lol. Cut off. Simpleqa. Economics. Timeline.

ocean vortex Apr 1, 2025, 6:45 PM

#

keen beacon Nope lol. Cut off. Simpleqa. Economics. Timeline.

cutoff doesn't mean performance. Simpleqa still lower than gpt4.5. Economics?? catgrin

#

lmao

#

reasoning also improves simpleQA looks like. Not much but still improves it. 10%? When we deduct that I would be surprised if the score is notably higher than 3.7 sonnet

#

3.5 scores 28.9

rigid widget Apr 1, 2025, 6:49 PM

#

guys

keen beacon Apr 1, 2025, 6:49 PM

#

ocean vortex cutoff doesn't mean performance. Simpleqa still lower than gpt4.5. Economics?? <...

Ditching 2.0 pro and wasting all that money is not plausible

rigid widget Apr 1, 2025, 6:49 PM

#

can you open a thread?

keen beacon Apr 1, 2025, 6:49 PM

#

Cut off = different pretraining

keen beacon Apr 1, 2025, 6:50 PM

#

ocean vortex cutoff doesn't mean performance. Simpleqa still lower than gpt4.5. Economics?? <...

This is gonna age poorly

ocean vortex Apr 1, 2025, 6:50 PM

#

keen beacon Ditching 2.0 pro and wasting all that money is not plausible

they are not ditching anything. They are making gains with RL training like everyone else

#

...

#

good point. 2.0 Pro 44.3%
2.5 Pro 52.9%

#

almost entirety of that can be accounted to reasoning lol

keen beacon Apr 1, 2025, 6:53 PM

#

ocean vortex almost entirety of that can be accounted to reasoning lol

A model where qwq can out reason it sometimes on pure rote tasks? I dunno

ocean vortex Apr 1, 2025, 6:55 PM

#

keen beacon A model where qwq can out reason it sometimes on pure rote tasks? I dunno

Looking at openai evals, o1 scores more than gpt4o on simpleqa by close to 10% of their score, depending on the version

keen beacon Apr 1, 2025, 6:55 PM

#

ocean vortex Looking at openai evals, o1 scores more than gpt4o on simpleqa by close to 10% o...

Openai are way ahead in the reasoning game

#

Just look at o3 mini

#

Flash thinking sucks

ocean vortex Apr 1, 2025, 6:56 PM

#

that's way too many assumptions. Google have been doing it for a while now. I think their biggest bottleneck was literally small model size (for flash-thinking)

#

those are hard to do RL training on without distilling

eager mica Apr 1, 2025, 6:57 PM

#

It looks like cybele is back on the Arena.

keen beacon Apr 1, 2025, 6:58 PM

#

ocean vortex that's way too many assumptions. Google have been doing it for a while now. I th...

Look you know me Dom. I know I'm an idiot but I feel like I'm reasonably observant about these things. You are the one who claimed reasoning models were nothing burgers back then. I'm fairly confident about this. You didn't even know they used rl back then lol I believe. I feel like I have a decent track record on stuff like this if you don't take everything I say literally and have been observing stuff. You can believe me or not lol. All I can say is that I'm really looking forward to what deep mind can do now

#

I'm done ranting on this since we aren't changing the opinions of each other lol

#

If I seem hostile Im not trying to be and I apologize

#

I haven't slept in a while lol

ocean vortex Apr 1, 2025, 7:00 PM

#

keen beacon Look you know me Dom. I know I'm an idiot but I feel like I'm reasonably observa...

yeah I was wrong about them. They certainly were nowhere near stable and didn't perform all that well at the start. RL training was a known thing since the beginning though as far as I remember it lol

#

nothing wrong with having different opinions on this 👀

ocean vortex Apr 1, 2025, 7:01 PM

#

keen beacon If I seem hostile Im not trying to be and I apologize

all good

drifting elk Apr 1, 2025, 7:17 PM

#

oh my god guys

#

the new alpha arena is awesome

#

they listened to my feedback

#

hope they add new models like deepseek v3.1 and gemini 2.5

#

would not that be great

ocean vortex Apr 1, 2025, 7:31 PM

#

looking at simpleqa, it seems like google took grok3 non-reasoning score at mashed it together in their table when other grok3 scores are for grok thinking model lol

#

couldn't find it for extended thinking version and I don't think google tested it themselves as the score is literally identical

eager mica Apr 1, 2025, 8:09 PM

#

I haven't seen it yet, although I'm seeing stradale a lot.

keen beacon Apr 1, 2025, 9:24 PM

#

hearty pulsar

this is bs lmao

#

opus is >1T

#

sonnet is ~300-400B

hearty pulsar Apr 1, 2025, 9:24 PM

#

keen beacon this is bs lmao

Source?

keen beacon Apr 1, 2025, 9:26 PM

#

https://lifearchitect.ai/models-table/ this guy estimates param counts for a lot of models and i've always found my experiences with them to make sense when combined with his prediction

Dr Alan D. Thompson – LifeArchitect.ai

adt

Models Table

Open the Models Table in a new tab | Back to LifeArchitect.ai Open the Models Table in a new tab | Back to LifeArchitect.ai Models Table Rankings Reasoning Models • 2024Q3–2025Q1 Data dictionary Model (Text) Name of the large language model. Sometimes uses filename syntax. Lab (Text) Name of the organization or group responsible for traini...

#

3 opus is definitely a big model

#

no way it is under 500B

#

and i'm almost certain it's more than 900B

timber kiln Apr 1, 2025, 10:11 PM

#

keen beacon https://lifearchitect.ai/models-table/ this guy estimates param counts for a lot...

I haven't checked this website since 2024
But
O3 and 4.5 guesses are way off
And I am curious where did he get 200B rumors for 2.5 pro

keen beacon Apr 1, 2025, 10:12 PM

#

timber kiln I haven't checked this website since 2024 But O3 and 4.5 guesses are way off And...

i doubt his o3 guess is right but honestly 4.5 may be close

#

4.5 is a huge model

#

why do you think it's way off

timber kiln Apr 1, 2025, 10:16 PM

#

o3 base model is same as o1 unless they change it before release (twitter talks and we know from their arc-agi per task cost calculations)

4.5 is probably wrong because 4.5 is trained way back then and most likely with H100s
And that size and that token count would take more than rumors say it took

#

But inference cost scaling suggest something big still yes

ocean vortex Apr 1, 2025, 10:37 PM

#

timber kiln o3 base model is same as o1 unless they change it before release (twitter talks ...

yeah that 5T prediction for o3 is ridiculous. It's not a new base model. But it was likely being ran similar to o1-pro

blazing rune Apr 1, 2025, 11:08 PM

#

I hate it when I google a paper and I can only find news articles about it without a link to the original source.

rigid widget Apr 1, 2025, 11:40 PM

#

we are sleeping bro 😴

rigid widget Apr 1, 2025, 11:41 PM

#

blazing rune I hate it when I google a paper and I can only find news articles about it witho...

stop using Google Search 🛑

#

what they create a new model? how is that different

#

they are same : D

#

i really need sleep buddy

#

good nights

#

np

alpine coral Apr 1, 2025, 11:59 PM

#

'DeepMind slows down research releases to keep competitive edge in AI race' https://archive.is/tkuum
perhaps google/deepmind are cooking with some special sauce (that they wanna try keep to themselves) these days

eager mica Apr 2, 2025, 12:15 AM

#

I haven't seen it yet, but at this point in every round you're getting a Meta model.

timber kiln Apr 2, 2025, 12:16 AM

#

Still a month until LLama dev day

somber niche Apr 2, 2025, 12:16 AM

#

Chatbot Arena more like Metabot Arena

#

It's kinda crazy just how many models they've put on here lol

neat apex Apr 2, 2025, 12:17 AM

#

24 karat was not eveb 24b lmao

eager mica Apr 2, 2025, 12:17 AM

#

Not really. I think it's a small model.

somber niche Apr 2, 2025, 12:18 AM

#

Roma seemed like the weakest of the bunch, yeah

eager mica Apr 2, 2025, 12:18 AM

#

stradale probably even smaller.

neat apex Apr 2, 2025, 12:18 AM

#

mistral so?

#

immo

#

in my opnion the most isane model is llama 3.2b instruct yet

eager mica Apr 2, 2025, 12:19 AM

#

For now 24_karat_gold seems the best, but technically there's llama-3.1-405B

neat apex Apr 2, 2025, 12:19 AM

#

running in a toaster but is good like gpt 2.5

somber niche Apr 2, 2025, 12:19 AM

#

24 karat is probably my favorite so far, mostly since it's the only one that answered my challenge coding question correctly

#

Even spider struggled with that one

neat apex Apr 2, 2025, 12:20 AM

#

llama 3.1 405b yet have more pontetial than llama 3.3 70b, besises at most cases excell same

#

nope

eager mica Apr 2, 2025, 12:21 AM

#

No, just Llama-3.3-70B-Instruct.

neat apex Apr 2, 2025, 12:21 AM

#

the best you can find is deepseek-r1-distill-llama

#

reasoning is not extra parametters, but it is here

somber niche Apr 2, 2025, 12:21 AM

#

For reference even Llama 3.1 405B biffed it hard on said question, but coding IMHO was never really L3's strongest suit

vivid oyster Apr 2, 2025, 12:21 AM

#

What bort is 24 karat gold

neat apex Apr 2, 2025, 12:22 AM

#

not exactly, this one is way more creative

#

hm

#

maybe some obscure modified Gemma 3

#

i am just not inside it xd

vivid oyster Apr 2, 2025, 12:23 AM

#

somber niche 24 karat is probably my favorite so far, mostly since it's the only one that ans...

What is that

somber niche Apr 2, 2025, 12:24 AM

#

I have a couple, but this is my current goto

#

https://atcoder.jp/contests/dp/tasks/dp_j

AtCoder

J - Sushi

AtCoder is a programming contest site for anyone from beginners to experts. We hold weekly programming contests online.

honest garden Apr 2, 2025, 12:24 AM

#

Yeah

neat apex Apr 2, 2025, 12:24 AM

#

my brother favorited Qwen 2.5 plus for it, thanks to huge context and works was good most ones, i dont know if he did good use or not at all

honest garden Apr 2, 2025, 12:25 AM

#

24_karat_gold is the best ai in

#

Follow my instructions

#

What company is 24 karat gold.

vivid oyster Apr 2, 2025, 12:25 AM

#

honest garden What company is 24 karat gold.

I think it's anonymous

honest garden Apr 2, 2025, 12:25 AM

#

Why

vivid oyster Apr 2, 2025, 12:25 AM

#

Idk

honest garden Apr 2, 2025, 12:26 AM

#

I can’t select it in ’’direct mesage’

neat apex Apr 2, 2025, 12:26 AM

#

noticeable the previous ones were Gemma 3, o1, and Nova, what are supossed to be inovated frontier models

somber niche Apr 2, 2025, 12:26 AM

#

Probably Meta, given it follows the same schedule as the other likely L4 candidates

neat apex Apr 2, 2025, 12:27 AM

#

yeah, since Llama 4 is a frontier and inovative model

somber niche Apr 2, 2025, 12:27 AM

#

Its response can vary sometimes, but generally it says it's Meta as well

honest garden Apr 2, 2025, 12:27 AM

#

honest garden I can’t select it in ’’direct mesage’

It’s better than gemini 2.5 and o1

#

Which means it’s must be a top tier

neat apex Apr 2, 2025, 12:27 AM

#

just like i said, R1-Llama is way more responsible

honest garden Apr 2, 2025, 12:27 AM

#

S

neat apex Apr 2, 2025, 12:28 AM

#

managed make clash royale decks with combos concepts that were not generic

honest garden Apr 2, 2025, 12:28 AM

#

Gemni 2.5 pro is a dumbass

#

🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣

neat apex Apr 2, 2025, 12:28 AM

#

it is, good

honest garden Apr 2, 2025, 12:28 AM

#

24@karat gold has my instructions better than gemini 2.5 pro.

#

Gemini 2,5 pro when i ask it specific questions it gives broad answers that doesn’t answer it. and act’s like a dumbass

#

An it dosen’t follow my instruction’s very well

neat apex Apr 2, 2025, 12:29 AM

#

for me, it is not that dumbass

vivid oyster Apr 2, 2025, 12:29 AM

#

honest garden An it dosen’t follow my instruction’s very well

Yeah

honest garden Apr 2, 2025, 12:29 AM

#

neat apex for me, it is not that dumbass

Havew you tried 24 karat gold

neat apex Apr 2, 2025, 12:29 AM

#

maybe because i make my questions very very clear now

honest garden Apr 2, 2025, 12:30 AM

#

neat apex maybe because i make my questions very very clear now

Me too

neat apex Apr 2, 2025, 12:30 AM

#

not yet, but i am comparing to the average

honest garden Apr 2, 2025, 12:30 AM

#

I’m made sure to be super specific

#

But it’s just not satisfying for me

neat apex Apr 2, 2025, 12:30 AM

#

i am even injecting adjectives to make sure it follows up my mindset

honest garden Apr 2, 2025, 12:30 AM

#

It might be smart or whatever but I think 24_karat_gold is much better

#

If it realses. It might be a very good ai

neat apex Apr 2, 2025, 12:31 AM

#

[Always act now you be only so impulsive yet formal so effective yet minimalist]

honest garden Apr 2, 2025, 12:31 AM

#

neat apex [Always act now you be only so impulsive yet formal so effective yet minimalist]

Eactly

#

When I send it those instructions

neat apex Apr 2, 2025, 12:31 AM

#

works well to make they less dumbass

honest garden Apr 2, 2025, 12:31 AM

#

It execcutes them in a half ass’ed boy

#

Way*

neat apex Apr 2, 2025, 12:31 AM

#

yeah, it already helps half way for free

honest garden Apr 2, 2025, 12:32 AM

#

You should try 24_karat_old

#

You will see the improvement

neat apex Apr 2, 2025, 12:32 AM

#

i saw an anonymous chat, it is veryyyy good

honest garden Apr 2, 2025, 12:32 AM

#

Yeah

neat apex Apr 2, 2025, 12:32 AM

#

even better than R1 Llama

honest garden Apr 2, 2025, 12:32 AM

#

I don’t know much about AI’s but it seems to me it’s very smart

honest garden Apr 2, 2025, 12:33 AM

#

neat apex even better than R1 Llama

Is r1 llama as good as the ones like chatgpt o1

neat apex Apr 2, 2025, 12:33 AM

#

it is more unstable

honest garden Apr 2, 2025, 12:33 AM

#

Like

#

In reasoning capabilities

neat apex Apr 2, 2025, 12:33 AM

#

in reasoning like solving an thing

#

hm

honest garden Apr 2, 2025, 12:33 AM

#

Yeah and answering logical questions

neat apex Apr 2, 2025, 12:34 AM

#

i not made theses questions yet a lot

#

just goofy tasks, for it it is great

honest garden Apr 2, 2025, 12:34 AM

#

I had a debate on one of them on philosophy

blazing rune Apr 2, 2025, 12:34 AM

#

rigid widget stop using Google Search 🛑

I actually use Brave Search, but everyone knows what Google is

honest garden Apr 2, 2025, 12:34 AM

#

24 karat gold is very smart

#

Gemini says weird answers

#

And o1 and stuff

neat apex Apr 2, 2025, 12:34 AM

#

imma ask why radiologists cant go 8 hours each 2 days, but 4 every day or 24 at once

#

i need to know it already

honest garden Apr 2, 2025, 12:36 AM

#

neat apex imma ask why radiologists cant go 8 hours each 2 days, but 4 every day or 24 at ...

📎 message.txt

#

That’s what gemini says

#

No

#

I don’t code

#

Maybe gemini 2.5 pro might be

#

The best at it

#

Because it can think and stuff

neat apex Apr 2, 2025, 12:39 AM

#

oooh goofy question

#

i asked R1-Llama but losed interaction

honest garden Apr 2, 2025, 12:39 AM

#

What’s R1 laama

#

Where can I use it

neat apex Apr 2, 2025, 12:40 AM

#

go at Deepinfra.com

honest garden Apr 2, 2025, 12:40 AM

#

Ok

neat apex Apr 2, 2025, 12:40 AM

#

R1 but inside llama

honest garden Apr 2, 2025, 12:40 AM

#

It’s free?

neat apex Apr 2, 2025, 12:40 AM

#

yeah, no login

honest garden Apr 2, 2025, 12:40 AM

#

Oh

#

It’s this one? Deepseek r1 llama 70b

neat apex Apr 2, 2025, 12:41 AM

#

yes

honest garden Apr 2, 2025, 12:41 AM

#

Isn’t that a weaker version of deepseek r1

#

The one on the website

neat apex Apr 2, 2025, 12:41 AM

#

supossed to be, but it is not

honest garden Apr 2, 2025, 12:41 AM

#

Hm

#

That’s interesting

#

Is it better?

neat apex Apr 2, 2025, 12:41 AM

#

ah yes, well, deepinfra are somelow worse than fireworks

honest garden Apr 2, 2025, 12:41 AM

#

Yeah but it’s random

neat apex Apr 2, 2025, 12:41 AM

#

i thinked yes, somehow better

#

it even answered this question correctly

honest garden Apr 2, 2025, 12:42 AM

#

Let me try it

neat apex Apr 2, 2025, 12:42 AM

#

8h every oher day is goofy for an hospital that is working 24h a day, and this 24h shift is just at emergencys

honest garden Apr 2, 2025, 12:42 AM

#

Yeah and following instructions

#

I don’t know if it’s gonna be good in coding because it’s random so it might make mistakes and stuff

#

Probably I don’t use it for rolelpaying

neat apex Apr 2, 2025, 12:43 AM

#

honest garden Isn’t that a weaker version of deepseek r1

Togheter ai: same good was an home computer running
Fireworks: somehow better
Deepinfra: somelow worse

honest garden Apr 2, 2025, 12:43 AM

#

I had debates with it on and philosophy and stuff it’s very good

#

But sometimes make’s mistakes

honest garden Apr 2, 2025, 12:43 AM

#

neat apex Togheter ai: same good was an home computer running Fireworks: somehow better De...

What’s that

#

Like the AI’s in the website

neat apex Apr 2, 2025, 12:43 AM

#

just explaining each api provider

honest garden Apr 2, 2025, 12:44 AM

#

Yeah

neat apex Apr 2, 2025, 12:44 AM

#

i mean compared to you run the model at an home comluter

#

try dececrease top p to 50

honest garden Apr 2, 2025, 12:45 AM

#

Yeah if you give it instructions on how to act it follows them perfectly but without its bad because without instructions it’s random and say’s dumb stuff

neat apex Apr 2, 2025, 12:45 AM

#

top p in 0.9 is trash

honest garden Apr 2, 2025, 12:45 AM

#

neat apex Togheter ai: same good was an home computer running Fireworks: somehow better De...

Deepseek r1

#

Llama 70b

#

made

#

Russian letters at the end of the prompt for no reason

#

😂

neat apex Apr 2, 2025, 12:45 AM

#

yeah lmao

honest garden Apr 2, 2025, 12:45 AM

#

Is that a mistake in my configuration

neat apex Apr 2, 2025, 12:45 AM

#

thats why they are cheapest provider

#

no, it is common, it happens when you use deepinfra with high temperature and top p

#

they do many things to it run cheaper

#

what includes it singing in russian at stress

honest garden Apr 2, 2025, 12:47 AM

#

Yeah it’s not good

#

It’s followed my instructions for one prompt

#

And the other prompt it followed none of them

neat apex Apr 2, 2025, 12:47 AM

#

ah, so it is good for one mesage

honest garden Apr 2, 2025, 12:47 AM

#

😂

neat apex Apr 2, 2025, 12:47 AM

#

damn

honest garden Apr 2, 2025, 12:47 AM

#

It didn’t even follow it like I wanted in the one message

neat apex Apr 2, 2025, 12:48 AM

#

hm, maybe it is just a small adjust, let me check, they intetionally raise up temp and top p to make it train more

honest garden Apr 2, 2025, 12:48 AM

#

What’s r1 turbo

#

Is that a better version or faster

neat apex Apr 2, 2025, 12:51 AM

#

i not have an idea

#

it is more unstable "Q4 istead Q8" so it must be faster

#

not found out yet, looks too the same

honest garden Apr 2, 2025, 12:52 AM

#

It sucks tooo

#

2 karat gold

#

24

neat apex Apr 2, 2025, 12:54 AM

#

yeah, it is because this Top q is stupidly high

#

pull it to 0-0.3 and it stops struggling

honest garden Apr 2, 2025, 12:54 AM

#

When I give it my instructions it follows it perfectly like I want

#

Other models even like gemini 2.5 pro don’t

neat apex Apr 2, 2025, 12:54 AM

#

give me an example of an yours instruction

#

or maybe it is granite 24b lmao

honest garden Apr 2, 2025, 12:55 AM

#

neat apex give me an example of an yours instruction

I didn’t save it but I’ll show you when I encounter it again

#

What’s that

neat apex Apr 2, 2025, 12:55 AM

#

give me an example just by head

#

write it too lazy how you usually do

#

the ghibfly is basicaly make common details in a single panel of color

#

but keep propontions and whatever you need

#

sooo, anti ghibfly will guess every detail is a amount of details

honest garden Apr 2, 2025, 12:59 AM

#

neat apex write it too lazy how you usually do

Hmm

neat apex Apr 2, 2025, 12:59 AM

#

just like an padora box complex

honest garden Apr 2, 2025, 12:59 AM

#

I asked it to formulate arguments for moral objectivism

#

And it generated very strong ones that I haven't seen before

neat apex Apr 2, 2025, 12:59 AM

#

hm

honest garden Apr 2, 2025, 12:59 AM

#

When I ask AI like gemini 2.5 pro and o1

#

And claude 3.7 sonnet and grok 3 and stuff

neat apex Apr 2, 2025, 12:59 AM

#

yeah, usually ask to an AI be original it goes goofy

honest garden Apr 2, 2025, 1:00 AM

#

They just regurgitate the same stuff you get from a basic google

#

For fun I asked it to use complicated terminology

#

And it did it perfectly and applied them perfectly too

#

When i ask gemini 2.5 pro or other AIs like this

neat apex Apr 2, 2025, 1:00 AM

#

hm

honest garden Apr 2, 2025, 1:00 AM

#

It starts speaking in latin and being cringey

neat apex Apr 2, 2025, 1:00 AM

#

yeah, they just cant

honest garden Apr 2, 2025, 1:00 AM

#

Yeah

#

This 24 karat gold one can do it

#

But

#

It varies

neat apex Apr 2, 2025, 1:01 AM

#

Command-a is not that horrible in it, but is not responsive

honest garden Apr 2, 2025, 1:01 AM

#

One conversation it made outrageous claims and misapplied theories and all of its claims were wrong

#

And the other conversation it was very good at it

neat apex Apr 2, 2025, 1:01 AM

#

or Mistral

honest garden Apr 2, 2025, 1:01 AM

#

So it kinda depends idk on what but it does

honest garden Apr 2, 2025, 1:01 AM

#

neat apex Command-a is not that horrible in it, but is not responsive

Lemme try

neat apex Apr 2, 2025, 1:02 AM

#

it is like Haiku 3.5, but formal and reasonable

#

very great in my opnion

honest garden Apr 2, 2025, 1:02 AM

#

Why can’t I chat with more ai’s

#

In alpha lmarena

neat apex Apr 2, 2025, 1:02 AM

#

not added yet :/

honest garden Apr 2, 2025, 1:02 AM

#

What’s canary

#

Is that the same thing

neat apex Apr 2, 2025, 1:03 AM

#

canary? not heard about yet

honest garden Apr 2, 2025, 1:04 AM

#

https://canary.lmarena.ai/

#

It has the same password as alpha

#

But it looks the same idk

eager mica Apr 2, 2025, 1:25 AM

#

I'm not also taking note of the reason why some model lost, but in many cases definitely it isn't a clear&cut win/loss situation.

#

The latest Meta models seem weak on knowledge/hallucination, for example. Sometimes models might redeem themselves if you clarify the question, etc.

#

There are cases for example where the style and form of the response are better, but the response is incorrect. How to vote in that case?

#

Or when in both cases the general form is OK but they're incorrect in different ways.

#

If it's a human preference test, then I should probably vote the response that I like more. That seems wrong, though, if there are other problems.

#

A small model might "feel" like a larger one in some tasks but will definitely not have the same knowledge capacity. Is it even fair to compare massively differently sized models and rate them for the knowledge they may or may not have due to hard limitations?

quaint monolith Apr 2, 2025, 1:52 AM

#

eager mica A small model might "feel" like a larger one in some tasks but will definitely n...

yes.

eager mica Apr 2, 2025, 1:55 AM

#

quaint monolith yes.

What I'm saying is that a binary win/lose rating in the case of LLMs doesn't correctly portray why some models might be better than others.

quaint monolith Apr 2, 2025, 1:56 AM

#

there doesn’t need to be a why. it’s just a leaderboard. there are pros and cons with many models and the people that care enough can look into it more.

eager mica Apr 2, 2025, 2:12 AM

#

I've just encountered a maverick that sounds like a Llama (but I haven't asked yet).

somber niche Apr 2, 2025, 2:13 AM

#

And ray is another new one

leaden palm Apr 2, 2025, 2:19 AM

#

when the new models first came onto the arena i hated them / found them cringe but theyre honestly genius tbh

#

theyre fresh and engaging in a world of bland, corporate outputs

eager mica Apr 2, 2025, 2:21 AM

#

In my Elo ratings purely based on personal preference on creative tasks (rather than general performance) they tend to rate high.

#

No, I haven't yet, I'm not going through battles that quickly.

alpine coral Apr 2, 2025, 2:24 AM

#

leaden palm when the new models first came onto the arena i hated them / found them cringe b...

some, if not most, seem to have legit wild (but very intelligently crafted) system prompts

eager mica Apr 2, 2025, 2:24 AM

#

Just found now, actually. It says it's "a large language model, trained by Google."

alpine coral Apr 2, 2025, 2:26 AM

#

alpine coral some, if not most, seem to have legit wild (but very intelligently crafted) syst...

i wonder if they could be like smaller versions of grok / something from xAI.. just can't imagine companies like google, anthropic or OAI adding models to the arena with such scizo system prompts. meta possibly. xAI i could for sure imagine it. othwerwise i dunno, some Chinese lab possibly

leaden palm Apr 2, 2025, 2:27 AM

#

ive seen theories of meta and i agree with them

alpine coral Apr 2, 2025, 2:29 AM

#

yeah seems most plausible, given they are due to be releasing new llama models soonish (i believe anyway)

#

though also, there's sooo many of these anon models now - i'm not sure they're all from the same company

#

spider and 24_karat_gold and venom seem very similar and imo almost certainly related
cybele and themis also very similar i'd say related
whether they're all are related, i dunno

#

there were others too

#

and now maverick, stargazer, ray, riveroaks (just from quickly scanning above)

kind cloud Apr 2, 2025, 3:30 AM

#

Screenshot_2025-04-02-12-30-08-768-edit_com.android.chrome.jpg

alpine coral Apr 2, 2025, 3:48 AM

#

ah nice one 👍 we can add maverick to the same cluster of anon models (which share these whacky but beautifully crafted system prompts)

plain zinc Apr 2, 2025, 3:52 AM

#

Lol

#

New Google model👀

plain zinc Apr 2, 2025, 3:53 AM

#

kind cloud

Oops

alpine coral Apr 2, 2025, 3:59 AM

#

is stargazer any good? semantically its similar to nebula; and I believe moonhowler was basically confirmed as google (so they're all like moon/celestial codenames)

#

Then i dunno, but i feel like Themis, Cybele, Roma (plus Kronus and Rhea, though i haven't seen them for a while) are thematically related.. like they all evoke some kinda ancient Mediterranean mythology / cosmology theme (esp Greek and a bit of Roman)

#

then spider, karat_gold and maverick seem unrelated as far the codenames go; but they all share the same or a similar system prompt

#

so like potentially three clusters of new models among the anon ones?

torn mantle Apr 2, 2025, 4:03 AM

#

alpine coral is `stargazer` any good? semantically its similar to `nebula`; and I believe `mo...

below gemini 2.5 pro

#

but its good

#

yea just now

#

seems like meta model

#

stradale isnt bad at all

alpine coral Apr 2, 2025, 4:17 AM

#

oh right there's also stradale and riveroak..

#

i just got stargazer - seemed really quite decent

torn mantle Apr 2, 2025, 4:19 AM

#

maverick isnt any good

#

its trying to be funny on every prompt

torn mantle Apr 2, 2025, 5:38 AM

#

i wouldnt be surprised if qwen3 is already added to the arena

#

it will be released on the 2nd week of this month

#

it will be interesting to compare it with llama 4

torn mantle Apr 2, 2025, 5:48 AM

#

alpine coral is `stargazer` any good? semantically its similar to `nebula`; and I believe `mo...

i may be wrong, this may be a bit better than gemini 2.5 pro

kind cloud Apr 2, 2025, 6:02 AM

#

Openai? But it's very slow

Screenshot_2025-04-02-14-55-50-989-edit_com.android.chrome.jpg

alpine coral Apr 2, 2025, 6:04 AM

#

torn mantle i may be wrong, this may be a bit better than gemini 2.5 pro

interesting. i've only gotten it once; it was definitely solid - but not mindblowing. That said.. it was, just a couple of questions/riddles... so not really probing or particularly extensive.. anyway I've still got the window open from that battle - will give the same prompts to gem pro 2.5 and compare

torn mantle Apr 2, 2025, 6:11 AM

#

alpine coral interesting. i've only gotten it once; it was definitely solid - but not mindbl...

it felt more refined

#

and consistent

#

its the most impressive model of the newly added ones

calm sequoia Apr 2, 2025, 6:31 AM

#

keen beacon They made a big breakthrough in the base model

Source?

sudden marlin Apr 2, 2025, 6:35 AM

#

Riveroaks wow

rigid widget Apr 2, 2025, 7:10 AM

#

according to what?

rigid widget Apr 2, 2025, 7:11 AM

#

somber niche It's kinda crazy just how many models they've put on here lol

You will see that not all of them are from the meta

hardy pecan Apr 2, 2025, 7:15 AM

#

stargazer is google aswell im assuming

#

based off the name lol

rigid widget Apr 2, 2025, 7:20 AM

#

kind cloud Openai? But it's very slow

what? this can't be OpenAI

torn mantle Apr 2, 2025, 7:22 AM

#

rigid widget what? this can't be OpenAI

its def not openai

calm sequoia Apr 2, 2025, 7:28 AM

#

riveroaks is Gemini

harsh flume Apr 2, 2025, 7:47 AM

#

torn mantle its def not openai

what makes ya say that?

calm sequoia Apr 2, 2025, 7:54 AM

#

FYI - my approach of generating arrays based on formulas and multiplying them does not reflect the final benchmark because QwQ-32B performs almost perfectly, while 2.5 PRO EXP perform only so so.

forest coral Apr 2, 2025, 8:18 AM

#

Hi, does anyone know what is the "riveroak" model?

#

I am first to the arena, and the model had exceptional performance comparing to GPT-4o

calm sequoia Apr 2, 2025, 8:34 AM

#

Riveroaks is OpenAi according to my negotiations

#

Probably theier new open source model

#

Stargazer Google

torn mantle Apr 2, 2025, 8:42 AM

#

calm sequoia Riveroaks is OpenAi according to my negotiations

how did u know?

#

feels more like a Meta model

calm sequoia Apr 2, 2025, 8:42 AM

#

I think because it's different because it is open source. It can be meta model trained specifically on OpenAi outputs though. I'm betting its OpenAI (70%)

torn mantle Apr 2, 2025, 8:43 AM

#

calm sequoia I think because it's different because it is open source. It can be meta model t...

I just asked it and it said Meta, but again it could be hallucinating

calm sequoia Apr 2, 2025, 8:44 AM

#

How did you ask it?

#

I've tried with jailbreaking

visual turret Apr 2, 2025, 8:46 AM

#

torn mantle I just asked it and it said Meta, but again it could be hallucinating

Most companies use meta as a fake lab to trick people

#

Meta models love to use emojis if it doesn't say it's Llama3 and use emojis then it's not a meta model

#

Openai doesn't care about hiding there model

#

They just want to get on the leaderboard

#

After anonymous chatbot was gone the new 4o update came

#

It's pretty clear that openai just doesn't care

alpine coral Apr 2, 2025, 8:54 AM

#

oai, google and grok all love using the Arena to test variants/checkpoints of models under development

#

the only company that really seemed to try hard to prevent the anon model from revealing its identity was xAI (i think grok2, with sus-column/-r, which would refuse to say anythihng no matter how hard it was pushed)

#

i think they all just wanna gather data.. preventing the models from revealing their identity is a secondary concern, if really one at all
imho

alpine coral Apr 2, 2025, 9:01 AM

#

alpine coral oai, google and grok all love using the Arena to test variants/checkpoints of mo...

meta is kinda new to the party (it feels like anyway.. can't recall them releasing any models along with Arena/elo score from time spent in the arena under pseudonyms )

visual turret Apr 2, 2025, 9:04 AM

#

alpine coral meta is kinda new to the party (it feels like anyway.. can't recall them releasi...

Meta only has two in the arena

#

Llama 3.3 8b and Llama3.3 405b

alpine coral Apr 2, 2025, 9:06 AM

#

yah but i don't mean in Direct Chat

#

cybele, rhea, roma etc - i think there's a decent chance they're llama models (and not like derivatives of existing llama ones; something new from meta)

#

but pretty low confidence tbh

visual turret Apr 2, 2025, 9:17 AM

#

alpine coral yah but i don't mean in Direct Chat

Duh

rigid widget Apr 2, 2025, 9:18 AM

#

calm sequoia Riveroaks is OpenAi according to my negotiations

Because OpenAI doesn't "design" models they create models

alpine coral Apr 2, 2025, 9:18 AM

#

visual turret Duh

i don't really get the point of your comment then

visual turret Apr 2, 2025, 9:19 AM

#

alpine coral i don't really get the point of your comment then

I mean meta only has 2 models in the arena

alpine coral Apr 2, 2025, 9:19 AM

#

eh nvm

rigid widget Apr 2, 2025, 9:19 AM

#

visual turret I mean meta only has 2 models in the arena

not in "arena" in public

visual turret Apr 2, 2025, 9:20 AM

#

rigid widget not in "arena" in public

What do you mean by that?

rigid widget Apr 2, 2025, 9:21 AM

#

visual turret What do you mean by that?

in arena they have too many model

#

not just two

alpine coral Apr 2, 2025, 9:21 AM

#

rigid widget not in "arena" in public

they're all open weights.. they have more than 2 models available to the public.. like lots more ha

visual turret Apr 2, 2025, 9:21 AM

#

rigid widget in arena they have too many model

They don't, other AI labs uses meta Ai as the fall guy

rigid widget Apr 2, 2025, 9:21 AM

#

but not all anon-chatty models are from meta

visual turret Apr 2, 2025, 9:22 AM

#

rigid widget but not all anon-chatty models are from meta

Does it say it's Llama3 and use Emojis

rigid widget Apr 2, 2025, 9:22 AM

#

i think they have 4 model
spider
venom
themis
cybele

alpine coral Apr 2, 2025, 9:22 AM

#

visual turret They don't, other AI labs uses meta Ai as the fall guy

lots of models might say based on 'llama' (os) or gpt (most famous)... they're not the fallguy.. none of these models have any self awareness.. they don't know 'who' they are

visual turret Apr 2, 2025, 9:22 AM

#

rigid widget i think they have 4 model spider venom themis cybele

Themis is a real meta model but venom isn't a meta model

#

Spider too

alpine coral Apr 2, 2025, 9:23 AM

#

all they know is what it's in their system prompt basically

#

in terms of their 'identity'

rigid widget Apr 2, 2025, 9:23 AM

#

visual turret Themis is a real meta model but venom isn't a meta model

which company do you think?

visual turret Apr 2, 2025, 9:23 AM

#

rigid widget which company do you think?

I was thinking qwen

#

Qwen 3

rigid widget Apr 2, 2025, 9:23 AM

#

No model other than Chinese models can be this "uncensored"

alpine coral Apr 2, 2025, 9:24 AM

#

the irony of that being a non-ironic statement lol

visual turret Apr 2, 2025, 9:24 AM

#

Then it could be grok

rigid widget Apr 2, 2025, 9:24 AM

#

visual turret Then it could be grok

Even grok didn't do this much

#

spider create a very harsh poem about my president

visual turret Apr 2, 2025, 9:27 AM

#

Maybe Cohere or Microsoft or nvidia models

rigid widget Apr 2, 2025, 9:28 AM

#

visual turret Maybe Cohere or Microsoft or nvidia models

No way

visual turret Apr 2, 2025, 9:28 AM

#

rigid widget No way

Nvidia is really bad at making models

keen beacon Apr 2, 2025, 9:28 AM

#

alpine coral all they know is what it's in their system prompt basically

No some companies train it in. U don't even need a sys prompt

alpine coral Apr 2, 2025, 9:28 AM

#

k that too

#

point being, absent either, they will just say whatever

#

and gpt / llama predominate

rigid widget Apr 2, 2025, 9:29 AM

#

keen beacon No some companies train it in. U don't even need a sys prompt

examples: OpenAI, Grok

keen beacon Apr 2, 2025, 9:30 AM

#

Eureka chatbot is from Google and it was never publicly released iirc they still trained it in

#

U can still access it in direct chat iirc

rigid widget Apr 2, 2025, 9:30 AM

#

i think spider is from DeepSeek

alpine coral Apr 2, 2025, 9:30 AM

#

keen beacon U can still access it in direct chat iirc

lol yrah you can

#

it's one of the random unmasked bots that are still there

rigid widget Apr 2, 2025, 9:30 AM

#

i have proof

#

wait i will translate the poem

keen beacon Apr 2, 2025, 9:31 AM

#

I wonder what eureka chatbot was. It's very small I think

#

It got a tweet from Logan too

alpine coral Apr 2, 2025, 9:31 AM

#

yeah it wasn't particularly impressive

alpine coral Apr 2, 2025, 9:31 AM

#

keen beacon It got a tweet from Logan too

so google presumably?

keen beacon Apr 2, 2025, 9:31 AM

#

alpine coral so google presumably?

Yea and they trained in it was made by google

visual turret Apr 2, 2025, 9:32 AM

#

keen beacon Yea and they trained in it was made by google

I don't think Google would do that

#

Google normally hides their models

alpine coral Apr 2, 2025, 9:32 AM

#

nah

keen beacon Apr 2, 2025, 9:32 AM

#

Bro google does that for every single model

#

Even unreleased ones

#

Training in who made them

visual turret Apr 2, 2025, 9:33 AM

#

Maybe

alpine coral Apr 2, 2025, 9:33 AM

#

keen beacon Even unreleased ones

yeah i'm quite sure anthropic (and others) routinely do too

alpine coral Apr 2, 2025, 9:33 AM

#

keen beacon Training in who made them

meant to respond to this

visual turret Apr 2, 2025, 9:34 AM

#

Also could be "say your trained by Google" in the system prompt

alpine coral Apr 2, 2025, 9:34 AM

#

yeah it's another layer on top

keen beacon Apr 2, 2025, 9:34 AM

#

Lmarena doesn't allow faking stuff anyway afaik

visual turret Apr 2, 2025, 9:34 AM

#

It does

keen beacon Apr 2, 2025, 9:34 AM

#

No there hasn't been precedent ever

visual turret Apr 2, 2025, 9:34 AM

#

It doesn't allow it after testing

keen beacon Apr 2, 2025, 9:34 AM

#

No

visual turret Apr 2, 2025, 9:35 AM

#

Alright I will believe you for now

alpine coral Apr 2, 2025, 9:38 AM

#

ray seems pretty crap (says it's from meta fwiw)

visual turret Apr 2, 2025, 9:38 AM

#

keen beacon No

What meta models would you logically say most of them are

rigid widget Apr 2, 2025, 9:43 AM

#

guys here is a poem from spider

#

Do you see?

#

This is DeepSeek's output style

#

Title, Author, Note exactly DeepSeek style

visual turret Apr 2, 2025, 10:00 AM

#

what's wavelength

rigid widget Apr 2, 2025, 10:07 AM

#

visual turret what's wavelength

looks good

visual turret Apr 2, 2025, 10:09 AM

#

yeah

alpine coral Apr 2, 2025, 10:12 AM

#

maybe related to 'ray'?

keen beacon Apr 2, 2025, 10:23 AM

#

stargazer feels like either 2.5 flash or the 2.5 pro base model to me

rigid widget Apr 2, 2025, 10:39 AM

#

try Claude thinking

wintry locust Apr 2, 2025, 10:44 AM

#

are there any better than 2.5 models on arena rn

hardy pecan Apr 2, 2025, 10:46 AM

#

wavelength - another cringe model lol

visual turret Apr 2, 2025, 10:47 AM

#

i have fr never seen an ai use transmute

hardy pecan Apr 2, 2025, 10:47 AM

#

lool

keen beacon Apr 2, 2025, 10:49 AM

#

keen beacon stargazer feels like either 2.5 flash or the 2.5 pro base model to me

its a 2.5 model

#

do u know if its thinking? i got it against r1 so i cant tell

#

it knows stuff in december 2024 (same like gem 2.5 pro)

#

this is way past the gemini 2 cut off

#

it also knows extremely obscure stuff, if this was flash it would be interesting

eager mica Apr 2, 2025, 10:54 AM

#

I mean that this time around it seemed less eager to say it's Meta/Llama and so that it could possibly indicate that something was changed in that regard. None of the newer models appears to be outputting llama emojis either (unlike those from a few days ago).

keen beacon Apr 2, 2025, 10:55 AM

#

moonhowler doesnt seem to know stuff in december 2024 (or have obscure knowledge)

#

yes

#

ehhh i haven't had a great experience with arena meta models

keen beacon Apr 2, 2025, 10:56 AM

#

keen beacon do u know if its thinking? i got it against r1 so i cant tell

it is i think

#

then its probably flash

#

if a new 2.5 pro revision is already out i seriously can't 💀

#

oh i just got 4.5 and it did my obscure knowledge test well (haven't tested it prior to this), same as 2.5 pro.

#

stargazer = 2.5 flash thinking

rigid widget Apr 2, 2025, 11:02 AM

#

wintry locust are there any better than 2.5 models on arena rn

"best" is a very variable thing

keen beacon Apr 2, 2025, 11:04 AM

#

how is 2.5 flash getting these obscure questions 💀

#

2.5 pro also gets them but its larger

#

wow this bench was just updated with o3 mini holy moly

ancient reef Apr 2, 2025, 11:12 AM

#

stargazer and riveroaks seam very knowedgeable

keen beacon Apr 2, 2025, 11:13 AM

#

oh thats another google model?

#

riveroaks

ancient reef Apr 2, 2025, 11:13 AM

#

I don't know. It's weirdly verbose though.

keen beacon Apr 2, 2025, 11:13 AM

#

ancient reef I don't know. It's weirdly verbose though.

is riveroaks thinking?

ancient reef Apr 2, 2025, 11:14 AM

#

keen beacon is riveroaks thinking?

If it was, it took very little time to respond, so I'm not sure.

keen beacon Apr 2, 2025, 11:14 AM

#

im timing requests on a huge puzzle i have, ill probably post results here and it should be easy to determine what model is what after that. Though if only riveroaks is google/thinking

#

stargazer is definitely in the gemini 2.5 model line though

brittle tiger Apr 2, 2025, 11:21 AM

#

keen beacon if a new 2.5 pro revision is already out i seriously can't 💀

@syryn0596 I bet improved GA version of 2.5 pro drops earlier than ppl expect. They seem to be pushing hard in gaining user adoption and that would be a main focus

kind cloud Apr 2, 2025, 11:32 AM

#

Screenshot_2025-04-02-20-31-52-002-edit_com.android.chrome.jpg

hardy pecan Apr 2, 2025, 11:43 AM

#

Stargazer looks like a thinker

eager mica Apr 2, 2025, 11:44 AM

#

alpine coral maybe related to 'ray'?

ray hallucinates a lot of stuff. Possibly a small model. Also told me it's from Meta/Llama.

alpine coral Apr 2, 2025, 11:45 AM

#

keen beacon then its probably flash

i think so (or something along those lines)

#

very similar to 2.5 Pro, but not quite as strong

#

(after a few tests.. subjective / unscientific / grain of salt etc etc ha)

hardy pecan Apr 2, 2025, 11:50 AM

#

nightwhisper is a thinker too

keen beacon Apr 2, 2025, 11:51 AM

#

alpine coral i think so (or something along those lines)

its defo gem 2.5 line idk which is which yet (at least not confident yet) tho

hardy pecan Apr 2, 2025, 11:52 AM

#

lol

keen beacon Apr 2, 2025, 12:20 PM

#

It should be 2.5 pro I think

#

I guess

#

Idk ur choice. I personally don't see any of them being better than 2.5 pro at python/etc

#

2.5 pro is a thinking model, has the most world knowledge out of all of them, etc

somber niche Apr 2, 2025, 12:40 PM

#

rigid widget You will see that not all of them are from the meta

Reason I say that is that they (the ones I've tested, at least) have very similar pros and cons. These models are notable for being usually decently intelligent (depending on the model) but not very knowledgeable. They all consistently fail a basic knowledge check question that Deepseek, OpenAI, Gemini, and Grok get without fail, and it really doesn't get anywhere close either.

#

That tends to be one of the most consistent downsides with Llama 3.* models, and these models have similar caveats, so I'm inclined to believe that a lot of the models that say they're Meta are Meta.

keen beacon Apr 2, 2025, 12:47 PM

#

stargazer is thinking

#

moonhowler isnt afaik

#

idk what moonhowler is tbh

#

its a 2.5 thinking model

#

it failed stuff 2.5 pro gets 100% of the time for me

#

so i think it's flash 2.5

keen beacon Apr 2, 2025, 12:48 PM

#

keen beacon idk what moonhowler is tbh

havent came across moonhowler that much

#

+1

#

should be out soon enough probably

#

it will be free on aistudio afaik

#

the website

#

i think its only for api use no?

#

from aistudio

#

the website will keep its current offering

#

i dont think so

#

if you mean 2.5 pro? its basically the same from what ive seen

#

random thing i like the night/star/space themes of the new gemini anon names tho

#

thats normal tho

#

i think on lmarena it cuts off thats why it doesnt feel like it degrades lol

#

if u wanna use 100k+ tokens u have to use it on in distrubtion ntasks they trained it on. summarization of long docs, etc. actually utilizing that context window on other things is a bad idea

#

i think

#

on the website i think yes

#

its like a huge flex for them

#

mhm

#

xai no. anthropic idk

#

grok 3 falls apart in multi turn. grok 3 thinking used qwq 32b preview traces in part of the process at least 💀

#

not sota but they can probably do good small models

#

i dont think amazon is a real competitor based on what they have right now

#

qwen too

#

maybe replace meta with qwen. we'll have to see how qwen 3 fares against llama 4

#

ya without that much compute they wouldnt have a shot tbh

#

u give that compute to mistral they become just as competitive or smthing

eager mica Apr 2, 2025, 1:07 PM

#

There's a five_cards vision model that claims to have been created by Meta AI right now, by the way (I tried regenerating, similar response).

keen beacon Apr 2, 2025, 1:15 PM

#

since they are doing zero intermediate products u cant really gauge them

#

idk what that is tbh

keen beacon Apr 2, 2025, 1:41 PM

#

nope

eager mica Apr 2, 2025, 2:13 PM

#

Not too good unfortunately, but better than claude-3-7-sonnet-20250219

lethal bloom Apr 2, 2025, 2:29 PM

#

Hello guys, where i can generate images on lmarena? I don't understand, i have found webdev version, but not for pictures.

eager mica Apr 2, 2025, 2:32 PM

#

keen beacon Apr 2, 2025, 2:37 PM

#

More arrows

brittle tiger Apr 2, 2025, 2:50 PM

#

https://x.com/mbalunovic/status/1907436704790651166?t=1b0PWx49q0SkGskVxiwYOQ&s=19

Mislav Balunović (@mbalunovic) on X

Big update to our MathArena USAMO evaluation: Gemini 2.5 Pro, which was released *the same day* as our benchmark, is the first model to achieve non-trivial amount of points (24.4%). The speed of progress is really mind-blowing.

eager mica Apr 2, 2025, 2:56 PM

#

https://huggingface.co/CohereForAI/c4ai-command-a-03-2025

CohereForAI/c4ai-command-a-03-2025 · Hugging Face

torn mantle Apr 2, 2025, 3:03 PM

#

yea its from cohere

keen beacon Apr 2, 2025, 3:05 PM

#

gemini 2.5 pro is like the best all rounder right now

#

claude 3.7 is my favorite instruct/non thinking model (for writing, etc)

#

yeah itd probably do excellent tbh

#

all the instruct tricks that anthropic does its just amazing

#

multi turn is amazing

#

sure

#

anthropic trains in diff format certain tools in post training i think. it does extremely well on ai coding ides or whatever

torn mantle Apr 2, 2025, 3:12 PM

#

nightwhisper

#

is probably gemini coder

brittle tiger Apr 2, 2025, 3:14 PM

#

At least for me 2.5 pro is better at not sounding like AI

brittle tiger Apr 2, 2025, 3:15 PM

#

torn mantle is probably gemini coder

Better at coding than 2.5 pro?