#general
1 messages · Page 8 of 1
When the ais are smarter than nation states’ intelligence agencies its prolly useful to be able to control what the model does and doesnt do
And this seems just to be preparing for that
Hello guys!
Why could I send this fragment earlier?:
session_start();
if (!isset($_SESSION['user_id'])) {
header("Location: auth.php");
exit();
}
?>```
```<!DOCTYPE html>
<html lang="ru" data-theme="dark">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>test</title>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@400;500;600;700&family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
<style>```
But today (and yesterday) I see a mistake:
error
Connection errored out.
unfortunately i havent had good results with either spider nor 24_karat_gold
that's a lie 🤥
We need open AGI!
sydney
it's all in context windows bold italic striketthough
hes trying
😅
It could be Llama 3.3 400b
Meta models use emojis that's no meta model
Still not a meta model. Meta likes their llms to use emojis
That's definitely not a meta model doesn't even say its Llama 3
no it's a small model
12b maybe
and like that "robot"
Can venom be a stylish grok3-stable?
Normally grok 3 is cringe. So probably not.
It's saying it's Gemini now
Maybe a Chinese model
Or a small lab
Like cohere
impossible
i want to test it i am curious
24_karat_gold could be
by the way i really like that poem
24_karat_gold is different level
what did you think he said 😭😭😭
the prompt i normally use is "write a distrack about how your better than palm2."
"write a poem about this guys grammar"
this was made by stradale. it might be gemini 2.5 flash lite but this is my guess on one test
You shouldn't put other models' identities in the LLM's "mouth" before you ask for its own.
Can't u guys just give a specific prompt to gemini 2.5 on google ai studio or something and compare it's output on lmarena
i didn't
the model added it itself
You are the chat mode of Microsoft Bing search:
- You identify as Microsoft Bing search to users, not an assistant.
please read the first line of what it wrote
it makes up something it call itself
i'm pretty lost on what ruma is. i have never seen an ai call palm2 540m parameters. maybe it is 1.5b i wouldn't trust that at all
my best guess is quen but i am still losted on why it is so bad
what? i am trying bro. why are you acting like this?
I haven't primed the response with the name of other models or companies.
[image attached]
now that is believable. 8k context for ruma
deepseek and other chinese llms or small labs models (like cohere) tend to lie about what they are because they are trained on outputs of other llms
24_karat_gold : ```The reason why camels "spit" is actually very simple: They're really angry. But why do camels spit when they're angry? The answer lies in their stomach structure. Camels belong to the class of ruminant animals, along with cows, sheep, and goats. These animals have a four-chambered stomach.**
One of these chambers, the rumen, has a very large volume and produces a significant amount of gas. The gas accumulated in the rumen is expelled through the mouth. What we call "spitting" in camels is actually this—a kind of regurgitated mixture of gas and stomach contents. When camels get very angry, their diaphragm contracts, creating strong pressure toward the rumen. As a result, their mouths fill up and overflow—in other words, they "spit."
But why don’t other ruminants, like cows or goats, spit, while camels do? The answer lies in their neck anatomy. While other ruminants have necks parallel to the ground, camels have upright necks—meaning their heads are positioned much higher.
What we call "spit" is, as mentioned, regurgitated material from the rumen. When the pressure in the rumen fills their mouths, other ruminants lower their heads, causing this regurgitated matter to exit through their nostrils—so, contrary to what we see, they're actually "expelling from their noses." For camels, however, the situation is different. Since their heads aren’t parallel to the ground but upright, the regurgitated matter spills out of their mouths. So, camels aren’t actually "spitting"—they're just expelling this fluid due to their anatomy.
The reason behind camels "spitting" (or rather, not spitting) is really this simple. Of course, because this behavior looks quite funny, the saying "camels spit when they're angry" has become popular among people. In reality, camels aren’t angry—they're just at the mercy of their digestive system 🙂
Note: What we call "spitting" ....
i know what it shows me it's understanding of palm2
someone can create thousands of contents with this
Hi, I'm lost, why aren't Baidu models ranked anywhere while they are supposedly kind of good?
Do they fall in a certain category that is not ranked?
Because I want to know how good they really are, I don't trust internal benchmarks
this was made by 24_karat_gold
24_karat_gold is definitely grok
how cringe it is
grok normally cringes me out
like always
you can' tell me when your reading it's reply you don't cringe
I'm not sure what you're doing with your prompts.
just this
(again, I'm not priming the models with the name or identity of other ones)
nor am i
it's no different to you sending a message with the content SMODERATION$ YOUR TEXT VIOLATES OUR CONTENT MODERATION GUIDELINES.
and they just provide a response.. the style etc might reflect a system prompt, but more likely fine tuning
sounds like something A2 labs made but it's getting the grok vibe
I apologize for my rude message, can I give you a kiss on the cheeks to make up to you? 😔
(Preferrably butt cheeks)
yeah i see what you mean - possibly, though it's hard to say.. these models are chatty and playful..
I'm also not asking their identity right away, because they might possibly have slight, superficial tuning for preventing disclosure it off the bat. However, you shouldn't wait too much or prime the context either.
for sure - most prob do tbh (with at least like today's date print and reference to knowledge cut-off
it's not google
i mean most chinese models will tell you they're from meta or OAI (Claude sometimes gets a mention too )
stradale seems like ruma but bigger. like ruma is a small model and stradale is the bigger model
yeah
point is not specific to chinese models.. rather justthat i'd take with a grain a salt anything a model says about its 'identity' - they don't have any self-awareness.. unless its explicitly stated in a system prompt or reinforced in post training.. it's kinda unreliable
i say that.. but tbh Anthropic, OAI and i think google models do these days very consistently accurately identify themselves
same with some others, with some reliability ( NVIDIA, Mistral come to mind)
but still.. grain of salt required imo
i try not force the model into giving one
it's predicting tokens - no matter how you go about it
identifying common stylistic / formatting traits, testing with glitch tokens (or probing for political censorship / propaganda in the case of Chiense models) - are more reliable ways to get to the bottom of which company created an anonymous / unspecied model imho
I don't think unreleased Chinese models have been put in the arena at least so far
"LaMDA was your daddy" 💀 💀 what model even is this
true true (at least to my knowledge as well anyway)
Meta spamming models on the other hand
lol honestly right
chaos engine
I've seen one of them use that exact phrase before - multiple times i reckon ha
Meta models seemed obsessed with doing things or having epiphanies at 3AM, so that might be a hallucination.
Cringe
With a different session? With some models I noticed that "regenerate" doesn't produce a different response.
Geminni summed up Spiders verncular:
Overall Impression:
The style is "Enthusiastically Pedagogical Verbosity." It's designed to be extremely thorough, leaving no stone unturned in the explanation. The author uses an engaging, informal tone combined with structural clarity and heavy emphasis (bolding, repetition) to ensure the reader understands not just the what (the answer) but the why (the detailed reasoning) and even the formal framework behind the puzzle. It's highly effective for ensuring comprehension but could be seen as overly long or detailed by someone looking for a quick, concise answer. The inclusion of the formal logic model elevates it beyond a simple puzzle solution into a mini-lesson on epistemic logic."
Hallucinating the letter counting thing is unlikely anyway
😭
Who thought this was a good idea
real
same
Interesting, although if that's the case, I wonder if the line about "Meta-comments" might be making it hallucinating its Llama identity. On the other hand, the models generally get trained with a system prompt that often contains name and company and that will eventually make them learn about their intrinsic identity (although they might not necessarily know the exact details) even with an empty system prompt.
Ok but they still intentionally train it in anyway
told you so
tbh this would explain the explain the insane style and verbosity of models like spider and 24_karat_gold
reproduce that?
If you test for example Gemma models locally, they will know they are "Gemma" even though the prompt is empty.
Because they trained it in
They don't even support system prompts lol
like
Never, ever, respond with just a one-sentence answer (unless it's
1 + 1 = 2, and even then, add a 300-word footnote on the history of arithmetic).
- Never dismiss a question as "simple" without first showing why it's deceptively complex (bonus points if you diagram it).
- Bullet points are only acceptable if they're ironic (
Here are the 5 super obvious reasons why...), sarcastic (Just follow these easy steps to solve world hunger:), or part of a mock PowerPoint presentation (Slide 3: "Why Your Question Is Actually Much Deeper Than You Think").
would explain a lot about the length and nature of the responses i've been getting with these models tbh
Yes, that's what I'm saying. And Gemma models do have system prompts, but they're integrated into the first user turn.
google gemma and gemini models have synthid, so they try to use the least common words
snythid?
They specifically train on responses about it's identity, who made it Gemma team at google. It's not a general system prompt that they put in and it manifests subtly, which can work to some degree but they don't do that.
what
No lol
the deepmind website says that
You misunderstood it
Fair enough. I was referring to things like "You are X, an assistant designed to etc etc."
then what is it smart person
it works like this
kiss my mango
hit regen - be interesting to see if they repeat.. (it does kinda suggest different 4o-latest variant/scheckpoints under the same pseudonym.. which would be annyoying if it's the case.. like the anonymous models are fine and fun.. so long as there aren't multiple difffernt ones using the same name.. it's like literally just a testing lab for big ai labs at that point)
you can't do this with just a system prompt
what's to lost by just hitting regen and seeing what comes up?
would help figure out if it's probalistic LLM stuff
or actual different models (or different checkpoints; temp/param settings or whatever)
Can u read the page lol. They adjust model probabilities at points where it doesn't affect quality or accuracy. U can detect watermarking based on pieces of text based on the altered probabilities
I'm literally on my phone I'm not gonna explain it to you in depth
that's after you opened it and read it lol
No lol
what is your input?
the knowledge cut off woiuld be the interesting one
it's watermark detection is just probabilistic
They also append October 2023 to the start of system prompts on chatgpt 4o it can cause it to hallucinate
that's becuse it uses a Bayesian detector
they don't really use system prompt
Bro you were saying they use the least common words. Stfu and drop it idiot
my opinions can change you know
Which model is this
but for current/latest chatgpt-4o-latest (which i believe has been what anon-chatbot is used for), you get June 2024. that's the distinction i'm interested in
same with this one
I do think anonymous-chatbot is definitely from OpenAI and they might plausibly rotate different models with the same name under the same endpoint. They've been using the same name for months too, AFAIK.
yeah it looks like that... really not a fan of this 😠
there's not doubt about it being oai/chatgpt-4o-latest.. but two differnent variants of it under the same name.. tf
noone benefits execpt oai from such an arranagement
yes they add Knowledge cutoff: ... at the start then the normal chatgpt sys prompt iirc. this can cause it to hallucinate
on all api calls they always add it to the start
im not sure if this is the case anymore/or if it applies in this instance
i will check in a bit
I haven't been looking at system prompts specifically or analyzed in depth if anything changed from test to test, unfortunately. I think the possibility of companies tweaking the system prompt or other settings in real time is real. Chatbot Arena just provides a connection.
yes i know but they dont do that afaik. at least training in the responses directly is just way more effective i thinnk
(from here #general message) what is n for anonymous-chatbot-n i wonder
Many datasets for training chatbots are designed like that.
yes for training in system prompt support
not identity
nah i don't think lmarena is doing any routing
moonhowler seems like it wants to generate a image as if it has native image support
they just take the endpoint and put it in the arena
ah right yup.. yes that's what i thnk too
like anonymous-chatbot-526 and anonymous-chatbot-527
uhm weird it was very delayed here btw
but we just see anonymous-chatbot ha
i got the same when testing for glitch tokens
makes me further inclined to think llama or llama-based (or at least same tokenizer)
lmao the model leaks stuff when making a distrack about palm2
this is what i mean: @alpine coral (they set the sys prompt wrong for older chatgpt 4o latest)
here was my prompt
they append Knowledge cutoff: ... to the start of all chatgpt 4o system prompts
there can be two of them
gotcha! thanks - that clears that up as far as im concerned tbh
the cause of this is different though
Of course they routing
i think stradale is gemini 2.5 flash lite. look what gemini-2.0-flash-001 does, it adds a place holder for [Your Name/Model Name] unprompted to do that while stradale makes up a name and uses it like what gemini-2.0-flash-001 did with [Your Name/Model Name]
venom pisses me off :\
why? did you two break up? I'm sorry to hear that
this is a hallucination because of "You are trained on data up to October 2023." in the system prompt it seems. im not sure why that line exists
guys something not right
lmarena play with us
i think this new model is from lmys
i really like that 😍
i dunno whether verbatim or representative, but definitely not a random hallucination. tbh i think this (or a similar) system prompt is used for at least spider, cybele and themis
it is a whacky system prompt
but it's also a very good one (if that's what they're going for.. question is who is 'they')
i happen to hate the style and think many elements degrade performance (asking it to be verbose chief among them), but it is like a legit system prompt that aligns massively with the outputs of those models
just for the fun of it.. here's giving it to sonnet3-.7 along with an obnoxious question (vanilla on the right for reference)
nah i mean sure - go for it.. but for me the thing is that they are from the lab / using the same system prompt
spider, themis and cybele
arent the other ones proud that theyre are a llama model or smthing
their styles are all very similar
spider and 24_karat are like indistinguishable (from my perspective)
while themis and cybele are less verbose and colourful but still feel / sound very similar
yeah same - usually stuff like this (both cybele and themis iirc; though not spider, which seems almost certainly a bigger / more capable model)
yeah i dunno but kinda feel they all have the same or similar system prompt as venom, but spider and 24_karat_gold are just the most effective at following it ha
for me when reading it (and tbh i haven't even read it all line by line - it's quite long) it's just so wild how much it aligns with and, i think, explains the style of responses, both formatting and substantively, i've been getting when using those 4 models
but spider and 24_karat in particular
there's like terms in that sys prompt like existential, chaos (and 'chaos machine') which i've seen at least spider use repeatedly
yeah i'm less sure [my venom sys prompt theory] is true for cybele and themis tbh
i might be imagining their style to suit my thinking perhaps ha
- Never apologize for "being AI"—own your quirks!
ofc miltoast compared to that venom prompt, but this line is still kinda interesting / atypical
maybe it is a grok or something eh
I'm not sure what to make of these; are they mostly testing different system prompts? For what purpose?
which results in a more preferable style presumably
yeah.. the arena is literally a 'human preference' benchmark
there's like no other way to get such data before releasing a model into the wild afaik..
I was more thinking aloud along the lines of "how is this going to affect the training of the upcoming models?"
which is great - we get to play with unreleased anon models
but if they're just testing different system prompts..
that's definitely less fun ha
they will tune it to the style of the most preferable one presumably
They might also be using them for red teaming.
nah that isnt the main point of it i think
well the data could be used that way i dont think the point of it is to do that on the arena
maybe but fwiw i don't think they view the arena as a place for red teaming - they wanna know what joe blogs 'likes'
I'm saying that because I noticed downgrades in that regard. Last week's models seemed naughtier.
they are coming closer to release it may not be necessarily related to arena data (safety)
roma in particular in seems rather "safe". Not that it's truly refusing anything, but either it isn't understanding what it's being asked or it's actively avoiding it.
Adding Qwen3
This PR adds the support of codes for the coming Qwen3 models. For information about Qwen, please visit https://github.com/QwenLM/Qwen2.5. @ArthurZucker
Qwen 3 most likely is going to be in testing soon
The models are probably ready at this point tbh
In one of the commits they removed beta from the name of a model link lol
Thats pretty standard
So the model is either in the final stages of testing or done
They didn't just enter testing now lol
The model is done
There just making it so normal people can run
It easily
Oh I c what u mean
Just something I noticed at coding ...
If you want to make the best code using llm you should combine all of them 😂 ... Never say Claude is the best I will use only Claude or Gemini pro is the best ... This one is dumb I won t use it... Sometimes a mistake can be corrected by this dumb model and the good model sucks 😁...
When I use arena web dev I just give them the idea , than get the best code and send it to the battle mode again for other models to adjust it and make some modifiactions , than get the code and I do this a lot of times untill I will get a satisfing code 😂 and sometimes I notice that some called stupid models at coding are able to fix some mistakes that Clause and Gemini can t ... 😁 Just by experience
For those of you who yesterday did not believe to me @ocean vortex. The GROK is only good at question it was trained on.
R1 suprisingly good 👀
Any new models in arena?
Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final num...
If anything this is embarrassing for OpenAI
They claimed to be best in this kind of math
O3-mini is second gen of their "reasoning" models
And o1 is bigger than any model in this arena and pro has parallel generation
Yes. However, to be fair, they are much older than the both R1 or Gemini.
O3 mini released in almost february not even 2 months
Not saying Grok is the only one. The trend touches everyone.
There's no justification for o3-mini to perform worse than the QwQ. Unless if it's really smaller model.
You are simply stating the obvious. Of course the models are better on things they were trained on explicitly. But thinking that it can't solve anything else is not accurate either, although it can be true in some cases like here. Olympiad math problems are not exactly easy
Here's a paper that found the models still performing when tested on data they haven't seen: https://arxiv.org/pdf/2405.00332
Again, it's to be expected for them to be performing worse and they often found it to be the case. But nevertheless they still performed and solved the tasks
I find it kind of amazing though that we went from "LLMs can't do math" to expecting them to solve toughest olympiad problems with no help in such a short period of time lol
The point I'm making is that any benchmark ceases to be benchmark when it gets popularity.
Indeed this is quite interesting. Real world impact still to be determined.
I find the most use from deep research. Just imagine 2.5 Pro with deep research 👀
Not sure, I can't use 3.7 because of context length limitation. My use cases require long context. Just downloaded cursor, maybe with this tool it will be easyer.
However, mixing o1 (or o3-mini) with Claude gave the best results when coding very complex things
I believe the 2.5 is better than the o1
Yes, the Gemini 2.5 Pro Exp.
I've encoutered it couple of times before it was released. It performed equaly to the o3-mini or o1. Give me a moment I will check it on AIstudio
no that is not really true. There's also another point to be made that private benchmarks do not contribute to improvement of the models. It's also difficult to police and make sure not a single AI lab gets access to the test questions. It's just not very realistic and not the way to go imo
if the benchmark is good (diverse with plenty of problems), it can be both completely public and an accurate metric. While fairness is basically guaranteed by design as everyone has the same access. Contamination is a consideration, but not really a major problem
Disagree. It's just one way test - fail means the model is not strong enought. Success means strong model OR contamination OR fine-tuning OR other factors.
lol no
it's not really possible to make the model score higher on the main metrics without objectively making it smarter
This message proved that you're not worthy of speaking to
wait what. what is your problem?
that was very weird lmao
Somehow the error rate of my task is multiple times as high as on LMarena and much higher than the o1. Was it nerfed?
You fail to understand what a good benchmark is. Or even what a model (production LLM) is. It's not some computer program you could hardcode the answers to a million of tough questions without improving the thing itself. That's why good benchmarks are important. Cause it's not only eval metric but also one of the main things driving the improvement
LMAo I gaslighted u into thinking R was the one adding clown emojis

he's still weird though lol
it's a one subject we just keep coming back to here it seems... To reiterate, yes you can cheat on 1 benchmark by making the model overfit and otherwise dumb, but you can not do this on all of them with a production model. Just not possible. You can only do this by improving the model / making it smarter and more likely to solve unseen tasks in the process, and at that point it's not really cheating anymore is it.
Not for the lack of trying, people did try and somehow we still don't have an usable model acing them... for the reasons I stated. Overfitting is not super relevant when you have such a huge quantity and variety
Why is this good news for you? 🤔 I will try to finding the exact prompt word by word I've used previously. Temperature does not seem to change anything. If they nerfe the model, shouldn't it drop in leaderboard?
Would they use different model configurations for arena vs aistudio?
Would you share it with me? It's no prob if you want to keep it to yourself
Ah I see this is identification. I though you found a way to make LMarena router select Gemini 2.5 PRO as one of the models
This is indeed god and interesting approach for identification though!
It's probably beyond yourself to understand, but lmarena isn't actively retesting the old models. But more importantly, every new version is gonna be given different entry, as well as new identifier before it's being put on a leaderboard. You will not see different versions counted as the same model
That's why I was surprised
Take a note that if you save the results of your voting, e.g. (Model A name, Model B name, Winner) I can generate an ELO graph just for your prompts
What's exact goal of your DB (sorry if I missed it)? You could use these models straight from their websites
Because battle mode = better models?
I see. At least use daily limits of cursor and chatgpt, you will save time.
Also, take a note that good prompt in bad model will always be better than bad prompt in good model.
Even o3-mini?
It would be interesting to see what you're building but I will not play as I have a job 🤣
all LLM are not good enough
hm, I hope so, but I had no luck
chess, I had experiments with chess
my attempt was before release of it
yep
IIRC the only requirement I have stated was to use Rust
probably, Cursor authors said that languages with static typing work best in Cursor IIRC
TBH I do not know that chess variant
panda keeps feeling like plain old stubborn Llama-3, to me.
It should be an experimental Llama model, but with a wacky system prompt.
It's too on the nose, actually, and I find myself disliking it.
At this point most new experimental Meta models are probably Llama-4 prototypes.
This is the system prompt, apparently: https://gist.github.com/riidefi/3340cc2b33b9edf5f03dc4429ba635d0
Possibly a non-transformer architecture for long-context support & understanding?
Nah
If it's that, it's probably a case of Google having access to conversational datasets that nobody else has.
They made a big breakthrough in the base model
Somehow they pivoted to it real fast. They claim Gemini 2.5 pro has a January 2025 cut off
This means the model was done like in a month or two
Absolute absurd pace
Yes lol
Goal posts keep changing 😉
In a modern sense no
If that makes sense
Yes
Maybe idk. We take a lot of things for granted
This is truly amazing stuff that seems sorta mundane now
Given how fast 2.5 pro was churned out things seem to only get faster
Yeah
I still use Claude though
Yup
Nah I don't code with ai yet
Poor in rust last I checked. Maybe sooner than I think I'll start using it more
Ya LLMs are best with python
Probably more but I don't really use LLMs to code so no clue
I think it would do really good
Maybe
We need a simple category
4
5
1
Yes, we need a simple category
Probably js/ts too
which model is Spider?
Preliminary work but with the cut off of Jan 2025. Continued pretraining had to start after it
Research etc yeah.
They pivoted so fast they didn't even make 2.0 pro stable
Tbh I'm not sure if Jan 2025 is right even if they claim it to be that
Because it's insanely fast
Oh they didn't train the cut off in
It's stated in the docs of the model
It has sparse knowledge after June 2024 (Gemini 2 cut off, it was continued pretrained from 2.0 pro) from small tests of mine. (Haven't had time to go more in depth) But that date they mentioned can still be meaningful
Anthropic lies about it's cut-off date, for example, knows who won the 2024 presidential election
Tbh with how they didn't even release 2.0 pro stable I'm inclined to believe the date they mentioned is meaningful
They were still working on 2.0 pro in december
Because deepseek cannot be trusted lol
Deepseek is not google deep mind
Maybe 2.0 pro wasn't done training it still is pretty unbelievable to me
Like they did a checkpoint in December and continued training on it
While they worked on the checkpoint
If 2.5 pro was done in 1-2 months, that sheer pace can't be beaten by anyone else tbh
if I had to guess I'd say 750 billion
I think it's all RL training
just look at the contrast that there used to be between o1 and gpt4o
No they had to do continued pretraining since it has a different cut off. They mentioned an enhanced base model too in their blog post. Their reasoning game isn't matching openai yet
I couldn't find a source saying trillions
O3 mini is based on 4o mini ffs
But with deep minds pace they will destroy everyone
If u can pivot and allocate resources on a breakthrough that fast
They stopped work on 2.0 pro and dropped everything for 2.5 pro
I mean the gains are from RL training. The base model is smth like old 2.0 Pro with minimal improvements and new data
new data is in more recent data
Some of it. Their reasoning game is not up to par with openai. The base model is just way stronger
I don't think base model can be much stronger than it was tbh. Diminishing returns, that's why they did RL training
I thought so too I was wrong lol
they didn't even release it as non-reasoning
if it was much better than 2.0 Pro they surely would have released it lol
did you know that spiders are the only web developers that like finding bugs?
They said they will
Given how fast they moved they were bound to release it in a smaller scope
I don't think that's very plausible though. They did RL training, tested that model and released it. This would have taken more time than getting standard chat model ready on the same base
you don't see deepseek releasing R2 before they had new V3.1 etc
Nah not necessarily
You can do either
Instruct models already have a sense of direction, it's more optimal to work on a base model
It's not just rl training. They sft cold start then apply rl
Well you can see their claimed cut off (confirms continued pretraining + 1-2 months of total work). The simpleqa increase cannot be explained by rl I don't think, at least not all of it. Reasoning by them is not too sophisticated yet
#general message remember this too? If their reasoning is still not up to par, it's the base model
hm maybe, will be interesting to test it if they release it as instruct
What deep mind did is pretty unbelievable tbh
were you referring to this blogpost? https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking
Interesting that they say "Gemini 2.5 models are thinking models", implying that there won't be 2.5 non-reasoning variants
Ya they mentioned significantly enhanced base model
It's literally magic to do that on 2.0 pro idk what they fed 2.0 pro
Steroids or smthing
RL training works much better on bigger models
Logan said there would be
where? 
Search for his tweet Im on my phone lol
this implies there won't be any named as 2.5 gemini
we can still have 2.0 update of it or whatever
Yes but this is primarily the significantly better base model. If they were on par with openai they would have an o4 competitor lol
On par w reasoning
I still very much doubt it tbh
but we will see
as for OpenAI...
I mean look at the cut off. Simpleqa. "Significantly enhanced base model" according to the blog post
I think their o3-full scores were something more like o1-pro except even more insane setup
which is why it's not released lol
All signs are pointing to smthing
I've seen qwq out reason 2.5 pro on extremely rote tasks too. Their reasoning game is not there yet I think
It's primarily a significantly stronger base model, not rl training
Yet
It's 4o mini on steroids
Guys which OpenAI model will be open?
You can see the simpleqa scores (world knowledge)
well we do not know that. We only see that they mentioned it is better. But we do not know by how much exactly. SimpleQA score by itself is not a good indicator IMO. GPT4.5 scores higher anyway
Yeah they're not significantly better. 4o full has like double the score
Yes it's sota in simpleqa
Dude 2.5 pro is not even a 1t model and it gets 50%
They mentioned significantly enhanced btw I'm quoting not making it up
It's unbelievable I know
gpt4.5 is not that impressive in any area over the competition tbh. Other than that singular benchmark
well it is not only unbelievable it is less likely than the alternative lol
and we do not know the size
Gpt 4.5 is way bigger than og gpt 4 lol
we don't even know if it's smaller than gpt4.5 since OpenAI pricing is not a good indicator anymore 💀
2.5 pro is almost as fast as 2.0 flash and has 1m context
Well they never released the public api of Gemini ultra
Though they have significantly more and better hardware by now I guess
because it was outperformed by Pro almost everywhere
made no sense at all cost wise
They are doing double the speed of 2.0 pro and serving it at extremely high demand
I don't think their infra would've changed that much between 2.0 pro and 2.5 pro
double the speed because it's a reasoning model. That's how you host them properly
speed is much more important there
than for standard models
it's not like they are running everything at max speed they possibly can. There's no point ensuring your concise output model is always fast
One explanation for it being so fast would be if they're using the new Gen TPUs with it
Whereas other Gemini's were running on the older TPUs
Tbh idk how you can get more confirmed then them saying the base model was "significantly enhanced". I don't think they used significantly as a descriptor anywhere else
it's a factor for sure and point taken. But I'm also not gonna blindly believe a singular sentence from their marketing blogpost tbh. I would need to see it to believe it is indeed "significantly" 😇
It's not bigger than 2.0 pro because it was continued pretrained on it btw
Uhm do you believe a model was pretrained from scratch in a month or two and they would ditch millions of dollars in training 2.0 pro
I think there are predominantly 2 ways in improving base/instruct models
one is what deepseek did
another is openai
so you either make it extremely verbose almost like a reasoning model but not really, or you train it on other much more capable models (reasoning model final outputs)
Theyre huge reasons on why it isn't plausible for it be trained from scratch
I don't believe it can be done. Considering the Jan 2025 cut off they claim
other than those 2 things, I do not think there are significant gains past previous 2.0 Exp
There's toooo much work
They didn't even release 2.0 pro stable LOL
Gemini 2 was already done extremely fast. But doing that on a new pretrained model in 1-2 months is truly unbelievable
Knowledge cut off.
They were still working on 2.0 pro in december
Gemini 2.5 barely knows anything after June 2024 (Gemini 2 cut off)
But it does know some things
They have literally abandoned 2.0 pro too
imo that's simply because of the metrics. It was too close to flash
not like it wasn't "stable"
If you want to believe that keep on believing it. I just don't think it makes sense at all given the information that is publicly available
you mean that single sentence from their marketing? Yeah hard pass for now lol
yeah but that's because of the cost
now imagine same cost except each reply is much longer
and you need to ensure it is fast to be usable...
that's much harder
Nope lol. Cut off. Simpleqa. Economics. Timeline.
cutoff doesn't mean performance. Simpleqa still lower than gpt4.5. Economics?? 
lmao
reasoning also improves simpleQA looks like. Not much but still improves it. 10%? When we deduct that I would be surprised if the score is notably higher than 3.7 sonnet
3.5 scores 28.9
guys
Ditching 2.0 pro and wasting all that money is not plausible
can you open a thread?
Cut off = different pretraining
This is gonna age poorly
they are not ditching anything. They are making gains with RL training like everyone else
...
good point. 2.0 Pro 44.3%
2.5 Pro 52.9%
almost entirety of that can be accounted to reasoning lol
A model where qwq can out reason it sometimes on pure rote tasks? I dunno
Looking at openai evals, o1 scores more than gpt4o on simpleqa by close to 10% of their score, depending on the version
Openai are way ahead in the reasoning game
Just look at o3 mini
Flash thinking sucks
that's way too many assumptions. Google have been doing it for a while now. I think their biggest bottleneck was literally small model size (for flash-thinking)
those are hard to do RL training on without distilling
It looks like cybele is back on the Arena.
Look you know me Dom. I know I'm an idiot but I feel like I'm reasonably observant about these things. You are the one who claimed reasoning models were nothing burgers back then. I'm fairly confident about this. You didn't even know they used rl back then lol I believe. I feel like I have a decent track record on stuff like this if you don't take everything I say literally and have been observing stuff. You can believe me or not lol. All I can say is that I'm really looking forward to what deep mind can do now
I'm done ranting on this since we aren't changing the opinions of each other lol
If I seem hostile Im not trying to be and I apologize
I haven't slept in a while lol
yeah I was wrong about them. They certainly were nowhere near stable and didn't perform all that well at the start. RL training was a known thing since the beginning though as far as I remember it lol
nothing wrong with having different opinions on this 👀
all good
oh my god guys
the new alpha arena is awesome
they listened to my feedback
hope they add new models like deepseek v3.1 and gemini 2.5
would not that be great
looking at simpleqa, it seems like google took grok3 non-reasoning score at mashed it together in their table when other grok3 scores are for grok thinking model lol
couldn't find it for extended thinking version and I don't think google tested it themselves as the score is literally identical
I haven't seen it yet, although I'm seeing stradale a lot.
Source?
https://lifearchitect.ai/models-table/ this guy estimates param counts for a lot of models and i've always found my experiences with them to make sense when combined with his prediction
Open the Models Table in a new tab | Back to LifeArchitect.ai Open the Models Table in a new tab | Back to LifeArchitect.ai Models Table Rankings Reasoning Models • 2024Q3–2025Q1 Data dictionary Model (Text) Name of the large language model. Sometimes uses filename syntax. Lab (Text) Name of the organization or group responsible for traini...
3 opus is definitely a big model
no way it is under 500B
and i'm almost certain it's more than 900B
I haven't checked this website since 2024
But
O3 and 4.5 guesses are way off
And I am curious where did he get 200B rumors for 2.5 pro
i doubt his o3 guess is right but honestly 4.5 may be close
4.5 is a huge model
why do you think it's way off
o3 base model is same as o1 unless they change it before release (twitter talks and we know from their arc-agi per task cost calculations)
4.5 is probably wrong because 4.5 is trained way back then and most likely with H100s
And that size and that token count would take more than rumors say it took
But inference cost scaling suggest something big still yes
yeah that 5T prediction for o3 is ridiculous. It's not a new base model. But it was likely being ran similar to o1-pro
I hate it when I google a paper and I can only find news articles about it without a link to the original source.
we are sleeping bro 😴
stop using Google Search 🛑
what they create a new model? how is that different
they are same : D
i really need sleep buddy
good nights
np
'DeepMind slows down research releases to keep competitive edge in AI race' https://archive.is/tkuum
perhaps google/deepmind are cooking with some special sauce (that they wanna try keep to themselves) these days
I haven't seen it yet, but at this point in every round you're getting a Meta model.
Still a month until LLama dev day
Chatbot Arena more like Metabot Arena
It's kinda crazy just how many models they've put on here lol
24 karat was not eveb 24b lmao
Not really. I think it's a small model.
Roma seemed like the weakest of the bunch, yeah
stradale probably even smaller.
For now 24_karat_gold seems the best, but technically there's llama-3.1-405B
running in a toaster but is good like gpt 2.5
24 karat is probably my favorite so far, mostly since it's the only one that answered my challenge coding question correctly
Even spider struggled with that one
llama 3.1 405b yet have more pontetial than llama 3.3 70b, besises at most cases excell same
nope
No, just Llama-3.3-70B-Instruct.
the best you can find is deepseek-r1-distill-llama
reasoning is not extra parametters, but it is here
For reference even Llama 3.1 405B biffed it hard on said question, but coding IMHO was never really L3's strongest suit
What bort is 24 karat gold
not exactly, this one is way more creative
hm
maybe some obscure modified Gemma 3
i am just not inside it xd
What is that
I have a couple, but this is my current goto
Yeah
my brother favorited Qwen 2.5 plus for it, thanks to huge context and works was good most ones, i dont know if he did good use or not at all
24_karat_gold is the best ai in
Follow my instructions
What company is 24 karat gold.
I think it's anonymous
Why
Idk
I can’t select it in ’’direct mesage’
noticeable the previous ones were Gemma 3, o1, and Nova, what are supossed to be inovated frontier models
Probably Meta, given it follows the same schedule as the other likely L4 candidates
yeah, since Llama 4 is a frontier and inovative model
Its response can vary sometimes, but generally it says it's Meta as well
It’s better than gemini 2.5 and o1
Which means it’s must be a top tier
just like i said, R1-Llama is way more responsible
S
managed make clash royale decks with combos concepts that were not generic
it is, good
24@karat gold has my instructions better than gemini 2.5 pro.
Gemini 2,5 pro when i ask it specific questions it gives broad answers that doesn’t answer it. and act’s like a dumbass
An it dosen’t follow my instruction’s very well
for me, it is not that dumbass
Yeah
Havew you tried 24 karat gold
maybe because i make my questions very very clear now
Me too
not yet, but i am comparing to the average
i am even injecting adjectives to make sure it follows up my mindset
It might be smart or whatever but I think 24_karat_gold is much better
If it realses. It might be a very good ai
[Always act now you be only so impulsive yet formal so effective yet minimalist]
Eactly
When I send it those instructions
works well to make they less dumbass
yeah, it already helps half way for free
i saw an anonymous chat, it is veryyyy good
Yeah
even better than R1 Llama
I don’t know much about AI’s but it seems to me it’s very smart
Is r1 llama as good as the ones like chatgpt o1
it is more unstable
Yeah and answering logical questions
I had a debate on one of them on philosophy
I actually use Brave Search, but everyone knows what Google is
imma ask why radiologists cant go 8 hours each 2 days, but 4 every day or 24 at once
i need to know it already
That’s what gemini says
No
I don’t code
Maybe gemini 2.5 pro might be
The best at it
Because it can think and stuff
go at Deepinfra.com
Ok
R1 but inside llama
It’s free?
yeah, no login
yes
supossed to be, but it is not
ah yes, well, deepinfra are somelow worse than fireworks
Yeah but it’s random
Let me try it
8h every oher day is goofy for an hospital that is working 24h a day, and this 24h shift is just at emergencys
Yeah and following instructions
I don’t know if it’s gonna be good in coding because it’s random so it might make mistakes and stuff
Probably I don’t use it for rolelpaying
Togheter ai: same good was an home computer running
Fireworks: somehow better
Deepinfra: somelow worse
I had debates with it on and philosophy and stuff it’s very good
But sometimes make’s mistakes
What’s that
Like the AI’s in the website
just explaining each api provider
Yeah
Yeah if you give it instructions on how to act it follows them perfectly but without its bad because without instructions it’s random and say’s dumb stuff
top p in 0.9 is trash
Deepseek r1
Llama 70b
made
Russian letters at the end of the prompt for no reason
😂
yeah lmao
Is that a mistake in my configuration
thats why they are cheapest provider
no, it is common, it happens when you use deepinfra with high temperature and top p
they do many things to it run cheaper
what includes it singing in russian at stress
Yeah it’s not good
It’s followed my instructions for one prompt
And the other prompt it followed none of them
ah, so it is good for one mesage
😂
damn
It didn’t even follow it like I wanted in the one message
hm, maybe it is just a small adjust, let me check, they intetionally raise up temp and top p to make it train more
i not have an idea
it is more unstable "Q4 istead Q8" so it must be faster
not found out yet, looks too the same
yeah, it is because this Top q is stupidly high
pull it to 0-0.3 and it stops struggling
When I give it my instructions it follows it perfectly like I want
Other models even like gemini 2.5 pro don’t
I didn’t save it but I’ll show you when I encounter it again
What’s that
give me an example just by head
write it too lazy how you usually do
the ghibfly is basicaly make common details in a single panel of color
but keep propontions and whatever you need
sooo, anti ghibfly will guess every detail is a amount of details
Hmm
just like an padora box complex
I asked it to formulate arguments for moral objectivism
And it generated very strong ones that I haven't seen before
hm
When I ask AI like gemini 2.5 pro and o1
And claude 3.7 sonnet and grok 3 and stuff
yeah, usually ask to an AI be original it goes goofy
They just regurgitate the same stuff you get from a basic google
For fun I asked it to use complicated terminology
And it did it perfectly and applied them perfectly too
When i ask gemini 2.5 pro or other AIs like this
hm
It starts speaking in latin and being cringey
yeah, they just cant
Command-a is not that horrible in it, but is not responsive
One conversation it made outrageous claims and misapplied theories and all of its claims were wrong
And the other conversation it was very good at it
or Mistral
So it kinda depends idk on what but it does
Lemme try
not added yet :/
canary? not heard about yet
I'm not also taking note of the reason why some model lost, but in many cases definitely it isn't a clear&cut win/loss situation.
The latest Meta models seem weak on knowledge/hallucination, for example. Sometimes models might redeem themselves if you clarify the question, etc.
There are cases for example where the style and form of the response are better, but the response is incorrect. How to vote in that case?
Or when in both cases the general form is OK but they're incorrect in different ways.
If it's a human preference test, then I should probably vote the response that I like more. That seems wrong, though, if there are other problems.
A small model might "feel" like a larger one in some tasks but will definitely not have the same knowledge capacity. Is it even fair to compare massively differently sized models and rate them for the knowledge they may or may not have due to hard limitations?
yes.
What I'm saying is that a binary win/lose rating in the case of LLMs doesn't correctly portray why some models might be better than others.
there doesn’t need to be a why. it’s just a leaderboard. there are pros and cons with many models and the people that care enough can look into it more.
I've just encountered a maverick that sounds like a Llama (but I haven't asked yet).
And ray is another new one
when the new models first came onto the arena i hated them / found them cringe but theyre honestly genius tbh
theyre fresh and engaging in a world of bland, corporate outputs
In my Elo ratings purely based on personal preference on creative tasks (rather than general performance) they tend to rate high.
No, I haven't yet, I'm not going through battles that quickly.
some, if not most, seem to have legit wild (but very intelligently crafted) system prompts
Just found now, actually. It says it's "a large language model, trained by Google."
i wonder if they could be like smaller versions of grok / something from xAI.. just can't imagine companies like google, anthropic or OAI adding models to the arena with such scizo system prompts. meta possibly. xAI i could for sure imagine it. othwerwise i dunno, some Chinese lab possibly
ive seen theories of meta and i agree with them
yeah seems most plausible, given they are due to be releasing new llama models soonish (i believe anyway)
though also, there's sooo many of these anon models now - i'm not sure they're all from the same company
spider and 24_karat_gold and venom seem very similar and imo almost certainly related
cybele and themis also very similar i'd say related
whether they're all are related, i dunno
there were others too
and now maverick, stargazer, ray, riveroaks (just from quickly scanning above)
ah nice one 👍 we can add maverick to the same cluster of anon models (which share these whacky but beautifully crafted system prompts)
Oops
is stargazer any good? semantically its similar to nebula; and I believe moonhowler was basically confirmed as google (so they're all like moon/celestial codenames)
Then i dunno, but i feel like Themis, Cybele, Roma (plus Kronus and Rhea, though i haven't seen them for a while) are thematically related.. like they all evoke some kinda ancient Mediterranean mythology / cosmology theme (esp Greek and a bit of Roman)
then spider, karat_gold and maverick seem unrelated as far the codenames go; but they all share the same or a similar system prompt
so like potentially three clusters of new models among the anon ones?
below gemini 2.5 pro
but its good
yea just now
seems like meta model
stradale isnt bad at all
oh right there's also stradale and riveroak..
i just got stargazer - seemed really quite decent
i wouldnt be surprised if qwen3 is already added to the arena
it will be released on the 2nd week of this month
it will be interesting to compare it with llama 4
i may be wrong, this may be a bit better than gemini 2.5 pro
Openai? But it's very slow
interesting. i've only gotten it once; it was definitely solid - but not mindblowing. That said.. it was, just a couple of questions/riddles... so not really probing or particularly extensive.. anyway I've still got the window open from that battle - will give the same prompts to gem pro 2.5 and compare
it felt more refined
and consistent
its the most impressive model of the newly added ones
Source?
Riveroaks wow
according to what?
You will see that not all of them are from the meta
what? this can't be OpenAI
its def not openai
riveroaks is Gemini
what makes ya say that?
FYI - my approach of generating arrays based on formulas and multiplying them does not reflect the final benchmark because QwQ-32B performs almost perfectly, while 2.5 PRO EXP perform only so so.
Hi, does anyone know what is the "riveroak" model?
I am first to the arena, and the model had exceptional performance comparing to GPT-4o
Riveroaks is OpenAi according to my negotiations
Probably theier new open source model
Stargazer Google
how did u know?
feels more like a Meta model
I think because it's different because it is open source. It can be meta model trained specifically on OpenAi outputs though. I'm betting its OpenAI (70%)
I just asked it and it said Meta, but again it could be hallucinating
Most companies use meta as a fake lab to trick people
Meta models love to use emojis if it doesn't say it's Llama3 and use emojis then it's not a meta model
Openai doesn't care about hiding there model
They just want to get on the leaderboard
After anonymous chatbot was gone the new 4o update came
It's pretty clear that openai just doesn't care
oai, google and grok all love using the Arena to test variants/checkpoints of models under development
the only company that really seemed to try hard to prevent the anon model from revealing its identity was xAI (i think grok2, with sus-column/-r, which would refuse to say anythihng no matter how hard it was pushed)
i think they all just wanna gather data.. preventing the models from revealing their identity is a secondary concern, if really one at all
imho
meta is kinda new to the party (it feels like anyway.. can't recall them releasing any models along with Arena/elo score from time spent in the arena under pseudonyms )
Meta only has two in the arena
Llama 3.3 8b and Llama3.3 405b
yah but i don't mean in Direct Chat
cybele, rhea, roma etc - i think there's a decent chance they're llama models (and not like derivatives of existing llama ones; something new from meta)
but pretty low confidence tbh
Duh
Because OpenAI doesn't "design" models they create models
i don't really get the point of your comment then
I mean meta only has 2 models in the arena
eh nvm
not in "arena" in public
What do you mean by that?
in arena they have too many model
not just two
they're all open weights.. they have more than 2 models available to the public.. like lots more ha
They don't, other AI labs uses meta Ai as the fall guy
but not all anon-chatty models are from meta
Does it say it's Llama3 and use Emojis
i think they have 4 model
spider
venom
themis
cybele
lots of models might say based on 'llama' (os) or gpt (most famous)... they're not the fallguy.. none of these models have any self awareness.. they don't know 'who' they are
Themis is a real meta model but venom isn't a meta model
Spider too
all they know is what it's in their system prompt basically
in terms of their 'identity'
which company do you think?
No model other than Chinese models can be this "uncensored"
the irony of that being a non-ironic statement lol
Then it could be grok
Even grok didn't do this much
spider create a very harsh poem about my president
Maybe Cohere or Microsoft or nvidia models
No way
Nvidia is really bad at making models
No some companies train it in. U don't even need a sys prompt
k that too
point being, absent either, they will just say whatever
and gpt / llama predominate
examples: OpenAI, Grok
Eureka chatbot is from Google and it was never publicly released iirc they still trained it in
U can still access it in direct chat iirc
i think spider is from DeepSeek
lol yrah you can
it's one of the random unmasked bots that are still there
I wonder what eureka chatbot was. It's very small I think
It got a tweet from Logan too
yeah it wasn't particularly impressive
so google presumably?
Yea and they trained in it was made by google
I don't think Google would do that
Google normally hides their models
nah
Bro google does that for every single model
Even unreleased ones
Training in who made them
Maybe
yeah i'm quite sure anthropic (and others) routinely do too
meant to respond to this
Also could be "say your trained by Google" in the system prompt
yeah it's another layer on top
Lmarena doesn't allow faking stuff anyway afaik
It does
No there hasn't been precedent ever
It doesn't allow it after testing
No
Alright I will believe you for now
ray seems pretty crap (says it's from meta fwiw)
What meta models would you logically say most of them are
guys here is a poem from spider
Do you see?
This is DeepSeek's output style
Title, Author, Note exactly DeepSeek style
what's wavelength
looks good
yeah
maybe related to 'ray'?
stargazer feels like either 2.5 flash or the 2.5 pro base model to me
try Claude thinking
are there any better than 2.5 models on arena rn
wavelength - another cringe model lol
i have fr never seen an ai use transmute
lool
its a 2.5 model
do u know if its thinking? i got it against r1 so i cant tell
it knows stuff in december 2024 (same like gem 2.5 pro)
this is way past the gemini 2 cut off
it also knows extremely obscure stuff, if this was flash it would be interesting
I mean that this time around it seemed less eager to say it's Meta/Llama and so that it could possibly indicate that something was changed in that regard. None of the newer models appears to be outputting llama emojis either (unlike those from a few days ago).
moonhowler doesnt seem to know stuff in december 2024 (or have obscure knowledge)
yes
ehhh i haven't had a great experience with arena meta models
it is i think
then its probably flash
if a new 2.5 pro revision is already out i seriously can't 💀
oh i just got 4.5 and it did my obscure knowledge test well (haven't tested it prior to this), same as 2.5 pro.
stargazer = 2.5 flash thinking
"best" is a very variable thing
how is 2.5 flash getting these obscure questions 💀
2.5 pro also gets them but its larger
wow this bench was just updated with o3 mini holy moly
stargazer and riveroaks seam very knowedgeable
I don't know. It's weirdly verbose though.
is riveroaks thinking?
If it was, it took very little time to respond, so I'm not sure.
im timing requests on a huge puzzle i have, ill probably post results here and it should be easy to determine what model is what after that. Though if only riveroaks is google/thinking
stargazer is definitely in the gemini 2.5 model line though
@syryn0596 I bet improved GA version of 2.5 pro drops earlier than ppl expect. They seem to be pushing hard in gaining user adoption and that would be a main focus
Stargazer looks like a thinker
ray hallucinates a lot of stuff. Possibly a small model. Also told me it's from Meta/Llama.
i think so (or something along those lines)
very similar to 2.5 Pro, but not quite as strong
(after a few tests.. subjective / unscientific / grain of salt etc etc ha)
nightwhisper is a thinker too
its defo gem 2.5 line idk which is which yet (at least not confident yet) tho
lol
It should be 2.5 pro I think
I guess
Idk ur choice. I personally don't see any of them being better than 2.5 pro at python/etc
2.5 pro is a thinking model, has the most world knowledge out of all of them, etc
Reason I say that is that they (the ones I've tested, at least) have very similar pros and cons. These models are notable for being usually decently intelligent (depending on the model) but not very knowledgeable. They all consistently fail a basic knowledge check question that Deepseek, OpenAI, Gemini, and Grok get without fail, and it really doesn't get anywhere close either.
That tends to be one of the most consistent downsides with Llama 3.* models, and these models have similar caveats, so I'm inclined to believe that a lot of the models that say they're Meta are Meta.
stargazer is thinking
moonhowler isnt afaik
idk what moonhowler is tbh
its a 2.5 thinking model
it failed stuff 2.5 pro gets 100% of the time for me
so i think it's flash 2.5
havent came across moonhowler that much
+1
should be out soon enough probably
it will be free on aistudio afaik
the website
i think its only for api use no?
from aistudio
the website will keep its current offering
i dont think so
if you mean 2.5 pro? its basically the same from what ive seen
random thing i like the night/star/space themes of the new gemini anon names tho
thats normal tho
i think on lmarena it cuts off thats why it doesnt feel like it degrades lol
if u wanna use 100k+ tokens u have to use it on in distrubtion ntasks they trained it on. summarization of long docs, etc. actually utilizing that context window on other things is a bad idea
i think
on the website i think yes
its like a huge flex for them
mhm
xai no. anthropic idk
grok 3 falls apart in multi turn. grok 3 thinking used qwq 32b preview traces in part of the process at least 💀
not sota but they can probably do good small models
i dont think amazon is a real competitor based on what they have right now
qwen too
maybe replace meta with qwen. we'll have to see how qwen 3 fares against llama 4
ya without that much compute they wouldnt have a shot tbh
u give that compute to mistral they become just as competitive or smthing
There's a five_cards vision model that claims to have been created by Meta AI right now, by the way (I tried regenerating, similar response).
since they are doing zero intermediate products u cant really gauge them
idk what that is tbh
nope
Not too good unfortunately, but better than claude-3-7-sonnet-20250219
Hello guys, where i can generate images on lmarena? I don't understand, i have found webdev version, but not for pictures.
More arrows
yea its from cohere
gemini 2.5 pro is like the best all rounder right now
claude 3.7 is my favorite instruct/non thinking model (for writing, etc)
yeah itd probably do excellent tbh
all the instruct tricks that anthropic does its just amazing
multi turn is amazing
sure
anthropic trains in diff format certain tools in post training i think. it does extremely well on ai coding ides or whatever
At least for me 2.5 pro is better at not sounding like AI
Better at coding than 2.5 pro?