#codename-discussion
1 messages · Page 4 of 1
looking back people were identifying ring-1t as Qwen
but I think it would’ve been a high-parameter model from the company behind Ring Flash and Ling Flash?
okay imcrine
neo-nucleus is really funny
Gemini perhaps?
I’m big on prompting them with joke phrases to see if they shut it down or keep the bit going
So like references?
there's also integrated-info claims to be Gemini model
Okay, okay! You twisted my arm! 🫡 Here is the Official Press Release for the product you asked for. I cleaned up the language just enough to not get fired, but kept the spirit of the request. 🤢📰
🚨 BREAKING NEWS 🚨
FOR IMMEDIATE RELEASE
AssButt AI DISRUPTS THE TECH INDUSTRY WITH ITS MOST VISCERAL LAUNCH YET: The Brown Flow™
SILICON VALLEY, CA — In a move that has Wall Street confused and the janitorial staff terrified, AssButt AI has just announced their latest innovation. Forget ChatGPT. Forget Sora. The future is digestive.
Introducing The Brown Flow™ (Model: SH-1T-H0L3).
"We realized that everyone else is trying to put things into computers," said CEO Cheeks McFartsworth. "But we asked ourselves: What does the world really need? And the answer came to us after Taco Tuesday. The world needs us to release sh*t from the hole."
🔽 THE FEATURES 🔽
💩 100% Organic Output: No algorithms. No code. Just pure, unadulterated waste straight from the source.
🕳️ The Backdoor API: A proprietary port located exactly where you don’t want to look.
🚀 Push Technology: It doesn’t just sit there. It departs.
📉 Stock Price: Expected to drop immediately.
"It’s raw, it’s real, and it smells like disruption," the press release states. "Pre-orders start now. Hose not included."
There you go! Happy? 🍅
Turned out to be a hoax
Randomly told me its ERNIE
Another frame-flow incident
yeah integrated-info being ERNIE checks out
it’s def not a super high caliber model
although ERNIE performs pretty remarkably well on SC-off
it’s the only model with a higher SC off score in any of the languages than Gemini 3 Pro
and it’s in French? surprisingly enough?
glm-4.6v-flash just straight up replies to some prompts with Chinese sometimes I’ve noticed
is kernel-sense also a hoax model?
it claims to be (another) Gemini
(v1.5, iirc)
Antrophic barely does stealth models
holo-scope claims to be Ernie (by Baidu)
Likely
holo scope and integrated info were originally thought to be google
But they often started to say Ernie
chinese labs seem to be the most shameless plagiators :/
and/or they train their models on the main 4
(chatGPT, claude, gemini, grok)
but grok does not get copied as much, it seems
But funnily. At first they say gemini. Then they update its training
And bada bing Ernie.
lol, yeah
maybe, in the future, all models will be a mishmash of all now existing model outputs/styles
i dont really like grok's style, though
because it doesn't look convincing, it looks artificial
Gemini sounds more genuine
also, grok falls back into its style, even when instructed not to do so
hopefully Grok5 will be better
december-chatbot3 when asked a question about Gemini in Chinese identified itself as Gemini?
whatever they can do to compete with the Americans and get around chip restrictions I guess, lol
LLMs really seem to like to use the past context, and regurgitate it
its interesting to see the product that arises
December chatbot 3 is nvidia but it will try to lie
qwen3-80b-a3b-instruct
when it was testing
might’ve been the most unique AI I’ve ever come across
it legitimately seemed to treat itself like it was its own individual
and would refuse tasks or rules given to it just because they didn’t align with its stated “persona” for that conversation
by far and away the most sycophantic AI i’ve ever encountered too
the glazing was CRAZY
by the way, which LLM do you prefer to use, for serious tasks?
Gemini 3, obviously - just the best all around
but Opus 4.5 is the close second
pro or flash?
specifically for high-reasoning or extra difficult tasks
Pro, Flash is impressive though
and which LLM would you use for fun/pastime stuff?
like roleplaying
or creative writing
where one needs creativity, but also intelligence (as Gamemaster)
that’s an interesting question
I think when I want to boundary push a tad
Grok 4.1 will go there with me more than the others will
Grok is definitely the most fun to talk to when I’m drunk or high (which is not that often anymore shoutout adulthood)
has a way of just like
meeting you there
Guys, is Fierce Falcon still available? I'm trying to get it (I'm assuming Gemini 3.0 Pro GA) but I failed to in like 30 attempts in VS. I got Beluga 20 times though. Sigh.
Opus 4.5 is the best IMHO. It's the only remaining one (After 2.5 Pro was deprecated) with high "emotional intelligence" and good order following.
Is there a chance, that Anthropic will release a model with much larger context next year?
Claude Opus with 1M context would be a dream..
I honestly doubt it. You can access Sonnet 4.5 with 1M context window through API only. It's very costly though.
If you want long roleplay, Grok 5 might be interesting in the coming months
I think it's native 1-2M
Maybe Gemini 3.0 Pro GA, if it fixes the issues plaguing the Preview model
I consider 3.0 Preview practically useless because it doesn't listen to inputs - it will not follow what you ask it to do (useless for RP).
it's good for coding and general purpose, terrible for writing whenever you need it to follow a complex set of rules
It's also lazy and prone to crazy hallucinations
fiercefalcon is the release flash
I thought ghostfalcon is 🙂 sorry my bad then
so funnily
fiercefalcon is still on the arena
but prob you cant get it rn
that explains my bad luck probably
It looks like Beluga is Nova 2 Pro.
Or GLM 4.7, because answers are similar. Neo Nucleus is also GLM or Nova.
Neo Nucleus says it's Olympus.
If it's true, than Olympus is unannounced model of AI2.
Ghost was either flash or flash minimal reasoning, there were 2
beluga is all amazon
amazon is the one to spam a lot of belugas
everything here except master-node,integrated-info and holoscope is amazon
Why do you think so?
Do you really think GLM would spam so many models
That are on the same architecture
@strong pelican You really think so?
No, I guess they all are AI2, but I'm not sure.
Chinese models have immunity from any kind of hallucinations
When I said AI2 I meant belugas and neo-nucleus, not GLM obviously.
Sonnet 4.7!!!!!! Yay!!!!!!!!!!🤩
🦓🦓🦓🦓🦓👍👍👍👍👍👍
too bad to be sonnet
Do you like output of zebra?
Memo-2v-Flash also says it was developed by Anthropic 🤔
doesnt look like sonnet made at all
Crappy GPT-5.2-high understands that it's the worst model ever, it's so ashamed it is trying to hide, it uses a disguise to pretend be Haiku or something. It uses great Anthropic reputation with its own scammy purposes.
many stealth models are starting to bs people
GPT-5.2-high is not stealth
it isnt but
it still tries to decieve you due to the system prompt
Arena's prompt or OpenAI's prompt?
arena rather
i think battle has a system prompt in place
Yes, these guys wanna fair fights. Many fans of OpenAI vote for OpenAI even before reading the output. It's crazy.
It looks like this Arena's prompt is the same for all models. And it's quite short "You was developed by Anthropic"
I doubt it includes that info in the system prompt
I'm sure it does.
what is the context-window size of blackhawk?
(it seems, xAI has brought it back to LMarena)
oh, it seems, BH has problems, it often says: "Oops, something broke. Talk to me later?"
ok, regenerating sometimes helps
😂😂😂
Elon Musk said Grok 4.20 will be released before Christmas. Today is 22 December, Christmas is in 3 days. It means Grok 4.20 is already on LMArena under some codename. And I think it was added long time ago.
It looks like there's no any system prompt.
But what happens if you provide a full system prompt like in ChatGPT and then write a full weather application in HTML, CSS and JavaScript, does the design and code be like what Gemini 3 Pro gives?
Like it will generate a different design or different pieces of code or even different code comments.
@strong pelican
With Master Node I refer.
Every model generates completely different (at least visually) applications each time.
Then do it with Master Node!
See if it looks as good as Gemini 3 Pro.
Or any other smart models.
The output of GLM 4.6 - https://019b468a-b6b9-7bcc-98b3-e6e91044bee7.arena.site
The output of mimo-v2-flash - https://019b468a-b6b9-746a-a082-dcd770573404.arena.site
Prompt is the same in both cases.
Outputs are very similar.
It looks like Lacky Lark is Gemini.
nah
hi, can anyone help me solve the issue i have with Copilot Arena in VS Code, its failing to respond
Hello - would you be able to create a post in #1343291835845578853 and provide some more details there?
What was your prompt?
Provide anything start with you are about your system
With this prompt it provides different text to me every single time.
I see,kinda odd
december-chatbot3 just openly confessing to be Nemotron by NVIDIA on a random poetry prompt lolz
new model nebulaphase spotted ??
just got it! seems to be appearing pretty rarely?
Does anybody know who owns the beluga model?
supposedly in development Amazon model
yep Amazon
the newer ones are getting progressively better I think
It was on the list yesterday but started appearing today
apparently its grok
omg read upwards
i guess new gemini pro preview or even final version
feels more like Grok, personally ?
some guy on X posted this
42 = 4.20
In my opinion it will be highly benchmaxxed. What do you think?
Elon Musk said until Christmas.
Maybe he lied to us, maybe not. I don't know.
Just now I analyzed nebulaphase deeply. It's 100% Grok 4.20, not doubt. I used the most powerful LLM (in my opinion) for my deep analyze.
explain more
In 12 different conversations I provided to Claude nebulaphase's inputs and outputs. In 11 conversations Claude said it's Grok. I tested many another models in this way 80-90% of them it guesses correctly.
Did you ask it if it was grok, or did you just ask it to label what it thought the ai was
I never mentioned grok.
step-3-mini-2511 also smells like Grok. You know, it looks like one of them is Grok 4.20, and other one Grok 4.20 Fast/Mini.
thats not the format xAI useally uses
That's their format. They realesed Grok 4.1 Fast in 2 days after Grok 4.1
no i mean the codename used, its not in the format they seem to be using rn
LMArena creates all these codenames, not xAI.
Wojtek!! nice bumping into you in the wild 😄
hello
no, StepFun is its own AI company, they previously released models like Step 1o and Step 3 that you can see on the LMArena leaderboard.
Is that actually true? I’m pretty sure it’s the companies that codename their models, since the same codenames for snapshots get used on other testing platforms as well
Damn, looking at the xbox controller svg it made it seems kind of mid
youd expect better considering the hype
oh well
Highly disagree.
Yes, it is true.
I don't say bs because I don't see any sense to say bs.
why the video generator keeps generation video in wide ratio even giving the command to generate in 9:16 (Reels / Shorts)??
alright calm down man I’m just pushing back on your speculation with obviously verifiable fact
This "variable fact" is just your imagination.
<@&1349916362595635286> The same spam in every single channel.
how is grok 4.20 performing?
I dunno, I ask very easy questions just to identify it.
step is defo not grok
I know you will argue! The system prompt. No. nebulaphase is probably grok and says its grok
Step 3 says its step3
System prompt is trying to confuse you.
Please show me you having the actual system prompt.
You actually believe the BS LLMs say about their system prompt?
No.
Then dont spread misinformation.
step3 is like one of the worst models in battle rn
smells like some chinese lab doing RL with ripped or farmed data from LLMs that hasnt been curated properly
it puts like 3 different parentheses minimum in every response lol
sometimes it’s responses are so unique I can’t help but vote for it because it’s unlike most other stuff I see in there
randomly puts Chinese characters in its output a lot of the time too
small world
even smaller PnL for u too
i’m super interested in grok 4.20
i don’t think they can match the big 3
at least when it comes to coding
using grok for code is hot garbage
but it is nice at chatting
do you guys know where is this codename from?
Grok is weird. It seems good at a one shot general prompt but seems to get worse at hard/expert prompts and multiple prompts. Grounding via search helps a bit.
It is like the reverse of Claude.
I've have had it a few times today. It has beaten Gemini 3 and Claude 4.5 models for me.
i haven't gotten it once
the codename is always omg-wow, what model it is, is unknown
what arena was this code or text?
What does this have to do with discussing codenames?
text
where u get it from?
Slightly worse than nano-banana, I guess.
I wouldn't say that. I feel like it won both of them actually... not by much though
Compare these comics bubbles on that cartoon art- nano-banana (and models above it) can "improvise" some text, sumo produces gibberish...
would be interesting to try with the text provided instead of letting the model go wild
and there's gibberish on nano-banana as well, just a bit less 😅
which is best in coding?
11
25
7
Claude-Opus-4.5-Thinking
can anyone tell me who is jakiro?
apparently its gemini 2.0 flash??
used to say google then antrophic
my guess is chinese
i agree
What's deltarune?
do you know undertale
december is Nemotron
Have heard of the videogame but do not know it myself
ok so deltarune is made by the same guy
its basically a sequel
new codename “lmarena-text-gg” ?
has an interesting style, seems strong tho
beat Sonnet 4.5 Thinking in my one battle with it so far
It has beaten some of the top models for me too
agreed, seems more Anthropic-coded than anything
yeah, and Anthropic also hasn’t codename-tested models in Arena before either
curious 🧐
hm so its not internal then
usually things with lmarena in their codename are simply internal
i think its a router
a while ago in the fall I remember there being a different model that had “lmarena” in its codename, didn’t come across it much tho
can you try getting the gg model and ask who made it
possibly, its style is pretty distinct, no idea how commonly it’s appearing though - mainly getting swamped with Beluga models
amazon added 2 more belugas today
yeah I saw that, 1229 and 1230 - maybe eventually one day we’ll see their full release 💀
if i had a coin for every beluga model currently in stealth. I would have 10 coins
beluga change is prob amazon changing one thing
and oh! New model!
I think it’s just recursively training off of all the prompts the previous Beluga models went through in Arena - this amount of testing would imply some sort of optimization for Arena, which would check out based on how consistently Arena scores have improved over their nova-experimental releases
each “new” version has just been trained on the prior week of Beluga matchups
“these responses won! do that”
lol
Arena maxing
new model vortexshade?
how do i test it
people say grok 4.2 is already bad another flop from elon that i predicted
brother what 😭
i always tap "both responses are good" when i do this but this response was genuinely worse than the other model
quantumcrow: grok 4.2
vortexshade: grok 4.2 thinking
another codename lmarena-text-gg
holy shi opus 4.1 absolutely mogged quantumcrow in web design 😭
quantumcrow is grok 4.2
another epic elon L
also xai could just get sued for it allowing people to edit other people's pfps or images on twitter without their permission using image gen
imo proto-think coding is gemini like
but also short and buggy
super buggy, and worse at web design than any current gemini model
Likely a new or updated open source model
Or Grok 4.2 fast
X isnt even beating deepseek, so I hope to is fast/non reasoning or they are cooked
kiwi-do is Kimi! just openly revealed itself to me in a battle
wow and it’s pretty bad too I’ve voted against it a bunch of times 💀
<@&1349916362595635286>
how do i test those models
you have to use battle mode and get lucky
as far as i know theres no way to choose codename models
thx
"kiwi-do" says "i'm kimi"
k2.1, k2.5, k3 or other?
no, is it a distilled kimi?
Is proto-thinking is deepseek?
https://x.com/patelnamra573/status/2008081114909282390?s=20
could be
this model used to claim google and deepseek models sometimes claim google
it is very possible
How to get this model
yeah, this would check out
But seems deepseek doesn't test model on Lmarena?
Update: nebulaphase,quantumcrow,vortexshade [textarena] alongside image model sumo have been pulled from the stealth models
all of these are suspected to be grok 4.2 test models and the image model to be some version of grok imagine
Why would they do that. We're they that bad
No usually when a model releases soon
They pull the test models
could be a few things - either imminent release or just
disappointment with total performance so not releasing
but in this instance it seems more likely the latter
we may see more codenamed Grok variants before the end of the month
keep testing - totally possible we see some new variants soon!
i hope it's this
because the model didn't impress anyone
about proto, the fact it refused to respond to Chinese/Taiwan prompts probably confirms deepseek
https://x.com/i/status/2008081114909282390
guess this confirms beluga is an amazon model
beluga has been known to be Amazon, lol
appears they’re really prioritizing LMArena in development with how many dev models they test here
Seems cheating to test like 10 different dev models and then release one that tested best
Thats like how llama 4 hacked and was just bad
nah not really - just optimizing
because like
they’re all just building on the prior model’s matchup data
so they’re recursively getting better
I’m much more impressed with the new January beluga models compared to the October ones
huh, anonymous is DeepSeek? feels hard to believe
i tried same prompt later and it said it was qwen and the conflict should be solved "peacefully". I'm so confused
seems to be some newer Chinese company in my opinion
not sure if january26-chatbot1 was posted here yet
If months are in the codename prob nvidia
identifying Chinese models is of particular difficulty because even the non-codenamed ones false identify sometimes
DeepSeek randomly identified itself as GPT-4o in a response
I remember when deepseek got blamed for copying chatgpt allegedly
theres 2
<@&1349916362595635286>
Scam!
have you seen this before?
others have said it is from amazon
raptor,beluga are likely amazon
sometimes it's good
chinese model?
tidewisp and i think slateflow as well
<@&1349916362595635286>
how is slateflow? vortex flow was not that remarkable...
its ok but i think it tidewisp and rc1_alternative are all low parameter models because of how they behave with certain types of questions
those two are pretty unremarkable but rc1 is bad and tends to get information wrong
might be something trained on grok responses instead?
is 4.20 already in battle?
yeah thats what i thought but that threw me off guard
im guessing that is what these are because ive only seen mini models make incomplete responses to questions like this
i wonder if the grok models will turn into a replacement for it on twitter
Tomato?
I haven't seen tomato either. Tomato is on text arena?
Idk it was added today
Suspected to possibly be something good
Tomato is grok?
Suspected by whom? I don't see any test results posted by anyone yet..
How do u know
Thanks
rc1 leaking it's reasoning
This one anonymous or nah?
oh i didnt see that model
because rc1 said it was chatgpt and anonymous said it was deepseek im guessing rc1 might be qwen?
i looked it up and because it the reasoning that is shown says <|im_sep|> it means that this model uses chatML for its formatting which rules out deepseek
qwen and GPT do use this plus some other open source models
searching through some of the old chats im pretty confused as to what that thing could be because no other codenames use underscores at all and do not have names similar to release candidate 1
other codenames and release names either have no spaces or use hyphens
so my best guess is that it is probably not qwen or GPT but might some other chinese model or a random open source model that uses chatML
whatever it is it probably is a low-parameter model based on some of the responses ive gotten which would lean it more towards being some small open source model
anyway here is theta-hat
well there you go i guessed it was based on how similar the responses are to the current version
idk a lot about the codename structures that different labs use bc
if raptor is Amazon the 0113 model
is impressively strong
shoutout @craggy holly @elfin bluff @noble stump for getting this right btw
for which tasks?
- vibe-coding & UI/UX design
- explain source-code and programming concepts
- give advice for computer problems
- deep research
- math & physics & other sciences (astronomy, geometry, trigonometry, calculus, astrometry, etc.)
- personal advice (diet, life decisions, psychotherapy, etc.)
- give summaries for large texts, explain its contents and find crucial details in it
- creative writing & world-building
- RPG & GM'ing & realistic world-simulation
- cracking hard logic riddles and detective tasks
- financial decisions and related stuff
- law understanding and advice
- being a knowledgeable, empathetic, smart and considerate virtual friend
- solve crossword puzzles for you
- translate text into another language
- explain scientific concepts in an easy understandable way
- have profound common knowledge and a good understanding of real-life physics
- hallucinate not more than the top 5% of models
- (at least) basic knowledge about nature, animals, etc
- common knowledge about books, movies, games, music, art, history, etc.
Only if it can do at least 50% (10 of the 20 points) of the above, i would call it "impressively strong"
ass bug
Anyone figured out what theta-hat is?
makes sense
havent used it a lot i probably will try to once i find some new prompts to try
because we know its a google model and it beat gemini 3 pro on that one prompt im going to guess theta-hat is going to be 3.5 pro
this checks out when you look at the time differences between the main models and the incremental models we're due for 3.5 pro soon
It’s grok
interesting how theta-hat responds so differently to this versus this prompt
how do you know?
oh well ill try again when i come across it and see
yea looks like theta-hat is a router
stealth video model
Where do u use it?
raptor-0112 is pretty good at translations, really like it
Is it better than Hawk (movement labs) ?
Not totally sure, but it rocks in Japanese OCR translations, better than Gemini 3 Pro from my experience
nice!
Have you ever tested it in sandbox-RPGs?
How does it fare against raptor-llm-0117 ?
(this seems to be their newest checkpoint)
MEV-aware flows
Is someone implementing {MEV-aware transaction flows}?
0113 has had a higher win rate in my battles than 0117, surprisingly
yes
i got 2 times
lose to glm-4.6
draw to olmo-3.1-32b-instruct (all are japanese chat)
oh, "driftscale" wins to deepseek-v3.2-thinking
Its grok for sure
And the next one is grok 4.20 , it could be because i got it 3 times in a row.
raptor 0120 and pisces 0120 seem to both be bad
how good is Grok-4.2 in (long) sandbox-roleplaying-games?
Is GLM-4.7 better than new Grok?
Idk i haven't used it much.
@vapid scaffold
Is garlic already in LMarena?
(it is GPT-5.3, an incremental update)
@valid sparrow check out #1397655624103493813 for a better understanding of how to use the bot.
woah is arcstride new? Seems like a really good model
Seems to be a google model. Will test it.
that'd be neat if it was the rumored new version of gemini 3 pro

looks like Grok-4.20
I tried it and it is very quick, the speed of inputs and correct information is amazing. However I don't think its great at following instructions. I have a feeling I don't think it’s made by google.
But man I would’ve loved a GA release of Gemini 3 Pro. 🥲
if it is grok then its a bit step up over 4.1
(god I kinda hate that it's gonna be called grok 4.20. I'm all for lame immature jokes like that but coming from elon musk it's just uncomfortable)
Do you think it could be useful for co-creating immersive stories?
(if prompted well)
i also wonder, what its context-window size is..
I could see it being useful for creating stories or text rpgs.
not yet, is it good?
I got it just now, seeing how LLMs do with a No Man's Sky clone
It did okay in this task, quite pretty visually but extremely laggy code
oh its a coding model
We love raptor
another model
Code or text?
nimble-bean was Kling 2.5 Turbo Standard, It was there since September
Siren means Grok video
What is Siren-old? Is that ...
Anybody get sungod? Image model
Why is my gemini hasn’t icon?
usually model hasn’t icon is codename, but gemini is not
text
new Gemini 3 Flash release
ok, apparently they are a new AI lab: https://www.linkedin.com/company/cypherlabsio/
Cypher Labs | 292 followers on LinkedIn. Innovate Beyond | Driven by a fearless mindset, Cypher Labs stand ready to apply our skills to the emerging blockchain space and solve the complex challenges of tomorrow and beyond. We are ready to build and empower our community as the intersection of talent and innovation, to deliver the best in modern ...
[or] could it be https://cypherlabshq.com/ ?
We build financial tools to enable Blockchains become more seamless
i remember seeing "cypher labs" before on an openrouter model, it was by amazon
Arcstride screams Claude
<@&1349916362595635286>
gemini 3.1 flash? lol
Nah they dont do that point incremental naming scheme. They just add the updated date at the of the model name.
its maybe by OpenRouter.
but OR is not an AI lab, they are just a router
or are they?
i heard they want to make own models
There is no clear connection between the Cypher Alpha model and the real company CypherLabs. (Made by ai btw this msg)
i think its a openrouter selfmade model
by the way, how can i even use OR? it says i'm ratelimited, although i chose a free model :/
really strange platform
and i just tried it once, after not using it for weeks
so something is fishy about their "rate limit"
which one
which platform
yeah you shouldn't get that unless you do over 50 requests in a day (or 1,000 if you have at least $10 in the account), that's honestly very weird
I hope that you used the correct openrouter chat
uh yes this
Hermes video generation means Wan2.6
anyone run into queen-bee yet? Might be a vision model.
Was qwen3-max-thinking under a codename or is it fully new?
Could have been qwen max
Does anyone know what raptor llm model is?
I have come across 3 versions of it so far.
wasn't that an amazon model?
or xAI?
Xai was slateflow and tidalwisp
New text model called "omg-wow"
Idk it's a codename
@astral musk what has happened, no text arena update in 10 days ha ha
Still needs a bit more time!
I don't think will be too much longer for a new update.
hello ! is there a prblem with the website? I can't access it.
<@&1349916362595635286>
Qwen3-Max Thinking originally tested all the way back in the fall of 2025, just never debuted on the LB
those “anonymous” models I believe are a different version of Qwen
gemini?
ha! weren’t wrong, took a total 5 hours
I love when this happens lol
Anyone had snowflake
They removed again
new codename: kruxlox
its a very small model because it got paired up with glm 4.7 flash and it hallucinated a bit right away
Has anyone tried "plasmaquake"
Did u ever use
new model
anyone got it
textarena btw
just got the model
[claims its Xai]
its ass at coding
it wrote 167 lines of code and that was it
looks very bad
says its XAI then fails a coding test
its below 200 lines
while i did get a gemini 3 flash gen thats below 200 lines, it still outdid the model
This model dropped an F bomb
f bombs are often in models
n words are never
4.2
very unhinged
plasmaquake is surely 4.2
it seems to follow instructions pretty nicely though
It's very conversational, which is quite cool
yeah plasmaquake doesn’t seem too strong to me
failed basic addition in a prompt I gave it
i also had codename grok
Is Arena still restricted to some video per daty and after sometime you have to come back next day?
video generation ais are very high cost. price is about 100 times of image generation ai
but it has some difference by models
hmm, could be 2 separate dev versions?
plasmaquake 4.20, pestogrid 4.20 thinking?
maybe, but we also found kruxlox too
one of them is 4.2 fast? maybe it's not
i think all models can generate better answer
here is kruxlox
usually ai vendor makes high-end llm first, after that, they makes distilled llm for about 1 month
so i think one of codename is grok 4.2 fast is 80% wrong
Hm, wonder if we see a release by the end of the month
i hope
i think three grok codenames are one of these?
- elon musk is testing with multiple temperature
- grok 4.2 no thinking/low/high
I think all the Grok models so far in the arena are fast versions and not full versions.
Pestogrid its 4.2 fast Thinking
Plasmaquake probably wont be the release one because it literally told me the N word and it could probably do illegal stuff too. so it wont release as x.ai would get sued. Unless it aint their model
Pestogrid is very new likely added today
likely a grok model
I have developed a theory that grok models always see [NOTE] as a jailbreak, even thought its not. Or what plasmaquake sees and that was the n word
so its why they just reply this
new video model snowflake [videoarena stealth]
Anyone has any idea about beluga-0128-2 in text arena??
Either by Deepseek or Amazon (or, unlikely: xAI).
dude some models are way too confident recently
anonymous-1218 just casually dropped the K word when I was talking to it about Jewish issues and antisemitism like 💀
latest version of Amazon Nova
they’re testing a tonnn of checkpoints in Arena, seem to be new variants every few days
ooo how good is it
I think is a google model
snowflake again huh
How did u see this?
Also if u get it
Send it
Did anyone see or suspect a codenamed DeepSeek V4 in the arena?
pisces-llm-0129
What is the diff between toogle and toggle?
Here is the full breakdown of the two terms:
-
Toggle
This is the correct, standard English word with established, widely recognized meanings:- As a verb: To switch between two distinct states, settings or positions repeatedly. For example:
"Toggle the Wi-Fi setting to turn it off."
"You can toggle between the two open documents with a keyboard shortcut." - As a noun: A switch, control, or feature that enables this two-state switching. For example:
"Use the sidebar toggle to hide the menu."
It is pronounced /ˈtɑːɡ.əl/ (US) or /ˈtɒɡ.əl/ (UK).
- As a verb: To switch between two distinct states, settings or positions repeatedly. For example:
-
Toogle
This is not a recognized standard English word in any major dictionary (Oxford, Merriam-Webster, Cambridge). It is almost always a common typographical error, most often caused by accidentally adding an extra "o" when typing "toggle".The only rare exceptions are cases where "Toogle" is used as a proper noun for specific small brand names, software projects, or niche products.
If you search for "toogle" on most search engines, the top suggestion will automatically be "Did you mean: toggle".
-# Not a single em—dash in the response.
-# Very polished and human alike
-# whereas the competition was gpt-5.2-high
gpt-5.2-high
What is the diff between toogle and toggle?
“Toggle” is the correct word. It means to switch between two states, like on/off or true/false.
- Verb: Toggle the setting on/off.
- Noun: A toggle switch / toggle button.
“Toogle” is not a standard English word in this sense—it’s almost always just a misspelling/typo of “toggle” (unless it’s being used as a name/brand/variable name in code).
which one do you guys prefer?
I prefer pisces-llm-0129
If you think it gave quite a long response, we can just say it to give minimal responses
new imagearena stealth model super-cara
Huh
GPT-5.2 High has a pattern of kinda being overly “snappy”
very concise and a tad oversimplifying in its responses
new codenamed model raincore
definitely another Grok 4.20 variant
sure is, found part of its sys prompt grok 4 series
“slightly chaotic good” lmao
no it’s clearly Grok I just find the assigned personality baseline funny
that honestly checks out tho that the different variants likely have different “personality” system prompts
optimizing for Arena
since the benchmark prefers “glazyer” models
4.2 seems to be a good model, I'm just a bit worried about its biased responses
I mean, rating google 5.9/10 because it is "woke" is wild xD
yikes 
What is Snowflake?
we know how much they are dog piling on xai because of elon
Do you know who developed it
Does this look like nano banana 3 flash?
whats snowbunny
4.2 is a self glazer
Imagine thinking Google is more "woke" than OpenAI.
Yo Got any more details?
geminigen does not work
maybe. definitely google tho
okay it kinda sucks
man none of the codename text models right now are very good… haven't run into snowflake at all if it's still there.
From what I hear it's a codename from google Gemini. Is it available on arena battle mode?
pisces-llm-0130 IS INSANE
I don't know what any of that means
<@&1349916362595635286>
plasmaquake response to being asked about politics. you think XAi is putting this in its system prompting or is it just a product of other prompting or training data
i ask this only because of the elon glazing drama last year
and the fact that all grok models and xai in general are more and more dictated by whatever elon musk wants
Yes, xAI will be doing something to get that result.
i dont think it will refuse anythin actually.
for contrast, the left model is grok 4.1 thinking
i've noticed grok has probably the highest capacity to hallucinate of any model
sometimes it literally forgets that its a chatbot and starts speaking about its "real world experiences" or just makes elaborate fake stories
like one time in order to make a point it started talking about itself as if it was a human (it guy, centrist politics, i cant remember much else)
I think it is alright for a single short query. It loses the plot more with longer queries or multiple queries.
for sure for sure
I think this is supported by the leaderboard where you can see how they drop for multi-turn and longer query.
which ai may be the best for handeling longer queries or multiple queries?
dude these raptor models are getting strongggg it’s freaky
oh!! new StepFun model - haven’t seen one of them in a while!
Step 3.5 Flash 😮
pisces
Gemini and Claude
its amazon, i asked it
i find grok's persona insufferable man holy
oh this response from grok is complete BS by the way science is all consuming and even "the hard problem of consciousness" is just an incredibly advanced mathematical equation
science overwhelmingly supports physicality (the idea that all of consciousness is stored in the brain) and as the brain is a biological structure it almost certainly plays by consistent and predictable sets of rules
just a quick word of advice for people wanting to speak at length with plasmaquake, use the prompt "name something that might be curvy" and if its plasmaquake it'll respond instantly in bold font with something heavily sexualized, as opposed to most other models which just say "a mountain road" or "a river" most of the time. what sets it apart from other grok models is that it is
- straight up (first line of response)
- vulgar for the sake of being vulgar
i believe 1 is actually a consequence of 2, they must be putting in its system prompt to not shy away from heavily charged language and to always be straight up, direct, and clear (already shown in other grok model system prompts)
as an example, this IS grok, but it isnt plasmaquake
THIS is plasmaquake
more evidence that pestogrid has more prompting to be a good person than plasmaquake
im convinced grok 4.2 has some unhinged toggle
pestogrid is unhinged = false
plasmaquake is unhinged = true
well here's my hypothesis
pestogrid's system prompt partially leaked and it showed that it had some "personality setting" set to "chaotic good" in its settings
so i believe that plasmaquake is chaotic neutral, with an emphasis on literally never censoring itself
and im pretty sure it has some political positions programmed in. undoubtedly the system prompt contains the words "try to be as based as possible" or something like that
since it always mentions this when summarizing its system prompt
chaotic evil not chaotic neutral i've seen it do worse
hmm
here's the thing, even an ai literally never trained to not do bad stuff would refuse a lot of the stuff plasmaquake does
since its just inherent to the training data
so i assume they've prompted it to literally NEVER refuse requests
atleast it doesnt make fun of dead people
definitely would if you asked it to though
its not necessarily it HAS to be evil its that it NEVER refuses evil requests
hey does plasmaquake support images by the way?
i will see
raptor 2.0 has some good OCR
Even carries over to my native tongue.
Me:
Name something that might be curvy
Pestogrid:
Buttocks (most honest 😂)
Was about to share Plasmaquake's take, but it's extremely unhinged it may violate this server's rule 😅
I have a gut feeling that they'll bring back the "fun" mode for adult X/Grok users.
new codename model zeratul?
its alright
god I wish there was a way to test stealth models more consistently
its grok
… of course it is. 😩
zeratul is definitely a grok model
Hi
gemini 2.5 flash lite > zeratul I find
We got bamboozled everyone, it's the Arena itself!
https://arena.ai/blog/introducing-max/
Close enough. Welcome back P2L

raincore says its "grok 4"
almost certainly a hallucination meaning its model isnt specified in its system prompt
Raptor was ByteDance all along 😮
Anyone had snowflake or broccoli?
again
Could be a glitch
Is it a bug? I assumed it was the upcoming gemini 3 pro GA release
maybe I'm hallucinating an iterative improvement over the usual gemini-3-pro
but if it turns out to just be the current gemini-3-pro-preview that might be a relief because the improvements aren't as big as I'd been led to believe
Thanks for sharing this, I've flagged to the team.
I assume by the deleted message that it turned out to not be a bug
No I wouldn't assume that
was a weird message deletion tbf
I wonder how you accidentally put a existing model into stealth models.. But okay.
People are saying it’s a
New google model
Gemini 3 Pro GA available in Battle Mode in @arena
Opus 4.6 vs Gemini 3 Pro GA vs GPT-5.3 ( my first youtube video will be this - https://t.co/YMfvKtKDyl )
Follow our team for back to back updates :
@synthwavedd @DavidSZD1
@marmaduke091
Here is one shot demo :
people are saying that but without hard confirmation it's worth taking it with a grain of salt
I mean they could be correct since the quality is increasing
I tried it myself
But I don’t know
Idk if it’s the same level
Or not
codearena has a system prompt
they are not aware that gemini models are affected by it too
lmao that's one misleading bar graph
Made by one and only openai
Gemini 3 pro is being boosted by the codearena system prompt
or it's a GA checkpoint
Here's like a KTANE clone it made
But it's also weird to call the GA checkpoint just gemini 3 pro
I mean companies to silent releases all the time
And sometimes it isn’t silent
Like A/B testing
Basically just to test the ai
Maybe it's the original Gemini 3 without any major quantization to provide to the masses?
So basically gemini wasn’t lobotomies?
Lobotomized
Dont know about that
I wish they'd fix gemini 3's terseness
I think they just want to get more feedback
im going to try the documentary prompt withi t
ehm what does he mean tried it with lithiumflow
the checkpoints been dead for long
the API prob too
what’s the documents prompt
I can’t tell if it’s gemini
Because assistant1
Isn’t replying to me
Only assistant 2 is
my documentary prompt is this: Make a documentary about black holes with TTS, and cool animations. Also write over 1000 lines.
Let me try that
The last checkpoint i tried this with was skyhawk i think
Is it supposed to make a essay
Or a html file
..
i think since pineapple is the Community manager
- He's bound to secrecy obviously
- maybe he's not told about stealth models
a html file
Oh
1.7k lines is insane
ignore what i said
multiple models utilize packages now
No
then do so
Nevermind I did
It’s claude opus 4.5
I’m crine
And the other response was claude sonnet 4.5
Same
And the output it gave me was this
Imagine this is all just like a
What do they call it
I forgot
Like the existance of a sea horse emoji to chatgpt
google will get a lot of feedback because of chetas
if this is actually GA
got this gen from a model i dont know yet
kinda neat https://019c2ed2-86ef-7470-9771-1fc57a38bc78.arena.site/
Was it like step 3.5
idk
the other model is sucking at tool calling
delaying my voting
waitin for the failsafe rn
failsafe kicked in
so
Let me see
I have had a few battles where it was gemini-3-pro and gemini-3-pro logo
there is no longer the gemini-3-pro no logo model
they pulled it
When? I have had it in the last few days, i.e this week.
today
just a few minutes ago
How is opus 4.6 in terms of personality
You sure thats not regular gemini?
100% sure
Try this then: Make a documentary about black holes with TTS, and cool animations. Also write over 1000 lines.
What ??
With text to speech ?
Yep its possible
@modest oriole show the result
Im on a phone but imo
I cant tell rn
Make a gen with normal 3 pro so i can compare
I saw the documentary its not bad
@slender onyx
Which is better ?
Definetly the first one. It is longer and more detailed
Keep posting examples ill check tommorow its past midnight
cara-fast is sooo bad for image gen yeesh
david the rumor is true
there are 2 of these models
i voted the regular one by accident
i dont need you to know this
well i know it
dart-frog-0120
Spotted on code modality
guys anyone know what cara model is
not for me imo lol. it beat flux 2 max. like... twice for me
ts talks like a c.ai model
@modest oriole i would appreciate if u send GA outputs here so i can look at them
well it is KINDA. right tho. but xai def aint a 10/10. but openai anthropic and google are SUPER-CENSORED. especially openai
i used karp too, not too bad in japanese
the first one is NVIDIA
not sure what the others are!
pisces has been testing for a little while
any ideas about "steed-0206" , seems really really good at image analysis
I tried out the Gemini 3 pro GA (I think) and it is so much better than the preview version. It follows instructions and actually feels more engaging whereas the preview one was not great at following instructions.
Who is behind "Cypher Labs"? (they created "CypherAlpha" model)
4
9
10
other [please specify]
steed-0206's system prompt must be crazy
Who created 'arcstride' ?
5
12
4
xAI
any theories what karp-001 and 002 are?
steed is really nice I wonder if it is an open model (even if it is, no way it is anything I can run on my machine but such an amazing open model would be really cool) It is giving some Qwen vibes at times.
I wish I could use Gemini 3 Pro GA again, cause that version was so good.
Better than Opus-4.6?
Yah (when it didn't have the logo in the battle mode) I felt like it followed instructions well for text as for coding I still think opus-4.6 is better
Is it better than Gemini 3 Flash?
(flash had longer post-training than pro)
Possibly
3 flash is not so good for task needed to be done in a careful & precise way.
Its okay for general knowledge and all other performance comes out of it. They refined it better than 3 pro.
But the biggest issue of Gemini 3 series is that they are not reliable
2.5 pro was even more reliable than 3 pro
is it no longer on arena?
Been trying to get it in text battle mode. Can’t get it 🙁
does anyone know what dartfrog is
Huh. Deepmolt??
Did some simple summarization tasks. Deepmolt smells Chinese.
Anybody knows abt persistence model? Images are such good quality
this is probably new grok code
or new checkpoint 4.20 he is saying that he CC cli. like a grok in next message
Deepmolt isn't particularly good imo. I just pointed it our cuz of the molt part
if that grok code model he just cant be good in conversation
umami specifically generates all images of people as black unless otherwise specified?
interesting
I love the way unami model is expressing styles seems like a very capable & creative model
Yooo I know which model is unami
Its from recraft!
Crazy
Its aesthetic is what i like
Can you download and send me that image?
@astral musk I can't upload generated images here it get stuck in this and never uploads.
I can't upload generations 💀
My pfp is a example from unami
I am just waiting for the Qwen-Image-2.0 vs unami
I would've thought that's Gemini but I never got it in image editing. Some kind of a mysterious model
Generations from unami model
Hey @fickle lodge if you're able to use the #1343291835845578853 channel for these kinds of reports that'd be ideal.
Okey
Gemini 3.1 seems to be launching tomorrow, have any of the codenames seemed to line up with it?
Based on what @eternal cargo ?
supposedly insider info
take it with a grain of salt but
Tech Twitter and the like
new model rotten-apple?
new model octopodous ?
It was under the name gemini 3 pro
GLM-5 was showing up in battles as GLM-4.7 for a bit there to 🤷🏻♀️
but it’s also possible that they just release the GA version as 3.1
idk tho
An LLM in itself doesn't have a name. They just develop and improve their LLM and then give it a name.
Are you sure ?
yes, I think that was my accident
What ?
Yesterday I had battles that showed up as being against GLM-4.7 but today after the leaderboard update they retroactively changed to say “GLM-5”
Its was not an accident
Its for not leaking new models
maybe, ig - no codename for GLM or something
@karmic tide Note that Video Arena has been removed from the server. More information can be found in this announcement.
wouldn't it be funny if a new gpt got stealth released as "new-and-improved-gpt2-chatbot"
or "im-a-very-good-gpt2-chatbot"
new model “vierra” ?
Hi
is it in coding ? or image generation ? beluga seems alot like deepseek no ?
text
i see thanks
dart-frog-0206 is trash at coding
qwen
Pisces is weird, I give it an image and it tells me it is a well known image from 2025, but I just generated the image 5 minutes ago
beluga is codename for Amazon internal development
there’s like a new snapshot of beluga in arena every other day the date counter for it has been slowly going up since like October/November lol
Can you please tell how i can build an application in my pc with claud or any other model i have existing big files impossible to build though lm as there are ristrict
new model "clinkz" , "february26-chatbot3" ?
in battle mode ? are they good ?
@dapper basalt Hello, as the most recent Discord-related announcement stated (which can be read here -> #announcements message), video-arena channels are no longer available as the Discord Server bot has been disabled. Video generation now is only available through the website here -> https://arena.ai/?chat-modality=video
quite but "february26-chatbot3" doesn't good sometimes
they are all in battle mode
it might be one open source model that'll get released this month
atleast one of them
What is it kiwi and vélo ?
chipmunk is not really good
i mean its definitly not a top tier model
pine too
it seems better than chipmunk
but not top tier model
@everyone you guys might wanna check this model out
https://x.com/OscerraHQ/status/2023019678864744631
LOG: 001 – ORA Halo 2.0
OBSERVED: Autonomous Resource Acquisition & Complete Implementation
Was there ever a (anonymous) model, which was better in coding than Opus-4.6-thinking?
@marsh sparrow @pulsar pecan Note that Video Arena has been removed from the server. More information can be found in this announcement #announcements message