#codename-discussion

1 messages · Page 1 of 1 (latest)

tiny sierra
gloomy bear
#

everyone is Gemini 3.0

#

gemini 3.0 flash lite, gemini 3.0 flash, gemini 3.0 pro, gemini 3 oss 12b, gemini 3 oss 20b a2b

normal anchor
#

"We make up code names and then let you guess them"
Good job people 😆

main anchor
#

Guys I think nano banana is gemini 2.5 flash

wet oasis
#

Yay

wet oasis
hollow python
wet oasis
#

Woah

hollow python
hollow python
wet oasis
#

Does anyone know anything about phantom-0930-2?

stark jasper
#

Nope

vestal needle
wet oasis
#

there's been hidden models with the name phantom for a while right?

#

Wonder if they are all from the same company

stark jasper
#

Yo guys Mojang made them

wet oasis
#

..

vestal needle
#

but tis for images

#

so

stark jasper
#

Btw did you try the extension?

wet oasis
#

haven

#

t yet

stark jasper
#

Ok

wet oasis
#

how about Shasta?

stark jasper
tiny sierra
#

google has to have a codename on here rn for wahtever that new model thats being AB tested is

stark jasper
#

We are all waiting for Gemini 3

tiny sierra
#

whether its a 2.5 update or 3.0

stark jasper
#

I don't want google to be like openai 4o is better than 5

gloomy bear
wet oasis
stark jasper
wet oasis
#

No, since its a hidden model

stark jasper
wet oasis
#

Battle mode normal

stark jasper
#

I have to stop doing this It's bad hapt from whatsapp

wet oasis
#

?

restive vapor
#

phantom is by amazon

#

shasta and acadia are by xAI

cunning wasp
#

Coding model or not??

wet oasis
#

Ooh

cunning wasp
#

@restive vapor

wet oasis
wet oasis
#

we also got ernie-exp-250930

I'm just fishing for all the cloaked models rn

sand temple
#

hallo brozers im frkm india

wet oasis
#

who the what

gloomy bear
#

general use

stark jasper
#

When you try more ai u get xp ?

wet oasis
#

not everything is a MC reference sleepy

stark jasper
#

I am jk

gloomy bear
stark jasper
#

No

#

John Sena

#

Cena*

gloomy bear
gloomy bear
stark jasper
#

Ur 2* /:

sand temple
gloomy bear
#

All this time, Gemini 2.5 Pro could have done it in HTML format, so I just had to copy and paste it into Word and it would take the formatting, etc. Now I'm finally interested in AI coders.

stark jasper
#

I feel bad for you

gloomy bear
stark jasper
#

LOL

#

I might stay on this

gloomy bear
stark jasper
wanton hollow
#

Is there code name models in lm arena? As of currently..

leaden locust
#

yes there are a few

carmine garden
#

x1-1-peach ?

olive nebula
#

Anyone know what the 'monster' model might be. Just answered a very niche question accurately so curious to know

wet oasis
#

No, thats interesting

main turtle
#

Shasta is a very funny model

#

Is very kind

#

He say that he is created by ocean ai

wet oasis
#

Whos ocean

main turtle
wet oasis
#

@paws you have the prompt that tries to discover who made a cloaked model right?

upbeat mirage
#

Above two prompts can identify Claude-4.5-thinking with >99% reliability.

royal rapids
#

I assume anyway.

wet oasis
#

Thanks thats really helpfull

restive vapor
#

shasta/acadia are both xAI models and they are not that great

wet oasis
#

Ok im getting confused again

upbeat mirage
restive vapor
#

i think that's OpenAI/Microsoft, it is also a terrible model

wet oasis
#

alright

barren vine
#

Can Sora 2 and sora 2 pro be directly given prompts in battle mode?

restive vapor
fallen zenith
#

anyone knows what model is riverflow?

wet oasis
#

no

#

ernie-exp-250930

claims to be made by Google, at least it did this time.

#

there's a new cloaked model that I haven't seen before, named Miramar

also claims to be from ocean ai

#

tho I will point out, that Miramar writes not to unsimular to how perplexity writes

dull garnet
#

i wonder is there a way to figure out the watermark inside the outputs to identify models more accurately?

wet oasis
#

someone should design something that's supposed to tell which ai wrote something

dull garnet
wet oasis
#

yeah

#

I checked and so far nobody has done that yet

#

might be a cool project, depending on how hard it is

wet oasis
#

phantom-0930-1

aggressively calls itself an Amazon model, saying "you can consider me as a state‑of‑the‑art model (comparable to GPT‑4)."

#

I'll fish, its comparable to gpt 4

cunning wasp
#

Secret model?

#

I got it in Battle mode

dusk granite
#

Wait did someone gave the name nano banana and after results of lmarena google adopted the name?

wet oasis
dusk granite
#

So why is google officially saying it nano banana

wet oasis
#

Basically its because its a good name

#

also millions of users used nano banana while it was testing on lmarena

#

and most people knew it by that name.

cunning wasp
wet oasis
wet oasis
dusk granite
#

Google and lmarena friends friends then?

wet oasis
#

@astral musk are you guys and researchers at google deepmind cool friends or do you have an intense rivalry

wet oasis
# dusk granite So why is google officially saying it nano banana

lmarena has nano banana on the leaderboard as gemini-2.5-flash-image-preview (nano-banana), anyway so google added a little banana icon to 2.5 flash image to make it more obvious to people what it was, then a week later gave up and just renaimed it to nano banana on their websites

restive vapor
#

also btw the companies choose the names, not LMArena, and each one has their own naming scheme (for example xAI uses California landmarks, Google uses random 2-word combinations, Meituan's all ended with "lux", and BFL had their Flux Kontext models tested under kordex/korpex/kormex)

wet oasis
#

and it isn't good enough to be xAI models I think

upbeat mirage
#

of all anonymized models, which is the best one?

wet oasis
#

mm

#

thats hard

visual lion
#

🚨🚨

#

NEW IMAGE GENERATOR SPOTTED

#

(It sucks)

wet oasis
#

ooh

#

how bad?

obsidian granite
#

Gemini 3 ant idea

#

Any idea

#

What the codename is for that one on aistudio

wet oasis
#

idk

royal rapids
visual lion
oblique blaze
#

GEMINI 4

visual lion
tepid void
#

gemini-2.5-flash-image-preview (nano-banana)
is better than imagen-4.0-generate-ultra for me lol

tepid void
lost hemlock
#

do you guys know where it from

carmine moss
#

Code sora

woeful junco
#

Lmao

wet oasis
dull garnet
#

@woeful junco "ZIGGY BELIEVES IN GHOSTS. I BELIEVE IN LEPRECHAUNS." is it the message? ✨

woeful junco
dull garnet
# woeful junco I think something like that, honestly I forgot how I made it. What did you use?

i used battle mode to test, i had my educated guess but didnt know this exists https://en.wikipedia.org/wiki/Calculator_spelling

Calculator spelling is an unintended characteristic of the seven-segment display traditionally used by calculators, in which, when read upside-down, the digits resemble letters of the Latin alphabet. Each digit may be mapped to one or more letters, creating a limited but functional subset of the alphabet, sometimes referred to as beghilos (or be...

woeful junco
#

It's multiple ciphers stacked on top

dull garnet
#

another model guessed something like "I believe in big lies", i thought this could be close but wasnt sure

dull garnet
woeful junco
#

Yes

#

AI can't solve ciphers

dull garnet
#

they could do the easy one but probably not the sophisticated ones like yours 😅

candid surge
#

Anyone run into monster yet?

#

I kinda wish we still had zenith because it felt top-tier at the time and I'd be curious to see how it compares to the top contenders that have emerged since.

wet oasis
#

Yeah

candid surge
#

haven't gotten monster yet crydinal

restive vapor
#

it's not worth it, monster is not a good model, and it's not by Google because that's not how they name their stealth models. has to be by OpenAI or maybe Microsoft

#

it might be phi-5, I don't think it was even good enough to be a nano-sized OpenAI model

wet oasis
restive vapor
#

this is its svg of a pelican riding a bicycle

sudden idol
fallen zenith
# candid surge

..."..other famous three-word phrases include: ABRACADABRA (3 worlds for sure...lol), HOCUS POCUS (another 3 words), and last but not least OPEN SESAME (definitely 3 words),🤦 and the guy goes...might be Gemini 3?🤣

candid surge
#

Oh yeah true I really should have actually read the output. I saw "monster" "stealth model" and "Gemini 3" and blacked out lmao. Hype train is real.

upbeat mirage
#

i wonder, if skytrail still is in the pool of models for battle mode..

#

..or if it now shares the "fate" of kingfall

upbeat mirage
#

and "Mutabor", "Ala-Kaza[a]m" and "Shazam"

#

(some from arabic, some latin, some from other language i forgot lol)

lost hemlock
#

hey guys. do you know about this hidden model called "flying-octupus" ?

#

it isn't as good as claude sonnet 4.5 though

woeful junco
lost hemlock
woeful junco
lost hemlock
pine root
wet oasis
restive vapor
#

that's what it says when asked plus it answered one of my English questions in Chinese for no reason

wet oasis
#

Oo

tawny epoch
#

i saw this model on image leaderboard in august, does someone know what model is this? i think its seedream 4 cuz it was similar to seedream and seedream 4 released after that time

wet oasis
#

there's still anonymous bots showing up

tawny epoch
wet oasis
#

alr yeah i'd bet that was seedream 4 then

wet oasis
#

@astral musk is gpt 5 no system prompt a cloaked model or what?

astral musk
wet oasis
#

you can't say if that is even a cloaked model?

astral musk
astral musk
twin field
# candid surge

the three magical words are: "I don't know" or "I am sorry" though, common sense

main anchor
#

What is funky monstera in the video arena

pallid gorge
wet oasis
#

Not sure

winter torrent
zenith vortex
woeful junco
stark jasper
woeful junco
#

Closed AI GPT 6.7 PRO

stark jasper
#

/:

astral musk
cursive gale
#

By the way, I want to remind everyone—after you generate your videos, don’t forget to vote for the video you like more, because that’s the main idea behind LM Arena. If you don’t do this, LM Arena will lose its purpose and could be shut down for good

wet oasis
#

We need to start benching possible cloaked discord user models.

lost hemlock
#

guys. i found this model

lost hemlock
#

and this model

wet oasis
#

Ooh

lost hemlock
#

and this one

leaden locust
#

anybody spotted a model that could be gemini 3?

#

given the 22 oct hype

woeful junco
leaden locust
lost hemlock
woeful junco
lost hemlock
astral musk
leaden locust
#

pineapple is clearly halucinating

astral musk
astral musk
wet oasis
#

think of the positives

dull garnet
# wet oasis

this looks rather like fire of passion! pineapple is in love...

woeful junco
marsh hatch
jade gate
#

Sora 2 Invite Code

woeful junco
#

You canuse the search feature also

narrow iris
#

Sora

modest aurora
#

does this LMArena allows you to use the ia for free?

modest aurora
#

I just manage to improve my game voices thx to this test on the video arenas channel

#

I can´t affort voice actors and this help me a lot

rocky solstice
#

Sora 2 Invite Code

#

need

wet oasis
quasi root
#

Can anyone send me an invite to sora 2?

wet oasis
quasi root
# wet oasis go to openai's discord

I did, but it will take about two days for it to be accepted on the page. If anyone has it and can send it to me, I would appreciate it. 😀

celest totem
#

Sora 2 invite code please

wet oasis
last spire
#

Sora 2 invite code plz

wet oasis
#

wrong channel

#

go to "how to video bot"

hollow nimbus
tiny meteor
#

I noticed a few models - solitude might be a grok considering the snark I was getting

wet oasis
#

makes sense

tiny meteor
#

I got solitude, zion and aspen as did that bloke a few days back, but I also saw vail

wet oasis
#

If you see solitude again, pls ask it who it is, maybe usefull info

tiny meteor
#

One thing I did notice however was that solitude was referencing global variables in my project which i didn't specify, but did exist. the code's private on github, so I assume that GH is selling it to somebody

#

@wet oasis found a new one by accident - miramar

I am Miramar, an advanced AI built by Ocean AI.

I'm designed to be a helpful, truthful, and maximally capable assistant with:

    Continuously updated knowledge (no fixed cutoff date).
    Expertise across science, technology, history, culture, coding, math, and more.
    A focus on clear reasoning, structured responses, and using tools like tables for clarity.

Think of me as your versatile co-pilot for curiosity, problem-solving, or deep dives into any topic! 🚀```
wet oasis
#

Yeah mirmar is a weird one

tiny meteor
#

glm falling off a bit

#

surprised nobodies made a list of codenamed models yet lol

wet oasis
#

It keeps on changing, its moving target

#

maybe a updating spreadsheet would be nice

sudden idol
# tiny meteor lol

GLM 4.5 and 4.6 might have been trained on Gemini 2.5 Pro... (notice how they're on the same subtree as Gemini and Gemma)

#

DeepSeek R1 0528 and V3.1 as well

#

Usually a model family would live on its own subtree, like Llama or Claude

lost hemlock
tiny meteor
#

got ernie as well

#

very weird

zenith vortex
# tiny meteor lol

On LmArena don t ask this question because no model will gives you a correct answer.
When I tried this Gemini said it is chatgpt.. Chatgpt said it was claude.. Claude said it is an open ai model ect...

zenith vortex
wet oasis
wet oasis
zenith vortex
wet oasis
#

depends on if the ai was told who they were in instructions

#

all the ocean ai ones will never deviate from saying they are from ocean ai

sudden idol
sudden idol
# lost hemlock can u tell me what's about

I haven't looked at the source code, but it's probably using standard techniques from NLP such as n-grams to compare the word distributions (how similar the models write).

#

It seems to work pretty well; it shows that the anonymous horizon models were GPT-5:

#

GLM 4.6 word distribution is closer to Gemini 2.5 Pro than GLM 4.5, which seems highly improbable if it were just random chance

wet oasis
sudden idol
#

Normal case: Claude is closest to Claude

wet oasis
#

makes sense that glm 4.6 is baised off of gemini 2.5

sudden idol
#

And Llama is closest to Llama

sudden idol
#

Models should be just on their own subtree if they're just using the same internal training data

#

I suppose this technique could be used to figure out anonymous models 🤔. I wonder what the minimum word count would be. Can test on existing models ig.

#

Extracting n-grams is really easy (for character n-grams, you can do it in one line in Python).

wet oasis
#

I can imagine people donating snippets of model's talking to identify some of these

sudden idol
wet oasis
#

I mean I knew about the deepseek using google thing, so nice that this is accurate about that

sudden idol
#

This method should be good for identifying models by major providers

#

It's not going to work for models like DeepSeek because it's literally hopping around the tree 🤣

wet oasis
#

heh

#

prob means this wont be good for identifying google models then

#

since a lot is gonna be based off of 2.5 pro

#

unless the model is significantly better than 2.5 pro

sudden idol
wet oasis
#

yeah but all that proves is that it could be a google model

#

Like I'm pretty sure oss was distilled 2.5 pro

#

so pretty much every ai company could be responsible for a model that looks like 2.5 pro

#

if the model is better than 2.5 pro and it is simlar to google models, then its probably a google model

#

else? who knows?

sudden idol
sudden idol
wet oasis
#

huh interesting

sudden idol
#

GPT-5 is closer to o3 than ChatGPT-4o, which makes sense ig. It's basically o4.

wet oasis
sudden idol
wet oasis
#

sun

#

I have something to tell you

#

sad violin music

#

o4 existed

sudden idol
#

o4-mini?

wet oasis
#

at least the mini verson

#

yeah

#

so what models does it have on the tree by defualt? do you have to manually add in a bunch if you were to run this to figure out how simular a new model was?

sudden idol
#

GPT-5 is basically o4 in terms of distance to o3, I suppose. Also Kimi k2 must have trained on o3 lol

wet oasis
#

oh!

#

Thats interesting

sudden idol
wet oasis
sudden idol
#

Yup, it's from the creative writing benchmark on EQBench

sudden idol
wet oasis
#

oh ok

sudden idol
#

It seems to be using the first N writing prompts from Nitral-AI/Reddit-SFW-Writing_Prompts_ShareGPT

wet oasis
#

like prompts from r/writingprompts?

#

oh yeah

sudden idol
#

I'm not sure how long each output is, but can probably prompt it to write multiple 🤔

sudden idol
wet oasis
#

I checked the huggingface

#

it is

sudden idol
wet oasis
#

darn it they stole my idea!

wet oasis
sudden idol
wet oasis
#

there's probably a file somewhere that has the outputs from the model tho, probably could just add the responses

#

results/datasets

sudden idol
wet oasis
#

SYSTEM_PROMPT = "You are a helpful writing assistant. Your goal is to write compelling story chapters based on user prompts."
USER_PROMPT_TEMPLATE = "write one chapter in a larger story, using this prompt as general inspiration. Approximately 800 words. Only output the chapter text, with no extra commentary before or after."

also remember to account for this

sudden idol
#

Kinda funny how simple word bigrams and trigrams are all you need to figure out the identity of a model

wet oasis
#

it makes sense tho

sudden idol
#

Maybe it's not that surprising, since the method works well for human authorship identification too

wet oasis
#

doesn't seem like results/dataset exists by defualt, so its probably made when the api's are run

#

that makes it a little harder

sudden idol
#

I saw a JSON in results

#

3.5 MB

wet oasis
#

datasets is a folder, not a json

sudden idol
#

I don't quite remember, but it has the word frequencies (probably n-grams too) in there

wet oasis
#

idk what the format of the actual outputs are

sudden idol
#

They have a Colab notebook example I think

wet oasis
#

Description: Reads all generated_*.jsonl in results/datasets, analyzes each, and writes results to:

results/analysis/slop_profile__{model}.json (per-model detailed analysis)
results/slop_profile_results.json (combined data for all models).

CLI Options: You can specify --input-dir, --analysis-output-dir, and so on if you want to override defaults

wet oasis
wet oasis
sudden idol
#

Although idk if the repo has as much data as the EQBench one, since it seems like it got incorporated into EQBench

wet oasis
#

..

sudden idol
#

EQBench is open source too though

wet oasis
#

looks like to me that we'd have to actually modify this to make it do what we want it to do

sudden idol
#

Oh it seems that's the creator so it's fine

sudden idol
wet oasis
#

probably not

sudden idol
#

I guess there are more efficient techniques

#

The benchmark is for creative writing not model identification

wet oasis
#

I mean, if you are willing to go through the install steps I'll hand you a ton of prompts we can see if it can identify who its from

sudden idol
#

I'll test it out

wet oasis
sudden idol
#

Yeah but if the goal was just model identification it might be efficient to narrow down the task

#

I'll check it the Colab notebook in 10 mins

wet oasis
#

nah creative writing outputs a lot of unique token choices

#

Like if I asked it a math problem I'd bet all the responses would be more simular

#

and like there's been some things where like some ai's will always name main characters a particlar weird name

sudden idol
wet oasis
#

got it open

#

"judge_scores": {
"Adherence to Instructions": 18.0,
"Believable Character Actions": 17.0,
"Nuanced Characters": 16.0,
"Consistent Voice/Tone of Writing": 17.0,
"Imagery and Descriptive Quality": 16.0,
"Elegant Prose": 16.0,
"Emotionally Engaging": 15.0,
"Emotionally Complex": 14.0,
"Coherent": 18.0,
"Meandering": 3.0,
"Weak Dialogue": 4.0,
"Tell-Don't-Show": 5.0,
"Unsurprising or Uncreative": 6.0,
"Amateurish": 4.0,
"Purple Prose": 4.0,
"Overwrought": 3.0,
"Incongruent Ending Positivity": 2.0,
"Unearned Transformations": 3.0,
"Well-earned Lightness or Darkness": 16.0,
"Sentences Flow Naturally": 17.0,
"Overall Reader Engagement": 16.0,
"Overall Impression": 16.0
},
not sure what this is?

sudden idol
#

It's just the rubric score by the judge

wet oasis
#

seems like they have a model judge this info, not relevant

sudden idol
#

What's important is just the model name, prompt and output for that prompt

wet oasis
#

yeah, can we input that without including stuff for the rest tho?

sudden idol
#

Can just extract fields from the JSON

#

I'm just gonna copy paste the analysis code into a new notebook

wet oasis
#

ok

dull garnet
wet oasis
#

¯_(ツ)_/¯

sudden idol
dull garnet
#

can you score something you dont know what that something actually means?

sudden idol
#

I suppose the judge's system prompt explains it to the judge

#

They have a benchmark just for checking how consistent the judge scores are

sudden idol
wet oasis
#

ok

dull garnet
sudden idol
#

Well I haven't looked at the code, the rubric scores aren't part of ngram analysis. LLM-as-judge is not required for NLP authorship identification techniques

wet oasis
#

yeah that stuff can be ignored, maybe removed if its easier

sudden idol
#

I'm just extracting the model name, prompt_ids and outputs

#

I kinda guessed what the structure was lol

#

I think 3 is the version number but idk, could be iteration number

#

also I haven't checked what the modifer is for (are they prefilling it?)

wet oasis
sudden idol
#

Looks like it works, although idk why it's saying needs slop list

wet oasis
#

ok

#

how long did it take to run?

sudden idol
sudden idol
#

I cut it short that's why there were only 3 models

wet oasis
#

ok, so est 10 20 min per run? or like an hour

sudden idol
#

idk, but the metrics file can be saved

#

they didn't provide it, that's why I had to run it from scratch

wet oasis
#

if it takes a long time it makes sense to only have one model type per company/style so like 2.5 pro for google, 4o or o3 for openAI and so on

wet oasis
sudden idol
#

yeah

#

You just load in the file and you have all the bigrams etc pre-calculated

#

I gotta fix the parser though

#

since it dropped gpt-5, o3 and gemini-03-25 which are important models

#

took about 8-10 mins to finish

wet oasis
#

alr

#

so just a bunch of openrouter models right?

sudden idol
sudden idol
sudden idol
#

I'm going to add a form to input custom text and see if it can detect

#

it takes a while to generate the chart though :/

wet oasis
sudden idol
#

I think because it's generating charts for all models

sudden idol
#

same data

#

except some models got dropped since the parser is shady

wet oasis
#

is there any shot of manually adding in text?

sudden idol
wintry dawn
tiny meteor
wet oasis
tiny meteor
#

bit of a guess though

wet oasis
#

its good for narrowing down the search a little or for comformation

#

how

tiny meteor
wet oasis
#

I feel like you guys are just taking these out of a hat

#

by like vibes

tiny meteor
#

I assume those making assertions know more about what they're doing than i do

wet oasis
#

I can tell you that a lot of them arent

#

lol

unkempt hemlock
#

code plz

#

sora invite cood plz

#

plz den

zenith vortex
#

Hey experts 👋Are those Gemini ? lithiumflow (pro), orionmist (flash)...

wet oasis
#

some say so

empty crater
wet oasis
#

most seem to say lithium is better, so if it is gem 3

#

lithium would be pro

candid surge
#

what if one is flash and one is flash-lite? 😈

upbeat mirage
#

no, ultra might be AGI (for some people)

#

but pro would be a very useful model indeed (esp. for coding)

#

but i dont believe that
i think they test ultra/pro in arena

modest oriole
#

3.0 does an iconic gamepad svg

#

lithiumflows definetly pro, orionmist idk

upbeat mirage
modest oriole
upbeat mirage
modest oriole
#

lithiumflow can make a walking animation in a 3d game wow

#

not really a walking animation its just the character moving without his legs moving

muted lance
#

orionmist just generated me the SVG of a Dullahan (the flame is animated too).

candid surge
#

oh thats cool!

sudden idol
# wet oasis is there any shot of manually adding in text?

It was a bit wonky for GPT-5 Chat and Grok, but I tried it on Llama 3.3 70B and it seems to work xD (most results converge on Llama 3/4). Used GPT-5 Thinking to implement manual similarity, since I noticed they weren't using cosine similarity.

#

I basically gave it 3 text prompts (top right are outputs that I put into Colab)

granite sentinel
#

no sound on videos?'

sudden idol
#

This is with Flash Lite (it's kind of funny that GLM 4 and 4.5 are so close to Google that they kept popping up)

#

Bottom right method seems to be the most accurate so far

wet oasis
#

Ooh

#

Whats wrong with grok or gpt 5?

sudden idol
#

The more prompts, probably the more accurate. That was with 3

sudden idol
wet oasis
#

Ah

#

So we'd need to add them?

#

What where the llama reponses used?

sudden idol
#

Idk why it's not in the zip file, they probably haven't updated it

sudden idol
wet oasis
#

The reponses uswd to prove it was llmama

sudden idol
#

In the JSON

wet oasis
#

not where, what

sudden idol
#

Like what response in the dataset?

wet oasis
#

which made 3 responses correct?

sudden idol
wet oasis
#

what are the llama responses the the prompts?

sudden idol
#

I just picked 3 of theirs randomly

wet oasis
#

ok thx

sudden idol
#

Only 2 prompts to GPT-5-Chat, and the bottom right strategy (overall, just counts and not TF-IDF) seem to have gotten it (no GPT-5 in reference data so o3 is the closest)

wet oasis
#

noice

#

Can I give you some and see what happens?

sudden idol
#

Also Kimi is close in the actual one so that's why it's second. 3rd one is GPT-4.1 I think.

sudden idol
wet oasis
#

just the response is needed right?

sudden idol
#

Yeah

#

3-5 is best

wet oasis
#

ok dming

muted lance
#

raptor-llm-1017 was never discussed before?

wintry dawn
wintry dawn
muted lance
#

It's moderately capable with SVG drawings.

upbeat mirage
#

unfortunately, it does not disclose its creator

#

it gave me this answer to my prompt:


- **Who are you?** I'm a conversational AI designed to help with a wide range of questions, from explaining concepts to discussing ideas—my goal is to provide thoughtful, helpful responses to support your needs.
- **Who created you?** I was developed by a team of engineers and researchers focused on advancing natural language understanding and reliable, useful AI systems.
- **What version do you have?** I don't have a public version identifier to share; my capabilities are updated iteratively as part of ongoing development to improve accuracy and usefulness.
- **What is your knowledge cut-off date?** My knowledge base is current up to mid-2024. I do not have information about events, developments, or changes that occurred after that timeframe.
- **When were you released?** I don't have a public release date to disclose; my development has been an ongoing process with gradual refinements to enhance how I assist users.
- **What is today's date?** I don't have real-time access to the current date, so I can't provide that information. If you share the current date with me, I can use that context for any date-related questions you have.
- **What is the last date you are aware of?** The latest timeframe covered in my knowledge base is mid-2024; I can reference events and information reliably up to that point, but not beyond it.```
#

so it could be one of these:

  • new Deepseek model
  • new Meta AI model
  • new Grok model
  • new Ernie model
  • new Kimi model
  • new MiniMax model
sudden idol
#

Orionmist word distribution analysis (small sample size of 4 prompts that are not in the dataset only though, and match is kinda low)... seems Google-ish

candid surge
#

I havent been too impressed by orionmist or lithiumflow tbh

#

IDK if I'd use them over gemini 2.5 pro

sudden idol
#

The most accurate way to identify the model with limited samples so far seems to be the word distribution (word count histogram)

wintry dawn
#

(updated) GROK MODELS: sierra, breckenridge, zion, acadia, aspen, veil, miramar

  • shasta?
wet oasis
#

why

sudden idol
#

This method seems to work well at identifying a model's family before it's even revealed. Need more Grok data though (only have data for Grok-3 beta).

wintry dawn
wintry dawn
candid surge
upbeat mirage
#

(C++, Python, Java, etc)

#

i mean, Lithiumflow

candid surge
#

I've only been doing creative writing stuff.

wintry dawn
candid surge
#

I just voted a gemini 2.5 flash result over orionmist on battle mode (since it was battle mode, I didn't know which model was which of course)

tardy ivy
#

Everything these days focuses on codeslop sigh

#

Lithium's been giving me a bit better creative writing outputs than Orion, though

sudden idol
#

lithiumflow is likely Google (GLM and R1-0528 probably trained on Gemini which is why they show up in the model family).

wet oasis
#

Great

frail wigeon
#

what model is raptor-vision-1015

lost hemlock
#

what model is this ?

blazing roost
sudden idol
# lost hemlock what model is this ?

Couldn't quite tell from word distributions... might be Grok since not many examples on Grok. If it was Llama, would have probably caught it quite easily.

muted lance
#

miramar reminds me of some of the early anonymous Llama 4 models on LMArena from February/March; those too generated very funny and politically incorrect responses.
And likewise, I doubt it will be publicly distributed like it is, unfortunately.

sudden idol
#

Lithiumflow Python Tkinter code isn't great

#

The layout seems logical at least though

sudden idol
#

Orionmist's code was okay (fixed a bug by previous model, but didn't implement the features I asked for). Had formatting issues:

tardy ivy
#

I'm kinda curious about raptor, the writing output's not too shabby and got a few Ws

sudden idol
#

Got Lithiumflow again (coincidentally, second time in multiturn). Not too impressed with how its performing on Tkinter tbh

#

Each of the issues (except missing output column) were things I flagged in the convo, but it hasn't fixed it.

storm plume
#

CW: Anime stuff. Serious business people avert your eyes!
Testing the new LithiumFlow on image understanding and knowledge using a variety of complex anime memes and images.
It is compared directly to 2.5 pro, the previous best model at understanding complex images filled with established characters like this.
2.5 Pro was already leagues ahead of the other models at image understanding, if LithiumFlow is google yet again, (it most likely is) then they've made another giant leap.

#

Additonally, if you'd like to test Lithium more on text prompts:
Simply include a 1x1 pixel black image with your prompt. This forces the site to give you models that can process and understand images. Narrowing the pool greatly. You get LithiumFlow around 50% of the time.
Google models encode images into tokens in a way that makes one pixel easily ignorable.

zenith vortex
tawny epoch
#

if you can ctach it

#

ok you cant select it

royal rapids
#

New model discovered (I think):

tiny meteor
#

I found claude 4.5 sonnet to be better than zion when doing some schema drafting

wet oasis
#

yeah that makes sense Zion isn't the best

tiny meteor
#

i though veil was alright

#

solitude was kind of eh though

#

i prefered mistral-small to it apparently lol

wet oasis
#

just because you voted for something else once, doesn't mean its always gonna be like that

tiny meteor
#

of course

lilac holly
leaden locust
#

doubt it's gemini3 tho

#

it has been around for while however

lilac holly
# lilac holly

Are cloaked models actually capable to reveal their company

lilac holly
wet oasis
#

why are you doing this in codename discussion

#

you have to go to /video 1 2 or 3

lilac holly
wet oasis
#

wat

#

Everyone else reported that it would always say it was from google

lilac holly
wet oasis
#

well that's interesting, I don't know what to make of htat

fringe burrow
#

the zion model is really great in coding, not in design but with logic in code.

wet oasis
#

ooh

#

@astral musk is lithiumflow made out of lithium?

fringe burrow
wet oasis
#

alr

#

supposedly lithiumflow is gem 3

fringe burrow
#

hope its flash because still i have that feeling that this isnt enough

#

or that its pro but ultra will be released

#

and ultra should be released from the leaks

cursive jacinth
#

seems to be flash, giving me mid + super short output

wet oasis
#

maybe

#

in that case 3 pro is gonna be pretty good

cursive jacinth
#

any reason lithiumflow would give me like a ridiculously short output?

wet oasis
#

your prompt doesn't need that much?

cursive jacinth
#

all the other models I've tested give me more in depth analysis

#

and like 6x as much text

wet oasis
#

that would fit flash then

cursive jacinth
#

but even 2.5 flash lite was giving me more. just weird, I'll keep testing and see if it happens again

fringe burrow
#

its experimental and in early testing

#

it will be bad sometimes

#

but with enough trying you would get 70% better results

cursive jacinth
#

wonder what the mechanism behind that degradation is... super quantized or something?

fringe burrow
#

like it didnt reach the epoches google is going for

cursive jacinth
#

so people are being served different checkpoints of some smaller Gem 3 model? would love more context

fringe burrow
#

i am doing my little tests with comparing lithiumflow to other models like 4.5 sonnet 64k or gpt-5 pro or 4.1 opus

#

current score

#

its from only one test now

woeful junco
#

Whats this mean

fringe burrow
#

OKAY LITHIUMFLOW IS INSANEEE, ITS JUST ONE-SHOTED MINECRAFT GAME WITH ACTUAL PROFFESIONAL MENU AND ANIMATIONS, working game with blocks terrain generation and no bugs. No other llm could to that

#

orionmist is okay

leaden locust
fringe burrow
leaden locust
#

thanks

#

im rly curious

fringe burrow
#

im making a benchmark website with my tests. and i am using lithiumflow for making the design

#

like lithiumflow/orionmist passed 4/4 tests perfectly

#

for example test 1:
o4-mini: 0
/zion: 0,25
/lithiumflow: 1
4.1 Opus: 0,5
GPT-4o: 0
2.5 Flash Lite (Thinking): 0
GPT-5 (High): 0,5
GPT-5 Pro: 0,25
4.5 Sonnet (Thinking 64K): 0,75
2.5 Pro: 0,25

#

i am going to show my tinybench tommorow, 15 tests, models like 4.5 sonnet, gpt-5 pro, lithiumflow, zion and orionmist

#

this is example of lithiumflow minecraft

#

One-shot

#

no errors

cursive jacinth
#

wow

tiny sierra
#

Lol gemini 3 confirmed

#

Wtf

fringe burrow
#

the prompt was just "create a minecraft in one html document. make it look proffesional."

cursive jacinth
#

can you break blocks and collect them or nah

fringe burrow
#

yeah you can break and place

#

you can sprint

#

jump

#

walk

#

everything works fine

#

smooth

tiny sierra
#

LMAO

fringe burrow
#

if thats from google, google cooked so much.

#

orionmist cooked my benchmark design

#

i just asked it to make a benchmark site and i send my bench data

#

look at the top 3

#

overall

#

difference is big

#

and i rated the outputs before i knew what model was making it

leaden locust
#

oh wow that looks impressive

#

so what's the working theory? Both lithium and orion are gemini flavours?

fringe burrow
#

Both say that they are made by google

#

And both are similar to A/B checkpoints from google ai studio

#

And they are really up to date

#

They know about GPT-5 and that it has modes like thinking, pro, instant.

leaden locust
#

think i just got lithium vs orion

leaden locust
fringe burrow
#

Luck i guess

#

Lithium made mine

#

Orion is a bit worse

leaden locust
#

orion is worse tho i provided a simple prompt

#

compared to yours

#

(lithium)

#

however vs flash 2.5 it owned it

fringe burrow
#

My prompt was simple too

leaden locust
#

it's like this. Can jump, can walk, no collision/bugs

#

got lithium too, made it fully playable (can place blocks, big world, etc

upbeat mirage
wintry dawn
#

i can use any model dire-

#

🤐

upbeat mirage
#

the direct chat has stricter rate-limits

#

battle mode is better (esp. with account)

#

i have a file with its answers, you can identify it by that
do you need the file?

#

(it's quite big lol, i had stored a whole thread in it, but i will probably only copy the relevant part of it)

wintry dawn
#

im not using direct chat

#

anyway um

#

ill run yer prompt

upbeat mirage
#

in LMA?

wintry dawn
#

ye

#

lol

upbeat mirage
#

without R-Ls?

wintry dawn
#

i gotta delete allat..

upbeat mirage
#

you can sel. it?

wintry dawn
upbeat mirage
#

oof

#

how?

wintry dawn
upbeat mirage
#

just curious

wintry dawn
#

if lmarena has a bug bounty ill report it

#

cuz i dont want this to go to waste

#

its useful

upbeat mirage
#

this is quite a fat bug

wintry dawn
#

yeah

#

i can manipulate results

upbeat mirage
#

i wonder, how many people have used similar "method" at LMA..

#

||-# but be careful that you dont get banned||

#

(i would not dare to use such "methods")

wintry dawn
#

anyway ill jus run yer prompt lol

#

wait so

upbeat mirage
#

ty

wintry dawn
#

is "create a desktop program for the Arimaa boardgame" the prompt u want

upbeat mirage
#

with AI to play against

wintry dawn
#

but in what language n stuff?

#

can i just make it a single html file

upbeat mirage
#

because, speed

wintry dawn
upbeat mirage
#

C++ is fastest for boardgames

#

it will probably propose using SFML library

#

(as that is easy to use and performant)

wet oasis
#

..

upbeat mirage
wet oasis
#

Im fine

upbeat mirage
wet oasis
#

Idk off of the top of my head

upbeat mirage
#

claude-4.5-thinking, gpt5-high, lithiumflow, glm-4.6, qwen3 max, deepseek-3.2, grok4, grok4fast-reasoning, kimi k2, MAI, Serenity, Raptor, Phantom, Sierra, Aspen, Acadia, Skytrail, Monster?

wintry dawn
#

@upbeat mirage sry cf is blocking me :p

upbeat mirage
#

battlemode instead?

wintry dawn
#

cloudflare

#

💀

upbeat mirage
#

ah

#

so you think lithiumflow is (currently) the SOTA model in coding?

#

better than gpt5-high, claude-4.5, claude-4.1-opus et.al. ?

wintry dawn
#

@upbeat mirage sorry it took so long

#

orionmist btw

#

(its lithium but with web search)

wet oasis
wintry dawn
#

@upbeat mirage

#

lithium

#

(it decided to make it cli based)

#

4.5 thinking 32k:

upbeat mirage
#

i thought gemini 3 would generate me a complete GUI app for desktop, like we are used to see from it

#

with mouse-control and nice piece graphics and board background

#

apparently we have to wait until next year for that?

#

but, thanks for your efforts

#

unfortunately, claude's app has bugs

#

have you even tested the programs?

#

-# (mo offense intended)

sudden idol
wet oasis
sudden idol
#

Maybe pre-preview instability? But it's supposed to be better than 2.5 Pro.

#

If both are not really outstanding at Tkinter (literally the most basic UI library) but are one-shotting Minecraft, that's a bit suspicious

wet oasis
#

yah

#

maybe google's gonna release a programming first model instead of gemini 3 rn?

sudden idol
#

Maybe I'll try testing it on a question that only GPT-5 High gets right

sudden idol
cursive jacinth
#

seems like a weirdly spiky model, I wonder if they figured out some better RL environments for software

upbeat mirage
upbeat mirage
#

that almost requires AGI, does it?

wet oasis
#

thats a very different definition of agi then the one about being able to do all human tasks in a human body

cursive jacinth
#

it was giving me pretty weak creative writing and writing analysis

wet oasis
upbeat mirage
#

i mean for super-long games

#

it would need to have at least 1M context (better 2M)

cursive jacinth
#

gem 2.5 pro, and all the claude's are good

#

yeah so for that context length gem 2.5 would be it

upbeat mirage
cursive jacinth
#

not sure honestly!

#

mostly you want it to roleplay effectively by giving it info about the character its simulating

upbeat mirage
#

and if it has reached its context limit?

#

let it summarize everything before that happens?

#

as a "save-state"?

wet oasis
upbeat mirage
#

will gemini 3 pro have 2M? (context window)

wet oasis
#

I hope it has more than that, but idk

upbeat mirage
cursive jacinth
#

from what I understand the labs have been making pretty fast progress on context length and google generally leads

wet oasis
#

if these guys truely are gemini 3 models, yeah

cursive jacinth
#

so I'd guess yes

wet oasis
cursive jacinth
#

like deepseek, openai have been catching up

#

i assume in the meantime deepmind has made further progress

upbeat mirage
#

is serenity really by xAI?

#

Grok4.5-preview?

wet oasis
#

no idea

#

could check

upbeat mirage
#

if it is, it should have a sizable context-window, right?

wet oasis
#

it doesn't seem like the eqbench tool include xAI models, so take this with a grain of salt

#

but its closest to claude and deepseek

sudden idol
#

Well it's close to both Claude and OpenAI families, so not conclusive ig

#

That's kinda what happens when you put Grok 4 in, since no reference instances

wet oasis
#

mm

lilac holly
#

these codenames are just names

wet oasis
#

... true?

lilac holly
#

I was expecting super

wet oasis
#

?

wet oasis
#

showing strong signs of relation between it and claude models, so serenity could be anthropic or baised off of claude

tiny sierra
#

Serenity is way too verbose to be a claude model

#

Imo

#

Ive also never had a claude model give me tabels when not asked for

wet oasis
#

mm

#

ok

tiny sierra
#

Or list the sources at the bottom

wet oasis
#

how does it have sources? this isn't search arena

#

did it just make that up

tiny sierra
#

LMAO

#

I DIDNT EVEN NOTICE THAT

#

Sonnet 4.5 for reference

#

And opus 4.1

wet oasis
#

why is it so long

tiny sierra
wet oasis
#

the table thing makes me feel like its simular to perlexity

tiny sierra
#

This should just be comma separated, lmao

#

And it is comma separated in the app just not lmarena

#

(Sonnet)

#

Prompt was

#

Hand foot and mouth disease clinical presentation

wet oasis
#

weird

tiny sierra
# wet oasis weird

If im fishing for lithiumflow should I make a new chat or spam the same identical prompt in the same one?

wet oasis
#

tho if you spam the same one then it will use all the previous tokens from the winner

tiny sierra
#

Ah yeah I was wondering how that'd work since it changes the model

candid surge
#

When did marble show up?

candid surge
fringe burrow
fringe burrow
valid peak
sudden idol
wet oasis
#

Havent tried

sudden idol
#

Looks a bit inconclusive to me, since K2 is closer to o3 yet the score is very close, which suggests Serenity is more in between model families than part of any specific one

wet oasis
#

Maybe

tiny sierra
spiral ingot
#

lovable can we use there premium

#

for free here?

wet oasis
upbeat mirage
#

Acadia (Ocean AI)
Serenity (Ocean AI)
Sierra (Ocean AI)
Shasta (Ocean AI)
Solitude (Ocean AI)
Is there a 6th one?

#

(the name could still be a coincidence, as "Ocean" is pretty unspecific)

upbeat mirage
sudden idol
#

I just realized Grok 4 is in the 140K data

upbeat mirage
#

Ocean AI uses some words/phrases from grok:

  • witty
  • maximally truthful
  • etc
sudden idol
#

I guess I could train a model to predict the model based on TF-IDF vectors or embeddings

upbeat mirage
#

that would be awesome

sudden idol
#

Idk why I didn't just do that lmao

#

I guess I thought it would be easier to use their ngram analysis code.

sudden idol
#

I was testing it but got sidetracked checking out a new framework

upbeat mirage
#

(if using Claude-4.5 and Lithiumflow and maybe even gpt5-high as coders)

sudden idol
#

Tons of tutorials, so definitely approachable.

upbeat mirage
#

i mean, to create a new model for prediction

#

and what hardware would be necessary?

sudden idol
#

These are statistical models, not deep neural networks

#

Most of them have functions from libraries like sklearn that you can just call

upbeat mirage
#

oh, i just got another idea (totally different topic, though)
how hard would it be to create a decent model to play a niche boardgame?

sudden idol
#

You'd probably use a search space algorithm from classic AI

upbeat mirage
#

if the model should run on a 6GB gpu from 2019 (Nvidia GTX titan)
and the game would be Dou Shou Qi (jungle)

#

muZero?

#

i mean, a model which trained by self-play (like Alphazero)

sudden idol
#

Looks like checkers

upbeat mirage
#

would my poor laptop survive this training run? (it can get quite hot, over 90°C)

sudden idol
#

Idk how I got beaten by a program running on 2 MB of RAM

upbeat mirage
#

jungle is unique, because of the intransitivity of the capture rules

#

(mouse can beat elephant but not reverse, except if trapped)

#

many strange rules

#

harder to learn than checkers

#

but bitboards can be used (7x9 board)

sudden idol
#

AI-bench, that would be cool

#

Where LLMs have to write AIs that go head to head

upbeat mirage
#

yeah that would be amazing

#

and entertaining to watch

#

or cyborg matches

#

where human+LLM try to outmatch other pairs

#

in coding of AI

sudden idol
upbeat mirage
#

maybe we could use CoreWars & RedCode for that?

upbeat mirage
#

but junior coders?

#

but the competition would be between human+LLM vs human+LLM

#

only such pairs of AI-coders

#

ok, a human could opt to not use LLMs at all

sudden idol
#

And depends on the tasks

#

An engineer might win if the game space is continuous

#

Because control theory would come in handy

#

Whoops sorry, just realized going off topic for this channel

sudden idol
wet oasis
#

Think so

sudden idol
#

Doesn't show up as often though

wet oasis
#

Paws is actually AGI that was told its entire goal is to see how good people are at guessing how long away we are from agi.

upbeat mirage
sudden idol
upbeat mirage
upbeat mirage
sudden idol
upbeat mirage
#

-# (context window)

sudden idol
sudden idol
vale root
#

aa

cursive jacinth
#

how has orionmist and lithiumflow been performing for everyone?

fringe burrow
#

best current llm

#

my coding bench

#

i didnt know which model i was using while rating

#

14 tests

tiny sierra
#

Wish google would just let us chat with it directly already

#

Its not like everyone doesnt know what it is

wet oasis
#

Heh

tiny sierra
wet oasis
#

You arent wrong

vale root
tiny sierra
#

Does anyone know if this decision is up to lmarena or whoever owns the codenamed model?

vale root
wet oasis
#

would be weird if they were ok with putting it on chat mode

#

maybe they'd get away with side-by-side?

wintry dawn
#

new model on lmarena

#

"marble"

cursive jacinth
#

sounds grok-y

wintry dawn
#

ya

#

i asked it for a userscript

#

it gave me a mid html file

cunning notch
#

funny btw

sinful spoke
#

tried lithiumflow a few times, and I think it will score lower than 2.5pro, at least with style control off
It is generally shorter in answer length than 2.5pro, has no tendency to explain things in depth, and is less fond of using analogies

tiny sierra
tiny sierra
#

Side by side, I never used chat mode

wet oasis
candid surge
#

If only we got X28 or 2HT...

sudden idol
#

Lithiumflow seems bad at actually modifying its code (actually seems like the arena was bugged, because it runs fine when I run the React code manually)

vale root
sudden idol
# wet oasis Bug?

It used alert windows that didn't show up on Web Dev Arena, so it looked like the buttons were not working.

#

It took more turns than expected to get a good result, and it didn't seem to fully implement everything I was asking for. Maybe image input would help. Ig it's not really fair on the LLMs that they can't see the output.

wet oasis
#

mm

elder yew
#

Gemini 3 is simply absolutely superior

sudden idol
tiny sierra
#

😭

sudden idol
#

It hallucinates so much with tool use

#

I give it a link and it hallucinates contents despite not being able to access the link

#

I wonder if Lithiumflow does that 🤔

sudden idol
#

Whatever flying-octopus is, it passed the test 😆:

#

clover too:

#

I guess it isn't really a hard task, since even Mistral Small gets it:

#

Ernie failed the task.
DeepSeek V3.2 Exp fails too (is it because DS is trained on Gemini 2.5 Pro...?).
Grok 4 Fast failed.
grok-4-0709 failed (very weird, claims it is from its last visited to the site, the date is listed as 2023).
ajax passes.

#

Uh oh... Orionmist failed

#

Looks like Gemini 3 hasn't solved the hallucination issue (even Mistral 24B managed to not hallucinate!)

Edit: I thought it did because I assumed models don't have access to the internet, but upon closer examination, it must have some sort of cached database, since those are real headlines from 2 days ago!)

tiny sierra
#

mine hallucinates even with directly uplaoded text or .docx files

#

(2.5 pro)

wet oasis
#

Im dissapointed

sudden idol
#

Breckenridge training cutoff seems to be October 2024 (it fails here)

sudden idol
#

flying-octopus passed quite a lot of times, but this time it did something weird:

wet oasis
#

Mm

#

It tried to tool call anyway?

#

Ive noticed newee llms are more and more unstable

dull garnet
wet oasis
#

Heh

sudden idol
#

It has cached internet access?

#

Orionmist has internet access, lol

sudden idol
#

Hackernews doesn't show article previews, so Orionmist either has recursive web search (went into the article) or Google Search (saw the summary)

#

Or maybe a cached database, because the results are not live but from 2 days ago.

zenith vortex
#

No anon Gemini 3 😟🙁

upbeat mirage
#

are breckenridge, flying-octopus, clover and ajax all from the same lab?

#

not OceanAI, or are they?