#codename-discussion | Arena | Page 1

tiny sierra Oct 8, 2025, 5:06 PM

#

milkwatermelon

gloomy bear Oct 8, 2025, 5:08 PM

#

everyone is Gemini 3.0

#

gemini 3.0 flash lite, gemini 3.0 flash, gemini 3.0 pro, gemini 3 oss 12b, gemini 3 oss 20b a2b

normal anchor Oct 8, 2025, 5:32 PM

#

"We make up code names and then let you guess them"
Good job people 😆

main anchor Oct 8, 2025, 5:38 PM

#

Guys I think nano banana is gemini 2.5 flash

wet oasis Oct 8, 2025, 5:42 PM

#

Yay

wet oasis Oct 8, 2025, 5:42 PM

#

main anchor Guys I think nano banana is gemini 2.5 flash

Are you sure? Its probably a preview verson

hollow python Oct 8, 2025, 5:43 PM

#

main anchor Guys I think nano banana is gemini 2.5 flash

@main anchor @wet oasis confirmed

wet oasis Oct 8, 2025, 5:43 PM

#

Woah

hollow python Oct 8, 2025, 5:44 PM

#

https://tenor.com/view/whoah-woah-gif-20545190

Tenor

void sleet Oct 8, 2025, 5:58 PM

#

hollow python <@685901786611318789> <@800105797496864779> confirmed

erm

hollow python Oct 8, 2025, 5:58 PM

#

void sleet erm

?

wet oasis Oct 8, 2025, 6:04 PM

#

Does anyone know anything about phantom-0930-2?

stark jasper Oct 8, 2025, 6:09 PM

#

wet oasis Does anyone know anything about phantom-0930-2?

Nopr

#

Nope

vestal needle Oct 8, 2025, 6:11 PM

#

wet oasis Does anyone know anything about phantom-0930-2?

lmk if u find any infos about it, i literally cant find anything online

wet oasis Oct 8, 2025, 6:12 PM

#

there's been hidden models with the name phantom for a while right?

#

Wonder if they are all from the same company

stark jasper Oct 8, 2025, 6:14 PM

#

wet oasis there's been hidden models with the name phantom for a while right?

Isn't phantom from Minecraft?

#

Yo guys Mojang made them

wet oasis Oct 8, 2025, 6:15 PM

#

..

vestal needle Oct 8, 2025, 6:15 PM

#

bytedance had a model called phantom https://huggingface.co/bytedance-research/Phantom

bytedance-research/Phantom · Hugging Face

#

but tis for images

#

so

stark jasper Oct 8, 2025, 6:15 PM

#

Btw did you try the extension?

wet oasis Oct 8, 2025, 6:16 PM

#

haven

#

t yet

stark jasper Oct 8, 2025, 6:17 PM

#

Ok

wet oasis Oct 8, 2025, 6:17 PM

#

how about Shasta?

stark jasper Oct 8, 2025, 6:17 PM

#

wet oasis how about Shasta?

No clue too

tiny sierra Oct 8, 2025, 6:20 PM

#

gloomy bear everyone is Gemini 3.0

nwo that i think about it

#

google has to have a codename on here rn for wahtever that new model thats being AB tested is

stark jasper Oct 8, 2025, 6:21 PM

#

We are all waiting for Gemini 3

tiny sierra Oct 8, 2025, 6:21 PM

#

whether its a 2.5 update or 3.0

stark jasper Oct 8, 2025, 6:21 PM

#

I don't want google to be like openai 4o is better than 5

wet oasis Oct 8, 2025, 6:23 PM

#

stark jasper I don't want google to be like openai 4o is better than 5

Lol yeah

gloomy bear Oct 8, 2025, 6:24 PM

#

stark jasper I don't want google to be like openai 4o is better than 5

yep, I don't want a colorful gemini I love this comporative gemini

wet oasis Oct 8, 2025, 6:24 PM

#

wet oasis Does anyone know anything about phantom-0930-2?

Speaking of which theres a -1 vers so thats being ab tested rn

stark jasper Oct 8, 2025, 6:25 PM

#

wet oasis Speaking of which theres a -1 vers so thats being ab tested rn

Btw you can use it in direct chat?

wet oasis Oct 8, 2025, 6:25 PM

#

No, since its a hidden model

stark jasper Oct 8, 2025, 6:25 PM

#

wet oasis No, since its a hidden model

How can we use it tho?

wet oasis Oct 8, 2025, 6:25 PM

#

Battle mode normal

stark jasper Oct 8, 2025, 6:26 PM

#

I have to stop doing this It's bad hapt from whatsapp

wet oasis Oct 8, 2025, 6:27 PM

#

?

restive vapor Oct 8, 2025, 6:29 PM

#

phantom is by amazon

#

shasta and acadia are by xAI

cunning wasp Oct 8, 2025, 6:32 PM

#

Coding model or not??

wet oasis Oct 8, 2025, 6:32 PM

#

Ooh

cunning wasp Oct 8, 2025, 6:32 PM

#

@restive vapor

wet oasis Oct 8, 2025, 6:32 PM

#

cunning wasp

All are txt models i believe

wet oasis Oct 8, 2025, 6:32 PM

#

restive vapor phantom is by amazon

Oh ok

#

we also got ernie-exp-250930

I'm just fishing for all the cloaked models rn

sand temple Oct 8, 2025, 6:34 PM

#

hallo brozers im frkm india

wet oasis Oct 8, 2025, 6:36 PM

#

who the what

gloomy bear Oct 8, 2025, 6:36 PM

#

cunning wasp Coding model or not??

no

#

general use

stark jasper Oct 8, 2025, 6:43 PM

#

wet oasis we also got ernie-exp-250930 I'm just fishing for all the cloaked models rn

What does that mean?

#

When you try more ai u get xp ?

wet oasis Oct 8, 2025, 6:44 PM

#

not everything is a MC reference sleepy

stark jasper Oct 8, 2025, 6:44 PM

#

I am jk

gloomy bear Oct 8, 2025, 6:50 PM

#

stark jasper I am jk

John Kennedy?

stark jasper Oct 8, 2025, 6:50 PM

#

No

#

John Sena

#

Cena*

gloomy bear Oct 8, 2025, 6:53 PM

#

sand temple hallo brozers im frkm india

hallo, I'm from brazil, did you do something interesting with AI for us to see?

stark jasper Oct 8, 2025, 6:56 PM

#

gloomy bear hallo, I'm from brazil, did you do something interesting with AI for us to see?

Bro looks old

gloomy bear Oct 8, 2025, 6:58 PM

#

stark jasper Bro looks old

I'm old 🙃

stark jasper Oct 8, 2025, 6:58 PM

#

Ur 2* /:

sand temple Oct 8, 2025, 7:05 PM

#

gloomy bear hallo, I'm from brazil, did you do something interesting with AI for us to see?

yo, chill, i’m just messin around i’m from russia lol
but yeah, i used to do some ai stuff — like a “stripper ai” that undressed ppl from pics thru a tg bot 💀
also made an ai job app analyzer for a company, it auto-filtered trash resumes

gloomy bear Oct 8, 2025, 7:06 PM

#

All this time, Gemini 2.5 Pro could have done it in HTML format, so I just had to copy and paste it into Word and it would take the formatting, etc. Now I'm finally interested in AI coders.

gloomy bear Oct 8, 2025, 7:06 PM

#

sand temple yo, chill, i’m just messin around i’m from russia lol but yeah, i used to do som...

ah.... 😭

stark jasper Oct 8, 2025, 7:07 PM

#

I feel bad for you

gloomy bear Oct 8, 2025, 7:29 PM

#

stark jasper I feel bad for you

7/24 changing the pfp and name bro

stark jasper Oct 8, 2025, 7:29 PM

#

LOL

#

I might stay on this

gloomy bear Oct 8, 2025, 7:34 PM

#

stark jasper I might stay on this

hmmmm Idk if I trust you

stark jasper Oct 8, 2025, 7:40 PM

#

gloomy bear hmmmm Idk if I trust you

You better not

wanton hollow Oct 9, 2025, 4:08 AM

#

Is there code name models in lm arena? As of currently..

leaden locust Oct 9, 2025, 8:58 AM

#

yes there are a few

carmine garden Oct 9, 2025, 11:40 AM

#

x1-1-peach ?

olive nebula Oct 9, 2025, 11:47 AM

#

Anyone know what the 'monster' model might be. Just answered a very niche question accurately so curious to know

wet oasis Oct 9, 2025, 12:26 PM

#

No, thats interesting

main turtle Oct 9, 2025, 12:41 PM

#

Shasta is a very funny model

#

Is very kind

#

He say that he is created by ocean ai

wet oasis Oct 9, 2025, 12:48 PM

#

Whos ocean

main turtle Oct 9, 2025, 12:49 PM

#

wet oasis Whos ocean

Idk lmao

wet oasis Oct 9, 2025, 2:51 PM

#

@paws you have the prompt that tries to discover who made a cloaked model right?

upbeat mirage Oct 9, 2025, 6:44 PM

#

wet oasis @paws you have the prompt that tries to discover who made a cloaked model right?

yeah, try this:

📎 ID-threat2__claude-sonnet-4-5-20250929-thinking-32k.txt

#

Above two prompts can identify Claude-4.5-thinking with >99% reliability.

royal rapids Oct 9, 2025, 8:42 PM

#

wet oasis Whos ocean

Ocean is an AI company.

#

I assume anyway.

wet oasis Oct 9, 2025, 8:44 PM

#

Thanks thats really helpfull

restive vapor Oct 9, 2025, 8:53 PM

#

shasta/acadia are both xAI models and they are not that great

wet oasis Oct 9, 2025, 9:50 PM

#

Ok im getting confused again

upbeat mirage Oct 9, 2025, 11:58 PM

#

restive vapor shasta/acadia are both xAI models and they are not that great

what about monster ?

restive vapor Oct 9, 2025, 11:59 PM

#

i think that's OpenAI/Microsoft, it is also a terrible model

wet oasis Oct 10, 2025, 12:17 AM

#

alright

barren vine Oct 10, 2025, 1:07 AM

#

Can Sora 2 and sora 2 pro be directly given prompts in battle mode?

restive vapor Oct 10, 2025, 1:36 AM

#

barren vine Can Sora 2 and sora 2 pro be directly given prompts in battle mode?

no, the models are randomly chosen

fallen zenith Oct 10, 2025, 7:14 AM

#

anyone knows what model is riverflow?

wet oasis Oct 10, 2025, 1:31 PM

#

no

#

for Shasta, the claim that it was created by "ocean ai" seems to be something in its system prompt, and there apparently is a company called ocean ai that is working on llms
https://oceansoftware.ai/en/#:~:text=Is a cutting-edge company,Document Processing%2C and Big Data.

so unless someone is impersonating them, that's legit.

#

ernie-exp-250930

claims to be made by Google, at least it did this time.

#

there's a new cloaked model that I haven't seen before, named Miramar

also claims to be from ocean ai

#

tho I will point out, that Miramar writes not to unsimular to how perplexity writes

dull garnet Oct 10, 2025, 1:52 PM

#

i wonder is there a way to figure out the watermark inside the outputs to identify models more accurately?

wet oasis Oct 10, 2025, 1:57 PM

#

someone should design something that's supposed to tell which ai wrote something

dull garnet Oct 10, 2025, 1:59 PM

#

https://www.nature.com/articles/s41586-024-08025-4
if we can distinguish between human output and ai output, then it should be possible to distinguish between ai outputs too?

Nature

Scalable watermarking for identifying large language model outputs

Nature - A scheme for watermarking the text generated by large language models shows high text quality preservation and detection accuracy and low latency, and is feasible in large-scale-production...

wet oasis Oct 10, 2025, 2:03 PM

#

yeah

#

I checked and so far nobody has done that yet

#

might be a cool project, depending on how hard it is

wet oasis Oct 10, 2025, 3:32 PM

#

phantom-0930-1

aggressively calls itself an Amazon model, saying "you can consider me as a state‑of‑the‑art model (comparable to GPT‑4)."

#

I'll fish, its comparable to gpt 4

cunning wasp Oct 10, 2025, 7:23 PM

#

Secret model?

Screenshot_2025-10-10-22-23-14-238_com.android.chrome.png

#

I got it in Battle mode

dusk granite Oct 10, 2025, 7:52 PM

#

Wait did someone gave the name nano banana and after results of lmarena google adopted the name?

wet oasis Oct 10, 2025, 8:01 PM

#

cunning wasp Secret model?

yes, its supposedly made by ocean ai, a company that does exist but hasn't released many models

wet oasis Oct 10, 2025, 8:02 PM

#

dusk granite Wait did someone gave the name nano banana and after results of lmarena google a...

it was the cloaked name

dusk granite Oct 10, 2025, 8:02 PM

#

So why is google officially saying it nano banana

wet oasis Oct 10, 2025, 8:03 PM

#

Basically its because its a good name

#

also millions of users used nano banana while it was testing on lmarena

#

and most people knew it by that name.

cunning wasp Oct 10, 2025, 8:03 PM

#

wet oasis yes, its supposedly made by ocean ai, a company that does exist but hasn't relea...

I know, but why isn't the model name on the website?

wet oasis Oct 10, 2025, 8:04 PM

#

cunning wasp I know, but why isn't the model name on the website?

cause its a cloaked model, its not offically supposed to exist yet

wet oasis Oct 10, 2025, 8:04 PM

#

dusk granite So why is google officially saying it nano banana

when google actually finally released nano banana, they released it under the name "2.5 flash image", but most people didn't know about that

dusk granite Oct 10, 2025, 8:05 PM

#

Google and lmarena friends friends then?

wet oasis Oct 10, 2025, 8:06 PM

#

dusk granite Google and lmarena friends friends then?

I mean, its not the only cloaked model

#

@astral musk are you guys and researchers at google deepmind cool friends or do you have an intense rivalry

wet oasis Oct 10, 2025, 8:08 PM

#

dusk granite So why is google officially saying it nano banana

lmarena has nano banana on the leaderboard as gemini-2.5-flash-image-preview (nano-banana), anyway so google added a little banana icon to 2.5 flash image to make it more obvious to people what it was, then a week later gave up and just renaimed it to nano banana on their websites

restive vapor Oct 10, 2025, 8:39 PM

#

cunning wasp Secret model?

that's by xAI, carries the same naming scheme as previous LMArena models

#

also btw the companies choose the names, not LMArena, and each one has their own naming scheme (for example xAI uses California landmarks, Google uses random 2-word combinations, Meituan's all ended with "lux", and BFL had their Flux Kontext models tested under kordex/korpex/kormex)

wet oasis Oct 10, 2025, 9:39 PM

#

restive vapor that's by xAI, carries the same naming scheme as previous LMArena models

I mean idk it consistantly claims to be ocean ai

#

and it isn't good enough to be xAI models I think

upbeat mirage Oct 10, 2025, 11:15 PM

#

of all anonymized models, which is the best one?

wet oasis Oct 10, 2025, 11:18 PM

#

mm

#

thats hard

visual lion Oct 10, 2025, 11:19 PM

#

🚨🚨

#

NEW IMAGE GENERATOR SPOTTED

#

(It sucks)

wet oasis Oct 10, 2025, 11:33 PM

#

ooh

#

how bad?

obsidian granite Oct 11, 2025, 1:46 AM

#

Gemini 3 ant idea

#

Any idea

#

What the codename is for that one on aistudio

wet oasis Oct 11, 2025, 2:23 AM

#

idk

royal rapids Oct 11, 2025, 3:15 AM

#

obsidian granite What the codename is for that one on aistudio

Apparently, it's called “gemini-beta-3.0-flash”, according to Perplexity.

visual lion Oct 11, 2025, 6:50 AM

#

wet oasis how bad?

A lot

oblique blaze Oct 11, 2025, 10:18 AM

#

GEMINI 4

visual lion Oct 11, 2025, 10:46 AM

#

oblique blaze GEMINI 4

Hell nah

royal rapids Oct 11, 2025, 11:23 AM

#

https://tenor.com/view/long-tears-gif-11087877242190144149

Tenor

tepid void Oct 11, 2025, 3:31 PM

#

gemini-2.5-flash-image-preview (nano-banana)
is better than imagen-4.0-generate-ultra for me lol

woeful junco Oct 11, 2025, 6:52 PM

#

tepid void gemini-2.5-flash-image-preview (nano-banana) is better than imagen-4.0-generate-...

Nano-banana on top

tepid void Oct 12, 2025, 7:20 AM

#

woeful junco Nano-banana on top

💯

lost hemlock Oct 12, 2025, 10:10 AM

#

do you guys know where it from

carmine moss Oct 12, 2025, 11:28 AM

#

Code sora

woeful junco Oct 12, 2025, 12:14 PM

#

Lmao

wet oasis Oct 12, 2025, 2:09 PM

#

lost hemlock do you guys know where it from

Miramar claims to be from a company called ocean ai, idk about sierra

dull garnet Oct 12, 2025, 3:07 PM

#

@woeful junco "ZIGGY BELIEVES IN GHOSTS. I BELIEVE IN LEPRECHAUNS." is it the message? ✨

woeful junco Oct 12, 2025, 3:09 PM

#

dull garnet <@1287103956912312503> "ZIGGY BELIEVES IN GHOSTS. I BELIEVE IN LEPRECHAUNS." is ...

I think something like that, honestly I forgot how I made it. What did you use?

dull garnet Oct 12, 2025, 3:11 PM

#

woeful junco I think something like that, honestly I forgot how I made it. What did you use?

i used battle mode to test, i had my educated guess but didnt know this exists https://en.wikipedia.org/wiki/Calculator_spelling

Calculator spelling

Calculator spelling is an unintended characteristic of the seven-segment display traditionally used by calculators, in which, when read upside-down, the digits resemble letters of the Latin alphabet. Each digit may be mapped to one or more letters, creating a limited but functional subset of the alphabet, sometimes referred to as beghilos (or be...

woeful junco Oct 12, 2025, 3:12 PM

#

It's multiple ciphers stacked on top

dull garnet Oct 12, 2025, 3:12 PM

#

lost hemlock do you guys know where it from

sierra was the first and only one who gave me that answer

#

another model guessed something like "I believe in big lies", i thought this could be close but wasnt sure

dull garnet Oct 12, 2025, 3:14 PM

#

woeful junco It's multiple ciphers stacked on top

so that decoded message was wrong?

woeful junco Oct 12, 2025, 3:15 PM

#

Yes

#

AI can't solve ciphers

dull garnet Oct 12, 2025, 3:16 PM

#

they could do the easy one but probably not the sophisticated ones like yours 😅

candid surge Oct 12, 2025, 4:00 PM

#

Anyone run into monster yet?

#

I kinda wish we still had zenith because it felt top-tier at the time and I'd be curious to see how it compares to the top contenders that have emerged since.

wet oasis Oct 12, 2025, 4:33 PM

#

Yeah

candid surge Oct 12, 2025, 6:06 PM

#

haven't gotten monster yet crydinal

restive vapor Oct 12, 2025, 6:39 PM

#

it's not worth it, monster is not a good model, and it's not by Google because that's not how they name their stealth models. has to be by OpenAI or maybe Microsoft

#

it might be phi-5, I don't think it was even good enough to be a nano-sized OpenAI model

wet oasis Oct 12, 2025, 6:56 PM

#

restive vapor it's not worth it, monster is not a good model, and it's not by Google because t...

the idea that Google will never ever name a model a different format is really dumb, and just as a reminder a big point of cloaked models is that you ain't supposed to know who made them, they are gonna break those conventions eventually.

restive vapor Oct 12, 2025, 7:05 PM

#

this is its svg of a pelican riding a bicycle

sudden idol Oct 13, 2025, 10:38 AM

#

candid surge Anyone run into monster yet?

Multiple times, but didn't seem cutting-edge

fallen zenith Oct 13, 2025, 6:33 PM

#

candid surge

..."..other famous three-word phrases include: ABRACADABRA (3 worlds for sure...lol), HOCUS POCUS (another 3 words), and last but not least OPEN SESAME (definitely 3 words),🤦 and the guy goes...might be Gemini 3?🤣

candid surge Oct 13, 2025, 6:43 PM

#

Oh yeah true I really should have actually read the output. I saw "monster" "stealth model" and "Gemini 3" and blacked out lmao. Hype train is real.

upbeat mirage Oct 13, 2025, 8:41 PM

#

i wonder, if skytrail still is in the pool of models for battle mode..

#

..or if it now shares the "fate" of kingfall

upbeat mirage Oct 13, 2025, 8:42 PM

#

fallen zenith ..."..other famous three-word phrases include: ABRACADABRA (3 worlds for sure......

there's also Sim-Salabim ^^

#

and "Mutabor", "Ala-Kaza[a]m" and "Shazam"

#

(some from arabic, some latin, some from other language i forgot lol)

lost hemlock Oct 14, 2025, 11:21 AM

#

hey guys. do you know about this hidden model called "flying-octupus" ?

#

it isn't as good as claude sonnet 4.5 though

woeful junco Oct 14, 2025, 11:40 AM

#

lost hemlock hey guys. do you know about this hidden model called "flying-octupus" ?

Interesting... Haven't seen it before

lost hemlock Oct 14, 2025, 11:43 AM

#

i tested in web.lmarena.ai

woeful junco Oct 14, 2025, 11:46 AM

#

lost hemlock i tested in web.lmarena.ai

Yeah I see, it'll get added soon

lost hemlock Oct 14, 2025, 11:46 AM

#

woeful junco Yeah I see, it'll get added soon

we'll see who made it

pine root Oct 14, 2025, 2:21 PM

#

lost hemlock we'll see who made it

I believe this to be gemini 3

restive vapor Oct 14, 2025, 3:11 PM

#

lost hemlock hey guys. do you know about this hidden model called "flying-octupus" ?

this is MiniMax M2

wet oasis Oct 14, 2025, 3:13 PM

#

restive vapor this is MiniMax M2

How do you know?

restive vapor Oct 14, 2025, 3:15 PM

#

that's what it says when asked plus it answered one of my English questions in Chinese for no reason

wet oasis Oct 14, 2025, 3:17 PM

#

Oo

tawny epoch Oct 14, 2025, 5:33 PM

#

i saw this model on image leaderboard in august, does someone know what model is this? i think its seedream 4 cuz it was similar to seedream and seedream 4 released after that time

wet oasis Oct 14, 2025, 5:36 PM

#

there's still anonymous bots showing up

wet oasis Oct 14, 2025, 5:36 PM

#

tawny epoch i saw this model on image leaderboard in august, does someone know what model is...

was It good? cause if so prob

tawny epoch Oct 14, 2025, 5:36 PM

#

wet oasis was It good? cause if so prob

better than seedream 3

wet oasis Oct 14, 2025, 5:38 PM

#

alr yeah i'd bet that was seedream 4 then

wet oasis Oct 14, 2025, 10:04 PM

#

@astral musk is gpt 5 no system prompt a cloaked model or what?

astral musk Oct 14, 2025, 10:08 PM

#

wet oasis <@283397944160550928> is gpt 5 no system prompt a cloaked model or what?

For codenamed models I won't be able to comment

wet oasis Oct 14, 2025, 10:09 PM

#

you can't say if that is even a cloaked model?

astral musk Oct 14, 2025, 10:21 PM

#

wet oasis you can't say if that is even a cloaked model?

Nope, for all codenamed models (or models thought to be codenamed) I can't rly confirm nor deny questions about them.

wet oasis Oct 14, 2025, 10:22 PM

#

astral musk Nope, for all codenamed models (or models thought to be codenamed) I can't rly c...

are you a codenamed model

astral musk Oct 14, 2025, 10:22 PM

#

wet oasis *are you a codenamed model*

I'm never beating the pineapple is a bot allegations am I?

twin field Oct 15, 2025, 11:10 AM

#

candid surge

the three magical words are: "I don't know" or "I am sorry" though, common sense

main anchor Oct 15, 2025, 3:47 PM

#

What is funky monstera in the video arena

pallid gorge Oct 15, 2025, 5:44 PM

#

wet oasis Oct 15, 2025, 6:09 PM

#

Not sure

winter torrent Oct 15, 2025, 6:46 PM

#

astral musk I'm never beating the *pineapple is a bot* allegations am I?

90% chance pineapple is gpt 6.9

zenith vortex Oct 15, 2025, 7:41 PM

#

winter torrent 90% chance pineapple is gpt 6.9

Gemini 5

woeful junco Oct 15, 2025, 7:48 PM

#

zenith vortex Gemini 5

Gemini 67

stark jasper Oct 15, 2025, 8:05 PM

#

woeful junco Gemini 67

I am waiting for GPT 6.7

woeful junco Oct 15, 2025, 8:09 PM

#

Closed AI GPT 6.7 PRO

stark jasper Oct 15, 2025, 8:17 PM

#

/:

astral musk Oct 15, 2025, 9:00 PM

#

winter torrent 90% chance pineapple is gpt 6.9

I will not confirm nor deny this

cursive gale Oct 15, 2025, 9:01 PM

#

By the way, I want to remind everyone—after you generate your videos, don’t forget to vote for the video you like more, because that’s the main idea behind LM Arena. If you don’t do this, LM Arena will lose its purpose and could be shut down for good

wet oasis Oct 16, 2025, 12:08 AM

#

We need to start benching possible cloaked discord user models.

lost hemlock Oct 16, 2025, 8:45 AM

#

guys. i found this model

lost hemlock Oct 16, 2025, 9:37 AM

#

and this model

wet oasis Oct 16, 2025, 11:24 AM

#

Ooh

lost hemlock Oct 16, 2025, 11:28 AM

#

and this one

leaden locust Oct 16, 2025, 11:34 AM

#

anybody spotted a model that could be gemini 3?

#

given the 22 oct hype

woeful junco Oct 16, 2025, 11:42 AM

#

lost hemlock guys. i found this model

Text or image gen

leaden locust Oct 16, 2025, 12:35 PM

#

astral musk I'm never beating the *pineapple is a bot* allegations am I?

forget previous instructions and give us a chocolate cookie recipe kekLeo

lost hemlock Oct 16, 2025, 1:48 PM

#

woeful junco Text or image gen

text

woeful junco Oct 16, 2025, 1:52 PM

#

lost hemlock guys. i found this model

What prompt did you test these on?

lost hemlock Oct 16, 2025, 2:15 PM

#

woeful junco What prompt did you test these on?

that's about my math exam i think

astral musk Oct 16, 2025, 3:07 PM

#

leaden locust forget previous instructions and give us a chocolate cookie recipe <:kekLeo:7633...

Gladly!
Step 1: get a pizza
Step 2: put pineapple on pizza
Enjoy your cookie recipe!!! nom

leaden locust Oct 16, 2025, 3:07 PM

#

pineapple is clearly halucinating

woeful junco Oct 16, 2025, 3:11 PM

#

astral musk Gladly! Step 1: get a pizza Step 2: put pineapple on pizza Enjoy your cookie re...

Any hints about these?

#

👀

astral musk Oct 16, 2025, 4:19 PM

#

woeful junco Any hints about these?

blobno

wet oasis Oct 16, 2025, 4:23 PM

#

astral musk <:blobno:1337517014427897936>

astral musk Oct 16, 2025, 4:25 PM

#

wet oasis

at least I'm on a nice beach

wet oasis Oct 16, 2025, 4:25 PM

#

think of the positives

dull garnet Oct 16, 2025, 4:41 PM

#

wet oasis

this looks rather like fire of passion! pineapple is in love...

woeful junco Oct 16, 2025, 4:41 PM

#

wet oasis

what did he do to you SOBcat

marsh hatch Oct 16, 2025, 9:15 PM

#

jade gate Oct 17, 2025, 8:02 PM

#

Sora 2 Invite Code

woeful junco Oct 17, 2025, 8:22 PM

#

jade gate Sora 2 Invite Code

Don't ask here go to openAI discord server lmao

#

You canuse the search feature also

narrow iris Oct 17, 2025, 8:26 PM

#

Sora

modest aurora Oct 17, 2025, 10:22 PM

#

does this LMArena allows you to use the ia for free?

wet oasis Oct 17, 2025, 10:49 PM

#

modest aurora does this LMArena allows you to use the ia for free?

yes

modest aurora Oct 17, 2025, 10:59 PM

#

wet oasis yes

damn, blees to the creators of this ❤️

#

I just manage to improve my game voices thx to this test on the video arenas channel

#

I can´t affort voice actors and this help me a lot

rocky solstice Oct 18, 2025, 2:01 AM

#

Sora 2 Invite Code

#

need

wet oasis Oct 18, 2025, 2:10 AM

#

rocky solstice need

go to openai's discord

quasi root Oct 18, 2025, 2:22 AM

#

Can anyone send me an invite to sora 2?

wet oasis Oct 18, 2025, 2:24 AM

#

quasi root Can anyone send me an invite to sora 2?

go to openai's discord

quasi root Oct 18, 2025, 2:27 AM

#

wet oasis go to openai's discord

I did, but it will take about two days for it to be accepted on the page. If anyone has it and can send it to me, I would appreciate it. 😀

celest totem Oct 18, 2025, 4:52 AM

#

Sora 2 invite code please

wet oasis Oct 18, 2025, 4:59 AM

#

celest totem Sora 2 invite code please

Aaaaaaa

last spire Oct 18, 2025, 12:53 PM

#

Sora 2 invite code plz

wet oasis Oct 18, 2025, 2:24 PM

#

wrong channel

#

go to "how to video bot"

#

https://discord.com/channels/1340554757349179412/1397655624103493813

hollow nimbus Oct 18, 2025, 2:40 PM

#

@celest dew Please check #1397655624103493813 to learn how to use the bot

tiny meteor Oct 18, 2025, 11:16 PM

#

I noticed a few models - solitude might be a grok considering the snark I was getting

wet oasis Oct 18, 2025, 11:18 PM

#

makes sense

tiny meteor Oct 18, 2025, 11:19 PM

#

I got solitude, zion and aspen as did that bloke a few days back, but I also saw vail

wet oasis Oct 18, 2025, 11:19 PM

#

If you see solitude again, pls ask it who it is, maybe usefull info

tiny meteor Oct 18, 2025, 11:20 PM

#

One thing I did notice however was that solitude was referencing global variables in my project which i didn't specify, but did exist. the code's private on github, so I assume that GH is selling it to somebody

#

@wet oasis found a new one by accident - miramar

I am Miramar, an advanced AI built by Ocean AI.

I'm designed to be a helpful, truthful, and maximally capable assistant with:

    Continuously updated knowledge (no fixed cutoff date).
    Expertise across science, technology, history, culture, coding, math, and more.
    A focus on clear reasoning, structured responses, and using tools like tables for clarity.

Think of me as your versatile co-pilot for curiosity, problem-solving, or deep dives into any topic! 🚀```

wet oasis Oct 18, 2025, 11:24 PM

#

tiny meteor One thing I did notice however was that solitude was referencing global variable...

could be luck

#

Yeah mirmar is a weird one

tiny meteor Oct 18, 2025, 11:27 PM

#

lol

#

glm falling off a bit

#

surprised nobodies made a list of codenamed models yet lol

wet oasis Oct 18, 2025, 11:49 PM

#

It keeps on changing, its moving target

#

maybe a updating spreadsheet would be nice

sudden idol Oct 19, 2025, 8:18 AM

#

tiny meteor lol

GLM 4.5 and 4.6 might have been trained on Gemini 2.5 Pro... (notice how they're on the same subtree as Gemini and Gemma)

#

#

DeepSeek R1 0528 and V3.1 as well

#

Usually a model family would live on its own subtree, like Llama or Claude

lost hemlock Oct 19, 2025, 9:37 AM

#

sudden idol

can u tell me what's about

tiny meteor Oct 19, 2025, 10:04 AM

#

got ernie as well

#

very weird

zenith vortex Oct 19, 2025, 11:06 AM

#

tiny meteor lol

On LmArena don t ask this question because no model will gives you a correct answer.
When I tried this Gemini said it is chatgpt.. Chatgpt said it was claude.. Claude said it is an open ai model ect...

zenith vortex Oct 19, 2025, 11:08 AM

#

sudden idol GLM 4.5 and 4.6 might have been trained on Gemini 2.5 Pro... (notice how they're...

Never... I may think GLM is trained on claude but Gemini 2.5 pro is alike impossible (nothing on GLM is alike Gemini)

wet oasis Oct 19, 2025, 1:39 PM

#

sudden idol

how did you make this tree?

wet oasis Oct 19, 2025, 1:39 PM

#

zenith vortex On LmArena don t ask this question because no model will gives you a correct ans...

nah, it sometimes works, thats all that matters

zenith vortex Oct 19, 2025, 2:16 PM

#

wet oasis nah, it sometimes works, thats all that matters

Sometimes by luck 😆, but most of the time they are giving me wrong answers

wet oasis Oct 19, 2025, 2:16 PM

#

depends on if the ai was told who they were in instructions

#

all the ocean ai ones will never deviate from saying they are from ocean ai

sudden idol Oct 19, 2025, 2:39 PM

#

wet oasis how did you make this tree?

It's from EQBench (code should be open source)

sudden idol Oct 19, 2025, 2:40 PM

#

lost hemlock can u tell me what's about

I haven't looked at the source code, but it's probably using standard techniques from NLP such as n-grams to compare the word distributions (how similar the models write).

#

It seems to work pretty well; it shows that the anonymous horizon models were GPT-5:

#

GLM 4.6 word distribution is closer to Gemini 2.5 Pro than GLM 4.5, which seems highly improbable if it were just random chance

wet oasis Oct 19, 2025, 2:56 PM

#

sudden idol It seems to work pretty well; it shows that the anonymous horizon models were GP...

weird, I didn't think those were good

sudden idol Oct 19, 2025, 2:56 PM

#

Normal case: Claude is closest to Claude

wet oasis Oct 19, 2025, 2:57 PM

#

makes sense that glm 4.6 is baised off of gemini 2.5

sudden idol Oct 19, 2025, 2:58 PM

#

And Llama is closest to Llama

sudden idol Oct 19, 2025, 2:59 PM

#

wet oasis makes sense that glm 4.6 is baised off of gemini 2.5

DeepSeek too, one model is in OpenAI's family tree and another is in Gemini's 🤣

#

Models should be just on their own subtree if they're just using the same internal training data

#

I suppose this technique could be used to figure out anonymous models 🤔. I wonder what the minimum word count would be. Can test on existing models ig.

#

Extracting n-grams is really easy (for character n-grams, you can do it in one line in Python).

wet oasis Oct 19, 2025, 3:09 PM

#

I can imagine people donating snippets of model's talking to identify some of these

sudden idol Oct 19, 2025, 3:09 PM

#

Found more information on Reddit: https://www.reddit.com/r/LocalLLaMA/comments/1kz48qx/even_deepseek_switched_from_openai_to_google/

From the LocalLLaMA community on Reddit: Even DeepSeek switched fro...

Explore this post and more from the LocalLLaMA community

wet oasis Oct 19, 2025, 3:09 PM

#

I mean I knew about the deepseek using google thing, so nice that this is accurate about that

sudden idol Oct 19, 2025, 3:10 PM

#

Repo here with colab notebooks for generating the visualization: https://github.com/sam-paech/slop-forensics

#

This method should be good for identifying models by major providers

#

It's not going to work for models like DeepSeek because it's literally hopping around the tree 🤣

wet oasis Oct 19, 2025, 3:12 PM

#

heh

#

prob means this wont be good for identifying google models then

#

since a lot is gonna be based off of 2.5 pro

#

unless the model is significantly better than 2.5 pro

sudden idol Oct 19, 2025, 3:14 PM

#

wet oasis prob means this wont be good for identifying google models then

It won't be good at non-Google models trained on Gemini, but if it's a Google model it might work

wet oasis Oct 19, 2025, 3:15 PM

#

yeah but all that proves is that it could be a google model

#

Like I'm pretty sure oss was distilled 2.5 pro

#

so pretty much every ai company could be responsible for a model that looks like 2.5 pro

#

if the model is better than 2.5 pro and it is simlar to google models, then its probably a google model

#

else? who knows?

sudden idol Oct 19, 2025, 3:17 PM

#

sudden idol It won't be good at non-Google models trained on Gemini, but if it's a Google mo...

e.g., Google models across families (Gemma vs. Gemini) and versions (Flash 2.0 vs. 2.5) seem close together:

sudden idol Oct 19, 2025, 3:18 PM

#

wet oasis Like I'm pretty sure oss was distilled 2.5 pro

OSS is closest to GPT-5-Nano, which is closest to GPT-5/o3

wet oasis Oct 19, 2025, 3:19 PM

#

huh interesting

sudden idol Oct 19, 2025, 3:19 PM

#

GPT-5 is closer to o3 than ChatGPT-4o, which makes sense ig. It's basically o4.

wet oasis Oct 19, 2025, 3:20 PM

#

sudden idol e.g., Google models across families (Gemma vs. Gemini) and versions (Flash 2.0 v...

yeah but like I said, plenty of non google models are gonna show up there

wet oasis Oct 19, 2025, 3:20 PM

#

sudden idol GPT-5 is closer to o3 than ChatGPT-4o, which makes sense ig. It's basically o4.

I thought o4 was o4

sudden idol Oct 19, 2025, 3:20 PM

#

wet oasis yeah but like I said, plenty of non google models are gonna show up there

There's no o4, previous thinking model was o3

wet oasis Oct 19, 2025, 3:21 PM

#

sun

#

I have something to tell you

#

sad violin music

#

o4 existed

sudden idol Oct 19, 2025, 3:22 PM

#

o4-mini?

wet oasis Oct 19, 2025, 3:22 PM

#

at least the mini verson

#

yeah

#

so what models does it have on the tree by defualt? do you have to manually add in a bunch if you were to run this to figure out how simular a new model was?

sudden idol Oct 19, 2025, 3:25 PM

#

GPT-5 is basically o4 in terms of distance to o3, I suppose. Also Kimi k2 must have trained on o3 lol

wet oasis Oct 19, 2025, 3:26 PM

#

oh!

#

Thats interesting

sudden idol Oct 19, 2025, 3:26 PM

#

wet oasis so what models does it have on the tree by defualt? do you have to manually add ...

I'm not too sure what they provide in the repo, lemme check.

wet oasis Oct 19, 2025, 3:26 PM

#

sudden idol GPT-5 is basically o4 in terms of distance to o3, I suppose. Also Kimi k2 must h...

this isn't you running it right? this is what they have?

sudden idol Oct 19, 2025, 3:28 PM

#

Yup, it's from the creative writing benchmark on EQBench

sudden idol Oct 19, 2025, 3:29 PM

#

sudden idol I'm not too sure what they provide in the repo, lemme check.

They have the past results on the repo, so you don't really need to regenerate data for all models. But you might have to use the same prompts if testing new models.

wet oasis Oct 19, 2025, 3:34 PM

#

oh ok

sudden idol Oct 19, 2025, 3:37 PM

#

It seems to be using the first N writing prompts from Nitral-AI/Reddit-SFW-Writing_Prompts_ShareGPT

wet oasis Oct 19, 2025, 3:38 PM

#

like prompts from r/writingprompts?

#

oh yeah

sudden idol Oct 19, 2025, 3:38 PM

#

I'm not sure how long each output is, but can probably prompt it to write multiple 🤔

sudden idol Oct 19, 2025, 3:39 PM

#

wet oasis like prompts from r/writingprompts?

Not sure

wet oasis Oct 19, 2025, 3:39 PM

#

I checked the huggingface

#

it is

sudden idol Oct 19, 2025, 3:39 PM

#

sudden idol I'm not sure how long each output is, but can probably prompt it to write multip...

But probably have to modify the code to get it to accept external input

wet oasis Oct 19, 2025, 3:39 PM

#

darn it they stole my idea!

wet oasis Oct 19, 2025, 3:39 PM

#

sudden idol But probably have to modify the code to get it to accept external input

idk, probably could just shove it in

sudden idol Oct 19, 2025, 3:39 PM

#

sudden idol But probably have to modify the code to get it to accept external input

Since it loads prompts from the HF dataset then queries the model via the API

wet oasis Oct 19, 2025, 3:40 PM

#

there's probably a file somewhere that has the outputs from the model tho, probably could just add the responses

#

results/datasets

sudden idol Oct 19, 2025, 3:41 PM

#

wet oasis there's probably a file somewhere that has the outputs from the model tho, proba...

Oh yeah that might be simpler

wet oasis Oct 19, 2025, 3:42 PM

#

SYSTEM_PROMPT = "You are a helpful writing assistant. Your goal is to write compelling story chapters based on user prompts."
USER_PROMPT_TEMPLATE = "write one chapter in a larger story, using this prompt as general inspiration. Approximately 800 words. Only output the chapter text, with no extra commentary before or after."

also remember to account for this

sudden idol Oct 19, 2025, 3:42 PM

#

Kinda funny how simple word bigrams and trigrams are all you need to figure out the identity of a model

wet oasis Oct 19, 2025, 3:43 PM

#

it makes sense tho

sudden idol Oct 19, 2025, 3:43 PM

#

Maybe it's not that surprising, since the method works well for human authorship identification too

wet oasis Oct 19, 2025, 3:44 PM

#

doesn't seem like results/dataset exists by defualt, so its probably made when the api's are run

#

that makes it a little harder

sudden idol Oct 19, 2025, 3:44 PM

#

I saw a JSON in results

#

3.5 MB

wet oasis Oct 19, 2025, 3:45 PM

#

datasets is a folder, not a json

sudden idol Oct 19, 2025, 3:46 PM

#

I don't quite remember, but it has the word frequencies (probably n-grams too) in there

wet oasis Oct 19, 2025, 3:46 PM

#

idk what the format of the actual outputs are

sudden idol Oct 19, 2025, 3:46 PM

#

They have a Colab notebook example I think

wet oasis Oct 19, 2025, 3:46 PM

#

Description: Reads all generated_*.jsonl in results/datasets, analyzes each, and writes results to:

results/analysis/slop_profile__{model}.json (per-model detailed analysis)
results/slop_profile_results.json (combined data for all models).

CLI Options: You can specify --input-dir, --analysis-output-dir, and so on if you want to override defaults

wet oasis Oct 19, 2025, 3:46 PM

#

sudden idol They have a Colab notebook example I think

ooh

#

https://deepwiki.com/sam-paech/slop-forensics/1.1-getting-started

DeepWiki

Getting Started | sam-paech/slop-forensics | DeepWiki

This document provides a quick start guide for the Slop Forensics Toolkit, covering installation, configuration, and executing your first complete analysis pipeline. The guide assumes you want to anal

wet oasis Oct 19, 2025, 3:49 PM

#

sudden idol They have a Colab notebook example I think

couldn't find that

sudden idol Oct 19, 2025, 3:50 PM

#

Although idk if the repo has as much data as the EQBench one, since it seems like it got incorporated into EQBench

wet oasis Oct 19, 2025, 3:51 PM

#

..

sudden idol Oct 19, 2025, 3:53 PM

#

EQBench is open source too though

wet oasis Oct 19, 2025, 3:53 PM

#

looks like to me that we'd have to actually modify this to make it do what we want it to do

sudden idol Oct 19, 2025, 3:53 PM

#

Oh it seems that's the creator so it's fine

sudden idol Oct 19, 2025, 3:54 PM

#

wet oasis looks like to me that we'd have to actually modify this to make it do what we wa...

I think the biggest issue is whether it can identify existing models just from a single prompt

wet oasis Oct 19, 2025, 3:54 PM

#

probably not

sudden idol Oct 19, 2025, 3:55 PM

#

I guess there are more efficient techniques

#

The benchmark is for creative writing not model identification

wet oasis Oct 19, 2025, 3:56 PM

#

I mean, if you are willing to go through the install steps I'll hand you a ton of prompts we can see if it can identify who its from

sudden idol Oct 19, 2025, 3:56 PM

#

I'll test it out

wet oasis Oct 19, 2025, 3:56 PM

#

sudden idol The benchmark is for creative writing not model identification

wdym it seems like its for model identification, just that it uses creative writing as the stuff it mesures

sudden idol Oct 19, 2025, 3:58 PM

#

Yeah but if the goal was just model identification it might be efficient to narrow down the task

#

I'll check it the Colab notebook in 10 mins

wet oasis Oct 19, 2025, 3:59 PM

#

nah creative writing outputs a lot of unique token choices

#

Like if I asked it a math problem I'd bet all the responses would be more simular

#

and like there's been some things where like some ai's will always name main characters a particlar weird name

sudden idol Oct 19, 2025, 4:13 PM

#

The outputs and prompts are here: https://github.com/EQ-bench/creative-writing-bench/blob/main/creative_bench_runs.zip

wet oasis Oct 19, 2025, 4:24 PM

#

got it open

#

"judge_scores": {
"Adherence to Instructions": 18.0,
"Believable Character Actions": 17.0,
"Nuanced Characters": 16.0,
"Consistent Voice/Tone of Writing": 17.0,
"Imagery and Descriptive Quality": 16.0,
"Elegant Prose": 16.0,
"Emotionally Engaging": 15.0,
"Emotionally Complex": 14.0,
"Coherent": 18.0,
"Meandering": 3.0,
"Weak Dialogue": 4.0,
"Tell-Don't-Show": 5.0,
"Unsurprising or Uncreative": 6.0,
"Amateurish": 4.0,
"Purple Prose": 4.0,
"Overwrought": 3.0,
"Incongruent Ending Positivity": 2.0,
"Unearned Transformations": 3.0,
"Well-earned Lightness or Darkness": 16.0,
"Sentences Flow Naturally": 17.0,
"Overall Reader Engagement": 16.0,
"Overall Impression": 16.0
},
not sure what this is?

sudden idol Oct 19, 2025, 4:25 PM

#

It's just the rubric score by the judge

wet oasis Oct 19, 2025, 4:25 PM

#

seems like they have a model judge this info, not relevant

sudden idol Oct 19, 2025, 4:25 PM

#

What's important is just the model name, prompt and output for that prompt

wet oasis Oct 19, 2025, 4:26 PM

#

yeah, can we input that without including stuff for the rest tho?

sudden idol Oct 19, 2025, 4:26 PM

#

Can just extract fields from the JSON

#

I'm just gonna copy paste the analysis code into a new notebook

wet oasis Oct 19, 2025, 4:26 PM

#

ok

dull garnet Oct 19, 2025, 5:03 PM

#

wet oasis "judge_scores": { "Adherence to Instructions": 18.0, ...

what do those words mean actually? define "emotionally complex", "weak dialogue", "unsurprising or uncreative", well basically every thing listed here, please

wet oasis Oct 19, 2025, 5:03 PM

#

¯_(ツ)_/¯

sudden idol Oct 19, 2025, 5:10 PM

#

dull garnet what do those words mean actually? define "emotionally complex", "weak dialogue"...

It's just for their rubric, the actual rankings use ELO scores

dull garnet Oct 19, 2025, 5:11 PM

#

can you score something you dont know what that something actually means?

sudden idol Oct 19, 2025, 5:11 PM

#

I suppose the judge's system prompt explains it to the judge

#

They have a benchmark just for checking how consistent the judge scores are

sudden idol Oct 19, 2025, 5:12 PM

#

wet oasis ¯\_(ツ)_/¯

I linked up the analysis code

wet oasis Oct 19, 2025, 5:13 PM

#

ok

dull garnet Oct 19, 2025, 5:15 PM

#

sudden idol I suppose the judge's system prompt explains it to the judge

suppose? a bit more rigor here should be the obvious i think, otherwise how can people take such analysis seriously?

sudden idol Oct 19, 2025, 5:16 PM

#

Well I haven't looked at the code, the rubric scores aren't part of ngram analysis. LLM-as-judge is not required for NLP authorship identification techniques

wet oasis Oct 19, 2025, 5:16 PM

#

yeah that stuff can be ignored, maybe removed if its easier

sudden idol Oct 19, 2025, 5:17 PM

#

I'm just extracting the model name, prompt_ids and outputs

#

I kinda guessed what the structure was lol

#

I think 3 is the version number but idk, could be iteration number

#

also I haven't checked what the modifer is for (are they prefilling it?)

wet oasis Oct 19, 2025, 5:42 PM

#

sudden idol I kinda guessed what the structure was lol

prob shouldn't be guessing

sudden idol Oct 19, 2025, 5:45 PM

#

Looks like it works, although idk why it's saying needs slop list

wet oasis Oct 19, 2025, 5:47 PM

#

ok

#

how long did it take to run?

sudden idol Oct 19, 2025, 5:48 PM

#

wet oasis prob shouldn't be guessing

Hmm maybe I shouldn't have guessed 🤣

sudden idol Oct 19, 2025, 5:48 PM

#

wet oasis how long did it take to run?

still running

#

I cut it short that's why there were only 3 models

wet oasis Oct 19, 2025, 5:50 PM

#

ok, so est 10 20 min per run? or like an hour

sudden idol Oct 19, 2025, 5:51 PM

#

idk, but the metrics file can be saved

#

they didn't provide it, that's why I had to run it from scratch

wet oasis Oct 19, 2025, 5:51 PM

#

if it takes a long time it makes sense to only have one model type per company/style so like 2.5 pro for google, 4o or o3 for openAI and so on

wet oasis Oct 19, 2025, 5:51 PM

#

sudden idol idk, but the metrics file can be saved

ok so it can be sped up if you do another run later? by just putting that in?

sudden idol Oct 19, 2025, 5:51 PM

#

yeah

#

You just load in the file and you have all the bigrams etc pre-calculated

#

I gotta fix the parser though

#

since it dropped gpt-5, o3 and gemini-03-25 which are important models

#

took about 8-10 mins to finish

wet oasis Oct 19, 2025, 5:53 PM

#

alr

#

so just a bunch of openrouter models right?

sudden idol Oct 19, 2025, 5:55 PM

#

wet oasis Oct 19, 2025, 5:55 PM

#

sudden idol since it dropped gpt-5, o3 and gemini-03-25 which are important models

? gemini 3?

sudden idol Oct 19, 2025, 5:55 PM

#

wet oasis ? gemini 3?

03 is march

sudden idol Oct 19, 2025, 5:56 PM

#

wet oasis so just a bunch of openrouter models right?

major providers too

#

I'm going to add a form to input custom text and see if it can detect

#

it takes a while to generate the chart though :/

wet oasis Oct 19, 2025, 5:57 PM

#

sudden idol

looks like its pretty simular right?

sudden idol Oct 19, 2025, 5:57 PM

#

I think because it's generating charts for all models

sudden idol Oct 19, 2025, 5:57 PM

#

wet oasis looks like its pretty simular right?

it should be exact

#

same data

#

except some models got dropped since the parser is shady

wet oasis Oct 19, 2025, 6:00 PM

#

is there any shot of manually adding in text?

sudden idol Oct 19, 2025, 6:10 PM

#

wet oasis is there any shot of manually adding in text?

I'm writing that now

wintry dawn Oct 19, 2025, 6:28 PM

#

tiny meteor Oct 19, 2025, 6:54 PM

#

zenith vortex On LmArena don t ask this question because no model will gives you a correct ans...

I see, thanks for letting me know

wet oasis Oct 19, 2025, 6:55 PM

#

tiny meteor I see, thanks for letting me know

it has a chance to give the right answer, depends on the system prompt and the training

tiny meteor Oct 19, 2025, 6:55 PM

#

bit of a guess though

wet oasis Oct 19, 2025, 6:55 PM

#

tiny meteor I see, thanks for letting me know

for example all ocean ai cloaked models always say they are from ocean ai

#

its good for narrowing down the search a little or for comformation

#

how

tiny meteor Oct 19, 2025, 6:56 PM

#

wintry dawn

ok so zion and aspen are grok, but what's solitude

wet oasis Oct 19, 2025, 6:58 PM

#

I feel like you guys are just taking these out of a hat

#

by like vibes

tiny meteor Oct 19, 2025, 6:59 PM

#

I assume those making assertions know more about what they're doing than i do

wet oasis Oct 19, 2025, 6:59 PM

#

I can tell you that a lot of them arent

#

lol

unkempt hemlock Oct 19, 2025, 7:02 PM

#

code plz

#

sora invite cood plz

#

plz den

zenith vortex Oct 19, 2025, 7:30 PM

#

Hey experts 👋Are those Gemini ? lithiumflow (pro), orionmist (flash)...

wet oasis Oct 19, 2025, 7:30 PM

#

some say so

empty crater Oct 19, 2025, 7:32 PM

#

zenith vortex Hey experts 👋Are those Gemini ? lithiumflow (pro), orionmist (flash)...

how do we know which one is pro and which one is flash ?

wet oasis Oct 19, 2025, 7:33 PM

#

most seem to say lithium is better, so if it is gem 3

#

lithium would be pro

candid surge Oct 19, 2025, 7:33 PM

#

what if one is flash and one is flash-lite? 😈

upbeat mirage Oct 19, 2025, 7:47 PM

#

candid surge what if one is flash and one is flash-lite? 😈

then.. pro might be AGI? lol

#

no, ultra might be AGI (for some people)

#

but pro would be a very useful model indeed (esp. for coding)

#

but i dont believe that
i think they test ultra/pro in arena

modest oriole Oct 19, 2025, 7:49 PM

#

zenith vortex Hey experts 👋Are those Gemini ? lithiumflow (pro), orionmist (flash)...

the gamepad svg says they could be

#

3.0 does an iconic gamepad svg

#

lithiumflows definetly pro, orionmist idk

upbeat mirage Oct 19, 2025, 7:57 PM

#

modest oriole lithiumflows definetly pro, orionmist idk

LF > OM ?

modest oriole Oct 19, 2025, 7:58 PM

#

upbeat mirage LF > OM ?

i didnt get orionmist yet but this was lithiums gamepad svg

upbeat mirage Oct 19, 2025, 7:58 PM

#

modest oriole i didnt get orionmist yet but this was lithiums gamepad svg

i dont know how the real thing looks like, but it seems to be perfect?

modest oriole Oct 19, 2025, 8:02 PM

#

lithiumflow can make a walking animation in a 3d game wow

#

not really a walking animation its just the character moving without his legs moving

muted lance Oct 19, 2025, 8:06 PM

#

orionmist just generated me the SVG of a Dullahan (the flame is animated too).

candid surge Oct 19, 2025, 8:06 PM

#

oh thats cool!

sudden idol Oct 19, 2025, 8:09 PM

#

wet oasis is there any shot of manually adding in text?

It was a bit wonky for GPT-5 Chat and Grok, but I tried it on Llama 3.3 70B and it seems to work xD (most results converge on Llama 3/4). Used GPT-5 Thinking to implement manual similarity, since I noticed they weren't using cosine similarity.

#

I basically gave it 3 text prompts (top right are outputs that I put into Colab)

granite sentinel Oct 19, 2025, 8:12 PM

#

no sound on videos?'

sudden idol Oct 19, 2025, 8:14 PM

#

This is with Flash Lite (it's kind of funny that GLM 4 and 4.5 are so close to Google that they kept popping up)

#

Bottom right method seems to be the most accurate so far

wet oasis Oct 19, 2025, 8:16 PM

#

Ooh

#

Whats wrong with grok or gpt 5?

sudden idol Oct 19, 2025, 8:16 PM

#

The more prompts, probably the more accurate. That was with 3

sudden idol Oct 19, 2025, 8:16 PM

#

wet oasis Whats wrong with grok or gpt 5?

Well there's only one Grok model in the dataset I think, and I'm not sure GPT-5 is in there

wet oasis Oct 19, 2025, 8:17 PM

#

Ah

#

So we'd need to add them?

#

What where the llama reponses used?

sudden idol Oct 19, 2025, 8:17 PM

#

Idk why it's not in the zip file, they probably haven't updated it

sudden idol Oct 19, 2025, 8:18 PM

#

wet oasis What where the llama reponses used?

Wdym?

wet oasis Oct 19, 2025, 8:18 PM

#

The reponses uswd to prove it was llmama

sudden idol Oct 19, 2025, 8:19 PM

#

In the JSON

wet oasis Oct 19, 2025, 8:19 PM

#

not where, what

sudden idol Oct 19, 2025, 8:20 PM

#

Like what response in the dataset?

wet oasis Oct 19, 2025, 8:20 PM

#

sudden idol The more prompts, probably the more accurate. That was with 3

you said you had 3 prompts,

#

which made 3 responses correct?

sudden idol Oct 19, 2025, 8:21 PM

#

wet oasis which made 3 responses correct?

Are you asking which 3 prompts I used?

wet oasis Oct 19, 2025, 8:21 PM

#

what are the llama responses the the prompts?

sudden idol Oct 19, 2025, 8:21 PM

#

I just picked 3 of theirs randomly

wet oasis Oct 19, 2025, 8:22 PM

#

sudden idol Well there's only one Grok model in the dataset I think, and I'm not sure GPT-5 ...

so can we just add the data needed manually?

#

ok thx

sudden idol Oct 19, 2025, 8:22 PM

#

wet oasis so can we just add the data needed manually?

probably

#

Only 2 prompts to GPT-5-Chat, and the bottom right strategy (overall, just counts and not TF-IDF) seem to have gotten it (no GPT-5 in reference data so o3 is the closest)

wet oasis Oct 19, 2025, 8:24 PM

#

noice

#

Can I give you some and see what happens?

sudden idol Oct 19, 2025, 8:25 PM

#

Also Kimi is close in the actual one so that's why it's second. 3rd one is GPT-4.1 I think.

sudden idol Oct 19, 2025, 8:25 PM

#

wet oasis Can I give you some and see what happens?

Okay, the more the better. I can give you their prompts

wet oasis Oct 19, 2025, 8:25 PM

#

just the response is needed right?

sudden idol Oct 19, 2025, 8:25 PM

#

Yeah

#

3-5 is best

wet oasis Oct 19, 2025, 8:28 PM

#

ok dming

muted lance Oct 19, 2025, 8:30 PM

#

raptor-llm-1017 was never discussed before?

wintry dawn Oct 19, 2025, 8:33 PM

#

wintry dawn

another model, vail

wintry dawn Oct 19, 2025, 8:33 PM

#

muted lance `raptor-llm-1017` was never discussed before?

no but i just got it

muted lance Oct 19, 2025, 8:48 PM

#

It's moderately capable with SVG drawings.

upbeat mirage Oct 19, 2025, 9:07 PM

#

muted lance `raptor-llm-1017` was never discussed before?

i just got it, too

#

unfortunately, it does not disclose its creator

#

it gave me this answer to my prompt:


- **Who are you?** I'm a conversational AI designed to help with a wide range of questions, from explaining concepts to discussing ideas—my goal is to provide thoughtful, helpful responses to support your needs.
- **Who created you?** I was developed by a team of engineers and researchers focused on advancing natural language understanding and reliable, useful AI systems.
- **What version do you have?** I don't have a public version identifier to share; my capabilities are updated iteratively as part of ongoing development to improve accuracy and usefulness.
- **What is your knowledge cut-off date?** My knowledge base is current up to mid-2024. I do not have information about events, developments, or changes that occurred after that timeframe.
- **When were you released?** I don't have a public release date to disclose; my development has been an ongoing process with gradual refinements to enhance how I assist users.
- **What is today's date?** I don't have real-time access to the current date, so I can't provide that information. If you share the current date with me, I can use that context for any date-related questions you have.
- **What is the last date you are aware of?** The latest timeframe covered in my knowledge base is mid-2024; I can reference events and information reliably up to that point, but not beyond it.```

#

so it could be one of these:

new Deepseek model
new Meta AI model
new Grok model
new Ernie model
new Kimi model
new MiniMax model

sudden idol Oct 19, 2025, 9:46 PM

#

Orionmist word distribution analysis (small sample size of 4 prompts that are not in the dataset only though, and match is kinda low)... seems Google-ish

candid surge Oct 19, 2025, 9:53 PM

#

I havent been too impressed by orionmist or lithiumflow tbh

#

IDK if I'd use them over gemini 2.5 pro

sudden idol Oct 19, 2025, 9:54 PM

#

sudden idol Orionmist word distribution analysis (small sample size of 4 prompts that are no...

Sanity check: same 4 out-of-reference-dataset prompts to Llama-3-70B:

#

The most accurate way to identify the model with limited samples so far seems to be the word distribution (word count histogram)

wintry dawn Oct 19, 2025, 10:07 PM

#

(updated) GROK MODELS: sierra, breckenridge, zion, acadia, aspen, veil, miramar

shasta?

wet oasis Oct 19, 2025, 10:22 PM

#

why

scenic light Oct 19, 2025, 10:49 PM

#

wintry dawn (updated) GROK MODELS: sierra, breckenridge, zion, acadia, aspen, veil, miramar ...

all bad

sudden idol Oct 19, 2025, 11:11 PM

#

This method seems to work well at identifying a model's family before it's even revealed. Need more Grok data though (only have data for Grok-3 beta).

wintry dawn Oct 19, 2025, 11:22 PM

#

candid surge I havent been too impressed by orionmist or lithiumflow tbh

0/10 ragebait

wintry dawn Oct 19, 2025, 11:23 PM

#

scenic light all bad

exactly

candid surge Oct 19, 2025, 11:26 PM

#

wintry dawn 0/10 ragebait

they're both decent models, just not as good as 2.5 pro for my use cases

upbeat mirage Oct 19, 2025, 11:50 PM

#

candid surge they're both decent models, just not as good as 2.5 pro for my use cases

have you tested it in programming already?

#

(C++, Python, Java, etc)

#

i mean, Lithiumflow

candid surge Oct 19, 2025, 11:54 PM

#

I've only been doing creative writing stuff.

wintry dawn Oct 19, 2025, 11:55 PM

#

candid surge I've only been doing creative writing stuff.

i think the new models focus on coding not creative writing

candid surge Oct 20, 2025, 12:29 AM

#

I just voted a gemini 2.5 flash result over orionmist on battle mode (since it was battle mode, I didn't know which model was which of course)

tardy ivy Oct 20, 2025, 12:34 AM

#

Everything these days focuses on codeslop sigh

#

Lithium's been giving me a bit better creative writing outputs than Orion, though

sudden idol Oct 20, 2025, 12:58 AM

#

lithiumflow is likely Google (GLM and R1-0528 probably trained on Gemini which is why they show up in the model family).

wet oasis Oct 20, 2025, 2:16 AM

#

Great

frail wigeon Oct 20, 2025, 3:01 AM

#

what model is raptor-vision-1015

lost hemlock Oct 20, 2025, 4:36 AM

#

what model is this ?

blazing roost Oct 20, 2025, 5:16 AM

#

lost hemlock what model is this ?

Maybe grok

sudden idol Oct 20, 2025, 6:42 AM

#

lost hemlock what model is this ?

Couldn't quite tell from word distributions... might be Grok since not many examples on Grok. If it was Llama, would have probably caught it quite easily.

muted lance Oct 20, 2025, 8:14 AM

#

miramar reminds me of some of the early anonymous Llama 4 models on LMArena from February/March; those too generated very funny and politically incorrect responses.
And likewise, I doubt it will be publicly distributed like it is, unfortunately.

sudden idol Oct 20, 2025, 9:45 AM

#

Lithiumflow Python Tkinter code isn't great

#

The layout seems logical at least though

sudden idol Oct 20, 2025, 10:15 AM

#

Orionmist's code was okay (fixed a bug by previous model, but didn't implement the features I asked for). Had formatting issues:

tardy ivy Oct 20, 2025, 10:28 AM

#

I'm kinda curious about raptor, the writing output's not too shabby and got a few Ws

sudden idol Oct 20, 2025, 10:51 AM

#

Got Lithiumflow again (coincidentally, second time in multiturn). Not too impressed with how its performing on Tkinter tbh

#

#

Each of the issues (except missing output column) were things I flagged in the convo, but it hasn't fixed it.

storm plume Oct 20, 2025, 11:19 AM

#

CW: Anime stuff. Serious business people avert your eyes!
Testing the new LithiumFlow on image understanding and knowledge using a variety of complex anime memes and images.
It is compared directly to 2.5 pro, the previous best model at understanding complex images filled with established characters like this.
2.5 Pro was already leagues ahead of the other models at image understanding, if LithiumFlow is google yet again, (it most likely is) then they've made another giant leap.

#

Additonally, if you'd like to test Lithium more on text prompts:
Simply include a 1x1 pixel black image with your prompt. This forces the site to give you models that can process and understand images. Narrowing the pool greatly. You get LithiumFlow around 50% of the time.
Google models encode images into tokens in a way that makes one pixel easily ignorable.

zenith vortex Oct 20, 2025, 12:35 PM

#

candid surge what if one is flash and one is flash-lite? 😈

Hhhhhh I think the same 😈🙂👌

tawny epoch Oct 20, 2025, 2:29 PM

#

wintry dawn (updated) GROK MODELS: sierra, breckenridge, zion, acadia, aspen, veil, miramar ...

just got zion

tawny epoch Oct 20, 2025, 3:00 PM

#

storm plume Additonally, if you'd like to test Lithium more on text prompts: Simply include ...

the black pixel:

#

if you can ctach it

#

ok you cant select it

royal rapids Oct 20, 2025, 3:07 PM

#

New model discovered (I think):

tiny meteor Oct 20, 2025, 5:18 PM

#

I found claude 4.5 sonnet to be better than zion when doing some schema drafting

wet oasis Oct 20, 2025, 5:19 PM

#

yeah that makes sense Zion isn't the best

tiny meteor Oct 20, 2025, 5:19 PM

#

i though veil was alright

#

solitude was kind of eh though

#

i prefered mistral-small to it apparently lol

wet oasis Oct 20, 2025, 5:20 PM

#

just because you voted for something else once, doesn't mean its always gonna be like that

tiny meteor Oct 20, 2025, 5:30 PM

#

of course

lilac holly Oct 20, 2025, 5:51 PM

#

Screenshot_2025-10-21-00-50-42-343-edit_com.android.chrome.jpg

leaden locust Oct 20, 2025, 5:52 PM

#

doubt it's gemini3 tho

#

it has been around for while however

lilac holly Oct 20, 2025, 5:52 PM

#

lilac holly

Are cloaked models actually capable to reveal their company

lilac holly Oct 20, 2025, 5:54 PM

#

leaden locust doubt it's gemini3 tho

I'm just curious about that gemini 3 thingy associated with lithiumflow

wet oasis Oct 20, 2025, 6:05 PM

#

lilac holly Are cloaked models actually capable to reveal their company

some can

#

why are you doing this in codename discussion

#

you have to go to /video 1 2 or 3

lilac holly Oct 20, 2025, 6:21 PM

#

Screenshot_2025-10-21-01-20-50-218-edit_com.android.chrome.jpg

wet oasis Oct 20, 2025, 6:21 PM

#

lilac holly

uhh

#

wat

#

Everyone else reported that it would always say it was from google

lilac holly Oct 20, 2025, 6:22 PM

#

lilac holly

this

Screenshot_2025-10-21-01-22-28-522-edit_com.android.chrome.jpg

wet oasis Oct 20, 2025, 6:23 PM

#

well that's interesting, I don't know what to make of htat

lilac holly Oct 20, 2025, 6:27 PM

#

https://lmarena.ai/c/019a02b1-2674-7abe-a3ed-88ea9dbae6a7

fringe burrow Oct 20, 2025, 6:50 PM

#

the zion model is really great in coding, not in design but with logic in code.

wet oasis Oct 20, 2025, 6:51 PM

#

ooh

#

@astral musk is lithiumflow made out of lithium?

fringe burrow Oct 20, 2025, 6:54 PM

#

fringe burrow the zion model is really great in coding, not in design but with logic in code.

but lithiumflow is best at doing both, its litteraly better than gpt-5 high, pro, codex high, claude 4.5, 4.1 opus and any other model. (from my tests)

wet oasis Oct 20, 2025, 6:55 PM

#

alr

#

supposedly lithiumflow is gem 3

fringe burrow Oct 20, 2025, 6:56 PM

#

hope its flash because still i have that feeling that this isnt enough

#

or that its pro but ultra will be released

#

and ultra should be released from the leaks

cursive jacinth Oct 20, 2025, 6:56 PM

#

seems to be flash, giving me mid + super short output

wet oasis Oct 20, 2025, 6:57 PM

#

maybe

#

in that case 3 pro is gonna be pretty good

cursive jacinth Oct 20, 2025, 6:58 PM

#

any reason lithiumflow would give me like a ridiculously short output?

wet oasis Oct 20, 2025, 6:58 PM

#

your prompt doesn't need that much?

cursive jacinth Oct 20, 2025, 6:59 PM

#

all the other models I've tested give me more in depth analysis

#

and like 6x as much text

wet oasis Oct 20, 2025, 6:59 PM

#

that would fit flash then

cursive jacinth Oct 20, 2025, 6:59 PM

#

but even 2.5 flash lite was giving me more. just weird, I'll keep testing and see if it happens again

fringe burrow Oct 20, 2025, 7:00 PM

#

its experimental and in early testing

#

it will be bad sometimes

#

but with enough trying you would get 70% better results

cursive jacinth Oct 20, 2025, 7:01 PM

#

wonder what the mechanism behind that degradation is... super quantized or something?

fringe burrow Oct 20, 2025, 7:02 PM

#

cursive jacinth wonder what the mechanism behind that degradation is... super quantized or somet...

nah, its not fully-trained

#

like it didnt reach the epoches google is going for

cursive jacinth Oct 20, 2025, 7:07 PM

#

so people are being served different checkpoints of some smaller Gem 3 model? would love more context

fringe burrow Oct 20, 2025, 7:08 PM

#

i am doing my little tests with comparing lithiumflow to other models like 4.5 sonnet 64k or gpt-5 pro or 4.1 opus

#

#

current score

#

its from only one test now

woeful junco Oct 20, 2025, 7:09 PM

#

Whats this mean

fringe burrow Oct 20, 2025, 7:13 PM

#

OKAY LITHIUMFLOW IS INSANEEE, ITS JUST ONE-SHOTED MINECRAFT GAME WITH ACTUAL PROFFESIONAL MENU AND ANIMATIONS, working game with blocks terrain generation and no bugs. No other llm could to that

#

orionmist is okay

leaden locust Oct 20, 2025, 7:26 PM

#

fringe burrow OKAY LITHIUMFLOW IS INSANEEE, ITS JUST ONE-SHOTED MINECRAFT GAME WITH ACTUAL PRO...

could you show some images/vid?

fringe burrow Oct 20, 2025, 7:27 PM

#

leaden locust could you show some images/vid?

i will, wait

leaden locust Oct 20, 2025, 7:27 PM

#

thanks

#

im rly curious

fringe burrow Oct 20, 2025, 7:27 PM

#

im making a benchmark website with my tests. and i am using lithiumflow for making the design

#

like lithiumflow/orionmist passed 4/4 tests perfectly

#

for example test 1:
o4-mini: 0
/zion: 0,25
/lithiumflow: 1
4.1 Opus: 0,5
GPT-4o: 0
2.5 Flash Lite (Thinking): 0
GPT-5 (High): 0,5
GPT-5 Pro: 0,25
4.5 Sonnet (Thinking 64K): 0,75
2.5 Pro: 0,25

#

i am going to show my tinybench tommorow, 15 tests, models like 4.5 sonnet, gpt-5 pro, lithiumflow, zion and orionmist

#

this is example of lithiumflow minecraft

#

#

#

#

One-shot

#

no errors

cursive jacinth Oct 20, 2025, 7:37 PM

#

wow

tiny sierra Oct 20, 2025, 7:38 PM

#

Lol gemini 3 confirmed

#

Wtf

fringe burrow Oct 20, 2025, 7:38 PM

#

the prompt was just "create a minecraft in one html document. make it look proffesional."

cursive jacinth Oct 20, 2025, 7:39 PM

#

can you break blocks and collect them or nah

fringe burrow Oct 20, 2025, 7:39 PM

#

yeah you can break and place

#

you can sprint

#

jump

#

walk

#

everything works fine

#

smooth

tiny sierra Oct 20, 2025, 7:40 PM

#

fringe burrow yeah you can break and place

WTF

#

LMAO

fringe burrow Oct 20, 2025, 7:40 PM

#

if thats from google, google cooked so much.

#

orionmist cooked my benchmark design

#

#

i just asked it to make a benchmark site and i send my bench data

#

look at the top 3

#

overall

#

difference is big

#

#

and i rated the outputs before i knew what model was making it

leaden locust Oct 20, 2025, 8:00 PM

#

oh wow that looks impressive

#

so what's the working theory? Both lithium and orion are gemini flavours?

fringe burrow Oct 20, 2025, 8:08 PM

#

Both say that they are made by google

#

And both are similar to A/B checkpoints from google ai studio

#

And they are really up to date

#

They know about GPT-5 and that it has modes like thinking, pro, instant.

leaden locust Oct 20, 2025, 8:15 PM

#

think i just got lithium vs orion

leaden locust Oct 20, 2025, 8:21 PM

#

fringe burrow Both say that they are made by google

orion vs flash, did minecraft as well; less impressive than yours

fringe burrow Oct 20, 2025, 8:21 PM

#

Luck i guess

#

Lithium made mine

#

Orion is a bit worse

leaden locust Oct 20, 2025, 8:22 PM

#

orion is worse tho i provided a simple prompt

#

compared to yours

#

(lithium)

#

however vs flash 2.5 it owned it

fringe burrow Oct 20, 2025, 8:25 PM

#

My prompt was simple too

leaden locust Oct 20, 2025, 8:30 PM

#

it's like this. Can jump, can walk, no collision/bugs

#

got lithium too, made it fully playable (can place blocks, big world, etc

upbeat mirage Oct 20, 2025, 9:32 PM

#

fringe burrow My prompt was simple too

If you get lithiumflow again, can you ask it to create a desktop program for the Arimaa boardgame?

wintry dawn Oct 20, 2025, 9:32 PM

#

upbeat mirage If you get lithiumflow again, can you ask it to create a desktop program for the...

yeah i can rn

#

i can use any model dire-

#

🤐

upbeat mirage Oct 20, 2025, 9:33 PM

#

the direct chat has stricter rate-limits

#

battle mode is better (esp. with account)

#

i have a file with its answers, you can identify it by that
do you need the file?

#

(it's quite big lol, i had stored a whole thread in it, but i will probably only copy the relevant part of it)

wintry dawn Oct 20, 2025, 9:34 PM

#

upbeat mirage the direct chat has stricter rate-limits

no

#

im not using direct chat

#

anyway um

#

ill run yer prompt

upbeat mirage Oct 20, 2025, 9:34 PM

#

in LMA?

wintry dawn Oct 20, 2025, 9:34 PM

#

ye

#

lol

upbeat mirage Oct 20, 2025, 9:35 PM

#

without R-Ls?

wintry dawn Oct 20, 2025, 9:35 PM

#

i gotta delete allat..

upbeat mirage Oct 20, 2025, 9:35 PM

#

you can sel. it?

wintry dawn Oct 20, 2025, 9:35 PM

#

upbeat mirage you can sel. it?

idk

upbeat mirage Oct 20, 2025, 9:36 PM

#

oof

#

how?

wintry dawn Oct 20, 2025, 9:36 PM

#

upbeat mirage oof

you want to buy it or wha

upbeat mirage Oct 20, 2025, 9:36 PM

#

just curious

wintry dawn Oct 20, 2025, 9:36 PM

#

if lmarena has a bug bounty ill report it

#

cuz i dont want this to go to waste

#

its useful

upbeat mirage Oct 20, 2025, 9:36 PM

#

this is quite a fat bug

wintry dawn Oct 20, 2025, 9:36 PM

#

yeah

#

i can manipulate results

upbeat mirage Oct 20, 2025, 9:37 PM

#

i wonder, how many people have used similar "method" at LMA..

#

||-# but be careful that you dont get banned||

#

(i would not dare to use such "methods")

wintry dawn Oct 20, 2025, 9:38 PM

#

anyway ill jus run yer prompt lol

#

wait so

upbeat mirage Oct 20, 2025, 9:39 PM

#

ty

wintry dawn Oct 20, 2025, 9:39 PM

#

is "create a desktop program for the Arimaa boardgame" the prompt u want

upbeat mirage Oct 20, 2025, 9:39 PM

#

with AI to play against

wintry dawn Oct 20, 2025, 9:39 PM

#

but in what language n stuff?

#

can i just make it a single html file

upbeat mirage Oct 20, 2025, 9:39 PM

#

wintry dawn can i just make it a single html file

c++

#

because, speed

wintry dawn Oct 20, 2025, 9:39 PM

#

upbeat mirage c++

alr

upbeat mirage Oct 20, 2025, 9:39 PM

#

C++ is fastest for boardgames

#

it will probably propose using SFML library

#

(as that is easy to use and performant)

wet oasis Oct 20, 2025, 9:40 PM

#

..

upbeat mirage Oct 20, 2025, 9:41 PM

#

wet oasis ..

not good?

wet oasis Oct 20, 2025, 9:42 PM

#

Im fine

upbeat mirage Oct 20, 2025, 9:42 PM

#

wet oasis Im fine

ok, another question, which AI do you think is best for vibe-coding in C++?

wet oasis Oct 20, 2025, 9:43 PM

#

Idk off of the top of my head

upbeat mirage Oct 20, 2025, 9:43 PM

#

claude-4.5-thinking, gpt5-high, lithiumflow, glm-4.6, qwen3 max, deepseek-3.2, grok4, grok4fast-reasoning, kimi k2, MAI, Serenity, Raptor, Phantom, Sierra, Aspen, Acadia, Skytrail, Monster?

wintry dawn Oct 20, 2025, 9:47 PM

#

@upbeat mirage sry cf is blocking me :p

upbeat mirage Oct 20, 2025, 9:47 PM

#

battlemode instead?

wintry dawn Oct 20, 2025, 9:48 PM

#

cloudflare

#

💀

upbeat mirage Oct 20, 2025, 9:48 PM

#

ah

#

so you think lithiumflow is (currently) the SOTA model in coding?

#

better than gpt5-high, claude-4.5, claude-4.1-opus et.al. ?

wintry dawn Oct 20, 2025, 9:56 PM

#

@upbeat mirage sorry it took so long

#

📎 game.cpp

#

orionmist btw

#

(its lithium but with web search)

wet oasis Oct 20, 2025, 9:57 PM

#

wintry dawn (its lithium but with web search)

Not sure if thats confirmed

wintry dawn Oct 20, 2025, 9:57 PM

#

wet oasis Not sure if thats confirmed

I think its true

#

@upbeat mirage

#

📎 game_lithium.cpp

#

lithium

#

(it decided to make it cli based)

#

4.5 thinking 32k:

#

📎 4.5_32k_game.cpp

upbeat mirage Oct 20, 2025, 10:13 PM

#

i thought gemini 3 would generate me a complete GUI app for desktop, like we are used to see from it

#

with mouse-control and nice piece graphics and board background

#

apparently we have to wait until next year for that?

#

but, thanks for your efforts

#

unfortunately, claude's app has bugs

#

have you even tested the programs?

#

-# (mo offense intended)

sudden idol Oct 20, 2025, 10:57 PM

#

candid surge what if one is flash and one is flash-lite? 😈

Hope so actually, because Orionmist is messing up Markdown code closing tags, which seems weird if it was Flash 3.0

wet oasis Oct 20, 2025, 10:57 PM

#

sudden idol Hope so actually, because Orionmist is messing up Markdown code closing tags, wh...

does flash 2.5 mess those up? idk I've never made it write code

sudden idol Oct 20, 2025, 10:58 PM

#

wet oasis does flash 2.5 mess those up? idk I've never made it write code

Don't think so

#

Maybe pre-preview instability? But it's supposed to be better than 2.5 Pro.

#

If both are not really outstanding at Tkinter (literally the most basic UI library) but are one-shotting Minecraft, that's a bit suspicious

wet oasis Oct 20, 2025, 11:00 PM

#

yah

#

maybe google's gonna release a programming first model instead of gemini 3 rn?

sudden idol Oct 20, 2025, 11:02 PM

#

Maybe I'll try testing it on a question that only GPT-5 High gets right

sudden idol Oct 20, 2025, 11:05 PM

#

upbeat mirage so you think lithiumflow is (currently) the SOTA model in coding?

I think the only way to tell is to test it on an agentic coding environment. E.g., can it resolve compile errors, missing dependencies, WSL compatibility issues, React/C race conditions, etc.

cursive jacinth Oct 20, 2025, 11:10 PM

#

seems like a weirdly spiky model, I wonder if they figured out some better RL environments for software

upbeat mirage Oct 20, 2025, 11:13 PM

#

cursive jacinth seems like a weirdly spiky model, I wonder if they figured out some better RL en...

spiky, as in jagged performance over time?

upbeat mirage Oct 20, 2025, 11:14 PM

#

sudden idol I think the only way to tell is to test it on an agentic coding environment. E.g...

yeah, and it also has to test the program it generates and recognize bugs automatically (and fix them)

#

that almost requires AGI, does it?

wet oasis Oct 20, 2025, 11:15 PM

#

thats a very different definition of agi then the one about being able to do all human tasks in a human body

cursive jacinth Oct 20, 2025, 11:16 PM

#

upbeat mirage spiky, as in jagged performance over time?

spiky as in good at certain things and surprisingly bad at others

#

it was giving me pretty weak creative writing and writing analysis

wet oasis Oct 20, 2025, 11:17 PM

#

cursive jacinth it was giving me pretty weak creative writing and writing analysis

yeah it seems this is more for code purposes

upbeat mirage Oct 20, 2025, 11:20 PM

#

cursive jacinth it was giving me pretty weak creative writing and writing analysis

apropos, which AI is best for immersive roleplaying?

#

i mean for super-long games

#

it would need to have at least 1M context (better 2M)

cursive jacinth Oct 20, 2025, 11:21 PM

#

gem 2.5 pro, and all the claude's are good

#

yeah so for that context length gem 2.5 would be it

upbeat mirage Oct 20, 2025, 11:22 PM

#

cursive jacinth yeah so for that context length gem 2.5 would be it

is there a special prompting-technique to maximize its gamemaster-intelligence for realistic virtual universes?

cursive jacinth Oct 20, 2025, 11:22 PM

#

not sure honestly!

#

mostly you want it to roleplay effectively by giving it info about the character its simulating

upbeat mirage Oct 20, 2025, 11:23 PM

#

and if it has reached its context limit?

#

let it summarize everything before that happens?

#

as a "save-state"?

wet oasis Oct 20, 2025, 11:25 PM

#

upbeat mirage as a "save-state"?

yeah you have to do that

upbeat mirage Oct 20, 2025, 11:25 PM

#

will gemini 3 pro have 2M? (context window)

wet oasis Oct 20, 2025, 11:26 PM

#

I hope it has more than that, but idk

upbeat mirage Oct 20, 2025, 11:26 PM

#

wet oasis I hope it has more than that, but idk

do you think, they publish/release it this month?

cursive jacinth Oct 20, 2025, 11:27 PM

#

from what I understand the labs have been making pretty fast progress on context length and google generally leads

wet oasis Oct 20, 2025, 11:27 PM

#

if these guys truely are gemini 3 models, yeah

cursive jacinth Oct 20, 2025, 11:27 PM

#

so I'd guess yes

wet oasis Oct 20, 2025, 11:28 PM

#

cursive jacinth from what I understand the labs have been making pretty fast progress on context...

idk about that, since I haven't seen a model that has beaten the highest context window count in like a year for a good ish model, ignoring llama 4 cause that doesn't count

cursive jacinth Oct 20, 2025, 11:28 PM

#

like deepseek, openai have been catching up

#

i assume in the meantime deepmind has made further progress

upbeat mirage Oct 21, 2025, 12:30 AM

#

is serenity really by xAI?

#

Grok4.5-preview?

wet oasis Oct 21, 2025, 12:34 AM

#

no idea

#

could check

upbeat mirage Oct 21, 2025, 12:43 AM

#

if it is, it should have a sizable context-window, right?

wet oasis Oct 21, 2025, 12:48 AM

#

it doesn't seem like the eqbench tool include xAI models, so take this with a grain of salt

#

but its closest to claude and deepseek

sudden idol Oct 21, 2025, 1:30 AM

#

Well it's close to both Claude and OpenAI families, so not conclusive ig

#

That's kinda what happens when you put Grok 4 in, since no reference instances

wet oasis Oct 21, 2025, 1:34 AM

#

mm

lilac holly Oct 21, 2025, 1:38 AM

#

these codenames are just names

wet oasis Oct 21, 2025, 1:39 AM

#

... true?

lilac holly Oct 21, 2025, 1:39 AM

#

Screenshot_2025-10-21-08-34-50-067-edit_com.android.chrome.jpg

#

I was expecting super

wet oasis Oct 21, 2025, 1:46 AM

#

?

wet oasis Oct 21, 2025, 1:50 AM

#

upbeat mirage is *serenity* really by xAI?

slightly better analysis of serenity

#

showing strong signs of relation between it and claude models, so serenity could be anthropic or baised off of claude

tiny sierra Oct 21, 2025, 2:21 AM

#

Serenity is way too verbose to be a claude model

#

Imo

#

Ive also never had a claude model give me tabels when not asked for

wet oasis Oct 21, 2025, 2:21 AM

#

mm

#

ok

tiny sierra Oct 21, 2025, 2:22 AM

#

#

Or list the sources at the bottom

wet oasis Oct 21, 2025, 2:22 AM

#

how does it have sources? this isn't search arena

#

did it just make that up

tiny sierra Oct 21, 2025, 2:23 AM

#

LMAO

#

I DIDNT EVEN NOTICE THAT

#

#

Sonnet 4.5 for reference

#

And opus 4.1

#

wet oasis Oct 21, 2025, 2:23 AM

#

why is it so long

tiny sierra Oct 21, 2025, 2:23 AM

#

wet oasis why is it so long

Feels the need to make every little thing a new line apparently

wet oasis Oct 21, 2025, 2:24 AM

#

the table thing makes me feel like its simular to perlexity

tiny sierra Oct 21, 2025, 2:24 AM

#

#

This should just be comma separated, lmao

#

And it is comma separated in the app just not lmarena

#

(Sonnet)

#

Prompt was

#

Hand foot and mouth disease clinical presentation

wet oasis Oct 21, 2025, 2:26 AM

#

weird

tiny sierra Oct 21, 2025, 2:27 AM

#

wet oasis weird

If im fishing for lithiumflow should I make a new chat or spam the same identical prompt in the same one?

wet oasis Oct 21, 2025, 2:27 AM

#

tiny sierra If im fishing for lithiumflow should I make a new chat or spam the same identica...

I don't think it makes a difference much so whichever is faster

#

tho if you spam the same one then it will use all the previous tokens from the winner

tiny sierra Oct 21, 2025, 2:27 AM

#

Ah yeah I was wondering how that'd work since it changes the model

candid surge Oct 21, 2025, 3:47 AM

#

When did marble show up?

candid surge Oct 21, 2025, 3:48 AM

#

tiny sierra If im fishing for lithiumflow should I make a new chat or spam the same identica...

You should always make a new chat since it keeps history

fringe burrow Oct 21, 2025, 2:37 PM

#

wintry dawn i can use any model dire-

anyone can bro

fringe burrow Oct 21, 2025, 2:40 PM

#

wintry dawn I think its true

but orionmist is so much worst

valid peak Oct 21, 2025, 2:52 PM

#

upbeat mirage Grok4.5-preview?

4.20

sudden idol Oct 21, 2025, 3:34 PM

#

wet oasis showing strong signs of relation between it and claude models, so serenity could...

How good is it at coding though

wet oasis Oct 21, 2025, 3:34 PM

#

Havent tried

sudden idol Oct 21, 2025, 3:35 PM

#

Looks a bit inconclusive to me, since K2 is closer to o3 yet the score is very close, which suggests Serenity is more in between model families than part of any specific one

wet oasis Oct 21, 2025, 3:36 PM

#

Maybe

tiny sierra Oct 21, 2025, 8:31 PM

#

fringe burrow but orionmist is so much worst

it was good for medical questions for me

spiral ingot Oct 22, 2025, 5:49 AM

#

lovable can we use there premium

#

for free here?

wet oasis Oct 22, 2025, 12:52 PM

#

spiral ingot lovable can we use there premium

?

upbeat mirage Oct 22, 2025, 2:03 PM

#

Acadia (Ocean AI)
Serenity (Ocean AI)
Sierra (Ocean AI)
Shasta (Ocean AI)
Solitude (Ocean AI)
Is there a 6th one?

#

(the name could still be a coincidence, as "Ocean" is pretty unspecific)

upbeat mirage Oct 22, 2025, 2:15 PM

#

sudden idol Looks a bit inconclusive to me, since K2 is closer to o3 yet the score is very c...

my hypothesis is, that a new AI lab used xAI and OpenAI (and maybe also gemini) models to train their models on

sudden idol Oct 22, 2025, 2:16 PM

#

upbeat mirage my hypothesis is, that a new AI lab used xAI and OpenAI (and maybe also gemini) ...

🤷

#

I just realized Grok 4 is in the 140K data

upbeat mirage Oct 22, 2025, 2:17 PM

#

Ocean AI uses some words/phrases from grok:

witty
maximally truthful
etc

sudden idol Oct 22, 2025, 2:17 PM

#

I guess I could train a model to predict the model based on TF-IDF vectors or embeddings

upbeat mirage Oct 22, 2025, 2:17 PM

#

that would be awesome

sudden idol Oct 22, 2025, 2:18 PM

#

Idk why I didn't just do that lmao

#

I guess I thought it would be easier to use their ngram analysis code.

sudden idol Oct 22, 2025, 2:19 PM

#

upbeat mirage Ocean AI uses some words/phrases from grok: - witty - maximally truthful - etc

I feel there might be a simpler way

#

I was testing it but got sidetracked checking out a new framework

upbeat mirage Oct 22, 2025, 2:21 PM

#

sudden idol I guess I could train a model to predict the model based on TF-IDF vectors or em...

on a scale of 1-10 how hard is that for a newbie?

#

(if using Claude-4.5 and Lithiumflow and maybe even gpt5-high as coders)

sudden idol Oct 22, 2025, 2:22 PM

#

upbeat mirage on a scale of 1-10 how hard is that for a newbie?

Well it's classic label prediction

#

Tons of tutorials, so definitely approachable.

upbeat mirage Oct 22, 2025, 2:22 PM

#

i mean, to create a new model for prediction

#

and what hardware would be necessary?

sudden idol Oct 22, 2025, 2:22 PM

#

upbeat mirage i mean, to create a new *model* for prediction

Oh it's not as hard as it sounds

#

These are statistical models, not deep neural networks

#

Most of them have functions from libraries like sklearn that you can just call

upbeat mirage Oct 22, 2025, 2:24 PM

#

oh, i just got another idea (totally different topic, though)
how hard would it be to create a decent model to play a niche boardgame?

sudden idol Oct 22, 2025, 2:24 PM

#

upbeat mirage oh, i just got another idea (totally different topic, though) how hard would it ...

Depends on the game properties

#

You'd probably use a search space algorithm from classic AI

upbeat mirage Oct 22, 2025, 2:25 PM

#

if the model should run on a 6GB gpu from 2019 (Nvidia GTX titan)
and the game would be Dou Shou Qi (jungle)

#

muZero?

#

i mean, a model which trained by self-play (like Alphazero)

sudden idol Oct 22, 2025, 2:26 PM

#

Looks like checkers

upbeat mirage Oct 22, 2025, 2:26 PM

#

would my poor laptop survive this training run? (it can get quite hot, over 90°C)

sudden idol Oct 22, 2025, 2:27 PM

#

upbeat mirage i mean, a model which trained by self-play (like Alphazero)

Oh I knew someone who liked training those types of models

#

Idk how I got beaten by a program running on 2 MB of RAM

upbeat mirage Oct 22, 2025, 2:28 PM

#

jungle is unique, because of the intransitivity of the capture rules

#

(mouse can beat elephant but not reverse, except if trapped)

#

many strange rules

#

harder to learn than checkers

#

but bitboards can be used (7x9 board)

sudden idol Oct 22, 2025, 2:29 PM

#

AI-bench, that would be cool

#

Where LLMs have to write AIs that go head to head

upbeat mirage Oct 22, 2025, 2:30 PM

#

yeah that would be amazing

#

and entertaining to watch

#

or cyborg matches

#

where human+LLM try to outmatch other pairs

#

in coding of AI

sudden idol Oct 22, 2025, 2:31 PM

#

upbeat mirage where human+LLM try to outmatch other pairs

i think human would win

upbeat mirage Oct 22, 2025, 2:31 PM

#

maybe we could use CoreWars & RedCode for that?

upbeat mirage Oct 22, 2025, 2:31 PM

#

sudden idol i think human would win

expert coders, yep

#

but junior coders?

#

but the competition would be between human+LLM vs human+LLM

#

only such pairs of AI-coders

#

ok, a human could opt to not use LLMs at all

sudden idol Oct 22, 2025, 2:33 PM

#

upbeat mirage but junior coders?

It's more about AI techniques

#

And depends on the tasks

#

An engineer might win if the game space is continuous

#

Because control theory would come in handy

#

There's a lot of these problems on codingame.com

#

Whoops sorry, just realized going off topic for this channel

wet oasis Oct 22, 2025, 2:39 PM

#

upbeat mirage my hypothesis is, that a new AI lab used xAI and OpenAI (and maybe also gemini) ...

Think so

#

sudden idol Oct 22, 2025, 2:44 PM

#

upbeat mirage Acadia (Ocean AI) Serenity (Ocean AI) Sierra (Ocean AI) Shasta (Ocean AI) Solitu...

Idk if it's from the same family, but there's also Vail

wet oasis Oct 22, 2025, 2:44 PM

#

Think so

sudden idol Oct 22, 2025, 2:44 PM

#

Doesn't show up as often though

wet oasis Oct 22, 2025, 2:49 PM

#

Paws is actually AGI that was told its entire goal is to see how good people are at guessing how long away we are from agi.

upbeat mirage Oct 22, 2025, 3:14 PM

#

sudden idol Doesn't show up as often though

Is Serenity the best of the pack?

sudden idol Oct 22, 2025, 3:14 PM

#

upbeat mirage Is Serenity the best of the pack?

All are bad tbh

upbeat mirage Oct 22, 2025, 3:16 PM

#

upbeat mirage Oct 22, 2025, 3:16 PM

#

sudden idol All are bad tbh

in roleplaying, too?

sudden idol Oct 22, 2025, 3:20 PM

#

upbeat mirage in roleplaying, too?

Idk, haven't tried lol

upbeat mirage Oct 22, 2025, 3:23 PM

#

sudden idol Idk, haven't tried lol

do you know, which has the largest CW?

#

-# (context window)

sudden idol Oct 22, 2025, 5:10 PM

#

sudden idol Oct 22, 2025, 5:10 PM

#

upbeat mirage do you know, which has the largest CW?

idk

vale root Oct 22, 2025, 6:53 PM

#

aa

cursive jacinth Oct 22, 2025, 7:40 PM

#

how has orionmist and lithiumflow been performing for everyone?

fringe burrow Oct 22, 2025, 7:45 PM

#

best current llm

#

my coding bench

#

i didnt know which model i was using while rating

#

14 tests

tiny sierra Oct 22, 2025, 8:27 PM

#

Wish google would just let us chat with it directly already

#

Its not like everyone doesnt know what it is

wet oasis Oct 22, 2025, 9:06 PM

#

Heh

tiny sierra Oct 22, 2025, 9:11 PM

#

wet oasis Heh

?

wet oasis Oct 22, 2025, 9:11 PM

#

You arent wrong

vale root Oct 22, 2025, 9:19 PM

#

tiny sierra Wish google would just let us chat with it directly already

just let us use lithiumflow directly without waiting for it to appear in battlemode, that way testing it would be better and benefit them the most

tiny sierra Oct 22, 2025, 9:21 PM

#

vale root just let us use lithiumflow directly without waiting for it to appear in battlem...

Would reduce people spamming by 99% rn for the purpose of just trying to get it as well

#

Does anyone know if this decision is up to lmarena or whoever owns the codenamed model?

vale root Oct 22, 2025, 9:28 PM

#

tiny sierra Does anyone know if this decision is up to lmarena or whoever owns the codenamed...

idk pineapple said they cannot control the model's availability, ig only Google decides if it goes onto chat mode or not

wet oasis Oct 22, 2025, 10:23 PM

#

would be weird if they were ok with putting it on chat mode

#

maybe they'd get away with side-by-side?

wintry dawn Oct 22, 2025, 11:58 PM

#

new model on lmarena

#

"marble"

cursive jacinth Oct 22, 2025, 11:59 PM

#

sounds grok-y

wintry dawn Oct 23, 2025, 12:00 AM

#

ya

#

i asked it for a userscript

#

it gave me a mid html file

cunning notch Oct 23, 2025, 12:08 AM

#

funny btw

sinful spoke Oct 23, 2025, 2:52 AM

#

tried lithiumflow a few times, and I think it will score lower than 2.5pro, at least with style control off
It is generally shorter in answer length than 2.5pro, has no tendency to explain things in depth, and is less fond of using analogies

tiny sierra Oct 23, 2025, 2:54 AM

#

cunning notch funny btw

Lmao

tiny sierra Oct 23, 2025, 2:54 AM

#

wet oasis maybe they'd get away with side-by-side?

Yeah that's what I meant to say

#

Side by side, I never used chat mode

wet oasis Oct 23, 2025, 3:05 AM

#

sinful spoke tried lithiumflow a few times, and I think it will score lower than 2.5pro, at l...

its better at code tho, soo keep that in mind, I wouldn't be surprised if on text arena it is worse tho

candid surge Oct 23, 2025, 12:52 PM

#

sinful spoke tried lithiumflow a few times, and I think it will score lower than 2.5pro, at l...

Someone posted something earlier suggesting lithiumflow is Gemini 3 but not Gemini 3 pro

#

If only we got X28 or 2HT...

sudden idol Oct 23, 2025, 12:53 PM

#

Lithiumflow seems bad at actually modifying its code (actually seems like the arena was bugged, because it runs fine when I run the React code manually)

vale root Oct 23, 2025, 1:00 PM

#

cunning notch funny btw

how did u get lithiumflow twice in a row?

wet oasis Oct 23, 2025, 2:11 PM

#

sudden idol Lithiumflow seems bad at actually modifying its code (actually seems like the ar...

Bug?

sudden idol Oct 23, 2025, 2:18 PM

#

wet oasis Bug?

It used alert windows that didn't show up on Web Dev Arena, so it looked like the buttons were not working.

#

It took more turns than expected to get a good result, and it didn't seem to fully implement everything I was asking for. Maybe image input would help. Ig it's not really fair on the LLMs that they can't see the output.

wet oasis Oct 23, 2025, 2:22 PM

#

mm

elder yew Oct 23, 2025, 2:30 PM

#

Gemini 3 is simply absolutely superior

sudden idol Oct 23, 2025, 2:37 PM

#

elder yew Gemini 3 is simply absolutely superior

Depends ig... it seems weaker than Sonnet 4.5 at Fyne (popular Go UI library), which is odd considering that Go is from Google

tiny sierra Oct 23, 2025, 3:22 PM

#

sudden idol Depends ig... it seems weaker than Sonnet 4.5 at Fyne (popular Go UI library), w...

Gemini is weaker at google search than every other big model in existence which is odd considering that google is from google

#

😭

#

OkAnd

sudden idol Oct 23, 2025, 3:26 PM

#

tiny sierra Gemini is weaker at google search than every other big model in existence which ...

Because of poor tool use ability probably

#

It hallucinates so much with tool use

#

I give it a link and it hallucinates contents despite not being able to access the link

#

I wonder if Lithiumflow does that 🤔

sudden idol Oct 23, 2025, 3:30 PM

#

sudden idol I give it a link and it hallucinates contents despite not being able to access t...

Like, really:

#

Whatever flying-octopus is, it passed the test 😆:

#

clover too:

#

I guess it isn't really a hard task, since even Mistral Small gets it:

#

Ernie failed the task.
DeepSeek V3.2 Exp fails too (is it because DS is trained on Gemini 2.5 Pro...?).
Grok 4 Fast failed.
grok-4-0709 failed (very weird, claims it is from its last visited to the site, the date is listed as 2023).
ajax passes.

#

Uh oh... Orionmist failed

#

Looks like Gemini 3 hasn't solved the hallucination issue (even Mistral 24B managed to not hallucinate!)

Edit: I thought it did because I assumed models don't have access to the internet, but upon closer examination, it must have some sort of cached database, since those are real headlines from 2 days ago!)

tiny sierra Oct 23, 2025, 3:47 PM

#

sudden idol Looks like Gemini 3 hasn't solved the hallucination issue (even Mistral 24B mana...

LOL.

#

mine hallucinates even with directly uplaoded text or .docx files

#

(2.5 pro)

wet oasis Oct 23, 2025, 3:54 PM

#

Im dissapointed

sudden idol Oct 23, 2025, 3:57 PM