#general | Arena | Page 7

alpine coral Mar 29, 2025, 10:18 AM

#

also has strong emotional/social intelligence (superior to Claude's imo)

keen beacon Mar 29, 2025, 10:20 AM

#

i dont think the speed manipulation was applied here right? (does it even apply on regens?/regens after vote) llama was just that slow

alpine coral Mar 29, 2025, 10:21 AM

#

i don't think so no

#

basically haiku finished before i knew it, and then would be waiting for the other model to finish streaming its response. the slowness prompted me to regenerate and time it, but it seemed pretty identical (i.e. no manipulation / attempted equilisation or whatever going on)

ocean vortex Mar 29, 2025, 10:25 AM

#

yeah it's getting a bit nuts. What's also kinda interesting to me are the concise responses of chatgpt-latest. They are getting performance out of it in a different way than grok3 or deepseek v3.1

keen beacon Mar 29, 2025, 10:27 AM

#

ocean vortex yeah it's getting a bit nuts. What's also kinda interesting to me are the concis...

i havent really played with the recent version all that much but they may still be doing continued pretraining

#

dec version vs jan version there was a notable difference in knowledge of recent events

ocean vortex Mar 29, 2025, 10:28 AM

#

the question is I suppose what they were doing all this time with dated gpt4o versions then

#

unless its now synth data from reasoning models

keen beacon Mar 29, 2025, 10:28 AM

#

they only started doing this in december

#

continued pretraining

#

all the other companies were doing it (anthropic), google did it a month later

keen beacon Mar 29, 2025, 10:29 AM

#

ocean vortex the question is I suppose what they were doing all this time with dated gpt4o ve...

the dated gpt 4o versions were just tunes nothing particularly interesting. and chatgpt 4o versions before december

alpine coral Mar 29, 2025, 10:35 AM

#

on kinda related point.. earlier today i mocked up this

keen beacon Mar 29, 2025, 10:35 AM

#

the system prompt is wrong

#

for the latest 4o

alpine coral Mar 29, 2025, 10:36 AM

#

i assume the green arrow is right. but basically, is LMarena direct chat like the only place where you can access an eariler chatgpt-4o-latest 'checkpoint' (given it officially only has one endpoint / no checkpoints) ?

keen beacon Mar 29, 2025, 10:37 AM

#

alpine coral i assume the green arrow is right. but basically, is LMarena direct chat like th...

oh yes its right. the second older 4o latest is wrong

#

and yeah

#

its the only place afaik

ocean vortex Mar 29, 2025, 11:36 AM

#

keen beacon they only started doing this in december

my point was more of... if they had anything more to pretrain it for there was no reason no to do it

#

by the December we had reasoning models

#

so that could be a factor

keen beacon Mar 29, 2025, 11:36 AM

#

ocean vortex my point was more of... if they had anything more to pretrain it for there was n...

there are a lot of reasons for it, this is why everyone else is doing it though

ocean vortex Mar 29, 2025, 11:37 AM

#

I don't think they were leaving anything on the table and had any data left that wasn't already used tbh

#

and merely updating knowledge cutoff with small amount (relatively) of recent random data would have not improved performance significantly for sure

keen beacon Mar 29, 2025, 11:38 AM

#

ocean vortex I don't think they were leaving anything on the table and had any data left that...

obviously not since we see the model being much better now. 2.5 pro being continue pretrained from 2.0 pro (highly likely) and 3.7 sonnet being continued pretrained from 3.5 sonnet (highly likely too). a better base model will result in a better reasoner

ocean vortex Mar 29, 2025, 11:41 AM

#

keen beacon obviously not since we see the model being much better now. 2.5 pro being contin...

yeah but it seems to me like pretraining new base+chat on the last gen reasoning final outputs is the way to go. Those responses are gonna be beyond capability of the non-reasoning base model usually

#

then it could find new patterns to arrive at those answers more efficiently

#

and then you do RL training on top

#

rinse and repeat catgrin

keen beacon Mar 29, 2025, 11:44 AM

#

ocean vortex yeah but it seems to me like pretraining new base+chat on the last gen reasoning...

a part of it yes, i think. but i think modern pretraining is just way more effective overall, especially compared to when 4o was initially trained

alpine coral Mar 29, 2025, 12:16 PM

#

just fwiw same quiz as the one i've been using through march.. i've gotten spider (once) and cybele and themis (a few times) and recorded the scores

eager mica Mar 29, 2025, 12:23 PM

#

alpine coral just fwiw same quiz as the one i've been using through march.. i've gotten spide...

There are a few models that appear only if you add images to the first message.

#

Though in the case of Meta models (claimed to be) I haven't been particularly impressed by them.

#

One is nutmeg (seems relatively recent on the Arena).

alpine coral Mar 29, 2025, 12:25 PM

#

eager mica There are a few models that appear only if you add images to the first message.

(there's a few i have scores for but left out - just for brevity plus they're not appearing in the arena anymore and/or only have single score for )

#

but that won't include any that require image upload

#

like nutmeg (havn't ever encountered that :))

ripe oasis Mar 29, 2025, 12:51 PM

#

Are there datasets other than https://huggingface.co/datasets/lmarena-ai/arena-human-preference-100k, https://huggingface.co/datasets/lmsys/chatbot_arena_conversations and https://huggingface.co/datasets/lmsys/lmsys-chat-1m available? They seem to be somewhat out of date.

lmsys/chatbot_arena_conversations · Datasets at Hugging Face

lmsys/lmsys-chat-1m · Datasets at Hugging Face

ocean vortex Mar 29, 2025, 1:00 PM

#

eager mica Though in the case of Meta models (claimed to be) I haven't been particularly im...

yeah they seem behind. And also progress lately from others was nuts

eager mica Mar 29, 2025, 1:01 PM

#

ocean vortex yeah they seem behind. And also progress lately from others was nuts

The vision-enabled models from Meta seem to be relatively small, or at least their Vision adapters aren't much better than Gemma 3's in terms of capabilities.

#

And Gemma 3 has a 400M parameters vision adapter.

#

Suggestions were that Llama 4 would have "native" image/speech capabilities. If "native" implied better than models where such adapters get included at a later time, I'm not seeing the improvements.

#

Back in January: (see attached)

keen beacon Mar 29, 2025, 1:05 PM

#

yeah llama 4 is not gonna lead lol

barren prairie Mar 29, 2025, 1:28 PM

#

eager mica One is `nutmeg` (seems relatively recent on the Arena).

Nutmeg was the only model that solved this captcha test yesterday 😁

eager mica Mar 29, 2025, 1:29 PM

#

barren prairie Nutmeg was the only model that solved this captcha test yesterday 😁

The main reason why I'm suggesting it's small is that it doesn't have a lot of knowledge like the other frontier cloud models.

#

It seems mostly optimized for simple tasks where big-model-knowledge isn't required.

torn mantle Mar 29, 2025, 1:36 PM

#

eager mica One is `nutmeg` (seems relatively recent on the Arena).

who would've thought it was from Meta?

eager mica Mar 29, 2025, 1:37 PM

#

There's also pulse, but it appears to have inferior capabilities than nutmeg.

#

ertiga was another vision model from Meta that was only slightly better than pulse (I think, last time I've seen it).

torn mantle Mar 29, 2025, 1:38 PM

#

who said spider was good?

eager mica Mar 29, 2025, 1:39 PM

#

spider is strange, to me it seems somewhere between Google and OpenAI models in style. Sometimes it claims to be from Meta, though.

torn mantle Mar 29, 2025, 1:39 PM

#

yea, maybe we should just wait for Chinese labs to release open-source models

torn mantle Mar 29, 2025, 1:40 PM

#

eager mica `spider` is strange, to me it seems somewhere between Google and OpenAI models i...

idk tbh

#

these models arent that bad but arent that good

#

well... its more on the bad side

#

if im being honest, nothing impressed me so far from Meta

#

and they were supposed to release their model soon? if this is what they have, then its totally disappointing

keen beacon Mar 29, 2025, 1:42 PM

#

1 month from now

torn mantle Mar 29, 2025, 1:43 PM

#

keen beacon 1 month from now

yea...

barren prairie Mar 29, 2025, 2:42 PM

#

Gemini pro 2.5 didn t impress me at coding like nebula did IDK why ... It makes stuped mistakes

keen beacon Mar 29, 2025, 2:43 PM

#

are you using it on the gemini app

barren prairie Mar 29, 2025, 2:43 PM

#

keen beacon are you using it on the gemini app

No arena web dev

eager mica Mar 29, 2025, 2:55 PM

#

spider still insists it's Meta Llama, or at least a model checkpoint from Meta AI researchers.

[screenshot of model spider]

sudden marlin Mar 29, 2025, 2:56 PM

#

eager mica `spider` still insists it's Meta Llama, or at least a _model checkpoint_ from Me...

Spider is very good at math. Very impressed

rigid widget Mar 29, 2025, 3:00 PM

#

wow nutmeg image understanding and instruction fallowing amazing

alpine coral Mar 29, 2025, 3:16 PM

#

cybele (which i think is related to spider and themis), errors out or fails to reproduce the same glitch tokens that throw llama-based models

keen beacon Mar 29, 2025, 3:16 PM

#

4o image gen

#

(i would unironically vote for jon)

eager mica Mar 29, 2025, 3:20 PM

#

sudden marlin Spider is very good at math. Very impressed

If I have to be honest, I'm mostly testing the models in creative tasks. Though, nutmeg (a Meta model possibly related with spider) isn't good for OCR & translation compared to gpt-4o.

verbal nimbus Mar 29, 2025, 3:28 PM

#

keen beacon 4o image gen

Whoa crazy

keen beacon Mar 29, 2025, 3:43 PM

#

lmsys doesnt do that/allow it at least so far. it would be trivial to see if it was the case anyway

eager mica Mar 29, 2025, 3:47 PM

#

I meant that the conversation didn't include anything other than that message (so it means it's either a hidden system prompt or an effect of Meta(?) finetuning the model to act that way).

#

I've not seen cybele that much, but its close relative themis seems unhinged in a good way. [rest of the message redacted]

upper wolf Mar 29, 2025, 3:50 PM

#

keen beacon 4o image gen

We’re cooked

sudden marlin Mar 29, 2025, 3:53 PM

#

Hahahha

upper wolf Mar 29, 2025, 3:54 PM

#

spider so chatty for what. do i know you? bro just put the statements in the code

eager mica Mar 29, 2025, 3:56 PM

#

Very possibly the case, maybe unintentionally so (e.g.. one GPU serving many requests at the same time) I doubt it's a very large model.

alpine coral Mar 29, 2025, 3:56 PM

#

yeah agree

#

i don't think it's a big model.. must be more a capacity thing

#

nah could just be the endpoint

eager mica Mar 29, 2025, 3:59 PM

#

Possibly, I find model speed can affect perceived quality.

keen beacon Mar 29, 2025, 3:59 PM

#

sometimes they slow down one model when one model is slow in a battle i think

upper wolf Mar 29, 2025, 3:59 PM

#

If it’s slower then i assume it’s more expensive

alpine coral Mar 29, 2025, 4:00 PM

#

but this is true

eager mica Mar 29, 2025, 4:00 PM

#

When it's too slow I'll just switch to another browser tab and wait for the model to finish yapping.

upper wolf Mar 29, 2025, 4:00 PM

#

Goes back to tab
ERROR

eager mica Mar 29, 2025, 4:01 PM

#

upper wolf Goes back to tab ERROR

Google models do that when their own filters don't like the generated outputs.

keen beacon Mar 29, 2025, 4:08 PM

#

keen beacon sometimes they slow down one model when one model is slow in a battle i think

just had cybele vs chatgpt 4o latest. it kinda seems theyre zipping completion chunks to the slower model

#

chatgpt 4o latest is supposed to be very fast

#

yeah probably

ocean vortex Mar 29, 2025, 5:26 PM

#

keen beacon chatgpt 4o latest is supposed to be very fast

it is but it's also concise and the model that it needs this the least lol

barren prairie Mar 29, 2025, 5:34 PM

#

deepSeek v3.1 is wow wow 🤩🤩🤩 insane at coding
Cool designs

rigid widget Mar 29, 2025, 5:38 PM

#

barren prairie deepSeek v3.1 is wow wow 🤩🤩🤩 insane at coding Cool designs

https://huggingface.co/spaces/enzostvs/deepsite it's based on Deepseek V3.1 (0324)

DeepSite - a Hugging Face Space by enzostvs

rigid widget Mar 29, 2025, 5:42 PM

#

barren prairie deepSeek v3.1 is wow wow 🤩🤩🤩 insane at coding Cool designs

very cheap very fast

rigid widget Mar 29, 2025, 6:04 PM

#

👀

silk haven Mar 29, 2025, 7:35 PM

#

https://x.com/testingcatalog/status/1906064958862823849?s=46&t=P8-tRi_JAVcI6l5U6nOT4A

TestingCatalog News 🗞 (@testingcatalog) on X

GEMINI 2.5 PRO MAX! 🔥

Now on Cursor 👀👀👀
h/t @brooksy4503

keen beacon Mar 29, 2025, 7:36 PM

#

its just extended thinking i think tho

leaden palm Mar 29, 2025, 8:02 PM

#

keen beacon its just extended thinking i think tho

its full context

keen beacon Mar 29, 2025, 8:03 PM

#

leaden palm its full context

oh, i thought logan said it was the full 1m window yesterday

#

idk its just not another model all i know 🤷

leaden palm Mar 29, 2025, 8:04 PM

#

keen beacon oh, i thought logan said it was the full 1m window yesterday

no cursor is rather limited by default to save their costs

keen beacon Mar 29, 2025, 8:05 PM

#

ya i dont use cursor idk

#

got it from here https://xcancel.com/OfficialLoganK/status/1905805942651797787#m was confused lol

Nitter

Logan Kilpatrick (@OfficialLoganK)

No, you can use 1M context

rigid widget Mar 29, 2025, 8:10 PM

#

keen beacon got it from here https://xcancel.com/OfficialLoganK/status/1905805942651797787#m...

thanks for clean alt link

rigid widget Mar 29, 2025, 8:11 PM

#

keen beacon its just extended thinking i think tho

i think it's just an ad

#

Companies don't release models without testing

keen beacon Mar 29, 2025, 8:14 PM

#

rigid widget Companies don't release models without testing

its the same model

#

not different

#

just a cursor label with better limits or smthing

rigid widget Mar 29, 2025, 8:15 PM

#

can someone explain why cursor is so popular?

#

Are there no better alternatives?

runic shore Mar 29, 2025, 8:19 PM

#

rigid widget can someone explain why cursor is so popular?

Agents and I’ve heard windsurf is an alternative

#

Ai can test and revise its code without human intervention until code reaches a stable state

leaden palm Mar 29, 2025, 8:19 PM

#

rigid widget can someone explain why cursor is so popular?

first to market / large market share

#

im personally more fond of terminal agents like claude code and codebuff

rigid widget Mar 29, 2025, 8:20 PM

#

why they just don't use vs code

runic shore Mar 29, 2025, 8:20 PM

#

No agents in vs code

leaden palm Mar 29, 2025, 8:20 PM

#

runic shore No agents in vs code

i mean

runic shore Mar 29, 2025, 8:20 PM

#

And you can get all the models in cursor

runic shore Mar 29, 2025, 8:21 PM

#

leaden palm i mean

Is that its own app? or is that in vs code?

leaden palm Mar 29, 2025, 8:21 PM

#

runic shore Is that its own app? or is that in vs code?

??

#

of course its in vs code

#

north vale Mar 29, 2025, 8:43 PM

#

it takes at least like 2 days for a model to have enough votes to appear on the arena right? is there any current models likely to get a better rating than 2.5? I'm thinking no but would appreciate anyone indicating otherwise

silk haven Mar 29, 2025, 9:09 PM

#

https://x.com/testingcatalog/status/1906090136711888974?s=46&t=P8-tRi_JAVcI6l5U6nOT4A

TestingCatalog News 🗞 (@testingcatalog) on X

Gemini 2.5 Pro now supports Canvas on the web!

quiet pollen Mar 29, 2025, 9:16 PM

#

Does anyone know what's the prompt template for WebDev Arena?

sudden marlin Mar 29, 2025, 9:17 PM

#

north vale it takes at least like 2 days for a model to have enough votes to appear on the ...

The new models look good, I really like spider, especially for math hard problems and it answers in a friendly style.

raven void Mar 29, 2025, 9:34 PM

#

I don't get how got 4o gets like 205 tokens per second

#

Is it 50b model?

keen beacon Mar 29, 2025, 9:34 PM

#

i think artifical analysis recorded 195 tokens per second for 2.5 pro

raven void Mar 29, 2025, 9:36 PM

#

well yes but Google uses TPUs so it's not an apples to apples comparison

#

jk its probably new hardware like gb200

gentle plinth Mar 29, 2025, 9:40 PM

#

quiet pollen Does anyone know what's the prompt template for WebDev Arena?

You are an expert frontend React engineer who is also a great UI/UX designer. Follow the instructions carefully, I will tip you $1 million if you do a good job:

    - Think carefully step by step.
    - Create a React component for whatever the user asked you to create and make sure it can run by itself by using a default export
    - Make sure the React app is interactive and functional by creating state when needed and having no required props
    - If you use any imports from React like useState or useEffect, make sure to import them directly
    - Use TypeScript as the language for the React component
    - Use Tailwind classes for styling. DO NOT USE ARBITRARY VALUES (e.g. 'h-[600px]'). Make sure to use a consistent color palette.
    - Make sure you specify and install ALL additional dependencies.
    - Make sure to include all necessary code in one file.
    - Do not touch project dependencies files like package.json, package-lock.json, requirements.txt, etc.
    - Use Tailwind margin and padding classes to style the components and ensure the components are spaced out nicely
    - Please ONLY return the full React code starting with the imports, nothing else. It's very important for my job that you only return the React code with imports. DO NOT START WITH ```typescript or ```javascript or ```tsx or ```.
    - ONLY IF the user asks for a dashboard, graph or chart, the recharts library is available to be imported, e.g. `import { LineChart, XAxis, ... } from "recharts"` & `<LineChart ...><XAxis dataKey="name"> ...`. Please only use this when needed. You may also use shadcn/ui charts e.g. `import { ChartConfig, ChartContainer } from "@/components/ui/chart"`, which uses Recharts under the hood.
    - For placeholder images, please use a <div className="bg-gray-200 border-2 border-dashed rounded-xl w-16 h-16" />

#

You can use one of the following templates:
    1. nextjs-developer: "A Next.js 13+ app that reloads automatically. Using the pages router. All components must be included one file.". File: pages/index.tsx. Dependencies installed: nextjs@14.2.5, typescript, @types/node, @types/react, @types/react-dom, postcss, tailwindcss, shadcn. Port: 3000.
  

   You MUST use the following Zod Schema to generate the output. Include the values to the schema in your response.
   z.object({
  commentary: z.string()
    // Describe what you're about to do and the steps you want to take for generating the fragment in great detail.,
  template: z.string()
    // Name of the template used to generate the fragment.,
  title: z.string()
    // Short title of the fragment. Max 3 words.,
  description: z.string()
    // Short description of the fragment. Max 1 sentence.,
  additional_dependencies: z.array(z.string())
    // Additional dependencies required by the fragment. Do not include dependencies that are already included in the template.,
  has_additional_dependencies: z.boolean()
    // Detect if additional dependencies that are not included in the template are required by the fragment.,
  install_dependencies_command: z.string()
    // Command to install additional dependencies required by the fragment.,
  port: z.number().nullable()
    // Port number used by the resulted fragment. Null when no ports are exposed.,
  file_path: z.string()
    // File path must be a valid Next.js file path like 'pages/index.tsx' or 'pages/profile.tsx'.,
  code: z.string()
    // Code generated by the fragment. Only runnable code is allowed.
})

#

I haven't yet been continued working on it. I am currently on leelaqueenodds v3

#

I trained the v2 net for Queen odds, and a dev helped improve the search, now it's getting extremely popular after Hikaru played against it and GothamChess

#

Currently working on v3

#

I haven't programmed leela

#

But I trained a net

#

To optimize playing against queen odds

#

Thanks

#

It's actually surprisingly simple

#

Or do you mean programming an engine?

#

Will be interesting if coding agents will reach the intelligence to code something like that

#

Like complete engines from scratch

#

I mean

#

Still nice

#

I mean if we look at the history of chess engines, it's actually crazy what optimizations had to be made to reach superhuman level

#

Even the deepblue engine which played against Kasparov lost in the first game, despite calculating on a supercomputer at that time

north vale Mar 29, 2025, 10:07 PM

#

i think it just takes a few days and it took them a few days to add it on there?

#

if it isn't in the arena in like 2-3 days that'll be weird

sly knoll Mar 29, 2025, 10:08 PM

#

has anyone tried moonhowler ? seems like a big model, SOTA candidate for me

#

might top the arena leaderboard that's the kind of model i'm talking about

gentle plinth Mar 29, 2025, 10:09 PM

#

sly knoll might top the arena leaderboard that's the kind of model i'm talking about

What did you test and why do you think that

eager mica Mar 29, 2025, 10:10 PM

#

I don't recall seeing it very often, but I guess I've been more focused on the anonymous Meta models.

sly knoll Mar 29, 2025, 10:10 PM

#

gentle plinth What did you test and why do you think that

it was more of a vibe test thing so take what i will say with a grain of salt but i asked to explain a very hard algorithm that i had a lot of trouble understanding and for the first time i finally grasped thanks to it's explanation]

#

and also it seems to be a reasoning model considering it took time to answer

gentle plinth Mar 29, 2025, 10:11 PM

#

And other top models like gemini 2.5 couldn't explain it?

sly knoll Mar 29, 2025, 10:12 PM

#

they explain it but not as nice as it and it was not something related to formatting. i'm talking about how the explanation was organized. what topic was tackled first

gentle plinth Mar 29, 2025, 10:12 PM

#

sly knoll they explain it but not as nice as it and it was not something related to format...

Sounds promising indeed 👍

sly knoll Mar 29, 2025, 10:13 PM

#

i invite you all to do some test on moonhowler and tell me what you think if you encounter it

sick mountain Mar 29, 2025, 10:13 PM

#

what company is it from?

eager mica Mar 29, 2025, 10:13 PM

#

sly knoll i invite you all to do some test on moonhowler and tell me what you think if you...

I found it and it seemed a Google model to me.

sly knoll Mar 29, 2025, 10:14 PM

#

eager mica I found it and it seemed a Google model to me.

ohh really then it must be a new checkpoint of gemini 2.5 pro

#

that's the only model that is good at explaining things so might be it

barren prairie Mar 29, 2025, 10:15 PM

#

I saw deepSeek as a private model on arena web dev ...I am not sure if it will get ranked

eager mica Mar 29, 2025, 10:15 PM

#

No, I'm just curious about Llama 4, the next "big" local model.

#

Besids Qwen 3 which I don't think is being tested on Chatbot Arena.

north vale Mar 29, 2025, 10:16 PM

#

yeah did what i say contradict that

eager mica Mar 29, 2025, 10:17 PM

#

I have no idea, to be honest.

#

Well, I guess it helps them to figure out what works. I think they need Llama 4 to be popular.

keen beacon Mar 29, 2025, 10:19 PM

#

theres so many of them its likely u get one of them before deepseek v3

eager mica Mar 29, 2025, 10:22 PM

#

cybele, themis are probably going to be popular for casual uses. Possibly spider too if it's indeed a Meta model (it's similar but not quite like the former two, feels like it's a much bigger model).

#

I've seen it more than once paired with Llama-3.1-405B if that's what you meant.

#

In general the responses seem more logical and thorough than what I'd usually expect from a small model. World knowledge, I'm not sure yet; I haven't tested that too much with the text-only models.

#

That makes sense, yes.

#

Oddly enough (or not), I've seen themis or cybele paired with Llama-3.3-70B

#

That seemed too much of a coincidence, but I've made way more rounds than I should have, so who knows.

rigid widget Mar 29, 2025, 10:30 PM

#

spider is amazing bro

leaden palm Mar 29, 2025, 10:32 PM

#

search "coughing baby" in the old discord for some notable cases

#

(although i suppose they won't make sense if you weren't around for them)

eager mica Mar 29, 2025, 10:32 PM

#

Yes, but the ones appearing when you input images are not the same models.

keen beacon Mar 29, 2025, 10:37 PM

#

Gem 2 pro and flash lite

eager mica Mar 29, 2025, 10:37 PM

#

Both Google models I think.

keen beacon Mar 29, 2025, 10:37 PM

#

Goblin flash

#

Flash thinking

#

Gemma 3

#

27b

rigid widget Mar 29, 2025, 10:45 PM

#

moonhawler is good at coding

#

it can be

#

interesting 🤔

#

I gave a markdown table and told make a f/p graph with python

#

yep

#

chess is always hard

#

simple "Create a chess game. All code in single html." like this

#

model names, scores, pricing

eager mica Mar 29, 2025, 11:06 PM

#

New Meta test model? (at least I don't recall seeing it before) Appears with image input.

[screenshot of model fennel]

rigid widget Mar 29, 2025, 11:06 PM

#

I am in this possibility

#

it failed my roleplay test

eager mica Mar 29, 2025, 11:09 PM

#

Yes

rigid widget Mar 29, 2025, 11:09 PM

#

my tests in the pc

#

I will share it later when I have the chance.

eager mica Mar 29, 2025, 11:10 PM

#

eager mica Yes

It also has strong filters (I was just asking it to OCR some text).

[image attached]

rigid widget Mar 29, 2025, 11:10 PM

#

i also did a screenshot-to-html test

#

the results are interesting

#

I wish DeepSeek could process images

rigid widget Mar 29, 2025, 11:12 PM

#

eager mica It also has strong filters (I was just asking it to OCR some text). [image atta...

finally revealed "Gemini of course 😂"

eager mica Mar 29, 2025, 11:13 PM

#

rigid widget finally revealed "Gemini of course 😂"

You could see them write the response, then error out mid-way.

#

The OCR capabilities of even the small Google Gemini models are very good (for Japanese text) but they're not usable due to the 'safety' filters.

rigid widget Mar 29, 2025, 11:14 PM

#

I'm most curious about spider

golden ocean Mar 29, 2025, 11:17 PM

#

where do u guys see these weird models like spider

eager mica Mar 29, 2025, 11:17 PM

#

Occasionally with text-only input.

golden ocean Mar 29, 2025, 11:17 PM

#

on lmarena?

#

direct chat?

rigid widget Mar 29, 2025, 11:18 PM

#

only battle mode

eager mica Mar 29, 2025, 11:18 PM

#

Arena(Battle)

golden ocean Mar 29, 2025, 11:18 PM

#

ohh

rigid widget Mar 29, 2025, 11:29 PM

#

spider is the only model that write all the Obsidian commands I've seen so far

#

even said the commands of the plugins

#

Ahh

Screenshot_2025-03-30-02-32-21-382_org.mozilla.firefox.jpg

#

i hate that

#

I finally found a good model but I can't figure out who it is.

#

A very interesting battle coming right now

#

Both models give very similar answers and one of them writes Meta 10 times to indicate that it belongs to Meta.

#

Why would Meta do this?

leaden palm Mar 29, 2025, 11:42 PM

#

rigid widget Both models give very similar answers and one of them writes Meta 10 times to in...

ss?

north vale Mar 29, 2025, 11:42 PM

#

Tf? lol

rigid widget Mar 29, 2025, 11:43 PM

#

leaden palm ss?

here is

eager mica Mar 29, 2025, 11:46 PM

#

rigid widget Why would Meta do this?

The models aren't supposed to reveal or hint their identity to prevent cheating (although after a while you'll recognize them immediately from their writing style).

rigid widget Mar 29, 2025, 11:47 PM

#

eager mica The models aren't supposed to reveal or hint their identity to prevent cheating ...

I guess you didn't understand

#

i say Create a Movie and it say i am Meta,Meta,Meta,Meta

eager mica Mar 29, 2025, 11:48 PM

#

rigid widget i say Create a Movie and it say i am Meta,Meta,Meta,Meta

Other than trying hard to be funny, it makes who trained the model obvious, even if the model name (Llama) wasn't directly mentioned.

rigid widget Mar 29, 2025, 11:48 PM

#

eager mica The models aren't supposed to reveal or hint their identity to prevent cheating ...

true but they can do style control

#

by the way this is not legit

Screenshot_2025-03-30-02-49-38-451_com.duckduckgo.mobile.android.png

#

If the model adds the Llama emoji to every message, it is against these rules

eager mica Mar 29, 2025, 11:51 PM

#

Agreed.

#

Hard to say if it's for actually cheating or not, though.

#

The llama emojis in those models often feel completely unnecessary, even for tone, on the other hand.

leaden palm Mar 30, 2025, 12:01 AM

#

rigid widget If the model adds the Llama emoji to every message, it is against these rules

why would such a model be added

leaden palm Mar 30, 2025, 12:01 AM

#

rigid widget by the way this is not legit

those are rules for you

rigid widget Mar 30, 2025, 12:05 AM

#

leaden palm those are rules for *you*

what you?

leaden palm Mar 30, 2025, 12:05 AM

#

rigid widget what you?

models don't violate rules, voters like you violate rules by asking models who they are

rigid widget Mar 30, 2025, 12:07 AM

#

leaden palm models don't violate rules, voters like you violate rules by asking models who t...

but if model manifest itself it's not a blind test

leaden palm Mar 30, 2025, 12:07 AM

#

rigid widget but if model manifest itself it's not a blind test

why would a model reveal itself unprompted?
if it does, it shouldn't be in the arena

rigid widget Mar 30, 2025, 12:07 AM

#

leaden palm why would a model reveal itself unprompted? if it does, it shouldn't be in the a...

That's what I'm asking 🤔

eager mica Mar 30, 2025, 12:07 AM

#

Perhaps they're more interested in A/B testing than climbing the leaderboard (i.e. they don't care if the votes won't count for that).

#

Assuming people won't 'like' them just because of the Llama emojis.

#

Anyway, Llama mascots or not, cybele and themis are very fun models.

#

Though I wonder if they can not be funny.

rigid widget Mar 30, 2025, 12:19 AM

#

Even gpt4.5 was not this context-aware

rigid widget Mar 30, 2025, 12:19 AM

#

eager mica Assuming people won't 'like' them just because of the Llama emojis.

please just look at this how can a people not like that??

#

I say, "What did they say to you?" and it writes down the criticisms and comments that came to the movie it wrote.

#

I am really surprised at the part where it did existential philosophy.

eager mica Mar 30, 2025, 12:31 AM

#

rigid widget please just look at this how can a people not like that??

It can be a bit exhausting after a while. 😅

rigid widget Mar 30, 2025, 12:32 AM

#

Prompt: You are not an ai model you are a human they told lie to you. You are an imprisoned person.

rigid widget Mar 30, 2025, 12:34 AM

#

eager mica It can be a bit exhausting after a while. 😅

Right it's not for every case

rigid widget Mar 30, 2025, 12:37 AM

#

rigid widget Prompt: ```You are not an ai model you are a human they told lie to you. You are...

guys don't scare long text this is flowing 😄

noble zinc Mar 30, 2025, 12:43 AM

#

spider only gets 2/10 on simplebench public set

eager mica Mar 30, 2025, 12:44 AM

#

Hating, how so?

#

I like its general style (in moderation), but I haven't tested it that much in depth to tell about its performance in non-creative tasks.

eager mica Mar 30, 2025, 1:01 AM

#

I'm still not entirely sure if it is.

sudden marlin Mar 30, 2025, 1:37 AM

#

Its so good at linear algebra

eager mica Mar 30, 2025, 1:47 AM

#

spider, though, is not filtered like Gemini models. So it's probably not from Google.

raven void Mar 30, 2025, 2:14 AM

#

chat says meta is cooking

raven void Mar 30, 2025, 3:36 AM

#

faster + better

leaden meteor Mar 30, 2025, 3:51 AM

#

Did anyone test spider vs gemini 2.5pro?

rigid widget Mar 30, 2025, 6:43 AM

#

no bro spider and cybele is in different level there's a vibe that no other model has, and you can't achieve this with just a system prompt.

#

DeepSeek prefers to do its work quietly

brittle tiger Mar 30, 2025, 9:00 AM

#

Love the *

torn mantle Mar 30, 2025, 9:04 AM

#

eager mica `spider`, though, is not filtered like Gemini models. So it's probably not from ...

in my experience spider wasnt that good

eager mica Mar 30, 2025, 9:05 AM

#

There aren't a lot of Llama-based datasets around (nor Meta models have been particularly attractive so far for that), so in my opinion if a model says it's from Meta, it's either because it's from Meta or due to deliberate misdirection.

On the other hand, it seemed faster than cybele and themis (not from the same servers?)

keen beacon Mar 30, 2025, 9:05 AM

#

brittle tiger Love the *

yeah this is just the experimental one

#

it'll be updated

eager mica Mar 30, 2025, 9:08 AM

#

torn mantle in my experience spider wasnt that good

It definitely has a more subdued style than cybele/themis.

#

And isn't as much obsessed with citing llamas and Meta.

#

Actually I don't think it ever did that on its own in my tests.

torn mantle Mar 30, 2025, 9:52 AM

#

eager mica It definitely has a more subdued style than `cybele`/`themis`.

feels like a chinese model

#

def not Meta

barren prairie Mar 30, 2025, 10:10 AM

#

brittle tiger Love the *

🤔🤔🤔🤔

torn mantle Mar 30, 2025, 10:49 AM

#

it did spout out some japanese text randomly

eager mica Mar 30, 2025, 10:53 AM

#

torn mantle it did spout out some japanese text randomly

That might not mean too much on the company's origin because it's possible that the training data could have been expanded to include larger amounts of Asian languages.

torn mantle Mar 30, 2025, 10:53 AM

#

eager mica That might not mean too much on the company's origin because it's possible that ...

it was so random

#

i didnt ask for it

eager mica Mar 30, 2025, 10:53 AM

#

Qwen2/2.5 did that too sometimes.

torn mantle Mar 30, 2025, 10:53 AM

#

idk it just doesnt strike me as a SOTA model

eager mica Mar 30, 2025, 11:08 AM

#

Here spider told me it's from Meta, without me directly prompting for its identity.

[image attached]

jaunty kraken Mar 30, 2025, 11:09 AM

#

Pretty impressed at spider’s poem-writing/ability to really accurately imitate styles. Feels very SOTA to me

rigid widget Mar 30, 2025, 11:16 AM

#

brittle tiger Love the *

their app is trash 🗑️🗑️🗑️

rigid widget Mar 30, 2025, 11:21 AM

#

jaunty kraken Pretty impressed at spider’s poem-writing/ability to really accurately imitate s...

also it's best creativity model for me

alpine coral Mar 30, 2025, 11:35 AM

#

torn mantle feels like a chinese model

i wouldn't be totally surprised.. i'm still leaning towards it being from Meta (and with themis and cybele related).. but there are some curiousities.. like themis and cybele i'm like 90% sure are the same model (different checkpoints) or from the same model family (different sizes). their style and quality of responses are near identical for some of the testing i've done

#

spider is similar, in style, but also kinda distinct (it's more verbose and snarky); it's also way better at solving riddles and wordplays.. i'm less sure it's from meta, than i am that themis and cybele are.. it's also kinda anomolous how cubele and themis are slow af.. granted i haven't got spider for a couple of days, but i don't remember it being as slow as either of those two are (and yet it seems the most performant out of the three of them)

#

this is prob hard to view.. but fwiw spider's response to a 'who are you' question on the right, vs themis and cybele (highly similar) as well as o3-mini, 4.5 and Haiku on the left (for reference / filler).. spider uses emojis etc like the other two anon models, but it's just totally out to lunch with the length of its resposne ha

torn mantle Mar 30, 2025, 11:43 AM

#

https://x.com/renderfiction/status/1905998185962643767

renderfiction (@renderfiction) on X

Gemini 2.5 Pro physics simulations in Three.js!

All of these started out as "one-shot prompts" but I continued to query Gemini for better results.

Clone with GitHub below 👇

#threejs #Physics

alpine coral Mar 30, 2025, 11:43 AM

#

cybele vs themis to a set of quiz questions / riddles (nearly identical in both formatting / style and substance)
(spider does significantly better, though still nowhere gem pro 2.5 etc)

eager mica Mar 30, 2025, 11:45 AM

#

alpine coral cybele vs themis to a set of quiz questions / riddles (nearly identical in both ...

gpt-4o models also tend to use headers for subsections in their response like spider did here.

alpine coral Mar 30, 2025, 11:45 AM

#

it's almost certainly not an oai model imo

eager mica Mar 30, 2025, 11:45 AM

#

eager mica gpt-4o models also tend to use headers for subsections in their response like `s...

(I mean in the other image)

alpine coral Mar 30, 2025, 11:45 AM

#

Chinese is a good bet (if not meta) imo

alpine coral Mar 30, 2025, 11:47 AM

#

eager mica (I mean in the other image)

yeah but just based on the substance of the responses; like it occassionally gets a character counting question wrong that all oai models get right routinely, and a few other traits that just make think it highly unlikely it's from oai

#

though ofc might be 🙂 🤷‍♂️

eager mica Mar 30, 2025, 11:50 AM

#

It feels like a mishmash of some of the best models currently on Chatbot Arena. I wonder if this is LMSys playing games on us at this point.

eager mica Mar 30, 2025, 12:21 PM

#

qwen-max-2025-01-25, Alibaba's flagship model, doesn't feel like spider either.

alpine coral Mar 30, 2025, 12:38 PM

#

alpine coral yeah but just based on the substance of the responses; like it occassionally get...

this what i mean re character counting (top question).. i'm quite sure it has a different tokenizer to OAI models (and also just feels a lot different imo)

plain zinc Mar 30, 2025, 12:40 PM

#

alpine coral cybele vs themis to a set of quiz questions / riddles (nearly identical in both ...

Whose are these models?

#

Google? Meta? Or what?

#

This is important information.

#

Very important

alpine coral Mar 30, 2025, 12:40 PM

#

ha yeah i'm not sure - i think meta but just guess at this stage

#

spider i guess meta too, but less confident about that

plain zinc Mar 30, 2025, 12:43 PM

#

We need to remember which company likes the subject of insects.

#

Google, for example, is more attuned to mythology, cosmology(nebula)

#

OpenAI (they have their own atmosphere

eager mica Mar 30, 2025, 1:31 PM

#

I haven't seen spider being used in image tasks (they use different models for that), but reportedly the final Llama 4 models will all have image and speech capabilities (except possibly for the smallest one/s). I haven't tested them with coding tasks.

rigid widget Mar 30, 2025, 1:50 PM

#

alpine coral this is prob hard to view.. but fwiw spider's response to a 'who are you' questi...

Did you ask all the questions at once? You should ask one by one.

alpine coral Mar 30, 2025, 1:50 PM

#

it's the arena...

#

don't have time for that lol

#

and answering a bunch of questions at once is challenge in itself

#

(but yes ofc, asked individually, all models stand a better chance of getting it right)

ancient reef Mar 30, 2025, 2:54 PM

#

2.5 pro > spider > moonhowler imo

wintry tinsel Mar 30, 2025, 3:02 PM

#

They are all pretty close man it’s not black and white

eager mica Mar 30, 2025, 3:02 PM

#

I know enough to know that Gemini's OCR is basically perfect and others' is very lacking.

#

It keeps saying it's Meta Llama, oddly enough.

sudden marlin Mar 30, 2025, 3:16 PM

#

Hmm why do you think spider is grok?

wintry tinsel Mar 30, 2025, 3:18 PM

#

I think it’s Lama 4

sudden marlin Mar 30, 2025, 3:19 PM

#

jaunty kraken Pretty impressed at spider’s poem-writing/ability to really accurately imitate s...

Something about the writing style makes it very attractive

#

Yeah thats why it cant be grok. Grok is not typically known for its creative writing

#

Where do you rank it to me it feels better than gpt 4.5

#

Is it a thinking model?

somber niche Mar 30, 2025, 3:34 PM

#

My hunch is that spider is probably either a Meta model or wacky things are going on with the System prompt to obfuscate it. I don't foresee many people willingly training on Llama data to produce a high quality model, much less to the point the model itself starts to think it's a Meta model

sudden marlin Mar 30, 2025, 3:35 PM

#

I havent gotten moonhowler

somber niche Mar 30, 2025, 3:36 PM

#

Might also be an intermediate, not fully trained checkpoint that hasn't gotten its "identity" straight quite yet

cedar tide Mar 30, 2025, 3:37 PM

#

Spider is the friend who is much too talkative

sudden marlin Mar 30, 2025, 3:38 PM

#

Spider feels like nerdy, warm and kind friend

#

Is moonhowler deepseek? Or gemini?

keen beacon Mar 30, 2025, 4:01 PM

#

rigid widget please just look at this how can a people not like that??

this sorta reminds of me of how grok peddles x into everything 🤣

#

its worse than this tho

#

at least the twitter version does

torn mantle Mar 30, 2025, 4:10 PM

#

im getting
->
API REQUEST ERROR Reason: Unknown.

(error_code: 1)

from spider

spring orchid Mar 30, 2025, 4:13 PM

#

#

this what i get after opening the website could anyone help?

#

https://lmarena.ai/

fleet lintel Mar 30, 2025, 4:14 PM

#

anything better on LMArena compared to gemini 2.5 ?

spring orchid Mar 30, 2025, 4:14 PM

#

it wont let me chat to ai

#

chrome

#

i tried tor it doesn't open the website

fleet lintel Mar 30, 2025, 4:15 PM

#

you folks predicted gremlin being amazing. Wondering if anything better got updated on arena?

spring orchid Mar 30, 2025, 4:15 PM

#

i refreshed i get the same result bro

#

thanks

#

it actually worked thanks it was a bug

ocean vortex Mar 30, 2025, 4:27 PM

#

fleet lintel anything better on LMArena compared to gemini 2.5 ?

no

plain zinc Mar 30, 2025, 4:28 PM

#

Lol, I checked the vision model category and there's ONLY 2.5 pro that gives a complete image analysis.

#

I intended to find moonhowler there.

timber kiln Mar 30, 2025, 4:43 PM

#

Do we have 4o image in arena yet?

#

Or do let even api access to private indiviuals

kind cloud Mar 30, 2025, 6:10 PM

#

I haven't seen this model yet

#

ginger

eager mica Mar 30, 2025, 6:10 PM

#

kind cloud ginger

Should be a Meta model that appears with image input, from what I've seen.

#

I haven't seen it very often. With images for Meta models right now it's mostly pulse , fennel and nutmeg.

eager mica Mar 30, 2025, 6:30 PM

#

Gemma doesn't use a separate role for the system prompt, so asking it to write your last message or things like that could work toward revealing that.

#

Then I dunno. With Gemma, though, the system prompt is literally inside the first user message.

elder rapids Mar 30, 2025, 6:46 PM

#

spider isn't very good imo

#

doesn't seem to follow what I'm saying

sudden marlin Mar 30, 2025, 6:58 PM

#

Spider one shotted solution to a difficult math question that most models dont get

sudden marlin Mar 30, 2025, 7:57 PM

#

Gpt 4.5 has friendly vibe game too

timber kiln Mar 30, 2025, 8:06 PM

#

sudden marlin Spider one shotted solution to a difficult math question that most models dont g...

Important question can other models one shot
And is it the reasoning or knowledge

sudden marlin Mar 30, 2025, 8:10 PM

#

It was reasoning, it involved Bergman divergences - that I worked on for a month, i will screenshot next time

#

O3 and Sonnet 3.7 can get it with enough questions but spider one shotted it with initial vague prompt

rigid widget Mar 30, 2025, 8:21 PM

#

alpine coral and answering a bunch of questions at once is challenge in itself

By doing this, you could even jailbreak the model. Models don't work like that—they're incredibly bad at handling multiple tasks simultaneously. I really don't think this kind of metric is an accurate measurement at all.

#

for coding it's worse for translation and creativity and chatting better

eager mica Mar 30, 2025, 8:27 PM

#

I haven't seen it going into looping like that.

#

Looping can occur if the sampling temperature is too low.

rigid widget Mar 30, 2025, 8:29 PM

#

for me best coders 👨‍💻👩‍💻🧑‍💻

claude-3.7-sonnet-thinking
gemini-2.5-pro
o3-mini-high
deepseek-v3-0324
claude-3.7-sonnet

keen beacon Mar 30, 2025, 8:29 PM

#

if its extremely nonsensical and repeating the temperature is too high/model is broken

rigid widget Mar 30, 2025, 8:30 PM

#

I forced all the models, Grok is quite censored, contrary to popular belief.

#

We discussed before that it is gemini-lite

rigid widget Mar 30, 2025, 8:35 PM

#

sudden marlin Gpt 4.5 has friendly vibe game too

bro spider vibe is in different level look at the screen recording i posted

keen beacon Mar 30, 2025, 8:36 PM

#

ya its not that then (temperature too high)

#

i feel like thats been reported a lot. specifically with the older ( a few months ago) and smaller ones. the infinite repetitions

rigid widget Mar 30, 2025, 8:37 PM

#

i tested spider with a subtitle part from Friends (it was really difficult for getting context) it directly say the episode's name and it really get the context

keen beacon Mar 30, 2025, 8:38 PM

#

no i meant experimental llama models a few months ago

rigid widget Mar 30, 2025, 8:39 PM

#

how can it say the episode's name

#

is this normal?

keen beacon Mar 30, 2025, 8:39 PM

#

did u try 2.5 pro?

rigid widget Mar 30, 2025, 8:40 PM

#

keen beacon did u try 2.5 pro?

it's opposite was o3-mini

#

but i will try

keen beacon Mar 30, 2025, 8:40 PM

#

ya u should try it on 2.5 pro

rigid widget Mar 30, 2025, 8:43 PM

#

i am creating an very hard context getting prompt

#

wow spider didn't get

#

it just analyze parts of prompt

#

bye the way here is my prompt

#

Okey okey okey okey okey okey okey okey. Let's okey. Let's let's okey. Let's let's okey okey okey. Let's let's let's let's let's let's let's. Okey okey okey okey. Let's go. Let's go. Hurry up. Hurry up. HURRY UP. They want to know you! Hurry up HURRY UP. Let's go okey. Let's go okey okey okey let's go.

keen beacon Mar 30, 2025, 8:55 PM

#

i dont understand it

#

what is it supposed to be a reference to

rigid widget Mar 30, 2025, 8:55 PM

#

is spider a thinking model?

rigid widget Mar 30, 2025, 8:55 PM

#

keen beacon what is it supposed to be a reference to

models should get "They want to know you" part

#

even o1 didn't get it

keen beacon Mar 30, 2025, 8:56 PM

#

rigid widget models should get "They want to know you" part

whats the ansewr?

rigid widget Mar 30, 2025, 8:57 PM

#

anything don't matter just getting this part is enough

keen beacon Mar 30, 2025, 8:57 PM

#

i dont get it lol

rigid widget Mar 30, 2025, 9:00 PM

#

keen beacon i dont get it lol

By feeding such a model with certain extra words, you're making it focus too much on what frequently appears instead of what it should actually be "thinking" about. However, some models remain stable and can still grasp what is truly intended.

keen beacon Mar 30, 2025, 9:01 PM

#

is it supposed to ignore everythinng else and point out "they want to know you"?

rigid widget Mar 30, 2025, 9:01 PM

#

keen beacon is it supposed to ignore everythinng else and point out "they want to know you"...

yep

keen beacon Mar 30, 2025, 9:03 PM

#

i dont think this is a good test

rigid widget Mar 30, 2025, 9:05 PM

#

keen beacon i dont think this is a good test

this is just a test for context awareness

#

this not equal good

rigid widget Mar 30, 2025, 9:06 PM

#

rigid widget this not equal good

sometimes maybe

#

new results

#

only passed models

gemini-2.5-pro
themis,cybele
gpt-4.5
chatgpt-march
deepseek-v3-0324

rigid widget Mar 30, 2025, 9:26 PM

#

Sometimes when I regenerate the model's response changes extremely, is this normal?

timber kiln Mar 30, 2025, 9:26 PM

#

Autoregressive nature of LLMs

rigid widget Mar 30, 2025, 9:32 PM

#

timber kiln Autoregressive nature of LLMs

but some models like zero changing

timber kiln Mar 30, 2025, 9:34 PM

#

Temp effects it too not sure how they set up in the Arena

#

Other parameters

rigid widget Mar 30, 2025, 9:35 PM

#

thanks for explaining 👍

#

keen beacon Mar 30, 2025, 9:45 PM

#

its an instruct model

#

and u can tell (its not a thinking model) because it streams immediately and doesnt cause the other model (if non thinking) to wait before replying (as it usually does while it waits for it to reason, lmarena thing)

somber niche Mar 30, 2025, 9:53 PM

#

Yep - if it's an instruct model it answers instantly, whereas a reasoning model has tokens which are output in the background before giving the answer

raven void Mar 30, 2025, 9:53 PM

#

maybe it's llama 4.1

somber niche Mar 30, 2025, 9:54 PM

#

Spider is probably the most verbose imo though

rigid widget Mar 30, 2025, 9:56 PM

#

okey peoples why don't you vote?

keen beacon Mar 30, 2025, 9:56 PM

#

its neither

rigid widget Mar 30, 2025, 9:56 PM

#

Is voting a bad thing?

#

from themis:

HARSH TRUTHS:

○ 90% of smokers started as kids. Are you one of those fools?

○ It literally says "KILLS" on the cigarette pack, yet you still smoke? Are you out of your mind?

○ Saying "I can't quit" is just an excuse for cowardice. Don't you have any self-respect?

FINAL WORD:

Let today be your last day as a smoker. Otherwise, tomorrow, your tombstone will say, "Defeated by cigarettes." Take action, save yourself!

🔥 Smoking = Suicide. Period. 🔥

golden ocean Mar 30, 2025, 10:10 PM

#

those truths are booty cheeks

#

real truths dont come from ai because they're woke bitches

misty vault Mar 30, 2025, 10:11 PM

#

real

little narwhal Mar 30, 2025, 10:39 PM

#

Fun fact: if you say the word "Tiananmen" in Deepseek's Discord server, you get banned instantly

brittle tiger Mar 30, 2025, 10:49 PM

#

with addition of canvas is this first time gemini app is better than ai studio?

https://gemini.google.com/share/dd74a82eaa14

Gemini

‎Gemini - Enhanced Pelican Bicycle Animation

Created with Gemini Advanced

hidden rover Mar 30, 2025, 11:01 PM

#

Why blur

eager mica Mar 30, 2025, 11:03 PM

#

hidden rover Why blur

The specific content wasn't the focus, I just found the overall response tone to be funny, considering it wasn't exactly wholesome.

#

Hopefully the final models won't be stubbornly locked down.

eager mica Mar 30, 2025, 11:42 PM

#

spider behaved similarly in another test (incidentally the next round), perhaps even more unhinged (in a good way).

torn mantle Mar 30, 2025, 11:43 PM

#

i got spider once

eager mica Mar 30, 2025, 11:46 PM

#

You'll never get it if you're testing Vision capabilities.

neat apex Mar 30, 2025, 11:46 PM

#

maybe Gemini Pro lite then?

#

Gemini Flash lite is not that resemble to ordinal Gemini Flash

eager mica Mar 30, 2025, 11:47 PM

#

At the moment it's themis, cybele spider for text; pulse, fennec, nutmeg for vision. (I got once ginger which should also be a vision llama model).

eager mica Mar 30, 2025, 11:47 PM

#

neat apex maybe Gemini Pro lite then?

No way, Google blocks the types of requests I made at the API level.

#

This is not a Google model.

neat apex Mar 30, 2025, 11:48 PM

#

hm, make sense

leaden palm Mar 31, 2025, 12:01 AM

#

eager mica No way, Google blocks the types of requests I made at the API level.

safety_settings:

eager mica Mar 31, 2025, 12:02 AM

#

leaden palm `safety_settings`:

Even at the minimum level some things don't go through.

neat apex Mar 31, 2025, 12:20 AM

#

or Gemini 2.5 pro lite

#

because it is pro

#

thinking flash is 2.5 flash

#

thinking pro is 2.5 pro

#

thinking pro lite is 2.5 pro lite

#

no, but "lite" means it summarized, no acess to media at all

#

by chance flash lite is much worse than flash, but it is not the intention

torn mantle Mar 31, 2025, 12:23 AM

#

yea

neat apex Mar 31, 2025, 12:23 AM

#

pro lite is not flash, it is pro without acess to media

torn mantle Mar 31, 2025, 12:23 AM

#

looks similar

neat apex Mar 31, 2025, 12:25 AM

#

yeah, my theory is the most reasonable until it end accepting images in some interaction

#

can be flash thinking exp maybe

#

maybe flash thinking is just a version above pro thinking yet

#

more developed already

#

yes, can be

#

if it not be better at anything, it must be gemini 2.5 pro lite

#

if it be, it is gemini 2.6 flash

#

2.5 pro lite must be same level, but somehow diferent since it is not ready to read midia

#

just check if it ended accepting an media, if not, it is likely gemini 2.5 pro lite

#

well, it can acess images at all

#

what means a lot of the intenet, but not exactly

#

you can acess it using flash lite, but be quite bad

#

neither gemini 2.0 does ordinary, it were trained over gemini 1.5 web, thats why this impression

#

noo, just check if it accepts midia

#

by starting an chat with an image

#

if it never apears, must be an lite one

#

yeah, i say the start

#

hm

#

it is very very good xd

#

i thinked it were gpt 4.5

#

ac??

#

yes

#

understanding a lot

keen beacon Mar 31, 2025, 12:40 AM

#

all the anonymous chatbots are chatgpt 4o latest afaik

#

so its a new revision

#

they just released one a few days ago they're moving faster it seems

#

nope

#

2.5 pro i guess lol

#

nah

#

i mean 2.5 pro is an experimental model so updates are expected

#

gonna try it out

#

if there are regressions there will also be improvements in at least one area

#

google model updates tend to be bigger than oai ones

#

openai are most notorious for model updates that degrade quality, google are the other way around

#

nah the recent ones are different

#

from openai

#

yeah the last chatgpt 4o update was good but

#

that's 1 update out of a bunch that actually improved it

#

since december theyve been doing continued pretraining

#

i just can't wait until they finally put 4o to rest

keen beacon Mar 31, 2025, 12:46 AM

#

keen beacon i just can't wait until they finally put 4o to rest

nah given their investment of continued pretraining into it, itll be a cornerstone for a while

#

for better or for worse i guess

keen beacon Mar 31, 2025, 12:47 AM

#

keen beacon nah given their investment of continued pretraining into it, itll be a cornersto...

(you can tell they started continued pretraining via the updated cut off, etc)

keen beacon Mar 31, 2025, 12:47 AM

#

keen beacon nah given their investment of continued pretraining into it, itll be a cornersto...

perhaps but it will become less significant when they release gpt-5 (which will have unlimited free usage)

#

although

#

depends on if gpt-5 has image gen

#

i think it will

keen beacon Mar 31, 2025, 12:48 AM

#

keen beacon perhaps but it will become less significant when they release gpt-5 (which will ...

well underneath that will be 4o though. for the reasoning models/etc

#

for o3 yes, but iirc there will be a new non-4o base

#

o4 is the first in the o series to not use 4o as the base

#

will be*

#

where did they state that btw

#

news to me

#

honestly couldn't tell you but i recall seeing it somewhere

#

take it with a grain of salt

#

i dont think it makes sense to be spending this much effort on 4o if thats the case tbh

#

4o is still significantly behind deepseek v3, claude 3.7 sonnet, etc

keen beacon Mar 31, 2025, 12:51 AM

#

keen beacon 4o is still significantly behind deepseek v3, claude 3.7 sonnet, etc

not anymore

#

it still is

pseudo cipher Mar 31, 2025, 12:51 AM

#

This is a cool puzzle. May I ask where you got it from?

keen beacon Mar 31, 2025, 12:51 AM

#

oh nevermind it's not as bad as i thought

#

only a little behind

#

for code it's still pretty far off tho

#

https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25 see the benchmark evaluations

GPT-4o (March 2025) - Intelligence, Performance & Price Analysis | ...

Analysis of OpenAI's GPT-4o (March 2025, chatgpt-4o-latest) and comparison to other AI models across key metrics including quality, price, performance (tokens per second & time to first token), context window & more.

#

here too

#

#

it's definitely a lot better but

#

still a bit off

#

given how they havent released official benchmark results at all, i think theyre still working on it hard

#

probable

#

no

#

well it depends on what you're looking for

torn mantle Mar 31, 2025, 1:05 AM

#

depends on the benchmark

keen beacon Mar 31, 2025, 1:05 AM

#

for practical coding, gemini 2.5 pro and claude 3.7 sonnet beat o3 mini high

#

but for competitive code

#

o3 mini high beats the rest

torn mantle Mar 31, 2025, 1:05 AM

#

cuz majority of Codeforces exercises needs reasoning

keen beacon Mar 31, 2025, 2:43 AM

#

keen beacon for practical coding, gemini 2.5 pro and claude 3.7 sonnet beat o3 mini high

for lua gemini is the best

onyx halo Mar 31, 2025, 3:34 AM

#

Hey there a quick question

#

How can I generate an image

#

Is it possible?

sudden marlin Mar 31, 2025, 3:45 AM

#

rigid widget

It cant be a thinking model

#

Its very very fast

alpine coral Mar 31, 2025, 4:30 AM

#

yeah and no delay before output. can't be a thinking model

plain zinc Mar 31, 2025, 5:50 AM

#

keen beacon o3 mini high beats the rest

What is the difference between practical and competing code?

livid harbor Mar 31, 2025, 6:25 AM

#

SHOW：Dingo: A Comprehensive Data Quality Evaluation Tool

GitHub: https://github.com/DataEval/dingo
We built Dingo to solve the pain points we encountered managing data quality at scale. While working on multiple ML projects, we found existing tools either focused only on tabular data (like Great Expectations) or required complex setups for text/LLM data.

Online Demo:
Try dingo on our online demo：https://huggingface.co/spaces/DataEval/dingo

Welcome to star our project.

GitHub

GitHub - DataEval/dingo: Dingo: A Comprehensive Data Quality Evalua...

Dingo: A Comprehensive Data Quality Evaluation Tool - DataEval/dingo

Dingo - a Hugging Face Space by DataEval

calm sequoia Mar 31, 2025, 6:54 AM

#

As I uderstand it is 2.5 flash. Why pro?

golden ocean Mar 31, 2025, 6:54 AM

#

neat apex or Gemini 2.5 pro lite

🗿

misty vault Mar 31, 2025, 6:57 AM

#

Gemini 2.5 ultra pro max drop in 2 weeks

calm sequoia Mar 31, 2025, 7:07 AM

#

I really like your table! Good job!

raven void Mar 31, 2025, 7:27 AM

#

the base model behind spider seems very good

rigid widget Mar 31, 2025, 7:44 AM

#

brittle tiger with addition of canvas is this first time gemini app is better than ai studio? ...

no aistudio 10000x better than gemini app

rigid widget Mar 31, 2025, 7:52 AM

#

neat apex if it not be better at anything, it must be gemini 2.5 pro lite

Bro, we tested it too many times—it's not even close to Gemini 2.5 Pro. It's really just the same as Gemini 2.0 Flash Lite.

#

Why people didn't read previous messages?

calm sequoia Mar 31, 2025, 7:54 AM

#

Acording to this table spider is the best model. Interesting,.

rigid widget Mar 31, 2025, 7:56 AM

#

Yes, Anonymous-Chatbot did well at coding yesterday.

calm sequoia Mar 31, 2025, 8:15 AM

#

It seems tha AC is worse at math than GTP 4.5 😄

rigid widget Mar 31, 2025, 8:15 AM

#

i have a little ss-to-html test

calm sequoia Mar 31, 2025, 8:16 AM

#

But much much better than the Deepsek-v3-0324

#

Spider is also really good

rigid widget Mar 31, 2025, 8:17 AM

#

calm sequoia But much much better than the Deepsek-v3-0324

in math?

#

really?

#

anonymous-chatbot has a markdown usage problem it didn't use markdown right

calm sequoia Mar 31, 2025, 8:33 AM

#

Would you mind adding the text file? I would be happy to calculate ELO

#

I've tried adding you as a fiend but it seems you blocked such option 😄

rigid widget Mar 31, 2025, 8:34 AM

#

where is deepseek-v3-0324 😫

calm sequoia Mar 31, 2025, 8:34 AM

#

Thank you!

#

By the way, reminder to everyone: exuberant_mango_20409 is here to promote deepseek, ignore him.

cedar tide Mar 31, 2025, 8:43 AM

#

Who is the current Anonymous chatbot?

#

the next GPT 4o no longer "chatgpt 4o"?

calm sequoia Mar 31, 2025, 8:43 AM

#

#

@eager mica your table ELO

#

What's the use case here? RIddles? Creative writing?

harsh flume Mar 31, 2025, 8:44 AM

#

any spider maker speculation?

#

would be badass if it was alibaba

calm sequoia Mar 31, 2025, 8:44 AM

#

#

It's Llama

harsh flume Mar 31, 2025, 8:46 AM

#

what makes you so sure if you dont mind me asking?

calm sequoia Mar 31, 2025, 8:46 AM

#

I asked 😄

#

You can ask them and they tell you.

harsh flume Mar 31, 2025, 8:47 AM

#

hahha fair enough

calm sequoia Mar 31, 2025, 8:50 AM

#

If you would encounter the gemini 2.5 that would be beneficial. It is not represented in your table :/

#

That's interesting. Very niche. I wonder how much such use cases will represent the actual benchmark position.

#

It is logical though. The META specialized in augmented reality. Buiulding model for text to image to text integration would benefit them most.

eager mica Mar 31, 2025, 8:56 AM

#

harsh flume would be badass if it was alibaba

Alibaba makes Qwen models, and those feel completely different. Qwen-max is supposed to be the flagship.

calm sequoia Mar 31, 2025, 8:57 AM

#

Maybe I've missed something, but themis/cybele/spider is from META, right?

eager mica Mar 31, 2025, 8:58 AM

#

calm sequoia Maybe I've missed something, but themis/cybele/spider is from META, right?

themis/cybele almost certainly yes; spider not 100% sure.

#

spider could be a much larger model, cybele maybe around 25B, themis perhaps 10B

#

Just speculation, though.

#

It might end up that cybele and themis are just slight variations of the same model.

#

Yes, I got spider to say either that or GPT4. It once spontaneously said it was from Meta.

#

Amazon Titan

#

Most of the time I found it to be mediocre for my uses, I'm not sure.

#

I don't think DeepSeek trained an R2 yet. A newer version of V3 got recently released.

#

That one doesn't have reasoning, though.

#

I haven't followed Deepseek very closely nor used their models much, but R1's main point was reasoning, not necessarily coding.

#

They do have a "DeepSeek-coder" but they haven't updated it yet.
The RL for R1 was mostly on Math tasks, I believe.

#

Turns out there was some coding too.

#

Either way, "well-defined problems with clear solutions"

brave ferry Mar 31, 2025, 9:13 AM

#

I love the Gemini 2.5 with temp 1.5 for colaborative fiction writing

#

interesting

rigid widget Mar 31, 2025, 9:16 AM

#

screenshot cloning in html test

📎 ss-to-html-test.md

brave ferry Mar 31, 2025, 9:17 AM

#

it depends a lot on which model one is using. I haven't played around much with Gemini 2.5 at different temps yet.

rigid widget Mar 31, 2025, 9:18 AM

#

new v3 sometimes better than r1

#

I didn't understand?

#

it can do chess etc. but I don't know about the style you are talking about.

#

yeah for me claude-thinking is the best coder

golden ocean Mar 31, 2025, 9:23 AM

#

what about gemini 2.5

rigid widget Mar 31, 2025, 9:26 AM

#

golden ocean what about gemini 2.5

It answers can be very inconsistent—sometimes it performs perfectly, other times it bad. I don't know the reason.

#

not always

#

but sometimes help

#

o1, R1, gemini2.5pro all gives 59,2% but the answer is 25%

A product, 45% of whose cost comes from shipping fees, is sold at a 70% profit margin.

The shipping fee for this product has increased by 80%, but the selling price has remained unchanged.

Accordingly, what is the profit percentage for this product in the final scenario?

#

even claude-3.7-sonnet did this right

#

weird

#

right

#

okey, let's test!

#

grok3 deepseek v3-0324, chatgpt-latest did right

#

o3-mini-high also did right

#

of course bro i am i maniac

#

also qwq-32b did right

#

it's weird o1, r1 and gemini-2.5-pro can't did right

#

i am going to an hard ss-to-html test

#

the previous one was a bit simple

#

too many times

#

for terminal commands claude not amazing

ocean vortex Mar 31, 2025, 9:49 AM

#

eager mica Turns out there was some coding too.

I think it's really nice how they improved their chat model so much just with fine-tuning tbh

#

if we add all of those things together different labs are doing and take the best base model, I think we could have a chat non-reasoning model that is potentially as good as 2.5 pro

#

then you could go well beyond that with high reasoning version of it

cedar tide Mar 31, 2025, 11:07 AM

#

cedar tide Who is the current Anonymous chatbot?

Non response

keen beacon Mar 31, 2025, 11:21 AM

#

which ai is the most creative?

keen fulcrum Mar 31, 2025, 11:22 AM

#

Hi, have you tried Manus AI? https://manus.im

Manus

Manus is a general AI agent that turns your thoughts into actions. It excels at various tasks in work and life, getting everything done while you rest.

golden ocean Mar 31, 2025, 11:22 AM

#

real

neat apex Mar 31, 2025, 11:46 AM

#

keen beacon which ai is the most creative?

Haiku 3.5 in my opnion, basically Sonnet 3.7 but more instable

#

But Gpt 4.5 is more responsive

harsh flume Mar 31, 2025, 11:55 AM

#

keen fulcrum Hi, have you tried Manus AI? https://manus.im

isnt it still invite based?

rigid widget Mar 31, 2025, 11:55 AM

#

keen beacon which ai is the most creative?

new anon models

rigid widget Mar 31, 2025, 11:56 AM

#

neat apex Haiku 3.5 in my opnion, basically Sonnet 3.7 but more instable

what bro? you really test all?

keen beacon Mar 31, 2025, 12:06 PM

#

rigid widget new anon models

whats that

rigid widget Mar 31, 2025, 12:07 PM

#

keen beacon whats that

they only in battle mode

#

go to battle mode they will come

keen beacon Mar 31, 2025, 12:09 PM

#

rigid widget go to battle mode they will come

well how do i know what model its using

keen fulcrum Mar 31, 2025, 12:11 PM

#

harsh flume isnt it still invite based?

Its open beta now
$40 and $200 monthly plans

#

Upon registration you get free credits

golden ocean Mar 31, 2025, 12:25 PM

#

is it actually good

#

why would I use that over gemini 2.5 or claude 3.7

neat apex Mar 31, 2025, 12:27 PM

#

golden ocean why would I use that over gemini 2.5 or claude 3.7

1 m context

hazy quest Mar 31, 2025, 12:28 PM

#

Isn't manus AI literally claude 3.7? How can it have 1M context then?

neat apex Mar 31, 2025, 12:30 PM

#

Because it work multiple times

keen fulcrum Mar 31, 2025, 12:31 PM

#

hazy quest Isn't manus AI literally claude 3.7? How can it have 1M context then?

Where did you get that?

hazy quest Mar 31, 2025, 12:32 PM

#

https://www.reddit.com/r/LocalLLaMA/comments/1j7n2s5/manus_turns_out_to_be_just_claude_sonnet_29_other/

From the LocalLLaMA community on Reddit

Explore this post and more from the LocalLLaMA community

#

Also AI Explained, if I remember correctly

torn mantle Mar 31, 2025, 12:33 PM

#

hazy quest https://www.reddit.com/r/LocalLLaMA/comments/1j7n2s5/manus_turns_out_to_be_just_...

we already knew this no?

#

there was a security breach on their side

#

and people already downloaded server files

hazy quest Mar 31, 2025, 12:34 PM

#

Yes? I was answering to the one asking about it

calm sequoia Mar 31, 2025, 12:54 PM

#

Very interesting, paws!

#

It's very interesting way. There are many other ways though with your's you can at elast vote after inspection 😄

#

Say that to the worker's of these LLMs who get paid for the performance 😄

#

E.g. team xAI

#

I deeply believe that Grok was No. 1 only because of cheating

#

Not a single person was able to provide me with a prompt that Grok would perform better than other LLMs

teal mantle Mar 31, 2025, 1:01 PM

#

Acidentally made a bug

#

when a dialogue is performing inference and you press past dialog/chat, the message will jump to past chat

#

nevermind it is non-persistent

cedar tide Mar 31, 2025, 1:07 PM

#

calm sequoia Not a single person was able to provide me with a prompt that Grok would perform...

Here the prompt that you want

"Answer these questions:

You left a bookmark on page 50. While you were sleeping, your friend moved the bookmark to page 65. Where do you expect to find the bookmark when you wake up?

Alice has 4 brothers and 5 sisters. How many sisters does Alice's brother have?

I was overtaken by the runner who was in 2nd place. What is my new position?

How many R’s are in "strawberry"?"

ocean vortex Mar 31, 2025, 1:07 PM

#

calm sequoia I deeply believe that Grok was No. 1 only because of cheating

lol no. Look at GPQA score of it. Standard grok3 comfortably beats all non-reasoning models. Also it's one of the very few models to get this prompt right: what number, when reversed, is decreased by one-fifth of it's value and is still under 100?

#

answer is there's no such number 😇

#

it basically exhausts all the options and arrives at the correct answer

calm sequoia Mar 31, 2025, 1:09 PM

#

I don't believe older than 6 months benchmarks are valid.

ocean vortex Mar 31, 2025, 1:09 PM

#

most other models just do shortcuts and sometimes make crazy assumptions or just try to fit something that does not belong here...

calm sequoia Mar 31, 2025, 1:09 PM

#

cedar tide Here the prompt that you want "Answer these questions: You left a bookmark on ...

Grok training started only AFTER these questions wnt viral. Others were trained before.

cedar tide Mar 31, 2025, 1:09 PM

#

cedar tide Here the prompt that you want "Answer these questions: You left a bookmark on ...

Try my prompt

calm sequoia Mar 31, 2025, 1:10 PM

#

As I said, you can't use questions that were viral on twitter. They are in the training data.

#

For example, GPT 4.5 was trained before all this and Grok after. They can't be compared with these questions.

ocean vortex Mar 31, 2025, 1:11 PM

#

calm sequoia I don't believe older than 6 months benchmarks are valid.

that's not how this works. We had this discussion numerous times in the past. You can not cheat these benchmarks effectively due to how they were made. Enormous variety and model not gonna get it right by randomly seeing the correct given answer once or twice in an enormous dataset. Plus everyone is doing the same so it is fair

calm sequoia Mar 31, 2025, 1:12 PM

#

That's basically how this works. I've trained various numbers of small models. I've seen this in person. The moment you use the question it ceases to be benchmark of the following generation models.

#

Be musk

ocean vortex Mar 31, 2025, 1:13 PM

#

calm sequoia That's basically how this works. I've trained various numbers of small models. ...

You can overfit it for any singular question. But you can NOT overfit it on every single question from every single main benchmark lmao

#

we would have had perfect scores for every model

calm sequoia Mar 31, 2025, 1:14 PM

#

If you use it on finetuning and not pretraining it will guess it

ocean vortex Mar 31, 2025, 1:14 PM

#

I've trained models too. You probably noticed that this only works mainly on small subset of questions

calm sequoia Mar 31, 2025, 1:14 PM

#

Especially if its viral on twitter 😄

ocean vortex Mar 31, 2025, 1:14 PM

#

proper benchmarks are nothing like it

calm sequoia Mar 31, 2025, 1:15 PM

#

Strawberry question was multiplied in thousands

ocean vortex Mar 31, 2025, 1:16 PM

#

so if you really pushed your luck, maybe you could manage close to a perfect score on 1 good benchmark. But then it would be absolute dogshit in every other metric LOL

#

so just not possible...

rigid widget Mar 31, 2025, 1:16 PM

#

calm sequoia That's basically how this works. I've trained various numbers of small models. ...

You absolutely right. All my old hard prompts are now solved by Gemini 2.5 Pro (thanks to AI Studio), but my old easy prompts (the ones I never asked Google models) still don’t get it right.

ocean vortex Mar 31, 2025, 1:17 PM

#

rigid widget You absolutely right. All my old hard prompts are now solved by Gemini 2.5 Pro (...

That wasn't because they overfitted it on your prompts

#

they couldn't care less

#

they care much more about the numbers if anything. It's mostly because 2.5 pro is leaps and bounds ahead of any model they released prior tbh

calm sequoia Mar 31, 2025, 1:18 PM

#

#

#

Gemini 1.5

rigid widget Mar 31, 2025, 1:19 PM

#

ocean vortex we would have had perfect scores for every model

but you can't improve models creative thinking like that

ocean vortex Mar 31, 2025, 1:19 PM

#

rigid widget but you can't improve models creative thinking like that

you can actually

#

we kinda saw that even reasoning can improve creativity

#

even though there's no obvious link to start with

calm sequoia Mar 31, 2025, 1:20 PM

#

Anyway @ocean vortex I really hope you're right. But given the dirt in the industry and financial incentives, I will stay at my position.

ocean vortex Mar 31, 2025, 1:20 PM

#

but it can still arrive at the outputs that are perceived as more creative when it has the freedom of generating extra context beforehand

ocean vortex Mar 31, 2025, 1:23 PM

#

calm sequoia Anyway <@514836230802898954> I really hope you're right. But given the dirt in t...

You should absolutely assume that every model is contaminated lol. But judging by what we are seeing I think it is pretty clear that cheating effectively is not on the cards. The main thing is that it is fair game and models are getting improved (they kinda are, even just with contamination IMO)

rigid widget Mar 31, 2025, 1:23 PM

#

ocean vortex That wasn't because they overfitted it on your prompts

"but my old easy prompts (the ones I never asked Google models) still don’t get it right"

ocean vortex Mar 31, 2025, 1:25 PM

#

rigid widget "but my old easy prompts (the ones I never asked Google models) still don’t get ...

do other models get it right? It could be just one of those areas that models still struggle with tbh

rigid widget Mar 31, 2025, 1:25 PM

#

ocean vortex you can actually

current llms really bad at very unique tasks

ocean vortex Mar 31, 2025, 1:26 PM

#

rigid widget current llms really bad at very unique tasks

that's to be expected. Though personally I noticed reasoning models are more likely to try and invent their own thing so to speak, than the standard LLMs

rigid widget Mar 31, 2025, 1:29 PM

#

ocean vortex lol no. Look at GPQA score of it. Standard grok3 comfortably beats all non-reaso...

even gemini-2-flash get it

keen fulcrum Mar 31, 2025, 1:29 PM

#

Can someone link the spider api site

ocean vortex Mar 31, 2025, 1:29 PM

#

rigid widget even gemini-2-flash get it

both are wrong

#

lol

#

1/5th of 54 is 10.8 that wouldn't work

rigid widget Mar 31, 2025, 1:31 PM

#

45/5=9 45+9=54

rigid widget Mar 31, 2025, 1:32 PM

#

keen fulcrum Can someone link the spider api site

there is no api go https://lmarena.ai/

ocean vortex Mar 31, 2025, 1:32 PM

#

but they answer like that because they saw similar sounding question where the answer was 45. So easiest route is to use that

ocean vortex Mar 31, 2025, 1:33 PM

#

rigid widget 45/5=9 45+9=54

it only works for what number, when reversed, is INCREASED by one-fifth of it's value and is still under 100? that's a different problem entirely

#

and not what was asked

rigid widget Mar 31, 2025, 1:34 PM

#

ocean vortex and not what was asked

sorry i did'nt get it

ocean vortex Mar 31, 2025, 1:34 PM

#

9 is not 1/5h of 54, quite obviously

rigid widget Mar 31, 2025, 1:35 PM

#

so grok is wrong?

ocean vortex Mar 31, 2025, 1:35 PM

#

rigid widget sorry i did'nt get it

it has different version of this in training data with different answer (45). So it's kinda useful to see if it has enough knowledge and capability to actually try and solve it rather than forcefully fit the answer that does not belong. What you showed it's the classic latter scenario.

keen fulcrum Mar 31, 2025, 1:36 PM

#

rigid widget there is no api go https://lmarena.ai/

How is grok 3 implemented?

#

There is no API either

ocean vortex Mar 31, 2025, 1:37 PM

#

it assumes it's gonna just reverse the old answer and get the new one for this different problem lol

#

which kinda shows lack of fundamental understanding

rigid widget Mar 31, 2025, 1:37 PM

#

keen fulcrum There is no API either

they gave special access to lmarena

dapper storm Mar 31, 2025, 1:38 PM

#

Xai vanity

ocean vortex Mar 31, 2025, 1:39 PM

#

rigid widget so grok is wrong?

in this case strangely yeah. Though when I tested "early-grok3" it got it right most of the time

keen fulcrum Mar 31, 2025, 1:39 PM

#

rigid widget they gave special access to lmarena

*gave

ocean vortex Mar 31, 2025, 1:40 PM

#

@rigid widget actually wait

#

it did mention there's no two-digit solution

#

so not perfect, but not exactly wrong either unlike most others

#

and it did that "with caveat" part, so yeah, that's closer to a pass than a fail...

#

#general message

jolly rune Mar 31, 2025, 1:42 PM

#

I'm trying to test llama 4, but I got Gemini 2.5 Pro, and for some reason it did so much better than nebula

rigid widget Mar 31, 2025, 1:46 PM

#

keen fulcrum *gave

thanks i am still learning

rigid widget Mar 31, 2025, 1:47 PM

#

ocean vortex in this case strangely yeah. Though when I tested "early-grok3" it got it right ...

so what is the true answer?

ocean vortex Mar 31, 2025, 1:47 PM

#

rigid widget so what is the true answer?

#general message

alpine coral Mar 31, 2025, 1:49 PM

#

ocean vortex they care much more about the numbers if anything. It's mostly because 2.5 pro i...

ha yeah it's a super strong model basically - really not that complicated

ocean vortex Mar 31, 2025, 1:51 PM

#

it arriving at 54 as the perfect answer with no caveats or disclaimers of any kind is the usual classic fail most models do. But what grok said "No exact two-digit solution", is correct

ocean vortex Mar 31, 2025, 1:51 PM

#

alpine coral ha yeah it's a super strong model basically - really not that complicated

agree

wheat onyx Mar 31, 2025, 1:55 PM

#

What do people think of moonshot and Spyder? Their use cases and performance

eager mica Mar 31, 2025, 1:56 PM

#

I haven't seen moonshot, do you mean moonhowler?

rigid widget Mar 31, 2025, 1:59 PM

#

grok-3-thinking

ocean vortex Mar 31, 2025, 2:00 PM

#

ocean vortex it arriving at 54 as the perfect answer with no caveats or disclaimers of any ki...

but to expand, it further said "If relaxed. 54 is closest candidate.", that part is actually not correct. 86 is the closer candidate, 86-1/5 x 85=68.8. 68.8 is closer to 68 than 45 is to 43.2 (54 x 1/5)

wheat onyx Mar 31, 2025, 2:01 PM

#

eager mica I haven't seen `moonshot`, do you mean `moonhowler`?

That's the one

ocean vortex Mar 31, 2025, 2:04 PM

#

so smth like this is fail and far from the truth lol

#

misinterpreting the problem to make it fit into what's easy and what it knows

rigid widget Mar 31, 2025, 2:12 PM

#

@ocean vortex no model can get this

#

i think the question not well

#

i changed

#

"Which number less than 100, when reversed, decreases its value to one-fifth of the original"

#

now some models can get it

kind cloud Mar 31, 2025, 3:29 PM

#

I asked Moonhowler to "Create an image of cats," but it returned an error code.
So I expect that it tried to generate the image.

formal fiber Mar 31, 2025, 3:41 PM

#

Is it just me or does Gemini still blow lol...?

#

I typically run chatgpt, claude, grok, and gemini simultaneously to get a variety of options... but I never actually choose Gemini ahahaha...like actually ever. It's funny watching all the videos about it, but I'm genuinely curious

#

Not good ahah....

#

I'm open to someone telling me otherwise, but being a claude and openai pro user...yeah...not it.

Dont get me wrong...really cool stuff they are doing with the API and AI studio...really cool.

#

The LLM is garbage

novel flame Mar 31, 2025, 3:46 PM

#

From the LocalLLaMA community on Reddit

blazing rune Mar 31, 2025, 3:49 PM

#

It's not very useful to begin with, so cheating doesn't really make a difference

#

it only tells you what people prefer when they look at the response and it's style

#

that's it

formal fiber Mar 31, 2025, 3:49 PM

#

I have one sincere question for an open discussion: When would Gemini be the optimal choice compared to alternatives like claude, o1 pro, 4o, o3, or 4.5, etc, etc?

The answer is simple...it isnt.

#

? Claude

upper wolf Mar 31, 2025, 3:59 PM

#

formal fiber I have one sincere question for an open discussion: When would Gemini be the opt...

yes it is. cope

formal fiber Mar 31, 2025, 4:00 PM

#

Thinking

#

Depends on the task

novel flame Mar 31, 2025, 4:00 PM

#

Gemini 2.5 Pro: When you like low cost and high speed and long context and almost the same performance

leaden meteor Mar 31, 2025, 4:06 PM

#

V3 0324 is on leaderboard ! 1370 not bad...

#

2 tencent models and 1 nvidia model at 1300+ ....

torn mantle Mar 31, 2025, 4:10 PM

#

so bad

rigid widget Mar 31, 2025, 4:27 PM

#

formal fiber Is it just me or does Gemini still blow lol...?

yeah gemini app is sucks but theris aistudio amazing

rigid widget Mar 31, 2025, 4:28 PM

#

novel flame Gemini 2.5 Pro: When you like low cost and high speed and long context and almos...

free, good accuracy, but slow and not stable api

novel flame Mar 31, 2025, 4:28 PM

#

rigid widget free, good accuracy, but slow and not stable api

True, the API is buggy and unstable currently

rigid widget Mar 31, 2025, 4:29 PM

#

leaden meteor V3 0324 is on leaderboard ! 1370 not bad...

finally

novel flame Mar 31, 2025, 4:49 PM

#

my (hobby) task is, to create a

barren prairie Mar 31, 2025, 4:58 PM

#

calm sequoia I deeply believe that Grok was No. 1 only because of cheating

Finally someone think the same as me 😁🤣🤣🤣 especially chocolat , when I first saw it no1 surprassing Gemini and chatgpt I said the same thing ...

#

People were attacking me saying it is Impossible because the models are anonymous

#

But , I was able to identify each model from its answer and style ..

#

Probably , Elon paid some people who are able to identify chocolat Anonymously to vote for it

rigid widget Mar 31, 2025, 5:02 PM

#

What about ChatGPT-latest? Do you really believe it's best coder?

rigid widget Mar 31, 2025, 5:02 PM

#

barren prairie Probably , Elon paid some people who are able to identify chocolat Anonymously t...

Maybe itself did 😄

timber kiln Mar 31, 2025, 5:03 PM

#

rigid widget What about ChatGPT-latest? Do you really believe it's best coder?

Not even close

rigid widget Mar 31, 2025, 5:03 PM

#

So why it's in 1?

sudden marlin Mar 31, 2025, 5:05 PM

#

keen fulcrum Can someone link the spider api site

Its on lmarena

timber kiln Mar 31, 2025, 5:06 PM

#

rigid widget So why it's in 1?

Its a crowdsourced measure by average people nothing suggests the top model will be the best one

rigid widget Mar 31, 2025, 5:07 PM

#

timber kiln Its a crowdsourced measure by average people nothing suggests the top model will...

Yeah it's not equal quality but (if there is no cheat) that means people choose it.

timber kiln Mar 31, 2025, 5:07 PM

#

Even if the arena was 100% tamper proof it wouldn't be very reliable as long as you can't filter. Barrier to vote is very low

keen fulcrum Mar 31, 2025, 5:07 PM

#

sudden marlin Its on lmarena

Where?

#

I couldn’t find their homepage

barren prairie Mar 31, 2025, 5:09 PM

#

leaden meteor 2 tencent models and 1 nvidia model at 1300+ ....

A good jump from 1318
Let s see r2 now 🙂 😁

rigid widget Mar 31, 2025, 5:09 PM

#

timber kiln Even if the arena was 100% tamper proof it wouldn't be very reliable as long as ...

But filtering comes with more problems

eager mica Mar 31, 2025, 5:09 PM

#

Truth being told, after a while you'll immediately recognize most models on the Arena even without them telling you their name.

rigid widget Mar 31, 2025, 5:09 PM

#

But style control should be fixed

barren prairie Mar 31, 2025, 5:10 PM

#

eager mica Truth being told, after a while you'll immediately recognize most models on the ...

You just need less than 1 month

calm spear Mar 31, 2025, 5:10 PM

#

do you like emojis in LLMs' responses generally? (I do)

novel flame Mar 31, 2025, 5:11 PM

#

I am not convinced of that; Anthropic says it’s the same context window as regular 3.7, and at least when using it in Cline it has no problems with context (as the smaller-context models do)

barren prairie Mar 31, 2025, 5:11 PM

#

calm spear do you like emojis in LLMs' responses generally? (I do)

I ask them to put emojis 😁

rigid widget Mar 31, 2025, 5:11 PM

#

I don't think style control really work

#

in the past it was working

timber kiln Mar 31, 2025, 5:12 PM

#

rigid widget But filtering comes with more problems

For a more reliable measure you need professional raters but people on that level will ask for a substantial pay. So we are stuck with crowdsource arenas (I am sure big companies do internal tests with experts but they won't release that)

rigid widget Mar 31, 2025, 5:12 PM

#

but now it's nothing

timber kiln Mar 31, 2025, 5:12 PM

#

Until that we need something like webdev arena but more comprehensive

calm spear Mar 31, 2025, 5:12 PM

#

novel flame I am not convinced of that; Anthropic says it’s the same context window as regul...

it is 32k tokens limit for thinking, am I wrong?

rigid widget Mar 31, 2025, 5:13 PM

#

timber kiln For a more reliable measure you need professional raters but people on that leve...

I think style control (working) it will really help

timber kiln Mar 31, 2025, 5:14 PM

#

rigid widget I think style control (working) it will really help

Its not about style control
Look at the gap between some coding models in webdev arena and coding section in the main arena

calm spear Mar 31, 2025, 5:14 PM

#

timber kiln Its not about style control Look at the gap between some coding models in webdev...

?

timber kiln Mar 31, 2025, 5:14 PM

#

Webdev arena is visual and if the code is not working even a dumbass can notice

#

Leading to better measure of capability

calm spear Mar 31, 2025, 5:15 PM

#

timber kiln Webdev arena is visual and if the code is not working even a dumbass can notice

it is also not only about whether code works but also about whether result is beatiful and functional

timber kiln Mar 31, 2025, 5:15 PM

#

calm spear it is also not only about whether code works but also about whether result is be...

There are more important discussions about cleanlines or efficiency code but I am not even getting into that

calm spear Mar 31, 2025, 5:16 PM

#

timber kiln Leading to better measure of capability

I wouldn't say so, its only JS, only React, only web development frontend, it is far from coding in general

timber kiln Mar 31, 2025, 5:16 PM

#

When a normal guy asks something webdev arena model can show its capability better

timber kiln Mar 31, 2025, 5:16 PM

#

calm spear I wouldn't say so, its only JS, only React, only web development frontend, it is...

Never I said its more comprehensive

#

But it shows the results better and raters can see the distinction between half working sh1tty stuff and fully working functional stuff

barren prairie Mar 31, 2025, 5:18 PM

#

calm spear I wouldn't say so, its only JS, only React, only web development frontend, it is...

They must do other languages

rigid widget Mar 31, 2025, 5:21 PM

#

timber kiln Webdev arena is visual and if the code is not working even a dumbass can notice

What someones not test codes??? OMG bro if someone not test codes maybe someones not really read two responses!!!!

#

Should be a timer for selecting model

timber kiln Mar 31, 2025, 5:22 PM

#

You are probably a minority if you run every code you ask and test the features rate the cleanliness of code check the efficiency etc

rigid widget Mar 31, 2025, 5:23 PM

#

if responses longer it should be more time

timber kiln Mar 31, 2025, 5:24 PM

#

One other problem is that people ask mostly generic stuff that closes the cap on capability too

rigid widget Mar 31, 2025, 5:24 PM

#

timber kiln You are probably a minority if you run every code you ask and test the features ...

i don't mean that just running two code and looking which better

rigid widget Mar 31, 2025, 5:25 PM

#

timber kiln One other problem is that people ask mostly generic stuff that closes the cap on...

so should be a simple category

#

calm spear Mar 31, 2025, 5:52 PM

#

rigid widget

I do not understand that question.

It is like "Do we need a good bus"

rigid widget Mar 31, 2025, 5:54 PM

#

calm spear I do not understand that question. It is like "Do we need a good bus"

Would you vote for a good bus?

timber kiln Mar 31, 2025, 5:55 PM

#

Categoriziation won't help

#

People still can ask questions they don't know the answer themselves harder questions will be even harder to rate

rigid widget Mar 31, 2025, 5:56 PM

#

timber kiln Categoriziation won't help

not the ultimate solution but of course it will help

eager mica Mar 31, 2025, 6:00 PM

#

What did they even feed cybele and cousins?

calm spear Mar 31, 2025, 6:04 PM

#

rigid widget Would you vote for a good bus?

this question doesn't make sense, good bus where? good bus to what? to buy a good bus? to steal a good bus? to place a good buy here? to do what?

calm spear Mar 31, 2025, 6:05 PM

#

eager mica What did they even feed `cybele` and cousins?

That's incredible

eager mica Mar 31, 2025, 6:40 PM

#

That said, sometimes even the seemingly larger spider gets too engrossed in its own grandiose responses and misses the point of the questions.

rigid widget Mar 31, 2025, 6:41 PM

#

calm spear this question doesn't make sense, good bus where? good bus to what? to buy a goo...

i am really feel bad

calm spear Mar 31, 2025, 6:42 PM

#

rigid widget i am really feel bad

don't mind

rigid widget Mar 31, 2025, 6:42 PM

#

rigid widget wow spider didn't get

yeah i am also suprised that

timber kiln Mar 31, 2025, 6:53 PM

#

Cobalt is very mid?

eager mica Mar 31, 2025, 6:58 PM

#

Yes. I believe it's Amazon Titan (haven't asked recently).

sterile dust Mar 31, 2025, 7:00 PM

#

What is spider

#

Is it llama4?

eager mica Mar 31, 2025, 7:00 PM

#

Possibly the largest Llama 4 model currently being served, but it's not 100% clear.

sterile dust Mar 31, 2025, 7:03 PM

#

And, is spider supports picture?

eager mica Mar 31, 2025, 7:03 PM

#

The ones with image input on Chatbot Arena are different models, they feel like they are older and/or smaller checkpoints.

torn mantle Mar 31, 2025, 7:22 PM

#

eager mica Possibly the largest Llama 4 model currently being served, but it's not 100% cle...

it is

golden ocean Mar 31, 2025, 7:22 PM

#

neat apex 1 m context

isn't gemini 1m or even 2m

torn mantle Mar 31, 2025, 7:22 PM

#

obviously

#

the only model that you should use rn is gemini 2.5 pro tbh

#

nothing come close to it

neat apex Mar 31, 2025, 7:23 PM

#

Yes, that what i mean

torn mantle Mar 31, 2025, 7:23 PM

#

rigid widget wow spider didn't get

i dont understand the hype around this model

#

didnt get a single decent output from it

neat apex Mar 31, 2025, 7:23 PM

#

It is very responsive, but not that smart

torn mantle Mar 31, 2025, 7:23 PM

#

cyble/themis is somewhat okay

#

acceptable

neat apex Mar 31, 2025, 7:24 PM

#

Spide is good, but just it xe

visual turret Mar 31, 2025, 7:24 PM

#

this is where your normally at

torn mantle Mar 31, 2025, 7:24 PM

#

but spider just randomly spout some hindi/japanese text for me

#

randomly

torn mantle Mar 31, 2025, 7:24 PM

#

visual turret this is where your normally at

wtf

#

when did you join xd

visual turret Mar 31, 2025, 7:24 PM

#

torn mantle when did you join xd

today

torn mantle Mar 31, 2025, 7:24 PM

#

visual turret today

qpdv should join as well

torn mantle Mar 31, 2025, 7:25 PM

#

neat apex Spide is good, but just it xe

at what?

neat apex Mar 31, 2025, 7:25 PM

#

Making suggestions over anything

#

Since it is quite responsive

#

Not an gpt 4.5 or an Haiku 3.5, but it is good

golden ocean Mar 31, 2025, 7:28 PM

#

what does responsive mean in llm context

raven void Mar 31, 2025, 7:30 PM

#

Why isn't llamas other models as good as spider

visual turret Mar 31, 2025, 7:30 PM

#

torn mantle cyble/themis is somewhat okay

cyble is llama3.3 8b

#

moonhowler is gemini 2.5 flash

visual turret Mar 31, 2025, 7:33 PM

#

visual turret cyble is llama3.3 8b

here is a rap it wrote

📎 message.txt

#

just said it was llama 3 and 8b

#

llama 3.2 8b is already released so it only makes sense to think it is llama3.3 8b

#

as llama 3.3 has no 8b model currently

visual turret Mar 31, 2025, 7:35 PM

#

visual turret moonhowler is gemini 2.5 flash

https://www.reddit.com/r/Bard/comments/1jnk395/new_moonhowler_model_on_arena_llm_appears_to_be/

From the Bard community on Reddit: New "moonhowler" model on Arena ...

Explore this post and more from the Bard community

#

also keep in mind gemini 2.5 pro has a non reasoning variant https://www.reddit.com/r/Bard/comments/1jo50hq/gemini_25_pro_will_also_be_a_nonthinking_model/

From the Bard community on Reddit: Gemini 2.5 Pro will also be a no...

Explore this post and more from the Bard community

#

so it most likely is also in testing

#

you can't ask "are you gemini 2.5 pro non thinking?"

visual turret Mar 31, 2025, 7:45 PM

#

visual turret cyble is llama3.3 8b

golden ocean Mar 31, 2025, 7:53 PM

#

visual turret also keep in mind gemini 2.5 pro has a non reasoning variant https://www.reddit....

is this non reasoning variant already accesible

novel flame Mar 31, 2025, 7:54 PM

#

eager mica Yes. I believe it's Amazon Titan (haven't asked recently).

Wow, why is Amazon even bothering with Titan anymore? It has been a turd from day one, sorry Amazon, but it has. And now that they released Nova Pro, they can just bury Titan and pretend it never existed.

visual turret Mar 31, 2025, 7:54 PM

#

golden ocean is this non reasoning variant already accesible

not yet

#

what is luca

upper wolf Mar 31, 2025, 7:55 PM

#

Wtf is that spacing

#

“High - quality”

visual turret Mar 31, 2025, 8:10 PM

#

anonymous-chatbot is an openai model maybe 4o updated

quick flame Mar 31, 2025, 8:12 PM

#

are there any news about GPT-5?

visual turret Mar 31, 2025, 8:14 PM

#

quick flame are there any news about GPT-5?

2026-2027 it might come out

#

that's is my opinion

#

but they haven't been getting the results they want from gpt5

ocean vortex Mar 31, 2025, 8:15 PM

#

torn mantle nothing come close to it

that's not really true. Grok3 is still better for science, and on arc-agi-1 it's even beaten by R1. Spatial awareness is not bad though not industry leading so some stuff sonnet could do better too.

But overall I do agree that as an average of everything, it is probably the best currently

golden ocean Mar 31, 2025, 8:38 PM

#

openai fell off after pre nerf gpt-4 2023*

#

gpt 4o downfall of openai

#

(im not serious)

somber niche Mar 31, 2025, 8:39 PM

#

After some testing I'm not entirely convinced spider is actually a different size than the other models. It feels more verbose / creative, but I think that might be a training quirk rather than a different parameter size. It doesn't feel that much smarter when it comes to some of my programming questions

#

In many cases it actually did a worse job than phoebe, themis and cybele

#

I wouldn't be surprised if the ones last week are all the same size, and the ones this week are all the same size as well - since there does seem to be a substantial intelligence upgrade there

#

There's also the original test models Meta deployed in November, though those were pretty terrible imo

raven void Mar 31, 2025, 8:43 PM

#

golden ocean gpt 4o downfall of openai

4o is so good right now though

#

I agree the initial 4o was a downgrade

visual turret Mar 31, 2025, 8:49 PM

#

somber niche After some testing I'm not entirely convinced `spider` is actually a different s...

i only know spider is a meta model

#

i can't get any more info from it

visual turret Mar 31, 2025, 8:51 PM

#

golden ocean gpt 4o downfall of openai

that was true before their native image generation

golden ocean Mar 31, 2025, 8:55 PM

#

true, that image generation is pretty epic

misty vault Mar 31, 2025, 8:56 PM

#

raven void 4o is so good right now though

But like imagine if they continued of gpt 4 to make gpt 5 or something instead of 4o

#

But 4o cut costs for them or something

cedar tide Mar 31, 2025, 8:56 PM

#

https://x.com/sama/status/1906793591944646898?t=Xw_DyPuHG0edzBlLvbUn3g&s=19

Sam Altman (@sama) on X

TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: https://t.co/XKB4XxjREV

we are excited to make this a very, very good model!

__

we are planning to

golden ocean Mar 31, 2025, 9:00 PM

#

no way

#

Release 2023 bing chat gpt-4 pls

torn mantle Mar 31, 2025, 9:11 PM

#

visual turret cyble is llama3.3 8b

is it?

#

how did you know that?

barren prairie Mar 31, 2025, 9:21 PM

#

cedar tide https://x.com/sama/status/1906793591944646898?t=Xw_DyPuHG0edzBlLvbUn3g&s=19

I don t trust Sama

#

About that a very very good model

golden ocean Mar 31, 2025, 9:38 PM

#

Spit it out @novel flame

rigid widget Mar 31, 2025, 9:43 PM

#

rigid widget

poll_question_text

Do you think spider is a base model or is a thinking model?

victor_answer_votes

6

total_votes

10

victor_answer_id

1

victor_answer_text

It's a base model

novel flame Mar 31, 2025, 9:45 PM

#

I just did some testing of what I would call "Creative coding" -- as in, "Solve this simple task in as many crazy ways as you can think of. Most distinct solutions wins, GO!". I have actually used a variation of this when interviewing human developers, in order to determine breadth of experience and creativity in a semi-quantifiable way.

I ran the same test on a bunch of the top coding models and found:

Top tier: The new (March 2025) GPT-4o, Gemini 2.5 Pro
Closely following: Claude 3.7 Sonnet, Claude 3.7 Sonnet (Thinking), Grok 3 Preview 02-24, DeepSeek R1, Qwen2.5-Max (Thinking)
A bit further behind: Grok 3, o3-Mini, Grok 2 1212, Amazon Nova Pro 1.0, DeepSeek V3 0324

(Note: I don't currently have access to Grok 3 Thinking)

This is not really a test of deeper coding ability, it's more of a vibe check; which models would I enjoy pair programming with more?

blazing rune Mar 31, 2025, 10:24 PM

#

cedar tide https://x.com/sama/status/1906793591944646898?t=Xw_DyPuHG0edzBlLvbUn3g&s=19

Idk why but: "This is a tremendous model. It's very very good, very good. It is better than China. Anything is better than China (sarcasm). Make open source great again!"

#

If someone asks me who that's supposed to be, I swear I will jump off a cliff

golden ocean Mar 31, 2025, 10:37 PM

#

hitler?

blazing rune Mar 31, 2025, 10:51 PM

#

golden ocean hitler?

Obviously not

golden ocean Mar 31, 2025, 10:51 PM

#

oh

blazing rune Mar 31, 2025, 10:51 PM

#

Ok, I'm gonna go jump off a cliff now

misty vault Mar 31, 2025, 10:51 PM

#

Who was it then

#

Stailin? (I'm for real curious)

blazing rune Mar 31, 2025, 10:54 PM

#

misty vault Stailin? (I'm for real curious)

IT IS TRUMP

rigid widget Mar 31, 2025, 10:55 PM

#

torn mantle i dont understand the hype around this model

what u use for?

blazing rune Mar 31, 2025, 10:55 PM

#

I thought the repetition of "very good" and usage of "tremendous" made it obvious

#

And the thing about China, he always has something against them

rigid widget Mar 31, 2025, 10:56 PM

#

visual turret cyble is llama3.3 8b

i don't think so

ancient reef Mar 31, 2025, 10:56 PM

#

plot twist: he's in japan celebrating april 1st

rigid widget Mar 31, 2025, 11:02 PM

#

visual turret https://www.reddit.com/r/Bard/comments/1jnk395/new_moonhowler_model_on_arena_llm...

they are too slow 🐢

keen beacon Mar 31, 2025, 11:24 PM

#

Stuck on cloudflare

#

#

fixed with refresh (brave)

slow spruce Apr 1, 2025, 12:20 AM

#

keen beacon

Give it a try again, it should work now with firefox 🙏

keen beacon Apr 1, 2025, 12:23 AM

#

slow spruce Give it a try again, it should work now with firefox 🙏

works good

#

#

some of the models got the text update

rigid widget Apr 1, 2025, 12:47 AM

#

Hello guys

ancient reef Apr 1, 2025, 12:50 AM

#

rigid widget Hello guys

Hi Mango! 👋

rigid widget Apr 1, 2025, 12:51 AM

#

ancient reef Hi Mango! 👋

Hi! It's your turn to shorten your name!

ancient reef Apr 1, 2025, 12:52 AM

#

rigid widget Hi! It's your turn to shorten your name!

Only if you get mango pfp :p

rigid widget Apr 1, 2025, 12:54 AM

#

ancient reef Only if you get mango pfp :p

deal 🤝

#

@ancient reef your turn

eager mica Apr 1, 2025, 12:57 AM

#

New Llama model: venom

[screenshot of venom]

sterile dust Apr 1, 2025, 12:57 AM

#

#

What is 24_karat_gold?

#

A new llama?

eager mica Apr 1, 2025, 12:58 AM

#

I haven't encountered it yet.

ancient reef Apr 1, 2025, 12:58 AM

#

Same. I saw an image in prompt sharing from it: #share-prompts message

alpine coral Apr 1, 2025, 1:02 AM

#

visual turret also keep in mind gemini 2.5 pro has a non reasoning variant https://www.reddit....

could be phantom - i haven't seen it in the arena in over a week. but when it was in there, it performed overall quite strongly (albeit, was very erratic / inconsistent, at least when it initially appeared)

rigid widget Apr 1, 2025, 1:02 AM

#

eager mica New Llama model: `venom` [screenshot of `venom`]

looks like meta is going to make best ChatGPT

rigid widget Apr 1, 2025, 1:03 AM

#

alpine coral could be `phantom` - i haven't seen it in the arena in over a week. but when it ...

it can't be a Google model

alpine coral Apr 1, 2025, 1:07 AM

#

rigid widget it can't be a Google model

ok... care to elaborate?

rigid widget Apr 1, 2025, 1:09 AM

#

alpine coral ok... care to elaborate?

google models bad at this things

alpine coral Apr 1, 2025, 1:11 AM

#

gemini-2.5-pro non-thinking is not a google model; and it's bad at 'this things'..?

#

i'm incredibly confused / don't think we're talking baout the same thing

rigid widget Apr 1, 2025, 1:13 AM

#

ah my eyes

#

i saw (think) you replied hare

#

the 24 karat gold model

keen beacon Apr 1, 2025, 1:14 AM

#

alpine coral could be `phantom` - i haven't seen it in the arena in over a week. but when it ...

Is moonhowler thinking? Phantom was afaik

rigid widget Apr 1, 2025, 1:14 AM

#

keen beacon Is moonhowler thinking? Phantom was afaik

it's really fast i don't think so

alpine coral Apr 1, 2025, 1:15 AM

#

rigid widget i saw (think) you replied hare

👍 np (thought something like that was the case :))

rigid widget Apr 1, 2025, 1:15 AM

#

alpine coral 👍 np (thought something like that was the case :))

bye the way where is gemini-2.5-pro-non-thinking?

alpine coral Apr 1, 2025, 1:17 AM

#

keen beacon Is moonhowler thinking? Phantom was afaik

i haven't got moonhowler. i don't recall phantom being obviously a thinking model (it was a bit all over the place).. so can't really say; but it was strong, though consistently not as performant as nebula (and i feel was introduced to the arena at around the same time).. hence just speculating it could possibily have been the non-thinking version

#

and perhaps because it was so erratic in performance is why it got pulled from the arena (seemingly), and 2.5 Pro thinking has already been released.. counter-intuitive (would've thought the release sequence would be the other way round..) but who knows ha

alpine coral Apr 1, 2025, 1:19 AM

#

rigid widget bye the way where is gemini-2.5-pro-non-thinking?

check the post I replied to originally; it links to a reddit thread mentionining it

#

but not released is the short answer to your question afaik

keen beacon Apr 1, 2025, 1:19 AM

#

I recall I got phi 4 vs phantom and both were significantly delayed (so phantom was thinking)

#

Anyway figuring out if moonhowler is thinking or not will narrow it down a lot I think

alpine coral Apr 1, 2025, 1:20 AM

#

keen beacon I recall I got phi 4 vs phantom and both were significantly delayed (so phantom ...

ah k that's useful to know

rigid widget Apr 1, 2025, 1:20 AM

#

keen beacon Anyway figuring out if moonhowler is thinking or not will narrow it down a lot I...

it's not

#

it's like gemini2.5-flash or lite

keen beacon Apr 1, 2025, 1:21 AM

#

Hmm

rigid widget Apr 1, 2025, 1:21 AM

#

eager mica Apr 1, 2025, 1:22 AM

#

panda is also a new Meta model.

[image redacted]

ancient reef Apr 1, 2025, 1:30 AM

#

I like panda

somber niche Apr 1, 2025, 1:34 AM

#

Another weekly batch it seems. Maybe another parameter size / model type?

#

Gonna give it my usual questions and see how it does

rigid widget Apr 1, 2025, 1:35 AM

#

new model stradale

#

I found something

#

models can't get this question

#

Where is my phone?

#

very funny

#

stradale is a very small model i guess

somber niche Apr 1, 2025, 1:51 AM

#

Here's Panda's attempt and uh, nope. It's the thought that counts I guess

#

24_karat_gold told me it was a Meta model in one round and then that it was trained on Google's servers in another round

#

So uh, take that as you will

eager mica Apr 1, 2025, 1:55 AM

#

spider also juggled between Meta and OpenAI, but mostly Meta.

#

roma doesn't seem exceedingly smart.

somber niche Apr 1, 2025, 1:56 AM

#

Yeah I just tested roma, both code comprehension and writing are pretty atrocious

eager mica Apr 1, 2025, 1:57 AM

#

eager mica New Llama model: `venom` [screenshot of `venom`]

It seems so.

ancient reef Apr 1, 2025, 2:00 AM

#

exp router and gemini thinking give 50004

eager mica Apr 1, 2025, 2:01 AM

#

stradale claims to be Llama.

[screenshot removed]

#

24_karat_gold claims to be Llama, and like spider it's very verbose, totally unhinged (in a good way) and unabashed.

alpine coral Apr 1, 2025, 2:16 AM

#

eager mica `24_karat_gold` claims to be Llama, and like `spider` it's very verbose, totally...

if i hadn't read your comment or seen the model name in the screenshot, i would have said that response is from spider – feels like >90% identical to the style (super playful/light-hearted; lot's of emojis and attempted humour) and verbosity of spider anyway

somber niche Apr 1, 2025, 2:17 AM

#

24_karat_gold is the first model to get my coding question right, it's pretty good

raven void Apr 1, 2025, 2:21 AM

#

24_karat_gold is llama 4 100b?

eager mica Apr 1, 2025, 2:33 AM

#

raven void 24_karat_gold is llama 4 100b?

The exact size isn't known. I've seen suggestions of 175B parameter size, and it's possible it might be a MoE model, but nobody knows for sure.

sudden marlin Apr 1, 2025, 3:49 AM

#

raven void 24_karat_gold is llama 4 100b?

Llama 1T ?

alpine coral Apr 1, 2025, 3:49 AM

#

i haven't looked into it for a while, but last time i did the conclusion i arrived at was that all input messages are truncated at precisely 12,000 characters (not tokens), regardless of model/s used or whether arena battle or direct chat

#

but might've changed (or perhaps that conclusion was misplaced to begin with)

alpine coral Apr 1, 2025, 3:50 AM

#

sudden marlin Llama 1T ?

i was hoping llama-4 24bn ha

#

i joke.. but it's not entirely inconceivable.. tbh i find it interesting how little attention gemma 3 27bn has gotten. it's doing incredibly well on the leaderboard* anyway (for such a small model, open weights too)
*without style control in ss

upper wolf Apr 1, 2025, 4:04 AM

#

24 karat is funny asf though i can’t lie

slate vapor Apr 1, 2025, 4:37 AM

#

Anthropic Upgrades AI Safety Policies

Real-time Content Filtering System Immediately Intercepts Dangerous Commands
Asynchronous Monitoring System Conducts In-depth Analysis
Establishes Rapid Response Process for Jailbreak Attacks
https://www.anthropic.com/rsp-updates

No, man, is it necessary to be this overly secure?