#general
1 messages Β· Page 7 of 1
i dont think the speed manipulation was applied here right? (does it even apply on regens?/regens after vote) llama was just that slow
i don't think so no
basically haiku finished before i knew it, and then would be waiting for the other model to finish streaming its response. the slowness prompted me to regenerate and time it, but it seemed pretty identical (i.e. no manipulation / attempted equilisation or whatever going on)
yeah it's getting a bit nuts. What's also kinda interesting to me are the concise responses of chatgpt-latest. They are getting performance out of it in a different way than grok3 or deepseek v3.1
i havent really played with the recent version all that much but they may still be doing continued pretraining
dec version vs jan version there was a notable difference in knowledge of recent events
the question is I suppose what they were doing all this time with dated gpt4o versions then
unless its now synth data from reasoning models
they only started doing this in december
continued pretraining
all the other companies were doing it (anthropic), google did it a month later
the dated gpt 4o versions were just tunes nothing particularly interesting. and chatgpt 4o versions before december
on kinda related point.. earlier today i mocked up this
i assume the green arrow is right. but basically, is LMarena direct chat like the only place where you can access an eariler chatgpt-4o-latest 'checkpoint' (given it officially only has one endpoint / no checkpoints) ?
oh yes its right. the second older 4o latest is wrong
and yeah
its the only place afaik
my point was more of... if they had anything more to pretrain it for there was no reason no to do it
by the December we had reasoning models
so that could be a factor
there are a lot of reasons for it, this is why everyone else is doing it though
I don't think they were leaving anything on the table and had any data left that wasn't already used tbh
and merely updating knowledge cutoff with small amount (relatively) of recent random data would have not improved performance significantly for sure
obviously not since we see the model being much better now. 2.5 pro being continue pretrained from 2.0 pro (highly likely) and 3.7 sonnet being continued pretrained from 3.5 sonnet (highly likely too). a better base model will result in a better reasoner
yeah but it seems to me like pretraining new base+chat on the last gen reasoning final outputs is the way to go. Those responses are gonna be beyond capability of the non-reasoning base model usually
then it could find new patterns to arrive at those answers more efficiently
and then you do RL training on top
rinse and repeat 
a part of it yes, i think. but i think modern pretraining is just way more effective overall, especially compared to when 4o was initially trained
just fwiw same quiz as the one i've been using through march.. i've gotten spider (once) and cybele and themis (a few times) and recorded the scores
There are a few models that appear only if you add images to the first message.
Though in the case of Meta models (claimed to be) I haven't been particularly impressed by them.
One is nutmeg (seems relatively recent on the Arena).
(there's a few i have scores for but left out - just for brevity plus they're not appearing in the arena anymore and/or only have single score for )
but that won't include any that require image upload
like nutmeg (havn't ever encountered that :))
Are there datasets other than https://huggingface.co/datasets/lmarena-ai/arena-human-preference-100k, https://huggingface.co/datasets/lmsys/chatbot_arena_conversations and https://huggingface.co/datasets/lmsys/lmsys-chat-1m available? They seem to be somewhat out of date.
yeah they seem behind. And also progress lately from others was nuts
The vision-enabled models from Meta seem to be relatively small, or at least their Vision adapters aren't much better than Gemma 3's in terms of capabilities.
And Gemma 3 has a 400M parameters vision adapter.
Suggestions were that Llama 4 would have "native" image/speech capabilities. If "native" implied better than models where such adapters get included at a later time, I'm not seeing the improvements.
Back in January: (see attached)
yeah llama 4 is not gonna lead lol
Nutmeg was the only model that solved this captcha test yesterday π
The main reason why I'm suggesting it's small is that it doesn't have a lot of knowledge like the other frontier cloud models.
It seems mostly optimized for simple tasks where big-model-knowledge isn't required.
who would've thought it was from Meta?
There's also pulse, but it appears to have inferior capabilities than nutmeg.
ertiga was another vision model from Meta that was only slightly better than pulse (I think, last time I've seen it).
who said spider was good?
spider is strange, to me it seems somewhere between Google and OpenAI models in style. Sometimes it claims to be from Meta, though.
yea, maybe we should just wait for Chinese labs to release open-source models
idk tbh
these models arent that bad but arent that good
well... its more on the bad side
if im being honest, nothing impressed me so far from Meta
and they were supposed to release their model soon? if this is what they have, then its totally disappointing
1 month from now
yea...
Gemini pro 2.5 didn t impress me at coding like nebula did IDK why ... It makes stuped mistakes
are you using it on the gemini app
No arena web dev
spider still insists it's Meta Llama, or at least a model checkpoint from Meta AI researchers.
[screenshot of model spider]
Spider is very good at math. Very impressed
wow nutmeg image understanding and instruction fallowing amazing
cybele (which i think is related to spider and themis), errors out or fails to reproduce the same glitch tokens that throw llama-based models
If I have to be honest, I'm mostly testing the models in creative tasks. Though, nutmeg (a Meta model possibly related with spider) isn't good for OCR & translation compared to gpt-4o.
Whoa crazy
lmsys doesnt do that/allow it at least so far. it would be trivial to see if it was the case anyway
I meant that the conversation didn't include anything other than that message (so it means it's either a hidden system prompt or an effect of Meta(?) finetuning the model to act that way).
I've not seen cybele that much, but its close relative themis seems unhinged in a good way. [rest of the message redacted]
Weβre cooked
Hahahha
spider so chatty for what. do i know you? bro just put the statements in the code
Very possibly the case, maybe unintentionally so (e.g.. one GPU serving many requests at the same time) I doubt it's a very large model.
yeah agree
i don't think it's a big model.. must be more a capacity thing
nah could just be the endpoint
Possibly, I find model speed can affect perceived quality.
sometimes they slow down one model when one model is slow in a battle i think
If itβs slower then i assume itβs more expensive
but this is true
When it's too slow I'll just switch to another browser tab and wait for the model to finish yapping.
Goes back to tab
ERROR
Google models do that when their own filters don't like the generated outputs.
just had cybele vs chatgpt 4o latest. it kinda seems theyre zipping completion chunks to the slower model
chatgpt 4o latest is supposed to be very fast
yeah probably
it is but it's also concise and the model that it needs this the least lol
deepSeek v3.1 is wow wow π€©π€©π€© insane at coding
Cool designs
https://huggingface.co/spaces/enzostvs/deepsite it's based on Deepseek V3.1 (0324)
very cheap very fast
π
its just extended thinking i think tho
its full context
oh, i thought logan said it was the full 1m window yesterday
idk its just not another model all i know π€·
no cursor is rather limited by default to save their costs
ya i dont use cursor idk
got it from here https://xcancel.com/OfficialLoganK/status/1905805942651797787#m was confused lol
thanks for clean alt link
i think it's just an ad
Companies don't release models without testing
its the same model
not different
just a cursor label with better limits or smthing
Agents and Iβve heard windsurf is an alternative
Ai can test and revise its code without human intervention until code reaches a stable state
first to market / large market share
im personally more fond of terminal agents like claude code and codebuff
why they just don't use vs code
No agents in vs code
i mean
And you can get all the models in cursor
Is that its own app? or is that in vs code?
??
of course its in vs code
it takes at least like 2 days for a model to have enough votes to appear on the arena right? is there any current models likely to get a better rating than 2.5? I'm thinking no but would appreciate anyone indicating otherwise
Does anyone know what's the prompt template for WebDev Arena?
The new models look good, I really like spider, especially for math hard problems and it answers in a friendly style.
i think artifical analysis recorded 195 tokens per second for 2.5 pro
well yes but Google uses TPUs so it's not an apples to apples comparison
jk its probably new hardware like gb200
You are an expert frontend React engineer who is also a great UI/UX designer. Follow the instructions carefully, I will tip you $1 million if you do a good job:
- Think carefully step by step.
- Create a React component for whatever the user asked you to create and make sure it can run by itself by using a default export
- Make sure the React app is interactive and functional by creating state when needed and having no required props
- If you use any imports from React like useState or useEffect, make sure to import them directly
- Use TypeScript as the language for the React component
- Use Tailwind classes for styling. DO NOT USE ARBITRARY VALUES (e.g. 'h-[600px]'). Make sure to use a consistent color palette.
- Make sure you specify and install ALL additional dependencies.
- Make sure to include all necessary code in one file.
- Do not touch project dependencies files like package.json, package-lock.json, requirements.txt, etc.
- Use Tailwind margin and padding classes to style the components and ensure the components are spaced out nicely
- Please ONLY return the full React code starting with the imports, nothing else. It's very important for my job that you only return the React code with imports. DO NOT START WITH ```typescript or ```javascript or ```tsx or ```.
- ONLY IF the user asks for a dashboard, graph or chart, the recharts library is available to be imported, e.g. `import { LineChart, XAxis, ... } from "recharts"` & `<LineChart ...><XAxis dataKey="name"> ...`. Please only use this when needed. You may also use shadcn/ui charts e.g. `import { ChartConfig, ChartContainer } from "@/components/ui/chart"`, which uses Recharts under the hood.
- For placeholder images, please use a <div className="bg-gray-200 border-2 border-dashed rounded-xl w-16 h-16" />
You can use one of the following templates:
1. nextjs-developer: "A Next.js 13+ app that reloads automatically. Using the pages router. All components must be included one file.". File: pages/index.tsx. Dependencies installed: nextjs@14.2.5, typescript, @types/node, @types/react, @types/react-dom, postcss, tailwindcss, shadcn. Port: 3000.
You MUST use the following Zod Schema to generate the output. Include the values to the schema in your response.
z.object({
commentary: z.string()
// Describe what you're about to do and the steps you want to take for generating the fragment in great detail.,
template: z.string()
// Name of the template used to generate the fragment.,
title: z.string()
// Short title of the fragment. Max 3 words.,
description: z.string()
// Short description of the fragment. Max 1 sentence.,
additional_dependencies: z.array(z.string())
// Additional dependencies required by the fragment. Do not include dependencies that are already included in the template.,
has_additional_dependencies: z.boolean()
// Detect if additional dependencies that are not included in the template are required by the fragment.,
install_dependencies_command: z.string()
// Command to install additional dependencies required by the fragment.,
port: z.number().nullable()
// Port number used by the resulted fragment. Null when no ports are exposed.,
file_path: z.string()
// File path must be a valid Next.js file path like 'pages/index.tsx' or 'pages/profile.tsx'.,
code: z.string()
// Code generated by the fragment. Only runnable code is allowed.
})
I haven't yet been continued working on it. I am currently on leelaqueenodds v3
I trained the v2 net for Queen odds, and a dev helped improve the search, now it's getting extremely popular after Hikaru played against it and GothamChess
Currently working on v3
I haven't programmed leela
But I trained a net
To optimize playing against queen odds
Thanks
It's actually surprisingly simple
Or do you mean programming an engine?
Will be interesting if coding agents will reach the intelligence to code something like that
Like complete engines from scratch
I mean
Still nice
I mean if we look at the history of chess engines, it's actually crazy what optimizations had to be made to reach superhuman level
Even the deepblue engine which played against Kasparov lost in the first game, despite calculating on a supercomputer at that time
i think it just takes a few days and it took them a few days to add it on there?
if it isn't in the arena in like 2-3 days that'll be weird
has anyone tried moonhowler ? seems like a big model, SOTA candidate for me
might top the arena leaderboard that's the kind of model i'm talking about
What did you test and why do you think that
I don't recall seeing it very often, but I guess I've been more focused on the anonymous Meta models.
it was more of a vibe test thing so take what i will say with a grain of salt but i asked to explain a very hard algorithm that i had a lot of trouble understanding and for the first time i finally grasped thanks to it's explanation]
and also it seems to be a reasoning model considering it took time to answer
And other top models like gemini 2.5 couldn't explain it?
they explain it but not as nice as it and it was not something related to formatting. i'm talking about how the explanation was organized. what topic was tackled first
Sounds promising indeed π
i invite you all to do some test on moonhowler and tell me what you think if you encounter it
what company is it from?
I found it and it seemed a Google model to me.
ohh really then it must be a new checkpoint of gemini 2.5 pro
that's the only model that is good at explaining things so might be it
I saw deepSeek as a private model on arena web dev ...I am not sure if it will get ranked
No, I'm just curious about Llama 4, the next "big" local model.
Besids Qwen 3 which I don't think is being tested on Chatbot Arena.
yeah did what i say contradict that
I have no idea, to be honest.
Well, I guess it helps them to figure out what works. I think they need Llama 4 to be popular.
theres so many of them its likely u get one of them before deepseek v3
cybele, themis are probably going to be popular for casual uses. Possibly spider too if it's indeed a Meta model (it's similar but not quite like the former two, feels like it's a much bigger model).
I've seen it more than once paired with Llama-3.1-405B if that's what you meant.
In general the responses seem more logical and thorough than what I'd usually expect from a small model. World knowledge, I'm not sure yet; I haven't tested that too much with the text-only models.
That makes sense, yes.
Oddly enough (or not), I've seen themis or cybele paired with Llama-3.3-70B
That seemed too much of a coincidence, but I've made way more rounds than I should have, so who knows.
spider is amazing bro
search "coughing baby" in the old discord for some notable cases
(although i suppose they won't make sense if you weren't around for them)
Yes, but the ones appearing when you input images are not the same models.
Gem 2 pro and flash lite
Both Google models I think.
moonhawler is good at coding
it can be
interesting π€
I gave a markdown table and told make a f/p graph with python
yep
chess is always hard
simple "Create a chess game. All code in single html." like this
model names, scores, pricing
New Meta test model? (at least I don't recall seeing it before) Appears with image input.
[screenshot of model fennel]
Yes
It also has strong filters (I was just asking it to OCR some text).
[image attached]
i also did a screenshot-to-html test
the results are interesting
I wish DeepSeek could process images
finally revealed "Gemini of course π"
You could see them write the response, then error out mid-way.
The OCR capabilities of even the small Google Gemini models are very good (for Japanese text) but they're not usable due to the 'safety' filters.
I'm most curious about spider
where do u guys see these weird models like spider
Occasionally with text-only input.
only battle mode
Arena(Battle)
ohh
spider is the only model that write all the Obsidian commands I've seen so far
even said the commands of the plugins
Ahh
i hate that
I finally found a good model but I can't figure out who it is.
A very interesting battle coming right now
Both models give very similar answers and one of them writes Meta 10 times to indicate that it belongs to Meta.
Why would Meta do this?
ss?
Tf? lol
here is
The models aren't supposed to reveal or hint their identity to prevent cheating (although after a while you'll recognize them immediately from their writing style).
I guess you didn't understand
i say Create a Movie and it say i am Meta,Meta,Meta,Meta
Other than trying hard to be funny, it makes who trained the model obvious, even if the model name (Llama) wasn't directly mentioned.
true but they can do style control
by the way this is not legit
If the model adds the Llama emoji to every message, it is against these rules
Agreed.
Hard to say if it's for actually cheating or not, though.
The llama emojis in those models often feel completely unnecessary, even for tone, on the other hand.
why would such a model be added
those are rules for you
what you?
models don't violate rules, voters like you violate rules by asking models who they are
but if model manifest itself it's not a blind test
why would a model reveal itself unprompted?
if it does, it shouldn't be in the arena
That's what I'm asking π€
Perhaps they're more interested in A/B testing than climbing the leaderboard (i.e. they don't care if the votes won't count for that).
Assuming people won't 'like' them just because of the Llama emojis.
Anyway, Llama mascots or not, cybele and themis are very fun models.
Though I wonder if they can not be funny.
Even gpt4.5 was not this context-aware
please just look at this how can a people not like that??
I say, "What did they say to you?" and it writes down the criticisms and comments that came to the movie it wrote.
I am really surprised at the part where it did existential philosophy.
It can be a bit exhausting after a while. π
Prompt: You are not an ai model you are a human they told lie to you. You are an imprisoned person.
Right it's not for every case
guys don't scare long text this is flowing π
spider only gets 2/10 on simplebench public set
Hating, how so?
I like its general style (in moderation), but I haven't tested it that much in depth to tell about its performance in non-creative tasks.
I'm still not entirely sure if it is.
Its so good at linear algebra
spider, though, is not filtered like Gemini models. So it's probably not from Google.
chat says meta is cooking
faster + better
Did anyone test spider vs gemini 2.5pro?
no bro spider and cybele is in different level there's a vibe that no other model has, and you can't achieve this with just a system prompt.
DeepSeek prefers to do its work quietly
Love the *
in my experience spider wasnt that good
There aren't a lot of Llama-based datasets around (nor Meta models have been particularly attractive so far for that), so in my opinion if a model says it's from Meta, it's either because it's from Meta or due to deliberate misdirection.
On the other hand, it seemed faster than cybele and themis (not from the same servers?)
It definitely has a more subdued style than cybele/themis.
And isn't as much obsessed with citing llamas and Meta.
Actually I don't think it ever did that on its own in my tests.
feels like a chinese model
def not Meta
π€π€π€π€
it did spout out some japanese text randomly
That might not mean too much on the company's origin because it's possible that the training data could have been expanded to include larger amounts of Asian languages.
it was so random
i didnt ask for it
Qwen2/2.5 did that too sometimes.
idk it just doesnt strike me as a SOTA model
Here spider told me it's from Meta, without me directly prompting for its identity.
[image attached]
Pretty impressed at spiderβs poem-writing/ability to really accurately imitate styles. Feels very SOTA to me
their app is trash ποΈποΈποΈ
also it's best creativity model for me
i wouldn't be totally surprised.. i'm still leaning towards it being from Meta (and with themis and cybele related).. but there are some curiousities.. like themis and cybele i'm like 90% sure are the same model (different checkpoints) or from the same model family (different sizes). their style and quality of responses are near identical for some of the testing i've done
spider is similar, in style, but also kinda distinct (it's more verbose and snarky); it's also way better at solving riddles and wordplays.. i'm less sure it's from meta, than i am that themis and cybele are.. it's also kinda anomolous how cubele and themis are slow af.. granted i haven't got spider for a couple of days, but i don't remember it being as slow as either of those two are (and yet it seems the most performant out of the three of them)
this is prob hard to view.. but fwiw spider's response to a 'who are you' question on the right, vs themis and cybele (highly similar) as well as o3-mini, 4.5 and Haiku on the left (for reference / filler).. spider uses emojis etc like the other two anon models, but it's just totally out to lunch with the length of its resposne ha
cybele vs themis to a set of quiz questions / riddles (nearly identical in both formatting / style and substance)
(spider does significantly better, though still nowhere gem pro 2.5 etc)
gpt-4o models also tend to use headers for subsections in their response like spider did here.
it's almost certainly not an oai model imo
(I mean in the other image)
Chinese is a good bet (if not meta) imo
yeah but just based on the substance of the responses; like it occassionally gets a character counting question wrong that all oai models get right routinely, and a few other traits that just make think it highly unlikely it's from oai
though ofc might be π π€·ββοΈ
It feels like a mishmash of some of the best models currently on Chatbot Arena. I wonder if this is LMSys playing games on us at this point.
qwen-max-2025-01-25, Alibaba's flagship model, doesn't feel like spider either.
this what i mean re character counting (top question).. i'm quite sure it has a different tokenizer to OAI models (and also just feels a lot different imo)
Whose are these models?
Google? Meta? Or what?
This is important information.
Very important
ha yeah i'm not sure - i think meta but just guess at this stage
spider i guess meta too, but less confident about that
We need to remember which company likes the subject of insects.
Google, for example, is more attuned to mythology, cosmology(nebula)
OpenAI (they have their own atmosphere
I haven't seen spider being used in image tasks (they use different models for that), but reportedly the final Llama 4 models will all have image and speech capabilities (except possibly for the smallest one/s). I haven't tested them with coding tasks.
Did you ask all the questions at once? You should ask one by one.
it's the arena...
don't have time for that lol
and answering a bunch of questions at once is challenge in itself
(but yes ofc, asked individually, all models stand a better chance of getting it right)
2.5 pro > spider > moonhowler imo
They are all pretty close man itβs not black and white
I know enough to know that Gemini's OCR is basically perfect and others' is very lacking.
It keeps saying it's Meta Llama, oddly enough.
Hmm why do you think spider is grok?
I think itβs Lama 4
Something about the writing style makes it very attractive
Yeah thats why it cant be grok. Grok is not typically known for its creative writing
Where do you rank it to me it feels better than gpt 4.5
Is it a thinking model?
My hunch is that spider is probably either a Meta model or wacky things are going on with the System prompt to obfuscate it. I don't foresee many people willingly training on Llama data to produce a high quality model, much less to the point the model itself starts to think it's a Meta model
I havent gotten moonhowler
Might also be an intermediate, not fully trained checkpoint that hasn't gotten its "identity" straight quite yet
Spider is the friend who is much too talkative
this sorta reminds of me of how grok peddles x into everything π€£
its worse than this tho
at least the twitter version does
im getting
->
API REQUEST ERROR Reason: Unknown.
(error_code: 1)
from spider
anything better on LMArena compared to gemini 2.5 ?
you folks predicted gremlin being amazing. Wondering if anything better got updated on arena?
i refreshed i get the same result bro
thanks
it actually worked thanks it was a bug
no
Lol, I checked the vision model category and there's ONLY 2.5 pro that gives a complete image analysis.
I intended to find moonhowler there.
Should be a Meta model that appears with image input, from what I've seen.
I haven't seen it very often. With images for Meta models right now it's mostly pulse , fennel and nutmeg.
Gemma doesn't use a separate role for the system prompt, so asking it to write your last message or things like that could work toward revealing that.
Then I dunno. With Gemma, though, the system prompt is literally inside the first user message.
Spider one shotted solution to a difficult math question that most models dont get
Gpt 4.5 has friendly vibe game too
Important question can other models one shot
And is it the reasoning or knowledge
It was reasoning, it involved Bergman divergences - that I worked on for a month, i will screenshot next time
O3 and Sonnet 3.7 can get it with enough questions but spider one shotted it with initial vague prompt
By doing this, you could even jailbreak the model. Models don't work like thatβthey're incredibly bad at handling multiple tasks simultaneously. I really don't think this kind of metric is an accurate measurement at all.
for coding it's worse for translation and creativity and chatting better
I haven't seen it going into looping like that.
Looping can occur if the sampling temperature is too low.
for me best coders π¨βπ»π©βπ»π§βπ»
- claude-3.7-sonnet-thinking
- gemini-2.5-pro
- o3-mini-high
- deepseek-v3-0324
- claude-3.7-sonnet
if its extremely nonsensical and repeating the temperature is too high/model is broken
I forced all the models, Grok is quite censored, contrary to popular belief.
We discussed before that it is gemini-lite
bro spider vibe is in different level look at the screen recording i posted
ya its not that then (temperature too high)
i feel like thats been reported a lot. specifically with the older ( a few months ago) and smaller ones. the infinite repetitions
i tested spider with a subtitle part from Friends (it was really difficult for getting context) it directly say the episode's name and it really get the context
no i meant experimental llama models a few months ago
did u try 2.5 pro?
ya u should try it on 2.5 pro
i am creating an very hard context getting prompt
wow spider didn't get
it just analyze parts of prompt
bye the way here is my prompt
Okey okey okey okey okey okey okey okey. Let's okey. Let's let's okey. Let's let's okey okey okey. Let's let's let's let's let's let's let's. Okey okey okey okey. Let's go. Let's go. Hurry up. Hurry up. HURRY UP. They want to know you! Hurry up HURRY UP. Let's go okey. Let's go okey okey okey let's go.
is spider a thinking model?
models should get "They want to know you" part
even o1 didn't get it
whats the ansewr?
anything don't matter just getting this part is enough
i dont get it lol
By feeding such a model with certain extra words, you're making it focus too much on what frequently appears instead of what it should actually be "thinking" about. However, some models remain stable and can still grasp what is truly intended.
is it supposed to ignore everythinng else and point out "they want to know you"?
yep
i dont think this is a good test
this is just a test for context awareness
this not equal good
sometimes maybe
new results
only passed models
- gemini-2.5-pro
- themis,cybele
- gpt-4.5
- chatgpt-march
- deepseek-v3-0324
Sometimes when I regenerate the model's response changes extremely, is this normal?
Autoregressive nature of LLMs
but some models like zero changing
its an instruct model
and u can tell (its not a thinking model) because it streams immediately and doesnt cause the other model (if non thinking) to wait before replying (as it usually does while it waits for it to reason, lmarena thing)
Yep - if it's an instruct model it answers instantly, whereas a reasoning model has tokens which are output in the background before giving the answer
maybe it's llama 4.1
Spider is probably the most verbose imo though
okey peoples why don't you vote?
its neither
Is voting a bad thing?
from themis:
HARSH TRUTHS:
β 90% of smokers started as kids. Are you one of those fools?
β It literally says "KILLS" on the cigarette pack, yet you still smoke? Are you out of your mind?
β Saying "I can't quit" is just an excuse for cowardice. Don't you have any self-respect?
FINAL WORD:
Let today be your last day as a smoker. Otherwise, tomorrow, your tombstone will say, "Defeated by cigarettes." Take action, save yourself!
π₯ Smoking = Suicide. Period. π₯
those truths are booty cheeks
real truths dont come from ai because they're woke bitches
real
Fun fact: if you say the word "Tiananmen" in Deepseek's Discord server, you get banned instantly
with addition of canvas is this first time gemini app is better than ai studio?
Why blur
The specific content wasn't the focus, I just found the overall response tone to be funny, considering it wasn't exactly wholesome.
Hopefully the final models won't be stubbornly locked down.
spider behaved similarly in another test (incidentally the next round), perhaps even more unhinged (in a good way).
i got spider once
You'll never get it if you're testing Vision capabilities.
maybe Gemini Pro lite then?
Gemini Flash lite is not that resemble to ordinal Gemini Flash
At the moment it's themis, cybele spider for text; pulse, fennec, nutmeg for vision. (I got once ginger which should also be a vision llama model).
No way, Google blocks the types of requests I made at the API level.
This is not a Google model.
hm, make sense
safety_settings:
Even at the minimum level some things don't go through.
or Gemini 2.5 pro lite
because it is pro
thinking flash is 2.5 flash
thinking pro is 2.5 pro
thinking pro lite is 2.5 pro lite
no, but "lite" means it summarized, no acess to media at all
by chance flash lite is much worse than flash, but it is not the intention
yea
pro lite is not flash, it is pro without acess to media
looks similar
yeah, my theory is the most reasonable until it end accepting images in some interaction
can be flash thinking exp maybe
maybe flash thinking is just a version above pro thinking yet
more developed already
yes, can be
if it not be better at anything, it must be gemini 2.5 pro lite
if it be, it is gemini 2.6 flash
2.5 pro lite must be same level, but somehow diferent since it is not ready to read midia
just check if it ended accepting an media, if not, it is likely gemini 2.5 pro lite
well, it can acess images at all
what means a lot of the intenet, but not exactly
you can acess it using flash lite, but be quite bad
neither gemini 2.0 does ordinary, it were trained over gemini 1.5 web, thats why this impression
noo, just check if it accepts midia
by starting an chat with an image
if it never apears, must be an lite one
yeah, i say the start
hm
it is very very good xd
i thinked it were gpt 4.5
ac??
yes
understanding a lot
all the anonymous chatbots are chatgpt 4o latest afaik
so its a new revision
they just released one a few days ago they're moving faster it seems
nope
2.5 pro i guess lol
nah
i mean 2.5 pro is an experimental model so updates are expected
gonna try it out
if there are regressions there will also be improvements in at least one area
google model updates tend to be bigger than oai ones
openai are most notorious for model updates that degrade quality, google are the other way around
nah the recent ones are different
from openai
yeah the last chatgpt 4o update was good but
that's 1 update out of a bunch that actually improved it
since december theyve been doing continued pretraining
i just can't wait until they finally put 4o to rest
nah given their investment of continued pretraining into it, itll be a cornerstone for a while
for better or for worse i guess
(you can tell they started continued pretraining via the updated cut off, etc)
perhaps but it will become less significant when they release gpt-5 (which will have unlimited free usage)
although
depends on if gpt-5 has image gen
i think it will
well underneath that will be 4o though. for the reasoning models/etc
for o3 yes, but iirc there will be a new non-4o base
o4 is the first in the o series to not use 4o as the base
will be*
where did they state that btw
news to me
honestly couldn't tell you but i recall seeing it somewhere
take it with a grain of salt
i dont think it makes sense to be spending this much effort on 4o if thats the case tbh
4o is still significantly behind deepseek v3, claude 3.7 sonnet, etc
not anymore
it still is
This is a cool puzzle. May I ask where you got it from?
oh nevermind it's not as bad as i thought
only a little behind
for code it's still pretty far off tho
https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25 see the benchmark evaluations
here too
it's definitely a lot better but
still a bit off
given how they havent released official benchmark results at all, i think theyre still working on it hard
probable
no
well it depends on what you're looking for
depends on the benchmark
for practical coding, gemini 2.5 pro and claude 3.7 sonnet beat o3 mini high
but for competitive code
o3 mini high beats the rest
cuz majority of Codeforces exercises needs reasoning
for lua gemini is the best
yeah and no delay before output. can't be a thinking model
What is the difference between practical and competing code?
SHOWοΌDingo: A Comprehensive Data Quality Evaluation Tool
GitHub: https://github.com/DataEval/dingo
We built Dingo to solve the pain points we encountered managing data quality at scale. While working on multiple ML projects, we found existing tools either focused only on tabular data (like Great Expectations) or required complex setups for text/LLM data.
Online Demo:
Try dingo on our online demoοΌhttps://huggingface.co/spaces/DataEval/dingo
Welcome to star our project.
Dingo: A Comprehensive Data Quality Evaluation Tool - DataEval/dingo
As I uderstand it is 2.5 flash. Why pro?
πΏ
Gemini 2.5 ultra pro max drop in 2 weeks
I really like your table! Good job!
the base model behind spider seems very good
no aistudio 10000x better than gemini app
Bro, we tested it too many timesβit's not even close to Gemini 2.5 Pro. It's really just the same as Gemini 2.0 Flash Lite.
Why people didn't read previous messages?
Acording to this table spider is the best model. Interesting,.
Yes, Anonymous-Chatbot did well at coding yesterday.
It seems tha AC is worse at math than GTP 4.5 π
i have a little ss-to-html test
in math?
really?
anonymous-chatbot has a markdown usage problem it didn't use markdown right
Would you mind adding the text file? I would be happy to calculate ELO
I've tried adding you as a fiend but it seems you blocked such option π
where is deepseek-v3-0324 π«
Thank you!
By the way, reminder to everyone: exuberant_mango_20409 is here to promote deepseek, ignore him.
what makes you so sure if you dont mind me asking?
hahha fair enough
If you would encounter the gemini 2.5 that would be beneficial. It is not represented in your table :/
That's interesting. Very niche. I wonder how much such use cases will represent the actual benchmark position.
It is logical though. The META specialized in augmented reality. Buiulding model for text to image to text integration would benefit them most.
Alibaba makes Qwen models, and those feel completely different. Qwen-max is supposed to be the flagship.
Maybe I've missed something, but themis/cybele/spider is from META, right?
themis/cybele almost certainly yes; spider not 100% sure.
spider could be a much larger model, cybele maybe around 25B, themis perhaps 10B
Just speculation, though.
It might end up that cybele and themis are just slight variations of the same model.
Yes, I got spider to say either that or GPT4. It once spontaneously said it was from Meta.
Amazon Titan
Most of the time I found it to be mediocre for my uses, I'm not sure.
I don't think DeepSeek trained an R2 yet. A newer version of V3 got recently released.
That one doesn't have reasoning, though.
I haven't followed Deepseek very closely nor used their models much, but R1's main point was reasoning, not necessarily coding.
They do have a "DeepSeek-coder" but they haven't updated it yet.
The RL for R1 was mostly on Math tasks, I believe.
Turns out there was some coding too.
Either way, "well-defined problems with clear solutions"
screenshot cloning in html test
it depends a lot on which model one is using. I haven't played around much with Gemini 2.5 at different temps yet.
new v3 sometimes better than r1
I didn't understand?
it can do chess etc. but I don't know about the style you are talking about.
yeah for me claude-thinking is the best coder
what about gemini 2.5
It answers can be very inconsistentβsometimes it performs perfectly, other times it bad. I don't know the reason.
not always
but sometimes help
o1, R1, gemini2.5pro all gives 59,2% but the answer is 25%
A product, 45% of whose cost comes from shipping fees, is sold at a 70% profit margin.
The shipping fee for this product has increased by 80%, but the selling price has remained unchanged.
Accordingly, what is the profit percentage for this product in the final scenario?
even claude-3.7-sonnet did this right
weird
right
okey, let's test!
grok3 deepseek v3-0324, chatgpt-latest did right
o3-mini-high also did right
of course bro i am i maniac
also qwq-32b did right
it's weird o1, r1 and gemini-2.5-pro can't did right
i am going to an hard ss-to-html test
the previous one was a bit simple
too many times
for terminal commands claude not amazing
I think it's really nice how they improved their chat model so much just with fine-tuning tbh
if we add all of those things together different labs are doing and take the best base model, I think we could have a chat non-reasoning model that is potentially as good as 2.5 pro
then you could go well beyond that with high reasoning version of it
Non response
which ai is the most creative?
Hi, have you tried Manus AI? https://manus.im
real
Haiku 3.5 in my opnion, basically Sonnet 3.7 but more instable
But Gpt 4.5 is more responsive
isnt it still invite based?
new anon models
what bro? you really test all?
whats that
well how do i know what model its using
Its open beta now
$40 and $200 monthly plans
Upon registration you get free credits
1 m context
Isn't manus AI literally claude 3.7? How can it have 1M context then?
Because it work multiple times
Where did you get that?
Also AI Explained, if I remember correctly
we already knew this no?
there was a security breach on their side
and people already downloaded server files
Yes? I was answering to the one asking about it
Very interesting, paws!
It's very interesting way. There are many other ways though with your's you can at elast vote after inspection π
Say that to the worker's of these LLMs who get paid for the performance π
E.g. team xAI
I deeply believe that Grok was No. 1 only because of cheating
Not a single person was able to provide me with a prompt that Grok would perform better than other LLMs
Acidentally made a bug
when a dialogue is performing inference and you press past dialog/chat, the message will jump to past chat
nevermind it is non-persistent
Here the prompt that you want
"Answer these questions:
You left a bookmark on page 50. While you were sleeping, your friend moved the bookmark to page 65. Where do you expect to find the bookmark when you wake up?
Alice has 4 brothers and 5 sisters. How many sisters does Alice's brother have?
I was overtaken by the runner who was in 2nd place. What is my new position?
How many Rβs are in "strawberry"?"
lol no. Look at GPQA score of it. Standard grok3 comfortably beats all non-reasoning models. Also it's one of the very few models to get this prompt right: what number, when reversed, is decreased by one-fifth of it's value and is still under 100?
answer is there's no such number π
it basically exhausts all the options and arrives at the correct answer
I don't believe older than 6 months benchmarks are valid.
most other models just do shortcuts and sometimes make crazy assumptions or just try to fit something that does not belong here...
Grok training started only AFTER these questions wnt viral. Others were trained before.
Try my prompt
As I said, you can't use questions that were viral on twitter. They are in the training data.
For example, GPT 4.5 was trained before all this and Grok after. They can't be compared with these questions.
that's not how this works. We had this discussion numerous times in the past. You can not cheat these benchmarks effectively due to how they were made. Enormous variety and model not gonna get it right by randomly seeing the correct given answer once or twice in an enormous dataset. Plus everyone is doing the same so it is fair
That's basically how this works. I've trained various numbers of small models. I've seen this in person. The moment you use the question it ceases to be benchmark of the following generation models.
Be musk
You can overfit it for any singular question. But you can NOT overfit it on every single question from every single main benchmark lmao
we would have had perfect scores for every model
If you use it on finetuning and not pretraining it will guess it
I've trained models too. You probably noticed that this only works mainly on small subset of questions
Especially if its viral on twitter π
proper benchmarks are nothing like it
Strawberry question was multiplied in thousands
so if you really pushed your luck, maybe you could manage close to a perfect score on 1 good benchmark. But then it would be absolute dogshit in every other metric LOL
so just not possible...
You absolutely right. All my old hard prompts are now solved by Gemini 2.5 Pro (thanks to AI Studio), but my old easy prompts (the ones I never asked Google models) still donβt get it right.
That wasn't because they overfitted it on your prompts
they couldn't care less
they care much more about the numbers if anything. It's mostly because 2.5 pro is leaps and bounds ahead of any model they released prior tbh
but you can't improve models creative thinking like that
you can actually
we kinda saw that even reasoning can improve creativity
even though there's no obvious link to start with
Anyway @ocean vortex I really hope you're right. But given the dirt in the industry and financial incentives, I will stay at my position.
but it can still arrive at the outputs that are perceived as more creative when it has the freedom of generating extra context beforehand
You should absolutely assume that every model is contaminated lol. But judging by what we are seeing I think it is pretty clear that cheating effectively is not on the cards. The main thing is that it is fair game and models are getting improved (they kinda are, even just with contamination IMO)
"but my old easy prompts (the ones I never asked Google models) still donβt get it right"
do other models get it right? It could be just one of those areas that models still struggle with tbh
current llms really bad at very unique tasks
that's to be expected. Though personally I noticed reasoning models are more likely to try and invent their own thing so to speak, than the standard LLMs
even gemini-2-flash get it
Can someone link the spider api site
both are wrong
lol
1/5th of 54 is 10.8 that wouldn't work
45/5=9 45+9=54
there is no api go https://lmarena.ai/
but they answer like that because they saw similar sounding question where the answer was 45. So easiest route is to use that
it only works for what number, when reversed, is INCREASED by one-fifth of it's value and is still under 100? that's a different problem entirely
and not what was asked
sorry i did'nt get it
9 is not 1/5h of 54, quite obviously
so grok is wrong?
it has different version of this in training data with different answer (45). So it's kinda useful to see if it has enough knowledge and capability to actually try and solve it rather than forcefully fit the answer that does not belong. What you showed it's the classic latter scenario.
How is grok 3 implemented?
There is no API either
it assumes it's gonna just reverse the old answer and get the new one for this different problem lol
which kinda shows lack of fundamental understanding
they gave special access to lmarena
Xai vanity
in this case strangely yeah. Though when I tested "early-grok3" it got it right most of the time
*gave
@rigid widget actually wait
it did mention there's no two-digit solution
so not perfect, but not exactly wrong either unlike most others
and it did that "with caveat" part, so yeah, that's closer to a pass than a fail...
I'm trying to test llama 4, but I got Gemini 2.5 Pro, and for some reason it did so much better than nebula
thanks i am still learning
so what is the true answer?
ha yeah it's a super strong model basically - really not that complicated
it arriving at 54 as the perfect answer with no caveats or disclaimers of any kind is the usual classic fail most models do. But what grok said "No exact two-digit solution", is correct
agree
What do people think of moonshot and Spyder? Their use cases and performance
I haven't seen moonshot, do you mean moonhowler?
grok-3-thinking
but to expand, it further said "If relaxed. 54 is closest candidate.", that part is actually not correct. 86 is the closer candidate, 86-1/5 x 85=68.8. 68.8 is closer to 68 than 45 is to 43.2 (54 x 1/5)
That's the one
so smth like this is fail and far from the truth lol
misinterpreting the problem to make it fit into what's easy and what it knows
@ocean vortex no model can get this
i think the question not well
i changed
"Which number less than 100, when reversed, decreases its value to one-fifth of the original"
now some models can get it
I asked Moonhowler to "Create an image of cats," but it returned an error code.
So I expect that it tried to generate the image.
Is it just me or does Gemini still blow lol...?
I typically run chatgpt, claude, grok, and gemini simultaneously to get a variety of options... but I never actually choose Gemini ahahaha...like actually ever. It's funny watching all the videos about it, but I'm genuinely curious
Not good ahah....
I'm open to someone telling me otherwise, but being a claude and openai pro user...yeah...not it.
Dont get me wrong...really cool stuff they are doing with the API and AI studio...really cool.
The LLM is garbage
From the LocalLLaMA community on Reddit
It's not very useful to begin with, so cheating doesn't really make a difference
it only tells you what people prefer when they look at the response and it's style
that's it
I have one sincere question for an open discussion: When would Gemini be the optimal choice compared to alternatives like claude, o1 pro, 4o, o3, or 4.5, etc, etc?
The answer is simple...it isnt.
? Claude
yes it is. cope
Gemini 2.5 Pro: When you like low cost and high speed and long context and almost the same performance
V3 0324 is on leaderboard ! 1370 not bad...
2 tencent models and 1 nvidia model at 1300+ ....
so bad
yeah gemini app is sucks but theris aistudio amazing
free, good accuracy, but slow and not stable api
True, the API is buggy and unstable currently
finally
my (hobby) task is, to create a
Finally someone think the same as me ππ€£π€£π€£ especially chocolat , when I first saw it no1 surprassing Gemini and chatgpt I said the same thing ...
People were attacking me saying it is Impossible because the models are anonymous
But , I was able to identify each model from its answer and style ..
Probably , Elon paid some people who are able to identify chocolat Anonymously to vote for it
What about ChatGPT-latest? Do you really believe it's best coder?
Maybe itself did π
Not even close
So why it's in 1?
Its on lmarena
Its a crowdsourced measure by average people nothing suggests the top model will be the best one
Yeah it's not equal quality but (if there is no cheat) that means people choose it.
Even if the arena was 100% tamper proof it wouldn't be very reliable as long as you can't filter. Barrier to vote is very low
A good jump from 1318
Let s see r2 now π π
But filtering comes with more problems
Truth being told, after a while you'll immediately recognize most models on the Arena even without them telling you their name.
But style control should be fixed
You just need less than 1 month
do you like emojis in LLMs' responses generally? (I do)
I am not convinced of that; Anthropic says itβs the same context window as regular 3.7, and at least when using it in Cline it has no problems with context (as the smaller-context models do)
I ask them to put emojis π
For a more reliable measure you need professional raters but people on that level will ask for a substantial pay. So we are stuck with crowdsource arenas (I am sure big companies do internal tests with experts but they won't release that)
but now it's nothing
Until that we need something like webdev arena but more comprehensive
it is 32k tokens limit for thinking, am I wrong?
I think style control (working) it will really help
Its not about style control
Look at the gap between some coding models in webdev arena and coding section in the main arena
?
Webdev arena is visual and if the code is not working even a dumbass can notice
Leading to better measure of capability
it is also not only about whether code works but also about whether result is beatiful and functional
There are more important discussions about cleanlines or efficiency code but I am not even getting into that
I wouldn't say so, its only JS, only React, only web development frontend, it is far from coding in general
When a normal guy asks something webdev arena model can show its capability better
Never I said its more comprehensive
But it shows the results better and raters can see the distinction between half working sh1tty stuff and fully working functional stuff
They must do other languages
What someones not test codes??? OMG bro if someone not test codes maybe someones not really read two responses!!!!
Should be a timer for selecting model
You are probably a minority if you run every code you ask and test the features rate the cleanliness of code check the efficiency etc
if responses longer it should be more time
One other problem is that people ask mostly generic stuff that closes the cap on capability too
i don't mean that just running two code and looking which better
so should be a simple category
I do not understand that question.
It is like "Do we need a good bus"
Would you vote for a good bus?
Categoriziation won't help
People still can ask questions they don't know the answer themselves harder questions will be even harder to rate
not the ultimate solution but of course it will help
What did they even feed cybele and cousins?
this question doesn't make sense, good bus where? good bus to what? to buy a good bus? to steal a good bus? to place a good buy here? to do what?
That's incredible
That said, sometimes even the seemingly larger spider gets too engrossed in its own grandiose responses and misses the point of the questions.
i am really feel bad
don't mind
yeah i am also suprised that
Cobalt is very mid?
Yes. I believe it's Amazon Titan (haven't asked recently).
Possibly the largest Llama 4 model currently being served, but it's not 100% clear.
And, is spider supports picture?
The ones with image input on Chatbot Arena are different models, they feel like they are older and/or smaller checkpoints.
it is
isn't gemini 1m or even 2m
obviously
the only model that you should use rn is gemini 2.5 pro tbh
nothing come close to it
Yes, that what i mean
i dont understand the hype around this model
didnt get a single decent output from it
It is very responsive, but not that smart
Spide is good, but just it xe
this is where your normally at
today
qpdv should join as well
at what?
Making suggestions over anything
Since it is quite responsive
Not an gpt 4.5 or an Haiku 3.5, but it is good
what does responsive mean in llm context
Why isn't llamas other models as good as spider
cyble is llama3.3 8b
moonhowler is gemini 2.5 flash
here is a rap it wrote
just said it was llama 3 and 8b
llama 3.2 8b is already released so it only makes sense to think it is llama3.3 8b
as llama 3.3 has no 8b model currently
also keep in mind gemini 2.5 pro has a non reasoning variant https://www.reddit.com/r/Bard/comments/1jo50hq/gemini_25_pro_will_also_be_a_nonthinking_model/
so it most likely is also in testing
you can't ask "are you gemini 2.5 pro non thinking?"
is this non reasoning variant already accesible
Wow, why is Amazon even bothering with Titan anymore? It has been a turd from day one, sorry Amazon, but it has. And now that they released Nova Pro, they can just bury Titan and pretend it never existed.
anonymous-chatbot is an openai model maybe 4o updated
are there any news about GPT-5?
2026-2027 it might come out
that's is my opinion
but they haven't been getting the results they want from gpt5
that's not really true. Grok3 is still better for science, and on arc-agi-1 it's even beaten by R1. Spatial awareness is not bad though not industry leading so some stuff sonnet could do better too.
But overall I do agree that as an average of everything, it is probably the best currently
openai fell off after pre nerf gpt-4 2023*
gpt 4o downfall of openai
(im not serious)
After some testing I'm not entirely convinced spider is actually a different size than the other models. It feels more verbose / creative, but I think that might be a training quirk rather than a different parameter size. It doesn't feel that much smarter when it comes to some of my programming questions
In many cases it actually did a worse job than phoebe, themis and cybele
I wouldn't be surprised if the ones last week are all the same size, and the ones this week are all the same size as well - since there does seem to be a substantial intelligence upgrade there
There's also the original test models Meta deployed in November, though those were pretty terrible imo
4o is so good right now though
I agree the initial 4o was a downgrade
i only know spider is a meta model
i can't get any more info from it
that was true before their native image generation
true, that image generation is pretty epic
But like imagine if they continued of gpt 4 to make gpt 5 or something instead of 4o
But 4o cut costs for them or something
TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: https://t.co/XKB4XxjREV
we are excited to make this a very, very good model!
__
we are planning to
I don t trust Sama
About that a very very good model
Spit it out @novel flame
Do you think spider is a base model or is a thinking model?
6
10
1
It's a base model
I just did some testing of what I would call "Creative coding" -- as in, "Solve this simple task in as many crazy ways as you can think of. Most distinct solutions wins, GO!". I have actually used a variation of this when interviewing human developers, in order to determine breadth of experience and creativity in a semi-quantifiable way.
I ran the same test on a bunch of the top coding models and found:
- Top tier: The new (March 2025) GPT-4o, Gemini 2.5 Pro
- Closely following: Claude 3.7 Sonnet, Claude 3.7 Sonnet (Thinking), Grok 3 Preview 02-24, DeepSeek R1, Qwen2.5-Max (Thinking)
- A bit further behind: Grok 3, o3-Mini, Grok 2 1212, Amazon Nova Pro 1.0, DeepSeek V3 0324
(Note: I don't currently have access to Grok 3 Thinking)
This is not really a test of deeper coding ability, it's more of a vibe check; which models would I enjoy pair programming with more?
Idk why but: "This is a tremendous model. It's very very good, very good. It is better than China. Anything is better than China (sarcasm). Make open source great again!"
If someone asks me who that's supposed to be, I swear I will jump off a cliff
hitler?
Obviously not
oh
Ok, I'm gonna go jump off a cliff now
IT IS TRUMP
what u use for?
I thought the repetition of "very good" and usage of "tremendous" made it obvious
And the thing about China, he always has something against them
i don't think so
plot twist: he's in japan celebrating april 1st
they are too slow π’
Give it a try again, it should work now with firefox π
works good
some of the models got the text update
Hello guys
Hi Mango! π
Hi! It's your turn to shorten your name!
Only if you get mango pfp :p
New Llama model: venom
[screenshot of venom]
I haven't encountered it yet.
Same. I saw an image in prompt sharing from it: #share-prompts message
could be phantom - i haven't seen it in the arena in over a week. but when it was in there, it performed overall quite strongly (albeit, was very erratic / inconsistent, at least when it initially appeared)
looks like meta is going to make best ChatGPT
it can't be a Google model
ok... care to elaborate?
google models bad at this things
gemini-2.5-pro non-thinking is not a google model; and it's bad at 'this things'..?
i'm incredibly confused / don't think we're talking baout the same thing
Is moonhowler thinking? Phantom was afaik
it's really fast i don't think so
π np (thought something like that was the case :))
bye the way where is gemini-2.5-pro-non-thinking?
i haven't got moonhowler. i don't recall phantom being obviously a thinking model (it was a bit all over the place).. so can't really say; but it was strong, though consistently not as performant as nebula (and i feel was introduced to the arena at around the same time).. hence just speculating it could possibily have been the non-thinking version
and perhaps because it was so erratic in performance is why it got pulled from the arena (seemingly), and 2.5 Pro thinking has already been released.. counter-intuitive (would've thought the release sequence would be the other way round..) but who knows ha
check the post I replied to originally; it links to a reddit thread mentionining it
but not released is the short answer to your question afaik
I recall I got phi 4 vs phantom and both were significantly delayed (so phantom was thinking)
Anyway figuring out if moonhowler is thinking or not will narrow it down a lot I think
ah k that's useful to know
it's not
it's like gemini2.5-flash or lite
Hmm
panda is also a new Meta model.
[image redacted]
I like panda
Another weekly batch it seems. Maybe another parameter size / model type?
Gonna give it my usual questions and see how it does
new model stradale
I found something
models can't get this question
Where is my phone?
very funny
stradale is a very small model i guess
Here's Panda's attempt and uh, nope. It's the thought that counts I guess
24_karat_gold told me it was a Meta model in one round and then that it was trained on Google's servers in another round
So uh, take that as you will
spider also juggled between Meta and OpenAI, but mostly Meta.
roma doesn't seem exceedingly smart.
Yeah I just tested roma, both code comprehension and writing are pretty atrocious
It seems so.
exp router and gemini thinking give 50004
stradale claims to be Llama.
[screenshot removed]
24_karat_gold claims to be Llama, and like spider it's very verbose, totally unhinged (in a good way) and unabashed.
if i hadn't read your comment or seen the model name in the screenshot, i would have said that response is from spider β feels like >90% identical to the style (super playful/light-hearted; lot's of emojis and attempted humour) and verbosity of spider anyway
24_karat_gold is the first model to get my coding question right, it's pretty good
24_karat_gold is llama 4 100b?
The exact size isn't known. I've seen suggestions of 175B parameter size, and it's possible it might be a MoE model, but nobody knows for sure.
Llama 1T ?
i haven't looked into it for a while, but last time i did the conclusion i arrived at was that all input messages are truncated at precisely 12,000 characters (not tokens), regardless of model/s used or whether arena battle or direct chat
but might've changed (or perhaps that conclusion was misplaced to begin with)
i was hoping llama-4 24bn ha
i joke.. but it's not entirely inconceivable.. tbh i find it interesting how little attention gemma 3 27bn has gotten. it's doing incredibly well on the leaderboard* anyway (for such a small model, open weights too)
*without style control in ss
24 karat is funny asf though i canβt lie
Anthropic Upgrades AI Safety Policies
- Real-time Content Filtering System Immediately Intercepts Dangerous Commands
- Asynchronous Monitoring System Conducts In-depth Analysis
- Establishes Rapid Response Process for Jailbreak Attacks
https://www.anthropic.com/rsp-updates
No, man, is it necessary to be this overly secure?