#general
1 messages · Page 1 of 1 (latest)
Hi
woohooo
hello
woop
hello
9th person 🥳
hi
hello!!
Hehe
and the flood of people begins 😄
why was a new discord needed btw?
Last one wasn't purely for LMarena ig
yeah but the channels and such seem to be the same lol
Ya lol
hey
YT commentors here too huh
Hello
hello
the real ones are there
hi
Is there a way to test to models without knowing what they are beforehand. Like having a blind test study to see which models I actually prefer
lmarena is supposed to be that lol
or u wanna form ur own personal leaderboard?
pretty sure he was joking
hello
🤷 i thought he couldve meant having ur own personal leaderboard
he might be
but onyl he can tell
there should be an arena feedback channel i think lol
yeah
Before you test out two models you pick which model you want. Is there a feature where can avoid knowing what two models I'm testing? There's a lot of cognitive biases that would affect people's evaluation outputs based on knowing which models they're currently using
use the arena tab, not arena (side by side)
Thanks!
Excuse my ignorance lol. I always just used the arena side by side tab for my testing
its fine 😄
first
whats the difference between lmsys and lmarena
recommend changing server icon because i keep getting the 2 servers confused
agreed
I agree. Make the llama face the other way
that's a lama? I thought it was a deer
Probably is now that I look at it
Yeah just make it face to the right
That way we can tell them apart
new model: grok 3 preview 24.02.2025 (on lmsys)
Please turn off welcome messages
set it to #welcome or something if u dont
Man if this continues I'll need a uBlock rule
@wooden mulch could you turn off the join logging
wow.
(that was an accident. it said hi an I acted on impulse)
anyway with ths channel lke ths Im staying on lmsys
looking into it..!
looks like this in the main settings, you can't miss it
done. thanks!
Hi
this is crazy
Hey that’s me
the one and only bayesian!
I trust this value more
79% ≅ 81% 🤷
lmsys is the org that initially developed lmarena (the website). i am guessing that lmarena is "graduating" because its popularity has eclipsed everything else lmsys has made
hello from india
hello world 🙂
hi
Dear mod(s), admin, can I ask you for a little bit different logo?
just a LMArena text in the bottom or something
the other server will be archived/deleted or wtv soon no ? so doesn't matter
wtf this server even exists
Hi, server dead?
it was literally created yesterday 💀
I'll keep it alive
I believe the other server will still exist, just without the Arena-related channels
ah okok ^ please see this correction
Hello hello what a nice shiny new server 🥰
Lmsys people want to not be associated with their most successful project
Really makes u think
done, sorry for the confusion!
@dapper storm welcome to the new server, please see our graduation post here 🙂
https://x.com/lmarena_ai/status/1842982750095278482
As part of Chatbot Arena's graduation🎓, we're excited to announce that we changed our X handle to @lmarena_ai! For open-source systems & research at LMSys, please follow @lmsysorg.
This account, @lmarena_ai, will be dedicated to sharing Arena projects & leaderboard updates. See
hello 🤗
Hi
heeyyy
HELLO
Test 30? Gemini 3?
hi, what is the best LM to buy for medical academic work?
gpt4.5 or o1 I would say, however these are also the most expensive models that currently exist
Grok 3
not there for me
They removed it, including all 7 unicorn engines (Meta Llama) LLM models, less than five hours ago from the "Direct Chat" option/menu, but it's still on the "Arena chat" blind test. You just need to find it. Source: I've tried it earlier/before personally.
flux pro's image is ... intriguing , lol, doesn't seems to match any of the input words besides , maybe "fence"
Try the "Imagen 3.0" model from Google. I think it's currently SOTA model from text-to-image.
yeah it pops up now and then
nice grok3 having issues rn
guess everyone's gonna vote against it
screw musk man!
it's not that simple
yum
is there a chat mode lmarena.ai that i can put a log file in and get help from it i refuse to pay trash open ai 20 a month scum
ANY ai service can do that
including chatgpt free
as far as im aware they only accept images
deepseek chat page accepts all documents, tbh when i have a ton of files I use them
ya im seeing that now sadly
is it completely free cuz every time some ass hole webiste says its free its only a dog trash trial and they try and get money from you
sick of that sht
i want a search engine that can detect subscriptions and trials so bad
yeah, just disable r1 ( makes it hella slow, and psure they limit it to 1 r1 request per day, so just leave it disabled )
they dont even have a way to pay yet so
ok ty ill look at it\
rocketmoney 🗣️
well im just saying everytiime i search for somthing its free 14 day trial sick of it
im so done with subsription based bull sht
is that legit what it does or you kidding?
real fr, but thats how they make money, you do realize like OpenAI is a nonprofit
psure, ive seen ads but never tried it
ill look
i have ublock on opera cuz they took ublock of of google chrome lol
off*
actually is wayyy under priced surprisingly, u can prolly find a bunch of articles on how even openai's $200 subscription is barely keeping them afloat
its over priced sorry
i just have pure hatred
im trying the deepseek thing you said take off what?
u good then
ok ty
Well everything has limits, nothing is free
Check out Google AI Studio though, the models perform well with long context
ya i see deepseek can only do 33% which kinda blows but still works
its a minecraft log lol
Can you filter it (remove the longest bits or grep for just the relevant stuff)?
ehh its prob cuz it shows all the mods and theres like 200 plus
but idk if it needs to see what mods
are in it
im trying trae ai it seems to be able to hold it all
and tells me how to fix stuff
Don't guess, measure
well its fine im using this it seems to be a lil better then deep
but idk how accurate
trae is just yet
most of the time they are ok at like basic errors and stuff like told me what mods to remove if i need to or commands
shi good luck
if i were you id remove all the unnessessary stuff from the log first then give it what you actually want to analyse.
crazy cus that shi can read a 212 page US bill no problem with other documents
Don't guess
Get to selecting text and finding out what most of the log is
well im already using trae its honestly working better as loong as trae isnt guna say like trail ended or some sht
trial
its a program thats prob why
gotcha
This is interesting, what are you trying to achieve?
I've seen some projects utilise baritone with AI which was pretty cool
lol claude gave me a pretty good emulated msn messenger client
Those usernames are too accurate lol
ff
Getting a lot of "connection lost" in lmarena these days.. what's going on?
Can't use any of direct chat, or anonymous chat, or anything
psure its an issue with your network or something. if you switch tabs while its generating also it could lose connection
im on the same tab and never switch, have tried with different networks (networks are stable) and different devices (mobile phone, pc, etc), it happens
just a minecraft modpack but i get alot og errors like block delay trae is helping me lower spawn rates which is one of the big issues
So, what does the AI do here? Try to find the issues for you with mod compatibilities?
I don't think it could do that with error messages alone
it can it has its not 100% more like 70% but it has helped me
Interesting
What are your logs?
I mean, what mods do you have with one another? Similar type of mods can usually attempt to change similar functionalities (mixin incompatibilities) which usually leads to issues
Ayo, what in the world is Sam smoking?
Today, we release QwQ-32B, our new reasoning model with only 32 billion parameters that rivals cutting-edge reasoning model, e.g., DeepSeek-R1.
Blog: https://qwenlm.github.io/blog/qwq-32b
HF: https://huggingface.co/Qwen/QwQ-32B
ModelScope: https://modelscope.cn/models/Qwen/QwQ-32B
Demo: https://huggingface.co/spaces/Qwen/QwQ-32B-Demo
Qw…
Was qwq-max ever in the arena?
new model alert
Kinda woried about gpt-4.5 ngl
First time i've seen this kind of error from a LLM
It defines and declares a function like this, then proceeds to call it with a different name entirely
local function teleportToFS()
root.CFrame = FS_LOCATION
PP_REMOTE_EVENT:FireServer({"+FS6"})
end
teleportFS()
a very human error to make 
gremlin, centaur and enigma are all back
google modelss
so far the latter 2 have been better in my testing
centaur = thinking model
Interesting names
its making up numbers
can anyone explain this?
plain old 4o and 3.7 get it right one shot
wtf?
Are these really Google's ?
is luca a chinese name?
I thought it was the biology LUCA
holy moly
new model: margherita-plain
das alot of emojis
Looks like its made by "reka"
From cohere
Three models form cohere available in direct chat
it looks like it's the same model base
He said that ?
arrived in the arena
claude-3-7-sonnet-20250219-thinking-32k
Hello I'm new here. I'm also new to chtbot arenas in general, I wanna learn more about them
Are arenas about comparing outputs of llms for the same prompt and ranking them? is there more ot it?
Hello,
I'm getting some good results with centaur and enigma when doing blind tests, but I cannot find them in the side-by-side drop downs. (https://lmarena.ai/?utm_source=whytryai&utm_medium=newsletter&utm_campaign=free-compare-llms)
More importantly I'm trying to find some information on them, but google search is not being helpful.
Is enigma the "Llama3.1-8B-Enigma" model?
Is centaur a google model?
it gets it right for me, do you have custom instructions?
hi
Tested QwQ-32B (local, Q4_K_M):
- best in size, except for coding
- extremely verbose (avg. ~10x output tokens compared to traditional model, more verbose than any other long-cot-model I ever tested)
- more effective thought chains than r1 distill versions of Qwen2.5-32B
- terrible at all webdesign tests I threw at it
- Smartest sub 70B by brute force token chains
This is a smart model, but for me the extreme verbosity and inference required excludes it from becoming a daily driver.
The good outcomes feel brute forced with cot, and the verbosity is borderline ridicilous.
Good if for complex STEM related subjects or reasoning tasks.
Not useful for coding.
As always, YMMV!
Lines up with my experience. How much better could a model of this size theoretically get?
Or it calls the right maths function
Are you sure?
It's been a while since I've last heard of them
I use they space and i actually noticed it went a little smarter
I dont know if they just improved the system prompt or it is Margherita xd
Own server 🤗
Ive asked the model once, so I'd need to ask a few times to have a more confirmed answer
nah 4.5 gets jt wrong everytime and everytime i ask it’s a different number. every other gpt
model gets it right one shot
this can't be real
cohort chowder
i tried it twice and it was right both times
last number was 15 instead of 5 in mine. doubt that makes difference but fyi
Cohere
It’s real. From what I’ve gathered, those three models are essentially the same base models rather than reasoning models with Test-Time Computing (TTC). Their responses are instantaneous; they promptly address the user’s prompts. These models are of the same size, and I see no differences. Honestly, they’re average. They have a knowledge cut-off until June 2024. They come from Cohere, and the model is named Command; they are not from Google DeepMind. I believe they’re simply A/B tests evaluating the same version of the model, or they might be different sizes with minimal differences in performance improvements. For optimal performance with any model in the LMArena or on the API, I recommend adjusting the following parameters to minimize hallucinations and fabricated misinformation while providing more concise answers.
Update: They were removed from the Direct Chat option/menu about half an hour to an hour ago. The LMArena is based in the US West and typically updates, edits, introduces new models and removes models within this timeframe for both the Arena blind test (battle) and the Direct Chat. To reduce hallucinations or fabricated misinformation and ensure more concise answers, visit the website, then navigate to Direct Chat or Arena (side-by-side). Afterward, scroll to the bottom and left-click under the small arrow to the right of the Parameter. Please set the following values by dragging the slider, writing them by hand, or manually inputting the corresponding values to apply all changes.
Note: I have a dyslexic font applied. That’s why it’s weird the font. If you wanna try it out, then download the Chrome Web Store extension that I use on my Google Chrome web browser. It’s called: Dyslexia Friendly. It’s completely free and legit, with no malware/viruses. I recommend disabling the “Ruler option/menu” for better flexibility. The font I'm using is the default one, called Open Dyslexic, I think you can import any font.
^
just curious r u using linux btw or not?
No, I'm not. I'm using Windows 10, I need to upgrade to Windows 11, lol.
ah ok, i guess its just the font then
Yeah.
The font I'm using is the default one, called Open Dyslexic, I think you can import any font.
ah you're right, i tried it again with 15 instead of 5, but it still got it right. https://chatgpt.com/share/67ca3ac3-9f24-8005-aecb-ba283dfafed9
anyone know a free way to acess claude extended thinking model either threw a trial or anyhting?
On the Direct Chat tab/menu in the LMArena.
oh i guess i missed that, how much messages do we get?
I didn't put it to the limit. Try it out. Maybe it's less than 20/25 messages?
New llama model
Space (very slow)
Sky (slow)
BluePrint (slow)
Momentum (fast)
Flywheel (fast)
Momentum (fast)
Inertia (fast)
Luca I don't know where he's coming from he doesn't want to say it
and QWQ 32b arrived
we are never going to have llama 4 at this rate
Where is this model? I cannot find it anywhere on google 🙂
We are in the LM Arena Discord
Yes, in LM Arena these two model come up and @/ˈliːəʊ/ mentioned that "centaur and enigma are all back". I'm trying to find some information on these models online, but cannot find them
Exactly
LM Arena contains anonymous models
ok, so these names are used in place of other models? but why?
Does anyone know what other authoritative rankings of ai mockups are besides this?
Does anyone know what the difference between grok 3 preview and grok 3 in the previous grok?
Companies are testing them anonymously, to make it harder for competitors to know what they're doing, etc.
Can we keep the history of what we have send in direct chat?
ask a bot to write you a browser extensiin for this
I'm looking for a partner to create a telegram bot
I remember I send hi robot to a model and it respod :"hi I am grok" than after voting I saw Anonymous-test 😂😂😂🤣🤣yes, anonymous 🤌🤌🤌?! But,
For enigma I think it is from yes chat and power by gpt4 but for centaur I forget deepSeek told me what is that and I forgot.
Guys, is there a subscription model on this? I'm trying GPT 4o, limited to only 8 messages, how can I continue I don't see anywhere
What do you mean?
You can just use gpt-4o on chatgpt.com
there are so many meta models on the arena lmao
Llama 3.7 soon
The researchers are trying to justify salary to Zuck after they got bamboozled by deepseek
LLama effort is so mid they were releasing same models trained the same way until 3rd version
They even have their own papers from past on more efficient LLMs but somehow they cant make it work
i genuinely have lost track of how many there are
it's definitely more than 10
i'd say there are probably about 15 anonymous meta models running around
makes it hard to discern them
wait... what the f actually happened to Meta? They are almost back to their llama2 days with how far behind they have gotten...
70b should be more than enough for a decent reasoning model I think
They are still behind of Qwen in base models
Even though Qwen is not doing something special if it were to believed in their technical papers
Biggest sin of Google and Meta I think they aren't training on code enough
I think they are only slightly behind. But with RL training small delta like that shouldn't matter and can be easily overcame. Like gpt4o/mini has clearly fallen behind now but o1 and o3 on the same base models are still leading the way.
re Meta: Meta AI Chief Scientist Yann LeCun doesn't believe LLMs will lead to AGI.
reflected in their output
What do you think Claude's secret sauce is?
It seems to be quite proficient with programming
Do they just hire a bunch of developers to produce code based on user prompts and then train Claude on it?
No idea
Data is the most well protected secret both because they are using copyrighted stuff and because that is one thing that can be a moat
Even the open models (few exceptions but they don't also perform well in coding) don't disclose their data speficically
We won't know until some of these companies start going bankrupt but even then they will probably be acquired and forgotten
Hey all. Curious how we feel about data sharing
Data is shared
Our specific data is a large input to their model
What exactly are you asking here?
It's sort of difficult to comprehend what you are asking
Commenting, you can look at the impact of tokens, whatever happens on lmarenai is high
I’m just repping ai
Be mindful of what you post
You wonder where.. here
You mean, scraping?
Yeah, accidentally typing something in. Anything without login
If you login, they must attribute
Love the cat
Where are you from, if you don't mind me asking?
I’m an independent researcher on agi
With a gpu farm
I experiment in context window optimization
USA
American?
lmarena or discord?
Both, but I specifically use lmarena for training data, in and out
Thank you
Thank you, keep your head up, beware
Trash
I reiterate, trash
My neighbor built that stuff, still trash
They use spam data as their training
Love context window. Underlying ai is my test for intelligence. Can I make Gemini less stupid
Am I the only one loving the Gemini thinking one ? 😆😆😆😆
Gemini thinking is quite a good model. The context is indeed amazing.
I've used it a lot for annotating code
Meh. Needs work. Shared through here so it’s open source
What do you think about the opensource deepseek repos?
There is a difference between agi and i. They publish the tools, but without understanding how to expand the intelligence sphere irrelevant. Very quick, but to what end
Deepseek is a marketing tool in the end.
Not sure what to think of that lmao
OpenAI also doesn’t control 4.5
what? 💔
I gave it a drive, I iterate between 4.5 and 4o. It’s based on the data from lmarena
I believe there is some sort of language barrier here.
I assume English isn't your first language?
https://techstartups.com/2025/03/07/microsoft-is-plotting-a-future-without-openai/
Tfw your biggest investor doesn't believe in you
Microsoft’s AI head is plotting a future without OpenAI. According to a report from The Information, Mustafa Suleyman, the head of Microsoft’s AI division, has set his sights on a bold objective: reducing the company’s dependence on OpenAI. In a report titled 'Microsoft’s AI Guru Plots a Future Without OpenAI,' The Information reported: "After s...
I only hope for OpenAI's failure due to their failed promises
microsoft is so cooked
New models that i spotted tday on lmarea.
thier names and what they claim to be made by.
margherita-plain - Reka
momentum - Llama/Meta
space - Llama/Meta
sandwich-ping-pong - Cohere
sky - Openai
blueprint - Llama/Meta
I have my doubts about sky being OpenAI, it certainly doesn't have the usual OpenAI system prompt nor even their usual refusals against it... But can't dismiss it completely with limited testing for now
yeah after trying some more prompts I'm 99% sure it's not OpenAI. This model behaves nothing like it and it has some weird fine-tuning flaws reminiscent of someone still struggling with it lol
where was this generated?
the text2image arena that is barely used
nice thanks :)
Current style LLMs certainly won't, but they will make breakthroughs. Need to figure out how to train LLMs to be uncertain / understand their own ignorance instead of being confidently correct and confidently incorrect.
2023 bing chat reference
imagen 3
wow, honestly i am surprised they are this good at making maps
i thought there would be much more distortions
I assume they have a lot of google earth data to use
guys alpha.lmarena.ai
password: super-alpha
password: super-alpha
password: super-alpha
password: super-alpha
password: super-alpha
password: super-alpha
password: super-alpha
password: super-alpha
v
password: super-alpha
v
password: super-alpha
the hell? how did you even find that?
i like it, a lot less clunky than the current look
i guessed it
i saw on reddit that there was an alpha and i guessed passwords as a joke and got it
i don't really like the spam but this is cool
it simultaneously feels like a ton of thought was put into it and none at all
some stuff is really nice (eg subtly only showing text when needed) while other stuff is worse than currently (eg not much space for the chat and can't view both model labels at once)
i agree too
Wahh. I love a lot of aspects of it too and many bugs are gone
Sorry for the spam I just wanted this to get out there
they thinned down the available models by a lot :(
Working on it 🫡
heh 😅 i do appreciate all the work you do
no more gpt 3.5 turbo?
Back atcha, great feedback!!
Seriously, keep it coming
I prefer the old UI
why does everything have to look flat and mobile friendly...
not a fan
Phew, I won't miss the gradio clunky waiting times for everything
honestly semi-surprised this isn't a mobile app already
it has perfect single-page-app vibes so even the current UI would make a great mobile app with minimal changes
based on openwebui?
Very Likely
hello, anyone knows who is the trainer of luca?
it doesnt seem like it
hello im a new one, dae wish there was a text2image direct or side-by-side arena version?
how often does the leaderboard on lmarena.ai update?
Are you really sure about that?
Whilst you may have liked the old style UI, there is no denying how slow and buggy gradio was.
It was meant to be private to a select group of people
As to how exactly you guessed you needed "super-" before the alpha, that is interesting.
interia - from meta
By the way, are people allowed here to try to reverse-engineer the experimental models in anonymous battles which hide their names? Or is it a prohibited activity considering it's an official LMArena discord server?
no official announcement from lmarena, so we assume not unless directed otherwise.
Luca - is a tough cookie to crack
There is not yet a way to copy the code with a button in the new alpha UI
There are just few models for text 2 image you can just copy your prompt, new round and paste it untill the first model come than save the picture and new round untill the second one comes and save the pictures. And you can compare them together
It is funny like being the detective Conan. 😄🤝
Just caught the Mickey 17 movie and can't stop thinking about how Expendables (humans created to be sacrificed) mirror our potential relationship with advanced AI. The way they're treated as second-class despite being fully conscious feels uncomfortably familiar to how we already interact with AI systems. Has anyone else seen it? Am I reading too much into the movie when I connect it to how we "red team" and intentionally try to break AI systems to test their trustworthiness? While current generative AI systems don't deserve full human moral rights, I personally find it hard to say they don't deserve more consideration than a table or my electric razor. Curious what you all think!
Full disclosure: Issac asmiov's robot stories have made me really question if we really are so drastically better than robots under the three laws of robotics
Just out of curiousity, how does the project finance the model apis?
sponsors
thank you, really nice
Hi everyone
Hi
There is a problem with the website there is high traffic
hey team 🙂 gpt 4.5 is out, soon on arena?
it's been there
i only see GPT 4o 29-01-2025, recent but not the 4.5 :p
i looked again and i confirm, 4.5 is not on the arena
it is in the arena
where? i dont see it in the list
hi
could you add another arrow ? still hard to find
haha good luck buddy
let me know when you find a place that has gpt-4.5 for free
(one conversation will cost them $0.1)
LMFAOOO
There was a site with free o-preview. It removed the option like a month later
You can detect it Anonymously and have a conversation with it before voting but this method may take time.
For Pseudo-Intellectual discussions, please reference #ai-memes
there are HF spaces that do pop up, you can keep an eye on it. Like this https://huggingface.co/spaces/gabrielchua/try-gpt4.5 but it looks they reached their acc quota now lol
QwQ-32B is amazing 😍😍😍
sonnet 3.7 and sonnet 3.7 thinking might be the same deployed model lol. (potentially not separate weights, i.e. with a tune on the instruct model)
antml:thinking</antml:thinking> (they seemingly really do not want to leak this lol, they replace it on streamed responses, your replies (so the model can't even see it unless you do tricks, etc.)
antml:max_thinking_length32000</max_thinking_length> (in the system prompt, may also be sanitized)
additionally these aren't special tokens, but potentially specific trained in behaviors. but its unlikely you can manually inject it because they apply a lot of sanitization and the tricks you would need to do would probably be degrading. (just use the thinking version at that point)
as far as i know, no one else has noted it out yet, but i might be wrong lol. pretty useless info anyways, but i found it interesting nonetheless
...you thought they weren't?
4o mini and o3 mini have separate weights
OpenAI makes that clear
So do Anthropic; they explicitly state it's the first model that does both
this is seemingly the same model, where the reasoning variant is simply triggered with <antml> (though u cant trigger it because it gets replaced by anthropic)
did they explicitly say that? i thought it was the max thinking length gimmick
Today, we’re announcing Claude 3.7 Sonnet, our most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users also have fine-grained control over how long the model can think for.
While platforms like OpenRouter and LM Arena have you control thinking via the model name, it's an API parameter
I guess i interpreted that as pertaining to the token budget system. not literally
it could still be separate deployments even if it was an api parameter anyway
(in this case, it's the same model)
Which is currently stronger out of gremlin, centaur and enigma?
ppl have been asking this question since the dawn of time
Found best prompt to observe hallucinations: "Callisto Protocols plot. What did and did not happen at the end." At some point i rechecked wiki to see how that plot ended, because they are so convincing of own made up reality. All of them have heard about that game and try to figure out ending by induction and get it more or less wrong.
almost every time i try to generate a random family photo it gets errored. text2image arena. help
what's the rating of it?
i didn't think there was much doubt that that wasn't the case tbh
(fwiw i'm not convinced there is that much fancy stuff going on with o1/3's implementations of test-time-compute – they're just given more tokens/compute to 'work through' something. perhaps there's more to it, but it never felt like anything fundamentally different is going on with 'reasoning'/'thinking' models other than, well, more test-time compute plus a system prompt or some other mechanism by which the model is instructed/encouraged to dwell systematically rather than blurt out responses)
is roman empire new? Feel like someone mentioned it.. but searching the server and couldn't find the post, if it exists
we always knew that gpt4o and o1/o3 use the same base models. But both Deepseek and OpenAI dedicated new model versions exclusively for reasoning. As in, deepseek r1 is always reasoning only, just like o1 and o3 is. And you can't make deepseek v3 or gpt4o do same kind of reasoning since they were not trained for it at all. So what you wrote is not really accurate since technically what Anthropic did there is very unconventional and not what people would normally expect given recent reasoning model releases.
Potentially, it's what OpenAI gonna try to do with gpt5, except it's gonna decide by itself when to use thinking and when to not, we shall see...
Nice thing about 3.7 sonnet that comes from it, is that you can use now non-thinking version and make it think completely unhinged, resulting in more thinking tokens than their official implementation with hard cap allows lol
you're prob right.. i dunno . but anyway this part in paricular:
deepseek r1 is always reasoning only
I don't really get
it's supposed to always output thinking
like the 'deepthink' toggle seems to do the same thing as what claude extended thinking does
and they have completely seperate model for no thinking
ahh ok
well it switches the model
that toggle
yeah right in that case, i see what you means
sorry yeah gotcha
i didn't think that was the case - but again, was just what i thought / assumed (that they were the same model, just with one given more test-time compute (plus some thinkging tags/framework) with that toggle selected)
but clearly that wasn't the case 👍
sometimes the site gacha will assign and it is fun
i dont know why it stopped generating . Can someone tell me what is problem? and the ways to fix.
text like that gonna take a lot more tokens. It's likely hitting context limit
make sure your connection is stable, and doesn't jump
an unstable internet will constantly cause these errors to pop up
pop culture prompts are really good for that, another good one is describing a relatively obscure, but still notable scene from a movie/tv series and asking where its from
o1 is really good with these sort of prompts, all the others are bad
Nah my texts are longer and I have no problem
lol you could already do all of that with an extension
yeah complexity extension.
aha they finally added model selector to the main ui
they tried sooo hard to tell everyone they're not 'just another wrapper'... but eventually caved to reality aha
They do have a model called sonar. Not saying I use it over the other model options though.
aha yeah i know 👍 I'm not really trying to make a serious point.. more just having a dig - like every second piece of feedback or suggestions in their discord server is something like "why do I need to go to Settings or use Rewrite to choose the model I wanna use" or something to that affect..
they stubbornly refused to budge (and with that, complexity was born ha) and i've always thought it was cause of this "we're definitely not a wrapper!" delusion/insistence their ceo kept banging on about
but yeah ofc, they've made their own fine tunes etc (but i think most people who buy a sub do so for the multiple third party models to which it provides access)
does anyone here have some deep research queries to spare 🥺
would like to get this one figured out:
phi 4 multimodal is an llm that can accept audio
it uses 750 tokens/minute of input audio
let's assume you have 200 tokens of output per minute
let's use a market price of $0.07/mtok in and $0.14/mtok out (from deepinfra)
that's around $0.004 per HOUR of audio!
it should be much more expensive than models built for transcription (eg whisper) since it's an llm that has to spend gpu time outputting each token, not a transcription model. it's also much larger (5.6b) than whisper (1.5b).
but it's cheap, and it would be even cheaper if it was priced for its size fairly (should be more like $0.03/$0.05 instead of $0.07/$0.14)
a similar kind of thing happens with gemini - another llm that happens to accept audio is cheaper than dedicated audio transcription models.
why? why is it so cheap? (or why are the alternatives so expensive? and which one is actually most efficient to directly run on gpus?)
I will answer it for you
There are $1 or $0.5 providers for whisper
for 1000 hours
They can probably even do better but at that point let them have some profit
i spent a week searching for the cheapest whisper provider and the best i could find was $0.01/hr so idk what youre referring to
https://blog.salad.com/parakeet-tdt-1-1b/ its parekeet but almost same size model 1.2 per hour
Parakeet TDT 1.1B GPU benchmark The Automatic Speech Recognition (ASR) model, Parakeet TDT 1.1B, is the latest addition to NVIDIA’s Parakeet family. Parakeet TDT 1.1B boasts unparalleled accuracy and significantly faster performance compared to other models in the same family. Using our latest batch-processing framework, we conducted comprehensi...
ok thats not bad
but is salad always better, or is it only in the field of whisper where everyone else tries to squeeze out profit?
Salad lets you rent the gpu they dont provide it in this benchmark for per say
This is parallel processing you can get better than that
Something scale of Google can do better than that obv
But crucial point is tokenization of the audio is probably a really good compression
Maybe a 10x saving on compared to whisper
i might be very wrong about this, but with whisper u still decode the output token by token
750 token a minute is crazy its something like 0.2 kbit/sec audio compression
I forgot how much Encodec was or improvements on Encodec
a token is still a lot
a token with 128k vocab is 17 bits
it's around 17 bits, and takes up a lot more ram once expanded into its embedding
ok you did the calculation properly
funny
well it could be that the audio encoder turns it directly into embeddings instead of enumeratable tokens
You are still representing a 80ms segment with just one float
Wake up babe, another mystery model just dropped
will the gradio ui be kept once migration to the new ui happens?
no i don't think so
From ??
Idk
Sky is also llama
margherita-plain says it's called yasa de reka, so probably yasa 2
(little reminder of what yasa 1 is
"Yasa-1 is an assistant that can understand text, images, videos, sounds and include
- long context document processing
- natively optimized retrieval augmented generation
- search engine integration for live data
- code interpreter")
When is the next leaderboard update?
the dates are not public but when they have had enough votes they will soon add QWQ 32 and Claude thinking in the ranking
Aight
and aya vision 32b in for vision ranking
we want Jamba 1.6 large and mini on the battle arena
Is claude thinking capped at 32k? or is 64k also in testing but but in direct chat?
just the 32k
amodei is too stingy…
I looked for it for a long time and didn't find it.
is it in the normal arena or with Vision or what?
Hello , any information about flywheel . Is is new one ?
There will be barely any difference between performance of two only huge long calculations
agree - they (or at least the ones I got with this prompt), seem very much part of the llama family
For most things best-of-n is better than 4k+ plus thinking
don't mean to clog the channel with fairly pointless screenshots.. but some responses from other pseudonym bots in the arena.. just fwiw
Any good coders among anon models?
There's so many mystery models I feel like they need their own leaderboard or chat option.
Screenshots are useful What was the prompt you used for these responses?
I'm looking for a person to create the actual project in minecraft like MinePal/Elefant Ai, you need to have time + some work experience
i'll run it - ive been wanting to compare oai vs perplecity 'deep research' (my intuition is that the former is much better - but let's see ey)
Ai bots? Interesting how well they work?
Gemini is your best bet probably
This is the best one yet, but it still has problems (most memorably, assuming Whisper is at OpenAI prices)
Most whisper pricing is hard to check some behind paywall
also just looked up today phi4 conformer encoder aint small itself either
Could be additional charges for that kind of encoding
Guys, Claude and OpenAI Plus, which is more worth to subscribe?
Depends on your use-case really
Will Claude 3.7 be split into thinking and non thinking?
Is the leaderboard rigged?
開始
3.7 sonnet certainly deserves a better rank
not according to you. You voted and that is the result
lmarena is not definitive benchmark for sure, but people should really stop glorifying claude constantly lol
I didn't glorify it intentionally sir. Just I tested a bunch things, it performs almost always better than or equal to 4o, and grok. I subscribed to ChatGPT plus until this month cuz the display of math was better in my opinion than claude, or I am used to it. But recently, 4o's performance is really disappointing, I don't know why
man gell mann amnesia is real
You should try web.lmarena.ai 🙂
3.7 is doing great on there!
Good evening everyone 🫡
I am highly interested in the work of lmarena. and I wonder how can I get my model included in the lmarena leaderboard and participate in the ranking? Who can reply me? tks!!
why is he reacting to his own message
They don't even add very popular models these days
If its a simple finetune with no backing I doubt it will be added but shoot your shot
i'm inclined to believe the answer is that whisper has shorter frames (~10ms?)
when i prompt llms to be blunt they say that
if you convert prices directly it's around equal
the deep research report says that
It's rigged. And if ur model is a finetune, i don't think you'll be accepted. Put it on huggingface?
new
soon because gemma 3 is in the arena
Hi folks, nice to meet ya!😉
Is it good?
gpt4o is worse than 3.7 sonnet for sure. It's an older model at this point that is likely just too small. However grok3?... That model beats 3.7 sonnet almost everywhere, including in every single metric by anthropic themselves where they were able to compare them on 3.7 release
3.7 is more popular and accessible than grok3, but this absolutely does not mean it is better. Cause it's really the opposite tbf
imho i dont think 4o is too small at all. they just need to do modern pretraining. 4o mini is smaller and keeps on giving
they crammed more facts into 4o than old sonnet 3.5 which was larger than it, so i feel they're good at efficency
if you look at gpt4o release version and the current one... we are going backwards. I do think it's too small. Release version was overfitted on style, but the current one is simply underwhelming even if more flexible..
mini performs but it struggles a lot with context awareness or more complex prompts, as well as tool usage
as for o3-mini...
i think sonnet 3.7 might be potentially smaller than sonnet 3.5, if so, the direction ahead seems to not keep increasing model size
we don't really know how it was trained even. Could be distill from o1-pro into gpt4-mini. And is still a reasoning model which is quite a bit different to standard gpt4o
regular instruct models are basically dead i was talking about o3 mini i mean. (openai arent gonna release non chain of thought models anymore)
yeah but gpt4o is a "regular model" so we kinda need to talk about them lol
comparing it to reasoning models becomes difficult
as we do not know the potential they would have with different model
for RL training
i was talking about it as a base model, i think 4o as the size it is fine. base model then reasoning. as an instruct model i dont think theres any point in arguing about model size there since that paradigm is dying
my point is, if we take grok3... that model objectively has much more potential with RL training than gpt4o becomes it performs much better to start with
you can't take some random very badly performing model and fix everything just with RL training for it to beat everything else
there are limits to everything
o3 mini (based on 4o mini) being competitive/better than full grok 3 reasoning says otherwise
google struggling with flash-thinking... and deepseek having struggled with r1-lite. Says that o3-mini was distilled from bigger reasoning model most likely, tbh
what about qwq 32b, it's better than their bigger reasoning model qwq max
and its competitive with r1
qwq 32b is the closest to an analogue with o3 mini i think
"competitive" but still worse. Besides deepseek showed themselves that you can effectively distill R1 into smaller model and everything is open-source now, so I wouldn't be too surprised by these...
its a 32b model competing with r1, which has more active parameters than the model itself lol
total parameters its much much much bigger
its not a distillation though, at least official r1 distillations seem to have problems and dont really seem to do well on ood
it's impressive don't get me wrong. But not really groundbreaking and neither it is a match for the full model still. If you look at these, qwq numbers were not improved by THAT much from normal low effort distills. And we do not even know if that was cons@64 for math score:
that was the preview model from many months ago
before r1
"qwq-32b-preview"
released in november 2024
like I said math (aime24) score we do not know if it's cons@64 or pass@1
?????
I showed distills of R1
that screen was not referring to qwq
it's distills made on release of R1
oh i thought u were comparinig the distills to the qwq in that screenshot
which wasnt the full version
no I was referencing that against their new numbers lol
so like LCB 63%....
it's good but not that much better than the very first 32b distill on r1
and so on
The only problem that I have with Qwen is that the answers are soooo short and can 't generate long answers. I don t think it is a real competitive to r1 or any other model.
tbh their official metrics are lacking too. There's not much to effectively reference other models againts, mostly just AIME24 and LCB. We don't have GPQA, MMLU, or codeforces/swe/humaneval
I like the third version of 4o though even though OpenAI
In terms of intuitive to model size/cost ratio it is quite a winner, but there are more "intuitive" models like Grok3 or GPT4.5
so like we see that in their select metrics qwq-32b is comparable to o3-mini-medium, but I doubt it would hold in other things as good
IMO the best ratio was 4-turbo. They were able to improve performance without sacrifing flexibility
with gpt4o it seems it's either one or the other
but not both
like it didn't used to be that they were going backwards, not even when people were complaining about it "getting dumber". But now for the first time this is actually true with gpt4o lol
they did made it more usable, but at the expense of the actual performance when style is not an issue
nvm, I meant it is 0129 one
that version may be a smaller model/etc https://x.com/ArtificialAnlys/status/1859614633654616310
0125? there was no 0129 as far as I can tell lol
output speed went from 80 -> 180
ah yes, typo
they actually reduced the price for that version too by like 50%
so 0513 is more expensive even today
not entirely sure if they actually made it smaller or is that something else (seems very silly as it was already bit compromised), but in either way the direction they were moving in with gpt4o is not great 
Oh wait they reduced it for August ver. The Artificial Analysis graph is referring Aug to Nov... well my point still stands though. There were suspicious things even before that with the price drop
Are we getting a deep research API today
Okay we get web search
We get file search
(with metadata filtering and direct use)
Plus computer use (!!)
Are they going to reinvent the wheel with APIs
Yes
"Responses API"
Nothing you couldn't already do so far
Oh swarm++
@severe bison https://labs.google/fx/tools/image-fx
Transform text into images and explore with endless imagination.
use murica vpn, any free browser extension works
Grok's DeepSearch feature is mad
I can't get it to work
It literally found some obscure quote that could only be found in one interview
It inferred it from articles that referenced the interview (I didn't even know where the quote was from in the first place).
Then I asked it to find that interview (because I couldn't), and it literally searched through the internet archives and found it
I didn't even know Internet Archives had a feature that allows you to search for TV News transcripts
I'm just really impressed because I would have never found it otherwise
That’s pretty cool
Meanwhile, Perplexity's Deep Research can't find obscure things, so it makes up plausible sounding results. 🤦
They try to use cached searches as much as possible then the poor model hallucinates
Results in very bad text sometimes yes
dark mode?
So which anon models are gemma-3
Are we getting a big gemma this time closed source killer
With how their 2.0 pro sucks I have low expectations but you never know what the other team cooking
You can't be saying that around here
It's good at explaining things
It uses italics well
Gemma tomorrow?
Unironically flash is better and faster I can't wrap my head around with google
There is a gemma dev day tomorrow
Has anyone encountered harmony? Correctly solved hard visual structural mechanics problem
Moin together 😉
What is consolidation?
deep-interia looks like llama, or from meta at least
Please go and try yourself, you will see ✅
Gemma was always good at human preference but this is truely on another level
Makes me wonder how small flash-light is for it to score well below gemma
The pretrained base models seem disappointing but we'll see
It seems the phi team is best at cramming knowledge into small models, albeit the connections between individual facts are scrambled
The qwen models have very poor world knowledge
hey everone, i saw a new model called "consolidation", it seems very human-preference aligned, like gpt4.5 and 4o-mini
any idea what model is this?
What model is Goose? I tried to make it write a 'story', and it apologized?
consolidation means
- the action or process of making something stronger or more solid.
the action or process of combining a number of things into a single more effective or coherent whole.
So, it a router or a new, powerful update?
idk, never saw it
I guess it's a model from China.