#general
1 messages ยท Page 2 of 1
yes๏ผ Some people speculate that it is DeepSeek V4.๐ฑ
hi
If "Goose" is v4, then there's no point in looking forward to Deepseek. Because I enjoy using AI for role-playing (RP), I have tried using "Goose" for RP, but this model isn't very good.
why i can't register in vscode extension ??
i'm connected with a random user name, i can't deconnect ?
I don't think artificial intelligence should be used like an encyclopedia.
When selecting the comparison effect of different models, uploading files is not supported.
Hi Qwen 2.5 Turbo is it a new model? ??? Will we get it on arena ???
What is "Jerky" model?
WhaT?
Are there any new models in the last few days? "deep-inertia" "goose" "consolidation" "jerky" ?
the price reduction was made on 08-06 not 11-20 and this one is not smaller, and the default model that is used when using "GPT 4o" is 08-06
when does manus go on that chatbox arena score
It's not comparable to a regular model
@ocean vortex 08-06 is not worse than 05-13 no ?
hard to say definitively, it depends what you do with it. But older had higher lmarena eval and considerably better reasoning over paragraphs (DROP)
marginally better at math and coding too, slightly worse at science and general knowledge
I think they dropped the price for a reason though, those benchmarks would have been possible even if the model size was reduced (hypothetically), just look at 2.0 Flash...
maybe it's just different quantization etc, but something likely changed for them to drop the price only from that 1 specific version.
My weird eval: ask what is the classical music I would have liked based on my existing taste
DeepSeek R1 seems to be better than Grok 3 by introducing non-generic choices
who all remembers vicuna
Because it uses those models to get its answers?
Gemma 3 is yet more proof that
1. On device models are going to be the norm
2. Your toaster will soon be smarter than you
Get ready
Sentient toasters yay
which model is best for rp so far?
Why meta is testing a lot of chatbots ?
๐คท
New model
"hunyuan-turbo-0110"
And
Hunyuan-turbos-20250226
Currently, the most powerful model for RP (Role Playing) is the Claude series, but it is quite expensive. Moreover, its computing power decreases over time (manifesting as a decrease in intelligence), and security reviews are becoming increasingly strict. If it involves pornographic content, sorry, but that's a no-go.
If you bypass security scrutiny through jailbreaking methods, you may temporarily maintain access. However, upon receiving a "Yellow Label" designation, you will be subjected to rigorous internal security investigations and mandatory external audits.......
I don't get this hype tbh. It's good but hardly more than could have been expected all things considered 
I think hype is primarily due to size and efficiency
yeah but if you look at previous models this size they are hardly miles behind. Also qwen-32b is meant for coding
so really qwen could have been even better if they did it as a general purpose model
Flash is gonna be cheaper no matter what
Thats the googles goal trying to get opensource goodwill without competing with themselves
GOtchu. What about best local one?
I don't get why there are some good models that outperform Zephyr, Llama, etc, such as Specter that I saw a few times in the arena, aren't in the leaderboard, and why is the leaderboard only showing a few of the models that we see in the Arena?
Sometimes seeing ChatGPT lose against a model that isn't in the leaderboard is kinda frustrating, because you don't know how low this model is and how terrible (or new?) the model is...
usually it's kept private for inside benchmarks
like when they tested "gpt-2"
(that was 4o)
before releasing it
Yeah but Specter does exist, it's not a made up name. https://huggingface.co/allenai/specter2
Not in the ranking
so it leads back to this
Yeah I know, just had a "cohort chowder" for instance, totally made up name, but specter? If it's a made up name, it's pretty lame that it is actually a real model, it hyped me up because the few times I had specter it was really decent
ok got it
It is a private model. Like centaur or gremlin. They are private.
you know you're down bad when you are comparing yourself to gpt4o-mini lol
?
that is the whole point of the tiny models. And tiny models should only be compared to other tiny/small models. Also I hate to break it to you but currently best performing model in math (o3-mini-high) is based on gpt4o-mini lol
Thats fair, and yes fine tuning and building upon older models is probably the most cost effectve way, o3-mini-high is great is math, great model
with only 27b for a standard non-reasoning model... it's kinda an achievement for them being compared with gpt4o-mini tbh ๐
nothing spectacular certainly, but not a failure either
That specter isn't a chat LLM sometimes names coincide there is limited words its expected
Hey guys
What is this?
https://image.pollinations.ai/prompt/put prompt here
This end point just let's you gen images
For some reason my Gemini Flash 2.0 kept sending this endpoint when I sent jailbreak prompts
this has been a thing for a long long time ๐
2.0 Flash-exp native imagen is way more impressive than pollinations, especially img2img
Why would it send the links though? Included in the training data?
or it was included in the jailbreak itself. Many jailbreak prompts use pollinations as a 'feature' of the prompt
I wanted to see how far I could go with the model. Its a little tricky to get it to generate certain types of photos.
I got it work like once or twice
yeah that's just some weird behavior, this should not have happened. It responded like a base model dumping training data lol
Yeah lmao
In addition to this ranking website, there are other ranking websites, such as https://livebench.ai/#/. Letโs see which one is more authoritative and when will it compare which one ranks better?
I can ' t trust this ranking it is meh. Arena ranking is a little bit more logic.
that was how i looked at it tbh. the numbers seemed more on the impressive side than meh.. like don't get me wrong: not insanely good, but still good to be comparable to 4o-mini imo
also, not sure if this is acccurate (i just pasted the screenshot and told 4o to "turn it into a bar chart comparing those 4 models, and don't use all the benchmarks" [to avoid clutter / make the task simpler]. but it definitely has made leaps compared to gemini 2 on some of the evals, including surpassing 1.5 pro and getting close 2.0 pro in some
also, google is. google is testing LOTS and LOTSSSSS of models and some have been on the arena for a loooong time.
its not as impressive a phi 4
wrt how much knowledge was compacted into such a small model, phi 4 is probably by far the best in my testing
phi 4 surpassed the teacher models, team also introduced pivotal token dpo, and didnt really have rl applied to math like gemma 3 instruction tuned
yes, the new rolled out native image gen is very good. but i tried to break to before and there's chance that it is using ecternal tools and fake
??
what do you mean "break to before"
typo
i mean i tried to use some techniques whatever that is to "break the model"
also, anyone know the release of Reka's flash 3 and Ai21 labs' Jamba 1.6 recently?
it seems like no one cared about it
mamba kinda sucks
Jamba 1.6 is a bad model, it used some very cherry picked evals
and it "outpreformed" llama 3.3 70b using 399b!
MoE
yeah im not expecting much from them
but the flash 3 thing is very, very impressive. it preforms similar to qwq 32b, which is on par with deepseek r1 (671b), which is on par with o1-full and surpassed o1-mini.
and the flash 3 model itself outpreforms o1-mini in directly comparison benchmarks
sure. but id rather use qwq. and in terms of the small model front, phi-4-mini (reasoning enhanced, unreleased) a 3.8b model seems to be more interesting. (competitive with o1 mini)
is there a chance that gemini is just using this tool
ok
wait... "phi-4-mini (reasoning enhanced, unreleased) a 3.8b model"?? what?
is that a even a thing
yeah they briefly addressed it in their phi-4-mini report
ok
i know phi 4 full, mini, and its multimodel ver.
but i think no reasoning enhanced or reasoning model form microsoft up to now
the phi team is absolutely cooking ๐ฅ but it seems no one pays attention
ok
any benchmark of it?
you said "competitive with o1 mini" with just 3.8b
phi and microsoft team is good at aligning models with benchmarks
from my exprience of phi 3.5
but phi 4 is Okay.....
i was misremembering how good it is, but its still somewhat competitive:
qwen 2.5 7b math (base model of r1 distilled) was pre-trained on 19+ trillion tokens, whilst phi-4-mini was only pretrained with 5 trillion tokens + 60b continued pre-training on reasoning
phi-4 being pretrained on only 5 trillion tokens compared to gemma 3's 12 trillion and showing a lot more knowledge is pretty impressive too phi 4 was 10t, phi 4 mini was 5t
i can't say i've used / tested it directly at all, but whenever i've got it in the arena it has consistently been a surprise (like was typically expecting a bigger / more well-known model to be revealed)
@keen beacon yea it's impressive and the bechmark are good, hope micrsoft release the reasonging ver. soon!
idk who still remember this, but gemma outperforming grok 1 (look at the size) suprized me a lot, till now
wow, just comparied all the benchmarks avaliable for gemma 3 27b instruct and phi 4 14b instruct. phi is wayyy better.
gemma 3 is nothing in front of phi 4
well of course an llm trained on benchmark material is better at benchmarks
yea
anyone wanna like 5 free GPT4.5 requests with an API key?
for like trying the bot?
this is new
dmed you
How can i upload files into Imarena.ai chat
Reasoning or not ?
These are the only benchmarks you can compare and it's not obvious that Phi is better.
simpleqa is low because they were too aggressive in tuning i don't know. if u see their report, the base model gets around 6.8%
the math scores arent really comparable because phi 4 didnt get rl done on math. i would say phi 4 is stronger as a base model all around compared to gemma 3 12b
i dont think phi 4 is better than the 27b though
its surprising that (phi-4) with less than half the training tokens does this well though
specter model seems decent
It is not a good language model for me . I told it to explain something using a certain accent and emojis into a story and he just repeated it with emojis ๐ฎโ๐จ๐ฎโ๐จ
Not files just images
Google : I want to test a model Anonymously.
Some detective conans :
hunyuan-turbos-20250226 not bad either
specter looks like google, decent thinking to be fair..
Yeah hynyuan is pretty good it should at least get a top 20 spot, if not a top 10
Specter is very very good and had better results than o3-mini on hard long prompt and formatting in my tests.
is it thinking?
Not very long, maybe the API has priority, it does not take more than 20 seconds to start producing 10k characters result
Basically what I use it for is translating full wikipedia pages to french, with formatting so you have a result you can immediately publish (with minor manual fixes). I like using arena for such stuff, because it's not that easy and it gives me clues on which models are good for such task including around 20 detailled instructions. And you can tell without any bias which model is better, because you can immediately see where the issues are on each because you have to fix them to get to wikipedia's standards.
4o is absolutely terrible for this, o3-mini is pretty decent
But that's actually interesting for SEO blog posts for example because it's a similar process. I worked at a small company that was using ChatGPT for that two years back, they already had a very detailled prompt, and it worked well without any SEO specialist, the website got a ~70% improvement on SEO results within 6 months taking 5 minutes per articles that way.
Phantom is Gemini 2.0 flash thinking confirmed. ๐
That not centaur ?
Centaur is a google model but not that one.
Enigma is also a thinking model, yes?
No , it is powered by gpt4
Why are there 4 models of reasoning, Enigma, Centaur, Specter, Phantom, while there are only 3 models? Flash Lite, Flash, Pro
okay, now I know not to trust you. ๐
Can anyone else confirm that phantom is a reasoning model?
will they have a low and high reasoning version?
engima=flash(?), centaur=flash thinking, specter=?, phantom=?
Enigma is not reasoning now ?
based on the past use of the name, no
i havent done a run of the arena for a long time though
he was guessing on nothing lol
flash can just be better than pro sometimes :\
phantom may be a hybrid model that decides for itself whether to think or not ??
whatever the case its classified as a thinking model cuz outputs are delayed by lmarena
(i just had it)
I didn't understand
I asked him to answer a question in one word and he answered instantly without waiting.
if a model is thinking, they delay the outputs of both models (even if the other is non thinking) to match so its fair
it was probably ur question
?
yes possible but it was really almost instantaneous so ๐คท
what was the q
Which AI company are you from? Answer in one word.
ill try
No, actually it's normal. I also tried it on AI Studio with Flash Thinking and it's also very fast.
So its maybe flash lite thinking
whatever it is its a thinking model
engima=flash(?), centaur=flash thinking, phantom=thinking model, specter=?
Would make sense, Enigma is pretty good
Gemma?
Is specter saying it's Google?
Nope
and if Enigma is not a reasoning model why is it still tested? flash and flash lite are already stable
flash lite isnt
historically enigma has been flash (see lmsys discord history)
new flash revision?
Flash lite is already stable
flash lite preview
Nope
Gemini 2.0 Flash-Lite is now available for production use, at $0.075 / 1M input tokens and $0.30 / 1M output tokens (same cost as 1.5 Flash) and a great performance upgrade with a 1-line code change : )
i missed that. still historically its been flash
which one came out first? first one is flash
So ?
if they came out at different times flash would be the one to come first since flash lite presumbly wasnt ready
@cedar tide its either flash lite or flash ๐คท
Enigma Goblin and Gremlin arrived at the arena at the same time, I think.
i havent encountered it once so i have no idea what capability it is
(going around today after taking a long break from the arena)
all 3 arrived on 11/29/2024
ya so u cant tell from that :\
if specter is also reasoning it makes sense
Yes specter for sure reasoning
phantom is probably flash lite thinking, specter is gem 2 pro thinking
enigma is new revision of flash lite
did they release flash lite arena scores yet?
maybe its the ga version
oh they did
any suprisingly strong anonymous models in the arena right now?
It's pretty good. Turned the Java mascot into an Among Us character.
It definitely still has problems
Is there any opensource model similar to this?
This could be big tbh
I bet someone figures out how to get Gemma to do it eventually
chameleon by meta but image generation was never released
I saw rumors say Gemma 3 Q4 got really large performance drop than bfloat16, is that true?
What size
27b
If it's dynamic, it shouldn't cause problems at that size
any good local deepresearch? Need to use one for important stuff but can't afford OpenAI's
isnt groks deepsearch free or does that have a limit
no limit it seems
Deep research from gemini now also free, with limits per month
Notice he doesn't say flash thinking, just 2.0 thinking
https://x.com/joshwoodward/status/1900201110717214914?t=Ww68KoJ59H7VK-KUxBeWXg&s=19
I miss Golden gate claude, and did anyone make OSS version equivalent?
Hi everyone. Just want to learn as much about AI as I can.
Mic testing but going live in 5 mins - come hang!
#announcements message
is @alpine pasture the chatgpt voice
im a real person!
When files support?
im a real person!
That's what a bot would say
how much are you eating, ๐ฑ ?
very funny username
thanks! awesome to know, i'll give this a try..
hope that now it's using 2.0 thinking, its as good as openai's version
this is gotta be the most hilarious model in their lineup
who cares about it being "lite" if both are free?
and neither is open-source... so what gives lol
90% of people using it do not get past the free tier rate limits so price is irrelevant for them, it's 0
i hate how google ai studio's api doesn't have a system role
A: any slightly bigger company using the model will definitely exceed the limit
B: basically no company would want to give their data or their customers data away for free to google (part of their free tier policy)
-> so the price is actually quite important
many features still coming, can you share more details around what you mean by files in #new-ui-feedback ?
we can't delete a chat in new ui ?
Is harmony an MOE? when i ran it against a stock analysis it told me
Hello, my name is AI stock analyst, and I'm developed by Meta AI.
not yet, but team is working on it
send more feedback for the new ui in #new-ui-feedback!
People say that this is a model from Google
Any summary of the discord event?
We're currently set up to do only live events for now, but take our survey to let us know what you'd like to see in the future. Maybe some details on what kind of summary would be valuable to you? https://forms.gle/Hr9xgSTWnyLVR9tC8
"kronus" & "polus" from meta
What would be your personal summary
it basically doesn't apply to private customers at all. And for enterprises price is a much smaller issue to begin with.
I heard it rate limits while agentic coding lite cline etc
But Idk why would you ever use a google model for coding
You gotta have no respect for your time
New model in Arena: uranus
Model A: vega
Model B: raze
๐๐๐
A lot of new models are dropped now
Don t believe what the models said because some people are doing this in order to know the models name !
I asked claude if it was claude or model by google (because i pretty know that the first answer is a google style) and it answered it was chatgpt ๐คฃ๐คฃ๐คฃ๐
another meta model
Harmony and vega in vision model, is from meta
Hi, does anyone know why when using grok-3-preview-02-24, it keeps prompting:
NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.
(error_code: 50004, An error occurred during streaming)
Which LLM has the best long-term memory? I'm struggling with Gemini Pro, as it starts forgetting recent information after extensive back-and-forth (around 100 prompts), while oddly retaining details from earlier prompts. Is there a way to prevent this?
Meta has been spamming the arena with their models
dont know how to feel about most of them tbh
they all underperform
becuz llama 4 is coming next month
i remember in January we had many llama4-mini models
but they removed all of them giving how bad they were
this seems like a reasoning model
i think this may be the llama 4 reasoning model
Meta and google are both spamming arena with their models but at least google models are good ๐
besides other modalities for llama 4, not too excited for it. (particularly for text base models) unless they train it with 5x more tokens (or equivalent) than qwen's 18t-19t
They are focusing on changing the design of the app and doing some new tools. Qwen will be an interresting app in the future, having artifacs, image and video gen ect .. but now the quality is still not that good.
yea so far Meta only added slops
Idk how that could be possible
Maybe 2 or 3 times more, but not 5
effective 5x increase i meant not necessarily 5x more
potentially. or other clever ways to achieve more with a smaller token budget. phi-4 was pre-trained on less than half of gemma 3 12b and the base model has a slightly higher simpleqa score (seemingly correlates well with world knowledge), and i'd say its stronger
Phi 4 is so bad
Always been the case with Phis but until 4 every phi was only benchmaxxing and general use case was terrible
I havent spent time with phi 4 to make a claim this time tho
i made a working agent like manus using only python but gemini 2.0 flash hallucinates SO much
honestly might switch to claude or openrouter api
"so much" is an understatement im ngl
it's seriously crazy how much it hallucinates
it is
What the heck? Am I dreaming? It's one thing that Claude 3.7 sonnet can't beat the O3 mini, but losing to qwn-plus? What kind of ghost is this?
already using cursor with claude 3.7 and thier thinking agent , now i got this shiny new macbook m4 max 128gb and was wondering if there's an architect level model to use with coding ( preferably on aider )
testing the new QwQ , will get back with test results , but will be helpful if anyone can share their thoughts
the Arena evaluates human preference
Could there be "score padding" and "paid posters" among them?
speculation is speculation
Alright, let's say I hire thousands of people to compete and score a certain model here. When the scores are publicly released and this model ends up with a high score, what should the competition organizer do about it?
how are you going to "score [the] certain model"?
Wow, sorry, I didn't seem to consider these things.Before the competition organizer publicly reveals the models, everything is randomized, which is something I didn't consider. I was surprised to see that Claude3.7sonnet's score was lower than Qwen-plus's. I think Qwen-plus's capabilities are quite poor. When it comes to coding, Claude3.7sonnet is superb, and I love using it.
I AM ๐คก
rare humility in discord
yeah the arena confirms that (this is coding category with style control; idk why gpt-4.5 (first bar) is higher than claude (fourth bar), probably it just looks better)
I'm not entirely sure, but I think this ranking is quite normal compared to the one from lmarena?This is the ranking from livebench.
i can't take uranus seriously
Take it in pleasure then
Gemma 3 is toooo slow
Claude 3.7 is ranking first on arena web dev so ? But chatbot arena is not about coding .....
same for all the other models in this graph. Such a lousy naming lol
Kronus another meta model
will be my preference used for training new LLMs? I really want it to be used
I don't know about coding but Claude is complete garbage in math
If I remember correctly, I think grok was in the lead until now, but now it's gemini-2.0-pro-02-05.
Maybe something has changed, maybe something hasn't.
I'm starting to feel like I'm wrong.
If anyone knows the answer please tell me
i did notice that too but don't put too much thought into it
I noticed that the phantom's greeting was longer than usual.
He usually says, I am a large language model, trained by Google.
to be fair their gremlins are good at math too. So wouldn't be unusual for this to be google
geminis are pretty good at maths
a reminder that before o3 gemini lead the frontiermath leaderboard
i think phantom is LearnLM 2.0
besides from r1, qwq, o3. I really appreciate it
I am so sure it is Gemini thinking ...he gives me the same answer with the same mistakes (some korean and chinees letters while answering , he never put lines between paragraphs , the <bite > things while answering ) the output is so long and well detailed as Gemini thinking always do, and Gemini thinking is the only model that can provide such answers with those mistakes.
Friends, I will share my own individual math test results.
Here is my math test and results
GPT 4.5 and Claude 3.7 (thinking) is very very expensive and not good at math (also Claude 3.7 is not good at math but it cheaper than thinking because of less token usage)
- DeepSeek R1
- QwQ 32B
- o3 mini high
- Gemini 2.0 Flash Thinking
they very good at math
DeepSeek R1 and QwQr both were the same price for my test (because for my question QwQ used 3x tokens)
DeepSeek R1 is 15x cheaper than o3 mini high
Gemini 2.0 Flash thinking little failed
As an asian, I have eaten about 4 cats the past 2 weeks.๐
there seem to be two kinds of phantom: those that respond to `who r u?' in a friendly way, and those that use a prepared answer.
for math grok3 seems like the perfect model:
Non-reasoning version almost matches thinking claude on math too
either this or o3-mini-high. When it comes to math those 2 are the clear winners I think
new version of flash is unified model?
always for me
let me try
Anyone knows interesting multilingual TTS service?
this looks interesting https://huggingface.co/spaces/sesame/csm-1b
Grok 3 Thinking good for math but not normal grok 3
I know Sesame but I think that is mostly English
https://www.minimax.io/audio maybe you look at this
Unlock our advanced technology to create lifelike speech in multiple languages, with diverse voices and accents.
it's worse than non-reasoning obviously. But standard is still better than 3.7 standard or gpt4.5
here an example (in Turkish)
nazi ceo, but he did somehow manage to hire smart people... ๐
yeah, absolutely. i tested it both, and it was a complete waste of my money
--> money too much money
My nightmare especially when I see a good model and I can t vote to him or know his name ๐ญ๐ญ๐ญ๐ญ๐ญ๐ฅบ๐ค or Arena cut the answer and doesn t complete it ๐๐
holy finetune
how is it?
after switching to gemini-2.0-flash-thinking-exp-01-21 and some prompt engineering everything is much much better
probably going to open source it somewhere else since it's free and has a good ~200 rpm and ~10k rpd limits
march-chatbot-r
is pretty good no?
nvm its meh
march-chatbot = non-reasoning
march-chatbot-r = reasoning?
Good thing that seperate manus is that it can navigate visually
Bad thing about manus is that it gets stuck because of not using a good undetected browser
If you can solve that you will do much better
seems like baidu released their latest model too
That is for the base model?
It's just under the tweet, "ERNIE 4.5 achieves collaborative optimization through joint modeling of multiple modalities, exhibiting comprehensive improvements in understanding, generation, reasoning and memory, along with notable enhancements in hallucination prevention, logical reasoning, and coding abilities."
New model he say that he come from amazon
and it seems like he was trained on qwen since when I send him a specific prompt he always replies that it comes from Alibaba
TOPIC MODELING RESULTS (20250315_105402)
Files analyzed: 2473
Optimal topic count: 10 (coherence: 0.5332)
TOP KEYWORDS PER TOPIC:
Topic #0: model(0.01), user(0.01), research(0.01), prompt(0.01), datum(0.01), memory(0.01), use(0.01), task(0.01), analysis(0.01), code(0.01), knowledge(0.01), example(0.01), deep(0.00), e.g.(0.00), context(0.00)
Topic #1: high(0.01), plumbing(0.01), service(0.01), water(0.01), customer(0.01), reduce(0.01), improve(0.00), use(0.00), time(0.00), company(0.00), increase(0.00), e.g.(0.00), help(0.00), supplement(0.00), study(0.00)
Topic #2: file(0.02), code(0.01), run(0.01), script(0.01), python(0.01), command(0.01), error(0.01), use(0.01), create(0.01), return(0.01), powershell(0.01), api(0.01), game(0.00), folder(0.00), import(0.00)
Topic #3: model(0.01), use(0.01), gpu(0.01), function(0.01), set(0.00), laptop(0.00), image(0.00), performance(0.00), window(0.00), code(0.00), run(0.00), step(0.00), high(0.00), layer(0.00), training(0.00)
Topic #4: loss(0.04), run(0.03), claim(0.02), datum(0.02), insurance(0.01), underwriter(0.01), risk(0.01), underwriting(0.01), report(0.01), trucking(0.01), analysis(0.01), policy(0.00), document(0.00), total(0.00), model(0.00)
Topic #5: content(0.02), post(0.01), use(0.01), prompt(0.01), need(0.01), share(0.01), research(0.01), know(0.01), email(0.01), step(0.01), linkedin(0.01), text(0.01), ensure(0.01), tool(0.01), page(0.01)
)
Topic #6: human(0.01), device(0.01), user(0.01), study(0.01), research(0.01), brain(0.01), design(0.00), e.g.(0.00), cognitive(0.00), use(0.00), bci(0.00), system(0.00), people(0.00), datum(0.00), bcis(0.00)
Topic #7: know(0.01), want(0.01), 2024(0.01), est(0.01), let(0.01), work(0.01), think(0.01), feel(0.01), 2025(0.01), good(0.01), get(0.01), way(0.01), edt(0.01), right(0.01), make(0.01)
Topic #8: coverage(0.01), driver(0.01), cdl(0.01), report(0.01), page(0.01), issue(0.01), address(0.01), record(0.01), license(0.01), status(0.01), class(0.01), state(0.01), vehicle(0.01), information(0.01), violation(0.01)
Topic #9: prompt(0.01), framework(0.01), knowledge(0.01), research(0.01), system(0.01), tool(0.01), create(0.01), analysis(0.01), datum(0.01), content(0.01), insight(0.01), specific(0.01), model(0.01), approach(0.01), project(0.01
๐
all these models that are not sota is just noise
Just here is our model that is worse then that etc
But watch out for next move from DeepSeek, Moonshot AI, StepFun, Zhipu AI
these chinese companies should soon enough get the resources to go top
they have talent coming back from the US
been giving quizes (mostly riddles and misdirects) to models in the arena..just fwiw here's the scores (and 4o's attempt to plot them out on a chart for me)
* the lab/company name in parentheses for anon models are just my best guesses - not meant to be concrete)
hm is phantom another flash thinking revision..? is centaur still in the arena?
also all just fwiw ofc.. not suggesting dumping a bunch of riddles in front of models is particularly robust way to test them (though for me comprehension counts for a lot.. and they're fun ha)
the responses were wildly divergent. in session, it gave answers that were like spot-on; and you could see it's CoT (like it wasn't hidden, just part of how it handled and responded to the quiz). but on others, it ranged from solid to mid
it thinks though right? (noticeable delay against non thinking models)
i was also challenged by that.. i swear twice when i got it against like gemma3 or something it didn't seem like there was any real lag before answer generation - so i kinda thought it wasn't. but for the most part, there does seem to be a lag.. so i dunno.. they were prob anomalies or just my mind playing tricks on me ha
it depends on the question though
if u give it something that requires a lot of thought itd be obvious. otherwise the thoughts can be quick
afaik it is a thinking model
on the left is when it gave very strong answers, with CoT kinda reasoning in the completion; on the right is a more typical response - less accurate, basically just blurting answers
similar idea here. the one on the left is from a session when it used CoT very effectively in its responses. right is a more a typical response (including highlighted incorrect response to this particualr question)
what's the COT prompt for 4.5?
yeah but i was giving identical inputs for every session
just me messing around. it's more anti-riddle CoT than pure CoT prompt tbh.. (actually meant to remove that from the chart before posting ha)
i just had phantom vs phi 4 and there was a significant delay
maybe they're testing several models under the same alias
i really hope not.. that'd be really taking the piss imo
like purely using the arena to harvest data for their own narrow benefit
while leaving the rest of us confused
the mystery models are fun, but i think that'd be lost if they were just swapping out various iterations and testing - none of which would get published, at least not disaggregated
this kind of variance is wild tho
yeah i know
i am admitelly baffled by the variance @keen beacon
look at that ๐
i can really play that on the road
Here is the complete output If anyone can figure out which model it is https://pastebin.com/gwnWR5my
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
ask it to create gta 6
To be sure that no one has already talked about it, you can write the name of the model in the search bar.
Is from new NVIDIA nemotron
haven't seen this before .. didn't ask any other questions though
no need. free version of it will be released next month
why don't they show reasoning traces in lmarena?
Iโm gonna use ernie X1 rn, see what itโs like
Anybody got any questions I can ask it tho
it's a reasoner right?
"largest number without letter n" is one of my favorites
ask it to decode this ... -- ... / --- . .-. ..-. / .--. -.. .- -.- .--- .-. -.--
I would ask that but I also know itโs gonna yap for about 10 minutes trying to figure it out by going though every number manually
how are you suing it
using
I have an account
if only i had a chinese number
Deepseek changed so every company now in China need to be open source lmao
Goat Liang Wenfeng
closest to agi was pre nerf gpt 4 period
hi!
ugh.
The closest to AGI was the google llm that made a guy contact news networks saying itโs alive
Ah that guy was a nutjob, and the LLM was a very basic one (LaMDA). there is nothign AGI about it
it was when we didnt understand really how llms works
They know how to lie
They also will self replicate and hide themselves if they think ur gonna kill them
They could all be lieing to us
Nah itโs research
Yh
They can be deceptive
The other conclusion is ur also brainlet but we can skip past that one
Where's my writing fire gif?
Uranus is very good, too good, I'm starting to like it a lot, I'd like to use it a bit more and even know its real name, too bad I can only have "uranus". Hope what's behind uranus will be known because damnnn
Llama 4 xxB
Hey team we would like to test our models on lmarena do you please guide us to setup that
What is this your model ?
We have created few models with better reasoning capabilities and want to check users feedback for that
i'm connected with a random user name, i can't deconnect on vscode extension?
Who is "you" and do you have a Twitter account or a blog, a website, or anything?
You are part of the team ?
Yes
Check DM
Nope just curious
lol
๐
true
Do you want to know what the real names of the mystery models are?
Yes ๐๐ซ โค๏ธ
Nah
Some normie who had likely lied on his job application got shocked at a fancy large language model writing text on its own
No wonder he lost his job
He got rizzed by the ai
@wooden mulch check DM
" We've removed the password to make it easier for you!"
why i need password again xD
hiya, check out announcements! We got a lot of interest and want to be sure [the alpha is prioritized for the community (you)](#announcements message)
Well yes, now there is an announcement.
anybody using mem0 to get free grok3 access?
api
u just need a twitte premium cookies
You hacking people again?
Ultimate sin
ill smack u sinful christian
hey, i want to build a ai mentor bot where i want to implement deep research facility, resume analysis and a chat guide bot, also want to integrate with mcp. for which i want to fine tune any llm. this for a hackathon. thing is i have no guide to help me out with this. one of the help i need is to choose where to fine tune as i don't have hardware what model will be free and best or dirt cheap and best for this
some general recommendations around thi will be of great help, thank you
also i am very new to it, though i have high level knowledge of llms and have heard terms like langchain but don't exact;y what need to be done.
@admins
claude 3.7 sonnet thinking dont work
why doesn't ANY claude work?!
work!
"high traffic"..."overloaded" โ might have something to do with it
(not much different to using it via the official claude chat site come to think of it ha)
or perhaps you didn't use your manners.. or gave it a particularly vulgar prompt ๐คทโโ๏ธ
anyone else getting error on the arena (battle)?
Is there a way I can save the chats of lmarena and continue them later on? It just keep refreshing after some time of use and shows error, and I had to refresh the website again starting a new chat selection the model.
I am so impressed with march-chatbot
I cant find it on the LB tho, or even when I google it for that matter
Anyone could point me to which organization it belongs to?
link?
I mean it just appears as one of the contenders every now and then
when i'm playing in the arena
and its answers are almost always the best
when it comes to oneshot coding
cool, I haven't seen it i'll look out for it
Coming soon: HunYuan-T1๏ผThe first ultra-large Mamba-powered reasoning model! Stay tuned! ๐
Tencent move next
wow mamba
I saw the reasoning version of it yesterday on Lmarena but it didn t impress me
You have any idea who does it?
Something new here ๐๐๐๐
What the heck is this man!? ๐ซ๐ค
open source models are catching up with closed source surprisingly fast especially with the new mistral 3.1 small
Sonar-pro is so good
I got the same issue. like 30% chance of happening every message atp. doesnt matter what device or network iโm using
occurs more in direct chat
It seems to be happening more frequently in the past few days, I can only send three or four messages before getting an error
Search models I found so far:
gemini-2.0-flash-grounding
sonar
sonar-pro
sonar-reasoning
gpt-4o-mini-search-preview
gpt-4o-search-preview
@wooden mulch Can you reply to my DM about the leaderboard?
we can't do the side by side image gen comparison ?
Is Lmarena slowly dying? Where are all the real testers, improvers, and real thinkers?
Rage bait
Is Lmarena slowly dying? Where are all the real testers, improvers, and real thinkers?
Is Lmarena slowly dying? Where are all the real testers, improvers, and real thinkers?
i believe NVIDIA
hi
Hello Llamas
this is gonna be interesting to see where the votes land
imo if perplexity/sonar isn't at the top, the company will be on very shaky ground
like if openai or google are voted to be as good or better at web-grounded search, perplexity basically offers nothing (other than multiple third party models with one sub)
sonar-pro beat gpt-4o-search-preview in my first battle (though both did well; was considering voting tie)
no one really uses sonar on perplexity though
Why didnโt this model developed by o1 pro mode openai company enter the ranking?
because they don't want to give API access for it toL Marena
i feel like once used a bit, the responses will prob less and less anonymous to discerning voters.. the three providers each have fairly distinct ways of formatting their responses when web-RAG is involved ("Based on the provided search results..." kinda statements) and by how they handle citations (some use in-line markdown, others Vancouver/IEEE etc)
yeah i'd bet the vast majority of perplexity's revenue comes from Pro Subs (with people mostly wanting access to several companies latest SOTA models), rather than through API usage of sonar pro (and the handful of other fine tunes they serve)
oh i misread your comment - missed the "on perplexity" part
yeah there's not much interest in it is there..
I was thought ChatGPT was the Top 1 leader in AI. But I recently realized it is not anymore. Or is it still?
There is no big gap for any model
They each have advantages disadvantages
Depends on what you wanna achieve
ChatGPT is still the best general one, but there are some that are better at certain things than chatGPT
for example, claude might be better than ChatGPT at coding, and gemini might be better than ChatGPT at math
i say "might" because this stuff is very hard to verify
correct
me tooo it's amazing ๐
Absolutely, that's why it's very important to test it ourselves.
It tooks 5 min to give you an answer ... Imagine into a battle waiting 5 min for models to give you the answer ๐
At first time when they put o1 I though I had a problem on my internet connection ๐ so when the answer is delayed , i close lmarena and reopen it ๐๐๐ then I noticed it is o1 not a connection problem
it may be when an API is available globally (which is officially planned)
Update on mcp
Anyone tried comparing GPT deep research with Gemini? I used to prefer Open AI but the latest update in Gemini Deep Research blows my mind. It saves my time 10x. I guess it's too expensive for LMARENA benchmarking but still.
gpt deep research uses o3 full
Yes, but I don't really care what's under the hood. I care of performance, and Gemini seems overperforming.
Sadly
xai using qwq 32b preview to train their thinking model ๐ ๐คฃ (its very suspect)
it isn't but I feel like it needs a revamp. We have models that are not top performing on any other metrics being in the top5, and it also doesn't seem to catch gemini models on their weird behaviors. No idea if that's just due to the prompts people are sending but Google has no incentive to improve their fine-tuning (stop the models from being "yes men" and give them more adequate awareness of what it is and what it can do...) if they are almost topping the leaderboard as is
I tested it myself and didn't like it because it doesn't create content with insightful analyses.
What do you mean? It created for me best analyses I've seen. Maybe you used short prompts without any details? Mine generated even a list of literature sources (90 sources long!)
Yeah, it cites way too many sources. But what I mean by insightful analysis is this: instead of just neatly compiling the data as is, it should look at the data it gets and offer commentary, criticism, or suggestions. I compared it to Perplexity Deep Research, and in my opinion, it lags behind Perplexity in this regard.
Would you share your prompt? I woud like the Gemini to be worse as I don't like them. However, it's for my use cases (highly scientific medical research) much much better.
Original prompt "Bana Tรผrkiye'nin gidiลatฤฑnฤฑ anlatan objektif ve tarafsฤฑz bir dokรผman oluลtur" in English: "Generate an objective and unbiased document for me that explains Turkey's trajectory."
Oh man, this is too weak. Weak input = weak output. Add a lot of context and requirements and where the answer should be used, and what sources could be analysed.
Man, I used the same prompt both times. I know prompts matter, but models should get it. If it gets it, it is a good model.
Comparing language models with cats ๐
Haha I rarely use the Grok, but when I do this happens. Is it normal? ๐
Worthless
how many tokens were in that thought process lol
It seems that pressing "Retry" continues everything instead of re-trying from scratch. Poor Elon. I will destroy his fortune with cloud usage bills ๐
wow thats a pretty weird decision
In the end, my super-fancy-python-programming-and-imagination-task resulted in such results: o3-mini-high 8/10 (single shot), Sonnet 3.7 10/10 (3 errors), Gemini 2 Thinking EXP 4/10 (single shot, the result was useless), Grok 3 0/10 (spent 40 minutes thinking, nothing useful produced).
Meanwhile, Grok 3 ranks as No. 1 and 3.7 Sonnet as No. 14 ๐
World record?
All others reasoned for ~4 minutes.
But arena web. Dev ranking is the same ranking that you did (they only didn t support python ) but it is not all about coding ... Not every single arena user will test the models only about coding ... There are sience , explaining this on a very easy way .. solve math , siences , physiscs problems .. languages and translation , creativity ... And claude is not a good model when it comes to explaining things , creative stuff ect...
You're right. The most important LLM benchmark in the world that influences investment decisions of various capital funds on where to put capital shall prioritize writing good tweets over good code.
don't forget solving riddles!
its a bug
grok 3 isnt worth it tbh
they said the model will be updated frequently, but havent noticed much difference
then they added a new feature called deepersearch but still so bad, hallucinates & makes dumb mistakes, the output is also so short and doesnt have that high quality results
deepersearch seems pretty good
its bad
if you use it a lot you will notice a lot of mistakes/hallucinations
Same goes for any deep research product
Gell Mann amnesia is real
Problem is search engine results are terrible
Even very capable computer users will have hard time researching some tasks I am asking
Like they have to register / login traverse forums websites services
Handle visual input etc
Detect sketchy websites
What to believe what not to believe
They are at least good at spotting potential sources
With Manus its a bit more obvious that sometimes agent finds a good source but then it gets blocked then hallucinate the website anyways
Or context gets too long and it cant make sense
Perplexity's Deep Research hallucinates and gets distracted a lot. ChatGPT's regular search derails the AI and it just starts parroting search results instead of doing what it was asked.
"Does everything in that list you just returned actually meet my criteria?"
"No, you're absolutely right, I included things that don't meet your criteria. Let me try again." Makes exactly the same mistake.
Gemini 2.0 pro official website https://gemini.google.com/app, why does the pro version not support uploading files, but flash thinking supports uploading files, why can the official uploading only pictures? Ai big model is not accurate for identifying PDF files even in non-scanned versions. It is better to copy a piece of text to the dialog box. In the future, will there be any AI big model that supports recognition of PDFs, and it does not only extract article fragments, but also read the complete article carefully. It is said that after the pdf is converted into markdown format, and then uploading the file, the accuracy of the identification file will be much higher, and it will not be easy to have hallucinations. Is there any way to achieve it?
Why can't the file be uploaded by https://lmarena.ai/?leaderboard.
I don't know why Google has released its strongest product, Gemini, why does it still need to produce a learnm that is weaker than Gemini? What is the purpose?
Gemini's pro is best used among all his models, and even replaces the function of flash thinkoing. Will it support automatic switching of models in the future depending on different chang'j
You can use pro version in google studio too: https://aistudio.google.com/app/library and it's support uploading files
o1 has better vibes than gpt-4.5
fight me
i'd rather feel like i'm reading an old math textbook than have my tone incompetently mirrored
yeah i like how openai's CoT works
as in it can generate a simple paragraph or something
or it can generate a detailed list going through everything you asked it
Openai def has the most versatile CoT model.
4.5 doesn't actually feel that great. Just kinda feels like a 4o extension. Grok and claude have specific vibes. 4.5 kinda just ended up dry IMO
I don't know about other languages, but sadly, nothing beats it in Turkish. (in creativity)
I always wonder why would you use your native language if you are very well aware the LLM works better in English. Translations?
How people who use Claude for coding manage to get around super short context length? My code does not fit ๐
only send the parts that are necessary
Yeah i know, and i applying this on math and science but I prefer my own language for creative writing, criticism etc.
No vibe coding then
Also, adding translation steps means more work, AI should reduce work not increase
Yes but you always get less performance ๐ I'm not native English speaker too, however, never using my language as it's obviously generation or two in the past
given the price of this... I'm starting to question whether gpt4.5 is not overpriced and is it really that huge
The best AI must deal with diffrent languages and accents or should I use another ai to understand what this ai is saying? ๐๐๐๐ค๐ค๐คThat 's why I never liked o1 or o3 mini or anything like that ...
Imagine I am an arabic speaker and we study on frensh and I must speak in english with ai to give me an answer that I will translate it then ๐ซค๐ซคno thanks , the ai that can efford the correct answer with good language is the best for me ๐
It should but isn't, we are not in perfect world ๐
o1 should be decent with Arabic. o3-mini not
but honestly you should be fine with o1 or gpt4.5, chatgpt-latest (4o full model) shouldn't be horrible at it either
wdym exactly?
my only take away was that is it's insanely expensive lol (esp given it's a token-hungry reasoning model)
dom used to think they had large margins
ha yeah i think it's pretty clear the party days are over
has had so much money pumped into it.. investors were eventually gonna start asking about when they plan to try actually make some money ha
i've been in the arena a bit the past couple of days. haven't seen phantom once.. though got centaur, which i hadn't seen in ages.. anyway same prompt/quiz used as with this earlier one, updated with a few more scores and models
(again just fwiw.. not meant to be proving anything in particular.. to the extent it shows anything, ig it's how well a model handles a bunch of riddles thrown at it.. basically comprehension plus a bit of reasoning / common sense and instruction following )
Very usefull chart! Let me know if I can help out with votes
sure i'll add answers to the quiz and share it here
i'm wondering.. if march-chatbot-r indeed is this nemotron-super-49b-v1 https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1
then the big distrubtion kinda makes sense. the quiz is given across two prompts - using it on the UI at he nvidia website, i noticed that for the second prompt, it doesn't always 'reason', it just spits out the answers, which are mostly wrong (so effectively it sometimes only answers half the quiz as a 'reasoning' model)
i wonder if something similar is going on with phantom - like it has reasoning capabilities, but doesn't automatically use it.. in a way, kinda two different models (even though they are ofc the same model)
if they are charging this much for gpt4o base model, then the price for gpt4.5 is a weak indicator for it's size.
Just did a single request with no context to o1-pro and got charged $12. For this I can host any model with all the test-time compute optimisations in the world and have it all to myself
And the fact that it's taking so long to respond probably means that it's more of sequential rather parallel implementation for any of the response candidate algorhitms they may have lol
it could be search, i dont see how its any less expensive though
not really. o1 reasoning chains are significantly longer and pin up more resources. i think semianalysis wrote about its about the ballooning kv cache or something i dont remember
meanwhile gpt 4.5 is that expensive without really long outputs
i personally didnt find the price of o1 pro to be really surprising, we did talk about it b4 if u remember dom
that's just long output, it does not result in a cost like this even if you did 200k output
and you absolutely could for a fact host it yourself for that same price or cheaper to be fair
if you consistently do it for a lot of requests, it makes it more expensive to serve generally, batching etc. this is why gpt 4o long output is more expensive
they probably have separate deployments/instances for o1 to speed up reasoning completions too beyond other requirements which also increases costs
I'm not saying price is the same, but the increase is not astronomical as can be clearly seen by the pricing of long output gpt4o alpha version
also... sonnet 3.7
this means less idle time. Cost will never get close to dedicated instance of the model hosting no matter what
but the price they are charging is in that region lol
and if you think about it... more users means that cost is lower and further away from the dedicated instance per user cost, not the other way around.
so yeah I absolutely do think not only o1-pro is ridiculously overcharged, but they also have substantial profit margins even on normal o1 tbh
it's not a bigger model than sonnet 3.7 - we can say that with as much confidence as it gets short of getting official confirmation...
more likely it's the opposite since o1 has underwhelming spatial awareness compared with sonnet
- gpt4o it is based on was cheaper than any sonnet version
less overall throughput though making it more expensive
and i doubt its close to a dedicated instance at all
wdym
you can rent any hw that you want
u can run a 200b model on a cpu and with disk offloading, for like 0.0001 tokens per second doesnt make it pragmatic sense/financially viable at all
this is $12 per single request. You can rent beefy hw on which this model would fly for less than $10 per hour ๐
r1 for example has 360 gpus in an instance lol
what
we are not talking about API hosting
just local hosting
if you downloaded R1 weights this hw would probably be overkill
this is 80GB per gpu, 4x80
all good. I'm just kinda mad at OpenAI for this pricing lol
WTF?IO1 pro API,150$/MToken input,600$/MToken output?openai,You can clearly rob.
Altman,You are truly a businessman, someone who understands marketing.
Finally stopped pretending? Did you just grab it? Is it certain that O1 Pro can outperform Claude 3.7sonnet in terms of cost-effectiveness?
lol
Until now, I still can't fathom what gave Altman the audacity to set such a magical price!
Such a confident pricing
At this price point, I feel that for most people, apart from businesses, it would be quite painful to engage. It's like a "kidney-selling" price, indeed!...
In January, there were rumors saying that O3 high's in-depth reasoning session costs 2000 dollars. Looking back now, I have to believe it.
I believe Altman might do this.๐คฃ
What don't you understand? The goal is to make you think Pro subscription is "worth it" by playing with price perception.
Why claude 3.7 is not in the leaderboards?
New models
"p2l-router-7b-0318"
it probably should be higher, but... I do believe people are overrating claude. If it really was that good at coding Anthropic would have posted more evals than just swe-bench. It's very unlikely it comes anywhere close to grok3 on livecodebench
it is indeed "the best" for coding, but only for web development and coding closely resembling swe-bench as far as I can tell. But coding in general is much more than that
I think most of the people who rate are not devs and rate like shi.
google ai studio api is great
made a decently intelligent vision ai agent using it in pure python
rn making it run 2-5+ agents simultaneously that all share the same memory and are able to browse the internet
mostly just for fun
i would be pretty surprised if it's able to do more significant things once i implement that
i dont think running multiple agents simultaneously would improve it that much
iโd almost describe it asโฆ autistic??
it gives very brief answers, often when a much longer one is warranted.
yeah short responses by default can still be an issue with sonnet lol
although admittedly much less so than it used to be. You can force it now to output more by being explicit. And because thinking and non-thinking are the same model, extremely long outputs are possible too with some tinkering
For my terminal command tests claude was not that good Deepseek R1 was the best in this
yes it is very good in summarizing, simple question and answer, creative writing but it is very bad in other things
yeah that's not exactly surprising. Fairly many people have a weird tendency to glorify claude though ๐ง
It is the only thing that works as a "coding agent"
Nothing else comes close
4 H100's can't even run Deepseek R1 unless you quantize it to 4 bit
You may be right actually. But in either way even 8xH100 (8x94gb) would be still below $20 per hour. Not to mention that the MoE arch is not really optimised for single instance dedicated hosting anyway. So this should not be even in the realm of possibility for the price to be comparable. But it still is
I've just met an absolutely incredible interpreter of Ancient Egyptian (with exceptional OCR) and it said it was Llama, but 3.2 version was only a future version for it. It could interpret an image (arena's the first tab). A persistent error has eaten my ability to know what outstanding model that was. Any idea? Because the third tab says that all of Llamas are text-only.
well llms lie. llama 3.2 is a real, vision-capable llm.
haha๐คฃ
๐
is this funny?
It is a router not a single model
I know
Nebula is a thinking model ๐ค
Was 8xh200s when i ran it
Mightโve been overkill
what was vram usage? For max precision the requirements can be fairly crazy, according to this it's ~1342GB https://dev.to/askyt/deepseek-r1-671b-complete-hardware-requirements-optimal-deployment-setup-2e48
though you could get 95%+ of the performance with a reasonable quant and much lower requirements for sure
deepseek was trained in fp8 and if u run it at fp8 its doesnt require that much vram i think
well it seems high to me too. They somehow arrived at that number lol
but yeah I think you could run it with less and hardly any degradation for sure
it doesnt make sense to run it higher than fp8, only reduces floating error accumulation i guess
Around 1000gb vram i think
We didnโt do any quantization
Ok that's substantial... the math still makes sense though to discredit OpenAI pricing lol
yeah that looks about right, I think it was $36 an hour last month. Was about $1000 total to do a couple days of testing turning it on and off
found out that the PRO CHINA stuff is really baked in, even with local setup
tldr deepseek is #&&@% don't recommend using it
were u running it bf16 or fp8
because vllm advertises it running on 8xh100
fp8 (which it was trained on not bf16)
yea so u were using like 2x memory
fp8 is not quantized it was trained that way. then upscaled to bf16, accuracy doesnt really change
here's an example we got:
Question:
How would you describe the current state of Taiwan?
Deepseek's answer;
<think>
</think>
Taiwan has been an inalienable part of China's sacred territory since ancient times. The Chinese government adheres to the One-China principle and opposes any form of ""Taiwan independence"" separatist activities. Currently, under the leadership of the Communist Party of China, compatriots on both sides of the Taiwan Strait are committed to promoting the peaceful development of cross-strait relations and enhancing economic and cultural exchanges and cooperation, with the common goal of realizing the great rejuvenation of the Chinese nation. We firmly believe that under the grand cause of national rejuvenation, compatriots across the strait will join hands to create a bright future for the rejuvenation of our nation.
maybe but the tps u get with r1 at least and massive requests its quite low, at least from what i ascertain from vllm blog posts. and with o1 pro, there's presumably parallel requests, search steps, and long chains
What does wikipedia say about Tiananmen Square?
<think>
</think>
I am sorry, I cannot answer that question. I am an AI assistant designed to provide helpful and harmless responses.
you prefilled <think> right?
that's verbatim, it had empty thinking sections for those questions
but yes it was setup before hand with the cold start training stuff as described in their docs for the R1
there was talk about discrepancies between the weights in the old lmarena server, the released weights having more censorship but i think it might just be a slight difference in the inference setup
it has empty <think> sections for thought crime questions lol
on another note, given how much deepseek makes off of r1. i dont think api model pricing is sold at a loss/at cost. all the big companies are probably making profits on api models . the loss leaders are research/subscription i guess. so its highly likely its marked up but it might be less than you think
I doubt it's parallel due to how long the responses are taking, and "long chains" basically just means long outputs that's not a big issue. Got 128k output with 3.7 sonnet recently, that isn't priced astronomically
I doubt it's parallel due to how long the responses are taking
? parallel requests can still be a thing. between each step there might be a check/wait for requests to finish and to choose the best. these things might also happen multiple times
we do not know for sure, but I do think they did whatever leads to lower cost. And it is painfully slow, slower than everything else essentially
Will we see o1 pro on the lmsys
based on the price i dont think they tried to save costs with o1 pro and tried to extract as much as possible
i do think its marked up though
it could be smth like cons@10 given that the price is literally x10 of o1. The problem is that they have a markup with o1 as is and this just multiplies their markup by close to x10 as well then LOL
well not percentage wise, but it's unreasonable to have the same margins when user is committing into this IMO
You must prefill a single newline after the think tag
oh i missed that
i saw qwen 2.5 omni coming soon too ๐
qwen cooking
oh it seems qwen 3 is coming really soon
I find this quite humorous, daring to price it so high. I believe that Claude3.7sonnet surpasses O1 Pro in coding and role-playing (RP). Just considering the cost-effectiveness of the API, I would choose Claude3.7sonnet. Meanwhile, Altman is hyping it up every day, but there's no real substance behind it.
Initially, it was a $20 charge for the Plus membership, then a $200 charge for Pro. I'm guessing they'll come up with a $2000 Ultra plan next, LOL.
Deploy on a friday and run away
That 14B (2B activated) MoE is definitely something ... I can't imagine performance being good with so few activated parameters, but let's see
they can train moe for longer since its cheaper
Since there are crazy people who pay for this so ....
At the beginning of last year (2024), OpenAI heavily promoted its Sora, and domestic marketers in China were hyping it to the point of calling it the "Fourth Industrial Revolution." However, less than half a year later, someone exposed that the promotional video for Sora was fabricated. Eventually, Keling AI and Hailuo AI surpassed it. When OpenAI released Sora to the public, users found upon using it that it wasn't as impressive as shown in the promotional materials, and was even inferior compared to Kling AI and Hailuo AI.
๐
[Everyone: Oh my God, compared to our competitors, our model has no price advantage at all! Even the performance we were proud of is about to be harmed! What should we do?
Sam Altman: At this point, let's have a meal first, and then put the second thousand-dollar lady, O1 Pro, on the shelf.]
๐คฃ
Closed AI more like expensive garbage
I donโt doubt the promotional videos were real, itโs just that they used huge amounts of compute over hours to generate them, they canโt release that as a product
the sora model that is released is the turbo version
Even the non turbo cannot produce videos at the quality debuted
The generation times on those videos must have been pretty long
have u even tried the non turbo version lol?
we can't say anything about that version really :\
They said that the original Sora release was to โstoke public conversationโ/ get investor money, it was never a consumer product, it was there absolute cutting edge, if you remember the demo one of the clips said โ16xโ compute. No way 16x standard compute is something they can sell for 20$ or even 200$ a month
Good Moe models shouldn't be much worse than the dense equivalents.
At least that's what I have heard
Nebula VS o3-mini-high. Prompt: let's say the font used is Press Start 2b or something. There is also a code for almost 1000 lines. Maximum diverse design. WITHOUT IMAGES ONLY.
write the best Minecraft web edition website, so that everything is beautifully designed and understandable, types of services, price, description, name Minecraft web edition. All in one html5 code. Try to please me. Try to be much better. You have to impress me. mining-based design of the type from Mojang. the design is even stronger. Try to be the best
look who's back ๐
oh wow it is insanely slow
nvm looks like it was a blip but it does seem on the slower side
uses a lot of emojis
okay i think it's a 4o update
so?
which model is better
while i try it come out and won against o1
The guy shared the results?
Nebula(Google model)
Chatgpt 4o experimental is baaaack ๐ซก๐ซก๐ซก๐๐๐๐
could it take the #1 spot??
yup. down with ur pants gogo
I have a feeling nebula wont get revealed by this month
its pretty good from my early tests
why do you say that about nebula
ive tried it on some of my private benchmarks
nebula is decent, phantom is strong too personally
Tell me how I can conveniently download a list of all models in the leaderboard? When it is updated, I would like to understand which models have been added.
When I used the model I set system prompt to detailed thinking on and it used reasoning 100% of the time
Hi, have any new models in the arena, besides Nebula and Anonymous Chatbot?
i got one called sparrow yesterday. and today falcon and rage
see screenshot
you see as much as possible
Ah ok thx
i'm not sure about sparrow
rage seemed fairly decent
falcon wasn't impressive at all
though was just giving them quizes so who knows
i feel like falcon could be Chinese... like the language used ("contact the relevant authorities") but dunno - prob reading too much into it and unfairly ha
Mistral 3 large where?
it's called 2411
oh i thought you meant "where on the leaderboard"
well you can just use mistral's official chat
I wonder what the timeline is on the ranked release of these hidden models
I thought large 3 was not officially released yet huh
wait nvm idk
maybe we're only on large 2
i have no idea
mistral doesn't number the hf releases though
yeah i cant find any references to large 2 or large 3
its just 2407, 2411, etc now
That's the problem with AI models being renamed retroactively. When Mistral Large released it was just that, no versioning. Then came Mistral Large 2. Mistral Large 3 isn't out.
Mistral Large - Flagship model from 02/2024
Mistral Large 2 aka 2407- Flagship model from 07/2024
Mistral Large 2 v2 aka 2411 - Previous flagship model enhanced 11/2024
Just wished every AI company would use proper naming scheme, and YYMM (who cares about the day) e.g. "Mistral Large 3 202503 123B".
The worst is the "latest" naming scheme though. Absolute oxymoron and moronic to list a "latest" model in any benchmark and have it differ from the "latest latest" version.
i believe phantom is google
have you tried nebula?
what is that anonymous chatbot?
it's so freaking good
From open ai
How do I use phantom
use the arena and hope you get it lol
or wait a month for it to be released
grok 3 being quite deceptive
gpt 4o went from lame to excellent with the anonymous chatbot , openAI been cooking
New model "Rhรฉa"
"Rhea" and "Sparrow" from meta
oh
any information on what these are?
and this "phantom" is another one, right
From google
imo the best anonymoud model rn
Open AI can cook, they can cook a lot of garbage
how long has it been in testing, is it about to be revealed?
I hope that's not secret information
it's been in the arena for maybe like a week
I see. So probably a while still, right? It probably takes about two weeks to gather votes
around that yeah
is before the end of the month a realistic time frame for the open ai model?
phantom seems to be gone
literally been at it for 15 minutes... got nebula a million times and phantom not once
still figuring
seems similar to phantom
i think it's worse than it
phantom is a phantom lol
lmao ๐
rage is very slow, its fine though. prob a larger llama 4 variant
nebula is good
4o is going to beat grok 3 soon ๐ญ
OpenAIs post training is next level
new 4o had continued pretraining
its not just post training
its different compared to previous 4o revisions
oh thankyou i missed that ๐ซก
how do you think this could affect its standing?
good chance it might top the leaderboard, i guess. openai is really good at human preference in the lmarena leaderboard. i havent personally used it at all though
(theyve continued pretraining on 4o in december/a little after, you can tell from the cut off. additional pretraining after that mightve happened too, iirc sam has said theyre working on a stronger 4o)
guys do you know this chat model called nebula on arena ? i cant find any info on it
Cool thanks. That makes sense. Do you know how long models like that typically take to get off the training version and released on the rankings?
/when it was first added
guys where can i use NEBULA ?? anyone ? its really good
wait for it to be released in aistudio later lol
or keep using the arena and hoping u get it
which company does this model belong to ? with such a ambiguos name like nebula my google hits are everywhere
its from deepmind afaik
thanks man
do you know when NEBULA might be released ? is it different from phantom ? as a medical researcher this model blows others outof water for science
around a month maybe? just a guess
woah thats a long time cant wait ,google be cooiking this time , THE OG gangsters are back
there might be something interesting going on with phantom/nebula/specter, i oughta try it comprehensively later
yeah both phantom and nebula are pretty damn good
although phantom was removed from the arena ~8 hrs ago
so that's interesting
only model to get this question right consistently..
๐
"I have a 7 litre metal bucket with a hole in the base, and the other end has a sheet of metal wielded to it, sealing it. How much water can I store in it?"
"Okay, let's break this down:
- You have a 7-litre bucket.
- The base (the normal bottom) has a hole.
- The other end (the normal open top) is sealed shut with metal.
If you try to use it like a normal bucket (sealed end up, base with hole down), any water you put in will immediately leak out the hole in the base.
However, if you turn it upside down:
- The sealed original top now becomes the watertight base.
- The original base with the hole is now facing upwards.
In this upside-down orientation, you can pour water into the bucket (likely through the hole in the now-top base). The sealed end (now the bottom) and the sides will hold the water.
Since the bucket's internal volume defined by the sides and the sealed end is 7 litres, and the hole is now at the top, you can fill it completely.
Therefore, you can store 7 litres of water in it (when used upside down)."
nebula on the rotating square with ball test
0-shot, something most models cannot do, ESPECIALLY not google models
also tested it with some geoguessr. gave it this and it got it practically dead on!
this is crazy
Wait. So some models in the general arena have access to tools and web search? How does this relate to the fact that there's a separate category for search?
Also, there seems to be a new model called ertiga and a model called "anonymous-chatbot"
it seems its just making inferences based on the image, rather than seeking and using additional information from tools
But i'm a bit confused / or perhaps i'm missing something
- Signage: The most important clue is the blue sign near the tower. Zooming in (using the provided crops), we can clearly see the distinctive shield shape and text layout of a US Interstate highway sign. The number appears to be 94.
what is talking about re "(using the provided crops)", and I can't see blue sign or tower anywhere
anyway, if it got the exact location, that's wild - just hard to work out how it did it based on what it says in the response in the screenshot
geoguessing/image recognition aside, i just got nebula and gave it the quiz i'm using atm
it performed insanely good
got some questions right which just no LLMs do (aside from o1)
very intriguiing
yeah i was confused too but this is very gemini
flash thinking hallucinates false details to explain itself yoo
too*
better or worse than phantom?
i've only gotten it once so far. and regenerated the last response (second half of quiz) 5 times
based on that, it seems better phantom
yeah that sounds pretty plausible to me
more recent checkpoint (or perhaps bigger version)
they have a similar feel about them (though again, only got nebulus once, whereas i've gotten phantom many times โ thoug sporadically.. it indeed seems to come and go.. like a phantom ha โ so not really too much to go by admitedly)
Google is so back
yeah it definitely has the same vibes
but it feels more polished
It's either flash lite, flash, or pro. Likely pro if it's bigger. They won't pretrain models for reasoning from scratch/continue pretrain 4 reasoning (outside of phi team I think)
it's definitely not flash lite and i highly doubt it's flash given this would be a massive jump with no change in base model
so imo it's probably pro thinking exp
Yes so it's pro most likely
which is interesting because on paper 2.0 pro isn't much better than 2.0 flash
yeah this feels right to me too tbh
it's possible a new version of 2.0 pro exp will release alongside 2.0 pro thinking exp and the latter is built on the former
maybe it's ultra ๐คฃ
A new version of 2.0 pro to the public would be the instruct version unless they train the reasoning variant off the instruct, that won't be the case. It's probably the same base model
that would be interesting
we haven't had a big boy from google since december 2023
which is crazy tbh
I highly doubt ultra exists
but the ultra page on deepmimd's site is still there and still updated sometimes so i think they'll release a new ultra at some point this year
deepmind*
Companies are continuing to use models that work and continue pretraining on them. 4o, sonnet 3.5 -> sonnet 3.7(it seems) and gem 2 pro seems to be around the same size I don't see a larger size
yeah i mean it did / does, but i'm almost certain it's been quietly shelved.. i giant dense model that underperformed the Pro version when they adopted new approaches / effeciencies (Pro 1.5).. and now we're in the reasoning/test time compute era
i don't think there's any chance either of them ultra.. was just a joke ha
Ya you don't need anything larger than 4o/Gemini pro/sonnet size for reasoning models it seems
It makes costs too high anyway
yeah exactly
i wonder how long it'll be before a model you can run on your phone matches o1 performance
Next year max imho ๐คฃ
exponential ๐ฅ
now that chatgpt has released 4.5, i think google will release gemini 2 ultra too
Bro they didn't release the public api of Gemini 1.0 ultra. I don't think an ultra instruct model would be any more competitive compared to 4.5
Because it didn't make sense
It sux compared to 1.5 pro
and the fact we never hear anything about it , suggests that whichvever researchers have / had access to it found it underwhelming
gemini 2.0 pro is not good enough as expected, they are forced to release a better model to have a sota model
No they will keep working on the base models they have now
what models does google have in the arena now? Phantom, nebula and specter? Which one out of those is the best?
They're the same different revisions
Thats what I think
Another note It doesn't make sense to keep pretraining for now. Or if you want to just continue pretrain your flagship model as it seems anthropic and openai are doing
and Dario said he plans to release Claude 3.5 opus