#general
1 messages · Page 12 of 1
like it started and we are currently in that period of the agi forming
Some agents can already be called as an AGI
why office worker?
we already have ai that can replace images tbh
like i can give my mom an image and she will not be able to tell its ai
since robot vaccum cleaner can replace human cleaners
you technally can replace lowlevel devs like jrs out of college or interns
any webdev for sure lol, like a junior tho
we gotta test it more
we only seen its frontend capabilities
we need to see it in cusors and stuff
but now that I am thining about it
i think NW might be a whole new model
and stargazer is the mini
NW has to big af
Is multi-token attention of Meta AI helpful for the situation?
why they take it down so fast?
which situation
i like meta research into thinking without using tokens tho
that might be something
making AI more intellegent
is night whisper still in webdeb arena
we should continue this and start using this metrix
dont make me cry bro
Alibaba had proposed a paper called START, which the reasoning model use tools itself during the reasoning. I see the "use tools in reasoning" tendency in Gemini 2.5 pro now. So I guess... are the proprietory models already using the hacks open-sourced papers proposed, or even a better method than them.
have you tried NW?
yeah
just search START by Alibaba in the internet
Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introdu...
There is a model called Anonymous
remember 2.5 is released in 25/3, and the paper is proposed in 6/3
its definitely google
its as smart as 2.5 pro in other areas but much better in coding
its pretty sh*t in creative writing too
yeah google models suck at that
thats whyi liked 24 karat gold
nah 2.5 pro's writing was great
it adheres to my plot and its tone is engaging
what is nw?
I can only access to Gemini via my laptop, and 2.5's writing let me stick to my laptop for days
where
wait what??
isnt most of the ais rn smart as a human or even smarter
but i really believe nightwhisper will threaten their market share
it will
but NW is def google based on metadata and they have this lunar thing now
its crazy how fast this Ai stuff is advancing
yall tried lunarcall?
yeah
its kinda sh*t
like where would you rank it?
like
based on my very limited interaction with it
maybe a bit higher than
gemini 2.0 flash
I’ll rather look at the benchmarks than these vague statements like AGI and ASI
same
but benchmarks are getting saturated too
i only trust simple bench now
yeah they already beat ARC? did v2 come out yet?
What are those new crystal flannel harley models
It was beaten by o3-mini
还我24k~还我24k~
nightwhisper is not doing that lol
yeah and even that 4% o3 score was not realistic to be running like that for consumers ("high-compute 172x", whatever that means it's more extreme than o1-pro for sure)
if you be honest most SOTA llms are as smart as the average human, like on all the stuff we care about the AI is better at us at, it may struggle at novel stuff and new stuff which is the final piece we are figuring out with out, but on general knowledge and IQ its actually smarter than the average human lol
especially with this new generation
o3 is several months old anyway
yea its from google
lunar/star*/nw all are from google
it think the arc test might be unfair to other models not from openai
On arena
it was smth like o3-pro+ (low) lol. But it was already way too expensive and inefficient even with low reasoning
cause how the heck is gemini 2.5 pro 12% on arc 1?
i think they need to allow gemini 2.5 pro to use the same compute at o3
openai are ahead in reasoning anyway
they didn't include high reasoning version because it didn't qualify with that cost per task
they need to standardize the test with the same compute time
this barely made the cut. Still not very realistic cost
true but why are they the only ones with low, medium and high, while others dont have that?
google has the money to extend compute time
but they choose not to?
i wouldnt say way ahead
as in it was to expensive?
oai reasoning approach is the same as deepseek r1 just more scaled
google is using different approach for their reasoning
but they are getting there
yeah. They have $10k compute limit for the entire benchmark to be put on leaderboard. High version cost more than that
google charging way less to run their models, so i could imagine if they ramped up the compute time and gave us an o1 pro type experience they would do amazing, because the have the fastest SOTA reasoning time and deliever quality at those speeds
imagine gemini 2.5 pro, but the same compute time as o1 pro?
that would be nuts
ahh i see
i think nightwhisper might be a medium level reasoning model
the next tier up from gemini 2.5 pro
like they all gemini 2.5 pro
i think they cracked the code for a good coding model
but it's not like high version would have been miles ahead. If we look at arc-1:
but higher levels of compute
and what we will see is high finetuned coding models
probably
gemini coder 1
gemini coder 2
so this is roughly 15% increase of the low score
these will be highly focused at coding tasks
yea
probably flash
a recent checkpoint of gemini 2.5 flash
this whole ARC-AGI benchmark doesnt make sense to me tbh
by spending much much much more LOL
let me know the results, my webdev glitching
ik there are some limitations but still they should test vision capabilities
because give any person a bunch of random arrays with 1,0 and he wont solve that
it just doesnt seem practical
ironically, but spatial awareness does not actually come from vision capabilities
"input": [[7, 0, 7], [7, 0, 7], [7, 7, 0]], "output": [[7, 0, 7, 0, 0, 0, 7, 0, 7], [7, 0, 7, 0, 0, 0, 7, 0, 7], [7, 7, 0, 0, 0, 0, 7, 7, 0], [7, 0, 7, 0, 0, 0, 7, 0, 7], [7, 0, 7, 0, 0, 0, 7, 0, 7], [7, 7, 0, 0, 0, 0, 7, 7, 0], [7, 0, 7, 7, 0, 7, 0, 0, 0], [7, 0, 7, 7, 0, 7, 0, 0, 0], [7, 7, 0, 7, 7, 0, 0, 0, 0]]}]
i mean whats this?
you except a normal person to solve that?
- vision is 2d
well let it be a benchmark for spatial awareness
until we figure out how to improve on that
openai are cooked
they didnt seem to improve much at coding
yeah arc-agi is very confusing at first. But it actually makes sense given what it is. You can't solve this without contamination just by scraping the internet
which is beter:
https://3000-iu6jeldzrfv5qj0kq3w36-87045d8f.e2b-foxtrot.dev
or this:
A simple game where you click a button that moves randomly.
one is lunarcell
i think nightwhisper included a high quality data with intensive RLHF and even vision
o3 mini is still holding up well imo, especially for the cost.
thats not lunar
just to get the aesthetics right/styling etc...
quasar is prob gonna replace o3 mini
thats why i think quasar is o4 mini
but quasar is not a reasoning model tho right?
it kinda is but not
like not traditionally
i get that but just the way we are testing this seems odd to me
but when you prompt it, check its outputs
like i asked it what was bigger 9.9 or 9.11
check what it does
similar to deepseekv3.1 and the new 4o
it has a COT output
very weird
I hope google releases a competing model in the o3 mini price tier as well. There is kind of a significant gap between 2.5 pro and 2.0 flash pricing, which o3 mini is inbetween
2.5 flash
google prices are already good
its just that people lost faith in them because they messed up many times
people will be more hyped for a gpt -model than a gemini model
but i like what google are doing recently
idk, but with ai studio they train on your data, and you have zero integration with stuff like IDEs and I mainly use llms for coding stuff where that integration saves a lot of time
its not bad
like it gets the job done but not the best
Deleted
its probably something like that
if you look hard enough you will notice that these numbers are not actually random and represent objects in space.
I can still remember the "Elmer's glue pizza"
it's basically like one of those tasks in IQ test or from general tests for a job interview
you look at test patterns
and then need to find a relation
to solve another example in a similar way
wait lunarcell is actually good
I prefer 'multiple intelligence' than simple IQ tests
I think there's a website for deleted tweets. You can really delete them lmao
ik ik, its just doesnt seem practical to me
I guess OpenAI is panicking right now, when they are changing plans constantly
It shows a messy management
And I actually hate the thought of GPT-5 as a free-tier user
I doubt they are testing them on text representations though
like only vision models are being tested
text is probably just so you could reproduce it in code tbh
they are lol
they would get zero if they had to use vision lol
then why only vision models are on leaderboard?
coincidence lmao
all of them are translated into text (o3 prompt):
Discover who I am!
??? r1 is on that leaderboard and its not a vision model too btw
idk how you came up with that
lunarcell
google really releasing so many models man
their event is on tuesday right?
9
oh ok. I seem to recall seeing somewhere that a model couldn't be tested on arc-agi since it didn't have vision. Must have been smth else then
is this real?
https://x.com/minchoi/status/1908521182149668958
we are living in a different timeline man
ever since covid nothing was the same lmaoo
btw you seem very touchy lately for some reason lmao
overreacting to everything
deepseek are also taking their time on r2
qwen team might be cooking up something bigger
curious to see what they will come up with. Seems like they are going for longer outputs
but open source has problems hosting it as is
the gains with v3.1 are from it being more verbose. Unsure if that would translate to gains with reasoning in the same way as they move to 3.1 for R2... 🤔
rather than like just make the chat model closer to reasoning and less different
cause I mean like, much of what V3.1 already does now in a standard output it would be doing that in reasoning instead, so there's gonna be some overlap...
Is nightwhisper still in the battle arena? I really want to test it.....
just started testing it
got my first benchmark question right
seems to be a reasoner
yo think so?
let me know the final results
i gave up on it
its okay, just not SOTA
took about 15s to start streaming
Can we stop judge generalist model on web dev
why not if they put it only on webdev then you are judged on that
they should have just put it on lmarena
but they did not
so i judge with what i am given
and its another google model right after they had NW and star which were amazing(especially NW) so its hard to go from NW to lunarcell
well its mid on webdev
and its all based on system prompts tbh
NW had the same system prompt as lunar does on webdev
NW was very much tuned for the specific environment
i found it consistently poorer than 2.5 pro in a general sense
for example when asked to clone UIs, it would do poorer at actually replicating it than 2.5 pro but normally had less bugs
on general tests it did amazing and the way it interpreted my prompts was kinda uncanny
even if they did finetune gemini2.5 pro to coding, its a great model that performs well, just bc its finetuned on a specific usecase does not mean it will not perform well generally
i tested NW for 2 days straight, barely even could work lol
never felt that way about a model before
nebula was the most blown away ive been by an anonymous model
NW impressed me but not as much
it follows directions like no other model too
i never tried nebula
which one was that?
got my 2nd prompt wrong (2.5 pro gets it right)
it was 2.5 pro
i mean i love 2.5 pro as well
thats my 2nd fav model after NW, but i think they are the same models
NW might just be a higher level of compute for reasoning or a finetuned version
cause its a bit slower than 2.5 pro
but 2.5 pro is def the best model we have now! like it just craps on everything else lol
especially on studio where you have so much customization
+1
it also seems their rate limits they say exist on ai studio don't actually do anything
i've definitely gone over what it is supposed to be several times
yeah i hit the rate limit once on my account when it first came out
then i just change accounts lol
and ever since then never hit a limit again because i just switch accounts
how do you feel about the ARC test and how Open Ai gets to test with different levels of compute(or compute time) vs gemini 2.5 pro only having one test?
although the o3 results were impressive they definitely get watered down by the fact they threw thousands of dollars at a version of the model both tuned for arc and that was given way more attempts at getting it right compared to others
thats what i was thinking
and 2.5 pro scored 12%
that does not make sense, yeah open ai is good with reasoning but when you throwing a bunch of money to make your model think longer is that really as impressive like you said? and what if we standardized the compute times? we need a new graphic of that
yeah openai seem to have this strategy of
releasing the "best" reasoning models but it literally just brute forces the answer
like it's way less efficient
it can take their models 3-4x as many reasoning tokens as others to get the same answer
same for grok 3 thinking
exactly!!
2.5 pro speed is nuts and its quality is just as good if not better which is wild
google might have officially won this year unless gpt5 really blows us away lol
it depends really. if its an extremely long rote reasoning task, 2.5 pro can be a lot worse
ive seen an instance where qwq 32b can solve it with 10k less tokens (13k vs 23k)
and qwq is the master of overthinking
o3 mini and qwq are very good at these extremely rote reasoning tasks where 2.5 pro can fall apart
o3 mini is the best model yet for that still. these specific tasks do not require world knowledge, etc just pure rote reasoning
gemini 2.5 no longer free on google ai studio? 😔
Experimental one is gone for me now
Why can I still chat with these models if it has pricing
or is that only for api usage
yes its unlimited on the website
oh nice
so do all other models
you'll use it for free (with the free limits) until you upgrade
2.5 pro would be godlike if its output is 128k
on website like on studio right? i been using studio today and wanna make sure i aint using money lol
yeah
thank god lol
the aistudio website used to have limits a while back it seems they removed them at some point
like you think they removed it this week?
Unlimited but rate limited. 10 requests per minute I believe
what about gemini web app or or something instead of studio
No it was removed before Gemini 2 I think
I just tested Lunarcall for coding, and it’s not bad, but it’s not on par with Claude 3.7 Sonnet. I would say it’s below 3.7 Sonnet, 3.5 Sonnet, and Gemini Pro 2.5; but maybe on par with o3-mini-high
i got rate limited like with my account, not able to use it until next day when gemini 2.5 pro just came out so maybe the limits are really high
can someone give me example web dev prompt for me to test?
quite the opposite for me
Ive tried it on physics simulations and it was blowing sonnet 3.7 out of the water
Nightwhisper is some next level coding model
Im pretty sure a lot of hard work went into this model
Gemini 2.5 pro while its good, it wasnt enjoyable to talk to
It has a strong info retrieval which actually makes it less creative
U wont get anything unexpected or unexplored areas on its outputs
I would like if they include something similar to 24k gold model
Why 2.5 perform worse than 3.7 in SWE bench?
unfortunately that just seems to be something reasoners are worse at
i don't think anything has beat the original 1.0 ultra in terms of creativity and "human-ness"
How much did it score?
I can't find it !
Cross words test 🤣
This test is a little bit hard to them
It was removed
can you send it?
can yall see this?
https://liveweave.com/7rVAzw
like the app?
whats a REE??
Give them harder and long words 😈🔥 nihahahhaa
Yeah
Sry bro, discord have 2000 character limit
Not able to send here
there is no way the system prompt is that long
just put it in a file
a text file
i got one to but i wanna compare results
lol
Spider launch 😁
it sucks that llama.cpp doesnt support all that multimodal stuff yet
What was its alias
Was it named maverick?
I dont remember testing this model
Nothimg from meta impressed me
pretty unclear which one was maverick since llama is pumping out so many models lol
Are we sure about this?
yes
Its probably cybele
Or whatever its name is
Themis
Weird
I dont remember such model
Benchmarks
https://ai.meta.com/blog/llama-4-multimodal-intelligence/ oh this is the big boy
behemoth is coming soon
i'll wait for that
wasnt there that whole thing where to dude who gave google the 1m context window went to fb ai
That thing was so slow
I dont trust their benchmarks tbh
why not
Because they don't reflect how the model really feels like
Are you impressed with any Meta model added to the Arena?
The answer is clearly no
wait llama 4 is out?
2 of the 3 models
I think in an hour or so
we got vc in here? cause i dont even know how to express that
xd
So if Behemoth was out today it would be at the top of Lmsys leaderboard?
lmaooo they did not show 2.5 pro
clowns
but still amazing for open source
so good job meta
wait so now we have a active and total parameters? someone please explain?
is this new tech or was i just oblvious to it?
No
It got popular from deepseek
Its MoE architecture
Instead of activating all parameters you active only the necessary ones ( experts)
lol
the crystal model is a gaslighter that wont own up to mistakes XD
If Maverick is ~1420 ELO then surely Behemoth woudl be SOTA above Gemini 2.5?
Is that thinking incorrect?
oh snapp i think they out
oh wow I just loaded it yeah
it's 1417 ELO without stylecontrol
lol stylecontrol is way worse they were lmsys tuning 🤦♀️
it doesnt seem like behemoth is a reasoning model
its being compared to other non reasoning models like 4.5 and 3.7 (not extended thinking)
llama 4 reasoning is the reasoning model
wow meta cooked
i think its smart to not have released the big boy yet
just wait a few weeks for the other players to release and train on their outputs lmaoo
where do you see that comparison? i am only seeing maverick being compared
Still huge error bars on Maverick only 2.5k votes
i tested it on simple bench pulic and it got everythign wrong
ill try again
and i think more intelligent
Didn't get it yet
Verbosity is the best way to unweighted ELO
Omg, doing with with it hurts
https://discord.gg/j6kxQ4krtc @everyone
dude is a clown
is maverick gonna be open weights? do yall know
yeah it is
gonna try my pokemon example on it lol
bruhh the context window on lmarena doesnt let me 😦
so we are getting reasoning
they are f'ing insane with those model sizes.
No one is gonna able to run them
and all but one are compromised with low active param count anyway LOL
Is llama4 open source ?
I assumed so. It must be..?
it serious took them 2T+ params and over a year to slightly beat the next best base
great
seriously*
this is gonna bring absolutely no value to open-source lmao
you can host it let alone finetune
Finally, someone who doesn't say "but it uses less active params"
It is very underwhelming
what was llama's codename?
it will but for large orga
in stealth
It only seems good for distillation, but at that point just use Gemini or Claude or OpenAI models
Open source doesn't mean "you can run it on your basement server for RP erotica", there's OSS significance within the enterprise world.
Llama 4's new license comes with several limitations:
︀︀
︀︀- Companies with more than 700 million monthly active users must request a special license from Meta, which Meta can grant or deny at its sole discretion.
︀︀
︀︀- You must prominently display "Built with Llama" on websites, interfaces, documentation, etc.
︀︀
︀︀- Any AI model you create using Llama Materials must include "Llama" at the beginning of its name
︀︀
︀︀- You must include the specific attribution notice in a "Notice" text file with any distribution
︀︀
︀︀- Your use must comply with Meta's separate Acceptable Use Policy (referenced at llama.com/llama4/use-policy))
︀︀
︀︀- Limited license to use "Llama" name only for compliance with the branding requirements
💬 13 🔁 13 ❤️ 118 👁️ 13.3K
facepalm
Nice. I'll try to finish my blog post this weekend on prompts, I'll tag you.
I get the argument cause I used to make a similar one myself lol.
But recent 1-2 years have shown us 405b llama was too big and not very useful at all. This is gonna be even harder to run and realistically only makes sense if you can serve thousands of people with your endpoint. I would understand if it was the case that there's clear benefit from bigger models. But it's more like the opposite is true with RL training now. Besides this exact size is saturated now with Deepseek taking advantage of it in full
These are all reasonable, tbh.
I was going to freak out about the "Built with Llama" one but it says you only have to say it on 'a' blog post, website, or within documentation. They're not saying you need to slather your product with it.
scout is stupidly fast
I'm all in for huge models when they make sense, but I don't think this applies here. There's a fuckton of redundant capacity and we don't have enough data to take advantage of such size imo
maybe they should have updated their 405b model first, at least once. Before releasing this lol
The two biggest models on the planet are Gemini 2.0 and GPT4o, they aren't even local models to begin with. The biggest model in coding right now is Gemini 2.5 Pro. Just because hobbyists buzz about small models doesn't mean they are where the center of the industry is. There's a whole iceberg out there, you're just the tip.
dense 405b has potential to perform better than this "behemoth" anyway, since it's dense and here it's 288b active
are you joking?
gpt4o is smaller than 4-turbo
4-turbo is smaller than og gpt4
And neither one is a local model.
sorry but I don't feel like arguing with someone stating "The two biggest models on the planet are Gemini 2.0 and GPT4o," with a straight face lmao
no offense
I can imagine why that would be. Tough to argue against water being wet, the sky being blue.
it's more like it's impossible to argue with someone thinking 2+2 = 5
but whatever floats your boat I suppose
Oh I see what's going on here. You think biggest = more params.
Biggest = most popular. Widest application.
Yeesh.
that is still nonsensical. 2.0 pro was never among the most popular models
LMAO
it only makes sense in the context of big in size models
I didn't say anything about 2.0 Pro. Christ. An ounce of reading comprehension, please. Even a gram would do. I'll take a milligram at this point.
"The two biggest models on the planet are Gemini 2.0 and GPT4o"
??
2.0 pro is not 2.0 gemini? 🤣
The word 'Pro' doesn't appear anywhere in that sentence, you absolutely goofball.
Gemini 2.0 is an entire class of models which includes Flash Lite, Flash, Flash Thinking, and Pro.
there's no "gemini 2.0" model lmfao
oh my god
how are you this dense
Probably, yeah.
it's you who is f'ing dense, how does 2.0 pro not qualify to talk about when you mention 2.0 gemini? LOL
mate you're drunk go get some sleep https://en.wikipedia.org/wiki/Set_theory
Set theory is the branch of mathematical logic that studies sets, which can be informally described as collections of objects. Although objects of any kind can be collected into a set, set theory – as a branch of mathematics – is mostly concerned with those that are relevant to mathematics as a whole.
The modern study of set theory was initi...
"The two biggest models on the planet are Gemini 2.0 and GPT4o" - this honestly sound like some clickbaity title by some idiot on the internet
gemini 2.0 was never as popular as cgpt, claude or deepseek, if that's how you intended to say it
neither of their 2.0 models
if you don't want to talk about 2.0 pro specifically, well then... all the other 2.0 models are worse
2.0 flash having a brief period of decent popularity with imagen but it kinda got denied by gpt4o before it even had a chance to blow up
....
Gemini 2.0 flash thinking is not bad for me
But Gemini 2.0 flash is the dumbest ai on the plannet
Haven't had the chance to try the models yet, but I imagined they would be considerably smaller than what came out.
yeah it's not bad, but nothing more than that unfortunately...
But for me better than chatgpt
I mean the 4o
well if you compare against the updated gpt4o
flash-thinking performs worse now
I don t know about "now" but I used it a lot in the past
yea this maverick model is so ruined
when it was just released, it was beating gpt4o on some things yeah
one of the worst models ive tried on multilingual
Used mostly for enterprise at the moment. Think chatbots, document transformations, etc.
Of course, Google is using it internally on a bunch of products too.
though it was never at any point anywhere close in popularity
I didn t like the updated gpt4o , seriously It didn t follow my instructions (on chatgpt app)
maverick so far :
bad at instruction following
bad at coding tasks
bad at multilingual
good at creativity
yea im not using it anymore
that's openrouter you idiot, lol
not global popularity
and it's only up because it's free
yeah it is, what a genius you are
I don't think you understand how openrouter works
of course people are gonna use more of what they can for free
mate, i don't think you understand much in general
I'm gonna give it a fair shake, but my sentiments are similar. For being two big MoEs that are better than Deepseek they... don't really seem better than Deepseek. The restrictive license is kinda the nail in the coffin atm
It's amazing that I need to spell it out 
anybody did more tests with these models?
Llama4?
not listing better widely-available alternatives on a comparison chart doesn't make them not exist btw
who invited them
is grok 3 really a good model
first impression was okay, but after using the model a lot then you figure out that its nothing special
interesting, so are the benchmarks bunk? they seem to be comparing it to Gemini Flash 2.0.
How do you rank it?
I mean, it isn't available open source on even via api so it practically doesn't exist
well I mean, is there anything at all to suggest otherwise? I do not think so lol
it seems to beat both gpt4.5 and 3.7 as well as deepseek 3.1 overall from what we know and what can be tested
not better everywhere, but overall ahead somewhat
It's okay at one-shot style coding. But yeah, otherwise nothing special. Everything we know suggests it's also a brute-force model which likely has high compute cost. They're literally running the datacentre on portable gas generators right now. 🫡
Brute-force money-scale MVPs are kinda Elon Musk's thing.
there is 0 chance 405b performs better than behemoth. dense is just stupid
well it's harder to train, but not significantly harder than behemoth. And behemoth has less active params so... I don't think it's a clear cut which has more potential. Plus behemoth is 10 times harder to run dedicated instance
maverick seems better than 405b. i don't get how this is not completely clear cut
it is newer model. 405b is ancient by now. If they chose to update 405b it would have performed much better than it did a long time ago...
right, which they didn't, because it's so incredibly expensive to train for no reason
405b is indeed ancient
which is why it's completely eclipsed by llama 4 models
so why would behemoth not be way better than 405b
behemoth has comparable training cost though
comparable cost a year later represents ~10x more effective compute
so yes
it won't be close
?
compute required to train behemoth is fairly comparable to 405b dense, that is my point. Almost 300b active. And you need much much more memory to host that MoE than 405b dense
then you add extended context on top, and memory requirements become insane
gemini2.5 pro crossword puzzle, after some tinkering i got the right system prompt
the compute is used much more efficiently in behemoth than in 405b. i also think behemoth will have spent mroe compute to get trained but it's hard to tell. basically to get 405b to the quality of behemoth you'd need to spend like 5x-15x more compute than behemoth i suspect
can you share stuff you make in liveweave?
efficient is in the fact that we are not comparing it to 2T dense model lol. That was already implied/accounted for. It's not magic MoE was a known concept for a long time now
I'm curious, did other models get this right?
i can try them, which ones you want me to try?
yeah exactly. 405b used an ancient training procedure that even at the time it was trained was already clearly stupid. they overspent on a dense model when they should have trained a MoE. because of that, despite large capital expenditure 405b was pretty mid. behemoth with similar spend will be much much much better and obviously so
I don't think it's a clear cut which has more potential.
I think it's clear cut for the reasons above
but it also wouldn't be comparable in performance to 2T dense model if someone managed to train something like that. 2T dense would have been exponentially more capable than MoE with 2T total and barely ~300b active
the key to these llms are def the system prompts
I'm just curious how you found it as a prompt, if LLMs were making common mistakes, etc.
No worries if you haven't.
Sounds like an interesting benchmark prompt.
for the same training compute, it would be hilariously less capable
yeah true. But that's not what we are talking about 👀
i made a system prompts for making system prompts, made one for refining prompts, made one for webdev and used all three to create the puzzle
idk you were saying something wasn't clear cut when it was
that's what i was focused on
you can talk about othre things i don't mind
I was referring to 405b dense vs MoE with ~300b active and 2T total. If you can stomach bit of extra compute needed for training, it's possible that dense model would have actually more potential I think, hard to say. Which is why I said "not clear cut"
like we saw 70b llama doing really well against MoE models with less active and much more total param etc
"if you can stomach a bit of extra compute" ?
i am just looking at their potential relative to how long they have been trained or are likely to be trained in the future? 405b will not be trained enough to come anywhere near behemoth's current capabilities so i don't see the relevance
the relevance is that they abandoned that and started training from zero a model that is only slightly faster to train
but is much harder to host/run
not only slightly faster
with that track record, it's not a given that they will be continuing with behemoth either lmao
especially now that there's deepseek
it was unlikely a thing when they started training
well but it is. Less active param but not THAT much less than old dense total, and you still need more memory allocation too
honestly, I'm kind of surprised they didn't do RL training with their existing 70b
that seems like an easy the fastest way for gains
to the current level of capability, behemoth is an order of magnitude faster to train than 405B would be, i'd think? i'd be interested if someone had info on that tho
i just don't agree with your claims at all
it's like you live in a different universe where models haven't gotten way way way better in the last year
nah that's not true lol. It's 2T total and close to 300b active
that means you need more memory than 405b dense
and training itself is gonna be roughly 4:3 times faster
at best
but 405b has 405b weights so will be by default much worse and will need to be trained a lot longer to get to the same capability threshold
no?
i think i wanna marry gemini2.5 pro
why is it so good?
like i love using it in studio
so much fun
if MoE has ~300b active that means you need to train it essentially like 300b dense, not accounting for vram
in a big data center this seems very manageable
sure but so is 405b in much the same way... At that point it's not a big difference at all
that's where we agree. they are pretty similar in terms of inference and training time (within 4x of each other) but behemoth has a way way way better ability to scale
405b scaling law will be way worse to reach behemoth level capabilities
i'll ask my ml friend if she knows specific theoretic numbers on that
It's amazing. Just a beautiful model, that's the only way I can really describe it. Such a pleasure to use.
That is debatable... Like if we take deepseek that is MoE too, it has poor spatial awareness just like gpt4o. Why? Most logical explanation the way I see it is low active param count. That's also true for all llama4 models except the biggest / behemoth. Only 17b active. MoE have their advantages and they make sense for mass hosting and maybe even RL training potentially (speculation at this point), but they certainly have their drawbacks too and relatively low active param count is always gonna be a compromise.
you can't have lower training requirements for free, that's always gonna come at a cost. If there's less to train, there's less work that was done, bluntly speaking
do you have more info on spacial awareness wrt MoEs? I would assume insofar as that's not just correlated to capabilities, it's about omni / training on images
not even close to gemini flash 2.0
you can try it at https://lmarena.ai/
Well that's sad.
lol
thats just one of many
take let's say arc-agi but without reasoning models (those are harder to compare directly when we are talking about arch). gpt4o score is very low relative to sonnet. That's also true for simple-bench which is another benchmark for the most part based on spatial awareness. Then we also have web development and web coding arena where gpt4o stands no chance against sonnet. Same applies to deepseek to a varying degree.
also really really bad at multilingual
llama 405b was actually good at it
idk what happened
Unfortunate. Price is so low, it would be a massive deal if it was great
web development is relevant as many of it is based on css and visual design
ofc I'm assuming here that sonnet is bigger, but I do realise there's no official size if we want to be 100% accurate. But if you take say gpt4.5, you will see it has better spatial awareness than gpt4o too, or even o1
because i remember them getting this question right
did yall test it?>
yea it was on the arena for a while
It depends on what you want
Is that a question ?
Of course ¹⁰
with o1 it's actually interesting that they had such a massive improvement for arc-agi. TO be completely honest I wouldn't rule out contamination lol. But it could be also that it does help with those specific tasks. But o1 can still struggle a lot with svg or web design comparatively
i think google will start taking the lead from now on
btw not saying that sonnet is a better model overall, but for this specific thing, it is. 😉
sonnet is kinda unique ngl
you feel like it has its own intelligence not some robotic ai
Sonnet is the best at coding .
But so dumb at other things
yeah either they did something very smart or openai cut the cost too much and too drastically with arch planning. Likely a combination of both
then gpt4.5 was just too extreme to train in time to compete adequately
that alone is a big plus
its really hard to make a model like that
well it will change with nightwhisper ig
being good at coding isnt just providing a one-shot working code
but there are a lot of nuances
I think Google changed her plan after the announcement of O3 😁
best at web coding. Not ALL coding. 👀
if it's not web, it's honestly a coin toss but I would go with 2.5 pro
they kinda improved on other areas with sonnet 3.7 tbh
it used to be so bad at desktop apps
even for web 2.5 is gonna be very decent
I would be shocked if 2.5 is not significantly bigger than gpt4o all things considered
tf is flannel
but it's still behind competition on some code things like livecodebench. Currently probably only 2.5 pro is a model with no significant flaws. Kinda insane if you think about it lol
yea
24k gold model was so fun to chat with
i can see myself using it a lot
for unserious stuff
idk about that
true
I'm also very excited for the big llama especially the base and reasoning version
Given that the new Llama 4s can't answer my questions correctly while 24k gold could, my guess is that 24k gold was a preview of behemoth
lol
Which is... worrying. I'd really hoped it was a lot smaller than that
I mean, I was close
https://x.com/lmarena_ai/status/1908601011989782976
😛
😛
Funny thing is Deepseek can also get it at a fraction of the size (and likely cost) soooo
Did Meta actually finish training Maverick? From the model card, it saw half the training tokens of Scout.
I noticed that too
Both on the training token side and the context length side
My guess is probably not
Put up a couple of my prompts in #share-prompts , which the current L4 models seem to fail (to date, only 24K got the sushi question right, and it didn't get the Deltarune one right)
I genuinely don't know what they were thinking with this release tbh
This has pretty much killed any chance they have, in my mind, of being competitive with some of the other players in the area
I'd be nicer if they were like, 1/10 of the size they are lol
so many complains about Llama 4 not that good but why lmarena score that high?
how are these models hacking this score?
DM me 🙂
They were the least filtered ones and had the best vibes among the ones on the Arena, for conversational uses at least.
you can say it and I will react with an emoji
Tell us
Say say 😁
My guess is pretty much what Suikamelon said, rather than committing to a single instruct style, they tried a plethora of different models with different conversational styles and RL tunes for each size category, and then picked the ones which gave the best score in the arena
In the end, the spider and 24k type styles won out
So those are the ones they went with for the final release
That's what even Google do
and what is "Style Control"?
i think eveyone does that
Even without formatting/text styling differences, I generally preferred the 24-karat-gold-type models.
They do, but I don't think anyone did it to nearly the extent Meta did
We went through like, fifty models in the past couple of weeks lol
Only fifty models
people say old llama 4 was caught lacking so they scrapped it and made a better one
LMArena needs to fix this problem. people will stop believing its value
well yeah if the previous planned launch models were even worse meta ai are way behind
deepseek will eat them for lunch
Hi
Interesting data.
Huh
So L3.3 took less time than both L4 models combined
(I'm guessing the MoE architecture reduced the training compute per token)
That's weird though - what were they doing with all of that compute then?
No idea. It makes the rushed state of this release even more puzzling.
A possibility is that they delayed training the models for as long as possible due to the copyright lawsuit(s).
Oh yeah the Anna's Archive stuff
They have so much compute they might have trained both models in 1 week or less.
Remember hearing about that
hopefully that means behemoth is out in the next few weeks
Not only that, also Books3 (as disclosed in the Llama1 paper—which they probably regret doing), Libgen (Llama2).
A 4.1 release with Maverick in a finished state and smaller models would be appreciated.
Agreed
I don't think the rushed Maverick release did them any favors, they probably should have just waited until it was finished training
Also, where's the omnimodality?
It should have had voice input/output, maybe even image out.
This entire release is pretty bizarre tbh - spamming the arena with models, the small amount of training compute, the lack of additional modalities which they said were supposed to be their major focus
Only that? I imagined it would be more complex than that.
I don't feel I usually preferred those Meta models due to the longer responses, but mainly because they had a more relatable tone, or were more informative, didn't exhibit slop, etc.
OpenAI and Google models also generally had long responses.
They cycled through so many models, several were probably targeted tests, earlier checkpoints, etc.
Yeah I think it's basically impossible to determine which checkpoint we're seeing in the full releases
What model do we think was maverick? That we had tested earlier
Spider? Themis? Cybele? 24_karat_gold?
That said, my theory is 24k is probably a Behemoth preview, just because for whatever reason, I'm not able to get either Scout or Behemoth to give a single correct answer to coding questions which it got in one shot
Even if it is, it might still be one of multiple checkpoints, though
spider appeared many times in my tests.
okay call me naive but perhaps maverick = maverick
But in the end 24_karat_gold was the one that I've seen the most.
24_karat_gold is only good for its styling imo lol
it follows instructions kinda meh
24kg: 49 times; spider: 23 times, cybele: 21 times
Possibly, with a mellowed down system prompt perhaps.
Ah I see that.
At some pointflannel, harley and crystal got introduced in the past 24 hours, and I don't think they had a crazy system prompt.
Eventually I got tired of testing them when I saw that 24_karat_gold won all my personal tests.
I'm not 100% sure but crystal seemed the best one.
Both Maverick and Scout have the same amount of active parameters, so they should be equally fast, I think.
Of course.
what model is spider
Try the one on Chatbot Arena (Direct Chat).
There have been suggestions that the one from OpenRouter isn't working correctly.
well apparently the system prompt tells maverick to only answer the question 50% of the time
I don't know the details; I haven't tried it personally there. I do have tried it on Chatbot Arena though.
Aint no way it was 24 karat gold
Not of its responses but you can test it easily.
Those crystal harley flnanel models were llama 4 wallahi
They act the exact same
Or maybe theyre other lama models but llama 4 maverick is way better
Nahh
Its way better
The reasoning
Llama 3 was a dumbass
maybe it is dumbass if you not pay immo
Spider is behemoth
y u thnk so?
Because
We already have maverick and scout
Why wold they put that spider emoji
They're hinting
Spider out of all emojis
Its either their reasoning model or behemoth
Situation unclear.
is it better than nightwhisper at coding ?
I mostly really just follow the LLM threads.
For what it's worth, the official Llama 4 inference code uses temperature=0.6 and top_p=0.9 as far as I can see.
Hm, it looks like they have int4 quantization loading code too.
when you use a system prompt obtainer prompt https://github.com/LouisShark/chatgpt_system_prompt/blob/main/GETTING_STARTED.md
the proper default prompt https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8#:~:text=Llama 4 models.-,System prompt,-You are an comes back via meta.ai, whatsapp, messenger, and the api providers
the lmarena -experimental brings back nothing, it is like it has been left out
what the llama
Unfortunately, I'm not a programmer.
This is not my screenshot, but a screenshot of my Friend's discord bot.
which tells you when and which model was added to the arena or disappeared on the contrary
Maverick is amazing
@modern knoll
yep thats what i linked
I don't think fp8 alone would reduce quality appreciably.
On-the-fly INT4 sounds like what bitsandbytes would do.
It's versioned 03-26-experimental. Is it from 10 days ago? Or did the training start on that date? We can't know, of course.
maverick has been published in fp8 and officially "maintaining quality" so all providers will end up hosting that
Dude wtf
Maverick taches u how to
Do illegal stuff
Make fentanyl and even teaches u how to get the precursors
💀
Yes
Its a real method
Yes im a chemist
Lil bro
Yes
Who said i cant make it cus its illegal
Walter white made meth and it was illegal lil bro it dosent amtter
Rules are made to be broken
wait do we have full access to llama 4 now?
lmaoo they gotta pray man
@alpine coral can you dm me your question set if possible? i think i may have found o3 full on a red-teaming platform and am putting it through its paces
o3 (?) simplebench "try yourself" public questions score:
- ✅
- ❌
- ✅
- ❌
- ✅
- ✅
- ❌
- ✅
- ✅
- ❌
6/10, iirc best performance on this specific set minus gpt-4.5 preview
What did Meta see coming out on Monday that they rushed?
yo i have tested llama4 all day and they def lied lmaoo
Yea its really bad
and they said the lmarena was a experimental chat session for the results wtfff
there is no way this is second
I have no idea how it reached that no2 spot
Its really one of the worst releases so far
lmaoo
Even their llama 405b was better
the only thing going for it is that 10 mill context but i have yet to test that or see that
has anyone else tried that?
The model is so dumb, its retrieval information is so bad, it hallucinate a lot, it doesn't follow instructions well, its so bad at multi-turn
I can go all day about this model
I just dont understand why did they even bother to release this
this is a joke
I won't touch it again
like we basically can ignore everything they said
that 10 mill context is a lie to
bc everything else is a lie
i feel lowkey bad for zuc
I knew they self-trained on benchmarks
he screwed his compnay withe meta stuff and now this
just wait until simple bench gets their hands on this
gonna be wild
maybe we prompting llama4 wrong
there has to be a reason
if it is gg
They are not taking llms training seriously obviously
really??
yeah its the promises that is annoying
just dont release, we can wait
but i think its for investors
Their instruct model is so bad that they had to wait for behemoth to traina reasoner
You cant just use maverick with reasoning
Lile that model is so dumb
It doesn't understand anything
yeah maverick is so badd
like it makes me mad
i dont really see that big a difference between scout and maverick
tbh
Nobody wants a dumb model with 10m context window
but lmarena needs to find the ranking asap