#general

1 messages · Page 12 of 1

drifting thorn
#

agi is a vague definition

balmy mist
#

like it started and we are currently in that period of the agi forming

drifting thorn
#

Some agents can already be called as an AGI

balmy mist
#

why office worker?

#

we already have ai that can replace images tbh

#

like i can give my mom an image and she will not be able to tell its ai

drifting thorn
#

since robot vaccum cleaner can replace human cleaners

balmy mist
#

you technally can replace lowlevel devs like jrs out of college or interns

#

any webdev for sure lol, like a junior tho

#

we gotta test it more

#

we only seen its frontend capabilities

#

we need to see it in cusors and stuff

#

but now that I am thining about it

#

i think NW might be a whole new model

#

and stargazer is the mini

#

NW has to big af

drifting thorn
#

Is multi-token attention of Meta AI helpful for the situation?

balmy mist
#

why they take it down so fast?

balmy mist
#

i like meta research into thinking without using tokens tho

#

that might be something

drifting thorn
#

making AI more intellegent

balmy mist
#

that could be useful

#

imagine we never get NW back

frozen skiff
#

is night whisper still in webdeb arena

balmy mist
#

we should continue this and start using this metrix

balmy mist
drifting thorn
#

Alibaba had proposed a paper called START, which the reasoning model use tools itself during the reasoning. I see the "use tools in reasoning" tendency in Gemini 2.5 pro now. So I guess... are the proprietory models already using the hacks open-sourced papers proposed, or even a better method than them.

balmy mist
#

wow

#

that is nuts

#

can you post that

balmy mist
frozen skiff
#

yeah

drifting thorn
#

just search START by Alibaba in the internet

#
barren prairie
#

There is a model called Anonymous

drifting thorn
#

remember 2.5 is released in 25/3, and the paper is proposed in 6/3

frozen skiff
#

its definitely google

#

its as smart as 2.5 pro in other areas but much better in coding

#

its pretty sh*t in creative writing too

drifting thorn
#

...

#

so sad i mainly concern on creative writing

frozen skiff
#

thats whyi liked 24 karat gold

drifting thorn
#

nah 2.5 pro's writing was great

frozen skiff
#

idk

#

it types this ceretain way idk how to describe it

#

but its annoying

drifting thorn
#

it adheres to my plot and its tone is engaging

keen ferry
#

what is nw?

frozen skiff
#

how do u know he a eral person

#

he might be lying

drifting thorn
#

I can only access to Gemini via my laptop, and 2.5's writing let me stick to my laptop for days

balmy mist
frozen skiff
#

whats agi suppose to mean

#

like smarter than a human

torn mantle
#

yea

#

not happening

balmy mist
#

wait what??

frozen skiff
#

isnt most of the ais rn smart as a human or even smarter

torn mantle
#

but i really believe nightwhisper will threaten their market share

balmy mist
#

but NW is def google based on metadata and they have this lunar thing now

frozen skiff
#

its crazy how fast this Ai stuff is advancing

balmy mist
#

yall tried lunarcall?

frozen skiff
#

yeah

balmy mist
#

it from google too

#

they are cooking

frozen skiff
#

its kinda sh*t

balmy mist
#

it might be over for OpenAi lol

#

damn really

frozen skiff
#

i havent tried it much buit

#

when i tried it its kinda average

balmy mist
#

like where would you rank it?

frozen skiff
#

like

#

based on my very limited interaction with it

#

maybe a bit higher than

#

gemini 2.0 flash

drifting thorn
#

I’ll rather look at the benchmarks than these vague statements like AGI and ASI

balmy mist
#

but benchmarks are getting saturated too

#

i only trust simple bench now

drifting thorn
#

Humanity’s last exam?

#

SimpleQA for my “searching” task

balmy mist
#

yeah they already beat ARC? did v2 come out yet?

frozen skiff
#

What are those new crystal flannel harley models

drifting thorn
ivory schooner
#

还我24k~还我24k~

balmy mist
#

that was o3 bro

keen beacon
#

nightwhisper is not doing that lol

ocean vortex
#

yeah and even that 4% o3 score was not realistic to be running like that for consumers ("high-compute 172x", whatever that means it's more extreme than o1-pro for sure)

balmy mist
#

if you be honest most SOTA llms are as smart as the average human, like on all the stuff we care about the AI is better at us at, it may struggle at novel stuff and new stuff which is the final piece we are figuring out with out, but on general knowledge and IQ its actually smarter than the average human lol

#

especially with this new generation

ocean vortex
balmy mist
#

damn so that was not even o3 high?

#

wow

keen beacon
#

o3 is several months old anyway

torn mantle
#

lunar/star*/nw all are from google

balmy mist
#

it think the arc test might be unfair to other models not from openai

barren prairie
ocean vortex
balmy mist
#

cause how the heck is gemini 2.5 pro 12% on arc 1?

#

i think they need to allow gemini 2.5 pro to use the same compute at o3

keen beacon
ocean vortex
#

they didn't include high reasoning version because it didn't qualify with that cost per task

balmy mist
#

they need to standardize the test with the same compute time

ocean vortex
#

this barely made the cut. Still not very realistic cost

balmy mist
#

google has the money to extend compute time

#

but they choose not to?

torn mantle
torn mantle
#

oai reasoning approach is the same as deepseek r1 just more scaled

#

google is using different approach for their reasoning

#

but they are getting there

ocean vortex
balmy mist
#

google charging way less to run their models, so i could imagine if they ramped up the compute time and gave us an o1 pro type experience they would do amazing, because the have the fastest SOTA reasoning time and deliever quality at those speeds

#

imagine gemini 2.5 pro, but the same compute time as o1 pro?

#

that would be nuts

balmy mist
#

i think nightwhisper might be a medium level reasoning model

#

the next tier up from gemini 2.5 pro

#

like they all gemini 2.5 pro

torn mantle
#

i think they cracked the code for a good coding model

ocean vortex
#

but it's not like high version would have been miles ahead. If we look at arc-1:

balmy mist
#

but higher levels of compute

torn mantle
#

and what we will see is high finetuned coding models

#

probably

#

gemini coder 1

#

gemini coder 2

ocean vortex
#

so this is roughly 15% increase of the low score

torn mantle
#

these will be highly focused at coding tasks

#

yea

#

probably flash

#

a recent checkpoint of gemini 2.5 flash

ocean vortex
#

now 4% + 15% of that score = 4.6%

#

would have been roughly that

torn mantle
#

this whole ARC-AGI benchmark doesnt make sense to me tbh

ocean vortex
#

by spending much much much more LOL

torn mantle
#

they are testing it on text based prompts

#

instead of using their vision models

balmy mist
#

let me know the results, my webdev glitching

torn mantle
#

ik there are some limitations but still they should test vision capabilities

#

because give any person a bunch of random arrays with 1,0 and he wont solve that

#

it just doesnt seem practical

ocean vortex
torn mantle
#

"input": [[7, 0, 7], [7, 0, 7], [7, 7, 0]], "output": [[7, 0, 7, 0, 0, 0, 7, 0, 7], [7, 0, 7, 0, 0, 0, 7, 0, 7], [7, 7, 0, 0, 0, 0, 7, 7, 0], [7, 0, 7, 0, 0, 0, 7, 0, 7], [7, 0, 7, 0, 0, 0, 7, 0, 7], [7, 7, 0, 0, 0, 0, 7, 7, 0], [7, 0, 7, 7, 0, 7, 0, 0, 0], [7, 0, 7, 7, 0, 7, 0, 0, 0], [7, 7, 0, 7, 7, 0, 0, 0, 0]]}]

#

i mean whats this?

#

you except a normal person to solve that?

ocean vortex
#
  • vision is 2d
torn mantle
#

well let it be a benchmark for spatial awareness

#

until we figure out how to improve on that

#

openai are cooked

#

they didnt seem to improve much at coding

ocean vortex
balmy mist
#

one is lunarcell

torn mantle
#

i think nightwhisper included a high quality data with intensive RLHF and even vision

oblique flint
balmy mist
#

thats not lunar

torn mantle
#

just to get the aesthetics right/styling etc...

balmy mist
#

thats why i think quasar is o4 mini

oblique flint
#

but quasar is not a reasoning model tho right?

balmy mist
#

ahhh

#

i forgot the strategy already lol

balmy mist
#

like not traditionally

torn mantle
balmy mist
#

like i asked it what was bigger 9.9 or 9.11

#

check what it does

#

similar to deepseekv3.1 and the new 4o

#

it has a COT output

#

very weird

oblique flint
#

I hope google releases a competing model in the o3 mini price tier as well. There is kind of a significant gap between 2.5 pro and 2.0 flash pricing, which o3 mini is inbetween

torn mantle
#

google prices are already good

#

its just that people lost faith in them because they messed up many times

#

people will be more hyped for a gpt -model than a gemini model

#

but i like what google are doing recently

oblique flint
#

idk, but with ai studio they train on your data, and you have zero integration with stuff like IDEs and I mainly use llms for coding stuff where that integration saves a lot of time

balmy mist
#

lol that was claude 3.5 haiku

#

the other one was lunarcell

#

lunar is mid

torn mantle
balmy mist
#

like it gets the job done but not the best

torn mantle
#

its probably a much smaller version than flash

#

like when we had 7b flash

lime coral
#

Deleted

torn mantle
#

its probably something like that

ocean vortex
drifting thorn
ocean vortex
#

it's basically like one of those tasks in IQ test or from general tests for a job interview

#

you look at test patterns

#

and then need to find a relation

#

to solve another example in a similar way

balmy mist
#

wait lunarcell is actually good

drifting thorn
#

I prefer 'multiple intelligence' than simple IQ tests

ocean vortex
#

I think there's a website for deleted tweets. You can really delete them lmao

torn mantle
drifting thorn
#

I guess OpenAI is panicking right now, when they are changing plans constantly

#

It shows a messy management

#

And I actually hate the thought of GPT-5 as a free-tier user

ocean vortex
#

like only vision models are being tested

#

text is probably just so you could reproduce it in code tbh

keen beacon
ocean vortex
keen beacon
#

all of them are translated into text (o3 prompt):

balmy mist
keen beacon
#

idk how you came up with that

balmy mist
#

lunarcell

#

google really releasing so many models man

#

their event is on tuesday right?

lime coral
#

9

ocean vortex
#

oh ok. I seem to recall seeing somewhere that a model couldn't be tested on arc-agi since it didn't have vision. Must have been smth else then

balmy mist
#

we are living in a different timeline man

#

ever since covid nothing was the same lmaoo

ocean vortex
#

overreacting to everything

keen beacon
#

ya not getting enough sleep and a lot going on lol

#

sry

ocean vortex
#

all good

torn mantle
#

deepseek are also taking their time on r2

keen beacon
#

qwen team might be cooking up something bigger

ocean vortex
#

but open source has problems hosting it as is

#

the gains with v3.1 are from it being more verbose. Unsure if that would translate to gains with reasoning in the same way as they move to 3.1 for R2... 🤔

#

rather than like just make the chat model closer to reasoning and less different

#

cause I mean like, much of what V3.1 already does now in a standard output it would be doing that in reasoning instead, so there's gonna be some overlap...

novel flame
#

Is nightwhisper still in the battle arena? I really want to test it.....

keen beacon
#

no

#

new google model

balmy mist
#

i dont even know how to rank it tbh

keen beacon
#

just started testing it

#

got my first benchmark question right

#

seems to be a reasoner

balmy mist
#

yo think so?

#

let me know the final results

#

i gave up on it

#

its okay, just not SOTA

keen beacon
lime coral
balmy mist
#

they should have just put it on lmarena

#

but they did not

#

so i judge with what i am given

keen beacon
#

it's on there

balmy mist
#

and its another google model right after they had NW and star which were amazing(especially NW) so its hard to go from NW to lunarcell

balmy mist
#

and its all based on system prompts tbh

#

NW had the same system prompt as lunar does on webdev

keen beacon
#

NW was very much tuned for the specific environment

#

i found it consistently poorer than 2.5 pro in a general sense

balmy mist
#

i didnt NW

#

NW was great on general for me

keen beacon
#

for example when asked to clone UIs, it would do poorer at actually replicating it than 2.5 pro but normally had less bugs

balmy mist
#

i tested it on public SB and it tied with gemini 2.5 pro

#

NW is the best model imo

keen beacon
#

shrug~3 we shall see if/when they release it

balmy mist
#

on general tests it did amazing and the way it interpreted my prompts was kinda uncanny

#

even if they did finetune gemini2.5 pro to coding, its a great model that performs well, just bc its finetuned on a specific usecase does not mean it will not perform well generally

#

i tested NW for 2 days straight, barely even could work lol

keen beacon
#

as did i

#

i test all anonymous models with my personal set

balmy mist
#

never felt that way about a model before

keen beacon
#

nebula was the most blown away ive been by an anonymous model

#

NW impressed me but not as much

balmy mist
#

it follows directions like no other model too

#

i never tried nebula

#

which one was that?

keen beacon
#

got my 2nd prompt wrong (2.5 pro gets it right)

keen beacon
balmy mist
#

i mean i love 2.5 pro as well

#

thats my 2nd fav model after NW, but i think they are the same models

#

NW might just be a higher level of compute for reasoning or a finetuned version

#

cause its a bit slower than 2.5 pro

#

but 2.5 pro is def the best model we have now! like it just craps on everything else lol

#

especially on studio where you have so much customization

keen beacon
#

+1

#

it also seems their rate limits they say exist on ai studio don't actually do anything

#

i've definitely gone over what it is supposed to be several times

balmy mist
#

then i just change accounts lol

#

and ever since then never hit a limit again because i just switch accounts

balmy mist
keen beacon
#

although the o3 results were impressive they definitely get watered down by the fact they threw thousands of dollars at a version of the model both tuned for arc and that was given way more attempts at getting it right compared to others

balmy mist
#

thats what i was thinking

#

and 2.5 pro scored 12%

#

that does not make sense, yeah open ai is good with reasoning but when you throwing a bunch of money to make your model think longer is that really as impressive like you said? and what if we standardized the compute times? we need a new graphic of that

keen beacon
#

yeah openai seem to have this strategy of

#

releasing the "best" reasoning models but it literally just brute forces the answer

#

like it's way less efficient

#

it can take their models 3-4x as many reasoning tokens as others to get the same answer

#

same for grok 3 thinking

balmy mist
#

exactly!!

#

2.5 pro speed is nuts and its quality is just as good if not better which is wild

#

google might have officially won this year unless gpt5 really blows us away lol

drifting thorn
keen beacon
#

ive seen an instance where qwq 32b can solve it with 10k less tokens (13k vs 23k)

leaden palm
#

and qwq is the master of overthinking

keen beacon
#

o3 mini and qwq are very good at these extremely rote reasoning tasks where 2.5 pro can fall apart

#

o3 mini is the best model yet for that still. these specific tasks do not require world knowledge, etc just pure rote reasoning

misty vault
#

gemini 2.5 no longer free on google ai studio? 😔

misty vault
#

Experimental one is gone for me now

#

Why can I still chat with these models if it has pricing

#

or is that only for api usage

keen beacon
#

yes its unlimited on the website

misty vault
#

oh nice

leaden palm
#

you'll use it for free (with the free limits) until you upgrade

drifting thorn
#

2.5 pro would be godlike if its output is 128k

balmy mist
balmy mist
#

thank god lol

keen beacon
#

the aistudio website used to have limits a while back it seems they removed them at some point

balmy mist
#

like you think they removed it this week?

novel flame
misty vault
#

what about gemini web app or or something instead of studio

keen beacon
novel flame
#

I just tested Lunarcall for coding, and it’s not bad, but it’s not on par with Claude 3.7 Sonnet. I would say it’s below 3.7 Sonnet, 3.5 Sonnet, and Gemini Pro 2.5; but maybe on par with o3-mini-high

balmy mist
#

can someone give me example web dev prompt for me to test?

torn mantle
#

Ive tried it on physics simulations and it was blowing sonnet 3.7 out of the water

#

Nightwhisper is some next level coding model

#

Im pretty sure a lot of hard work went into this model

#

Gemini 2.5 pro while its good, it wasnt enjoyable to talk to

#

It has a strong info retrieval which actually makes it less creative

#

U wont get anything unexpected or unexplored areas on its outputs

#

I would like if they include something similar to 24k gold model

drifting thorn
#

Why 2.5 perform worse than 3.7 in SWE bench?

keen beacon
#

i don't think anything has beat the original 1.0 ultra in terms of creativity and "human-ness"

torn mantle
rose thicket
barren prairie
#

This test is a little bit hard to them

torn mantle
rose thicket
#

😫

#

Well I got the system of web dev arena

#

System prompt*

balmy mist
#

like the app?

#

whats a REE??

barren prairie
rose thicket
rose thicket
#

Not able to send here

balmy mist
#

just put it in a file

#

a text file

rose thicket
#

I just copy pasted what the model has returned to me

#

But yes it is

balmy mist
#

i got one to but i wanna compare results

rose thicket
#

See

balmy mist
#

thnx

#

and this is for all models on webdev right?

rose thicket
#

Yes

#

Both models returned somewhat same prompt

#

Minor diff

keen beacon
balmy mist
#

lol

barren prairie
balmy mist
#

how is this

oblique flint
balmy mist
torn mantle
#

What was its alias

#

Was it named maverick?

#

I dont remember testing this model

#

Nothimg from meta impressed me

thorny drum
#

pretty unclear which one was maverick since llama is pumping out so many models lol

torn mantle
#

Are we sure about this?

thorny drum
#

yes

torn mantle
#

Its probably cybele

#

Or whatever its name is

#

Themis

#

Weird

#

I dont remember such model

#

Benchmarks

keen beacon
#

wait wtf

#

it's really mid

torn mantle
#

xd

#

It has 10m context window

#

That's the only noticeable thing

keen beacon
#

behemoth is coming soon

#

i'll wait for that

torn mantle
#

Behemoth is probably themis

#

Pretty sure

thorny drum
torn mantle
#

That thing was so slow

keen beacon
#

yeah that's more like it

#

holy smokes

torn mantle
keen beacon
#

why not

torn mantle
#

Because they don't reflect how the model really feels like

#

Are you impressed with any Meta model added to the Arena?

#

The answer is clearly no

balmy mist
#

wait llama 4 is out?

keen beacon
#

2 of the 3 models

torn mantle
#

But today

balmy mist
#

wtffff

#

10 mill

#

no damn way

torn mantle
#

I think in an hour or so

torn mantle
#

Only*

balmy mist
#

we got vc in here? cause i dont even know how to express that

torn mantle
#

xd

keen beacon
#

likely april 29 will be the behemoth release

#

llamacon

dapper storm
#

So if Behemoth was out today it would be at the top of Lmsys leaderboard?

torn mantle
balmy mist
#

lmaooo they did not show 2.5 pro

#

clowns

#

but still amazing for open source

#

so good job meta

#

wait so now we have a active and total parameters? someone please explain?

#

is this new tech or was i just oblvious to it?

torn mantle
#

No

#

It got popular from deepseek

#

Its MoE architecture

#

Instead of activating all parameters you active only the necessary ones ( experts)

balmy mist
#

ahh okay thanks

#

i love meta again lol

torn mantle
#

lol

rigid crescent
#

the crystal model is a gaslighter that wont own up to mistakes XD

dapper storm
#

If Maverick is ~1420 ELO then surely Behemoth woudl be SOTA above Gemini 2.5?
Is that thinking incorrect?

balmy mist
#

oh snapp i think they out

dapper storm
#

oh wow I just loaded it yeah

#

it's 1417 ELO without stylecontrol

#

lol stylecontrol is way worse they were lmsys tuning 🤦‍♀️

balmy mist
#

wow an open source model 2nd

#

you using it on oepnrouter?

vast atlas
#

its being compared to other non reasoning models like 4.5 and 3.7 (not extended thinking)

#

llama 4 reasoning is the reasoning model

balmy mist
#

wow meta cooked

ancient reef
balmy mist
#

i think its smart to not have released the big boy yet

#

just wait a few weeks for the other players to release and train on their outputs lmaoo

balmy mist
dapper storm
#

Still huge error bars on Maverick only 2.5k votes

balmy mist
#

how are yall testing it? on arena?

#

i got it

#

wow its so fast

ancient reef
#

It yaps 😭

torn mantle
balmy mist
#

i tested it on simple bench pulic and it got everythign wrong

balmy mist
#

ill try again

torn mantle
#

Did you really believe their benchmark?

#

Mark was angry for a reason

keen beacon
#

riveroaks may be the 2T model

#

it's really slow

keen beacon
#

and i think more intelligent

torn mantle
dapper storm
#

Verbosity is the best way to unweighted ELO

keen beacon
#

llama 4 yaps more than i do

#

i will not tolerate being out yapped

ancient reef
#

Omg, doing with with it hurts

kindred sky
balmy mist
#

dude is a clown

north vale
#

is maverick gonna be open weights? do yall know

balmy mist
#

yeah it is

#

gonna try my pokemon example on it lol

#

bruhh the context window on lmarena doesnt let me 😦

#

so we are getting reasoning

ocean vortex
#

they are f'ing insane with those model sizes.

#

No one is gonna able to run them

#

and all but one are compromised with low active param count anyway LOL

barren prairie
#

Is llama4 open source ?

ocean vortex
keen beacon
#

it serious took them 2T+ params and over a year to slightly beat the next best base

#

great

#

seriously*

ocean vortex
#

pretty sure it is OS

#

they always have been

ocean vortex
#

you can host it let alone finetune

blazing rune
#

It is very underwhelming

thorn oak
#

what was llama's codename?

raven void
#

it will but for large orga

thorn oak
#

in stealth

blazing rune
#

It only seems good for distillation, but at that point just use Gemini or Claude or OpenAI models

night trout
keen beacon
#

Llama 4's new license comes with several limitations:
︀︀
︀︀- Companies with more than 700 million monthly active users must request a special license from Meta, which Meta can grant or deny at its sole discretion.
︀︀
︀︀- You must prominently display "Built with Llama" on websites, interfaces, documentation, etc.
︀︀
︀︀- Any AI model you create using Llama Materials must include "Llama" at the beginning of its name
︀︀
︀︀- You must include the specific attribution notice in a "Notice" text file with any distribution
︀︀
︀︀- Your use must comply with Meta's separate Acceptable Use Policy (referenced at llama.com/llama4/use-policy))
︀︀
︀︀- Limited license to use "Llama" name only for compliance with the branding requirements

💬 13 🔁 13 ❤️ 118 👁️ 13.3K

#

facepalm

night trout
#

Nice. I'll try to finish my blog post this weekend on prompts, I'll tag you.

ocean vortex
# night trout Open source doesn't mean "you can run it on your basement server for RP erotica"...

I get the argument cause I used to make a similar one myself lol.

But recent 1-2 years have shown us 405b llama was too big and not very useful at all. This is gonna be even harder to run and realistically only makes sense if you can serve thousands of people with your endpoint. I would understand if it was the case that there's clear benefit from bigger models. But it's more like the opposite is true with RL training now. Besides this exact size is saturated now with Deepseek taking advantage of it in full

night trout
balmy mist
#

scout is stupidly fast

ocean vortex
#

I'm all in for huge models when they make sense, but I don't think this applies here. There's a fuckton of redundant capacity and we don't have enough data to take advantage of such size imo

#

maybe they should have updated their 405b model first, at least once. Before releasing this lol

night trout
ocean vortex
#

dense 405b has potential to perform better than this "behemoth" anyway, since it's dense and here it's 288b active

ocean vortex
#

gpt4o is smaller than 4-turbo

#

4-turbo is smaller than og gpt4

night trout
#

Because it sure seems like it.

ocean vortex
#

and even without that...

#

gpt4.5 is much bigger than gpt4o

#

obviously

night trout
ocean vortex
#

sorry but I don't feel like arguing with someone stating "The two biggest models on the planet are Gemini 2.0 and GPT4o," with a straight face lmao

#

no offense

night trout
ocean vortex
#

it's more like it's impossible to argue with someone thinking 2+2 = 5

#

but whatever floats your boat I suppose

night trout
#

Oh I see what's going on here. You think biggest = more params.

#

Biggest = most popular. Widest application.

#

Yeesh.

ocean vortex
#

LMAO

#

it only makes sense in the context of big in size models

night trout
ocean vortex
#

"The two biggest models on the planet are Gemini 2.0 and GPT4o"

#

??

#

2.0 pro is not 2.0 gemini? 🤣

night trout
#

Gemini 2.0 is an entire class of models which includes Flash Lite, Flash, Flash Thinking, and Pro.

ocean vortex
#

there's no "gemini 2.0" model lmfao

night trout
#

how are you this dense

#

Probably, yeah.

ocean vortex
night trout
# ocean vortex it's you who is f'ing dense, how does 2.0 pro not qualify to talk about when you...

mate you're drunk go get some sleep https://en.wikipedia.org/wiki/Set_theory

Set theory is the branch of mathematical logic that studies sets, which can be informally described as collections of objects. Although objects of any kind can be collected into a set, set theory – as a branch of mathematics – is mostly concerned with those that are relevant to mathematics as a whole.
The modern study of set theory was initi...

ocean vortex
#

"The two biggest models on the planet are Gemini 2.0 and GPT4o" - this honestly sound like some clickbaity title by some idiot on the internet

#

gemini 2.0 was never as popular as cgpt, claude or deepseek, if that's how you intended to say it

#

neither of their 2.0 models

#

if you don't want to talk about 2.0 pro specifically, well then... all the other 2.0 models are worse

#

2.0 flash having a brief period of decent popularity with imagen but it kinda got denied by gpt4o before it even had a chance to blow up

night trout
#

....

barren prairie
eager mica
#

Haven't had the chance to try the models yet, but I imagined they would be considerably smaller than what came out.

ocean vortex
barren prairie
#

I mean the 4o

ocean vortex
#

flash-thinking performs worse now

barren prairie
torn mantle
#

yea this maverick model is so ruined

ocean vortex
#

when it was just released, it was beating gpt4o on some things yeah

torn mantle
#

one of the worst models ive tried on multilingual

night trout
ocean vortex
#

though it was never at any point anywhere close in popularity

barren prairie
torn mantle
#

maverick so far :

bad at instruction following
bad at coding tasks
bad at multilingual
good at creativity

torn mantle
#

yea im not using it anymore

ocean vortex
#

not global popularity

#

and it's only up because it's free

night trout
ocean vortex
#

of course people are gonna use more of what they can for free

night trout
somber niche
#

I'm gonna give it a fair shake, but my sentiments are similar. For being two big MoEs that are better than Deepseek they... don't really seem better than Deepseek. The restrictive license is kinda the nail in the coffin atm

ocean vortex
#

It's amazing that I need to spell it out catgrin

balmy mist
#

anybody did more tests with these models?

barren prairie
torn mantle
#

who invited them

#

is grok 3 really a good model

#

first impression was okay, but after using the model a lot then you figure out that its nothing special

night trout
#

How do you rank it?

night trout
ocean vortex
#

it seems to beat both gpt4.5 and 3.7 as well as deepseek 3.1 overall from what we know and what can be tested

#

not better everywhere, but overall ahead somewhat

night trout
#

Brute-force money-scale MVPs are kinda Elon Musk's thing.

balmy mist
#

lmaoo download

#

anybody can download them?

north vale
ocean vortex
north vale
ocean vortex
#

it is newer model. 405b is ancient by now. If they chose to update 405b it would have performed much better than it did a long time ago...

north vale
#

right, which they didn't, because it's so incredibly expensive to train for no reason

#

405b is indeed ancient

#

which is why it's completely eclipsed by llama 4 models

#

so why would behemoth not be way better than 405b

ocean vortex
north vale
#

so yes

#

it won't be close

ocean vortex
#

?

#

compute required to train behemoth is fairly comparable to 405b dense, that is my point. Almost 300b active. And you need much much more memory to host that MoE than 405b dense

#

then you add extended context on top, and memory requirements become insane

balmy mist
#

gemini2.5 pro crossword puzzle, after some tinkering i got the right system prompt

north vale
#

the compute is used much more efficiently in behemoth than in 405b. i also think behemoth will have spent mroe compute to get trained but it's hard to tell. basically to get 405b to the quality of behemoth you'd need to spend like 5x-15x more compute than behemoth i suspect

balmy mist
#

can you share stuff you make in liveweave?

ocean vortex
night trout
balmy mist
#

i can try them, which ones you want me to try?

north vale
#

I don't think it's a clear cut which has more potential.
I think it's clear cut for the reasons above

ocean vortex
balmy mist
#

the key to these llms are def the system prompts

night trout
#

Sounds like an interesting benchmark prompt.

north vale
ocean vortex
balmy mist
north vale
#

idk you were saying something wasn't clear cut when it was

#

that's what i was focused on

#

you can talk about othre things i don't mind

ocean vortex
#

like we saw 70b llama doing really well against MoE models with less active and much more total param etc

north vale
#

"if you can stomach a bit of extra compute" ?
i am just looking at their potential relative to how long they have been trained or are likely to be trained in the future? 405b will not be trained enough to come anywhere near behemoth's current capabilities so i don't see the relevance

ocean vortex
night trout
ocean vortex
#

but is much harder to host/run

ocean vortex
#

with that track record, it's not a given that they will be continuing with behemoth either lmao

#

especially now that there's deepseek

#

it was unlikely a thing when they started training

ocean vortex
#

honestly, I'm kind of surprised they didn't do RL training with their existing 70b

#

that seems like an easy the fastest way for gains

north vale
#

to the current level of capability, behemoth is an order of magnitude faster to train than 405B would be, i'd think? i'd be interested if someone had info on that tho

#

i just don't agree with your claims at all

#

it's like you live in a different universe where models haven't gotten way way way better in the last year

ocean vortex
#

that means you need more memory than 405b dense

#

and training itself is gonna be roughly 4:3 times faster

#

at best

north vale
#

but 405b has 405b weights so will be by default much worse and will need to be trained a lot longer to get to the same capability threshold

#

no?

balmy mist
#

i think i wanna marry gemini2.5 pro

#

why is it so good?

#

like i love using it in studio

#

so much fun

ocean vortex
north vale
ocean vortex
north vale
#

that's where we agree. they are pretty similar in terms of inference and training time (within 4x of each other) but behemoth has a way way way better ability to scale

#

405b scaling law will be way worse to reach behemoth level capabilities

#

i'll ask my ml friend if she knows specific theoretic numbers on that

night trout
barren prairie
#

It is still dumb on WhatsApp even after the update

#

Llama4

ocean vortex
# north vale that's where we agree. they are pretty similar in terms of inference and trainin...

That is debatable... Like if we take deepseek that is MoE too, it has poor spatial awareness just like gpt4o. Why? Most logical explanation the way I see it is low active param count. That's also true for all llama4 models except the biggest / behemoth. Only 17b active. MoE have their advantages and they make sense for mass hosting and maybe even RL training potentially (speculation at this point), but they certainly have their drawbacks too and relatively low active param count is always gonna be a compromise.

#

you can't have lower training requirements for free, that's always gonna come at a cost. If there's less to train, there's less work that was done, bluntly speaking

north vale
#

do you have more info on spacial awareness wrt MoEs? I would assume insofar as that's not just correlated to capabilities, it's about omni / training on images

torn mantle
night trout
#

Well that's sad.

torn mantle
#

one of the worst models ive tried so far

#

couldnt even get a single question right

balmy mist
#

lol

torn mantle
#

thats just one of many

ocean vortex
# north vale do you have more info on spacial awareness wrt MoEs? I would assume insofar as t...

take let's say arc-agi but without reasoning models (those are harder to compare directly when we are talking about arch). gpt4o score is very low relative to sonnet. That's also true for simple-bench which is another benchmark for the most part based on spatial awareness. Then we also have web development and web coding arena where gpt4o stands no chance against sonnet. Same applies to deepseek to a varying degree.

torn mantle
#

also really really bad at multilingual

#

llama 405b was actually good at it

#

idk what happened

wheat onyx
ocean vortex
#

web development is relevant as many of it is based on css and visual design

#

ofc I'm assuming here that sonnet is bigger, but I do realise there's no official size if we want to be 100% accurate. But if you take say gpt4.5, you will see it has better spatial awareness than gpt4o too, or even o1

balmy mist
#

should I not like meta again?

torn mantle
#

i think leo guessed it right

#

themis/cybele is probably behemoth

balmy mist
#

which one is that?

#

where can i find it?

torn mantle
balmy mist
#

did yall test it?>

torn mantle
barren prairie
balmy mist
#

damn

#

i missed it

#

so gemini2.5 pro is still best rght?

barren prairie
ocean vortex
#

with o1 it's actually interesting that they had such a massive improvement for arc-agi. TO be completely honest I wouldn't rule out contamination lol. But it could be also that it does help with those specific tasks. But o1 can still struggle a lot with svg or web design comparatively

torn mantle
ocean vortex
#

btw not saying that sonnet is a better model overall, but for this specific thing, it is. 😉

torn mantle
#

sonnet is kinda unique ngl

#

you feel like it has its own intelligence not some robotic ai

barren prairie
#

Sonnet is the best at coding .

But so dumb at other things

ocean vortex
#

then gpt4.5 was just too extreme to train in time to compete adequately

torn mantle
#

its really hard to make a model like that

#

well it will change with nightwhisper ig

#

being good at coding isnt just providing a one-shot working code

#

but there are a lot of nuances

barren prairie
ocean vortex
#

if it's not web, it's honestly a coin toss but I would go with 2.5 pro

torn mantle
#

it used to be so bad at desktop apps

ocean vortex
#

even for web 2.5 is gonna be very decent

#

I would be shocked if 2.5 is not significantly bigger than gpt4o all things considered

torn mantle
#

tf is flannel

ocean vortex
torn mantle
#

24k gold model was so fun to chat with

#

i can see myself using it a lot

#

for unserious stuff

raven void
#

sonnet 3.7 is like

#

the recipe is good

#

but needs much more compute

balmy mist
#

true

raven void
#

I'm also very excited for the big llama especially the base and reasoning version

somber niche
#

Given that the new Llama 4s can't answer my questions correctly while 24k gold could, my guess is that 24k gold was a preview of behemoth

balmy mist
#

lol

somber niche
#

Which is... worrying. I'd really hoped it was a lot smaller than that

cloud meadow
#

😛

#

😛

somber niche
#

Funny thing is Deepseek can also get it at a fraction of the size (and likely cost) soooo

keen beacon
#

does anyone have any hard benchmark prompts?

#

would be much appreciated

eager mica
#

Did Meta actually finish training Maverick? From the model card, it saw half the training tokens of Scout.

somber niche
#

I noticed that too

#

Both on the training token side and the context length side

#

My guess is probably not

keen beacon
#

rushed launch and they literally had more than a year

#

shambles

somber niche
#

Put up a couple of my prompts in #share-prompts , which the current L4 models seem to fail (to date, only 24K got the sushi question right, and it didn't get the Deltarune one right)

#

I genuinely don't know what they were thinking with this release tbh

#

This has pretty much killed any chance they have, in my mind, of being competitive with some of the other players in the area

#

I'd be nicer if they were like, 1/10 of the size they are lol

fleet lintel
#

so many complains about Llama 4 not that good but why lmarena score that high?
how are these models hacking this score?

balmy mist
#

lol ik how

#

but i cant say

#

NDA

somber niche
#

I also know how lol

#

Or at least I have a pretty decent guess I think

fleet lintel
#

DM me 🙂

eager mica
balmy mist
barren prairie
barren prairie
somber niche
#

My guess is pretty much what Suikamelon said, rather than committing to a single instruct style, they tried a plethora of different models with different conversational styles and RL tunes for each size category, and then picked the ones which gave the best score in the arena

#

In the end, the spider and 24k type styles won out

#

So those are the ones they went with for the final release

fleet lintel
#

and what is "Style Control"?

fleet lintel
eager mica
somber niche
#

They do, but I don't think anyone did it to nearly the extent Meta did

#

We went through like, fifty models in the past couple of weeks lol

barren prairie
#

Only fifty models

raven void
fleet lintel
#

LMArena needs to fix this problem. people will stop believing its value

keen beacon
#

deepseek will eat them for lunch

azure minnow
#

Hi

eager mica
#

Interesting data.

somber niche
#

Huh

keen beacon
#

that's crazy

somber niche
#

So L3.3 took less time than both L4 models combined

#

(I'm guessing the MoE architecture reduced the training compute per token)

#

That's weird though - what were they doing with all of that compute then?

eager mica
#

A possibility is that they delayed training the models for as long as possible due to the copyright lawsuit(s).

somber niche
#

Oh yeah the Anna's Archive stuff

eager mica
#

They have so much compute they might have trained both models in 1 week or less.

somber niche
#

Remember hearing about that

keen beacon
eager mica
eager mica
somber niche
#

Agreed

#

I don't think the rushed Maverick release did them any favors, they probably should have just waited until it was finished training

eager mica
#

Also, where's the omnimodality?

somber niche
#

Yeah, no audio to speak of

#

That was supposed to be their killer feature

eager mica
#

It should have had voice input/output, maybe even image out.

somber niche
#

This entire release is pretty bizarre tbh - spamming the arena with models, the small amount of training compute, the lack of additional modalities which they said were supposed to be their major focus

balmy mist
#

like what shows up

eager mica
#

Only that? I imagined it would be more complex than that.

#

I don't feel I usually preferred those Meta models due to the longer responses, but mainly because they had a more relatable tone, or were more informative, didn't exhibit slop, etc.

#

OpenAI and Google models also generally had long responses.

#

They cycled through so many models, several were probably targeted tests, earlier checkpoints, etc.

somber niche
#

Yeah I think it's basically impossible to determine which checkpoint we're seeing in the full releases

hardy pecan
#

What model do we think was maverick? That we had tested earlier

#

Spider? Themis? Cybele? 24_karat_gold?

somber niche
#

That said, my theory is 24k is probably a Behemoth preview, just because for whatever reason, I'm not able to get either Scout or Behemoth to give a single correct answer to coding questions which it got in one shot

#

Even if it is, it might still be one of multiple checkpoints, though

eager mica
#

spider appeared many times in my tests.

leaden palm
#

okay call me naive but perhaps maverick = maverick

eager mica
#

But in the end 24_karat_gold was the one that I've seen the most.

viral notch
#

24_karat_gold is only good for its styling imo lol

#

it follows instructions kinda meh

eager mica
#

24kg: 49 times; spider: 23 times, cybele: 21 times

leaden palm
#

could be the deployed maverick has the same prompt too

#

(it does in direct chat)

eager mica
#

Possibly, with a mellowed down system prompt perhaps.

#

Ah I see that.

#

At some pointflannel, harley and crystal got introduced in the past 24 hours, and I don't think they had a crazy system prompt.

#

Eventually I got tired of testing them when I saw that 24_karat_gold won all my personal tests.

#

I'm not 100% sure but crystal seemed the best one.

hardy pecan
#

it feels like it might be "scout" on their website

#

It's fast but dumb

eager mica
#

Both Maverick and Scout have the same amount of active parameters, so they should be equally fast, I think.

#

Of course.

sage raptor
#

what model is spider

eager mica
#

Try the one on Chatbot Arena (Direct Chat).

#

There have been suggestions that the one from OpenRouter isn't working correctly.

frozen skiff
#

Did llama 4 release?

#

No way

thorny drum
#

well apparently the system prompt tells maverick to only answer the question 50% of the time

eager mica
#

I don't know the details; I haven't tried it personally there. I do have tried it on Chatbot Arena though.

frozen skiff
eager mica
#

Not of its responses but you can test it easily.

frozen skiff
#

Those crystal harley flnanel models were llama 4 wallahi

#

They act the exact same

#

Or maybe theyre other lama models but llama 4 maverick is way better

neat apex
#

llama 4 looks like 3, at least it have huge context

#

xd

frozen skiff
#

Its way better

#

The reasoning

#

Llama 3 was a dumbass

neat apex
#

hm, yea right

#

over reacted here

frozen skiff
#

I like llaama 4 maverwick

#

Wym

neat apex
#

but no better than 3.1

#

it is lying a lot actually

frozen skiff
#

Are u using it on openrouter or arena

#

Its shet on openrouter for some reason

neat apex
#

fireworks

#

fireworks is well known to have they models somelittle better yet it is mid

frozen skiff
#

Is it free

#

I never used fire works

#

It says error generating response

hardy pecan
#

meta.ai model got 6/20 for simplebench, no bueno

neat apex
#

maybe it is dumbass if you not pay immo

frozen skiff
#

llama 4 is good in typing style

#

Like 24 karat gold

#

Its fun to talk to

strong shell
#

Llama 4 is here with 4 models!🦙🦙🦙🦙

I'm back to share with the world what the team has been cooking. Today we're open-sourcing 2 state-of-the-art omni models (Scout, Maverick - including pre-trained weights), previewing a 3rd one (Behemoth) and will drop a reasoning one soon.

frozen skiff
#

Spider is behemoth

strong shell
#

y u thnk so?

frozen skiff
#

Because

#

We already have maverick and scout

#

Why wold they put that spider emoji

#

They're hinting

#

Spider out of all emojis

#

Its either their reasoning model or behemoth

eager mica
sage raptor
frozen skiff
#

Idk

#

We dont have spider anymore

#

I dont think so

#

It got removed a while ago

eager mica
#

I mostly really just follow the LLM threads.

#

For what it's worth, the official Llama 4 inference code uses temperature=0.6 and top_p=0.9 as far as I can see.

#

Hm, it looks like they have int4 quantization loading code too.

young otter
leaden palm
#

what the llama

plain zinc
#

Unfortunately, I'm not a programmer.

#

This is not my screenshot, but a screenshot of my Friend's discord bot.

#

which tells you when and which model was added to the arena or disappeared on the contrary

frozen skiff
#

Maverick is amazing

plain zinc
#

@modern knoll

young otter
#

yep thats what i linked

eager mica
#

I don't think fp8 alone would reduce quality appreciably.
On-the-fly INT4 sounds like what bitsandbytes would do.

#

It's versioned 03-26-experimental. Is it from 10 days ago? Or did the training start on that date? We can't know, of course.

young otter
frozen skiff
#

Dude wtf

#

Maverick taches u how to

#

Do illegal stuff

#

Make fentanyl and even teaches u how to get the precursors

#

💀

#

Yes

#

Its a real method

#

Yes im a chemist

#

Lil bro

#

Yes

#

Who said i cant make it cus its illegal

#

Walter white made meth and it was illegal lil bro it dosent amtter

#

Rules are made to be broken

frozen skiff
#

Walter white cant make fetnanyl

#

Hes too weak

#

Im ahmad fring

ivory schooner
#

啊,还我24k,还我24k~

#

如果chatbot再不将它出现的话,谁的人也可以来部署一下独立模型镜像也能继续玩,行不行

balmy mist
#

wait do we have full access to llama 4 now?

ivory schooner
#

(觉悟)原来将要出的behermoth好像正是原来的24k呢

#

那么我要等behermoth推出再来玩吧~

#

不过。。。。。。还我24k~

balmy mist
#

lmaoo they gotta pray man

raven void
#

I mean

#

any model can teach you that even Gemini

keen beacon
#

@alpine coral can you dm me your question set if possible? i think i may have found o3 full on a red-teaming platform and am putting it through its paces

#

o3 (?) simplebench "try yourself" public questions score:

6/10, iirc best performance on this specific set minus gpt-4.5 preview

lime coral
balmy mist
#

yo i have tested llama4 all day and they def lied lmaoo

balmy mist
#

they really bs them benchmarks

torn mantle
balmy mist
#

and they said the lmarena was a experimental chat session for the results wtfff

#

there is no way this is second

torn mantle
#

I have no idea how it reached that no2 spot

balmy mist
#

like impossible

#

lmaoo

torn mantle
#

Its really one of the worst releases so far

balmy mist
#

lmaoo

torn mantle
#

Even their llama 405b was better

balmy mist
#

the only thing going for it is that 10 mill context but i have yet to test that or see that

#

has anyone else tried that?

torn mantle
#

The model is so dumb, its retrieval information is so bad, it hallucinate a lot, it doesn't follow instructions well, its so bad at multi-turn

#

I can go all day about this model

#

I just dont understand why did they even bother to release this

balmy mist
#

this is a joke

torn mantle
#

Lol

#

Its not beating anything

#

Even qwen35b is better

balmy mist
#

lmaooo

#

this is actually devestating

torn mantle
#

I won't touch it again

balmy mist
#

like we basically can ignore everything they said

#

that 10 mill context is a lie to

#

bc everything else is a lie

#

i feel lowkey bad for zuc

torn mantle
#

I knew they self-trained on benchmarks

balmy mist
#

he screwed his compnay withe meta stuff and now this

#

just wait until simple bench gets their hands on this

#

gonna be wild

#

maybe we prompting llama4 wrong

torn mantle
#

Themis or behemoth should be on gpt4o level

#

But at what cost?

balmy mist
#

there has to be a reason

balmy mist
torn mantle
#

They are not taking llms training seriously obviously

torn mantle
#

We tried it in lmarena already

balmy mist
#

really??

torn mantle
#

Its at best gpt4o level

#

Its not sota or anything

balmy mist
#

thats prob why they still training it

#

what do they do now?

torn mantle
#

But that's just so disappointing

#

And we still don't have reasoning models

balmy mist
#

yeah its the promises that is annoying

#

just dont release, we can wait

#

but i think its for investors

torn mantle
#

Their instruct model is so bad that they had to wait for behemoth to traina reasoner

#

You cant just use maverick with reasoning

#

Lile that model is so dumb

#

It doesn't understand anything

balmy mist
#

yeah maverick is so badd

#

like it makes me mad

#

i dont really see that big a difference between scout and maverick

#

tbh

torn mantle
#

They didn't bring anything new to the table

#

Qwen3 will be a huge blow to them

balmy mist
#

"10 mill context"

#

if that is debunked meta might go under

torn mantle
balmy mist
#

but lmarena needs to find the ranking asap

torn mantle
#

What would i do with it exactly?

#

Write novels?