#general | Arena | Page 12

drifting thorn Apr 5, 2025, 1:30 PM

#

agi is a vague definition

balmy mist Apr 5, 2025, 1:30 PM

#

like it started and we are currently in that period of the agi forming

drifting thorn Apr 5, 2025, 1:30 PM

#

Some agents can already be called as an AGI

balmy mist Apr 5, 2025, 1:30 PM

#

why office worker?

#

we already have ai that can replace images tbh

#

like i can give my mom an image and she will not be able to tell its ai

drifting thorn Apr 5, 2025, 1:30 PM

#

since robot vaccum cleaner can replace human cleaners

balmy mist Apr 5, 2025, 1:31 PM

#

you technally can replace lowlevel devs like jrs out of college or interns

#

any webdev for sure lol, like a junior tho

#

we gotta test it more

#

we only seen its frontend capabilities

#

we need to see it in cusors and stuff

#

but now that I am thining about it

#

i think NW might be a whole new model

#

and stargazer is the mini

#

NW has to big af

drifting thorn Apr 5, 2025, 1:32 PM

#

Is multi-token attention of Meta AI helpful for the situation?

balmy mist Apr 5, 2025, 1:32 PM

#

why they take it down so fast?

balmy mist Apr 5, 2025, 1:33 PM

#

drifting thorn Is multi-token attention of Meta AI helpful for the situation?

which situation

#

i like meta research into thinking without using tokens tho

#

that might be something

drifting thorn Apr 5, 2025, 1:33 PM

#

making AI more intellegent

balmy mist Apr 5, 2025, 1:33 PM

#

that could be useful

#

imagine we never get NW back

frozen skiff Apr 5, 2025, 1:35 PM

#

is night whisper still in webdeb arena

balmy mist Apr 5, 2025, 1:36 PM

#

we should continue this and start using this metrix

balmy mist Apr 5, 2025, 1:36 PM

#

frozen skiff is night whisper still in webdeb arena

dont make me cry bro

drifting thorn Apr 5, 2025, 1:36 PM

#

Alibaba had proposed a paper called START, which the reasoning model use tools itself during the reasoning. I see the "use tools in reasoning" tendency in Gemini 2.5 pro now. So I guess... are the proprietory models already using the hacks open-sourced papers proposed, or even a better method than them.

balmy mist Apr 5, 2025, 1:36 PM

#

wow

#

that is nuts

#

can you post that

balmy mist Apr 5, 2025, 1:37 PM

#

frozen skiff is night whisper still in webdeb arena

have you tried NW?

frozen skiff Apr 5, 2025, 1:37 PM

#

yeah

drifting thorn Apr 5, 2025, 1:37 PM

#

just search START by Alibaba in the internet

#

https://arxiv.org/abs/2503.04625

arXiv.org

START: Self-taught Reasoner with Tools

Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introdu...

barren prairie Apr 5, 2025, 1:38 PM

#

There is a model called Anonymous

drifting thorn Apr 5, 2025, 1:38 PM

#

remember 2.5 is released in 25/3, and the paper is proposed in 6/3

frozen skiff Apr 5, 2025, 1:38 PM

#

its definitely google

#

its as smart as 2.5 pro in other areas but much better in coding

#

its pretty sh*t in creative writing too

drifting thorn Apr 5, 2025, 1:39 PM

#

...

#

so sad i mainly concern on creative writing

frozen skiff Apr 5, 2025, 1:40 PM

#

drifting thorn so sad i mainly concern on creative writing

yeah google models suck at that

#

thats whyi liked 24 karat gold

drifting thorn Apr 5, 2025, 1:40 PM

#

nah 2.5 pro's writing was great

frozen skiff Apr 5, 2025, 1:40 PM

#

idk

#

it types this ceretain way idk how to describe it

#

but its annoying

drifting thorn Apr 5, 2025, 1:40 PM

#

it adheres to my plot and its tone is engaging

keen ferry Apr 5, 2025, 1:41 PM

#

what is nw?

frozen skiff Apr 5, 2025, 1:41 PM

#

how do u know he a eral person

#

he might be lying

drifting thorn Apr 5, 2025, 1:41 PM

#

I can only access to Gemini via my laptop, and 2.5's writing let me stick to my laptop for days

balmy mist Apr 5, 2025, 1:42 PM

#

barren prairie There is a model called Anonymous

where

frozen skiff Apr 5, 2025, 1:43 PM

#

whats agi suppose to mean

#

like smarter than a human

torn mantle Apr 5, 2025, 1:43 PM

#

yea

#

not happening

balmy mist Apr 5, 2025, 1:43 PM

#

wait what??

frozen skiff Apr 5, 2025, 1:43 PM

#

isnt most of the ais rn smart as a human or even smarter

torn mantle Apr 5, 2025, 1:43 PM

#

but i really believe nightwhisper will threaten their market share

balmy mist Apr 5, 2025, 1:44 PM

#

torn mantle but i really believe nightwhisper will threaten their market share

it will

#

but NW is def google based on metadata and they have this lunar thing now

frozen skiff Apr 5, 2025, 1:44 PM

#

its crazy how fast this Ai stuff is advancing

balmy mist Apr 5, 2025, 1:44 PM

#

yall tried lunarcall?

frozen skiff Apr 5, 2025, 1:44 PM

#

yeah

balmy mist Apr 5, 2025, 1:44 PM

#

it from google too

#

they are cooking

frozen skiff Apr 5, 2025, 1:44 PM

#

its kinda sh*t

balmy mist Apr 5, 2025, 1:45 PM

#

it might be over for OpenAi lol

#

damn really

frozen skiff Apr 5, 2025, 1:45 PM

#

i havent tried it much buit

#

when i tried it its kinda average

balmy mist Apr 5, 2025, 1:45 PM

#

like where would you rank it?

frozen skiff Apr 5, 2025, 1:45 PM

#

like

#

based on my very limited interaction with it

#

maybe a bit higher than

#

gemini 2.0 flash

drifting thorn Apr 5, 2025, 1:45 PM

#

I’ll rather look at the benchmarks than these vague statements like AGI and ASI

balmy mist Apr 5, 2025, 1:46 PM

#

drifting thorn I’ll rather look at the benchmarks than these vague statements like AGI and ASI

same

#

but benchmarks are getting saturated too

#

i only trust simple bench now

drifting thorn Apr 5, 2025, 1:46 PM

#

Humanity’s last exam?

#

SimpleQA for my “searching” task

balmy mist Apr 5, 2025, 1:47 PM

#

yeah they already beat ARC? did v2 come out yet?

frozen skiff Apr 5, 2025, 1:47 PM

#

What are those new crystal flannel harley models

drifting thorn Apr 5, 2025, 1:47 PM

#

balmy mist yeah they already beat ARC? did v2 come out yet?

It was beaten by o3-mini

ivory schooner Apr 5, 2025, 1:48 PM

#

还我24k~还我24k~

balmy mist Apr 5, 2025, 1:48 PM

#

drifting thorn It was beaten by o3-mini

that was not mini

#

that was o3 bro

keen beacon Apr 5, 2025, 1:49 PM

#

nightwhisper is not doing that lol

ocean vortex Apr 5, 2025, 1:51 PM

#

yeah and even that 4% o3 score was not realistic to be running like that for consumers ("high-compute 172x", whatever that means it's more extreme than o1-pro for sure)

balmy mist Apr 5, 2025, 1:52 PM

#

if you be honest most SOTA llms are as smart as the average human, like on all the stuff we care about the AI is better at us at, it may struggle at novel stuff and new stuff which is the final piece we are figuring out with out, but on general knowledge and IQ its actually smarter than the average human lol

#

especially with this new generation

ocean vortex Apr 5, 2025, 1:52 PM

#

balmy mist Apr 5, 2025, 1:52 PM

#

damn so that was not even o3 high?

#

wow

keen beacon Apr 5, 2025, 1:53 PM

#

o3 is several months old anyway

torn mantle Apr 5, 2025, 1:53 PM

#

balmy mist but NW is def google based on metadata and they have this lunar thing now

yea its from google

#

lunar/star*/nw all are from google

balmy mist Apr 5, 2025, 1:54 PM

#

it think the arc test might be unfair to other models not from openai

barren prairie Apr 5, 2025, 1:54 PM

#

balmy mist where

On arena

ocean vortex Apr 5, 2025, 1:54 PM

#

balmy mist damn so that was not even o3 high?

it was smth like o3-pro+ (low) lol. But it was already way too expensive and inefficient even with low reasoning

balmy mist Apr 5, 2025, 1:54 PM

#

cause how the heck is gemini 2.5 pro 12% on arc 1?

#

i think they need to allow gemini 2.5 pro to use the same compute at o3

keen beacon Apr 5, 2025, 1:55 PM

#

balmy mist cause how the heck is gemini 2.5 pro 12% on arc 1?

openai are ahead in reasoning anyway

ocean vortex Apr 5, 2025, 1:55 PM

#

they didn't include high reasoning version because it didn't qualify with that cost per task

balmy mist Apr 5, 2025, 1:55 PM

#

they need to standardize the test with the same compute time

ocean vortex Apr 5, 2025, 1:55 PM

#

this barely made the cut. Still not very realistic cost

balmy mist Apr 5, 2025, 1:56 PM

#

keen beacon openai are ahead in reasoning anyway

true but why are they the only ones with low, medium and high, while others dont have that?

#

google has the money to extend compute time

#

but they choose not to?

torn mantle Apr 5, 2025, 1:56 PM

#

keen beacon openai are ahead in reasoning anyway

i wouldnt say way ahead

balmy mist Apr 5, 2025, 1:56 PM

#

ocean vortex they didn't include high reasoning version because it didn't qualify with that c...

as in it was to expensive?

torn mantle Apr 5, 2025, 1:57 PM

#

oai reasoning approach is the same as deepseek r1 just more scaled

#

google is using different approach for their reasoning

#

but they are getting there

ocean vortex Apr 5, 2025, 1:57 PM

#

balmy mist as in it was to expensive?

yeah. They have $10k compute limit for the entire benchmark to be put on leaderboard. High version cost more than that

balmy mist Apr 5, 2025, 1:58 PM

#

google charging way less to run their models, so i could imagine if they ramped up the compute time and gave us an o1 pro type experience they would do amazing, because the have the fastest SOTA reasoning time and deliever quality at those speeds

#

imagine gemini 2.5 pro, but the same compute time as o1 pro?

#

that would be nuts

balmy mist Apr 5, 2025, 1:59 PM

#

ocean vortex yeah. They have $10k compute limit for the entire benchmark to be put on leaderb...

ahh i see

#

i think nightwhisper might be a medium level reasoning model

#

the next tier up from gemini 2.5 pro

#

like they all gemini 2.5 pro

torn mantle Apr 5, 2025, 1:59 PM

#

i think they cracked the code for a good coding model

ocean vortex Apr 5, 2025, 2:00 PM

#

but it's not like high version would have been miles ahead. If we look at arc-1:

balmy mist Apr 5, 2025, 2:00 PM

#

but higher levels of compute

torn mantle Apr 5, 2025, 2:00 PM

#

and what we will see is high finetuned coding models

#

probably

#

gemini coder 1

#

gemini coder 2

ocean vortex Apr 5, 2025, 2:00 PM

#

so this is roughly 15% increase of the low score

torn mantle Apr 5, 2025, 2:00 PM

#

these will be highly focused at coding tasks

#

yea

#

probably flash

#

a recent checkpoint of gemini 2.5 flash

ocean vortex Apr 5, 2025, 2:00 PM

#

now 4% + 15% of that score = 4.6%

#

would have been roughly that

torn mantle Apr 5, 2025, 2:01 PM

#

this whole ARC-AGI benchmark doesnt make sense to me tbh

ocean vortex Apr 5, 2025, 2:01 PM

#

by spending much much much more LOL

torn mantle Apr 5, 2025, 2:01 PM

#

they are testing it on text based prompts

#

instead of using their vision models

balmy mist Apr 5, 2025, 2:01 PM

#

let me know the results, my webdev glitching

torn mantle Apr 5, 2025, 2:01 PM

#

ik there are some limitations but still they should test vision capabilities

#

because give any person a bunch of random arrays with 1,0 and he wont solve that

#

it just doesnt seem practical

ocean vortex Apr 5, 2025, 2:02 PM

#

torn mantle instead of using their vision models

ironically, but spatial awareness does not actually come from vision capabilities

torn mantle Apr 5, 2025, 2:02 PM

#

"input": [[7, 0, 7], [7, 0, 7], [7, 7, 0]], "output": [[7, 0, 7, 0, 0, 0, 7, 0, 7], [7, 0, 7, 0, 0, 0, 7, 0, 7], [7, 7, 0, 0, 0, 0, 7, 7, 0], [7, 0, 7, 0, 0, 0, 7, 0, 7], [7, 0, 7, 0, 0, 0, 7, 0, 7], [7, 7, 0, 0, 0, 0, 7, 7, 0], [7, 0, 7, 7, 0, 7, 0, 0, 0], [7, 0, 7, 7, 0, 7, 0, 0, 0], [7, 7, 0, 7, 7, 0, 0, 0, 0]]}]

#

i mean whats this?

#

you except a normal person to solve that?

ocean vortex Apr 5, 2025, 2:02 PM

#

vision is 2d

torn mantle Apr 5, 2025, 2:03 PM

#

well let it be a benchmark for spatial awareness

#

until we figure out how to improve on that

#

openai are cooked

#

they didnt seem to improve much at coding

ocean vortex Apr 5, 2025, 2:04 PM

#

torn mantle you except a normal person to solve that?

yeah arc-agi is very confusing at first. But it actually makes sense given what it is. You can't solve this without contamination just by scraping the internet

balmy mist Apr 5, 2025, 2:04 PM

#

which is beter:
https://3000-iu6jeldzrfv5qj0kq3w36-87045d8f.e2b-foxtrot.dev

or this:

https://3000-ir48gqr1x8jkh2pvpfp4p-a7c98fd5.e2b-foxtrot.dev

Funny Button Game

A simple game where you click a button that moves randomly.

#

one is lunarcell

torn mantle Apr 5, 2025, 2:05 PM

#

i think nightwhisper included a high quality data with intensive RLHF and even vision

oblique flint Apr 5, 2025, 2:05 PM

#

torn mantle they didnt seem to improve much at coding

o3 mini is still holding up well imo, especially for the cost.

balmy mist Apr 5, 2025, 2:05 PM

#

thats not lunar

torn mantle Apr 5, 2025, 2:05 PM

#

just to get the aesthetics right/styling etc...

balmy mist Apr 5, 2025, 2:05 PM

#

oblique flint o3 mini is still holding up well imo, especially for the cost.

quasar is prob gonna replace o3 mini

#

thats why i think quasar is o4 mini

oblique flint Apr 5, 2025, 2:06 PM

#

but quasar is not a reasoning model tho right?

balmy mist Apr 5, 2025, 2:06 PM

#

ahhh

#

i forgot the strategy already lol

balmy mist Apr 5, 2025, 2:06 PM

#

oblique flint but quasar is not a reasoning model tho right?

it kinda is but not

#

like not traditionally

torn mantle Apr 5, 2025, 2:06 PM

#

ocean vortex yeah arc-agi is very confusing at first. But it actually makes sense given what ...

i get that but just the way we are testing this seems odd to me

balmy mist Apr 5, 2025, 2:06 PM

#

oblique flint but quasar is not a reasoning model tho right?

but when you prompt it, check its outputs

#

like i asked it what was bigger 9.9 or 9.11

#

check what it does

#

similar to deepseekv3.1 and the new 4o

#

it has a COT output

#

very weird

oblique flint Apr 5, 2025, 2:08 PM

#

I hope google releases a competing model in the o3 mini price tier as well. There is kind of a significant gap between 2.5 pro and 2.0 flash pricing, which o3 mini is inbetween

torn mantle Apr 5, 2025, 2:09 PM

#

oblique flint I hope google releases a competing model in the o3 mini price tier as well. Ther...

2.5 flash

#

google prices are already good

#

its just that people lost faith in them because they messed up many times

#

people will be more hyped for a gpt -model than a gemini model

#

but i like what google are doing recently

oblique flint Apr 5, 2025, 2:10 PM

#

idk, but with ai studio they train on your data, and you have zero integration with stuff like IDEs and I mainly use llms for coding stuff where that integration saves a lot of time

balmy mist Apr 5, 2025, 2:10 PM

#

lol that was claude 3.5 haiku

#

the other one was lunarcell

#

Screenshot_2025-04-05_at_10.10.34_AM.png

#

lunar is mid

torn mantle Apr 5, 2025, 2:11 PM

#

balmy mist lunar is mid

its not bad

balmy mist Apr 5, 2025, 2:11 PM

#

like it gets the job done but not the best

torn mantle Apr 5, 2025, 2:11 PM

#

its probably a much smaller version than flash

#

like when we had 7b flash

lime coral Apr 5, 2025, 2:11 PM

#

Deleted

torn mantle Apr 5, 2025, 2:11 PM

#

its probably something like that

ocean vortex Apr 5, 2025, 2:11 PM

#

torn mantle i get that but just the way we are testing this seems odd to me

if you look hard enough you will notice that these numbers are not actually random and represent objects in space.

drifting thorn Apr 5, 2025, 2:12 PM

#

torn mantle people will be more hyped for a gpt -model than a gemini model

I can still remember the "Elmer's glue pizza"

ocean vortex Apr 5, 2025, 2:13 PM

#

it's basically like one of those tasks in IQ test or from general tests for a job interview

#

you look at test patterns

#

and then need to find a relation

#

to solve another example in a similar way

balmy mist Apr 5, 2025, 2:14 PM

#

wait lunarcell is actually good

drifting thorn Apr 5, 2025, 2:15 PM

#

I prefer 'multiple intelligence' than simple IQ tests

ocean vortex Apr 5, 2025, 2:15 PM

#

I think there's a website for deleted tweets. You can really delete them lmao

torn mantle Apr 5, 2025, 2:16 PM

#

ocean vortex if you look hard enough you will notice that these numbers are not actually rand...

ik ik, its just doesnt seem practical to me

drifting thorn Apr 5, 2025, 2:17 PM

#

I guess OpenAI is panicking right now, when they are changing plans constantly

#

It shows a messy management

#

And I actually hate the thought of GPT-5 as a free-tier user

ocean vortex Apr 5, 2025, 2:18 PM

#

torn mantle ik ik, its just doesnt seem practical to me

I doubt they are testing them on text representations though

#

like only vision models are being tested

#

text is probably just so you could reproduce it in code tbh

#

https://arcprize.org/play

ARC Prize

ARC Prize - Play the Game

Easy for humans, hard for AI. Try ARC-AGI.

keen beacon Apr 5, 2025, 2:19 PM

#

ocean vortex I doubt they are testing them on text representations though

they are lol

keen beacon Apr 5, 2025, 2:19 PM

#

ocean vortex like only vision models are being tested

they would get zero if they had to use vision lol

ocean vortex Apr 5, 2025, 2:19 PM

#

keen beacon they are lol

then why only vision models are on leaderboard?

keen beacon Apr 5, 2025, 2:20 PM

#

ocean vortex then why only vision models are on leaderboard?

coincidence lmao

#

all of them are translated into text (o3 prompt):

#

https://github.com/arcprize/model_baseline/blob/main/prompt_example_o3.md

balmy mist Apr 5, 2025, 2:20 PM

#

https://3000-ihkapfhbgjilphl23rnc7-bc29f5f8.e2b-foxtrot.dev @hollow ivy

Who Am I?

Discover who I am!

keen beacon Apr 5, 2025, 2:21 PM

#

ocean vortex then why only vision models are on leaderboard?

??? r1 is on that leaderboard and its not a vision model too btw

#

idk how you came up with that

balmy mist Apr 5, 2025, 2:21 PM

#

lunarcell

#

google really releasing so many models man

#

their event is on tuesday right?

lime coral Apr 5, 2025, 2:22 PM

#

9

ocean vortex Apr 5, 2025, 2:22 PM

#

oh ok. I seem to recall seeing somewhere that a model couldn't be tested on arc-agi since it didn't have vision. Must have been smth else then

balmy mist Apr 5, 2025, 2:23 PM

#

is this real?
https://x.com/minchoi/status/1908521182149668958

Min Choi (@minchoi) on X

Kawasaki's hydrogen powered four-legged robot, CORLEO, just revealed at Osaka Kansai Expo

#

we are living in a different timeline man

#

ever since covid nothing was the same lmaoo

ocean vortex Apr 5, 2025, 2:25 PM

#

keen beacon ??? r1 is on that leaderboard and its not a vision model too btw

btw you seem very touchy lately for some reason lmao

#

overreacting to everything

keen beacon Apr 5, 2025, 2:26 PM

#

ya not getting enough sleep and a lot going on lol

#

sry

ocean vortex Apr 5, 2025, 2:26 PM

#

catgrin

#

all good

torn mantle Apr 5, 2025, 2:32 PM

#

deepseek are also taking their time on r2

keen beacon Apr 5, 2025, 2:41 PM

#

qwen team might be cooking up something bigger

ocean vortex Apr 5, 2025, 2:41 PM

#

torn mantle deepseek are also taking their time on r2

curious to see what they will come up with. Seems like they are going for longer outputs

#

but open source has problems hosting it as is

#

the gains with v3.1 are from it being more verbose. Unsure if that would translate to gains with reasoning in the same way as they move to 3.1 for R2... 🤔

#

rather than like just make the chat model closer to reasoning and less different

#

cause I mean like, much of what V3.1 already does now in a standard output it would be doing that in reasoning instead, so there's gonna be some overlap...

novel flame Apr 5, 2025, 3:02 PM

#

Is nightwhisper still in the battle arena? I really want to test it.....

keen beacon Apr 5, 2025, 3:02 PM

#

no

#

new google model

balmy mist Apr 5, 2025, 3:04 PM

#

keen beacon new google model

its mid

#

i dont even know how to rank it tbh

keen beacon Apr 5, 2025, 3:05 PM

#

just started testing it

#

got my first benchmark question right

#

seems to be a reasoner

balmy mist Apr 5, 2025, 3:05 PM

#

yo think so?

#

let me know the final results

#

i gave up on it

#

its okay, just not SOTA

keen beacon Apr 5, 2025, 3:05 PM

#

balmy mist let me know the final results

took about 15s to start streaming

lime coral Apr 5, 2025, 3:05 PM

#

balmy mist its mid

Can we stop judge generalist model on web dev

balmy mist Apr 5, 2025, 3:06 PM

#

lime coral Can we stop judge generalist model on web dev

why not if they put it only on webdev then you are judged on that

#

they should have just put it on lmarena

#

but they did not

#

so i judge with what i am given

keen beacon Apr 5, 2025, 3:07 PM

#

balmy mist but they did not

they did

#

it's on there

#

balmy mist Apr 5, 2025, 3:07 PM

#

and its another google model right after they had NW and star which were amazing(especially NW) so its hard to go from NW to lunarcell

balmy mist Apr 5, 2025, 3:07 PM

#

keen beacon they did

well its mid on webdev

#

and its all based on system prompts tbh

#

NW had the same system prompt as lunar does on webdev

keen beacon Apr 5, 2025, 3:08 PM

#

NW was very much tuned for the specific environment

#

i found it consistently poorer than 2.5 pro in a general sense

balmy mist Apr 5, 2025, 3:08 PM

#

i didnt NW

#

NW was great on general for me

keen beacon Apr 5, 2025, 3:09 PM

#

for example when asked to clone UIs, it would do poorer at actually replicating it than 2.5 pro but normally had less bugs

balmy mist Apr 5, 2025, 3:09 PM

#

i tested it on public SB and it tied with gemini 2.5 pro

#

NW is the best model imo

keen beacon Apr 5, 2025, 3:09 PM

#

shrug~3 we shall see if/when they release it

balmy mist Apr 5, 2025, 3:09 PM

#

on general tests it did amazing and the way it interpreted my prompts was kinda uncanny

#

even if they did finetune gemini2.5 pro to coding, its a great model that performs well, just bc its finetuned on a specific usecase does not mean it will not perform well generally

#

i tested NW for 2 days straight, barely even could work lol

keen beacon Apr 5, 2025, 3:11 PM

#

as did i

#

i test all anonymous models with my personal set

balmy mist Apr 5, 2025, 3:11 PM

#

never felt that way about a model before

keen beacon Apr 5, 2025, 3:11 PM

#

nebula was the most blown away ive been by an anonymous model

#

NW impressed me but not as much

balmy mist Apr 5, 2025, 3:11 PM

#

it follows directions like no other model too

#

i never tried nebula

#

which one was that?

keen beacon Apr 5, 2025, 3:12 PM

#

got my 2nd prompt wrong (2.5 pro gets it right)

keen beacon Apr 5, 2025, 3:12 PM

#

balmy mist which one was that?

it was 2.5 pro

balmy mist Apr 5, 2025, 3:12 PM

#

i mean i love 2.5 pro as well

#

thats my 2nd fav model after NW, but i think they are the same models

#

NW might just be a higher level of compute for reasoning or a finetuned version

#

cause its a bit slower than 2.5 pro

#

but 2.5 pro is def the best model we have now! like it just craps on everything else lol

#

especially on studio where you have so much customization

keen beacon Apr 5, 2025, 3:16 PM

#

+1

#

it also seems their rate limits they say exist on ai studio don't actually do anything

#

i've definitely gone over what it is supposed to be several times

balmy mist Apr 5, 2025, 3:17 PM

#

keen beacon it also seems their rate limits they say exist on ai studio don't actually do an...

yeah i hit the rate limit once on my account when it first came out

#

then i just change accounts lol

#

and ever since then never hit a limit again because i just switch accounts

balmy mist Apr 5, 2025, 3:21 PM

#

keen beacon i test all anonymous models with my personal set

how do you feel about the ARC test and how Open Ai gets to test with different levels of compute(or compute time) vs gemini 2.5 pro only having one test?

keen beacon Apr 5, 2025, 3:22 PM

#

although the o3 results were impressive they definitely get watered down by the fact they threw thousands of dollars at a version of the model both tuned for arc and that was given way more attempts at getting it right compared to others

balmy mist Apr 5, 2025, 3:23 PM

#

thats what i was thinking

#

and 2.5 pro scored 12%

#

that does not make sense, yeah open ai is good with reasoning but when you throwing a bunch of money to make your model think longer is that really as impressive like you said? and what if we standardized the compute times? we need a new graphic of that

keen beacon Apr 5, 2025, 3:26 PM

#

yeah openai seem to have this strategy of

#

releasing the "best" reasoning models but it literally just brute forces the answer

#

like it's way less efficient

#

it can take their models 3-4x as many reasoning tokens as others to get the same answer

#

same for grok 3 thinking

balmy mist Apr 5, 2025, 3:30 PM

#

exactly!!

#

2.5 pro speed is nuts and its quality is just as good if not better which is wild

#

google might have officially won this year unless gpt5 really blows us away lol

drifting thorn Apr 5, 2025, 3:31 PM

#

keen beacon Apr 5, 2025, 3:36 PM

#

keen beacon it can take their models 3-4x as many reasoning tokens as others to get the same...

it depends really. if its an extremely long rote reasoning task, 2.5 pro can be a lot worse

#

ive seen an instance where qwq 32b can solve it with 10k less tokens (13k vs 23k)

leaden palm Apr 5, 2025, 3:38 PM

#

and qwq is the master of overthinking

keen beacon Apr 5, 2025, 3:39 PM

#

o3 mini and qwq are very good at these extremely rote reasoning tasks where 2.5 pro can fall apart

#

o3 mini is the best model yet for that still. these specific tasks do not require world knowledge, etc just pure rote reasoning

misty vault Apr 5, 2025, 3:41 PM

#

gemini 2.5 no longer free on google ai studio? 😔

leaden palm Apr 5, 2025, 3:45 PM

#

misty vault gemini 2.5 no longer free on google ai studio? 😔

?

misty vault Apr 5, 2025, 3:46 PM

#

Experimental one is gone for me now

#

#

#

Why can I still chat with these models if it has pricing

#

or is that only for api usage

keen beacon Apr 5, 2025, 3:48 PM

#

yes its unlimited on the website

misty vault Apr 5, 2025, 3:48 PM

#

oh nice

leaden palm Apr 5, 2025, 3:50 PM

#

misty vault Why can I still chat with these models if it has pricing

so do all other models

#

you'll use it for free (with the free limits) until you upgrade

drifting thorn Apr 5, 2025, 3:51 PM

#

2.5 pro would be godlike if its output is 128k

balmy mist Apr 5, 2025, 3:56 PM

#

keen beacon yes its unlimited on the website

on website like on studio right? i been using studio today and wanna make sure i aint using money lol

keen beacon Apr 5, 2025, 3:57 PM

#

balmy mist on website like on studio right? i been using studio today and wanna make sure i...

yeah

balmy mist Apr 5, 2025, 3:57 PM

#

thank god lol

keen beacon Apr 5, 2025, 4:00 PM

#

the aistudio website used to have limits a while back it seems they removed them at some point

balmy mist Apr 5, 2025, 4:02 PM

#

like you think they removed it this week?

novel flame Apr 5, 2025, 4:08 PM

#

keen beacon yes its unlimited on the website

Unlimited but rate limited. 10 requests per minute I believe

misty vault Apr 5, 2025, 4:08 PM

#

what about gemini web app or or something instead of studio

keen beacon Apr 5, 2025, 4:10 PM

#

balmy mist like you think they removed it this week?

No it was removed before Gemini 2 I think

novel flame Apr 5, 2025, 4:13 PM

#

I just tested Lunarcall for coding, and it’s not bad, but it’s not on par with Claude 3.7 Sonnet. I would say it’s below 3.7 Sonnet, 3.5 Sonnet, and Gemini Pro 2.5; but maybe on par with o3-mini-high

balmy mist Apr 5, 2025, 4:43 PM

#

keen beacon No it was removed before Gemini 2 I think

i got rate limited like with my account, not able to use it until next day when gemini 2.5 pro just came out so maybe the limits are really high

#

can someone give me example web dev prompt for me to test?

torn mantle Apr 5, 2025, 5:12 PM

#

keen beacon NW impressed me but not as much

quite the opposite for me

#

Ive tried it on physics simulations and it was blowing sonnet 3.7 out of the water

#

Nightwhisper is some next level coding model

#

Im pretty sure a lot of hard work went into this model

#

Gemini 2.5 pro while its good, it wasnt enjoyable to talk to

#

It has a strong info retrieval which actually makes it less creative

#

U wont get anything unexpected or unexplored areas on its outputs

#

I would like if they include something similar to 24k gold model

drifting thorn Apr 5, 2025, 5:19 PM

#

Why 2.5 perform worse than 3.7 in SWE bench?

keen beacon Apr 5, 2025, 5:23 PM

#

torn mantle Gemini 2.5 pro while its good, it wasnt enjoyable to talk to

unfortunately that just seems to be something reasoners are worse at

#

i don't think anything has beat the original 1.0 ultra in terms of creativity and "human-ness"

torn mantle Apr 5, 2025, 5:23 PM

#

drifting thorn Why 2.5 perform worse than 3.7 in SWE bench?

How much did it score?

rose thicket Apr 5, 2025, 5:29 PM

#

torn mantle Nightwhisper is some next level coding model

I can't find it !

barren prairie Apr 5, 2025, 5:30 PM

#

balmy mist can someone give me example web dev prompt for me to test?

Cross words test 🤣

#

This test is a little bit hard to them

torn mantle Apr 5, 2025, 5:31 PM

#

rose thicket I can't find it !

It was removed

rose thicket Apr 5, 2025, 5:33 PM

#

😫

#

Well I got the system of web dev arena

#

System prompt*

balmy mist Apr 5, 2025, 5:45 PM

#

rose thicket Well I got the system of web dev arena

can you send it?

#

can yall see this?
https://liveweave.com/7rVAzw

#

like the app?

#

whats a REE??

barren prairie Apr 5, 2025, 5:55 PM

#

balmy mist whats a REE??

Give them harder and long words 😈🔥 nihahahhaa

rose thicket Apr 5, 2025, 5:58 PM

#

balmy mist can you send it?

Yeah

rose thicket Apr 5, 2025, 5:59 PM

#

balmy mist can you send it?

Sry bro, discord have 2000 character limit

#

Not able to send here

balmy mist Apr 5, 2025, 6:00 PM

#

rose thicket Sry bro, discord have 2000 character limit

there is no way the system prompt is that long

#

just put it in a file

#

a text file

rose thicket Apr 5, 2025, 6:00 PM

#

I just copy pasted what the model has returned to me

#

But yes it is

balmy mist Apr 5, 2025, 6:01 PM

#

i got one to but i wanna compare results

rose thicket Apr 5, 2025, 6:03 PM

#

📎 text.text

#

See

balmy mist Apr 5, 2025, 6:05 PM

#

thnx

#

and this is for all models on webdev right?

rose thicket Apr 5, 2025, 6:06 PM

#

Yes

#

Both models returned somewhat same prompt

#

Minor diff

keen beacon Apr 5, 2025, 6:20 PM

#

https://x.com/legit_api/status/1908584812694339776

ʟᴇɢɪᴛ (@legit_api) on X

Llama 4 Omni is preparing for launch

these new pages were added today

official release any day this month 🤷‍♂️

balmy mist Apr 5, 2025, 6:25 PM

#

lol

barren prairie Apr 5, 2025, 6:27 PM

#

keen beacon https://x.com/legit_api/status/1908584812694339776

Spider launch 😁

balmy mist Apr 5, 2025, 6:31 PM

#

barren prairie Spider launch 😁

https://liveweave.com/7rVAzw

#

how is this

oblique flint Apr 5, 2025, 6:32 PM

#

keen beacon https://x.com/legit_api/status/1908584812694339776

it sucks that llama.cpp doesnt support all that multimodal stuff yet

balmy mist Apr 5, 2025, 6:33 PM

#

torn mantle Apr 5, 2025, 6:51 PM

#

What was its alias

#

Was it named maverick?

#

I dont remember testing this model

#

Nothimg from meta impressed me

thorny drum Apr 5, 2025, 6:52 PM

#

pretty unclear which one was maverick since llama is pumping out so many models lol

torn mantle Apr 5, 2025, 6:52 PM

#

Are we sure about this?

thorny drum Apr 5, 2025, 6:52 PM

#

yes

torn mantle Apr 5, 2025, 6:52 PM

#

Its probably cybele

#

Or whatever its name is

#

Themis

#

Weird

#

I dont remember such model

#

https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md

GitHub

llama-models/models/llama4/MODEL_CARD.md at main · meta-llama/llam...

Utilities intended for use with Llama models. Contribute to meta-llama/llama-models development by creating an account on GitHub.

#

Benchmarks

#

keen beacon Apr 5, 2025, 6:57 PM

#

wait wtf

#

it's really mid

torn mantle Apr 5, 2025, 6:57 PM

#

xd

#

It has 10m context window

#

That's the only noticeable thing

keen beacon Apr 5, 2025, 6:58 PM

#

https://ai.meta.com/blog/llama-4-multimodal-intelligence/ oh this is the big boy

#

#

behemoth is coming soon

#

i'll wait for that

torn mantle Apr 5, 2025, 6:58 PM

#

Behemoth is probably themis

#

Pretty sure

thorny drum Apr 5, 2025, 6:59 PM

#

torn mantle It has 10m context window

wasnt there that whole thing where to dude who gave google the 1m context window went to fb ai

torn mantle Apr 5, 2025, 6:59 PM

#

That thing was so slow

keen beacon Apr 5, 2025, 6:59 PM

#

#

yeah that's more like it

#

holy smokes

torn mantle Apr 5, 2025, 6:59 PM

#

keen beacon

I dont trust their benchmarks tbh

keen beacon Apr 5, 2025, 7:00 PM

#

why not

torn mantle Apr 5, 2025, 7:01 PM

#

Because they don't reflect how the model really feels like

#

Are you impressed with any Meta model added to the Arena?

#

The answer is clearly no

balmy mist Apr 5, 2025, 7:03 PM

#

wait llama 4 is out?

keen beacon Apr 5, 2025, 7:03 PM

#

2 of the 3 models

torn mantle Apr 5, 2025, 7:03 PM

#

balmy mist wait llama 4 is out?

Not yet

#

But today

balmy mist Apr 5, 2025, 7:03 PM

#

wtffff

#

10 mill

#

no damn way

torn mantle Apr 5, 2025, 7:03 PM

#

I think in an hour or so

torn mantle Apr 5, 2025, 7:03 PM

#

balmy mist 10 mill

That's the smaller model on

#

Only*

balmy mist Apr 5, 2025, 7:03 PM

#

we got vc in here? cause i dont even know how to express that

torn mantle Apr 5, 2025, 7:04 PM

#

xd

keen beacon Apr 5, 2025, 7:04 PM

#

likely april 29 will be the behemoth release

#

llamacon

dapper storm Apr 5, 2025, 7:07 PM

#

So if Behemoth was out today it would be at the top of Lmsys leaderboard?

torn mantle Apr 5, 2025, 7:07 PM

#

https://x.com/Ahmad_Al_Dahle/status/1908595680828154198

Ahmad Al-Dahle (@Ahmad_Al_Dahle) on X

Introducing our first set of Llama 4 models!

We’ve been hard at work doing a complete re-design of the Llama series. I’m so excited to share it with the world today and mark another major milestone for the Llama herd as we release the *first* open source models in the Llama 4

balmy mist Apr 5, 2025, 7:07 PM

#

lmaooo they did not show 2.5 pro

#

clowns

#

but still amazing for open source

#

so good job meta

#

wait so now we have a active and total parameters? someone please explain?

#

is this new tech or was i just oblvious to it?

torn mantle Apr 5, 2025, 7:09 PM

#

No

#

It got popular from deepseek

#

Its MoE architecture

#

Instead of activating all parameters you active only the necessary ones ( experts)

balmy mist Apr 5, 2025, 7:11 PM

#

ahh okay thanks

#

i love meta again lol

torn mantle Apr 5, 2025, 7:15 PM

#

lol

rigid crescent Apr 5, 2025, 7:19 PM

#

the crystal model is a gaslighter that wont own up to mistakes XD

dapper storm Apr 5, 2025, 7:19 PM

#

If Maverick is ~1420 ELO then surely Behemoth woudl be SOTA above Gemini 2.5?
Is that thinking incorrect?

balmy mist Apr 5, 2025, 7:20 PM

#

oh snapp i think they out

dapper storm Apr 5, 2025, 7:20 PM

#

oh wow I just loaded it yeah

#

it's 1417 ELO without stylecontrol

#

lol stylecontrol is way worse they were lmsys tuning 🤦‍♀️

balmy mist Apr 5, 2025, 7:20 PM

#

wow an open source model 2nd

#

you using it on oepnrouter?

vast atlas Apr 5, 2025, 7:21 PM

#

dapper storm If Maverick is ~1420 ELO then surely Behemoth woudl be SOTA above Gemini 2.5? Is...

it doesnt seem like behemoth is a reasoning model

#

its being compared to other non reasoning models like 4.5 and 3.7 (not extended thinking)

#

llama 4 reasoning is the reasoning model

balmy mist Apr 5, 2025, 7:22 PM

#

wow meta cooked

ancient reef Apr 5, 2025, 7:22 PM

#

balmy mist Apr 5, 2025, 7:23 PM

#

i think its smart to not have released the big boy yet

#

just wait a few weeks for the other players to release and train on their outputs lmaoo

balmy mist Apr 5, 2025, 7:24 PM

#

vast atlas its being compared to other non reasoning models like 4.5 and 3.7 (not extended ...

where do you see that comparison? i am only seeing maverick being compared

dapper storm Apr 5, 2025, 7:24 PM

#

Still huge error bars on Maverick only 2.5k votes

balmy mist Apr 5, 2025, 7:25 PM

#

how are yall testing it? on arena?

#

i got it

#

wow its so fast

ancient reef Apr 5, 2025, 7:27 PM

#

#

It yaps 😭

torn mantle Apr 5, 2025, 7:28 PM

#

https://x.com/teortaxesTex/status/1908602241046528218

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxe...

First reaction on Meta Llama 4 launch: disappointment
No local model. I think they can't beat Gemma density.
Scout 109B/17A bizarrely forgoes finegrained sparsity despite all the research in its favor, maybe to pander to low tech providers.
Maverick is just fatter DeepSeek V2.

balmy mist Apr 5, 2025, 7:29 PM

#

i tested it on simple bench pulic and it got everythign wrong

torn mantle Apr 5, 2025, 7:29 PM

#

balmy mist i tested it on simple bench pulic and it got everythign wrong

xd

balmy mist Apr 5, 2025, 7:29 PM

#

ill try again

torn mantle Apr 5, 2025, 7:29 PM

#

Did you really believe their benchmark?

#

Mark was angry for a reason

keen beacon Apr 5, 2025, 7:29 PM

#

riveroaks may be the 2T model

#

it's really slow

vast atlas Apr 5, 2025, 7:29 PM

#

balmy mist where do you see that comparison? i am only seeing maverick being compared

keen beacon Apr 5, 2025, 7:29 PM

#

and i think more intelligent

torn mantle Apr 5, 2025, 7:29 PM

#

keen beacon and i think more intelligent

Didn't get it yet

dapper storm Apr 5, 2025, 7:30 PM

#

Verbosity is the best way to unweighted ELO

keen beacon Apr 5, 2025, 7:30 PM

#

llama 4 yaps more than i do

#

i will not tolerate being out yapped

ancient reef Apr 5, 2025, 7:31 PM

#

Omg, doing with with it hurts

kindred sky Apr 5, 2025, 7:32 PM

#

https://discord.gg/j6kxQ4krtc @everyone

balmy mist Apr 5, 2025, 7:33 PM

#

dude is a clown

north vale Apr 5, 2025, 7:36 PM

#

is maverick gonna be open weights? do yall know

balmy mist Apr 5, 2025, 7:38 PM

#

yeah it is

#

gonna try my pokemon example on it lol

#

bruhh the context window on lmarena doesnt let me 😦

#

https://x.com/btibor91/status/1908604595745808889

Tibor Blaho (@btibor91) on X

Llama 4 Reasoning is coming soon

https://t.co/4gDz26XEra

#

so we are getting reasoning

ocean vortex Apr 5, 2025, 7:44 PM

#

they are f'ing insane with those model sizes.

#

No one is gonna able to run them

#

and all but one are compromised with low active param count anyway LOL

barren prairie Apr 5, 2025, 7:44 PM

#

Is llama4 open source ?

ocean vortex Apr 5, 2025, 7:45 PM

#

barren prairie Is llama4 open source ?

I assumed so. It must be..?

keen beacon Apr 5, 2025, 7:45 PM

#

it serious took them 2T+ params and over a year to slightly beat the next best base

#

great

#

seriously*

ocean vortex Apr 5, 2025, 7:45 PM

#

pretty sure it is OS

#

they always have been

ocean vortex Apr 5, 2025, 7:46 PM

#

keen beacon it serious took them 2T+ params and over a year to slightly beat the next best b...

this is gonna bring absolutely no value to open-source lmao

#

you can host it let alone finetune

blazing rune Apr 5, 2025, 7:46 PM

#

ocean vortex they are f'ing insane with those model sizes.

Finally, someone who doesn't say "but it uses less active params"

#

It is very underwhelming

thorn oak Apr 5, 2025, 7:47 PM

#

what was llama's codename?

raven void Apr 5, 2025, 7:48 PM

#

it will but for large orga

thorn oak Apr 5, 2025, 7:48 PM

#

in stealth

blazing rune Apr 5, 2025, 7:48 PM

#

It only seems good for distillation, but at that point just use Gemini or Claude or OpenAI models

night trout Apr 5, 2025, 7:55 PM

#

ocean vortex this is gonna bring absolutely no value to open-source lmao

Open source doesn't mean "you can run it on your basement server for RP erotica", there's OSS significance within the enterprise world.

keen beacon Apr 5, 2025, 7:56 PM

#

https://fixupx.com/maximelabonne/status/1908602756182745506

Maxime Labonne (@maximelabonne)

Llama 4's new license comes with several limitations:
︀︀
︀︀- Companies with more than 700 million monthly active users must request a special license from Meta, which Meta can grant or deny at its sole discretion.
︀︀
︀︀- You must prominently display "Built with Llama" on websites, interfaces, documentation, etc.
︀︀
︀︀- Any AI model you create using Llama Materials must include "Llama" at the beginning of its name
︀︀
︀︀- You must include the specific attribution notice in a "Notice" text file with any distribution
︀︀
︀︀- Your use must comply with Meta's separate Acceptable Use Policy (referenced at llama.com/llama4/use-policy))
︀︀
︀︀- Limited license to use "Llama" name only for compliance with the branding requirements

💬 13 🔁 13 ❤️ 118 👁️ 13.3K

#

facepalm

night trout Apr 5, 2025, 7:57 PM

#

Nice. I'll try to finish my blog post this weekend on prompts, I'll tag you.

ocean vortex Apr 5, 2025, 7:58 PM

#

night trout Open source doesn't mean "you can run it on your basement server for RP erotica"...

I get the argument cause I used to make a similar one myself lol.

But recent 1-2 years have shown us 405b llama was too big and not very useful at all. This is gonna be even harder to run and realistically only makes sense if you can serve thousands of people with your endpoint. I would understand if it was the case that there's clear benefit from bigger models. But it's more like the opposite is true with RL training now. Besides this exact size is saturated now with Deepseek taking advantage of it in full

night trout Apr 5, 2025, 8:00 PM

#

keen beacon https://fixupx.com/maximelabonne/status/1908602756182745506

These are all reasonable, tbh.

I was going to freak out about the "Built with Llama" one but it says you only have to say it on 'a' blog post, website, or within documentation. They're not saying you need to slather your product with it.

balmy mist Apr 5, 2025, 8:01 PM

#

scout is stupidly fast

ocean vortex Apr 5, 2025, 8:01 PM

#

I'm all in for huge models when they make sense, but I don't think this applies here. There's a fuckton of redundant capacity and we don't have enough data to take advantage of such size imo

#

maybe they should have updated their 405b model first, at least once. Before releasing this lol

night trout Apr 5, 2025, 8:02 PM

#

ocean vortex I get the argument cause I used to make a similar one myself lol. But recent 1...

The two biggest models on the planet are Gemini 2.0 and GPT4o, they aren't even local models to begin with. The biggest model in coding right now is Gemini 2.5 Pro. Just because hobbyists buzz about small models doesn't mean they are where the center of the industry is. There's a whole iceberg out there, you're just the tip.

ocean vortex Apr 5, 2025, 8:03 PM

#

dense 405b has potential to perform better than this "behemoth" anyway, since it's dense and here it's 288b active

ocean vortex Apr 5, 2025, 8:03 PM

#

night trout The two biggest models on the planet are Gemini 2.0 and GPT4o, they aren't even ...

are you joking?

#

gpt4o is smaller than 4-turbo

#

4-turbo is smaller than og gpt4

night trout Apr 5, 2025, 8:04 PM

#

ocean vortex are you joking?

Are you?

#

Because it sure seems like it.

ocean vortex Apr 5, 2025, 8:04 PM

#

and even without that...

#

gpt4.5 is much bigger than gpt4o

#

obviously

night trout Apr 5, 2025, 8:04 PM

#

ocean vortex gpt4.5 is much bigger than gpt4o

And neither one is a local model.

ocean vortex Apr 5, 2025, 8:05 PM

#

sorry but I don't feel like arguing with someone stating "The two biggest models on the planet are Gemini 2.0 and GPT4o," with a straight face lmao

#

no offense

night trout Apr 5, 2025, 8:06 PM

#

ocean vortex sorry but I don't feel like arguing with someone stating "The two biggest models...

I can imagine why that would be. Tough to argue against water being wet, the sky being blue.

ocean vortex Apr 5, 2025, 8:06 PM

#

it's more like it's impossible to argue with someone thinking 2+2 = 5

#

but whatever floats your boat I suppose

night trout Apr 5, 2025, 8:06 PM

#

Oh I see what's going on here. You think biggest = more params.

#

Biggest = most popular. Widest application.

#

Yeesh.

ocean vortex Apr 5, 2025, 8:08 PM

#

night trout Biggest = most popular. Widest application.

that is still nonsensical. 2.0 pro was never among the most popular models

#

LMAO

#

it only makes sense in the context of big in size models

night trout Apr 5, 2025, 8:08 PM

#

ocean vortex that is still nonsensical. 2.0 pro was never among the most popular models

I didn't say anything about 2.0 Pro. Christ. An ounce of reading comprehension, please. Even a gram would do. I'll take a milligram at this point.

ocean vortex Apr 5, 2025, 8:09 PM

#

"The two biggest models on the planet are Gemini 2.0 and GPT4o"

#

??

#

2.0 pro is not 2.0 gemini? 🤣

night trout Apr 5, 2025, 8:09 PM

#

ocean vortex ### "The two biggest models on the planet are Gemini 2.0 and GPT4o"

The word 'Pro' doesn't appear anywhere in that sentence, you absolutely goofball.

#

Gemini 2.0 is an entire class of models which includes Flash Lite, Flash, Flash Thinking, and Pro.

ocean vortex Apr 5, 2025, 8:10 PM

#

there's no "gemini 2.0" model lmfao

night trout Apr 5, 2025, 8:10 PM

#

ocean vortex there's no "gemini 2.0" model lmfao

oh my god

#

how are you this dense

#

Probably, yeah.

ocean vortex Apr 5, 2025, 8:11 PM

#

night trout how are you this dense

it's you who is f'ing dense, how does 2.0 pro not qualify to talk about when you mention 2.0 gemini? LOL

night trout Apr 5, 2025, 8:11 PM

#

ocean vortex it's you who is f'ing dense, how does 2.0 pro not qualify to talk about when you...

mate you're drunk go get some sleep https://en.wikipedia.org/wiki/Set_theory

Set theory

Set theory is the branch of mathematical logic that studies sets, which can be informally described as collections of objects. Although objects of any kind can be collected into a set, set theory – as a branch of mathematics – is mostly concerned with those that are relevant to mathematics as a whole.
The modern study of set theory was initi...

ocean vortex Apr 5, 2025, 8:12 PM

#

"The two biggest models on the planet are Gemini 2.0 and GPT4o" - this honestly sound like some clickbaity title by some idiot on the internet

#

gemini 2.0 was never as popular as cgpt, claude or deepseek, if that's how you intended to say it

#

neither of their 2.0 models

#

if you don't want to talk about 2.0 pro specifically, well then... all the other 2.0 models are worse

#

2.0 flash having a brief period of decent popularity with imagen but it kinda got denied by gpt4o before it even had a chance to blow up

night trout Apr 5, 2025, 8:16 PM

#

....

barren prairie Apr 5, 2025, 8:16 PM

#

ocean vortex if you don't want to talk about 2.0 pro specifically, well then... all the other...

Gemini 2.0 flash thinking is not bad for me

But Gemini 2.0 flash is the dumbest ai on the plannet

eager mica Apr 5, 2025, 8:16 PM

#

Haven't had the chance to try the models yet, but I imagined they would be considerably smaller than what came out.

ocean vortex Apr 5, 2025, 8:17 PM

#

barren prairie Gemini 2.0 flash thinking is not bad for me But Gemini 2.0 flash is the dumbes...

yeah it's not bad, but nothing more than that unfortunately...

barren prairie Apr 5, 2025, 8:17 PM

#

ocean vortex yeah it's not bad, but nothing more than that unfortunately...

But for me better than chatgpt

#

I mean the 4o

ocean vortex Apr 5, 2025, 8:18 PM

#

barren prairie I mean the 4o

well if you compare against the updated gpt4o

#

flash-thinking performs worse now

barren prairie Apr 5, 2025, 8:19 PM

#

ocean vortex flash-thinking performs worse now

I don t know about "now" but I used it a lot in the past

torn mantle Apr 5, 2025, 8:19 PM

#

yea this maverick model is so ruined

ocean vortex Apr 5, 2025, 8:19 PM

#

when it was just released, it was beating gpt4o on some things yeah

torn mantle Apr 5, 2025, 8:19 PM

#

one of the worst models ive tried on multilingual

night trout Apr 5, 2025, 8:19 PM

#

barren prairie Gemini 2.0 flash thinking is not bad for me But Gemini 2.0 flash is the dumbes...

Used mostly for enterprise at the moment. Think chatbots, document transformations, etc.

Of course, Google is using it internally on a bunch of products too.

ocean vortex Apr 5, 2025, 8:20 PM

#

though it was never at any point anywhere close in popularity

barren prairie Apr 5, 2025, 8:20 PM

#

ocean vortex well if you compare against the updated gpt4o

I didn t like the updated gpt4o , seriously It didn t follow my instructions (on chatgpt app)

torn mantle Apr 5, 2025, 8:20 PM

#

maverick so far :

bad at instruction following
bad at coding tasks
bad at multilingual
good at creativity

night trout Apr 5, 2025, 8:21 PM

#

ocean vortex though it was never at any point anywhere close in popularity

torn mantle Apr 5, 2025, 8:21 PM

#

yea im not using it anymore

ocean vortex Apr 5, 2025, 8:21 PM

#

night trout

that's openrouter you idiot, lol

#

not global popularity

#

and it's only up because it's free

night trout Apr 5, 2025, 8:22 PM

#

ocean vortex that's openrouter you idiot, lol

yeah it is, what a genius you are

ocean vortex Apr 5, 2025, 8:22 PM

#

night trout yeah it is, what a genius you are

I don't think you understand how openrouter works

#

of course people are gonna use more of what they can for free

night trout Apr 5, 2025, 8:23 PM

#

ocean vortex I don't think you understand how openrouter works

mate, i don't think you understand much in general

somber niche Apr 5, 2025, 8:23 PM

#

I'm gonna give it a fair shake, but my sentiments are similar. For being two big MoEs that are better than Deepseek they... don't really seem better than Deepseek. The restrictive license is kinda the nail in the coffin atm

ocean vortex Apr 5, 2025, 8:23 PM

#

It's amazing that I need to spell it out catgrin

balmy mist Apr 5, 2025, 8:23 PM

#

anybody did more tests with these models?

barren prairie Apr 5, 2025, 8:23 PM

#

balmy mist anybody did more tests with these models?

Llama4?

torn mantle Apr 5, 2025, 8:24 PM

#

https://x.com/ericzelikman/status/1908603695753007353

Eric Zelikman (@ericzelikman) on X

not listing better widely-available alternatives on a comparison chart doesn't make them not exist btw

#

who invited them

#

is grok 3 really a good model

#

first impression was okay, but after using the model a lot then you figure out that its nothing special

night trout Apr 5, 2025, 8:25 PM

#

torn mantle maverick so far : bad at instruction following bad at coding tasks bad at mult...

interesting, so are the benchmarks bunk? they seem to be comparing it to Gemini Flash 2.0.

#

How do you rank it?

night trout Apr 5, 2025, 8:25 PM

#

torn mantle is grok 3 really a good model

I mean, it isn't available open source on even via api so it practically doesn't exist

ocean vortex Apr 5, 2025, 8:25 PM

#

torn mantle is grok 3 really a good model

well I mean, is there anything at all to suggest otherwise? I do not think so lol

#

it seems to beat both gpt4.5 and 3.7 as well as deepseek 3.1 overall from what we know and what can be tested

#

not better everywhere, but overall ahead somewhat

night trout Apr 5, 2025, 8:27 PM

#

torn mantle first impression was okay, but after using the model a lot then you figure out t...

It's okay at one-shot style coding. But yeah, otherwise nothing special. Everything we know suggests it's also a brute-force model which likely has high compute cost. They're literally running the datacentre on portable gas generators right now. 🫡

#

Brute-force money-scale MVPs are kinda Elon Musk's thing.

balmy mist Apr 5, 2025, 8:29 PM

#

lmaoo download

#

anybody can download them?

north vale Apr 5, 2025, 8:29 PM

#

ocean vortex dense 405b has potential to perform better than this "behemoth" anyway, since it...

there is 0 chance 405b performs better than behemoth. dense is just stupid

ocean vortex Apr 5, 2025, 8:31 PM

#

north vale there is 0 chance 405b performs better than behemoth. dense is just stupid

well it's harder to train, but not significantly harder than behemoth. And behemoth has less active params so... I don't think it's a clear cut which has more potential. Plus behemoth is 10 times harder to run dedicated instance

north vale Apr 5, 2025, 8:32 PM

#

ocean vortex well it's harder to train, but not significantly harder than behemoth. And behem...

maverick seems better than 405b. i don't get how this is not completely clear cut

ocean vortex Apr 5, 2025, 8:33 PM

#

it is newer model. 405b is ancient by now. If they chose to update 405b it would have performed much better than it did a long time ago...

north vale Apr 5, 2025, 8:34 PM

#

right, which they didn't, because it's so incredibly expensive to train for no reason

#

405b is indeed ancient

#

which is why it's completely eclipsed by llama 4 models

#

so why would behemoth not be way better than 405b

ocean vortex Apr 5, 2025, 8:34 PM

#

north vale right, which they didn't, because it's so incredibly expensive to train for no r...

behemoth has comparable training cost though

north vale Apr 5, 2025, 8:35 PM

#

ocean vortex behemoth has comparable training cost though

comparable cost a year later represents ~10x more effective compute

#

so yes

#

it won't be close

ocean vortex Apr 5, 2025, 8:35 PM

#

?

#

compute required to train behemoth is fairly comparable to 405b dense, that is my point. Almost 300b active. And you need much much more memory to host that MoE than 405b dense

#

then you add extended context on top, and memory requirements become insane

balmy mist Apr 5, 2025, 8:38 PM

#

gemini2.5 pro crossword puzzle, after some tinkering i got the right system prompt

north vale Apr 5, 2025, 8:38 PM

#

the compute is used much more efficiently in behemoth than in 405b. i also think behemoth will have spent mroe compute to get trained but it's hard to tell. basically to get 405b to the quality of behemoth you'd need to spend like 5x-15x more compute than behemoth i suspect

balmy mist Apr 5, 2025, 8:38 PM

#

can you share stuff you make in liveweave?

ocean vortex Apr 5, 2025, 8:39 PM

#

north vale the compute is used much more efficiently in behemoth than in 405b. i also think...

efficient is in the fact that we are not comparing it to 2T dense model lol. That was already implied/accounted for. It's not magic MoE was a known concept for a long time now

night trout Apr 5, 2025, 8:39 PM

#

balmy mist gemini2.5 pro crossword puzzle, after some tinkering i got the right system prom...

I'm curious, did other models get this right?

balmy mist Apr 5, 2025, 8:39 PM

#

i can try them, which ones you want me to try?

north vale Apr 5, 2025, 8:40 PM

#

ocean vortex efficient is in the fact that we are not comparing it to 2T dense model lol. Tha...

yeah exactly. 405b used an ancient training procedure that even at the time it was trained was already clearly stupid. they overspent on a dense model when they should have trained a MoE. because of that, despite large capital expenditure 405b was pretty mid. behemoth with similar spend will be much much much better and obviously so

#

I don't think it's a clear cut which has more potential.
I think it's clear cut for the reasons above

ocean vortex Apr 5, 2025, 8:40 PM

#

ocean vortex efficient is in the fact that we are not comparing it to 2T dense model lol. Tha...

but it also wouldn't be comparable in performance to 2T dense model if someone managed to train something like that. 2T dense would have been exponentially more capable than MoE with 2T total and barely ~300b active

balmy mist Apr 5, 2025, 8:40 PM

#

the key to these llms are def the system prompts

night trout Apr 5, 2025, 8:40 PM

#

balmy mist i can try them, which ones you want me to try?

I'm just curious how you found it as a prompt, if LLMs were making common mistakes, etc.

No worries if you haven't.

#

Sounds like an interesting benchmark prompt.

north vale Apr 5, 2025, 8:41 PM

#

ocean vortex but it also wouldn't be comparable in performance to 2T dense model if someone m...

for the same training compute, it would be hilariously less capable

ocean vortex Apr 5, 2025, 8:41 PM

#

north vale for the same training compute, it would be hilariously less capable

yeah true. But that's not what we are talking about 👀

balmy mist Apr 5, 2025, 8:41 PM

#

night trout I'm just curious how you found it as a prompt, if LLMs were making common mistak...

i made a system prompts for making system prompts, made one for refining prompts, made one for webdev and used all three to create the puzzle

north vale Apr 5, 2025, 8:41 PM

#

idk you were saying something wasn't clear cut when it was

#

that's what i was focused on

#

you can talk about othre things i don't mind

ocean vortex Apr 5, 2025, 8:43 PM

#

north vale idk you were saying something wasn't clear cut when it was

I was referring to 405b dense vs MoE with ~300b active and 2T total. If you can stomach bit of extra compute needed for training, it's possible that dense model would have actually more potential I think, hard to say. Which is why I said "not clear cut"

#

like we saw 70b llama doing really well against MoE models with less active and much more total param etc

north vale Apr 5, 2025, 8:45 PM

#

"if you can stomach a bit of extra compute" ?
i am just looking at their potential relative to how long they have been trained or are likely to be trained in the future? 405b will not be trained enough to come anywhere near behemoth's current capabilities so i don't see the relevance

ocean vortex Apr 5, 2025, 8:46 PM

#

north vale "if you can stomach a bit of extra compute" ? i am just looking at their potenti...

the relevance is that they abandoned that and started training from zero a model that is only slightly faster to train

night trout Apr 5, 2025, 8:47 PM

#

https://x.com/lmarena_ai/status/1908601015609471187

lmarena.ai (formerly lmsys.org) (@lmarena_ai) on X

Meta's Llama 4 Maverick hits in the top 5 across all categories.

Tied for #1 rank specifically in Hard Prompts, Coding, Math, Creative Writing, Longer Query and Multi-Turn!

ocean vortex Apr 5, 2025, 8:47 PM

#

but is much harder to host/run

north vale Apr 5, 2025, 8:48 PM

#

ocean vortex the relevance is that they abandoned that and started training from zero a model...

not only slightly faster

ocean vortex Apr 5, 2025, 8:48 PM

#

with that track record, it's not a given that they will be continuing with behemoth either lmao

#

especially now that there's deepseek

#

it was unlikely a thing when they started training

ocean vortex Apr 5, 2025, 8:50 PM

#

north vale not only slightly faster

well but it is. Less active param but not THAT much less than old dense total, and you still need more memory allocation too

#

honestly, I'm kind of surprised they didn't do RL training with their existing 70b

#

that seems like an easy the fastest way for gains

north vale Apr 5, 2025, 8:53 PM

#

to the current level of capability, behemoth is an order of magnitude faster to train than 405B would be, i'd think? i'd be interested if someone had info on that tho

#

i just don't agree with your claims at all

#

it's like you live in a different universe where models haven't gotten way way way better in the last year

ocean vortex Apr 5, 2025, 8:53 PM

#

north vale to the current level of capability, behemoth is an order of magnitude faster to ...

nah that's not true lol. It's 2T total and close to 300b active

#

that means you need more memory than 405b dense

#

and training itself is gonna be roughly 4:3 times faster

#

at best

north vale Apr 5, 2025, 8:54 PM

#

but 405b has 405b weights so will be by default much worse and will need to be trained a lot longer to get to the same capability threshold

#

no?

balmy mist Apr 5, 2025, 8:55 PM

#

i think i wanna marry gemini2.5 pro

#

why is it so good?

#

like i love using it in studio

#

so much fun

ocean vortex Apr 5, 2025, 8:55 PM

#

north vale but 405b has 405b weights so will be by default much worse and will need to be t...

if MoE has ~300b active that means you need to train it essentially like 300b dense, not accounting for vram

north vale Apr 5, 2025, 8:57 PM

#

ocean vortex if MoE has ~300b active that means you need to train it essentially like 300b de...

in a big data center this seems very manageable

ocean vortex Apr 5, 2025, 8:57 PM

#

north vale in a big data center this seems very manageable

sure but so is 405b in much the same way... At that point it's not a big difference at all

north vale Apr 5, 2025, 8:58 PM

#

that's where we agree. they are pretty similar in terms of inference and training time (within 4x of each other) but behemoth has a way way way better ability to scale

#

405b scaling law will be way worse to reach behemoth level capabilities

#

i'll ask my ml friend if she knows specific theoretic numbers on that

night trout Apr 5, 2025, 9:00 PM

#

balmy mist i think i wanna marry gemini2.5 pro

It's amazing. Just a beautiful model, that's the only way I can really describe it. Such a pleasure to use.

barren prairie Apr 5, 2025, 9:01 PM

#

It is still dumb on WhatsApp even after the update

#

Llama4

ocean vortex Apr 5, 2025, 9:03 PM

#

north vale that's where we agree. they are pretty similar in terms of inference and trainin...

That is debatable... Like if we take deepseek that is MoE too, it has poor spatial awareness just like gpt4o. Why? Most logical explanation the way I see it is low active param count. That's also true for all llama4 models except the biggest / behemoth. Only 17b active. MoE have their advantages and they make sense for mass hosting and maybe even RL training potentially (speculation at this point), but they certainly have their drawbacks too and relatively low active param count is always gonna be a compromise.

#

you can't have lower training requirements for free, that's always gonna come at a cost. If there's less to train, there's less work that was done, bluntly speaking

north vale Apr 5, 2025, 9:05 PM

#

do you have more info on spacial awareness wrt MoEs? I would assume insofar as that's not just correlated to capabilities, it's about omni / training on images

torn mantle Apr 5, 2025, 9:05 PM

#

night trout interesting, so are the benchmarks bunk? they seem to be comparing it to Gemini ...

not even close to gemini flash 2.0

#

you can try it at https://lmarena.ai/

night trout Apr 5, 2025, 9:06 PM

#

Well that's sad.

torn mantle Apr 5, 2025, 9:08 PM

#

one of the worst models ive tried so far

#

couldnt even get a single question right

balmy mist Apr 5, 2025, 9:08 PM

#

lol

torn mantle Apr 5, 2025, 9:09 PM

#

thats just one of many

ocean vortex Apr 5, 2025, 9:09 PM

#

north vale do you have more info on spacial awareness wrt MoEs? I would assume insofar as t...

take let's say arc-agi but without reasoning models (those are harder to compare directly when we are talking about arch). gpt4o score is very low relative to sonnet. That's also true for simple-bench which is another benchmark for the most part based on spatial awareness. Then we also have web development and web coding arena where gpt4o stands no chance against sonnet. Same applies to deepseek to a varying degree.

torn mantle Apr 5, 2025, 9:09 PM

#

also really really bad at multilingual

#

llama 405b was actually good at it

#

idk what happened

wheat onyx Apr 5, 2025, 9:09 PM

#

torn mantle maverick so far : bad at instruction following bad at coding tasks bad at mult...

Unfortunate. Price is so low, it would be a massive deal if it was great

ocean vortex Apr 5, 2025, 9:10 PM

#

web development is relevant as many of it is based on css and visual design

#

ofc I'm assuming here that sonnet is bigger, but I do realise there's no official size if we want to be 100% accurate. But if you take say gpt4.5, you will see it has better spatial awareness than gpt4o too, or even o1

balmy mist Apr 5, 2025, 9:12 PM

#

torn mantle idk what happened

lmaoo

#

should I not like meta again?

torn mantle Apr 5, 2025, 9:12 PM

#

i think leo guessed it right

#

themis/cybele is probably behemoth

balmy mist Apr 5, 2025, 9:13 PM

#

which one is that?

#

where can i find it?

torn mantle Apr 5, 2025, 9:13 PM

#

torn mantle one of the worst models ive tried so far

because i remember them getting this question right

balmy mist Apr 5, 2025, 9:13 PM

#

did yall test it?>

torn mantle Apr 5, 2025, 9:13 PM

#

balmy mist did yall test it?>

yea it was on the arena for a while

barren prairie Apr 5, 2025, 9:13 PM

#

balmy mist should I not like meta again?

It depends on what you want

balmy mist Apr 5, 2025, 9:13 PM

#

damn

#

i missed it

#

so gemini2.5 pro is still best rght?

barren prairie Apr 5, 2025, 9:14 PM

#

balmy mist so gemini2.5 pro is still best rght?

Is that a question ?

Of course ¹⁰

ocean vortex Apr 5, 2025, 9:15 PM

#

with o1 it's actually interesting that they had such a massive improvement for arc-agi. TO be completely honest I wouldn't rule out contamination lol. But it could be also that it does help with those specific tasks. But o1 can still struggle a lot with svg or web design comparatively

torn mantle Apr 5, 2025, 9:15 PM

#

balmy mist so gemini2.5 pro is still best rght?

i think google will start taking the lead from now on

ocean vortex Apr 5, 2025, 9:15 PM

#

btw not saying that sonnet is a better model overall, but for this specific thing, it is. 😉

torn mantle Apr 5, 2025, 9:15 PM

#

sonnet is kinda unique ngl

#

you feel like it has its own intelligence not some robotic ai

barren prairie Apr 5, 2025, 9:17 PM

#

Sonnet is the best at coding .

But so dumb at other things

ocean vortex Apr 5, 2025, 9:17 PM

#

torn mantle sonnet is kinda unique ngl

yeah either they did something very smart or openai cut the cost too much and too drastically with arch planning. Likely a combination of both

#

then gpt4.5 was just too extreme to train in time to compete adequately

torn mantle Apr 5, 2025, 9:17 PM

#

barren prairie Sonnet is the best at coding . But so dumb at other things

that alone is a big plus

#

its really hard to make a model like that

#

well it will change with nightwhisper ig

#

being good at coding isnt just providing a one-shot working code

#

but there are a lot of nuances

barren prairie Apr 5, 2025, 9:18 PM

#

torn mantle well it will change with nightwhisper ig

I think Google changed her plan after the announcement of O3 😁

ocean vortex Apr 5, 2025, 9:18 PM

#

barren prairie Sonnet is the best at coding . But so dumb at other things

best at web coding. Not ALL coding. 👀

#

if it's not web, it's honestly a coin toss but I would go with 2.5 pro

torn mantle Apr 5, 2025, 9:19 PM

#

ocean vortex best at web coding. Not ALL coding. 👀

they kinda improved on other areas with sonnet 3.7 tbh

#

it used to be so bad at desktop apps

ocean vortex Apr 5, 2025, 9:20 PM

#

even for web 2.5 is gonna be very decent

#

I would be shocked if 2.5 is not significantly bigger than gpt4o all things considered

torn mantle Apr 5, 2025, 9:22 PM

#

tf is flannel

ocean vortex Apr 5, 2025, 9:23 PM

#

torn mantle they kinda improved on other areas with sonnet 3.7 tbh

but it's still behind competition on some code things like livecodebench. Currently probably only 2.5 pro is a model with no significant flaws. Kinda insane if you think about it lol

torn mantle Apr 5, 2025, 9:23 PM

#

ocean vortex but it's still behind competition on some code things like livecodebench. Curren...

yea

#

24k gold model was so fun to chat with

#

i can see myself using it a lot

#

for unserious stuff

balmy mist Apr 5, 2025, 9:31 PM

#

barren prairie Sonnet is the best at coding . But so dumb at other things

idk about that

raven void Apr 5, 2025, 9:32 PM

#

sonnet 3.7 is like

#

the recipe is good

#

but needs much more compute

balmy mist Apr 5, 2025, 9:35 PM

#

true

raven void Apr 5, 2025, 9:35 PM

#

I'm also very excited for the big llama especially the base and reasoning version

somber niche Apr 5, 2025, 9:55 PM

#

Given that the new Llama 4s can't answer my questions correctly while 24k gold could, my guess is that 24k gold was a preview of behemoth

balmy mist Apr 5, 2025, 9:56 PM

#

lol

somber niche Apr 5, 2025, 9:56 PM

#

Which is... worrying. I'd really hoped it was a lot smaller than that

cloud meadow Apr 5, 2025, 9:57 PM

#

I mean, I was close
https://x.com/lmarena_ai/status/1908601011989782976

lmarena.ai (formerly lmsys.org) (@lmarena_ai) on X

BREAKING: Meta's Llama 4 Maverick just hit #2 overall - becoming the 4th org to break 1400+ on Arena!🔥

Highlights:
- #1 open model, surpassing DeepSeek
- Tied #1 in Hard Prompts, Coding, Math, Creative Writing
- Huge leap over Llama 3 405B: 1268 → 1417
- #5 under style control

#

😛

#

😛

somber niche Apr 5, 2025, 9:57 PM

#

Funny thing is Deepseek can also get it at a fraction of the size (and likely cost) soooo

keen beacon Apr 5, 2025, 10:00 PM

#

does anyone have any hard benchmark prompts?

#

would be much appreciated

keen fulcrum Apr 5, 2025, 10:03 PM

#

https://www.rxddit.com/r/LocalLLaMA/s/QRt8m8LVpf
Crazy news

rxddit.com

Mark presenting four Llama 4 models, even a 2 trillion parameters model!!!

u/LarDark on r/LocalLLaMA

source from his instagram page

▶ Play video

eager mica Apr 5, 2025, 10:04 PM

#

Did Meta actually finish training Maverick? From the model card, it saw half the training tokens of Scout.

#

somber niche Apr 5, 2025, 10:06 PM

#

I noticed that too

#

Both on the training token side and the context length side

#

My guess is probably not

keen beacon Apr 5, 2025, 10:07 PM

#

rushed launch and they literally had more than a year

#

shambles

somber niche Apr 5, 2025, 10:09 PM

#

Put up a couple of my prompts in #share-prompts , which the current L4 models seem to fail (to date, only 24K got the sushi question right, and it didn't get the Deltarune one right)

#

I genuinely don't know what they were thinking with this release tbh

#

This has pretty much killed any chance they have, in my mind, of being competitive with some of the other players in the area

#

I'd be nicer if they were like, 1/10 of the size they are lol

fleet lintel Apr 5, 2025, 10:15 PM

#

so many complains about Llama 4 not that good but why lmarena score that high?
how are these models hacking this score?

balmy mist Apr 5, 2025, 10:15 PM

#

lol ik how

#

but i cant say

#

NDA

somber niche Apr 5, 2025, 10:16 PM

#

I also know how lol

#

Or at least I have a pretty decent guess I think

fleet lintel Apr 5, 2025, 10:16 PM

#

DM me 🙂

eager mica Apr 5, 2025, 10:16 PM

#

fleet lintel so many complains about Llama 4 not that good but why lmarena score that high? ...

They were the least filtered ones and had the best vibes among the ones on the Arena, for conversational uses at least.

balmy mist Apr 5, 2025, 10:16 PM

#

somber niche Or at least I have a pretty decent guess I think

you can say it and I will react with an emoji

barren prairie Apr 5, 2025, 10:16 PM

#

somber niche Or at least I have a pretty decent guess I think

Tell us

barren prairie Apr 5, 2025, 10:17 PM

#

balmy mist but i cant say

Say say 😁

somber niche Apr 5, 2025, 10:18 PM

#

My guess is pretty much what Suikamelon said, rather than committing to a single instruct style, they tried a plethora of different models with different conversational styles and RL tunes for each size category, and then picked the ones which gave the best score in the arena

#

In the end, the spider and 24k type styles won out

#

So those are the ones they went with for the final release

barren prairie Apr 5, 2025, 10:19 PM

#

somber niche My guess is pretty much what Suikamelon said, rather than committing to a single...

That's what even Google do

fleet lintel Apr 5, 2025, 10:19 PM

#

and what is "Style Control"?

fleet lintel Apr 5, 2025, 10:19 PM

#

barren prairie That's what even Google do

i think eveyone does that

eager mica Apr 5, 2025, 10:20 PM

#

fleet lintel and what is "Style Control"?

Even without formatting/text styling differences, I generally preferred the 24-karat-gold-type models.

somber niche Apr 5, 2025, 10:20 PM

#

They do, but I don't think anyone did it to nearly the extent Meta did

#

We went through like, fifty models in the past couple of weeks lol

barren prairie Apr 5, 2025, 10:22 PM

#

Only fifty models

raven void Apr 5, 2025, 10:22 PM

#

keen beacon rushed launch and they literally had more than a year

people say old llama 4 was caught lacking so they scrapped it and made a better one

fleet lintel Apr 5, 2025, 10:23 PM

#

LMArena needs to fix this problem. people will stop believing its value

keen beacon Apr 5, 2025, 10:23 PM

#

raven void people say old llama 4 was caught lacking so they scrapped it and made a better ...

well yeah if the previous planned launch models were even worse meta ai are way behind

#

deepseek will eat them for lunch

azure minnow Apr 5, 2025, 10:29 PM

#

Hi

eager mica Apr 5, 2025, 10:31 PM

#

Interesting data.

somber niche Apr 5, 2025, 10:34 PM

#

Huh

keen beacon Apr 5, 2025, 10:34 PM

#

eager mica Interesting data.

wtf

#

that's crazy

somber niche Apr 5, 2025, 10:35 PM

#

So L3.3 took less time than both L4 models combined

#

(I'm guessing the MoE architecture reduced the training compute per token)

#

That's weird though - what were they doing with all of that compute then?

eager mica Apr 5, 2025, 10:36 PM

#

somber niche That's weird though - what were they doing with all of that compute then?

No idea. It makes the rushed state of this release even more puzzling.

#

A possibility is that they delayed training the models for as long as possible due to the copyright lawsuit(s).

somber niche Apr 5, 2025, 10:38 PM

#

Oh yeah the Anna's Archive stuff

eager mica Apr 5, 2025, 10:38 PM

#

They have so much compute they might have trained both models in 1 week or less.

somber niche Apr 5, 2025, 10:38 PM

#

Remember hearing about that

keen beacon Apr 5, 2025, 10:41 PM

#

eager mica They have so much compute they might have trained both models in 1 week or less.

hopefully that means behemoth is out in the next few weeks

eager mica Apr 5, 2025, 10:41 PM

#

somber niche Oh yeah the Anna's Archive stuff

Not only that, also Books3 (as disclosed in the Llama1 paper—which they probably regret doing), Libgen (Llama2).

eager mica Apr 5, 2025, 10:42 PM

#

keen beacon hopefully that means behemoth is out in the next few weeks

A 4.1 release with Maverick in a finished state and smaller models would be appreciated.

somber niche Apr 5, 2025, 10:42 PM

#

Agreed

#

I don't think the rushed Maverick release did them any favors, they probably should have just waited until it was finished training

eager mica Apr 5, 2025, 10:43 PM

#

Also, where's the omnimodality?

somber niche Apr 5, 2025, 10:43 PM

#

Yeah, no audio to speak of

#

That was supposed to be their killer feature

eager mica Apr 5, 2025, 10:44 PM

#

It should have had voice input/output, maybe even image out.

somber niche Apr 5, 2025, 10:45 PM

#

This entire release is pretty bizarre tbh - spamming the arena with models, the small amount of training compute, the lack of additional modalities which they said were supposed to be their major focus

balmy mist Apr 5, 2025, 10:46 PM

#

yo can yall see this?
https://liveweave.com/I05lht

#

like what shows up

eager mica Apr 5, 2025, 10:48 PM

#

Only that? I imagined it would be more complex than that.

#

I don't feel I usually preferred those Meta models due to the longer responses, but mainly because they had a more relatable tone, or were more informative, didn't exhibit slop, etc.

#

OpenAI and Google models also generally had long responses.

#

They cycled through so many models, several were probably targeted tests, earlier checkpoints, etc.

somber niche Apr 5, 2025, 10:53 PM

#

Yeah I think it's basically impossible to determine which checkpoint we're seeing in the full releases

hardy pecan Apr 5, 2025, 10:53 PM

#

What model do we think was maverick? That we had tested earlier

#

Spider? Themis? Cybele? 24_karat_gold?

somber niche Apr 5, 2025, 10:54 PM

#

That said, my theory is 24k is probably a Behemoth preview, just because for whatever reason, I'm not able to get either Scout or Behemoth to give a single correct answer to coding questions which it got in one shot

#

Even if it is, it might still be one of multiple checkpoints, though

eager mica Apr 5, 2025, 10:56 PM

#

spider appeared many times in my tests.

leaden palm Apr 5, 2025, 10:56 PM

#

okay call me naive but perhaps maverick = maverick

eager mica Apr 5, 2025, 10:56 PM

#

But in the end 24_karat_gold was the one that I've seen the most.

viral notch Apr 5, 2025, 10:57 PM

#

24_karat_gold is only good for its styling imo lol

#

it follows instructions kinda meh

eager mica Apr 5, 2025, 10:57 PM

#

24kg: 49 times; spider: 23 times, cybele: 21 times

leaden palm Apr 5, 2025, 10:58 PM

#

could be the deployed maverick has the same prompt too

#

(it does in direct chat)

eager mica Apr 5, 2025, 10:59 PM

#

Possibly, with a mellowed down system prompt perhaps.

#

Ah I see that.

#

At some pointflannel, harley and crystal got introduced in the past 24 hours, and I don't think they had a crazy system prompt.

#

Eventually I got tired of testing them when I saw that 24_karat_gold won all my personal tests.

#

I'm not 100% sure but crystal seemed the best one.

hardy pecan Apr 5, 2025, 11:09 PM

#

it feels like it might be "scout" on their website

#

It's fast but dumb

eager mica Apr 5, 2025, 11:10 PM

#

Both Maverick and Scout have the same amount of active parameters, so they should be equally fast, I think.

#

Of course.

sage raptor Apr 5, 2025, 11:10 PM

#

what model is spider

eager mica Apr 5, 2025, 11:11 PM

#

Try the one on Chatbot Arena (Direct Chat).

#

There have been suggestions that the one from OpenRouter isn't working correctly.

frozen skiff Apr 5, 2025, 11:20 PM

#

Did llama 4 release?

#

No way

thorny drum Apr 5, 2025, 11:21 PM

#

well apparently the system prompt tells maverick to only answer the question 50% of the time

eager mica Apr 5, 2025, 11:21 PM

#

I don't know the details; I haven't tried it personally there. I do have tried it on Chatbot Arena though.

frozen skiff Apr 5, 2025, 11:23 PM

#

hardy pecan Spider? Themis? Cybele? 24_karat_gold?

Aint no way it was 24 karat gold

eager mica Apr 5, 2025, 11:28 PM

#

Not of its responses but you can test it easily.

frozen skiff Apr 5, 2025, 11:31 PM

#

Those crystal harley flnanel models were llama 4 wallahi

#

They act the exact same

#

Or maybe theyre other lama models but llama 4 maverick is way better

neat apex Apr 5, 2025, 11:32 PM

#

llama 4 looks like 3, at least it have huge context

#

xd

frozen skiff Apr 5, 2025, 11:32 PM

#

neat apex llama 4 looks like 3, at least it have huge context

Nahh

#

Its way better

#

The reasoning

#

Llama 3 was a dumbass

neat apex Apr 5, 2025, 11:33 PM

#

hm, yea right

#

over reacted here

frozen skiff Apr 5, 2025, 11:33 PM

#

I like llaama 4 maverwick

#

Wym

neat apex Apr 5, 2025, 11:33 PM

#

but no better than 3.1

#

it is lying a lot actually

frozen skiff Apr 5, 2025, 11:33 PM

#

Are u using it on openrouter or arena

#

Its shet on openrouter for some reason

neat apex Apr 5, 2025, 11:34 PM

#

fireworks

#

fireworks is well known to have they models somelittle better yet it is mid

frozen skiff Apr 5, 2025, 11:35 PM

#

Is it free

#

I never used fire works

#

It says error generating response

hardy pecan Apr 5, 2025, 11:36 PM

#

meta.ai model got 6/20 for simplebench, no bueno

neat apex Apr 5, 2025, 11:38 PM

#

maybe it is dumbass if you not pay immo

frozen skiff Apr 5, 2025, 11:47 PM

#

llama 4 is good in typing style

#

Like 24 karat gold

#

Its fun to talk to

strong shell Apr 5, 2025, 11:49 PM

#

see spider
https://x.com/rsumbaly/status/1908598203865588112

Roshan Sumbaly (@rsumbaly) on X

Llama 4 is here with 4 models!🦙🦙🦙🦙

I'm back to share with the world what the team has been cooking. Today we're open-sourcing 2 state-of-the-art omni models (Scout, Maverick - including pre-trained weights), previewing a 3rd one (Behemoth) and will drop a reasoning one soon.

#

frozen skiff Apr 5, 2025, 11:56 PM

#

Spider is behemoth

strong shell Apr 5, 2025, 11:58 PM

#

y u thnk so?

frozen skiff Apr 5, 2025, 11:59 PM

#

Because

#

We already have maverick and scout

#

Why wold they put that spider emoji

#

They're hinting

#

Spider out of all emojis

#

Its either their reasoning model or behemoth

eager mica Apr 5, 2025, 11:59 PM

#

Situation unclear.

sage raptor Apr 6, 2025, 12:00 AM

#

frozen skiff Spider is behemoth

is it better than nightwhisper at coding ?

frozen skiff Apr 6, 2025, 12:00 AM

#

Idk

#

We dont have spider anymore

#

I dont think so

#

It got removed a while ago

eager mica Apr 6, 2025, 12:04 AM

#

I mostly really just follow the LLM threads.

#

For what it's worth, the official Llama 4 inference code uses temperature=0.6 and top_p=0.9 as far as I can see.

#

https://github.com/meta-llama/llama-models/blob/main/models/llama4/generation.py

GitHub

llama-models/models/llama4/generation.py at main · meta-llama/llam...

Utilities intended for use with Llama models. Contribute to meta-llama/llama-models development by creating an account on GitHub.

#

Hm, it looks like they have int4 quantization loading code too.

young otter Apr 6, 2025, 12:17 AM

#

when you use a system prompt obtainer prompt https://github.com/LouisShark/chatgpt_system_prompt/blob/main/GETTING_STARTED.md
the proper default prompt https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8#:~:text=Llama 4 models.-,System prompt,-You are an comes back via meta.ai, whatsapp, messenger, and the api providers
the lmarena -experimental brings back nothing, it is like it has been left out

#

leaden palm Apr 6, 2025, 12:26 AM

#

what the llama

plain zinc Apr 6, 2025, 12:31 AM

#

Unfortunately, I'm not a programmer.

#

This is not my screenshot, but a screenshot of my Friend's discord bot.

#

which tells you when and which model was added to the arena or disappeared on the contrary

frozen skiff Apr 6, 2025, 12:34 AM

#

Maverick is amazing

plain zinc Apr 6, 2025, 12:34 AM

#

@modern knoll

young otter Apr 6, 2025, 12:40 AM

#

yep thats what i linked

eager mica Apr 6, 2025, 12:43 AM

#

I don't think fp8 alone would reduce quality appreciably.
On-the-fly INT4 sounds like what bitsandbytes would do.

#

It's versioned 03-26-experimental. Is it from 10 days ago? Or did the training start on that date? We can't know, of course.

young otter Apr 6, 2025, 12:51 AM

#

eager mica I don't think fp8 alone would reduce quality appreciably. On-the-fly INT4 sounds...

maverick has been published in fp8 and officially "maintaining quality" so all providers will end up hosting that

frozen skiff Apr 6, 2025, 12:51 AM

#

Dude wtf

#

Maverick taches u how to

#

Do illegal stuff

#

Make fentanyl and even teaches u how to get the precursors

#

💀

#

#

Yes

#

Its a real method

#

Yes im a chemist

#

Lil bro

#

Yes

#

Who said i cant make it cus its illegal

#

Walter white made meth and it was illegal lil bro it dosent amtter

#

Rules are made to be broken

eager mica Apr 6, 2025, 1:06 AM

#

https://discord.com/channels/1340554757349179412/1343285970375540839

frozen skiff Apr 6, 2025, 1:06 AM

#

Walter white cant make fetnanyl

#

Hes too weak

#

Im ahmad fring

ivory schooner Apr 6, 2025, 1:19 AM

#

啊，还我24k，还我24k~

#

如果chatbot再不将它出现的话，谁的人也可以来部署一下独立模型镜像也能继续玩，行不行

balmy mist Apr 6, 2025, 1:21 AM

#

wait do we have full access to llama 4 now?

ivory schooner Apr 6, 2025, 1:35 AM

#

（觉悟）原来将要出的behermoth好像正是原来的24k呢

#

那么我要等behermoth推出再来玩吧~

#

不过。。。。。。还我24k~

balmy mist Apr 6, 2025, 1:36 AM

#

lmaoo they gotta pray man

raven void Apr 6, 2025, 1:38 AM

#

I mean

#

any model can teach you that even Gemini

keen beacon Apr 6, 2025, 1:56 AM

#

@alpine coral can you dm me your question set if possible? i think i may have found o3 full on a red-teaming platform and am putting it through its paces

#

o3 (?) simplebench "try yourself" public questions score:

✅
❌
✅
❌
✅
✅
❌
✅
✅
❌

6/10, iirc best performance on this specific set minus gpt-4.5 preview

lime coral Apr 6, 2025, 2:32 AM

#

https://x.com/teortaxestex/status/1908706840554197309?s=46

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxe...

What did Meta see coming out on Monday that they rushed?

balmy mist Apr 6, 2025, 2:33 AM

#

yo i have tested llama4 all day and they def lied lmaoo

torn mantle Apr 6, 2025, 2:34 AM

#

balmy mist yo i have tested llama4 all day and they def lied lmaoo

xd

balmy mist Apr 6, 2025, 2:34 AM

#

torn mantle xd

remember we were saying that

#

they really bs them benchmarks

torn mantle Apr 6, 2025, 2:35 AM

#

balmy mist remember we were saying that

Yea its really bad

balmy mist Apr 6, 2025, 2:35 AM

#

and they said the lmarena was a experimental chat session for the results wtfff

#

there is no way this is second

torn mantle Apr 6, 2025, 2:35 AM

#

I have no idea how it reached that no2 spot

balmy mist Apr 6, 2025, 2:35 AM

#

like impossible

#

lmaoo

torn mantle Apr 6, 2025, 2:36 AM

#

Its really one of the worst releases so far

balmy mist Apr 6, 2025, 2:36 AM

#

lmaoo

torn mantle Apr 6, 2025, 2:36 AM

#

Even their llama 405b was better

balmy mist Apr 6, 2025, 2:36 AM

#

the only thing going for it is that 10 mill context but i have yet to test that or see that

#

has anyone else tried that?

torn mantle Apr 6, 2025, 2:37 AM

#

The model is so dumb, its retrieval information is so bad, it hallucinate a lot, it doesn't follow instructions well, its so bad at multi-turn

#

I can go all day about this model

#

I just dont understand why did they even bother to release this

balmy mist Apr 6, 2025, 2:37 AM

#

this is a joke

Screenshot_2025-04-05_at_10.37.51_PM.png

torn mantle Apr 6, 2025, 2:38 AM

#

Lol

#

Its not beating anything

#

Even qwen35b is better

balmy mist Apr 6, 2025, 2:38 AM

#

lmaooo

#

this is actually devestating

torn mantle Apr 6, 2025, 2:39 AM

#

I won't touch it again

balmy mist Apr 6, 2025, 2:39 AM

#

like we basically can ignore everything they said

#

that 10 mill context is a lie to

#

bc everything else is a lie

#

i feel lowkey bad for zuc

torn mantle Apr 6, 2025, 2:39 AM

#

I knew they self-trained on benchmarks

balmy mist Apr 6, 2025, 2:40 AM

#

he screwed his compnay withe meta stuff and now this

#

just wait until simple bench gets their hands on this

#

gonna be wild

#

maybe we prompting llama4 wrong

torn mantle Apr 6, 2025, 2:40 AM

#

Themis or behemoth should be on gpt4o level

#

But at what cost?

balmy mist Apr 6, 2025, 2:41 AM

#

there has to be a reason

balmy mist Apr 6, 2025, 2:41 AM

#

torn mantle Themis or behemoth should be on gpt4o level

if it is gg

torn mantle Apr 6, 2025, 2:41 AM

#

They are not taking llms training seriously obviously

torn mantle Apr 6, 2025, 2:41 AM

#

balmy mist if it is gg

It is

#

We tried it in lmarena already

balmy mist Apr 6, 2025, 2:41 AM

#

really??

torn mantle Apr 6, 2025, 2:41 AM

#

Its at best gpt4o level

#

Its not sota or anything

balmy mist Apr 6, 2025, 2:41 AM

#

thats prob why they still training it

#

what do they do now?

torn mantle Apr 6, 2025, 2:42 AM

#

But that's just so disappointing

#

And we still don't have reasoning models

balmy mist Apr 6, 2025, 2:42 AM

#

yeah its the promises that is annoying

#

just dont release, we can wait

#

but i think its for investors

torn mantle Apr 6, 2025, 2:42 AM

#

Their instruct model is so bad that they had to wait for behemoth to traina reasoner

#

You cant just use maverick with reasoning

#

Lile that model is so dumb

#

It doesn't understand anything

balmy mist Apr 6, 2025, 2:42 AM

#

yeah maverick is so badd

#

like it makes me mad

#

i dont really see that big a difference between scout and maverick

#

tbh

torn mantle Apr 6, 2025, 2:43 AM

#

They didn't bring anything new to the table

#

Qwen3 will be a huge blow to them

balmy mist Apr 6, 2025, 2:43 AM

#

"10 mill context"

#

if that is debunked meta might go under

torn mantle Apr 6, 2025, 2:44 AM

#

balmy mist "10 mill context"

Nobody wants a dumb model with 10m context window

balmy mist Apr 6, 2025, 2:44 AM

#

but lmarena needs to find the ranking asap

torn mantle Apr 6, 2025, 2:44 AM

#

What would i do with it exactly?

#

Write novels?