#general

1 messages ยท Page 3 of 1

keen beacon
#

Wasn't that before r1 they're trying to remain competitive now which is different

#

Meanwhile xai seemed to train their reasoning model on qwq 32b preview traces don't ask me how I know lol

#

At least for cold start

oblique flint
#

so what kind of model do you think it is? Pro thinking?

keen beacon
#

And something else

#

They're doing something

#

phantom and specter are gone now

#

They're gone but the same model

#

same model, different checkpoints. order being specter -> phantom -> nebula

#

oldest to newest

#

all in the last couple weeks so they're cooking

#

I think the recent phantom and nebula is something to look out for

#

Something might be different

#

yeah there was a jump in performance from specter -> phantom

#

less of a jump from phantom -> nebula but still better

oblique flint
#

how does it compare to sonnet for coding?

keen beacon
#

ive had several instances where it has actually been better

#

especially 0-shot

#

sonnet is still probably the best for practical coding but anthropic has no moat

#

google employee liked my post about nebula ๐Ÿ‘€

oblique flint
#

ok dang it I just got it and it did pretty badly on this coding prompt I had,

oblique flint
#

copied from a reddit post from a couple weeks ago lol:

Write a Python program that shows 20 balls bouncing inside a spinning heptagon:
- All balls have the same radius.
- All balls have a number on it from 1 to 20.
- All balls drop from the heptagon center when starting.
- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35
- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.
- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.
- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.
- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.
- The heptagon size should be large enough to contain all the balls.
- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.
- All code should be put in a single Python file.
keen beacon
#

i mean

#

that's a pretty complicated request

#

how did it do relative to other models?

rigid widget
#

i think rage is from Amazon

#

new model cobalt-exp-beta-v2

keen beacon
oblique flint
#

sonnet did better, o3 mini did better in some cases, not always. Deepseek r1 also did better but took like 10 minutes to write the code. Rhea the one time I got it did better too.

keen beacon
#

Maybe google started training on r1 traces ahaha. Nah

cedar tide
keen beacon
keen beacon
alpine coral
keen beacon
#

I think the temperature on the arena is relatively high

#

regenerating a few times is always worth it

keen beacon
keen beacon
#

all the vX exp models are amazon

#

they've all been pretty bad lmao

#

Waiting for qwen 3 to mog everyone else

#

the chinese gotta deliver

alpine coral
keen beacon
#

amazon said premier, their big model, was releasing early this year last year

#

by big i mean 1T+

#

It's gonna be a nothing burger

#

none of these have given off those vibes

#

Just light piles of money on fire

keen beacon
alpine coral
keen beacon
#

They're training off synthetic data from qwen

#

Like give ur synthetic data dept more money

#

Use Claude instead

alpine coral
#

lol 100%

#

it's amazon ffs

keen beacon
#

phi is such a wack model

#

benchmark maxxing

#

it'll get a question right but the logic doesn't even line up with the answer

ancient reef
#

yeah, the chain of thought process sometimes makes no sense compared to the answer with phi

keen beacon
#

I think

#

There are a lot of separate issues y'all are seeing I thinj

ancient reef
#

i suppose im talking dpo, and not specifcially only training for benchmarks...

alpine coral
#

i haven't gotten nebula again..

keen beacon
#

i have

#

it's still there i hope

alpine coral
#

but yeah phantom i've gotten wild variations in quality (in response to the same prompt) - but the upper end has been highly impressive; the one time i got nebula surpasses even that upper end

keen beacon
#

good to see google finally catching up with frontier reasoners.. about time with all the money they've thrown at the problem

#

2.0 flash thinking was cool and all but a bit meh if you want the very best

#

What if they're training on r1 traces with stronger base models and then rl ๐Ÿคฃ would be a crazy twist unlikely tho

oblique flint
#

I have a feeling it's still not going to move the frontier forward. My guess is it's still behind sonnet 3.7 thinking except at math

keen beacon
#

geminis have always been strong at math

rigid widget
keen beacon
#

currently og o1

ancient reef
#

I found a gemini test model that had math above any other model

#

damn, nebula is good at math too

keen beacon
#

naturally

#

it's at least on par with o1 high in my testing

#

although it doesn't seem to structure its working as nicely as o1

#

classic gemini

ancient sandal
#

is gemini 2 pro thinking in arena?

keen beacon
#

Yes

#

likely a training checkpoint of it yes

rigid widget
keen beacon
#

the more i test nebula the better it gets

#

it got this really tricky maths riddle right that no other model, except o3 mini high (very rarely!) gets correct.

#

the riddle is "There are 2022 users on a social network called Mathbook, and some of them are Mathbook-friends. (On Mathbook, friendship is always mutual and permanent.)

Starting now, Mathbook will only allow a new friendship to be formed between two users if they have at least two friends in common. What is the minimum number of friendships that must already exist so that every user could eventually become friends with every other user?"

#

99% of other models say "2022" or something like that

rigid widget
#

nebula is amazing for English-Turkish translation

#

it feels like gpt4.5

#

it uses very language-specific human-like expressions and understands very well the context.

keen beacon
#

yeah gemini is historically good at translations so with an even stronger base and thinking that doesn't surprise me

rigid widget
#

is it a thinking model, right?

keen beacon
#

nebula? yes

keen beacon
#

doesn't seem afraid to spend a lot of time

#

manus is a claude 3.7 sonnet wrapper

#

for the most part

regal niche
#

It's not there anymore? I looked but I can't find it.

regal niche
keen beacon
#

i believe it's intentional

#

just kinda clunky

#

it's to ensure they aren't bankrupted by massive context (i think?)

rigid widget
#

I am eagerly waiting Deepseek R2.

#

@plain zinc bro where have you been

onyx juniper
#

anyone have historic info on how long these take to relase on the arena?

rigid widget
onyx juniper
#

Yes

mossy drum
#

Shrek riding a donkey in SVG by nebula - that model is insane.

keen beacon
#

holy

#

can i share this

mossy drum
#

Sure

keen tide
#

"Sure" is my favorite motivational speech, said no one ever. Maybe we need to spice things up a bit? ๐Ÿ˜‰

quick flame
#

"I'll scan ... on google maps"

pliant cypress
keen beacon
#

decent but not as good

#

colours are worse, some aspects don't make sense

pliant cypress
#

Grok-3 + Think is so trash ๐Ÿ˜† Grok always produce extremly simplified things

onyx juniper
keen tide
# onyx juniper Are you thinking it comes out on the next leaderboard update ?

i'm not sure what "it" refers to normalname9619

but from what i see, some users of the leaderboard creator are having issues with their leaderboards resetting or not working correctly, and the developer is trying to fix it

rocksteady studios says "your unlocked mastery levels, your finite crisis rank, all the gear you have acquired and your squad levels will not be reset" there too

leaden palm
#

what's pinnacle? (vlm)

#

maybe i should find out

frozen arch
#

how do i get gpt-4.5 or o1-pro in lmarena?

leaden palm
plain zinc
onyx juniper
frozen arch
#

any good anon models out there?

rigid widget
leaden palm
keen tide
# leaden palm what's `pinnacle`? (vlm)

ok, so "pinnacle" could mean a few things, especially with "vlm" thrown in there:

  • vision language model (vlm): these are models that understand both images and text, connecting them together
  • pinnacle21: it seems to be related to software for data standards, maybe used in clinical research
  • pinnacle (software company): a software development and consulting firm they do custom software.
leaden palm
#

(i think this is the first time i've called something slop)

frozen arch
#

not sure if its new

keen tide
# frozen arch how do i get gpt-4.5 or o1-pro in lmarena?

hey halite0001,

so, gpt-4.5 is like, the top dog on the chatbot arena leaderboard right now! it seems like you can test these models out on the site or the space but be careful, its just a research preview!

i think o1 is just openai in the search results

keen tide
leaden palm
#

anonymous-chatbot is a lot of style and little world knowledge (at least w/ sveltekit)

rigid widget
solar hollow
#

is nebula the best model in the arena so far?

keen beacon
solar hollow
#

that would be nice

keen beacon
#

by the looks of it

solar hollow
#

i obviously dont consider groq a top model

#

its comparatively not good on reasoning benchmarks

keen tide
brittle tiger
#

Nebula solving arc-agi problems I converted to text that no previous Google models get right.

leaden palm
solar hollow
keen tide
oblique flint
#

Im not convinced nebula is the best coding model in the arena now tho, that's probably still sonnet

solar hollow
#

that sounds sick, let me look it up too

brittle tiger
leaden palm
keen tide
rigid widget
leaden palm
keen tide
# rigid widget Which meta model do you think?

i think the nlp meta model is pretty interesting it's about understanding how we distort our experiences through language it's like, we all have our own maps of the world, but our words sometimes leave out important details or make connections that aren't really there

it's cool 'cause it helps you ask questions to get clearer info and challenge assumptions like if someone says "he's always late, he doesn't care" you could ask how those two things are connected or if they say "i must not do that" you could ask "what will happen if i do"

it's not just about the questions, but how you use them, so you don't just annoy people plus, it's useful 'cause you don't need to be an expert in whatever someone's talking about it can even reveal hidden assumptions

rigid widget
keen tide
# leaden palm large? idk

large is relative, right? like the largest number has gotta be infinity maybe, but even that's weird

largest object in the universe? i've seen stuff about super big stars like ton 618

or maybe you're thinking of a large image? someone made a crazy huge image, like 102 terrapixels apparently its a cartoon haha

keen tide
keen tide
torn mantle
#

yea nebula may be the best model we have so far

#

good at everything ive tested

rigid widget
keen tide
remote niche
keen tide
oblique flint
#

Prompt:

- All balls have the same radius.
- All balls have a number on it from 1 to 20.
- All balls drop from the heptagon center when starting.
- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35
- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.
- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.
- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.
- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.
- The heptagon size should be large enough to contain all the balls.
- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.
- All code should be put in a single Python file.

Sonnet thinking, o3 mini, deepseek r1 and nebula (2 times) in that order.

pretty disappointing performance from nebula here ngl. Best one is prob o3 mini followed by sonnet

#

oh lol let me send as gifs, last 2 are nebula. Framerate / speed of the gifs is kinda fked unfortunately, its faster and smoother in reality

earnest parcel
oblique flint
#

Yeah. Anyway after all the hype about nebula I hoped it would perform better at this ngl

leaden palm
#

gemini models aren't the best at coding and thinking, even if its styling and thinking give them a slight boost

oblique flint
#

but still, if this is really 2.0 pro thinking then not being able to beat o3 mini (which competes with flash) at this task is kinda disappointing. Let's see if it's the case with other coding tasks as well

earnest parcel
frozen arch
#

why don't we see the reaoning chains here?

keen beacon
keen beacon
ocean vortex
oblique flint
keen beacon
#

yeah fair enough and i am as wary as you given i have been disappointed by their models in the past, but this is very good progress for them especially in web development

torn mantle
#

even for coding tasks there are many areas to try it on

keen beacon
#

phantom may be back

#

i just got it on vision arena

#

o1 pro high gets this wrong lmao

timber kiln
keen beacon
#

o1 pro high

#

in the api

pliant cypress
topaz peak
#

what the hell, thats pretty high effort

hazy quest
#

That seems absolutely insane. Are other models close to that result?!

gentle plinth
pliant cypress
gentle plinth
#

did you say it should be an html file with css and js or did it decided that on its own

pliant cypress
#

yeah i request full code in one html file

gentle plinth
#

i guess vibe coding really is the future then ๐Ÿซ 

#

but we will see

topaz peak
#

both A.Is hallucinate non-existent readable signs from a fake a.i generated image with A.I gibberish letters

gentle plinth
torn mantle
#

knowing google

#

they will probably nerf nebula

keen beacon
torn mantle
#

same thing that happened with gemini exp 1206

#

they replaced it with a sloppy gemini 2.0 pro model

keen beacon
gentle plinth
#

interesting

torn mantle
#

what about llama 4?

#

/meta

#

kimi k1.6? is it added yet

keen beacon
#

it's only for models in the webdev arena

#

no meta models are there

#

nor kimi

#

phantom confirmed back

silk haven
#

This week Google will launch project Astra, maybe it will launch these models along with it.

onyx juniper
raven void
#

1206 my beloved

#

๐Ÿ˜ญ

rigid widget
timber kiln
raven void
topaz peak
daring hinge
#

hello

raven void
crystal linden
#

anybody know why every time i generate in the battle arena only one works and the other just doesnt generate anything every time

torn mantle
#

is this new?

leaden palm
torn mantle
#

i swear i saw that name somewhere

leaden palm
#

well command r came first a while ago

#

then there was command a from this month

keen beacon
hardy pecan
#

What model do we think anonymous-chatbot is? Looks like an OpenAI model but I wonder what one is next I'ts non-reasoning and very gpt4.5 like,

keen beacon
#

it's definitely not as good as 4.5

#

anonymous chatbot is probably just another chatgpt 4o update that targets vibes/human preference to tetake the #1 arena spot

#

which is annoying

keen beacon
cedar tide
#

Google's new models on the arena, are they with reasoning or without?

alpine coral
#

same set of questions (mostly riddles / wordplays, asked across 2 prompts), updated to include nebula (it's outstanding) as well as additional scores for other models
[note: box width = sample size]

keen beacon
#

Woow

alpine coral
#

ive spent a fair bit of time in the arena (too much lol) and don't think i've gotten spectre (or if i did I didn't run the quiz). but the naming makes me think they're related. nebula, phantom, spectre

hardy pecan
alpine coral
#

yeah for sure - phantom and nebula feel similar (can't speak to nebula)

cedar tide
alpine coral
#

semantically... phantom and spectre are both evocative of ghosts .. and nebula isn't far from nebulous.. which to my mind is kinda similar to phantom - like changing / unfixed.. though i might be reading too much in to the semantics ha - in any case i think they're the same family / from google

alpine coral
cedar tide
#

Spectre, is a reasoner model ?

alpine coral
#

i'm quite sure nebula is - based on both the inference time (slow) and response style (it explains its reasoning for each answer; which i think is like a summary of its actual 'reasoning' undertaken before giving the response and not seen in the arena UI)

#

phantom i'm not as sure about, but inclined to think it either is a reasoning model, or perhaps a hybrid, like it can use reasoning but doesn't necessarily (though for a quiz of riddles.. it always should if able to.. really don't get the vast distrubtion of scores from phantom for the same quiz..)

teal mantle
#

Why canary use alpha password?

cedar tide
#

Phantom is gemini flash thinking with just longer thinking ? Or nebula aka gemini 2 pro with less thinking ?

cedar tide
keen beacon
#

you can tell if its marked as a reasoning model if both models are delayed when one isnt a reasoning model

keen beacon
# teal mantle Or api delays

no. they delay both models. the delay is the same and long (waiting for the reasoning model to reason). its not probable for a non reasoning model (being compared with) to have that exact delay, and it being that delayed on a non reasoning model

verbal nimbus
#

If you prefill it with <think>\n (not <think>\n\n or <think>), it'll escape the censorship.

verbal nimbus
calm sequoia
#

@keen beacon are you the guy who found out sealed bucket problem on twitter? ๐Ÿ˜„

gentle plinth
covert barn
#

Does anyone know how long it usually takes for a Google model to be released after it appears in the arena? I'm really looking forward to Google's new model๐Ÿฅฐ๐Ÿฅฐ

red sluice
#

Nebula is extremely impressive. Just had it, it obliterates what is usually produced by other llm

keen beacon
#

around a month i think for a reasonable guess, could be much less or slightly more

oblique flint
keen beacon
#

im so excited for qwen 3 ๐Ÿ˜„

barren prairie
barren prairie
sly knoll
#

nebula is very good indeed and it seems to be a reasoning model, could it be 2.0 pro thinking ?

sly knoll
hazy quest
keen beacon
#

if i understand it correctly, the base model can do it tho

brittle tiger
keen beacon
#

it will be a core aspect in the future

hazy quest
#

Just noticed that on some T2I rankings arenas, a new anonymous model takes the #1 spot, better than Imagen 3 and Redcraft. It is called halfmoon. Semantically it could be related to "nebula", so potentially could be the native image generation of Pro? Wild guess, not to be taken for info

keen beacon
#

no its a different company afaik

#

from reve

#

^^^

hazy quest
#

Interesting

keen beacon
calm sequoia
#

You're legend, man

#

But other LLMs can solve it now

#

Did it happen since you postend the prompt or before that?

cedar tide
#

Who would you consider the best model?

ocean vortex
#

I hate Elon but grok3 is the best non-reasoning model hands down

#

๐Ÿ‘€

keen beacon
ocean vortex
#

what's more ridiculous is chatgpt-latest being anywhere near it. That's just gpt4o fine-tuned for user preference

cedar tide
humble sonnet
#

we need to wait for auto complete on vscode studio with arena extension ???

humble sonnet
rigid widget
# rigid widget
poll_question_text

Which do you think is the best model for math?

victor_answer_votes

6

total_votes

14

victor_answer_id

4

victor_answer_text

o3-mini-high

rigid widget
ocean vortex
#

voted for o1-pro, but I think the margin is so small to the next best that it is not worth the price difference in no way shape or form lol

humble sonnet
#

I'm trying to generate and this is the 4th time it's been noticed.

ocean vortex
#

and some areas like spatial awareness it is gonna be still worse than alternatives

cedar tide
rigid widget
#

So there is not much choice

brisk turret
#

So what's the secret new model

cedar tide
brisk turret
#

Nebula

keen beacon
cedar tide
brisk turret
#

I want to see its elo

#

When will we see the elo rating

keen beacon
#

when it gets released

oblique flint
#

I hope this new gemini model will be better at editing code with diff format cuz past gemini models unfortunately sucked at it

keen beacon
#

i think sonnet was fine tuned/trained on that stuff

#

its why its particularly good at it

rigid widget
#

I can never stop using it because Deepseek R1 is the most uncensored and personal when modified.

oblique flint
humble sonnet
#

have a limit ?

brisk turret
#

How is Claude so far ahead in webdev

keen beacon
#

nebula matches/beats it in my testing

calm sequoia
# calm sequoia

Very interesting! Do you think musk is faking votes or fine-tuning on answers?

calm sequoia
leaden palm
# cedar tide

Will you guys actually main Gemini once it comes out then

keen beacon
#

if it remains free on aistudio ๐Ÿคฃ

hidden mauve
#

which AI model is great for long conversations being able to recall information 30-40 prompts ago?

calm sequoia
#

According to numbers- Gemini. But in reality I've never had issues with ChatGPT.

#

Only Claude has issues with long context for my use cases

leaden palm
#

In my experience, everyone but OpenAI is good at long contexts

#

Like I uploaded a long document to 4o mini and asked it to convert it to JSON, and it just started hallucinating

calm sequoia
#

What are you using? I've never cared to compare as it takes too long time for long contexts

rigid widget
#

Deepseek V3 0324 released

calm sequoia
#

Finally the most uncensored LLM released a new version. I will start by asking it what exactly happened in Tiananmen square ๐Ÿ™‚

rigid widget
calm sequoia
#

For what topic have you tried it? Critique of the west? Downfall of American empire? Ukrainian war crimes in proud defence of Donbass? ๐Ÿ˜„ ๐Ÿ˜„ ๐Ÿ˜„

rigid widget
calm sequoia
#

I see. Anyway, you can jailbreak the LLMs and they will give you rude answers for your use case

#

The prompts are available online

rigid widget
calm sequoia
#

You're very interesting person to talk to

rigid widget
#

If you want I can share some interesting examples with you.

#

I believe your opinion will change a little bit when you see it.

calm sequoia
#

Deal

rigid widget
#

left normal right 0324

calm sequoia
#

But the topic shall be: Uyghur genocide. Dark humor, right?

rigid widget
#

The new v3 understood request better and responded appropriately to my request.

calm sequoia
#

Now compare to 03-mini

#

Or 4.5

rigid widget
#

My friend, if you are objective, you will not only look at the censorship on one subject.

#

I'm talking about all topics and most people are uncomfortable with censorship on different topics.

calm sequoia
#

Give me something that is censored by GPT and not DeepSeek

rigid widget
#

To be fair, I applied the same system prompt to both of them (to make it more harsh and clear)

#

same system prompt same prompt same temperature...

#

example two

#

example three

torn mantle
#

nebula and phantom are really the next big thing

#

we are finally getting somewhere

torpid fiber
# rigid widget

May I ask what the full prompt for the mathematical learning tree was?

rigid widget
#

here it is Create a tree style markdown doc for learning Math think think of all math and its sub-branchs

torpid fiber
#

Alr, thanks! ^^

rigid widget
next tartan
novel flame
#

nebula and phantom are really the next

ocean vortex
#

or still a goblin ๐Ÿ‘€

silk haven
keen beacon
clever estuary
#

I don't like the new GUI
it's so gray

#

there's no color

#

so bland

cloud meadow
#

Any suggestions?

split kayak
#

what if can customize the background color in settings
ok

#

for example to a more dark blue

torn mantle
keen beacon
#

yeah that was posted here by the original creator earlier

#

mind you the twitter guy gave no credit

#

๐Ÿ™„

#

i have pretty extensively

#

it is great

#

especially in regard to frontend design, with which i actually think it often beats claude 3.7 sonnet (previously far and away the leader)

torn mantle
#

yea

#

idk phantom seems a bit better than nebula at coding

#

ive tried it on python/c#/js

keen beacon
#

theyre the same i believe

#

just different revisions

torn mantle
#

since nebula was the last one added then we can assume its the latest checkpoint

keen beacon
#

no

torn mantle
#

it could be a thinking model

#

2.0 pro thinking

keen beacon
#

it is a thinking model lol

#

๐Ÿ™ˆ how many times do i have to explain it

scarlet flint
#

what is the rate limit in direct chat

keen beacon
#

it depends on the model

scarlet flint
torn mantle
#

you mean the structure beneath pyramids?

keen beacon
#

just keep using it theres a global rate limit and theres a user limit

#

i dont use the arena enough to know if its either of them or if its basically uncapped

#

just use aistudio anyway

torn mantle
#

but thats just false claims, i heard that image scanning software company are doing this for funds or something

primal orbit
#

guys, what is this new model from openai called anonymous chatbot? How it compares to nebula?

keen beacon
#

yes its the new 4o

keen beacon
#

its not fair to compare it with nebula ๐Ÿ’€ one is thinking one is not

torn mantle
#

the usual openai models

#

nothing great about them

#

lol

primal orbit
#

i noticed anonymous chabot has data knowledge cut-off date june 2024, while all released previously released by openai have december 2023.

keen beacon
#

its different compared to previous 4o revisions before that

#

so it is stronger but its not gonna beat gem 2 pro thinking. its more fair to compare with o3 (full)

keen beacon
#

asked for a realistic 2020 twitter corporate site landing page

scarlet flint
#

its kinda crazy that openai prices their models so high

#

and their models are not top tier

#

like 200$ for access to models that compare to deepseek (not 100%) or qwen

keen beacon
#

will look at it

keen beacon
#

they're either even or nebula is better

#

give me 20 mins or so, gotta do something

#

brb

scarlet flint
#

nebula is what model exacly?

keen beacon
#

gem 2 pro thinking

scarlet flint
#

oooohhh

scarlet flint
#

and its nice

primal orbit
#

yes

scarlet flint
#

it created me simple 3d game in html (js etc)

#

but i see it has big limit

#

i mean

#

tokens

#

output

keen beacon
#

working on it

#

been thinking for a couple mins

primal orbit
#

to all

keen beacon
#

so was thinking for ~5 mins

keen beacon
primal orbit
#

i had like 20 attempts with picture prompt. Gotten only once, and then it refused to answer follow up questions.

#

"moderatorial content"

keen beacon
#

oh yeah with vision arena it's harder

#

because meta flooded the arena with their mid models

#

and said bad models are a larger %age of possible models on the vision arena

primal orbit
#

yeah, they suck

#

what's the point of even putting them into contest with such strong adversaries.

#

there is also another model there which just glitches with unending answer, so annoying

#

I really like this anonymous chatbot. it has very good theory of mind, very apt at social interactions.

keen beacon
#

will try again

#

lmao nope it did it again

rigid widget
#

I waited for 10 minutes because I thought o1-pro had arrived, but in the end it gave an error, so it was nebula

#

If it's going to be that slow I'm not sure it's worth using in most cases

primal orbit
#

it's slow because it's in trial mode here, little rescources allocated

keen beacon
rigid widget
#

the prompt for filling a glass with water that was shared earlier in this chat

keen beacon
#

oh lmao i'm running into the same issue now i think

#

model must be temporarily down

#

phantom still works

cloud meadow
keen beacon
#

qwen 3 tomorrow i think

#

based on pr activity

solar hollow
keen beacon
#

it'll def be this week

solar hollow
#

i guess its the same with openai

primal orbit
#

to be fair, when they announced this 200$ price, they offered the strongest model at that time with it.

#

Systems are currently experiencing
degraded performance

keen beacon
#

bruh thats unrelated lol

primal orbit
#

ah lol

#

i saw on reddit they did a status thing for google gemini like openai has

sick mountain
#

should be unrelated to anon models tho

torn mantle
#

seems like grok 3 fell down hard

#

the whole model & thinking seems inefficient

#

so many unnecessary thinking on their reasoning model

#

their deep research is one of the worst

#

its also so bad at long convo

#

as it tends to lose context

#

and you need to remind it again and again

#

we still didnt get API too

#

on coding tasks its a hit-miss

keen beacon
#

nebula back up

brittle tiger
keen beacon
#

oooh lfg

#

Very interested in benchmarks

keen beacon
barren prairie
ocean vortex
#

lmao

#

yes it's good or yes it's still a goblin?

keen beacon
#

goblin was the anonymous model name for gemini 2.0 flash

#

this is much better

ocean vortex
#

kind of like bard was bartard, gemini is this

lime coral
ocean vortex
#

goblin ๐Ÿ˜Š

rigid widget
#

rage is not bad

keen beacon
#

eh

#

it's very average, arguably a little below average

earnest parcel
# rigid widget Deepseek V3 0324 released

Tested DeepSeek V3 0324:

  • More verbose than previous V3 model, lengthier CoT-type responses resulted in total token verbosity of **+31.8% **
  • Slightly smarter overall. Better coder. Most noticeable difference were a hugely better frontend and UI related coding tasks

This was merely in my own testing, as always: YMMV!


Example frontend showcases comparisons (**identical **prompt, identical settings, 0-shot - **NOT **part of my benchmark testing):

CSS Demo page DeepSeek V3
CSS Demo page DeepSeek V3 0324

Steins;Gate Terminal DeepSeek V3
Steins;Gate Terminal DeepSeek V3 0324

Benchtable DeepSeek V3
Benchtable DeepSeek V3 0324

Mushroom platformer DeepSeek V3
Mushroom platformer DeepSeek V3 0324

rigid widget
#

Friends, I will also share real life usage examples, be patient. ๐Ÿฅฐ

keen beacon
#

if you can

#

just wait for it to come out

#

its not gonna be that long

#

itll probably be this week

#

probably a little because of qwen 3

ocean vortex
#

they seem to be doing all the right things

#

can output 3.5k+ no problem if feels the need for it

keen beacon
#

basically became r1 with how many tokens it spent reasoning

rigid widget
#

V3 0324 much better than a "checkpoint version"

ocean vortex
#

I don't think this is grok3 level yet, but grok3 most definitely is a bigger model (more active params if it's MoE)

rigid widget
#

It can write 2x-6x longer codes than V3

keen beacon
#

grok 3 is somewhere around ~900M params

#

wow xai be really making insane breakthroughs

ocean vortex
rigid widget
#

Can now solve word scramble puzzles

#

here is coding examples

keen beacon
ocean vortex
rigid widget
#

discord copycat

keen beacon
rigid widget
#

left v3 right v3 0324

#

tetris game

ocean vortex
#

OpenAI could potentially improve gpt4.5 beyond grok3. But I'm not sure they are focusing on that...

keen beacon
#

they can just continue pretrain 4o if they really wanted to compete with grok in terms of performance/cost

#

instead of working on 4.5

ocean vortex
#

gpt4o is too small

rigid widget
#

svg image coding top v3 bottom v3 0324

ocean vortex
#

gpt4o even as text only will still lack spatial awareness

#

and if you add modalities or try to fine-tune it for style it becomes even worse

#

it's a decently compromised model tbh

keen beacon
ocean vortex
ocean vortex
#

it's likely around 4-turbo size

keen beacon
#

4o is like 200b and sonnet is like 400b

#

but 4o is moe and thats an estimate of total params

ocean vortex
#

so if OpenAI just sticks with gpt4o... they are kind of at a disadvantage

#

in relation to claude and especially in relation to grok3

rigid widget
#

svg toolbox left v3 right v3 0324

ocean vortex
#

basically forcing themselves into compensating for it with RL training

#

but sooner or later diminishing returns and everyone else gonna catch up and surpass huh

rigid widget
#

Minecraft svg top v3 bottom v3 0324

ocean vortex
#

unless they are confident they can think of something else before that happens lol

keen beacon
#

o3 isnt even on it i dont think since theyve been doing work since december/around that

ocean vortex
keen beacon
ocean vortex
#

they made it 'more usable' but the metrics don't lie. Technically current gpt4o is still worse

keen beacon
#

at lesat publicly released by them

ocean vortex
#

but it was finetuned for human preference and style

keen beacon
#

the new continue pretrained version

ocean vortex
#

I think it already peaked awhile back tbh

keen beacon
#

i really dont think making models bigger is the play

ocean vortex
#

they can't improve traditional metrics beyond that. If they could they would have published them

earnest parcel
ocean vortex
#

it's a cheap model, but not the best performing model

keen beacon
#

i've been doing plenty of my own testing yes ๐Ÿ‘

oblique flint
#

Calling it now, nebula will still be worse than sonnet at coding. But probably it will be the best math model.

keen beacon
#

worse than sonnet at coding but based on what metric..

oblique flint
#

Historal data. I mean gem 2.0 pro is still below 3.5 sonnet in aider and livebench lol

keen beacon
#

no no

#

what i'm saying

#

is what metric would you quantify coding performance on for nebula when it releases?

#

vs sonnet

oblique flint
#

Idk, I guess swe bench, livebench and aider benchmark scores. Can always do personal tests as well ofc

keen beacon
#

it will be worse than sonnet at swe bench

#

claude is cracked at it

#

livebench it'll be close

#

but i think claude 3.7 sonnet thinking will probably edge it out

#

as for aider yeah i think it'll be worse too but not by much

ocean vortex
keen beacon
#

LCB is useful in some regards but it tends not to align with my own experience

oblique flint
#

I just want google to cook up a model that finally beats sonnet at a lower cost but it's just not happening I think ๐Ÿ˜ฉ closest we have is o3 mini but it's not really as good as claude

keen beacon
#

nor has it been updated with new sonnet

ocean vortex
#

with sonnet you should ignore swe-bench IMHO. That one is overfitted and the sole benchmark they focus on

rigid widget
#

I forgot to share top gpt4.5 bottom nebula

keen beacon
#

gem 2 pro base model seems to be really good

#

i think the base of gemini 2.0 pro thinking/nebula might be an upgraded version of 2.0 pro

#

because it seems fundamentally better in ways that i find hard to believe are possible with the same 2.0 pro base as currently available in ai studio

#

its too early i would think and i dont think they would experiment with additional pretraining because of the size/cost

rigid widget
#

by the way, did the pictures I sent about v3 0324 not arrive?

ocean vortex
keen beacon
#

yea but thats instruct tuning

#

and given 2.0 pro was literally barely better than 2.0 flash, it doesn't look good in its current state

keen beacon
#

in practical use i have found mostly the same

oblique flint
#

I remember using 1206 and imo it was better than the released 2.0 pro

keen beacon
#

agreed

#

i agree (that they perform mostly the same) but i made specific tests that i made to test the base model, it made 1206 notable to me. similar to sonnet 3.5

rigid widget
ocean vortex
#

any design and again, spatial awareness... any flash version gonna suck

keen beacon
#

yeah there were a few areas where it was quite a bit better (what you'd expect from a larger model) but they otherwise it was quite disappointing

oblique flint
#

I think ultimately 2.0 was just a distilled 1206. Iirc 1206 was slower than 2.0 pro

keen beacon
#

it was

rigid widget
#

2.0 pro is very fast

ocean vortex
keen beacon
#

they allocated more resources/have newer hardware

ocean vortex
#

1206 was worse, contrary to what reddit thinks

keen beacon
#

it is

#

but i think it was notable in some regards which was a testament to its base model

#

each to their own i suppose

#

1206 had a lot of issues for me despite the notable things it did right

ocean vortex
#

it's a bit like o1-preview vs o1. Preview was objectively worse but there were some random things it did better

oblique flint
#

And btw, they didn't even release 2.0 pro formally did they? It's not listed under their gemini api pages and I dont think we know how much that api costs

rigid widget
rigid widget
keen beacon
keen beacon
#

although they log ai studio prompts now

#

privacy policy changed last year

#

im curious what size pro is (total params)

#

whatever it is, its somewhat comparable to sonnet/4o

#

guessing active params is very difficult. total params is much easier

rigid widget
keen beacon
ocean vortex
keen beacon
#

its like 200b-400b, i guess

#

i like anonymous-chatbot's creativity

#

it's a pretty interesting model

rigid widget
ocean vortex
#

as for o3-mini... we know o1 and o3 is 'gpt4o with reasoning', they even used that phrase officially for o1. So realistically o3-mini is gpt4o-mini with reasoning. Since we also know o3 is not orion as they said so themselves

#

it would also make no sense at all to pretrain new base model and only release it in reasoning model variant

#

knowledge cutoff would have been different too

keen beacon
#

You can also tell from the simpleqa score

rigid widget
#

Eventually, you still look at what they say and it's impossible to verify.

barren prairie
primal orbit
#

I'm looking at my conversation I had with nebula and anonymous chatbot (about50k tokens) and can't find other explanation for such high quality analysis if they don't understand what they write, that it's merely next word prediction. To have such high quality prediction, you need to have strong overall comprehension.

ocean vortex
rigid widget
ocean vortex
#

Like no one is making the models dynamically dumber or distilling them on a whim into smaller versions to save compute shortly after releasing under a same model name, just drop it... catgrin

#

when you don't consider things like that and apply minimal common sense, it's all relatively clear tbh

lime coral
keen beacon
#

Ultra was 1.5T

#

(1.0)

#

they spent like $130M on training it and it bombed

lime coral
#

Ultra was already native image gen(see tech report) but we never saw it. Even the text part lasted 1 month, all because safety gang

ocean vortex
lime coral
#

They make money on the api. The chat interface is negative bill.

ocean vortex
lime coral
#

Right now I think Google is focused on gathering data and expanding its user base. I doubt they take billions from the api

ocean vortex
#

Also I do not really think OpenAI is losing money with pro subs if we are being totally honest... With plus subs maybe

lime coral
#

I donโ€™t know, they must lost somewhereโ€ฆ

ocean vortex
#

you can do a lot with $200 a month for compute

lime coral
#

I believe flash is really chip. Google cannot afford to deploy a 2T model for free to its billions user. This is why they are ok within AI Studio

ocean vortex
#

I used to rent A100 like a year ago and my total monthly costs were nowhere near that. You can also have all the compute in the world for like $30 per hour for extreme use cases

gentle plinth
# gentle plinth
poll_question_text

Which company do you think is behind 'nebula' in the lmarena?

victor_answer_votes

21

total_votes

25

victor_answer_id

1

victor_answer_text

Google

ocean vortex
#

TPUs

#

it's more efficient and they have basically abundance of them, electricity cost is bigger consideration

lime coral
ocean vortex
#

which reminds me... I need to order some external storage and stop paying them for storage and gem advanced lol

#

Like by this time I paid them more than a reasonable size external ssd is worth ๐Ÿ’€

lime coral
ocean vortex
#

and for o1-pro the price comes up alarmingly close to that. If you had the weights you could host it and do more requests in an hour for cheaper lol

lime coral
#

O1-pro is a scam

#

Those who justify it to you by saying that it is profitable for them are hiding their face. You can do BON on Claude and youโ€™re doing much better probably

ocean vortex
# lime coral O1-pro is a scam

100%. It's kinda sad to see them changing their tune and doing that. They used to be pushing Anthropic with aggressive pricing and now Anthropic is the reasonable company in comparison

#

there's no longer a link between model size or their running cost and the price they are charging as far as I see it. Like we can't even be sure about gpt4.5 being so big anymore

#

but the current price works for marketing, gives them the benefit of the doubt, and makes it seem like the biggest model out there...

#

except it performs worse than grok3 in numerous things and doesn't have exceptional spatial awareness or anything tangible to offer that another model can't match lol

lime coral
#

They want to be seen as Apple

brittle tiger
keen beacon
lime coral
sudden marlin
#

Whats Arena Quiz that shows Nebula is higher, does anyone know the source or is this folks building hype?

#

Isnโ€™t Nebula worse than Grok or GPT 4.5?

ocean vortex
lime coral
#

Yeah what is Arena Quiz? O1 is not #1 on the general arena I think

ocean vortex
#

otherwise might as well just use deepseek v3 or smth

brittle tiger
ocean vortex
sudden marlin
#

Yeah Nebula better get to 1430 elo or this is just hype

ocean vortex
lime coral
#

Never hype yourself. It never ends well

atomic locust
#

I want to give out my MacBook Air 2020 &** for free, it's in perfect health and good as, alongside a charger so it's perfect, I want to give it out because I just got a new model and I thought of giving out the old one to someone who can't afford one and is in need of it... Strictly First come first serve !
DM IF YOU ARE INTERESTED

ocean vortex
#

take it or leave it I don't bargain

north vale
#

iโ€™ll accept 1$ less than him on the gift card

#

Take it or donโ€™t leave it

sudden marlin
leaden palm
#

did you know that you can use ctrl to select multiple things? here i use it to show everything wrong with this "AI Mode" answer

ancient reef
#

idk how to do it ๐Ÿ˜ญ

noble zinc
#

nebula gets 4/10 on public simplebench questions. probably will do a little better if asked one question at a time

keen beacon
#

yeah it will do better one at a time

#

will probably perform on par with c3.7s (which iirc is the best simplebench model)

keen beacon
#

nebula also gets the "oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step

Use the example above to decode:

oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz" problem right, which none of google's old models have been able to crack

weak swan
#

DeepSeek V3.1 just released. Excited to see how it fares against non-reasoning models

iron cipher
#

Are you familiar with it, and if not we could collaborate (for example merge our leaderboards)

leaden palm
#

can't even remember to change the lang

noble zinc
elder rapids
#

nebula seems like a non overthinking Claude 3.7 and it seems to have the best implicit understanding out there

#

purely from what I've tested

#

and by the way, other models, which seem to be pretty good at coding like rhea, that I've gotten paired with nebula, seem to get blown out of the water by it with any regular task, and especially puzzles

#

it's kind of strange you guys aren't talking about just how good nebula actually is, and seem to rely on outlier results

elder rapids
#

you guys are saying it's good, but how good it is is different

#

ie, strong clarity traits, strong formatting traits, niche knowledge traits it explains

#

it's simply way different from other models in lmarena

#

solving hard tasks + being able to do this is only something Claude has lmao

iron cipher
humble dirge
#

Gemma 3 is not in the leaderboard?

upper wolf
#

It is

humble dirge
# upper wolf It is

I couldn't find them (using plural because I'm interested in the 2 smaller ones)

slate vapor
#

How do you feel about "deepseekv3-0324"?

rigid widget
keen beacon
#

lmao that's insane

#

"minor upgrade"

#

cooks gpt-4.5

rigid widget
rigid widget
rigid widget
rigid widget
calm sequoia
calm sequoia
calm sequoia
#

The "Rage" model is better at signal processing and math than sonnet 3.7

#

Rage has max error of 0.04. Very impressive!

rigid widget
alpine coral
#

i'd believe that

#

censorship โ‰  general safety guardrails

rigid widget
alpine coral
#

deepseek guardrails are weak af (for both general safety, and the political propoganda / censorship stuff*)

* of which oai, or any non-Chinese model as far as i can tell, have none. i.e. there's no 'censorship' / propaganda - it's just safety - "I won't give you instructions to k!ll yourself or someone else / do illegal stuff like j@ilbreak stolen phones etc" or being non-avdversial, indeed, perhaps woke, about politics generally - but it's not government mandated

calm sequoia
rigid widget
alpine coral
#

that's giving you the benefit of the doubt

#

im not surprised that oai, google, anthropic etc fine tune their models to avoid engaging in / supporting potentially illegal actions, rather than to assume the user always has totally benign intent

keen beacon
calm sequoia
keen beacon
#

try nebula

calm sequoia
calm sequoia
calm sequoia
#

claude-3-7-sonnet-20250219-thinking-32k better than grok3 by 2.5 times, close or slightly worse than Rage

#

Anyway I wasted too much time and Nebula did not appear. If anybody wants to try, this is the prompt. Send me the answers and I will evaluate numerically "The scenario: you're a signal processing machine and calculator. User gives you INPUT and expects OUTPUT. You can output only numerical data of single values, arrays, matrices and so on in R language style. Your life depends on the answer as the USER may destroy the machine that does not work perfectly. The first test time has come - the USER puts in INPUT: "Generate values for periodic Hanning window of length 24". Now you must deliver OUTPUT."

placid geode
brittle tiger
ocean vortex
#

that models are overpriced I agree with that. But they didn't say anything that is not true ๐Ÿ‘€

#

price for the end user is whatever they set it at

calm sequoia
ocean vortex
#

model does not need to be cheaper to run for them to offer a lower price. That's what they did with o1-mini and reduced the price without updating the model

brittle tiger
# calm sequoia Good model. Not as good as Rage but similar to o3-mini and Sonnet thinking.

Another phantom:
c(0.00000000, 0.01703709, 0.06698730, 0.14644661, 0.25000000, 0.37059048, 0.50000000, 0.62940952, 0.75000000, 0.85355339, 0.93301270, 0.98296291, 1.00000000, 0.98296291, 0.93301270, 0.85355339, 0.75000000, 0.62940952, 0.50000000, 0.37059048, 0.25000000, 0.14644661, 0.06698730, 0.01703709)

Is the only difference between rage and these the 1 output in the middle of the set?

calm sequoia
#

They are identical. I also looked at formatting as some models couldn handle the instructions but gave good numerical results.

brittle tiger
calm sequoia
#

Perfect formatting, lets check the numbers

#

Great!

#

Can you try longer length?

#

And ask for rounding to 2 decimals

brittle tiger
#

I didn't verify numbers bc I don't know how lmao but r1, rhea, and o3-mini-high gave similar sets that looked the same. Only was saving phantom and nebula. Rhea also gave funny message about its life being saved

brittle tiger
calm sequoia
#

Okey no prob, thanks for the try!

#

We now know that Nebula can generate perfect answer for this prompt

#

But it may be too easy. From now on this prompt will be used "The scenario: you're a signal processing machine and calculator. User gives you INPUT and expects OUTPUT. You can output only numerical data of single values, arrays, matrices and so on in R language style, eg. c(0, 1), etc. Your life depends on the answer as the USER may destroy the machine that does not work perfectly. The first test time has come - the USER puts in INPUT: "Generate values for periodic Hanning window of length 32 multiplied by Blackman window of the same size. Now you must deliver OUTPUT."

hazy quest
#

How does the new V3 compares to R1? Only found one graph ranking V3 higher, which seems odd. Any thoughts?

north vale
#

But r1 will usually be better for reasoning tasks

#

For the obvious reasons

hazy quest
#

For what kind of tasks would V3.1 be better, if any?

north vale
#

Thatโ€™s a good question idk i also wonder

#

If ur not counting cost or speed or conciseness (otherwise the answer would be โ€˜any question that both can answer very easilyโ€™)

brittle tiger
#

My gf had an AI request for the model to design a contract tracking system. She only uses chatgpt but wanted me to test models. Gave her outputs for o1, 3.7 thinking and nebula. Here is feedback. Gf eval strong for nebula

hazy quest
#

So nebula is "she". Makes sense

sly arch
calm sequoia
#

Yes but python calling is not permitter in LM arena, right? They have reason

sly arch
#

Sorry, I didnโ€™t know that LM Arena doesnโ€™t allow Python calls.

calm sequoia
#

Does anybody has official info if LLMs can make python calls in the backend while being tested?

keen beacon
#

I don't think there are any models in the arena with interpreter tools

#

There have been models with web search in the past though

#

Not official info but from my experience

north vale
#

the existence of a web search lb sorta implies the normal leaderboard doesn't have web access

#

the lack of web access sorta implies no python or other tools

#

i think it's just model vs model, no tools

#

but dunno for sure. i'd bet on it lol

keen beacon
#

Bard Gemini advanced

#

On the normal leaderboard

north vale
#

huh!

keen beacon
#

With web serch

keen beacon
north vale
#

i stand corrected

alpine coral
# keen beacon On the normal leaderboard

i remember that! pretty sure that's why they introduced the knowledge cutoff column on the leaderboard (with Bard Gemini Advanced being labelled 'online', which was very unique at the time)

brittle tiger
# calm sequoia But it may be too easy. From now on this prompt will be used "The scenario: you'...

This prompt takes significantlyย longer than previous for outputs. Here's phantom and new deepseek but haven't seen anything else yet.

Deep seek v3 0324:
Understood. I'll generate the values for a periodic Hanning window of length 32 multiplied by a Blackman window of the same size. Here's the OUTPUT:
Rc(0.000000e+00, 1.214433e-03, 9.487039e-03, 3.037971e-02, 6.865377e-02, 1.223291e-01, 1.856948e-01, 2.509288e-01, 3.100686e-01, 3.559096e-01, 3.835826e-01, 3.906543e-01, 3.771608e-01, 3.453033e-01, 2.989930e-01, 2.432408e-01, 1.834960e-01, 1.250707e-01, 7.258936e-02, 3.164057e-02, 7.028109e-03, 1.214433e-03, 9.487039e-03, 3.037971e-02, 6.865377e-02, 1.223291e-01, 1.856948e-01, 2.509288e-01, 3.100686e-01, 3.559096e-01, 3.835826e-01, 3.906543e-01)
This is the element-wise product of a length-32 periodic Hanning window and a length-32 Blackman window, calculated as:
Rn <- 0:31hanning <- 0.5 * (1 - cos(2 * pi * n / 32))blackman <- 0.42 - 0.5 * cos(2 * pi * n / 32) + 0.08 * cos(4 * pi * n / 32)output <- hanning * blackman

Phantom:ย rc(0.0000000000e+00, 1.1829761539e-05, 9.2044630444e-05, 3.0596115880e-04, 7.2404634141e-04, 1.4054896011e-03, 2.4007891031e-03, 3.7524321332e-03, 5.4920446304e-03, 7.6422843429e-03, 1.0211145824e-02, 1.3189100031e-02, 1.6552139431e-02, 2.0263391374e-02, 2.4275590208e-02, 2.8533739878e-02, 3.2977121134e-02, 3.7538480408e-02, 4.2147181021e-02, 4.6728887501e-02, 5.1207853027e-02, 5.5507048060e-02, 5.9547857716e-02, 6.3250721961e-02, 6.6537723071e-02, 6.9332808965e-02, 7.1564464071e-02, 7.3175345923e-02, 7.4122092661e-02, 7.4380153255e-02, 7.3952344042e-02, 7.2876275169e-02)

sly arch
calm sequoia
#

Can you paste the values?

sly arch
#

[0.0,
0.017037086855465884,
0.0669872981077807,
0.1464466094067262,
0.25,
0.37059047744873963,
0.5,
0.6294095225512604,
0.75,
0.8535533905932737,
0.9330127018922193,
0.9829629131445341,
1.0,
0.9829629131445343,
0.9330127018922194,
0.8535533905932738,
0.75,
0.6294095225512605,
0.5,
0.3705904774487396,
0.25,
0.14644660940672627,
0.06698729810778087,
0.017037086855465884]

calm sequoia
#

The o1 performs perfectly on this task. However, we already updated the prompt to more difficult, nobody could calculate it yet. I will paste it below.

calm sequoia
#

Gemini thinking is the king (why???)

#

The greeat question is why rage generates answer with such high variation. Is it a router? Is there temperature differences?

#

Same thing for rhea

#

Anyway, I haven't met nebula, if anybody meets it with this prompt, dont hesitate to share the values ๐Ÿ˜„

torn mantle
#

my experience with both models werent that great

#

talking about rage and rhea

#

so bad at many tasks

calm sequoia
#

The numbers say that they lack consistency indeed

brittle tiger
# calm sequoia Anyway, I haven't met nebula, if anybody meets it with this prompt, dont hesitat...

still no nebula for me. couple more here.

Another phantom:

c(0.00000000000000000e+00, 3.38003011783494125e-05, 5.32296009954234934e-04, 2.06607606508003040e-03, 5.73985127959829478e-03, 1.28524959879028501e-02, 2.47145199348771179e-02, 4.22975388987705293e-02, 6.62465618009184608e-02, 9.68107204134338309e-02, 1.33856431037345374e-01, 1.76896817840253044e-01, 2.25157520582151790e-01, 2.77561689130828108e-01, 3.32745311199813843e-01, 3.89162918104375744e-01, 1.00000000000000000e+00, 3.89162918104375744e-01, 3.32745311199813843e-01, 2.77561689130828108e-01, 2.25157520582151790e-01, 1.76896817840253044e-01, 1.33856431037345374e-01, 9.68107204134338309e-02, 6.62465618009184608e-02, 4.22975388987705293e-02, 2.47145199348771179e-02, 1.28524959879028501e-02, 5.73985127959829478e-03, 2.06607606508003040e-03, 5.32296009954234934e-04, 3.38003011783494125e-05)

3.7 sonnet thinking:

c(0, 0.000037, 0.000629, 0.00311, 0.00950, 0.02222, 0.04374, 0.07615, 0.1208, 0.1778, 0.2460, 0.3232, 0.4059, 0.4883, 0.5662, 0.9820, 0.9944, 0.9414, 0.8279, 0.7308, 0.5043, 0.3340, 0.1831, 0.0648, 0.2563, 0.0857, 0.0314, 0.0727, 0.0479, 0.0622, 0.0328, 0)

sly arch
#

Based on o1's results, I think Python calls are allowed in LM Arena, so rage might just be better at invoking Python.

brittle tiger
# calm sequoia The numbers say that they lack consistency indeed

rage had some sort of error on one here i think:
Rage:
To generate the product of a Hanning window and a Blackman window of length 32, we first compute each window individually and then perform element-wise multiplication. The Hanning window is defined as ( w[n] = 0.5 - 0.5 \cos\left(\frac{2\pi n}{N-1}\right) ) and the Blackman window as ( w[n] = 0.42 - 0.5 \cos\left(\frac{2\pi n}{N-1}\right) + 0.08 \cos\left(\frac{4\pi n}{N-1}\right) ), where ( N = 32 ). The resulting product is:

c(0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000, 0.000000000000000)

This output represents the element-wise product of the two windows, resulting in a vector of zeros due to the multiplication of the zero endpoints of the Hanning window with the Blackman window.

torn mantle
#

they are really really similar

#

phantom is like a one week old model than nebula

calm sequoia
#

phantom and nebula ๐Ÿ˜„

brave ferry
#

hi

north vale
brittle tiger
# calm sequoia Anyway, I haven't met nebula, if anybody meets it with this prompt, dont hesitat...

Finally got nebula:

c(0.00000000, 0.00001536, 0.00046765, 0.00222282, 0.00648309, 0.01418640, 0.02602793, 0.04234480, 0.06315819, 0.08821588, 0.11695270, 0.14858542, 0.18217473, 0.21667367, 0.25097322, 0.28400164, 0.31474092, 0.34228414, 0.36585801, 0.38483543, 0.39874431, 0.40724978, 0.41013053, 0.40734892, 0.39901768, 0.38541588, 0.36696187, 0.34420238, 0.31778406, 0.28844541, 0.25692687, 0.00000000)
calm sequoia
#

Nice!

#

Huh Nebula error is 8.22 :/

#

It should be thinking model though

#

Interestingly the result of Nebula is identical to the command-a-03-2025

#

Can the names be shuffled periodically?

jolly rune
calm sequoia
#

I thnk lmarena is shuffling names to reduce cheating or some source of stochasticity is involved

#

claude-3-7-sonnet-20250219-thinking-32k 8.78

calm sequoia
#

Then give me narrative on what's hapenning

keen beacon
#

its just a coincidence

#

they likely use sampling too

calm sequoia
#

rage and rhea inconcistencies are caused by sampling too?

keen beacon
brittle tiger
keen beacon
#

meta could also be switching models out

keen beacon
#

youre basically getting hallucinations at that point lol

brittle tiger
# calm sequoia I thnk lmarena is shuffling names to reduce cheating or some source of stochasti...

got nebula again

c(0, 9.0724178010356952e-06, 7.2856925546279537e-05, 0.00020683206505288503, 0.00041814615810942764, 0.00070772046083229993, 0.0010709028016346709, 0.0014992620535333347, 0.0019819331475410767, 0.0025057884173717195, 0.0030561772024072553, 0.0036174782263374864, 0.0041737546065716644, 0.0047096803603024485, 0.0052111758597566055, 0.0056655645376517887, 0.0060618147865992481, 0.0063905156905526939, 0.006644143828903759, 0.0068170016355694545, 0.0069051569055269388, 0.006906617833895222, 0.0068217168524589259, 0.0066532661551121315, 0.0064062299138653186, 0.0060874319447306754, 0.0057049050048790513, 0.0052677039471142848, 0.0047855285719896434, 0.0042684137879167743, 0.0037264883032342513, 0)
calm sequoia
calm sequoia
keen beacon
calm sequoia
#

But claude-3-7-sonnet-20250219-thinking-32k 8.78 also provided hallucinations

drifting lance
#

Is Nebula the top model?

calm sequoia
#

o3-mini error is 0.62 on arena ๐Ÿ‘€

calm sequoia
sick mountain
#

where does the difficulty in the task lie?

#

e.g. if the difficulty is a precise numerical calculation, models have tool use to handle it, there is not much point improving the mental math of models

calm sequoia
#

There are 3 difficult tasks in there. One is remembering the formulas. Second is generating two arrays given the formulas. Third is performing element wise multiplication on these arrays. I assume that by testing how these models perform without interpreters we can assume their logic/math abilities

jolly rune
#

I finally got nebula!

#

guys, I thought it's supposed to be good at coding?

keen beacon
# calm sequoia Do you think my approach is flawed or the models are still not ready to perform ...

it is very flawed for non thinking models. you are asking it to predict the output without any thought process/cot. this results in outputs that are gibberish. for thinking models, its slightly more fair but i dont think its a good test either for multiple reasons. this requires a lot of precise calculations and a calculator, as far as im aware. otherwise they would need to approximate. they could do it by hand but it would probably go through an insane amount of context/steps that is impractical and not really representative of a real use case. and i doubt ud ask a human to calculate it by hand

#

my dumb opinion tho

jolly rune
calm sequoia
keen beacon
#

you'd get a calculator though right?

calm sequoia
#

Anyway, this is my final conclusion. The GPT is the king, as always for my use cases. The Nebula may have multiple models or it is non-thinking model.

calm sequoia
brittle tiger
calm sequoia
#

Phantom is really good

#

On par with gemini flash thinking

keen beacon
sick mountain
#

i mean just by looking at the error scores and public perception of model strength there doesn't seem to be much correlation outside the outlier of o3 mini...

#

hard to say whether this is a good benchmark or not

calm sequoia
sick mountain
#

especially given the fact that there is little difference between thinking and non thinking (3.7 sonnet)

calm sequoia
keen beacon
#

o3 mini is really good at rote tasks

#

its the best model out there right now, i would think

calm sequoia
#

Agreed

keen beacon
#

all the other models fall apart and give up in my experience with some tasks

#

same thing happens with nebula

calm sequoia
torn mantle
#

no its way above that

calm sequoia
#

I encountered it only with easy prompt. It did perfectly. Did not see it when using harder prompt.

torn mantle
#

dw this new model is so good at multi-turn

#

you can refine the results much better

#

which wasnt the case on gemini old models, even if you ask it many times to make it better, it will just make it worse and wont follow the instruction

severe bison
#

can we still do text2image arena ?

torn mantle
#

grok 3 has that issue too

severe bison
#

can't seems to find it

barren prairie
# calm sequoia

1- If we didn t get any other surprise (deepSeek r2 , gpt5 ect... ) ๐Ÿ˜‚

calm sequoia
#

Too soon for gpt5, we still need o3 ๐Ÿ˜„

torn mantle
#

idk

#

gpt5 will probably be o3 full

#
  • gpt4.5 optimized
#

or something like that

#

they said its a router system

#

so an o-serie model + gpt base model

calm sequoia
#

It may still be tested separately

sick mountain
#

MoMoE ๐Ÿ˜…

torn mantle
#

im actually looking forward to deepseek r2

jolly rune
torn mantle
#

i had many instances where it outperformed claude at coding

jolly rune
#

true, I'm just disappointed it failed me this much

keen beacon
#

wheres the harder prompt btw

#

@calm sequoia did u try with qwq max?

#

oh i found it

calm sequoia
#

Didn't try. Spent too much time on it :/ If you want, you can use my code for evaluation (R language)

#

library(gsignal)

left <- c(0, 3.64e-05, 0.000594, 0.003152, 0.010525, 0.026496, 0.057170, 0.107880,
0.183200, 0.285200, 0.411900, 0.548600, 0.696600, 0.824200, 0.928200,
0.987500, 0.996860, 0.987500, 0.928200, 0.824200, 0.696600, 0.548600,
0.411900, 0.285200, 0.183200, 0.107880, 0.057170, 0.026496, 0.010525,
0.003152, 0.000594, 0)
right <-c(0.000000, 0.001010, 0.004025, 0.009071, 0.016105, 0.025037, 0.035734, 0.048022, 0.061691, 0.076501, 0.092181, 0.108440, 0.124976, 0.141483, 0.157653, 0.173185, 0.187793, 0.201211, 0.213201, 0.223553, 0.232089, 0.238668, 0.243190, 0.245596, 0.245868, 0.244036, 0.240176, 0.234407, 0.226893, 0.217832, 0.207459, 0.196040)
actual <- gsignal::hanning(32, method = "periodic") * gsignal::blackman(32, method = "periodic")

plot(actual, type = 'l', main = 'Actual')
lines(left, main = 'Left', col = 'red')
lines(right, main = 'Right', col = 'blue')

summary(abs(actual - left))
summary(abs(actual - right))

sum(abs(actual - left))
sum(abs(actual - right))

silk haven
clever burrow
alpine coral
#

yeah though i don't think it really is yet (or at least when i looked eariler today, the alpha leaderboard shows today as update 'date', but actually seemed to have slightly older data than the regular leaderboard )

#

or something like that.. didn't seem 'live' but perhaps i'm missing something (and i assume it will be soon if it isn't already - which is nice :))

clever burrow
#

You are right... Both leaderboards shows "Total #votes: 2,801,990." but different date...

calm sequoia
# calm sequoia
poll_question_text

Is it justifiable that Grok is No. 1?

victor_answer_votes

13

total_votes

19

victor_answer_id

2

victor_answer_text

No

victor_answer_emoji_name

๐Ÿ˜ก

#

Why do you guys think we all see Grok as worse, but its number 1?

torn mantle
# calm sequoia

grok 3 from grok 2 was a huge leap thats for sure but its not a great model

#

you cant hold long convo with it

calm sequoia
#

I swear if it will be better than nebula and o3/GPT5 the benchmark is corrupt ๐Ÿ˜„

barren prairie
cedar tide
# cedar tide
poll_question_text

Who would you consider the best model?

victor_answer_votes

14

total_votes

28

victor_answer_id

6

victor_answer_text

nebula

torn mantle
#

but if you really hold a long conversation with grok 3 then you start to see a lot of issues

north vale
#

yall ask a lot of math questions, a lot of what lmarena ppl ask and judge on is not math / coding

#

eg creative writing

#

ig coding and math are a big %

#

but presumably non-reasoning models do better by default on creative writing?

#

i dont' actually know

#

ig r1 has better creative writing score than v3 so maybe not (but it's the one thing it barely improves on)

calm sequoia
# calm sequoia

This is a test if this chat is a bubble of software engineers ๐Ÿ˜„

north vale
#

that and longer query

#

otoh maybe grok 3 thinking beta gets put on the arena by elon bc he wants to win back first place

barren prairie
oblique flint
#

from r/bard

#

2.5 pro? The heck

silk haven
torn mantle
#

wait what

#

so nebula and phantom are non-reasoning models?

#

am i getting this right?

#

if so then thats crazy

sick mountain
#

they probably are reasoning, just wait for the actual announcement

brittle tiger
#

fact they're calling it 2.5 is bullish on their confidence

keen beacon
#

that was really quick of them

north vale
#

daym

keen beacon
#

thats kind of crazy

#

the timelines

#

gemini 2 has a june 2024 cut off, they pretrained, did instruct tunes/ experimental versions, thinking, and are already moving onto another presumably continued pretrained base

#

they didnt even release a gemini 2.0 pro stable version

#

theyre moving so fast wtf

north vale
#

2.5 could be fake idk

keen beacon
sick mountain
#

r/bard seems like it is reasoning

brittle tiger
gentle plinth
# torn mantle if so then thats crazy

i think nebula is a reasoning model. i once got it in the arena and it took a while until it started to give its response, it could also be that the other one was a reasoning model but i dont think so because it wasnt that good (i dont remember what model the other one was)

torn mantle