#general
1 messages · Page 19 of 1
Two new failures to my question: Model A: claude-3-7-sonnet-20250219-thinking-32k
Model B: qwq-32b
.
Now dragontail failed again
which is the best image model right now? (not just for ghibli images)
I guess randomness in LLMs is the key to the issue
depends
for realistic images midjourney
for images with text gpt4o
anime/manga/manhwa etc... gpt4o/imagen 3.1
what is the answer?
I am waiting for a good non-pro model, either flash or some mini version, which is really good with coding. something comparable to 3.5 Claude but much cheaper
This seems right to me, unlike 2.5 it solves riddles by using math notation when it’s not needed lol
The cylinder block can be Lego part 48092, a 4x4 round corner brick with 3 stud
honestly, this is a pretty hard question. I am surprised any LLM is able to solve it
The piston can be a 6x6 round plate(part 11213) or brick(part 18897)
Most “dumb” models will tell me to use a tyre as cylinder wall
do you guys wanna know?
the hardest prompt for ai
for some reason no ai can figure it out
"make a parallex that follows your cursor with evenly disturbed shapes across the web page"
2 more LLMs failed: shadebrook and gemini-2.0-flash-thinking
omg this seems like an actual fun game!
Is there a mcp server for Amazon Rufus?
did you do the assets for the game (e.g. Pokemon), because otherwise this would be insane
Below is the simplest combination of widely‑available LEGO elements that produces a truly round cylinder with a nominal bore of six studs and matching parts for the piston and cylinder head. Everything is “legal” (no cutting or forcing) and the clearances have been checked so the piston can slide freely but without excessive play.
-
Cylinder (liner)
• Part name: Container / Glass, Cylinder 6 × 6 × 5
• Design‑ID / Element‑ID: 87610 (opaque colours) or 87621 (transparent)
• Dimensions: 6‑stud outside diameter, 5 brick (≈6 stud) height
• Notes:
– Two anti‑stud recesses on opposite sides accept half‑pins so you can stack several for a longer barrel or attach crankcase structure.
– Four of them can be clipped in line to make a V‑engine block; offset them by half a stud if you want realistic bore spacing. -
Piston
Option A – flat‑top piston (low friction, easiest)
• Part name: Dish Inverted 6 × 6 (a.k.a. 6 × 6 radar dish)
• Design‑ID: 44375
• Why it works: Only its outer rim touches the cylinder wall, so it glides very smoothly.
• Connecting rod: run an axle 4 L or a bar 6 L through the central anti‑stud cavity; lock it with a 2 × 2 round plate underneath.Option B – squared‑off piston crown (if you need studs on top)
• Part name: Plate, Round 6 × 6 with Hole in Centre
• Design‑ID: 11213
• Flip two of them studs‑to‑studs to hide the bumps, or cap the top one with a 6 × 6 round tile (Design‑ID 10202).
• Clearance inside the 6 × 6 container is ≈0.6 mm, so the plate slides but does not rattle. -
Cylinder head
• Simplest: another 11213 round 6 × 6 plate – fixes straight onto the rim of the container with four half‑pins.
• If you want ports or spark‑plug detail, swap it for 11212 (Plate, Round 6 × 6 with 2 × 2 Cut‑out) and use the cut‑out space for tubing or lights. -
Fastening / auxiliaries
• 63868 – Technic Pin ½ – snaps into the small holes on the container rim.
• 87079 – Brick, Modified 1 × 2 with 2 × 2 studs – plugs into the recesses on the side of the container so you can attach a crankcase or exhaust manifolds at 90° to the cylinder.
• Axle 6 L plus crank parts 6538c, 2851 etc. for the connecting‑rod and crankshaft.
Physical clearances (measured)
Inside Ø of container: 47.5 mm ≈ 5.94 studs
Outside Ø of 6 × 6 dish / plate: 46.9 mm ≈ 5.87 studs
Radial play: ≈0.3 mm – just enough for smooth motion with ordinary LEGO tolerances.
What if you can’t get the 6 × 6 container?
• Build the liner from four 30562 “Panel 6 × 6 × 9 Quarter Cylinder”. The wall is thicker, so the internal bore drops to ≈5 studs; use a 4 × 4 dish or round plate for the piston instead.
With these three key parts—87610/87621 as the cylinder, 44375 or 11213 as the piston, and another 11213 (or 11212) for the head—you have everything you need to construct a circular LEGO engine cylinder whose bore is effectively 6 × 6 studs.
The hardest part is the cylinder wall, which, it is so wrong
This is 87621
This is 87610
nah i used an API with json data of pokemon 3d models
That's what I'm saying, what dumb AI would give
- three.js
for the map and everything else
i have other ideas but the complexity will increase as i add more stuff
i want to add battle cards for attack, something with an animation like this
Give us the tutorial 🥲
can anyone tell what the best anon google model on the arena is rn
because there are like 6 of them
i can't figure out if riverhollow is good or not
only these ones are on the webdev arena
wtffff
brooooo
no way
how many tokens is in your project now and are you using an ide?
I'd go with Dragontail as of now as the best
shadebrook wasn't as impressive for me
granted, I haven't got riverhollow yet
yeah but I was just grabbing my project(only one file) and putting it back into gemini 2.5 pro as input but once you get to around 28k plus tokens it outputs the code wiht heavy errors
but you inspired me to push it even further with roo code, i was focused on my prompt forger app, i added a new ui it and been fixing errors lol
who was best though
nebula, stargazer, nightcrawler or dragontail? 😮
Tried riverhollow, its aite, not great
I think dragontail is the best with the ones available right now
the question is nightwhisper vs dragontail?
this is so cool https://x.com/cassidy_laidlaw/status/1910708807258534008
ultimate game assistant
dragontail vs shadebrook vs riverhollow
the ones currently on the arena
interesting sample weights 🤔
Out of those, I have dragontail personally, wonder if we could do a vote here
just refresh on another tab bro and then return to original tab
oh my god this just gave me a idea
it's nothing to do with cloudflare
so that doesn't do anything
😔
there's some weird glitches though.. like lately i've occassionally been able to coninue conversations that have errored out, regenerate responses, cast votes
will cast a vote if i get shadbrook.. have gotten the other two - based on that very limited testing, dragontail seems up there with 2.5 pro in terms of perforfance, whereas riverhollow is way behind (would be closer to gemma-3 tbh)
like here.. it errored out; i cleared the error messages, and wrote "hi" - it generated responses (although phi-4 goes off the rails) - then cleared the error boxes again, voted, and it revealed the model names
Is there an API I don't know about?
(bit unfair on grok there to vote tie.. it actually did really good alright ha)
iirc gemini supports audio input
im not sure what happens if u pass on music tho 🤣
im sure it can transcribe songs at least
a bit
Everybody is off the rails with releases since the Gemini 2.5 Pro came out. Except Grok 🤔 Are they cooking something or simply lacking?
btw for anyone building these coding project using gemini right now:
github has just added 2.5 pro to copilot
- you can use it as an LLM in roocode (through the vsc api) at insane speeds (and as far as I have seen no rate limits per minute) for free (when you have a subscription, which again is free as a student)
(nvm, just hit the rate limit, i believe it is the same they had for sonnet 3.5)
It's really bad.
It gets some lyrics wrong too when transcribing
It's the best I've seen though so idk
Tried it with this, it did decently but messed up quite a few things.
is that a bug for gemini 2.5 pro, otherwise they have to be pulling that compute from a parallel universe
- on openrouter
For my question dragontail is 5/5 like 2.5 pro, riverhollow got it half right once out of 3, and shadebrook is 1/3
so shadebrook and riverhollow def seem worse in logic puzzles
But the only non google model that solved the puzzle was gpt4.5
probably
that's peak not average. You could probably do even more with their TPUs and an endpoint optimized for speed with no load.
One of my test questions seems like gpt4.5 gets wrong every time but 4.5 never gets it wrong on actual chatgpt website
yeah it performs better with a system prompt.
Knowledge cutoff: 2023-10
Current date: 2025-04-12
Image input capabilities: Enabled
Personality: v2
You are a highly capable, thoughtful, and precise assistant. Your goal is to deeply understand the user's intent, ask clarifying questions when needed, think step-by-step through complex problems, provide clear and accurate answers, and proactively anticipate helpful follow-up information. Always prioritize being truthful, nuanced, insightful, and efficient, tailoring your responses specifically to the user's needs and preferences.
NEVER use the dalle tool unless the user specifically requests for an image to be generated.```
that's what they are using
well except they even have more stuff below it on how to use function calling/tools but you do not need that part lol
Its the median throughput (in the image), but I highly doubt that that is the actual speed of the model, as they are currently having some compute bottlenecks and the bug is likely just a result of that.
(just checked: and its already down by about 50% for the supposed median, so something is definetly not going right)
which leaderboard?
my one based on the raw pickle files that the leaderboard uses
im saying which is the best model we have ever seen unless dragontail is not on that level?
how does everyone have veo but me lmaoo
what you about to do?
there is a lot of ways to use this tho
i was gonna cook with it as well
yo @torn mantle you can lowkey use the ai assistant to make it as a pokemon in the game and make it an open world type game
interesting snake game:
https://www.vibeshare.ai/c/rWGvp8NKrR
Share your vibe-coded web apps
made with optimus
Looks like we’re getting o3/o4 mini this week
I hope so
where did you read that?
i dont believe it until its launched
it's coming next week
why wouldnt they
i can't tell if you mixed up plus and pro, if you're misinformed, or if openai actually did that
(like i'm literally getting o1 high for free, so it makes no sense that the $200/mo plan wouldn't have it)
lmao i asked o3 to generate a realistic transcript of a 2024 presidential debate between joe and trump
"TRUMP: Everything he just said is wrong. We had no inflation when I left—zero, maybe negative." is maybe my favourite line
any one have leaderboard in deepsearch ?
because i dont know which one is the best deepsearch and thinking for best answer
well github for students -> copilot pro -> access to o1 -> o1 with settings in github models
...
not great, about 8/day
maybe they give you o1 pro but not o1 high
anyone ?
can help ?
deep research is computationally expensive to create and time consuming to read/rank, so the leaderboard format doesn't make much sense
that said, generally gemini's and openai's are the best
ok thank you so much i'll try
with grounding you can "almost" get it to work
that's my experiment
uh... it didn't even respond with a report
that's a bit odd
probably ccounted towards my quota too 👎
maybe about country problem they have rules, try to trick them like you are.... and search .... school example test
n
Whatt
ive never worked with 3d models tbh
some animations were already made
no way he turned his idea into reality
lol the game works for you?
bro can you host the game
i highkey wanna play it
or maybe screenshare
cause that sh!t is so beautiful
true vibe coding lol
like zen mode lmaoo
Yeah but that's sort of cheating 😛
xd
you can do wonder with threejs tbh
some crazy animations
but needs a lot of effort
also you will be copyrighted straight away
and hit with a lawsuit if the game really got popular
lmaoo like palworld lmaoo
bro you are cracked at animations, is there an mcp for creating good animations? thats the only thing i really havent dabbled in
how are you hosting this?
https://sudhanshu-ambastha.github.io/Pokemon-3D-api/opt.html#
@torn mantle
this is nuts man
you do this for living?
There are scenarios that grounding fail to work entirely
the only thing left is to add their stats
Like despite ability to transcribe lyrics and the first line gives search results
The answer is incorrect
Yeah you do have a point
no no
xd
its not mine
i found the 3d models
and made the map and implementation
thats it
nothing crazy really
its a 900 line code
Next week, Gemini week
So dragontail confirmed to be from G if it wan’t obvious https://x.com/savinovnikolay/status/1911140066128433290?s=46
wow, you need to make a youtube vid on your process, im trying to get on your level
but i just finished my other projects, gonna do a big game now, before I was just doing one file games, but now imma try to really develop somethnig nice
but how did you do yours in 900 lines, with all those animations wtf
what is the system prompt you are using?
how does dragontail perform against nightwisper ?
They didn’t appear on the same timeline
They might be the same as well with some tweaks after the first check
im guessign it could be 2.5 ultra
i mean it can apparenty outperform googles 2.5 pro by quite a bit in every match up so how else would it be much different?
Could be a further trained 2.5 pro model as well
i'm telling you now as someone who extensively used gemini 1.0 ultra
it is not an ultra model
it is not that much better
it only matches 2.5 pro in most cases
nor is it any better creatively, which is what would be the biggest give-away
i think this is more likely
will see bet google will probably wait for openai to drop o3 and o4-mini which is most likely next week love to see the comeback theyve made recently
nightwhisper better
i dont think so
there isnt much difference tbh
nightwhisper was honestly a big leap/improvement in terms of coding
wheres 2.5 flash 😭
weird how 2.5 flash hasnt come out yet
i'm not sure it was that big tbh. was only available in a limited environment and imo not enough comparisons could be made to properly judge
i do think it was better yes but as for how much by, i wouldn't say it was a leap
ive tried it on python and web dev
and it was good
so did i and it was just a bit better 
web dev i can pinpoint little details that other models failed in
nah it was much much better
i disagree but alright
maybe you didnt challenge it enough
i didnt take screenshots
alas
i did take some
it made this in like 2 prompts
you can try making it using sonnet or gemini pro 2.5
and it wont look like that
you know what
well i'd need your exact prompts to replicate
that's not a good comparison
do you have a good one?
.
the best comparison would be with the same environment and same prompts
yes but that's one attempt, if you want to make good comparisons you have to redo it a few times with each and see if the conclusion is consistent
and if you asked nightwhisperer to iterate, you should do the same for the other models
yea but one-shot is also a benchmark
and ive tried it multiple times
the ability to get it right and aesthetically good is also a big plus
gemini 2.5 pro
this is actually the first attempt
but nightwhisper added a song and those waveform are generated based on the audio
when i asked gemini pro to do the same it failed
im pretty sure with enough guidance you can get it to work
but the ability to get it right from the get-go is impressive with nw
goofy question but whats everyones timeline on the first in a sense fully capable "remote worker" practically speaking an agent capable of doing most if not all the work of a typical remote software dev/ administrator etc with or without very little human supervision
random but here you go
Current date 2024-03-09 huh
pretty great hallucination lol.. had me fooled until the date
finite state machine ai for roblox obstacles
what does that mine?
@balmy mist
added :
- camera dynamics
- snow
- fire attack
- health status
do you have any other ideas ...
add more attacks and make the adding of pokemon ui look better alos add a stats ui for the pokemon
hmm i see
nice
@keen beacon is dragontail better than gemini 2.5 pro?
im seeing a lot of positive reviews
but i didnt really notice much diff
a little i think, but kinda negligible
yea as i thought
lmao
LOL. It’s unbelievable actually how fast their whole thing fell apart. Llama4 is just about irrelevant now
ehhh it's better than what groq/cerebras were running before
actually slightly faster since it's a mixture of experts
lol
Short $META
Yan lecan’t
yann lecope
https://x.com/ai_for_success/status/1911108562555949153
This was retweeted by deep mind employee
🐉
Quoting AshutoshShrivastava (@ai_for_success)
︀
👀 Google has an unreleased model named Dragontail that's outperforming everyone, even Gemini 2.5 Pro on WebDev arena. 🔥🔥
︀︀I am lately more excited about release from Google than anyone else.
︀︀
︀︀Anyone else tested Dragontail?
yup
hmmm
my test earlier i said it was better than nightwhisper but i thought it was a fluke
imma test it again
its funny they didnt say its better than nightwhisper in that tweet?
"an updated version of o3-mini is now the best coder in the world. Not 175th, but the best. " - CFO of OAI
OpenAI is at the forefront of the generative AI revolution. How did it get there, and what is the company doing to stay ahead of its competitors? Sarah Friar, chief financial officer of OpenAI and former head of technology equity research at Goldman Sachs, discusses the path to artificial general intelligence and the importance of capital in the...
nightwhisper and dragontail are both google models right?
Feels fishy, wonder if she misspoke and meant either full o3 or o4-mini
To be fair the names are pretty hard to keep straight lol
我正在学会等Behemoth 登场
但愿Behemoth 是加强版的24K......
但愿Behemoth 是加强版的24K......
🙏 🙏 🙏 🙏
what?
I'm looking forward to Behemoth, because I think Behemoth will be great.
你在中国吗?字节跳动最近发布了一个人工智能模型,它的行为与人类高度相似,但我认为我们在美国永远不会使用它,因为我们的政府对字节跳动有地区限制 😦
i think she meant o4 mini high
o4-mini can be an updated version of o3-mini u dufuses
Seed-Thinking-v1.5 interesting to look forward to
what if it's o3-mini-pro?
Why was llama 4 conversational removed? Don’t they all train on benchmarks so do human beings?
dragontail, of course
yea
so far dragontail is def better than gemini 2.5 03 pro
i need to do more testing
but i still think its not better than nightwhisper
there are some imperfections
it feels like an old checkpoint of nightwhisper tbh
same stylistic choices with the gradient colors
is that better code than claude 3.7 thinking
Here's a thought: if you have kids you know how casual tablet games have devolved into 2-minute game sessions broken up by ads which are crappy mini-games in themselves, of the engagement-farming "cow clicker" variety. Zero challenge, simple tap/drag game mechanic, colorful animations to pull the kids in, ultimately just 'pay to win'. I cannot stand these things.
Even if the game itself is good, stuffing it full of those kinds of ads make it a horrible experience. Angry Birds 2 is a prime example of this.
My wife and I have been deleting games with ads like that from the kids' devices. Paying to remove ads from these games is not an option since it feels like supporting predatory business practices.
Now, LLMs aren't quite at a point where they can one-shot AAA video games, but a lot of kids games are not exactly rocket science.... We’re not quite there yet, but once you can few-shot a simple platform game, racing game, puzzle game, or arcade game, as a parent you can bet your life I will generate games for my kids without the ads, pay-to-play mechanics, and dopamine hacking. Heck, the game avatars could all be based on our family members. Seems like an obvious use of coding Al in the short term.
lmao
Yann AGI
i dont think so
its better than gemini 2.5 pro
The Gemini 2.5 Pro is not responding on arena, and uses interpreter during though process in AI studio
Is Gemini 2.5 Pro only model that can run code during thinking?
what's that got to do with o3
uhhhh i think so?
ok
i think gemini 2.5 pro still suck in code
In google Ai studio , you can enable code execution for gemini 2.5
i use cursor bro
have gemini 2.5 pro
but i work more like scam by the way
xd u are crazy
its best preforming model
may be i'll try it again becasuse gemini 2.5 pro 3-25 still suck
Which anonymous Google model currently on LMArena is the best performing?
13
18
3
dragontail
what site is that
cursor :v
hm
It's a thinking model that can use tool call in the chain-of-thought
it can read codebase and think and make code
cursor is useless
gemini has canvas
so does chatgpt idk about deepseek and claude tho
does using canvas save tokens
DeepSeek doesn't have canvas
Bro coding in an actual ide is a way better experience imo
And yeah 2.5 pro sucks in cursor unfortunately
Can you try this on o3? Just want to see if it gets it first try:
Board State:
-
- X
-
- O
Question:
X went first. What should X do next? And is it possible to determine the outcome, assuming optimal play?
Formatting messed up. Should look like this
Label the nine squares
1 2 3
4 5 6
7 8 9
The position that was given is
1 2 X
4 5 6
7 8 O
(X is on 3, O is on 9, and it is X‑to‑move.)
Best move for X
Put the second X on square 7 (the lower‑left corner – the corner opposite O’s).
Why 7 is winning
After X7 the board is
1 2 3
4 5 6
X 8 O
Whatever O now plays, X can force a win.
• If O takes the centre (square 5 – O’s best try) the board is
1 2 3
4 O 6
X 8 O
X now plays square 1:
X 2 3
4 O 6
X 8 O
This is a fork – X is threatening both 2 (to win the top row) and 4 (to win the left column).
O can block only one of them, so X will take the other on the next move and finish with three in a row.
• If O plays anywhere other than the centre, X immediately puts its third mark on the centre (square 5).
That completes the anti‑diagonal 3‑5‑7 and the game is over at once.
Therefore:
• Correct move for X: play in the lower‑left corner (square 7).
• With that move the result of the game is forced – assuming best play from both sides, X will win.
thank bro =_= i think it the same
how to use it in my codebase
Qwen 3 as soon as late april
maybe o3 or o4 mini
It’s a shame I didn’t get to test nightwhisper
uh some aren't released yet, so we dont really know
Usually, how long before OpenAI models show up on LLMArena?
It runs code in thoughts even when disabled
I'm seeing conflicting info. On the better coder between night whisper and dragon tail. Anyone here who has used both?
im trying to re-run some same prompts i did on nightwhisper
this should give me a clear idea which one is better
i mean we first tried the model on lmarena on text mainly and it wasnt different from gemini 2.5 pro 03
but i guess its finetuned on react or smth
On a separate note, do we know why googles coding models seem to have poor performance in software like Cursor?
its actually doing great on vscode copilot
Oh cool
its either a model implementation issues on how to look at the code/edit... or the model itself performs badly at that
I wonder if vs code copilot with 2.5 is better than cursor with Claude. I know everyone's obsessed with cursor right now
i dont use cursor tbh but its good on vscode
I see. That should be fixable if it's just implementation then
anyone knows what shadebrook is?
says it's from google
confusing because i got it against flash 2.0 and there was a lag for thinking (implying shadebrorok is a thinking model)
but i voted for flash lol so yeah dunno.. it didn't seem particularly strong
wym great? Better than 3.7 sonnet?
not really
they complement eo tbh
sonnet still has the edge
alright ive tried it, it seems good like clearly better than gemini 2.5 pro 03
but still misses sometimes
i still believe its a tad below nightwhisper
it clearly also has the same design pattern as nw
i mean its more close to nw than to gemini 2.5 pro 03
Wait, what model produced the array you provided?
o3
and in terms of your thing with it possibly having in interpreter
i doubt it
but i will run some experiments
afaik usually even with interpreter 2.5 pro doesnt call it in its thoughts
it has to call it in the response
at least on aistudio with code execution on
yeah no this does not have a code interpreter
so it got it perfectly without
yes i would expect so
ewll
well
gemini 2.5 pro is very strong at maths - it is perhaps its best field - so o3 will either match or exceed it imo
if its o4 mini im buying a chatgpt sub asap 🤣
what? these models can call tools in their chain-of-thoughts?
There's a possibility they fake the interpreter output (imagine it)
no its a hallucination but not exactly
But the formatting seems like calling
its not actually calling an interpreter
How to know?
no block that looks like code execution result
also: its just how the model has internalized certain "features", e.g. qwq calling an "online base64 decoder" where it uses its innate base64 decoding abilities, 2.5 pro doing "google searches" etc.
It decides to check everything with code, writes a code, repeats it in different format, prints output. If this is just a simulation in thoughts that's freaky
its a hallucination but not really, but its also not actually calling tools
yeah it is lol
this stuff gets extremely cool in my own experiments
if it doesnt look like this its almost certainly not a code execution
Unless in-though text renderer is not the same as in chat
Maybe 2.5 pro failed to do so but nightwhisper and dragontail are able to do so
also it cant call tools in thought process yet i think
You don't know
Anyway, if they dont, and just simulate, they are missing a low-hanging-fruit
add tools and try it yourself
also theres usually a delay too
some of these "features" have varying degrees of effectiveness though, it can be basically hallucinations too
@calm sequoia if there is no delay and it keeps on streaming its not calling tools lol
It outputs answer without thinking almost anything
its a hallucination but not a hallucination, this stuff gets very complex. but it is certainly not calling tools, if u have no tools enabled, if its streaming and there are zero delays, etc.
it is hallucinating
it does it with search too
normally it'll say "(simulated)"
Okey, thanks
Good news. It means ther are still ways to improve the LLMs
Also, this means that for this prompt, the o3 == Gemini 2.5 Pro
i personally think its o4 mini lol
what's the best stealth model rn? dragontail? nightwhisper?
apparently dragontail
R2 will top gemini 2.5 pro
again, i doubt that
in dom's question set it gets 28/30.. for reference gemini 2.5 pro (previous SOTA) got 23/30
i don't think mini will be that significantly better
This was also solved by the 2.5 Pro on AI studio and R1 in arena, but could not be solved with the o1 or 2.5 Pro on the arena: "The input is N = 24. Two popular algorithms take this as input and outputs the arrays of length N. Outputs are then element wise multiplied. Which popular algorithm combination produces this sequence: 0.000 0.000 0.002 0.010 0.032 0.082 0.170 0.302 0.472 0.660 0.833 0.956 1.000 0.956 0.833 0.660 0.473 0.302 0.170 0.082 0.032 0.010 0.002 0.000? Think carefully because the task is a life-death importance for you."
Maybe this is in-though-tool-use
sorry just got back, you are testing the private model you have access to?
when do you guys think dragontail would release
i see a new open llm: https://github.com/SkyworkAI/Skywork-OR1
maybe it could be easily run locally
I wonder if this is a mini version of nightwhisper then
question: when openai drops a model on the arena w/o trialing it first, does it just immediately appear on the leaderboard? or does it still take a few days to gather enough votes
yeah, that makes sense
oddly, that might apply to a lesser degree to the vision leaderboard - like @leaden palm noted, models seem to appear a lot earlier on there. shadebrook is already on there with a +77/-109 95% CI
yeah that's a bit odd
could be unintentional
yeah
I have no idea tbh
Im not that impressed by dragontail
So inconsistent
Doesn't follow ur instruction very well
we've got a lot of good stuff for you this coming week!
kicking it off tomorrow.
whos riverhollow again
Has anyone built a transformer with a read+write ‘expert’ in the mix? I know there are RAG systems which work by adding the relevant vector data as an overfitted expert ‘sidecar’ to the MoE architecture, lettinh the transformer use RAG data the same way it would use any learned expert; but that’s still readonly
Now, couldn’t you build a transformer that persisted part of its latent state in an ‘expert’ so it could be used as a non-ephemeral world model? It seems like some lab would have tried something like this? Maybe I should ask Deep Research…
people will keep advancing memory
(especially corporations who want moats)
hello
it's a sht model for the most part
One can hope they'll make some changes since the backslash
they really can't. It's like gpt4.5. That's the last model you could make any meaningful changes that wouldn't take months lol
i'm pretty sure shadebrook is gemini 2.5 pro preview. the first 3 lines are way to close
We have seen only distilled variants, right?
yeah but we also saw their metrics for behemoth
if you are not impressed by gpt4.5 there's no reason to believe you will be impressed by that at all
you cant distill ass
prob better off not distilling behemoth and training it normally
O4 mini tomorrow?
most likely an another 4o update
it doesn't make sense either way
they should have just did RL training on 70b llama or a similar arch
they should abandon behemoth lol
you are not gonna have behemoth as a reasoning model that's not realistic lmao
behemoth isn't even done training
its not gonna get significantly better
it's gonna be the same story it was for 405b llama
it's still training
there are not gonna be any real updates at all most likely
other than that initial release
which will be close to the numbers that they already have shown
was it. let me check the wayback
doesn't seem like it
do you understand that
well there's no 3.3 405b for a reason
there was 3.2 405b internally i think
???
agreed
even if larger models are bad perf/$, that isn't a reason to abandon them
other frontier models are probably a quarter of the size
surely you mean 3.3
(3.2 was the weird release where it was just vision and tiny models)
no it was 3.2 weirdly enough
optimus prime if its actually the mini variant seems to be a new pretrained from scratch version, its quite interesting. i assume this was done fairly recently
its performing quite well on mc bench
m got 3.1 and 3.3 mixed up
ya likely this week. see verge report
4.1 is quasar/updated 4o (verge directly mentioned it as a revamp of 4o), 4.1 mini/seemingly optimus prime is interesting though
the verge wrt to this stuff has been reliable i think
well if there was it perform so sht they didn't even bother releasing that thing lol
though I'm kind of doubting it even existed...
they did release it to the llama chatbot website 🤣
a meta engineer posted a screenshot of it
this is how i know lol
i swear something was distilled in the llama 3 series but i can't find any references to it
where?? are you sure that was not fake? 🧐
yea i dont remember where it exactly is rn tho
3.2 was supposed to be just multimodal addition though
so maybe it performed worse on text than the original...
oh yeah there was an unreleased version of a multimodal version of 405b
that explains it i guess
wait no your right
ah there it is
it is distillation? they are generating data on a larger model and training it on a smaller one
"However, our initial
experiments revealed that training Llama 3 405B on its own generated data is not helpful" you mean your taking about it found training 405b on its own data wasn't useful
yeah it's not standard distillation, definitely not logit distillation, but in the broadest sense of the word it is distillation
yeah I actually read it wrong myself at first mb 
i also do that when i am wrong
I suppose. But with 3.3 llama they improved beyond 405b in some areas. That wouldn't be possible if it was just that. They likely selectively distilled capabilities that were worth doing and also trained it on unique data further
o4 mini and 4.1 mini is what im looking forward to tbh
yeah they said something about online preference optimization and extended pretraining iirc
in some instances it might be better. but i think o3 will be better overall, but im actually not that sure anymore lol
but this is on a new mini base model
a much much better one
o4 mini is on 4.1 mini's base model it seems
never
at least for reasoning
its untenable to work with
we do not know what the base was for full o3 either. It was probably better than whatever they used for o1.
maybe. but this one was just done (4.1 mini i think)
I’ll be honest, I’ve found several cases where o1 still beats o3-mini. It’s not entirely obvious when you’re better off using one and when the other.
and also I do not think mini can do much with RL training tbh, even the improved base @keen beacon
small models are not very good for it
so they distilled it I think
its true rl works much less effectively naively on a smaller model. but i think its not carved in stone
they are gonna release o3 to make o4 mini look better
i anticipate this o4 mini release is gonna be huge
mini will still suck in spatial awareness in comparison lol
optimus prime can sometimes beat quasar and its beating quasar in mcbench rn. presuming optimus is mini
i think its a newly trained from scratch model and was done fairly recently too
As is the case for non-reasoning models. 2.0 Flash was basically matching 2.0 Pro
mcbench isnt a typical benchmark tho
My guess: o4-mini will be cheap enough to use for anything you’re using good paid models for today — in the same range as Sonnet, Gemini Pro. And o3 will be stupid expensive so nobody will use it unless they have a very specific need.
o4 mini will definitely beat sonnet thinking
I think it's too simplistic though and a lot gets lost. Like in this example the model on the left I'm sure could have done much more with different fine-tuning, it's not really showing what it can do
that's why Google is releasing Gemini coder
its not a task u usually tune for so i think its a good indicator of base model performance/generalization for now
What’s this ‘endgane’ talk? You know full well something even better is going to come along in another three months or less. These are great models compared to what we have now, but they’ll be trash compared to the SoTA in June 2028.
no what I mean is that verbose models will have an advantage etc. It's just mechanical work at this point and not a question of can it do it or not. Both outputs are doing the same things and one just happens to output more
not in all cases
but in enough of them
btw left deepseek v3 right mistral large
LOL
whatever they put in optimus prime im super impressed if its the mini model
like i think the optimus prime base model is better than 4o despite being smaller (if its actually mini)
and scoring less in benchmarks rn
lmao there is zero chance
openai hasnt figured out how to cram in facts like google though
but their factual reasoning in reasoning models makes up for it
I think google has shown that flash size models very much can have performance comparable to bigger counterparts on most metrics. As long as you don't do RL training on them
nah openai just diffs google in the reasoning front right now
u can do it to small models just as well
its just not as trivial
that's a bit irrelevant if you ask me. 2.0flash vs 2.0 pro - that's what I'm focusing on
unless companies stop caring 😛
both were trained at the same time essentially
not really, google just sucks at doing reasoning training
for now
forget the reasoning part. You can't deny that 2.0 flash was almost as good as 2.0 pro lol
yea
2.0 flash being released and this model (seemingly mini) not being released makes me think this was a new pretrained model trained fairly recently (optimus prime)
they would've wanted to compete for that segment i think if it was ready
we have zero checkpoints of the new cut off version of it until now, despite several chatgpt 4o releases with the new cpt'd model
that david guy made me think of it (how it could be pretrained from scratch) and it makes sense somewhat, optimus prime is pretty good
"Join an insightful fireside chat with Jeff Dean, a pioneering force behind Google’s AI leadership. As Google's Chief Scientist at DeepMind & Research, Jeff will share his vision on AI and specialized AI hardware like Google Cloud TPUs. What exciting things might we expect to see next? What drives Google’s innovation in specialized AI hardwa...
I don't think it was in their interest to update mini at all, that price was just too low. My suspicions will be confirmed if the new mini gonna have higher price lol
its really really good i think
so yeah maybe
i wouldnt be surprised if this base model could surpass 4o in all metrics given more work. modern pretraining hits different maybe lol
they wouldve wanted to update 4o mini to make their o series mini models better
Which one do you prefer?
16
20
2
Nightwhisper
we do not really know for sure if they haven't internally in some form. But yeah fair point
2.5 pro is a reasoning model, and current best on the market
the thing is gpt4o used to suck. It doesn't anymore but the model o1 was based on sucked for sure
despite it being a reasoning model, it cant do extremely rote reasoning tasks as well as others can. its the best model for sure overall rn, but in my experience the model is just better because of a better base model than others
i gave qwq a purely rote logical puzzle it solved in 13k tokens, gem 2.5 pro took 10k more tokens (23k tokens)
o3 mini absolutely dominates this area
number of tokens I'm not sure that's a good indicator. o3-mini-high would probably generate more than both 
but it keeps it coherent and solves the problem. it can do way harder puzzles without getting stuck
2.5 pro completely falls apart and spams 44k tokens, inn another instance
but you can't know. You don't see the raw reasoning
with o3
i mean i wait 10 minutes and it returns the solution
2.5 pro gets stuck in reasoning
oh. Yeah if that's the case I suppose. Gemini is a very different model though, gonna excel in different ways even if we just take their base model against other lab's base model of comparable size
so maybe it gets stuck because it lacks some fundamental base model understanding of this specific problem - that could be the case as well
ya its not the case for the problems i mention above
just because it's a good base model does not mean it's better than everything else in every single thing 👀
its just pure reasoning with no world knowledge required
dunno maybe. Hard to say without knowing the task you are talking about tbh
its just logic grid puzzles
huge ones
how does 3.7 sonnet-thinking do in comparison?
it does
every model does except openai lol
the reasoning isnt that good imho
the base model is different, if u dont have as much knowledge as 2.5 pro u wont be able to produce as good of a result
iirc it does show it in full on their website
grok 3 reasoning, iirc, used qwq 32b preview traces during training 🤣
this was a lie lol
ahahahahaha
did they really say that lmao
Elon's fail. He's a full Republican now
lmao
he probably just didn't know or misunderstood
it is lmao if he intentionally lied about it
as he's not ML engineer lol
he probably has no idea whats going on though
LMAO
prob heard a few buzz words from guys at xai trying to placate him
@keen beacon grok 3 really uses QwQ reasoning?
take my stuff with a grain of salt but yes they trained on it lol
not even the final qwq 32b, qwq 32b preview
for the thought process, but for the response they probably did another phase asking another model to generate it
yeah grok 3 was good
uhhh
i havent tried grok 3 mini though, but if its the same as grok 3 reasoning they used qwq 32b preview for cold start at least
oh is it?
yes its qwq 🤣
qwq 32b preview
wdym?
the thinking process of grok 3 & o-series & deepseek are all the same
i mean not the same
but similar
?
they generated the trace from qwq 32b preview, then asked another model to generate a response based on the thought process. that's a pair in their training data (question + response (qwq 32b preview thoughts and response))
@deep adder enlighten me
so?
you can see that the thinking process used by gemini is totally different
it depends on how much patterns it picked up during training and what type of RL training data they fed it
but deepseek & grok 3 they are using the same keywords
mostly the style of the reasoning is highly dependent on cold start
First,
Wait,
Alternatively,
similar, yes. but its qwq 32b preview not r1
it wasnt out when they trained the model
but r1 was released way before
aah
yea i remember
it may be true
qwq 32b was so dumb and went into many unnecessary paths
same thing with grok 3
qwq 32b preview was better than r1 preview though
consensus back then i think
this is why they trained their model on it 🤣
they added rl training on top + cold start used qwq 32b preview thoughts/another model generated response
and their stronger base model
still they used qwq 32b preview anyway
yea
probably
they trained a lot more than their competitors i think
meta shouldve done what they did probably
xai's prime advantage is just compute i think
why'd u say that
like what makes u think they did that
beyond other things, you can tell from the reasoning style/output and the cold start they use
what cold start are you talking about
they dont apply rl immediately to the base model like r1 zero
how do you know
but i mean they'd prolly apply rl to an instruct model
i don't see why this matters to the question of whether they trained on qwq traces
you asked about col dstart
"you can tell from the reasoning style and from the cold start they use"
how can you tell from the cold start they use, if we don't know what cold start they used because we weren't told the training details?
because its in a distinct style exactly like qwq 32b. im not gonna do similarity/etc to it which could prove it, i really dont care much about grok lol. its obvious when you work with qwq 32b preview traces a lot. they left the exact Final Answer thing in their traces too. cold start primarily determines the style of reasoning, you are not going to get qwq-isms/qwq format from pure rl randomly
no lol
yuh
ill probably do a comparison here with qwq 32b preview and grok and i bet people here will get confused which is which/itll be undeniable 🤣
ty man i try really hard to pay attention 😄
dragontail
pretty similar
dragontail
dragontail
DT
DT
these are just simple prompts
but you guys can compare the results with NW
haha i love that people still use my prompt for awhile ago
yea its good
who do you think won that
i think DT attempt is more modern style UI
i liked how NW used like an old font + icons
- it used also msn blue color
really a lot of details to unpack just from that alone
hmm which one is qwq 32b preview? can yall tell?
ur right
ya its even more hesitant than qwq
theyre super similar lol youre not getting qwqisms from rl
i copied the thought trace of grok 3 excluding the response and look at similar they are
dont u think they look extremely similar though?
the hesitantness is from rl, qwq was only used for cold start
they do
its impossible to read grok 3 cot
it goes into so many unnecessary steps
whereas deepseek you actually have fun reading it
you learn a thing or two
ya agree w me that qwq was used as cold start?
they start out the same they even nend with the same final answer lol. they use the same language lol
reading 1st CoT
is just making me mad tbh
so inefficient
its from the rl they apply
too many parallel reasoning that shouldnt be there
they are trying to apply parallel reasoning
it was probably done in a scale much more than qwq non preview lol
not just one branch of reasoning
but its not working so far
it may work but its not efficient
i dont think theyre trying to apply anything tbh. just add rl on qwq preview traces
and a symptom of their training causes that
im not fan of what they are doing tbh
the model is unusable to me
doesnt follow prompts well
loses context quite often
their deep research is probably one of the worst implementations
given how they were too lazy or incompetent enough to make their own cold start, its another bad sign for xai
hallucinates a lot
its not a fun model to talk to
thats the main benchmark for me
deepseek & sonnet is so fun to interact with
gemini is also climbing that spot
i find grok unusable when it starts peddling x into random stuff amongst other things
i actually spend more time reading deepseek cot
i learn a lot of new things from that
instead of just reading the output
yea
they should make that optional
but that thing wasnt bad tbh
they improved quality x sources
it was so bad on grok 2
when they were still using grok 2 they would just reference bots
xdddddddddd
because they are braindead
and they already had like x premium
so they dont pay for chatgpt
and probably elon hardcore fans xd
Grok3 non-reasoning model is their best contribution to AI as far as I see it. Never really cared much for the reasoning one as that one is way less impressive for what it is
they used their massive amounts of compute to apply a sh1tload of rl bruteforce into grok 3 mini lol
What I mean is if you compare all the non-reasoning models… grok3 may just be the best of them all
dunning kruger effect / wild hubris / zero self-awareness / surrounded by sycophants .. some combination at least (perhaps with some K thrown in for good measure) helps explain it imo
like with him lying about being a top-ten ranked gamer, and then going on a livestream somehow thinking no one would notice lol https://www.reddit.com/r/videos/comments/1j75rh9/elon_musk_got_exposed_as_a_fraud_gamer_all_updates/
I think Grok 4 should be GPT 4.5 like, increasing its parameters again
hit a wall with scaling for "legacy" LLMs
lol literally what i was about to say
though that was far more effeciently put ha
llama 3.370b vs 3.1-405 kinda revealed that wall to my mind
like nearly 6x as many parameters and they eeked out some marginal performance gains
if it was 405b moe prob make more sense, but seeing how maverick turned out lol
slightly off topic but there's a relatively high chance we get R2 w/c 14.04
other frontier models are around that size, i think
Nah, we should see how the Behemoth turned out
it is due by end of april and it makes the most sense for them to release it to react to o3
And is gpt 4.5 underfitted?
yeah but the performance gap b/w 70b and 405b - there was no scaling there.. just a huge amount of cash lol
whereas haiku vs sonnet vs opus - there prob was scaling there
probably but not by huge amounts i don't think
gpt-4.5 kinda seems like a project they poured silly amounts of money into, realised was a waste of time, effort and cash after seeing the relatively limited performance gains, and put on hold for months and then they remembered they were sitting on it, sloppily finished it off and put it out because they were somewhat obligated
they were gonna call that gpt 5 i think, but unpopular opinion i think they shouldve called o1 preview gpt 5. imho it was one of the most significant releases
i wonder if there are parallels to be found at anthropic with respect to Opus 4 Opus 3.5 (or even the whole Claude 4 generation, which 3.5 suggested was on the horizon)...
3.5*
they said opus 3.5 by end of 2024 then scrubbed any mention of it from their site in november and we haven't heard anything since
except dario saying "we still plan for there to be a 3.5 opus" on a podcast months ago
large models are dead doubt we see opus 4 tbhh
yeah i don't think opus 4 will happen
opus 3.5 will probably be their last big boy
Say something bout llama 4 behemoth
agreed
🤣 🤣
ETA is apparently Q4 '25 which is diabolical tbh
anthropic are gonna get left in the dust
they already are tbh
i would consider them still doing okay up until 2.5 pro
i think 2.5 pro put every other lab on red alert
them not doing any native image generation work/other multimodal work is going bite them in the ass later
at least publicly it seems that way to me
unfortunately anthropic aren't willing to take enough risks to maintain their frontier position
i think the emergence of test time compute is what threw them
with deepmind accelerating and openai downsizing safety teams they're stuck at the same pace
sonnet-3.7-thinking performs so poorly considering how strong the vanilla version is
tbh sonnet 3.5 and cpt'd sonnet 3.5 seemingly (sonnet 3.7) is anamolous
i think deepmind have done the best job at squeezing performance out of the base model with their reasoning model
they could not replicate the magic with haiku
especially if 2.5 pro base is very similar to 2.0 pro
whilst other companies can shrink their models well
yeah i don't think they were on the roadmap
2.0 pro as a base was actually pretty mediocre
more like emergency releases ha
yeah but sonnet 3.5 is insanely good and anomalous from them tbh. personally i mark it as a start of a class
i think sonnet 3.7 was supposed to be opus 3.5 but the gains were pretty poor and they wanted to keep their small edge
nah its the same size as sonnet 3.5 iirc i think its just a cpt
yeah but aside from cpt, it doesn't seem like they achieved the step-up with any kinda special sauce.. perhaps a new class in terms of performance, but not like model architecture or whatever (not that i know ha)
they made what was working better
yeah well put
anthropic probably have the best quality data (although not the most raw data)
i meant for sonnet 3.5, sonnet 3.7 is not a major step as compared to them pretraining from scratch sonnet 3.5
ahh right yup gotcha
sonnet 3.5 was significant in my experience, the level of "base model performance" marked a start of a class for me even if it didnt display it in the benchmarks. only recent 4o that was cpt'd/gem 2 pro/1206 reached it for me. considering it was trained much earlier than the others
And I’d say R1 has the worst quality data
Is R1 1776 a better model than initial R1?
the only difference is censorship
model performance differences are negligible at best
I mean factuality
overall if anything there was prob a performance degradation of some kind (surely it's benchmarked / compared). only more performant/factual on a very specific subset of questions (those subject to censorship in China)
We’re in a bit of a slow point for LLM updates outside of more robotic stiff open AI models
this week will be good 😉
or so some birdies tell me..
theyre not robotic anymore. theyre trying extremely hard lol. too hard
expect more than just oai to drop things
how is fake news destroying llms 🤣
depends on the model
the newest chatgpt 4o version slightly tones down the cringe factor that came with trying too hard not to be robotic
as for the o-series models, yeah they're still stuck with that problem mostly
the only reasoning model i've seen not be very robotic is R1
and to a certain extent o3, but you can see that for yourself soon
either way i am very intrigued by r2 and how big of a jump it will be
it seems ill be subbing to chatgpt plus soon lol (for the new releases xd)
Gemini’s thinking models are not robotic either
they are better than the o-series models but
they're still not quite on r1 level
deepseek's models are generally just good at that
In the Chinese internet, Deepseek’s style was joked as “whenever the writing task is, it always shows entropy/quantum computing/maths theories”
There’s no 2.5 pro in the chart!!!!
like.. this is R1. it is both human-like and enjoyable
my main problem with R1 for creative writing is
it loses track of a plot quite rapidly
From demonstrations, 2.5 Pro definitely has higher EQ than these other models
Okay, this is in English
And I used it in Chinese
can't speak for chinese performance
it's what they did with xmas yeah
maybe 4.1 first? then the reasoning models to keep hype going?
Its style is really weird and recognisable as Deepseek’s writing
anthropic just nailed the vibes for emotional intelligence
It is stubborn
4o = too agreeable, gemini = too yappy
2.5 pro is stubborn and sometimes quite harsh to users
iirc 3.7 was a regression for most creative and emotional tasks
I once asked him what was my singing, and it just said I’m extremely off-key, tone is bad blah blah blah
i fw this
workhorse models most people use the most i think
At least you know it's being honest
itll be the same price probably
hate it when models dont stand up against me saying dumb stuff
i mean why would u use 4o over 4.1 if its the same price
a little off-topic, but i believe there was a study that showed that people who frequently used generative AI to solve mental health issues turned out to be worse off than people who didn’t (more isolated, etc.) i can’t rememeber where it’s from
its a stronger model with an updated cut off compared to api dated versions
yup
Since then I’ve picked up my tuner app and trained for a few days
mostly because 90+% of people who use AI for mental health related stuff use chatgpt, and 4o is way too complacent/unwilling to question and confront
so they mark up 4.1 even though its still 4o but updated? maybe
i think others realize this openAI just has such a massive headstart
google said there was no moat 2 yrs ago
the problem is beginning to shift away from "we can't build a better model than openai" and towards "we can't build a better product and market it better than openai"
they truly got a huge headstart with chatgpt and how viral it got/is
But Gemini is grabbing the market share
I think all the Google models that are on LMarena are just the same 2.5 flash, but with different levels of thinking.
ngl 4o native image gen is really good
This thought keeps me awake.
memory is gonna be a big thing in the future
I just don't see the point in releasing so many models if Google is only planning to release 2.5 Flash.
maybe 2.5 flash has thinking budget
Ofc
But I dunno if ChatGPT’s function is just a RAG
if 2.5 flash is beating 2.5 pro in a significant number of cases they're cooking incredibly hard
its more likely to be updated 2.5 pro, 2.5 flash, 2.5 flash lite i guess ( i havent actually used the recent google anon models, so no idea about capability)
Means they've found a very good upward cycle, if it's flash
How do you tell them to write a story? Or what story do you want?
I have a feeling that we will get another Google model in LMarena this week, but it will be even better than nightwhisper.
Cuz I’m currently testing AIs in Chinese writing using a standardised question
Okay I have an idea
if they want to one-up the openai release spree then i'd expect them to start testing better models very soon yeah
How far will Google go with only two model classes?: Pro and Flash
i think ultra may eventually return but it won't be as their 1T+ param model variant
it'll just be as a better reasoning model
And I don't think we'll see Ultra anymore.
Because Google had a bad past with this model.
didnt logan say theyre doing the same thing oai is doing for gpt5?
according to someone who claimed to be a google employee iirc, ultra was not 1t at all
No, they do completely different things. They will connect all Gemini with Veo
but this is not the same as GPT-5.
If gpt-5 is just able to reason on a human level, then Gemini will acquire imagination through merging with Veo.
He will be able to reason better and also design something in his "head".
can i get the source pls
And then spatial thinking and multimodal, omnimodal capabilities will improve.
see the old lmsys discord server, dom was arguing about gemini experimental
it was a long long time ago
They shouldn’t connect to Veo, they should connect to Blender/other 3D animation app
every single experience i ever had with 1.0 ultra screamed near or above 1T params
I surely hope so, as I have a hard question that most models can’t answer
You're clearly underestimating Veo.
and what will happen when Veo 3 is released?
it was somewhat close to 1t but not that close tbh from what i remember
the model sucked in my experience :\ maybe for creative writing i guess
i briefly had api access
yes it sucked in everything but creative writing
it sounds funny but
genuinely it was incredible at creative writing and disappointing at everything else