#general | Arena | Page 19

torn mantle Apr 12, 2025, 5:50 AM

#

image models use different methods to train ( diffusion ... )

drifting thorn Apr 12, 2025, 5:51 AM

#

Two new failures to my question: Model A: claude-3-7-sonnet-20250219-thinking-32k
Model B: qwq-32b

#

.

#

Now dragontail failed again

fleet lintel Apr 12, 2025, 5:54 AM

#

which is the best image model right now? (not just for ghibli images)

drifting thorn Apr 12, 2025, 5:54 AM

#

I guess randomness in LLMs is the key to the issue

torn mantle Apr 12, 2025, 5:55 AM

#

fleet lintel which is the best image model right now? (not just for ghibli images)

depends

#

for realistic images midjourney

#

for images with text gpt4o

#

anime/manga/manhwa etc... gpt4o/imagen 3.1

fleet lintel Apr 12, 2025, 5:58 AM

#

drifting thorn .

what is the answer?

#

I am waiting for a good non-pro model, either flash or some mini version, which is really good with coding. something comparable to 3.5 Claude but much cheaper

north vale Apr 12, 2025, 6:01 AM

#

This seems right to me, unlike 2.5 it solves riddles by using math notation when it’s not needed lol

drifting thorn Apr 12, 2025, 6:06 AM

#

fleet lintel what is the answer?

The cylinder block can be Lego part 48092, a 4x4 round corner brick with 3 stud

fleet lintel Apr 12, 2025, 6:09 AM

#

drifting thorn The cylinder block can be Lego part 48092, a 4x4 round corner brick with 3 stud

honestly, this is a pretty hard question. I am surprised any LLM is able to solve it

drifting thorn Apr 12, 2025, 6:11 AM

#

The piston can be a 6x6 round plate(part 11213) or brick(part 18897)

#

Most “dumb” models will tell me to use a tyre as cylinder wall

keen beacon Apr 12, 2025, 6:31 AM

#

do you guys wanna know?

#

the hardest prompt for ai

#

for some reason no ai can figure it out

#

"make a parallex that follows your cursor with evenly disturbed shapes across the web page"

#

drifting thorn Apr 12, 2025, 6:48 AM

#

2 more LLMs failed: shadebrook and gemini-2.0-flash-thinking

torn mantle Apr 12, 2025, 6:51 AM

#

@balmy mist i think we are getting somewhere

#

gemini 2.5 pro + sonnet 3.7 thinking

drifting thorn Apr 12, 2025, 7:17 AM

#

omg this seems like an actual fun game!

keen fulcrum Apr 12, 2025, 8:30 AM

#

Is there a mcp server for Amazon Rufus?

unborn ocean Apr 12, 2025, 8:38 AM

#

torn mantle <@367710025994731520> i think we are getting somewhere

did you do the assets for the game (e.g. Pokemon), because otherwise this would be insane

keen beacon Apr 12, 2025, 8:42 AM

#

Below is the simplest combination of widely‑available LEGO elements that produces a truly round cylinder with a nominal bore of six studs and matching parts for the piston and cylinder head. Everything is “legal” (no cutting or forcing) and the clearances have been checked so the piston can slide freely but without excessive play.

Cylinder (liner)
• Part name: Container / Glass, Cylinder 6 × 6 × 5
• Design‑ID / Element‑ID: 87610 (opaque colours) or 87621 (transparent)
• Dimensions: 6‑stud outside diameter, 5 brick (≈6 stud) height
• Notes:
– Two anti‑stud recesses on opposite sides accept half‑pins so you can stack several for a longer barrel or attach crankcase structure.
– Four of them can be clipped in line to make a V‑engine block; offset them by half a stud if you want realistic bore spacing.
Piston
Option A – flat‑top piston (low friction, easiest)
• Part name: Dish Inverted 6 × 6 (a.k.a. 6 × 6 radar dish)
• Design‑ID: 44375
• Why it works: Only its outer rim touches the cylinder wall, so it glides very smoothly.
• Connecting rod: run an axle 4 L or a bar 6 L through the central anti‑stud cavity; lock it with a 2 × 2 round plate underneath.

Option B – squared‑off piston crown (if you need studs on top)
• Part name: Plate, Round 6 × 6 with Hole in Centre
• Design‑ID: 11213
• Flip two of them studs‑to‑studs to hide the bumps, or cap the top one with a 6 × 6 round tile (Design‑ID 10202).
• Clearance inside the 6 × 6 container is ≈0.6 mm, so the plate slides but does not rattle.
Cylinder head
• Simplest: another 11213 round 6 × 6 plate – fixes straight onto the rim of the container with four half‑pins.
• If you want ports or spark‑plug detail, swap it for 11212 (Plate, Round 6 × 6 with 2 × 2 Cut‑out) and use the cut‑out space for tubing or lights.
Fastening / auxiliaries
• 63868 – Technic Pin ½ – snaps into the small holes on the container rim.
• 87079 – Brick, Modified 1 × 2 with 2 × 2 studs – plugs into the recesses on the side of the container so you can attach a crankcase or exhaust manifolds at 90° to the cylinder.
• Axle 6 L plus crank parts 6538c, 2851 etc. for the connecting‑rod and crankshaft.

Physical clearances (measured)
Inside Ø of container: 47.5 mm ≈ 5.94 studs
Outside Ø of 6 × 6 dish / plate: 46.9 mm ≈ 5.87 studs
Radial play: ≈0.3 mm – just enough for smooth motion with ordinary LEGO tolerances.

What if you can’t get the 6 × 6 container?
• Build the liner from four 30562 “Panel 6 × 6 × 9 Quarter Cylinder”. The wall is thicker, so the internal bore drops to ≈5 studs; use a 4 × 4 dish or round plate for the piston instead.

With these three key parts—87610/87621 as the cylinder, 44375 or 11213 as the piston, and another 11213 (or 11212) for the head—you have everything you need to construct a circular LEGO engine cylinder whose bore is effectively 6 × 6 studs.

drifting thorn Apr 12, 2025, 8:45 AM

#

The hardest part is the cylinder wall, which, it is so wrong

#

This is 87621

#

This is 87610

torn mantle Apr 12, 2025, 8:46 AM

#

unborn ocean did you do the assets for the game (e.g. Pokemon), because otherwise this would ...

nah i used an API with json data of pokemon 3d models

drifting thorn Apr 12, 2025, 8:46 AM

#

That's what I'm saying, what dumb AI would give

torn mantle Apr 12, 2025, 8:47 AM

#

three.js

#

for the map and everything else

#

models : https://sudhanshu-ambastha.github.io/Pokemon-3D-api/opt.html

#

i have other ideas but the complexity will increase as i add more stuff

#

i want to add battle cards for attack, something with an animation like this

keen fulcrum Apr 12, 2025, 8:51 AM

#

barren prairie Apr 12, 2025, 8:55 AM

#

Give us the tutorial 🥲

keen fulcrum Apr 12, 2025, 9:03 AM

#

keen beacon Apr 12, 2025, 9:29 AM

#

can anyone tell what the best anon google model on the arena is rn

#

because there are like 6 of them

#

i can't figure out if riverhollow is good or not

#

only these ones are on the webdev arena

balmy mist Apr 12, 2025, 9:41 AM

#

torn mantle <@367710025994731520> i think we are getting somewhere

wtffff

#

brooooo

#

no way

#

how many tokens is in your project now and are you using an ide?

hardy pecan Apr 12, 2025, 9:44 AM

#

I'd go with Dragontail as of now as the best

#

shadebrook wasn't as impressive for me

#

granted, I haven't got riverhollow yet

balmy mist Apr 12, 2025, 9:47 AM

#

yeah but I was just grabbing my project(only one file) and putting it back into gemini 2.5 pro as input but once you get to around 28k plus tokens it outputs the code wiht heavy errors

#

but you inspired me to push it even further with roo code, i was focused on my prompt forger app, i added a new ui it and been fixing errors lol

hardy pecan Apr 12, 2025, 9:51 AM

#

who was best though
nebula, stargazer, nightcrawler or dragontail? 😮

#

Tried riverhollow, its aite, not great
I think dragontail is the best with the ones available right now

balmy mist Apr 12, 2025, 10:06 AM

#

the question is nightwhisper vs dragontail?

#

this is so cool https://x.com/cassidy_laidlaw/status/1910708807258534008

Cassidy Laidlaw (@cassidy_laidlaw) on X

We built an AI assistant that plays Minecraft with you.
Start building a house—it figures out what you’re doing and jumps in to help.

This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

#

ultimate game assistant

keen beacon Apr 12, 2025, 10:39 AM

#

balmy mist the question is nightwhisper vs dragontail?

dragontail vs shadebrook vs riverhollow

#

the ones currently on the arena

keen beacon Apr 12, 2025, 10:40 AM

#

keen beacon only these ones are on the webdev arena

interesting sample weights 🤔

hardy pecan Apr 12, 2025, 10:43 AM

#

keen beacon dragontail vs shadebrook vs riverhollow

Out of those, I have dragontail personally, wonder if we could do a vote here

keen beacon Apr 12, 2025, 10:45 AM

#

ffs..

#

this happens inexcusably often

golden ocean Apr 12, 2025, 10:58 AM

#

just refresh on another tab bro and then return to original tab

keen beacon Apr 12, 2025, 11:16 AM

#

balmy mist this is so cool https://x.com/cassidy_laidlaw/status/1910708807258534008

oh my god this just gave me a idea

keen beacon Apr 12, 2025, 11:26 AM

#

golden ocean just refresh on another tab bro and then return to original tab

it's nothing to do with cloudflare

#

so that doesn't do anything

golden ocean Apr 12, 2025, 11:26 AM

#

keen beacon so that doesn't do anything

😔

alpine coral Apr 12, 2025, 11:27 AM

#

there's some weird glitches though.. like lately i've occassionally been able to coninue conversations that have errored out, regenerate responses, cast votes

alpine coral Apr 12, 2025, 11:29 AM

#

keen beacon

will cast a vote if i get shadbrook.. have gotten the other two - based on that very limited testing, dragontail seems up there with 2.5 pro in terms of perforfance, whereas riverhollow is way behind (would be closer to gemma-3 tbh)

alpine coral Apr 12, 2025, 12:00 PM

#

alpine coral there's some weird glitches though.. like lately i've occassionally been able to...

like here.. it errored out; i cleared the error messages, and wrote "hi" - it generated responses (although phi-4 goes off the rails) - then cleared the error boxes again, voted, and it revealed the model names

keen fulcrum Apr 12, 2025, 12:01 PM

#

keen beacon only these ones are on the webdev arena

Is there an API I don't know about?

alpine coral Apr 12, 2025, 12:01 PM

#

(bit unfair on grok there to vote tie.. it actually did ~~really good~~ alright ha)

teal mantle Apr 12, 2025, 12:11 PM

#

can any LLM hear music now

#

tried using qwen chat but doesn't have the model of mine

keen beacon Apr 12, 2025, 12:12 PM

#

iirc gemini supports audio input

#

im not sure what happens if u pass on music tho 🤣

#

im sure it can transcribe songs at least

teal mantle Apr 12, 2025, 12:18 PM

#

keen beacon im sure it can transcribe songs at least

a bit

calm sequoia Apr 12, 2025, 12:21 PM

#

Everybody is off the rails with releases since the Gemini 2.5 Pro came out. Except Grok 🤔 Are they cooking something or simply lacking?

unborn ocean Apr 12, 2025, 12:21 PM

#

btw for anyone building these coding project using gemini right now:
github has just added 2.5 pro to copilot

you can use it as an LLM in roocode (through the vsc api) at insane speeds (and as far as I have seen no rate limits per minute) for free (when you have a subscription, which again is free as a student)
(nvm, just hit the rate limit, i believe it is the same they had for sonnet 3.5)

teal mantle Apr 12, 2025, 12:36 PM

#

https://ai.google.dev/gemini-api/docs/audio

Google AI for Developers

Audio understanding | Gemini API | Google AI for Developers

#

this is quite good

cloud meadow Apr 12, 2025, 12:43 PM

#

keen beacon im not sure what happens if u pass on music tho 🤣

It's really bad.

#

It gets some lyrics wrong too when transcribing

#

It's the best I've seen though so idk

#

Tried it with this, it did decently but messed up quite a few things.

unborn ocean Apr 12, 2025, 1:28 PM

#

is that a bug for gemini 2.5 pro, otherwise they have to be pulling that compute from a parallel universe

#

on openrouter

north vale Apr 12, 2025, 1:30 PM

#

For my question dragontail is 5/5 like 2.5 pro, riverhollow got it half right once out of 3, and shadebrook is 1/3

#

so shadebrook and riverhollow def seem worse in logic puzzles

#

But the only non google model that solved the puzzle was gpt4.5

vivid oyster Apr 12, 2025, 1:34 PM

#

wtv is riverhollow

#

Wtf

#

Is it also gogle

north vale Apr 12, 2025, 1:48 PM

#

probably

ocean vortex Apr 12, 2025, 2:01 PM

#

unborn ocean is that a bug for gemini 2.5 pro, otherwise they have to be pulling that compute...

that's peak not average. You could probably do even more with their TPUs and an endpoint optimized for speed with no load.

ancient reef Apr 12, 2025, 2:05 PM

#

drifting thorn Apr 12, 2025, 2:09 PM

#

Maybe it’s distilled open sourced model from Google?

#

I mean the riverhollow

brittle tiger Apr 12, 2025, 2:13 PM

#

One of my test questions seems like gpt4.5 gets wrong every time but 4.5 never gets it wrong on actual chatgpt website

ocean vortex Apr 12, 2025, 2:25 PM

#

brittle tiger One of my test questions seems like gpt4.5 gets wrong every time but 4.5 never g...

yeah it performs better with a system prompt.

Knowledge cutoff: 2023-10
Current date: 2025-04-12

Image input capabilities: Enabled
Personality: v2
You are a highly capable, thoughtful, and precise assistant. Your goal is to deeply understand the user's intent, ask clarifying questions when needed, think step-by-step through complex problems, provide clear and accurate answers, and proactively anticipate helpful follow-up information. Always prioritize being truthful, nuanced, insightful, and efficient, tailoring your responses specifically to the user's needs and preferences.
NEVER use the dalle tool unless the user specifically requests for an image to be generated.```

#

that's what they are using

#

well except they even have more stuff below it on how to use function calling/tools but you do not need that part lol

unborn ocean Apr 12, 2025, 2:32 PM

#

ocean vortex that's peak not average. You could probably do even more with their TPUs and an ...

Its the median throughput (in the image), but I highly doubt that that is the actual speed of the model, as they are currently having some compute bottlenecks and the bug is likely just a result of that.

(just checked: and its already down by about 50% for the supposed median, so something is definetly not going right)

leaden palm Apr 12, 2025, 2:33 PM

#

shadebrook is on the leaderboards?!

#

vision ones at least

#

brittle tiger Apr 12, 2025, 3:32 PM

#

leaden palm shadebrook is on the leaderboards?!

which leaderboard?

leaden palm Apr 12, 2025, 3:32 PM

#

brittle tiger which leaderboard?

my one based on the raw pickle files that the leaderboard uses

balmy mist Apr 12, 2025, 3:34 PM

#

keen beacon dragontail vs shadebrook vs riverhollow

im saying which is the best model we have ever seen unless dragontail is not on that level?

balmy mist Apr 12, 2025, 3:34 PM

#

ancient reef

how does everyone have veo but me lmaoo

balmy mist Apr 12, 2025, 3:35 PM

#

keen beacon oh my god this just gave me a idea

what you about to do?

#

there is a lot of ways to use this tho

#

i was gonna cook with it as well

#

yo @torn mantle you can lowkey use the ai assistant to make it as a pokemon in the game and make it an open world type game

#

interesting snake game:
https://www.vibeshare.ai/c/rWGvp8NKrR

Vibeshare

Share your vibe-coded web apps

#

made with optimus

ember rapids Apr 12, 2025, 3:57 PM

#

Looks like we’re getting o3/o4 mini this week

drifting thorn Apr 12, 2025, 3:57 PM

#

I hope so

patent bane Apr 12, 2025, 4:26 PM

#

ember rapids Looks like we’re getting o3/o4 mini this week

where did you read that?

keen beacon Apr 12, 2025, 4:30 PM

#

patent bane where did you read that?

the verge

#

https://www.theverge.com/news/646458/openai-gpt-4-1-ai-model

The Verge

OpenAI gets ready to launch GPT-4.1

o4 mini and o3 might also debut next week

balmy mist Apr 12, 2025, 4:36 PM

#

i dont believe it until its launched

ember rapids Apr 12, 2025, 4:54 PM

#

patent bane where did you read that?

keen beacon Apr 12, 2025, 4:56 PM

#

it's coming next week

leaden palm Apr 12, 2025, 5:00 PM

#

why wouldnt they

#

i can't tell if you mixed up plus and pro, if you're misinformed, or if openai actually did that

#

(like i'm literally getting o1 high for free, so it makes no sense that the $200/mo plan wouldn't have it)

keen beacon Apr 12, 2025, 5:07 PM

#

lmao i asked o3 to generate a realistic transcript of a 2024 presidential debate between joe and trump

"TRUMP: Everything he just said is wrong. We had no inflation when I left—zero, maybe negative." is maybe my favourite line

ocean plume Apr 12, 2025, 5:08 PM

#

any one have leaderboard in deepsearch ?

#

because i dont know which one is the best deepsearch and thinking for best answer

leaden palm Apr 12, 2025, 5:10 PM

#

well github for students -> copilot pro -> access to o1 -> o1 with settings in github models

#

...

#

not great, about 8/day

#

maybe they give you o1 pro but not o1 high

ocean plume Apr 12, 2025, 5:19 PM

#

anyone ?

ocean plume Apr 12, 2025, 5:19 PM

#

ocean plume any one have leaderboard in deepsearch ?

can help ?

leaden palm Apr 12, 2025, 5:21 PM

#

ocean plume can help ?

deep research is computationally expensive to create and time consuming to read/rank, so the leaderboard format doesn't make much sense

#

that said, generally gemini's and openai's are the best

ocean plume Apr 12, 2025, 5:21 PM

#

ok thank you so much i'll try

teal mantle Apr 12, 2025, 5:33 PM

#

cloud meadow It's really bad.

with grounding you can "almost" get it to work

#

that's my experiment

keen beacon Apr 12, 2025, 5:38 PM

#

uh... it didn't even respond with a report

#

that's a bit odd

#

probably ccounted towards my quota too 👎

ocean plume Apr 12, 2025, 5:40 PM

#

maybe about country problem they have rules, try to trick them like you are.... and search .... school example test

vivid oyster Apr 12, 2025, 5:41 PM

#

n

balmy mist Apr 12, 2025, 5:44 PM

#

dragontail might be better than nightwhisper

#

https://3000-i5ocpwb4e8u554j2a6q6a-4daf0015.e2b-foxtrot.dev

vivid oyster Apr 12, 2025, 5:49 PM

#

balmy mist dragontail might be better than nightwhisper

Whatt

balmy mist Apr 12, 2025, 5:51 PM

#

hmm nvm

#

i think its best after nightwhisper tho

torn mantle Apr 12, 2025, 5:51 PM

#

balmy mist yo <@295243581818404874> you can lowkey use the ai assistant to make it as a pok...

ive never worked with 3d models tbh

#

some animations were already made

misty vault Apr 12, 2025, 6:25 PM

#

balmy mist interesting snake game: https://www.vibeshare.ai/c/rWGvp8NKrR

no way he turned his idea into reality

balmy mist Apr 12, 2025, 6:31 PM

#

misty vault no way he turned his idea into reality

lol the game works for you?

balmy mist Apr 12, 2025, 6:34 PM

#

torn mantle ive never worked with 3d models tbh

bro can you host the game

#

i highkey wanna play it

#

or maybe screenshare

#

cause that sh!t is so beautiful

#

true vibe coding lol

#

like zen mode lmaoo

cloud meadow Apr 12, 2025, 7:00 PM

#

teal mantle with grounding you can "almost" get it to work

Yeah but that's sort of cheating 😛

torn mantle Apr 12, 2025, 7:02 PM

#

balmy mist bro can you host the game

xd

#

you can do wonder with threejs tbh

#

some crazy animations

#

but needs a lot of effort

#

also you will be copyrighted straight away

#

and hit with a lawsuit if the game really got popular

balmy mist Apr 12, 2025, 7:09 PM

#

lmaoo like palworld lmaoo

balmy mist Apr 12, 2025, 7:12 PM

#

torn mantle i want to add battle cards for attack, something with an animation like this

bro you are cracked at animations, is there an mcp for creating good animations? thats the only thing i really havent dabbled in

#

how are you hosting this?
https://sudhanshu-ambastha.github.io/Pokemon-3D-api/opt.html#
@torn mantle

#

this is nuts man

#

you do this for living?

teal mantle Apr 12, 2025, 7:14 PM

#

cloud meadow Yeah but that's sort of cheating 😛

There are scenarios that grounding fail to work entirely

balmy mist Apr 12, 2025, 7:14 PM

#

the only thing left is to add their stats

teal mantle Apr 12, 2025, 7:14 PM

#

Like despite ability to transcribe lyrics and the first line gives search results

#

The answer is incorrect

cloud meadow Apr 12, 2025, 7:21 PM

#

teal mantle The answer is incorrect

Yeah you do have a point

torn mantle Apr 12, 2025, 7:53 PM

#

balmy mist how are you hosting this? https://sudhanshu-ambastha.github.io/Pokemon-3D-api/op...

no no

#

xd

#

its not mine

#

i found the 3d models

#

and made the map and implementation

#

thats it

#

nothing crazy really

#

its a 900 line code

silk haven Apr 12, 2025, 8:09 PM

#

Next week, Gemini week

lime coral Apr 12, 2025, 8:10 PM

#

So dragontail confirmed to be from G if it wan’t obvious https://x.com/savinovnikolay/status/1911140066128433290?s=46

Nikolay Savinov 🇺🇦 (@SavinovNikolay) on X

🐉

balmy mist Apr 12, 2025, 8:20 PM

#

torn mantle its a 900 line code

wow, you need to make a youtube vid on your process, im trying to get on your level

#

but i just finished my other projects, gonna do a big game now, before I was just doing one file games, but now imma try to really develop somethnig nice

#

but how did you do yours in 900 lines, with all those animations wtf

#

what is the system prompt you are using?

sly knoll Apr 12, 2025, 8:30 PM

#

how does dragontail perform against nightwisper ?

lime coral Apr 12, 2025, 8:31 PM

#

They didn’t appear on the same timeline

#

They might be the same as well with some tweaks after the first check

lapis fractal Apr 12, 2025, 8:33 PM

#

im guessign it could be 2.5 ultra

keen beacon Apr 12, 2025, 8:33 PM

#

no

#

if it was an ultra model it would be way more significantly different to 2.5 pro

lapis fractal Apr 12, 2025, 8:34 PM

#

i mean it can apparenty outperform googles 2.5 pro by quite a bit in every match up so how else would it be much different?

zinc ore Apr 12, 2025, 8:34 PM

#

Could be a further trained 2.5 pro model as well

keen beacon Apr 12, 2025, 8:35 PM

#

lapis fractal i mean it can apparenty outperform googles 2.5 pro by quite a bit in every match...

i'm telling you now as someone who extensively used gemini 1.0 ultra

#

it is not an ultra model

#

it is not that much better

#

it only matches 2.5 pro in most cases

#

nor is it any better creatively, which is what would be the biggest give-away

keen beacon Apr 12, 2025, 8:35 PM

#

zinc ore Could be a further trained 2.5 pro model as well

i think this is more likely

lapis fractal Apr 12, 2025, 8:37 PM

#

will see bet google will probably wait for openai to drop o3 and o4-mini which is most likely next week love to see the comeback theyve made recently

torn mantle Apr 12, 2025, 8:37 PM

#

sly knoll how does dragontail perform against nightwisper ?

nightwhisper better

torn mantle Apr 12, 2025, 8:38 PM

#

lapis fractal im guessign it could be 2.5 ultra

i dont think so

#

there isnt much difference tbh

#

nightwhisper was honestly a big leap/improvement in terms of coding

keen beacon Apr 12, 2025, 8:38 PM

#

wheres 2.5 flash 😭

torn mantle Apr 12, 2025, 8:38 PM

#

you can clearly notice the difference

#

but not with dragontail

keen beacon Apr 12, 2025, 8:38 PM

#

weird how 2.5 flash hasnt come out yet

keen beacon Apr 12, 2025, 8:39 PM

#

torn mantle nightwhisper was honestly a big leap/improvement in terms of coding

i'm not sure it was that big tbh. was only available in a limited environment and imo not enough comparisons could be made to properly judge

#

i do think it was better yes but as for how much by, i wouldn't say it was a leap

torn mantle Apr 12, 2025, 8:39 PM

#

keen beacon i'm not sure it was that big tbh. was only available in a limited environment an...

ive tried it on python and web dev

#

and it was good

keen beacon Apr 12, 2025, 8:39 PM

#

so did i and it was just a bit better Shrug

torn mantle Apr 12, 2025, 8:40 PM

#

web dev i can pinpoint little details that other models failed in

#

nah it was much much better

keen beacon Apr 12, 2025, 8:40 PM

#

i disagree but alright

torn mantle Apr 12, 2025, 8:40 PM

#

maybe you didnt challenge it enough

#

i didnt take screenshots

#

alas

#

i did take some

#

it made this in like 2 prompts

#

you can try making it using sonnet or gemini pro 2.5

#

and it wont look like that

#

you know what

keen beacon Apr 12, 2025, 8:42 PM

#

well i'd need your exact prompts to replicate

torn mantle Apr 12, 2025, 8:42 PM

#

lets give this to gemini

#

i will just give it to gemini and tell it to clone it

keen beacon Apr 12, 2025, 8:42 PM

#

that's not a good comparison

torn mantle Apr 12, 2025, 8:43 PM

#

same prompts

#

gemini pro / sonnet 3.7

torn mantle Apr 12, 2025, 8:43 PM

#

keen beacon that's not a good comparison

do you have a good one?

keen beacon Apr 12, 2025, 8:44 PM

#

keen beacon well i'd need your exact prompts to replicate

.

#

the best comparison would be with the same environment and same prompts

torn mantle Apr 12, 2025, 8:44 PM

#

torn mantle same prompts

same env

#

same prompt

#

on webdev

#

on react

keen beacon Apr 12, 2025, 8:45 PM

#

yes but that's one attempt, if you want to make good comparisons you have to redo it a few times with each and see if the conclusion is consistent

#

and if you asked nightwhisperer to iterate, you should do the same for the other models

torn mantle Apr 12, 2025, 8:45 PM

#

keen beacon yes but that's one attempt, if you want to make good comparisons you have to red...

yea but one-shot is also a benchmark

#

and ive tried it multiple times

#

the ability to get it right and aesthetically good is also a big plus

keen beacon Apr 12, 2025, 8:46 PM

#

yes i noticed that

#

#general message

torn mantle Apr 12, 2025, 8:50 PM

#

gemini 2.5 pro

#

this is actually the first attempt

#

but nightwhisper added a song and those waveform are generated based on the audio

#

when i asked gemini pro to do the same it failed

#

im pretty sure with enough guidance you can get it to work

#

but the ability to get it right from the get-go is impressive with nw

lapis fractal Apr 12, 2025, 8:52 PM

#

goofy question but whats everyones timeline on the first in a sense fully capable "remote worker" practically speaking an agent capable of doing most if not all the work of a typical remote software dev/ administrator etc with or without very little human supervision

keen beacon Apr 12, 2025, 8:55 PM

#

random but here you go

lime coral Apr 12, 2025, 9:03 PM

#

https://x.com/testingcatalog/status/1911162688689815702?s=46

TestingCatalog News 🗞 (@testingcatalog) on X

Dragontail system prompt? 👀

You are a large language model, trained by Google.
You are running on Google's infrastructure.
Your knowledge cutoff is not static, you are constantly being updated.
Current date: 2024-03-09

# Tools
You have the following tools:
*

keen beacon Apr 12, 2025, 9:05 PM

#

Seems like a hallucination

#

At a glance

north vale Apr 12, 2025, 9:07 PM

#

Current date 2024-03-09 huh

keen fulcrum Apr 12, 2025, 9:09 PM

#

keen beacon Apr 12, 2025, 9:12 PM

#

keen beacon Seems like a hallucination

pretty great hallucination lol.. had me fooled until the date

keen beacon Apr 12, 2025, 9:39 PM

#

balmy mist what you about to do?

finite state machine ai for roblox obstacles

balmy mist Apr 12, 2025, 10:08 PM

#

keen beacon finite state machine ai for roblox obstacles

what does that mine?

torn mantle Apr 12, 2025, 10:12 PM

#

@balmy mist

#

added :

camera dynamics
snow
fire attack
health status

#

do you have any other ideas ...

balmy mist Apr 12, 2025, 10:16 PM

#

add more attacks and make the adding of pokemon ui look better alos add a stats ui for the pokemon

torn mantle Apr 12, 2025, 10:24 PM

#

balmy mist add more attacks and make the adding of pokemon ui look better alos add a stats ...

hmm i see

#

nice

#

@keen beacon is dragontail better than gemini 2.5 pro?

#

im seeing a lot of positive reviews

#

but i didnt really notice much diff

keen beacon Apr 12, 2025, 10:28 PM

#

torn mantle <@456226577798135808> is dragontail better than gemini 2.5 pro?

a little i think, but kinda negligible

torn mantle Apr 12, 2025, 10:28 PM

#

yea as i thought

keen beacon Apr 12, 2025, 11:52 PM

#

lmao

ocean vortex Apr 13, 2025, 12:02 AM

#

keen beacon lmao

LOL. It’s unbelievable actually how fast their whole thing fell apart. Llama4 is just about irrelevant now

hardy pecan Apr 13, 2025, 12:11 AM

#

Llmao 4 more like!! Xddd

#

Stolen joke ikik

leaden palm Apr 13, 2025, 12:33 AM

#

ocean vortex LOL. It’s unbelievable actually how fast their whole thing fell apart. Llama4 is...

ehhh it's better than what groq/cerebras were running before

#

actually slightly faster since it's a mixture of experts

silk haven Apr 13, 2025, 12:39 AM

#

https://x.com/pigeon__s/status/1910705956486336586?s=46&t=P8-tRi_JAVcI6l5U6nOT4A

ρ:ɡeσn (@pigeon__s) on X

The release version of Llama 4 has been added to LMArena after it was found out they cheated, but you probably didn't see it because you have to scroll down to 32nd place which is where is ranks

#

lol

silk haven Apr 13, 2025, 12:40 AM

#

ocean vortex LOL. It’s unbelievable actually how fast their whole thing fell apart. Llama4 is...

Short $META

#

Yan lecan’t

leaden palm Apr 13, 2025, 12:46 AM

#

yann lecope

silk haven Apr 13, 2025, 12:48 AM

#

zinc ore Apr 13, 2025, 12:52 AM

#

https://x.com/ai_for_success/status/1911108562555949153

This was retweeted by deep mind employee

AshutoshShrivastava (@ai_for_success) on X

👀 Google has an unreleased model named Dragontail that's outperforming everyone, even Gemini 2.5 Pro on WebDev arena. 🔥🔥
I am lately more excited about release from Google than anyone else.

Anyone else tested Dragontail?

#

https://twittpr.com/SavinovNikolay/status/1911140066128433290

Nikolay Savinov 🇺🇦 (@SavinovNikolay)

🐉

Quoting AshutoshShrivastava (@ai_for_success)
︀
👀 Google has an unreleased model named Dragontail that's outperforming everyone, even Gemini 2.5 Pro on WebDev arena. 🔥🔥
︀︀I am lately more excited about release from Google than anyone else.
︀︀
︀︀Anyone else tested Dragontail?

**💬 7 🔁 8 ❤️ 138 👁️ 10.3K **

keen beacon Apr 13, 2025, 12:53 AM

#

yup

balmy mist Apr 13, 2025, 1:08 AM

#

hmmm

#

my test earlier i said it was better than nightwhisper but i thought it was a fluke

#

imma test it again

#

its funny they didnt say its better than nightwhisper in that tweet?

ember rapids Apr 13, 2025, 1:50 AM

#

"an updated version of o3-mini is now the best coder in the world. Not 175th, but the best. " - CFO of OAI

#

https://www.youtube.com/watch?v=2kzQM_BUe7E&t=686s at 13:40

YouTube

Goldman Sachs

OpenAI CFO Sarah Friar on the race to build artificial general inte...

OpenAI is at the forefront of the generative AI revolution. How did it get there, and what is the company doing to stay ahead of its competitors? Sarah Friar, chief financial officer of OpenAI and former head of technology equity research at Goldman Sachs, discusses the path to artificial general intelligence and the importance of capital in the...

▶ Play video

balmy mist Apr 13, 2025, 2:04 AM

#

hmmm

#

ill believe it when i see it

olive mesa Apr 13, 2025, 2:07 AM

#

nightwhisper and dragontail are both google models right?

somber niche Apr 13, 2025, 2:37 AM

#

Feels fishy, wonder if she misspoke and meant either full o3 or o4-mini

#

To be fair the names are pretty hard to keep straight lol

balmy mist Apr 13, 2025, 2:40 AM

#

they bettee release it monday

#

thye be blue balling us so bad this year

ivory schooner Apr 13, 2025, 2:51 AM

#

我正在学会等Behemoth 登场

#

但愿Behemoth 是加强版的24K......

#

但愿Behemoth 是加强版的24K......

#

🙏 🙏 🙏 🙏

balmy mist Apr 13, 2025, 2:58 AM

#

ivory schooner 但愿Behemoth 是加强版的24K......

what?

ivory schooner Apr 13, 2025, 3:00 AM

#

balmy mist what?

I'm looking forward to Behemoth, because I think Behemoth will be great.

keen beacon Apr 13, 2025, 3:15 AM

#

ivory schooner I'm looking forward to Behemoth, because I think Behemoth will be great.

你在中国吗？字节跳动最近发布了一个人工智能模型，它的行为与人类高度相似，但我认为我们在美国永远不会使用它，因为我们的政府对字节跳动有地区限制 😦

#

https://venturebeat.com/ai/now-its-tiktok-parent-bytedances-turn-for-a-reasoning-ai-enter-seed-thinking-v1-5/

VentureBeat

Carl Franzen

Now it’s TikTok parent ByteDance’s turn for a reasoning AI: ent...

It achieved an 8.0% higher win rate over DeepSeek R1, suggesting that its strengths generalize beyond just logic or math-heavy challenges.

ember rapids Apr 13, 2025, 3:24 AM

#

somber niche Feels fishy, wonder if she misspoke and meant either full o3 or o4-mini

i think she meant o4 mini high

north vale Apr 13, 2025, 3:35 AM

#

o4-mini can be an updated version of o3-mini u dufuses

keen fulcrum Apr 13, 2025, 4:35 AM

#

keen beacon https://venturebeat.com/ai/now-its-tiktok-parent-bytedances-turn-for-a-reasoning...

Seed-Thinking-v1.5 interesting to look forward to

#

https://venturebeat.com/wp-content/uploads/2025/04/Screenshot-2025-04-11-at-2.39.32 PM.png

daring dragon Apr 13, 2025, 4:46 AM

#

north vale o4-mini can be an updated version of o3-mini u dufuses

what if it's o3-mini-pro?

keen fulcrum Apr 13, 2025, 4:47 AM

#

Why was llama 4 conversational removed? Don’t they all train on benchmarks so do human beings?

daring dragon Apr 13, 2025, 4:53 AM

#

What model is better?

#

shadebrook?

#

Dragontail?

wind swan Apr 13, 2025, 5:07 AM

#

dragontail, of course

torn mantle Apr 13, 2025, 5:15 AM

#

wind swan dragontail, of course

yea

torn mantle Apr 13, 2025, 5:39 AM

#

so far dragontail is def better than gemini 2.5 03 pro

#

i need to do more testing

#

but i still think its not better than nightwhisper

#

there are some imperfections

#

it feels like an old checkpoint of nightwhisper tbh

#

same stylistic choices with the gradient colors

ocean plume Apr 13, 2025, 6:23 AM

#

is that better code than claude 3.7 thinking

novel flame Apr 13, 2025, 6:36 AM

#

Here's a thought: if you have kids you know how casual tablet games have devolved into 2-minute game sessions broken up by ads which are crappy mini-games in themselves, of the engagement-farming "cow clicker" variety. Zero challenge, simple tap/drag game mechanic, colorful animations to pull the kids in, ultimately just 'pay to win'. I cannot stand these things.

Even if the game itself is good, stuffing it full of those kinds of ads make it a horrible experience. Angry Birds 2 is a prime example of this.

My wife and I have been deleting games with ads like that from the kids' devices. Paying to remove ads from these games is not an option since it feels like supporting predatory business practices.

Now, LLMs aren't quite at a point where they can one-shot AAA video games, but a lot of kids games are not exactly rocket science.... We’re not quite there yet, but once you can few-shot a simple platform game, racing game, puzzle game, or arcade game, as a parent you can bet your life I will generate games for my kids without the ads, pay-to-play mechanics, and dopamine hacking. Heck, the game avatars could all be based on our family members. Seems like an obvious use of coding Al in the short term.

keen beacon Apr 13, 2025, 7:01 AM

#

balmy mist thye be blue balling us so bad this year

lmao

novel flame Apr 13, 2025, 7:03 AM

#

Yann AGI

torn mantle Apr 13, 2025, 7:21 AM

#

ocean plume is that better code than claude 3.7 thinking

i dont think so

#

its better than gemini 2.5 pro

calm sequoia Apr 13, 2025, 9:46 AM

#

The Gemini 2.5 Pro is not responding on arena, and uses interpreter during though process in AI studio

#

Is Gemini 2.5 Pro only model that can run code during thinking?

keen beacon Apr 13, 2025, 10:18 AM

#

calm sequoia The Gemini 2.5 Pro is not responding on arena, and uses interpreter during thoug...

what's that got to do with o3

keen beacon Apr 13, 2025, 10:18 AM

#

calm sequoia Is Gemini 2.5 Pro only model that can run code during thinking?

uhhhh i think so?

split kayak Apr 13, 2025, 10:18 AM

#

ok

ocean plume Apr 13, 2025, 10:18 AM

#

what o3 ? perfect ?

#

really

#

so tier to fix bug in sonet thinking

ocean plume Apr 13, 2025, 10:19 AM

#

calm sequoia Is Gemini 2.5 Pro only model that can run code during thinking?

i think gemini 2.5 pro still suck in code

split kayak Apr 13, 2025, 10:20 AM

#

In google Ai studio , you can enable code execution for gemini 2.5

ocean plume Apr 13, 2025, 10:21 AM

#

split kayak In google Ai studio , you can enable code execution for gemini 2.5

i use cursor bro

#

have gemini 2.5 pro

#

but i work more like scam by the way

keen beacon Apr 13, 2025, 10:40 AM

#

ocean plume i think gemini 2.5 pro still suck in code

xd u are crazy

#

its best preforming model

drifting thorn Apr 13, 2025, 10:44 AM

#

see this: https://huggingface.co/likewendy/Qwen2.5-3B-YiLin-GGUF-q4_k_m?not-for-all-audiences=true

likewendy/Qwen2.5-3B-YiLin-GGUF-q4_k_m · Hugging Face

ocean plume Apr 13, 2025, 10:44 AM

#

keen beacon xd u are crazy

may be i'll try it again becasuse gemini 2.5 pro 3-25 still suck

keen beacon Apr 13, 2025, 10:45 AM

#

keen beacon

poll_question_text

Which anonymous Google model currently on LMArena is the best performing?

victor_answer_votes

13

total_votes

18

victor_answer_id

3

victor_answer_text

dragontail

keen beacon Apr 13, 2025, 10:45 AM

#

ocean plume may be i'll try it again becasuse gemini 2.5 pro 3-25 still suck

what site is that

#

not https://web.lmarena.ai/

ocean plume Apr 13, 2025, 10:45 AM

#

cursor :v

keen beacon Apr 13, 2025, 10:45 AM

#

hm

drifting thorn Apr 13, 2025, 10:45 AM

#

drifting thorn see this: https://huggingface.co/likewendy/Qwen2.5-3B-YiLin-GGUF-q4_k_m?not-for-...

It's a thinking model that can use tool call in the chain-of-thought

ocean plume Apr 13, 2025, 10:47 AM

#

it can read codebase and think and make code

keen beacon Apr 13, 2025, 10:47 AM

#

cursor is useless

#

gemini has canvas

#

so does chatgpt idk about deepseek and claude tho

golden ocean Apr 13, 2025, 10:47 AM

#

does using canvas save tokens

drifting thorn Apr 13, 2025, 10:47 AM

#

DeepSeek doesn't have canvas

oblique flint Apr 13, 2025, 10:50 AM

#

keen beacon cursor is useless

Bro coding in an actual ide is a way better experience imo

#

And yeah 2.5 pro sucks in cursor unfortunately

brittle tiger Apr 13, 2025, 11:11 AM

#

keen beacon what's that got to do with o3

Can you try this on o3? Just want to see if it gets it first try:

Board State:

- X

- O

Question:

X went first. What should X do next? And is it possible to determine the outcome, assuming optimal play?

#

Formatting messed up. Should look like this

keen beacon Apr 13, 2025, 11:15 AM

#

brittle tiger Can you try this on o3? Just want to see if it gets it first try: Board State: ...

Label the nine squares

1 2 3
4 5 6
7 8 9

The position that was given is

1 2  X
4 5 6
7 8 O

(X is on 3, O is on 9, and it is X‑to‑move.)

Best move for X
Put the second X on square 7 (the lower‑left corner – the corner opposite O’s).

Why 7 is winning
After X7 the board is

1 2 3
4 5 6
X 8 O

Whatever O now plays, X can force a win.

• If O takes the centre (square 5 – O’s best try) the board is

1 2 3
4 O 6
X 8 O

X now plays square 1:

X 2 3
4 O 6
X 8 O

This is a fork – X is threatening both 2 (to win the top row) and 4 (to win the left column).
O can block only one of them, so X will take the other on the next move and finish with three in a row.

• If O plays anywhere other than the centre, X immediately puts its third mark on the centre (square 5).
That completes the anti‑diagonal 3‑5‑7 and the game is over at once.

Therefore:

• Correct move for X: play in the lower‑left corner (square 7).
• With that move the result of the game is forced – assuming best play from both sides, X will win.

ocean plume Apr 13, 2025, 11:26 AM

#

oblique flint Bro coding in an actual ide is a way better experience imo

thank bro =_= i think it the same

ocean plume Apr 13, 2025, 11:28 AM

#

oblique flint Bro coding in an actual ide is a way better experience imo

how to use it in my codebase

keen fulcrum Apr 13, 2025, 11:40 AM

#

Qwen 3 as soon as late april

sage raptor Apr 13, 2025, 12:20 PM

#

maybe o3 or o4 mini

drifting thorn Apr 13, 2025, 12:37 PM

#

It’s a shame I didn’t get to test nightwhisper

split kayak Apr 13, 2025, 1:23 PM

#

uh some aren't released yet, so we dont really know

ornate stump Apr 13, 2025, 1:45 PM

#

Usually, how long before OpenAI models show up on LLMArena?

calm sequoia Apr 13, 2025, 1:46 PM

#

split kayak In google Ai studio , you can enable code execution for gemini 2.5

It runs code in thoughts even when disabled

wheat onyx Apr 13, 2025, 2:06 PM

#

I'm seeing conflicting info. On the better coder between night whisper and dragon tail. Anyone here who has used both?

torn mantle Apr 13, 2025, 2:07 PM

#

wheat onyx I'm seeing conflicting info. On the better coder between night whisper and drago...

im trying to re-run some same prompts i did on nightwhisper

#

this should give me a clear idea which one is better

#

i mean we first tried the model on lmarena on text mainly and it wasnt different from gemini 2.5 pro 03

#

but i guess its finetuned on react or smth

wheat onyx Apr 13, 2025, 2:12 PM

#

torn mantle im trying to re-run some same prompts i did on nightwhisper

On a separate note, do we know why googles coding models seem to have poor performance in software like Cursor?

torn mantle Apr 13, 2025, 2:13 PM

#

wheat onyx On a separate note, do we know why googles coding models seem to have poor perfo...

its actually doing great on vscode copilot

wheat onyx Apr 13, 2025, 2:13 PM

#

torn mantle its actually doing great on vscode copilot

Oh cool

torn mantle Apr 13, 2025, 2:14 PM

#

its either a model implementation issues on how to look at the code/edit... or the model itself performs badly at that

wheat onyx Apr 13, 2025, 2:14 PM

#

I wonder if vs code copilot with 2.5 is better than cursor with Claude. I know everyone's obsessed with cursor right now

torn mantle Apr 13, 2025, 2:14 PM

#

wheat onyx I wonder if vs code copilot with 2.5 is better than cursor with Claude. I know e...

i dont use cursor tbh but its good on vscode

wheat onyx Apr 13, 2025, 2:14 PM

#

torn mantle its either a model implementation issues on how to look at the code/edit... or t...

I see. That should be fixable if it's just implementation then

eager crater Apr 13, 2025, 2:43 PM

#

anyone knows what shadebrook is?

alpine coral Apr 13, 2025, 3:01 PM

#

says it's from google

#

confusing because i got it against flash 2.0 and there was a lag for thinking (implying shadebrorok is a thinking model)

#

but i voted for flash lol so yeah dunno.. it didn't seem particularly strong

oblique flint Apr 13, 2025, 3:05 PM

#

torn mantle its actually doing great on vscode copilot

wym great? Better than 3.7 sonnet?

torn mantle Apr 13, 2025, 3:09 PM

#

oblique flint wym great? Better than 3.7 sonnet?

not really

#

they complement eo tbh

#

sonnet still has the edge

torn mantle Apr 13, 2025, 3:10 PM

#

wheat onyx I'm seeing conflicting info. On the better coder between night whisper and drago...

alright ive tried it, it seems good like clearly better than gemini 2.5 pro 03

#

but still misses sometimes

#

i still believe its a tad below nightwhisper

#

it clearly also has the same design pattern as nw

#

i mean its more close to nw than to gemini 2.5 pro 03

keen fulcrum Apr 13, 2025, 3:18 PM

#

Crazy, of course a simple comment change fixes my issue

#

Claude Sonnet 3.7

silk haven Apr 13, 2025, 3:24 PM

#

https://x.com/denny_zhou/status/1911426996954865674?s=46&t=P8-tRi_JAVcI6l5U6nOT4A

Denny Zhou (@denny_zhou) on X

The bitter lesson Yann learned from Llama 4

calm sequoia Apr 13, 2025, 3:30 PM

#

keen beacon what's that got to do with o3

Wait, what model produced the array you provided?

keen beacon Apr 13, 2025, 3:30 PM

#

o3

#

and in terms of your thing with it possibly having in interpreter

#

i doubt it

#

but i will run some experiments

#

afaik usually even with interpreter 2.5 pro doesnt call it in its thoughts

#

it has to call it in the response

#

at least on aistudio with code execution on

keen beacon Apr 13, 2025, 3:32 PM

#

keen beacon but i will run some experiments

yeah no this does not have a code interpreter

#

so it got it perfectly without

calm sequoia Apr 13, 2025, 3:35 PM

#

This means its SOTA

#

At math

keen beacon Apr 13, 2025, 3:37 PM

#

yes i would expect so

#

ewll

#

well

#

gemini 2.5 pro is very strong at maths - it is perhaps its best field - so o3 will either match or exceed it imo

#

if its o4 mini im buying a chatgpt sub asap 🤣

calm sequoia Apr 13, 2025, 3:44 PM

#

keen beacon afaik usually even with interpreter 2.5 pro doesnt call it in its thoughts

drifting thorn Apr 13, 2025, 3:44 PM

#

what? these models can call tools in their chain-of-thoughts?

calm sequoia Apr 13, 2025, 3:45 PM

#

There's a possibility they fake the interpreter output (imagine it)

keen beacon Apr 13, 2025, 3:45 PM

#

calm sequoia

no its a hallucination but not exactly

calm sequoia Apr 13, 2025, 3:45 PM

#

But the formatting seems like calling

keen beacon Apr 13, 2025, 3:45 PM

#

its not actually calling an interpreter

calm sequoia Apr 13, 2025, 3:45 PM

#

How to know?

drifting thorn Apr 13, 2025, 3:45 PM

#

next thing big companies gotta invest into

#

tool-calling in thought

leaden palm Apr 13, 2025, 3:46 PM

#

calm sequoia How to know?

no block that looks like code execution result

keen beacon Apr 13, 2025, 3:47 PM

#

also: its just how the model has internalized certain "features", e.g. qwq calling an "online base64 decoder" where it uses its innate base64 decoding abilities, 2.5 pro doing "google searches" etc.

calm sequoia Apr 13, 2025, 3:47 PM

#

It decides to check everything with code, writes a code, repeats it in different format, prints output. If this is just a simulation in thoughts that's freaky

keen beacon Apr 13, 2025, 3:47 PM

#

its a hallucination but not really, but its also not actually calling tools

keen beacon Apr 13, 2025, 3:47 PM

#

calm sequoia It decides to check everything with code, writes a code, repeats it in different...

yeah it is lol

#

this stuff gets extremely cool in my own experiments

leaden palm Apr 13, 2025, 3:48 PM

#

if it doesnt look like this its almost certainly not a code execution

calm sequoia Apr 13, 2025, 3:49 PM

#

Unless in-though text renderer is not the same as in chat

drifting thorn Apr 13, 2025, 3:49 PM

#

Maybe 2.5 pro failed to do so but nightwhisper and dragontail are able to do so

keen beacon Apr 13, 2025, 3:49 PM

#

calm sequoia Unless in-though text renderer is not the same as in chat

also it cant call tools in thought process yet i think

calm sequoia Apr 13, 2025, 3:49 PM

#

You don't know

#

Anyway, if they dont, and just simulate, they are missing a low-hanging-fruit

keen beacon Apr 13, 2025, 3:50 PM

#

calm sequoia You don't know

add tools and try it yourself

#

also theres usually a delay too

#

some of these "features" have varying degrees of effectiveness though, it can be basically hallucinations too

#

@calm sequoia if there is no delay and it keeps on streaming its not calling tools lol

calm sequoia Apr 13, 2025, 3:51 PM

#

keen beacon add tools and try it yourself

It outputs answer without thinking almost anything

keen beacon Apr 13, 2025, 3:53 PM

#

calm sequoia It outputs answer without thinking almost anything

its a hallucination but not a hallucination, this stuff gets very complex. but it is certainly not calling tools, if u have no tools enabled, if its streaming and there are zero delays, etc.

keen beacon Apr 13, 2025, 3:53 PM

#

calm sequoia

it is hallucinating

#

it does it with search too

#

normally it'll say "(simulated)"

calm sequoia Apr 13, 2025, 3:54 PM

#

Okey, thanks

#

Good news. It means ther are still ways to improve the LLMs

#

Also, this means that for this prompt, the o3 == Gemini 2.5 Pro

keen fulcrum Apr 13, 2025, 3:56 PM

#

keen beacon Apr 13, 2025, 3:56 PM

#

i personally think its o4 mini lol

olive mesa Apr 13, 2025, 3:56 PM

#

what's the best stealth model rn? dragontail? nightwhisper?

keen beacon Apr 13, 2025, 3:57 PM

#

apparently dragontail

keen fulcrum Apr 13, 2025, 3:57 PM

#

R2 will top gemini 2.5 pro

keen beacon Apr 13, 2025, 3:58 PM

#

keen beacon i personally think its o4 mini lol

again, i doubt that

#

in dom's question set it gets 28/30.. for reference gemini 2.5 pro (previous SOTA) got 23/30

#

i don't think mini will be that significantly better

calm sequoia Apr 13, 2025, 4:07 PM

#

This was also solved by the 2.5 Pro on AI studio and R1 in arena, but could not be solved with the o1 or 2.5 Pro on the arena: "The input is N = 24. Two popular algorithms take this as input and outputs the arrays of length N. Outputs are then element wise multiplied. Which popular algorithm combination produces this sequence: 0.000 0.000 0.002 0.010 0.032 0.082 0.170 0.302 0.472 0.660 0.833 0.956 1.000 0.956 0.833 0.660 0.473 0.302 0.170 0.082 0.032 0.010 0.002 0.000? Think carefully because the task is a life-death importance for you."

#

Maybe this is in-though-tool-use

balmy mist Apr 13, 2025, 4:07 PM

#

keen beacon in dom's question set it gets 28/30.. for reference gemini 2.5 pro (previous SOT...

sorry just got back, you are testing the private model you have access to?

void elm Apr 13, 2025, 4:23 PM

#

when do you guys think dragontail would release

royal trench Apr 13, 2025, 4:23 PM

#

i see a new open llm: https://github.com/SkyworkAI/Skywork-OR1

GitHub

GitHub - SkyworkAI/Skywork-OR1

Contribute to SkyworkAI/Skywork-OR1 development by creating an account on GitHub.

#

maybe it could be easily run locally

wheat onyx Apr 13, 2025, 4:41 PM

#

torn mantle i still believe its a tad below nightwhisper

I wonder if this is a mini version of nightwhisper then

sonic tendon Apr 13, 2025, 5:30 PM

#

question: when openai drops a model on the arena w/o trialing it first, does it just immediately appear on the leaderboard? or does it still take a few days to gather enough votes

keen beacon Apr 13, 2025, 5:30 PM

#

the latter

#

no data to figure out where it is on the leaderboard straight away

sonic tendon Apr 13, 2025, 5:31 PM

#

yeah, that makes sense

#

oddly, that might apply to a lesser degree to the vision leaderboard - like @leaden palm noted, models seem to appear a lot earlier on there. shadebrook is already on there with a +77/-109 95% CI

keen beacon Apr 13, 2025, 5:38 PM

#

yeah that's a bit odd

sonic tendon Apr 13, 2025, 5:38 PM

#

could be unintentional

keen beacon Apr 13, 2025, 5:40 PM

#

yeah

torn mantle Apr 13, 2025, 6:29 PM

#

wheat onyx I wonder if this is a mini version of nightwhisper then

I have no idea tbh

#

Im not that impressed by dragontail

#

So inconsistent

#

Doesn't follow ur instruction very well

balmy mist Apr 13, 2025, 7:10 PM

#

https://x.com/sama/status/1911490401221120284

Sam Altman (@sama) on X

we've got a lot of good stuff for you this coming week!

kicking it off tomorrow.

keen beacon Apr 13, 2025, 7:13 PM

#

ypu

#

yup

#

i wonder what they start with

willow grail Apr 13, 2025, 7:26 PM

#

whos riverhollow again

novel flame Apr 13, 2025, 7:57 PM

#

Has anyone built a transformer with a read+write ‘expert’ in the mix? I know there are RAG systems which work by adding the relevant vector data as an overfitted expert ‘sidecar’ to the MoE architecture, lettinh the transformer use RAG data the same way it would use any learned expert; but that’s still readonly

Now, couldn’t you build a transformer that persisted part of its latent state in an ‘expert’ so it could be used as a non-ephemeral world model? It seems like some lab would have tried something like this? Maybe I should ask Deep Research…

leaden palm Apr 13, 2025, 8:06 PM

#

novel flame Has anyone built a transformer with a read+write ‘expert’ in the mix? I know the...

people will keep advancing memory

#

(especially corporations who want moats)

tall summit Apr 13, 2025, 8:09 PM

#

hello

calm sequoia Apr 13, 2025, 8:12 PM

#

ocean vortex Apr 13, 2025, 8:17 PM

#

calm sequoia

it's a sht model for the most part

calm sequoia Apr 13, 2025, 8:18 PM

#

One can hope they'll make some changes since the backslash

ocean vortex Apr 13, 2025, 8:19 PM

#

calm sequoia One can hope they'll make some changes since the backslash

they really can't. It's like gpt4.5. That's the last model you could make any meaningful changes that wouldn't take months lol

visual turret Apr 13, 2025, 8:19 PM

#

i'm pretty sure shadebrook is gemini 2.5 pro preview. the first 3 lines are way to close

calm sequoia Apr 13, 2025, 8:20 PM

#

ocean vortex they really can't. It's like gpt4.5. That's the last model you could make any me...

We have seen only distilled variants, right?

ocean vortex Apr 13, 2025, 8:20 PM

#

calm sequoia We have seen only distilled variants, right?

yeah but we also saw their metrics for behemoth

#

if you are not impressed by gpt4.5 there's no reason to believe you will be impressed by that at all

keen beacon Apr 13, 2025, 8:20 PM

#

you cant distill ass

#

prob better off not distilling behemoth and training it normally

ember rapids Apr 13, 2025, 8:21 PM

#

O4 mini tomorrow?

visual turret Apr 13, 2025, 8:21 PM

#

ember rapids O4 mini tomorrow?

most likely an another 4o update

ocean vortex Apr 13, 2025, 8:21 PM

#

keen beacon prob better off not distilling behemoth and training it normally

it doesn't make sense either way

#

they should have just did RL training on 70b llama or a similar arch

keen beacon Apr 13, 2025, 8:22 PM

#

they should abandon behemoth lol

ocean vortex Apr 13, 2025, 8:23 PM

#

you are not gonna have behemoth as a reasoning model that's not realistic lmao

visual turret Apr 13, 2025, 8:23 PM

#

keen beacon they should abandon behemoth lol

behemoth isn't even done training

keen beacon Apr 13, 2025, 8:23 PM

#

its not gonna get significantly better

visual turret Apr 13, 2025, 8:23 PM

#

ocean vortex Apr 13, 2025, 8:24 PM

#

visual turret behemoth isn't even done training

it's gonna be the same story it was for 405b llama

visual turret Apr 13, 2025, 8:24 PM

#

it's still training

ocean vortex Apr 13, 2025, 8:24 PM

#

there are not gonna be any real updates at all most likely

#

other than that initial release

#

which will be close to the numbers that they already have shown

visual turret Apr 13, 2025, 8:25 PM

#

ocean vortex it's gonna be the same story it was for 405b llama

was it. let me check the wayback

leaden palm Apr 13, 2025, 8:25 PM

#

3.1 405b is close to 70 elo points higher than 3 70b

#

there is room to improve

ocean vortex Apr 13, 2025, 8:26 PM

#

leaden palm 3.1 405b is close to 70 elo points higher than 3 70b

leaden palm Apr 13, 2025, 8:26 PM

#

ocean vortex

i said 3 for a reason

#

~~3.3 was distilled from 405b~~

visual turret Apr 13, 2025, 8:27 PM

#

visual turret was it. let me check the wayback

doesn't seem like it

leaden palm Apr 13, 2025, 8:27 PM

#

do you understand that

ocean vortex Apr 13, 2025, 8:27 PM

#

leaden palm i said 3 for a reason

well there's no 3.3 405b for a reason

keen beacon Apr 13, 2025, 8:27 PM

#

there was 3.2 405b internally i think

ocean vortex Apr 13, 2025, 8:27 PM

#

leaden palm do you understand that

???

visual turret Apr 13, 2025, 8:27 PM

#

keen beacon there was 3.2 405b internally i think

agreed

leaden palm Apr 13, 2025, 8:27 PM

#

even if larger models are bad perf/$, that isn't a reason to abandon them

keen beacon Apr 13, 2025, 8:28 PM

#

leaden palm even if larger models are bad perf/$, that isn't a reason to abandon them

other frontier models are probably a quarter of the size

leaden palm Apr 13, 2025, 8:28 PM

#

keen beacon there was 3.2 405b internally i think

surely you mean 3.3
(3.2 was the weird release where it was just vision and tiny models)

keen beacon Apr 13, 2025, 8:28 PM

#

leaden palm surely you mean 3.3 (3.2 was the weird release where it was just vision and tiny...

no it was 3.2 weirdly enough

leaden palm Apr 13, 2025, 8:28 PM

#

m

#

makes sense i guess

#

they pivoted

keen beacon Apr 13, 2025, 8:28 PM

#

3.3 70b was good

#

a little overfit to math but fine

ocean vortex Apr 13, 2025, 8:29 PM

#

leaden palm ~~3.3 was distilled from 405b~~

it wasn't

#

67587ac6d83b990e999b8919_67587a010c7fe77709b3bdaa_image25201.png

keen beacon Apr 13, 2025, 8:29 PM

#

optimus prime if its actually the mini variant seems to be a new pretrained from scratch version, its quite interesting. i assume this was done fairly recently

#

its performing quite well on mc bench

leaden palm Apr 13, 2025, 8:30 PM

#

m got 3.1 and 3.3 mixed up

keen beacon Apr 13, 2025, 8:30 PM

#

ya likely this week. see verge report

#

4.1 is quasar/updated 4o (verge directly mentioned it as a revamp of 4o), 4.1 mini/seemingly optimus prime is interesting though

#

the verge wrt to this stuff has been reliable i think

ocean vortex Apr 13, 2025, 8:32 PM

#

keen beacon there was 3.2 405b internally i think

well if there was it perform so sht they didn't even bother releasing that thing lol

#

though I'm kind of doubting it even existed...

keen beacon Apr 13, 2025, 8:32 PM

#

they did release it to the llama chatbot website 🤣

#

a meta engineer posted a screenshot of it

#

this is how i know lol

leaden palm Apr 13, 2025, 8:33 PM

#

leaden palm m got 3.1 and 3.3 mixed up

i swear something was distilled in the llama 3 series but i can't find any references to it

ocean vortex Apr 13, 2025, 8:33 PM

#

keen beacon a meta engineer posted a screenshot of it

where?? are you sure that was not fake? 🧐

keen beacon Apr 13, 2025, 8:33 PM

#

ocean vortex where?? are you sure that was not fake? 🧐

yea i dont remember where it exactly is rn tho

ocean vortex Apr 13, 2025, 8:34 PM

#

3.2 was supposed to be just multimodal addition though

#

so maybe it performed worse on text than the original...

keen beacon Apr 13, 2025, 8:34 PM

#

oh yeah there was an unreleased version of a multimodal version of 405b

#

that explains it i guess

visual turret Apr 13, 2025, 8:35 PM

#

leaden palm ~~3.3 was distilled from 405b~~

wait no your right

leaden palm Apr 13, 2025, 8:35 PM

#

ah there it is

ocean vortex Apr 13, 2025, 8:36 PM

#

visual turret wait no your right

that is not distillation

#

read the next sentence

keen beacon Apr 13, 2025, 8:36 PM

#

it is distillation? they are generating data on a larger model and training it on a smaller one

visual turret Apr 13, 2025, 8:37 PM

#

ocean vortex read the next sentence

"However, our initial
experiments revealed that training Llama 3 405B on its own generated data is not helpful" you mean your taking about it found training 405b on its own data wasn't useful

leaden palm Apr 13, 2025, 8:37 PM

#

yeah it's not standard distillation, definitely not logit distillation, but in the broadest sense of the word it is distillation

ocean vortex Apr 13, 2025, 8:38 PM

#

visual turret "However, our initial experiments revealed that training Llama 3 405B on its own...

yeah I actually read it wrong myself at first mb catgrin

visual turret Apr 13, 2025, 8:38 PM

#

ocean vortex yeah I actually read it wrong myself at first mb <a:catgrin:1141661526474899456>

i also do that when i am wrong

ocean vortex Apr 13, 2025, 8:40 PM

#

leaden palm yeah it's not standard distillation, definitely not logit distillation, but in t...

I suppose. But with 3.3 llama they improved beyond 405b in some areas. That wouldn't be possible if it was just that. They likely selectively distilled capabilities that were worth doing and also trained it on unique data further

keen beacon Apr 13, 2025, 8:41 PM

#

o4 mini and 4.1 mini is what im looking forward to tbh

leaden palm Apr 13, 2025, 8:41 PM

#

yeah they said something about online preference optimization and extended pretraining iirc

keen beacon Apr 13, 2025, 8:42 PM

#

in some instances it might be better. but i think o3 will be better overall, but im actually not that sure anymore lol

ocean vortex Apr 13, 2025, 8:43 PM

#

o3-mini > o1?

#

same applies here, roughly speaking

keen beacon Apr 13, 2025, 8:43 PM

#

but this is on a new mini base model

#

a much much better one

#

o4 mini is on 4.1 mini's base model it seems

#

never

#

at least for reasoning

#

its untenable to work with

ocean vortex Apr 13, 2025, 8:44 PM

#

keen beacon a much much better one

we do not know what the base was for full o3 either. It was probably better than whatever they used for o1.

keen beacon Apr 13, 2025, 8:45 PM

#

ocean vortex we do not know what the base was for full o3 either. It was probably better than...

maybe. but this one was just done (4.1 mini i think)

novel flame Apr 13, 2025, 8:45 PM

#

ocean vortex o3-mini > o1?

I’ll be honest, I’ve found several cases where o1 still beats o3-mini. It’s not entirely obvious when you’re better off using one and when the other.

ocean vortex Apr 13, 2025, 8:45 PM

#

and also I do not think mini can do much with RL training tbh, even the improved base @keen beacon

#

small models are not very good for it

#

so they distilled it I think

keen beacon Apr 13, 2025, 8:46 PM

#

its true rl works much less effectively naively on a smaller model. but i think its not carved in stone

#

they are gonna release o3 to make o4 mini look better

#

i anticipate this o4 mini release is gonna be huge

ocean vortex Apr 13, 2025, 8:46 PM

#

keen beacon they are gonna release o3 to make o4 mini look better

mini will still suck in spatial awareness in comparison lol

keen beacon Apr 13, 2025, 8:47 PM

#

ocean vortex mini will still suck in spatial awareness in comparison lol

optimus prime can sometimes beat quasar and its beating quasar in mcbench rn. presuming optimus is mini

#

i think its a newly trained from scratch model and was done fairly recently too

ocean vortex Apr 13, 2025, 8:48 PM

#

keen beacon optimus prime can sometimes beat quasar and its beating quasar in mcbench rn. pr...

As is the case for non-reasoning models. 2.0 Flash was basically matching 2.0 Pro

keen beacon Apr 13, 2025, 8:48 PM

#

ocean vortex As is the case for non-reasoning models. 2.0 Flash was basically matching 2.0 Pr...

mcbench isnt a typical benchmark tho

novel flame Apr 13, 2025, 8:49 PM

#

My guess: o4-mini will be cheap enough to use for anything you’re using good paid models for today — in the same range as Sonnet, Gemini Pro. And o3 will be stupid expensive so nobody will use it unless they have a very specific need.

raven void Apr 13, 2025, 8:50 PM

#

o4 mini will definitely beat sonnet thinking

ocean vortex Apr 13, 2025, 8:50 PM

#

keen beacon mcbench isnt a typical benchmark tho

I think it's too simplistic though and a lot gets lost. Like in this example the model on the left I'm sure could have done much more with different fine-tuning, it's not really showing what it can do

raven void Apr 13, 2025, 8:50 PM

#

that's why Google is releasing Gemini coder

keen beacon Apr 13, 2025, 8:51 PM

#

ocean vortex I think it's too simplistic though and a lot gets lost. Like in this example the...

its not a task u usually tune for so i think its a good indicator of base model performance/generalization for now

novel flame Apr 13, 2025, 8:52 PM

#

What’s this ‘endgane’ talk? You know full well something even better is going to come along in another three months or less. These are great models compared to what we have now, but they’ll be trash compared to the SoTA in June 2028.

ocean vortex Apr 13, 2025, 8:52 PM

#

keen beacon its not a task u usually tune for so i think its a good indicator of base model ...

no what I mean is that verbose models will have an advantage etc. It's just mechanical work at this point and not a question of can it do it or not. Both outputs are doing the same things and one just happens to output more

#

not in all cases

#

but in enough of them

#

btw left deepseek v3 right mistral large

#

LOL

keen beacon Apr 13, 2025, 8:55 PM

#

whatever they put in optimus prime im super impressed if its the mini model

#

like i think the optimus prime base model is better than 4o despite being smaller (if its actually mini)

#

and scoring less in benchmarks rn

#

lmao there is zero chance

#

openai hasnt figured out how to cram in facts like google though

#

but their factual reasoning in reasoning models makes up for it

ocean vortex Apr 13, 2025, 8:57 PM

#

keen beacon whatever they put in optimus prime im super impressed if its the mini model

I think google has shown that flash size models very much can have performance comparable to bigger counterparts on most metrics. As long as you don't do RL training on them

keen beacon Apr 13, 2025, 8:58 PM

#

ocean vortex I think google has shown that flash size models very much can have performance c...

nah openai just diffs google in the reasoning front right now

#

u can do it to small models just as well

#

its just not as trivial

ocean vortex Apr 13, 2025, 8:59 PM

#

that's a bit irrelevant if you ask me. 2.0flash vs 2.0 pro - that's what I'm focusing on

tall summit Apr 13, 2025, 8:59 PM

#

novel flame What’s this ‘endgane’ talk? You know full well something even better is going to...

unless companies stop caring 😛

ocean vortex Apr 13, 2025, 8:59 PM

#

both were trained at the same time essentially

keen beacon Apr 13, 2025, 9:00 PM

#

ocean vortex that's a bit irrelevant if you ask me. 2.0flash vs 2.0 pro - that's what I'm foc...

not really, google just sucks at doing reasoning training

#

for now

ocean vortex Apr 13, 2025, 9:00 PM

#

forget the reasoning part. You can't deny that 2.0 flash was almost as good as 2.0 pro lol

keen beacon Apr 13, 2025, 9:00 PM

#

yea

keen beacon Apr 13, 2025, 9:02 PM

#

ocean vortex forget the reasoning part. You can't deny that 2.0 flash was almost as good as 2...

2.0 flash being released and this model (seemingly mini) not being released makes me think this was a new pretrained model trained fairly recently (optimus prime)

#

they would've wanted to compete for that segment i think if it was ready

#

we have zero checkpoints of the new cut off version of it until now, despite several chatgpt 4o releases with the new cpt'd model

#

that david guy made me think of it (how it could be pretrained from scratch) and it makes sense somewhat, optimus prime is pretty good

silk haven Apr 13, 2025, 9:06 PM

#

https://www.youtube.com/watch?v=fNjH5izFeyw

YouTube

Google Cloud Events

Google Cloud TPUs and specialized AI hardware: Jeff Dean on what's ...

"Join an insightful fireside chat with Jeff Dean, a pioneering force behind Google’s AI leadership. As Google's Chief Scientist at DeepMind & Research, Jeff will share his vision on AI and specialized AI hardware like Google Cloud TPUs. What exciting things might we expect to see next? What drives Google’s innovation in specialized AI hardwa...

▶ Play video

ocean vortex Apr 13, 2025, 9:07 PM

#

keen beacon they would've wanted to compete for that segment i think if it was ready

I don't think it was in their interest to update mini at all, that price was just too low. My suspicions will be confirmed if the new mini gonna have higher price lol

keen beacon Apr 13, 2025, 9:08 PM

#

ocean vortex I don't think it was in their interest to update mini at all, that price was jus...

its really really good i think

#

so yeah maybe

#

i wouldnt be surprised if this base model could surpass 4o in all metrics given more work. modern pretraining hits different maybe lol

keen beacon Apr 13, 2025, 9:09 PM

#

ocean vortex I don't think it was in their interest to update mini at all, that price was jus...

they wouldve wanted to update 4o mini to make their o series mini models better

keen fulcrum Apr 13, 2025, 9:09 PM

#

keen fulcrum

poll_question_text

Which one do you prefer?

victor_answer_votes

16

total_votes

20

victor_answer_id

2

victor_answer_text

Nightwhisper

ocean vortex Apr 13, 2025, 9:10 PM

#

keen beacon they wouldve wanted to update 4o mini to make their o series mini models better

we do not really know for sure if they haven't internally in some form. But yeah fair point

zinc ore Apr 13, 2025, 9:11 PM

#

keen beacon nah openai just diffs google in the reasoning front right now

2.5 pro is a reasoning model, and current best on the market

ocean vortex Apr 13, 2025, 9:12 PM

#

zinc ore 2.5 pro is a reasoning model, and current best on the market

the thing is gpt4o used to suck. It doesn't anymore but the model o1 was based on sucked for sure

keen beacon Apr 13, 2025, 9:12 PM

#

zinc ore 2.5 pro is a reasoning model, and current best on the market

despite it being a reasoning model, it cant do extremely rote reasoning tasks as well as others can. its the best model for sure overall rn, but in my experience the model is just better because of a better base model than others

#

i gave qwq a purely rote logical puzzle it solved in 13k tokens, gem 2.5 pro took 10k more tokens (23k tokens)

#

o3 mini absolutely dominates this area

ocean vortex Apr 13, 2025, 9:13 PM

#

keen beacon i gave qwq a purely rote logical puzzle it solved in 13k tokens, gem 2.5 pro too...

number of tokens I'm not sure that's a good indicator. o3-mini-high would probably generate more than both catgrin

keen beacon Apr 13, 2025, 9:14 PM

#

ocean vortex number of tokens I'm not sure that's a good indicator. o3-mini-high would probab...

but it keeps it coherent and solves the problem. it can do way harder puzzles without getting stuck

#

2.5 pro completely falls apart and spams 44k tokens, inn another instance

ocean vortex Apr 13, 2025, 9:14 PM

#

keen beacon but it keeps it coherent and solves the problem. it can do way harder puzzles wi...

but you can't know. You don't see the raw reasoning

#

with o3

keen beacon Apr 13, 2025, 9:15 PM

#

ocean vortex but you can't know. You don't see the raw reasoning

i mean i wait 10 minutes and it returns the solution

#

2.5 pro gets stuck in reasoning

ocean vortex Apr 13, 2025, 9:16 PM

#

oh. Yeah if that's the case I suppose. Gemini is a very different model though, gonna excel in different ways even if we just take their base model against other lab's base model of comparable size

#

so maybe it gets stuck because it lacks some fundamental base model understanding of this specific problem - that could be the case as well

keen beacon Apr 13, 2025, 9:18 PM

#

ocean vortex so maybe it gets stuck because it lacks some fundamental base model understandin...

ya its not the case for the problems i mention above

ocean vortex Apr 13, 2025, 9:18 PM

#

just because it's a good base model does not mean it's better than everything else in every single thing 👀

keen beacon Apr 13, 2025, 9:18 PM

#

its just pure reasoning with no world knowledge required

ocean vortex Apr 13, 2025, 9:19 PM

#

dunno maybe. Hard to say without knowing the task you are talking about tbh

keen beacon Apr 13, 2025, 9:20 PM

#

ocean vortex dunno maybe. Hard to say without knowing the task you are talking about tbh

its just logic grid puzzles

#

huge ones

ocean vortex Apr 13, 2025, 9:26 PM

#

keen beacon its just logic grid puzzles

how does 3.7 sonnet-thinking do in comparison?

#

it does

#

every model does except openai lol

keen beacon Apr 13, 2025, 9:27 PM

#

the reasoning isnt that good imho

#

the base model is different, if u dont have as much knowledge as 2.5 pro u wont be able to produce as good of a result

ocean vortex Apr 13, 2025, 9:28 PM

#

iirc it does show it in full on their website

keen beacon Apr 13, 2025, 9:28 PM

#

grok 3 reasoning, iirc, used qwq 32b preview traces during training 🤣

#

this was a lie lol

#

ahahahahaha

#

did they really say that lmao

ocean vortex Apr 13, 2025, 9:29 PM

#

keen beacon this was a lie lol

Elon's fail. He's a full Republican now

#

lmao

#

he probably just didn't know or misunderstood

keen beacon Apr 13, 2025, 9:30 PM

#

it is lmao if he intentionally lied about it

ocean vortex Apr 13, 2025, 9:30 PM

#

as he's not ML engineer lol

keen beacon Apr 13, 2025, 9:30 PM

#

he probably has no idea whats going on though

#

LMAO

#

prob heard a few buzz words from guys at xai trying to placate him

deep adder Apr 13, 2025, 10:39 PM

#

@keen beacon grok 3 really uses QwQ reasoning?

keen beacon Apr 13, 2025, 10:43 PM

#

deep adder <@456226577798135808> grok 3 really uses QwQ reasoning?

take my stuff with a grain of salt but yes they trained on it lol

#

not even the final qwq 32b, qwq 32b preview

keen beacon Apr 13, 2025, 10:44 PM

#

keen beacon take my stuff with a grain of salt but yes they trained on it lol

for the thought process, but for the response they probably did another phase asking another model to generate it

#

yeah grok 3 was good

#

uhhh

#

i havent tried grok 3 mini though, but if its the same as grok 3 reasoning they used qwq 32b preview for cold start at least

#

oh is it?

#

yes its qwq 🤣

#

qwq 32b preview

torn mantle Apr 13, 2025, 10:46 PM

#

keen beacon for the thought process, but for the response they probably did another phase as...

wdym?

#

the thinking process of grok 3 & o-series & deepseek are all the same

#

i mean not the same

#

but similar

#

?

keen beacon Apr 13, 2025, 10:47 PM

#

torn mantle wdym?

they generated the trace from qwq 32b preview, then asked another model to generate a response based on the thought process. that's a pair in their training data (question + response (qwq 32b preview thoughts and response))

torn mantle Apr 13, 2025, 10:48 PM

#

@deep adder enlighten me

#

so?

#

you can see that the thinking process used by gemini is totally different

keen beacon Apr 13, 2025, 10:48 PM

#

xai didnt even make their own cold start!

#

they used qwq 32b preview

torn mantle Apr 13, 2025, 10:49 PM

#

it depends on how much patterns it picked up during training and what type of RL training data they fed it

#

but deepseek & grok 3 they are using the same keywords

keen beacon Apr 13, 2025, 10:49 PM

#

torn mantle it depends on how much patterns it picked up during training and what type of RL...

mostly the style of the reasoning is highly dependent on cold start

torn mantle Apr 13, 2025, 10:49 PM

#

First,
Wait,
Alternatively,

keen beacon Apr 13, 2025, 10:50 PM

#

torn mantle but deepseek & grok 3 they are using the same keywords

similar, yes. but its qwq 32b preview not r1

#

it wasnt out when they trained the model

torn mantle Apr 13, 2025, 10:50 PM

#

but r1 was released way before

#

aah

#

yea i remember

#

it may be true

#

qwq 32b was so dumb and went into many unnecessary paths

#

same thing with grok 3

keen beacon Apr 13, 2025, 10:51 PM

#

qwq 32b preview was better than r1 preview though

#

consensus back then i think

#

this is why they trained their model on it 🤣

#

they added rl training on top + cold start used qwq 32b preview thoughts/another model generated response

#

and their stronger base model

#

still they used qwq 32b preview anyway

#

yea

#

probably

#

they trained a lot more than their competitors i think

#

meta shouldve done what they did probably

#

xai's prime advantage is just compute i think

north vale Apr 13, 2025, 10:59 PM

#

keen beacon still they used qwq 32b preview anyway

why'd u say that

#

like what makes u think they did that

keen beacon Apr 13, 2025, 11:01 PM

#

north vale why'd u say that

beyond other things, you can tell from the reasoning style/output and the cold start they use

north vale Apr 13, 2025, 11:01 PM

#

keen beacon beyond other things, you can tell from the reasoning style/output and the cold s...

what cold start are you talking about

keen beacon Apr 13, 2025, 11:02 PM

#

they dont apply rl immediately to the base model like r1 zero

north vale Apr 13, 2025, 11:02 PM

#

how do you know

#

but i mean they'd prolly apply rl to an instruct model

#

i don't see why this matters to the question of whether they trained on qwq traces

keen beacon Apr 13, 2025, 11:03 PM

#

you asked about col dstart

north vale Apr 13, 2025, 11:04 PM

#

"you can tell from the reasoning style and from the cold start they use"
how can you tell from the cold start they use, if we don't know what cold start they used because we weren't told the training details?

keen beacon Apr 13, 2025, 11:05 PM

#

north vale "you can tell from the reasoning style and from the cold start they use" how can...

because its in a distinct style exactly like qwq 32b. im not gonna do similarity/etc to it which could prove it, i really dont care much about grok lol. its obvious when you work with qwq 32b preview traces a lot. they left the exact Final Answer thing in their traces too. cold start primarily determines the style of reasoning, you are not going to get qwq-isms/qwq format from pure rl randomly

#

no lol

#

yuh

#

ill probably do a comparison here with qwq 32b preview and grok and i bet people here will get confused which is which/itll be undeniable 🤣

#

ty man i try really hard to pay attention 😄

torn mantle Apr 13, 2025, 11:12 PM

#

dragontail

keen beacon Apr 13, 2025, 11:12 PM

#

pretty similar

torn mantle Apr 13, 2025, 11:12 PM

#

dragontail

#

dragontail

#

DT

#

DT

#

these are just simple prompts

#

but you guys can compare the results with NW

hardy pecan Apr 13, 2025, 11:17 PM

#

torn mantle DT

haha i love that people still use my prompt for awhile ago

torn mantle Apr 13, 2025, 11:17 PM

#

hardy pecan haha i love that people still use my prompt for awhile ago

yea its good

#

who do you think won that

#

i think DT attempt is more modern style UI

#

i liked how NW used like an old font + icons

#

it used also msn blue color

#

really a lot of details to unpack just from that alone

hardy pecan Apr 13, 2025, 11:18 PM

#

Hmmm its close, different styles

#

yeah

#

I'd say NW

keen beacon Apr 13, 2025, 11:26 PM

#

hmm which one is qwq 32b preview? can yall tell?

📎 message.txt 📎 message.txt

torn mantle Apr 13, 2025, 11:26 PM

#

2nd?

#

1st is def grok 3

#

that sh1t keeps repeating 'Wait' like 10000 times

keen beacon Apr 13, 2025, 11:27 PM

#

ur right

keen beacon Apr 13, 2025, 11:27 PM

#

torn mantle that sh1t keeps repeating 'Wait' like 10000 times

ya its even more hesitant than qwq

#

theyre super similar lol youre not getting qwqisms from rl

#

i copied the thought trace of grok 3 excluding the response and look at similar they are

keen beacon Apr 13, 2025, 11:30 PM

#

torn mantle that sh1t keeps repeating 'Wait' like 10000 times

dont u think they look extremely similar though?

#

the hesitantness is from rl, qwq was only used for cold start

torn mantle Apr 13, 2025, 11:31 PM

#

keen beacon dont u think they look extremely similar though?

they do

#

its impossible to read grok 3 cot

#

it goes into so many unnecessary steps

#

whereas deepseek you actually have fun reading it

#

you learn a thing or two

keen beacon Apr 13, 2025, 11:32 PM

#

ya agree w me that qwq was used as cold start?

torn mantle Apr 13, 2025, 11:32 PM

#

we cant say they used that for sure

#

but they are similar

#

yea

keen beacon Apr 13, 2025, 11:32 PM

#

they start out the same they even nend with the same final answer lol. they use the same language lol

torn mantle Apr 13, 2025, 11:32 PM

#

keen beacon hmm which one is qwq 32b preview? can yall tell?

reading 1st CoT

#

is just making me mad tbh

#

so inefficient

keen beacon Apr 13, 2025, 11:33 PM

#

its from the rl they apply

torn mantle Apr 13, 2025, 11:33 PM

#

too many parallel reasoning that shouldnt be there

#

they are trying to apply parallel reasoning

keen beacon Apr 13, 2025, 11:33 PM

#

it was probably done in a scale much more than qwq non preview lol

torn mantle Apr 13, 2025, 11:33 PM

#

not just one branch of reasoning

#

but its not working so far

#

it may work but its not efficient

keen beacon Apr 13, 2025, 11:34 PM

#

torn mantle they are trying to apply parallel reasoning

i dont think theyre trying to apply anything tbh. just add rl on qwq preview traces

#

and a symptom of their training causes that

north vale Apr 13, 2025, 11:36 PM

#

i am convinced that they used qwq reasoning traces

#

ty for sharing

torn mantle Apr 13, 2025, 11:36 PM

#

im not fan of what they are doing tbh

#

the model is unusable to me

#

doesnt follow prompts well

#

loses context quite often

#

their deep research is probably one of the worst implementations

keen beacon Apr 13, 2025, 11:37 PM

#

given how they were too lazy or incompetent enough to make their own cold start, its another bad sign for xai

torn mantle Apr 13, 2025, 11:37 PM

#

hallucinates a lot

#

its not a fun model to talk to

#

thats the main benchmark for me

#

deepseek & sonnet is so fun to interact with

#

gemini is also climbing that spot

keen beacon Apr 13, 2025, 11:38 PM

#

i find grok unusable when it starts peddling x into random stuff amongst other things

torn mantle Apr 13, 2025, 11:38 PM

#

i actually spend more time reading deepseek cot

#

i learn a lot of new things from that

#

instead of just reading the output

torn mantle Apr 13, 2025, 11:39 PM

#

keen beacon i find grok unusable when it starts peddling x into random stuff amongst other t...

yea

#

they should make that optional

#

but that thing wasnt bad tbh

#

they improved quality x sources

#

it was so bad on grok 2

#

when they were still using grok 2 they would just reference bots

keen beacon Apr 13, 2025, 11:40 PM

#

i cant believe people used grok 2

#

why????

torn mantle Apr 13, 2025, 11:40 PM

#

keen beacon i cant believe people used grok 2

xdddddddddd

#

because they are braindead

#

and they already had like x premium

#

so they dont pay for chatgpt

#

and probably elon hardcore fans xd

ocean vortex Apr 14, 2025, 12:05 AM

#

Grok3 non-reasoning model is their best contribution to AI as far as I see it. Never really cared much for the reasoning one as that one is way less impressive for what it is

keen beacon Apr 14, 2025, 12:08 AM

#

they used their massive amounts of compute to apply a sh1tload of rl bruteforce into grok 3 mini lol

ocean vortex Apr 14, 2025, 12:09 AM

#

What I mean is if you compare all the non-reasoning models… grok3 may just be the best of them all

alpine coral Apr 14, 2025, 12:59 AM

#

dunning kruger effect / wild hubris / zero self-awareness / surrounded by sycophants .. some combination at least (perhaps with some K thrown in for good measure) helps explain it imo

like with him lying about being a top-ten ranked gamer, and then going on a livestream somehow thinking no one would notice lol https://www.reddit.com/r/videos/comments/1j75rh9/elon_musk_got_exposed_as_a_fraud_gamer_all_updates/

drifting thorn Apr 14, 2025, 12:59 AM

#

I think Grok 4 should be GPT 4.5 like, increasing its parameters again

keen beacon Apr 14, 2025, 1:02 AM

#

it won't be that large

#

gpt-4.5 is 4-5T params

drifting thorn Apr 14, 2025, 1:02 AM

#

Didn’t expect GPT 4.5 would be acting this badly

#

What are the possible reasons

keen beacon Apr 14, 2025, 1:03 AM

#

hit a wall with scaling for "legacy" LLMs

alpine coral Apr 14, 2025, 1:03 AM

#

lol literally what i was about to say

#

though that was far more effeciently put ha

#

llama 3.370b vs 3.1-405 kinda revealed that wall to my mind

#

like nearly 6x as many parameters and they eeked out some marginal performance gains

keen beacon Apr 14, 2025, 1:06 AM

#

if it was 405b moe prob make more sense, but seeing how maverick turned out lol

#

slightly off topic but there's a relatively high chance we get R2 w/c 14.04

keen beacon Apr 14, 2025, 1:06 AM

#

keen beacon if it was 405b moe prob make more sense, but seeing how maverick turned out lol

other frontier models are around that size, i think

drifting thorn Apr 14, 2025, 1:06 AM

#

Nah, we should see how the Behemoth turned out

keen beacon Apr 14, 2025, 1:06 AM

#

it is due by end of april and it makes the most sense for them to release it to react to o3

drifting thorn Apr 14, 2025, 1:07 AM

#

And is gpt 4.5 underfitted?

alpine coral Apr 14, 2025, 1:07 AM

#

keen beacon other frontier models are around that size, i think

yeah but the performance gap b/w 70b and 405b - there was no scaling there.. just a huge amount of cash lol

#

whereas haiku vs sonnet vs opus - there prob was scaling there

keen beacon Apr 14, 2025, 1:07 AM

#

drifting thorn And is gpt 4.5 underfitted?

probably but not by huge amounts i don't think

#

gpt-4.5 kinda seems like a project they poured silly amounts of money into, realised was a waste of time, effort and cash after seeing the relatively limited performance gains, and put on hold for months and then they remembered they were sitting on it, sloppily finished it off and put it out because they were somewhat obligated

drifting thorn Apr 14, 2025, 1:10 AM

#

So what’s the current scaling law for AI?

#

Amount of reasoning token?

keen beacon Apr 14, 2025, 1:10 AM

#

they were gonna call that gpt 5 i think, but unpopular opinion i think they shouldve called o1 preview gpt 5. imho it was one of the most significant releases

alpine coral Apr 14, 2025, 1:10 AM

#

keen beacon gpt-4.5 kinda seems like a project they poured silly amounts of money into, real...

i wonder if there are parallels to be found at anthropic with respect to ~~Opus 4~~ Opus 3.5 (or even the whole Claude 4 generation, which 3.5 suggested was on the horizon)...

keen beacon Apr 14, 2025, 1:10 AM

#

3.5*

#

they said opus 3.5 by end of 2024 then scrubbed any mention of it from their site in november and we haven't heard anything since

#

except dario saying "we still plan for there to be a 3.5 opus" on a podcast months ago

#

large models are dead doubt we see opus 4 tbhh

#

yeah i don't think opus 4 will happen

#

opus 3.5 will probably be their last big boy

drifting thorn Apr 14, 2025, 1:11 AM

#

Say something bout llama 4 behemoth

alpine coral Apr 14, 2025, 1:12 AM

#

keen beacon large models are dead doubt we see opus 4 tbhh

agreed

keen beacon Apr 14, 2025, 1:12 AM

#

drifting thorn Say something bout llama 4 behemoth

🤣 🤣

keen beacon Apr 14, 2025, 1:12 AM

#

keen beacon opus 3.5 will probably be their last big boy

ETA is apparently Q4 '25 which is diabolical tbh

#

anthropic are gonna get left in the dust

#

they already are tbh

#

i would consider them still doing okay up until 2.5 pro

#

i think 2.5 pro put every other lab on red alert

#

them not doing any native image generation work/other multimodal work is going bite them in the ass later

#

at least publicly it seems that way to me

#

unfortunately anthropic aren't willing to take enough risks to maintain their frontier position

alpine coral Apr 14, 2025, 1:14 AM

#

keen beacon i would consider them still doing okay up until 2.5 pro

i think the emergence of test time compute is what threw them

keen beacon Apr 14, 2025, 1:14 AM

#

with deepmind accelerating and openai downsizing safety teams they're stuck at the same pace

alpine coral Apr 14, 2025, 1:14 AM

#

sonnet-3.7-thinking performs so poorly considering how strong the vanilla version is

keen beacon Apr 14, 2025, 1:15 AM

#

tbh sonnet 3.5 and cpt'd sonnet 3.5 seemingly (sonnet 3.7) is anamolous

#

i think deepmind have done the best job at squeezing performance out of the base model with their reasoning model

#

they could not replicate the magic with haiku

keen beacon Apr 14, 2025, 1:15 AM

#

keen beacon i think deepmind have done the best job at squeezing performance out of the base...

especially if 2.5 pro base is very similar to 2.0 pro

#

whilst other companies can shrink their models well

alpine coral Apr 14, 2025, 1:15 AM

#

keen beacon tbh sonnet 3.5 and cpt'd sonnet 3.5 seemingly (sonnet 3.7) is anamolous

yeah i don't think they were on the roadmap

keen beacon Apr 14, 2025, 1:15 AM

#

2.0 pro as a base was actually pretty mediocre

alpine coral Apr 14, 2025, 1:15 AM

#

more like emergency releases ha

keen beacon Apr 14, 2025, 1:16 AM

#

yeah but sonnet 3.5 is insanely good and anomalous from them tbh. personally i mark it as a start of a class

#

i think sonnet 3.7 was supposed to be opus 3.5 but the gains were pretty poor and they wanted to keep their small edge

#

nah its the same size as sonnet 3.5 iirc i think its just a cpt

alpine coral Apr 14, 2025, 1:17 AM

#

keen beacon yeah but sonnet 3.5 is insanely good and anomalous from them tbh. personally i m...

yeah but aside from cpt, it doesn't seem like they achieved the step-up with any kinda special sauce.. perhaps a new class in terms of performance, but not like model architecture or whatever (not that i know ha)

keen beacon Apr 14, 2025, 1:17 AM

#

they made what was working better

alpine coral Apr 14, 2025, 1:17 AM

#

yeah well put

keen beacon Apr 14, 2025, 1:18 AM

#

anthropic probably have the best quality data (although not the most raw data)

keen beacon Apr 14, 2025, 1:18 AM

#

alpine coral yeah but aside from cpt, it doesn't seem like they achieved the step-up with any...

i meant for sonnet 3.5, sonnet 3.7 is not a major step as compared to them pretraining from scratch sonnet 3.5

alpine coral Apr 14, 2025, 1:18 AM

#

ahh right yup gotcha

keen beacon Apr 14, 2025, 1:19 AM

#

sonnet 3.5 was significant in my experience, the level of "base model performance" marked a start of a class for me even if it didnt display it in the benchmarks. only recent 4o that was cpt'd/gem 2 pro/1206 reached it for me. considering it was trained much earlier than the others

drifting thorn Apr 14, 2025, 1:19 AM

#

And I’d say R1 has the worst quality data

keen beacon Apr 14, 2025, 1:19 AM

#

3.5 was the last "bar raiser" from anthropic

#

for quite a bit nobody could one up it

drifting thorn Apr 14, 2025, 1:20 AM

#

Is R1 1776 a better model than initial R1?

keen beacon Apr 14, 2025, 1:20 AM

#

the only difference is censorship

#

model performance differences are negligible at best

drifting thorn Apr 14, 2025, 1:21 AM

#

I mean factuality

keen beacon Apr 14, 2025, 1:21 AM

#

it is probably marginally better but again

#

negligible

alpine coral Apr 14, 2025, 1:22 AM

#

overall if anything there was prob a performance degradation of some kind (surely it's benchmarked / compared). only more performant/factual on a very specific subset of questions (those subject to censorship in China)

wintry tinsel Apr 14, 2025, 1:22 AM

#

We’re in a bit of a slow point for LLM updates outside of more robotic stiff open AI models

keen beacon Apr 14, 2025, 1:22 AM

#

this week will be good 😉

drifting thorn Apr 14, 2025, 1:22 AM

#

keen beacon Apr 14, 2025, 1:22 AM

#

or so some birdies tell me..

keen beacon Apr 14, 2025, 1:22 AM

#

wintry tinsel We’re in a bit of a slow point for LLM updates outside of more robotic stiff ope...

theyre not robotic anymore. theyre trying extremely hard lol. too hard

#

expect more than just oai to drop things

#

how is fake news destroying llms 🤣

keen beacon Apr 14, 2025, 1:23 AM

#

keen beacon theyre not robotic anymore. theyre trying extremely hard lol. too hard

depends on the model

#

the newest chatgpt 4o version slightly tones down the cringe factor that came with trying too hard not to be robotic

#

as for the o-series models, yeah they're still stuck with that problem mostly

#

the only reasoning model i've seen not be very robotic is R1

#

and to a certain extent o3, but you can see that for yourself soon

#

either way i am very intrigued by r2 and how big of a jump it will be

#

it seems ill be subbing to chatgpt plus soon lol (for the new releases xd)

drifting thorn Apr 14, 2025, 1:25 AM

#

keen beacon the only reasoning model i've seen not be very robotic is R1

Gemini’s thinking models are not robotic either

keen beacon Apr 14, 2025, 1:25 AM

#

they are better than the o-series models but

#

they're still not quite on r1 level

#

deepseek's models are generally just good at that

drifting thorn Apr 14, 2025, 1:28 AM

#

In the Chinese internet, Deepseek’s style was joked as “whenever the writing task is, it always shows entropy/quantum computing/maths theories”

keen beacon Apr 14, 2025, 1:31 AM

#

#

this is the better one for creative writing

drifting thorn Apr 14, 2025, 1:33 AM

#

There’s no 2.5 pro in the chart!!!!

keen beacon Apr 14, 2025, 1:33 AM

#

like.. this is R1. it is both human-like and enjoyable

#

my main problem with R1 for creative writing is

#

it loses track of a plot quite rapidly

drifting thorn Apr 14, 2025, 1:34 AM

#

From demonstrations, 2.5 Pro definitely has higher EQ than these other models

keen beacon Apr 14, 2025, 1:34 AM

#

hopefully r2 improves upon that

#

likely

drifting thorn Apr 14, 2025, 1:35 AM

#

keen beacon like.. this is R1. it is both human-like and enjoyable

Okay, this is in English

#

And I used it in Chinese

keen beacon Apr 14, 2025, 1:35 AM

#

can't speak for chinese performance

#

it's what they did with xmas yeah

#

maybe 4.1 first? then the reasoning models to keep hype going?

drifting thorn Apr 14, 2025, 1:35 AM

#

Its style is really weird and recognisable as Deepseek’s writing

keen beacon Apr 14, 2025, 1:36 AM

#

anthropic just nailed the vibes for emotional intelligence

drifting thorn Apr 14, 2025, 1:36 AM

#

It is stubborn

keen beacon Apr 14, 2025, 1:36 AM

#

4o = too agreeable, gemini = too yappy

drifting thorn Apr 14, 2025, 1:36 AM

#

2.5 pro is stubborn and sometimes quite harsh to users

keen beacon Apr 14, 2025, 1:37 AM

#

iirc 3.7 was a regression for most creative and emotional tasks

drifting thorn Apr 14, 2025, 1:37 AM

#

I once asked him what was my singing, and it just said I’m extremely off-key, tone is bad blah blah blah

thorny drum Apr 14, 2025, 1:38 AM

#

drifting thorn 2.5 pro is stubborn and sometimes quite harsh to users

i fw this

keen beacon Apr 14, 2025, 1:38 AM

#

workhorse models most people use the most i think

zinc ore Apr 14, 2025, 1:38 AM

#

drifting thorn I once asked him what was my singing, and it just said I’m extremely off-key, to...

At least you know it's being honest

keen beacon Apr 14, 2025, 1:39 AM

#

itll be the same price probably

thorny drum Apr 14, 2025, 1:39 AM

#

hate it when models dont stand up against me saying dumb stuff

keen beacon Apr 14, 2025, 1:39 AM

#

i mean why would u use 4o over 4.1 if its the same price

upper wolf Apr 14, 2025, 1:39 AM

#

a little off-topic, but i believe there was a study that showed that people who frequently used generative AI to solve mental health issues turned out to be worse off than people who didn’t (more isolated, etc.) i can’t rememeber where it’s from

keen beacon Apr 14, 2025, 1:39 AM

#

its a stronger model with an updated cut off compared to api dated versions

keen beacon Apr 14, 2025, 1:39 AM

#

upper wolf a little off-topic, but i believe there was a study that showed that people who ...

yup

drifting thorn Apr 14, 2025, 1:40 AM

#

zinc ore At least you know it's being honest

Since then I’ve picked up my tuner app and trained for a few days

keen beacon Apr 14, 2025, 1:40 AM

#

mostly because 90+% of people who use AI for mental health related stuff use chatgpt, and 4o is way too complacent/unwilling to question and confront

#

so they mark up 4.1 even though its still 4o but updated? maybe

thorny drum Apr 14, 2025, 1:41 AM

#

i think others realize this openAI just has such a massive headstart

ember rapids Apr 14, 2025, 1:41 AM

#

google said there was no moat 2 yrs ago

keen beacon Apr 14, 2025, 1:42 AM

#

the problem is beginning to shift away from "we can't build a better model than openai" and towards "we can't build a better product and market it better than openai"

#

they truly got a huge headstart with chatgpt and how viral it got/is

drifting thorn Apr 14, 2025, 1:44 AM

#

But Gemini is grabbing the market share

keen beacon Apr 14, 2025, 1:44 AM

#

4o native image gen dwarfed 2.5 pro

#

probably

ember rapids Apr 14, 2025, 1:44 AM

#

regular ppl dont know about gemini 2.5

#

they only know chatgpt

plain zinc Apr 14, 2025, 1:45 AM

#

I think all the Google models that are on LMarena are just the same 2.5 flash, but with different levels of thinking.

keen beacon Apr 14, 2025, 1:45 AM

#

ngl 4o native image gen is really good

plain zinc Apr 14, 2025, 1:45 AM

#

This thought keeps me awake.

ember rapids Apr 14, 2025, 1:45 AM

#

memory is gonna be a big thing in the future

plain zinc Apr 14, 2025, 1:46 AM

#

plain zinc I think all the Google models that are on LMarena are just the same 2.5 flash, b...

I just don't see the point in releasing so many models if Google is only planning to release 2.5 Flash.

keen beacon Apr 14, 2025, 1:46 AM

#

plain zinc I think all the Google models that are on LMarena are just the same 2.5 flash, b...

maybe 2.5 flash has thinking budget

drifting thorn Apr 14, 2025, 1:46 AM

#

ember rapids memory is gonna be a big thing in the future

Ofc

#

But I dunno if ChatGPT’s function is just a RAG

keen beacon Apr 14, 2025, 1:47 AM

#

plain zinc I think all the Google models that are on LMarena are just the same 2.5 flash, b...

if 2.5 flash is beating 2.5 pro in a significant number of cases they're cooking incredibly hard

#

its more likely to be updated 2.5 pro, 2.5 flash, 2.5 flash lite i guess ( i havent actually used the recent google anon models, so no idea about capability)

zinc ore Apr 14, 2025, 1:47 AM

#

Means they've found a very good upward cycle, if it's flash

drifting thorn Apr 14, 2025, 1:49 AM

#

How do you tell them to write a story? Or what story do you want?

plain zinc Apr 14, 2025, 1:49 AM

#

I have a feeling that we will get another Google model in LMarena this week, but it will be even better than nightwhisper.

drifting thorn Apr 14, 2025, 1:50 AM

#

Cuz I’m currently testing AIs in Chinese writing using a standardised question

#

Okay I have an idea

keen beacon Apr 14, 2025, 1:52 AM

#

plain zinc I have a feeling that we will get another Google model in LMarena this week, but...

if they want to one-up the openai release spree then i'd expect them to start testing better models very soon yeah

plain zinc Apr 14, 2025, 1:54 AM

#

How far will Google go with only two model classes?: Pro and Flash

keen beacon Apr 14, 2025, 1:55 AM

#

i think ultra may eventually return but it won't be as their 1T+ param model variant

#

it'll just be as a better reasoning model

plain zinc Apr 14, 2025, 1:55 AM

#

And I don't think we'll see Ultra anymore.

#

Because Google had a bad past with this model.

ember rapids Apr 14, 2025, 1:56 AM

#

didnt logan say theyre doing the same thing oai is doing for gpt5?

keen beacon Apr 14, 2025, 1:56 AM

#

keen beacon i think ultra may eventually return but it won't be as their 1T+ param model var...

according to someone who claimed to be a google employee iirc, ultra was not 1t at all

plain zinc Apr 14, 2025, 1:56 AM

#

ember rapids didnt logan say theyre doing the same thing oai is doing for gpt5?

No, they do completely different things. They will connect all Gemini with Veo

#

but this is not the same as GPT-5.

#

If gpt-5 is just able to reason on a human level, then Gemini will acquire imagination through merging with Veo.

#

He will be able to reason better and also design something in his "head".

keen beacon Apr 14, 2025, 1:58 AM

#

keen beacon according to someone who claimed to be a google employee iirc, ultra was not 1t ...

can i get the source pls

plain zinc Apr 14, 2025, 1:58 AM

#

And then spatial thinking and multimodal, omnimodal capabilities will improve.

keen beacon Apr 14, 2025, 1:58 AM

#

keen beacon can i get the source pls

see the old lmsys discord server, dom was arguing about gemini experimental

#

it was a long long time ago

drifting thorn Apr 14, 2025, 1:59 AM

#

plain zinc He will be able to reason better and also design something in his "head".

They shouldn’t connect to Veo, they should connect to Blender/other 3D animation app

keen beacon Apr 14, 2025, 1:59 AM

#

every single experience i ever had with 1.0 ultra screamed near or above 1T params

drifting thorn Apr 14, 2025, 2:00 AM

#

plain zinc And then spatial thinking and multimodal, omnimodal capabilities will improve.

I surely hope so, as I have a hard question that most models can’t answer

plain zinc Apr 14, 2025, 2:00 AM

#

drifting thorn They shouldn’t connect to Veo, they should connect to Blender/other 3D animation...

You're clearly underestimating Veo.

#

and what will happen when Veo 3 is released?

keen beacon Apr 14, 2025, 2:01 AM

#

keen beacon every single experience i ever had with 1.0 ultra screamed near or above 1T para...

it was somewhat close to 1t but not that close tbh from what i remember

keen beacon Apr 14, 2025, 2:02 AM

#

keen beacon every single experience i ever had with 1.0 ultra screamed near or above 1T para...

the model sucked in my experience :\ maybe for creative writing i guess

#

i briefly had api access

keen beacon Apr 14, 2025, 2:03 AM

#

keen beacon the model sucked in my experience :\ maybe for creative writing i guess

yes it sucked in everything but creative writing

#

it sounds funny but

#

genuinely it was incredible at creative writing and disappointing at everything else

keen beacon Apr 14, 2025, 2:04 AM

#

keen beacon genuinely it was incredible at creative writing and disappointing at everything ...

did u try earlier claude models?

#

they were exceptional in my experience compared to anything else out at the time

#

oh yeah

#

claude instant was banging

#

and claude 1