#general

1 messages · Page 19 of 1

torn mantle
#

image models use different methods to train ( diffusion ... )

drifting thorn
#

Two new failures to my question: Model A: claude-3-7-sonnet-20250219-thinking-32k
Model B: qwq-32b

#

.

#

Now dragontail failed again

fleet lintel
#

which is the best image model right now? (not just for ghibli images)

drifting thorn
#

I guess randomness in LLMs is the key to the issue

torn mantle
#

for realistic images midjourney

#

for images with text gpt4o

#

anime/manga/manhwa etc... gpt4o/imagen 3.1

fleet lintel
#

I am waiting for a good non-pro model, either flash or some mini version, which is really good with coding. something comparable to 3.5 Claude but much cheaper

north vale
#

This seems right to me, unlike 2.5 it solves riddles by using math notation when it’s not needed lol

drifting thorn
fleet lintel
drifting thorn
#

The piston can be a 6x6 round plate(part 11213) or brick(part 18897)

#

Most “dumb” models will tell me to use a tyre as cylinder wall

keen beacon
#

do you guys wanna know?

#

the hardest prompt for ai

#

for some reason no ai can figure it out

#

"make a parallex that follows your cursor with evenly disturbed shapes across the web page"

drifting thorn
#

2 more LLMs failed: shadebrook and gemini-2.0-flash-thinking

torn mantle
#

gemini 2.5 pro + sonnet 3.7 thinking

drifting thorn
#

omg this seems like an actual fun game!

keen fulcrum
#

Is there a mcp server for Amazon Rufus?

unborn ocean
keen beacon
#

Below is the simplest combination of widely‑available LEGO elements that produces a truly round cylinder with a nominal bore of six studs and matching parts for the piston and cylinder head. Everything is “legal” (no cutting or forcing) and the clearances have been checked so the piston can slide freely but without excessive play.

  1. Cylinder (liner)
    • Part name: Container / Glass, Cylinder 6 × 6 × 5
    • Design‑ID / Element‑ID: 87610 (opaque colours) or 87621 (transparent)
    • Dimensions: 6‑stud outside diameter, 5 brick (≈6 stud) height
    • Notes:
    – Two anti‑stud recesses on opposite sides accept half‑pins so you can stack several for a longer barrel or attach crankcase structure.
    – Four of them can be clipped in line to make a V‑engine block; offset them by half a stud if you want realistic bore spacing.

  2. Piston
    Option A – flat‑top piston (low friction, easiest)
    • Part name: Dish Inverted 6 × 6 (a.k.a. 6 × 6 radar dish)
    • Design‑ID: 44375
    • Why it works: Only its outer rim touches the cylinder wall, so it glides very smoothly.
    • Connecting rod: run an axle 4 L or a bar 6 L through the central anti‑stud cavity; lock it with a 2 × 2 round plate underneath.

    Option B – squared‑off piston crown (if you need studs on top)
    • Part name: Plate, Round 6 × 6 with Hole in Centre
    • Design‑ID: 11213
    • Flip two of them studs‑to‑studs to hide the bumps, or cap the top one with a 6 × 6 round tile (Design‑ID 10202).
    • Clearance inside the 6 × 6 container is ≈0.6 mm, so the plate slides but does not rattle.

  3. Cylinder head
    • Simplest: another 11213 round 6 × 6 plate – fixes straight onto the rim of the container with four half‑pins.
    • If you want ports or spark‑plug detail, swap it for 11212 (Plate, Round 6 × 6 with 2 × 2 Cut‑out) and use the cut‑out space for tubing or lights.

  4. Fastening / auxiliaries
    • 63868 – Technic Pin ½ – snaps into the small holes on the container rim.
    • 87079 – Brick, Modified 1 × 2 with 2 × 2 studs – plugs into the recesses on the side of the container so you can attach a crankcase or exhaust manifolds at 90° to the cylinder.
    • Axle 6 L plus crank parts 6538c, 2851 etc. for the connecting‑rod and crankshaft.

Physical clearances (measured)
Inside Ø of container: 47.5 mm ≈ 5.94 studs
Outside Ø of 6 × 6 dish / plate: 46.9 mm ≈ 5.87 studs
Radial play: ≈0.3 mm – just enough for smooth motion with ordinary LEGO tolerances.

What if you can’t get the 6 × 6 container?
• Build the liner from four 30562 “Panel 6 × 6 × 9 Quarter Cylinder”. The wall is thicker, so the internal bore drops to ≈5 studs; use a 4 × 4 dish or round plate for the piston instead.

With these three key parts—87610/87621 as the cylinder, 44375 or 11213 as the piston, and another 11213 (or 11212) for the head—you have everything you need to construct a circular LEGO engine cylinder whose bore is effectively 6 × 6 studs.

drifting thorn
#

The hardest part is the cylinder wall, which, it is so wrong

#

This is 87621

#

This is 87610

torn mantle
drifting thorn
#

That's what I'm saying, what dumb AI would give

torn mantle
#
  • three.js
#

for the map and everything else

#

i have other ideas but the complexity will increase as i add more stuff

keen fulcrum
barren prairie
#

Give us the tutorial 🥲

keen fulcrum
keen beacon
#

can anyone tell what the best anon google model on the arena is rn

#

because there are like 6 of them

#

i can't figure out if riverhollow is good or not

#

only these ones are on the webdev arena

balmy mist
#

brooooo

#

no way

#

how many tokens is in your project now and are you using an ide?

hardy pecan
#

I'd go with Dragontail as of now as the best

#

shadebrook wasn't as impressive for me

#

granted, I haven't got riverhollow yet

balmy mist
#

yeah but I was just grabbing my project(only one file) and putting it back into gemini 2.5 pro as input but once you get to around 28k plus tokens it outputs the code wiht heavy errors

#

but you inspired me to push it even further with roo code, i was focused on my prompt forger app, i added a new ui it and been fixing errors lol

hardy pecan
#

who was best though
nebula, stargazer, nightcrawler or dragontail? 😮

#

Tried riverhollow, its aite, not great
I think dragontail is the best with the ones available right now

balmy mist
#

the question is nightwhisper vs dragontail?

#

ultimate game assistant

keen beacon
#

the ones currently on the arena

keen beacon
hardy pecan
keen beacon
#

this happens inexcusably often

golden ocean
#

just refresh on another tab bro and then return to original tab

keen beacon
keen beacon
#

so that doesn't do anything

golden ocean
alpine coral
#

there's some weird glitches though.. like lately i've occassionally been able to coninue conversations that have errored out, regenerate responses, cast votes

alpine coral
# keen beacon

will cast a vote if i get shadbrook.. have gotten the other two - based on that very limited testing, dragontail seems up there with 2.5 pro in terms of perforfance, whereas riverhollow is way behind (would be closer to gemma-3 tbh)

alpine coral
keen fulcrum
alpine coral
#

(bit unfair on grok there to vote tie.. it actually did really good alright ha)

teal mantle
#

can any LLM hear music now

#

tried using qwen chat but doesn't have the model of mine

keen beacon
#

iirc gemini supports audio input

#

im not sure what happens if u pass on music tho 🤣

#

im sure it can transcribe songs at least

teal mantle
calm sequoia
#

Everybody is off the rails with releases since the Gemini 2.5 Pro came out. Except Grok 🤔 Are they cooking something or simply lacking?

unborn ocean
#

btw for anyone building these coding project using gemini right now:
github has just added 2.5 pro to copilot

  • you can use it as an LLM in roocode (through the vsc api) at insane speeds (and as far as I have seen no rate limits per minute) for free (when you have a subscription, which again is free as a student)
    (nvm, just hit the rate limit, i believe it is the same they had for sonnet 3.5)
teal mantle
#

this is quite good

cloud meadow
#

It gets some lyrics wrong too when transcribing

#

It's the best I've seen though so idk

unborn ocean
#

is that a bug for gemini 2.5 pro, otherwise they have to be pulling that compute from a parallel universe

#
  • on openrouter
north vale
#

For my question dragontail is 5/5 like 2.5 pro, riverhollow got it half right once out of 3, and shadebrook is 1/3

#

so shadebrook and riverhollow def seem worse in logic puzzles

#

But the only non google model that solved the puzzle was gpt4.5

vivid oyster
#

wtv is riverhollow

#

Wtf

#

Is it also gogle

north vale
#

probably

ocean vortex
ancient reef
drifting thorn
#

Maybe it’s distilled open sourced model from Google?

#

I mean the riverhollow

brittle tiger
#

One of my test questions seems like gpt4.5 gets wrong every time but 4.5 never gets it wrong on actual chatgpt website

ocean vortex
# brittle tiger One of my test questions seems like gpt4.5 gets wrong every time but 4.5 never g...

yeah it performs better with a system prompt.

Knowledge cutoff: 2023-10
Current date: 2025-04-12

Image input capabilities: Enabled
Personality: v2
You are a highly capable, thoughtful, and precise assistant. Your goal is to deeply understand the user's intent, ask clarifying questions when needed, think step-by-step through complex problems, provide clear and accurate answers, and proactively anticipate helpful follow-up information. Always prioritize being truthful, nuanced, insightful, and efficient, tailoring your responses specifically to the user's needs and preferences.
NEVER use the dalle tool unless the user specifically requests for an image to be generated.```
#

that's what they are using

#

well except they even have more stuff below it on how to use function calling/tools but you do not need that part lol

unborn ocean
leaden palm
#

shadebrook is on the leaderboards?!

#

vision ones at least

brittle tiger
leaden palm
balmy mist
balmy mist
balmy mist
#

there is a lot of ways to use this tho

#

i was gonna cook with it as well

#

yo @torn mantle you can lowkey use the ai assistant to make it as a pokemon in the game and make it an open world type game

#

made with optimus

ember rapids
#

Looks like we’re getting o3/o4 mini this week

drifting thorn
#

I hope so

patent bane
keen beacon
balmy mist
#

i dont believe it until its launched

ember rapids
keen beacon
#

it's coming next week

leaden palm
#

why wouldnt they

#

i can't tell if you mixed up plus and pro, if you're misinformed, or if openai actually did that

#

(like i'm literally getting o1 high for free, so it makes no sense that the $200/mo plan wouldn't have it)

keen beacon
#

lmao i asked o3 to generate a realistic transcript of a 2024 presidential debate between joe and trump

"TRUMP: Everything he just said is wrong. We had no inflation when I left—zero, maybe negative." is maybe my favourite line

ocean plume
#

any one have leaderboard in deepsearch ?

#

because i dont know which one is the best deepsearch and thinking for best answer

leaden palm
#

well github for students -> copilot pro -> access to o1 -> o1 with settings in github models

#

not great, about 8/day

#

maybe they give you o1 pro but not o1 high

ocean plume
#

anyone ?

ocean plume
leaden palm
# ocean plume can help ?

deep research is computationally expensive to create and time consuming to read/rank, so the leaderboard format doesn't make much sense

#

that said, generally gemini's and openai's are the best

ocean plume
#

ok thank you so much i'll try

teal mantle
#

that's my experiment

keen beacon
#

uh... it didn't even respond with a report

#

that's a bit odd

#

probably ccounted towards my quota too 👎

ocean plume
#

maybe about country problem they have rules, try to trick them like you are.... and search .... school example test

vivid oyster
#

n

balmy mist
#

dragontail might be better than nightwhisper

vivid oyster
balmy mist
#

hmm nvm

#

i think its best after nightwhisper tho

torn mantle
#

some animations were already made

misty vault
balmy mist
balmy mist
#

i highkey wanna play it

#

or maybe screenshare

#

cause that sh!t is so beautiful

#

true vibe coding lol

#

like zen mode lmaoo

cloud meadow
torn mantle
#

you can do wonder with threejs tbh

#

some crazy animations

#

but needs a lot of effort

#

also you will be copyrighted straight away

#

and hit with a lawsuit if the game really got popular

balmy mist
#

lmaoo like palworld lmaoo

balmy mist
#

this is nuts man

#

you do this for living?

teal mantle
balmy mist
#

the only thing left is to add their stats

teal mantle
#

Like despite ability to transcribe lyrics and the first line gives search results

#

The answer is incorrect

cloud meadow
torn mantle
#

xd

#

its not mine

#

i found the 3d models

#

and made the map and implementation

#

thats it

#

nothing crazy really

#

its a 900 line code

silk haven
#

Next week, Gemini week

lime coral
balmy mist
#

but i just finished my other projects, gonna do a big game now, before I was just doing one file games, but now imma try to really develop somethnig nice

#

but how did you do yours in 900 lines, with all those animations wtf

#

what is the system prompt you are using?

sly knoll
#

how does dragontail perform against nightwisper ?

lime coral
#

They didn’t appear on the same timeline

#

They might be the same as well with some tweaks after the first check

lapis fractal
#

im guessign it could be 2.5 ultra

keen beacon
#

no

#

if it was an ultra model it would be way more significantly different to 2.5 pro

lapis fractal
#

i mean it can apparenty outperform googles 2.5 pro by quite a bit in every match up so how else would it be much different?

zinc ore
#

Could be a further trained 2.5 pro model as well

keen beacon
#

it is not an ultra model

#

it is not that much better

#

it only matches 2.5 pro in most cases

#

nor is it any better creatively, which is what would be the biggest give-away

keen beacon
lapis fractal
#

will see bet google will probably wait for openai to drop o3 and o4-mini which is most likely next week love to see the comeback theyve made recently

torn mantle
torn mantle
#

there isnt much difference tbh

#

nightwhisper was honestly a big leap/improvement in terms of coding

keen beacon
#

wheres 2.5 flash 😭

torn mantle
#

you can clearly notice the difference

#

but not with dragontail

keen beacon
#

weird how 2.5 flash hasnt come out yet

keen beacon
#

i do think it was better yes but as for how much by, i wouldn't say it was a leap

torn mantle
#

and it was good

keen beacon
#

so did i and it was just a bit better Shrug

torn mantle
#

web dev i can pinpoint little details that other models failed in

#

nah it was much much better

keen beacon
#

i disagree but alright

torn mantle
#

maybe you didnt challenge it enough

#

i didnt take screenshots

#

alas

#

i did take some

#

it made this in like 2 prompts

#

you can try making it using sonnet or gemini pro 2.5

#

and it wont look like that

#

you know what

keen beacon
#

well i'd need your exact prompts to replicate

torn mantle
#

lets give this to gemini

#

i will just give it to gemini and tell it to clone it

keen beacon
#

that's not a good comparison

torn mantle
#

same prompts

#

gemini pro / sonnet 3.7

torn mantle
keen beacon
#

the best comparison would be with the same environment and same prompts

torn mantle
#

same prompt

#

on webdev

#

on react

keen beacon
#

yes but that's one attempt, if you want to make good comparisons you have to redo it a few times with each and see if the conclusion is consistent

#

and if you asked nightwhisperer to iterate, you should do the same for the other models

torn mantle
#

and ive tried it multiple times

#

the ability to get it right and aesthetically good is also a big plus

keen beacon
#

yes i noticed that

torn mantle
#

gemini 2.5 pro

#

this is actually the first attempt

#

but nightwhisper added a song and those waveform are generated based on the audio

#

when i asked gemini pro to do the same it failed

#

im pretty sure with enough guidance you can get it to work

#

but the ability to get it right from the get-go is impressive with nw

lapis fractal
#

goofy question but whats everyones timeline on the first in a sense fully capable "remote worker" practically speaking an agent capable of doing most if not all the work of a typical remote software dev/ administrator etc with or without very little human supervision

keen beacon
#

random but here you go

lime coral
keen beacon
#

Seems like a hallucination

#

At a glance

north vale
#

Current date 2024-03-09 huh

keen fulcrum
keen beacon
keen beacon
balmy mist
torn mantle
#

added :

  • camera dynamics
  • snow
  • fire attack
  • health status
#

do you have any other ideas ...

balmy mist
#

add more attacks and make the adding of pokemon ui look better alos add a stats ui for the pokemon

torn mantle
#

nice

#

@keen beacon is dragontail better than gemini 2.5 pro?

#

im seeing a lot of positive reviews

#

but i didnt really notice much diff

keen beacon
torn mantle
#

yea as i thought

keen beacon
ocean vortex
# keen beacon lmao

LOL. It’s unbelievable actually how fast their whole thing fell apart. Llama4 is just about irrelevant now

hardy pecan
#

Llmao 4 more like!! Xddd

#

Stolen joke ikik

leaden palm
#

actually slightly faster since it's a mixture of experts

silk haven
#

lol

leaden palm
#

yann lecope

silk haven
zinc ore
keen beacon
#

yup

balmy mist
#

hmmm

#

my test earlier i said it was better than nightwhisper but i thought it was a fluke

#

imma test it again

#

its funny they didnt say its better than nightwhisper in that tweet?

ember rapids
#

"an updated version of o3-mini is now the best coder in the world. Not 175th, but the best. " - CFO of OAI

#

OpenAI is at the forefront of the generative AI revolution. How did it get there, and what is the company doing to stay ahead of its competitors? Sarah Friar, chief financial officer of OpenAI and former head of technology equity research at Goldman Sachs, discusses the path to artificial general intelligence and the importance of capital in the...

▶ Play video
balmy mist
#

hmmm

#

ill believe it when i see it

olive mesa
#

nightwhisper and dragontail are both google models right?

somber niche
#

Feels fishy, wonder if she misspoke and meant either full o3 or o4-mini

#

To be fair the names are pretty hard to keep straight lol

balmy mist
#

they bettee release it monday

#

thye be blue balling us so bad this year

ivory schooner
#

我正在学会等Behemoth 登场

#

但愿Behemoth 是加强版的24K......

#

但愿Behemoth 是加强版的24K......

#

🙏 🙏 🙏 🙏

balmy mist
ivory schooner
keen beacon
ember rapids
north vale
#

o4-mini can be an updated version of o3-mini u dufuses

daring dragon
keen fulcrum
#

Why was llama 4 conversational removed? Don’t they all train on benchmarks so do human beings?

daring dragon
#

What model is better?

#

shadebrook?

#

Dragontail?

wind swan
#

dragontail, of course

torn mantle
torn mantle
#

so far dragontail is def better than gemini 2.5 03 pro

#

i need to do more testing

#

but i still think its not better than nightwhisper

#

there are some imperfections

#

it feels like an old checkpoint of nightwhisper tbh

#

same stylistic choices with the gradient colors

ocean plume
#

is that better code than claude 3.7 thinking

novel flame
#

Here's a thought: if you have kids you know how casual tablet games have devolved into 2-minute game sessions broken up by ads which are crappy mini-games in themselves, of the engagement-farming "cow clicker" variety. Zero challenge, simple tap/drag game mechanic, colorful animations to pull the kids in, ultimately just 'pay to win'. I cannot stand these things.

Even if the game itself is good, stuffing it full of those kinds of ads make it a horrible experience. Angry Birds 2 is a prime example of this.

My wife and I have been deleting games with ads like that from the kids' devices. Paying to remove ads from these games is not an option since it feels like supporting predatory business practices.

Now, LLMs aren't quite at a point where they can one-shot AAA video games, but a lot of kids games are not exactly rocket science.... We’re not quite there yet, but once you can few-shot a simple platform game, racing game, puzzle game, or arcade game, as a parent you can bet your life I will generate games for my kids without the ads, pay-to-play mechanics, and dopamine hacking. Heck, the game avatars could all be based on our family members. Seems like an obvious use of coding Al in the short term.

keen beacon
novel flame
#

Yann AGI

torn mantle
#

its better than gemini 2.5 pro

calm sequoia
#

The Gemini 2.5 Pro is not responding on arena, and uses interpreter during though process in AI studio

#

Is Gemini 2.5 Pro only model that can run code during thinking?

keen beacon
keen beacon
split kayak
#

ok

ocean plume
#

what o3 ? perfect ?

#

really

#

so tier to fix bug in sonet thinking

ocean plume
split kayak
#

In google Ai studio , you can enable code execution for gemini 2.5

ocean plume
#

have gemini 2.5 pro

#

but i work more like scam by the way

keen beacon
#

its best preforming model

ocean plume
keen beacon
# keen beacon
poll_question_text

Which anonymous Google model currently on LMArena is the best performing?

victor_answer_votes

13

total_votes

18

victor_answer_id

3

victor_answer_text

dragontail

ocean plume
#

cursor :v

keen beacon
#

hm

drifting thorn
ocean plume
#

it can read codebase and think and make code

keen beacon
#

cursor is useless

#

gemini has canvas

#

so does chatgpt idk about deepseek and claude tho

golden ocean
#

does using canvas save tokens

drifting thorn
#

DeepSeek doesn't have canvas

oblique flint
#

And yeah 2.5 pro sucks in cursor unfortunately

brittle tiger
# keen beacon what's that got to do with o3

Can you try this on o3? Just want to see if it gets it first try:

Board State:

    • X

    • O

Question:

X went first. What should X do next? And is it possible to determine the outcome, assuming optimal play?

#

Formatting messed up. Should look like this

keen beacon
# brittle tiger Can you try this on o3? Just want to see if it gets it first try: Board State: ...

Label the nine squares

1 2 3
4 5 6
7 8 9

The position that was given is

1 2  X
4 5 6
7 8 O

(X is on 3, O is on 9, and it is X‑to‑move.)

Best move for X
Put the second X on square 7 (the lower‑left corner – the corner opposite O’s).

Why 7 is winning
After X7 the board is

1 2 3
4 5 6
X 8 O

Whatever O now plays, X can force a win.

• If O takes the centre (square 5 – O’s best try) the board is

1 2 3
4 O 6
X 8 O

X now plays square 1:

X 2 3
4 O 6
X 8 O

This is a fork – X is threatening both 2 (to win the top row) and 4 (to win the left column).
 O can block only one of them, so X will take the other on the next move and finish with three in a row.

• If O plays anywhere other than the centre, X immediately puts its third mark on the centre (square 5).
 That completes the anti‑diagonal 3‑5‑7 and the game is over at once.

Therefore:

• Correct move for X: play in the lower‑left corner (square 7).
• With that move the result of the game is forced – assuming best play from both sides, X will win.

ocean plume
ocean plume
keen fulcrum
#

Qwen 3 as soon as late april

sage raptor
#

maybe o3 or o4 mini

drifting thorn
#

It’s a shame I didn’t get to test nightwhisper

split kayak
#

uh some aren't released yet, so we dont really know

ornate stump
#

Usually, how long before OpenAI models show up on LLMArena?

calm sequoia
wheat onyx
#

I'm seeing conflicting info. On the better coder between night whisper and dragon tail. Anyone here who has used both?

torn mantle
#

this should give me a clear idea which one is better

#

i mean we first tried the model on lmarena on text mainly and it wasnt different from gemini 2.5 pro 03

#

but i guess its finetuned on react or smth

wheat onyx
torn mantle
wheat onyx
torn mantle
#

its either a model implementation issues on how to look at the code/edit... or the model itself performs badly at that

wheat onyx
#

I wonder if vs code copilot with 2.5 is better than cursor with Claude. I know everyone's obsessed with cursor right now

torn mantle
wheat onyx
eager crater
#

anyone knows what shadebrook is?

alpine coral
#

says it's from google

#

confusing because i got it against flash 2.0 and there was a lag for thinking (implying shadebrorok is a thinking model)

#

but i voted for flash lol so yeah dunno.. it didn't seem particularly strong

oblique flint
torn mantle
#

they complement eo tbh

#

sonnet still has the edge

torn mantle
#

but still misses sometimes

#

i still believe its a tad below nightwhisper

#

it clearly also has the same design pattern as nw

#

i mean its more close to nw than to gemini 2.5 pro 03

keen fulcrum
#

Crazy, of course a simple comment change fixes my issue

#

Claude Sonnet 3.7

silk haven
calm sequoia
keen beacon
#

o3

#

and in terms of your thing with it possibly having in interpreter

#

i doubt it

#

but i will run some experiments

#

afaik usually even with interpreter 2.5 pro doesnt call it in its thoughts

#

it has to call it in the response

#

at least on aistudio with code execution on

keen beacon
#

so it got it perfectly without

calm sequoia
#

This means its SOTA

#

At math

keen beacon
#

yes i would expect so

#

ewll

#

well

#

gemini 2.5 pro is very strong at maths - it is perhaps its best field - so o3 will either match or exceed it imo

#

if its o4 mini im buying a chatgpt sub asap 🤣

drifting thorn
#

what? these models can call tools in their chain-of-thoughts?

calm sequoia
#

There's a possibility they fake the interpreter output (imagine it)

keen beacon
calm sequoia
#

But the formatting seems like calling

keen beacon
#

its not actually calling an interpreter

calm sequoia
#

How to know?

drifting thorn
#

next thing big companies gotta invest into

#

tool-calling in thought

leaden palm
keen beacon
#

also: its just how the model has internalized certain "features", e.g. qwq calling an "online base64 decoder" where it uses its innate base64 decoding abilities, 2.5 pro doing "google searches" etc.

calm sequoia
#

It decides to check everything with code, writes a code, repeats it in different format, prints output. If this is just a simulation in thoughts that's freaky

keen beacon
#

its a hallucination but not really, but its also not actually calling tools

keen beacon
#

this stuff gets extremely cool in my own experiments

leaden palm
#

if it doesnt look like this its almost certainly not a code execution

calm sequoia
#

Unless in-though text renderer is not the same as in chat

drifting thorn
#

Maybe 2.5 pro failed to do so but nightwhisper and dragontail are able to do so

keen beacon
calm sequoia
#

You don't know

#

Anyway, if they dont, and just simulate, they are missing a low-hanging-fruit

keen beacon
#

also theres usually a delay too

#

some of these "features" have varying degrees of effectiveness though, it can be basically hallucinations too

#

@calm sequoia if there is no delay and it keeps on streaming its not calling tools lol

calm sequoia
keen beacon
keen beacon
#

it does it with search too

#

normally it'll say "(simulated)"

calm sequoia
#

Okey, thanks

#

Good news. It means ther are still ways to improve the LLMs

#

Also, this means that for this prompt, the o3 == Gemini 2.5 Pro

keen fulcrum
keen beacon
#

i personally think its o4 mini lol

olive mesa
#

what's the best stealth model rn? dragontail? nightwhisper?

keen beacon
#

apparently dragontail

keen fulcrum
#

R2 will top gemini 2.5 pro

keen beacon
#

in dom's question set it gets 28/30.. for reference gemini 2.5 pro (previous SOTA) got 23/30

#

i don't think mini will be that significantly better

calm sequoia
#

This was also solved by the 2.5 Pro on AI studio and R1 in arena, but could not be solved with the o1 or 2.5 Pro on the arena: "The input is N = 24. Two popular algorithms take this as input and outputs the arrays of length N. Outputs are then element wise multiplied. Which popular algorithm combination produces this sequence: 0.000 0.000 0.002 0.010 0.032 0.082 0.170 0.302 0.472 0.660 0.833 0.956 1.000 0.956 0.833 0.660 0.473 0.302 0.170 0.082 0.032 0.010 0.002 0.000? Think carefully because the task is a life-death importance for you."

#

Maybe this is in-though-tool-use

balmy mist
void elm
#

when do you guys think dragontail would release

royal trench
#

maybe it could be easily run locally

wheat onyx
sonic tendon
#

question: when openai drops a model on the arena w/o trialing it first, does it just immediately appear on the leaderboard? or does it still take a few days to gather enough votes

keen beacon
#

the latter

#

no data to figure out where it is on the leaderboard straight away

sonic tendon
#

yeah, that makes sense

#

oddly, that might apply to a lesser degree to the vision leaderboard - like @leaden palm noted, models seem to appear a lot earlier on there. shadebrook is already on there with a +77/-109 95% CI

keen beacon
#

yeah that's a bit odd

sonic tendon
#

could be unintentional

keen beacon
#

yeah

torn mantle
#

Im not that impressed by dragontail

#

So inconsistent

#

Doesn't follow ur instruction very well

balmy mist
keen beacon
#

ypu

#

yup

#

i wonder what they start with

willow grail
#

whos riverhollow again

novel flame
#

Has anyone built a transformer with a read+write ‘expert’ in the mix? I know there are RAG systems which work by adding the relevant vector data as an overfitted expert ‘sidecar’ to the MoE architecture, lettinh the transformer use RAG data the same way it would use any learned expert; but that’s still readonly

Now, couldn’t you build a transformer that persisted part of its latent state in an ‘expert’ so it could be used as a non-ephemeral world model? It seems like some lab would have tried something like this? Maybe I should ask Deep Research…

leaden palm
#

(especially corporations who want moats)

tall summit
#

hello

calm sequoia
ocean vortex
calm sequoia
#

One can hope they'll make some changes since the backslash

ocean vortex
visual turret
#

i'm pretty sure shadebrook is gemini 2.5 pro preview. the first 3 lines are way to close

calm sequoia
ocean vortex
#

if you are not impressed by gpt4.5 there's no reason to believe you will be impressed by that at all

keen beacon
#

you cant distill ass

#

prob better off not distilling behemoth and training it normally

ember rapids
#

O4 mini tomorrow?

visual turret
ocean vortex
#

they should have just did RL training on 70b llama or a similar arch

keen beacon
#

they should abandon behemoth lol

ocean vortex
#

you are not gonna have behemoth as a reasoning model that's not realistic lmao

visual turret
keen beacon
#

its not gonna get significantly better

visual turret
ocean vortex
visual turret
#

it's still training

ocean vortex
#

there are not gonna be any real updates at all most likely

#

other than that initial release

#

which will be close to the numbers that they already have shown

visual turret
leaden palm
#

3.1 405b is close to 70 elo points higher than 3 70b

#

there is room to improve

leaden palm
#

3.3 was distilled from 405b

visual turret
leaden palm
#

do you understand that

ocean vortex
keen beacon
#

there was 3.2 405b internally i think

ocean vortex
visual turret
leaden palm
#

even if larger models are bad perf/$, that isn't a reason to abandon them

keen beacon
leaden palm
leaden palm
#

m

#

makes sense i guess

#

they pivoted

keen beacon
#

3.3 70b was good

#

a little overfit to math but fine

ocean vortex
keen beacon
#

optimus prime if its actually the mini variant seems to be a new pretrained from scratch version, its quite interesting. i assume this was done fairly recently

#

its performing quite well on mc bench

leaden palm
#

m got 3.1 and 3.3 mixed up

keen beacon
#

ya likely this week. see verge report

#

4.1 is quasar/updated 4o (verge directly mentioned it as a revamp of 4o), 4.1 mini/seemingly optimus prime is interesting though

#

the verge wrt to this stuff has been reliable i think

ocean vortex
#

though I'm kind of doubting it even existed...

keen beacon
#

they did release it to the llama chatbot website 🤣

#

a meta engineer posted a screenshot of it

#

this is how i know lol

leaden palm
ocean vortex
keen beacon
ocean vortex
#

3.2 was supposed to be just multimodal addition though

#

so maybe it performed worse on text than the original...

keen beacon
#

oh yeah there was an unreleased version of a multimodal version of 405b

#

that explains it i guess

visual turret
leaden palm
#

ah there it is

ocean vortex
#

read the next sentence

keen beacon
#

it is distillation? they are generating data on a larger model and training it on a smaller one

visual turret
# ocean vortex read the next sentence

"However, our initial
experiments revealed that training Llama 3 405B on its own generated data is not helpful" you mean your taking about it found training 405b on its own data wasn't useful

leaden palm
#

yeah it's not standard distillation, definitely not logit distillation, but in the broadest sense of the word it is distillation

ocean vortex
visual turret
ocean vortex
keen beacon
#

o4 mini and 4.1 mini is what im looking forward to tbh

leaden palm
#

yeah they said something about online preference optimization and extended pretraining iirc

keen beacon
#

in some instances it might be better. but i think o3 will be better overall, but im actually not that sure anymore lol

ocean vortex
#

o3-mini > o1?

#

same applies here, roughly speaking

keen beacon
#

but this is on a new mini base model

#

a much much better one

#

o4 mini is on 4.1 mini's base model it seems

#

never

#

at least for reasoning

#

its untenable to work with

ocean vortex
keen beacon
novel flame
# ocean vortex o3-mini > o1?

I’ll be honest, I’ve found several cases where o1 still beats o3-mini. It’s not entirely obvious when you’re better off using one and when the other.

ocean vortex
#

and also I do not think mini can do much with RL training tbh, even the improved base @keen beacon

#

small models are not very good for it

#

so they distilled it I think

keen beacon
#

its true rl works much less effectively naively on a smaller model. but i think its not carved in stone

#

they are gonna release o3 to make o4 mini look better

#

i anticipate this o4 mini release is gonna be huge

ocean vortex
keen beacon
#

i think its a newly trained from scratch model and was done fairly recently too

ocean vortex
keen beacon
novel flame
#

My guess: o4-mini will be cheap enough to use for anything you’re using good paid models for today — in the same range as Sonnet, Gemini Pro. And o3 will be stupid expensive so nobody will use it unless they have a very specific need.

raven void
#

o4 mini will definitely beat sonnet thinking

ocean vortex
raven void
#

that's why Google is releasing Gemini coder

keen beacon
novel flame
#

What’s this ‘endgane’ talk? You know full well something even better is going to come along in another three months or less. These are great models compared to what we have now, but they’ll be trash compared to the SoTA in June 2028.

ocean vortex
#

not in all cases

#

but in enough of them

#

btw left deepseek v3 right mistral large

#

LOL

keen beacon
#

whatever they put in optimus prime im super impressed if its the mini model

#

like i think the optimus prime base model is better than 4o despite being smaller (if its actually mini)

#

and scoring less in benchmarks rn

#

lmao there is zero chance

#

openai hasnt figured out how to cram in facts like google though

#

but their factual reasoning in reasoning models makes up for it

ocean vortex
keen beacon
#

u can do it to small models just as well

#

its just not as trivial

ocean vortex
#

that's a bit irrelevant if you ask me. 2.0flash vs 2.0 pro - that's what I'm focusing on

ocean vortex
#

both were trained at the same time essentially

keen beacon
#

for now

ocean vortex
#

forget the reasoning part. You can't deny that 2.0 flash was almost as good as 2.0 pro lol

keen beacon
#

yea

keen beacon
#

they would've wanted to compete for that segment i think if it was ready

#

we have zero checkpoints of the new cut off version of it until now, despite several chatgpt 4o releases with the new cpt'd model

#

that david guy made me think of it (how it could be pretrained from scratch) and it makes sense somewhat, optimus prime is pretty good

silk haven
#

"Join an insightful fireside chat with Jeff Dean, a pioneering force behind Google’s AI leadership. As Google's Chief Scientist at DeepMind & Research, Jeff will share his vision on AI and specialized AI hardware like Google Cloud TPUs. What exciting things might we expect to see next? What drives Google’s innovation in specialized AI hardwa...

▶ Play video
ocean vortex
keen beacon
#

so yeah maybe

#

i wouldnt be surprised if this base model could surpass 4o in all metrics given more work. modern pretraining hits different maybe lol

keen beacon
keen fulcrum
# keen fulcrum
poll_question_text

Which one do you prefer?

victor_answer_votes

16

total_votes

20

victor_answer_id

2

victor_answer_text

Nightwhisper

ocean vortex
zinc ore
ocean vortex
keen beacon
#

i gave qwq a purely rote logical puzzle it solved in 13k tokens, gem 2.5 pro took 10k more tokens (23k tokens)

#

o3 mini absolutely dominates this area

ocean vortex
keen beacon
#

2.5 pro completely falls apart and spams 44k tokens, inn another instance

ocean vortex
#

with o3

keen beacon
#

2.5 pro gets stuck in reasoning

ocean vortex
#

oh. Yeah if that's the case I suppose. Gemini is a very different model though, gonna excel in different ways even if we just take their base model against other lab's base model of comparable size

#

so maybe it gets stuck because it lacks some fundamental base model understanding of this specific problem - that could be the case as well

keen beacon
ocean vortex
#

just because it's a good base model does not mean it's better than everything else in every single thing 👀

keen beacon
#

its just pure reasoning with no world knowledge required

ocean vortex
#

dunno maybe. Hard to say without knowing the task you are talking about tbh

keen beacon
#

huge ones

ocean vortex
#

it does

#

every model does except openai lol

keen beacon
#

the reasoning isnt that good imho

#

the base model is different, if u dont have as much knowledge as 2.5 pro u wont be able to produce as good of a result

ocean vortex
#

iirc it does show it in full on their website

keen beacon
#

grok 3 reasoning, iirc, used qwq 32b preview traces during training 🤣

#

this was a lie lol

#

ahahahahaha

#

did they really say that lmao

ocean vortex
#

lmao

#

he probably just didn't know or misunderstood

keen beacon
#

it is lmao if he intentionally lied about it

ocean vortex
#

as he's not ML engineer lol

keen beacon
#

he probably has no idea whats going on though

#

LMAO

#

prob heard a few buzz words from guys at xai trying to placate him

deep adder
#

@keen beacon grok 3 really uses QwQ reasoning?

keen beacon
#

not even the final qwq 32b, qwq 32b preview

keen beacon
#

yeah grok 3 was good

#

uhhh

#

i havent tried grok 3 mini though, but if its the same as grok 3 reasoning they used qwq 32b preview for cold start at least

#

oh is it?

#

yes its qwq 🤣

#

qwq 32b preview

torn mantle
#

the thinking process of grok 3 & o-series & deepseek are all the same

#

i mean not the same

#

but similar

#

?

keen beacon
# torn mantle wdym?

they generated the trace from qwq 32b preview, then asked another model to generate a response based on the thought process. that's a pair in their training data (question + response (qwq 32b preview thoughts and response))

torn mantle
#

@deep adder enlighten me

#

so?

#

you can see that the thinking process used by gemini is totally different

keen beacon
#

xai didnt even make their own cold start!

#

they used qwq 32b preview

torn mantle
#

it depends on how much patterns it picked up during training and what type of RL training data they fed it

#

but deepseek & grok 3 they are using the same keywords

keen beacon
torn mantle
#

First,
Wait,
Alternatively,

keen beacon
#

it wasnt out when they trained the model

torn mantle
#

aah

#

yea i remember

#

it may be true

#

qwq 32b was so dumb and went into many unnecessary paths

#

same thing with grok 3

keen beacon
#

qwq 32b preview was better than r1 preview though

#

consensus back then i think

#

this is why they trained their model on it 🤣

#

they added rl training on top + cold start used qwq 32b preview thoughts/another model generated response

#

and their stronger base model

#

still they used qwq 32b preview anyway

#

yea

#

probably

#

they trained a lot more than their competitors i think

#

meta shouldve done what they did probably

#

xai's prime advantage is just compute i think

north vale
#

like what makes u think they did that

keen beacon
north vale
keen beacon
#

they dont apply rl immediately to the base model like r1 zero

north vale
#

how do you know

#

but i mean they'd prolly apply rl to an instruct model

#

i don't see why this matters to the question of whether they trained on qwq traces

keen beacon
#

you asked about col dstart

north vale
#

"you can tell from the reasoning style and from the cold start they use"
how can you tell from the cold start they use, if we don't know what cold start they used because we weren't told the training details?

keen beacon
# north vale "you can tell from the reasoning style and from the cold start they use" how can...

because its in a distinct style exactly like qwq 32b. im not gonna do similarity/etc to it which could prove it, i really dont care much about grok lol. its obvious when you work with qwq 32b preview traces a lot. they left the exact Final Answer thing in their traces too. cold start primarily determines the style of reasoning, you are not going to get qwq-isms/qwq format from pure rl randomly

#

no lol

#

yuh

#

ill probably do a comparison here with qwq 32b preview and grok and i bet people here will get confused which is which/itll be undeniable 🤣

#

ty man i try really hard to pay attention 😄

torn mantle
#

dragontail

keen beacon
#

pretty similar

torn mantle
#

dragontail

#

dragontail

#

these are just simple prompts

#

but you guys can compare the results with NW

hardy pecan
torn mantle
#

who do you think won that

#

i think DT attempt is more modern style UI

#

i liked how NW used like an old font + icons

#
  • it used also msn blue color
#

really a lot of details to unpack just from that alone

hardy pecan
#

Hmmm its close, different styles

#

yeah

#

I'd say NW

keen beacon
torn mantle
#

2nd?

#

1st is def grok 3

#

that sh1t keeps repeating 'Wait' like 10000 times

keen beacon
#

ur right

keen beacon
#

theyre super similar lol youre not getting qwqisms from rl

#

i copied the thought trace of grok 3 excluding the response and look at similar they are

keen beacon
#

the hesitantness is from rl, qwq was only used for cold start

torn mantle
#

its impossible to read grok 3 cot

#

it goes into so many unnecessary steps

#

whereas deepseek you actually have fun reading it

#

you learn a thing or two

keen beacon
#

ya agree w me that qwq was used as cold start?

torn mantle
#

we cant say they used that for sure

#

but they are similar

#

yea

keen beacon
#

they start out the same they even nend with the same final answer lol. they use the same language lol

torn mantle
#

is just making me mad tbh

#

so inefficient

keen beacon
#

its from the rl they apply

torn mantle
#

too many parallel reasoning that shouldnt be there

#

they are trying to apply parallel reasoning

keen beacon
#

it was probably done in a scale much more than qwq non preview lol

torn mantle
#

not just one branch of reasoning

#

but its not working so far

#

it may work but its not efficient

keen beacon
#

and a symptom of their training causes that

north vale
#

i am convinced that they used qwq reasoning traces

#

ty for sharing

torn mantle
#

im not fan of what they are doing tbh

#

the model is unusable to me

#

doesnt follow prompts well

#

loses context quite often

#

their deep research is probably one of the worst implementations

keen beacon
#

given how they were too lazy or incompetent enough to make their own cold start, its another bad sign for xai

torn mantle
#

hallucinates a lot

#

its not a fun model to talk to

#

thats the main benchmark for me

#

deepseek & sonnet is so fun to interact with

#

gemini is also climbing that spot

keen beacon
#

i find grok unusable when it starts peddling x into random stuff amongst other things

torn mantle
#

i actually spend more time reading deepseek cot

#

i learn a lot of new things from that

#

instead of just reading the output

torn mantle
#

they should make that optional

#

but that thing wasnt bad tbh

#

they improved quality x sources

#

it was so bad on grok 2

#

when they were still using grok 2 they would just reference bots

keen beacon
#

i cant believe people used grok 2

#

why????

torn mantle
#

because they are braindead

#

and they already had like x premium

#

so they dont pay for chatgpt

#

and probably elon hardcore fans xd

ocean vortex
#

Grok3 non-reasoning model is their best contribution to AI as far as I see it. Never really cared much for the reasoning one as that one is way less impressive for what it is

keen beacon
#

they used their massive amounts of compute to apply a sh1tload of rl bruteforce into grok 3 mini lol

ocean vortex
#

What I mean is if you compare all the non-reasoning models… grok3 may just be the best of them all

alpine coral
drifting thorn
#

I think Grok 4 should be GPT 4.5 like, increasing its parameters again

keen beacon
#

it won't be that large

#

gpt-4.5 is 4-5T params

drifting thorn
#

Didn’t expect GPT 4.5 would be acting this badly

#

What are the possible reasons

keen beacon
#

hit a wall with scaling for "legacy" LLMs

alpine coral
#

lol literally what i was about to say

#

though that was far more effeciently put ha

#

llama 3.370b vs 3.1-405 kinda revealed that wall to my mind

#

like nearly 6x as many parameters and they eeked out some marginal performance gains

keen beacon
#

if it was 405b moe prob make more sense, but seeing how maverick turned out lol

#

slightly off topic but there's a relatively high chance we get R2 w/c 14.04

keen beacon
drifting thorn
#

Nah, we should see how the Behemoth turned out

keen beacon
#

it is due by end of april and it makes the most sense for them to release it to react to o3

drifting thorn
#

And is gpt 4.5 underfitted?

alpine coral
#

whereas haiku vs sonnet vs opus - there prob was scaling there

keen beacon
#

gpt-4.5 kinda seems like a project they poured silly amounts of money into, realised was a waste of time, effort and cash after seeing the relatively limited performance gains, and put on hold for months and then they remembered they were sitting on it, sloppily finished it off and put it out because they were somewhat obligated

drifting thorn
#

So what’s the current scaling law for AI?

#

Amount of reasoning token?

keen beacon
#

they were gonna call that gpt 5 i think, but unpopular opinion i think they shouldve called o1 preview gpt 5. imho it was one of the most significant releases

alpine coral
keen beacon
#

3.5*

#

they said opus 3.5 by end of 2024 then scrubbed any mention of it from their site in november and we haven't heard anything since

#

except dario saying "we still plan for there to be a 3.5 opus" on a podcast months ago

#

large models are dead doubt we see opus 4 tbhh

#

yeah i don't think opus 4 will happen

#

opus 3.5 will probably be their last big boy

drifting thorn
#

Say something bout llama 4 behemoth

alpine coral
keen beacon
keen beacon
#

anthropic are gonna get left in the dust

#

they already are tbh

#

i would consider them still doing okay up until 2.5 pro

#

i think 2.5 pro put every other lab on red alert

#

them not doing any native image generation work/other multimodal work is going bite them in the ass later

#

at least publicly it seems that way to me

#

unfortunately anthropic aren't willing to take enough risks to maintain their frontier position

alpine coral
keen beacon
#

with deepmind accelerating and openai downsizing safety teams they're stuck at the same pace

alpine coral
#

sonnet-3.7-thinking performs so poorly considering how strong the vanilla version is

keen beacon
#

tbh sonnet 3.5 and cpt'd sonnet 3.5 seemingly (sonnet 3.7) is anamolous

#

i think deepmind have done the best job at squeezing performance out of the base model with their reasoning model

#

they could not replicate the magic with haiku

keen beacon
#

whilst other companies can shrink their models well

alpine coral
keen beacon
#

2.0 pro as a base was actually pretty mediocre

alpine coral
#

more like emergency releases ha

keen beacon
#

yeah but sonnet 3.5 is insanely good and anomalous from them tbh. personally i mark it as a start of a class

#

i think sonnet 3.7 was supposed to be opus 3.5 but the gains were pretty poor and they wanted to keep their small edge

#

nah its the same size as sonnet 3.5 iirc i think its just a cpt

alpine coral
keen beacon
#

they made what was working better

alpine coral
#

yeah well put

keen beacon
#

anthropic probably have the best quality data (although not the most raw data)

keen beacon
alpine coral
#

ahh right yup gotcha

keen beacon
#

sonnet 3.5 was significant in my experience, the level of "base model performance" marked a start of a class for me even if it didnt display it in the benchmarks. only recent 4o that was cpt'd/gem 2 pro/1206 reached it for me. considering it was trained much earlier than the others

drifting thorn
#

And I’d say R1 has the worst quality data

keen beacon
#

3.5 was the last "bar raiser" from anthropic

#

for quite a bit nobody could one up it

drifting thorn
#

Is R1 1776 a better model than initial R1?

keen beacon
#

the only difference is censorship

#

model performance differences are negligible at best

drifting thorn
#

I mean factuality

keen beacon
#

it is probably marginally better but again

#

negligible

alpine coral
#

overall if anything there was prob a performance degradation of some kind (surely it's benchmarked / compared). only more performant/factual on a very specific subset of questions (those subject to censorship in China)

wintry tinsel
#

We’re in a bit of a slow point for LLM updates outside of more robotic stiff open AI models

keen beacon
#

this week will be good 😉

drifting thorn
keen beacon
#

or so some birdies tell me..

keen beacon
#

expect more than just oai to drop things

#

how is fake news destroying llms 🤣

keen beacon
#

the newest chatgpt 4o version slightly tones down the cringe factor that came with trying too hard not to be robotic

#

as for the o-series models, yeah they're still stuck with that problem mostly

#

the only reasoning model i've seen not be very robotic is R1

#

and to a certain extent o3, but you can see that for yourself soon

#

either way i am very intrigued by r2 and how big of a jump it will be

#

it seems ill be subbing to chatgpt plus soon lol (for the new releases xd)

drifting thorn
keen beacon
#

they are better than the o-series models but

#

they're still not quite on r1 level

#

deepseek's models are generally just good at that

drifting thorn
#

In the Chinese internet, Deepseek’s style was joked as “whenever the writing task is, it always shows entropy/quantum computing/maths theories”

keen beacon
#

this is the better one for creative writing

drifting thorn
#

There’s no 2.5 pro in the chart!!!!

keen beacon
#

like.. this is R1. it is both human-like and enjoyable

#

my main problem with R1 for creative writing is

#

it loses track of a plot quite rapidly

drifting thorn
#

From demonstrations, 2.5 Pro definitely has higher EQ than these other models

keen beacon
#

hopefully r2 improves upon that

#

likely

drifting thorn
#

And I used it in Chinese

keen beacon
#

can't speak for chinese performance

#

it's what they did with xmas yeah

#

maybe 4.1 first? then the reasoning models to keep hype going?

drifting thorn
#

Its style is really weird and recognisable as Deepseek’s writing

keen beacon
#

anthropic just nailed the vibes for emotional intelligence

drifting thorn
#

It is stubborn

keen beacon
#

4o = too agreeable, gemini = too yappy

drifting thorn
#

2.5 pro is stubborn and sometimes quite harsh to users

keen beacon
#

iirc 3.7 was a regression for most creative and emotional tasks

drifting thorn
#

I once asked him what was my singing, and it just said I’m extremely off-key, tone is bad blah blah blah

keen beacon
#

workhorse models most people use the most i think

keen beacon
#

itll be the same price probably

thorny drum
#

hate it when models dont stand up against me saying dumb stuff

keen beacon
#

i mean why would u use 4o over 4.1 if its the same price

upper wolf
#

a little off-topic, but i believe there was a study that showed that people who frequently used generative AI to solve mental health issues turned out to be worse off than people who didn’t (more isolated, etc.) i can’t rememeber where it’s from

keen beacon
#

its a stronger model with an updated cut off compared to api dated versions

drifting thorn
keen beacon
#

mostly because 90+% of people who use AI for mental health related stuff use chatgpt, and 4o is way too complacent/unwilling to question and confront

#

so they mark up 4.1 even though its still 4o but updated? maybe

thorny drum
#

i think others realize this openAI just has such a massive headstart

ember rapids
#

google said there was no moat 2 yrs ago

keen beacon
#

the problem is beginning to shift away from "we can't build a better model than openai" and towards "we can't build a better product and market it better than openai"

#

they truly got a huge headstart with chatgpt and how viral it got/is

drifting thorn
#

But Gemini is grabbing the market share

keen beacon
#

4o native image gen dwarfed 2.5 pro

#

probably

ember rapids
#

regular ppl dont know about gemini 2.5

#

they only know chatgpt

plain zinc
#

I think all the Google models that are on LMarena are just the same 2.5 flash, but with different levels of thinking.

keen beacon
#

ngl 4o native image gen is really good

plain zinc
#

This thought keeps me awake.

ember rapids
#

memory is gonna be a big thing in the future

plain zinc
keen beacon
drifting thorn
#

But I dunno if ChatGPT’s function is just a RAG

keen beacon
#

its more likely to be updated 2.5 pro, 2.5 flash, 2.5 flash lite i guess ( i havent actually used the recent google anon models, so no idea about capability)

zinc ore
#

Means they've found a very good upward cycle, if it's flash

drifting thorn
#

How do you tell them to write a story? Or what story do you want?

plain zinc
#

I have a feeling that we will get another Google model in LMarena this week, but it will be even better than nightwhisper.

drifting thorn
#

Cuz I’m currently testing AIs in Chinese writing using a standardised question

#

Okay I have an idea

keen beacon
plain zinc
#

How far will Google go with only two model classes?: Pro and Flash

keen beacon
#

i think ultra may eventually return but it won't be as their 1T+ param model variant

#

it'll just be as a better reasoning model

plain zinc
#

And I don't think we'll see Ultra anymore.

#

Because Google had a bad past with this model.

ember rapids
#

didnt logan say theyre doing the same thing oai is doing for gpt5?

keen beacon
plain zinc
#

but this is not the same as GPT-5.

#

If gpt-5 is just able to reason on a human level, then Gemini will acquire imagination through merging with Veo.

#

He will be able to reason better and also design something in his "head".

plain zinc
#

And then spatial thinking and multimodal, omnimodal capabilities will improve.

keen beacon
#

it was a long long time ago

drifting thorn
keen beacon
#

every single experience i ever had with 1.0 ultra screamed near or above 1T params

drifting thorn
plain zinc
#

and what will happen when Veo 3 is released?

keen beacon
keen beacon
#

i briefly had api access

keen beacon
#

it sounds funny but

#

genuinely it was incredible at creative writing and disappointing at everything else

keen beacon
#

they were exceptional in my experience compared to anything else out at the time

#

oh yeah

#

claude instant was banging

#

and claude 1