#general

1 messages · Page 28 of 1

keen beacon
#

i think its close, but o3 has the edge

elder rapids
keen beacon
#

ofc

elder rapids
#

whether that's more meaningful than your benchmarks

#

but that's besides the point

#

therefore it's not the general intelligence you think it is

keen beacon
#

nah , i just use them both in programming for example, o3 has bigger edge
for general stuff its usually gemini 2.5

elder rapids
#

specific coding performance tasks are held back by total coding capability

keen beacon
elder rapids
#

so ye

#

i don't think it's good to compare them with algorithms

ember rapids
#

dayhush is rly good wow

novel flame
#

Correct.

I think the problem for many is that it’s easy to mistake greedy decoding (which is deterministic and also boring to the point where it would be practically useless) for temperature=0 (which still has inherent randomness but the probability distribution is more biased towards the top logits).

Mathematically, temperature=0 isn’t even possible since it would cause a divide-by-zero so LLMs just change it to a very small value >0 behind the scenes.

ocean vortex
#

that's the thing, chatgpt-latest was above o1 on lmarena even when they were using the old gpt4o base model

#

It does very well on lmarena, openai reasoning models much less so... till now at least, if this doesn't change I expect o3 to be no higher than 3rd

ocean vortex
#

but I understand why they implemented it this way. It's just much faster and easier than writing smth like 0.00000001

keen beacon
#

Bruh the misinformation wtf???

calm sequoia
leaden palm
#

i've seen a few docs that say something along the lines of 0 temperature is changed to epsilon

keen beacon
#

Huh really?

ocean vortex
keen beacon
#

Check vllm Lol

ocean vortex
#

you can't divide by 0 lol

keen beacon
#

Omg

#

They check if it's 0

#

Omfg

#

If it's 0 they switch to greedy assuming no other sampling options are enabled. Check vllm docs

#

Vllm is industry practice

ocean vortex
# keen beacon They check if it's 0

that's why it's custom implementation. If there was no custom implementation and no handling 0 differently than any other value, it wouldn't be possible

keen beacon
#

Don't change the goalpost since you thought they changed it to a small value lmfaoooo

ocean vortex
#

it is not handled the same way as other value, that's the main and only point

#

I don't get why you got so fussed about it lmao

keen beacon
keen beacon
ocean vortex
#

my point was to show what you would need to do if 0 wasn't handled explicitly

#

nothing less

#

nothing more

#

maybe take a chill pill @keen beacon I have no clue what's gotten into you LMAO

leaden palm
novel flame
#

Temp=0 in the math is not possible, if implementations just change to greedy decoding then there’s no longer a temperature value involved in the inference.

But I genuinely thought the standard was to sub a small epsilon value. I could be wrong about that for sure.

keen beacon
#

Otherwise what would you even do there

ocean vortex
ocean vortex
keen beacon
#

You would know this yourself

#

Otherwise you would not be able to use 0

ocean vortex
keen beacon
#

It is a special value for all intents and purposes

keen beacon
ocean vortex
#

that's what we were saying before you replied here with "??????" for no reason at all lmfao. If there was no custom interpretation and temp parameter was strictly temperature directly applied, you would have to use it differently

keen beacon
#

But everyone does that lmao. I wouldn't consider it a custom interpretation lmao. Not setting 0 to that would be a custom implementation at this point, no one expects it

ocean vortex
keen beacon
ocean vortex
#

so no

keen beacon
dark oak
#

how to vote for leaderboard?

ocean vortex
tall summit
#

hello friends!

keen beacon
novel flame
#

OK I had to check... and @keen beacon is wrong. I just made four identical requests to Claude 3.7 Sonnet on Bedrock:

"prompt": "How much wood could a woodchuck chuck?",
"model": "bedrock/claude-3.7-sonnet",
"config": {
  "temperature": 0,
  "max_tokens": 220
}
keen beacon
#

See this

tawdry phoenix
#

Installing mac os on my laptop

novel flame
keen beacon
novel flame
keen beacon
#

That is a ridiculous claim

#

Also iirc deepseek uses a fork of vllm are they building toy llms

#

I can go on about this but I'm going to stop here since you're making ridiculous claims

keen beacon
golden ocean
#

I love lmarena beef

novel flame
# keen beacon https://github.com/deepseek-ai/open-infra-index/tree/main/OpenSourcing_DeepSeek_...

Congratulations, you found one example of a major lab that uses (checks notes) a fork of vLLM which may or may not use the method of replacing temp=0 with greedy decoding. I tested the APIs from OpenAI, Google and Amazon Bedrock and all of them did not. I just don't have the time to discuss what 'industry standard' means, nor to try all the non-SOTA labs' APIs to find out what I already know.

keen beacon
keen beacon
#

You can't even set a seed on the anthropic api lol

novel flame
novel flame
# keen beacon You can't even set a seed on the anthropic api lol

OMG are you just trolling? Anthropics API docs say exactly what I've been saying: "Note that even with temperature of 0.0, the results will not be fully deterministic." That's my whole point. But they do not say "greedy decoding" because they do not switch to greedy decoding, they use a small epsilon which leaves it non-deterministic.

keen beacon
# novel flame I'm not saying they did, I am saying you don't know they didn't and your entire ...
152334H

It’s well-known at this point that GPT-4/GPT-3.5-turbo is non-deterministic, even at temperature=0.0. This is an odd behavior if you’re used to dense decoder-only models, where temp=0 should imply greedy sampling which should imply full determinism, because the logits for the next token should be a pure function of the input sequence & the m...

#

Google announced they used Moe too

#

While we don't know for sure for the others its highly likely to be Moe too

keen beacon
#

That they change with epsilon

keen beacon
novel flame
keen beacon
#

Actual documentation

novel flame
#

See the 'T's right there. This is how I "support my hypothesis". By knowing the actual math.

keen beacon
#

If it's 0 it goes into greedy decoding if no other sampling options are enabled

#

Keep on believing what you want it's a weird hill to die on

novel flame
calm sequoia
#

It's so nice seeing a real talking instead of "look what svg pokemon llm drew for me"

novel flame
#

I apologize on my part for the heated debate here tonight, this was not a productive contribution to the channel.

tall summit
#

i learned a bit by following it

novel flame
# keen beacon While this isn't definitive I found it convincing a while back at least. I dont ...

Actually re-reading this, this can indeed explain non-deterministic outputs even if the implementation replaces the zero temp with greedy decoding. MoE (which they're probably all using) concurrency means partial results may arrive/recombine in different order each time, and that is a totally plausible explanation for small differences in the resulting logprobs. As for whether individual LLM providers do one or the other, who knows; the labs don't reveal the details. I heard about the 'epsilon' trick from Andrew Ng, who is a fairly reliable source of how things are done; but my evidence was hinged on observing non-deterministic results, and the MoE explanation is as valid as the epsilon explanation. I concede the argument.

torn mantle
#

do you think this is accurate enough?

#

dayhush vs sonnet

ocean vortex
torn mantle
keen beacon
#

sonnet 3.7 still probably best for coding

#

despite benchmarks

ocean vortex
#

or heck, at least a singular easily reproducible coding challenge where it would be clear 2.5Pro, o3 and o4-mini all output a notably worse solution

#

though we probably need to add some non-reasoning model too if you are to do this... gpt4.1 and/or updated grok3 should do

keen fulcrum
ocean vortex
#

some specific coding tasks reasoning models are at an disadvantage kind of by design 👀

#

especially if you are asking smth that you could easily google for and fit the answer with minimal changes. Reasoning can work against it trying to reinvent the wheel sometimes

barren prairie
#

I just tried this now with Gemini pro 2.5 , Gemini flash 2.5 , O3 , O4 mini

Give me two words with those letters FP DAS .
Answer : pf sad (pf : sound )

Surprised that the winner was Gemini 2.5 flash

novel flame
balmy mist
#

has anyone been testin 4.1? how good is it?

ocean vortex
#

yeah I wrote it too generally looking back at it. 🤷‍♂️ For coding tasks that involve visuals 3.7 sonnet can indeed the best still. Like even just drawing (coding) in svg it is gonna be better than competition. Overall though if we consider everything coding related and not just this, sonnet is not the best after all said and done

keen fulcrum
#

Dom, 3.7 performs really bad at coding

ocean vortex
#

the way they fine-tuned it they are kind of playing on their strenghts as well. I had thinking enabled and reasoning was way longer than for o1/o3 with high reasoning effort for this prompt. Chat model is already doing it better, and thinking enabled (since I'm comparing it against reasoning models) just amplifies this

small haven
tawdry phoenix
#

Windows on the Samsung

raven void
#

O3 is better for code architecture and bug fixing not for writing code

verbal nimbus
#

I didn't know Artificial Analysis had an image arena:

#

Midjourney V6 and V7 are on there too, although idk why it's so low

thorn arrow
#

When will o3 be put on the leaderboard?

worthy thunder
#

As requested by several - OpenAI-MRCR results for o3: https://x.com/DillonUzar/status/1913806594199990582

Finally got access. They had some server-side issue with our org account, got resolved today.

Overall, pretty strong performance over its context window! Really strong up to 64k context, then drastically falls after, but that's to be expected (it hasn't specifically been improved for long context like the GPT-4.1 series, or Gemini 2.5). Should be really interesting to see GPT-4.1 applied to o-series!

I've also reran the bench against all 21 models I've tested so far, tracking costs and individual test results, will post results on that over the next couple of days.

Enjoy

calm sequoia
#

Is there are LLM tool which allows agent to crawl through pages of a specific website?

small haven
#

o3 sux

torn mantle
#

claybrook has a lot of rendering issues on webdev

tall summit
brittle tiger
#

Tomay model from goog?

keen beacon
#

tomay?

#

have you tested it

torn mantle
#

Output wise

#

Claybrook & tomay outputs are more close to eo than to gemini 2.5 pro 03

plain zinc
#

I don't see him.

#

Is it of a rare level?

torn mantle
brittle tiger
torn mantle
#

Its a thinking model + so fast

keen fulcrum
#

Which companies released stealth models on lmarena so far?

hardy pecan
#

To may

torn mantle
#

Could be even smaller

#

Like flash lite

hardy pecan
#

It releases it may then gg!

fleet lintel
#

I haven't seen Tomay yet. Is it any good?

brittle tiger
#

o3 and o4-mini are on Minecraft bench now

harsh flume
#

Did google remove the option to fine tune a model on AI studio or just hid it in a way im not seeing?

brittle tiger
harsh flume
#

Ohh I just found it

#

They moved it vertex ai studio on google cloud

balmy mist
#

wait we got a new model?

keen beacon
brittle tiger
balmy mist
#

this is such a dope benchmark

brittle tiger
#

o3 and o4-mini are new additions. it's been around a couple months i think

keen beacon
tall summit
#

all his tweets are them using lmarena or webarena

plain zinc
#

Tomay better in coding?

ocean vortex
# torn mantle prompt?

it's below the linked message. Sonnet generated like 32k tokens and went into the mode of "what can be improved" as it sometimes likes to do lmao
But even without that, I'm not an advocate for claude and do not see this model as best overall, but yeah I will still say it is the best with spatial awareness. Competing models are either optimized for cost much more (smaller, like gpt4.1), or they are simply way too extremely big and in effect undertrained and lacking this skill comparatively speaking for other reasons (gpt4.5). In this aspect gpt4.5 is still better than gpt4.1, but it is not as good as sonnet.

#

essentially, spatial awarenes is a skill that is very rarely captured in benchmarks. Arc-agi does it, but for some reason until very recently it was not a very popular benchmark to use or reference at all. So this tends to get compromised sooner than their usual target evals.

brittle tiger
#

claude 3.7 is svg king for sure. i think that was purposeful on their part. i think minecraft is better for spacial awareness than svg though

ocean vortex
#

I suspect Anthropic's internal evals do test for it, while OpenAI does lack proper internal testing of this

knotty anchor
#

Good day everyone
Happy Easter Sunday

ocean vortex
#

or they willingly sacrificed it for lower cost, though that is somewhat less likely...

#

especially in the light of them going crazy lately with things like o1-pro lmao

ocean vortex
knotty anchor
keen fulcrum
earnest parcel
# keen fulcrum

yeap, it's a clusterf. even people who interact with AI daily get confused (and I accidentally named some files 4o instead o4)

keen fulcrum
#

I mean google trains new models daily as well
its just why do they all need to be released to the public?

keen beacon
#

o3 > 2.5

tall summit
plain zinc
tall summit
plain zinc
#

He's there.

plain zinc
worthy thunder
balmy mist
#

I think that’s a Pokémon

#

That’s on arena?

#

Hmm

#

That’s cool

#

Came out today?

#

Also does anybody know what the best free model on open router is with favorable rate limits?

keen beacon
#

woaah

balmy mist
#

I miss having quasar on open router lol

#

Did you try them?

keen beacon
balmy mist
#

Yeah I think they lowered it cause I hit rate limits so fast with 2.5 free now

balmy mist
#

What company you think backing it?

balmy mist
balmy mist
#

I actually never used deep seek cause I was scared lol

#

Wait nvm I used v3

keen beacon
#

maybe u can use chutes but ur data is not really private

balmy mist
#

And deepseek doesn’t rate limit heavy?

keen beacon
#

w openrouter u can use chutes hosted models but 1000 rpd i think

balmy mist
#

1000 is good enough

balmy mist
keen beacon
#

rpm of 2.5 flash not high enough?

keen beacon
balmy mist
#

Flash is not free tho

keen beacon
#

the free variants

#

theres a free tier

balmy mist
#

Only pro is

keen beacon
#

really hmm?

balmy mist
#

What I looked only see thinking and normal

keen beacon
#

or use 300 usd in credits on vertex

#

i think u can do that

balmy mist
#

Wait wym 300 usd?

#

Lmaoo

keen beacon
balmy mist
#

Wait these Pokémon models

#

I love it

keen beacon
#

who made them btw

#

did u ask

#

might reveal smthing

balmy mist
keen beacon
#

which one chutes?

balmy mist
#

No the vertex thing you said

keen beacon
#

vertex u have to go direct to them

sage raptor
keen beacon
#

google vertex

#

u can use it as a provider on openrouter but no free credits

#

meh lol

balmy mist
#

Oh I see, so vertex has their own api key? I mean I might as well use the one from Google cause they give 300 as well, yeah I’ll just use that until I run out but I’m tryna plan for when I do run out lmaoo

#

If they gonna use Pokémon names they gotta produce heat

#

Ask why is it called charizard lol

keen beacon
balmy mist
keen beacon
#

dont u need a cc to claim the 300 usd in credits?

#

u used like 4 ccs lmao 🤣?

balmy mist
keen beacon
#

u can do that??

#

huh

balmy mist
#

They don’t have restrictions on it

keen beacon
#

Lmao

balmy mist
#

Yeah Lmaoo

#

I might just farm this

#

But idk they might ban me and I will be devastated

keen beacon
#

hallucination

#

i doubt they trained the cut off in

keen beacon
brittle tiger
#

Are labs specifically trolling this discord now 🤔

balmy mist
#

Ask this:
You are facing north and A rectangular tank is filled with water. Each side The West and East phases of the tank are painted to look like an island. A small toy boat floats in the tank. On the boat is a small figurine which is facing north. You lift the right side of the tank. From the point of view of the figurine Which island appears to rise. The West side, East Side or Both or None.

sage raptor
#

maybe its from meta

balmy mist
#

Yo did yall try the meta glasses?

#

East side

#

Smh

#

Why put crap on lmarena

#

They need to start screening theses models before going on arena

keen beacon
#

cohere are employing the meta strategy of releasing 10 equally horrific models to the arena

#

dw

balmy mist
#

Lmarena doesn’t group or section models based on size? Like separate leaderboards for diff sizes? I don’t usually use arena so this might be an obvious question, I’m a web dev guy

keen beacon
#

it doesn't

balmy mist
#

How is ray?

upper wolf
balmy mist
balmy mist
upper wolf
#

Alts for free credits

balmy mist
#

Yeah don’t even waste your time, giving these companies free benchmarking lol

balmy mist
upper wolf
#

only two

keen beacon
balmy mist
#

Lmaoooo

keen beacon
#

new contender for world's worst thinking model if true

balmy mist
#

Show us

keen beacon
balmy mist
#

Do more tests like that

balmy mist
keen beacon
balmy mist
#

Are the models at least fast?

keen beacon
#

😂

#

im making use of o3 to build repos via the subscription, similar to how one would use cursor or windsurf + o3 via api $$$
but its way cheaper (200$)

#

if rayquaza is from another lab it seems the lmsys staff are naming them and having fun

#

they really ARE pulling a meta. gonna be a 200b model that bombs every meaningful benchmark 🔥🔥🔥

balmy mist
balmy mist
keen beacon
balmy mist
#

Damn I was gonna wait to get that until o3 pro comes

keen beacon
#
  • i split it with another dev so 100$
balmy mist
#

Not a bad idea

#

Hmm I might do that actually makes an account specifically for that

#

And tell people to pay me to use it

keen beacon
#

yeah some do it that way

#

some scam that way

balmy mist
#

Lmaoo

#

How they scam?

#

Google models

#

Try dayhush

#

That’s current best model that we have access to

keen beacon
#

bro we just had 2.5 flash

balmy mist
#

Out of all models?

kind charm
#

I see claybrook not giving output on webarena quite often, especially on complex questions.
I think it takes more time & there seems to be some timeout setup in the webarena, which ends before it gives an output

keen beacon
#

better than 2.5 ? coding or general model ?

#

tbh im rooting for o3 to get nr 1 than on 30 april 2.5 ultra to be nr 1 , thatd be hella fun

#

I dont like that its taking too long for openai models to get in leaderboard. They probably have the votes, just making sure they aint biased.

leaden palm
balmy mist
#

@keen beacon that $10 thing applies to all free models on open router

keen beacon
#

yea now

balmy mist
#

so you get 1000 rpd with 2.5 pro

keen beacon
#

u dont get it with 2.5 pro

balmy mist
#

thats cracked

keen beacon
#

its special i think

#

or maybe u do now

balmy mist
#

yeah you do bro, they just said it

#

yeah now

keen beacon
#

oh

#

thats great

balmy mist
#

thats crazy

#

how can they do that?

keen beacon
#

they were getting hammered on free requests before i guess their quota is now sufficient enough with that criteria

balmy mist
#

for a regular ai user that does not use ai heavy in products, you never have to pay for SOTA models ever

keen beacon
#

things are extremely competitive now lol

balmy mist
#

besides o3 lol

keen beacon
#

(u can still use it in direct chat lmao)

balmy mist
#

lmaoooo

#

truee

#

it makes it so hard to pay for ai, thats why i cancelled all my subs

#

only will pay for o3 pro when it drops

#

but even that has me hesistant

keen beacon
#

2.5 pro is still unbeatable in my own personal experience lol for the things im using it on

#

its incredible

balmy mist
#

yeah thats what i am saying, for a general model its perfect

#

so this mean that a year from now we should have models better than 2.5 pro but basically free and fast like how we treat flash lite

#

2.5 is already free basically

#

but a year from now it will be free for production environments or dirt cheap

#

thats crazy

#

any model better than 2.5 is overkill for most usecases, so we basically made it if we assume we will get better and better models

#

i mean i will use it, im just saying its wild that our models are so good now that we have SOTA models that are free, where you get 1000 RPD, for openai plus you get like 50 o4 mini per day i believe lol

torn mantle
#

coming from webdev itself

#

not the model

#

it didnt even send the prompt

#

it could be interfering with JSON format on the POST data

#

not sure

keen beacon
#

Every day at 12pm, Relaxed Voyages spaceship departs from Liverpool for Dublin. Simultaneously, another Relaxed Voyages spaceship starts journey from Dublin to Liverpool. The journey takes 503 full hours in both directions.

How many Relaxed Voyages spaceships, traveling to Liverpool, will the spaceship departing now at 1pm from Liverpool encounter?

opaque adder
#

what's the latest best gemini mode?

#

in arena

#

is it dayhush or nw

torn mantle
kind charm
#

Currently in arena, I would say claybrook.

opaque adder
#

is this a joke?

torn mantle
#
  • better instruction following
  • better on reasoning
  • better on colors/ui choices
opaque adder
#

why the same models are battling eachother
and the one on the right is waaay worse

#

tf

opaque adder
#

mini is ass then

torn mantle
#

i would just chose

#

"both are bad"

#

next

#

there is an option for that

leaden palm
#

why would you do that

#

that says that it's a tie

torn mantle
#

because of them looks bad to me

opaque adder
#

ok well it was gemini 2.0 flash

#

so understandable

leaden palm
# opaque adder is this a joke?

if you care about the actual rankings you would say that left is better because it has more details (assuming nothing was cut off)

opaque adder
#

i know, i did put left

torn mantle
# torn mantle

i think i found the issue, i may override the POST data see if it fixes the problem

opaque adder
#

wtf...

#

no fuking way 2.5 pro isnt nerfed in the left side

#

dayhush is seriously insane

#

im sure 2.5 pro is nerfed here somewat

torn mantle
#

what the hell are these color choices

#

dayhush is so bad at chosing colors palette

opaque adder
#

right is significantly better dude

torn mantle
#

NW was so good at it

#

yea but it \still looks bad

keen beacon
#

there's nothing wrong with the colour scheme

opaque adder
#

obviously they don't look like it was made by a 10 year experience developer

#

doesnt mean right isnt better

torn mantle
#

idk

#

looks bad to me

keen beacon
#

nightwhisper generated much better results

torn mantle
#

yea

keen beacon
#

but it could be the prompt

torn mantle
#

finally

#

someone

#

agrees with me

#

high-taste user?

keen beacon
#

the right is definitely better tho

torn mantle
#

wild?

keen beacon
#

🤣

torn mantle
#

are you one of them?

keen beacon
#

maybe 🤣

earnest parcel
#

i prefer the vertical selection, but other than that right is better in almost everything

#

dayhush aint half bad

keen beacon
#

unfortunately the answer is 43

#

no LLM has got it right 😔

earnest parcel
#

from a single attempt just now

leaden palm
keen beacon
leaden palm
torn mantle
#

why is it 43?

sage raptor
leaden palm
keen beacon
#

i don't think that's a particularly good way of figuring out the right answer tbf

torn mantle
#

yea but why is it 43

keen beacon
#

dawg

#

i mentioned dom for a reason

#

it's his question

torn mantle
keen beacon
#

yes, because that's what dom states the answer is

#

duh 😭

torn mantle
#

so you didnt try to solve that on ur own

keen beacon
#

it's not a question i use often

#

you can go and try that if you want

#

but i trust his judgement

earnest parcel
#

I tried solving it on my own, I also don't get 43. might be a trick if anything. either way, without provided correct solution it's kinda pointless

torn mantle
#

i tried it as well

#

i think llms are rounding the numbers

#

using days instead

#

of hours

#

"k ≤ (503 – 1)/24 = 502/24 ≈ 20.92 → k = 0,1,2,…,20. So there are 21 such ships already on their way when we leave."

#

thats o3

#

but on the 21th day its 21 * 24 + 1 = 505 >= 503

#

so it shouldnt count

earnest parcel
#

still makes no sense, then the models would overshoot, not undershoot

torn mantle
#

and they are also taking into account the ship that will fly at the same time as our ship

north vale
#

there's 20 ships that are on the road when it departs, 1 ship that departs at the same time, and then 20 ships that depart between the time it leaves and the time it arrives right?

torn mantle
#

yea

north vale
#

so 41?

torn mantle
#

yea exactly

#

41 is the correct one imo

north vale
#

there's a ship that arrived an hour before our ship left, and a ship that leaves 1 hour after our ship arrives? if it's 43 it could just be that the guy considers that with 1 hour of wiggleroom our ship would encounter those ships in the dock

#

that would be a bit dumb but it's my guess

torn mantle
#

so it could be 41 or 42

#

if you count the ship that departs at the same time

north vale
torn mantle
#

some of them counted the ship that departed at the same time

north vale
#

right i'm counting it in an extra category

#

20 already departed, 1 departs at the same time, 20 depart after our departure but before our arrival

brittle tiger
torn mantle
#

xd

#

wtf

north vale
#

same

#

gdm cooked

north vale
#

how though

#

could you explain that in more detail? where you don't count the boat that departs at the same time

ember rapids
#

Happy Easter to all those who celebrate

#

Google is cooking

torn mantle
# north vale how though

we need ships that departed before day 0 @ 1pm and haven't completed the 503-hour journey yet so its x <= 503 :

  • day 0 @ 12 PM: 1 hr. (1 <= 503) -> 1 ship
  • day -1 @ 12 PM: (1*24)+1 = 25 hrs. (25 <= 503) -> 1 ship
  • day -20 @ 12 PM: (20*24)+1 = 481 hrs. (481 <= 503) -> 1 ship
  • day -21 @ 12 PM: (21*24)+1 = 505 hrs. (505 > 503) -> 0 ships

counting these days gives: 0 - (-20) + 1 = 21 days.

north vale
#

oh it's departing at 1pm i see

torn mantle
#

yea

#

so its either 41 or 42

north vale
#

so 21 before it departs, 0 while it departs, 20 after it departs

#

oh

#

hmm

#

yeah no that seems right

torn mantle
#

0 while it departs = is what differentiated other LLMs

north vale
#

it is obviously right bc it departs at 1pm and the others depart at noon

torn mantle
#

yea

#

thats right

north vale
#

tbc you think it's 41? or are you unsure and think it might be either 41 or 42?

north vale
torn mantle
#

maybe im missing something as well

north vale
#

okok

keen fulcrum
#

Mistral offers access to news (AFP)

#

What is AFP?

torn mantle
#

kinda impressive from dayhush

#

sonnet and other models failed to visualise the solution

#

and they also failed to provide the answer

torn mantle
north vale
#

not agi

torn mantle
#

😭

#

its telling me to review logic

#

tf does that mean

keen beacon
#

the answer is 43 i thinnk (i reread it again, it is not)

#

i might be very wrong i was

torn mantle
#

improved version

north vale
#

feel free to explain your reasoning wild

torn mantle
#

tf

ocean vortex
# keen beacon cc <@514836230802898954>

Yeah did not expect to see my prompt randomly here lmfao. The thing here is that the formula is really 2*days+1. And +1 is because it will encounter additional ship arriving just as it is leaving. The original prompt is already tricky enough for LLMs though can still be trained for. But when you add complexity they tend to default to their default earlier reasoning patterns (indicating that they didn't really understand it all that well to begin with still)

torn mantle
#

yes

#

2.0

north vale
#

I made the market, ppl can bet on it (with fake money)

torn mantle
#

cool

keen fulcrum
# torn mantle

Spacex sim?
wow
Release it
One of those was very popular

north vale
#

not sure yet, open to ideas

north vale
#

lol ok

#

there obviously are lots of tenable ideas

ocean vortex
north vale
#
  1. ask 10 people what they think the answer is, resolve to majority voting
  2. wait for a mathematical proof that is formally verifiable, of the answer to the riddle
    etc.
north vale
# ocean vortex yeah it is

i don't think you justified that appropriately and i think you're wrong but if you don't wanna explain in more detail that's fine tbc you didn't ask for your riddle to be talked about ig

ocean vortex
north vale
#

that would be much worse because whoever has more mana can dictate any solution

#

it doesn't have a truth-based schelling point

north vale
zinc ore
#

The riddle involves ambiguities (at least one).

keen beacon
#

its a modification of the 15 ship one

zinc ore
#

Which imo isn't good for a mathematical prompt

north vale
#

i agree ambiguities make a math prompt worse but could you point out the ambiguities?

small haven
#

now do o1 pro vs o3

zinc ore
#

So different people use different assumptions to get the answers they get

ocean vortex
# north vale could you point me to the known riddle?
How many ships belonging to the same shipping company, traveling in the opposite direction, will the steamship departing today at noon from Le Havre encounter?``` Answer 15. Many models answer with 13 or 14 to this day
small haven
#

marginal cost is $0, less expensive than o3

keen beacon
#

lmao polymarket polls

small haven
#

no its unlimited usage

zinc ore
small haven
#

yes i love that statement

#

wtf is going on loool

#

o3 pro gon be amazing

ocean vortex
zinc ore
#

No wait my bad

north vale
zinc ore
#

I was actually talking about the one that left an hour before yours, not arriving

north vale
#

right you would not encounter that one bc it left before you and will arrive before you

zinc ore
#

Yeah, but that's your assumption

north vale
#

i suppose so ok

zinc ore
#

"if they are on the strip together, it doesn't count"

#

So even if they are in proximity you don't count it.

north vale
zinc ore
#

One of the ships left at 12, yours left at 1

small haven
#

huh i literally have 4 tabs running o1 pro at once, never got banned for months

#

only got limited via o3 yesterday

#

they gave it back

north vale
# ocean vortex yeah it is

Does this modified riddle also have 43 as its answer in your opinion? I made it more similar to the original riddle but with the differences i think exist in your riddle

Every day at noon, a steamship departs from Le Havre for New York. The journey takes exactly 503 hours (20 days and 23 hours) in both directions.
How many ships belonging to the same shipping company, traveling in the opposite direction, will a steamship departing today at 1 pm from New York encounter during its trip?

zinc ore
#

Based on the last sentence, I agree that you don't count any grounded flights before pre-launch.

small haven
#

yea o4 mini > o4 mini high, very very cool

ember rapids
#

Possibly

north vale
#

A shipping company operates a route between Le Havre and New York.

Every day at 12:00 noon local time, a company ship departs Le Havre bound for New York.

The journey time between the two ports is exactly 503 hours in either direction.

Consider a specific ship from this company that departs New York for Le Havre today at 1:00 PM local time.

During its entire 503-hour voyage (from the moment it leaves the New York port until it arrives at the Le Havre port), how many of the company's other ships, traveling in the opposite direction (from Le Havre towards New York), will it pass at sea?

ember rapids
#

Google is in their bag rn who knows

#

Google io is may 20th

small haven
ember rapids
#

Logan is confident

#

Hard to say we’ll see

#

Anyway exciting time

#

Progress ain’t halting

small haven
#

i hate that o3 cant give the full code content

#

i asked for full updated code file, but it shows only the changed section, unlike o1 pro which gives the entirety

torn mantle
#

Its a smart model but it doesn't explain itself too much

#

It will use technical jargon without definitions

torn mantle
small haven
small haven
#

ok asking for diff code aint too bad; feed it to 4.1 in cursor to apply it

torn mantle
#

Its really a different experience from other models

#

Im not talking about how its concise /short

#

But just the way its talking and providing infos

zinc ore
#

So I did that riddle manually, and I also got 43. So I agree with Dom, the answer must be 43.

hardy pecan
#

I got 42 doing it manually

zinc ore
#

Yeah, it's 42

#

I read the riddle again, for some reason I thought it had said a ship arrived right when your ship was leaving /.-

keen beacon
#

the problem was reworded in a poor manner that doesnt work as intended i think. i dont think the answer is 43 anymore, i saw it was a modification of the 15 ship one upon first glance but didnt actually read it properly lol

hardy pecan
keen beacon
zinc ore
#

505 is still 42

#

No nvm

zinc ore
#

505 I think is 43.
504 I think is 42.

balmy mist
#

lol

#

promo vid funny, but it seems to just be a software for your laptop and not actually a glasses device

zinc ore
#

Google DeepMind CEO Demis Hassabis showed 60 Minutes Genie 2, an AI model that generates 3D interactive environments, which could be used to train robots in the not-so-distant future.

"60 Minutes" is the most successful television broadcast in history. Offering hard-hitting investigative reports, interviews, feature segments and profiles of peo...

▶ Play video
#

Just dropped 3 hrs ago

verbal nimbus
#

o1 Pro scored lower than Bing? 🤣

#

Are o3 or o4-mini in the Web Arena yet?

leaden palm
elder rapids
#

crazy

alpine coral
# north vale > A shipping company operates a route between Le Havre and New York. > > Every ...

this adds ambiguity. previously, it was:

Every day at noon, a steamship departs from Le Havre for New York. At the same time, another steamship belonging to the same company sets sail from New York to Le Havre
So i interpret that to mean that at 1200hrs (/noon) Le Havre local time, ships simultaneously depart (meaning it's 0600hrs in New York when that ship departs )

But in this variation:

Consider a specific ship from this company that departs New York for Le Havre today at 1:00 PM local time.
So it's not departing 1hr later than in the original formulation, it's departing 7hrs later.

Also:

how many of the company's other ships...will it pass at sea?
"at sea" has a specific meaning; imo it precludes any 'encounters' that occur at / around the zero time instant when one ship is arriving and the other departing from the calculation. Neither ship would strictly be 'at sea' [i.e. "...in navigable waters beyond the immediate confines of the port area.] sonn-3.7

#

fwiw i think both the original and that tweaked version are problematic - too much ambiguity

elder rapids
#

I just did it

#

it is 42

#

absolutely

#

no question about it

alpine coral
#

yeah i get 42 too

#

using the two different time zones anyway

elder rapids
#

I'm lying

#

I used 2.5 pro to solve it

keen beacon
#

well relatively well known

alpine coral
#

but still, i don't think this is one of his finest

keen beacon
#

maybe it was poorly translated

alpine coral
keen beacon
#

anyways it seems doms reworded one was poorly done and doesnt have the intended answer

alpine coral
#

nah fwiw I think dom's reformulation seems more or less the same as the original (at as it appears in English here). I think it's a crappy question, as the comments highlight - like there's several non-trivial flaws / alternative interpretations imo
[edit: the spaceship reformulation is flawed.. i was confused by what was being referenced]

elder rapids
#

this one is 15

#

it's different

keen beacon
#

yes but doms formulation is 43

#

the intended answer

elder rapids
#

no

keen beacon
#

but its actually 42 i think

elder rapids
#

the intended answer is 42

keen beacon
#

no its 43

elder rapids
#

and it's the answer

keen beacon
#

according to dom

elder rapids
#

lol

elder rapids
#

and mistook it for 43

keen beacon
#

Claybrook vs nightwh8sper

#

What's better

elder rapids
#

41 is more valid than 43

keen beacon
#

no

#

u can see his reasoning lol

elder rapids
#

send

keen beacon
#

it doesnt make sense

elder rapids
#

"every day"

#

"conversely"

#

"exactly 7 days"

#

so might be a skill issue or non native English speakers

#

tbf not everyone cares that much about the wording and assume it'll be super intuitive

keen beacon
#

specific mechanics in le havre dont apply because of the changes he made

alpine coral
#

can you link to the version from Dom where 43 is the putative answer

keen beacon
elder rapids
keen beacon
alpine coral
keen beacon
#

his version doesnt make sense

alpine coral
#

i glazed over the spaceship versions ha

keen beacon
#

if u change it to 505 i think it works though

#

but models still get it anyway

#

so this type of question isnt particularly challenging anymore

#

Is it better then o3

#

?

#

?

#

The Google model

#

How does it compare

#

which google model??

elder rapids
#

tbh

keen beacon
#

The new one on webdev?

elder rapids
#

y'all gotta be comparing o4 mini

#

to 2.5 pro

#

not o3

keen beacon
#

y

balmy mist
#

whats with this 43 i keep seeing

elder rapids
balmy mist
#

oh lol

elder rapids
#

the server seems stuck on

#

someone proposed the wrong answer initially

#

and the models were actually getting it right

#

but nobody could really figure it out without second guessing themselves because the models seem to get answers that are very close

#

just not the specific answer 43

keen beacon
#

dom made a mistake when rewording it, and assumed an incorrect ground truth. this type of question seems conquered by models tho

elder rapids
#

ye

#

and how strong the model forgiveness is somewhat decides how it's going to approach it

#

since the questions are flawed in general

keen beacon
#

Is claybrook a coding model or a new general purpose mode

narrow elbow
#

Google has so many models, I'm a bit suspicious if the names of those models come from Google? Shouldn't we come up with some more fun names like: temusk, closesam?🤪

elder rapids
#

ion think it's a truly massive model though, doesn't seem to know more than 2.0 pro or 1.5 pro

#

and running a model like that wouldn't be practical and somewhat dismisses the fact 2.5 flash exists

#

which in comparison to 2.5 pro, isn't a major downgrade, yet is the cheapest and best reasoning/non reasoning model by far

#

we can infer flash is less than a 100b due to its 8b variant, since it would be redundant and concede the proposition the original flash makes by being fast and small, why create size extremities and therefore room for an intermediate "flash" model

#

but it doesn't exist

elder rapids
#

me saying cheapest is relevant to that entire claim

#

its not saying it's the "best"

#

no?

#

bruh what

#

you're getting confused

#

2.5 FLASH

#

depends

#

2.5 flash is LITERALLY just a smaller 2.5 pro

#

same techniques prob

#

if 2.5 pro is so good generally, since it knows a lot

#

then it's gonna be good generally

#

if 2.5 flash is attempting to be general, but doesn't know as much

#

I think it's gonna simply appear worse

#

because there's nothing specific that stands out

#

whereas o3 mini vs o1, coding

#

o4 mini vs o3, coding/puzzle tasks

#

ye the gap is getting closer and closer for the narrow tasks

#

although o3 and o4 mini hallucinate a ton

#

and o4 mini doesn't quite comprehend some stuff

#

I think it generally does better

#

than o3

#

for direct if then processes

#

ye depends tho

keen beacon
#

it hallucinates significantly more than o1

elder rapids
#

still, depends

keen beacon
#

lmao don't try and downplay it

elder rapids
#

how come 2.5 pro can so effortlessly process this stuff

#

without hallucinating at all

#

I don't want to bring up an example this way

#

but it's def clear

#

p sure this is still the flaw of the model

#

o3 has better reasoning retention than o1

#

and that's why it's doing better at context (just not as good as some benchmarks suggest)

#

but if it's still failing hard at recalling instruction and iterating through it

#

then is it truly a good model

keen beacon
#

the way they have trained the model to respond differs from o1 in that o3 is eager to provide specific facts and figures and sources, which is all well and good but it just ends up all false

elder rapids
#

ion think so anymore

#

I think they released it to dismiss o3

#

if a summer release is true

#

they have time to simply do this with o4

elder rapids
#

in my testing 4.1 stomps deepseek v3

#

but still

#

2.0 pro just seems to be the best

#

as far as just being smart goes

elder rapids
#

so?

#

not at all lmao

#

4.1 is really just

#

good

#

but it just doesn't understand stuff implicitly

#

hn

#

2.0 pro is way better

#

it's not even close tbh

#

and deepseek v3.1 is also mid

#

hallucinates way too much

#

awfully assumptive since my primary concern is practicality

#

not puzzles or narrow tasks lol

#

I just happen to use puzzles for maybe mathematical reasoning

#

or testing how it processes a claim

#

ye

keen beacon
#

You can exclude providers on or

elder rapids
#

2.5 flash 💔

#

.60 lmao

#

why would you reason using it

#

yet it's better than 2.0 flash

#

while flash 2.0 has always been the absolute best price for performance

#

this just doesn't follow

#

bruh?

keen beacon
#

They were working on it for a while so I don't think so

narrow elbow
#

Prefer to hand over data to U.S. companies, which are more trustworthy?

elder rapids
#

def not lmao

keen beacon
#

It's also used as the base in the retrained o3

elder rapids
narrow elbow
#

🤣

elder rapids
#

not "barely"

#

it's pretty visible

#

2.0 flash is hella straightforward

#

albeit concise

#

2.5 flash retains conciseness while giving insight

#

it's obviously a major upgrade

#

including instruction tasks

narrow elbow
#

racism😏

elder rapids
#

although it doesn't get "better" after practically perfect in retaining instructions

keen beacon
#

2.5 flash isn't good?

elder rapids
#

it's good asf

#

I've been using it

keen beacon
#

Do you use the non thinking mode too

elder rapids
#

a ton

elder rapids
keen beacon
#

Oh

#

Hmm I was considering it for a task later

elder rapids
#

ye, this pretty much goes for all smaller models

#

if you want very specific feedback

#

go for them

#

W utility

#

def efficient

#

🙏

keen beacon
#

Qwen max isn't small tho lmao Im interested in them open sourcing it

#

Depends tbh I heard prompt processing can be slow on macs

elder rapids
#

qwen 2.5 seems the right choice

keen beacon
#

Qwen 3 is about to come out anyway

elder rapids
#

try hooking up full deepseek r1

keen beacon
#

Qwen 2.5 is super old

elder rapids
#

this is the most efficient you can get

keen beacon
#

You can probably run qwq well tho and fast on a mac

elder rapids
#

0 temp deepseek r1 for low context output goes crazy

#

🙏

keen beacon
#

Original qwq wasn't that finicky I think

hardy pecan
#

This pretty much summs it up unfortunately

keen beacon
#

Eh usable even if it's somewhat slow I think

#

Definitely spoiled by api speed lol

#

O lol

#

It makes more sense with Mac mini s I guess

#

Yeah much better compute compared to a Mac I think and higher bandwidth

#

It'll do well if it can fit

#

Yea

keen beacon
#

o3 was a step forward for everything except in combatting hallucinations

hardy pecan
#

We all feel it right

#

hallucinates more, and more lazy

#

o3 benefits with a great prompt

#

if you are clear and concise and verbose with your prompt, itll do well, but it defintely will cut corners if you don't watch it

novel flame
#

What makes you say that? Their bet on JEPA?

plain zinc
fleet lintel
#

what are the incoming models that folks are excited about ?

hardy pecan
#

Tomay appears to be a thinkerrrr

earnest parcel
torn mantle
#

Yea def r2

fleet lintel
hardy pecan
#

I wont lie, I despise R1 when it comes up in lmarena, since im sitting there waiting for 4 mins to an output..

#

I hope r2 is faster

cedar tide
#

New model the v5 and V6 of "cobalt-exp-beta"
(Amazon model)

#

GPT 4.1 mini and nano are in the arena?

torn mantle
#

i think two models were added

kind cloud
#

by Amazon

#

(Yesterday's screenshot)

cedar tide
#

Apricot v1 was on the arena a month ago

keen fulcrum
torn mantle
calm sequoia
#

Im often very impressed with the qwen 32B. I wonder what could they achieve with normal size models and inference compute

alpine coral
#

i don't get 43 for that liverpool-dublin spaceship formulation.. (it's indeed botched)

#

funnny thing is.. by changing it from Le Havre/NY to Liverpool/Dublin, it actually removes one of the chief ambiguities in the original version (timezones), as both cities are in the same time zone. Then the other change made, rather than both routes departing at the same times, in this version there is 1hr delay – so there is no overlap (putting aside the problematic aspect of whether that overlap actually constitutes 'encountering' another company ship). But..using 503 hours.. yeah.. I get 42. With 505hrs, there is sufficient time that the 1hr delay means you encounter an additional Liverpool-bound spaceship, bringing it to 43. The stuff about spending in time the port with people boarding etc is nonsense and cannot be used as part of the 'answer' lol

#

what a mess..anyway.. let's move on ha

hardy pecan
#

yep

torn mantle
#

Dom gave us quite the entertainment for the past 2 days

hardy pecan
#

Turns out LLMS actually DO know their stuff, and the question creator is confused XD

#

agi confirmed

brittle tiger
#

AGI would question the premise of spaceship travel between Liverpool and Dublin and also note two launching rockets could see the other while reaching orbit due to proximity

ember rapids
#

Will titans be used in Gemini 3.0?

#

Or perhaps 2.5 ultra if we get it

novel flame
balmy mist
#

do we have any interestin news this week?

#

the only thing about having good weeks is that the enxt week is usually dry lmaoo

alpine coral
#

not sure but feels like there's a bit of momentum and dynamism atm that was lacking since reasoning models were first introduced

#

just an unrelated curiosity... anyone have any thoughts why India seems to be kinda absent from the AI landscape? (I say that.. though perhaps it's misplaced)

#

not: where is India's deepseek moment?.. just: why doesn't India seem to appear in the picture at all? big country; lots clever minds etc..

wintry tinsel
fleet lintel
hardy pecan
alpine coral
wintry tinsel
#

Google, Claude, Open AI, Grok, Deep Seek, everyone else is chump change

alpine coral
wintry tinsel
fleet lintel
wintry tinsel
#

They are not worried about winning any AI race, Indians are not very creative, hard working and smart, but slow to new technology

alpine coral
fleet lintel
alpine coral
#

i still don't find it satisfactory as an explanation (brain drain / domestic politics).. there's so much Indian pride...

#

i thought maybe the diversity of the country

#

there's so many languages

#

unlike China. where [the central govt] is just like: you're all Han, and speak Mandarin

wintry tinsel
#

Hopefully this helps clarify

wintry tinsel
wintry tinsel
#

Japan is also a country known for its hard working and smart tech industry adopting technologies like the bullet train before most other major urban centers, yet they are absent from AI space too

alpine coral
fleet lintel
#

I am concerned a bit about EU though. With america ruled by mad king, EU might be in trouble

wintry tinsel
#

Japan and India should collaborate on AI, now that Japan has opened floodgates to immigration

#

It won’t happen though

#

Samsung in South Korea is also in panic mode as they don’t have any competitive AI models and their new phone chips are being outperformed by others with better AI implementation

fleet lintel
#

Samsung is too dependent on Android and Google

#

and I think they will continue to

#

Any news in the Industry? Which models are businesses adopting more of ? OAI models?

ember rapids
novel flame
novel flame
# fleet lintel Any news in the Industry? Which models are businesses adopting more of ? OAI m...

I think enterprises are using OpenAI models (and MS Copilot, which has to be a derivative of OpenAI?) if they are in the MS ecosystem; and Google models in they are in the Google ecosystem. Apart from that I'm pretty confident a lot of businesses have accounts with OpenAI just because that's what they know / what came first, and if they build software a lot of them are probably paying for GitHub Copilot (again, MS/OpenAI). Anthropic is a developer favorite of course, but that's a narrow segment.

I don't know to what extent any of the other big labs have captured the enterprise market.

cedar tide
opaque adder
#

dayhush or nw?

cedar tide
#

but I don't have the impression that Amazon really wants to be in the race, but maybe they only want to have their AI for their chatbot and for Alexa

fleet lintel
balmy mist
#

people think that we might get deepseek this week right?

#

r2*

fleet lintel
#

Hard to say but I think it's likely

balmy mist
#

its like what happened with microsoft and OA

keen beacon
balmy mist
#

they still ended up making their own model

keen beacon
#

either this week or next week for sure

fleet lintel
#

both Google and Amazon partnered with Claude. But Amazon also creates garbage like nova

balmy mist
keen beacon
#

same goes for qwen 3

#

i do have pretty high hopes for R2

#

a few little birdies tell me it gives o3 a run for its money

balmy mist
keen beacon
#

it will be behind in some things and better in others

#

2.5 is very good

balmy mist
#

last time it was pretty much o1 right? during o1 phase?

keen beacon
#

yeah

balmy mist
#

so r2 should be o3 ish

#

wow

#

that would be insane

fleet lintel
#

I'll be dissappointed if they dont beat 2.5 Pro in half of the categories

balmy mist
#

your connects talking

balmy mist
keen beacon
#

either way it'll be state of the art as an open model by a big margin

#

i don't know how much of a threat qwen 3 will pose though

#

don't have any connections there lmao

novel flame
# fleet lintel is amazon serious abut their models? Is it worth giving them any benefit of dou...

It's a good question. Their Titan series of models was utter trash; then they suddenly came out with an entirely new set of models - Nova, which is not SOTA nor within spitting distance of it, but let's be fair here, Nova Pro was not terrible either, and it was a gargantuan leap up from Titan. If they can make another leap like that then they could potentially join the top labs. But that's a big if.

keen beacon
#

speaking of nova titan

fleet lintel
keen beacon
#

they said early 2025

#

where the hell is it

balmy mist
#

@keen beacon are your birdies letting you test out the model 👀

keen beacon
#

if cobalt on the arena is titan it sucks

balmy mist
balmy mist
keen beacon
#

they don't really hire people outside of china for that stuff

fleet lintel
balmy mist
#

you can be an unbiased source for all companies

keen beacon
#

for national security reasons mostly

#

the ccp aren't taking any chances

balmy mist
#

but dont forget yall, o3 pro coming out next week

#

and if its the same gap from o1 to o1 pro

fleet lintel
keen beacon
#

i may re-sub to chatgpt pro next week

#

depends on how good o3 pro is

balmy mist
#

i think we dont see a model beat o3 pro until end of year

balmy mist
fleet lintel
keen beacon
#

trust me

fleet lintel
#

yes, month.. wow, they are so generous 😄

keen beacon
#

to be fair you do get practically unlimited o3, o4 mini, 4o + image gen

#

if only they didn't cap o3 pro

balmy mist
keen beacon
#

sora is unfortunately pretty bad

balmy mist
#

its def worth the price imo

keen beacon
#

hopefully sora 2 drops this year

novel flame
keen beacon
#

oh yeah and imagen 4 is coming in the next 1-2 months btw

#

it's in internal testing

balmy mist
#

but whats this talk about openAi branching out so much into shopping, social media, etc..

fleet lintel
ocean vortex
fleet lintel
#

Actually nova pro is 8x more expensive than 2.0 flash

fleet lintel
#

each extra venture takes effort and energy away from other things

#

i have seen businesses fail multiple times because of expanding too fast. it is a common pitfall

ocean vortex
sonic tendon
#

i hope

fleet lintel
sonic tendon
ocean vortex
novel flame
sonic tendon
#

i think they have a pretty good chance

fleet lintel
ocean vortex
#

o3 is really not that impressive despite the hype and "feelings"

#

the biggest practical change is how they made it now use every tool that they have, quite extensively

brittle tiger
# balmy mist right now can we reliably say that o3 is better than 2.5 pro?

Definitely depends on use case. Bigger context makes 2.5 shine in many cases. Tooling and less restrictions vault o3 over others. I was talking to some crazy girl this weekend and her and all her friends have started using o3 and deep research to analyze guys on dating apps. o3 much more willing to dig up social profiles than Gemini. I tested myself and o3 output freaked me out. Gemini didn't touch socials.

fleet lintel
balmy mist
#

as long as r2 is near o3 its a win, an o3 thats around r1 price is nuts

#

it doesnt have to be better

keen beacon
#

r2 is absolutely near o3

#

well more than near