#general
1 messages · Page 28 of 1
ye but you and I both know there are cases where that's not true at all
ofc
whether that's more meaningful than your benchmarks
but that's besides the point
therefore it's not the general intelligence you think it is
nah , i just use them both in programming for example, o3 has bigger edge
for general stuff its usually gemini 2.5
specific coding performance tasks are held back by total coding capability
I think both gemini and especially o3, are in the non-zero agi % metric
yeah
dayhush is rly good wow
Correct.
I think the problem for many is that it’s easy to mistake greedy decoding (which is deterministic and also boring to the point where it would be practically useless) for temperature=0 (which still has inherent randomness but the probability distribution is more biased towards the top logits).
Mathematically, temperature=0 isn’t even possible since it would cause a divide-by-zero so LLMs just change it to a very small value >0 behind the scenes.
that's the thing, chatgpt-latest was above o1 on lmarena even when they were using the old gpt4o base model
It does very well on lmarena, openai reasoning models much less so... till now at least, if this doesn't change I expect o3 to be no higher than 3rd
yeah I was actually confused af by the temp 0 at first, it made no sense to me lmao
but I understand why they implemented it this way. It's just much faster and easier than writing smth like 0.00000001
Where did you hear that? 0 temperature does not get changed to a small temperature lol. Assuming no other sampling options are used, it's greedy decoding
Bruh the misinformation wtf???
?????
????
I get it. The o1 was best but took only rhe lower ranks.
on many platforms they don't know how to greedy decode ig
i've seen a few docs that say something along the lines of 0 temperature is changed to epsilon
Huh really?
???????
Check vllm Lol
you can't divide by 0 lol
Omg
They check if it's 0
Omfg
If it's 0 they switch to greedy assuming no other sampling options are enabled. Check vllm docs
Vllm is industry practice
that's why it's custom implementation. If there was no custom implementation and no handling 0 differently than any other value, it wouldn't be possible
This is industry standard
Don't change the goalpost since you thought they changed it to a small value lmfaoooo
it is not handled the same way as other value, that's the main and only point
I don't get why you got so fussed about it lmao
Can you link me this is news to me
???????
It's one of the most hilarious things Ive read all day
my point was to show what you would need to do if 0 wasn't handled explicitly
nothing less
nothing more
maybe take a chill pill @keen beacon I have no clue what's gotten into you LMAO
Temp=0 in the math is not possible, if implementations just change to greedy decoding then there’s no longer a temperature value involved in the inference.
But I genuinely thought the standard was to sub a small epsilon value. I could be wrong about that for sure.
Yes obviously I thought everyone knew 0 was greedy decoding lmao
Otherwise what would you even do there
yeah exactly. And when you are treating it different to any other value that's basically just confirming that 0 is not possible technically speaking
this doesn't make the temp 0 magically valid lmfao. All it means is that they are interpreting your technically invalid parameter input to provide you with a closest match different solution
The industry has used 0 for that for an extremely long time
You would know this yourself
Otherwise you would not be able to use 0
that doesn't change a single thing or disprove what I was saying really...
It is a special value for all intents and purposes
It is not setting it to a small non zero value though. It just picks the top token
that's what we were saying before you replied here with "??????" for no reason at all lmfao. If there was no custom interpretation and temp parameter was strictly temperature directly applied, you would have to use it differently
But everyone does that lmao. I wouldn't consider it a custom interpretation lmao. Not setting 0 to that would be a custom implementation at this point, no one expects it
omg... that's because temp is not getting passed as "0" and it doesn't use it the same way like if you inputted any higher value.
Did you just figure that out just today lol
we were already talking about it before you showed up 
so no
You agreed that it was set to a small non zero value though
how to vote for leaderboard?
what I meant was 0 was much easier than setting a smallest possible value that would limit it to the only possible highest probability, effectively greedy decoding, which in practice probably is not even possible in most cases to express that as a number...
hello friends!
I definitely overreacted on something that is a trivial correction lol. I apologize I wanted to get mad at smthing
OK I had to check... and @keen beacon is wrong. I just made four identical requests to Claude 3.7 Sonnet on Bedrock:
"prompt": "How much wood could a woodchuck chuck?",
"model": "bedrock/claude-3.7-sonnet",
"config": {
"temperature": 0,
"max_tokens": 220
}
See this
Installing mac os on my laptop
vLLM is a library, not how LLMs are built
It is the most used open inference engine.
And if you are building toy LLMs at home then I'm sure you're using vLLM, but that does not make it the industry standard, nor does it mean anyone else is following its design decisions
That is a ridiculous claim
Vllm optimizations and how throughout changed dramatically against other providers: https://developers.redhat.com/articles/2025/03/19/how-we-optimized-vllm-deepseek-r1#deepseek_r1__a_complex_model
Also iirc deepseek uses a fork of vllm are they building toy llms
I can go on about this but I'm going to stop here since you're making ridiculous claims
I love lmarena beef
Congratulations, you found one example of a major lab that uses (checks notes) a fork of vLLM which may or may not use the method of replacing temp=0 with greedy decoding. I tested the APIs from OpenAI, Google and Amazon Bedrock and all of them did not. I just don't have the time to discuss what 'industry standard' means, nor to try all the non-SOTA labs' APIs to find out what I already know.
Why tf would they replace the greedy decoding code for no reason lmao
0 temperature/greedy decoding is not deterministic. Anthropics api even notes it lmao
You can't even set a seed on the anthropic api lol
I'm not saying they did, I am saying you don't know they didn't and your entire argument seems to be "vLLM docs say vLLM does this, so this is industry standard". But evidently not Google or OpenAI or Claude on Bedrock, so you are grasping
OMG are you just trolling? Anthropics API docs say exactly what I've been saying: "Note that even with temperature of 0.0, the results will not be fully deterministic." That's my whole point. But they do not say "greedy decoding" because they do not switch to greedy decoding, they use a small epsilon which leaves it non-deterministic.
It’s well-known at this point that GPT-4/GPT-3.5-turbo is non-deterministic, even at temperature=0.0. This is an odd behavior if you’re used to dense decoder-only models, where temp=0 should imply greedy sampling which should imply full determinism, because the logits for the next token should be a pure function of the input sequence & the m...
Google announced they used Moe too
While we don't know for sure for the others its highly likely to be Moe too
Link me to a resource anywhere where this behavior occurs and is documented
That they change with epsilon
While this isn't definitive I found it convincing a while back at least. I dont recall it much
"...under capacity constraints". That is describing anomalous behavior. I just ran tests across three APIs and in just four requests (no cherry-picking) I got different results from all of them.
Please show me anything that supports your hypothesis
Actual documentation
It's not a hypothesis -- in a Transformer LLM, temperature is used in the denominator and you cannot divide by zero.
See the 'T's right there. This is how I "support my hypothesis". By knowing the actual math.
There's literally a check to see if it's 0. Do you see your api calls error out lmao
If it's 0 it goes into greedy decoding if no other sampling options are enabled
Keep on believing what you want it's a weird hill to die on
There is... in vLLM specifically. And vLLM switches to greedy decoding, which removes (99.99%-100% of) the non-determinism. So since these are not deterministic, they are not using greedy decoding.
It's so nice seeing a real talking instead of "look what svg pokemon llm drew for me"
I apologize on my part for the heated debate here tonight, this was not a productive contribution to the channel.
it was
i learned a bit by following it
Actually re-reading this, this can indeed explain non-deterministic outputs even if the implementation replaces the zero temp with greedy decoding. MoE (which they're probably all using) concurrency means partial results may arrive/recombine in different order each time, and that is a totally plausible explanation for small differences in the resulting logprobs. As for whether individual LLM providers do one or the other, who knows; the labs don't reveal the details. I heard about the 'epsilon' trick from Andrew Ng, who is a fairly reliable source of how things are done; but my evidence was hinged on observing non-deterministic results, and the MoE explanation is as valid as the epsilon explanation. I concede the argument.
Continued there not to flood this chat #1363229951851237376~~ (ignore the thread name 😂 )~~
functionality wise, sonnet is kinda better, collapsible are working + you can edit the code
then make your own benchmark to prove it or it's just not true lol
or heck, at least a singular easily reproducible coding challenge where it would be clear 2.5Pro, o3 and o4-mini all output a notably worse solution
though we probably need to add some non-reasoning model too if you are to do this... gpt4.1 and/or updated grok3 should do
Bs
o3
some specific coding tasks reasoning models are at an disadvantage kind of by design 👀
especially if you are asking smth that you could easily google for and fit the answer with minimal changes. Reasoning can work against it trying to reinvent the wheel sometimes
I just tried this now with Gemini pro 2.5 , Gemini flash 2.5 , O3 , O4 mini
Give me two words with those letters FP DAS .
Answer : pf sad (pf : sound )
Surprised that the winner was Gemini 2.5 flash
I sent you mine. I wish the others would match their benchmarks, but Sonnet is still overall coding king for my tests 😕
has anyone been testin 4.1? how good is it?
yeah I wrote it too generally looking back at it. 🤷♂️ For coding tasks that involve visuals 3.7 sonnet can indeed the best still. Like even just drawing (coding) in svg it is gonna be better than competition. Overall though if we consider everything coding related and not just this, sonnet is not the best after all said and done
Dom, 3.7 performs really bad at coding
it's behind in many things coding, but visual coding / web design or my mentioned svg... try it and you will see. I would still classify it the best in this more specific aspect
like for this thing: https://discordapp.com/channels/1340554757349179412/1340554757827461211/1362774913038946394
this is what 3.7 sonnet did:
the way they fine-tuned it they are kind of playing on their strenghts as well. I had thinking enabled and reasoning was way longer than for o1/o3 with high reasoning effort for this prompt. Chat model is already doing it better, and thinking enabled (since I'm comparing it against reasoning models) just amplifies this
WTF
woah
prompt?
O3 is better for code architecture and bug fixing not for writing code
I didn't know Artificial Analysis had an image arena:
Midjourney V6 and V7 are on there too, although idk why it's so low
When will o3 be put on the leaderboard?
As requested by several - OpenAI-MRCR results for o3: https://x.com/DillonUzar/status/1913806594199990582
Finally got access. They had some server-side issue with our org account, got resolved today.
Overall, pretty strong performance over its context window! Really strong up to 64k context, then drastically falls after, but that's to be expected (it hasn't specifically been improved for long context like the GPT-4.1 series, or Gemini 2.5). Should be really interesting to see GPT-4.1 applied to o-series!
I've also reran the bench against all 21 models I've tested so far, tracking costs and individual test results, will post results on that over the next couple of days.
Enjoy
Is there are LLM tool which allows agent to crawl through pages of a specific website?
o3 sux
claybrook has a lot of rendering issues on webdev
never heard of it before now but artificial analysis just seems like one of infinite benchmark compilation websites
Tomay model from goog?
So similar to claybrook
Output wise
Claybrook & tomay outputs are more close to eo than to gemini 2.5 pro 03
Where did you find it?
I don't see him.
Is it of a rare level?
Yes
I haven't seen. saw a tweet with image of regular arena
Its a thinking model + so fast
Which companies released stealth models on lmarena so far?
Yea def flash version
To may
It releases it may then gg!
I haven't seen Tomay yet. Is it any good?
o3 and o4-mini are on Minecraft bench now
Did google remove the option to fine tune a model on AI studio or just hid it in a way im not seeing?
i think they're sunsetting it for aistudio and leaving for api. aistudio can only do 1.5 flash still and it is hidden
wait we got a new model?
whats this bench xD , link ?
o3 and o4-mini are new additions. it's been around a couple months i think
cool
are they in this server i wonder
all his tweets are them using lmarena or webarena
thank you for this!
Tomay better in coding?
it's below the linked message. Sonnet generated like 32k tokens and went into the mode of "what can be improved" as it sometimes likes to do lmao
But even without that, I'm not an advocate for claude and do not see this model as best overall, but yeah I will still say it is the best with spatial awareness. Competing models are either optimized for cost much more (smaller, like gpt4.1), or they are simply way too extremely big and in effect undertrained and lacking this skill comparatively speaking for other reasons (gpt4.5). In this aspect gpt4.5 is still better than gpt4.1, but it is not as good as sonnet.
essentially, spatial awarenes is a skill that is very rarely captured in benchmarks. Arc-agi does it, but for some reason until very recently it was not a very popular benchmark to use or reference at all. So this tends to get compromised sooner than their usual target evals.
claude 3.7 is svg king for sure. i think that was purposeful on their part. i think minecraft is better for spacial awareness than svg though
I suspect Anthropic's internal evals do test for it, while OpenAI does lack proper internal testing of this
Good day everyone
Happy Easter Sunday
or they willingly sacrificed it for lower cost, though that is somewhat less likely...
especially in the light of them going crazy lately with things like o1-pro lmao
yeah you too, happy Easter! 
It’s nice seeing y’all here
yeap, it's a clusterf. even people who interact with AI daily get confused (and I accidentally named some files 4o instead o4)
I mean google trains new models daily as well
its just why do they all need to be released to the public?
o3 > 2.5
makes sense to me now
Gemini 2.5 Pro with Canvas >> o3 😁
what's canvas?
Oops, Canvas is such a thing designed for coding
And for small documents before exporting to docs 😉
I think that’s a Pokémon
That’s on arena?
Hmm
That’s cool
Came out today?
Also does anybody know what the best free model on open router is with favorable rate limits?
woaah
free unlimited gpt 4.1 🤣
Yeah I think they lowered it cause I hit rate limits so fast with 2.5 free now
Fr man we did not appreciate that as much as we should have
What company you think backing it?
deepseek
What do you think is the best free model rn in terms of api and good rate limits?(so 2.5 is not included)
Really?
I actually never used deep seek cause I was scared lol
Wait nvm I used v3
maybe u can use chutes but ur data is not really private
And deepseek doesn’t rate limit heavy?
w openrouter u can use chutes hosted models but 1000 rpd i think
1000 is good enough
Chutes? Hmm I didn’t see that provider
rpm of 2.5 flash not high enough?
ya deepseek v3 r1 i think etc
Flash is not free tho
Only pro is
really hmm?
What I looked only see thinking and normal
ya free credits
Okay I will try that provider, that’s in open router right?
which one chutes?
No the vertex thing you said
vertex u have to go direct to them
google vertex
u can use it as a provider on openrouter but no free credits
meh lol
Oh I see, so vertex has their own api key? I mean I might as well use the one from Google cause they give 300 as well, yeah I’ll just use that until I run out but I’m tryna plan for when I do run out lmaoo
If they gonna use Pokémon names they gotta produce heat
Ask why is it called charizard lol
it might be the same thing tbh im so confused with the google offerings
Fr lol, all these platforms but as long as i get free credits im cool lol, i got like 4 accounts lol
Nahh the same cc
They don’t have restrictions on it
Lmao
Yeah Lmaoo
I might just farm this
But idk they might ban me and I will be devastated
ya i personally wouldnt test it myself ig. but if it was allowed lol
Are labs specifically trolling this discord now 🤔
Ask this:
You are facing north and A rectangular tank is filled with water. Each side The West and East phases of the tank are painted to look like an island. A small toy boat floats in the tank. On the boat is a small figurine which is facing north. You lift the right side of the tank. From the point of view of the figurine Which island appears to rise. The West side, East Side or Both or None.
maybe its from meta
Yo did yall try the meta glasses?
East side
Smh
Why put crap on lmarena
They need to start screening theses models before going on arena
if ur a big company u can put any model into the arena it seems lol
cohere are employing the meta strategy of releasing 10 equally horrific models to the arena
dw
Lmarena doesn’t group or section models based on size? Like separate leaderboards for diff sizes? I don’t usually use arena so this might be an obvious question, I’m a web dev guy
it doesn't
How is ray?
they banned me for doing this 😭
That seems like an obvious feature to have tho
What
Alts for free credits
Yeah don’t even waste your time, giving these companies free benchmarking lol
How many alts you had?
only two
o3 with tools >> Gemini 2.5 Pro with Canvas >> o3 😁
Lmaoooo
new contender for world's worst thinking model if true
How do you use canvas? On the Gemini app?
Show us
idk i dont use gemini, im openai fanboy
Do more tests like that
How you know about canvas then?
idk why I read that as openai femboy
Are the models at least fast?
😂
im making use of o3 to build repos via the subscription, similar to how one would use cursor or windsurf + o3 via api $$$
but its way cheaper (200$)
if rayquaza is from another lab it seems the lmsys staff are naming them and having fun
they really ARE pulling a meta. gonna be a 200b model that bombs every meaningful benchmark 🔥🔥🔥
How are you able to use o3 from api? They said I had to have an enterprise
Wait so how are you banned?
true, nvm i wouldnt even be able to use + would be super expensive.
thats why i use it via the subscription , but i have made a script that make questions and extract answers.
so i use it as if i had api access
U lay $200 a month?
Damn I was gonna wait to get that until o3 pro comes
Not a bad idea
Hmm I might do that actually makes an account specifically for that
And tell people to pay me to use it
Lmaoo
How they scam?
Google models
Try dayhush
That’s current best model that we have access to
- pay me 100$ to use gpt , we share win win
- ok here are 100$
- ok bye
bro we just had 2.5 flash
Out of all models?
I see claybrook not giving output on webarena quite often, especially on complex questions.
I think it takes more time & there seems to be some timeout setup in the webarena, which ends before it gives an output
better than 2.5 ? coding or general model ?
tbh im rooting for o3 to get nr 1 than on 30 april 2.5 ultra to be nr 1 , thatd be hella fun
I dont like that its taking too long for openai models to get in leaderboard. They probably have the votes, just making sure they aint biased.
they aren't arena tuned so they probably didn't go through the "run it privately and release the elo asap" process
@keen beacon that $10 thing applies to all free models on open router
yea now
so you get 1000 rpd with 2.5 pro
u dont get it with 2.5 pro
thats cracked
they were getting hammered on free requests before i guess their quota is now sufficient enough with that criteria
for a regular ai user that does not use ai heavy in products, you never have to pay for SOTA models ever
things are extremely competitive now lol
besides o3 lol
(u can still use it in direct chat lmao)
lmaoooo
truee
it makes it so hard to pay for ai, thats why i cancelled all my subs
only will pay for o3 pro when it drops
but even that has me hesistant
2.5 pro is still unbeatable in my own personal experience lol for the things im using it on
its incredible
yeah thats what i am saying, for a general model its perfect
so this mean that a year from now we should have models better than 2.5 pro but basically free and fast like how we treat flash lite
2.5 is already free basically
but a year from now it will be free for production environments or dirt cheap
thats crazy
any model better than 2.5 is overkill for most usecases, so we basically made it if we assume we will get better and better models
i mean i will use it, im just saying its wild that our models are so good now that we have SOTA models that are free, where you get 1000 RPD, for openai plus you get like 50 o4 mini per day i believe lol
its a server-side issue
coming from webdev itself
not the model
it didnt even send the prompt
it could be interfering with JSON format on the POST data
not sure
Every day at 12pm, Relaxed Voyages spaceship departs from Liverpool for Dublin. Simultaneously, another Relaxed Voyages spaceship starts journey from Dublin to Liverpool. The journey takes 503 full hours in both directions.
How many Relaxed Voyages spaceships, traveling to Liverpool, will the spaceship departing now at 1pm from Liverpool encounter?
nw
Currently in arena, I would say claybrook.
is this a joke?
- better instruction following
- better on reasoning
- better on colors/ui choices
why the same models are battling eachother
and the one on the right is waaay worse
tf
mini
mini is ass then
gpt models arent good at coding
i would just chose
"both are bad"
next
there is an option for that
because of them looks bad to me
if you care about the actual rankings you would say that left is better because it has more details (assuming nothing was cut off)
i know, i did put left
i think i found the issue, i may override the POST data see if it fixes the problem
wtf...
no fuking way 2.5 pro isnt nerfed in the left side
dayhush is seriously insane
im sure 2.5 pro is nerfed here somewat
both looks bad
what the hell are these color choices
dayhush is so bad at chosing colors palette
right is significantly better dude
why are you such a hater dawg 😭😭
there's nothing wrong with the colour scheme
obviously they don't look like it was made by a 10 year experience developer
doesnt mean right isnt better
nightwhisper generated much better results
yea
but it could be the prompt
the right is definitely better tho
wild?
🤣
are you one of them?
maybe 🤣
i prefer the vertical selection, but other than that right is better in almost everything
dayhush aint half bad
o1 = 42, gpt-4.5 = 41, o4 mini = 42, grok 3 = 41, o3-mini high = 42, 2.5 pro = 41, 3.7 sonnet thinking = 42, R1 = 41
from a single attempt just now
https://websim.ai/@wonderousfog95826785/spaceship-encounter-simulator/6 still yet to figure out why this is incorrect
ive never tried o3 high but o3 med gets it wrong
cc @ocean vortex
why is it 43?
oo, saved
i don't think that's a particularly good way of figuring out the right answer tbf
yea but why is it 43
you said the answer is 43
so you didnt try to solve that on ur own
it's not a question i use often
you can go and try that if you want
but i trust his judgement
I tried solving it on my own, I also don't get 43. might be a trick if anything. either way, without provided correct solution it's kinda pointless
i tried it as well
i think llms are rounding the numbers
using days instead
of hours
"k ≤ (503 – 1)/24 = 502/24 ≈ 20.92 → k = 0,1,2,…,20. So there are 21 such ships already on their way when we leave."
thats o3
but on the 21th day its 21 * 24 + 1 = 505 >= 503
so it shouldnt count
still makes no sense, then the models would overshoot, not undershoot
⇒ –20.916…≤k≤21
same here
and they are also taking into account the ship that will fly at the same time as our ship
there's 20 ships that are on the road when it departs, 1 ship that departs at the same time, and then 20 ships that depart between the time it leaves and the time it arrives right?
yea
so 41?
there's a ship that arrived an hour before our ship left, and a ship that leaves 1 hour after our ship arrives? if it's 43 it could just be that the guy considers that with 1 hour of wiggleroom our ship would encounter those ships in the dock
that would be a bit dumb but it's my guess
sorry its 21 on depart since 0-20 day = 21 days
so it could be 41 or 42
if you count the ship that departs at the same time
thats why some concluded 42/41
could you expand on this reasoning bc i don't follow
some of them counted the ship that departed at the same time
right i'm counting it in an extra category
20 already departed, 1 departs at the same time, 20 depart after our departure but before our arrival
21 already departed
how though
could you explain that in more detail? where you don't count the boat that departs at the same time
we need ships that departed before day 0 @ 1pm and haven't completed the 503-hour journey yet so its x <= 503 :
- day 0 @ 12 PM: 1 hr. (1 <= 503) -> 1 ship
- day -1 @ 12 PM: (1*24)+1 = 25 hrs. (25 <= 503) -> 1 ship
- day -20 @ 12 PM: (20*24)+1 = 481 hrs. (481 <= 503) -> 1 ship
- day -21 @ 12 PM: (21*24)+1 = 505 hrs. (505 > 503) -> 0 ships
counting these days gives: 0 - (-20) + 1 = 21 days.
oh it's departing at 1pm i see
so 21 before it departs, 0 while it departs, 20 after it departs
oh
hmm
yeah no that seems right
0 while it departs = is what differentiated other LLMs
it is obviously right bc it departs at 1pm and the others depart at noon
tbc you think it's 41? or are you unsure and think it might be either 41 or 42?
^wrt to this comment
yea not sure tbh
maybe im missing something as well
okok
xddd
kinda impressive from dayhush
sonnet and other models failed to visualise the solution
and they also failed to provide the answer
not agi
the answer is 43 i thinnk (i reread it again, it is not)
i might be very wrong i was
feel free to explain your reasoning wild
ppl are betting about the correct solution here lol https://manifold.markets/Bayesian/the-riddle-of-spaceships-encounters?r=QmF5ZXNpYW4
tf
Yeah did not expect to see my prompt randomly here lmfao. The thing here is that the formula is really 2*days+1. And +1 is because it will encounter additional ship arriving just as it is leaving. The original prompt is already tricky enough for LLMs though can still be trained for. But when you add complexity they tend to default to their default earlier reasoning patterns (indicating that they didn't really understand it all that well to begin with still)
I made the market, ppl can bet on it (with fake money)
cool
Spacex sim?
wow
Release it
One of those was very popular
not sure yet, open to ideas
do you think the correct answer is 43?
lol ok
there obviously are lots of tenable ideas
yeah it is
- ask 10 people what they think the answer is, resolve to majority voting
- wait for a mathematical proof that is formally verifiable, of the answer to the riddle
etc.
i don't think you justified that appropriately and i think you're wrong but if you don't wanna explain in more detail that's fine tbc you didn't ask for your riddle to be talked about ig
not really necessary since it's a variation of the known riddle with just basic additional math added on top
that would be much worse because whoever has more mana can dictate any solution
it doesn't have a truth-based schelling point
could you point me to the known riddle?
The riddle involves ambiguities (at least one).
its a modification of the 15 ship one
Which imo isn't good for a mathematical prompt
i agree ambiguities make a math prompt worse but could you point out the ambiguities?
now do o1 pro vs o3
So different people use different assumptions to get the answers they get
How many ships belonging to the same shipping company, traveling in the opposite direction, will the steamship departing today at noon from Le Havre encounter?``` Answer 15. Many models answer with 13 or 14 to this day
marginal cost is $0, less expensive than o3
lmao polymarket polls
no its unlimited usage
Do you count the ship that left an hour early or not? What constitutes meeting other ships in that situation pre launch? Does it need to meet them on the air strip? What if our ship is in the storage bay still when the other one launches? This information isn't provided.
if it's departing at the same time another one is arriving they will for a fact meet. Departure would take preparation (boarding the passengers, waiting for clearance to depart etc) so mere logic and common sense dictates for this to happen
One arrived an hour before you launched.
No wait my bad
what ship left an hour early? pre-launch and post-arrival encounters do not count. storage bay encounters do not count. they do not count because they are not specified as happening. we don't know the ships encounter in the bay.
I was actually talking about the one that left an hour before yours, not arriving
right you would not encounter that one bc it left before you and will arrive before you
Yeah, but that's your assumption
i suppose so ok
"if they are on the strip together, it doesn't count"
So even if they are in proximity you don't count it.
did someone say this
One of the ships left at 12, yours left at 1
huh i literally have 4 tabs running o1 pro at once, never got banned for months
only got limited via o3 yesterday
they gave it back
Does this modified riddle also have 43 as its answer in your opinion? I made it more similar to the original riddle but with the differences i think exist in your riddle
Every day at noon, a steamship departs from Le Havre for New York. The journey takes exactly 503 hours (20 days and 23 hours) in both directions.
How many ships belonging to the same shipping company, traveling in the opposite direction, will a steamship departing today at 1 pm from New York encounter during its trip?
Based on the last sentence, I agree that you don't count any grounded flights before pre-launch.
yea o4 mini > o4 mini high, very very cool
Ultra coming late may to compete with o3 pro
Possibly
A shipping company operates a route between Le Havre and New York.
Every day at 12:00 noon local time, a company ship departs Le Havre bound for New York.
The journey time between the two ports is exactly 503 hours in either direction.
Consider a specific ship from this company that departs New York for Le Havre today at 1:00 PM local time.
During its entire 503-hour voyage (from the moment it leaves the New York port until it arrives at the Le Havre port), how many of the company's other ships, traveling in the opposite direction (from Le Havre towards New York), will it pass at sea?
i think o3 pro > ultra if it ever happens
Logan is confident
Hard to say we’ll see
Anyway exciting time
Progress ain’t halting
i hate that o3 cant give the full code content
i asked for full updated code file, but it shows only the changed section, unlike o1 pro which gives the entirety
Same thing with its usual outputs
Its a smart model but it doesn't explain itself too much
It will use technical jargon without definitions
I think they deliberately asked the model to keep it short since it's costly
u think its better to go with asking the diff code? or does it also hallucinate that
yea i heard about its yap score..
ok asking for diff code aint too bad; feed it to 4.1 in cursor to apply it
It gave me that feeling of "figure out the rest urself"
Its really a different experience from other models
Im not talking about how its concise /short
But just the way its talking and providing infos
So I did that riddle manually, and I also got 43. So I agree with Dom, the answer must be 43.
I got 42 doing it manually
Yeah, it's 42
I read the riddle again, for some reason I thought it had said a ship arrived right when your ship was leaving /.-
the problem was reworded in a poor manner that doesnt work as intended i think. i dont think the answer is 43 anymore, i saw it was a modification of the 15 ship one upon first glance but didnt actually read it properly lol
This logic doesn't apply to your puzzle you created though, they both aren't leaving at 12pm, and the journey isn't a perfect 21 days, its a non integer 20.9583333. The 2*days+1 that is. So I think the important detail here is there is a 1 hour offset which skips a pass/meet
i think if its changed to 505 it works though
505 I think is 43.
504 I think is 42.
lol
promo vid funny, but it seems to just be a software for your laptop and not actually a glasses device
Google DeepMind CEO Demis Hassabis showed 60 Minutes Genie 2, an AI model that generates 3D interactive environments, which could be used to train robots in the not-so-distant future.
"60 Minutes" is the most successful television broadcast in history. Offering hard-hitting investigative reports, interviews, feature segments and profiles of peo...
Just dropped 3 hrs ago
where bing is sydney "precise" mode
this adds ambiguity. previously, it was:
Every day at noon, a steamship departs from Le Havre for New York. At the same time, another steamship belonging to the same company sets sail from New York to Le Havre
So i interpret that to mean that at 1200hrs (/noon) Le Havre local time, ships simultaneously depart (meaning it's 0600hrs in New York when that ship departs )
But in this variation:
Consider a specific ship from this company that departs New York for Le Havre today at 1:00 PM local time.
So it's not departing 1hr later than in the original formulation, it's departing 7hrs later.
Also:
how many of the company's other ships...will it pass at sea?
"at sea" has a specific meaning; imo it precludes any 'encounters' that occur at / around the zero time instant when one ship is arriving and the other departing from the calculation. Neither ship would strictly be 'at sea' [i.e. "...in navigable waters beyond the immediate confines of the port area.] sonn-3.7
fwiw i think both the original and that tweaked version are problematic - too much ambiguity
the original is a well known puzzle tho
well relatively well known
yeah the creator (a French mathematician) of it is apparently quite famous for these
but still, i don't think this is one of his finest
maybe it was poorly translated
top rated answer here says there are multiple answers depending on interpretation https://puzzling.stackexchange.com/questions/30012/a-ship-departs-from-le-havre
anyways it seems doms reworded one was poorly done and doesnt have the intended answer
nah fwiw I think dom's reformulation seems more or less the same as the original (at as it appears in English here). I think it's a crappy question, as the comments highlight - like there's several non-trivial flaws / alternative interpretations imo
[edit: the spaceship reformulation is flawed.. i was confused by what was being referenced]
u got 43?
no
but its actually 42 i think
the intended answer is 42
no its 43
and it's the answer
according to dom
lol
prob tried recalling the answer from 42
and mistook it for 43
41 is more valid than 43
send
it doesnt make sense
it is 15 ye, I think this comes from deliberately choosing other interpretation though
"every day"
"conversely"
"exactly 7 days"
so might be a skill issue or non native English speakers
tbf not everyone cares that much about the wording and assume it'll be super intuitive
im surprised you got 43 how??
specific mechanics in le havre dont apply because of the changes he made
i was referring to this formulation (which is apparently the original ) #general message
I got 42 for this one #general message (but the wording is more flawed imo)
can you link to the version from Dom where 43 is the putative answer
this?
this is his reasoning?
#general message he also justified here
thanks
his version doesnt make sense
i glazed over the spaceship versions ha
if u change it to 505 i think it works though
but models still get it anyway
so this type of question isnt particularly challenging anymore
Is it better then o3
?
?
The Google model
How does it compare
which google model??
tbh
The new one on webdev?
y
whats with this 43 i keep seeing
just an old puzzle thing
oh lol
the server seems stuck on
someone proposed the wrong answer initially
and the models were actually getting it right
but nobody could really figure it out without second guessing themselves because the models seem to get answers that are very close
just not the specific answer 43
dom made a mistake when rewording it, and assumed an incorrect ground truth. this type of question seems conquered by models tho
ye
and how strong the model forgiveness is somewhat decides how it's going to approach it
since the questions are flawed in general
Is claybrook a coding model or a new general purpose mode
Google has so many models, I'm a bit suspicious if the names of those models come from Google? Shouldn't we come up with some more fun names like: temusk, closesam?🤪
ion think it's a truly massive model though, doesn't seem to know more than 2.0 pro or 1.5 pro
and running a model like that wouldn't be practical and somewhat dismisses the fact 2.5 flash exists
which in comparison to 2.5 pro, isn't a major downgrade, yet is the cheapest and best reasoning/non reasoning model by far
we can infer flash is less than a 100b due to its 8b variant, since it would be redundant and concede the proposition the original flash makes by being fast and small, why create size extremities and therefore room for an intermediate "flash" model
but it doesn't exist
me saying cheapest is relevant to that entire claim
its not saying it's the "best"
no?
bruh what
you're getting confused
2.5 FLASH
depends
2.5 flash is LITERALLY just a smaller 2.5 pro
same techniques prob
if 2.5 pro is so good generally, since it knows a lot
then it's gonna be good generally
if 2.5 flash is attempting to be general, but doesn't know as much
I think it's gonna simply appear worse
because there's nothing specific that stands out
whereas o3 mini vs o1, coding
o4 mini vs o3, coding/puzzle tasks
ye the gap is getting closer and closer for the narrow tasks
although o3 and o4 mini hallucinate a ton
and o4 mini doesn't quite comprehend some stuff
I think it generally does better
than o3
for direct if then processes
ye depends tho
it hallucinates significantly more than o1
still, depends
lmao don't try and downplay it
how come 2.5 pro can so effortlessly process this stuff
without hallucinating at all
I don't want to bring up an example this way
but it's def clear
p sure this is still the flaw of the model
o3 has better reasoning retention than o1
and that's why it's doing better at context (just not as good as some benchmarks suggest)
but if it's still failing hard at recalling instruction and iterating through it
then is it truly a good model
the way they have trained the model to respond differs from o1 in that o3 is eager to provide specific facts and figures and sources, which is all well and good but it just ends up all false
ion think so anymore
I think they released it to dismiss o3
if a summer release is true
they have time to simply do this with o4
ye, I read the paper
in my testing 4.1 stomps deepseek v3
but still
2.0 pro just seems to be the best
as far as just being smart goes
lmaoo
so?
not at all lmao
4.1 is really just
good
but it just doesn't understand stuff implicitly
hn
2.0 pro is way better
it's not even close tbh
and deepseek v3.1 is also mid
hallucinates way too much
awfully assumptive since my primary concern is practicality
not puzzles or narrow tasks lol
I just happen to use puzzles for maybe mathematical reasoning
or testing how it processes a claim
ye
You can exclude providers on or
2.5 flash 💔
.60 lmao
why would you reason using it
yet it's better than 2.0 flash
while flash 2.0 has always been the absolute best price for performance
this just doesn't follow
bruh?
They were working on it for a while so I don't think so
Prefer to hand over data to U.S. companies, which are more trustworthy?
def not lmao
It's also used as the base in the retrained o3
ye, historically and legally
🤣
not "barely"
it's pretty visible
2.0 flash is hella straightforward
albeit concise
2.5 flash retains conciseness while giving insight
it's obviously a major upgrade
including instruction tasks
racism😏
although it doesn't get "better" after practically perfect in retaining instructions
2.5 flash isn't good?
Do you use the non thinking mode too
a ton
I use only the non thinking mode
ye, this pretty much goes for all smaller models
if you want very specific feedback
go for them
W utility
def efficient
🙏
Qwen max isn't small tho lmao Im interested in them open sourcing it
Depends tbh I heard prompt processing can be slow on macs
he's joking but ngl I wouldn't stick with it regardless
qwen 2.5 seems the right choice
Qwen 3 is about to come out anyway
try hooking up full deepseek r1
Qwen 2.5 is super old
this is the most efficient you can get
You can probably run qwq well tho and fast on a mac
Original qwq wasn't that finicky I think
Eh usable even if it's somewhat slow I think
Definitely spoiled by api speed lol
O lol
It makes more sense with Mac mini s I guess
Yeah much better compute compared to a Mac I think and higher bandwidth
It'll do well if it can fit
Yea
exactly
o3 was a step forward for everything except in combatting hallucinations
We all feel it right
hallucinates more, and more lazy
o3 benefits with a great prompt
if you are clear and concise and verbose with your prompt, itll do well, but it defintely will cut corners if you don't watch it
What makes you say that? Their bet on JEPA?
Gemini 2.5 Pro with Canvas just WOW
what are the incoming models that folks are excited about ?
Tomay appears to be a thinkerrrr
Qwen3, DeepSeek-R2 for me.
Yea def r2
Same for R2. Will it top the charts? Better than 2.5 or o3?
I wont lie, I despise R1 when it comes up in lmarena, since im sitting there waiting for 4 mins to an output..
I hope r2 is faster
And llama reasoning and behemoth
New model the v5 and V6 of "cobalt-exp-beta"
(Amazon model)
GPT 4.1 mini and nano are in the arena?
yea was added yesterday
i think two models were added
Its good ? Its finally amazon premier ?
Apricot v1 was on the arena a month ago
- Claybrook(LMarena)
- o3
no
Im often very impressed with the qwen 32B. I wonder what could they achieve with normal size models and inference compute
yeah you're spot on
i don't get 43 for that liverpool-dublin spaceship formulation.. (it's indeed botched)
funnny thing is.. by changing it from Le Havre/NY to Liverpool/Dublin, it actually removes one of the chief ambiguities in the original version (timezones), as both cities are in the same time zone. Then the other change made, rather than both routes departing at the same times, in this version there is 1hr delay – so there is no overlap (putting aside the problematic aspect of whether that overlap actually constitutes 'encountering' another company ship). But..using 503 hours.. yeah.. I get 42. With 505hrs, there is sufficient time that the 1hr delay means you encounter an additional Liverpool-bound spaceship, bringing it to 43. The stuff about spending in time the port with people boarding etc is nonsense and cannot be used as part of the 'answer' lol
what a mess..anyway.. let's move on ha
yep
Dom gave us quite the entertainment for the past 2 days
Turns out LLMS actually DO know their stuff, and the question creator is confused XD
agi confirmed
AGI would question the premise of spaceship travel between Liverpool and Dublin and also note two launching rockets could see the other while reaching orbit due to proximity
The question is if it was aka used in 2.5…. They would have had the research internally for a while before it was published
do we have any interestin news this week?
the only thing about having good weeks is that the enxt week is usually dry lmaoo
not sure but feels like there's a bit of momentum and dynamism atm that was lacking since reasoning models were first introduced
just an unrelated curiosity... anyone have any thoughts why India seems to be kinda absent from the AI landscape? (I say that.. though perhaps it's misplaced)
not: where is India's deepseek moment?.. just: why doesn't India seem to appear in the picture at all? big country; lots clever minds etc..
They all come to the US where they get paid better
is amazon serious abut their models? Is it worth giving them any benefit of doubt?
They are just so bad, like bottom of the barrel bad..
ha yeah but still.. i don't find that explanation satisfactory.. like there's still plenty of talent regardless of that kinda brain drain.. and there's capital (increasingly so)
Google, Claude, Open AI, Grok, Deep Seek, everyone else is chump change
oh this was meant ot be @wintry tinsel
Their country is in a lot of political unrest now, they seem to be sort of disconnected from the rest of the world like South America, they are their own cultural eco system satisfied with their own internal state of affairs
India? lol. you will not see any Deepseek moment from India.
They will use models from others to create applications. But there is no hope that they will be able to competitive model
They are not worried about winning any AI race, Indians are not very creative, hard working and smart, but slow to new technology
yeah i don;t wann go down a political direction.. but i'm fairly familiar with India's politics - i get the crony capitalism under Modi (but also how the BJP was dealt a blow at the last elections.. only scraping by.. that said.. the INC is still a mess)
They dont have any govt support and they dont have good VC capital money in research areas.
They are creative. But creativity shines when they are in countries like US
i still don't find it satisfactory as an explanation (brain drain / domestic politics).. there's so much Indian pride...
i thought maybe the diversity of the country
there's so many languages
unlike China. where [the central govt] is just like: you're all Han, and speak Mandarin
Perhaps Indians are capable of creativity, but the country itself does not foster or support it very much culturally
I agree.
Japan is also a country known for its hard working and smart tech industry adopting technologies like the bullet train before most other major urban centers, yet they are absent from AI space too
shame to see
I am concerned a bit about EU though. With america ruled by mad king, EU might be in trouble
Japan and India should collaborate on AI, now that Japan has opened floodgates to immigration
It won’t happen though
Samsung in South Korea is also in panic mode as they don’t have any competitive AI models and their new phone chips are being outperformed by others with better AI implementation
Samsung is too dependent on Android and Google
and I think they will continue to
Any news in the Industry? Which models are businesses adopting more of ? OAI models?
We’d know if 2.5 pro had it wouldn’t we?
One of the strongest properties of Titans is loooong context recall performance, like drastically better than other models. Have you seen the Gemini 2.5 needle-in-a-haystack results? It fits.
I think enterprises are using OpenAI models (and MS Copilot, which has to be a derivative of OpenAI?) if they are in the MS ecosystem; and Google models in they are in the Google ecosystem. Apart from that I'm pretty confident a lot of businesses have accounts with OpenAI just because that's what they know / what came first, and if they build software a lot of them are probably paying for GitHub Copilot (again, MS/OpenAI). Anthropic is a developer favorite of course, but that's a narrow segment.
I don't know to what extent any of the other big labs have captured the enterprise market.
llama 2, gemini 1.5 and grok 2 were the 3 really bad ones, while now meta google and xai have really interesting models and with a very good price-quality ratio
dayhush or nw?
but I don't have the impression that Amazon really wants to be in the race, but maybe they only want to have their AI for their chatbot and for Alexa
what's the point of doing half-hearted effort? they are still spending like billion+ model and producing garbage .
What's the long term game here?
Hard to say but I think it's likely
and i thought amazon partnered with claude?
its like what happened with microsoft and OA
yeah i think it makes sense
they still ended up making their own model
either this week or next week for sure
both Google and Amazon partnered with Claude. But Amazon also creates garbage like nova
i really hope man!! that would make this month top tier
same goes for qwen 3
i do have pretty high hopes for R2
a few little birdies tell me it gives o3 a run for its money
you think it will beat 2.5?
last time it was pretty much o1 right? during o1 phase?
yeah
I'll be dissappointed if they dont beat 2.5 Pro in half of the categories
your connects talking
right now can we reliably say that o3 is better than 2.5 pro?
either way it'll be state of the art as an open model by a big margin
i don't know how much of a threat qwen 3 will pose though
don't have any connections there lmao
It's a good question. Their Titan series of models was utter trash; then they suddenly came out with an entirely new set of models - Nova, which is not SOTA nor within spitting distance of it, but let's be fair here, Nova Pro was not terrible either, and it was a gargantuan leap up from Titan. If they can make another leap like that then they could potentially join the top labs. But that's a big if.
speaking of nova titan
nope.. I consider 2.5 > o3 (but barely)
@keen beacon are your birdies letting you test out the model 👀
if cobalt on the arena is titan it sucks
nope
i think they are better in different stuff tbh
nahh we need to make you the official model tester
they don't really hire people outside of china for that stuff
+1. For me 2.5 is slightly better but o3 is slightly in many other areas as well
you can be an unbiased source for all companies
yeah reasoning o3 def wins
but dont forget yall, o3 pro coming out next week
and if its the same gap from o1 to o1 pro
umm.. may be they come up with something interesting.. yet to see. Right now, i dont consider them in AI race. They are in the top of AI infra provider though
i think we dont see a model beat o3 pro until end of year
fr
we will
pro is 200$ per week? Yes, i am never subbing it unless it is like crazy better over 2.5 pro
trust me
month
yes, month.. wow, they are so generous 😄
to be fair you do get practically unlimited o3, o4 mini, 4o + image gen
if only they didn't cap o3 pro
hmm i wont be mad if we do lmaoo
and sora lol
sora is unfortunately pretty bad
its def worth the price imo
hopefully sora 2 drops this year
In my testing, Nova Pro ended up in the same tier (third tier) as Grok 3, Gemini 2.0 Flash, o3-mini-high, Mistral Large, and QwQ-32B. So given its price point (very low) it did have some merit there for a while.
but whats this talk about openAi branching out so much into shopping, social media, etc..
i thought nova pro was like 5x+ more expensive than flash.. let me check
I think they are looking into avenues that actually make money, instead of you know... money sinking business
Actually nova pro is 8x more expensive than 2.0 flash
i dont get it? why not focus on couple of things and make it kick ass instead of 10 mediocre things?
each extra venture takes effort and energy away from other things
i have seen businesses fail multiple times because of expanding too fast. it is a common pitfall
I don't think there's a single frontier AI LLM company out there that is actually making money
no way
i hope
i also have my doubt about r2 >>> o3
i mean, i was half joking
I wouldn't expect any other way tbh
Yes, it had merit before Flash came out
i think they have a pretty good chance
yes and spreading in more directions == more loses and less focus
o3 is really not that impressive despite the hype and "feelings"
the biggest practical change is how they made it now use every tool that they have, quite extensively
Definitely depends on use case. Bigger context makes 2.5 shine in many cases. Tooling and less restrictions vault o3 over others. I was talking to some crazy girl this weekend and her and all her friends have started using o3 and deep research to analyze guys on dating apps. o3 much more willing to dig up social profiles than Gemini. I tested myself and o3 output freaked me out. Gemini didn't touch socials.
damn.. never thought people would be using deep research this way 🙂