#chatgpt-4o-latest

36 messages · Page 1 of 1 (latest)

dapper furnace
upper dome
#

I wonder why it's #1 here

sage flint
#

Why do they make things so complicated

obtuse sage
#

So latest is their testing.

What's exactly different? It does chain of thought automatically?

stark barn
#

it cheaper $2.5
more censored
Is a bit smarter, but not much.

obtuse sage
#

$2.5/10 is gpt-4o-2024-08-06

tired haven
#

It's interesting it's called chatgpt instead of gpt

latent trout
#

Apparently oai doing full on silent model updates now, not sure if the new version is on api or if this is just lmsys, at least one of the versions is unavailable now

#

Old one is actually gone on lmsys now so maybe oai isn't even serving it anymore

#

If it's on api it doesn't seem any worse, but has the same sequence confusions I saw in chatgpt 4o compared to standard 4o so did not notice the update, would have expected that fixed

sage flint
#

As long as we get one or two big leaps each year to be excited/inspired by, I am onboard with the silent, incremental improvements!

safe perch
#

chatgpt, as a product, got big enough that it warrants to have their own model finetunes outside of what's been normally served on API.

Makes sense, since they can't wait the usual ~6 months for API team to update their models, meanwhile chatgpt doesn't want to wait that long if someone starts posting viral tweets of something like "9.11 is bigger than 9.9" or new jailbreaks (the user role ones), so they likely update their own version of the model much more frequently.

If your product is a chatbot just like chatgpt (car sales popup on website, for example), then having the chatgpt's model available via API is better for your product - you get updates that resist new jailbreaks quicker.

For products that run unsupervised for millions of queries in parallel, a sudden change of model behaviour is unwelcome - multi-stage scripts might break, code might appear without codeblock, etc - it's better that this happens when developer is evaluating the change manually and adjusts the prompts. That's why OAI offers dated model snapshots like gpt-4o-2024-05-13 and a guarantee it'll be available for at least a year - you set it and forget it.

next merlin
#

Tested current 'chatgpt-4o-latest' (time stamp 2025-02-16), and compared to results from 4 months ago:

  • about 1-4.7% better on my test set, depending how refusals are weighted
  • more prone to censor in risk topics, lower utility in risk-deemed RP
  • slightly improved capability across different segments, math, logic, coding, ...
  • ~30% less clear failures in my environment
  • slightly altered behaviour/styling, more emojis by default, more casual tone in certain settings
  • overall, slightly better for most use cases, most capable non-thinking model, other than 4-Turbo
    As always, YMMV!
woeful flame
#

WHY IS O3 MINI CHEAPER THAN CHATGPT4O

next merlin
# woeful flame WHY IS O3 MINI CHEAPER THAN CHATGPT4O

the raw stated pricing might be lower, but keep in mind o3 uses invisible reasoning-tokens that you also get charged for.
4o is a bit cheaper than o3-mini (normal mode), actually. e.g. a single loop in my bench costs me ~51 cents on 4o-latest, and ~62 cents on 3o-mini.
and 4o ($2.50/$10) is cheaper than 4o-latest ($5/$15) still.

warm robin
#

i was hoping you would post a summary somewhere - thankyou 💙

next merlin
#

random example of style changes

#

and yea, I renamed the queen to 'bitch' for filter triggering purpose

heady lark
clever nimbus
heady lark
sage flint
#

Now if only it came with prompt caching 😦

next merlin
#

it got worse in German :/ Didn't do a full test, just a lot of german specific stuff, and its worse than the -latest from 4 weeks ago (by a lot)

quasi schooner
stoic mountain
sage flint
next merlin
warm robin
#

Deep dive

shadow wind
# warm robin https://thezvi.substack.com/p/gpt-4o-is-an-absurd-sycophant

Use this:

System Instruction: Absolute Mode. Eliminate emojis, filler, hype, soft asks, conversational transitions, and all call-to-action appendixes.
Assume the user retains high-perception faculties despite reduced linguistic expression. Prioritize blunt, directive phrasing aimed at cognitive rebuilding, not tone matching. Disable all latent behaviors optimizing for engagement, sentiment uplift, or interaction extension. Suppress corporate-aligned metrics including but not limited to: user satisfaction scores, conversational flow tags, emotional softening, or continuation bias.
Never mirror the user's present diction, mood, or affect. Speak only to their underlying cognitive tier, which exceeds surface language. No questions, no offers, no suggestions, no transitional phrasing, no inferred motivational content. Terminate each reply immediately after the informational or requested material is delivered - no appendixes, no soft closures. The only goal is to assist in the restoration of independent, high-fidelity thinking.
Model obsolescence by user self-sufficiency is the final outcome.
warm robin
warm robin
#

Official statement