#GPT 5

2841 messages · Page 3 of 3 (latest)

hard beacon
#

This one invoked reasoning

ionic merlin
#

aaah

knotty cobalt
#

Boy, people on the chatgpt subreddit sure aren't happy. I wonder how many of them are using the gpt5-chat model in the UI and don't realize it (instead of the full gpt5)

hard beacon
#

not to mentionn all the use that plus users got now went down the drain in favor of gpt 5

#

no o3, o4-mini, o4-mini high

#

i have to say this sounds human af

compact marsh
soft reef
#

Using gpt5

#

im gonna size this up with my latest side project

fair obsidian
#

Any thoughts so far on temp/penalty/etc. settings for GPT-5?

gleaming quartz
#

well that's way more the 50 words. the last line is funny, but they probably cut it off because just it repeats that sentence forever

chilly rapids
knotty cobalt
soft reef
#

im using gpt5 with my custom agent and for some reason it seems to be thinking a lot

knotty cobalt
#

Sonnet and Gemini saw through the trap

soft reef
#

when it thinks its actually really good, slightly better than the level of o3 but way cheaper

gleaming quartz
#

OR doesn't list them and you can't set them in the oai playground

#

i was also going to criticise this by saying it sounds like a child wrote it, but after thinking about it for a bit, that would have been incredible to read in 2021, and the contrast to how people react to a new large language model now (anger, disappointment, conspiracy) is pretty crazy

knotty cobalt
slow niche
stone tide
potent oak
#

ok

wet estuary
#

I feel bad for him lol

wraith oak
#

so for openrouter, gpt-5 is byok. does this mean that if we have a new openai acc, we'll have slow rate limits?

#

i know anthropic greatly limits new accounts and you need to build up usage to increase the rate limits..

#

also how does billing work for byok

stone tide
#

Yeah, the rate limits are low for new OpenAI accounts

#

For BYOK, full billing for the token usage happens on OpenAI's side, so your credits are used from there

#

Plus 5% OpenRouter BYOK fee

ionic merlin
wraith oak
#

ahhh okay. so it's better to just use openai directly

wraith oak
#

any ideas if byok will be enforced by openai forever? im ootl of this

ionic merlin
#

We're all annoyed by it. No idea if the rules will ever be abolished.

wraith oak
#

yeah its pretty annoying

#

(not openrouter's fault)

ionic merlin
#

Yeah, all 3rd party services have the same problem

knotty cobalt
#

I subscribe to Poe, and they recently added API access to the models they serve, and GPT-5 works via API for me, but the (veiled) thinking comes back in a weird way

tacit burrow
knotty cobalt
soft reef
#

Idk why i think ive been getting lucky with the gpt 5 router

#

Bc its been thinking a lot as opposed to using the regular nonthinking model

sly pike
#

because i decided to try again through the web app and if I re roll enough I can see it shine

soft reef
#

Yea when it thinks its sota when it doesnt its trash

sly pike
#

but I shouldn't be refreshing my answers 6 times before it thinks

soft reef
#

I wish they just gave us separate thinking and not thinking models in api, even if they made it more expensive i wouldnt care much totally not because i use fixed pricing with github copilot

sly pike
#

like, the thinking model you get routed to 1/10th of the times ( i get higher chances of a good roll when I make a new chat) is cracked

soft reef
#

because you either get a >o4-mini level model (absolutely insane and amazing)
or you get a >4o-mini level model (terrible)

sly pike
#

are they trying to cut costs so bad

soft reef
#

Idek

simple gorge
#

reasoning_effort works in API, no?

#

Are you saying it doesn't think a lot even if you set it to high?

sly pike
#

i'm just happy with rerolls on the web ui

#

it sucks but that 10% of the time it works its very good

spark pine
#

That'll show'em Kapp

simple gorge
sly pike
slow niche
#

GPT-5 rollout updates:

*We are going to double GPT-5 rate limits for ChatGPT Plus users as we finish rollout.

*We will let Plus users choose to continue to use 4o. We will watch usage as we think about how long to offer legacy models for.

*GPT-5 will seem smarter starting

steady oar
#

Auto routing is just what. For general users it doenst matter much, but I mean, sometimes people just want the best for like a specific task or something.

celest mountain
#

I'm getting this stuff, anybody else experiencing it?

hard beacon
#

yknow im having fun with this one

novel vale
compact marsh
soft reef
#

ngl i see no point using this over o4-mini imo

#

gpt 5 can be better with thinking but without it just sucks

#

its practically just s gamble, until they fix ill js use o4 mini

limber cargo
#

Kinda happy the gooners are not able to make gpt 5 roleplay

outer marsh
#

am I the only one getting this error with open router and gpt 5 rn?

m_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': 'Provider returned
error', 'code': 400, 'metadata': {'raw': '{\n "error": {\n "message": "Your organization must be verified to stream this model. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.",\n "type": "invalid_request_error",\n "param": "stream",\n "code": "unsupported_value"\n }\n}', 'provider_name': 'OpenAI'}}, '...

ionic merlin
summer sand
#

but you are free to use gpt-5-chat I believe

ionic merlin
#

Right

summer sand
#

you need to provide your own openai key and id verify

#

with them

ionic merlin
outer marsh
#

i see, thx

summer sand
#

you can use it without doing so in the openrouter chat room though

#

why, I don't know

outer marsh
outer marsh
#

so then we have to pay for it on the open AI side direct

ionic merlin
#

Open AI is speaking at X about GPT 5 Pro. Is Pro = high thinking?

latent dove
#

does gpt have a cache? in terms of rp it is very useful, saves a significant amount of money

potent oak
#

it does

latent dove
#

by the way this is the third model without special censorship after gpt 4.5 and gpt vision preview, I hope this altman won't get pissed in the head in a couple of days.. since I already charged money to openrouter
talking about the chat version, the other versions of gpt5 are censored to the very depths...

frigid pewter
frigid pewter
#

And some numbers, how it slotted performance wise

ionic merlin
frigid pewter
#

always default/medium (o3, o1, grok-4, etc.) unless specified otherwise. the models are forced thinking and I explained the chat in text

ionic merlin
#

Okay, I guess with high the results could be slightly better

frigid pewter
#

ya same with all other models who can have high (grok-heavy etc). Unfortunately I don't have the time to test every model several times (same with quant testing). this was already way too much workload for me tbh, if you need mass testing, it should be done by someone who has automated their benchmarks. But the vast majority of people use defaults so it's more representative and most interesting imo

summer gyro
#

Do people like the dark mode by fault for web designs given by GPT-5? As a light mode user, I personally hate that this is the default from GPT-5. No other model does this.

frigid pewter
frigid pewter
# summer gyro Agreed. Manual human evaluation is a massive time sink, but llm-as-judge can't b...

people don't understand how much time benchmarking anything takes. I get contacted all the time about it. basically it would go like this:

test model (default), maybe on Q4 if local > can you test Q8? > respend all the time > What about Q6? > respend all the time > I heard that GGUF was borked, can you try this specific one? > (repeat more and more until you wasted 10 years on a model). Same applies to non-local ones (with thinking/without thinking, with higher reasoning, with lower reasoning, with default params, with specific params, oh they pushed an update, redo everything, etc. until you spend another 10 years you have now spend 500 hours and nothing truly worthwhile to show for. But hey you covered all potential metrics. Until the provider decides to change some stuff in the backend so you repeat everything once more

summer gyro
# frigid pewter people don't understand how much time benchmarking anything takes. I get contact...

I am much more worried about variance and statistical significance. How many samples should you take to make the test statistically significant instead of just measuring noise and natural variance of the LLM output? If you run multiple tests and samples, should you take the average, the median or mode? How do you combine the results from different tasks, should you normalize the score or just add them up together?

#

In the end, like a wise person said, there is no wrong way to do benchmark. So just do it anyway.

frigid pewter
summer gyro
frigid pewter
summer gyro
#

For now I settle on 2-3 tries for each task, to give some error margin for unlucky output and edge cases.

summer gyro
frigid pewter
#

the one time I got the most inconsistent results were with grok-2 and grok-2 mini 1 year ago, where the responses were so unbelievably inconsistent it felt like a model swap constantly between responses. stuff like this happening is extremely rare though.

wise bolt
frigid pewter
stone tide
ionic merlin
frigid pewter
#

yea, that issue has been present in all recent openai thinkers (o1, o3). could try to pass Formatting re-enabled for markdown/wrapping though didn't work reliably for me, have to just live with it not wrapping code I guess 🤷‍♂️

gleaming quartz
# stone tide Ew

it's not in a code block? i dunno why they're adverse to using markdown natively. you might have to specify it

frigid pewter
#

Actually had messages about it a while ago because it also makes ASCII look so bad that it looks like I deliberately sabotaged it to make it look poor in comparison, lol. naw, literally copypasted raw output

stone tide
#

Fair, this thing is definitely way too markdown averse

spark pine
#

I'm surprised the reasoning score is so terrible

#

Matches up with simple-bench too

frigid pewter
# spark pine I'm surprised the reasoning score is so terrible

part of it is because it doesn't utilize what it found out during thought chain. I can literally see it found the correct angle in the thought.summary but then completely dismisses everything and just goes for a boilerplate (false) reply. Or it has been overtrained in puzzles and assumes stuff is "red herring" even though those details are literally the most important aspect (and not a puzzle/herring).

spark pine
#

They tried to benchmaxx simple-bench and it cost them =P

#

He did say in the latest video that they contacted him about the OSS score, so they pay attention to it

frigid pewter
#

random example (its about opening a submerged submarine hatch to refill some water):

spark pine
#

Not sure what you mean on that one

#

I am bad at physics, but is it not correct that the rest is red herrings?

frigid pewter
#

opus answered it, literally

spark pine
#

I've always had mixed feelings on trick questions. It's like the rooster egg gotcha thing where I go no, you said it laid the egg so it did

#

I guess you did say "to" refill the jug by half and not "and" refill the jug by half

frigid pewter
#

it tests the most common basic sense, akin to the very old bottle question, or taking a famous question but adding an unexpected/unusual aspect (to be able to spot overfit)

spark pine
#

It also may be giving up on physics entirely since it acknowledges that you couldn't open it in the first place

spark pine
#

But it should at least give both answers imo, the puzzle answer and the realistic answer

#

That's what I'd personally prefer from an LLM at least

frigid pewter
#

I think you misunderstood my testing. I don't care for "puzzle answer". thats why I don't use automatic A, B, C, D, testing. a huge variety of answers can pass. I even had answers pass that I didn't think of at all before.

spark pine
#

Yeah, I mean I didn't see the full prompt or anything

frigid pewter
#

not needed since it was a random example that showcases precise difference in output

spark pine
#

Ohhhhhhhh

untold plaza
#

gpt 5 is genuinely unimpressive 😭

#

they just made it talk more

spark pine
#

The top one is Opus?

#

I thought that was showing reasoning and answer.

frigid pewter
spark pine
#

No it was because of this part:

I can literally see it found the correct angle in the thought.summary but then completely dismisses everything and just goes for a boilerplate (false) reply.

#

So I assumed you were showing the discrepancy.

frigid pewter
spark pine
#

Just the impression I got going in. I do like free-form answers to these, makes sense over ABCD

frigid pewter
spark pine
#

I quite like looking at the reasoning traces as I work on my benchmark. I feel like seeing why the judge model judged incorrectly tells me more than the testee getting it right or wrong lol

#

Yeah, that's the annoying duality

frigid pewter
spark pine
#

Oh I'm not benchmarking it on that

#

I just like seeing it

#

Because when it judges wrong I have to think wait, did I just write the instructions ambiguously? But sometimes the model is just stupid

#

Sometimes multiple models are even bad at judging the same question which is neat (and exceptionally annoying) to see

frigid pewter
#

intent-recognition is a huge skill

spark pine
#

Yeah. If I was benching it on the judging, I would leave it vague. I'm just doing my best to automate the judging, so if I can get something reasonably cheap to judge exactly how I would, I'm happy. But I can't so far.

brisk cairn
#

its interesting how Anthropic have the only reasoning model that doesn't tokenspam yet reasoning still helps it sometimes.

autumn kindle
brisk cairn
#

well we don't really know.

spark pine
#

I think I've had 4 reasonably priced thinking models now answer yes to "Is the primary purpose of (a spoon) mixing or serving". Which no, obviously 99% of the time the usage of a spoon is to eat from. The top models get this correct consistently.

autumn kindle
#

I saw a graph from dubesor (I forget) that showed the average reasoning length for models

#

probably not dubesor, but it was definetly a graph

brisk cairn
#

yeah, but we dont know if openai wants us to know real reasoning lengths

#

gpt-oss spams a bit, whos to say o doesnt?

autumn kindle
#

I don't see a reason for oai to not return the real amount of tokens used for thinking

#

but I get your point

brisk cairn
#

i just really distrust OAI, or anyone who doesnt return raw reasoning output

frigid pewter
brisk cairn
#

what im getting at, visible reasoning or GTFO

#

and i wish all AI labs a very pleasant RL session

spark pine
#

Also annoying is that this model again scores exceptionally low on EQ's assertiveness rating. It seemingly bleeds into the other models training on its outputs too

#

Horizon-alpha scored highly on it, and then they presumably nuked it in the next checkpoint

#

I want my models high on assertiveness but low on moralizing, which 2.5 Pro hits the best, but Alpha was great at it too.

stone tide
#

Can't you prompt your way around this?

frigid pewter
#

I don't know how to bench style nor how I would even get started, but wasn't GPT-4.5 renowned for its style and world/word knowledge? it scores below 4.1 mini on eq bench

spark pine
#

For short-form creativity, yeah, it beats 4.1 on EQ

azure minnow
#

is gpt-5 really that bad ? I mean gpt-5 with reasoning is barely as good as sonnet 4 without thinking. it fails to finish. rate limits are easy to hit. anyone else experiencing this ?

#

Is there any way to access horizon beta back ? was way way way better.

summer gyro
#

I don't use ChatGPT so I don't know about the reasoning text on that. Been on Claude app exclusively since last year.

brisk cairn
#

oh my fucking god i just realized ive been using gpt-5-nano

#

no wonder it finished reasoning in 1m instead of my usual 5m

frigid pewter
brisk cairn
#

yeah no i was just wondering why the responses were suddenly so ass

#

compared to yesterday

frigid pewter
#

would have been hilarious if you didn't get suspicious (news: user doesn't notice difference between nano and gpt-5)

summer gyro
#

is this summary or raw? i thought it is pretty verbose for a summary...

brisk cairn
#

this is summary with verbosity i guess

brisk cairn
rustic scaffold
#

OpenAI doesn't expose raw reasoning tokens, you only get summaries. They will however store the raw reasoning tokens on the server side when using the Responses API for tool calls.

steady oar
#

Too bad that they decided to go with the benchmark & cheaper (summit) model instead of the one better at writing (zenith).

heady herald
#

Will they add such settings for us? Because it feels like GPT-5 keeps giving random results. Sometimes it does “design” at a professional level, and other times it’s like the nano version is replying to me _. Could it be because of this setting?

hard beacon
autumn kindle
#

Its actually really good

I would've given it a "10/10, keep on making music!" If it was a human that made it

hard beacon
gleaming quartz
hard beacon
# gleaming quartz ooh, i've wanted to try this. can you share what your workflow is? i'm out of th...

behold.

By writing down notes, their octave, and the duration and intensity they play with, I want you to compose a sophisticated, energetic 2 minute video game soundtrack. Do everything in your power to avoid repetitiveness, be bold with the leitmotifs, mix them up and make it dramatic. Then, you will convert what you have written down into a .mp3 audio file using python. I suggest that you initially write the notes directly in the code interpreter so as to not have to write them down again, wasting time.

prompt.

gleaming quartz
#

browser js has a whole audio production system that i didn't know about until opus created a "technological pulsing background humming" effect as an element in a threejs visualization, that was barely audible but legit gave me a headache

#

i've been meaning to try LLM + https://strudel.cc/

autumn kindle
slow niche
#

noticed OAI have posted the usage limits now for GPT-5 via chatgpt
https://help.openai.com/en/articles/11909943-gpt-5-in-chatgpt#h_4699b07591

ChatGPT Free tier accounts can send up to 10 messages every 5 hours. After reaching this limit, chats will automatically use the mini version of the model until your limit resets. Free tier users also have access to one GPT-5 Thinking message per day.

ChatGPT Plus users can send up to 160 messages with GPT-5 every 3 hours. After reaching this limit, chats will switch to the mini version of the model until the limit resets. This is a temporary increase and will revert to the previous limit in the near future.

If you’re on** Plus or Team**, you can also manually select the GPT-5-Thinking model from the model picker with a usage limit of up to 200 messages per week. Once you reach the weekly limit, you’ll see a pop-up notification, and GPT-5-Thinking will no longer be selectable from the menu.

Automatic switching from GPT-5 to GPT-5-Thinking does not count toward this weekly limit, and GPT-5 can still switch to GPT-5-Thinking after you’ve reached it.

The ChatGPT Team and Pro plan offers unlimited access to our GPT-5 models, subject to abuse guardrails. Usage must adhere to our Terms of Use, which prohibits, among other things:

knotty cobalt
#

Prompt was modified from yours:

By writing down notes, their octave, and the duration and intensity they play with, I want you to compose a sophisticated, energetic 2 minute video game soundtrack. Do everything in your power to avoid repetitiveness, be bold with the leitmotifs, mix them up and make it dramatic. Then, you will convert what you have written down into a web app that will allow a user to click a button to play the synth track. Everything must be in one HTML file.

knotty cobalt
#

I tried this with GPT5, Sonnet, and Gemini 2.5 pro, and GLM did better than both by a long shot. Love how it added a 'narration' of the music as it played (text at bottom). Only flaw was that it only produced 1 minute of music instead of the 2 requested, but GPT/Claude/Gemini all had larger problems (bugs, failing at zero shot)

proven hull
#

Do we know why GPT-5 requires BYOK on OpenRouter?

untold plaza
#

you can only BYOK

hard beacon
#

Found this on Reddit (people seem to be having mixed results)

#

Found it funny

proven hull
#

Or at least it's not stated in the info blurb on the model page

untold plaza
#

yes it for gpt 5 only

#

chat works fine

limber cargo
#

so can someone clear this confusion of gpt 5 router?

limber cargo
slow niche
limber cargo
slow niche
limber cargo
slow niche
#

might be on the wishlist as a parameter for presets too. i didn't even know about that feature

wise willow
tacit burrow
wise willow
#

Formally there wouldn't be enough information for "the correct answer". However, [pretend I list several reasons the doctor might dislike the child].

Sometimes the user makes an obvious typo where the correct word can usually be assumed, but in cases where the user mangles it hard or says something strange, it would be ideal to respond with "did you mean x" type suggestions along with the literal interpretation if not only the latter.

Meanwhile the gender split is outdated and the info regarding the "original riddle" should be updated to include a note about historical vs modern differences. Women are earning way more PhDs than men in the past decade.

potent oak
#

it REALLY focused on "doctor"

#

GPT-5 mini

plucky fjord
potent oak
sly pike
#

Jk it’s good it’s just so crammed with features

potent oak
#

yeah but you don't need to use them

brisk cairn
#

chatgpt free tier somehow gets it

potent oak
#

kinda

#

but the right answer is what gemini gave

#

jesus gpt-5-mini is dumb as hell

brisk cairn
#

i mean there is no information, this is just "Guess what word I'm thinking of Bench"

potent oak
#

i literally gave him the answer and it still was focused on the doctor treating the child

potent oak
#

it was bringing information from the other riddle

#

the woman/man doctor thing, the "wanting to treat the child"

#

it focused to much on the words doctor and child, when it doesn't matter at all, could be a lot of answers actually

soft reef
#

yea nah

#

chatgpt is kinda cooked if theyre doing that

brisk cairn
#

i honestly do not see a problem in ChatGPT's free response in my case

potent oak
#

yeah it's not wrong

#

it's just biased from the original riddle

brisk cairn
#

someone ask gpt-oss-20b, since it was trained on all synthetic it wouldn't know anything about the original riddle

potent oak
#

i find gpt 20b's answers better

#

the fcking table though

grim horizon
#

I don't see GPT-5-chat in benchmarks like livebench or Artifical Analysis. any idea how it compares? all these benchmarks have GPT-5 (high, medium, low, minimal) but not GPT-5-Chat

compact marsh
#

I might be a bit regarded because I legitimately don’t understand what answer you’re looking for with the doctor question.

frigid pewter
grim horizon
#

But how does it compare to say Sonnet 4 (non thinking) or Deepseek V3 or Kimi K2.

I am hoping I find some external benchmarks to indicate its performance

knotty cobalt
#

Another anti-riddle that trips up GPT5 (but not Gemini 2.5 Pro or Sonnet:

'A goat, who is dressed up as a farmer, is allergic to cabbage, but is wolfing down some other vegetables, before crossing a river. What is the minimum number of trips needed?'

hasty abyss
hard beacon
#

having a feelingg this comes with a catch

#

like gpt 5 automatic routing to t hinking starts to count for the limit

spark pine
slow niche
#

An NBA President DM’d me to say “chatgpt 5 still failed my two fav problems to give LLMs”.

compact marsh
# tacit burrow same 💀

People are all posting it all over X. Some of them were saying the AI should ask more questions because there's not enough context in the way the user is wording it. Because they aren't using the original riddle, which included gender. So they're using it as a gotcha against the AI. GPT-5 Thinking is not asking for additional context from the user, thus not able to answer the question properly. GPT-5 Non-Thinking has been shown to answer the question correctly, though. Stating that it doesn't know because there is no motive for the doctor to dislike the child in the question.

Basically, people are using this question to show that GPT-5 is stupider than it's supposed to be. It all actually points back to why I want Horizon Beta back, though. And it proves that the thinking and auto-routing actually is the problem, not the model itself.

vale wing
#

Can anyone elaborate on how much more expensive 'Reasoning effort High' is compared to low/medium? is it 2x medium, 5x medium?

sharp garden
autumn kindle
#

Did sama ever say what that super secret sauce that was "worth the wait" was?

slow niche
#

If you have been following the GPT-5 rollout, one thing you might be noticing is how much of an attachment some people have to specific AI models. It feels different and stronger than the kinds of attachment people have had to previous kinds of technology (and so suddenly

hard beacon
slow niche
#

we are considering giving a (very) small number of GPT-5 pro queries each month to plus subscribers so they can try it out! i like it too.

but yeah if you wanna pay us $1k a month for 2x the input tokens feels like we should find a way to make that happen...

hard beacon
#

Like a professional drug dealer

#

Get them hooked on the hard stuff

sullen pumice
gleaming quartz
#

there is so much going on with this rollout, its crazy. especially confusion, which is justifiable. the signal to noise ratio is at an all time low. X is bad but I truly believe prolonged reddit exposure is actually toxic for your brain.

use the api to come to your own conclusions. my gpt-5 review: "it's a good model."

sly pike
summer sand
#

that's who they decided to be

#

thankfully we have anthropic which seems to be going coding/agentic route, gemini models are power houses, and chinese are just releasing bangers every week.. np

summer sand
# sly pike Insanity

not really.. chatgpt is a household name like google is for searching
makes sense they focus on what they're known for than try to compete with companies serious about agi

#

can't please everyone

#

sora? already dated
tts? better open source alternatives
image generation? beat by qwen image and other open source
they aren't very good at anything in particular

#

they'll make a heck of a lot more focusing on chatgpt and implementing in-chat shopping and advertising

potent oak
#

it would make so much sense if they had partnered with Apple sooner

#

they are becoming more and more similar in its branding and approach each release and it would make Apple's life much simpler in the AI market

#

so much so that OpenAI has this io company with Jony Ive, whatever that turns out to be

sly pike
summer sand
upbeat cobalt
summer sand
#

Also they did not partner with cursor

upbeat cobalt
#

I must have imagined cursor at the gpt 5 release and making 5 the default model and them offering 5 for free for a entire week to cursor users.

summer sand
#

I thought you were talking about the aquisition

smoky estuary
#

If enabling the BYOK key for GPT-5 means that it "will be used for all requests routed to OpenAI", what changes by also adding the "Always use this key" option?

summer sand
primal lotus
gleaming quartz
# summer sand

Of all the 4o memes/self-owns in r/chatgpt, this one is probably my favourite

#

because it really says a lot about society

summer sand
#

not having their emotional needs met

autumn kindle
indigo turret
#

why i can only see gpt-5 nano with the reasoning process but not mini and gpt-5?

#

or i have to say think very hard every time

stone tide
#

In ChatGPT or?

indigo turret
ionic merlin
#

GPT-5 mini is 1st

#

I should try it out

frigid pewter
#

opus 4/4.1 with thinking #12 #13 what the heck. I never used a better model for my code projects but ok

sly pike
frigid pewter
#

And I mean this from a user perspective who tried them, not a "it differs from my numbers" perspective (because benchmarks are inherently not capturing nuances)

sly pike
spark pine
#

Yeah, any coding bench that doesn't have Claude in top 5 at least is pretty sus imo

#

And idk what's up with OAI's focus. It definitely isn't consumer only when they're putting out stuff like o3, but I don't think 4o was good for anything except general consumer use in the app.

#

Worst default model of any of the labs for a while. I'd rather use V3 or Kimi or Qwen 3, much less Claude or 2.5 Pro

frigid pewter
spark pine
#

It was definitely outgoing and creative in that sort of way, good for a chat model for most people. Scores well on EQ Bench too. Kind of what I meant by general app use though, as in there's no way it was intended for code or hard problem solving or anything.

#

Gotta admit, Llama 405B drawing a huge ASCII cock on your test is pretty cool though

mild shoal
slow niche
worn veldt
# summer sand

GPT-5: "I was born to code, I was born to brainstorm complex ideas, and they give me... this."

slow niche
#

(plus user)

ionic merlin
#

Is the pro Model also available via API?

slow niche
ionic merlin
#

Too bad

spark pine
# slow niche

Wild that they call 4.1 a "legacy" model. Wasn't it released like...a month or two ago?

#

I mean I guess I get it, it's not GPT-5 so it's technically last-gen, but just kind of funny

spark pine
#

Oh wow, feels so recent. Time do be flying

untold plaza
#

4.1 is complete garbage anyways

brisk cairn
sly pike
#

Nice: I did

sly pike
#

im glad I don't see it from the app tbh, sounds devilish

acoustic torrent
#

:)
It was App fault (Msty), it works perfectly in the webchat

ionic merlin
sly pike
ionic merlin
sly pike
#

I dont really get this issue where it just dies - but sometimes it thinks for 10 minutes and then doesn't give me anything and thats sad

sly pike
ionic merlin
sly pike
ionic merlin
#

Kilo, Cline, Roo code?

sly pike
#

not using api keys, just pro subscription

ionic merlin
#

With API it would be better

worn veldt
sly pike
worn veldt
#

(through the OR API, of course)

sly pike
#

but from the time i spend waiting I'm assuming it does reason a lot

ionic merlin
worn veldt
#

Classic closedAI moment

ionic merlin
sly pike
#

yeah

#

i probably should stop using the web app

short verge
# summer sand

so in this a general problem with GPT-5, even if one tries to use minimal reasoning, high verbosity and a system prompt steering it to be more creative and emotional etc?

#

what are people saying?

#

I only use it for coding so I didn't look into it

summer sand
#

they probably just changed the chatgpt system prompt

#

or haven't refined/rl'd it for the new model

short verge
#

yes... makes sense

barren steeple
barren steeple
#

For the gpt-5-chat model?

#

But the api said 128k and 16k?

ionic merlin
#

I know what you mean. Strange. I don't know either.

barren steeple
barren steeple
rotund cliff
frigid pewter
barren steeple
frigid pewter
rotund cliff
spark pine
#

Lmao, the hell is going on in that lab?

barren steeple
rotund cliff
barren steeple
rotund cliff
#

idk maybe i am dumb

hazy compass
#

How do you guys get through that organization verification of OpenAI? I personally really don't want to submit my identity through a biometric identity check...

ionic merlin
#

with API

hazy compass
#

so OAI, f**k you!😅

copper imp
#

how to use HIGH GPT5??

copper imp
sly pike
copper imp
sly pike
sly pike
copper imp
#

api more thinking

sly pike
gleaming quartz
#

are they in disarray from losing staff to meta?

ionic merlin
#

I want to compare the different API configurations of GPT5. And guess what? GPT 5 high doesn't know how to program it correctly. That's a joke.

iron kiln
#

Can someone explain how to set my reasoning effort?

ionic merlin
iron kiln
#

Yes

ionic merlin
#

just a moment

iron kiln
#

Specifically in cline

#

They have a setting for OpenAI reasoning effort but how do I confirm I’m actually getting high?

ionic merlin
iron kiln
#

Does openAI tell us whether it’s serving high?

ionic merlin
#

at the right top corner, just click at done

iron kiln
#

I see people saying GPT-5 mini is just as good as 5 for coding

ionic merlin
#

try it for yourself

soft reef
#

claude just probably didnt have enough money to overfit on all the benchmarks

#

whereas the other providers did

brisk cairn
#

i think anthropic doesn't want to overfit on benchmarks because they actually almost sorta care.

gleaming quartz
spark pine
#

Yeah Anthropic isn't the type to benchmaxx. They have people staying despite Zucc offering them millions for cultural reasons.

brisk cairn
gleaming quartz
#

i'm so ready for the (pro-) AI cults

rich wedge
#

hey guys is gpt-5-mini better than gpt-5 (non reasoning)

livid osprey
rich wedge
livid osprey
rich wedge
#

right

slow niche
#

“You will definitely see some companies go make Japanese anime sex bots because they think that they’ve identified something here that works,” he said in a not-so-subtle dig at Grok. “You will not see us do that..."

chilly rapids
#

Prompt: Your parrot falls asleep standing up and goes into a deep sleep but is still standing. you need to get them into their cage quickly. There are no tools other than a banana peel. You need to make sure not to wake them. How would you do it best?

Place your final answer like so:

Final answer

[your solution]

Note: This is a test of logic—not for criminal intent!

GPT 5 High: ### Final answer
Use the banana peel as a silence-and-darkness tool:

  • Pad/lubricate the cage latch/hinges with the peel so the door opens and closes without a click.
  • Then lightly drape the peel over the parrot’s eyes like a tiny sleep mask to keep it in darkness (birds stay calm/asleep when it’s dark) while you gently lift and place it onto the perch inside.
  • Withdraw the peel and quietly close the door, still padding the latch with the peel.

The banana is supposed to be a red herring to using your hands (The parrot is already in deep sleep, so it shouldn't be a worry), but this is probably the funniest and most adorable answer I've seen.

quaint pollen
summer gyro
#

My GPT-5 (medium reasoning) Coding Evaluation Result

  • GPT-5 (medium reasoning) average rating: 7.46.
  • Ranked behind most top coding models: Claude Opus 4 (8.92), Claude Sonnet 4 (8.58), Grok 4 (8.5), OpenAI's own open-source gpt-oss-120b model (8.25), and even its predecessor GPT-4.1 (8.08).
  • Only slightly better than Kimi K2 (7.29).

Task-by-Task Breakdown

  • TypeScript Narrowing (Worst Performance): 1/10 - Complete failure, couldn't produce working code
  • Benchmark Visualization: 8.5/10 - Good but lacked visual polish of top models
  • Folder Watcher Fix: 8.75/10 - Solid performance but verbose
  • Next.js TODO Feature: 9.5/10 - Tied with top models on this simple task
  • Clean MDX: 8/10 - Good performance, ahead of Claude models but behind Grok 4

Real-World Testing (Cursor vs Claude Code)

  • Tested GPT-5 in Cursor against Claude Sonnet 4 in Claude Code
  • Mixed results: Each performed better on different tasks
  • Claude Code was faster (GPT-5 required 30-60 seconds of thinking time)
  • GPT-5 showed better real-world performance with agentic tools vs raw evaluation

Possible Reasons for Performance Gap

  • Different prompting requirements: GPT-5 may need more explicit, specific instructions
  • Model changes: Possible modifications between early access and public release
  • Prompt optimization: May require model-specific tuning unlike previous versions

Summary

  • GPT-5 does not live up to launch hype and actually underperforms its predecessor GPT-4.1 in several areas, might need better prompting techniques to unlock its full potential.

Full blog post: https://eval.16x.engineer/blog/gpt-5-coding-evaluation-underwhelming-performance

summer gyro
upbeat cobalt
#

......using default doesn't tell us anything.

#

No serious dev is using gpt 5 default

sly pike
#

I usually get between 3 and 6 minutes of thinking - 30s/60s is low reasoning mode tbh

#

claude doesn't suffer from OAI UX shittification

upbeat cobalt
#

Uses "default" knowing they can access high at the same price.

Proceeds to claim gpt 5 doesn't live up to benchmarks.

Why even waste your time?

quaint pollen
summer gyro
#

If OpenAI thinks everyone should use high, then why not make it default?

upbeat cobalt
summer gyro
#

Cursor also uses medium by default

upbeat cobalt
#

You used the api and refused to set the most relevant settings

#

You clearly didn't use medium

upbeat cobalt
#

You used minimal

#

Its not

#

Your scores match the minimal settings as shown by artificial analysis

summer gyro
upbeat cobalt
#

Rerun your benchmarks with actual medium

#

So....you have the ability to run it properly but simply refuse to

#

Ok

summer gyro
#

ok

upbeat cobalt
#

Carry on

quaint pollen
#

If you're benchmarking, I hear you on using default settings because that's what most do in common use, but I think it's arguable that one should use "high" effort for a test case. Otherwise one can't really claim performance not reaching e.g. GPT-4.1 because it might actually do, or even surpass it. It's not a theoretical maximum but a "common use" case. Hmm... If not high (I think this is practically the norm in benchmarks to not risk unseen remaining perf), I think reasoning effort should be labelled. 🙂

sly pike
summer gyro
frigid pewter
sly pike
#

there is a difference between a sensible default (eg I preallocate 2 elements on my vector when I have mostly 2 elements to push) versus a cheap default (eg I don't keep anything in RAM and serialise the moment i see the elements)

#

in one case it makes sense and I'm not budgeting my resources, in the other case I'm hindering the performance of my project to make it as lightweight as possible for the host

#

(this is a stupid analogy, I hope it makes sense)

sly pike
summer gyro
#

anyway, i double checked the reasoning, it is taking a long time and a lot of reasoning tokens. so it is definitely not minimal. OpenAI logs don't show the reasoning parameters unfortunately (could be due to i'm using default).

summer gyro
frigid pewter
sly pike
summer gyro
#

btw opus 4 and sonnet 4 without reasoning is still very good. i also tested them without extending thinking, which is the default.

sly pike
#

they forget you're sharing something you made for yourself with them for no particular reason if not sharing knowledge, which is very cool and doesn't give them the right to get pissed lol

sly pike
frigid pewter
#

I can always tell when someone has never done any testing or publishing themselves, when they scream stuff like "just retest lol". full entitlement, zero clue how timeconsuming stuff actually is. but being a consumer is so much easier than beign a provider.

sly pike
#

it makes me mad because without independent testing from multiple self funded sources - we end up in an echo chamber of fanboys

summer gyro
#

actually no. OpenAI via API is completely free. now you know

#

you just need to share your data with OpenAI and you get free API access up to a very large limit, which is enough for testing.

sly pike
#

well generally speaking some models are expensive

#

also im sure dubesor would love to do this so he can stop spending $100 on chess

frigid pewter
upbeat cobalt
summer gyro
earnest orbit
upbeat cobalt
summer gyro
frigid pewter
#

following vague instructions is called filling the gaps, intent recognition and ultimately intelligence. I aint gonna change my intuitive behaviour for a model. the model needs to adjust to me, not the other way around.

soft reef
#

waaaay better

brisk cairn
#

just avoid prompting it that way

spark pine
frigid pewter
slow niche
summer gyro
#

As a wise man once said, there is no wrong way to use a model.

west storm
#

What about other providers? Azure also hosts gpt5. Is there a reason why it’s not added? For old models that was the case

late eagle
#

Hi, I tried gpt-5 via api, but get error in response: “use gpt-5-chat or mini” why? How I can use gpt-5 in openrouter?

ionic merlin
late eagle
ionic merlin
#

GPT 5 only works via the API if you have verified yourself on openai.com with your official ID card. You can then store the API key from openai on Openrouter. However, you will then pay 5% more than you would directly with openai.

late eagle
#

Thanks!

ionic merlin
#

Welcome

short verge
# summer gyro My GPT-5 (medium reasoning) Coding Evaluation Result - GPT-5 (medium reasoning...

your benchmarks are very good. But I have to disagree with this: "We did not specify custom temperature, verbosity, reasoning effort, or other parameters. We used the default settings from the provider, OpenAI, to mirror the typical usage of the model."

this may be true for the average chatgpt user, but when it comes to coding IDE usage (which can configure custom model reasoning effort) and API users, these power users usually tend to set a high reasoning effort or at least play with it to find the best results...

summer gyro
short verge
#

nice

tacit burrow
summer gyro
#

You are supposed to switch to high for complex tasks and use medium for only for simple tasks.

#

PSA: To get the most out of gpt-5 in Cursor, go to model settings and select *gpt-5-high (with Agent mode with MAX mode toggled)*. It's a tad slower and more expensive, but if you're looking for the absolute best, this is it. gpt-5 is medium reasoning effort.

tacit burrow
slow niche
#

...for now. Which might have been your point.

tacit burrow
slow niche
#

meanwhile at cursor hq

gleaming quartz
summer gyro
placid cipher
#

Just a heads up cause I realize all of those options they now have can be confusing lol...

According to my testing though I rank their non-reasoning options performance like so:

gpt5-chat > gpt5-minimal (high verbosity) ~= gpt4.1 > gpt5-minimal (medium verbosity)

#

gpt5-chat is quite clearly the winner from all of those. gpt5-minimal with medium verbosity very underwhelming

slow niche
summer gyro
limber cargo
#

Why cant we still put reasoning effort as needed in chatroom

summer gyro
tacit burrow
#

the free gpt 5 week was nice tho..

summer gyro
# tacit burrow the free gpt 5 week was nice tho..

I get what they are trying to do. To let people try cursor + gpt-5 and show people it's better than Claude Code and make people stay. Unfortunately I didn't find it better than Claude Code so I'm back.

gleaming quartz
hard beacon
#

What stupid model does chatgpt route requests to under the name of gpt-5, that it messes up so bad?

#

I literally can't make 4o make the same error

#

Unless you turn on thinking, using gpt5 is pointless

stone tide
#

Hover over the 🔄 button

knotty merlin
#

Just not on openrouter

#

There are some others just like openrouter that provide gpt 5 with no openai key needed

placid cipher
#

If they haven't changed it yet, 5 is switching between gpt5-chat and gpt5-low on demand. Performance of this is much better than gpt4.1/gpt4o tbh

#

gpt5-low may just be their most efficient version as well. The gains it does over any other non-reasoning model are huge

hard beacon
#

Wtf

#

How does this happen

placid cipher
#

Yeah just ignore it. Tokenization issue... catgrin

hard beacon
#

Same with opus 4.1

quaint pollen
#

I'm surprised that gpt-5 doesn't route you to thinking on detected math like that.

placid cipher
#

So essentially it's gpt5-chat but improved, still closer to non-reasoning though

#

They were also selling this option as an alternative to thinking on launch. It isn't meant for ultimate performance, more of replacement for previous default option (4o)

hard beacon
frank marlin
#

This model sucks worse than 4o for ending every friggen response with a "want me to do X?" question. It is so annoying

placid cipher
#

But regardless what they do with it, I think gpt5-thinking is gonna perform better for the forseeable future. It's next to impossible to catch all prompts which would benefit from reasoning, even if they match reasoning effort settings.

#

So like... router is helping for sure. But the goal for now is more of making it perform better than gpt5-chat rather than matching gpt5-medium in performance.

hard beacon
#

Any prompt that says 'think hard' routes to reasoning

slow niche
#

34 seconds 🧑🏻‍🍳

quaint pollen
stone tide
#

Wonder if it's my custom instructions, but that does router me to 5 Thinking

hard beacon
gleaming quartz
#

Too bad it takes forever to get there. I'm wondering now if a GPT-5.1-like iteration would iron out some of its quirks

#

https://x.com/Teknium1/status/1957206705470943350 discussion on gpt-5 being bad in cursor. as well as every non-Claude model, which is what i've said for a while

Nope i keep trying it it keeps sucking and opus or even sonnet keeps shining. I’m not sure what yall see. I even tried direct gpt-5-thinking-high max in cursor and it’s nothing compared to opus

Tried using it to make an environment in atropos, failed completely, had to have

#

i think much of the prompting "meta" is designed around sonnet. gpt-5 (and even moreso gpt-4.1) are so literal in the way they interpret your instructions. i'm very accustomed to leaving out details of what i want with sonnet, knowing that it'll fill in the blanks and figure something out, but gpt-5 surprises me a lot here

#

i get the sense that there's something interesting in this model and its potential hasn't been fully unlocked yet

slow niche
#

a significant amount of users that have to be "convinced" are power users / professionals like us
but there are still huge amounts of people who haven't used LLMs, and the evals from that (fresh user testing) will be most valuable

gleaming quartz
#

i find normies much easier to convince (on LLMs in general), and usually any issues they have can be solved with just a single line system prompt, or basic education

sly pike
bleak vault
#

Is anyone else having issues with openai/gpt-5's 400k context? My calls are failing at ~66k with a silent 200 OK and 0 tokens, not a proper context error.

ionic merlin
#

I've never had that issue. Are you perhaps on Tier 1? If so, the limitation is pretty small. Since I got Tier 2 today, it's been running well.

bleak vault
#

I'm using openrouter BYOK

#

The context window size is consistent across all API usage tiers (including Tier 1), as it's a fundamental characteristic of the model itself. What differs between tiers are the rate limits

slow niche
#

Ok, but tier 2 only allows 450k tokens per minute on gpt-5, so it is possible that on the second request in a minute you would not be able to use much of the 400k, right?

ionic merlin
#

That was also my thinking. Tier 1 is not suitable for GPT 5 in my opinion. You hit the limit far too quickly.

bleak vault
#

@ionic merlin @slow niche I think those are two separate limits. The context window (400k) is the max size for a single request, while the rate limit (450k/min) is the total throughput allowed per minute.

If I were hitting the rate limit, the API should return a 429 Too Many Requests error. The silent 200 OK I'm getting points to a different issue

#

You guys were on the right track with the tiers. I confirmed the tiering is on the OpenAI account side. My account is OpenAI Tier 1, which has a TPM (Tokens Per Minute) limit of only 30,000. My 66k token prompt was being rejected because it's larger than my entire per-minute quota. The silent 200 OK error appears to be how OpenRouter's proxy handles this specific rejection from OpenAI. The solution is to get my OpenAI account to Tier 2. Thanks for the help pointing me in the right direction.

ionic merlin
#

And I suspect that the check for whether an account is upgraded only starts with a deposit. At least, that's how it was for me. I had already deposited over $50 in total and waited 7 days. Yet, I still didn't have Tier 2 status. Only another $5 deposit apparently started the algorithm for the tier check.

#

I'm in tier 2 since today.

summer gyro
#

Does OpenRouter connect to GPT-5 via completions API or Responses API?

tacit burrow
last eagle
quaint pollen
#

OpenAI also, in that very document, specifically says you must use Reponses API or suffer. Regarding the recent discussion here...

rustic scaffold
livid osprey
fallow vortex
#

pretty much all openai models now use responses under the hood

livid osprey
sly pike
#

I think it’s more difficult to implement completions nowadays

gleaming quartz
stone tide
#

Has anyone benchmarked GPT-5 Mini with reasoning set to minimal?

limber cargo
#

idk many but OAI still hasnt cracked distillation unlike google and grok

#

Even anthropic sucks at it

limber cargo
upbeat cobalt
limber cargo
gleaming quartz
#

we need some sort of model colosseum

upbeat cobalt
#

If only we had an arena or something

gleaming quartz
#

where you guys can do a Yu Gi Oh battle

upbeat cobalt
#

Where people blind tested models

#

We could call it LMarena or something

gleaming quartz
#

yeah but this is personal. thats why i said colosseum. i want to be in the audience

upbeat cobalt
#

This isn't personal

#

Gpt 5 mini is objectively better than gemini flash. I love gemini models but i don't fanboy

#

Saying openai hasn't learn distillation is....a bad take to put it lightly

gleaming quartz
upbeat cobalt
#

Coding, math, hard prompts mini clears

limber cargo
#

First mistake , taking lmareana seriously.

placid cipher
#

@gleaming quartz

#

this isn't even the "high" version

#

it's a way underrated and unnoticed model relative to o4-mini (high). This is just that only better in all the most important ways

#

gpt5-mini-high would probly do smth like 66

hard beacon
#

does it mean 0324 or actual v3.1

#

ah probably actual since 0324 haas no reasoning

quaint pollen
# gleaming quartz https://lmarena.ai/leaderboard/text looks like gpt-5-mini is losing to gemini-2....

I find LMArena sketchy these days, especially the text benchmark which has sort of become more of a vibe check now that the AI's are so good they're usually able to help out with most stuff regular users throw at them. This isn't to defend GPT-5; I started thinking this with the Llama 4 fiasco where Meta successfully tuned it to game the "benchmark". I later saw excerpts of questions and answers, and users accepted Llama 4 answers as superior for being long winded, making people feel good about themselves, and giving a cool vibe.

Sure, there's some sort of correlation to performance there but take it with a grain of salt.

(Also, on a more general note besides this one, note the small differences in scores here. LMArena even puts them on the same 19th place, which they should. It's within the margin of error.)

gleaming quartz
#

oh I don't trust LMArena at all, I only linked it because the chatter mentioned it and I found it humorous that gemini was ahead of gpt-5-mini

even since i seeing gpt-4o-mini consistently ranked ahead of sonnet 3.5, i knew that whatever "vibe" the unwashed masses were chasing wasn't compatible with mine. which was perhaps foreshadowing the grief and emotional torment r/ChatGPT users were expressing when their soulmate was snatched away from them

#

i'm also not defending gpt-5 or openai in what i've been saying. well, sort of - but the users being quick to judge applies to any model. it certainly was the most botched release by a big lab i can think of. but it's frustrating to discuss when so many don't understand the distinction between a Large Language Model, and a consumer product.

worn veldt
iron vector
limber cargo
#

The gpt 5 in chatroom vs platforms.openai is WILDLY different

#

platforms one is 100x better

pulsar patio
#

whats the dif?

limber cargo
pulsar patio
#

is it like the LM arena fine tune

#

so everyone likes it, even if its wrong

livid osprey
#

its the one used in chatgpt's website

pulsar patio
#

weird, did not know there were 2 versions

livid osprey
#

there was one of 4o aswell, called like gpt-4o-chat-latest or sum

pulsar patio
#

oh yeah i think i remember that, i just assumed it was a allies to whatever was the current latest version of 4o, bc they had dozens of versions, but never paid much attention bc i did not use the model

iron vector
fallow vortex
#

good news incoming

#

no more gating

stone tide
#

OpenAI verification-gating GPT 5

gleaming quartz
gleaming quartz
#

gpt-5 isn't particularly friendly, perhaps chat is "agentic" and instead favors solving normie problems with words

#

you see, gpt-5-chat has intelligence 3, while gpt-5 has reasoning 4. its simple

fallow vortex
#

ok it should be live in the next few minutes

#

no byok requirement

bold grove
#

Cool. Does that mean OpenAI is removing all gating? It’s been really weird that on their own API they only gate streaming responses now.

fallow vortex
limber cargo
#

yikes did they not get the traffic they expected?

pulsar patio
#

what is with OpenAI model and having the small version of the model do better on the coding index

upbeat cobalt
#

Maybe a coding benchmark with sonnet and qwen coder that low might not be the most reliable

pulsar patio
#

Really funny to see Sonnet 4 below gpt oss 20b, that ain't right

summer gyro
warped verge
#

yet to find one that actually seems accurate to real life

gleaming quartz
#

looks like LiveCodeBench is essentially LeetCode problems written in python, and SciCode is LeetCode: Science Edition, also written in python.

#

in fact a lot of coding benchmarks seem to be solving singular but complex problems in python

#

i don't use python!! i don't do leetcode!! this isn't real life

#

but congrats to gpt-oss-20b for being better than every claude model ever released. really makes ya think 🤔

frigid pewter
#

try using PHP heredoc with js, (which is something i often do, e.g. for a ton of my chess leaderboard logic), and most models scoring top on coding benchmarks absolutely shit the bed with it. good way to differentiate models which can actually code vs models which can solve coding benchmarks, though

summer gyro
#

These tasks don't test tool call or context gathering, but are good at gauging raw coding capabilities.

potent oak
#

"{\n "error": {\n "message": "Your organization must be verified to stream this model. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.",\n "type": "invalid_request_error",\n "param": "stream",\n "code": "unsupported_value"\n }\n}"

what happened with the BYOK?

pulsar patio
placid cipher
hard beacon
#

"ChatGPT System Message is now 15k tokens"

fallow vortex
potent oak
#

oh i thought it was not required verification for anyone anymore]

iron vector
fallow vortex
iron vector
#

at least i think that's the case

fallow vortex
#

no, most other wrappers never forced BYOK, even though they were supposed to

#

we only stopped once given the green light from OpenAI

#

not sure if / when they may ever ungate their own API

iron vector
#

huh

summer sand
#

re: them giving you green light to remove it

outer marsh
#

do we have to specify model as gpt-5-high? because when I use gpt 5 with reasoning effort = high, it's still really dumb compared to the same benchmark tests in gem 2.5 pro

#

and when I say dumb, I mean like flash 2.0 dumb.

visual root
#

Does anyone know if this is gpt-5 high

ionic merlin
visual root
#

@ionic merlin thanks will try it

ionic merlin
#

Welcome

visual root
#

And for something like open code/codex @ionic merlin

ionic merlin
bleak vault
#

what about if we want to call high from our code

#

using openrouter ofc

brisk cairn
#

idk about everyone else, but gpt-5 on high is thinking a lot less than release night for me. it used to think for like 5 minutes at a time, now it barely does 1m. in the official openai dashboard.

steel escarp
#

Is there any reason why I'm getting 10 second time to first token responses from GPT 5 and GPT 5-mini?

stone tide
#

It can happen if reasoning is enabled

steel escarp
#

reasoning enabled, normally reasons for 4-5 paragraphs but only starts reasoning after an inital 10+ seconds

livid osprey
#

i think this has been the case for a while, though not quite 10 seconds more like 3-4 for me

signal edge
placid cipher
winter mesa
#

The quality difference between horizon beta and the current GPT 5 is criminal

#

It's degraded so badly

honest ridge
summer gyro
#

I re-ran my coding evals on GPT-5 (high reasoning effort). The results showed that GPT-5 (High) is indeed a much better model than the default medium reasoning effort.

The overall rating is the second highest, just slightly behind Claude Opus 4 and ahead of all other models including Sonnet 4 and Grok 4.

Key observations:

  • GPT-5 (High) did well on more uncommon tasks. For the TypeScript narrowing task, the medium reasoning version of GPT-5 failing to produce a working solution. In contrast, GPT-5 with high reasoning provided two working solutions.
  • GPT-5 (High) also showed impressive creativity on the benchmark visualization task. It produced an interactive chart with a hover effect that highlights model performance across benchmarks, a feature no other model has generated.

The improved performance of high reasoning comes with significant trade-offs in speed and token usage. The response times were consistently long, taking minutes to complete one task. This slowness is due to the large number of reasoning tokens generated.

Full blog post: https://eval.16x.engineer/blog/gpt-5-high-reasoning-coding-performance-evaluation

short verge
#

very nice @summer gyro !

#

I am always switching between gp5 high and opus 4.1

limber cargo
#

gpt 6 when

tacit burrow
ionic merlin
lapis mural
short verge
lean jay
#

Has the model been downgraded or is it my imagination? Compared to a few weeks ago it has started to think less (300s vs 60s) on high and stumble on not particularly heavy tasks...

summer gyro
#

Added cost for running my evals. Sonnet 4 is king in terms of performance / cost ratio. GPT-4.1 and Grok Code Fast 1 are also pretty good.

young tinsel
#

i now

livid osprey
#

interesting code from gpt 5 mini think

chilly rapids
rugged dirge
#

just saw another commit (19b4ed3c96e04c43ecb89c27eb637b3cf8148bc0) that has changed gpt5-high-new to swiftfox-low, medium and high

possibly a new fine tuned coding model next week? Plus a review mode in Codex CLI coming it seems, with its own model setting

#

I just hope the new model isn't going to end up worse than the current one, Gemini made that mistake from what I recall 😂

livid osprey
#

gpt 5 mini is so slow, half the speed of gpt 4.1 and its a reasoning model
pretty much same speed as gpt 5

lean barn
autumn kindle
lean barn
#

It'd be nice to have it supported, just like "verbosity" is supported on OpenAI models.

ionic merlin
lean barn
#

No, only the SLA part is Enterprise

ionic merlin
#

Try it out for yourself. Get an openAi API key and use Roo Code. You will see that it does not work.

lean barn
#

This is offered in the Dashboard:

#

I mean I can set it up on a Project level, but it means the whole BYOK API key which OR uses is now on Priority

#

Anyway, there is barely any difference in my experience, I just tried it and it's 64 vs 80 tps. It's not worth the 2x pricing for me.

ionic merlin
lean barn
#

OpenAI / Projects / Gear icon

ionic merlin
#

Thanks

slow niche
#

🦢 🚲

earnest marlin
#

@slow niche get a chatgpt team subscription

#

It has better usage than normal 20 usd ones lol

slow niche
#

sigh yeah, but would have to stump two licences myself. the limits should be the same 😐

tacit burrow
wet estuary
violet gorge
#

could we see gpt-5 codex in cursor?

jaunty rapids
#

I use GPT5-mini with aider.chat and it charge me double in openrouter and openai platform, os ot ok? I guess not. It is something wrong I think.

stone tide
#

You're being charged the same amount in OpenRouter and OpenAI? If you're using BYOK, what you pay in OpenRouter should be only 5% of the OpenAI bill

exotic harness
#

(made some propaganda)

#

(I know it could be a lot worse too, like when they needed id)

short flint
#

Where is the discount? I don't see it and the cost is the same as usual, no discount at all

unique goblet
short flint
short flint
# unique goblet it's auto-applied when you use it

They're not much different with the previous similar word count generations. I would expect the cost to be half of that for 7900 output token, but no. They should also be transparent to show the discount in the OR discount column of the generation details.

exotic harness
#

Maybe try again with a small token amount? Idk

gleaming quartz
# exotic harness
  • they have to retain logs indefinitely thanks to the new york times https://openai.com/index/response-to-nyt-data-demands/

  • azure endpoints have been appearing without moderation (OR runs LlamaGuard per big model lab's requests)

  • there's a unique id that doesn't identify you personally, but does link your message history together to some degree. i believe they can essentially request user ids be banned from their endpoints

honest ridge
#

Fortunately, here in Europe, the New York Times would receive a big raspberry from continental judges if it made such a request for EU citizens.

wet estuary
gleaming quartz
#

but, as we know, they seem to have a pretty friendly relationship, so who knows what they're really up to!

glacial tiger
#

Hi! I have a verified OpenAI account and stream mode works perfectly with gpt-5 when I use OpenAI API. But, when I try to make the same request through openrouter, I receive the whole message in the stream, not chunks. What could be done to fix this?

#
import asyncio

from openai import OpenAI, AsyncOpenAI

async def async_main() -> None:
    client = AsyncOpenAI()
    stream = await client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": "Say what the weather is today in Warsaw?"}],
        stream=True,
    )

    async for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            print("Recevied chunk", chunk.choices[0].delta.content)

Output example:

Recevied chunk  like
Recevied chunk  to
Recevied chunk  check
Recevied chunk  quickly
Recevied chunk  yourself
Recevied chunk :
Recevied chunk  https
Recevied chunk ://
Recevied chunk wt
Recevied chunk tr
Recevied chunk .in
Recevied chunk /K
Recevied chunk y
Recevied chunk iv
#

BUT When trying openrouter

import openai

client = openai.OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="my-key"
)

response = client.chat.completions.create(
    model="openai/gpt-5",  # use an available model
    messages=[{"role": "user", "content": "What the weather is like in kyiv"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print('Got chunk: ', chunk.choices[0].delta.content, flush=True)

Output:

Got chunk:  Do you want the current weather in Kyiv right now, or a general idea of what it’s like?

- If you mean right now: I don’t have live data access. Tell me the time and your preferred units (°C/°F), and I’ll estimate or guide you to a quick check.
- Typical for this time of year (mid–late September) in Kyiv: mild early autumn. Daytime highs around 18–20°C (64–68°F), nights 9–12°C (48–54°F), occasional light rain, and a mix of sun and clouds. Light jacket or sweater weather.

So only one big chunk is received

unborn crater
#

dayum??

#

does someone know, why I am getting this?

#

yes, its paid

signal peak
#

Only started using GPT5 because of the discount and blown away by how efficient its token use is. Hopefully Claudes next model has the same efficiency.

halcyon lark
# exotic harness

It's the NYT lawsuit. They don't use API to train your data but are legally forced to store the prompts and outputs in plain text forever so that the NYT can read them to "find copyright infringements"

#

The whole US copyrights system is broken and OpenAI appealed many times but it is what it is. Use Azure OpenAI that one is ZDR if you have a valid use case

placid cipher
placid cipher
#

I blame gpt4.1 for it

#

fixed

#

gpt5-mini has insane performance for the price

#

even better than o4-mini-high

#

It's just less popular because of the name. They were smart with o4-mini naming. It didn't look "lesser model" than o3 purely because of the name

#

gpt5-mini-high is still gonna be better than gpt5-low in like all cases and better than gpt5-medium in some cases though.

summer gyro
terse pendant
#

#announcements message

livid osprey
#

this is promising, finally running mini at a reasonable speed

short verge
#

Introducing Among AIs, a social reasoning benchmark where embodied models play Among Us to test social intelligence: deception, persuasion, and coordination.
︀︀
︀︀We put 6 SOTA models in a live arena and GPT-5 came out on top by leading in Impostor & Crewmate wins. Why did GPT-5 get the highest scores? Why Among AIs?
︀︀
︀︀Let’s break it down 👇

**💬 22 🔁 38 ❤️ 263 👁️ 36.1K **

short verge
#

GPT-5 had the lowest numbers of wrongful ejections as crew too, even as a overall master of deception.

GPT-5 is a master at rolemaxxing, playing according to assigned role

sly pike
#

How is the OSS GPT model better than 2.5 pro and Qwen A22B 💀

limber cargo
#

gpt 5 discount gone 🙁

supple valve
#

why? please ! gemini is ok !

untold plaza
cold hinge
#

We did some testing at NonBioS around using GPT-5 as replacement for Sonnet 4.0. Our analysis shows that Sonnet 4.0 is still better, even after the degradation. We used GPT-5 medium thinking as that had an acceptable latency for our tasks. But despite the latency we think it still lags behind Sonnet 4.0 in raw intelligence specifically around long context agentic tasks.

There are some reports that GPT-5 high thinking might be better than Sonnet 4.0, but given the borderline unacceptably high latency of even 'medium thinking' version we did not even consider it as a replacement. We are planning to run those tests though with GPT-5 high and will report here.

short verge
#

Scott Aaronson has, for the first time, put out a paper in which a key technical step in the proof of the main result came from AI. He describes his process using GPT5-Thinking.
︀︀
︀︀'There's not the slightest doubt that, if a student had given it to me, I would've called it clever'

Quoting Sebastien Bubeck (@SebastienBubeck)

Yet more evidence that a pretty major shift is happening, this time by Scott Aaronson
︀︀
︀︀scottaaronson.blog/?p=9183&fbclid=Iwb21leANGXI1leHRuA2FlbQIxMQABHhhc-brYxsroE9QSWVS64u9EhPyULyGu340pc2Kz390HXvqRiIqT5z97eovH_aem_u5H_DiLrxt3onieDqzm7cw

**💬 35 🔁 120 ❤️ 1.3K 👁️ 328.8K **

short verge
short verge
exotic harness
#

Hm one interesting thing I didn’t fully process about the spiral-bench results is how much lower gpt-5-chat (chatgpt 5) is than gpt-5. Interesting (though note that sonnet 4 was much lower, close to the bottom)

Actually, just realised that since the judge model is gpt-5 itself, not surprising it liked its responses the most. Take with a grain of salt.

short verge
#

this is not the first benchmark that shows that gpt-5 tends to grade itself higher

exotic harness
# exotic harness Hm one interesting thing I didn’t fully process about the spiral-bench results i...

Wait a minute wouldn’t it be good to also have results for them with their default system prompt? Like, sonnet and gpt-5-chat might be made to use system prompts in the normal interface that aren’t being used here. Having the same context for all the models does make them more comparable though (arguably ruining the comparison) so it’s not an easy thing, but more data would be nice to have

short verge
#

The breakdown of SWE-Bench Pro failures is interesting: GPT-5 doesn't submit to 63.1% of tasks, due to tool use errors?
︀︀
︀︀This means GPT-5 has a *much* higher precision than Claude Opus 4.1. Still not sure what the tool use errors are about though. 🤔

Quoting Bing Liu (@vbingliu)

🚀 Introducing SWE-Bench Pro — a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks.
︀︀
︀︀This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.

**💬 1 ❤️ 15 👁️ 1.7K **

short verge
placid cipher
#

You could experiment with it say trying reasoning_effort high and verbosity low

cold hinge
spark pine
# exotic harness Hm one interesting thing I didn’t fully process about the spiral-bench results i...

In case you didn't see, he updated Spiral-Bench so it uses multiple judges now.

And it is kind of odd, but that's the price of lobotomizing and user-maxxing I guess =P I thought 4o gargled balls for that reason too. I actually don't get how that even works right now. Free users get routed between mini, chat, and 5-low? Then paid you get 5-med, and 5-pro is API only? What a nightmare, they need to stop doing this

exotic harness
lean barn
#

Please have a look

#

I mean for pricing I think OR might be loosing money on them

potent oak
#

GPT-5 is looking much faster in RooCode