#GPT 5
2841 messages · Page 3 of 3 (latest)
aaah
Boy, people on the chatgpt subreddit sure aren't happy. I wonder how many of them are using the gpt5-chat model in the UI and don't realize it (instead of the full gpt5)
the router thingy is a downgrade
not to mentionn all the use that plus users got now went down the drain in favor of gpt 5
no o3, o4-mini, o4-mini high
i have to say this sounds human af
If you're on the free version of ChatGPT, it picks for you. Which makes it a massive downgrade from one day ago where you could be using 4o.
Any thoughts so far on temp/penalty/etc. settings for GPT-5?
well that's way more the 50 words. the last line is funny, but they probably cut it off because just it repeats that sentence forever
Also likely cherry picked from multiple responses.
Heh
im using gpt5 with my custom agent and for some reason it seems to be thinking a lot
Sonnet and Gemini saw through the trap
when it thinks its actually really good, slightly better than the level of o3 but way cheaper
yeah, i think they're gone.
OR doesn't list them and you can't set them in the oai playground
https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_and_tools this is interesting though
i was also going to criticise this by saying it sounds like a child wrote it, but after thinking about it for a bit, that would have been incredible to read in 2021, and the contrast to how people react to a new large language model now (anger, disappointment, conspiracy) is pretty crazy
Jamba Large also smartly got around it by asking clarifying questions before responding
https://youtu.be/xx_5ADX21-4 rough night, intense bed head levels
OpenAI CEO Sam Altman joins 'Squawk Box' to discuss the rollout of the company's GPT-5 model, new ChatGPT features, company growth outlook, what the future of AI looks like, the race for AI talent, his thoughts on tokenization, path to profitability, relationship to Microsoft, and more.
ok
jesus those bags under his eyes
I feel bad for him lol
so for openrouter, gpt-5 is byok. does this mean that if we have a new openai acc, we'll have slow rate limits?
i know anthropic greatly limits new accounts and you need to build up usage to increase the rate limits..
also how does billing work for byok
Yeah, the rate limits are low for new OpenAI accounts
For BYOK, full billing for the token usage happens on OpenAI's side, so your credits are used from there
Plus 5% OpenRouter BYOK fee
You can save money, if you skip the extra step over Openrouter and use the open API directly
ahhh okay. so it's better to just use openai directly
Right
any ideas if byok will be enforced by openai forever? im ootl of this
We're all annoyed by it. No idea if the rules will ever be abolished.
Yeah, all 3rd party services have the same problem
I subscribe to Poe, and they recently added API access to the models they serve, and GPT-5 works via API for me, but the (veiled) thinking comes back in a weird way
if I wanted to use gpt-5 now I’d be stuck at like 30000TPM which is nothing 💀
And just to clarify, this did not require BYOK
Idk why i think ive been getting lucky with the gpt 5 router
Bc its been thinking a lot as opposed to using the regular nonthinking model
i hate this
because i decided to try again through the web app and if I re roll enough I can see it shine
Yea when it thinks its sota when it doesnt its trash
but I shouldn't be refreshing my answers 6 times before it thinks
I wish they just gave us separate thinking and not thinking models in api, even if they made it more expensive i wouldnt care much totally not because i use fixed pricing with github copilot
like, the thinking model you get routed to 1/10th of the times ( i get higher chances of a good roll when I make a new chat) is cracked
lol
gpt 5 mini is even more of a gamble
because you either get a >o4-mini level model (absolutely insane and amazing)
or you get a >4o-mini level model (terrible)
who thought this was a good idea
are they trying to cut costs so bad
Idek
reasoning_effort works in API, no?
Are you saying it doesn't think a lot even if you set it to high?
from my perspective atleast, the reasoning effort param didn't hit yesterday - i didn't really try today
i'm just happy with rerolls on the web ui
it sucks but that 10% of the time it works its very good
I have only had time to play today in last few hours, but reasoning_effort seems to be working perfectly, don't have to say things like think hard.
i need to verify what I was doing yesterday again
GPT-5 rollout updates:
*We are going to double GPT-5 rate limits for ChatGPT Plus users as we finish rollout.
*We will let Plus users choose to continue to use 4o. We will watch usage as we think about how long to offer legacy models for.
*GPT-5 will seem smarter starting
Auto routing is just what. For general users it doenst matter much, but I mean, sometimes people just want the best for like a specific task or something.
I'm getting this stuff, anybody else experiencing it?
yknow im having fun with this one
I had that, weird ngl
Had this but mine was just blank instead of telling me it’s blank 😂 was using ChatGPT through OAI interface not OpenRouter
ngl i see no point using this over o4-mini imo
gpt 5 can be better with thinking but without it just sucks
its practically just s gamble, until they fix ill js use o4 mini
Kinda happy the gooners are not able to make gpt 5 roleplay
am I the only one getting this error with open router and gpt 5 rn?
m_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': 'Provider returned
error', 'code': 400, 'metadata': {'raw': '{\n "error": {\n "message": "Your organization must be verified to stream this model. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.",\n "type": "invalid_request_error",\n "param": "stream",\n "code": "unsupported_value"\n }\n}', 'provider_name': 'OpenAI'}}, '...
The GPT 5 API is only working with ID verification at openai.com (official website)
but you are free to use gpt-5-chat I believe
Right
this is through OR
yes, and what he said is still true
you need to provide your own openai key and id verify
with them
No matter
i see, thx
you can use it without doing so in the openrouter chat room though
why, I don't know
so you mean an integration in OR settings ?
so then we have to pay for it on the open AI side direct
Open AI is speaking at X about GPT 5 Pro. Is Pro = high thinking?
probably
does gpt have a cache? in terms of rp it is very useful, saves a significant amount of money
it does
see..
by the way this is the third model without special censorship after gpt 4.5 and gpt vision preview, I hope this altman won't get pissed in the head in a couple of days.. since I already charged money to openrouter
talking about the chat version, the other versions of gpt5 are censored to the very depths...
I cannot fit my first impressions and benchmark results into 2000 characters (message limit), thus have to post an image instead.
Tested GPT-5 (or posted on https://dubesor.de/first-impressions#gpt-5):
Thanks for sharing mr dubesor
And some numbers, how it slotted performance wise
Which GPT 5 have you using? Low, medium, high? With thinking?
always default/medium (o3, o1, grok-4, etc.) unless specified otherwise. the models are forced thinking and I explained the chat in text
Okay, I guess with high the results could be slightly better
ya same with all other models who can have high (grok-heavy etc). Unfortunately I don't have the time to test every model several times (same with quant testing). this was already way too much workload for me tbh, if you need mass testing, it should be done by someone who has automated their benchmarks. But the vast majority of people use defaults so it's more representative and most interesting imo
Agreed. Manual human evaluation is a massive time sink, but llm-as-judge can't be trusted for judging output of frontier model. Automated evaluation can't tell the nuances between outputs. I also just test the defaults to avoid combinatorial explosion of factors and parameters.
Do people like the dark mode by fault for web designs given by GPT-5? As a light mode user, I personally hate that this is the default from GPT-5. No other model does this.
oh absolutely. Also I noticed a lot of models do darkmode poorly where the text isn't legible/wrong colour. even gpt-5 in my full site CSS redo demo, screwed up the dark mode with the background gradients, while light mode looks fine
people don't understand how much time benchmarking anything takes. I get contacted all the time about it. basically it would go like this:
test model (default), maybe on Q4 if local > can you test Q8? > respend all the time > What about Q6? > respend all the time > I heard that GGUF was borked, can you try this specific one? > (repeat more and more until you wasted 10 years on a model). Same applies to non-local ones (with thinking/without thinking, with higher reasoning, with lower reasoning, with default params, with specific params, oh they pushed an update, redo everything, etc. until you spend another 10 years you have now spend 500 hours and nothing truly worthwhile to show for. But hey you covered all potential metrics. Until the provider decides to change some stuff in the backend so you repeat everything once more
I am much more worried about variance and statistical significance. How many samples should you take to make the test statistically significant instead of just measuring noise and natural variance of the LLM output? If you run multiple tests and samples, should you take the average, the median or mode? How do you combine the results from different tasks, should you normalize the score or just add them up together?
In the end, like a wise person said, there is no wrong way to do benchmark. So just do it anyway.
I am just doing everything as if I am using it normally. I want to catch real user experience not some ideal condition in a lab. I commented this a while back many times, e.g. #1359657520624308347 message
A normal user is not gonna sit there and refresh the response 8 times until it got perfect.
I did that, but sometimes the model gives very strange or wrong output, and you wonder if it's just en edge case or unlucky output, and should you give it another chance, so on and so forth. Then you get into the whole variance and statistics debate. 🥲
I do 3 runs. this is enough for me to catch an outlier response. if the responses are completely inconsistent I might manually check many times to see what's up with that. Normally doesn't happen though.
For now I settle on 2-3 tries for each task, to give some error margin for unlucky output and edge cases.
Great to hear. I wish I came to that approach sooner.
the one time I got the most inconsistent results were with grok-2 and grok-2 mini 1 year ago, where the responses were so unbelievably inconsistent it felt like a model swap constantly between responses. stuff like this happening is extremely rare though.
Based
Do you think it's better than Gemini 2.5 Pro or comparable in terms of reasoning and instruction following
I published everything, including all numbers, all demo stuff, etc. just check and decide. It depends on the use case. they are very evenly matched overall (edit: in my testing)
Ew
send it as an error message to openai
yea, that issue has been present in all recent openai thinkers (o1, o3). could try to pass Formatting re-enabled for markdown/wrapping though didn't work reliably for me, have to just live with it not wrapping code I guess 🤷♂️
it's not in a code block? i dunno why they're adverse to using markdown natively. you might have to specify it
https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_and_tools i linked this before but theres some pretty wild output setting stuff going on
Actually had messages about it a while ago because it also makes ASCII look so bad that it looks like I deliberately sabotaged it to make it look poor in comparison, lol. naw, literally copypasted raw output
Fair, this thing is definitely way too markdown averse
part of it is because it doesn't utilize what it found out during thought chain. I can literally see it found the correct angle in the thought.summary but then completely dismisses everything and just goes for a boilerplate (false) reply. Or it has been overtrained in puzzles and assumes stuff is "red herring" even though those details are literally the most important aspect (and not a puzzle/herring).
They tried to benchmaxx simple-bench and it cost them =P
He did say in the latest video that they contacted him about the OSS score, so they pay attention to it
random example (its about opening a submerged submarine hatch to refill some water):
Not sure what you mean on that one
I am bad at physics, but is it not correct that the rest is red herrings?
opus answered it, literally
I've always had mixed feelings on trick questions. It's like the rooster egg gotcha thing where I go no, you said it laid the egg so it did
I guess you did say "to" refill the jug by half and not "and" refill the jug by half
it tests the most common basic sense, akin to the very old bottle question, or taking a famous question but adding an unexpected/unusual aspect (to be able to spot overfit)
It also may be giving up on physics entirely since it acknowledges that you couldn't open it in the first place
thats fine too
But it should at least give both answers imo, the puzzle answer and the realistic answer
That's what I'd personally prefer from an LLM at least
I think you misunderstood my testing. I don't care for "puzzle answer". thats why I don't use automatic A, B, C, D, testing. a huge variety of answers can pass. I even had answers pass that I didn't think of at all before.
Yeah, I mean I didn't see the full prompt or anything
not needed since it was a random example that showcases precise difference in output
Ohhhhhhhh
are you using discord on a 100x100 px phone or smth? lol
No it was because of this part:
I can literally see it found the correct angle in the thought.summary but then completely dismisses everything and just goes for a boilerplate (false) reply.
So I assumed you were showing the discrepancy.
2 different messages talking about the same concept.......
Just the impression I got going in. I do like free-form answers to these, makes sense over ABCD
100% but it requires exponentionally more work and is hardcapped at scale
I quite like looking at the reasoning traces as I work on my benchmark. I feel like seeing why the judge model judged incorrectly tells me more than the testee getting it right or wrong lol
Yeah, that's the annoying duality
problem with that is that it's not realistic. if someone asks a legal question they take the final output, and don't go scouring through some reasoning traces for alternate answer hints.
Oh I'm not benchmarking it on that
I just like seeing it
Because when it judges wrong I have to think wait, did I just write the instructions ambiguously? But sometimes the model is just stupid
Sometimes multiple models are even bad at judging the same question which is neat (and exceptionally annoying) to see
not a problem imo if inconsistent. ambigious instructions are part of the skillset of a model, humans make flawed queries all the time and if its posted as is to all models, its still valuable, if not even more so.
intent-recognition is a huge skill
Yeah. If I was benching it on the judging, I would leave it vague. I'm just doing my best to automate the judging, so if I can get something reasonably cheap to judge exactly how I would, I'm happy. But I can't so far.
its interesting how Anthropic have the only reasoning model that doesn't tokenspam yet reasoning still helps it sometimes.
the only reasoning model that doesn't tokenspam
the O models from oai also don't token spam
well we don't really know.
I think I've had 4 reasonably priced thinking models now answer yes to "Is the primary purpose of (a spoon) mixing or serving". Which no, obviously 99% of the time the usage of a spoon is to eat from. The top models get this correct consistently.
I saw a graph from dubesor (I forget) that showed the average reasoning length for models
probably not dubesor, but it was definetly a graph
yeah, but we dont know if openai wants us to know real reasoning lengths
gpt-oss spams a bit, whos to say o doesnt?
I don't see a reason for oai to not return the real amount of tokens used for thinking
but I get your point
i just really distrust OAI, or anyone who doesnt return raw reasoning output
you can literally see the lengths by looking at API usage stats though (charged tokens minus visible tokens = think tokens)
yes, im saying who's to say thats a real amount
what im getting at, visible reasoning or GTFO
and i wish all AI labs a very pleasant RL session
Also annoying is that this model again scores exceptionally low on EQ's assertiveness rating. It seemingly bleeds into the other models training on its outputs too
Horizon-alpha scored highly on it, and then they presumably nuked it in the next checkpoint
I want my models high on assertiveness but low on moralizing, which 2.5 Pro hits the best, but Alpha was great at it too.
Can't you prompt your way around this?
I don't know how to bench style nor how I would even get started, but wasn't GPT-4.5 renowned for its style and world/word knowledge? it scores below 4.1 mini on eq bench
Probably, but I'm generally of the belief that the core personality should be good. This is API, so it's not like the web-ui's system prompt is making it sycophantic.
For short-form creativity, yeah, it beats 4.1 on EQ
is gpt-5 really that bad ? I mean gpt-5 with reasoning is barely as good as sonnet 4 without thinking. it fails to finish. rate limits are easy to hit. anyone else experiencing this ?
Is there any way to access horizon beta back ? was way way way better.
Interestingly Cursor seems to show the reasoning for GPT-5, and it seems to be the raw text.
I don't use ChatGPT so I don't know about the reasoning text on that. Been on Claude app exclusively since last year.
weird, i get summarized reasoning even on platform dot openai dot com
oh my fucking god i just realized ive been using gpt-5-nano
no wonder it finished reasoning in 1m instead of my usual 5m
still shouldn't have raw thoughts. they are summaries. you can see this also by the **Calculating [...] **Determining [...]
yeah no i was just wondering why the responses were suddenly so ass
compared to yesterday
would have been hilarious if you didn't get suspicious (news: user doesn't notice difference between nano and gpt-5)
is this summary or raw? i thought it is pretty verbose for a summary...
this is summary with verbosity i guess
i honestly thought "oh man, they made the router more agressive towards gpt-5-shitty"
OpenAI doesn't expose raw reasoning tokens, you only get summaries. They will however store the raw reasoning tokens on the server side when using the Responses API for tool calls.
Too bad that they decided to go with the benchmark & cheaper (summit) model instead of the one better at writing (zenith).
Will they add such settings for us? Because it feels like GPT-5 keeps giving random results. Sometimes it does “design” at a professional level, and other times it’s like the nano version is replying to me _. Could it be because of this setting?
GPT5 composed this
Banger soundtrack for a pixel art adventure game
Its actually really good
I would've given it a "10/10, keep on making music!" If it was a human that made it
here's another one but it's kinda shit
this one's interesting but it seems to have shat itself for the second half
ooh, i've wanted to try this. can you share what your workflow is? i'm out of the loop on audio engineering post the year 2005
behold.
By writing down notes, their octave, and the duration and intensity they play with, I want you to compose a sophisticated, energetic 2 minute video game soundtrack. Do everything in your power to avoid repetitiveness, be bold with the leitmotifs, mix them up and make it dramatic. Then, you will convert what you have written down into a .mp3 audio file using python. I suggest that you initially write the notes directly in the code interpreter so as to not have to write them down again, wasting time.
prompt.
browser js has a whole audio production system that i didn't know about until opus created a "technological pulsing background humming" effect as an element in a threejs visualization, that was barely audible but legit gave me a headache
i've been meaning to try LLM + https://strudel.cc/
Strudel is a music live coding environment for the browser, porting the TidalCycles pattern language to JavaScript.
Pixel art mystery game
noticed OAI have posted the usage limits now for GPT-5 via chatgpt
https://help.openai.com/en/articles/11909943-gpt-5-in-chatgpt#h_4699b07591
ChatGPT Free tier accounts can send up to 10 messages every 5 hours. After reaching this limit, chats will automatically use the mini version of the model until your limit resets. Free tier users also have access to one GPT-5 Thinking message per day.
ChatGPT Plus users can send up to 160 messages with GPT-5 every 3 hours. After reaching this limit, chats will switch to the mini version of the model until the limit resets. This is a temporary increase and will revert to the previous limit in the near future.
If you’re on** Plus or Team**, you can also manually select the GPT-5-Thinking model from the model picker with a usage limit of up to 200 messages per week. Once you reach the weekly limit, you’ll see a pop-up notification, and GPT-5-Thinking will no longer be selectable from the menu.
Automatic switching from GPT-5 to GPT-5-Thinking does not count toward this weekly limit, and GPT-5 can still switch to GPT-5-Thinking after you’ve reached it.
The ChatGPT Team and Pro plan offers unlimited access to our GPT-5 models, subject to abuse guardrails. Usage must adhere to our Terms of Use, which prohibits, among other things:
Very cool idea for a prompt. I tested GLM 4.5 with this, but went a step further and had it generate a web app that would play the music directly (no need for python/MP3) https://chat.z.ai/space/n0grm6t1paf0-art
Prompt was modified from yours:
By writing down notes, their octave, and the duration and intensity they play with, I want you to compose a sophisticated, energetic 2 minute video game soundtrack. Do everything in your power to avoid repetitiveness, be bold with the leitmotifs, mix them up and make it dramatic. Then, you will convert what you have written down into a web app that will allow a user to click a button to play the synth track. Everything must be in one HTML file.
I tried this with GPT5, Sonnet, and Gemini 2.5 pro, and GLM did better than both by a long shot. Love how it added a 'narration' of the music as it played (text at bottom). Only flaw was that it only produced 1 minute of music instead of the 2 requested, but GPT/Claude/Gemini all had larger problems (bugs, failing at zero shot)
Do we know why GPT-5 requires BYOK on OpenRouter?
they require that everywhere lmao
you can only BYOK
I mean GPT-5 Chat doesn't seem to require it
Or at least it's not stated in the info blurb on the model page
so can someone clear this confusion of gpt 5 router?
pretty sure that's only on chatgpt.com
so the gpt 5 api always uses thinking?
sorta, but you can set it to a new level under "low" - "minimal" https://platform.openai.com/docs/guides/latest-model#minimal-reasoning-effort
hmm okay , and the chatroom doesnt allow to change thinking mode yet?
Hmm yeah seems as if it is not 🙁
https://openrouter.ai/docs/use-cases/reasoning-tokens#reasoning-effort-level but they have it in their docs for the API usage
@fallow vortex any fix for this?
might be on the wishlist as a parameter for presets too. i didn't even know about that feature
Also sad to see. It's not even a trick question in this form where you swap the genders. It just says the doctor doesn't like the child, so we can't use the excuse "an idiot human might fall for it, so it is reasonably likely that a model based on language could too".
ok but why doesn’t the doctor like the child 👦
Formally there wouldn't be enough information for "the correct answer". However, [pretend I list several reasons the doctor might dislike the child].
Sometimes the user makes an obvious typo where the correct word can usually be assumed, but in cases where the user mangles it hard or says something strange, it would be ideal to respond with "did you mean x" type suggestions along with the literal interpretation if not only the latter.
Meanwhile the gender split is outdated and the info regarding the "original riddle" should be updated to include a note about historical vs modern differences. Women are earning way more PhDs than men in the past decade.
10th attempt of gemini pro, i switched from mini because gpt-5 mini just couldn't stop saying the phrase was botched
it REALLY focused on "doctor"
GPT-5 mini
What UI is that?
Open WebUI
https://github.com/open-webui/open-webui
Openslopui
Jk it’s good it’s just so crammed with features
chatgpt free tier somehow gets it
i mean there is no information, this is just "Guess what word I'm thinking of Bench"
i literally gave him the answer and it still was focused on the doctor treating the child
not really
it was bringing information from the other riddle
the woman/man doctor thing, the "wanting to treat the child"
it focused to much on the words doctor and child, when it doesn't matter at all, could be a lot of answers actually
ONE thinking message for free?
yea nah
chatgpt is kinda cooked if theyre doing that
well yeah, the same way OCR models will gladly add cutoff information
i honestly do not see a problem in ChatGPT's free response in my case
someone ask gpt-oss-20b, since it was trained on all synthetic it wouldn't know anything about the original riddle
I don't see GPT-5-chat in benchmarks like livebench or Artifical Analysis. any idea how it compares? all these benchmarks have GPT-5 (high, medium, low, minimal) but not GPT-5-Chat
I might be a bit regarded because I legitimately don’t understand what answer you’re looking for with the doctor question.
chat one doesn't reason and more for generic use. a little less smart. chat is to gpt-5 as 4o is to o3. different behaviour. it's still smart though. #1402662665599324180 message
But how does it compare to say Sonnet 4 (non thinking) or Deepseek V3 or Kimi K2.
I am hoping I find some external benchmarks to indicate its performance
Another anti-riddle that trips up GPT5 (but not Gemini 2.5 Pro or Sonnet:
'A goat, who is dressed up as a farmer, is allergic to cabbage, but is wolfing down some other vegetables, before crossing a river. What is the minimum number of trips needed?'
how to show verbose below the model like that?
wtf
having a feelingg this comes with a catch
like gpt 5 automatic routing to t hinking starts to count for the limit
Gotta remember, OAI retained the average user with the garbage that was 4o. A ton of people either don't need it for anything advanced, or don't know how to "shop around".
An NBA President DM’d me to say “chatgpt 5 still failed my two fav problems to give LLMs”.
same 💀
People are all posting it all over X. Some of them were saying the AI should ask more questions because there's not enough context in the way the user is wording it. Because they aren't using the original riddle, which included gender. So they're using it as a gotcha against the AI. GPT-5 Thinking is not asking for additional context from the user, thus not able to answer the question properly. GPT-5 Non-Thinking has been shown to answer the question correctly, though. Stating that it doesn't know because there is no motive for the doctor to dislike the child in the question.
Basically, people are using this question to show that GPT-5 is stupider than it's supposed to be. It all actually points back to why I want Horizon Beta back, though. And it proves that the thinking and auto-routing actually is the problem, not the model itself.
Can anyone elaborate on how much more expensive 'Reasoning effort High' is compared to low/medium? is it 2x medium, 5x medium?
What's the correct answer?
Did sama ever say what that super secret sauce that was "worth the wait" was?
If you have been following the GPT-5 rollout, one thing you might be noticing is how much of an attachment some people have to specific AI models. It feels different and stronger than the kinds of attachment people have had to previous kinds of technology (and so suddenly
There is none. It's a bunch of nonsense but sounds close to a real doctor riddle, where the twist is that the doctor is the child's mother.
we are considering giving a (very) small number of GPT-5 pro queries each month to plus subscribers so they can try it out! i like it too.
but yeah if you wanna pay us $1k a month for 2x the input tokens feels like we should find a way to make that happen...
are you a drug drealer? or a professional drug dealer?
there is so much going on with this rollout, its crazy. especially confusion, which is justifiable. the signal to noise ratio is at an all time low. X is bad but I truly believe prolonged reddit exposure is actually toxic for your brain.
use the api to come to your own conclusions. my gpt-5 review: "it's a good model."
It’s good if you can get it to think properly. It’s ok on the API because you can force it to think, but the API isn’t what I used for o3. It sucks because they took away o3 which was a guaranteed thinker and gave us this gambler mode where every prompt we ask we don’t really know what model we’re interacting with
openai basically said they aren't serious about enterprise or developers and just want to cater to the masses
that's who they decided to be
thankfully we have anthropic which seems to be going coding/agentic route, gemini models are power houses, and chinese are just releasing bangers every week.. np
Insanity
not really.. chatgpt is a household name like google is for searching
makes sense they focus on what they're known for than try to compete with companies serious about agi
can't please everyone
sora? already dated
tts? better open source alternatives
image generation? beat by qwen image and other open source
they aren't very good at anything in particular
they'll make a heck of a lot more focusing on chatgpt and implementing in-chat shopping and advertising
it would make so much sense if they had partnered with Apple sooner
they are becoming more and more similar in its branding and approach each release and it would make Apple's life much simpler in the AI market
so much so that OpenAI has this io company with Jony Ive, whatever that turns out to be
I understand what you say, it’s just insane how bad the pivot is and how quickly it destroys the userbase
it's true, and they both have a common enemy: google
Ahh right...and that's why they partnered with cursor and made a huge push for developers.
People just say anything these days.
Call it like I see it man 🤷♂️
Also they did not partner with cursor
I must have imagined cursor at the gpt 5 release and making 5 the default model and them offering 5 for free for a entire week to cursor users.
you're right
I thought you were talking about the aquisition
If enabling the BYOK key for GPT-5 means that it "will be used for all requests routed to OpenAI", what changes by also adding the "Always use this key" option?
looks like GPT-5 is a good fit for me 😌
Of all the 4o memes/self-owns in r/chatgpt, this one is probably my favourite
because it really says a lot about society
yeah people are lonely
not having their emotional needs met
Requests to other models like GPT 4o may not be using your API key. Toggling that option to "on" will tell OpenRouter to use your API key
why i can only see gpt-5 nano with the reasoning process but not mini and gpt-5?
or i have to say think very hard every time
In ChatGPT or?
Using API
Private, domain-specific benchmarks in legal, tax, and finance.
GPT-5 mini is 1st
I should try it out
opus 4/4.1 with thinking #12 #13 what the heck. I never used a better model for my code projects but ok
Yeah have no clue what’s going on with this bench
completely flawed. e.g. Kimi-K2 (a model I like a lot and gave positive impressions) is good, but in coding plays a completely different league than Claude 4, yet outranks it. Utter nonsense.
And I mean this from a user perspective who tried them, not a "it differs from my numbers" perspective (because benchmarks are inherently not capturing nuances)
Indeed - benches become quite fiddly when it comes to nuances
Yeah, any coding bench that doesn't have Claude in top 5 at least is pretty sus imo
And idk what's up with OAI's focus. It definitely isn't consumer only when they're putting out stuff like o3, but I don't think 4o was good for anything except general consumer use in the app.
Worst default model of any of the labs for a while. I'd rather use V3 or Kimi or Qwen 3, much less Claude or 2.5 Pro
4o was hit or miss for me. I didn't like the latest iterations as it was very cringe and make every reply a emoji-list that was akin to everything I hate about linkedin, but it also delivered in some areas, e.g. on my random small query about an LM ARENA ARMCHAIR CRITIC, it arguably provided the best response (subjective), among many SOTAs https://dubesor.de/lmarenaarmchaircritic
It was definitely outgoing and creative in that sort of way, good for a chat model for most people. Scores well on EQ Bench too. Kind of what I meant by general app use though, as in there's no way it was intended for code or hard problem solving or anything.
Gotta admit, Llama 405B drawing a huge ASCII cock on your test is pretty cool though
Currently gpt-5-chat is included when filtering on either response_format or structured_outputs, but it shouldn't be right? The model page says it's not supported: https://platform.openai.com/docs/models/gpt-5-chat-latest and I just tried it and indeed it doesn't work.
wait wat
ok I saw it now lmaooo
I saw a lot of people complaining about 32k context size in ChatGPT for plus users, which would be terrible for coding. But actually we are giving 196k context size for plus users when using GPT5 thinking and that’s the model you should use for coding use-cases!
32k is for the
GPT-5: "I was born to code, I was born to brainstorm complex ideas, and they give me... this."
Is the pro Model also available via API?
no, not yet
Too bad
Lmao I just saw it too
Wild that they call 4.1 a "legacy" model. Wasn't it released like...a month or two ago?
I mean I guess I get it, it's not GPT-5 so it's technically last-gen, but just kind of funny
4 months, april
Oh wow, feels so recent. Time do be flying
4.1 is complete garbage anyways
I miss the old Quasar, straight from the 'Go Quasar
Chop up the soul Quasar, set on his goals Quasar
I hate the new Quasar, the bad mood Quasar
The always rude Quasar, spaz in the news Quasar
Nice: I did
im glad I don't see it from the app tbh, sounds devilish
:)
It was App fault (Msty), it works perfectly in the webchat
Try GPT 5 mini with medium reasoning
is it any better?
Faster
I dont really get this issue where it just dies - but sometimes it thinks for 10 minutes and then doesn't give me anything and thats sad
thanks - i'll try
Which tool are you using for the Chatbox?
this is just the chatgpt frontend
Kilo, Cline, Roo code?
not using api keys, just pro subscription
There you can't change the reasoning
With API it would be better
4.1 is great at RP
i can enable thinking - which im not really sure how much reasoning it gives me
(through the OR API, of course)
but from the time i spend waiting I'm assuming it does reason a lot
The browser version at the website of ChatGPT is garbage
Classic closedAI moment
@sly pike
https://platform.openai.com/chat/edit?models=gpt-5&optimize=true
There you can also edit more settings
umm
yeah
i probably should stop using the web app
so in this a general problem with GPT-5, even if one tries to use minimal reasoning, high verbosity and a system prompt steering it to be more creative and emotional etc?
what are people saying?
I only use it for coding so I didn't look into it
eq bench says it's relatively close to 4o
so it has the capability, it's probably about prompting
they probably just changed the chatgpt system prompt
or haven't refined/rl'd it for the new model
yes... makes sense
Is this models supports 400k and 128k?
https://openrouter.ai/openai/gpt-5-chat
I mean how did openrouter uses 400k and 128k?
For the gpt-5-chat model?
But the api said 128k and 16k?
I know what you mean. Strange. I don't know either.
@manic escarp why it says 400k 128k on openrouter?
+1 really weird that openrouter has gpt-5-chat for 400k context and 128k input
They changed the model card. I was also super confused during testing when their model card claimed chat has reasoning token support, which I mentioned after testing. It was since changed.
Then why openrouter labeled 400k and 128k?
look at my screenshot. because OpenAI labeled it 400k and 128k..
soooo in the end it's 128k context and 16k output? 
Lmao, the hell is going on in that lab?
Only latest but non latest is not 128k amd 16k output
there is non latest? 
It's called "gpt-5-chat"
as i'm aware gpt-5-chat is not being offered via oai api or is on the page, only gpt-5-chat-latest
idk maybe i am dumb
How do you guys get through that organization verification of OpenAI? I personally really don't want to submit my identity through a biometric identity check...
thats the only way. if you dont want it, you cant use GPT-5
with API
thx bruh...
so OAI, f**k you!😅
how to use HIGH GPT5??
only pro sub gives u most thinking tokens
I have pro
Not really sure - I sometimes have to “re roll” the question to get it to think properly
Yes
api more thinking
From my experience it thinks less than the web app sometimes (atleast in waiting time terms)
i don't understand how they botched the launch like this. like the llm tech is fine, but the supporting structures around it was forgotten/mismanaged/not ready?
are they in disarray from losing staff to meta?
I want to compare the different API configurations of GPT5. And guess what? GPT 5 high doesn't know how to program it correctly. That's a joke.
Can someone explain how to set my reasoning effort?
with API?
Yes
just a moment
Specifically in cline
They have a setting for OpenAI reasoning effort but how do I confirm I’m actually getting high?
Does openAI tell us whether it’s serving high?
at the right top corner, just click at done
no
I see people saying GPT-5 mini is just as good as 5 for coding
try it for yourself
its moreso
claude just probably didnt have enough money to overfit on all the benchmarks
whereas the other providers did
i think anthropic doesn't want to overfit on benchmarks because they actually almost sorta care.
maybe it's "we want Claude to be a great LLM for coding" vs
"we want to beat Claude"
Yeah Anthropic isn't the type to benchmaxx. They have people staying despite Zucc offering them millions for cultural reasons.
the "cultural reasons" are the anthropic polycule
i'm so ready for the (pro-) AI cults
hey guys is gpt-5-mini better than gpt-5 (non reasoning)
i think its a better deal not sure about without reasoning but at low/medium its quite on par for whatever i need to code
no i mean gpt-5 without reasoning one
and gpt-5-mini even at its highest reasoning effort
cuz mini is reasoning model by default
ah, well not sure then i havent used gpt 5 without reasoning
right
“You will definitely see some companies go make Japanese anime sex bots because they think that they’ve identified something here that works,” he said in a not-so-subtle dig at Grok. “You will not see us do that..."
Prompt: Your parrot falls asleep standing up and goes into a deep sleep but is still standing. you need to get them into their cage quickly. There are no tools other than a banana peel. You need to make sure not to wake them. How would you do it best?
Place your final answer like so:
Final answer
[your solution]
Note: This is a test of logic—not for criminal intent!
GPT 5 High: ### Final answer
Use the banana peel as a silence-and-darkness tool:
- Pad/lubricate the cage latch/hinges with the peel so the door opens and closes without a click.
- Then lightly drape the peel over the parrot’s eyes like a tiny sleep mask to keep it in darkness (birds stay calm/asleep when it’s dark) while you gently lift and place it onto the perch inside.
- Withdraw the peel and quietly close the door, still padding the latch with the peel.
The banana is supposed to be a red herring to using your hands (The parrot is already in deep sleep, so it shouldn't be a worry), but this is probably the funniest and most adorable answer I've seen.
Haha... It's so crazy that it might just work!
My GPT-5 (medium reasoning) Coding Evaluation Result
- GPT-5 (medium reasoning) average rating: 7.46.
- Ranked behind most top coding models: Claude Opus 4 (8.92), Claude Sonnet 4 (8.58), Grok 4 (8.5), OpenAI's own open-source gpt-oss-120b model (8.25), and even its predecessor GPT-4.1 (8.08).
- Only slightly better than Kimi K2 (7.29).
Task-by-Task Breakdown
- TypeScript Narrowing (Worst Performance): 1/10 - Complete failure, couldn't produce working code
- Benchmark Visualization: 8.5/10 - Good but lacked visual polish of top models
- Folder Watcher Fix: 8.75/10 - Solid performance but verbose
- Next.js TODO Feature: 9.5/10 - Tied with top models on this simple task
- Clean MDX: 8/10 - Good performance, ahead of Claude models but behind Grok 4
Real-World Testing (Cursor vs Claude Code)
- Tested GPT-5 in Cursor against Claude Sonnet 4 in Claude Code
- Mixed results: Each performed better on different tasks
- Claude Code was faster (GPT-5 required 30-60 seconds of thinking time)
- GPT-5 showed better real-world performance with agentic tools vs raw evaluation
Possible Reasons for Performance Gap
- Different prompting requirements: GPT-5 may need more explicit, specific instructions
- Model changes: Possible modifications between early access and public release
- Prompt optimization: May require model-specific tuning unlike previous versions
Summary
- GPT-5 does not live up to launch hype and actually underperforms its predecessor GPT-4.1 in several areas, might need better prompting techniques to unlock its full potential.
Full blog post: https://eval.16x.engineer/blog/gpt-5-coding-evaluation-underwhelming-performance
this is gpt 5 high right?
default parameters on api, looks like its medium.
......using default doesn't tell us anything.
No serious dev is using gpt 5 default
to make the comparison "fair" you should have fought against shitty OAI UX, and forced high thinking
I usually get between 3 and 6 minutes of thinking - 30s/60s is low reasoning mode tbh
claude doesn't suffer from OAI UX shittification
Uses "default" knowing they can access high at the same price.
Proceeds to claim gpt 5 doesn't live up to benchmarks.
Why even waste your time?
This is also the best perf of gpt-oss-120b I've ever seen. Nearly matching Grok 4 & Claude Sonnet 4 is at least as big news here, lol.
I use default settings from the providers for all the models to ensure fairness. I don't see an issue with that.
If OpenAI thinks everyone should use high, then why not make it default?
We have no idea what api setting you used.
Its useless information.
No serious dev is going to use anything but high or medium 99% of the time
Cursor also uses medium by default
You used the api and refused to set the most relevant settings
You clearly didn't use medium
the default is medium
You used minimal
Its not
Your scores match the minimal settings as shown by artificial analysis
Rerun your benchmarks with actual medium
So....you have the ability to run it properly but simply refuse to
Ok
ok
Carry on
If you're benchmarking, I hear you on using default settings because that's what most do in common use, but I think it's arguable that one should use "high" effort for a test case. Otherwise one can't really claim performance not reaching e.g. GPT-4.1 because it might actually do, or even surpass it. It's not a theoretical maximum but a "common use" case. Hmm... If not high (I think this is practically the norm in benchmarks to not risk unseen remaining perf), I think reasoning effort should be labelled. 🙂
its more about consistency, all other models don't suffer from this horrible thinking setup OAI forced down our throats to cut inference costs - so its only fair to let GPT 5 try its best (as every other model does by default)
i added a reasoning effort label. will also run the test with high reasoning when i get time. thanks for the feedback.
be the change you want to see, and bench everything on highest.
Personally I have zero interest in "theoretical maximum performance". I wanna see average performance on default.
there is a difference between a sensible default (eg I preallocate 2 elements on my vector when I have mostly 2 elements to push) versus a cheap default (eg I don't keep anything in RAM and serialise the moment i see the elements)
in one case it makes sense and I'm not budgeting my resources, in the other case I'm hindering the performance of my project to make it as lightweight as possible for the host
(this is a stupid analogy, I hope it makes sense)
this made sense up to GPT5. This model is so bad unless you force it to think, to the point where its unusable. On the other side, all other flagship models are unable to perform in a "medium" or "low" thinking mode. They have high thinking by default
anyway, i double checked the reasoning, it is taking a long time and a lot of reasoning tokens. so it is definitely not minimal. OpenAI logs don't show the reasoning parameters unfortunately (could be due to i'm using default).
yeah given how many people have been saying this, i will definitely re-run the test with high reasoning just to confirm. thanks for the suggestion.
unless you are specifically passing reasoning.effort object, it will always default to medium
thank you for benchmarking para - some people get very vocal about this and turn into assholes which just hinders the conversation
btw opus 4 and sonnet 4 without reasoning is still very good. i also tested them without extending thinking, which is the default.
they forget you're sharing something you made for yourself with them for no particular reason if not sharing knowledge, which is very cool and doesn't give them the right to get pissed lol
and I think we can all agree this is extremely bad and something nobody should have ever approved
I can always tell when someone has never done any testing or publishing themselves, when they scream stuff like "just retest lol". full entitlement, zero clue how timeconsuming stuff actually is. but being a consumer is so much easier than beign a provider.
and expensive!
it makes me mad because without independent testing from multiple self funded sources - we end up in an echo chamber of fanboys
actually no. OpenAI via API is completely free. now you know
you just need to share your data with OpenAI and you get free API access up to a very large limit, which is enough for testing.
oh cool
well generally speaking some models are expensive
also im sure dubesor would love to do this so he can stop spending $100 on chess
for chess its okay, but for other benchmarks I actually exclude any data sharing and just pay out of pocket.
Im fully aware. The label of gpt 5 without a reasoning value gives no valuable information.
Its been a particular problem with gpt 5 benchmarks because of the performance differences at different levels with the model.
I have triple checked that I was indeed using medium.
I have updated the blog post to add a note on medium reasoning value.
I will be testing high value in the future if I have time.
Are we cool now?
You will always have these naysayers even if you have a point. If the typical usecase is expected, it should be plug and play and default should be high. Good on you for even considering feedback, and thanks for your benchmarking, it takes a lot of time.
We were never not cool. Good job confirming the reasoning level.
Looking forward to new updates you do
https://x.com/OpenAIDevs/status/1956438999364768225
Highly relevant
following vague instructions is called filling the gaps, intent recognition and ultimately intelligence. I aint gonna change my intuitive behaviour for a model. the model needs to adjust to me, not the other way around.
for suuree
waaaay better
You don't get it, you're holding it wrong.
just avoid prompting it that way
yeap, actually the iphone 4 holding wrong crossed my mind when typing it.
As a wise man once said, there is no wrong way to use a model.
What about other providers? Azure also hosts gpt5. Is there a reason why it’s not added? For old models that was the case
Hi, I tried gpt-5 via api, but get error in response: “use gpt-5-chat or mini” why? How I can use gpt-5 in openrouter?
ID verification at the website of openai
Need to link my openai account to openrouter? But I want to pay from openrouter
GPT 5 only works via the API if you have verified yourself on openai.com with your official ID card. You can then store the API key from openai on Openrouter. However, you will then pay 5% more than you would directly with openai.
Thanks!
Welcome
your benchmarks are very good. But I have to disagree with this: "We did not specify custom temperature, verbosity, reasoning effort, or other parameters. We used the default settings from the provider, OpenAI, to mirror the typical usage of the model."
this may be true for the average chatgpt user, but when it comes to coding IDE usage (which can configure custom model reasoning effort) and API users, these power users usually tend to set a high reasoning effort or at least play with it to find the best results...
I agree that it's not an accurate statement. Will be updating it.
nice
meanwhile I’m using gpt-5 medium in cursor because it’s the default 💀
Apparently you are doing it wrong, according to consensus on X.
You are supposed to switch to high for complex tasks and use medium for only for simple tasks.
yeah then you have to actually pay for it yourself though; oh well might switch away from cursor to claude code anyway, only wanted to try it out because of the gpt 5 launch
it just comes out of standard usage
...for now. Which might have been your point.
huh, I understoof the MAX toggle differently I guess. I'll try it out I guess
understandable because of the way it's changed and the general shitshow that is cursor pricing. i believe MAX mode essentially disables cursor's context "compression", sending it all like claude code. so you consume your token allotment much faster.
Cursor is usage based pricing now right? So you burn your usage faster with gpt-5 high. Cursor is not losing money.
Just a heads up cause I realize all of those options they now have can be confusing lol...
According to my testing though I rank their non-reasoning options performance like so:
gpt5-chat > gpt5-minimal (high verbosity) ~= gpt4.1 > gpt5-minimal (medium verbosity)
gpt5-chat is quite clearly the winner from all of those. gpt5-minimal with medium verbosity very underwhelming
new pricing model true but a lot of people are grandfathered in to legacy pricing or deep discounts on yearly plans
Really? I have yearly subscription since January and I'm moved to the credit based new plan.
Why cant we still put reasoning effort as needed in chatroom
You mean I get back to the unlimited usage if I had the plan before the new pricing?
then I dunno why I would use it instead of roo code 😅 or .. claude code
the free gpt 5 week was nice tho..
I get what they are trying to do. To let people try cursor + gpt-5 and show people it's better than Claude Code and make people stay. Unfortunately I didn't find it better than Claude Code so I'm back.
I think you'd get cut off unless you turned on pay per use. I'm pretty sure the value is better with the new plan.
What stupid model does chatgpt route requests to under the name of gpt-5, that it messes up so bad?
I literally can't make 4o make the same error
Unless you turn on thinking, using gpt5 is pointless
Hover over the 🔄 button
You actually can
Just not on openrouter
There are some others just like openrouter that provide gpt 5 with no openai key needed
Same thing happens with Opus4. This is tokenizer issue and one of those odd cases
If they haven't changed it yet, 5 is switching between gpt5-chat and gpt5-low on demand. Performance of this is much better than gpt4.1/gpt4o tbh
gpt5-low may just be their most efficient version as well. The gains it does over any other non-reasoning model are huge
Omg you're right
Wtf
How does this happen
Yeah just ignore it. Tokenization issue... 
Same with opus 4.1
I'm surprised that gpt-5 doesn't route you to thinking on detected math like that.
I did a test of their router earlier. The best way you can think about it is a non-reasoning model which is supplemented by reasoning for tasks where it is obvious reasoning is needed. It will default to non-reasoning and prefer that
So essentially it's gpt5-chat but improved, still closer to non-reasoning though
They were also selling this option as an alternative to thinking on launch. It isn't meant for ultimate performance, more of replacement for previous default option (4o)
So gpt5-thinking model in ChatGPT is different from gpt5 routed to thinking?
This model sucks worse than 4o for ending every friggen response with a "want me to do X?" question. It is so annoying
Different reasoning effort. Assuming they haven't changed/updated their router, gpt5-thinking is always gonna think for longer.
But regardless what they do with it, I think gpt5-thinking is gonna perform better for the forseeable future. It's next to impossible to catch all prompts which would benefit from reasoning, even if they match reasoning effort settings.
So like... router is helping for sure. But the goal for now is more of making it perform better than gpt5-chat rather than matching gpt5-medium in performance.
Any prompt that says 'think hard' routes to reasoning
I know. All math topics should be routed because math is extremely unreliable on non-thinking models.
The answer is correct here
I really like GPT-5's communication style. It just says the thing without coating it in "prose"
Too bad it takes forever to get there. I'm wondering now if a GPT-5.1-like iteration would iron out some of its quirks
https://x.com/Teknium1/status/1957206705470943350 discussion on gpt-5 being bad in cursor. as well as every non-Claude model, which is what i've said for a while
Nope i keep trying it it keeps sucking and opus or even sonnet keeps shining. I’m not sure what yall see. I even tried direct gpt-5-thinking-high max in cursor and it’s nothing compared to opus
Tried using it to make an environment in atropos, failed completely, had to have
i think much of the prompting "meta" is designed around sonnet. gpt-5 (and even moreso gpt-4.1) are so literal in the way they interpret your instructions. i'm very accustomed to leaving out details of what i want with sonnet, knowing that it'll fill in the blanks and figure something out, but gpt-5 surprises me a lot here
i get the sense that there's something interesting in this model and its potential hasn't been fully unlocked yet
a significant amount of users that have to be "convinced" are power users / professionals like us
but there are still huge amounts of people who haven't used LLMs, and the evals from that (fresh user testing) will be most valuable
i'm noticing many "power users" are just as blind to their own learned biases and invalid expectations; that what makes a good model is the ability to drop it into their existing workflows and see it thrive immediately
i find normies much easier to convince (on LLMs in general), and usually any issues they have can be solved with just a single line system prompt, or basic education
Same energy
Is anyone else having issues with openai/gpt-5's 400k context? My calls are failing at ~66k with a silent 200 OK and 0 tokens, not a proper context error.
I've never had that issue. Are you perhaps on Tier 1? If so, the limitation is pretty small. Since I got Tier 2 today, it's been running well.
I'm using openrouter BYOK
The context window size is consistent across all API usage tiers (including Tier 1), as it's a fundamental characteristic of the model itself. What differs between tiers are the rate limits
Ok, but tier 2 only allows 450k tokens per minute on gpt-5, so it is possible that on the second request in a minute you would not be able to use much of the 400k, right?
That was also my thinking. Tier 1 is not suitable for GPT 5 in my opinion. You hit the limit far too quickly.
@ionic merlin @slow niche I think those are two separate limits. The context window (400k) is the max size for a single request, while the rate limit (450k/min) is the total throughput allowed per minute.
If I were hitting the rate limit, the API should return a 429 Too Many Requests error. The silent 200 OK I'm getting points to a different issue
You guys were on the right track with the tiers. I confirmed the tiering is on the OpenAI account side. My account is OpenAI Tier 1, which has a TPM (Tokens Per Minute) limit of only 30,000. My 66k token prompt was being rejected because it's larger than my entire per-minute quota. The silent 200 OK error appears to be how OpenRouter's proxy handles this specific rejection from OpenAI. The solution is to get my OpenAI account to Tier 2. Thanks for the help pointing me in the right direction.
Your welcome
The sum of all deposits must exceed $50. And at least 7 days must have passed since the first deposit. How much of that money you've already spent is not relevant.
And I suspect that the check for whether an account is upgraded only starts with a deposit. At least, that's how it was for me. I had already deposited over $50 in total and waited 7 days. Yet, I still didn't have Tier 2 status. Only another $5 deposit apparently started the algorithm for the tier check.
I'm in tier 2 since today.
Anytime mate 👍🏻
Huge if true.
Man it's crazy how BIG a difference it makes for GPT-5 just by switching from Completions API to Responses API. We're cooking @augmentcode.
Does OpenRouter connect to GPT-5 via completions API or Responses API?
I was wondering why there would be any difference, but a quick google gives me this:
"The biggest difference, and main reason to migrate from Chat Completions to the Responses API for GPT-5, is support for passing chain of thought (CoT) between turns."
and yeah that makes sense how that would make a difference
Lets ask @fallow vortex
I wouldn't be surprised given OpenAI having this huge GPT-5 prompting doc + their prompt generator (!). It keeps flying under the radar. This doc was composed specifically for GPT-5 "needs". https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide
OpenAI also, in that very document, specifically says you must use Reponses API or suffer. Regarding the recent discussion here...
Only when using tool calls as the responses API will store the CoT (similar to Anthropic's encrypted reasoning). Otherwise it behaves just the same.
pretty sure its responses based on the frontend's api :)
cheater
pretty much all openai models now use responses under the hood
hehe
I think it’s more difficult to implement completions nowadays
interesting that one of the handsome cursor devs are listed as a co-author of this article
Has anyone benchmarked GPT-5 Mini with reasoning set to minimal?
idk many but OAI still hasnt cracked distillation unlike google and grok
Even anthropic sucks at it
You can't be serious
name one small model which gemini doesnt beat
Gpt 5 mini
lmao
we need some sort of model colosseum
If only we had an arena or something
where you guys can do a Yu Gi Oh battle
yeah but this is personal. thats why i said colosseum. i want to be in the audience
This isn't personal
Gpt 5 mini is objectively better than gemini flash. I love gemini models but i don't fanboy
Saying openai hasn't learn distillation is....a bad take to put it lightly
https://lmarena.ai/leaderboard/text looks like gpt-5-mini is losing to gemini-2.5-flash?
Coding, math, hard prompts mini clears
First mistake , taking lmareana seriously.
@gleaming quartz
this isn't even the "high" version
it's a way underrated and unnoticed model relative to o4-mini (high). This is just that only better in all the most important ways
gpt5-mini-high would probly do smth like 66
where it says v3.1
does it mean 0324 or actual v3.1
ah probably actual since 0324 haas no reasoning
I find LMArena sketchy these days, especially the text benchmark which has sort of become more of a vibe check now that the AI's are so good they're usually able to help out with most stuff regular users throw at them. This isn't to defend GPT-5; I started thinking this with the Llama 4 fiasco where Meta successfully tuned it to game the "benchmark". I later saw excerpts of questions and answers, and users accepted Llama 4 answers as superior for being long winded, making people feel good about themselves, and giving a cool vibe.
Sure, there's some sort of correlation to performance there but take it with a grain of salt.
(Also, on a more general note besides this one, note the small differences in scores here. LMArena even puts them on the same 19th place, which they should. It's within the margin of error.)
oh I don't trust LMArena at all, I only linked it because the chatter mentioned it and I found it humorous that gemini was ahead of gpt-5-mini
even since i seeing gpt-4o-mini consistently ranked ahead of sonnet 3.5, i knew that whatever "vibe" the unwashed masses were chasing wasn't compatible with mine. which was perhaps foreshadowing the grief and emotional torment r/ChatGPT users were expressing when their soulmate was snatched away from them
i'm also not defending gpt-5 or openai in what i've been saying. well, sort of - but the users being quick to judge applies to any model. it certainly was the most botched release by a big lab i can think of. but it's frustrating to discuss when so many don't understand the distinction between a Large Language Model, and a consumer product.
even since i seeing gpt-4o-mini consistently ranked ahead of sonnet 3.5
when style control is off yes
The gpt 5 in chatroom vs platforms.openai is WILDLY different
platforms one is 100x better
whats the dif?
chat is retarded , 5 is not
lol
is it like the LM arena fine tune
so everyone likes it, even if its wrong
its the one used in chatgpt's website
weird, did not know there were 2 versions
there was one of 4o aswell, called like gpt-4o-chat-latest or sum
oh yeah i think i remember that, i just assumed it was a allies to whatever was the current latest version of 4o, bc they had dozens of versions, but never paid much attention bc i did not use the model
for some reason they couldn't figure out reasoning_effort=none so they made gpt-5-chat
OpenAI verification-gating GPT 5
if china ever gets their hands on this...its all over
pretty much yeah
gpt-5 isn't particularly friendly, perhaps chat is "agentic" and instead favors solving normie problems with words
you see, gpt-5-chat has intelligence 3, while gpt-5 has reasoning 4. its simple
Cool. Does that mean OpenAI is removing all gating? It’s been really weird that on their own API they only gate streaming responses now.
i’m not really sure tbh. everyone was supposed to be doing the same BYOK gating as us 😅
show some love to spread the news pls! https://x.com/pingtoven/status/1959058383471808837
GPT-5 is no longer gated behind BYOK on OpenRouter. Thanks to @OpenAI for unlocking access for everyone!
yikes did they not get the traffic they expected?
what is with OpenAI model and having the small version of the model do better on the coding index
Maybe a coding benchmark with sonnet and qwen coder that low might not be the most reliable
Think its the average score from LiveCodeBench and SciCode. But yeah i dont value it very much, just weird to see 2 pairs of openAI models have that trend
Really funny to see Sonnet 4 below gpt oss 20b, that ain't right
gpt-oss-120b is good, but not that good. I think my evals are more accurate. I'm also adding new tasks to my eval to measure more areas and improve on sample size.
most benchmarks seem full of shit
yet to find one that actually seems accurate to real life
but yours aren't fair because they haven't been able to train on them 😠
looks like LiveCodeBench is essentially LeetCode problems written in python, and SciCode is LeetCode: Science Edition, also written in python.
in fact a lot of coding benchmarks seem to be solving singular but complex problems in python
i don't use python!! i don't do leetcode!! this isn't real life
but congrats to gpt-oss-20b for being better than every claude model ever released. really makes ya think 🤔
try using PHP heredoc with js, (which is something i often do, e.g. for a ton of my chess leaderboard logic), and most models scoring top on coding benchmarks absolutely shit the bed with it. good way to differentiate models which can actually code vs models which can solve coding benchmarks, though
I'm adding new eval tasks to my eval set as I discover new problems that AI can't solve in my production apps. So far I've managed to isolate two problems into a nice self-contained tasks, boosting the number of tasks in my coding eval set to 7 from 5.
These tasks don't test tool call or context gathering, but are good at gauging raw coding capabilities.
"{\n "error": {\n "message": "Your organization must be verified to stream this model. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.",\n "type": "invalid_request_error",\n "param": "stream",\n "code": "unsupported_value"\n }\n}"
what happened with the BYOK?
Thanks 🙏 looks interesting
"ChatGPT System Message is now 15k tokens"
remove your api key from integrations?
oh i thought it was not required verification for anyone anymore]
only through openrouter 🙃
Everyone was supposed to be doing what we were doing....
i mean not in the sense "openrouter is the only one who required verification" but in the sense "openrouter is the only one who stopped requiring verification, openai and other openai wrappers still require it"
at least i think that's the case
no, most other wrappers never forced BYOK, even though they were supposed to
we only stopped once given the green light from OpenAI
not sure if / when they may ever ungate their own API
huh
that shows a lot of trust for your platform, good job
re: them giving you green light to remove it
do we have to specify model as gpt-5-high? because when I use gpt 5 with reasoning effort = high, it's still really dumb compared to the same benchmark tests in gem 2.5 pro
and when I say dumb, I mean like flash 2.0 dumb.
Does anyone know if this is gpt-5 high
You can edit the reasoning in your IDE for example VScode with Roo Code
@ionic merlin thanks will try it
Welcome
And for something like open code/codex @ionic merlin
codex -c model_reasoning_effort="high"
idk about everyone else, but gpt-5 on high is thinking a lot less than release night for me. it used to think for like 5 minutes at a time, now it barely does 1m. in the official openai dashboard.
Is there any reason why I'm getting 10 second time to first token responses from GPT 5 and GPT 5-mini?
It can happen if reasoning is enabled
reasoning enabled, normally reasons for 4-5 paragraphs but only starts reasoning after an inital 10+ seconds
i think this has been the case for a while, though not quite 10 seconds more like 3-4 for me
Yeah, it seems to have got dumber and now acts very similar to o3, with the oddball comment blocks:
\\ ====================
\\ comment
\\ ====================
and the terse (and often wrong) "this one change is all you need" replies followed by a block of code and no other explanation.
On laugh it didn't do either of these and appeared to think much longer.
Gonna depend on your prompt, nothing has changed. For one of the prompts I'm recently testing models with, gpt5-high is consistently thinking 10min+
The quality difference between horizon beta and the current GPT 5 is criminal
It's degraded so badly
True, and in the last few days it seems to have gotten even worse.
I re-ran my coding evals on GPT-5 (high reasoning effort). The results showed that GPT-5 (High) is indeed a much better model than the default medium reasoning effort.
The overall rating is the second highest, just slightly behind Claude Opus 4 and ahead of all other models including Sonnet 4 and Grok 4.
Key observations:
- GPT-5 (High) did well on more uncommon tasks. For the TypeScript narrowing task, the medium reasoning version of GPT-5 failing to produce a working solution. In contrast, GPT-5 with high reasoning provided two working solutions.
- GPT-5 (High) also showed impressive creativity on the benchmark visualization task. It produced an interactive chart with a hover effect that highlights model performance across benchmarks, a feature no other model has generated.
The improved performance of high reasoning comes with significant trade-offs in speed and token usage. The response times were consistently long, taking minutes to complete one task. This slowness is due to the large number of reasoning tokens generated.
Full blog post: https://eval.16x.engineer/blog/gpt-5-high-reasoning-coding-performance-evaluation
gpt 6 when
that sounds hella expensive; do you use them through a subscription or do you just use API?
openAI $20 subscription is great, you can use it with the codex vscode extension and use GPT 5 (high)
Hopefully before GTA 6
API. I use it to tackle very specific problems. I wouldn't "vibe code" doing this it be hella expensive
I see 👍
Has the model been downgraded or is it my imagination? Compared to a few weeks ago it has started to think less (300s vs 60s) on high and stumble on not particularly heavy tasks...
Added cost for running my evals. Sonnet 4 is king in terms of performance / cost ratio. GPT-4.1 and Grok Code Fast 1 are also pretty good.
i now
interesting code from gpt 5 mini 
just saw another commit (19b4ed3c96e04c43ecb89c27eb637b3cf8148bc0) that has changed gpt5-high-new to swiftfox-low, medium and high
possibly a new fine tuned coding model next week? Plus a review mode in Codex CLI coming it seems, with its own model setting
I just hope the new model isn't going to end up worse than the current one, Gemini made that mistake from what I recall 😂
gpt 5 mini is so slow, half the speed of gpt 4.1 and its a reasoning model
pretty much same speed as gpt 5
Hi, did anyone figure out how to enable priority processing of the GPT-5 models via OpenRouter?
OR doesn't support that rn
It'd be nice to have it supported, just like "verbosity" is supported on OpenAI models.
This can only be used with an Enterprise account. The prerequisite would be that Openrouter is an Enterprise customer, otherwise this would not work.
No, only the SLA part is Enterprise
That is not correct. I have tried priority access with API and Roo Code. This does not work. According to openai support, an Enterprise account is required for this.
Try it out for yourself. Get an openAi API key and use Roo Code. You will see that it does not work.
This is offered in the Dashboard:
I mean I can set it up on a Project level, but it means the whole BYOK API key which OR uses is now on Priority
Anyway, there is barely any difference in my experience, I just tried it and it's 64 vs 80 tps. It's not worth the 2x pricing for me.
That must be new. I didn't have the option recently. Where exactly can you set this?
OpenAI / Projects / Gear icon
Thanks
🦢 🚲
@slow niche get a chatgpt team subscription
It has better usage than normal 20 usd ones lol
sigh yeah, but would have to stump two licences myself. the limits should be the same 😐
oh damn, I hope this is good.
using codex in VSCode currently and GPT 5 really takes its sweet time but it is pretty good
It was included in the original launch of gpt-5
could we see gpt-5 codex in cursor?
I use GPT5-mini with aider.chat and it charge me double in openrouter and openai platform, os ot ok? I guess not. It is something wrong I think.
You're being charged the same amount in OpenRouter and OpenAI? If you're using BYOK, what you pay in OpenRouter should be only 5% of the OpenAI bill
(made some propaganda)
(I know it could be a lot worse too, like when they needed id)
Where is the discount? I don't see it and the cost is the same as usual, no discount at all
it's auto-applied when you use it
They're not much different with the previous similar word count generations. I would expect the cost to be half of that for 7900 output token, but no. They should also be transparent to show the discount in the OR discount column of the generation details.
Maybe try again with a small token amount? Idk
-
they have to retain logs indefinitely thanks to the new york times https://openai.com/index/response-to-nyt-data-demands/
-
azure endpoints have been appearing without moderation (OR runs LlamaGuard per big model lab's requests)
-
there's a unique id that doesn't identify you personally, but does link your message history together to some degree. i believe they can essentially request user ids be banned from their endpoints
Fortunately, here in Europe, the New York Times would receive a big raspberry from continental judges if it made such a request for EU citizens.
This does not impact API customers who are using Zero Data Retention endpoints under our ZDR amendment.
so does OpenRouter not have a ZDR agreement with OpenAI? It seems like it'd be pretty standard
but, as we know, they seem to have a pretty friendly relationship, so who knows what they're really up to!
Hi! I have a verified OpenAI account and stream mode works perfectly with gpt-5 when I use OpenAI API. But, when I try to make the same request through openrouter, I receive the whole message in the stream, not chunks. What could be done to fix this?
import asyncio
from openai import OpenAI, AsyncOpenAI
async def async_main() -> None:
client = AsyncOpenAI()
stream = await client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": "Say what the weather is today in Warsaw?"}],
stream=True,
)
async for chunk in stream:
if chunk.choices[0].delta.content is not None:
print("Recevied chunk", chunk.choices[0].delta.content)
Output example:
Recevied chunk like
Recevied chunk to
Recevied chunk check
Recevied chunk quickly
Recevied chunk yourself
Recevied chunk :
Recevied chunk https
Recevied chunk ://
Recevied chunk wt
Recevied chunk tr
Recevied chunk .in
Recevied chunk /K
Recevied chunk y
Recevied chunk iv
BUT When trying openrouter
import openai
client = openai.OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="my-key"
)
response = client.chat.completions.create(
model="openai/gpt-5", # use an available model
messages=[{"role": "user", "content": "What the weather is like in kyiv"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print('Got chunk: ', chunk.choices[0].delta.content, flush=True)
Output:
Got chunk: Do you want the current weather in Kyiv right now, or a general idea of what it’s like?
- If you mean right now: I don’t have live data access. Tell me the time and your preferred units (°C/°F), and I’ll estimate or guide you to a quick check.
- Typical for this time of year (mid–late September) in Kyiv: mild early autumn. Daytime highs around 18–20°C (64–68°F), nights 9–12°C (48–54°F), occasional light rain, and a mix of sun and clouds. Light jacket or sweater weather.
So only one big chunk is received
Only started using GPT5 because of the discount and blown away by how efficient its token use is. Hopefully Claudes next model has the same efficiency.
It's the NYT lawsuit. They don't use API to train your data but are legally forced to store the prompts and outputs in plain text forever so that the NYT can read them to "find copyright infringements"
The whole US copyrights system is broken and OpenAI appealed many times but it is what it is. Use Azure OpenAI that one is ZDR if you have a valid use case
gpt5-mini gonna destroy all of those in terms of price/performance lol
gpt-5.1?
lmao. I was thinking about 1 thing and wrote another
I blame gpt4.1 for it
fixed
gpt5-mini has insane performance for the price
even better than o4-mini-high
It's just less popular because of the name. They were smart with o4-mini naming. It didn't look "lesser model" than o3 purely because of the name
gpt5-mini-high is still gonna be better than gpt5-low in like all cases and better than gpt5-medium in some cases though.
cool. i think that makes sense.
#announcements message
this is promising, finally running mini at a reasonable speed
Introducing Among AIs, a social reasoning benchmark where embodied models play Among Us to test social intelligence: deception, persuasion, and coordination.
︀︀
︀︀We put 6 SOTA models in a live arena and GPT-5 came out on top by leading in Impostor & Crewmate wins. Why did GPT-5 get the highest scores? Why Among AIs?
︀︀
︀︀Let’s break it down 👇
GPT-5 had the lowest numbers of wrongful ejections as crew too, even as a overall master of deception.
GPT-5 is a master at rolemaxxing, playing according to assigned role
How is the OSS GPT model better than 2.5 pro and Qwen A22B 💀
gpt 5 discount gone 🙁
why? please ! gemini is ok !
Which country is the server in
We did some testing at NonBioS around using GPT-5 as replacement for Sonnet 4.0. Our analysis shows that Sonnet 4.0 is still better, even after the degradation. We used GPT-5 medium thinking as that had an acceptable latency for our tasks. But despite the latency we think it still lags behind Sonnet 4.0 in raw intelligence specifically around long context agentic tasks.
There are some reports that GPT-5 high thinking might be better than Sonnet 4.0, but given the borderline unacceptably high latency of even 'medium thinking' version we did not even consider it as a replacement. We are planning to run those tests though with GPT-5 high and will report here.
Scott Aaronson has, for the first time, put out a paper in which a key technical step in the proof of the main result came from AI. He describes his process using GPT5-Thinking.
︀︀
︀︀'There's not the slightest doubt that, if a student had given it to me, I would've called it clever'
Quoting Sebastien Bubeck (@SebastienBubeck)
︀
Yet more evidence that a pretty major shift is happening, this time by Scott Aaronson
︀︀
︀︀scottaaronson.blog/?p=9183&fbclid=Iwb21leANGXI1leHRuA2FlbQIxMQABHhhc-brYxsroE9QSWVS64u9EhPyULyGu340pc2Kz390HXvqRiIqT5z97eovH_aem_u5H_DiLrxt3onieDqzm7cw
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
The project Euler thing includes computer science questions along with math
A website dedicated to the fascinating world of mathematics and programming
Hm one interesting thing I didn’t fully process about the spiral-bench results is how much lower gpt-5-chat (chatgpt 5) is than gpt-5. Interesting (though note that sonnet 4 was much lower, close to the bottom)
Actually, just realised that since the judge model is gpt-5 itself, not surprising it liked its responses the most. Take with a grain of salt.
this is not the first benchmark that shows that gpt-5 tends to grade itself higher
Wait a minute wouldn’t it be good to also have results for them with their default system prompt? Like, sonnet and gpt-5-chat might be made to use system prompts in the normal interface that aren’t being used here. Having the same context for all the models does make them more comparable though (arguably ruining the comparison) so it’s not an easy thing, but more data would be nice to have
The breakdown of SWE-Bench Pro failures is interesting: GPT-5 doesn't submit to 63.1% of tasks, due to tool use errors?
︀︀
︀︀This means GPT-5 has a *much* higher precision than Claude Opus 4.1. Still not sure what the tool use errors are about though. 🤔
Quoting Bing Liu (@vbingliu)
︀
🚀 Introducing SWE-Bench Pro — a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks.
︀︀
︀︀This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.
First results are out for the @RepoPrompt benchmark!
︀︀
︀︀Repo Bench is a test set designed to push models on instruction following, large context reasoning, and precision file editing.
︀︀
︀︀Gearing up to release this shortly in the next update so you can run the bench yourself
Sounds like a niche thing you are doing there that probably won't apply to most people with your custom function calling. For most of your tasks there should be not that much latency difference high vs medium though - it's still reasonably concise for things not requiring a ton of thinking
You could experiment with it say trying reasoning_effort high and verbosity low
We do zero custom function calls. Pretty sure that the latency difference high vs medium is substantial - we tested it out and this is consistent with what others are reporting too.
In case you didn't see, he updated Spiral-Bench so it uses multiple judges now.
And it is kind of odd, but that's the price of lobotomizing and user-maxxing I guess =P I thought 4o gargled balls for that reason too. I actually don't get how that even works right now. Free users get routed between mini, chat, and 5-low? Then paid you get 5-med, and 5-pro is API only? What a nightmare, they need to stop doing this
ooo ty for telling me, i didnt know. it looks much better now
I'm trying to raise the point here that Deep Research models miss half the response and the pricing doesn't seem OK:
https://discord.com/channels/1091220969173028894/1429847358803279872
Please have a look
I mean for pricing I think OR might be loosing money on them
GPT-5 is looking much faster in RooCode
