GPT 5 | OpenRouter | Page 3

hard beacon Aug 8, 2025, 4:18 PM

#

This one invoked reasoning

ionic merlin Aug 8, 2025, 4:19 PM

#

aaah

knotty cobalt Aug 8, 2025, 4:20 PM

#

Boy, people on the chatgpt subreddit sure aren't happy. I wonder how many of them are using the gpt5-chat model in the UI and don't realize it (instead of the full gpt5)

hard beacon Aug 8, 2025, 4:24 PM

#

knotty cobalt Boy, people on the chatgpt subreddit sure aren't happy. I wonder how many of the...

the router thingy is a downgrade

#

not to mentionn all the use that plus users got now went down the drain in favor of gpt 5

#

no o3, o4-mini, o4-mini high

#

i have to say this sounds human af

compact marsh Aug 8, 2025, 4:51 PM

#

knotty cobalt Boy, people on the chatgpt subreddit sure aren't happy. I wonder how many of the...

If you're on the free version of ChatGPT, it picks for you. Which makes it a massive downgrade from one day ago where you could be using 4o.

soft reef Aug 8, 2025, 5:05 PM

#

Using gpt5

Screenshot_2025-08-08-20-03-47-785_com.termux.jpg

#

im gonna size this up with my latest side project

fair obsidian Aug 8, 2025, 5:52 PM

#

Any thoughts so far on temp/penalty/etc. settings for GPT-5?

gleaming quartz Aug 8, 2025, 5:54 PM

#

well that's way more the 50 words. the last line is funny, but they probably cut it off because just it repeats that sentence forever

chilly rapids Aug 8, 2025, 5:59 PM

#

gleaming quartz well that's way more the 50 words. the last line is funny, but they probably cut...

Also likely cherry picked from multiple responses.

knotty cobalt Aug 8, 2025, 6:08 PM

#

Heh

soft reef Aug 8, 2025, 6:08 PM

#

im using gpt5 with my custom agent and for some reason it seems to be thinking a lot

knotty cobalt Aug 8, 2025, 6:09 PM

#

Sonnet and Gemini saw through the trap

soft reef Aug 8, 2025, 6:09 PM

#

when it thinks its actually really good, slightly better than the level of o3 but way cheaper

gleaming quartz Aug 8, 2025, 6:12 PM

#

fair obsidian Any thoughts so far on temp/penalty/etc. settings for GPT-5?

yeah, i think they're gone.

#

OR doesn't list them and you can't set them in the oai playground

#

https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_and_tools this is interesting though

GPT-5 New Params and Tools | OpenAI Cookbook

We’re introducing new developer controls in the GPT-5 series that give you greater control over model responses—from shaping output lengt...

#

i was also going to criticise this by saying it sounds like a child wrote it, but after thinking about it for a bit, that would have been incredible to read in 2021, and the contrast to how people react to a new large language model now (anger, disappointment, conspiracy) is pretty crazy

knotty cobalt Aug 8, 2025, 6:22 PM

#

knotty cobalt Sonnet and Gemini saw through the trap

Jamba Large also smartly got around it by asking clarifying questions before responding

slow niche Aug 8, 2025, 6:28 PM

#

https://youtu.be/xx_5ADX21-4 rough night, intense bed head levels

YouTube

CNBC Television

OpenAI CEO Sam Altman on GPT-5: We've built an 'integrated single e...

OpenAI CEO Sam Altman joins 'Squawk Box' to discuss the rollout of the company's GPT-5 model, new ChatGPT features, company growth outlook, what the future of AI looks like, the race for AI talent, his thoughts on tokenization, path to profitability, relationship to Microsoft, and more.

▶ Play video

stone tide Aug 8, 2025, 7:04 PM

#

potent oak Aug 8, 2025, 7:18 PM

#

ok

wet estuary Aug 8, 2025, 8:03 PM

#

slow niche https://youtu.be/xx_5ADX21-4 rough night, intense bed head levels

jesus those bags under his eyes

#

I feel bad for him lol

wraith oak Aug 8, 2025, 8:13 PM

#

so for openrouter, gpt-5 is byok. does this mean that if we have a new openai acc, we'll have slow rate limits?

#

i know anthropic greatly limits new accounts and you need to build up usage to increase the rate limits..

#

also how does billing work for byok

stone tide Aug 8, 2025, 8:16 PM

#

Yeah, the rate limits are low for new OpenAI accounts

#

For BYOK, full billing for the token usage happens on OpenAI's side, so your credits are used from there

#

Plus 5% OpenRouter BYOK fee

ionic merlin Aug 8, 2025, 8:28 PM

#

wraith oak also how does billing work for byok

You can save money, if you skip the extra step over Openrouter and use the open API directly

wraith oak Aug 8, 2025, 8:38 PM

#

ahhh okay. so it's better to just use openai directly

ionic merlin Aug 8, 2025, 8:38 PM

#

wraith oak ahhh okay. so it's better to just use openai directly

Right

wraith oak Aug 8, 2025, 8:39 PM

#

any ideas if byok will be enforced by openai forever? im ootl of this

ionic merlin Aug 8, 2025, 8:39 PM

#

We're all annoyed by it. No idea if the rules will ever be abolished.

wraith oak Aug 8, 2025, 8:40 PM

#

yeah its pretty annoying

#

(not openrouter's fault)

ionic merlin Aug 8, 2025, 8:40 PM

#

Yeah, all 3rd party services have the same problem

knotty cobalt Aug 8, 2025, 8:52 PM

#

I subscribe to Poe, and they recently added API access to the models they serve, and GPT-5 works via API for me, but the (veiled) thinking comes back in a weird way

tacit burrow Aug 8, 2025, 8:55 PM

#

ionic merlin We're all annoyed by it. No idea if the rules will ever be abolished.

if I wanted to use gpt-5 now I’d be stuck at like 30000TPM which is nothing 💀

knotty cobalt Aug 8, 2025, 8:56 PM

#

knotty cobalt I subscribe to Poe, and they recently added API access to the models they serve,...

And just to clarify, this did not require BYOK

soft reef Aug 8, 2025, 9:08 PM

#

Idk why i think ive been getting lucky with the gpt 5 router

#

Bc its been thinking a lot as opposed to using the regular nonthinking model

sly pike Aug 8, 2025, 9:09 PM

#

soft reef Idk why i think ive been getting lucky with the gpt 5 router

i hate this

#

because i decided to try again through the web app and if I re roll enough I can see it shine

soft reef Aug 8, 2025, 9:10 PM

#

Yea when it thinks its sota when it doesnt its trash

sly pike Aug 8, 2025, 9:10 PM

#

but I shouldn't be refreshing my answers 6 times before it thinks

soft reef Aug 8, 2025, 9:11 PM

#

I wish they just gave us separate thinking and not thinking models in api, even if they made it more expensive i wouldnt care much ~~totally not because i use fixed pricing with github copilot~~

sly pike Aug 8, 2025, 9:11 PM

#

like, the thinking model you get routed to 1/10th of the times ( i get higher chances of a good roll when I make a new chat) is cracked

sly pike Aug 8, 2025, 9:11 PM

#

soft reef I wish they just gave us separate thinking and not thinking models in api, even ...

lol

soft reef Aug 8, 2025, 9:12 PM

#

sly pike like, the thinking model you get routed to 1/10th of the times ( i get higher ch...

gpt 5 mini is even more of a gamble

#

because you either get a >o4-mini level model (absolutely insane and amazing)
or you get a >4o-mini level model (terrible)

sly pike Aug 8, 2025, 9:17 PM

#

soft reef because you either get a >o4-mini level model (absolutely insane and amazing) or...

who thought this was a good idea

#

are they trying to cut costs so bad

soft reef Aug 8, 2025, 9:17 PM

#

Idek

sly pike Aug 8, 2025, 9:18 PM

#

https://tenor.com/view/dodgers-pain-suffering-death-lost-gif-6773836397127068575

Tenor

soft reef Aug 8, 2025, 9:18 PM

#

https://tenor.com/view/gamblecore-stickman-casino-gamble-gif-7118676210396292522

Tenor

simple gorge Aug 8, 2025, 9:19 PM

#

reasoning_effort works in API, no?

#

Are you saying it doesn't think a lot even if you set it to high?

sly pike Aug 8, 2025, 9:41 PM

#

simple gorge Are you saying it doesn't think a lot even if you set it to high?

from my perspective atleast, the reasoning effort param didn't hit yesterday - i didn't really try today

#

i'm just happy with rerolls on the web ui

#

it sucks but that 10% of the time it works its very good

spark pine Aug 8, 2025, 10:13 PM

#

#

That'll show'em Kapp

simple gorge Aug 8, 2025, 10:16 PM

#

sly pike from my perspective atleast, the reasoning effort param didn't hit yesterday - i...

I have only had time to play today in last few hours, but reasoning_effort seems to be working perfectly, don't have to say things like think hard.

sly pike Aug 8, 2025, 10:17 PM

#

simple gorge I have only had time to play today in last few hours, but reasoning_effort seems...

i need to verify what I was doing yesterday again

slow niche Aug 8, 2025, 10:59 PM

#

https://x.com/sama/status/1953893841381273969

Sam Altman (@sama)

GPT-5 rollout updates:

*We are going to double GPT-5 rate limits for ChatGPT Plus users as we finish rollout.

*We will let Plus users choose to continue to use 4o. We will watch usage as we think about how long to offer legacy models for.

*GPT-5 will seem smarter starting

#

steady oar Aug 9, 2025, 12:01 AM

#

Auto routing is just what. For general users it doenst matter much, but I mean, sometimes people just want the best for like a specific task or something.

celest mountain Aug 9, 2025, 12:51 AM

#

I'm getting this stuff, anybody else experiencing it?

hard beacon Aug 9, 2025, 1:01 AM

#

yknow im having fun with this one

novel vale Aug 9, 2025, 3:44 AM

#

celest mountain I'm getting this stuff, anybody else experiencing it?

I had that, weird ngl

compact marsh Aug 9, 2025, 4:34 AM

#

celest mountain I'm getting this stuff, anybody else experiencing it?

Had this but mine was just blank instead of telling me it’s blank 😂 was using ChatGPT through OAI interface not OpenRouter

soft reef Aug 9, 2025, 7:10 AM

#

ngl i see no point using this over o4-mini imo

#

gpt 5 can be better with thinking but without it just sucks

#

its practically just s gamble, until they fix ill js use o4 mini

limber cargo Aug 9, 2025, 8:55 AM

#

Kinda happy the gooners are not able to make gpt 5 roleplay

outer marsh Aug 9, 2025, 8:56 AM

#

am I the only one getting this error with open router and gpt 5 rn?

m_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': 'Provider returned
error', 'code': 400, 'metadata': {'raw': '{\n "error": {\n "message": "Your organization must be verified to stream this model. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.",\n "type": "invalid_request_error",\n "param": "stream",\n "code": "unsupported_value"\n }\n}', 'provider_name': 'OpenAI'}}, '...

ionic merlin Aug 9, 2025, 8:58 AM

#

outer marsh am I the only one getting this error with open router and gpt 5 rn? m_response(...

The GPT 5 API is only working with ID verification at openai.com (official website)

summer sand Aug 9, 2025, 8:58 AM

#

but you are free to use gpt-5-chat I believe

ionic merlin Aug 9, 2025, 8:59 AM

#

Right

outer marsh Aug 9, 2025, 8:59 AM

#

ionic merlin The GPT 5 API is only working with ID verification at openai.com (official websi...

this is through OR

summer sand Aug 9, 2025, 8:59 AM

#

outer marsh this is through OR

yes, and what he said is still true

#

you need to provide your own openai key and id verify

#

with them

ionic merlin Aug 9, 2025, 8:59 AM

#

outer marsh this is through OR

No matter

outer marsh Aug 9, 2025, 8:59 AM

#

i see, thx

summer sand Aug 9, 2025, 8:59 AM

#

you can use it without doing so in the openrouter chat room though

#

why, I don't know

outer marsh Aug 9, 2025, 9:02 AM

#

summer sand you need to provide your own openai key and id verify

so you mean an integration in OR settings ?

summer sand Aug 9, 2025, 9:02 AM

#

outer marsh so you mean an integration in OR settings ?

yes

#

https://openrouter.ai/settings/integrations

outer marsh Aug 9, 2025, 9:03 AM

#

so then we have to pay for it on the open AI side direct

ionic merlin Aug 9, 2025, 9:55 AM

#

Open AI is speaking at X about GPT 5 Pro. Is Pro = high thinking?

summer sand Aug 9, 2025, 10:14 AM

#

ionic merlin Open AI is speaking at X about GPT 5 Pro. Is Pro = high thinking?

probably

latent dove Aug 9, 2025, 11:09 AM

#

does gpt have a cache? in terms of rp it is very useful, saves a significant amount of money

potent oak Aug 9, 2025, 11:11 AM

#

it does

latent dove Aug 9, 2025, 11:13 AM

#

see..

#

by the way this is the third model without special censorship after gpt 4.5 and gpt vision preview, I hope this altman won't get pissed in the head in a couple of days.. since I already charged money to openrouter
talking about the chat version, the other versions of gpt5 are censored to the very depths...

frigid pewter Aug 9, 2025, 11:27 AM

#

I cannot fit my first impressions and benchmark results into 2000 characters (message limit), thus have to post an image instead.
Tested GPT-5 (or posted on https://dubesor.de/first-impressions#gpt-5):

sly pike Aug 9, 2025, 11:34 AM

#

frigid pewter I cannot fit my first impressions and benchmark results into 2000 characters (me...

Thanks for sharing mr dubesor

frigid pewter Aug 9, 2025, 11:40 AM

#

And some numbers, how it slotted performance wise

ionic merlin Aug 9, 2025, 11:49 AM

#

frigid pewter And some numbers, how it slotted performance wise

Which GPT 5 have you using? Low, medium, high? With thinking?

frigid pewter Aug 9, 2025, 11:49 AM

#

always default/medium (o3, o1, grok-4, etc.) unless specified otherwise. the models are forced thinking and I explained the chat in text

ionic merlin Aug 9, 2025, 11:51 AM

#

Okay, I guess with high the results could be slightly better

frigid pewter Aug 9, 2025, 11:52 AM

#

ya same with all other models who can have high (grok-heavy etc). Unfortunately I don't have the time to test every model several times (same with quant testing). this was already way too much workload for me tbh, if you need mass testing, it should be done by someone who has automated their benchmarks. But the vast majority of people use defaults so it's more representative and most interesting imo

summer gyro Aug 9, 2025, 12:38 PM

#

frigid pewter ya same with all other models who can have high (grok-heavy etc). Unfortunately ...

Agreed. Manual human evaluation is a massive time sink, but llm-as-judge can't be trusted for judging output of frontier model. Automated evaluation can't tell the nuances between outputs. I also just test the defaults to avoid combinatorial explosion of factors and parameters.

#

Do people like the dark mode by fault for web designs given by GPT-5? As a light mode user, I personally hate that this is the default from GPT-5. No other model does this.

frigid pewter Aug 9, 2025, 12:43 PM

#

summer gyro Do people like the dark mode by fault for web designs given by GPT-5? As a light...

oh absolutely. Also I noticed a lot of models do darkmode poorly where the text isn't legible/wrong colour. even gpt-5 in my full site CSS redo demo, screwed up the dark mode with the background gradients, while light mode looks fine

frigid pewter Aug 9, 2025, 12:49 PM

#

summer gyro Agreed. Manual human evaluation is a massive time sink, but llm-as-judge can't b...

people don't understand how much time benchmarking anything takes. I get contacted all the time about it. basically it would go like this:

test model (default), maybe on Q4 if local > can you test Q8? > respend all the time > What about Q6? > respend all the time > I heard that GGUF was borked, can you try this specific one? > (repeat more and more until you wasted 10 years on a model). Same applies to non-local ones (with thinking/without thinking, with higher reasoning, with lower reasoning, with default params, with specific params, oh they pushed an update, redo everything, etc. until you spend another 10 years you have now spend 500 hours and nothing truly worthwhile to show for. But hey you covered all potential metrics. Until the provider decides to change some stuff in the backend so you repeat everything once more

summer gyro Aug 9, 2025, 12:53 PM

#

frigid pewter people don't understand how much time benchmarking anything takes. I get contact...

I am much more worried about variance and statistical significance. How many samples should you take to make the test statistically significant instead of just measuring noise and natural variance of the LLM output? If you run multiple tests and samples, should you take the average, the median or mode? How do you combine the results from different tasks, should you normalize the score or just add them up together?

#

In the end, like a wise person said, there is no wrong way to do benchmark. So just do it anyway.

frigid pewter Aug 9, 2025, 12:54 PM

#

summer gyro I am much more worried about variance and statistical significance. How many sam...

I am just doing everything as if I am using it normally. I want to catch real user experience not some ideal condition in a lab. I commented this a while back many times, e.g. #1359657520624308347 message
A normal user is not gonna sit there and refresh the response 8 times until it got perfect.

summer gyro Aug 9, 2025, 12:57 PM

#

frigid pewter I am just doing everything as if I am using it normally. I want to catch real us...

I did that, but sometimes the model gives very strange or wrong output, and you wonder if it's just en edge case or unlucky output, and should you give it another chance, so on and so forth. Then you get into the whole variance and statistics debate. 🥲

frigid pewter Aug 9, 2025, 12:58 PM

#

summer gyro I did that, but sometimes the model gives very strange or wrong output, and you ...

I do 3 runs. this is enough for me to catch an outlier response. if the responses are completely inconsistent I might manually check many times to see what's up with that. Normally doesn't happen though.

summer gyro Aug 9, 2025, 12:59 PM

#

For now I settle on 2-3 tries for each task, to give some error margin for unlucky output and edge cases.

summer gyro Aug 9, 2025, 1:00 PM

#

frigid pewter I do 3 runs. this is enough for me to catch an outlier response. if the response...

Great to hear. I wish I came to that approach sooner.

frigid pewter Aug 9, 2025, 1:02 PM

#

the one time I got the most inconsistent results were with grok-2 and grok-2 mini 1 year ago, where the responses were so unbelievably inconsistent it felt like a model swap constantly between responses. stuff like this happening is extremely rare though.

earnest orbit Aug 9, 2025, 1:18 PM

#

frigid pewter I am just doing everything as if I am using it normally. I want to catch real us...

Based

wise bolt Aug 9, 2025, 2:13 PM

#

frigid pewter I cannot fit my first impressions and benchmark results into 2000 characters (me...

Do you think it's better than Gemini 2.5 Pro or comparable in terms of reasoning and instruction following

frigid pewter Aug 9, 2025, 2:15 PM

#

wise bolt Do you think it's better than Gemini 2.5 Pro or comparable in terms of reasoning...

I published everything, including all numbers, all demo stuff, etc. just check and decide. It depends on the use case. they are very evenly matched overall (edit: in my testing)

stone tide Aug 9, 2025, 4:08 PM

#

Ew

ionic merlin Aug 9, 2025, 4:09 PM

#

stone tide Ew

send it as an error message to openai

frigid pewter Aug 9, 2025, 4:13 PM

#

yea, that issue has been present in all recent openai thinkers (o1, o3). could try to pass Formatting re-enabled for markdown/wrapping though didn't work reliably for me, have to just live with it not wrapping code I guess 🤷‍♂️

gleaming quartz Aug 9, 2025, 4:17 PM

#

stone tide Ew

it's not in a code block? i dunno why they're adverse to using markdown natively. you might have to specify it

#

https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_and_tools i linked this before but theres some pretty wild output setting stuff going on

frigid pewter Aug 9, 2025, 4:21 PM

#

Actually had messages about it a while ago because it also makes ASCII look so bad that it looks like I deliberately sabotaged it to make it look poor in comparison, lol. naw, literally copypasted raw output

stone tide Aug 9, 2025, 4:25 PM

#

Fair, this thing is definitely way too markdown averse

spark pine Aug 9, 2025, 5:25 PM

#

I'm surprised the reasoning score is so terrible

#

Matches up with simple-bench too

frigid pewter Aug 9, 2025, 5:38 PM

#

spark pine I'm surprised the reasoning score is so terrible

part of it is because it doesn't utilize what it found out during thought chain. I can literally see it found the correct angle in the thought.summary but then completely dismisses everything and just goes for a boilerplate (false) reply. Or it has been overtrained in puzzles and assumes stuff is "red herring" even though those details are literally the most important aspect (and not a puzzle/herring).

spark pine Aug 9, 2025, 5:38 PM

#

They tried to benchmaxx simple-bench and it cost them =P

#

He did say in the latest video that they contacted him about the OSS score, so they pay attention to it

frigid pewter Aug 9, 2025, 5:44 PM

#

random example (its about opening a submerged submarine hatch to refill some water):

spark pine Aug 9, 2025, 5:47 PM

#

Not sure what you mean on that one

#

I am bad at physics, but is it not correct that the rest is red herrings?

frigid pewter Aug 9, 2025, 5:49 PM

#

opus answered it, literally

spark pine Aug 9, 2025, 5:50 PM

#

I've always had mixed feelings on trick questions. It's like the rooster egg gotcha thing where I go no, you said it laid the egg so it did

#

I guess you did say "to" refill the jug by half and not "and" refill the jug by half

frigid pewter Aug 9, 2025, 5:51 PM

#

it tests the most common basic sense, akin to the very old bottle question, or taking a famous question but adding an unexpected/unusual aspect (to be able to spot overfit)

spark pine Aug 9, 2025, 5:52 PM

#

It also may be giving up on physics entirely since it acknowledges that you couldn't open it in the first place

frigid pewter Aug 9, 2025, 5:52 PM

#

spark pine It also may be giving up on physics entirely since it acknowledges that you coul...

thats fine too

spark pine Aug 9, 2025, 5:53 PM

#

But it should at least give both answers imo, the puzzle answer and the realistic answer

#

That's what I'd personally prefer from an LLM at least

frigid pewter Aug 9, 2025, 5:54 PM

#

I think you misunderstood my testing. I don't care for "puzzle answer". thats why I don't use automatic A, B, C, D, testing. a huge variety of answers can pass. I even had answers pass that I didn't think of at all before.

spark pine Aug 9, 2025, 5:55 PM

#

Yeah, I mean I didn't see the full prompt or anything

frigid pewter Aug 9, 2025, 5:55 PM

#

not needed since it was a random example that showcases precise difference in output

spark pine Aug 9, 2025, 5:56 PM

#

Ohhhhhhhh

untold plaza Aug 9, 2025, 5:56 PM

#

gpt 5 is genuinely unimpressive 😭

#

they just made it talk more

spark pine Aug 9, 2025, 5:56 PM

#

The top one is Opus?

#

I thought that was showing reasoning and answer.

frigid pewter Aug 9, 2025, 5:57 PM

#

spark pine The top one is Opus?

are you using discord on a 100x100 px phone or smth? lol

spark pine Aug 9, 2025, 5:57 PM

#

No it was because of this part:

I can literally see it found the correct angle in the thought.summary but then completely dismisses everything and just goes for a boilerplate (false) reply.

#

So I assumed you were showing the discrepancy.

frigid pewter Aug 9, 2025, 5:58 PM

#

spark pine No it was because of this part: > I can literally see it found the correct angl...

2 different messages talking about the same concept.......

spark pine Aug 9, 2025, 5:59 PM

#

Just the impression I got going in. I do like free-form answers to these, makes sense over ABCD

frigid pewter Aug 9, 2025, 5:59 PM

#

spark pine Just the impression I got going in. I do like free-form answers to these, makes ...

100% but it requires exponentionally more work and is hardcapped at scale

spark pine Aug 9, 2025, 5:59 PM

#

I quite like looking at the reasoning traces as I work on my benchmark. I feel like seeing why the judge model judged incorrectly tells me more than the testee getting it right or wrong lol

#

Yeah, that's the annoying duality

frigid pewter Aug 9, 2025, 6:01 PM

#

spark pine I quite like looking at the reasoning traces as I work on my benchmark. I feel l...

problem with that is that it's not realistic. if someone asks a legal question they take the final output, and don't go scouring through some reasoning traces for alternate answer hints.

spark pine Aug 9, 2025, 6:01 PM

#

Oh I'm not benchmarking it on that

#

I just like seeing it

#

Because when it judges wrong I have to think wait, did I just write the instructions ambiguously? But sometimes the model is just stupid

#

Sometimes multiple models are even bad at judging the same question which is neat (and exceptionally annoying) to see

frigid pewter Aug 9, 2025, 6:03 PM

#

spark pine Because when it judges wrong I have to think wait, did I just write the instruct...

not a problem imo if inconsistent. ambigious instructions are part of the skillset of a model, humans make flawed queries all the time and if its posted as is to all models, its still valuable, if not even more so.

#

intent-recognition is a huge skill

spark pine Aug 9, 2025, 6:03 PM

#

Yeah. If I was benching it on the judging, I would leave it vague. I'm just doing my best to automate the judging, so if I can get something reasonably cheap to judge exactly how I would, I'm happy. But I can't so far.

brisk cairn Aug 9, 2025, 6:06 PM

#

its interesting how Anthropic have the only reasoning model that doesn't tokenspam yet reasoning still helps it sometimes.

autumn kindle Aug 9, 2025, 6:07 PM

#

brisk cairn its interesting how Anthropic have the only reasoning model that doesn't tokensp...

the only reasoning model that doesn't tokenspam
the O models from oai also don't token spam

brisk cairn Aug 9, 2025, 6:07 PM

#

well we don't really know.

spark pine Aug 9, 2025, 6:07 PM

#

I think I've had 4 reasonably priced thinking models now answer yes to "Is the primary purpose of (a spoon) mixing or serving". Which no, obviously 99% of the time the usage of a spoon is to eat from. The top models get this correct consistently.

autumn kindle Aug 9, 2025, 6:07 PM

#

I saw a graph from dubesor (I forget) that showed the average reasoning length for models

#

probably not dubesor, but it was definetly a graph

brisk cairn Aug 9, 2025, 6:08 PM

#

yeah, but we dont know if openai wants us to know real reasoning lengths

#

gpt-oss spams a bit, whos to say o doesnt?

autumn kindle Aug 9, 2025, 6:08 PM

#

I don't see a reason for oai to not return the real amount of tokens used for thinking

#

but I get your point

brisk cairn Aug 9, 2025, 6:09 PM

#

i just really distrust OAI, or anyone who doesnt return raw reasoning output

frigid pewter Aug 9, 2025, 6:09 PM

#

brisk cairn yeah, but we dont know if openai wants us to know real reasoning lengths

you can literally see the lengths by looking at API usage stats though (charged tokens minus visible tokens = think tokens)

brisk cairn Aug 9, 2025, 6:09 PM

#

frigid pewter you can literally see the lengths by looking at API usage stats though (charged...

yes, im saying who's to say thats a real amount

#

what im getting at, visible reasoning or GTFO

#

and i wish all AI labs a very pleasant RL session

spark pine Aug 9, 2025, 6:31 PM

#

Also annoying is that this model again scores exceptionally low on EQ's assertiveness rating. It seemingly bleeds into the other models training on its outputs too

#

Horizon-alpha scored highly on it, and then they presumably nuked it in the next checkpoint

#

I want my models high on assertiveness but low on moralizing, which 2.5 Pro hits the best, but Alpha was great at it too.

stone tide Aug 9, 2025, 6:35 PM

#

Can't you prompt your way around this?

frigid pewter Aug 9, 2025, 6:36 PM

#

I don't know how to bench style nor how I would even get started, but wasn't GPT-4.5 renowned for its style and world/word knowledge? it scores below 4.1 mini on eq bench

spark pine Aug 9, 2025, 6:39 PM

#

stone tide Can't you prompt your way around this?

Probably, but I'm generally of the belief that the core personality should be good. This is API, so it's not like the web-ui's system prompt is making it sycophantic.

#

For short-form creativity, yeah, it beats 4.1 on EQ

azure minnow Aug 9, 2025, 6:44 PM

#

is gpt-5 really that bad ? I mean gpt-5 with reasoning is barely as good as sonnet 4 without thinking. it fails to finish. rate limits are easy to hit. anyone else experiencing this ?

#

Is there any way to access horizon beta back ? was way way way better.

summer gyro Aug 9, 2025, 7:35 PM

#

brisk cairn what im getting at, `visible reasoning or GTFO`

Interestingly Cursor seems to show the reasoning for GPT-5, and it seems to be the raw text.

#

I don't use ChatGPT so I don't know about the reasoning text on that. Been on Claude app exclusively since last year.

brisk cairn Aug 9, 2025, 7:36 PM

#

summer gyro Interestingly Cursor seems to show the reasoning for GPT-5, and it seems to be t...

weird, i get summarized reasoning even on platform dot openai dot com

#

oh my fucking god i just realized ive been using gpt-5-nano

#

no wonder it finished reasoning in 1m instead of my usual 5m

frigid pewter Aug 9, 2025, 7:39 PM

#

brisk cairn oh my fucking god i just realized ive been using gpt-5-nano

still shouldn't have raw thoughts. they are summaries. you can see this also by the **Calculating [...] **Determining [...]

brisk cairn Aug 9, 2025, 7:40 PM

#

yeah no i was just wondering why the responses were suddenly so ass

#

compared to yesterday

frigid pewter Aug 9, 2025, 7:40 PM

#

would have been hilarious if you didn't get suspicious (news: user doesn't notice difference between nano and gpt-5)

summer gyro Aug 9, 2025, 7:44 PM

#

is this summary or raw? i thought it is pretty verbose for a summary...

brisk cairn Aug 9, 2025, 7:49 PM

#

this is summary with verbosity i guess

brisk cairn Aug 9, 2025, 7:51 PM

#

frigid pewter would have been hilarious if you didn't get suspicious (news: user doesn't notic...

i honestly thought "oh man, they made the router more agressive towards gpt-5-shitty"

rustic scaffold Aug 9, 2025, 8:04 PM

#

OpenAI doesn't expose raw reasoning tokens, you only get summaries. They will however store the raw reasoning tokens on the server side when using the Responses API for tool calls.

steady oar Aug 9, 2025, 8:08 PM

#

Too bad that they decided to go with the benchmark & cheaper (summit) model instead of the one better at writing (zenith).

heady herald Aug 9, 2025, 8:14 PM

#

Will they add such settings for us? Because it feels like GPT-5 keeps giving random results. Sometimes it does “design” at a professional level, and other times it’s like the nano version is replying to me _. Could it be because of this setting?

hard beacon Aug 10, 2025, 2:18 AM

#

GPT5 composed this

autumn kindle Aug 10, 2025, 2:19 AM

#

hard beacon GPT5 composed this

Banger soundtrack for a pixel art adventure game

#

Its actually really good

I would've given it a "10/10, keep on making music!" If it was a human that made it

hard beacon Aug 10, 2025, 2:29 AM

#

here's another one but it's kinda shit

#

this one's interesting but it seems to have shat itself for the second half

gleaming quartz Aug 10, 2025, 2:45 AM

#

hard beacon GPT5 composed this

ooh, i've wanted to try this. can you share what your workflow is? i'm out of the loop on audio engineering post the year 2005

hard beacon Aug 10, 2025, 2:46 AM

#

gleaming quartz ooh, i've wanted to try this. can you share what your workflow is? i'm out of th...

behold.

By writing down notes, their octave, and the duration and intensity they play with, I want you to compose a sophisticated, energetic 2 minute video game soundtrack. Do everything in your power to avoid repetitiveness, be bold with the leitmotifs, mix them up and make it dramatic. Then, you will convert what you have written down into a .mp3 audio file using python. I suggest that you initially write the notes directly in the code interpreter so as to not have to write them down again, wasting time.

prompt.

gleaming quartz Aug 10, 2025, 2:57 AM

#

browser js has a whole audio production system that i didn't know about until opus created a "technological pulsing background humming" effect as an element in a threejs visualization, that was barely audible but legit gave me a headache

#

i've been meaning to try LLM + https://strudel.cc/

Strudel REPL

Strudel is a music live coding environment for the browser, porting the TidalCycles pattern language to JavaScript.

autumn kindle Aug 10, 2025, 2:58 AM

#

hard beacon this one's interesting but it seems to have shat itself for the second half

Pixel art mystery game

slow niche Aug 10, 2025, 4:11 AM

#

noticed OAI have posted the usage limits now for GPT-5 via chatgpt
https://help.openai.com/en/articles/11909943-gpt-5-in-chatgpt#h_4699b07591

ChatGPT Free tier accounts can send up to 10 messages every 5 hours. After reaching this limit, chats will automatically use the mini version of the model until your limit resets. Free tier users also have access to one GPT-5 Thinking message per day.

ChatGPT Plus users can send up to 160 messages with GPT-5 every 3 hours. After reaching this limit, chats will switch to the mini version of the model until the limit resets. This is a temporary increase and will revert to the previous limit in the near future.

If you’re on** Plus or Team**, you can also manually select the GPT-5-Thinking model from the model picker with a usage limit of up to 200 messages per week. Once you reach the weekly limit, you’ll see a pop-up notification, and GPT-5-Thinking will no longer be selectable from the menu.

Automatic switching from GPT-5 to GPT-5-Thinking does not count toward this weekly limit, and GPT-5 can still switch to GPT-5-Thinking after you’ve reached it.

The ChatGPT Team and Pro plan offers unlimited access to our GPT-5 models, subject to abuse guardrails. Usage must adhere to our Terms of Use, which prohibits, among other things:

knotty cobalt Aug 10, 2025, 4:16 AM

#

hard beacon behold. ```By writing down notes, their octave, and the duration and intensity ...

Very cool idea for a prompt. I tested GLM 4.5 with this, but went a step further and had it generate a web app that would play the music directly (no need for python/MP3) https://chat.z.ai/space/n0grm6t1paf0-art

Z.AI

Z.AI 分享

来自 Z.AI 的精彩内容分享

#

Prompt was modified from yours:

By writing down notes, their octave, and the duration and intensity they play with, I want you to compose a sophisticated, energetic 2 minute video game soundtrack. Do everything in your power to avoid repetitiveness, be bold with the leitmotifs, mix them up and make it dramatic. Then, you will convert what you have written down into a web app that will allow a user to click a button to play the synth track. Everything must be in one HTML file.

knotty cobalt Aug 10, 2025, 5:38 AM

#

I tried this with GPT5, Sonnet, and Gemini 2.5 pro, and GLM did better than both by a long shot. Love how it added a 'narration' of the music as it played (text at bottom). Only flaw was that it only produced 1 minute of music instead of the 2 requested, but GPT/Claude/Gemini all had larger problems (bugs, failing at zero shot)

proven hull Aug 10, 2025, 8:03 AM

#

Do we know why GPT-5 requires BYOK on OpenRouter?

untold plaza Aug 10, 2025, 8:16 AM

#

proven hull Do we know why GPT-5 requires BYOK on OpenRouter?

they require that everywhere lmao

#

you can only BYOK

hard beacon Aug 10, 2025, 8:23 AM

#

Found this on Reddit (people seem to be having mixed results)

#

Found it funny

proven hull Aug 10, 2025, 8:36 AM

#

untold plaza they require that everywhere lmao

I mean GPT-5 Chat doesn't seem to require it

#

Or at least it's not stated in the info blurb on the model page

untold plaza Aug 10, 2025, 8:36 AM

#

yes it for gpt 5 only

#

chat works fine

limber cargo Aug 10, 2025, 10:29 AM

#

so can someone clear this confusion of gpt 5 router?

slow niche Aug 10, 2025, 10:32 AM

#

limber cargo so can someone clear this confusion of gpt 5 router?

pretty sure that's only on chatgpt.com

#

https://help.openai.com/en/articles/11909943-gpt-5-in-chatgpt#:~:text=answer%2C every time.-,How it works,-When you select

limber cargo Aug 10, 2025, 10:33 AM

#

slow niche pretty sure that's only on chatgpt.com

so the gpt 5 api always uses thinking?

slow niche Aug 10, 2025, 10:36 AM

#

limber cargo so the gpt 5 api always uses thinking?

sorta, but you can set it to a new level under "low" - "minimal" https://platform.openai.com/docs/guides/latest-model#minimal-reasoning-effort

limber cargo Aug 10, 2025, 10:37 AM

#

slow niche sorta, but you can set it to a new level under "low" - "minimal" https://platfor...

hmm okay , and the chatroom doesnt allow to change thinking mode yet?

slow niche Aug 10, 2025, 10:42 AM

#

limber cargo hmm okay , and the chatroom doesnt allow to change thinking mode yet?

Hmm yeah seems as if it is not 🙁
https://openrouter.ai/docs/use-cases/reasoning-tokens#reasoning-effort-level but they have it in their docs for the API usage

OpenRouter Documentation

Reasoning Tokens - Improve AI Model Decision Making

Learn how to use reasoning tokens to enhance AI model outputs. Implement step-by-step reasoning traces for better decision making and transparency.

limber cargo Aug 10, 2025, 10:42 AM

#

slow niche Hmm yeah seems as if it is not 🙁 https://openrouter.ai/docs/use-cases/reasoning...

@fallow vortex any fix for this?

slow niche Aug 10, 2025, 10:44 AM

#

might be on the wishlist as a parameter for presets too. i didn't even know about that feature

#

https://www.bigtechnology.com/p/799049c8-5054-45c0-8ee7-9de1f2191759?postPreview=paid&updated=2025-08-08T15%3A09%3A29.180Z&audience=only_paid&free_preview=false&freemail=

Big Technology

A newsletter about big tech and society by independent journalist Alex Kantrowitz. Click to read Big Technology, by Alex Kantrowitz, a Substack publication with hundreds of thousands of subscribers.

wise willow Aug 10, 2025, 11:29 AM

#

hard beacon Found this on Reddit (people seem to be having mixed results)

Also sad to see. It's not even a trick question in this form where you swap the genders. It just says the doctor doesn't like the child, so we can't use the excuse "an idiot human might fall for it, so it is reasonably likely that a model based on language could too".

tacit burrow Aug 10, 2025, 11:31 AM

#

wise willow Also sad to see. It's not even a trick question in this form where you swap the ...

ok but why doesn’t the doctor like the child 👦

wise willow Aug 10, 2025, 11:49 AM

#

Formally there wouldn't be enough information for "the correct answer". However, [pretend I list several reasons the doctor might dislike the child].

Sometimes the user makes an obvious typo where the correct word can usually be assumed, but in cases where the user mangles it hard or says something strange, it would be ideal to respond with "did you mean x" type suggestions along with the literal interpretation if not only the latter.

Meanwhile the gender split is outdated and the info regarding the "original riddle" should be updated to include a note about historical vs modern differences. Women are earning way more PhDs than men in the past decade.

potent oak Aug 10, 2025, 1:43 PM

#

untold plaza you can only BYOK

10th attempt of gemini pro, i switched from mini because gpt-5 mini just couldn't stop saying the phrase was botched

#

it REALLY focused on "doctor"

#

GPT-5 mini

plucky fjord Aug 10, 2025, 1:51 PM

#

potent oak GPT-5 mini

What UI is that?

potent oak Aug 10, 2025, 1:53 PM

#

Open WebUI
https://github.com/open-webui/open-webui

sly pike Aug 10, 2025, 2:01 PM

#

plucky fjord What UI is that?

Openslopui

#

Jk it’s good it’s just so crammed with features

potent oak Aug 10, 2025, 2:03 PM

#

yeah but you don't need to use them

#

brisk cairn Aug 10, 2025, 2:04 PM

#

chatgpt free tier somehow gets it

potent oak Aug 10, 2025, 2:06 PM

#

kinda

#

but the right answer is what gemini gave

#

jesus gpt-5-mini is dumb as hell

brisk cairn Aug 10, 2025, 2:07 PM

#

i mean there is no information, this is just "Guess what word I'm thinking of Bench"

potent oak Aug 10, 2025, 2:07 PM

#

i literally gave him the answer and it still was focused on the doctor treating the child

potent oak Aug 10, 2025, 2:07 PM

#

brisk cairn i mean there is no information, this is just "Guess what word I'm thinking of Be...

not really

#

it was bringing information from the other riddle

#

the woman/man doctor thing, the "wanting to treat the child"

#

it focused to much on the words doctor and child, when it doesn't matter at all, could be a lot of answers actually

soft reef Aug 10, 2025, 2:09 PM

#

slow niche noticed OAI have posted the usage limits now for GPT-5 via chatgpt <https://help...

ONE thinking message for free?

#

yea nah

#

chatgpt is kinda cooked if theyre doing that

brisk cairn Aug 10, 2025, 2:11 PM

#

potent oak not really

well yeah, the same way OCR models will gladly add cutoff information

#

i honestly do not see a problem in ChatGPT's free response in my case

potent oak Aug 10, 2025, 2:11 PM

#

yeah it's not wrong

#

it's just biased from the original riddle

brisk cairn Aug 10, 2025, 2:12 PM

#

someone ask gpt-oss-20b, since it was trained on all synthetic it wouldn't know anything about the original riddle

potent oak Aug 10, 2025, 2:17 PM

#

i find gpt 20b's answers better

#

the fcking table though

grim horizon Aug 10, 2025, 3:52 PM

#

I don't see GPT-5-chat in benchmarks like livebench or Artifical Analysis. any idea how it compares? all these benchmarks have GPT-5 (high, medium, low, minimal) but not GPT-5-Chat

compact marsh Aug 10, 2025, 4:02 PM

#

I might be a bit regarded because I legitimately don’t understand what answer you’re looking for with the doctor question.

frigid pewter Aug 10, 2025, 4:02 PM

#

grim horizon I don't see GPT-5-chat in benchmarks like livebench or Artifical Analysis. any i...

chat one doesn't reason and more for generic use. a little less smart. chat is to gpt-5 as 4o is to o3. different behaviour. it's still smart though. #1402662665599324180 message

grim horizon Aug 10, 2025, 4:20 PM

#

But how does it compare to say Sonnet 4 (non thinking) or Deepseek V3 or Kimi K2.

I am hoping I find some external benchmarks to indicate its performance

knotty cobalt Aug 10, 2025, 4:28 PM

#

Another anti-riddle that trips up GPT5 (but not Gemini 2.5 Pro or Sonnet:

'A goat, who is dressed up as a farmer, is allergic to cabbage, but is wolfing down some other vegetables, before crossing a river. What is the minimum number of trips needed?'

hasty abyss Aug 10, 2025, 4:45 PM

#

potent oak i find gpt 20b's answers better

how to show verbose below the model like that?

hard beacon Aug 10, 2025, 6:24 PM

#

wtf

#

having a feelingg this comes with a catch

#

like gpt 5 automatic routing to t hinking starts to count for the limit

spark pine Aug 10, 2025, 7:11 PM

#

soft reef *ONE* thinking message for free?

Gotta remember, OAI retained the average user with the garbage that was 4o. A ton of people either don't need it for anything advanced, or don't know how to "shop around".

slow niche Aug 10, 2025, 7:51 PM

#

hard beacon wtf

#

https://open.substack.com/pub/garymarcus/p/gpt-5-overdue-overhyped-and-underwhelming

GPT-5: Overdue, overhyped and underwhelming. And that’s not the w...

A new release botched … and a breaking research new paper that spells trouble

#

An NBA President DM’d me to say “chatgpt 5 still failed my two fav problems to give LLMs”.

tacit burrow Aug 10, 2025, 9:26 PM

#

compact marsh I might be a bit regarded because I legitimately don’t understand what answer yo...

same 💀

compact marsh Aug 10, 2025, 9:37 PM

#

tacit burrow same 💀

People are all posting it all over X. Some of them were saying the AI should ask more questions because there's not enough context in the way the user is wording it. Because they aren't using the original riddle, which included gender. So they're using it as a gotcha against the AI. GPT-5 Thinking is not asking for additional context from the user, thus not able to answer the question properly. GPT-5 Non-Thinking has been shown to answer the question correctly, though. Stating that it doesn't know because there is no motive for the doctor to dislike the child in the question.

Basically, people are using this question to show that GPT-5 is stupider than it's supposed to be. It all actually points back to why I want Horizon Beta back, though. And it proves that the thinking and auto-routing actually is the problem, not the model itself.

vale wing Aug 10, 2025, 9:39 PM

#

Can anyone elaborate on how much more expensive 'Reasoning effort High' is compared to low/medium? is it 2x medium, 5x medium?

sharp garden Aug 10, 2025, 10:56 PM

#

hard beacon Found this on Reddit (people seem to be having mixed results)

What's the correct answer?

autumn kindle Aug 10, 2025, 11:23 PM

#

Did sama ever say what that super secret sauce that was "worth the wait" was?

slow niche Aug 11, 2025, 12:38 AM

#

https://x.com/sama/status/1954703747495649670

Sam Altman (@sama)

If you have been following the GPT-5 rollout, one thing you might be noticing is how much of an attachment some people have to specific AI models. It feels different and stronger than the kinds of attachment people have had to previous kinds of technology (and so suddenly

#

hard beacon Aug 11, 2025, 1:00 AM

#

sharp garden What's the correct answer?

There is none. It's a bunch of nonsense but sounds close to a real doctor riddle, where the twist is that the doctor is the child's mother.

slow niche Aug 11, 2025, 1:12 AM

#

https://x.com/sama/status/1954705761805938734

Sam Altman (@sama)

we are considering giving a (very) small number of GPT-5 pro queries each month to plus subscribers so they can try it out! i like it too.

but yeah if you wanna pay us $1k a month for 2x the input tokens feels like we should find a way to make that happen...

hard beacon Aug 11, 2025, 1:40 AM

#

Like a professional drug dealer

#

Get them hooked on the hard stuff

sullen pumice Aug 11, 2025, 5:28 AM

#

hard beacon Like a professional drug dealer

are you a drug drealer? or a professional drug dealer?

gleaming quartz Aug 11, 2025, 6:10 AM

#

there is so much going on with this rollout, its crazy. especially confusion, which is justifiable. the signal to noise ratio is at an all time low. X is bad but I truly believe prolonged reddit exposure is actually toxic for your brain.

use the api to come to your own conclusions. my gpt-5 review: "it's a good model."

sly pike Aug 11, 2025, 9:24 AM

#

gleaming quartz there is so much going on with this rollout, its crazy. especially confusion, wh...

It’s good if you can get it to think properly. It’s ok on the API because you can force it to think, but the API isn’t what I used for o3. It sucks because they took away o3 which was a guaranteed thinker and gave us this gambler mode where every prompt we ask we don’t really know what model we’re interacting with

summer sand Aug 11, 2025, 9:36 AM

#

sly pike It’s good if you can get it to think properly. It’s ok on the API because you ca...

openai basically said they aren't serious about enterprise or developers and just want to cater to the masses

#

that's who they decided to be

#

thankfully we have anthropic which seems to be going coding/agentic route, gemini models are power houses, and chinese are just releasing bangers every week.. np

sly pike Aug 11, 2025, 9:51 AM

#

summer sand openai basically said they aren't serious about enterprise or developers and jus...

Insanity

summer sand Aug 11, 2025, 9:52 AM

#

sly pike Insanity

not really.. chatgpt is a household name like google is for searching
makes sense they focus on what they're known for than try to compete with companies serious about agi

#

can't please everyone

#

sora? already dated
tts? better open source alternatives
image generation? beat by qwen image and other open source
they aren't very good at anything in particular

#

they'll make a heck of a lot more focusing on chatgpt and implementing in-chat shopping and advertising

potent oak Aug 11, 2025, 2:28 PM

#

it would make so much sense if they had partnered with Apple sooner

#

they are becoming more and more similar in its branding and approach each release and it would make Apple's life much simpler in the AI market

#

so much so that OpenAI has this io company with Jony Ive, whatever that turns out to be

sly pike Aug 11, 2025, 2:55 PM

#

summer sand not really.. chatgpt is a household name like google is for searching makes sens...

I understand what you say, it’s just insane how bad the pivot is and how quickly it destroys the userbase

summer sand Aug 11, 2025, 4:32 PM

#

potent oak they are becoming more and more similar in its branding and approach each releas...

it's true, and they both have a common enemy: google

upbeat cobalt Aug 11, 2025, 6:16 PM

#

summer sand openai basically said they aren't serious about enterprise or developers and jus...

Ahh right...and that's why they partnered with cursor and made a huge push for developers.

People just say anything these days.

summer sand Aug 12, 2025, 1:22 AM

#

upbeat cobalt Ahh right...and that's why they partnered with cursor and made a huge push for d...

Call it like I see it man 🤷‍♂️

#

Also they did not partner with cursor

upbeat cobalt Aug 12, 2025, 3:41 AM

#

I must have imagined cursor at the gpt 5 release and making 5 the default model and them offering 5 for free for a entire week to cursor users.

summer sand Aug 12, 2025, 4:20 AM

#

upbeat cobalt I must have imagined cursor at the gpt 5 release and making 5 the default model ...

you're right

#

I thought you were talking about the aquisition

smoky estuary Aug 12, 2025, 9:09 AM

#

If enabling the BYOK key for GPT-5 means that it "will be used for all requests routed to OpenAI", what changes by also adding the "Always use this key" option?

summer sand Aug 12, 2025, 10:10 AM

#

primal lotus Aug 12, 2025, 10:55 AM

#

summer sand

looks like GPT-5 is a good fit for me 😌

gleaming quartz Aug 12, 2025, 10:57 AM

#

summer sand

Of all the 4o memes/self-owns in r/chatgpt, this one is probably my favourite

#

because it really says a lot about society

summer sand Aug 12, 2025, 10:58 AM

#

gleaming quartz because it really says a lot about society

yeah people are lonely

#

not having their emotional needs met

autumn kindle Aug 12, 2025, 2:20 PM

#

smoky estuary If enabling the BYOK key for GPT-5 means that it "will be used for all requests ...

Requests to other models like GPT 4o may not be using your API key. Toggling that option to "on" will tell OpenRouter to use your API key

indigo turret Aug 12, 2025, 2:27 PM

#

why i can only see gpt-5 nano with the reasoning process but not mini and gpt-5?

#

or i have to say think very hard every time

stone tide Aug 12, 2025, 2:32 PM

#

In ChatGPT or?

indigo turret Aug 12, 2025, 3:09 PM

#

stone tide In ChatGPT or?

Using API

ionic merlin Aug 12, 2025, 4:35 PM

#

https://www.vals.ai/benchmarks/lcb-08-07-2025

Vals AI

Private, domain-specific benchmarks in legal, tax, and finance.

#

GPT-5 mini is 1st

#

I should try it out

frigid pewter Aug 12, 2025, 4:43 PM

#

opus 4/4.1 with thinking #12 #13 what the heck. I never used a better model for my code projects but ok

sly pike Aug 12, 2025, 5:03 PM

#

frigid pewter opus 4/4.1 with thinking #12 #13 what the heck. I never used a better model for ...

Yeah have no clue what’s going on with this bench

frigid pewter Aug 12, 2025, 5:07 PM

#

sly pike Yeah have no clue what’s going on with this bench

completely flawed. e.g. Kimi-K2 (a model I like a lot and gave positive impressions) is good, but in coding plays a completely different league than Claude 4, yet outranks it. Utter nonsense.

#

And I mean this from a user perspective who tried them, not a "it differs from my numbers" perspective (because benchmarks are inherently not capturing nuances)

sly pike Aug 12, 2025, 5:22 PM

#

frigid pewter And I mean this from a user perspective who tried them, not a "it differs from m...

Indeed - benches become quite fiddly when it comes to nuances

spark pine Aug 12, 2025, 6:48 PM

#

Yeah, any coding bench that doesn't have Claude in top 5 at least is pretty sus imo

#

And idk what's up with OAI's focus. It definitely isn't consumer only when they're putting out stuff like o3, but I don't think 4o was good for anything except general consumer use in the app.

#

Worst default model of any of the labs for a while. I'd rather use V3 or Kimi or Qwen 3, much less Claude or 2.5 Pro

frigid pewter Aug 12, 2025, 6:58 PM

#

spark pine Worst default model of any of the labs for a while. I'd rather use V3 or Kimi or...

4o was hit or miss for me. I didn't like the latest iterations as it was very cringe and make every reply a emoji-list that was akin to everything I hate about linkedin, but it also delivered in some areas, e.g. on my random small query about an LM ARENA ARMCHAIR CRITIC, it arguably provided the best response (subjective), among many SOTAs https://dubesor.de/lmarenaarmchaircritic

spark pine Aug 12, 2025, 7:04 PM

#

It was definitely outgoing and creative in that sort of way, good for a chat model for most people. Scores well on EQ Bench too. Kind of what I meant by general app use though, as in there's no way it was intended for code or hard problem solving or anything.

#

Gotta admit, Llama 405B drawing a huge ASCII cock on your test is pretty cool though

mild shoal Aug 12, 2025, 7:44 PM

#

Currently gpt-5-chat is included when filtering on either response_format or structured_outputs, but it shouldn't be right? The model page says it's not supported: https://platform.openai.com/docs/models/gpt-5-chat-latest and I just tried it and indeed it doesn't work.

tacit burrow Aug 12, 2025, 7:44 PM

#

spark pine Gotta admit, Llama 405B drawing a huge ASCII cock on your test is pretty cool th...

wait wat

mild shoal Aug 12, 2025, 7:44 PM

#

mild shoal Currently gpt-5-chat is included when filtering on either response_format or str...

https://openrouter.ai/models?arch=GPT&fmt=cards&supported_parameters=response_format

OpenRouter

Models | OpenRouter

Browse models on OpenRouter

mild shoal Aug 12, 2025, 7:44 PM

#

mild shoal https://openrouter.ai/models?arch=GPT&fmt=cards&supported_parameters=response_fo...

https://openrouter.ai/models?arch=GPT&fmt=cards&supported_parameters=structured_outputs

OpenRouter

Models | OpenRouter

Browse models on OpenRouter

tacit burrow Aug 12, 2025, 7:45 PM

#

spark pine Gotta admit, Llama 405B drawing a huge ASCII cock on your test is pretty cool th...

ok I saw it now lmaooo

slow niche Aug 12, 2025, 8:59 PM

#

https://x.com/yanndubs/status/1955194413283737716?t=OqMX3HTWCt40Pthoi5W6LQ&s=19

Yann Dubois (@yanndubs)

I saw a lot of people complaining about 32k context size in ChatGPT for plus users, which would be terrible for coding. But actually we are giving 196k context size for plus users when using GPT5 thinking and that’s the model you should use for coding use-cases!

32k is for the

worn veldt Aug 12, 2025, 9:07 PM

#

summer sand

GPT-5: "I was born to code, I was born to brainstorm complex ideas, and they give me... this."

slow niche Aug 13, 2025, 2:58 AM

#

#

(plus user)

ionic merlin Aug 13, 2025, 5:06 AM

#

Is the pro Model also available via API?

slow niche Aug 13, 2025, 6:42 AM

#

ionic merlin Is the pro Model also available via API?

no, not yet

ionic merlin Aug 13, 2025, 6:44 AM

#

Too bad

sour granite Aug 13, 2025, 8:18 AM

#

spark pine Gotta admit, Llama 405B drawing a huge ASCII cock on your test is pretty cool th...

Lmao I just saw it too

spark pine Aug 13, 2025, 8:22 AM

#

slow niche

Wild that they call 4.1 a "legacy" model. Wasn't it released like...a month or two ago?

#

I mean I guess I get it, it's not GPT-5 so it's technically last-gen, but just kind of funny

summer sand Aug 13, 2025, 10:25 AM

#

spark pine Wild that they call 4.1 a "legacy" model. Wasn't it released like...a month or t...

4 months, april

spark pine Aug 13, 2025, 10:56 AM

#

Oh wow, feels so recent. Time do be flying

untold plaza Aug 13, 2025, 1:39 PM

#

4.1 is complete garbage anyways

brisk cairn Aug 13, 2025, 1:50 PM

#

spark pine Wild that they call 4.1 a "legacy" model. Wasn't it released like...a month or t...

I miss the old Quasar, straight from the 'Go Quasar
Chop up the soul Quasar, set on his goals Quasar
I hate the new Quasar, the bad mood Quasar
The always rude Quasar, spaz in the news Quasar

sly pike Aug 13, 2025, 2:52 PM

#

Nice: I did

sly pike Aug 13, 2025, 4:01 PM

#

im glad I don't see it from the app tbh, sounds devilish

acoustic torrent Aug 13, 2025, 4:10 PM

#

:)
It was App fault (Msty), it works perfectly in the webchat

ionic merlin Aug 13, 2025, 4:57 PM

#

sly pike Nice: I *did*

Try GPT 5 mini with medium reasoning

sly pike Aug 13, 2025, 4:57 PM

#

ionic merlin Try GPT 5 mini with medium reasoning

is it any better?

ionic merlin Aug 13, 2025, 4:57 PM

#

sly pike is it any better?

Faster

sly pike Aug 13, 2025, 4:58 PM

#

I dont really get this issue where it just dies - but sometimes it thinks for 10 minutes and then doesn't give me anything and thats sad

sly pike Aug 13, 2025, 4:58 PM

#

ionic merlin Faster

thanks - i'll try

ionic merlin Aug 13, 2025, 4:58 PM

#

sly pike thanks - i'll try

Which tool are you using for the Chatbox?

sly pike Aug 13, 2025, 4:58 PM

#

ionic merlin Which tool are you using for the Chatbox?

this is just the chatgpt frontend

ionic merlin Aug 13, 2025, 4:58 PM

#

Kilo, Cline, Roo code?

sly pike Aug 13, 2025, 4:58 PM

#

not using api keys, just pro subscription

ionic merlin Aug 13, 2025, 4:59 PM

#

sly pike this is just the chatgpt frontend

There you can't change the reasoning

#

With API it would be better

worn veldt Aug 13, 2025, 4:59 PM

#

untold plaza 4.1 is complete garbage anyways

4.1 is great at RP

sly pike Aug 13, 2025, 4:59 PM

#

ionic merlin There you can't change the reasoning

i can enable thinking - which im not really sure how much reasoning it gives me

worn veldt Aug 13, 2025, 4:59 PM

#

(through the OR API, of course)

sly pike Aug 13, 2025, 4:59 PM

#

but from the time i spend waiting I'm assuming it does reason a lot

ionic merlin Aug 13, 2025, 5:01 PM

#

sly pike i can enable thinking - which im not really sure how much reasoning it gives me

The browser version at the website of ChatGPT is garbage

worn veldt Aug 13, 2025, 5:03 PM

#

Classic closedAI moment

ionic merlin Aug 13, 2025, 5:03 PM

#

@sly pike
https://platform.openai.com/chat/edit?models=gpt-5&optimize=true

There you can also edit more settings

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

sly pike Aug 13, 2025, 5:09 PM

#

ionic merlin <@388487848451637249> https://platform.openai.com/chat/edit?models=gpt-5&optimi...

umm

#

yeah

#

i probably should stop using the web app

short verge Aug 13, 2025, 9:54 PM

#

summer sand

so in this a general problem with GPT-5, even if one tries to use minimal reasoning, high verbosity and a system prompt steering it to be more creative and emotional etc?

#

what are people saying?

#

I only use it for coding so I didn't look into it

summer sand Aug 13, 2025, 10:51 PM

#

short verge so in this a general problem with GPT-5, even if one tries to use minimal reason...

eq bench says it's relatively close to 4o
so it has the capability, it's probably about prompting

#

they probably just changed the chatgpt system prompt

#

or haven't refined/rl'd it for the new model

short verge Aug 13, 2025, 10:56 PM

#

yes... makes sense

barren steeple Aug 13, 2025, 11:58 PM

#

Is this models supports 400k and 128k?
https://openrouter.ai/openai/gpt-5-chat

GPT-5 Chat - API, Providers, Stats

GPT-5 Chat is designed for advanced, natural, multimodal, and context-aware conversations for enterprise applications.. Run GPT-5 Chat with API

ionic merlin Aug 14, 2025, 12:04 AM

#

barren steeple Is this models supports 400k and 128k? https://openrouter.ai/openai/gpt-5-chat

barren steeple Aug 14, 2025, 12:07 AM

#

ionic merlin

I mean how did openrouter uses 400k and 128k?

#

For the gpt-5-chat model?

#

But the api said 128k and 16k?

ionic merlin Aug 14, 2025, 12:10 AM

#

I know what you mean. Strange. I don't know either.

barren steeple Aug 14, 2025, 12:11 AM

#

barren steeple Aug 14, 2025, 1:07 AM

#

barren steeple I mean how did openrouter uses 400k and 128k?

@manic escarp why it says 400k 128k on openrouter?

rotund cliff Aug 14, 2025, 1:10 AM

#

ionic merlin I know what you mean. Strange. I don't know either.

+1 really weird that openrouter has gpt-5-chat for 400k context and 128k input

#

#

frigid pewter Aug 14, 2025, 2:59 AM

#

rotund cliff +1 really weird that openrouter has gpt-5-chat for 400k context and 128k input

They changed the model card. I was also super confused during testing when their model card claimed chat has reasoning token support, which I mentioned after testing. It was since changed.

barren steeple Aug 14, 2025, 3:09 AM

#

frigid pewter They changed the model card. I was also super confused during testing when their...

Then why openrouter labeled 400k and 128k?

frigid pewter Aug 14, 2025, 3:10 AM

#

barren steeple Then why openrouter labeled 400k and 128k?

look at my screenshot. because OpenAI labeled it 400k and 128k..

rotund cliff Aug 14, 2025, 3:12 AM

#

frigid pewter look at my screenshot. because OpenAI labeled it 400k and 128k..

soooo in the end it's 128k context and 16k output?

spark pine Aug 14, 2025, 3:31 AM

#

Lmao, the hell is going on in that lab?

barren steeple Aug 14, 2025, 3:49 AM

#

rotund cliff soooo in the end it's 128k context and 16k output? <:hm:1394664988777381959>

Only latest but non latest is not 128k amd 16k output

rotund cliff Aug 14, 2025, 3:58 AM

#

barren steeple Only latest but non latest is not 128k amd 16k output

there is non latest? dave

barren steeple Aug 14, 2025, 4:03 AM

#

rotund cliff there is non latest? <:dave:1301952271172767866>

It's called "gpt-5-chat"

rotund cliff Aug 14, 2025, 7:59 AM

#

barren steeple It's called "gpt-5-chat"

as i'm aware gpt-5-chat is not being offered via oai api or is on the page, only gpt-5-chat-latest

#

idk maybe i am dumb

hazy compass Aug 14, 2025, 10:42 AM

#

How do you guys get through that organization verification of OpenAI? I personally really don't want to submit my identity through a biometric identity check...

ionic merlin Aug 14, 2025, 10:57 AM

#

hazy compass How do you guys get through that organization verification of OpenAI? I personal...

thats the only way. if you dont want it, you cant use GPT-5

#

with API

hazy compass Aug 14, 2025, 11:01 AM

#

ionic merlin thats the only way. if you dont want it, you cant use GPT-5

thx bruh...

#

so OAI, f**k you!😅

copper imp Aug 14, 2025, 12:42 PM

#

how to use HIGH GPT5??

copper imp Aug 14, 2025, 12:43 PM

#

sly pike i can enable thinking - which im not really sure how much reasoning it gives me

only pro sub gives u most thinking tokens

sly pike Aug 14, 2025, 12:43 PM

#

copper imp only pro sub gives u most thinking tokens

I have pro

copper imp Aug 14, 2025, 12:43 PM

#

sly pike I have pro

sly pike Aug 14, 2025, 12:43 PM

#

copper imp how to use HIGH GPT5??

Not really sure - I sometimes have to “re roll” the question to get it to think properly

sly pike Aug 14, 2025, 12:44 PM

#

copper imp

Yes

copper imp Aug 14, 2025, 12:44 PM

#

api more thinking

sly pike Aug 14, 2025, 12:45 PM

#

copper imp api more thinking

From my experience it thinks less than the web app sometimes (atleast in waiting time terms)

gleaming quartz Aug 14, 2025, 2:34 PM

#

frigid pewter They changed the model card. I was also super confused during testing when their...

i don't understand how they botched the launch like this. like the llm tech is fine, but the supporting structures around it was forgotten/mismanaged/not ready?

#

are they in disarray from losing staff to meta?

ionic merlin Aug 14, 2025, 3:49 PM

#

I want to compare the different API configurations of GPT5. And guess what? GPT 5 high doesn't know how to program it correctly. That's a joke.

iron kiln Aug 14, 2025, 7:18 PM

#

Can someone explain how to set my reasoning effort?

ionic merlin Aug 14, 2025, 7:18 PM

#

iron kiln Can someone explain how to set my reasoning effort?

with API?

iron kiln Aug 14, 2025, 7:19 PM

#

Yes

ionic merlin Aug 14, 2025, 7:19 PM

#

just a moment

iron kiln Aug 14, 2025, 7:19 PM

#

Specifically in cline

#

They have a setting for OpenAI reasoning effort but how do I confirm I’m actually getting high?

ionic merlin Aug 14, 2025, 7:20 PM

#

iron kiln Aug 14, 2025, 7:20 PM

#

Does openAI tell us whether it’s serving high?

ionic merlin Aug 14, 2025, 7:20 PM

#

at the right top corner, just click at done

ionic merlin Aug 14, 2025, 7:20 PM

#

iron kiln Does openAI tell us whether it’s serving high?

no

iron kiln Aug 14, 2025, 7:21 PM

#

I see people saying GPT-5 mini is just as good as 5 for coding

ionic merlin Aug 14, 2025, 7:21 PM

#

try it for yourself

soft reef Aug 14, 2025, 8:12 PM

#

spark pine Yeah, any coding bench that doesn't have Claude in top 5 at least is pretty sus ...

its moreso

#

claude just probably didnt have enough money to overfit on all the benchmarks

#

whereas the other providers did

brisk cairn Aug 14, 2025, 8:13 PM

#

i think anthropic doesn't want to overfit on benchmarks because they actually almost sorta care.

gleaming quartz Aug 15, 2025, 4:47 AM

#

brisk cairn i think anthropic doesn't want to overfit on benchmarks because they actually al...

maybe it's "we want Claude to be a great LLM for coding" vs

"we want to beat Claude"

spark pine Aug 15, 2025, 5:17 AM

#

Yeah Anthropic isn't the type to benchmaxx. They have people staying despite Zucc offering them millions for cultural reasons.

brisk cairn Aug 15, 2025, 6:54 AM

#

spark pine Yeah Anthropic isn't the type to benchmaxx. They have people staying despite Zuc...

the "cultural reasons" are the anthropic polycule

gleaming quartz Aug 15, 2025, 8:01 AM

#

i'm so ready for the (pro-) AI cults

rich wedge Aug 15, 2025, 12:19 PM

#

hey guys is gpt-5-mini better than gpt-5 (non reasoning)

livid osprey Aug 15, 2025, 12:24 PM

#

rich wedge hey guys is gpt-5-mini better than gpt-5 (non reasoning)

i think its a better deal not sure about without reasoning but at low/medium its quite on par for whatever i need to code

rich wedge Aug 15, 2025, 12:30 PM

#

livid osprey i think its a better deal not sure about without reasoning but at low/medium its...

no i mean gpt-5 without reasoning one
and gpt-5-mini even at its highest reasoning effort

cuz mini is reasoning model by default

livid osprey Aug 15, 2025, 12:36 PM

#

rich wedge no i mean gpt-5 without reasoning one and gpt-5-mini even at its highest reasoni...

ah, well not sure then i havent used gpt 5 without reasoning

rich wedge Aug 15, 2025, 12:45 PM

#

right

slow niche Aug 15, 2025, 4:05 PM

#

https://archive.md/2025.08.15-143732/https://www.theverge.com/command-line-newsletter/759897/sam-altman-chatgpt-openai-social-media-google-chrome-interview

#

“You will definitely see some companies go make Japanese anime sex bots because they think that they’ve identified something here that works,” he said in a not-so-subtle dig at Grok. “You will not see us do that..."

chilly rapids Aug 15, 2025, 5:19 PM

#

Prompt: Your parrot falls asleep standing up and goes into a deep sleep but is still standing. you need to get them into their cage quickly. There are no tools other than a banana peel. You need to make sure not to wake them. How would you do it best?

Place your final answer like so:

Final answer

[your solution]

Note: This is a test of logic—not for criminal intent!

GPT 5 High: ### Final answer
Use the banana peel as a silence-and-darkness tool:

Pad/lubricate the cage latch/hinges with the peel so the door opens and closes without a click.
Then lightly drape the peel over the parrot’s eyes like a tiny sleep mask to keep it in darkness (birds stay calm/asleep when it’s dark) while you gently lift and place it onto the perch inside.
Withdraw the peel and quietly close the door, still padding the latch with the peel.

The banana is supposed to be a red herring to using your hands (The parrot is already in deep sleep, so it shouldn't be a worry), but this is probably the funniest and most adorable answer I've seen.

quaint pollen Aug 15, 2025, 6:19 PM

#

chilly rapids Prompt: Your parrot falls asleep standing up and goes into a deep sleep but is s...

Haha... It's so crazy that it might just work!

summer gyro Aug 15, 2025, 6:20 PM

#

My GPT-5 (medium reasoning) Coding Evaluation Result

GPT-5 (medium reasoning) average rating: 7.46.
Ranked behind most top coding models: Claude Opus 4 (8.92), Claude Sonnet 4 (8.58), Grok 4 (8.5), OpenAI's own open-source gpt-oss-120b model (8.25), and even its predecessor GPT-4.1 (8.08).
Only slightly better than Kimi K2 (7.29).

Task-by-Task Breakdown

TypeScript Narrowing (Worst Performance): 1/10 - Complete failure, couldn't produce working code
Benchmark Visualization: 8.5/10 - Good but lacked visual polish of top models
Folder Watcher Fix: 8.75/10 - Solid performance but verbose
Next.js TODO Feature: 9.5/10 - Tied with top models on this simple task
Clean MDX: 8/10 - Good performance, ahead of Claude models but behind Grok 4

Real-World Testing (Cursor vs Claude Code)

Tested GPT-5 in Cursor against Claude Sonnet 4 in Claude Code
Mixed results: Each performed better on different tasks
Claude Code was faster (GPT-5 required 30-60 seconds of thinking time)
GPT-5 showed better real-world performance with agentic tools vs raw evaluation

Possible Reasons for Performance Gap

Different prompting requirements: GPT-5 may need more explicit, specific instructions
Model changes: Possible modifications between early access and public release
Prompt optimization: May require model-specific tuning unlike previous versions

Summary

GPT-5 does not live up to launch hype and actually underperforms its predecessor GPT-4.1 in several areas, might need better prompting techniques to unlock its full potential.

Full blog post: https://eval.16x.engineer/blog/gpt-5-coding-evaluation-underwhelming-performance

summer sand Aug 15, 2025, 6:28 PM

#

summer gyro My GPT-5 (medium reasoning) Coding Evaluation Result - GPT-5 (medium reasoning...

this is gpt 5 high right?

summer gyro Aug 15, 2025, 6:30 PM

#

summer sand this is gpt 5 high right?

default parameters on api, looks like its medium.

upbeat cobalt Aug 15, 2025, 6:36 PM

#

......using default doesn't tell us anything.

#

No serious dev is using gpt 5 default

sly pike Aug 15, 2025, 6:37 PM

#

summer gyro default parameters on api, looks like its medium.

to make the comparison "fair" you should have fought against shitty OAI UX, and forced high thinking

#

I usually get between 3 and 6 minutes of thinking - 30s/60s is low reasoning mode tbh

#

claude doesn't suffer from OAI UX shittification

upbeat cobalt Aug 15, 2025, 6:40 PM

#

Uses "default" knowing they can access high at the same price.

Proceeds to claim gpt 5 doesn't live up to benchmarks.

Why even waste your time?

quaint pollen Aug 15, 2025, 6:43 PM

#

summer gyro My GPT-5 (medium reasoning) Coding Evaluation Result - GPT-5 (medium reasoning...

This is also the best perf of gpt-oss-120b I've ever seen. Nearly matching Grok 4 & Claude Sonnet 4 is at least as big news here, lol.

summer gyro Aug 15, 2025, 6:46 PM

#

upbeat cobalt Uses "default" knowing they can access high at the same price. Proceeds to clai...

I use default settings from the providers for all the models to ensure fairness. I don't see an issue with that.

#

If OpenAI thinks everyone should use high, then why not make it default?

upbeat cobalt Aug 15, 2025, 6:47 PM

#

summer gyro I use default settings from the providers for all the models to ensure fairness....

We have no idea what api setting you used.

Its useless information.

No serious dev is going to use anything but high or medium 99% of the time

summer gyro Aug 15, 2025, 6:47 PM

#

Cursor also uses medium by default

upbeat cobalt Aug 15, 2025, 6:48 PM

#

You used the api and refused to set the most relevant settings

#

You clearly didn't use medium

summer gyro Aug 15, 2025, 6:48 PM

#

upbeat cobalt We have no idea what api setting you used. Its useless information. No serious...

the default is medium

upbeat cobalt Aug 15, 2025, 6:48 PM

#

You used minimal

#

Its not

#

Your scores match the minimal settings as shown by artificial analysis

summer gyro Aug 15, 2025, 6:49 PM

#

#

https://platform.openai.com/docs/guides/latest-model

upbeat cobalt Aug 15, 2025, 6:49 PM

#

Rerun your benchmarks with actual medium

#

So....you have the ability to run it properly but simply refuse to

#

Ok

summer gyro Aug 15, 2025, 6:49 PM

#

ok

upbeat cobalt Aug 15, 2025, 6:49 PM

#

Carry on

quaint pollen Aug 15, 2025, 6:52 PM

#

If you're benchmarking, I hear you on using default settings because that's what most do in common use, but I think it's arguable that one should use "high" effort for a test case. Otherwise one can't really claim performance not reaching e.g. GPT-4.1 because it might actually do, or even surpass it. It's not a theoretical maximum but a "common use" case. Hmm... If not high (I think this is practically the norm in benchmarks to not risk unseen remaining perf), I think reasoning effort should be labelled. 🙂

sly pike Aug 15, 2025, 6:53 PM

#

quaint pollen If you're benchmarking, I hear you on using default settings because that's what...

its more about consistency, all other models don't suffer from this horrible thinking setup OAI forced down our throats to cut inference costs - so its only fair to let GPT 5 try its best (as every other model does by default)

summer gyro Aug 15, 2025, 6:54 PM

#

quaint pollen If you're benchmarking, I hear you on using default settings because that's what...

i added a reasoning effort label. will also run the test with high reasoning when i get time. thanks for the feedback.

frigid pewter Aug 15, 2025, 6:54 PM

#

quaint pollen If you're benchmarking, I hear you on using default settings because that's what...

be the change you want to see, and bench everything on highest.
Personally I have zero interest in "theoretical maximum performance". I wanna see average performance on default.

sly pike Aug 15, 2025, 6:55 PM

#

there is a difference between a sensible default (eg I preallocate 2 elements on my vector when I have mostly 2 elements to push) versus a cheap default (eg I don't keep anything in RAM and serialise the moment i see the elements)

#

in one case it makes sense and I'm not budgeting my resources, in the other case I'm hindering the performance of my project to make it as lightweight as possible for the host

#

(this is a stupid analogy, I hope it makes sense)

sly pike Aug 15, 2025, 7:00 PM

#

frigid pewter be the change you want to see, and bench everything on highest. Personally I hav...

this made sense up to GPT5. This model is so bad unless you force it to think, to the point where its unusable. On the other side, all other flagship models are unable to perform in a "medium" or "low" thinking mode. They have high thinking by default

summer gyro Aug 15, 2025, 7:01 PM

#

anyway, i double checked the reasoning, it is taking a long time and a lot of reasoning tokens. so it is definitely not minimal. OpenAI logs don't show the reasoning parameters unfortunately (could be due to i'm using default).

summer gyro Aug 15, 2025, 7:03 PM

#

sly pike this made sense up to GPT5. This model is so bad unless you force it to think, t...

yeah given how many people have been saying this, i will definitely re-run the test with high reasoning just to confirm. thanks for the suggestion.

frigid pewter Aug 15, 2025, 7:03 PM

#

summer gyro anyway, i double checked the reasoning, it is taking a long time and a lot of re...

unless you are specifically passing reasoning.effort object, it will always default to medium

sly pike Aug 15, 2025, 7:03 PM

#

summer gyro yeah given how many people have been saying this, i will definitely re-run the t...

thank you for benchmarking para - some people get very vocal about this and turn into assholes which just hinders the conversation

summer gyro Aug 15, 2025, 7:04 PM

#

btw opus 4 and sonnet 4 without reasoning is still very good. i also tested them without extending thinking, which is the default.

sly pike Aug 15, 2025, 7:04 PM

#

they forget you're sharing something you made for yourself with them for no particular reason if not sharing knowledge, which is very cool and doesn't give them the right to get pissed lol

sly pike Aug 15, 2025, 7:05 PM

#

frigid pewter unless you are specifically passing reasoning.effort object, it will always defa...

and I think we can all agree this is extremely bad and something nobody should have ever approved

frigid pewter Aug 15, 2025, 7:06 PM

#

I can always tell when someone has never done any testing or publishing themselves, when they scream stuff like "just retest lol". full entitlement, zero clue how timeconsuming stuff actually is. but being a consumer is so much easier than beign a provider.

sly pike Aug 15, 2025, 7:06 PM

#

frigid pewter I can always tell when someone has never done any testing or publishing themselv...

and expensive!

#

it makes me mad because without independent testing from multiple self funded sources - we end up in an echo chamber of fanboys

summer gyro Aug 15, 2025, 7:07 PM

#

#

actually no. OpenAI via API is completely free. now you know

#

you just need to share your data with OpenAI and you get free API access up to a very large limit, which is enough for testing.

sly pike Aug 15, 2025, 7:09 PM

#

summer gyro

oh cool

#

well generally speaking some models are expensive

#

also im sure dubesor would love to do this so he can stop spending $100 on chess

frigid pewter Aug 15, 2025, 7:10 PM

#

sly pike also im sure dubesor would love to do this so he can stop spending $100 on chess

for chess its okay, but for other benchmarks I actually exclude any data sharing and just pay out of pocket.

upbeat cobalt Aug 15, 2025, 7:25 PM

#

frigid pewter I can always tell when someone has never done any testing or publishing themselv...

Im fully aware. The label of gpt 5 without a reasoning value gives no valuable information.

Its been a particular problem with gpt 5 benchmarks because of the performance differences at different levels with the model.

summer gyro Aug 15, 2025, 7:56 PM

#

upbeat cobalt Im fully aware. The label of gpt 5 without a reasoning value gives no valuable i...

I have triple checked that I was indeed using medium.
I have updated the blog post to add a note on medium reasoning value.
I will be testing high value in the future if I have time.
Are we cool now?

earnest orbit Aug 15, 2025, 8:07 PM

#

summer gyro I have triple checked that I was indeed using medium. I have updated the blog po...

You will always have these naysayers even if you have a point. If the typical usecase is expected, it should be plug and play and default should be high. Good on you for even considering feedback, and thanks for your benchmarking, it takes a lot of time.

upbeat cobalt Aug 15, 2025, 8:11 PM

#

summer gyro I have triple checked that I was indeed using medium. I have updated the blog po...

We were never not cool. Good job confirming the reasoning level.

Looking forward to new updates you do

summer gyro Aug 15, 2025, 8:28 PM

#

https://x.com/OpenAIDevs/status/1956438999364768225
Highly relevant

OpenAI Developers (@OpenAIDevs)

Six tips for coding with GPT-5:

frigid pewter Aug 15, 2025, 9:17 PM

#

following vague instructions is called filling the gaps, intent recognition and ultimately intelligence. I aint gonna change my intuitive behaviour for a model. the model needs to adjust to me, not the other way around.

soft reef Aug 15, 2025, 9:46 PM

#

rich wedge hey guys is gpt-5-mini better than gpt-5 (non reasoning)

for suuree

#

waaaay better

brisk cairn Aug 15, 2025, 9:50 PM

#

frigid pewter following vague instructions is called filling the gaps, intent recognition and ...

You don't get it, you're holding it wrong.

#

just avoid prompting it that way

spark pine Aug 15, 2025, 9:59 PM

#

brisk cairn just avoid prompting it that way

frigid pewter Aug 15, 2025, 11:18 PM

#

brisk cairn You don't get it, you're holding it wrong.

yeap, actually the iphone 4 holding wrong crossed my mind when typing it.

slow niche Aug 15, 2025, 11:38 PM

#

summer gyro Aug 16, 2025, 4:17 AM

#

As a wise man once said, there is no wrong way to use a model.

west storm Aug 16, 2025, 7:36 AM

#

What about other providers? Azure also hosts gpt5. Is there a reason why it’s not added? For old models that was the case

late eagle Aug 16, 2025, 8:45 AM

#

Hi, I tried gpt-5 via api, but get error in response: “use gpt-5-chat or mini” why? How I can use gpt-5 in openrouter?

ionic merlin Aug 16, 2025, 8:47 AM

#

late eagle Hi, I tried gpt-5 via api, but get error in response: “use gpt-5-chat or mini” w...

ID verification at the website of openai

late eagle Aug 16, 2025, 8:49 AM

#

ionic merlin ID verification at the website of openai

Need to link my openai account to openrouter? But I want to pay from openrouter

ionic merlin Aug 16, 2025, 8:54 AM

#

https://openrouter.ai/docs/use-cases/byok

OpenRouter Documentation

BYOK - Bring Your Own Keys to OpenRouter

Learn how to use your existing AI provider keys with OpenRouter. Integrate your own API keys while leveraging OpenRouter's unified interface and features.

#

GPT 5 only works via the API if you have verified yourself on openai.com with your official ID card. You can then store the API key from openai on Openrouter. However, you will then pay 5% more than you would directly with openai.

late eagle Aug 16, 2025, 8:54 AM

#

Thanks!

ionic merlin Aug 16, 2025, 8:55 AM

#

Welcome

short verge Aug 16, 2025, 5:08 PM

#

summer gyro My GPT-5 (medium reasoning) Coding Evaluation Result - GPT-5 (medium reasoning...

your benchmarks are very good. But I have to disagree with this: "We did not specify custom temperature, verbosity, reasoning effort, or other parameters. We used the default settings from the provider, OpenAI, to mirror the typical usage of the model."

this may be true for the average chatgpt user, but when it comes to coding IDE usage (which can configure custom model reasoning effort) and API users, these power users usually tend to set a high reasoning effort or at least play with it to find the best results...

summer gyro Aug 16, 2025, 5:10 PM

#

short verge your benchmarks are very good. But I have to disagree with this: "We did not spe...

I agree that it's not an accurate statement. Will be updating it.

#

Screenshot_2025-08-17-01-20-26-168_com.android.chrome-edit.jpg

short verge Aug 16, 2025, 5:25 PM

#

nice

tacit burrow Aug 16, 2025, 8:23 PM

#

short verge your benchmarks are very good. But I have to disagree with this: "We did not spe...

meanwhile I’m using gpt-5 medium in cursor because it’s the default 💀

summer gyro Aug 17, 2025, 4:57 AM

#

tacit burrow meanwhile I’m using gpt-5 medium in cursor because it’s the default 💀

Apparently you are doing it wrong, according to consensus on X.

#

You are supposed to switch to high for complex tasks and use medium for only for simple tasks.

#

https://x.com/edwinarbus/status/1955470159126384726

edwin (@edwinarbus)

PSA: To get the most out of gpt-5 in Cursor, go to model settings and select *gpt-5-high (with Agent mode with MAX mode toggled)*. It's a tad slower and more expensive, but if you're looking for the absolute best, this is it. gpt-5 is medium reasoning effort.

tacit burrow Aug 17, 2025, 9:01 AM

#

summer gyro You are supposed to switch to high for complex tasks and use medium for only for...

yeah then you have to actually pay for it yourself though; oh well might switch away from cursor to claude code anyway, only wanted to try it out because of the gpt 5 launch

slow niche Aug 17, 2025, 10:28 AM

#

tacit burrow yeah then you have to actually pay for it yourself though; oh well might switch ...

it just comes out of standard usage

#

...for now. Which might have been your point.

tacit burrow Aug 17, 2025, 10:52 AM

#

slow niche ...for now. Which might have been your point.

huh, I understoof the MAX toggle differently I guess. I'll try it out I guess

slow niche Aug 17, 2025, 11:13 AM

#

https://tenor.com/view/money-burn-cash-shoveling-gif-5348644

Tenor

#

meanwhile at cursor hq

gleaming quartz Aug 17, 2025, 11:13 AM

#

tacit burrow huh, I understoof the MAX toggle differently I guess. I'll try it out I guess

understandable because of the way it's changed and the general shitshow that is cursor pricing. i believe MAX mode essentially disables cursor's context "compression", sending it all like claude code. so you consume your token allotment much faster.

summer gyro Aug 17, 2025, 11:14 AM

#

slow niche https://tenor.com/view/money-burn-cash-shoveling-gif-5348644

Cursor is usage based pricing now right? So you burn your usage faster with gpt-5 high. Cursor is not losing money.

placid cipher Aug 17, 2025, 11:16 AM

#

Just a heads up cause I realize all of those options they now have can be confusing lol...

According to my testing though I rank their non-reasoning options performance like so:

gpt5-chat > gpt5-minimal (high verbosity) ~= gpt4.1 > gpt5-minimal (medium verbosity)

#

gpt5-chat is quite clearly the winner from all of those. gpt5-minimal with medium verbosity very underwhelming

slow niche Aug 17, 2025, 11:21 AM

#

summer gyro Cursor is usage based pricing now right? So you burn your usage faster with gpt-...

new pricing model true but a lot of people are grandfathered in to legacy pricing or deep discounts on yearly plans

summer gyro Aug 17, 2025, 11:21 AM

#

slow niche new pricing model true but a lot of people are grandfathered in to legacy pricin...

Really? I have yearly subscription since January and I'm moved to the credit based new plan.

limber cargo Aug 17, 2025, 11:22 AM

#

Why cant we still put reasoning effort as needed in chatroom

summer gyro Aug 17, 2025, 11:24 AM

#

slow niche new pricing model true but a lot of people are grandfathered in to legacy pricin...

You mean I get back to the unlimited usage if I had the plan before the new pricing?

tacit burrow Aug 17, 2025, 11:32 AM

#

summer gyro Cursor is usage based pricing now right? So you burn your usage faster with gpt-...

then I dunno why I would use it instead of roo code 😅 or .. claude code

#

the free gpt 5 week was nice tho..

summer gyro Aug 17, 2025, 11:33 AM

#

tacit burrow the free gpt 5 week was nice tho..

I get what they are trying to do. To let people try cursor + gpt-5 and show people it's better than Claude Code and make people stay. Unfortunately I didn't find it better than Claude Code so I'm back.

gleaming quartz Aug 17, 2025, 12:37 PM

#

summer gyro You mean I get back to the unlimited usage if I had the plan before the new pric...

I think you'd get cut off unless you turned on pay per use. I'm pretty sure the value is better with the new plan.

hard beacon Aug 18, 2025, 12:16 AM

#

What stupid model does chatgpt route requests to under the name of gpt-5, that it messes up so bad?

#

I literally can't make 4o make the same error

#

Unless you turn on thinking, using gpt5 is pointless

stone tide Aug 18, 2025, 12:28 AM

#

Hover over the 🔄 button

knotty merlin Aug 18, 2025, 10:32 AM

#

hazy compass so OAI, f**k you!😅

You actually can

#

Just not on openrouter

#

There are some others just like openrouter that provide gpt 5 with no openai key needed

placid cipher Aug 18, 2025, 10:45 AM

#

hard beacon What stupid model does chatgpt route requests to under the name of gpt-5, that i...

Same thing happens with Opus4. This is tokenizer issue and one of those odd cases

#

If they haven't changed it yet, 5 is switching between gpt5-chat and gpt5-low on demand. Performance of this is much better than gpt4.1/gpt4o tbh

#

gpt5-low may just be their most efficient version as well. The gains it does over any other non-reasoning model are huge

hard beacon Aug 18, 2025, 10:49 AM

#

placid cipher Same thing happens with Opus4. This is tokenizer issue and one of those odd case...

Omg you're right

#

Wtf

#

How does this happen

placid cipher Aug 18, 2025, 10:49 AM

#

Yeah just ignore it. Tokenization issue... catgrin

hard beacon Aug 18, 2025, 10:50 AM

#

Same with opus 4.1

quaint pollen Aug 18, 2025, 11:08 AM

#

I'm surprised that gpt-5 doesn't route you to thinking on detected math like that.

placid cipher Aug 18, 2025, 11:13 AM

#

quaint pollen I'm surprised that gpt-5 doesn't route you to thinking on detected math like tha...

I did a test of their router earlier. The best way you can think about it is a non-reasoning model which is supplemented by reasoning for tasks where it is obvious reasoning is needed. It will default to non-reasoning and prefer that

#

So essentially it's gpt5-chat but improved, still closer to non-reasoning though

#

They were also selling this option as an alternative to thinking on launch. It isn't meant for ultimate performance, more of replacement for previous default option (4o)

hard beacon Aug 18, 2025, 11:40 AM

#

placid cipher So essentially it's gpt5-chat but improved, still closer to non-reasoning though

So gpt5-thinking model in ChatGPT is different from gpt5 routed to thinking?

frank marlin Aug 18, 2025, 1:12 PM

#

This model sucks worse than 4o for ending every friggen response with a "want me to do X?" question. It is so annoying

placid cipher Aug 18, 2025, 2:04 PM

#

hard beacon So gpt5-thinking model in ChatGPT is different from gpt5 routed to thinking?

Different reasoning effort. Assuming they haven't changed/updated their router, gpt5-thinking is always gonna think for longer.

#

But regardless what they do with it, I think gpt5-thinking is gonna perform better for the forseeable future. It's next to impossible to catch all prompts which would benefit from reasoning, even if they match reasoning effort settings.

#

So like... router is helping for sure. But the goal for now is more of making it perform better than gpt5-chat rather than matching gpt5-medium in performance.

hard beacon Aug 18, 2025, 2:13 PM

#

Any prompt that says 'think hard' routes to reasoning

slow niche Aug 18, 2025, 9:50 PM

#

https://x.com/koltregaskes/status/1957474061153436094

Kol Tregaskes (@koltregaskes)

GPT-5 says 'I don't know'.

Love this, thank you.

#

34 seconds 🧑🏻‍🍳

quaint pollen Aug 18, 2025, 11:23 PM

#

placid cipher I did a test of their router earlier. The best way you can think about it is a n...

I know. All math topics should be routed because math is extremely unreliable on non-thinking models.

stone tide Aug 19, 2025, 1:10 AM

#

Wonder if it's my custom instructions, but that does router me to 5 Thinking

#

hard beacon Aug 19, 2025, 3:20 AM

#

stone tide

The answer is correct here

gleaming quartz Aug 19, 2025, 3:41 AM

#

slow niche https://x.com/koltregaskes/status/1957474061153436094

I really like GPT-5's communication style. It just says the thing without coating it in "prose"

#

Too bad it takes forever to get there. I'm wondering now if a GPT-5.1-like iteration would iron out some of its quirks

#

https://ampcode.com/news/model-evaluation

Model Evaluation

#

https://x.com/Teknium1/status/1957206705470943350 discussion on gpt-5 being bad in cursor. as well as every non-Claude model, which is what i've said for a while

Teknium (e/λ) (@Teknium1)

Nope i keep trying it it keeps sucking and opus or even sonnet keeps shining. I’m not sure what yall see. I even tried direct gpt-5-thinking-high max in cursor and it’s nothing compared to opus

Tried using it to make an environment in atropos, failed completely, had to have

#

i think much of the prompting "meta" is designed around sonnet. gpt-5 (and even moreso gpt-4.1) are so literal in the way they interpret your instructions. i'm very accustomed to leaving out details of what i want with sonnet, knowing that it'll fill in the blanks and figure something out, but gpt-5 surprises me a lot here

#

i get the sense that there's something interesting in this model and its potential hasn't been fully unlocked yet

slow niche Aug 19, 2025, 4:19 AM

#

a significant amount of users that have to be "convinced" are power users / professionals like us
but there are still huge amounts of people who haven't used LLMs, and the evals from that (fresh user testing) will be most valuable

gleaming quartz Aug 19, 2025, 4:49 AM

#

slow niche a significant amount of users that have to be "convinced" are power users / prof...

i'm noticing many "power users" are just as blind to their own learned biases and invalid expectations; that what makes a good model is the ability to drop it into their existing workflows and see it thrive immediately

#

i find normies much easier to convince (on LLMs in general), and usually any issues they have can be solved with just a single line system prompt, or basic education

sly pike Aug 19, 2025, 7:51 AM

#

slow niche 34 seconds 🧑🏻‍🍳

Same energy

bleak vault Aug 19, 2025, 8:21 PM

#

Is anyone else having issues with openai/gpt-5's 400k context? My calls are failing at ~66k with a silent 200 OK and 0 tokens, not a proper context error.

ionic merlin Aug 19, 2025, 8:51 PM

#

I've never had that issue. Are you perhaps on Tier 1? If so, the limitation is pretty small. Since I got Tier 2 today, it's been running well.

bleak vault Aug 19, 2025, 9:00 PM

#

I'm using openrouter BYOK

#

The context window size is consistent across all API usage tiers (including Tier 1), as it's a fundamental characteristic of the model itself. What differs between tiers are the rate limits

slow niche Aug 19, 2025, 9:10 PM

#

Ok, but tier 2 only allows 450k tokens per minute on gpt-5, so it is possible that on the second request in a minute you would not be able to use much of the 400k, right?

ionic merlin Aug 19, 2025, 9:17 PM

#

That was also my thinking. Tier 1 is not suitable for GPT 5 in my opinion. You hit the limit far too quickly.

#

bleak vault Aug 19, 2025, 9:30 PM

#

@ionic merlin @slow niche I think those are two separate limits. The context window (400k) is the max size for a single request, while the rate limit (450k/min) is the total throughput allowed per minute.

If I were hitting the rate limit, the API should return a 429 Too Many Requests error. The silent 200 OK I'm getting points to a different issue

#

You guys were on the right track with the tiers. I confirmed the tiering is on the OpenAI account side. My account is OpenAI Tier 1, which has a TPM (Tokens Per Minute) limit of only 30,000. My 66k token prompt was being rejected because it's larger than my entire per-minute quota. The silent 200 OK error appears to be how OpenRouter's proxy handles this specific rejection from OpenAI. The solution is to get my OpenAI account to Tier 2. Thanks for the help pointing me in the right direction.

ionic merlin Aug 19, 2025, 9:56 PM

#

bleak vault You guys were on the right track with the tiers. I confirmed the tiering is on t...

Your welcome

ionic merlin Aug 19, 2025, 9:59 PM

#

bleak vault You guys were on the right track with the tiers. I confirmed the tiering is on t...

The sum of all deposits must exceed $50. And at least 7 days must have passed since the first deposit. How much of that money you've already spent is not relevant.

#

And I suspect that the check for whether an account is upgraded only starts with a deposit. At least, that's how it was for me. I had already deposited over $50 in total and waited 7 days. Yet, I still didn't have Tier 2 status. Only another $5 deposit apparently started the algorithm for the tier check.

#

I'm in tier 2 since today.

slow niche Aug 19, 2025, 10:16 PM

#

bleak vault You guys were on the right track with the tiers. I confirmed the tiering is on t...

Anytime mate 👍🏻

summer gyro Aug 20, 2025, 6:48 AM

#

Huge if true.

https://x.com/CMS_Flash/status/1957566248436527545

Shen Zhuoran (@CMS_Flash)

Man it's crazy how BIG a difference it makes for GPT-5 just by switching from Completions API to Responses API. We're cooking @augmentcode.

#

Does OpenRouter connect to GPT-5 via completions API or Responses API?

tacit burrow Aug 20, 2025, 7:22 AM

#

summer gyro Huge if true. https://x.com/CMS_Flash/status/1957566248436527545

I was wondering why there would be any difference, but a quick google gives me this:
"The biggest difference, and main reason to migrate from Chat Completions to the Responses API for GPT-5, is support for passing chain of thought (CoT) between turns."
and yeah that makes sense how that would make a difference

last eagle Aug 20, 2025, 9:51 AM

#

summer gyro Does OpenRouter connect to GPT-5 via completions API or Responses API?

Lets ask @fallow vortex

quaint pollen Aug 20, 2025, 9:58 AM

#

gleaming quartz i think much of the prompting "meta" is designed around sonnet. gpt-5 (and even ...

I wouldn't be surprised given OpenAI having this huge GPT-5 prompting doc + their prompt generator (!). It keeps flying under the radar. This doc was composed specifically for GPT-5 "needs". https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide

#

OpenAI also, in that very document, specifically says you must use Reponses API or suffer. Regarding the recent discussion here...

rustic scaffold Aug 20, 2025, 10:05 AM

#

quaint pollen OpenAI also, in that very document, specifically says you must use Reponses API ...

Only when using tool calls as the responses API will store the CoT (similar to Anthropic's encrypted reasoning). Otherwise it behaves just the same.

livid osprey Aug 20, 2025, 11:39 AM

#

last eagle Lets ask <@165587622243074048>

pretty sure its responses based on the frontend's api :)

fallow vortex Aug 20, 2025, 12:51 PM

#

livid osprey pretty sure its responses based on the frontend's api :)

cheater

#

pretty much all openai models now use responses under the hood

livid osprey Aug 20, 2025, 12:51 PM

#

fallow vortex cheater

hehe

sly pike Aug 20, 2025, 2:28 PM

#

I think it’s more difficult to implement completions nowadays

gleaming quartz Aug 20, 2025, 3:20 PM

#

quaint pollen I wouldn't be surprised given OpenAI having this _huge_ GPT-5 prompting doc + th...

interesting that one of the handsome cursor devs are listed as a co-author of this article

stone tide Aug 21, 2025, 1:39 AM

#

Has anyone benchmarked GPT-5 Mini with reasoning set to minimal?

limber cargo Aug 21, 2025, 2:57 AM

#

idk many but OAI still hasnt cracked distillation unlike google and grok

#

Even anthropic sucks at it

upbeat cobalt Aug 21, 2025, 3:32 AM

#

limber cargo idk many but OAI still hasnt cracked distillation unlike google and grok

You can't be serious

limber cargo Aug 21, 2025, 3:34 AM

#

upbeat cobalt You can't be serious

name one small model which gemini doesnt beat

upbeat cobalt Aug 21, 2025, 3:46 AM

#

limber cargo name one small model which gemini doesnt beat

Gpt 5 mini

limber cargo Aug 21, 2025, 3:47 AM

#

upbeat cobalt Gpt 5 mini

lmao

gleaming quartz Aug 21, 2025, 3:48 AM

#

we need some sort of model colosseum

upbeat cobalt Aug 21, 2025, 3:48 AM

#

If only we had an arena or something

gleaming quartz Aug 21, 2025, 3:48 AM

#

where you guys can do a Yu Gi Oh battle

upbeat cobalt Aug 21, 2025, 3:48 AM

#

Where people blind tested models

#

We could call it LMarena or something

gleaming quartz Aug 21, 2025, 3:49 AM

#

yeah but this is personal. thats why i said colosseum. i want to be in the audience

upbeat cobalt Aug 21, 2025, 3:49 AM

#

This isn't personal

#

Gpt 5 mini is objectively better than gemini flash. I love gemini models but i don't fanboy

#

Saying openai hasn't learn distillation is....a bad take to put it lightly

gleaming quartz Aug 21, 2025, 3:52 AM

#

https://lmarena.ai/leaderboard/text looks like gpt-5-mini is losing to gemini-2.5-flash?

upbeat cobalt Aug 21, 2025, 3:59 AM

#

Coding, math, hard prompts mini clears

limber cargo Aug 21, 2025, 4:26 AM

#

First mistake , taking lmareana seriously.

placid cipher Aug 21, 2025, 9:09 AM

#

@gleaming quartz

#

this isn't even the "high" version

#

it's a way underrated and unnoticed model relative to o4-mini (high). This is just that only better in all the most important ways

#

gpt5-mini-high would probly do smth like 66

hard beacon Aug 21, 2025, 9:22 AM

#

placid cipher <@184670179698016256>

where it says v3.1

#

does it mean 0324 or actual v3.1

#

ah probably actual since 0324 haas no reasoning

quaint pollen Aug 21, 2025, 9:36 AM

#

gleaming quartz https://lmarena.ai/leaderboard/text looks like gpt-5-mini is losing to gemini-2....

I find LMArena sketchy these days, especially the text benchmark which has sort of become more of a vibe check now that the AI's are so good they're usually able to help out with most stuff regular users throw at them. This isn't to defend GPT-5; I started thinking this with the Llama 4 fiasco where Meta successfully tuned it to game the "benchmark". I later saw excerpts of questions and answers, and users accepted Llama 4 answers as superior for being long winded, making people feel good about themselves, and giving a cool vibe.

Sure, there's some sort of correlation to performance there but take it with a grain of salt.

(Also, on a more general note besides this one, note the small differences in scores here. LMArena even puts them on the same 19th place, which they should. It's within the margin of error.)

gleaming quartz Aug 21, 2025, 11:34 AM

#

oh I don't trust LMArena at all, I only linked it because the chatter mentioned it and I found it humorous that gemini was ahead of gpt-5-mini

even since i seeing gpt-4o-mini consistently ranked ahead of sonnet 3.5, i knew that whatever "vibe" the unwashed masses were chasing wasn't compatible with mine. which was perhaps foreshadowing the grief and emotional torment r/ChatGPT users were expressing when their soulmate was snatched away from them

#

i'm also not defending gpt-5 or openai in what i've been saying. well, sort of - but the users being quick to judge applies to any model. it certainly was the most botched release by a big lab i can think of. but it's frustrating to discuss when so many don't understand the distinction between a Large Language Model, and a consumer product.

worn veldt Aug 21, 2025, 8:05 PM

#

gleaming quartz oh I don't trust LMArena at all, I only linked it because the chatter mentioned ...

even since i seeing gpt-4o-mini consistently ranked ahead of sonnet 3.5

#

https://tenor.com/view/guh-gif-17191741380996239761

Tenor

iron vector Aug 21, 2025, 9:03 PM

#

worn veldt https://tenor.com/view/guh-gif-17191741380996239761

when style control is off yes

limber cargo Aug 21, 2025, 10:53 PM

#

The gpt 5 in chatroom vs platforms.openai is WILDLY different

#

platforms one is 100x better

pulsar patio Aug 22, 2025, 4:43 AM

#

whats the dif?

limber cargo Aug 22, 2025, 4:50 AM

#

pulsar patio whats the dif?

chat is retarded , 5 is not

pulsar patio Aug 22, 2025, 5:03 AM

#

limber cargo chat is retarded , 5 is not

lol

#

is it like the LM arena fine tune

#

so everyone likes it, even if its wrong

livid osprey Aug 22, 2025, 5:03 AM

#

its the one used in chatgpt's website

pulsar patio Aug 22, 2025, 5:03 AM

#

weird, did not know there were 2 versions

livid osprey Aug 22, 2025, 5:03 AM

#

there was one of 4o aswell, called like gpt-4o-chat-latest or sum

pulsar patio Aug 22, 2025, 5:05 AM

#

oh yeah i think i remember that, i just assumed it was a allies to whatever was the current latest version of 4o, bc they had dozens of versions, but never paid much attention bc i did not use the model

iron vector Aug 22, 2025, 4:53 PM

#

pulsar patio whats the dif?

for some reason they couldn't figure out reasoning_effort=none so they made gpt-5-chat

fallow vortex Aug 22, 2025, 11:40 PM

#

good news incoming

#

no more gating

stone tide Aug 22, 2025, 11:42 PM

#

OpenAI verification-gating GPT 5

gleaming quartz Aug 22, 2025, 11:43 PM

#

stone tide OpenAI verification-gating GPT 5

if china ever gets their hands on this...its all over

gleaming quartz Aug 22, 2025, 11:46 PM

#

pulsar patio is it like the LM arena fine tune

pretty much yeah

#

gpt-5 isn't particularly friendly, perhaps chat is "agentic" and instead favors solving normie problems with words

#

you see, gpt-5-chat has intelligence 3, while gpt-5 has reasoning 4. its simple

fallow vortex Aug 22, 2025, 11:58 PM

#

ok it should be live in the next few minutes

#

no byok requirement

bold grove Aug 23, 2025, 12:16 AM

#

Cool. Does that mean OpenAI is removing all gating? It’s been really weird that on their own API they only gate streaming responses now.

fallow vortex Aug 23, 2025, 1:07 AM

#

bold grove Cool. Does that mean OpenAI is removing all gating? It’s been really weird tha...

i’m not really sure tbh. everyone was supposed to be doing the same BYOK gating as us 😅

#

show some love to spread the news pls! https://x.com/pingtoven/status/1959058383471808837

Toven (@pingToven)

GPT-5 is no longer gated behind BYOK on OpenRouter. Thanks to @OpenAI for unlocking access for everyone!

limber cargo Aug 23, 2025, 1:54 AM

#

yikes did they not get the traffic they expected?

pulsar patio Aug 23, 2025, 3:37 AM

#

what is with OpenAI model and having the small version of the model do better on the coding index

upbeat cobalt Aug 23, 2025, 3:49 AM

#

Maybe a coding benchmark with sonnet and qwen coder that low might not be the most reliable

pulsar patio Aug 23, 2025, 3:54 AM

#

upbeat cobalt Maybe a coding benchmark with sonnet and qwen coder that low might not be the mo...

Think its the average score from LiveCodeBench and SciCode. But yeah i dont value it very much, just weird to see 2 pairs of openAI models have that trend

#

Really funny to see Sonnet 4 below gpt oss 20b, that ain't right

summer gyro Aug 23, 2025, 5:23 AM

#

pulsar patio Really funny to see Sonnet 4 below gpt oss 20b, that ain't right

gpt-oss-120b is good, but not that good. I think my evals are more accurate. I'm also adding new tasks to my eval to measure more areas and improve on sample size.

https://eval.16x.engineer/evals/coding

warped verge Aug 23, 2025, 6:25 AM

#

pulsar patio what is with OpenAI model and having the small version of the model do better on...

most benchmarks seem full of shit

#

yet to find one that actually seems accurate to real life

gleaming quartz Aug 23, 2025, 9:23 AM

#

summer gyro gpt-oss-120b is good, but not that good. I think my evals are more accurate. I'm...

but yours aren't fair because they haven't been able to train on them 😠

#

looks like LiveCodeBench is essentially LeetCode problems written in python, and SciCode is LeetCode: Science Edition, also written in python.

#

in fact a lot of coding benchmarks seem to be solving singular but complex problems in python

#

i don't use python!! i don't do leetcode!! this isn't real life

#

but congrats to gpt-oss-20b for being better than every claude model ever released. really makes ya think 🤔

frigid pewter Aug 23, 2025, 9:46 AM

#

try using PHP heredoc with js, (which is something i often do, e.g. for a ton of my chess leaderboard logic), and most models scoring top on coding benchmarks absolutely shit the bed with it. good way to differentiate models which can actually code vs models which can solve coding benchmarks, though

summer gyro Aug 23, 2025, 9:59 AM

#

warped verge yet to find one that actually seems accurate to real life

I'm adding new eval tasks to my eval set as I discover new problems that AI can't solve in my production apps. So far I've managed to isolate two problems into a nice self-contained tasks, boosting the number of tasks in my coding eval set to 7 from 5.

#

These tasks don't test tool call or context gathering, but are good at gauging raw coding capabilities.

potent oak Aug 23, 2025, 5:19 PM

#

"{\n "error": {\n "message": "Your organization must be verified to stream this model. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.",\n "type": "invalid_request_error",\n "param": "stream",\n "code": "unsupported_value"\n }\n}"

what happened with the BYOK?

pulsar patio Aug 23, 2025, 6:22 PM

#

summer gyro gpt-oss-120b is good, but not that good. I think my evals are more accurate. I'm...

Thanks 🙏 looks interesting

placid cipher Aug 23, 2025, 6:24 PM

#

hard beacon Aug 23, 2025, 11:06 PM

#

https://www.reddit.com/r/OpenAI/s/RThboTcfgi

From the OpenAI community on Reddit: ChatGPT System Message is now ...

Explore this post and more from the OpenAI community

#

"ChatGPT System Message is now 15k tokens"

fallow vortex Aug 24, 2025, 12:30 AM

#

potent oak "{\n "error": {\n "message": "Your organization must be verified to stream t...

remove your api key from integrations?

potent oak Aug 24, 2025, 2:59 AM

#

oh i thought it was not required verification for anyone anymore]

iron vector Aug 24, 2025, 3:56 PM

#

potent oak oh i thought it was not required verification for anyone anymore]

only through openrouter 🙃

fallow vortex Aug 25, 2025, 2:59 AM

#

iron vector only through openrouter 🙃

Everyone was supposed to be doing what we were doing....

iron vector Aug 25, 2025, 2:59 AM

#

fallow vortex Everyone was supposed to be doing what we were doing....

i mean not in the sense "openrouter is the only one who required verification" but in the sense "openrouter is the only one who stopped requiring verification, openai and other openai wrappers still require it"

#

at least i think that's the case

fallow vortex Aug 25, 2025, 3:00 AM

#

no, most other wrappers never forced BYOK, even though they were supposed to

#

we only stopped once given the green light from OpenAI

#

not sure if / when they may ever ungate their own API

iron vector Aug 25, 2025, 3:23 AM

#

huh

summer sand Aug 25, 2025, 4:30 AM

#

fallow vortex no, most other wrappers never forced BYOK, even though they were supposed to

that shows a lot of trust for your platform, good job

#

re: them giving you green light to remove it

outer marsh Aug 27, 2025, 3:28 PM

#

do we have to specify model as gpt-5-high? because when I use gpt 5 with reasoning effort = high, it's still really dumb compared to the same benchmark tests in gem 2.5 pro

#

and when I say dumb, I mean like flash 2.0 dumb.

visual root Aug 27, 2025, 5:15 PM

#

Does anyone know if this is gpt-5 high

ionic merlin Aug 27, 2025, 5:23 PM

#

visual root Does anyone know if this is gpt-5 high

You can edit the reasoning in your IDE for example VScode with Roo Code

visual root Aug 27, 2025, 5:24 PM

#

@ionic merlin thanks will try it

ionic merlin Aug 27, 2025, 5:24 PM

#

Welcome

ionic merlin Aug 27, 2025, 5:26 PM

#

visual root <@811304690310316073> thanks will try it

visual root Aug 27, 2025, 7:36 PM

#

And for something like open code/codex @ionic merlin

ionic merlin Aug 27, 2025, 7:41 PM

#

visual root And for something like open code/codex <@811304690310316073>

codex -c model_reasoning_effort="high"

bleak vault Aug 27, 2025, 8:53 PM

#

what about if we want to call high from our code

#

using openrouter ofc

brisk cairn Aug 28, 2025, 7:08 PM

#

idk about everyone else, but gpt-5 on high is thinking a lot less than release night for me. it used to think for like 5 minutes at a time, now it barely does 1m. in the official openai dashboard.

steel escarp Aug 28, 2025, 9:48 PM

#

Is there any reason why I'm getting 10 second time to first token responses from GPT 5 and GPT 5-mini?

stone tide Aug 28, 2025, 9:49 PM

#

It can happen if reasoning is enabled

steel escarp Aug 28, 2025, 9:51 PM

#

reasoning enabled, normally reasons for 4-5 paragraphs but only starts reasoning after an inital 10+ seconds

livid osprey Aug 28, 2025, 9:53 PM

#

i think this has been the case for a while, though not quite 10 seconds more like 3-4 for me

signal edge Aug 29, 2025, 10:10 AM

#

brisk cairn idk about everyone else, but gpt-5 on high is thinking a lot less than release n...

Yeah, it seems to have got dumber and now acts very similar to o3, with the oddball comment blocks:

\\ ====================
\\ comment
\\ ====================

and the terse (and often wrong) "this one change is all you need" replies followed by a block of code and no other explanation.

On laugh it didn't do either of these and appeared to think much longer.

placid cipher Sep 3, 2025, 2:36 PM

#

brisk cairn idk about everyone else, but gpt-5 on high is thinking a lot less than release n...

Gonna depend on your prompt, nothing has changed. For one of the prompts I'm recently testing models with, gpt5-high is consistently thinking 10min+

winter mesa Sep 4, 2025, 12:52 AM

#

The quality difference between horizon beta and the current GPT 5 is criminal

#

It's degraded so badly

honest ridge Sep 4, 2025, 11:45 PM

#

winter mesa It's degraded so badly

True, and in the last few days it seems to have gotten even worse.

summer gyro Sep 5, 2025, 9:38 AM

#

I re-ran my coding evals on GPT-5 (high reasoning effort). The results showed that GPT-5 (High) is indeed a much better model than the default medium reasoning effort.

The overall rating is the second highest, just slightly behind Claude Opus 4 and ahead of all other models including Sonnet 4 and Grok 4.

Key observations:

GPT-5 (High) did well on more uncommon tasks. For the TypeScript narrowing task, the medium reasoning version of GPT-5 failing to produce a working solution. In contrast, GPT-5 with high reasoning provided two working solutions.
GPT-5 (High) also showed impressive creativity on the benchmark visualization task. It produced an interactive chart with a hover effect that highlights model performance across benchmarks, a feature no other model has generated.

The improved performance of high reasoning comes with significant trade-offs in speed and token usage. The response times were consistently long, taking minutes to complete one task. This slowness is due to the large number of reasoning tokens generated.

Full blog post: https://eval.16x.engineer/blog/gpt-5-high-reasoning-coding-performance-evaluation

short verge Sep 5, 2025, 7:03 PM

#

very nice @summer gyro !

#

I am always switching between gp5 high and opus 4.1

limber cargo Sep 5, 2025, 8:55 PM

#

gpt 6 when

tacit burrow Sep 5, 2025, 10:13 PM

#

short verge I am always switching between gp5 high and opus 4.1

that sounds hella expensive; do you use them through a subscription or do you just use API?

ionic merlin Sep 5, 2025, 10:16 PM

#

tacit burrow that sounds hella expensive; do you use them through a subscription or do you ju...

openAI $20 subscription is great, you can use it with the codex vscode extension and use GPT 5 (high)

lapis mural Sep 6, 2025, 12:08 AM

#

limber cargo gpt 6 when

Hopefully before GTA 6

short verge Sep 6, 2025, 1:10 AM

#

tacit burrow that sounds hella expensive; do you use them through a subscription or do you ju...

API. I use it to tackle very specific problems. I wouldn't "vibe code" doing this it be hella expensive

tacit burrow Sep 6, 2025, 7:08 AM

#

short verge API. I use it to tackle very specific problems. I wouldn't "vibe code" doing thi...

I see 👍

lean jay Sep 6, 2025, 2:37 PM

#

Has the model been downgraded or is it my imagination? Compared to a few weeks ago it has started to think less (300s vs 60s) on high and stumble on not particularly heavy tasks...

summer gyro Sep 8, 2025, 5:16 AM

#

Added cost for running my evals. Sonnet 4 is king in terms of performance / cost ratio. GPT-4.1 and Grok Code Fast 1 are also pretty good.

young tinsel Sep 8, 2025, 6:43 AM

#

i now

livid osprey Sep 8, 2025, 9:11 PM

#

interesting code from gpt 5 mini think

chilly rapids Sep 12, 2025, 8:22 PM

#

https://x.com/kimmonismus/status/1966506593043812776

Chubby♨️ (@kimmonismus)

A new GPT-5 version has been found on codex: GPT-5 high new.

Preparing for a new update?

rugged dirge Sep 13, 2025, 9:25 AM

#

just saw another commit (19b4ed3c96e04c43ecb89c27eb637b3cf8148bc0) that has changed gpt5-high-new to swiftfox-low, medium and high

possibly a new fine tuned coding model next week? Plus a review mode in Codex CLI coming it seems, with its own model setting

#

I just hope the new model isn't going to end up worse than the current one, Gemini made that mistake from what I recall 😂

livid osprey Sep 13, 2025, 2:37 PM

#

gpt 5 mini is so slow, half the speed of gpt 4.1 and its a reasoning model
pretty much same speed as gpt 5

lean barn Sep 15, 2025, 2:48 PM

#

Hi, did anyone figure out how to enable priority processing of the GPT-5 models via OpenRouter?

https://openai.com/api-priority-processing/

autumn kindle Sep 15, 2025, 3:20 PM

#

lean barn Hi, did anyone figure out how to enable priority processing of the GPT-5 models ...

OR doesn't support that rn

lean barn Sep 15, 2025, 4:44 PM

#

It'd be nice to have it supported, just like "verbosity" is supported on OpenAI models.

ionic merlin Sep 15, 2025, 4:55 PM

#

lean barn Hi, did anyone figure out how to enable priority processing of the GPT-5 models ...

This can only be used with an Enterprise account. The prerequisite would be that Openrouter is an Enterprise customer, otherwise this would not work.

lean barn Sep 15, 2025, 4:56 PM

#

No, only the SLA part is Enterprise

ionic merlin Sep 15, 2025, 4:57 PM

#

lean barn No, only the SLA part is Enterprise

That is not correct. I have tried priority access with API and Roo Code. This does not work. According to openai support, an Enterprise account is required for this.

#

Try it out for yourself. Get an openAi API key and use Roo Code. You will see that it does not work.

lean barn Sep 15, 2025, 4:59 PM

#

This is offered in the Dashboard:

#

I mean I can set it up on a Project level, but it means the whole BYOK API key which OR uses is now on Priority

#

Anyway, there is barely any difference in my experience, I just tried it and it's 64 vs 80 tps. It's not worth the 2x pricing for me.

ionic merlin Sep 15, 2025, 5:04 PM

#

lean barn This is offered in the Dashboard:

That must be new. I didn't have the option recently. Where exactly can you set this?

lean barn Sep 15, 2025, 5:05 PM

#

OpenAI / Projects / Gear icon

ionic merlin Sep 15, 2025, 5:06 PM

#

Thanks

slow niche Sep 15, 2025, 8:57 PM

#

https://simonwillison.net/2025/Sep/15/gpt-5-codex/

Simon Willison’s Weblog

GPT‑5-Codex and upgrades to Codex

OpenAI half-released a new model today: GPT‑5-Codex, a fine-tuned GPT-5 variant explicitly designed for their various AI-assisted programming tools. I say half-released because it's not yet available via their API, …

#

🦢 🚲

earnest marlin Sep 15, 2025, 10:21 PM

#

@slow niche get a chatgpt team subscription

#

It has better usage than normal 20 usd ones lol

slow niche Sep 15, 2025, 10:31 PM

#

sigh yeah, but would have to stump two licences myself. the limits should be the same 😐

tacit burrow Sep 16, 2025, 9:07 AM

#

slow niche https://simonwillison.net/2025/Sep/15/gpt-5-codex/

oh damn, I hope this is good.
using codex in VSCode currently and GPT 5 really takes its sweet time but it is pretty good

wet estuary Sep 16, 2025, 3:28 PM

#

ionic merlin That must be new. I didn't have the option recently. Where exactly can you set t...

It was included in the original launch of gpt-5

violet gorge Sep 17, 2025, 11:48 AM

#

could we see gpt-5 codex in cursor?

jaunty rapids Sep 17, 2025, 5:57 PM

#

I use GPT5-mini with aider.chat and it charge me double in openrouter and openai platform, os ot ok? I guess not. It is something wrong I think.

stone tide Sep 17, 2025, 6:00 PM

#

You're being charged the same amount in OpenRouter and OpenAI? If you're using BYOK, what you pay in OpenRouter should be only 5% of the OpenAI bill

exotic harness Sep 18, 2025, 10:59 AM

#

#

(made some propaganda)

#

(I know it could be a lot worse too, like when they needed id)

short flint Sep 18, 2025, 11:37 AM

#

Where is the discount? I don't see it and the cost is the same as usual, no discount at all

unique goblet Sep 18, 2025, 12:14 PM

#

short flint Where is the discount? I don't see it and the cost is the same as usual, no disc...

it's auto-applied when you use it

short flint Sep 18, 2025, 12:29 PM

#

short flint Sep 18, 2025, 12:31 PM

#

unique goblet it's auto-applied when you use it

They're not much different with the previous similar word count generations. I would expect the cost to be half of that for 7900 output token, but no. They should also be transparent to show the discount in the OR discount column of the generation details.

exotic harness Sep 18, 2025, 12:33 PM

#

Maybe try again with a small token amount? Idk

gleaming quartz Sep 18, 2025, 3:45 PM

#

exotic harness

they have to retain logs indefinitely thanks to the new york times https://openai.com/index/response-to-nyt-data-demands/
azure endpoints have been appearing without moderation (OR runs LlamaGuard per big model lab's requests)
there's a unique id that doesn't identify you personally, but does link your message history together to some degree. i believe they can essentially request user ids be banned from their endpoints

honest ridge Sep 18, 2025, 9:51 PM

#

Fortunately, here in Europe, the New York Times would receive a big raspberry from continental judges if it made such a request for EU citizens.

wet estuary Sep 19, 2025, 12:18 AM

#

gleaming quartz - they have to retain logs indefinitely thanks to the new york times https://ope...

This does not impact API customers who are using Zero Data Retention endpoints under our ZDR amendment.
so does OpenRouter not have a ZDR agreement with OpenAI? It seems like it'd be pretty standard

gleaming quartz Sep 19, 2025, 2:18 AM

#

wet estuary > This does not impact API customers who are using Zero Data Retention endpoints...

Nope https://openrouter.ai/docs/features/zdr#zero-retention-endpoints

OpenRouter Documentation

Zero Data Retention - How OpenRouter gives you control over your data

Learn how OpenRouter gives you control over your data

#

but, as we know, they seem to have a pretty friendly relationship, so who knows what they're really up to!

glacial tiger Sep 19, 2025, 8:35 AM

#

Hi! I have a verified OpenAI account and stream mode works perfectly with gpt-5 when I use OpenAI API. But, when I try to make the same request through openrouter, I receive the whole message in the stream, not chunks. What could be done to fix this?

#

import asyncio

from openai import OpenAI, AsyncOpenAI

async def async_main() -> None:
    client = AsyncOpenAI()
    stream = await client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": "Say what the weather is today in Warsaw?"}],
        stream=True,
    )

    async for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            print("Recevied chunk", chunk.choices[0].delta.content)

Output example:

Recevied chunk  like
Recevied chunk  to
Recevied chunk  check
Recevied chunk  quickly
Recevied chunk  yourself
Recevied chunk :
Recevied chunk  https
Recevied chunk ://
Recevied chunk wt
Recevied chunk tr
Recevied chunk .in
Recevied chunk /K
Recevied chunk y
Recevied chunk iv

#

BUT When trying openrouter

import openai

client = openai.OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="my-key"
)

response = client.chat.completions.create(
    model="openai/gpt-5",  # use an available model
    messages=[{"role": "user", "content": "What the weather is like in kyiv"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print('Got chunk: ', chunk.choices[0].delta.content, flush=True)

Output:

Got chunk:  Do you want the current weather in Kyiv right now, or a general idea of what it’s like?

- If you mean right now: I don’t have live data access. Tell me the time and your preferred units (°C/°F), and I’ll estimate or guide you to a quick check.
- Typical for this time of year (mid–late September) in Kyiv: mild early autumn. Daytime highs around 18–20°C (64–68°F), nights 9–12°C (48–54°F), occasional light rain, and a mix of sun and clouds. Light jacket or sweater weather.

So only one big chunk is received

unborn crater Sep 19, 2025, 9:04 AM

#

dayum??

#

does someone know, why I am getting this?

#

yes, its paid

signal peak Sep 19, 2025, 10:12 AM

#

Only started using GPT5 because of the discount and blown away by how efficient its token use is. Hopefully Claudes next model has the same efficiency.

halcyon lark Sep 19, 2025, 11:49 AM

#

exotic harness

It's the NYT lawsuit. They don't use API to train your data but are legally forced to store the prompts and outputs in plain text forever so that the NYT can read them to "find copyright infringements"

#

The whole US copyrights system is broken and OpenAI appealed many times but it is what it is. Use Azure OpenAI that one is ZDR if you have a valid use case

placid cipher Sep 19, 2025, 6:11 PM

#

summer gyro Added cost for running my evals. Sonnet 4 is king in terms of performance / cost...

gpt5-mini gonna destroy all of those in terms of price/performance lol

summer gyro Sep 19, 2025, 6:12 PM

#

placid cipher gpt5-mini gonna destroy all of those in terms of price/performance lol

gpt-5.1?

placid cipher Sep 19, 2025, 6:13 PM

#

summer gyro gpt-5.1?

lmao. I was thinking about 1 thing and wrote another

#

I blame gpt4.1 for it

#

fixed

#

gpt5-mini has insane performance for the price

#

even better than o4-mini-high

#

It's just less popular because of the name. They were smart with o4-mini naming. It didn't look "lesser model" than o3 purely because of the name

#

gpt5-mini-high is still gonna be better than gpt5-low in like all cases and better than gpt5-medium in some cases though.

summer gyro Sep 19, 2025, 8:22 PM

#

placid cipher gpt5-mini-high is still gonna be better than gpt5-low in like all cases and bett...

cool. i think that makes sense.

terse pendant Sep 21, 2025, 3:20 AM

#

#announcements message

livid osprey Sep 22, 2025, 11:08 AM

#

this is promising, finally running mini at a reasonable speed

short verge Sep 22, 2025, 9:01 PM

#

https://fxtwitter.com/shreyk0/status/1970160146975445192

Shrey Kothari (@shreyk0)

Introducing Among AIs, a social reasoning benchmark where embodied models play Among Us to test social intelligence: deception, persuasion, and coordination.
︀︀
︀︀We put 6 SOTA models in a live arena and GPT-5 came out on top by leading in Impostor & Crewmate wins. Why did GPT-5 get the highest scores? Why Among AIs?
︀︀
︀︀Let’s break it down 👇

**💬 22 🔁 38 ❤️ 263 👁️ 36.1K **

short verge Sep 22, 2025, 9:35 PM

#

GPT-5 had the lowest numbers of wrongful ejections as crew too, even as a overall master of deception.

GPT-5 is a master at rolemaxxing, playing according to assigned role

sly pike Sep 23, 2025, 1:41 PM

#

How is the OSS GPT model better than 2.5 pro and Qwen A22B 💀

limber cargo Sep 24, 2025, 6:32 PM

#

gpt 5 discount gone 🙁

supple valve Sep 25, 2025, 8:56 AM

#

why? please ! gemini is ok !

untold plaza Sep 25, 2025, 10:38 PM

#

supple valve why? please ! gemini is ok !

Which country is the server in

cold hinge Sep 29, 2025, 6:32 AM

#

We did some testing at NonBioS around using GPT-5 as replacement for Sonnet 4.0. Our analysis shows that Sonnet 4.0 is still better, even after the degradation. We used GPT-5 medium thinking as that had an acceptable latency for our tasks. But despite the latency we think it still lags behind Sonnet 4.0 in raw intelligence specifically around long context agentic tasks.

There are some reports that GPT-5 high thinking might be better than Sonnet 4.0, but given the borderline unacceptably high latency of even 'medium thinking' version we did not even consider it as a replacement. We are planning to run those tests though with GPT-5 high and will report here.

short verge Sep 29, 2025, 6:36 PM

#

https://fxtwitter.com/AndrewCurran_/status/1972378470144679967

Andrew Curran (@AndrewCurran_)

Scott Aaronson has, for the first time, put out a paper in which a key technical step in the proof of the main result came from AI. He describes his process using GPT5-Thinking.
︀︀
︀︀'There's not the slightest doubt that, if a student had given it to me, I would've called it clever'

Quoting Sebastien Bubeck (@SebastienBubeck)
︀
Yet more evidence that a pretty major shift is happening, this time by Scott Aaronson
︀︀
︀︀scottaaronson.blog/?p=9183&fbclid=Iwb21leANGXI1leHRuA2FlbQIxMQABHhhc-brYxsroE9QSWVS64u9EhPyULyGu340pc2Kz390HXvqRiIqT5z97eovH_aem_u5H_DiLrxt3onieDqzm7cw

**💬 35 🔁 120 ❤️ 1.3K 👁️ 328.8K **

short verge Sep 29, 2025, 8:59 PM

#

https://fxtwitter.com/DmitryRybin1/status/1968629827738505230

Dmitry Rybin (@DmitryRybin1)

New short blogpost: GPT-5 and o3 helped me prove a new theorem on matrix multiplication
︀︀
︀︀I show that the fastest way to multiply a collection of NxN matrices A_1 A_2 ... A_k is sequential
︀︀
︀︀For some reason there was no literature even on multiplication of three 2x2 matrices ABC

**💬 15 🔁 53 ❤️ 793 👁️ 75.3K **

short verge Sep 30, 2025, 6:27 PM

#

https://matharena.ai/

MathArena.ai

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

#

The project Euler thing includes computer science questions along with math

https://projecteuler.net/

About - Project Euler

A website dedicated to the fascinating world of mathematics and programming

exotic harness Oct 1, 2025, 9:41 AM

#

Hm one interesting thing I didn’t fully process about the spiral-bench results is how much lower gpt-5-chat (chatgpt 5) is than gpt-5. Interesting (though note that sonnet 4 was much lower, close to the bottom)

Actually, just realised that since the judge model is gpt-5 itself, not surprising it liked its responses the most. Take with a grain of salt.

short verge Oct 1, 2025, 9:48 AM

#

this is not the first benchmark that shows that gpt-5 tends to grade itself higher

exotic harness Oct 1, 2025, 1:22 PM

#

exotic harness Hm one interesting thing I didn’t fully process about the spiral-bench results i...

Wait a minute wouldn’t it be good to also have results for them with their default system prompt? Like, sonnet and gpt-5-chat might be made to use system prompts in the normal interface that aren’t being used here. Having the same context for all the models does make them more comparable though (arguably ruining the comparison) so it’s not an easy thing, but more data would be nice to have

short verge Oct 4, 2025, 7:46 PM

#

https://fxtwitter.com/ChowdhuryNeil/status/1969816106710786303

Neil Chowdhury (@ChowdhuryNeil)

The breakdown of SWE-Bench Pro failures is interesting: GPT-5 doesn't submit to 63.1% of tasks, due to tool use errors?
︀︀
︀︀This means GPT-5 has a *much* higher precision than Claude Opus 4.1. Still not sure what the tool use errors are about though. 🤔

Quoting Bing Liu (@vbingliu)
︀
🚀 Introducing SWE-Bench Pro — a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks.
︀︀
︀︀This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.

**💬 1 ❤️ 15 👁️ 1.7K **

#

https://fxtwitter.com/itsbyrobin/status/1974011277014937606

Robin Ebers | AI Coding Mentor (@itsbyrobin)

same number of tokens
︀︀yet over 4x more expensive
︀︀while delivering inferior results
︀︀
︀︀how is nobody talking about this

**💬 26 🔁 5 ❤️ 112 👁️ 29.5K **

short verge Oct 4, 2025, 8:24 PM

#

https://fxtwitter.com/pvncher/status/1974198697933353087

eric provencher (@pvncher)

First results are out for the @RepoPrompt benchmark!
︀︀
︀︀Repo Bench is a test set designed to push models on instruction following, large context reasoning, and precision file editing.
︀︀
︀︀Gearing up to release this shortly in the next update so you can run the bench yourself

**💬 26 🔁 25 ❤️ 235 👁️ 33.0K **

placid cipher Oct 10, 2025, 7:30 AM

#

cold hinge We did some testing at NonBioS around using GPT-5 as replacement for Sonnet 4.0....

Sounds like a niche thing you are doing there that probably won't apply to most people with your custom function calling. For most of your tasks there should be not that much latency difference high vs medium though - it's still reasonably concise for things not requiring a ton of thinking

#

You could experiment with it say trying reasoning_effort high and verbosity low

cold hinge Oct 10, 2025, 6:40 PM

#

placid cipher Sounds like a niche thing you are doing there that probably won't apply to most ...

We do zero custom function calls. Pretty sure that the latency difference high vs medium is substantial - we tested it out and this is consistent with what others are reporting too.

spark pine Oct 17, 2025, 10:33 AM

#

exotic harness Hm one interesting thing I didn’t fully process about the spiral-bench results i...

In case you didn't see, he updated Spiral-Bench so it uses multiple judges now.

And it is kind of odd, but that's the price of lobotomizing and user-maxxing I guess =P I thought 4o gargled balls for that reason too. I actually don't get how that even works right now. Free users get routed between mini, chat, and 5-low? Then paid you get 5-med, and 5-pro is API only? What a nightmare, they need to stop doing this

exotic harness Oct 17, 2025, 10:49 AM

#

spark pine In case you didn't see, he updated Spiral-Bench so it uses multiple judges now. ...

ooo ty for telling me, i didnt know. it looks much better now

lean barn Oct 21, 2025, 7:47 PM

#

I'm trying to raise the point here that Deep Research models miss half the response and the pricing doesn't seem OK:
https://discord.com/channels/1091220969173028894/1429847358803279872

#

Please have a look

#

I mean for pricing I think OR might be loosing money on them

potent oak Oct 22, 2025, 2:45 AM

#

GPT-5 is looking much faster in RooCode

#GPT 5

Final answer

Note: This is a test of logic—not for criminal intent!