#gpt-3.5-turbo requests per minute limits (stream=True)

4 messages · Page 1 of 1 (latest)

final orchid
#

Context: My application has a time limit to respond, so to make sure it always gives a response I'm using stream=True. In the first two days in production (03/22 and 03/23), I had some of the requests taking longer than the time limit to give the first partial response (sometimes more than 30 seconds), a behavior that led me to believe that I was passing the API rate limits and that's why they were putting me in a queue. I've handled the code to catch this error, but now even with more traffic it's not happening anymore.

I would like to know if each partial response when using stream=True is considered a request, and if so, does the request limit remain the same in stream=False and stream=True? (since stream=True easily catches hundreds of partial responses)

halcyon plinth
#

Each initial request to an endpoint that supports streaming is counted as 1 request, the rest of the request including the stream and the end of stream falls under the umbrella of that 1 request

#

Sometimes a 30 second delay before getting a response from the API is normal, especially during outages and during peak traffic

#

You weren't hitting any rate limits, the API would let you know if that's the case in its error message