#How to set max concurrency per worker for a load balancing endpoint?

16 messages · Page 1 of 1 (latest)

wet elm
#

I'm trying to configure the maximum concurrency for each worker on my serverless load balancing endpoint, but I can't seem to find the setting in the new UI.

lean oceanBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

broken goblet
#

its from an env variable right?

#

what does maximum concurrency do in vllm?

coarse wagon
#

@wet elm , were you initially setting up your endpoint. When creating an endpoint on Serverless, we do the calculation for you. Once then endpoint is setup, you can then edit the endpoint and adjust as needed.

wet elm
broken goblet
#

I think it's the request count only

sterile ledge
#

Does this work if we set the concurrency within the fastapi itself, as it supports custom endpoints? Can we process n parallel requests with n concurrency using 1 worker this way?

broken goblet
#

Hmm I'm not sure of that

#

How is concurrent handlers on load balancing? @coarse wagon is it supported, or what's the logic for balancing to the workers?

brisk cipher
#

I got a question

#

What happens if worker changes /ping endpoint response code to 204 if the vllm worker is overloaded

#

Does it make the load balancer not route more requests to it?

tidal fern
#

set requests per worker setting, at the moment its constant, your fastapi should be able to handle that many requests in parallel