#How to set max concurrency per worker for a load balancing endpoint?
16 messages · Page 1 of 1 (latest)
@wet elm , were you initially setting up your endpoint. When creating an endpoint on Serverless, we do the calculation for you. Once then endpoint is setup, you can then edit the endpoint and adjust as needed.
I mean in the queue based serverless worker, there is a concurrent handler that I can control concurrency for each worker https://docs.runpod.io/serverless/workers/concurrent-handler, I want to know how to do same control for load balancing workers
I think it's the request count only
Does this work if we set the concurrency within the fastapi itself, as it supports custom endpoints? Can we process n parallel requests with n concurrency using 1 worker this way?
Hmm I'm not sure of that
How is concurrent handlers on load balancing? @coarse wagon is it supported, or what's the logic for balancing to the workers?
I got a question
What happens if worker changes /ping endpoint response code to 204 if the vllm worker is overloaded
Does it make the load balancer not route more requests to it?
set requests per worker setting, at the moment its constant, your fastapi should be able to handle that many requests in parallel
Oh the scaling config?