#How to configure auto scaling for load balancing endpoints?

35 messages · Page 1 of 1 (latest)

uneven smelt
#

From the documentation: "The method used to scale up workers on the created Serverless endpoint. If QUEUE_DELAY, workers are scaled based on a periodic check to see if any requests have been in queue for too long. If REQUEST_COUNT, the desired number of workers is periodically calculated based on the number of requests in the endpoint's queue. Use QUEUE_DELAY if you need to ensure requests take no longer than a maximum latency, and use REQUEST_COUNT if you need to scale based on the number of requests."

From what I understand the load balancing endpoints don't have a queue? How do I configure the auto scaling to work with serverless endpoints?

light gyroBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

torn arch
#

@uneven smelt hey, what do you mean by that ? When no worker is available, your requests will be automatically queued until a worker is available

#

The reason they give you those two options is that :

  • You have a fast operation (so you want to base it on QUEUE_DELAY)
  • You have a long operation (you want to use the REQUEST_COUNT)
uneven smelt
#

So loadbalancing endpoints also have a queue? With several endpoints like /generate /search /enhance, is the REQUEST_COUNT based on all the requests the server receives?

#

For example if you are using FastAPI?

torn arch
#

Ooooh, no

#

I thought you meant serverless endpoint, that what they call it for each "serverless project"

#

You can duplicate your project, and in your code trigger a different "serverless endpoint" based on the URL

#

that way you will be able to better track URL operation duration

#

otherwise all your data will be merged, and your metrics will be impacted

#

That's why I recommend to split each operation into individual "serverless endpoint"

uneven smelt
#

Why not loadbalancing endpoints with several endpoints? I have several models on one server, splitting them will add a lot of cost

torn arch
#

When you talk about endpoints you mean your fastAPI endpoints ?
On one server what do you mean ?

uneven smelt
#

Yeah, fastAPI, serveral endpoints, e.g. /generate /search /enhance, in one docker image

torn arch
#

Ok so why I do not recommend this :

#

(And thanks for this precise operation)

#

When you will want to generate something you will load a model on a GPU, which will be different from the one in search or enhance, am I correct ?

uneven smelt
#

All on the same GPU

#

For example /generate (model.pth), /search (model_search.pth and model.pth), and /enhance (model_enhance.pth)

torn arch
#

This model will take time to load and you will pay for this. Once the model is loaded in your GPU each time you trigger a worker, it wont have to be loaded again in the VRAM.

If you add multiple models into the same GPU, you can reach the max VRAM capacity and therefore some models will be unloaded, so each time you make an operation, it will have to reload the model (so your operation will be slower, and cost more)

uneven smelt
#

They all fit in the VRAM

torn arch
#

All of them AT THE SAME TIME ?

uneven smelt
#

Yeah

torn arch
#

Okok, then it's fine for this,
Now if you have different operations some longer than the other, the longers one might block the shorter ones to happen as they will end up being queued

#

let's say you have a 120s operation and a 2s one, if you put them in the same docker, the 2s might be queued for 120s

#

and for the third reason, if you split your operation, you can set up cheaper GPUs for lighter operation

uneven smelt
#

All are between 10-100ms

torn arch
#

Ok so in the end you don't need a per endpoint queue ?

#

the runpod queuing system will be enough

uneven smelt
#

I need it to scale once it hits say 30 requests per second

torn arch
#

by setting a higher number of max workers and playing with the REQUEST_COUNT or QUEUE_DELAY runpod will automatically boot new workers to handle your requests

#

and loadbalance them

uneven smelt
#

nice!

torn arch
#

you can try it yourself, set 2 max workers, and send 4 requests, in the "request" panel you should see that it's triggered on different workers (GPUs)