How to configure auto scaling for load balancing endpoints? | Runpod | Page 1

uneven smelt Oct 1, 2025, 9:22 AM

#

From the documentation: "The method used to scale up workers on the created Serverless endpoint. If QUEUE_DELAY, workers are scaled based on a periodic check to see if any requests have been in queue for too long. If REQUEST_COUNT, the desired number of workers is periodically calculated based on the number of requests in the endpoint's queue. Use QUEUE_DELAY if you need to ensure requests take no longer than a maximum latency, and use REQUEST_COUNT if you need to scale based on the number of requests."

From what I understand the load balancing endpoints don't have a queue? How do I configure the auto scaling to work with serverless endpoints?

light gyroBOT Oct 1, 2025, 9:22 AM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

torn arch Oct 1, 2025, 9:48 AM

#

@uneven smelt hey, what do you mean by that ? When no worker is available, your requests will be automatically queued until a worker is available

#

The reason they give you those two options is that :

You have a fast operation (so you want to base it on QUEUE_DELAY)
You have a long operation (you want to use the REQUEST_COUNT)

uneven smelt Oct 1, 2025, 9:52 AM

#

So loadbalancing endpoints also have a queue? With several endpoints like /generate /search /enhance, is the REQUEST_COUNT based on all the requests the server receives?

#

For example if you are using FastAPI?

torn arch Oct 1, 2025, 9:53 AM

#

Ooooh, no

#

I thought you meant serverless endpoint, that what they call it for each "serverless project"

#

You can duplicate your project, and in your code trigger a different "serverless endpoint" based on the URL

#

that way you will be able to better track URL operation duration

#

otherwise all your data will be merged, and your metrics will be impacted

#

That's why I recommend to split each operation into individual "serverless endpoint"

uneven smelt Oct 1, 2025, 9:57 AM

#

Why not loadbalancing endpoints with several endpoints? I have several models on one server, splitting them will add a lot of cost

torn arch Oct 1, 2025, 9:59 AM

#

When you talk about endpoints you mean your fastAPI endpoints ?
On one server what do you mean ?

uneven smelt Oct 1, 2025, 9:59 AM

#

Yeah, fastAPI, serveral endpoints, e.g. /generate /search /enhance, in one docker image

torn arch Oct 1, 2025, 9:59 AM

#

Ok so why I do not recommend this :

#

(And thanks for this precise operation)

#

When you will want to generate something you will load a model on a GPU, which will be different from the one in search or enhance, am I correct ?

uneven smelt Oct 1, 2025, 10:00 AM

#

All on the same GPU

#

For example /generate (model.pth), /search (model_search.pth and model.pth), and /enhance (model_enhance.pth)

torn arch Oct 1, 2025, 10:02 AM

#

This model will take time to load and you will pay for this. Once the model is loaded in your GPU each time you trigger a worker, it wont have to be loaded again in the VRAM.

If you add multiple models into the same GPU, you can reach the max VRAM capacity and therefore some models will be unloaded, so each time you make an operation, it will have to reload the model (so your operation will be slower, and cost more)

uneven smelt Oct 1, 2025, 10:02 AM

#

They all fit in the VRAM

torn arch Oct 1, 2025, 10:02 AM

#

All of them AT THE SAME TIME ?

uneven smelt Oct 1, 2025, 10:02 AM

#

Yeah

torn arch Oct 1, 2025, 10:03 AM

#

Okok, then it's fine for this,
Now if you have different operations some longer than the other, the longers one might block the shorter ones to happen as they will end up being queued

#

let's say you have a 120s operation and a 2s one, if you put them in the same docker, the 2s might be queued for 120s

#

and for the third reason, if you split your operation, you can set up cheaper GPUs for lighter operation

uneven smelt Oct 1, 2025, 10:04 AM

#

All are between 10-100ms

torn arch Oct 1, 2025, 10:04 AM

#

Ok so in the end you don't need a per endpoint queue ?

#

the runpod queuing system will be enough

uneven smelt Oct 1, 2025, 10:05 AM

#

I need it to scale once it hits say 30 requests per second

torn arch Oct 1, 2025, 10:06 AM

#

by setting a higher number of max workers and playing with the REQUEST_COUNT or QUEUE_DELAY runpod will automatically boot new workers to handle your requests

#

and loadbalance them

uneven smelt Oct 1, 2025, 10:06 AM

#

nice!

torn arch Oct 1, 2025, 10:08 AM

#

you can try it yourself, set 2 max workers, and send 4 requests, in the "request" panel you should see that it's triggered on different workers (GPUs)

#How to configure auto scaling for load balancing endpoints?