#New load balancer serverless endpoint type questions

11 messages · Page 1 of 1 (latest)

polar fossil
#

Hey team !

In the past, i've tried to use runpod's queue based serverless for my voice AI project but the added job queue latency was just making this impossible. Voice AI required sub 200ms inference latency and the overhead made it huge and unpredictable. This is ok for long running jobs but not for high frequency / low latency.

This new load balancer serverless endpoint type looks amazing and seem to be solving a real feature gap in the GPU provider game.

However, i'm lacking some informations:

  • Scaling algorithm type: how does the auto scaler decide it's time to boot up a new pod ? In my case i'd like to use either numbers of sessions per worker, or average time to first token
  • How is the load balancer actually balancing ? Is there any way to implement sticky sessions for instance ? Especially in the vllm example, it's better if the same conversation stay on the same worker 🙏

None of this stuff appear to be documented, and I think these are some pretty important parameters for a load balencer.

Waiting for some guidance on this as this is the only thing preventing us to migrate our infra to it 🙂

coral prism
polar fossil
#

I couldn't see any of this in the create "load balancer" endpoint creation form. I do remember they are available with regular "job queue" serverless endpoints thought.

coral prism
#

advanced?

#

deploy it first then edit endpoint

polar fossil
#

Ah ok I need to deploy first then edit 😅

coral prism
#

Yess 👍

polar fossil
#

Do you think I could use the API to programmatically start or end workers based on my own metrics ?

coral prism
#

Hmm sure but you wont be able to control the request's routing

#

only how much worker are running (like active workers)