#All pods unavailable | help needed for future proof strategy

16 messages · Page 1 of 1 (latest)

celest zephyr
#

Region eu-se-1 has all pods unavailable for serverless.

I need to protect against this because SLA - it's hard because I litteraly don't know how or where to read about it - on Monday a 1000-2000 usd a month need is expected so would love help.

Maybe I am stupid, but I will have to look for alternatives I'm ofc stressed a bit. Hope you guys figure it out, and or can help me avoid and monitor this problem in the future.

-yes I can setup endpoint on all clouds, but truly I would need to set active worker to avoid this issue, which defeats the purpose of server less, unless I can predict the future and set active workers before others, but I don't want to have to program that algorithm.

safe root
#

let me try to help here, eu-se-1 is running into network packet loss issues and we have disabled it to avoid users getting charged for downtime

#

your using network storage and that's why you need eu-se-1? if so, can you elaborate more how to save / load from storage, we are actively trying to find solutions to endpoints that need network storage, how to load balance them to other regions without data loss

sinful zealot
#

is that what you need?

sinful zealot
safe root
#

availability is good across regions, in a single region in this case is related to outage and issues with their ISP, we are trying to figure out best ways to provide multi-region endpoints even with network storage

#

so if network storage is empty, you have a way to fill it or is it source of truth?

sinful zealot
sinful zealot
#

also on vllm worker i queued one request and it works just fine, launched a worker but it is still loading a huge model so it took abit long ( running ) but it launched a new worker after running like 2-3 mins~ is that normal? @safe root

safe root
#

our vllm deploy requires the model to be downloaded while its running, this is something we are working on improving so download the model is done in INIT step and doesn't cost anything

sinful zealot
#

but the launching new workers after 2-3 mins ( the same request ) why is that?

safe root
#

its likely downloading the model and serverless realizing request is still pending and launches another, downloading during init will fix this behaviour