Hi : The inference my docker image locally on a RTX 4070 is way faster than in your RTX 3090 serverless. I was expecting some speed increase or at least same speed. Im using Nemo Nvidia Diarize model and 1 hour long audio takes me 85 seconds to process on my 4070 using same image as the one used by your worker while it takes 160 seconds on the 3090 on runpod. Also I use torch.multirpocess to spawn 2 process 1 for the transcirption using whisperx and one for the diarization in parallel. I don't know if there are some limitation on your part for parallel multi process on same docker image run.
#Bad performance on runpod
1 messages · Page 1 of 1 (latest)
thanks you ! I succeeded to have good inference time on the RTX4090 same as locally. However the delay time before processing the request is quite high: 45 seconds. Any optimisation possible ? Does the docker image size plays a role on this. I pre donwloaded all the models into it so the image is quite large: 25 gig. Pods is not serverless right ?
It's quite the same but there is differences like pod charges you from image download but serverless doesnt
And serverless auto scales up, and is request based
When you just have constant flow of request or almost often, flashboot will make cold starts faster
(the transition time of idle worker to running with model loaded)
Thx. It's for a SaaS app that already have 100 users. I expect 200 - 300 inference per day. What would you recommend ? The inference takes 120 seconds. Pods is like a server you pay monthly or only usage based ? what happends if too many inference on the same time on a pod and does not have enough CUDA memory to handle them ? It can't scale ?
I'd still use serverless unless you want to orchestrate pods
(scale up and down using api)
Usage based, per minute
But price is shown for per hr
For oom, Nope it can't handle OOM manually, even serverless cannot. Usually the apps restart
So choose a safe gpu vram amount or manage your workload
"serverless cannot" the purpose of serverless is autoscaling ?
So If I have a burst of request it should be handled by your worker unless no available one
OOM wont scale up workers, but it can help signal your app to restart, like vllm worker
Serverless autoscales when you have alot of requests, but surely you won't be loading up one workers with many model loads until it stops working because of cuda OOM
Okay. I sent multiple requests simultaneously and the delay dropped to 15 - 25 seconds. I don't know if flashboot get triggered. I will try your service for my production workload and see how it goes. Last thing: the max number of worker is 5 , that means it can't handle more than 5 requests simultaneously. So a burst of 15 requests at the same time will still be handled but the 10 other requests will have a more significant delay since they have to wait the previous one to be processed to have a free worker. The number of worker can be greater than 5 ? How company that have hundred of parallel inferences do, they can't go with serverless then ?