Bad performance on runpod | Runpod | Page 1

coarse sluice Aug 28, 2025, 12:59 PM

#

Hi : The inference my docker image locally on a RTX 4070 is way faster than in your RTX 3090 serverless. I was expecting some speed increase or at least same speed. Im using Nemo Nvidia Diarize model and 1 hour long audio takes me 85 seconds to process on my 4070 using same image as the one used by your worker while it takes 160 seconds on the 3090 on runpod. Also I use torch.multirpocess to spawn 2 process 1 for the transcirption using whisperx and one for the diarization in parallel. I don't know if there are some limitation on your part for parallel multi process on same docker image run.

quartz cypressBOT Aug 28, 2025, 12:59 PM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

spice ember Aug 30, 2025, 7:46 AM

#

try 4090?

#

on pods

coarse sluice Aug 31, 2025, 12:14 PM

#

thanks you ! I succeeded to have good inference time on the RTX4090 same as locally. However the delay time before processing the request is quite high: 45 seconds. Any optimisation possible ? Does the docker image size plays a role on this. I pre donwloaded all the models into it so the image is quite large: 25 gig. Pods is not serverless right ?

spice ember Aug 31, 2025, 12:36 PM

#

It's quite the same but there is differences like pod charges you from image download but serverless doesnt

#

And serverless auto scales up, and is request based

spice ember Aug 31, 2025, 12:38 PM

#

coarse sluice thanks you ! I succeeded to have good inference time on the RTX4090 same as loca...

When you just have constant flow of request or almost often, flashboot will make cold starts faster

#

(the transition time of idle worker to running with model loaded)

coarse sluice Aug 31, 2025, 12:45 PM

#

Thx. It's for a SaaS app that already have 100 users. I expect 200 - 300 inference per day. What would you recommend ? The inference takes 120 seconds. Pods is like a server you pay monthly or only usage based ? what happends if too many inference on the same time on a pod and does not have enough CUDA memory to handle them ? It can't scale ?

spice ember Aug 31, 2025, 12:48 PM

#

I'd still use serverless unless you want to orchestrate pods

#

(scale up and down using api)

spice ember Aug 31, 2025, 12:48 PM

#

coarse sluice Thx. It's for a SaaS app that already have 100 users. I expect 200 - 300 inferen...

Usage based, per minute

#

But price is shown for per hr

spice ember Aug 31, 2025, 12:49 PM

#

coarse sluice Thx. It's for a SaaS app that already have 100 users. I expect 200 - 300 inferen...

For oom, Nope it can't handle OOM manually, even serverless cannot. Usually the apps restart

#

So choose a safe gpu vram amount or manage your workload

coarse sluice Aug 31, 2025, 12:51 PM

#

"serverless cannot" the purpose of serverless is autoscaling ?

#

So If I have a burst of request it should be handled by your worker unless no available one

spice ember Aug 31, 2025, 1:18 PM

#

OOM wont scale up workers, but it can help signal your app to restart, like vllm worker

#

Serverless autoscales when you have alot of requests, but surely you won't be loading up one workers with many model loads until it stops working because of cuda OOM

coarse sluice Aug 31, 2025, 1:31 PM

#

Okay. I sent multiple requests simultaneously and the delay dropped to 15 - 25 seconds. I don't know if flashboot get triggered. I will try your service for my production workload and see how it goes. Last thing: the max number of worker is 5 , that means it can't handle more than 5 requests simultaneously. So a burst of 15 requests at the same time will still be handled but the 10 other requests will have a more significant delay since they have to wait the previous one to be processed to have a free worker. The number of worker can be greater than 5 ? How company that have hundred of parallel inferences do, they can't go with serverless then ?

#Bad performance on runpod