Some workers failed with GPU Acceleration | Runpod | Page 1

abstract temple Dec 26, 2025, 6:07 AM

#

Hi. I deployed a Serverless app. Some workers are experiencing the error shown in the images, but not all of them. I deployed 5 workers, and 1 or 2 of them were affected, even though all the workers has the same deployment.

split pantherBOT Dec 26, 2025, 6:07 AM

#

vital basalt Dec 26, 2025, 6:11 AM

#

I got the same issues. Could someone mind to check this please?

split pantherBOT Dec 26, 2025, 1:53 PM

#

abstract temple Hi. I deployed a Serverless app. Some workers are experiencing the error shown i...

@abstract temple

Escalated To Zendesk

The thread has been escalated to Zendesk!

paper gale Dec 26, 2025, 1:53 PM

#

one error is OOM, meaning youre out of vram and if you want to run it smoothly choosee another gpu with bigger vram

abstract temple Dec 27, 2025, 1:16 AM

#

paper gale one error is OOM, meaning youre out of vram and if you want to run it smoothly c...

yeah I see. But I don't think it's OOM from my model. The model is lightweight, just a small TTS model with 0.3M parameters, takes around 3GB of VRAM.
No concurrency at runtime.
@paper gale

split pantherBOT Dec 27, 2025, 2:28 AM

#

abstract temple yeah I see. But I don't think it's OOM from my model. The model is lightweight, ...

@abstract temple

Escalated To Zendesk

The thread has been escalated to Zendesk!

paper gale Dec 27, 2025, 2:28 AM

#

its more like your code is doing something wrong or runpod is doing something wrong before giving the worker to you

abstract temple Dec 30, 2025, 3:12 AM

#

paper gale its more like your code is doing something wrong or runpod is doing something wr...

I'm sure that my code is correct because it works well on my local machine and another platform

#

And not all workers got the same issue. Another one work well.

abstract temple Jan 20, 2026, 3:54 AM

#

@paper gale any updates? I still got this error too much time recently.

paper gale Jan 20, 2026, 9:08 AM

#

You haven't opened the ticket BTW, you need to press that button above to open it

#

Seems like it's inconsistent happening on some workers only, but what if you used a way bigger vram gpu for safety?like h100 or 48gbs

Which one are you using right now?

#

And which tts model is this?

abstract temple Jan 21, 2026, 1:49 AM

#

paper gale Seems like it's inconsistent happening on some workers only, but what if you use...

You can see my screenshot above. My model is F5TTS running on RTX4090 which has 24GB of VRAM.

#

It's a really lightweight tts model. I can run in a 6GBVRAM card as well.

#

You can also see here.
I didn't turn on the concurecy handler. -> That mean It should have only one process run in the time. But why it shows the worker have 3 processes here?

#

Btw, even if all processes are created from my code. But total of memory is less than 6GB. It's a small amounts vs total VRAM of RTX4090.

paper gale Jan 21, 2026, 2:45 AM

#

abstract temple You can also see here. I didn't turn on the concurecy handler. -> That mean It s...

Do you have some kind of batching for that model inference code? (not runpod request batching)

paper gale Jan 21, 2026, 2:46 AM

#

abstract temple You can also see here. I didn't turn on the concurecy handler. -> That mean It s...

And I actually can't see the 3processes did you mean the 3 requests that worker took, or it should be on a different screenshot?

#

Maybe do watch your nvidia ami output when running inference, use web terminal and keep an active worker for debugging

abstract temple Jan 21, 2026, 3:39 AM

#

paper gale And I actually can't see the 3processes did you mean the 3 requests that worker...

I mean the error message it says that the worker has 3 running processes (current process, 180 and 181 - looks like spawned from my main process).
Anyway could you take a look at the GPU memory?

abstract temple Jan 21, 2026, 3:40 AM

#

paper gale Maybe do watch your nvidia ami output when running inference, use web terminal a...

I couldn't. Because as what I said, not all workers have same error. 90% requests are done as well. So I don't know which worker will be error to debug it.

#

And when the error happened. All requests to the same worker are error too.

#

Ah I found that after 5 -> 15minutes since error. The worker will be deleted by runpod serverless system.

paper gale Jan 21, 2026, 3:47 AM

#

abstract temple I mean the error message it says that the worker has 3 running processes (curre...

Ohh right

paper gale Jan 21, 2026, 3:49 AM

#

abstract temple I couldn't. Because as what I said, not all workers have same error. 90% request...

Hmm ic but if you retry the same input it can succeed?

#

I think you really have to ask in support ticket to check what's going on I'm not sure either, the best thing I can do right now is guess because lack of context

#

Btw do you notice that your code might run another process using the mp library in python

#

If you can just share your repo maybe I cna take a look later ( dm me if you want)

Or re ping me after you send the codes

#Some workers failed with GPU Acceleration