#Some workers failed with GPU Acceleration

30 messages · Page 1 of 1 (latest)

abstract temple
#

Hi. I deployed a Serverless app. Some workers are experiencing the error shown in the images, but not all of them. I deployed 5 workers, and 1 or 2 of them were affected, even though all the workers has the same deployment.

split pantherBOT
vital basalt
#

I got the same issues. Could someone mind to check this please?

split pantherBOT
paper gale
#

one error is OOM, meaning youre out of vram and if you want to run it smoothly choosee another gpu with bigger vram

abstract temple
split pantherBOT
paper gale
#

its more like your code is doing something wrong or runpod is doing something wrong before giving the worker to you

abstract temple
#

And not all workers got the same issue. Another one work well.

abstract temple
#

@paper gale any updates? I still got this error too much time recently.

paper gale
#

You haven't opened the ticket BTW, you need to press that button above to open it

#

Seems like it's inconsistent happening on some workers only, but what if you used a way bigger vram gpu for safety?like h100 or 48gbs

Which one are you using right now?

#

And which tts model is this?

abstract temple
#

It's a really lightweight tts model. I can run in a 6GBVRAM card as well.

#

You can also see here.
I didn't turn on the concurecy handler. -> That mean It should have only one process run in the time. But why it shows the worker have 3 processes here?

#

Btw, even if all processes are created from my code. But total of memory is less than 6GB. It's a small amounts vs total VRAM of RTX4090.

paper gale
paper gale
#

Maybe do watch your nvidia ami output when running inference, use web terminal and keep an active worker for debugging

abstract temple
abstract temple
#

And when the error happened. All requests to the same worker are error too.

#

Ah I found that after 5 -> 15minutes since error. The worker will be deleted by runpod serverless system.

paper gale
#

I think you really have to ask in support ticket to check what's going on I'm not sure either, the best thing I can do right now is guess because lack of context

#

Btw do you notice that your code might run another process using the mp library in python

#

If you can just share your repo maybe I cna take a look later ( dm me if you want)

Or re ping me after you send the codes