#Serverless Workers Hang

24 messages · Page 1 of 1 (latest)

trim snow
#

I've been rotating through GPUs testing inference, but I've found that workers can hang when trying to reload onto a new GPU. I've tested many tiers, but I've found this to be a bigger issue when selecting higher VRAM counts. For example, I was going to test 80GB, but it's hung for about 10 hours while I was away.

I've noticed it's particularly bad if you push a change to the GitHub repo while also changing the VRAM count.

The solution I've had is to just delete the hanging workers and have them reload.

Since I tagged feature request, here's a specific request regarding this issue. Are you able to come up with a solution for this? Perhaps reload timer setting would be the solution. (if stuck initializing for more than [time], reload worker)

tired currentBOT
worthy galleon
#

When you push a change to a github, then a new version builds, and then also select another gpu, what happens? No workers available / all initializing?

#

Sure you can use runpodctl command to remove/terminate pod ( check the docs) after passing a certain time and the worker is still loading

trim snow
#

Sadly, no. I don't use that endpoint anymore. I've determined 24GB with a 48GB fallback with 5 workers is fine so I haven't needed to do anymore testing.

I recall when I do those steps, some build successfully, but others hang for long periods of time (probably forever?).

worthy galleon
#

Or maybe just set an execution timeout+ check why it might hang by the logs

#

Oh you're talking about building from github repo integration taking a long time?

trim snow
#

Nah, that's usually quick. It's specifically when I push a new build, the workers are switching, and then I select a new GPU. That was the most consistently problematic

worthy galleon
#

I de. And I'm just curious how it's able to hang 10hrs

Can you explain what you saw (the problem) and what you expect it to be like but it's not like that?

#

And how much workers do you have in the endpoint?

trim snow
#

I pushed a build and then selected some new GPUs to load while I was away. I came back 10 hours later and one was still hanging

#

5

#

The others rebuilt, so I deleted the hanging worker and then it reloaded fine right after

#

Sadly, I didn't keep track of details, but just that it happened. Sorry about that. I know it's not the most helpful to have obscure anecdotes

worthy galleon
#

Ah sometime the 80gbs or higher end is limited on stock, that's why they might take longer to initialize or get stuck

worthy galleon
# trim snow 5

Maybe you can try adding more workers, it'll be replaced eventually

#

No no problem I'm not a staff, the problem description is usually enough if it's generally experienced by other users too

worthy galleon
trim snow
#

Hanging is more that it gets stuck in a building/initializing phase. I believe it was the initializing phase, so it circles blue (versus building which is green)

#

So it'll initialize for ages

#

But that had an impact on the build indicator that appears where it says like 1/6 built. That indicator was stuck on like 5/6

#

Also I am confused why when I have 5 workers, it says n/6 rather than n/5

#

That's a side point though haha