#Workers stay in Initialize

28 messages · Page 1 of 1 (latest)

fresh panther
#

It's been 24 hours and I tried different GPU configurations, all the 24 - 48GB, these specifically: RTX4090 RTX5090 L40 L40S stuck in initializing in multiple(all) data centers. I tried in multiple data centers, but it is the same will all of them.

modest ploverBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

limpid totemBOT
fresh panther
#

Ticket #29446

upper wraith
#

maybe its trouble on pulling your image

fresh panther
#

There are no logs at all, it doesn't even get a chance to pull an image

#

I have created another endpoint and the same thing happens. Treid multiple regions/data centers as well.

upper wraith
#

can't be the runpod servers i guess, could be your browser

#

if it were runpod side, many people would storm this server saying the same thing

fresh panther
#

It has to be the Runpd server, it can not be my browser. If it is my browser I have serious questions about why using Firefox instead of Chrome is preventing my a worker from starting up correctly.

#

I'll give the support ticket another 24 hours since I opened it duriing the weekend but if I can not use Runpod soon, I'll just have to move to AWS again and pay them 2x. But at this point I have lost faith, I can not use Runpod reliably to build a product for our company. I guess I'd rather pay AWS 2x and know it works consistently. Well, the Ops team will be happy that I'm helping us reach our AWS contract quota haha..

calm jungle
#

you are not the only one, it's happening to me too. only for runpod worker templates like vLLM, not my own. there is another poster as well from a few days ago

fresh panther
#

Closing the loop here. Support got back to me, there is a problem with the internal registry when you create a new Serverless endpoint it sets the vLLM image as registry.runpod.net/runpod-workers-worker-vllm-main-dockerfile:3851d53f9 and that pulls from an internal Runpod registry. There seems to be some issues with it that they are aware of, so you have to use the DockerHub one: runpod/worker-v1-vllm:v2.11.1 untill they have fixed the issue.

modest ploverBOT
winter ridge
#

Problem stays the same for me. At least now I get log output.

#
2026-01-08T09:05:59Z loading container image from cache
2026-01-08T09:08:43Z Loaded image: runpod/worker-v1-vllm:v2.11.1
2026-01-08T09:08:44Z v2.11.1 Pulling from runpod/worker-v1-vllm
2026-01-08T09:08:44Z Digest: sha256:aa576ece2d76cac59578a8ef4595719423648b7e52eaebb36df0fd2a3e5dfbda
2026-01-08T09:08:44Z Status: Image is up to date for runpod/worker-v1-vllm:v2.11.1
2026-01-08T09:12:28Z create container runpod/worker-v1-vllm:v2.11.1
2026-01-08T09:12:28Z start container for runpod/worker-v1-vllm:v2.11.1: begin
2026-01-08T09:12:44Z start container for runpod/worker-v1-vllm:v2.11.1: begin
2026-01-08T09:13:00Z start container for runpod/worker-v1-vllm:v2.11.1: begin
2026-01-08T09:13:16Z start container for runpod/worker-v1-vllm:v2.11.1: begin
2026-01-08T09:13:32Z start container for runpod/worker-v1-vllm:v2.11.1: begin
2026-01-08T09:13:48Z start container for runpod/worker-v1-vllm:v2.11.1: begin
2026-01-08T09:14:04Z start container for runpod/worker-v1-vllm:v2.11.1: begin
2026-01-08T09:14:20Z start container for runpod/worker-v1-vllm:v2.11.1: begin
2026-01-08T09:14:35Z remove container
#

Also the worker now just stays in unhealthy state. 🤔 If we know, it is unhealthy, would it not make sense to remove it? 🤓

upper wraith
upper wraith
winter ridge
#

did not change env variables, took everything from the vllm template. just changed the container path because runpod hub seems broken.

#

logfiles don't tell too much. 🤓

upper wraith
#

hmm ic

#

open a support ticket then

upper wraith
winter ridge
#

i just specified a model from hugging face. as i understand that should be cached. but it never started even once ... both jobs in the queue for hours.