#vLLM initializing forever: Why do the logs contain so little info making debugging impossible?

80 messages · Page 1 of 1 (latest)

steady jewel
#

Using the vLLM serverless endpoint in the catalog. I have a simple model that works well in manual serverless endpoint deployments. I add the model URL over at huggingface, the hf_token, press go. I did this several times yesterday with the workers never completing initializing. I even let it run over night, workers are still initializing. Logs are stupidly empty, making it impossible to determine what's going on. Do I have to request RunPod staff to check state for me every time or is there a way I can self-serve my debugging efforts here?

sacred galeBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

oak scarabBOT
steady jewel
#

I tried replacing the docker image with runpod/worker-v1-vllm:v2.11.0 and there has been no change in behaviour. Still stuck initializing with no log entries.

steady jewel
#

vLLM initializing forever: Why do the logs contain so little info making debugging impossible?

steady jewel
#

After replacing the docker image above, now after 30 minutes or so, I get "image ready, model download failed" but no other info and no way to SSH into a still initializing worker and see details.

tiny turret
elfin shadow
#

am having the same issue. Trying to run tencent/HunyuanOCR, the endpoint stuck at initialising. And I have no idea about the status this is very frustrating

steady jewel
#

Yeah super frustrating. I got a low-effort tech support response that just said to kill all worker nodes and it would fix itself. Of course that did not change the behaviour. I'm pulling down the vLLM docker image to local machine and will try debug it there. If I figure out what's going on, I'll share it with you here in thread.

elfin shadow
#

keep in mind that for some reason it also charges you. I got charged $0.51 without any request or anything. This is very bad practice on their end for a process that doesnt even diisplay any logs you shouldnt charge. I signed up to the platform with high hopes as it looke dreally great on the surfice but if its always like this I'll ditch it away as I spent two horus trying to debug the issue myself. Sorry if I sound rude but its a matter of setting an image 🙂 I hope we can find a solution though regardless.

tiny turret
tiny turret
#

oh didnt realize you're experiencing the stuck on initializing too

dry escarp
steady jewel
#

1qae1ne5s7zhh6

steady jewel
#

There's a bug in the Runpod UI for the serverless enpoint settings for the vllm endpoints. The bug is that when you edit the settings and set the model to the exact syntax demanded by https://github.com/runpod/worker-vllm syntax MODEL_NAME=facebook/opt-125m, using my own private hugging face model myHFuser/myPrivateModel, Runpod's UI goes and appends a huge garbage string after the model name. If you try resolve this it fails. Additionally the worker node logs show the same problem: "
loading container image from cache
Loaded image: runpod/worker-v1-vllm:v2.11.0
v2.11.0 Pulling from runpod/worker-v1-vllm
Digest: shaxxxxxxxxx
Status: Image is up to date for runpod/worker-v1-vllm:v2.11.0
image ready, initializing model files
image ready, model not found
"

Even if I go back and edit the field, runpod still forces the garbage back in. Stop doing this!

dry escarp
#

The garbage is your model version fbslightsmile

steady jewel
#

Thanks re the version tag. The model version isn't necessary, it's annoying that your developers make this an editable field but then clobber our edits. Either allow us edit control or disable the field if you're going to overwrite whatever we put in there.

I built a linux box locally and pulled down that container image onto it. My HF token works fine via the hugging face cli, allowing full download of my targeted model.

Additionally, in my local build, the worker vllm container output has all kinds of helpful diagnostic info. Info that is not present in the same container running in runpod's web console.

Why does runpod hide this diag info from us? Shouldn't there be a means to turn this on so that we can view our own diagnostic output in the worker logs of the runpud web console?

#

Also, it's now 3 days that these workers have been initializing

#

If we had proper diag output, it would save runpod so much money in tech support resources because we wouldn't have our hands tied behind our backs and could self help ourselves instead of having to come running for support and wait for days.

tiny turret
#

maybe because its holiday supports are reduced, please wait

#

you have opened a support ticket yourself right?

steady jewel
#

I did, I was talking to Roman about it but haven't heard back since Friday. I replied to the email. Last year, I used to get a ticket number and made an account in the runpod zendesk system to track and interact with my tickets. Now the email response I got has no ticket number or ability to track anything in that zendesk site any more. That's why I'm discussing it here because it seems to be the only way to move this forward. I need to have this workload running, people are depending on me and I'm very frustrated I can't get any traction to solving this problem. It feels like runpod treats their serverless workloads as a hobbyist platform, whereas I need production debug tools and stability.

tiny turret
#

I'd recommend building your own image with the model in it or use a network storage instead

steady jewel
#

I was previously using network storage and it worked. vLLM is supposed to be an improvement, both in cost and response time. If it works as advertised, I can dump a lot of custom code and switch to commodity deployment sans customization. I have a custom trained version of meta llama 3-70B. Is there an example that runpod supports of a custom trained AI running in vLLM? Or have you guys not tested this and only support stock models that haven't been trained?

sick violet
#

Glad it isn't just me. Trying to use RunPod for the first time, trying to use serverless vLLM.

No joy at all so far. Yesterday I was getting "initializing" and workers outputting errors in the logs, showing I was running out of VRAM.

But trying exactly the same thing again today. But now all I ever get is "initializing" and then "throttled". Nothing the logs at all. So I have no idea if I'm doing something wrong, something is broken or I'm just not patient enough.

I guess I need to go and read some more docs...

sick violet
# sick violet Glad it isn't just me. Trying to use RunPod for the first time, trying to use se...

If I go back and just use the vLLM template using the openchat/openchat-3.5-0106 model the quick start uses then that works fine. So I guess I'm getting something wrong with configuring the models I'm trying to use.

But without any logging how am I supposed to have any chance of working out what I've got wrong ?

(and no, there is no point me sharing the model or settings on this thread, it isn't the fact that the config isn't working, it is the lack of any logging that is what is causing me problems)

tiny turret
tiny turret
steady jewel
tiny turret
#

Try not to use the model cache feature for now

To be clear it is the "model" in the endpoint settings

tiny turret
#

If I'm getting you right:
don't use a public registry, it's not a requirement in runpod, you can always use a private image registry

elfin shadow
steady jewel
tiny turret
#

Not sure if this was the main reason or its a bug which is planned to fixed maybe @dry escarp can help

steady jewel
#

Hey so I got things proof-of-concept-working using a vLLM pod, now I've switched back to apply things in my serverless endpoint. I went back to the serverless endpoint and looked for the vLLM v2.11.0 serverless endpont offering and it's gone. Only nano vLLM runpod edition is there. Where did the vLLM v2.11.0 tile go?

#

You can see that I used vLLM v2.11.0 in the first screenshot in this thread.

tiny turret
#

oh i think they might've delisted it

dry escarp
#

Does this link not work for you all? 😧

#

I can see that it's not listed in the hub view but I'm not sure why the latest version is approved

tiny turret
#

oh tests failing huh and it doesn't let users to deploy 2.11.0

steady jewel
# elfin shadow did you get it working?

Not yet. Combining several threads and support emails here that show the challenges to getting vLLM serverless usable:

Issue 1: No logs. "The workers become stuck during initialization due to a Docker layer load issue on the underlying host, which causes the worker to hang before it reaches a state where logs are emitted. That’s why the logs look empty and why it feels opaque from your perspective." per support email

Issue 2: Drivers out of date when attempting diag via vLLM pod instead of serverless. Silently fixed via presumed pod vLLM patch (discussed in thread "vLLM pod template needs updating". Works now as of my tests last night.

Issue 3: vLLM v2.11.0 service disappeared from the serverless options. Still being discussed here and in thread "Anyone notice the vLLM v2.x.y series of serverless templates gone missing from the hub?". Currently broken and a barrier to deploying vLLM serverless as per the original start of this thread.

dry escarp
steady jewel
#

Thanks for the info, running over to test it and will report back

dry escarp
#

Issue 2: Drivers out of date when attempting diag via vLLM pod instead of serverless. Silently fixed via presumed pod vLLM patch (discussed in thread "vLLM pod template needs updating". Works now as of my tests last night.

The image hasn't changed in over a week (before today) - but v2.11.2 included a fix for using the latest GPU family w/ this template

steady jewel
#

I have to drive now, but I'll test asap. Thanks again

elfin shadow
#

I've tried deploying but no. Still initialising

dry escarp
#

If you defined a rather large model in the modelname input you're waiting for one of our servers to download the model from HF

elfin shadow
steady jewel
#

I'm the same as @elfin shadow , deployed almost 12 hours ago, still initializing. Nothing useful in the logs. I think it's something else other than a hugging face download.

#

It's been 8 days, still can't deploy my endpoint. Is there anyway to escalate this so it gets fixed?

steady jewel
#

Just tested again still broken

tiny turret
steady jewel
#

Endpoint ID = ri56lgiuit4o7v

I just made this fresh now. I also responded to an email I'm having with your colleague where I shared the model name and in the interest of moving this forward, made the model public. I provided these steps to your colleague as well along with the actual model name:

steady jewel
#

Bump. @tiny turret ?

dreamy veldt
#

I've been seeing lots of similar issues like this lately. Workers fail to provision, network ingress really slow, and the boundaries between cpu and gpu utilization spill over. Then I'll kill the workers showing that behavior, get new ones, and everything is lightning fast again. But its only been like the last 6-7 days.

Don't want to derail your support ticket. Just throwing that in there.

#

Maybe they're cooking new releases to give us goodies so some bugs have cropped up ¯_(ツ)_/¯

#

Had one worker just now that wasn't able to download anything and the gpu metrics said it was maxed at 100%. Wasn't even using the gpu in that job yet. So it feels like this is all networking issues.

tiny turret
tiny turret
dreamy veldt
#

Cool yeah. I have the logs but they dont really say anything. Maybe runpod staff can see more metrics on the box itself. It was just stuck at 101% gpu, then would say 0% -1% 100% etc

steady jewel
steady jewel
steady jewel
dreamy veldt
#

Yeah. I don't have any evidence to back it up... but in my time using runpod I have seen times where the serverless pods will have their networking slow down to a crawl. Like during the docker image step it'll need to download a few hundred megabytes and sit there forever waiting for it to finish.

Then once the pods are provisioned, they'll work fine until it happens again. When I put a job in the queue, the first thing it should do is download a video. That works 99 times out of 100. Then I'll see periods where the download slows down to a crawl.

I hate networking things like this because it could literally be runpod, my buckets, or anything in between across the internet.

#

I've always wondered if its because someone else is using that same node and hammering the networking, causing me to see reduced download speeds. Again, no evidence to support that theory. But seeing the gpu usage maxed for several minutes today when my job wasn't using the gpu at all yet made me think of it again.

#

But outside of that, I've been loving runpod ❤️ . And I found their support to be pretty thorough once someone from their team picks up the case. Hopefully your issues are resolved soon.

elfin shadow
#

I ditched vllm all together. Even with a different model like misnteral 24b it doesnt work lol

steady jewel
#

Yeah, I've decided the same unfortunately.

steady jewel
#

Thread seems abandoned unfortunately. Sad.

tiny turret
#

@dry escarp