#vLLM Lag spikes

70 messages · Page 1 of 1 (latest)

gritty sand
#

if you look at the logs when one of doing this... it's just sitting there.. no activity .. then eventually it will start to load the model (the one that supposed to be cached locally) and eventually starts taking requests.

gritty sand
#

engine.py :167 2025-11-03 00:24:33,022 Initialized vLLM engine in 374.43s

gritty sand
#

Nobody seems to think this is a problem at all!

fiery quail
#

We do nothing to censor or filter you from speaking your mind here, and I run the Discord to specifically facilitate those opinions and take them to the people who can do something about it in cases where I cannot make it better myself (it would require more than one engineer).

It's not particularly in our interest to consume any amount of "dead time" just for fun and revenue, losing a customer in the process makes it just a bad investment entirely and it's especially not the job of the support team to intentionally cause situations like this for any revenue boost at all.

Truly, at the end of the day our job is to provide you the hardware - and we do - anything past that is ultimately our constant decision to go above and beyond. From a (very) quick analysis of your message and some metrics, what you describe is a strange combination of known flaws with Ray and resource allocation.

Our team is happy to help you debug this or point you in the right direction and build confidence in deciding if something is our fault or a place for us to improve.

However, this cannot (and will not) happen when you treat the support team the way you have been. This behavior only alienates you, and makes our team just less likely to be willing to continue to go out of their way to assist you.

You've been emailed a link to schedule a call with a member of our technical support team. Where following through is your decision, I am here to tell you that we do, have, and will remove access to Zendesk or the Runpod platform for continued abusive behavior.

PS: When I take a look at the worker you shared (xqgyjnc...), I can see this worker spends almost all of it's time with it's RAM consumed. When I take a look at the request lifecycle I can see this pod took, processed and finished most requests in under 10 seconds.

gritty sand
#

@fiery quail I understand that. And yes, the communication style of the tech support people was in my opinion a little triggering.
Ignoring the points made and suggesting that trivial fixes would solve the issue. Where you know this has been my primary focus, Trying and changing every part of the config to make it right. It's been a long evolution. Starting with early version with serialized baked in models.

My theory is Ray is not configured correctly, or is due to the unstable platform every single worker is always slamming the CPU. And we all know the effect of that. literally waiting up to 7 minutes before the model starts loading. People have suggested that it's downloading the model. But this is logically impossible. As you said yourself 80% of requests will run fine. but those 20% make the streaming reader to timeout. after this is chat client actually goes to sleep and tries to refresh the stream reader. Lucky the backend can wait a little longer as it's an async connection to the API. So eventually even if the app looses the ability to write to screen, it's still in the database so the app has to be reloaded to get the message.

When these dropout's happen it's not an idle time out. it can be as random as one user reported that 5 times in an hour they would be chatting at an expected speed, reroll a message that just came in and have to wait 5-10 mins for the ray server to restart. Why did it quit when there was still jobs in the queue? Since updating to the later version I've seen enginecore/ray crash violently numerous times. But mostly is seems like it just decided to quit.

On that last point I do respectfully disagree, the engineers know the network they have built and know how it works optimally. This is why other cloud providers go above and beyond even some providing an SDK to let the app use GPU time natively. I see it as a partnership you provide hardware I provide users, we work together we both win, I go broke we both lose.

#

The end effect is when this happens 20-30 times a day, burning credit while the worker is ... loading itself? It become unsustainable when scaled.

#

And finally ... three different versions of vLLM ... all the different GPUS ... different model .. different model serving .. all make no difference. So the only thing left is exhausted hardware, which is you said you provide and offer "flashboot" which obviously does not load the worker in milliseconds.

#

I even tried on my dev server ... loading the crap out of it I still can't get vLLM to take over 500 Secs to load its core.

#

At the end of the day I've spent 1000's of dollars developing the app to work with your cloud ... moving anywhere with invalidate that work. So my only choice is to I feel drag this over hot coals.

fiery quail
#

I've been following this issue with you from the start, I truly understand the frustration but I cannot interact with you once you've crossed a certain line. I have to be assertive in my delivery of this warning when I'm requested to deliver the warning. Oft I am just a messenger. I do my best to make it obvious when I am stepping in versus when I am being asked to step in.

Workers only slam the CPU when they're directed to. I've run my own production workload on Runpod since before my time as an employee even - no issues in that front in my own experience and if it were widespread ideally by now we'd've picked up on the pattern.

I am not here to say "the platform is flawless" but instead offer my own (again, very very quick) analysis of what I identified from the single worker id shared here. Realistically when you're facing a bad worker I don't know what the holdup is, there are a lot of variables to consider and debugging it correctly takes a lot of time. This also isn't to say I would not dedicate that time, but you do have to be respectful or you lose the ability to have me (or anyone, really) do that for you.

VLLM is complicated, inside of it is two generation engines and like 100 different tuning levers. Ray is complicated and comes with it's own set of issues. Without knowing and considering all of these variables the support team can really only guess. The person triaging your ticket has to either be an expert in both of these pieces of software or be capable of grabbing someone who is to recognize things like Ray's PID exhaustion or when vLLM is using it's newer engine (even just that two engines exist!) I don't know what other people think or how closely they pay attention, but searching in your ticket ending in 33 I only ever see Ray mentioned in stack traces, and the first time Ray is emitting the same error I linked in the issue above.

Although I can recognize this as I've seen it in the past, not support or myself could tell you what's wrong without spending the time to research. We don't use or provide Ray anywhere to my knowledge, so anything I know about it is incidental. If it's misconfigured, the most any of us could do short of having full access to your codebase is tell you "yeah this ones ray sorry". If we think the worker is bad, if there's something we could do I would be one of the first people notified - but it takes time to identify who is at fault and truly while I understand support can be frustrating they are trying to help.

#

I also recognize that this is to no means actionable but it's significantly easier for us to help people who want to and allow us to help. If you have a complaint about how your support ticket is handled, feel ignored or misunderstood by a single support representative or anything similiar you can always inform support directly or involve me personally.

gritty sand
#

Ok can you clarify what is happening then with your repo. I assumed you built releases you knew worked on the platform. If this isn't the case should I be trying to build my own images and avoiding the repo ones?

fiery quail
#

One more addition, I do want to explicitly state that I am technically not a member of our support team. I am a Software Engineer on our Developer Experience team.

fiery quail
gritty sand
#

Also yes I get stressed and sometimes nasty .. I apologize for that. but this only started when I felt support started ignoring me and reading off a script ... does that not make you frustrated as well?

fiery quail
#

I get it, I want the Discord to be a space for users to collaborate with each other and for me to use the experiences I see people having to improve support as an outsider. I know the people running the support team and it's something I care about deeply and know they're receptive to.

gritty sand
#

My P.O.V is I have a budget ... if that budget is wasted like this it's me that has to face the boss and ask for more money ... I have to explain why my budget was blown .. and saying it's because of hardware issues in a T1 Cloud ... really gets my head bitten off

fiery quail
#

I'm not entirely convinced you're seeing hardware issues, but I understand where you're coming from. If you'd like to give me your endpoint id or a few request ids that take a suspiciously long time, error, or anything strange pass them to me and I'll look into it. You do not have to schedule a call unless you want to, and I personally offer credit refunds for any time spent working with me on debugging or getting me data.

gritty sand
#

I have booked a call during my day tomorrow with the person whos taken over the ticket. But I'm not looking forward to it. And if it's not hardware I'd love to know. really it's the only variable. But, at the end of the day. I have to get this solved. You know the life of a developer and you know how fast I'll be putting the fries in the bag if I say the last 4 months was a failure.

#

@fiery quail Here's what I can do. I'll get the beta testers to log every timeout when it occurred and if it recovered (delivered a message without reload) I'll marry these to the vent as seen in the logs. hopefully we can come up with a solution. But my biggest concern is that you say you see no pattern. Surely I can't be the only one using Runpod's image. And I'm using really small models on really big cards... once the model is loaded it take under a second to deliver over 1000 tokens .. it's just these startups. And the only thing I provide for that is a link to HF

#

if it is just me is it bad practice? should I be using fallback endpoints, maybe tighten up my retries? should I be auto clearing the queue trashing workers and changing endpoint maybe?

fiery quail
gritty sand
#

@fiery quail Hi,
This is what I have so far. Will run samples in a few hours. Wanted to get a snapshot of a full day.

fiery quail
#

Fantastic, this will do - I'll be back.

gritty sand
#

So in summary ... about 250 requests at about a 10% failure rate. But those 10% equaled 90% of the credit I spent today.

#

When you scale that to 2500 requests. those numbers just get horrid...

#

Not to mention, because this is end user stuff. but a tester doesn't mind reloading. A end-user well stats suggest a reloads contribute to attrition the worst

hexed yew
#

Using VLLM on serverless is a pain, let me suggest couple of things that speed up my worker, please use ENFORCE_EAGER=True as the image, to avoid crashes due to drivers enable VLLM_ENGINE_ITERATION_TIMEOUT_S=300 and the best for speeding up the worker from cold start is to pack the model inside the container

#

it is about 15-20s from cold start

#

Another tip, if you can quantize your model, It will speed up the loading time

gritty sand
hexed yew
#

From my experience it is not stable yet

#

If you already have the model baked enable eager, the difference will be immediate

gritty sand
#

@hexed yew anyway thanks again for the tips it all helps, hopefully there's a solution. it's 5am and I haven't slept again 😴

hexed yew
#

What's your context length? In the past I suffered from what looked driver issue but at the end my request where crashing with timeout due to max length error weirdly handled by Runpod as driver issue

gritty sand
#

same as yours at the moment 28000

hexed yew
#

And still 20% of them crashed?

gritty sand
#

I truncate at 22000 anyway though

#

and the model actually are 10% bigger than 28000

hexed yew
gritty sand
#

well at the moment it's just a oldest purge. But moving to Rag context soon .. nearly finished.

hexed yew
#

I got issue sometimes inside my rag that uses url, be aware if the rag purges based on spaces o word count maybe a very very long url will trigger context length, is your rag dealing with documents that contains urls?

gritty sand
#

No it's rpg stuff... think backyard.ai but better .. imo 🙂

#

heavy lore fantasy data mostly ... Tolkien and the likes.

hexed yew
#

nice, are you able to share the docker file of your backed model?

gritty sand
#

if you remind me tomorrow. it's about 20 builds ago so it's purged from the repo. but I think I still have it on my desktop at work.

#

or more so if I remember 🙂

hexed yew
#

Sure, I can help you tomorrow

gritty sand
#

Send an FR.. probably shouldn't fill up this thread with chat 🙂

fiery quail
#

Glad you two are getting along, noticed sort of the same problem between your two workloads. There's 100% work we can do to help with the issues you're seeing @gritty sand - but that will take me some time. In the meantime:

I've noticed a majority of what you report as dead time is actually the cold start process for the pod. I don't have access to specific customer account logs but because of the open source nature of the image, I can work out what's going on between the time the pod start is emitted from our backend to when you're told the vLLM image is initialized. However, I also noticed one of these workers is stopped then reinitialized (thus triggering the same cold start). You should be able to address this (and in the meantime, the cold start time) by increasing your idle timeout. The specific burstiness is up to you (and how much you're okay with spending on potentially idle time) but I think even a small bump would smooth some of the bumps.

#

I notice also most of the actual CPU usage immediately tanks after the GPU kicks in, so that may be a non starter but I'm also not sure why you use Ray at all (and I'm not currently sure if we ship this as a default for you)

gritty sand
#

@fiery quail So in summary ... about 250 requests at about a 10% failure rate. But those 10% equaled 90% of the credit I spent today.
When you scale that to 2500 requests. those numbers just get horrid...
Not to mention, because this is end user stuff. but a tester doesn't mind reloading. A end-user well stats suggest a reloads contribute to attrition the worst

Ok at this point I'm a little worried. You just repeated what I said. on the very sheet you have. I said it was startup times... however that stats count 241 cold starts for the period 10% are broken. Dude I'm really really disappointed here. I'm trying to stay calm. Yes you were the one that suggested I update to the latest vLLM including ray. I'm going to go cry now because I actually had hope this was making progress.

#

I can assure you with utmost certainty that when I run the image released by Runpod, Requested. (not suggested) that I update to this version. THIS version clearly uses ray as even Jason was talking about it in the other ticket. This image I run locally taken from RunPod's repo And I never get timeouts. everything runs fine. And strangely when it running it does not use 100% of the cpu. In fact are you sure the Telemetry windows are for individual instances because my test box DOES NOT jump to 14000 threads upon startup.

And yes is it cold start time ... like the other 241 other cold starts.... the one that like my dev box take seconds not MINUTES.

#

And yes .... IF THE PODS STARTUP TIME EXCEEDS 3 MINS I'm going to treat it was crashed. Come on dude. This is really uncool

#

side note pretty sure @hexed yew doesn't have any problems. but I do implore that no more comments are made on this thread by anyone but myself and DJ .... as I can only assume this is what caused him to miss my summary statement above.

gritty sand
#

In good faith. I'll try and ignore the reply above. However since you seem to have done a little diving I'll explain why that's wrong, The 20 cold startups out of 241 (as shown in the endpoint metrics) Hang BEFORE loading shards. The most of time spent here (turn it upside down)

#

Now that aside. If this new system caches models where are they cached. The logs are a little bit of a mess to read. But one log entry suggested it took 70 secs to download the model. It doesn't take me that long to download it from HF. (I guess it could if the model isn't locally cached as advised and the link to the internet/HF is saturated). So here in lies the issue. If i use the image provided by you. copy the env from that release. host it on my own box with a fastAPI queue. and put it under the same load why do my startups only take a MAXIMUM of 20secs, of hardware that's half the spec of server grade.

#

If you where to suggest that the telemetry screen is just for my image (not a shared server stats like most provide) then why when I run this image locally, with the same load do I not see 100% CPU usage or 14000 processes?

#

And why would every single worker that fails ironically have what I assume is some sort of set ceiling of 14000? for a total load of 250 requests?

#

Maybe the worker sitting for 7 mins at the screen above was downloading the model... at 20KB/sec ...

#

OK. I've said enough. I really expected better than "everything is normal nothing to see here"

#

I really didn't want to point out the obvious ... but why are most cold starts on this platform under 75secs. and 10% are double that?

#

Waiting 20-30secs for a reply like what happened ~220 times out 250. Perfectly fine. And was 10% of the credit I burned yesterday.

#

90% of the credit spent on 20 requests.

fiery quail
#

It's unsustainable for me to spend a morning correlating a bunch of data for you and also consume all of the text of every single one of your support tickets. So I'm not sure what you were told or what you've tried outside of your last ticket.

I simply told you I could confirm the issue, I would be willing to work on the image to try to find why you're seeing such a delay and I wasn't sure why you were using Ray. If you weren't either you could just say that - I don't know if this is something you may've been asked to change or changed on your own. (For posterities sake, I've learned this is a default we define in the Runpod console and I don't have enough knowledge here to know I should just unilaterally change this for everyone)

Aside from what I've been doing for you here, we've had a team working on an improvement to the model cache feature which should help with the time from transfering the image from our volume to your Pod. But if there is an issue with the worker-vllm repo I will spend the time sometime this week (maybe next week, it's been a long one) tracking it down.

fiery quail
#

You're going to receive a email from support on behalf of another engineer who's been working on the Model Store side of the problems you're having, give all of those a try.

gritty sand
#

@fiery quail Cool will keep an eye out. I'm not so sure why the issue only occurs 20% of the time. This is why I think it's a hardware related issue. I feel I've made it pretty clear. But sure, if you can work out why it's crashing unexpectedly and failing to start (it says something about GPU allocation) then great.