#Support Ticket responses sound AI-generated and are non-specific

1 messages · Page 1 of 1 (latest)

spiral karma
#

Every support ticket response I've gotten sounds like it was written by AI, and in the worst way.

One example is a recent response I got to an ongoing problem, where the agent said "This is likely because you have only one specific GPU type selected for this endpoint" Which is absolutely false. I sent him my endpoint ID. He could have easily looked and seen that I have many possible GPUs selected.

The "likely because" feels like something written by AI, and I feel like the support agents are just sending out rapid AI responses, instead of looking at the specific endpoint to diagnose the problem.

This has resulted in almost every support ticket I've opened going back and forth needlessly, longer time-to-resolution, and much more frustration.

silk spear
#

Do you know your ticket id?

spiral karma
#

Yes, it is 24807

silk spear
#

hmm you have 1 outdated worker and 4 throlled

#

would deleted outdated one as it could prevent getting requests

spiral karma
#

I thought about that, but I was scared that it would make my endpoint stop working b/c the only non-throttled worker was deleted

#

I also don't understand why the other workers are throttled since I have a variety of GPUs available

silk spear
#

throlling is little bit dynamic thing

#

it's all based on suply and demand

spiral karma
#

Actually - even though I have 1 active worker, there are now 63 jobs in queue...and no worker is running

#

Ohw ait - the outdated one just started and is processing them

#

So, since it's throttled - does that mean that ALL of the GPUs I have selected are not available? That is a bit scary

silk spear
#

to be true I would not worry and delete outdated worker so request could come to new workers

spiral karma
#

If H200s are throttled - why would it not use H100s or other GPUs? Why would it just show throttled?

#

Was that outdated worker never going away just a bug in RunPod?

#

Obviously I wouldn't expect to have to log in and manually terminate workers.

silk spear
#

it did not get away as you had lot of requests

#

suspecting you have used new realese function

spiral karma
#

Well - if you look at the original ticket, no workers were pulling requests at all...no worker was running. So, I was specifically trying to trigger a new release just to try and get the system unstuck.

There were really two issues:

  1. My initial issue is that I had a bunch of idle workers and none of them were running jobs.
#
  1. Then, to try and get it "unstuck" I made changes to my endpoint configuration to try and "force" a release. And after that worked, I then ended up with a bunch of throttled workers. Despite the fact that the way I triggered the release was by adding more GPU types
#

The only weird thing I noticed for issue (1) was that I had a SIGTERM in the logs...so I thought maybe a worker crashed and that caused the system to get stuck.

silk spear
#

It's cause service will prevent sending tickets until all is ready

#

for SIGTERM it does not look like coming from us.

spiral karma
#

So...any advice on how to prevent this in the future?

silk spear
spiral karma
#

Ok - right now I'm just using your standard vLLM template - not a custom serverless function

#

(although we are building one of those for another service)

#

One last question....why won't my endpoint let me add A100 as options? Every time I click those checkboxes under "Advanced" and save. Then reopen the edit dialog, they are unchecked again.

silk spear
#

Oh another tip remove from filters CUDA versions under 12.4 and also uncheck CUDA 12.9

spiral karma
#

oh - so maybe the A100 doesn't support some of those versions?

#

Or some of those versions don't supprot A100

silk spear
#

nah all machines are run minimum 12.4+

spiral karma
#

Well after making the CUDA change, it now let's me save A100

silk spear
#

Hmm I think with A100's it might be a bug

spiral karma
#

While I have you - sorry one other question.

One other "problem" I see happen daily is that if we are at 0 workers and jobs start to come in, the worker starts running. But because we are using a 70B model, it takes a while to start up.

While it's starting up, we hit the Queue Delay threshold - and more workers start up...so now we have 2, 3, 4 workers. Of course - we only really needed 1 worker, but the scaling up happens faster than the startup time of the worker.

I don't really want to increase the Queue Delay, because under a scaling scenario I would want them to go up. I guess what I really want is for the clock to start ticking once we have at least one worker running.

I've run into this with other autoscaling services before, where the scaling criteria from 0-to-1 is often different from 1-to-2. Although I guess this same thing could happen when going from 1 worker to 2...where it adds a 3rd or 4th prior to the 2nd one starting to take jobs.

Any advice? Thanks!

#

Other autoscaling services have a "minimum time between scaling events" setting or something like that to help with this too

silk spear
#

ok so I checked and my old VLLM endpoint can't enable A100 but on new one it works.

spiral karma
#

weird - any idea why?

silk spear
#

you might want to play around here:

#

are you using network storage or model baked into docker image?

spiral karma
#

We are passing a hugging face model to the vllm image.

silk spear
#

so you are using new model cache?

spiral karma
#

I think so (not sure how to verify as another engineer set it up initially) - and we have Flashboot enabled

#

We are using the "Model (optional)" setting to set the model - which I think enables the "cached model" feature

silk spear
#

yup it is

spiral karma
#

Yea - it's happening right now. Despite having active workers set to 1....I have 90 jobs in queue...0 jobs in progress and 4 running workers. The first worker took over 3 minutes before it processed the first job