#why the hell are my delay times so high and im bearing all the costs??

73 messages · Page 1 of 1 (latest)

shadow cliff
#

yesterday everything was working fine, delay times were a couple seconds. but now the delay times are getting ridiculous and IM being charged for the delay on top of the execution??

echo harness
#

We're currently resolving an incident that affects serverless job times. I will update you when I have news about extra spend as a result of the outage

shadow cliff
#

will we be reimbursed for the unnecessary extra spending in delay times?

#

why are we even being charged for delay times? i was always under the impression im paying for the execution time

echo harness
shadow cliff
#

if i load in $100 in credits and spin up 10 workers, what happens when my balance falls below $100? @echo harness or what if i want to deploy a new endpoint? will i still be able to deploy 10 workers?

echo harness
#

Yes, it's just a soft check at the time of registration once you press upgrade you're fine.

mental trench
#

We have a stable diffusion endpoint that has failed to boot up since the outage last night, even though the gpus are continually trying to boot up and start serving requests we're being charged for that gpu time even though its broken due to the outage that we're still trying to debug. this has been a major hit to our business!

unborn parcel
#

Hey runpod support can check that have you made any support request?

shadow cliff
echo harness
echo harness
shadow cliff
unborn parcel
#

No it's the name that you sent, Id looks like a random characters it's in your /run url

shadow cliff
#

oh

#

this should be the one:

u7hn1oucmnkkc5

unborn parcel
#

Yep seems that, now let dj check that

echo harness
# shadow cliff this should be the one: u7hn1oucmnkkc5

Everything I see seems to be normal behavior for your workload, but I can only see the lifecycle of each worker (incoming request, pod started, job finished, pod stopped). You should be able to email support for help with receiving reimbursement.

shadow cliff
#

this is from yesterday when i made this thread. are you telling me the 5-12 minute delay times are normal? if so i think we're gonna have to re consider hosting on runpod

unborn parcel
#

Can you check your endpoint logs, who knows you can see what's wrong with those worker

wind arrow
#

and cold start / fast boot

shadow cliff
shadow cliff
#

I’m telling you it’s not about the model. I’ve ran the exact same workflow over the past week and never seen anything remotely close to 12 minutes

#

I’m still handling requests today and the delay time is nowhere near a minute even

wind arrow
#

maybe logs will help debugging?

unborn parcel
#

That's why I'm telling you to only check the logs if possible..

wind arrow
#

yeah

unborn parcel
#

Especially in that time, and that specific worker

wind arrow
#

from a dev's perspective

#

only info they(i mean people here) have is

  1. pods are sometimes taking longer to load

done

#

you can't debug with that

unborn parcel
#

But as dj said you can create a support ticket or email support for reimbursement request

wind arrow
#

yeah but if you want to debug together we need logs

shadow cliff
#

How do I get the logs for those ones? It’s disappeared from the requests tab

unborn parcel
#

Is there a logs tab?

#

Not in requests tab

shadow cliff
#

I can’t find them anymore, they’ve been buried under multiple other requests :(

#

Anyways the bottom line is, will we all be getting reimbursed or not?

unborn parcel
#

I think the best way to get that answer is ask in a support request / ticket

#

I'm just trying to see what's the problem from the logs if possible, it's fine if you cant find them anymord

wind arrow
#

my thought about the delay times are:

  1. the 4 to 5 sec delay time you had b4 was a result of runpod's fast boot feature which essentially keeps the model loaded in VRAM.
  2. the 5 min delay time was probably caused by cold start
#

having

  1. The idle timeout in serverless settings
  2. the image u r using
  3. the model
  4. the interval you send the requests
  5. hopefully the logs if possible

might help debugging

#

possible causes of high delay are:

  1. idle timeout is too low and it causes workers to do a cold boot every time (or you send requests one at a time.)
    2. if u r using a non-official image that may not be cached at the host and cause high boot time
  2. runpod's network volume has speed issues
  3. CUDA Memory leak (the worker could die after processing one request)
unborn parcel
#

Images won't be re downloaded as long as your worker stays idle, and if your worker is initializing and if counts as delay time it means your endpoint is new
So feel free to eliminate That one

shadow cliff
unborn parcel
#

What does vocal means?

shadow cliff
#

they didn't provide any meaningful information other than just saying they have fixed the outage

#

@echo harness can you confirm if we are even supposed to pay for delay times or just execution times? there is no information on this at all

#

if i have a request which has a delay of 2 minutes and execution of 2 minutes, do i pay for 2 or 4 minutes?

echo harness
#

You're not paying for delay time, delay time is stuff like how long it takes the image to download and start, that's on us, execution time is how long it takes the model to load and actually do the thing.

#

For that example, 2m

unborn parcel
#

Delay time can be charged too, thing is you will be charged when worker is running

shadow cliff
#

?? who is right here

#

do we need to bring the ceo in

unborn parcel
#

What is charged is only when your worker is running

#

It can be delay time or execution time

shadow cliff
#

@echo harness ^ is that true or not? if it's true then how do i even quantify how much im paying

unborn parcel
echo harness
#

Candidly I'm not the best source of information on this, but my understanding is it depends on how your worker is setup. It should be delay time to load a model but I'm pretty sure you can do it "wrong" and load your model on request. Technically nothing stops you from shooting yourself in the foot, any code inside the handler function which is literally responsible for responding to your request is run time.

unborn parcel
echo harness
#

If you want me to tak ea look at your template and help you understand your delay time I can skim it over now, but it's 2am on a weekend so providing full support is slightly out of my scope at this time.

#

I'm happy to answer questions, etc but fixing a template for you is something I'd rather do on Monday 😛

unborn parcel
shadow cliff
shadow cliff
#

we'll see what they respond on monday]

unborn parcel
#

Ohh okay

echo harness
#

Support was directed to provide reimbursement for the length of the outage iirc (27 minutes) and it was confirmed Pods were unaffected, only serverless users (like you!)

unborn parcel
#

Basically whatever happens before you call runpod.Serverless.Start() is the delay time that is charged (because worker is running already)