#LLAMA 3.1 8B Model Cold Start and Delay time very long

29 messages · Page 1 of 1 (latest)

young gale
#

Hey, our cold start time always reaches over a minute and same with delay. For live running we need this to be quicker. We have tried with network volume as well but it doesnt change anything.

livid scroll
#

do active workers: 1

young gale
#

Is there no other solution?

#

to control costs as well

livid scroll
#

well, you need to use that endpoint every minute or so

young gale
#

It's usually not used every minute. At night our user count is less so it is not used as frequently.

#

The reason runpod was pushed by our team was because we say it gave record cold start times.

livid scroll
#

yeah, those are the only solutions that i know.. there are no free way of cutting cost any other way i believe

#

not every time your worker becomes idle after running = requires cold start again

young gale
#

But i thought for LLMs the cold start time was in seconds

#

according to the blog posts

livid scroll
#

Oh i rarely read the blogs, which one i wonder..

#

and how do you load your models?

#

where from?

young gale
#

I tried through network volume and normally too

#

both give same result

livid scroll
#

oh okay, so maybe your loading time maybe result from download + loading to vram

#

next time it loads it will be faster

young gale
#

Yes next time it is faster

#

but for a request after a while it takes over a minute

livid scroll
#

not the first time after a while

#

i believe thats why serverless won't charge you when its not used...

#

but if you want it to stay "warm" try active workers

young gale
livid scroll
#

oh ya its not

#

flashboot helps with subsequent request by not charging you while keeping the model warm, thats a way to understand it

regal vigil
#

When you create an endpoint, the worker first needs to download the images. Depending on the size of the model you’re running, this can take some time. If you send a request during this initial phase, it will remain in the queue and won’t be processed because the worker isn’t ready to serve yet.

Once the worker is initialized, performance will depend on your request traffic pattern, idle timeout settings, and the minimum number of workers you’ve configured. If your requests are sporadic and there are no active workers, you will experience a cold start delay. However, if you have a steady stream of requests, you’ll benefit from faster response times.