Does GPU Cloud is suitable for deploying LLM or only for training? | Runpod | Page 1

umbral jetty Jan 9, 2024, 3:20 PM

#

I'm pretty new in RunPod, I have already build 4 endpoints on Serverless and it's pretty straight-forward for me, however I don't understand is GPU Cloud is also suitalbe for pure LLM Inferencing via API for chatbot purposers or it's only for training models and saving weights. The main question is that can I also deploy my LLM for inference on GPU Cloud for production? Where to find API on which I should make calls? Because I find Serverless very unstable for production, or maybe it's mine fault, whenever the worker starts, it choose to download again model weights, which sometimes weight 100GB+ it takes 5-15 minutes, after user will make his query he would need to wait 15 minutes for response from Serverless, while worker first downloads weights from HuggingFace and than make inference

opal spindleBOT Jan 9, 2024, 3:20 PM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

naive prawn Jan 9, 2024, 3:24 PM

#

GPU cloud does not scale as well as Serverless and should typically not be used for production applications, serverless is better suited for production applications that need to serve multiple concurrent users. Serverless is fine for what you are doing, you have just built your worker incorrectly. You should either build the model into your Docker image or store it on a network volume but should definitely not be downloading it within the worker itself.

umbral jetty Jan 9, 2024, 6:18 PM

#

naive prawn GPU cloud does not scale as well as Serverless and should typically not be used ...

Understood, thanks, between these two options, building the model inside the docker image or storing on a network volume, which one is better?

#

I have added network volume, but it doesn't work seems like

naive prawn Jan 9, 2024, 6:20 PM

#

umbral jetty Understood, thanks, between these two options, building the model inside the doc...

Inside Docker image because then you are not restricted to a specific region and have higher GPU availability and it's also very slow to load large models from the network volume disk

umbral jetty Jan 9, 2024, 6:22 PM

#

naive prawn Inside Docker image because then you are not restricted to a specific region and...

Awesome, thanks, than this is the way to go, one more question about FlashBoost, should I always use it? In order to reduce cold starts to 2s, even for big 70B LLM, or it has some restrictions and possible issues?

naive prawn Jan 9, 2024, 6:23 PM

#

umbral jetty Awesome, thanks, than this is the way to go, one more question about FlashBoost,...

It doesn't always reduce cold starts to 2s and is also only really beneficial if you have a constant flow of requests

twin lynx Jan 9, 2024, 7:21 PM

#

umbral jetty Awesome, thanks, than this is the way to go, one more question about FlashBoost,...

Better to keep it on imo still, and set your max workers to 3, is better to have more max workers, runpod has some issues with max-workers at 1, cause it's more a development stage.

#

U don't pay for workers unless they go active from a request

naive prawn Jan 9, 2024, 7:26 PM

#

twin lynx Better to keep it on imo still, and set your max workers to 3, is better to have...

Other users using the GPU causing it to become to become throttled when you set max workers to 1 isn't a RunPod issue, it's a user issue because RunPod defaults it to 3 not to 1

twin lynx Jan 9, 2024, 7:29 PM

#

naive prawn Other users using the GPU causing it to become to become throttled when you set ...

Yeah, maybe a user education issue tho.

#

on best practice

#

agree. i guess when most ppl start off with runpod it isn't an expectation to get throttled / sometimes i find that runpod has a weird initialization with max of 1

naive prawn Jan 9, 2024, 7:31 PM

#

twin lynx on best practice

Best practice is to use the sane defaults and not be a moron by changing it 😂

twin lynx Jan 9, 2024, 7:32 PM

#

naive prawn Best practice is to use the sane defaults and not be a moron by changing it 😂

xD fair. but i can see why ppl change it. when i first started, I think it was like, I was running out of endpoints by everything being 3. (cause i only have a max of 10 at the time) And then I changed it to one thinking I just need one active endpoint like a lambda function. Not really discussed in the docs that the GPUs get throttled

#

so a lot of people will assume its like a lambda - kinda marketed i feel as a lambda with a gpu

#

wish in the docs it said max of 2 workers also gets u the 5 workers on the side. I think flash said 1 is considered development, 2 or more is considered prod

naive prawn Jan 9, 2024, 7:34 PM

#

I think too many people use Serverless to save money on occasional inference instead of using it to scale out their production applications

naive prawn Jan 9, 2024, 7:35 PM

#

twin lynx wish in the docs it said max of 2 workers also gets u the 5 workers on the side....

2 to 5 all give you 5 but it's not really 5, the max is still honoured, so if you have 2, only 2 will run at once, the other 3 are just there to help with throttling

twin lynx Jan 9, 2024, 7:35 PM

#

Yup yup

umbral jetty Jan 10, 2024, 12:27 AM

#

naive prawn It doesn't always reduce cold starts to 2s and is also only really beneficial if...

One more question, I believe you do have experience with HF transformers and LLMs, do you know what command to put into Dockerfile in order to get weights pre-downloaded while building Docker image, so than they can be later in the endpoint loaded with .from_pretrained? Or I'm looking into wrong side?

I thought just downloading model repo and than using .from_pretrained method to load weights from local folder, but looks like they have different extensions or what, but it don't work and I still haven't found an reliable solution(

RUN git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
model_path = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(
f"./{model_path.split('/')[1]}/",
local_files_only=True)

And getting error SafetensorError: Error while deserializing header: HeaderTooLarge

twin lynx Jan 10, 2024, 3:40 AM

#

I'm not sure you can run code like that? Like that f"string split. That's like python code? lol. Idk, Im not a docker expert

But you can put that in a bash script or something

#

Another method is to Clone this locally to your computer

#

and then do a COPY folderwhereyoucopied /dockerlocationfolder

#Does GPU Cloud is suitable for deploying LLM or only for training?