#Does GPU Cloud is suitable for deploying LLM or only for training?

25 messages · Page 1 of 1 (latest)

umbral jetty
#

I'm pretty new in RunPod, I have already build 4 endpoints on Serverless and it's pretty straight-forward for me, however I don't understand is GPU Cloud is also suitalbe for pure LLM Inferencing via API for chatbot purposers or it's only for training models and saving weights. The main question is that can I also deploy my LLM for inference on GPU Cloud for production? Where to find API on which I should make calls? Because I find Serverless very unstable for production, or maybe it's mine fault, whenever the worker starts, it choose to download again model weights, which sometimes weight 100GB+ it takes 5-15 minutes, after user will make his query he would need to wait 15 minutes for response from Serverless, while worker first downloads weights from HuggingFace and than make inference

opal spindleBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

naive prawn
#

GPU cloud does not scale as well as Serverless and should typically not be used for production applications, serverless is better suited for production applications that need to serve multiple concurrent users. Serverless is fine for what you are doing, you have just built your worker incorrectly. You should either build the model into your Docker image or store it on a network volume but should definitely not be downloading it within the worker itself.

umbral jetty
#

I have added network volume, but it doesn't work seems like

naive prawn
umbral jetty
naive prawn
twin lynx
#

U don't pay for workers unless they go active from a request

naive prawn
twin lynx
#

on best practice

#

agree. i guess when most ppl start off with runpod it isn't an expectation to get throttled / sometimes i find that runpod has a weird initialization with max of 1

naive prawn
twin lynx
#

so a lot of people will assume its like a lambda - kinda marketed i feel as a lambda with a gpu

#

wish in the docs it said max of 2 workers also gets u the 5 workers on the side. I think flash said 1 is considered development, 2 or more is considered prod

naive prawn
#

I think too many people use Serverless to save money on occasional inference instead of using it to scale out their production applications

naive prawn
twin lynx
#

Yup yup

umbral jetty
# naive prawn It doesn't always reduce cold starts to 2s and is also only really beneficial if...

One more question, I believe you do have experience with HF transformers and LLMs, do you know what command to put into Dockerfile in order to get weights pre-downloaded while building Docker image, so than they can be later in the endpoint loaded with .from_pretrained? Or I'm looking into wrong side?

I thought just downloading model repo and than using .from_pretrained method to load weights from local folder, but looks like they have different extensions or what, but it don't work and I still haven't found an reliable solution(

RUN git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
model_path = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(
f"./{model_path.split('/')[1]}/",
local_files_only=True)

And getting error SafetensorError: Error while deserializing header: HeaderTooLarge

twin lynx
#

I'm not sure you can run code like that? Like that f"string split. That's like python code? lol. Idk, Im not a docker expert

But you can put that in a bash script or something

#

Another method is to Clone this locally to your computer

#

and then do a COPY folderwhereyoucopied /dockerlocationfolder