#Pipeline is not using gpu on serverless

69 messages · Page 1 of 1 (latest)

plush trail
#

Hi!

I 'm running bart-large-mnli on serverless but as I can see from the worker stats it's not using the gpu, do you know what I'm doing wrong?

The image is my current handler.py

And as docker base I'm using "FROM runpod/base:0.6.2-cuda12.2.0", also tried with "runpod/pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04" but still 0% usage of gpu.

Let me know if you need more details!

Thank you 🙂

ivory stratus
#

How are you running the model?

plush trail
#

this is the docker, I'm building + push on my docker and running it from a 24gb gpu on serverless

#

and this is the model downloader

night prism
#

I have a feeling this line:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Is doing something funky.
You should try doing a print right after that:

print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())

And see if your code thinks it is running on a CPU.

plush trail
#

thank you! I'll try it immediately and let you know

#

@night prism this is the output

#

I can give you the full repo if you need 🙂

ivory stratus
#

Yep, will be useful for us to help you test it

night prism
#

That would be useful yes! Would love to test out and see what is going on.

plush trail
night prism
#

Risky click 😆

halcyon spire
#

It's Just a zip right? 😊

plush trail
#

if you'd prefer I can give you single files

#

this is the folder structure

halcyon spire
#

Hmm can you try like some codes to move the hf model that you use to the Cuda gpus

#

Try searching for codes like that

#

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") model.to(device)

ivory stratus
#

its already doing that

halcyon spire
#

Oh

#

How long does your process take?

#

In serverless

plush trail
#

with 5 concurrent requests ~5s per request

halcyon spire
#

If you try your pipeline on cpu does it have the same performance?

plush trail
#

let me try again cause I don't remember 😅 I'll launch the 32vcpu and let you know!

halcyon spire
#

Sorry not quite following the thread from the start... But how did you know it wasn't using the gpu again?

#

Right sure

plush trail
#

sure no problem, I see 100% CPU usage and 0% for the GPU

halcyon spire
#

Oh.. Because, sometimes I think the usage on the ui isn't that updated especially if your job only took couple of secs

plush trail
#

thanks for the tip, but I'm performing stress tests sending constantly requests for 1 minutes on it to understand how many requests it can handle so it's always running

halcyon spire
#

Ic

plush trail
#

another strange thing is that on a cheap cpu on hugging face inference endpoint it performs faster than on a 24gb gpu on runpod (that's also why I think that is not using it) 😅

#

always ~5 seconds with 5 concurrent requests on a 32 vcpu

halcyon spire
#

Wow...

#

In gpu it takes more?

#

Hahahah if you got your code right, and you think it's a gpu problem feel free to report it in the site's contact button on the left menu thrn

#

Btw @plush trail have you tried this export CUDA_VISIBLE_DEVICES=0

plush trail
#

@halcyon spire tried now, still 100% CPU usage and 0% for the GPU 😦

coarse wolf
#

I might look at it

plush trail
#

thank you 🙂

night prism
#

Hey, so I went through this and I've this input:

{
    "input": {
        "sequence": "The weather is sunny today.",
        "labels": ["weather", "sports", "news"]
    }
}

and this output:

{
  "id": "test-822c3793-23b3-4464-8b65-972bb5776867",
  "status": "COMPLETED",
  "output": {
    "classification_result": {
      "sequence": "The weather is sunny today.",
      "labels": [
        "weather",
        "news",
        "sports"
      ],
      "scores": [
        0.989009439945221,
        0.24655567109584808,
        0.008112689480185509
      ]
    },
    "device": "cuda"
  }
}
#

Here is my python code:

import torch
import runpod
from runpod.serverless.utils.rp_validator import validate
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(device)
INPUT_SCHEMA = {
    'sequence': {
        'type': str,
        'required': True
    },
    'labels': {
        'type': list,
        'required': True,
    }
}

def classify_text(sequence, labels):
    model = AutoModelForSequenceClassification.from_pretrained(
        "facebook/bart-large-mnli",
        local_files_only=False  # Change this to False to download if not available locally
    ).to(device)
    tokenizer = AutoTokenizer.from_pretrained(
        "facebook/bart-large-mnli", local_files_only=False)  # Change this to False to download if not available locally

    classifier = pipeline(
        "zero-shot-classification",
        model=model,
        tokenizer=tokenizer,
        device=0,
    )

    return classifier(sequence, labels, multi_label=True)

async def handler(job):
    val_input = validate(job['input'], INPUT_SCHEMA)
    if 'errors' in val_input:
        return {"error": val_input['errors']}
    val_input = val_input['validated_input']

    classification_result = classify_text(val_input["sequence"], val_input["labels"])
    
    return {
        "classification_result": classification_result,
        "device": str(device)
    }

runpod.serverless.start({"handler": handler, "concurrency_modifier": lambda x: 1000})
night prism
#

So I am getting the GPU to run through CUDA.

#

Yes, output of the device is GPU.
BTW I used the CLI tool runpodctl project create for faster itteration cycles/not having to rebuild docker constantly.

halcyon spire
#

Hmm okay cool, whats the difference with badnoise's code?

night prism
#

I rebuilt the new Docker image based off another image:

FROM runpod/base:0.6.1-cuda12.2.0


COPY builder/requirements.txt /requirements.txt
RUN python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt --no-cache-dir && \
    rm /requirements.txt

ADD . /

CMD python3.11 -u /src/handler.py
sudden zinc
#

I think he trying to use the cache_model.py to cache the model locally when building the docker image. He set local_files_only=True, just to make sure it never download from internet.

sudden zinc
#

i don't feel anything wrong with that😂 , I am still wondering what Patrick changed make it works to start using the GPU.

halcyon spire
night prism
#

Sorry, my code was a little bit of a redherring. Here is a screenshot of it running on GPU though.

halcyon spire
#

I guess it can be a dependency issue ( torch ) thats causing it not to use the gpu

plush trail
#

hi! thank you so much for your help, I will try with the suggested docker image 🙂

sudden zinc
#

I think this might be the root cause, in your requirements.txt, you have to set:

torch==2.2.1

coarse wolf
#

Make sure to install cuda version not cpu

plush trail
#

I'll try setting manually the torch version, because it's strange that I still see 0% of the GPU usage

plush trail
ivory stratus
plush trail
#

that's crazy, always 0% 😩

ivory stratus
#

Its using GPU if the GPU memory is showing as used

#

That telemetry is not real time and not reliable

plush trail
#

but it's strange that even if I run stress test on it for over 1 minute it's never used 😅

sudden zinc
#

I added some logs in the code and it is using the GPU.

ivory stratus
#

Yep, the GPU utilization telemetry always confuses people because its not real-time

sudden zinc
#

this one is interesting, lol
😂