#Rundpod VLLM Cuda out of Memory

75 messages · Page 1 of 1 (latest)

hybrid fern
#

Hi I've been using the default runpod VLLM template with the mixtrial model loaded in the network volume. I'm encountering CUDA out of memory on cold starts.

Here is the error log.

2024-01-15T20:32:13.726720287Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 47.54 GiB of which 16.75 MiB is free. Process 422202 has 47.51 GiB memory in use. Of the allocated memory 47.05 GiB is allocated by PyTorch, and 12.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

pliant coralBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

unique finch
#

Which Mixtral model?

hybrid fern
#

mistralai/Mixtral-8x7B-v0.1

unique finch
#

Thats too big to fit into 48GB, you need 2 x A100 for it, you should look at using a quantized version instead, such as TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ

#

This version is also uncensored

shrewd nest
unique finch
hybrid fern
#

2024-01-15T20:52:23.809750811Z File "/handler.py", line 7, in <module>
2024-01-15T20:52:23.809891157Z vllm_engine = VLLMEngine()
2024-01-15T20:52:23.810037390Z ^^^^^^^^^^^^
2024-01-15T20:52:23.810098653Z File "/engine.py", line 38, in init
2024-01-15T20:52:23.810218453Z self.llm = self._initialize_llm()
2024-01-15T20:52:23.810380946Z ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.810389493Z File "/engine.py", line 57, in _initialize_llm
2024-01-15T20:52:23.810576492Z raise e
2024-01-15T20:52:23.810592982Z File "/engine.py", line 54, in _initialize_llm
2024-01-15T20:52:23.810735102Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T20:52:23.811013662Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.811046045Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T20:52:23.811394368Z engine = cls(parallel_config.worker_use_ray,
2024-01-15T20:52:23.811521064Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.811549097Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in init
2024-01-15T20:52:23.811785870Z self.engine = self._init_engine(*args, **kwargs)
2024-01-15T20:52:23.811983677Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.812010660Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T20:52:23.812252599Z return engine_class(*args, **kwargs)
2024-01-15T20:52:23.812439826Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.812447822Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in init
2024-01-15T20:52:23.812621912Z self._init_workers(distributed_init_method)
2024-01-15T20:52:23.812687615Z File "/src/vllm/vllm/engine/llm_engine.py", line 146, in _init_workers
2024-01-15T20:52:23.812863558Z self._run_workers(

#

Looks like I'm getting a disk quota exceeded

unique finch
#

Is your network volume full? Or didn't you add the other environment variables for huggingface cache etc?

shrewd nest
unique finch
#

Not necessary if the environment variables are set correctly

#

5GB is enough

hybrid fern
#

I increased my network volume and got rid of the that problem. I'll probably wipe my network volume so it doesn't have the old model on there anymore.

#

My jobs are getting stuck here but the model is loading in fine.

unique finch
#

Use GPTQ not AWQ

#

I sent you this one TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ not sure why you changed it to AWQ

hybrid fern
#

Changed it because of this lol oops

unique finch
#

Oh, don't know why the README says that because your screenshot says AWQ quantization is not fully optimized yet 🤷‍♂️

hybrid fern
#

IT says the same with gptq

unique finch
#

Oh okay, my bad sorry, AWQ is probably better then.

hybrid fern
#
config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]
config.json: 100%|██████████| 1.06k/1.06k [00:00<00:00, 3.12MB/s]
2024-01-15T21:25:12.631302229Z WARNING 01-15 21:25:12 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-15T21:25:12.631674354Z INFO 01-15 21:25:12 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-15T21:25:12.878504399Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-01-15T21:25:12.968988083Z engine.py           :56   2024-01-15 21:25:12,968 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:25:12.969011647Z Traceback (most recent call last):
2024-01-15T21:25:12.969017540Z   File "/handler.py", line 7, in <module>
2024-01-15T21:25:12.969079606Z     vllm_engine = VLLMEngine()
2024-01-15T21:25:12.969193802Z                   ^^^^^^^^^^^^
2024-01-15T21:25:12.969200856Z   File "/engine.py", line 38, in __init__
2024-01-15T21:25:12.969316102Z     self.llm = self._initialize_llm()
2024-01-15T21:25:12.969405051Z                ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969414475Z   File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:25:12.969528071Z     raise e
2024-01-15T21:25:12.969535724Z   File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:25:12.969631284Z     return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:25:12.969861773Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969879906Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:25:12.970089032Z     engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:25:12.970165425Z              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970203878Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-01-15T21:25:12.970334541Z     self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:25:12.970462593Z                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970470800Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:25:12.970637632Z     return engine_class(*args, **kwargs)
2024-01-15T21:25:12.970733855Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970746229Z   File "/src/vllm/vllm/engine/llm_engine.py", line 110, in __init__
2024-01-15T21:25:12.970891958Z     self._init_workers(distributed_init_method)
2024-01-15T21:25:12.970907978Z   File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:25:12.970999331Z     self._run_workers(
2024-01-15T21:25:12.971006647Z   File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:25:12.971260200Z     self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:25:12.971432982Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AWQ Seems to be faulty too. CUDA seems to be breaking.

unique finch
#

trust_remote_code needs to be set to TRUE for Mixtral, not sure whether thats causing the issue.

hybrid fern
#

I don't see that as an enviornment var in the readme

unique finch
#

Might need to fork it and add it yourself 🙈 . Are you still using 48GB GPU tier?

hybrid fern
#

Yes.

#

2024-01-15T21:32:29.206204204Z INFO 01-15 21:32:29 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mistral-7B-v0.1', tokenizer='mistralai/Mistral-7B-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)
2024-01-15T21:32:29.587369692Z engine.py :56 2024-01-15 21:32:29,586 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:32:29.587459992Z Traceback (most recent call last):
2024-01-15T21:32:29.587477182Z File "/handler.py", line 7, in <module>
2024-01-15T21:32:29.587597751Z vllm_engine = VLLMEngine()
2024-01-15T21:32:29.587676721Z ^^^^^^^^^^^^
2024-01-15T21:32:29.587687568Z File "/engine.py", line 38, in init
2024-01-15T21:32:29.587846707Z self.llm = self._initialize_llm()
2024-01-15T21:32:29.587920123Z ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.587927757Z File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:32:29.588049626Z raise e
2024-01-15T21:32:29.588066343Z File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:32:29.588169416Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:32:29.588340362Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.588362955Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:32:29.588594264Z engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:32:29.588675837Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.588704403Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in init
2024-01-15T21:32:29.588857283Z self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:32:29.588974769Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.589017799Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:32:29.589179321Z return engine_class(*args, **kwargs)
2024-01-15T21:32:29.589276287Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.589306141Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in init
2024-01-15T21:32:29.589436730Z self._init_workers(distributed_init_method)
2024-01-15T21:32:29.589445070Z File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:32:29.589570340Z self._run_workers(
2024-01-15T21:32:29.589578206Z File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:32:29.589964835Z self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:32:29.589993004Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.589998521Z File "/src/vllm/vllm/engine/llm_engine.py", line 737, in _run_workers_in_batch
2024-01-15T21:32:29.590353319Z output = executor(*args, **kwargs)
2024-01-15T21:32:29.590387619Z ^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.590392816Z File "/src/vllm/vllm/worker/worker.py", line 67, in init_model
2024-01-15T21:32:29.590540725Z torch.cuda.set_device(self.device)
2024-01-15T21:32:29.590547462Z File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 404, in set_device
2024-01-15T21:32:29.590728911Z torch._C._cuda_setDevice(device)
2024-01-15T21:32:29.590792281Z File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 298, in _lazy_init
2024-01-15T21:32:29.590940554Z torch._C._cuda_init()
2024-01-15T21:32:29.590948904Z RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

#

Tried using base default mistral and getting CUDA errors still lol.

unique finch
#

Probably related to trust_remote_code, it has to be true for Mixtral.

hybrid fern
#

It worked before which is super weird.

#

The CUDA error was also for mistral

unique finch
#

Oh yeah thats strange then, didn't realise it was working

#

Which mistral model?

hybrid fern
#

mistralai/Mistral-7B-v0.1

unique finch
#

Thats a pretty small model so there shouldn't be issues

hybrid fern
#

Yeah not sure how this is giving me CUDA errors.

unique finch
hybrid fern
#

Still just keep getting st uck at this stage.

undone sleet
#

Hey! I had a similar issue with loading awq models with this worker. I resolved it by setting GPU_MEMORY_UTILIZATION variable to 0.90.

#

One more thing. It's recommended to use CUDA verson of 12.1. Try to change it by setting env variable WORKER_CUDA_VERSION to 12.1

#

I'm not sure but you should probably change it in the Dockerfile. Setting it as an env variable probably won't work. (I may be wrong.)

unique finch
#

Yeah we need the CUDA version filter for serverless like GPU cloud has.

hybrid fern
hybrid fern
#

Just kidding found it. @undone sleet Also wondering if you baked your model into the docker image? The spin up time while using network volume is quite slow.

green badger
#

@nova nest if you get a chance to glance this over.

hybrid fern
#

Can't seem to use 12.1

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 12.1* ]]; then \
        python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu121; \
        python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
    else \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
    fi && \
    rm -rf /root/.cache/pip

It can't find the branch for https://github.com/runpod/[email protected]#egg=vllm. Should I just have it use 11.8?

nova nest
nova nest
#

What are some missing features and issues that made you build your own images rather than use the pre-built one? @here

hybrid fern
#

I'm baking in my model to test out if it will make my workers faster. When using the default image, I'm getting delay times of up to 700s for it to load the model.

#

With the long delay time, it spins up other workers thus increasing my cost.

#

I'm also trying the solution that @undone sleet gave of changing the GPU Util value

#

Error log of a fresh fork from the repo.

No module named 'numpy'.

# syntax = docker/dockerfile:1.3
ARG WORKER_CUDA_VERSION=11.8
FROM runpod/base:0.4.2-cuda${WORKER_CUDA_VERSION}.0 as builder

ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope

# Set Environment Variables
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
    HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
    HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
    TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
    HF_TRANSFER=1


# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
    python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt && \
    rm /requirements.txt

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
        python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
        python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
    else \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
    fi && \
    rm -rf /root/.cache/pip

# Add source files
COPY src .

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ"
ARG MODEL_BASE_PATH="/models"
ARG QUANTIZATION="awq"

ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
    MODEL_NAME=$MODEL_NAME \
    QUANTIZATION=$QUANTIZATION 

RUN --mount=type=secret,id=HF_TOKEN,required=false \
    if [ -f /run/secrets/HF_TOKEN ]; then \
        export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
    fi && \
    if [ -n "$MODEL_NAME" ]; then \
        python3.11 /download_model.py --model $MODEL_NAME; \
    fi

# Start the handler
CMD ["python3.11", "/handler.py"]
#

I'm trying to use the prebuilt now with 0.1.0. Will report back if it works

hybrid fern
#

This is with the base image of 0.1.0 using OpenChat with no other env variables. It also spun up 3 other workers to accomplish this job.

#

Ran it again and it worked much faster hmm.

#

Different worker

nova nest
nova nest
hybrid fern
nova nest
hybrid fern
#

I do have flash boot on. I’m just worried about those first request load times. Which was why I looked into baking the model into the image

hybrid fern
#

This is with flasahboot on

hybrid fern
nova nest
#

I think for llama 7b our load times were around 22 seconds on the machine directly
We’re working on improving the speed of model loading in vLLM

arctic ether
#

what context are u running at

#

not only do you need to have space for fp16 weights u need space for context, which should be about 2 to 3gb for 4k to 8k context

hybrid fern