Rundpod VLLM Cuda out of Memory | Runpod | Page 1

hybrid fern Jan 15, 2024, 8:35 PM

#

Hi I've been using the default runpod VLLM template with the mixtrial model loaded in the network volume. I'm encountering CUDA out of memory on cold starts.

Here is the error log.

2024-01-15T20:32:13.726720287Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 47.54 GiB of which 16.75 MiB is free. Process 422202 has 47.51 GiB memory in use. Of the allocated memory 47.05 GiB is allocated by PyTorch, and 12.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

📎 message.txt

pliant coralBOT Jan 15, 2024, 8:35 PM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

unique finch Jan 15, 2024, 8:36 PM

#

Which Mixtral model?

hybrid fern Jan 15, 2024, 8:37 PM

#

mistralai/Mixtral-8x7B-v0.1

unique finch Jan 15, 2024, 8:38 PM

#

Thats too big to fit into 48GB, you need 2 x A100 for it, you should look at using a quantized version instead, such as TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ

#

This version is also uncensored

hybrid fern Jan 15, 2024, 8:42 PM

#

unique finch Thats too big to fit into 48GB, you need 2 x A100 for it, you should look at usi...

Does this look correct?

shrewd nest Jan 15, 2024, 8:47 PM

#

hybrid fern Hi I've been using the default runpod VLLM template with the mixtrial model load...

Mixtral is just too much of a memory hog

unique finch Jan 15, 2024, 8:50 PM

#

hybrid fern Does this look correct?

Yeah looks fine

hybrid fern Jan 15, 2024, 8:52 PM

#

2024-01-15T20:52:23.809750811Z File "/handler.py", line 7, in <module>
2024-01-15T20:52:23.809891157Z vllm_engine = VLLMEngine()
2024-01-15T20:52:23.810037390Z ^^^^^^^^^^^^
2024-01-15T20:52:23.810098653Z File "/engine.py", line 38, in init
2024-01-15T20:52:23.810218453Z self.llm = self._initialize_llm()
2024-01-15T20:52:23.810380946Z ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.810389493Z File "/engine.py", line 57, in _initialize_llm
2024-01-15T20:52:23.810576492Z raise e
2024-01-15T20:52:23.810592982Z File "/engine.py", line 54, in _initialize_llm
2024-01-15T20:52:23.810735102Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T20:52:23.811013662Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.811046045Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T20:52:23.811394368Z engine = cls(parallel_config.worker_use_ray,
2024-01-15T20:52:23.811521064Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.811549097Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in init
2024-01-15T20:52:23.811785870Z self.engine = self._init_engine(*args, **kwargs)
2024-01-15T20:52:23.811983677Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.812010660Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T20:52:23.812252599Z return engine_class(*args, **kwargs)
2024-01-15T20:52:23.812439826Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.812447822Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in init
2024-01-15T20:52:23.812621912Z self._init_workers(distributed_init_method)
2024-01-15T20:52:23.812687615Z File "/src/vllm/vllm/engine/llm_engine.py", line 146, in _init_workers
2024-01-15T20:52:23.812863558Z self._run_workers(

#

Looks like I'm getting a disk quota exceeded

unique finch Jan 15, 2024, 8:54 PM

#

Is your network volume full? Or didn't you add the other environment variables for huggingface cache etc?

shrewd nest Jan 15, 2024, 8:57 PM

#

hybrid fern Does this look correct?

prob increase ur container volume higher too - 5 is tiny

unique finch Jan 15, 2024, 8:58 PM

#

Not necessary if the environment variables are set correctly

#

5GB is enough

hybrid fern Jan 15, 2024, 8:59 PM

#

I increased my network volume and got rid of the that problem. I'll probably wipe my network volume so it doesn't have the old model on there anymore.

#

#

My jobs are getting stuck here but the model is loading in fine.

unique finch Jan 15, 2024, 9:01 PM

#

Use GPTQ not AWQ

#

I sent you this one TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ not sure why you changed it to AWQ

hybrid fern Jan 15, 2024, 9:02 PM

#

Changed it because of this lol oops

#

unique finch Jan 15, 2024, 9:04 PM

#

Oh, don't know why the README says that because your screenshot says AWQ quantization is not fully optimized yet 🤷‍♂️

hybrid fern Jan 15, 2024, 9:06 PM

#

IT says the same with gptq

#

unique finch Jan 15, 2024, 9:07 PM

#

Oh okay, my bad sorry, AWQ is probably better then.

hybrid fern Jan 15, 2024, 9:27 PM

#

config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]
config.json: 100%|██████████| 1.06k/1.06k [00:00<00:00, 3.12MB/s]
2024-01-15T21:25:12.631302229Z WARNING 01-15 21:25:12 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-15T21:25:12.631674354Z INFO 01-15 21:25:12 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-15T21:25:12.878504399Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-01-15T21:25:12.968988083Z engine.py           :56   2024-01-15 21:25:12,968 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:25:12.969011647Z Traceback (most recent call last):
2024-01-15T21:25:12.969017540Z   File "/handler.py", line 7, in <module>
2024-01-15T21:25:12.969079606Z     vllm_engine = VLLMEngine()
2024-01-15T21:25:12.969193802Z                   ^^^^^^^^^^^^
2024-01-15T21:25:12.969200856Z   File "/engine.py", line 38, in __init__
2024-01-15T21:25:12.969316102Z     self.llm = self._initialize_llm()
2024-01-15T21:25:12.969405051Z                ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969414475Z   File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:25:12.969528071Z     raise e
2024-01-15T21:25:12.969535724Z   File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:25:12.969631284Z     return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:25:12.969861773Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969879906Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:25:12.970089032Z     engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:25:12.970165425Z              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970203878Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-01-15T21:25:12.970334541Z     self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:25:12.970462593Z                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970470800Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:25:12.970637632Z     return engine_class(*args, **kwargs)
2024-01-15T21:25:12.970733855Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970746229Z   File "/src/vllm/vllm/engine/llm_engine.py", line 110, in __init__
2024-01-15T21:25:12.970891958Z     self._init_workers(distributed_init_method)
2024-01-15T21:25:12.970907978Z   File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:25:12.970999331Z     self._run_workers(
2024-01-15T21:25:12.971006647Z   File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:25:12.971260200Z     self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:25:12.971432982Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AWQ Seems to be faulty too. CUDA seems to be breaking.

unique finch Jan 15, 2024, 9:31 PM

#

trust_remote_code needs to be set to TRUE for Mixtral, not sure whether thats causing the issue.

hybrid fern Jan 15, 2024, 9:31 PM

#

I don't see that as an enviornment var in the readme

unique finch Jan 15, 2024, 9:32 PM

#

Might need to fork it and add it yourself 🙈 . Are you still using 48GB GPU tier?

hybrid fern Jan 15, 2024, 9:32 PM

#

Yes.

#

2024-01-15T21:32:29.206204204Z INFO 01-15 21:32:29 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mistral-7B-v0.1', tokenizer='mistralai/Mistral-7B-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)
2024-01-15T21:32:29.587369692Z engine.py :56 2024-01-15 21:32:29,586 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:32:29.587459992Z Traceback (most recent call last):
2024-01-15T21:32:29.587477182Z File "/handler.py", line 7, in <module>
2024-01-15T21:32:29.587597751Z vllm_engine = VLLMEngine()
2024-01-15T21:32:29.587676721Z ^^^^^^^^^^^^
2024-01-15T21:32:29.587687568Z File "/engine.py", line 38, in init
2024-01-15T21:32:29.587846707Z self.llm = self._initialize_llm()
2024-01-15T21:32:29.587920123Z ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.587927757Z File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:32:29.588049626Z raise e
2024-01-15T21:32:29.588066343Z File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:32:29.588169416Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:32:29.588340362Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.588362955Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:32:29.588594264Z engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:32:29.588675837Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.588704403Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in init
2024-01-15T21:32:29.588857283Z self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:32:29.588974769Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.589017799Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:32:29.589179321Z return engine_class(*args, **kwargs)
2024-01-15T21:32:29.589276287Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.589306141Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in init
2024-01-15T21:32:29.589436730Z self._init_workers(distributed_init_method)
2024-01-15T21:32:29.589445070Z File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:32:29.589570340Z self._run_workers(
2024-01-15T21:32:29.589578206Z File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:32:29.589964835Z self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:32:29.589993004Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.589998521Z File "/src/vllm/vllm/engine/llm_engine.py", line 737, in _run_workers_in_batch
2024-01-15T21:32:29.590353319Z output = executor(*args, **kwargs)
2024-01-15T21:32:29.590387619Z ^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.590392816Z File "/src/vllm/vllm/worker/worker.py", line 67, in init_model
2024-01-15T21:32:29.590540725Z torch.cuda.set_device(self.device)
2024-01-15T21:32:29.590547462Z File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 404, in set_device
2024-01-15T21:32:29.590728911Z torch._C._cuda_setDevice(device)
2024-01-15T21:32:29.590792281Z File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 298, in _lazy_init
2024-01-15T21:32:29.590940554Z torch._C._cuda_init()
2024-01-15T21:32:29.590948904Z RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

#

Tried using base default mistral and getting CUDA errors still lol.

unique finch Jan 15, 2024, 9:36 PM

#

Probably related to trust_remote_code, it has to be true for Mixtral.

hybrid fern Jan 15, 2024, 9:36 PM

#

It worked before which is super weird.

#

The CUDA error was also for mistral

unique finch Jan 15, 2024, 9:37 PM

#

Oh yeah thats strange then, didn't realise it was working

#

Which mistral model?

hybrid fern Jan 15, 2024, 9:38 PM

#

mistralai/Mistral-7B-v0.1

unique finch Jan 15, 2024, 9:39 PM

#

Thats a pretty small model so there shouldn't be issues

hybrid fern Jan 15, 2024, 9:46 PM

#

#

Yeah not sure how this is giving me CUDA errors.

unique finch Jan 15, 2024, 9:48 PM

#

Probably need to log a Github issue for it.
https://github.com/runpod-workers/worker-vllm/issues

GitHub

Issues · runpod-workers/worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - Issues · runpod-workers/worker-vllm

hybrid fern Jan 15, 2024, 9:58 PM

#

Still just keep getting st uck at this stage.

undone sleet Jan 16, 2024, 2:05 PM

#

Hey! I had a similar issue with loading awq models with this worker. I resolved it by setting GPU_MEMORY_UTILIZATION variable to 0.90.

#

One more thing. It's recommended to use CUDA verson of 12.1. Try to change it by setting env variable WORKER_CUDA_VERSION to 12.1

#

I'm not sure but you should probably change it in the Dockerfile. Setting it as an env variable probably won't work. (I may be wrong.)

unique finch Jan 16, 2024, 2:10 PM

#

Yeah we need the CUDA version filter for serverless like GPU cloud has.

hybrid fern Jan 17, 2024, 7:54 PM

#

undone sleet Hey! I had a similar issue with loading awq models with this worker. I resolved ...

Is that an environment vaariable?

hybrid fern Jan 18, 2024, 5:16 PM

#

Just kidding found it. @undone sleet Also wondering if you baked your model into the docker image? The spin up time while using network volume is quite slow.

green badger Jan 18, 2024, 6:12 PM

#

@nova nest if you get a chance to glance this over.

hybrid fern Jan 18, 2024, 6:13 PM

#

Can't seem to use 12.1

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 12.1* ]]; then \
        python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu121; \
        python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
    else \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
    fi && \
    rm -rf /root/.cache/pip

It can't find the branch for https://github.com/runpod/[email protected]#egg=vllm. Should I just have it use 11.8?

nova nest Jan 18, 2024, 6:33 PM

#

hybrid fern Can't seem to use 12.1 ``` # Install torch and vllm based on CUDA version RUN ...

Sorry, it's not that intuitive, but if you build from the main branch with --build-arg WORKER_CUDA_VERSION=12.1, it will correctly install everything for 12.1, so you dont need to modify the dockerfile

hybrid fern Jan 18, 2024, 6:36 PM

#

nova nest Sorry, it's not that intuitive, but if you build from the main branch with --bui...

Okay thank you!

nova nest Jan 18, 2024, 6:38 PM

#

What are some missing features and issues that made you build your own images rather than use the pre-built one? @here

hybrid fern Jan 18, 2024, 6:40 PM

#

I'm baking in my model to test out if it will make my workers faster. When using the default image, I'm getting delay times of up to 700s for it to load the model.

#

With the long delay time, it spins up other workers thus increasing my cost.

#

I'm also trying the solution that @undone sleet gave of changing the GPU Util value

#

Error log of a fresh fork from the repo.

No module named 'numpy'.

# syntax = docker/dockerfile:1.3
ARG WORKER_CUDA_VERSION=11.8
FROM runpod/base:0.4.2-cuda${WORKER_CUDA_VERSION}.0 as builder

ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope

# Set Environment Variables
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
    HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
    HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
    TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
    HF_TRANSFER=1


# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
    python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt && \
    rm /requirements.txt

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
        python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
        python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
    else \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
    fi && \
    rm -rf /root/.cache/pip

# Add source files
COPY src .

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ"
ARG MODEL_BASE_PATH="/models"
ARG QUANTIZATION="awq"

ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
    MODEL_NAME=$MODEL_NAME \
    QUANTIZATION=$QUANTIZATION 

RUN --mount=type=secret,id=HF_TOKEN,required=false \
    if [ -f /run/secrets/HF_TOKEN ]; then \
        export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
    fi && \
    if [ -n "$MODEL_NAME" ]; then \
        python3.11 /download_model.py --model $MODEL_NAME; \
    fi

# Start the handler
CMD ["python3.11", "/handler.py"]

📎 message.txt

#

I'm trying to use the prebuilt now with 0.1.0. Will report back if it works

hybrid fern Jan 18, 2024, 7:12 PM

#

This is with the base image of 0.1.0 using OpenChat with no other env variables. It also spun up 3 other workers to accomplish this job.