#Can't use GPU with Jax in serverless endpoint

54 messages · Page 1 of 1 (latest)

polar leaf
#

Hi, I'm trying to run a serverless worker to perform point tracking on a video. It works ok, but I think that it is running on CPU.

I read that the telemetry on the UI isn't reliable, but the Container Logs indicate that too. There is an image of what they logs say. It finds the Nvidia GPU, but there are problems with Jax I think.

I use the function on the first image to check the device:
And the outputs I get are on the second image:

In my Dockerfile, I'm setting this as base image:
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04

I'm running this command to install the jax version that is supposed to work with CUDA 11.8.
RUN pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Then I install requirements.txt (I don't install Jax again here) and do other stuff

And finally I do this to set the library path for CUDA:
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH

I still can't get to make it work on GPU, if someone could tell me where the problem could be, it would be extremely helpful, thank you.

outer light
#

Hey before running the code try setting this env variable

#

export CUDA_VISIBLE_DEVICES=0,1

#

Run that command in a cli

#

Let me know if that works or not

obtuse pawn
#

try add this to your dockerfile

ENV PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_VISIBLE_DEVICES=all```
outer light
#

yeah i gues that, but it has been included in the newest image tag

obtuse pawn
#

I'm also why to use CUDA 11.8 rather than 12.1

#

pip install -U "jax[cuda12]"

polar leaf
obtuse pawn
#

@polar leaf any use case I might try make Better JAX template though would need to understand how you test it

polar leaf
obtuse pawn
#

you would probably need to add it in docker container

polar leaf
obtuse pawn
#

workers are basically pods

polar leaf
#

ok I'll run that command from the python code in the beginning and add your suggestion too

obtuse pawn
#

tried to run:
pip install --upgrade "jax[cuda12_local]"

polar leaf
obtuse pawn
polar leaf
#

okay, I'll try yes, thank you

polar leaf
# outer light Let me know if that works or not

Hi, I think that it worked but there is a new error now, related to cudnn I think, these are the logs:

Starting Serverless Worker |  Version 1.6.0 ---
{"requestId": "cbeb73b4-8679-43d1-aaa0-8c68101e76ac-e1", "message": "Started.", "level": "INFO"}
Get inside input_fn
xla_bridge.py       :889  Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
xla_bridge.py       :889  Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
inference.py        :172  Found device: cuda:0
inference.py        :176  JAX is not using the GPU. Check your JAX installation and environment configuration.
inference.py        :177  JAX backend: gpu
inference.py        :182  CUDA_VISIBLE_DEVICES: 0,1
inference.py        :183  LD_LIBRARY_PATH: /opt/venv/lib/python3.9/site-packages/cv2/../../lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
inference.py        :187  libcudart.so loaded successfully.
inference.py        :189  libcudnn.so loaded successfully.
inference.py        :143  Read and resized video, number of frames: 107
E0716  cuda_dnn.cc:535 Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E0716  cuda_dnn.cc:539 Memory usage: 84536328192 bytes free, 84986691584 bytes total.
E0716  cuda_dnn.cc:535 Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E0716  cuda_dnn.cc:539 Memory usage: 84536328192 bytes free, 84986691584 bytes total.
inference.py        :162  Error during processing: FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.
{"requestId": "cbeb73b4-8679-43d1-aaa0-8c68101e76ac-e1", "message": "Finished.", "level": "INFO"}

I've tried with 24GB GPU and 80GB GPU.

I'm using this base image:
FROM nvidia/cuda:12.0.0-cudnn8-devel-ubuntu20.04

outer light
#

this maybe related

polar leaf
outer light
#

So what's your versions now?
Jax, cudnn, cuda

outer light
#

maybe try bigger vram gpu's

#

or try another later version on the jax and cuda

polar leaf
outer light
polar leaf
#

And for Jax I do this to install it:
RUN pip install --upgrade "jax[cuda12_local]"

outer light
#

what about pip install -U "jax[cuda12]"

#

try using CUDA >=12.1 too

#

filter the serverless

polar leaf
outer light
#

on the endpoints, edit endpoint, expand the bottom section

#

then select cuda version ( checkboxes )

polar leaf
#

aah okay, I'll try 11.8 too, thank you

outer light
#

yeah, just now i checked i think jax for cuda 12 works for 12.1 or later

polar leaf
#

I'll try this, I'll tell you if it works, thanks a lot for helping

nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04

obtuse pawn
#

@polar leaf is your worker open source?

polar leaf
polar leaf
# obtuse pawn <@129618727267205120> is your worker open source?

hi, here is a simple version of the worker:
https://github.com/galakurpi/yekar_coaches_point_tracking_simple

for testing it, send the video link i have in this code in that same format:

import requests
url = 'https://api.runpod.ai/v2/sd1ylpcd55dj12/run'
data = {
'input': {
'video_url': 'https://drive.google.com/uc?export=download&id=1SER_MwYt0XyOHOX0UbN30iyMCmeWE-dd'
}
}
headers = {
'Content-Type': 'application/json',
'Authorization': 'Bearer <RUNPOD API KEY MISSING>' # If authentication is needed
}
response = requests.post(url, json=data, headers=headers)
print(response.json())

thank you

GitHub

Contribute to galakurpi/yekar_coaches_point_tracking_simple development by creating an account on GitHub.

polar leaf
#

let me know if you test anything or need anything

sudden ruin
#

btw did you make sure to filter cuda version on machines in serverless

outer light
#

Have you tried the versions?

polar leaf
#

Actually, no, sorry, but the logs showed that CUDA 12.1 was running

#

But I'll try again with that

outer light
#

Try your code, template in pods if you want, it's faster

#

You can iterate there, if something works you can build your docker img and code from your result on the pod that works