#CUDA NO WORKY?

1 messages · Page 1 of 1 (latest)

peak aurora
#

I'm unable to get SSH working to pods from a clean Cuda docker image. Despite saying they're ready and giving me an SSH line (and charging me $$$), they all spit out the same error:

Error response from daemon: container a94707bd5f391d6a3f25d13f3ba02a425757bdbecfcb7de3b1169ddda866d434 is not running

You can try one here. https://console.runpod.io/pods?id=mlbfg4iutwm19c

The only reason I'm using a clean Cuda image without PyTorch is because apparently the official PyTorch Cuda envs are misconfigured. By misconfigured I mean, no matter what I try, I can't get cuda visible to python, or get any CUDA_DEVICES_AVAILABLE.

cd /workspace
rm -rf venv
python3 -m venv venv && source venv/bin/activate

# Install ONLY pip first
pip install --upgrade pip

# Install PyTorch with EXACT CUDA version matching your driver
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124

# TEST IMMEDIATELY before installing anything else
python -c "import torch; print(torch.cuda.is_available())" // Always false, or cuda undefined

No matter how many times or pods I try this on, I never get cuda defined!!

barren jettyBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

peak aurora
#

At this point I'd like to request a refund. I'm at wit's end. Even the LLMs are telling me runpod's cuda envs must be misconfigured

dusky fog
#

@peak aurora Do you have a image to share? I tried to look at your link and it did not lead to anywhere

#

I tried the official template, and was able to get it?

peak aurora
dusky fog
#

What template is that using?

#

You said a clean cuda thing?

#

I cannot see pods that are on your system

#

but if you are trying to get ssh setup

#

you can maybe try hold on

peak aurora
#
nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
#

it was a custom template with that docker image

dusky fog
#
wget https://raw.githubusercontent.com/justinwlin/Runpod-SSH-Password/main/passwordrunpod.sh && chmod +x passwordrunpod.sh && ./passwordrunpod.sh
#

Got it

peak aurora
#

and ssh would just kick me out every time

dusky fog
#

let me take a look

#

Im not familiar with this template, but:

  1. I think runpod is working
  2. If you want to try i have a ssh script that tries its best
#

to install ssh by password based

#

and tells u how to ssh into it when done

#
  1. Let me give it a try
#

This is the repo fyi

#

if curious

dusky fog
#

do you have like a link to your custom docker image? im guessing that maybe doesn't have openssh installed? or something like that

#

You'll need:
openssh-server

dusky fog
#

You can run my script to do password ssh with a runpod official template through web terminal / jupyter labs, and should work 🙂 or you can set up ssh key properly

#

once you use a runpod official template, which has more than the bare minimum setup for you, you can just run my script in the web console or in the jupyter labs:

wget https://raw.githubusercontent.com/justinwlin/Runpod-SSH-Password/main/passwordrunpod.sh && chmod +x passwordrunpod.sh && ./passwordrunpod.sh

Should get SSH + I tried again, on a fresh pod and i still got the stuff working. I did not reinstall torch / torchvision tho. I just go straight from our pod

#

Summary:

  1. Try to use a runpod official template to start
  2. You can run my ssh script, or set up ssh keys (in the docs) so you get automatic SSH for future pods spun up. Our templates are setup with openssh server
  3. You can run your: python -c "import torch; print(torch.cuda.is_available())" which as i show in my screenshot in two instances that it does pick up
peak aurora
#

got it. thanks for the assist. unfort id still like to process a refund because in the interim ive switched over to lambda labs