#Frequent GPU problem with H100

22 messages · Page 1 of 1 (latest)

tall latch
#

Hello,
I've seen that 9 times out of 10, I would get an H100 (PCIe) machine where in Cuda won't work with torch.
For instance, this machine runs Cuda 12.2, but the torch-cuda integration is broken?
@mighty adder or someone from the RunPod team, can you please see since it's happening extremely frequently now?
ID: b6d3hcqct79d7o, runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
Please find image attached. Note that this is a freshly provisioned VM, with NO commands executed but the ones shown in the screenshot.

crystal egretBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

wind herald
#

@tall latch could you give a try to my tool

#

#1213495584539811860

tall latch
#

Thanks @wind herald , I've released the H100 now (to save costs) and instead provisioned an A100 cluster. But I'll post here once I run into the problem again.

#

Provisioned another one where Cuda doesn't work, and here are the results

{
"PyTorch Version": "2.2.0+cu121",
"Environment Info": {
"RUNPOD_POD_ID": "7zb8qedy1qzr0v",
"Template CUDA_VERSION": "Not Available",
"NVIDIA_DRIVER_CAPABILITIES": "Not Available",
"NVIDIA_VISIBLE_DEVICES": "Not Available",
"NVIDIA_PRODUCT_NAME": "Not Available",
"RUNPOD_GPU_COUNT": "4",
"machineId": "krn533olhyna"
},
"Host Machine Info": {
"CUDA Version": "12.2",
"Driver Version": "535.154.05",
"GPU Name": "NVIDIA H100 PCIe"
},
"CUDA Test Result": {
"GPU 0": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.",
"GPU 1": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.",
"GPU 2": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.",
"GPU 3": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero."
}
}

#

@wind herald

wind herald
#

next time try use my tool and share errors as they help debug

#

btw what template do you use?

tall latch
#

I'm using: runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04

#

I don't think there's any specific one for 12.2?

wind herald
#

btw next time you can just upload json file

#

btw after you get file you can remove pod as I saved machine ID

tall latch
#

Perfect, thanks!

wind herald
#

I made that script to help get info on broken H100 trust me they are problematic

#

In meantime pls enjoy woman crying over broken GPU.

Btw fell free to give feedback about my tool

mighty adder
#

@tall latch H100 PCIe have caused us lots of headaches lately. We are soon releasing a very powerful detection tool for the totality of RunPod servers, which will help us fix these non trivial issues.

It seems it's always around some specific kernel version that might not be compatible even though it's supposed to be. That being said, expect a strong resolution in the near term!

tall latch
#

Thank you!!

#

It would be great if an update post could be made once that happens

mighty adder
#

So, we got a very good detection tool in place now, but it's manual

#

I believe the problem is largelly solved for H100s. We will be looking to automate the script now to expand it to all servers on RunPod. In the mean time, do not hesitate to reach out if you have any question 🙂