#We have detected a critical error on this machine which may affect some pods.

26 messages · Page 1 of 1 (latest)

void grail
#

Hey all. We're renting a number of H100s as a trial run of Runpod as we are looking for another compute provider. We paid for 24 hours of compute in order to transfer terabytes of data onto the machine, alongside paying for bandwidth and additional storage. We additionally paid our cloud provider egress costs, which is more than we paid for the H100 machine, and rented a disk & network optimized machine in order to transfer the data quickly to the Runpod machine.

After 24 hours, we are getting this error on the Runpod GUI:
We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime.

Running nvidia-smi gives an ERR! for the 3rd GPU.

What are our options here? Is this an error that will be fixed by Runpod, or have I paid for a faulty machine?
Similarly, is there any way to use the persistent volume disk we are currently paying for and have it attached to a different H100 machine, so we do not have to spend another 24 hours transferring data & paying additional fees? Please advise.

solemn marsh
#

You can use network storage to be able to attach your persistent storage to different pods, but it restricts your availability.
Can you provide the pod id so that someone can check it out.
@red violet @pliant crown more issues with H100.

void grail
#

We did not use network storage as we were unable to find any availability for H100s.

Pod ID: brpigbe2fzkzrh

pliant crown
#

Forwarded to team

#

@void grail do you have any data in there?

red violet
#

looking into it, this is a hardware issue and hopefully doesn't require RMA

void grail
#

Yes -- as mentioned above we have terabytes of data we paid to transfer to the machine on a persistent volume disk

#

Got it. I've seen this happen with cables not being plugged in fully or a GPU getting overheated significantly, hopefully not RMA 🙏

void grail
#

Lmk when you have an update

severe cedar
#

Hey @void grail!

#

Request has been sent to a tech in the datacenter already like others mentioned. That being said, I have reconciliated fees for the pod ID you provided, and credited your account for the entire duration, plus a sizable buffer. We will reach out soon with any relevant updates!

Let me know if you have any other question in the mean time 🙂

void grail
#

Thank you for the update and for crediting the account. Right now it seems like two GPUs are in the error state. Do you have any clue regarding the timing for the tech to fix the issue? Would you recommend spinning up a new instance, or waiting for the issue to be fixed by the tech? For reference, it takes ~24 hours to transfer data and egress costs are substantial.

severe cedar
#

Those servers are on a 10gbps backbone, so it could be quite fast to transfer from one to another, given that you rent from the same location. I would definitely recommend that so you can get started very soon! You can transfer data between pods leveraging out CTL tool here:
https://docs.runpod.io/references/runpodctl/#options

void grail
#

~3TB, we have a > 16 TB we want to transfer for when we fully start training.

#

Transferring between pods should work - ty

#

I'm assuming CA is Canada, correct?

#

Do you have datacenters not listed on the UI? When I use the dropdown to select a particular region, it says GPUs are not available in any region. When I select "any", GPUs are available.

#

Could you increase my spending limit when you have a chance?

severe cedar
#

It's in Canada. Indeed, most dc tags like these ones are some with network storage, or plans for network storage. @red violet Could we get a dc tag for this location? It is quite sizable

void grail
#

Could you increase my spending limit?

void grail
#

Trying to rent another machine to transfer the data over, but running into spending limits

red violet
#

pm me details and can increase it for you

void grail
#

runpodctl send times out/fails for large files when transferring data. This is a relatively common problem for approaches that do not have enough retries or networking instability. rsync and wget work due to robust retries. Given the inability to obtain a pod in the given geographical region, I'm giving up on the approach of transferring data via two connected pods and will pay the cloud provider fees.

solemn marsh