#MI300X Broken

42 messages · Page 1 of 1 (latest)

ruby ore
#

Hello MI300X machines are still all broken. Can we get them fixed asap pls 🙏

If it also possible to please have a refund for pod id c9mz48yaslkbzl as it's been broken since yesterday morning. The pod is still running as i have a script that keeps checking if its fixed (I'm not taking compute from anybody else as I can confirm that spinning up MI300X via another account is just broken). Also happy to pay extra for a machine that works lol - runpod is the only provider that isn't full 8xMI300X.

stark slateBOT
floral spokeBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

ruby ore
#

Linking previous convo: #🎤|general message
@mint timber

mint timber
#

@ruby ore this machine is going down for emergency maintenance - will credit you in just a moment

#

Just sent

ruby ore
#

thank you

#

wait a second am i still going to have my directory of work??

#

please i hope that this maintains my container disk, on reboots it does no?

#

@mint timber

mint timber
#

Container will always be lost on reboot, no matter the reason - are you able to connect to web terminal to move it off first?

ruby ore
#

ugh

#

i think i lost everything

#

i mean idk i can't access the pod at all

#

same pod id

eager obsidian
#

if you space is mounted path (for example, usually /workspace), but depends on where the template mounts it ( check on edit pod )

eager obsidian
ruby ore
#

it's still c9mz48yaslkbzl

#

if it's under maintenance that's fine

#

the GPUs werent working

left epoch
#

Also host 213.173.96.54 is broken (I have already a ticket open also for this host). Is there any way to exclude those hosts from create pod api so at least we can boot on other servers if they become available ?

pliant ermine
#

I'm also seeing issues with MI300X (GPU applications just seem to hang). Recreating pods doesn't seem to help, since they all land on the same machine, dyz1gco8abdx. Running rocminfo under strace prints openat(AT_FDCWD, "/dev/kfd", O_RDWR and then hangs. I have created a support ticket for this, #35165.

mint timber
#

Yeah, I've identified some other users having problems with these machines so I've sent it back up and asked for more of an RCA. They get automatically pulled once they've been identified as having an issue, but in this case that's not happening

ruby ore
#

i think i'm still getting charged for c9mz48yaslkbzl even tho it's stuck loading

#

i also try to spin up new MI300X pods and they still hang, so problematic machines are still live

left epoch
#

MI300X are blocked again (host 213.173.96.54) I've already open a ticket #35285

mint timber
ruby ore
#

this pod is continuing to charge me even though its stuck. Why are we getting charged for pods that are stuck and giving errors / have no ssh available?

id: c9mz48yaslkbzl

astral fern
#

also, you can still stop the pod, boot it into CPU only mode, open webterminal and access your data

#

let me know if you need further help @ruby ore

ruby ore
#

its just the default template, the one that used to be working

#

i turned it off, all fine now. could i get credits tho as the pod was never even on lol

ruby ore
#

i mean really a refund for everything, i spun up a bazillion pods over and over again to see if i can get one with rocminfo not hanging but it always hangs.

#

I just spun up 2wtgnkyr0o5nod again to check and it still hangs.

idk why nodes that are clearly broken are still available for rent (and are currently the only ones available for rent, actually).

sharp phoenix
#

Cant verify since they are all taken. If its taken thats also an automatic assumption that they are fine btw.

#

I did run into it hanging a few days ago, that one was one you ran into that was flagged for reboot but didnt actually restart so they had to do that more forcefully. Then I landed on another one that did work.

ruby ore
sharp phoenix
#

If need be tag finley with the ticket number since finley knows about it