#Option to turn off ECC in some GPU types

17 messages · Page 1 of 1 (latest)

sonic dome
#

An A40 pod has less than advertised 48GB of memory due to ECC
An option to turn off ECC to claim more memory would be nice.

#

Option to turn off ECC in some GPU types

#

output of nvidia-smi -q

halcyon elbow
#

We previously had a internal discussion about this, right now we probably won’t make any changes at the moment.

stiff root
#

Why (the runpod decision NOT to offer disabling this)?

The A40 is too slow to do training on, and that's where ECC matters. The card is an inference card, and for inference, the worst that can happen is a bit-flip producing imperceptible output variance in already non-deterministic output. And impact to the user? Well, "reroll prompt" or just shrug because the generated LLM message is full of other invalid predictions anyway due to the nature of the beast.

sonic dome
#

I dont care about it crashing in a personal project

stiff root
#

Do you often have crashes due to memory corruption? If you don't tolerate crashes then I assume you have a fault tolerant retry layer on top as well, to reroute failed calls to other pods if something other than the non-ECC memory acts up as well (as that's much more likely to occur).

My personal beef with this decision is that 48 GB and 96 GB are specific boundaries where my workload will either fit or not. 40 and 80 means I have to spin up an additional pod redundantly.

sonic dome
#

Ive never had bitflips in runpod

#

Also if it happens a lot your consumer gpus will glitch in game always

stiff root
#

Ah, sorry. I misunderstood you. I thought you meant it was important to you (having ECC memory that is). I now see we're both on the same page 🙂

sonic dome
#

Anyways i think having this option is good

#

Since 4-6 gigs om memory can determine if you can run a model on single gpu or not

stiff root
#

Absolutely agree. Having the option to choose whether to use it is absolutely the best.

mellow rampart
#

Yes, please allow this we are lossing vram

#

@halcyon elbow

fresh tendon
#

I don't think that would be straight forward since ECC is driver level