#Option to turn off ECC in some GPU types
17 messages · Page 1 of 1 (latest)
Option to turn off ECC in some GPU types
output of nvidia-smi -q
this claims enabling ECC reduced perf by 15% on an A40
We previously had a internal discussion about this, right now we probably won’t make any changes at the moment.
Why (the runpod decision NOT to offer disabling this)?
The A40 is too slow to do training on, and that's where ECC matters. The card is an inference card, and for inference, the worst that can happen is a bit-flip producing imperceptible output variance in already non-deterministic output. And impact to the user? Well, "reroll prompt" or just shrug because the generated LLM message is full of other invalid predictions anyway due to the nature of the beast.
I dont care about it crashing in a personal project
Do you often have crashes due to memory corruption? If you don't tolerate crashes then I assume you have a fault tolerant retry layer on top as well, to reroute failed calls to other pods if something other than the non-ECC memory acts up as well (as that's much more likely to occur).
My personal beef with this decision is that 48 GB and 96 GB are specific boundaries where my workload will either fit or not. 40 and 80 means I have to spin up an additional pod redundantly.
Ive never had bitflips in runpod
Also if it happens a lot your consumer gpus will glitch in game always
Ah, sorry. I misunderstood you. I thought you meant it was important to you (having ECC memory that is). I now see we're both on the same page 🙂
Anyways i think having this option is good
Since 4-6 gigs om memory can determine if you can run a model on single gpu or not
Absolutely agree. Having the option to choose whether to use it is absolutely the best.
I don't think that would be straight forward since ECC is driver level