#How to evaluate cloud GPU providers?
5 messages · Page 1 of 1 (latest)
main challenge has sometimes been issues with our clusters not working properly for a myriad of reasons so I suppose reliability is the biggest thing... it all depends on where you're at and what level of scaling you need to do.
Ahhhh gotcha, is this mainly for like multi-node workloads + interconnect reliability? Which cloud GPU providers have you used in the past?
would it b cool to dm?
if you want realiability i think setting up machines on two differnt providers (datacentrs) is useful eg even if you use vastai just set two providers from different data centres and route traffic. Even just two machines should give you good redundancy. Then if one falls its usually ~10min to load another machine with docker containers.