#Learning about systems architecture for large model training?

10 messages · Page 1 of 1 (latest)

amber prawn
#

Does anyone have recommendations for blogs or papers that explain the systems architecture side of large model training? For example, how jobs are launched and managed on a large GPU cluster, how to do training checkpoints, how large amounts of data are made available during training, etc. I'd like to learn more about this general area and it's been surprisingly difficult to find resources on it. Thanks!

brittle stratus
#

Hmm, I'm not sure if there's a specific strategy. Someone more experienced could inform on it.

I do work systems but not specifically for AI. I'd recommend ZFS over iSCSI maybe, and just run it read-only so lots of things could concurrently read the large amounts of data needed for training pretty efficiently. NFS would be an option but also substantially slower. ZFS would also allow transparent compression, which would save significantly on costs as decompression would be done at the destination; I believe NFS would have to decompress, recompress, and decompress again at different points.

Only reason I'm willing to even suggest this is because it'd be highly performant and administration would be very easy. I assume one wouldn't really have to update the training data. If it does need to be updated, I suppose you could have one write primary that updates it. I wouldn't think you'd have the GPU clusters pushing any sort of update to it. They could actually write to perhaps a zvol over the network (iSCSi again if it needs to be performant) and ZFS can do trivial, rapid differential snapshots. I am unsure, though, how much this would be able to save space for snapshotting weights.

#

As for actually deploying training jobs I know lots of people just use docker and stuff.

Using ZFS also enables different layers of cache very easily, as well as deduplication when the training data (which could easily be pretty redundant) is uploaded.

#

ZFS cache would enable caching pretty large amounts at different levels of storage hierarchy. So ZFS over iSCSI, then HDD, then presumably NVME SSDs, then RAM. This helps a lot if you have to go over data many times as one would when training a large model.

#

Docker and Kubernetes will almost certainly be used for the actual training/evaluation jobs. Similarly those could be snapshotted easily on ZFS.

#

Maybe Apache Spark for data preprocessing and stuff like that. I'm not terribly familiar with that.

#

But 100% you want ZFS, and you likely want iSCSI - I wouldn't accept any substitute for those.

#

Note that updating with a write primary would have to be done carefully.

If you have multiple DCs you do this at perhaps GlusterFS could fill the role of making sure everything is replicated. May be useful in one DC as well depending on setup.

#

For actual deployment, I would just have Ansible and define everything down to host, cluster, etc. hierarchically.

brittle stratus