Implement (and extend) the low-bit optimizers from torchao with cuda kernels.
The current, python-based implementation nicely separates the optimization algorithm from the underlying data format that is to be used for optimizer states. This is something that should be carried over in the CUDA implementation, i.e., instead of writing an 8-bit Adam kernel, write one Adam kernel template that can be instantiated with different param and optim state dtypes.
In addition to the features currently available, the low-bit optimizer should also help support training when not only the optimizer states are low-bit, but also the parameters. This implies that it should support [https://arxiv.org/abs/2010.06192](stochastic rounding and/or error compensation) for the weight updates.
Stretch goals include efficiently handling multiple tensors in a single kernel call, and using CUDA's just-in-time compilation to potentially avoid having to compile the combinatorical explosion of possible low-bit adam kernels.