Python-only flexible autocast | GPU MODE | Page 1

maiden snow Sep 21, 2024, 5:14 AM

#

Build a python-only alternative to torch.amp.autocast so that people can specify what ops they want to do autocasting with. Most people want to run softmax and layernorm with half precision inputs and outputs these days in LLMs, but unfortunately AMP in pytorch today casts them to float32 precision https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32 and outputs them as float32 tensors. Then subsequent matmuls will have to cast these float32 outputs back to half precision. This cuases a "ping pong" of casts that can kill performance.

This just barely scratches the surface, though. It stands to reason that people will want to have flexible control over what operations they want to do in lower precision (fp8, etc.).

I believe that using a TorchDispatchMode will allow for implementing this easily.

bronze shell Sep 21, 2024, 7:32 AM

#

I'm interested

full quail Sep 21, 2024, 3:35 PM

#

Cool idea. How much CUDA knowledge is required for this project?

maiden snow Sep 21, 2024, 3:43 PM

#

Not a whole lot IMO.

#

This inspiration is influenced by some work I did previously BTW: https://github.com/NVIDIA/NeMo/pull/9198

My coworkers were using AMP expecting things to "just work" as they expected. It worked functionally, but at a huge cost to performance unfortunately.

GitHub

Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inferenc...

What does this PR do ?
I demonstrate, using transcribe_speech.py, that simply casting the entire model to bfloat16 gives about 15% higher performance than using automatic mixed precision. The reaso...

#Python-only flexible autocast