Improve GemLite-Triton Kernels | GPU MODE | Page 1

visual hearth Sep 18, 2024, 8:17 PM

#

GemLite-Triton is the set of Triton kernels to perform AnWn fused matmul available in https://github.com/mobiusml/gemlite/
[kernels available from Thursday 19 Sep]
GemLite was designed for simplicity and flexibility without scarifying performance. The kernels are kept short and simple, enabling the community to easily build custom kernels on top of the available codebase.

While the current Triton kernels perform very well on large matrices (outperforming Marlin and BitBlas), performance on smaller matrices needs some improvement. This is mainly due to a mix of Triton launch overhead, GEMV implementation, autotuning, etc.

The goal of this project is to make the kernels work better with smaller matrices/lower loads

#

As can be seen in these graphs, performance for 4096 x 4096 matrices with lower batch-sizes needs to catch-up with BitBlas and Marlin

#

Participants can also adapt the kernels to work faster on newer GPUs.
Resource: https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html

visual hearth Sep 19, 2024, 1:52 PM

#

The GEMV implementation can be improved for slightly larger batch-sizes via a better split-K algorithm. The current implementation performs atomic addition over 1D chunks, which is ok for batch-size=1, but you quickly lose performance at batch-size 2-8 (batch-size>=16 would use the gemm implementation anyway, so not a problem there)

whole hawk Sep 20, 2024, 8:07 AM

#

May be handy

📎 how_to_run_a_gemlite-triton_benchmark_on_a_fresh_modal_h100.txt

#

📎 gemlite-triton_benchmark_results_-_H100.txt

#Improve GemLite-Triton Kernels