#Improve GemLite-Triton Kernels

1 messages · Page 1 of 1 (latest)

visual hearth
#

GemLite-Triton is the set of Triton kernels to perform AnWn fused matmul available in https://github.com/mobiusml/gemlite/
[kernels available from Thursday 19 Sep]
GemLite was designed for simplicity and flexibility without scarifying performance. The kernels are kept short and simple, enabling the community to easily build custom kernels on top of the available codebase.

While the current Triton kernels perform very well on large matrices (outperforming Marlin and BitBlas), performance on smaller matrices needs some improvement. This is mainly due to a mix of Triton launch overhead, GEMV implementation, autotuning, etc.

The goal of this project is to make the kernels work better with smaller matrices/lower loads

#

As can be seen in these graphs, performance for 4096 x 4096 matrices with lower batch-sizes needs to catch-up with BitBlas and Marlin

visual hearth
#

The GEMV implementation can be improved for slightly larger batch-sizes via a better split-K algorithm. The current implementation performs atomic addition over 1D chunks, which is ok for batch-size=1, but you quickly lose performance at batch-size 2-8 (batch-size>=16 would use the gemm implementation anyway, so not a problem there)