GemLite-Triton is the set of Triton kernels to perform AnWn fused matmul available in https://github.com/mobiusml/gemlite/
[kernels available from Thursday 19 Sep]
GemLite was designed for simplicity and flexibility without scarifying performance. The kernels are kept short and simple, enabling the community to easily build custom kernels on top of the available codebase.
While the current Triton kernels perform very well on large matrices (outperforming Marlin and BitBlas), performance on smaller matrices needs some improvement. This is mainly due to a mix of Triton launch overhead, GEMV implementation, autotuning, etc.
The goal of this project is to make the kernels work better with smaller matrices/lower loads