Bit QK attention | GPU MODE | Page 1

upper elbow Sep 21, 2024, 2:15 PM

#

Can we make use of tensorcore single-bit operations to speed up the QK part. Either calculate Q and K directly in one bit, or add a random (or trainable) projection, which I think would let us interpret the one-bit matmul as a locality-sensitive hashing lookup.

zealous oyster Sep 21, 2024, 2:18 PM

#

Can you give an example of previous work on "which I think would let us interpret the one-bit matmul as a locality-sensitive hashing lookup"? I'm familiar with reformer https://arxiv.org/pdf/2001.04451, but the purpose of LSH via random projections there was not for dealing with low precision.

upper elbow Sep 21, 2024, 2:26 PM

#

ok, so I think the word lookup here is probably not a good choice. What I mean is just that, random-projection lsh uses a random hyperplane h for each hash function, and sets the corresponding bit to sgn(<h, k>). If we generate these bits for each key and query, then the 1-bit tensorcore-matmul in XOR mode should just give us the (negative) fraction of hash buckets in which each key and query coincide.

#

then you could do the higher-precision inner products for the most promising candidates (I think this would correspond to reformer), but maybe we could also just feed those directly into the softmax. Really, I'm just trying to find somehing to get some use out of tensorcore bit operations 🙂

zealous oyster Sep 21, 2024, 4:08 PM

#

This is definitely interesting. I just took a lot at the relevant instruction in hopper: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-shape

I think the first question is what shapes you have for the Q and K matrices. Do you have shapes in mind? If k ( meaning the inner dimension of the matmul) is fairly small, then I worry about the effectiveness of this, but if it is large, it definitely has merit.

#Bit QK attention