#Hyper-efficient self-attention

9 messages · Page 1 of 1 (latest)

eager night
#

You need to include something to model positional information

#

Unless you’re assuming Qh, Kh, Vh have been produced with biases per token

deft goblet
#

on it

#

wait can this not just be used with rope

eager night
#

I don’t think there’s really any downside to the relative positional embeddings (Shaw et al) they use, it can be implemented with really speed no trouble for CPU inference

#

Anyways, you’re proposing a rather obvious simplification of the self attention mechanism with no proposal of how to feasibly integrate this into a chess network - how many layers, what do you want your inputs to look like, etc are all more difficult/important questions to answer

deft goblet
#

fair