#Transformer
7 messages · Page 1 of 1 (latest)
And I'll create a new thread just incase, because I don't want it to obstruct other conversations.
Hi! So I was wondering, does the "Masked Multi-Head Attention" block just mean that it performs multi-head attention only on the tokens that precede the token that is currently being produced? Hopefully I explained that alright
I guess, I'm wondering exactly how the masked multi-head attention and regular multi-head attention blocks differ.
Also, does "layer normalization" mean that the individual signals of the nodes in a layer are scaled based on some sort of "total" measure of the strength of the signals throughout the layer (to keep them from getting too reduced / large)?