#Why is the weight matrix divided by the square root of the sum of the input and output sizes?

4 messages · Page 1 of 1 (latest)

tiny stump
#

I'm trying to follow a tutorial to write a neural network from scratch and in the bit where the person initialises the weights and biases randomly, they divide the matrices by the square root of the sum of the input and output sizes. Can anyone explain why this is the case? didnt np.random.randn already initialise the parameters in a gaussian distribution?

gilded garnet
#

That is a type of network weight initialisation called xavier initialisation. In a nutshell it improves convergence by making sure the variance of the activations and gradients stays constant throughout the layers of your network
https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79

and the link to the paper is here: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

Medium

Exploring the evolution of initializing layer weights in neural networks: from old-school to Xavier, and arriving finally at Kaiming init.

gilded garnet
#

Can you please put a ✅ to close this thread?