I'm trying to follow a tutorial to write a neural network from scratch and in the bit where the person initialises the weights and biases randomly, they divide the matrices by the square root of the sum of the input and output sizes. Can anyone explain why this is the case? didnt np.random.randn already initialise the parameters in a gaussian distribution?
#Why is the weight matrix divided by the square root of the sum of the input and output sizes?
4 messages · Page 1 of 1 (latest)
That is a type of network weight initialisation called xavier initialisation. In a nutshell it improves convergence by making sure the variance of the activations and gradients stays constant throughout the layers of your network
https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79
and the link to the paper is here: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
thank you very much!
Can you please put a ✅ to close this thread?