I implemented a fully connected network with Numpy, tested it on MNIST and it works. But I noticed that with just a few layers, before training, the output of the network is dominated by the biases, that is, the output doesn't depend on the input, which I find odd. Figured it could be due to the initialization of weights, but they seem in line with the literature when using ReLU activations ( self.weights = np.random.normal(0, 2/neurons_pre, (neurons_pre + 1, neurons))).
Regardless, I then tried a more challenging dataset - these are images of a drone flying, where the background is a building and some grass. I am using ~5000 images (80% train 20% val), each was originally 480x640x3 but since that's too large to train a FC network on, I did min pooling (the drone is dark) so they became 60x80x3 images, flattened to 14400. The labels are vectors of size 4, with the coordinates of a bounding box, so I'm just using an MSE loss. Both the inputs and the labels are normalized between 0 and 1 (I've tried between -1 and 1 for the inputs but it didn't help).
Visually, in some (reduced) images it is easy to find the drone, in others it is quite difficult, so it wasn't surprising to see the network converge to a point in this case, regardless of the inputs - figured it was just converging to an average due to being unable to learn the harder task. So I tried to increase the network size, overfit first and then regularize. I've tried several layouts (14400-500-50-4, 14400-1000-500-250-125-64-32-16-8-4, etc), but I can't increase the number of parameters too much, due to memory constraints. Also tried different learning rates (0.2, 0.1, 0.03, 0.003, etc), but no help.
It might be worth pointing out that I'm not using any optimizer - just doing vanilla gradient descent with mini-batching (50 samples). Shouldn't it be able to memorize the training set?
TL;DR: How should I make my network if I want to overfit on a dataset with 4000 samples of input size 14400 and label size 4?