#Can't even overfit to the training set...

28 messages · Page 1 of 1 (latest)

cloud elbow
#

I implemented a fully connected network with Numpy, tested it on MNIST and it works. But I noticed that with just a few layers, before training, the output of the network is dominated by the biases, that is, the output doesn't depend on the input, which I find odd. Figured it could be due to the initialization of weights, but they seem in line with the literature when using ReLU activations ( self.weights = np.random.normal(0, 2/neurons_pre, (neurons_pre + 1, neurons))).

Regardless, I then tried a more challenging dataset - these are images of a drone flying, where the background is a building and some grass. I am using ~5000 images (80% train 20% val), each was originally 480x640x3 but since that's too large to train a FC network on, I did min pooling (the drone is dark) so they became 60x80x3 images, flattened to 14400. The labels are vectors of size 4, with the coordinates of a bounding box, so I'm just using an MSE loss. Both the inputs and the labels are normalized between 0 and 1 (I've tried between -1 and 1 for the inputs but it didn't help).

Visually, in some (reduced) images it is easy to find the drone, in others it is quite difficult, so it wasn't surprising to see the network converge to a point in this case, regardless of the inputs - figured it was just converging to an average due to being unable to learn the harder task. So I tried to increase the network size, overfit first and then regularize. I've tried several layouts (14400-500-50-4, 14400-1000-500-250-125-64-32-16-8-4, etc), but I can't increase the number of parameters too much, due to memory constraints. Also tried different learning rates (0.2, 0.1, 0.03, 0.003, etc), but no help.
It might be worth pointing out that I'm not using any optimizer - just doing vanilla gradient descent with mini-batching (50 samples). Shouldn't it be able to memorize the training set?

TL;DR: How should I make my network if I want to overfit on a dataset with 4000 samples of input size 14400 and label size 4?

dusk topaz
#

Hey there, @cloud elbow you can go through this video to find flaws in your approach [Building a neural network FROM SCRATCH (no Tensorflow/Pytorch, just numpy & math)] https://youtu.be/w8yWXqWQYmU?si=GRhFGdkvhtPJJ0sK.

vapid surge
#

I don't know if I've seen bounding-box image tasks attempted with a fully connected network before and I'll be a little surprised if it works. Have you found an example of this approach elsewhere? Maybe try to recreate their architecture first.

cloud elbow
# dusk topaz Hey there, <@144838262135980032> you can go through this video to find flaws in...

Hey, that's a great video! I remember it was quite helpful a few years ago, when i followed along and implemented it. What i implemented is meant to be more general, with a dense layer class with init, forward and backward methods, and it works on the MNIST dataset.
But on this drone images dataset, i can't even seem to overfit. Even when i reduced it to 5-10 images, it was still unable to overfit. I've already checked and compared the implementation to a few others and it seems ok, which led me to asking here, in case i'm missing something obvious, and because i'm running out of time.

cloud elbow
# vapid surge I don't know if I've seen bounding-box image tasks attempted with a fully connec...

I'm inclined to agree with you. The goal is actually to make a working CNN that runs on the GPU, which i've done (actually, i implemented a simplified conv layer, where the stride is the same as the kernel size)... except it's not working. Since it's really weird that i can't even overfit to this dataset with a fully connected network, i'm starting there, since that might fix the CNN as well.

merry spoke
#

If you could pass the code it would be easier to find the error

#

Also, you are saying: that your are not using optimizers, just vanilla gradient descent, but I think that that is the optimizer. The optimizer is the part of the code that adjusts the weight by finding the gradient descent of the loss function

cloud elbow
# merry spoke If you could pass the code it would be easier to find the error

Yes i know, but i didn't want to assume people's help with debugging so i was giving the context first in case someone's been through a similar problem.
And yes that is the optimizer, although i incorporated it directly into the backward function. Just meant that i didn't implement anything such as learning rate scheduler or adam, etc...
I'll do a quick cleanup of the code and send it right away

#

This is just imports and loading. Using MNIST here so training data is 60000x784, but with the (reduced) drone dataset it's 3997x14400.

import cupy as cp
import matplotlib.pyplot as plt

from keras.datasets import mnist

(train_xx, train_yy), (test_xx, test_yy) = mnist.load_data()

train_xx, test_x_mnist = train_xx[...,None], test_xx[...,None]

print('X_train: ' + str(train_xx.shape))
print('Y_train: ' + str(train_yy.shape))
print('X_test:  '  + str(test_xx.shape))
print('Y_test:  '  + str(test_yy.shape))

train_x = train_xx.reshape(-1, 784)
train_y = np.zeros((60000,10))
train_y[np.arange(60000), train_yy] = 1
print(train_x.shape, train_y.shape)

val_x = test_xx.reshape(-1, 784)
val_y = np.zeros((10000,10))
val_y[np.arange(10000), test_yy] = 1
print(val_x.shape, val_y.shape)
#

A few useful functions:

    return x*(x>0)

def relu_d(x):
    return x>0

def softmax(z):
    m = z.max()
    return cp.exp(z-m)/cp.sum(cp.exp(z-m),axis=-1)[:,None]

def MSE(out, label):
    return cp.mean((out-label)**2)

def MSE_grad(out, label):
    return 2*(out-label)

def MAE(out, label):
    return cp.mean(cp.abs((out-label)))

def MAE_grad(out, label):
    return cp.sign((out-label))

def cross_entropy(out, label):
    eps = 10**-14
    return -cp.sum(label*cp.log(out+eps))```
#

The dense layer class is just this

    def __init__(self, neurons_pre, neurons, act_func, act_func_delta, alpha):
        self.activation = act_func
        self.weights = cp.random.normal(0, 2/neurons_pre, (neurons_pre + 1, neurons))   # He
        self.alpha = alpha
        self.act_func_delta = act_func_delta
        
    def forward(self, input_vec):
        self.inp_v = cp.concatenate([input_vec, cp.ones((input_vec.shape[0],1))], axis=-1)
        z = self.inp_v@self.weights   # N,x+1 @ x+1,y -> N,y
        self.z = z.copy()
        a = self.activation(z)
        return a
    
    def backward(self, grads):
        dz = self.act_func_delta(self.z)*grads
        da = dz@self.weights[:-1,:].T   # N, y @ y, x -> N, x
        self.weights -= cp.mean(dz[:,None,:]*self.inp_v[...,None], axis=0)*self.alpha
        return da```
#

And the training (i'm using 3997 and 997 because those are the train and val sizes for the drones dataset). Not sure why i didn't just grab the length, since i've made the rest quite general...

densy0 = layer_dense(784, 128, relu, relu_d, Learn_R)
densy1 = layer_dense(128, 64, relu, relu_d, Learn_R)
densy2 = layer_dense(64, 10,  lambda x: x, lambda x: 0*x+1, Learn_R)
# densy2 = layer_dense(64, 10,  softmax, lambda x: 0*x+1, Learn_R)


epochs = 1001
layer_n=[densy0,densy1,densy2]


train_losses = []
val_losses = []
for i in range(epochs):
    choi = np.random.choice(np.arange(3997), 100)
    out = cp.array(train_x[choi]/255,dtype = 'float32') #- 0.5   # subtracted 0.5
    label_l = cp.array(train_y[choi])
    for flayer in layer_n:
        out = flayer.forward(out)

    train_loss = MSE(out, label_l)
    train_losses.append(train_loss)

    grads = MSE_grad(out, label_l)
    # grads = softmax(flayer.z)-label_l

    # print("epoch ",i, ":",train_loss)
    print(i, np.mean(np.argmax(out.get(), axis=1)==np.argmax(train_y[choi], axis=1)))
    
    for blayer in layer_n[::-1]:
        grads = blayer.backward(grads)

    if i%10==0:
        choi = np.random.choice(np.arange(997), 50)
        out = cp.array(val_x[choi]/255) #- 0.5   # subtracted 0.5
        label_l = cp.array(val_y[choi])
        for flayer in layer_n:
            out = flayer.forward(out)
            
        val_loss = MSE(out, label_l)
        val_losses.append(val_loss)
        # print("Validation loss: ", val_loss)
        print("val acc: ", np.mean(np.argmax(out.get(), axis=1)==np.argmax(val_y[choi], axis=1)))```
cloud elbow
# merry spoke If you could pass the code it would be easier to find the error

Sorry if this is quite long (was long enough to incur dyno bot's wrath a bit), obtuse or messy (didn't want to change things too much to not break anything). That code should work off the bat, and show the biases dominating a bit before training, which goes away after training, but it doesn't go away with the other dataset. I think the reduced one is ~70mb, so let me know if you want me to send it.
Alternatively, I can find the minimum number of images that make the network unable to overfit and send those, which i think is only around 5 images. Any help is much appreciated!

cloud elbow
#

To clarify, even when i reduce the dataset to 3 images, it fails to overfit. In reality, the second image is very similar to the first, so what this means is that for some reason, it is constrained to learn only a point in 4D space as the solution. I find this very weird given that it is able to break away from that with MNIST.
It appears to me that it has something to do with the network being wider, but i am at a loss in regards to how to fix it. From comparing to other approaches, mine seems fine, but i'm likely suffering from perceptual blindness...

sly shoal
#

I didn't read through the whole thing, just saw that you want to overfit your data. Not sure what the use case is, but that's the general rule of thumb

cloud elbow
#

(either way it's failing even with single digit number of images)

sly shoal
# cloud elbow The input vector is 14400 long and output is just 4 numbers. My impression for o...

We are now sponsored by Weights and Biases! Please visit our sponsor link: http://wandb.me/MLST

Patreon: https://www.patreon.com/mlst
Discord: https://discord.gg/ESrGqhf5CB

Yann LeCun thinks that it's specious to say neural network models are interpolating because in high dimensions, everything is extrapolation. Recently Dr. Randall Bellestre...

▶ Play video
#

Not that I'm trying to just throw stuff, but here's a good explanation how the curse of dimensionality still exist. What you're facing is not just "predicting 4 outcomes", but based on the all possible data point on all dimensions, which outcome should they be?

#

So quite simply, wich such shallow model, despite having quite wide per layer, is not going to have enough decision boundaries

#

This is of course assuming there's nothing wrong in the implementation

#

And we don't really do these exercises anyways.. so I may be failing seeing something, but at least that's the general rule of thumb

#

Go deeper, go larger, especially for high dimensional data

cloud elbow
#

Makes a lot of sense. I do think it's the latter in this case, but that contextualizes things a bit better.

cloud elbow
# sly shoal Go deeper, go larger, especially for high dimensional data

I've tried going up to a 12 layers MLP with a gradual descent from 14400 to 4, but it did not help. I also found it weird that the untrained network was dominated by the biases, and larger size seems to make the problem worse (which is why i think it trains for MNIST but fails with this one)

cloud elbow
vapid surge
#

The learning problems are totally different for MNIST and the drones. MNIST is mapping a flat array of high-contrast greyscale sequences to 10 classes. With drones, you are asking for it to map a much larger flat array of low-contrast pixel sequences to specific bounding-box numbers so it's easy to see why those inputs might not be easy to differentiate. It's not really a 'next step' problem, it's a big jump. Having said that, if it's not even learning a few images (where they are very different, especially with lots of different colors) you may have something wrong in your network because at some point it is just mapping an array of values to an output value so it should just memorize it. Maybe try creating the same network in PyTorch and see if that learns with a few very different images. If so, you have a subtle bug to find in your network.