Would like clarification on Lux's layer building | Humans of Julia | Page 1

#

I'm wanting to train a very basic feed-forward neural network to learn a nonlinear-opeator, that is I want it to learn a function of functions.

For the sake of simplicity, let's say this operator maps a function, call it f, of type: R¹ → R¹ to another function g of type: R¹ → R¹ . The nonlinear operator is defined as y = N(f(x)) := ∫₋₁¹ f(x) * s + f'(x) * sin(π * s²)*cos(x) dx (if s is the gridpoints of the codomain y, but for simplicity I'm assuming s and x are the same size, same nodes).

So for my input functions, I just generate 31 random points, and then have my operator act on them to produce the correspodning "output functions" which serve as target functions, and I'll have say 1000 samples of those input-output pairs (i.e. I'll have 1000 pairs of 31D vectors, or two 31 x 1000 arrays, one storing my input functions, and the other storing my output functions).

I'm willing to post more code, e.g. my Gauss-Lobatto-Legendre discretization nodes, weights, and differentiation matrix, for my test operator I've implemented that I want my feed-forward neural network to learn. Issue is that going into that will run well into the limits of this post (and is probably going to be irrelevant to my main question here).

My main question:

In the past, e.g. when I tried following the "Fitting a Polynomial using MLP" Lux tutorial, examples like these would build out their models like so:

model = Chain(Dense(1 => 16, relu), Dense(16 => 1))

i.e. the first layer seems like a n_param × 1 row vector, i.e. it takes 1 point and maps it to 16 parameters.

However, thinking about what the first layer should look like mathematically: L₁= W v_in + b, I changed the first layer to take in my number of points directly. I.e. my first weight matrix is 64 x 31.

# Define the neural network model, Nx = Ny = 30
model = Lux.Chain(
    Dense(Nx+1, 64, leakyrelu),
    Dense(64, 64, leakyrelu),
    Dense(64, Ny+1)
)

to be continued below:

#

(Continued, sorry I was running agains the post limit):

When I first made my model attempting to learn the nonlinear operator I defined above by following the polynomial MLP tutorial in the Lux documentation(https://lux.csail.mit.edu/stable/tutorials/beginner/2_PolynomialFitting), i.e. buliding my model who's first layer is 1 => 64 say, and making use of the Experimental trainstate they used, that didn't quite work, and the neural net was just not learning anything/flatlining.

My main trainloop at the time looked like this:

# Define the training function
function main_loop(tstate::Lux.Experimental.TrainState,
                    vjp,
                    test_data,
                    test_outfuncdata,
                    N_epochs,
                    N_samples, 
                    Nx,
                    Ny)

  train_loss_history = []
  Threads.@threads for epoch in 1:N_epochs
    total_loss = 0.0
    for sample_function in 1:N_samples
      f_x = reshape(test_data[:, sample_function]', (1, Nx+1))
      N_f_x = reshape(test_outfuncdata[:, sample_function]', (1, Ny+1))
      data = (f_x, N_f_x)
      data = data .|> gpu_device()
      grads, loss, stats, tstate = Lux.Training.compute_gradients(
        vjp_rule, loss_function, data, tstate
      )
      tstate = Lux.Training.apply_gradients(tstate, grads)

      total_loss += loss;
    end

      # avg_loss = total_loss / num_batches
      avg_loss = total_loss / N_samples
      push!(train_loss_history, avg_loss)

      if epoch % 50 == 1 || epoch == N_epochs
          @printf "Epoch: [%3d] Avg Train Loss: %4.5f\n" epoch avg_loss
      end
  end
  return tstate, train_loss_history
end

However, I went ahead and changed how my first layer was defined in my model, and also changed how my neural network would train:

#

dev = cpu_device()

# The forward function
function loss_fn(x_sample, y_target, model, ps, st)
  y_pred, st_new = model(x_sample, ps, st)
  loss = 0.5 * mean((y_pred .- y_target).^2)
  return loss, st_new
end

# Optimisers
opt = Optimisers.Adam(0.001f0)
optimiser_state = Optimisers.setup(opt, parameters)

# Warming up the model:
selection::Int =rand(1:100)
initial_sample = dev(train_input_fns[:, selection]);
initial_target = dev(train_output_fns[:, selection]);



loss_fn(initial_sample, initial_target, model, parameters, operator_layer_states)

(l, _), back = Zygote.pullback(p -> loss_fn(initial_sample,
                                            initial_target,
                                            model,
                                            p,
                                            operator_layer_states), parameters)
back((one(l), nothing)

#

# Main training loop:

N_epochs = 1000
N_samples = 1000
train_loss_history = []

for epoch in 1:N_epochs
  total_loss = 0.0
  for iteration in 1:N_samples
    selection::Int =rand(1:1000)
    f_sample = dev(train_input_fns[:, selection]);
    Nf_sample = dev(train_output_fns[:, selection]);
    # grads, loss, stats, tstate = Lux.Training.compute_gradients(
    #   vjp_rule, loss_function, data, tstate
    # )
    # tstate = Lux.Training.apply_gradients(tstate, grads)
    (l, operator_layer_states), back = Zygote.pullback(p -> loss_fn(f_sample,
                                                                    Nf_sample,
                                                                    model,
                                                                    p,
                                                                    operator_layer_states),
                                                                    parameters)
    gs = back((one(l), nothing))[1]
    optimiser_state, parameters = Optimisers.update(optimiser_state, parameters, gs)
    total_loss += l;
  end

    # avg_loss = total_loss / num_batches
    avg_loss = total_loss / N_samples
    push!(train_loss_history, avg_loss)

    if epoch % 50 == 1 || epoch == N_epochs
        @printf "Epoch: [%3d] Avg Train Loss: %4.5f\n" epoch avg_loss
    end
end

This change ended up giving me way better results

#

Finally: to put a cap on what I'm asking: If i'm trying to train a neural network with say,

2 hidden layers
- with 64 neurons in both layers,
which is trying to learn an operator mapping functions to functions by training on a discrete sample of pairs of functions
(one randomly generated, another (serving as the targets) generated by the operator we're wanting to learn)

Then this is the model I want to use in Lux?

# Define the neural network model, Nx = Ny = 30
model = Lux.Chain(
    Dense(Nx+1, 64, leakyrelu),
    Dense(64, 64, leakyrelu),
    Dense(64, Ny+1)
)

#

(from vids such as Chris's tutorial in this vid here, it seems as if this would be correct. It's just that I've seen many other tutorials start with the first layer being 1 => num_params, rather than input_data_length => num_params, while others (including Chris's tutorial below) do what seems to have fixed my situation, and I'm not sure about the difference regarding the two.

For instance: although I may be discretizing them with 31 points, my operator is mapping 1D functions/curves to other 1D functions/curves. They aren't 31 dimensional. And that's what I took something like model = Chain(Dense(1 => 16, relu), Dense(16 => 1)) to mean initially.

YouTube

Chris Rackauckas

What is (scientific) machine learning? An introduction through Juli...

What is machine learning? How does it work? What is a neural network? In this talk we'll introduce what neural networks are by understanding them mathematically and then implementing neural networks using the Lux.jl library in Julia.

▶ Play video

#

(Here's my training loss history in my past implementation, as well as a couple failed neural nets:

#Would like clarification on Lux's layer building