#Deliberately make L1 sparse for small net.

40 messages · Page 1 of 1 (latest)

steep skiff
#

An intriguing idea. If we sparsify the net, we can get increased speed. Since the feature of the small net is speed, maybe what we really need is a sparse one, not necessarily the smallest one.

junior edge
#

I think that's a fair idea to try.

pastel condor
#

Any details how to do this? One way I think is rewrite loss function so that it can see sparsity but then training steps would be immensely slower

junior edge
#

smallnet is hardly limited by loss function I think. Yes, I agree one can add a suitable norm on the weights to achieve this, is think.

vivid flame
#

how much sparser would the net need to be for a nonnegligible speed improvement

junior edge
#

idk

steep skiff
#

My proposal is actually just sigmoid(x-1) for each l1 or something.

#

Or something like one-sided Huber loss.

#

From say -0.5 upward.

pastel condor
#

I first tested how much AffineTransformSparseInput of smallnet speeds up the inference,

    using L1AffineTransform = std::conditional_t<L1 == TransformedFeatureDimensionsBig,
                                                Layers::AffineTransformSparseInput<L1, FC_0_OUTPUTS + 1>,
                                                Layers::AffineTransform<L1, FC_0_OUTPUTS + 1>>;

    L1AffineTransform fc_0;

But interestingly:

stockfish-master: 28313813
stockfish: 28455070

It's within the error range but dense layer seems faster than sparse layer.

vivid flame
#

I think that checks out, there's a decent constant overhead for the sparse processing

pastel condor
#

Yes, also I guess making small net more sparse wouldn't benefit that much overall.

vivid flame
#

ye

#

amdahl's law

#

😩

pastel condor
#

Excluding sparse layer entirely:

stockfish: 23650404

~16.5% slowdown (+19.7% speedup), also interesting, because this magnitude of speedup is greater than what AndrovT suggested originally (~10%)

#

I guess bigger L1 size is the main factor for that...

pastel condor
#

What do you mean?

vivid flame
#

where is this number from

#

sry I'm a little confused

pastel condor
#
diff --git a/src/nnue/nnue_architecture.h b/src/nnue/nnue_architecture.h
index c020ce05..d417a69d 100644
--- a/src/nnue/nnue_architecture.h
+++ b/src/nnue/nnue_architecture.h
@@ -61,7 +61,7 @@ struct NetworkArchitecture {
     static constexpr int       FC_0_OUTPUTS                 = L2;
     static constexpr int       FC_1_OUTPUTS                 = L3;
 
-    Layers::AffineTransformSparseInput<TransformedFeatureDimensions, FC_0_OUTPUTS + 1> fc_0;
+    Layers::AffineTransform<TransformedFeatureDimensions, FC_0_OUTPUTS + 1> fc_0;
     Layers::SqrClippedReLU<FC_0_OUTPUTS + 1>                                           ac_sqr_0;
     Layers::ClippedReLU<FC_0_OUTPUTS + 1>                                              ac_0;
     Layers::AffineTransform<FC_0_OUTPUTS * 2, FC_1_OUTPUTS>                            fc_1;
vivid flame
#

yeah ok

#

that's what I thought

gritty axle
pastel condor
steep skiff
#

That is weird.

vivid flame
#

how small is small net

pastel condor
vivid flame
#

ah yeah

#

that's just too smol I think :(

#

maybe if we move the find_nnz calculation earlier

pastel condor
#

I was thinking of applying the same method to big net but 1) training takes forever and 2) features are already sparse enough (above 70%?)

vivid flame
#

how does ur method work

#

sorry I don't know anything about neural networks

pastel condor
#

Just add a smooth L0 regularization term to loss so training makes transformed features driven towards zero

#

Mean of 1 - exp(-20.0 * |w|) where w is a feature tensor

#

For visual representation: