#Deliberately make L1 sparse for small net.
40 messages · Page 1 of 1 (latest)
I think that's a fair idea to try.
Any details how to do this? One way I think is rewrite loss function so that it can see sparsity but then training steps would be immensely slower
smallnet is hardly limited by loss function I think. Yes, I agree one can add a suitable norm on the weights to achieve this, is think.
how much sparser would the net need to be for a nonnegligible speed improvement
idk
My proposal is actually just sigmoid(x-1) for each l1 or something.
Or something like one-sided Huber loss.
From say -0.5 upward.
I first tested how much AffineTransformSparseInput of smallnet speeds up the inference,
using L1AffineTransform = std::conditional_t<L1 == TransformedFeatureDimensionsBig,
Layers::AffineTransformSparseInput<L1, FC_0_OUTPUTS + 1>,
Layers::AffineTransform<L1, FC_0_OUTPUTS + 1>>;
L1AffineTransform fc_0;
But interestingly:
stockfish-master: 28313813
stockfish: 28455070
It's within the error range but dense layer seems faster than sparse layer.
I think that checks out, there's a decent constant overhead for the sparse processing
Yes, also I guess making small net more sparse wouldn't benefit that much overall.
Excluding sparse layer entirely:
stockfish: 23650404
~16.5% slowdown (+19.7% speedup), also interesting, because this magnitude of speedup is greater than what AndrovT suggested originally (~10%)
I guess bigger L1 size is the main factor for that...
this figure is for L0?
What do you mean?
diff --git a/src/nnue/nnue_architecture.h b/src/nnue/nnue_architecture.h
index c020ce05..d417a69d 100644
--- a/src/nnue/nnue_architecture.h
+++ b/src/nnue/nnue_architecture.h
@@ -61,7 +61,7 @@ struct NetworkArchitecture {
static constexpr int FC_0_OUTPUTS = L2;
static constexpr int FC_1_OUTPUTS = L3;
- Layers::AffineTransformSparseInput<TransformedFeatureDimensions, FC_0_OUTPUTS + 1> fc_0;
+ Layers::AffineTransform<TransformedFeatureDimensions, FC_0_OUTPUTS + 1> fc_0;
Layers::SqrClippedReLU<FC_0_OUTPUTS + 1> ac_sqr_0;
Layers::ClippedReLU<FC_0_OUTPUTS + 1> ac_0;
Layers::AffineTransform<FC_0_OUTPUTS * 2, FC_1_OUTPUTS> fc_1;
weight permutation has also gotten better since then i think, but unsure by how much (that and bigger L1)
Okay... but a slowdown even without strength penalty?
That is weird.
how small is small net
L1=128, 32 x 32b blocks processed in find_nnz (big net is L1=3072)
ah yeah
that's just too smol I think :(
maybe if we move the find_nnz calculation earlier
I was thinking of applying the same method to big net but 1) training takes forever and 2) features are already sparse enough (above 70%?)