Current results
10% latency save and compute save on llama 3.2 3b Instruct with indistinguishable quality compared to normal generation (skipping from hidden layer 6 to hidden layer 24 out of 28 total layers excluding embedding)
25-30% compute save on llama 3.1 8b and mistral 7b 0.1v model using an older version of the current architecture that worked using python hooks and strictly predicting the final hidden layer state. This approach was much faster and the quality of outputs was acceptable albiet lower than normal generation (skipping from hidden layer 8 to hidden layer 32 out of 32 total layers excluding embedding)
Future plans
Look into training the confidence gate by having it just look at the xth hidden layer state and try determining if the prediction from the cerebellum will be accurate from it. This is unlikely to work at a production level but worth giving a try as the entire training time cycle for the cerebellum can be done within 40 mins
Multiple cerebellums attached at different layers
I plan to explore adding multiple cerebellums at different points that skip a different number of layers. The biggest road block here would be redundant activations of the cerebellum as on tokens where it's not activated at all, running one cerebellum is inconsequential but running multiple can start getting expensive, currently the cerebellum is around 1/3rd of the size of one hidden layer.