#Dynamically teleporting between layers in transformer based models

4 messages · Page 1 of 1 (latest)

sharp umbra
#

Current results

10% latency save and compute save on llama 3.2 3b Instruct with indistinguishable quality compared to normal generation (skipping from hidden layer 6 to hidden layer 24 out of 28 total layers excluding embedding)

25-30% compute save on llama 3.1 8b and mistral 7b 0.1v model using an older version of the current architecture that worked using python hooks and strictly predicting the final hidden layer state. This approach was much faster and the quality of outputs was acceptable albiet lower than normal generation (skipping from hidden layer 8 to hidden layer 32 out of 32 total layers excluding embedding)

Future plans

Look into training the confidence gate by having it just look at the xth hidden layer state and try determining if the prediction from the cerebellum will be accurate from it. This is unlikely to work at a production level but worth giving a try as the entire training time cycle for the cerebellum can be done within 40 mins

Multiple cerebellums attached at different layers

I plan to explore adding multiple cerebellums at different points that skip a different number of layers. The biggest road block here would be redundant activations of the cerebellum as on tokens where it's not activated at all, running one cerebellum is inconsequential but running multiple can start getting expensive, currently the cerebellum is around 1/3rd of the size of one hidden layer.

sharp umbra
#

I ran benchmarks, here are the results

#

Metric Stock Cerebellum Status

MMLU (General) 63.4% 62.0% Success
Marketing/Biz ~80% 86.3% Better than stock
Math (GSM8K) 48.0% 12.5% Useless
Inference Speed 100% 115% Faster
VRAM Cost 0 MB +20 MB Negligible