Hi, I started learning C++ about 1 year and a half ago so I am still pretty new to coding in general.
A week ago, I started my biggest ( and probably most unrealistic ) project yet :
Build an entire transformer from scratch with no external libraries whatsoever.
I am currently at the first step which consists of building the vocabulary of the AI with an algorithm called BPE ( byte pair encoding ) where you basically take the bigram in a big dataset with the biggest frequency and add it to the vocabulary ( don't know if that was clear ).
So I am reaching to this server because I need help optimizing my tokenizer.
Only do it if you have time to spare !
REPO : github.com/Swann7777777/SwaggTransformer
FILE : tokenizer.h
OUTPUT ( on my very slow laptop ) :
loadVocabularyTime : 0.0161939
buildTrieTime : 2.73e-05
loadCorpusTime : 38.7941
normalizeCorpusTime : 69.7368
tokenizeCorpusTime : 92.7972
createPairsTime : 35.1865
calculateFrequenciesTime : 22.119
clearTime : 9.44356
orderFrequenciesTime : 0.0001186
runtime : 270.079s
Btw : I am taking every advice, even about code structuration and stuff !
( I am a 16 yo french student so sorry for the bad english )