#C++ Tokenizer optimization

27 messages · Page 1 of 1 (latest)

whole gate
#

Hi, I started learning C++ about 1 year and a half ago so I am still pretty new to coding in general.
A week ago, I started my biggest ( and probably most unrealistic ) project yet :
Build an entire transformer from scratch with no external libraries whatsoever.
I am currently at the first step which consists of building the vocabulary of the AI with an algorithm called BPE ( byte pair encoding ) where you basically take the bigram in a big dataset with the biggest frequency and add it to the vocabulary ( don't know if that was clear ).
So I am reaching to this server because I need help optimizing my tokenizer.
Only do it if you have time to spare !

REPO : github.com/Swann7777777/SwaggTransformer
FILE : tokenizer.h

OUTPUT ( on my very slow laptop ) :

loadVocabularyTime : 0.0161939
buildTrieTime : 2.73e-05
loadCorpusTime : 38.7941
normalizeCorpusTime : 69.7368
tokenizeCorpusTime : 92.7972
createPairsTime : 35.1865
calculateFrequenciesTime : 22.119
clearTime : 9.44356
orderFrequenciesTime : 0.0001186
runtime : 270.079s

Btw : I am taking every advice, even about code structuration and stuff !
( I am a 16 yo french student so sorry for the bad english )

fiery citrusBOT
#

When your question is answered use !solved to mark the question as resolved.

Remember to ask specific questions, provide necessary details, and reduce your question to its simplest form. For tips on how to ask a good question use !howto ask.

pallid pollen
# whole gate Hi, I started learning C++ about 1 year and a half ago so I am still pretty new ...

Build an entire transformer from scratch with no external libraries whatsoever.
Could you specify what exactly you mean by "transformer".
The term in and of itself is pretty vague and there's also no ReadME in your project.

Actually, that should be the most important first step: Write a ReadMe, so that others know what your code is supposed to do, where it's at, how to compile it, etc.

There is no folder structure in your project whatsoever.

Providing the VS solution files is... meh, cause now you just excluded everyone who's not using Windows from being able to help you. You should much rather look into CMake. Setting up a simple project with CMake is really easy (literally 3 lines of CMake).

#

You have waaay too many blank lines.

You don't use namespaces

You don't avoid unnecessary copies

whole gate
pallid pollen
whole gate
pallid pollen
#

I have already built the "HelloWorld" of Neural Network : a handwritten digits recognizer
By following the tutorial (to a tee) I assume?

whole gate
#

Not at all

whole gate
pallid pollen
#

Alright. I don't want to claim I know a lot about LLMs. In fact, I don't know pretty much anything about them. They're fucking scary, complex, mathematical beasts.
If you say you've got the background/knowledge required for it then go for it, but I'm very sceptic considering your age

whole gate
#

So any evident performance flaw in my current code ?

pallid pollen
whole gate
#

What copies ??

pallid pollen
#

Whenever you have a function signature where there is a parameter without a & or &&, then when you call that function you automatically perform a copy.

E.g. when you have:

void f(std::vector<int> vec) {
    ...
}

void g(std::vector<int>& vec) {
    ...
}

int main() {
    std::vector<int> vec = /* initialize it to have 10 million elements */;
    f(vec);  // This performs a copy of all the 10 million elements into a new vector, which will then be passed to 'f'
    g(vec);  // This passes an lvalue reference to vec, which essentially is just a single pointer
}
pallid pollen
whole gate
fiery citrusBOT
#

@whole gate Has your question been resolved? If so, type !solved :)

pallid pollen
#

I have no clue what those VS flags do, but they probably translate to MSVC, which I've also never used... yeah I'm pretty useless regarding that question

#

Looks like you at least compile with -O2 though

#

-O3 should be better, not sure why it's saying O2 is maximum performance, maybe that's another MSVC thing

pallid pollen
whole gate
#

Well, anyway, thanks a lot for everything @pallid pollen and especially for the general C++ advice :
Now I'll use CMake, I'll pass variables by reference, I will try to use namespaces, I will think about others and write a readme, I'll switch to GCC and I will stop putting so many newlines everywhere :)

#

!solved

fiery citrusBOT
#

Thank you and let us know if you have any more questions!

This thread is now set to auto-hide after an hour of inactivity