Hi everyone, I’m Zabi, a newbie in AI. 😦 I have a rather silly question, sorry in advance! I honestly tried to find answers online, but opinions vary. I don’t want to spend 80 hours training a model only to realize I did everything wrong. Hoping for a kind and clear response!
My issue: I’m building a Russian-language dataset (about 200 MB so far). I also scraped over 500 Wikipedia articles using a script. BUT I’m stuck: should I keep Greek transliterations like “Philosophy (from Ancient Greek φιλοσοφία, literally — ‘love of wisdom’)” or formulas like this:
{\displaystyle {\ce {^{40}{19}K->{}{20}^{40}Ca{}+e^{-}{}+{\bar {\nu }}_{e},}}}? (I don’t even understand what this is!).
If you ask why: I’m training a Transformer model for text/story generation, but I also want to use it as a GPT-style chatbot. I’ll attach one sample .txt file (in Russian) so you can see the data.
ONCE AGAIN, sorry for such a silly question! If you recommend changing the architecture or anything else — I’d really appreciate your advice.
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.