#What should be removed from a dataset when training AI?

1 messages · Page 1 of 1 (latest)

worn fern
#

Hi everyone, I’m Zabi, a newbie in AI. 😦 I have a rather silly question, sorry in advance! I honestly tried to find answers online, but opinions vary. I don’t want to spend 80 hours training a model only to realize I did everything wrong. Hoping for a kind and clear response!

My issue: I’m building a Russian-language dataset (about 200 MB so far). I also scraped over 500 Wikipedia articles using a script. BUT I’m stuck: should I keep Greek transliterations like “Philosophy (from Ancient Greek φιλοσοφία, literally — ‘love of wisdom’)” or formulas like this:
{\displaystyle {\ce {^{40}{19}K->{}{20}^{40}Ca{}+e^{-}{}+{\bar {\nu }}_{e},}}}? (I don’t even understand what this is!).

If you ask why: I’m training a Transformer model for text/story generation, but I also want to use it as a GPT-style chatbot. I’ll attach one sample .txt file (in Russian) so you can see the data.

ONCE AGAIN, sorry for such a silly question! If you recommend changing the architecture or anything else — I’d really appreciate your advice.

https://pastebin.com/4V5KPiED

edgy willow
#

Are you simply trying to train a bot on Russian, or are you focused on specific things, like academic texts, etc?

worn fern
edgy willow
#

Have you looked at ruDial and/or DeepPavlov? They're open source datasets in Russian that may help you tremendously

worn fern
edgy willow
#

That's one. I lost ruDial, apparently

#

That's all the Russian language datasets I know of that are decent quality

edgy willow
#

Надеюсь, я вам помог!

worn fern
# edgy willow Надеюсь, я вам помог!

Oh, no, there's no need to speak Russian with me — I understand English perfectly well! I just didn't reply because I was busy with things in the physical world. Now I'm going to check the datasets. For now, I don’t know if your help was useful (maybe I did mention that I need a custom dataset with a large, diverse, and even specific vocabulary), so if I need help, I’ll reach out to you. Anyway, thank you for taking the time for me.

edgy willow
#

No worries. I need the practice anyway.