#General Advice for a Low-Resource Language

3 messages · Page 1 of 1 (latest)

grand wagon
#

Hi, apologies if this is the wrong place---I can move this to #general if needed.

I have some knowledge of a low-resource language called Obolo. I'm interested in experimenting with fine-tuning an LLM for both general monolingual functionality in Obolo and for translation into and out of Obolo. There exists a full translation of the Bible in Obolo, and I'd like to see how much mileage I can get out of that alone (there isn't much other Obolo text in general, especially monolingual text in the standard orthography). And of course the Bible is also nice since it's been translated into so many other languages.

I'm curious what advice people have about working with this little data (about 31k sentences). Am I likely to see more success with monolingual functionality or with translation? What sort of model and set up should I use? I've heard of things like fine-tuning on higher-resource related languages first. Is that likely to substantially help?

A first pass with the llama 3 8B colab and the prompt "Generate a verse in Obolo:" paired with each verse got me some very repetitive and largely unintelligible sentences.

I realize that outside of the particular choice of language this isn't exactly a novel problem---I'll try to see what I can find online. I guess one thing I'm curious about that might be hard to find answer to on Google is qualitatively, how good I can expect this to get right now. And in the even shorter term, how I can reduce this annoying repetitiveness.

Thank you!

full elm
#

Perfect place to ask questions sloththumbs

#

Data is most of the problem for any fine tune, but even more so for low resource languages...