#Choosing a model for LoRA training

1 messages · Page 1 of 1 (latest)

open gull
#

I would like to train a LoRA for poetry writing. I have a database of 10K poems with metadata and have written short summaries and then one sentence summaries using WizardLM_Uncensored_30B.
I was planning to generate question/answer pairs of the form:
Q: "Write a poem about {one_sentence}. The poem should be written in the style of {author} in the year {year} on the themes of {themes}. The poem should be in the form of {forms}. Describe your plan for writing the poem in 100-300 words followed by the the poem."
A: "The poem is about {summary}\n{title}\n{poem}"

What should I use for a base model?
Is it a good idea to start with a small 7B model?
Can I train 30B on my 3090?
Also, is this the right format for training? Should I be using the Alpaca format?

gaunt totem
#

ie just pick a base you already like that works in the format you're wanting to work in

#

starting with a small model can be useful for testing/practicing, don't have high expectations of the results tho

#

you can train as much as you can run on your GPU. 30B on an xx90 only barely runs normally so it's a little shifty, if you minimize the settings it might work

harsh gyro
#

you'lll want more VRAM to train on if you want to make 30B with even a moderately high rank, i think

#

i went OOM trying to do rank 128 on an A100 80GB with 30B 8bit

#

Had to fall back to 30b 4bit. We really need multi-GPU training in oobabooga.

open gull
#

I was planning on using the qLoRA with 4bit WizardLM_30B.

harsh gyro
#

4bit training works fine but i'd rent a server anyway. It'll cost you less than $10 and save you a lot of time. You only find out you lack memory to really finish when the LoRA is being written, which stinks.

open gull
#

It may come to that but I'd like to see what I can manage on my compute.

harsh gyro
#

I'd highly recommend using Guanaco as the base for your use case, but it's personal preference.

open gull
#

Can you explain why Guanaco?

harsh gyro
#

Metadata may not buy you much. If you look at the loader files for actual datasets, you can see exactly how it will look to the model when it's fed in. You want the input feed to match the input text when you actually run the model to the degree possible.

#

Just personal experience. Guanaco 65B is pretty remarkable for all tasks, but particularly instructions. Could be training parameters, could be dataset, could be dumb luck. This is all stochastic.

#

I just have gotten consistently excellent results from it and it's already instruct-trained.

#

Why not extend that?

open gull
#

I thought WizardLM was also an instruct model?

harsh gyro
#

It is, and you may prefer that.

#

Falcon is great too, but you can't train against it yet.

open gull
#

I will try Guanaco too and probably start with a subset of the data (like sonnets) and on a smaller model, 7B or 13B to start.

#

What do you mean by "metadata may not buy you much"? And "loader files"?

harsh gyro
#

There is no right or wrong answer here. Use your subjective judgment. Your biggest asset in training will be experience and practice.

#

Check out the innards of the training folder. I can't remember what the parsers are called, but you can see that they basically deserialize JSON into plain text.

#

If you can make a text file that is "more coherent" from the ANN's perspective than the JSON unpacked, you'll get better results.

#

User: <msg>
Bot: <msg>

User: <msg>
Bot: <msg>

with a hard cut on \n\n leads to very good chat customization, for instance, because the speaker is very clear and the beginning and end are too

#

that did more for me than any JSON

open gull
#

Thanks, I will take a look at that. If I understand you correctly, you are saying if I can just write a good text file with my inputs, that's just as good as anything I do with the JSON file?

harsh gyro
#

yes, and i was shocked and dismayed to learn that because i wasted a lot of time fiddling with actual pytorch datasets

#

if you're slinging around extreme amounts of data, JSON and loaders are the right way to go

open gull
#

Thank you. I agree with your earlier statement that there is nothing better than trying stuff out to learn. But there's also getting advice from people who've walked this road already.

harsh gyro
#

but if you're curating it, that's even better

open gull
#

At the most, I will have about 12K poems.

harsh gyro
#

it's a meaningfully large dataset and you may find it more convenient to work with it in JSON

#

but ultimately the model wants to and will see a big fat pile of text

#

check out the loaders in the training folder and it'll make more sense

#

esp alpaca chat

#

dirt simple

open gull
#

Do you have any thoughts about using intermediate information to improve inference? I generated long summaries and plan to use that to prime the LLM to generate the full poem from the short sentence prompt. Is this better handled as two steps or in a single input, output pair?

harsh gyro
#

context is everything to a model

#

if you write:

Summary:
<description>

Themes:
<main themes>

Style:
<shakespearean sonnet|limerick|.....>

Poem:
<text>

then you can recycle that when you generate text

#

models get it

#

in that case you'd want a hard cut on \n\n\n so that each "poem unit" is self-contained

open gull
#

Sorry. Please explain "hard cut." Is this for the trainer?

#

(Give a man a fish, feed him for a day. Teach a man to fish, expect a thousand fishing related questions.)

harsh gyro
#

you'll see on the training interface a "Hard cut string" option or something like that on the text file loader

#

it's \n\n\n by default, which means the model would receive each of these as separate entities and you won't end up with longer poems getting spliced or shorter poems getting conflated

Summary:
<description>

Themes:
<main themes>

Style:
<shakespearean sonnet|limerick|.....>

Poem:
<text>

Summary:
<description>

Themes:
<main themes>

Style:
<shakespearean sonnet|limerick|.....>

Poem:
<text>

#

very very clear and clean separation for the model to train on

open gull
#

So I also have to make sure there's no triple "\n" in the dataset?

harsh gyro
#

yes

open gull
#

Some poems have funny formatting.

#

In the spirit of a million questions, does it handle unicode characters or does everything need to be in ascii?

harsh gyro
#

for optimal results, you will, and those poetry styles will be very hard for the LLM to understand or perform anyway

#

e.e. cummings' writing style is not going to be anything an LLM can emulate any time soon

open gull
#

Emily Dickenson loves the mdash.

harsh gyro
#

unicode is fine afaik

open gull
#

I think there's some e.e. cummings in the dataset. we'll see what the LoRA can do.

#

I think it would be fun to convert shakespearean sonnets into e.e.cummings style free verse.

gaunt totem
open gull
gaunt totem
#

also the hardest part of replicating poetry via AI is probably just going to be... the raw complexity and loading of information that is orthogonal to an LLMs awareness. Like, for example, shakespeare was famous for iambic pentameter, which requires counting syllables and placing stresses, which... are datapoints that LLMs don't have any awareness of and have to figure out second hand.
Rhyming is also generally a problem for LLMs, as of course... they don't know what words sound like.
Deeper meanings, metaphors, etc., LLMs are shockingly capable of at times, but still no match for a mildly clever human

#

You are, in short, playing the AI game on challenge mode here

open gull
gaunt totem
open gull
#

I am mostly using the metadata as given, but I was looking into programmatically determining the rhyme scheme to add as training data. There are some systems that use phonemes to do rhyming.

gaunt totem
#

just keep your expectations limited to better chance lol

harsh gyro
#

and let us know how different approaches fare

#

we fish, too

open gull
#

Absolutely. Thanks so much for all your help.

open gull
#

FISHING RESULTS: I've made many modifications in the last few weeks. I've been training on poem sentence, author and date. Then asking of the model to generate a summary followed by the poem. I got some decent results, but as @gaunt totem mentioned earlier in this thread, the model does not know about sounds. Metaphors were interesting, but they usually are. Still, the results fit the training layout and author names seemed to have reasonable effect on output. However, rhyme scheme, alliteration, and poetic sound were missing. So I converted each poem into phonemes using charsui. I then trained on an output of "summary, sounds, poem." I had some bugs in the sounds (stupid find & replace), but also the distance between word and sound was the length of the poem. I am presently running a version where the poem and its phonemes are side-by-side in two 80 character columns. Thanks to everyone for their help, most especially @harsh gyro. I'll keep you posted.

harsh gyro
#

Thanks for the update. I'm looking forward to hearing how things go.

Dataset quality is more important than dataset quantity, as LIMA demonstrated. Their results generalize to machine learning overall.

I appreciate the methodical approach you're taking. Might be worth documenting your experience eventually in the resources channel. A lot of other people are struggling with the same thing with less success.

#

Really creative to do phonemes like that. You might be on to something more generally for training this sort of thing.

#

Bravo.