Continuous pre-training on instruct model? | Unsloth AI | Page 1

paper fog Sep 19, 2024, 9:20 PM

#

I want to experiment with doing some continuous pre-training as I want to infuse the already excellent Llama3.1 instruct model with some domain-specific knowledge. Is it a terrible idea to do continuous pre-training on an instruct model? If I started with base Llama3.1, is it possible for me to LoRA my way back to as good of Instruct performance as the Meta Llama3.1 instruct model using open-source instruct datasets ?

#

For some additional details, the reason I want to do this is because Llama3.1-8b-instruct is already really good at performing some data extraction tasks for me, but it makes mistakes in the interpretation of domain-specific terms. I want to teach it better meaning of these terms without losing the task-following/instruction understanding capability.

edgy oak Sep 20, 2024, 7:13 AM

#

just curious, why not try RAG first, simple pdf document with terms explanation?

paper fog Sep 20, 2024, 7:39 AM

#

I’ve tried RAG but trying to force the model to interpret things the way we want it to can be problematic. Even if I explicitly say “interpret X as Y”, it will still interpret X as it was natively trained in, except now Y is broken.

#

And it won’t follow the instructions properly anymore or makes a ton of mistakes.

edgy oak Sep 20, 2024, 7:46 AM

#

RAG answers natively if you don't insist on the specific context like this:

PROMPT_TEMPLATE = """
Answer the question based on the following context:

{context}

Answer the question based on the above context: {question}
"""

So, you the new terms data continuously have been updating? That you're thinking about continuous pretraining?

#

*That is why....

paper fog Sep 20, 2024, 11:53 AM

#

Trust me, I have explored RAG extensively. It’s not nearly as effective as fine tuning.

#

RAG will work, but only with larger models. When testing using the APIs with the biggest models, they work, but going down to the 8B parameter models, the results are poor. The model is very confidently incorrect.

#

If I could use RAG, I wouldn’t be here doing fine tuning. I started with it for the first few months of experimentation.

#

The other big thing is that RAG works like… 90% of the time. But I need it to work 99%+.

#

The other problem is that the question and the content is insanely complex. So now you’re asking a (small parameter) LLM to perform a very difficult task, and then on top of it, try to force it to “re-learn” a bunch of medical terms on the fly.

#

Anyway, my original question was on continuous pretraining. Not sure why RAG was even brought up.

grim sable Sep 20, 2024, 7:48 PM

#

paper fog The other problem is that the question and the content is insanely complex. So n...

try this https://huggingface.co/numind/NuExtract

numind/NuExtract · Hugging Face

paper fog Sep 20, 2024, 8:10 PM

#

Data Extraction is just one thing though. For example, I want to be able to ask the model if there is presence of a lung nodule in a medical report. In my field, Lung Nodules can be represented by more than just the word "nodule". I need it to infer when an opacity with sharp enough edges can be considered a nodule, or when an area of consolidation is dangerous enough to be examined as potential precursor to a nodule.

#

Also I wish the context for that model was larger!

iron ibex Sep 21, 2024, 1:27 AM

#

If you are planning on pretraining, always use the base models, not jnstruct. Same thing with Instruct tuning. If you have enough instruct data to generalise (100k>) always train on base.

#

This will reduce model catastrophic forgetting.

#

Good luck.

grim sable Sep 21, 2024, 8:28 AM

#

paper fog Data Extraction is just one thing though. For example, I want to be able to ask...

you can use the data extractor model on your documents to generate high quality instruct data that both teaches the model the meaning of the new terms, while also retaining the instruction capabilities

paper fog Sep 21, 2024, 3:06 PM

#

Gotcha. So I get the general consensus is absolutely do not CPT an instruct model.

paper fog Sep 25, 2024, 6:28 PM

#

https://www.reddit.com/r/LocalLLaMA/comments/1au1p1g/updating_base_knowledge_continued_pretraining_on/

From the LocalLLaMA community on Reddit

Explore this post and more from the LocalLLaMA community

#

This fellow continued pre-trained mistral v2-instruct for knowledge injection with success

#

I might just have to try both

paper fog Oct 1, 2024, 5:04 PM

#

I have tried both. Seems to be fine to train llama3.1-8b-instruct! After 5 epochs on 2B Tokens per epoch, I still have retained instruction following capacity while also injecting a ton of medical data.

austere ivy Oct 1, 2024, 6:09 PM

#

paper fog I have tried both. Seems to be fine to train llama3.1-8b-instruct! After 5 epo...

Nice I am thinking of doing the same. Its a great headstart for me.

paper fog Oct 1, 2024, 6:33 PM

#

DM me if you have any questions, happy to answer, but hoping that others see this

paper fog Oct 1, 2024, 9:14 PM

#

I only did 5 epochs and honestly it wasn't even a great dataset, but for one of my complex medical data extraction tasks, I'm at an F1 score of 0.86 on my test set. Qwen2.5-70B gets 0.96

#

Before training, Llama3 8B was getting closer to 0.5

iron ibex Oct 1, 2024, 10:25 PM

#

You can get better results if you pretrain the base model on your data, then generate a Q & A dataset from that dataset, then instruction tune the base model you trained before for better results.

#

More steps, but might be better.

slim axle Oct 2, 2024, 5:04 AM

#

@iron ibex - mind if i DM you about this?

iron ibex Oct 2, 2024, 5:04 AM

#

slim axle <@805550380334579772> - mind if i DM you about this?

Of course man.

#

Or we could make a thread.

#

So other people can learn too?

slim axle Oct 2, 2024, 5:04 AM

#

Yeah sure

iron ibex Oct 2, 2024, 5:05 AM

#

Just make a new post.

#

On the Support Forum.

#

I'll respond!

slim axle Oct 2, 2024, 5:06 AM

#

ok will do, let me put it together

iron ibex Oct 2, 2024, 5:07 AM

#

@slim axle

#

I may be unavailable for a bit after 15 mins.

#

Like an hour.

#

Just ping me in the post, and I'll reply when possible.

#

Is that okay with you?

slim axle Oct 2, 2024, 5:12 AM

#

Ok no problem!

slim axle Oct 2, 2024, 5:16 AM

#

iron ibex You can get better results if you pretrain the base model on your data, then gen...

I think it may be easier to keep it combined here given the context if thats ok @iron ibex :

I'm trying to do something similar (domain adapting) and I have access to TPUs, so I'd like to clarify a few points about your proposed method:

When you say "pretrain the base model on your data", do you mean starting from scratch, or continuing pretraining from an existing base model like Llama 3?
After pretraining, you suggest generating a Q&A dataset. Should I use the pretrained model to generate these Q&A pairs, or create them from my original dataset through some other method?
Given that I have access to TPUs, would you recommend I try continuous pretraining on a model like Llama 3 with my domain-specific data first, and then use that to create Q&A pairs?
How does your suggested method compare to directly continuous pretraining an instruct model, as shen did?

I'd really appreciate any insights you can provide on these points. Thanks!

iron ibex Oct 2, 2024, 5:18 AM

#

slim axle I think it may be easier to keep it combined here given the context if thats ok ...

Obviously train an existing base model, like Llama 3/3.1/3.2, Mistral, etc.

#

Use a dataset generation pipeline like distilabel for generating Q&A datasets. I am currently making a pipeline from scratch to do this easily!

#

Yes, exactly.

#

Why would you pretrain an instruct model 😉? Think about it. A model that was pretrained, then instruct trained. Why would you pretrain it again? It can cause catastrophic forgetting about other things. It's better to (cont) pretrain a base model (completion model), then instruction tune IT THEN with domain specifc knowledge.

#

Is that cool?

slim axle Oct 2, 2024, 5:22 AM

#

Yeah, thank you!

iron ibex Oct 2, 2024, 5:24 AM

#

Also, if this is for answering knowledge about a domain specifc thing, attach it to a RAG backend after doing everything, so it has upto date knowledge.

#

@slim axle

slim axle Oct 2, 2024, 5:24 AM

#

Gotcha that makes sense

slim axle Oct 2, 2024, 5:25 AM

#

iron ibex 2. Use a dataset generation pipeline like `distilabel` for generating Q&A datase...

Have you seen this? Could be useful 🙂 https://github.com/e-p-armstrong/augmentoolkit

iron ibex Oct 2, 2024, 5:26 AM

#

slim axle Have you seen this? Could be useful 🙂 https://github.com/e-p-armstrong/augmento...

Yep.

#

It's cool.

iron ibex Oct 2, 2024, 5:44 AM

#

Good luck @slim axle.

paper fog Oct 2, 2024, 5:59 AM

#

Instruction training is already done using a specific instruction format. I don’t think it’s as big of an issue as you claim it is, considering pretraining is done without using any format other than BOS and then EOS.

#

The reason I wanted to attempt this is because llama3.1-8B-instruct is already SO good at many of my tasks. It’s what I use for the foundation of fine tuning for some of my complex tasks. The models I get wipe the floor with chatGPT and any other API based LLM

#

Whatever secret sauce is in metas instructions dataset, I want to try to preserve if possible.

#

Hence the attempt to do pre-training on the instruction model. And for the most part, it works!

iron ibex Oct 2, 2024, 6:06 AM

#

It's not bad or anything.

#

But I have seen improvements using my methods.

#

So I was only recommending it 🙂.

#

You do what you do.

paper fog Oct 2, 2024, 6:12 AM

#

Now that said, you’re definitely right that your approach is better

#

No doubt that pretraining the base model is better

#

Especially if you can generate a great instruct dataset (currently working on that!)

#

I wonder if it would make sense to use the nous Hermes dataset as a starting point.

#

Nous just feels so… outdated these days.

iron ibex Oct 2, 2024, 6:47 AM

#

Lol.

#

Thanks I guess.

iron ibex Oct 2, 2024, 6:48 AM

#

paper fog Especially if you can generate a great instruct dataset (currently working on th...

I can help.

#

I have made a fully custom dataset generation pipeline, that takes raw text in JSONL and turns it into Q&A.

paper fog Oct 2, 2024, 6:50 AM

#

What kind of Q&A do you generate? Just general question and answer?

#

Someone linked a model above that does data extraction and I liked the idea of using that to generate part of a dataset.

#

Otherwise I’ve been built prompts to excise QA and negative QA pairs from a piece of input text.

iron ibex Oct 2, 2024, 7:21 AM

#

paper fog What kind of Q&A do you generate? Just general question and answer?

Yes. Multi turn.

clear vault Oct 2, 2024, 7:26 AM

#

hmmm, i'm doing some experiments on Continuous pre-training. I found an article https://huggingface.co/papers/2406.01375 about scaling law for CP

Paper page - D-CPT Law: Domain-specific Continual Pre-Training Scal...

#

i wonder if you have any other papers that discuss what kind of data should be prepared or how much data should be selected to conduct CP?

iron ibex Oct 2, 2024, 7:27 AM

#

Probably shouldn't use that abbreviation 😅.

#

But yea, paper looks interesting.

clear vault Oct 2, 2024, 7:28 AM

#

iron ibex Probably shouldn't use that abbreviation 😅.

oh sr

#Continuous pre-training on instruct model?