#Continuous pre-training on instruct model?

86 messages · Page 1 of 1 (latest)

paper fog
#

I want to experiment with doing some continuous pre-training as I want to infuse the already excellent Llama3.1 instruct model with some domain-specific knowledge. Is it a terrible idea to do continuous pre-training on an instruct model? If I started with base Llama3.1, is it possible for me to LoRA my way back to as good of Instruct performance as the Meta Llama3.1 instruct model using open-source instruct datasets ?

#

For some additional details, the reason I want to do this is because Llama3.1-8b-instruct is already really good at performing some data extraction tasks for me, but it makes mistakes in the interpretation of domain-specific terms. I want to teach it better meaning of these terms without losing the task-following/instruction understanding capability.

edgy oak
#

just curious, why not try RAG first, simple pdf document with terms explanation?

paper fog
#

I’ve tried RAG but trying to force the model to interpret things the way we want it to can be problematic. Even if I explicitly say “interpret X as Y”, it will still interpret X as it was natively trained in, except now Y is broken.

#

And it won’t follow the instructions properly anymore or makes a ton of mistakes.

edgy oak
#

RAG answers natively if you don't insist on the specific context like this:

PROMPT_TEMPLATE = """
Answer the question based on the following context:

{context}


Answer the question based on the above context: {question}
"""

So, you the new terms data continuously have been updating? That you're thinking about continuous pretraining?

#

*That is why....

paper fog
#

Trust me, I have explored RAG extensively. It’s not nearly as effective as fine tuning.

#

RAG will work, but only with larger models. When testing using the APIs with the biggest models, they work, but going down to the 8B parameter models, the results are poor. The model is very confidently incorrect.

#

If I could use RAG, I wouldn’t be here doing fine tuning. I started with it for the first few months of experimentation.

#

The other big thing is that RAG works like… 90% of the time. But I need it to work 99%+.

#

The other problem is that the question and the content is insanely complex. So now you’re asking a (small parameter) LLM to perform a very difficult task, and then on top of it, try to force it to “re-learn” a bunch of medical terms on the fly.

#

Anyway, my original question was on continuous pretraining. Not sure why RAG was even brought up.

paper fog
#

Data Extraction is just one thing though. For example, I want to be able to ask the model if there is presence of a lung nodule in a medical report. In my field, Lung Nodules can be represented by more than just the word "nodule". I need it to infer when an opacity with sharp enough edges can be considered a nodule, or when an area of consolidation is dangerous enough to be examined as potential precursor to a nodule.

#

Also I wish the context for that model was larger!

iron ibex
#

If you are planning on pretraining, always use the base models, not jnstruct. Same thing with Instruct tuning. If you have enough instruct data to generalise (100k>) always train on base.

#

This will reduce model catastrophic forgetting.

#

Good luck.

grim sable
paper fog
#

Gotcha. So I get the general consensus is absolutely do not CPT an instruct model.

paper fog
#

This fellow continued pre-trained mistral v2-instruct for knowledge injection with success

#

I might just have to try both

paper fog
#

I have tried both. Seems to be fine to train llama3.1-8b-instruct! After 5 epochs on 2B Tokens per epoch, I still have retained instruction following capacity while also injecting a ton of medical data.

austere ivy
paper fog
#

DM me if you have any questions, happy to answer, but hoping that others see this

paper fog
#

I only did 5 epochs and honestly it wasn't even a great dataset, but for one of my complex medical data extraction tasks, I'm at an F1 score of 0.86 on my test set. Qwen2.5-70B gets 0.96

#

Before training, Llama3 8B was getting closer to 0.5

iron ibex
#

You can get better results if you pretrain the base model on your data, then generate a Q & A dataset from that dataset, then instruction tune the base model you trained before for better results.

#

More steps, but might be better.

slim axle
#

@iron ibex - mind if i DM you about this?

iron ibex
#

Or we could make a thread.

#

So other people can learn too?

slim axle
#

Yeah sure

iron ibex
#

Just make a new post.

#

On the Support Forum.

#

I'll respond!

slim axle
#

ok will do, let me put it together

iron ibex
#

@slim axle

#

I may be unavailable for a bit after 15 mins.

#

Like an hour.

#

Just ping me in the post, and I'll reply when possible.

#

Is that okay with you?

slim axle
#

Ok no problem!

slim axle
# iron ibex You can get better results if you pretrain the base model on your data, then gen...

I think it may be easier to keep it combined here given the context if thats ok @iron ibex :

I'm trying to do something similar (domain adapting) and I have access to TPUs, so I'd like to clarify a few points about your proposed method:

  1. When you say "pretrain the base model on your data", do you mean starting from scratch, or continuing pretraining from an existing base model like Llama 3?

  2. After pretraining, you suggest generating a Q&A dataset. Should I use the pretrained model to generate these Q&A pairs, or create them from my original dataset through some other method?

  3. Given that I have access to TPUs, would you recommend I try continuous pretraining on a model like Llama 3 with my domain-specific data first, and then use that to create Q&A pairs?

  4. How does your suggested method compare to directly continuous pretraining an instruct model, as shen did?

I'd really appreciate any insights you can provide on these points. Thanks!

iron ibex
#
  1. Use a dataset generation pipeline like distilabel for generating Q&A datasets. I am currently making a pipeline from scratch to do this easily!
#
  1. Yes, exactly.
#
  1. Why would you pretrain an instruct model 😉? Think about it. A model that was pretrained, then instruct trained. Why would you pretrain it again? It can cause catastrophic forgetting about other things. It's better to (cont) pretrain a base model (completion model), then instruction tune IT THEN with domain specifc knowledge.
#

Is that cool?

slim axle
#

Yeah, thank you!

iron ibex
#

Also, if this is for answering knowledge about a domain specifc thing, attach it to a RAG backend after doing everything, so it has upto date knowledge.

#

@slim axle

slim axle
#

Gotcha that makes sense

iron ibex
#

Good luck @slim axle.

paper fog
#

Instruction training is already done using a specific instruction format. I don’t think it’s as big of an issue as you claim it is, considering pretraining is done without using any format other than BOS and then EOS.

#

The reason I wanted to attempt this is because llama3.1-8B-instruct is already SO good at many of my tasks. It’s what I use for the foundation of fine tuning for some of my complex tasks. The models I get wipe the floor with chatGPT and any other API based LLM

#

Whatever secret sauce is in metas instructions dataset, I want to try to preserve if possible.

#

Hence the attempt to do pre-training on the instruction model. And for the most part, it works!

iron ibex
#

It's not bad or anything.

#

But I have seen improvements using my methods.

#

So I was only recommending it 🙂.

#

You do what you do.

paper fog
#

Now that said, you’re definitely right that your approach is better

#

No doubt that pretraining the base model is better

#

Especially if you can generate a great instruct dataset (currently working on that!)

#

I wonder if it would make sense to use the nous Hermes dataset as a starting point.

#

Nous just feels so… outdated these days.

iron ibex
#

Lol.

#

Thanks I guess.

iron ibex
#

I have made a fully custom dataset generation pipeline, that takes raw text in JSONL and turns it into Q&A.

paper fog
#

What kind of Q&A do you generate? Just general question and answer?

#

Someone linked a model above that does data extraction and I liked the idea of using that to generate part of a dataset.

#

Otherwise I’ve been built prompts to excise QA and negative QA pairs from a piece of input text.

clear vault
#

i wonder if you have any other papers that discuss what kind of data should be prepared or how much data should be selected to conduct CP?

iron ibex
#

Probably shouldn't use that abbreviation 😅.

#

But yea, paper looks interesting.

clear vault