jagged quiver Aug 14, 2024, 4:34 AM

#

Hello Team, I have working on finetuning based six pdf documents content . Each document is having 6-7 page content and prepared the Instruction based Alpaca dataset with and finetuned with llama 3 8B model . After fine tune , model is answering the question exactly from the dataset but when I am ask the question with twisting it's answering wrongly with hallucination . What i am missing here . The finetuning code is below :

#

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "meta-llama/Meta-Llama-3-8B",
# model_name = "unsloth/Meta-Llama-3.1-8B",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
token = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)

#

Define the Alpaca prompt format

alpaca_prompt = """ Below is an instruction that contains a question.\n\n Write a response that appropriately completes the request.\n\n

Instruction:

{}

Input:

{}

Response:

{}"""

EOS_TOKEN = "<|endoftext|>" # Replace with the actual EOS token for your tokenizer

from datasets import load_dataset

def formatting_prompts_func(examples):
instructions = examples["instruction"]
outputs = examples["response"]
inputs = examples["input"]
texts = []
for instruction, output in zip(instructions, outputs):
input_text = "" # Adjust if you have actual inputs
text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
texts.append(text)
return {"text": texts}

Format the dataset

dataset = dataset.map(formatting_prompts_func, batched=True, batch_size=8)

print(dataset)

Load your custom dataset from the JSONL file

dataset = load_dataset('json', data_files='./dataset/alpaca_instructions_dataset_4_large.json', split='train')

Print the first example to verify the dataset is loaded correctly

print(dataset)

dataset = dataset.map(formatting_prompts_func, batched=True,batch_size=8)

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

#

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
num_train_epochs = 100,
warmup_steps = 5,
# max_steps=None,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)
trainer_stats = trainer.train()

#

#help

indigo tartan Aug 14, 2024, 5:30 AM

#

fine-tuning is not a good way to inject new knowledge into the model, what are those pdf documents about? Is it better to do RAG if you are looking to answer queries based on the documents?

jagged quiver Aug 14, 2024, 6:18 AM

#

indigo tartan fine-tuning is not a good way to inject new knowledge into the model, what are t...

We have implemented in RAG , But our requirement is finetune the knowledge base from old archive documents

jagged quiver Aug 14, 2024, 6:18 AM

#

indigo tartan fine-tuning is not a good way to inject new knowledge into the model, what are t...

Thanks

indigo tartan Aug 14, 2024, 9:03 AM

#

That interesting, are you looking to fine-tune in order to reduce hallucinations? I am quite curious why fine-tuning is being used for this purpose.

viral tree Aug 14, 2024, 10:11 AM

#

I have the same hallucination issue, and I also thought to reduce it by fine tuning. Is that an incorrect way to fix this problem?

jagged quiver Aug 14, 2024, 10:46 AM

#

indigo tartan That interesting, are you looking to fine-tune in order to reduce hallucinations...

I have done the finetuning and facing the hallucination problem except model is answering as pected only the dataset Questions

viral tree Aug 14, 2024, 12:04 PM

#

So you mean when you asking the question that was NOT provided in the training dataset it still hallucinating?

jagged quiver Aug 14, 2024, 2:11 PM

#

viral tree So you mean when you asking the question that was NOT provided in the training d...

I am trying to ask the question twisted instead of direct question from dataset . Expecting as end user point of view and all the user won't ask the same question with related topic

paper monolith Aug 14, 2024, 5:00 PM

#

I wonder what would happen if you include the text chunk next to the question and answer to make the model understand the base more

indigo tartan Aug 14, 2024, 7:42 PM

#

jagged quiver I am trying to ask the question twisted instead of direct question from datas...

What is the expected answer in this case? Is the right answer a refusal or you are expecting the model to still answer the question based on the data?

daring folio Aug 15, 2024, 5:18 AM

#

jagged quiver We have implemented in RAG , But our requirement is finetune the knowledge base ...

Yes correct - finetuning definitely reduces hallucinations. You should pair it with RAG for even better results

#

you can finetune a model specifically to make it hallucinate less or more and make it learn that way

daring folio Aug 15, 2024, 5:22 AM

#

viral tree I have the same hallucination issue, and I also thought to reduce it by fine tun...

It really depends on what is is hallucinating. the higher epochs the more deterministic answers it will be

autumn osprey Aug 15, 2024, 5:24 AM

#

You can reduce hallucinations directly by finetuning your model to only output certain outputs ie if you ask a LLM "What is 2+2?" It has 90% chance it'll say 4, but 5% it'll say "four" and small chances on 3, 2, 1, 10, 4 etc

#

so to "force" to model to make the probability to 100%, simply finetune on a dataset with "What is 2+2? It's 4" 10,000 epochs

#

and ull force the probably to go to 100%

#

the issue is the model now becomes non creative

#

which defeats the whole purpose of LLMs

#

the issue with LLMs is we select 1 token from a distribution, but rarely multiple tokens during the decoding step - instead if we output a probability distribution

#

thatll be better

viral tree Aug 15, 2024, 5:29 AM

#

daring folio It really depends on what is is hallucinating. the higher epochs the more determ...

It hallucinate about the data which we include as part of RAG.

To fine tune with RAG we simply need to create dataset where it will be a lot of examples with different data in the context, right?
I just afraid that model just learn that data instead, especially on high number of epoch.

viral tree Aug 15, 2024, 5:36 AM

#

autumn osprey which defeats the whole purpose of LLMs

That’s my another fear, that the model would just answers exactly in the same way as in train data. We need a model to use it as part of customer service (reception) so the answers should be in specific format, but various.
I spend hundred of hours trying to make it work as we need just by prompt. But it still does not work in most cases as we want to, so I am looking in the direction of fine tuning.

autumn osprey Aug 15, 2024, 7:16 AM

#

Ye you could try finetuning i guess

#Hallucination issue

Define the Alpaca prompt format

Instruction:

Input:

Response:

Format the dataset

dataset = dataset.map(formatting_prompts_func, batched=True, batch_size=8)

print(dataset)

Load your custom dataset from the JSONL file

Print the first example to verify the dataset is loaded correctly