#Training or fine-tuning a model on large number of text documents

73 messages · Page 1 of 1 (latest)

little terrace
#

Hi, I'm building a chatbot using GPT-3 for internal use within our business with the intent that employees can ask it the kind of general questions that ChatGPT can answer but also questions specific to our business. Therefore I need to train the model on our Operations Manual and training documents.

I've read the API docs on fine-tuning but this appears to only allow me to train the model with prompt-answer pairs, so I am trying to figure out if there is away to train it on several hundred pages of unstructured text too?

Or am I going about this in the wrong way and there is another way to make the model aware of the content of our Operations Manual and training documents without "training" or "fine-tuning" it?

Thanks in advance, super new here and very very excited by all of this!!

dull aspen
#

I think you'll need to fine-tune it.

acoustic shale
#

Join the Bugout Slack dev community to connect with fellow data scientists, ML practitioners, and engineers: https://join.slack.com/t/bugout-dev/shared_invite/zt-fhepyt87-5XcJLy0iu702SO_hMFKNhQ

The GPT-3 language models have a range of impressive capabilities; however, they also have a number of performance limitations.

In this talk OpenAI res...

▶ Play video
grand nest
#

@little terrace - any luck here? I am considering similar things

little terrace
#

Thanks for the responses. Where I'm up to with this is using ChatGPT to help me write some code that will use GPT-3 to parse our docs into a fine tuning ready format and then auto upload them to the model. Essentially I'm turning into a very lazy dev 😂

grand nest
#

how do you plan to feed all the docs to ChatGPT?

#

Planning to use ChatGPT or GPT-3 ?

pale fable
#

👍

little terrace
#

I'll feed the docs to the GPT-3 via the API, but I'm going to have ChatGPT assist me in writing the code to do this and then to upload the output fine tune the model. I think a simple data engine to automate this process is going to be super useful

grand nest
#

working on something similar now - going to try a few different approaches to splitting up the text to see what gets reliable results

dull aspen
#

@grand nest I would like to learn some approaches. Could you please share some here?

grand nest
#

once I get one working @dull aspen will share back

grand nest
#

test one - I extracted text from PDF and feed it as is to the fine_tunes.prepare_data - worked but not great, lot's of hallucinations

grand nest
#

nah - not at all

dull aspen
pearl bolt
#

I've read the API docs on fine-tuning but this appears to only allow me to train the model with prompt-answer pairs, so I am trying to figure out if there is away to train it on several hundred pages of unstructured text too?
yes the data used to fine tune has to be structured -- but this is due to what's happening under the hood. you can imagine it as more of a seed prompt direction than a wholesale re training of the base model

dull aspen
pearl bolt
oblique copper
#

@grand nest I would love to hear how you get on

grand nest
#

Just for clarity - I am far from expert

#

But in your case @oblique copper - if you want to ask it a question about a specfic block of text. I think you can do like a one-shot with context

#

Meaning - you can ask the question and provide gpt with the context "text block" and it will attempt to answer

oblique copper
#

hmm, Iim not sure if that will work, maybe if i pre process stuff to get a context

#

I have a full page document and im training to pull out a set of fields, so i dont think the data im passing in with be local enough for it

grand nest
#

do you have example data and the fields?

oblique copper
#

its like addresses and and names in a specific context

#

and the text is a whole page legal document

grand nest
#

you sure you need gpt-3 for it?

#

sounds like you could probably just use python for it..

oblique copper
#

oh ?

grand nest
#

without seeing I can't really say

#

but if you have documents that something like address:{address} - you can just parse that out with a script

oblique copper
#

Its never that clean

grand nest
#

have you tried giving davinvi the data and then asking it for the addresses? without any fine-tuining

oblique copper
#

It does really well, and that is what im using, but i think it would be better if i could fine tune it

#

and i have access to atleast 5000 documents with the correct examples pulled out

grand nest
#

then maybe just make some prompts with the question and the expected answer

#

you have a linux server or VM you can use?

oblique copper
#

and i have a way to reinforce train too

#

I can set one up

grand nest
#

the openai tools seem to work best

#

on linux

oblique copper
#

hmm, i need to have a look at them

#

Mind if I dm you from time to time ?

grand nest
#

nah I don't mind

pearl bolt
oblique copper
#

Thanks @pearl bolt

grand nest
#

had some decent amount of success with various projects over the last week if anyone has anymore questions

ancient ruin
ancient ruin
grand nest
#

so based on what I have learned

#

and implemented - you don't fine tune the model to answer the questions per se

#

you build out a two part systems

#

once part to do "retrieval" and one part to answer the question

#

you will see these refereed to as "open-domain question answering" in ML papers

#

"huge" might be a challenge and could get expensive - I am not totally sure.

#

in my case I am using the OpenAI embeeded models to vectorize and query the the documents to figure out which document and sections are most related to the question

#

Then you take that question and pass it to a model like davinci with the context you pulled from the model, and it answers the question.

#

so first step is really figuring out which of the massive documents is going to have the answers and the section inside of it that will have the answer, and then second is prompting a model like davinci to answer the question

#

the first approach I took was trying to fine tune a model with the data by having davinci come up with question/answers based on every section. That did not give me near as good results as using the retrieval using embeeded models did

ancient ruin
#

Thank you @grand nest! If I understand right, at a high level, what I need to do is first build a retrieval engine that, based on the prompt, ranks the documents/paragraphs that are most relevant to the prompt. Then, use davinci model with context from retrieval to answer the question. Is that right?

#

By huge, I mean probably about 15x as many documents as you — roughly about 7000 documents. Lucky for me, a lot of these documents are auto-generated and follow a useful pattern

cyan tartan
#

@grand nest thanks, that's helpful. it sounds like you went with something similar to this, correct? https://github.com/openai/openai-cookbook/blob/main/examples/Semantic_text_search_using_embeddings.ipynb

if so, did you do this with a csv/dataframe like they did in the example? my use-case involves >50k docs, so trying to figure out the right approach

GitHub

Examples and guides for using the OpenAI API. Contribute to openai/openai-cookbook development by creating an account on GitHub.

grand nest
#

@ancient ruin - yes that seems to be the case at a high level. My use case is not as large. for large use-cases you can use vectorized databases

cedar coyote
#

I need some training on how to use this app

grand nest
#

What app is that...?

solid dawn
#

Hey

#

I would like to do a similar thing