#Need advice for Dataset Creation for Training

9 messages · Page 1 of 1 (latest)

winter musk
#

I have a database full of company policies and regulations. I understand the training dataset needs to be in a prompt-response format. What is a good way to create the dataset, given I want to pull some kind of summarized information related to a single policy from multiple sources present in the database? Any advice would be highly appreciated. Thanks!

stoic crystal
#

One way to create your dataset would be to create a spreadsheet with two columns: "prompt" and "response". In the "prompt" column, you could include a summary or brief description of the policy that you want to retrieve information about. In the "response" column, you could include the summarized information that you want to retrieve from your database.

To create the dataset, you could start by selecting a policy from your database, and then creating a prompt for it in the "prompt" column. Next, you could search through your database to find the relevant information about that policy, and then summarize it in the "response" column. You could repeat this process for each policy that you want to include in your dataset.

Alternatively, you could use a text annotation tool like Prodigy or Labelbox to create your dataset. These tools allow you to easily highlight and label specific pieces of text within a document, which could be helpful if you want to extract specific pieces of information from your database rather than summarizing it.

I hope this helps! Let me know if you have any other questions.

winter musk
#

@stoic crystal Awesome! Thanks for the input. However, the problem I'm dealing with is at scale, like there are 10,000+ policy documents which might not be smart to create brief descriptions for all the policies. Is that the only way to do it or are there better ways?

slate trench
#

You can look into using embeddings for this endeavor @winter musk, you can embed individual pieces of the policy documents and then search and find the most relevant policy snippets for when there is an input question

winter musk
#

@slate trench Aren’t embeddings expensive in relation to OpenAI? Can a workaround be implemented like having embeddings over a different model generate all the related documents to the policy in prompt and have gpt summarize it?

slate trench
#

Embeddings are much cheaper IMO for this use case than fine tuning will be, since the price per token for embeddings is like nearly nothing

#

and the cost would be a one time thing as well after the documents are properly embedded

#

The complexity and trouble of this approach would be effectively indexing and comparing similarity for so many embeddings

#

And you can definitely hybrid approaches, like using embeddings on top of a fine tuned model or using the summarization approach you mentioned