I have a database full of company policies and regulations. I understand the training dataset needs to be in a prompt-response format. What is a good way to create the dataset, given I want to pull some kind of summarized information related to a single policy from multiple sources present in the database? Any advice would be highly appreciated. Thanks!
#Need advice for Dataset Creation for Training
9 messages · Page 1 of 1 (latest)
One way to create your dataset would be to create a spreadsheet with two columns: "prompt" and "response". In the "prompt" column, you could include a summary or brief description of the policy that you want to retrieve information about. In the "response" column, you could include the summarized information that you want to retrieve from your database.
To create the dataset, you could start by selecting a policy from your database, and then creating a prompt for it in the "prompt" column. Next, you could search through your database to find the relevant information about that policy, and then summarize it in the "response" column. You could repeat this process for each policy that you want to include in your dataset.
Alternatively, you could use a text annotation tool like Prodigy or Labelbox to create your dataset. These tools allow you to easily highlight and label specific pieces of text within a document, which could be helpful if you want to extract specific pieces of information from your database rather than summarizing it.
I hope this helps! Let me know if you have any other questions.
@stoic crystal Awesome! Thanks for the input. However, the problem I'm dealing with is at scale, like there are 10,000+ policy documents which might not be smart to create brief descriptions for all the policies. Is that the only way to do it or are there better ways?
You can look into using embeddings for this endeavor @winter musk, you can embed individual pieces of the policy documents and then search and find the most relevant policy snippets for when there is an input question
@slate trench Aren’t embeddings expensive in relation to OpenAI? Can a workaround be implemented like having embeddings over a different model generate all the related documents to the policy in prompt and have gpt summarize it?
Embeddings are much cheaper IMO for this use case than fine tuning will be, since the price per token for embeddings is like nearly nothing
and the cost would be a one time thing as well after the documents are properly embedded
The complexity and trouble of this approach would be effectively indexing and comparing similarity for so many embeddings
And you can definitely hybrid approaches, like using embeddings on top of a fine tuned model or using the summarization approach you mentioned