#Using large set of story text to train davinci

15 messages · Page 1 of 1 (latest)

harsh condor
#

Hello, I am trying to create a discord bot that uses the davinci model to talk to users about a worldbuilding project I am working on. The goal is that the bot is familiar with all the writing in the world and thus is an "expert" and can have natural dialogue that is informed by the writing. Here are my two issues:

  1. The text (around 7000 words currently) is too large to be put into the prompt of a normal davinci model
  2. The text is not in a format that is acceptable for fine tuning (its not in a "prompt" "completion" set, its only a story)

This left me with the choice of either finding another way to familiarize the bot with the writing in a way I dont know of or put the writing into a format with fine-tuning. I have been using chatgpt to generate data based on the story but it is slow and cumbersome. Is there a better way to either translate my story into this prompt/completion set or a different method for the bot to become familiar with all the work?

#

Using large set of story text to train davinci

#

another question: say I create 200 prompt/completion sets and find that the model is well trained enough to converse, but I develop something new to the story. How many additional sets would be required to teach the model the new element in the story? For example, if i create a character "Hoot" what additional training is necessary for the model to learn Hoot?

iron escarp
#

dallecredits couldn't you just tell the bot to reference it? Just like you can ask the chat bot to quote the information for you?

harsh condor
# iron escarp <:dallecredits:1002181155220631562> couldn't you just tell the bot to reference ...

the max token for the completion model is 4000 tokens or so, if the current body of work is 7000 words then the bot wont be able to read it all. And this is 4000 tokens total, so say save half for a reply that means it can only read 2000 tokens from a direct reference. Thats why I was thinking of fine tuning because it can take a lot more information without counting toward the prompts token count.

iron escarp
#

well have it deduce it down into sections that would be small enough maybe? I see what you mean. perhaps find something to summarize it for you? That's easy to do... But I imagine you wouldn't get the amount of detail you may be

#

looking for... Maybe multiple summaries of the same thing and use multiple bots to work together to divide and conquer .😅

harsh condor
#

yea i did something similar, but its ineffective since it loses detail and isnt sustainable long term if i ever decide to write more story instead of code 😝

blazing salmon
#

Sorry my bad

iron escarp
# blazing salmon Sorry my bad

Are you saying that if you reference the 7000 words it will use your tokens to just to read it? I know it uses tokens for the amount of text in an input/output.... or I've gotten around that kind of... After the output runs out of room for any more text to generate just delete most of the output just leaving the last sentence or so and generate again. It seems like it picks up right where it left off. hope you get it figured out... there's so much that can be done with this stuff and idk much myself just trying to learn

harsh condor
#

I dont think davinci 3 has memory like that, it will “continue” where it got cut off but after the original input of the story is deleted it wouldnt be able to reference details in that. So far i have a fine tuned model trained on 200 conversations ive had with chat gpt specifically about the world data, but its still not great. Lot of inaccurate information and it seems to go on and on asking itself questions and answering them. Next thing im going to try is there embedding system to look up the context of the users prompt and feed it into the fine tuned models input for additional information. Should be a happy medium. https://beta.openai.com/docs/guides/embeddings/what-are-embeddings

novel skiff
#

@harsh condor did you get any luck with embedding ? was you able to overcome the limitation that you shared above ?

harsh condor
#

but I saw embedding just got a big update so maybe ill take another look

distant bridge
#

Dakhla