#OpenAI will bring my friend back from the dead

36 messages · Page 1 of 1 (latest)

raven steppe
#

I have a lot of written data that on old mentor provided to me years ago. Also have interviews and recorded discussions.
He was an expert on a very niche topic and not sure how to do this exactly but want to ask him questions and get a likely answer based on what he would say about a certain topic.

I want to create the ability to ask a question and have the answer predominately come from all of this proprietary data.

Where do I go to learn how to do this exactly?

Where would I upload the data?

What format must I upload it? For example, if videos, do I need to convert the audio to be transcribed first?

Do I need to remove what other people say from the recordings (will there input contaminate the data)?

How would I make chat GPT reference this concentrated data first (first layer), then if the answer is not there direct it to reference a specific different expert whom it may have information on.

Thank you for the help

sly magnet
#

Hiii

#

Try fine tuning, what a nice project. I was thinking about this like last week or month and i found someone thats gonna make it

#

thats gonna be costly to make it looks like him

#

Mostly for FINE TUNING you should use text inputs

#

But if its audio / video you might transcribe it into texts and clean it.

#

Do I need to remove what other people say from the recordings (will there input contaminate the data)?

Yes

#

How would I make chat GPT reference this concentrated data first (first layer), then if the answer is not there direct it to reference a specific different expert whom it may have information on.

its not available/not possible currently, for fine tuning you must use gpt-3 ( latest version )

raven steppe
#

@sly magnet thank you very much. that helps

sly magnet
#

np bro goodluck on your project and share the results with me hahah

raven steppe
#

will do. my mind is going a mile a min. so much fun

sly magnet
#

haha nice

sturdy phoenix
#

Hi, Fine-tuning will maybe provide the best experience but it is probably going to be expensive and I personally do not know how to do it.

Another possibility is to use semantic search. Specifically, to search for information that seems relevant in the database and to feed those parts into a prompt like "Given the information from my mentor below, what is the answer to ..."

#

Or maybe (less sure whether this would work), "In the style of my mentor from the text extracts below, what would they say about "

#

For semantic search and ChatGPT there might be a few tutorials online. Some keywords are Semantic search, ChatGPT, Pinecone, Langchain.

If your database is large you might need something like Pinecone which has algorithms for efficient search. I did not use it but I saw it mentioned sometimes. I am not sure what Langchain does exactly but I also see it mentioned sometimes and the examples in the documentation might help.

#

For certain questions, you might not get a satisfactory answer with a semantic search. In particular, if the information needed can not be found with keywords or semantically similar parts in the text and instead require a global understanding of the database as a whole

#

with semantic search you do not need to match exact keywords just words or sentences that are semantically similar

sturdy phoenix
raven steppe
#

@sturdy phoenix this is helpful.

I'd want to know when the answer comes from my database or when it comes from other information sources. It would be like a multi tier system. level 1 is my database, level 2 is whatever, etc.

Let's say the file size of all the data combined might be 100gb or something. Probably not even that much.

Where would I house that data?

sturdy phoenix
sturdy phoenix
#

@raven steppe I made a tiny edit in my answer just to mention that you could also maybe do it straight from your computer if it fits there but again I do not know much about vector databases beyond what I used it for

raven steppe
#

@sturdy phoenix I'd rather keep it on my computer. I have very limited coding/programing knowledge but I figure that part wont even be the biggest hurdle. Breaking this all down into bite sized chunks will be the hurdle I think

sturdy phoenix
#

I wrote some code to break text into chuncks but It's not that good because it breaks sentences and I do not remember if it can also break words, You can maybe use: langchain.text_splitter.TextSplitter with langchain here : https://langchain.readthedocs.io/en/latest/reference/modules/text_splitter.html. Using langchain has the benefit that you would be using a module that seems popular with large language models

#

@raven steppe

#

If you still want my bad code I can copy paste it here

#

I just have not taken the time to look into langchain properly

raven steppe
#

@sturdy phoenix really appreciate you help man. Thank you

violet arrow
sturdy phoenix
#

@raven steppe There is a plugin to upload your data from a file but i think one has to register to the waiting list and so far it seems like it's mainly the chatgpt plus users that get access. If you are willing to code then you might be interested in the open source code of the plugin made available by openai on github:

https://github.com/openai/chatgpt-retrieval-plugin

raven steppe
sturdy phoenix
# raven steppe What type of file do you have to upload it as?

I am not sure. According to the link above the plugin uses third-party software for the semantic search. I suppose the type of file depends on those services.

Of the services mentioned in the link, the two that I feel I have read about the most are pinecone and weaviate. I vaguely remember qdrant being good but I am not sure anymore. In any case, I only have a very superficial knowledge of the different services.

kind stream
#

any updates?