#Large Dataset without Embeddings

12 messages · Page 1 of 1 (latest)

tacit lichen
#

So I have created a number of small open-ai sample use cases for my company. But the main issue I am running into handling large datasets without using embeddings which seem inadequate. I have tried embeddings but when a question requires divergent information across the dataset I really want all the information loaded.

The demo yesterday used the tax code which is very close to what I am looking for. Embeddings would not solve that problem as the entire solution set needs to be sent.

So my question is how are people handling the lack of training in the traditional sense (not fine-tuning)? I would love to be able to send 500 pages of data to a model and have it preserved for future interactions as priority data. Is there a product available that allows training of that sort?

heady bramble
#

Am also working on something similar, till now it was all prompts like 300 prompts. Not tried embeddings

#

When you say 500 pages of document , can it be converted to prompts or its more of a information pages, which gpt needs to remember ?

mellow acorn
mellow acorn
heady bramble
mellow acorn
heady bramble
tacit lichen
#

The problem with embeddings -- from what I can tell -- you have to do a manual vector search through your embeddings and send the most relevant ones with the query. That is still limited.

@heady bramble I don't think it can be turned into prompts because the data is connected. Think of the tax code example - they reference other sections in an almost recursive manner. You need large sections of the document to get to an end answer because section C references section Y which in turn references section F. If you search for relevant data and turn up section C and send that embedding it will not come to a satisfactory answer.

My current thought is that this sort of system will require a custom model. The OpenAI API just doesn't have any sort of memory.

#

@mellow acorn Maybe I misunderstand your comment - but embeddings don't store the information on the model. They simply return vectors that can be used for searching. But you still need to send the embedding information is subsequent requests. So there is no real storage -- just search assistance. Although I would be happy to be wrong.

mellow acorn
mellow acorn