#Model training - missing API documentation?

8 messages · Page 1 of 1 (latest)

uneven prairie
#

I'm trying to train a model on a telegraph group's chat history, then prompt the model with relevant questions about the conversation (most brought up topics, what topics the group has consensus on, which questions are asked the most, which topics the group disagrees on, etc) - But I can't find how to train a model in the API Documentation. It allows me to fine-tune a model, but I don't think that's what I want. I want to train a model on a custom dataset, then use text completion to ask questions about the dataset. Help!

torn locust
#

I think what you need is not fine tune. Semantic search with text embedding might be the best way

uneven prairie
#

I've read about embeddings and I don't think that's what I need either. I want to ask questions about a chat log, for instance. I want to ask questions about a given record of 100 conversations and have the model give me an understanding of what is discussed. Unlike using embeddings to pull up relevant reviews/comments in a post I need it to understand the back and forth conversations between a group of people to see, for example, topics there is agreement/disagreement on, people who been able to best answer people's questions, topics that there is a lot of interest in, topics that no one is able to answer, etc. I need to train a model on this dataset (telegram JSON chat log). Looking at the API it seems like this is an up-and-coming feature. They're working on a tutorial to have the AI answer questions about local files, for instance. I think that's what I need. Instead of embeddings.. I think I need to train a model on a specific data

torn locust
#

On the website you can see another method "fine tune", it can train a base model with many prompts paied with outputs so that so the model can work as you wish. But it si still different from storing the data into the model. So text embeding is still the best way we can find

uneven prairie
#

I need to train a model, neither method will work. I'm going to have 80 megabytes of data, at least, that I need to feed to the model. Current token limits won't let me do this. I basically need to feed an entire book to the model and chunking doesn't seem to fit my use case.