I'm trying to train a model on a telegraph group's chat history, then prompt the model with relevant questions about the conversation (most brought up topics, what topics the group has consensus on, which questions are asked the most, which topics the group disagrees on, etc) - But I can't find how to train a model in the API Documentation. It allows me to fine-tune a model, but I don't think that's what I want. I want to train a model on a custom dataset, then use text completion to ask questions about the dataset. Help!
#Model training - missing API documentation?
8 messages · Page 1 of 1 (latest)
I think what you need is not fine tune. Semantic search with text embedding might be the best way
view the embedding in https://platform.openai.com/docs/api-reference
An API for accessing new AI models developed by OpenAI
you can also view what I post here for reference https://discord.com/channels/974519864045756446/1078442334670307430
I've read about embeddings and I don't think that's what I need either. I want to ask questions about a chat log, for instance. I want to ask questions about a given record of 100 conversations and have the model give me an understanding of what is discussed. Unlike using embeddings to pull up relevant reviews/comments in a post I need it to understand the back and forth conversations between a group of people to see, for example, topics there is agreement/disagreement on, people who been able to best answer people's questions, topics that there is a lot of interest in, topics that no one is able to answer, etc. I need to train a model on this dataset (telegram JSON chat log). Looking at the API it seems like this is an up-and-coming feature. They're working on a tutorial to have the AI answer questions about local files, for instance. I think that's what I need. Instead of embeddings.. I think I need to train a model on a specific data
Since each model has a limited max tokens, you cannot send indefinate information to it at once.
On the website you can see another method "fine tune", it can train a base model with many prompts paied with outputs so that so the model can work as you wish. But it si still different from storing the data into the model. So text embeding is still the best way we can find
I need to train a model, neither method will work. I'm going to have 80 megabytes of data, at least, that I need to feed to the model. Current token limits won't let me do this. I basically need to feed an entire book to the model and chunking doesn't seem to fit my use case.