#How to format unstructured data?
33 messages · Page 1 of 1 (latest)
well i'm no expert, but i'll give you my idea
use a LLM to structure it! if all the examples are as easy as the one you gave, it should do an okay job
might wanna like, check the results before using them
Okay, I'll see this one,. thanks
But like whats is the format of this type data should be
For 1st case its simple we can give model inout and output to train
But what for second one
prob doesn't matter, but i guess it depends on what model you're using?
I am using Llama and getting these data from youtube transcripts
this is 3.5turbo which i'm sure is gonna be better than llama, and to be fair i did fail once because i wasn't specific enough with the prompt
(chatgpt also generated the source data lol)
but it SHOULD be able to do it
lol this worked with emojis, though tbf it also just knows all the capitals
Ahh I see
What are you trying to train?
I want to train a model with youtube transcript of a particular video
you're training a model off one video?
Ya at least one video transcript...further I would add more videos of that channel
Prob is there is only plain text, so how do I format it in such a way that model can be trained to ans
In most cases, you don't actually need to train a model. What you would need to do is:
- First define what is the solution you actually need
- Are there pre-trained models that you can use it for (most likely yes)
- If the pre trained models don't have domain specific knowledge (from the transcript you wrote, seems pretty generic, so not the case), what data is needed and for what purpose? As in, further pretraining the base model i.e. domain adaptation? Or just finetune for specific tasks?
- Then you talk about data format for the specific work
Oh I see... just fine tune for specific tasks
would a vector database be appropriate?
Not necessarily
But what is the task?
Task is like if a user ask a question then the response will be generated by this model for that I have to fine tune it first
You're not explaining the actual outcome. Why do you need to finetune? What questions would users ask? You most likely don't need finetuning. For example, if it's just Q&A based on your channel, then you can just get all the transcripts with metadata on which videos they are, put them into a vector database as @cobalt glacier suggests, and just use some readily available models for question answering, while retrieving context from vector database to answer channel specific questions
Finetuning also has risks, it's not like you just finetune on your channel transcript, then it suddenly becomes your channel's assistent. Finetune on small and not so special data often results into catastrophic forgetting, for example
See the video is about a lec given by a prof, so the outcome will be based on the question asked by a user which cant be predicted
Oh!