#How to format unstructured data?

33 messages · Page 1 of 1 (latest)

zinc torrent
#

I want to train my model with a data which has only one input, I mean genrally training data is like:

{
  "input": "Hi, whats capital of India",
  "output": "Capital of india is New Delhi"
}

but My data is like:

{
  "transcript": "capital of india is New Delhi"
}

So how do I format this data so that I can make it trainable

cobalt glacier
#

well i'm no expert, but i'll give you my idea

use a LLM to structure it! if all the examples are as easy as the one you gave, it should do an okay job

#

might wanna like, check the results before using them

zinc torrent
#

Okay, I'll see this one,. thanks

#

But like whats is the format of this type data should be

#

For 1st case its simple we can give model inout and output to train

#

But what for second one

cobalt glacier
#

prob doesn't matter, but i guess it depends on what model you're using?

zinc torrent
#

I am using Llama and getting these data from youtube transcripts

cobalt glacier
#

this is 3.5turbo which i'm sure is gonna be better than llama, and to be fair i did fail once because i wasn't specific enough with the prompt

#

(chatgpt also generated the source data lol)

#

but it SHOULD be able to do it

zinc torrent
#

I see ... but my data is plain text

#

But thanks I'll try this

cobalt glacier
#

lol this worked with emojis, though tbf it also just knows all the capitals

zinc torrent
#

Ahh I see

shrewd thicket
zinc torrent
cobalt glacier
#

you're training a model off one video?

zinc torrent
#

Ya at least one video transcript...further I would add more videos of that channel

#

Prob is there is only plain text, so how do I format it in such a way that model can be trained to ans

shrewd thicket
# zinc torrent I want to train a model with youtube transcript of a particular video

In most cases, you don't actually need to train a model. What you would need to do is:

  1. First define what is the solution you actually need
  2. Are there pre-trained models that you can use it for (most likely yes)
  3. If the pre trained models don't have domain specific knowledge (from the transcript you wrote, seems pretty generic, so not the case), what data is needed and for what purpose? As in, further pretraining the base model i.e. domain adaptation? Or just finetune for specific tasks?
  4. Then you talk about data format for the specific work
zinc torrent
cobalt glacier
#

would a vector database be appropriate?

shrewd thicket
shrewd thicket
zinc torrent
shrewd thicket
# zinc torrent Task is like if a user ask a question then the response will be generated by thi...

You're not explaining the actual outcome. Why do you need to finetune? What questions would users ask? You most likely don't need finetuning. For example, if it's just Q&A based on your channel, then you can just get all the transcripts with metadata on which videos they are, put them into a vector database as @cobalt glacier suggests, and just use some readily available models for question answering, while retrieving context from vector database to answer channel specific questions

#

Finetuning also has risks, it's not like you just finetune on your channel transcript, then it suddenly becomes your channel's assistent. Finetune on small and not so special data often results into catastrophic forgetting, for example

zinc torrent
shrewd thicket