How to format unstructured data? | Learn AI Together | Page 1

zinc torrent Sep 3, 2023, 12:50 PM

#

I want to train my model with a data which has only one input, I mean genrally training data is like:

{
  "input": "Hi, whats capital of India",
  "output": "Capital of india is New Delhi"
}

but My data is like:

{
  "transcript": "capital of india is New Delhi"
}

So how do I format this data so that I can make it trainable

cobalt glacier Sep 3, 2023, 4:21 PM

#

well i'm no expert, but i'll give you my idea

use a LLM to structure it! if all the examples are as easy as the one you gave, it should do an okay job

#

might wanna like, check the results before using them

zinc torrent Sep 3, 2023, 4:22 PM

#

Okay, I'll see this one,. thanks

#

But like whats is the format of this type data should be

#

For 1st case its simple we can give model inout and output to train

#

But what for second one

cobalt glacier Sep 3, 2023, 4:30 PM

#

prob doesn't matter, but i guess it depends on what model you're using?

zinc torrent Sep 3, 2023, 4:31 PM

#

I am using Llama and getting these data from youtube transcripts

cobalt glacier Sep 3, 2023, 4:34 PM

#

#

this is 3.5turbo which i'm sure is gonna be better than llama, and to be fair i did fail once because i wasn't specific enough with the prompt

#

(chatgpt also generated the source data lol)

#

but it SHOULD be able to do it

zinc torrent Sep 3, 2023, 4:36 PM

#

I see ... but my data is plain text

#

But thanks I'll try this

cobalt glacier Sep 3, 2023, 4:36 PM

#

lol this worked with emojis, though tbf it also just knows all the capitals

zinc torrent Sep 3, 2023, 4:36 PM

#

Ahh I see

shrewd thicket Sep 3, 2023, 5:36 PM

#

zinc torrent I want to train my model with a data which has only one input, I mean genrally t...

What are you trying to train?

zinc torrent Sep 3, 2023, 5:37 PM

#

shrewd thicket What are you trying to train?

I want to train a model with youtube transcript of a particular video

cobalt glacier Sep 3, 2023, 5:37 PM

#

you're training a model off one video?

zinc torrent Sep 3, 2023, 5:38 PM

#

Ya at least one video transcript...further I would add more videos of that channel

#

Prob is there is only plain text, so how do I format it in such a way that model can be trained to ans

shrewd thicket Sep 3, 2023, 5:41 PM

#

zinc torrent I want to train a model with youtube transcript of a particular video

In most cases, you don't actually need to train a model. What you would need to do is:

First define what is the solution you actually need
Are there pre-trained models that you can use it for (most likely yes)
If the pre trained models don't have domain specific knowledge (from the transcript you wrote, seems pretty generic, so not the case), what data is needed and for what purpose? As in, further pretraining the base model i.e. domain adaptation? Or just finetune for specific tasks?
Then you talk about data format for the specific work

zinc torrent Sep 3, 2023, 5:43 PM

#

shrewd thicket In most cases, you don't actually need to train a model. What you would need to ...

Oh I see... just fine tune for specific tasks

cobalt glacier Sep 3, 2023, 5:43 PM

#

would a vector database be appropriate?

shrewd thicket Sep 3, 2023, 5:47 PM

#

cobalt glacier would a vector database be appropriate?

Not necessarily

shrewd thicket Sep 3, 2023, 5:47 PM

#

zinc torrent Oh I see... just fine tune for specific tasks

But what is the task?

zinc torrent Sep 3, 2023, 5:49 PM

#

shrewd thicket But what is the task?

Task is like if a user ask a question then the response will be generated by this model for that I have to fine tune it first

shrewd thicket Sep 3, 2023, 5:52 PM

#

zinc torrent Task is like if a user ask a question then the response will be generated by thi...

You're not explaining the actual outcome. Why do you need to finetune? What questions would users ask? You most likely don't need finetuning. For example, if it's just Q&A based on your channel, then you can just get all the transcripts with metadata on which videos they are, put them into a vector database as @cobalt glacier suggests, and just use some readily available models for question answering, while retrieving context from vector database to answer channel specific questions

#

Finetuning also has risks, it's not like you just finetune on your channel transcript, then it suddenly becomes your channel's assistent. Finetune on small and not so special data often results into catastrophic forgetting, for example

zinc torrent Sep 3, 2023, 5:53 PM

#

shrewd thicket You're not explaining the actual outcome. Why do you need to finetune? What ques...

See the video is about a lec given by a prof, so the outcome will be based on the question asked by a user which cant be predicted

zinc torrent Sep 3, 2023, 5:54 PM

#

shrewd thicket Finetuning also has risks, it's not like you just finetune on your channel trans...

Oh!

shrewd thicket Sep 3, 2023, 5:55 PM

#

Then yeah you don't need any finetuned models. You can check out how to use LLMs with specific knowledge base through vectorDB with LangChain here