#Help with Synthetic Data Kit

36 messages · Page 1 of 1 (latest)

brazen oriole
#

Hello, I have been testing synthethic data kit using the notebook by Unsloth. I wondered wther i could ingest more than 1 data? E.g:
In the example, the code is like:

!synthetic-data-kit \
    -c synthetic_data_kit_config.yaml \
    ingest "https://arxiv.org/html/2412.09871v1"

# Truncate document
filenames = generator.chunk_data("data/output/arxiv_org.txt")
print(len(filenames), filenames[:3])

Here, it only uses this arxiv to generate questions. But, the questions will be about this topic so you can't really generate a wide dataset. Is it possible to increase the amount of data for ingestion?

lament plaza
#

i think you should ask in the github issue of tthe syntheticdataset kit repo!

#

you might get a better answer from the creator

warm rock
#

also has anyone been through the issue of vllm crashing after just doing 3 loops even though u specify a bigger number in the for loop, it will do 3 times nicely and then just start printing vllm server not found

mild zodiac
#

Seems like the vllm server crashed during one of the generations. Maybe out of memory or not specifying a large enough context length

warm rock
#

This is the error I am getting

mild zodiac
#

Did you do the check step? I don’t think vllm started successfully

#

You should see output that vllm server is running. If you do, then vllm is crashing on generation. I’m guessing context length getting exceeded or OOM

warm rock
mild zodiac
#

Can you try increasing the sequence length after restarting the notebook?

warm rock
#

i did still the same issue

mild zodiac
#

Let's rewind. does the base notebook work for you without any modifications?

warm rock
#

no it has the same issue, only works for 3

mild zodiac
#

the error above is not saying that it works for 3. It only attempts 3 times and all three failed. Means that vllm shutdown at some point.

#

where are you running the notebook?

warm rock
#

in google colab

warm rock
mild zodiac
#

the notebook as is runs on google colab t4, so something has changed. Seems like the dataset. but the vllm server is crashing at some point. You'll need to figure out why it happens with your data. I've mentioned an OOM or a context length limit. the synthetic-data-kit tool is by meta so you may need to work with them to understand what the tool is doing too.

#

you can even look at the intermediate files generated in google colab. You can inspect the offending file to see what different. You can even delete it to see if the rest work. generator has vllm_process which is the vllm sub process. You can lookup how to print what logs to stdout and stderr to help you debug.

warm rock
#

okay thx !!

neon spruce
#

Hey, I can confirm that the notebook has a tendency to work or not with no changes from user input, additionally, it seems to OOM very early.

What I did was use Unsloth for finetuning but moved the Meta synth kit to a different computer/environment and then just moved the data into the colab with unsloth. Some issue with vLLM

mild zodiac
#

You can play with the following params.

gpu_memory_utilization, max_seq_length in from_pretrained of synthetic data kit.

You can also play with prepare qa generation args to help control inference.

#

You can see which files it complains with and edit/remove to see if it works.

#

Part of the notebook relies on a meta package you can check those docs for how to control parameters on that side.

#

@neon spruce if there’s an issue on our side we are happy to investigate but I’ll need some more information about what you’ve tried and outcomes when you dig into the params and data

brazen oriole
#

will i be able to create a synthetic dataset for other languages as well? E.g hittitian dataset for further hittitian models

mild zodiac
#

Depends if the model you’re using to generate the synthetic data understands the language

#

If it doesn’t then it won’t generate coherent data

brazen oriole
#

i'll ingest hittitan resources and as a prompt in the config file, I'll state than it'll be a cot dataset so give the english meanings of the contents etc. etc.

#

the docs aren't really clear there isn't any specific usage of ingestion of multiple resources

#

i guess I'll just try it myself to figure out whether it works

#

If you have already used Argilla's synthetic dataset generator, what were the pros and cons compared to Meta's kit?

mild zodiac
mild zodiac
brazen oriole
#

have you given this kit a try? How were the results in your opinion?

mild zodiac
#

if you need data it works well.