Help with Synthetic Data Kit | Unsloth AI | Page 1

brazen oriole May 26, 2025, 11:04 PM

#

Hello, I have been testing synthethic data kit using the notebook by Unsloth. I wondered wther i could ingest more than 1 data? E.g:
In the example, the code is like:

!synthetic-data-kit \
    -c synthetic_data_kit_config.yaml \
    ingest "https://arxiv.org/html/2412.09871v1"

# Truncate document
filenames = generator.chunk_data("data/output/arxiv_org.txt")
print(len(filenames), filenames[:3])

Here, it only uses this arxiv to generate questions. But, the questions will be about this topic so you can't really generate a wide dataset. Is it possible to increase the amount of data for ingestion?

lament plaza May 27, 2025, 12:45 AM

#

i think you should ask in the github issue of tthe syntheticdataset kit repo!

#

you might get a better answer from the creator

warm rock May 27, 2025, 11:18 AM

#

also has anyone been through the issue of vllm crashing after just doing 3 loops even though u specify a bigger number in the for loop, it will do 3 times nicely and then just start printing vllm server not found

mild zodiac May 27, 2025, 11:28 AM

#

Seems like the vllm server crashed during one of the generations. Maybe out of memory or not specifying a large enough context length

warm rock May 27, 2025, 11:32 AM

#

This is the error I am getting

mild zodiac May 27, 2025, 12:10 PM

#

Did you do the check step? I don’t think vllm started successfully

#

You should see output that vllm server is running. If you do, then vllm is crashing on generation. I’m guessing context length getting exceeded or OOM

warm rock May 27, 2025, 12:20 PM

#

mild zodiac Did you do the check step? I don’t think vllm started successfully

are u talking about this ?

mild zodiac May 27, 2025, 12:52 PM

#

Can you try increasing the sequence length after restarting the notebook?

warm rock May 27, 2025, 1:29 PM

#

i did still the same issue

mild zodiac May 27, 2025, 2:24 PM

#

Let's rewind. does the base notebook work for you without any modifications?

warm rock May 27, 2025, 2:25 PM

#

no it has the same issue, only works for 3

mild zodiac May 27, 2025, 2:28 PM

#

the error above is not saying that it works for 3. It only attempts 3 times and all three failed. Means that vllm shutdown at some point.

#

where are you running the notebook?

warm rock May 27, 2025, 2:29 PM

#

in google colab

warm rock May 27, 2025, 2:30 PM

#

warm rock This is the error I am getting

here it does 0qa pairs, 1qa pairs , but on the 2nd it stops

mild zodiac May 27, 2025, 2:46 PM

#

the notebook as is runs on google colab t4, so something has changed. Seems like the dataset. but the vllm server is crashing at some point. You'll need to figure out why it happens with your data. I've mentioned an OOM or a context length limit. the synthetic-data-kit tool is by meta so you may need to work with them to understand what the tool is doing too.

#

you can even look at the intermediate files generated in google colab. You can inspect the offending file to see what different. You can even delete it to see if the rest work. generator has vllm_process which is the vllm sub process. You can lookup how to print what logs to stdout and stderr to help you debug.

warm rock May 27, 2025, 3:07 PM

#

okay thx !!

neon spruce May 28, 2025, 6:41 AM

#

Hey, I can confirm that the notebook has a tendency to work or not with no changes from user input, additionally, it seems to OOM very early.

What I did was use Unsloth for finetuning but moved the Meta synth kit to a different computer/environment and then just moved the data into the colab with unsloth. Some issue with vLLM

mild zodiac May 28, 2025, 10:43 AM

#

You can play with the following params.

gpu_memory_utilization, max_seq_length in from_pretrained of synthetic data kit.

You can also play with prepare qa generation args to help control inference.

#

You can see which files it complains with and edit/remove to see if it works.

#

Part of the notebook relies on a meta package you can check those docs for how to control parameters on that side.

#

@neon spruce if there’s an issue on our side we are happy to investigate but I’ll need some more information about what you’ve tried and outcomes when you dig into the params and data

brazen oriole May 28, 2025, 11:03 AM

#

will i be able to create a synthetic dataset for other languages as well? E.g hittitian dataset for further hittitian models

mild zodiac May 28, 2025, 11:06 AM

#

Depends if the model you’re using to generate the synthetic data understands the language

#

If it doesn’t then it won’t generate coherent data

brazen oriole May 28, 2025, 11:08 AM

#

i'll ingest hittitan resources and as a prompt in the config file, I'll state than it'll be a cot dataset so give the english meanings of the contents etc. etc.

#

the docs aren't really clear there isn't any specific usage of ingestion of multiple resources

#

i guess I'll just try it myself to figure out whether it works

#

If you have already used Argilla's synthetic dataset generator, what were the pros and cons compared to Meta's kit?

mild zodiac May 28, 2025, 11:09 AM

#

brazen oriole the docs aren't really clear there isn't any specific usage of ingestion of mult...

You can raise that with meta. Not sure tbh

mild zodiac May 28, 2025, 11:09 AM

#

brazen oriole If you have already used Argilla's synthetic dataset generator, what were the pr...

Have not tried it unfortunately

brazen oriole May 28, 2025, 11:10 AM

#

have you given this kit a try? How were the results in your opinion?

mild zodiac May 28, 2025, 11:15 AM

#

if you need data it works well.

#Help with Synthetic Data Kit