Hello, I have been testing synthethic data kit using the notebook by Unsloth. I wondered wther i could ingest more than 1 data? E.g:
In the example, the code is like:
!synthetic-data-kit \
-c synthetic_data_kit_config.yaml \
ingest "https://arxiv.org/html/2412.09871v1"
# Truncate document
filenames = generator.chunk_data("data/output/arxiv_org.txt")
print(len(filenames), filenames[:3])
Here, it only uses this arxiv to generate questions. But, the questions will be about this topic so you can't really generate a wide dataset. Is it possible to increase the amount of data for ingestion?