#Common Pile Edu

147 messages · Page 1 of 1 (latest)

mystic canyon
#

The Common Pile dataset is a phenomenal collection of curated data, but can we do better for small LM training?

In recent times we've seem incredibly small but high quality corpora which can produce really capable small LMs by aggressively filtering for only "educational" content.

Why not try the same for Common Pile? Creating a pre-shuffled, pre-weighted and high quality subset designed for efficient SLM training on the order of ~2-300B tokens.

I'd like to gauge interest in such a project, because I've already seen first hand the advantages of training with SmolLM Corpus for small LMs, and I think having a permissively licenced counterpart derived from the Common Pile collection would benefit us a lot.

marble vine
#

I'm very interested in this, however I recommend waiting for the next version of the dataset. We have a number of high quality works I expect to provide a lot of value in the pipeline currently.

Right now 200B tokens is a pretty large % of the dataset and I think a really high quality 200B subset will be easier to make with more data.

#

... as soon as I hit enter the thought occurred to me "why discourage starting now? Worst case scenario they build infra that can be used out of the box with the next version"

mystic canyon
# marble vine I'm very interested in this, however I recommend waiting for the next version of...

Oh I absolutely agree. The "v0.1" definitely speaks to the fact that further work is planned to expand and refine the corpus. But I think starting out by building infra for quality classifiers and filtering schemes would be a good starting point in the meantime. It also gives us time to build momentum because this is absolutely not a one woman job, so getting some hype to attract other contributors absolutely cannot be a bad thing!

west geode
#

Interesting. Question: do you see this differentiating anywhere from the previous approach (eg FineWeb-Edu, Stack-Edu-Python, etc) of just training a classifier for educational content and filtering on that?

mystic canyon
#

So, naturally, the best starting point would be the qualuity classifiers themselves. While we could follow the recipe for SmolLM corpus I'm unsure if that's the best idea.

While I cannot speak for the classifier used for the fine-web-edu split I do have experience with the "educational code classifier" that was used for the python-edu split; if you've followed some of my projects I released two pre-shuffled mixes of the SmolLM corpus (Avelina/smollm-corpus and Avelina/smollm-corpus-cleaned and two corresponding python-edu splits (Avelina/python-edu and Avelina/python-edu-cleaned)

The reason for needing to release "cleaned" versions was because the code classifier didn't do a great job in the python edu split. In fact it did a shite job. When training LMs I found weird loss dips (not spikes, dips) and I traced it back to some HORRIFIC python code containing lots of repetitions. I'm talking hardcoded python calculators with several thousand lines of if statements, unit tests for every possible user input, scripts with hard coded lists of all words in the English language, and an instance of a "python file" which was literally just a freaking harry potter book that was clearly copy pasted from a pdf because it had page numbers and chapter headings every dozen lines.

So while in general the code classifier did a good job, it clearly deemed some utter crap as being "high quality". And while I don't expect the text classifier to suffer from the same sort of issues, I think it's definitely worth revisiting things and maybe adding in some more quality control filters to make sure the classifiers don't get tripped up by.

mystic canyon
# west geode Interesting. Question: do you see this differentiating anywhere from the previou...

I think we should definitely be taking a similar approach, but combine the existing filtering used for Common Pile as a starting point.

I also have an idea for how we can use the classifier quality signals in a different way to prior methods: previously curators have used a specific "cuttoff" point when creating such corpora, e.g. retaining only documents which score above 3/5, and then maybe perform further filtering by retaining only the top N scoring documents from the final data mix to create the dataset...

I think we can do things differently. Firstly, we determine the size of the dataset we want, let's say 100B tokens for simplicity, and then we look at the data mix used to train the Comma models (probably just the initial phase, not the cooldown) and scale down those values to sum up to 100B. This gives us a number of tokens for each subset (which we can roughly translate to a number of documents based on average document size) and using that number we retain only the top N documents from each individual subset rather than from Common Pile as a whole. This means we subsample each split individually based on the desired data mix, retaining fewer but highest quality documents from smaller splits and a broader range of documents from the larger splits.

#

And as more work is done on Common Pile, with newer Comma models being released we'll likely get more refined recipes for the "ideal" data mixes, which when combined with higher document counts of future iterations of Common Pile this quality filtering scheme could land us an excellent dataset for small model training.

#

Doing filtering individually also makes our lives a bit easier. For example if we wanted to just take the entirety of common pile and filter first, mix later, we'd have a hard job making sure there is no "bias" towards certain split over others, e.g. rating arxiv docs higher on average than wikimedia purely down to nuances in the document structures. Classifying each subset separately means the absolute scores no longer matter as we only care about the top documents within each split.

west geode
#

Makes sense. I was going to suggest something similar based on splits, since this also allows us to make more domain-specific decisions about what's educational enough to make the cut. E.g. obviously pretty much everything on arxiv is "educational" in the academic sense so what signal is the classifier picking up on? Perhaps it ends up biasing toward a particular subdomain, which is undesirable.

mystic canyon
#

e.g. comparing between clusters of different keywords in the arxiv split, or publisher in the news split, or categories in the wikimedia split

#

another thing we could do is split specific finetunes on an already good classifier, maybe leveraging the differences between the "raw" and "filtered" splits of common pile to drive the per-split finetuning data

west geode
#

I'm not sure what you mean by that last bit, but a split specific ft would be nice if you can get good quality labeled data (I'm not a huge fan of the automatic labeling method but maybe the models will have improved since Llama 3 70B)

dire sparrow
cyan root
#

I like the idea of using metadata (somehow) to cluster. The per-sample metadata is comparatively more extensive (or more can be easily obtained), given the license provenance needed to be tracked

marble vine
mystic canyon
mystic canyon
#

Another important question is.... who is willing to contribute? And I don't just mean contributing to the development by helping out with ideas or contributing code, but also contributing disk space and compute.

Because we're gonna need to run a lot of documents -- some of which are millions of tokens long so may need to be chunked -- through quality classifiers so we can build a database of raw quality signals to then allow us to do clustering/filtering/reblancing/etc/etc/etc. I have a lot of compute at my disposal, but not a lot of persistant disk space. I can generate TBs worth of transient artefacts, but they get cleaned up very quickly so I won't be able to actaully store the entirety of common pile myself, and even after filtering I may need someone else to store a "defacto" version.

(Just to drive the point home, remember the cleaned SmolLM corpus shuffles I generated? Only around 300GB compressed, but it no longer exists on my cluster, it got cleaned up for space. I have the code to regenerate it, but the only actual copy of it is stored on the HF hub.)

#

And just to clarify I'm not asking for charity here. If push comes to shove and no one else can really help I could probably annotate the whole corpus myself with quality signals in lots of small chunks, keeping only lists of document IDs and labels in persistent storage and dropping everything else to save space. But in the end I will absolutely need someone else with more disk space to actually generate the dataset from the list of top documents.

west geode
#

I have trivial compute but ~10TB of aggregate space I can contribute on my NAS, but it's only 1Gb up/down so not ideal

mystic canyon
#

1Gb up/down is MORE than enough when it comes to actually aggregating the data

west geode
#

oh sick. consider that storage contributed then

#

i only have 1x4090 (desktop rig) to run stuff locally on it which can maybe do lightweight low-latency postprocessing

real oar
#

A know of few of them with actively sue over the transcripts

livid wigeon
#

Been following the post for couple of days and am very interested in contributing. Can help contribute code, compute and storage.
Just wanted to clarify the end goal of this effort. Are we aiming to simply release an open-source dataset, or are we also open to exploring filtering techniques more deeply, other ideas such as studying the efficacy of the dataset and models trained with it and potentially publishing our findings?

mystic canyon
livid wigeon
#

Sounds good. Consider me and my resources a part of the initiative then.
Specifically, I have access to one A100 and about 800Gb of disk space available.

marble vine
mystic canyon
#

because honestly I don't want to fall into the same trap that SmolLM corpus did where they only extracted quality signals from truncated documents. Ideally we should be qualifying every document in it's entirety, but I need to figure out of that's even feasible and what the tradeoff may be against more typical models with shorter context length.

cyan root
meager python
cyan stump
south yacht
#

Hey, I'd pretty much love to contribute except for my lack of resources...

#

Also, anyone want to work on Finance version of this?

dire sparrow
#

I have no resources but am very interested in working on this project

mystic canyon
#

Sorry I've not been super active with this recently. Had a couple tragedies in the family AND had neurips reviews to deal with AND got a friend staying over for a starting the day neurips reviews are due

marble vine
#

Oh fuck I haven't started my NeurIPS reviews

viral coyote
#

okay this is a cool idea and i am sorry i missed it previously

#

2c: i perpetually have sort of a suspicion that data classifiers are doing something weird and are difficult to get exactly right, i would be really interested to see about results by using normal compressors for classification

#

compression ratio under a few compressors will catch things like high-repeat autogenerated python (i assume psychopy by the way! psych students use it and it spits this out) and compression dictionary under eg gzip will have natural clusters

#

the dumbest check is just snappy compression ratio though

#

the size under compression metric will have more than one mode though, with code being notably very difficult from noncode and i think different languges varying substantially

viral coyote
#

what is 200b tokens in bytes again. this would call for ~approx 800gb, right?

viral coyote
#

i have space but low compute, perfectly fine for some things

i think i have a question about hf storage: do we care if i use it to do things like e.g. upload a shuffled copy of the raw data, or throw metadata files for it somewhere, etc? personal account on hf claims a 300gb limit

#

if we're going to experiment with different filters across different people it's a lot nicer to do specific operations once and then upload but even subsets are likely to run kind of heavy since it's 8tb

nova lynx
#

would anyone with high storage and low compute be interested in making a parquet-formatted common pile dataset repo? it would only really need two columns, one for the text of the document and one labeling what subset the doc is from

#

then we could use the new release of datasets to clean/filter the data in streaming mode, allowing people like me with compute but no storage to contribute

nova lynx
#

I'm gonna test drive this with pile-uncopyrighted, since it's already streamable from hf. I'll try scoring the whole thing with two existing text classifier models (dclm's and fwedu's), and then upload the annotations to HF so y'all can experiment with it

viral coyote
#

i am processing these anyway so it would just be another upload

#

actually: might be a problem because of how long upload is likely to take, otherwise fine

nova lynx
cyan root
#

I can probably do this this weekend depending how friendly parquet is with the different subsets Just to confirm we want the raw data?

viral coyote
#

i'd really like a version that is preshuffled and one that is streamable

cyan root
#

I should probably be able to add it to Eleuther org

cyan root
viral coyote
#

(i am also not done shuffling it, i am fiddling tooling to try to get a reasonable shuffler in the dolma library)

#

but in principle it's "nice" to have the raw stuff in both shuffled and streamable

cyan root
#

yeah, I'lll try to upload a split this weekend

cyan root
viral coyote
#

the stack is freakishly large tbh

cyan root
#

yeah, its half the dataset

cyan root
#

any parquet wizards know the best settings for streaming. From what I gathered ~1GB per file, snappy compression and sorted (can use id column)

#

partition strategy?

viral coyote
#

128mb row group and i think you probably want to range partition by id after sort

#

it has been a bit and tbh in prod use cases i would "go check what the one that already works does"

plain knot
#

Just saw this, is "-edu" really the right trend to continue? I think there are better ways to build higher quality subsets, probably, at the very least due to preserving more diversity

lyric frigate
#

Also define educational. Is that like textbooks and stuff? cuz we probs losing out on linguistic diversity/real world grounding by not including common spoken language tokens right. Also would this be a strictly english dataset?

nova lynx
#

i'm hopeful that one of them will keep more docs than the other, because common pile isn't super large already

plain knot
#

https://arxiv.org/abs/2408.08310 I was thinking something like this perhaps, I would have compute to do this btw

nova lynx
plain knot
#

I think you'd have to train two models specifically for it

#

the interesting thing is also you get a full ranking of all samples, so if you release that people can choose their own top k threshold

#

not that I wanna derail this, maybe someone would like to also do the -edu thing

nova lynx
nova lynx
nova lynx
#

@cyan root@viral coyote any updates on the raw data repo?

viral coyote
#

i need some time for a shuffle to run today and then i can reupload raw + shuffled. i can also do a proportionate subsample at like 1/10th size to get under hf size limits, would need perms on something permitting the reup of the shuffled due to its size

cyan root
#

I almost have most of the (raw) datasets formatted to parquet (except stack). Sorry got stuck in a loop trying to normalize the columns. Will start uploading in a bit

nova lynx
#

No rush! I'm using dolma + the Comma dataset for testing in the meantime, and that seems to be working well for now

cyan root
#

damn, I hate parquet. Am just vibing with these configs now:
FORMAT PARQUET,
file_size_bytes '1GB',
compression 'zstd',
per_thread_output TRUE,
preserve_order FALSE,
ROW_GROUP_SIZE_BYTES '512MB'

#

used to think, why isn't it used more. So convenient. So fast

#

well now I know. Its evil

plain knot
#

Poor baber

cyan root
#

I uploaded most of the sets here, except for 4 of the largest

#

downloading and converting stackv2 now, and will upload the remaining when that's done

cyan root
#

ok think I've uploaded all now

plain knot
#

is this the new version stella mentioned or are we waiting for that one?

#

btw nice, its good to have it in one place rather than a collection

cyan root
nova lynx
#

dumb question: has anyone considered using DoReMi or ODM to reweight the comma data?

#

the paper mentions MixMin but I'm not sure how similar that is

real oar
marble vine
#

Same

marble vine
plain knot
#

alright thanks, good to know

marble vine
#

We're also experimenting with synthetic data augmentation and rewriting some of the more unusually structured stuff (e.g., USPTO) in more natural language

nova lynx
#

synthetic as in LLM-generated? I thought the Common Pile argued against using that in the dataset?

marble vine
#

It wasn't in scope for our aims, but synthetic text is not eligible for copyright protection in the US.

real oar
#

I honestly think it might be worth getting a group together to hand label say 100k docs or so from the CommonCrawl subset?

molten marsh
#

Hey all, just saw a random link to this thread so I figured I'd mention a few relevant things:

  • After our agressive filtering for the comma v0.1 datatset, we already had a small-ish dataset - it's only ~450B unique tokens. We just repeated a lot of stuff a lot to get the 1T and 2T training runs.
  • The comma v0.1 dataset already has some of "-edu"-style filtering: for cccc we used the dclm classifier, for stackv2 we used the stack-edu classifier.
  • It's not obvious to me how helpful it would be to apply other classifier-based filtering to other sources. For example I wouldn't expect certain patents to be higher quality or more educational than others. If you applied the same quality classifier to the whole dataset I would just expect it to filter out some sources (like, no patents, none of the old ocr'd text) whole hog.
real oar
#

Or at least a better classification models is what i mean

molten marsh
#

Sure yeah, though I'm not sure there's a ton of room to improve on stack-edu and dclm classifiers, at least for low-hanging fruit

real oar
#

We are evaluating fine edu classifier now on some textbook data

#

That is what I meant

#

Haha

#

I think just getting something that can filter for textbook like data out of commoncrawl would be a significant improvement over the base classifier

molten marsh
#

Sure, worth a try

real oar
#

Talking to HarvardLIL about using a labeled version of institutional books

#

Hand labeled

molten marsh
#

Though cccc is such a tiny slice of our data (especially after all the filtering) that I'm not sure better filtering would make a big difference in eventual model quality

real oar
#

I think when I say generally I mean people would probably also use it for non permissive data lol

#

Actually I should have something else for this soon. TBA

marble vine
#

@real oar I've been talking with Greg and we cannot use the IB dataset. Although the books are in the public domain (and it's legally dubious whether they can actually put ToS on them) they do have a non-open ToS. Treating them as open data requires directly picking a legal fight with IDI

real oar
#

Well technically we can as well

#

But you/we can't release them yet

#

Yet being sometime in the next months

marble vine
#

@real oar

  1. I mean that we can't use them as open data. They're permitted to be used under the terms of service, but you can't treat them as open data.
  2. Last week Greg specifically disavowed any timeline and said it could take a year or more
real oar
#

As long as we know it is permissive

marble vine
#

You can use it to train a classifier

real oar
#

EleutherAI, Allen, etc would be well within the realm of the TOS

#

So I think training a classifier with hand labeled data from it would make sense

#

We just, as you stated, can't release the data itself

#

Also, I did bring up to them that the TOS on the public domain data doesn't make much sense to me either haha

#

Maybe I should post that on the HF thread

marble vine
molten marsh
molten marsh
real oar
#

Im not sure of fine edu now haha

#

It might actually be worse than random

real oar
#

synthetically generated 300k binary labels on the institutional data. after a review of 5k sample pages I would say it is very accurate

#

Going to see how well a classifier can be trained to filter out the low quality pages i.e. TOC, Appendix, such

nova lynx
meager python