Common Pile Edu | EleutherAI | Page 1

mystic canyon Jun 12, 2025, 9:25 PM

#

The Common Pile dataset is a phenomenal collection of curated data, but can we do better for small LM training?

In recent times we've seem incredibly small but high quality corpora which can produce really capable small LMs by aggressively filtering for only "educational" content.

Why not try the same for Common Pile? Creating a pre-shuffled, pre-weighted and high quality subset designed for efficient SLM training on the order of ~2-300B tokens.

I'd like to gauge interest in such a project, because I've already seen first hand the advantages of training with SmolLM Corpus for small LMs, and I think having a permissively licenced counterpart derived from the Common Pile collection would benefit us a lot.

marble vine Jun 12, 2025, 9:37 PM

#

I'm very interested in this, however I recommend waiting for the next version of the dataset. We have a number of high quality works I expect to provide a lot of value in the pipeline currently.

Right now 200B tokens is a pretty large % of the dataset and I think a really high quality 200B subset will be easier to make with more data.

#

... as soon as I hit enter the thought occurred to me "why discourage starting now? Worst case scenario they build infra that can be used out of the box with the next version"

mystic canyon Jun 12, 2025, 9:55 PM

#

marble vine I'm very interested in this, however I recommend waiting for the next version of...

Oh I absolutely agree. The "v0.1" definitely speaks to the fact that further work is planned to expand and refine the corpus. But I think starting out by building infra for quality classifiers and filtering schemes would be a good starting point in the meantime. It also gives us time to build momentum because this is absolutely not a one woman job, so getting some hype to attract other contributors absolutely cannot be a bad thing!

west geode Jun 12, 2025, 10:04 PM

#

Interesting. Question: do you see this differentiating anywhere from the previous approach (eg FineWeb-Edu, Stack-Edu-Python, etc) of just training a classifier for educational content and filtering on that?

mystic canyon Jun 12, 2025, 10:10 PM

#

So, naturally, the best starting point would be the qualuity classifiers themselves. While we could follow the recipe for SmolLM corpus I'm unsure if that's the best idea.

While I cannot speak for the classifier used for the fine-web-edu split I do have experience with the "educational code classifier" that was used for the python-edu split; if you've followed some of my projects I released two pre-shuffled mixes of the SmolLM corpus (Avelina/smollm-corpus and Avelina/smollm-corpus-cleaned and two corresponding python-edu splits (Avelina/python-edu and Avelina/python-edu-cleaned)

The reason for needing to release "cleaned" versions was because the code classifier didn't do a great job in the python edu split. In fact it did a shite job. When training LMs I found weird loss dips (not spikes, dips) and I traced it back to some HORRIFIC python code containing lots of repetitions. I'm talking hardcoded python calculators with several thousand lines of if statements, unit tests for every possible user input, scripts with hard coded lists of all words in the English language, and an instance of a "python file" which was literally just a freaking harry potter book that was clearly copy pasted from a pdf because it had page numbers and chapter headings every dozen lines.

So while in general the code classifier did a good job, it clearly deemed some utter crap as being "high quality". And while I don't expect the text classifier to suffer from the same sort of issues, I think it's definitely worth revisiting things and maybe adding in some more quality control filters to make sure the classifiers don't get tripped up by.

mystic canyon Jun 12, 2025, 10:19 PM

#

west geode Interesting. Question: do you see this differentiating anywhere from the previou...

I think we should definitely be taking a similar approach, but combine the existing filtering used for Common Pile as a starting point.

I also have an idea for how we can use the classifier quality signals in a different way to prior methods: previously curators have used a specific "cuttoff" point when creating such corpora, e.g. retaining only documents which score above 3/5, and then maybe perform further filtering by retaining only the top N scoring documents from the final data mix to create the dataset...

I think we can do things differently. Firstly, we determine the size of the dataset we want, let's say 100B tokens for simplicity, and then we look at the data mix used to train the Comma models (probably just the initial phase, not the cooldown) and scale down those values to sum up to 100B. This gives us a number of tokens for each subset (which we can roughly translate to a number of documents based on average document size) and using that number we retain only the top N documents from each individual subset rather than from Common Pile as a whole. This means we subsample each split individually based on the desired data mix, retaining fewer but highest quality documents from smaller splits and a broader range of documents from the larger splits.

#

And as more work is done on Common Pile, with newer Comma models being released we'll likely get more refined recipes for the "ideal" data mixes, which when combined with higher document counts of future iterations of Common Pile this quality filtering scheme could land us an excellent dataset for small model training.

#

Doing filtering individually also makes our lives a bit easier. For example if we wanted to just take the entirety of common pile and filter first, mix later, we'd have a hard job making sure there is no "bias" towards certain split over others, e.g. rating arxiv docs higher on average than wikimedia purely down to nuances in the document structures. Classifying each subset separately means the absolute scores no longer matter as we only care about the top documents within each split.

west geode Jun 12, 2025, 10:33 PM

#

Makes sense. I was going to suggest something similar based on splits, since this also allows us to make more domain-specific decisions about what's educational enough to make the cut. E.g. obviously pretty much everything on arxiv is "educational" in the academic sense so what signal is the classifier picking up on? Perhaps it ends up biasing toward a particular subdomain, which is undesirable.

mystic canyon Jun 12, 2025, 11:20 PM

#

west geode Makes sense. I was going to suggest something similar based on splits, since thi...

You raise an excellent point there. Something we could attempt to do would be clustering documents within a split by some metadata or keywords or something and see if we notice any sort of distribution shifts in classifier quality signals between each cluster.

#

e.g. comparing between clusters of different keywords in the arxiv split, or publisher in the news split, or categories in the wikimedia split

#

another thing we could do is split specific finetunes on an already good classifier, maybe leveraging the differences between the "raw" and "filtered" splits of common pile to drive the per-split finetuning data

west geode Jun 12, 2025, 11:28 PM

#

I'm not sure what you mean by that last bit, but a split specific ft would be nice if you can get good quality labeled data (I'm not a huge fan of the automatic labeling method but maybe the models will have improved since Llama 3 70B)

dire sparrow Jun 13, 2025, 5:36 AM

#

marble vine I'm very interested in this, however I recommend waiting for the next version of...

Hi, I'm curious to know where we can additionally learn more about these high quality works.

cyan root Jun 13, 2025, 6:14 AM

#

I like the idea of using metadata (somehow) to cluster. The per-sample metadata is comparatively more extensive (or more can be easily obtained), given the license provenance needed to be tracked

marble vine Jun 13, 2025, 2:38 PM

#

dire sparrow Hi, I'm curious to know where we can additionally learn more about these high qu...

What are you looking to learn about? It includes public domain books, transcriptions of government audio records, and Wikimedia data that isn't in the main dumps to name a few examples

mystic canyon Jun 13, 2025, 6:35 PM

#

cyan root I like the idea of using metadata (somehow) to cluster. The per-sample metadata ...

honestly this might be a really good starting point for this project. i doubt future versions of the Common Pile will give us less information than what we currently have in v0.1, so anything we build now in regards to a pre-processing pipeline for metadata-informed clustering should almost certainly continue to work in the future.

mystic canyon Jun 14, 2025, 5:22 PM

#

Another important question is.... who is willing to contribute? And I don't just mean contributing to the development by helping out with ideas or contributing code, but also contributing disk space and compute.

Because we're gonna need to run a lot of documents -- some of which are millions of tokens long so may need to be chunked -- through quality classifiers so we can build a database of raw quality signals to then allow us to do clustering/filtering/reblancing/etc/etc/etc. I have a lot of compute at my disposal, but not a lot of persistant disk space. I can generate TBs worth of transient artefacts, but they get cleaned up very quickly so I won't be able to actaully store the entirety of common pile myself, and even after filtering I may need someone else to store a "defacto" version.

(Just to drive the point home, remember the cleaned SmolLM corpus shuffles I generated? Only around 300GB compressed, but it no longer exists on my cluster, it got cleaned up for space. I have the code to regenerate it, but the only actual copy of it is stored on the HF hub.)

#

And just to clarify I'm not asking for charity here. If push comes to shove and no one else can really help I could probably annotate the whole corpus myself with quality signals in lots of small chunks, keeping only lists of document IDs and labels in persistent storage and dropping everything else to save space. But in the end I will absolutely need someone else with more disk space to actually generate the dataset from the list of top documents.

west geode Jun 14, 2025, 7:00 PM

#

I have trivial compute but ~10TB of aggregate space I can contribute on my NAS, but it's only 1Gb up/down so not ideal

mystic canyon Jun 14, 2025, 7:27 PM

#

1Gb up/down is MORE than enough when it comes to actually aggregating the data

west geode Jun 14, 2025, 7:58 PM

#

oh sick. consider that storage contributed then

#

i only have 1x4090 (desktop rig) to run stuff locally on it which can maybe do lightweight low-latency postprocessing

real oar Jun 15, 2025, 1:25 AM

#

marble vine What are you looking to learn about? It includes public domain books, transcript...

I need to still call those states

#

A know of few of them with actively sue over the transcripts

livid wigeon Jun 15, 2025, 2:03 AM

#

Been following the post for couple of days and am very interested in contributing. Can help contribute code, compute and storage.
Just wanted to clarify the end goal of this effort. Are we aiming to simply release an open-source dataset, or are we also open to exploring filtering techniques more deeply, other ideas such as studying the efficacy of the dataset and models trained with it and potentially publishing our findings?

mystic canyon Jun 15, 2025, 3:36 PM

#

livid wigeon Been following the post for couple of days and am very interested in contributin...

I primary goal is the dataset. That is our MVP.

But this is also perfectly ripe material to publish findings. If we have the time and contributors to do so I think we should publish our findings as well, be it just the filtering techniques and statistical analysis of quality signals, or going a step further and training models too.

livid wigeon Jun 15, 2025, 5:19 PM

#

Sounds good. Consider me and my resources a part of the initiative then.
Specifically, I have access to one A100 and about 800Gb of disk space available.

marble vine Jun 15, 2025, 6:45 PM

#

mystic canyon Another important question is.... who is willing to contribute? And I don't just...

What is the anticipated compute requirements? My instinct is that I want to say "we will 100% sponsor this" but it's good to know what you're getting into before makign such a promise.

mystic canyon Jun 16, 2025, 3:16 AM

#

marble vine What is the anticipated compute requirements? My instinct is that I want to say ...

before I give any answer whatsoever I should probably do a survey of the current RM/quality classifier models out there. the landscape changes SO quickly so I'll get back to you when I have a better answer on the sort of scale we're looking at!

#

because honestly I don't want to fall into the same trap that SmolLM corpus did where they only extracted quality signals from truncated documents. Ideally we should be qualifying every document in it's entirety, but I need to figure out of that's even feasible and what the tradeoff may be against more typical models with shorter context length.

cyan root Jun 18, 2025, 6:07 AM

#

https://fixupx.com/essential_ai/status/1935134906071531783

Essential AI (@essential_ai)

[1/5]
︀︀
︀︀🚀 Meet Essential-Web v1.0, a 24-trillion-token pre-training dataset with rich metadata built to effortlessly curate high-performing datasets across domains and use cases!

**💬 2 🔁 33 ❤️ 147 👁️ 73.7K **

meager python Jun 20, 2025, 6:13 PM

#

livid wigeon Sounds good. Consider me and my resources a part of the initiative then. Specif...

I think for handling long documents without truncation, we could implement a sliding window approach with overlap, then aggregate scores. we can prototype this with a few different window sizes to find a good spot between computational cost and quality signal preservation. That would help us in the long term if we plan on going furthur with it

cyan stump Jun 22, 2025, 1:28 PM

#

mystic canyon Another important question is.... who is willing to contribute? And I don't just...

I have 50TB and 8-12 A100s/H100s

south yacht Jun 22, 2025, 5:53 PM

#

Hey, I'd pretty much love to contribute except for my lack of resources...

#

Also, anyone want to work on Finance version of this?

dire sparrow Jun 22, 2025, 7:47 PM

#

I have no resources but am very interested in working on this project

mystic canyon Jun 30, 2025, 5:02 PM

#

Sorry I've not been super active with this recently. Had a couple tragedies in the family AND had neurips reviews to deal with AND got a friend staying over for a starting the day neurips reviews are due

marble vine Jun 30, 2025, 5:39 PM

#

Oh fuck I haven't started my NeurIPS reviews

viral coyote Jul 2, 2025, 2:35 PM

#

okay this is a cool idea and i am sorry i missed it previously

#

2c: i perpetually have sort of a suspicion that data classifiers are doing something weird and are difficult to get exactly right, i would be really interested to see about results by using normal compressors for classification

#

compression ratio under a few compressors will catch things like high-repeat autogenerated python (i assume psychopy by the way! psych students use it and it spits this out) and compression dictionary under eg gzip will have natural clusters

#

the dumbest check is just snappy compression ratio though

#

the size under compression metric will have more than one mode though, with code being notably very difficult from noncode and i think different languges varying substantially

viral coyote Jul 2, 2025, 8:53 PM

#

what is 200b tokens in bytes again. this would call for ~approx 800gb, right?

viral coyote Jul 3, 2025, 12:28 AM

#

i have space but low compute, perfectly fine for some things

i think i have a question about hf storage: do we care if i use it to do things like e.g. upload a shuffled copy of the raw data, or throw metadata files for it somewhere, etc? personal account on hf claims a 300gb limit

#

if we're going to experiment with different filters across different people it's a lot nicer to do specific operations once and then upload but even subsets are likely to run kind of heavy since it's 8tb

nova lynx Jul 3, 2025, 8:23 PM

#

would anyone with high storage and low compute be interested in making a parquet-formatted common pile dataset repo? it would only really need two columns, one for the text of the document and one labeling what subset the doc is from

#

then we could use the new release of datasets to clean/filter the data in streaming mode, allowing people like me with compute but no storage to contribute

nova lynx Jul 4, 2025, 2:42 AM

#

I'm gonna test drive this with pile-uncopyrighted, since it's already streamable from hf. I'll try scoring the whole thing with two existing text classifier models (dclm's and fwedu's), and then upload the annotations to HF so y'all can experiment with it

viral coyote Jul 4, 2025, 8:51 PM

#

nova lynx would anyone with high storage and low compute be interested in making a parquet...

i would be good for this but also would need storage since it's gonna be well above hf free tier

#

i am processing these anyway so it would just be another upload

#

actually: might be a problem because of how long upload is likely to take, otherwise fine

nova lynx Jul 4, 2025, 8:57 PM

#

viral coyote i would be good for this but also would need storage since it's gonna be well ab...

oh, i thought datasets were free unless they were private? my bad

cyan root Jul 4, 2025, 9:23 PM

#

I can probably do this this weekend depending how friendly parquet is with the different subsets Just to confirm we want the raw data?

viral coyote Jul 4, 2025, 9:28 PM

#

nova lynx oh, i thought datasets were free unless they were private? my bad

i'm looking at this specifically for common pile reupload concerns

#

i'd really like a version that is preshuffled and one that is streamable

cyan root Jul 4, 2025, 9:32 PM

#

I should probably be able to add it to Eleuther org

cyan root Jul 4, 2025, 9:35 PM

#

viral coyote i'm looking at this specifically for common pile reupload concerns

idt they are very particular with this. I think they mostly enforce the max no of files/time bracket

viral coyote Jul 4, 2025, 9:38 PM

#

cyan root idt they are very particular with this. I think they mostly enforce the max no o...

since uploading the entire thing a second time looks like a pita i kind of want to avoid uploading it somewhere that it isn't meant to be permanently

#

(i am also not done shuffling it, i am fiddling tooling to try to get a reasonable shuffler in the dolma library)

#

but in principle it's "nice" to have the raw stuff in both shuffled and streamable

cyan root Jul 4, 2025, 9:41 PM

#

yeah, I'lll try to upload a split this weekend

cyan root Jul 4, 2025, 9:41 PM

#

viral coyote (i am also not done shuffling it, i am fiddling tooling to try to get a reasonab...

are you shuffling the whole thing or the comma split?

viral coyote Jul 4, 2025, 9:43 PM

#

cyan root are you shuffling the whole thing or the comma split?

raw

#

the stack is freakishly large tbh

cyan root Jul 4, 2025, 9:47 PM

#

yeah, its half the dataset

cyan root Jul 4, 2025, 10:08 PM

#

any parquet wizards know the best settings for streaming. From what I gathered ~1GB per file, snappy compression and sorted (can use id column)

#

partition strategy?

viral coyote Jul 5, 2025, 1:03 AM

#

128mb row group and i think you probably want to range partition by id after sort

#

it has been a bit and tbh in prod use cases i would "go check what the one that already works does"

plain knot Jul 7, 2025, 7:05 PM

#

Just saw this, is "-edu" really the right trend to continue? I think there are better ways to build higher quality subsets, probably, at the very least due to preserving more diversity

lyric frigate Jul 7, 2025, 8:22 PM

#

Also define educational. Is that like textbooks and stuff? cuz we probs losing out on linguistic diversity/real world grounding by not including common spoken language tokens right. Also would this be a strictly english dataset?

nova lynx Jul 7, 2025, 10:55 PM

#

plain knot Just saw this, is "-edu" really the right trend to continue? I think there are b...

i'm planning to do preliminary tests with two existing classifier models: the "edu" one used for fineweb-edu, and a non-edu one used for dclm (primarily based on oh and eli5)

#

i'm hopeful that one of them will keep more docs than the other, because common pile isn't super large already

plain knot Jul 8, 2025, 7:52 AM

#

https://arxiv.org/abs/2408.08310 I was thinking something like this perhaps, I would have compute to do this btw

arXiv.org

ScalingFilter: Assessing Data Quality through Inverse Utilization o...

High-quality data is crucial for the pre-training performance of large language models. Unfortunately, existing quality filtering methods rely on a known high-quality dataset as reference, which can introduce potential bias and compromise diversity. In this paper, we propose ScalingFilter, a novel approach that evaluates text quality based on th...

nova lynx Jul 8, 2025, 3:22 PM

#

plain knot https://arxiv.org/abs/2408.08310 I was thinking something like this perhaps, I w...

this looks cool -- which models would you use to calculate the perplexity difference?

plain knot Jul 9, 2025, 9:01 AM

#

I think you'd have to train two models specifically for it

#

the interesting thing is also you get a full ranking of all samples, so if you release that people can choose their own top k threshold

#

not that I wanna derail this, maybe someone would like to also do the -edu thing

nova lynx Jul 10, 2025, 2:51 PM

#

plain knot not that I wanna derail this, maybe someone would like to also do the -edu thing

no worries, i was planning on doing that anyways, plus it would be good to compare w your approach

nova lynx Jul 10, 2025, 2:52 PM

#

plain knot I think you'd have to train two models specifically for it

it seemed to me like the paper used gpt2 models with no modifications, did I read something wrong?

nova lynx Jul 12, 2025, 3:31 AM

#

@cyan root@viral coyote any updates on the raw data repo?

viral coyote Jul 14, 2025, 8:23 PM

#

i need some time for a shuffle to run today and then i can reupload raw + shuffled. i can also do a proportionate subsample at like 1/10th size to get under hf size limits, would need perms on something permitting the reup of the shuffled due to its size

cyan root Jul 14, 2025, 10:16 PM

#

I almost have most of the (raw) datasets formatted to parquet (except stack). Sorry got stuck in a loop trying to normalize the columns. Will start uploading in a bit

nova lynx Jul 15, 2025, 12:53 AM

#

No rush! I'm using dolma + the Comma dataset for testing in the meantime, and that seems to be working well for now

cyan root Jul 15, 2025, 2:09 AM

#

damn, I hate parquet. Am just vibing with these configs now:
FORMAT PARQUET,
file_size_bytes '1GB',
compression 'zstd',
per_thread_output TRUE,
preserve_order FALSE,
ROW_GROUP_SIZE_BYTES '512MB'

#

used to think, why isn't it used more. So convenient. So fast

#

well now I know. Its evil

plain knot Jul 15, 2025, 6:07 AM

#

Poor baber

cyan root Jul 15, 2025, 7:56 PM

#

I uploaded most of the sets here, except for 4 of the largest

#

downloading and converting stackv2 now, and will upload the remaining when that's done

cyan root Jul 16, 2025, 7:30 PM

#

ok think I've uploaded all now

plain knot Jul 16, 2025, 7:32 PM

#

is this the new version stella mentioned or are we waiting for that one?

#

btw nice, its good to have it in one place rather than a collection

cyan root Jul 16, 2025, 7:37 PM

#

plain knot btw nice, its good to have it in one place rather than a collection

yeah, it's just the 0.1 raw collection, but all converted to parquet

nova lynx Jul 26, 2025, 4:48 PM

#

dumb question: has anyone considered using DoReMi or ODM to reweight the comma data?

#

the paper mentions MixMin but I'm not sure how similar that is

real oar Aug 2, 2025, 4:46 AM

#

nova lynx dumb question: has anyone considered using DoReMi or ODM to reweight the comma d...

I have found doremi to not perform well

marble vine Aug 2, 2025, 2:19 PM

#

Same

plain knot Aug 2, 2025, 2:28 PM

#

marble vine I'm very interested in this, however I recommend waiting for the next version of...

any eta on this next version?

marble vine Aug 2, 2025, 2:35 PM

#

plain knot any eta on this next version?

I'm hoping to have a meaningful update by the end of the month or early next month

plain knot Aug 2, 2025, 2:35 PM

#

alright thanks, good to know

marble vine Aug 2, 2025, 2:36 PM

#

We're also experimenting with synthetic data augmentation and rewriting some of the more unusually structured stuff (e.g., USPTO) in more natural language

nova lynx Aug 2, 2025, 9:51 PM

#

synthetic as in LLM-generated? I thought the Common Pile argued against using that in the dataset?

marble vine Aug 2, 2025, 10:22 PM

#

It wasn't in scope for our aims, but synthetic text is not eligible for copyright protection in the US.

real oar Aug 2, 2025, 11:33 PM

#

I honestly think it might be worth getting a group together to hand label say 100k docs or so from the CommonCrawl subset?

molten marsh Aug 18, 2025, 1:57 PM

#

Hey all, just saw a random link to this thread so I figured I'd mention a few relevant things:

After our agressive filtering for the comma v0.1 datatset, we already had a small-ish dataset - it's only ~450B unique tokens. We just repeated a lot of stuff a lot to get the 1T and 2T training runs.
The comma v0.1 dataset already has some of "-edu"-style filtering: for cccc we used the dclm classifier, for stackv2 we used the stack-edu classifier.
It's not obvious to me how helpful it would be to apply other classifier-based filtering to other sources. For example I wouldn't expect certain patents to be higher quality or more educational than others. If you applied the same quality classifier to the whole dataset I would just expect it to filter out some sources (like, no patents, none of the old ocr'd text) whole hog.

real oar Aug 18, 2025, 2:04 PM

#

molten marsh Hey all, just saw a random link to this thread so I figured I'd mention a few re...

This would likely be on a source by source basis. For something like caselaw, or patents as you mentioned, it would not make sense to filter with a classifier. But for the common crawl and the stackv2 I could see benefits to filtering those sources generally.

#

Or at least a better classification models is what i mean

molten marsh Aug 18, 2025, 2:05 PM

#

Sure yeah, though I'm not sure there's a ton of room to improve on stack-edu and dclm classifiers, at least for low-hanging fruit

real oar Aug 18, 2025, 2:05 PM

#

We are evaluating fine edu classifier now on some textbook data

#

That is what I meant

#

Haha

#

I think just getting something that can filter for textbook like data out of commoncrawl would be a significant improvement over the base classifier

molten marsh Aug 18, 2025, 2:06 PM

#

Sure, worth a try

real oar Aug 18, 2025, 2:07 PM

#

Talking to HarvardLIL about using a labeled version of institutional books

#

Hand labeled

molten marsh Aug 18, 2025, 2:07 PM

#

Though cccc is such a tiny slice of our data (especially after all the filtering) that I'm not sure better filtering would make a big difference in eventual model quality

real oar Aug 18, 2025, 2:08 PM

#

I think when I say generally I mean people would probably also use it for non permissive data lol

#

Actually I should have something else for this soon. TBA

marble vine Aug 18, 2025, 2:47 PM

#

@real oar I've been talking with Greg and we cannot use the IB dataset. Although the books are in the public domain (and it's legally dubious whether they can actually put ToS on them) they do have a non-open ToS. Treating them as open data requires directly picking a legal fight with IDI

real oar Aug 18, 2025, 2:51 PM

#

marble vine <@898656846590664734> I've been talking with Greg and we cannot use the IB datas...

Hmmm I can use the books

#

Well technically we can as well

#

But you/we can't release them yet

#

Yet being sometime in the next months

marble vine Aug 18, 2025, 2:53 PM

#

@real oar

I mean that we can't use them as open data. They're permitted to be used under the terms of service, but you can't treat them as open data.
Last week Greg specifically disavowed any timeline and said it could take a year or more

real oar Aug 18, 2025, 2:54 PM

#

marble vine <@898656846590664734> 1. I mean that we can't use them as open data. They're pe...

Huh that is strange. He said it would be some months not too long ago.

For training a classifier do we need to release the data?

#

As long as we know it is permissive

marble vine Aug 18, 2025, 2:54 PM

#

You can use it to train a classifier

real oar Aug 18, 2025, 2:54 PM

#

EleutherAI, Allen, etc would be well within the realm of the TOS

#

So I think training a classifier with hand labeled data from it would make sense

#

We just, as you stated, can't release the data itself

#

Also, I did bring up to them that the TOS on the public domain data doesn't make much sense to me either haha

#

Maybe I should post that on the HF thread

marble vine Aug 18, 2025, 3:17 PM

#

molten marsh Hey all, just saw a random link to this thread so I figured I'd mention a few re...

Is there an easy way to download that 450B?

molten marsh Aug 18, 2025, 4:07 PM

#

marble vine Is there an easy way to download that 450B?

Yes, it's just the comma dataset on HF (though it's not tokenized, but the tokenizer is available)

marble vine Aug 18, 2025, 4:08 PM

#

molten marsh Yes, it's just the comma dataset on HF (though it's not tokenized, but the token...

These ones: https://huggingface.co/collections/common-pile/common-pile-v01-filtered-data-68300bb0a946d10dda697663

Common Pile v0.1 Filtered Data - a common-pile Collection

molten marsh Aug 18, 2025, 4:10 PM

#

Not quite, https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset is closer, with the caveat that we do under 1 epoch on some of those sources so that includes data that comma never saw.