So, naturally, the best starting point would be the qualuity classifiers themselves. While we could follow the recipe for SmolLM corpus I'm unsure if that's the best idea.
While I cannot speak for the classifier used for the fine-web-edu split I do have experience with the "educational code classifier" that was used for the python-edu split; if you've followed some of my projects I released two pre-shuffled mixes of the SmolLM corpus (Avelina/smollm-corpus and Avelina/smollm-corpus-cleaned and two corresponding python-edu splits (Avelina/python-edu and Avelina/python-edu-cleaned)
The reason for needing to release "cleaned" versions was because the code classifier didn't do a great job in the python edu split. In fact it did a shite job. When training LMs I found weird loss dips (not spikes, dips) and I traced it back to some HORRIFIC python code containing lots of repetitions. I'm talking hardcoded python calculators with several thousand lines of if statements, unit tests for every possible user input, scripts with hard coded lists of all words in the English language, and an instance of a "python file" which was literally just a freaking harry potter book that was clearly copy pasted from a pdf because it had page numbers and chapter headings every dozen lines.
So while in general the code classifier did a good job, it clearly deemed some utter crap as being "high quality". And while I don't expect the text classifier to suffer from the same sort of issues, I think it's definitely worth revisiting things and maybe adding in some more quality control filters to make sure the classifiers don't get tripped up by.