Have you guys found any good content on finding issues/errors/problems in tabular ML data? I was searching for a bit and couldn't find much of anything so I decided to write my own notebook and article.
I basically trained a stock xgboost model on the data and got a baseline performance of 67%. Then I used data-centric techniques (cleanlab open-source) to find label errors within the data. After removing the errors and re-raining the SAME model I got an accuracy of 90%! I think the fact that I got a great bump in performance without touching the hyperparameters or the model itself is pretty cool.
Here is the blog if you want to check it out:) How do you guys find issues within your training data? https://cleanlab.ai/blog/label-errors-tabular-datasets/