Handling Errors in Tabular ML Data | Learn AI Together | Page 1

Have you guys found any good content on finding issues/errors/problems in tabular ML data? I was searching for a bit and couldn't find much of anything so I decided to write my own notebook and article.

I basically trained a stock xgboost model on the data and got a baseline performance of 67%. Then I used data-centric techniques (cleanlab open-source) to find label errors within the data. After removing the errors and re-raining the SAME model I got an accuracy of 90%! I think the fact that I got a great bump in performance without touching the hyperparameters or the model itself is pretty cool.

Here is the blog if you want to check it out:) How do you guys find issues within your training data? https://cleanlab.ai/blog/label-errors-tabular-datasets/

Cleanlab

Handling Mislabeled Tabular Data to Improve Your XGBoost Model

Learn how to reduce prediction errors by 70% using data-centric techniques with cleanlab.

#Handling Errors in Tabular ML Data