#Handling Errors in Tabular ML Data

2 messages · Page 1 of 1 (latest)

red wave
#

Have you guys found any good content on finding issues/errors/problems in tabular ML data? I was searching for a bit and couldn't find much of anything so I decided to write my own notebook and article.

I basically trained a stock xgboost model on the data and got a baseline performance of 67%. Then I used data-centric techniques (cleanlab open-source) to find label errors within the data. After removing the errors and re-raining the SAME model I got an accuracy of 90%! I think the fact that I got a great bump in performance without touching the hyperparameters or the model itself is pretty cool.

Here is the blog if you want to check it out:) How do you guys find issues within your training data? https://cleanlab.ai/blog/label-errors-tabular-datasets/

Cleanlab

Learn how to reduce prediction errors by 70% using data-centric techniques with cleanlab.

river shard
#

why did you have errors there to begin with? these were artifacts of data annotation? some naturally occurring process contain error and if you remove it from training and testing you'll end up with a model that won't work in production.