why do we use the same Standard scaler object on both the test and train data . | Smarter Dev | Page 1

stoic pier Dec 20, 2022, 3:18 PM

#

why do we use the same StandardScaler object on both the test and train data(x_train and x_test) , wouldn't this lead to data leakage between the x_train and x_test data set ?

Using .fit_transform() method on x_train calculates the mean of x_train ,and by using the same object for x_test , the standardisation is done using the mean calculated for x_train

stray pollen Dec 20, 2022, 3:39 PM

#

StandardScaler is only used to determine the standard, store that standard, and make changes to training data using that standard. It does not store values from the data. So you fit to the training data, the scaler stores how it did that fit, it then transforms the training data. Next you use the scaler with the stored settings to transform the testing data in the exact same way you transformed the training data.

#

The goal is just to ensure that the training data and testing data are transformed around the same mean/variability/etc.

#

If you're curious to see how StandardScaler is implemented you can see it here https://github.com/scikit-learn/scikit-learn/blob/dc580a8ef/sklearn/preprocessing/_data.py#L644

GitHub

scikit-learn/_data.py at dc580a8ef5ee2a8aea80498388690e2213118efd ·...

scikit-learn: machine learning in Python. Contribute to scikit-learn/scikit-learn development by creating an account on GitHub.

#

You can see what 4 values it stores in the _reset method. Those are reset every time fit is called.

stoic pier Dec 20, 2022, 4:24 PM

#

stray pollen If you're curious to see how `StandardScaler` is implemented you can see it here...

but , scaling the train and test data around the same mean/variance is also a form of data leakage yes ?

#

it leads to some form for correlation , yes ?

stray pollen Dec 20, 2022, 4:35 PM

#

I don't really do any DS stuff. I can just tell you what the code does. Where did you get the code that you're asking about?

stoic pier Dec 20, 2022, 4:37 PM

#

Im doing a course on ML

#

got it from there

stray pollen Dec 20, 2022, 4:38 PM

#

So from that I'd say the values needed for fitting the data don't leak any meaningful values between the data sets.

mellow ibex Dec 20, 2022, 4:45 PM

#

Using the same scaler precisely avoids data leakage. If you were to use a different scaler for the test set, you would be extracting statistics from the test set, thus implicitly extracting information from the test set.

Think of it this way, you are transforming the data using the statistics you would have access to in a real-world scenario.

It would lead to an optimistic evaluation of the model.

#

It might be useful to see it as a pre-processing step. It's completely okay to apply transformations to data based on properties you expect them to have. As long as you would have access to said properties in a real scenario.

#

The mean and standard deviation you get from the training data might not be the true mean and standard deviation from the entire possible population of that particular domain. But it is a good approximation as long as your train data is representative of that domain.

stoic pier Dec 20, 2022, 5:01 PM

#

oh so ideally we would want a scaling transformation whose statistics have been derived from data we have access to at the time (which is the training data) , is this what you meant ?

mellow ibex Dec 20, 2022, 5:01 PM

#

Precisely, the closest we can get to the real statistic, but using only data we have access to.

#

The only moment you will use the test data to train a scaler is when you're putting a model in production. Then you'll train your model with all the data you have, including the test set. In this case, the test data is accessible for you. The problem with this is that you won't have any data to evaluate the model. But you can assume it will be slightly better than your train/test data model, since it will be trained in more data. Hopefully my phrasing makes sense to you

static mauveBOT Dec 20, 2022, 6:26 PM

#

mellow ibex The only moment you will use the test data to train a scaler is when you're putt...

@mellow ibex has been given 4 kudos from @stray pollen.

stoic pier Dec 21, 2022, 4:07 AM

#

mellow ibex Precisely, the closest we can get to the *real* statistic, but using only data w...

yes ! it makes sense now , thank you so much ! 😃

static mauveBOT Dec 21, 2022, 4:07 AM

#

mellow ibex Precisely, the closest we can get to the *real* statistic, but using only data w...

@mellow ibex has been given 8 kudos from @stoic pier.

stoic pier Dec 21, 2022, 4:08 AM

#

also @stray pollen thank you so much !

stray pollen Dec 21, 2022, 4:08 AM

#

👍

static mauveBOT Dec 21, 2022, 4:08 AM

#

stray pollen So from that I'd say the values needed for fitting the data don't leak any meani...

@stray pollen has been given 4 kudos from @stoic pier.

#why do we use the same Standard scaler object on both the test and train data .