#why do we use the same Standard scaler object on both the test and train data .

23 messages · Page 1 of 1 (latest)

stoic pier
#

why do we use the same StandardScaler object on both the test and train data(x_train and x_test) , wouldn't this lead to data leakage between the x_train and x_test data set ?

Using .fit_transform() method on x_train calculates the mean of x_train ,and by using the same object for x_test , the standardisation is done using the mean calculated for x_train

stray pollen
#

StandardScaler is only used to determine the standard, store that standard, and make changes to training data using that standard. It does not store values from the data. So you fit to the training data, the scaler stores how it did that fit, it then transforms the training data. Next you use the scaler with the stored settings to transform the testing data in the exact same way you transformed the training data.

#

The goal is just to ensure that the training data and testing data are transformed around the same mean/variability/etc.

#

You can see what 4 values it stores in the _reset method. Those are reset every time fit is called.

stoic pier
#

it leads to some form for correlation , yes ?

stray pollen
#

I don't really do any DS stuff. I can just tell you what the code does. Where did you get the code that you're asking about?

stoic pier
#

Im doing a course on ML

#

got it from there

stray pollen
#

So from that I'd say the values needed for fitting the data don't leak any meaningful values between the data sets.

mellow ibex
#

Using the same scaler precisely avoids data leakage. If you were to use a different scaler for the test set, you would be extracting statistics from the test set, thus implicitly extracting information from the test set.

Think of it this way, you are transforming the data using the statistics you would have access to in a real-world scenario.

It would lead to an optimistic evaluation of the model.

#

It might be useful to see it as a pre-processing step. It's completely okay to apply transformations to data based on properties you expect them to have. As long as you would have access to said properties in a real scenario.

#

The mean and standard deviation you get from the training data might not be the true mean and standard deviation from the entire possible population of that particular domain. But it is a good approximation as long as your train data is representative of that domain.

stoic pier
#

oh so ideally we would want a scaling transformation whose statistics have been derived from data we have access to at the time (which is the training data) , is this what you meant ?

mellow ibex
#

Precisely, the closest we can get to the real statistic, but using only data we have access to.

#

The only moment you will use the test data to train a scaler is when you're putting a model in production. Then you'll train your model with all the data you have, including the test set. In this case, the test data is accessible for you. The problem with this is that you won't have any data to evaluate the model. But you can assume it will be slightly better than your train/test data model, since it will be trained in more data. Hopefully my phrasing makes sense to you

static mauveBOT
stoic pier
static mauveBOT
stoic pier
#

also @stray pollen thank you so much !

stray pollen
#

👍

static mauveBOT