strange behaivour of 'inverse_transform' function in sklearn.preprocessing.MinMaxScalar - python

I used MinMaxScalar function in sklearn.preprocessing for normalizing the attributes of some of my variables(array) to use that in a model(linear regression), after the model creation and training
I tested my model with x_test(splited usind train_test_split) and stored the result in some variable(say predicted) ,for evaluating purpose i wanna evaluate my prediction with the original dataset for that i used "MinMaxScalar.inverse_transform" function, that function works well when my code is in below order,
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,train_size=0.75,random_state=27)
sc=MinMaxScaler(feature_range=(0,1))
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_train)
y_train=y_train.reshape(-1,1)
y_train=sc.fit_transform(y_train)
when i changed the order like the below code it throws me error
on-broadcastable output operand with shape (379,1) doesn't match the
broadcast shape (379,13))
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,train_size=0.75,random_state=27)
sc=MinMaxScaler(feature_range=(0,1))
x_train=sc.fit_transform(x_train)
y_train=y_train.reshape(-1,1)
y_train=sc.fit_transform(y_train)
x_test=sc.fit_transform(x_train)
please compare the two photos for better understanding of my query

It can be seen from the linked printscreen figure that you use the same MinMaxScaler to fit and transform both the train and test x-data, and also the training y-data (which does not make sense).
The correct process would be
Fit the scaler with train x-data. The fit_transform() also transforms (scales) the x_train.
sc = MinMaxScaler(feature_range=(0,1))
x_train = sc.fit_transform(x_train)
Scale also the test x-data with the same scaler. Do not fit here; just scale/transform.
x_test = sc.transform(x_test)
If you think scaling is needed also for y-data, you will have to fit another scaler for that purpose. It could also be that there is no need for scaling the y-data.
# Option A: Do not scale y-data
# (do nothing)
# Option B: Scale y-data
sc_y = MinMaxScaler(feature_range=(0,1))
y_train = sc_y.fit_transform(y_train)
After you have trained your model (lr), you can make predictions with the scaled x_test and the model:
# Option A:
predicted = lr.predict(x_test)
# Option B:
y_test_scaled = lr.predict(x_test)
predicted = sc_y.inverse_transform(y_test_scaled)

Related

Can I use StandardScaler() on whole data set, or should I calculate on train and test sets separately?

I'm developing a SVR for ~100 continuous features and a continuous label.
For scaling the data, I wrote:
#Read in
df = pd.read_csv(data_path,sep='\t')
features = df.iloc[:,1:-1] #100 features
target = df.iloc[:,-1] #The label
names = df.iloc[:,0] #Column names
#Scale features
scaler = StandardScaler()
scaled_df = scaler.fit_transform(features)
# rename columns (since now its an np array)
features.columns = df_columns
So now I have a scaled data frame, and my next step was to split into train and test, and then develop a model (SVR):
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
model = SVR()
...and then I fit the model to the data.
But I noticed other people don't fit the StandardScaler() to the whole data frame, but they split the dataframe into train and test first, and then apply StandardScaler() to each separately.
Is there a difference between whether you apply the StandardScaler to the whole data frame, or train and test separately?
The previous answer says that you should separate the training and testing set when scaling, otherwise the testing one might bias the transformation of the training one. This is half correct and half wrong.
If you do the transformation separately, then it might well be that the training set will get scaled to wrong proportions (e.g. if it comes from a narrow continuous time range, thus taking on a subset of the values range). You will end up having wrong values for the variables of the test set.
In general, what you must do is scale on the training set and transfer the scale over to the testing set. This is done by using the methods fit and transform separately, as seen in the documentation.
You need to apply StandardScaler to the training set to prevent the distribution of the test set leaking into the model. If you fit the scaler on the full dataset before splitting, the test set information is used to transform the training set and use it to train the model.

Incomprehension of input data for the LSTM model

I'm facing a problem I can't solve. Indeed, I try to create a model LSTM with keras, but I don't understand what the input data format should be.
My data train and my data test look like this:
date/value/value/value/value/value_i_want_to_predict
I've seen some people doing this:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)X_train = []
y_train = []
for i in range(60, len(training_set_scaled)):
X_train.append(training_set_scaled[i-60: i, 0])
y_train.append(training_set_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)
But if I do that how do I predict my features without modifying the test data set?
I have a hard time understanding why we do this. Moreover, what I would like to use the values to predict the target in the last column. With this method I feel like I have to change the format of the data test and it's important that I can test the model on test data that are different and that I don't have to change.
Can someone help me?
EDIT
scaler.fit(df_train_x)
X_train = scaler.fit_transform(df_train_x)
X_test = scaler.transform(df_test_x)
y_train = np.array(df_train_y)
y_train = np.insert(y_train, 0, 0)
y_train = np.delete(y_train, -1)
The shape of the data is: (2420, 7)
That what I did. But The shape still remain 2D. So i used :
generator = TimeseriesGenerator(X_train, y_train, length=n_input, batch_size=32)
And the input shape of first layer is:
model.add(LSTM(150, activation='relu', return_sequences=True,input_shape=(2419, 7)))
but when i fit the generator to the model:
ValueError: Error when checking target: expected dense_10 to have 3 dimensions, but got array with shape (1, 1)
i really don't understand
I'm not sure to fullly understand your question but I will try my best.
I think the code you provided is problem specific, meaning it maybe not suitable for your imlementation.
For an LSTM (and for pretty much any neural network) you always want to scale your data before feeding it to the model. This helps avoid having completely different data ranges across your features. The MinMaxScaler scale your features to the range provided. For an explanation of why do you need scaling, you can have a look at this article.
Usualy, you want to first split your dataset in training and testing sets, using for example the train_test_split function of sklearn, then scale your features.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = data.drop("feature_I_want_to_predict",axis=1)
y = data["feature_I_want_to_predict"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
That way, X_train represent your training data, and y_train represent your labels for the training data. (and similarly for the test data)
I here used the StandardScaler instead of the MinMaxScaler. The standard scaler substracts the mean of the feature then divides by the standard deviation.

Why does calling transform() on test data return an error that the data is not fitted yet?

While performing feature scaling, instead of assigning a variable to StandardScaler(), when coded like this:
from sklearn.preprocessing import StandardScaler
x_train = StandardScaler().fit_transform(x_train)
x_test = StandardScaler().transform(x_test)
It gives the following error:
NotFittedError: This StandardScaler instance is not fitted yet. Call
'fit'
with appropriate arguments before using this method.
whereas, following code works fine (after giving an identifier to StandardScaler()):
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
Here, x_train is the training dataset and x_test is the test dataset.
Can someone please explain that and why is it happening?
When you call StandardScaler(), you create a new (a.k.a. unfitted) object of the standscaler class. If you want to use it, you have to fit it before you can transform any data with it.
What you "told" the code to do was (pseudocode):
Create a new scaler object
Fit it to your training data
Create another new scaler object
Don't fit it to anything, but use it to transform some data
In the secod example, you created a single scaler object, fitted it to your data, then used the same object to transform your test data (which is the correct method to use)
Its always recommended to use fit() on the train data-set and transform() on the test data-set. As the μ and σ is calculated from the training data set, the same needs to be applied to normalize the test data set.
avoid using fittransform() as it combines both the steps on train data-set.
Since no one has posted code to fix the issue, per #G. Anderson's answer, the solution code would be:
# create a Scaler object and fit it to your training data
scaler = StandardScaler().fit(X_train)
# scale x_train and x_test
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Keras Regression using Scikit Learn StandardScaler with Pipeline and without Pipeline

I am comparing the performance of two programs about KerasRegressor using Scikit-Learn StandardScaler: one program with Scikit-Learn Pipeline and one program without the Pipeline.
Program 1:
estimators = []
estimators.append(('standardise', StandardScaler()))
estimators.append(('multiLayerPerceptron', KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)))
pipeline = Pipeline(estimators)
log = pipeline.fit(X_train, Y_train)
Y_deep = pipeline.predict(X_test)
Program 2:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.fit_transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
My problem is that Program 1 can achieve R2 score as 0.98 (3 trials on average) while Program 2 only achieve R2 score as 0.84 (3 trials on average.) Can anyone explain the difference between these two programs?
In the second case, you are calling StandardScaler.fit_transform() on both X_train and X_test. Its wrong usage.
You should call fit_transform() on X_train and then call only transform() on the X_test. Because thats what the Pipeline does.
The Pipeline as the documentation states, will:
fit():
Fit all the transforms one after the other and transform the data,
then fit the transformed data using the final estimator
predict():
Apply transforms to the data, and predict with the final estimator
So you see, it will only apply transform() to the test data, not fit_transform().
So elaborate my point, your code should be:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
#This is the change
X_test = scale.transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
Calling fit() or fit_transform() on test data wrongly scales it to a different scale than what was used on train data. And is a source of change in prediction.
Edit: To answer the question in comment:
See, fit_transform() is just a shortcut function for doing fit() and then transform(). For StandardScaler, fit() doesnt return anything, just learns the mean and standard deviation of data. And then transform() applies the learning on the data to return new scaled data.
So what you are saying leads to below two scenarios:
Scenario 1: Wrong
1) X_scaled = scaler.fit_transform(X)
2) Divide the X_scaled into X_scaled_train, X_scaled_test and run your model.
No need to scale again.
Scenario 2: Wrong (Basically equal to Scenario 1, reversing the scaling and spitting operations)
1) Divide the X into X_train, X_test
2) scale.fit_transform(X) [# You are not using the returned value, only fitting the data, so equivalent to scale.fit(X)]
3.a) X_train_scaled = scale.transform(X_train) #[Equals X_scaled_train in scenario 1]
3.b) X_test_scaled = scale.transform(X_test) #[Equals X_scaled_test in scenario 1]
You can try any of the scenario and maybe it will increase the performance of your model.
But there is one very important thing which is missing in them. When you do scaling on the whole data and then divide them into train and test, it is assumed that you know the test (unseen) data, which will not be true in real world cases. And will give you results which will not be according to real world results. Because in the real world, whole of the data will be our training data. It may also lead to over-fitting because the model has some information about the test data already.
So when evaluating the performance of machine learning models, it is recommended that you keep aside the test data before performing any operations on it. Because it is our unseen data, we know nothing about it. So ideal path of operations would be the one I answered, ie.:
1) Divide X into X_train and X_test (same for y)
2) X_train_scaled = scale.fit_transform(X_train) [#Learn the mean and SD of train data]
3) X_test_scaled = scale.transform(X_test) [#Use the mean and SD learned in step2 to convert test data]
4) Use the X_train_scaled for training the model and X_test_scaled in evaluation.
Hope it makes sense to you.

scikit-learn - Convert pipeline prediction to original value/scale

I've create a pipeline as follows (using the Keras Scikit-Learn API)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, nb_epoch=50, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
and fit it with
pipeline.fit(trainX,trainY)
If I predict with pipline.predict(testX), I (believe) I get standardised predictions.
How do I predict on testX so that predictedY it at the same scale as the actual (untouched) testY (i.e. NOT standardised prediction, but instead the actual values)? I see there is an inverse_transform method for Pipeline, however appears to be for only reverting a transformed X.
Exactly. The StandardScaler() in a pipeline is only mapping the inputs (trainX) of pipeline.fit(trainX,trainY).
So, if you fit your model to approximate trainY and you need it to be standardized as well, you should map your trainY as
scalerY = StandardScaler().fit(trainY) # fit y scaler
pipeline.fit(trainX, scalerY.transform(trainY)) # fit your pipeline to scaled Y
testY = scalerY.inverse_transform(pipeline.predict(testX)) # predict and rescale
The inverse_transform() function maps its values considering the standard deviation and mean calculated in StandardScaler().fit().
You can always fit your model without scaling Y, as you mentioned, but this can be dangerous depending on your data since it can lead your model to overfit. You have to test it ;)

Categories

Resources