scikit-learn - Convert pipeline prediction to original value/scale - python

I've create a pipeline as follows (using the Keras Scikit-Learn API)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, nb_epoch=50, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
and fit it with
pipeline.fit(trainX,trainY)
If I predict with pipline.predict(testX), I (believe) I get standardised predictions.
How do I predict on testX so that predictedY it at the same scale as the actual (untouched) testY (i.e. NOT standardised prediction, but instead the actual values)? I see there is an inverse_transform method for Pipeline, however appears to be for only reverting a transformed X.

Exactly. The StandardScaler() in a pipeline is only mapping the inputs (trainX) of pipeline.fit(trainX,trainY).
So, if you fit your model to approximate trainY and you need it to be standardized as well, you should map your trainY as
scalerY = StandardScaler().fit(trainY) # fit y scaler
pipeline.fit(trainX, scalerY.transform(trainY)) # fit your pipeline to scaled Y
testY = scalerY.inverse_transform(pipeline.predict(testX)) # predict and rescale
The inverse_transform() function maps its values considering the standard deviation and mean calculated in StandardScaler().fit().
You can always fit your model without scaling Y, as you mentioned, but this can be dangerous depending on your data since it can lead your model to overfit. You have to test it ;)

Related

Get all prediction values for each CV in GridSearchCV

I have a time-dependent data set, where I (as an example) am trying to do some hyperparameter tuning on a Lasso regression.
For that I use sklearn's TimeSeriesSplit instead of regular Kfold CV, i.e. something like this:
tscv = TimeSeriesSplit(n_splits=5)
model = GridSearchCV(
estimator=pipeline,
param_distributions= {"estimator__alpha": np.linspace(0.05, 1, 50)},
scoring="neg_mean_absolute_percentage_error",
n_jobs=-1,
cv=tscv,
return_train_score=True,
max_iters=10,
early_stopping=True,
)
model.fit(X_train, y_train)
With this I get a model, which I can then use for predictions etc. The idea behind that cross validation is based on this:
However, my issue is that I would actually like to have the predictions from all the test sets from all cv's. And I have no idea how to get that out of the model ?
If I try the cv_results_ I get the score (from the scoring parameter) for each split and each hyperparameter. But I don't seem to be able to find the prediction values for each value in each test split. And I actually need that for some backtesting. I don't think it would be "fair" to use the final model to predict the previous values. I would imagine there would be some kind of overfitting in that case.
So yeah, is there any way for me to extract the predicted values for each split ?
You can have custom scoring functions in GridSearchCV.With that you can predict outputs with the estimator given to the GridSearchCV in that particular fold.
from the documentation scoring parameter is
Strategy to evaluate the performance of the cross-validated model on the test set.
from sklearn.metrics import mean_absolute_percentage_error
def custom_scorer(clf, X, y):
y_pred = clf.predict(X)
# save y_pred somewhere
return -mean_absolute_percentage_error(y, y_pred)
model = GridSearchCV(estimator=pipeline,
scoring=custom_scorer)
The input X and y in the above code came from the test set. clf is the given pipeline to the estimator parameter.
Obviously your estimator should implement the predict method (should be a valid model in scikit-learn). You can add other scorings to the custom one to avoid non-sense scores from the custom function.

strange behaivour of 'inverse_transform' function in sklearn.preprocessing.MinMaxScalar

I used MinMaxScalar function in sklearn.preprocessing for normalizing the attributes of some of my variables(array) to use that in a model(linear regression), after the model creation and training
I tested my model with x_test(splited usind train_test_split) and stored the result in some variable(say predicted) ,for evaluating purpose i wanna evaluate my prediction with the original dataset for that i used "MinMaxScalar.inverse_transform" function, that function works well when my code is in below order,
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,train_size=0.75,random_state=27)
sc=MinMaxScaler(feature_range=(0,1))
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_train)
y_train=y_train.reshape(-1,1)
y_train=sc.fit_transform(y_train)
when i changed the order like the below code it throws me error
on-broadcastable output operand with shape (379,1) doesn't match the
broadcast shape (379,13))
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,train_size=0.75,random_state=27)
sc=MinMaxScaler(feature_range=(0,1))
x_train=sc.fit_transform(x_train)
y_train=y_train.reshape(-1,1)
y_train=sc.fit_transform(y_train)
x_test=sc.fit_transform(x_train)
please compare the two photos for better understanding of my query
It can be seen from the linked printscreen figure that you use the same MinMaxScaler to fit and transform both the train and test x-data, and also the training y-data (which does not make sense).
The correct process would be
Fit the scaler with train x-data. The fit_transform() also transforms (scales) the x_train.
sc = MinMaxScaler(feature_range=(0,1))
x_train = sc.fit_transform(x_train)
Scale also the test x-data with the same scaler. Do not fit here; just scale/transform.
x_test = sc.transform(x_test)
If you think scaling is needed also for y-data, you will have to fit another scaler for that purpose. It could also be that there is no need for scaling the y-data.
# Option A: Do not scale y-data
# (do nothing)
# Option B: Scale y-data
sc_y = MinMaxScaler(feature_range=(0,1))
y_train = sc_y.fit_transform(y_train)
After you have trained your model (lr), you can make predictions with the scaled x_test and the model:
# Option A:
predicted = lr.predict(x_test)
# Option B:
y_test_scaled = lr.predict(x_test)
predicted = sc_y.inverse_transform(y_test_scaled)

Getting probabilities of best model for RandomizedSearchCV

I'm using RandomizedSearchCV to get the best parameters with a 10-fold cross-validation and 100 iterations. This works well. But now I would like to also get the probabilities of each predicted test data point (like predict_proba) from the best performing model.
How can this be done?
I see two options. First, perhaps it is possible to get these probabilities directly from the RandomizedSearchCV or second, getting the best parameters from RandomizedSearchCV and then doing again a 10-fold cross-validation (with the same seed so that I get the same splits) with this best parameters.
Edit: Is the following code correct to get the probabilities of the best performing model? X is the training data and y are the labels and model is my RandomizedSearchCV containing a Pipeline with imputing missing values, standardization and SVM.
cv_outer = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
y_prob = np.empty([y.size, nrClasses]) * np.nan
best_model = model.fit(X, y).best_estimator_
for train, test in cv_outer.split(X, y):
probas_ = best_model.fit(X[train], y[train]).predict_proba(X[test])
y_prob[test] = probas_
If I understood it right, you would like to get the individual scores of every sample in your test split for the case with the highest CV score. If that is the case, you have to use one of those CV generators which give you control over split indices, such as those here: http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#cross-validation-generators
If you want to calculate scores of a new test sample with the best performing model, the predict_proba() function of RandomizedSearchCV would suffice, given that your underlying model supports it.
Example:
import numpy
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
scores = cross_val_score(svc, X, y, cv=skf, n_jobs=-1)
max_score_split = numpy.argmax(scores)
Now that you know that your best model happens at max_score_split, you can get that split yourself and fit your model with it.
train_indices, test_indices = k_fold.split(X)[max_score_split]
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
model.fit(X_train, y_train) # this is your model object that should have been created before
And finally get your predictions by:
model.predict_proba(X_test)
I haven't tested the code myself but should work with minor modifications.
You need to look in cv_results_ this will give you the scores, and mean scores for all of your folds, along with a mean, fitting time etc...
If you want to predict_proba() for each of the iterations, the way to do this would be to loop through the params given in cv_results_, re-fit the model for each of then, then predict the probabilities, as the individual models are not cached anywhere, as far as I know.
best_params_ will give you the best fit parameters, for if you want to train a model just using the best parameters next time.
See cv_results_ in the information page http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Keras Regression using Scikit Learn StandardScaler with Pipeline and without Pipeline

I am comparing the performance of two programs about KerasRegressor using Scikit-Learn StandardScaler: one program with Scikit-Learn Pipeline and one program without the Pipeline.
Program 1:
estimators = []
estimators.append(('standardise', StandardScaler()))
estimators.append(('multiLayerPerceptron', KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)))
pipeline = Pipeline(estimators)
log = pipeline.fit(X_train, Y_train)
Y_deep = pipeline.predict(X_test)
Program 2:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.fit_transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
My problem is that Program 1 can achieve R2 score as 0.98 (3 trials on average) while Program 2 only achieve R2 score as 0.84 (3 trials on average.) Can anyone explain the difference between these two programs?
In the second case, you are calling StandardScaler.fit_transform() on both X_train and X_test. Its wrong usage.
You should call fit_transform() on X_train and then call only transform() on the X_test. Because thats what the Pipeline does.
The Pipeline as the documentation states, will:
fit():
Fit all the transforms one after the other and transform the data,
then fit the transformed data using the final estimator
predict():
Apply transforms to the data, and predict with the final estimator
So you see, it will only apply transform() to the test data, not fit_transform().
So elaborate my point, your code should be:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
#This is the change
X_test = scale.transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
Calling fit() or fit_transform() on test data wrongly scales it to a different scale than what was used on train data. And is a source of change in prediction.
Edit: To answer the question in comment:
See, fit_transform() is just a shortcut function for doing fit() and then transform(). For StandardScaler, fit() doesnt return anything, just learns the mean and standard deviation of data. And then transform() applies the learning on the data to return new scaled data.
So what you are saying leads to below two scenarios:
Scenario 1: Wrong
1) X_scaled = scaler.fit_transform(X)
2) Divide the X_scaled into X_scaled_train, X_scaled_test and run your model.
No need to scale again.
Scenario 2: Wrong (Basically equal to Scenario 1, reversing the scaling and spitting operations)
1) Divide the X into X_train, X_test
2) scale.fit_transform(X) [# You are not using the returned value, only fitting the data, so equivalent to scale.fit(X)]
3.a) X_train_scaled = scale.transform(X_train) #[Equals X_scaled_train in scenario 1]
3.b) X_test_scaled = scale.transform(X_test) #[Equals X_scaled_test in scenario 1]
You can try any of the scenario and maybe it will increase the performance of your model.
But there is one very important thing which is missing in them. When you do scaling on the whole data and then divide them into train and test, it is assumed that you know the test (unseen) data, which will not be true in real world cases. And will give you results which will not be according to real world results. Because in the real world, whole of the data will be our training data. It may also lead to over-fitting because the model has some information about the test data already.
So when evaluating the performance of machine learning models, it is recommended that you keep aside the test data before performing any operations on it. Because it is our unseen data, we know nothing about it. So ideal path of operations would be the one I answered, ie.:
1) Divide X into X_train and X_test (same for y)
2) X_train_scaled = scale.fit_transform(X_train) [#Learn the mean and SD of train data]
3) X_test_scaled = scale.transform(X_test) [#Use the mean and SD learned in step2 to convert test data]
4) Use the X_train_scaled for training the model and X_test_scaled in evaluation.
Hope it makes sense to you.

Relationship between sklearn .fit() and .score()

While working with a linear regression model I split the data into a training set and test set. I then calculated R^2, RMSE, and MAE using the following:
lm.fit(X_train, y_train)
R2 = lm.score(X,y)
y_pred = lm.predict(X_test)
RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
MAE = metrics.mean_absolute_error(y_test, y_pred)
I thought that I was calculating R^2 for the entire data set (instead of comparing the training and original data). However, I learned that you must fit the model before you score it, therefore I'm not sure if I'm scoring the original data (as inputted in R2) or the data that I used to fit the model (X_train, and y_train). When I run:
lm.fit(X_train, y_train)
lm.score(X_train, y_train)
I get a different result than what I got when I was scoring X and y. So my question is are the inputs to the .score parameter compared to the model that was fitted (thereby making lm.fit(X,y); lm.score(X,y) the R^2 value for the original data and lm.fit(X_train, y_train); lm.score(X,y) the R^2 value for the original data based off the model created in .fit.) or is something else entirely happening?
fit() that only fit the data which is synonymous to train, that is fit the data means train the data.
score is something like testing or predict.
So one should use different dataset for training the classifier and testing the acuracy
One can do like this.
X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.2)
clf=neighbors.KNeighborsClassifier()
clf.fit(X_train,y_train)
accuracy=clf.score(X_test,y_test)

Categories

Resources