I am comparing the performance of two programs about KerasRegressor using Scikit-Learn StandardScaler: one program with Scikit-Learn Pipeline and one program without the Pipeline.
Program 1:
estimators = []
estimators.append(('standardise', StandardScaler()))
estimators.append(('multiLayerPerceptron', KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)))
pipeline = Pipeline(estimators)
log = pipeline.fit(X_train, Y_train)
Y_deep = pipeline.predict(X_test)
Program 2:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.fit_transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
My problem is that Program 1 can achieve R2 score as 0.98 (3 trials on average) while Program 2 only achieve R2 score as 0.84 (3 trials on average.) Can anyone explain the difference between these two programs?
In the second case, you are calling StandardScaler.fit_transform() on both X_train and X_test. Its wrong usage.
You should call fit_transform() on X_train and then call only transform() on the X_test. Because thats what the Pipeline does.
The Pipeline as the documentation states, will:
fit():
Fit all the transforms one after the other and transform the data,
then fit the transformed data using the final estimator
predict():
Apply transforms to the data, and predict with the final estimator
So you see, it will only apply transform() to the test data, not fit_transform().
So elaborate my point, your code should be:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
#This is the change
X_test = scale.transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
Calling fit() or fit_transform() on test data wrongly scales it to a different scale than what was used on train data. And is a source of change in prediction.
Edit: To answer the question in comment:
See, fit_transform() is just a shortcut function for doing fit() and then transform(). For StandardScaler, fit() doesnt return anything, just learns the mean and standard deviation of data. And then transform() applies the learning on the data to return new scaled data.
So what you are saying leads to below two scenarios:
Scenario 1: Wrong
1) X_scaled = scaler.fit_transform(X)
2) Divide the X_scaled into X_scaled_train, X_scaled_test and run your model.
No need to scale again.
Scenario 2: Wrong (Basically equal to Scenario 1, reversing the scaling and spitting operations)
1) Divide the X into X_train, X_test
2) scale.fit_transform(X) [# You are not using the returned value, only fitting the data, so equivalent to scale.fit(X)]
3.a) X_train_scaled = scale.transform(X_train) #[Equals X_scaled_train in scenario 1]
3.b) X_test_scaled = scale.transform(X_test) #[Equals X_scaled_test in scenario 1]
You can try any of the scenario and maybe it will increase the performance of your model.
But there is one very important thing which is missing in them. When you do scaling on the whole data and then divide them into train and test, it is assumed that you know the test (unseen) data, which will not be true in real world cases. And will give you results which will not be according to real world results. Because in the real world, whole of the data will be our training data. It may also lead to over-fitting because the model has some information about the test data already.
So when evaluating the performance of machine learning models, it is recommended that you keep aside the test data before performing any operations on it. Because it is our unseen data, we know nothing about it. So ideal path of operations would be the one I answered, ie.:
1) Divide X into X_train and X_test (same for y)
2) X_train_scaled = scale.fit_transform(X_train) [#Learn the mean and SD of train data]
3) X_test_scaled = scale.transform(X_test) [#Use the mean and SD learned in step2 to convert test data]
4) Use the X_train_scaled for training the model and X_test_scaled in evaluation.
Hope it makes sense to you.
Related
I am trying to find reliable hyper parameters for training a multiclass classifier, using both lgbm's "gbdt" and scikitlearn's GridsearchCV.
On the feature side of things there is a ~4k x 40 matrix, containing continuous values.
On the labeling side there is a pool of 4 categorical mutually exclusive classes.
To judge whether any given fold is performing well I would like to use lgbm's auc_mu metric, but I'm ok with any at this point. As you can see in the code below I resorted to weighted accuracy instead.
Below is a simplified version of how the gridsearch is initialised.
param_set = {
'n_estimators':[15, 25]
}
clf = lgb.LGBMModel(
boosting_type='gbdt',
num_leaves=31,
max_depth=5,
learning_rate=0.1,
n_estimators=100,
objective='multiclass',
num_class= len(np.unique(training_data.label)),
min_split_gain=0,
min_child_weight=1e-3,
min_child_samples=10,
subsample=1,
subsample_freq=0,
colsample_bytree=0.6,
reg_alpha=0.3,
reg_lambda=0.7,
random_state=42,
n_jobs=2)
gsearch = GridSearchCV(estimator = clf,
param_grid = param_set,
scoring="balanced_accuracy",
error_score='raise',
n_jobs=2,
cv=5,
verbose = 2)
When I try to call the fit function on the GridSearchCV object,
# separate total data into train/validation and test
stratifiedss = StratifiedShuffleSplit(
n_splits = 1, test_size = 0.2, train_size = 0.8, random_state=723)
for train_ind, test_ind in stratifiedss.split(X,y):
train_feature_obs = X.loc[train_ind]
train_labels = y[train_ind]
validation_feature_obs = X.loc[test_ind]
validation_labels = y[test_ind]
# transform data into lgb Dataset
training_data = lgb.Dataset(train_feature_obs, label=train_labels)
# call the GridSearchCV.fit
lgb_model2 = gsearch.fit(training_data.data.reset_index(drop=True), training_data.label)
it returns
ValueError: Classification metrics can't handle a mix of unknown and continuous-multioutput targets
So I am guessing the sklearnGridSearchCV has trouble evaluating the output of lgbmModel.predict().
I tried fitting a lgbmModel separetly and it should return an array with probabilities of the observation for each of the four classes, summing up to 100%.
I looked at:
ValueError: Classification metrics can't handle a mix of unknown and binary targets
I got the warning "UserWarning: One or more of the test scores are non-finite" when revising a toy scikit-learn gridsearchCV example
But that has not been conclusive yet.
How can I enable the sklearn.GridSearchCV to evaluate the performance of each fold of the lgbmModel classifier?
I am mostly confused as to where the "unknown" type is comnig from.
Any help would be much appreciated.
Regards, Robert
I used MinMaxScalar function in sklearn.preprocessing for normalizing the attributes of some of my variables(array) to use that in a model(linear regression), after the model creation and training
I tested my model with x_test(splited usind train_test_split) and stored the result in some variable(say predicted) ,for evaluating purpose i wanna evaluate my prediction with the original dataset for that i used "MinMaxScalar.inverse_transform" function, that function works well when my code is in below order,
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,train_size=0.75,random_state=27)
sc=MinMaxScaler(feature_range=(0,1))
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_train)
y_train=y_train.reshape(-1,1)
y_train=sc.fit_transform(y_train)
when i changed the order like the below code it throws me error
on-broadcastable output operand with shape (379,1) doesn't match the
broadcast shape (379,13))
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,train_size=0.75,random_state=27)
sc=MinMaxScaler(feature_range=(0,1))
x_train=sc.fit_transform(x_train)
y_train=y_train.reshape(-1,1)
y_train=sc.fit_transform(y_train)
x_test=sc.fit_transform(x_train)
please compare the two photos for better understanding of my query
It can be seen from the linked printscreen figure that you use the same MinMaxScaler to fit and transform both the train and test x-data, and also the training y-data (which does not make sense).
The correct process would be
Fit the scaler with train x-data. The fit_transform() also transforms (scales) the x_train.
sc = MinMaxScaler(feature_range=(0,1))
x_train = sc.fit_transform(x_train)
Scale also the test x-data with the same scaler. Do not fit here; just scale/transform.
x_test = sc.transform(x_test)
If you think scaling is needed also for y-data, you will have to fit another scaler for that purpose. It could also be that there is no need for scaling the y-data.
# Option A: Do not scale y-data
# (do nothing)
# Option B: Scale y-data
sc_y = MinMaxScaler(feature_range=(0,1))
y_train = sc_y.fit_transform(y_train)
After you have trained your model (lr), you can make predictions with the scaled x_test and the model:
# Option A:
predicted = lr.predict(x_test)
# Option B:
y_test_scaled = lr.predict(x_test)
predicted = sc_y.inverse_transform(y_test_scaled)
Suppose you have numerical time series data and you managed to split it like:
X_train, y_train, X_val, y_val, X_test, y_test.
and you properly scaled everything ending up with:
X_train_scaled, y_train_scaled, X_val_scaled, y_val_scaled,
X_test_scaled, y_test_scaled
And now you run the following code:
linear = Sequential([
Dense(units=1,activation='linear',input_shape=[X_train_scaled.shape[1])
])
linear.compile(loss='mse',optimizer='adam')
history = linear.fit(X_train_scaled, y_train_scaled,
epochs=50, verbose=1, shuffle=False,
validation_data=(X_valid_scaled.values,y_valid_scaled.values))
If our idea is to calculate the MSE, we can use the scaled test data and calculate it by 2 "different" ways:
mse_linear_scaled_1 = linear.evaluate(X_test_scaled,y_test_scaled)
or using the standalone version from https://www.tensorflow.org/api_docs/python/tf/keras/losses/MeanSquaredError
mse = keras.losses.MeanSquaredError()
mse_linear_scaled_2 = mse(y_test_scaled.values,y_pred_scaled).numpy()
if you do this exercise, mse_linear_scaled_1 = mse_linear_scaled_2 (as expected).
Now here comes the question (thank you if you read down to here...). If you do this same last part but with the original scale of the test data (the final idea is to get the RMSE value to have it in context of the real data) the results are very different between each other.
mse_linear_unscaled_1 = linear.evaluate(X_test,y_test)
gives a very different number than doing
mse_linear_unscaled_2 = mse(y_test,y_pred).numpy()
If I want to get the correct RMSE number in the scale of the original time series numbers, would guess this should be the correct way of doing it?
np.sqrt(mse_linear_unscaled_2)
Maybe .evaluate() wasn't thought for this and is doing something under the hood that I'm not aware, so it won't return the correct number?
When you do linear.evaluate(,) you are using the model linear that was already fitted with scaled data. So evaluating with unscaled data is like introducing a range of data that that particular model did not see.
The way to go is, in pseudocode:
y_pred_scaled = linear.predict(y_test_scaled)
inverse_transform y_pred_scaled with your scaler
mse in original scale comparing y_test to y_pred
I'm using RandomizedSearchCV to get the best parameters with a 10-fold cross-validation and 100 iterations. This works well. But now I would like to also get the probabilities of each predicted test data point (like predict_proba) from the best performing model.
How can this be done?
I see two options. First, perhaps it is possible to get these probabilities directly from the RandomizedSearchCV or second, getting the best parameters from RandomizedSearchCV and then doing again a 10-fold cross-validation (with the same seed so that I get the same splits) with this best parameters.
Edit: Is the following code correct to get the probabilities of the best performing model? X is the training data and y are the labels and model is my RandomizedSearchCV containing a Pipeline with imputing missing values, standardization and SVM.
cv_outer = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
y_prob = np.empty([y.size, nrClasses]) * np.nan
best_model = model.fit(X, y).best_estimator_
for train, test in cv_outer.split(X, y):
probas_ = best_model.fit(X[train], y[train]).predict_proba(X[test])
y_prob[test] = probas_
If I understood it right, you would like to get the individual scores of every sample in your test split for the case with the highest CV score. If that is the case, you have to use one of those CV generators which give you control over split indices, such as those here: http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#cross-validation-generators
If you want to calculate scores of a new test sample with the best performing model, the predict_proba() function of RandomizedSearchCV would suffice, given that your underlying model supports it.
Example:
import numpy
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
scores = cross_val_score(svc, X, y, cv=skf, n_jobs=-1)
max_score_split = numpy.argmax(scores)
Now that you know that your best model happens at max_score_split, you can get that split yourself and fit your model with it.
train_indices, test_indices = k_fold.split(X)[max_score_split]
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
model.fit(X_train, y_train) # this is your model object that should have been created before
And finally get your predictions by:
model.predict_proba(X_test)
I haven't tested the code myself but should work with minor modifications.
You need to look in cv_results_ this will give you the scores, and mean scores for all of your folds, along with a mean, fitting time etc...
If you want to predict_proba() for each of the iterations, the way to do this would be to loop through the params given in cv_results_, re-fit the model for each of then, then predict the probabilities, as the individual models are not cached anywhere, as far as I know.
best_params_ will give you the best fit parameters, for if you want to train a model just using the best parameters next time.
See cv_results_ in the information page http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
Does running a standard scaler and then a classifier give the same result as using a pipeline?
Hi, I have a classification problem and trying to scale the X variables using scikit learn's StandardScaler(). I see two options of doing this, should they in theory yield the same result? Because I am getting better precision score on my test data set when I use option (1).
(1)
scalar = StandardScaler()
xtrain_ = scalar.fit_transform(xtrain)
RFC = RandomForestClassifier(n_estimators=100)
RFC.fit(xtrain. ytrain)
xtest_ = scalar.transform(xtest)
score = cross_val_score(RFC, xtest_, ytest,cv=10, scoring ='precision')
(2)
RFCs = Pipeline([("scale", StandardScaler()), ("rf", RandomForestClassifier(n_estimators=100))])
RFCs.fit(xtrain, ytrain)
scores = cross_val_score(RFCs, xytest, ytest, cv=10, scoring='precision')
Your option number 2 uses a different data set (xytest) than your version number (1), which uses xtest. Furthermore, your crossvalidation should include the training, not only the prediction.
Apart from that they should be the same, while I advice you to use pipelines.