From train test split to cross validation in sklearn using pipeline

From train test split to cross validation in sklearn using pipeline - python

I have the following piece of code:
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.pipeline import Pipeline
...
x_train, x_test, y_train, y_test= model_selection.train_test_split(dataframe[features_],dataframe[labels], test_size=0.30,random_state=42, shuffle=True)
classifier = RandomForestClassifier(n_estimators=11)
pipe = Pipeline([('feats', feature), ('clf', classifier)])
pipe.fit(x_train, y_train)
predicts = pipe.predict(x_test)
Instead of train test split, I want to use k-fold cross validation to train my model. However, I do not know how can make it by using pipeline structure. I came across this: https://scikit-learn.org/stable/modules/compose.html but I could not fit to my code.
I want to use from sklearn.model_selection import StratifiedKFold if possible. I can use it without pipeline structure but I can not use it with pipeline.
Update:
I tried this but it generates me error.
x_train = dataframe[features_]
y_train = dataframe[labels]
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
classifier = RandomForestClassifier(n_estimators=11)
#pipe = Pipeline([('feats', feature), ('clf', classifier)])
#pipe.fit(x_train, y_train)
#predicts = pipe.predict(x_test)
predicts = cross_val_predict(classifier, x_train , y_train , cv=skf)

Pipeline is used to assemble several steps such as preprocessing, transformations, and modeling. StratifiedKFold is used to split your dataset to assess the performance of your model. It is not meant to be used as a part of the Pipeline as you do not want to perform it on new data.
Therefore it is normal to perform it out of the pipeline's structure.

While I echo with the answer provided above, below might be the code you are looking for:
x_train = dataframe[features_]
y_train = dataframe[labels]
pipe = Pipeline([('feats', feature), ('clf', RandomForestClassifier(n_estimators=11))])
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
predicts = cross_val_predict(pipe, x , y , scoring='accuracy',cv=skf, n_jobs = -1, error_score = 'raise')
Since you are using Cross validation, train_test_split() is not required to be passed to the Pipeline, the cross validation split (i.e., the training sample) is passed directly to the pipeline for feature extraction and modeling and evaluated on the test set (which is also generated as part of the cv split)

Related

Machine Learning (python) Saving a scaler in a model

I'm using the following scaler:
scaler = preprocessing.MinMaxScaler()
X_train = pd.DataFrame(scaler.fit_transform(dataset_train),
columns=dataset_train.columns,
index=dataset_train.index)
X_train.sample(frac=1)
X_test = pd.DataFrame(scaler.transform(dataset_test),
columns=dataset_test.columns,
index=dataset_test.index)
The model that I'm using for anomaly detection is sequential().
I know how to save the model, but my question is how can I save the scaler in the model, so I can simply apply the model in a new df.
Thank you.

This is a good use-case for an sklearn pipeline. An sklearn pipeline allows you to chain preprocessing steps with your model. Most things you want to apply a .fit() method to can be put in an sklearn pipeline. This way when you persist the pipeline important information is captured within your saved model object. You can find more on the sklearn pipeline page. Here's an example:
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn import datasets
from sklearn.model_selection import train_test_split
# creating an example dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
dataset_train, dataset_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
#example of creating and predicting with a sklearn pipeline
pipeline = Pipeline([('scaler', MinMaxScaler()), ('rf', RandomForestClassifier())])
pipeline.fit(dataset_train, y_train)
pipeline.predict(dataset_test)

How does KMeans and Logistic Regression interact with MNIST dataset in Pipeline class?

I am reviewing the "Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow" book. One method of classification for the MNIST dataset uses KMeans as a means to preprocess the dataset before using a LogsticRegression model to perform the classification.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
X_digits, y_digits = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, random_state=42)
pipeline = Pipeline([
("kmeans", KMeans(random_state=42)),
("log_reg", LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)),
])
param_grid = dict(kmeans__n_clusters=range(45, 50))
grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=2)
grid_clf.fit(X_train, y_train)
predict = grid_clf.predict(X_test)
The output of grid_clf.predict(X_test) is in the original digits (digits 0-9) rather than the clusters that are created in the KMeans step in the pipeline. My question is, how is the grid_clf.predict() relating its predictions back to the original labels on the dataset?

Putting aside the grid search, the code
pipeline = Pipeline([
("kmeans", KMeans(n_clusters=45)),
("log_reg", LogisticRegression()),
])
pipeline.fit(X_train, y_train)
is equivalent to:
kmeans = KMeans(n_clusters=45)
log_reg = LogisticRegression()
new_X_train = kmeans.fit_transform(X_train)
log_reg.fit(new_X_train, y_train)
Thus KMeans is used to transform the training data. The original data, which has 64 features, is transformed into data with 45 features consisting of distances of the data points to the centers of the 45 clusters. This transformed data, together with the original labels, is then used to fit LogisticRegression.
Prediction works the same way: the test data is first transformed by KMeans and then LogisticRegression is used with the transformed data to predict labels. Thus, instead of
predict = pipeline.predict(X_test)
one could use:
predict = log_reg.predict(kmeans.transform(X_test))

Simple CV for classic classification models

I'm looking for the easiest way to teach my students how to perform 10CV, for standard classifiers in sklearn such as logisticregression, knnm, decision tree, adaboost, svm, etc.
I was hoping there was a method that created the folds for them instead of having to loop like below:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=0)
X=df1.drop(['Unnamed: 0','ID','target'],axis=1).values
y=df1.target.values
for train_index, test_index in sss.split(X,y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = LogisticRegressionCV()
clf.fit(X_train, y_train)
train_predictions = clf.predict(X_test)
acc = accuracy_score(y_test, train_predictions)
print(acc)
Seems like there should be an easier way.

I think your question is, whether there is an already existing method for 10-fold cross validation. So to answer it, there is the sklearn documentation, which explains cross validation and also how to use it:
Cross-validation: evaluating estimator performance
Besides that, you can also make use of the sklearn modules for cross validation
Various splitting techniques with modules
Model validation with cross validation
To include a code example, which should work with your code, import the required library
from sklearn.model_selection import cross_val_score
and add this line instead of your loop:
print(cross_val_score(clf, X, y, cv=10))
And your n_splits is just set to 1 by the way, so its 1-fold and not 10-fold in your code.

Should Cross Validation Score be performed on original or split data?

When I want to evaluate my model with cross validation, should I perform cross validation on original (data thats not split on train and test) or on train / test data?
I know that training data is used for fitting the model, and testing for evaluating. If I use cross validation, should I still split the data into train and test, or not?
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)
Or should I do like this:
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
clf = LogisticRegression()
model = clf.fit(features, results)
accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)
Or maybe something different?

Both your approaches are wrong.
In the first one, you apply cross validation to the test set, which is meaningless
In the second one, you first fit the model with your whole data, and then you perform cross validation, which is again meaningless. Moreover, the approach is redundant (your fitted clf is not used by the cross_val_score method, which does its own fitting)
Since you are not doing any hyperparameter tuning (i.e. you seem to be interested only in performance assessment), there are two ways:
Either with a separate test set
Or with cross validation
First way (test set):
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_test = accuracy_score(y_test, y_pred)
Second way (cross validation):
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
clf = LogisticRegression()
# shuffle data first:
features_s, results_s = shuffle(features, results)
accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')
# fit the model afterwards with the whole data, if satisfied with the performance:
model = clf.fit(features, results)

I will try to summarize the "best practice" here:
1) If you want to train your model, fine-tune parameters, and do final evaluation, I recommend you to split your data into training|val|test.
You fit your model using the training part, and then you check different parameter combinations on the val part. Finally, when you're sure which classifier/parameter obtains the best result on the val part, you evaluate on the test to get the final rest.
Once you evaluate on the test part, you shouldn't change the parameters any more.
2) On the other hand, some people follow another way, they split their data into training and test, and they finetune their model using cross-validation on the training part and at the end they evaluate it on the test part.
If your data is quite large, I recommend you to use the first way, but if your data is small, the 2.

How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?

In the example below,
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))
I am using StandardScaler(), is this the correct way to apply it to test set as well?

Yes, this is the right way to do this but there is a small mistake in your code. Let me break this down for you.
When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.
What happens can be described as follows:
Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
Step 1: the scaler is fitted on the TRAINING data
Step 2: the scaler transforms TRAINING data
Step 3: the models are fitted/trained using the transformed TRAINING data
Step 4: the scaler is used to transform the TEST data
Step 5: the trained models predict using the transformed TEST data
Note: You should be using grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically split the data into training and testing data (this happen internally).
Use something like this:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)
Once you run this code (when you call grid.fit(X, y)), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure and the best_params_ describes the combination of parameters that achieved the best results.
IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:
X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation
= train_test_split(X, y, test_size=0.15, random_state=1)
Then use:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)

Quick answer: Your methodology is correct.
Although the above answer is very good, I just would like to point out some subtleties:
best_score_ [1] is the best cross-validation metric, and not the generalization performance of the model [2]. To evaluate how well the best found parameters generalize, you should call the score on the test set, as you've done. Therefore it is needed to start by splitting the data into training and test set, fit the grid search only in the X_train, y_train, and then score it with X_test, y_test [2].
Deep Dive:
A threefold split of data into training set, validation set and test set is one way to prevent overfitting in the parameters during grid search. On the other hand, GridSearchCV uses Cross-Validation in the training set, instead of having both training and validation set, but this does not replace the test set. This can be verified in [2] and [3].
References:
[1] GridSearchCV
[2] Introduction to Machine Learning with Python
[3] 3.1 Cross-validation: evaluating estimator performance

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

From train test split to cross validation in sklearn using pipeline - python

Related

Machine Learning (python) Saving a scaler in a model

How does KMeans and Logistic Regression interact with MNIST dataset in Pipeline class?

Simple CV for classic classification models

Should Cross Validation Score be performed on original or split data?

How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?

Categories

Resources