I'm using the following scaler:
scaler = preprocessing.MinMaxScaler()
X_train = pd.DataFrame(scaler.fit_transform(dataset_train),
columns=dataset_train.columns,
index=dataset_train.index)
X_train.sample(frac=1)
X_test = pd.DataFrame(scaler.transform(dataset_test),
columns=dataset_test.columns,
index=dataset_test.index)
The model that I'm using for anomaly detection is sequential().
I know how to save the model, but my question is how can I save the scaler in the model, so I can simply apply the model in a new df.
Thank you.
This is a good use-case for an sklearn pipeline. An sklearn pipeline allows you to chain preprocessing steps with your model. Most things you want to apply a .fit() method to can be put in an sklearn pipeline. This way when you persist the pipeline important information is captured within your saved model object. You can find more on the sklearn pipeline page. Here's an example:
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn import datasets
from sklearn.model_selection import train_test_split
# creating an example dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
dataset_train, dataset_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
#example of creating and predicting with a sklearn pipeline
pipeline = Pipeline([('scaler', MinMaxScaler()), ('rf', RandomForestClassifier())])
pipeline.fit(dataset_train, y_train)
pipeline.predict(dataset_test)
Related
I am reviewing the "Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow" book. One method of classification for the MNIST dataset uses KMeans as a means to preprocess the dataset before using a LogsticRegression model to perform the classification.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
X_digits, y_digits = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, random_state=42)
pipeline = Pipeline([
("kmeans", KMeans(random_state=42)),
("log_reg", LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)),
])
param_grid = dict(kmeans__n_clusters=range(45, 50))
grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=2)
grid_clf.fit(X_train, y_train)
predict = grid_clf.predict(X_test)
The output of grid_clf.predict(X_test) is in the original digits (digits 0-9) rather than the clusters that are created in the KMeans step in the pipeline. My question is, how is the grid_clf.predict() relating its predictions back to the original labels on the dataset?
Putting aside the grid search, the code
pipeline = Pipeline([
("kmeans", KMeans(n_clusters=45)),
("log_reg", LogisticRegression()),
])
pipeline.fit(X_train, y_train)
is equivalent to:
kmeans = KMeans(n_clusters=45)
log_reg = LogisticRegression()
new_X_train = kmeans.fit_transform(X_train)
log_reg.fit(new_X_train, y_train)
Thus KMeans is used to transform the training data. The original data, which has 64 features, is transformed into data with 45 features consisting of distances of the data points to the centers of the 45 clusters. This transformed data, together with the original labels, is then used to fit LogisticRegression.
Prediction works the same way: the test data is first transformed by KMeans and then LogisticRegression is used with the transformed data to predict labels. Thus, instead of
predict = pipeline.predict(X_test)
one could use:
predict = log_reg.predict(kmeans.transform(X_test))
I have the following piece of code:
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.pipeline import Pipeline
...
x_train, x_test, y_train, y_test= model_selection.train_test_split(dataframe[features_],dataframe[labels], test_size=0.30,random_state=42, shuffle=True)
classifier = RandomForestClassifier(n_estimators=11)
pipe = Pipeline([('feats', feature), ('clf', classifier)])
pipe.fit(x_train, y_train)
predicts = pipe.predict(x_test)
Instead of train test split, I want to use k-fold cross validation to train my model. However, I do not know how can make it by using pipeline structure. I came across this: https://scikit-learn.org/stable/modules/compose.html but I could not fit to my code.
I want to use from sklearn.model_selection import StratifiedKFold if possible. I can use it without pipeline structure but I can not use it with pipeline.
Update:
I tried this but it generates me error.
x_train = dataframe[features_]
y_train = dataframe[labels]
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
classifier = RandomForestClassifier(n_estimators=11)
#pipe = Pipeline([('feats', feature), ('clf', classifier)])
#pipe.fit(x_train, y_train)
#predicts = pipe.predict(x_test)
predicts = cross_val_predict(classifier, x_train , y_train , cv=skf)
Pipeline is used to assemble several steps such as preprocessing, transformations, and modeling. StratifiedKFold is used to split your dataset to assess the performance of your model. It is not meant to be used as a part of the Pipeline as you do not want to perform it on new data.
Therefore it is normal to perform it out of the pipeline's structure.
While I echo with the answer provided above, below might be the code you are looking for:
x_train = dataframe[features_]
y_train = dataframe[labels]
pipe = Pipeline([('feats', feature), ('clf', RandomForestClassifier(n_estimators=11))])
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
predicts = cross_val_predict(pipe, x , y , scoring='accuracy',cv=skf, n_jobs = -1, error_score = 'raise')
Since you are using Cross validation, train_test_split() is not required to be passed to the Pipeline, the cross validation split (i.e., the training sample) is passed directly to the pipeline for feature extraction and modeling and evaluated on the test set (which is also generated as part of the cv split)
I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training and testing iteration to perfrom SMOTE.
Therefore, I would like to know the correct procedure to perfrom SMOTE using cross-validation.
My current code is as follows. However, as mentioned above it only uses single iteration.
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)
I am happy to provide more details if needed.
You need to perform SMOTE within each fold. Accordingly, you need to avoid train_test_split in favour of KFold:
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
kf = KFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
X_train = X[train_index]
y_train = y[train_index] # Based on your code, you might need a ravel call here, but I would look into how you're generating your y
X_test = X[test_index]
y_test = y[test_index] # See comment on ravel and y_train
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model = ... # Choose a model here
model.fit(X_train_oversampled, y_train_oversampled )
y_pred = model.predict(X_test)
print(f'For fold {fold}:')
print(f'Accuracy: {model.score(X_test, y_test)}')
print(f'f-score: {f1_score(y_test, y_pred)}')
You can also, for example, append the scores to a list defined outside.
from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import SMOTE
cv = StratifiedKFold(n_splits=5)
for train_idx, test_idx, in cv.split(X, y):
X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]
X_train, y_train = SMOTE().fit_sample(X_train, y_train)
....
I think you can also solve this with a pipeline from the imbalanced-learn library.
I saw this solution in a blog called Machine Learning Mastery https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
The idea is to use a pipeline from imblearn to do the cross-validation. Please, let me know if that works. The example below is with a decision tree, but the logic is the same.
#decision tree evaluated on imbalanced dataset with SMOTE oversampling
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# define pipeline
steps = [('over', SMOTE()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
score = mean(scores))
I am doing classification, and I have a list with two sizes like this;
Data=[list1,list2]
list1 is 1000*784 size. It means that 1000 images the have been reshaped from 28*28 size into 784.
list2 is 1000*1 size. It shows the label that each images is belonged to.
With the below code, I applied PCA:
from matplotlib.mlab import PCA
results = PCA(Data[0])
the output is like this:
Out[40]: <matplotlib.mlab.PCA instance at 0x7f301d58c638>
now, I want to use SVM as classifier.
I should add the labels. So I have the new data like this for SVm:
newData=[results,Data[1]]
I do not know how use SVM here.
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn import cross_validation
Data=[list1,list2]
X = Data[0]
y = Data[1]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
pca = PCA(n_components=2)# adjust yourself
pca.fit(X_train)
X_t_train = pca.transform(X_train)
X_t_test = pca.transform(X_test)
clf = SVC()
clf.fit(X_t_train, y_train)
print 'score', clf.score(X_t_test, y_test)
print 'pred label', clf.predict(X_t_test)
Here is an tested code on another dataset.
import numpy as np
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn import cross_validation
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
pca = PCA(n_components=2)# adjust yourself
pca.fit(X_train)
X_t_train = pca.transform(X_train)
X_t_test = pca.transform(X_test)
clf = SVC()
clf.fit(X_t_train, y_train)
print 'score', clf.score(X_t_test, y_test)
print 'pred label', clf.predict(X_t_test)
Based on these references:
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
http://scikit-learn.org/stable/modules/cross_validation.html
I think what you are looking for is http://scikit-learn.org/. It's a python library where you'll find PCA, SVM and other cool algorithms for Machine Learning. It has a good tutorial, but I recommend you follow this guy's http://www.astroml.org/sklearn_tutorial/general_concepts.html . For your particular question, the SVM page of scikit-learn should suffice http://scikit-learn.org/stable/modules/svm.html.
I'm trying to fit a model that I've put together using Pipeline:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'accuracy')
grid_search_object.fit(X_train,Y_train)
My question: Is the best_estimator going to scale the test data based on the values in the training data? For example, if I call:
grid_search_object.best_estimator_.predict(X_test)
It will NOT try to fit the scaler on the X_test data, right? It will just transform it using the original parameters.
Thanks!
The predict methods never fit any data. In this case, exactly as you describe it, the best_estimator_ pipeline is going to scale based on the scaling it has learnt on the training set.