SKlearn pipeline using KNeighborsClassifier - python

I am trying to build a GridSearchCV pipeline in sklearn for using KNeighborsClassifier and SVM. SO far, have tried the following code:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
from sklearn import svm
from sklearn.svm import SVC
clf = SVC(kernel='linear')
pipeline = Pipeline([ ('knn',neigh), ('sVM', clf)]) # Code breaks here
weight_options = ['uniform','distance']
param_knn = {'weights':weight_options}
param_svc = {'kernel':('linear', 'rbf'), 'C':[1,5,10]}
grid = GridSearchCV(pipeline, param_knn, param_svc, cv=5, scoring='accuracy')
but am getting the following error:
TypeError: All intermediate steps should be transformers and implement fit and transform. 'KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='uniform')' (type <class 'sklearn.neighbors.classification.KNeighborsClassifier'>) doesn't
Can anyone please help me with what am I going wrong, and how to correct it? I think there is something wrong with the last line as well, re params.

The error clearly says that the KNeighborsClassifier doesnt have transform method KNN has only fit method where as SVM has fit_transform() method. for the Pipeline we can pass n number of arguments in to it. but all the arguments should have transformer methods in it.Please refer the below link
http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

The scikit-learn Pipeline steps require to have the transform() method. You might want to try the pipeline from imblearn instead.
See for instance here: https://bsolomon1124.github.io/oversamp/

Related

Pipeline including several steps: StandardScaler(), RandomUnderSampler, Classifiers

I have the following code and showing the error TypeError: Last step of Pipeline should implement fit or be the string 'passthrough'. '[('sc', StandardScaler()), ('rus', RandomUnderSampler()), ('clf', LogisticRegression(max_iter=10000, multi_class='ovr', solver='sag'))]' (type <class 'list'>) doesn't
my code as follow:
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline
from sklearn.pipeline import make_pipeline
also, i have imported all classifiers in my list
classifiers = [LogisticRegression(solver='sag',penalty='l2',multi_class='ovr',
max_iter=10000,random_state=None,fit_intercept=True),
LinearDiscriminantAnalysis(shrinkage='auto'),LinearSVC(multi_class='ovr',penalty ='l2'),
QuadraticDiscriminantAnalysis(),SGDClassifier(max_iter=10000),
GaussianProcessClassifier(max_iter_predict =10000,multi_class='one_vs_rest'),
RidgeClassifier(solver='sag',random_state=None,max_iter=10000),
DecisionTreeClassifier(min_samples_leaf=1),BaggingClassifier(),RandomForestClassifier()]
for classifier in classifiers:
model = make_pipeline( [('sc',StandardScaler()),('rus',RandomUnderSampler()),
('clf',classifier)])
model.fit(X_train,y_train)
I need help to see where have i done something wrong or maybe i am missing something out!
the solution was:
for classifier in classifiers:
model = Pipeline_imb( [('sc',StandardScaler()),('rus',RandomUnderSampler()),
('clf',classifier)])
model.fit(X_train,y_train)
i had to install:
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.pipeline import Pipeline as Pipeline_imb

How to be sure that sklearn piepline applies fit_transform method when using feature selection and ML model in piepline?

Assume that I want to apply several feature selection methods using sklearn pipeline. An example is provided below:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
fs_pipeline = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
])
X_new = fs_pipeline.fit_transform(X_train, y_train)
I get the selected features using fit_transform method. If I use fit method on pipeline, I will get pipeline object.
Now, assume that I want to add a ML model to the pipeline like below:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
('gbc', GradientBoostingClassifier(random_state=0))])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
If I use fit_transform method in the above code (model.fit_transform(X_train, y_train)), I get the error:
AttributeError: 'GradientBoostingClassifier' object has no attribute 'transform'
So. I should use model.fit(X_train, y_train). But, how can I be sure that pipeline applied fit_transform method for feature selection steps?
A pipeline is meant for sequential data transformation (for which it needs multiple calls to .fit_transform()). You can be sure that .fit_transform() is called on the intermediate steps (basically on all steps but the last one) of a pipeline as that's how it works by design.
Namely, when calling .fit() or .fit_transform() on a Pipeline instance, .fit_transform() is called sequentially on all intermediate transformers but the last one and the output of each call of the method is passed as parameter to the next call. On the very last step, either .fit() or .fit_transform() is called depending on the method called on the pipeline itself; indeed, in the last step an estimator is generally more commonly used rather than a transformer (as with the case of your GradientBoostingClassifier).
Whenever the last step is made of an estimator rather than a transformer, as in your case, you won't be able to call .fit_transform() on the pipeline instance as the pipeline itself exposes the same methods of the final estimator/transformer and in the considered case estimators do not expose neither .transform() nor .fit_transform().
Summing up,
case with an estimator in the last step (you can only call .fit() on the pipeline); model.fit(X_train, y_train) means the following:
final_estimator.fit(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
which in your case becomes
gbc.fit(k_best.fit_transform(vt.fit_transform(X_train, y_train)))
case with a transformer in the last step (you can either call .fit() or .fit_transform() on the pipeline, but let's suppose you're calling .fit_transform()); model.fit_transform(X_train, y_train) means the following:
final_estimator.fit_transform(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
Eventually, here's the reference in the source code: https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/pipeline.py#L351

How to use KNeighborsClassifier in BaggingClassifier & How to solve "KNN doesn't support sample weights issue"

I am new to Sklearn, and I am trying to combine KNN, Decision Tree, SVM, and Gaussian NB for BaggingClassifier.
Part of my code looks like this:
best_KNN = KNeighborsClassifier(n_neighbors=5, p=1)
best_KNN.fit(X_train, y_train)
majority_voting = VotingClassifier(estimators=[('KNN', best_KNN), ('DT', best_DT), ('SVM', best_SVM), ('gaussian', gaussian_NB)], voting='hard')
majority_voting.fit(X_train, y_train)
bagging = BaggingClassifier(base_estimator=majority_voting)
bagging.fit(X_train, y_train)
But this causes an error saying:
TypeError: Underlying estimator KNeighborsClassifier does not support sample weights.
The "bagging" part worked fine if I remove KNN.
Does anyone have any idea to solve this issue? Thank you for your time.
In BaggingClassifier you can only use base estimators that support sample weights because it relies on score method, which takes in sample_weightparam.
You can list all the available classifiers like:
import inspect
from sklearn.utils.testing import all_estimators
for name, clf in all_estimators(type_filter='classifier'):
if 'sample_weight' in inspect.getargspec(clf.fit)[0]:
print(name)

Using Scikit-Learn's pipelines to combine a transformers and estimator

I try to use Scikit-Learn's Pipeline function to organize our transformers and estimator, and having problem with building a pipeline that combines one_hot_transformer with a LinearRegression() estimator. It is challenging to connect the following ones
from sklearn.preprocessing import OneHotEncoder
cat_feats = np.array([[1,10],[2,20],[3,10],[4,20],[3,10],[2,20],[1,10]])
OneHotEncoder(sparse=False).fit_transform(cat_feats)
one_hot_transformer = OneHotEncoder(sparse=False).fit_transform(X,y)
from sklearn.pipeline import Pipeline
linear_est = Pipeline([one_hot_transformer], LinearRegression())
linear_est.fit(X,y)
predicted = linear_est.predict(X)
grader.score('intro_ml__linear_model', linear_est.predict)

How to optimize a sklearn pipeline, using XGboost, for a different `eval_metric`?

I'm trying to use XGBoost, and optimize the eval_metric as auc(as described here).
This works fine when using the classifier directly, but fails when I'm trying to use it as a pipeline.
What is the correct way to pass a .fit argument to the sklearn pipeline?
Example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from xgboost import XGBClassifier
import xgboost
import sklearn
print('sklearn version: %s' % sklearn.__version__)
print('xgboost version: %s' % xgboost.__version__)
X, y = load_iris(return_X_y=True)
# Without using the pipeline:
xgb = XGBClassifier()
xgb.fit(X, y, eval_metric='auc') # works fine
# Making a pipeline with this classifier and a scaler:
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
# using the pipeline, but not optimizing for 'auc':
pipe.fit(X, y) # works fine
# however this does not work (even after correcting the underscores):
pipe.fit(X, y, classifier__eval_metric='auc') # fails
The error:
TypeError: before_fit() got an unexpected keyword argument 'classifier__eval_metric'
Regarding the version of xgboost:
xgboost.__version__ shows 0.6
pip3 freeze | grep xgboost shows xgboost==0.6a2.
The error is because you are using a single underscore between estimator name and its parameter when using in pipeline. It should be two underscores.
From the documentation of Pipeline.fit(), we see that the correct way of supplying params in fit:
Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.
So in your case, the correct usage is:
pipe.fit(X_train, y_train, classifier__eval_metric='auc')
(Notice two underscores between name and param)
When the goal is to optimize I suggest to use sklearn wrapper and GridSearchCV
from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV
It looks like
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
score = 'roc_auc'
pipe.fit(X, y)
param = {
'classifier_max_depth':[1,2,3,4,5,6,7,8,9,10] # just as example
}
gsearch = GridSearchCV(estimator =pipe, param_grid =param , scoring= score)
Also you can use a technique of cross validation
gsearch.fit(X, y)
And you get the best params & the best scores
gsearch.best_params_, gsearch.best_score_

Categories

Resources