How to use the imbalanced library with sklearn pipeline? - python

I am trying to solve a text classification problem. I want to create baseline model using MultinomialNB
my data is highly imbalnced for few categories, hence decided to use the imbalanced library with sklearn pipeline and referring the tutorial.
The model is failing and giving error after introducing the two stages in pipeline as suggested in docs.
from imblearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import (EditedNearestNeighbours,
RepeatedEditedNearestNeighbours)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()
pipe = make_pipeline_imb([('vect', CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem)),\
('tfidf', TfidfTransformer(use_idf= True)),\
('enn', EditedNearestNeighbours()),\
('renn', RepeatedEditedNearestNeighbours()),\
('clf-gnb', MultinomialNB()),])
Error:
TypeError: Last step of Pipeline should implement fit. '[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
Can someone please help here. I am also open to use different way of (Boosting/SMOTE) implementation as well ?

It seems that the pipeline from ìmblearn doesn't support naming like the one in sklearn. From imblearn documentation :
*steps : list of estimators.
You should modify your code to :
pipe = make_pipeline_imb( CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem),\
TfidfTransformer(use_idf= True),\
EditedNearestNeighbours(),\
RepeatedEditedNearestNeighbours(),\
MultinomialNB())

Related

Pipeline including several steps: StandardScaler(), RandomUnderSampler, Classifiers

I have the following code and showing the error TypeError: Last step of Pipeline should implement fit or be the string 'passthrough'. '[('sc', StandardScaler()), ('rus', RandomUnderSampler()), ('clf', LogisticRegression(max_iter=10000, multi_class='ovr', solver='sag'))]' (type <class 'list'>) doesn't
my code as follow:
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline
from sklearn.pipeline import make_pipeline
also, i have imported all classifiers in my list
classifiers = [LogisticRegression(solver='sag',penalty='l2',multi_class='ovr',
max_iter=10000,random_state=None,fit_intercept=True),
LinearDiscriminantAnalysis(shrinkage='auto'),LinearSVC(multi_class='ovr',penalty ='l2'),
QuadraticDiscriminantAnalysis(),SGDClassifier(max_iter=10000),
GaussianProcessClassifier(max_iter_predict =10000,multi_class='one_vs_rest'),
RidgeClassifier(solver='sag',random_state=None,max_iter=10000),
DecisionTreeClassifier(min_samples_leaf=1),BaggingClassifier(),RandomForestClassifier()]
for classifier in classifiers:
model = make_pipeline( [('sc',StandardScaler()),('rus',RandomUnderSampler()),
('clf',classifier)])
model.fit(X_train,y_train)
I need help to see where have i done something wrong or maybe i am missing something out!
the solution was:
for classifier in classifiers:
model = Pipeline_imb( [('sc',StandardScaler()),('rus',RandomUnderSampler()),
('clf',classifier)])
model.fit(X_train,y_train)
i had to install:
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.pipeline import Pipeline as Pipeline_imb

Extract log probabilities from MulinomialNB

I have a scikit-learn Pipeline made of a feature extractor, and a VotingClassifier, which contains MulinomialNB and some other models. When I train MulinomialNB separately I can extract the log probabilities using nb.feature_log_prob_, but inside a pipeline this attribute is missing.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import Pipeline
vclf = Pipeline([
('vect', CountVectorizer()),
('clf', VotingClassifier(
estimators=[
('nb', MultinomialNB()),
[...]
]
))
])
vclf.fit(train_X, train_y)
nb = vclf.named_steps['clf'].estimators[0][1]
nb.feature_log_prob_
AttributeError: 'MultinomialNB' object has no attribute 'feature_log_prob_'
According to the documentation, estimators_ is the correct attribute to access the list of fitted sub-estimators of the VotingClassifier. Your code should therefore look like this:
nb = vclf.named_steps['clf'].estimators_[0]
print(nb.feature_log_prob_)
The MulinomialNB you accessed with estimators was not fitted and, therefore, did not provide the feature_log_prob_ attribute. That is where the error came from.

How to use KNeighborsClassifier in BaggingClassifier & How to solve "KNN doesn't support sample weights issue"

I am new to Sklearn, and I am trying to combine KNN, Decision Tree, SVM, and Gaussian NB for BaggingClassifier.
Part of my code looks like this:
best_KNN = KNeighborsClassifier(n_neighbors=5, p=1)
best_KNN.fit(X_train, y_train)
majority_voting = VotingClassifier(estimators=[('KNN', best_KNN), ('DT', best_DT), ('SVM', best_SVM), ('gaussian', gaussian_NB)], voting='hard')
majority_voting.fit(X_train, y_train)
bagging = BaggingClassifier(base_estimator=majority_voting)
bagging.fit(X_train, y_train)
But this causes an error saying:
TypeError: Underlying estimator KNeighborsClassifier does not support sample weights.
The "bagging" part worked fine if I remove KNN.
Does anyone have any idea to solve this issue? Thank you for your time.
In BaggingClassifier you can only use base estimators that support sample weights because it relies on score method, which takes in sample_weightparam.
You can list all the available classifiers like:
import inspect
from sklearn.utils.testing import all_estimators
for name, clf in all_estimators(type_filter='classifier'):
if 'sample_weight' in inspect.getargspec(clf.fit)[0]:
print(name)

SKlearn pipeline using KNeighborsClassifier

I am trying to build a GridSearchCV pipeline in sklearn for using KNeighborsClassifier and SVM. SO far, have tried the following code:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
from sklearn import svm
from sklearn.svm import SVC
clf = SVC(kernel='linear')
pipeline = Pipeline([ ('knn',neigh), ('sVM', clf)]) # Code breaks here
weight_options = ['uniform','distance']
param_knn = {'weights':weight_options}
param_svc = {'kernel':('linear', 'rbf'), 'C':[1,5,10]}
grid = GridSearchCV(pipeline, param_knn, param_svc, cv=5, scoring='accuracy')
but am getting the following error:
TypeError: All intermediate steps should be transformers and implement fit and transform. 'KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='uniform')' (type <class 'sklearn.neighbors.classification.KNeighborsClassifier'>) doesn't
Can anyone please help me with what am I going wrong, and how to correct it? I think there is something wrong with the last line as well, re params.
The error clearly says that the KNeighborsClassifier doesnt have transform method KNN has only fit method where as SVM has fit_transform() method. for the Pipeline we can pass n number of arguments in to it. but all the arguments should have transformer methods in it.Please refer the below link
http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
The scikit-learn Pipeline steps require to have the transform() method. You might want to try the pipeline from imblearn instead.
See for instance here: https://bsolomon1124.github.io/oversamp/

Include customized feature extraction methods in sklearn Pipeline

In sklearn, it is possible to define a pipeline in the following way:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
Is it also possible to include custom feature extraction methods like
extract_features(image, cspace='RGB',
pix_per_cell=128, cell_per_block=32,
hog_channel=10)
and how would I do that?

Categories

Resources