Pipline with SMOTE and Imputer Errors - python

i am trying to create a pipeline that first impute missing data , do oversampling with the SMOTE and the the model
my code worked perfectly before i try smote not i cant find any solution
here is the code without smote
scoring = ['balanced_accuracy', 'f1_macro']
imputer = SimpleImputer(strategy='most_frequent')
pipeline = Pipeline(steps=[('i', imputer),('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1)
And here's the code after adding smote
Note: I tired importing make pipeline from imlearn
imputer = SimpleImputer(strategy='most_frequent')
pipeline = Pipeline(steps=[('i', imputer),('over', SMOTE()),('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1)
when i import Pipeline From SKLearn i got this error
All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE()' (type <class 'imblearn.over_sampling._smote.base.SMOTE'>) doesn't
when i tried importing makepipeline from imlearn i get this error
Last step of Pipeline should implement fit or be the string 'passthrough'. '[('i', SimpleImputer(strategy='most_frequent')), ('over', SMOTE()), ('m', RandomForestClassifier())]' (type <class 'list'>) doesn't

Use the imblearn pipline:
from imblearn.pipeline import Pipeline
pipeline = Pipeline([('i', imputer),('over', SMOTE()),('m', model)])

Related

RFE ranking with Gridsearch

I want to use RFE for feature selection in a pipeline. I have no problems getting it to work in pipelines without GridSearch. However, when I try to incorporate GridSearch, I keep getting a value error (NB. the models are fine without RFE).
I have tried to use feature_selection as was suggested in this topic: Grid Search with Recursive Feature Elimination in scikit-learn pipeline returns an error, but this results in the same error.
What could be wrong?
my error:
ValueError: Invalid parameter alpha for estimator RFE(estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=True, random_state=None, solver='auto',
tol=0.001),
n_features_to_select=4, step=1, verbose=1). Check the list of available parameters with estimator.get_params().keys().
this works fine:
rfe=RFE(estimator=LinearRegression(), n_features_to_select=4, verbose=1)
#setup the pipeline steps
steps = [('scaler', StandardScaler()),
('imputation', SimpleImputer(missing_values = np.NaN, strategy='most_frequent')),
('reg', rfe)]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit the pipeline to the training set:
pipeline.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = pipeline.predict(X_test)
print()
# Print the features and their ranking (high = dropped early on)
print(dict(zip(X.columns, rfe.ranking_)))
# Print the features that are not eliminated
print(X.columns[rfe.support_])
print()
print("R^2: {}".format(pipeline.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
this doesn't work
rfe=RFE(estimator=Ridge(normalize=True), n_features_to_select=4, verbose=1)
#setup the pipeline steps
steps = [('scaler', StandardScaler()),
('imputation', SimpleImputer(missing_values=np.NaN, strategy='most_frequent')),
('ridge', rfe)]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
#Define hyperparameters and range of Grid Search
parameters = {"ridge__alpha": np.linspace(0,1,100)}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# run cross validation
cv = GridSearchCV(pipeline, param_grid = parameters, cv=3)
# Fit the pipeline to the training set:
cv.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = cv.predict(X_test)
# Compute and print R^2 and RMSE
print("R^2: {}".format(cv.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
print("Tuned Model Parameters: {}".format(cv.best_params_))
using feature_selection also doesn't work
selector = feature_selection.RFE(Ridge(normalize=True))
#setup the pipeline steps
steps = [('scaler', StandardScaler()),
('imputation', SimpleImputer(missing_values=np.NaN, strategy='most_frequent')),
('RFE', selector)]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
The question is old, but in case someone stumbles upon it:
You can access the hyperparameter alpha or any parameter of the estimator inside feature_selection(estimator=) with the parameter '<feature_selection>__estimator__<your parameter>':
from sklearn.pipeline import Pipeline
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.feature_selection import RFE
model = RFE(estimator=Ridge())
pipe = Pipeline(
steps = [
("scaler", StandardScaler()),
("rfe", model)
]
)
param = {
"rfe__step" : np.linspace(0.1, 1, 10),
"rfe__estimator__alpha" : np.logspace(-3, 3, 7)
}
tscv = TimeSeriesSplit(n_splits=5).split(X_train)
gridsearch = GridSearchCV(estimator=pipe, cv=tscv, param_grid=param, refit=True, return_train_score=True, n_jobs=-1)
fit = gridsearch.fit(X_train, y_train)

Problem Python ML ERROR with wine quality Kaggle exercice

I've tried to do the wine_quality exercice on Kaggle:
Here is my code (the beginning):
X= data.drop(["quality"], axis=1)
Y= data["quality"]
X_train, X_test, Y_train, Y_test= train_test_split(X,Y, test_size= 0.2)
def encodage(df):
code= {"positive":1,
"negative":0,
"detected":1,
"not_detected":0}
for col in df.select_dtypes("object").columns:
df.loc[:,col]= df[col].map(code)
return df
encodage(X_train)
encodage(X_test)
model_test= DecisionTreeClassifier(random_state=0)
def evaluation(model):
model.fit(X_train,Y_train)
ypred= model.predict(X_test)
print(confusion_matrix(Y_test,ypred))
print(classification_report(Y_test,ypred))
numerical_features= make_column_selector(dtype_include= np.number)
categorical_features= make_column_selector(dtype_include= np.number)
numerical_pipeline= make_pipeline(SimpleImputer(), StandardScaler(), PolynomialFeatures(2, include_bias=False), SelectKBest(f_classif, k=10))
categorical_pipeline= make_pipeline(SimpleImputer(strategy="most_frequent"),OneHotEncoder()),SelectKBest(f_classif, k=10)
preprocessor = make_column_transformer((numerical_pipeline, numerical_features),(categorical_pipeline, categorical_features))
RandomForest= make_pipeline(preprocessor, RandomForestClassifier(random_state=0))
AdaBoost= make_pipeline(preprocessor, AdaBoostClassifier(random_state=0))
SVM= make_pipeline(preprocessor, StandardScaler(),SVC(random_state=0))
KNN =make_pipeline(preprocessor, StandardScaler(), KNeighborsClassifier())
dict_of_models= {"RandomForest": RandomForest, "AdaBoost": AdaBoost, "SVM": SVM, "KNN": KNN}
for name, model in dict_of_models.items():
print(name)
evaluation(model)
Everything was fine, I had a score of 0.66 with model_test(not visible here), but when I arrive at for name, model in... , I have this error:
TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. '(Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')),
('onehotencoder', OneHotEncoder())]), SelectKBest())' (type <class 'tuple'>) doesn't.
isn't make_pipeline steps supposed to be list[] not tuple()
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html
check out these other questions :
ColumnTransformer generating a TypeError when trying to fit_transform pipeline in sklearn
sklearn:TypeError: All estimators should implement fit and transform

Problem passing Pipeline through nested RFECV and GridSearchCV

I'm trying to perform feature selection and grid search for the inner loop of a nested CV in sklearn. While I can pass a pipeline as the estimator to the RFECV, I then receive an error on fitting when I pass the RFECV as the estimator to GridSearchCV.
I have found that changing the name of the model in the pipeline to 'estimator' moves the error to the Pipeline with 'regression an invalid parameter' rather than in the RFECV where whatever I named the model was an invalid parameter.
I have verified using rfcv.get_params().keys() and pipeline.get_params().keys() that the parameters I'm calling do exist.
I do not receive this error if I name SGDRegressor() directly as ‘estimator’ and ignore the pipeline entirely, but this model requires feature scaling and log transforming of the Y variable.
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import SGDRegressor
import numpy as np
# random sample data
X = np.random.rand(100,2)
y = np.random.rand(100)
#passing coef amd importance through pipe and TransformedTargetRegressor
class MyPipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
class MyTransformedTargetRegressor(TransformedTargetRegressor):
#property
def feature_importances_(self):
return self.regressor_.feature_importances_
#property
def coef_(self):
return self.regressor_.coef_
# build pipeline
pipeline = MyPipeline([ ('scaler', MinMaxScaler()),
('estimator', MyTransformedTargetRegressor(regressor=SGDRegressor(), func=np.log1p, inverse_func=np.expm1))])
# define tuning grid
parameters = {"estimator__regressor__alpha": [1e-5,1e-4,1e-3,1e-2,1e-1],
"estimator__regressor__l1_ratio": [0.001,0.25,0.5,0.75,0.999]}
# instantiate inner cv
inner_kv = KFold(n_splits=5, shuffle=True, random_state=42)
rfcv = RFECV(estimator=pipeline, step=1, cv=inner_kv, scoring="neg_mean_squared_error")
cv = GridSearchCV(estimator=rfcv, param_grid=parameters, cv=inner_kv, iid=True,
scoring= "neg_mean_squared_error", n_jobs=-1, verbose=True)
cv.fit(X,y)
I receive the following error and can confirm that regressor is a parameter for estimator pipeline:
ValueError: Invalid parameter regressor for estimator MyPipeline(memory=None,
steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
('estimator',
MyTransformedTargetRegressor(check_inverse=True,
func=<ufunc 'log1p'>,
inverse_func=<ufunc 'expm1'>,
regressor=SGDRegressor(alpha=0.0001,
average=False,
early_stopping=False,
epsilon=0.1,
eta0=0.01,
fit_intercept=True,
l1_ratio=0.15,
learning_rate='invscaling',
loss='squared_loss',
max_iter=1000,
n_iter_no_change=5,
penalty='l2',
power_t=0.25,
random_state=None,
shuffle=True,
tol=0.001,
validation_fraction=0.1,
verbose=0,
warm_start=False),
transformer=None))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
Thanks
It has to be estimator__estimator__regressor because you have the pipeline inside the rfecv.
Try this!
parameters = {"estimator__estimator__regressor__alpha": [1e-5,1e-4,1e-3,1e-2,1e-1],
"estimator__estimator__regressor__l1_ratio": [0.001,0.25,0.5,0.75,0.999]}
Note: Having a nested CV is not a right approach. May be you can do the feature selection separately and then do the model training.

Cross validation with multiple parameters using f1-score

I am trying to do feature selection using SelectKBest and the best tree depth for binary classification using f1-score. I have created a scorer function to select the best features and to evaluate the grid search. An error of "call() missing 1 required positional argument: 'y_true'" pops up when the classifier is trying to fit to the training data.
#Define scorer
f1_scorer = make_scorer(f1_score)
#Split data into training, CV and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state = 0)
#initialize tree and Select K-best features for classifier
kbest = SelectKBest(score_func=f1_scorer, k=all)
clf = DecisionTreeClassifier(random_state=0)
#create a pipeline for features to be optimized
pipeline = Pipeline([('kbest',kbest),('dt',clf)])
#initialize a grid search with features to be optimized
gs = GridSearchCV(pipeline,{'kbest__k': range(2,11), 'dt__max_depth':range(3,7)}, refit=True, cv=5, scoring = f1_scorer)
gs.fit(X_train,y_train)
#order best selected features into a single variable
selector = SelectKBest(score_func=f1_scorer, k=gs.best_params_['kbest__k'])
X_new = selector.fit_transform(X_train,y_train)
On the fit line I get a TypeError: __call__() missing 1 required positional argument: 'y_true'.
The problem is in the score_func which you have used for SelectKBest. score_func is a function which takes two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores, but in your code you have fed the callable f1_scorer as the score_func which just takes your y_true and y_pred and computes the f1 score. You can use one of chi2, f_classif or mutual_info_classif as your score_func for the classification task. Also, there is a minor bug in the parameter k for SelectKBest it should have been "all" instead of all. I have modified your code incorporating these changes,
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import f_classif
from sklearn.metrics import f1_score, make_scorer
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_classes=2,
n_informative=4, weights=[0.7, 0.3],
random_state=0)
f1_scorer = make_scorer(f1_score)
#Split data into training, CV and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state = 0)
#initialize tree and Select K-best features for classifier
kbest = SelectKBest(score_func=f_classif)
clf = DecisionTreeClassifier(random_state=0)
#create a pipeline for features to be optimized
pipeline = Pipeline([('kbest',kbest),('dt',clf)])
gs = GridSearchCV(pipeline,{'kbest__k': range(2,11), 'dt__max_depth':range(3,7)}, refit=True, cv=5, scoring = f1_scorer)
gs.fit(X_train,y_train)
gs.best_params_
OUTPUT
{'dt__max_depth': 6, 'kbest__k': 9}
Also modify your last two lines as below:
selector = SelectKBest(score_func=f_classif, k=gs.best_params_['kbest__k'])
X_new = selector.fit_transform(X_train,y_train)

How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?

In the example below,
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))
I am using StandardScaler(), is this the correct way to apply it to test set as well?
Yes, this is the right way to do this but there is a small mistake in your code. Let me break this down for you.
When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.
What happens can be described as follows:
Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
Step 1: the scaler is fitted on the TRAINING data
Step 2: the scaler transforms TRAINING data
Step 3: the models are fitted/trained using the transformed TRAINING data
Step 4: the scaler is used to transform the TEST data
Step 5: the trained models predict using the transformed TEST data
Note: You should be using grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically split the data into training and testing data (this happen internally).
Use something like this:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)
Once you run this code (when you call grid.fit(X, y)), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure and the best_params_ describes the combination of parameters that achieved the best results.
IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:
X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation
= train_test_split(X, y, test_size=0.15, random_state=1)
Then use:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)
Quick answer: Your methodology is correct.
Although the above answer is very good, I just would like to point out some subtleties:
best_score_ [1] is the best cross-validation metric, and not the generalization performance of the model [2]. To evaluate how well the best found parameters generalize, you should call the score on the test set, as you've done. Therefore it is needed to start by splitting the data into training and test set, fit the grid search only in the X_train, y_train, and then score it with X_test, y_test [2].
Deep Dive:
A threefold split of data into training set, validation set and test set is one way to prevent overfitting in the parameters during grid search. On the other hand, GridSearchCV uses Cross-Validation in the training set, instead of having both training and validation set, but this does not replace the test set. This can be verified in [2] and [3].
References:
[1] GridSearchCV
[2] Introduction to Machine Learning with Python
[3] 3.1 Cross-validation: evaluating estimator performance

Categories

Resources