i have the below scenario:
preprocess = make_column_transformer(
(SimpleImputer(strategy='constant',fill_value = 0),numeric_cols),
(ce.TargetEncoder(),['country'])
)
pipeline = make_pipeline(preprocess,XGBClassifier())
pipeline[0].get_params().keys()
dict_keys(['n_jobs', 'remainder', 'sparse_threshold', 'transformer_weights', 'transformers', 'verbose', 'simpleimputer', 'targetencoder', 'simpleimputer__add_indicator', 'simpleimputer__copy', 'simpleimputer__fill_value', 'simpleimputer__missing_values', 'simpleimputer__strategy', 'simpleimputer__verbose', 'targetencoder__cols', 'targetencoder__drop_invariant', 'targetencoder__handle_missing', 'targetencoder__handle_unknown', 'targetencoder__min_samples_leaf', 'targetencoder__return_df', 'targetencoder__smoothing', 'targetencoder__verbose'])
i then wish to do a grid search on the smoothing factor:
so:
param_grid = {
'xgbclassifier__learning_rate': [0.01,0.005,0.001],
'targetencoder__smoothing': [1, 10, 30, 50]
}
pipeline = make_pipeline(preprocess,XGBClassifier())
# Initialize Grid Search Modelg
clf = GridSearchCV(pipeline,param_grid = param_grid,scoring = 'neg_mean_squared_error',
verbose= 1,iid= True,
refit = True,cv = 3)
clf.fit(X_train,y_train)
however i get this error:
ValueError: Invalid parameter transformer_targetencoder for estimator Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers...
how can i access the smoothing paramter?
Using your example, it will be columntransformer__targetencoder__smoothing . To reproduce the pipeline, first I use an example dataset and define the columns:
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
import category_encoders as ce
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
X_train = pd.DataFrame({'x1':np.random.normal(0,1,50),
'x2':np.random.normal(0,1,50),
'country':np.random.choice(['A','B','C'],50)})
y_train = np.random.binomial(1,0.5,50)
numeric_cols = ['x1','x2']
preprocess = make_column_transformer(
(SimpleImputer(strategy='constant',fill_value = 0),numeric_cols),
(ce.TargetEncoder(),['country'])
)
pipeline = make_pipeline(preprocess,XGBClassifier())
You should look at the keys at a higher level:
pipeline.get_params().keys()
Then set up the grid, make sure that the smoothing is a float (see this issue):
param_grid = { 'columntransformer__targetencoder__smoothing': [1.0, 10.0],
'xgbclassifier__learning_rate': [0.01,0.001]}
pipeline = make_pipeline(preprocess,XGBClassifier())
clf = GridSearchCV(pipeline,param_grid = param_grid,scoring = 'neg_mean_squared_error',
verbose= 1,refit = True,cv = 3)
clf.fit(X_train,y_train)
And it should work
Related
I created a Pipeline with RFE and RandomForestClassifer in it and then applied RandomizedSearchCV to find the best hyperparameter values for both. This is what my code looks like -
from sklearn.esemble_learning import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
steps = [
("rfe", RFE(estimator = RandomForestClassifier(random_state = 42))),
("est", RandomForestClassifier())
]
rf_clf_pl = Pipeline(steps = steps)
params = {
"rfe__n_features_to_select" : range(2, smote_X_train.shape[1] + 1),
"est__random_state" : np.linspace(0, 42, 5).astype(int),
"est__n_estimators" : range(50, 201, 10),
"est__max_depth" : [None] + list(range(5, max_depth, 3)),
"est__max_leaf_nodes" : [None] + list(range(100, max_leaf_nodes, 20))
}
rs = RandomizedSearchCV(estimator = rf_clf_pl, cv = 4, param_distributions = params, n_jobs = -1, n_iter = 100, random_state = 42)
rs.fit(smote_X_train, smote_y_train)
I tried using the code below but got an error -
rf_clf_pl.named_steps["rfe"].support_
Error -
AttributeError Traceback (most recent call last)
<ipython-input-53-c73290f0e090> in <module>()
----> 1 rf_clf_pl.named_steps["rfe"].support_
AttributeError: 'RFE' object has no attribute 'support_'
How can I get the name of the retained features?
You can access the retained features of the best estimator as follows:
rs.best_estimator_.named_steps['rfe'].support_
Namely, you should access the best_estimator_ attribute of the RandomizedSearchCV fitted instance (i.e. the pipeline re-fitted with the best found hyperparameters thanks to the default parameter refit=True of RandomizedSearchCV).
The way you were trying to access attribute support_ from the pipeline instance does not work because you've not explicitly fitted the pipeline itself nor the fitted RandomizedSearchCV returns the fitted base estimator (despite calling .fit() on it while running the search) with the exception of the best_estimator_ in the case described above.
Here's an example:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV, train_test_split
iris = load_iris(as_frame=True)
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0)
steps = [
("rfe", RFE(estimator = RandomForestClassifier(random_state = 42))),
("est", RandomForestClassifier())
]
rf_clf_pl = Pipeline(steps = steps)
params = {
"rfe__n_features_to_select" : range(2, X_train.shape[1] + 1),
"est__random_state" : np.linspace(0, 42, 5).astype(int),
"est__n_estimators" : range(50, 201, 10),
"est__max_depth" : [None] + list(range(5, 16, 3)),
"est__max_leaf_nodes" : [None] + list(range(100, 201, 20))
}
rs = RandomizedSearchCV(estimator = rf_clf_pl, cv = 4, param_distributions = params, n_jobs = -1, n_iter = 100, random_state = 42)
rs.fit(X_train, y_train)
rs.best_estimator_.named_steps['rfe'].support_
Eventually, if you want to access the explicit names of the retained features, you can retrieve them via rs.feature_names_in_[np.where(rs.best_estimator_.named_steps['rfe'].support_)[0]].
My ML project is about "Loan Eligibility prediction"
For that I used data below : https://www.kaggle.com/code/sazid28/home-loan-prediction/data?select=train.csv
and my code is as shown :
import random
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import \
SimpleImputer, KNNImputer, IterativeImputer
from sklearn.preprocessing import \
OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.feature_selection import RFECV
df_train_original = pd.read_csv("train.csv.xls")
df = df_train_original.drop(df_train_original.columns[0], axis=1)
# Remplace 'Credit_History' by random value (0 or 1)
random.seed(0)
df['Credit_History'] = \
df['Credit_History'].apply(
lambda x: np.random.choice(df['Credit_History'].dropna().values)
if np.isnan(x) else x)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
# Data pre-processing
numerical_feature, categorical_feature = [], []
for i in X.columns:
if X[i].dtype == 'O':
categorical_feature.append(i)
else:
numerical_feature.append(i)
imputer = IterativeImputer(random_state=0)
scaler = StandardScaler()
encoder = OrdinalEncoder()
# Replace categorical features with the most frequent value of the column
# Gender (-13) , Married (-3), Self_Employed (-32)
# LoanAmount (-22) Loan_Amount_Term (-14) Credit_History (-50)
numerical_pipeline = make_pipeline(imputer, scaler)
categorical_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), encoder)
preprocessor = make_column_transformer((numerical_pipeline, numerical_feature),
(categorical_pipeline, categorical_feature),
remainder='passthrough')
clf = LogisticRegression(random_state=0, max_iter=df.shape[0])
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
random_state=0)
params = {
'logisticregression__class_weight': [None, 'balanced'],
'logisticregression__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
'logisticregression__C': np.linspace(0.001, 0.1, 30),
}
model = make_pipeline(preprocessor, clf)
selector = RFECV(model, step=1, min_features_to_select=2, cv=5)
selector.fit(X_train, y_train)
when I run the code I get :
ValueError: could not convert string to float: 'Male'
I think that the data is not fitted and transformed before going through RFECV.
How to fix this?
RFECV does not work with a pipeline as the estimator, as it requires the estimator to expose either a coef_ or a feature_importances_. Pipelines do not, and even if they did, there would be no guarantee that the feature importances of the final estimator correspond to the features input to the pipeline with arbitrary transformations in the intermediate.
What you can do is make the RFECV transformer an element of your pipeline between the preprocessing and the final estimator, ie
preprocessor = make_column_transformer((numerical_pipeline, numerical_feature),
(categorical_pipeline, categorical_feature),
remainder='passthrough')
clf_fs = LogisticRegression(random_state=0, max_iter=df.shape[0])
clf = LogisticRegression(random_state=0, max_iter=df.shape[0])
feature_selector = RFECV(clf_fs , step=1, min_features_to_select=2, cv=5)
model = make_pipeline(preprocessor, feature_selector, clf)
I'm trying to use mlxtend SequentialFeatureSelector() in combination with a pipeline by using ColumnTransformer(). I use the ColumnTransformer() to make power transformations (via PowerTransformer()) only on the numeric variables, but not on the binary variables. My problem is that I either get the error: AttributeError: 'numpy.ndarray' object has no attribute 'columns' or the results make no sense.
If I define like this: numeric_features = ['feature_1','feature_2', 'feature_3', 'feature_4', 'feature_5'], then I get the AttributeError: 'numpy.ndarray' object has no attribute 'columns'
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PowerTransformer
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
# RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=10,random_state=123)
cv_define = rskf.split(X, y)
cv = list(cv_define)
# Define pipeline (transformation & model)
numeric_features = ['feature_1','feature_2', 'feature_3', 'feature_4', 'feature_5']
# numeric_features = [0,1,2,3,4,]
numeric_transformer = make_pipeline(PowerTransformer(method = 'yeo-johnson', standardize=False))
preprocessor = make_pipeline(ColumnTransformer(transformers = [('num', numeric_transformer, numeric_features)], remainder = 'drop'))
pipe = make_pipeline((preprocessor), (LogisticRegression(max_iter = 10000, penalty = 'none')))
# Run SFS (Sequential Feature Selector)
sfs1 = SFS(pipe, k_features = 1, scoring = "neg_log_loss", forward = False, floating = False, cv = cv, verbose = 2, n_jobs = -1)
sfs_result = sfs1.fit(X, y)
If I define numeric_features like this: numeric_features = [0,1,2,3,4] , then I get incorrect results (same value for every iteration) which also include NaNs. See pictures below.
Results of the SequentialFeatureSelector() - Part 1
Results of the SequentialFeatureSelector() - Part 2
The problem lies only in preprocessor = make_pipeline(ColumnTransformer(transformers = [('num', numeric_transformer, numeric_features)], remainder = 'drop')). If for example, I would remove the specific column transformation and would perform the power transformations on the whole dataset, then code works and results would be:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PowerTransformer
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
# RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=10,random_state=123)
cv_define = rskf.split(X, y)
cv = list(cv_define)
# Define pipeline (transformation & model)
numeric_features ['feature_1','feature_2', 'feature_3', 'feature_4', 'feature_5']
numeric_transformer = make_pipeline(PowerTransformer(method = 'yeo-johnson', standardize=False))
pipe = make_pipeline((numeric_transformer), (LogisticRegression(max_iter = 10000, penalty = 'none')))
# Run SFS (Sequential Feature Selector)
sfs1 = SFS(pipe, k_features = 1, scoring = "neg_log_loss", forward = False, floating = False, cv = cv, verbose = 2, n_jobs = -1)
sfs_result = sfs1.fit(X, y)
Results:
Working results of SFS - Part 1
Working result of SFS - Part 2
Does anyone has an idea how this could be solved? Any help would be much appreciated.
I get different results when i am doing GridSearch using sklearn and manually.
The first code block is my procedure when i run GridSearch using sklearn:
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection
from sklearn import pipeline
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
X = folded_train.drop(columns = ["10_fold", "class_encoded"])
y = folded_train["class_encoded"]
ten_fold = folded_train["10_fold"]
logo = LeaveOneGroupOut()
cross_val_groups = logo.split(X, y, ten_fold)
classifier = (Pipeline([("sampling", RandomUnderSampler()),
("classifier", ensemble.RandomForestClassifier(n_jobs = -1))]))
param_grid = {
"classifier__n_estimators" : [100, 200, 300, 400, 600],
"classifier__max_depth": [1, 3, 5, 7],
"classifier__criterion": ["gini", "entropy"]
}
model = model_selection.GridSearchCV(
estimator = classifier,
param_grid = param_grid,
scoring = "roc_auc",
verbose = 10,
n_jobs = 1,
cv = cross_val_groups
)
model.fit(X,y)
And I am trying to do the same procedure manually. Here is my code:
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection
from sklearn import pipeline
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
X = folded_train.drop(columns = ["10_fold", "class_encoded"])
y = folded_train["class_encoded"]
ten_fold = folded_train["10_fold"]
number_of_estimators = [100, 200, 300]
maximum_depths = [1, 3, 5, 7]
criterions = ["gini", "entropy"]
logo = LeaveOneGroupOut()
for criterion in criterions:
for max_depth in maximum_depths:
for n_of_estimator in number_of_estimators:
for train_index, val_index in logo.split(X, y, ten_fold):
aPipeline = (Pipeline(steps=[('sampling', RandomUnderSampler()),
('classifier', ensemble.RandomForestClassifier(criterion= criterion,
max_depth= max_depth,
n_estimators= n_of_estimator,
n_jobs=-1))]))
X_trn, X_vl = X.iloc[train_index], X.iloc[val_index]
y_trn, y_vl = y.iloc[train_index], y.iloc[val_index]
aPipeline1.fit(X_trn, y_trn)
predictions = aPipeline1.predict(X_vl)
print("Criterion", criterion, " Max Depth", max_depth, "Number of estimator ", n_of_estimator, "score ", metrics.roc_auc_score(y_vl, predictions))
For sklearn GridSearchCV, i obtained scores (roc_auc) for specific parameters as below:
For criterion = "gini", max_depth = 1 and n_estimators = 100,
[0.786, 0.799, 0.789, 0.796, 0.775, 0.776, 0.779, 0.788, 0.770, 0.769] for each cv iteration
And my manual execution for same parameters i get:
[0.730, 0.749, 0.714, 0.710, 0.732, 0.724, 0.711, 0.724, 0.715, 0.734]
And this results is valid for other parameter combinations too. What are the factors that lead to such kind of situation?
Note: I found this but it is not answer of my problem: Why GridSearchCV model results are different than the model I manually tuned?
I'm trying to predict some prices from a dataset that I scraped. I never used Python for this (I usually use tidyverse, but this time I wanted to explore pipeline.
So here is the code snippet:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/norhther/idealista/main/idealistaBCN.csv")
df.drop("info", axis = 1, inplace = True)
df["floor"].fillna(1, inplace=True)
df.drop("neigh", axis = 1, inplace = True)
df.dropna(inplace = True)
df = df[df["habs"] < 11]
X = df.drop("price", axis = 1)
y = df["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
ct = ColumnTransformer(
[("standardScaler", StandardScaler(), ["habs", "m2", "floor"]),
("onehot", OneHotEncoder(), ["type"]
)], remainder="passthrough")
pipe = Pipeline(steps = [("Transformer", ct),
("svr", SVR())])
param_grid = {
"svr__kernel" : ['linear', 'poly', 'rbf', 'sigmoid'],
"svr__degree" : range(3,6),
"svr__gamma" : ['scale', 'auto'],
"svr__coef0" : np.linspace(0.01, 1, 2)
}
search = GridSearchCV(pipe, param_grid, scoring = ['neg_mean_squared_error'], refit='neg_mean_squared_error')
search.fit(X_train, y_train)
print(search.best_score_)
pipe = Pipeline(steps = [("Transformer", ct),
("svr", SVR(coef0 = search.best_params_["svr__coef0"],
degree = search.best_params_["svr__degree"],
kernel =
search.best_params_["svr__kernel"]))])
from sklearn.metrics import mean_squared_error
pipe.fit(X_train, y_train)
preds = pipe.predict(X_train)
mean_squared_error(preds, y_train)
And search.best_score_ here is -443829697806.1671, and the MSE is 608953977916.3896
I think I messed up with something, maybe with the transformer, but I'm not completely sure. I think this is an exagerated MSE. I did a fearly similar approach with tidymodels and I got much better results.
So here I wanted to know if there is something wrong with the transformer, or is just that the model is this bad.
The reason is that you did not include C in parameter and you need to cover a whole range of Cs to fit. If we fit it with the default C = 1, you can see where the problem lies:
import matplotlib.pyplot as plt
o = pipe.named_steps["Transformer"].fit_transform(X_train)
mdl = SVR(C=1)
mdl.fit(o,y_train)
plt.scatter(mdl.predict(o),y_train)
There are some price values that are 10x the average values (1e7 versus median of 5e5). If you use mse or r^2, these will be heavily decided by these extreme values. So we need to follow the data a bit more closely and this is decided by C, which you can read more about here. We try a range:
ct = ColumnTransformer(
[("standardScaler", StandardScaler(), ["habs", "m2", "floor"]),
("onehot", OneHotEncoder(), ["type"]
)], remainder="passthrough")
pipe = Pipeline(steps = [("Transformer", ct),
("svr", SVR())])
#, 'poly', 'rbf', 'sigmoid'
param_grid = {
"svr__kernel" : ['rbf'],
"svr__gamma" : ['auto'],
"svr__coef0" : [1,2],
"svr__C" : [1e-03,1e-01,1e1,1e3,1e5,1e7]
}
search = GridSearchCV(pipe, param_grid, scoring = ['neg_mean_squared_error'],
refit='neg_mean_squared_error')
search.fit(X_train, y_train)
print(search.best_score_)
-132061065775.25969
Your y values are high and the MSE values are going to be in the range of the the variance of your y values, so if we check that:
y_train.var()
545423126823.4545
132061065775.25969 / y_train.var()
0.24212590057261346
It is pretty ok, you reduce MSE to about 25% of the variance. We can check this with the test data, and I guess in this case it is quite lucky that the C values are pretty ok:
from sklearn.metrics import mean_squared_error
o = pipe.named_steps["Transformer"].fit_transform(X_train)
mdl = SVR(C=10000000.0, coef0=1, gamma='auto')
mdl.fit(o,y_train)
o_test = pipe.named_steps["Transformer"].fit_transform(X_test)
pred = mdl.predict(o_test)
print( mean_squared_error(pred,y_test) , mean_squared_error(pred,y_test)/y_test.var())
plt.scatter(mdl.predict(o_test),y_test)