RFE ranking with Gridsearch - python

I want to use RFE for feature selection in a pipeline. I have no problems getting it to work in pipelines without GridSearch. However, when I try to incorporate GridSearch, I keep getting a value error (NB. the models are fine without RFE).
I have tried to use feature_selection as was suggested in this topic: Grid Search with Recursive Feature Elimination in scikit-learn pipeline returns an error, but this results in the same error.
What could be wrong?
my error:
ValueError: Invalid parameter alpha for estimator RFE(estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=True, random_state=None, solver='auto',
tol=0.001),
n_features_to_select=4, step=1, verbose=1). Check the list of available parameters with estimator.get_params().keys().
this works fine:
rfe=RFE(estimator=LinearRegression(), n_features_to_select=4, verbose=1)
#setup the pipeline steps
steps = [('scaler', StandardScaler()),
('imputation', SimpleImputer(missing_values = np.NaN, strategy='most_frequent')),
('reg', rfe)]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit the pipeline to the training set:
pipeline.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = pipeline.predict(X_test)
print()
# Print the features and their ranking (high = dropped early on)
print(dict(zip(X.columns, rfe.ranking_)))
# Print the features that are not eliminated
print(X.columns[rfe.support_])
print()
print("R^2: {}".format(pipeline.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
this doesn't work
rfe=RFE(estimator=Ridge(normalize=True), n_features_to_select=4, verbose=1)
#setup the pipeline steps
steps = [('scaler', StandardScaler()),
('imputation', SimpleImputer(missing_values=np.NaN, strategy='most_frequent')),
('ridge', rfe)]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
#Define hyperparameters and range of Grid Search
parameters = {"ridge__alpha": np.linspace(0,1,100)}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# run cross validation
cv = GridSearchCV(pipeline, param_grid = parameters, cv=3)
# Fit the pipeline to the training set:
cv.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = cv.predict(X_test)
# Compute and print R^2 and RMSE
print("R^2: {}".format(cv.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
print("Tuned Model Parameters: {}".format(cv.best_params_))
using feature_selection also doesn't work
selector = feature_selection.RFE(Ridge(normalize=True))
#setup the pipeline steps
steps = [('scaler', StandardScaler()),
('imputation', SimpleImputer(missing_values=np.NaN, strategy='most_frequent')),
('RFE', selector)]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

The question is old, but in case someone stumbles upon it:
You can access the hyperparameter alpha or any parameter of the estimator inside feature_selection(estimator=) with the parameter '<feature_selection>__estimator__<your parameter>':
from sklearn.pipeline import Pipeline
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.feature_selection import RFE
model = RFE(estimator=Ridge())
pipe = Pipeline(
steps = [
("scaler", StandardScaler()),
("rfe", model)
]
)
param = {
"rfe__step" : np.linspace(0.1, 1, 10),
"rfe__estimator__alpha" : np.logspace(-3, 3, 7)
}
tscv = TimeSeriesSplit(n_splits=5).split(X_train)
gridsearch = GridSearchCV(estimator=pipe, cv=tscv, param_grid=param, refit=True, return_train_score=True, n_jobs=-1)
fit = gridsearch.fit(X_train, y_train)

Related

Pipline with SMOTE and Imputer Errors

i am trying to create a pipeline that first impute missing data , do oversampling with the SMOTE and the the model
my code worked perfectly before i try smote not i cant find any solution
here is the code without smote
scoring = ['balanced_accuracy', 'f1_macro']
imputer = SimpleImputer(strategy='most_frequent')
pipeline = Pipeline(steps=[('i', imputer),('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1)
And here's the code after adding smote
Note: I tired importing make pipeline from imlearn
imputer = SimpleImputer(strategy='most_frequent')
pipeline = Pipeline(steps=[('i', imputer),('over', SMOTE()),('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1)
when i import Pipeline From SKLearn i got this error
All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE()' (type <class 'imblearn.over_sampling._smote.base.SMOTE'>) doesn't
when i tried importing makepipeline from imlearn i get this error
Last step of Pipeline should implement fit or be the string 'passthrough'. '[('i', SimpleImputer(strategy='most_frequent')), ('over', SMOTE()), ('m', RandomForestClassifier())]' (type <class 'list'>) doesn't
Use the imblearn pipline:
from imblearn.pipeline import Pipeline
pipeline = Pipeline([('i', imputer),('over', SMOTE()),('m', model)])

prediction scores on test vs train dataset different with/without gridsearchcv

I am exploring the use of GridSearchCV from sklearn to predict data. After the fit of the data using RandomForestRegressor, I calculate the score (MSE) for the test and the train data. I can see there is a huge difference between the MSE of train and the MSE of test (even if the scores should be similar).
Here is the code:
# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Create Regressors Pipeline
pipeline_estimators = Pipeline([
('RandomForest', RandomForestRegressor()),
])
param_grid = [{'RandomForest__n_estimators': np.linspace(50, 100, 3).astype(int)}]
search = GridSearchCV(estimator = pipeline_estimators,
param_grid = param_grid,
scoring = 'neg_mean_squared_error',
cv = 2,)
search.fit(X_train, y_train)
y_test_predicted = search.best_estimator_.predict(X_test)
y_train_predicted = search.best_estimator_.predict(X_train)
print('MSE test predict', metrics.mean_squared_error(y_test, y_test_predicted))
print('MSE train predict',metrics.mean_squared_error(y_train, y_train_predicted))
The OUT are:
MSE test predict 0.0021045875412650343
MSE train predict 0.000332850878980335
IF I don't use Gridsearchcv but a FOR loop for the differet 'n_estimators', the MSE scores obtained for the predicted test and the train are very close.
To add more detail related to the 'FOR loop' explanation, this is by using simple approach, see code below:
n_estimators = np.linspace(50, 100, 3).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
mse_train = []
mse_test = []
for val_n_estimators in n_estimators:
regressor = RandomForestRegressor(n_estimators = val_n_estimators)
regressor.fit(X, y)
y_pred = regressor.predict(X_test)
y_test_predicted = regressor.predict(X_test)
y_train_predicted = regressor.predict(X_train)
mse_train.append(metrics.mean_squared_error(y_train, y_train_predicted))
mse_test.append(metrics.mean_squared_error(y_test, y_test_predicted)) code here
For this code, the mse_train and mse_test are very similar. But using the Gridseachcv (see code on the top of the post), they are not similar.
Any suggestions?
Why there is such scores difference using GridSearchCV?
Thank you.
Marc

Performance metric when using XGboost regressor with sklearn learning_curve

I've created xgboost regressor model and want to see how training and test performance changes as number of training set increases.
xgbm_reg = XGBRegressor()
tr_sizes, tr_scs, test_scs = learning_curve(estimator=xgbm_reg,
X=ori_X,y=y,
train_sizes=np.linspace(0.1, 1, 5),
cv=5)
What is performance is it using for tr_scs, and test_scs?
Sklearn doc tells me that
scoring : str or callable, default=None
A str (see model evaluation documentation) or a scorer callable object / function
with signature scorer(estimator, X, y)
So I've looked at XGboost documentation which says objective is default = reg:squarederror does this mean results of tr_scs, and test_scs are in terms of squared error?
I want to check by using cross_val_score
scoring = "neg_mean_squared_error"
cv_results = cross_val_score(xgbm_reg, ori_X, y, cv=5, scoring=scoring)
however not quite sure how to get squared_error from cross_val_score
The XGBRegressor's built-in scorer is the R-squared and this is the default scorer used in learning_curve and cross_val_score, see the code below.
from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import learning_curve, cross_val_score, KFold
from sklearn.metrics import r2_score
# generate the data
X, y = make_regression(n_features=10, random_state=100)
# generate 5 CV splits
kf = KFold(n_splits=5, shuffle=False)
# calculate the CV scores using `learning_curve`, use 100% train size for comparison purposes
_, _, lc_scores = learning_curve(estimator=XGBRegressor(), X=X, y=y, train_sizes=[1.0], cv=kf)
print(lc_scores)
# [[0.51444244 0.70020972 0.64521668 0.36608259 0.81670165]]
# calculate the CV scores using `cross_val_score`
cv_scores = cross_val_score(estimator=XGBRegressor(), X=X, y=y, cv=kf)
print(cv_scores)
# [0.51444244 0.70020972 0.64521668 0.36608259 0.81670165]
# calculate the CV scores manually
xgb_scores = []
r2_scores = []
# iterate across the CV splits
for train_index, test_index in kf.split(X):
# extract the training and test data
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# fit the model to the training data
estimator = XGBRegressor()
estimator.fit(X_train, y_train)
# score the test data using the XGBRegressor built-in scorer
xgb_scores.append(estimator.score(X_test, y_test))
# score the test data using the R-squared
y_pred = estimator.predict(X_test)
r2_scores.append(r2_score(y_test, y_pred))
print(xgb_scores)
# [0.5144424362721487, 0.7002097211679331, 0.645216683969211, 0.3660825936288453, 0.8167016490227281]
print(r2_scores)
# [0.5144424362721487, 0.7002097211679331, 0.645216683969211, 0.3660825936288453, 0.8167016490227281]

how to i handle a multiclass decision tree?

I am new to python & ML, but I am trying to use sklearn to build a decision tree. I have many categorical features and I have transformed them into numerical variables. However, my target feature is a multiclass and I am run into an error. How should I handle targets that are multiclass?
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
from sklearn.model_selection import train_test_split
#SPLIT DATA INTO TRAIN AND TEST SET
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size =0.30, #by default is 75%-25%
#shuffle is set True by default,
stratify=y, #preserve target propotions
random_state= 123) #fix random seed for replicability
print(X_train.shape, X_test.shape)
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion='gini', max_depth=3, min_samples_split=4, min_samples_leaf=2)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# criterion : "gini", "entropy"
# max_depth : The maximum depth of the tree.
# min_samples_split : The minimum number of samples required to split an internal node:
# min_samples_leaf : The minimum number of samples required to be at a leaf node.
#DEFINE YOUR CLASSIFIER and THE PARAMETERS GRID
from sklearn.tree import DecisionTreeClassifier
import numpy as np
classifier = DecisionTreeClassifier()
parameters = {'criterion': ['entropy','gini'],
'max_depth': [3,4,5],
'min_samples_split': [5,10],
'min_samples_leaf': [2]}
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(classifier, parameters, cv=3, scoring = 'f1', verbose=50, n_jobs=-1, refit=True)
enter image description here
You should specify the score function manually:
from sklearn.metrics import f1_score, make_scorer
f1 = make_scorer(f1_score, average='weighted')
....
gs = GridSearchCV(classifier, parameters, cv=3, scoring=f1, verbose=50, n_jobs=-1, refit=True)
Thank you so much for you help. I figured it out. It was actually on the gs line. In scoring, I needed to adjust what you mentioned. So i revised scoring = f1_macro
gs = GridSearchCV(classifier, parameters, cv=3, scoring=f1_macro, verbose=50, n_jobs=-1, refit=True)

My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%

For the code below, my r-squared score is coming out to be negative but my accuracies score using k-fold cross validation is coming out to be 92%. How's this possible? Im using random forest regression algorithm to predict some data. The link to the dataset is given in the link below:
https://www.kaggle.com/ludobenistant/hr-analytics
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
dataset = pd.read_csv("HR_comma_sep.csv")
x = dataset.iloc[:,:-1].values ##Independent variable
y = dataset.iloc[:,9].values ##Dependent variable
##Encoding the categorical variables
le_x1 = LabelEncoder()
x[:,7] = le_x1.fit_transform(x[:,7])
le_x2 = LabelEncoder()
x[:,8] = le_x1.fit_transform(x[:,8])
ohe = OneHotEncoder(categorical_features = [7,8])
x = ohe.fit_transform(x).toarray()
##splitting the dataset in training and testing data
from sklearn.cross_validation import train_test_split
y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
print(y_pred)
from sklearn.metrics import r2_score
r2_score(y_test , y_pred)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
There are several issues with your question...
For starters, you are doing a very basic mistake: you think you are using accuracy as a metric, while you are in a regression setting and the actual metric used underneath is the mean squared error (MSE).
Accuracy is a metric used in classification, and it has to do with the percentage of the correctly classified examples - check the Wikipedia entry for more details.
The metric used internally in your chosen regressor (Random Forest) is included in the verbose output of your regressor.fit(x_train, y_train) command - notice the criterion='mse' argument:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
verbose=0, warm_start=False)
MSE is a positive continuous quantity, and it is not upper-bounded by 1, i.e. if you got a value of 0.92, this means... well, 0.92, and not 92%.
Knowing that, it is good practice to include explicitly the MSE as the scoring function of your cross-validation:
cv_mse = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10, scoring='neg_mean_squared_error')
cv_mse.mean()
# -2.433430574463703e-28
For all practical purposes, this is zero - you fit the training set almost perfectly; for confirmation, here is the (perfect again) R-squared score on your training set:
train_pred = regressor.predict(x_train)
r2_score(y_train , train_pred)
# 1.0
But, as always, the moment of truth comes when you apply your model on the test set; your second mistake here is that, since you train your regressor with scaled y_train, you should also scale y_test before evaluating:
y_test = sc_y.fit_transform(y_test)
r2_score(y_test , y_pred)
# 0.9998476914664215
and you get a very nice R-squared in the test set (close to 1).
What about the MSE?
from sklearn.metrics import mean_squared_error
mse_test = mean_squared_error(y_test, y_pred)
mse_test
# 0.00015230853357849051

Categories

Resources