Feature selection: after or during nested cross-validation?

Feature selection: after or during nested cross-validation? - python

I have managed to write some code doing a nested cross-validation using lightGBM as my regressor and wrapping everying with sklearn.pipeline.
Ultimately, I would now want to do feature selection (or really just get the features' importance for the final model) but I am wondering what is the best path to take from here. I guess there would be two possibilities:
1# Use this methodology to build a model (using .fit and .predict) using the best hyperparameters. Then check the importance of the features for this model.
2# Do feature selection in the inner fold of the nest cv but I am unsure how to do this exactly.
I guess #1 would be the easiest but I am unsure how to get the best hyperparamters for each outerfold.
This thread touches on it:
Putting together sklearn pipeline+nested cross-validation for KNN regression
But the selected answers drops the cross_val_score altogether, meaning that it isn't nested cross-validation anymore (I would still like to perform the CV on the outer fold after getting the best hyperparameters on the inner fold).
So my problem is the following:
Can I get feature importances for each fold of the outer CV (I am
aware that if I have 5 folds, I will get 5 different sets of feature
importance)? And if yes, how?
Alternatively, should I just get the best hyperparameters for each
fold (how?) and build a new model without CV on the whole dataset,
based on these hyperparameters?
Here is the code I have so far:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import scipy.stats as st
#Parameters for model building an reproducibility
X = X_age
y = y_age
RNGesus = 42
state = 13
outer_scoring = 'neg_mean_absolute_error'
inner_scoring = 'neg_mean_absolute_error'
#### Nested CV with Random gridsearch ####
# Pipeline with standard scaling and the regressor
regressors = [lgb.LGBMRegressor(random_state = state)]
continuous_transformer = Pipeline([('scaler', StandardScaler())])
preprocessor = ColumnTransformer([('cont',continuous_transformer, continuous_variables)], remainder = 'passthrough')
for reg in regressors:
steps=[('preprocessor', preprocessor), ('regressor', reg)]
pipeline = Pipeline(steps)
#inner and outer fold to be used
inner_cv = KFold(n_splits=5, shuffle=True, random_state=RNGesus)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=RNGesus)
#Hyperparameters of the regressor to be optimized using randomized search
params = {
'regressor__max_depth': (3, 5, 7, 10),
'regressor__lambda_l1': st.uniform(0, 5),
'regressor__lambda_l2': st.uniform(0, 3)
}
#Pass the RandomizedSearchCV to cross_val_score
regression = RandomizedSearchCV(estimator = pipeline, param_distributions = params, scoring=inner_scoring, cv=inner_cv, n_iter=200, verbose= 3, n_jobs= -1)
nested_score = cross_val_score(regression, X= X, y= y, cv = outer_cv, scoring=outer_scoring)
print('\n MAE for lightGBM model predicting age: %.3f' % (abs(nested_score.mean())))
print('\n'str(nested_score) + '<- outer CV')
Edit: Stated the problem clearly.

I encountered problems importing the lightGBM module so I coundn't run your code. But here is a post explaining how you cannot get the "winning" or optimal hyperparameters (as well as the feature_importance_) out of nested cross-validation by cross_val_score. Briefly, the reason is that cross_val_score only returns the measurement value.
Can I get feature importances for each fold of the outer CV (I am aware that if I have 5 folds, I will get 5 different sets of feature importance)? And if yes, how?
The answer is no with cross_val_score. But if you follow the code from that post, you'll be able to get the feature_importance_ simply by GSCV.best_estimator_.feature_importance_ under the for loop after GSCV.fit().
Alternatively, should I just get the best hyperparameters for each fold (how?) and build a new model without CV on the whole dataset, based on these hyperparameters?
This is exactly what that post is talking about: getting you the "best" hyperparameters by nested cv. Ideally, you'll observe one combination of hyperparameters that wins all the time and that is the hyperparameters you'll use for the final model (with the entire training set). But when different "best" hyperparameter combinations appear during cv, there is no standard way to deal with it as far as I know.

Related

Random forest with unbalanced class (positive is minority class), low precision and weird score distributions

I have a very unbalanced dataset (5000 positive, 300000 negative). I am using sklearn RandomForestClassifier to try and predict the probability of the positive class. I have data for multiple years and one of the features I've engineered is the class in the previous year, so I am withholding the last year of the dataset to test on in addition to my test set from within the years I'm training on.
Here is what I've tried (and the result):
Upsampling with SMOTE and SMOTEENN (weird score distributions, see first pic, predicted probabilities for positive and negative class are both the same, i.e., the model predicts a very low probability for most of the positive class)
Downsampling to a balanced dataset (recall is ~0.80 for the test set, but 0.07 for the out-of-year test set from sheer number of total negatives in the unbalanced out of year test set, see second pic)
Leave it unbalanced (weird scoring distribution again, precision goes up to ~0.60 and recall falls to 0.05 and 0.10 for test and out-of-year test set)
XGBoost (slightly better recall on the out-of-year test set, 0.11)
What should I try next? I'd like to optimize for F1, as both false positives and false negatives are equally bad in my case. I would like to incorporate k-fold cross validation and have read I should do this before upsampling, a) should I do this/is it likely to help and b) how can I incorporate this into a pipeline similar to this:
from imblearn.pipeline import make_pipeline, Pipeline
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)
pipeline = make_pipeline(??)
pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)

Upsampling with SMOTE and SMOTEENN : I am far from being an expert with those but by upsampling your dataset you might amplify existing noise which induce overfitting. This could explain the fact that your algorithm cannot correctly classify, thus giving the results in the first graph.
I found a little bit more info here and maybe how to improve your results:
https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf
When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.
Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:
https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/
For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering.
If you have not try it yet, maybe you could:
Downsample your data
Normalize all your data with a StandardScaler
Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression
# Define steps of the pipeline
steps = [('scaler', StandardScaler()),
('log_reg', LogisticRegression())]
pipeline = Pipeline(steps)
# Specify the hyperparameters
parameters = {'C':[1, 10, 100],
'penalty':['l1', 'l2']}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
# Instantiate a GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)
# Fit to the training set
cv.fit(X_train, y_train)
Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
param_grid = {'C': [1, 10, 100]}
clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")
pipeline = Pipeline([('scale', StandardScaler()),
('SMOTEENN', sme),
('grid', grid)])
cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)
I hope this may help you.
(edit: I added score = "f1" in the GridSearchCV)

Low K-fold accuracy for First Fold

I created a text classifier, and I'm trying to utilize K-fold cross-validation. I can't figure out why my first fold has an accuracy of 55% while my other folds are overfitting at 99-100% accuracy. My data set is a 5109x2 dataframe with columns df["Features"] as the features and df["Labels"] as labels. df["Features"] has descriptors based off some product mapping keywords and are separated by commas as seen here: Features. I'm creating indicator variables based off the sub-features through countvectorizer(). This is the result of a 5-fold cv. Result
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
def train(classifier, X, y):
count_vect=CountVectorizer(min_df = 1,lowercase = False)
y=pd.Series(y)
X=count_vect.fit_transform(X)
y=count_vect.fit_transform(y)
kf=KFold(n_splits=5,shuffle=True)
k_fold=pd.Series(np.zeros(5))
for i,(train_index,test_index) in enumerate(kf.split(X)):
print("Train",train_index, "Test",test_index)
X_train,X_test=X[train_index],X[test_index]
y_train,y_test=y[train_index],y[test_index]
k_fold[i]=(print("For K=",i+1," Classifier accuracy= ",classifier.fit(X_train, y_train).score(X_test, y_test), "n = ",X_train.shape[0]))
train(MLPClassifier(hidden_layer_sizes= (100,),activation='relu',random_state=2, max_iter=100, warm_start=True),df["Features"], df["Labels"])

It is entirely possible that this is just a result of the data. There is no reason to implement this by hand, scikit-learn has the functionality built in. If you want to test your implementation, try running the experiment using the shuffle parameter off to see if you get the same results.
It is best practice to shuffle your data anyway prior to running cross validation.

In GridSearchCV, how do I pass only the default parameters in param_grid?

I'm a beginner, and I have the following code below.
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
pca = PCA()
model = GaussianNB()
steps = [('pca', pca), ('model', model)]
pipeline = Pipeline(steps)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
modelwithpca = GridSearchCV(pipeline, param_grid= ,cv=cv)
modelwithpca.fit(X_train,y_train)
This is a local testing, what I'm trying to accomplish is,
i. Perform PCA on the dataset
ii. Use Gaussian Naive Bayes with only the default parameters
iii. Use StratifiedShuffleSplit
So in the end I want the above steps to be carried over to another function that dumps the classifier, the dataset and the feature list to test for performance.
dump_classifier_and_data(modelwithpca, dataset, features)
In the param_grid part, I don't want to test any list of parameters. I just want to have the default parameters used in Gaussian Naive Bayes if that makes sense. What do I change?
Also should there be any changes as to how I instantiate the classifier objects?

The purpose of GridSearchCV is to test with different parameters for at least one thing in your pipeline (if you don't want to test for different parameters you don't need to use GridSearchCV).
So, in general, if you want let's say to test for different PCA n_components.
The format to use a pipeline with GridSearchCV would be the following:
gscv = GridSearchCV(pipeline, param_grid={'{step_name}__{parameter_name}': [possible values]}, cv=cv)
e.g.:
# this would perform cv for the 3 different values of n_components for pca
gscv = GridSearchCV(pipeline, param_grid={'pca__n_components': [3, 6, 10]}, cv=cv)
If you use GridSearchCV to tune PCA as above, this of course would mean that your model would have the default values.
If you don't need parameter tuning then GridSearchCV is not the way to go, since using the default parameters of your model for GridSearchCV like this, will only produce a parameter grid with one combination, so it would be like just performing only CV. It wouldn't make sense to do it like this - if I have understood your question correctly:
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
pca = PCA()
model = GaussianNB()
steps = [('pca', pca), ('model', model)]
pipeline = Pipeline(steps)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
# get the default parameters of your model and use them as a param_grid
modelwithpca = GridSearchCV(pipeline, param_grid={'model__' + k: [v] for k, v in model.get_params().items()}, cv=cv)
# will run 5 times as your cv is configured
modelwithpca.fit(X_train,y_train)
Hope this helps, good luck!

Putting together sklearn pipeline+nested cross-validation for KNN regression

I'm trying to figure out how to built a workflow for sklearn.neighbors.KNeighborsRegressor that includes:
normalize features
feature selection (best subset of 20 numeric features, no specific total)
cross-validates hyperparameter K in range 1 to 20
cross-validates model
uses RMSE as error metric
There's so many different options in scikit-learn that I'm a bit overwhelmed trying to decide which classes I need.
Besides sklearn.neighbors.KNeighborsRegressor, I think I need:
sklearn.pipeline.Pipeline
sklearn.preprocessing.Normalizer
sklearn.model_selection.GridSearchCV
sklearn.model_selection.cross_val_score
sklearn.feature_selection.selectKBest
OR
sklearn.feature_selection.SelectFromModel
Would someone please show me what defining this pipeline/workflow might look like? I think it should be something like this:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
# build regression pipeline
pipeline = Pipeline([('normalize', Normalizer()),
('kbest', SelectKBest(f_classif)),
('regressor', KNeighborsRegressor())])
# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k': list(range(1, X.shape[1]+1)),
'regressor__n_neighbors': list(range(1,21))}
# outer cross-validation on model, inner cross-validation on hyperparameters
scores = cross_val_score(GridSearchCV(pipeline, parameters, scoring="neg_mean_squared_error", cv=10),
X, y, cv=10, scoring="neg_mean_squared_error", verbose=2)
rmses = np.abs(scores)**(1/2)
avg_rmse = np.mean(rmses)
print(avg_rmse)
It doesn't seem to error out, but a few of my concerns are:
Did I perform the nested cross-validation properly so that my RMSE is unbiased?
If I want the final model to be selected according to the best RMSE, am I supposed to use scoring="neg_mean_squared_error" for both cross_val_score and GridSearchCV?
Is SelectKBest, f_classif the best option to use for selecting features for the KNeighborsRegressor model?
How can I see:
which subset of features was selected as best
which K was selected as best
Any help is greatly appreciated!

Your code seems okay.
For the scoring="neg_mean_squared_error" for both cross_val_score and GridSearchCV, I would do the same to make sure things run fine but the only way to test this is to remove the one of the two and see if the results change.
SelectKBest is a good approach but you can also use SelectFromModel or even other methods that you can find here
Finally, in order to get the best parameters and the features scores I modified a bit your code as follows:
import ...
pipeline = Pipeline([('normalize', Normalizer()),
('kbest', SelectKBest(f_classif)),
('regressor', KNeighborsRegressor())])
# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k': list(range(1, X.shape[1]+1)),
'regressor__n_neighbors': list(range(1,21))}
# changes here
grid = GridSearchCV(pipeline, parameters, cv=10, scoring="neg_mean_squared_error")
grid.fit(X, y)
# get the best parameters and the best estimator
print("the best estimator is \n {} ".format(grid.best_estimator_))
print("the best parameters are \n {}".format(grid.best_params_))
# get the features scores rounded in 2 decimals
pip_steps = grid.best_estimator_.named_steps['kbest']
features_scores = ['%.2f' % elem for elem in pip_steps.scores_ ]
print("the features scores are \n {}".format(features_scores))
feature_scores_pvalues = ['%.3f' % elem for elem in pip_steps.pvalues_]
print("the feature_pvalues is \n {} ".format(feature_scores_pvalues))
# create a tuple of feature names, scores and pvalues, name it "features_selected_tuple"
featurelist = ['age', 'weight']
features_selected_tuple=[(featurelist[i], features_scores[i],
feature_scores_pvalues[i]) for i in pip_steps.get_support(indices=True)]
# Sort the tuple by score, in reverse order
features_selected_tuple = sorted(features_selected_tuple, key=lambda
feature: float(feature[1]) , reverse=True)
# Print
print 'Selected Features, Scores, P-Values'
print features_selected_tuple
Results using my data:
the best estimator is
Pipeline(steps=[('normalize', Normalizer(copy=True, norm='l2')), ('kbest', SelectKBest(k=2, score_func=<function f_classif at 0x0000000004ABC898>)), ('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=18, p=2,
weights='uniform'))])
the best parameters are
{'kbest__k': 2, 'regressor__n_neighbors': 18}
the features scores are
['8.98', '8.80']
the feature_pvalues is
['0.000', '0.000']
Selected Features, Scores, P-Values
[('correlation', '8.98', '0.000'), ('gene', '8.80', '0.000')]

Train scikit SVM, customize score assessment

I plan on using scikit svm for class prediction.
I have a two-class dataset consisting of about 100 experiments. Each experiment encapsulates my data-points (vectors) + classification.
Training of an SVM according to http://scikit-learn.org/stable/modules/svm.html should straight forward.
I will have to put all vectors in an array and generate another array with the corresponding class labels, train SVM. However, in order to run leave-one-out error estimation, I need to leave out a specific subset of vectors - one experiment.
How do I achieve that with the available score function?
Cheers,
EL

You could manually train on everything but the one observation, using numpy indexing to drop it out. Then you can use any of sklearn's helpers to evaluate the classification. For example:
import numpy as np
from sklearn import svm
clf = svm.SVC(...)
idx = np.arange(len(observations))
preds = np.zeros(len(observations))
for i in idx:
is_train = idx != i
clf.fit(observations[is_train, :], labels[is_train])
preds[i] = clf.predict(observations[i, :])
Alternatively, scikit-learn has a helper to do leave-one-out, and another helper to get cross-validation scores:
from sklearn import svm, cross_validation
clf = svm.SVC(...)
loo = cross_validation.LeaveOneOut(len(observations))
was_right = cross_validation.cross_val_score(clf, observations, labels, cv=loo)
total_acc = np.mean(was_right)
See the user's guide for more. cross_val_score actually returns a score for each fold (which is a little strange IMO), but since we have one fold per observation, this will just be 0 if it was wrong and 1 if it was right.
Of course, leave-one-out is very slow and has terrible statistical properties to boot, so you should probably use KFold instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Feature selection: after or during nested cross-validation? - python

Related

Random forest with unbalanced class (positive is minority class), low precision and weird score distributions

Low K-fold accuracy for First Fold

In GridSearchCV, how do I pass only the default parameters in param_grid?

Putting together sklearn pipeline+nested cross-validation for KNN regression

Train scikit SVM, customize score assessment

Categories

Resources