I created a text classifier, and I'm trying to utilize K-fold cross-validation. I can't figure out why my first fold has an accuracy of 55% while my other folds are overfitting at 99-100% accuracy. My data set is a 5109x2 dataframe with columns df["Features"] as the features and df["Labels"] as labels. df["Features"] has descriptors based off some product mapping keywords and are separated by commas as seen here: Features. I'm creating indicator variables based off the sub-features through countvectorizer(). This is the result of a 5-fold cv. Result
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
def train(classifier, X, y):
count_vect=CountVectorizer(min_df = 1,lowercase = False)
y=pd.Series(y)
X=count_vect.fit_transform(X)
y=count_vect.fit_transform(y)
kf=KFold(n_splits=5,shuffle=True)
k_fold=pd.Series(np.zeros(5))
for i,(train_index,test_index) in enumerate(kf.split(X)):
print("Train",train_index, "Test",test_index)
X_train,X_test=X[train_index],X[test_index]
y_train,y_test=y[train_index],y[test_index]
k_fold[i]=(print("For K=",i+1," Classifier accuracy= ",classifier.fit(X_train, y_train).score(X_test, y_test), "n = ",X_train.shape[0]))
train(MLPClassifier(hidden_layer_sizes= (100,),activation='relu',random_state=2, max_iter=100, warm_start=True),df["Features"], df["Labels"])
It is entirely possible that this is just a result of the data. There is no reason to implement this by hand, scikit-learn has the functionality built in. If you want to test your implementation, try running the experiment using the shuffle parameter off to see if you get the same results.
It is best practice to shuffle your data anyway prior to running cross validation.
Related
After K-fold cross validation, which evaluation metric was averaged?
Precision and recall, or F-measure?
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
KFold(n_splits=2, random_state=None, shuffle=False)
The sklearn.model_selection.KFold function is a utility that provides the folds but does not actually perform k-fold validation. You have to implement this yourself!
See documentation description:
Provides train/test indices to split data in train/test sets. Split
dataset into k consecutive folds (without shuffling by default).
Each fold is then used once as a validation while the k - 1 remaining
folds form the training set.
I've encountered a problem when I used kernel PCA implemented in sklearn. The order of the samples before kpca would significantly influence the classification accuracy.
Here is my processing procedure:
run Kernel PCA for input X(n_samples, n_components).
shuffle the X, split the X into training set and test set (10-fold).
use extratree classifier, svc or other classifiers implemented in sklearn to perform binary classification task.
my code
import numpy as np
from sklearn.utils import shuffle
from sklearn.decomposition import KernelPCA
import sklearn.metrics as metrics
from sklearn.model_selection import KFold
from sklearn.ensemble import ExtraTreesClassifier as etclf
# load data
datapath=r"F:\..."
data=sio.loadmat(datapath+"\\...")
x=data["x"]
labels=data["labels"]
# kernel pca
gm=[1e-5]
nfea=x.shape[1]
kpca=KernelPCA(n_components=nfea,kernel='rbf',gamma=gm,eigen_solver="auto",random_state=(42))
x_pca=kpca.fit_transform(x)
# shuffle the x_pca with labels
x_shuffle,y_shuffle=shuffle(x_pca,labels,random_state=42)
data_label=np.concatenate((x_shuffle,y_shuffle),axis=1)
# 10-fold cross validation
kf = KFold(n_splits=10,shuffle=False)
for train,test in kf.split(data_label):
x_train=data_label[train,:-1]
x_test=data_label[test,:-1]
y_train=data_label[train,-1]
y_test=data_label[test,-1]
# binary classification prediction
clf=etclf(n_estimators=10,criterion='gini',random_state=42)
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
acc=metrics.accuracy_score(y_test,y_pred)
Before applying kernel pca, there are two kinds of orders of x :
I sorted the x by their labels from 1 to 0 (i.e., 1111111...111000000...000), I finally got the accuracy close to 0.99.
I shuffled the x with their labels (i.e., 1100011100...00101100100011), I finally got the accuracy about 0.50.
I also adopted other classifiers such as svc, gaussian naive bayes, the results were similar. I think it is not the matter of classifier or leakage between training set and test set. It is more likely that kpca makes high correlations between samples that are close in order. I don't know how to explain this result.
Thanks for help!
I have managed to write some code doing a nested cross-validation using lightGBM as my regressor and wrapping everying with sklearn.pipeline.
Ultimately, I would now want to do feature selection (or really just get the features' importance for the final model) but I am wondering what is the best path to take from here. I guess there would be two possibilities:
1# Use this methodology to build a model (using .fit and .predict) using the best hyperparameters. Then check the importance of the features for this model.
2# Do feature selection in the inner fold of the nest cv but I am unsure how to do this exactly.
I guess #1 would be the easiest but I am unsure how to get the best hyperparamters for each outerfold.
This thread touches on it:
Putting together sklearn pipeline+nested cross-validation for KNN regression
But the selected answers drops the cross_val_score altogether, meaning that it isn't nested cross-validation anymore (I would still like to perform the CV on the outer fold after getting the best hyperparameters on the inner fold).
So my problem is the following:
Can I get feature importances for each fold of the outer CV (I am
aware that if I have 5 folds, I will get 5 different sets of feature
importance)? And if yes, how?
Alternatively, should I just get the best hyperparameters for each
fold (how?) and build a new model without CV on the whole dataset,
based on these hyperparameters?
Here is the code I have so far:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import scipy.stats as st
#Parameters for model building an reproducibility
X = X_age
y = y_age
RNGesus = 42
state = 13
outer_scoring = 'neg_mean_absolute_error'
inner_scoring = 'neg_mean_absolute_error'
#### Nested CV with Random gridsearch ####
# Pipeline with standard scaling and the regressor
regressors = [lgb.LGBMRegressor(random_state = state)]
continuous_transformer = Pipeline([('scaler', StandardScaler())])
preprocessor = ColumnTransformer([('cont',continuous_transformer, continuous_variables)], remainder = 'passthrough')
for reg in regressors:
steps=[('preprocessor', preprocessor), ('regressor', reg)]
pipeline = Pipeline(steps)
#inner and outer fold to be used
inner_cv = KFold(n_splits=5, shuffle=True, random_state=RNGesus)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=RNGesus)
#Hyperparameters of the regressor to be optimized using randomized search
params = {
'regressor__max_depth': (3, 5, 7, 10),
'regressor__lambda_l1': st.uniform(0, 5),
'regressor__lambda_l2': st.uniform(0, 3)
}
#Pass the RandomizedSearchCV to cross_val_score
regression = RandomizedSearchCV(estimator = pipeline, param_distributions = params, scoring=inner_scoring, cv=inner_cv, n_iter=200, verbose= 3, n_jobs= -1)
nested_score = cross_val_score(regression, X= X, y= y, cv = outer_cv, scoring=outer_scoring)
print('\n MAE for lightGBM model predicting age: %.3f' % (abs(nested_score.mean())))
print('\n'str(nested_score) + '<- outer CV')
Edit: Stated the problem clearly.
I encountered problems importing the lightGBM module so I coundn't run your code. But here is a post explaining how you cannot get the "winning" or optimal hyperparameters (as well as the feature_importance_) out of nested cross-validation by cross_val_score. Briefly, the reason is that cross_val_score only returns the measurement value.
Can I get feature importances for each fold of the outer CV (I am aware that if I have 5 folds, I will get 5 different sets of feature importance)? And if yes, how?
The answer is no with cross_val_score. But if you follow the code from that post, you'll be able to get the feature_importance_ simply by GSCV.best_estimator_.feature_importance_ under the for loop after GSCV.fit().
Alternatively, should I just get the best hyperparameters for each fold (how?) and build a new model without CV on the whole dataset, based on these hyperparameters?
This is exactly what that post is talking about: getting you the "best" hyperparameters by nested cv. Ideally, you'll observe one combination of hyperparameters that wins all the time and that is the hyperparameters you'll use for the final model (with the entire training set). But when different "best" hyperparameter combinations appear during cv, there is no standard way to deal with it as far as I know.
I am trying to build a predictive model using python. The training and test data set has over 400 variables. On using feature selection on training data set the number of variables are reduced to 180
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold = .9)
and then I am training a model using gradient boosting algorithm achieveing .84 AUC accuracy in cross validation.
from sklearn import ensemble
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score as auc
df_fit, df_eval, y_fit, y_eval= train_test_split( df, y, test_size=0.2, random_state=1 )
boosting_model = ensemble.GradientBoostingClassifier(n_estimators=100, max_depth=3,
min_samples_leaf=100, learning_rate=0.1,
subsample=0.5, random_state=1)
boosting_model.fit(df_fit, y_fit)
But when I am trying to use this model to predict for prediction data set it is giving me error
predict_target = boosting_model.predict(df_prediction)
Error: Number of variables in prediction data set 'df_prediction' does not match the number of variables in the model
Which makes sense because total variables in testing data remains to be over 400.
My question is there anyway to bypass this problem and keep using feature selection for predictive modeling. Because if I remove it the accuracy of model drops down to .5 which is very poor.
Thanks!
You should transform your prediction matrix through your feature selection too. So somewhere in your code you do
df = sel.fit_transform(X)
and before predicting
df_prediction = sel.transform(X_prediction)
I have a matrix with 20 columns. The last column are 0/1 labels.
The link to the data is here.
I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:
using sklearn.cross_validation.cross_val_score
using sklearn.cross_validation.train_test_split
I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.
import csv
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
#read in the data
data = pd.read_csv('data_so.csv', header=None)
X = data.iloc[:,0:18]
y = data.iloc[:,19]
depth = 5
maxFeat = 3
result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2)
result
# result is now something like array([ 0.66773295, 0.58824739])
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)
RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False)
RFModel.fit(xtrain,ytrain)
prediction = RFModel.predict_proba(xtest)
auc = roc_auc_score(ytest, prediction[:,1:2])
print auc #something like 0.83
RFModel.fit(xtest,ytest)
prediction = RFModel.predict_proba(xtrain)
auc = roc_auc_score(ytrain, prediction[:,1:2])
print auc #also something like 0.83
My question is:
why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split?
Note:
When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.
In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.
Thanks for your help!
When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:
http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics
http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold
By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.
The KFolds iterator has a random state parameter:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html
So does train_test_split, which does randomize by default:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.
The answer is what #KCzar pointed. Just want to note the easiest way I found to randomize data(X and y with the same index shuffling) is as following:
p = np.random.permutation(len(X))
X, y = X[p], y[p]
source: Better way to shuffle two numpy arrays in unison