I am using LinearSVM to classify my documents into categories. However, my dataset is unbalanced with some categories having 48,000 documents under them and some as small as 100. When I train my model, even with using Stratified KFold, I see that the category with 48,000 documents get a larger portion of documents(3300) compared to others. In such a case, it would definitely give me biased predictions. How can I make sure this selection isn't biased?
kf=StratifiedKFold(labels, n_folds=10, shuffle=True)
for train_index, test_index in kf:
X_train, X_test = docs[train_index],docs[test_index]
Y_train, Y_test = labels[train_index],labels[test_index]
Then I'm writing these(X_train, Y_train) to a file, computing the feature matrix and passing them to the classifier as follows:
model1 = LinearSVC()
model1 = model1.fit(matrix, label_tmp)
pred = model1.predict(matrix_test)
print("Accuracy is:")
print(metrics.accuracy_score(label_test, pred))
print(metrics.classification_report(label_test, pred))
The StratifiedKFold method by default takes into account the ratio of labels in all your classes, meaning that each fold will have the exact (or close to exact) ratio of each label in that sample. Whether you want to adjust for this or not is somewhat up to you - you can either let the classifier learn some kind of bias for labels with more samples (as you are now), or you can do one of two things:
Construct a separate train / test set, where the training set has equal number of samples in each label (therefore in your case, each class label in the training set might only have 50 examples, which is not ideal). Then you can train on your training set and test on the rest. If you do this multiple times with different samples, you are essentially doing k-fold cross validation, just choosing your sample sizes in a different way.
You can change your loss function (i.e. the way you initialize LinearSVC() to account for the class imbalances. For example: model = LinearSVC(class_weight='balanced'). This will cause the model to learn a loss function that takes class imbalances into account.
Related
Suppose I iterate with the following code until I acquire an accuracy that I'm satisfied with:
from sklearn.model_selection import train_test_split
x, y = # ... read in some data set ...
c = 3000 # iterate over some arbitrary range
for i in range(c):
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=i)
model = # ... initialize some classifier of choice ...
model.fit(x_train, y_train)
p = model.predict(x_test)
p = np.round(p).reshape(-1)
test_accuracy = np.mean(p == y_test) * 100
For a particular data set and range, say I build a classifier such that the training accuracy is 97% and the test accuracy is 96%. Can I truly claim the model is 96% accurate? For the same range and data set, I can also build a classifier such that the training accuracy and test accuracy are as low as 99% and 70%, respectively.
Since I have selected random_state based on the test set accuracy, is the test set really a validation set here? I don't know why, but I think to claim the first model is 96% accurate would not be true. What should I do instead in order to make a correct claim about the model's accuracy?
Is it bad practice to iterate over many random training & test set splits until a high accuracy is achieved?
Yes, this is bad practice. You should be evaluating on data that your model has never been trained on, and this wouldn't really be the case if you train many times to find the best train/test split.
You can put aside a test set before you train the model. Then you can create as many train/validation splits as you want and train the model multiple times. You would evaluate on the test set, on which the model was never trained.
You can also look into nested cross-validation.
Kinda. There is Cross-Validation which is similar to what you have described. This is where the train/test split is randomised and the model trained each time. Except the final value quoted is the average test accuracy - not simply the best. This sort of thing is done in tricky situations e.g. with very small datasets.
In terms of the bigger picture, the test data should be representative of the training data and vice versa. Sure you can cheese it that way, but if the atypical 'weird' cases are hidden in your training set and the test set is just full of easy cases (e.g. only the digit 0 for MNIST) then you aren't really achieving anything. You're only cheating yourself.
I'm using RandomizedSearchCV to get the best parameters with a 10-fold cross-validation and 100 iterations. This works well. But now I would like to also get the probabilities of each predicted test data point (like predict_proba) from the best performing model.
How can this be done?
I see two options. First, perhaps it is possible to get these probabilities directly from the RandomizedSearchCV or second, getting the best parameters from RandomizedSearchCV and then doing again a 10-fold cross-validation (with the same seed so that I get the same splits) with this best parameters.
Edit: Is the following code correct to get the probabilities of the best performing model? X is the training data and y are the labels and model is my RandomizedSearchCV containing a Pipeline with imputing missing values, standardization and SVM.
cv_outer = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
y_prob = np.empty([y.size, nrClasses]) * np.nan
best_model = model.fit(X, y).best_estimator_
for train, test in cv_outer.split(X, y):
probas_ = best_model.fit(X[train], y[train]).predict_proba(X[test])
y_prob[test] = probas_
If I understood it right, you would like to get the individual scores of every sample in your test split for the case with the highest CV score. If that is the case, you have to use one of those CV generators which give you control over split indices, such as those here: http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#cross-validation-generators
If you want to calculate scores of a new test sample with the best performing model, the predict_proba() function of RandomizedSearchCV would suffice, given that your underlying model supports it.
Example:
import numpy
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
scores = cross_val_score(svc, X, y, cv=skf, n_jobs=-1)
max_score_split = numpy.argmax(scores)
Now that you know that your best model happens at max_score_split, you can get that split yourself and fit your model with it.
train_indices, test_indices = k_fold.split(X)[max_score_split]
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
model.fit(X_train, y_train) # this is your model object that should have been created before
And finally get your predictions by:
model.predict_proba(X_test)
I haven't tested the code myself but should work with minor modifications.
You need to look in cv_results_ this will give you the scores, and mean scores for all of your folds, along with a mean, fitting time etc...
If you want to predict_proba() for each of the iterations, the way to do this would be to loop through the params given in cv_results_, re-fit the model for each of then, then predict the probabilities, as the individual models are not cached anywhere, as far as I know.
best_params_ will give you the best fit parameters, for if you want to train a model just using the best parameters next time.
See cv_results_ in the information page http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
According to the DOC of scikit-learn
sklearn.model_selection.cross_val_score(estimator, X, y=None,
groups=None, scoring=None, cv=None, n_jobs=1, verbose=0,
fit_params=None, pre_dispatch=‘2*n_jobs’)
X and y
X : array-like The data to fit. Can be for example a list, or an
array.
y : array-like, optional, default: None The target variable to
try to predict in the case of supervised learning.
I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset. In some of the notebooks from kaggle some people use the whole dataset and some others X_train and y_train.
To my knowledge, cross validation just evaluate the model and shows whether or not you overfit/underfit your data (it does not actually train the model). Then, in my view the most data you have the better will be the performance, so I would use the whole dataset.
What do you think?
Model performance is dependent on way the data is split and sometimes model does not have ability to generalize.
So that's why we need the cross validation.
Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.
I am wondering whether [X,y] is X_train and y_train or [X,y] should be
the whole dataset.
[X, y] should be the whole dataset because internally cross validation spliting the data into training data and test data.
Suppose you use cross validation with 5 folds (cv = 5).
We begin by splitting the dataset into five groups or folds. Then we hold out the first fold as a test set, fit out model on the remaining four folds, predict on the test set and compute the metric of interest.
Next, we hold out the second fold as out test set, fit on the remaining data, predict on the test set and compute the metric of interest.
By default, scikit-learn's cross_val_score() function uses R^2 score as the metric of choice for regression.
R^2 score is called coefficient of determination.
I'd like to use scikit-learn's GridSearchCV to perform a grid search and calculate the cross validation error using a predefined development and validation split (1-fold cross validation).
I'm afraid that I've done something wrong, because my validation accuracy is suspiciously high. Where I think I'm going wrong: I'm splitting up my training data into development and validation sets, training on the development set and recording the cross validation score on the validation set. My accuracy might be inflated because I am really training on a mix of the development and validation sets, then testing on the validation set. I'm not sure if I'm using scikit-learn's PredefinedSplit module correctly. Details below:
Following this answer, I did the following:
import numpy as np
from sklearn.model_selection import train_test_split, PredefinedSplit
from sklearn.grid_search import GridSearchCV
# I split up my data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(
data[training_features], data[training_response], test_size=0.2, random_state=550)
# sanity check - dimensions of training and test splits
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
# dimensions of X_train and x_test are (323430, 26) and (323430,1) respectively
# dimensions of X_test and y_test are (80858, 26) and (80858, 1)
''' Now, I define indices for a pre-defined split.
this is a 323430 dimensional array, where the indices for the development
set are set to -1, and the indices for the validation set are set to 0.'''
validation_idx = np.repeat(-1, y_train.shape)
np.random.seed(550)
validation_idx[np.random.choice(validation_idx.shape[0],
int(round(.2*validation_idx.shape[0])), replace = False)] = 0
# Now, create a list which contains a single tuple of two elements,
# which are arrays containing the indices for the development and
# validation sets, respectively.
validation_split = list(PredefinedSplit(validation_idx).split())
# sanity check
print(len(validation_split[0][0])) # outputs 258744
print(len(validation_split[0][0]))/float(validation_idx.shape[0])) # outputs .8
print(validation_idx.shape[0] == y_train.shape[0]) # True
print(set(validation_split[0][0]).intersection(set(validation_split[0][1]))) # set([])
Now, I run a grid search using GridSearchCV. My intention is that a model will be fit on the development set for each parameter combination over the grid, and the cross validation score will be recorded when the resulting estimator is applied to the validation set.
# a vanilla XGboost model
model1 = XGBClassifier()
# create a parameter grid for the number of trees and depth of trees
n_estimators = range(300, 1100, 100)
max_depth = [8, 10]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
# A grid search.
# NOTE: I'm passing a PredefinedSplit object as an argument to the `cv` parameter.
grid_search = GridSearchCV(model1, param_grid,
scoring='neg_log_loss',
n_jobs=-1,
cv=validation_split,
verbose=1)
Now, here is where a red flag is raised for me. I use the best estimator found by the gridsearch to find the accuracy on the validation set. It's very high - 0.89207865689639176. What's worse is that it's almost identical to the accuracy that I get if I use the classifier on the data development set (on which I just trained) - 0.89295597192591902. BUT - when I use the classifier on the true test set, I get a much lower accuracy, roughly .78:
# accurracy score on the validation set. This yields .89207865
accuracy_score(y_pred =
grid_result2.predict(X_train.iloc[validation_split[0][1]]),
y_true=y_train[validation_split[0][1]])
# accuracy score when applied to the development set. This yields .8929559
accuracy_score(y_pred =
grid_result2.predict(X_train.iloc[validation_split[0][0]]),
y_true=y_train[validation_split[0][0]])
# finally, the score when applied to the test set. This yields .783
accuracy_score(y_pred = grid_result2.predict(X_test), y_true = y_test)
To me, the almost exact correspondence between the model's accuracy when applied to the development and validation datasets, and the significant loss in accuracy when applied to the test set is a clear sign that I'm training on the validation data by accident, and thus my cross validation score is not representative of the true accuracy of the model.
I can't seem to find where I went wrong - mostly because I don't know what GridSearchCV is doing under the hood when it receives a PredefinedSplit object as the argument to the cv parameter.
Any ideas where I went wrong? If you need more details/elaboration, please let me know. The code is also in this notebook on github.
Thanks!
You need to set refit=False (not a default option), otherwise the grid search will refit the estimator on the whole dataset (ignoring cv) after the grid search completes.
Yes, there was a data leaking problem for the validation data. You need to set refit = False for GridSearchCV and it will not refit the whole data including training and validation data.
As part of the Enron project, built the attached model, Below is the summary of the steps,
Below model gives highly perfect scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.predict(x_test)
Below model gives more reasonable but low scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.fit(x_train,y_train)
gcv.best_estimator_.predict(x_test)
Used Kbest to find out the scores and sorted the features and trying a combination of higher and lower scores.
Used SVM with a GridSearch using a StratifiedShuffle
Used the best_estimator_ to predict and calculate the precision and recall.
The problem is estimator is spitting out perfect scores, in some case 1
But when I refit the best classifier on training data then run the test it gives reasonable scores.
My doubt/question was what exactly GridSearch does with the test data after the split using the Shuffle split object we send in to it. I assumed it would not fit anything on Test data, if that was true then when I predict using the same test data, it should not give this high scores right.? since i used random_state value, the shufflesplit should have created the same copy for the Grid fit and also for the predict.
So, is using the same Shufflesplit for two wrong?
GridSearchCV as #Gauthier Feuillen said is used to search best parameters of an estimator for given data.
Description of GridSearchCV:-
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels)
clf_params will be expanded to get all possible combinations separate using ParameterGrid.
features will now be split into features_train and features_test using cv. Same for labels
Now the gridSearch estimator (pipe) will be trained using features_train and labels_inner and scored using features_test and labels_test.
For each possible combination of parameters in step 3, The steps 4 and 5 will be repeated for cv_iterations. The average of score across cv iterations will be calculated, which will be assigned to that parameter combination. This can be accessed using cv_results_ attribute of gridSearch.
For the parameters which give the best score, the internal estimator will be re initialized using those parameters and refit for the whole data supplied into it(features and labels).
Because of last step, you are getting different scores in first and second approach. Because in the first approach, all data is used for training and you are predicting for that data only. Second approach has prediction on previously unseen data.
Basically the grid search will:
Try every combination of your parameter grid
For each of them it will do a K-fold cross validation
Select the best available.
So your second case is the good one. Otherwise you are actually predicting data that you trained with (which is not the case in the second option, there you only keep the best parameters from your gridsearch)