Scikit has CalibratedClassifierCV, which allows us to calibrate our models on a particular X, y pair. It also states clearly that data for fitting the classifier and for calibrating it must be disjoint.
If they must be disjoint, is it legitimate to train the classifier with the following?
model = CalibratedClassifierCV(my_classifier)
model.fit(X_train, y_train)
I fear that by using the same training set I'm breaking the disjoint data rule. An alternative might be to have a validation set
my_classifier.fit(X_train, y_train)
model = CalibratedClassifierCV(my_classifier, cv='prefit')
model.fit(X_valid, y_valid)
Which has the disadvantage of leaving less data for training. Also, if CalibratedClassifierCV should only be fit on models fit on a different training set, why would it's default options be cv=3, which will also fit the base estimator? Does the cross validation handle the disjoint rule on its own?
Question: what is the correct way to use CalibratedClassifierCV?
I already answered this in CrossValidated to the exact same question. I'm leaving it here anyways since it is not clear for me whether this question belongs here or to CrossVal.
--- Original answer ---
There are two things mentioned in the CalibratedClassifierCV docs that hint towards the ways it can be used:
base_estimator: If cv=prefit, the classifier must have been fit already on data.
cv: If “prefit” is passed, it is assumed that base_estimator has been fitted already and all data is used for calibration.
I may obviously be interpreting this wrong, but it appears you can use the CCCV (short for CalibratedClassifierCV) in two ways:
Number one:
You train your model as usual, your_model.fit(X_train, y_train).
Then, you create your CCCV instance, your_cccv = CalibratedClassifierCV(your_model, cv='prefit'). Notice you set cv to flag that your model has already been fit.
Finally, you call your_cccv.fit(X_validation, y_validation). This validation data is used solely for calibration purposes.
Number two:
You have a new, untrained model.
Then you create your_cccv=CalibratedClassifierCV(your_untrained_model, cv=3). Notice cv is now the number of folds.
Finally, you call cccv_instance.fit(X, y). Because your model is untrained, X and y have to be used for both training and calibration. The way to ensure the data is 'disjoint' is cross validation: for any given fold, CCCV will split X and y into your training and calibration data, so they do not overlap.
TLDR: Method one allows you to control what is used for training and for calibration. Method two uses cross validation to try and make the most out of your data for both purposes.
Related
I'm currently working on a problem which compares three different machine learning algorithms performance on the same data-set. I divided the data-set into 70/30 training/testing sets and then performed grid search for the best parameters of each algorithm using GridSearchCV and X_train, y_train.
First question, am I suppose to perform grid search on the training set or is it suppose to be on the whole data-set?
Second question, I know that GridSearchCV uses K-fold in its' implementation, does it mean that I performed cross-validation if I used the same X_train, y_train for all three algorithms I compare in the GridSearchCV?
Any answer would be appreciated, thank you.
All estimators in scikit where name ends with CV perform cross-validation.
But you need to keep a separate test set for measuring the performance.
So you need to split your whole data to train and test. Forget about this test data for a while.
And then pass this train data only to grid-search. GridSearch will split this train data further into train and test to tune the hyper-parameters passed to it. And finally fit the model on the whole train data with best found parameters.
Now you need to test this model on the test data you kept aside in the beginning. This will give you the near real world performance of model.
If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.
You can look at my other answers which describe the GridSearch in more detail:
Model help using Scikit-learn when using GridSearch
scikit-learn GridSearchCV with multiple repetitions
Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it.
So you train your models against train data set and test them on a testing data set.
Here I was doing almost the same - you might want to check it...
So, I need to use some of the estimators in scikit-learn, namely LogisticRegression and SVM, but I have a problem, I have an extremely unbalanced dataset and need to run Kfold cross validation. The thing is sometimes the fold I am fitting can have only one target class of the available ones. I wanted to know if there's any way with these estimators to predefine the number of classes, maybe something like passing them a one-hot encoding representations of the target where it doesn't matter if all the examples are from one class, the shape of the target matrix will define the number of classes already.
Is there any way to do this with scikit-learn? Maybe with another library? I know those two algorithms use liblinear, maybe there's some interface I can use in that case.
Any way, thank you for your time.
EDIT: StratifiedFold cross validation is not useful for me because sometimes I have less amount of occurrences than the number of folds. E.g. it can happen that I have a dataset with 50 instances and 3 classes, but 46 can be of one class, 2 of a second class and 2 of a third class and though I can go for 3 fold cross validation I would generally need results of more folds than that, plus even with 3 folds still leaves open the case where one class is the only available for one fold.
The comment that said you need to gather more data may be right. However if you believe you have enough data for your model to learn something useful, you can over sample your minority classes (or possibly under sample the majority classes, but this sounds like a problem for over sampling). Having only one class in the data set makes it pretty much impossible for your model to learn anything about that class.
Here are some links to over sampling and under sampling libraries in python. The famous imbalanced-learn library is great.
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
https://imbalanced-learn.org/en/stable/combine.html
Your case sounds like a good candidate for SMOTE. You also mentioned you wanted to change the ratio. There is a parameter in imblearn.over_sampling.SMOTE called ratio, where you would pass a dictionary. You can also do it with percentages (see the documentation).
SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones. This is a more powerful algorithm than traditional over-sampling because then when your model gets the training data it helps avoid the issue where your model is memorizing key points of specific examples. Instead, smote creates a "similar" data point (likely in a multi-dimensional space) so your model can learn to generalize better.
NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (i.e. after you split), and then validate on the validation set and test sets to see if your SMOTE model out performed your other model(s). If you do not do this, there will be data leakage and you will get a model that doesn't even closely resemble what you want.
from collections import Counter
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
X_resampled, y_resampled = sm.fit_sample(X_normalized, y)
print('Original dataset shape:', Counter(y))
print('Resampled dataset shape:', Counter(y_resampled))
X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_resampled, y_resampled)
X_train_smote.shape, X_test_smote.shape, y_train_smote.shape, y_test_smote.shape, X_resampled.shape, y_resampled.shape
smote_xgbc = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote)
print('TRAIN')
print(accuracy_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
print(f1_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
print('TEST')
print(accuracy_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))
print(f1_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))
Is there a way to retrieve the list of feature names used for training of a classifier, once it has been trained with the fit method? I would like to get this information before applying to unseen data.
The data used for training is a pandas DataFrame and in my case, the classifier is a RandomForestClassifier.
I have a solution which works but is not very elegant. This is an old post with no existing solutions so I suppose there are not any.
Create and fit your model. For example
model = GradientBoostingRegressor(**params)
model.fit(X_train, y_train)
Then you can add an attribute which is the 'feature_names' since you know them at training time
model.feature_names = list(X_train.columns.values)
I typically then put the model into a binary file to pass it around but you can ignore this
joblib.dump(model, filename)
loaded_model = joblib.load(filename)
Then you can get the feature names back from the model to use them when you predict
f_names = loaded_model.feature_names
loaded_model.predict(X_pred[f_names])
Based on the documentation and previous experience, there is no way to get a list of the features considered at least at one of the splitting.
Is your concern that you do not want to use all your features for prediction, just the ones actually used for training? In this case I suggest to list the feature_importances_ after fitting and eliminate the features that does not seem relevant. Then train a new model with only the relevant features and use those features for prediction as well.
You don't need to know which features were selected for the training. Just make sure to give, during the prediction step, to the fitted classifier the same features you used during the learning phase.
The Random Forest Classifier will only use the features on which it makes its splits. Those will be the same as those learnt during the first phase. Others won't be considered.
If the shape of your test data is not the same as the training data it will throw an error, even if the test data contains all the features used for the splits of you decision trees.
What's more, since Random Forests make random selection of features for your decision trees (called estimators in sklearn) all the features are likely to be used at least once.
However, if you want to know the features used, you can just call the attributes n_features_ and feature_importances_ on your classifier once fitted.
You can look here to see how you can retrieve the names of the most important features you used.
You can extract feature names from a trained XGBOOST model as follows:
model.get_booster().feature_names
train_index, test_index = next(iter(ShuffleSplit(821, train_size=0.2, test_size=0.80, random_state=42)))
print train_index, len(train_index)
print test_index, len(test_index)
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, train_size=0.33, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test, labels_test)
print pred, len(pred)
A few questions from this code:
Why do I need the cross_validation.train_test_split line in order to fit and predict with my classifier? (I am not doing any preprocessing on my data except for stopword removal I have already done)
Do the test and train indexes correspond to the classified & predicted labels? My goal is to get all my labels, in their original order, after fitting and predicting them. My features and labels used for training and testing are from a pandas dataframe (two columns), and I need the predicted labels, in order, so that I can feed them back into the pandas dataframe.
Is there a way to predict the labels for the whole set, and not just the test set?
tl;dr
Because your decision tree classifier has to be trained before it can predict anything. It's not a magic algorithm. It has to be shown examples of what to do before it can work out what to do on other things.
cross_validation.test_train_split() facilitates this by splitting your data into a test and training dataset in such a way that you can analyse how well it performed later on. Without this, you have no way of assessing how well your decision tree classifier actually performed.
You can create your own testing and training data without test_train_split() (and I suspect that was what you were trying to do with ShuffleSplit()), but you will need at least some training data.
test_index and train_index have nothing to do with your data. Full stop. They come from a randomly generated process that is completely unrelated to what test_train_split() does.
The purpose of ShuffleSplit() is to give you the indices to partition your data into training and test yourself. test_train_split() will instead choose their own indices and partition based on those indices. You should either use one or the other and sensibly.
Yes. You can always just call
pred = clf.predict(features) or pred = clf.predict(features_test + features_train)
The Full Story
You need cross_validation if you want to do this right. The whole purpose of cross-validation is to avoid overfit.
Basically, if you run your model on both the training and the testing data, then your model is going to perform really well on the training set (because, well, that's what you trained it on) and that's going to skew your overall metrics of how well your model will perform on real data.
It's a lot like asking a student to perform in an exam and then in real life: if you want to know whether your student learned from the process of preparing for an exam, you don't give him another exam, you ask him to demonstrate his skills in the real world dealing with unknown and complex data.
If you want to know if your model will be useful, then you want to cross-validate. Wikipedia puts it best:
In a prediction problem, a model is usually given a dataset of known
data on which training is run (training dataset), and a dataset of
unknown data (or first seen data) against which the model is tested
(testing dataset).
The goal of cross validation is to define a
dataset to "test" the model in the training phase (i.e., the
validation dataset), in order to limit problems like overfitting, give
an insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real problem), etc.
cross_validation.train_test_split doesn't do anything except split the dataset into training and testing data for you.
But perhaps you don't care about metrics, and that's fine. The question then becomes: is it possible to run a decision tree classifier without a training dataset?
The answer is no. Decision tree classifiers are supervised algorithms: they need to be trained on data before they can generalise their model to new results. If you don't give them any data to train on, it will be unable to do anything with any data you feed it in predict.
Finally, while it is perfectly possible to get the labels for the whole set (see tl;dr) , it is a really bad idea if you actually care about whether or not you're getting sensible results.
You already have the labels for the testing and training data. You don't need another column that includes prediction on the testing data, because they'll either come out to be identical or close enough to identical.
I can't think of a single meaningful reason to get back predicted results for your training data short of trying to optimise how it's performing on your training data. If that's what you are trying to do, then do that. What you are doing right now is definitely not that, and I encourage you to think strongly about what your reasons are for blindly inserting numbers into your table without due cause to believe they actually mean something.
There are ways to improve this: get back an accuracy metric, for example, or try to do k-fold cross-validation to model accuracy, or look at log-loss or AUC or any one of number of metrics to gauge whether or not your model is performing well.
Using both ShuffleSplit and train_test_split is redundant. You do not even appear to be using the indices returned by ShuffleSplit.
An example of how to use the indices return by ShuffleSplit is below. X and y are np.array. X is number of instances by number of features. y contains the labels of each row.
train_inds, test_inds = train_test_split(range(len(y)),test_size=0.33, random_state=42)
X_train, y_train = X[train_inds], y[train_inds]
X_test , y_test = X[test_inds] , y[test_inds]
You should not test on your training data! But if you want to see what happens just do
pred = clf.predict(features_train)
Also you do not need to pass the labels to predict. You should be using
score = metrics.accuracy_score(y_test, pred)
I'm currently working on a research study about classifiers performances comparison. To evaluate those performances, I'm computing the accuracy, the area under curve and the squared error for each classifier on all the datasets I have. Besides I need to perform tuning parameters for some of the classifiers in order to select the best parameters in terms of accuracy, so a validation test is required (I chose 20% of the dataset).
I was told that, in order to make this comparison even more meaningful, the cross validation should be performed on the same sets for each classifier.
So basically, is there a way to use the cross_val_score method so that it runs always on the same folds for all the classifiers or should I rewrite from scratch some code that can do this job ?
Thank you in advance.
cross_val_score accepts a cv parameter which represents the cross validation object you want to use. You probably want StratifiedKFold, which accepts a shuffle parameter, which specifies if you want to shuffle the data prior to running cross validation on it.
cv can also be an int, in which case a StratifiedKFold or KFold object will be created automatically with K = cv.
As you can tell from the documentation, shuffle is False by default, so by default it will already be performed on the same folds for all of your classifiers.
You can test it by running it twice on the same classifier to make sure (you should get the exact same results).
You can specify it yourself like this:
your_cv = StratifiedKFold(your_y, n_folds=10, shuffle=True) # or shuffle=False
cross_val_score(your_estimator, your_X, y=your_y, cv=your_cv)