Incrementally fitting sklearn RandomForestClassifier

Incrementally fitting sklearn RandomForestClassifier - python

I'm working with a environment that generates data at each iteration. I want to retain the model from previous iteration and add new data to the existing model.
I want to understand how model fit works. Will it fit the new data with the existing model or will it create a new model with the new data.
calling fit with the new data:
clf = RandomForestClassifier(n_estimators=100)
for i in customRange:
get_data()
clf.fit(new_train_data) #directly fitting new train data
clf.predict(new_test_data)
Or
Saving the history of train data and calling fit over all the historic data is the only solution
clf = RandomForestClassifier(n_estimators=100)
global_train_data = new dict()
for i in customRange:
get_data()
global_train_data.append(new_train_data) #Appending new train data
clf.fit(global_train_data) #Fitting on global train data
clf.predict(new_test_data)
My goal is to train model efficiently so i don't want to waste CPU time re-learning models.
I want to confirm the right approach and also want to know if that approach is consistent across all classifiers

Your second approach is "correct", in the sense that, as you have already guessed, it will fit a new classifier from scratch each time the data is appended; but arguably this is not what you are looking for.
What you are actually looking for is the argument warm_start; from the docs:
warm_start : bool, optional (default=False)
When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole
new forest. See the Glossary.
So, you should use your 1st approach, with the following modification:
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
This is not necessarily consistent across classifiers (some come with a partial_fit method instead) - see for example Is it possible to train a sklearn model (eg SVM) incrementally? for the SGDClasssifier; you should always check the relevant documentation.

Related

Computing TF-IDF on the whole dataset or only on training data?

In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function before separating it into train and test. Is it a true action or we must separate data first and then perform fit_transform on train and transform on test?

According to the documentation of scikit-learn, fit() is used in order to
Learn vocabulary and idf from training set.
On the other hand, fit_transform() is used in order to
Learn vocabulary and idf, return term-document matrix.
while transform()
Transforms documents to document-term matrix.
On the training set you need to apply both fit() and transform() (or just fit_transform() that essentially joins both operations) however, on the testing set you only need to transform() the testing instances (i.e. the documents).
Remember that training sets are used for learning purposes (learning is achieved through fit()) while testing set is used in order to evaluate whether the trained model can generalise well to new unseen data points.
For more details you can refer to the article fit() vs transform() vs fit_transform()

Author gives all text data before separating train and test to
function. Is it a true action or we must separate data first then
perform tfidf fit_transform on train and transform on test?
I would consider this as already leaking some information about the test set into the training set.
I tend to always follow the rule that before any pre-processing first thing to do is to separate the data, create a hold-out set.

As we are talking about text data, we have to make sure that the model is trained only on the vocabulary of the training set as when we will deploy a model in real life, it will encounter words that it has never seen before so we have to do the validation on the test set keeping that in mind.
We have to make sure that the new words in the test set are not a part of the vocabulary of the model.
Hence we have to use fit_transform on the training data and transform on the test data.
If you think about doing cross validation, then you can use this logic across all the folds.

Scikit correct way to calibrate classifiers with CalibratedClassifierCV

Scikit has CalibratedClassifierCV, which allows us to calibrate our models on a particular X, y pair. It also states clearly that data for fitting the classifier and for calibrating it must be disjoint.
If they must be disjoint, is it legitimate to train the classifier with the following?
model = CalibratedClassifierCV(my_classifier)
model.fit(X_train, y_train)
I fear that by using the same training set I'm breaking the disjoint data rule. An alternative might be to have a validation set
my_classifier.fit(X_train, y_train)
model = CalibratedClassifierCV(my_classifier, cv='prefit')
model.fit(X_valid, y_valid)
Which has the disadvantage of leaving less data for training. Also, if CalibratedClassifierCV should only be fit on models fit on a different training set, why would it's default options be cv=3, which will also fit the base estimator? Does the cross validation handle the disjoint rule on its own?
Question: what is the correct way to use CalibratedClassifierCV?

I already answered this in CrossValidated to the exact same question. I'm leaving it here anyways since it is not clear for me whether this question belongs here or to CrossVal.
--- Original answer ---
There are two things mentioned in the CalibratedClassifierCV docs that hint towards the ways it can be used:
base_estimator: If cv=prefit, the classifier must have been fit already on data.
cv: If “prefit” is passed, it is assumed that base_estimator has been fitted already and all data is used for calibration.
I may obviously be interpreting this wrong, but it appears you can use the CCCV (short for CalibratedClassifierCV) in two ways:
Number one:
You train your model as usual, your_model.fit(X_train, y_train).
Then, you create your CCCV instance, your_cccv = CalibratedClassifierCV(your_model, cv='prefit'). Notice you set cv to flag that your model has already been fit.
Finally, you call your_cccv.fit(X_validation, y_validation). This validation data is used solely for calibration purposes.
Number two:
You have a new, untrained model.
Then you create your_cccv=CalibratedClassifierCV(your_untrained_model, cv=3). Notice cv is now the number of folds.
Finally, you call cccv_instance.fit(X, y). Because your model is untrained, X and y have to be used for both training and calibration. The way to ensure the data is 'disjoint' is cross validation: for any given fold, CCCV will split X and y into your training and calibration data, so they do not overlap.
TLDR: Method one allows you to control what is used for training and for calibration. Method two uses cross validation to try and make the most out of your data for both purposes.

Retrieve list of training features names from classifier

Is there a way to retrieve the list of feature names used for training of a classifier, once it has been trained with the fit method? I would like to get this information before applying to unseen data.
The data used for training is a pandas DataFrame and in my case, the classifier is a RandomForestClassifier.

I have a solution which works but is not very elegant. This is an old post with no existing solutions so I suppose there are not any.
Create and fit your model. For example
model = GradientBoostingRegressor(**params)
model.fit(X_train, y_train)
Then you can add an attribute which is the 'feature_names' since you know them at training time
model.feature_names = list(X_train.columns.values)
I typically then put the model into a binary file to pass it around but you can ignore this
joblib.dump(model, filename)
loaded_model = joblib.load(filename)
Then you can get the feature names back from the model to use them when you predict
f_names = loaded_model.feature_names
loaded_model.predict(X_pred[f_names])

Based on the documentation and previous experience, there is no way to get a list of the features considered at least at one of the splitting.
Is your concern that you do not want to use all your features for prediction, just the ones actually used for training? In this case I suggest to list the feature_importances_ after fitting and eliminate the features that does not seem relevant. Then train a new model with only the relevant features and use those features for prediction as well.

You don't need to know which features were selected for the training. Just make sure to give, during the prediction step, to the fitted classifier the same features you used during the learning phase.
The Random Forest Classifier will only use the features on which it makes its splits. Those will be the same as those learnt during the first phase. Others won't be considered.
If the shape of your test data is not the same as the training data it will throw an error, even if the test data contains all the features used for the splits of you decision trees.
What's more, since Random Forests make random selection of features for your decision trees (called estimators in sklearn) all the features are likely to be used at least once.
However, if you want to know the features used, you can just call the attributes n_features_ and feature_importances_ on your classifier once fitted.
You can look here to see how you can retrieve the names of the most important features you used.

You can extract feature names from a trained XGBOOST model as follows:
model.get_booster().feature_names

How to get ordered list of labels after fitting sklearn

train_index, test_index = next(iter(ShuffleSplit(821, train_size=0.2, test_size=0.80, random_state=42)))
print train_index, len(train_index)
print test_index, len(test_index)
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, train_size=0.33, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test, labels_test)
print pred, len(pred)
A few questions from this code:
Why do I need the cross_validation.train_test_split line in order to fit and predict with my classifier? (I am not doing any preprocessing on my data except for stopword removal I have already done)
Do the test and train indexes correspond to the classified & predicted labels? My goal is to get all my labels, in their original order, after fitting and predicting them. My features and labels used for training and testing are from a pandas dataframe (two columns), and I need the predicted labels, in order, so that I can feed them back into the pandas dataframe.
Is there a way to predict the labels for the whole set, and not just the test set?

tl;dr
Because your decision tree classifier has to be trained before it can predict anything. It's not a magic algorithm. It has to be shown examples of what to do before it can work out what to do on other things.
cross_validation.test_train_split() facilitates this by splitting your data into a test and training dataset in such a way that you can analyse how well it performed later on. Without this, you have no way of assessing how well your decision tree classifier actually performed.
You can create your own testing and training data without test_train_split() (and I suspect that was what you were trying to do with ShuffleSplit()), but you will need at least some training data.
test_index and train_index have nothing to do with your data. Full stop. They come from a randomly generated process that is completely unrelated to what test_train_split() does.
The purpose of ShuffleSplit() is to give you the indices to partition your data into training and test yourself. test_train_split() will instead choose their own indices and partition based on those indices. You should either use one or the other and sensibly.
Yes. You can always just call
pred = clf.predict(features) or pred = clf.predict(features_test + features_train)
The Full Story
You need cross_validation if you want to do this right. The whole purpose of cross-validation is to avoid overfit.
Basically, if you run your model on both the training and the testing data, then your model is going to perform really well on the training set (because, well, that's what you trained it on) and that's going to skew your overall metrics of how well your model will perform on real data.
It's a lot like asking a student to perform in an exam and then in real life: if you want to know whether your student learned from the process of preparing for an exam, you don't give him another exam, you ask him to demonstrate his skills in the real world dealing with unknown and complex data.
If you want to know if your model will be useful, then you want to cross-validate. Wikipedia puts it best:
In a prediction problem, a model is usually given a dataset of known
data on which training is run (training dataset), and a dataset of
unknown data (or first seen data) against which the model is tested
(testing dataset).
The goal of cross validation is to define a
dataset to "test" the model in the training phase (i.e., the
validation dataset), in order to limit problems like overfitting, give
an insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real problem), etc.
cross_validation.train_test_split doesn't do anything except split the dataset into training and testing data for you.
But perhaps you don't care about metrics, and that's fine. The question then becomes: is it possible to run a decision tree classifier without a training dataset?
The answer is no. Decision tree classifiers are supervised algorithms: they need to be trained on data before they can generalise their model to new results. If you don't give them any data to train on, it will be unable to do anything with any data you feed it in predict.
Finally, while it is perfectly possible to get the labels for the whole set (see tl;dr) , it is a really bad idea if you actually care about whether or not you're getting sensible results.
You already have the labels for the testing and training data. You don't need another column that includes prediction on the testing data, because they'll either come out to be identical or close enough to identical.
I can't think of a single meaningful reason to get back predicted results for your training data short of trying to optimise how it's performing on your training data. If that's what you are trying to do, then do that. What you are doing right now is definitely not that, and I encourage you to think strongly about what your reasons are for blindly inserting numbers into your table without due cause to believe they actually mean something.
There are ways to improve this: get back an accuracy metric, for example, or try to do k-fold cross-validation to model accuracy, or look at log-loss or AUC or any one of number of metrics to gauge whether or not your model is performing well.

Using both ShuffleSplit and train_test_split is redundant. You do not even appear to be using the indices returned by ShuffleSplit.
An example of how to use the indices return by ShuffleSplit is below. X and y are np.array. X is number of instances by number of features. y contains the labels of each row.
train_inds, test_inds = train_test_split(range(len(y)),test_size=0.33, random_state=42)
X_train, y_train = X[train_inds], y[train_inds]
X_test , y_test = X[test_inds] , y[test_inds]
You should not test on your training data! But if you want to see what happens just do
pred = clf.predict(features_train)
Also you do not need to pass the labels to predict. You should be using
score = metrics.accuracy_score(y_test, pred)

Vectorizer where or how fit information is stored?

in text mining/classification when a vectorizer is used to transform a text into numerical features, in the training TfidfVectorizer(...).fit_transform(text) or TfidfVectorizer(...).fit(text) is used. In testing it supposes to utilize former training info and just transform the data following the training fit.
In general case the test run(s) is completely separate from train run. But it needs some info regarding the fit obtained during the training stage otherwise the transformation fails with error sklearn.utils.validation.NotFittedError: idf vector is not fitted . It's not just a dictionary, it's something else.
What should be saved after the training is done, to make the test stage passing smoothly?
In other words train and test are separated in time and space, how to make test working, utilizing training results?
Deeper question would be what 'fit' means in scikit-learn context, but it's probably out of scope

In the test phase you should use the same model names as you used in the trainings phase. In this way you will be able to use the model parameters which are derived in the training phase. Here is an example below;
First give a name to your vectorizer and to your predictive algoritym (It is NB in this case)
vectorizer = TfidVectorizer()
classifier = MultinomialNB()
Then, use these names to vectorize and predict your data
trainingdata_counts = vectorizer.fit_transform(trainingdata.values)
classifier.fit(trainingdata_counts, trainingdatalabels)
testdata_counts = vectorizer.transform(testdata.values)
predictions=classifier.predict(testdata_counts)
By this way, your code will be able to process training and the test phases continuously.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.