I am using python to fit an xgboost model incrementally (chunk by chunk). I came across a solution that uses xgboost.train but I do not know what to do with the Booster object that it returns. For instance, the XGBClassifier has options like fit, predict, predict_proba etc.
Here is what happens inside the for loop that I am reading in the data little by little:
dtrain=xgb.DMatrix(X_train, label=y)
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
modelXG=xgb.train(param,dtrain,xgb_model='xgbmodel')
modelXG.save_model("xgbmodel")
XGBClassifier is a scikit-learn compatible class which can be used in conjunction with other scikit-learn utilities.
Other than that, its just a wrapper over the xgb.train, in which you dont need to supply advanced objects like Booster etc.
Just send your data to fit(), predict() etc and internally it will be converted to appropriate objects automatically.
I'm not entirely sure what was your question. xgb.XGBMClassifier.fit() under the hood calls xgb.train() so it is a matter of matching us arguments of relevant functions.
If you are interested how to implement the learning that you have in mind, then you can do
clf = xgb.XGBClassifier(**params)
clf.fit(X, y, xgb_model=your_model)
See the documentation here. On each iteration you will have to save the booster using something like clf.get_booster().save_model(xxx).
PS I hope you do learning in mini-batches, i.e. chunks and not literally line-by-line, i.e. example-by-example, as that would result in performance drop due to writing/reading the model each time
Related
As per the documentation of RandomizedSearchCV and GridSearchCV modules of sklearn, they support only the fit method for the classifier which is passed to them and doesn't support the partial_fit method of the classifiers which can be used for training on an incremental basis. Currently, I am trying to use SGDClassifier which can be trained on incremental data using the partial_fit method and also find the best set of hyper-parameters for the same. I was just wondering why doesn't RandomizedSearchCV or GridSearchCV support partial_fit. I don't see any technical reasons as to why this cannot be done (please correct me if I am wrong here). Any leads will be really appreciated.
Yeah, technically you can write a GridSerachCV for partial_fit as well, but when you think about
what is that you are searching for?
what is that your are optimizing for?
it becomes quite different from what we do with the .fit() approach. Here is my list of reason for not having partial_fit in GridsearchCV/RandomSearchCV.
what is that you are searching for?
When we optimize for the hyper parameters of a model for one batch of data, it could be sub-optimal for the final model (which is trained on complete data using multiple partial_fits). Now the problem becomes finding the best schedule for the hyper parameters i.e. what is the optimal value of the hyper parameter at each batch/time step. One example of this is the decaying learning rate in neural networks, where we train the model using multiple partial_fits and the hyper parameter - learning rate value would not be a single value but a series values that needs to be used for each time step/batch.
Also you need to loop through the entire dataset multiple times (multiple epochs) to know the best scheduling of the hyper parameters. This needs a basic API change for GridSearchCV.
what is that your are optimizing for?
There is a need to change the evaluation metric of the model now. The metric could be achieving best performance at the end of all partial_fits or reaching the sweet-spot quickly (in fewer batches) for usual metric (precision, recall, f1-score, etc.), some combination of one and two. Hence, this also needs a API change for computing the single value for summarizing the performance of a model, which was trained using multiple partial_fits.
I think this can be solved in a different way. I have encontered the problem that only partial_fit works (data is too big to do full batch learning via fit), so I think scikit-learn should have partial_fit support somewhere.
Instead of having partial_fit in GridSearchCV, you can write a simple wrapper (something like a pytorch DataLoader) which turns a partial_fit model into fit model, and do batch split and shuffle inside the wrapper's fit. Then you can make GridSearchCV work, with extra parameter to be fine-tuned provided by the wrapper (batch_size and is_shuffle)
Do you know if models from scikit-learn use automatically multithreading or just sequential instructions?
Thanks
No. All scikit-learn estimators will by default work on a single thread only.
But then again, it all depends on the algorithm and the problem. If the algorithm is such that which want sequential data, we cannot do anything. If the dataset is multi-class or multi-label and algorithm works on a one-vs-rest basis, then yes it can use multi-threading.
Look for a param n_jobs in the utilities or algorithm you want to use, and set it to -1 for using the multi-threading.
For eg.
LogisticRegression if working in a binary problem will only train a single model, which will require data sequentially, so here using n_jobs have no effect. But it handles multi-class problems as OvR, so it will have to train those many estimators using the same data. In this case you can use the n_jobs=-1.
DecisionTreeClassifier is inherently multi-class enabled and dont need to train multiple models. So we dont have that param there.
Ensemble methods like RandomForestClassifier will train multiple estimators (irrespective of problem type) which individually work on some part of data, so here again we can make use of n_jobs.
Cross-validation utilities like cross_val_score or GridSearchCV will again work on some part of data or some individual parameters, which is independent of other folds, so here also we can use multi-threading capabilities.
How are we supposed to use the dictionary output from lightgbm.cv to improve our predictions?
Here's an example - we train our cv model using the code below:
cv_mod = lgb.cv(params,
d_train,
500,
nfold = 10,
early_stopping_rounds = 25,
stratified = True)
How can we use the parameters found from the best iteration of the above code to predict an output? In this case, cv_mod has no "predict" method like lightgbm.train, and the dictionary output from lightgbm.cvthrows an error when used in lightgbm.train.predict(..., pred_parameters = cv_mod).
Am I missing an important transformation step?
In general, the purpose of CV is NOT to do hyperparameter optimisation. The purpose is to evaluate performance of model-building procedure.
A basic train/test split is conceptually identical to a 1-fold CV (with a custom size of the split in contrast to the 1/K train size in the k-fold CV). The advantage of doing more splits (i.e. k>1 CV) is to get more information about the estimate of generalisation error. There is more info in a sense of getting the error + stat uncertainty. There is an excellent discussion on CrossValidated (start with the links added to the question, which cover the same question, but formulated in a different way). It covers nested cross validation and is absolutely not straightforward. But if you will wrap your head around the concept in general, this will help you in various non-trivial situations. The idea that you have to take away is: The purpose of CV is to evaluate performance of model-building procedure.
Keeping that idea in mind, how does one approach hyperparameter estimation in general (not only in LightGBM)?
You want to train a model with a set of parameters on some data and evaluate each variation of the model on an independent (validation) set. Then you intend to choose the best parameters by choosing the variant that gives the best evaluation metric of your choice.
This can be done with a simple train/test split. But evaluated performance, and thus the choice of the optimal model parameters, might be just a fluctuation on a particular split.
Thus, you can evaluate each of those models more statistically robust averaging evaluation over several train/test splits, i.e k-fold CV.
Then you can make a step further and say that you had an additional hold-out set, that was separated before hyperparameter optimisation was started. This way you can evaluate the chosen best model on that set to measure the final generalisation error. However, you can make even step further and instead of having a single test sample you can have an outer CV loop, which brings us to nested cross validation.
Technically, lightbgm.cv() allows you only to evaluate performance on a k-fold split with fixed model parameters. For hyper-parameter tuning you will need to run it in a loop providing different parameters and recoding averaged performance to choose the best parameter set. after the loop is complete. This interface is different from sklearn, which provides you with complete functionality to do hyperparameter optimisation in a CV loop. Personally, I would recommend to use the sklearn-API of lightgbm. It is just a wrapper around the native lightgbm.train() functionality, thus it is not slower. But it allows you to use the full stack of sklearn toolkit, thich makes your life MUCH easier.
If you're happy with your CV results, you just use those parameters to call the 'lightgbm.train' method. Like #pho said, CV is usually just for param tuning. You don't use the actual CV object for predictions.
You should use CV for parameter optimization.
If your model performs well on all folds use these parameters to train on the whole training set.
Then evaluate that model on the external test set.
Is there a method that I can input the coefficients to the clf of SVC in my script, then apply clf.score() or clf.predict() function for further test?
Currently I am using joblib.dump(clf,'file.plk') to save all the information of a trained clf. But this involves the disk writing/reading. It will be helpful for me if I can just define a clf with two arrays representing the support vector (clf.support_vectors_), weights (clf.coef_/clf.dual_coef_), and bias (clf.intercept_) respectively.
This line calls the prediction function from libsvm. It looks like this (but please take a look at the whole function _dense_predict):
libsvm.predict(
X, self.support_, self.support_vectors_, self.n_support_,
self.dual_coef_, self._intercept_,
self.probA_, self.probB_, svm_type=svm_type, kernel=kernel,
degree=self.degree, coef0=self.coef0, gamma=self._gamma,
cache_size=self.cache_size)
You can use this line and give it all the relevant information directly and will obtain a raw prediction. In order to do this, you must import the libsvm from sklearn.svm import libsvm. If your initial fitted classifier is called svc, then you can obtain all the relevant information from it by replacing all the self keywords with svc and keeping the values. If svc._impl gives you "c_svc", then you set svm_type=0.
Note that at the beginning of the _dense_predict function you have X = self._compute_kernel(X). If your data is X, then you need to transform it by doing K = svc._compute_kernel(X), and call the libsvm.predict function with K as the first argument
Scoring is independent from all this. Take a look at sklearn.metrics, where you will find e.g. the accuracy_score, which is the default score in SVM.
This is of course a somewhat suboptimal way of doing things, but in this specific case, if is impossible (I didn't check very hard) to set coefficients, then going into the code and seeing what it does and extracting the relevant part is surely an option.
Check out this blog post on memory usage of sklearn models using succinct tries to see if it is applicable.
If the other location does not have access to the sklearn packages you would need to create your own score and predict functions. clf.score() and clf.predict() requires clf to be an sklearn object.
I would like to be use GridSearchCV to determine the parameters of a classifier, and using pipelines seems like a good option.
The application will be for image classification using Bag-of-Word features, but the issue is that there is a different logical pipeline depending on whether training or test examples are used.
For each training set, KMeans must run to produce a vocabulary that will be used for testing, but for test data no KMeans process is run.
I cannot see how it is possible to specify this difference in behavior for a pipeline.
You probably need to derive from the KMeans class and override the following methods to use your vocabulary logic:
fit_transform will only be called on the train data
transform will be called on the test data
Maybe class derivation is not alway the best option. You can also write your own transformer class that wraps calls to an embedded KMeans model and provides the fit / fit_transform / transform API that is expected by the Pipeline class for the first stages.