xgboost.fit() vs. xgboost.train()

xgboost.fit() vs. xgboost.train() - python

I am looking over the docs in XGBoost, but I am not understanding 1) if there are any differences between using xgboost.fit() vs. xgboost.train(), and 2) if there are any advantages/disadvantages using one over the other?
I think the only one I've identified so far is that you can specify more params with the train() function, but I'm not entirely sold that you cannot specify those same params somewhere within the fit() function as well.

Taking most points from this related answer on the use of DMatrix in xgboost:
The XGBoost Python package allows choosing between two APIs. The Scikit-Learn API has objects XGBRegressor and XGBClassifier trained via calling fit.
XGBoost's own Learning API has xgboost.train.
You probably could specify most models with any of the two choices.
Mostly a matter of personal preference. The scikit-learn API makes it easy to utilize some of the tools available in scikit-learn (model selection, pipelines etc.)

Related

Converting H2O AutoML Model to Sklearn Model

I have an H2O AutoML generated GBM model using python. I wonder if we can convert this into a standard sklearn model so that I can fit it into my ecosystem of other sklearn models.
I can see the model properties as below when I print the model.
If direct conversion from H2O to sklearn is not feasible, is there a way we can use the above properties to recreate GBM in sklearn? These terminologies look slightly different from the standard sklearn GBM parameters.
Thanks in advance.

It will be a bit tricky, since the packages are a bit different. Sklearn is based on Python/Cython/C and H2O uses Java. The underlying algorithms could also be different. However, you can try matching/translating your hyperparameters between the two since they will be similar.
Additionally, it would be a good idea to have an ecosystem that is library agnostic so that you can interchange different models.

Is it possible to use HyperDriveStep with time-series cross-validation?

I want to deploy a stacked model to Azure Machine Learning Service. The architecture of the solution consists of three models and one meta-model.
Data is a time-series data.
I'd like the model to automatically re-train based on some schedule. I'd also like to re-tune hyperparameters during each re-training.
AML Service offers HyperDriveStep class that can be used in the pipeline for automatic hyperparameter optimization.
Is it possible - and if so, how to do it - to use HyperDriveStep with time-series CV?
I checked the documentation, but haven't found a satisfying answer.

AzureML HyperDrive is a black box optimizer, meaning that it will just run your code with different parameter combinations based on the configuration you chose. At the same time, it supports Random and Bayesian sampling and has different policies for early stopping (see here for relevant docs and here for an example -- HyperDrive is towards the end of the notebook).
The only thing that your model/script/training needs to adhere to is to be launched from a script that takes --param style parameters. As long as that holds you could optimize the parameters for each of your models individually and then tune the meta-model, or you could tune them all in one run. It will mainly depend on the size of the parameter space and the amount of compute you want to use (or pay for).

What's the purpose of the different kinds of TensorFlow SignatureDefs?

It seems like the Predict SignatureDef encompasses all the functionality of the Classification and Regression SignatureDefs. When would there be an advantage to using Classification or Regression SignatureDefs rather than just using Predict for everything? We're looking to keep complexity down in our production environment, and if it's possible to use just Predict SignatureDefs in all cases, that would seem like a good idea.

From what I can see on the documentation (https://www.tensorflow.org/serving/signature_defs) it seems the "Classify" and "Regress" SigDefs try to force a simple and consistent interface for the simple cases (classify and regress), respectively, "inputs"-->"classes+scores" and "inputs"->"outputs". There seems to be an added benefit that the "Classify" and "Regress" SigDefs dont require a serving function to be constructed as part of the model export function.
Also from the docs, it seems the Predict SigDef allows a more generic interface with the benefit of being able to swap in and out models. From the docs:
Predict SignatureDefs enable portability across models. This means
that you can swap in different SavedModels, possibly with different
underlying Tensor names (e.g. instead of x:0 perhaps you have a new
alternate model with a Tensor z:0), while your clients can stay online
continuously querying the old and new versions of this model without
client-side changes.
Predict SignatureDefs also allow you to add optional additional
Tensors to the outputs, that you can explicitly query. Let's say that
in addition to the output key below of scores, you also wanted to
fetch a pooling layer for debugging or other purposes.
However the docs dont explain, aside from the minor benefit of not having to export a serving function, why one wouldn't just use the Predict SigDef for everything since it appears to be a superset with plenty of upside. I'd love to see a definitive answer on this, as the benefits of the specialized functions (classify, regress) seem quite minimal.

The differences I've seen so far are...
1) If utilizing the tf.feature_column.indicator_column wrapping the tf.feature_column.categorical_column_with_vocabulary_* in a DNNClassifier model, when you query the tensorflow server, I've had problems with the Predict API sometimes not being able to parse/map string inputs according to the vocabulary file/list. On the other hand, the Classify API properly mapped strings to their index (categorical_column) on the vocabulary, and then to the one-hot/multi-hot (indicator_column), and provided (what seems to be) the correct classification response to the query.
2) The response format of [[class, score],[class,score],....] for Classify API vs [class[], score[]] for Predict API. One or the other may be preferable if you need to parse the data in some way afterwards.
TLDR; With indicator_column wrapped in categorical_column_with_vocabulary_*, I've experienced issues with the vocabulary mapping when serving with Predict API. So, using Classify API.

xgboost.train versus XGBClassifier

I am using python to fit an xgboost model incrementally (chunk by chunk). I came across a solution that uses xgboost.train but I do not know what to do with the Booster object that it returns. For instance, the XGBClassifier has options like fit, predict, predict_proba etc.
Here is what happens inside the for loop that I am reading in the data little by little:
dtrain=xgb.DMatrix(X_train, label=y)
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
modelXG=xgb.train(param,dtrain,xgb_model='xgbmodel')
modelXG.save_model("xgbmodel")

XGBClassifier is a scikit-learn compatible class which can be used in conjunction with other scikit-learn utilities.
Other than that, its just a wrapper over the xgb.train, in which you dont need to supply advanced objects like Booster etc.
Just send your data to fit(), predict() etc and internally it will be converted to appropriate objects automatically.

I'm not entirely sure what was your question. xgb.XGBMClassifier.fit() under the hood calls xgb.train() so it is a matter of matching us arguments of relevant functions.
If you are interested how to implement the learning that you have in mind, then you can do
clf = xgb.XGBClassifier(**params)
clf.fit(X, y, xgb_model=your_model)
See the documentation here. On each iteration you will have to save the booster using something like clf.get_booster().save_model(xxx).
PS I hope you do learning in mini-batches, i.e. chunks and not literally line-by-line, i.e. example-by-example, as that would result in performance drop due to writing/reading the model each time

Scikit-learn model parameters unavailable? If so what ML workbench alternative?

I am doing machine learning using scikit-learn as recommended in this question. To my surprise, it does not appear to provide access to the actual models it trains. For example, if I create an SVM, linear classifier or even a decision tree, it doesn't seem to provide a way for me to see the parameters selected for the actual trained model.
Seeing the actual model is useful if the model is being created partly to get a clearer picture of what features it is using (e.g., decision trees). Seeing the model is also a significant issue if one wants to use Python to train the model and some other code to actually implement it.
Am I missing something in scikit-learn or is there some way to get at this in scikit-learn? If not, what is the a good free machine learning workbench, not necessarily in python, in which models are transparently available?

The fitted model parameters are stored directly as attributes on the model instance. There is a specific naming convention for those fitted parameters: they all end with a trailing underscore as opposed to user-provided constructor parameters (a.k.a. hyperparameters) which don't.
The type of the fitted attributes is algorithm-dependent. For instance for a kernel Support Vector Machine you will have the arrays support vectors, dual coefs and intercepts while for random forests and extremly randomized trees you will have a collection of binary trees (internally represented in memory as contiguous numpy arrays for performance matters: structure of arrays representation).
See the Attributes section of the docstring of each model for more details, for instance for SVC:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
For tree based models you also have a helper function to generate a graphivz_export of the learned trees:
http://scikit-learn.org/stable/modules/tree.html#classification
To find the importance of features in forests models you should also have a look at the compute_importances parameter, see the following examples for instance:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances_faces.html#example-ensemble-plot-forest-importances-faces-py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.