Gradient Boosting with a OLS Base Learner

Gradient Boosting with a OLS Base Learner - python

I've been playing with the Boostings function in Sklearn and I've noticed a key difference between sklearn.ensemble.GradientBoostingRegressor and sklearn.ensemble.AdaBoostRegressor. While the latter allows the user to specify the base learner, the former does not. Specifically, sklearn.ensemble.GradientBoostingRegressor only utilize trees. This is a bit annoying, as it would be nice to utilize OLS and Spline base learners within Gradient Boosting. Am I missing something? Does another function within the Sklearn library or a different python library offer this functionality?

Here's one way of doing it.
Replace sklearn/ensemble/gradient_boosting.py with this script.
You'll then be able to pass base_estimator to GradientBoostingRegressor.
If you're satisfied, please enjoy. If not, please see below for a discussion.
Demonstration
Before we begin, I'd like to mention a few things. All of the plots in this post can be reproduced from the scripts available in my repo for this project. Simply replace the contents of restore.sh and run.sh with the appropriate directories for your local installation of Sklearn. Just keep in mind that the two bash scripts will permanently overwrite your existing Sklearn file (gradient_boosting.py). You may recover the file by simply copying it from the Sklearn repo. I don't claim that anything demonstrated in this post is good practice for developing new features for Sklearn. I'm not affiliated with Sklearn nor am I an experienced software developer. All scripts were tested on Sklearn 18.2.
Now, using the new gradient_boosting.py script, we may specify the linear base learner (as you typically would for AdaBoostRegressor). As a sanity check, we may fit the linear-base GBR on a single variable and ensure that it produces a linear visual when plotted against that variable. Here's the result for four (arbitrary) variables from the Boston Housing dataset.
As another sanity check, we may ensure that the results of the original tree-base GBR can be reproduced by passing a tree base_estimator to the new GBR. The following plot is an MSE 10-fold cross-validation profile with respect to the boosting iterations. That is, for each number of boosting iterations, I cross-validate to obtain a vector of 10 MSE scores and compute the mean/min of this vector. The left plot is the original GBR, while the right plot is the one using the new gradient_boosting.py.
What exactly has been changed?
We can use any diffchecker to compare the original gradient_boosting.py to the new one. This will show you all the steps I took to create this new script, but the main step is to modify the _fit_stage() and _decision_function() methods. Within _fit_stage(), which is responsible for fitting the learner at each boosting iteration, we notice that the base learner (named tree) is instantiated with DecisionTreeRegressor, so we simply need to add the following conditions in order for this method to use the custom learner specified by the base_estimator argument instead:
if (self.base_estimator is None or
isinstance(self.base_estimator,
(BaseDecisionTree, BaseForest))):
# Original code for decision trees will go here.
else:
base_learner = self.base_estimator
if X_csc is not None:
base_learner.fit(X_csc, residual, sample_weight=sample_weight)
else:
base_learner.fit(X, residual, sample_weight=sample_weight)
if X_csr is not None:
y_pred[:, k] += self.learning_rate * base_learner.predict(X_csr).ravel()
else:
y_pred[:, k] += self.learning_rate * base_learner.predict(X).ravel()
self.estimators_[i, k] = base_learner
Next, we may inspect _decision_function(), which is responsible for computing the boosted result when the top-level predict() function is called from GradientBoostingRegressor. Here, the heart of the calculation lies in a function called predict_stages(), which is the rather low-level Cython (.pyx) implementation of the boosting step intended for the tree-based ensemble only. In order to compute this step for an arbitrary base learner instead, we circumvent the entire predict_stages() calculation and enter it manually at the Python level:
def _decision_function(self, X):
score = self._init_decision_function(X)
score += self.learning_rate*sum(estimator[0].predict(X) for estimator in self.estimators_).reshape((-1, 1))
# predict_stages(self.estimators_, X, self.learning_rate, score)
return score
The above two changes are what I consider to be the main changes that need to be made. The rest are minor technical details, and they include:
Showing an error message if feature importances are requested when using a base learner that doesn't include a feature importance method
Including the base estimator in parameter checking and estimator validation (refer to the original script to see details on these, as I'm not entirely familiar)
Including the base estimator in the class headers and interfaces (e.g., including the super text in the class definitions for BaseGradientBoosting and GradientBoostingRegressor)
What can't be done (yet)?
The base learner that you choose must be able to support a sample_weights parameter. For this reason, I was unable to plug in a spline-like regressor (such as PyEarth). Please let me know if you or anyone manages to achieve this.
Also note that when using non-tree base learners, you may still pass tree-related arguments without error, but of course, they will be entirely ignored.
Here are things my script can't (necessarily) do:
Work with other losses besides the default, 'ls'. (Perhaps it can, but I haven't tried.)
Allow base_estimator to be passed to GradientBoostingClassifier
Overall, this was only a modest beginning toward truly custom base learners. Hope it helps.

Related

How to reproduce a RandomForestClassifier when random_state uses RandomState

Context:
I was reading the Common Pitfalls documentation of scikit-learn. I was surprised with the Controlling randomness section, as in my case I've always use random_state with and integer, to read that the estimator should use an instance of np.random.RandomState in the initialization of the classifier, but not on the cv split; after reading Robustness of cross-validation results section I thought: 'Ok I get it', BUT, there is a warning in the cloning section that says:
from sklearn import clone
from sklearn.ensemble import RandomForestClassifier
import numpy as np
rng = np.random.RandomState(0)
a = RandomForestClassifier(random_state=rng)
b = clone(a)
Since a RandomState instance was passed to a, a and b are not clones in the strict sense, but rather clones in the statistical sense: a and b will still be different models, even when calling fit(X, y) on the same data. Moreover, a and b will influence each-other since they share the same internal RNG: calling a.fit will consume b’s RNG, and calling b.fit will consume a’s RNG, since they are the same. This bit is true for any estimators that share a random_state parameter; it is not specific to clones.
If an integer were passed, a and b would be exact clones and they would not influence each other
Warning Even though clone is rarely used in user code, it is called pervasively throughout scikit-learn codebase: in particular, most meta-estimators that accept non-fitted estimators call clone internally (GridSearchCV, StackingClassifier, CalibratedClassifierCV, etc.).
Question
If I am on the developing phase of a project and trying to identify which model(s) works best with lets say a GridSearchCV how can I get the random_state value of my model so I can use it in production?
At first I thoght lets use in the grid random_state, but the Robustness of cross-validation results section writes:
Passing instances leads to more robust CV results, and makes the comparison between various algorithms fairer. It also helps limiting the temptation to treat the estimator’s RNG as a hyper-parameter that can be tuned.

The random_state parameter is really intended for deterministic reproducibility & is especially useful when developing more complex pipelines.
GridSearchCV is used to find the best settings for your learning procedure. I emphasize procedure, because the cross-validation aspect is to get statistical results based on an estimate rather than an actual concrete model. Your RandomForest classifier & cross-validation like many techniques in machine learning rely on entropy/randomness to fairly approximate things. random_state should NOT be treated as a hyper-parameter else you are metric climbing on noise.
Once you know the optimal settings that statistically yield a good model beyond random chance, you want to re-apply the procedure to derive your production model. Note that the metric performance of this model will be bounded within (but not identical to) what grid-search estimated. Specifying random_state for the production model is bogus.
Here is a valid way to do things with/out specifying the random seeds:
# define your procedure as you wish. random_state is optional
# good for testing, irrelevant for predicting.
my_pipeline = RandomForestClassifier(random_state=42, ..)
# search parameter space & then refit another model with your
# procedure on the discovered parameters.
optimal = GridSearchCV(my_pipeline, params, refit=True, ...)
optimal.fit(train_X, train_y)
# get the new model trained with the best found parameters,
# rather than best performing model of the cross-validation!!
prod_model = optimal.best_estimator_

The warning stems from using the random number generator object rather than the random seed integer.
When you use the generator, subsequent calls to it, will produce different seeds in the random sequence. This means that since the clones share the same generator sequence the order of operations, rate at which each gets invoked etc is implicitly coupled & thus produce different but statistically equivalent objects. Using random integer seeds is safe.

Advice on how to create a custom tf.keras optimizer (optimizer_v2)

I want to make an accumulated SGD optimizer for tf.keras (not keras standalone). I have found a couple of implementations of standalone keras accumulated SGD optimizers including this one on pypi. Nevertheless, I am using a project which make use of tf.keras. And as I have seen it's not a good idea to mix them together.
The problem is that the documentation for achieving this custom optimizer is not really straight forward. The base class (which I should inherit from) is Optimizer_v2.py which contains some information in the comment section about the task.
The required methods that should be overridden are:
- resource_apply_dense (update variable given gradient tensor is dense)
- resource_apply_sparse (update variable given gradient tensor is sparse)
- create_slots (if your optimizer algorithm requires additional variables)
- get_config (serialization of the optimizer, include all hyper parameters)
Of course of these ones only get_config() actually exists in the base class. resource_apply_dense is actually _resource_apply_dense, resource_apply_sparse is _resource_apply_sparse and create_slots does not even exist in base class. In subclasses as SGD in gradient_decent.py, create_slots also exists as _create_slots.
Anyway, apparently the documentation is not updated (there is also an issue regarding this in git but I don't remember the link which pointed this lack of consistency with the documentation) but this makes the whole procedure difficult. For example in SGD I have to override the _resource_apply_dense() method but I cannot understand where the gradients are being calculated and where they are updated.
The actual code is given below:
def _resource_apply_dense(self, grad, var, apply_state=None):
var_device, var_dtype = var.device, var.dtype.base_dtype
coefficients = ((apply_state or {}).get((var_device, var_dtype))
or self._fallback_apply_state(var_device, var_dtype))
if self._momentum:
momentum_var = self.get_slot(var, "momentum")
return training_ops.resource_apply_keras_momentum(
var.handle,
momentum_var.handle,
coefficients["lr_t"],
grad,
coefficients["momentum"],
use_locking=self._use_locking,
use_nesterov=self.nesterov)
else:
return training_ops.resource_apply_gradient_descent(
var.handle, coefficients["lr_t"], grad, use_locking=self._use_locking)
which obviously rely on training_ops.resource_apply_keras_momentum and training_ops.resource_apply_gradient_descent to do the actual job. How can I split the 2 parts mentioned in the minimize() method in OptimizerV2 from the above code? The 2 parts are:
_compute_gradients() and apply_gradients().
There are a lot of parts that are confusing in this comments like for example in the base class:
Many optimizer subclasses, such as Adam and Adagrad allocate and
manage additional variables associated with the variables to train.
These are called Slots. Slots have names and you can ask the
optimizer for the names of the slots that it uses.
although if I declare an Adam optimizer and ask for slot names I get an empty list (?).
optimizer = Adam(lr=1e-3)
optimizer.get_slot_names()
[]
Another confusing issue is the use of private methods which is not clear when they are called and what's their purpose. For example _prepare_local() is contained within SGD and includes a line:
apply_state[(var_device, var_dtype)]["momentum"] = array_ops.identity(self._get_hyper("momentum", var_dtype))
Anyway, the problem here is that I do not know which exactly approach to follow to create a custom tf.keras optimizer. Instructions included in comments seem to contradict with the actual implemented subclasses, and the latter also seem to assign the dirty work to the actual C++ function without being clear how this is done or how (in my case) to separate the actions (like the gradient calculation and application). So, is there any advice someone can provide on how to proceed and steps to follow to accomplish this (relatively) simple task?
I am using tf 1.15 by the way (so the links are from there).

Reference for optimizer : DiffGrad (kind of Adam like)
https://github.com/evanatyourservice/diffGrad-tf/blob/master/diffgrad.py
It is based on a paper called DiffGrad , they have good explanations and generally a good read.
First of all good question, secondly TensorFlow documentation can do a lot better. Answers to various questions in no particular order:
In reference to empty slot list for Adam, you have to run a model.fit once on a model for it to initialize as far as I have seen. Remember reading about it while looking up saving and loading optimizer states (check if it works on model.compile).
As for _prepare_local, that line creates the momentum variable from the hyper parameter you set on creation. I suppose it makes it accessible to all the weights the optimizer is trying to update, why they use identity is deep TensorFlow graph stuff.
Why they use _prepare_local generally is to create variables that are common across all weighs that are being updated like decays or learning rates or time steps and such. For every Iteration these variables are common across all variables tracked in the optimizer's var_list.
Unlike the above _prepare_local, slots are separate variables for each weight tracked by the optimizer so you might have moments or history or cumulative sum. Anything to do with that specific individual weight.
Gradient compute and apply: If I understand this correctly compute gradients takes the loss does back propagation and auto differentiation and gets you the "gradients" for each weight. when you go to apply it is when the optimizer comes into play with its slots and variables. finally optimizer does the updating with the computed gradients as inputs.

NN: outputting a probability density function instead of a single value

This might sound silly but I'm just wondering about the possibility of modifying a neural network to obtain a probability density function rather than a single value when you are trying to predict a scalar. I know that when you are trying to classify images or words you can get a probability for each class, so I'm thinking there might be a way to do something similar with a continuous value and plot it. (Similar to the posterior plot with bayesian optimisation)
Such details could be interesting when deploying a model for prediction and could provide more flexibility than a single value.
Does anyone knows a way to obtain such an output?
Thanks!

Ok So I found a solution to this issue, though it adds a lot of overhead.
Initially I thought the keras callback could be of use but despite the fact that it provided the flexibility that I wanted i.e.: train only on test data or only a subset and not for every test. It seems that callbacks are only given summary data from the logs.
So the first step what to create a custom metric that would do the same calculation as any metric with the 2 arrays ( the true value and the predicted value) and once those calculations are done, output them to a file for later use.
Then once we found a way to gather all the data for every sample, the next step was to implement a method that could give a good measure of error. I'm currently implementing a handful of methods but the most fitting one seem to be bayesian bootstraping ( user lmc2179 has a great python implementation). I also implemented ensemble methods and gaussian process as alternatives or to use as other metrics and some other bayesian methods.
I'll try to find if there are internals in keras that are set during the training and testing phases to see if I can set a trigger for my metric. The main issue with using all the data is that you obtain a lot of unreliable data points at the start since the network is not optimized. Some data filtering could be useful to remove a good amount of those points to improve the results of the error predictors.
I'll update if I find anything interesting.

How to get predictions out of tensorflow model after you've used tf.group on your optimizers

I'm trying to write something similar to google's wide and deep learning after running into difficulties of doing multi-class classification(12 classes) with the sklearn api. I've tried to follow the advice in a couple of posts and used the tf.group(logistic_regression_optimizer, deep_model_optimizer). It seems to work but I was trying to figure out how to get predictions out of this model. I'm hoping that with the tf.group operator the model is learning to weight the logistic and deep models differently but I don't know how to get these weights out so I can get the right combination of the two model's predictions. Thanks in advance for any help.
https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/Cs0R75AGi8A
How to set layer-wise learning rate in Tensorflow?

tf.group() creates a node that forces a list of other nodes to run using control dependencies. It's really just a handy way to package up logic that says "run this set of nodes, and I don't care about their output". In the discussion you point to, it's just a convenient way to create a single train_op from a pair of training operators.
If you're interested in the value of a Tensor (e.g., weights), you should pass it to session.run() explicitly, either in the same call as the training step, or in a separate session.run() invocation. You can pass a list of values to session.run(), for example, your tf.group() expression, as well as a Tensor whose value you would like to compute.
Hope that helps!

Set the weights of decision functions through stdin in Sklearn

Is there a method that I can input the coefficients to the clf of SVC in my script, then apply clf.score() or clf.predict() function for further test?
Currently I am using joblib.dump(clf,'file.plk') to save all the information of a trained clf. But this involves the disk writing/reading. It will be helpful for me if I can just define a clf with two arrays representing the support vector (clf.support_vectors_), weights (clf.coef_/clf.dual_coef_), and bias (clf.intercept_) respectively.

This line calls the prediction function from libsvm. It looks like this (but please take a look at the whole function _dense_predict):
libsvm.predict(
X, self.support_, self.support_vectors_, self.n_support_,
self.dual_coef_, self._intercept_,
self.probA_, self.probB_, svm_type=svm_type, kernel=kernel,
degree=self.degree, coef0=self.coef0, gamma=self._gamma,
cache_size=self.cache_size)
You can use this line and give it all the relevant information directly and will obtain a raw prediction. In order to do this, you must import the libsvm from sklearn.svm import libsvm. If your initial fitted classifier is called svc, then you can obtain all the relevant information from it by replacing all the self keywords with svc and keeping the values. If svc._impl gives you "c_svc", then you set svm_type=0.
Note that at the beginning of the _dense_predict function you have X = self._compute_kernel(X). If your data is X, then you need to transform it by doing K = svc._compute_kernel(X), and call the libsvm.predict function with K as the first argument
Scoring is independent from all this. Take a look at sklearn.metrics, where you will find e.g. the accuracy_score, which is the default score in SVM.
This is of course a somewhat suboptimal way of doing things, but in this specific case, if is impossible (I didn't check very hard) to set coefficients, then going into the code and seeing what it does and extracting the relevant part is surely an option.

Check out this blog post on memory usage of sklearn models using succinct tries to see if it is applicable.
If the other location does not have access to the sklearn packages you would need to create your own score and predict functions. clf.score() and clf.predict() requires clf to be an sklearn object.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.