Understanding Partial Dependence for Gradient Boosted Regression trees

Understanding Partial Dependence for Gradient Boosted Regression trees - python

I am looking at the tutorial for partial dependence plots in Python. No equation is given in the tutorial or in the documentation. The documentation of the R function gives the formula I expected:
This does not seem to make sense with the results given in the Python tutorial. If it is an average of the prediction of house prices, then how is it negative and small? I would expect values in the millions. Am I missing something?
Update:
For regression it seems the average is subtracted off of the above formula. How would this be added back? For my trained model I can get the partial dependence by
from sklearn.ensemble.partial_dependence import partial_dependence
partial_dependence, independent_value = partial_dependence(model, features.index(independent_feature),X=df2[features])
I want to add (?) back on the average. Would I get this by just using model.predict() on the df2 values with the independent_feature values changed?

how the R formula works
The r formula presented in the question applies to a randomForest. Each tree in a random forest tries to predict the target variable directly. Thus, prediction of each tree lies in the expected interval (in your case, all house prices are positive), and prediction of the ensemble is just the average of all the individual predictions.
ensemble_prediction = mean(tree_predictions)
This is what the formula tells you: just take predictions of all the trees x and average them.
why the Python PDP values are small
In sklearn, however, partial dependence is calculated for a GradientBoostingRegressor. In gradient boosting, each tree predicts the derivative of the loss function at current prediction, which is only indirectly related to the target variable. For GB regression, prediction is given as
ensemble_prediction = initial_prediction + sum(tree_predictions * learning_rate)
and for GB classification predicted probability is
ensemble_prediction = softmax(initial_prediction + sum(tree_predictions * learning_rate))
For both cases, partial dependency is reported as just
sum(tree_predictions * learning_rate)
Thus, initial_prediction (for GradientBoostingRegressor(loss='ls') it equals just the mean of the training y) is not included into the PDP, which makes the predictions negative.
As for the small range of its values, the y_train in your example is small: mean hous value is roughly 2, so house prices are probably expressed in millions.
how the sklearn formula actually works
I have already said that in sklearn the value of partial dependence function is an average of all trees. There is one more tweak: all irrelevant features are averaged away. To describe the actual way of averaging, I will just quote the documentation of sklearn:
For each value of the ‘target’ features in the grid the partial
dependence function need to marginalize the predictions of a tree over
all possible values of the ‘complement’ features. In decision trees
this function can be evaluated efficiently without reference to the
training data. For each grid point a weighted tree traversal is
performed: if a split node involves a ‘target’ feature, the
corresponding left or right branch is followed, otherwise both
branches are followed, each branch is weighted by the fraction of
training samples that entered that branch. Finally, the partial
dependence is given by a weighted average of all visited leaves. For
tree ensembles the results of each individual tree are again averaged.
And if you are still not satisfied, see the source code.
an example
To see that the prediction is already on the scale of the dependent variable (but is just centered), you can look at a very toy example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import plot_partial_dependence
np.random.seed(1)
X = np.random.normal(size=[1000, 2])
# yes, I will try to fit a linear function!
y = X[:, 0] * 10 + 50 + np.random.normal(size=1000, scale=5)
# mean target is 50, range is from 20 to 80, that is +/- 30 standard deviations
model = GradientBoostingRegressor().fit(X, y)
fig, subplots = plot_partial_dependence(model, X, [0, 1], percentiles=(0.0, 1.0), n_cols=2)
subplots[0].scatter(X[:, 0], y - y.mean(), s=0.3)
subplots[1].scatter(X[:, 1], y - y.mean(), s=0.3)
plt.suptitle('Partial dependence plots and scatters of centered target')
plt.show()
You can see that partial dependence plots reflect the true distribution of the centered target variable pretty well.
If you want not only the units, but the mean to coincide with your y, you have to add the "lost" mean to the result of the partial_dependence function and then plot the results manually:
from sklearn.ensemble.partial_dependence import partial_dependence
pdp_y, [pdp_x] = partial_dependence(model, X=X, target_variables=[0], percentiles=(0.0, 1.0))
plt.scatter(X[:, 0], y, s=0.3)
plt.plot(pdp_x, pdp_y.ravel() + model.init_.mean)
plt.show()
plt.title('Partial dependence plot in the original coordinates');

You are looking at a Partial Dependence Plot. A PDP is a graph that represents
a set of variables/predictors and their effect on the target field (in this case price). Those graphs do not estimate actual prices.
It is important to realize that a PDP is not a representation of the dataset values or price. It is a representation of the variables effect on the target field. The negative numbers are logits of probabilities, not raw probabilities.

Related

Chi Squared Analysis on Data sets that don't have matching frequencies

I have 15 data sets each of which I have fitted with a curve. Now I am trying to determine the quality of fit by doing a chi-squared test; however, when I run my code:
chi, p_value = stats.chisquare(n, y)
where n is the actual data and y is the predicted data, I get the error
For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
0.1350785306607008
I can't seem to understand why they have to add up to the same total - are there any ways I can run a chi-squared test without muddling my data?

This chi-squared test for goodness of fit indeed requires the sums of both inputs to be (almost) the same. So, if you want to check whether your model fits the observations n well, you have to adjust the counts y of your model as described e.g. here. This could be done with a small wrapper:
from scipy.stats import chisquare
import numpy as np
def cs(n, y):
return chisquare(n, np.sum(n)/np.sum(y) * y)
Another possibility would be to go for R and use chisq.test.

Logistic Regression vs predicting probability by splitting data into bin

So I am exploring using a logistic regression model to predict the probability of a shot resulting in a goal. I have two predictors but for simplicity lets assume I have one predictor: distance from the goal. When doing some data exploration I decided to investigate the relationship between distance and the result of a goal. I did this graphical by splitting the data into equal size bins and then taking the mean of all the results (0 for a miss and 1 for a goal) within each bin. Then I plotted the average distance from goal for each bin vs the probability of scoring. I did this in python
#use the seaborn library to inspect the distribution of the shots by result (goal or no goal)
fig, axes = plt.subplots(1, 2,figsize=(11, 5))
#first we want to create bins to calc our probability
#pandas has a function qcut that evenly distibutes the data
#into n bins based on a desired column value
df['Goal']=df['Goal'].astype(int)
df['Distance_Bins'] = pd.qcut(df['Distance'],q=50)
#now we want to find the mean of the Goal column(our prob density) for each bin
#and the mean of the distance for each bin
dist_prob = df.groupby('Distance_Bins',as_index=False)['Goal'].mean()['Goal']
dist_mean = df.groupby('Distance_Bins',as_index=False)['Distance'].mean()['Distance']
dist_trend = sns.scatterplot(x=dist_mean,y=dist_prob,ax=axes[0])
dist_trend.set(xlabel="Avg. Distance of Bin",
ylabel="Probabilty of Goal",
title="Probability of Scoring Based on Distance")
Probability of Scoring Based on Distance
So my question is why would we go through the process of creating a logistic regression model when I could fit a curve to the plot in the image? Would that not provide a function that would predict a probability for a shot with distance x.
I guess the problem would be that we are reducing say 40,000 data point into 50 but I'm not entirely sure why this would be a problem for predict future shot. Could we increase the number of bins or would that just add variability? Is this a case of bias-variance trade off? Im just a little confused about why this would not be as good as a logistic model.

The binning method is a bit more finicky than the logistic regression since you need to try different types of plots to fit the curve (e.g. inverse relationship, log, square, etc.), while for logistic regression you only need to adjust the learning rate to see results.
If you are using one feature (your "Distance" predictor), I wouldn't see much difference between the binning method and the logistic regression. However, when you are using two or more features (I see "Distance" and "Angle" in the image you provided), how would you plan to combine the probabilities for each to make a final 0/1 classification? It can be tricky. For one, perhaps "Distance" is more useful a predictor than "Angle". However, logistic regression does that for you because it can adjust the weights.
Regarding your binning method, if you use fewer bins you might see more bias since the data may be more complicated than you think, but this is not that likely because your data looks quite simple at first glance. However, if you use more bins that would not significantly increase variance, assuming that you fit the curve without varying the order of the curve. If you change the order of the curve you fit, then yes, it will increase variance. However, your data seems like it is amenable to a very simple fit if you go with this method.

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?

There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.

As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.

I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

Generalised additive model - Python

I'm trying to fit a non linear model using Generalized Additive model. How do I determine the number of splines to use. Is there a specific way to choose the number of splines? I have used a 3rd order (cubic) spline fitting. Below is the code.
from pygam import LinearGAM
from pygam.utils import generate_X_grid
# Curve fitting using GAM model - Penalised spline curve.
def modeltrain(time,value):
return LinearGAM(n_splines=58,spline_order=3).gridsearch(time, value)
model=modeltrain(t1,x1)
# samples random x-values for prediction
XX = generate_X_grid(model)
#plots for vizualisation
plt.plot(XX, model.predict(XX), 'r--')
plt.plot(XX, model.prediction_intervals(XX,width=0.25), color='b', ls='-- ')
plt.scatter(t1, x1)
plt.show()
This is the expected result
Original data scatter plot
If the number of splines is not chosen correctly, then I get a incorrect fit.
Please, I would like a suggestion of methods to choose the number of splines accurately.

Typically for splines you choose a fairly high number of splines (~25) and you let the lambda smoothing parameter do the work of reducing the flexibility of the model.
For your use-case I would choose the default n_splines=25 and then do a gridsearch over the lambda parameter lam to find the best amount of smoothing:
def modeltrain(time,value):
return LinearGAM(n_splines=25,spline_order=3).gridsearch(time, value, lam=np.logspace(-3, 3, 11))
This will try 11 models from lam = 1e-3 to 1e3.
I think your choice of n_splines=58 is too high because it looks like it produces one spline per data-point.
If you really want to do a search over n_splines then you could do:
LinearGAM(n_splines=25,spline_order=3).gridsearch(time, value, n_splines=np.arange(50))
Note: the function generate_X_grid does NOT do random sampling for prediction, it actually just makes a dense linear-spacing of your X-values (time). The reason for this is to visualize how the learned model will interpolate.

sklearn LogisticRegression - plot displays too small coefficient

I am attempting to fit a logistic regression model to sklearn's iris dataset. I get a probability curve that looks like it is too flat, aka the coefficient is too small. I would expect a probability over ninety percent by sepal length > 7 :
Is this probability curve indeed wrong? If so, what might cause that in my code?
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
import math
from sklearn.linear_model import LogisticRegression
data = datasets.load_iris()
#get relevent data
lengths = data.data[:100, :1]
is_setosa = data.target[:100]
#fit model
lgs = LogisticRegression()
lgs.fit(lengths, is_setosa)
m = lgs.coef_[0,0]
b = lgs.intercept_[0]
#generate values for curve overlay
lgs_curve = lambda x: 1/(1 + math.e**(-(m*x+b)))
x_values = np.linspace(2, 10, 100)
y_values = lgs_curve(x_values)
#plot it
plt.plot(x_values, y_values)
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")

If you refer to http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression, you will find a regularization parameter C that can be passed as argument while training the logistic regression model.
C : float, default: 1.0 Inverse of regularization strength; must be a
positive float. Like in support vector machines, smaller values
specify stronger regularization.
Now, if you try different values of this regularization parameter, you will find that larger values of C leads to fitting curves that has sharper transitions from 0 to 1 value of the output (response) binary variable, and still larger values fit models that have high variance (try to model the training data transition more closely, i think that's what you are expecting, then you may try to set C value as high as 10 and plot) but at the same time are likely to have the risk to overfit, while the default value C=1 and values smaller than that lead to high bias and are likely to underfit and here comes the famous bias-variance trade-off in machine learning.
You can always use techniques like cross-validation to choose the C value that is right for you. The following code / figure shows the probability curve fitted with models of different complexity (i.e., with different values of the regularization parameter C, from 1 to 10):
x_values = np.linspace(2, 10, 100)
x_test = np.reshape(x_values, (100,1))
C = list(range(1, 11))
labels = map(str, C)
for i in range(len(C)):
lgs = LogisticRegression(C = C[i]) # pass a value for the regularization parameter C
lgs.fit(lengths, is_setosa)
y_values = lgs.predict_proba(x_test)[:,1] # use this function to compute probability directly
plt.plot(x_values, y_values, label=labels[i])
plt.scatter(lengths, is_setosa, c='r', s=2)
plt.xlabel("Sepal Length")
plt.ylabel("Probability is Setosa")
plt.legend()
plt.show()
Predicted probs with models fitted with different values of C

Although you do not describe what you want to plot, I assume you want to plot the separating line. It seems that you are confused with respect to the Logistic/sigmoid function. The decision function of Logistic Regression is a line.

Your probability graph looks flat because you have, in a sense, "zoomed in" too much.
If you look at the middle of a sigmoid function, it get's to be almost linear, as the second derivative get's to be almost 0 (see for example a wolfram alpha graph)
Please note that the value's we are talking about are the results of -(m*x+b)
When we reduce the limits of your graph, say by using
x_values = np.linspace(4, 7, 100), we get something which looks like a line:
But on the other hand, if we go crazy with the limits, say by using x_values = np.linspace(-10, 20, 100), we get the clearer sigmoid:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding Partial Dependence for Gradient Boosted Regression trees - python

Related

Chi Squared Analysis on Data sets that don't have matching frequencies

Logistic Regression vs predicting probability by splitting data into bin

Is there a way to get the probability of a prediction using XGBoostRegressor?

Generalised additive model - Python

sklearn LogisticRegression - plot displays too small coefficient

Categories

Resources