Logistic Regression vs predicting probability by splitting data into bin

Logistic Regression vs predicting probability by splitting data into bin - python

So I am exploring using a logistic regression model to predict the probability of a shot resulting in a goal. I have two predictors but for simplicity lets assume I have one predictor: distance from the goal. When doing some data exploration I decided to investigate the relationship between distance and the result of a goal. I did this graphical by splitting the data into equal size bins and then taking the mean of all the results (0 for a miss and 1 for a goal) within each bin. Then I plotted the average distance from goal for each bin vs the probability of scoring. I did this in python
#use the seaborn library to inspect the distribution of the shots by result (goal or no goal)
fig, axes = plt.subplots(1, 2,figsize=(11, 5))
#first we want to create bins to calc our probability
#pandas has a function qcut that evenly distibutes the data
#into n bins based on a desired column value
df['Goal']=df['Goal'].astype(int)
df['Distance_Bins'] = pd.qcut(df['Distance'],q=50)
#now we want to find the mean of the Goal column(our prob density) for each bin
#and the mean of the distance for each bin
dist_prob = df.groupby('Distance_Bins',as_index=False)['Goal'].mean()['Goal']
dist_mean = df.groupby('Distance_Bins',as_index=False)['Distance'].mean()['Distance']
dist_trend = sns.scatterplot(x=dist_mean,y=dist_prob,ax=axes[0])
dist_trend.set(xlabel="Avg. Distance of Bin",
ylabel="Probabilty of Goal",
title="Probability of Scoring Based on Distance")
Probability of Scoring Based on Distance
So my question is why would we go through the process of creating a logistic regression model when I could fit a curve to the plot in the image? Would that not provide a function that would predict a probability for a shot with distance x.
I guess the problem would be that we are reducing say 40,000 data point into 50 but I'm not entirely sure why this would be a problem for predict future shot. Could we increase the number of bins or would that just add variability? Is this a case of bias-variance trade off? Im just a little confused about why this would not be as good as a logistic model.

The binning method is a bit more finicky than the logistic regression since you need to try different types of plots to fit the curve (e.g. inverse relationship, log, square, etc.), while for logistic regression you only need to adjust the learning rate to see results.
If you are using one feature (your "Distance" predictor), I wouldn't see much difference between the binning method and the logistic regression. However, when you are using two or more features (I see "Distance" and "Angle" in the image you provided), how would you plan to combine the probabilities for each to make a final 0/1 classification? It can be tricky. For one, perhaps "Distance" is more useful a predictor than "Angle". However, logistic regression does that for you because it can adjust the weights.
Regarding your binning method, if you use fewer bins you might see more bias since the data may be more complicated than you think, but this is not that likely because your data looks quite simple at first glance. However, if you use more bins that would not significantly increase variance, assuming that you fit the curve without varying the order of the curve. If you change the order of the curve you fit, then yes, it will increase variance. However, your data seems like it is amenable to a very simple fit if you go with this method.

Related

How to evaluate K-Means Clustering since automatic indexes of clusters don't match true labels?

How do we measure the accuracy of a K-Means clustering algorithm (say, generate a confusion matrix) since the automatic indexes of cluster is probably a permutation of the original labels?

I don't exactly know what you mean too. Your original labels perhaps is the ground truth labeling. The clustering results provided by k-means is usually an integer with range given as many as the k clusters you wish the k-means algorithm to give you.
I typically use pandas.crosstab function to visualize the localizations of the groundtruth labeling with kmeans labeling with cross-tabulation.
For better visualization, you may want to use the following:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(30,10))
# plot the heatmap for correlation matrix
ax = sns.heatmap(crosstab_groundtruth_kmeans.T,
square=True, annot=True, fmt='.2f')
ax.set_yticklabels(
ax.get_yticklabels(),
rotation=0);
out:
Good luck!~

k-means is a clustering (grouping algorithm, not used for classification), hence, it is not feasible to check and analyze accuracy. Major concept of k-means is to find a cluster of data-points which maximize the "between-cluster" distance (and does not have the concept of labels, and hence, you can't get accuracy matrix). More insights: https://scikit-learn.org/stable/modules/clustering.html#k-means
The accuracy (assuming, you want to visualize which cluster consists of which data points) has to be analyzed manually using the predict method from sklearn.cluster.KMeans. It basically "Predicts the closest cluster each sample in X belongs to." (from documentation)

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?

There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.

As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.

I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

Convert independent sklearn GaussianMixture log probability scores to probabilities summing to 1

I have labeled 2D data. There are 4 labels in the set, and I know the correspondence of every point to its label. I'd like to, given a new arbitrary data point, find the probability that it has each of the 4 labels. It must belong to one and only one of the labels, so the probabilities should sum to 1.
What I've done so far is to train 4 independent sklearn GMMs (sklearn.mixture.GaussianMixture) on the data points associated with each label. It should be noted that I do not wish to train a single GMM with 4 components because I already know the labels, and don't want to re-cluster in a way that is worse than my known labels. (It would appear that there is a way to provide Y= labels to the fit() function, but I can't seem to get it to work).
In the above plot, points are colored by their known labels, and the contours represent the four independent GMMs fitted to these 4 sets of points.
For a new point, I attempted to compute the probability of its label in a couple ways:
GaussianMixture.predict_proba(): Since each independent GMM has only one distribution, this simply returns a probability of 1 for all models.
GaussianMixture.score_samples(): According to documentation, this one returns the "weighted log probabilities for each sample". My procedure is, for a single new point, I make four calls to this function from each of the four independently trained GMMs represenenting each distribution above. I do get semi sensible results here--typically a positive number for the correct model and negative numbers for each of the three incorrect models, with more muddled results for points near intersecting distribution boundaries. Here's a typical clear-cut result:
2.904136, -60.881554, -20.824841, -30.658509
This point is actually associated with the first label and is least likely to be the second label (is farthest from the second distribution). My issue is how to convert the above scores into probabilities that sum to 1 and accurately represent the chance that the given point belongs to one and only one of the four distributions? Given that these are 4 independent models, is this possible? If not, is there another method I have overlooked that could allow me to train GMM(s) based on known labels and will provide probabilities that sum to 1?

In general, if you don't know how the scores are calculated but you know that there is a monotonic relationship between the scores and the probability, you can simply use the softmax function to approximate a probability, with an optional temperature variable that controls the spikiness of the distribution.
Let V be your list of scores and tau be the temperature. Then,
p = np.exp(V/tau) / np.sum(np.exp(V/tau))
is your answer.
PS: Luckily, we know how sklearn GMM scoring works and softmax with tau=1 is your exact answer.

Understanding Partial Dependence for Gradient Boosted Regression trees

I am looking at the tutorial for partial dependence plots in Python. No equation is given in the tutorial or in the documentation. The documentation of the R function gives the formula I expected:
This does not seem to make sense with the results given in the Python tutorial. If it is an average of the prediction of house prices, then how is it negative and small? I would expect values in the millions. Am I missing something?
Update:
For regression it seems the average is subtracted off of the above formula. How would this be added back? For my trained model I can get the partial dependence by
from sklearn.ensemble.partial_dependence import partial_dependence
partial_dependence, independent_value = partial_dependence(model, features.index(independent_feature),X=df2[features])
I want to add (?) back on the average. Would I get this by just using model.predict() on the df2 values with the independent_feature values changed?

how the R formula works
The r formula presented in the question applies to a randomForest. Each tree in a random forest tries to predict the target variable directly. Thus, prediction of each tree lies in the expected interval (in your case, all house prices are positive), and prediction of the ensemble is just the average of all the individual predictions.
ensemble_prediction = mean(tree_predictions)
This is what the formula tells you: just take predictions of all the trees x and average them.
why the Python PDP values are small
In sklearn, however, partial dependence is calculated for a GradientBoostingRegressor. In gradient boosting, each tree predicts the derivative of the loss function at current prediction, which is only indirectly related to the target variable. For GB regression, prediction is given as
ensemble_prediction = initial_prediction + sum(tree_predictions * learning_rate)
and for GB classification predicted probability is
ensemble_prediction = softmax(initial_prediction + sum(tree_predictions * learning_rate))
For both cases, partial dependency is reported as just
sum(tree_predictions * learning_rate)
Thus, initial_prediction (for GradientBoostingRegressor(loss='ls') it equals just the mean of the training y) is not included into the PDP, which makes the predictions negative.
As for the small range of its values, the y_train in your example is small: mean hous value is roughly 2, so house prices are probably expressed in millions.
how the sklearn formula actually works
I have already said that in sklearn the value of partial dependence function is an average of all trees. There is one more tweak: all irrelevant features are averaged away. To describe the actual way of averaging, I will just quote the documentation of sklearn:
For each value of the ‘target’ features in the grid the partial
dependence function need to marginalize the predictions of a tree over
all possible values of the ‘complement’ features. In decision trees
this function can be evaluated efficiently without reference to the
training data. For each grid point a weighted tree traversal is
performed: if a split node involves a ‘target’ feature, the
corresponding left or right branch is followed, otherwise both
branches are followed, each branch is weighted by the fraction of
training samples that entered that branch. Finally, the partial
dependence is given by a weighted average of all visited leaves. For
tree ensembles the results of each individual tree are again averaged.
And if you are still not satisfied, see the source code.
an example
To see that the prediction is already on the scale of the dependent variable (but is just centered), you can look at a very toy example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import plot_partial_dependence
np.random.seed(1)
X = np.random.normal(size=[1000, 2])
# yes, I will try to fit a linear function!
y = X[:, 0] * 10 + 50 + np.random.normal(size=1000, scale=5)
# mean target is 50, range is from 20 to 80, that is +/- 30 standard deviations
model = GradientBoostingRegressor().fit(X, y)
fig, subplots = plot_partial_dependence(model, X, [0, 1], percentiles=(0.0, 1.0), n_cols=2)
subplots[0].scatter(X[:, 0], y - y.mean(), s=0.3)
subplots[1].scatter(X[:, 1], y - y.mean(), s=0.3)
plt.suptitle('Partial dependence plots and scatters of centered target')
plt.show()
You can see that partial dependence plots reflect the true distribution of the centered target variable pretty well.
If you want not only the units, but the mean to coincide with your y, you have to add the "lost" mean to the result of the partial_dependence function and then plot the results manually:
from sklearn.ensemble.partial_dependence import partial_dependence
pdp_y, [pdp_x] = partial_dependence(model, X=X, target_variables=[0], percentiles=(0.0, 1.0))
plt.scatter(X[:, 0], y, s=0.3)
plt.plot(pdp_x, pdp_y.ravel() + model.init_.mean)
plt.show()
plt.title('Partial dependence plot in the original coordinates');

You are looking at a Partial Dependence Plot. A PDP is a graph that represents
a set of variables/predictors and their effect on the target field (in this case price). Those graphs do not estimate actual prices.
It is important to realize that a PDP is not a representation of the dataset values or price. It is a representation of the variables effect on the target field. The negative numbers are logits of probabilities, not raw probabilities.

What's a good metric to analyze the quality of the output of a clustering algorithm?

I've been trying out the kmeans clustering algorithm implementation in scipy. Are there any standard, well-defined metrics that could be used to measure the quality of the clusters generated?
ie, I have the expected labels for the data points that are clustered by kmeans. Now, once I get the clusters that have been generated, how do I evaluate the quality of these clusters with respect to the expected labels?

I am doing this very thing at that time with Spark's KMeans.
I am using:
The sum of squared distances of points to their nearest center
(implemented in computeCost()).
The Unbalanced factor (see
Unbalanced factor of KMeans?
for an implementation and
Understanding the quality of the KMeans algorithm
for an explanation).
Both quantities promise a better cluster, when the are small (the less, the better).

Kmeans attempts to minimise a sum of squared distances to cluster centers. I would compare the result of this with the Kmeans clusters with the result of this using the clusters you get if you sort by expected labels.
There are two possibilities for the result. If the KMeans sum of squares is larger than the expected label clustering then your kmeans implementation is buggy or did not get started from a good set of initial cluster assignments and you could think about increasing the number of random starts you using or debugging it. If the KMeans sum of squares is smaller than the expected label clustering sum of squares and the KMeans clusters are not very similar to the expected label clustering (that is, two points chosen at random from the expected label clustering are/are not usually in the same expected label clustering when they are/are not in the KMeans clustering) then sum of squares from cluster centers is not a good way of splitting your points up into clusters and you need to use a different distance function or look at different attributes or use a different sort of clustering.

In your case, when you do have the samples true label, validation is very easy.
First of all, compute the confusion matrix (http://en.wikipedia.org/wiki/Confusion_matrix). Then, derive from it all relevant measures: True Positive, false negatives, false positives and true negatives. Then, you can find the Precision, Recall, Miss rate, etc.
Make sure you understand the meaning of all above. They basically tell you how well your clustering predicted / recognized the true nature of your data.
If you're using python, just use the sklearn package:
http://scikit-learn.org/stable/modules/model_evaluation.html
In addition, it's nice to run some internal validation, to see how well your clusters are separated. There are known internal validity measures, like:
Silhouette
DB index
Dunn index
Calinski-Harabasz measure
Gamma score
Normalized Cut
etc.
Read more here: An extensive comparative study of cluster validity indices
Olatz Arbelaitz , Ibai Gurrutxaga, , Javier Muguerza , Jesús M. Pérez , Iñigo Perona

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.