I have a question to you (maybe i don't understand something). My target value is binary (Yes/No). I make a prediction using scikit learn for few classiefiers, and plot the roc curve. Everything look good except for roc curves for DecisionTreeClassifier() , ExtraTreeClassifier(). I get something like this:
In other classiefiers i get something similar to this:
I tried all scikit function to display roc curve and i get the same plot. Could you show me the path how can i "improve" my model or plot? My code for DecisionTreeClassifier():
model3 = make_pipeline(preprocessor, DecisionTreeClassifier())
model3[:-1].get_feature_names_out()
m=model3[:-1].get_feature_names_out()
model3 = model3.fit(data_train, target_train)
plt.figure(figsize=(12,12))
plot_tree(model3.named_steps['decisiontreeclassifier'], fontsize=10, node_ids=True,
feature_names=m, max_depth=5)
cm3 = confusion_matrix(target_test, y_pred3, normalize='all')
cm3_display = ConfusionMatrixDisplay(cm3).plot()
plt.xlabel('Klasa predykowana – wynik testu')
plt.ylabel('Klasa rzeczywista')
plt.show()
RocCurveDisplay.from_estimator(model3, data_test, target_test)
plt.show()
RocCurveDisplay.from_predictions(target_test, y_pred3)
plt.show()
model3_probs = model3.predict_proba(data_test)
model3_probs = model3_probs[:, 1]
model3_fpr, model3_tpr, _ = roc_curve(target_test, model3_probs)
roc_auc = metrics.auc(model3_fpr, model3_tpr)
display = metrics.RocCurveDisplay(fpr=model3_fpr, tpr=model3_tpr,
roc_auc=roc_auc,estimator_name='example estimator')
display.plot()
How many examples are included in your test data? Possibly only three? What the curve looks like will depend also on the number of samples that you used for testing. Using more samples will produce a more similar output to your expectations.
For clarification: The curve is produced by increasing a cut-off threshold for your predictions and plotting the false positive and true positive rates for an increasing value of this threshold (see here). If you only include very few samples, your curve only has very few points to plot. Thus, it looks just like yours.
Edit from the comments: The thing that happened here is, that your tree is filled completely by your data due to the unlimited complexity of the tree (e.g. when not setting a max_depth or min_samples for each leaf). This means that all leafs (also at test-time) are pure and your predictions have only probabilities of 0 and 1, nothing in between. Thus, since the threshold does not matter, your ROC only changes once from (0,0) to the point determined by the (fpr,tpr) in the test-set and to (1,1), looking like your curve. This can be circumvented by using a RandomForest (introducing randomness) or restricting the decision tree in order to get probabilities in between 0 and 1 by counting what's in the leaf. A related thread can be found here.
Nevertheless, there is nothing wrong with your plot, if pure leafs are okay for you!
Related
So I am exploring using a logistic regression model to predict the probability of a shot resulting in a goal. I have two predictors but for simplicity lets assume I have one predictor: distance from the goal. When doing some data exploration I decided to investigate the relationship between distance and the result of a goal. I did this graphical by splitting the data into equal size bins and then taking the mean of all the results (0 for a miss and 1 for a goal) within each bin. Then I plotted the average distance from goal for each bin vs the probability of scoring. I did this in python
#use the seaborn library to inspect the distribution of the shots by result (goal or no goal)
fig, axes = plt.subplots(1, 2,figsize=(11, 5))
#first we want to create bins to calc our probability
#pandas has a function qcut that evenly distibutes the data
#into n bins based on a desired column value
df['Goal']=df['Goal'].astype(int)
df['Distance_Bins'] = pd.qcut(df['Distance'],q=50)
#now we want to find the mean of the Goal column(our prob density) for each bin
#and the mean of the distance for each bin
dist_prob = df.groupby('Distance_Bins',as_index=False)['Goal'].mean()['Goal']
dist_mean = df.groupby('Distance_Bins',as_index=False)['Distance'].mean()['Distance']
dist_trend = sns.scatterplot(x=dist_mean,y=dist_prob,ax=axes[0])
dist_trend.set(xlabel="Avg. Distance of Bin",
ylabel="Probabilty of Goal",
title="Probability of Scoring Based on Distance")
Probability of Scoring Based on Distance
So my question is why would we go through the process of creating a logistic regression model when I could fit a curve to the plot in the image? Would that not provide a function that would predict a probability for a shot with distance x.
I guess the problem would be that we are reducing say 40,000 data point into 50 but I'm not entirely sure why this would be a problem for predict future shot. Could we increase the number of bins or would that just add variability? Is this a case of bias-variance trade off? Im just a little confused about why this would not be as good as a logistic model.
The binning method is a bit more finicky than the logistic regression since you need to try different types of plots to fit the curve (e.g. inverse relationship, log, square, etc.), while for logistic regression you only need to adjust the learning rate to see results.
If you are using one feature (your "Distance" predictor), I wouldn't see much difference between the binning method and the logistic regression. However, when you are using two or more features (I see "Distance" and "Angle" in the image you provided), how would you plan to combine the probabilities for each to make a final 0/1 classification? It can be tricky. For one, perhaps "Distance" is more useful a predictor than "Angle". However, logistic regression does that for you because it can adjust the weights.
Regarding your binning method, if you use fewer bins you might see more bias since the data may be more complicated than you think, but this is not that likely because your data looks quite simple at first glance. However, if you use more bins that would not significantly increase variance, assuming that you fit the curve without varying the order of the curve. If you change the order of the curve you fit, then yes, it will increase variance. However, your data seems like it is amenable to a very simple fit if you go with this method.
I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?
There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.
As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.
I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$
I am looking at the tutorial for partial dependence plots in Python. No equation is given in the tutorial or in the documentation. The documentation of the R function gives the formula I expected:
This does not seem to make sense with the results given in the Python tutorial. If it is an average of the prediction of house prices, then how is it negative and small? I would expect values in the millions. Am I missing something?
Update:
For regression it seems the average is subtracted off of the above formula. How would this be added back? For my trained model I can get the partial dependence by
from sklearn.ensemble.partial_dependence import partial_dependence
partial_dependence, independent_value = partial_dependence(model, features.index(independent_feature),X=df2[features])
I want to add (?) back on the average. Would I get this by just using model.predict() on the df2 values with the independent_feature values changed?
how the R formula works
The r formula presented in the question applies to a randomForest. Each tree in a random forest tries to predict the target variable directly. Thus, prediction of each tree lies in the expected interval (in your case, all house prices are positive), and prediction of the ensemble is just the average of all the individual predictions.
ensemble_prediction = mean(tree_predictions)
This is what the formula tells you: just take predictions of all the trees x and average them.
why the Python PDP values are small
In sklearn, however, partial dependence is calculated for a GradientBoostingRegressor. In gradient boosting, each tree predicts the derivative of the loss function at current prediction, which is only indirectly related to the target variable. For GB regression, prediction is given as
ensemble_prediction = initial_prediction + sum(tree_predictions * learning_rate)
and for GB classification predicted probability is
ensemble_prediction = softmax(initial_prediction + sum(tree_predictions * learning_rate))
For both cases, partial dependency is reported as just
sum(tree_predictions * learning_rate)
Thus, initial_prediction (for GradientBoostingRegressor(loss='ls') it equals just the mean of the training y) is not included into the PDP, which makes the predictions negative.
As for the small range of its values, the y_train in your example is small: mean hous value is roughly 2, so house prices are probably expressed in millions.
how the sklearn formula actually works
I have already said that in sklearn the value of partial dependence function is an average of all trees. There is one more tweak: all irrelevant features are averaged away. To describe the actual way of averaging, I will just quote the documentation of sklearn:
For each value of the ‘target’ features in the grid the partial
dependence function need to marginalize the predictions of a tree over
all possible values of the ‘complement’ features. In decision trees
this function can be evaluated efficiently without reference to the
training data. For each grid point a weighted tree traversal is
performed: if a split node involves a ‘target’ feature, the
corresponding left or right branch is followed, otherwise both
branches are followed, each branch is weighted by the fraction of
training samples that entered that branch. Finally, the partial
dependence is given by a weighted average of all visited leaves. For
tree ensembles the results of each individual tree are again averaged.
And if you are still not satisfied, see the source code.
an example
To see that the prediction is already on the scale of the dependent variable (but is just centered), you can look at a very toy example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import plot_partial_dependence
np.random.seed(1)
X = np.random.normal(size=[1000, 2])
# yes, I will try to fit a linear function!
y = X[:, 0] * 10 + 50 + np.random.normal(size=1000, scale=5)
# mean target is 50, range is from 20 to 80, that is +/- 30 standard deviations
model = GradientBoostingRegressor().fit(X, y)
fig, subplots = plot_partial_dependence(model, X, [0, 1], percentiles=(0.0, 1.0), n_cols=2)
subplots[0].scatter(X[:, 0], y - y.mean(), s=0.3)
subplots[1].scatter(X[:, 1], y - y.mean(), s=0.3)
plt.suptitle('Partial dependence plots and scatters of centered target')
plt.show()
You can see that partial dependence plots reflect the true distribution of the centered target variable pretty well.
If you want not only the units, but the mean to coincide with your y, you have to add the "lost" mean to the result of the partial_dependence function and then plot the results manually:
from sklearn.ensemble.partial_dependence import partial_dependence
pdp_y, [pdp_x] = partial_dependence(model, X=X, target_variables=[0], percentiles=(0.0, 1.0))
plt.scatter(X[:, 0], y, s=0.3)
plt.plot(pdp_x, pdp_y.ravel() + model.init_.mean)
plt.show()
plt.title('Partial dependence plot in the original coordinates');
You are looking at a Partial Dependence Plot. A PDP is a graph that represents
a set of variables/predictors and their effect on the target field (in this case price). Those graphs do not estimate actual prices.
It is important to realize that a PDP is not a representation of the dataset values or price. It is a representation of the variables effect on the target field. The negative numbers are logits of probabilities, not raw probabilities.
I am performing preliminary tests using sklearn in my code.
I am testing:
1) sklearn.cross_validation.cross_val_score
2) sklearn.cross_validation.train_test_split
like in this question.
The code is the following:
#X is my data and Y the corresponding binary labels
#My classifier
clf = svm.SVC(class_weight='auto', kernel=kernel, gamma=gamma,
degree=degree, cache_size=cache_size,probability=probability)
#1st method: ShuffleSplit and cross validation
cv = cross_validation.ShuffleSplit(X.shape[0], n_iter=5,
test_size=0.4, random_state=0)
#Scoring
scores = cross_validation.cross_val_score(clf, X, Y,
cv=cv, n_jobs=3, scoring="roc_auc")
#2nd method: train_test_split
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, Y, test_size=0.4, random_state=42)
clf.fit(X_train, y_train)
pred_test = clf.predict(X_test)
#Scoring
score = roc_auc_score(y_test, pred_test)
The difference with the other question is that my data is being randomized in both cases 1) and 2).
However I get for case 1) the following scores:
[ 0.9453893 0.94878745 0.95197478 0.95150763 0.94971746]
and for case 2):
0.867637
I actually quite not understand the reason of this different scores and cannot get what I'm missing here.
Shouldn't the scorings be similar ?
Thank you for your time.
I know that I'm late to this, but I have just been having a similar problem and happening to stumble upon this post. I was having exactly the same issues when comparing answers using train_test_split and cross_val_score - using the roc_auc_score metric.
I think that the problem is arising from putting the predicted binary outputs from the classifier into the roc_auc_score comparison. This means that the metric only has two arrays of binary outputs to make the score from. If you try using 'predict_proba' instead, this will give you an array with two columns (presuming that you have a two-class problem here) of the probabilities of the classes for the different sample point.
On my dataset, I took the second column of this and entered it into roc_auc_score along with the true values, and this returned answer that were far more in line with the output of cross_val_score.
UPDATE:
On having learnt some more (and reading the docs!) - this isn't the best way to go about this, as it requires setting probability=True for the SVC, and this is far more computationally expensive. Instead of using either predict or predict_proba, use decision_function instead, and then enter these values into the roc_auc_score as the predicted values.
UPDATE:
In response to a comment made on this process, I've also attached a couple of figures to explain this process. I'll also provide some background information that aided me when learning about this.
The Receiver Operating Characteristic curve is made by seeing changes in the relative amounts of true vs false positives as the threshold for a decision boundary changes from strict to more relaxed. This explanation, however, can seem somewhere inscrutable, so a figure is provided here. This shows the decision boundary for a linear Support Vector Machine on some generated data with 2 features, the 'blue' class and the 'red'class. The solid line represents the threshold for binary decisions that is found by training the SVM. All of the points represent data that was used to train the model. Any new data can be added to the plot; if they appear on the bottom left, they will be labelled 'red', and in the top right 'blue'. We can think of the 'red' as the 'positive' class, and therefore the output from prediction is a binary {0, 1} output (red = 1, blue = 0).
One thing to notice is that the data points are not perfectly linearly separable, there is a region in the model near the decision boundary where the red and blue points overlap a lot. Therefore, a linear model here cannot ever get perfect performance.
The dotted lines represent the margins of the SVM. The training of the SVM aims to maximise the width of this margin, and is very dependant on the hyper-parameter C value provided. In effect, higher values of C will force the model to fit better to the training data, whereas lower values will allow for misclassifications here, with the intent of having better generalisability for new and unseen data. A full description can be seen in the scikit-learn docs: http://scikit-learn.org/stable/auto_examples/svm/plot_svm_margin.html#sphx-glr-auto-examples-svm-plot-svm-margin-py. Note that all of the points that are either misclassified, or appear in this margin region. The other points, we are super confident about being correct.
So to the main point, how the AUC is calculated. I have added two extra lines on this graph, the red and blue boundary lines. These can be thought of as a shift of the main decision line from a highly selective area, where only the most confident red points are actually classified as red, to a very relaxed boundary, where every point will be classed as red. Remember, any point to the bottom right of this moving threshold will be classed as red.
Initially, no data points meet the criteria to be classified as red, but as the line moves in the direction of the arrows, it starts to scoop up these points. In the early stages, all of these are correct as all of the data points are red, but as we head towards the margin area, we soon start to pick up false positives (blue points) while getting more of the reds. This pattern of collecting true and false positives at different rates, affects the ROC curve. The best way to show this is with another figure:
Imagine that we start to draw the curve from the bottom left, and make a small movement any time we change the threshold position. As we collect the true, red, positives, we draw our line in the y-axis direction, but as we collect blues, we draw in the x-axis direction. The aim is to send the line as close to the top left corner as possible, as in the end we will take the Area-Under-the-Curve (AUC) as our metric. Note that at the end, the line always gets to the top right (as eventually, all the data points will be classed as red), and in this case it is just travelling along the top of the graph. This is because, in this dataset, as the threshold moves closer to the blue line, we are only getting false positives.
Now imagine 2 very different situations: if the data were perfectly linearly separable, so none of the training data points were on the 'wrong' side of the boundary, the ROC line would always head directly up the y-axis until it gets to the top left, and the head along the top of the graph to the top-right, giving an AUC of 1. However, if the data points were just a cloud of noise, all mixed in the centre, you would get false positives at the same rate as true positives, and your line would head in the direction of the diagonal line and give an AUC of 0.5. Hence why this value represents complete chance level of performance.
I am not a contributor to scikit-learn, and I haven't examined the source code here, but I can imagine that the roc_auc_score uses the values from decision_function or predict_proba as a representation for how confident the model is of the point being a positive (in our case red) class. Therefore the same logic of relaxing the boundary and looking at the changing rates of true to false positives still holds. If this is not-right, then please correct me.
I am using nltk with Python and I would like to plot the ROC curve of my classifier (Naive Bayes). Is there any function for plotting it or should I have to track the True Positive rate and False Positive rate ?
It would be great if someone would point me to some code already doing it...
Thanks.
PyROC looks simple enough: tutorial, source code
This is how it would work with the NLTK naive bayes classifier:
# class labels are 0 and 1
labeled_data = [
(1, featureset_1),
(0, featureset_2),
(1, featureset_3),
# ...
]
# naive_bayes is your already trained classifier,
# preferrably not on the data you're testing on :)
from pyroc import ROCData
roc_data = ROCData(
(label, naive_bayes.prob_classify(featureset).prob(1))
for label, featureset
in labeled_data
)
roc_data.plot()
Edits:
ROC is for binary classifiers only. If you have three classes, you can measure the performance of your positive and negative class separately (by counting the other two classes as 0, like you proposed).
The library expects the output of a decision function as the second value of each tuple. It then tries all possible thresholds, e.g. f(x) >= 0.8 => classify as 1, and plots a point for each threshold (that's why you get a curve in the end). So if your classifier guesses class 0, you actually want a value closer to zero. That's why I proposed .prob(1)