Ntlk & Python, plotting ROC curve - python

I am using nltk with Python and I would like to plot the ROC curve of my classifier (Naive Bayes). Is there any function for plotting it or should I have to track the True Positive rate and False Positive rate ?
It would be great if someone would point me to some code already doing it...
Thanks.

PyROC looks simple enough: tutorial, source code
This is how it would work with the NLTK naive bayes classifier:
# class labels are 0 and 1
labeled_data = [
(1, featureset_1),
(0, featureset_2),
(1, featureset_3),
# ...
]
# naive_bayes is your already trained classifier,
# preferrably not on the data you're testing on :)
from pyroc import ROCData
roc_data = ROCData(
(label, naive_bayes.prob_classify(featureset).prob(1))
for label, featureset
in labeled_data
)
roc_data.plot()
Edits:
ROC is for binary classifiers only. If you have three classes, you can measure the performance of your positive and negative class separately (by counting the other two classes as 0, like you proposed).
The library expects the output of a decision function as the second value of each tuple. It then tries all possible thresholds, e.g. f(x) >= 0.8 => classify as 1, and plots a point for each threshold (that's why you get a curve in the end). So if your classifier guesses class 0, you actually want a value closer to zero. That's why I proposed .prob(1)

Related

Problem with roc curve in scikit - DecisionTreeClassifier() , ExtraTreeClassifier()

I have a question to you (maybe i don't understand something). My target value is binary (Yes/No). I make a prediction using scikit learn for few classiefiers, and plot the roc curve. Everything look good except for roc curves for DecisionTreeClassifier() , ExtraTreeClassifier(). I get something like this:
In other classiefiers i get something similar to this:
I tried all scikit function to display roc curve and i get the same plot. Could you show me the path how can i "improve" my model or plot? My code for DecisionTreeClassifier():
model3 = make_pipeline(preprocessor, DecisionTreeClassifier())
model3[:-1].get_feature_names_out()
m=model3[:-1].get_feature_names_out()
model3 = model3.fit(data_train, target_train)
plt.figure(figsize=(12,12))
plot_tree(model3.named_steps['decisiontreeclassifier'], fontsize=10, node_ids=True,
feature_names=m, max_depth=5)
cm3 = confusion_matrix(target_test, y_pred3, normalize='all')
cm3_display = ConfusionMatrixDisplay(cm3).plot()
plt.xlabel('Klasa predykowana – wynik testu')
plt.ylabel('Klasa rzeczywista')
plt.show()
RocCurveDisplay.from_estimator(model3, data_test, target_test)
plt.show()
RocCurveDisplay.from_predictions(target_test, y_pred3)
plt.show()
model3_probs = model3.predict_proba(data_test)
model3_probs = model3_probs[:, 1]
model3_fpr, model3_tpr, _ = roc_curve(target_test, model3_probs)
roc_auc = metrics.auc(model3_fpr, model3_tpr)
display = metrics.RocCurveDisplay(fpr=model3_fpr, tpr=model3_tpr,
roc_auc=roc_auc,estimator_name='example estimator')
display.plot()
How many examples are included in your test data? Possibly only three? What the curve looks like will depend also on the number of samples that you used for testing. Using more samples will produce a more similar output to your expectations.
For clarification: The curve is produced by increasing a cut-off threshold for your predictions and plotting the false positive and true positive rates for an increasing value of this threshold (see here). If you only include very few samples, your curve only has very few points to plot. Thus, it looks just like yours.
Edit from the comments: The thing that happened here is, that your tree is filled completely by your data due to the unlimited complexity of the tree (e.g. when not setting a max_depth or min_samples for each leaf). This means that all leafs (also at test-time) are pure and your predictions have only probabilities of 0 and 1, nothing in between. Thus, since the threshold does not matter, your ROC only changes once from (0,0) to the point determined by the (fpr,tpr) in the test-set and to (1,1), looking like your curve. This can be circumvented by using a RandomForest (introducing randomness) or restricting the decision tree in order to get probabilities in between 0 and 1 by counting what's in the leaf. A related thread can be found here.
Nevertheless, there is nothing wrong with your plot, if pure leafs are okay for you!

Determining accuracy for k-means clustering

I want to classify Iris flower dataset (I removed labels though, so its an unlabeled data now) using sklearns k-means clustering function. I have made the prediction model and the output seems to be classifying the data correctly for the most part, however it is choosing the labels randomly (0, 1 and 2) and I cannot compare it to my own labels to determine the accuracy (I have marked setosa as 0, versicolor as 1, virginica as 2). Is there any way to correctly label the flowers?
Heres the code:
from sklearn.cluster import KMeans
cluster = KMeans(n_clusters = 3)
cluster.fit(features)
pred = cluster.labels_
score = round(accuracy_score(pred, name_val), 4)
print('Accuracy scored using k-means clustering: ', score)
features, as expected contains the features, name_val is matrix containing flower values, 0 for setosa, 1 for versicolor, 2 for virginica
Edit: one solution I came up with was setting random_state to any number so that the labeling is constant, is there any other solution?
You need to take a look at clustering metrics to evaluate your predicitons, these include
Homegenity Score
V measure
Completenss Score and so on
Now take Completeness Score for example,
A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.
For example
from sklearn.metrics.cluster import completeness_score
print completeness_score([0, 0, 1, 1], [1, 1, 0, 0])
#Output : 1.0
Which similar to what you want. For you the code would be completeness_score(pred, name_val). Here note that the label assigned to a data point is not important rather their labelling with respect to each other is important.
Homogenity on the other hand focus on the quality of data points within the same cluster. Whereas, V-measure is defined as 2 * (homogeneity * completeness) / (homogeneity + completeness)
Read the official documentation here : Homogenity, completeness and V-measure
First of all, you are not classifying, you are clustering the data. Classification is a different process.
The K-Means algorithm includes randomness in choosing the initial cluster centers. By setting the random_state you manage to reproduce the same clustering, as the initial cluster centers will be the same. However, this does not fix your problem. What you want is the cluster with id 0 to be setosa, 1 to be versicolor etc. This is not possible because the K-Means algorithm has no knowledge of these categories, it only groups flowers depending on their similarity. What you can do is create a rule to determine which cluster corresponds to which category. For example you can say that if more than 50% of the flowers that belong to a cluster are also in the setosa category, then this cluster's documents should be compared to the set of documents in the setosa category.
That's the best way of doing it, that I can think of. However, this is not the way we evaluate custering quality, there are metrics you can use such as the Silhouette Coefficient. I hope I helped.
Reference from this blog https://smorbieu.gitlab.io/accuracy-from-classification-to-clustering-evaluation/
You need to got the relation from confusion matrix with Hungarian algorithm.
The code is below:
from scipy.optimize import linear_sum_assignment as linear_assignment
def cluster_acc(y_true, y_pred):
cm = metrics.confusion_matrix(y_true, y_pred)
_make_cost_m = lambda x:-x + np.max(x)
indexes = linear_assignment(_make_cost_m(cm))
indexes = np.concatenate([indexes[0][:,np.newaxis],indexes[1][:,np.newaxis]], axis=-1)
js = [e[1] for e in sorted(indexes, key=lambda x: x[0])]
cm2 = cm[:, js]
acc = np.trace(cm2) / np.sum(cm2)
return acc
Or just import library coclust
from coclust.evaluation.external import accuracy
accuracy(labels, predicted_labels)

Understanding Partial Dependence for Gradient Boosted Regression trees

I am looking at the tutorial for partial dependence plots in Python. No equation is given in the tutorial or in the documentation. The documentation of the R function gives the formula I expected:
This does not seem to make sense with the results given in the Python tutorial. If it is an average of the prediction of house prices, then how is it negative and small? I would expect values in the millions. Am I missing something?
Update:
For regression it seems the average is subtracted off of the above formula. How would this be added back? For my trained model I can get the partial dependence by
from sklearn.ensemble.partial_dependence import partial_dependence
partial_dependence, independent_value = partial_dependence(model, features.index(independent_feature),X=df2[features])
I want to add (?) back on the average. Would I get this by just using model.predict() on the df2 values with the independent_feature values changed?
how the R formula works
The r formula presented in the question applies to a randomForest. Each tree in a random forest tries to predict the target variable directly. Thus, prediction of each tree lies in the expected interval (in your case, all house prices are positive), and prediction of the ensemble is just the average of all the individual predictions.
ensemble_prediction = mean(tree_predictions)
This is what the formula tells you: just take predictions of all the trees x and average them.
why the Python PDP values are small
In sklearn, however, partial dependence is calculated for a GradientBoostingRegressor. In gradient boosting, each tree predicts the derivative of the loss function at current prediction, which is only indirectly related to the target variable. For GB regression, prediction is given as
ensemble_prediction = initial_prediction + sum(tree_predictions * learning_rate)
and for GB classification predicted probability is
ensemble_prediction = softmax(initial_prediction + sum(tree_predictions * learning_rate))
For both cases, partial dependency is reported as just
sum(tree_predictions * learning_rate)
Thus, initial_prediction (for GradientBoostingRegressor(loss='ls') it equals just the mean of the training y) is not included into the PDP, which makes the predictions negative.
As for the small range of its values, the y_train in your example is small: mean hous value is roughly 2, so house prices are probably expressed in millions.
how the sklearn formula actually works
I have already said that in sklearn the value of partial dependence function is an average of all trees. There is one more tweak: all irrelevant features are averaged away. To describe the actual way of averaging, I will just quote the documentation of sklearn:
For each value of the ‘target’ features in the grid the partial
dependence function need to marginalize the predictions of a tree over
all possible values of the ‘complement’ features. In decision trees
this function can be evaluated efficiently without reference to the
training data. For each grid point a weighted tree traversal is
performed: if a split node involves a ‘target’ feature, the
corresponding left or right branch is followed, otherwise both
branches are followed, each branch is weighted by the fraction of
training samples that entered that branch. Finally, the partial
dependence is given by a weighted average of all visited leaves. For
tree ensembles the results of each individual tree are again averaged.
And if you are still not satisfied, see the source code.
an example
To see that the prediction is already on the scale of the dependent variable (but is just centered), you can look at a very toy example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import plot_partial_dependence
np.random.seed(1)
X = np.random.normal(size=[1000, 2])
# yes, I will try to fit a linear function!
y = X[:, 0] * 10 + 50 + np.random.normal(size=1000, scale=5)
# mean target is 50, range is from 20 to 80, that is +/- 30 standard deviations
model = GradientBoostingRegressor().fit(X, y)
fig, subplots = plot_partial_dependence(model, X, [0, 1], percentiles=(0.0, 1.0), n_cols=2)
subplots[0].scatter(X[:, 0], y - y.mean(), s=0.3)
subplots[1].scatter(X[:, 1], y - y.mean(), s=0.3)
plt.suptitle('Partial dependence plots and scatters of centered target')
plt.show()
You can see that partial dependence plots reflect the true distribution of the centered target variable pretty well.
If you want not only the units, but the mean to coincide with your y, you have to add the "lost" mean to the result of the partial_dependence function and then plot the results manually:
from sklearn.ensemble.partial_dependence import partial_dependence
pdp_y, [pdp_x] = partial_dependence(model, X=X, target_variables=[0], percentiles=(0.0, 1.0))
plt.scatter(X[:, 0], y, s=0.3)
plt.plot(pdp_x, pdp_y.ravel() + model.init_.mean)
plt.show()
plt.title('Partial dependence plot in the original coordinates');
You are looking at a Partial Dependence Plot. A PDP is a graph that represents
a set of variables/predictors and their effect on the target field (in this case price). Those graphs do not estimate actual prices.
It is important to realize that a PDP is not a representation of the dataset values or price. It is a representation of the variables effect on the target field. The negative numbers are logits of probabilities, not raw probabilities.

Calculating probability with sklearn GMM

I want to determine the probability that a data point belongs to a population of data. I read that sklearn GMM can do this. I tried the following....
import numpy as np
from sklearn.mixture import GMM
training_data = np.hstack((
np.random.normal(500, 100, 2000).reshape(-1, 1),
np.random.normal(500, 100, 2000).reshape(-1, 1),
))
# train the classifier and get max score
g = GMM(n_components=1)
g.fit(training_data)
scores = g.score(training_data)
max_score = np.amax(scores)
# create a candidate data point and calculate the probability
# it belongs to the training population
candidate_data = np.array([[490, 450]])
candidate_score = g.score(candidate_data)
From this point on I'm not sure what to do. I was reading that I have to normalize the log probability in order to get the probability of a candidate data point belonging to a population. Would that be something like this...
candidate_probability = (np.exp(candidate_score)/np.exp(max_score)) * 100
print candidate_probability
>>> [ 87.81751913]
The number does not seem unreasonable, but I'm really out of my comfort zone here so I thought I'd ask. Thanks!
The candidate_probability you are using would not be statistically correct.
I think that what you would have to do is calculate the probabilities of the sample point being a member of only one of the individual gaussians (from the weights and multivariate cumulative distribution functions (CDFs)) and sum up those probabilities. The largest problem is that I cannot find a good python package that would calculate the multivariate CDFs. Unless you are able to find one, this paper would be a good starting point https://upload.wikimedia.org/wikipedia/commons/a/a2/Cumulative_function_n_dimensional_Gaussians_12.2013.pdf

Library in python for neural networks to plot ROC, AUC, DET [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am new to machine learning in python, therefore forgive my naive question. Is there a library in python for implementing neural networks, such that it gives me the ROC and AUC curves also. I know about libraries in python which implement neural networks but I am searching for a library which also helps me in plotting ROC, DET and AUC curves.
In this case it makes sense to divide your question in 2 topics, since neural networks are hardly directly related to ROC curves.
Neural Networks
I think there's nothing better to learn by example, so I'll show you an approach to your problem using a binary classification problem trained by a Feed-Forward neural network, and inspired by this tutorial from pybrain.
First thing is to define a dataset. The easiest way to visualize is to use a binary dataset on a 2D plane, with points generated from normal distributions, each of them belonging to one of the 2 classes. This will be linearly separable in this case.
from pybrain.datasets import ClassificationDataSet
from pybrain.utilities import percentError
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.structure.modules import SoftmaxLayer
from pylab import ion, ioff, figure, draw, contourf, clf, show, hold, plot
from scipy import diag, arange, meshgrid, where
from numpy.random import multivariate_normal
means = [(-1,0),(2,4),(3,1)]
cov = [diag([1,1]), diag([0.5,1.2]), diag([1.5,0.7])]
n_klass = 2
alldata = ClassificationDataSet(2, 1, nb_classes=n_klass)
for n in xrange(400):
for klass in range(n_klass):
input = multivariate_normal(means[klass],cov[klass])
alldata.addSample(input, [klass])
To visualize, it looks something like this:
Now you want to split it into training and test set:
tstdata, trndata = alldata.splitWithProportion(0.25)
trndata._convertToOneOfMany()
tstdata._convertToOneOfMany()
And to create your network:
fnn = buildNetwork( trndata.indim, 5, trndata.outdim, outclass=SoftmaxLayer )
trainer = BackpropTrainer( fnn, dataset=trndata, momentum=0.1, verbose=True, weightdecay=0.01)
ticks = arange(-3.,6.,0.2)
X, Y = meshgrid(ticks, ticks)
# need column vectors in dataset, not arrays
griddata = ClassificationDataSet(2,1, nb_classes=n_klass)
for i in xrange(X.size):
griddata.addSample([X.ravel()[i],Y.ravel()[i]], [0])
griddata._convertToOneOfMany() # this is still needed to make the fnn feel comfy
Now you need to train your network and see what results you get in the end:
for i in range(20):
trainer.trainEpochs( 1 )
trnresult = percentError( trainer.testOnClassData(),
trndata['class'] )
tstresult = percentError( trainer.testOnClassData(
dataset=tstdata ), tstdata['class'] )
print "epoch: %4d" % trainer.totalepochs, \
" train error: %5.2f%%" % trnresult, \
" test error: %5.2f%%" % tstresult
out = fnn.activateOnDataset(griddata)
out = out.argmax(axis=1) # the highest output activation gives the class
out = out.reshape(X.shape)
figure(1)
ioff() # interactive graphics off
clf() # clear the plot
hold(True) # overplot on
for c in range(n_klass):
here, _ = where(tstdata['class']==c)
plot(tstdata['input'][here,0],tstdata['input'][here,1],'o')
if out.max()!=out.min(): # safety check against flat field
contourf(X, Y, out) # plot the contour
ion() # interactive graphics on
draw() # update the plot
Which gives you a very bad boundary at the beginning:
But in the end a pretty good result:
ROC curves
As for ROC curves, here is a nice and simple Python library to do it on a random toy problem:
from pyroc import *
random_sample = random_mixture_model() # Generate a custom set randomly
#Example instance labels (first index) with the decision function , score (second index)
#-- positive class should be +1 and negative 0.
roc = ROCData(random_sample) #Create the ROC Object
roc.auc() #get the area under the curve
roc.plot(title='ROC Curve') #Create a plot of the ROC curve
Which gives you a single ROC curve:
Of course you can also plot multiple ROC curves on the same graph:
x = random_mixture_model()
r1 = ROCData(x)
y = random_mixture_model()
r2 = ROCData(y)
lista = [r1,r2]
plot_multiple_roc(lista,'Multiple ROC Curves',include_baseline=True)
(remember that the diagonal just means that your classifier is random and that you're probably doing something wrong)
You can probably easily use your modules in any of your classification tasks (not limited to neural networks) and it will produce ROC curves for you.
Now to get the class/probability needed to plot your ROC curve from your neural network, you just need to look at the activation of your neural network: activateOnDataset in pybrain will give you the probability for both classes (in my example above we just take the max of probabilities to determine which class to consider). From there, just transform it to the format expected by PyROC like for random_mixture_model and it should give you your ROC curve.
Sure. First, check out this
https://stackoverflow.com/questions/2276933/good-open-source-neural-network-python-library
This is my general idea, I'm sketching out how I might approach this, none of this is tested
From
http://pybrain.org/docs/tutorial/netmodcon.html#feed-forward-networks
>>> from pybrain.structure import FeedForwardNetwork
>>> n = FeedForwardNetwork()
>>> n.activate((2, 2))
array([-0.1959887])
We build a neural net, train it (not shown) and get the output. You have a test set, right? You use the test set to generate the data for the ROC curve. For a single output neural net, you want to create a threshold for the output values to translate them to yes or no responses that get the best degree of specificity/sensitivity for your task
This is a good tutorial
http://webhome.cs.uvic.ca/~mgbarsky/DM_LABS/LAB_5/Lab5_ROC_weka.pdf
Then you just plot them. Or you can try to find a library that does it for you
I saw this
http://pypi.python.org/pypi/yard
The point is, that generating at ROC curve is not specific to neural nets, so you may not find a library that does it for you. I've provided the above to show it's fairly simple to roll your own
* More detail *
Your neural network is going to have an output that you will have to translate in to a classification (likely yes/no). To calculate the ROC curve, you're going to take a few thresholds for yes/no (in other words, .75> yes, <.75 no). From this threshold, you translate the output of your neural net into classifications. By comparing those classifications to the true classifications, you get a false positive and true positive rate. You are then plotting the false positive rate and true positive rate when you tweak that threshold.

Categories

Resources