I want to classify Iris flower dataset (I removed labels though, so its an unlabeled data now) using sklearns k-means clustering function. I have made the prediction model and the output seems to be classifying the data correctly for the most part, however it is choosing the labels randomly (0, 1 and 2) and I cannot compare it to my own labels to determine the accuracy (I have marked setosa as 0, versicolor as 1, virginica as 2). Is there any way to correctly label the flowers?
Heres the code:
from sklearn.cluster import KMeans
cluster = KMeans(n_clusters = 3)
cluster.fit(features)
pred = cluster.labels_
score = round(accuracy_score(pred, name_val), 4)
print('Accuracy scored using k-means clustering: ', score)
features, as expected contains the features, name_val is matrix containing flower values, 0 for setosa, 1 for versicolor, 2 for virginica
Edit: one solution I came up with was setting random_state to any number so that the labeling is constant, is there any other solution?
You need to take a look at clustering metrics to evaluate your predicitons, these include
Homegenity Score
V measure
Completenss Score and so on
Now take Completeness Score for example,
A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.
For example
from sklearn.metrics.cluster import completeness_score
print completeness_score([0, 0, 1, 1], [1, 1, 0, 0])
#Output : 1.0
Which similar to what you want. For you the code would be completeness_score(pred, name_val). Here note that the label assigned to a data point is not important rather their labelling with respect to each other is important.
Homogenity on the other hand focus on the quality of data points within the same cluster. Whereas, V-measure is defined as 2 * (homogeneity * completeness) / (homogeneity + completeness)
Read the official documentation here : Homogenity, completeness and V-measure
First of all, you are not classifying, you are clustering the data. Classification is a different process.
The K-Means algorithm includes randomness in choosing the initial cluster centers. By setting the random_state you manage to reproduce the same clustering, as the initial cluster centers will be the same. However, this does not fix your problem. What you want is the cluster with id 0 to be setosa, 1 to be versicolor etc. This is not possible because the K-Means algorithm has no knowledge of these categories, it only groups flowers depending on their similarity. What you can do is create a rule to determine which cluster corresponds to which category. For example you can say that if more than 50% of the flowers that belong to a cluster are also in the setosa category, then this cluster's documents should be compared to the set of documents in the setosa category.
That's the best way of doing it, that I can think of. However, this is not the way we evaluate custering quality, there are metrics you can use such as the Silhouette Coefficient. I hope I helped.
Reference from this blog https://smorbieu.gitlab.io/accuracy-from-classification-to-clustering-evaluation/
You need to got the relation from confusion matrix with Hungarian algorithm.
The code is below:
from scipy.optimize import linear_sum_assignment as linear_assignment
def cluster_acc(y_true, y_pred):
cm = metrics.confusion_matrix(y_true, y_pred)
_make_cost_m = lambda x:-x + np.max(x)
indexes = linear_assignment(_make_cost_m(cm))
indexes = np.concatenate([indexes[0][:,np.newaxis],indexes[1][:,np.newaxis]], axis=-1)
js = [e[1] for e in sorted(indexes, key=lambda x: x[0])]
cm2 = cm[:, js]
acc = np.trace(cm2) / np.sum(cm2)
return acc
Or just import library coclust
from coclust.evaluation.external import accuracy
accuracy(labels, predicted_labels)
Related
I am using sklearn isolation forest for an anomaly detection task. Isolation forest consists of iTrees. As this paper describes, the nodes of the iTrees are split in the following way:
We select any feature (uniformly) randomly and perform a split on a random value of that feature.
But I want to give more weight to some features than the others. So instead of selecting the features with equal probability, I want to draw some features with a higher probability (giving more weight to those features) and other features with a lower probability.
How can I do that? From the source code it seems I have to change the function _generate_bagging_indices in _bagging.py, but not sure.
You can achieve this without changing the source code. Instead, you can tweak your input data by duplicating the features you wish to increase the weight for. If you have a feature appearing twice, the trees will use it twice to split your data, which in practice will mean the same as having doubled the weight of the feature.
In addition to this, you can also choose to reduce the amount of features used by your isolation forest in each tree. This is controlled by the argument max_features. The default value of 1.0 ensures that every feature will be used for each tree. By reducing it, more trees will be trained without the less frequent features in your input.
Illustration
Load Data
from sklearn.ensemble import IsolationForest
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
data = load_iris()
X = data.data
df = pd.DataFrame(X, columns=data.feature_names)
Default settings
IF = IsolationForest()
IF.fit(df)
preds = IF.predict(df)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds)
plt.title("Default settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
Weighted Settings
df1 = df.copy()
weight_feature = 10
for i in range(weight_feature):
df1["duplicated_" + str(i)] = df1["sepal length (cm)"]
IF1 = IsolationForest(max_features=0.3)
IF1.fit(df1)
preds1 = IF1.predict(df1)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds1)
plt.title("Weighted settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
As you can see visually, the second option has used the X-axis more intensively to determine which are the outliers.
I am looking at the tutorial for partial dependence plots in Python. No equation is given in the tutorial or in the documentation. The documentation of the R function gives the formula I expected:
This does not seem to make sense with the results given in the Python tutorial. If it is an average of the prediction of house prices, then how is it negative and small? I would expect values in the millions. Am I missing something?
Update:
For regression it seems the average is subtracted off of the above formula. How would this be added back? For my trained model I can get the partial dependence by
from sklearn.ensemble.partial_dependence import partial_dependence
partial_dependence, independent_value = partial_dependence(model, features.index(independent_feature),X=df2[features])
I want to add (?) back on the average. Would I get this by just using model.predict() on the df2 values with the independent_feature values changed?
how the R formula works
The r formula presented in the question applies to a randomForest. Each tree in a random forest tries to predict the target variable directly. Thus, prediction of each tree lies in the expected interval (in your case, all house prices are positive), and prediction of the ensemble is just the average of all the individual predictions.
ensemble_prediction = mean(tree_predictions)
This is what the formula tells you: just take predictions of all the trees x and average them.
why the Python PDP values are small
In sklearn, however, partial dependence is calculated for a GradientBoostingRegressor. In gradient boosting, each tree predicts the derivative of the loss function at current prediction, which is only indirectly related to the target variable. For GB regression, prediction is given as
ensemble_prediction = initial_prediction + sum(tree_predictions * learning_rate)
and for GB classification predicted probability is
ensemble_prediction = softmax(initial_prediction + sum(tree_predictions * learning_rate))
For both cases, partial dependency is reported as just
sum(tree_predictions * learning_rate)
Thus, initial_prediction (for GradientBoostingRegressor(loss='ls') it equals just the mean of the training y) is not included into the PDP, which makes the predictions negative.
As for the small range of its values, the y_train in your example is small: mean hous value is roughly 2, so house prices are probably expressed in millions.
how the sklearn formula actually works
I have already said that in sklearn the value of partial dependence function is an average of all trees. There is one more tweak: all irrelevant features are averaged away. To describe the actual way of averaging, I will just quote the documentation of sklearn:
For each value of the ‘target’ features in the grid the partial
dependence function need to marginalize the predictions of a tree over
all possible values of the ‘complement’ features. In decision trees
this function can be evaluated efficiently without reference to the
training data. For each grid point a weighted tree traversal is
performed: if a split node involves a ‘target’ feature, the
corresponding left or right branch is followed, otherwise both
branches are followed, each branch is weighted by the fraction of
training samples that entered that branch. Finally, the partial
dependence is given by a weighted average of all visited leaves. For
tree ensembles the results of each individual tree are again averaged.
And if you are still not satisfied, see the source code.
an example
To see that the prediction is already on the scale of the dependent variable (but is just centered), you can look at a very toy example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import plot_partial_dependence
np.random.seed(1)
X = np.random.normal(size=[1000, 2])
# yes, I will try to fit a linear function!
y = X[:, 0] * 10 + 50 + np.random.normal(size=1000, scale=5)
# mean target is 50, range is from 20 to 80, that is +/- 30 standard deviations
model = GradientBoostingRegressor().fit(X, y)
fig, subplots = plot_partial_dependence(model, X, [0, 1], percentiles=(0.0, 1.0), n_cols=2)
subplots[0].scatter(X[:, 0], y - y.mean(), s=0.3)
subplots[1].scatter(X[:, 1], y - y.mean(), s=0.3)
plt.suptitle('Partial dependence plots and scatters of centered target')
plt.show()
You can see that partial dependence plots reflect the true distribution of the centered target variable pretty well.
If you want not only the units, but the mean to coincide with your y, you have to add the "lost" mean to the result of the partial_dependence function and then plot the results manually:
from sklearn.ensemble.partial_dependence import partial_dependence
pdp_y, [pdp_x] = partial_dependence(model, X=X, target_variables=[0], percentiles=(0.0, 1.0))
plt.scatter(X[:, 0], y, s=0.3)
plt.plot(pdp_x, pdp_y.ravel() + model.init_.mean)
plt.show()
plt.title('Partial dependence plot in the original coordinates');
You are looking at a Partial Dependence Plot. A PDP is a graph that represents
a set of variables/predictors and their effect on the target field (in this case price). Those graphs do not estimate actual prices.
It is important to realize that a PDP is not a representation of the dataset values or price. It is a representation of the variables effect on the target field. The negative numbers are logits of probabilities, not raw probabilities.
I'm analyzing a corpus of roughly 2M raw words. I build a model using gensim's word2vec, embed the vectors using sklearn TSNE, and cluster the vectors (from word2vec, not TSNE) using sklearn DBSCAN. The TSNE output looks about right: the layout of the words in 2D space seems to reflect their semantic meaning. There's a group of misspellings, clothes, etc.
However, I'm having trouble getting DBSCAN to output meaningful results. It seems to label almost everything in the "0" group (colored teal in the images). As I increase epsilon, the "0" group takes over everything. Here are screenshots with epsilon=10, and epsilon=12.5. With epsilon=20, almost everything is in the same group.
I would expect, for instance, the group of "clothing" words to all get clustered together (they're unclustered # eps=10). I would also expect more on the order of 100 clusters, as opposed to 5 - 12 clusters, and to be able to control the size and number of the clusters using epsilon.
A few questions, then. Am I understanding the use of DBSCAN correctly? Is there another clustering algorithm that might be a better choice? How can I know what a good clustering algorithm for my data is?
Is it safe to assume my model is tuned pretty well, given that the TSNE looks about right?
What other techniques can I use in order to isolate the issue with clustering? How do I know if it's my word2vec model, my use of DBSCAN, or something else?
Here's the code I'm using to perform DBSCAN:
import sys
import gensim
import json
from optparse import OptionParser
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# snip option parsing
model = gensim.models.Word2Vec.load(options.file);
words = sorted(model.vocab.keys())
vectors = StandardScaler().fit_transform([model[w] for w in words])
db = DBSCAN(eps=options.epsilon).fit(vectors)
labels = db.labels_
core_indices = db.core_sample_indices_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Estimated {:d} clusters".format(n_clusters), file=sys.stderr)
output = [{'word': w, 'label': np.asscalar(l), 'isCore': i in core_indices} for i, (l, w) in enumerate(zip(labels, words))]
print(json.dumps(output))
I'm having the same problem and trying these solutions, posting it here hoping it could help you or someone else:
Adapting the min_samples value in DBSCAN to your problem, in my case the default value, 4, was too high as some clusters could also be formed by 2 words.
Obviously, starting from a better corpus could be the solution to your problem, if the model is badly initialized, it won't perform
Perhaps DBSCAN is not the better choiche, I am also approaching K-Means for this problem
Iterating the creation of the model also helped me understand better which parameters to choose:
for eps in np.arange(0.1, 50, 0.1):
dbscan_model = DBSCAN(eps=eps, min_samples=3, metric_params=None, algorithm="auto", leaf_size=30, p=None, n_jobs=1)
labels = dbscan_model.fit_predict(mat_words)
clusters = {}
for i, w in enumerate(words_found):
clusters[w] = labels[i]
dbscan_clusters = sorted(clusters.items(), key=operator.itemgetter(1))
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = len([lab for lab in labels if lab == -1])
print("EPS: ", eps, "\tClusters: ", n_clusters, "\tNoise: ", n_noise)
As far as I can tell from the various visualizations of word2vec, the vectors probably won't cluster well.
First of all, there is nothing in the word2vec objective that would encourage clustering. On the contrary, it optimizes words to resemble the neighbors, so nearby words will get similar vectors. That is necessary for the word substitution aim.
Secondly, based on the plots, I am not sure there are "dense" regions separated by areas of low density in there. Instead, the data usually more looks like one big blob. But when almost all the vectors are in that big blob, they will almost all be in the same cluster!
Last but not least, most words probably don't cluster. Yes, numbers will likely cluster. You'd expect verbs to cluster vs. nouns, but "to bear" and "a bear" is the same to word2vec, and so is "bar" (verb and noun) etc. - there are too many polysemies for such clusters to be well separated even if the embedding were perfect!
Your best guess is to increase minors and lower epsilon until most data is noise, and you find some remaining clusters.
I want to determine the probability that a data point belongs to a population of data. I read that sklearn GMM can do this. I tried the following....
import numpy as np
from sklearn.mixture import GMM
training_data = np.hstack((
np.random.normal(500, 100, 2000).reshape(-1, 1),
np.random.normal(500, 100, 2000).reshape(-1, 1),
))
# train the classifier and get max score
g = GMM(n_components=1)
g.fit(training_data)
scores = g.score(training_data)
max_score = np.amax(scores)
# create a candidate data point and calculate the probability
# it belongs to the training population
candidate_data = np.array([[490, 450]])
candidate_score = g.score(candidate_data)
From this point on I'm not sure what to do. I was reading that I have to normalize the log probability in order to get the probability of a candidate data point belonging to a population. Would that be something like this...
candidate_probability = (np.exp(candidate_score)/np.exp(max_score)) * 100
print candidate_probability
>>> [ 87.81751913]
The number does not seem unreasonable, but I'm really out of my comfort zone here so I thought I'd ask. Thanks!
The candidate_probability you are using would not be statistically correct.
I think that what you would have to do is calculate the probabilities of the sample point being a member of only one of the individual gaussians (from the weights and multivariate cumulative distribution functions (CDFs)) and sum up those probabilities. The largest problem is that I cannot find a good python package that would calculate the multivariate CDFs. Unless you are able to find one, this paper would be a good starting point https://upload.wikimedia.org/wikipedia/commons/a/a2/Cumulative_function_n_dimensional_Gaussians_12.2013.pdf
I am using nltk with Python and I would like to plot the ROC curve of my classifier (Naive Bayes). Is there any function for plotting it or should I have to track the True Positive rate and False Positive rate ?
It would be great if someone would point me to some code already doing it...
Thanks.
PyROC looks simple enough: tutorial, source code
This is how it would work with the NLTK naive bayes classifier:
# class labels are 0 and 1
labeled_data = [
(1, featureset_1),
(0, featureset_2),
(1, featureset_3),
# ...
]
# naive_bayes is your already trained classifier,
# preferrably not on the data you're testing on :)
from pyroc import ROCData
roc_data = ROCData(
(label, naive_bayes.prob_classify(featureset).prob(1))
for label, featureset
in labeled_data
)
roc_data.plot()
Edits:
ROC is for binary classifiers only. If you have three classes, you can measure the performance of your positive and negative class separately (by counting the other two classes as 0, like you proposed).
The library expects the output of a decision function as the second value of each tuple. It then tries all possible thresholds, e.g. f(x) >= 0.8 => classify as 1, and plots a point for each threshold (that's why you get a curve in the end). So if your classifier guesses class 0, you actually want a value closer to zero. That's why I proposed .prob(1)