I have a set of data with known labels. I want to try clustering and see if I can get the same clusters given by known labels. To measure the accuracy, I need to get something like a confusion matrix.
I know I can get a confusion matrix easily for a test set of a classification problem. I already tried that like this.
However, it can't be used for clustering as it expected both columns and rows to have the same set of labels, which makes sense for a classification problem. But for a clustering problem what I expect is something like this.
Rows - Actual labels
Columns - New cluster names (i.e. cluster-1, cluster-2 etc.)
Is there a way to do this?
Edit: Here are more details.
In sklearn.metrics.confusion_matrix, it expects y_test and y_pred to have the same values, and labels to be the labels of those values.
That's why it gives a matrix which has the same labels for both rows and columns like this.
But in my case (KMeans Clustering), the real values are Strings and estimated values are numbers (i.e. cluster number)
Therefore, if I call confusion_matrix(y_true, y_pred) it gives below error.
ValueError: Mix of label input types (string and number)
This is the real problem. For a classification problem, this makes sense. But for a clustering problem, this restriction shouldn't be there, because real label names and new cluster names don't need to be the same.
With this, I understand I'm trying to use a tool, which is supposed to be used for classification problems, for a clustering problem. So, my question is, is there a way I can get such a matrix for may clustered data.
Hope the question is now clearer. Please let me know if it isn't.
I wrote a code myself.
# Compute confusion matrix
def confusion_matrix(act_labels, pred_labels):
uniqueLabels = list(set(act_labels))
clusters = list(set(pred_labels))
cm = [[0 for i in range(len(clusters))] for i in range(len(uniqueLabels))]
for i, act_label in enumerate(uniqueLabels):
for j, pred_label in enumerate(pred_labels):
if act_labels[j] == act_label:
cm[i][pred_label] = cm[i][pred_label] + 1
return cm
# Example
labels=['a','b','c',
'a','b','c',
'a','b','c',
'a','b','c']
pred=[ 1,1,2,
0,1,2,
1,1,1,
0,1,2]
cnf_matrix = confusion_matrix(labels, pred)
print('\n'.join([''.join(['{:4}'.format(item) for item in row])
for row in cnf_matrix]))
Edit:
(Dayyyuumm) just found that I could do this easily with Pandas Crosstab :-/.
labels=['a','b','c',
'a','b','c',
'a','b','c',
'a','b','c']
pred=[ 1,1,2,
0,1,2,
1,1,1,
0,1,2]
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'Labels': labels, 'Clusters': pred})
# Create crosstab: ct
ct = pd.crosstab(df['Labels'], df['Clusters'])
# Display ct
print(ct)
You can easily compute a pairwise intersection matrix.
But it may be necessary to do this yourself, if the sklearn library has been optimized for the classification use case.
Related
I have a Keras neural network with 26 features and 100 targets I want to explain with the SHAP python library.
In order to plot the force plot, for instance, I do:
shap.force_plot(exp.expected_value[i], shap_values[j][k], x_val.columns)
Where:
exp.expected_values is a list of size 100 with the base values for each of my targets (this is at least what I understand). The index i refers to the i-th target, I assume.
shap_values refers to the Shapley values of all the features for each of the targets in each validation case. Therefore, j runs from 0 to 99 (i.e. the size of my targets) and k runs from 0 to the total number of validation cases.
What I find confusing is that i and j can actually be different and I do get a plot that looks OK. However, shouldn't they always be the same index? Shouldn't the i-th baseline target always be compared to the shap values of the i-th target?
Am I understanding the indices wrong?
i and j should be the same, because you're plotting how ith target is affected by features, from base to predicted:
shap.force_plot(exp.expected_value[i], shap_values[i][k], x_val.columns)
where:
i stands for ith target class
k stands for kth sample to be explained.
The reason behind is exp.expected_value will be of shape num_targets and they will be base values for shap values to be added to, and shap values should be of shape [num_classes, num_samples, num_features], if converted to numpy array.
So, e.g., to get shap values for kth datapoint in raw space, one would do:
shap_values[:,k,:].sum(1) + base_values
and for models using softmax to get to probability space one would do:
softmax(shap_values[:,k,:].sum(1) + base_values)
Note, this is assuming shap_values are of numpy array type.
Please ask if something is not clear.
Scikit documentation states that:
Method for initialization:
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
My data has 10 (predicted) clusters and 7 features. However, I would like to pass array of 10 by 6 shape, i.e. I want 6 dimensions of centroid of be predefined by me, but 7th dimension to be iterated freely using k-mean++.(In another word, I do not want to specify initial centroid, but rather control 6 dimension and only leave one dimension to vary for initial cluster)
I tried to pass 10x6 dimension, in hope it would work, but it just throw up the error.
Sklearn does not allow you to perform this kind of fine operations.
The only possibility is to provide a 7th feature value that is random, or similar to what Kmeans++ would have achieved.
So basically you can estimate a good value for this as follows:
import numpy as np
from sklearn.cluster import KMeans
nb_clust = 10
# your data
X = np.random.randn(7*1000).reshape( (1000,7) )
# your 6col centroids
cent_6cols = np.random.randn(6*nb_clust).reshape( (nb_clust,6) )
# artificially fix your centroids
km = KMeans( n_clusters=10 )
km.cluster_centers_ = cent_6cols
# find the points laying on each cluster given your initialization
initial_prediction = km.predict(X[:,0:6])
# For the 7th column you'll provide the average value
# of the points laying on the cluster given by your partial centroids
cent_7cols = np.zeros( (nb_clust,7) )
cent_7cols[:,0:6] = cent_6cols
for i in range(nb_clust):
init_7th = X[ np.where( initial_prediction == i ), 6].mean()
cent_7cols[i,6] = init_7th
# now you have initialized the 7th column with a Kmeans ++ alike
# So now you can use the cent_7cols as your centroids
truekm = KMeans( n_clusters=10, init=cent_7cols )
That is a very nonstandard variation of k-means. So you cannot expect sklearn to be prepared for every exotic variation. That would make sklearn slower for everybody else.
In fact, your approach is more like certain regression approaches (predicting the last value of the cluster centers) rather than clustering. I also doubt the results will be much better than simply setting the last value to the average of all points assigned to the cluster center using the other 6 dimensions only. Try partitioning your data based on the nearest center (ignoring the last column) and then setting the last column to be the arithmetic mean of the assigned data.
However, sklearn is open source.
So get the source code, and modify k-means. Initialize the last component randomly, and while running k-means only update the last column. It's easy to modify it this way - but it's very hard to design an efficient API to allow such customizations through trivial parameters - use the source code to customize at his level.
I have been using Sklearn's Kmeans implementation
I have been clustering a dataset which is labeled, and I have been using sklearn's clustering metrics in order to test the clustering performance.
Sklearn's Kmeans clustering output is as you know a list of numbers in the range of k_clusters. However my labels are strings.
So far I had no problems with them since the metrics from sklearn.metrics.cluster work with mixed inputs (int & str label lists).
However now I want to use some of the classification metrics and from what I gather, the inputs k_true and k_pred need to be of the same set. Either numbers in range of k, or then string labels that my dataset is using. If I try it, it returns the following error:
AttributeError: 'bool' object has no attribute 'sum'
So, how could I translate the k_means labels into an other type of labels? Or even the other way around (string labels -> integer labels).
How could I even begin implementing it? Since k_means is pretty non-deterministic, the labels might change from iteration to iteration. Is there a legit way in order to correctly translate Kmeans labels?
EDIT:
EXAMPLE
for k = 4
kmeans output: [0,3,3,2,........0]
class labels : ['CAT','DOG','DOG','BIRD',.......'CHICKEN']
Clustering is not classification.
The methods do not predict a label, so you must not use a classification evaluation measure. That would be like measuring the quality of an apple in miles per gallon...
If you insist on doing the wrong thing(tm) then use the Hungarian algorithm to find the best mapping. But beware: the number of clusters and the number of classes will usually not be the same. If this is the case, using such a mapping will either be unfairly negative (not mapping extra clusters) or unfairly positive (mapping !uktiple clusters to the same label will consider the N points are N clusters "solution" optimal). It's better to only use clustering measures.
You can create mapping using a dictionary, say
mapping_dict = { 0: 'cat', 1: 'chicken', 2:'bird', 3:'dog'}
Then you can simply apply this mapping using say list comprehension,etc.
Suppose your labels are stored in a list kmeans_predictions
mapped_predictions = [ mapping_dict[x] for x in kmeans_predictions]
Then use mapped_predictions as your predictions
Update : Based on your comments, i believe you have to do it the other way round. I mean convert your labels into `int' mappings.
Also, you cannot use just any classification metric here. Use Completeness score, v-measure and homogenity as these are more suited for clustering problems. It would be incorrect to just blindly use any random classification metric here.
I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.
This is my code:
from sklearn import cluster
import pandas as pd
#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)
My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).
How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?
The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).
n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]
Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior.
Hope that helps!
Haven't tested the code. Let me know if there are problems.
After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.
I would like to run kmeans clustering with more than 3 features. I've tried with two features and wondering how to provide more than 3 features to sklearn.cluster KMeans.
Here's my code and dataframe that I'd like to select features to run. I have multiple dataframes as an input and I have to provide them as features.
# currently two features are selected
# I'd like to combine more than 3 features and provide them to dataset
df_features = pd.merge(df_max[['id', 'max']],
df_var[['id', 'variance']], on='id', how='left')
cols = list(df_features.loc[:,'max':'variance'])
X = df_features.as_matrix(columns=cols)
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
centroid = kmeans.cluster_centers_
labels = kmeans.labels_
colors = ["g.","r.","c."]
for i in range(len(X)):
print ("coordinate:" , X[i], "label:", labels[i])
plt.plot(X[i][0],X[i][1],colors[labels[i]],markersize=10)
plt.scatter(centroid[:,0],centroid[:,1], marker = "x", s=150, linewidths = 5, zorder =10)
plt.show()
Generally you wouldn't want id to be a feature, because, unless you have good reason to believe otherwise, they do not correlate with anything.
As long as you feed in a valid matrix X at kmeans.fit(X), it will run KMean algorithm for you regardless of number of features in X. Though, if you have a huge amount of features, it may take longer to finish.
The problem is then how to construct X. As you have shown in your example, you can simply merge dataframes, select the wanted columns, and extract feature matrix with a .as_matrix() call. If you have more dataframes and columns, I guess you just merge more and select more.
Feature selection and dimensional reduction may come in handy once you have more than enough features in your dataset. Read more about them when you have time.
P.S. Why scipy in the title?