I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.
This is my code:
from sklearn import cluster
import pandas as pd
#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)
My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).
How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?
The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).
n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]
Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior.
Hope that helps!
Haven't tested the code. Let me know if there are problems.
After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.
Related
Scikit documentation states that:
Method for initialization:
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
My data has 10 (predicted) clusters and 7 features. However, I would like to pass array of 10 by 6 shape, i.e. I want 6 dimensions of centroid of be predefined by me, but 7th dimension to be iterated freely using k-mean++.(In another word, I do not want to specify initial centroid, but rather control 6 dimension and only leave one dimension to vary for initial cluster)
I tried to pass 10x6 dimension, in hope it would work, but it just throw up the error.
Sklearn does not allow you to perform this kind of fine operations.
The only possibility is to provide a 7th feature value that is random, or similar to what Kmeans++ would have achieved.
So basically you can estimate a good value for this as follows:
import numpy as np
from sklearn.cluster import KMeans
nb_clust = 10
# your data
X = np.random.randn(7*1000).reshape( (1000,7) )
# your 6col centroids
cent_6cols = np.random.randn(6*nb_clust).reshape( (nb_clust,6) )
# artificially fix your centroids
km = KMeans( n_clusters=10 )
km.cluster_centers_ = cent_6cols
# find the points laying on each cluster given your initialization
initial_prediction = km.predict(X[:,0:6])
# For the 7th column you'll provide the average value
# of the points laying on the cluster given by your partial centroids
cent_7cols = np.zeros( (nb_clust,7) )
cent_7cols[:,0:6] = cent_6cols
for i in range(nb_clust):
init_7th = X[ np.where( initial_prediction == i ), 6].mean()
cent_7cols[i,6] = init_7th
# now you have initialized the 7th column with a Kmeans ++ alike
# So now you can use the cent_7cols as your centroids
truekm = KMeans( n_clusters=10, init=cent_7cols )
That is a very nonstandard variation of k-means. So you cannot expect sklearn to be prepared for every exotic variation. That would make sklearn slower for everybody else.
In fact, your approach is more like certain regression approaches (predicting the last value of the cluster centers) rather than clustering. I also doubt the results will be much better than simply setting the last value to the average of all points assigned to the cluster center using the other 6 dimensions only. Try partitioning your data based on the nearest center (ignoring the last column) and then setting the last column to be the arithmetic mean of the assigned data.
However, sklearn is open source.
So get the source code, and modify k-means. Initialize the last component randomly, and while running k-means only update the last column. It's easy to modify it this way - but it's very hard to design an efficient API to allow such customizations through trivial parameters - use the source code to customize at his level.
I read a paper that their retrieval system is based on SIFT descriptor and fast approximate k-means clustering. I installed pyflann. If I am not mistaken the following commands only find the indices of the close datapoints to a specific sample (for example, here, the indices of 5 nearest points from dataset to testset)
from pyflann import *
from numpy import *
from numpy.random import *
dataset = rand(10000, 128)
testset = rand(1000, 128)
flann = FLANN()
result,dists = flann.nn(dataset,testset,5,algorithm="kmeans",
branching=32, iterations=7, checks=16)
I went through user manual, however, could find how can I do k-means clusterin with FLANN. and How can I fit the test based on the cluster centers. As we can use the kmeans++ clustering` in scikitlearn, and then we are fitting the dataset based on the model:
kmeans=KMeans(n_clusters=100,init='k-means++',random_state = 0, verbose=0)
kmeans.fit(dataset)
and later we can assign labels to the test set by using KDTree for example.
kdt=KDTree(kmeans.cluster_centers_)
Q=testset #query
kdt_dist,kdt_idx=kdt.query(Q,k=1) #knn
test_labels=kdt_idx #knn=1 labels
Could someone please help me how can I use the same procedure with FLANN? (I mean clustering the dataset (finding the cluster centers and quantizing features) and then quantizing testset based on cluster centers found from the previous step).
You won't be able to do the best variations with FLANN, because these use two indexes at the same time, and are ugly to implement.
But you can build a new index on the centers for every iteration. But unless you have k > 1000 it probably will not help much.
I have a set of data with known labels. I want to try clustering and see if I can get the same clusters given by known labels. To measure the accuracy, I need to get something like a confusion matrix.
I know I can get a confusion matrix easily for a test set of a classification problem. I already tried that like this.
However, it can't be used for clustering as it expected both columns and rows to have the same set of labels, which makes sense for a classification problem. But for a clustering problem what I expect is something like this.
Rows - Actual labels
Columns - New cluster names (i.e. cluster-1, cluster-2 etc.)
Is there a way to do this?
Edit: Here are more details.
In sklearn.metrics.confusion_matrix, it expects y_test and y_pred to have the same values, and labels to be the labels of those values.
That's why it gives a matrix which has the same labels for both rows and columns like this.
But in my case (KMeans Clustering), the real values are Strings and estimated values are numbers (i.e. cluster number)
Therefore, if I call confusion_matrix(y_true, y_pred) it gives below error.
ValueError: Mix of label input types (string and number)
This is the real problem. For a classification problem, this makes sense. But for a clustering problem, this restriction shouldn't be there, because real label names and new cluster names don't need to be the same.
With this, I understand I'm trying to use a tool, which is supposed to be used for classification problems, for a clustering problem. So, my question is, is there a way I can get such a matrix for may clustered data.
Hope the question is now clearer. Please let me know if it isn't.
I wrote a code myself.
# Compute confusion matrix
def confusion_matrix(act_labels, pred_labels):
uniqueLabels = list(set(act_labels))
clusters = list(set(pred_labels))
cm = [[0 for i in range(len(clusters))] for i in range(len(uniqueLabels))]
for i, act_label in enumerate(uniqueLabels):
for j, pred_label in enumerate(pred_labels):
if act_labels[j] == act_label:
cm[i][pred_label] = cm[i][pred_label] + 1
return cm
# Example
labels=['a','b','c',
'a','b','c',
'a','b','c',
'a','b','c']
pred=[ 1,1,2,
0,1,2,
1,1,1,
0,1,2]
cnf_matrix = confusion_matrix(labels, pred)
print('\n'.join([''.join(['{:4}'.format(item) for item in row])
for row in cnf_matrix]))
Edit:
(Dayyyuumm) just found that I could do this easily with Pandas Crosstab :-/.
labels=['a','b','c',
'a','b','c',
'a','b','c',
'a','b','c']
pred=[ 1,1,2,
0,1,2,
1,1,1,
0,1,2]
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'Labels': labels, 'Clusters': pred})
# Create crosstab: ct
ct = pd.crosstab(df['Labels'], df['Clusters'])
# Display ct
print(ct)
You can easily compute a pairwise intersection matrix.
But it may be necessary to do this yourself, if the sklearn library has been optimized for the classification use case.
I would like to run kmeans clustering with more than 3 features. I've tried with two features and wondering how to provide more than 3 features to sklearn.cluster KMeans.
Here's my code and dataframe that I'd like to select features to run. I have multiple dataframes as an input and I have to provide them as features.
# currently two features are selected
# I'd like to combine more than 3 features and provide them to dataset
df_features = pd.merge(df_max[['id', 'max']],
df_var[['id', 'variance']], on='id', how='left')
cols = list(df_features.loc[:,'max':'variance'])
X = df_features.as_matrix(columns=cols)
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
centroid = kmeans.cluster_centers_
labels = kmeans.labels_
colors = ["g.","r.","c."]
for i in range(len(X)):
print ("coordinate:" , X[i], "label:", labels[i])
plt.plot(X[i][0],X[i][1],colors[labels[i]],markersize=10)
plt.scatter(centroid[:,0],centroid[:,1], marker = "x", s=150, linewidths = 5, zorder =10)
plt.show()
Generally you wouldn't want id to be a feature, because, unless you have good reason to believe otherwise, they do not correlate with anything.
As long as you feed in a valid matrix X at kmeans.fit(X), it will run KMean algorithm for you regardless of number of features in X. Though, if you have a huge amount of features, it may take longer to finish.
The problem is then how to construct X. As you have shown in your example, you can simply merge dataframes, select the wanted columns, and extract feature matrix with a .as_matrix() call. If you have more dataframes and columns, I guess you just merge more and select more.
Feature selection and dimensional reduction may come in handy once you have more than enough features in your dataset. Read more about them when you have time.
P.S. Why scipy in the title?
I have a text corpus that contains 1000+ articles each in a separate line. I used Hierarchy Clustering using Sklearn in python to produce clusters of related articles. This is the code I used to do the clustering
Note: X, is a sparse NumPy 2D array with rows corresponding to documents and columns corresponding to terms
# Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(affinity="euclidean",linkage="complete",n_clusters=3)
model.fit(X.toarray())
clustering = model.labels_
print (clustering)
I specify the number of clusters = 3 at which to cut off the tree to get a flat clustering like K-mean
My question is : How to get the top N frequent words in each cluster? so that I can suggest a topic for each cluster.
Thanks
One option is to convert X from the sparse numpy array to a pandas dataframe. The rows will still correspond to documents, and the columns to words. If you have a list of your vocabulary in order of your array columns (used as your_word_list below) you could try something like this:
import pandas as pd
X = pd.DataFrame(X.toarray(), columns=your_word_list) # columns argument is optional
X['Cluster'] = clustering # Add column corresponding to cluster number
word_frequencies_by_cluster = X.groupby('Cluster').sum()
# To get sorted list for a numbered cluster, in this case 1
print word_frequencies_by_cluster.loc[1, :].sort(ascending=False)
As a side note, you may want to look into algorithms (e.g. LDA) and distance metrics (cosine) that are more commonly used for natural language processing. If you are looking to extract topics, there is a nice sklearn tutorial on topic modeling.