How to traverse a tree from sklearn AgglomerativeClustering?

How to traverse a tree from sklearn AgglomerativeClustering? - python

I have a numpy text file array at: https://github.com/alvations/anythingyouwant/blob/master/WN_food.matrix
It's a distance matrix between terms and each other, my list of terms are as such: http://pastebin.com/2xGt7Xjh
I used the follow code to generate a hierarchical cluster:
import numpy as np
from sklearn.cluster import AgglomerativeClustering
matrix = np.loadtxt('WN_food.matrix')
n_clusters = 518
model = AgglomerativeClustering(n_clusters=n_clusters,
linkage="average", affinity="cosine")
model.fit(matrix)
To get the clusters for each term, I could have done:
for term, clusterid in enumerate(model.labels_):
print term, clusterid
But how do I traverse the tree that the AgglomerativeClustering outputs?
Is it possible to convert it into a scipy dendrogram (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html)? And after that how do I traverse the dendrogram?

I've answered a similar question for sklearn.cluster.ward_tree:
How do you visualize a ward tree from sklearn.cluster.ward_tree?
AgglomerativeClustering outputs the tree in the same way, in the children_ attribute. Here's an adaptation of the code in the ward tree question for AgglomerativeClustering. It outputs the structure of the tree in the form (node_id, left_child, right_child) for each node of the tree.
import numpy as np
from sklearn.cluster import AgglomerativeClustering
import itertools
X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
model = AgglomerativeClustering(linkage="average", affinity="cosine")
model.fit(X)
ii = itertools.count(X.shape[0])
[{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in model.children_]
https://stackoverflow.com/a/26152118

Adding to A.P.'s answer, here is code that will give you a dictionary of membership. member[node_id] gives all the data point indices (zero to n).
on_split is a simple reformat of A.P's clusters that give the two clusters that form when node_id is split.
up_merge tells what node_id merges into and what node_id must be combined to merge into that.
ii = itertools.count(data_x.shape[0])
clusters = [{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in fit_cluster.children_]
import copy
n_points = data_x.shape[0]
members = {i:[i] for i in range(n_points)}
for cluster in clusters:
node_id = cluster["node_id"]
members[node_id] = copy.deepcopy(members[cluster["left"]])
members[node_id].extend(copy.deepcopy(members[cluster["right"]]))
on_split = {c["node_id"]: [c["left"], c["right"]] for c in clusters}
up_merge = {c["left"]: {"into": c["node_id"], "with": c["right"]} for c in clusters}
up_merge.update({c["right"]: {"into": c["node_id"], "with": c["left"]} for c in clusters})

Related

Text data clustering with python

I am currently trying to cluster a list of sequences based on their similarity using python.
ex:
DFKLKSLFD
DLFKFKDLD
LDPELDKSL
...
The way I pre process my data is by computing the pairwise distances using for example the Levenshtein distance. After calculating all the pairwise distances and creating the distance matrix, I want to use it as input for the clustering algorithm.
I have already tried using Affinity Propagation, but convergence is a bit unpredictable and I would like to go around this problem.
Does anyone have any suggestions regarding other suitable clustering algorithms for this case?
Thank you!!

sklearn actually does show this example using DBSCAN, just like Luke once answered here.
This is based on that example, using !pip install python-Levenshtein.
But if you have pre-calculated all distances, you could change the custom metric, as shown below.
from Levenshtein import distance
import numpy as np
from sklearn.cluster import dbscan
data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]
def z:
i, j = int(x[0]), int(y[0]) # extract indices
return distance(data[i], data[j])
X = np.arange(len(data)).reshape(-1, 1)
dbscan(X, metric=lev_metric, eps=5, min_samples=2)
And if you pre-calculated you could define pre_lev_metric(x, y) along the lines of
def pre_lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return DISTANCES[i,j]
Alternative answer based on K-Medoids using sklearn_extra.cluster.KMedoids. K-Medoids is not yet that well known, but only needs distance as well.
I had to install like this
!pip uninstall -y enum34
!pip install scikit-learn-extra
Than I was able to create clusters with;
from sklearn_extra.cluster import KMedoids
import numpy as np
from Levenshtein import distance
data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]
def lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return distance(data[i], data[j])
X = np.arange(len(data)).reshape(-1, 1)
kmedoids = KMedoids(n_clusters=2, random_state=0, metric=lev_metric).fit(X)
The labels/centers are in
kmedoids.labels_
kmedoids.cluster_centers_

Try this.
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Results:
- *LDPELDKSL:* LDPELDKSL
- *DFKLKSLFD:* DFKLKSLFD
- *XYZ:* ABC, XYZ
- *DLFKFKDLD:* DLFKFKDLD

common_words = kmeans.cluster_centers_.argsort()[:,-1:-11:-1]
for num, centroid in enumerate(common_words):
print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))

Clustering vectors with similar patterns

Say that I have many vectors, some of them are:
a: [1,2,3,4,3,2,1,0,0,0,0,0]
b: [5,5,5,5,5,10,20,30,5,10]
c: [1,2,3,2,1,0,0,0,0,0,0,0]
We can see similar patterns between vector a and c.
My question is if it is possible to classify these two to the same cluster and classify b to another cluster.
I rather not use algorithms like KMeans, because the values are not interesting, only the patterns do.
any advice is welcome, especially solutions in Phyton.
Thanks

You may want to use Support Vector Classifier as it produces boundaries between clusters based on the patterns (generalized directions) between points in the clusters, rather than naive distance between points (like KMeans and Spectral Clustering will do). You will however have to construct labels Y yourself as SVC is a supervised method. Here is an example:
import numpy as np
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
a = [1,2,3,4,3,2,1,0,0,0,0,0]
b = [5,5,5,5,5,10,20,30,5,10]
c = [1,2,3,2,1,0,0,0,0,0,0,0]
d = [100,2,300,4,100,0,0,0,0,0,0,0]
vectors = [a, b, c]
# Vectors have different lengths. Append them to get equal dimensions.
L = max(len(elem) for elem in vectors)
imputed = []
for elem in vectors:
l = len(elem)
imputed.append(elem + [0]*(L-l))
print(imputed)
X = np.array(imputed)
print(X)
Y = np.array([0, 1, 0])
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X, Y)
print(clf.predict(np.array([d])))

Sklearn Nearest Neighbors and unseen data

I've used Nearest Neighbors to find closely related customers in order to recommend popular products to a target customer. I've fit a sparse matrix of training users to get the cosine distances. However, I cannot get indices and distances of new users on the fitted model because those users aren't in the original matrix. Is there a way around this, or do I have to refit the model each time new users are introduced?
Thanks!!
from scipy.sparse import csr_matrix
train_df = train.pivot(index = 'user', columns = 'product_id', values = 'rating').fillna(0)
test_df = test.pivot(index = 'user', columns = 'product_id', values = 'rating').fillna(0)
train_mat = csr_matrix(train_df.values)
test_mat = csr_matrix(test_df.values)
from sklearn.neighbors import NearestNeighbors
model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute', n_neighbors=30)
model_knn.fit(train_mat)
test_user = list(np.sort(test_df.user.unique()))
list1=[]
query_index = np.random.choice(test_user)
distances, indices = model_knn.kneighbors(test_df.loc[query_index, :].values.reshape(1, -1))
for i in range(0, len(distances.flatten())):
list1.append(test_df.index[indices.flatten()[i]])
Here is the error message:
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 1605 while Y.shape[1] == 2724

In the documentation it states:
Returns:
dist : array
Array representing the lengths to points, only present if return_distance=True
So you can try:
distances, indices = model_knn.kneighbors(
test_df.loc[query_index,:].values.reshape(1, -1),
return_distance=True
)

What do the values that `graphviz` renders inside each node of a decision tree mean?

For the image above using the AdaBoostClassifier library from scipy and graphviz I was able to create this subtree visual and I need help interpreting the values that are in each node? Like for example, what does "gini" mean? What is the significance of the "samples" and "value" fields? What does it mean that attribute F5 <= 0.5?
Here is my code (I did this all in jupyter notebook):
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
f = open('dtree-data.txt')
d = dict()
for i in range(1,9):
key = 'F' + str(i)
d[key] = []
d['RES'] = []
for line in f:
values = [(True if x == 'True' else False) for x in line.split()[:8]]
result = line.split()[8]
d['RES'].append(result)
for i in range(1, 9):
key = 'F' + str(i)
d[key].append(values[i-1])
df = pd.DataFrame(data=d, columns=['F1','F2','F3','F4','F5','F6','F7','F8','RES'])
from sklearn.model_selection import train_test_split
X = df.drop('RES', axis=1)
y = df['RES']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)
from IPython.display import Image
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import pydot
# https://stackoverflow.com/questions/46192063/not-fitted-error-when-using-sklearns-graphviz
sub_tree = ada.estimators_[0]
dot_data = StringIO()
features = list(df.columns[1:])
export_graphviz(sub_tree, out_file=dot_data,feature_names=features,filled=True,rounded=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph[0].create_png())
NOTE: External packages may need to be installed in order to view the data locally (obviously)
Here is a link to the data file:
https://cs.rit.edu/~jro/courses/intelSys/dtree-data

A decision tree is a binary tree where each node represents a portion of the data. Each node that is not a leaf (root or branch) splits its part of the data in two sub-parts. The root node contains all data (from the training set). Furthermore, this is a classification tree. It predicts class probabilities - the node values.
Root/branch node:
samples = 134 that means the node 'contains' 134 samples. Since it's the root node that means the tree was trained on 134 samples.
value = [0.373, 0.627] are class frequencies. About 1/3 of the samples belong to class A and 2/3 to class B.
gini = 0.468 is the gini impurity of the node. It discribes how much the classes are mixed up.
F5 <= 0.5 What are the column names of the data? Right. This means that the node is split so that all samples where the feature F5 is lower than 0.5 go to the left child and the samples where the feature is higher than 0.5 go to the right child.
Leaf nodes:
These nodes are not further split, so there is no need for a F <= something field.
samples = 90 / 44 sum to 134. 90 samples went to the left child and 44 samples to the right child.
values = [0.104, 0.567] / [0.269, 0.06] are the class frequencies in the children. Most samples in the left child belong to class B (56% vs 10%) and most samples in the right child belong to class A (27% v 6%).
gini = 0.263 / 0.298 are the remaining impurities in the child nodes. They are lower than in the parent node, which means the split improved separability between the classes, but there is still some uncertainty left.

Determining a threshold value for a bimodal distribution via KMeans clustering

I'd like to find a threshold value for a bimodal distribution. For example, a bimodal distribution could look like the following:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000; b = n//10; i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)
An attempt to find the cluster centers did not work, as I wasn't sure how the matrix, h, should be formatted:
from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(h)
I would expect to be able to find the cluster centers around -2 and 2. The threshold value would then be the midpoint of the two cluster centers.

Your question is a bit confusing to me, so please let me know if I've interpreted it incorrectly. I think you are basically trying to do 1D kmeans, and try to introduce frequency as a second dimension to get KMeans to work, but would really just be happy with [-2,2] as the output for the centers instead of [(-2,y1), (2,y2)].
To do a 1D kmeans you can just reshape your data to be n of 1-length vectors (similar question: Scikit-learn: How to run KMeans on a one-dimensional array?)
code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000;
b = n//10;
i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)
from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(x.reshape(n,1))
print kmeans.cluster_centers_
output:
[[-1.9896414]
[ 2.0176039]]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to traverse a tree from sklearn AgglomerativeClustering? - python

Related

Text data clustering with python

Clustering vectors with similar patterns

Sklearn Nearest Neighbors and unseen data

What do the values that `graphviz` renders inside each node of a decision tree mean?

Determining a threshold value for a bimodal distribution via KMeans clustering

Categories

Resources