How do I read the matrix resulting of a linkage? - python

I am doing a hierarchical agglomerative clustering. Everything is working fine, but I want to do representation of t vs. number of cluster of the dendrogram.
The only info about the dendrogram is the Z matix in the following code, but I don't know what the clust matrix mean.
import seaborn as sns
import scipy.cluster.hierarchy as sch
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
iris = sns.load_dataset("iris")
species = iris.pop("species")
Z = linkage(X, 'ward')
fig = plt.figure(figsize=(25, 10))
dn = dendrogram(Z)
So for this case I would have (number of cluster,t) the values of (2,30) , (3,10) and so on, but the closer we get to t=0 the harder it is to count all

Related

Feature selection from dendrogram

I have a dataframe with 105 radiomics features. I have performed Spearman's rank correlation for each features, and have created a dendrogram. My simplified data looks like the following:
1.000000 0.723548 0.018779
0.723548 1.000000 0.118595
0.018779 0.118595 1.000000
I want to apply a cut-off value on the dendrogram (for example 0.3), and then find the features that represent all the clusters the best for the first round of feature selection (see example). All the clusters below the threshold are combined into the number of features equal to the lines above the threshold, where the feature that represents the cluster the best is chosen. This is my code, implemented from here:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
# Perform Spearman's rank correlation
df = pd.read_excel(r'H:\Documenten\file.xlsx',header=None)
rank = df.T.corr(method='spearman')
print(rank)
sns.heatmap(rank, annot = False, vmin=-1, vmax=1)
plt.show()
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from scipy.spatial.distance import squareform
plt.figure(figsize=(12,5))
dissimilarity = 1 - abs(rank)
Z = linkage(squareform(dissimilarity), 'complete')
dendrogram(Z, labels=df.T.columns, orientation='top',leaf_rotation=90)
threshold = 0.8
labels = fcluster(Z, threshold, criterion='distance')
# Keep the indices to sort labels
labels_order = np.argsort(labels)
# Build a new dataframe with the sorted columns
for idx, i in enumerate(df.T.columns[labels_order]):
if idx == 0:
clustered = pd.DataFrame(df.T[i])
else:
df_to_append = pd.DataFrame(df.T[i])
clustered = pd.concat([clustered, df_to_append], axis=1)
plt.figure(figsize=(15,10))
correlations = clustered.corr()
sns.heatmap(round(correlations,2), cmap='RdBu', annot=False, vmin=-1, vmax=1)
plt.show()
I think I have to apply something with either Z or dissimilarity. How do I do this?

How to perform Spectral Clustering on 3 circles dataset with three different classes

I want to perform spectral clustering on the 3 circles dataset that I have generated using make circles as shown in the figure. All the three circles are of different classes.
from sklearn.datasets import make_circles
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.cluster import SpectralClustering
import matplotlib.pyplot as plt
import pylab as pl
import networkx as nx
X_small, y_small = make_circles(n_samples=(100,200), random_state=3,
noise=0.07, factor = 0.7)
X_large, y_large = make_circles(n_samples=(100,200), random_state=3,
noise=0.07, factor = 0.4)
y_large[y_large==1] = 2
df = pd.DataFrame(np.vstack([X_small,X_large]),columns=['x1','x2'])
df['label'] = np.hstack([y_small,y_large])
df.label.value_counts()
sns.scatterplot(data=df,x='x1',y='x2',hue='label',style='label',palette="bright")
Since I can't flag this question as duplicate (the similar question has no accepted answer), here is a working example of Spectral Clustering on 3 circles using your code:
X_small, y_small = make_circles(n_samples=(1000,2000), random_state=3,
noise=0.07, factor = 0.1)
X_large, y_large = make_circles(n_samples=(1000,2000), random_state=3,
noise=0.07, factor = 0.6)
y_large[y_large==1] = 2
df = pd.DataFrame(np.vstack([X_small,X_large]),columns=['x1','x2'])
df['label'] = np.hstack([y_small,y_large])
df.label.value_counts()
sns.scatterplot(data=df,x='x1',y='x2',hue='label',style='label',palette="bright")
Then adapt the slightly modified 3 circles dataset (added samples and spread the circles) to the code of this SO answer:
x1 = np.expand_dims(df['x1'].values,axis=1)
x2 = np.expand_dims(df['x2'].values,axis=1)
X = np.concatenate((x1,x2),axis=1)
y = df['label'].values
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3, gamma=1000).fit(X)
colors = ['r','g','b']
colors = np.array([colors[label] for label in clustering.labels_])
plt.scatter(X[y==0, 0], X[y==0, 1], c=colors[y==0], marker='X')
plt.scatter(X[y==1, 0], X[y==1, 1], c=colors[y==1], marker='o')
plt.scatter(X[y==2, 0], X[y==2, 1], c=colors[y==2], marker='*')
plt.show()
The np.expand_dims(...,axis=1) is necessary to create the dimension along which to concatenate features with np.concatenate() (we initially have 1D vectors, and we don't want to concatenate along the existing initial dimension which is the samples index dimension). Each plt.scatter() line plots the points of a single true data class (hence the y==y_true index selection) using the associated marker, the colors indicating the class provided by the clustering.
Resulting dataset:
Resulting clusters:
Edit: to use different markers to identify true classes (colors already indicating the clustering classes), as asked by OP in the comments. We unfortunately cannot use an array for markers (as for colors) to produce the plot in a single line of code, this is because marker does not accept a list as input (discussed here).
Edit2: added motivation for the use of np.expand_dims(...,axis=1) and some explanation for the plt.scatter() lines, as asked by OP in the comments.

How to find cut-off height in agglomerative clustering with a predefined number of clusters in sklearn?

I'm deploying sklearn's hierarchical clustering algorithm with the following code:
AgglomerativeClustering(compute_distances = True, n_clusters = 15, linkage = 'complete', affinity = 'cosine').fit(X_scaled)
How can I extract the exact height at which the dendrogram has been cut off to create the 15 clusters?
Try this code with your feature data set X to find heights vs # of clusters:
import numpy as np
from scipy.cluster.hierarchy import linkage, cut_tree
hegits = np.arange(0, 20)
n_clusters = np.zeros(len(hegits))
linked = linkage(X, metric="euclidean", method="average")
for i, d in enumerate(hegits):
t = cut_tree(linked, height=d)
n_clusters[i] = len(np.unique(t))
plt.plot(n_clusters, hegits, '-o')
plt.grid()
plt.xlabel('k')
plt.ylabel('heights')
Or, try this code with your feature data set X to find distance vs # of clusters:
import numpy as np
from sklearn.cluster import AgglomerativeClustering
distance = np.arange(0,20,0.1)
n_clusters = np.zeros(len(distance))
for i, d in enumerate(distance):
cluster = AgglomerativeClustering(distance_threshold=d, n_clusters=None, affinity='euclidean', linkage='ward')
cluster.fit(X)
n_clusters[i] = cluster.n_clusters_
plt.plot(n_clusters, distance, '-o')
plt.grid()
plt.xlabel('k')
plt.ylabel('distance')

How to plot the pricipal vectors of each variable after performing PCA?

My question mainly comes from this post
:https://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance
In the article, the author plotted the vector direction and length of each variable. Based on my understanding, after performing PCA. All we get are the eigenvectors and eigenvalues. For a dataset which has a dimension M x N, each eigenvalue should be a vector as 1 x N. So, my question is maybe the length of the vector is the eigenvalue, but how to find the direction of the vector for each variable mathematical? And what is the physical meaning of the length of the vector?
Also, if it is possible, can I do similar work with scikit PCA function in python?
Thanks!
This plot is called biplot and it is very useful to understand the PCA results. The length of the vectors it is just the values that each feature/variable has on each Principal Component aka PCA loadings.
Example:
These loadings as accessible through print(pca.components_). Using the Iris Dataset the loadings are:
[[ 0.52106591, -0.26934744, 0.5804131 , 0.56485654],
[ 0.37741762, 0.92329566, 0.02449161, 0.06694199],
[-0.71956635, 0.24438178, 0.14212637, 0.63427274],
[-0.26128628, 0.12350962, 0.80144925, -0.52359713]])
Here, each row is one PC and each column corresponds to one variable/feature. So feature/variable 1, has a value 0.52106591 on the PC1 and 0.37741762 on the PC2. These are the values used to plot the vectors that you saw in the biplot. See below the coordinates of Var1. It's exactly those (above) values !!
Finally, to create this plot in python you can use this using sklearn:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
plt.scatter(xs ,ys, c = y) #without scaling
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function.
myplot(x_new[:,0:2], pca.components_.T)
plt.show()
See also this post: https://stackoverflow.com/a/50845697/5025009
and
https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
Try the pca library. This will plot the explained variance, and create a biplot.
pip install pca
A small example:
from pca import pca
# Initialize to reduce the data up to the number of componentes that explains 95% of the variance.
model = pca(n_components=0.95)
# Or reduce the data towards 2 PCs
model = pca(n_components=2)
# Load example dataset
import pandas as pd
import sklearn
from sklearn.datasets import load_iris
X = pd.DataFrame(data=load_iris().data, columns=load_iris().feature_names, index=load_iris().target)
# Fit transform
results = model.fit_transform(X)
# Plot explained variance
fig, ax = model.plot()
# Scatter first 2 PCs
fig, ax = model.scatter()
# Make biplot with the number of features
fig, ax = model.biplot(n_feat=4)
The results is a dict containing many statistics of the PCs, loadings etc.

What is the best way for clustering data containing categorical and numeric variables with python

I need to cluster customers data that contains categorical and numerical features. numerical features are not on the same ranges (age, income....). I tried Mclust for numerical data after i have scaled it with StandardScale but that gave me intersected groups.
1-Should i normalize if with Standardscale results are not satisfying ?
2-what will be the best way to cluster with K-Prototype?
3-should clustering method should be dependent on the data distribution ?
I use pandas
This is what i have used :
#K-mean Cluster#search K
from scipy.spatial import distance as sci_distance
from sklearn import cluster as sk_cluster
cdata = data
K = range(1, 10)
KM = (sk_cluster.KMeans(n_clusters=k).fit(cdata) for k in K)
centroids = (k.cluster_centers_ for k in KM)
D_k = (sci_distance.cdist(cdata, cent, 'euclidean') for cent in centroids)
dist = (np.min(D, axis=1) for D in D_k)
avgWithinSS = [sum(d) / cdata.shape[0] for d in dist]
plt.plot(K, avgWithinSS, 'b*-')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')
plt.show()
#KMean Cluster
from sklearn.cluster import KMeans, AgglomerativeClustering,
AffinityPropagation #For clustering
from sklearn.mixture import GaussianMixture #For GMM clustering
import matplotlib.pyplot as plt # For graphics
import seaborn as sns
#Clustering
def doKmeans(X, nclust=3):
model = KMeans(nclust)
model.fit(X)
clust_labels = model.predict(X)
cent = model.cluster_centers_
return (clust_labels, cent)
clust_labels, cent = doKmeans(data, 3)
kmeans = pd.DataFrame(clust_labels)
data.insert((data.shape[1]),'kmeans',kmeans)
#Plot the clusters obtained using k means
fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(data['var1'],data['var2'],
c=kmeans[0],s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('var1')
ax.set_ylabel('var2')
plt.colorbar(scatter)
You are approaching this the very wrong way.
Do not choose an approach just because you manage to get the code to run. This will never give you good results.
First figure out what you need. What is a cluster? What is a clustering (are all points in clusters? probably not. etc.)? What is a good clustering, and how can I measure this? Only then choose algorithms based on how well they match your requirements.
Otherwise, you will be solving the wrong problem.

Categories

Resources