Feature selection from dendrogram - python

I have a dataframe with 105 radiomics features. I have performed Spearman's rank correlation for each features, and have created a dendrogram. My simplified data looks like the following:
1.000000 0.723548 0.018779
0.723548 1.000000 0.118595
0.018779 0.118595 1.000000
I want to apply a cut-off value on the dendrogram (for example 0.3), and then find the features that represent all the clusters the best for the first round of feature selection (see example). All the clusters below the threshold are combined into the number of features equal to the lines above the threshold, where the feature that represents the cluster the best is chosen. This is my code, implemented from here:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
# Perform Spearman's rank correlation
df = pd.read_excel(r'H:\Documenten\file.xlsx',header=None)
rank = df.T.corr(method='spearman')
print(rank)
sns.heatmap(rank, annot = False, vmin=-1, vmax=1)
plt.show()
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from scipy.spatial.distance import squareform
plt.figure(figsize=(12,5))
dissimilarity = 1 - abs(rank)
Z = linkage(squareform(dissimilarity), 'complete')
dendrogram(Z, labels=df.T.columns, orientation='top',leaf_rotation=90)
threshold = 0.8
labels = fcluster(Z, threshold, criterion='distance')
# Keep the indices to sort labels
labels_order = np.argsort(labels)
# Build a new dataframe with the sorted columns
for idx, i in enumerate(df.T.columns[labels_order]):
if idx == 0:
clustered = pd.DataFrame(df.T[i])
else:
df_to_append = pd.DataFrame(df.T[i])
clustered = pd.concat([clustered, df_to_append], axis=1)
plt.figure(figsize=(15,10))
correlations = clustered.corr()
sns.heatmap(round(correlations,2), cmap='RdBu', annot=False, vmin=-1, vmax=1)
plt.show()
I think I have to apply something with either Z or dissimilarity. How do I do this?

Related

How do I read the matrix resulting of a linkage?

I am doing a hierarchical agglomerative clustering. Everything is working fine, but I want to do representation of t vs. number of cluster of the dendrogram.
The only info about the dendrogram is the Z matix in the following code, but I don't know what the clust matrix mean.
import seaborn as sns
import scipy.cluster.hierarchy as sch
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
iris = sns.load_dataset("iris")
species = iris.pop("species")
Z = linkage(X, 'ward')
fig = plt.figure(figsize=(25, 10))
dn = dendrogram(Z)
So for this case I would have (number of cluster,t) the values of (2,30) , (3,10) and so on, but the closer we get to t=0 the harder it is to count all

Grouping based on and plotting error statistics in python

I have implemented a regression model and retrieved results. Now to evaluate the results I want to create plot, where MAE, and its standard deviation are represented in the same figure. However, I want to group the date into intervals and evaluate statistics. Though, I can use sklearn metrics for calculating mean absolute error, it works on entire range of data. Can some one give an idea about how to group the data based on intervals.
The data is very large hence, could not share here. However, random data and implemented code for calculating bias, I am attaching below.
import pandas as pd
import random
import matplotlib.pyplot as plt
yact = random.sample(range(1, 100), 50)
ypred=random.sample(range(1, 100), 50)
df = pd.DataFrame(yact,columns=['yact'])
df['ypred']=ypred
df['bias']=df['yact']-df['ypred']
#groups=[20,40,60,80,100]
I want to creat groups of y pred based on yact (similar to groups given above).
A reference figure which I am trying to plot is present in the first quadrant of below attached figure.
We could use only pandas/matplotlib but seaborn makes this kind of plotting so much easier. First, we categorize the data with pd.cut based on the bins provided, then we plot them with seaborns pointplot. The estimator mean is the default but I wanted to point out that you can feed other functions here into the plot.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#random data generation
rng = np.random.default_rng(123)
n=500
yact = rng.choice(range(1, 100), n)
ypred = rng.choice(range(1, 100), n)
df = pd.DataFrame({"yact": yact, "ypred": ypred})
df['bias']=df['yact']-df['ypred']
#binning of data
bins = [0, 30, 50, 80, 100]
labels = [f"({first}; {second}]" for first, second in zip(bins[:-1], bins[1:])]
df["cats"] = pd.cut(x=df['yact'], bins=bins, labels=labels, include_lowest=True)
#plotting with seaborn
sns.pointplot(x="cats", y="ypred", data=df, order=labels, estimator=np.mean, ci="sd", join=False)
plt.show()
(Unsurprisingly uniform) sample output:

How to perform Spectral Clustering on 3 circles dataset with three different classes

I want to perform spectral clustering on the 3 circles dataset that I have generated using make circles as shown in the figure. All the three circles are of different classes.
from sklearn.datasets import make_circles
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.cluster import SpectralClustering
import matplotlib.pyplot as plt
import pylab as pl
import networkx as nx
X_small, y_small = make_circles(n_samples=(100,200), random_state=3,
noise=0.07, factor = 0.7)
X_large, y_large = make_circles(n_samples=(100,200), random_state=3,
noise=0.07, factor = 0.4)
y_large[y_large==1] = 2
df = pd.DataFrame(np.vstack([X_small,X_large]),columns=['x1','x2'])
df['label'] = np.hstack([y_small,y_large])
df.label.value_counts()
sns.scatterplot(data=df,x='x1',y='x2',hue='label',style='label',palette="bright")
Since I can't flag this question as duplicate (the similar question has no accepted answer), here is a working example of Spectral Clustering on 3 circles using your code:
X_small, y_small = make_circles(n_samples=(1000,2000), random_state=3,
noise=0.07, factor = 0.1)
X_large, y_large = make_circles(n_samples=(1000,2000), random_state=3,
noise=0.07, factor = 0.6)
y_large[y_large==1] = 2
df = pd.DataFrame(np.vstack([X_small,X_large]),columns=['x1','x2'])
df['label'] = np.hstack([y_small,y_large])
df.label.value_counts()
sns.scatterplot(data=df,x='x1',y='x2',hue='label',style='label',palette="bright")
Then adapt the slightly modified 3 circles dataset (added samples and spread the circles) to the code of this SO answer:
x1 = np.expand_dims(df['x1'].values,axis=1)
x2 = np.expand_dims(df['x2'].values,axis=1)
X = np.concatenate((x1,x2),axis=1)
y = df['label'].values
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3, gamma=1000).fit(X)
colors = ['r','g','b']
colors = np.array([colors[label] for label in clustering.labels_])
plt.scatter(X[y==0, 0], X[y==0, 1], c=colors[y==0], marker='X')
plt.scatter(X[y==1, 0], X[y==1, 1], c=colors[y==1], marker='o')
plt.scatter(X[y==2, 0], X[y==2, 1], c=colors[y==2], marker='*')
plt.show()
The np.expand_dims(...,axis=1) is necessary to create the dimension along which to concatenate features with np.concatenate() (we initially have 1D vectors, and we don't want to concatenate along the existing initial dimension which is the samples index dimension). Each plt.scatter() line plots the points of a single true data class (hence the y==y_true index selection) using the associated marker, the colors indicating the class provided by the clustering.
Resulting dataset:
Resulting clusters:
Edit: to use different markers to identify true classes (colors already indicating the clustering classes), as asked by OP in the comments. We unfortunately cannot use an array for markers (as for colors) to produce the plot in a single line of code, this is because marker does not accept a list as input (discussed here).
Edit2: added motivation for the use of np.expand_dims(...,axis=1) and some explanation for the plt.scatter() lines, as asked by OP in the comments.

No runtime error, but wrong iris PCA plotting

I am using following code to perform PCA on iris dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# get iris data to a dataframe:
from sklearn import datasets
iris = datasets.load_iris()
varnames = ['SL', 'SW', 'PL', 'PW']
irisdf = pd.DataFrame(data=iris.data, columns=varnames)
irisdf['Species'] = [iris.target_names[a] for a in iris.target]
# perform pca:
from sklearn.decomposition import PCA
model = PCA(n_components=2)
scores = model.fit_transform(irisdf.iloc[:,0:4])
loadings = model.components_
# plot results:
scoredf = pd.DataFrame(data=scores, columns=['PC1','PC2'])
scoredf['Grp'] = irisdf.Species
sns.lmplot(fit_reg=False, x="PC1", y='PC2', hue='Grp', data=scoredf) # plot point;
loadings = loadings.T
for e, pt in enumerate(loadings):
plt.plot([0,pt[0]], [0,pt[1]], '--b')
plt.text(x=pt[0], y=pt[1], s=varnames[e], color='b')
plt.show()
I am getting following plot:
However, when I compare with plots from other sites (e.g. at http://marcoplebani.com/pca/ ), my plot is not correct. Following differences seem to be present:
Petal length and petal width lines should have similar lengths.
Sepal length line should be closer to petal length and petal width lines rather than closer to sepal width line.
All 4 lines should be on the same side of x-axis.
Why is my plot not correct. Where is the error and how can it be corrected?
It depends on whether you scale the variance or not. The "other site" uses scale=TRUE. If you want to do this with sklearn, add StandardScaler before fitting the model and fit the model with scaled data, like this:
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(irisdf.iloc[:,0:4])
scores = model.fit_transform(X)
Edit: Difference between StandardScaler and normalize
Here is an answer which pointed out a key difference (row vs column). Even you use normalize here, you might want to consider X = normalize(X.T).T. The following code shows some differences after transformation:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, normalize
iris = datasets.load_iris()
varnames = ['SL', 'SW', 'PL', 'PW']
fig, ax = plt.subplots(2, 2, figsize=(16, 12))
irisdf = pd.DataFrame(data=iris.data, columns=varnames)
irisdf.plot(kind='kde', title='Raw data', ax=ax[0][0])
irisdf_std = pd.DataFrame(data=StandardScaler().fit_transform(irisdf), columns=varnames)
irisdf_std.plot(kind='kde', title='StandardScaler', ax=ax[0][1])
irisdf_norm = pd.DataFrame(data=normalize(irisdf), columns=varnames)
irisdf_norm.plot(kind='kde', title='normalize', ax=ax[1][0])
irisdf_norm = pd.DataFrame(data=normalize(irisdf.T).T, columns=varnames)
irisdf_norm.plot(kind='kde', title='normalize', ax=ax[1][1])
plt.show()
I'm not sure how deep I can go with the algorithm/math. The point for StandardScaler is to get uniform/consistent mean and variance across features. The assumption is that variables with large measurement units are not necessarily (and should not be) dominant in PCA. In other word, StandardScaler makes features contribute equally to PCA. As you can see, normalize won't give consistent mean or variance.

What is the best way for clustering data containing categorical and numeric variables with python

I need to cluster customers data that contains categorical and numerical features. numerical features are not on the same ranges (age, income....). I tried Mclust for numerical data after i have scaled it with StandardScale but that gave me intersected groups.
1-Should i normalize if with Standardscale results are not satisfying ?
2-what will be the best way to cluster with K-Prototype?
3-should clustering method should be dependent on the data distribution ?
I use pandas
This is what i have used :
#K-mean Cluster#search K
from scipy.spatial import distance as sci_distance
from sklearn import cluster as sk_cluster
cdata = data
K = range(1, 10)
KM = (sk_cluster.KMeans(n_clusters=k).fit(cdata) for k in K)
centroids = (k.cluster_centers_ for k in KM)
D_k = (sci_distance.cdist(cdata, cent, 'euclidean') for cent in centroids)
dist = (np.min(D, axis=1) for D in D_k)
avgWithinSS = [sum(d) / cdata.shape[0] for d in dist]
plt.plot(K, avgWithinSS, 'b*-')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')
plt.show()
#KMean Cluster
from sklearn.cluster import KMeans, AgglomerativeClustering,
AffinityPropagation #For clustering
from sklearn.mixture import GaussianMixture #For GMM clustering
import matplotlib.pyplot as plt # For graphics
import seaborn as sns
#Clustering
def doKmeans(X, nclust=3):
model = KMeans(nclust)
model.fit(X)
clust_labels = model.predict(X)
cent = model.cluster_centers_
return (clust_labels, cent)
clust_labels, cent = doKmeans(data, 3)
kmeans = pd.DataFrame(clust_labels)
data.insert((data.shape[1]),'kmeans',kmeans)
#Plot the clusters obtained using k means
fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(data['var1'],data['var2'],
c=kmeans[0],s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('var1')
ax.set_ylabel('var2')
plt.colorbar(scatter)
You are approaching this the very wrong way.
Do not choose an approach just because you manage to get the code to run. This will never give you good results.
First figure out what you need. What is a cluster? What is a clustering (are all points in clusters? probably not. etc.)? What is a good clustering, and how can I measure this? Only then choose algorithms based on how well they match your requirements.
Otherwise, you will be solving the wrong problem.

Categories

Resources