I am using hierarchical clustering from seaborn.clustermap to cluster my data. This works fine to nicely visualize the clusters in a heatmap. However, now I would like to extract all row values that are assigned to the different clusters.
This is what my data looks like:
import pandas as pd
# load DataFrame
df = pd.read_csv('expression_data.txt', sep='\t', index_col=0)
log_HU1 log_HU2
EEF1A1 13.439499 13.746856
HSPA8 13.169191 12.983910
FTH1 13.861164 13.511200
PABPC1 12.142340 11.885885
TFRC 11.261368 10.433607
RPL26 13.837205 13.934710
NPM1 12.381585 11.956855
RPS4X 13.359880 12.588574
EEF2 11.076926 11.379336
RPS11 13.212654 13.915813
RPS2 12.910164 13.009184
RPL11 13.498649 13.453234
CA1 9.060244 13.152061
RPS3 11.243343 11.431791
YBX1 12.135316 12.100374
ACTB 11.592359 12.108637
RPL4 12.168588 12.184330
HSP90AA1 10.776370 10.550427
HSP90AB1 11.200892 11.457365
NCL 11.366145 11.060236
Then I perform the clustering using seaborn as follows:
fig = sns.clustermap(df)
Which produces the following clustermap:
For this example I may be able to manually interpret the values belonging to each cluster (e.g. that TFRC and HSP90AA1 cluster). However I am planning to do these clustering analysis on much bigger data sets.
So my question is: does anyone know how to get the row values belonging to each cluster?
Using scipy.cluster.hierarchy module with fcluster allows cluster retrieval:
import pandas as pd
import seaborn as sns
import scipy.cluster.hierarchy as sch
df = pd.read_csv('expression_data.txt', sep='\t', index_col=0)
# retrieve clusters using fcluster
d = sch.distance.pdist(df)
L = sch.linkage(d, method='complete')
# 0.2 can be modified to retrieve more stringent or relaxed clusters
clusters = sch.fcluster(L, 0.2*d.max(), 'distance')
# clusters indicices correspond to incides of original df
for i,cluster in enumerate(clusters):
print(df.index[i], cluster)
EEF1A1 2
FTH1 2
RPL26 2
NPM1 3
EEF2 4
RPS11 2
RPS2 1
RPL11 2
CA1 6
RPS3 4
YBX1 3
RPL4 3
HSP90AA1 5
HSP90AB1 4
Let’s suppose we have the below data (a sample of my whole dataset that counts thousands of rows):
Node Target
1 2
1 3
1 5
2 1
2 3
2 6
7 8
7 12
9 13
9 15
9 14
Clearly, if I plot in a graph this data I have two components that are disconnected.
I am wondering how to isolate or remove one component from my network, e.g., the smallest one.
I would say that first I should identify the components, then isolate/filtering out the component(s) that I am not interested in.
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
connected_com=[len(c) for c in sorted(nx.connected_components(G), key=len, reverse=True)]
Now I should create a network only with data from the largest component:
largest_cc = max(nx.connected_components(G), key=len)
This is easy in case of two components. However, if I would like to select two components and exclude one, how should I do? This is my question.
In the example data you provided, 3 islands are obtained when plotting the graph with the code below:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
And the graph looks like that:
Now if you want to only keep the biggest two islands you can use the nx.connected_components(G) function that you mentioned and store the two biggest components. Below is the code to do this:
N_subs=2 #Number of biggest islands you want to keep
for i in range(N_subs):
largest_components.append(sorted(nx.connected_components(G), key=len, reverse=True)[i])
You will then need to create a subgraph of G that is composed of both islands. You can use nx.compose_all to do that. And you can then just plot your subgraph
So overall the code looks like that:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
for i in range(N_subs):
largest_components.append(sorted(nx.connected_components(G), key=len, reverse=True)[i])
And the output of this code gives:
Note: According to this, nx.connected_components is best used for undirected graphs. Since you are dealing with directed graphs, you might want to use strongly_connected_components(G) or weakly_connected_components(G) instead.
Currently my data frame consist of both numerical and categorical values (mixed data type). My data frame looks like -
id age txn_duration Statename amount gender religion
1 27 275 bihar 110 m hindu
2 33 163 maharashtra 50 f muslim
3 53 63 delhi 50 f muslim
4 47 100 up 50 m hindu
5 39 263 punjab 100 m punjabi
6 41 303 delhi 50 m punjabi
There is 20 states (Statename) and 7 religion. I have done get_dummies for both Statename and rekigion but got lots of noise. Also detect outlier.My question is -
1. how to find optimum no of cluster for mixed data type.
2. In this case I am using k-means algo.Can I use k-modes or any other methods which will help my results. Because I am not getting good results using k-means
3.How to interpretation my cluster results. I have use
print (cluster_data[clmns].groupby(['clusters']).mean())
Any other way I can see or plot?please provide me the code
My code is -
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
#Importing libraries
import os
import matplotlib.pyplot as plt#visualization
from PIL import Image
%matplotlib inline
import seaborn as sns#visualization
import itertools
import warnings
import io
from scipy import stats
from sklearn.cluster import KMeans
from kmodes.kprototypes import KPrototypes
cluster_data = pd.read_csv("cluster.csv")
cluster_data = pd.get_dummies(cluster_data, columns=['StateName'])
cluster_data = pd.get_dummies(cluster_data, columns=['gender'])
cluster_data = pd.get_dummies(cluster_data, columns=['religion'])
clmns = ['mobile', 'age', 'txn_duration', 'amount', 'StateName_Bihar',
'StateName_Delhi', 'StateName_Gujarat', 'StateName_Karnataka',
'StateName_Maharashtra', 'StateName_Punjab', 'StateName_Rajasthan',
'StateName_Telangana', 'StateName_Uttar Pradesh',
'StateName_West Bengal', 'gender_female',
'gender_male', 'religion_buddhist',
'religion_christian', 'religion_hindu',
'religion_jain', 'religion_muslim',
'religion_other', 'religion_sikh']
df_tr_std = stats.zscore(cluster_data[clmns])
#Cluster the data
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_tr_std)
labels = kmeans.labels_
#Glue back to originaal data
cluster_data['clusters'] = labels
#Lets analyze the clusters
print (cluster_data[clmns].groupby(['clusters']).mean())
You can run something like this code:
Look at the image attached, in that plot you can see that having more than 3 clusters (for the dataset it was run on) does not provide a significant decrease in distortion. So optimum cluster number would be 3 in that case (simple synthetic data). For noisy data the decision might be harder.
Reference: A. Mueller's scipy notes on sklearn
import matplotlib.pyplot as plt
distortions = []
for i in range(1, 11):
km = KMeans(n_clusters=i,
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
Edit for ValueError:
For ValueError: you need just numericals, so you can do like this:
df_numerics = df.drop(['Statename', 'gender', 'religion], axis=1)
You can also drop other columns that you don't want included in clustering analysis.
with df_numerics, try the elbow method and try to find a good cluster number.
Then, let's say you found out that 3 clusters was good, you can run:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
labels contains the cluster numbers (0,1,2 for 3-clusters) for each row in your dataframe.You can also save this as a column in you datafame:
df['cluster_labels'] = labels
Then to visualize it you can pick 2 columns (more than that is dificult to visualize). Let's say you picked 'txn_duration' and 'amount' you can plot those columns, and add the cluster labels as color like this:
import matplotlib.pyplot as plt
plt.scatter(df['txn_duration'],df['amount'], c=df['cluster_labels'])
all, I have a correlation matrix of 21 industry sectors. Now I want to split these 21 sectors into 4 or 5 groups, with sectors of similar behaviors grouped together.
Can experts shed me some lights on how to do this in Python please? Thanks much in advance!
You might explore the use of Pandas DataFrame.corr and the scipy.cluster Hierarchical Clustering package
import pandas as pd
import scipy.cluster.hierarchy as spc
df = pd.DataFrame(my_data)
corr = df.corr().values
pdist = spc.distance.pdist(corr)
linkage = spc.linkage(pdist, method='complete')
idx = spc.fcluster(linkage, 0.5 * pdist.max(), 'distance')
I have a text corpus that contains 1000+ articles each in a separate line. I am trying to use Hierarchy Clustering using Scipy in python to produce clusters of related articles.
This is the code I used to do the clustering
# Agglomerative Clustering
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as hac
tree = hac.linkage(X.toarray(), method="complete",metric="euclidean")
and I got this plot
Then I cut off the tree at the third level with fcluster()
from scipy.cluster.hierarchy import fcluster
clustering = fcluster(tree,3,'maxclust')
and I got this output:
[2 2 2 ..., 2 2 2]
My question is how can I find the top 10 frequent words in each cluster in order to suggest a topic for each cluster?
You can do the following:
Align your results (your clustering variable) with your input (the 1000+ articles).
Using pandas library, you can use a groupby function with the cluster # as its key.
Per group (using the get_group function), fill up a defaultdict of integers for every
word you encounter.
You can now sort the dictionary of word counts in descending order and get your desired number of most frequent words.
Good luck with what you're doing and please do accept my answer if it's what you're looking for.
I'd do so. Given a df with article name and article text like
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Argument 6 non-null object
1 Article 6 non-null object
dtypes: object(2)
memory usage: 224.0+ bytes
create the articles matrix
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.feature_extraction.text import CountVectorizer
# initialize
cv = CountVectorizer(stop_words='english')
cv_matrix = cv.fit_transform(df['Article'])
# create document term matrix
df_dtm = pd.DataFrame(
tree = hierarchy.linkage(df_dtm, method="complete", metric="euclidean")
then get the chosen clustering
clustering = fcluster(tree, 2, 'maxclust')
and add clustering to df_dtm
df_dtm['_cluster_'] = clustering
df_dtm.index.name = '_article_'
df_word_count = df_dtm.groupby('_cluster_').sum().reset_index().melt(
id_vars=['_cluster_'], var_name='_word_', value_name='_count_'
finally take the first most frequent words
words_1 = df_word_count[df_word_count._cluster_==1].sort_values(
by=['_count_'], ascending=False).head(3)
words_2 = df_word_count[df_word_count._cluster_==2].sort_values(
by=['_count_'], ascending=False).head(3)
dataset is pandas dataframe. This is sklearn.cluster.KMeans
km = KMeans(n_clusters = n_Clusters)
prediction = km.predict(dataset)
This is how I decide which entity belongs to which cluster:
for i in range(len(prediction)):
cluster_fit_dict[dataset.index[i]] = prediction[i]
This is how dataset looks:
A 1 2 3 4 5 6
B 2 3 4 5 6 7
C 1 4 2 7 8 1
where A,B,C are indices
Is this the correct way of using k-means?
Assuming all the values in the dataframe are numeric,
# Convert DataFrame to matrix
mat = dataset.values
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=5)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pandas.DataFrame([dataset.index,labels]).T
Alternatively, you could try KMeans++ for Pandas.
To know if your dataframe dataset has suitable content you can explicitly convert to a numpy array:
dataset_array = dataset.values
If the array has an homogeneous numerical dtype (typically numpy.float64) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScaler for instance.
If your data frame is heterogeneously typed, the dtype of the corresponding numpy array will be object which is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).