Text clustering using Scipy Hierarchy Clustering in Python - python

I have a text corpus that contains 1000+ articles each in a separate line. I am trying to use Hierarchy Clustering using Scipy in python to produce clusters of related articles.
This is the code I used to do the clustering
# Agglomerative Clustering
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as hac
tree = hac.linkage(X.toarray(), method="complete",metric="euclidean")
plt.clf()
hac.dendrogram(tree)
plt.show()
and I got this plot
Then I cut off the tree at the third level with fcluster()
from scipy.cluster.hierarchy import fcluster
clustering = fcluster(tree,3,'maxclust')
print(clustering)
and I got this output:
[2 2 2 ..., 2 2 2]
My question is how can I find the top 10 frequent words in each cluster in order to suggest a topic for each cluster?

You can do the following:
Align your results (your clustering variable) with your input (the 1000+ articles).
Using pandas library, you can use a groupby function with the cluster # as its key.
Per group (using the get_group function), fill up a defaultdict of integers for every
word you encounter.
You can now sort the dictionary of word counts in descending order and get your desired number of most frequent words.
Good luck with what you're doing and please do accept my answer if it's what you're looking for.

I'd do so. Given a df with article name and article text like
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Argument 6 non-null object
1 Article 6 non-null object
dtypes: object(2)
memory usage: 224.0+ bytes
create the articles matrix
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.feature_extraction.text import CountVectorizer
# initialize
cv = CountVectorizer(stop_words='english')
cv_matrix = cv.fit_transform(df['Article'])
# create document term matrix
df_dtm = pd.DataFrame(
cv_matrix.toarray(),
index=df['Argument'].values,
columns=cv.get_feature_names()
)
tree = hierarchy.linkage(df_dtm, method="complete", metric="euclidean")
then get the chosen clustering
clustering = fcluster(tree, 2, 'maxclust')
and add clustering to df_dtm
df_dtm['_cluster_'] = clustering
df_dtm.index.name = '_article_'
df_word_count = df_dtm.groupby('_cluster_').sum().reset_index().melt(
id_vars=['_cluster_'], var_name='_word_', value_name='_count_'
)
finally take the first most frequent words
words_1 = df_word_count[df_word_count._cluster_==1].sort_values(
by=['_count_'], ascending=False).head(3)
words_2 = df_word_count[df_word_count._cluster_==2].sort_values(
by=['_count_'], ascending=False).head(3)

Related

Create clusters depending on scores performance

I have data from students who took a test that has 2 sections : the 1st section tests their digital skill at level2, and the second section tests their digital skills at level3. I need to come up with 3 clusters of students depending on their scores to place them in 3 different skills levels (1,2 and 3) --> code sample below
import pandas as pd
data = [12,24,14,20,8,10,5,23]
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
'Scores_section1': [12,24,14,20,8,10,5,23],
'Scores_section2' : [20,4,1,0,18,9,12,10],
'Sum_all_scores': [32,28,15,20,26,19,17,33]}
# Create DataFrame
df = pd.DataFrame(data)
# print dataframe.
df
I thought about using K-means clustering, but following a tutorial online, I'd need to use x,y coordinates. Should I use scores_section1 as x, and Scores_section2 as y or vice-versa, and why?
Many thanks in advance for your help!
Try it this way.
import pandas as pd
data = [12,24,14,20,8,10,5,23]
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
'Scores_section1': [12,24,14,20,8,10,5,23],
'Scores_section2' : [20,4,1,0,18,9,12,10],
'Sum_all_scores': [32,28,15,20,26,19,17,33]}
# Create DataFrame
df = pd.DataFrame(data)
# print dataframe.
df
#Import required module
from sklearn.cluster import KMeans
#Initialize the class object
kmeans = KMeans(n_clusters=3)
#predict the labels of clusters.
df = df[['Scores_section1', 'Scores_section2', 'Sum_all_scores']]
label = kmeans.fit_predict(df)
label
df['kmeans'] = label
df
# K-Means Clustering may be the most widely known clustering algorithm and involves assigning examples to
# clusters in an effort to minimize the variance within each cluster.
# The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets
# on the basis of a sample. The process, which is called ‘k-means,’ appears to give partitions which are reasonably
# efficient in the sense of within-class variance.
# plot X & Y coordinates and color by cluster number
import plotly.express as px
fig = px.scatter(df, x="Scores_section1", y="Scores_section2", color="kmeans", size='Sum_all_scores', hover_data=['kmeans'])
fig.show()
Feel free to modify the code to suit your needs.

How can I use SVM classifier to detect outliers in percentage changes?

I have a pandas dataframe that is in the following format:
This contains the % change in stock prices each day for 3 companies MSFT, F and BAC.
I would like to use a OneClassSVM calculator to detect whether the data is an outlier or not. I have tried the following code, which I believe detects the rows which contain outliers.
#Import libraries
from sklearn.svm import OneClassSVM
import matplotlib.pyplot as plt
#Create SVM Classifier
svm = OneClassSVM(kernel='rbf',
gamma=0.001, nu=0.03)
#Use svm to fit and predict
svm.fit(delta)
pred = svm.predict(delta)
#If the values are outlier the prediction
#would be -1
outliers = where(pred==-1)
#Print rows with outliers
print(outliers)
This gives the following output:
I would like to then add a new column to my dataframe that includes whether the data is an outlier or not. I have tried the following code but I get an error due to the lists being different lengths as shown below.
condition = (delta.index.isin(outliers))
assigned_value = "outlier"
df['isoutlier'] = np.select(condition,
assigned_value)
Would you be able to let me know I could add this column given that the list of the rows containing outliers is much shorter please?
It's not very clear what is delta and df in your code. I am assuming they are the same data frame.
You can use the result from svm.predict , here we leave it as blank '' if not outlier:
import numpy as np
df = pd.DataFrame(np.random.uniform(0,1,(100,3)),columns=['A','B','C'])
svm = OneClassSVM(kernel='rbf', gamma=0.001, nu=0.03)
svm.fit(df)
pred = svm.predict(df)
df['isoutlier'] = np.where(pred == -1 ,'outlier','')
A B C isoutlier
0 0.869475 0.752420 0.388898
1 0.177420 0.694438 0.129073
2 0.011222 0.245425 0.417329
3 0.791647 0.265672 0.401144
4 0.538580 0.252193 0.142094
.. ... ... ... ...
95 0.742192 0.079426 0.676820 outlier
96 0.619767 0.702513 0.734390
97 0.872848 0.251184 0.887500 outlier
98 0.950669 0.444553 0.088101
99 0.209207 0.882629 0.184912

Extract rows of clusters in hierarchical clustering using seaborn clustermap

I am using hierarchical clustering from seaborn.clustermap to cluster my data. This works fine to nicely visualize the clusters in a heatmap. However, now I would like to extract all row values that are assigned to the different clusters.
This is what my data looks like:
import pandas as pd
# load DataFrame
df = pd.read_csv('expression_data.txt', sep='\t', index_col=0)
df
log_HU1 log_HU2
EEF1A1 13.439499 13.746856
HSPA8 13.169191 12.983910
FTH1 13.861164 13.511200
PABPC1 12.142340 11.885885
TFRC 11.261368 10.433607
RPL26 13.837205 13.934710
NPM1 12.381585 11.956855
RPS4X 13.359880 12.588574
EEF2 11.076926 11.379336
RPS11 13.212654 13.915813
RPS2 12.910164 13.009184
RPL11 13.498649 13.453234
CA1 9.060244 13.152061
RPS3 11.243343 11.431791
YBX1 12.135316 12.100374
ACTB 11.592359 12.108637
RPL4 12.168588 12.184330
HSP90AA1 10.776370 10.550427
HSP90AB1 11.200892 11.457365
NCL 11.366145 11.060236
Then I perform the clustering using seaborn as follows:
fig = sns.clustermap(df)
Which produces the following clustermap:
For this example I may be able to manually interpret the values belonging to each cluster (e.g. that TFRC and HSP90AA1 cluster). However I am planning to do these clustering analysis on much bigger data sets.
So my question is: does anyone know how to get the row values belonging to each cluster?
Thanks,
Using scipy.cluster.hierarchy module with fcluster allows cluster retrieval:
import pandas as pd
import seaborn as sns
import scipy.cluster.hierarchy as sch
df = pd.read_csv('expression_data.txt', sep='\t', index_col=0)
# retrieve clusters using fcluster
d = sch.distance.pdist(df)
L = sch.linkage(d, method='complete')
# 0.2 can be modified to retrieve more stringent or relaxed clusters
clusters = sch.fcluster(L, 0.2*d.max(), 'distance')
# clusters indicices correspond to incides of original df
for i,cluster in enumerate(clusters):
print(df.index[i], cluster)
Out:
EEF1A1 2
HSPA8 1
FTH1 2
PABPC1 3
TFRC 5
RPL26 2
NPM1 3
RPS4X 1
EEF2 4
RPS11 2
RPS2 1
RPL11 2
CA1 6
RPS3 4
YBX1 3
ACTB 3
RPL4 3
HSP90AA1 5
HSP90AB1 4
NCL 4

Extract regressions coefficient from statsmodels

I'm estimating an OLS model, as seen below. I need the coefficients on the categorical variable along with their values.
Here's my code:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.random.seed(12345)
df = pd.DataFrame(np.random.randn(25, 1), columns=list('A'))
df['groupid'] = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,5,5,5,5,5,6,6,6,6,6]
df['groupid'] = df['groupid'].astype('int')
###Fixed effects models
FE_ols = smf.ols(formula = 'A ~ C(groupid) - 1', data=df).fit()
FE_coeffs = FE_ols.params #Save coeffs
FE_coeffs.GroupID = FE_coeffs.index #Extract value of GroupID
FE_coeffs.GroupID = FE_coeffs.GroupID.str.extract('(\d+)') #Parse number from string
I'm able to extract the coefficients on the dummy variables. I put them in a new data frame.
C(groupid)[1] 0.2329694463342642
C(groupid)[2] 0.7567034333090062
C(groupid)[3] 0.31355791920072623
C(groupid)[5] -0.05131898650395289
C(groupid)[6] 0.31757453138500547
However, I want the data frame to be like:
1 0.2329694463342642
2 0.7567034333090062
3 0.31355791920072623
5 -0.05131898650395289
6 0.31757453138500547
The code seems to work, including the parsing. When I do this on Jupyter, it even shows the correct output. But the change isn't saved onto the data frame. There doesn't seem to be a inplace=True kind of command.
Will appreciate any help.
FE_coeffs is a Series, so adding an attribute GroupID as if it's adding a column is the wrong direction. Instead, just overwrite the index with the extracted integer values:
In [80]: FE_coeffs = FE_ols.params.copy()
In [81]: FE_coeffs.index = FE_coeffs.index.str.extract("(\d+)", expand=False).astype(int)
In [82]: FE_coeffs
Out[82]:
1 0.232969
2 0.756703
3 0.313558
5 -0.051319
6 0.317575
dtype: float64

Will pandas dataframe object work with sklearn kmeans clustering?

dataset is pandas dataframe. This is sklearn.cluster.KMeans
km = KMeans(n_clusters = n_Clusters)
km.fit(dataset)
prediction = km.predict(dataset)
This is how I decide which entity belongs to which cluster:
for i in range(len(prediction)):
cluster_fit_dict[dataset.index[i]] = prediction[i]
This is how dataset looks:
A 1 2 3 4 5 6
B 2 3 4 5 6 7
C 1 4 2 7 8 1
...
where A,B,C are indices
Is this the correct way of using k-means?
Assuming all the values in the dataframe are numeric,
# Convert DataFrame to matrix
mat = dataset.values
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=5)
km.fit(mat)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pandas.DataFrame([dataset.index,labels]).T
Alternatively, you could try KMeans++ for Pandas.
To know if your dataframe dataset has suitable content you can explicitly convert to a numpy array:
dataset_array = dataset.values
print(dataset_array.dtype)
print(dataset_array)
If the array has an homogeneous numerical dtype (typically numpy.float64) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScaler for instance.
If your data frame is heterogeneously typed, the dtype of the corresponding numpy array will be object which is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).

Categories

Resources