Get clusters of words using Kmeans and TF-IDF - python

I am trying to clusters text words.
Let suppose I have a list of text
text=["WhatsApp extends 'confusing' update deadline",
"India begins world's biggest Covid vaccine drive",
"Nepali climbers make history with K2 winter summit"]
I implemented TF-IDF on this data
vec = TfidfVectorizer()
feat = vec .fit_transform(text)
After that, I applied Kmeans
kmeans = KMeans(n_clusters=num).fit(feat)
The thing I am confused about is how I get clusters of words such as
cluster 0
WhatsApp, update,biggest
cluster 1
history,biggest ,world's
etc.

You can use the get_feature_names() method from the TfidfVectorizer class with the predictions from KMeans to inspect the words in each cluster.
Here's a minimal example with two clusters and the three sentence provided by you:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
text = ["WhatsApp extends 'confusing' update deadline",
"India begins world's biggest Covid vaccine drive",
"Nepali climbers make history with K2 winter summit"]
vec = TfidfVectorizer()
feat = vec.fit_transform(text)
kmeans = KMeans(2).fit(feat)
pred = kmeans.predict(feat)
for i in range(2):
print(f"Cluster #{i}:")
words = []
for sentence in np.array(text)[pred==i]:
words += [fn for fn in vec.get_feature_names() if fn in sentence]
print(words)
Result:
Cluster #0:
['confusing', 'deadline', 'extends', 'update', 'begins', 'biggest', 'drive', 'vaccine', 'world']
Cluster #1:
['climbers', 'history', 'make', 'summit', 'winter', 'with']

Related

Visualization and Clustering in Python

I would like to classify comments based on NLP algorithm (tf-idf).
I managed to classify these clusters but I want to visualize them graphically (histogram, scatter plot...)
import collections
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from pprint import pprint
import matplotlib.pyplot as plt
import pandas as pd
import nltk
import pandas as pd
import string
data = pd.read_excel (r'C:\Users\cra\One\intern\Book2.xlsx')
def word_tokenizer(text):
#tokenizes and stems the text
tokens = word_tokenize(text)
stemmer = PorterStemmer()
tokens = [stemmer.stem(t) for t in tokens if t not in
stopwords.words('english')]
return tokens
#tfidf convert text data to vectors
def cluster_sentences(sentences, nb_of_clusters=5):
tfidf_vectorizer = TfidfVectorizer(tokenizer=word_tokenizer,
stop_words=stopwords.words('english'),#enlever stopwords
max_df=0.95,min_df=0.05,
lowercase=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)
kmeans = KMeans(n_clusters=nb_of_clusters)
kmeans.fit(tfidf_matrix)
clusters = collections.defaultdict(list)
for i, label in enumerate(kmeans.labels_):
clusters[label].append(i)
return dict(clusters)
if __name__ == "__main__":
sentences = data.Comment
nclusters= 20
clusters = cluster_sentences(sentences, nclusters) #dictionary of
#cluster and the index of the comment in the dataframe
for cluster in range(nclusters):
print ("cluster ",cluster,":")
for i,sentence in enumerate(clusters[cluster]):
print ("\tsentence ",i,": ",sentences[sentence])
result that I got for example :
cluster 6 :
sentence 0 : 26 RIH DP std
sentence 1 : 32 RIH DP std
sentence 2 : 68 RIH Liner with DP std in hole
sentence 3 : 105 RIH DP std
sentence 4 : 118 RIH std no of DP in hole
sentence 5 : 154 RIH DP std
Could you help me please! thank you
You will need to use t-SNE to visualize the clusters - this article on visualizing and clustering US Laws using tf-idf can get you started.

K-means predict and fit predictions

I have a simple K-means program that extracts 2 clusters and then tries to predict for new sentences. I would like to find the best 'fit' for each cluster.
In my 'example predict'_c0 has a good fit for cluster 0 while 'predict_bad_fit' covers cluster 0 and 1
I suppose I have to calculate the average distatance to the cluster centroid for each predicted sentence. How do I do that?.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
c0_sents = ['cats and dogs','i like cats','cats not like dogs','cats and dogs animals',]
c1_sents = ['computer is for typing','i play games on my computer','programs run on computer','computer has screen']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(c0_sents+c1_sents)
k_means = KMeans(n_clusters=2)
k_means.fit(X)
order_centroids = k_means.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(2):
for ind in order_centroids[i, :4]:
print (i, terms[ind])
#0 computer
#0 on
#0 screen
#0 has
#1 cats
#1 dogs
#1 like
#1 and
predict_c0 = ['cats are not dogs']
predict_c1 = ['typing on computers']
predict_bad_fit = ['cats on computers dogs on screen']
for sent in predict_c0+predict_c1+predict_bad_fit:
X = vectorizer.transform([sent])
predicted = k_means.predict(X)
print (sent,predicted)
#cats are not dogs [0]
#typing on computers [1]
#cats on computers dogs on screen [1]

How to get top n terms with highest tf-idf score - Big sparse matrix

There is this code:
feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]
n = 3
top_n = feature_array[tfidf_sorting][:n]
coming from this answer.
My question is how can I efficiently do this in the case where my sparse matrix is too big to convert at once to a dense matrix (with response.toarray())?
Apparently, the general answer is by splitting the sparse matrix in chunks, doing the conversion of each chunk in a for loop and then combining the results across all chunks.
But I would like to see specifically the code which does this in total.
If you have a deep look at that question, they are interested at knowing top tf_idf scores for a single document.
when you want to do the same thing for a large corpus, you need to sum the scores of each feature across all documents (still its not meaningfull because the scores are l2 normalized in TfidfVectorizer(), read here). I would recommend using .idf_ scores to know the features with high inverse document frequency score.
In case, you want to know the top features based on number of occurrences, use CountVectorizer()
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus = [
'I would like to check this document',
'How about one more document',
'Aim is to capture the key words from the corpus'
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()
top_n = 3
print('tf_idf scores: \n', sorted(list(zip(vectorizer.get_feature_names(),
X.sum(0).getA1())),
key=lambda x: x[1], reverse=True)[:top_n])
# tf_idf scores :
# [('document', 1.4736296010332683), ('check', 0.6227660078332259), ('like', 0.6227660078332259)]
print('idf values: \n', sorted(list(zip(feature_array,vectorizer.idf_,)),
key = lambda x: x[1], reverse=True)[:top_n])
# idf values:
# [('aim', 1.6931471805599454), ('capture', 1.6931471805599454), ('check', 1.6931471805599454)]
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()
print('Frequency: \n', sorted(list(zip(vectorizer.get_feature_names(),
X.sum(0).getA1())),
key=lambda x: x[1], reverse=True)[:top_n])
# Frequency:
# [('document', 2), ('aim', 1), ('capture', 1)]

Get values from K-Means clusters using dataframe

I have this dataframe (text_df):
There are 10 different authors with 13834 rows of text.
I then created a bag of words and used a TfidfVectorizer like so:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus).toarray() # corpus --> bagofwords
y = text_df.iloc[:,1].values
Shape of X is (13834,2701)
I decided to use 7 clusters for KMeans:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=7,random_state=42)
I'd like to extract the authors of the texts in each cluster to see if the authors are consistently grouped into the same cluster. Not sure about the best way to go about this. Thanks!
Update:
Trying to visualize the author count per cluster using nested dictionary like so:
author_cluster = {}
for i in range(len(y_kmeans)):
# check 20 random predictions
j = np.random.randint(0, 13833, 1)[0]
if y_kmeans[j] not in author_cluster:
author_cluster[y_kmeans[j]] = {}
if y[j] not in author_cluster[y_kmeans[j]]:
author_cluster[y_kmeans[j]][y[j]] = 1
else:
author_cluster[y_kmeans[j]][y[j]] += 1
Output:
There should be a larger count per cluster and probably more than one author per cluster. I'd like to use all of the predictions to get a more accurate count instead of using a subset. But open to alternative solutions.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus) # I removed .toarray() - not sure why it was there except maybe for print debugging?
y = text_df.iloc[:,1].values
km = KMeans(n_clusters=7,random_state=42)
model = km.fit(X)
result = model.predict(X)
for i in range(20):
# check 20 random predictions
container = np.random.randint(low=0, high=13833, size=1)
j = container[0]
print(f'Author {y[j]} wrote {X[j]} and was put in cluster {result[j]}')

Should TF-IDF Score be Identical for Every Ngram with the Same Frequency in a Document

I am using sklearn to find frequencies and tf-idf scores for bigrams from a set of 50 documents. Let's say one of the cleaned documents is "run fast slow".
The output is:
ngram freq tfidf
run fast 1 .23
fast slow 1 .23
The ngrams in the output for that one document are found in other documents. Let's say "run fast" is found 20 times in the document collection and "fast slow" is found 30 times. Why are the tfidf scores the same for ngrams within a document that have the same frequency?
This doesn't intuitively seem like the correct output since the frequency across the document collection varies.
This is the code I am using to extract the features. It takes a grouped df and a text column from that df:
def extractFeatures(groupedDF, textCol):
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
tv = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
features = pd.DataFrame()
for id, group in tqdm(groupedDF):
freq = cv.fit_transform(group[textCol])
tfidf = tv.fit_transform(group[textCol])
freq = sum(freq).toarray()[0]
tfidf.todense()
tfidf = tfidf.toarray()[0]
freq = pd.DataFrame(freq, columns=['frequency'])
tfidf = pd.DataFrame(tfidf, columns=['tfidf'])
dfinner = pd.DataFrame(cv.get_feature_names(), columns=['ngram'])
dfinner['map'] = id
dfinner = dfinner.join(freq)
results = dfinner.join(tfidf)
features = features.append(results)
return features

Categories

Resources