I have a simple K-means program that extracts 2 clusters and then tries to predict for new sentences. I would like to find the best 'fit' for each cluster.
In my 'example predict'_c0 has a good fit for cluster 0 while 'predict_bad_fit' covers cluster 0 and 1
I suppose I have to calculate the average distatance to the cluster centroid for each predicted sentence. How do I do that?.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
c0_sents = ['cats and dogs','i like cats','cats not like dogs','cats and dogs animals',]
c1_sents = ['computer is for typing','i play games on my computer','programs run on computer','computer has screen']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(c0_sents+c1_sents)
k_means = KMeans(n_clusters=2)
k_means.fit(X)
order_centroids = k_means.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(2):
for ind in order_centroids[i, :4]:
print (i, terms[ind])
#0 computer
#0 on
#0 screen
#0 has
#1 cats
#1 dogs
#1 like
#1 and
predict_c0 = ['cats are not dogs']
predict_c1 = ['typing on computers']
predict_bad_fit = ['cats on computers dogs on screen']
for sent in predict_c0+predict_c1+predict_bad_fit:
X = vectorizer.transform([sent])
predicted = k_means.predict(X)
print (sent,predicted)
#cats are not dogs [0]
#typing on computers [1]
#cats on computers dogs on screen [1]
Related
I did k means clustering by running below code
X_std = StandardScaler().fit_transform(df_logret)
km = Kmeans(n_clusters=2, max_iter = 100)
km.fit(X_std)
centroids = km.centroids
and I'd like to put cluster 1 in x_1 and cluster 2 in x_2 and run a regression that looks like y= ax_1+bx_2
I've been searching for ways to do this for the whole day but can't find any.
the dataset 'df_logret' looks like
Any help would be greatly appreciated!
You've just applied Kmeans clustering on X_std. With the Sklearn package, you can extract the labels and fill them into the appropriate clusters.
Assuming your X_std is a 2x1 np array (i.e. np.array([[1,2],[3,4],[4,5]]...))
cluster_1 = []
cluster_2 = []
for i in range(len(X_std)):
if km.labels_[i] == 0:
cluster_1.append(X_std[i])
else:
cluster_2.append(X_std[i])
cluster_1_array = np.array(cluster_1)
cluster_2_array = np.array(cluster_2)
I am trying to clusters text words.
Let suppose I have a list of text
text=["WhatsApp extends 'confusing' update deadline",
"India begins world's biggest Covid vaccine drive",
"Nepali climbers make history with K2 winter summit"]
I implemented TF-IDF on this data
vec = TfidfVectorizer()
feat = vec .fit_transform(text)
After that, I applied Kmeans
kmeans = KMeans(n_clusters=num).fit(feat)
The thing I am confused about is how I get clusters of words such as
cluster 0
WhatsApp, update,biggest
cluster 1
history,biggest ,world's
etc.
You can use the get_feature_names() method from the TfidfVectorizer class with the predictions from KMeans to inspect the words in each cluster.
Here's a minimal example with two clusters and the three sentence provided by you:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
text = ["WhatsApp extends 'confusing' update deadline",
"India begins world's biggest Covid vaccine drive",
"Nepali climbers make history with K2 winter summit"]
vec = TfidfVectorizer()
feat = vec.fit_transform(text)
kmeans = KMeans(2).fit(feat)
pred = kmeans.predict(feat)
for i in range(2):
print(f"Cluster #{i}:")
words = []
for sentence in np.array(text)[pred==i]:
words += [fn for fn in vec.get_feature_names() if fn in sentence]
print(words)
Result:
Cluster #0:
['confusing', 'deadline', 'extends', 'update', 'begins', 'biggest', 'drive', 'vaccine', 'world']
Cluster #1:
['climbers', 'history', 'make', 'summit', 'winter', 'with']
I am trying to understand the TfidfVectorizer of scikit-learn a bit better. The following code has two documents doc1 = The car is driven on the road,doc2 = The truck is driven on the highway. By calling fit_transform a vectorized matrix of tf-idf weights is generated.
According to the tf-idf value matrix, shouldn't highway,truck,car be the top words instead of highway,truck,driven as highway = truck= car= 0.63 and driven = 0.44?
#testing tfidfvectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(tokenizer= lambda x:x.split(),stop_words = 'english')
response = vectorizer.fit_transform(tn)
feature_array = np.array(vectorizer.get_feature_names()) #list of features
print(feature_array)
print(response.toarray())
sorted_features = np.argsort(response.toarray()).flatten()[:-1] #index of highest valued features
print(sorted_features)
#printing top 3 weighted features
n = 3
top_n = feature_array[sorted_features][:n]
print(top_n)
['car' 'driven' 'highway' 'road' 'truck']
[[0.6316672 0.44943642 0. 0.6316672 0. ]
[0. 0.44943642 0.6316672 0. 0.6316672 ]]
[2 4 1 0 3 0 3 1 2]
['highway' 'truck' 'driven']
As you can see from the result, the tf-idf matrix is indeed giving a higher score to highway,truck,car (and truck):
tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(stop_words = 'english')
response = vectorizer.fit_transform(tn)
terms = vectorizer.get_feature_names()
pd.DataFrame(response.toarray(), columns=terms)
car driven highway road truck
0 0.631667 0.449436 0.000000 0.631667 0.000000
1 0.000000 0.449436 0.631667 0.000000 0.631667
What's wrong is the further check you do by flattening the array. To get the top scores accross all rows, you could instead do something like:
max_scores = response.toarray().max(0).argsort()
np.array(terms)[max_scores[-4:]]
array(['car', 'highway', 'road', 'truck'], dtype='<U7')
Where the highest scores are the feature_names that have a 0.63 score in the dataframe.
I have a KMeans clustering script and it organises some documents based on the contents of the text. The documents fall into 1 of 3 clusters, but it seems very YES or NO, I'd like to be able to see how releveant to the cluster each document is.
eg. Document A is in Cluster 1 90% matching, Document B is in Cluster 1 but 45% matching.
Therefore I can create some sort of threshold to say, I only want documents 80% or higher.
dict_of_docs = {'Document A':'some text content',...'Document Z':'some more text content'}
# Vectorizing the data, my data is held in a Dict, so I just want the values.
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())
X = X.toarray()
# 3 Clusters as I know that there are 3, otherwise use Elbow method
# Then add the vectorized data to the Vocabulary
NUMBER_OF_CLUSTERS = 3
km = KMeans(
n_clusters=NUMBER_OF_CLUSTERS,
init='k-means++',
max_iter=500)
km.fit(X)
# First: for every document we get its corresponding cluster
clusters = km.predict(X)
# We train the PCA on the dense version of the tf-idf.
pca = PCA(n_components=2)
two_dim = pca.fit_transform(X)
scatter_x = two_dim[:, 0] # first principle component
scatter_y = two_dim[:, 1] # second principle component
plt.style.use('ggplot')
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red'}
# group by clusters and scatter plot every cluster
# with a colour and a label
for group in np.unique(clusters):
ix = np.where(clusters == group)
ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)
ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
# Print out top terms for each cluster
terms = vectorizer.get_feature_names()
for i in range(3):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
for doc in dict_of_docs:
text = dict_of_docs[doc]
Y = vectorizer.transform([text])
prediction = km.predict(Y)
print(prediction, doc)
I don't believe it is possible to do exactly what you want because k-means is not really a probabilistic model and its scikit-learn implementation (which is what I'm assuming you're using) just doesn't provide the right interface.
One option I'd suggest is to use the KMeans.score method, which does not provide a probabilistic output but provides a score that is larger the closer a point is to the closest cluster. You could threshold by this, such as by saying "Document A is in cluster 1 with a score of -.01 so I keep it" or "Document B is in cluster 2 with a score of -1000 so I ignore it".
Another option is to used the GaussianMixture model instead. A gaussian mixture is a very similar model to k-means and it provides the probabilities you want with GaussianMixture.predict_proba.
I have this dataframe (text_df):
There are 10 different authors with 13834 rows of text.
I then created a bag of words and used a TfidfVectorizer like so:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus).toarray() # corpus --> bagofwords
y = text_df.iloc[:,1].values
Shape of X is (13834,2701)
I decided to use 7 clusters for KMeans:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=7,random_state=42)
I'd like to extract the authors of the texts in each cluster to see if the authors are consistently grouped into the same cluster. Not sure about the best way to go about this. Thanks!
Update:
Trying to visualize the author count per cluster using nested dictionary like so:
author_cluster = {}
for i in range(len(y_kmeans)):
# check 20 random predictions
j = np.random.randint(0, 13833, 1)[0]
if y_kmeans[j] not in author_cluster:
author_cluster[y_kmeans[j]] = {}
if y[j] not in author_cluster[y_kmeans[j]]:
author_cluster[y_kmeans[j]][y[j]] = 1
else:
author_cluster[y_kmeans[j]][y[j]] += 1
Output:
There should be a larger count per cluster and probably more than one author per cluster. I'd like to use all of the predictions to get a more accurate count instead of using a subset. But open to alternative solutions.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus) # I removed .toarray() - not sure why it was there except maybe for print debugging?
y = text_df.iloc[:,1].values
km = KMeans(n_clusters=7,random_state=42)
model = km.fit(X)
result = model.predict(X)
for i in range(20):
# check 20 random predictions
container = np.random.randint(low=0, high=13833, size=1)
j = container[0]
print(f'Author {y[j]} wrote {X[j]} and was put in cluster {result[j]}')