I have this dataframe (text_df):
There are 10 different authors with 13834 rows of text.
I then created a bag of words and used a TfidfVectorizer like so:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus).toarray() # corpus --> bagofwords
y = text_df.iloc[:,1].values
Shape of X is (13834,2701)
I decided to use 7 clusters for KMeans:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=7,random_state=42)
I'd like to extract the authors of the texts in each cluster to see if the authors are consistently grouped into the same cluster. Not sure about the best way to go about this. Thanks!
Update:
Trying to visualize the author count per cluster using nested dictionary like so:
author_cluster = {}
for i in range(len(y_kmeans)):
# check 20 random predictions
j = np.random.randint(0, 13833, 1)[0]
if y_kmeans[j] not in author_cluster:
author_cluster[y_kmeans[j]] = {}
if y[j] not in author_cluster[y_kmeans[j]]:
author_cluster[y_kmeans[j]][y[j]] = 1
else:
author_cluster[y_kmeans[j]][y[j]] += 1
Output:
There should be a larger count per cluster and probably more than one author per cluster. I'd like to use all of the predictions to get a more accurate count instead of using a subset. But open to alternative solutions.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus) # I removed .toarray() - not sure why it was there except maybe for print debugging?
y = text_df.iloc[:,1].values
km = KMeans(n_clusters=7,random_state=42)
model = km.fit(X)
result = model.predict(X)
for i in range(20):
# check 20 random predictions
container = np.random.randint(low=0, high=13833, size=1)
j = container[0]
print(f'Author {y[j]} wrote {X[j]} and was put in cluster {result[j]}')
Related
I did k means clustering by running below code
X_std = StandardScaler().fit_transform(df_logret)
km = Kmeans(n_clusters=2, max_iter = 100)
km.fit(X_std)
centroids = km.centroids
and I'd like to put cluster 1 in x_1 and cluster 2 in x_2 and run a regression that looks like y= ax_1+bx_2
I've been searching for ways to do this for the whole day but can't find any.
the dataset 'df_logret' looks like
Any help would be greatly appreciated!
You've just applied Kmeans clustering on X_std. With the Sklearn package, you can extract the labels and fill them into the appropriate clusters.
Assuming your X_std is a 2x1 np array (i.e. np.array([[1,2],[3,4],[4,5]]...))
cluster_1 = []
cluster_2 = []
for i in range(len(X_std)):
if km.labels_[i] == 0:
cluster_1.append(X_std[i])
else:
cluster_2.append(X_std[i])
cluster_1_array = np.array(cluster_1)
cluster_2_array = np.array(cluster_2)
I am quite new at programming and word2vec models and I would really appreciate your help.
I have performed a BoW analysis on my data and have obtained the top 100 most predictive words. Now, I want to run those words through a pretrained word2vec model and organize them into different clusters using K means. The main goal is to establish the most predictive clusters. However, when I do the clustering, something goes wrong as my model gives me clusters containing single letters (though my 100 words are actual words and my data does not have single letters).
Below is a snippet of the code I am using (note that I have used a tutorial that has similar steps as these):
text = np.take(vectorizer2.get_feature_names(), pos_class_prob_sorted[:100]) #these are the extracted 100 predictive words from the BoW analysis
text = str(text)
tokenized_docs = word_tokenize(text)
tokenized_docs = list(tokenized_docs)
list_of_docs = text
#Trying to make clusters
def vectorize(list_of_docs, model=wv):
"""Generate vectors for list of documents using a Word Embedding
Args:
list_of_docs: List of documents
model: Gensim's Word Embedding
Returns:
List of document vectors
"""
features = []
for tokens in list_of_docs:
zero_vector = np.zeros(model.vector_size)
vectors = []
for token in tokens:
if token in wv:
try:
vectors.append(wv[token])
except KeyError:
continue
if vectors:
vectors = np.asarray(vectors)
avg_vec = vectors.mean(axis=0)
features.append(avg_vec)
else:
features.append(zero_vector)
return features
vectorized_docs = vectorize(tokenized_docs,model=wv)
len(vectorized_docs), len(vectorized_docs[0])
def mbkmeans_clusters(
X,
k,
mb,
print_silhouette_values,
):
"""Generate clusters and print Silhouette metrics using MBKmeans
Args:
X: Matrix of features.
k: Number of clusters.
mb: Size of mini-batches.
print_silhouette_values: Print silhouette values per cluster.
Returns:
Trained clustering model and labels based on X.
"""
km = MiniBatchKMeans(n_clusters=k, batch_size=mb).fit(X)
print(f"For n_clusters = {k}")
print(f"Silhouette coefficient: {silhouette_score(X, km.labels_):0.2f}")
print(f"Inertia:{km.inertia_}")
if print_silhouette_values:
sample_silhouette_values = silhouette_samples(X, km.labels_)
print(f"Silhouette values:")
silhouette_values = []
for i in range(k):
cluster_silhouette_values = sample_silhouette_values[km.labels_ == i]
silhouette_values.append(
(
i,
cluster_silhouette_values.shape[0],
cluster_silhouette_values.mean(),
cluster_silhouette_values.min(),
cluster_silhouette_values.max(),
)
)
silhouette_values = sorted(
silhouette_values, key=lambda tup: tup[2], reverse=True
)
for s in silhouette_values:
print(
f" Cluster {s[0]}: Size:{s[1]} | Avg:{s[2]:.2f} | Min:{s[3]:.2f} | Max: {s[4]:.2f}"
)
return km, km.labels_
clustering, cluster_labels = mbkmeans_clusters(
X=vectorized_docs,
k=10,
mb=500,
print_silhouette_values=True,
)
df_clusters = pd.DataFrame({
"tokens": [" ".join(text) for text in tokenized_docs],
"cluster": cluster_labels
})
print("Most representative terms per cluster (based on centroids):")
for i in range(10):
tokens_per_cluster = ""
most_representative = wv.most_similar(positive=[clustering.cluster_centers_[i]], topn=5)
for t in most_representative:
tokens_per_cluster += f"{t[0]} "
print(f"Cluster {i}: {tokens_per_cluster}")
Can you please help me determine where am I going wrong with this code? For me, it seems that the model does not actually take those top 100 words into account and it seems like they are not going through the word2vec model.
I'm trying to make a stratified random sampler with tf.data.Dataset for imbalanced classes but I can't find a way to do it.
import tensorflow as tf
dogs = [f'dog_{i}' for i in range(2000)]
cats = [f'cat_{i}' for i in range(100)]
monkeys = [f'monkey_{i}' for i in range(500)]
dogs_ds = tf.data.Dataset.from_tensor_slices(dogs)
cats_ds = tf.data.Dataset.from_tensor_slices(cats)
monkeys_ds = tf.data.Dataset.from_tensor_slices(monkeys)
Maybe with .concatenate()? Or .interleave()? .concatenate() and .shuffle()?
I found it myself. It's possible to use tf.data.experimental.sample_from_datasets, which takes a list of datasets and randomly selects from them.
For a finite dataset (that doesn't iterate inifinitely), you can repeat the minority datasets so they have approximately equal number of values than the majority category. Then, sample weights should be equal for 3 categories, so 0.33333 each.
import tensorflow as tf
from collections import Counter
dogs = [f'dog_{i}' for i in range(2000)]
cats = [f'cat_{i}' for i in range(100)]
monkeys = [f'monkey_{i}' for i in range(500)]
dogs_ds = tf.data.Dataset.from_tensor_slices(dogs)
cats_ds = tf.data.Dataset.from_tensor_slices(cats).repeat(20)
monkeys_ds = tf.data.Dataset.from_tensor_slices(monkeys).repeat(4)
all_ds = tf.data.experimental.sample_from_datasets(
[dogs_ds, cats_ds, monkeys_ds], weights=[.33, .33, .33])
c = Counter()
for elem in all_ds.take(1000):
category, _ = tf.strings.split(elem, '_')
c.update([category.numpy()])
Counter({b'monkey': 363, b'dog': 329, b'cat': 308})
I have used nltk to perform k mean clustering as I would like to change the distance metrics to cosine distance. However, how do I obtain the centroids of all the clusters?
kclusterer = KMeansClusterer(8, distance = nltk.cluster.util.cosine_distance, repeats = 1)
predict = kclusterer.cluster(features, assign_clusters = True)
centroids = kclusterer._centroid
df_clustering['cluster'] = predict
#df_clustering['centroid'] = centroids[df_clustering['cluster'] - 1].tolist()
df_clustering['centroid'] = centroids
I am trying to perform the k mean clustering on a pandas dataframe, and would like to have the coordinates of the centroid of the cluster of each data point to be in the dataframe column 'centroid'.
Thank you in advance!
import pandas as pd
import numpy as np
# created dummy dataframe with 3 feature
df = pd.DataFrame([[1,2,3],[50, 51,52],[2.0,6.0,8.5],[50.11,53.78,52]], columns = ['feature1', 'feature2','feature3'])
print(df)
obj = KMeansClusterer(2, distance = nltk.cluster.util.cosine_distance) #giving number of cluster 2
vectors = [np.array(f) for f in df.values]
df['predicted_cluster'] = obj.cluster(vectors,assign_clusters = True))
print(obj.means())
#OP
[array([50.055, 52.39 , 52. ]), array([1.5 , 4. , 5.75])] #which is going to be mean of three feature for 2 cluster, since number of cluster that we passed is 2
#now if u want the cluster center in pandas dataframe
df['centroid'] = df['predicted_cluster'].apply(lambda x: obj.means()[x])
I have a simple K-means program that extracts 2 clusters and then tries to predict for new sentences. I would like to find the best 'fit' for each cluster.
In my 'example predict'_c0 has a good fit for cluster 0 while 'predict_bad_fit' covers cluster 0 and 1
I suppose I have to calculate the average distatance to the cluster centroid for each predicted sentence. How do I do that?.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
c0_sents = ['cats and dogs','i like cats','cats not like dogs','cats and dogs animals',]
c1_sents = ['computer is for typing','i play games on my computer','programs run on computer','computer has screen']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(c0_sents+c1_sents)
k_means = KMeans(n_clusters=2)
k_means.fit(X)
order_centroids = k_means.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(2):
for ind in order_centroids[i, :4]:
print (i, terms[ind])
#0 computer
#0 on
#0 screen
#0 has
#1 cats
#1 dogs
#1 like
#1 and
predict_c0 = ['cats are not dogs']
predict_c1 = ['typing on computers']
predict_bad_fit = ['cats on computers dogs on screen']
for sent in predict_c0+predict_c1+predict_bad_fit:
X = vectorizer.transform([sent])
predicted = k_means.predict(X)
print (sent,predicted)
#cats are not dogs [0]
#typing on computers [1]
#cats on computers dogs on screen [1]