Clustering similar words with a pretrained word2vec and K means - python

I am quite new at programming and word2vec models and I would really appreciate your help.
I have performed a BoW analysis on my data and have obtained the top 100 most predictive words. Now, I want to run those words through a pretrained word2vec model and organize them into different clusters using K means. The main goal is to establish the most predictive clusters. However, when I do the clustering, something goes wrong as my model gives me clusters containing single letters (though my 100 words are actual words and my data does not have single letters).
Below is a snippet of the code I am using (note that I have used a tutorial that has similar steps as these):
text = np.take(vectorizer2.get_feature_names(), pos_class_prob_sorted[:100]) #these are the extracted 100 predictive words from the BoW analysis
text = str(text)
tokenized_docs = word_tokenize(text)
tokenized_docs = list(tokenized_docs)
list_of_docs = text
#Trying to make clusters
def vectorize(list_of_docs, model=wv):
"""Generate vectors for list of documents using a Word Embedding
Args:
list_of_docs: List of documents
model: Gensim's Word Embedding
Returns:
List of document vectors
"""
features = []
for tokens in list_of_docs:
zero_vector = np.zeros(model.vector_size)
vectors = []
for token in tokens:
if token in wv:
try:
vectors.append(wv[token])
except KeyError:
continue
if vectors:
vectors = np.asarray(vectors)
avg_vec = vectors.mean(axis=0)
features.append(avg_vec)
else:
features.append(zero_vector)
return features
vectorized_docs = vectorize(tokenized_docs,model=wv)
len(vectorized_docs), len(vectorized_docs[0])
def mbkmeans_clusters(
X,
k,
mb,
print_silhouette_values,
):
"""Generate clusters and print Silhouette metrics using MBKmeans
Args:
X: Matrix of features.
k: Number of clusters.
mb: Size of mini-batches.
print_silhouette_values: Print silhouette values per cluster.
Returns:
Trained clustering model and labels based on X.
"""
km = MiniBatchKMeans(n_clusters=k, batch_size=mb).fit(X)
print(f"For n_clusters = {k}")
print(f"Silhouette coefficient: {silhouette_score(X, km.labels_):0.2f}")
print(f"Inertia:{km.inertia_}")
if print_silhouette_values:
sample_silhouette_values = silhouette_samples(X, km.labels_)
print(f"Silhouette values:")
silhouette_values = []
for i in range(k):
cluster_silhouette_values = sample_silhouette_values[km.labels_ == i]
silhouette_values.append(
(
i,
cluster_silhouette_values.shape[0],
cluster_silhouette_values.mean(),
cluster_silhouette_values.min(),
cluster_silhouette_values.max(),
)
)
silhouette_values = sorted(
silhouette_values, key=lambda tup: tup[2], reverse=True
)
for s in silhouette_values:
print(
f" Cluster {s[0]}: Size:{s[1]} | Avg:{s[2]:.2f} | Min:{s[3]:.2f} | Max: {s[4]:.2f}"
)
return km, km.labels_
clustering, cluster_labels = mbkmeans_clusters(
X=vectorized_docs,
k=10,
mb=500,
print_silhouette_values=True,
)
df_clusters = pd.DataFrame({
"tokens": [" ".join(text) for text in tokenized_docs],
"cluster": cluster_labels
})
print("Most representative terms per cluster (based on centroids):")
for i in range(10):
tokens_per_cluster = ""
most_representative = wv.most_similar(positive=[clustering.cluster_centers_[i]], topn=5)
for t in most_representative:
tokens_per_cluster += f"{t[0]} "
print(f"Cluster {i}: {tokens_per_cluster}")
Can you please help me determine where am I going wrong with this code? For me, it seems that the model does not actually take those top 100 words into account and it seems like they are not going through the word2vec model.

Related

DBSCAN. Detecting spam emails using fuzzy hashing

There is a task detecting spam emails using fuzzy hashes. After watching a bunch of videos and reading no less articles, I came to this algorithm:
I read the dataset, process it, delete the None values, duplicates.
import pandas as pd
import numpy as np
import ppdeep as pp
from sklearn.cluster import DBSCAN
init_data = pd.read_csv('./spam2.csv')
data = init_data[['Label', 'Body']]
data.dropna(inplace=True)
data = data.rename(columns={'Body': 'message', 'Label': 'class'})
data = data.reindex(columns=['class', 'message'])
data = data.drop_duplicates(subset='message', keep="first")
The next step, I process the messages themselves, I'm not sure about the need for it. Although if I don't skip this stage, then more emails with the same fuzzy hash appear, which is logical.
import re
import html
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
# nltk.download('stopwords')
# nltk.download('punkt')
EMAIL_RE = re.compile("[\w.+-]+#[\w-]+\.[\w.-]+")
URLS_RE = re.compile("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:\%[0-9a-fA-F][0-9a-fA-F]))+")
PUNCTUATION_RE = re.compile(r'[!"\#\$%\&\'\(\)\*\+,\-\./:;<=>\?#\[\\\]\^_`\{\|\}\~]')
NOT_LETTERS_OR_SPACE_RE = re.compile("[^A-Za-z ]")
REPEATING_LETTERS_RE = re.compile(r'([a-z])\1{2,}')
def prepare_message(message):
# Convert to lower case
text = message.lower()
# Сonverts HTML codes into characters
text = html.unescape(text)
# Remove email
text = re.sub(EMAIL_RE, ' ', text)
# Remove urls
text = re.sub(URLS_RE, ' ', text)
# Remove all punctuation symbols
text = re.sub(PUNCTUATION_RE, ' ', text)
# Remove all except letters and space
text = re.sub(NOT_LETTERS_OR_SPACE_RE, '', text)
# Replace repeating letters
text = re.sub(REPEATING_LETTERS_RE, r'\1', text)
# Split by space and stemming with PorterStemmer
ps = PorterStemmer()
return ' '.join([ps.stem(word) for word in text.split()])
data['message'] = data['message'].apply(prepare_message)
After this stage, I get processed words separated by a space, without numbers and any other characters.
Next, I add a column with a fuzzy hash of each message.
data['hash'] = data['message'].apply(pp.hash)
Next, I divide the dataset into a training and test sample (although, as I found out, new values cannot be predicted in DBSCAN, only added to the distance matrix and trained in a new way, because in DBSCAN there are no static cluster centers, as for example in KMeans)
x = data['hash']
y = data['class']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, stratify=y)
In the next step, I calculate the distance matrix between each pair of messages.
import numba
#numba.jit(parallel=True, cache=True, fastmath=True)
def calc_distances(x_train, x):
count = 0
n = len(x_train)
max_count = (n**2 - n) // 2
for i in range(n):
for j in range(i):
x[i,j] = pp.compare(x_train[i], x_train[j])
x[j,i] = x[i,j]
count += 1
print(f"\r{count}/{max_count}", end='')
n = len(x_train)
distances = np.zeros((n, n))
calc_distances(list(x_train), distances)
np.fill_diagonal(distances, 100.0)
Here I did not come up with and did not find any ideas to speed up the calculation of the matrix, so I sit waiting for 30 minutes, if with the entire dataset then 70 :(
The final stage. Clustering.
db = DBSCAN(eps=0.5, min_samples=2, min_metric='precomputed').fit(distances)
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)
At this stage, I'm having problems. No matter how I change the parameter eps and min_samples, I always get 1 cluster and 0 noise at the output.
Please tell me what exactly am I wrong about? What exactly is the problem? Maybe use some other algorithm besides 'DBSCAN'? or what? There is very little information on the Internet, however, specifically about spam detection using fuzzy hashes.
I tried several options. I tried not to process the message, that is, I left them as they are in the dataset. Instead of pp.compare, I tried using fuzz.ratio comparison. I tried to change the eps and min_samples parameters, but I still get 1 cluster at the output.

Figuring out the percentage/probability a string belongs in a cluster?

I have a KMeans clustering script and it organises some documents based on the contents of the text. The documents fall into 1 of 3 clusters, but it seems very YES or NO, I'd like to be able to see how releveant to the cluster each document is.
eg. Document A is in Cluster 1 90% matching, Document B is in Cluster 1 but 45% matching.
Therefore I can create some sort of threshold to say, I only want documents 80% or higher.
dict_of_docs = {'Document A':'some text content',...'Document Z':'some more text content'}
# Vectorizing the data, my data is held in a Dict, so I just want the values.
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())
X = X.toarray()
# 3 Clusters as I know that there are 3, otherwise use Elbow method
# Then add the vectorized data to the Vocabulary
NUMBER_OF_CLUSTERS = 3
km = KMeans(
n_clusters=NUMBER_OF_CLUSTERS,
init='k-means++',
max_iter=500)
km.fit(X)
# First: for every document we get its corresponding cluster
clusters = km.predict(X)
# We train the PCA on the dense version of the tf-idf.
pca = PCA(n_components=2)
two_dim = pca.fit_transform(X)
scatter_x = two_dim[:, 0] # first principle component
scatter_y = two_dim[:, 1] # second principle component
plt.style.use('ggplot')
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red'}
# group by clusters and scatter plot every cluster
# with a colour and a label
for group in np.unique(clusters):
ix = np.where(clusters == group)
ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)
ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
# Print out top terms for each cluster
terms = vectorizer.get_feature_names()
for i in range(3):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
for doc in dict_of_docs:
text = dict_of_docs[doc]
Y = vectorizer.transform([text])
prediction = km.predict(Y)
print(prediction, doc)
I don't believe it is possible to do exactly what you want because k-means is not really a probabilistic model and its scikit-learn implementation (which is what I'm assuming you're using) just doesn't provide the right interface.
One option I'd suggest is to use the KMeans.score method, which does not provide a probabilistic output but provides a score that is larger the closer a point is to the closest cluster. You could threshold by this, such as by saying "Document A is in cluster 1 with a score of -.01 so I keep it" or "Document B is in cluster 2 with a score of -1000 so I ignore it".
Another option is to used the GaussianMixture model instead. A gaussian mixture is a very similar model to k-means and it provides the probabilities you want with GaussianMixture.predict_proba.

Gensim Doc2vec model: how to compute similarity on a corpus obtained using a pre-trained doc2vec model?

I have a model based on doc2vec trained on multiple documents. I would like to use that model to infer the vectors of another document, which I want to use as the corpus for comparison. So, when I look for the most similar sentence to one I introduce, it uses this new document vectors instead of the trained corpus.
Currently, I am using the infer_vector() to compute the vector for each one of the sentences of the new document, but I can't use the most_similar() function with the list of vectors I obtain, it has to be KeyedVectors.
I would like to know if there's any way that I can compute these vectors for the new document that will allow the use of the most_similar() function, or if I have to compute the similarity between each one of the sentences of the new document and the sentence I introduce individually (in this case, is there any implementation in Gensim that allows me to compute the cosine similarity between 2 vectors?).
I am new to Gensim and NLP, and I'm open to your suggestions.
I can not provide the complete code, since it is a project for the university, but here are the main parts in which I'm having problems.
After doing some pre-processing of the data, this is how I train my model:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(train_data)]
assert gensim.models.doc2vec.FAST_VERSION > -1
cores = multiprocessing.cpu_count()
doc2vec_model = Doc2Vec(vector_size=200, window=5, workers=cores)
doc2vec_model.build_vocab(documents)
doc2vec_model.train(documents, total_examples=doc2vec_model.corpus_count, epochs=30)
I try to compute the vectors for the new document this way:
questions = [doc2vec_model.infer_vector(line) for line in lines_4]
And then I try to compute the similarity between the new document vectors and an input phrase:
text = str(input('Me: '))
tokens = text.split()
new_vector = doc2vec_model.infer_vector(tokens)
index = questions[i].most_similar([new_vector])
A dirty solution I used about a month ago in gensim==3.2.0 (the syntax might have changed).
You can save your inferred vectors in KeyedVectors format.
from gensim.models import KeyedVectors
from gensim.models.doc2vec import Doc2Vec
vectors = dict()
# y_names = doc2vec_model.docvecs.doctags.keys()
y_names = range(len(questions))
for name in y_names:
# vectors[name] = doc2vec_model.docvecs[name]
vectors[str(name)] = questions[name]
f = open("question_vectors.txt".format(filename), "w")
f.write("")
f.flush()
f.close()
f = open("question_vectors.txt".format(filename), "a")
f.write("{} {}\n".format(len(questions), doc2vec_model.vector_size))
for v in vectors:
line = "{} {}\n".format(v, " ".join(questions[v].astype(str)))
f.write(line)
f.close()
then you can load and use most_similar function
keyed_model = KeyedVectors.load_word2vec_format("question_vectors.txt")
keyed_model.most_similar(str(list(y_names)[0]))
Another solution (esp. if the number of questions is not so high) would be just to convert questions to a np.array and get cosine distance), e.g.
import numpy as np
questions = np.array(questions)
texts_norm = np.linalg.norm(questions, axis=1)[np.newaxis].T
norm = texts_norm * texts_norm.T
product = np.matmul(questions, questions.T)
product = product.T / norm
# Otherwise the item is the closest to itself
for j in range(len(questions)):
product[j, j] = 0
# Gives the top 10 most similar items to the 0th question
np.argpartition(product[0], 10)

Get values from K-Means clusters using dataframe

I have this dataframe (text_df):
There are 10 different authors with 13834 rows of text.
I then created a bag of words and used a TfidfVectorizer like so:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus).toarray() # corpus --> bagofwords
y = text_df.iloc[:,1].values
Shape of X is (13834,2701)
I decided to use 7 clusters for KMeans:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=7,random_state=42)
I'd like to extract the authors of the texts in each cluster to see if the authors are consistently grouped into the same cluster. Not sure about the best way to go about this. Thanks!
Update:
Trying to visualize the author count per cluster using nested dictionary like so:
author_cluster = {}
for i in range(len(y_kmeans)):
# check 20 random predictions
j = np.random.randint(0, 13833, 1)[0]
if y_kmeans[j] not in author_cluster:
author_cluster[y_kmeans[j]] = {}
if y[j] not in author_cluster[y_kmeans[j]]:
author_cluster[y_kmeans[j]][y[j]] = 1
else:
author_cluster[y_kmeans[j]][y[j]] += 1
Output:
There should be a larger count per cluster and probably more than one author per cluster. I'd like to use all of the predictions to get a more accurate count instead of using a subset. But open to alternative solutions.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus) # I removed .toarray() - not sure why it was there except maybe for print debugging?
y = text_df.iloc[:,1].values
km = KMeans(n_clusters=7,random_state=42)
model = km.fit(X)
result = model.predict(X)
for i in range(20):
# check 20 random predictions
container = np.random.randint(low=0, high=13833, size=1)
j = container[0]
print(f'Author {y[j]} wrote {X[j]} and was put in cluster {result[j]}')

How to use TfidfVectorizer correctly?

I always got error when using TfidfVectorizer for kmeans clustering.
There are 3 cases:
I use tokenizer parameter in TfidfVectorizer to customize tokenization process for my dataset. Here is my code:
vectorizer = TfidfVectorizer(stop_words=stops,tokenizer=tokenize)
X = vectorizer.fit_transform(titles)
however i got this error :
ValueError: empty vocabulary; perhaps the documents only contain stop words
I make a vocabulary consisting terms that was the result of the tokenization, so the code become like this:
vectorizer = TfidfVectorizer(stop_words=stops,tokenizer=tokenize,vocabulary=vocab)
but i got another new error:
ValueError: Vocabulary contains repeated indices.
And lastly, i remove the tokenizer and vocabulary parameter. The code becomes like this:
vectorizer = TfidfVectorizer(stop_words=stops)
X = vectorizer.fit_transform(titles)
terms = vectorizer.get_feature_names()
true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print "Top terms per cluster:"
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
for i in range(true_k):
print "Cluster %d:" % i,
for ind in order_centroids[i, :10]:
print ' %s' % terms[ind],
print
Well, the program runs successfully but the clustering results are like this:
Cluster 0: bangun, rancang, lunak, perangkat, aplikasi, berbasis, menggunakan, service, sistem, pembangunan,
Cluster 1: sistem, aplikasi, berbasis, web, menggunakan, pembuatan, mobile, informasi, teknologi, pengembangan,
Cluster 2: android, berbasis, aplikasi, perangkat, rancang, bangun, bergerak, mobile, sosial, menggunakan,
Cluster 3: implementasi, algoritma, menggunakan, klasifikasi, data, game, fuzzy, vector, support, machine,
Cluster 4: metode, menggunakan, video, penerapan, implementasi, steganografi, pengenalan, berbasis, file, analisis,
Cluster 5: citra, segmentasi, menggunakan, implementasi, metode, warna, tekstur, kembali, berwarna, temu,
Cluster 6: jaringan, tiruan, protokol, voip, syaraf, saraf, menggunakan, implementasi, kinerja, streaming,
Cluster 7: studi, kasus, its, informatika, teknik, sistem, informasi, data, surabaya, jurusan,
Some terms are clustered into multiple clusters,like term data is placed to Cluster 3 and Cluster 7.
Can you tell me how to us the tfidfvectorizer and KMeans properly? Your help is my happiness :)

Categories

Resources