Understanding and applying k-means clustering for topic modeling

Understanding and applying k-means clustering for topic modeling - python

I have the following code that I found from https://pythonprogramminglanguage.com/kmeans-text-clustering/ on document clustering. While I understand the k-means algorithm as a whole, I have a little trouble wrapping my head about what the top terms per cluster represents and how that is computed? Is it the most frequent words that occur in the cluster? One blogpost I read said that the outputted words at the end represent the "top n words that are nearest to the cluster centroid" (but what does it mean for an actual word to be "closest" to the cluster centroid). I really want to understand the details and nuances of what is going on. Thank you!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print

'Top' in this context is directly related to the way in which the text has been transformed into an array of numerical values. By using TFIDF you are, for each individual document, assigning each word a score based on how prevalent it is in that document, inverse to the prevalence across the entire set of documents. A word with a high score in a document indicates that it is more significant or more representative of that document than the other words.
Therefore with this generation of top terms for each cluster, they are the words that, on average, are most significant in the documents for that cluster.
The way it has been done here works and is efficient but I find it difficult to understand myself and I don't think it is particularly intuitive as it is difficult to comprehend why, if cluster_centers_ are the co-ordinates for the centroids, then the features with the highest co-ordinate numbers are the top words. I kind of get it but not quite (if anyone wants to explain how this works that would be great!).
I use a different method to find the top terms for a cluster which I find more intuitive. I just tested the method you posted with my own on a corpus of 250 documents and the top words are exactly the same. The value of my method is that it works however you cluster the documents as long as you can provide a list of the cluster assignments (which any clustering algorithm should provide), meaning you're not reliant on the presence of a cluster_centers_ attribute. It's also, I think, more intuitive.
import numpy as np
def term_scorer(doc_term_matrix, feature_name_list, labels=None, target=None, n_top_words=10):
if target is not None:
filter_bool = np.array(labels) == target
doc_term_matrix = doc_term_matrix[filter_bool]
term_scores = np.sum(doc_term_matrix,axis=0)
top_term_indices = np.argsort(term_scores)[::-1]
return [feature_name_list[term_idx] for term_idx in top_term_indices[:n_top_words]]
term_scorer(X, terms, labels=model.labels_, target=1, n_top_words=10)
The model.labels_ attribute gives you a list of the cluster assignments for each document. In this example I want to find the top words for cluster 1 so I assign target=1, the function filters the X array keeping only rows assigned to cluster 1. It then sums all the scores across the documents row wise so it has one single row with a column for each word. It then uses argsort to sort that row by highest values to lowest, replaces the values with the original index positions of the words. Finally it uses a list comprehension to grab index numbers from the top score to n_top_words and then builds a list of words by looking up those indexes in feature_name_list.

When words are converted into vectors, we talk about closeness of words as how similar they are. So for instance, you could use cosine similarity for determining how close two words are to each other. a vector of "dog" and "puppy" will be similar so you could say the two words are close to each other.
In other terms, closeness is also determined by the context words. So, word pair (the, cat) can be close, as per the sentences. That is how word2vec or similar algorithms work to create word vectors.

Related

Reduce Dimension of word-vectors from TFIDFVectorizer / CountVectorizer

I want to use the TFIDFVectorizer (or CountVectorizer followed by TFIDFTransformer) to get a vector representation of my terms. That means, I want a vector for a term where the documents are the features. That's simply the transpose of a TF-IDF matrix created by the TFIDFVectorizer.
>>> vectorizer = TfidfVectorizer()
>>> model = vectorizer.fit_transform(corpus)
>>> model.transpose()
However, I have 800k documents which mean my term vectors are very sparse and very large (800k dimensions). The flag max_features in the CountVectorizer would do exactly what I'm looking for. I can specify a dimension and the CountVectorizer tries to fit all information into this dimension. Unfortunately, this option is for the document vectors rather than the terms in the vocabulary. Hence, it reduces the size of my vocabulary because the terms are the features.
Is there any way to do the opposite? Like, perform a transpose on the TFIDFVectorizer object before it starts cutting and normalizing everything? And if such an approach exists, how can I do that? Something like this:
>>> countVectorizer = CountVectorizer(input='filename', max_features=300, transpose=True)
I was looking for such an approach for a while now but every guide, code example, whatever is talking about the document TF-IDF vectors rather than the term vectors.
Thank you so much in advance!

I am not aware of any straight forward way to do this but let me propose a way how this could be achieved.
You are trying to represent each term in your corpus as a vector that uses the documents in your corpus as its component features. Because the number of documents (which are the features in your case) is very large, you would like to limit them in a way similar to what max_features does.
According to CountVectorizer user guide (same for the TfidfVectorizer):
max_features int, default=None
If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.
In a similar way, you want to keep the top documents ordered by their "frequency across the terms", as confusing as this may sound. This could be rephrased simplistically as "keep those documents that contain the most unique terms".
One way I can think of doing that is by using the inverse_transform performing the following steps:
vectorizer = TfidfVectorizer()
model = vectorizer.fit_transform(corpus)
# We use the inverse_transform which returns the
# terms per document with nonzero entries
inverse_model = vectorizer.inverse_transform(model)
# Each line in the inverse model corresponds to a document
# and contains a list of feature names (the terms).
# As we want to rank the documents we tranform the list
# of feature names to a number of features
# that each document is represented by.
inverse_model_count = list(map(lambda doc_vec: len(doc_vec), inverse_model))
# As we are going to sort the list, we need to keep track of the
# document id (its index in the corpus), so we create tuples with
# the list index of each item before we sort the list.
inverse_model_count_tuples = list(zip(range(len(inverse_model_count)),
inverse_model_count))
# Then we sort the list by the count of terms
# in each document (the second component)
max_features = 100
top_documents_tuples = sorted(inverse_model_count_tuples,
key=lambda item: item[1],
reverse=True)[:max_features]
# We are interested only in the document ids (the first tuple component)
top_documents, _ = zip(*top_documents_tuples)
# Having the top_documents ids we can slice the initial model
# to keep only the documents indicated by the top_documents list
reduced_model = model[top_documents]
Please note that this approach only takes into account the number of terms per document, no matter what is their count (CountVectorizer) or weight (TfidfVectorizer).
If the direction of this approach is acceptable for you then with some more code it could be possible to also take into account the count or weight of the terms.
I hope this helps!

How to identify doc2vec instances seperately in gensim in python

I have a list of 1000 documents, where the first 500 belongs to documents in movies (i.e. list index from 0 to 499) and the remaining 500 belings to documents in tv series (i.e. list index from 500 to 999).
For movies the document tag starts with movie_ (e.g., movie_fast_and_furious) and for tv series the document tag starts with tv_series_ (e.g., tv_series_the_office)
I use these movies and tv series dataset to build a doc2vec model as follows.
from gensim.models import doc2vec
from collections import namedtuple
dataset = json.load(open(input_file))
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for description in dataset:
tags = [description[0]]
words = description[1]
docs.append(analyzedDocument(words, tags))
model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)
Now for each movie, I want to get its nearest 5 tv series along with their cosine similarity.
I know, the function gensim provides model.docvecs.most_similar. However, the results of this include movies as well (which is not my intension). Is it possible to do this in gensim (I assume that the document vectors are creating in the order of the documents list that we provide).
I am happy to provide more details if needed.

All the tags are opaque identifiers to Doc2Vec. So, if there's internal distinctions to your data, you'll need to model and filter on that yourself.
So, my main recommendation would be to ask for a much larger topn than you need, then discard those results of the type you don't want, or in excess of the number you actually need.
(Note that every calculation of most_similar() requires a comparison against the whole known set of doc-tags, and using a smaller topn only saves some computation in sorting of those full results. So using a larger topn, even up to the full size of the known doc-tags, isn't as costly as you might fear.)
With just two categories, to get the 10 tv-shows closest to a query movie, you could make topn equal to the count of movies, minus 1 (the query), plus 10 – then in the absolute worst-case, where all movies are closer than than the 1st tv-show, you'd still get 10 valid tv-show results.
(The most_similar() function also includes a restrict_vocab parameter. It takes an int count, and limits results to only the 1st that-many items, in the internal storage order. So if in fact the 1st 500 documents were all tv-shows, restrict_vocab=500 would give only results from that subset. However, I wouldn't recommend relying on this, as (a) it'd only work for one category that was front-loaded, not any others; (b) ideally for training, you wouldn't clump all similar documents together, but shuffle them to be interspersed with contrasting documents. Generally Word2Vec vector-sets are sorted to put the highest-frequency words 1st – no matter the order-of-appearance in the original data. That makes restrict_vocab more useful there, as often only results for the most-common words with the strongest vectors are most interesting.)

How to generate the result of bigrams with highest probabilities with a list of individual alphabetical strings as input

I am learning natural language processing for bigram topic. At this stage, I am having difficulty in the Python computation, but I try.
I will be using this corpus that has not been subjected to tokenization as my main raw dataset. I can generate the bigram results using nltk module. However, my question is how to compute in Python to generate the bigrams containing more than two specific words. More specifically, I wish to find all the bigrams, which are available in corpus_A, that contain words from the word_of_interest.
corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
I want to get the bigram for each of the individual words from the list of word_of_interest. Next, I want to get the frequency for each bigram available based on their appearance in the corpus_A. With the frequency available, I want to sort and print out the bigram based on their probability from highest to lower.
I have tried out codes from on-line search but it does not give me an output. The codes are mentioned below:
for i in corpus:
bigrams_i = BigramCollocationFinder.from_words(corpus, window_size=5)
bigram_j = lambda i[x] not in i
x += 1
print(bigram_j)
Unfortunately, the output did not return what I am planning to achieve.
Please advice me. The output that I want will have the bigram with the specific words from the word_of_interest and their probabilities sorted as shown below.
[((santa, clauss), 0.89), ((he, and), 0.67), ((stands, firm), 0.34))]

You can try this code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(2,2),use_idf=False)
corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
matrix = vec.fit_transform(corpus).toarray()
vocabulary = vec.get_feature_names()
all_bigrams = []
all_frequencies = []
for word in word_of_interest:
for bigram in vocabulary:
if word in bigram:
index = vocabulary.index(bigram)
tuple_bigram = tuple(bigram.split(' '))
frequency = matrix[:,index].sum()
all_bigrams.append(tuple_bigram)
all_frequencies.append(frequency)
df = pd.DataFrame({'bigram':all_bigrams,'frequency':all_frequencies})
df.sort_values('frequency',inplace=True)
df.head()
The output is a pandas dataframe showing the bigrams sorted by frequency.
bigram frequency
0 (for, santa) 0.109764
19 (stands, firm) 0.109764
18 (he, stands) 0.109764
17 (their, hand) 0.109764
16 (hand, to) 0.109764
The rationale here is that TfidfVectorizer counts how many times a token is present in each document of a corpus, then computes the term-specific frequency, and then stores this information in a column associated with that token. The index of that column is the same as the index of the associated word in the vocabulary that is retrieved with the method .get_feature_names() on the vectorizer already fit.
Then you just have to select all rows from the matrix containing the relative frequencies of the tokens, and sum along the column of interest.
The double-nested for loop is not ideal though, and there may be a more efficient implementation for it. The issue is that get_feature_names returns not tuples, but a list of strings in the form ['first_token second_token',].
I would be interested in seeing a better implementation of the second half of the above code.

Check the tf-idf scores of sklearn in python

I am following the example here to calculate the TF-IDF values using sklearn.
My code is as follows.
from sklearn.feature_extraction.text import TfidfVectorizer
myvocabulary = ['life', 'learning']
corpus = {1: "The game of life is a game of everlasting learning", 2: "The unexamined life is not worth living", 3: "Never stop learning"}
tfidf = TfidfVectorizer(vocabulary = myvocabulary, ngram_range = (1,3))
tfs = tfidf.fit_transform(corpus.values())
I want to calculate the tf-idf values for the two words life and learning for the 3 documents in corpus.
According to the article I am referring (see Table below) I should get the following values for my example.
However, the values I get from my code is completely different. Please help me find what is wrong in my code and how to fix it.

The main point is that you should not restrict the vocabulary to just two words ('life', 'learning') before constructing the term frequency matrix. If you do that, all other words will be ignored and it will affect the term frequency counting.
There are also several other steps that need to be taken into account if one wants to get exactly the same numbers as in the example by using sklearn:
The features in the example are unigrams (single words) so I have
set ngram_range=(1,1).
The example uses different normalization than sklearn for the term
frequency part (the term counts are normalized by document lengths
in the example, whereas sklearn uses raw term counts by default).
Because of this, I have counted and normalized the term frequencies
separately before calculating the idf part.
The normalization in the example for the idf part is also not the
default for sklearn. This can be adjusted to match the example by
setting smooth_idf to false.
Sklearn's vectorizers discard by default words with just one
character, but such words are kept in the example. In the code
below, I have modified token_pattern to allow also 1-character
words.
The final tfidf matrix is obtained by multiplying the normalized counts by the idf vector.
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import normalize
import pandas as pd
corpus = {1: "The game of life is a game of everlasting learning", 2: "The unexamined life is not worth living", 3: "Never stop learning"}
cvect = CountVectorizer(ngram_range=(1,1), token_pattern='(?u)\\b\\w+\\b')
counts = cvect.fit_transform(corpus.values())
normalized_counts = normalize(counts, norm='l1', axis=1)
tfidf = TfidfVectorizer(ngram_range=(1,1), token_pattern='(?u)\\b\\w+\\b', smooth_idf=False)
tfs = tfidf.fit_transform(corpus.values())
new_tfs = normalized_counts.multiply(tfidf.idf_)
feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
df = pd.DataFrame(new_tfs.T.todense(), index=feature_names, columns=corpus_index)
print(df.loc[['life', 'learning']])
However, in practice such modifications are rarely needed. One usually obtains good results just by using TfidfVectorizer directly.

tag generation from a text content

I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools.
Additionally, I will be grateful if you point any Python based solution / library for this.
Thanks

One way to do this would be to extract words that occur more frequently in a document than you would expect them to by chance. For example, say in a larger collection of documents the term 'Markov' is almost never seen. However, in a particular document from the same collection Markov shows up very frequently. This would suggest that Markov might be a good keyword or tag to associate with the document.
To identify keywords like this, you could use the point-wise mutual information of the keyword and the document. This is given by PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]. This will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection.
To identify the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.
If you want to extract multiword tags, see the StackOverflow question How to extract common / significant phrases from a series of text entries.
Borrowing from my answer to that question, the NLTK collocations how-to covers how to do
extract interesting multiword expressions using n-gram PMI in a about 7 lines of code, e.g.:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words('english-web.txt'))
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)

First, the key python library for computational linguistics is NLTK ("Natural Language Toolkit"). This is a stable, mature library created and maintained by professional computational linguists. It also has an extensive collection of tutorials, FAQs, etc. I recommend it highly.
Below is a simple template, in python code, for the problem raised in your Question; although it's a template it runs--supply any text as a string (as i've done) and it will return a list of word frequencies as well as a ranked list of those words in order of 'importance' (or suitability as keywords) according to a very simple heuristic.
Keywords for a given document are (obviously) chosen from among important words in a document--ie, those words that are likely to distinguish it from another document. If you had no a priori knowledge of the text's subject matter, a common technique is to infer the importance or weight of a given word/term from its frequency, or importance = 1/frequency.
text = """ The intensity of the feeling makes up for the disproportion of the objects. Things are equal to the imagination, which have the power of affecting the mind with an equal degree of terror, admiration, delight, or love. When Lear calls upon the heavens to avenge his cause, "for they are old like him," there is nothing extravagant or impious in this sublime identification of his age with theirs; for there is no other image which could do justice to the agonising sense of his wrongs and his despair! """
BAD_CHARS = ".!?,\'\""
# transform text into a list words--removing punctuation and filtering small words
words = [ word.strip(BAD_CHARS) for word in text.strip().split() if len(word) > 4 ]
word_freq = {}
# generate a 'word histogram' for the text--ie, a list of the frequencies of each word
for word in words :
word_freq[word] = word_freq.get(word, 0) + 1
# sort the word list by frequency
# (just a DSU sort, there's a python built-in for this, but i can't remember it)
tx = [ (v, k) for (k, v) in word_freq.items()]
tx.sort(reverse=True)
word_freq_sorted = [ (k, v) for (v, k) in tx ]
# eg, what are the most common words in that text?
print(word_freq_sorted)
# returns: [('which', 4), ('other', 4), ('like', 4), ('what', 3), ('upon', 3)]
# obviously using a text larger than 50 or so words will give you more meaningful results
term_importance = lambda word : 1.0/word_freq[word]
# select document keywords from the words at/near the top of this list:
map(term_importance, word_freq.keys())

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation tries to represent each document in a training corpus as mixture of topics, which in turn are distributions mapping words to probabilities.
I had used it once to dissect a corpus of product reviews into the latent ideas that were being spoken about across all the documents such as 'customer service', 'product usability', etc.. The basic model does not advocate a way to convert the topic models into a single word describing what a topic is about.. but people have come up with all kinds of heuristics to do that once their model is trained.
I recommend you try playing with http://mallet.cs.umass.edu/ and seeing if this model fits your needs..
LDA is a completely unsupervised algorithm meaning it doesn't require you to hand annotate anything which is great, but on the flip side, might not deliver you the topics you were expecting it to give.

A very simple solution to the problem would be:
count the occurences of each word in the text
consider the most frequent terms as the key phrases
have a black-list of 'stop words' to remove common words like the, and, it, is etc
I'm sure there are cleverer, stats based solutions though.
If you need a solution to use in a larger project rather than for interests sake, Yahoo BOSS has a key term extraction method.

Latent Dirichlet allocation or Hierarchical Dirichlet Process can be used to generate tags for individual texts within a greater corpus (body of texts) by extracting the most important words from the derived topics.
A basic example would be if we were to run LDA over a corpus and define it to have two topics, and that we find further that a text in the corpus is 70% one topic, and 30% another. The top 70% of the words that define the first topic and 30% that define the second (without duplication) could then be considered as tags for the given text. This method provides strong results where tags generally represent the broader themes of the given texts.
With a general reference for preprocessing needed for these codes being found here, we can find tags through the following process using gensim.
A heuristic way of deriving the optimal number of topics for LDA is found in this answer. Although HDP does not require the number of topics as an input, the standard in such cases is still to use LDA with a derived topic number, as HDP can be problematic. Assume here that the corpus is found to have 10 topics, and we want 5 tags per text:
from gensim.models import LdaModel, HdpModel
from gensim import corpora
num_topics = 10
num_tags = 5
Assume further that we have a variable corpus, which is a preprocessed list of lists, with the subslist entries being word tokens. Initialize a Dirichlet dictionary and create a bag of words where texts are converted to their indexes for their component tokens (words):
dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
Create an LDA or HDP model:
dirichlet_model = LdaModel(corpus=bow_corpus,
id2word=dirichlet_dict,
num_topics=num_topics,
update_every=1,
chunksize=len(bow_corpus),
passes=20,
alpha='auto')
# dirichlet_model = HdpModel(corpus=bow_corpus,
# id2word=dirichlet_dict,
# chunksize=len(bow_corpus))
The following code produces ordered lists for the most important words per topic (note that here is where num_tags defines the desired tags per text):
shown_topics = dirichlet_model.show_topics(num_topics=num_topics,
num_words=num_tags,
formatted=False)
model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
Then find the coherence of the topics across the texts:
topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0
topics_per_text = [text for text in topic_corpus]
From here we have the percentage that each text coheres to a given topic, and the words associated with each topic, so we can combine them for tags with the following:
corpus_tags = []
for i in range(len(bow_corpus)):
# The complexity here is to make sure that it works with HDP
significant_topics = list(set([t[0] for t in topics_per_text[i]]))
topic_indexes_by_coherence = [tup[0] for tup in sorted(enumerate(topics_per_text[i]), key=lambda x:x[1])]
significant_topics_by_coherence = [significant_topics[i] for i in topic_indexes_by_coherence]
ordered_topics = [model_topics[i] for i in significant_topics_by_coherence][:num_topics] # subset for HDP
ordered_topic_coherences = [topics_per_text[i] for i in topic_indexes_by_coherence][:num_topics] # subset for HDP
text_tags = []
for i in range(num_topics):
# Find the number of indexes to select, which can later be extended if the word has already been selected
selection_indexes = list(range(int(round(num_tags * ordered_topic_coherences[i]))))
if selection_indexes == [] and len(text_tags) < num_tags:
# Fix potential rounding error by giving this topic one selection
selection_indexes = [0]
for s_i in selection_indexes:
# ignore_words is a list of words should not be included
if ordered_topics[i][s_i] not in text_tags and ordered_topics[i][s_i] not in ignore_words:
text_tags.append(ordered_topics[i][s_i])
else:
selection_indexes.append(selection_indexes[-1] + 1)
# Fix for if too many were selected
text_tags = text_tags[:num_tags]
corpus_tags.append(text_tags)
corpus_tags will be a list of tags for each text based on how coherent the text is to the derived topics.
See this answer for a similar version of this that generates tags for a whole text corpus.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.