I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools.
Additionally, I will be grateful if you point any Python based solution / library for this.
Thanks
One way to do this would be to extract words that occur more frequently in a document than you would expect them to by chance. For example, say in a larger collection of documents the term 'Markov' is almost never seen. However, in a particular document from the same collection Markov shows up very frequently. This would suggest that Markov might be a good keyword or tag to associate with the document.
To identify keywords like this, you could use the point-wise mutual information of the keyword and the document. This is given by PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]. This will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection.
To identify the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.
If you want to extract multiword tags, see the StackOverflow question How to extract common / significant phrases from a series of text entries.
Borrowing from my answer to that question, the NLTK collocations how-to covers how to do
extract interesting multiword expressions using n-gram PMI in a about 7 lines of code, e.g.:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words('english-web.txt'))
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)
First, the key python library for computational linguistics is NLTK ("Natural Language Toolkit"). This is a stable, mature library created and maintained by professional computational linguists. It also has an extensive collection of tutorials, FAQs, etc. I recommend it highly.
Below is a simple template, in python code, for the problem raised in your Question; although it's a template it runs--supply any text as a string (as i've done) and it will return a list of word frequencies as well as a ranked list of those words in order of 'importance' (or suitability as keywords) according to a very simple heuristic.
Keywords for a given document are (obviously) chosen from among important words in a document--ie, those words that are likely to distinguish it from another document. If you had no a priori knowledge of the text's subject matter, a common technique is to infer the importance or weight of a given word/term from its frequency, or importance = 1/frequency.
text = """ The intensity of the feeling makes up for the disproportion of the objects. Things are equal to the imagination, which have the power of affecting the mind with an equal degree of terror, admiration, delight, or love. When Lear calls upon the heavens to avenge his cause, "for they are old like him," there is nothing extravagant or impious in this sublime identification of his age with theirs; for there is no other image which could do justice to the agonising sense of his wrongs and his despair! """
BAD_CHARS = ".!?,\'\""
# transform text into a list words--removing punctuation and filtering small words
words = [ word.strip(BAD_CHARS) for word in text.strip().split() if len(word) > 4 ]
word_freq = {}
# generate a 'word histogram' for the text--ie, a list of the frequencies of each word
for word in words :
word_freq[word] = word_freq.get(word, 0) + 1
# sort the word list by frequency
# (just a DSU sort, there's a python built-in for this, but i can't remember it)
tx = [ (v, k) for (k, v) in word_freq.items()]
tx.sort(reverse=True)
word_freq_sorted = [ (k, v) for (v, k) in tx ]
# eg, what are the most common words in that text?
print(word_freq_sorted)
# returns: [('which', 4), ('other', 4), ('like', 4), ('what', 3), ('upon', 3)]
# obviously using a text larger than 50 or so words will give you more meaningful results
term_importance = lambda word : 1.0/word_freq[word]
# select document keywords from the words at/near the top of this list:
map(term_importance, word_freq.keys())
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation tries to represent each document in a training corpus as mixture of topics, which in turn are distributions mapping words to probabilities.
I had used it once to dissect a corpus of product reviews into the latent ideas that were being spoken about across all the documents such as 'customer service', 'product usability', etc.. The basic model does not advocate a way to convert the topic models into a single word describing what a topic is about.. but people have come up with all kinds of heuristics to do that once their model is trained.
I recommend you try playing with http://mallet.cs.umass.edu/ and seeing if this model fits your needs..
LDA is a completely unsupervised algorithm meaning it doesn't require you to hand annotate anything which is great, but on the flip side, might not deliver you the topics you were expecting it to give.
A very simple solution to the problem would be:
count the occurences of each word in the text
consider the most frequent terms as the key phrases
have a black-list of 'stop words' to remove common words like the, and, it, is etc
I'm sure there are cleverer, stats based solutions though.
If you need a solution to use in a larger project rather than for interests sake, Yahoo BOSS has a key term extraction method.
Latent Dirichlet allocation or Hierarchical Dirichlet Process can be used to generate tags for individual texts within a greater corpus (body of texts) by extracting the most important words from the derived topics.
A basic example would be if we were to run LDA over a corpus and define it to have two topics, and that we find further that a text in the corpus is 70% one topic, and 30% another. The top 70% of the words that define the first topic and 30% that define the second (without duplication) could then be considered as tags for the given text. This method provides strong results where tags generally represent the broader themes of the given texts.
With a general reference for preprocessing needed for these codes being found here, we can find tags through the following process using gensim.
A heuristic way of deriving the optimal number of topics for LDA is found in this answer. Although HDP does not require the number of topics as an input, the standard in such cases is still to use LDA with a derived topic number, as HDP can be problematic. Assume here that the corpus is found to have 10 topics, and we want 5 tags per text:
from gensim.models import LdaModel, HdpModel
from gensim import corpora
num_topics = 10
num_tags = 5
Assume further that we have a variable corpus, which is a preprocessed list of lists, with the subslist entries being word tokens. Initialize a Dirichlet dictionary and create a bag of words where texts are converted to their indexes for their component tokens (words):
dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
Create an LDA or HDP model:
dirichlet_model = LdaModel(corpus=bow_corpus,
id2word=dirichlet_dict,
num_topics=num_topics,
update_every=1,
chunksize=len(bow_corpus),
passes=20,
alpha='auto')
# dirichlet_model = HdpModel(corpus=bow_corpus,
# id2word=dirichlet_dict,
# chunksize=len(bow_corpus))
The following code produces ordered lists for the most important words per topic (note that here is where num_tags defines the desired tags per text):
shown_topics = dirichlet_model.show_topics(num_topics=num_topics,
num_words=num_tags,
formatted=False)
model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
Then find the coherence of the topics across the texts:
topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0
topics_per_text = [text for text in topic_corpus]
From here we have the percentage that each text coheres to a given topic, and the words associated with each topic, so we can combine them for tags with the following:
corpus_tags = []
for i in range(len(bow_corpus)):
# The complexity here is to make sure that it works with HDP
significant_topics = list(set([t[0] for t in topics_per_text[i]]))
topic_indexes_by_coherence = [tup[0] for tup in sorted(enumerate(topics_per_text[i]), key=lambda x:x[1])]
significant_topics_by_coherence = [significant_topics[i] for i in topic_indexes_by_coherence]
ordered_topics = [model_topics[i] for i in significant_topics_by_coherence][:num_topics] # subset for HDP
ordered_topic_coherences = [topics_per_text[i] for i in topic_indexes_by_coherence][:num_topics] # subset for HDP
text_tags = []
for i in range(num_topics):
# Find the number of indexes to select, which can later be extended if the word has already been selected
selection_indexes = list(range(int(round(num_tags * ordered_topic_coherences[i]))))
if selection_indexes == [] and len(text_tags) < num_tags:
# Fix potential rounding error by giving this topic one selection
selection_indexes = [0]
for s_i in selection_indexes:
# ignore_words is a list of words should not be included
if ordered_topics[i][s_i] not in text_tags and ordered_topics[i][s_i] not in ignore_words:
text_tags.append(ordered_topics[i][s_i])
else:
selection_indexes.append(selection_indexes[-1] + 1)
# Fix for if too many were selected
text_tags = text_tags[:num_tags]
corpus_tags.append(text_tags)
corpus_tags will be a list of tags for each text based on how coherent the text is to the derived topics.
See this answer for a similar version of this that generates tags for a whole text corpus.
Related
I am working on text embedding in python. Where I found the similarity between two documents with the Doc2vec model. the code is as follows:
for doc_id in range(len(train_corpus)):
inferred_vector = model.infer_vector(train_corpus[doc_id].words) # it takes each document words as a input and produce vector of each document
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) # it takes list of all document's vector as a input and compare those with the trained vectors and gives the most similarity of 1st document to other and then second to other and so on .
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))
now, from these two embedded documents, how can I extract a set of semantically similar words of those particular documents.
please, help me out.
Only some Doc2Vec modes also train word-vectors: dm=1 (the default), or dm=0, dbow_words=1 (DBOW doc-vectors but added skip-gram word-vectors. If you've used such a mode, then there will be word-vectors in your model.wv property.
A call to model.wv.similarity(word1, word2) method will give you the pairwise similarity for any 2 words.
So, you could iterate over all the words in doc1, then collect the similarities to each word in doc2, and report the single highest similarity for each word.
I have a list of 1000 documents, where the first 500 belongs to documents in movies (i.e. list index from 0 to 499) and the remaining 500 belings to documents in tv series (i.e. list index from 500 to 999).
For movies the document tag starts with movie_ (e.g., movie_fast_and_furious) and for tv series the document tag starts with tv_series_ (e.g., tv_series_the_office)
I use these movies and tv series dataset to build a doc2vec model as follows.
from gensim.models import doc2vec
from collections import namedtuple
dataset = json.load(open(input_file))
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for description in dataset:
tags = [description[0]]
words = description[1]
docs.append(analyzedDocument(words, tags))
model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)
Now for each movie, I want to get its nearest 5 tv series along with their cosine similarity.
I know, the function gensim provides model.docvecs.most_similar. However, the results of this include movies as well (which is not my intension). Is it possible to do this in gensim (I assume that the document vectors are creating in the order of the documents list that we provide).
I am happy to provide more details if needed.
All the tags are opaque identifiers to Doc2Vec. So, if there's internal distinctions to your data, you'll need to model and filter on that yourself.
So, my main recommendation would be to ask for a much larger topn than you need, then discard those results of the type you don't want, or in excess of the number you actually need.
(Note that every calculation of most_similar() requires a comparison against the whole known set of doc-tags, and using a smaller topn only saves some computation in sorting of those full results. So using a larger topn, even up to the full size of the known doc-tags, isn't as costly as you might fear.)
With just two categories, to get the 10 tv-shows closest to a query movie, you could make topn equal to the count of movies, minus 1 (the query), plus 10 – then in the absolute worst-case, where all movies are closer than than the 1st tv-show, you'd still get 10 valid tv-show results.
(The most_similar() function also includes a restrict_vocab parameter. It takes an int count, and limits results to only the 1st that-many items, in the internal storage order. So if in fact the 1st 500 documents were all tv-shows, restrict_vocab=500 would give only results from that subset. However, I wouldn't recommend relying on this, as (a) it'd only work for one category that was front-loaded, not any others; (b) ideally for training, you wouldn't clump all similar documents together, but shuffle them to be interspersed with contrasting documents. Generally Word2Vec vector-sets are sorted to put the highest-frequency words 1st – no matter the order-of-appearance in the original data. That makes restrict_vocab more useful there, as often only results for the most-common words with the strongest vectors are most interesting.)
I am learning natural language processing for bigram topic. At this stage, I am having difficulty in the Python computation, but I try.
I will be using this corpus that has not been subjected to tokenization as my main raw dataset. I can generate the bigram results using nltk module. However, my question is how to compute in Python to generate the bigrams containing more than two specific words. More specifically, I wish to find all the bigrams, which are available in corpus_A, that contain words from the word_of_interest.
corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
I want to get the bigram for each of the individual words from the list of word_of_interest. Next, I want to get the frequency for each bigram available based on their appearance in the corpus_A. With the frequency available, I want to sort and print out the bigram based on their probability from highest to lower.
I have tried out codes from on-line search but it does not give me an output. The codes are mentioned below:
for i in corpus:
bigrams_i = BigramCollocationFinder.from_words(corpus, window_size=5)
bigram_j = lambda i[x] not in i
x += 1
print(bigram_j)
Unfortunately, the output did not return what I am planning to achieve.
Please advice me. The output that I want will have the bigram with the specific words from the word_of_interest and their probabilities sorted as shown below.
[((santa, clauss), 0.89), ((he, and), 0.67), ((stands, firm), 0.34))]
You can try this code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(2,2),use_idf=False)
corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
matrix = vec.fit_transform(corpus).toarray()
vocabulary = vec.get_feature_names()
all_bigrams = []
all_frequencies = []
for word in word_of_interest:
for bigram in vocabulary:
if word in bigram:
index = vocabulary.index(bigram)
tuple_bigram = tuple(bigram.split(' '))
frequency = matrix[:,index].sum()
all_bigrams.append(tuple_bigram)
all_frequencies.append(frequency)
df = pd.DataFrame({'bigram':all_bigrams,'frequency':all_frequencies})
df.sort_values('frequency',inplace=True)
df.head()
The output is a pandas dataframe showing the bigrams sorted by frequency.
bigram frequency
0 (for, santa) 0.109764
19 (stands, firm) 0.109764
18 (he, stands) 0.109764
17 (their, hand) 0.109764
16 (hand, to) 0.109764
The rationale here is that TfidfVectorizer counts how many times a token is present in each document of a corpus, then computes the term-specific frequency, and then stores this information in a column associated with that token. The index of that column is the same as the index of the associated word in the vocabulary that is retrieved with the method .get_feature_names() on the vectorizer already fit.
Then you just have to select all rows from the matrix containing the relative frequencies of the tokens, and sum along the column of interest.
The double-nested for loop is not ideal though, and there may be a more efficient implementation for it. The issue is that get_feature_names returns not tuples, but a list of strings in the form ['first_token second_token',].
I would be interested in seeing a better implementation of the second half of the above code.
I am trying to filter out tokens by their frequency using the filter_extremes function in Gensim (https://radimrehurek.com/gensim/corpora/dictionary.html). Specifically, I am interested in filtering out words that occur in "Less frequent than no_below documents" and "More frequent than no_above documents".
id2word_ = corpora.Dictionary(texts)
print(len(id2word_))
id2word_.filter_extremes(no_above = 0.600)
print(len(id2word_))
The first print statement gives 11918 and the second print statement gives 3567. However, if I do the following:
id2word_ = corpora.Dictionary(texts)
print(len(id2word_))
id2word_.filter_extremes(no_below = 0.599)
print(len(id2word_))
The first print statement gives 11918 (as expected) and the second gives 11406. Shouldn't id2word_.filter_extremes(no_below = 0.599) and id2word_.filter_extremes(no_above = 0.600) add up to the number of total words? However, 11406 + 3567 > 11918, so how come this sum exceeds the number of words in the corpus? That does not make sense since the filters should cover non-overlapping words, based off the explanation in the documentation.
If you have any ideas, I would really appreciate your input! Thanks!
According to the definition:
no_below (int, optional) – Keep tokens which are contained in at least no_below
documents.
no_above (float, optional) – Keep tokens which are contained in no more than
no_above documents (fraction of total corpus size, not an absolute number).
no_below is an int that represents a threshold filtering out number of occurrence of the tokens among documents above certain number. e.g. use no_below to filter out words appearing less than 10 times.
On the contrary, no_above is not a int but a float that represents faction of total corpus size. e.g. use no_above to filter out words appearing in more than 10% of all documents.
It's a little odd that no_below and no_above do not represent the same unit and therefore the confusion.
Hope this answers your question.
Regarding the filter_extremes in Gensim, the units for "no_above" and "no_below" parameters are actually DIFFERENT. This is a bit odd, to be honest.
For "no_above", you want to put a number between 0 and 1 there (float). It should be a percentage that represents the portion of a word in total corpus size.
For "no_below", you want to have a integer. It should be the number of times when a word shows in corpus. It is a threshold.
Hope it clarifies your question.
I write this as an extension to other users' answers. Yes, the two parameters are different and control different kinds of token frequencies. In addition notice the following error: filter_extremes assigns default values to no_above and no_below, so if you write:
id2word_.filter_extremes(no_below = 0.599)
it is effectively
id2word_.filter_extremes(no_below = 0.599, no_above=0.5)
hope you had found the answer to your reply. I have been dabbling with the gensim library and found out that these two parameters 'no_below' and 'no_above' are probably best used together. The no_below as TS said is straightforward and returns token if frequency is not less than the parameter.
Some code samples, I am not sure if there are best practices but I usually tinker with the no_above starting from 1 (or 100% of the corpus) which makes the filtering exclusive of the whole corpus.
# filter if frequency not less than 2
# and tokens contained not more than 90% of corpus
dictionary.filter_extremes(no_below=2, no_above=0.9)
print(len(dictionary))
# filter if frequency not less than 3
# and tokens contained not more than 100% of corpus
dictionary.filter_extremes(no_below=3, no_above=1)
print(len(dictionary))
In general, start with no_above corpus 100%, from there make adjustments.
I have the following code that I found from https://pythonprogramminglanguage.com/kmeans-text-clustering/ on document clustering. While I understand the k-means algorithm as a whole, I have a little trouble wrapping my head about what the top terms per cluster represents and how that is computed? Is it the most frequent words that occur in the cluster? One blogpost I read said that the outputted words at the end represent the "top n words that are nearest to the cluster centroid" (but what does it mean for an actual word to be "closest" to the cluster centroid). I really want to understand the details and nuances of what is going on. Thank you!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
'Top' in this context is directly related to the way in which the text has been transformed into an array of numerical values. By using TFIDF you are, for each individual document, assigning each word a score based on how prevalent it is in that document, inverse to the prevalence across the entire set of documents. A word with a high score in a document indicates that it is more significant or more representative of that document than the other words.
Therefore with this generation of top terms for each cluster, they are the words that, on average, are most significant in the documents for that cluster.
The way it has been done here works and is efficient but I find it difficult to understand myself and I don't think it is particularly intuitive as it is difficult to comprehend why, if cluster_centers_ are the co-ordinates for the centroids, then the features with the highest co-ordinate numbers are the top words. I kind of get it but not quite (if anyone wants to explain how this works that would be great!).
I use a different method to find the top terms for a cluster which I find more intuitive. I just tested the method you posted with my own on a corpus of 250 documents and the top words are exactly the same. The value of my method is that it works however you cluster the documents as long as you can provide a list of the cluster assignments (which any clustering algorithm should provide), meaning you're not reliant on the presence of a cluster_centers_ attribute. It's also, I think, more intuitive.
import numpy as np
def term_scorer(doc_term_matrix, feature_name_list, labels=None, target=None, n_top_words=10):
if target is not None:
filter_bool = np.array(labels) == target
doc_term_matrix = doc_term_matrix[filter_bool]
term_scores = np.sum(doc_term_matrix,axis=0)
top_term_indices = np.argsort(term_scores)[::-1]
return [feature_name_list[term_idx] for term_idx in top_term_indices[:n_top_words]]
term_scorer(X, terms, labels=model.labels_, target=1, n_top_words=10)
The model.labels_ attribute gives you a list of the cluster assignments for each document. In this example I want to find the top words for cluster 1 so I assign target=1, the function filters the X array keeping only rows assigned to cluster 1. It then sums all the scores across the documents row wise so it has one single row with a column for each word. It then uses argsort to sort that row by highest values to lowest, replaces the values with the original index positions of the words. Finally it uses a list comprehension to grab index numbers from the top score to n_top_words and then builds a list of words by looking up those indexes in feature_name_list.
When words are converted into vectors, we talk about closeness of words as how similar they are. So for instance, you could use cosine similarity for determining how close two words are to each other. a vector of "dog" and "puppy" will be similar so you could say the two words are close to each other.
In other terms, closeness is also determined by the context words. So, word pair (the, cat) can be close, as per the sentences. That is how word2vec or similar algorithms work to create word vectors.