Term weighting for original LDA in gensim - python

I am using the gensim library to apply LDA to a set of documents. Using gensim I can apply LDA to a corpus whatever the term weights are: binary, tf, tf-idf...
My question is, what is the term weighting that should be used for the original LDA? If I have understood correctly the weights should be term frequencies, but I am not sure.

It should be a corpus represented as a "bag of words". Or, yes, lists of term counts.
The correct format is that of the corpus defined in the first tutorial on the Gensim webpage (these are really useful).
Namely, if you have a dictionary as defined in Radim's tutorial, and the following documents,
doc1 = ['big', 'data', 'technique', 'lots', 'of', 'cash']
doc2 = ['this', 'document', 'has', 'words']
docs = [doc1, doc2]
then your corpus (for use with LDA) should be an iterable object (such as a list) of lists of tuples of the form: (dictKey, count), where dk refers to the dictionary key of a term, and count is the number of times it occurs in the document. This is done for you with
corpus = [dictionary.doc2bow(doc) for doc in docs]
That doc2bow function means "document to bag of words".

Related

Text similarity using Word2Vec

I would like to use Word2Vec to check similarity of texts.
I am currently using another logic:
from fuzzywuzzy import fuzz
def sim(name, dataset):
matches = dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1)
return
(name is my column).
For applying this function I do the following:
df['Sim']=df.apply(lambda row: sim(row['Text'], df), axis=1)
Could you please tell me how to replace fuzzy.ratio with Word2Vec in order to compare texts in a dataset?
Example of dataset:
Text
Hello, this is Peter, what would you need me to help you with today?
I need you
Good Morning, John here, are you calling regarding your cell phone bill?
Hi, this this is John. What can I do for you?
...
The first text and the last one are quite similar, although they have different words to express similar concept.
I would like to create a new column where to put, for each row, text that are similar.
I hope you can help me.
TLDR; skip to the last section (part 4.) for code implementation
1. Fuzzy vs Word embeddings
Unlike a fuzzy match, which is basically edit distance or levenshtein distance to match strings at alphabet level, word2vec (and other models such as fasttext and GloVe) represent each word in a n-dimensional euclidean space. The vector that represents each word is called a word vector or word embedding.
These word embeddings are n-dimensional vector representations of a large vocabulary of words. These vectors can be summed up to create a representation of the sentence's embedding. Sentences with word with similar semantics will have similar vectors, and thus their sentence embeddings will also be similar. Read more about how word2vec works internally here.
Let's say I have a sentence with 2 words. Word2Vec will represent each word here as a vector in some euclidean space. Summing them up, just like standard vector addition will result in another vector in the same space. This can be a good choice for representing a sentence using individual word embeddings.
NOTE: There are other methods of combining word embeddings such as a weighted sum with tf-idf weights OR just directly using sentence embeddings with an algorithm called Doc2Vec. Read more about this here.
2. Similarity between word vectors / sentence vectors
“You shall know a word by the company it keeps”
Words that occur with words (context) are usually similar in semantics/meaning. The great thing about word2vec is that words vectors for words with similar context lie closer to each other in the euclidean space. This lets you do stuff like clustering or just simple distance calculations.
A good way to find how similar 2 words vectors is cosine-similarity. Read more here.
3. Pre-trained word2vec models (and others)
The awesome thing about word2vec and such models is that you don't need to train them on your data for most cases. You can use pre-trained word embedding that has been trained on a ton of data and encodes the contextual/semantic similarities between words based on their co-occurrence with other words in sentences.
You can check similarity between these sentence embeddings using cosine_similarity
4. Sample code implementation
I use a glove model (similar to word2vec) which is already trained on wikipedia, where each word is represented as a 50-dimensional vector. You can choose other models than the one I used from here - https://github.com/RaRe-Technologies/gensim-data
from scipy import spatial
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50") #choose from multiple models https://github.com/RaRe-Technologies/gensim-data
s0 = 'Mark zuckerberg owns the facebook company'
s1 = 'Facebook company ceo is mark zuckerberg'
s2 = 'Microsoft is owned by Bill gates'
s3 = 'How to learn japanese'
def preprocess(s):
return [i.lower() for i in s.split()]
def get_vector(s):
return np.sum(np.array([model[i] for i in preprocess(s)]), axis=0)
print('s0 vs s1 ->',1 - spatial.distance.cosine(get_vector(s0), get_vector(s1)))
print('s0 vs s2 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s2)))
print('s0 vs s3 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s3)))
#Semantic similarity between sentence pairs
s0 vs s1 -> 0.965923011302948
s0 vs s2 -> 0.8659112453460693
s0 vs s3 -> 0.5877998471260071
If you want to compare sentences you should not use Word2Vec or GloVe embeddings. They translate every word in a sentence into a vector. It is quite cumbersome to get how similar those sentences out of the two sets of such vectors. You should use something that is tailored to convert whole sentence into a single vector. Then you just need to compare how similar two vector are. Universal Sentence Encoder is one of the best encoders considering computational cost and accuracy trade off (the DAN variant). See example of usage in this post. I believe it describes a use case which is quite close to yours.

i want to get a list of semantically similar words from the two embedded documents in python

I am working on text embedding in python. Where I found the similarity between two documents with the Doc2vec model. the code is as follows:
for doc_id in range(len(train_corpus)):
inferred_vector = model.infer_vector(train_corpus[doc_id].words) # it takes each document words as a input and produce vector of each document
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) # it takes list of all document's vector as a input and compare those with the trained vectors and gives the most similarity of 1st document to other and then second to other and so on .
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))
now, from these two embedded documents, how can I extract a set of semantically similar words of those particular documents.
please, help me out.
Only some Doc2Vec modes also train word-vectors: dm=1 (the default), or dm=0, dbow_words=1 (DBOW doc-vectors but added skip-gram word-vectors. If you've used such a mode, then there will be word-vectors in your model.wv property.
A call to model.wv.similarity(word1, word2) method will give you the pairwise similarity for any 2 words.
So, you could iterate over all the words in doc1, then collect the similarities to each word in doc2, and report the single highest similarity for each word.

Remove single occurrences of words in vocabulary TF-IDF

I am attempting to remove words that occur once in my vocabulary to reduce my vocabulary size. I am using the sklearn TfidfVectorizer() and then the fit_transform function on my data frame.
tfidf = TfidfVectorizer()
tfs = tfidf.fit_transform(df['original_post'].values.astype('U'))
My first thought is the preprocessor field in the tfidf vectorizer or using the preprocessing package before machine learning.
Any tips or links to further implementation?
you are looking for min_df param (minimum frequency), from the documentation of scikit-learn TfidfVectorizer:
min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also
called cut-off in the literature. If float, the parameter represents a
proportion of documents, integer absolute counts. This parameter is
ignored if vocabulary is not None.
# remove words occuring less than 5 times
tfidf = TfidfVectorizer(min_df=5)
you can also remove common words:
# remove words occuring in more than half the documents
tfidf = TfidfVectorizer(max_df=0.5)
you can also remove stopwords like this:
tfidf = TfidfVectorizer(stop_words='english')
ShmulikA's answer will most likely work well but will remove words based on document frequency. Thus, if the specific word occurs 200 times in only 1 document, it will be removed. TF-IDF vectorizer does not provide exactly what you want. You would have to:
Fit the vectorizer to your corpus. Extract the complete vocabulary from the vectorizer
Take the words as keys in a new dictionary.
count every word occurrence:
for every document in corpus: for word in document: vocabulary[word] += 1
Now, find out if there are values = 1, drop these entries from the dictionary. Put the keys into a list and pass the list as parameter to the TF-IDF vectorizer.
It will need a lot of looping, maybe just use min_df, which works well in practice.

Gensim word2vec augment or merge pre-trained vectors

I am loading pre-trained vectors from a binary file generated from the word2vec C code with something like:
model_1 = Word2Vec.load_word2vec_format('vectors.bin', binary=True)
I am using those vectors to generate vector representations of sentences that contain words that may not have already existing vectors in vectors.bin. For example, if vectors.bin has no associated vector for the word "yogurt", and I try
yogurt_vector = model_1['yogurt']
I get KeyError: 'yogurt', which makes good sense. What I want is to be able to take the sentence words that do not have corresponding vectors and add representations for them to model_1. I am aware from this post that you cannot continue to train the C vectors. Is there then a way to train a new model, say model_2, for the words with no vectors and merge model_2 with model_1?
Alternatively, is there a way to test if the model contains a word before I actually try to retrieve it, so that I can at least avoid the KeyError?
Avoiding the key error is easy:
[x for x in 'this model hus everything'.split() if x in model_1.vocab]
The more difficult problem is merging a new word to an existing model. The problem is that word2vec calculates the likelihood of 2 words being next to each other, and if the word 'yogurt' wasn't in the first body that the model was trained on it's not next to any of those words, so the second model would not correlate to the first.
You can look at the internals when a model is saved (uses numpy.save) and I would be interested in working with you to come up with code to allow adding vocabulary.
This is a great question, and unfortunately there is no way to add to the vocabulary without changing the internals of the code. Check out this discussion: https://groups.google.com/forum/#!searchin/word2vec-toolkit/online$20word2vec/word2vec-toolkit/L9zoczopPUQ/_Zmy57TzxUQJ
My advice is to ignore words that are not in the vocabulary, and only use the ones that are in the vocabulary. If you are using python, you can do this by:
for word in wordlist:
if word in model.vocab:
present.append(word)
else:
# this is all the words that are absent for your model
# might be useful for debugging. Ignore if you dont need this info
absent.append(word)
<Do whatever you want with the words in the list 'present'>
A possible alternative to handle absent/missing words is suggested by YoonKim in "Convolutional Neural Networks for Sentence Classification"
Its code: https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py#L88
def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
"""
For words that occur in at least min_df documents, create a separate word vector.
0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
"""
for word in vocab:
if word not in word_vecs and vocab[word] >= min_df:
word_vecs[word] = np.random.uniform(-0.25,0.25,k)
But this works bcoz, you use model to find to find corresponding vectors. Functionality like similarity etc are lost
you can continue adding new words/sentences to a model vocabulary and train the augmented model, with gensim online training algorithm (https://rutumulkar.com/blog/2015/word2vec/),
https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html
model = gensim.models.Word2Vec.load(temporary_filepath)
more_sentences = [
['Advanced', 'users', 'can', 'load', 'a', 'model',
'and', 'continue', 'training', 'it', 'with', 'more', 'sentences'],
]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.epochs)
related:
Update gensim word2vec model
Is it possible to re-train a word2vec model (e.g. GoogleNews-vectors-negative300.bin) from a corpus of sentences in python?

tag generation from a text content

I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools.
Additionally, I will be grateful if you point any Python based solution / library for this.
Thanks
One way to do this would be to extract words that occur more frequently in a document than you would expect them to by chance. For example, say in a larger collection of documents the term 'Markov' is almost never seen. However, in a particular document from the same collection Markov shows up very frequently. This would suggest that Markov might be a good keyword or tag to associate with the document.
To identify keywords like this, you could use the point-wise mutual information of the keyword and the document. This is given by PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]. This will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection.
To identify the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.
If you want to extract multiword tags, see the StackOverflow question How to extract common / significant phrases from a series of text entries.
Borrowing from my answer to that question, the NLTK collocations how-to covers how to do
extract interesting multiword expressions using n-gram PMI in a about 7 lines of code, e.g.:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words('english-web.txt'))
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)
First, the key python library for computational linguistics is NLTK ("Natural Language Toolkit"). This is a stable, mature library created and maintained by professional computational linguists. It also has an extensive collection of tutorials, FAQs, etc. I recommend it highly.
Below is a simple template, in python code, for the problem raised in your Question; although it's a template it runs--supply any text as a string (as i've done) and it will return a list of word frequencies as well as a ranked list of those words in order of 'importance' (or suitability as keywords) according to a very simple heuristic.
Keywords for a given document are (obviously) chosen from among important words in a document--ie, those words that are likely to distinguish it from another document. If you had no a priori knowledge of the text's subject matter, a common technique is to infer the importance or weight of a given word/term from its frequency, or importance = 1/frequency.
text = """ The intensity of the feeling makes up for the disproportion of the objects. Things are equal to the imagination, which have the power of affecting the mind with an equal degree of terror, admiration, delight, or love. When Lear calls upon the heavens to avenge his cause, "for they are old like him," there is nothing extravagant or impious in this sublime identification of his age with theirs; for there is no other image which could do justice to the agonising sense of his wrongs and his despair! """
BAD_CHARS = ".!?,\'\""
# transform text into a list words--removing punctuation and filtering small words
words = [ word.strip(BAD_CHARS) for word in text.strip().split() if len(word) > 4 ]
word_freq = {}
# generate a 'word histogram' for the text--ie, a list of the frequencies of each word
for word in words :
word_freq[word] = word_freq.get(word, 0) + 1
# sort the word list by frequency
# (just a DSU sort, there's a python built-in for this, but i can't remember it)
tx = [ (v, k) for (k, v) in word_freq.items()]
tx.sort(reverse=True)
word_freq_sorted = [ (k, v) for (v, k) in tx ]
# eg, what are the most common words in that text?
print(word_freq_sorted)
# returns: [('which', 4), ('other', 4), ('like', 4), ('what', 3), ('upon', 3)]
# obviously using a text larger than 50 or so words will give you more meaningful results
term_importance = lambda word : 1.0/word_freq[word]
# select document keywords from the words at/near the top of this list:
map(term_importance, word_freq.keys())
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation tries to represent each document in a training corpus as mixture of topics, which in turn are distributions mapping words to probabilities.
I had used it once to dissect a corpus of product reviews into the latent ideas that were being spoken about across all the documents such as 'customer service', 'product usability', etc.. The basic model does not advocate a way to convert the topic models into a single word describing what a topic is about.. but people have come up with all kinds of heuristics to do that once their model is trained.
I recommend you try playing with http://mallet.cs.umass.edu/ and seeing if this model fits your needs..
LDA is a completely unsupervised algorithm meaning it doesn't require you to hand annotate anything which is great, but on the flip side, might not deliver you the topics you were expecting it to give.
A very simple solution to the problem would be:
count the occurences of each word in the text
consider the most frequent terms as the key phrases
have a black-list of 'stop words' to remove common words like the, and, it, is etc
I'm sure there are cleverer, stats based solutions though.
If you need a solution to use in a larger project rather than for interests sake, Yahoo BOSS has a key term extraction method.
Latent Dirichlet allocation or Hierarchical Dirichlet Process can be used to generate tags for individual texts within a greater corpus (body of texts) by extracting the most important words from the derived topics.
A basic example would be if we were to run LDA over a corpus and define it to have two topics, and that we find further that a text in the corpus is 70% one topic, and 30% another. The top 70% of the words that define the first topic and 30% that define the second (without duplication) could then be considered as tags for the given text. This method provides strong results where tags generally represent the broader themes of the given texts.
With a general reference for preprocessing needed for these codes being found here, we can find tags through the following process using gensim.
A heuristic way of deriving the optimal number of topics for LDA is found in this answer. Although HDP does not require the number of topics as an input, the standard in such cases is still to use LDA with a derived topic number, as HDP can be problematic. Assume here that the corpus is found to have 10 topics, and we want 5 tags per text:
from gensim.models import LdaModel, HdpModel
from gensim import corpora
num_topics = 10
num_tags = 5
Assume further that we have a variable corpus, which is a preprocessed list of lists, with the subslist entries being word tokens. Initialize a Dirichlet dictionary and create a bag of words where texts are converted to their indexes for their component tokens (words):
dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
Create an LDA or HDP model:
dirichlet_model = LdaModel(corpus=bow_corpus,
id2word=dirichlet_dict,
num_topics=num_topics,
update_every=1,
chunksize=len(bow_corpus),
passes=20,
alpha='auto')
# dirichlet_model = HdpModel(corpus=bow_corpus,
# id2word=dirichlet_dict,
# chunksize=len(bow_corpus))
The following code produces ordered lists for the most important words per topic (note that here is where num_tags defines the desired tags per text):
shown_topics = dirichlet_model.show_topics(num_topics=num_topics,
num_words=num_tags,
formatted=False)
model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
Then find the coherence of the topics across the texts:
topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0
topics_per_text = [text for text in topic_corpus]
From here we have the percentage that each text coheres to a given topic, and the words associated with each topic, so we can combine them for tags with the following:
corpus_tags = []
for i in range(len(bow_corpus)):
# The complexity here is to make sure that it works with HDP
significant_topics = list(set([t[0] for t in topics_per_text[i]]))
topic_indexes_by_coherence = [tup[0] for tup in sorted(enumerate(topics_per_text[i]), key=lambda x:x[1])]
significant_topics_by_coherence = [significant_topics[i] for i in topic_indexes_by_coherence]
ordered_topics = [model_topics[i] for i in significant_topics_by_coherence][:num_topics] # subset for HDP
ordered_topic_coherences = [topics_per_text[i] for i in topic_indexes_by_coherence][:num_topics] # subset for HDP
text_tags = []
for i in range(num_topics):
# Find the number of indexes to select, which can later be extended if the word has already been selected
selection_indexes = list(range(int(round(num_tags * ordered_topic_coherences[i]))))
if selection_indexes == [] and len(text_tags) < num_tags:
# Fix potential rounding error by giving this topic one selection
selection_indexes = [0]
for s_i in selection_indexes:
# ignore_words is a list of words should not be included
if ordered_topics[i][s_i] not in text_tags and ordered_topics[i][s_i] not in ignore_words:
text_tags.append(ordered_topics[i][s_i])
else:
selection_indexes.append(selection_indexes[-1] + 1)
# Fix for if too many were selected
text_tags = text_tags[:num_tags]
corpus_tags.append(text_tags)
corpus_tags will be a list of tags for each text based on how coherent the text is to the derived topics.
See this answer for a similar version of this that generates tags for a whole text corpus.

Categories

Resources