Python Gensim: how to calculate document similarity using the LDA model? - python

I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on.
After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a hint? Thanks!

Depends what similarity metric you want to use.
Cosine similarity is universally useful & built-in:
sim = gensim.matutils.cossim(vec_lda1, vec_lda2)
Hellinger distance is useful for similarity between probability distributions (such as LDA topics):
import numpy as np
dense1 = gensim.matutils.sparse2full(lda_vec1, lda.num_topics)
dense2 = gensim.matutils.sparse2full(lda_vec2, lda.num_topics)
sim = np.sqrt(0.5 * ((np.sqrt(dense1) - np.sqrt(dense2))**2).sum())

Don't know if this'll help but, I managed to attain successful results on document matching and similarities when using the actual document as a query.
dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("model.lda") #result from running online lda (training)
index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")
docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]
sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims
Your similarity score between all documents residing in the corpus and the document that was used as a query will be the second index of every sim for sims.

Provided answers are good, but they aren't very beginner-friendly. I want to start from training the LDA model and calculate cosine similarity.
Training model part:
docs = ["latent Dirichlet allocation (LDA) is a generative statistical model",
"each document is a mixture of a small number of topics",
"each document may be viewed as a mixture of various topics"]
# Convert document to tokens
docs = [doc.split() for doc in docs]
# A mapping from token to id in each document
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)
# Representing the corpus as a bag of words
corpus = [dictionary.doc2bow(doc) for doc in docs]
# Training the model
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)
For extracting the probability assigned to each topic for a document, there are generally two ways. I provide here the both:
# Some preprocessing for documents like the training the model
test_doc = ["LDA is an example of a topic model",
"topic modelling refers to the task of identifying topics"]
test_doc = [doc.split() for doc in test_doc]
test_corpus = [dictionary.doc2bow(doc) for doc in test_doc]
# Method 1
from gensim.matutils import cossim
doc1 = model.get_document_topics(test_corpus[0], minimum_probability=0)
doc2 = model.get_document_topics(test_corpus[1], minimum_probability=0)
print(cossim(doc1, doc2))
# Method 2
doc1 = model[test_corpus[0]]
doc2 = model[test_corpus[1]]
print(cossim(doc1, doc2))
output:
#Method 1
0.8279631530869963
#Method 2
0.828066885140262
As you can see both of the methods are generally the same, the difference is in the probabilities returned in the 2nd method sometimes doesn't add up to one as discussed here.
For large corpus, the possibility vector could be given by passing the whole corpus:
#Method 1
possibility_vector = model.get_document_topics(test_corpus, minimum_probability=0)
#Method 2
possiblity_vector = model[test_corpus]
NOTE: The sum of probability assigned to each topic in a document may become a bit higher than 1 or in some cases a bit less than 1. That is because of the floating-point arithmetic rounding errors.

Related

Different cosine similarity coefficients from Doc2Vec and Word2Vec

BACKGROUND
At the beginning of my project, the focus was to compare requests/questions received in terms of how they differ in terms of content. I trained a Doc2Vec model and the results were pretty good (for reference, my data included 14 million requests).
class PhrasingIterable():
def __init__(self, my_phraser, texts):
self.my_phraser = my_phraser
self.texts = texts
def __iter__(self):
return iter(self.my_phraser[self.texts])
docs = DocumentIterator()
bigram_transformer = Phrases(docs, min_count=1, threshold=10)
bigram = Phraser(bigram_transformer)
corpus = PhrasingIterable(bigram, docs)
sentences = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]
model = Doc2Vec(window=5,
vector_size=300,
min_count=10,
workers = multiprocessing.cpu_count(),
epochs = 10,
compute_loss=True)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
However, in a second stage, the focus of analysis shifted from requests to individuals per week. To measure how individuals requests differ from week to week I extracted all words from requests in a given week t and compared with all words from requests in the previous week t-1 using d2v_model.wv.n_similarity. Since I need to replicate this in other areas, occurred to me that I was wasting to much memory and time training Doc2Vec models when I could use Word2Vec to get the same measure. Thus, I trained the following Word2Vec model:
docs = DocumentIterator()
bigram_transformer = gensim.models.Phrases(docs, min_count=1, threshold=10)
bigram = gensim.models.phrases.Phraser(bigram_transformer)
sentences = PhrasingIterable(bigram, docs)
model = Word2Vec(window=5,
size=300,
min_count=10,
workers = multiprocessing.cpu_count(),
iter = 10,
compute_loss=True)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
I used again the cosine similarity to compare the content from week to week w2v_model.wv.n_similarity. As a sanity check, I compared the similarities generated by Word2Vec and Doc2vec, the correlation coefficient among is around 0.70 and the scale differs a lot. My implied assumption is that comparing sets of extracted words using d2v_model.wv.n_similarity was taking advantage of the Word2Vec within the trained Doc2Vec.
MY QUESTION
Should cosine similarity measures between two sets of extracted words differ as we trade from Doc2Vec to Word2Vec? If so, why? I not, any suggestions on my code?

How to get the nearest documents for a word in gensim in python

I am using the doc2vec model as follows to construct my document vectors.
from gensim.models import doc2vec
from collections import namedtuple
dataset = json.load(open(input_file))
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for description in dataset:
tags = [description[0]]
words = description[1]
docs.append(analyzedDocument(words, tags))
model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)
I have seen that gensim doc2vec also includes word vectors. Suppose I have a word vector created for the word deep learning. My question is; is it possible to get the documents nearest to deep learning word vector in gensim in python?
I am happy to provide more details if needed.
Some Doc2Vec modes will co-train doc-vectors and word-vectors in the "same space". Then, if you have a word-vector for 'deep_learning', you can ask for documents near that vector, and the results may be useful for you. For example:
similar_docs = d2v_model.docvecs.most_similar(
positive=[d2v_model.wv['deep_learning']]
)
But:
that's only going to be as good as your model learned 'deep_learning' as a word to mean what you think of it as
a training set of known-good documents fitting the category 'deep_learning' (and other categories) could be better - whether you hand-curate those, or try to bootstrap from other sources (like say the Wikipedia category 'Deep Learning' or other curated/search-result sets that you trust).
reducing a category to a single summary point (one vector) may not be as good as having a variety of examples – many points - that all fit the category. (Relevant docs may not be a neat sphere around a summary point, but rather populate exotically-shaped regions of the doc-vector high-dimensional space.) If you have a lot of good examples of each category, you could train a classifier to then label, or rank-in-relation-to-trained-categories, any further uncategorized docs.

How to get document vectors for a given topic in gensim

I have about 9000 documents and I am using Gensim's doc2vec to embed my documents. My code is as follows:
from gensim.models import doc2vec
from collections import namedtuple
dataset = json.load(open(input_file))
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for description in dataset:
tags = [description[0]]
words = description[1]
docs.append(analyzedDocument(words, tags))
model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)
I would like to get all the documents related to topic "deep learning". i.e. the documents that mainly have content related to deep learning. Is it possible to do this in doc2vec model in gensim?
I am happy to provide more details if needed.
If there was a document in your training set that was a great example of "deep learning" – say, docs[17] – then after successful training you could ask for documents similar to that example document, and that could be roughly what you'd need. For example:
sims = model.docvecs.most_similar(docs[17].tags[0])
You'd then have in sims a ranked, scored list of the 10 most-similar documents to the tag for the target document.

How to classify new documents with tf-idf?

If I use the TfidfVectorizer from sklearn to generate feature vectors as:
features = TfidfVectorizer(min_df=0.2, ngram_range=(1,3)).fit_transform(myDocuments)
How would I then generate feature vectors to classify a new document? Since you cant calculate the tf-idf for a single document.
Would it be a correct approach, to extract the feature names with:
feature_names = TfidfVectorizer.get_feature_names()
and then count the term frequency for the new document according to the feature_names?
But then I won't get the weights that have the information of a words importance.
You need to save the instance of the TfidfVectorizer, it will remember the term frequencies and vocabulary that was used to fit it. It may make things clearer sense if rather than using fit_transform, you use fit and transform separately:
vec = TfidfVectorizer(min_df=0.2, ngram_range=(1,3))
vec.fit(myDocuments)
features = vec.transform(myDocuments)
new_features = fec.transform(myNewDocuments)
I would rather use gensim with a Latent Semantic Indexing as a wrapper over the original corpus: bow->tfidf->lsi
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
Then if you need to continue the training:
new_tfidf = models.TfidfModel(corpus)
new_corpus_tfidf = new_tfidf[corpus]
lsi.add_documents(another_tfidf_corpus) # now LSI has been trained on corpus_tfidf + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space
Where corpus is bag-of-words
As you can read in their tutorials:
LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!
If you like sci-kit, gensim is also compatible with numpy

LDA model generates different topics everytime i train on the same corpus

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.
Why does the same LDA parameters and corpus generate different topics everytime?
And how do i stabilize the topic generation?
I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code:
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math
stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]
def generateTopics(corpus, dictionary):
# Build LDA model using the above corpus
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]
# Group topics with similar words together.
tops = set(lda.show_topics(50))
top_clusters = []
for l in tops:
top = []
for t in l.split(" + "):
top.append((t.split("*")[0], t.split("*")[1]))
top_clusters.append(top)
# Generate word only topics
top_wordonly = []
for i in top_clusters:
top_wordonly.append(":".join([j[1] for j in i]))
return lda, corpus_lda, top_clusters, top_wordonly
#######################################################################
# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
lemma = line.split("\t")[3]
documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)
for i in topic_wordonly:
print i
Why does the same LDA parameters and corpus generate different topics everytime?
Because LDA uses randomness in both training and inference steps.
And how do i stabilize the topic generation?
By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed:
SOME_FIXED_SEED = 42
# before training/inference:
np.random.seed(SOME_FIXED_SEED)
(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)
Set the random_state parameter in the initialization of LdaModel() method.
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=1,
passes=num_passes,
alpha='auto')
I had the same problem, even with about 50,000 comments. But you can get much more consistent topics by increasing the number of iterations the LDA runs for. It is initially set to 50 and when I raise it to 300, it usually gives me the same results, probably because it is much closer to convergence.
Specifically, you just add the following option:
ldamodel.LdaModel(corpus, ..., iterations = <your desired iterations>):
This is due to the probabilistic nature of LDA as noted by others. However, I don't believe setting the random_seed argument to a fixed number is the proper solution.
Definitely try increasing the number of iterations first to make sure your algorithm is converging. Even then, each starting point may land you on a different local minimum. So you can run LDA multiple times without setting random_seed, and then comparing the results using the coherence score of each model. This helps you avoid the suboptimal local minima.
Gensim's CoherenceModel already has the most common coherence metrics implemented for you, such as c_v, u_mass, and c_npmi.
You might realize these will make the results more stable, but they won't actually guarantee the same results from run to run. However, it's better to get to the global optimum as much as possible instead of being stuck on the same local minimum because of a fixed random_seed IMO.

Categories

Resources