Different cosine similarity coefficients from Doc2Vec and Word2Vec - python

BACKGROUND
At the beginning of my project, the focus was to compare requests/questions received in terms of how they differ in terms of content. I trained a Doc2Vec model and the results were pretty good (for reference, my data included 14 million requests).
class PhrasingIterable():
def __init__(self, my_phraser, texts):
self.my_phraser = my_phraser
self.texts = texts
def __iter__(self):
return iter(self.my_phraser[self.texts])
docs = DocumentIterator()
bigram_transformer = Phrases(docs, min_count=1, threshold=10)
bigram = Phraser(bigram_transformer)
corpus = PhrasingIterable(bigram, docs)
sentences = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]
model = Doc2Vec(window=5,
vector_size=300,
min_count=10,
workers = multiprocessing.cpu_count(),
epochs = 10,
compute_loss=True)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
However, in a second stage, the focus of analysis shifted from requests to individuals per week. To measure how individuals requests differ from week to week I extracted all words from requests in a given week t and compared with all words from requests in the previous week t-1 using d2v_model.wv.n_similarity. Since I need to replicate this in other areas, occurred to me that I was wasting to much memory and time training Doc2Vec models when I could use Word2Vec to get the same measure. Thus, I trained the following Word2Vec model:
docs = DocumentIterator()
bigram_transformer = gensim.models.Phrases(docs, min_count=1, threshold=10)
bigram = gensim.models.phrases.Phraser(bigram_transformer)
sentences = PhrasingIterable(bigram, docs)
model = Word2Vec(window=5,
size=300,
min_count=10,
workers = multiprocessing.cpu_count(),
iter = 10,
compute_loss=True)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
I used again the cosine similarity to compare the content from week to week w2v_model.wv.n_similarity. As a sanity check, I compared the similarities generated by Word2Vec and Doc2vec, the correlation coefficient among is around 0.70 and the scale differs a lot. My implied assumption is that comparing sets of extracted words using d2v_model.wv.n_similarity was taking advantage of the Word2Vec within the trained Doc2Vec.
MY QUESTION
Should cosine similarity measures between two sets of extracted words differ as we trade from Doc2Vec to Word2Vec? If so, why? I not, any suggestions on my code?

Related

LDA topic modeling changes result everytime I run the code [duplicate]

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.
Why does the same LDA parameters and corpus generate different topics everytime?
And how do i stabilize the topic generation?
I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code:
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math
stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]
def generateTopics(corpus, dictionary):
# Build LDA model using the above corpus
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]
# Group topics with similar words together.
tops = set(lda.show_topics(50))
top_clusters = []
for l in tops:
top = []
for t in l.split(" + "):
top.append((t.split("*")[0], t.split("*")[1]))
top_clusters.append(top)
# Generate word only topics
top_wordonly = []
for i in top_clusters:
top_wordonly.append(":".join([j[1] for j in i]))
return lda, corpus_lda, top_clusters, top_wordonly
#######################################################################
# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
lemma = line.split("\t")[3]
documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)
for i in topic_wordonly:
print i
Why does the same LDA parameters and corpus generate different topics everytime?
Because LDA uses randomness in both training and inference steps.
And how do i stabilize the topic generation?
By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed:
SOME_FIXED_SEED = 42
# before training/inference:
np.random.seed(SOME_FIXED_SEED)
(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)
Set the random_state parameter in the initialization of LdaModel() method.
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=1,
passes=num_passes,
alpha='auto')
I had the same problem, even with about 50,000 comments. But you can get much more consistent topics by increasing the number of iterations the LDA runs for. It is initially set to 50 and when I raise it to 300, it usually gives me the same results, probably because it is much closer to convergence.
Specifically, you just add the following option:
ldamodel.LdaModel(corpus, ..., iterations = <your desired iterations>):
This is due to the probabilistic nature of LDA as noted by others. However, I don't believe setting the random_seed argument to a fixed number is the proper solution.
Definitely try increasing the number of iterations first to make sure your algorithm is converging. Even then, each starting point may land you on a different local minimum. So you can run LDA multiple times without setting random_seed, and then comparing the results using the coherence score of each model. This helps you avoid the suboptimal local minima.
Gensim's CoherenceModel already has the most common coherence metrics implemented for you, such as c_v, u_mass, and c_npmi.
You might realize these will make the results more stable, but they won't actually guarantee the same results from run to run. However, it's better to get to the global optimum as much as possible instead of being stuck on the same local minimum because of a fixed random_seed IMO.

Doc2Vec find the similar sentence

I am trying find similar sentence using doc2vec. What I am not able to find is actual sentence that is matching from the trained sentences.
Below is the code from this article:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love machine learning. Its awesome.",
"I love coding in python",
"I love building chatbots",
"they chat amagingly well"]
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love building chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)
# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)
# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])
But the above code only gives me vectors or numbers. But how can I get the actual sentence matched from training data. For Eg - In this case I am expecting the result as "I love building chatbots".
The output of similar_doc is: [('2', 0.991769552230835), ('0', 0.989276111125946), ('3', 0.9854298830032349)]
This shows the similarity score of each document in the data with the requested document and it is sorted in descending order.
Based in this, '2' index in the data is the closest to the requested data i.e. test_data.
print(data[int(similar_doc[0][0])])
// prints: I love building chatbots
Note: this code is giving different results every time, maybe you need a better model or more training data.
Doc2Vec isn't going to give good results on toy-sized datasets, so you shouldn't expect anything meaningful until using much more data.
But also, a Doc2Vec model doesn't retain within itself the full texts you supply during training. It just remembers the learned vectors for each text's tag – which is usually a unique identifier. So when you get back results from most_similar(), you'll be getting back tag values, which you then need to look-up yourself, in your own code/data, to retrieve full documents.
Separately:
Calling train() multiple times in a loop like you're doing is a bad and error-prone idea, as is managing alpha/min_alpha explicitly. You should not follow any tutorial/guide which recommends that approach.
Don't change the defaults for the alpha parameters, and call train() once, with your desired epochs count – and it will do the right number of passes, and right learning-rate management.
To get the actual result you have to pass the text as a vector to most_simlar method to get the actual result. Hard coding the most_similar(1) will always give static results.
similar_doc = model.docvecs.most_similar([v1])
Modified version of your code
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love machine learning. Its awesome.",
"I love coding in python",
"I love building chatbots",
"they chat amagingly well"]
def output_sentences(most_similar):
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(most_similar)//2), ('LEAST', len(most_similar) - 1)]:
print(u'%s %s: %s\n' % (label, most_similar[index][1], data[int(most_similar[index][0])])))
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love building chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)
# to find most similar doc using tags
similar_doc = model.docvecs.most_similar([v1])
print(similar_doc)
# to print similar sentences
output_sentences(similar_doc)
# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])
Semantic “Similar Sentences” with your dataset-NLP
If you are looking for accurate prediction with your dataset and which is less, you can go for,
pip install similar-sentences

How to get the nearest documents for a word in gensim in python

I am using the doc2vec model as follows to construct my document vectors.
from gensim.models import doc2vec
from collections import namedtuple
dataset = json.load(open(input_file))
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for description in dataset:
tags = [description[0]]
words = description[1]
docs.append(analyzedDocument(words, tags))
model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)
I have seen that gensim doc2vec also includes word vectors. Suppose I have a word vector created for the word deep learning. My question is; is it possible to get the documents nearest to deep learning word vector in gensim in python?
I am happy to provide more details if needed.
Some Doc2Vec modes will co-train doc-vectors and word-vectors in the "same space". Then, if you have a word-vector for 'deep_learning', you can ask for documents near that vector, and the results may be useful for you. For example:
similar_docs = d2v_model.docvecs.most_similar(
positive=[d2v_model.wv['deep_learning']]
)
But:
that's only going to be as good as your model learned 'deep_learning' as a word to mean what you think of it as
a training set of known-good documents fitting the category 'deep_learning' (and other categories) could be better - whether you hand-curate those, or try to bootstrap from other sources (like say the Wikipedia category 'Deep Learning' or other curated/search-result sets that you trust).
reducing a category to a single summary point (one vector) may not be as good as having a variety of examples – many points - that all fit the category. (Relevant docs may not be a neat sphere around a summary point, but rather populate exotically-shaped regions of the doc-vector high-dimensional space.) If you have a lot of good examples of each category, you could train a classifier to then label, or rank-in-relation-to-trained-categories, any further uncategorized docs.

Python Gensim: how to calculate document similarity using the LDA model?

I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on.
After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a hint? Thanks!
Depends what similarity metric you want to use.
Cosine similarity is universally useful & built-in:
sim = gensim.matutils.cossim(vec_lda1, vec_lda2)
Hellinger distance is useful for similarity between probability distributions (such as LDA topics):
import numpy as np
dense1 = gensim.matutils.sparse2full(lda_vec1, lda.num_topics)
dense2 = gensim.matutils.sparse2full(lda_vec2, lda.num_topics)
sim = np.sqrt(0.5 * ((np.sqrt(dense1) - np.sqrt(dense2))**2).sum())
Don't know if this'll help but, I managed to attain successful results on document matching and similarities when using the actual document as a query.
dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("model.lda") #result from running online lda (training)
index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")
docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]
sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims
Your similarity score between all documents residing in the corpus and the document that was used as a query will be the second index of every sim for sims.
Provided answers are good, but they aren't very beginner-friendly. I want to start from training the LDA model and calculate cosine similarity.
Training model part:
docs = ["latent Dirichlet allocation (LDA) is a generative statistical model",
"each document is a mixture of a small number of topics",
"each document may be viewed as a mixture of various topics"]
# Convert document to tokens
docs = [doc.split() for doc in docs]
# A mapping from token to id in each document
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)
# Representing the corpus as a bag of words
corpus = [dictionary.doc2bow(doc) for doc in docs]
# Training the model
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)
For extracting the probability assigned to each topic for a document, there are generally two ways. I provide here the both:
# Some preprocessing for documents like the training the model
test_doc = ["LDA is an example of a topic model",
"topic modelling refers to the task of identifying topics"]
test_doc = [doc.split() for doc in test_doc]
test_corpus = [dictionary.doc2bow(doc) for doc in test_doc]
# Method 1
from gensim.matutils import cossim
doc1 = model.get_document_topics(test_corpus[0], minimum_probability=0)
doc2 = model.get_document_topics(test_corpus[1], minimum_probability=0)
print(cossim(doc1, doc2))
# Method 2
doc1 = model[test_corpus[0]]
doc2 = model[test_corpus[1]]
print(cossim(doc1, doc2))
output:
#Method 1
0.8279631530869963
#Method 2
0.828066885140262
As you can see both of the methods are generally the same, the difference is in the probabilities returned in the 2nd method sometimes doesn't add up to one as discussed here.
For large corpus, the possibility vector could be given by passing the whole corpus:
#Method 1
possibility_vector = model.get_document_topics(test_corpus, minimum_probability=0)
#Method 2
possiblity_vector = model[test_corpus]
NOTE: The sum of probability assigned to each topic in a document may become a bit higher than 1 or in some cases a bit less than 1. That is because of the floating-point arithmetic rounding errors.

LDA model generates different topics everytime i train on the same corpus

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.
Why does the same LDA parameters and corpus generate different topics everytime?
And how do i stabilize the topic generation?
I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code:
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math
stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]
def generateTopics(corpus, dictionary):
# Build LDA model using the above corpus
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]
# Group topics with similar words together.
tops = set(lda.show_topics(50))
top_clusters = []
for l in tops:
top = []
for t in l.split(" + "):
top.append((t.split("*")[0], t.split("*")[1]))
top_clusters.append(top)
# Generate word only topics
top_wordonly = []
for i in top_clusters:
top_wordonly.append(":".join([j[1] for j in i]))
return lda, corpus_lda, top_clusters, top_wordonly
#######################################################################
# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
lemma = line.split("\t")[3]
documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)
for i in topic_wordonly:
print i
Why does the same LDA parameters and corpus generate different topics everytime?
Because LDA uses randomness in both training and inference steps.
And how do i stabilize the topic generation?
By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed:
SOME_FIXED_SEED = 42
# before training/inference:
np.random.seed(SOME_FIXED_SEED)
(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)
Set the random_state parameter in the initialization of LdaModel() method.
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=1,
passes=num_passes,
alpha='auto')
I had the same problem, even with about 50,000 comments. But you can get much more consistent topics by increasing the number of iterations the LDA runs for. It is initially set to 50 and when I raise it to 300, it usually gives me the same results, probably because it is much closer to convergence.
Specifically, you just add the following option:
ldamodel.LdaModel(corpus, ..., iterations = <your desired iterations>):
This is due to the probabilistic nature of LDA as noted by others. However, I don't believe setting the random_seed argument to a fixed number is the proper solution.
Definitely try increasing the number of iterations first to make sure your algorithm is converging. Even then, each starting point may land you on a different local minimum. So you can run LDA multiple times without setting random_seed, and then comparing the results using the coherence score of each model. This helps you avoid the suboptimal local minima.
Gensim's CoherenceModel already has the most common coherence metrics implemented for you, such as c_v, u_mass, and c_npmi.
You might realize these will make the results more stable, but they won't actually guarantee the same results from run to run. However, it's better to get to the global optimum as much as possible instead of being stuck on the same local minimum because of a fixed random_seed IMO.

Categories

Resources