I am trying find similar sentence using doc2vec. What I am not able to find is actual sentence that is matching from the trained sentences.
Below is the code from this article:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love machine learning. Its awesome.",
"I love coding in python",
"I love building chatbots",
"they chat amagingly well"]
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love building chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)
# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)
# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])
But the above code only gives me vectors or numbers. But how can I get the actual sentence matched from training data. For Eg - In this case I am expecting the result as "I love building chatbots".
The output of similar_doc is: [('2', 0.991769552230835), ('0', 0.989276111125946), ('3', 0.9854298830032349)]
This shows the similarity score of each document in the data with the requested document and it is sorted in descending order.
Based in this, '2' index in the data is the closest to the requested data i.e. test_data.
print(data[int(similar_doc[0][0])])
// prints: I love building chatbots
Note: this code is giving different results every time, maybe you need a better model or more training data.
Doc2Vec isn't going to give good results on toy-sized datasets, so you shouldn't expect anything meaningful until using much more data.
But also, a Doc2Vec model doesn't retain within itself the full texts you supply during training. It just remembers the learned vectors for each text's tag – which is usually a unique identifier. So when you get back results from most_similar(), you'll be getting back tag values, which you then need to look-up yourself, in your own code/data, to retrieve full documents.
Separately:
Calling train() multiple times in a loop like you're doing is a bad and error-prone idea, as is managing alpha/min_alpha explicitly. You should not follow any tutorial/guide which recommends that approach.
Don't change the defaults for the alpha parameters, and call train() once, with your desired epochs count – and it will do the right number of passes, and right learning-rate management.
To get the actual result you have to pass the text as a vector to most_simlar method to get the actual result. Hard coding the most_similar(1) will always give static results.
similar_doc = model.docvecs.most_similar([v1])
Modified version of your code
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love machine learning. Its awesome.",
"I love coding in python",
"I love building chatbots",
"they chat amagingly well"]
def output_sentences(most_similar):
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(most_similar)//2), ('LEAST', len(most_similar) - 1)]:
print(u'%s %s: %s\n' % (label, most_similar[index][1], data[int(most_similar[index][0])])))
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love building chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)
# to find most similar doc using tags
similar_doc = model.docvecs.most_similar([v1])
print(similar_doc)
# to print similar sentences
output_sentences(similar_doc)
# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])
Semantic “Similar Sentences” with your dataset-NLP
If you are looking for accurate prediction with your dataset and which is less, you can go for,
pip install similar-sentences
Related
I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.
Why does the same LDA parameters and corpus generate different topics everytime?
And how do i stabilize the topic generation?
I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code:
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math
stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]
def generateTopics(corpus, dictionary):
# Build LDA model using the above corpus
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]
# Group topics with similar words together.
tops = set(lda.show_topics(50))
top_clusters = []
for l in tops:
top = []
for t in l.split(" + "):
top.append((t.split("*")[0], t.split("*")[1]))
top_clusters.append(top)
# Generate word only topics
top_wordonly = []
for i in top_clusters:
top_wordonly.append(":".join([j[1] for j in i]))
return lda, corpus_lda, top_clusters, top_wordonly
#######################################################################
# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
lemma = line.split("\t")[3]
documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)
for i in topic_wordonly:
print i
Why does the same LDA parameters and corpus generate different topics everytime?
Because LDA uses randomness in both training and inference steps.
And how do i stabilize the topic generation?
By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed:
SOME_FIXED_SEED = 42
# before training/inference:
np.random.seed(SOME_FIXED_SEED)
(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)
Set the random_state parameter in the initialization of LdaModel() method.
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=1,
passes=num_passes,
alpha='auto')
I had the same problem, even with about 50,000 comments. But you can get much more consistent topics by increasing the number of iterations the LDA runs for. It is initially set to 50 and when I raise it to 300, it usually gives me the same results, probably because it is much closer to convergence.
Specifically, you just add the following option:
ldamodel.LdaModel(corpus, ..., iterations = <your desired iterations>):
This is due to the probabilistic nature of LDA as noted by others. However, I don't believe setting the random_seed argument to a fixed number is the proper solution.
Definitely try increasing the number of iterations first to make sure your algorithm is converging. Even then, each starting point may land you on a different local minimum. So you can run LDA multiple times without setting random_seed, and then comparing the results using the coherence score of each model. This helps you avoid the suboptimal local minima.
Gensim's CoherenceModel already has the most common coherence metrics implemented for you, such as c_v, u_mass, and c_npmi.
You might realize these will make the results more stable, but they won't actually guarantee the same results from run to run. However, it's better to get to the global optimum as much as possible instead of being stuck on the same local minimum because of a fixed random_seed IMO.
The program should be returning the second text in the list for most similar, as it is same word to word. But its not the case here.
import gensim
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
data = ["I love machine learning. Its awesome.",
"I love coding in python",
"I love building chatbots",
"they chat amagingly well"]
tagged_data=[TaggedDocument(word_tokenize(_d.lower()),tags=[str(i)]) for i,_d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
negative=0,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
#print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
loaded_model=Doc2Vec.load("d2v.model")
test_data=["I love coding in python".lower()]
v1=loaded_model.infer_vector(test_data)
similar_doc=loaded_model.docvecs.most_similar([v1])
print similar_doc
Output:
[('0', 0.17585766315460205), ('2', 0.055697083473205566), ('3', -0.02361609786748886), ('1', -0.2507985532283783)]
Its showing the first text in the list as most similar instead of the second text. Can you please help with this ?
First, you won't get good results from Doc2Vec-style models with toy-sized datasets. Just four documents, and a vocabulary of about 20 unique words, can't create a meaningfully-contrasting "dense embedding" vector model full of 20-dimensional vectors.
Second, if you set negative=0 in your model initialization, you're disabling the default model-training-correction mode (negative=5) – and you're not enabling the non-default, less-recommended alternative (hs=1). No training at all will be occurring. There may also be an error shown in the code output – but also, if you're running with at least INFO-level logging, you might notice other issues in the output.
Third, infer_vector() requires a list-of-word-tokens as its argument. You're providing a plain string. That will look like a list of one-character words to the code, so it's like you're asking it to infer on the 23-word sentence:
['i', ' ', 'l', 'o', 'v', 'e', ' ', 'c', ...]
The argument to infer_vector() should be tokenized exactly the same as the training texts were tokenized. (If you used word_tokenize() during training, use it during inference, too.)
infer_vector() will also use a number of repeated inference-passes over the text that's equal to the 'epochs' value inside the Doc2Vec model, unless you specify another value. Since you didn't specify an epochs, the model will still have its default value (inherited from Word2Vec) of epochs=5. Most Doc2Vec work uses 10-20 epochs during training, and using at least as many during inference seems a good practice.
But also:
Don't try to call train() more than once in a loop, or manage alpha in your own code, unless you are an expert.
Whatever online example suggested a code block like your...
for epoch in range(max_epochs):
#print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
...is a bad example. It's sending the effective alpha rate down-and-up incorrectly, it's very fragile if you ever want to change the number of epochs, it actually winds up running 500 epochs (100 * model.iter), it's far more code than is necessary.
Instead, don't change default alpha options, and specify your desired number of epochs when the model is created. So, the model will have a meaningful epochs value cached to be used by a later infer_vector().
Then, only call train() once. It will handle all epochs & alpha-management correctly. For example:
model = Doc2Vec(size=vec_size,
min_count=1, # not good idea w/ real corpuses but OK
dm=1, # not necessary to specify since it's the default but OK
epochs=max_epochs)
model.build_vocab(tagged_data)
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.epochs)
I am using the doc2vec model as follows to construct my document vectors.
from gensim.models import doc2vec
from collections import namedtuple
dataset = json.load(open(input_file))
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for description in dataset:
tags = [description[0]]
words = description[1]
docs.append(analyzedDocument(words, tags))
model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)
I have seen that gensim doc2vec also includes word vectors. Suppose I have a word vector created for the word deep learning. My question is; is it possible to get the documents nearest to deep learning word vector in gensim in python?
I am happy to provide more details if needed.
Some Doc2Vec modes will co-train doc-vectors and word-vectors in the "same space". Then, if you have a word-vector for 'deep_learning', you can ask for documents near that vector, and the results may be useful for you. For example:
similar_docs = d2v_model.docvecs.most_similar(
positive=[d2v_model.wv['deep_learning']]
)
But:
that's only going to be as good as your model learned 'deep_learning' as a word to mean what you think of it as
a training set of known-good documents fitting the category 'deep_learning' (and other categories) could be better - whether you hand-curate those, or try to bootstrap from other sources (like say the Wikipedia category 'Deep Learning' or other curated/search-result sets that you trust).
reducing a category to a single summary point (one vector) may not be as good as having a variety of examples – many points - that all fit the category. (Relevant docs may not be a neat sphere around a summary point, but rather populate exotically-shaped regions of the doc-vector high-dimensional space.) If you have a lot of good examples of each category, you could train a classifier to then label, or rank-in-relation-to-trained-categories, any further uncategorized docs.
I roughly followed this tutorial:
https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/
A notable difference is that I use 2 LSTM layers with dropout. My data set is different (music data-set in abc notation). I do get some songs generated, but after a certain number of steps (may range from 30 steps to a couple hundred) in the generation process, the LSTM keeps generating the exact same sequence over and over again. For example, it once got stuck with generating URLs for songs:
F: http://www.youtube.com/watch?v=JPtqU6pipQI
and so on ...
It also once got stuck with generating the same two songs (the two songs are a sequence of about 300 characters). In the beginning it generated 3-4 good pieces but afterwards, it kept regenerating the two songs almost indefinitely.
I am wondering, does anyone have some insight into what could be happening ?
I want to clarify that any sequence generated whether repeating or non-repeating seems to be new (model is not memorising). The validation loss and training loss decrease as expected.
Andrej Karpathy is able to generate a document of thousands of characters and I couldn't find this pattern of getting stuck indefinitely.
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Instead of taking the argmax on the prediction output, try introducing some randomness with something like this:
np.argmax(prediction_output)[0])
to
np.random.choice(len(prediction_output), p=prediction_output)
I've been struggling on this repeating sequences issue for a while until I discovered this Colab notebook where I figured out why their model was able to generate some really good samples: https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/shakespeare_with_tpu_and_keras.ipynb#scrollTo=tU7M-EGGxR3E
After I changed this single line, my model went from generating a few words over and over to something actually interesting!
To use and train a text generation model follow these steps:
Drawing from the model a probability distribution over the next character given the text available so far ( This would be our predictions scores )
Reweighting the distribution to a certain "temperature" (See the code below)
Sampling the next character at random according to the reweighted distribution (See the code below)
Adding the new character at the end of the available text
See the sample function:
def sample(preds, temperature=1.0):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
You should use the sample function during training as follows:
for epoch in range(1, 60):
print('epoch', epoch)
# Fit the model for 1 epoch on the available training data
model.fit(x, y,
batch_size=128,
epochs=1)
# Select a text seed at random
start_index = random.randint(0, len(text) - maxlen - 1)
generated_text = text[start_index: start_index + maxlen]
print('--- Generating with seed: "' + generated_text + '"')
for temperature in [0.2, 0.5, 1.0, 1.2]:
print('------ temperature:', temperature)
sys.stdout.write(generated_text)
# We generate 400 characters
for i in range(400):
sampled = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(generated_text):
sampled[0, t, char_indices[char]] = 1.
preds = model.predict(sampled, verbose=0)[0]
next_index = sample(preds, temperature)
next_char = chars[next_index]
generated_text += next_char
generated_text = generated_text[1:]
sys.stdout.write(next_char)
sys.stdout.flush()
print()
A low temperature results in extremely repetitive and predictable text, but where local structure is highly realistic: in particular, all words (a word being a local pattern of characters) are real English words. With higher temperatures, the generated text becomes more interesting, surprising, even creative.
See this notebook
I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.
Why does the same LDA parameters and corpus generate different topics everytime?
And how do i stabilize the topic generation?
I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code:
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math
stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]
def generateTopics(corpus, dictionary):
# Build LDA model using the above corpus
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]
# Group topics with similar words together.
tops = set(lda.show_topics(50))
top_clusters = []
for l in tops:
top = []
for t in l.split(" + "):
top.append((t.split("*")[0], t.split("*")[1]))
top_clusters.append(top)
# Generate word only topics
top_wordonly = []
for i in top_clusters:
top_wordonly.append(":".join([j[1] for j in i]))
return lda, corpus_lda, top_clusters, top_wordonly
#######################################################################
# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
lemma = line.split("\t")[3]
documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)
for i in topic_wordonly:
print i
Why does the same LDA parameters and corpus generate different topics everytime?
Because LDA uses randomness in both training and inference steps.
And how do i stabilize the topic generation?
By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed:
SOME_FIXED_SEED = 42
# before training/inference:
np.random.seed(SOME_FIXED_SEED)
(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)
Set the random_state parameter in the initialization of LdaModel() method.
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=1,
passes=num_passes,
alpha='auto')
I had the same problem, even with about 50,000 comments. But you can get much more consistent topics by increasing the number of iterations the LDA runs for. It is initially set to 50 and when I raise it to 300, it usually gives me the same results, probably because it is much closer to convergence.
Specifically, you just add the following option:
ldamodel.LdaModel(corpus, ..., iterations = <your desired iterations>):
This is due to the probabilistic nature of LDA as noted by others. However, I don't believe setting the random_seed argument to a fixed number is the proper solution.
Definitely try increasing the number of iterations first to make sure your algorithm is converging. Even then, each starting point may land you on a different local minimum. So you can run LDA multiple times without setting random_seed, and then comparing the results using the coherence score of each model. This helps you avoid the suboptimal local minima.
Gensim's CoherenceModel already has the most common coherence metrics implemented for you, such as c_v, u_mass, and c_npmi.
You might realize these will make the results more stable, but they won't actually guarantee the same results from run to run. However, it's better to get to the global optimum as much as possible instead of being stuck on the same local minimum because of a fixed random_seed IMO.