How to identify doc2vec instances seperately in gensim in python

How to identify doc2vec instances seperately in gensim in python - python

I have a list of 1000 documents, where the first 500 belongs to documents in movies (i.e. list index from 0 to 499) and the remaining 500 belings to documents in tv series (i.e. list index from 500 to 999).
For movies the document tag starts with movie_ (e.g., movie_fast_and_furious) and for tv series the document tag starts with tv_series_ (e.g., tv_series_the_office)
I use these movies and tv series dataset to build a doc2vec model as follows.
from gensim.models import doc2vec
from collections import namedtuple
dataset = json.load(open(input_file))
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for description in dataset:
tags = [description[0]]
words = description[1]
docs.append(analyzedDocument(words, tags))
model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)
Now for each movie, I want to get its nearest 5 tv series along with their cosine similarity.
I know, the function gensim provides model.docvecs.most_similar. However, the results of this include movies as well (which is not my intension). Is it possible to do this in gensim (I assume that the document vectors are creating in the order of the documents list that we provide).
I am happy to provide more details if needed.

All the tags are opaque identifiers to Doc2Vec. So, if there's internal distinctions to your data, you'll need to model and filter on that yourself.
So, my main recommendation would be to ask for a much larger topn than you need, then discard those results of the type you don't want, or in excess of the number you actually need.
(Note that every calculation of most_similar() requires a comparison against the whole known set of doc-tags, and using a smaller topn only saves some computation in sorting of those full results. So using a larger topn, even up to the full size of the known doc-tags, isn't as costly as you might fear.)
With just two categories, to get the 10 tv-shows closest to a query movie, you could make topn equal to the count of movies, minus 1 (the query), plus 10 – then in the absolute worst-case, where all movies are closer than than the 1st tv-show, you'd still get 10 valid tv-show results.
(The most_similar() function also includes a restrict_vocab parameter. It takes an int count, and limits results to only the 1st that-many items, in the internal storage order. So if in fact the 1st 500 documents were all tv-shows, restrict_vocab=500 would give only results from that subset. However, I wouldn't recommend relying on this, as (a) it'd only work for one category that was front-loaded, not any others; (b) ideally for training, you wouldn't clump all similar documents together, but shuffle them to be interspersed with contrasting documents. Generally Word2Vec vector-sets are sorted to put the highest-frequency words 1st – no matter the order-of-appearance in the original data. That makes restrict_vocab more useful there, as often only results for the most-common words with the strongest vectors are most interesting.)

Related

Reduce Dimension of word-vectors from TFIDFVectorizer / CountVectorizer

I want to use the TFIDFVectorizer (or CountVectorizer followed by TFIDFTransformer) to get a vector representation of my terms. That means, I want a vector for a term where the documents are the features. That's simply the transpose of a TF-IDF matrix created by the TFIDFVectorizer.
>>> vectorizer = TfidfVectorizer()
>>> model = vectorizer.fit_transform(corpus)
>>> model.transpose()
However, I have 800k documents which mean my term vectors are very sparse and very large (800k dimensions). The flag max_features in the CountVectorizer would do exactly what I'm looking for. I can specify a dimension and the CountVectorizer tries to fit all information into this dimension. Unfortunately, this option is for the document vectors rather than the terms in the vocabulary. Hence, it reduces the size of my vocabulary because the terms are the features.
Is there any way to do the opposite? Like, perform a transpose on the TFIDFVectorizer object before it starts cutting and normalizing everything? And if such an approach exists, how can I do that? Something like this:
>>> countVectorizer = CountVectorizer(input='filename', max_features=300, transpose=True)
I was looking for such an approach for a while now but every guide, code example, whatever is talking about the document TF-IDF vectors rather than the term vectors.
Thank you so much in advance!

I am not aware of any straight forward way to do this but let me propose a way how this could be achieved.
You are trying to represent each term in your corpus as a vector that uses the documents in your corpus as its component features. Because the number of documents (which are the features in your case) is very large, you would like to limit them in a way similar to what max_features does.
According to CountVectorizer user guide (same for the TfidfVectorizer):
max_features int, default=None
If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.
In a similar way, you want to keep the top documents ordered by their "frequency across the terms", as confusing as this may sound. This could be rephrased simplistically as "keep those documents that contain the most unique terms".
One way I can think of doing that is by using the inverse_transform performing the following steps:
vectorizer = TfidfVectorizer()
model = vectorizer.fit_transform(corpus)
# We use the inverse_transform which returns the
# terms per document with nonzero entries
inverse_model = vectorizer.inverse_transform(model)
# Each line in the inverse model corresponds to a document
# and contains a list of feature names (the terms).
# As we want to rank the documents we tranform the list
# of feature names to a number of features
# that each document is represented by.
inverse_model_count = list(map(lambda doc_vec: len(doc_vec), inverse_model))
# As we are going to sort the list, we need to keep track of the
# document id (its index in the corpus), so we create tuples with
# the list index of each item before we sort the list.
inverse_model_count_tuples = list(zip(range(len(inverse_model_count)),
inverse_model_count))
# Then we sort the list by the count of terms
# in each document (the second component)
max_features = 100
top_documents_tuples = sorted(inverse_model_count_tuples,
key=lambda item: item[1],
reverse=True)[:max_features]
# We are interested only in the document ids (the first tuple component)
top_documents, _ = zip(*top_documents_tuples)
# Having the top_documents ids we can slice the initial model
# to keep only the documents indicated by the top_documents list
reduced_model = model[top_documents]
Please note that this approach only takes into account the number of terms per document, no matter what is their count (CountVectorizer) or weight (TfidfVectorizer).
If the direction of this approach is acceptable for you then with some more code it could be possible to also take into account the count or weight of the terms.
I hope this helps!

Filtering tokens by frequency using filter_extremes in Gensim

I am trying to filter out tokens by their frequency using the filter_extremes function in Gensim (https://radimrehurek.com/gensim/corpora/dictionary.html). Specifically, I am interested in filtering out words that occur in "Less frequent than no_below documents" and "More frequent than no_above documents".
id2word_ = corpora.Dictionary(texts)
print(len(id2word_))
id2word_.filter_extremes(no_above = 0.600)
print(len(id2word_))
The first print statement gives 11918 and the second print statement gives 3567. However, if I do the following:
id2word_ = corpora.Dictionary(texts)
print(len(id2word_))
id2word_.filter_extremes(no_below = 0.599)
print(len(id2word_))
The first print statement gives 11918 (as expected) and the second gives 11406. Shouldn't id2word_.filter_extremes(no_below = 0.599) and id2word_.filter_extremes(no_above = 0.600) add up to the number of total words? However, 11406 + 3567 > 11918, so how come this sum exceeds the number of words in the corpus? That does not make sense since the filters should cover non-overlapping words, based off the explanation in the documentation.
If you have any ideas, I would really appreciate your input! Thanks!

According to the definition:
no_below (int, optional) – Keep tokens which are contained in at least no_below
documents.
no_above (float, optional) – Keep tokens which are contained in no more than
no_above documents (fraction of total corpus size, not an absolute number).
no_below is an int that represents a threshold filtering out number of occurrence of the tokens among documents above certain number. e.g. use no_below to filter out words appearing less than 10 times.
On the contrary, no_above is not a int but a float that represents faction of total corpus size. e.g. use no_above to filter out words appearing in more than 10% of all documents.
It's a little odd that no_below and no_above do not represent the same unit and therefore the confusion.
Hope this answers your question.

Regarding the filter_extremes in Gensim, the units for "no_above" and "no_below" parameters are actually DIFFERENT. This is a bit odd, to be honest.
For "no_above", you want to put a number between 0 and 1 there (float). It should be a percentage that represents the portion of a word in total corpus size.
For "no_below", you want to have a integer. It should be the number of times when a word shows in corpus. It is a threshold.
Hope it clarifies your question.

I write this as an extension to other users' answers. Yes, the two parameters are different and control different kinds of token frequencies. In addition notice the following error: filter_extremes assigns default values to no_above and no_below, so if you write:
id2word_.filter_extremes(no_below = 0.599)
it is effectively
id2word_.filter_extremes(no_below = 0.599, no_above=0.5)

hope you had found the answer to your reply. I have been dabbling with the gensim library and found out that these two parameters 'no_below' and 'no_above' are probably best used together. The no_below as TS said is straightforward and returns token if frequency is not less than the parameter.
Some code samples, I am not sure if there are best practices but I usually tinker with the no_above starting from 1 (or 100% of the corpus) which makes the filtering exclusive of the whole corpus.
# filter if frequency not less than 2
# and tokens contained not more than 90% of corpus
dictionary.filter_extremes(no_below=2, no_above=0.9)
print(len(dictionary))
# filter if frequency not less than 3
# and tokens contained not more than 100% of corpus
dictionary.filter_extremes(no_below=3, no_above=1)
print(len(dictionary))
In general, start with no_above corpus 100%, from there make adjustments.

Gensim docvecs.most_similar returns Id's that dont exist

I'm trying create an algorithm that's capable of show the top n documents similar to a specific document.
For that i used the gensim doc2vec. The code is bellow:
model = gensim.models.doc2vec.Doc2Vec(size=400, window=8, min_count=5, workers = 11,
dm=0,alpha = 0.025, min_alpha = 0.025, dbow_words = 1)
model.build_vocab(train_corpus)
for x in xrange(10):
model.train(train_corpus)
model.alpha -= 0.002
model.min_alpha = model.alpha
model.train(train_corpus)
model.save('model_EN_BigTrain')
sims = model.docvecs.most_similar([408], topn=10)
The sims var should give me 10 tuples, being the first element the id of the doc and the second the score.
The problem is that some id's do not correspond to any document in my training data.
I've been trying for some time now to make sense out of the ids that aren't in my training data but i don't see any logic.
Ps: This is the code that i used to create my train_corpus
def readData(train_corpus, jData):
print("The response contains {0} properties".format(len(jData)))
print("\n")
for i in xrange(len(jData)):
print "> Reading offers from Aux array"
if i % 10 == 0:
print ">>", i, "offers processed..."
train_corpus.append(gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(jData[i][1]), tags=[jData[i][0]]))
print "> Finished processing offers"
Being each position of the aux array one array in witch the position 0 is an int (that i want to be the id) and the position 1 a description
Thanks in advance.

Are you using plain integer IDs as your tags, but not using exactly all of the integers from 0 to whatever your MAX_DOC_ID is?
If so, that could explain the appearance of tags within that range. When you use plain ints, gensim Doc2Vec avoids creating a dict mapping provided tags to index-positions in its internal vector-array – and just uses the ints themselves.
Thus that internal vector-array must be allocated to include MAX_DOC_ID + 1 rows. Any rows corresponding to unused IDs are still initialized as random vectors, like all the positions, but won't receive any of the training from actual text examples to push them into meaningful relative positions. It's thus possible these random-initialized-but-untrained vectors could appear in later most_similar() results.
To avoid that, either use only contiguous ints from 0 to the last ID you need. Or, if you can afford the memory cost of the string-to-index mapping, use string tags instead of plain ints. Or, keep an extra record of the valid IDs and manually filter the unwanted IDs from results.
Separately: by not specifying iter=1 in your Doc2Vec model initialization, the default of iter=5 will be in effect, meaning each call to train() does 5 iterations over your data. Oddly, also, your xrange(10) for-loop includes two separate calls to train() each iteration (and the 1st is just using whatever alpha/min_alpha was already in place). So you're actually doing 10 * 2 * 5 = 100 passes over the data, with an odd learning-rate schedule.
I suggest instead if you want 10 passes to just set iter=10, leave default alpha/min_alpha untouched, and then call train() only once. The model will do 10 passes, smoothly managing alpha from its starting to ending values.

I was having this problem as well, I was initializing my doc2vec with the following:
for idx,doc in data.iterrows():
alldocs.append(TruthDocument(doc['clean_text'], [idx], doc['label']))
I was passing it a dataframe that had some wonk indexes. All I had to do was.
df.reset_index(inplace=True)

Use Gensim for scoring features in each document. Also a Python memory issue

I am using GENSIM on a corpus of 50000 documents along with a dictionary of around 4000 features. I also have a LSI model already prepared for the same.
Now I want to find the highest matching features for each of the added documents. To find the best features in a particular document, I am running gensim's similarity module for each of the features on all the documents. This gives us a score for each of the feature that we want to use later on. But as you can imagine, this is a costly process as we have to iterate over 50000 indices and run 4000 iterations of similarity on each.
I need a better way of doing this as I run out of 8 GB memory on my system at around 1000 iterations. There's actually no reason for the memory to keep rising as I am only reallocating it during the iterations. Surprisingly the memory starts rising only after around 200 iterations.
Why the memory issue? How can it be solved?
Is there a better way of finding the highest scored features in a particular document (not topics)?
Here's a snippet of the code that runs out of memory:
dictionary = corpora.Dictionary.load('features-dict.dict')
corpus = corpora.MmCorpus('corpus.mm')
lsi = models.LsiModel.load('model.lsi')
corpus_lsi = lsi[corpus]
index = similarities.MatrixSimilarity(list(corpus_lsi))
newDict = dict()
for feature in dictionary.token2id.keys():
vec_bow = dictionary.doc2bow([feature])
vec_lsi = lsi[vec_bow]
sims = index[vec_lsi]
li = sorted(enumerate(sims * 100), key=lambda item: -item[1])
for data in li:
dict[data[0]] = (feature,data[1]) # Store feature and score for each document
# Do something with the dict created above
EDIT:
The memory issue was resolved using a memory profiler. There was something else in that loop that caused it to rise drastically.
Let me explain the purpose in detail. Imagine we are dealing with various recipes (each recipe is document) and each item in our dictionary is an ingredient. Find six such recipes below.
corpus = [[Olive Oil, Tomato, Brocolli, Oregano], [Garlic, Olive Oil, Bread, Cheese, Oregano], [Avocado, Beans, Cheese, Lime], [Jalepeneo, Lime, Tomato, Tortilla, Sour Cream], [Chili Sauce, Vinegar, Mushrooms, Rice], [Soy Sauce, Noodles, Brocolli, Ginger, Vinegar]]
There are thousands of such recipes. What I am trying to achieve is to assign a weight between 0 and 100 to each of the ingredient (where higher weighted ingredient is the most important or most unique). What would be the best way to achieve this.

Let's break this down:
unless I misunderstood your purpose, you can simple use the left singular vectors from lsi.projection.u to get your weights:
# create #features x #corpus 2D matrix of weights
doc_feature_matrix = numpy.dot(lsi.projection.u, index.index.T)
Rows of this matrix should be the "documents weights" you're looking for, one row for one feature.
the call to list() in your list(lsi[corpus]) makes your code very inefficient. It basically serializes the entire doc-topic matrix into RAM. Drop the list() and use the streamed version directly, it's much more memory-efficient: index = MatrixSimilarity(lsi[corpus], num_features=lsi.num_topics).
LSI usually works better over regularized input. Consider transforming the plain bag-of-words vectors (=integers) via e.g. TF-IDF or log entropy transformation before passing it to LSI.

tag generation from a text content

I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools.
Additionally, I will be grateful if you point any Python based solution / library for this.
Thanks

One way to do this would be to extract words that occur more frequently in a document than you would expect them to by chance. For example, say in a larger collection of documents the term 'Markov' is almost never seen. However, in a particular document from the same collection Markov shows up very frequently. This would suggest that Markov might be a good keyword or tag to associate with the document.
To identify keywords like this, you could use the point-wise mutual information of the keyword and the document. This is given by PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]. This will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection.
To identify the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.
If you want to extract multiword tags, see the StackOverflow question How to extract common / significant phrases from a series of text entries.
Borrowing from my answer to that question, the NLTK collocations how-to covers how to do
extract interesting multiword expressions using n-gram PMI in a about 7 lines of code, e.g.:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words('english-web.txt'))
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)

First, the key python library for computational linguistics is NLTK ("Natural Language Toolkit"). This is a stable, mature library created and maintained by professional computational linguists. It also has an extensive collection of tutorials, FAQs, etc. I recommend it highly.
Below is a simple template, in python code, for the problem raised in your Question; although it's a template it runs--supply any text as a string (as i've done) and it will return a list of word frequencies as well as a ranked list of those words in order of 'importance' (or suitability as keywords) according to a very simple heuristic.
Keywords for a given document are (obviously) chosen from among important words in a document--ie, those words that are likely to distinguish it from another document. If you had no a priori knowledge of the text's subject matter, a common technique is to infer the importance or weight of a given word/term from its frequency, or importance = 1/frequency.
text = """ The intensity of the feeling makes up for the disproportion of the objects. Things are equal to the imagination, which have the power of affecting the mind with an equal degree of terror, admiration, delight, or love. When Lear calls upon the heavens to avenge his cause, "for they are old like him," there is nothing extravagant or impious in this sublime identification of his age with theirs; for there is no other image which could do justice to the agonising sense of his wrongs and his despair! """
BAD_CHARS = ".!?,\'\""
# transform text into a list words--removing punctuation and filtering small words
words = [ word.strip(BAD_CHARS) for word in text.strip().split() if len(word) > 4 ]
word_freq = {}
# generate a 'word histogram' for the text--ie, a list of the frequencies of each word
for word in words :
word_freq[word] = word_freq.get(word, 0) + 1
# sort the word list by frequency
# (just a DSU sort, there's a python built-in for this, but i can't remember it)
tx = [ (v, k) for (k, v) in word_freq.items()]
tx.sort(reverse=True)
word_freq_sorted = [ (k, v) for (v, k) in tx ]
# eg, what are the most common words in that text?
print(word_freq_sorted)
# returns: [('which', 4), ('other', 4), ('like', 4), ('what', 3), ('upon', 3)]
# obviously using a text larger than 50 or so words will give you more meaningful results
term_importance = lambda word : 1.0/word_freq[word]
# select document keywords from the words at/near the top of this list:
map(term_importance, word_freq.keys())

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation tries to represent each document in a training corpus as mixture of topics, which in turn are distributions mapping words to probabilities.
I had used it once to dissect a corpus of product reviews into the latent ideas that were being spoken about across all the documents such as 'customer service', 'product usability', etc.. The basic model does not advocate a way to convert the topic models into a single word describing what a topic is about.. but people have come up with all kinds of heuristics to do that once their model is trained.
I recommend you try playing with http://mallet.cs.umass.edu/ and seeing if this model fits your needs..
LDA is a completely unsupervised algorithm meaning it doesn't require you to hand annotate anything which is great, but on the flip side, might not deliver you the topics you were expecting it to give.

A very simple solution to the problem would be:
count the occurences of each word in the text
consider the most frequent terms as the key phrases
have a black-list of 'stop words' to remove common words like the, and, it, is etc
I'm sure there are cleverer, stats based solutions though.
If you need a solution to use in a larger project rather than for interests sake, Yahoo BOSS has a key term extraction method.

Latent Dirichlet allocation or Hierarchical Dirichlet Process can be used to generate tags for individual texts within a greater corpus (body of texts) by extracting the most important words from the derived topics.
A basic example would be if we were to run LDA over a corpus and define it to have two topics, and that we find further that a text in the corpus is 70% one topic, and 30% another. The top 70% of the words that define the first topic and 30% that define the second (without duplication) could then be considered as tags for the given text. This method provides strong results where tags generally represent the broader themes of the given texts.
With a general reference for preprocessing needed for these codes being found here, we can find tags through the following process using gensim.
A heuristic way of deriving the optimal number of topics for LDA is found in this answer. Although HDP does not require the number of topics as an input, the standard in such cases is still to use LDA with a derived topic number, as HDP can be problematic. Assume here that the corpus is found to have 10 topics, and we want 5 tags per text:
from gensim.models import LdaModel, HdpModel
from gensim import corpora
num_topics = 10
num_tags = 5
Assume further that we have a variable corpus, which is a preprocessed list of lists, with the subslist entries being word tokens. Initialize a Dirichlet dictionary and create a bag of words where texts are converted to their indexes for their component tokens (words):
dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
Create an LDA or HDP model:
dirichlet_model = LdaModel(corpus=bow_corpus,
id2word=dirichlet_dict,
num_topics=num_topics,
update_every=1,
chunksize=len(bow_corpus),
passes=20,
alpha='auto')
# dirichlet_model = HdpModel(corpus=bow_corpus,
# id2word=dirichlet_dict,
# chunksize=len(bow_corpus))
The following code produces ordered lists for the most important words per topic (note that here is where num_tags defines the desired tags per text):
shown_topics = dirichlet_model.show_topics(num_topics=num_topics,
num_words=num_tags,
formatted=False)
model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
Then find the coherence of the topics across the texts:
topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0
topics_per_text = [text for text in topic_corpus]
From here we have the percentage that each text coheres to a given topic, and the words associated with each topic, so we can combine them for tags with the following:
corpus_tags = []
for i in range(len(bow_corpus)):
# The complexity here is to make sure that it works with HDP
significant_topics = list(set([t[0] for t in topics_per_text[i]]))
topic_indexes_by_coherence = [tup[0] for tup in sorted(enumerate(topics_per_text[i]), key=lambda x:x[1])]
significant_topics_by_coherence = [significant_topics[i] for i in topic_indexes_by_coherence]
ordered_topics = [model_topics[i] for i in significant_topics_by_coherence][:num_topics] # subset for HDP
ordered_topic_coherences = [topics_per_text[i] for i in topic_indexes_by_coherence][:num_topics] # subset for HDP
text_tags = []
for i in range(num_topics):
# Find the number of indexes to select, which can later be extended if the word has already been selected
selection_indexes = list(range(int(round(num_tags * ordered_topic_coherences[i]))))
if selection_indexes == [] and len(text_tags) < num_tags:
# Fix potential rounding error by giving this topic one selection
selection_indexes = [0]
for s_i in selection_indexes:
# ignore_words is a list of words should not be included
if ordered_topics[i][s_i] not in text_tags and ordered_topics[i][s_i] not in ignore_words:
text_tags.append(ordered_topics[i][s_i])
else:
selection_indexes.append(selection_indexes[-1] + 1)
# Fix for if too many were selected
text_tags = text_tags[:num_tags]
corpus_tags.append(text_tags)
corpus_tags will be a list of tags for each text based on how coherent the text is to the derived topics.
See this answer for a similar version of this that generates tags for a whole text corpus.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.