Updating dictionary and TFIDF for new documents using Gensim - python

My use case is that I have a corpus of documents, and when new docs come in I update Dictionary and vectorize. The result should be a sparse matrix of TFIDF vectors, which I'm using corpus2csc for.
I think I have a solution, but I have seen other answers that suggest my solution is impossible, and I've seen some unexpected behavior. I'm seeking a gut-check on the approach, with some specific questions below.
Overall approach:
Use Dictionary.doc2bow with allow_update=True
Construct TFIDF model using that updated Dictionary.
Questions and example code below.
from gensim import corpora
from gensim import models
from gensim import matutils
docs = [['definition', 'addition', 'term', 'defined'],
['term', 'sweet', 'generally', 'subject'],
['gene', 'extent', 'provided', 'gene'],
['additional', 'cost', 'gene', 'adequacy'],
['initial', 'condition', 'sweet', 'effectiveness']]
gensim_dict = corpora.Dictionary([], prune_at=None) # start with empty dict because I don't know what the corpus looks like at this point
BoW_corpus0 = [gensim_dict.doc2bow(d, allow_update=True) for d in docs] # Add to the dictionary by setting allow_update=True
# documentation says: corpus (iterable of iterable of (int, number)) – Input corpus in BoW format
matutils.corpus2csc(corpus=BoW_corpus0) # Expected. A 16x5 matrix with 19 stored elements
matutils.corpus2csc(corpus=BoW_corpus0, num_terms=len(gensim_dict.keys()), num_docs=len(BoW_corpus0), num_nnz=sum([len(doc) for doc in BoW_corpus0])) # Also expected for the 'efficient' path
model = models.TfidfModel(BoW_corpus0)
vecs = model[BoW_corpus0]
matutils.corpus2csc(corpus=vecs) # Unexpected - asks for a BoW format, but this is TransformedCorpus. But ignoring that, it's expected -- 16x5 with 19 stored elements.
matutils.corpus2csc(corpus=vecs, num_terms=len(gensim_dict.keys()), num_docs=vecs.obj.num_docs, num_nnz=vecs.obj.num_nnz) # Expected
corpus2csc documentation asks for a BoW format, not a TransformedCorpus. Does it accept a TransformedCorpus just by chance?
Now I add new docs:
new_docs = [['provided', 'communication', 'provided', 'writing']]
BoW_corpus1 = [gensim_dict.doc2bow(d, allow_update=True) for d in new_docs]
len(gensim_dict.cfs) # 18
gensim_dict.num_nnz # 22
len(BoW_corpus1) # 1
matutils.corpus2csc(corpus=BoW_corpus1) # Expected. An 18x1 matrix, with 3 stored elements
matutils.corpus2csc(corpus=BoW_corpus1, num_terms=len(gensim_dict.keys()), num_docs=len(BoW_corpus1), num_nnz=sum([len(doc) for doc in BoW_corpus1])) # Expected. An 18x1 matrix, with 3 stored elements
# Let's use the original model. Ideally we would update the model with the new document but I'm not sure of the best way to do that.
new_vecs = model[BoW_corpus1]
matutils.corpus2csc(corpus=new_vecs) # Unexpected. Using the original model, a 10x1 with 1 stored elements. Why not 18x1?
Why does this approach result in a 10x1 matrix?
# let's try using the Dictionary that's been updated on the fly
dict_based_model = models.TfidfModel(dictionary=gensim_dict)
new_vecs2 = dict_based_model[BoW_corpus1]
matutils.corpus2csc(corpus=new_vecs2) # Expected. 18x1 with 3 stored elements.
matutils.corpus2csc(corpus=new_vecs2, num_terms=len(gensim_dict.keys()), num_docs=len(new_vecs2), num_nnz=sum([len(v) for v in new_vecs2])) # Also expected. 18x1 with 3 stored elements.
# But why are these not the same?
assert new_vecs2.obj.num_docs == len(new_vecs)
assert new_vecs2.obj.num_nnz == sum([len(v) for v in new_vecs2])
# Finally, let's make a model based on the new corpus, I know this isn't right but curious why the output is what it is
new_corpus_based_model = models.TfidfModel(BoW_corpus1)
new_vecs3 = new_corpus_based_model[BoW_corpus1]
matutils.corpus2csc(corpus=new_vecs3) # 0x1 with 0 elements. Kinda expected. But I would have thought it would have produced a 2x1 or 3x1 matrix
Can you confirm that dict_based_model is the right approach?
What is the new_vecs2.obj all about?
Why do I get a 0x1 instead of a 2x1?

Related

NLP for multi feature data set using TensorFlow

I am just a beginner in this subject, I have tested some NN for image recognition as well as using NLP for sequence classification.
This second topic is interesting for me.
using
sentences = [
'some test sentence',
'and the second sentence'
]
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
sentences = tokenizer.texts_to_sequences(sentences)
will result with an array of size [n,1] where n is word size in sentence. And assuming I have implemented padding correctly each Training example in set will be size of [n,1] where n is the max sentence length.
that prepared training set I can pass into keras model.fit
what when I have multiple features in my data set?
let's say I would like to build an event prioritization algorithm and my data structure would look like:
[event_description, event_category, event_location, label]
trying to tokenize such array would result in [n,m] matrix where n is maximum word length and m is the feature number
how to prepare such a dataset so a model could be trained on such data?
would this approach be ok:
# Going through training set to get all features into specific ararys
for data in dataset:
training_sentence.append(data['event_description'])
training_category.append(data['event_category'])
training_location.append(data['event_location'])
training_labels.append(data['label'])
# Tokenize each array which contains tokenized value
tokenizer.fit_on_texts(training_sentence)
tokenizer.fit_on_texts(training_category)
tokenizer.fit_on_texts(training_location)
sequences = tokenizer.texts_to_sequences(training_sentence)
categories = tokenizer.texts_to_sequences(training_category)
locations = tokenizer.texts_to_sequences(training_location)
# Concatenating arrays with features into one
training_example = numpy.concatenate([sequences,categories, locations])
#ommiting model definition, training the model
model.fit(training_example, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
I haven't been testing it yet. I just want to make sure if I understand everything correctly and if my assumptions are correct.
Is this a correct approach to create NPL using NN?
I know of two common ways to manage multiple input sequences, and your approach lands somewhere between them.
One approach is to design a multi-input model with each of your text columns as a different input. They can share the vocabulary and/or embedding layer later, but for now you still need a distinct input sub-model for each of description, category, etc.
Each of these becomes an input to the network, using the Model(inputs=[...], outputs=rest_of_nn) syntax. You will need to design rest_of_nn so it can take multiple inputs. This can be as simple as your current concatenation, or you could use additional layers to do the synthesis.
It could look something like this:
# Build separate vocabularies. This could be shared.
desc_tokenizer = Tokenizer()
desc_tokenizer.fit_on_texts(training_sentence)
desc_vocab_size = len(desc_tokenizer.word_index)
categ_tokenizer = Tokenizer()
categ_tokenizer.fit_on_texts(training_category)
categ_vocab_size = len(categ_tokenizer.word_index)
# Inputs.
desc = Input(shape=(desc_maxlen,))
categ = Input(shape=(categ_maxlen,))
# Input encodings, opting for different embeddings.
# Descriptions go through an LSTM as a demo of extra processing.
embedded_desc = Embedding(desc_vocab_size, desc_embed_size, input_length=desc_maxlen)(desc)
encoded_desc = LSTM(categ_embed_size, return_sequences=True)(embedded_desc)
encoded_categ = Embedding(categ_vocab_size, categ_embed_size, input_length=categ_maxlen)(categ)
# Rest of the NN, which knows how to put everything together to get an output.
merged = concatenate([encoded_desc, encoded_categ], axis=1)
rest_of_nn = Dense(hidden_size, activation='relu')(merged)
rest_of_nn = Flatten()(rest_of_nn)
rest_of_nn = Dense(output_size, activation='softmax')(rest_of_nn)
# Create the model, assuming some sort of classification problem.
model = Model(inputs=[desc, categ], outputs=rest_of_nn)
model.compile(optimizer='adam', loss=K.categorical_crossentropy)
The second approach is to concatenate all of your data before encoding it, and then treat everything as a more standard single-sequence problem after that. It is common to use a unique token to separate or define the different fields, similar to BOS and EOS for the beginning and end of the sequence.
It would look something like this:
XXBOS XXDESC This event will be fun. XXCATEG leisure XXLOC Seattle, WA XXEOS
You can also do end tags for the fields like DESCXX, omit the BOS and EOS tokens, and generally mix and match however you want. You can even use this to combine some of your input sequences, but then use a multi-input model as above to merge the rest.
Speaking of mixing and matching, you also have the option to treat some of your inputs directly as an embedding. Low-cardinality fields like category and location do not need to be tokenized, and can be embedded directly without any need to split into tokens. That is, they don't need to be a sequence.
If you are looking for a reference, I enjoyed this paper on Large Scale Product Categorization using Structured and Unstructured Attributes. It tests all or most of the ideas I have just outlined, on real data at scale.

applying the Similar function in Gensim.Doc2Vec

I am trying to get the doc2vec function to work in python 3.
I Have the following code:
tekstdata = [[ index, str(row["StatementOfTargetFiguresAndPoliciesForTheUnderrepresentedGender"])] for index, row in data.iterrows()]
def prep (x):
low = x.lower()
return word_tokenize(low)
def cleanMuch(data, clean):
output = []
for x, y in data:
z = clean(y)
output.append([str(x), z])
return output
tekstdata = cleanMuch(tekstdata, prep)
def tagdocs(docs):
output = []
for x,y in docs:
output.append(gensim.models.doc2vec.TaggedDocument(y, x))
return output
tekstdata = tagdocs(tekstdata)
print(tekstdata[100])
vectorModel = gensim.models.doc2vec.Doc2Vec(tekstdata, size = 100, window = 4,min_count = 3, iter = 2)
ranks = []
second_ranks = []
for x, y in tekstdata:
print (x)
print (y)
inferred_vector = vectorModel.infer_vector(y)
sims = vectorModel.docvecs.most_similar([inferred_vector], topn=1001, restrict_vocab = None)
rank = [docid for docid, sim in sims].index(y)
ranks.append(rank)
All works as far as I can understand until the rank function.
The error I get is that there is no zero in my list e.g. the documents I am putting in does not have 10 in list:
File "C:/Users/Niels Helsø/Documents/github/Speciale/Test/Data prep.py", line 59, in <module>
rank = [docid for docid, sim in sims].index(y)
ValueError: '10' is not in list
It seems to me that it is the similar function that does not work.
the model trains on my data (1000 documents) and build a vocab which is tagged.
The documentation I mainly have used is this:
Gensim dokumentation
Torturial
I hope that some one can help. If any additional info is need please let me know.
best
Niels
If you're getting ValueError: '10' is not in list, you can rely on the fact that '10' is not in the list. So have you looked at the list, to see what is there, and if it matches what you expect?
It's not clear from your code excerpts that tagdocs() is ever called, and thus unclear what form tekstdata is in when provided to Doc2Vec. The intent is a bit convoluted, and there's nothing to display what the data appears as in its raw, original form.
But perhaps the tags you are supplying to TaggedDocument are not the required list-of-tags, but rather a simple string, which will be interpreted as a list-of-characters. As a result, even if you're supplying a tags of '10', it will be seen as ['1', '0'] – and len(vectorModel.doctags) will be just 10 (for the 10 single-digit strings).
Separate comments on your setup:
1000 documents is pretty small for Doc2Vec, where most published results use tens-of-thousands to millions of documents
an iter of 10-20 is more common in Doc2Vec work (and even larger values might be helpful with smaller datasets)
infer_vector() often works better with non-default values in its optional parameters, especially a steps that's much larger (20-200) or a starting alpha that's more like the bulk-training default (0.025)

Convert Python dictionary to Word2Vec object

I have obtained a dictionary mapping words to their vectors in python, and I am trying to scatter plot the n most similar words since TSNE on huge number of words is taking forever. The best option is to convert the dictionary to a w2v object to deal with it.
I had the same issue and I finaly found the solution
So, I assume that your dictionary looks like mine
d = {}
d['1'] = np.random.randn(300)
d['2'] = np.random.randn(300)
Basically, the keys are the users' ids and each of them has a vector with shape (300,).
So now, in order to use it as word2vec I need to firstly save it to binary file and then load it with gensim library
from numpy import zeros, dtype, float32 as REAL, ascontiguousarray, fromstring
from gensim import utils
m = gensim.models.keyedvectors.Word2VecKeyedVectors(vector_size=300)
m.vocab = d
m.vectors = np.array(list(d.values()))
my_save_word2vec_format(binary=True, fname='train.bin', total_vec=len(d), vocab=m.vocab, vectors=m.vectors)
Where my_save_word2vec_format function is:
def my_save_word2vec_format(fname, vocab, vectors, binary=True, total_vec=2):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Parameters
----------
fname : str
The file path used to save the vectors in.
vocab : dict
The vocabulary of words.
vectors : numpy.array
The vectors to be stored.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
total_vec : int, optional
Explicitly specify total number of vectors
(in case word vectors are appended with document vectors afterwards).
"""
if not (vocab or vectors):
raise RuntimeError("no input")
if total_vec is None:
total_vec = len(vocab)
vector_size = vectors.shape[1]
assert (len(vocab), vector_size) == vectors.shape
with utils.smart_open(fname, 'wb') as fout:
print(total_vec, vector_size)
fout.write(utils.to_utf8("%s %s\n" % (total_vec, vector_size)))
# store in sorted order: most frequent words at the top
for word, row in vocab.items():
if binary:
row = row.astype(REAL)
fout.write(utils.to_utf8(word) + b" " + row.tostring())
else:
fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
And then use
m2 = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format('train.bin', binary=True)
To load the model as word2vec
If you've calculated the word-vectors with your own code, you may want to write them to a file in a format compatible with Google's original word2vec.c or gensim. You can review the gensim code in KeyedVectors.save_word2vec_format() to see exactly how its vectors are written – it's less than 20 lines of code – and do something similar to your vectors. See:
https://github.com/RaRe-Technologies/gensim/blob/3d2227d58b10d0493006a3d7e63b98d64e991e60/gensim/models/keyedvectors.py#L130
Then you could re-load vectors that originated with your code and use them almost directly with examples like the one from Jeff Delaney you mention.

Creating features function for further classification in python

I have read a description, how to apply random forest regression here. In this example the authors use the following code to create the features:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()
I am thinking of combining several possibilities as features and turn them on and off. And I don't know how to do it.
What I have so far is that I define a class, where I will be able to turn on and off the features and see if it brings something (for example, all unigrams and 20 most frequent unigrams, it could be then 10 most frequent adjectives, tf-idf). But for now I don't understand how to combine them together.
The code looks like this, and in the function part I am lost (the kind of function I have would replicate what they do in the tutorial, but it doesn't seem to be really helpful the way I do it):
class FeatureGen: #for example, feat = FeatureGen(unigrams = False) creates feature set without the turned off feature
def __init__(self, unigrams = True, unigrams_freq = True)
self.unigrams = unigrams
self.unigrams_freq = unigrams_freq
def get_features(self, input):
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
tokens = input["token"]
if self.unigrams:
train_data_features = vectorizer.fit_transform(tokens)
return train_data_features
What should I do to add one more feature possibility? Like contains 10 most frequent words.
if self.unigrams
train_data_features = vectorizer.fit_transform(tokens)
if self.unigrams_freq:
#something else
return features #and this should be a combination somehow
Looks like you need np.hstack
However you need each features array to have one row per training case.

Gensim - LDA create a document- topic matrix

I am working on a project where I need to apply topic modelling to a set of documents and I need to create a matrix :
DT , a D × T matrix, where D is the number of documents and T is the number of topics. DT(ij) contains the number of times a word in document Di has been assigned to topic Tj.
So far I have followed this tut: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
I am new to gensim and so far I have
1. created a document list
2. preprocessed and tokenized the documents.
3. Used corpora.Dictionary() to create id-> term dictionary (id2word)
4. convert tokenized documents into a document-term matrix
generated an LDA model.
So now I get the topics.
How can I now get the matrix that I mentioned before.
I will be using this matrix to calculate similarity between 2 documents on topic t as :
sim(a,b) = 1- |DT(a,t) - DT(b, t)|
There is an implementation in the pyLDAvis source code which returns the lists that may be helpful for building the matrix you are interested in.
Snippet from the _extract_data method in gensim.py:
def _extract_data(topic_model, corpus, dictionary, doc_topic_dists=None):
...
...
...
return {'topic_term_dists': topic_term_dists, 'doc_topic_dists': doc_topic_dists,
'doc_lengths': doc_lengths, 'vocab': vocab, 'term_frequency': term_freqs}
The number of topics for your model will be static. Maybe you're interested in finding the document topic distribution for the T matrix. In that case, the DxT matrix would be doc_lengths x doc_topic_dists.
Showing your code would be helpful, but if we were to go off of the example in the tutorial you linked then the model is identified by:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)
you could put into your script something like:
model_name = "name_of_my_model"
ldamodel.save(model_name)
Then when you run it, this will create a model in the same directory that the script is run from.
Then you can get topic probability distribution with:
print(ldamodel[doc_bow])
If you want to get similarity to this model then you need to create a model for the query document, too, and then get cosine similarity between the two:
dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("name_of_my_model.lda")
index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")
docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]
sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims

Categories

Resources