I am working on a project where I need to apply topic modelling to a set of documents and I need to create a matrix :
DT , a D × T matrix, where D is the number of documents and T is the number of topics. DT(ij) contains the number of times a word in document Di has been assigned to topic Tj.
So far I have followed this tut: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
I am new to gensim and so far I have
1. created a document list
2. preprocessed and tokenized the documents.
3. Used corpora.Dictionary() to create id-> term dictionary (id2word)
4. convert tokenized documents into a document-term matrix
generated an LDA model.
So now I get the topics.
How can I now get the matrix that I mentioned before.
I will be using this matrix to calculate similarity between 2 documents on topic t as :
sim(a,b) = 1- |DT(a,t) - DT(b, t)|
There is an implementation in the pyLDAvis source code which returns the lists that may be helpful for building the matrix you are interested in.
Snippet from the _extract_data method in gensim.py:
def _extract_data(topic_model, corpus, dictionary, doc_topic_dists=None):
...
...
...
return {'topic_term_dists': topic_term_dists, 'doc_topic_dists': doc_topic_dists,
'doc_lengths': doc_lengths, 'vocab': vocab, 'term_frequency': term_freqs}
The number of topics for your model will be static. Maybe you're interested in finding the document topic distribution for the T matrix. In that case, the DxT matrix would be doc_lengths x doc_topic_dists.
Showing your code would be helpful, but if we were to go off of the example in the tutorial you linked then the model is identified by:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)
you could put into your script something like:
model_name = "name_of_my_model"
ldamodel.save(model_name)
Then when you run it, this will create a model in the same directory that the script is run from.
Then you can get topic probability distribution with:
print(ldamodel[doc_bow])
If you want to get similarity to this model then you need to create a model for the query document, too, and then get cosine similarity between the two:
dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("name_of_my_model.lda")
index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")
docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]
sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims
Related
My use case is that I have a corpus of documents, and when new docs come in I update Dictionary and vectorize. The result should be a sparse matrix of TFIDF vectors, which I'm using corpus2csc for.
I think I have a solution, but I have seen other answers that suggest my solution is impossible, and I've seen some unexpected behavior. I'm seeking a gut-check on the approach, with some specific questions below.
Overall approach:
Use Dictionary.doc2bow with allow_update=True
Construct TFIDF model using that updated Dictionary.
Questions and example code below.
from gensim import corpora
from gensim import models
from gensim import matutils
docs = [['definition', 'addition', 'term', 'defined'],
['term', 'sweet', 'generally', 'subject'],
['gene', 'extent', 'provided', 'gene'],
['additional', 'cost', 'gene', 'adequacy'],
['initial', 'condition', 'sweet', 'effectiveness']]
gensim_dict = corpora.Dictionary([], prune_at=None) # start with empty dict because I don't know what the corpus looks like at this point
BoW_corpus0 = [gensim_dict.doc2bow(d, allow_update=True) for d in docs] # Add to the dictionary by setting allow_update=True
# documentation says: corpus (iterable of iterable of (int, number)) – Input corpus in BoW format
matutils.corpus2csc(corpus=BoW_corpus0) # Expected. A 16x5 matrix with 19 stored elements
matutils.corpus2csc(corpus=BoW_corpus0, num_terms=len(gensim_dict.keys()), num_docs=len(BoW_corpus0), num_nnz=sum([len(doc) for doc in BoW_corpus0])) # Also expected for the 'efficient' path
model = models.TfidfModel(BoW_corpus0)
vecs = model[BoW_corpus0]
matutils.corpus2csc(corpus=vecs) # Unexpected - asks for a BoW format, but this is TransformedCorpus. But ignoring that, it's expected -- 16x5 with 19 stored elements.
matutils.corpus2csc(corpus=vecs, num_terms=len(gensim_dict.keys()), num_docs=vecs.obj.num_docs, num_nnz=vecs.obj.num_nnz) # Expected
corpus2csc documentation asks for a BoW format, not a TransformedCorpus. Does it accept a TransformedCorpus just by chance?
Now I add new docs:
new_docs = [['provided', 'communication', 'provided', 'writing']]
BoW_corpus1 = [gensim_dict.doc2bow(d, allow_update=True) for d in new_docs]
len(gensim_dict.cfs) # 18
gensim_dict.num_nnz # 22
len(BoW_corpus1) # 1
matutils.corpus2csc(corpus=BoW_corpus1) # Expected. An 18x1 matrix, with 3 stored elements
matutils.corpus2csc(corpus=BoW_corpus1, num_terms=len(gensim_dict.keys()), num_docs=len(BoW_corpus1), num_nnz=sum([len(doc) for doc in BoW_corpus1])) # Expected. An 18x1 matrix, with 3 stored elements
# Let's use the original model. Ideally we would update the model with the new document but I'm not sure of the best way to do that.
new_vecs = model[BoW_corpus1]
matutils.corpus2csc(corpus=new_vecs) # Unexpected. Using the original model, a 10x1 with 1 stored elements. Why not 18x1?
Why does this approach result in a 10x1 matrix?
# let's try using the Dictionary that's been updated on the fly
dict_based_model = models.TfidfModel(dictionary=gensim_dict)
new_vecs2 = dict_based_model[BoW_corpus1]
matutils.corpus2csc(corpus=new_vecs2) # Expected. 18x1 with 3 stored elements.
matutils.corpus2csc(corpus=new_vecs2, num_terms=len(gensim_dict.keys()), num_docs=len(new_vecs2), num_nnz=sum([len(v) for v in new_vecs2])) # Also expected. 18x1 with 3 stored elements.
# But why are these not the same?
assert new_vecs2.obj.num_docs == len(new_vecs)
assert new_vecs2.obj.num_nnz == sum([len(v) for v in new_vecs2])
# Finally, let's make a model based on the new corpus, I know this isn't right but curious why the output is what it is
new_corpus_based_model = models.TfidfModel(BoW_corpus1)
new_vecs3 = new_corpus_based_model[BoW_corpus1]
matutils.corpus2csc(corpus=new_vecs3) # 0x1 with 0 elements. Kinda expected. But I would have thought it would have produced a 2x1 or 3x1 matrix
Can you confirm that dict_based_model is the right approach?
What is the new_vecs2.obj all about?
Why do I get a 0x1 instead of a 2x1?
I have this function to apply a topic model to every document. However, I want to extract words which are over e.g. 70% probability of contributing to each topic and print these into a CSV dataframe. How am I able to do so?
I have heard Mallet does this for you, but I am using Gensim.
The code I have at the moment looks like this:
def topic_model_new(grid_document):
''' this function is used to conduct topic modelling for each grid/document '''
df = pd.read_csv(grid_document)
tokens = df['english_only_tags'].astype(str).apply(nltk.word_tokenize)
#convert tokenized lists into dictionary
dictionary = corpora.Dictionary(tokens)
#create document term matrix
doc_term_matrix = [dictionary.doc2bow(tag) for tag in tokens]
#initialise topic model from gensim
LDA = gensim.models.ldamodel.LdaModel
#build and train topic model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=8, random_state=100,
chunksize=400, passes=50,iterations=100)
return lda_model
Any help appreciated. I am a beginner in Python and NLP so please do write out the code and explain it as much as possible! Thanks!
I have build a word2vec model containing the sentences of several scientific research papers and retrieved the top 100 most similar words for the term "online_platform":
# Create a model with 300 dimensions and a context window of 6 words.
# Only consider words that appear at least in 10 documents. Use 3 CPU
# cores for estimating the model.
model_online_platform = word2vec.Word2Vec(sentences_online_platform, window = 4, min_count = 10, workers=3)
online_platform_most_similar = model_online_platform.most_similar("online_platform", topn=100)
Here are some of the most similar words:
most_similar_words=[('event', 0.6746241450309753),
('date', 0.612309992313385),
('stems', 0.6009981632232666),
('draws', 0.5935811996459961),
('company', 0.5848336219787598), ...]
From this list of tuples I retrived only the words, leaving out the similarity scores:
first_tuple_elements = ['event', 'date', 'stems', 'draws', 'company',...]
Then I used a Pre-Trained Word2Vec (glove-wiki-gigaword-300) Model in order to get the vektor-representation of each word from this Pre-trained word2vec model instead of my own word2vec model:
vectores = []
for word in first_tuple_elements:
vectores.append(word_vectors_wiki_gigaword["word"])
print(vectores)
This gave me a list of arrays for each word.
Right now I am struggling to link the arrays(vektor representations) to the associated word..
What I want to have is a list of tuples that I can the use for k means clustering:
[
('event', array([...,...,..]),
('date', array([...,...,..]),
etc....
]
I think that with this kind of list of tuples I might be able to do k means clustering, but I also might be wrong...
If I am using python Sklearn for LDA topic modeling, I can use the transform function to get a "document topic distribution" of the LDA-results like here:
document_topic_distribution = lda_model.transform(document_term_matrix)
Now I tried also the R structural topic models (stm) package and i want get the same. Is there any function in the stm package, which can produce the same thing (document topic distribution)?
I have the stm-object created as follows:
stm_model <- stm(documents = out$documents, vocab = out$vocab,
K = number_of_topics, data = out$meta,
max.em.its = 75, init.type = "Spectral" )
But i didn't find out how I can get the desired distribution out of this object. The documentation didn't really help me aswell.
As emilliman5 pointed out, your stm_model provides access to the underlying parameters of the model, as is shown in the documentation.
Indeed, the theta parameter is a
Number of Documents by Number of Topics matrix of topic proportions.
This requires some linguistical parsing: it is an N_DOCS by N_TOPICS matrix, i.e. it has N_DOCS rows, one per document, and N_TOPICS columns, one per topic. The values are the topic proportions, i.e. if stm_model[1, ] == c(.3, .2, .5), that means Document 1 is 30% Topic 1, 20% Topic 2 and 50% Topic 3.
To find out what topic dominates a document, you have to find the (column!) index of the maximum value, which can be retrieved e.g. by calling apply with MARGIN=1, which basically says "do this row-wise"; which.max simply returns the index of the maximum value:
apply(stm_model$theta, MARGIN=1, FUN=which.max)
I am applying the LDA method using Gensim to extract keywords from documents.
I can extract topics, and then assign these topics and key words associated to the documents.
I would like to have the ids of these terms (or key words) instead of the terms themselves. I know that corpus[i] extract a list of couples [(term_id, term_frequency) ...] of document i but I can't see how could I use this in my code to extract only the ids and assign it to my results.
My code is as follows :
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=passes, minimum_probability=0)
# Assinging the topics to the document in corpus
lda_corpus = ldamodel[corpus]
# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print(threshold)
for t in range(len(topic_tuple)):
key_words.append([topic_tuple[t][j][0] for j in range(num_words)])
df_key_words = pd.DataFrame({'key_words' : key_words})
documents_corpus.append([j for i,j in zip(lda_corpus,doc_set) if i[t][1] > threshold])
df_documents_corpus = pd.DataFrame({'documents_corpus' : documents_corpus})
documents_corpus_id.append([i for d,i in zip(lda_corpus, doc_set_id) if d[t][1] > threshold])
df_documents_corpus_id = pd.DataFrame({'documents_corpus_id' : documents_corpus_id})
result.append(pd.concat([df_key_words, df_documents_corpus, df_documents_corpus_id ], axis=1))
Thank you in advance and ask me if more information are needed.
In case someone has the same issue that I had, here is the answer for a reverse map :
reverse_map = dict((ldamodel.id2word[id],id) for id in ldamodel.id2word)
Thanks to bigdeeperadvisors