How to classify new documents with tf-idf? - python

If I use the TfidfVectorizer from sklearn to generate feature vectors as:
features = TfidfVectorizer(min_df=0.2, ngram_range=(1,3)).fit_transform(myDocuments)
How would I then generate feature vectors to classify a new document? Since you cant calculate the tf-idf for a single document.
Would it be a correct approach, to extract the feature names with:
feature_names = TfidfVectorizer.get_feature_names()
and then count the term frequency for the new document according to the feature_names?
But then I won't get the weights that have the information of a words importance.

You need to save the instance of the TfidfVectorizer, it will remember the term frequencies and vocabulary that was used to fit it. It may make things clearer sense if rather than using fit_transform, you use fit and transform separately:
vec = TfidfVectorizer(min_df=0.2, ngram_range=(1,3))
vec.fit(myDocuments)
features = vec.transform(myDocuments)
new_features = fec.transform(myNewDocuments)

I would rather use gensim with a Latent Semantic Indexing as a wrapper over the original corpus: bow->tfidf->lsi
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
Then if you need to continue the training:
new_tfidf = models.TfidfModel(corpus)
new_corpus_tfidf = new_tfidf[corpus]
lsi.add_documents(another_tfidf_corpus) # now LSI has been trained on corpus_tfidf + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space
Where corpus is bag-of-words
As you can read in their tutorials:
LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!
If you like sci-kit, gensim is also compatible with numpy

Related

LDA topic modeling changes result everytime I run the code [duplicate]

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.
Why does the same LDA parameters and corpus generate different topics everytime?
And how do i stabilize the topic generation?
I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code:
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math
stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]
def generateTopics(corpus, dictionary):
# Build LDA model using the above corpus
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]
# Group topics with similar words together.
tops = set(lda.show_topics(50))
top_clusters = []
for l in tops:
top = []
for t in l.split(" + "):
top.append((t.split("*")[0], t.split("*")[1]))
top_clusters.append(top)
# Generate word only topics
top_wordonly = []
for i in top_clusters:
top_wordonly.append(":".join([j[1] for j in i]))
return lda, corpus_lda, top_clusters, top_wordonly
#######################################################################
# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
lemma = line.split("\t")[3]
documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)
for i in topic_wordonly:
print i
Why does the same LDA parameters and corpus generate different topics everytime?
Because LDA uses randomness in both training and inference steps.
And how do i stabilize the topic generation?
By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed:
SOME_FIXED_SEED = 42
# before training/inference:
np.random.seed(SOME_FIXED_SEED)
(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)
Set the random_state parameter in the initialization of LdaModel() method.
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=1,
passes=num_passes,
alpha='auto')
I had the same problem, even with about 50,000 comments. But you can get much more consistent topics by increasing the number of iterations the LDA runs for. It is initially set to 50 and when I raise it to 300, it usually gives me the same results, probably because it is much closer to convergence.
Specifically, you just add the following option:
ldamodel.LdaModel(corpus, ..., iterations = <your desired iterations>):
This is due to the probabilistic nature of LDA as noted by others. However, I don't believe setting the random_seed argument to a fixed number is the proper solution.
Definitely try increasing the number of iterations first to make sure your algorithm is converging. Even then, each starting point may land you on a different local minimum. So you can run LDA multiple times without setting random_seed, and then comparing the results using the coherence score of each model. This helps you avoid the suboptimal local minima.
Gensim's CoherenceModel already has the most common coherence metrics implemented for you, such as c_v, u_mass, and c_npmi.
You might realize these will make the results more stable, but they won't actually guarantee the same results from run to run. However, it's better to get to the global optimum as much as possible instead of being stuck on the same local minimum because of a fixed random_seed IMO.

How to get the nearest documents for a word in gensim in python

I am using the doc2vec model as follows to construct my document vectors.
from gensim.models import doc2vec
from collections import namedtuple
dataset = json.load(open(input_file))
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for description in dataset:
tags = [description[0]]
words = description[1]
docs.append(analyzedDocument(words, tags))
model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)
I have seen that gensim doc2vec also includes word vectors. Suppose I have a word vector created for the word deep learning. My question is; is it possible to get the documents nearest to deep learning word vector in gensim in python?
I am happy to provide more details if needed.
Some Doc2Vec modes will co-train doc-vectors and word-vectors in the "same space". Then, if you have a word-vector for 'deep_learning', you can ask for documents near that vector, and the results may be useful for you. For example:
similar_docs = d2v_model.docvecs.most_similar(
positive=[d2v_model.wv['deep_learning']]
)
But:
that's only going to be as good as your model learned 'deep_learning' as a word to mean what you think of it as
a training set of known-good documents fitting the category 'deep_learning' (and other categories) could be better - whether you hand-curate those, or try to bootstrap from other sources (like say the Wikipedia category 'Deep Learning' or other curated/search-result sets that you trust).
reducing a category to a single summary point (one vector) may not be as good as having a variety of examples – many points - that all fit the category. (Relevant docs may not be a neat sphere around a summary point, but rather populate exotically-shaped regions of the doc-vector high-dimensional space.) If you have a lot of good examples of each category, you could train a classifier to then label, or rank-in-relation-to-trained-categories, any further uncategorized docs.

Adding feature to k-means

I am trying to use k-means clustering to classify text documents. Is it possible to take a set of documents tfidf vectorize them and perform the computation then add more documents to be classified?
This is what I have so far
true_k = 4
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
How would I add more documents to X? Because I would like to pickle X and save it.
Actually this is pretty simple (controrary to the accepted answer, which suggests that this is complex - it is not). Just concatenate your data, and reuse the same vectorizer (if you create new one, or refit the old one, as suggested in the accepted answer, it will change its estimations and consequently you will get different feature spaces), thus you have to pickle it too
true_k = 4
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
now you get new data, documents2 and simply do
X2 = vectorizer.transform(documents2)
X = np.vstack((X, X2))
model.fit(X) # optimally you would start from the previous solution, but sklearn does not yet support it
However, remember that this assumes that your first batch of documents was already representative for the whole dataset. In other words, you will limit yourself to words from the first documents, and also the idf normalization will not be refitted. You could actually remove both limitations, but you would have to implement your own - online tfidf vectorizer, which can update its estimates. It is not hard to do, but you would have to (after each new batch of documents) also update previous ones (as idf part would change). Easier solution would be to instead keep only countvectorizer and update it, and compute "idf" part independently and apply it on top (just before kmeans).
The problem is that your X feature matrix if of the shape [n_docs, n_features]. Hence, if you create a new feature matrix with new documents you would have to make sure that the new feature matrix (X2) has exactly the same features as X. I cannot image a application where this is feasible.
But if you know that both have the same feature space, you can use scipy.sparse.vstack to append the new documents to your feature matrix:
from scipy.sparse import vstack
X = vstack((X, X2))
EDIT: To ensure the same feature space in X2, you can use the vocabulary keyword argument in TfidfVectorizer, e.g.:
vectorizer = TfidfVectorizer(stop_words='english')
vectorizer = vectorizer.fit(documents)
X = vectorizer.transform(documents)
# do whatever with X
new_vectorizer = TfidfVectorizer(stop_words='english', vocabulary=vectorizer.vocabulary_)
X2 = vectorizer.fit_transform(new_documents)
X = vstack((X, X2))
That means, on top of saving X you also need to store vectorizer.vocabulary_.

Python Gensim: how to calculate document similarity using the LDA model?

I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on.
After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a hint? Thanks!
Depends what similarity metric you want to use.
Cosine similarity is universally useful & built-in:
sim = gensim.matutils.cossim(vec_lda1, vec_lda2)
Hellinger distance is useful for similarity between probability distributions (such as LDA topics):
import numpy as np
dense1 = gensim.matutils.sparse2full(lda_vec1, lda.num_topics)
dense2 = gensim.matutils.sparse2full(lda_vec2, lda.num_topics)
sim = np.sqrt(0.5 * ((np.sqrt(dense1) - np.sqrt(dense2))**2).sum())
Don't know if this'll help but, I managed to attain successful results on document matching and similarities when using the actual document as a query.
dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("model.lda") #result from running online lda (training)
index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")
docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]
sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims
Your similarity score between all documents residing in the corpus and the document that was used as a query will be the second index of every sim for sims.
Provided answers are good, but they aren't very beginner-friendly. I want to start from training the LDA model and calculate cosine similarity.
Training model part:
docs = ["latent Dirichlet allocation (LDA) is a generative statistical model",
"each document is a mixture of a small number of topics",
"each document may be viewed as a mixture of various topics"]
# Convert document to tokens
docs = [doc.split() for doc in docs]
# A mapping from token to id in each document
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)
# Representing the corpus as a bag of words
corpus = [dictionary.doc2bow(doc) for doc in docs]
# Training the model
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)
For extracting the probability assigned to each topic for a document, there are generally two ways. I provide here the both:
# Some preprocessing for documents like the training the model
test_doc = ["LDA is an example of a topic model",
"topic modelling refers to the task of identifying topics"]
test_doc = [doc.split() for doc in test_doc]
test_corpus = [dictionary.doc2bow(doc) for doc in test_doc]
# Method 1
from gensim.matutils import cossim
doc1 = model.get_document_topics(test_corpus[0], minimum_probability=0)
doc2 = model.get_document_topics(test_corpus[1], minimum_probability=0)
print(cossim(doc1, doc2))
# Method 2
doc1 = model[test_corpus[0]]
doc2 = model[test_corpus[1]]
print(cossim(doc1, doc2))
output:
#Method 1
0.8279631530869963
#Method 2
0.828066885140262
As you can see both of the methods are generally the same, the difference is in the probabilities returned in the 2nd method sometimes doesn't add up to one as discussed here.
For large corpus, the possibility vector could be given by passing the whole corpus:
#Method 1
possibility_vector = model.get_document_topics(test_corpus, minimum_probability=0)
#Method 2
possiblity_vector = model[test_corpus]
NOTE: The sum of probability assigned to each topic in a document may become a bit higher than 1 or in some cases a bit less than 1. That is because of the floating-point arithmetic rounding errors.

LDA model generates different topics everytime i train on the same corpus

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.
Why does the same LDA parameters and corpus generate different topics everytime?
And how do i stabilize the topic generation?
I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code:
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math
stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]
def generateTopics(corpus, dictionary):
# Build LDA model using the above corpus
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]
# Group topics with similar words together.
tops = set(lda.show_topics(50))
top_clusters = []
for l in tops:
top = []
for t in l.split(" + "):
top.append((t.split("*")[0], t.split("*")[1]))
top_clusters.append(top)
# Generate word only topics
top_wordonly = []
for i in top_clusters:
top_wordonly.append(":".join([j[1] for j in i]))
return lda, corpus_lda, top_clusters, top_wordonly
#######################################################################
# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
lemma = line.split("\t")[3]
documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)
for i in topic_wordonly:
print i
Why does the same LDA parameters and corpus generate different topics everytime?
Because LDA uses randomness in both training and inference steps.
And how do i stabilize the topic generation?
By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed:
SOME_FIXED_SEED = 42
# before training/inference:
np.random.seed(SOME_FIXED_SEED)
(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)
Set the random_state parameter in the initialization of LdaModel() method.
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=1,
passes=num_passes,
alpha='auto')
I had the same problem, even with about 50,000 comments. But you can get much more consistent topics by increasing the number of iterations the LDA runs for. It is initially set to 50 and when I raise it to 300, it usually gives me the same results, probably because it is much closer to convergence.
Specifically, you just add the following option:
ldamodel.LdaModel(corpus, ..., iterations = <your desired iterations>):
This is due to the probabilistic nature of LDA as noted by others. However, I don't believe setting the random_seed argument to a fixed number is the proper solution.
Definitely try increasing the number of iterations first to make sure your algorithm is converging. Even then, each starting point may land you on a different local minimum. So you can run LDA multiple times without setting random_seed, and then comparing the results using the coherence score of each model. This helps you avoid the suboptimal local minima.
Gensim's CoherenceModel already has the most common coherence metrics implemented for you, such as c_v, u_mass, and c_npmi.
You might realize these will make the results more stable, but they won't actually guarantee the same results from run to run. However, it's better to get to the global optimum as much as possible instead of being stuck on the same local minimum because of a fixed random_seed IMO.

Categories

Resources