I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.
I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ).
Please help to resolve my confusion.
Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.
Once you have test corpus use LDA to find document- topic distribution. Hope this helps.
In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).
All you need to do is call OnlineLDA.do_e_step on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.
You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.
Related
I made the LDA model to make topic model using big training data sets. So, I try to use this LDA model to classification using new sentence which it doesn't use in the training data set.
How I can find the most closet topic number using a new input sentence?
Should I use LDA Topic Models as a Classification Model Input?
Welcome to share example code using Python.
In classification problems, since the ground-truth label is known, we only need to consider how to extract features from the training data. For LDA, the features are usually the topic probability distribution, i.e. if there are 5 topics in the corpus, then the dimension of the feature vector is 5, and that should be a better feature than the closet topic number (the most probable topic).
For how to get topic probability distribution for new input sentences, you can take a look at here, for other packages, they should also have similar functions.
How can we use ANN to find some similar documents? I know its a silly question, but I am new to this NLP field.
I have made a model using kNN and bag-of-words approach to solve my problem. Using that I can get n number of documents (along with their closeness) that are somewhat similar to the input, but now I want to implement the same using ANN and I am not getting any idea.
Thanks in advance for any help or suggestions.
You can use "word embeddings" - technique, that presents words in the dense vector representation. To find similar documents as the vectors, you can simply use cosine similarity.
An example how to build word2vec model using TensorFlow. One more example how to use embeddings layer from Keras.
The way to obtain embeddings for your language is either training them yourself on your corpus of choice (large enough - e.g. wikipedia) or downloading the trained embeddings (for python there are plenty of sources for embeddings trained or loadable with gensim module - which is a de facto standard for Python word2vec).
You can also use GloVe (using glove-python) or FastText word embeddings.
If you are interested you can find more detailed descriptions of embeddings with code examples and source papers.
Have a look at the paper https://arxiv.org/pdf/1805.10685.pdf that gives you a overall idea.
check this link for more references https://github.com/Hironsan/awesome-embedding-models
I'm trying to build a vectorizer for a text mining problem. The used vocabulary should be fitted from given files. However, the number of files that will build the dictionary vocabulary_ is relatively large (say 10^5). Is there a simple way to parallelize that?
Update: As I found out, there is a "manual" way... Unfortunately, it only works for min_df=1 Let me exemplary describe what I do for two cores:
Split your input into two chunks. Train vectorizers (say vec1 and vec2), each on one core and on one chunk of your data (I used multiprocessing.Pool). Then,
# Use sets to dedupe token
vocab = set(vec1.vocabulary_) | set(vec2.vocabulary_)
# Create final vectorizer with given vocabulary
final_vec = CountVectorizer(vocabulary=vocab)
# Create the dictionary final_vec.vocabulary_
final_vec._validate_vocabulary()
will do the job.
You can use mllib, the machine learning library included in apache-spark wchich will handle the distribution accross nodes.
Here's a tutorial on how to use it for feature extraction.
https://spark.apache.org/docs/latest/mllib-feature-extraction.html
You can also check the sklearn documentation on How to optimize for speed here to get some inspiration.
I am using gensim version 0.12.4 and have trained two separate word embeddings using the same text and same parameters. After training I am calculating the Pearsons correlation between the word occurrence-frequency and vector-length. One model I trained using save_word2vec_format(fname, binary=True) and then loaded using load_word2vec_format the other I trained using model.save(fname) and then loaded using Word2Vec.load(). I understand that the word2vec algorithm is non deterministic so the results will vary however the difference in the correlation between the two models is quite drastic. Which method should I be using in this instance?
EDIT: this was intended as a comment. Don't know how to change it now, sorry
correlation between the word occurrence-frequency and vector-length I don't quite follow - aren't all your vectors the same length? Or are you not referring to the embedding vectors?
I've got BOW vectors and I'm wondering if there's a supervised dimensionality reduction algorithm in sklearn or gensim capable of taking high-dimensional, supervised data and projecting it into a lower dimensional space which preserves the variance between these classes.
Actually I'm trying to find a proper metric for the classification/regression, and I believe using dimensionality can help me. I know there's unsupervised methods, but I want to keep the label information along the way.
FastText - implementation from Facebook research, essentially help you achieve what you have been asking for. Since you were asking about gensim, I assume you might be aware of word2vec in gensim.
Now word2vec was proposed Mikolov while at google. Mikolov and his team at Facebook ahs come up with fastText, which takes into consideration the word and sub-word information. It also allows for classification of text.
you can only perform dimensionality reduction in an unsupervised manner OR supervised but with different labels than your target labels.
For example you could train a logistic regression classifier with a dataset containing 100 topics. the output of this classifier (100 values) using your training data could be your dimensionality reduced feature set.