Using LDA Topic Models as a Classification Model Input

Using LDA Topic Models as a Classification Model Input - python

I made the LDA model to make topic model using big training data sets. So, I try to use this LDA model to classification using new sentence which it doesn't use in the training data set.
How I can find the most closet topic number using a new input sentence?
Should I use LDA Topic Models as a Classification Model Input?
Welcome to share example code using Python.

In classification problems, since the ground-truth label is known, we only need to consider how to extract features from the training data. For LDA, the features are usually the topic probability distribution, i.e. if there are 5 topics in the corpus, then the dimension of the feature vector is 5, and that should be a better feature than the closet topic number (the most probable topic).
For how to get topic probability distribution for new input sentences, you can take a look at here, for other packages, they should also have similar functions.

Related

Multi-label text classification with non-uniform distribution of class labels for every train data

I have a multi-label classification problem, I want to classify texts with six labels, each text can have one to six labels but this label distribution is not equal. For example, 10 people annotated sentence1 as below:
These labels are the number of votes for that class. I can normalize them like sad 0.7, anger 0.2, fear 0.1, happy 0.0,...
What is the best classifier for this problem? What is the best type for labels I mean I should normalize them or not?
What keywords should I search for this kind of multi-label classification problem where the probability of labels is not equal?

Well, first, to clarify if I understand your problem correctly. You have sentences=[sent1, sent2, ... sentn] and you want to classify them into these six labels labels=[l1,l2,...,l6]. Your data isn't the labels themselves, but the probability of having that label in the text. You also mentioned the six labels comes from human annotation (I don't know what you mean by 10 people commented, I'll guess it is annotation)
If this is the case, you can deal with the problem with multi-label classification or a multi-target regression perspectives. I'll approach what you can do with your data both cases:
Multilabel Classification: In this case, you need to define the classes for each sentence so that you can train your model. Right now you have only the probabilities. You can do that by creating a threshold and the probabilities of labels that are above the threshold can be considered the labels for a sentence. You can read more about the evaluation metrics here.
Multi-target Regression: In this case, you don't need to define the classes, you just use the training input and we use the data to predict the probabilities for each label. I think it is a better and easier problem, given your data collection. If you want to know more about the problem of multi-target regression, you can read more about it here, but the models they used in this tutorial are not the the state-of-the-art (be aware of it).
Training Models: You can use both shallow and deep models for this task. You need a model that can receive a sentence as input and predict six labels or six probabilities. I suggest you take a look into this example, it can be a very good starting point for your work. The author provides a tutorial on how to build a multi-label text classifier using deep neural networks. He basically built a LSTM and a Feed-forward layer in the end to classify the labels. If you decide to use regression instead of classification, you can just drop the activation in the end.
The best results are likely to be obtained by deep neural networks, so the article I sent you can work very well. I also suggest you take a look in the state-of-the-art methods for text classification, such as BERT or XLNET. I implemented a Multi-label classification method using BERT, maybe it can be helpful to you.

Using online LDA to predict on test data

I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.
I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ).
Please help to resolve my confusion.

Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.
Once you have test corpus use LDA to find document- topic distribution. Hope this helps.

In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).
All you need to do is call OnlineLDA.do_e_step on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.
You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.

Extract features for tag prediction project

I am thinking of doing a project for keyword extraction of stack exchange questions in python.
I have an input data from kaggle.com which has id,title,body and tags for training.
I am thinking of implementing some machine learning algorithms like SVM,neural networks etc to train classifiers.
The problem is for input to these algorithms we need features.
And i dont have idea how to extract features from this input for these algorithms as i have never extracted features from a paragraph before.
Any help will be appreciated.

Feature selection is of crucial importance,it gives information of the relevance of features for your problem.Good theoretical explanation is given in Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas book.
I found this simple code example
# Feature Importance
from sklearn import datasets
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
# load the iris datasets
dataset = datasets.load_iris()
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(dataset.data, dataset.target)
# display the relative importance of each attribute
print(model.feature_importances_)
Result
0.1087327 0.06409384 0.32304493 0.50412853
You can read more [http://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/],with examples.

Many of the keyword extraction algorithms are based on classical statistical techniques (including graphical models). The popular feature are mostly frequency based. Some algorithms of ranking words also exist.
For further study, consider this paper:
http://www.hlt.utdallas.edu/~saidul/acl14.pdf

supervised dimensionality redunction/topic model using sklearn or gensim

I've got BOW vectors and I'm wondering if there's a supervised dimensionality reduction algorithm in sklearn or gensim capable of taking high-dimensional, supervised data and projecting it into a lower dimensional space which preserves the variance between these classes.
Actually I'm trying to find a proper metric for the classification/regression, and I believe using dimensionality can help me. I know there's unsupervised methods, but I want to keep the label information along the way.

FastText - implementation from Facebook research, essentially help you achieve what you have been asking for. Since you were asking about gensim, I assume you might be aware of word2vec in gensim.
Now word2vec was proposed Mikolov while at google. Mikolov and his team at Facebook ahs come up with fastText, which takes into consideration the word and sub-word information. It also allows for classification of text.

you can only perform dimensionality reduction in an unsupervised manner OR supervised but with different labels than your target labels.
For example you could train a logistic regression classifier with a dataset containing 100 topics. the output of this classifier (100 values) using your training data could be your dimensionality reduced feature set.

Review spam detection using SVM

I have a dataset of reviews from various e-commerce sites.
My task is to classify them into spam or not using SVM in Python.
How should I convert text dataset into SVM features? Are there other features need to be consider and if so, how to convert them into SVM feature vectors?
Is there any sample code or tutorial available to do this task? I need to implement this task, so please guide me on this.

A classic way of converting text input to input you can provide to a machine learning algorithm like SVM:
Divide your text into a list of tokens (for instance each word, each group of 2 words, etc.)
Represent the number of occurrences of your tokens according to a given model. For instance TFIDF is a model that weighs each token according to it's frequency into the whole corpus of documents.
Each document is therefore represented by a vector where each component is one word of your corpus of texts vocabulary, and the associated weigh represents a statistical indicator about this word relatively to the document considered.
See scikit-learn http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction for more information about it, and an implementation of the most classic methods for representing a text as a valid input for machine learning algorithms.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.