The title says it all; I have an SQL database bursting at the seams with online conversation text. I've already done most of this project in Python, so I would like to do this using Python's NLTK library (unless there's a strong reason not to).
The data is organized by Thread, Username, and Post. Each thread more or less focuses on discussing one "product" of the Category that I am interested in analyzing. Ultimately, when this is finished, I would like to have an estimated opinion (like/dislike sort of deal) from each user for any of the products they had discussed at some point.
So, what I would like to know:
1) How can I go about determining what product each thread is about? I was reading about keyword extraction... is that the correct method?
2) How do I determine a specific users sentiment based on their posts? From my limited understanding, I must first "train" NLTK to recognize certain indicators of opinion, and then do I simply determine the context of those words when they appear in the text?
As you may have guessed by now, I have no prior experience with NLP. From my reading so far, I think I can handle learning it though. Even just a basic and crude working model for now would be great if someone can point me in the right direction. Google was not very helpful to me.
P.S. I have permission to analyze this data (in case it matters)
Training any classifier requires a training set of labeled data and a feature extractor to obtain feature sets for each text. After you have a trained classifier, you can apply it to previously unseen text (unlabeled) and obtain a classification based on the machine learning algorithm used. NLTK gives a good explanation and some samples to play around with.
If you are interested in building a classifier for positive/negative sentiment, using your own training dataset, I would avoid simple keyword counts, as they aren't accurate for a number of reasons (eg. negation of positive words: "not happy"). An alternative, where you can still use a large training set without having to manually label anything, is distant supervision. Basically, this approach uses emoticons or other specific text elements as noisy labels. You still have to choose which features are relevant but many studies have had good results with simply using unigrams or bigrams (individual words or pairs of words, respectively).
All of this can be done relatively easily with Python and NLTK. You can also choose to use a tool like NLTK-trainer, which is a wrapper for NLTK and requires less code.
I think this study by Go et al. is one of the easiest to understand. You can also read other studies for distant supervision, distant supervision sentiment analysis, and sentiment analysis.
There are a few built-in classifiers in NLTK with both training and classification methods (Naive Bayes, MaxEnt, etc.) but if you are interested in using Support Vector Machines (SVM) then you should look elsewhere. Technically NLTK provides you with an SVM class but its really just a wrapper for PySVMLight, which itself is a wrapper for SVMLight, written in C. I had numerous problems with this approach though, and would instead recommend LIBSVM.
For determining the topic, many have used simple keywords but there are some more complex methods available.
You could train any classifier with similar datasets and see what the results are when you apply it to your data. For example, the NLTK contains the Movie Reviews Corpus that contains 1000 positive and 1000 negative reviews. Here is an example on how to train a Naive Bayes Classifier with it. Some other review datasets like Amazon Product Review data are available here.
Another possibility is to take a list of positive and negative words like this one and count their frequencies in your dataset. If you want a complete list, use SentiWordNet.
I am going to do topic modeling via LDA. I run my commands to see the optimal number of topics. The output was as follows: It is a bit different from any other plots that I have ever seen. Do you think it is okay? or it is better to use other algorithms rather than LDA. It is worth mentioning that when I run my commands to visualize the topics-keywords for 10 topics, the plot shows 2 main topics and the others had almost a strong overlap. Is there any valid range for coherence?
Many thanks to share your comments as I am a beginner in topic modeling.
Shameless self-promotion: I suggest you use the OCTIS library:
It allows you to run different topic models and optimize their hyperparameters (also the number of topics) in order to select the best result.
There might be many reasons why you get those results. But here some hints and observations:
Make sure that you've preprocessed the text appropriately. This usually includes removing punctuation and numbers, removing stopwords and words that are too frequent or rare, (optionally) lemmatizing the text. Preprocessing is dependent on the language and the domain of the texts.
LDA is a probabilistic model, which means that if you re-train it with the same hyperparameters, you will get different results each time. A good practice is to run the model with the same number of topics multiple times and then average the topic coherence.
There are a lot of topic models and LDA works usually fine. The choice of the topic model depends on the data that you have. For example, if you are working with tweets (i.e. short texts), I wouldn't recommend using LDA because it cannot handle well sparse texts.
Check how you set the hyperparameters. They may have a huge impact on the performance of the topic model.
The range for coherence (I assume you used NPMI which is the most well-known) is between -1 and 1, but values very close to the upper and lower bound are quite rare.
Imagine I have a fasttext model that had been trained thanks to the Wikipedia articles (like explained on the official website).
Would it be possible to train it again with another corpus (scientific documents) that could add new / more pertinent links between words? especially for the scientific ones ?
To summarize, I would need the classic links that exist between all the English words coming from Wikipedia. But I would like to enhance this model with new documents about specific sectors. Is there a way to do that ? And if yes, is there a way to maybe 'ponderate' the trainings so relations coming from my custom documents would be 'more important'.
My final wish is to compute cosine similarity between documents that can be very scientific (that's why to have better results I thought about adding more scientific documents)
Adjusting more-generic models with your specific domain training data is often called "fine-tuning".
The gensim implementation of FastText allows an existing model to expand its known-vocabulary via what's seen in new training data (via build_vocab(..., update=True)) and then for further training cycles including that new vocabulary to occur (through train()).
But, doing this particular form of updating introduces murky issues of balance between older and newer training data, with no clear best practices.
As just one example, to the extent there are tokens/ngrams in the original model that don't recur in the new data, new training is pulling those in the new data into new positions that are optimal for the new data... but potentially arbitrarily far from comparable compatibility with the older tokens/ngrams.)
Further, it's likely some model modes (like negative-sampling versus hierarchical-softmax), and some mixes of data, have a better chance of net-benefiting from this approach than others – but you pretty much have to hammer out the tradeoffs yourself, without general rules to rely upon.
(There may be better fine-tuning strategies for other kinds models; this is just speaking to the ability of the gensim FastText to update-vocabulary and repeat-train.)
But perhaps, your domain of interest is scientific texts. And maybe you also have a lot of representative texts – perhaps even, at training time, the complete universe of papers you'll want to compare.
In that case, are you sure you want to deal with the complexity of starting with a more-generic word-model? Why would you want to contaminate your analysis with any of the dominant word-senses in generic reference material, like Wikipedia, if in fact you already have sufficiently-varied and representative examples of your domain words in your domain contexts?
So I would recommend 1st trying to train your own model, from your own representative data. And only if you then fear you're missing important words/senses, try mixing in Wikipedia-derived senses. (At that point, another way to mix in that influence would be to mix Wikipedia texts with your other corpus. And you should also be ready to test whether that really helps or hurts – because it could be either.)
Also, to the extent your real goal is comparing full papers, you might want to look into other document-modeling strategies, including bag-of-words representations, the Doc2Vec ('Paragraph Vector') implementation in gensim, or others. Those approaches will not necessarily require per-word vectors as an input, but might still work well for quantifying text-to-text similarities.
I have some comments dataset which I want to classify into five categories :-
jewelries, clothes, shoes, electronics, food & beverages
So if someones talking about pork, steak, wine, soda, eat : its classified into f&b
Whereas if somebodys talking about say - gold, pendent, locket etc : its classified into jewelries
I want to know , what tags/tokens should I be looking for in a comment/tweet so as to classify it into any of these categories. Finally which classifier to use. I just need some guidance and suggestions , Ill take it from there.
Please help. Thanks
This answer can be a bit long and perhaps I abstract a few things away, but it's just to give you an idea and some advice.
Supervised Vs Unsupervised
As others already mentioned, in the land of machine learning there are 2 main roads: Supervised and Unsupervised learning. As you probably already know by now, if your corpus(documents) are labeled, you are talking about supervised learning. The labels are the categories and are in this case boolean values.
For instance if a text is related to clothes and shoes the labels for those categories should be true.
Since a text can be related to multiple categories (multiple labels), we are looking at multiclassifiers.
What to use?
I presume that the dataset is not yet labeled, since twitter does not do this categorisation for you. So here comes a big decision on your part.
You label the data manually, which means you try to look at as much tweets/fb messages in your dataset and for each of them you consider the 5 categories and answer them by True/False.
You decide to use a unsupervised learning algorithm and hope that you discover these 5 categories. Since approaches like clustering will just try to find categories on their own and these don't have to match your 5 predefined categories by default.
I've used quite some supervised learning in the past and have had good experience with this type of learning, therefore I will continue explaining this path.
Feature Engineering
You have to come up with the features that you want to use. For text classification, a good approach is to use each possible word in the document as a feature. A value of True represents if the word is present in the document, false represents absence.
Before doing this, you need to do some preprocessing. This can be done by using various features provided by the NLTK library.
Tokenization this will break your text up into a list of words. You can use this module.
Stopword removal this will remove common words out of the tokens. Words likes 'a',the',... You can take a look at this.
Stemming stemming will transform words to their stem-form. For example: the words 'working','worked','works' will be transformed to 'work'. Take a look at this.
Now if you have preprocessed the data, then generate a featureset for each word that exists in the documents. There exist automatic methods and filters for this, but I'm not sure how to do this in Python.
There are multiple classifiers that you can use for this purpose. I suggest to take a deeper look at the ones that exist and their benefits.You can user the nltk classifier which supports multiclassification, but to be honest I never tried that one before. In the past I've used Logistic Regression and SVM.
Training & testing
You will use a part of your data for training and a part for validating if the trained model performs well. I suggest you to use cross-validation, because you will have a small dataset (you have to manually label the data, which is cumbersome). The benefit of cross-validation is that you don't have to split your dataset in a training set and testing set. Instead it will run in multiple rounds and iterate through the data for a part training data and a part testing data. Resulting in all the data being used at least once in your training data.
Once your model is built and the outcome of the predictions on 'test-data' is plausible. You can use your model in the wild to predict the categories of the new Facebook messages/tweets.
The NLTK library is great for preprocessing and natural language processing, but I never used it before for classification. I've heard a lot of great things about the scikit python library. But to be fair honest, I prefer to use Weka, which is a data mining tool written in java, offering a great UI and which speeds up your task a lot!
From a different angle: Topic modelling
In your question you state that you want to classify the dataset into five categories. I would like to show you the idea of topic modelling. It might not be useful in your scenario if you are really only targeting those categories (that's why I leave this part at the end of my answer). However if your goal is to categorise the tweets/fb messages into non-predefined categories, topic modelling is the way to go.
Topic modeling is an unsupervised learning method, where you decide in advance the amount of topics(categories) you want to 'discover'. This number can be high (e.g. 40) Now the cool thing is that the algorithm will find 40 topics that contain words that have something related. It will also output for each document a distribution that indicates to which topics the document is related. This way you can discover a lot more categories than your 5 predefined ones.
Now I'm not gonna go much deeper into this, but just google it if you want more information. In addition you could consider to use MALLET which is an excellent tool for topic modelling.
Well this is kind of a big subject.
You mentioned Python, so you should have a look at the NLTK library which allows you to process natural language, such as your comments.
After this step, you should have a classifier which will map the words you retrieved to a certain class. NTLK also have tools for classification which is linked to knowledge databases. If you are lucky, the categories you are looking for are already available; otherwise you may have to build them yourself. You can have a look at this example which uses NTLK and the WordNet database. You can have access to the Synset, which seems to be pretty broad; and you can also have a look at the hypersets (see for example list(dog.closure(hyper)) ).
Basically you should consider using a multiclassifier on the whole tokenized text (comments on Facebook and tweets are usually short. You might also decide to only consider FB comments below 200 characters, your choice). The choice of a multiclassifier is motivated by non-orthogonality of your classification set (clothes, shoes and jewelries can be the same object; you could have electronic jewelry [ie smartwatches], etc.). This is a fairly simple setup but it's an interesting first step, whose strengths and weaknesses will allow you to iterate easily (if needed).
Good luck!
What you're looking for is in the subject of
Natural Language Processing (NLP) : processing text data and
Machine learning (where the classification models are built)
First I would suggesting going through NLP tutorials and then text classification tutorials, the most appropriate being
If you're looking for libraries available in python or java, take a look at Java or Python for Natural Language Processing
If you're new to text processing, please take a look at the NLTK library that provides a nice introduction to doing NLP, see
Now to the hard core details:
First, ask yourself whether you have twitter/facebook comments (let's call them documents from now on) that are manually labelled with the categories you want.
1a. If YES, look at supervised machine learning, see
1b. If NO, look at UNsupervised machine learning, i suggest clustering and topic modelling,
After knowing which kind of machine learning you need, split the documents up into at least training (70-90%) and testing (10-30%) set, see
Note. I suggest at least because there are other ways to split up your documents, e.g. for development or cross-validation. (if you don't understand this, it's all right, just follow step 2)
Finally, Train and Test your model
3a. If supervised, use the training set to train your supervised model. Apply your model onto the test set and then see how well you performed.
3b. If unsupervised, use the training set to generate documents clusters (that means to group similar documents) but they still have no labels. So you need to think of some smart way to label the groups of documents correctly. (To this date, there is no real good solution to this, even super effective neural networks cannot know what the neurons are firing, they just know each neuron is firing something specific)
I'm currently trying to classify Tweets using the Naive Bayes classifier in NLTK. I'm classifying tweets related to particular stock symbols, using the '$' prefix (eg: $AAPL). I've been basing my Python script of off this blog post: Twitter Sentiment Analysis using Python and NLTK . So far, I've been getting reasonably good results. However, I feel there is much, much room for improvement.
In my word-feature selection method, I decided to implement the tf-idf algorithm to select the most informative words. After having done this though, I felt that the results weren't that impressive.
I then implemented the technique on the following blog: Text Classification Sentiment Analysis Eliminate Low Information Features. The results were very similar to the ones obtained with the tf-idf algorithm, which led me to inspect my classifier's 'Most Informative Features' list more thoroughly. That's when I realized I had a bigger problem:
Tweets and real language don't use the same grammar and wording. In a normal text, many articles and verbs can be singled out using tf-idf or stopwords. However, in a tweet corpus, some extremely uninformative words, such as 'the', 'and', 'is', etc., occur just as much as words that are crucial to categorizing text correctly. I can't just remove all words that have less than 3 letters, because some uninformative features are bigger than that, and some informative ones are smaller.
If I could, I would like to not have to use stopwords, because of the need to frequently update the list. However, if that's my only option, I guess I'll have to go with it.
So, to summarize my question, does anyone know how to truly get the most informative words in the specific source that is a Tweet?
EDIT: I'm trying to classify into three groups: positive, negative, and neutral. Also, I was wondering, for TF-IDF, should I only be cutting off the words with the low scores, or also some with the higher scores? In each case, what percentage of the vocabulary of the text source would you exclude from the feature selection process?
The blog post you links to describes the show_most_informative_features method, but the NaiveBayesClassifier also has a most_informative_features method that returns the features rather than just printing them. You could simply set a cutoff based on your training set- features like "the", "and" and other unimportant features would be at the bottom of the list in terms of informativeness.
It's true that this approach could be subject to overfitting (some features would be much more important in your training set than in your test set), but that would be true of anything that filters features based on your training set.
I have a human tagged corpus of over 5000 subject indexed documents in XML. They vary in size from a few hundred kilobytes to a few hundred megabytes. Being short articles to manuscripts. They have all been subjected indexed as deep as the paragraph level. I am lucky to have such a corpus available, and I am trying to teach myself some NLP concepts. Admittedly, I've only begun. Thus far reading only the freely available NLTK book, streamhacker, and skimming jacobs(?) NLTK cookbook. I like to experiment with some ideas.
It was suggested to me, that perhaps, I could take bi-grams and use naive Bayes classification to tag new documents. I feel as if this is the wrong approach. a Naive Bayes is proficient at a true/false sort of relationship, but to use it on my hierarchical tag set I would need to build a new classifier for each tag. Nearly a 1000 of them. I have the memory and processor power to undertake such a task, but am skeptical of the results. However, I will be trying this approach first, to appease someones request. I should likely have this accomplished in the next day or two, but I predict the accuracy to be low.
So my question is a bit open ended. Laregly becuase of the nature of the discipline and the general unfamilirity with my data it will likely be hard to give an exact answer.
What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.
what feature extraction should I pursue for such a task. I am not expecting much with the bigrams.
Each document also contains some citational information including, author/s, an authors gender of m,f,mix(m&f),and other (Gov't inst et al.), document type, published date(16th cent. to current), human analyst, and a few other general elements. I'd also appreciate some useful descriptive tasks to help investigate this data better for gender bias, analyst bias, etc. But realize that is a bit beyond the scope of this question.
What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.
You can easily build a multilabel classifier by building a separate binary classifier for each class, that can distinguish between that class and all others. The classes for which the corresponding classifier yields a positive value are the combined classifier's output. You can use Naïve Bayes for this or any other algorithm. (You could also play tricks with NB's probability output and a threshold value, but NB's probability estimates are notoriously bad; only its ranking among them is what makes it valuable.)
what feature extraction should I pursue for such a task
For text classification, tf-idf vectors are known to work well, but you haven't specified what the exact task is. Any metadata on the documents might work as well; try doing some simple statistical analysis. If any feature of the data is more frequently present in some classes than in others, it may be a useful feature.
I understand that you have two tasks to solve here. The 1st one is that you want to tag an article based on its topic(?) and thus the article can be classified in more than one categories/classes and thus you have a multi-label classification problem. There are several algorithms proposed for solving a multi-label classification problem - please check the literature. I found this paper quite helpful when I was dealing with a similar problem:
The 2nd problem you want to solve is to tag the paper with authors, gender, type of document. This is a multi-class problem - each class has more than two potential values but all documents have some values for these classes.
I think as a first step it is important to understand the differences between multi-class and multi-label classification.