Building a Training Classifier Python with NLTK - python

Currently I'am working on a project which is emotions detecting (happy,sad etc) from text in a chat application using Python and NLTK. I'am not much familiar with NLP and Python. As a basic way, I hope to use keyword based method. In that case I have to make a emotional keyword list under each emotional state and need to find whether there is any emotional keyword in given sentence and identify the relevant emotional state accordingly. So what I need here to know, do I need to create a training data set and feature list to do that task, If yes, how can I do it. Please help me.

You will need a set of words that have been labeled. One place to start is the AFINN sentiment dictionary which is a large set of words that have been manually labeled. The slides by Wei-Ting Kuo shows how to use the AFINN word set.
Laurent Luce's blog walks through the entire sentiment analysis process using Tweets although he starts with a labeled training set.
Also take a look at NLTK's 'How To' on sentiment analysis
There are a number of emotion data sets that may help at https://www.w3.org/community/sentiment/wiki/Datasets#Emotions_datasets_by_Media_Core_.40_UFL.

Related

Extracting and ranking keywords from short text

I am working on a project to extract a keyword from short texts (3-4 sentences). Using the spaCy library I extract noun phrases and NER and use them as keywords. However, I would like to sort them based on their importance wrt the original text.
I tried standard informational retrieval approaches, like tfidf, and even a couple of graph-based algorithms but having such short text the results weren't so great.
I was thinking that maybe using a NN with an attention mechanism could help me rank those keywords. Is there any way to use the pre-trained models that come with spaCy to do some kind of ranking?
How about something like maximal marginal relevance? http://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf

How to use Machine Learning Algorithm to predict most discussed tweeter category

INPUT I HAVE
I have a csv file which contains 2000 sentences as below:
WHAT I WANT TO DO
I want to:
  A) Categorize each sentence
One way I am thinking is to create a dictionary where I will put each category related words
But I don't like this idea of creating dictionary , rather want machine to determine/predict itself.
Is there a better way of achieving this? 
How can I use Machine Learning here ?
Can you suggest Step by Step Process/Code/ML algorithm which can be trained?
I have experience on Python language.
This is not necessarily a good application for machine learning. Essentially you're analyzing each word in a tweet and seeing if that word belongs in a predefined category. Machine learning might be used for something like sentiment analysis, where it can "learn" that individual words or groups of words convey a certain feeling, but to classify individual words doesn't really make sense. You would be trying to "train" a model to learn definitions of words.
I think your approach with the dictionary is viable, and much easier to accomplish. For each category you care about, add a few words and then you can use a thesaurus API to programmatically find synonyms for each word in the category to expand the vocabulary of your dictionary.

generating Paraphrases of English text using PPDB

I need to generate paraphrase of an english sentence using the PPDB paraphrase database
I have downloaded the datasets from the website.
I would say your first step needs to be reducing the problem into more manageable components. Second figure out whether you want to paraphrase on a one-to-one, lexical, syntactical, phrase or combination basis. To inform this decision I would take one sentence and paraphrase it myself in order to get an idea of what I am looking for. Next I would start writing a parser for the downloaded data. Then I would remove the stopwords and incorporate a part-of-speech tagger like the ones included in spaCy or nltk for your example phrase.
Since they seem to give you all the information needed to make a successive dictionary filter that is where I would start. I would write a filter which found the parts of speech for each word in my sentence in the [LHS] column of the dataset and select a source that matches the word while minimizing/maximizing the value of 1 feature (like minimizing WordLenDiff) which in the case of "businessnow" <- "business now" = -1.5. Keeping track of the target feature you will then have a basic paraphrased sentence.
using this strategy your output could turn:
"the business uses 4 gb standard."
sent_score = 0
into:
"businessnow uses 4gb standard"
sent_score = -3
After you have a basic example the you can start exploring feature selection algorithms in like those in scikit-learn, etc. and incorporate word alignment. But I would seriously cut down on the scope of the problem and increase it gradually. In the end, how you approach the problem it depends on what the designated use is and how functional it needs to be.
Hope this helps.

Classifying text documents using nltk

I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...
I've successfully stripped the message bodies out of my emails, now I'm looking to start classifying. I've done the classic example of sentiment-analysis classification using the move_reviews corpus separating documents into positive and negative reviews.
I'm just wondering how I could apply this approach to my project? Can I create multiple classes like sports, technology, politics, entertainment, etc.? I have hit a road block here and am looking for a push in the right direction.
If this isn't an appropriate question for SO I'll happily delete it.
Edit: Hello everyone, I see that this post has gained a bit of popularity, I did end up successfully completing this project, here is a link to the code in the projects GitHub Repo:
https://github.com/codyreandeau/Email-Categorizer/blob/master/Email_Categorizer.py
The task of text classification is a Supervised Machine Learning problem. This means that you need to have labelled data. When you approached the movie_review problem, you used the +1/-1 labels to train your sentiment analysis system.
Getting back to your problem:
If you have labels for your data, approach the problem in the same manner. I suggest you use the scikit-learn library. You can draw some inspiration from here: Scikit-Learn for Text Classification
If you don't have labels, you can try an unsupervised learning approach. If you have any clue about how many categories(call the number K) you have, you can try a KMeans approach. This means, grouping the emails in K categories based on how similar they are. Similar emails will end up in similar buckets. Then inspect the clusters by hand and come up with a label. Assign new emails to the most similar cluster. If you need help with KMeans check this quick recipe: Text Clustering Recipe
Suggestion: Getting labels for emails can be easier than you think. For example, Gmail lets you export your emails with folder information. If you have categorised your email, you can take advantage of this.
To create a classifier, you need a training data set with the classes you are looking for. In your case, you may need to either:
create your own data set
use a pre-existing dataset
The brown corpus is a seminal text with many of the categories you are speaking about. This could be a starting point to help classify your emails using some package like gensim to find semantically similar texts.
Once you classify your emails, you can then train a system to predict a label for each unseen email.

Using WordNet-Affect with NLTK [duplicate]

I downloaded WN-Affect. I am however not sure how to use it to detect the mood of a sentence. For example if I have a string "I hate football." I want to be able to detect whether the mood is bad and the emotion is fear. WN-Affect has no tutorial on how to do it, and I am kind of new to python. Any help would be great!
In short: Use SentiWordNet instead and look at https://github.com/kevincobain2000/sentiment_classifier
In Long:
Affectedness vs Sentiment
The line between affect and sentiment is very fine. One should looking into Affectedness in linguistics studies, e.g. http://compling.hss.ntu.edu.sg/events/2014-ws-affectedness/ and Sentiment Analysis in computational researches. For now, let's call both the task of identifying affect and sentiment, sentiment analysis.
Also note that WN-Affect is a rather old resource compared to SentiWordNet, http://sentiwordnet.isti.cnr.it/.
Here's a good resource for using SentiWordNet for sentiment analysis: https://github.com/kevincobain2000/sentiment_classifier.
Often sentiment analysis has only two classes, positive or negative sentiment. Whereas the WN-affect uses 11 types of affectedness labels:
emotion
mood
trait
cognitive state
physical state
hedonic signal
emotion-eliciting
emotional response
behaviour
attitude
sensation
For each type, there are multiple classes, see https://github.com/larsmans/wordnet-domains-sentiwords/blob/master/wn-domains/wn-affect-1.1/a-hierarchy.xml
To answer the question of how one can use the WN-Affect, there're several things you need to do:
First map WN1.6 to WN3.0 (it's not an easy task, you have to do several mappings, especially the mapping between 2.0-2.1)
Now using the WN-Affect with WN3.0, you can apply
the same classification technique as he SentiWordNet sentiment classifier or
try to maximize the classes within text and then use some heuristics to choose 'positive' / 'negative'
WordNet-Affect uses WordNet 1.6 offsets.
However, WordNet 1.6 is still available for download. You can use the nltk.corpus.WordNetCorpusReader class to load it. I wrote all the code to do it here.

Categories

Resources