Phrase corpus for sentimental analysis

Phrase corpus for sentimental analysis - python

Good day,
I'm attempting to write a sentimental analysis application in python (Using naive-bayes classifier) with the aim to categorize phrases from news as being positive or negative.
And I'm having a bit of trouble finding an appropriate corpus for that.
I tried using "General Inquirer" (http://www.wjh.harvard.edu/~inquirer/homecat.htm) which works OK but I have one big problem there.
Since it is a word list, not a phrase list I observe the following problem when trying to label the following sentence:
He is not expected to win.
This sentence is categorized as being positive, which is wrong. The reason for that is that "win" is positive, but "not" does not carry any meaning since "not win" is a phrase.
Can anyone suggest either a corpus or a work around for that issue?
Your help and insight is greatly appriciated.

See for example: "What's great and what's not: learning to classify the scope of negation for improved sentiment analysis" by Councill, McDonald, and Velikovich
http://dl.acm.org/citation.cfm?id=1858959.1858969
and followups,
http://scholar.google.com/scholar?cites=3029019835762139237&as_sdt=5,33&sciodt=0,33&hl=en
e.g. by Morante et al 2011
http://eprints.pascal-network.org/archive/00007634/

In this case, the work not modifies the meaning of the phrase expecteed to win, reversing it. To identify this, you would need to POS tag the sentence and apply the negative adverb not to the (I think) verb phrase as a negation. I don't know if there is a corpus that would tell you that not would be this type of modifier or not, however.

Related

Find similar/synonyms/context words Python

Hello i'm looking to find a solution of my issue :
I Want to find a list of similar words with french and english
For example :
name could be : first name, last name, nom, prénom, username....
Postal address could be : city, country, street, ville, pays, code postale ....

The other answer, and comments, describe how to get synonyms, but I think you want more than that?
I can suggest two broad approaches: WordNet and word embeddings.
Using nltk and wordnet, you want to explore the adjacent graph nodes. See http://www.nltk.org/howto/wordnet.html for an overview of the functions available. I'd suggest that once you've found your start word in Wordnet, follow all its relations, but also go up to the hypernym, and do the same there.
Finding the start word is not always easy:
http://wordnetweb.princeton.edu/perl/webwn?s=Postal+address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
Instead it seems I have to use "address": http://wordnetweb.princeton.edu/perl/webwn?s=address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
and then decide which of those is the correct sense here. Then try clicking the hypernym, hyponym, sister term, etc.
To be honest, none of those feels quite right.
Open Multilingual WordNet tries to link different languages. http://compling.hss.ntu.edu.sg/omw/ So you could take your English WordNet code, and move to the French WordNet with it, or vice versa.
The other approach is to use word embeddings. You find the, say, 300 dimensional, vector of your source word, and then hunt for the nearest words in that vector space. This will be returning words that are used in similar contexts, so they could be similar meaning, or similar syntactically.
Spacy has a good implementation, see https://spacy.io/usage/spacy-101#vectors-similarity and https://spacy.io/usage/vectors-similarity
Regarding English and French, normally you would work in the two languages independently. But if you search for "multilingual word embeddings" you will find some papers and projects where the vector stays the same for the same concept in different languages.
Note: the API is geared towards telling you how two words are similar, not finding similar words. To find similar words you need to take your vector and compare with every other word vector, which is O(N) in the size of the vocabulary. So you might want to do this offline, and build your own "synonyms-and-similar" dictionary for each word of interest.

from PyDictionary import PyDictionary
dictionary=PyDictionary()
answer = dictionary.synonym(word)
word is the word for which you are finding the synonyms.

Which technique is most appropriate for identifing various sentiments in the same text using python?

I'm studying NLP and as example I'm trying to identify what feelings are in customer feedback in the online course platform.
I was able to identify the feelings of the students with only simple sentence, such as "The course is very nice, I learned a lot from it", "The teaching platform is complete and I really enjoy using it", "I could have more courses related to marine biology", and so on.
My doubt is how to correctly identify the various sentiments in one sentence or in several sentences. For example:
A sentiment per sentence:
"The course is very good! it could be cool to create a section of questions on the site."
More than one sentiment per sentence:
"The course is very good, but the site is not."
Involving both:
"The course is very good, but the teaching platform is very slow. There could be more tasks and examples in the courses, interaction by video or microphone on the forum, for example."
I thought of splitting text in sentences, but it is not so good for the example 2.

You can think that comas, other punctuation marcs and some conjunctions and prepositions actually split sentences. This actually goes beyond code into the field of linguistics as they sometimes, but not always, separate sentences.
In the 2nd case you actually have two sentences: "The course is very good" -, but- "The site is not [very good]".
I believe there are NPL packages that can split sentences (Probably by knowing that most sentences follow the subject/predicate/object structure, so if you wind more than one verb then propbably you'll find the same ammount of sentences) and you could use those to parse your text first. Take a look for libraries doing that for your language of choice.

There is a lib specific for multi-label classification:
scikit-multilearn
when you train your model you have to split classes into binary columns.

Change sentiment of a single word

I've been working with NLTK in Python for a few days for sentiment analysis and it's a wonderful tool. My only concern is the sentiment it has for the word 'Quick'. Most of the data that I am dealing with has comments about a certain service and MOST refer to the service as being 'Quick' which clearly has Positive sentiments to it. However, NLTK refers to it as being Neutral. I want to know if it's even possible to retrain NLTK to now refer to the Quick adjective as having positive annotations?

I have fixed the problem. Found the vader Lexicon file in AppData\Roaming\nltk_data\sentiment. Going through the file I found that the word Quick wasn't even in it. The format of the file is as following:
Token Mean-sentiment StandardDeviation [list of sentiment score collected from 10 people ranging from -4 to 4]
I edited the file. Zipped it. Now NLTK refers to Quick as having positive sentiments.

The models used for sentiment analysis are generally the result of a machine-learning process. You can produce your own model by running the model creation on a training set where the sentiments are tagged the way you like, but this is a significant undertaking, especially if you are unfamiliar with the underpinnings.
For a quick and dirty fix, maybe just make your code override the sentiment for an individual word, or (somewhat more challenging) figure out how to change its value in the existing model. Though if you can get a hold of the corpus the NLTK maintainers trained their sentiment analysis on and can modify it, that's probably much simpler than figuring out how to change an existing model. If you have a corpus of your own with sentiments for all the words you care about, even better.
In general usage, "quick" is not superficially a polarized word -- indeed, "quick and dirty" is often vaguely bad, and a "quick assessment" is worse than a thorough one; while of course in your specific context, a service which delivers quickly will dominantly be a positive thing. There will probably be other words which have a specific polarity in your domain, even though they cannot be assigned a generalized polarity, and vice versa -- some words with a polarity in general usage will be neutral in your domain. Thus, training your own model may well be worth the effort, especially if you are exploring utterances in a very specific register.

Negation handling in NLP

I'm currently working on a project, where I want to extract emotion from text. As I'm using conceptnet5 (a semantic network), I can't however simply prefix words in a sentence that contains a negation-word, as those words would simply not show up in conceptnet5's API.
Here's an example:
The movie wasn't that good.
Hence, I figured that I could use wordnet's lemma functionality to replace adjectives in sentences that contain negation-words like (not, ...).
In the previous example, the algorithm would detect wasn't and would replace it with was not.
Further, it would detect a negation-word not, and replace good with it's antonym bad.
The sentence would read:
The movie was that bad.
While I see that this isn't the most elegant way, and it does probably in many cases produce the wrong result, I'd still like to handle negation that way as I frankly don't know any better approach.
Considering my problem:
Unfortunately, I did not find any library that would allow me to replace all occurrences of appended negation-words (wasn't => was not).
I mean I could do it manually, by replacing the occurrences with a regex, but then I would be stuck with the english language.
Therefore I'd like to ask if some of you know a library, function or better method that could help me here.
Currently I'm using python nltk, still it doesn't seem that it contains such functionality, but I may be wrong.
Thanks in advance :)

Cases like wasn't can be simply parsed by tokenization (tokens = nltk.word_tokenize(sentence)): wasn't will turn into was and n't.
But negative meaning can also be formed by 'Quasi negative words, like hardly, barely, seldom' and 'Implied negatives, such as fail, prevent, reluctant, deny, absent', look into this paper. Even more detailed analysis can be found in Christopher Potts' On the negativity of negation
.
Considering your initial problem, sentiment analysis, most modern approaches, as far as I know, don't process negations explicitly; instead, they use supervised approaches with high-order n-grams. Those actually processing negation usually append special prefix NOT_ to all words between negation and punctuation marks.

question on sentiment analysis

I have a question regarding sentiment analysis that i need help with.
Right now, I have a bunch of tweets I've gathered through the twitter search api. Because I used my search terms, I know what are the subjects or entities (Person names) that I want to look at. I want to know how others feel about these people.
For starters, I downloaded a list of english words with known valence/sentiment score and calculate the sentiments (+/-) based on availability of these words in the tweet. The problem is that sentiments calculated this way - I'm actually looking more at the tone of the tweet rather than ABOUT the person.
For instance, I have this tweet:
"lol... Person A is a joke. lmao!"
The message is obviously in a positive tone, but person A should get a negative.
To improve my sentiment analysis, I can probably take into account negation and modifiers from my word list. But how exactly can I get my sentiments analysis to look at the subject of the message (and possibly sarcasm) instead?
It would be great if someone can direct me towards some resources....

While awaiting for answers from researchers in AI field I will give you some clues on what you can do quickly.
Even though this topic requires knowledge from natural language processing, machine learning and even psychology, you don't have to start from scratch unless you're desperate or have no trust in the quality of research going on in the field.
One possible approach to sentiment analysis would be to treat it as a supervised learning problem, where you have some small training corpus that includes human made annotations (later about that) and a testing corpus on which you test how well you approach/system is performing. For training you will need some classifiers, like SVM, HMM or some others, but keep it simple. I would start from binary classification: good, bad. You could do the same for a continuous spectrum of opinion ranges, from positive to negative, that is to get a ranking, like google, where the most valuable results come on top.
For a start check libsvm classifier, it is capable of doing both classification {good, bad} and regression (ranking).
The quality of annotations will have a massive influence on the results you get, but where to get it from?
I found one project about sentiment analysis that deals with restaurants. There is both data and code, so you can see how they extracted features from natural language and which features scored high in the classification or regression.
The corpus consists of opinions of customers about restaurants they recently visited and gave some feedback about the food, service or atmosphere.
The connection about their opinions and numerical world is expressed in terms of numbers of stars they gave to the restaurant. You have natural language on one site and restaurant's rate on another.
Looking at this example you can devise your own approach for the problem stated.
Take a look at nltk as well. With nltk you can do part of speech tagging and with some luck get names as well. Having done that you can add a feature to your classifier that will assign a score to a name if within n words (skip n-gram) there are words expressing opinions (look at the restaurant corpus) or use weights you already have, but it's best to rely on a classfier to learn weights, that's his job.

In the current state of technology this is impossible.
English (and any other language) is VERY complicated and cannot be "parsed" yet by programs. Why? Because EVERYTHING has to be special-cased. Saying that someone is a joke is a special-case of a joke, which is another exception in your program. Etcetera, etc, etc.
A good example (posted by ScienceFriction somewhere here on SO):
Similarly, the sentiment word "unpredictable" could be positive in the context of a thriller but negative when describing the breaks system of the Toyota.
If you are willing to spend +/-40 years of your life on this subject, go ahead, it will be much appreciated :)

I don't entirely agree with what nightcracker said. I agree that it is a hard problem, but we are making a good progress towards the solution.
For example, 'part-of-speech' might help you to figure out subject, verb and object in the sentence. And 'n-grams' might help you in the Toyota vs. thriller example to figure out the context. Look at TagHelperTools. It is built on top of weka and provides part-of-speech and n-grams tagging.
Still, it is difficult to get the results that OP wants, but it won't take 40 years.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.