How to determine the "sentiment" between two named entities with Python/NLTK?

How to determine the "sentiment" between two named entities with Python/NLTK? - python

I'm using NLTK to extract named entities and I'm wondering how it would be possible to determine the sentiment between entities in the same sentence. So for example for "Jon loves Paris." i would get two entities Jon and Paris. How would I be able to determine the sentiment between these two entities? In this case should be something like Jon -> Paris = positive

In short "you cannot". This task is far beyond simple text processing which is provided with NLTK. Such objects relations sentiment analysis could be the topic of the research paper, not something solvable with a simple approach. One possible method would be to perform a grammar analysis, extraction of the conceptual relation between objects and then independent sentiment analysis of words included, but as I said before - it is rather a reasearch topic.

Related

NER german natural objects

I have some familiarity with R, and I am just starting with python to get into NLP, with a specific interest in Semantic Analysis and Named Entity Recognition (i am currently learning spaCy).
I have a background in Humanities and very little computational knowledge.
With this in mind, I am interested in exploring sentiments in German literature of a specific period, in relation to the use and references to geographical places and natural elements of the specific area and time this literature was produced.
I thought I could use dictionaries with tagged places/natural elements in combination with dictionaries for sentiments, and proceed in R with the text mining of my corpus, by analysing how emotions are expressed in proximity (or in relation to) the entities I am interested in.
Thus two questions: do such NER dictionaries exist for geographical/natural elements, and do they exist in German? Where could I find them?
I would be very happy to read any sort of suggestion. Thanks.

Stanford coreNLP provides good ner tagger. You can find ner models for German also. See their wbsite https://nlp.stanford.edu/software/CRF-NER.html. Check how the predictions are coming.

Can I use Natural Language Processing while identifying given words in a paragraph Or do I need to use machine learning algorithms

I need to identify some given words using NLP.
As an example,
Mary Lives in France
If we consider in here the given words are Australia, Germany,France. But in this sentence it include only France.
So Among the above 3 given words I need to identify the sentence is include only France

I would comment but I don't have enough reputation. It's a bit unclear exactly what you are trying to achieve here and how representative your example is - please edit your question to make it clearer.
Anyhow, like Guy Coder says, if you know exactly the words you are looking for, you don't really need machine learning or NLP libraries at all. However, if this is not the case, and you don't know have every example of what you are looking for, the below might help:
It seems like what you are trying to do is perform Named Entity Recognition (NER) i.e. identify the named entities (e.g. countries) in your sentences. If so, the short answer is: you don't need to use any machine learning algorithms. You can just use a python library such as spaCy which comes out of the box with a pretrained language model that can already perform a bunch of tasks, for instance NER, to high degree of performance. The following snippet should get you started:
import spacy
nlp = spacy.load('en')
doc = nlp("Mary Lives in France")
for entity in doc.ents:
if (entity.label_ == "GPE"):
print(entity.text)
The output of the above snippet is "France". Named entities cover a wide range of possible things. In the snippet above I have filtered for Geopolitical entities (GPE).
Learn more about spaCy here: https://spacy.io/usage/spacy-101

Identifying the subject of a sententce

I have been exploring NLP techniques with the goal of identifying the subject of survey comments (which I then use in conjunction with sentiment analysis). I want to make high level statements such as "10% of survey respondents made a positive comment (+ sentiment) about Account Managers".
My approach has used Named Entity Recognition (NER). Now that I am working with real data, I am getting visibility of some of the complexities & nuances associated with identifying the subject of a sentence. Here are 5 examples of sentences where the subject is the Account Manager. I have put the named entity in bold for demonstration purposes.
Our account manager is great, he always goes the extra mile!
Steve our account manager is great, he always goes the extra mile!
Steve our relationship manager is great, he always goes the extra
mile!
Steven is great, he always goes the extra mile!
Steve Smith is great, he always goes the extra mile!
Our business mgr. is great,he always goes the extra mile!
I see three challenges that add complexity to my task
Synonyms: Account manager vs relationship manager vs business mgr. This is somewhat domain specific and tends to vary with the survey target audience.
Abbreviations: Mgr. vs manager
Ambiguity - Whether “Steven” is “Steve Smith” & therefore an
“account manager”.
Of these the synonym problem is the most frequent issue, followed by the ambiguity issues. Based on what I have seen, the abbreviation issue isn’t that frequent in my data.
Are there any NLP techniques that can help deal with any of these issues to a relatively high degree of confidence?

As far as I understood, what you call the "subject" is, given a sentence, the entity that a statement is made about - in your example, Steve the account manager.
Based on this assumption, here are a few techniques and how they might help you:
(Dependency) Parsing
Since you don't mean subject in the strict grammatical sense, the approach suggested by user7344209 based on dependency parsing probably won't help you. In a sentence such as "I like Steve", the grammatical subject is "I", although you probably want to find "Steve" as the "subject".
Named Entity Recognition
You already use this, and it will be great to detect names of persons such as Steve. What I'm not so sure about is the example of the "account manager". Both the output provided by Daniel and my own test with Stanford CoreNLP did not identify it as a named entity - which is correct, it really is not a named entity:
Something broader such as the suggested mention identification might be better, but it basically marks every noun phrase which is probably too broad. If I understood it correctly, you want to find one subject per sentence.
Coreference Resolution
Coreference Resolution is the key technique to detect that "Steve" and the "account manager" are the same entity. Stanford CoreNLP has such module for example.
In order for this to work in your example, you have to let it process several sentence at once, since you want to find the links between them. Here is an example with (shorted versions) of some of your examples:
The visualization is a bit messy, but it basically found the following coreference chains:
Steve <-> Steve Smith
Steve our account manager <-> He <-> Our account manager
Our <-> Our
the extra mile <-> the extra mile
Given the first two chains, and a bit of post-processing, you could figure out that all four statements are about the same entity.
Semantic Similarity
In the case of account, business and relationship manager, I found that the CoreNLP coreference resolver actually already finds chains despite the different terms.
More generally, if you think that the coreference resolver cannot handle synonyms and paraphrases well enough, you could also try to include measures of semantic similarity. There is a lot of work in NLP on predicting whether two phrases are synonymous or not.
Some approaches are:
Looking up synonyms in a thesaurus such as Wordnet - e.g. with nltk (python) as shown here
Better, compute a similarity measure based on the relationships defined in WordNet - e.g. using SEMILAR (Java)
Using continous representations for words to compute similarities, for example based on LSA or LDA - also possible with SEMILAR
Using more recent neural-network-style word embeddings such as word2vec or GloVe - the latter are easily usable with spacy (python)
An idea to use these similarity measures would be to identify entities in two sentences, then make pairwise comparisons between entities in both sentences and if a pair has a similarity higher than a threshold consider it as beeing the same entity.

If you don't have much data to train, you probably can try a dependency analysis tool and extract dependency pairs which have SUBJECT identified (usually the nsubj if you use Stanford Parser).

I like your approach using NER. This is what I see in our system for your inputs:
Mention-Detection output might also be useful:
On your 2nd point, which involves abbreviations, it is a hard problem. But we have entity-similarity module here that might be useful. This takes into account things like honorifics etc.
About your 3rd point, co-reference problem, try the coref module:
Btw the above figures are from the demo here: http://deagol.cs.illinois.edu:8080

Name Entity Resolution Algorithm

I was trying to build an entity resolution system, where my entities are,
(i) General named entities, that is organization, person, location,date, time, money, and percent.
(ii) Some other entities like, product, title of person like president,ceo, etc.
(iii) Corefererred entities like, pronoun, determiner phrase,synonym, string match, demonstrative noun phrase, alias, apposition.
From various literature and other references, I have defined its scope as I would not consider the ambiguity of each of the entity beyond its entity category. That is, I am taking Oxford of Oxford University
as different from Oxford as place, as the previous one is the first word of an organization entity and second one is the entity of location.
My task is to construct one resolution algorithm, where I would extract
and resolve the entities.
So, I am working out an entity extractor in the first place.
In the second place, if I try to relate the coreferences as I found from
various literatures like this seminal work, they are trying to work out
a decision tree based algorithm, with some features like, distance,
i-pronoun, j-pronoun, string match, definite noun
phrase, demonstrative noun phrase, number agreement feature,
semantic class agreement, gender agreement, both proper names, alias, apposition
etc.
The algorithm seems a nice one where enities are extracted with Hidden Markov Model(HMM).
I could work out one entity recognition system with HMM.
Now I am trying to work out a coreference as well as an entity
resolution system. I was trying to feel instead of using so many
features if I use an annotated corpus and train it directly with
HMM based tagger, with a view to solve a relationship extraction like,
*"Obama/PERS is/NA delivering/NA a/NA lecture/NA in/NA Washington/LOC, he/PPERS knew/NA it/NA was/NA going/NA to/NA be/NA
small/NA as/NA it/NA may/NA not/NA be/NA his/PoPERS speech/NA as/NA Mr. President/APPERS"
where, PERS-> PERSON
PPERS->PERSONAL PRONOUN TO PERSON
PoPERS-> POSSESSIVE PRONOUN TO PERSON
APPERS-> APPOSITIVE TO PERSON
LOC-> LOCATION
NA-> NOT AVAILABLE*
would I be wrong? I made an experiment with around 10,000 words. Early results seem
encouraging. With a support from one of my colleague I am trying to insert some
semantic information like,
PERSUSPOL, LOCCITUS, PoPERSM, etc. for PERSON OF US IN POLITICS, LOCATION CITY US, POSSESSIVE PERSON MALE, in the tagset to incorporate entity disambiguation at one go. My feeling relationship extraction would be much better now.
Please see this new thought too.
I got some good results with Naive Bayes classifier also where sentences
having predominately one set of keywords are marked as one class.
If any one may suggest any different approach, please feel free to suggest so.
I use Python2.x on MS-Windows and try to use libraries like NLTK, Scikit-learn, Gensim,
pandas, Numpy, Scipy etc.
Thanks in Advance.

It seems that you are going in three different paths that are totally different and each can be done in a stand alone Phd. There are many literature about them. My first advice focus on the main task and outsource the remaining. If you are going to develop this for non-famous language, also, you can build on others.
Named Entity Recognition
Standford NLP have really go too far in that specially for English. They resolve named entities really good, they are widely used and have a nice community.
Other solution may exist in openNLP for python .
Some tried to extend it to unusual fine-grain types but you need much bigger training data to cover the cases and the decision becomes much harder.
Edit: Stanford NER exists in NLTK python
Named Entity Resolution/Linking/Disambiguation
This is concerned with linking the name to some knowledge base, and solves the problem of whether Oxford University of Oxford City.
AIDA: is one of the state-of-art in that. They uses different context information as well as coherence information. Also, they have tried supporting several languages. They have a good bench mark.
Babelfy: offers interesting API that does NER and NED for Entities and concepts. Also, they support many language but never worked very well.
others like tagme and wikifi ...etc
Conference Resolution
Also Stanford CoreNLP has some good work in that direction. I can also recommend this work where they combined Conference Resolution with NED.

question on sentiment analysis

I have a question regarding sentiment analysis that i need help with.
Right now, I have a bunch of tweets I've gathered through the twitter search api. Because I used my search terms, I know what are the subjects or entities (Person names) that I want to look at. I want to know how others feel about these people.
For starters, I downloaded a list of english words with known valence/sentiment score and calculate the sentiments (+/-) based on availability of these words in the tweet. The problem is that sentiments calculated this way - I'm actually looking more at the tone of the tweet rather than ABOUT the person.
For instance, I have this tweet:
"lol... Person A is a joke. lmao!"
The message is obviously in a positive tone, but person A should get a negative.
To improve my sentiment analysis, I can probably take into account negation and modifiers from my word list. But how exactly can I get my sentiments analysis to look at the subject of the message (and possibly sarcasm) instead?
It would be great if someone can direct me towards some resources....

While awaiting for answers from researchers in AI field I will give you some clues on what you can do quickly.
Even though this topic requires knowledge from natural language processing, machine learning and even psychology, you don't have to start from scratch unless you're desperate or have no trust in the quality of research going on in the field.
One possible approach to sentiment analysis would be to treat it as a supervised learning problem, where you have some small training corpus that includes human made annotations (later about that) and a testing corpus on which you test how well you approach/system is performing. For training you will need some classifiers, like SVM, HMM or some others, but keep it simple. I would start from binary classification: good, bad. You could do the same for a continuous spectrum of opinion ranges, from positive to negative, that is to get a ranking, like google, where the most valuable results come on top.
For a start check libsvm classifier, it is capable of doing both classification {good, bad} and regression (ranking).
The quality of annotations will have a massive influence on the results you get, but where to get it from?
I found one project about sentiment analysis that deals with restaurants. There is both data and code, so you can see how they extracted features from natural language and which features scored high in the classification or regression.
The corpus consists of opinions of customers about restaurants they recently visited and gave some feedback about the food, service or atmosphere.
The connection about their opinions and numerical world is expressed in terms of numbers of stars they gave to the restaurant. You have natural language on one site and restaurant's rate on another.
Looking at this example you can devise your own approach for the problem stated.
Take a look at nltk as well. With nltk you can do part of speech tagging and with some luck get names as well. Having done that you can add a feature to your classifier that will assign a score to a name if within n words (skip n-gram) there are words expressing opinions (look at the restaurant corpus) or use weights you already have, but it's best to rely on a classfier to learn weights, that's his job.

In the current state of technology this is impossible.
English (and any other language) is VERY complicated and cannot be "parsed" yet by programs. Why? Because EVERYTHING has to be special-cased. Saying that someone is a joke is a special-case of a joke, which is another exception in your program. Etcetera, etc, etc.
A good example (posted by ScienceFriction somewhere here on SO):
Similarly, the sentiment word "unpredictable" could be positive in the context of a thriller but negative when describing the breaks system of the Toyota.
If you are willing to spend +/-40 years of your life on this subject, go ahead, it will be much appreciated :)

I don't entirely agree with what nightcracker said. I agree that it is a hard problem, but we are making a good progress towards the solution.
For example, 'part-of-speech' might help you to figure out subject, verb and object in the sentence. And 'n-grams' might help you in the Toyota vs. thriller example to figure out the context. Look at TagHelperTools. It is built on top of weka and provides part-of-speech and n-grams tagging.
Still, it is difficult to get the results that OP wants, but it won't take 40 years.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.