Identifying the subject of a sententce

Identifying the subject of a sententce - python

I have been exploring NLP techniques with the goal of identifying the subject of survey comments (which I then use in conjunction with sentiment analysis). I want to make high level statements such as "10% of survey respondents made a positive comment (+ sentiment) about Account Managers".
My approach has used Named Entity Recognition (NER). Now that I am working with real data, I am getting visibility of some of the complexities & nuances associated with identifying the subject of a sentence. Here are 5 examples of sentences where the subject is the Account Manager. I have put the named entity in bold for demonstration purposes.
Our account manager is great, he always goes the extra mile!
Steve our account manager is great, he always goes the extra mile!
Steve our relationship manager is great, he always goes the extra
mile!
Steven is great, he always goes the extra mile!
Steve Smith is great, he always goes the extra mile!
Our business mgr. is great,he always goes the extra mile!
I see three challenges that add complexity to my task
Synonyms: Account manager vs relationship manager vs business mgr. This is somewhat domain specific and tends to vary with the survey target audience.
Abbreviations: Mgr. vs manager
Ambiguity - Whether “Steven” is “Steve Smith” & therefore an
“account manager”.
Of these the synonym problem is the most frequent issue, followed by the ambiguity issues. Based on what I have seen, the abbreviation issue isn’t that frequent in my data.
Are there any NLP techniques that can help deal with any of these issues to a relatively high degree of confidence?

As far as I understood, what you call the "subject" is, given a sentence, the entity that a statement is made about - in your example, Steve the account manager.
Based on this assumption, here are a few techniques and how they might help you:
(Dependency) Parsing
Since you don't mean subject in the strict grammatical sense, the approach suggested by user7344209 based on dependency parsing probably won't help you. In a sentence such as "I like Steve", the grammatical subject is "I", although you probably want to find "Steve" as the "subject".
Named Entity Recognition
You already use this, and it will be great to detect names of persons such as Steve. What I'm not so sure about is the example of the "account manager". Both the output provided by Daniel and my own test with Stanford CoreNLP did not identify it as a named entity - which is correct, it really is not a named entity:
Something broader such as the suggested mention identification might be better, but it basically marks every noun phrase which is probably too broad. If I understood it correctly, you want to find one subject per sentence.
Coreference Resolution
Coreference Resolution is the key technique to detect that "Steve" and the "account manager" are the same entity. Stanford CoreNLP has such module for example.
In order for this to work in your example, you have to let it process several sentence at once, since you want to find the links between them. Here is an example with (shorted versions) of some of your examples:
The visualization is a bit messy, but it basically found the following coreference chains:
Steve <-> Steve Smith
Steve our account manager <-> He <-> Our account manager
Our <-> Our
the extra mile <-> the extra mile
Given the first two chains, and a bit of post-processing, you could figure out that all four statements are about the same entity.
Semantic Similarity
In the case of account, business and relationship manager, I found that the CoreNLP coreference resolver actually already finds chains despite the different terms.
More generally, if you think that the coreference resolver cannot handle synonyms and paraphrases well enough, you could also try to include measures of semantic similarity. There is a lot of work in NLP on predicting whether two phrases are synonymous or not.
Some approaches are:
Looking up synonyms in a thesaurus such as Wordnet - e.g. with nltk (python) as shown here
Better, compute a similarity measure based on the relationships defined in WordNet - e.g. using SEMILAR (Java)
Using continous representations for words to compute similarities, for example based on LSA or LDA - also possible with SEMILAR
Using more recent neural-network-style word embeddings such as word2vec or GloVe - the latter are easily usable with spacy (python)
An idea to use these similarity measures would be to identify entities in two sentences, then make pairwise comparisons between entities in both sentences and if a pair has a similarity higher than a threshold consider it as beeing the same entity.

If you don't have much data to train, you probably can try a dependency analysis tool and extract dependency pairs which have SUBJECT identified (usually the nsubj if you use Stanford Parser).

I like your approach using NER. This is what I see in our system for your inputs:
Mention-Detection output might also be useful:
On your 2nd point, which involves abbreviations, it is a hard problem. But we have entity-similarity module here that might be useful. This takes into account things like honorifics etc.
About your 3rd point, co-reference problem, try the coref module:
Btw the above figures are from the demo here: http://deagol.cs.illinois.edu:8080

Related

How to classify unseen text data?

I am training an text classifier for addresses such that if given sentence is an address or not.
Sentence examples :-
(1) Mirdiff City Centre, DUBAI United Arab Emirates
(2) Ultron Inc. <numb> Toledo Beach Rd #1189 La Salle, MI <numb>
(3) Avenger - HEAD OFFICE P.O. Box <numb> India
As addresses can be of n types it's very difficult to make such classifier. Is there any pre-trained model or database for the same or any other non ML way.

As mentioned earlier, verifying that an address is valid - is probably better formalized as an information retrieval problem rather than a machine learning problem. (e.g. using a service).
However, from the examples you gave, it seems like you have several entity types that reoccur, such as organizations and locations.
I'd recommend enriching the data with a NER, such a spacy, and use the entity types for either a feature or a rule.
Note that named-entity recognizers rely more on context than the typical bag-of-words classifier, and are usually more robust to unseen data.

When I did this the last time the problem was very hard, esp. since I had international adresses and the variation across countries is enormous. Add to that the variation added by people and the problem becomes quite hard even for humans.
I finally build a heuristic (contains it some like PO BOX, a likely country name (grep wikipedia), maybe city names) and then threw every remaining maybe address into the google maps API. GM is quite good a recognizing adresses, but even that will have false positives, so manual checking will most likely be needed.
I did not use ML because my adress db was "large" but not large enough for training, esp. we lacked labeled training data.

As you are asking for recommendation for literature (btw this question is probably to broad for this place), I can recommend you two links:
https://www.reddit.com/r/datasets/comments/4jz7og/how_to_get_a_large_at_least_100k_postal_address/
https://www.red-gate.com/products/sql-development/sql-data-generator/
https://openaddresses.io/
You need to build a labeled data as #Christian Sauer has already mentioned, where you have examples with adresses. And probably you need to make false data with wrong adresses as well! So for example you have to make sentences with only telephone numbers or whatever. But in anyway this will be a quite disbalanced dataset, as you will have a lot of correct adresses and only a few which are not adresses. In total you would need around 1000 examples to have a starting point for it.
Other option is to identify the basic adresses manually and do a similarity analysis to identify the sentences which are clostet to it.

As mentioned by Uri Goren, the problem is of Named entity recognition, while there are a lot of trained models in the market. Still, the best one cant get is the Stanford NER.
https://nlp.stanford.edu/software/CRF-NER.shtml
It is a conditional random field NER. It is available in java.
If you are looking for a python implementation of the same. Have a look at:
How to install and invoke Stanford NERTagger?
Here you can gather info from a multiple sequence of tags like
, , or any other sequence like that. If it doesn't give you the correct stuff, it will still somehow get you closer to any address in the whole document. That's a head start.
Thanks.

Name Entity Resolution Algorithm

I was trying to build an entity resolution system, where my entities are,
(i) General named entities, that is organization, person, location,date, time, money, and percent.
(ii) Some other entities like, product, title of person like president,ceo, etc.
(iii) Corefererred entities like, pronoun, determiner phrase,synonym, string match, demonstrative noun phrase, alias, apposition.
From various literature and other references, I have defined its scope as I would not consider the ambiguity of each of the entity beyond its entity category. That is, I am taking Oxford of Oxford University
as different from Oxford as place, as the previous one is the first word of an organization entity and second one is the entity of location.
My task is to construct one resolution algorithm, where I would extract
and resolve the entities.
So, I am working out an entity extractor in the first place.
In the second place, if I try to relate the coreferences as I found from
various literatures like this seminal work, they are trying to work out
a decision tree based algorithm, with some features like, distance,
i-pronoun, j-pronoun, string match, definite noun
phrase, demonstrative noun phrase, number agreement feature,
semantic class agreement, gender agreement, both proper names, alias, apposition
etc.
The algorithm seems a nice one where enities are extracted with Hidden Markov Model(HMM).
I could work out one entity recognition system with HMM.
Now I am trying to work out a coreference as well as an entity
resolution system. I was trying to feel instead of using so many
features if I use an annotated corpus and train it directly with
HMM based tagger, with a view to solve a relationship extraction like,
*"Obama/PERS is/NA delivering/NA a/NA lecture/NA in/NA Washington/LOC, he/PPERS knew/NA it/NA was/NA going/NA to/NA be/NA
small/NA as/NA it/NA may/NA not/NA be/NA his/PoPERS speech/NA as/NA Mr. President/APPERS"
where, PERS-> PERSON
PPERS->PERSONAL PRONOUN TO PERSON
PoPERS-> POSSESSIVE PRONOUN TO PERSON
APPERS-> APPOSITIVE TO PERSON
LOC-> LOCATION
NA-> NOT AVAILABLE*
would I be wrong? I made an experiment with around 10,000 words. Early results seem
encouraging. With a support from one of my colleague I am trying to insert some
semantic information like,
PERSUSPOL, LOCCITUS, PoPERSM, etc. for PERSON OF US IN POLITICS, LOCATION CITY US, POSSESSIVE PERSON MALE, in the tagset to incorporate entity disambiguation at one go. My feeling relationship extraction would be much better now.
Please see this new thought too.
I got some good results with Naive Bayes classifier also where sentences
having predominately one set of keywords are marked as one class.
If any one may suggest any different approach, please feel free to suggest so.
I use Python2.x on MS-Windows and try to use libraries like NLTK, Scikit-learn, Gensim,
pandas, Numpy, Scipy etc.
Thanks in Advance.

It seems that you are going in three different paths that are totally different and each can be done in a stand alone Phd. There are many literature about them. My first advice focus on the main task and outsource the remaining. If you are going to develop this for non-famous language, also, you can build on others.
Named Entity Recognition
Standford NLP have really go too far in that specially for English. They resolve named entities really good, they are widely used and have a nice community.
Other solution may exist in openNLP for python .
Some tried to extend it to unusual fine-grain types but you need much bigger training data to cover the cases and the decision becomes much harder.
Edit: Stanford NER exists in NLTK python
Named Entity Resolution/Linking/Disambiguation
This is concerned with linking the name to some knowledge base, and solves the problem of whether Oxford University of Oxford City.
AIDA: is one of the state-of-art in that. They uses different context information as well as coherence information. Also, they have tried supporting several languages. They have a good bench mark.
Babelfy: offers interesting API that does NER and NED for Entities and concepts. Also, they support many language but never worked very well.
others like tagme and wikifi ...etc
Conference Resolution
Also Stanford CoreNLP has some good work in that direction. I can also recommend this work where they combined Conference Resolution with NED.

How to automatically label a cluster of words using semantics?

The context is : I already have clusters of words (phrases actually) resulting from kmeans applied to internet search queries and using common urls in the results of the search engine as a distance (co-occurrence of urls rather than words if I simplify a lot).
I would like to automatically label the clusters using semantics, in other words I'd like to extract the main concept surrounding a group of phrases considered together.
For example - sorry for the subject of my example - if I have the following bunch of queries : ['my husband attacked me','he was arrested by the police','the trial is still going on','my husband can go to jail for harrassing me ?','free lawyer']
My study deals with domestic violence, but clearly this cluster is focused on the legal aspect of the problem so the label could be "legal" for example.
I am new to NPL but I have to precise that I don't want to extract words using POS tagging (or at least this is not the expected final outcome but maybe a necessary preliminary step).
I read about Wordnet for sense desambiguation and I think that might be a good track, but I don't want to calculate similarity between two queries (since the clusters are the input) nor obtain the definition of one selected word thanks to the context provided by the whole bunch of words (which word to select in this case ?). I want to use the whole bunch of words to provide a context (maybe using synsets or categorization with the xml structure of the wordnet) and then summarize the context in one or few words.
Any ideas ? I can use R or python, I read a little about nltk but I don't find a way to use it in my context.

Your best bet is probably is to label the clusters manually, especially if there are few of them. This a difficult problem even for humans to solve, because you might need a domain expert. Anyone claiming they could do that automatically and reliably (except in some very limited domains) is probably running a startup and trying to get your business.
Also, going through the clusters yourself will have benefits. 1) you may discover you had the wrong number of clusters (k parameter) or that there was too much junk in the input to begin with. 2) you will gain qualitative insight into what is being talked about and what topic there are in the data (which you probably can't know before looking at the data). Therefore, label manually if qualitative insight is what you are after. If you need quantitative result too, you could then train a classifier on the manually labelled topics to 1) predict topics for the rest of the clusters, or 2) for future use, if you repeat the clustering, get new data, ...

When we talk about semantics in this area we mean Statistical Semantics. The statistical or distributional semantics is very different from other definitions of semantics which has logic and reasoning behind it. Statistical semantics is based on Distributional Hypothesis, which considers context as meaning aspect of words and phrases. Meaning in very abstract and general sense in different litterers is called topics. There are several unsupervised methods for modelling topics, such as LDA or even word2vec, which basically provide word similarity metric or suggest a list of similar words for a document as another context. Usually when you have these unsupervised clusters, you need a domain expert to tell the meaning of each cluster.
However, for several reasons you might accept low accuracy assignment of a word as the general topic (or as in your words "global semantic") to a list of phrases. If this is the case, I would suggest to take a look at Word Sense Disambiguation tasks which look for coarse grained word senses. For WordNet, it might be called supersense tagging task.
This paper worth to take a look: More or less supervised supersense tagging of Twitter
And about your question about choosing words from current phrases, there is also an active question about "converting phrase to vectors", my answer to that question in word2vec fashion might be useful:
How can a sentence or a document be converted to a vector?
I can add more related papers later if it comes to my mind.

The paper Automatic Labelling of Topic Models explains the author's approach to this problem. To provide an overview I can tell you that they generate some label candidates using the information retrieved from Wikipedia and Google, and once they have the list of candidates in place they rank those candidates to find the best label.
I think the code is not available online, but I have not looked for it.

The package chowmein claims to do this in python using the algorithm outlined in Automatic Labeling of Multinomial Topic Models.

One possible approach, which the below papers suggest is identifying the set of keywords from the cluster, getting all the synonyms and then finding the hypernyms for each synonym.
The idea is to get a more abstract meaning for the cluster by using the hypernym.
Example: A word cluster containing words dog and wolf should not be labelled with either word but as canids. They achieve it using synonymy and hypernymy.
Cluster Labeling by Word Embeddings
and WordNet’s Hypernymy
Automated Text Clustering and Labeling using Hypernyms

Named entity recognition : For new/latest entities

Sorry for that weird "question title" , but I couldnt think of an appropriate title.
Im new to NLP concepts, so I used NER demo (http://cogcomp.cs.illinois.edu/demo/ner/results.php). Now the issue is that "how & in what ways" can I use these taggings done by NER. I mean these what answers or inferences can one draw from these named-entities which have been tagged in certain groups - location, person ,organization etc. If I have a data which has names of entirely new companies, places etc then how am I going to do these NER taggings for such a data ?
Pls dont downvote or block me, I just need guidance/expert suggestions thats it. Reading about a concept is another thing, while being able to know where & when to apply it is another thing, which is where Im asking for guidance. Thanks a ton !!!
A snippet from the demo:-
Dogs have been used in cargo areas for some time, but have just been introduced recently in
passenger areas at LOC Newark and LOC JFK airports. LOC JFK has one dog and LOC Newark has a
handful, PER Farbstein said.

Usually NER is a step in a pipeline. For example, once all entities have been tagged, if you have many sentences like [PER John Smith], CEO of [ORG IBM] said..., then you can set up a table of Companies and CEOs. This is a form of knowledge base population.
There are plenty of other uses, though, depending on the type of data you already have and what you are trying to accomplish.

I think there are two parts in your question:
What is the purpose of NER?
This is vast question, generally it is used for Information Retrieval (IR) tasks such as indexing, document classification, Knowledge Base Population (KBP) but also many, many others (speech recognition, translation)... quite hard to figure out an extensive list...
How can we NER be extended to also recognize new/unkown entities?
E.g. how can we recognize entities that have never been seen by the NER system. In a glance, two solutions are likely to work:
Let's say you have some linked database that is updated on a regular basis: than the system may rely on generic categories. For instance, let's say "Marina Silva" comes up in news and is now added to lexicon associated to category "POLITICIAN". As the system knows that every POLITICIAN should be tagged as a person, i.e. doesn't rely on lexical items but on categories, and will thus tag "Marina Silva" as a PERS named entity. You don't have to re-train the whole system, just to update its lexicon.
Using morphological and contextual clues, the system may guess for new named entities that have never been seen (and are not in the lexicon). For instance, a rule like "The presidential candidate XXX YYY" (or "Marina YYY") will guess that "XXX YYY" (or just "YYY") is a PERS (or part of a PERS). This involves, most of the times, probabilistic modeling.
Hope this helps :)

How to extract meaning from sentences after running named entity recognition?

First: Any recs on how to modify the title?
I am using my own named entity recognition algorithm to parse data from plain text. Specifically, I am trying to extract lawyer practice areas. A common sentence structure that I see is:
1) Neil focuses his practice on employment, tax, and copyright litigation.
or
2) Neil focuses his practice on general corporate matters including securities, business organizations, contract preparation, and intellectual property protection.
My entity extraction is doing a good job of finding the key words, for example, my output from sentence one might look like this:
Neil focuses his practice on (employment), (tax), and (copyright litigation).
However, that doesn't really help me. What would be more helpful is if i got an output that looked more like this:
Neil focuses his practice on (employment - litigation), (tax - litigation), and (copyright litigation).
Is there a way to accomplish this goal using an existing python framework such as nltk (after my algo extracts the practice areas) can I use ntlk to extract the other words that my "practice areas" modify in order to get a more complete picture?

Named entity recognition (NER) systems typically use grammer-based rules or statistical language models. What you have described here seems to be based only on keywords, though.
Typically, and much like most complex NLP tasks, NER systems should be trained on domain-specific data so that they perform well on previously unseen (test) data. You will require adequate knowledge of machine learning to go down that path.
In "normal" language, if you want to extract words or phrases and categorize them into classes defined by you (e.g. litigation), if often makes sense to use category labels in external ontologies. An example could be:
You want to extract words and phrases related to sports.
Such a categorization (i.e. detecting whether or not a word is indeed related to sports) is not a "general"-enough problem. Which means you will not find ready-made systems that will solve the problem (e.g. algorithms in the NLTK library). You can, however, use an ontology like Wikipedia and exploit the category labels available there.
E.g., you can check that if you search Wikipedia for "football", which has a category label "ball games", which in turn is under "sports".
Note that the wikipedia category labels form a directed graph. If you build a system which exploits the category structure of such an ontology, you should be able to categorize terms in your texts as you see fit. Moreover, you can even control the granularity of the categorization (e.g. do you want just "sports", or "individual sports" and "team sports").
I have built such a system for categorizing terms related to computer science, and it worked remarkably well. The closest freely available system that works in a similar way is the Wikifier built by the cognitive computing group at the University of Illinois at Urbana-Champaign.
Caveat: You may need to tweak a simple category-based code to suit your needs. E.g. there is no wikipedia page for "litigation". Instead, it redirects you to a page titled "lawsuit". Such cases need to be handled separately.
Final Note: This solution is really not in the area of NLP, but my past experience suggests that for some domains, this kind of ontology-based approach works really well. Also, I have used the "sports" example in my answer because I know nothing about legal terminology. But I hope my example helps you understand the underlying process.

I do not think your "algo" is even doing entity recognition... however, stretching the problem you presented quite a bit, what you want to do looks like coreference resolution in coordinated structures containing ellipsis. Not easy at all: start by googling for some relevant literature in linguistics and computational linguistics. I use the standard terminology from the field below.
In practical terms, you could start by assigning the nearest antecedent (the most frequently used approach in English). Using your examples:
first extract all the "entities" in a sentence
from the entity list, identify antecedent candidates ("litigation", etc.). This is a very difficult task, involving many different problems... you might avoid it if you know in advance the "entities" that will be interesting for you.
finally, you assign (resolve) each anaphora/cataphora to the nearest antecedent.

Have a look at CogComp NER tagger:
https://github.com/CogComp/cogcomp-nlp/tree/master/ner

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.