Named entity recognition : For new/latest entities

Named entity recognition : For new/latest entities - python

Sorry for that weird "question title" , but I couldnt think of an appropriate title.
Im new to NLP concepts, so I used NER demo (http://cogcomp.cs.illinois.edu/demo/ner/results.php). Now the issue is that "how & in what ways" can I use these taggings done by NER. I mean these what answers or inferences can one draw from these named-entities which have been tagged in certain groups - location, person ,organization etc. If I have a data which has names of entirely new companies, places etc then how am I going to do these NER taggings for such a data ?
Pls dont downvote or block me, I just need guidance/expert suggestions thats it. Reading about a concept is another thing, while being able to know where & when to apply it is another thing, which is where Im asking for guidance. Thanks a ton !!!
A snippet from the demo:-
Dogs have been used in cargo areas for some time, but have just been introduced recently in
passenger areas at LOC Newark and LOC JFK airports. LOC JFK has one dog and LOC Newark has a
handful, PER Farbstein said.

Usually NER is a step in a pipeline. For example, once all entities have been tagged, if you have many sentences like [PER John Smith], CEO of [ORG IBM] said..., then you can set up a table of Companies and CEOs. This is a form of knowledge base population.
There are plenty of other uses, though, depending on the type of data you already have and what you are trying to accomplish.

I think there are two parts in your question:
What is the purpose of NER?
This is vast question, generally it is used for Information Retrieval (IR) tasks such as indexing, document classification, Knowledge Base Population (KBP) but also many, many others (speech recognition, translation)... quite hard to figure out an extensive list...
How can we NER be extended to also recognize new/unkown entities?
E.g. how can we recognize entities that have never been seen by the NER system. In a glance, two solutions are likely to work:
Let's say you have some linked database that is updated on a regular basis: than the system may rely on generic categories. For instance, let's say "Marina Silva" comes up in news and is now added to lexicon associated to category "POLITICIAN". As the system knows that every POLITICIAN should be tagged as a person, i.e. doesn't rely on lexical items but on categories, and will thus tag "Marina Silva" as a PERS named entity. You don't have to re-train the whole system, just to update its lexicon.
Using morphological and contextual clues, the system may guess for new named entities that have never been seen (and are not in the lexicon). For instance, a rule like "The presidential candidate XXX YYY" (or "Marina YYY") will guess that "XXX YYY" (or just "YYY") is a PERS (or part of a PERS). This involves, most of the times, probabilistic modeling.
Hope this helps :)

Related

How to classify unseen text data?

I am training an text classifier for addresses such that if given sentence is an address or not.
Sentence examples :-
(1) Mirdiff City Centre, DUBAI United Arab Emirates
(2) Ultron Inc. <numb> Toledo Beach Rd #1189 La Salle, MI <numb>
(3) Avenger - HEAD OFFICE P.O. Box <numb> India
As addresses can be of n types it's very difficult to make such classifier. Is there any pre-trained model or database for the same or any other non ML way.

As mentioned earlier, verifying that an address is valid - is probably better formalized as an information retrieval problem rather than a machine learning problem. (e.g. using a service).
However, from the examples you gave, it seems like you have several entity types that reoccur, such as organizations and locations.
I'd recommend enriching the data with a NER, such a spacy, and use the entity types for either a feature or a rule.
Note that named-entity recognizers rely more on context than the typical bag-of-words classifier, and are usually more robust to unseen data.

When I did this the last time the problem was very hard, esp. since I had international adresses and the variation across countries is enormous. Add to that the variation added by people and the problem becomes quite hard even for humans.
I finally build a heuristic (contains it some like PO BOX, a likely country name (grep wikipedia), maybe city names) and then threw every remaining maybe address into the google maps API. GM is quite good a recognizing adresses, but even that will have false positives, so manual checking will most likely be needed.
I did not use ML because my adress db was "large" but not large enough for training, esp. we lacked labeled training data.

As you are asking for recommendation for literature (btw this question is probably to broad for this place), I can recommend you two links:
https://www.reddit.com/r/datasets/comments/4jz7og/how_to_get_a_large_at_least_100k_postal_address/
https://www.red-gate.com/products/sql-development/sql-data-generator/
https://openaddresses.io/
You need to build a labeled data as #Christian Sauer has already mentioned, where you have examples with adresses. And probably you need to make false data with wrong adresses as well! So for example you have to make sentences with only telephone numbers or whatever. But in anyway this will be a quite disbalanced dataset, as you will have a lot of correct adresses and only a few which are not adresses. In total you would need around 1000 examples to have a starting point for it.
Other option is to identify the basic adresses manually and do a similarity analysis to identify the sentences which are clostet to it.

As mentioned by Uri Goren, the problem is of Named entity recognition, while there are a lot of trained models in the market. Still, the best one cant get is the Stanford NER.
https://nlp.stanford.edu/software/CRF-NER.shtml
It is a conditional random field NER. It is available in java.
If you are looking for a python implementation of the same. Have a look at:
How to install and invoke Stanford NERTagger?
Here you can gather info from a multiple sequence of tags like
, , or any other sequence like that. If it doesn't give you the correct stuff, it will still somehow get you closer to any address in the whole document. That's a head start.
Thanks.

Identifying the subject of a sententce

I have been exploring NLP techniques with the goal of identifying the subject of survey comments (which I then use in conjunction with sentiment analysis). I want to make high level statements such as "10% of survey respondents made a positive comment (+ sentiment) about Account Managers".
My approach has used Named Entity Recognition (NER). Now that I am working with real data, I am getting visibility of some of the complexities & nuances associated with identifying the subject of a sentence. Here are 5 examples of sentences where the subject is the Account Manager. I have put the named entity in bold for demonstration purposes.
Our account manager is great, he always goes the extra mile!
Steve our account manager is great, he always goes the extra mile!
Steve our relationship manager is great, he always goes the extra
mile!
Steven is great, he always goes the extra mile!
Steve Smith is great, he always goes the extra mile!
Our business mgr. is great,he always goes the extra mile!
I see three challenges that add complexity to my task
Synonyms: Account manager vs relationship manager vs business mgr. This is somewhat domain specific and tends to vary with the survey target audience.
Abbreviations: Mgr. vs manager
Ambiguity - Whether “Steven” is “Steve Smith” & therefore an
“account manager”.
Of these the synonym problem is the most frequent issue, followed by the ambiguity issues. Based on what I have seen, the abbreviation issue isn’t that frequent in my data.
Are there any NLP techniques that can help deal with any of these issues to a relatively high degree of confidence?

As far as I understood, what you call the "subject" is, given a sentence, the entity that a statement is made about - in your example, Steve the account manager.
Based on this assumption, here are a few techniques and how they might help you:
(Dependency) Parsing
Since you don't mean subject in the strict grammatical sense, the approach suggested by user7344209 based on dependency parsing probably won't help you. In a sentence such as "I like Steve", the grammatical subject is "I", although you probably want to find "Steve" as the "subject".
Named Entity Recognition
You already use this, and it will be great to detect names of persons such as Steve. What I'm not so sure about is the example of the "account manager". Both the output provided by Daniel and my own test with Stanford CoreNLP did not identify it as a named entity - which is correct, it really is not a named entity:
Something broader such as the suggested mention identification might be better, but it basically marks every noun phrase which is probably too broad. If I understood it correctly, you want to find one subject per sentence.
Coreference Resolution
Coreference Resolution is the key technique to detect that "Steve" and the "account manager" are the same entity. Stanford CoreNLP has such module for example.
In order for this to work in your example, you have to let it process several sentence at once, since you want to find the links between them. Here is an example with (shorted versions) of some of your examples:
The visualization is a bit messy, but it basically found the following coreference chains:
Steve <-> Steve Smith
Steve our account manager <-> He <-> Our account manager
Our <-> Our
the extra mile <-> the extra mile
Given the first two chains, and a bit of post-processing, you could figure out that all four statements are about the same entity.
Semantic Similarity
In the case of account, business and relationship manager, I found that the CoreNLP coreference resolver actually already finds chains despite the different terms.
More generally, if you think that the coreference resolver cannot handle synonyms and paraphrases well enough, you could also try to include measures of semantic similarity. There is a lot of work in NLP on predicting whether two phrases are synonymous or not.
Some approaches are:
Looking up synonyms in a thesaurus such as Wordnet - e.g. with nltk (python) as shown here
Better, compute a similarity measure based on the relationships defined in WordNet - e.g. using SEMILAR (Java)
Using continous representations for words to compute similarities, for example based on LSA or LDA - also possible with SEMILAR
Using more recent neural-network-style word embeddings such as word2vec or GloVe - the latter are easily usable with spacy (python)
An idea to use these similarity measures would be to identify entities in two sentences, then make pairwise comparisons between entities in both sentences and if a pair has a similarity higher than a threshold consider it as beeing the same entity.

If you don't have much data to train, you probably can try a dependency analysis tool and extract dependency pairs which have SUBJECT identified (usually the nsubj if you use Stanford Parser).

I like your approach using NER. This is what I see in our system for your inputs:
Mention-Detection output might also be useful:
On your 2nd point, which involves abbreviations, it is a hard problem. But we have entity-similarity module here that might be useful. This takes into account things like honorifics etc.
About your 3rd point, co-reference problem, try the coref module:
Btw the above figures are from the demo here: http://deagol.cs.illinois.edu:8080

Name Entity Resolution Algorithm

I was trying to build an entity resolution system, where my entities are,
(i) General named entities, that is organization, person, location,date, time, money, and percent.
(ii) Some other entities like, product, title of person like president,ceo, etc.
(iii) Corefererred entities like, pronoun, determiner phrase,synonym, string match, demonstrative noun phrase, alias, apposition.
From various literature and other references, I have defined its scope as I would not consider the ambiguity of each of the entity beyond its entity category. That is, I am taking Oxford of Oxford University
as different from Oxford as place, as the previous one is the first word of an organization entity and second one is the entity of location.
My task is to construct one resolution algorithm, where I would extract
and resolve the entities.
So, I am working out an entity extractor in the first place.
In the second place, if I try to relate the coreferences as I found from
various literatures like this seminal work, they are trying to work out
a decision tree based algorithm, with some features like, distance,
i-pronoun, j-pronoun, string match, definite noun
phrase, demonstrative noun phrase, number agreement feature,
semantic class agreement, gender agreement, both proper names, alias, apposition
etc.
The algorithm seems a nice one where enities are extracted with Hidden Markov Model(HMM).
I could work out one entity recognition system with HMM.
Now I am trying to work out a coreference as well as an entity
resolution system. I was trying to feel instead of using so many
features if I use an annotated corpus and train it directly with
HMM based tagger, with a view to solve a relationship extraction like,
*"Obama/PERS is/NA delivering/NA a/NA lecture/NA in/NA Washington/LOC, he/PPERS knew/NA it/NA was/NA going/NA to/NA be/NA
small/NA as/NA it/NA may/NA not/NA be/NA his/PoPERS speech/NA as/NA Mr. President/APPERS"
where, PERS-> PERSON
PPERS->PERSONAL PRONOUN TO PERSON
PoPERS-> POSSESSIVE PRONOUN TO PERSON
APPERS-> APPOSITIVE TO PERSON
LOC-> LOCATION
NA-> NOT AVAILABLE*
would I be wrong? I made an experiment with around 10,000 words. Early results seem
encouraging. With a support from one of my colleague I am trying to insert some
semantic information like,
PERSUSPOL, LOCCITUS, PoPERSM, etc. for PERSON OF US IN POLITICS, LOCATION CITY US, POSSESSIVE PERSON MALE, in the tagset to incorporate entity disambiguation at one go. My feeling relationship extraction would be much better now.
Please see this new thought too.
I got some good results with Naive Bayes classifier also where sentences
having predominately one set of keywords are marked as one class.
If any one may suggest any different approach, please feel free to suggest so.
I use Python2.x on MS-Windows and try to use libraries like NLTK, Scikit-learn, Gensim,
pandas, Numpy, Scipy etc.
Thanks in Advance.

It seems that you are going in three different paths that are totally different and each can be done in a stand alone Phd. There are many literature about them. My first advice focus on the main task and outsource the remaining. If you are going to develop this for non-famous language, also, you can build on others.
Named Entity Recognition
Standford NLP have really go too far in that specially for English. They resolve named entities really good, they are widely used and have a nice community.
Other solution may exist in openNLP for python .
Some tried to extend it to unusual fine-grain types but you need much bigger training data to cover the cases and the decision becomes much harder.
Edit: Stanford NER exists in NLTK python
Named Entity Resolution/Linking/Disambiguation
This is concerned with linking the name to some knowledge base, and solves the problem of whether Oxford University of Oxford City.
AIDA: is one of the state-of-art in that. They uses different context information as well as coherence information. Also, they have tried supporting several languages. They have a good bench mark.
Babelfy: offers interesting API that does NER and NED for Entities and concepts. Also, they support many language but never worked very well.
others like tagme and wikifi ...etc
Conference Resolution
Also Stanford CoreNLP has some good work in that direction. I can also recommend this work where they combined Conference Resolution with NED.

Identifying an entity in article

I am working with Python on a Data Science related task. What I need to do is - I have extracted some news articles and now I want to selectively pick only those news articles belonging to a specific person and determine if the person mentioned in the article is the same person I am interested in.
Let's say a person can be identified by either his name or certain attributes that describes that person, for example, a person with name "X" is a political figure. When an article about that person is published, we 'know' that it is referring to that person only by reading the context of the article. By 'context' I mean if the article contains any (or a combination of following ):
That person's name
The name of his political party
Names of other people closely associated with him mentioned in the article
Other attributes that describe that person
Because names are common, I want to determine what is the probability (how much probability) that a given article speaks about that person "X" only and not any other person having the same name as "X".

Alright so this is my best shot.
Initial assumptions
First, we assume that you have articles that already contain mentions of people, and these mentions are either a) mentions of the particular person you are looking for or b) mentions of other people sharing the same name.
I think to disambiguate each mention (as you would do in Entity Linking) is overkill, as you also assume that the articles are either about the person or not. So we'll say that any articles that contains at least one mention of the person is an article about the person.
General solution: Text classification
You have to develop a classification algorithm that extracts features from an article and feeds those features to a model you obtained through supervised learning. The model will output one of two answers, for example True or False. This necessitates a training set. For evaluation purposes (knowing that your solution works), you will also need a testing set.
So the first step will be to tag these training and testing sets using one of two tags each time ("True" and "False" or whatever). You have to assign these tags manually, by examining the articles yourself.
What features to use
#eldams mentions using contextual clues. In my (attempt at a) solution, the article is the context, so basically you have to ask yourself what might give away that the article is about the person in particular. At this point you can either choose the features yourself or let a more complex model find specific features in a more general feature category.
Two examples, assuming we are looking for articles about Justin Trudeau, the newly elected Canadian Prime Minister, as opposed to anyone else who is also named Justin Trudeau.
A) Choosing features yourself
With a bit of research, you will learn that Justin Trudeau leads the Liberal Party of Canada, so some good features would be to check whether or not the article contains these strings:
Liberal Party of Canada, Parti Libéral du Canada, LPC, PLC, Liberals,
Libéraux, Jean Chrétien, Paul Martin, etc
Since Trudeau is a politician, looking for these might be a good idea:
politics, politician, law, reform, parliament, house of commons, etc
You might want to gather information about his personal life, close collaborators, name of wife and kids, and so on, and add these as well.
B) Letting the learning algorithm do the work
Your other option would be to train an n-gram model using every n-gram there are in the training set (eg use all unigrams and bigrams). This results in a more complex model than can be more robust, as well as heavier to train and use.
Software resources
Whatever you choose to do, if you need to train a classifier, you should use scikit-learn. Its SVM classifier would be the most popular choice. Naive Bayes is the more classic approach to document classification.

This task is usually known as Entity Linking. If you are working on popular entities, e.g. those that have an article in Wikipedia, then you may have a look at DBpedia Spotlight or BabelNet that address this issue.
If you'd like to implement your own linker, than you may have a look at related articles. In most cases, a named entity linker detects mentions (person names in your case), then a disambiguation step is required, which computes probabilities for available references (and NIL as a mention may not have a reference available), for any specific mention in text, and by using contextual clues (e.g. words of sentence, paragraph or whole article containing the mention).

How to extract meaning from sentences after running named entity recognition?

First: Any recs on how to modify the title?
I am using my own named entity recognition algorithm to parse data from plain text. Specifically, I am trying to extract lawyer practice areas. A common sentence structure that I see is:
1) Neil focuses his practice on employment, tax, and copyright litigation.
or
2) Neil focuses his practice on general corporate matters including securities, business organizations, contract preparation, and intellectual property protection.
My entity extraction is doing a good job of finding the key words, for example, my output from sentence one might look like this:
Neil focuses his practice on (employment), (tax), and (copyright litigation).
However, that doesn't really help me. What would be more helpful is if i got an output that looked more like this:
Neil focuses his practice on (employment - litigation), (tax - litigation), and (copyright litigation).
Is there a way to accomplish this goal using an existing python framework such as nltk (after my algo extracts the practice areas) can I use ntlk to extract the other words that my "practice areas" modify in order to get a more complete picture?

Named entity recognition (NER) systems typically use grammer-based rules or statistical language models. What you have described here seems to be based only on keywords, though.
Typically, and much like most complex NLP tasks, NER systems should be trained on domain-specific data so that they perform well on previously unseen (test) data. You will require adequate knowledge of machine learning to go down that path.
In "normal" language, if you want to extract words or phrases and categorize them into classes defined by you (e.g. litigation), if often makes sense to use category labels in external ontologies. An example could be:
You want to extract words and phrases related to sports.
Such a categorization (i.e. detecting whether or not a word is indeed related to sports) is not a "general"-enough problem. Which means you will not find ready-made systems that will solve the problem (e.g. algorithms in the NLTK library). You can, however, use an ontology like Wikipedia and exploit the category labels available there.
E.g., you can check that if you search Wikipedia for "football", which has a category label "ball games", which in turn is under "sports".
Note that the wikipedia category labels form a directed graph. If you build a system which exploits the category structure of such an ontology, you should be able to categorize terms in your texts as you see fit. Moreover, you can even control the granularity of the categorization (e.g. do you want just "sports", or "individual sports" and "team sports").
I have built such a system for categorizing terms related to computer science, and it worked remarkably well. The closest freely available system that works in a similar way is the Wikifier built by the cognitive computing group at the University of Illinois at Urbana-Champaign.
Caveat: You may need to tweak a simple category-based code to suit your needs. E.g. there is no wikipedia page for "litigation". Instead, it redirects you to a page titled "lawsuit". Such cases need to be handled separately.
Final Note: This solution is really not in the area of NLP, but my past experience suggests that for some domains, this kind of ontology-based approach works really well. Also, I have used the "sports" example in my answer because I know nothing about legal terminology. But I hope my example helps you understand the underlying process.

I do not think your "algo" is even doing entity recognition... however, stretching the problem you presented quite a bit, what you want to do looks like coreference resolution in coordinated structures containing ellipsis. Not easy at all: start by googling for some relevant literature in linguistics and computational linguistics. I use the standard terminology from the field below.
In practical terms, you could start by assigning the nearest antecedent (the most frequently used approach in English). Using your examples:
first extract all the "entities" in a sentence
from the entity list, identify antecedent candidates ("litigation", etc.). This is a very difficult task, involving many different problems... you might avoid it if you know in advance the "entities" that will be interesting for you.
finally, you assign (resolve) each anaphora/cataphora to the nearest antecedent.

Have a look at CogComp NER tagger:
https://github.com/CogComp/cogcomp-nlp/tree/master/ner

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.