I am working with Python on a Data Science related task. What I need to do is - I have extracted some news articles and now I want to selectively pick only those news articles belonging to a specific person and determine if the person mentioned in the article is the same person I am interested in.
Let's say a person can be identified by either his name or certain attributes that describes that person, for example, a person with name "X" is a political figure. When an article about that person is published, we 'know' that it is referring to that person only by reading the context of the article. By 'context' I mean if the article contains any (or a combination of following ):
That person's name
The name of his political party
Names of other people closely associated with him mentioned in the article
Other attributes that describe that person
Because names are common, I want to determine what is the probability (how much probability) that a given article speaks about that person "X" only and not any other person having the same name as "X".
Alright so this is my best shot.
Initial assumptions
First, we assume that you have articles that already contain mentions of people, and these mentions are either a) mentions of the particular person you are looking for or b) mentions of other people sharing the same name.
I think to disambiguate each mention (as you would do in Entity Linking) is overkill, as you also assume that the articles are either about the person or not. So we'll say that any articles that contains at least one mention of the person is an article about the person.
General solution: Text classification
You have to develop a classification algorithm that extracts features from an article and feeds those features to a model you obtained through supervised learning. The model will output one of two answers, for example True or False. This necessitates a training set. For evaluation purposes (knowing that your solution works), you will also need a testing set.
So the first step will be to tag these training and testing sets using one of two tags each time ("True" and "False" or whatever). You have to assign these tags manually, by examining the articles yourself.
What features to use
#eldams mentions using contextual clues. In my (attempt at a) solution, the article is the context, so basically you have to ask yourself what might give away that the article is about the person in particular. At this point you can either choose the features yourself or let a more complex model find specific features in a more general feature category.
Two examples, assuming we are looking for articles about Justin Trudeau, the newly elected Canadian Prime Minister, as opposed to anyone else who is also named Justin Trudeau.
A) Choosing features yourself
With a bit of research, you will learn that Justin Trudeau leads the Liberal Party of Canada, so some good features would be to check whether or not the article contains these strings:
Liberal Party of Canada, Parti Libéral du Canada, LPC, PLC, Liberals,
Libéraux, Jean Chrétien, Paul Martin, etc
Since Trudeau is a politician, looking for these might be a good idea:
politics, politician, law, reform, parliament, house of commons, etc
You might want to gather information about his personal life, close collaborators, name of wife and kids, and so on, and add these as well.
B) Letting the learning algorithm do the work
Your other option would be to train an n-gram model using every n-gram there are in the training set (eg use all unigrams and bigrams). This results in a more complex model than can be more robust, as well as heavier to train and use.
Software resources
Whatever you choose to do, if you need to train a classifier, you should use scikit-learn. Its SVM classifier would be the most popular choice. Naive Bayes is the more classic approach to document classification.
This task is usually known as Entity Linking. If you are working on popular entities, e.g. those that have an article in Wikipedia, then you may have a look at DBpedia Spotlight or BabelNet that address this issue.
If you'd like to implement your own linker, than you may have a look at related articles. In most cases, a named entity linker detects mentions (person names in your case), then a disambiguation step is required, which computes probabilities for available references (and NIL as a mention may not have a reference available), for any specific mention in text, and by using contextual clues (e.g. words of sentence, paragraph or whole article containing the mention).
Related
I am able to extract person names using Spacy NER model but it includes the lawyer/police/or everyone else who is a human.My problem is to extract the name of the person who is an accused/convicted/or has committed the crime based on news article.
e.g. the below nes article https://www.channelnewsasia.com/news/world/turkey-frees-opposition-figure-pending-terrorism-trial---anadolu-11095480
ANKARA: A Turkish court on Monday ordered the release on bail of a former opposition lawmaker while he is being tried on terrorism-related charges, state-owned Anadolu news agency said.
Eren Erdem, who lost his seat in mid-2018 elections that granted President Tayyip Erdogan sweeping new powers, has been jailed since June and accused of publishing illegal wiretaps while editor of an opposition newspaper in 2014.
He denies charges of assisting followers of U.S.-based cleric Fethullah Gulen, who is accused of orchestrating a failed 2016 putsch.
Eren Erdem is the prime accused and I need only this name but Spacy model extracts all the people names
Tayyip Erdogan(president)
Fethullah Gulen
Enis Berberoglu
Tuvan Gumrukcu
etc
I need the name of the criminal not president or police.
Can we do it using Python/NER ?
Edit : Can we apply Knowledge graph concept here ? I explored a lot about it but couldn't find convincing article regarding the case.it would be great if someone could walkover this concept or provide article links (relevant).
Firstly, you have to ask yourself how some reader of the text is capable of identifying the criminal. The proper name representing the criminal takes the argument function of a verb (let it be a copular verb like in "He is a criminal" or a semantically more complex verb like "the man also commited the murder 2 years ago"). This argument function (the "subject" in case of the examples) perfectly identifies the criminal entity. What you have to do is:
identifying the sentence containing the criminal, including the so-called subcategorisation frame of the verb (giving the arguments, e.g. "SUBJECT", "OBJECT" etc.).
Parsing the sentence, such that the arguments are made accessible (using nltk or spaCy) and using NER
extracting the entity, which is both recognized by NER and subcategorized by the verb in the argument position that assigns the role of the criminal to the entity
if necessary, performing anaphora resolution, when a personal pronoun is used, which needs to be matched with the entity to which the pronoun refers (you can imagine this as some sort of reference chaining of pronouns).
Really, there is no out of the box model, its rather a linguistic pipeline
with implementations for each separate steps that takes you there. For anything more detailed, you really need to paste some code for direct questions on the implementation pipeline.
You can use machine learning, but for this you need to perform steps 1 and 2 anyways,
so better first try those steps.
I'm also using spacy in my project to extract victim names and I also get a lot of non-victim names like police officers, doctors, suspect, etc. Tools like spacy are very useful but you also need to help it out in order to identify which type of PERSON entity you want to extract. To filter out the names I want, what I do is:
Analyze the articles and recognize some common patterns. Usually, articles from the same sources follow the same formats. In your case, I checked a few articles from the given website and it follows formats like "Suspect name, age, was accused/arrested/other synonyms" or "Suspect name, who , was accused/arrested/other synonyms". This is a pretty common format for crime-related articles. There could be other format, of course, but it's unlikely that there will be too many since these sites usually follow a certain standard or the articles are written by a few authors.
What pattern do you see from this? It's that the sentences that have the suspect name is often divided into three chunks. The [1] first one is the name followed by a comma, the [2] second one is either digits (age) or some description beginning with "who" followed by a comma, and the [3] third one includes the verbs similar to "arrests" such as arrested, jailed, accused, etc.
In your example: "[1] Eren Erdem, [2] who lost his seat in mid-2018 elections that granted President Tayyip Erdogan sweeping new powers, [3] has been jailed since June and accused of publishing illegal wiretaps while editor of an opposition newspaper in 2014.
Use regular expression to catch only phrases that have this pattern. In Python:
import re
for result in re.finditer(r'(\w+\W+\w+){1,5},\swho\s(\w+\W+\w+){0,20},\s(\w+\W+){0,5}(arrested|jailed)\s(\w+\W+){0,10}', text, flags=re.I):
print(result.group()) # pass this to spacy
print(result.group().split(",")[0]) # or this
You can use machine learning but there will always be some results that require tuning. You can also utilize scoring. If the articles are about a suspect, then the PERSON entity that will occur the most is often the suspect himself, other entities will probably be mentioned only a few times or sometimes just once.
I'm in need of suggestions how to extract keywords from a large document. The keywords should be inline what we have defined as the intended search results.
For example,
I need the owner's name, where the office is situated, what the operating industry is when a document about a company is given, and the defined set of words would be,
{owner, director, office, industry...}-(1)
the intended output has to be something like,
{Mr.Smith James, ,Main Street, Financial Banking}-(2)
I was looking for a method related to Semantic Similarity where sentences containing words similar to the given corpus (1), would be extracted, and using POS tagging to extract nouns from those sentences.
It would be a useful if further resources could be provided that support this approach.
What you want to do is referred to as Named Entity Recognition.
In Python there is a popular library called SpaCy that can be used for that. The standard models are able to detect 18 different entity types which is a fairly good amount.
Persons and company names should be extracted easily, while whole addresses and the industry might be more difficult. Maybe you would have to train your own model on these entity types. SpaCy also provides an API for training your own models.
Please note, that you need quite a lot of training data to have decent results. Start with 1000 examples per entity type and see if it's sufficient for your needs. POS can be used as a feature.
If your data is unstructured, this is probably one of most suited approaches. If you have more structured data, you could maybe take advantage of that.
I was trying to build an entity resolution system, where my entities are,
(i) General named entities, that is organization, person, location,date, time, money, and percent.
(ii) Some other entities like, product, title of person like president,ceo, etc.
(iii) Corefererred entities like, pronoun, determiner phrase,synonym, string match, demonstrative noun phrase, alias, apposition.
From various literature and other references, I have defined its scope as I would not consider the ambiguity of each of the entity beyond its entity category. That is, I am taking Oxford of Oxford University
as different from Oxford as place, as the previous one is the first word of an organization entity and second one is the entity of location.
My task is to construct one resolution algorithm, where I would extract
and resolve the entities.
So, I am working out an entity extractor in the first place.
In the second place, if I try to relate the coreferences as I found from
various literatures like this seminal work, they are trying to work out
a decision tree based algorithm, with some features like, distance,
i-pronoun, j-pronoun, string match, definite noun
phrase, demonstrative noun phrase, number agreement feature,
semantic class agreement, gender agreement, both proper names, alias, apposition
etc.
The algorithm seems a nice one where enities are extracted with Hidden Markov Model(HMM).
I could work out one entity recognition system with HMM.
Now I am trying to work out a coreference as well as an entity
resolution system. I was trying to feel instead of using so many
features if I use an annotated corpus and train it directly with
HMM based tagger, with a view to solve a relationship extraction like,
*"Obama/PERS is/NA delivering/NA a/NA lecture/NA in/NA Washington/LOC, he/PPERS knew/NA it/NA was/NA going/NA to/NA be/NA
small/NA as/NA it/NA may/NA not/NA be/NA his/PoPERS speech/NA as/NA Mr. President/APPERS"
where, PERS-> PERSON
PPERS->PERSONAL PRONOUN TO PERSON
PoPERS-> POSSESSIVE PRONOUN TO PERSON
APPERS-> APPOSITIVE TO PERSON
LOC-> LOCATION
NA-> NOT AVAILABLE*
would I be wrong? I made an experiment with around 10,000 words. Early results seem
encouraging. With a support from one of my colleague I am trying to insert some
semantic information like,
PERSUSPOL, LOCCITUS, PoPERSM, etc. for PERSON OF US IN POLITICS, LOCATION CITY US, POSSESSIVE PERSON MALE, in the tagset to incorporate entity disambiguation at one go. My feeling relationship extraction would be much better now.
Please see this new thought too.
I got some good results with Naive Bayes classifier also where sentences
having predominately one set of keywords are marked as one class.
If any one may suggest any different approach, please feel free to suggest so.
I use Python2.x on MS-Windows and try to use libraries like NLTK, Scikit-learn, Gensim,
pandas, Numpy, Scipy etc.
Thanks in Advance.
It seems that you are going in three different paths that are totally different and each can be done in a stand alone Phd. There are many literature about them. My first advice focus on the main task and outsource the remaining. If you are going to develop this for non-famous language, also, you can build on others.
Named Entity Recognition
Standford NLP have really go too far in that specially for English. They resolve named entities really good, they are widely used and have a nice community.
Other solution may exist in openNLP for python .
Some tried to extend it to unusual fine-grain types but you need much bigger training data to cover the cases and the decision becomes much harder.
Edit: Stanford NER exists in NLTK python
Named Entity Resolution/Linking/Disambiguation
This is concerned with linking the name to some knowledge base, and solves the problem of whether Oxford University of Oxford City.
AIDA: is one of the state-of-art in that. They uses different context information as well as coherence information. Also, they have tried supporting several languages. They have a good bench mark.
Babelfy: offers interesting API that does NER and NED for Entities and concepts. Also, they support many language but never worked very well.
others like tagme and wikifi ...etc
Conference Resolution
Also Stanford CoreNLP has some good work in that direction. I can also recommend this work where they combined Conference Resolution with NED.
Sorry for that weird "question title" , but I couldnt think of an appropriate title.
Im new to NLP concepts, so I used NER demo (http://cogcomp.cs.illinois.edu/demo/ner/results.php). Now the issue is that "how & in what ways" can I use these taggings done by NER. I mean these what answers or inferences can one draw from these named-entities which have been tagged in certain groups - location, person ,organization etc. If I have a data which has names of entirely new companies, places etc then how am I going to do these NER taggings for such a data ?
Pls dont downvote or block me, I just need guidance/expert suggestions thats it. Reading about a concept is another thing, while being able to know where & when to apply it is another thing, which is where Im asking for guidance. Thanks a ton !!!
A snippet from the demo:-
Dogs have been used in cargo areas for some time, but have just been introduced recently in
passenger areas at LOC Newark and LOC JFK airports. LOC JFK has one dog and LOC Newark has a
handful, PER Farbstein said.
Usually NER is a step in a pipeline. For example, once all entities have been tagged, if you have many sentences like [PER John Smith], CEO of [ORG IBM] said..., then you can set up a table of Companies and CEOs. This is a form of knowledge base population.
There are plenty of other uses, though, depending on the type of data you already have and what you are trying to accomplish.
I think there are two parts in your question:
What is the purpose of NER?
This is vast question, generally it is used for Information Retrieval (IR) tasks such as indexing, document classification, Knowledge Base Population (KBP) but also many, many others (speech recognition, translation)... quite hard to figure out an extensive list...
How can we NER be extended to also recognize new/unkown entities?
E.g. how can we recognize entities that have never been seen by the NER system. In a glance, two solutions are likely to work:
Let's say you have some linked database that is updated on a regular basis: than the system may rely on generic categories. For instance, let's say "Marina Silva" comes up in news and is now added to lexicon associated to category "POLITICIAN". As the system knows that every POLITICIAN should be tagged as a person, i.e. doesn't rely on lexical items but on categories, and will thus tag "Marina Silva" as a PERS named entity. You don't have to re-train the whole system, just to update its lexicon.
Using morphological and contextual clues, the system may guess for new named entities that have never been seen (and are not in the lexicon). For instance, a rule like "The presidential candidate XXX YYY" (or "Marina YYY") will guess that "XXX YYY" (or just "YYY") is a PERS (or part of a PERS). This involves, most of the times, probabilistic modeling.
Hope this helps :)
First: Any recs on how to modify the title?
I am using my own named entity recognition algorithm to parse data from plain text. Specifically, I am trying to extract lawyer practice areas. A common sentence structure that I see is:
1) Neil focuses his practice on employment, tax, and copyright litigation.
or
2) Neil focuses his practice on general corporate matters including securities, business organizations, contract preparation, and intellectual property protection.
My entity extraction is doing a good job of finding the key words, for example, my output from sentence one might look like this:
Neil focuses his practice on (employment), (tax), and (copyright litigation).
However, that doesn't really help me. What would be more helpful is if i got an output that looked more like this:
Neil focuses his practice on (employment - litigation), (tax - litigation), and (copyright litigation).
Is there a way to accomplish this goal using an existing python framework such as nltk (after my algo extracts the practice areas) can I use ntlk to extract the other words that my "practice areas" modify in order to get a more complete picture?
Named entity recognition (NER) systems typically use grammer-based rules or statistical language models. What you have described here seems to be based only on keywords, though.
Typically, and much like most complex NLP tasks, NER systems should be trained on domain-specific data so that they perform well on previously unseen (test) data. You will require adequate knowledge of machine learning to go down that path.
In "normal" language, if you want to extract words or phrases and categorize them into classes defined by you (e.g. litigation), if often makes sense to use category labels in external ontologies. An example could be:
You want to extract words and phrases related to sports.
Such a categorization (i.e. detecting whether or not a word is indeed related to sports) is not a "general"-enough problem. Which means you will not find ready-made systems that will solve the problem (e.g. algorithms in the NLTK library). You can, however, use an ontology like Wikipedia and exploit the category labels available there.
E.g., you can check that if you search Wikipedia for "football", which has a category label "ball games", which in turn is under "sports".
Note that the wikipedia category labels form a directed graph. If you build a system which exploits the category structure of such an ontology, you should be able to categorize terms in your texts as you see fit. Moreover, you can even control the granularity of the categorization (e.g. do you want just "sports", or "individual sports" and "team sports").
I have built such a system for categorizing terms related to computer science, and it worked remarkably well. The closest freely available system that works in a similar way is the Wikifier built by the cognitive computing group at the University of Illinois at Urbana-Champaign.
Caveat: You may need to tweak a simple category-based code to suit your needs. E.g. there is no wikipedia page for "litigation". Instead, it redirects you to a page titled "lawsuit". Such cases need to be handled separately.
Final Note: This solution is really not in the area of NLP, but my past experience suggests that for some domains, this kind of ontology-based approach works really well. Also, I have used the "sports" example in my answer because I know nothing about legal terminology. But I hope my example helps you understand the underlying process.
I do not think your "algo" is even doing entity recognition... however, stretching the problem you presented quite a bit, what you want to do looks like coreference resolution in coordinated structures containing ellipsis. Not easy at all: start by googling for some relevant literature in linguistics and computational linguistics. I use the standard terminology from the field below.
In practical terms, you could start by assigning the nearest antecedent (the most frequently used approach in English). Using your examples:
first extract all the "entities" in a sentence
from the entity list, identify antecedent candidates ("litigation", etc.). This is a very difficult task, involving many different problems... you might avoid it if you know in advance the "entities" that will be interesting for you.
finally, you assign (resolve) each anaphora/cataphora to the nearest antecedent.
Have a look at CogComp NER tagger:
https://github.com/CogComp/cogcomp-nlp/tree/master/ner