Entity extraction using POS and NER in spacy

Entity extraction using POS and NER in spacy - python

I need to extract entities from sentences using NER and POS tags. For example,
Given the sentence below:
docx = nlp("The two blue cars belong to the tall Lorry Jim.")
where the entities are (two blue cars, tall Lorry Jim). Running spacy NER on the sentence,
for ent in docx.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
It returns:
two 4 7 CARDINAL
Lorry Jim 37 46 PERSON
My goal is to append adjectives/number in front of the entities identified by NER together, in the case above, tall is ADJ and should be appended to the Lorry Jim entity. And two blue cars should be extracted using NUM ADJ NOUN from POS tagger.

First, I have to say that the task you wanted and writing about is NOT what you said in the title. I think Entity has a standard definition, and for example, ADJ is not an Entity's part.
I think for solving your problem, you have to use dependency parsing and analyze the sentence's dependency tree. It could help you to find references for each word.
In another way, you can define a chunking task for your problem and build a dataset for what you meant and try to train a model for that type of chunking.
I think if you want to do this for functional usage, you need to make your problem very clear and also simple so that you can choose a practical method for solving the problem that you have. I think if you accept some error, you can define simple rules for any NOUN and ADJ parts so that if you have the POS and NER together, then you could solve it. It also depends on the language you want to work in. Like your example :
blue car
In English, adjectives are commonly placed before nouns and this is known as the modifier or attributive position. but you have to care about sentences like this :
All the cars he had were blue.
For feature works also you can look for coreference resolution like this :
I saw the cars that he drives and it was all blue.

Related

Alternatives to NER taggers for long, heterogeneous phrases?

I am looking for ideas/thoughts on the following problem:
I am working with food ingredient data such as: milk, sugar, eggs, flour, may contain nuts
From such piece of text I want to be able to identify and extract phrases like may contain nuts, to preprocess them separately
These kinds of phrases can change quite a lot in terms of length and content. I thought of using NER taggers, but I don't know if they will do the job correctly as they are mainly used for identifying single-word entities...
Any ideas on what to use as a phrase-entity-recognition system? Also which package would you use? Cheers

IMHO NER (or model-based entity extraction in general) alone is a poor choice of methodology for this particular problem as it requires LOTS of manual annotation to do it right. Instead I suggest using Word2Vec (https://radimrehurek.com/gensim/models/word2vec.html) with phrasing (https://radimrehurek.com/gensim/models/phrases.html).
The idea is to have an unsupervised model containing phrases and their similarities which can then queried using some seed words to list all possible ingredients (e.g. "cat" produces similar words like "dog" or "rat"). Next step would be either to create dictionaries containing the ingredient words & phrases or try clustering the vocabulary of the model using cosine similarity between each word/phrase pair.
Now if you want to take things further you can always match your created dictionaries/clusters back to the corpus the W2V model was trained on and then train a custom entity recognition model using those matches as you now have annotated examples.

I believe this is a Multiword-Expression problem.
There are a few ways you can try to solve this:
Build a named entity recognition model (NER)
Search with Regex for a fixed set of known phrases
Chunking tokens with POS tags
Find collocations of tokens
Let's look at each of these
Build a named entity recognition model (NER)
Named Entity Recognition labels known spans of tokens an a entity type
For each input token you have to label it as part of a known named entity.
Eddy N PERSON
Bonte N PERSON
is V O
woordvoerder N O
van Prep O
diezelfde Pron O
Hogeschool N ORG
. Punc O
This is costly and requires a lot of time for labelling.
It is probably not a good choice for your task.
Search with Regex
This is not a bad idea, using some known phrases you could easily search input texts with minimal word boundaries for context.
import re
re.findall(r"\bmay contain nuts\b", text)
This would require you knowing all phrases you want to search for up front, and might not be possible.
Chunking tokens with POS tags
This could be a good intermediate step but could give many false positives.
You could do this my knowing the sequences of POS tags you expect
may MD
contain VB
nuts NNS
Then you could use chunking with the known tag sequence (MD, VB, NNS).
The problem is that you may not know this, and would have to capture many use cases. It will also capture many sequences which you wont want to capture (False Positive)
Find collocations of tokens
This is probably the best way, as it seems you are looking for a highly common sequences of words (tokens) in a corpus.
You can do this using:
Word2Vec Phrases
NLTK Collocations
Both do the same thing, they look for statistically common sequences of tokens which occur in a corpus.
That can then be used to extract the same collocation phrases from new texts.

It looks like your ingredient list is easy to split into a list. In that case you don't really need a sequence tagger; I wouldn't treat this problem as phrase extraction or NER. What I would do is train a classifier on different items in the list to label them as "food" or "non-food". You should be able to start with rules and train a basic classifier using anything really.
Before training a model, an even simpler step would be to run each list item through a PoS tagger (say spaCy), and if there's a verb you can guess that it's not a food item.

Changing a Noun to its Pronoun in a sentence

I want to replace a Noun in a sentence with its pronoun. I will be using this to create a dataset for a NLP task. for example if my sentences are -->
"Jack and Ryan are friends. Jack is also friends with Michelle."
Then I want to replace the second Jack(in italics and bold ) with "He".
I have done the POS tagging to find the Nouns in my sentences. But I do not know how to proceed from here.
If I have a list of all possible pronouns that can be used, Is there a corpus or system that can tell me the most appropriate pronoun for the word?

You can almost do this with tools in Stanford CoreNLP. If you run the "coref" annotator, then it will attempt to determine the reference of a pronoun to other entity mentions in the text. There is also a "gender" annotator, which can assign a (binary) gender to an English name (based just on overall frequency statistics). (This gender annotator can at present only be accessed programmatically; its output doesn't appear in our standard output formats.)
However, both coreference resolution and automated gender assignment are tasks with mediocre accuracy, and the second has further assumptions that make it generally questionable. I find it hard to believe that doing this automatically will be a useful strategy to automatically produce data for an NLP task.

Trying to adapt Pre-Trained BERT to another use case of semantic separation of sentences

i have used huggingface BERT for sentence classification with very good results, but now i want to apply it to another use case. Below is the kind of dataset(not exact) i have in mind.
set_df.head()
sentence subject object
0 my big red dog has a big fat bone my big red dog big fat bone
1 The Queen of Spades lives in a Castle The Queen of spades lives in a castle
I have a train dataset with these three columns, and i want it to be able to bisect the test sentences into its constituents. i have looked into the different pre-trained models in BERT, but i havent gotten any success. Am i using the wrong tool?

I think the better question is to refine how you are framing the task: If, in fact, the constituents are non-overlapping, this might be a case for BertForTokenClassification. Essentially, you are trying to predict the labels of each individual token, in your case either something like no label, subject, or object.
A great example for this kind of task is Named Entity Recognition (NER), which is generally framed in a similar fashion. Specifically, HuggingFace's transformer repository has a very extensive example available for you, that can serve as inspiration on how to format inputs, and how to train properly.

So, I solved this problem by modifying the data into a CONLL format, where each row contained only one word with the matching label subject\object.
my subject
big subject
red suject
... ....
Voila, It became a regular Entity recognition problem to be resolved using BERT for Token recognition. AS an extra piece of advice, I got it to work with Roberta but I needed to go through a complex de-tokenization process.

NLTK - distinguishing between colors and words using context

I'm writing a program to analyze the usage of color in text. I want to search for color words such as "apricot" or "orange". For example, an author might write "the apricot sundress billowed in the wind." However, I want to only count the apricots/oranges that actually describe color, not something like "I ate an apricot" or "I drank orange juice."
Is there anyway to do this, perhaps using context() in NLTK?

Welcome to the broad field of homonymy, polysemy and WSD. In corpus linguistics, this is an approach where collocations e.g. and are used to determine a probability of the juice having the colour "orange" or being made of the respective fruit. Both probabilities are high, but the probability of "jacket" being made of the respective fruit should be much lower. There are different methods to be used. You could ask corpus annotators (specialists, crowdsourcing etc.) to annotate data in a text, which you can use to train your (machine learning) model, in this case a simple classifier. Otherwise you could use large text data to gather collocation counts in combinition with Wordnet, which may give you semantic information whether it is usual for a jacket to be made of fruits. A fortunate detail is that only rarely people use stereotypical colours in text, so you don't have to care for cases like "the yellow banana".
Shallow parsing may also help, since colour adjectives should be preferrably used in attributive position.
A different approach would be to use word similarity measures (vector space semantics)
or embeddings for Word Sense disambiguation (WSD).
Maybe this helps:
https://web.stanford.edu/~jurafsky/slp3/slides/Chapter18.wsd.pdf
https://towardsdatascience.com/a-simple-word-sense-disambiguation-application-3ca645c56357

How to determine the "sentiment" between two named entities with Python/NLTK?

I'm using NLTK to extract named entities and I'm wondering how it would be possible to determine the sentiment between entities in the same sentence. So for example for "Jon loves Paris." i would get two entities Jon and Paris. How would I be able to determine the sentiment between these two entities? In this case should be something like Jon -> Paris = positive

In short "you cannot". This task is far beyond simple text processing which is provided with NLTK. Such objects relations sentiment analysis could be the topic of the research paper, not something solvable with a simple approach. One possible method would be to perform a grammar analysis, extraction of the conceptual relation between objects and then independent sentiment analysis of words included, but as I said before - it is rather a reasearch topic.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.