Changing a Noun to its Pronoun in a sentence - python

I want to replace a Noun in a sentence with its pronoun. I will be using this to create a dataset for a NLP task. for example if my sentences are -->
"Jack and Ryan are friends. Jack is also friends with Michelle."
Then I want to replace the second Jack(in italics and bold ) with "He".
I have done the POS tagging to find the Nouns in my sentences. But I do not know how to proceed from here.
If I have a list of all possible pronouns that can be used, Is there a corpus or system that can tell me the most appropriate pronoun for the word?

You can almost do this with tools in Stanford CoreNLP. If you run the "coref" annotator, then it will attempt to determine the reference of a pronoun to other entity mentions in the text. There is also a "gender" annotator, which can assign a (binary) gender to an English name (based just on overall frequency statistics). (This gender annotator can at present only be accessed programmatically; its output doesn't appear in our standard output formats.)
However, both coreference resolution and automated gender assignment are tasks with mediocre accuracy, and the second has further assumptions that make it generally questionable. I find it hard to believe that doing this automatically will be a useful strategy to automatically produce data for an NLP task.

Related

Alternatives to NER taggers for long, heterogeneous phrases?

I am looking for ideas/thoughts on the following problem:
I am working with food ingredient data such as: milk, sugar, eggs, flour, may contain nuts
From such piece of text I want to be able to identify and extract phrases like may contain nuts, to preprocess them separately
These kinds of phrases can change quite a lot in terms of length and content. I thought of using NER taggers, but I don't know if they will do the job correctly as they are mainly used for identifying single-word entities...
Any ideas on what to use as a phrase-entity-recognition system? Also which package would you use? Cheers
IMHO NER (or model-based entity extraction in general) alone is a poor choice of methodology for this particular problem as it requires LOTS of manual annotation to do it right. Instead I suggest using Word2Vec (https://radimrehurek.com/gensim/models/word2vec.html) with phrasing (https://radimrehurek.com/gensim/models/phrases.html).
The idea is to have an unsupervised model containing phrases and their similarities which can then queried using some seed words to list all possible ingredients (e.g. "cat" produces similar words like "dog" or "rat"). Next step would be either to create dictionaries containing the ingredient words & phrases or try clustering the vocabulary of the model using cosine similarity between each word/phrase pair.
Now if you want to take things further you can always match your created dictionaries/clusters back to the corpus the W2V model was trained on and then train a custom entity recognition model using those matches as you now have annotated examples.
I believe this is a Multiword-Expression problem.
There are a few ways you can try to solve this:
Build a named entity recognition model (NER)
Search with Regex for a fixed set of known phrases
Chunking tokens with POS tags
Find collocations of tokens
Let's look at each of these
Build a named entity recognition model (NER)
Named Entity Recognition labels known spans of tokens an a entity type
For each input token you have to label it as part of a known named entity.
Eddy N PERSON
Bonte N PERSON
is V O
woordvoerder N O
van Prep O
diezelfde Pron O
Hogeschool N ORG
. Punc O
This is costly and requires a lot of time for labelling.
It is probably not a good choice for your task.
Search with Regex
This is not a bad idea, using some known phrases you could easily search input texts with minimal word boundaries for context.
import re
re.findall(r"\bmay contain nuts\b", text)
This would require you knowing all phrases you want to search for up front, and might not be possible.
Chunking tokens with POS tags
This could be a good intermediate step but could give many false positives.
You could do this my knowing the sequences of POS tags you expect
may MD
contain VB
nuts NNS
Then you could use chunking with the known tag sequence (MD, VB, NNS).
The problem is that you may not know this, and would have to capture many use cases. It will also capture many sequences which you wont want to capture (False Positive)
Find collocations of tokens
This is probably the best way, as it seems you are looking for a highly common sequences of words (tokens) in a corpus.
You can do this using:
Word2Vec Phrases
NLTK Collocations
Both do the same thing, they look for statistically common sequences of tokens which occur in a corpus.
That can then be used to extract the same collocation phrases from new texts.
It looks like your ingredient list is easy to split into a list. In that case you don't really need a sequence tagger; I wouldn't treat this problem as phrase extraction or NER. What I would do is train a classifier on different items in the list to label them as "food" or "non-food". You should be able to start with rules and train a basic classifier using anything really.
Before training a model, an even simpler step would be to run each list item through a PoS tagger (say spaCy), and if there's a verb you can guess that it's not a food item.

Entity extraction using POS and NER in spacy

I need to extract entities from sentences using NER and POS tags. For example,
Given the sentence below:
docx = nlp("The two blue cars belong to the tall Lorry Jim.")
where the entities are (two blue cars, tall Lorry Jim). Running spacy NER on the sentence,
for ent in docx.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
It returns:
two 4 7 CARDINAL
Lorry Jim 37 46 PERSON
My goal is to append adjectives/number in front of the entities identified by NER together, in the case above, tall is ADJ and should be appended to the Lorry Jim entity. And two blue cars should be extracted using NUM ADJ NOUN from POS tagger.
First, I have to say that the task you wanted and writing about is NOT what you said in the title. I think Entity has a standard definition, and for example, ADJ is not an Entity's part.
I think for solving your problem, you have to use dependency parsing and analyze the sentence's dependency tree. It could help you to find references for each word.
In another way, you can define a chunking task for your problem and build a dataset for what you meant and try to train a model for that type of chunking.
I think if you want to do this for functional usage, you need to make your problem very clear and also simple so that you can choose a practical method for solving the problem that you have. I think if you accept some error, you can define simple rules for any NOUN and ADJ parts so that if you have the POS and NER together, then you could solve it. It also depends on the language you want to work in. Like your example :
blue car
In English, adjectives are commonly placed before nouns and this is known as the modifier or attributive position. but you have to care about sentences like this :
All the cars he had were blue.
For feature works also you can look for coreference resolution like this :
I saw the cars that he drives and it was all blue.

Can I use Natural Language Processing while identifying given words in a paragraph Or do I need to use machine learning algorithms

I need to identify some given words using NLP.
As an example,
Mary Lives in France
If we consider in here the given words are Australia, Germany,France. But in this sentence it include only France.
So Among the above 3 given words I need to identify the sentence is include only France
I would comment but I don't have enough reputation. It's a bit unclear exactly what you are trying to achieve here and how representative your example is - please edit your question to make it clearer.
Anyhow, like Guy Coder says, if you know exactly the words you are looking for, you don't really need machine learning or NLP libraries at all. However, if this is not the case, and you don't know have every example of what you are looking for, the below might help:
It seems like what you are trying to do is perform Named Entity Recognition (NER) i.e. identify the named entities (e.g. countries) in your sentences. If so, the short answer is: you don't need to use any machine learning algorithms. You can just use a python library such as spaCy which comes out of the box with a pretrained language model that can already perform a bunch of tasks, for instance NER, to high degree of performance. The following snippet should get you started:
import spacy
nlp = spacy.load('en')
doc = nlp("Mary Lives in France")
for entity in doc.ents:
if (entity.label_ == "GPE"):
print(entity.text)
The output of the above snippet is "France". Named entities cover a wide range of possible things. In the snippet above I have filtered for Geopolitical entities (GPE).
Learn more about spaCy here: https://spacy.io/usage/spacy-101

Which technique is most appropriate for identifing various sentiments in the same text using python?

I'm studying NLP and as example I'm trying to identify what feelings are in customer feedback in the online course platform.
I was able to identify the feelings of the students with only simple sentence, such as "The course is very nice, I learned a lot from it", "The teaching platform is complete and I really enjoy using it", "I could have more courses related to marine biology", and so on.
My doubt is how to correctly identify the various sentiments in one sentence or in several sentences. For example:
A sentiment per sentence:
"The course is very good! it could be cool to create a section of questions on the site."
More than one sentiment per sentence:
"The course is very good, but the site is not."
Involving both:
"The course is very good, but the teaching platform is very slow. There could be more tasks and examples in the courses, interaction by video or microphone on the forum, for example."
I thought of splitting text in sentences, but it is not so good for the example 2.
You can think that comas, other punctuation marcs and some conjunctions and prepositions actually split sentences. This actually goes beyond code into the field of linguistics as they sometimes, but not always, separate sentences.
In the 2nd case you actually have two sentences: "The course is very good" -, but- "The site is not [very good]".
I believe there are NPL packages that can split sentences (Probably by knowing that most sentences follow the subject/predicate/object structure, so if you wind more than one verb then propbably you'll find the same ammount of sentences) and you could use those to parse your text first. Take a look for libraries doing that for your language of choice.
There is a lib specific for multi-label classification:
scikit-multilearn
when you train your model you have to split classes into binary columns.

How to Grab meaning of sentence using NLP?

I am new to NLP. My requirement is to parse meaning from sentences.
Example
"Perpetually Drifting is haunting in all the best ways."
"When The Fog Rolls In is a fantastic song
From above sentences, I need to extract the following sentences
"haunting in all the best ways."
"fantastic song"
Is it possible to achieve this in spacy?
It is not possible to extract the summarized sentences using spacy. I hope the following methods might work for you
Simplest one is extract the noun phrases or verb phrases. Most of the time that should give the text what you want.(Phase struce grammar).
You can use dependency parsing and extract the center word dependencies.
dependency grammar
You can train an sequence model where input is going to be the full sentence and output will be your summarized sentence.
Sequence models for text summaraization
Extracting the meaning of a sentence is a quite arbitrary task. What do you mean by the meaning? Using spaCy you can extract the dependencies between the words (which specify the meaning of the sentence), find the POS tags to check how words are used in the sentence and also find places, organizations, people using NER tagger. However, meaning of the sentence is too general even for the humans.
Maybe you are searching for a specific meaning? If that's the case, you have to train your own classifier. This will get you started.
If your task is summarization of a couple of sentences, consider also using gensim . You can have a look here.
Hope it helps :)

Categories

Resources