named entity recognition with spacy

named entity recognition with spacy - python

I'm working on natural language processing using spacy library in python.
From input i get several sentences that i work seperatly using this
for sent in doc.sents:
for each sent i search for any named entity using .ents attribute.
What i would like to achieve is replacing the initial "sent" with a new one where every named entity recognized is replaced on the initial sentence.
Here an example:
First sentence: Apple is looking at buying U.K. startup for $1 billion
After replacing: ORG is looking at buying GPE startup for MONEY
Of course using a simple string.replace doesnt work since i would like to have a new spacy.Doc
Any idea how to achieve this?

You may wish to try:
import spacy
nlp = spacy.load("en_core_web_md")
in_ = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(in_)
out = []
for sent in doc.sents:
sent_out = ""
for tok in sent:
ws = " " if tok.whitespace_ else ""
if tok.ent_type_:
sent_out += tok.ent_type_ + ws
else:
sent_out += tok.text + ws
out.append(sent_out)
print(out)
['ORG is looking at buying GPE startup for MONEYMONEY MONEY']
Note a peculiar pattern MONEYMONEY MONEY where you have 3 entities: 2 of which are not separated by whitespace, and 1 is separated.

Related

Replace personal pronoun with previous person mentioned (noisy coref)

I want to do a noisy resolution such that given a personal prounoun, that pronoun is replace by the previous(nearest) person.
For example:
Alex is looking at buying a U.K. startup for $1 billion. He is very confident that this is going to happen. Sussan is also in the same situation. However, she has lost hope.
the output is:
Alex is looking at buying a U.K. startup for $1 billion. Alex is very confident that this is going to happen. Sussan is also in the same situation. However, Susan has lost hope.
Another example,
Peter is a friend of Gates. But Gates does not like him.
In this case, the output would be :
Peter is a friend of Gates. But Gates does not like Gates.
Yes! This is super noisy.
Using spacy:
I have extracted the Person using NER, but how can I replace pronouns appropriately?
Code:
import spacy
nlp = spacy.load("en_core_web_sm")
for ent in doc.ents:
if ent.label_ == 'PERSON':
print(ent.text, ent.label_)

There is specially dedicated neuralcoref library to resolve coreference. See the minimal reproducible example below:
import spacy
import neuralcoref
nlp = spacy.load('en_core_web_sm')
neuralcoref.add_to_pipe(nlp)
doc = nlp(
'''Alex is looking at buying a U.K. startup for $1 billion.
He is very confident that this is going to happen.
Sussan is also in the same situation.
However, she has lost hope.
Peter is a friend of Gates. But Gates does not like him.
''')
print(doc._.coref_resolved)
Alex is looking at buying a U.K. startup for $1 billion.
Alex is very confident that this is going to happen.
Sussan is also in the same situation.
However, Sussan has lost hope.
Peter is a friend of Gates. But Gates does not like Peter.
Note, you may have some issues with neuralcoref if you pip install it, so it's better to build it from source, as I outlined it here

I have written a function that works for your two examples:
Consider using a larger model such as en_core_web_lg for more accurate tagging.
import spacy
from string import punctuation
nlp = spacy.load("en_core_web_lg")
def pronoun_coref(text):
doc = nlp(text)
pronouns = [(tok, tok.i) for tok in doc if (tok.tag_ == "PRP")]
names = [(ent.text, ent[0].i) for ent in doc.ents if ent.label_ == 'PERSON']
doc = [tok.text_with_ws for tok in doc]
for p in pronouns:
replace = max(filter(lambda x: x[1] < p[1], names),
key=lambda x: x[1], default=False)
if replace:
replace = replace[0]
if doc[p[1] - 1] in punctuation:
replace = ' ' + replace
if doc[p[1] + 1] not in punctuation:
replace = replace + ' '
doc[p[1]] = replace
doc = ''.join(doc)
return doc

Extracting sentence from a dataframe with description column based on a phrase

I have a dataframe with a 'description' column with details about the product. Each of the description in the column has long paragraphs. Like
"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"
How do I locate/extract the sentence which has the phrase "superb product", and place it in a new column?
So for this case the result will be
expected output
I have used this,
searched_words=['superb product','SUPERB PRODUCT']
print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
The output for this is not suitable. Though it works if I put just one word in " Searched Word" List.

There are lot of methods to do that ,#ChootsMagoots gave you the good answer but SPacy is also so efficient, you can simply choose the pattern that will lead you to that sentence, but beofre that, you can need to define a function that will define the sentence here's the code :
import spacy
def product_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser") # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}]
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent)

Assuming the paragraphs are neatly formatted into sentences with ending periods, something like:
for index, paragraph in df['column_name'].iteritems():
for sentence in paragraph.split('.'):
if 'superb prod' in sentence:
print(sentence)
df['extracted_sentence'][index] = sentence
This is going to be quite slow, but idk if there's a better way.

a python library that accepts some text, and replaces phone numbers, names, and so on with tokens

I need a python library that accepts some text, and replaces phone numbers, names, and so on with tokens. Example:
Input: Please call Robert on 0430013454 to discuss this further.
Output: Please call NAME on PHONE to discuss this further.
In other words I need to take a sentence, any sentence, then the program will be run on that sentence and remove anything that looks like a name, phone number or any other identifier, and replace it with a token I.E NAME, PHONE NUMBER So that token would just be text to replace the info so that it is no longer displayed.
Must be python 2.7 compatible. Would anybody know how this would be done?
Cheers!

As Harrison pointed out, nltk has named entity recognition, which is what you want for this task. Here is a good sample to get you started.
From the site:
import nltk
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
entity_names = []
for tree in chunked_sentences:
# Print results per sentence
# print extract_entity_names(tree)
entity_names.extend(extract_entity_names(tree))
# Print all entity names
#print entity_names
# Print unique entity names
print set(entity_names)

Not really sure about name recognition. However, if you know the names that you would be looking for it would be easy. You could have a list of all of the names that you're looking for and check to see if each one is in the string and if so just use string.replace. If the names are random you could maybe look into NLTK I think they might have some name entity recognition. I really don't know anything about it though...
But as for phone numbers, that's easy. You can split the string into a list and check to see if any element consists of numbers. You could even check the length to make sure it's 10 digits (i'm assuming all numbers will be 10 based on your example).
Something like this...
example_input = 'Please call Robert on 0430013454 to discuss this further.'
new_list = example_input.split(' ')
for word in new_list:
if word.isdigit():
pos = new_list.index(word)
new_list[pos] = 'PHONE'
example_output = ' '.join(new_list)
print example_output
This would be the output: 'Please call Robert on PHONE to discuss this further'
The if statement would be something like if word.isdigit() and len(word) == 10: if you wanted to make sure the length of the digits is 10.

NER naive algorithm

I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended.
The idea was to extract names of the main characters in a story through:
Building a dictionary for each word
Filling for each word a list with the words that appear right next to it in the text
Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)
I ran the overly simple code (attached below) on Alice in Wonderland, which for "Alice" returns:
21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']
Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as main characters goes.
It does not work that well with other characters and in other stories, though gives interesting results.
Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?
Thanks!
#English Name recognition
import re
import sys
import random
from string import upper
def mimic_dict(filename):
dict = {}
f = open(filename)
text = f.read()
f.close()
prev = ""
words = text.split()
for word in words:
m = re.search("\w+",word)
if m == None:
continue
word = m.group()
if not prev in dict:
dict[prev] = [word]
else :
dict[prev] = dict[prev] + [word]
prev = word
return dict
def main():
if len(sys.argv) != 2:
print 'usage: ./main.py file-to-read'
sys.exit(1)
dict = mimic_dict(sys.argv[1])
upper = []
for e in dict.keys():
if len(e) > 1 and e[0].isupper():
upper.append(e)
print len(upper),upper
exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
exclude = [ x for x in exclude if dict.has_key(x)]
for s in exclude :
del dict[s]
scores = {}
for key1 in dict.keys():
max = 0
for key2 in dict.keys():
if key1 == key2 : continue
a = dict[key1]
k = dict[key2]
diff = []
for ia in a:
if ia in k and ia not in diff:
diff.append( ia)
if len(diff) > max:
max = len(diff)
scores[key1]=(key2,max)
dictscores = {}
names = []
for e in scores.keys():
if scores[e][0]=="Alice" and e[0].isupper():
names.append(e)
print len(names), names
if __name__ == '__main__':
main()

From the looks of your program and previous experience with NER, I'd say this "works" because you're not doing a proper evaluation. You've found "Hare" where you should have found "March Hare".
The difficulty in NER (at least for English) is not finding the names; it's detecting their full extent (the "March Hare" example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.
Also, Alice in Wonderland, being a children's novel, is a rather easy text to process. Newswire phrases like "Microsoft CEO Steve Ballmer" pose a much harder problem; here, you'd want to detect
[ORG Microsoft] CEO [PER Steve Ballmer]

What you are doing is building a distributional thesaurus-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since Alice, the Hare and the Queen tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.
Have a look at the following papers to get an idea of what people do with distributional semantics:
Lin 1998
Grefenstette 1994
Schuetze 1998
To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).
As larsmans pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!

Python: How to format large text outputs to be 'prettier' and user defined

Ahoy StackOverlow-ers!
I have a rather trivial question but it's something that I haven't been able to find in other questions here or on online tutorials: How might we be able to format the output of a Python program that so that it fits a certain aesthetic format without any extra modules?
The aim here is that I have a block of plain text like that from a newspaper article, and I've filtered through it earlier to extract just the words I want but now I'd like to print it out in the format that each line only has 70 characters along it and any word won't be broken if it should normally fall on a line break.
Using .ljust(70) as in stdout.write(article.ljust(70)) doesn't seem to do anything to it.
The other thing about not having words broken would be as:
Latest news tragic m
urder innocent victi
ms family quiet neig
hbourhood
Looking more like this:
Latest news tragic
murder innocent
victims family
quiet neighbourhood
Thank you all kindly in advance!

Checkout the python textwrap module (a standard module)
>>> import textwrap
>>> t="""Latest news tragic murder innocent victims family quiet neighbourhood"""
>>> print "\n".join(textwrap.wrap(t, width=20))
Latest news tragic
murder innocent
victims family quiet
neighbourhood
>>>

use textwrap module:
http://docs.python.org/library/textwrap.html

I'm sure this can be improved on. Without any libraries:
def wrap_text(text, wrap_column=80):
sentence = ''
for word in text.split(' '):
if len(sentence + word) <= 70:
sentence += ' ' + word
else:
print sentence
sentence = word
print sentence
EDIT: From the comment if you want to use Regular expressions to just pick out words use this:
import re
def wrap_text(text, wrap_column=80):
sentence = ''
for word in re.findall(r'\w+', text):
if len(sentence + word) <= 70:
sentence += ' ' + word
else:
print sentence
sentence = word
print sentence

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.