I want to know if there is an elegant way to get the index of an Entity with respect to a Sentence. I know I can get the index of an Entity in a string using ent.start_char and ent.end_char, but that value is with respect to the entire string.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion. Apple just launched a new Credit Card.")
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
I want the Entity Apple in both the sentences to point to start and end indexes 0 and 5 respectively. How can I do that?
You need to subtract the sentence start position from the entity start positions:
for ent in doc.ents:
print(ent.text, ent.start_char-ent.sent.start_char, ent.end_char-ent.sent.start_char, ent.label_)
# ^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^
Output:
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
Apple 0 5 ORG
Credit Card 26 37 ORG
Related
I have a table which has people names in text. I would like to de identify that text by removing the people's names from every instance, while maintaining the rest of the sentence.
Row Num Current Sent Ideal Sent
1 Garry bought a cracker. bought a cracker.
2 He named the parrot Eric. He named the parrot.
3 The ship was maned by Captain Jones. The ship was maned by Captain.
How can I do that with Spacy? I know you have to identify the label as a 'PERSON' and then apply it to each row, but I can't seem to get the intended result. This is what I have so far:
def pro_nn_finder(text):
doc = nlp(text)
return[ent.text for ent in doc.ents if ent.label_ == 'PERSON']
df.apply(pro_nn_finder)
One approach:
import pandas as pd
import spacy
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")
df = pd.DataFrame(data=["Garry bought a cracker.",
"He named the parrot Eric.",
"The ship was maned by Captain Jones."], columns=["Current Sent"])
def remove_person(txt):
doc = nlp(txt)
chunks = [range(entity.start_char, entity.end_char) for entity in doc.ents if entity.label_ == 'PERSON']
to_remove = set().union(*chunks)
return "".join(c for i, c in enumerate(txt) if i not in to_remove)
df["Ideal Sent"] = df["Current Sent"].apply(remove_person)
print(df)
Output
Current Sent Ideal Sent
0 Garry bought a cracker. bought a cracker.
1 He named the parrot Eric. He named the parrot .
2 The ship was maned by Captain Jones. The ship was maned by Captain .
Here's how I would do it. The thing to note here is that nlp can take a long time, so I'd do it once, store the resulting doc objects in a new column, and then proceed with the filtering. Since you are interested in the whole document and not just the entities, it's better to use the Token.ent_type_ attribute than going the doc.ents route.
import spacy
nlp = spacy.load("en_core_web_md")
df = pd.DataFrame({"sent": ["Garry bought a cracker.",
"He named the parrot Eric.",
"The ship was maned by Captain Jones."]})
df["proc_sent"] = df.sent.apply(nlp) # expensive step
df["ideal_sent"] = df.proc_sent.apply(lambda doc: ' '.join(tok.text for tok in doc if tok.ent_type_ != "PERSON"))
Alternatively, you can explode the doc column so you end up with one token per cell. That allows for more panda-esque data processing.
df2 = df.explode("proc_sent")
Now df2.proc_sent looks like this:
0 Garry
0 bought
0 a
0 cracker
0 .
1 He
1 named
1 the
1 parrot
1 Eric
1 .
So you can filter out PERSON entities via
>>> df2[df2.proc_sent.apply(lambda tok: tok.ent_type_) != "PERSON"]
sent proc_sent
0 Garry bought a cracker. bought
0 Garry bought a cracker. a
0 Garry bought a cracker. cracker
0 Garry bought a cracker. .
1 He named the parrot Eric. He
1 He named the parrot Eric. named
1 He named the parrot Eric. the
1 He named the parrot Eric. parrot
1 He named the parrot Eric. .
...
Of course, that only makes sense if you need to do more complex things because to get the sentences strings you need to do a groupby etc. making it more complicated overall for this application.
I am struggling with a very strange things in spacy.
I want to. determine all entity except some of them. So I did this:
for ent in x3.ents:
# print(str(spacy.explain(ent.label_)))
if not ent.label_ in [ 'ORG', 'PERSON']:
if not ent.text in { 'technician', 'service',' hcc'}:
print(ent.text)
but technician is printed.
my doc has many rows for example:
agricultural English
balancer
front office director
clinical laboratory technician ii
for this 4 rows my ent.text is:
English
technician
The issue is not with spacy, but in the way you are trying to filter the sentence. You will need to compare each word in ent.text with the list of words you want to discard ({ 'technician', 'service',' hcc'}). For example:
# This could be your ent.text
s = "my sentence contains technician"
new_s = []
for w in s.split(" "):
if w not in { 'technician', 'service',' hcc'}:
new_s.append(w)
# Here you would consider to replace the original ent.text
print(" ".join(new_s))
I currently have a dataframe with a column of already tokenized words and other columns with tags:
token tag
1 I PRN
2 like VBD
3 apples NNP
4 . .
5 John PRN
6 likes VBD
7 pears NNP
8 . .
I would like to add sentence numbering within the df, by adding an extra column:
token tag sentence #
1 I PRN sentence 1
2 like VBD sentence 1
3 apples NNP sentence 1
4 . . sentence 1
5 John PRN sentence 2
6 likes VBD sentence 2
7 pears NNP sentence 2
8 . . sentence 2
I am working with a human annotated dataset that has been pre-tokenized. I already tried de-tokenizing it, adding the sentence count and then re-tokenizing it; which gave me an entirely different token count, unfortunately. This method would results in the tag columns not aligning with the token column.
Thank you very much!
Good Morning,
If what you would like to do is add in the sentence that contains the work and the token, I would suggest adding a primary key reference to the sentence that you are doing your parsing from. I would love to help more, but unless i you the method that you are using to get the tokens and tag I cannot assist any further. I have given a methodized approach below. Are you using a self built method/module? Are you using a package/module that is in Sci-Kit learn to tokenize strings? Have a lovely day!
My Approach:
Take the dataset
Clean the dataset
Assign a reference/GUID key to each phrase that is being tokenized
Run tokenizing method
Do a join on the two data sets to create the model view that you want.
I know that that SpaCy provides start and end of each entity in a sentence. I want the start of the entity in the whole document (not just the sentence).
You may get the entity start position in the whole document using ent.start_char:
for ent in doc.ents:
print(ent.text, ent.start_char)
A quick test:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"The live in New York City. They went to Manhattan in the morning.")
for ent in doc.ents:
print(ent.text, ent.start_char)
Output:
New York City 12
Manhattan 40
Here I got a pandas.series named 'traindata'.
0 Published: 4:53AM Friday August 29, 2014 Sourc...
1 8 Have your say\n\n\nPlaying low-level club c...
2 Rohit Shetty has now turned producer. But the ...
3 A TV reporter in Serbia almost lost her job be...
4 THE HAGUE -- Tony de Brum was 9 years old in 1...
5 Australian TV cameraman Harry Burton was kille...
6 President Barack Obama sharply rebuked protest...
7 The car displaying the DIE FOR SYRIA! sticker....
8 \nIf you've ever been, you know that seeing th...
9 \nThe former executive director of JBWere has ...
10 Waterloo Road actor Joe Slater has revealed hi...
...
**Name: traindata, Length: 2284, dtype: object**
and what I want to do is to replace the series.values with the stemmed sentences.
my thought is to build a new series and put the stemmed sentence in.
my code is as below:
from nltk.stem.porter import PorterStemmer
stem_word_data = np.zeros([2284,1])
ps = PorterStemmer()
for i in range(0,len(traindata)):
tst = word_tokenize(traindata[i])
for word in tst:
word = ps.stem(word)
stem_word_data[i] = word
and then an error occurs:
ValueError: could not convert string to float: 'publish'
Anyone knows how to fix this error or anyone has a better idea on how to replace the series.values with the stemmed sentence? thanks.
You can use apply on a series and avoid writing loops.
from nltk import word_tokenize
from nltk.stem import PorterStemmer
## intialise stemmer class
pst = PorterStemmer()
## sample data frame
df = pd.DataFrame({'senten': ['I am not dancing','You are playing']})
## apply here
df['senten'] = df['senten'].apply(word_tokenize)
df['senten'] = df['senten'].apply(lambda x: ' '.join([pst.stem(y) for y in x]))
print(df)
senten
0 I am not danc
1 you are play