Extract Only Certain Named Entities From Tokens

Extract Only Certain Named Entities From Tokens - python

Quick question (hopefully). Is it possible for me to get the named entities of the tokens except for the ones with CARDINAL label (The label is 397). Here is my code below:
spacy_model = spacy.load('en-core-web-lg')
f = open('temp.txt')
tokens = spacy_model(f.read())
named_entities = tokens.ents #Except where named_entities.label = 397
Is this possible? Any help would be greatly appreciated.

You can filter out the entities using list comprehension:
named_entities = [t for t in tokens.ents if t.label_ != 'CARDINAL']
Here is a test:
import spacy
nlp = spacy.load("en_core_web_sm")
tokens = nlp('The basket costs $10. I bought 6.')
print([(ent.text, ent.label_) for ent in tokens.ents])
# => [('10', 'MONEY'), ('6', 'CARDINAL')]
print([t for t in tokens.ents if t.label_ != 'CARDINAL'])
# => [10]

Related

Align Multiple Matches in SpaCy nlp into a Pandas Dataframe

I have written a Code that will search a multiple terms in a Text file which are 'Capex' & Much more in my case
nlp = spacy.load('en_core_web_sm')
pm = PhraseMatcher(nlp.vocab)
tipe = PhraseMatcher(nlp.vocab)
doc = nlp(text)
sents = [sent for sent in doc.sents]
phrases = ['capex', 'capacity expansion', 'Capacity expansion', 'CAPEX', 'Capacity Expansion', 'Capex']
patterns = [nlp(text) for text in phrases]
pm.add('CAPEX ',None,*patterns)
matches = pm(doc)
Then after that when i get where these terms were is a text file , I try to get the sentence where these terms were used. After that i further search for Date , Value & Type of 'CAPEX' further in that Sentence
now the issue that i am facing is that when i do so their will have multiple instances where Type of 'CAPEX' which are "Greenfield etc etc" are used multiple times. Although my code only runs the no.of times the matches of the word 'CAPEX' are. Any solution to align all these into one Dataframe
def findmatch(doc,phrases,name):
p = phrases
pa = [nlp(text) for text in p]
name = PhraseMatcher(nlp.vocab)
name.add('Type',None,*pa)
results = name(doc)
return results
def getext(matches):
for match_id,start,end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
text = span.text
return text
allcapex = pd.DataFrame( columns = ['Type', 'Value', 'Date','business segment','Location', 'source'])
for ind,match in enumerate(matches):
for sent in sents:
if matches[ind][1] < sent.end:
typematches = findmatch(sent,['Greenfield','greenfield', 'brownfield','Brownfield', 'de-bottlenecking', 'De-bottlenecking'],'Type')
valuematches = findmatch(sent,['Crore', 'Cr','crore', 'cr'],'Value')
datematches = findmatch(sent,['2020', '2021','2022', '2023','2024', '2025', 'FY21', 'FY22', 'FY23', 'FY24', 'FY25','FY26'],'Date')
capextype = getext(typematches)
capexvalue = getext(valuematches)
capexdate = getext(datematches)
allcapex.loc[len(allcapex.index)] = [capextype,capexvalue,capexdate,'','',sent]
break
print(allcapex)

Need advice on Negation Handling while doing Aspect Based Sentiment Analysis in Python

I'm trying to write a Python code that does Aspect Based Sentiment Analysis of product reviews using Dependency Parser. I created an example review:
"The Sound Quality is great but the battery life is bad."
The output is : [['soundquality', ['great']], ['batterylife', ['bad']]]
I can properly get the aspect and it's adjective with this sentence but when I change the text to:
"The Sound Quality is not great but the battery life is not bad."
The output still stays the same. How can I add a negation handling to my code? And are there ways to improve what I currently have?
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
import stanfordnlp
stanfordnlp.download('en')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
txt = "The Sound Quality is not great but the battery life is not bad."
txt = txt.lower()
sentList = nltk.sent_tokenize(txt)
taggedList = []
for line in sentList:
txt_list = nltk.word_tokenize(line) # tokenize sentence
taggedList = taggedList + nltk.pos_tag(txt_list) # perform POS-Tagging
print(taggedList)
newwordList = []
flag = 0
for i in range(0,len(taggedList)-1):
if(taggedList[i][1]=='NN' and taggedList[i+1][1]=='NN'):
newwordList.append(taggedList[i][0]+taggedList[i+1][0])
flag=1
else:
if(flag == 1):
flag=0
continue
newwordList.append(taggedList[i][0])
if(i==len(taggedList)-2):
newwordList.append(taggedList[i+1][0])
finaltxt = ' '.join(word for word in newwordList)
print(finaltxt)
stop_words = set(stopwords.words('english'))
new_txt_list = nltk.word_tokenize(finaltxt)
wordsList = [w for w in new_txt_list if not w in stop_words]
taggedList = nltk.pos_tag(wordsList)
nlp = stanfordnlp.Pipeline()
doc = nlp(finaltxt)
dep_node = []
for dep_edge in doc.sentences[0].dependencies:
dep_node.append([dep_edge[2].text, dep_edge[0].index, dep_edge[1]])
for i in range(0, len(dep_node)):
if(int(dep_node[i][1]) != 0):
dep_node[i][1] = newwordList[(int(dep_node[i][1]) - 1)]
print(dep_node)
featureList = []
categories = []
totalfeatureList = []
for i in taggedList:
if(i[1]=='JJ' or i[1]=='NN' or i[1]=='JJR' or i[1]=='NNS' or i[1]=='RB'):
featureList.append(list(i))
totalfeatureList.append(list(i)) # stores all the features for every sentence
categories.append(i[0])
print(featureList)
print(categories)
fcluster = []
for i in featureList:
filist = []
for j in dep_node:
if((j[0]==i[0] or j[1]==i[0]) and (j[2] in ["nsubj", "acl:relcl", "obj", "dobj", "agent", "advmod", "amod", "neg", "prep_of", "acomp", "xcomp", "compound"])):
if(j[0]==i[0]):
filist.append(j[1])
else:
filist.append(j[0])
fcluster.append([i[0], filist])
print(fcluster)
finalcluster = []
dic = {}
for i in featureList:
dic[i[0]] = i[1]
for i in fcluster:
if(dic[i[0]]=='NN'):
finalcluster.append(i)
print(finalcluster)

You may wish to try spacy. The following pattern will catch:
a noun phrase
followed by is or are
optionally followed by not
followed by an adjective
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
output = []
doc = nlp('The product is very good')
matcher = Matcher(nlp.vocab)
matcher.add("mood",None,[{"LOWER":{"IN":["is","are"]}},{"LOWER":{"IN":["no","not"]},"OP":"?"},{"LOWER":"very","OP":"?"},{"POS":"ADJ"}])
for nc in doc.noun_chunks:
d = doc[nc.root.right_edge.i+1:nc.root.right_edge.i+1+3]
matches = matcher(d)
if matches:
_, start, end = matches[0]
output.append((nc.text, d[start+1:end].text))
print(output)
[('The product', 'very good')]
Alternatively, you may broaden matching pattern with info from dependency parser that would add definition of adjectival phrase:
output = []
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("mood",None,[{"LOWER":{"IN":["is","are"]}},{"LOWER":{"IN":["no","not"]},"OP":"?"},{"DEP":"advmod","OP":"?"},{"DEP":"acomp"}])
for nc in doc.noun_chunks:
d = doc[nc.root.right_edge.i+1:nc.root.right_edge.i+1+3]
matches = matcher(d)
if matches:
_, start, end = matches[0]
output.append((nc.text, d[start+1:end].text))
print(output)
[('The product', 'very good')]

How to find all longest common substrings that exist in multiple documents?

I have many text documents that I want to compare to one another and remove all text that is exactly the same between them. This is to remove find boiler plate text that is consistent so it can be removed for NLP.
The best way I figured to do this is to find Longest Common Sub-strings that exist or are mostly present in all the documents. However, doing this has been incredibly slow.
Here is an example of what I am trying to accomplish:
DocA:
Title: To Kill a Mocking Bird
Author: Harper Lee
Published: July 11, 1960
DocB:
Title: 1984
Author: George Orwell
Published: June 1949
DocC:
Title: The Great Gatsby
Author: F. Scott Fitzgerald
The output would show something like:
{
'Title': 3,
'Author': 3,
'Published': 2,
}
The results would then be used to strip out the commonalities between documents.
Here is some code I have tested in python. It's incredibly with any significant amount of permutations:
file_perms = list(itertools.permutations(files, 2))
results = {}
for p in file_perms:
doc_a = p[0]
doc_b = p[1]
while True:
seq_match = SequenceMatcher(a=doc_a, b=doc_b)
match = seq_match.find_longest_match(0, len(doc_a), 0, len(doc_b))
if (match.size >= 5):
doc_a_start, doc_a_stop = match.a, match.a + match.size
doc_b_start, doc_b_stop = match.b, match.b + match.size
match_word = doc_a[doc_a_start:doc_a_stop]
if match_word in results:
results[match_word] += 1
else:
results[match_word] = 1
doc_a = doc_a[:doc_a_start] + doc_a[doc_a_stop:]
doc_b = doc_b[:doc_b_start] + doc_b[doc_b_stop:]
else:
break
df = pd.DataFrame(
{
'Value': [x for x in results.keys()],
'Count': [x for x in results.values()]
}
)
print(df)

create a set from each document,
build a counter for every word how many time it appears
iterate over every document, when you find a word that appears in 70% -90% of documents,
append it and the word after it as a tuple to a new counter
and again..
from collections import Counter
one_word = Counter()
for doc in docs:
word_list = docs.split(" ")
word_set = set(word_list)
for word in word_set:
one_word[word]+=1
two_word = Counter()
threshold = len(docs)*0.7
for doc in docs:
word_list = doc.split(" ")
for i in range(len(word_list)-1):
if one_word[word_list[i]]>threshold:
key = (word_list[i], word_list[i+1])
you can play with the threshold and continue as long as the counter is not empty
the docs are lyrics of songs believer, by the river of Babylon, I could stay awake, rattlin bog
from collections import Counter
import os
import glob
TR =1 #threshold
dir = r"D:\docs"
path = os.path.join(dir,"*.txt")
files = glob.glob(path)
one_word = {}
all_docs = {}
for file in files:
one_word[file] = set()
all_docs[file] = []
with open(file) as doc:
for row in doc:
for word in row.split():
one_word[file].add(word)
all_docs[file].append(word)
#now one_word is a dict where the kay is file name and the value is set of words in it
#all_docs is a dict file name is the key and the value is the complete doc stord in a list word by word
common_Frase = Counter()
for key in one_word:
for word in one_word[key]:
common_Frase[word]+=1
#common_Frase containe a count of all words appearence in all files (every file can add a word once)
two_word = {}
for key in all_docs:
two_word[key] = set()
doc = all_docs[key]
for index in range(len(doc)-1):
if common_Frase[doc[index]]>TR:
val = (doc[index], doc[index+1])
two_word[key].add(val)
for key in two_word:
for word in two_word[key]:
common_Frase[word]+=1
#now common_Frase contain a count of all two words frase
three_word = {}
for key in all_docs:
three_word[key] = set()
doc = all_docs[key]
for index in range(len(doc)-2):
val2 = (doc[index], doc[index+1])
if common_Frase[val2]>TR:
val3 = (doc[index], doc[index+1], doc[index+2])
three_word[key].add(val3)
for key in three_word:
for word in three_word[key]:
common_Frase[word]+=1
for k in common_Frase:
if common_Frase[k]>1:
print(k)
this is the outpot
when like all Don't And one the my hear and feeling Then your of I'm in me The you away I never to be what a ever thing there from By down Now words that was ('all', 'the') ('And', 'the') ('the', 'words') ('By', 'the') ('and', 'the') ('in', 'the')

First and Last name tagged as one token by using Stanford NER with Python

I'm using the Stanford Named Entity Recognizer with Python to find the proper names in the novel "A Hundred years of solitud". There are many of them composed by first and last name e.g. "Aureliano Buendía" or "Santa Sofía de la Piedad". These Tokens are always separated e.g. "Aureliano" "Buendia", because of the tokenizer I am using.
I would like to have them together as a token, so they can be tagged together as "PERSON" with Stanford NER.
The code I wrote:
import nltk
from nltk.tag import StanfordNERTagger
from nltk import word_tokenize
from nltk import FreqDist
sentence1 = open('book1.txt').read()
sentence = sentence1.split()
path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"
path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"
st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)
taggedSentence = st.tag(sentence)
def findtags (tagged_text,tag_prefix):
cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in taggedSentence
if tag.endswith(tag_prefix))
return dict((tag, cfd[tag].most_common(1000)) for tag in cfd.conditions())
print (findtags('_','PERSON'))
The result looks like this:
{'PERSON': [('Aureliano', 397), ('José', 294), ('Arcadio', 286), ('Buendía', 251), ...
Does anybody have a solution? I would be more than grateful

import nltk
from nltk.tag import StanfordNERTagger
sentence1 = open('book1.txt').read()
sentence = sentence1.split()
path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"
path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"
st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)
taggedSentence = st.tag(sentence)
test = []
test_dict = {}
for element in range(len(taggedSentence)):
a = ''
if element < len(taggedSentence):
while taggedSentence[element][1] == 'PERSON':
a += taggedSentence[element][0] + ' '
taggedSentence.pop(element)
if len(a) > 1:
test.append(a.strip())
test_dict[data.split('.')[0]] = tuple(test)
print(test_dict)

How to detect aboutness with python pos tagger

I am working with python to take a facebook status, tell what the status is about and the sentiment. Essentially I need to tell what the sentiment refers to, I already have successfully coded a sentiment analyzer so the trouble is getting a POS tagger to compute what the sentiment is referring to.
If you have any suggestions from experience I would be grateful. I've read some papers on computing aboutness from subject-object, NP-PP, and NP-NP relations but haven't seen any good examples and havent found many papers.
Lastly if you have worked with POS-taggers, what would be my best bet in python as a non-computer scientist. I'm a physicist so I can hack code together but don't want to reinvent the wheel if there exists a package that has everything I'm going to need.
Thank you very much in advance!

This is what I found to work, going to edit it and use it with nltk pos tagger and see what results I can get.
import nltk
from nltk.corpus import brown
# http://thetokenizer.com/2013/05/09/efficient-way-to-extract-the-main-topics-of-a-sentence/
# This is our fast Part of Speech tagger
#############################################################################
brown_train = brown.tagged_sents(categories='news')
regexp_tagger = nltk.RegexpTagger(
[(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
(r'(-|:|;)$', ':'),
(r'\'*$', 'MD'),
(r'(The|the|A|a|An|an)$', 'AT'),
(r'.*able$', 'JJ'),
(r'^[A-Z].*$', 'NNP'),
(r'.*ness$', 'NN'),
(r'.*ly$', 'RB'),
(r'.*s$', 'NNS'),
(r'.*ing$', 'VBG'),
(r'.*ed$', 'VBD'),
(r'.*', 'NN')
])
unigram_tagger = nltk.UnigramTagger(brown_train, backoff=regexp_tagger)
bigram_tagger = nltk.BigramTagger(brown_train, backoff=unigram_tagger)
#############################################################################
# This is our semi-CFG; Extend it according to your own needs
#############################################################################
cfg = {}
cfg["NNP+NNP"] = "NNP"
cfg["NN+NN"] = "NNI"
cfg["NNI+NN"] = "NNI"
cfg["JJ+JJ"] = "JJ"
cfg["JJ+NN"] = "NNI"
#############################################################################
class NPExtractor(object):
def __init__(self, sentence):
self.sentence = sentence
# Split the sentence into singlw words/tokens
def tokenize_sentence(self, sentence):
tokens = nltk.word_tokenize(sentence)
return tokens
# Normalize brown corpus' tags ("NN", "NN-PL", "NNS" > "NN")
def normalize_tags(self, tagged):
n_tagged = []
for t in tagged:
if t[1] == "NP-TL" or t[1] == "NP":
n_tagged.append((t[0], "NNP"))
continue
if t[1].endswith("-TL"):
n_tagged.append((t[0], t[1][:-3]))
continue
if t[1].endswith("S"):
n_tagged.append((t[0], t[1][:-1]))
continue
n_tagged.append((t[0], t[1]))
return n_tagged
# Extract the main topics from the sentence
def extract(self):
tokens = self.tokenize_sentence(self.sentence)
tags = self.normalize_tags(bigram_tagger.tag(tokens))
merge = True
while merge:
merge = False
for x in range(0, len(tags) - 1):
t1 = tags[x]
t2 = tags[x + 1]
key = "%s+%s" % (t1[1], t2[1])
value = cfg.get(key, '')
if value:
merge = True
tags.pop(x)
tags.pop(x)
match = "%s %s" % (t1[0], t2[0])
pos = value
tags.insert(x, (match, pos))
break
matches = []
for t in tags:
if t[1] == "NNP" or t[1] == "NNI":
#if t[1] == "NNP" or t[1] == "NNI" or t[1] == "NN":
matches.append(t[0])
return matches
# Main method, just run "python np_extractor.py"
Summary="""
Verizon has not honored this appointment or notified me of the delay in an appropriate manner. It is now 1:20 PM and the only way I found out of a change is that I called their chat line and got a message saying my appointment is for 2 PM. My cell phone message says the original time as stated here.
"""
def main(Topic):
facebookData=[]
readdata=csv.reader(open('fb_data1.csv','r'))
for row in readdata:
facebookData.append(row)
relevant_sentence=[]
for status in facebookData:
summary=status.split('.')
for sentence in summary:
np_extractor = NPExtractor(sentence)
result = np_extractor.extract()
if Topic in result:
relevant_sentence.append(sentence)
print sentence
print "This sentence is about: %s" % ", ".join(result)
return relevant_sentence
if __name__ == '__main__':
result=main('Verizon')
note that it will save only sentences that are relevant to the topic you define. so if I am analyzing statuses about cheese I could use it as the topic, extract all of the sentences on cheese and then run a sentiment analysis on those. Please if you have comments or suggestions on improving this let me know!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract Only Certain Named Entities From Tokens - python

Related

Align Multiple Matches in SpaCy nlp into a Pandas Dataframe

Need advice on Negation Handling while doing Aspect Based Sentiment Analysis in Python

How to find all longest common substrings that exist in multiple documents?

First and Last name tagged as one token by using Stanford NER with Python

How to detect aboutness with python pos tagger

Categories

Resources