Named Entity Extraction - python

I am trying to extract list of persons using Stanford Named Entity Recognizer (NER) in Python NLTK. Code and obtained output is like this
Code
from nltk.tag import StanfordNERTagger
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
sent = 'joel thompson tracy k smith new work world premierenew york philharmonic commission'
strin = sent.title()
value = st.tag(strin.split())
def get_continuous_chunks(tagged_sent):
continuous_chunk = []
current_chunk = []
for token, tag in tagged_sent:
if tag != "O":
current_chunk.append((token, tag))
else:
if current_chunk: # if the current chunk is not empty
continuous_chunk.append(current_chunk)
current_chunk = []
# Flush the final current_chunk into the continuous_chunk, if any.
if current_chunk:
continuous_chunk.append(current_chunk)
return continuous_chunk
named_entities = get_continuous_chunks(value)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
print(named_entities_str)
Obtained Output
[('Joel Thompson Tracy K Smith New Work World Premierenew York Philharmonic Commission',
'PERSON')]
Desired Output
Person 1: Joel Thompson
Person 2: Tracy K Smith
Data : New Work World Premierenew York Philharmonic Commission

Related

Python Regex Capturing Multiple Matches in separate observations

I am trying to create variables location; contract items; contract code; federal aid using regex on the following text:
PAGE 1
BID OPENING DATE 07/25/18 FROM 0.2 MILES WEST OF ICE HOUSE 07/26/18 CONTRACT NUMBER 03-2F1304 ROAD TO 0.015 MILES WEST OF CONTRACT CODE 'A '
LOCATION 03-ED-50-39.5/48.7 DIVISION HIGHWAY ROAD 44 CONTRACT ITEMS
INSTALL SANDTRAPS AND PULLOUTS FEDERAL AID ACNH-P050-(146)E
PAGE 1
BID OPENING DATE 07/25/18 IN EL DORADO COUNTY AT VARIOUS 07/26/18 CONTRACT NUMBER 03-2H6804 LOCATIONS ALONG ROUTES 49 AND 193 CONTRACT CODE 'C ' LOCATION 03-ED-0999-VAR 13 CONTRACT ITEMS
TREE REMOVAL FEDERAL AID NONE
PAGE 1
BID OPENING DATE 07/25/18 IN LOS ANGELES, INGLEWOOD AND 07/26/18 CONTRACT NUMBER 07-296304 CULVER CITY, FROM I-105 TO PORT CONTRACT CODE 'B '
LOCATION 07-LA-405-R21.5/26.3 ROAD UNDERCROSSING 55 CONTRACT ITEMS
ROADWAY SAFETY IMPROVEMENT FEDERAL AID ACIM-405-3(056)E
This text is from one word file; I'll be looping my code on multiple doc files. In the text above are three location; contract items; contract code; federal aid pairs. But when I use regex to create variables, only the first instance of each pair is included.
The code I have right now is:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
all_bod = []
all_cn = []
all_location = []
all_fedaid = []
all_contractcode = []
all_contractitems = []
all_file = []
text = ' PAGE 1
BID OPENING DATE 07/25/18 FROM 0.2 MILES WEST OF ICE HOUSE 07/26/18 CONTRACT NUMBER 03-2F1304 ROAD TO 0.015 MILES WEST OF CONTRACT CODE 'A '
LOCATION 03-ED-50-39.5/48.7 DIVISION HIGHWAY ROAD 44 CONTRACT ITEMS
INSTALL SANDTRAPS AND PULLOUTS FEDERAL AID ACNH-P050-(146)E
PAGE 1
BID OPENING DATE 07/25/18 IN EL DORADO COUNTY AT VARIOUS 07/26/18 CONTRACT NUMBER 03-2H6804 LOCATIONS ALONG ROUTES 49 AND 193 CONTRACT CODE 'C ' LOCATION 03-ED-0999-VAR 13 CONTRACT ITEMS
TREE REMOVAL FEDERAL AID NONE
PAGE 1
BID OPENING DATE 07/25/18 IN LOS ANGELES, INGLEWOOD AND 07/26/18 CONTRACT NUMBER 07-296304 CULVER CITY, FROM I-105 TO PORT CONTRACT CODE 'B '
LOCATION 07-LA-405-R21.5/26.3 ROAD UNDERCROSSING 55 CONTRACT ITEMS
ROADWAY SAFETY IMPROVEMENT FEDERAL AID ACIM-405-3(056)E'
bod1 = re.search('BID OPENING DATE \s+ (\d+\/\d+\/\d+)', text)
bod2 = re.search('BID OPENING DATE\n\n(\d+\/\d+\/\d+)', text)
if not(bod1 is None):
bod = bod1.group(1)
elif not(bod2 is None):
bod = bod2.group(1)
else:
bod = 'NA'
all_bod.append(bod)
# creating contract number
cn1 = re.search('CONTRACT NUMBER\n+(.*)', text)
cn2 = re.search('CONTRACT NUMBER\s+(.........)', text)
if not(cn1 is None):
cn = cn1.group(1)
elif not(cn2 is None):
cn = cn2.group(1)
else:
cn = 'NA'
all_cn.append(cn)
# location
location1 = re.search('LOCATION \s+\S+', text)
location2 = re.search('LOCATION \n+\S+', text)
if not(location1 is None):
location = location1.group(0)
elif not(location2 is None):
location = location2.group(0)
else:
location = 'NA'
all_location.append(location)
# federal aid
fedaid = re.search('FEDERAL AID\s+\S+', text)
fedaid = fedaid.group(0)
all_fedaid.append(fedaid)
# contract code
contractcode = re.search('CONTRACT CODE\s+\S+', text)
contractcode = contractcode.group(0)
all_contractcode.append(contractcode)
# contract items
contractitems = re.search('\d+ CONTRACT ITEMS', text)
contractitems = contractitems.group(0)
all_contractitems.append(contractitems)
This code parses the only first instance of these variables in the text.
contract-number
location
contract-items
contract-code
federal-aid
03-2F1304
03-ED-50-39.5/48.7
44
A
ACNH-P050-(146)E
But, I am trying to figure out a way to get all possible instances in different observations.
contract-number
location
contract-items
contract-code
federal-aid
03-2F1304
03-ED-50-39.5/48.7
44
A
ACNH-P050-(146)E
03-2H6804
03-ED-0999-VAR
13
C
NONE
07-296304
07-LA-405-R21.5/26.3
55
B
ACIM-405-3(056)E
The all_variables in the code are for looping over multiple word files - we can ignore that if we want :).
Any leads would be super helpful. Thanks so much!
import re
data = []
df = pd.DataFrame()
regex_contract_number =r"(?:CONTRACT NUMBER\s+(?P<contract_number>\S+?)\s)"
regex_location = r"(?:LOCATION\s+(?P<location>\S+))"
regex_contract_items = r"(?:(?P<contract_items>\d+)\sCONTRACT ITEMS)"
regex_federal_aid =r"(?:FEDERAL AID\s+(?P<federal_aid>\S+?)\s)"
regex_contract_code =r"(?:CONTRACT CODE\s+\'(?P<contract_code>\S+?)\s)"
regexes = [regex_contract_number,regex_location,regex_contract_items,regex_federal_aid,regex_contract_code]
for regex in regexes:
for match in re.finditer(regex, text):
data.append(match.groupdict())
df = pd.concat([df, pd.DataFrame(data)], axis=1)
data = []
df

spacy remove only org and person names

I have written the below function which removes all named entities from text. How could I modify it to remove only org and person names? I don't want to remove 6 from $6 from below. Thanks
import spacy
sp = spacy.load('en_core_web_sm')
def NER_removal(text):
document = sp(text)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
return (" ".join(text_no_namedentities))
NER_removal("John loves to play at Sofi stadium at 6.00 PM and he earns $6")
'loves to play at stadium at 6.00 PM and he earns $'
I think item.ent_type_ will be useful here.
import spacy
sp = spacy.load('en_core_web_sm')
def NER_removal(text):
document = sp(text)
text_no_namedentities = []
# define ent types not to remove
ent_types_to_stay = ["MONEY"]
ents = [e.text for e in document.ents]
for item in document:
# add condition to leave defined ent types
if all((item.text in ents, item.ent_type_ not in ent_types_to_stay)):
pass
else:
text_no_namedentities.append(item.text)
return (" ".join(text_no_namedentities))
print(NER_removal("John loves to play at Sofi stadium at 6.00 PM and he earns $6"))
# loves to play at Sofi stadium at 6.00 PM and he earns $ 6

Search a series for a word. Return that word and N others in a new column?

Okay, I need help. I created a function to search a string for a specific word. If the function finds the search_word it will return the word the and N words that precede it. The function works fine with my test strings but I cannot figure out how to apply the function to an entire series?
My goal is to create a new column in the data frame that contains the n_words_prior whenever the search_word exists.
n_words_prior = []
test = "New School District, Dale County"
def n_before_string(string, search_word, N):
global n_words_prior
n_words_prior = []
found_word = string.find(search_word)
if found_word == -1: return ""
sentence= string[0:found_word]
n_words_prior = sentence.split()[N:]
n_words_prior.append(search_word)
return n_words_prior
The current dataframe looks like this:
data = [['Alabama', 'New School District, Dale County'],
['Alaska', 'Matanuska-Susitna Borough'],
['Arizona', 'Pima County - Tuscon Unified School District']]
df = pd.DataFrame(data, columns = ['State', 'Place'])
The improved function would take the inputs 'Place','County',-1 and create the following result.
improved_function(column, search_word, N)
new_data = [['Alabama', 'New School District, Dale County','Dale County'],
['Alaska', 'Matanuska-Susitna Borough', ''],
['Arizona', 'Pima County - Tuscon Unified School District','Pima County']]
new_df = pd.DataFrame(new_data, columns = ['State', 'Place','Result'])
I thought embedding this function would help, but it has only made things more confusing.
def fast_add(place, search_word):
df[search_word] = df[Place].str.contains(search_word).apply(lambda search_word: 1 if search_word == True else 0)
def fun(sentence, search_word, n):
"""Return search_word and n preceding words from sentence."""
words = sentence.split()
for i,word in enumerate(words):
if word == search_word:
return ' '.join(words[i-n:i+1])
return ''
Example:
df['Result'] = df.Place.apply(lambda x: fun(x, 'County', 1))
Result:
State Place Result
0 Alabama New School District, Dale County Dale County
1 Alaska Matanuska-Susitna Borough
2 Arizona Pima County - Tuscon Unified School District Pima County

posting more than one line of text on fb status [Python] [Facebook API]

I started this project in the school holidays as a way to keep practising and enhancing my python knowledge. To put it shortly the code is a facebook bot that randomly generates NBA teams, Players and positions which should look like this when run.
Houston Rockets
[PG] Ryan Anderson
[PF] Michael Curry
[SF] Marcus Morris
[C] Bob Royer
[SF] Brian Heaney
I'm currently having trouble when it comes to posting my code to my facebook page where instead of posting 1 team and 5 players/positions the programme will only post 1 single player like this
Ryan Anderson
Here is my code
import os
import random
import facebook
token = "...."
fb = facebook.GraphAPI(access_token = token)
parent_dir = "../NBAbot"
os.chdir(parent_dir)
file_name = "nba_players.txt"
def random_position():
"""Random Team from list"""
position = ['[PG]','[SG]','[SF]','[PF]','[C]',]
random.shuffle(position)
position = position.pop()
return(position)
def random_team():
"""Random Team from list"""
Team = ['Los Angeles Lakers','Golden State Warriors','Toronto Raptors','Boston Celtics','Cleveland Cavaliers','Houston Rockets','San Antonio Spurs','New York Knicks','Chicago Bulls','Minnesota Timberwolves','Philadelphia 76ers','Miami Heat','Milwaukee','Portland Trail Blazers','Dallas Mavericks','Phoenix Suns','Denver Nuggets','Utah Jazz','Indiana Pacers','Los Angeles Clippers','Washington Wizards','Brooklyn Nets','New Orleans Pelicans','Sacramento Kings','Atlanta Hawks','Detroit Pistons','Memphis Grizzlies','Charlotte Hornets','Orlando Magic']
random.shuffle(Team)
Team = Team.pop()
return(Team)
def random_player(datafile):
read_mode = "r"
with open (datafile, read_mode) as read_file:
the_line = read_file.readlines()
return(random.choice(the_line))
def main():
return(
random_team(),
random_position(),
random_player(file_name),
random_position(),
random_player(file_name),
random_position(),
random_player(file_name),
random_position(),
random_player(file_name),
random_position(),
random_player(file_name))
fb.put_object(parent_object="me", connection_name='feed', message=main())
any help is appreciated.

html scraping using python topboxoffice list from imdb website

URL: http://www.imdb.com/chart/?ref_=nv_ch_cht_2
I want you to print top box office list from above site (all the movies' rank, title, weekend, gross and weeks movies in the order)
Example output:
Rank:1
title: godzilla
weekend:$93.2M
Gross:$93.2M
Weeks: 1
Rank: 2
title: Neighbours
This is just a simple way to extract those entities by BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
url = "http://www.imdb.com/chart/?ref_=nv_ch_cht_2"
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data, 'html.parser')
rows = page.findAll("tr", {'class': ['odd', 'even']})
for tr in rows:
for data in tr.findAll("td", {'class': ['titleColumn', 'weeksColumn','ratingColumn']}):
print data.get_text()
P.S.-Arrange according to your will.
There is no need to scrape anything. See the answer I gave here.
How to scrape data from imdb business page?
The below Python script will give you, 1) List of Top Box Office movies from IMDb 2) And also the List of Cast for each of them.
from lxml.html import parse
def imdb_bo(no_of_movies=5):
bo_url = 'http://www.imdb.com/chart/'
bo_page = parse(bo_url).getroot()
bo_table = bo_page.cssselect('table.chart')
bo_total = len(bo_table[0][2])
if no_of_movies <= bo_total:
count = no_of_movies
else:
count = bo_total
movies = {}
for i in range(0, count):
mo = {}
mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()
m_page = parse(mo['url']).getroot()
m_casttable = m_page.cssselect('table.cast_list')
flag = 0
mo['cast'] = []
for cast in m_casttable[0]:
if flag == 0:
flag = 1
else:
m_starname = cast[1][0][0].text_content().strip()
mo['cast'].append(m_starname)
movies[i] = mo
return movies
if __name__ == '__main__':
no_of_movies = raw_input("Enter no. of Box office movies to display:")
bo_movies = imdb_bo(int(no_of_movies))
for k,v in bo_movies.iteritems():
print '#'+str(k+1)+' '+v['title']+' ('+v['year']+')'
print 'URL: '+v['url']
print 'Weekend: '+v['weekend']
print 'Gross: '+v['gross']
print 'Weeks: '+v['weeks']
print 'Cast: '+', '.join(v['cast'])
print '\n'
Output (run in terminal):
parag#parag-innovate:~/python$ python imdb_bo_scraper.py
Enter no. of Box office movies to display:3
#1 Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan SkarsgÄrd, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden
#2 Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski
#3 Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long

Categories

Resources