Parsing addresses from a blob of text in dataframe column

Parsing addresses from a blob of text in dataframe column - python

I am trying to use a library called pyap to parse addresses from text in a dataframe column.
My dataframe df has data in the following format:
MID TEXT_BODY
1 I live at 4998 Stairstep Lane Toronto ON
2 Let us catch up at the Ruby Restaurant. Here is the address 1234 Food Court Dr, Atlanta, GA 30030
The package website gives the following sample:
import pyap
test_address = """
I live at 4998 Stairstep Lane Toronto ON
"""
addresses = pyap.parse(test_address, country='CA')
for address in addresses:
# shows found address
print(address)
THe sample return it as a list but I would like to keep it in the dataframe as a new column
The output I am expecting is a data frame like this:
MID ADDRESS TEXT_BODY
1 4998 Stairstep Lane Toronto ON I live at 4998 Stairstep Lane Toronto ON
2 1234 Food Court Dr, Atlanta, GA 30030 Let us catch up at the Ruby Restaurant. Here is the address 1234 Food Court Dr, Atlanta, GA 30030
I tried this:
df["ADDRESS"] = df['TEXT_BODY'].apply(lambda row: pyap.parse(row, country='US'))
But this does not work. I get an error:
TypeError: expected string or bytes-like object
How do I do this?

Apply is indeed the right direction.
def parse_address(addr):
address = pyap.parse(addr, country = "US")
if not address:
address = pyap.parse(addr, country = "CA")
return address[0]
df["addr"] = df.TEXT_BODY.apply(parse_address)
The result is:
MID TEXT_BODY addr
0 1 I live at 4998 Stairstep Lane Toronto ON 4998 Stairstep Lane Toronto ON
1 2 Let us catch up at the Ruby Restaurant. Here i... 1234 Food Court Dr, Atlanta, GA 30030

Related

Setting Character Limit on Pandas DataFrame Column

Background:
Given the following pandas df -
Holding Account
Model Type
Entity ID
Direct Owner ID
WF LLC | 100 Jones Street 26th Floor San Francisco Ca Ltd Liability - Income Based Gross USA Only (486941515)
51364633
4564564
5646546
RF LLC | Neuberger | LLC | Aukai Services LLC-Neuberger Smid - Income Accuring Net of Fees Worldwide Fund (456456218)
46256325
1645365
4926654
The ask:
What is the most pythonic way to enforce a 80 character limit to the Holding Account column (dtype = object) values?
Context: I am writing df to a .csv and then subsequently uploading to a system with an 80-character limit. The values of Holding Account column are unique, so I just want to sacrifice those characters that take the string over 80-characters.
My attempt:
This is what I attempted - df['column'] = df['column'].str[:80]

Why not just use .str, like you were doing?
df['Holding Account'] = df['Holding Account'].str[:80]
Output:
>>> df
Holding Account Model Type Entity ID Direct Owner ID
0 WF LLC | 100 Jones Street 26th Floor San Francisco Ca Ltd Liability - Income Bas 51364633 4564564 5646546
1 RF LLC | Neuberger | LLC | Aukai Services LLC-Neuberger Smid - Income Accuring N 46256325 1645365 4926654

Using slice will loss some information, I will suggest create a mapping table after get the factorized. This also save the storage space for server or db
s = df['Holding Account'].factorize()[0]
df['Holding Account'] = df['Holding Account'].factorize()[0]
d = dict(zip(s, df['Holding Account']))
If you would like get the databank just do
df['new'] = df['Holding Account'] .map(d)

Appling a custom function to each row in a column in a dataframe

I have a bit of code which pulls the latitude and longitude for a location. It is here:
address = 'New York University'
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) +'?format=json'
response = requests.get(url).json()
print(response[0]["lat"])
print(response[0]["lon"])
I'm wanting to apply this as a function to a long column of "address".
I've seen loads of questions about 'apply' and 'map', but they're almost all simple math examples.
Here is what I tried last night:
def locate (address):
response = requests.get(url).json()
print(response[0]["lat"])
print(response[0]["lon"])
return
df['lat'] = df['lat'].map(locate)
df['lon'] = df['lon'].map(locate)
This ended up just applying the first row lat / lon to the entire csv.
What is the best method to turn the code into a custom function and apply it to each row?
Thanks in advance.
EDIT: Thank you #PacketLoss for your assistance. I'm getting an indexerror:list index out of range, but it does work on his sample dataframe.
Here is the read_csv I used to pull in the data:
df = pd.read_csv('C:\\Users\\CIHAnalyst1\\Desktop\\InstitutionLocations.csv', sep=',', error_bad_lines=False, index_col=False, dtype='unicode', encoding = "utf-8", warn_bad_lines=False)
Here is a text copy of the rows from the dataframe:
address
0 GRAND CANYON UNIVERSITY
1 SOUTHERN NEW HAMPSHIRE UNIVERSITY
2 WESTERN GOVERNORS UNIVERSITY
3 FLORIDA INTERNATIONAL UNIVERSITY - UNIVERSITY ...
4 PENN STATE UNIVERSITY UNIVERSITY PARK
... ...
4292 THE ART INSTITUTES INTERNATIONAL LLC
4293 INTERCOAST - ONLINE
4294 CAROLINAS COLLEGE OF HEALTH SCIENCES
4295 DYERSBURG STATE COMMUNITY COLLEGE COVINGTON
4296 ULTIMATE MEDICAL ACADEMY - NY

You need to return your values from your function, or nothing will happen.
We can use apply here and pass the address from the df as well.
data = {'address': ['New York University', 'Sydney Opera House', 'Paris', 'SupeRduperFakeAddress']}
df = pd.DataFrame(data)
def locate(row):
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(row['address']) +'?format=json'
response = requests.get(url).json()
if response:
row['lat'] = response[0]['lat']
row['lon'] = response[0]['lon']
return row
df = df.apply(locate, axis=1)
Outputs
address lat lon
0 New York University 40.72925325 -73.99625393609625
1 Sydney Opera House -33.85719805 151.21512338473752
2 Paris 48.8566969 2.3514616
3 SupeRduperFakeAddress NaN NaN

Creating pandas df columns with list items of uneven length?

I have a list of addresses that I would like to put into a dataframe where each row is a new address and the columns are the units of the address (title, street, city).
However, the way the list is structured, some addresses are longer than others. For example:
address = ['123 Some Street, City','45 Another Place, PO Box 123, City']
I have a pandas dataframe with the following columns:
Index Court Address Zipcode Phone
0 Court 1 123 Court Dr, Springfield 12345 11111
1 Court 2 45 Court Pl, PO Box 45, Pawnee 54321 11111
2 Court 3 1725 Slough Ave, Scranton 18503 11111
3 Court 4 101 Court Ter, Unit 321, Eagleton 54322 11111
I would like to split the Address column into up to three columns depending on how many comma separators there are in the address, with NaN filling in where values will be missing.
For example, I hope the data will look like this:
Index Court Address Address2 City Zip Phone
0 Court 1 123 Court Dr NaN Springfield ... ...
1 Court 2 45 Court Pl PO Box 45 Pawnee ... ...
2 Court 3 1725 Slough Ave NaN Scranton ... ...
3 Court 4 101 Court Ter Unit 321 Eagleton ... ...
I have plowed through and tried a ton of different solutions on StackOverflow to no avail. The closest I have gotten is with this code:
df2 = pd.concat([df, df['Address'].str.split(', ', expand=True)], axis=1)
But that returns a dataframe that adds the following three columns to the end structured as such:
... 0 1 2
... 123 Court Dr Springfield None
... 45 Court Pl PO Box 45 Pawnee
This is close, but as you can see, for the shorter entries, the city lines up with the second address line for the longer entries.
Ideally, column 2 should populate every single row with a city, and column 1 should alternate between "None" and the second address line if applicable.
I hope this makes sense -- this is a tough one to put into words. Thanks!

You could do something like this:
df['Address1'] = df['Address'].str.split(',').str[0]
df['Address2'] = df['Address'].str.extract(',(.*),')
df['City'] = df['Address'].str.split(',').str[-1]

Addresses, especially those produced by human input can be tricky. But, if your addresses only fit those two formats this will work:
Note: If there is an additional format you have to account for, this will print the culprit.
def split_address(df):
for index,row in df.iterrows():
full_address = df['address']
if full_address.count(',') == 3:
split = full_address.split(',')
row['address_1'] = split[0]
row['address_2'] = split[1]
row['city'] = split[2]
else if full_address.count(',') == 2:
split = full_address.split(',')
row['address_1'] = split[0]
row['city'] = split[1]
else:
print("address does not fit known formats {0}".format(full_address))
Essentially the two things that should help you are the string.count() function which will tell you the number of commas in a string, and the string.split() which you already found that will split the input into an array. You can reference the portions of this array to allocate the pieces to the correct column.

You can look into creating a function using the package usaddress. It has been very helpful for me when I need to split address into parts:
import usaddress
df = pd.DataFrame(['123 Main St. Suite 100 Chicago, IL', '123 Main St. PO Box 100 Chicago, IL'], columns=['Address'])
Then create functions for how you want to split the data:
def Address1(x):
try:
data = usaddress.tag(x)
if 'AddressNumber' in data[0].keys() and 'StreetName' in data[0].keys() and 'StreetNamePostType' in data[0].keys():
return data[0]['AddressNumber'] + ' ' + data[0]['StreetName'] + ' ' + data[0]['StreetNamePostType']
except:
pass
def Address2(x):
try:
data = usaddress.tag(x)
if 'OccupancyType' in data[0].keys() and 'OccupancyIdentifier' in data[0].keys():
return data[0]['OccupancyType'] + ' ' + data[0]['OccupancyIdentifier']
elif 'USPSBoxType' in data[0].keys() and 'USPSBoxID' in data[0].keys():
return data[0]['USPSBoxType'] + ' ' + data[0]['USPSBoxID']
except:
pass
def PlaceName(x):
try:
data = usaddress.tag(x)
if 'PlaceName' in data[0].keys():
return data[0]['PlaceName']
except:
pass
df['Address1'] = df.apply(lambda x: Address1(x['Address']), axis=1)
df['Address2'] = df.apply(lambda x: Address2(x['Address']), axis=1)
df['City'] = df.apply(lambda x: PlaceName(x['Address']), axis=1)
out:
Address Address1 Address2 City
0 123 Main St. Suite 100 Chicago, IL 123 Main St. Suite 100 Chicago
1 123 Main St. PO Box 100 Chicago, IL 123 Main St. PO Box 100 Chicago

Finding the Metropolitan Area or County for City, State

I have following data in a dataframe:
... location amount
... Jacksonville, FL 40
... Provo, UT 20
... Montgomery, AL 22
... Los Angeles, CA 34
My dataset only contains U.S. cities in the form of [city name, state code] and I have no ZIP codes.
I want to determine either county of a city, in order to visualize my data with ggcounty (like here).
I looked on the website of the U.S. Census Bureau but couldn't really find a table of city,state,county, or similar.
Assuming that I would prefer solving the problem in R only, who has an idea how to solve this?

You can get a ZIP code and more detailed info doing this:
library(ggmap)
revgeocode(as.numeric(geocode('Jacksonville, FL ')))
Hope it helps

Extract Strings from Dataframe

from pandas import DataFrame,Series
import pandas as pd
df
text region
The Five College Region The Five College Region
South Hadley (Mount Holyoke College) South Hadley
Waltham (Bentley University), (Brandeis Univer..) Waltham
The region should extract from text.
If the row contains "(",remove anything after "(",and then remove the white space.
If the row doesn't contain "(", keep it and copy to the region.
I know I can deal it with str.extract function. But I'm troubled in writing right regex pattern
df['Region'] =df['text'].str.extract(r'(.+)\(.*')
This regex pattern can not extract first string
I also acknowledge that using split functon works for this problem
str.split('(')[0]
But I don't know how to put the result in a column.
Hope to receive answers covering both methods.

option 1
assign + str.split
df.text.str.split('\s*\(').str[0]
0 The Five College Region
1 South Hadley
2 Waltham
Name: text, dtype: object
df.assign(region=df.text.str.split('\s*\(').str[0])
text region
0 The Five College Region The Five College Region
1 South Hadley (Mount Holyoke College) South Hadley
2 Waltham (Bentley University), (Brandeis Univer..) Waltham
option 2
join + str.extract
df.text.str.extract('(?P<region>[^\(]+)\s*\(*', expand=False)
0 The Five College Region
1 South Hadley
2 Waltham
Name: text, dtype: object
df.join(df.text.str.extract('(?P<region>[^\(]+)\s*\(*', expand=False))
text region
0 The Five College Region The Five College Region
1 South Hadley (Mount Holyoke College) South Hadley
2 Waltham (Bentley University), (Brandeis Univer..) Waltham

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.