Speed up pandas dataframe lookup

Speed up pandas dataframe lookup - python

I have a pandas data frame with zip codes, city, state and country of ~ 600,000 locations. Let's call it my_df
I'd like to look up the corresponding longitude and latitude for each of these locations. Thankfully, there is a database for this. Let's call this dataframe zipdb.
zipdb has, among others, columns for zip codes, city, state and country.
So, I'd like to look up all of the locations (zip, city, state and country) in zipdb.
def zipdb_lookup(zipcode, city, state, country):
countries_mapping = { "UNITED STATES":"US"
, "CANADA":"CA"
, "KOREA REP OF":"KR"
, "ITALY":"IT"
, "AUSTRALIA":"AU"
, "CHILE":"CL"
, "UNITED KINGDOM":"GB"
, "BERMUDA":"BM"
}
try:
slc = zipdb[ (zipdb.Zipcode == str(zipcode)) &
(zipdb.City == str(city).upper()) &
(zipdb.State == str(state).upper()) &
(zipdb.Country == countries_mapping[country].upper()) ]
if slc.shape[0] == 1:
return np.array(slc["Lat"])[0], np.array(slc["Long"])[0]
else:
return None
except:
return None
I have tried pandas' .apply as well as a for loop to do this.
Both are very slow. I recognize there are a large number of rows, but I can't help but think something faster must be possible.
zipdb = pandas.read_csv("free-zipcode-database.csv") #linked to above
Note: I've also performed this transformation on zibdb:
zipdb["Zipcode"] = zipdb["Zipcode"].astype(str)
Function Call:
#Defined a wrapper function:
def lookup(row):
"""
:param row:
:return:
"""
lnglat = zipdb_lookup(
zipcode = my_df["organization_zip"][row]
, city = my_df["organization_city"][row]
, state = my_df["organization_state"][row]
, country = my_df["organization_country"][row]
)
return lnglat
lnglat = list()
for l in range(0, my_df.shape[0]):
# if l % 5000 == 0: print(round((float(l) / my_df.shape[0])*100, 2), "%")
lnglat.append(lookup(row = l))
Sample Data from my_df:
organization_zip organization_city organization_state organization_country
0 60208 EVANSTON IL United Sates
1 77555 GALVESTON TX United Sates
2 23284 RICHMOND VA United Sates
3 53233 MILWAUKEE WI United Sates
4 10036 NEW YORK NY United Sates
5 33620 TAMPA FL United Sates
6 10029 NEW YORK NY United Sates
7 97201 PORTLAND OR United Sates
8 97201 PORTLAND OR United Sates
9 53715 MADISON WI United Sates

Using merge() will be a lot faster than calling a function on every row. Make sure the field types match and strings are stripped:
# prepare your dataframe
data['organization_zip'] = data.organization_zip.astype(str)
data['organization_city'] = data.organization_city.apply(lambda v: v.strip())
# get the zips database
zips = pd.read_csv('/path/to/free-zipcode-database.csv')
zips['Zipcode'] = zips.Zipcode.astype(str)
# left join
# -- prepare common join columns
zips.rename(columns=dict(Zipcode='organization_zip',
City='organization_city'),
inplace=True)
# specify join columns along with zips' columns to copy
cols = ['organization_zip', 'organization_city', 'Lat', 'Long']
data.merge(zips[cols], how='left')
=>
Note you may need to extend the merge columns and/or add more columns to copy from the zips dataframe.

Related

Python pandas: Search from multiple columns that contained strings of another column and map it to the new column

I have a dataframe looks like this:
Premise
Thoroughfare
Locality
PostalCode
Country
FullAddress
Yew Tree Lane
Holmbridge
HD9 2NR
N Ireland
Old Thorn, Yew Tree Lane, Holmbridge HD9 2NR, N Ireland
3
Cysgod Y Castell
Llandudno Junction
LL31 9LJ
Uk
3 Cysgod Y Castell, Llandudno Junction LL31 9LJ
1168
Christchurch Road
Bournemouth
BH7 6DY
Wales UK
1168 Christchurch Road, BH7 6DY Bournemouth
And want to create another column or dataframe that looks like this
FullAddress
FullAdressWithTag
Old Thorn, Yew Tree Lane, Holmbridge HD9 2NR, N Ireland
Old^Others Thorn^Others, Yew^Thoroughfare Tree^Thoroughfare Lane^Thoroughfare, Holmbridge^Locality HD9^PostalCode 2NR^PostalCode, N^Country Ireland^Country
3 Cysgod Y Castell, Llandudno Junction LL31 9LJ
3^Premise Cysgod^Thoroughfare Y^Thoroughfare Castell^Thoroughfare, Llandudno^Locality Junction^Locality LL31^PostalCode 9LJ^PostalCode
1168 Christchurch Road, BH7 6DY Bournemouth
1168^Premise Christchurch^Thoroughfare Road^Thoroughfare, BH7^PostalCode 6DY^PostalCode Bournemouth^Locality
I am trying to map the FullAddressWithTag columns that based on data that is available on the single column such as Locality, Premise, PostalCode etc. Do note that the pattern of the FullAddress might be vary.
For example, it can be:
Premise -> Thoroughfare -> Postalcode -> Locality
Premise -> Thoroughfare -> Locality -> PostalCode
Thoroughfare -> Premise -> Postalcode -> Locality
It can be in different position depends on how the FullAddress given. If the element in FullAddress doesnt have a tag, it will tags as "Others"
I have million records for this data to be map with.

Here is a long and bulky code that could get your job done:
def getDFwithFullTag(df1):
cols_to_check = ['Premise', 'Thoroughfare', 'Locality', 'PostalCode', 'Country']
def checkAddr(row):
ans = ''
for st in row['FullAddress'].split(' '):
flg = False
for col in cols_to_check:
if st in row[col].split(' '):
if st[-1]==',':
ans += st[:-1]+'^'+col+', '
else:
ans += st+'^'+col+' '
flg = True
break
if not flg:
if st[-1]==',':
ans += st[:-1]+'^'+col+', '
else:
ans += st+'^'+col+' '
return ans
ftag = []
for i in range(len(df1)):
ftag.append(checkAddr(df1.loc[i]))
df_new = pd.DataFrame(data=df1.FullAddress)
df_new.insert(1, 'FullAdressWithTag', ftag, True)
return df_new
Apart from being large and ugly, this function would take around 20-30 minutes to process 1 million rows. This function takes a dataframe as input and outputs a dataframe of your desired format.

How to avoid 20+ if statements when applying function to pandas based on a city/location

Let's say I have some customer data. The data is generated by the customer and it's messy, so they put their city into either the city or county field or both! That means I may need to check both columns to find out which city they are from.
mydf = pd.DataFrame({'name':['jim','jon'],
'city':['new york',''],
'county':['','los angeles']})
print(mydf)
name city county
0 jim new york
1 jon los angeles
And I am using an api to get their zipcode. There is a different api function for each city, and it returns the zipcode for the customer's address, e.g. 123 main stret, new york. I haven't included the full address here to save time.
# api for new york addresses
def get_NY_zipcode_api():
return 'abc123'
# api for chicago addresses
def get_CH_zipcode_api():
return 'abc124'
# api for los angeles addresses
def get_LA_zipcode_api():
return 'abc125'
# api for miami addresses
def get_MI_zipcode_api():
return 'abc126'
Depending on the city, I will call a different api. So for now, I am checking if city == x or county ==x, call api_x:
def myfunc(row):
city = row['city']
county = row['county']
if city == 'chicago' or county == 'chicago':
# call chicago api
zipcode = get_CH_zipcode_api()
return zipcode
elif city == 'new york' or county == 'new york':
# call new york api
zipcode = get_NY_zipcode_api()
return zipcode
elif city == 'los angeles' or county == 'los angeles':
# call los angeles api
zipcode = get_LA_zipcode_api()
return zipcode
elif city == 'miami' or county == 'miami':
# call miami api
zipcode = get_MI_zipcode_api()
return zipcode
And I apply() this to the df and get my results:
mydf['result'] = mydf.apply(myfunc,axis=1)
print(mydf)
name city county result
0 jim new york abc123
1 jon los angeles abc125
I actually have about 30 cities and therefore 30 conditions to check, so I want to avoid a long list of elif statments. What would be the most efficient way to do this?
I found some suggestions from a similar stack overflow question. Such as creating a dictionary with key:city and value:function and calling it based on city:
operationFuncs = {
'chicago': get_CH_zipcode_api,
'new york': get_NY_zipcode_api,
'los angeles': get_LA_zipcode_api,
'miami': get_MI_zipcode_api
}
But as far as I can see this only works if I am checking a single column / single condition. I can't see how it can work with if city == x or county == x

mydf['result'] = mydf.apply(lambda row : operationFuncs.get(row['county']) or operationFuncs.get(row['city']),axis=1)

I think you are referring to this. You can just perform this operation twice for city and county and save the result in two different variables, for each zipcode respectively. You can then compare the results and decide what to do if they differ (I am not sure if this can be the case with your dataset).
Since the dictionary-lookup is in O(1) and I assume your get_MI_zipcode_api() isn't any more expensive, this will have no real performance-drawbacks.

Maybe not the most elegant solution but you could use the dict approach and just call it twice, once on city and once on county. The second would overwrite the first but the same is true of your if block, and this would only be a problem if you had city='New York' county ='Chicago' for example which I assume cannot occur.
Or you could use the dict and iterate through it, this seems unnecessary though.
For key, fn in fdict:
if key in (city,county):
fn()
I'd do this join in SQL before reading in the data, I'm sure there's a way to do the same in Pandas, but I was trying to make suggestions that build on your existing research even if they are not the best.

If it's guaranteed that the value will either be present in city or country & not in both, then you can merge both the columns together into one.
df['region'] = df['City'] + '' + df['Country']
Then create a mapping of region and pincode, instead of creating a mapping of city with api function. Since, there are only 30 unique values, you can once store the value of city with zipcodes rather than calling the zipcode functions each time, as making an api call is expensive.
mappings = {
'chicago': 'abc123',
'new york': 'abc234',
'los angeles': 'abc345',
'miami': 'abc456'}
Create a dataframe using this dictionary & then merge with the original dataframe
mappings_df = pd.DataFrame(list(mappings.items()), columns=['region', 'zipcode'])
df.merge(mappings_df, how='left', on='region')
Hope this helps!!

You need a relation table which can be represented by a dict.
df = pd.DataFrame({'name':['jim','jon'],
'city':['new york',''],
'county':['','los angeles']})
df['region'] = df['city'] + df['county']
table = {'new york': 'abc123', 'chicago': 'abc124', 'los angeles': 'abc125', 'miami': 'abc126'}
df['region'] = df.region.apply(lambda row: table[row])
print(df)
Output
name city county region
0 jim new york abc123
1 jon los angeles abc125

Update or replace value in df when conditions are met

I have a list of city names and a df with city, state, and zipcode columns. Some of the zipcodes are missing. When a zipcode is missing, I want to use a generic zipcode based on the city. For example, the city is San Jose so the zipcode should be a generic 'SJ_zipcode'.
pattern_city = '|'.join(cities) #works
foundit = ( (df['cty_nm'].str.contains(pattern_city, flags=re.IGNORECASE)) & (df['zip_cd']==0) & (df['st_cd'].str.match('CA') ) ) #works--is this foundit a df?
df['zip_cd'] = foundit.replace( 'SJ_zipcode' ) #nope, error
Error: "Invalid dtype for pad_1d [bool]"
Implemented with where
df['zip_cd'].where( (df['cty_nm'].str.contains(pattern_city, flags=re.IGNORECASE)) & (df['zip_cd']==0) & (df['st_cd'].str.match('CA') ), "SJ_Zipcode", inplace = True) #nope, empty set; all set to nan?
Implemented with loc
df['zip_cd'].loc[ (df['cty_nm'].str.contains(pattern_city, flags=re.IGNORECASE)) & (df['zip_cd']==0) & (df['st_cd'].str.match('CA') ) ] = "SJ_Zipcode"
Some possible solutions that did not work
df.loc[df['First Season'] > 1990, 'First Season'] = 1 which I used as df.loc[foundit, 'zip_cd'] = 'SJ_zipcode' Pandas DataFrame: replace all values in a column, based on condition and similar/same as Conditional Replace Pandas
df['c'] = df.apply( lambda row: row['a']*row['b'] if np.isnan(row['c']) else row['c'], axis=1) however, I am not multiplying values https://datascience.stackexchange.com/questions/17769/how-to-fill-missing-value-based-on-other-columns-in-pandas-dataframe
I tried a solution using where, however, it seemed to replace the values where the condition was not met with nan--but the nan value was not helpful https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html
This conditional approach looked promising but then without looping with each value I was confused by how does anything happen... What should replace comparisons with False in python?
An example using replace which does not have the multiple conditions and pattern Replacing few values in a pandas dataframe column with another value
An additional 'want'; I want to update a dataframe with values, I do not want to create a new dataframe.

Try this:
df = pd.DataFrame(data)
df
city state zip
0 Burbank California 44325
1 Anaheim California nan
2 El Cerrito California 57643
3 Los Angeles California 56734
4 san Fancisco California 32819
def generate_placeholder_zip(row):
if pd.isnull(row['zip'] ):
row['zip'] =row['city']+'_ZIPCODE'
return row
df.apply(generate_placeholder_zip, axis =1)
city state zip
0 Burbank California 44325
1 Anaheim California Anaheim_ZIPCODE
2 El Cerrito California 57643
3 Los Angeles California 56734
4 san Fancisco California 32819

Creating pandas df columns with list items of uneven length?

I have a list of addresses that I would like to put into a dataframe where each row is a new address and the columns are the units of the address (title, street, city).
However, the way the list is structured, some addresses are longer than others. For example:
address = ['123 Some Street, City','45 Another Place, PO Box 123, City']
I have a pandas dataframe with the following columns:
Index Court Address Zipcode Phone
0 Court 1 123 Court Dr, Springfield 12345 11111
1 Court 2 45 Court Pl, PO Box 45, Pawnee 54321 11111
2 Court 3 1725 Slough Ave, Scranton 18503 11111
3 Court 4 101 Court Ter, Unit 321, Eagleton 54322 11111
I would like to split the Address column into up to three columns depending on how many comma separators there are in the address, with NaN filling in where values will be missing.
For example, I hope the data will look like this:
Index Court Address Address2 City Zip Phone
0 Court 1 123 Court Dr NaN Springfield ... ...
1 Court 2 45 Court Pl PO Box 45 Pawnee ... ...
2 Court 3 1725 Slough Ave NaN Scranton ... ...
3 Court 4 101 Court Ter Unit 321 Eagleton ... ...
I have plowed through and tried a ton of different solutions on StackOverflow to no avail. The closest I have gotten is with this code:
df2 = pd.concat([df, df['Address'].str.split(', ', expand=True)], axis=1)
But that returns a dataframe that adds the following three columns to the end structured as such:
... 0 1 2
... 123 Court Dr Springfield None
... 45 Court Pl PO Box 45 Pawnee
This is close, but as you can see, for the shorter entries, the city lines up with the second address line for the longer entries.
Ideally, column 2 should populate every single row with a city, and column 1 should alternate between "None" and the second address line if applicable.
I hope this makes sense -- this is a tough one to put into words. Thanks!

You could do something like this:
df['Address1'] = df['Address'].str.split(',').str[0]
df['Address2'] = df['Address'].str.extract(',(.*),')
df['City'] = df['Address'].str.split(',').str[-1]

Addresses, especially those produced by human input can be tricky. But, if your addresses only fit those two formats this will work:
Note: If there is an additional format you have to account for, this will print the culprit.
def split_address(df):
for index,row in df.iterrows():
full_address = df['address']
if full_address.count(',') == 3:
split = full_address.split(',')
row['address_1'] = split[0]
row['address_2'] = split[1]
row['city'] = split[2]
else if full_address.count(',') == 2:
split = full_address.split(',')
row['address_1'] = split[0]
row['city'] = split[1]
else:
print("address does not fit known formats {0}".format(full_address))
Essentially the two things that should help you are the string.count() function which will tell you the number of commas in a string, and the string.split() which you already found that will split the input into an array. You can reference the portions of this array to allocate the pieces to the correct column.

You can look into creating a function using the package usaddress. It has been very helpful for me when I need to split address into parts:
import usaddress
df = pd.DataFrame(['123 Main St. Suite 100 Chicago, IL', '123 Main St. PO Box 100 Chicago, IL'], columns=['Address'])
Then create functions for how you want to split the data:
def Address1(x):
try:
data = usaddress.tag(x)
if 'AddressNumber' in data[0].keys() and 'StreetName' in data[0].keys() and 'StreetNamePostType' in data[0].keys():
return data[0]['AddressNumber'] + ' ' + data[0]['StreetName'] + ' ' + data[0]['StreetNamePostType']
except:
pass
def Address2(x):
try:
data = usaddress.tag(x)
if 'OccupancyType' in data[0].keys() and 'OccupancyIdentifier' in data[0].keys():
return data[0]['OccupancyType'] + ' ' + data[0]['OccupancyIdentifier']
elif 'USPSBoxType' in data[0].keys() and 'USPSBoxID' in data[0].keys():
return data[0]['USPSBoxType'] + ' ' + data[0]['USPSBoxID']
except:
pass
def PlaceName(x):
try:
data = usaddress.tag(x)
if 'PlaceName' in data[0].keys():
return data[0]['PlaceName']
except:
pass
df['Address1'] = df.apply(lambda x: Address1(x['Address']), axis=1)
df['Address2'] = df.apply(lambda x: Address2(x['Address']), axis=1)
df['City'] = df.apply(lambda x: PlaceName(x['Address']), axis=1)
out:
Address Address1 Address2 City
0 123 Main St. Suite 100 Chicago, IL 123 Main St. Suite 100 Chicago
1 123 Main St. PO Box 100 Chicago, IL 123 Main St. PO Box 100 Chicago

I would like help optimising a triple for loop with an if in statement

I am doing a triple for loop on a dataframe with almost 70 thousand entries. How do I optimize it?
My ultimate goal is to create a new column that has the country of a seismic event. I have a latitude, longitude and 'place' (ex: '17km N of North Nenana, Alaska') column. I tried to reverse geocode, but with 68,488 entries, there is no free service that lets me do that. And as a student, I cannot afford it.
So I am using a dataframe with a list of countries and a dataframe with a list of states to compare to USGS['place']'s values. To do that, I ultimately settled on using 3 for loops.
As you can assume, it takes a long time. I was hoping there is a way to speed things up. I am using python, but I use r as well. The for loops just run better on python.
Any better options I'll take.
USGS = pd.DataFrame(data = {'latitide':[64.7385, 61.116], 'longitude':[-149.136, -138.655], 'place':['17km N of North Nenana, Alaska', '74km WNW of Haines Junction, Canada'], 'country':[NA, NA]})
states = pd.DataFrame(data = {'state':['AK', 'AL'], 'name':['Alaska', 'Alabama']})
countries = pd.DataFrame(data = {'country':['Afghanistan', 'Canada']})
for head in states:
for state in states[head]:
for p in USGS['place']:
if state in p:
USGS['country'] = USGS['country'].map({p : 'United 'States'})
# I have not finished the code for the countries dataframe

You do have options to do geocoding. Mapquest offers a free 15,000 calls per month. You can also look at using geopy which is what I use.
import pandas as pd
import geopy
from geopy.geocoders import Nominatim
USGS_df = pd.DataFrame(data = {'latitude':[64.7385, 61.116], 'longitude':[-149.136, -138.655], 'place':['17km N of North Nenana, Alaska', '74km WNW of Haines Junction, Canada'], 'country':[None, None]})
geopy.geocoders.options.default_user_agent = "locations-application"
geolocator=Nominatim(timeout=10)
for i, row in USGS_df.iterrows():
try:
lat = row['latitude']
lon = row['longitude']
location = geolocator.reverse('%s, %s' %(lat, lon))
country = location.raw['address']['country']
print ('Found: ' + location.address)
USGS_df.loc[i, 'country'] = country
except:
print ('Location not identified: %s, %s' %(lat, lon))
Input:
print (USGS_df)
latitude longitude place country
0 64.7385 -149.136 17km N of North Nenana, Alaska None
1 61.1160 -138.655 74km WNW of Haines Junction, Canada None
Output:
print (USGS_df)
latitude longitude place country
0 64.7385 -149.136 17km N of North Nenana, Alaska USA
1 61.1160 -138.655 74km WNW of Haines Junction, Canada Canada

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.