Suppose I have a dataframe that maps a child-level address to a more macro level address:
Child
Child's Level
Parent
Parent's Level
Pivet Drive
Street
Little Whinging
Town
Little Whinging
Town
England
Country
England
Country
Europe
Continent
State Street
Street
New York
City
New York
City
USA
Country
USA
Country
North America
Continent
I have a second dataframe that list down each person's address, but that address may be stated at different hierarchical level
Name
Address
Continent?
Adam
Pivet Drive
Mary
New York
Dave
State Street
How can I fill up the continent column in the 2nd dataframe using python?
A naive way is to either turn the 1st dataframe into a dictionary and repeatedly map upwards, or to just repeatedly merge the two dataframes. However, this does not scale up well once I've millions of rows for both dataframes, especially since every record does not start from the same level in the hierarchy.
I've previously filled up the continent column using a graph database (Neo4j), but I can't seem to google any hint on how to do this using python instead.
Graph DB is born to handle case like this, if you want to handle it under relational-db/dataframe(they are the same), you can't avoid the query with many outer-joins. The concept hidden here is how to store and retrieve a tree-like data structure in relational db. You can treat dataframe like a table in db.
Here I use the Union-Find algorithm to handle this, notice that I didn't use the other level info except for Continet, which may be a buggy if Two Continets contain places of the same name under different levels, or under the same level. The following codes are just some demo thoughts, but it works for the demo data you provided, may not work well for your entire dataset:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'Child': ['Pivet Drive', 'Little Whinging', 'England', 'State Street', 'New York', 'USA'],
"ChildLevel": ['Street', 'Town', 'Country', 'Street', 'City', 'Country'],
"Parent": ['Little Whinging', 'England', 'Europe', 'New York', 'USA', 'North America'],
"ParentLevel": ['Town', 'Country', 'Continent', 'City', 'Country', 'Continent']})
df_to_fill = pd.DataFrame({
'Name': ['Adam', 'Mary', 'Dave'],
'Address': ['Pivet Drive', 'New York', 'State Street'],
})
child_parent_value_pairs = df[["Child", "Parent"]].values.tolist()
tree = lambda: defaultdict(tree)
G = tree()
for child, parent in child_parent_value_pairs:
G[child][parent] = 1
G[parent][child] = 1
E = [(G[u][v], u, v) for u in G for v in G[u]]
T = set()
C = {u: u for u in G} # C stands for components
R = {u: 0 for u in G}
def find(C, u):
if C[u] != u:
C[u] = find(C, C[u]) # Path compression
return C[u]
def union(C, R, u, v):
u = find(C, u)
v = find(C, v)
if R[u] > R[v]:
C[v] = u
else:
C[u] = v
if R[u] == R[v]:
R[v] += 1
for __, u, v in sorted(E):
if find(C, u) != find(C, v):
T.add((u, v))
union(C, R, u, v)
all_continents = set(df[df['ParentLevel'] == 'Continent']['Parent'].tolist())
continent_lookup = {find(C, continent): continent for continent in all_continents}
df_to_fill['Continent'] = df_to_fill['Address'].apply(lambda x: continent_lookup.get(find(C, x), None))
print(df_to_fill)
Output:
Name Address Continent
0 Adam Pivet Drive Europe
1 Mary New York North America
2 Dave State Street North America
Related
I have a pandas dataset like below:
import pandas as pd
data = {'id': ['001', '002', '003'],
'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA, United States",
"1180 Shelard Tower\nMinneapolis, MN 55426\nUSA, United States",
"William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA, United States"]
}
df = pd.DataFrame(data)
print(df)
I need to convert address column to text delimited by \n and create new columns like name, address line 1, City, State, Zipcode, Country like below:
id Name addressline1 City State Zipcode Country
1 William J. Clare 290 Valley Dr. Casper WY 82604 United States
2 null 1180 Shelard Tower Minneapolis MN 55426 United States
3 William N. Barnard 145 S. Durbin Casper WY 82601 United States
I am learning python and from morning I am solving this. Any help will be greatly appreciated.
Thanks,
Right now, Pandas is returning you the table with 2 columns. If you look at the value in the second column, the essential information is separated with the comma. Therefore, if you saved your dataframe to df you can do the following:
df['address_and_city'] = df['address'].apply(lambda x: x.split(',')[0])
df['state_and_postal'] = df['address'].apply(lambda x: x.split(',')[1])
df['country'] = df['address'].apply(lambda x: x.split(',')[2])
Now, you have additional three columns in your dataframe, the last one contains the full information about the country already. Now from the first two columns that you have created you can extract the info you need in a similar way.
df['address_first_line'] = df['address_and_city'].apply(lambda x: ' '.join(x.split('\n')[:-1]))
df['city'] = df['address_and_city'].apply(lambda x: x.split('\n')[-1])
df['state'] = df['state_and_postal'].apply(lambda x: x.split(' ')[1])
df['postal'] = df['state_and_postal'].apply(lambda x: x.split(' ')[2].split('\n')[0])
Now you should have all the columns you need. You can remove the excess columns with:
df.drop(columns=['address','address_and_city','state_and_postal'], inplace=True)
Of course, it all can be done faster and with fewer lines of code, but I think it is the clearest way of doing it, which I hope you will find useful. If you don't understand what I did there, check the documentation for split and join methods, and also for apply method, native to pandas.
Let's say I have some customer data. The data is generated by the customer and it's messy, so they put their city into either the city or county field or both! That means I may need to check both columns to find out which city they are from.
mydf = pd.DataFrame({'name':['jim','jon'],
'city':['new york',''],
'county':['','los angeles']})
print(mydf)
name city county
0 jim new york
1 jon los angeles
And I am using an api to get their zipcode. There is a different api function for each city, and it returns the zipcode for the customer's address, e.g. 123 main stret, new york. I haven't included the full address here to save time.
# api for new york addresses
def get_NY_zipcode_api():
return 'abc123'
# api for chicago addresses
def get_CH_zipcode_api():
return 'abc124'
# api for los angeles addresses
def get_LA_zipcode_api():
return 'abc125'
# api for miami addresses
def get_MI_zipcode_api():
return 'abc126'
Depending on the city, I will call a different api. So for now, I am checking if city == x or county ==x, call api_x:
def myfunc(row):
city = row['city']
county = row['county']
if city == 'chicago' or county == 'chicago':
# call chicago api
zipcode = get_CH_zipcode_api()
return zipcode
elif city == 'new york' or county == 'new york':
# call new york api
zipcode = get_NY_zipcode_api()
return zipcode
elif city == 'los angeles' or county == 'los angeles':
# call los angeles api
zipcode = get_LA_zipcode_api()
return zipcode
elif city == 'miami' or county == 'miami':
# call miami api
zipcode = get_MI_zipcode_api()
return zipcode
And I apply() this to the df and get my results:
mydf['result'] = mydf.apply(myfunc,axis=1)
print(mydf)
name city county result
0 jim new york abc123
1 jon los angeles abc125
I actually have about 30 cities and therefore 30 conditions to check, so I want to avoid a long list of elif statments. What would be the most efficient way to do this?
I found some suggestions from a similar stack overflow question. Such as creating a dictionary with key:city and value:function and calling it based on city:
operationFuncs = {
'chicago': get_CH_zipcode_api,
'new york': get_NY_zipcode_api,
'los angeles': get_LA_zipcode_api,
'miami': get_MI_zipcode_api
}
But as far as I can see this only works if I am checking a single column / single condition. I can't see how it can work with if city == x or county == x
mydf['result'] = mydf.apply(lambda row : operationFuncs.get(row['county']) or operationFuncs.get(row['city']),axis=1)
I think you are referring to this. You can just perform this operation twice for city and county and save the result in two different variables, for each zipcode respectively. You can then compare the results and decide what to do if they differ (I am not sure if this can be the case with your dataset).
Since the dictionary-lookup is in O(1) and I assume your get_MI_zipcode_api() isn't any more expensive, this will have no real performance-drawbacks.
Maybe not the most elegant solution but you could use the dict approach and just call it twice, once on city and once on county. The second would overwrite the first but the same is true of your if block, and this would only be a problem if you had city='New York' county ='Chicago' for example which I assume cannot occur.
Or you could use the dict and iterate through it, this seems unnecessary though.
For key, fn in fdict:
if key in (city,county):
fn()
I'd do this join in SQL before reading in the data, I'm sure there's a way to do the same in Pandas, but I was trying to make suggestions that build on your existing research even if they are not the best.
If it's guaranteed that the value will either be present in city or country & not in both, then you can merge both the columns together into one.
df['region'] = df['City'] + '' + df['Country']
Then create a mapping of region and pincode, instead of creating a mapping of city with api function. Since, there are only 30 unique values, you can once store the value of city with zipcodes rather than calling the zipcode functions each time, as making an api call is expensive.
mappings = {
'chicago': 'abc123',
'new york': 'abc234',
'los angeles': 'abc345',
'miami': 'abc456'}
Create a dataframe using this dictionary & then merge with the original dataframe
mappings_df = pd.DataFrame(list(mappings.items()), columns=['region', 'zipcode'])
df.merge(mappings_df, how='left', on='region')
Hope this helps!!
You need a relation table which can be represented by a dict.
df = pd.DataFrame({'name':['jim','jon'],
'city':['new york',''],
'county':['','los angeles']})
df['region'] = df['city'] + df['county']
table = {'new york': 'abc123', 'chicago': 'abc124', 'los angeles': 'abc125', 'miami': 'abc126'}
df['region'] = df.region.apply(lambda row: table[row])
print(df)
Output
name city county region
0 jim new york abc123
1 jon los angeles abc125
I am doing a triple for loop on a dataframe with almost 70 thousand entries. How do I optimize it?
My ultimate goal is to create a new column that has the country of a seismic event. I have a latitude, longitude and 'place' (ex: '17km N of North Nenana, Alaska') column. I tried to reverse geocode, but with 68,488 entries, there is no free service that lets me do that. And as a student, I cannot afford it.
So I am using a dataframe with a list of countries and a dataframe with a list of states to compare to USGS['place']'s values. To do that, I ultimately settled on using 3 for loops.
As you can assume, it takes a long time. I was hoping there is a way to speed things up. I am using python, but I use r as well. The for loops just run better on python.
Any better options I'll take.
USGS = pd.DataFrame(data = {'latitide':[64.7385, 61.116], 'longitude':[-149.136, -138.655], 'place':['17km N of North Nenana, Alaska', '74km WNW of Haines Junction, Canada'], 'country':[NA, NA]})
states = pd.DataFrame(data = {'state':['AK', 'AL'], 'name':['Alaska', 'Alabama']})
countries = pd.DataFrame(data = {'country':['Afghanistan', 'Canada']})
for head in states:
for state in states[head]:
for p in USGS['place']:
if state in p:
USGS['country'] = USGS['country'].map({p : 'United 'States'})
# I have not finished the code for the countries dataframe
You do have options to do geocoding. Mapquest offers a free 15,000 calls per month. You can also look at using geopy which is what I use.
import pandas as pd
import geopy
from geopy.geocoders import Nominatim
USGS_df = pd.DataFrame(data = {'latitude':[64.7385, 61.116], 'longitude':[-149.136, -138.655], 'place':['17km N of North Nenana, Alaska', '74km WNW of Haines Junction, Canada'], 'country':[None, None]})
geopy.geocoders.options.default_user_agent = "locations-application"
geolocator=Nominatim(timeout=10)
for i, row in USGS_df.iterrows():
try:
lat = row['latitude']
lon = row['longitude']
location = geolocator.reverse('%s, %s' %(lat, lon))
country = location.raw['address']['country']
print ('Found: ' + location.address)
USGS_df.loc[i, 'country'] = country
except:
print ('Location not identified: %s, %s' %(lat, lon))
Input:
print (USGS_df)
latitude longitude place country
0 64.7385 -149.136 17km N of North Nenana, Alaska None
1 61.1160 -138.655 74km WNW of Haines Junction, Canada None
Output:
print (USGS_df)
latitude longitude place country
0 64.7385 -149.136 17km N of North Nenana, Alaska USA
1 61.1160 -138.655 74km WNW of Haines Junction, Canada Canada
I am trying to use a dictionary key to replace strings in a pandas column with its values. However, each column contains sentences. Therefore, I must first tokenize the sentences and detect whether a Word in the sentence corresponds with a key in my dictionary, then replace the string with the corresponding value.
However, the result that I continue to get it none. Is there a better pythonic way to approach this problem?
Here is my MVC for the moment. In the comments, I specified where the issue is happening.
import pandas as pd
data = {'Categories': ['animal','plant','object'],
'Type': ['tree','dog','rock'],
'Comment': ['The NYC tree is very big','The cat from the UK is small','The rock was found in LA.']
}
ids = {'Id':['NYC','LA','UK'],
'City':['New York City','Los Angeles','United Kingdom']}
df = pd.DataFrame(data)
ids = pd.DataFrame(ids)
def col2dict(ids):
data = ids[['Id', 'City']]
idDict = data.set_index('Id').to_dict()['City']
return idDict
def replaceIds(data,idDict):
ids = idDict.keys()
types = idDict.values()
data['commentTest'] = data['Comment']
words = data['commentTest'].apply(lambda x: x.split())
for (i,word) in enumerate(words):
#Here we can see that the words appear
print word
print ids
if word in ids:
#Here we can see that they are not being recognized. What happened?
print ids
print word
words[i] = idDict[word]
data['commentTest'] = ' '.apply(lambda x: ''.join(x))
return data
idDict = col2dict(ids)
results = replaceIds(df, idDict)
Results:
None
I am using python2.7 and when I am printing out the dict, there are u' of Unicode.
My expected outcome is:
Categories
Comment
Type
commentTest
Categories Comment Type commentTest
0 animal The NYC tree is very big tree The New York City tree is very big
1 plant The cat from the UK is small dog The cat from the United Kingdom is small
2 object The rock was found in LA. rock The rock was found in Los Angeles.
You can create dictionary and then replace:
ids = {'Id':['NYC','LA','UK'],
'City':['New York City','Los Angeles','United Kingdom']}
ids = dict(zip(ids['Id'], ids['City']))
print (ids)
{'UK': 'United Kingdom', 'LA': 'Los Angeles', 'NYC': 'New York City'}
df['commentTest'] = df['Comment'].replace(ids, regex=True)
print (df)
Categories Comment Type \
0 animal The NYC tree is very big tree
1 plant The cat from the UK is small dog
2 object The rock was found in LA. rock
commentTest
0 The New York City tree is very big
1 The cat from the United Kingdom is small
2 The rock was found in Los Angeles.
It's actually much faster to use str.replace() than replace(), even though str.replace() requires a loop:
ids = {'NYC': 'New York City', 'LA': 'Los Angeles', 'UK': 'United Kingdom'}
for old, new in ids.items():
df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
# Categories Type Comment
# 0 animal tree The New York City tree is very big
# 1 plant dog The cat from the United Kingdom is small
# 2 object rock The rock was found in Los Angeles
The only time replace() outperforms a str.replace() loop is with small dataframes:
The timing functions for reference:
def Series_replace(df):
df['Comment'] = df['Comment'].replace(ids, regex=True)
return df
def Series_str_replace(df):
for old, new in ids.items():
df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
return df
Note that if ids is a dataframe instead of dictionary, you can get the same performance with itertuples():
ids = pd.DataFrame({'Id': ['NYC', 'LA', 'UK'], 'City': ['New York City', 'Los Angeles', 'United Kingdom']})
for row in ids.itertuples():
df['Comment'] = df['Comment'].str.replace(row.Id, row.City, regex=False)
I have a pandas data frame with zip codes, city, state and country of ~ 600,000 locations. Let's call it my_df
I'd like to look up the corresponding longitude and latitude for each of these locations. Thankfully, there is a database for this. Let's call this dataframe zipdb.
zipdb has, among others, columns for zip codes, city, state and country.
So, I'd like to look up all of the locations (zip, city, state and country) in zipdb.
def zipdb_lookup(zipcode, city, state, country):
countries_mapping = { "UNITED STATES":"US"
, "CANADA":"CA"
, "KOREA REP OF":"KR"
, "ITALY":"IT"
, "AUSTRALIA":"AU"
, "CHILE":"CL"
, "UNITED KINGDOM":"GB"
, "BERMUDA":"BM"
}
try:
slc = zipdb[ (zipdb.Zipcode == str(zipcode)) &
(zipdb.City == str(city).upper()) &
(zipdb.State == str(state).upper()) &
(zipdb.Country == countries_mapping[country].upper()) ]
if slc.shape[0] == 1:
return np.array(slc["Lat"])[0], np.array(slc["Long"])[0]
else:
return None
except:
return None
I have tried pandas' .apply as well as a for loop to do this.
Both are very slow. I recognize there are a large number of rows, but I can't help but think something faster must be possible.
zipdb = pandas.read_csv("free-zipcode-database.csv") #linked to above
Note: I've also performed this transformation on zibdb:
zipdb["Zipcode"] = zipdb["Zipcode"].astype(str)
Function Call:
#Defined a wrapper function:
def lookup(row):
"""
:param row:
:return:
"""
lnglat = zipdb_lookup(
zipcode = my_df["organization_zip"][row]
, city = my_df["organization_city"][row]
, state = my_df["organization_state"][row]
, country = my_df["organization_country"][row]
)
return lnglat
lnglat = list()
for l in range(0, my_df.shape[0]):
# if l % 5000 == 0: print(round((float(l) / my_df.shape[0])*100, 2), "%")
lnglat.append(lookup(row = l))
Sample Data from my_df:
organization_zip organization_city organization_state organization_country
0 60208 EVANSTON IL United Sates
1 77555 GALVESTON TX United Sates
2 23284 RICHMOND VA United Sates
3 53233 MILWAUKEE WI United Sates
4 10036 NEW YORK NY United Sates
5 33620 TAMPA FL United Sates
6 10029 NEW YORK NY United Sates
7 97201 PORTLAND OR United Sates
8 97201 PORTLAND OR United Sates
9 53715 MADISON WI United Sates
Using merge() will be a lot faster than calling a function on every row. Make sure the field types match and strings are stripped:
# prepare your dataframe
data['organization_zip'] = data.organization_zip.astype(str)
data['organization_city'] = data.organization_city.apply(lambda v: v.strip())
# get the zips database
zips = pd.read_csv('/path/to/free-zipcode-database.csv')
zips['Zipcode'] = zips.Zipcode.astype(str)
# left join
# -- prepare common join columns
zips.rename(columns=dict(Zipcode='organization_zip',
City='organization_city'),
inplace=True)
# specify join columns along with zips' columns to copy
cols = ['organization_zip', 'organization_city', 'Lat', 'Long']
data.merge(zips[cols], how='left')
=>
Note you may need to extend the merge columns and/or add more columns to copy from the zips dataframe.