Lookup table with 'wildcards' in Pandas - python

I've been looking for an answer to this problem for a few days, but can't find anything similar in other threads.
I have a lookup table to define classification for some input data. The classification depends on continent, country and city. However, some classes may depend on a subset of these variables, e.g. only continent and country (no city). An example of such lookup table is below. In my example, I'm using one and two stars as wildcards:
- One Star: I want all cities in France to be classified as France, and
- Two Stars: All cities in US, excepting New York and San Francisco as USA - Other.
lookup_df = pd.DataFrame({'Continent': ['Europe', 'Europe', 'Asia', 'America', 'America', 'America', 'America', 'Africa'],
'Country': ['France', 'Italy', 'Japan', 'USA', 'USA', 'USA', 'Argentina', '*'],
'City': ['*', '*', '*', 'New York', 'San Francisco', '**', '*', '*'],
'Classification': ['France', 'Italy', 'Japan', 'USA - NY', 'USA - SF', 'USA - Other', 'Argentina', 'Africa']})
If my dataframe is
df = pd.DataFrame({'Continent': ['Europe', 'Europe', 'Asia', 'America ', 'America', 'America', 'Africa'],
'Country': ['France', 'Italy', 'Japan', 'USA', 'USA', 'USA', 'Egypt'],
'City': ['Paris', 'Rome', 'Tokyo', 'San Francisco', 'Houston', 'DC', 'Cairo']})
I am trying to get the following result:
Continent Country City Classification
0 Europe France Paris France
1 Europe Italy Rome Italy
2 Asia Japan Tokyo Japan
3 America USA San Francisco USA - SF
4 America USA Houston USA - Other
5 America USA DC USA - Other
6 Africa Egypt Cairo Africa
I need to start from a lookup table or similar because it's easier to maintain, easier to explain and it's also used by other processes. I can't create a full table, because I would have to consider all possible cities in the world.
Is there any pythonic way of doing this? I thought I could use pd.merge, but I haven't found any examples of this online.

One easy-to-maintain way to handle your task is to use maps:
df2 = df.copy()
# below will yield a field df2.Classification and save the value when all "Continent", "Country" and "City" match, otherwise np.nan
df2 = df2.merge(lookup_df, how='left', on = ["Continent", "Country", "City"])
# create map1 from lookup_df when City is '*' but Country is not '*'
map1 = lookup_df.loc[lookup_df.City.str.match('^\*+$') & ~lookup_df.Country.str.match('^\*+$')].set_index(['Continent','Country']).Classification.to_dict()
map1
#{('Europe', 'France'): 'France',
# ('Europe', 'Italy'): 'Italy',
# ('Asia', 'Japan'): 'Japan',
# ('America', 'USA'): 'USA - Other',
# ('America', 'Argentina'): 'Argentina'}
# create map2 from lookup_df when both City and Country are '*'
map2 = lookup_df.loc[lookup_df.City.str.match('^\*+$') & lookup_df.Country.str.match('^\*+$')].set_index('Continent').Classification.to_dict()
map2
#{'Africa': 'Africa'}
# create a function to define your logic:
def set_classification(x):
return x.Classification if x.Classification is not np.nan else \
map1[(x.Continent, x.Country)] if (x.Continent, x.Country) in map1 else \
map2[x.Continent] if x.Continent in map2 else \
np.nan
# apply the above function to each row of the df2
df2["Classification"] = df2.apply(set_classification, axis = 1)
Note: your original df.Continent on the 4th row contains an extra trailing space 'America ' which will fail the above df2 = df2.merge(...) line. you will need to fix this data issue though.

Related

How can I get geopy to read addresses from a data frame?

I'm testing this code.
results = [['city', 'state', 'location_raw'],
['Juneau', 'AK', """3260 HOSPITAL DR JUNEAU 99801"""],
['Palmer', 'AK', """2500 SOUTH WOODWORTH LOOP PALMER 99645"""],
['Anchorage', 'AK', """3200 PROVIDENCE DRIVE ANCHORAGE 99508"""]]
df = pd.DataFrame(results)
print(type(df))
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="ryan_app")
for x in range(len(df.index)):
try:
location = geolocator.geocode(df['location_raw'].iloc[x])
print(location.raw)
df['location_lat_lng'] = location.raw
except:
df['location_lat_lng'] = 'can not find this one...'
print('can not find this one...')
I keep getting this result.
can not find this one...
can not find this one...
can not find this one...
can not find this one...
However, if I pass in an address like this, below, it seems to work fine.
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="ryan_app")
for x in range(len(df.index)):
try:
location = geolocator.geocode("3200 PROVIDENCE DRIVE ANCHORAGE 99508")
print(location.raw)
df['location_lat_lng'] = location.raw
except:
df['location_lat_lng'] = 'can not find this one...'
print('can not find this one...')
Result.
{'place_id': 254070826, 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright', 'osm_type': 'way', 'osm_id': 784375677, 'boundingbox': ['61.1873261', '61.1895066', '-149.8220256', '-149.8181122'], 'lat': '61.18841865', 'lon': '-149.82059674184666', 'display_name': 'Providence Alaska Medical Center, 3200, Providence Drive, Anchorage, Alaska, 99508, Anchorage', 'class': 'building', 'type': 'hospital', 'importance': 0.5209999999999999}
I must be missing something simple here, but I'm not sure what it is.
Because you didn't set the first row as columns.
results = [['city', 'state', 'location_raw'],
['Juneau', 'AK', """3260 HOSPITAL DR JUNEAU 99801"""],
['Palmer', 'AK', """2500 SOUTH WOODWORTH LOOP PALMER 99645"""],
['Anchorage', 'AK', """3200 PROVIDENCE DRIVE ANCHORAGE 99508"""]]
df = pd.DataFrame(results)
print(df)
0 1 2
0 city state location_raw
1 Juneau AK 3260 HOSPITAL DR JUNEAU 99801
2 Palmer AK 2500 SOUTH WOODWORTH LOOP PALMER 99645
3 Anchorage AK 3200 PROVIDENCE DRIVE ANCHORAGE 99508
columns is [0, 1, 2], not ['city', 'state', 'location_raw'], so you can't get the value of df['location_raw'].
You should add codes after df = pd.DataFrame(results):
headers = df.iloc[0]
df = pd.DataFrame(df.values[1:], columns=headers)
similar question: convert-row-to-column-header-for-pandas-dataframe

Sum columns by key values in another column

I have a pandas DataFrame like this:
city country city_population
0 New York USA 8300000
1 London UK 8900000
2 Paris France 2100000
3 Chicago USA 2700000
4 Manchester UK 510000
5 Marseille France 860000
I want to create a new column country_population by calculating a sum of every city for each country. I have tried:
df['Country population'] = df['city_population'].sum().where(df['country'])
But this won't work, could I have some advise on the problem?
Sounds like you're looking for groupby
import pandas as pd
data = {
'city': ['New York', 'London', 'Paris', 'Chicago', 'Manchester', 'Marseille'],
'country': ['USA', 'UK', 'France', 'USA', 'UK', 'France'],
'city_population': [8_300_000, 8_900_000, 2_100_000, 2_700_000, 510_000, 860_000]
}
df = pd.DataFrame.from_dict(data)
# group by country, access 'city_population' column, sum
pop = df.groupby('country')['city_population'].sum()
print(pop)
output:
country
France 2960000
UK 9410000
USA 11000000
Name: city_population, dtype: int64
Appending this Series to the DataFrame. (Arguably discouraged though, since it stores information redundantly and doesn't really fit the structure of the original DataFrame):
# add to existing df
pop.rename('country_population', inplace=True)
# how='left' to preserve original ordering of df
df = df.merge(pop, how='left', on='country')
print(df)
output:
city country city_population country_population
0 New York USA 8300000 11000000
1 London UK 8900000 9410000
2 Paris France 2100000 2960000
3 Chicago USA 2700000 11000000
4 Manchester UK 510000 9410000
5 Marseille France 860000 2960000
based on #Vaishali's comment, a one-liner
df['Country population'] = df.groupby([ 'country']).transform('sum')['city_population']

Is there a way to bin categorical data in pandas?

I've got a dataframe where one column is U.S. states. I'd like to create a new column and bin the states according to region, i.e., South, Southwest, etc. It looks like pd.cut is only used for continuous variables, so binning that way doesn't seem like an option. Is there a good way to create a column that's conditional on categorical data in another column?
import pandas as pd
def label_states (row):
if row['state'] in ['Maine', 'New Hampshire', 'Vermont', 'Massachusetts', 'Rhode Island', 'Connecticut', 'New York', 'Pennsylvania', 'New Jersey']:
return 'north-east'
if row['state'] in ['Wisconsin', 'Michigan', 'Illinois', 'Indiana', 'Ohio', 'North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri']:
return 'midwest'
if row['state'] in ['Delaware', 'Maryland', 'District of Columbia', 'Virginia', 'West Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 'Kentucky', 'Tennessee', 'Mississippi', 'Alabama', 'Oklahoma', 'Texas', 'Arkansas', 'Louisiana']:
return 'south'
return 'etc'
df = pd.DataFrame([{'state':"Illinois", 'data':"aaa"}, {'state':"Rhode Island",'data':"aba"}, {'state':"Georgia",'data':"aba"}, {'state':"Iowa",'data':"aba"}, {'state':"Connecticut",'data':"bbb"}, {'state':"Ohio",'data':"bbb"}])
df['label'] = df.apply(lambda row: label_states(row), axis=1)
df
Assume that your df contains:
State - US state code.
other columns, for the test (see below) I included only State Name.
Of course it can contain more columns and more than one row for each state.
To add region names (a new column), define regions DataFrame,
containing columns:
State - US state code.
Region - Region name.
Then merge these DataFrames and save the result back under df:
df = df.merge(regions, on='State')
A part of the result is:
State Name State Region
0 Alabama AL Southeast
1 Arizona AZ Southwest
2 Arkansas AR South
3 California CA West
4 Colorado CO Southwest
5 Connecticut CT Northeast
6 Delaware DE Northeast
7 Florida FL Southeast
8 Georgia GA Southeast
9 Idaho ID Northwest
10 Illinois IL Central
11 Indiana IN Central
12 Iowa IA East North Central
13 Kansas KS South
14 Kentucky KY Central
15 Louisiana LA South
Of course, there are numerous variants of how to assign US states to regions,
so if you want to use other variant, define regions DataFrame according
to your classification.

Trying to use a list to populate a dataframe column

I have a dataframe (df) and I would like to create a new column called country, which is calculated buy looking at the region column and where the region value is present in the EnglandRegions list then the country value is set to England else its the value from the region column.
Please see below for my desired output:
name salary region B1salary country
0 Jason 42000 London 42000 England
1 Molly 52000 South West England
2 Tina 36000 East Midland England
3 Jake 24000 Wales Wales
4 Amy 73000 West Midlands England
You can see that all the values in country are set to England except for the value assigned to Jakes record that is set to Wales (as Wales is not in the EnglandRegions list). The code below produces the following error:
File "C:/Users/stacey/Documents/scripts/stacey.py", line 20
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
^
SyntaxError: invalid syntax
The code is as follows:
import pandas as pd
import numpy as np
EnglandRegions = ["London", "South West", "East Midland", "West Midlands", "East Anglia"]
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'salary': [42000, 52000, 36000, 24000, 73000],
'region': ['London', 'South West', 'East Midland', 'Wales', 'West Midlands']}
df = pd.DataFrame(data, columns = ['name', 'salary', 'region'])
df['B1salary'] = np.where((df['salary']>=40000) & (df['salary']<=50000) , df['salary'], '')
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
print(df)
The specific issue the error is referencing is that you are missing a ] to enclose your .loc. However, fixing this won't work anyways. Try:
df['country'] = np.where(df['region'].isin(EnglandRegions), 'England', df['region'])
This is essentially what you already had in the line above it for B1salary anyways.

remove numbers after the string and () along with what is inside

I have a problem when removing numbers and parenthesis along with what is inside in Python. It is suggested to use str.replace. However, the challenge here is that the numbers are not certain numbers. I only know I need to remove whatever the number is, but I am not sure what it may be. For parenthesis, I only know I need remove () along with what is inside. However, the content inside also varies. For instance, If I have the following data set:
import pandas as pd
a = pd.Series({'Country':'China 1', 'Capital': 'Bei Jing'})
b = pd.Series({'Country': 'United States (of American)', 'Capital': 'Washington'})
c = pd.Series({'Country': 'United Kingdom (of Great Britain and Northern Ireland)', 'Capital': 'London'})
d = pd.Series({'Country': 'France 2', 'Capital': 'Paris'})
e = pd.DataFrame([a,b,c,d])
Now in Column 'Country', the values are 'China 1', 'United States (of American)', 'United Kingdom (of...)' and 'France 2'. After replacement/remove, I want to get rid of all numbers and parenthesis along with content inside, and want the values in Column Country to be 'China', 'United States', 'United Kingdom', 'France'.
You can use str.replace here with regex.
series1.str.replace("^([a-zA-Z]+(?:\s+[a-zA-Z]+)*).*", r"\1")
See demo.You can replace with your own series and other modifications.
https://regex101.com/r/lIScpi/2
You can also directly modify frame.
a = pd.Series({'Country': 'China 1', 'Capital': 'Bei Jing'})
b = pd.Series({'Country': 'United States (of American)', 'Capital': 'Washington'})
c = pd.Series({'Country': 'United Kingdom (of Great Britain and Northern Ireland)', 'Capital': 'London'})
d = pd.Series({'Country': 'France 2', 'Capital': 'Paris'})
e = pd.DataFrame([a, b, c, d])
print e
e['Country'] = e['Country'].str.replace("^([a-zA-Z]+(?:\s+[a-zA-Z]+)*).*", r"\1")
print e
Output before replace.
Capital Country
0 Bei Jing China 1
1 Washington United States (of American)
2 London United Kingdom (of Great Britain and Northern ...
3 Paris France 2
Output after replace
Capital Country
0 Bei Jing China
1 Washington United States
2 London United Kingdom
3 Paris France

Categories

Resources