I'm trying to vectorize the get() method from one column containing dictionaries to another column in the same dataframe. For example, I would like the cities in the address column dictionaries to populate the address.city column.
df = pd.DataFrame({'address': [{'city': 'Lake Ashley', 'state': 'MN', 'street': '56833 Baker Branch', 'zip': '15884'},
{'city': 'Reginaldfurt', 'state': 'MO',
'street': '045 Bennett Motorway Suite 404', 'zip': '68916'},
{'city': 'East Stephaniefurt', 'state': 'VI', 'street': '908 Matthew Ports Suite 313', 'zip': '15956-9706'}],
'address.city': [None, None, None],
'address.street': [None, None, None]})
I was trying
df['address.city'].apply(df.address.get('city'))
but that doesn't work. I figured I was close since df.address[0].get('city') does extract the city value for that row. As you can imagine I want to do the same for address.street.
I think what you want is below the following. However, You can parse the address column like this
df.address.apply(pd.Series).add_prefix('address.')
# or
# pd.DataFrame(df.address.tolist()).add_prefix('address.')
address.city address.state address.street address.zip
0 Lake Ashley MN 56833 Baker Branch 15884
1 Reginaldfurt MO 045 Bennett Motorway Suite 404 68916
2 East Stephaniefurt VI 908 Matthew Ports Suite 313 15956-9706
This answers your question:
df['address.city'] = df.address.apply(lambda d: d['city'])
df
address address.city address.street
0 {'city': 'Lake Ashley', 'state': 'MN', 'street... Lake Ashley None
1 {'city': 'Reginaldfurt', 'state': 'MO', 'stree... Reginaldfurt None
2 {'city': 'East Stephaniefurt', 'state': 'VI', ... East Stephaniefurt None
Related
Sorry about a newbie question, I am making baby steps in Python. My DataFrame has a column address of type object. address has a country, like this: {... "city": "...", "state": "...", "country": "..."} . How do I add a column country that's derived from the column address?
Without the data its difficult to answer, but if the values are Python dict, applying a pandas Series on rows should work:
df['address'].apply(pd.Series)
You will have to assign the result back to the original dataframe, and if the values are JSON string, you may first want to convert it to dictionary using json.loads
SAMPLE RUN:
>>> df
x address
0 1 {'city': 'xyz', 'state': 'Massachusetts', 'country': 'US'}
1 2 {'city': 'ABC', 'state': 'LONDON', 'country': 'UK'}
>>> df.assign(country=df['address'].apply(pd.Series)['country'])
x address country
0 1 {'city': 'xyz', 'state': 'Massachusetts', 'country': 'US'} US
1 2 {'city': 'ABC', 'state': 'LONDON', 'country': 'UK'} UK
Even better to use key directly along with Series.str:
>>> df.assign(country=df['address'].str['country'])
x address country
0 1 {'city': 'xyz', 'state': 'Massachusetts', 'country': 'US'} US
1 2 {'city': 'ABC', 'state': 'LONDON', 'country': 'UK'} UK
My pandas DataFrame has a few missing and bad values. I'd like to replace / fill this by parse data from a dictionary stored in a pandas series. Here's an example:
import pandas as pd
df = pd.DataFrame({'Addr': ['123 Street, City, 85036', '234 Street1, City, 85036', '542js'],
'Lat': [32.312, 33.312, np.nan],
'CL': [{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Str',
'city':'phx',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189}]
})
For rows where Lat is np.nan, I'd like to parse the data from CL column. After filling the data from dict, the 2 columns of the row would look like this:
Addr Lat
134 Str phx 85312 34.661056
p.s In reality, the dict is fairly long. So, I'd prefer a way to extract only the values that are needed, in this case Lat and street, city and zip which makes up the Addr column.
You can normalize the 'CL' column and join the newly created columns to 'Addr' and 'Lat'. Then change the values of Lat to 'latitude' where it's np.nan:
df = df[['Addr', 'Lat']].join(pd.json_normalize(df['CL']))
df.loc[df['Lat'].isna(), 'Lat'] = df.loc[df['Lat'].isna(), 'latitude']
print(df)
Output:
Addr Lat street city zip latitude longitude
0 123 Street, City, 85036 32.312000 134 Street name City name 85312 34.661056 -118.146189
1 234 Street1, City, 85036 33.312000 134 Street name City name 85312 34.661056 -118.146189
2 542js 34.661056 134 Str phx 85312 34.661056 -118.146189
Edit: after reading your comments and edited question, it seems you don't want to build such a huge df, rather work from your dictionnary:
your_dict = {'Addr': ['123 Street, City, 85036', '234 Street1, City, 85036', '542js'],
'Lat': [32.312, 33.312, np.nan],
'CL': [{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Str',
'city':'phx',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189}]
}
df_lat = pd.Series(your_dict['Lat'])
df_cl = pd.DataFrame(your_dict['CL'])
print(df_cl.loc[df_lat.isna(), ['latitude', 'street', 'city', 'zip']])
That way only rows with 'Lat' initially equal to np.nan will be considered:
latitude street city zip
2 34.661056 134 Str phx 85312
If it's only that one column you want then
>>> df['Lat'] = df['Lat'].fillna(pd.DataFrame(df['CL'].tolist())['latitude'])
>>> df
Addr Lat CL
0 123 Street, City, 85036 32.312000 {'street': '134 Street name', 'city': 'City na...
1 234 Street1, City, 85036 33.312000 {'street': '134 Street name', 'city': 'City na...
2 542js 34.661056 {'street': '134 Str', 'city': 'phx', 'zip': '8...
If the dict is too long for memory then you can parse it with a for loop, convert to df and then fillna
keys = []
for i in df['CL'].tolist():
keys.append({'Lat': i['Lat'], 'street': i['street'],'city': i['city'],'zip': i['zip']})
ddf = pd.DataFrame(keys)
I'm testing this code.
results = [['city', 'state', 'location_raw'],
['Juneau', 'AK', """3260 HOSPITAL DR JUNEAU 99801"""],
['Palmer', 'AK', """2500 SOUTH WOODWORTH LOOP PALMER 99645"""],
['Anchorage', 'AK', """3200 PROVIDENCE DRIVE ANCHORAGE 99508"""]]
df = pd.DataFrame(results)
print(type(df))
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="ryan_app")
for x in range(len(df.index)):
try:
location = geolocator.geocode(df['location_raw'].iloc[x])
print(location.raw)
df['location_lat_lng'] = location.raw
except:
df['location_lat_lng'] = 'can not find this one...'
print('can not find this one...')
I keep getting this result.
can not find this one...
can not find this one...
can not find this one...
can not find this one...
However, if I pass in an address like this, below, it seems to work fine.
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="ryan_app")
for x in range(len(df.index)):
try:
location = geolocator.geocode("3200 PROVIDENCE DRIVE ANCHORAGE 99508")
print(location.raw)
df['location_lat_lng'] = location.raw
except:
df['location_lat_lng'] = 'can not find this one...'
print('can not find this one...')
Result.
{'place_id': 254070826, 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright', 'osm_type': 'way', 'osm_id': 784375677, 'boundingbox': ['61.1873261', '61.1895066', '-149.8220256', '-149.8181122'], 'lat': '61.18841865', 'lon': '-149.82059674184666', 'display_name': 'Providence Alaska Medical Center, 3200, Providence Drive, Anchorage, Alaska, 99508, Anchorage', 'class': 'building', 'type': 'hospital', 'importance': 0.5209999999999999}
I must be missing something simple here, but I'm not sure what it is.
Because you didn't set the first row as columns.
results = [['city', 'state', 'location_raw'],
['Juneau', 'AK', """3260 HOSPITAL DR JUNEAU 99801"""],
['Palmer', 'AK', """2500 SOUTH WOODWORTH LOOP PALMER 99645"""],
['Anchorage', 'AK', """3200 PROVIDENCE DRIVE ANCHORAGE 99508"""]]
df = pd.DataFrame(results)
print(df)
0 1 2
0 city state location_raw
1 Juneau AK 3260 HOSPITAL DR JUNEAU 99801
2 Palmer AK 2500 SOUTH WOODWORTH LOOP PALMER 99645
3 Anchorage AK 3200 PROVIDENCE DRIVE ANCHORAGE 99508
columns is [0, 1, 2], not ['city', 'state', 'location_raw'], so you can't get the value of df['location_raw'].
You should add codes after df = pd.DataFrame(results):
headers = df.iloc[0]
df = pd.DataFrame(df.values[1:], columns=headers)
similar question: convert-row-to-column-header-for-pandas-dataframe
I'm using the example given in the json_normalize documentation given here pandas.json_normalize — pandas 1.0.3 documentation, I can't unfortunately paste my actual JSON but this example works. Pasted from the documentation:
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {'governor': 'Rick Scott'},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {'governor': 'John Kasich'},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
result = json_normalize(data, 'counties', ['state', 'shortname',
['info', 'governor']])
result
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
What if the JSON was the one below instead where info is an array instead of a dict:
data = [{'state': 'Florida',
'shortname': 'FL',
'info': [{'governor': 'Rick Scott'},
{'governor': 'Rick Scott 2'}],
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': [{'governor': 'John Kasich'},
{'governor': 'John Kasich 2'}],
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
How would you get the following output using json_normalize:
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Dade 12345 Florida FL Rick Scott 2
2 Broward 40000 Florida FL Rick Scott
3 Broward 40000 Florida FL Rick Scott 2
4 Palm Beach 60000 Florida FL Rick Scott
5 Palm Beach 60000 Florida FL Rick Scott 2
6 Summit 1234 Ohio OH John Kasich
7 Summit 1234 Ohio OH John Kasich 2
8 Cuyahoga 1337 Ohio OH John Kasich
9 Cuyahoga 1337 Ohio OH John Kasich 2
Or if there is another way to do it, please do let me know.
json_normalize is designed for convenience rather than flexibility. It can't handle all forms of JSON out there (and JSON is just too flexible to write a universal parser for).
How about calling json_normalize twice and then merge. This assumes each state only appear once in your JSON:
counties = json_normalize(data, 'counties', ['state', 'shortname'])
governors = json_normalize(data, 'info', ['state'])
result = counties.merge(governors, on='state')
I've been looking for an answer to this problem for a few days, but can't find anything similar in other threads.
I have a lookup table to define classification for some input data. The classification depends on continent, country and city. However, some classes may depend on a subset of these variables, e.g. only continent and country (no city). An example of such lookup table is below. In my example, I'm using one and two stars as wildcards:
- One Star: I want all cities in France to be classified as France, and
- Two Stars: All cities in US, excepting New York and San Francisco as USA - Other.
lookup_df = pd.DataFrame({'Continent': ['Europe', 'Europe', 'Asia', 'America', 'America', 'America', 'America', 'Africa'],
'Country': ['France', 'Italy', 'Japan', 'USA', 'USA', 'USA', 'Argentina', '*'],
'City': ['*', '*', '*', 'New York', 'San Francisco', '**', '*', '*'],
'Classification': ['France', 'Italy', 'Japan', 'USA - NY', 'USA - SF', 'USA - Other', 'Argentina', 'Africa']})
If my dataframe is
df = pd.DataFrame({'Continent': ['Europe', 'Europe', 'Asia', 'America ', 'America', 'America', 'Africa'],
'Country': ['France', 'Italy', 'Japan', 'USA', 'USA', 'USA', 'Egypt'],
'City': ['Paris', 'Rome', 'Tokyo', 'San Francisco', 'Houston', 'DC', 'Cairo']})
I am trying to get the following result:
Continent Country City Classification
0 Europe France Paris France
1 Europe Italy Rome Italy
2 Asia Japan Tokyo Japan
3 America USA San Francisco USA - SF
4 America USA Houston USA - Other
5 America USA DC USA - Other
6 Africa Egypt Cairo Africa
I need to start from a lookup table or similar because it's easier to maintain, easier to explain and it's also used by other processes. I can't create a full table, because I would have to consider all possible cities in the world.
Is there any pythonic way of doing this? I thought I could use pd.merge, but I haven't found any examples of this online.
One easy-to-maintain way to handle your task is to use maps:
df2 = df.copy()
# below will yield a field df2.Classification and save the value when all "Continent", "Country" and "City" match, otherwise np.nan
df2 = df2.merge(lookup_df, how='left', on = ["Continent", "Country", "City"])
# create map1 from lookup_df when City is '*' but Country is not '*'
map1 = lookup_df.loc[lookup_df.City.str.match('^\*+$') & ~lookup_df.Country.str.match('^\*+$')].set_index(['Continent','Country']).Classification.to_dict()
map1
#{('Europe', 'France'): 'France',
# ('Europe', 'Italy'): 'Italy',
# ('Asia', 'Japan'): 'Japan',
# ('America', 'USA'): 'USA - Other',
# ('America', 'Argentina'): 'Argentina'}
# create map2 from lookup_df when both City and Country are '*'
map2 = lookup_df.loc[lookup_df.City.str.match('^\*+$') & lookup_df.Country.str.match('^\*+$')].set_index('Continent').Classification.to_dict()
map2
#{'Africa': 'Africa'}
# create a function to define your logic:
def set_classification(x):
return x.Classification if x.Classification is not np.nan else \
map1[(x.Continent, x.Country)] if (x.Continent, x.Country) in map1 else \
map2[x.Continent] if x.Continent in map2 else \
np.nan
# apply the above function to each row of the df2
df2["Classification"] = df2.apply(set_classification, axis = 1)
Note: your original df.Continent on the 4th row contains an extra trailing space 'America ' which will fail the above df2 = df2.merge(...) line. you will need to fix this data issue though.