Parse data from a dict with condition - pandas dataframe

Parse data from a dict with condition - pandas dataframe - python

My pandas DataFrame has a few missing and bad values. I'd like to replace / fill this by parse data from a dictionary stored in a pandas series. Here's an example:
import pandas as pd
df = pd.DataFrame({'Addr': ['123 Street, City, 85036', '234 Street1, City, 85036', '542js'],
'Lat': [32.312, 33.312, np.nan],
'CL': [{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Str',
'city':'phx',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189}]
})
For rows where Lat is np.nan, I'd like to parse the data from CL column. After filling the data from dict, the 2 columns of the row would look like this:
Addr Lat
134 Str phx 85312 34.661056
p.s In reality, the dict is fairly long. So, I'd prefer a way to extract only the values that are needed, in this case Lat and street, city and zip which makes up the Addr column.

You can normalize the 'CL' column and join the newly created columns to 'Addr' and 'Lat'. Then change the values of Lat to 'latitude' where it's np.nan:
df = df[['Addr', 'Lat']].join(pd.json_normalize(df['CL']))
df.loc[df['Lat'].isna(), 'Lat'] = df.loc[df['Lat'].isna(), 'latitude']
print(df)
Output:
Addr Lat street city zip latitude longitude
0 123 Street, City, 85036 32.312000 134 Street name City name 85312 34.661056 -118.146189
1 234 Street1, City, 85036 33.312000 134 Street name City name 85312 34.661056 -118.146189
2 542js 34.661056 134 Str phx 85312 34.661056 -118.146189
Edit: after reading your comments and edited question, it seems you don't want to build such a huge df, rather work from your dictionnary:
your_dict = {'Addr': ['123 Street, City, 85036', '234 Street1, City, 85036', '542js'],
'Lat': [32.312, 33.312, np.nan],
'CL': [{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Str',
'city':'phx',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189}]
}
df_lat = pd.Series(your_dict['Lat'])
df_cl = pd.DataFrame(your_dict['CL'])
print(df_cl.loc[df_lat.isna(), ['latitude', 'street', 'city', 'zip']])
That way only rows with 'Lat' initially equal to np.nan will be considered:
latitude street city zip
2 34.661056 134 Str phx 85312

If it's only that one column you want then
>>> df['Lat'] = df['Lat'].fillna(pd.DataFrame(df['CL'].tolist())['latitude'])
>>> df
Addr Lat CL
0 123 Street, City, 85036 32.312000 {'street': '134 Street name', 'city': 'City na...
1 234 Street1, City, 85036 33.312000 {'street': '134 Street name', 'city': 'City na...
2 542js 34.661056 {'street': '134 Str', 'city': 'phx', 'zip': '8...
If the dict is too long for memory then you can parse it with a for loop, convert to df and then fillna
keys = []
for i in df['CL'].tolist():
keys.append({'Lat': i['Lat'], 'street': i['street'],'city': i['city'],'zip': i['zip']})
ddf = pd.DataFrame(keys)

Related

How can I get geopy to read addresses from a data frame?

I'm testing this code.
results = [['city', 'state', 'location_raw'],
['Juneau', 'AK', """3260 HOSPITAL DR JUNEAU 99801"""],
['Palmer', 'AK', """2500 SOUTH WOODWORTH LOOP PALMER 99645"""],
['Anchorage', 'AK', """3200 PROVIDENCE DRIVE ANCHORAGE 99508"""]]
df = pd.DataFrame(results)
print(type(df))
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="ryan_app")
for x in range(len(df.index)):
try:
location = geolocator.geocode(df['location_raw'].iloc[x])
print(location.raw)
df['location_lat_lng'] = location.raw
except:
df['location_lat_lng'] = 'can not find this one...'
print('can not find this one...')
I keep getting this result.
can not find this one...
can not find this one...
can not find this one...
can not find this one...
However, if I pass in an address like this, below, it seems to work fine.
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="ryan_app")
for x in range(len(df.index)):
try:
location = geolocator.geocode("3200 PROVIDENCE DRIVE ANCHORAGE 99508")
print(location.raw)
df['location_lat_lng'] = location.raw
except:
df['location_lat_lng'] = 'can not find this one...'
print('can not find this one...')
Result.
{'place_id': 254070826, 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright', 'osm_type': 'way', 'osm_id': 784375677, 'boundingbox': ['61.1873261', '61.1895066', '-149.8220256', '-149.8181122'], 'lat': '61.18841865', 'lon': '-149.82059674184666', 'display_name': 'Providence Alaska Medical Center, 3200, Providence Drive, Anchorage, Alaska, 99508, Anchorage', 'class': 'building', 'type': 'hospital', 'importance': 0.5209999999999999}
I must be missing something simple here, but I'm not sure what it is.

Because you didn't set the first row as columns.
results = [['city', 'state', 'location_raw'],
['Juneau', 'AK', """3260 HOSPITAL DR JUNEAU 99801"""],
['Palmer', 'AK', """2500 SOUTH WOODWORTH LOOP PALMER 99645"""],
['Anchorage', 'AK', """3200 PROVIDENCE DRIVE ANCHORAGE 99508"""]]
df = pd.DataFrame(results)
print(df)
0 1 2
0 city state location_raw
1 Juneau AK 3260 HOSPITAL DR JUNEAU 99801
2 Palmer AK 2500 SOUTH WOODWORTH LOOP PALMER 99645
3 Anchorage AK 3200 PROVIDENCE DRIVE ANCHORAGE 99508
columns is [0, 1, 2], not ['city', 'state', 'location_raw'], so you can't get the value of df['location_raw'].
You should add codes after df = pd.DataFrame(results):
headers = df.iloc[0]
df = pd.DataFrame(df.values[1:], columns=headers)
similar question: convert-row-to-column-header-for-pandas-dataframe

How can I "concat" rows by same value in a column in Pandas?

I would like to concat rows value in one row in a dataframe, given one column. Then I would like to receive an edited dataframe.
Input Data :
ID F_Name L_Name Address SSN Phone
123 Sam Doe 123 12345 111-111-1111
123 Sam Doe 123 12345 222-222-2222
123 Sam Doe abc345 12345 111-111-1111
123 Sam Doe abc345 12345 222-222-2222
456 Naveen Gupta 456 45678 333-333-3333
456 Manish Gupta 456 45678 333-333-3333
Expected Output Data :
myschema = {
"ID":"123"
"F_Name":"Sam"
"L_Name":"Doe"
"Addess":"[123, abc345]"
"Phone":"[111-111-1111,222-222-2222]"
"SSN":"12345"
}
{
"ID":"456"
"F_Name":"[Naveen, Manish]"
"L_Name":"Gupta"
"Addess":"456"
"Phone":"[333-333-333]"
"SSN":"45678"
}
Code Tried :
df = pd.read_csv('data.csv')
print(df)

try groupby()+agg():
myschema=(df.groupby('ID',as_index=False)
.agg(lambda x:list(set(x))[0] if len(set(x))==1 else list(set(x))).to_dict('r'))
OR
If order is important then aggregrate pd.unique():
myschema=(df.groupby('ID',as_index=False)
.agg(lambda x:pd.unique(x)[0] if len(pd.unique(x))==1 else pd.unique(x).tolist())
.to_dict('r'))
so in the above code we are grouping the dataframe on 4 columns i.e ['ID','F_Name','L_Name','SSN'] then aggregrating the result and finding the unique values by aggregrating set and typecasting that set to a list and then converting the aggregrated result to list of dictionary and then selecting the value at 0th postion
output of myschema:
[{'ID': 123,
'F_Name': 'Sam',
'L_Name': 'Doe',
'Address': ['abc345', '123'],
'SSN': 12345,
'Phone': ['222-222-2222', '111-111-1111']},
{'ID': 456,
'F_Name': ['Naveen', 'Manish'],
'L_Name': 'Gupta',
'Address': '456',
'SSN': 45678,
'Phone': '333-333-3333'}]

Trying to use a list to populate a dataframe column

I have a dataframe (df) and I would like to create a new column called country, which is calculated buy looking at the region column and where the region value is present in the EnglandRegions list then the country value is set to England else its the value from the region column.
Please see below for my desired output:
name salary region B1salary country
0 Jason 42000 London 42000 England
1 Molly 52000 South West England
2 Tina 36000 East Midland England
3 Jake 24000 Wales Wales
4 Amy 73000 West Midlands England
You can see that all the values in country are set to England except for the value assigned to Jakes record that is set to Wales (as Wales is not in the EnglandRegions list). The code below produces the following error:
File "C:/Users/stacey/Documents/scripts/stacey.py", line 20
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
^
SyntaxError: invalid syntax
The code is as follows:
import pandas as pd
import numpy as np
EnglandRegions = ["London", "South West", "East Midland", "West Midlands", "East Anglia"]
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'salary': [42000, 52000, 36000, 24000, 73000],
'region': ['London', 'South West', 'East Midland', 'Wales', 'West Midlands']}
df = pd.DataFrame(data, columns = ['name', 'salary', 'region'])
df['B1salary'] = np.where((df['salary']>=40000) & (df['salary']<=50000) , df['salary'], '')
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
print(df)

The specific issue the error is referencing is that you are missing a ] to enclose your .loc. However, fixing this won't work anyways. Try:
df['country'] = np.where(df['region'].isin(EnglandRegions), 'England', df['region'])
This is essentially what you already had in the line above it for B1salary anyways.

Vectorize get() method in python

I'm trying to vectorize the get() method from one column containing dictionaries to another column in the same dataframe. For example, I would like the cities in the address column dictionaries to populate the address.city column.
df = pd.DataFrame({'address': [{'city': 'Lake Ashley', 'state': 'MN', 'street': '56833 Baker Branch', 'zip': '15884'},
{'city': 'Reginaldfurt', 'state': 'MO',
'street': '045 Bennett Motorway Suite 404', 'zip': '68916'},
{'city': 'East Stephaniefurt', 'state': 'VI', 'street': '908 Matthew Ports Suite 313', 'zip': '15956-9706'}],
'address.city': [None, None, None],
'address.street': [None, None, None]})
I was trying
df['address.city'].apply(df.address.get('city'))
but that doesn't work. I figured I was close since df.address[0].get('city') does extract the city value for that row. As you can imagine I want to do the same for address.street.

I think what you want is below the following. However, You can parse the address column like this
df.address.apply(pd.Series).add_prefix('address.')
# or
# pd.DataFrame(df.address.tolist()).add_prefix('address.')
address.city address.state address.street address.zip
0 Lake Ashley MN 56833 Baker Branch 15884
1 Reginaldfurt MO 045 Bennett Motorway Suite 404 68916
2 East Stephaniefurt VI 908 Matthew Ports Suite 313 15956-9706
This answers your question:
df['address.city'] = df.address.apply(lambda d: d['city'])
df
address address.city address.street
0 {'city': 'Lake Ashley', 'state': 'MN', 'street... Lake Ashley None
1 {'city': 'Reginaldfurt', 'state': 'MO', 'stree... Reginaldfurt None
2 {'city': 'East Stephaniefurt', 'state': 'VI', ... East Stephaniefurt None

Pandas drop duplicates with partially completed data in each row and combine data

I have a dataframe with duplicate IDs but the data is partially completed in multiple areas.
df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan, np.nan],
[1234, 'Customer A', np.nan, '333 Street', np.nan],
[1234, 'Customer A', '12345 Street', np.nan, np.nan],
[1234, 'Customer A', np.nan, np.nan, np.nan],
[1233, 'Customer B', '444 Street', '3335 Street', np.nan],
[1233, 'Customer B', '555 Street', '666 Street', np.nan],
[1233, 'Customer B', '553 Street', '666 Street', 'abc#email.com'],
[1235, 'Customer C', '1553 Street', '644 Street', 'abc#email.com'],
[1235, 'Customer C', '2553 Street', '644 Street', 'abc#email.com']],
columns=['ID', 'Customer', 'Billing Address', 'Shipping Address', 'Contact'])
df
ID Customer Billing Address Shipping Address Contact
0 1234 Customer A 123 Street NaN NaN
1 1234 Customer A NaN 333 Street NaN
2 1234 Customer A 12345 Street NaN NaN
3 1234 Customer A NaN NaN NaN
4 1233 Customer B 444 Street 3335 Street NaN
5 1233 Customer B 555 Street 666 Street NaN
6 1233 Customer B 553 Street 666 Street abc#email.com
7 1235 Customer C 1553 Street 644 Street abc#email.com
8 1235 Customer C 2553 Street 644 Street abc#email.com
I want to preserve all of the data so it creates new columns if the data is there so that it looks like the dataframe below:
I tried the following but it removes data that I want to preserve.
df.drop_duplicates(subset=['ID'], inplace=True)
df
ID Customer Billing Address Shipping Address Contact
0 1234 Customer A 123 Street NaN NaN
4 1233 Customer B 444 Street 3335 Street NaN
7 1235 Customer C 1553 Street 644 Street abc#email.com
EDIT: I added more data because it was unclear from the original post that there can be IDs with multiple rows.

Here's one approach using apply and create new columns, using dict creation for pd.Series
In [1057]: cols = ['Billing Address', 'Shipping Address']
In [1058]: (df.groupby(['ID', 'Customer'])
.apply(lambda g: pd.Series({'%s %s' % (x, i+1): v[x]
for i, v in enumerate(g[cols].to_dict('r'))
for x in v})))
Out[1058]:
Billing Address 1 Billing Address 2 Shipping Address 1 \
ID Customer
1233 Customer B 444 Street 555 Street 333 Street
1234 Customer A 123 Street NaN NaN
Shipping Address 2
ID Customer
1233 Customer B 666 Street
1234 Customer A 333 Street

Here is a potential solution, though it is not efficient at all in term of memory used in the process.
The idea is to loop over the number of rows you can have for a unique ID and merge your dataframe with the nth row:
new_df = df.drop_duplicates(subset = ['ID'])
temp_df = df.drop(new_df.index)
nth_address = 1
while len(temp_df) > 0:
temp = temp_df.drop_duplicates(subset = ['ID'])
new_df = new_df.merge(temp,suffixes = ('_'+str(nth_address),'_'+str(nth_address+1)),\
on = 'ID',how = 'left')
temp_df = temp_df.drop(temp.index)
nth_address +=1
ID Customer_1 Billing Address_1 Shipping Address_1 Customer_2 Billing Address_2 Shipping Address_2
0 1234 Customer A 123 Street NaN Customer A NaN 333 Street
1 1233 Customer B 444 Street 333 Street Customer B 555 Street 666 Street
To fit your desired output, we need to merge on ['ID','Customer'] as it is in your example the same key:
new_df = df.drop_duplicates(subset = ['ID'])
temp_df = df.drop(new_df.index)
nth_address = 1
while len(temp_df) > 0:
temp = temp_df.drop_duplicates(subset = ['ID'])
new_df = new_df.merge(temp,suffixes = ('_'+str(nth_address),'_'+str(nth_address+1)),on = ['ID','Customer'],how = 'left')
temp_df = temp_df.drop(temp.index)
nth_address+=1
ID Customer Billing Address_1 Shipping Address_1 Billing Address_2 Shipping Address_2
0 1234 Customer A 123 Street NaN NaN 333 Street
1 1233 Customer B 444 Street 333 Street 555 Street 666 Street

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse data from a dict with condition - pandas dataframe - python

Related

How can I get geopy to read addresses from a data frame?

How can I "concat" rows by same value in a column in Pandas?

Trying to use a list to populate a dataframe column

Vectorize get() method in python

Pandas drop duplicates with partially completed data in each row and combine data

Categories

Resources