How can I get geopy to read addresses from a data frame? - python

I'm testing this code.
results = [['city', 'state', 'location_raw'],
['Juneau', 'AK', """3260 HOSPITAL DR JUNEAU 99801"""],
['Palmer', 'AK', """2500 SOUTH WOODWORTH LOOP PALMER 99645"""],
['Anchorage', 'AK', """3200 PROVIDENCE DRIVE ANCHORAGE 99508"""]]
df = pd.DataFrame(results)
print(type(df))
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="ryan_app")
for x in range(len(df.index)):
try:
location = geolocator.geocode(df['location_raw'].iloc[x])
print(location.raw)
df['location_lat_lng'] = location.raw
except:
df['location_lat_lng'] = 'can not find this one...'
print('can not find this one...')
I keep getting this result.
can not find this one...
can not find this one...
can not find this one...
can not find this one...
However, if I pass in an address like this, below, it seems to work fine.
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="ryan_app")
for x in range(len(df.index)):
try:
location = geolocator.geocode("3200 PROVIDENCE DRIVE ANCHORAGE 99508")
print(location.raw)
df['location_lat_lng'] = location.raw
except:
df['location_lat_lng'] = 'can not find this one...'
print('can not find this one...')
Result.
{'place_id': 254070826, 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright', 'osm_type': 'way', 'osm_id': 784375677, 'boundingbox': ['61.1873261', '61.1895066', '-149.8220256', '-149.8181122'], 'lat': '61.18841865', 'lon': '-149.82059674184666', 'display_name': 'Providence Alaska Medical Center, 3200, Providence Drive, Anchorage, Alaska, 99508, Anchorage', 'class': 'building', 'type': 'hospital', 'importance': 0.5209999999999999}
I must be missing something simple here, but I'm not sure what it is.

Because you didn't set the first row as columns.
results = [['city', 'state', 'location_raw'],
['Juneau', 'AK', """3260 HOSPITAL DR JUNEAU 99801"""],
['Palmer', 'AK', """2500 SOUTH WOODWORTH LOOP PALMER 99645"""],
['Anchorage', 'AK', """3200 PROVIDENCE DRIVE ANCHORAGE 99508"""]]
df = pd.DataFrame(results)
print(df)
0 1 2
0 city state location_raw
1 Juneau AK 3260 HOSPITAL DR JUNEAU 99801
2 Palmer AK 2500 SOUTH WOODWORTH LOOP PALMER 99645
3 Anchorage AK 3200 PROVIDENCE DRIVE ANCHORAGE 99508
columns is [0, 1, 2], not ['city', 'state', 'location_raw'], so you can't get the value of df['location_raw'].
You should add codes after df = pd.DataFrame(results):
headers = df.iloc[0]
df = pd.DataFrame(df.values[1:], columns=headers)
similar question: convert-row-to-column-header-for-pandas-dataframe

Related

Parse data from a dict with condition - pandas dataframe

My pandas DataFrame has a few missing and bad values. I'd like to replace / fill this by parse data from a dictionary stored in a pandas series. Here's an example:
import pandas as pd
df = pd.DataFrame({'Addr': ['123 Street, City, 85036', '234 Street1, City, 85036', '542js'],
'Lat': [32.312, 33.312, np.nan],
'CL': [{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Str',
'city':'phx',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189}]
})
For rows where Lat is np.nan, I'd like to parse the data from CL column. After filling the data from dict, the 2 columns of the row would look like this:
Addr Lat
134 Str phx 85312 34.661056
p.s In reality, the dict is fairly long. So, I'd prefer a way to extract only the values that are needed, in this case Lat and street, city and zip which makes up the Addr column.
You can normalize the 'CL' column and join the newly created columns to 'Addr' and 'Lat'. Then change the values of Lat to 'latitude' where it's np.nan:
df = df[['Addr', 'Lat']].join(pd.json_normalize(df['CL']))
df.loc[df['Lat'].isna(), 'Lat'] = df.loc[df['Lat'].isna(), 'latitude']
print(df)
Output:
Addr Lat street city zip latitude longitude
0 123 Street, City, 85036 32.312000 134 Street name City name 85312 34.661056 -118.146189
1 234 Street1, City, 85036 33.312000 134 Street name City name 85312 34.661056 -118.146189
2 542js 34.661056 134 Str phx 85312 34.661056 -118.146189
Edit: after reading your comments and edited question, it seems you don't want to build such a huge df, rather work from your dictionnary:
your_dict = {'Addr': ['123 Street, City, 85036', '234 Street1, City, 85036', '542js'],
'Lat': [32.312, 33.312, np.nan],
'CL': [{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Str',
'city':'phx',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189}]
}
df_lat = pd.Series(your_dict['Lat'])
df_cl = pd.DataFrame(your_dict['CL'])
print(df_cl.loc[df_lat.isna(), ['latitude', 'street', 'city', 'zip']])
That way only rows with 'Lat' initially equal to np.nan will be considered:
latitude street city zip
2 34.661056 134 Str phx 85312
If it's only that one column you want then
>>> df['Lat'] = df['Lat'].fillna(pd.DataFrame(df['CL'].tolist())['latitude'])
>>> df
Addr Lat CL
0 123 Street, City, 85036 32.312000 {'street': '134 Street name', 'city': 'City na...
1 234 Street1, City, 85036 33.312000 {'street': '134 Street name', 'city': 'City na...
2 542js 34.661056 {'street': '134 Str', 'city': 'phx', 'zip': '8...
If the dict is too long for memory then you can parse it with a for loop, convert to df and then fillna
keys = []
for i in df['CL'].tolist():
keys.append({'Lat': i['Lat'], 'street': i['street'],'city': i['city'],'zip': i['zip']})
ddf = pd.DataFrame(keys)

How to perform a rowwise function on a column of data and append the output of the function to a Pandas data.frame?

I am very new to python (I generally use R). I have a list of addresses that I need to normalize. I would like to do the following.
Normalize each address in the list using the scourgify package
Append the data to my original data.frame.
On github, it is easy to see how to do one address, but how would I do this to a list or vector of addresses?
from scourgify import normalize_address_record
normalize_address_record('123 southwest Main street, Boring, or, 97203')
normalize_address_record({
'address_line_1': '123 southwest Main street',
'address_line_2': 'unit 2,
'city': 'Boring',
'state': 'or',
'postal_code': '97203'
})
expected output
{
'address_line_1': '123 SW MAIN ST',
'address_line_2': 'UNIT 2'
'city': 'BORING',
'state': 'OR',
'postal_code': '97203'
}
Here is some dummy data based on my original dataset
# initialize list of lists
data = [['a', '123 southwest Main street, Boring, or, 97203'], ['b', '4285 balsam la n plymouth mn 55441'], ['c', '632 bloomington ave s minneapolis mn 55417']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['id', 'address_original'])
I need the output in tabular format appended to my original data.frame.
Any help would be greatly appreciated.
Try:
from scourgify import normalize_address_record
data = [
["a", "123 southwest Main street, Boring, or, 97203"],
["b", "4285 balsam la n plymouth mn 55441"],
["c", "632 bloomington ave s minneapolis mn 55417"],
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=["id", "address_original"])
df_out = pd.concat(
[
df,
df["address_original"].apply(normalize_address_record).apply(pd.Series),
],
axis=1,
)
print(df_out)
Prints:
id address_original address_line_1 address_line_2 city state postal_code
0 a 123 southwest Main street, Boring, or, 97203 123 SW MAIN ST None BORING OR 97203
1 b 4285 balsam la n plymouth mn 55441 4285 BALSAM LA N None PLYMOUTH MN 55441
2 c 632 bloomington ave s minneapolis mn 55417 632 BLOOMINGTON AVE S None MINNEAPOLIS MN 55417
EDIT: To handle errors gracefully:
def fn(x):
try:
return normalize_address_record(x)
except:
return {}
df_out = pd.concat(
[
df,
df["address_original"].apply(fn).apply(pd.Series),
],
axis=1,
)
print(df_out)

How to remove excel row in first excel based on data of second excel file pandas

I have one type of excel file with school data such as address, school name, principals name and etc. And second type of excel file with address, school name,rating, nubmer of telephone and etc. The question is: how can I delete particular rows in first excel file based on addresses of second?
first excel file:
Unnamed: 0 School Address
0 0 Alabama School For Deaf 205 E South Street, Talladega, AL 35160
1 1 Helen Keller School 1101 Fort Lashley Avenue, Talladega, AL 35160
2 2 Tutwiler Prison 1209 Fort Lashley Ave., Talladega, AL 35160
3 3 Alabama School Of Fine Arts 8966 Us Hwy 231 N, Wetumpka, AL 36092
second:
School_Name ... Address
0 Pine View School ... 0 Mp 1361 Ak Hwy, Dot Lake, AK 99737
1 A.D. Henderson University School ... 1 168 3Rd Avenue, Eagle, AK 99738
2 School For Advanced Studies - South ... 2 249 Jon Summar Way, Tok, AK 99780
3 Tutwiler 3 1209 Fort Lashley Ave., Talladega, AL 35160
the output must be:
Unnamed: 0 School Address
0 0 Alabama School For Deaf 205 E South Street, Talladega, AL 35160
1 1 Helen Keller School 1101 Fort Lashley Avenue, Talladega, AL 35160
3 3 Alabama School Of Fine Arts 8966 Us Hwy 231 N, Wetumpka, AL 36092
I tried to use for loop, pandas
import pandas as pd
from pandas import ExcelWriter
writer = pd.ExcelWriter('US1234.xlsx', engine='xlsxwriter')
data = []
data_schools = []
df = pd.read_excel('DZ13288pubprin.xlsx')
lists = [[] for i in range(2)]
states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY',
'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH',
'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']
print(len(states))
def checking_top_100(nameofschool):
for i in states:
df2 = pd.read_excel('TOP-100.xlsx', sheet_name=[i])
for a in df2[i]['SchoolName']:
if nameofschool in a:
pass
else:
return nameofschool
def sort_by_value(state, index):
for i in range(len(df.SchoolName)):
if df.LocationState[i] == state:
# print(df.SchoolName[i])
school_name = checking_top_100(df.SchoolName[i])
lists[index].append(school_name)
lists[index].append(
df.LocationAddress[i] + ', ' + df.LocationCity[i] + ', ' + df.LocationState[i] + ' ' + df.LocationZip[
i])
# lists[index].append(df.EmailAddress[i])
print(lists[index][0::2])
def data_to_excel(state, index):
dfi = pd.DataFrame({
'SchoolName': lists[index][0::2],
# 'Principal Name': lists[index][1::3],
# 'Email Address': lists[index][2::3],
'Address': lists[index][1::2]
})
dfi.to_excel(writer, sheet_name=state)
# checking_top_100()
for i in range(len(states)):
sort_by_value(states[i], i)
data_to_excel(states[i], i)
writer.save()
I suggest you take a look at pandas.DataFrame.isin (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html). As this would return a boolean array (True or False) depending on whether or not the address is found in the second dataframe, you could then simply use boolean indexing to filter out the subset of the data where the address is not found.
In other words, you could do something like:
dataframe1[dataframe1.Address.isin(dataframe2.Address) == False]
This should give you the result you want.

Lookup table with 'wildcards' in Pandas

I've been looking for an answer to this problem for a few days, but can't find anything similar in other threads.
I have a lookup table to define classification for some input data. The classification depends on continent, country and city. However, some classes may depend on a subset of these variables, e.g. only continent and country (no city). An example of such lookup table is below. In my example, I'm using one and two stars as wildcards:
- One Star: I want all cities in France to be classified as France, and
- Two Stars: All cities in US, excepting New York and San Francisco as USA - Other.
lookup_df = pd.DataFrame({'Continent': ['Europe', 'Europe', 'Asia', 'America', 'America', 'America', 'America', 'Africa'],
'Country': ['France', 'Italy', 'Japan', 'USA', 'USA', 'USA', 'Argentina', '*'],
'City': ['*', '*', '*', 'New York', 'San Francisco', '**', '*', '*'],
'Classification': ['France', 'Italy', 'Japan', 'USA - NY', 'USA - SF', 'USA - Other', 'Argentina', 'Africa']})
If my dataframe is
df = pd.DataFrame({'Continent': ['Europe', 'Europe', 'Asia', 'America ', 'America', 'America', 'Africa'],
'Country': ['France', 'Italy', 'Japan', 'USA', 'USA', 'USA', 'Egypt'],
'City': ['Paris', 'Rome', 'Tokyo', 'San Francisco', 'Houston', 'DC', 'Cairo']})
I am trying to get the following result:
Continent Country City Classification
0 Europe France Paris France
1 Europe Italy Rome Italy
2 Asia Japan Tokyo Japan
3 America USA San Francisco USA - SF
4 America USA Houston USA - Other
5 America USA DC USA - Other
6 Africa Egypt Cairo Africa
I need to start from a lookup table or similar because it's easier to maintain, easier to explain and it's also used by other processes. I can't create a full table, because I would have to consider all possible cities in the world.
Is there any pythonic way of doing this? I thought I could use pd.merge, but I haven't found any examples of this online.
One easy-to-maintain way to handle your task is to use maps:
df2 = df.copy()
# below will yield a field df2.Classification and save the value when all "Continent", "Country" and "City" match, otherwise np.nan
df2 = df2.merge(lookup_df, how='left', on = ["Continent", "Country", "City"])
# create map1 from lookup_df when City is '*' but Country is not '*'
map1 = lookup_df.loc[lookup_df.City.str.match('^\*+$') & ~lookup_df.Country.str.match('^\*+$')].set_index(['Continent','Country']).Classification.to_dict()
map1
#{('Europe', 'France'): 'France',
# ('Europe', 'Italy'): 'Italy',
# ('Asia', 'Japan'): 'Japan',
# ('America', 'USA'): 'USA - Other',
# ('America', 'Argentina'): 'Argentina'}
# create map2 from lookup_df when both City and Country are '*'
map2 = lookup_df.loc[lookup_df.City.str.match('^\*+$') & lookup_df.Country.str.match('^\*+$')].set_index('Continent').Classification.to_dict()
map2
#{'Africa': 'Africa'}
# create a function to define your logic:
def set_classification(x):
return x.Classification if x.Classification is not np.nan else \
map1[(x.Continent, x.Country)] if (x.Continent, x.Country) in map1 else \
map2[x.Continent] if x.Continent in map2 else \
np.nan
# apply the above function to each row of the df2
df2["Classification"] = df2.apply(set_classification, axis = 1)
Note: your original df.Continent on the 4th row contains an extra trailing space 'America ' which will fail the above df2 = df2.merge(...) line. you will need to fix this data issue though.

Trying to use a list to populate a dataframe column

I have a dataframe (df) and I would like to create a new column called country, which is calculated buy looking at the region column and where the region value is present in the EnglandRegions list then the country value is set to England else its the value from the region column.
Please see below for my desired output:
name salary region B1salary country
0 Jason 42000 London 42000 England
1 Molly 52000 South West England
2 Tina 36000 East Midland England
3 Jake 24000 Wales Wales
4 Amy 73000 West Midlands England
You can see that all the values in country are set to England except for the value assigned to Jakes record that is set to Wales (as Wales is not in the EnglandRegions list). The code below produces the following error:
File "C:/Users/stacey/Documents/scripts/stacey.py", line 20
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
^
SyntaxError: invalid syntax
The code is as follows:
import pandas as pd
import numpy as np
EnglandRegions = ["London", "South West", "East Midland", "West Midlands", "East Anglia"]
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'salary': [42000, 52000, 36000, 24000, 73000],
'region': ['London', 'South West', 'East Midland', 'Wales', 'West Midlands']}
df = pd.DataFrame(data, columns = ['name', 'salary', 'region'])
df['B1salary'] = np.where((df['salary']>=40000) & (df['salary']<=50000) , df['salary'], '')
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
print(df)
The specific issue the error is referencing is that you are missing a ] to enclose your .loc. However, fixing this won't work anyways. Try:
df['country'] = np.where(df['region'].isin(EnglandRegions), 'England', df['region'])
This is essentially what you already had in the line above it for B1salary anyways.

Categories

Resources