So I have been using geopy to perform the action but since I am working with a dataframe, often times the data range between few hundreds to thousands of rows. It takes a lot of time getting all of the Lat+Long to get State name. And since I also have my application hosted on heroku, my request gets timed out every single time.
here's my code:
def fetch(col1, col2):
geolocator = Nominatim(user_agent="geoapiExercises")
location = geolocator.reverse(str(col1)+","+str(col2))
return location.raw['address']['state']
df['Tag2'] = df.apply(lambda x: fetchState(x['Latitude'], x['Longitude']), axis=1)
sample values:
Latitude Longitude
39.1157643 -94.6306511
39.1168735 -94.7547431
39.0669827 -94.5849949
39.1280774 -94.8245061
I also happen to full address column with sample data:
Fulladdress
812 Minnesota Ave, Kansas City, KS 66101
7530 State Ave, Kansas City, KS 66112
3255 Main St, Kansas City, MO 64111
10555 Parallel Pkwy, Kansas City, KS 66111
I have a bit of code which pulls the latitude and longitude for a location. It is here:
address = 'New York University'
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) +'?format=json'
response = requests.get(url).json()
print(response[0]["lat"])
print(response[0]["lon"])
I'm wanting to apply this as a function to a long column of "address".
I've seen loads of questions about 'apply' and 'map', but they're almost all simple math examples.
Here is what I tried last night:
def locate (address):
response = requests.get(url).json()
print(response[0]["lat"])
print(response[0]["lon"])
return
df['lat'] = df['lat'].map(locate)
df['lon'] = df['lon'].map(locate)
This ended up just applying the first row lat / lon to the entire csv.
What is the best method to turn the code into a custom function and apply it to each row?
Thanks in advance.
EDIT: Thank you #PacketLoss for your assistance. I'm getting an indexerror:list index out of range, but it does work on his sample dataframe.
Here is the read_csv I used to pull in the data:
df = pd.read_csv('C:\\Users\\CIHAnalyst1\\Desktop\\InstitutionLocations.csv', sep=',', error_bad_lines=False, index_col=False, dtype='unicode', encoding = "utf-8", warn_bad_lines=False)
Here is a text copy of the rows from the dataframe:
address
0 GRAND CANYON UNIVERSITY
1 SOUTHERN NEW HAMPSHIRE UNIVERSITY
2 WESTERN GOVERNORS UNIVERSITY
3 FLORIDA INTERNATIONAL UNIVERSITY - UNIVERSITY ...
4 PENN STATE UNIVERSITY UNIVERSITY PARK
... ...
4292 THE ART INSTITUTES INTERNATIONAL LLC
4293 INTERCOAST - ONLINE
4294 CAROLINAS COLLEGE OF HEALTH SCIENCES
4295 DYERSBURG STATE COMMUNITY COLLEGE COVINGTON
4296 ULTIMATE MEDICAL ACADEMY - NY
You need to return your values from your function, or nothing will happen.
We can use apply here and pass the address from the df as well.
data = {'address': ['New York University', 'Sydney Opera House', 'Paris', 'SupeRduperFakeAddress']}
df = pd.DataFrame(data)
def locate(row):
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(row['address']) +'?format=json'
response = requests.get(url).json()
if response:
row['lat'] = response[0]['lat']
row['lon'] = response[0]['lon']
return row
df = df.apply(locate, axis=1)
Outputs
address lat lon
0 New York University 40.72925325 -73.99625393609625
1 Sydney Opera House -33.85719805 151.21512338473752
2 Paris 48.8566969 2.3514616
3 SupeRduperFakeAddress NaN NaN
Let's say I have some customer data. The data is generated by the customer and it's messy, so they put their city into either the city or county field or both! That means I may need to check both columns to find out which city they are from.
mydf = pd.DataFrame({'name':['jim','jon'],
'city':['new york',''],
'county':['','los angeles']})
print(mydf)
name city county
0 jim new york
1 jon los angeles
And I am using an api to get their zipcode. There is a different api function for each city, and it returns the zipcode for the customer's address, e.g. 123 main stret, new york. I haven't included the full address here to save time.
# api for new york addresses
def get_NY_zipcode_api():
return 'abc123'
# api for chicago addresses
def get_CH_zipcode_api():
return 'abc124'
# api for los angeles addresses
def get_LA_zipcode_api():
return 'abc125'
# api for miami addresses
def get_MI_zipcode_api():
return 'abc126'
Depending on the city, I will call a different api. So for now, I am checking if city == x or county ==x, call api_x:
def myfunc(row):
city = row['city']
county = row['county']
if city == 'chicago' or county == 'chicago':
# call chicago api
zipcode = get_CH_zipcode_api()
return zipcode
elif city == 'new york' or county == 'new york':
# call new york api
zipcode = get_NY_zipcode_api()
return zipcode
elif city == 'los angeles' or county == 'los angeles':
# call los angeles api
zipcode = get_LA_zipcode_api()
return zipcode
elif city == 'miami' or county == 'miami':
# call miami api
zipcode = get_MI_zipcode_api()
return zipcode
And I apply() this to the df and get my results:
mydf['result'] = mydf.apply(myfunc,axis=1)
print(mydf)
name city county result
0 jim new york abc123
1 jon los angeles abc125
I actually have about 30 cities and therefore 30 conditions to check, so I want to avoid a long list of elif statments. What would be the most efficient way to do this?
I found some suggestions from a similar stack overflow question. Such as creating a dictionary with key:city and value:function and calling it based on city:
operationFuncs = {
'chicago': get_CH_zipcode_api,
'new york': get_NY_zipcode_api,
'los angeles': get_LA_zipcode_api,
'miami': get_MI_zipcode_api
}
But as far as I can see this only works if I am checking a single column / single condition. I can't see how it can work with if city == x or county == x
mydf['result'] = mydf.apply(lambda row : operationFuncs.get(row['county']) or operationFuncs.get(row['city']),axis=1)
I think you are referring to this. You can just perform this operation twice for city and county and save the result in two different variables, for each zipcode respectively. You can then compare the results and decide what to do if they differ (I am not sure if this can be the case with your dataset).
Since the dictionary-lookup is in O(1) and I assume your get_MI_zipcode_api() isn't any more expensive, this will have no real performance-drawbacks.
Maybe not the most elegant solution but you could use the dict approach and just call it twice, once on city and once on county. The second would overwrite the first but the same is true of your if block, and this would only be a problem if you had city='New York' county ='Chicago' for example which I assume cannot occur.
Or you could use the dict and iterate through it, this seems unnecessary though.
For key, fn in fdict:
if key in (city,county):
fn()
I'd do this join in SQL before reading in the data, I'm sure there's a way to do the same in Pandas, but I was trying to make suggestions that build on your existing research even if they are not the best.
If it's guaranteed that the value will either be present in city or country & not in both, then you can merge both the columns together into one.
df['region'] = df['City'] + '' + df['Country']
Then create a mapping of region and pincode, instead of creating a mapping of city with api function. Since, there are only 30 unique values, you can once store the value of city with zipcodes rather than calling the zipcode functions each time, as making an api call is expensive.
mappings = {
'chicago': 'abc123',
'new york': 'abc234',
'los angeles': 'abc345',
'miami': 'abc456'}
Create a dataframe using this dictionary & then merge with the original dataframe
mappings_df = pd.DataFrame(list(mappings.items()), columns=['region', 'zipcode'])
df.merge(mappings_df, how='left', on='region')
Hope this helps!!
You need a relation table which can be represented by a dict.
df = pd.DataFrame({'name':['jim','jon'],
'city':['new york',''],
'county':['','los angeles']})
df['region'] = df['city'] + df['county']
table = {'new york': 'abc123', 'chicago': 'abc124', 'los angeles': 'abc125', 'miami': 'abc126'}
df['region'] = df.region.apply(lambda row: table[row])
print(df)
Output
name city county region
0 jim new york abc123
1 jon los angeles abc125
I have following data in a dataframe:
... location amount
... Jacksonville, FL 40
... Provo, UT 20
... Montgomery, AL 22
... Los Angeles, CA 34
My dataset only contains U.S. cities in the form of [city name, state code] and I have no ZIP codes.
I want to determine either county of a city, in order to visualize my data with ggcounty (like here).
I looked on the website of the U.S. Census Bureau but couldn't really find a table of city,state,county, or similar.
Assuming that I would prefer solving the problem in R only, who has an idea how to solve this?
You can get a ZIP code and more detailed info doing this:
library(ggmap)
revgeocode(as.numeric(geocode('Jacksonville, FL ')))
Hope it helps
Im rocky to pandas or python for that case and I'm working with a 311 dataset. The output Im trying to get is a plot with 5 time series, one for each NYC borough. Where each point in the plot represents the total number of complaints for each "created date" in that period of time. My data is as follow:
Agency Name Complaint Type \ Borough
Created Date
2013-08-30 23:58:55 New York City Police Department Noise - Vehicle BROOKLYN
2013-08-30 23:58:28 New York City Police Department Noise - Vehicle QUEENS
2013-08-30 23:57:46 New York City Police Department Noise - Street/Sidewalk MANHATTAN
2013-08-30 23:55:07 New York City Police Department Noise - Street/Sidewalk QUEENS
2013-08-30 23:55:06 New York City Police Department Noise - Commercial MANHATTAN
X= created date, Y= Total No.of complaints.
My code so far (overlooking at some stackoverflow queries and libraries):
df=pd.read_csv(sys.argv[1], parse_dates=True)
df.set_index("Created Date", inplace=True)
df2=df[["Borough","Complaint Type"]]
df3=df2.groupby("Complaint Type").count()
df3.plot()
plt.show()
!http://imgur.com/D9jrYLf
I did some changes but still it doesn't work:
df=pd.read_csv(sys.argv[1], parse_dates=True)
df.set_index("Created Date", inplace=True)
df2=df[["Borough","Complaint Type"]]
df3=df[df2.groupby("Complaint Type")].count()
df3.plot()
I really appreciated any help. :)