So I have been using geopy to perform the action but since I am working with a dataframe, often times the data range between few hundreds to thousands of rows. It takes a lot of time getting all of the Lat+Long to get State name. And since I also have my application hosted on heroku, my request gets timed out every single time.
here's my code:
def fetch(col1, col2):
geolocator = Nominatim(user_agent="geoapiExercises")
location = geolocator.reverse(str(col1)+","+str(col2))
return location.raw['address']['state']
df['Tag2'] = df.apply(lambda x: fetchState(x['Latitude'], x['Longitude']), axis=1)
sample values:
Latitude Longitude
39.1157643 -94.6306511
39.1168735 -94.7547431
39.0669827 -94.5849949
39.1280774 -94.8245061
I also happen to full address column with sample data:
Fulladdress
812 Minnesota Ave, Kansas City, KS 66101
7530 State Ave, Kansas City, KS 66112
3255 Main St, Kansas City, MO 64111
10555 Parallel Pkwy, Kansas City, KS 66111
Related
I have a bit of code which pulls the latitude and longitude for a location. It is here:
address = 'New York University'
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) +'?format=json'
response = requests.get(url).json()
print(response[0]["lat"])
print(response[0]["lon"])
I'm wanting to apply this as a function to a long column of "address".
I've seen loads of questions about 'apply' and 'map', but they're almost all simple math examples.
Here is what I tried last night:
def locate (address):
response = requests.get(url).json()
print(response[0]["lat"])
print(response[0]["lon"])
return
df['lat'] = df['lat'].map(locate)
df['lon'] = df['lon'].map(locate)
This ended up just applying the first row lat / lon to the entire csv.
What is the best method to turn the code into a custom function and apply it to each row?
Thanks in advance.
EDIT: Thank you #PacketLoss for your assistance. I'm getting an indexerror:list index out of range, but it does work on his sample dataframe.
Here is the read_csv I used to pull in the data:
df = pd.read_csv('C:\\Users\\CIHAnalyst1\\Desktop\\InstitutionLocations.csv', sep=',', error_bad_lines=False, index_col=False, dtype='unicode', encoding = "utf-8", warn_bad_lines=False)
Here is a text copy of the rows from the dataframe:
address
0 GRAND CANYON UNIVERSITY
1 SOUTHERN NEW HAMPSHIRE UNIVERSITY
2 WESTERN GOVERNORS UNIVERSITY
3 FLORIDA INTERNATIONAL UNIVERSITY - UNIVERSITY ...
4 PENN STATE UNIVERSITY UNIVERSITY PARK
... ...
4292 THE ART INSTITUTES INTERNATIONAL LLC
4293 INTERCOAST - ONLINE
4294 CAROLINAS COLLEGE OF HEALTH SCIENCES
4295 DYERSBURG STATE COMMUNITY COLLEGE COVINGTON
4296 ULTIMATE MEDICAL ACADEMY - NY
You need to return your values from your function, or nothing will happen.
We can use apply here and pass the address from the df as well.
data = {'address': ['New York University', 'Sydney Opera House', 'Paris', 'SupeRduperFakeAddress']}
df = pd.DataFrame(data)
def locate(row):
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(row['address']) +'?format=json'
response = requests.get(url).json()
if response:
row['lat'] = response[0]['lat']
row['lon'] = response[0]['lon']
return row
df = df.apply(locate, axis=1)
Outputs
address lat lon
0 New York University 40.72925325 -73.99625393609625
1 Sydney Opera House -33.85719805 151.21512338473752
2 Paris 48.8566969 2.3514616
3 SupeRduperFakeAddress NaN NaN
Let's say I have some customer data. The data is generated by the customer and it's messy, so they put their city into either the city or county field or both! That means I may need to check both columns to find out which city they are from.
mydf = pd.DataFrame({'name':['jim','jon'],
'city':['new york',''],
'county':['','los angeles']})
print(mydf)
name city county
0 jim new york
1 jon los angeles
And I am using an api to get their zipcode. There is a different api function for each city, and it returns the zipcode for the customer's address, e.g. 123 main stret, new york. I haven't included the full address here to save time.
# api for new york addresses
def get_NY_zipcode_api():
return 'abc123'
# api for chicago addresses
def get_CH_zipcode_api():
return 'abc124'
# api for los angeles addresses
def get_LA_zipcode_api():
return 'abc125'
# api for miami addresses
def get_MI_zipcode_api():
return 'abc126'
Depending on the city, I will call a different api. So for now, I am checking if city == x or county ==x, call api_x:
def myfunc(row):
city = row['city']
county = row['county']
if city == 'chicago' or county == 'chicago':
# call chicago api
zipcode = get_CH_zipcode_api()
return zipcode
elif city == 'new york' or county == 'new york':
# call new york api
zipcode = get_NY_zipcode_api()
return zipcode
elif city == 'los angeles' or county == 'los angeles':
# call los angeles api
zipcode = get_LA_zipcode_api()
return zipcode
elif city == 'miami' or county == 'miami':
# call miami api
zipcode = get_MI_zipcode_api()
return zipcode
And I apply() this to the df and get my results:
mydf['result'] = mydf.apply(myfunc,axis=1)
print(mydf)
name city county result
0 jim new york abc123
1 jon los angeles abc125
I actually have about 30 cities and therefore 30 conditions to check, so I want to avoid a long list of elif statments. What would be the most efficient way to do this?
I found some suggestions from a similar stack overflow question. Such as creating a dictionary with key:city and value:function and calling it based on city:
operationFuncs = {
'chicago': get_CH_zipcode_api,
'new york': get_NY_zipcode_api,
'los angeles': get_LA_zipcode_api,
'miami': get_MI_zipcode_api
}
But as far as I can see this only works if I am checking a single column / single condition. I can't see how it can work with if city == x or county == x
mydf['result'] = mydf.apply(lambda row : operationFuncs.get(row['county']) or operationFuncs.get(row['city']),axis=1)
I think you are referring to this. You can just perform this operation twice for city and county and save the result in two different variables, for each zipcode respectively. You can then compare the results and decide what to do if they differ (I am not sure if this can be the case with your dataset).
Since the dictionary-lookup is in O(1) and I assume your get_MI_zipcode_api() isn't any more expensive, this will have no real performance-drawbacks.
Maybe not the most elegant solution but you could use the dict approach and just call it twice, once on city and once on county. The second would overwrite the first but the same is true of your if block, and this would only be a problem if you had city='New York' county ='Chicago' for example which I assume cannot occur.
Or you could use the dict and iterate through it, this seems unnecessary though.
For key, fn in fdict:
if key in (city,county):
fn()
I'd do this join in SQL before reading in the data, I'm sure there's a way to do the same in Pandas, but I was trying to make suggestions that build on your existing research even if they are not the best.
If it's guaranteed that the value will either be present in city or country & not in both, then you can merge both the columns together into one.
df['region'] = df['City'] + '' + df['Country']
Then create a mapping of region and pincode, instead of creating a mapping of city with api function. Since, there are only 30 unique values, you can once store the value of city with zipcodes rather than calling the zipcode functions each time, as making an api call is expensive.
mappings = {
'chicago': 'abc123',
'new york': 'abc234',
'los angeles': 'abc345',
'miami': 'abc456'}
Create a dataframe using this dictionary & then merge with the original dataframe
mappings_df = pd.DataFrame(list(mappings.items()), columns=['region', 'zipcode'])
df.merge(mappings_df, how='left', on='region')
Hope this helps!!
You need a relation table which can be represented by a dict.
df = pd.DataFrame({'name':['jim','jon'],
'city':['new york',''],
'county':['','los angeles']})
df['region'] = df['city'] + df['county']
table = {'new york': 'abc123', 'chicago': 'abc124', 'los angeles': 'abc125', 'miami': 'abc126'}
df['region'] = df.region.apply(lambda row: table[row])
print(df)
Output
name city county region
0 jim new york abc123
1 jon los angeles abc125
I am doing a triple for loop on a dataframe with almost 70 thousand entries. How do I optimize it?
My ultimate goal is to create a new column that has the country of a seismic event. I have a latitude, longitude and 'place' (ex: '17km N of North Nenana, Alaska') column. I tried to reverse geocode, but with 68,488 entries, there is no free service that lets me do that. And as a student, I cannot afford it.
So I am using a dataframe with a list of countries and a dataframe with a list of states to compare to USGS['place']'s values. To do that, I ultimately settled on using 3 for loops.
As you can assume, it takes a long time. I was hoping there is a way to speed things up. I am using python, but I use r as well. The for loops just run better on python.
Any better options I'll take.
USGS = pd.DataFrame(data = {'latitide':[64.7385, 61.116], 'longitude':[-149.136, -138.655], 'place':['17km N of North Nenana, Alaska', '74km WNW of Haines Junction, Canada'], 'country':[NA, NA]})
states = pd.DataFrame(data = {'state':['AK', 'AL'], 'name':['Alaska', 'Alabama']})
countries = pd.DataFrame(data = {'country':['Afghanistan', 'Canada']})
for head in states:
for state in states[head]:
for p in USGS['place']:
if state in p:
USGS['country'] = USGS['country'].map({p : 'United 'States'})
# I have not finished the code for the countries dataframe
You do have options to do geocoding. Mapquest offers a free 15,000 calls per month. You can also look at using geopy which is what I use.
import pandas as pd
import geopy
from geopy.geocoders import Nominatim
USGS_df = pd.DataFrame(data = {'latitude':[64.7385, 61.116], 'longitude':[-149.136, -138.655], 'place':['17km N of North Nenana, Alaska', '74km WNW of Haines Junction, Canada'], 'country':[None, None]})
geopy.geocoders.options.default_user_agent = "locations-application"
geolocator=Nominatim(timeout=10)
for i, row in USGS_df.iterrows():
try:
lat = row['latitude']
lon = row['longitude']
location = geolocator.reverse('%s, %s' %(lat, lon))
country = location.raw['address']['country']
print ('Found: ' + location.address)
USGS_df.loc[i, 'country'] = country
except:
print ('Location not identified: %s, %s' %(lat, lon))
Input:
print (USGS_df)
latitude longitude place country
0 64.7385 -149.136 17km N of North Nenana, Alaska None
1 61.1160 -138.655 74km WNW of Haines Junction, Canada None
Output:
print (USGS_df)
latitude longitude place country
0 64.7385 -149.136 17km N of North Nenana, Alaska USA
1 61.1160 -138.655 74km WNW of Haines Junction, Canada Canada
I have following data in a dataframe:
... location amount
... Jacksonville, FL 40
... Provo, UT 20
... Montgomery, AL 22
... Los Angeles, CA 34
My dataset only contains U.S. cities in the form of [city name, state code] and I have no ZIP codes.
I want to determine either county of a city, in order to visualize my data with ggcounty (like here).
I looked on the website of the U.S. Census Bureau but couldn't really find a table of city,state,county, or similar.
Assuming that I would prefer solving the problem in R only, who has an idea how to solve this?
You can get a ZIP code and more detailed info doing this:
library(ggmap)
revgeocode(as.numeric(geocode('Jacksonville, FL ')))
Hope it helps
I have a model
class Neighborhood(model) {
name = db.StringProperty(required=True)
city = EncodedProperty(encoder=_get_city_name)
}
and I would like to retrieve all of the Neighborhood objects and sort them through both "name" and "city" attributes. Initially, I tried
def retrieveAndSort()
query=Neighborhood.all()
query.order("name")
query.order("city")
return query
but after some further research, it seems GAE doesn't support sorting for EncodedProperty objects. Then, I tried to sort the data in Python after retrieving the query object with the sort() method but the Query object doesn't have this method. Lastly, I tried to sort the Query object with the sorted() method with the code:
neighborhoods = sorted(neighborhoods, key=attrgetter('city', 'name'))
and it almost worked. However, data seemed jumbled and I received some output like the following below.
Abu Dhabi - Al Maryah Island
Minneapolis - Montrose
Atlanta - Buckhead
Atlanta - Home Park
Atlantic City - Boardwalk
Atlantic City - Marina
...
California - Saint Helena
California - Yountville
New York City - Central Park
Charlotte - South End
Charlotte - Third End
...
I have absolutely no idea why this occurs other and would appreciate any help possible.
Edit:
Shorter sample output:
New York City - Meatpacking District
New York City - Brooklyn
New York City - Midtown West
I think you have to fetch the entities and then sort it.
neighborhoods = Neighborhood.all().fetch(1000)
neighborhoods = sorted(neighborhoods, key=attrgetter('city', 'name'))
The question is tagged as gae-datastore but "EncodedProperty" isn't a datastore property type.
https://cloud.google.com/appengine/docs/python/datastore/typesandpropertyclasses
The model should be:
class Neighborhood(db.model):
name = db.StringProperty(required=True)
city = db.StringProperty(required=True)
Then your query function could simply be:
def retrieveAndSort():
"""
Returns everything in Neighborhood model alphabetically
"""
query = db.GqlQuery('SELECT * FROM Neighborhood ORDER BY city DESC')
for place in query:
print "{0} - {1}".format(place.city, place.name)