How to improve performance with apply function - python

Context:
I have one city dataset with coordinates (lat, long)
I have several other specific datasets (hospitals, shops,...) with coordinates (lat, long) too
My objective is to find, for each city, the closest (or the N closest) of every other datasets.
Code:
I defined a function to calculate a Haversine distance:
def dist(lat1, long1, lat2, long2):
# convert decimal degrees to radians
lat1, long1, lat2, long2 = map(radians, [lat1, long1, lat2, long2])
# haversine formula
dlon = long2 - long1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers is 6371
km = 6371* c
return km
Now I use this function to find the nearest point:
def find_nearest(lat, long, recherche):
distances = recherche.apply(
lambda x: dist(lat, long, x['rech_lat'], x['rech_lon']),
axis=1)
return recherche.loc[distances.idxmin(), 'rech_id']
Which I call like this:
CITY['hospital_id'] = CITY.apply(lambda x: find_nearest(x['COM_LAT'], x['COM_LONG'],hospital),axis=1)
Problem:
Doing so, I need to pass the hospital dataframe every time. I am not sure it is very performant.
I thought using the reference of the dataframe with the eval function instead:
def find_nearest(lat, long, recherch):
recherche = eval(recherch)
distances = recherche.apply(
lambda x: dist(lat, long, x['rech_lat'], x['rech_lon']),
axis=1)
return recherche.loc[distances.idxmin(), 'rech_id']
CITY['hospital_id'] = CITY.apply(lambda x: find_nearest(x['COM_LAT'], x['COM_LONG'],'hospital'),axis=1)
Is it better? I still can't have fast answer. Do you know how I can improve more?
Thanks for answers

Related

Pandas dataframe : working with Latitude and longitude features

I have total 32 variables in dataframe,
X1 to X16 - Latitude values and
Y1 to Y16 - Longitude values for 16 different positions.
I want to perform following steps on these values using python,
calculate distance between each position (X1,Y1) with every other position. Do it for all the positions and then average the distance.
e.g., calculate distance between (X1,Y1) & (x2,y2), (X1,Y1) & (x3,y3), (x1,y1)&(x4,y4) etc - then average distance(A1)
calculate distance between (X2,Y2) & (x1,y1),(X2,Y2) & (x3,y3) etc - then average distance (A2)...etc
Finally i want to take the mean of A1+A2+...+A16 and insert in a column for corresponding rows.
I want to do it to compare the final column (mean of A's) with dependent variable.
I know there is something like following code to work with latitude and longitude but dont know how can i use it in my case.
vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
df['dist'] = haversine(df.LAT.shift(), df.LONG.shift().df.loc[1:, 'LAT'], df.loc[1:, 'LONG'])
The below should help you to find the distance between two coordinates:
# Python 3 program to calculate Distance Between Two Points on Earth
from math import radians, cos, sin, asin, sqrt
def distance(lat1, lat2, lon1, lon2):
# The math module contains a function named
# radians which converts from degrees to radians.
lon1 = radians(lon1)
lon2 = radians(lon2)
lat1 = radians(lat1)
lat2 = radians(lat2)
# Haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers. Use 3956 for miles
r = 6371
# calculate the result
return(c * r)
# driver code
lat1 = 53.32055555555556
lat2 = 53.31861111111111
lon1 = -1.7297222222222221
lon2 = -1.6997222222222223
print(distance(lat1, lat2, lon1, lon2), "K.M")
To find the same, for all the positions, using a 'for' loop should help you. It can be there stored in a new column and the mean can be calculated.
Edited:
I am sure the below code will help you. I have created a sample dataset as per your requirement and worked on it. Since you are new to python, I did the whole code for you. Let me know if this is your requirement - attaching the sample dataset, code, and output for you.
Sample input/dataset : sample dataset that i created as per your requirement
Sample Output : sample output
import pandas as pd
from math import radians, cos, sin, asin, sqrt
df = pd.read_excel(r'sample.xlsx', engine='openpyxl')
#function to calculate the distance
def distance(lat1, lat2, lon1, lon2):
# The math module contains a function named
# radians which converts from degrees to radians.
lon1 = radians(lon1)
lon2 = radians(lon2)
lat1 = radians(lat1)
lat2 = radians(lat2)
# Haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers. Use 3956 for miles
r = 6371
# calculate the result
return(c * r)
#driver code
#finds the number of rows in df
df_len = df.shape[0]
dist_list = []
#'for' loop that iterates through the every rows of the dataframe
for i in range(df_len):
dist_list = []
for j in range(df_len):
val1 = df.iloc[[i]]
lat1 = int(val1['x'])
lon1 = int(val1['y'])
val2 = df.iloc[[j]]
lat2 = int(val2['x'])
lon2 = int(val2['y'])
#function calling to calculate the distance between the (x1, y1) and (x2, y2), and so on.
dist_btwn = distance(lat1, lat2, lon1, lon2)
# appending the distance to a "dist_list"
dist_list.append(dist_btwn)
col_name = "dist between ({}, {}) and every other points".format(lat1,lon1)
df[col_name] = dist_list
#lets now print the dataframe
print(df)

Haversine Function using Pandas Data Frame

I am new to Python. I am trying to calculate Haversine on a Panda Dataframe. I have 2 dataframes. Like this: First 3 rows of first dataframe
Second one: First 3 rows of second dataframe
Here is my haversine function.
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 3956 # Radius of earth in kilometers.
return c * r
I took the longitude and latitude values in the first dataframe as centers and drew circles on the map (I took the Radius as 1000m). First, I try to give all the lon and lat values in the second dataframe to the haversine function with the lon and lat values in the first row in the first dataframe. Then I'll do the same for the other rows in the first dataframe. Thus, I will be able to find out whether the coordinates (longitude and latitude values) in the second dataframe are located in circles with central longitude and latitude values in the first dataframe. It works when i use like this:
a = haversine(29.023165,40.992752,28.844604,41.113586)
radius = 1.00 # in kilometer
if a <= radius:
print('Inside the area')
else:
print('Outside the area')
In the codes I wrote, I could not give the exact order I wanted. I mean I tried my code by giving all the lon and lat values ​​in the first dataframe and the second dataframe, but logically this is wrong (or unnecessary operation). I tried the below code (I tried the code Haversine Distance Calc using Pandas Data Frame "cannot convert the series to <class 'float'>") But it gives an error: ('LONGITUDE', 'occurred at index 0').
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 3956 # Radius of earth in kilometers.
return c * r
iskeleler.loc['density'] = iskeleler.apply(lambda row: haversine(iskeleler['lon'], iskeleler['lat'], row['LONGITUDE'], row['LATITUDE']), axis=1)
Can you help me with how I can do this? Thanks in advance.
The code you are using to calculate haversine distance receives one float in each argument, so indeed you need to pass floats for each argument. In this case iskeleler['lon'] and iskeleler['lat'] are Series.
This should work to calculate the distance between coordinates in the same row:
iskeleler.loc['density'] = iskeleler.apply(lambda row: haversine(
row['lon'], row['lat'],
row['LONGITUDE'], row['LATITUDE']
),axis=1)
But you are looking for a pair-wise distance which might require a for loop and this is not efficient. Try sklearn.metrics.pairwise.haversine_distances
from sklearn.metrics.pairwise import haversine_distances
distance_matrix = haversine_distances(
iskeleler[['lat', 'lon']],
iskeleler[['LATITUDE', 'LONGITUDE']]
)
If you prefer the table structure, then:
distance_table = pd.DataFrame(
distance_matrix,
index=pd.MultiIndex.from_frames(iskeleler[['lat', 'lon']]),
columns=pd.MultiIndex.from_frames(iskeleler[['LATITUDE', 'LONGITUDE']]),
).stack([0, 1]).reset_index(name='distance')
This is an example, there are many ways to create the dataframe from the matrix.

Calculate Harvesine and initial bearing for successive points of a panda DataFrame in python

There are so many packages that provide this calculation although most of them are based on points and not data frames or maybe I am making a mistake!
I have found this method that workers with my panda dataframe of Latitude and Longitude columns:
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6378137):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
but all initial bearing or azimuths I tried, don't accept the dataframe series and trying the numpy array will still return me zeros!
Is there a certain way to do so for a successive rows of a dataframe? I would like to calculate initial bearing between successive points. In R the bearing function will do the job with a dataframe just wondering if there is an equivalent in Python.
Update:
I found the problem. I was using the R method to be able to find bearing between successive rows so I was basically removing the first row and the last row making two sets of dataframes with two columns but it worked perfectly with shift() and I wrote my own bearing function which was easier than using the one out there...
So I made the tow dataframes below from pts my main dataframe:
latlon_a = pts
latlon_b = pts.shift()
and my own initial bearing function:
def initial_bearing(lon1, lat1, lon2, lat2):
"""
My own version based on R source
Calculate the initial bearing between two points
All (latitude, longitude) coordinates must have numeric dtypes and be of equal length.
"""
lat1, lon1, lat2, lon2 = map(np.radians, [lon1, lat1, lon2, lat2])
delta1 = lon1-lon2
term1 = np.sin(delta1) * np.cos(lat2)
term2 = np.cos(lat1) * np.sin(lat2)
term3 = np.sin(lat1) * np.cos(lat2) * np.cos(delta1)
rad = np.arctan2(term1, (term2-term3))
bearing = np.rad2deg(rad)
return (bearing + 360) % 360
bearing = initial_bearing(latlon_a['longitude'],latlon_a['latitude'],
latlon_b['longitude'],latlon_b['latitude'])
This worked perfectly for me and results back the initial bearing. for funial bearing you can just replace or add the line below to return:
return (bearing + 180) % 360

Returning the lat long with minimum distance in python

I want to find the lat, long combination with minimum distance. x_lat, x_long are constant. I want to get combinations of y_latitude, y_longitude and calculate the distance and find out the minimum distance and return the corresponding y_latitude, y_longitude.
The following is trying,
x_lat = 33.50194395
x_long = -112.048885
y_latitude = ['56.16', '33.211045400000003', '37.36']
y_longitude = ['-117.3700631', '-118.244']
I have a distance function which would return the distance,
from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
So I tried something like the following,
dist = []
for i in itertools.product(y_latitude , y_longitude):
print i
dist.append(distance(float(i[1]),float(i[0]),float(x_long), float(x_lat)))
print dist.index(min(dist))
So this creates all possible combinations of y_latitude and y_longitude and calculates distance and returns the index of minimum distance. I am not able to make it return the corresponding y_latitude and y_longitude.
Here the index of minimum distance is 2 and output is 2. The required output is ('33.211045400000003', '-117.3700631'), which I am not able to make it return.
Can anybody help me in solving the last piece?
Thanks
Try this,
dist = []
for i in itertools.product(y_latitude , y_longitude):
dist.append([distance(float(i[1]),float(i[0]),float(x_long), float(x_lat)),i])
min_lat,min_lng = min(dist, key = lambda x: x[0])[1]
Append the lat and long along with the dist, And get min of first index,

Iterate over Pandas index pairs [0,1],[1,2][2,3]

I have a pandas dataframe of lat/lng points created from a gps device.
My question is how to generate a distance column for the distance between each point in the gps track line.
Some googling has given me the haversine method below which works using single values selected using iloc, but i'm struggling on how to iterate over the dataframe for the method inputs.
I had thought I could run a for loop, with something along the lines of
for i in len(df):
df['dist'] = haversine(df['lng'].iloc[i],df['lat'].iloc[i],df['lng'].iloc[i+1],df['lat'].iloc[i+1]))
but I get the error TypeError: 'int' object is not iterable. I was also thinking about df.apply but I'm not sure how to get the appropriate inputs. Any help or hints. on how to do this would be appreciated.
Sample DF
lat lng
0 -7.11873 113.72512
1 -7.11873 113.72500
2 -7.11870 113.72476
3 -7.11870 113.72457
4 -7.11874 113.72444
Method
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
c = 2 * math.asin(math.sqrt(a))
km = 6367 * c
return km
are you looking for a result like this?
lat lon dist2next
0 -7.11873 113.72512 0.013232
1 -7.11873 113.72500 0.026464
2 -7.11873 113.72476 0.020951
3 -7.11873 113.72457 0.014335
4 -7.11873 113.72444 NaN
There's probably a clever way to use pandas.rolling_apply... but for a quick solution, I'd do something like this.
def haversine(loc1, loc2):
# convert decimal degrees to radians
lon1, lat1 = map(math.radians, loc1)
lon2, lat2 = map(math.radians, loc2)
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
c = 2 * math.asin(math.sqrt(a))
km = 6367 * c
return km
df['dist2next'] = np.nan
for i in df.index[:-1]:
loc1 = df.ix[i, ['lon', 'lat']]
loc2 = df.ix[i+1, ['lon', 'lat']]
df.ix[i, 'dist2next'] = haversine(loc1, loc2)
alternatively, if you don't want to modify your haversine function like that, you can just pick off lats and lons one at a time using df.ix[i, 'lon'], df.ix[i, 'lat'], df.ix[i+1, 'lon], etc.
I would recommande using a quicker variation of looping through a df such has
df_shift = df.shift(1)
df = df.join(df_shift, l_suffix="lag_")
log = []
for rows in df.itertuples():
log.append(haversine(rows.lng ,rows.lat, rows.lag_lng, rows.lag_lat))
pd.DataFrame(log)

Categories

Resources