How to use DistanceMetrix.pairwise() using haversine distance in scikit - python

How do you get the distance in kilometers using the haversine pairwise function in sklearn library? Looking over the example at https://stackoverflow.com/a/38685263/8378399 the numbers returned from scikit-learn are not correct which leads me to believe I'm not calling it correctly.
Sample code:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
paris = (48.8566, 2.3522)
lyon = (45.7640, 4.8357)
hdist = haversine(paris[1],paris[0], lyon[1], lyon[0])
skdist = dist.pairwise([paris], [lyon]) * 6371
# Returns: The distance between Paris and Lyon is 391km. sklearn=17766km
"The distance between Paris and Lyon is {0:.3g}km. sklearn={1:.5g}km".format(hdist, skdist[0][0])

From sklearn docs:
Note that the haversine distance metric requires data in the form of
[latitude, longitude] and both inputs and outputs are in units of
radians.
So, convert latitude and longitude to radians before applying the function:
skdist = dist.pairwise(np.radians([paris]), np.radians([lyon])) * 6371

Related

Pandas dataframe : working with Latitude and longitude features

I have total 32 variables in dataframe,
X1 to X16 - Latitude values and
Y1 to Y16 - Longitude values for 16 different positions.
I want to perform following steps on these values using python,
calculate distance between each position (X1,Y1) with every other position. Do it for all the positions and then average the distance.
e.g., calculate distance between (X1,Y1) & (x2,y2), (X1,Y1) & (x3,y3), (x1,y1)&(x4,y4) etc - then average distance(A1)
calculate distance between (X2,Y2) & (x1,y1),(X2,Y2) & (x3,y3) etc - then average distance (A2)...etc
Finally i want to take the mean of A1+A2+...+A16 and insert in a column for corresponding rows.
I want to do it to compare the final column (mean of A's) with dependent variable.
I know there is something like following code to work with latitude and longitude but dont know how can i use it in my case.
vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
df['dist'] = haversine(df.LAT.shift(), df.LONG.shift().df.loc[1:, 'LAT'], df.loc[1:, 'LONG'])
The below should help you to find the distance between two coordinates:
# Python 3 program to calculate Distance Between Two Points on Earth
from math import radians, cos, sin, asin, sqrt
def distance(lat1, lat2, lon1, lon2):
# The math module contains a function named
# radians which converts from degrees to radians.
lon1 = radians(lon1)
lon2 = radians(lon2)
lat1 = radians(lat1)
lat2 = radians(lat2)
# Haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers. Use 3956 for miles
r = 6371
# calculate the result
return(c * r)
# driver code
lat1 = 53.32055555555556
lat2 = 53.31861111111111
lon1 = -1.7297222222222221
lon2 = -1.6997222222222223
print(distance(lat1, lat2, lon1, lon2), "K.M")
To find the same, for all the positions, using a 'for' loop should help you. It can be there stored in a new column and the mean can be calculated.
Edited:
I am sure the below code will help you. I have created a sample dataset as per your requirement and worked on it. Since you are new to python, I did the whole code for you. Let me know if this is your requirement - attaching the sample dataset, code, and output for you.
Sample input/dataset : sample dataset that i created as per your requirement
Sample Output : sample output
import pandas as pd
from math import radians, cos, sin, asin, sqrt
df = pd.read_excel(r'sample.xlsx', engine='openpyxl')
#function to calculate the distance
def distance(lat1, lat2, lon1, lon2):
# The math module contains a function named
# radians which converts from degrees to radians.
lon1 = radians(lon1)
lon2 = radians(lon2)
lat1 = radians(lat1)
lat2 = radians(lat2)
# Haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers. Use 3956 for miles
r = 6371
# calculate the result
return(c * r)
#driver code
#finds the number of rows in df
df_len = df.shape[0]
dist_list = []
#'for' loop that iterates through the every rows of the dataframe
for i in range(df_len):
dist_list = []
for j in range(df_len):
val1 = df.iloc[[i]]
lat1 = int(val1['x'])
lon1 = int(val1['y'])
val2 = df.iloc[[j]]
lat2 = int(val2['x'])
lon2 = int(val2['y'])
#function calling to calculate the distance between the (x1, y1) and (x2, y2), and so on.
dist_btwn = distance(lat1, lat2, lon1, lon2)
# appending the distance to a "dist_list"
dist_list.append(dist_btwn)
col_name = "dist between ({}, {}) and every other points".format(lat1,lon1)
df[col_name] = dist_list
#lets now print the dataframe
print(df)

Fastest way to find closest points in Numpy [duplicate]

I have a dataset as follows,
Id Latitude longitude
1 25.42 55.47
2 25.39 55.47
3 24.48 54.38
4 24.51 54.54
I want to find the nearest distance for every point for the dataset. I found the following distance function in the internet,
from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
I am using the following function,
shortest_distance = []
for i in range(1,len(data)):
distance1 = []
for j in range(1,len(data)):
distance1.append(distance(data['Longitude'][i], data['Latitude'][i], data['Longitude'][j], data['Latitude'][j]))
shortest_distance.append(min(distance1))
But this code is looping twice for each entry and return n^2 iterations and in turn it is very slow. My dataset contains nearly 1 million records and each time looping through all the elements twice is becoming very costly.
I want to find the better way to find out the nearest point for each row. Can anybody help me in finding a way to solve this in python ?
Thanks
The brute force method of finding the nearest of N points to a given point is O(N) -- you'd have to check each point.
In contrast, if the N points are stored in a KD-tree, then finding the nearest point is on average O(log(N)).
There is also the additional one-time cost of building the KD-tree, which requires O(N) time.
If you need to repeat this process N times, then the brute force method is O(N**2) and the kd-tree method is O(N*log(N)).
Thus, for large enough N, the KD-tree will beat the brute force method.
See here for more on nearest neighbor algorithms (including KD-tree).
Below (in the function using_kdtree) is a way to compute the great circle arclengths of nearest neighbors using scipy.spatial.kdtree.
scipy.spatial.kdtree uses the Euclidean distance between points, but there is a formula for converting Euclidean chord distances between points on a sphere to great circle arclength (given the radius of the sphere).
So the idea is to convert the latitude/longitude data into cartesian coordinates, use a KDTree to find the nearest neighbors, and then apply the great circle distance formula to obtain the desired result.
Here are some benchmarks. Using N = 100, using_kdtree is 39x faster than the orig (brute force) method.
In [180]: %timeit using_kdtree(data)
100 loops, best of 3: 18.6 ms per loop
In [181]: %timeit using_sklearn(data)
1 loop, best of 3: 214 ms per loop
In [179]: %timeit orig(data)
1 loop, best of 3: 728 ms per loop
For N = 10000:
In [5]: %timeit using_kdtree(data)
1 loop, best of 3: 2.78 s per loop
In [6]: %timeit using_sklearn(data)
1 loop, best of 3: 1min 15s per loop
In [7]: %timeit orig(data)
# untested; too slow
Since using_kdtree is O(N log(N)) and orig is O(N**2), the factor by
which using_kdtree is faster than orig will grow as N, the length of
data, grows.
import numpy as np
import scipy.spatial as spatial
import pandas as pd
import sklearn.neighbors as neighbors
from math import radians, cos, sin, asin, sqrt
R = 6367
def using_kdtree(data):
"Based on https://stackoverflow.com/q/43020919/190597"
def dist_to_arclength(chord_length):
"""
https://en.wikipedia.org/wiki/Great-circle_distance
Convert Euclidean chord length to great circle arc length
"""
central_angle = 2*np.arcsin(chord_length/(2.0*R))
arclength = R*central_angle
return arclength
phi = np.deg2rad(data['Latitude'])
theta = np.deg2rad(data['Longitude'])
data['x'] = R * np.cos(phi) * np.cos(theta)
data['y'] = R * np.cos(phi) * np.sin(theta)
data['z'] = R * np.sin(phi)
tree = spatial.KDTree(data[['x', 'y','z']])
distance, index = tree.query(data[['x', 'y','z']], k=2)
return dist_to_arclength(distance[:, 1])
def orig(data):
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
c = 2 * asin(sqrt(a))
km = R * c
return km
shortest_distance = []
for i in range(len(data)):
distance1 = []
for j in range(len(data)):
if i == j: continue
distance1.append(distance(data['Longitude'][i], data['Latitude'][i],
data['Longitude'][j], data['Latitude'][j]))
shortest_distance.append(min(distance1))
return shortest_distance
def using_sklearn(data):
"""
Based on https://stackoverflow.com/a/45127250/190597 (Jonas Adler)
"""
def distance(p1, p2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
lon1, lat1 = p1
lon2, lat2 = p2
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
km = R * c
return km
points = data[['Longitude', 'Latitude']]
nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
distances, indices = nbrs.kneighbors(points)
result = distances[:, 1]
return result
np.random.seed(2017)
N = 1000
data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N),
'Longitude':np.random.uniform(0,360,size=N)})
expected = orig(data)
for func in [using_kdtree, using_sklearn]:
result = func(data)
assert np.allclose(expected, result)
You can do this very efficiently by calling a library that implements smart algorithms for this, one example would be sklearn which has a NearestNeighbors method that does exactly this.
Example of the code modified to do this:
from sklearn.neighbors import NearestNeighbors
import numpy as np
def distance(p1, p2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
lon1, lat1 = p1
lon2, lat2 = p2
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
points = [[25.42, 55.47],
[25.39, 55.47],
[24.48, 54.38],
[24.51, 54.54]]
nbrs = NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
distances, indices = nbrs.kneighbors(points)
result = distances[:, 1]
which gives
>>> result
array([ 1.889697 , 1.889697 , 17.88530556, 17.88530556])
You can use a dictionary to hash some calculations. Your code calculates the distance A to B many times (A and B being 2 arbitrary points in your dataset).
Either implement your own cache:
from math import radians, cos, sin, asin, sqrt
dist_cache = {}
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
try:
return dist_cache[(lon1, lat1, lon2, lat2)]
except KeyError:
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
dist_cache[(lon1, lat1, lon2, lat2)] = km
return km
Or use lru_cache:
from math import radians, cos, sin, asin, sqrt
from functools import lru_cache
#lru_cache
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km

I cannot create new column in my data series in python for an assignment on the haversine formula

I am trying to create a new column to manipulate the data set:
df['longitude'] = df['longitude'].astype(float)
df['latitude'] = df['latitude'].astype(float)
then ran the function for haversine:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1,lat1,lat2,lon2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
But when I run this code :
df['d_centre']=haversine(lon1,
lat1,
df.longitude.astype(float),
df.latitude.astype(float))
to create a new column in my df I get this error:
Error: cannot convert the series to <class 'float'>
I tried this as well:
df['d_centre']= haversine(lon1,lat1,lat2,lon2)
the haversine is working but when I try to create the new column in my df, I get this error. I have tried converting to a list as well but I'm getting the same result
I figured out the answer: have to use numpy for all the math and write the code for the new column with the df
from math import radians, cos, sin, asin, sqrt
def haversine_np(lon1,lat1,lon2,lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
Create a new column:
df2['d_centre'] =haversine_np(df2['lon1'],df2['lat1'],df2['lon2'],df2['lat2'])

Returning the lat long with minimum distance in python

I want to find the lat, long combination with minimum distance. x_lat, x_long are constant. I want to get combinations of y_latitude, y_longitude and calculate the distance and find out the minimum distance and return the corresponding y_latitude, y_longitude.
The following is trying,
x_lat = 33.50194395
x_long = -112.048885
y_latitude = ['56.16', '33.211045400000003', '37.36']
y_longitude = ['-117.3700631', '-118.244']
I have a distance function which would return the distance,
from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
So I tried something like the following,
dist = []
for i in itertools.product(y_latitude , y_longitude):
print i
dist.append(distance(float(i[1]),float(i[0]),float(x_long), float(x_lat)))
print dist.index(min(dist))
So this creates all possible combinations of y_latitude and y_longitude and calculates distance and returns the index of minimum distance. I am not able to make it return the corresponding y_latitude and y_longitude.
Here the index of minimum distance is 2 and output is 2. The required output is ('33.211045400000003', '-117.3700631'), which I am not able to make it return.
Can anybody help me in solving the last piece?
Thanks
Try this,
dist = []
for i in itertools.product(y_latitude , y_longitude):
dist.append([distance(float(i[1]),float(i[0]),float(x_long), float(x_lat)),i])
min_lat,min_lng = min(dist, key = lambda x: x[0])[1]
Append the lat and long along with the dist, And get min of first index,

Finding distance between two gps points in Python

I have the method below (haversine) that returns the distance between two gps points. Table below is my dataframe.
When I apply the function on the dataframe using, I get the error "cannot convert the series to ". Not sure whether i am missing something. Any help would be appreciated.
distdf1['distance'] = distdf1.apply(lambda x: haversine(distdf1['SLongitude'], distdf1['SLatitude'], distdf1['ClosestLong'], distdf1['ClosestLat']), axis=1)
Dataframe:
SLongitude SLatitude ClosestLong ClosestLat
0 -100.248093 25.756313 -98.220240 26.189491
1 -77.441536 38.991512 -77.481600 38.748722
2 -72.376370 40.898690 -73.662870 41.025640
Method:
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
Try:
distdf1.apply(lambda x: haversine(x['SLongitude'], x['SLatitude'], x['ClosestLong'], x['ClosestLat']), axis=1)

Categories

Resources