Related
I have total 32 variables in dataframe,
X1 to X16 - Latitude values and
Y1 to Y16 - Longitude values for 16 different positions.
I want to perform following steps on these values using python,
calculate distance between each position (X1,Y1) with every other position. Do it for all the positions and then average the distance.
e.g., calculate distance between (X1,Y1) & (x2,y2), (X1,Y1) & (x3,y3), (x1,y1)&(x4,y4) etc - then average distance(A1)
calculate distance between (X2,Y2) & (x1,y1),(X2,Y2) & (x3,y3) etc - then average distance (A2)...etc
Finally i want to take the mean of A1+A2+...+A16 and insert in a column for corresponding rows.
I want to do it to compare the final column (mean of A's) with dependent variable.
I know there is something like following code to work with latitude and longitude but dont know how can i use it in my case.
vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
df['dist'] = haversine(df.LAT.shift(), df.LONG.shift().df.loc[1:, 'LAT'], df.loc[1:, 'LONG'])
The below should help you to find the distance between two coordinates:
# Python 3 program to calculate Distance Between Two Points on Earth
from math import radians, cos, sin, asin, sqrt
def distance(lat1, lat2, lon1, lon2):
# The math module contains a function named
# radians which converts from degrees to radians.
lon1 = radians(lon1)
lon2 = radians(lon2)
lat1 = radians(lat1)
lat2 = radians(lat2)
# Haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers. Use 3956 for miles
r = 6371
# calculate the result
return(c * r)
# driver code
lat1 = 53.32055555555556
lat2 = 53.31861111111111
lon1 = -1.7297222222222221
lon2 = -1.6997222222222223
print(distance(lat1, lat2, lon1, lon2), "K.M")
To find the same, for all the positions, using a 'for' loop should help you. It can be there stored in a new column and the mean can be calculated.
Edited:
I am sure the below code will help you. I have created a sample dataset as per your requirement and worked on it. Since you are new to python, I did the whole code for you. Let me know if this is your requirement - attaching the sample dataset, code, and output for you.
Sample input/dataset : sample dataset that i created as per your requirement
Sample Output : sample output
import pandas as pd
from math import radians, cos, sin, asin, sqrt
df = pd.read_excel(r'sample.xlsx', engine='openpyxl')
#function to calculate the distance
def distance(lat1, lat2, lon1, lon2):
# The math module contains a function named
# radians which converts from degrees to radians.
lon1 = radians(lon1)
lon2 = radians(lon2)
lat1 = radians(lat1)
lat2 = radians(lat2)
# Haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers. Use 3956 for miles
r = 6371
# calculate the result
return(c * r)
#driver code
#finds the number of rows in df
df_len = df.shape[0]
dist_list = []
#'for' loop that iterates through the every rows of the dataframe
for i in range(df_len):
dist_list = []
for j in range(df_len):
val1 = df.iloc[[i]]
lat1 = int(val1['x'])
lon1 = int(val1['y'])
val2 = df.iloc[[j]]
lat2 = int(val2['x'])
lon2 = int(val2['y'])
#function calling to calculate the distance between the (x1, y1) and (x2, y2), and so on.
dist_btwn = distance(lat1, lat2, lon1, lon2)
# appending the distance to a "dist_list"
dist_list.append(dist_btwn)
col_name = "dist between ({}, {}) and every other points".format(lat1,lon1)
df[col_name] = dist_list
#lets now print the dataframe
print(df)
I have a dataset as follows,
Id Latitude longitude
1 25.42 55.47
2 25.39 55.47
3 24.48 54.38
4 24.51 54.54
I want to find the nearest distance for every point for the dataset. I found the following distance function in the internet,
from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
I am using the following function,
shortest_distance = []
for i in range(1,len(data)):
distance1 = []
for j in range(1,len(data)):
distance1.append(distance(data['Longitude'][i], data['Latitude'][i], data['Longitude'][j], data['Latitude'][j]))
shortest_distance.append(min(distance1))
But this code is looping twice for each entry and return n^2 iterations and in turn it is very slow. My dataset contains nearly 1 million records and each time looping through all the elements twice is becoming very costly.
I want to find the better way to find out the nearest point for each row. Can anybody help me in finding a way to solve this in python ?
Thanks
The brute force method of finding the nearest of N points to a given point is O(N) -- you'd have to check each point.
In contrast, if the N points are stored in a KD-tree, then finding the nearest point is on average O(log(N)).
There is also the additional one-time cost of building the KD-tree, which requires O(N) time.
If you need to repeat this process N times, then the brute force method is O(N**2) and the kd-tree method is O(N*log(N)).
Thus, for large enough N, the KD-tree will beat the brute force method.
See here for more on nearest neighbor algorithms (including KD-tree).
Below (in the function using_kdtree) is a way to compute the great circle arclengths of nearest neighbors using scipy.spatial.kdtree.
scipy.spatial.kdtree uses the Euclidean distance between points, but there is a formula for converting Euclidean chord distances between points on a sphere to great circle arclength (given the radius of the sphere).
So the idea is to convert the latitude/longitude data into cartesian coordinates, use a KDTree to find the nearest neighbors, and then apply the great circle distance formula to obtain the desired result.
Here are some benchmarks. Using N = 100, using_kdtree is 39x faster than the orig (brute force) method.
In [180]: %timeit using_kdtree(data)
100 loops, best of 3: 18.6 ms per loop
In [181]: %timeit using_sklearn(data)
1 loop, best of 3: 214 ms per loop
In [179]: %timeit orig(data)
1 loop, best of 3: 728 ms per loop
For N = 10000:
In [5]: %timeit using_kdtree(data)
1 loop, best of 3: 2.78 s per loop
In [6]: %timeit using_sklearn(data)
1 loop, best of 3: 1min 15s per loop
In [7]: %timeit orig(data)
# untested; too slow
Since using_kdtree is O(N log(N)) and orig is O(N**2), the factor by
which using_kdtree is faster than orig will grow as N, the length of
data, grows.
import numpy as np
import scipy.spatial as spatial
import pandas as pd
import sklearn.neighbors as neighbors
from math import radians, cos, sin, asin, sqrt
R = 6367
def using_kdtree(data):
"Based on https://stackoverflow.com/q/43020919/190597"
def dist_to_arclength(chord_length):
"""
https://en.wikipedia.org/wiki/Great-circle_distance
Convert Euclidean chord length to great circle arc length
"""
central_angle = 2*np.arcsin(chord_length/(2.0*R))
arclength = R*central_angle
return arclength
phi = np.deg2rad(data['Latitude'])
theta = np.deg2rad(data['Longitude'])
data['x'] = R * np.cos(phi) * np.cos(theta)
data['y'] = R * np.cos(phi) * np.sin(theta)
data['z'] = R * np.sin(phi)
tree = spatial.KDTree(data[['x', 'y','z']])
distance, index = tree.query(data[['x', 'y','z']], k=2)
return dist_to_arclength(distance[:, 1])
def orig(data):
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
c = 2 * asin(sqrt(a))
km = R * c
return km
shortest_distance = []
for i in range(len(data)):
distance1 = []
for j in range(len(data)):
if i == j: continue
distance1.append(distance(data['Longitude'][i], data['Latitude'][i],
data['Longitude'][j], data['Latitude'][j]))
shortest_distance.append(min(distance1))
return shortest_distance
def using_sklearn(data):
"""
Based on https://stackoverflow.com/a/45127250/190597 (Jonas Adler)
"""
def distance(p1, p2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
lon1, lat1 = p1
lon2, lat2 = p2
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
km = R * c
return km
points = data[['Longitude', 'Latitude']]
nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
distances, indices = nbrs.kneighbors(points)
result = distances[:, 1]
return result
np.random.seed(2017)
N = 1000
data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N),
'Longitude':np.random.uniform(0,360,size=N)})
expected = orig(data)
for func in [using_kdtree, using_sklearn]:
result = func(data)
assert np.allclose(expected, result)
You can do this very efficiently by calling a library that implements smart algorithms for this, one example would be sklearn which has a NearestNeighbors method that does exactly this.
Example of the code modified to do this:
from sklearn.neighbors import NearestNeighbors
import numpy as np
def distance(p1, p2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
lon1, lat1 = p1
lon2, lat2 = p2
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
points = [[25.42, 55.47],
[25.39, 55.47],
[24.48, 54.38],
[24.51, 54.54]]
nbrs = NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
distances, indices = nbrs.kneighbors(points)
result = distances[:, 1]
which gives
>>> result
array([ 1.889697 , 1.889697 , 17.88530556, 17.88530556])
You can use a dictionary to hash some calculations. Your code calculates the distance A to B many times (A and B being 2 arbitrary points in your dataset).
Either implement your own cache:
from math import radians, cos, sin, asin, sqrt
dist_cache = {}
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
try:
return dist_cache[(lon1, lat1, lon2, lat2)]
except KeyError:
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
dist_cache[(lon1, lat1, lon2, lat2)] = km
return km
Or use lru_cache:
from math import radians, cos, sin, asin, sqrt
from functools import lru_cache
#lru_cache
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
I am new to Python. I am trying to calculate Haversine on a Panda Dataframe. I have 2 dataframes. Like this: First 3 rows of first dataframe
Second one: First 3 rows of second dataframe
Here is my haversine function.
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 3956 # Radius of earth in kilometers.
return c * r
I took the longitude and latitude values in the first dataframe as centers and drew circles on the map (I took the Radius as 1000m). First, I try to give all the lon and lat values in the second dataframe to the haversine function with the lon and lat values in the first row in the first dataframe. Then I'll do the same for the other rows in the first dataframe. Thus, I will be able to find out whether the coordinates (longitude and latitude values) in the second dataframe are located in circles with central longitude and latitude values in the first dataframe. It works when i use like this:
a = haversine(29.023165,40.992752,28.844604,41.113586)
radius = 1.00 # in kilometer
if a <= radius:
print('Inside the area')
else:
print('Outside the area')
In the codes I wrote, I could not give the exact order I wanted. I mean I tried my code by giving all the lon and lat values in the first dataframe and the second dataframe, but logically this is wrong (or unnecessary operation). I tried the below code (I tried the code Haversine Distance Calc using Pandas Data Frame "cannot convert the series to <class 'float'>") But it gives an error: ('LONGITUDE', 'occurred at index 0').
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 3956 # Radius of earth in kilometers.
return c * r
iskeleler.loc['density'] = iskeleler.apply(lambda row: haversine(iskeleler['lon'], iskeleler['lat'], row['LONGITUDE'], row['LATITUDE']), axis=1)
Can you help me with how I can do this? Thanks in advance.
The code you are using to calculate haversine distance receives one float in each argument, so indeed you need to pass floats for each argument. In this case iskeleler['lon'] and iskeleler['lat'] are Series.
This should work to calculate the distance between coordinates in the same row:
iskeleler.loc['density'] = iskeleler.apply(lambda row: haversine(
row['lon'], row['lat'],
row['LONGITUDE'], row['LATITUDE']
),axis=1)
But you are looking for a pair-wise distance which might require a for loop and this is not efficient. Try sklearn.metrics.pairwise.haversine_distances
from sklearn.metrics.pairwise import haversine_distances
distance_matrix = haversine_distances(
iskeleler[['lat', 'lon']],
iskeleler[['LATITUDE', 'LONGITUDE']]
)
If you prefer the table structure, then:
distance_table = pd.DataFrame(
distance_matrix,
index=pd.MultiIndex.from_frames(iskeleler[['lat', 'lon']]),
columns=pd.MultiIndex.from_frames(iskeleler[['LATITUDE', 'LONGITUDE']]),
).stack([0, 1]).reset_index(name='distance')
This is an example, there are many ways to create the dataframe from the matrix.
I am trying to create a new column to manipulate the data set:
df['longitude'] = df['longitude'].astype(float)
df['latitude'] = df['latitude'].astype(float)
then ran the function for haversine:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1,lat1,lat2,lon2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
But when I run this code :
df['d_centre']=haversine(lon1,
lat1,
df.longitude.astype(float),
df.latitude.astype(float))
to create a new column in my df I get this error:
Error: cannot convert the series to <class 'float'>
I tried this as well:
df['d_centre']= haversine(lon1,lat1,lat2,lon2)
the haversine is working but when I try to create the new column in my df, I get this error. I have tried converting to a list as well but I'm getting the same result
I figured out the answer: have to use numpy for all the math and write the code for the new column with the df
from math import radians, cos, sin, asin, sqrt
def haversine_np(lon1,lat1,lon2,lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
Create a new column:
df2['d_centre'] =haversine_np(df2['lon1'],df2['lat1'],df2['lon2'],df2['lat2'])
I want to find the lat, long combination with minimum distance. x_lat, x_long are constant. I want to get combinations of y_latitude, y_longitude and calculate the distance and find out the minimum distance and return the corresponding y_latitude, y_longitude.
The following is trying,
x_lat = 33.50194395
x_long = -112.048885
y_latitude = ['56.16', '33.211045400000003', '37.36']
y_longitude = ['-117.3700631', '-118.244']
I have a distance function which would return the distance,
from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
So I tried something like the following,
dist = []
for i in itertools.product(y_latitude , y_longitude):
print i
dist.append(distance(float(i[1]),float(i[0]),float(x_long), float(x_lat)))
print dist.index(min(dist))
So this creates all possible combinations of y_latitude and y_longitude and calculates distance and returns the index of minimum distance. I am not able to make it return the corresponding y_latitude and y_longitude.
Here the index of minimum distance is 2 and output is 2. The required output is ('33.211045400000003', '-117.3700631'), which I am not able to make it return.
Can anybody help me in solving the last piece?
Thanks
Try this,
dist = []
for i in itertools.product(y_latitude , y_longitude):
dist.append([distance(float(i[1]),float(i[0]),float(x_long), float(x_lat)),i])
min_lat,min_lng = min(dist, key = lambda x: x[0])[1]
Append the lat and long along with the dist, And get min of first index,