i have a dataframe containing gps coordinates (Timestamp, Latitude, Longitude) of vehicle tracks. The frequency is between 30 seconds between points and 1 second between points. This depends on some logic in gps receiver including speed but is not very reliable.
These tracks can be very long and contain many thousands of gps points. especially when a vehicle is moving slow or is at rest. The data looks like this:
Timestamp
Latitude
Longitude
0 days 00:00:00
51.1513
9.61053
0 days 00:00:28
51.1513
9.61049
0 days 00:00:29
51.1513
9.61048
0 days 00:00:31
51.1513
9.61048
0 days 00:00:33
51.1513
9.61048
I want to reduce the size of the data frames by only including gps points which are at least 50 meters apart of the gps position before. The distance between two gps positions is calculated using the harvesine formula:
from math import radians, cos, sin, asin, sqrt
def haversine(lat1, lon1, lat2, lon2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees) in meters
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers is 6371
m = 6371000* c
return m
Currently i use a very naive approach by looping over the dataframe and creating a mask containing elements at least 50 meters apart. But this is very inefficient and i am looking for an efficient way to calculated this for large data frames.
def reduce_gps(df):
mask = np.full(len(df), False)
cpos = 0
#print(cpos)
lat_col = df_gps.columns.get_loc('Latitude')
lon_col = df_gps.columns.get_loc('Longitude')
for pos in range(len(mask)):
if haversine(df.iloc[cpos, lat_col], df.iloc[cpos, lon_col],
df.iloc[pos, lat_col], df.iloc[pos, lon_col]) > 50 or (pos==len(mask)-1):
#print(pos)
cpos = pos
mask[pos] = True
return df[mask]
The haversine formula can be vectorized if this is helpful:
def haversine_vec(df):
data = np.deg2rad(df[['Latitude', 'Longitude']])
diff = data.shift() - data
d = np.sin(diff['Latitude']/2)**2 + np.cos(data['Latitude'])*np.cos(data['Latitude'].shift()) * np.sin(diff['Longitude']/2)**2
return 2 * 6371000 * np.arcsin(np.sqrt(d))
I uploaded a small set of sample data here:
pd.read_csv('https://pastebin.com/raw/qeUDKr9z')
Try using a list comprehension:
df['distance']=[haversine(df.Latitude[i],df.Longitude[i],df.Latitude[i+1],df.Longitude[i+1]) if i!=len(df)-1 else 0 for i in range(len(df))]
df[df.distance>50]
Related
I have a large dataset of around 2 million rows and 4 columns: observation_id, latitude, longitude and class_id, like that:
Observation_id
Latitude
Longitude
Class_id
10131188
45.146973
6.416794
101
10799362
46.783695
-2.072855
700
10392536
48.604866
-2.825003
1456
...
...
...
...
22068176
29.806055
-98.41853
5532
There are 18,000 classes, but some of them are over-represented and some are under-represented. Note that each observation is either in France or in the USA.
I need to find, for each observation, the distance to the closest observation of every class.
For example, for the first observation (which belongs to the class 101 if we look at the table above), I will have a vector of size 18,000. The first value of the vector will represent the distance in km to the closest occurrence of class 1, the second value will represent the distance in km to the closest occurrence of class 2, and so on until the last value which will represent, you guessed it, the distance in km to the closest occurrence of class 18,000.
If the distance is too large (let's say more than 50km), I don't need the exact distance but a fixed value (50 km in this case). So if the closest occurrence from one class to my observation is more than 50km (whether it's 51km or 9,000km), I can fill 50 for the corresponding value of the observation's vector.
But I see two problems here:
My code will take forever to run.
The created file will be huge.
I started to create a small script that calculates the haversine distance, but for one observations it takes around 8 seconds to run, so it would be impossible for 2 million. Here it is anyway:
lat1 = 45.705116 # lat for observation 10561949
lon1 = 1.424622 # lon for observation 10561949
df = df[df.observation_id != 10561949] # removing observation 10561949 from the DataFrame
list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50
for observation_id, lat2, lon2 in zip(df['observation_id'], df['latitude'], df['longitude']):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
dist = 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)
if list_obs[observation_id] >= dist:
list_obs[observation_id] = dist
Do you have a idea on how to speed-up the algorithm (the distance doesn't have to be perfectly calculated, I just need to have a global idea on the nearest neighbor of each class for each observation) and to store the gigantic file after that (it will be an array-like of 2,000,000 x 18,000).
The idea after this is to try to feed this to a Neural Network (let's say a MLP), to see the difference with a simple K-Nearest Neighbor.
Since you only care about distances <50km the best saving that I can think of is to pre-calculate (approximately) absolute distances on a grid to exclude the need to compute distances for large values.
Below is my best attempt to solve this, it has a setup complexity of O(len(df)) but a search complexity of just O(9 * avg. bin size) which is significantly less than O(len(df)) from your example.
Note 1) there are large parts of this that can be vectorized to improve performance.
Note 2) there are most certainly better ways to bin distances on a sphere, I am just not that familiar with them, but the idea to first index values such that you can quickly find all data points within distance x is the key.
Note 3) I would be surprised if this code is bug free.
# generate dummy data --------------------------
import pandas as pd
import random
random.seed(10)
rand_float = lambda :(random.random()-.5)*90*2
rand_int = lambda :int(random.random()*18000)
dummy_data = [(rand_float(), rand_float(), rand_int()) for i in range(100_000)]
df = pd.DataFrame(data=dummy_data, columns=('lat', 'lon', 'class'))
# bin data points -------------------------------
from collections import defaultdict
def find_bin(lat, lon, bin_size=60):
"""Approximately bin data points
https://stackoverflow.com/questions/1253499/simple-calculations-for-working-with-lat-lon-and-km-distance
approximate distance conversion to exclude "far away" distances - only needs to be approximate since we add buffer,
however I am sure there are much better methods one could use to do this approximation, I spent 10 mins googling
"""
return int(110*lat//bin_size), int(110*cos(lon)//60)
bins = defaultdict(list)
for i, row in df.iterrows(): # O(len(df))
bins[find_bin(row['lat'], row['lon'])].append(int(i)) # this is slow, it can be vectorized less elegantly but only needs run once
print(f'average bin size {sum(map(len, bins.values()))/len(bins)}')
# find distances to neighbours ------------------
from math import radians, sin, cos, asin, sqrt
def compute_distance(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
return 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)
def neighbours(x, y):
yield x-1, y-1
yield x-1, y
yield x-1, y+1
yield x , y-1
yield x , y
yield x , y+1
yield x+1, y-1
yield x+1, y
yield x+1, y+1
obs = 1000 # 10561949
lat1, lon1, _ = df.iloc[obs].values
print(f'finding {lat1}, {lon1}')
b = find_bin(lat1, lon1)
list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50
for adj_bin in neighbours(*b): # O(9)
for i in bins[adj_bin]: # O(avg. bin size)
lat2, lon2, class_ = df.loc[i].values
dist = compute_distance(lat1, lon1, lat2, lon2)
if dist < list_obs[int(class_)]:
print(f'dist {dist} to {i} ({class_})')
list_obs[int(class_)] = dist
This question already has an answer here:
Efficient computation of minimum of Haversine distances
(1 answer)
Closed 2 years ago.
I am trying to find the minimum distance between each customer to the store. Currently, there are ~1500 stores and ~670K customers in my data. I have to calculate the geo distance for 670K customers x 1500 stores and find the minimum distance for each customer.
I have created the haversine function below:
import numpy as np
def haversine_np(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
miles = 6367 * c/1.609
return miles
and my data set looks like below, 1 data frame for the customer (cst_geo) and 1 data frame for the store (store_geo). The numbers below are made up as I can't share the snippet of the real data:
Customer ID
Latitude
Longitude
A123
39.342
-40.800
B456
38.978
-41.759
C789
36.237
-77.348
Store ID
Latitude
Longitude
S1
59.342
-60.800
S2
28.978
-71.759
S3
56.237
-87.348
I wrote a for loop below to attempt this calculation but it took >8 hours to run. I have tried to use deco but wasn't able to optimize it any further.
mindist = []
for i in cst_geo.index:
dist = []
for j in store_geo.index:
dist.append(haversine_np(cst_geo.longitude[i], cst_geo.latitude[i],
store_geo.longitude[j], store_geo.latitude[j]))
mindist.append(min(dist))
This can be done with geopy
from geopy.distance import geodesic
customers = [
(39.342, -40.800),
(38.978, -41.759),
(36.237, -77.348),
]
stores = [
(59.342, -60.800),
(28.978, -71.759),
(56.237, -87.348),
]
matrix = [[None] * len(customers)] * len(stores)
for index, i in enumerate(customers):
for j_index, j in enumerate(stores):
matrix[j_index][index] = geodesic(i, j).meters
output
[[3861568.3809260903, 3831526.290564832, 2347407.258650098, 2347407.258650098],
[3861568.3809260903, 3831526.290564832, 2347407.258650098, 2347407.258650098],
[3861568.3809260903, 3831526.290564832, 2347407.258650098, 2347407.258650098]]
you can also have the distance in others units with kilometers, miles, feet ...
I have two datasets, one describing locations and second having various points:
locations.head()
latitude longitude geobounds_lon1 geobounds_lat1 geobounds_lon2 geobounds_lat2
0 52.5054 13.33320 13.08830 52.6755 13.7611 52.3382
1 54.6192 9.99778 7.86496 55.0581 11.3129 53.3608
2 41.6671 -71.27420 -71.90730 42.0188 -71.0886 41.0958
3 25.9859 -80.12280 -87.81370 30.9964 -78.9917 24.5071
4 43.7004 11.51330 9.63364 44.5102 12.4104 42.1654
points.head()
category lat lon
0 161 47.923132 11.507743
1 161 47.926479 11.531736
2 161 47.943670 11.576099
3 161 57.617577 12.040591
4 23 52.124071 -0.491918
I need to calculate distances from each offer (based on locations.latitude and locations.longitude) to every point of each category (for example, 161). For me, only matters these points that are not so far away from location - I thought that using boundaries of location might be helpful, so I wouldn't need to calculate all distances and then filter them.
The biggest problem for me is how to effectively filter points for every location (based on category and boundaries) and calculate distances to these points from location point as the data counts are quite big (there are almost 9 million rows in locations and more than 10 million rows in points).
For distance calculation I tried BallTree:
RADIANT_TO_KM_CONSTANT = 6367
class BallTreeIndex:
def __init__(self,lat_longs):
self.lat_longs = np.radians(lat_longs)
self.ball_tree_index = BallTree(self.lat_longs, leaf_size=40, metric='haversine')
def query_radius(self,query,radius):
radius_radiant = radius / RADIANT_TO_KM_CONSTANT
query = np.radians(np.array([query]))
result = self.ball_tree_index.query_radius(query, r=radius_radiant,
return_distance=True)
return result[1][0]
And for filtering points:
condition = (points.category == c) & (points.lat > lat2) & (points.lat < lat1) & (points.lon < lon2) & (points.lon > lon1)
tmp = points[condition]
where c is the specific category, lat1, lat2, lon1, lon2 are the location boundaries.
However, this would take a lot of time, so I wonder if there is any way to make it faster.
I would like to have a new column in locations dataframe, for example:
distances_161
0 [distance0_0, distance0_1, ...]
1 [distance1_0, distance1_1, ...]
2 [distance2_1, distance2_2, ...]
I'm not 100% certain that this is what you want, but it seems to make sense to me.
import numpy as np
import pandas
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
df = {'lon1': [40.7454513],
'lat1': [-73.9536799],
'lon2': [40.7060268],
'lat2': [-74.0110188]}
df
df['distance'] = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])
Result:
array([6.48545403])
So, Python is saying 6.485 miles and Google says 6.5 miles.
I want to find the lat, long combination with minimum distance. x_lat, x_long are constant. I want to get combinations of y_latitude, y_longitude and calculate the distance and find out the minimum distance and return the corresponding y_latitude, y_longitude.
The following is trying,
x_lat = 33.50194395
x_long = -112.048885
y_latitude = ['56.16', '33.211045400000003', '37.36']
y_longitude = ['-117.3700631', '-118.244']
I have a distance function which would return the distance,
from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
So I tried something like the following,
dist = []
for i in itertools.product(y_latitude , y_longitude):
print i
dist.append(distance(float(i[1]),float(i[0]),float(x_long), float(x_lat)))
print dist.index(min(dist))
So this creates all possible combinations of y_latitude and y_longitude and calculates distance and returns the index of minimum distance. I am not able to make it return the corresponding y_latitude and y_longitude.
Here the index of minimum distance is 2 and output is 2. The required output is ('33.211045400000003', '-117.3700631'), which I am not able to make it return.
Can anybody help me in solving the last piece?
Thanks
Try this,
dist = []
for i in itertools.product(y_latitude , y_longitude):
dist.append([distance(float(i[1]),float(i[0]),float(x_long), float(x_lat)),i])
min_lat,min_lng = min(dist, key = lambda x: x[0])[1]
Append the lat and long along with the dist, And get min of first index,
I want to be able to get a estimate of the distance between two (latitude, longitude) points. I want to undershoot, as this will be for A* graph search and I want it to be fast. The points will be at most 800 km apart.
The answers to Haversine Formula in Python (Bearing and Distance between two GPS points) provide Python implementations that answer your question.
Using the implementation below I performed 100,000 iterations in less than 1 second on an older laptop. I think for your purposes this should be sufficient. However, you should profile anything before you optimize for performance.
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers is 6371
km = 6371* c
return km
To underestimate haversine(lat1, long1, lat2, long2) * 0.90 or whatever factor you want. I don't see how introducing error to your underestimation is useful.
Since the distance is relatively small, you can use the equirectangular distance approximation. This approximation is faster than using the Haversine formula. So, to get the distance from your reference point (lat1, lon1) to the point you're testing (lat2, lon2) use the formula below:
from math import sqrt, cos, radians
R = 6371 # radius of the earth in km
x = (radians(lon2) - radians(lon1)) * cos(0.5 * (radians(lat2) + radians(lat1)))
y = radians(lat2) - radians(lat1)
d = R * sqrt(x*x + y*y)
Since R is in km, the distance d will be in km.
Reference: http://www.movable-type.co.uk/scripts/latlong.html
One idea for speed is to transform the long/lat coordinated into 3D (x,y,z) coordinates. After preprocessing the points, use the Euclidean distance between the points as a quickly computed undershoot of the actual distance.
If the distance between points is relatively small (meters to few km range)
then one of the fast approaches could be
from math import cos, sqrt
def qick_distance(Lat1, Long1, Lat2, Long2):
x = Lat2 - Lat1
y = (Long2 - Long1) * cos((Lat2 + Lat1)*0.00872664626)
return 111.319 * sqrt(x*x + y*y)
Lat, Long are in radians, distance in km.
Deviation from Haversine distance is in the order of 1%, while the speed gain is more than ~10x.
0.00872664626 = 0.5 * pi/180,
111.319 - is the distance that corresponds to 1degree at Equator, you could replace it with your median value like here
https://www.cartographyunchained.com/cgsta1/
or replace it with a simple lookup table.
For maximal speed, you could create something like a rainbow table for coordinate distances. It sounds like you already know the area that you are working with, so it seems like pre-computing them might be feasible. Then, you could load the nearest combination and just use that.
For example, in the continental United States, the longitude is a 55 degree span and latitude is 20, which would be 1100 whole number points. The distance between all the possible combinations is a handshake problem which is answered by (n-1)(n)/2 or about 600k combinations. That seems pretty feasible to store and retrieve. If you provide more information about your requirements, I could be more specific.
You can use cdist from scipy spacial distance class:
For example:
from scipy.spatial.distance import cdist
df1_latlon = df1[['lat','lon']]
df2_latlon = df2[['lat', 'lon']]
distanceCalc = cdist(df1_latlon, df2_latlon, metric=haversine)
To calculate a haversine distance between 2 points u can simply use mpu.haversine_distance() library, like this:
>>> import mpu
>>> munich = (48.1372, 11.5756)
>>> berlin = (52.5186, 13.4083)
>>> round(mpu.haversine_distance(munich, berlin), 1)
>>> 504.2
Please use the following code.
def distance(lat1, lng1, lat2, lng2):
#return distance as meter if you want km distance, remove "* 1000"
radius = 6371 * 1000
dLat = (lat2-lat1) * math.pi / 180
dLng = (lng2-lng1) * math.pi / 180
lat1 = lat1 * math.pi / 180
lat2 = lat2 * math.pi / 180
val = sin(dLat/2) * sin(dLat/2) + sin(dLng/2) * sin(dLng/2) * cos(lat1) * cos(lat2)
ang = 2 * atan2(sqrt(val), sqrt(1-val))
return radius * ang