I have two datasets, one describing locations and second having various points:
locations.head()
latitude longitude geobounds_lon1 geobounds_lat1 geobounds_lon2 geobounds_lat2
0 52.5054 13.33320 13.08830 52.6755 13.7611 52.3382
1 54.6192 9.99778 7.86496 55.0581 11.3129 53.3608
2 41.6671 -71.27420 -71.90730 42.0188 -71.0886 41.0958
3 25.9859 -80.12280 -87.81370 30.9964 -78.9917 24.5071
4 43.7004 11.51330 9.63364 44.5102 12.4104 42.1654
points.head()
category lat lon
0 161 47.923132 11.507743
1 161 47.926479 11.531736
2 161 47.943670 11.576099
3 161 57.617577 12.040591
4 23 52.124071 -0.491918
I need to calculate distances from each offer (based on locations.latitude and locations.longitude) to every point of each category (for example, 161). For me, only matters these points that are not so far away from location - I thought that using boundaries of location might be helpful, so I wouldn't need to calculate all distances and then filter them.
The biggest problem for me is how to effectively filter points for every location (based on category and boundaries) and calculate distances to these points from location point as the data counts are quite big (there are almost 9 million rows in locations and more than 10 million rows in points).
For distance calculation I tried BallTree:
RADIANT_TO_KM_CONSTANT = 6367
class BallTreeIndex:
def __init__(self,lat_longs):
self.lat_longs = np.radians(lat_longs)
self.ball_tree_index = BallTree(self.lat_longs, leaf_size=40, metric='haversine')
def query_radius(self,query,radius):
radius_radiant = radius / RADIANT_TO_KM_CONSTANT
query = np.radians(np.array([query]))
result = self.ball_tree_index.query_radius(query, r=radius_radiant,
return_distance=True)
return result[1][0]
And for filtering points:
condition = (points.category == c) & (points.lat > lat2) & (points.lat < lat1) & (points.lon < lon2) & (points.lon > lon1)
tmp = points[condition]
where c is the specific category, lat1, lat2, lon1, lon2 are the location boundaries.
However, this would take a lot of time, so I wonder if there is any way to make it faster.
I would like to have a new column in locations dataframe, for example:
distances_161
0 [distance0_0, distance0_1, ...]
1 [distance1_0, distance1_1, ...]
2 [distance2_1, distance2_2, ...]
I'm not 100% certain that this is what you want, but it seems to make sense to me.
import numpy as np
import pandas
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
df = {'lon1': [40.7454513],
'lat1': [-73.9536799],
'lon2': [40.7060268],
'lat2': [-74.0110188]}
df
df['distance'] = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])
Result:
array([6.48545403])
So, Python is saying 6.485 miles and Google says 6.5 miles.
Related
i have a dataframe containing gps coordinates (Timestamp, Latitude, Longitude) of vehicle tracks. The frequency is between 30 seconds between points and 1 second between points. This depends on some logic in gps receiver including speed but is not very reliable.
These tracks can be very long and contain many thousands of gps points. especially when a vehicle is moving slow or is at rest. The data looks like this:
Timestamp
Latitude
Longitude
0 days 00:00:00
51.1513
9.61053
0 days 00:00:28
51.1513
9.61049
0 days 00:00:29
51.1513
9.61048
0 days 00:00:31
51.1513
9.61048
0 days 00:00:33
51.1513
9.61048
I want to reduce the size of the data frames by only including gps points which are at least 50 meters apart of the gps position before. The distance between two gps positions is calculated using the harvesine formula:
from math import radians, cos, sin, asin, sqrt
def haversine(lat1, lon1, lat2, lon2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees) in meters
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers is 6371
m = 6371000* c
return m
Currently i use a very naive approach by looping over the dataframe and creating a mask containing elements at least 50 meters apart. But this is very inefficient and i am looking for an efficient way to calculated this for large data frames.
def reduce_gps(df):
mask = np.full(len(df), False)
cpos = 0
#print(cpos)
lat_col = df_gps.columns.get_loc('Latitude')
lon_col = df_gps.columns.get_loc('Longitude')
for pos in range(len(mask)):
if haversine(df.iloc[cpos, lat_col], df.iloc[cpos, lon_col],
df.iloc[pos, lat_col], df.iloc[pos, lon_col]) > 50 or (pos==len(mask)-1):
#print(pos)
cpos = pos
mask[pos] = True
return df[mask]
The haversine formula can be vectorized if this is helpful:
def haversine_vec(df):
data = np.deg2rad(df[['Latitude', 'Longitude']])
diff = data.shift() - data
d = np.sin(diff['Latitude']/2)**2 + np.cos(data['Latitude'])*np.cos(data['Latitude'].shift()) * np.sin(diff['Longitude']/2)**2
return 2 * 6371000 * np.arcsin(np.sqrt(d))
I uploaded a small set of sample data here:
pd.read_csv('https://pastebin.com/raw/qeUDKr9z')
Try using a list comprehension:
df['distance']=[haversine(df.Latitude[i],df.Longitude[i],df.Latitude[i+1],df.Longitude[i+1]) if i!=len(df)-1 else 0 for i in range(len(df))]
df[df.distance>50]
I have a large dataset of around 2 million rows and 4 columns: observation_id, latitude, longitude and class_id, like that:
Observation_id
Latitude
Longitude
Class_id
10131188
45.146973
6.416794
101
10799362
46.783695
-2.072855
700
10392536
48.604866
-2.825003
1456
...
...
...
...
22068176
29.806055
-98.41853
5532
There are 18,000 classes, but some of them are over-represented and some are under-represented. Note that each observation is either in France or in the USA.
I need to find, for each observation, the distance to the closest observation of every class.
For example, for the first observation (which belongs to the class 101 if we look at the table above), I will have a vector of size 18,000. The first value of the vector will represent the distance in km to the closest occurrence of class 1, the second value will represent the distance in km to the closest occurrence of class 2, and so on until the last value which will represent, you guessed it, the distance in km to the closest occurrence of class 18,000.
If the distance is too large (let's say more than 50km), I don't need the exact distance but a fixed value (50 km in this case). So if the closest occurrence from one class to my observation is more than 50km (whether it's 51km or 9,000km), I can fill 50 for the corresponding value of the observation's vector.
But I see two problems here:
My code will take forever to run.
The created file will be huge.
I started to create a small script that calculates the haversine distance, but for one observations it takes around 8 seconds to run, so it would be impossible for 2 million. Here it is anyway:
lat1 = 45.705116 # lat for observation 10561949
lon1 = 1.424622 # lon for observation 10561949
df = df[df.observation_id != 10561949] # removing observation 10561949 from the DataFrame
list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50
for observation_id, lat2, lon2 in zip(df['observation_id'], df['latitude'], df['longitude']):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
dist = 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)
if list_obs[observation_id] >= dist:
list_obs[observation_id] = dist
Do you have a idea on how to speed-up the algorithm (the distance doesn't have to be perfectly calculated, I just need to have a global idea on the nearest neighbor of each class for each observation) and to store the gigantic file after that (it will be an array-like of 2,000,000 x 18,000).
The idea after this is to try to feed this to a Neural Network (let's say a MLP), to see the difference with a simple K-Nearest Neighbor.
Since you only care about distances <50km the best saving that I can think of is to pre-calculate (approximately) absolute distances on a grid to exclude the need to compute distances for large values.
Below is my best attempt to solve this, it has a setup complexity of O(len(df)) but a search complexity of just O(9 * avg. bin size) which is significantly less than O(len(df)) from your example.
Note 1) there are large parts of this that can be vectorized to improve performance.
Note 2) there are most certainly better ways to bin distances on a sphere, I am just not that familiar with them, but the idea to first index values such that you can quickly find all data points within distance x is the key.
Note 3) I would be surprised if this code is bug free.
# generate dummy data --------------------------
import pandas as pd
import random
random.seed(10)
rand_float = lambda :(random.random()-.5)*90*2
rand_int = lambda :int(random.random()*18000)
dummy_data = [(rand_float(), rand_float(), rand_int()) for i in range(100_000)]
df = pd.DataFrame(data=dummy_data, columns=('lat', 'lon', 'class'))
# bin data points -------------------------------
from collections import defaultdict
def find_bin(lat, lon, bin_size=60):
"""Approximately bin data points
https://stackoverflow.com/questions/1253499/simple-calculations-for-working-with-lat-lon-and-km-distance
approximate distance conversion to exclude "far away" distances - only needs to be approximate since we add buffer,
however I am sure there are much better methods one could use to do this approximation, I spent 10 mins googling
"""
return int(110*lat//bin_size), int(110*cos(lon)//60)
bins = defaultdict(list)
for i, row in df.iterrows(): # O(len(df))
bins[find_bin(row['lat'], row['lon'])].append(int(i)) # this is slow, it can be vectorized less elegantly but only needs run once
print(f'average bin size {sum(map(len, bins.values()))/len(bins)}')
# find distances to neighbours ------------------
from math import radians, sin, cos, asin, sqrt
def compute_distance(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
return 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)
def neighbours(x, y):
yield x-1, y-1
yield x-1, y
yield x-1, y+1
yield x , y-1
yield x , y
yield x , y+1
yield x+1, y-1
yield x+1, y
yield x+1, y+1
obs = 1000 # 10561949
lat1, lon1, _ = df.iloc[obs].values
print(f'finding {lat1}, {lon1}')
b = find_bin(lat1, lon1)
list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50
for adj_bin in neighbours(*b): # O(9)
for i in bins[adj_bin]: # O(avg. bin size)
lat2, lon2, class_ = df.loc[i].values
dist = compute_distance(lat1, lon1, lat2, lon2)
if dist < list_obs[int(class_)]:
print(f'dist {dist} to {i} ({class_})')
list_obs[int(class_)] = dist
There are so many packages that provide this calculation although most of them are based on points and not data frames or maybe I am making a mistake!
I have found this method that workers with my panda dataframe of Latitude and Longitude columns:
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6378137):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
but all initial bearing or azimuths I tried, don't accept the dataframe series and trying the numpy array will still return me zeros!
Is there a certain way to do so for a successive rows of a dataframe? I would like to calculate initial bearing between successive points. In R the bearing function will do the job with a dataframe just wondering if there is an equivalent in Python.
Update:
I found the problem. I was using the R method to be able to find bearing between successive rows so I was basically removing the first row and the last row making two sets of dataframes with two columns but it worked perfectly with shift() and I wrote my own bearing function which was easier than using the one out there...
So I made the tow dataframes below from pts my main dataframe:
latlon_a = pts
latlon_b = pts.shift()
and my own initial bearing function:
def initial_bearing(lon1, lat1, lon2, lat2):
"""
My own version based on R source
Calculate the initial bearing between two points
All (latitude, longitude) coordinates must have numeric dtypes and be of equal length.
"""
lat1, lon1, lat2, lon2 = map(np.radians, [lon1, lat1, lon2, lat2])
delta1 = lon1-lon2
term1 = np.sin(delta1) * np.cos(lat2)
term2 = np.cos(lat1) * np.sin(lat2)
term3 = np.sin(lat1) * np.cos(lat2) * np.cos(delta1)
rad = np.arctan2(term1, (term2-term3))
bearing = np.rad2deg(rad)
return (bearing + 360) % 360
bearing = initial_bearing(latlon_a['longitude'],latlon_a['latitude'],
latlon_b['longitude'],latlon_b['latitude'])
This worked perfectly for me and results back the initial bearing. for funial bearing you can just replace or add the line below to return:
return (bearing + 180) % 360
This question already has an answer here:
Efficient computation of minimum of Haversine distances
(1 answer)
Closed 2 years ago.
I am trying to find the minimum distance between each customer to the store. Currently, there are ~1500 stores and ~670K customers in my data. I have to calculate the geo distance for 670K customers x 1500 stores and find the minimum distance for each customer.
I have created the haversine function below:
import numpy as np
def haversine_np(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
miles = 6367 * c/1.609
return miles
and my data set looks like below, 1 data frame for the customer (cst_geo) and 1 data frame for the store (store_geo). The numbers below are made up as I can't share the snippet of the real data:
Customer ID
Latitude
Longitude
A123
39.342
-40.800
B456
38.978
-41.759
C789
36.237
-77.348
Store ID
Latitude
Longitude
S1
59.342
-60.800
S2
28.978
-71.759
S3
56.237
-87.348
I wrote a for loop below to attempt this calculation but it took >8 hours to run. I have tried to use deco but wasn't able to optimize it any further.
mindist = []
for i in cst_geo.index:
dist = []
for j in store_geo.index:
dist.append(haversine_np(cst_geo.longitude[i], cst_geo.latitude[i],
store_geo.longitude[j], store_geo.latitude[j]))
mindist.append(min(dist))
This can be done with geopy
from geopy.distance import geodesic
customers = [
(39.342, -40.800),
(38.978, -41.759),
(36.237, -77.348),
]
stores = [
(59.342, -60.800),
(28.978, -71.759),
(56.237, -87.348),
]
matrix = [[None] * len(customers)] * len(stores)
for index, i in enumerate(customers):
for j_index, j in enumerate(stores):
matrix[j_index][index] = geodesic(i, j).meters
output
[[3861568.3809260903, 3831526.290564832, 2347407.258650098, 2347407.258650098],
[3861568.3809260903, 3831526.290564832, 2347407.258650098, 2347407.258650098],
[3861568.3809260903, 3831526.290564832, 2347407.258650098, 2347407.258650098]]
you can also have the distance in others units with kilometers, miles, feet ...
I have a pandas dataframe of lat/lng points created from a gps device.
My question is how to generate a distance column for the distance between each point in the gps track line.
Some googling has given me the haversine method below which works using single values selected using iloc, but i'm struggling on how to iterate over the dataframe for the method inputs.
I had thought I could run a for loop, with something along the lines of
for i in len(df):
df['dist'] = haversine(df['lng'].iloc[i],df['lat'].iloc[i],df['lng'].iloc[i+1],df['lat'].iloc[i+1]))
but I get the error TypeError: 'int' object is not iterable. I was also thinking about df.apply but I'm not sure how to get the appropriate inputs. Any help or hints. on how to do this would be appreciated.
Sample DF
lat lng
0 -7.11873 113.72512
1 -7.11873 113.72500
2 -7.11870 113.72476
3 -7.11870 113.72457
4 -7.11874 113.72444
Method
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
c = 2 * math.asin(math.sqrt(a))
km = 6367 * c
return km
are you looking for a result like this?
lat lon dist2next
0 -7.11873 113.72512 0.013232
1 -7.11873 113.72500 0.026464
2 -7.11873 113.72476 0.020951
3 -7.11873 113.72457 0.014335
4 -7.11873 113.72444 NaN
There's probably a clever way to use pandas.rolling_apply... but for a quick solution, I'd do something like this.
def haversine(loc1, loc2):
# convert decimal degrees to radians
lon1, lat1 = map(math.radians, loc1)
lon2, lat2 = map(math.radians, loc2)
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
c = 2 * math.asin(math.sqrt(a))
km = 6367 * c
return km
df['dist2next'] = np.nan
for i in df.index[:-1]:
loc1 = df.ix[i, ['lon', 'lat']]
loc2 = df.ix[i+1, ['lon', 'lat']]
df.ix[i, 'dist2next'] = haversine(loc1, loc2)
alternatively, if you don't want to modify your haversine function like that, you can just pick off lats and lons one at a time using df.ix[i, 'lon'], df.ix[i, 'lat'], df.ix[i+1, 'lon], etc.
I would recommande using a quicker variation of looping through a df such has
df_shift = df.shift(1)
df = df.join(df_shift, l_suffix="lag_")
log = []
for rows in df.itertuples():
log.append(haversine(rows.lng ,rows.lat, rows.lag_lng, rows.lag_lat))
pd.DataFrame(log)