I have a large dataset of around 2 million rows and 4 columns: observation_id, latitude, longitude and class_id, like that:
Observation_id
Latitude
Longitude
Class_id
10131188
45.146973
6.416794
101
10799362
46.783695
-2.072855
700
10392536
48.604866
-2.825003
1456
...
...
...
...
22068176
29.806055
-98.41853
5532
There are 18,000 classes, but some of them are over-represented and some are under-represented. Note that each observation is either in France or in the USA.
I need to find, for each observation, the distance to the closest observation of every class.
For example, for the first observation (which belongs to the class 101 if we look at the table above), I will have a vector of size 18,000. The first value of the vector will represent the distance in km to the closest occurrence of class 1, the second value will represent the distance in km to the closest occurrence of class 2, and so on until the last value which will represent, you guessed it, the distance in km to the closest occurrence of class 18,000.
If the distance is too large (let's say more than 50km), I don't need the exact distance but a fixed value (50 km in this case). So if the closest occurrence from one class to my observation is more than 50km (whether it's 51km or 9,000km), I can fill 50 for the corresponding value of the observation's vector.
But I see two problems here:
My code will take forever to run.
The created file will be huge.
I started to create a small script that calculates the haversine distance, but for one observations it takes around 8 seconds to run, so it would be impossible for 2 million. Here it is anyway:
lat1 = 45.705116 # lat for observation 10561949
lon1 = 1.424622 # lon for observation 10561949
df = df[df.observation_id != 10561949] # removing observation 10561949 from the DataFrame
list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50
for observation_id, lat2, lon2 in zip(df['observation_id'], df['latitude'], df['longitude']):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
dist = 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)
if list_obs[observation_id] >= dist:
list_obs[observation_id] = dist
Do you have a idea on how to speed-up the algorithm (the distance doesn't have to be perfectly calculated, I just need to have a global idea on the nearest neighbor of each class for each observation) and to store the gigantic file after that (it will be an array-like of 2,000,000 x 18,000).
The idea after this is to try to feed this to a Neural Network (let's say a MLP), to see the difference with a simple K-Nearest Neighbor.
Since you only care about distances <50km the best saving that I can think of is to pre-calculate (approximately) absolute distances on a grid to exclude the need to compute distances for large values.
Below is my best attempt to solve this, it has a setup complexity of O(len(df)) but a search complexity of just O(9 * avg. bin size) which is significantly less than O(len(df)) from your example.
Note 1) there are large parts of this that can be vectorized to improve performance.
Note 2) there are most certainly better ways to bin distances on a sphere, I am just not that familiar with them, but the idea to first index values such that you can quickly find all data points within distance x is the key.
Note 3) I would be surprised if this code is bug free.
# generate dummy data --------------------------
import pandas as pd
import random
random.seed(10)
rand_float = lambda :(random.random()-.5)*90*2
rand_int = lambda :int(random.random()*18000)
dummy_data = [(rand_float(), rand_float(), rand_int()) for i in range(100_000)]
df = pd.DataFrame(data=dummy_data, columns=('lat', 'lon', 'class'))
# bin data points -------------------------------
from collections import defaultdict
def find_bin(lat, lon, bin_size=60):
"""Approximately bin data points
https://stackoverflow.com/questions/1253499/simple-calculations-for-working-with-lat-lon-and-km-distance
approximate distance conversion to exclude "far away" distances - only needs to be approximate since we add buffer,
however I am sure there are much better methods one could use to do this approximation, I spent 10 mins googling
"""
return int(110*lat//bin_size), int(110*cos(lon)//60)
bins = defaultdict(list)
for i, row in df.iterrows(): # O(len(df))
bins[find_bin(row['lat'], row['lon'])].append(int(i)) # this is slow, it can be vectorized less elegantly but only needs run once
print(f'average bin size {sum(map(len, bins.values()))/len(bins)}')
# find distances to neighbours ------------------
from math import radians, sin, cos, asin, sqrt
def compute_distance(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
return 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)
def neighbours(x, y):
yield x-1, y-1
yield x-1, y
yield x-1, y+1
yield x , y-1
yield x , y
yield x , y+1
yield x+1, y-1
yield x+1, y
yield x+1, y+1
obs = 1000 # 10561949
lat1, lon1, _ = df.iloc[obs].values
print(f'finding {lat1}, {lon1}')
b = find_bin(lat1, lon1)
list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50
for adj_bin in neighbours(*b): # O(9)
for i in bins[adj_bin]: # O(avg. bin size)
lat2, lon2, class_ = df.loc[i].values
dist = compute_distance(lat1, lon1, lat2, lon2)
if dist < list_obs[int(class_)]:
print(f'dist {dist} to {i} ({class_})')
list_obs[int(class_)] = dist
Related
i have a dataframe containing gps coordinates (Timestamp, Latitude, Longitude) of vehicle tracks. The frequency is between 30 seconds between points and 1 second between points. This depends on some logic in gps receiver including speed but is not very reliable.
These tracks can be very long and contain many thousands of gps points. especially when a vehicle is moving slow or is at rest. The data looks like this:
Timestamp
Latitude
Longitude
0 days 00:00:00
51.1513
9.61053
0 days 00:00:28
51.1513
9.61049
0 days 00:00:29
51.1513
9.61048
0 days 00:00:31
51.1513
9.61048
0 days 00:00:33
51.1513
9.61048
I want to reduce the size of the data frames by only including gps points which are at least 50 meters apart of the gps position before. The distance between two gps positions is calculated using the harvesine formula:
from math import radians, cos, sin, asin, sqrt
def haversine(lat1, lon1, lat2, lon2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees) in meters
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers is 6371
m = 6371000* c
return m
Currently i use a very naive approach by looping over the dataframe and creating a mask containing elements at least 50 meters apart. But this is very inefficient and i am looking for an efficient way to calculated this for large data frames.
def reduce_gps(df):
mask = np.full(len(df), False)
cpos = 0
#print(cpos)
lat_col = df_gps.columns.get_loc('Latitude')
lon_col = df_gps.columns.get_loc('Longitude')
for pos in range(len(mask)):
if haversine(df.iloc[cpos, lat_col], df.iloc[cpos, lon_col],
df.iloc[pos, lat_col], df.iloc[pos, lon_col]) > 50 or (pos==len(mask)-1):
#print(pos)
cpos = pos
mask[pos] = True
return df[mask]
The haversine formula can be vectorized if this is helpful:
def haversine_vec(df):
data = np.deg2rad(df[['Latitude', 'Longitude']])
diff = data.shift() - data
d = np.sin(diff['Latitude']/2)**2 + np.cos(data['Latitude'])*np.cos(data['Latitude'].shift()) * np.sin(diff['Longitude']/2)**2
return 2 * 6371000 * np.arcsin(np.sqrt(d))
I uploaded a small set of sample data here:
pd.read_csv('https://pastebin.com/raw/qeUDKr9z')
Try using a list comprehension:
df['distance']=[haversine(df.Latitude[i],df.Longitude[i],df.Latitude[i+1],df.Longitude[i+1]) if i!=len(df)-1 else 0 for i in range(len(df))]
df[df.distance>50]
I'm building a program in Python that loads in the geographic location of all addresses in my country and then calculates a "remoteness" value for each address by considering its distance to all the other addresses.
Since there are 3-5 million addresses in my country, I ultimately need to run 3-5 million iterations of the algorithm that calculates the index.
I have already taken some measures to bring down running time, including:
trying to process the data efficiently using numPy
instead of each address looking at its distance to each other address, I am dividing the country into zones, which have each already been assigned a population, and each address is then merely aware of how many of those zone centers fall within each of the distance values enumerated in "RADII"
It's still taking 23 seconds to get through a list of 5000 addresses, which means I expect 24h+ running time for the full dataset of 3-5 million.
So my question is: Which parts of my algorithm should be improved in order to save time? What stands out to you as slow, redundant or inefficient code?
Does my use of JSON and numPy make sense, for example? Could simplifying the "distance" function be of much help? Or would I gain a significant advantage by throwing it all out and using another language than Python?
Any advice would be much appreciated. I'm a newbie programmer so there could easily be issues I'm not aware of.
Here's the code. It's currently working on a limited dataset (one minor island) which makes up about 0,1% of the full dataset. The algorithm starts at line 30 (#Main loop):
import json
import math
import numpy as np
RADII = [888, 1480, 2368, 3848, 6216, 10064, 16280]
R = 6371000
remoteness = []
db = SQL("sqlite:///samsozones.db")
def main():
with open("samso.json", "r") as json_data:
data = json.load(json_data)
rows = db.execute("SELECT * FROM zones")
#Establish amount of zones with addresses in them
ZONES = len(rows)
#Initialize matrix with location of the center of each zone and the population of the zone
zonematrix = np.zeros((ZONES, 3), dtype="float")
for i, row in enumerate(rows):
zonematrix[i,:] = row["x"], row["y"], row["population"]
#Initialize matrix with distance from current address to centers of each zone and the population of the zone (will be filled out for each address in the main loop)
distances = np.zeros((ZONES, 2), dtype="float")
#Main loop (calculate remoteness index for each address)
for address in data:
#Reset remoteness index for new address
index = 0
#Calculate distance from address to each zone center and insert the distances into the distances matrix along with population
for j in range(ZONES):
distances[j, 0] = distance(address["x"], address["y"], zonematrix[j, 0], zonematrix[j, 1])
distances[j, 1] = zonematrix[j, 2]
#Calculate remoteness index
for radius in RADII:
#Calculate amount of zone centers within each radius and add up their population
allwithincircle = distances[distances[:,0] < radius]
count = len(allwithincircle)
pop = allwithincircle[:,1].sum()
#Calculate average within-radius zone population
try:
factor = pop / count
except:
factor = 0
#Increment remoteness index:
index += (1 / radius) * factor
remoteness.append((address["betegnelse"], index))
# Haversine function by Deduplicator, adapted from https://stackoverflow.com/questions/27928/calculate-distance-between-two-latitude-longitude-points-haversine-formula
def distance(lat1,lon1,lat2,lon2):
dLat = deg2rad(lat2-lat1)
dLon = deg2rad(lon2-lon1)
a = math.sin(dLat/2) * math.sin(dLat/2) + math.cos(deg2rad(lat1)) * math.cos(deg2rad(lat2)) * math.sin(dLon/2) * math.sin(dLon/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = R * c
return d;
def deg2rad(deg):
return deg * (math.pi / 180)
if __name__ == "__main__":
main()
# Running time (N = 5070): 23 seconds
# Results quite reasonable
# Scales more or less linearly```
I can recommend an number of potential speed ups:
Don't use Haversine to calculate your distances. Find a good local projection for your country (you don't say where you are, so I can't) and reproject your data into that CRS which will allow you to use simple Euclidean distances which are much faster to compute (especially if you work in the square of the distance to save a bunch of square roots).
I would avoid calculating the distances from all the possible zones by converting the calculation to a raster surface of population and then calculating the remoteness for each cell and then looking the point of the address on that raster. If one house on my street is remote then I don't need to look the rest up! I would look at, for example, Wilderness attribute mapping in the United Kingdom by my former colleagues Carver, Evan and Fritz for good examples.
My answer focuses on the computational tricks you can try and implement in these kinds of situations rather than on domain-specific optimizations (which you should definitely prioritize as they'll usually give you the largest speedup).
For that, I use cython, which adds static typing to python and tries to transpile as much as the code as possible to c.
calc_dist takes a Nx2 numpy array of N latitudes and longitudes as an input, and uses it to compute a NxN distance array where every value above the diagonal at coordinates i,j represents the distance between locations i and j:
# To test, I recommend using a notebook with the %%cython cell magic
# %load_ext cython
# %%cython
import numpy as np
cimport numpy as np
from libc.math cimport sin, cos, asin, sqrt, pi
import cython
ctypedef np.float64_t dtype_t
#cython.boundscheck(False)
#cython.wraparound(False)
def calc_dist(dtype_t[:, :] coords):
result = np.zeros((coords.shape[0], coords.shape[0]), dtype=np.float64)
cdef dtype_t[:, :] result_view = result
cdef int d1, d2
for d1 in range(coords.shape[0]):
for d2 in range(d1, coords.shape[0]):
result_view[d1][d2] = haversine(coords[d1][0], coords[d1][1], coords[d2][0], coords[d2][1])
return result
#cython.cdivision(True)
cdef dtype_t haversine(dtype_t lat1, dtype_t lon1, dtype_t lat2, dtype_t lon2):
"""Pure C haversine. Based on https://stackoverflow.com/a/29546836/13145954"""
cdef dtype_t deg_to_rad = pi/180
lon1 = lon1 * deg_to_rad
lon2 = lon2 * deg_to_rad
lat1 = lat1 * deg_to_rad
lat2 = lat2 * deg_to_rad
cdef dtype_t dlon = lon2 - lon1
cdef dtype_t dlat = lat2 - lat1
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
Running calc_dist on the following random coordinates takes only 780ms on my laptop, which translates to a running time of about 13 minutes for 5 million rows.
coords = np.random.random((5000,2))
coords[:,0] = (coords[:,0]*2-1)*90
coords[:,1] = (coords[:,1]*2-1)*180
NB: the code will have to adapted and process the data in chunks in order to to avoid OOM errors. on the full database.
There are so many packages that provide this calculation although most of them are based on points and not data frames or maybe I am making a mistake!
I have found this method that workers with my panda dataframe of Latitude and Longitude columns:
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6378137):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
but all initial bearing or azimuths I tried, don't accept the dataframe series and trying the numpy array will still return me zeros!
Is there a certain way to do so for a successive rows of a dataframe? I would like to calculate initial bearing between successive points. In R the bearing function will do the job with a dataframe just wondering if there is an equivalent in Python.
Update:
I found the problem. I was using the R method to be able to find bearing between successive rows so I was basically removing the first row and the last row making two sets of dataframes with two columns but it worked perfectly with shift() and I wrote my own bearing function which was easier than using the one out there...
So I made the tow dataframes below from pts my main dataframe:
latlon_a = pts
latlon_b = pts.shift()
and my own initial bearing function:
def initial_bearing(lon1, lat1, lon2, lat2):
"""
My own version based on R source
Calculate the initial bearing between two points
All (latitude, longitude) coordinates must have numeric dtypes and be of equal length.
"""
lat1, lon1, lat2, lon2 = map(np.radians, [lon1, lat1, lon2, lat2])
delta1 = lon1-lon2
term1 = np.sin(delta1) * np.cos(lat2)
term2 = np.cos(lat1) * np.sin(lat2)
term3 = np.sin(lat1) * np.cos(lat2) * np.cos(delta1)
rad = np.arctan2(term1, (term2-term3))
bearing = np.rad2deg(rad)
return (bearing + 360) % 360
bearing = initial_bearing(latlon_a['longitude'],latlon_a['latitude'],
latlon_b['longitude'],latlon_b['latitude'])
This worked perfectly for me and results back the initial bearing. for funial bearing you can just replace or add the line below to return:
return (bearing + 180) % 360
I have two datasets, one describing locations and second having various points:
locations.head()
latitude longitude geobounds_lon1 geobounds_lat1 geobounds_lon2 geobounds_lat2
0 52.5054 13.33320 13.08830 52.6755 13.7611 52.3382
1 54.6192 9.99778 7.86496 55.0581 11.3129 53.3608
2 41.6671 -71.27420 -71.90730 42.0188 -71.0886 41.0958
3 25.9859 -80.12280 -87.81370 30.9964 -78.9917 24.5071
4 43.7004 11.51330 9.63364 44.5102 12.4104 42.1654
points.head()
category lat lon
0 161 47.923132 11.507743
1 161 47.926479 11.531736
2 161 47.943670 11.576099
3 161 57.617577 12.040591
4 23 52.124071 -0.491918
I need to calculate distances from each offer (based on locations.latitude and locations.longitude) to every point of each category (for example, 161). For me, only matters these points that are not so far away from location - I thought that using boundaries of location might be helpful, so I wouldn't need to calculate all distances and then filter them.
The biggest problem for me is how to effectively filter points for every location (based on category and boundaries) and calculate distances to these points from location point as the data counts are quite big (there are almost 9 million rows in locations and more than 10 million rows in points).
For distance calculation I tried BallTree:
RADIANT_TO_KM_CONSTANT = 6367
class BallTreeIndex:
def __init__(self,lat_longs):
self.lat_longs = np.radians(lat_longs)
self.ball_tree_index = BallTree(self.lat_longs, leaf_size=40, metric='haversine')
def query_radius(self,query,radius):
radius_radiant = radius / RADIANT_TO_KM_CONSTANT
query = np.radians(np.array([query]))
result = self.ball_tree_index.query_radius(query, r=radius_radiant,
return_distance=True)
return result[1][0]
And for filtering points:
condition = (points.category == c) & (points.lat > lat2) & (points.lat < lat1) & (points.lon < lon2) & (points.lon > lon1)
tmp = points[condition]
where c is the specific category, lat1, lat2, lon1, lon2 are the location boundaries.
However, this would take a lot of time, so I wonder if there is any way to make it faster.
I would like to have a new column in locations dataframe, for example:
distances_161
0 [distance0_0, distance0_1, ...]
1 [distance1_0, distance1_1, ...]
2 [distance2_1, distance2_2, ...]
I'm not 100% certain that this is what you want, but it seems to make sense to me.
import numpy as np
import pandas
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
df = {'lon1': [40.7454513],
'lat1': [-73.9536799],
'lon2': [40.7060268],
'lat2': [-74.0110188]}
df
df['distance'] = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])
Result:
array([6.48545403])
So, Python is saying 6.485 miles and Google says 6.5 miles.
I want to be able to get a estimate of the distance between two (latitude, longitude) points. I want to undershoot, as this will be for A* graph search and I want it to be fast. The points will be at most 800 km apart.
The answers to Haversine Formula in Python (Bearing and Distance between two GPS points) provide Python implementations that answer your question.
Using the implementation below I performed 100,000 iterations in less than 1 second on an older laptop. I think for your purposes this should be sufficient. However, you should profile anything before you optimize for performance.
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers is 6371
km = 6371* c
return km
To underestimate haversine(lat1, long1, lat2, long2) * 0.90 or whatever factor you want. I don't see how introducing error to your underestimation is useful.
Since the distance is relatively small, you can use the equirectangular distance approximation. This approximation is faster than using the Haversine formula. So, to get the distance from your reference point (lat1, lon1) to the point you're testing (lat2, lon2) use the formula below:
from math import sqrt, cos, radians
R = 6371 # radius of the earth in km
x = (radians(lon2) - radians(lon1)) * cos(0.5 * (radians(lat2) + radians(lat1)))
y = radians(lat2) - radians(lat1)
d = R * sqrt(x*x + y*y)
Since R is in km, the distance d will be in km.
Reference: http://www.movable-type.co.uk/scripts/latlong.html
One idea for speed is to transform the long/lat coordinated into 3D (x,y,z) coordinates. After preprocessing the points, use the Euclidean distance between the points as a quickly computed undershoot of the actual distance.
If the distance between points is relatively small (meters to few km range)
then one of the fast approaches could be
from math import cos, sqrt
def qick_distance(Lat1, Long1, Lat2, Long2):
x = Lat2 - Lat1
y = (Long2 - Long1) * cos((Lat2 + Lat1)*0.00872664626)
return 111.319 * sqrt(x*x + y*y)
Lat, Long are in radians, distance in km.
Deviation from Haversine distance is in the order of 1%, while the speed gain is more than ~10x.
0.00872664626 = 0.5 * pi/180,
111.319 - is the distance that corresponds to 1degree at Equator, you could replace it with your median value like here
https://www.cartographyunchained.com/cgsta1/
or replace it with a simple lookup table.
For maximal speed, you could create something like a rainbow table for coordinate distances. It sounds like you already know the area that you are working with, so it seems like pre-computing them might be feasible. Then, you could load the nearest combination and just use that.
For example, in the continental United States, the longitude is a 55 degree span and latitude is 20, which would be 1100 whole number points. The distance between all the possible combinations is a handshake problem which is answered by (n-1)(n)/2 or about 600k combinations. That seems pretty feasible to store and retrieve. If you provide more information about your requirements, I could be more specific.
You can use cdist from scipy spacial distance class:
For example:
from scipy.spatial.distance import cdist
df1_latlon = df1[['lat','lon']]
df2_latlon = df2[['lat', 'lon']]
distanceCalc = cdist(df1_latlon, df2_latlon, metric=haversine)
To calculate a haversine distance between 2 points u can simply use mpu.haversine_distance() library, like this:
>>> import mpu
>>> munich = (48.1372, 11.5756)
>>> berlin = (52.5186, 13.4083)
>>> round(mpu.haversine_distance(munich, berlin), 1)
>>> 504.2
Please use the following code.
def distance(lat1, lng1, lat2, lng2):
#return distance as meter if you want km distance, remove "* 1000"
radius = 6371 * 1000
dLat = (lat2-lat1) * math.pi / 180
dLng = (lng2-lng1) * math.pi / 180
lat1 = lat1 * math.pi / 180
lat2 = lat2 * math.pi / 180
val = sin(dLat/2) * sin(dLat/2) + sin(dLng/2) * sin(dLng/2) * cos(lat1) * cos(lat2)
ang = 2 * atan2(sqrt(val), sqrt(1-val))
return radius * ang