I am trying to calculate a distance matrix for a long list of locations identified by Latitude & Longitude using the Haversine formula that takes two tuples of coordinate pairs to produce the distance:
def haversine(point1, point2, miles=False):
""" Calculate the great-circle distance bewteen two points on the Earth surface.
:input: two 2-tuples, containing the latitude and longitude of each point
in decimal degrees.
Example: haversine((45.7597, 4.8422), (48.8567, 2.3508))
:output: Returns the distance bewteen the two points.
The default unit is kilometers. Miles can be returned
if the ``miles`` parameter is set to True.
"""
I can calculate the distance between all points using a nested for loop as follows:
data.head()
id coordinates
0 1 (16.3457688674, 6.30354512503)
1 2 (12.494749307, 28.6263955635)
2 3 (27.794615136, 60.0324947881)
3 4 (44.4269923769, 110.114216113)
4 5 (-69.8540884125, 87.9468778773)
using a simple function:
distance = {}
def haver_loop(df):
for i, point1 in df.iterrows():
distance[i] = []
for j, point2 in df.iterrows():
distance[i].append(haversine(point1.coordinates, point2.coordinates))
return pd.DataFrame.from_dict(distance, orient='index')
But this takes quite a while given the time complexity, running at around 20s for 500 points and I have a much longer list. This has me looking at vectorization, and I've come across numpy.vectorize ((docs), but can't figure out how to apply it in this context.
From haversine's function definition, it looked pretty parallelizable. So, using one of the best tools for vectorization with NumPy aka broadcasting and replacing the math funcs with the NumPy equivalents ufuncs, here's one vectorized solution -
# Get data as a Nx2 shaped NumPy array
data = np.array(df['coordinates'].tolist())
# Convert to radians
data = np.deg2rad(data)
# Extract col-1 and 2 as latitudes and longitudes
lat = data[:,0]
lng = data[:,1]
# Elementwise differentiations for lattitudes & longitudes
diff_lat = lat[:,None] - lat
diff_lng = lng[:,None] - lng
# Finally Calculate haversine
d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
Runtime tests -
The other np.vectorize based solution has shown some positive promise on performance improvement over the original code, so this section would compare the posted broadcasting based approach against that one.
Function definitions -
def vectotized_based(df):
haver_vec = np.vectorize(haversine, otypes=[np.int16])
return df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))
def broadcasting_based(df):
data = np.array(df['coordinates'].tolist())
data = np.deg2rad(data)
lat = data[:,0]
lng = data[:,1]
diff_lat = lat[:,None] - lat
diff_lng = lng[:,None] - lng
d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
Timings -
In [123]: # Input
...: length = 500
...: d1 = np.random.uniform(-90, 90, length)
...: d2 = np.random.uniform(-180, 180, length)
...: coords = tuple(zip(d1, d2))
...: df = pd.DataFrame({'id':np.arange(length), 'coordinates':coords})
...:
In [124]: %timeit vectotized_based(df)
1 loops, best of 3: 1.12 s per loop
In [125]: %timeit broadcasting_based(df)
10 loops, best of 3: 68.7 ms per loop
You would provide your function as an argument to np.vectorize(), and could then use it as an argument to pandas.groupby.apply as illustrated below:
haver_vec = np.vectorize(haversine, otypes=[np.int16])
distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))
For instance, with sample data as follows:
length = 500
df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})
compare for 500 points:
def haver_vect(data):
distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates)))
return distance
%timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop
%timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop
start by getting all combinations using itertools.product
results= [(p1,p2,haversine(p1,p2))for p1,p2 in itertools.product(points,repeat=2)]
that said Im not sure how fast it will be this looks like it might be a duplicate of Python: speeding up geographic comparison
Related
I have an M x 3 array of 3D coordinates, coords (M ~1000-10000), and I would like to compute the sum of Gaussians centered at these coordinates over a mesh grid 3D array. The mesh grid 3D array is typically something like 64 x 64 x 64, but sometimes upwards of 256 x 256 x 256, and can go even larger. I’ve followed this question to get started, by converting my meshgrid array into an array of N x 3 coordinates, xyz, where N is 64^3 or 256^3, etc. However, for large array sizes it takes too much memory to vectorize the entire calculation (understandable since it could approach 1e11 elements and consume a terabyte of RAM) so I’ve broken it up into a loop over M coordinates. However, this is too slow.
I’m wondering if there is any way to speed this up at all without overloading memory. By converting the meshgrid to xyz, I feel like I’ve lost any advantage of the grid being equally spaced, and that somehow, maybe with scipy.ndimage, I should be able to take advantage of the even spacing to speed things up.
Here’s my initial start:
import numpy as np
from scipy import spatial
#create meshgrid
side = 100.
n = 64 #could be 256 or larger
x_ = np.linspace(-side/2,side/2,n)
x,y,z = np.meshgrid(x_,x_,x_,indexing='ij')
#convert meshgrid to list of coordinates
xyz = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
#create some coordinates
coords = np.random.random(size=(1000,3))*side - side/2
def sumofgauss(coords,xyz,sigma):
"""Simple isotropic gaussian sum at coordinate locations."""
n = int(round(xyz.shape[0]**(1/3.))) #get n samples for reshaping to 3D later
#this version overloads memory
#dist = spatial.distance.cdist(coords, xyz)
#dist *= dist
#values = 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dist/(2*sigma**2))
#values = np.sum(values,axis=0)
#run cdist in a loop over coords to avoid overloading memory
values = np.zeros((xyz.shape[0]))
for i in range(coords.shape[0]):
dist = spatial.distance.cdist(coords[None,i], xyz)
dist *= dist
values += 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dist[0]/(2*sigma**2))
return values.reshape(n,n,n)
image = sumofgauss(coords,xyz,1.0)
import matplotlib.pyplot as plt
plt.imshow(image[n/2]) #show a slice
plt.show()
M = 1000, N = 64 (~5 seconds):
M = 1000, N = 256 (~10 minutes):
Considering that many of your distance calculations will give zero weight after the exponential, you can probably drop a lot of your distances. Doing big chunks of distance calculations while dropping distances which are greater than a threshhold is usually faster with KDTree:
import numpy as np
from scipy.spatial import cKDTree # so we can get a `coo_matrix` output
def gaussgrid(coords, sigma = 1, n = 64, side = 100, eps = None):
x_ = np.linspace(-side/2,side/2,n)
x,y,z = np.meshgrid(x_,x_,x_,indexing='ij')
xyz = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
if eps is None:
eps = np.finfo('float64').eps
thr = -np.log(eps) * 2 * sigma**2
data_tree = cKDTree(coords)
discr = 1000 # you can tweak this to get best results on your system
values = np.empty(n**3)
for i in range(n**3//discr + 1):
slc = slice(i * discr, i * discr + discr)
grid_tree = cKDTree(xyz[slc])
dists = grid_tree.sparse_distance_matrix(data_tree, thr, output_type = 'coo_matrix')
dists.data = 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dists.data/(2*sigma**2))
values[slc] = dists.sum(1).squeeze()
return values.reshape(n,n,n)
Now, even if you keep eps = None it'll be a bit faster as you're still returning about 10% your distances, but with eps = 1e-6 or so, you should get a big speedup. On my system:
%timeit out = sumofgauss(coords, xyz, 1.0)
1 loop, best of 3: 23.7 s per loop
%timeit out = gaussgrid(coords)
1 loop, best of 3: 2.12 s per loop
%timeit out = gaussgrid(coords, eps = 1e-6)
1 loop, best of 3: 382 ms per loop
I have the following matrix, which represents some points:
points = np.random.uniform(30, 50, size = (5,3))
# gives array([[ 45.98139489, 40.27871523, 41.91617071],
[ 41.1404787 , 34.56098247, 35.91171313],
[ 34.46375465, 49.89872417, 39.04753134],
[ 49.28112722, 32.01837698, 32.83394596],
[ 48.96623168, 33.58271833, 33.54690091]])
Now each column is a coordinate. Each column has values within the range [30,50]. I want to map each column to different intervals. I know how to map points from an interval to another thanks to this question:
Algorithm to map an interval to a smaller interval
But I want to make something very fast and that maps each column (possibly) to a different interval. For instance suppose we have
intervals = np.array([[0, 10], [3,7], [100,200]])
Or we could have them separate in arrays as xinterval = np.array([0,10]), it doesn't matter.
My Slow try
I collected all the intervals in intervals and then used the transformation on each column through a loop
for col, interval in zip(range(points.shape[1]), intervals):
points[:, col] = ((points[:,col]-min(points[:,col]))*(interval[1]-interval[0]) / (max(points[:,col])-min(points[:,col])) ) + interval[0]
Where for simplicity I have used the min max range as the previous interval, but I could have just used 30,50 as such:
for col, interval in zip(range(points.shape[1]), intervals):
points[:, col] = ((points[:,col]-30)*(interval[1]-interval[0]) / (50-30) ) + interval[0]
Is there a faster way, without using a loop?
Straight-forward broadcasting
Here's one vectorized way making use of broadcasting -
mins = points.min(0)
a1 = (points - mins)* (intervals[:,1]-intervals[:,0])
a2 = points.max(0) - mins
out = a1/a2 + intervals[:,0]
Improvement : Lesser broadcasting
Looking closely, we are performing broadacsting at few places. Though broadacsting is a very efficient method to vectorize things, it still has some cost. We could improve on it, by re-arranging things around with the intention of reducing the number of broadcasting steps to just two, as compared to four before.
Hence, the modified one would be -
mins = points.min(0)
scale = (intervals[:,1]-intervals[:,0])/(points.max(0) - mins)
offset = mins*scale - intervals[:,0]
out = points *scale - offset
I. Broadcasting steps before :
Two at : (points - mins)* (intervals[:,1]-intervals[:,0]).
Two at : a1/a2 + intervals[:,0].
II. Broadcasting steps after improvement :
One at points *scale and one at the subtraction thereafter .
Runtime test
Approaches -
def app1(points, intervals):
mins = points.min(0)
a1 = (points - mins)* (intervals[:,1]-intervals[:,0])
a2 = points.max(0) - mins
out = a1/a2 + intervals[:,0]
return out
def app2(points, intervals):
mins = points.min(0)
scale = (intervals[:,1]-intervals[:,0])/(points.max(0) - mins)
offset = mins*scale - intervals[:,0]
out = points *scale - offset
return out
Timings -
In [104]: points = np.array([[ 45.98139489, 40.27871523, 41.91617071],
...: [ 41.1404787 , 34.56098247, 35.91171313],
...: [ 34.46375465, 49.89872417, 39.04753134],
...: [ 49.28112722, 32.01837698, 32.83394596],
...: [ 48.96623168, 33.58271833, 33.54690091]])
...: points = np.repeat(points, 100000,axis=0)
...:
...: intervals = np.array([[0, 10], [3,7], [100,200]])
...:
In [105]: %timeit app1(points, intervals)
10 loops, best of 3: 26.3 ms per loop
In [106]: %timeit app2(points, intervals)
100 loops, best of 3: 17.9 ms per loop
I'm trying to do an operation on each pair of rows of distance n, and get the minimum (also maximum and mean) of the results for each n from 0 to n-1. For example, if Data=[1,2,3,4] and the operation is addition, Minimum=[2,3,4,5] and Maximum=[8,7,6,5], and Mean=[5,5,5,5].
I have the following code that uses ratio as the operation which works OK for a small data size but takes more than 10 seconds for 10,000 rows. Since I will be working with data that can have 1,000,000 rows, what would be a better way to do this?
import pandas as pd
import numpy as np
low=250
high=5000
length=10
x=pd.DataFrame({'A': np.random.uniform(low, high=high, size=length)})
x['mean']=x['min']=x['max']=x['A'].copy()
for i in range(0,len(x)):
ratio=x['A']/x['A'].shift(i)
x['mean'].iloc[[i]]=ratio.mean()
x['max'].iloc[[i]]=ratio.max()
x['min'].iloc[[i]]=ratio.min()
print (x)
Approach #1 : For efficiency and considering that you might have upto 1,000,000 rows, I would suggest using the underlying array data in a similar-looking loopy solution and using the efficient array-slicing to use a gradually diminishing data to work with and these two together should bring on noticeable performance boost.
Thus, an implementation would be -
a = x['A'].values
N = len(a)
out = np.zeros((N,4))
out[:,0] = a
for i in range(N):
ratio = a[i:]/a[:N-i]
out[i,1] = ratio.mean()
out[i,2] = ratio.min()
out[i,3] = ratio.max()
df_out = pd.DataFrame(out, columns= (('A','mean','min','max')))
Approach #2 : For a smaller datasize, we can use a vectorized solution that would create a square 2D array of shape (N,N) with shifted versions of the input data. Then, we mask out the upper triangular region with NaNs and finally employ numpy.nanmean, numpy.nanmin and numpy.nanmax to perform those pandas equivalent mean, min and max equivalent operations -
a = x['A'].values
N = len(a)
r = np.arange(N)
shifting_idx = (r[:,None] - r)%N
vals = a[:,None]/a[shifting_idx]
upper_tri_mask = r[:,None] < r
vals[upper_tri_mask] = np.nan
out = np.zeros((N,4))
out[:,0] = a
out[:,1] = np.nanmean(vals, 0)
out[:,2] = np.nanmin(vals, 0)
out[:,3] = np.nanmax(vals, 0)
df_out = pd.DataFrame(out, columns= (('A','mean','min','max')))
Runtime test
Approaches -
def org_app(x):
x['mean']=x['min']=x['max']=x['A'].copy()
for i in range(0,len(x)):
ratio=x['A']/x['A'].shift(i)
x['mean'].iloc[[i]]=ratio.mean()
x['max'].iloc[[i]]=ratio.max()
x['min'].iloc[[i]]=ratio.min()
return x
def app1(x):
a = x['A'].values
N = len(a)
out = np.zeros((N,4))
out[:,0] = a
for i in range(N):
ratio = a[i:]/a[:N-i]
out[i,1] = ratio.mean()
out[i,2] = ratio.min()
out[i,3] = ratio.max()
return pd.DataFrame(out, columns= (('A','mean','min','max')))
Timings -
In [3]: low=250
...: high=5000
...: length=10000
...: x=pd.DataFrame({'A': np.random.uniform(low, high=high, size=length)})
...:
In [4]: %timeit app1(x)
1 loop, best of 3: 185 ms per loop
In [5]: %timeit org_app(x)
1 loop, best of 3: 8.59 s per loop
In [6]: 8590.0/185
Out[6]: 46.432432432432435
46x+ speedup on 10,000 rows data!
I have a code for sequentially whether every pair of cartesian coordinates found in my DataFrame fall into certain geometric enclosed areas. But it is rather slow, I suspect because it is not vectorized. Here is an example:
from matplotlib.patches import Rectangle
r1 = Rectangle((0,0), 10, 10)
r2 = Rectangle((50,50), 10, 10)
df = pd.DataFrame([[1,2],[-1,5], [51,52]], columns=['x', 'y'])
for j in range(df.shape[0]):
coordinates = df.x.iloc[j], df.y.iloc[j]
if r1.contains_point(coordinates):
df['location'].iloc[j] = 0
else r2.contains_point(coordinates):
df['location'].iloc[j] = 1
Can someone propose an approach for speed-up?
It's better to convert the rectangular patches into an array and work on it after deducing the extent to which they are spread out.
def seqcheck_vect(df):
xy = df[["x", "y"]].values
e1 = np.asarray(rec1.get_extents())
e2 = np.asarray(rec2.get_extents())
r1m1, r1m2 = np.min(e1), np.max(e1)
r2m1, r2m2 = np.min(e2), np.max(e2)
out = np.where(((xy >= r1m1) & (xy <= r1m2)).all(axis=1), 0,
np.where(((xy >= r2m1) & (xy <= r2m2)).all(axis=1), 1, np.nan))
return df.assign(location=out)
For the given sample the function outputs:
benchmarks:
def loopy_version(df):
for j in range(df.shape[0]):
coordinates = df.x.iloc[j], df.y.iloc[j]
if rec1.contains_point(coordinates):
df.loc[j, "location"] = 0
elif rec2.contains_point(coordinates):
df.loc[j, "location"] = 1
else:
pass
return df
testing on a DF of 10K rows:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 100, (10000,2)), columns=list("xy"))
# check if both give same outcome
loopy_version(df).equals(seqcheck_vect(df))
True
%timeit loopy_version(df)
1 loop, best of 3: 3.8 s per loop
%timeit seqcheck_vect(df)
1000 loops, best of 3: 1.73 ms per loop
So, the vectorized approach is approximately 2200 times faster compared to the loopy one.
I have an input of 36,742 points which means if I wanted to calculate the lower triangle of a distance matrix (using the vincenty approximation) I would need to generate 36,742*36,741*0.5 = 1,349,974,563 distances.
I want to keep the pair combinations which are within 50km of each other. My current set-up is as follows
shops= [[id,lat,lon]...]
def lower_triangle_mat(points):
for i in range(len(shops)-1):
for j in range(i+1,len(shops)):
yield [shops[i],shops[j]]
def return_stores_cutoff(points,cutoff_km=0):
below_cut = []
counter = 0
for x in lower_triangle_mat(points):
dist_km = vincenty(x[0][1:3],x[1][1:3]).km
counter += 1
if counter % 1000000 == 0:
print("%d out of %d" % (counter,(len(shops)*len(shops)-1*0.5)))
if dist_km <= cutoff_km:
below_cut.append([x[0][0],x[1][0],dist_km])
return below_cut
start = time.clock()
stores = return_stores_cutoff(points=shops,cutoff_km=50)
print(time.clock() - start)
This will obviously take hours and hours. Some possibilities I was thinking of:
Use numpy to vectorise these calculations rather than looping through
Use some kind of hashing to get a quick rough-cut off (all stores within 100km) and then only calculate accurate distances between those stores
Instead of storing the points in a list use something like a quad-tree but I think that only helps with the ranking of close points rather than actual distance -> so I guess some kind of geodatabase
I can obviously try the haversine or project and use euclidean distances, however I am interested in using the most accurate measure possible
Make use of parallel processing (however I was having a bit of difficulty coming up how to cut the list to still get all the relevant pairs).
Edit: I think geohashing is definitely needed here - an example from:
from geoindex import GeoGridIndex, GeoPoint
geo_index = GeoGridIndex()
for _ in range(10000):
lat = random.random()*180 - 90
lng = random.random()*360 - 180
index.add_point(GeoPoint(lat, lng))
center_point = GeoPoint(37.7772448, -122.3955118)
for distance, point in index.get_nearest_points(center_point, 10, 'km'):
print("We found {0} in {1} km".format(point, distance))
However, I would also like to vectorise (instead of loop) the distance calculations for the stores returned by the geo-hash.
Edit2: Pouria Hadjibagheri - I tried using lambda and map:
# [B]: Mapping approach
lwr_tr_mat = ((shops[i],shops[j]) for i in range(len(shops)-1) for j in range(i+1,len(shops)))
func = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km)
# Trying to see if conditional statements slow this down
func_cond = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km) if vincenty(x[0],x[1]).km <= 50 else None
start = time.clock()
out_dist = list(map(func,lwr_tr_mat))
print(time.clock() - start)
start = time.clock()
out_dist = list(map(func_cond,lwr_tr_mat))
print(time.clock() - start)
And they were all around 61 seconds (I restricted number of stores to 2000 from 32,000). Perhaps I used map incorrectly?
This sounds like a classic use case for k-D trees.
If you first transform your points into Euclidean space then you can use the query_pairs method of scipy.spatial.cKDTree:
from scipy.spatial import cKDTree
tree = cKDTree(data)
# where data is (nshops, ndim) containing the Euclidean coordinates of each shop
# in units of km
pairs = tree.query_pairs(50, p=2) # 50km radius, L2 (Euclidean) norm
pairs will be a set of (i, j) tuples corresponding to the row indices of pairs of shops that are ≤50km from each other.
The output of tree.sparse_distance_matrix is a scipy.sparse.dok_matrix. Since the matrix will be symmetric and you're only interested in unique row/column pairs, you could use scipy.sparse.tril to zero out the upper triangle, giving you a scipy.sparse.coo_matrix. From there you can access the nonzero row and column indices and their corresponding distance values via the .row, .col and .data attributes:
from scipy import sparse
tree_dist = tree.sparse_distance_matrix(tree, max_distance=10000, p=2)
udist = sparse.tril(tree_dist, k=-1) # zero the main diagonal
ridx = udist.row # row indices
cidx = udist.col # column indices
dist = udist.data # distance values
Have you tried mapping entire arrays and functions instead of iterating through them? An example would be as follows:
from numpy.random import rand
my_array = rand(int(5e7), 1) # An array of 50,000,000 random numbers in double.
Now what is normally done is:
squared_list_iter = [value**2 for value in my_array]
Which of course works, but is optimally invalid.
The alternative would be to map the array with a function. This is done as follows:
func = lambda x: x**2 # Here is what I want to do on my array.
squared_list_map = map(func, test) # Here I am doing it!
Now, one might ask, how is this any different, or even better for that matter? Since now we have added a call to a function, too! Here is your answer:
For the former solution (via iteration):
1 loop: 1.11 minutes.
Compared to the latter solution (mapping):
500 loop, on average 560 ns.
Simultaneous conversion of a map() to list by list(map(my_list)) would increase the time by a factor of 10 to approximately 500 ms.
You choose!
Thanks everyone's help. I think I have solved this by incorporating all the suggestions.
I use numpy to import the geographic co-ordinates and then project them using "France Lambert - 93". This lets me fill scipy.spatial.cKDTree with the points and then calculate a sparse_distance_matrix by specifying a cut-off of 50km (my projected points are in metres). I then extract extract the lower-triangle to a CSV.
import numpy as np
import csv
import time
from pyproj import Proj, transform
#http://epsg.io/2154 (accuracy: 1.0m)
fr = '+proj=lcc +lat_1=49 +lat_2=44 +lat_0=46.5 +lon_0=3 \
+x_0=700000 +y_0=6600000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 \
+units=m +no_defs'
#http://epsg.io/27700-5339 (accuracy: 1.0m)
uk = '+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 \
+x_0=400000 +y_0=-100000 +ellps=airy \
+towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs'
path_to_csv = '.../raw_in.csv'
out_csv = '.../out.csv'
def proj_arr(points):
inproj = Proj(init='epsg:4326')
outproj = Proj(uk)
# origin|destination|lon|lat
func = lambda x: transform(inproj,outproj,x[2],x[1])
return np.array(list(map(func, points)))
tstart = time.time()
# Import points as geographic coordinates
# ID|lat|lon
#Sample to try and replicate
#points = np.array([
# [39007,46.585012,5.5857829],
# [88086,48.192370,6.7296289],
# [62627,50.309155,3.0218611],
# [14020,49.133972,-0.15851507],
# [1091, 42.981765,2.0104902]])
#
points = np.genfromtxt(path_to_csv,
delimiter=',',
skip_header=1)
print("Total points: %d" % len(points))
print("Triangular matrix contains: %d" % (len(points)*((len(points))-1)*0.5))
# Get projected co-ordinates
proj_pnts = proj_arr(points)
# Fill quad-tree
from scipy.spatial import cKDTree
tree = cKDTree(proj_pnts)
cut_off_metres = 1600
tree_dist = tree.sparse_distance_matrix(tree,
max_distance=cut_off_metres,
p=2)
# Extract triangle
from scipy import sparse
udist = sparse.tril(tree_dist, k=-1) # zero the main diagonal
print("Distances after quad-tree cut-off: %d " % len(udist.data))
# Export CSV
import csv
f = open(out_csv, 'w', newline='')
w = csv.writer(f, delimiter=",", )
w.writerow(['id_a','lat_a','lon_a','id_b','lat_b','lon_b','metres'])
w.writerows(np.column_stack((points[udist.row ],
points[udist.col],
udist.data)))
f.close()
"""
Get ID labels
"""
id_to_csv = '...id.csv'
id_labels = np.genfromtxt(id_to_csv,
delimiter=',',
skip_header=1,
dtype='U')
"""
Try vincenty on the un-projected co-ordinates
"""
from geopy.distance import vincenty
vout_csv = '.../out_vin.csv'
test_vin = np.column_stack((points[udist.row].T[1:3].T,
points[udist.col].T[1:3].T))
func = lambda x: vincenty(x[0:2],x[2:4]).m
output = list(map(func,test_vin))
# Export CSV
f = open(vout_csv, 'w', newline='')
w = csv.writer(f, delimiter=",", )
w.writerow(['id_a','id_a2', 'lat_a','lon_a',
'id_b','id_b2', 'lat_b','lon_b',
'proj_metres','vincenty_metres'])
w.writerows(np.column_stack((list(id_labels[udist.row]),
points[udist.row ],
list(id_labels[udist.col]),
points[udist.col],
udist.data,
output,
)))
f.close()
print("Finished in %.0f seconds" % (time.time()-tstart)
This approach took 164 seconds to generate (for 5,306,434 distances) - compared to 9 - and also around 90 seconds to save to disk.
I then compared the difference in the vincenty distance and the hypotenuse distance (on the projected co-ordinates).
The mean difference in metres was 2.7 and the mean difference/metres was 0.0073% - which looks great.
"Use some kind of hashing to get a quick rough-cut off (all stores within 100km) and then only calculate accurate distances between those stores"
I think this might be better called gridding. So first make a dict, with a set of coords as the key and put each shop in a 50km bucket near that point. then when you are calculating distances, you only look in nearby buckets, rather than iterate through each shop in the whole universe
You can use vectorization with the haversine formula discussed in this thread Haversine Formula in Python (Bearing and Distance between two GPS points)
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6371 * c
Here you have the %%timeit for 7 451 653 distances
642 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)