Efficient nearest neighbour search on streaming data - python

I'm looking to harness the magic of Python to perform a nearest neighbour (NN) search on a dataset that will grow over time, for example because new entries arrive through streaming. On static data, a NN search is doable as shown below. To start off the example, e and f are two sets of static data points. For each entry in e we need to know which one in f is nearest to it. It's simple enough to do:
e=pd.DataFrame({'lbl':['a','b','c'],'x':[1.5,2,2.5],'y':[1.5,3,2]})
f=pd.DataFrame({'x':[2,2],'y':[2,1.5]})
from sklearn.neighbors import BallTree as bt
def runtree():
tree=bt(f[['x','y']],2)
dy,e['nearest']=tree.query(e[['x','y']],1)
return e
runtree() returns e with the index of the nearest data point in the final column:
lbl x y nearest
0 a 1.5 1.5 1
1 b 2.0 3.0 0
2 c 2.5 2.0 0
Now, let's treat f as a dataframe that will grow over time, and add a new record to it:
f.loc[2]=[2.5]+[1.75]
When running runtree() again, the record with lbl=c is closer to the new entry (the bottom right entry shows index=2 now). Before the new entry was added, the same record was closest to index 0 (see above):
lbl x y nearest
0 a 1.5 1.5 1
1 b 2.0 3.0 0
2 c 2.5 2.0 2
The question is, is there a way to get this final result without rerunning runtree() for all the records in e, but instead refreshing only the ones that are relatively close to the new entry we've added in f? In the example it would be great to have a way to know that only the final row needs to be refreshed, without running all the rows to find out.
To put this into context: in the example for e above, we have three records in two dimensions. A real world example could have millions of records in more than two dimensions. It seems inefficient to rerun all the calculations every time that a new record arrives in f. A more efficient method might factor in that some of the entries in e are nowhere near the new one in f so they should not need updating.
It might be possible to delve into Euclidean distance maths, but my sense is that all the heavy lifting has already been done in packages like BallTree.
Does a package exist that can do what is needed here, on growing rather than static data, without lifting the bonnet on some serious math?

Related

Is it possible to breakdown a numpy array to run through 1 different value in every iteration?

So, I have an excel spreadsheet which I want my programme to be able to access and get data from.
To do this I installed pandas and have managed to import the spreadsheet into the code.
Satellites_Path = (r'C:\Users\camer\Documents\Satellite_info.xlsx')
df = pd.read_excel(Satellites_Path, engine='openpyxl')
So this all works.
The problem is that what I want it to do is to grab a piece of data, say the distance between 2 things, and run this number through a loop. I then want it to go one down in the excel spreadsheet and do this again for the new number until there it finishes the column and then I want it to end.
The data file reads as:
Number ObjectName DistanceFromEarth(km) Radius(km) Mass(10^24kg)
0 0 Earth 0.0 6378.1 5.97240
1 1 Moon 384400.0 1783.1 0.07346
I put the 'Number' in as I thought that I could do a loop of whilst Number is < a limit then run through these numbers but I have found that the datafile doesn't work as an array or integer so that doesn't work.
Since then, I have tried to put it into an integer by turning it into a NumPy array:
N = df.loc[:,'Number']
D = np.array(df.loc[:,'DistanceFromEarth(km)'])
R = np.array(df.loc[:,'Radius(km)'])
However, the arrays are still problematic. I have tried to split them up like:
a = (np.array(N))
print(a)
newa = np.array_split(a,3)
and this sort of now works but as a test, I made this little bit and it repeats infinitely:
while True:
if (newa[0]) < 1:
print(newa)
If 1 is made a 0, it prints once and then stops. I just want it to run a couple of times.
What I am getting at is, is it possible to read this file, grab a number from it and run through calculations using it, and then repeat that for the next satellite in the list? The reason I thought to do it this way is that I am going to make quite a long list. I already have a working simulation of local planets in the solar system but I wanted to add more bodies and doing it in the way that I was would make it extremely long to write, very dense and introduce more problems.
Reading from a file from excel would make my life a lot easier and make it more future-proof but I don't know if it's possible and I can't see anywhere online that is similar.
Pandas are absolutely a good choice here, but lack of knowledge of how to use them appears to be holding you back.
Here's a couple of examples that may be applicable to your situation:
Simple row by row calculations to make a new column:
df['Diameter(km)'] = df['Radius(km)']*2
print(df)
Output:
Number ObjectName DistanceFromEarth(km) Radius(km) Mass(10^24kg) Diameter(km)
0 0 Earth 0.0 6378.1 5.97240 12756.2
1 1 Moon 384400.0 1783.1 0.07346 3566.2
Running each row through a function:
def do_stuff(row):
print(f"Number: {row['Number']}", f"Object Name: {row['ObjectName']}")
df.apply(do_stuff, axis=1)
Output:
Number: 0 Object Name: Earth
Number: 1 Object Name: Moon

Hausdorff distance for large dataset in a fastest way

Number of rows in my dataset is 500000+. I need Hausdorff distance of every id between itself and others. and repeat it for the whole dataset
I have a huge data set. Here is the small part:
df =
id_easy ordinal latitude longitude epoch day_of_week
0 aaa 1.0 22.0701 2.6685 01-01-11 07:45 Friday
1 aaa 2.0 22.0716 2.6695 01-01-11 07:45 Friday
2 aaa 3.0 22.0722 2.6696 01-01-11 07:46 Friday
3 bbb 1.0 22.1166 2.6898 01-01-11 07:58 Friday
4 bbb 2.0 22.1162 2.6951 01-01-11 07:59 Friday
5 ccc 1.0 22.1166 2.6898 01-01-11 07:58 Friday
6 ccc 2.0 22.1162 2.6951 01-01-11 07:59 Friday
I want to calculate Haudorff Distance:
import pandas as pd
import numpy as np
from scipy.spatial.distance import directed_hausdorff
from scipy.spatial.distance import pdist, squareform
u = np.array([(2.6685,22.0701),(2.6695,22.0716),(2.6696,22.0722)]) # coordinates of `id_easy` of taxi `aaa`
v = np.array([(2.6898,22.1166),(2.6951,22.1162)]) # coordinates of `id_easy` of taxi `bbb`
directed_hausdorff(u, v)[0]
Output is 0.05114626086039758
Now I want to calculate this distance for the whole dataset. For all id_easys. Desired output is matrix with 0 on diagonal (because distance between the aaa and aaa is 0):
aaa bbb ccc
aaa 0 0.05114 ...
bbb ... 0
ccc 0
You're talking about calculating 500000^2+ distances. If you calculate 1000 of these distances every second, it will take you 7.93 years to complete your matrix. I'm not sure whether the Hausdorff distance is symmetric, but even if it is, that only saves you a factor of two (3.96 years).
The matrix will also take about a terabyte of memory.
I recommend calculating this only when needed, or if you really need the whole matrix, you'll need to parallelize the calculations. On the bright side, this problem can easily be broken up. For example, with four cores, you can split the problem thusly (in pseudocode):
n = len(u)
m = len(v)
A = hausdorff_distance_matrix(u[:n], v[:m])
B = hausdorff_distance_matrix(u[:n], v[m:])
C = hausdorff_distance_matrix(u[n:], v[:m])
D = hausdorff_distance_matrix(u[n:], v[m:])
results = [[A, B],
[C, D]]
Where hausdorff_distance_matrix(u, v) returns all distance combinations between u and v. You'll probably need to split it into a lot more than four segments though.
What is the application? Can you get away with only calculating these piece-wise as needed?
At first I define a method which provides some sample data. It would be a lot easier if you provide something like that in the question. In most performance related problems the size of the real problem is needed to find a optimal solution.
In the following answer I will assume that the average size of id_easy is 17 and there are 30000 different ids which results in a data set size of 510_000.
Create sample data
import numpy as np
import numba as nb
N_ids=30_000
av_id_size=17
#create_data (pre sorting according to id assumed)
lat_lon=np.random.rand(N_ids*av_id_size,2)
#create_ids (sorted array with ids)
ids=np.empty(N_ids*av_id_size,dtype=np.int64)
ind=0
for i in range(N_ids):
for j in range(av_id_size):
ids[i*av_id_size+j]=ind
ind+=1
Hausdorff function
The following function is a slightly modified version from scipy-source.
The following modifications are made:
For very small input arrays I commented out the shuffling part (Enable shuffling on larger arrays and try out on your real data what's best
At least on Windows the Anaconda scipy function looks to have some performance issues (much slower, than on Linux), LLVM based Numba looks to be consitent
Indices of the Hausdorff pair removed
Distance loop unrolled for the (N,2) case
#Modified Code from Scipy-source
#https://github.com/scipy/scipy/blob/master/scipy/spatial/_hausdorff.pyx
#Copyright (C) Tyler Reddy, Richard Gowers, and Max Linke, 2016
#Copyright © 2001, 2002 Enthought, Inc.
#All rights reserved.
#Copyright © 2003-2013 SciPy Developers.
#All rights reserved.
#Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
#Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
#Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
#disclaimer in the documentation and/or other materials provided with the distribution.
#Neither the name of Enthought nor the names of the SciPy Developers may be used to endorse or promote products derived
#from this software without specific prior written permission.
#THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
#BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
#IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
#OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
#OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
#(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#nb.njit()
def directed_hausdorff_nb(ar1, ar2):
N1 = ar1.shape[0]
N2 = ar2.shape[0]
data_dims = ar1.shape[1]
# Shuffling for very small arrays disbabled
# Enable it for larger arrays
#resort1 = np.arange(N1)
#resort2 = np.arange(N2)
#np.random.shuffle(resort1)
#np.random.shuffle(resort2)
#ar1 = ar1[resort1]
#ar2 = ar2[resort2]
cmax = 0
for i in range(N1):
no_break_occurred = True
cmin = np.inf
for j in range(N2):
# faster performance with square of distance
# avoid sqrt until very end
# Simplificaten (loop unrolling) for (n,2) arrays
d = (ar1[i, 0] - ar2[j, 0])**2+(ar1[i, 1] - ar2[j, 1])**2
if d < cmax: # break out of `for j` loop
no_break_occurred = False
break
if d < cmin: # always true on first iteration of for-j loop
cmin = d
# always true on first iteration of for-j loop, after that only
# if d >= cmax
if cmin != np.inf and cmin > cmax and no_break_occurred == True:
cmax = cmin
return np.sqrt(cmax)
Calculating Hausdorff distance on subsets
#nb.njit(parallel=True)
def get_distance_mat(def_slice,lat_lon):
Num_ids=def_slice.shape[0]-1
out=np.empty((Num_ids,Num_ids),dtype=np.float64)
for i in nb.prange(Num_ids):
ar1=lat_lon[def_slice[i:i+1],:]
for j in range(i,Num_ids):
ar2=lat_lon[def_slice[j:j+1],:]
dist=directed_hausdorff_nb(ar1, ar2)
out[i,j]=dist
out[j,i]=dist
return out
Example and Timings
#def_slice defines the start and end of the slices
_,def_slice=np.unique(ids,return_index=True)
def_slice=np.append(def_slice,ids.shape[0])
%timeit res_1=get_distance_mat(def_slice,lat_lon)
#1min 2s ± 301 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Try using the computed distance from scipy
from scipy.spatial.distance import cdist
hausdorff_distance = cdist(df[['latitude', 'longitude']], df[['latitude', 'longitude']], lambda u, v: directed_hausdorff(u, v)[0])
hausdorff_distance_df = pd.DataFrame(hausdorff_distance)
As a note though, whatever method you will end up using - it will take a lot of time to calculate, just to due to the sheer volume of the data. Ask yourself, if you genuinely need every single pair of distances.
Practically, these kind of problems are solved by restricting the number of pairings to a manageable number. For example slice your data frame into smaller sets, with each set restricted to a geographical area and then find the pair of distances within that geographical area.
The above approach is used by supermarkets to identify spots for their new stores. They are not calculating a pair of distances between every single store they own and their competitors own. First they restrict the area, which will have only 5-10 stores in total and only then they proceed to calculate the distances.
It comes down to your exact use case:
If what you want is to pre-compute values because you need ballpark answers quickly, you can 'bin' your answers. In the most rough case, imagine you have 1.2343 * 10^7 and 9.2342 * 10^18. You drop the resolution markedly and do a straight lookup. That is, you lookup '1E7 and 1E19' in your precomputed lookup table. You get the correct answer accurate to a power of 10.
If you want to characterize your distribution of answers globally, pick a statistically sufficient number of random pairs and use that instead.
If you want to graph your answers, run successive passes of more precise answers. For example, make your graph from 50x50 points equally spaced. After they are computed, if the user is still there, start filling in 250x250 resolution, and so on. There are cute tricks to deciding where to focus more computational power, just follow the user's mouse instead.
If you may want the answers for some pieces, but just don't know which one's until a later computation, then make the evaluation lazy. That is, attempting to __getitem__() the value actually computes it.
If you think you actually need all the answers now, think harder. There is no problem with that many moving parts. You should find some exploitable constraint in the data that already exists, otherwise the data would already be overwhelming.
Keep thinking! Keep hacking! Keep notes.

Is there a non brute force based solution to optimise the minimum sum of a 2D array only using 1 value from each row and column

I have a 2 arrays; one is an ordered array generated from a set of previous positions for connected points; the second is a new set of points specifying the new positions of the points. The task is to match up each old point with the best fitting new position. The differential between each set of points is stored in a new Array which is of size n*n. The objective is to find a way to map each previous point to a new point resulting in the smallest total sum. As such each old point is a row of the matrix and must match to a single column.
I have already looked into a exhaustive search. Although this works it has complexity O(n!) which is just not a valid solution.
The code below can be used to generate test data for the 2D array.
import numpy as np
def make_data():
org = np.random.randint(5000, size=(100, 2))
new = np.random.randint(5000, size=(100, 2))
arr = []
# ranges = []
for i,j in enumerate(org):
values = np.linalg.norm(new-j, axis=1)
arr.append(values)
# print(arr)
# print(ranges)
arr = np.array(arr)
return arr
Here are some small examples of the array and the expected output.
Ex. 1
1 3 5
0 2 3
5 2 6
The above output should return [0,2,1] to signify that row 0 maps to column 0, row 1 to column 2 and row 2 to column 1. As the optimal solution would b 1,3,2
In
The algorithm would be nice to be 100% accurate although something much quicker that is 85%+ would also be valid.
Google search terms: "weighted graph minimum matching". You can consider your array to be a weighted graph, and you're looking for a matching that minimizes edge length.
The assignment problem is a fundamental combinatorial optimization problem. It consists of finding, in a weighted bipartite graph, a matching in which the sum of weights of the edges is as large as possible. A common variant consists of finding a minimum-weight perfect matching.
https://en.wikipedia.org/wiki/Assignment_problem
The Hungarian method is a combinatorial optimization algorithm that solves the assignment problem in polynomial time and which anticipated later primal-dual methods.
https://en.wikipedia.org/wiki/Hungarian_algorithm
I'm not sure whether to post the whole algorithm here; it's several paragraphs and in wikipedia markup. On the other hand I'm not sure whether leaving it out makes this a "link-only answer". If people have strong feelings either way, they can mention them in the comments.

How to import my data into python

I'm currently working on Project Euler 18 which involves a triangle of numbers and finding the value of the maximum path from top to bottom. It says you can do this project either by brute forcing it or by figuring out a trick to it. I think I've figured out the trick, but I can't even begin to solve this because I don't know how to start manipulating this triangle in Python.
https://projecteuler.net/problem=18
Here's a smaller example triangle:
3
7 4
2 4 6
8 5 9 3
In this case, the maximum route would be 3 -> 7 -> 4 -> 9 for a value of 23.
Some approaches I considered:
I've used NumPy quite a lot for other tasks, so I wondered if an array would work. For that 4 number base triangle, I could maybe do a 4x4 array and fill up the rest with zeros, but aside from not knowing how to import the data in that way, it also doesn't seem very efficient. I also considered a list of lists, where each sublist was a row of the triangle, but I don't know how I'd separate out the terms without going through and adding commas after each term.
Just to emphasise, I'm not looking for a method or a solution to the problem, just a way I can start to manipulate the numbers of the triangle in python.
Here is a little snippet that should help you with reading the data:
rows = []
with open('problem-18-data') as f:
for line in f:
rows.append([int(i) for i in line.rstrip('\n').split(" ")])

speeding up processing 5 million rows of coordinate data

I have a csv file with two columns (latitude, longitude) that contains over 5 million rows of geolocation data.
I need to identify the points which are not within 5 miles of any other point in the list, and output everything back into another CSV that has an extra column (CloseToAnotherPoint) which is True if there is another point is within 5 miles, and False if there isn't.
Here is my current solution using geopy (not making any web calls, just using the function to calculate distance):
from geopy.point import Point
from geopy.distance import vincenty
import csv
class CustomGeoPoint(object):
def __init__(self, latitude, longitude):
self.location = Point(latitude, longitude)
self.close_to_another_point = False
try:
output = open('output.csv','w')
writer = csv.writer(output, delimiter = ',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# 5 miles
close_limit = 5
geo_points = []
with open('geo_input.csv', newline='') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
for row in reader:
geo_points.append(CustomGeoPoint(row[0], row[1]))
# for every point, look at every point until one is found within 5 miles
for geo_point in geo_points:
for geo_point2 in geo_points:
dist = vincenty(geo_point.location, geo_point2.location).miles
if 0 < dist <= close_limit: # (0,close_limit]
geo_point.close_to_another_point = True
break
writer.writerow([geo_point.location.latitude, geo_point.location.longitude,
geo_point.close_to_another_point])
finally:
output.close()
As you might be able to tell from looking at it, this solution is extremely slow. So slow in fact that I let it run for 3 days and it still didn't finish!
I've thought about trying to split up the data into chunks (multiple CSV files or something) so that the inner loop doesn't have to look at every other point, but then I would have to figure out how to make sure the borders of each section checked against the borders of its adjacent sections, and that just seems overly complex and I'm afraid it would be more of a headache than it's worth.
So any pointers on how to make this faster?
Let's look at what you're doing.
You read all the points into a list named geo_points.
Now, can you tell me whether the list is sorted? Because if it was sorted, we definitely want to know that. Sorting is valuable information, especially when you're dealing with 5 million of anything.
You loop over all the geo_points. That's 5 million, according to you.
Within the outer loop, you loop again over all 5 million geo_points.
You compute the distance in miles between the two loop items.
If the distance is less than your threshold, you record that information on the first point, and stop the inner loop.
When the inner loop stops, you write information about the outer loop item to a CSV file.
Notice a couple of things. First, you're looping 5 million times in the outer loop. And then you're looping 5 million times in the inner loop.
This is what O(n²) means.
The next time you see someone talking about "Oh, this is O(log n) but that other thing is O(n log n)," remember this experience - you're running an n² algorithm where n in this case is 5,000,000. Sucks, dunnit?
Anyway, you have some problems.
Problem 1: You'll eventually wind up comparing every point against itself. Which should have a distance of zero, meaning they will all be marked as within whatever distance threshold. If your program ever finishes, all the cells will be marked True.
Problem 2: When you compare point #1 with, say, point #12345, and they are within the threshold distance from each other, you are recording that information about point #1. But you don't record the same information about the other point. You know that point #12345 (geo_point2) is reflexively within the threshold of point #1, but you don't write that down. So you're missing a chance to just skip over 5 million comparisons.
Problem 3: If you compare point #1 and point #2, and they are not within the threshold distance, what happens when you compare point #2 with point #1? Your inner loop is starting from the beginning of the list every time, but you know that you have already compared the start of the list with the end of the list. You can reduce your problem space by half just by making your outer loop go i in range(0, 5million) and your inner loop go j in range(i+1, 5million).
Answers?
Consider your latitude and longitude on a flat plane. You want to know if there's a point within 5 miles. Let's think about a 10 mile square, centered on your point #1. That's a square centered on (X1, Y1), with a top left corner at (X1 - 5miles, Y1 + 5miles) and a bottom right corner at (X1 + 5miles, Y1 - 5miles). Now, if a point is within that square, it might not be within 5 miles of your point #1. But you can bet that if it's outside that square, it's more than 5 miles away.
As #SeverinPappadeaux points out, distance on a spheroid like Earth is not quite the same as distance on a flat plane. But so what? Set your square a little bigger to allow for the difference, and proceed!
Sorted List
This is why sorting is important. If all the points were sorted by X, then Y (or Y, then X - whatever) and you knew it, you could really speed things up. Because you could simply stop scanning when the X (or Y) coordinate got too big, and you wouldn't have to go through 5 million points.
How would that work? Same way as before, except your inner loop would have some checks like this:
five_miles = ... # Whatever math, plus an error allowance!
list_len = len(geo_points) # Don't call this 5 million times
for i, pi in enumerate(geo_points):
if pi.close_to_another_point:
continue # Remember if close to an earlier point
pi0max = pi[0] + five_miles
pi1min = pi[1] - five_miles
pi1max = pi[1] + five_miles
for j in range(i+1, list_len):
pj = geo_points[j]
# Assumes geo_points is sorted on [0] then [1]
if pj[0] > pi0max:
# Can't possibly be close enough, nor any later points
break
if pj[1] < pi1min or pj[1] > pi1max:
# Can't be close enough, but a later point might be
continue
# Now do "real" comparison using accurate functions.
if ...:
pi.close_to_another_point = True
pj.close_to_another_point = True
break
What am I doing there? First, I'm getting some numbers into local variables. Then I'm using enumerate to give me an i value and a reference to the outer point. (What you called geo_point). Then, I'm quickly checking to see if we already know that this point is close to another one.
If not, we'll have to scan. So I'm only scanning "later" points in the list, because I know the outer loop scans the early ones, and I definitely don't want to compare a point against itself. I'm using a few temporary variables to cache the result of computations involving the outer loop. Within the inner loop, I do some stupid comparisons against the temporaries. They can't tell me if the two points are close to each other, but I can check if they're definitely not close and skip ahead.
Finally, if the simple checks pass then go ahead and do the expensive checks. If a check actually passes, be sure to record the result on both points, so we can skip doing the second point later.
Unsorted List
But what if the list is not sorted?
#RootTwo points you at a kD tree (where D is for "dimensional" and k in this case is "2"). The idea is really simple, if you already know about binary search trees: you cycle through the dimensions, comparing X at even levels in the tree and comparing Y at odd levels (or vice versa). The idea would be this:
def insert_node(node, treenode, depth=0):
dimension = depth % 2 # even/odd -> lat/long
dn = node.coord[dimension]
dt = treenode.coord[dimension]
if dn < dt:
# go left
if treenode.left is None:
treenode.left = node
else:
insert_node(node, treenode.left, depth+1)
else:
# go right
if treenode.right is None:
treenode.right = node
else:
insert_node(node, treenode.right, depth+1)
What would this do? This would get you a searchable tree where points could be inserted in O(log n) time. That means O(n log n) for the whole list, which is way better than n squared! (The log base 2 of 5 million is basically 23. So n log n is 5 million times 23, compared with 5 million times 5 million!)
It also means you can do a targeted search. Since the tree is ordered, it's fairly straightforward to look for "close" points (the Wikipedia link from #RootTwo provides an algorithm).
Advice
My advice is to just write code to sort the list, if needed. It's easier to write, and easier to check by hand, and it's a separate pass you will only need to make one time.
Once you have the list sorted, try the approach I showed above. It's close to what you were doing, and it should be easy for you to understand and code.
As the answer to Python calculate lots of distances quickly points out, this is a classic use case for k-D trees.
An alternative is to use a sweep line algorithm, as shown in the answer to How do I match similar coordinates using Python?
Here's the sweep line algorithm adapted for your questions. On my laptop, it takes < 5 minutes to run through 5M random points.
import itertools as it
import operator as op
import sortedcontainers # handy library on Pypi
import time
from collections import namedtuple
from math import cos, degrees, pi, radians, sqrt
from random import sample, uniform
Point = namedtuple("Point", "lat long has_close_neighbor")
miles_per_degree = 69
number_of_points = 5000000
data = [Point(uniform( -88.0, 88.0), # lat
uniform(-180.0, 180.0), # long
True
)
for _ in range(number_of_points)
]
start = time.time()
# Note: lat is first in Point, so data is sorted by .lat then .long.
data.sort()
print(time.time() - start)
# Parameter that determines the size of a sliding lattitude window
# and therefore how close two points need to be to be to get flagged.
threshold = 5.0 # miles
lat_span = threshold / miles_per_degree
coarse_threshold = (.98 * threshold)**2
# Sliding lattitude window. Within the window, observations are
# ordered by longitude.
window = sortedcontainers.SortedListWithKey(key=op.attrgetter('long'))
# lag_pt is the 'southernmost' point within the sliding window.
point = iter(data)
lag_pt = next(point)
milepost = len(data)//10
# lead_pt is the 'northernmost' point in the sliding window.
for i, lead_pt in enumerate(data):
if i == milepost:
print('.', end=' ')
milepost += len(data)//10
# Dec of lead_obs represents the leading edge of window.
window.add(lead_pt)
# Remove observations further than the trailing edge of window.
while lead_pt.lat - lag_pt.lat > lat_span:
window.discard(lag_pt)
lag_pt = next(point)
# Calculate 'east-west' width of window_size at dec of lead_obs
long_span = lat_span / cos(radians(lead_pt.lat))
east_long = lead_pt.long + long_span
west_long = lead_pt.long - long_span
# Check all observations in the sliding window within
# long_span of lead_pt.
for other_pt in window.irange_key(west_long, east_long):
if other_pt != lead_pt:
# lead_pt is at the top center of a box 2 * long_span wide by
# 1 * long_span tall. other_pt is is in that box. If desired,
# put additional fine-grained 'closeness' tests here.
# coarse check if any pts within 80% of threshold distance
# then don't need to check distance to any more neighbors
average_lat = (other_pt.lat + lead_pt.lat) / 2
delta_lat = other_pt.lat - lead_pt.lat
delta_long = (other_pt.long - lead_pt.long)/cos(radians(average_lat))
if delta_lat**2 + delta_long**2 <= coarse_threshold:
break
# put vincenty test here
#if 0 < vincenty(lead_pt, other_pt).miles <= close_limit:
# break
else:
data[i] = data[i]._replace(has_close_neighbor=False)
print()
print(time.time() - start)
If you sort the list by latitude (n log(n)), and the points are roughly evenly distributed, it will bring it down to about 1000 points within 5 miles for each point (napkin math, not exact). By only looking at the points that are near in latitude, the runtime goes from n^2 to n*log(n)+.0004n^2. Hopefully this speeds it up enough.
I would give pandas a try. Pandas is made for efficient handling of large amounts of data. That may help with the efficiency of the csv portion anyhow. But from the sounds of it, you've got yourself an inherently inefficient problem to solve. You take point 1 and compare it against 4,999,999 other points. Then you take point 2 and compare it with 4,999,998 other points and so on. Do the math. That's 12.5 trillion comparisons you're doing. If you can do 1,000,000 comparisons per second, that's 144 days of computation. If you can do 10,000,000 comparisons per second, that's 14 days. For just additions in straight python, 10,000,000 operations can take something like 1.1 seconds, but I doubt your comparisons are as fast as an add operation. So give it at least a fortnight or two.
Alternately, you could come up with an alternate algorithm, though I don't have any particular one in mind.
I would redo algorithm in three steps:
Use great-circle distance, and assume 1% error so make limit equal to 1.01*limit.
Code great-circle distance as inlined function, this test should be fast
You'll get some false positives, which you could further test with vincenty
A better solution generated from Oscar Smith. You have a csv file and just sorted it in excel it is very efficient). Then utilize binary search in your program to find the cities within 5 miles(you can make small change to binary search method so it will break if it finds one city satisfying your condition).
Another improvement is to set a map to remember the pair of cities when you find one city is within another one. For example, when you find city A is within 5 miles of city B, use Map to store the pair (B is the key and A is the value). So next time you meet B, search it in the Map first, if it has a corresponding value, you do not need to check it again. But it may use more memory so care about it. Hope it helps you.
This is just a first pass, but I've sped it up by half so far by using great_circle() instead of vincinty(), and cleaning up a couple of other things. The difference is explained here, and the loss in accuracy is about 0.17%:
from geopy.point import Point
from geopy.distance import great_circle
import csv
class CustomGeoPoint(Point):
def __init__(self, latitude, longitude):
super(CustomGeoPoint, self).__init__(latitude, longitude)
self.close_to_another_point = False
def isCloseToAnother(pointA, points):
for pointB in points:
dist = great_circle(pointA, pointB).miles
if 0 < dist <= CLOSE_LIMIT: # (0, close_limit]
return True
return False
with open('geo_input.csv', 'r') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
geo_points = sorted(map(lambda x: CustomGeoPoint(x[0], x[1]), reader))
with open('output.csv', 'w') as output:
writer = csv.writer(output, delimiter=',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# for every point, look at every point until one is found within a mile
for point in geo_points:
point.close_to_another_point = isCloseToAnother(point, geo_points)
writer.writerow([point.latitude, point.longitude,
point.close_to_another_point])
I'm going to improve this further.
Before:
$ time python geo.py
real 0m5.765s
user 0m5.675s
sys 0m0.048s
After:
$ time python geo.py
real 0m2.816s
user 0m2.716s
sys 0m0.041s
This problem can be solved with a VP tree. These allows querying data
with distances that are a metric obeying the triangle inequality.
The big advantage of VP trees over a k-D tree is that they can be blindly
applied to geographic data anywhere in the world without having to worry
about projecting it to a suitable 2D space. In addition a true geodesic
distance can be used (no need to worry about the differences between
geodesic distances and distances in the projection).
Here's my test: generate 5 million points randomly and uniformly on the
world. Put these into a VP tree.
Looping over all the points, query the VP tree to find any neighbor a
distance in (0km, 10km] away. (0km is not include in this set to avoid
the query point being found.) Count the number of points with no such
neighbor (which is 229573 in my case).
Cost of setting up the VP tree = 5000000 * 20 distance calculations.
Cost of the queries = 5000000 * 23 distance calculations.
Time for setup and queries is 5m 7s.
I am using C++ with GeographicLib for calculating distances, but
the algorithm can of course be implemented in any language and here's
the python version of GeographicLib.
ADDENDUM: The C++ code implementing this approach is given here.

Categories

Resources