Hausdorff distance for large dataset in a fastest way

Hausdorff distance for large dataset in a fastest way - python

Number of rows in my dataset is 500000+. I need Hausdorff distance of every id between itself and others. and repeat it for the whole dataset
I have a huge data set. Here is the small part:
df =
id_easy ordinal latitude longitude epoch day_of_week
0 aaa 1.0 22.0701 2.6685 01-01-11 07:45 Friday
1 aaa 2.0 22.0716 2.6695 01-01-11 07:45 Friday
2 aaa 3.0 22.0722 2.6696 01-01-11 07:46 Friday
3 bbb 1.0 22.1166 2.6898 01-01-11 07:58 Friday
4 bbb 2.0 22.1162 2.6951 01-01-11 07:59 Friday
5 ccc 1.0 22.1166 2.6898 01-01-11 07:58 Friday
6 ccc 2.0 22.1162 2.6951 01-01-11 07:59 Friday
I want to calculate Haudorff Distance:
import pandas as pd
import numpy as np
from scipy.spatial.distance import directed_hausdorff
from scipy.spatial.distance import pdist, squareform
u = np.array([(2.6685,22.0701),(2.6695,22.0716),(2.6696,22.0722)]) # coordinates of `id_easy` of taxi `aaa`
v = np.array([(2.6898,22.1166),(2.6951,22.1162)]) # coordinates of `id_easy` of taxi `bbb`
directed_hausdorff(u, v)[0]
Output is 0.05114626086039758
Now I want to calculate this distance for the whole dataset. For all id_easys. Desired output is matrix with 0 on diagonal (because distance between the aaa and aaa is 0):
aaa bbb ccc
aaa 0 0.05114 ...
bbb ... 0
ccc 0

You're talking about calculating 500000^2+ distances. If you calculate 1000 of these distances every second, it will take you 7.93 years to complete your matrix. I'm not sure whether the Hausdorff distance is symmetric, but even if it is, that only saves you a factor of two (3.96 years).
The matrix will also take about a terabyte of memory.
I recommend calculating this only when needed, or if you really need the whole matrix, you'll need to parallelize the calculations. On the bright side, this problem can easily be broken up. For example, with four cores, you can split the problem thusly (in pseudocode):
n = len(u)
m = len(v)
A = hausdorff_distance_matrix(u[:n], v[:m])
B = hausdorff_distance_matrix(u[:n], v[m:])
C = hausdorff_distance_matrix(u[n:], v[:m])
D = hausdorff_distance_matrix(u[n:], v[m:])
results = [[A, B],
[C, D]]
Where hausdorff_distance_matrix(u, v) returns all distance combinations between u and v. You'll probably need to split it into a lot more than four segments though.
What is the application? Can you get away with only calculating these piece-wise as needed?

At first I define a method which provides some sample data. It would be a lot easier if you provide something like that in the question. In most performance related problems the size of the real problem is needed to find a optimal solution.
In the following answer I will assume that the average size of id_easy is 17 and there are 30000 different ids which results in a data set size of 510_000.
Create sample data
import numpy as np
import numba as nb
N_ids=30_000
av_id_size=17
#create_data (pre sorting according to id assumed)
lat_lon=np.random.rand(N_ids*av_id_size,2)
#create_ids (sorted array with ids)
ids=np.empty(N_ids*av_id_size,dtype=np.int64)
ind=0
for i in range(N_ids):
for j in range(av_id_size):
ids[i*av_id_size+j]=ind
ind+=1
Hausdorff function
The following function is a slightly modified version from scipy-source.
The following modifications are made:
For very small input arrays I commented out the shuffling part (Enable shuffling on larger arrays and try out on your real data what's best
At least on Windows the Anaconda scipy function looks to have some performance issues (much slower, than on Linux), LLVM based Numba looks to be consitent
Indices of the Hausdorff pair removed
Distance loop unrolled for the (N,2) case
#Modified Code from Scipy-source
#https://github.com/scipy/scipy/blob/master/scipy/spatial/_hausdorff.pyx
#Copyright (C) Tyler Reddy, Richard Gowers, and Max Linke, 2016
#Copyright © 2001, 2002 Enthought, Inc.
#All rights reserved.
#Copyright © 2003-2013 SciPy Developers.
#All rights reserved.
#Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
#Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
#Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
#disclaimer in the documentation and/or other materials provided with the distribution.
#Neither the name of Enthought nor the names of the SciPy Developers may be used to endorse or promote products derived
#from this software without specific prior written permission.
#THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
#BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
#IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
#OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
#OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
#(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#nb.njit()
def directed_hausdorff_nb(ar1, ar2):
N1 = ar1.shape[0]
N2 = ar2.shape[0]
data_dims = ar1.shape[1]
# Shuffling for very small arrays disbabled
# Enable it for larger arrays
#resort1 = np.arange(N1)
#resort2 = np.arange(N2)
#np.random.shuffle(resort1)
#np.random.shuffle(resort2)
#ar1 = ar1[resort1]
#ar2 = ar2[resort2]
cmax = 0
for i in range(N1):
no_break_occurred = True
cmin = np.inf
for j in range(N2):
# faster performance with square of distance
# avoid sqrt until very end
# Simplificaten (loop unrolling) for (n,2) arrays
d = (ar1[i, 0] - ar2[j, 0])**2+(ar1[i, 1] - ar2[j, 1])**2
if d < cmax: # break out of `for j` loop
no_break_occurred = False
break
if d < cmin: # always true on first iteration of for-j loop
cmin = d
# always true on first iteration of for-j loop, after that only
# if d >= cmax
if cmin != np.inf and cmin > cmax and no_break_occurred == True:
cmax = cmin
return np.sqrt(cmax)
Calculating Hausdorff distance on subsets
#nb.njit(parallel=True)
def get_distance_mat(def_slice,lat_lon):
Num_ids=def_slice.shape[0]-1
out=np.empty((Num_ids,Num_ids),dtype=np.float64)
for i in nb.prange(Num_ids):
ar1=lat_lon[def_slice[i:i+1],:]
for j in range(i,Num_ids):
ar2=lat_lon[def_slice[j:j+1],:]
dist=directed_hausdorff_nb(ar1, ar2)
out[i,j]=dist
out[j,i]=dist
return out
Example and Timings
#def_slice defines the start and end of the slices
_,def_slice=np.unique(ids,return_index=True)
def_slice=np.append(def_slice,ids.shape[0])
%timeit res_1=get_distance_mat(def_slice,lat_lon)
#1min 2s ± 301 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Try using the computed distance from scipy
from scipy.spatial.distance import cdist
hausdorff_distance = cdist(df[['latitude', 'longitude']], df[['latitude', 'longitude']], lambda u, v: directed_hausdorff(u, v)[0])
hausdorff_distance_df = pd.DataFrame(hausdorff_distance)
As a note though, whatever method you will end up using - it will take a lot of time to calculate, just to due to the sheer volume of the data. Ask yourself, if you genuinely need every single pair of distances.
Practically, these kind of problems are solved by restricting the number of pairings to a manageable number. For example slice your data frame into smaller sets, with each set restricted to a geographical area and then find the pair of distances within that geographical area.
The above approach is used by supermarkets to identify spots for their new stores. They are not calculating a pair of distances between every single store they own and their competitors own. First they restrict the area, which will have only 5-10 stores in total and only then they proceed to calculate the distances.

It comes down to your exact use case:
If what you want is to pre-compute values because you need ballpark answers quickly, you can 'bin' your answers. In the most rough case, imagine you have 1.2343 * 10^7 and 9.2342 * 10^18. You drop the resolution markedly and do a straight lookup. That is, you lookup '1E7 and 1E19' in your precomputed lookup table. You get the correct answer accurate to a power of 10.
If you want to characterize your distribution of answers globally, pick a statistically sufficient number of random pairs and use that instead.
If you want to graph your answers, run successive passes of more precise answers. For example, make your graph from 50x50 points equally spaced. After they are computed, if the user is still there, start filling in 250x250 resolution, and so on. There are cute tricks to deciding where to focus more computational power, just follow the user's mouse instead.
If you may want the answers for some pieces, but just don't know which one's until a later computation, then make the evaluation lazy. That is, attempting to __getitem__() the value actually computes it.
If you think you actually need all the answers now, think harder. There is no problem with that many moving parts. You should find some exploitable constraint in the data that already exists, otherwise the data would already be overwhelming.
Keep thinking! Keep hacking! Keep notes.

Related

Is there a non brute force based solution to optimise the minimum sum of a 2D array only using 1 value from each row and column

I have a 2 arrays; one is an ordered array generated from a set of previous positions for connected points; the second is a new set of points specifying the new positions of the points. The task is to match up each old point with the best fitting new position. The differential between each set of points is stored in a new Array which is of size n*n. The objective is to find a way to map each previous point to a new point resulting in the smallest total sum. As such each old point is a row of the matrix and must match to a single column.
I have already looked into a exhaustive search. Although this works it has complexity O(n!) which is just not a valid solution.
The code below can be used to generate test data for the 2D array.
import numpy as np
def make_data():
org = np.random.randint(5000, size=(100, 2))
new = np.random.randint(5000, size=(100, 2))
arr = []
# ranges = []
for i,j in enumerate(org):
values = np.linalg.norm(new-j, axis=1)
arr.append(values)
# print(arr)
# print(ranges)
arr = np.array(arr)
return arr
Here are some small examples of the array and the expected output.
Ex. 1
1 3 5
0 2 3
5 2 6
The above output should return [0,2,1] to signify that row 0 maps to column 0, row 1 to column 2 and row 2 to column 1. As the optimal solution would b 1,3,2
In
The algorithm would be nice to be 100% accurate although something much quicker that is 85%+ would also be valid.

Google search terms: "weighted graph minimum matching". You can consider your array to be a weighted graph, and you're looking for a matching that minimizes edge length.
The assignment problem is a fundamental combinatorial optimization problem. It consists of finding, in a weighted bipartite graph, a matching in which the sum of weights of the edges is as large as possible. A common variant consists of finding a minimum-weight perfect matching.
https://en.wikipedia.org/wiki/Assignment_problem
The Hungarian method is a combinatorial optimization algorithm that solves the assignment problem in polynomial time and which anticipated later primal-dual methods.
https://en.wikipedia.org/wiki/Hungarian_algorithm
I'm not sure whether to post the whole algorithm here; it's several paragraphs and in wikipedia markup. On the other hand I'm not sure whether leaving it out makes this a "link-only answer". If people have strong feelings either way, they can mention them in the comments.

How to implement a cost minimization objective function correctly in Gurobi?

Given transport costs, per single unit of delivery, for a supermarket from three distribution centers to ten separate stores.
Note: Please look in the #data section of my code to see the data that I'm not allowed to post in photo form. ALSO note while my costs are a vector with 30 entries. Each distribution centre can only access 10 costs each. So DC1 costs = entries 1-10, DC2 costs = entries 11-20 etc..
I want to minimize the transport cost subject to each of the ten stores demand (in units of delivery).
This can be done by inspection. The the minimum cost being $150313. The problem being implementing the solution with Python and Gurobi and producing the same result.
What I've tried is a somewhat sloppy model of the problem in Gurobi so far. I'm not sure how to correctly index and iterate through my sets that are required to produce a result.
This is my main problem: The objective function I define to minimize transport costs is not correct as I produce a non-answer.
The code "runs" though. If I change to maximization I just get an unbounded problem. So I feel like I am definitely not calling the correct data/iterations through sets into play.
My solution so far is quite small, so I feel like I can format it into the question and comment along the way.
from gurobipy import *
#Sets
Distro = ["DC0","DC1","DC2"]
Stores = ["S0", "S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9"]
D = range(len(Distro))
S = range(len(Stores))
Here I define my sets of distribution centres and set of stores. I am not sure where or how to exactly define the D and S iteration variables to get a correct answer.
#Data
Demand = [10,16,11,8,8,18,11,20,13,12]
Costs = [1992,2666,977,1761,2933,1387,2307,1814,706,1162,
2471,2023,3096,2103,712,2304,1440,2180,2925,2432,
1642,2058,1533,1102,1970,908,1372,1317,1341,776]
Just a block of my relevant data. I am not sure if my cost data should be 3 separate sets considering each distribution centre only has access to 10 costs and not 30. Or if there is a way to keep my costs as one set but make sure each centre can only access the costs relevant to itself I would not know.
m = Model("WonderMarket")
#Variables
X = {}
for d in D:
for s in S:
X[d,s] = m.addVar()
Declaring my objective variable. Again, I'm blindly iterating at this point to produce something that works. I've never programmed before. But I'm learning and putting as much thought into this question as possible.
#set objective
m.setObjective(quicksum(Costs[s] * X[d, s] * Demand[s] for d in D for s in S), GRB.MINIMIZE)
My objective function is attempting to multiply the cost of each delivery from a centre to a store, subject to a stores demand, then make that the smallest value possible. I do not have a non zero constraint yet. I will need one eventually?! But right now I have bigger fish to fry.
m.optimize()
I produce a 0 row, 30 column with 0 nonzero entries model that gives me a solution of 0. I need to set up my program so that I get the value that can be calculated easily by hand. I believe the issue is my general declaring of variables and low knowledge of iteration and general "what goes where" issues. A lot of thinking for just a study exercise!
Appreciate anyone who has read all the way through. Thank you for any tips or help in advance.

Your objective is 0 because you do not have defined any constraints. By default all variables have a lower bound of 0 and hence minizing an unconstrained problem puts all variables to this lower bound.
A few comments:
Unless you need the names for the distribution centers and stores, you could define them as follows:
D = 3
S = 10
Distro = range(D)
Stores = range(S)
You could define the costs as a 2-dimensional array, e.g.
Costs = [[1992,2666,977,1761,2933,1387,2307,1814,706,1162],
[2471,2023,3096,2103,712,2304,1440,2180,2925,2432],
[1642,2058,1533,1102,1970,908,1372,1317,1341,776]]
Then the cost of transportation from distribution center d to store s are stored in Costs[d][s].
You can add all variables at once and I assume you want them to be binary:
X = m.addVars(D, S, vtype=GRB.BINARY)
(or use Distro and Stores instead of D and S if you need to use the names).
Your definition of the objective function then becomes:
m.setObjective(quicksum(Costs[d][s] * X[d, s] * Demand[s] for d in Distro for s in Stores), GRB.MINIMIZE)
(This is all assuming that each store can only be delivered from one distribution center, but since your distribution centers do not have a maximal capacity this seems to be a fair assumption.)
You need constraints ensuring that the stores' demands are actually satisfied. For this it suffices to ensure that each store is being delivered from one distribution center, i.e., that for each s one X[d, s] is 1.
m.addConstrs(quicksum(X[d, s] for d in Distro) == 1 for s in Stores)
When I optimize this, I indeed get an optimal solution with value 150313.

speeding up processing 5 million rows of coordinate data

I have a csv file with two columns (latitude, longitude) that contains over 5 million rows of geolocation data.
I need to identify the points which are not within 5 miles of any other point in the list, and output everything back into another CSV that has an extra column (CloseToAnotherPoint) which is True if there is another point is within 5 miles, and False if there isn't.
Here is my current solution using geopy (not making any web calls, just using the function to calculate distance):
from geopy.point import Point
from geopy.distance import vincenty
import csv
class CustomGeoPoint(object):
def __init__(self, latitude, longitude):
self.location = Point(latitude, longitude)
self.close_to_another_point = False
try:
output = open('output.csv','w')
writer = csv.writer(output, delimiter = ',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# 5 miles
close_limit = 5
geo_points = []
with open('geo_input.csv', newline='') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
for row in reader:
geo_points.append(CustomGeoPoint(row[0], row[1]))
# for every point, look at every point until one is found within 5 miles
for geo_point in geo_points:
for geo_point2 in geo_points:
dist = vincenty(geo_point.location, geo_point2.location).miles
if 0 < dist <= close_limit: # (0,close_limit]
geo_point.close_to_another_point = True
break
writer.writerow([geo_point.location.latitude, geo_point.location.longitude,
geo_point.close_to_another_point])
finally:
output.close()
As you might be able to tell from looking at it, this solution is extremely slow. So slow in fact that I let it run for 3 days and it still didn't finish!
I've thought about trying to split up the data into chunks (multiple CSV files or something) so that the inner loop doesn't have to look at every other point, but then I would have to figure out how to make sure the borders of each section checked against the borders of its adjacent sections, and that just seems overly complex and I'm afraid it would be more of a headache than it's worth.
So any pointers on how to make this faster?

Let's look at what you're doing.
You read all the points into a list named geo_points.
Now, can you tell me whether the list is sorted? Because if it was sorted, we definitely want to know that. Sorting is valuable information, especially when you're dealing with 5 million of anything.
You loop over all the geo_points. That's 5 million, according to you.
Within the outer loop, you loop again over all 5 million geo_points.
You compute the distance in miles between the two loop items.
If the distance is less than your threshold, you record that information on the first point, and stop the inner loop.
When the inner loop stops, you write information about the outer loop item to a CSV file.
Notice a couple of things. First, you're looping 5 million times in the outer loop. And then you're looping 5 million times in the inner loop.
This is what O(n²) means.
The next time you see someone talking about "Oh, this is O(log n) but that other thing is O(n log n)," remember this experience - you're running an n² algorithm where n in this case is 5,000,000. Sucks, dunnit?
Anyway, you have some problems.
Problem 1: You'll eventually wind up comparing every point against itself. Which should have a distance of zero, meaning they will all be marked as within whatever distance threshold. If your program ever finishes, all the cells will be marked True.
Problem 2: When you compare point #1 with, say, point #12345, and they are within the threshold distance from each other, you are recording that information about point #1. But you don't record the same information about the other point. You know that point #12345 (geo_point2) is reflexively within the threshold of point #1, but you don't write that down. So you're missing a chance to just skip over 5 million comparisons.
Problem 3: If you compare point #1 and point #2, and they are not within the threshold distance, what happens when you compare point #2 with point #1? Your inner loop is starting from the beginning of the list every time, but you know that you have already compared the start of the list with the end of the list. You can reduce your problem space by half just by making your outer loop go i in range(0, 5million) and your inner loop go j in range(i+1, 5million).
Answers?
Consider your latitude and longitude on a flat plane. You want to know if there's a point within 5 miles. Let's think about a 10 mile square, centered on your point #1. That's a square centered on (X1, Y1), with a top left corner at (X1 - 5miles, Y1 + 5miles) and a bottom right corner at (X1 + 5miles, Y1 - 5miles). Now, if a point is within that square, it might not be within 5 miles of your point #1. But you can bet that if it's outside that square, it's more than 5 miles away.
As #SeverinPappadeaux points out, distance on a spheroid like Earth is not quite the same as distance on a flat plane. But so what? Set your square a little bigger to allow for the difference, and proceed!
Sorted List
This is why sorting is important. If all the points were sorted by X, then Y (or Y, then X - whatever) and you knew it, you could really speed things up. Because you could simply stop scanning when the X (or Y) coordinate got too big, and you wouldn't have to go through 5 million points.
How would that work? Same way as before, except your inner loop would have some checks like this:
five_miles = ... # Whatever math, plus an error allowance!
list_len = len(geo_points) # Don't call this 5 million times
for i, pi in enumerate(geo_points):
if pi.close_to_another_point:
continue # Remember if close to an earlier point
pi0max = pi[0] + five_miles
pi1min = pi[1] - five_miles
pi1max = pi[1] + five_miles
for j in range(i+1, list_len):
pj = geo_points[j]
# Assumes geo_points is sorted on [0] then [1]
if pj[0] > pi0max:
# Can't possibly be close enough, nor any later points
break
if pj[1] < pi1min or pj[1] > pi1max:
# Can't be close enough, but a later point might be
continue
# Now do "real" comparison using accurate functions.
if ...:
pi.close_to_another_point = True
pj.close_to_another_point = True
break
What am I doing there? First, I'm getting some numbers into local variables. Then I'm using enumerate to give me an i value and a reference to the outer point. (What you called geo_point). Then, I'm quickly checking to see if we already know that this point is close to another one.
If not, we'll have to scan. So I'm only scanning "later" points in the list, because I know the outer loop scans the early ones, and I definitely don't want to compare a point against itself. I'm using a few temporary variables to cache the result of computations involving the outer loop. Within the inner loop, I do some stupid comparisons against the temporaries. They can't tell me if the two points are close to each other, but I can check if they're definitely not close and skip ahead.
Finally, if the simple checks pass then go ahead and do the expensive checks. If a check actually passes, be sure to record the result on both points, so we can skip doing the second point later.
Unsorted List
But what if the list is not sorted?
#RootTwo points you at a kD tree (where D is for "dimensional" and k in this case is "2"). The idea is really simple, if you already know about binary search trees: you cycle through the dimensions, comparing X at even levels in the tree and comparing Y at odd levels (or vice versa). The idea would be this:
def insert_node(node, treenode, depth=0):
dimension = depth % 2 # even/odd -> lat/long
dn = node.coord[dimension]
dt = treenode.coord[dimension]
if dn < dt:
# go left
if treenode.left is None:
treenode.left = node
else:
insert_node(node, treenode.left, depth+1)
else:
# go right
if treenode.right is None:
treenode.right = node
else:
insert_node(node, treenode.right, depth+1)
What would this do? This would get you a searchable tree where points could be inserted in O(log n) time. That means O(n log n) for the whole list, which is way better than n squared! (The log base 2 of 5 million is basically 23. So n log n is 5 million times 23, compared with 5 million times 5 million!)
It also means you can do a targeted search. Since the tree is ordered, it's fairly straightforward to look for "close" points (the Wikipedia link from #RootTwo provides an algorithm).
Advice
My advice is to just write code to sort the list, if needed. It's easier to write, and easier to check by hand, and it's a separate pass you will only need to make one time.
Once you have the list sorted, try the approach I showed above. It's close to what you were doing, and it should be easy for you to understand and code.

As the answer to Python calculate lots of distances quickly points out, this is a classic use case for k-D trees.
An alternative is to use a sweep line algorithm, as shown in the answer to How do I match similar coordinates using Python?
Here's the sweep line algorithm adapted for your questions. On my laptop, it takes < 5 minutes to run through 5M random points.
import itertools as it
import operator as op
import sortedcontainers # handy library on Pypi
import time
from collections import namedtuple
from math import cos, degrees, pi, radians, sqrt
from random import sample, uniform
Point = namedtuple("Point", "lat long has_close_neighbor")
miles_per_degree = 69
number_of_points = 5000000
data = [Point(uniform( -88.0, 88.0), # lat
uniform(-180.0, 180.0), # long
True
)
for _ in range(number_of_points)
]
start = time.time()
# Note: lat is first in Point, so data is sorted by .lat then .long.
data.sort()
print(time.time() - start)
# Parameter that determines the size of a sliding lattitude window
# and therefore how close two points need to be to be to get flagged.
threshold = 5.0 # miles
lat_span = threshold / miles_per_degree
coarse_threshold = (.98 * threshold)**2
# Sliding lattitude window. Within the window, observations are
# ordered by longitude.
window = sortedcontainers.SortedListWithKey(key=op.attrgetter('long'))
# lag_pt is the 'southernmost' point within the sliding window.
point = iter(data)
lag_pt = next(point)
milepost = len(data)//10
# lead_pt is the 'northernmost' point in the sliding window.
for i, lead_pt in enumerate(data):
if i == milepost:
print('.', end=' ')
milepost += len(data)//10
# Dec of lead_obs represents the leading edge of window.
window.add(lead_pt)
# Remove observations further than the trailing edge of window.
while lead_pt.lat - lag_pt.lat > lat_span:
window.discard(lag_pt)
lag_pt = next(point)
# Calculate 'east-west' width of window_size at dec of lead_obs
long_span = lat_span / cos(radians(lead_pt.lat))
east_long = lead_pt.long + long_span
west_long = lead_pt.long - long_span
# Check all observations in the sliding window within
# long_span of lead_pt.
for other_pt in window.irange_key(west_long, east_long):
if other_pt != lead_pt:
# lead_pt is at the top center of a box 2 * long_span wide by
# 1 * long_span tall. other_pt is is in that box. If desired,
# put additional fine-grained 'closeness' tests here.
# coarse check if any pts within 80% of threshold distance
# then don't need to check distance to any more neighbors
average_lat = (other_pt.lat + lead_pt.lat) / 2
delta_lat = other_pt.lat - lead_pt.lat
delta_long = (other_pt.long - lead_pt.long)/cos(radians(average_lat))
if delta_lat**2 + delta_long**2 <= coarse_threshold:
break
# put vincenty test here
#if 0 < vincenty(lead_pt, other_pt).miles <= close_limit:
# break
else:
data[i] = data[i]._replace(has_close_neighbor=False)
print()
print(time.time() - start)

If you sort the list by latitude (n log(n)), and the points are roughly evenly distributed, it will bring it down to about 1000 points within 5 miles for each point (napkin math, not exact). By only looking at the points that are near in latitude, the runtime goes from n^2 to n*log(n)+.0004n^2. Hopefully this speeds it up enough.

I would give pandas a try. Pandas is made for efficient handling of large amounts of data. That may help with the efficiency of the csv portion anyhow. But from the sounds of it, you've got yourself an inherently inefficient problem to solve. You take point 1 and compare it against 4,999,999 other points. Then you take point 2 and compare it with 4,999,998 other points and so on. Do the math. That's 12.5 trillion comparisons you're doing. If you can do 1,000,000 comparisons per second, that's 144 days of computation. If you can do 10,000,000 comparisons per second, that's 14 days. For just additions in straight python, 10,000,000 operations can take something like 1.1 seconds, but I doubt your comparisons are as fast as an add operation. So give it at least a fortnight or two.
Alternately, you could come up with an alternate algorithm, though I don't have any particular one in mind.

I would redo algorithm in three steps:
Use great-circle distance, and assume 1% error so make limit equal to 1.01*limit.
Code great-circle distance as inlined function, this test should be fast
You'll get some false positives, which you could further test with vincenty

A better solution generated from Oscar Smith. You have a csv file and just sorted it in excel it is very efficient). Then utilize binary search in your program to find the cities within 5 miles(you can make small change to binary search method so it will break if it finds one city satisfying your condition).
Another improvement is to set a map to remember the pair of cities when you find one city is within another one. For example, when you find city A is within 5 miles of city B, use Map to store the pair (B is the key and A is the value). So next time you meet B, search it in the Map first, if it has a corresponding value, you do not need to check it again. But it may use more memory so care about it. Hope it helps you.

This is just a first pass, but I've sped it up by half so far by using great_circle() instead of vincinty(), and cleaning up a couple of other things. The difference is explained here, and the loss in accuracy is about 0.17%:
from geopy.point import Point
from geopy.distance import great_circle
import csv
class CustomGeoPoint(Point):
def __init__(self, latitude, longitude):
super(CustomGeoPoint, self).__init__(latitude, longitude)
self.close_to_another_point = False
def isCloseToAnother(pointA, points):
for pointB in points:
dist = great_circle(pointA, pointB).miles
if 0 < dist <= CLOSE_LIMIT: # (0, close_limit]
return True
return False
with open('geo_input.csv', 'r') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
geo_points = sorted(map(lambda x: CustomGeoPoint(x[0], x[1]), reader))
with open('output.csv', 'w') as output:
writer = csv.writer(output, delimiter=',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# for every point, look at every point until one is found within a mile
for point in geo_points:
point.close_to_another_point = isCloseToAnother(point, geo_points)
writer.writerow([point.latitude, point.longitude,
point.close_to_another_point])
I'm going to improve this further.
Before:
$ time python geo.py
real 0m5.765s
user 0m5.675s
sys 0m0.048s
After:
$ time python geo.py
real 0m2.816s
user 0m2.716s
sys 0m0.041s

This problem can be solved with a VP tree. These allows querying data
with distances that are a metric obeying the triangle inequality.
The big advantage of VP trees over a k-D tree is that they can be blindly
applied to geographic data anywhere in the world without having to worry
about projecting it to a suitable 2D space. In addition a true geodesic
distance can be used (no need to worry about the differences between
geodesic distances and distances in the projection).
Here's my test: generate 5 million points randomly and uniformly on the
world. Put these into a VP tree.
Looping over all the points, query the VP tree to find any neighbor a
distance in (0km, 10km] away. (0km is not include in this set to avoid
the query point being found.) Count the number of points with no such
neighbor (which is 229573 in my case).
Cost of setting up the VP tree = 5000000 * 20 distance calculations.
Cost of the queries = 5000000 * 23 distance calculations.
Time for setup and queries is 5m 7s.
I am using C++ with GeographicLib for calculating distances, but
the algorithm can of course be implemented in any language and here's
the python version of GeographicLib.
ADDENDUM: The C++ code implementing this approach is given here.

Speeding up distance between all possible pairs in an array

I have an array of x,y,z coordinates of several (~10^10) points (only 5 shown here)
a= [[ 34.45 14.13 2.17]
[ 32.38 24.43 23.12]
[ 33.19 3.28 39.02]
[ 36.34 27.17 31.61]
[ 37.81 29.17 29.94]]
I want to make a new array with only those points which are at least some distance d away from all other points in the list. I wrote a code using while loop,
import numpy as np
from scipy.spatial import distance
d=0.1 #or some distance
i=0
selected_points=[]
while i < len(a):
interdist=[]
j=i+1
while j<len(a):
interdist.append(distance.euclidean(a[i],a[j]))
j+=1
if all(dis >= d for dis in interdist):
np.array(selected_points.append(a[i]))
i+=1
This works, but it is taking really long to perform this calculation. I read somewhere that while loops are very slow.
I was wondering if anyone has any suggestions on how to speed up this calculation.
EDIT: While my objective of finding the particles which are at least some distance away from all the others stays the same, I just realized that there is a serious flaw in my code, let's say I have 3 particles, my code does the following, for the first iteration of i, it calculates the distances 1->2, 1->3, let's say 1->2 is less than the threshold distance d, so the code throws away particle 1. For the next iteration of i, it only does 2->3, and let's say it finds that it is greater than d, so it keeps particle 2, but this is wrong! since 2 should also be discarded with particle 1. The solution by #svohara is the correct one!

For big data sets and low-dimensional points (such as your 3-dimensional data), sometimes there is a big benefit to using a spatial indexing method. One popular choice for low-dimensional data is the k-d tree.
The strategy is to index the data set. Then query the index using the same data set, to return the 2-nearest neighbors for each point. The first nearest neighbor is always the point itself (with dist=0), so we really want to know how far away the next closest point is (2nd nearest neighbor). For those points where the 2-NN is > threshold, you have the result.
from scipy.spatial import cKDTree as KDTree
import numpy as np
#a is the big data as numpy array N rows by 3 cols
a = np.random.randn(10**8, 3).astype('float32')
# This will create the index, prepare to wait...
# NOTE: took 7 minutes on my mac laptop with 10^8 rand 3-d numbers
# there are some parameters that could be tweaked for faster indexing,
# and there are implementations (not in scipy) that can construct
# the kd-tree using parallel computing strategies (GPUs, e.g.)
k = KDTree(a)
#ask for the 2-nearest neighbors by querying the index with the
# same points
(dists, idxs) = k.query(a, 2)
# (dists, idxs) = k.query(a, 2, n_jobs=4) # to use more CPUs on query...
#Note: 9 minutes for query on my laptop, 2 minutes with n_jobs=6
# So less than 10 minutes total for 10^8 points.
# If the second NN is > thresh distance, then there is no other point
# in the data set closer.
thresh_d = 0.1 #some threshold, equiv to 'd' in O.P.'s code
d_slice = dists[:, 1] #distances to second NN for each point
res = np.flatnonzero( d_slice >= thresh_d )

Here's a vectorized approach using distance.pdist -
# Store number of pts (number of rows in a)
m = a.shape[0]
# Get the first of pairwise indices formed with the pairs of rows from a
# Simpler version, but a bit slow : idx1,_ = np.triu_indices(m,1)
shifts_arr = np.zeros(m*(m-1)/2,dtype=int)
shifts_arr[np.arange(m-1,1,-1).cumsum()] = 1
idx1 = shifts_arr.cumsum()
# Get the IDs of pairs of rows that are more than "d" apart and thus select
# the rest of the rows using a boolean mask created with np.in1d for the
# entire range of number of rows in a. Index into a to get the selected points.
selected_pts = a[~np.in1d(np.arange(m),idx1[distance.pdist(a) < d])]
For a huge dataset like 10e10, we might have to perform the operations in chunks based on the system memory available.

your algorithm is quadratic (10^20 operations), Here is a linear approach if distribution is nearly random.
Splits your space in boxes of size d/sqrt(3)^3. Put each points in its box.
Then for each box,
if there is just one point, you just have to calculate distance with points in a little neighborhood.
else there is nothing to do.

Drop the append, it must be really slow. You can have a static vector of distances and use [] to put the number in the right position.
Use min instead of all. You only need to check if the minimum distance is bigger than x.
Actually, you can break on your append in the moment that you find a distance smaller than your limit, and then you can drop out both points. In this way you even do not have to save any distance (unless you need them later).
Since d(a,b)=d(b,a) you can do the internal loop only for the following points, forget about the distances you already calculated. If you need them you can pick the faster from the array.
From your comment, I believe this would do, if you have no repeated points.
selected_points = []
for p1 in a:
save_point = True
for p2 in a:
if p1!=p2 and distance.euclidean(p1,p2)<d:
save_point = False
break
if save_point:
selected_points.append(p1)
return selected_points
In the end I check a,b and b,a because you should not modify a list while processing it, but you can be smarter using some aditional variables.

Np.cross produces wrong results, search for a working alternative

I am rewriting an analysis code for Molecular Dynamics time series. Due to the huge amount of time steps (150 000 for each simulation run) which have to be analysed, it is very important that my code is as fast as possible.
The old code is very slow (actually it requires 300 to 500 times more time compared to my one) because it was written for the analysis of a few thousand PDB files and not a bunch full of different simulations (around 60), each one having 150 000 time steps. I know that C or Fortran would be the swiss army knife in this case but my experience with c is .....
Therefore I am trying to use as much as possible numpy/scipy routines for my python code. Because I've a license for the accelerated distribution of anaconda with the mkl, this is a really significant speedup.
Now I am facing a problem and I hope that I can explain it in a manner that you understand what i mean.
I have three arrays each one with a shape of (n, 3, 20). In the first row are all residuals of my peptide, commonly around 23 to 31. In the second row are coordinates in the order xyz and in the third row are some specific time steps.
Now I'am calculating the torsion for each residual at each time step. my code for the case of arrays with shape (n,3,1) its:
def fast_torsion(d1, d2, d3):
tt = dot(d1, np.cross(d2, d3))
tb = dot(d1, d1) * dot(d2, d2)
torsion = np.zeros([len(d1), 1])
for i in xrange(len(d1)):
if tb[i] != 0:
torsion[i] = tt[i]/tb[i]
return torsion
Now I tried to use the same code for the arrays with the extended third axis but the cross product function produces the wrong values compared to the original slow code, which is using a for loop.
I tried this code with my the big arrays it is around 10 to 20 times faster than a for loop solution and around 200 times fast than the old code.
What I am trying is that np.cross() only computes the cross product over the second (xyz) axis and iterates over the other two axis. In the case with the short third axis it works fine, but with the big arrays it only works for the first time step. I also tried the axis settings but I had no chance.
I can also use Cython or numba if this is the only solution for my problem.
P.S. Sorry for my english I hope you can understand everything.

np.crosshas axisa, axisb and axisc keyword arguments to select where in the input and output arguments are the vectors to be cross multiplied. I think you want to use:
np.cross(d2, d3, axisa=1, axisb=1, axisc=1)
If you don't include the axisc=1, the result of the multiplication will be at the end of the output array.
Also, you can avoid explicitly looping over your torsion array by doing:
torsion = np.zeros((len(d1), 1)
idx = (tb !=0)
torsion[idx] = tt[idx] / tb[idx]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.