So like the title says I need help trying to map points from a 2d plane to a number line in such a way that each point is associated with a unique positive integer. Put another way, I need a function f:ZxZ->Z+ and I need f to be injective. Additionally I need to to run in a reasonable time.
So the way I've though about doing this is to basically just count points, starting at (1,1) and spiraling outwards.
Below I've written some python code to do this for some point (i,j)
def plot_to_int(i,j):
a=max(i,j) #we want to find which "square" we are in
b=(a-1)^2 #we can start the count from the last square
J=abs(j)
I=abs(i)
if i>0 and j>0: #the first quadrant
#we start counting anticlockwise
if I>J:
b+=J
#we start from the edge and count up along j
else:
b+=J+(J-i)
#when we turn the corner, we add to the count, increasing as i decreases
elif i<0 and j>0: #the second quadrant
b+=2a-1 #the total count from the first quadrant
if J>I:
b+=I
else:
b+=I+(I-J)
elif i<0 and j<0: #the third quadrant
b+=(2a-1)2 #the count from the first two quadrants
if I>J:
b+=J
else:
b+=J+(J-I)
else:
b+=(2a-1)3
if J>I:
b+=I
else:
b+=I+(I-J)
return b
I'm pretty sure this works, but as you can see it quite a bulky function. I'm trying to think of some way to simplify this "spiral counting" logic. Or possibly if there's another counting method that is simpler to code that would work too.
Here's a half-baked idea:
For every point, calculate f = x + (y-y_min)/(y_max-y_min)
Find the smallest delta d between any given f_n and f_{n+1}. Multiply all the f values by 1/d so that all f values are at least 1 apart.
Take the floor() of all the f values.
This is sort of like a projection onto the x-axis, but it tries to spread out the values so that it preserves uniqueness.
UPDATE:
If you don't know all the data and will need to feed in new data in the future, maybe there's a way to hardcode an arbitrarily large or small constant for y_max and y_min in step 1, and an arbitrary delta d for step 2 according the boundaries of the data values you expect. Or a way to calculate values for these according to the limits of the floating point arithmetic.
I'm currently writing an algorithm that solves the 8-puzzle game through an A* search algorithm with Python. However, when I time my code, I find that get_manhattan_distance takes a really long amount of time.
I ran my code with cProfile for Python, and the results are below what is printed out by the program. Here is a gist for my issue.
I've already made my program more efficient by copying using Numpy Arrays instead of Python's lists. I don't quite know how to make this step more efficient. My current code for get_manhattan_distance is
def get_manhattan(self):
"""Returns the Manhattan heuristic for this board
Will attempt to use the cached Manhattan value for speed, but if it hasn't
already been calculated, then it will need to calculate it (which is
extremely costly!).
"""
if self.cached_manhattan != -1:
return self.cached_manhattan
# Set the value to zero, so we can add elements based off them being out of
# place.
self.cached_manhattan = 0
for r in range(self.get_dimension()):
for c in range(self.get_dimension()):
if self.board[r][c] != 0:
num = self.board[r][c]
# Solves for what row and column this number should be in.
correct_row, correct_col = np.divmod(num - 1, self.get_dimension())
# Adds the Manhattan distance from its current position to its correct
# position.
manhattan_dist = abs(correct_col - c) + abs(correct_row - r)
self.cached_manhattan += manhattan_dist
return self.cached_manhattan
The idea behind this is, the goal puzzle for a 3x3 grid is the following:
1 2 3
4 5 6
7 8
Where there is a blank tile (the blank tile is represented by a 0 in the int array). So, if we have the puzzle:
3 2 1
4 6 5
7 8
It should have a Manhattan distance of 6. This is because, 3 is two places away from where it should be. 1 is two places away from where it should be. 5 is one place away from where it should be, and 6 is one place away from where it should be. Hence, 2 + 2 + 1 + 1 = 6.
Unfortunately, this calculation takes a very long time because there are hundreds of thousands of different boards. Is there any way to speed this calculation up?
It looks to me like you should only need to calculate the full Manhattan distance sum for an entire board once - for the first board. After that, you're creating new Board entities from existing ones by swapping two adjacent numbers. The total Manhattan distance on the new board will differ only by the sum of changes in Manhattan distance for these two numbers.
If one of the numbers is the blank (0), then the total distance changes by minus one or one depending on whether the non-blank number moved closer to its proper place or farther from it. If both of the numbers are non-blank, as when you're making "twins", the total distance changes by minus two, zero, or two.
Here's what I would do: add a manhattan_distance = None argument to Board.__init__. If this is not given, calculate the board's total Manhattan distance; otherwise simply store the given distance. Create your first board without this argument. When you create a new board from an existing one, calculate the change in the total distance and pass the result in to the new board. (The cached_manhattan becomes irrelevant.)
This should reduce the total number of calculations involved with distance by quite a bit - I'd expect it to speed things up by several times, more the larger your board size.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
I'm working on a Project Euler problem (number 15, lattice paths). I've solved the problem another way, but I'm curious as to how to optimize the algorithm I used to initially try to solve the problem because it grows very quickly and am kind of surprised at how long it actually takes. So I really looking to learn how to analyze and continue to optimize the algorithm.
This algorithm's approach is to use the corners as points - (0,0) in top left, (2,2) in bottom left for a 2x2 grid. From the top point, the path will only be x+1 or y+1. So I pretty much iteratively form these paths by checking if the next allowable move exists in the space of points in the grid.
I initially started from the top left (x+1, y+1), but found it to be more efficient to go backwards from the bottom, removed some redundancies, and start to store only the valuable data in memory. So that's where I am now. Can it be optimized any further? and what other types of applications would this have?
the givenPoints is a list of all the points in the grid, stored as a string - ie '0202'. the algorithm stores the most recent point of the unique paths as opposed to the whole path, and at the end the number of entries in the list is equivalent to the number of unique paths.
def calcPaths4(givenPoints):
paths = []
paths.append(givenPoints[-1])
dims = int(math.sqrt(len(givenPoints))) - 1
numMoves = 2*dims
numPaths = 0
for x in range(0,numMoves):
t0= time.clock()
newPaths = []
for i in paths:
origin = int(i)
dest1 = origin - 1
dest3 = origin - 100
if ('%04d' % dest1) in givenPoints:
newPaths.append(('%04d' % dest1))
numPaths +=1
if ('%04d' % dest3) in givenPoints:
newPaths.append(('%04d' % dest3))
numPaths +=1
t= time.clock() - t0
paths = newPaths
print(str(x)+": " +str(t)+": " +str(len(paths)) )
return(paths)
You've got the wrong approach. Starting from the top left going to the bottom right corner takes 20 moves to the right and 20 moves down.
So you can think any path as a sequence of length 20 with 10 elements that are right and 10 elements that are down. You simply have to count how many arrengements are there.
Once you have fixed the, say, right moves the down ones are fixed, so the whole problem reduces to: in how many ways can you choose 10 positions from a set of 20?
This is simply solved by the binomial coefficient.
Hence a solution is:
from math import factorial
def number_of_paths(k):
"""Number of paths from the top left and bottom right corner in a kxk grid."""
return factorial(2*k)//(factorial(k)**2)
Which can be made more efficient by noting that n!/(k!*k!) = (n·(n-1)···(k+1))/k!:
import operator as op
from functools import reduce
def number_of_paths(k):
"""Number of paths from the top left and bottom right corner in a kxk grid."""
return reduce(op.mul, range(2*k, k, -1), 1)//factorial(k)
Note that the number of paths grows rapidly, which means any algorithm that works by creating the different paths is going to be slow. The only way to seriously "optimize" this is to change approach and avoid creating the paths but just counting them.
I''l point out a different, more general, approach: recursion and memoization/dynamic programming.
When the path is at a certain position (x,y) it can either go right to (x-1,y) or go down to (x, y-1). So the number of paths from that point to the bottom right is the sum of the number of paths that reach the bottom right from (x-1,y) and those that reach the bottom right from (x, y-1):
Base case is when you are on the edge, i.e. x==0 or y==0.
def number_of_paths(x, y):
if not x or not y:
return 1
return number_of_paths(x-1, y) + number_of_paths(x, y-1)
This solution follows your reasoning, but it only keeps track of the number of paths. You can see that again it is very inefficient.
The problem is that when we try to compute number_of_paths(x, y)
we end up doing the following steps:
Compute number_of_paths(x-1, y)
This is done by computing number_of_paths(x-2, y) and number_of_paths(x-1, y-1)
Compute number_of_paths(x, y-1)
This is done by computing number_of_paths(x-1, y-1) and number_of_paths(x, y-2)
Note how number_of_paths(x-1, y-1) is computed twice. But the result is obviously the same! So we can just computing it the first time and the next time we see that call we return the already known result:
def number_of_paths(x, y, table=None):
table = table if table is not None else {(0,0):1}
try:
# first look if we already computed this:
return table[x,y]
except KeyError:
# okay we didn't compute it, so we do it now:
if not x or not y:
result = table[x,y] = 1
else:
result = table[x,y] = number_of_paths(x-1, y, table) + number_of_paths(x, y-1, table)
return result
And now this executes pretty fast:
>>> number_of_paths(20,20)
137846528820
You could think "performing a call twice, isn't a big deal" but you have to take into account that if the call for (x-1,y-1) is computed twice, for each time it does two calls for (x-2, y-2) thus resulting in computing (x-2, y-2) four times. And then (x-3, y-3) eight times, ... and then (x-20, y-20) 1048576 times!
Alternatively we could have built a kxk matrix and fille it from the bottom right:
def number_of_paths(x, y):
table = [[0]*(x+1) for _ in range(y+1)]
table[-1][-1] = 1
for i in reversed(range(x+1)):
for j in reversed(range(y+1)):
if i == x or j == y:
table[i][j] = 1
else:
table[i][j] = table[i+1][j] + table[i][j+1]
return table[0][0]
Note that here the table represents the intersections so we end up with a +1 in the sizes.
This technique of memorizing previous call to reuse them later is called memoization. A more general principle is dynamic programming where you basically reduce the problem to filling a tabular data structure as we did here, using recursion and memoization and then you "backtrack" on the cells by using pointers you filled earlier to obtain a solution to the original problem.
I have a csv file with two columns (latitude, longitude) that contains over 5 million rows of geolocation data.
I need to identify the points which are not within 5 miles of any other point in the list, and output everything back into another CSV that has an extra column (CloseToAnotherPoint) which is True if there is another point is within 5 miles, and False if there isn't.
Here is my current solution using geopy (not making any web calls, just using the function to calculate distance):
from geopy.point import Point
from geopy.distance import vincenty
import csv
class CustomGeoPoint(object):
def __init__(self, latitude, longitude):
self.location = Point(latitude, longitude)
self.close_to_another_point = False
try:
output = open('output.csv','w')
writer = csv.writer(output, delimiter = ',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# 5 miles
close_limit = 5
geo_points = []
with open('geo_input.csv', newline='') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
for row in reader:
geo_points.append(CustomGeoPoint(row[0], row[1]))
# for every point, look at every point until one is found within 5 miles
for geo_point in geo_points:
for geo_point2 in geo_points:
dist = vincenty(geo_point.location, geo_point2.location).miles
if 0 < dist <= close_limit: # (0,close_limit]
geo_point.close_to_another_point = True
break
writer.writerow([geo_point.location.latitude, geo_point.location.longitude,
geo_point.close_to_another_point])
finally:
output.close()
As you might be able to tell from looking at it, this solution is extremely slow. So slow in fact that I let it run for 3 days and it still didn't finish!
I've thought about trying to split up the data into chunks (multiple CSV files or something) so that the inner loop doesn't have to look at every other point, but then I would have to figure out how to make sure the borders of each section checked against the borders of its adjacent sections, and that just seems overly complex and I'm afraid it would be more of a headache than it's worth.
So any pointers on how to make this faster?
Let's look at what you're doing.
You read all the points into a list named geo_points.
Now, can you tell me whether the list is sorted? Because if it was sorted, we definitely want to know that. Sorting is valuable information, especially when you're dealing with 5 million of anything.
You loop over all the geo_points. That's 5 million, according to you.
Within the outer loop, you loop again over all 5 million geo_points.
You compute the distance in miles between the two loop items.
If the distance is less than your threshold, you record that information on the first point, and stop the inner loop.
When the inner loop stops, you write information about the outer loop item to a CSV file.
Notice a couple of things. First, you're looping 5 million times in the outer loop. And then you're looping 5 million times in the inner loop.
This is what O(n²) means.
The next time you see someone talking about "Oh, this is O(log n) but that other thing is O(n log n)," remember this experience - you're running an n² algorithm where n in this case is 5,000,000. Sucks, dunnit?
Anyway, you have some problems.
Problem 1: You'll eventually wind up comparing every point against itself. Which should have a distance of zero, meaning they will all be marked as within whatever distance threshold. If your program ever finishes, all the cells will be marked True.
Problem 2: When you compare point #1 with, say, point #12345, and they are within the threshold distance from each other, you are recording that information about point #1. But you don't record the same information about the other point. You know that point #12345 (geo_point2) is reflexively within the threshold of point #1, but you don't write that down. So you're missing a chance to just skip over 5 million comparisons.
Problem 3: If you compare point #1 and point #2, and they are not within the threshold distance, what happens when you compare point #2 with point #1? Your inner loop is starting from the beginning of the list every time, but you know that you have already compared the start of the list with the end of the list. You can reduce your problem space by half just by making your outer loop go i in range(0, 5million) and your inner loop go j in range(i+1, 5million).
Answers?
Consider your latitude and longitude on a flat plane. You want to know if there's a point within 5 miles. Let's think about a 10 mile square, centered on your point #1. That's a square centered on (X1, Y1), with a top left corner at (X1 - 5miles, Y1 + 5miles) and a bottom right corner at (X1 + 5miles, Y1 - 5miles). Now, if a point is within that square, it might not be within 5 miles of your point #1. But you can bet that if it's outside that square, it's more than 5 miles away.
As #SeverinPappadeaux points out, distance on a spheroid like Earth is not quite the same as distance on a flat plane. But so what? Set your square a little bigger to allow for the difference, and proceed!
Sorted List
This is why sorting is important. If all the points were sorted by X, then Y (or Y, then X - whatever) and you knew it, you could really speed things up. Because you could simply stop scanning when the X (or Y) coordinate got too big, and you wouldn't have to go through 5 million points.
How would that work? Same way as before, except your inner loop would have some checks like this:
five_miles = ... # Whatever math, plus an error allowance!
list_len = len(geo_points) # Don't call this 5 million times
for i, pi in enumerate(geo_points):
if pi.close_to_another_point:
continue # Remember if close to an earlier point
pi0max = pi[0] + five_miles
pi1min = pi[1] - five_miles
pi1max = pi[1] + five_miles
for j in range(i+1, list_len):
pj = geo_points[j]
# Assumes geo_points is sorted on [0] then [1]
if pj[0] > pi0max:
# Can't possibly be close enough, nor any later points
break
if pj[1] < pi1min or pj[1] > pi1max:
# Can't be close enough, but a later point might be
continue
# Now do "real" comparison using accurate functions.
if ...:
pi.close_to_another_point = True
pj.close_to_another_point = True
break
What am I doing there? First, I'm getting some numbers into local variables. Then I'm using enumerate to give me an i value and a reference to the outer point. (What you called geo_point). Then, I'm quickly checking to see if we already know that this point is close to another one.
If not, we'll have to scan. So I'm only scanning "later" points in the list, because I know the outer loop scans the early ones, and I definitely don't want to compare a point against itself. I'm using a few temporary variables to cache the result of computations involving the outer loop. Within the inner loop, I do some stupid comparisons against the temporaries. They can't tell me if the two points are close to each other, but I can check if they're definitely not close and skip ahead.
Finally, if the simple checks pass then go ahead and do the expensive checks. If a check actually passes, be sure to record the result on both points, so we can skip doing the second point later.
Unsorted List
But what if the list is not sorted?
#RootTwo points you at a kD tree (where D is for "dimensional" and k in this case is "2"). The idea is really simple, if you already know about binary search trees: you cycle through the dimensions, comparing X at even levels in the tree and comparing Y at odd levels (or vice versa). The idea would be this:
def insert_node(node, treenode, depth=0):
dimension = depth % 2 # even/odd -> lat/long
dn = node.coord[dimension]
dt = treenode.coord[dimension]
if dn < dt:
# go left
if treenode.left is None:
treenode.left = node
else:
insert_node(node, treenode.left, depth+1)
else:
# go right
if treenode.right is None:
treenode.right = node
else:
insert_node(node, treenode.right, depth+1)
What would this do? This would get you a searchable tree where points could be inserted in O(log n) time. That means O(n log n) for the whole list, which is way better than n squared! (The log base 2 of 5 million is basically 23. So n log n is 5 million times 23, compared with 5 million times 5 million!)
It also means you can do a targeted search. Since the tree is ordered, it's fairly straightforward to look for "close" points (the Wikipedia link from #RootTwo provides an algorithm).
Advice
My advice is to just write code to sort the list, if needed. It's easier to write, and easier to check by hand, and it's a separate pass you will only need to make one time.
Once you have the list sorted, try the approach I showed above. It's close to what you were doing, and it should be easy for you to understand and code.
As the answer to Python calculate lots of distances quickly points out, this is a classic use case for k-D trees.
An alternative is to use a sweep line algorithm, as shown in the answer to How do I match similar coordinates using Python?
Here's the sweep line algorithm adapted for your questions. On my laptop, it takes < 5 minutes to run through 5M random points.
import itertools as it
import operator as op
import sortedcontainers # handy library on Pypi
import time
from collections import namedtuple
from math import cos, degrees, pi, radians, sqrt
from random import sample, uniform
Point = namedtuple("Point", "lat long has_close_neighbor")
miles_per_degree = 69
number_of_points = 5000000
data = [Point(uniform( -88.0, 88.0), # lat
uniform(-180.0, 180.0), # long
True
)
for _ in range(number_of_points)
]
start = time.time()
# Note: lat is first in Point, so data is sorted by .lat then .long.
data.sort()
print(time.time() - start)
# Parameter that determines the size of a sliding lattitude window
# and therefore how close two points need to be to be to get flagged.
threshold = 5.0 # miles
lat_span = threshold / miles_per_degree
coarse_threshold = (.98 * threshold)**2
# Sliding lattitude window. Within the window, observations are
# ordered by longitude.
window = sortedcontainers.SortedListWithKey(key=op.attrgetter('long'))
# lag_pt is the 'southernmost' point within the sliding window.
point = iter(data)
lag_pt = next(point)
milepost = len(data)//10
# lead_pt is the 'northernmost' point in the sliding window.
for i, lead_pt in enumerate(data):
if i == milepost:
print('.', end=' ')
milepost += len(data)//10
# Dec of lead_obs represents the leading edge of window.
window.add(lead_pt)
# Remove observations further than the trailing edge of window.
while lead_pt.lat - lag_pt.lat > lat_span:
window.discard(lag_pt)
lag_pt = next(point)
# Calculate 'east-west' width of window_size at dec of lead_obs
long_span = lat_span / cos(radians(lead_pt.lat))
east_long = lead_pt.long + long_span
west_long = lead_pt.long - long_span
# Check all observations in the sliding window within
# long_span of lead_pt.
for other_pt in window.irange_key(west_long, east_long):
if other_pt != lead_pt:
# lead_pt is at the top center of a box 2 * long_span wide by
# 1 * long_span tall. other_pt is is in that box. If desired,
# put additional fine-grained 'closeness' tests here.
# coarse check if any pts within 80% of threshold distance
# then don't need to check distance to any more neighbors
average_lat = (other_pt.lat + lead_pt.lat) / 2
delta_lat = other_pt.lat - lead_pt.lat
delta_long = (other_pt.long - lead_pt.long)/cos(radians(average_lat))
if delta_lat**2 + delta_long**2 <= coarse_threshold:
break
# put vincenty test here
#if 0 < vincenty(lead_pt, other_pt).miles <= close_limit:
# break
else:
data[i] = data[i]._replace(has_close_neighbor=False)
print()
print(time.time() - start)
If you sort the list by latitude (n log(n)), and the points are roughly evenly distributed, it will bring it down to about 1000 points within 5 miles for each point (napkin math, not exact). By only looking at the points that are near in latitude, the runtime goes from n^2 to n*log(n)+.0004n^2. Hopefully this speeds it up enough.
I would give pandas a try. Pandas is made for efficient handling of large amounts of data. That may help with the efficiency of the csv portion anyhow. But from the sounds of it, you've got yourself an inherently inefficient problem to solve. You take point 1 and compare it against 4,999,999 other points. Then you take point 2 and compare it with 4,999,998 other points and so on. Do the math. That's 12.5 trillion comparisons you're doing. If you can do 1,000,000 comparisons per second, that's 144 days of computation. If you can do 10,000,000 comparisons per second, that's 14 days. For just additions in straight python, 10,000,000 operations can take something like 1.1 seconds, but I doubt your comparisons are as fast as an add operation. So give it at least a fortnight or two.
Alternately, you could come up with an alternate algorithm, though I don't have any particular one in mind.
I would redo algorithm in three steps:
Use great-circle distance, and assume 1% error so make limit equal to 1.01*limit.
Code great-circle distance as inlined function, this test should be fast
You'll get some false positives, which you could further test with vincenty
A better solution generated from Oscar Smith. You have a csv file and just sorted it in excel it is very efficient). Then utilize binary search in your program to find the cities within 5 miles(you can make small change to binary search method so it will break if it finds one city satisfying your condition).
Another improvement is to set a map to remember the pair of cities when you find one city is within another one. For example, when you find city A is within 5 miles of city B, use Map to store the pair (B is the key and A is the value). So next time you meet B, search it in the Map first, if it has a corresponding value, you do not need to check it again. But it may use more memory so care about it. Hope it helps you.
This is just a first pass, but I've sped it up by half so far by using great_circle() instead of vincinty(), and cleaning up a couple of other things. The difference is explained here, and the loss in accuracy is about 0.17%:
from geopy.point import Point
from geopy.distance import great_circle
import csv
class CustomGeoPoint(Point):
def __init__(self, latitude, longitude):
super(CustomGeoPoint, self).__init__(latitude, longitude)
self.close_to_another_point = False
def isCloseToAnother(pointA, points):
for pointB in points:
dist = great_circle(pointA, pointB).miles
if 0 < dist <= CLOSE_LIMIT: # (0, close_limit]
return True
return False
with open('geo_input.csv', 'r') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
geo_points = sorted(map(lambda x: CustomGeoPoint(x[0], x[1]), reader))
with open('output.csv', 'w') as output:
writer = csv.writer(output, delimiter=',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# for every point, look at every point until one is found within a mile
for point in geo_points:
point.close_to_another_point = isCloseToAnother(point, geo_points)
writer.writerow([point.latitude, point.longitude,
point.close_to_another_point])
I'm going to improve this further.
Before:
$ time python geo.py
real 0m5.765s
user 0m5.675s
sys 0m0.048s
After:
$ time python geo.py
real 0m2.816s
user 0m2.716s
sys 0m0.041s
This problem can be solved with a VP tree. These allows querying data
with distances that are a metric obeying the triangle inequality.
The big advantage of VP trees over a k-D tree is that they can be blindly
applied to geographic data anywhere in the world without having to worry
about projecting it to a suitable 2D space. In addition a true geodesic
distance can be used (no need to worry about the differences between
geodesic distances and distances in the projection).
Here's my test: generate 5 million points randomly and uniformly on the
world. Put these into a VP tree.
Looping over all the points, query the VP tree to find any neighbor a
distance in (0km, 10km] away. (0km is not include in this set to avoid
the query point being found.) Count the number of points with no such
neighbor (which is 229573 in my case).
Cost of setting up the VP tree = 5000000 * 20 distance calculations.
Cost of the queries = 5000000 * 23 distance calculations.
Time for setup and queries is 5m 7s.
I am using C++ with GeographicLib for calculating distances, but
the algorithm can of course be implemented in any language and here's
the python version of GeographicLib.
ADDENDUM: The C++ code implementing this approach is given here.
I have an array of x,y,z coordinates of several (~10^10) points (only 5 shown here)
a= [[ 34.45 14.13 2.17]
[ 32.38 24.43 23.12]
[ 33.19 3.28 39.02]
[ 36.34 27.17 31.61]
[ 37.81 29.17 29.94]]
I want to make a new array with only those points which are at least some distance d away from all other points in the list. I wrote a code using while loop,
import numpy as np
from scipy.spatial import distance
d=0.1 #or some distance
i=0
selected_points=[]
while i < len(a):
interdist=[]
j=i+1
while j<len(a):
interdist.append(distance.euclidean(a[i],a[j]))
j+=1
if all(dis >= d for dis in interdist):
np.array(selected_points.append(a[i]))
i+=1
This works, but it is taking really long to perform this calculation. I read somewhere that while loops are very slow.
I was wondering if anyone has any suggestions on how to speed up this calculation.
EDIT: While my objective of finding the particles which are at least some distance away from all the others stays the same, I just realized that there is a serious flaw in my code, let's say I have 3 particles, my code does the following, for the first iteration of i, it calculates the distances 1->2, 1->3, let's say 1->2 is less than the threshold distance d, so the code throws away particle 1. For the next iteration of i, it only does 2->3, and let's say it finds that it is greater than d, so it keeps particle 2, but this is wrong! since 2 should also be discarded with particle 1. The solution by #svohara is the correct one!
For big data sets and low-dimensional points (such as your 3-dimensional data), sometimes there is a big benefit to using a spatial indexing method. One popular choice for low-dimensional data is the k-d tree.
The strategy is to index the data set. Then query the index using the same data set, to return the 2-nearest neighbors for each point. The first nearest neighbor is always the point itself (with dist=0), so we really want to know how far away the next closest point is (2nd nearest neighbor). For those points where the 2-NN is > threshold, you have the result.
from scipy.spatial import cKDTree as KDTree
import numpy as np
#a is the big data as numpy array N rows by 3 cols
a = np.random.randn(10**8, 3).astype('float32')
# This will create the index, prepare to wait...
# NOTE: took 7 minutes on my mac laptop with 10^8 rand 3-d numbers
# there are some parameters that could be tweaked for faster indexing,
# and there are implementations (not in scipy) that can construct
# the kd-tree using parallel computing strategies (GPUs, e.g.)
k = KDTree(a)
#ask for the 2-nearest neighbors by querying the index with the
# same points
(dists, idxs) = k.query(a, 2)
# (dists, idxs) = k.query(a, 2, n_jobs=4) # to use more CPUs on query...
#Note: 9 minutes for query on my laptop, 2 minutes with n_jobs=6
# So less than 10 minutes total for 10^8 points.
# If the second NN is > thresh distance, then there is no other point
# in the data set closer.
thresh_d = 0.1 #some threshold, equiv to 'd' in O.P.'s code
d_slice = dists[:, 1] #distances to second NN for each point
res = np.flatnonzero( d_slice >= thresh_d )
Here's a vectorized approach using distance.pdist -
# Store number of pts (number of rows in a)
m = a.shape[0]
# Get the first of pairwise indices formed with the pairs of rows from a
# Simpler version, but a bit slow : idx1,_ = np.triu_indices(m,1)
shifts_arr = np.zeros(m*(m-1)/2,dtype=int)
shifts_arr[np.arange(m-1,1,-1).cumsum()] = 1
idx1 = shifts_arr.cumsum()
# Get the IDs of pairs of rows that are more than "d" apart and thus select
# the rest of the rows using a boolean mask created with np.in1d for the
# entire range of number of rows in a. Index into a to get the selected points.
selected_pts = a[~np.in1d(np.arange(m),idx1[distance.pdist(a) < d])]
For a huge dataset like 10e10, we might have to perform the operations in chunks based on the system memory available.
your algorithm is quadratic (10^20 operations), Here is a linear approach if distribution is nearly random.
Splits your space in boxes of size d/sqrt(3)^3. Put each points in its box.
Then for each box,
if there is just one point, you just have to calculate distance with points in a little neighborhood.
else there is nothing to do.
Drop the append, it must be really slow. You can have a static vector of distances and use [] to put the number in the right position.
Use min instead of all. You only need to check if the minimum distance is bigger than x.
Actually, you can break on your append in the moment that you find a distance smaller than your limit, and then you can drop out both points. In this way you even do not have to save any distance (unless you need them later).
Since d(a,b)=d(b,a) you can do the internal loop only for the following points, forget about the distances you already calculated. If you need them you can pick the faster from the array.
From your comment, I believe this would do, if you have no repeated points.
selected_points = []
for p1 in a:
save_point = True
for p2 in a:
if p1!=p2 and distance.euclidean(p1,p2)<d:
save_point = False
break
if save_point:
selected_points.append(p1)
return selected_points
In the end I check a,b and b,a because you should not modify a list while processing it, but you can be smarter using some aditional variables.