I have the following code which randomly generates a list of (X,Y) tuples:
import random
coords = []
for i in range(10):
x = random.randint(85,939)
y = random.randint(75,693)
coords.append((x,y))
In the final list, the X values of each tuple are considered to overlap if the absolute difference between them is less then 85, and the Y values are considered to overlap if the absoulte difference is less than 75. How can I make sure that none of the tuples in the final list will overlap in both dimensions?
The easiest way to do this is to just keep sampling and discarding coordinates which will create overlap. This will however become very inefficient when you come close to filling the available space. If that's not an issue you should go with this solution.
A bit more efficient and as far as I can tell statistically equivalent is to sample one coordinate, like the row, first. Then compute the occupied area in that row and sample from the remaining positions if there are any.
To avoid the same problem as in the easy solution, if there are no available spaces in a row, it should be removed from the possible sampling outcomes for the row (plus the margin of 75 in both directions).
Ideally, you don't compute the occupied regions each time but you keep a mapping from the row to the occupied space in that row and the amount of non-full rows and just update this mapping when inserting new images. You will need storage for n_rows + 1 extra numbers.
To clarify: When sampling from a restricted space just subtract the occupied positions and get a sampling result n. Then find the correct position for n by walking along the coordinate axis n steps, skipping all the occupied positions.
I have a csv file with two columns (latitude, longitude) that contains over 5 million rows of geolocation data.
I need to identify the points which are not within 5 miles of any other point in the list, and output everything back into another CSV that has an extra column (CloseToAnotherPoint) which is True if there is another point is within 5 miles, and False if there isn't.
Here is my current solution using geopy (not making any web calls, just using the function to calculate distance):
from geopy.point import Point
from geopy.distance import vincenty
import csv
class CustomGeoPoint(object):
def __init__(self, latitude, longitude):
self.location = Point(latitude, longitude)
self.close_to_another_point = False
try:
output = open('output.csv','w')
writer = csv.writer(output, delimiter = ',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# 5 miles
close_limit = 5
geo_points = []
with open('geo_input.csv', newline='') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
for row in reader:
geo_points.append(CustomGeoPoint(row[0], row[1]))
# for every point, look at every point until one is found within 5 miles
for geo_point in geo_points:
for geo_point2 in geo_points:
dist = vincenty(geo_point.location, geo_point2.location).miles
if 0 < dist <= close_limit: # (0,close_limit]
geo_point.close_to_another_point = True
break
writer.writerow([geo_point.location.latitude, geo_point.location.longitude,
geo_point.close_to_another_point])
finally:
output.close()
As you might be able to tell from looking at it, this solution is extremely slow. So slow in fact that I let it run for 3 days and it still didn't finish!
I've thought about trying to split up the data into chunks (multiple CSV files or something) so that the inner loop doesn't have to look at every other point, but then I would have to figure out how to make sure the borders of each section checked against the borders of its adjacent sections, and that just seems overly complex and I'm afraid it would be more of a headache than it's worth.
So any pointers on how to make this faster?
Let's look at what you're doing.
You read all the points into a list named geo_points.
Now, can you tell me whether the list is sorted? Because if it was sorted, we definitely want to know that. Sorting is valuable information, especially when you're dealing with 5 million of anything.
You loop over all the geo_points. That's 5 million, according to you.
Within the outer loop, you loop again over all 5 million geo_points.
You compute the distance in miles between the two loop items.
If the distance is less than your threshold, you record that information on the first point, and stop the inner loop.
When the inner loop stops, you write information about the outer loop item to a CSV file.
Notice a couple of things. First, you're looping 5 million times in the outer loop. And then you're looping 5 million times in the inner loop.
This is what O(n²) means.
The next time you see someone talking about "Oh, this is O(log n) but that other thing is O(n log n)," remember this experience - you're running an n² algorithm where n in this case is 5,000,000. Sucks, dunnit?
Anyway, you have some problems.
Problem 1: You'll eventually wind up comparing every point against itself. Which should have a distance of zero, meaning they will all be marked as within whatever distance threshold. If your program ever finishes, all the cells will be marked True.
Problem 2: When you compare point #1 with, say, point #12345, and they are within the threshold distance from each other, you are recording that information about point #1. But you don't record the same information about the other point. You know that point #12345 (geo_point2) is reflexively within the threshold of point #1, but you don't write that down. So you're missing a chance to just skip over 5 million comparisons.
Problem 3: If you compare point #1 and point #2, and they are not within the threshold distance, what happens when you compare point #2 with point #1? Your inner loop is starting from the beginning of the list every time, but you know that you have already compared the start of the list with the end of the list. You can reduce your problem space by half just by making your outer loop go i in range(0, 5million) and your inner loop go j in range(i+1, 5million).
Answers?
Consider your latitude and longitude on a flat plane. You want to know if there's a point within 5 miles. Let's think about a 10 mile square, centered on your point #1. That's a square centered on (X1, Y1), with a top left corner at (X1 - 5miles, Y1 + 5miles) and a bottom right corner at (X1 + 5miles, Y1 - 5miles). Now, if a point is within that square, it might not be within 5 miles of your point #1. But you can bet that if it's outside that square, it's more than 5 miles away.
As #SeverinPappadeaux points out, distance on a spheroid like Earth is not quite the same as distance on a flat plane. But so what? Set your square a little bigger to allow for the difference, and proceed!
Sorted List
This is why sorting is important. If all the points were sorted by X, then Y (or Y, then X - whatever) and you knew it, you could really speed things up. Because you could simply stop scanning when the X (or Y) coordinate got too big, and you wouldn't have to go through 5 million points.
How would that work? Same way as before, except your inner loop would have some checks like this:
five_miles = ... # Whatever math, plus an error allowance!
list_len = len(geo_points) # Don't call this 5 million times
for i, pi in enumerate(geo_points):
if pi.close_to_another_point:
continue # Remember if close to an earlier point
pi0max = pi[0] + five_miles
pi1min = pi[1] - five_miles
pi1max = pi[1] + five_miles
for j in range(i+1, list_len):
pj = geo_points[j]
# Assumes geo_points is sorted on [0] then [1]
if pj[0] > pi0max:
# Can't possibly be close enough, nor any later points
break
if pj[1] < pi1min or pj[1] > pi1max:
# Can't be close enough, but a later point might be
continue
# Now do "real" comparison using accurate functions.
if ...:
pi.close_to_another_point = True
pj.close_to_another_point = True
break
What am I doing there? First, I'm getting some numbers into local variables. Then I'm using enumerate to give me an i value and a reference to the outer point. (What you called geo_point). Then, I'm quickly checking to see if we already know that this point is close to another one.
If not, we'll have to scan. So I'm only scanning "later" points in the list, because I know the outer loop scans the early ones, and I definitely don't want to compare a point against itself. I'm using a few temporary variables to cache the result of computations involving the outer loop. Within the inner loop, I do some stupid comparisons against the temporaries. They can't tell me if the two points are close to each other, but I can check if they're definitely not close and skip ahead.
Finally, if the simple checks pass then go ahead and do the expensive checks. If a check actually passes, be sure to record the result on both points, so we can skip doing the second point later.
Unsorted List
But what if the list is not sorted?
#RootTwo points you at a kD tree (where D is for "dimensional" and k in this case is "2"). The idea is really simple, if you already know about binary search trees: you cycle through the dimensions, comparing X at even levels in the tree and comparing Y at odd levels (or vice versa). The idea would be this:
def insert_node(node, treenode, depth=0):
dimension = depth % 2 # even/odd -> lat/long
dn = node.coord[dimension]
dt = treenode.coord[dimension]
if dn < dt:
# go left
if treenode.left is None:
treenode.left = node
else:
insert_node(node, treenode.left, depth+1)
else:
# go right
if treenode.right is None:
treenode.right = node
else:
insert_node(node, treenode.right, depth+1)
What would this do? This would get you a searchable tree where points could be inserted in O(log n) time. That means O(n log n) for the whole list, which is way better than n squared! (The log base 2 of 5 million is basically 23. So n log n is 5 million times 23, compared with 5 million times 5 million!)
It also means you can do a targeted search. Since the tree is ordered, it's fairly straightforward to look for "close" points (the Wikipedia link from #RootTwo provides an algorithm).
Advice
My advice is to just write code to sort the list, if needed. It's easier to write, and easier to check by hand, and it's a separate pass you will only need to make one time.
Once you have the list sorted, try the approach I showed above. It's close to what you were doing, and it should be easy for you to understand and code.
As the answer to Python calculate lots of distances quickly points out, this is a classic use case for k-D trees.
An alternative is to use a sweep line algorithm, as shown in the answer to How do I match similar coordinates using Python?
Here's the sweep line algorithm adapted for your questions. On my laptop, it takes < 5 minutes to run through 5M random points.
import itertools as it
import operator as op
import sortedcontainers # handy library on Pypi
import time
from collections import namedtuple
from math import cos, degrees, pi, radians, sqrt
from random import sample, uniform
Point = namedtuple("Point", "lat long has_close_neighbor")
miles_per_degree = 69
number_of_points = 5000000
data = [Point(uniform( -88.0, 88.0), # lat
uniform(-180.0, 180.0), # long
True
)
for _ in range(number_of_points)
]
start = time.time()
# Note: lat is first in Point, so data is sorted by .lat then .long.
data.sort()
print(time.time() - start)
# Parameter that determines the size of a sliding lattitude window
# and therefore how close two points need to be to be to get flagged.
threshold = 5.0 # miles
lat_span = threshold / miles_per_degree
coarse_threshold = (.98 * threshold)**2
# Sliding lattitude window. Within the window, observations are
# ordered by longitude.
window = sortedcontainers.SortedListWithKey(key=op.attrgetter('long'))
# lag_pt is the 'southernmost' point within the sliding window.
point = iter(data)
lag_pt = next(point)
milepost = len(data)//10
# lead_pt is the 'northernmost' point in the sliding window.
for i, lead_pt in enumerate(data):
if i == milepost:
print('.', end=' ')
milepost += len(data)//10
# Dec of lead_obs represents the leading edge of window.
window.add(lead_pt)
# Remove observations further than the trailing edge of window.
while lead_pt.lat - lag_pt.lat > lat_span:
window.discard(lag_pt)
lag_pt = next(point)
# Calculate 'east-west' width of window_size at dec of lead_obs
long_span = lat_span / cos(radians(lead_pt.lat))
east_long = lead_pt.long + long_span
west_long = lead_pt.long - long_span
# Check all observations in the sliding window within
# long_span of lead_pt.
for other_pt in window.irange_key(west_long, east_long):
if other_pt != lead_pt:
# lead_pt is at the top center of a box 2 * long_span wide by
# 1 * long_span tall. other_pt is is in that box. If desired,
# put additional fine-grained 'closeness' tests here.
# coarse check if any pts within 80% of threshold distance
# then don't need to check distance to any more neighbors
average_lat = (other_pt.lat + lead_pt.lat) / 2
delta_lat = other_pt.lat - lead_pt.lat
delta_long = (other_pt.long - lead_pt.long)/cos(radians(average_lat))
if delta_lat**2 + delta_long**2 <= coarse_threshold:
break
# put vincenty test here
#if 0 < vincenty(lead_pt, other_pt).miles <= close_limit:
# break
else:
data[i] = data[i]._replace(has_close_neighbor=False)
print()
print(time.time() - start)
If you sort the list by latitude (n log(n)), and the points are roughly evenly distributed, it will bring it down to about 1000 points within 5 miles for each point (napkin math, not exact). By only looking at the points that are near in latitude, the runtime goes from n^2 to n*log(n)+.0004n^2. Hopefully this speeds it up enough.
I would give pandas a try. Pandas is made for efficient handling of large amounts of data. That may help with the efficiency of the csv portion anyhow. But from the sounds of it, you've got yourself an inherently inefficient problem to solve. You take point 1 and compare it against 4,999,999 other points. Then you take point 2 and compare it with 4,999,998 other points and so on. Do the math. That's 12.5 trillion comparisons you're doing. If you can do 1,000,000 comparisons per second, that's 144 days of computation. If you can do 10,000,000 comparisons per second, that's 14 days. For just additions in straight python, 10,000,000 operations can take something like 1.1 seconds, but I doubt your comparisons are as fast as an add operation. So give it at least a fortnight or two.
Alternately, you could come up with an alternate algorithm, though I don't have any particular one in mind.
I would redo algorithm in three steps:
Use great-circle distance, and assume 1% error so make limit equal to 1.01*limit.
Code great-circle distance as inlined function, this test should be fast
You'll get some false positives, which you could further test with vincenty
A better solution generated from Oscar Smith. You have a csv file and just sorted it in excel it is very efficient). Then utilize binary search in your program to find the cities within 5 miles(you can make small change to binary search method so it will break if it finds one city satisfying your condition).
Another improvement is to set a map to remember the pair of cities when you find one city is within another one. For example, when you find city A is within 5 miles of city B, use Map to store the pair (B is the key and A is the value). So next time you meet B, search it in the Map first, if it has a corresponding value, you do not need to check it again. But it may use more memory so care about it. Hope it helps you.
This is just a first pass, but I've sped it up by half so far by using great_circle() instead of vincinty(), and cleaning up a couple of other things. The difference is explained here, and the loss in accuracy is about 0.17%:
from geopy.point import Point
from geopy.distance import great_circle
import csv
class CustomGeoPoint(Point):
def __init__(self, latitude, longitude):
super(CustomGeoPoint, self).__init__(latitude, longitude)
self.close_to_another_point = False
def isCloseToAnother(pointA, points):
for pointB in points:
dist = great_circle(pointA, pointB).miles
if 0 < dist <= CLOSE_LIMIT: # (0, close_limit]
return True
return False
with open('geo_input.csv', 'r') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
geo_points = sorted(map(lambda x: CustomGeoPoint(x[0], x[1]), reader))
with open('output.csv', 'w') as output:
writer = csv.writer(output, delimiter=',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# for every point, look at every point until one is found within a mile
for point in geo_points:
point.close_to_another_point = isCloseToAnother(point, geo_points)
writer.writerow([point.latitude, point.longitude,
point.close_to_another_point])
I'm going to improve this further.
Before:
$ time python geo.py
real 0m5.765s
user 0m5.675s
sys 0m0.048s
After:
$ time python geo.py
real 0m2.816s
user 0m2.716s
sys 0m0.041s
This problem can be solved with a VP tree. These allows querying data
with distances that are a metric obeying the triangle inequality.
The big advantage of VP trees over a k-D tree is that they can be blindly
applied to geographic data anywhere in the world without having to worry
about projecting it to a suitable 2D space. In addition a true geodesic
distance can be used (no need to worry about the differences between
geodesic distances and distances in the projection).
Here's my test: generate 5 million points randomly and uniformly on the
world. Put these into a VP tree.
Looping over all the points, query the VP tree to find any neighbor a
distance in (0km, 10km] away. (0km is not include in this set to avoid
the query point being found.) Count the number of points with no such
neighbor (which is 229573 in my case).
Cost of setting up the VP tree = 5000000 * 20 distance calculations.
Cost of the queries = 5000000 * 23 distance calculations.
Time for setup and queries is 5m 7s.
I am using C++ with GeographicLib for calculating distances, but
the algorithm can of course be implemented in any language and here's
the python version of GeographicLib.
ADDENDUM: The C++ code implementing this approach is given here.
I have a dictionary which has coordinates as keys. They are by default in 3 dimensions, like dictionary[(x,y,z)]=values, but may be in any dimension, so the code can't be hard coded for 3.
I need to find if there are other values within a certain radius of a new coordinate, and I ideally need to do it without having to import any plugins such as numpy.
My initial thought was to split the input into a cube and check no points match, but obviously that is limited to integer coordinates, and would grow exponentially slower (radius of 5 would require 729x the processing), and with my initial code taking at least a minute for relatively small values, I can't really afford this.
I heard finding the nearest neighbor may be the best way, and ideally, cutting down the keys used to a range of +- a certain amount would be good, but I don't know how you'd do that when there's more the one point being used.Here's how I'd do it with my current knowledge:
dimensions = 3
minimumDistance = 0.9
#example dictionary + input
dictionary[(0,0,0)]=[]
dictionary[(0,0,1)]=[]
keyToAdd = [0,1,1]
closestMatch = 2**1000
tooClose = False
for keys in dictionary:
#calculate distance to new point
originalCoordinates = str(split( dictionary[keys], "," ) ).replace("(","").replace(")","")
for i in range(dimensions):
distanceToPoint = #do pythagors with originalCoordinates and keyToAdd
#if you want the overall closest match
if distanceToPoint < closestMatch:
closestMatch = distanceToPoint
#if you want to just check it's not within that radius
if distanceToPoint < minimumDistance:
tooClose = True
break
However, performing calculations this way may still run very slow (it must do this to millions of values). I've searched the problem, but most people seem to have simpler sets of data to do this to. If anyone can offer any tips I'd be grateful.
You say you need to determine IF there are any keys within a given radius of a particular point. Thus, you only need to scan the keys, computing the distance of each to the point until you find one within the specified radius. (And if you do comparisons to the square of the radius, you can avoid the square roots needed for the actual distance.)
One optimization would be to sort the keys based on their "Manhattan distance" from the point (that is, add the component offsets), since the Euclidean distance will never be less than this. This would avoid some of the more expensive calculations (though I don't think you need and trigonometry).
If, as you suggest later in the question, you need to handle multiple points, you can obviously process each individually, or you could find the center of those points and sort based on that.
I have two arrays of coordinates from two separate files from a CFD calculation. One is a mesh file which contains the connectivity information and the other is the results file.
My problem is that the coordinates from each file are not in the same order. What I would like to be able to do is order ALL the arrays from the results file to be in the same order as the mesh file.
My idea would be to find the matching values of xyz coordinates and create a mapping such that the rest of the result arrays can be ordered.
I was thinking something like:
mapping = np.empty(len(co_mesh))
for i,coord in enumerate(co_mesh):
for j in range(len(co_res)):
if (coord[0]==co_res[j,0]) and (coord[1]==co_res[j,1]) and (coord[2]==co_res[j,2]):
mapping[i] = j
where co_mesh, co_res are arrays containing the x,y,z coords.
The problem is that I suspect this loop will take a long time. At the moment I'm only looping over around 70000 points but in future this could increase to 1 million or more.
Is there a faster way to write this in Python.
I'm using Python 2.6.5.
Ben
For those who are interested this is what I am currently using:
mesh_coords = zip(xm_list,ym_list,zm_list,range(len(x_po)))
res_coords = zip(xr_list,yr_list,zr_list,range(len(x)))
mesh_coords = sorted(mesh_coords , key = lambda x:(x[0],x[1],x[2]))
res_coords = sorted(res_coords , key = lambda x:(x[0],x[1],x[2]))
mapping = zip(np.array(listym)[:,-1],np.array(listyr)[:,-1])
mapping = sorted(mapping , key = lambda x:(x[0]))
How about sorting coordinate vectors in both files along x than y and than least z coordinate?
You can do this efficient and fast if you use numpy arrays for vectors.
Update:
If you don't have the node ids of the nodes in the result mesh. But the coordinates are the same. Do the following:
Add a numbering as an additional information to your vectors. Sort both mesh by x,y,z add the now unsorted numbering of your mesh to your comesh and sort the comesh along that axis. Now the comesh contains the exact order as the original mesh.