The fastest way of checking the closest distance between points

The fastest way of checking the closest distance between points - python

I have 2 dictionaries. Both have key value pairs of an index and a world space location.
Something like:
{
"vertices" :
{
1: "(0.004700, 130.417480, -13.546420)",
2: "(0.1, 152.4, 13.521)",
3: "(58.21, 998.412, -78.0051)"
}
}
Dictionary 1 will always have about 20 - 100 entries, dictionary 2 will always have around 10,000 entries.
For every point in dictionary 1, I want to find the point in dictionary 2 that's closest to it. What is the fastest way of doing that? For each entry in dictionary 1, loop through all entries in dictionary 2 and return the one that's closest by.
Some untested pseudo code:
for point, distance in dict_1.iteritems():
closest_point = get_closest_point(dict_1.get(point))
def get_closest_point(self, start_point)
furthest_distance = 2000000
closest_point = 0
for index, end_point in dict_1.iteritems():
distance = get_distance(self, start_point, end_point)
if distance < furthest_distance:
furthest_distance = distance
closest_point = closest_point
return closest_point
I think something like this will work. The "problem" is that if I have 100 entries in dictionary 1, it will be 100 x 10,000 = 1,000,000 iterations. That just doesn't seem very fast or elegant to me.
Is there a better way of doing this in Maya/Python?
EDIT:
Just want to comment that I've used a closestPointOnMesh node before, which works just fine and is a lot easier if the points you're checking against are actually part of a mesh. You could do something like this:
selected_object = pm.PyNode(pm.selected()[0])
cpom = pm.createNode("closestPointOnMesh", name="cpom")
for vertex, distance in dict_1.iteritems():
selected_object.worldMesh >> cpom.inMesh
cpom.inPosition.set(dict_1.get(vertex))
print "closest vertex is %s " % cpom.closestVertexIndex.get()
Instant reply from the node and all is dandy. However, if the list of point you're checking against are not part of a mesh you can't use this. Would it actually be possible/quicker to:
Construct a mesh out of the points in dictionary 2
Use mesh with closestPointOnMesh node
Delete mesh

You definitely need an acceleration structure for non-trivial amounts of points. A KD tree or an octree is what you want -- KD trees are more performant on search but slower to build and can be harder to code. Also since Octrees are spatial rather than binary they may make it easier to do trivial tests.
You can get a python octree here: http://code.activestate.com/recipes/498121-python-octree-implementation/
if you're doing a lot of distance checks you'll definitely want to use Maya API vector classes to do the actual math compares -- that will be much, much faster than the equivalent python although. You can get these from pymel.datatypes if you don't know the API well, although using the newer API2 versions is pretty painless.

You need what is called a KD Tree. Build a KD Tree with points in your second dictionary and query for the closest point to each point in first dictionary.
I am not familiar with maya, if you can use scipy, you can use this.
PS: There seems to be an implementation in C++ here.

Related

Is it possible to automate partitioning in ABAQUS? If so, how?

I am trying to automate the partitioning of a model in ABAQUS using Python script. So far I have a feeling that I am going down a rabbit hole with no solution. I have a feeling that even if I manage to do it, the algorithm will be very inefficient and slower than manual partitioning.
I want the script to:
to join Interesting Points on each face with lines that are perpendicular to the edges.
to be applicable to any model.
to create partitions that can be deleted/edited later on.
My question is: is automatic partitioning possible? If so, what kind of algorithm should I use?
In the meantime, I have made an initial code below to get an idea of the problem using the Partition by shortest path function:
(note that I am looping through vertices and not Interesting Points because I haven’t found a way to access them.)
The problems I have are:
New faces are created will be created as I partition the faces through the range function. My alternative is to select all the faces.
New interesting points are created as I partition. I could make a shallow copy of the initial interesting points and then extract the coordinates and then use these coordinates to do the partitioning. Before partitioning I will need to convert the coordinates back to a dictionary object.
I cannot seem to access the interesting points from the commands.
from abaqus import *
from abaqusConstants import *
#Define Functions
def Create_cube(myPart,myString):
s = mdb.models[myString].ConstrainedSketch(name='__profile__',sheetSize=200.0)
g, v, d, c = s.geometry, s.vertices, s.dimensions, s.constraints
s.setPrimaryObject(option=STANDALONE)
s.rectangle(point1=(10.0, 10.0), point2=(-10.0, -10.0))
p = mdb.models[myString].Part(name=myPart, dimensionality=THREE_D,type=DEFORMABLE_BODY)
p = mdb.models[myString].parts[myPart]
p.BaseSolidExtrude(sketch=s, depth=20.0)
s.unsetPrimaryObject()
p = mdb.models[myString].parts[myPart]
session.viewports['Viewport: 1'].setValues(displayedObject=p)
del mdb.models[myString].sketches['__profile__']
def subtractTheMatrix(matrix1,matrix2):
matrix = [0,0,0]
for i in range(0, 3):
matrix[i] = matrix1[i] - matrix2[i]
if matrix[i]==0.0:
matrix[i]=int(matrix[i])
return matrix
#Define Variables
myString='Buckling_Analysis'
myRadius= 25.0
myThickness= 2.5
myLength=1526.0
myModel= mdb.Model(name=myString)
myPart='Square'
myOffset=0.0
set_name='foobar'
#-------------------------------------------------------------------MODELLING-----------------------------------------------------------------
#Function1: Create Part
Create_cube(myPart,myString)
#Function2: Extract Coordinates from vertices (using string manipulation)
#Input: vertices in vertex form
#Output: coordinates of vertices in the form [[x,y,z],[x1,y1,z1],[x2,y2,z2]] (name: v1_coordinates)
p = mdb.models[myString].parts[myPart]
v1=p.vertices
v1_coordinates=[]
for x in range(len(v1)):
dictionary_object=v1[x]
dictionary_object_str= str(dictionary_object)
location_pointon=dictionary_object_str.find("""pointOn""")
location_coord=location_pointon+12
coordinates_x_string=dictionary_object_str[location_coord:-5]
coordinates_x_list=coordinates_x_string.split(',') #convert string to list of strings
for lo in range(3):
coordinates_x_list[lo]=float(coordinates_x_list[lo]) #change string list to float list
v1_coordinates.append(coordinates_x_list) #append function. adds float list to existing list
print("""these are all the coordinates for the vertices""",v1_coordinates)
#Function3: Partioning loop though List of Coordinates
#Input: List of Coordinates
#Output: Partioned faces of model (Can only be seen in ABAQUS viewport.)
f = p.faces
v1 = p.vertices
#try and except to ignore when vertex is not in plane
final_number_of_faces=24
for i in range(0,final_number_of_faces,2):
print("this is for face:")
for j in range(len(v1_coordinates)):
fixed_vertex_coord = v1_coordinates[j]
fixed_vertex_dict = v1.getClosest(coordinates=((fixed_vertex_coord[0], fixed_vertex_coord[1], fixed_vertex_coord[2]),))
fixed_vertex_dict_str= str(fixed_vertex_dict[0])
location_1=fixed_vertex_dict_str.find("""],""")
fixed_vertex_index=int(fixed_vertex_dict_str[location_1-1:location_1])
for k in range(len(v1_coordinates)):
try:
if subtractTheMatrix(v1_coordinates[j], v1_coordinates[k])==[0,0,0]:
continue
else:
moving_vertex_coord=v1_coordinates[k]
moving_vertex_dict=v1.getClosest(coordinates=((moving_vertex_coord[0], moving_vertex_coord[1], moving_vertex_coord[2]),))
moving_vertex_dict_str= str(moving_vertex_dict[0])
location_2=moving_vertex_dict_str.find("""],""")
moving_vertex_index=int(moving_vertex_dict_str[location_2-1:location_2])
p.PartitionFaceByShortestPath(point1=v1[fixed_vertex_index], point2=v1[moving_vertex_index], faces=f[i])
except:
print("face error")
continue

Short answer
"Is it possible to automate partitioning in ABAQUS?" -- Yes
"How" -- It depends. For your example you probably will be perfectly fine with the PartitionEdgeByDatumPlane() method.
Long answer
Generally speaking, you cannot create a method that will be applicable to any model. You can automate/generalize partitioning for similar geometries and when partitioning is performed using similar logic.
Depending on your problem you have several methods to perform a partition, for example:
For face: ByShortestPath, BySketch, ByDatumPlane, etc.;
For cell: ByDatumPlane, ByExtrudeEdge, BySweepEdge, etc.
Depending on your initial geometry and required results you could need to use different of those. And your approach (the logic of your script) would evolve accordingly.
Abaqus scripting language is not very optimal for checking intersections, geometrical dependences, etc., so yes, if your geometry/task requires complicated mix of several partitioning methods applied to a complex geometry then it could require some slow approaches (e.g. looping trhough all verticies).
Some additional comments:
no need in re-assigning variable p by p = mdb.models[myString].parts[myPart]: mdb.models[myString].Part(..) already return part object.
do you really need methods setPrimaryObject/unsetPrimaryObject? When automating you generally don't need viewport methods (also session.viewports['Viewport: 1'].setValues(displayedObject=p)).
Please, use attributes of Abaqus objects (as discussed in your previous question);
Don't forget that you can loop through sequences: use for v_i in v1 instead of for x in range(len(v1)) when you don't need to know index explicitly; use enumerate() when you need both object and its index.

How to check if a GPS coordinate is close to thousands of others efficiently?

I have 11 millions of GPS coordinates to analyse, the efficiency is my major problem. The problem is the following:
I want to keep only 1 GPS coordinates (call it a node) per 50 meters radius around it. So the code is pretty simple, I have a set G and for every node in G I check if the one I want to add is too close to any other one. If it's too close (<50 meters) I don't add it. Otherwise I do add it.
The problem is that the set G is growing pretty fast and at the end to check if I want to add one node to the set I need to run a for loop over millions of elements...
Here is a simplified code for the Node class:
from geopy import distance
class Node: #a point on the map
def __init__(self, lat, long): #lat and long in degree
self.lat = lat
self.long = long
def distanceTo(self, otherNode):
return distance.distance((self.lat, self.long), (otherNode.lat, otherNode.long)).km
def equivalent(self, otherNode):
return self.distanceTo(otherNode) < 0.05 #50 meters away
Here is the 'add' process:
currentNode = Node(lat, long)
alreadyIn = False
for n in graph:
if n.equivalent(currentNode):
alreadyIn = True
break
#set of Nodes
if alreadyIn == False:
G.add(currentNode)
This is not a problem of node clustering because I am not trying to detect any pattern in the dataset. I am just trying to group nodes inside a 50 meter radius.
I think the best would be to have a data structure that given coordinates return True or False if a similar node is in the set. However I can't figure out which one to use since I don't divide the environment in squares but in circles. (Yes a Node A can be equivalent to B and C without B and C being equivalent but I don't really mind...).
Thank you for your help !

Using an object oriented approach is usually slower for calculations like this (though more readable).
You could transform your latitude,longitude to cartesian x,y,z and create numpy arrays from your nodes and use scipy's very fast cKDTree. It provides several methods for operations like this, in your case query_ball_point might be the correct one.

speeding up processing 5 million rows of coordinate data

I have a csv file with two columns (latitude, longitude) that contains over 5 million rows of geolocation data.
I need to identify the points which are not within 5 miles of any other point in the list, and output everything back into another CSV that has an extra column (CloseToAnotherPoint) which is True if there is another point is within 5 miles, and False if there isn't.
Here is my current solution using geopy (not making any web calls, just using the function to calculate distance):
from geopy.point import Point
from geopy.distance import vincenty
import csv
class CustomGeoPoint(object):
def __init__(self, latitude, longitude):
self.location = Point(latitude, longitude)
self.close_to_another_point = False
try:
output = open('output.csv','w')
writer = csv.writer(output, delimiter = ',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# 5 miles
close_limit = 5
geo_points = []
with open('geo_input.csv', newline='') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
for row in reader:
geo_points.append(CustomGeoPoint(row[0], row[1]))
# for every point, look at every point until one is found within 5 miles
for geo_point in geo_points:
for geo_point2 in geo_points:
dist = vincenty(geo_point.location, geo_point2.location).miles
if 0 < dist <= close_limit: # (0,close_limit]
geo_point.close_to_another_point = True
break
writer.writerow([geo_point.location.latitude, geo_point.location.longitude,
geo_point.close_to_another_point])
finally:
output.close()
As you might be able to tell from looking at it, this solution is extremely slow. So slow in fact that I let it run for 3 days and it still didn't finish!
I've thought about trying to split up the data into chunks (multiple CSV files or something) so that the inner loop doesn't have to look at every other point, but then I would have to figure out how to make sure the borders of each section checked against the borders of its adjacent sections, and that just seems overly complex and I'm afraid it would be more of a headache than it's worth.
So any pointers on how to make this faster?

Let's look at what you're doing.
You read all the points into a list named geo_points.
Now, can you tell me whether the list is sorted? Because if it was sorted, we definitely want to know that. Sorting is valuable information, especially when you're dealing with 5 million of anything.
You loop over all the geo_points. That's 5 million, according to you.
Within the outer loop, you loop again over all 5 million geo_points.
You compute the distance in miles between the two loop items.
If the distance is less than your threshold, you record that information on the first point, and stop the inner loop.
When the inner loop stops, you write information about the outer loop item to a CSV file.
Notice a couple of things. First, you're looping 5 million times in the outer loop. And then you're looping 5 million times in the inner loop.
This is what O(n²) means.
The next time you see someone talking about "Oh, this is O(log n) but that other thing is O(n log n)," remember this experience - you're running an n² algorithm where n in this case is 5,000,000. Sucks, dunnit?
Anyway, you have some problems.
Problem 1: You'll eventually wind up comparing every point against itself. Which should have a distance of zero, meaning they will all be marked as within whatever distance threshold. If your program ever finishes, all the cells will be marked True.
Problem 2: When you compare point #1 with, say, point #12345, and they are within the threshold distance from each other, you are recording that information about point #1. But you don't record the same information about the other point. You know that point #12345 (geo_point2) is reflexively within the threshold of point #1, but you don't write that down. So you're missing a chance to just skip over 5 million comparisons.
Problem 3: If you compare point #1 and point #2, and they are not within the threshold distance, what happens when you compare point #2 with point #1? Your inner loop is starting from the beginning of the list every time, but you know that you have already compared the start of the list with the end of the list. You can reduce your problem space by half just by making your outer loop go i in range(0, 5million) and your inner loop go j in range(i+1, 5million).
Answers?
Consider your latitude and longitude on a flat plane. You want to know if there's a point within 5 miles. Let's think about a 10 mile square, centered on your point #1. That's a square centered on (X1, Y1), with a top left corner at (X1 - 5miles, Y1 + 5miles) and a bottom right corner at (X1 + 5miles, Y1 - 5miles). Now, if a point is within that square, it might not be within 5 miles of your point #1. But you can bet that if it's outside that square, it's more than 5 miles away.
As #SeverinPappadeaux points out, distance on a spheroid like Earth is not quite the same as distance on a flat plane. But so what? Set your square a little bigger to allow for the difference, and proceed!
Sorted List
This is why sorting is important. If all the points were sorted by X, then Y (or Y, then X - whatever) and you knew it, you could really speed things up. Because you could simply stop scanning when the X (or Y) coordinate got too big, and you wouldn't have to go through 5 million points.
How would that work? Same way as before, except your inner loop would have some checks like this:
five_miles = ... # Whatever math, plus an error allowance!
list_len = len(geo_points) # Don't call this 5 million times
for i, pi in enumerate(geo_points):
if pi.close_to_another_point:
continue # Remember if close to an earlier point
pi0max = pi[0] + five_miles
pi1min = pi[1] - five_miles
pi1max = pi[1] + five_miles
for j in range(i+1, list_len):
pj = geo_points[j]
# Assumes geo_points is sorted on [0] then [1]
if pj[0] > pi0max:
# Can't possibly be close enough, nor any later points
break
if pj[1] < pi1min or pj[1] > pi1max:
# Can't be close enough, but a later point might be
continue
# Now do "real" comparison using accurate functions.
if ...:
pi.close_to_another_point = True
pj.close_to_another_point = True
break
What am I doing there? First, I'm getting some numbers into local variables. Then I'm using enumerate to give me an i value and a reference to the outer point. (What you called geo_point). Then, I'm quickly checking to see if we already know that this point is close to another one.
If not, we'll have to scan. So I'm only scanning "later" points in the list, because I know the outer loop scans the early ones, and I definitely don't want to compare a point against itself. I'm using a few temporary variables to cache the result of computations involving the outer loop. Within the inner loop, I do some stupid comparisons against the temporaries. They can't tell me if the two points are close to each other, but I can check if they're definitely not close and skip ahead.
Finally, if the simple checks pass then go ahead and do the expensive checks. If a check actually passes, be sure to record the result on both points, so we can skip doing the second point later.
Unsorted List
But what if the list is not sorted?
#RootTwo points you at a kD tree (where D is for "dimensional" and k in this case is "2"). The idea is really simple, if you already know about binary search trees: you cycle through the dimensions, comparing X at even levels in the tree and comparing Y at odd levels (or vice versa). The idea would be this:
def insert_node(node, treenode, depth=0):
dimension = depth % 2 # even/odd -> lat/long
dn = node.coord[dimension]
dt = treenode.coord[dimension]
if dn < dt:
# go left
if treenode.left is None:
treenode.left = node
else:
insert_node(node, treenode.left, depth+1)
else:
# go right
if treenode.right is None:
treenode.right = node
else:
insert_node(node, treenode.right, depth+1)
What would this do? This would get you a searchable tree where points could be inserted in O(log n) time. That means O(n log n) for the whole list, which is way better than n squared! (The log base 2 of 5 million is basically 23. So n log n is 5 million times 23, compared with 5 million times 5 million!)
It also means you can do a targeted search. Since the tree is ordered, it's fairly straightforward to look for "close" points (the Wikipedia link from #RootTwo provides an algorithm).
Advice
My advice is to just write code to sort the list, if needed. It's easier to write, and easier to check by hand, and it's a separate pass you will only need to make one time.
Once you have the list sorted, try the approach I showed above. It's close to what you were doing, and it should be easy for you to understand and code.

As the answer to Python calculate lots of distances quickly points out, this is a classic use case for k-D trees.
An alternative is to use a sweep line algorithm, as shown in the answer to How do I match similar coordinates using Python?
Here's the sweep line algorithm adapted for your questions. On my laptop, it takes < 5 minutes to run through 5M random points.
import itertools as it
import operator as op
import sortedcontainers # handy library on Pypi
import time
from collections import namedtuple
from math import cos, degrees, pi, radians, sqrt
from random import sample, uniform
Point = namedtuple("Point", "lat long has_close_neighbor")
miles_per_degree = 69
number_of_points = 5000000
data = [Point(uniform( -88.0, 88.0), # lat
uniform(-180.0, 180.0), # long
True
)
for _ in range(number_of_points)
]
start = time.time()
# Note: lat is first in Point, so data is sorted by .lat then .long.
data.sort()
print(time.time() - start)
# Parameter that determines the size of a sliding lattitude window
# and therefore how close two points need to be to be to get flagged.
threshold = 5.0 # miles
lat_span = threshold / miles_per_degree
coarse_threshold = (.98 * threshold)**2
# Sliding lattitude window. Within the window, observations are
# ordered by longitude.
window = sortedcontainers.SortedListWithKey(key=op.attrgetter('long'))
# lag_pt is the 'southernmost' point within the sliding window.
point = iter(data)
lag_pt = next(point)
milepost = len(data)//10
# lead_pt is the 'northernmost' point in the sliding window.
for i, lead_pt in enumerate(data):
if i == milepost:
print('.', end=' ')
milepost += len(data)//10
# Dec of lead_obs represents the leading edge of window.
window.add(lead_pt)
# Remove observations further than the trailing edge of window.
while lead_pt.lat - lag_pt.lat > lat_span:
window.discard(lag_pt)
lag_pt = next(point)
# Calculate 'east-west' width of window_size at dec of lead_obs
long_span = lat_span / cos(radians(lead_pt.lat))
east_long = lead_pt.long + long_span
west_long = lead_pt.long - long_span
# Check all observations in the sliding window within
# long_span of lead_pt.
for other_pt in window.irange_key(west_long, east_long):
if other_pt != lead_pt:
# lead_pt is at the top center of a box 2 * long_span wide by
# 1 * long_span tall. other_pt is is in that box. If desired,
# put additional fine-grained 'closeness' tests here.
# coarse check if any pts within 80% of threshold distance
# then don't need to check distance to any more neighbors
average_lat = (other_pt.lat + lead_pt.lat) / 2
delta_lat = other_pt.lat - lead_pt.lat
delta_long = (other_pt.long - lead_pt.long)/cos(radians(average_lat))
if delta_lat**2 + delta_long**2 <= coarse_threshold:
break
# put vincenty test here
#if 0 < vincenty(lead_pt, other_pt).miles <= close_limit:
# break
else:
data[i] = data[i]._replace(has_close_neighbor=False)
print()
print(time.time() - start)

If you sort the list by latitude (n log(n)), and the points are roughly evenly distributed, it will bring it down to about 1000 points within 5 miles for each point (napkin math, not exact). By only looking at the points that are near in latitude, the runtime goes from n^2 to n*log(n)+.0004n^2. Hopefully this speeds it up enough.

I would give pandas a try. Pandas is made for efficient handling of large amounts of data. That may help with the efficiency of the csv portion anyhow. But from the sounds of it, you've got yourself an inherently inefficient problem to solve. You take point 1 and compare it against 4,999,999 other points. Then you take point 2 and compare it with 4,999,998 other points and so on. Do the math. That's 12.5 trillion comparisons you're doing. If you can do 1,000,000 comparisons per second, that's 144 days of computation. If you can do 10,000,000 comparisons per second, that's 14 days. For just additions in straight python, 10,000,000 operations can take something like 1.1 seconds, but I doubt your comparisons are as fast as an add operation. So give it at least a fortnight or two.
Alternately, you could come up with an alternate algorithm, though I don't have any particular one in mind.

I would redo algorithm in three steps:
Use great-circle distance, and assume 1% error so make limit equal to 1.01*limit.
Code great-circle distance as inlined function, this test should be fast
You'll get some false positives, which you could further test with vincenty

A better solution generated from Oscar Smith. You have a csv file and just sorted it in excel it is very efficient). Then utilize binary search in your program to find the cities within 5 miles(you can make small change to binary search method so it will break if it finds one city satisfying your condition).
Another improvement is to set a map to remember the pair of cities when you find one city is within another one. For example, when you find city A is within 5 miles of city B, use Map to store the pair (B is the key and A is the value). So next time you meet B, search it in the Map first, if it has a corresponding value, you do not need to check it again. But it may use more memory so care about it. Hope it helps you.

This is just a first pass, but I've sped it up by half so far by using great_circle() instead of vincinty(), and cleaning up a couple of other things. The difference is explained here, and the loss in accuracy is about 0.17%:
from geopy.point import Point
from geopy.distance import great_circle
import csv
class CustomGeoPoint(Point):
def __init__(self, latitude, longitude):
super(CustomGeoPoint, self).__init__(latitude, longitude)
self.close_to_another_point = False
def isCloseToAnother(pointA, points):
for pointB in points:
dist = great_circle(pointA, pointB).miles
if 0 < dist <= CLOSE_LIMIT: # (0, close_limit]
return True
return False
with open('geo_input.csv', 'r') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
geo_points = sorted(map(lambda x: CustomGeoPoint(x[0], x[1]), reader))
with open('output.csv', 'w') as output:
writer = csv.writer(output, delimiter=',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# for every point, look at every point until one is found within a mile
for point in geo_points:
point.close_to_another_point = isCloseToAnother(point, geo_points)
writer.writerow([point.latitude, point.longitude,
point.close_to_another_point])
I'm going to improve this further.
Before:
$ time python geo.py
real 0m5.765s
user 0m5.675s
sys 0m0.048s
After:
$ time python geo.py
real 0m2.816s
user 0m2.716s
sys 0m0.041s

This problem can be solved with a VP tree. These allows querying data
with distances that are a metric obeying the triangle inequality.
The big advantage of VP trees over a k-D tree is that they can be blindly
applied to geographic data anywhere in the world without having to worry
about projecting it to a suitable 2D space. In addition a true geodesic
distance can be used (no need to worry about the differences between
geodesic distances and distances in the projection).
Here's my test: generate 5 million points randomly and uniformly on the
world. Put these into a VP tree.
Looping over all the points, query the VP tree to find any neighbor a
distance in (0km, 10km] away. (0km is not include in this set to avoid
the query point being found.) Count the number of points with no such
neighbor (which is 229573 in my case).
Cost of setting up the VP tree = 5000000 * 20 distance calculations.
Cost of the queries = 5000000 * 23 distance calculations.
Time for setup and queries is 5m 7s.
I am using C++ with GeographicLib for calculating distances, but
the algorithm can of course be implemented in any language and here's
the python version of GeographicLib.
ADDENDUM: The C++ code implementing this approach is given here.

Getting Keys Within Range/Finding Nearest Neighbor From Dictionary Keys Stored As Tuples

I have a dictionary which has coordinates as keys. They are by default in 3 dimensions, like dictionary[(x,y,z)]=values, but may be in any dimension, so the code can't be hard coded for 3.
I need to find if there are other values within a certain radius of a new coordinate, and I ideally need to do it without having to import any plugins such as numpy.
My initial thought was to split the input into a cube and check no points match, but obviously that is limited to integer coordinates, and would grow exponentially slower (radius of 5 would require 729x the processing), and with my initial code taking at least a minute for relatively small values, I can't really afford this.
I heard finding the nearest neighbor may be the best way, and ideally, cutting down the keys used to a range of +- a certain amount would be good, but I don't know how you'd do that when there's more the one point being used.Here's how I'd do it with my current knowledge:
dimensions = 3
minimumDistance = 0.9
#example dictionary + input
dictionary[(0,0,0)]=[]
dictionary[(0,0,1)]=[]
keyToAdd = [0,1,1]
closestMatch = 2**1000
tooClose = False
for keys in dictionary:
#calculate distance to new point
originalCoordinates = str(split( dictionary[keys], "," ) ).replace("(","").replace(")","")
for i in range(dimensions):
distanceToPoint = #do pythagors with originalCoordinates and keyToAdd
#if you want the overall closest match
if distanceToPoint < closestMatch:
closestMatch = distanceToPoint
#if you want to just check it's not within that radius
if distanceToPoint < minimumDistance:
tooClose = True
break
However, performing calculations this way may still run very slow (it must do this to millions of values). I've searched the problem, but most people seem to have simpler sets of data to do this to. If anyone can offer any tips I'd be grateful.

You say you need to determine IF there are any keys within a given radius of a particular point. Thus, you only need to scan the keys, computing the distance of each to the point until you find one within the specified radius. (And if you do comparisons to the square of the radius, you can avoid the square roots needed for the actual distance.)
One optimization would be to sort the keys based on their "Manhattan distance" from the point (that is, add the component offsets), since the Euclidean distance will never be less than this. This would avoid some of the more expensive calculations (though I don't think you need and trigonometry).
If, as you suggest later in the question, you need to handle multiple points, you can obviously process each individually, or you could find the center of those points and sort based on that.

GPS co-ordinate search --R-trees

I have a list of lists in the form of
[ [ x1,.....,x8],[x1,.......,x8],...............,[x1,.....x[8]] ] . The number of lists in that list can go upto a million. Each list has 4 gps co-ordinates which show the four points of a rectangle ( assumed that each segment is in the form of a rectangle].
Problem : Given a new point, I need to determine which segment the point falls on and create a new one if it falls in none of them. I am not uploading the data into MySQL as of now, it comes in as a simple text file. I find out the co-ordinates from the text file for any given car.
What I tried : I am thinking of using R-trees to find all points which are near to the given point . ( Near== 200 meters maximum) . But even in R-trees, there seem to be too many options . R,R*,Hilbert.
Q1. Which one should be opted for ?
Q2. Is there a better option than R-trees?Can something be done by searching faster within the list ?
Thanks a lot.
[ {a1:[........]},{a2:[.......]},{a3:[.........]},.... ,{a20:[.....]}] .

Isn't the problem "find whether a given point falls within a certain rectangle in 2D space"?
That could be separated dimensionally, couldn't it? Give each rectangle an ID, then separate into lists of one-dimensional ranges ((id, x0, x1), (id, y0, y1)) and find all the ranges in both dimensions the point falls in. (I'm fairly sure there are very efficient algorithms for this. Heck, you could even leverage, say, sqlite already.) Then just intersect the ID sets you get and you should find all rectangles the point falls in, if any. (Of course you can exit early if either of the single dimensional queries returns no result.)
Not sure if this'd be faster or smarter than R-trees or other spatial indexes though. Hope this helps anyway.

import random as ra
# my _data will hold tuples of gps readings
# under the key of (row,col), knowing that the size of
# the row and col is 10, it will give an overall grid coverage.
# Another dict could translate row/col coordinates into some
# more usefull region names
my_data = {}
def get_region(x,y,region_size=10):
"""Build a tuple of row/col based on
the values provided and region square dimension.
It's for demonstration only and it uses rather naive calculation as
coordinate / grid cell size"""
row = int(x / region_size)
col = int(y / region_size)
return (row,col)
#make some examples and build my_data
for loop in range(10000):
#simule some readings
x = ra.choice(range(100))
y = ra.choice(range(100))
my_coord = get_region(x,y)
if my_data.get(my_coord):
my_data[my_coord].append((x,y))
else:
my_data[my_coord]= [(x,y),]
print my_data

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.