I am looking at the Wikipedia page for KD trees. As an example, I implemented, in python, the algorithm for building a kd tree listed.
The algorithm for doing KNN search with a KD tree, however, switches languages and isn't totally clear. The English explanation starts making sense, but parts of it (such as the area where they "unwind recursion" to check other leaf nodes) don't really make any sense to me.
How does this work, and how can one do a KNN search with a KD tree in python? This isn't meant to be a "send me the code!" type question, and I don't expect that. Just a brief explanation please :)
This book introduction, page 3:
Given a set of n points in a d-dimensional space, the kd-tree is constructed
recursively as follows. First, one finds a median of the values of the ith
coordinates of the points (initially, i = 1). That is, a value M is computed,
so that at least 50% of the points have their ith coordinate greater-or-equal
to M, while at least 50% of the points have their ith coordinate smaller
than or equal to M. The value of x is stored, and the set P is partitioned
into PL and PR , where PL contains only the points with their ith coordinate
smaller than or equal to M, and |PR | = |PL |±1. The process is then repeated
recursively on both PL and PR , with i replaced by i + 1 (or 1, if i = d).
When the set of points at a node has size 1, the recursion stops.
The following paragraphs discuss its use in solving nearest neighbor.
Or, here is the original 1975 paper by Jon Bentley.
EDIT: I should add that SciPy has a kdtree implementation:
scipy.spatial
another Stack Overflow question
I've just spend some time puzzling out the Wikipedia description of the algorithm myself, and came up with the following Python implementation that may help: https://gist.github.com/863301
The first phase of closest_point is a simple depth first search to find the best matching leaf node.
Instead of simply returning the best node found back up the call stack, a second phase checks to see if there could be a closer node on the "away" side: (ASCII art diagram)
n current node
b | best match so far
| p | point we're looking for
|< >| | error
|< >| distance to "away" side
|< | >| error "sphere" extends to "away" side
| x possible better match on the "away" side
The current node n splits the space along a line, so we only need to look on the "away" side if the "error" between the point p and the best match b is greater than the distance from point p and the line though n. If it is, then we check to see if there are any points on the "away" side that are closer.
Because our best matching node is passed into this second test, it doesn't have to do a full traversal of the branch and will stop pretty quickly if it's on the wrong track (only heading down the "near" child nodes until it hits a leaf.)
To compute the distance between the point p and the line splitting the space through the node n, we can simply "project" the point down onto the axis by copying the appropriate coordinate as the axes are all orthogonal (horizontal or vertical).
lets consider a example,for simplicity consider d=2 and the result of the Kd tree is show below
Your query point is Q and you want to find out k-nearest neighbours
The above tree is represents of kd-tree
we will search through the tree to fall into one of the regions.In kd-tree each region is represented by a single point.
then we will find out the distance between this point and query point
Then we will draw a circle with radius of that distance to ensure whether is there any point which are nearer to the query point.
Then axis which are fall in that circle area,we backtrack to those axis and find near point
Related
I was checking on crossover techniques for route optimization and have gone through few of thgem as mentioned below
1 - single point crossover
2 - two point crossover
3 - uniform crossover
In single point crossover, we generally swap one variable from each parent and get the child. The same with two point crossover where we swap two variables from two parents .
In my problems, the parents length's are not same, for example p1: ['a','b','c'] and p2:['v','n','m','h','k'] . As we the length of both parents are not same, I was able to use single point crossover based on the even and odd technique.
Now I want to use the uniform crossover with masking and finding it difficult to use with different lengths.
Any suggestions ?
What length are the offspring to be, if they are to be the same length of the parents then you could just do a normal uniform order. For example
[a,b,c] = p1
[v,n,m,h,k] = p2
[0,0,1,0,1] = mask (this should be random)
[v,n,c] = o1
[a,b,m,h,k] = o2
You could even randomly place where the smaller one sits on the mask for example:
[-,-,v,n,c]
[a,b,m,h,k]
so offspring would be
[v,h,c]
[a,b,m,n,k]
There exists a set of points (or items, it doesn't matter). Each point a is at a specific distance from other points in the set. The distance can be retrieved via the function retrieve_dist(a, b).
This question is about programming (in Python) an algorithm to pick a point, with replacement, from this set of points. The picked point:
i) has to be at the maximum possible distance from all already-selected points, while adhering to the requirement in (ii)
ii) the number of times an already-selected point occurs in the sample must carry weight in this calculation. I.e. more frequently-selected points should be weighed more heavily.
E.g. imagine a and b have already been selected (100 and 10 times respectively). Then when the next point is to be selected, it's distance from a matters more than its distance from b, in line with the frequency of occurrence of a in the already-selected sample.
What I can try:
This would have been easy to accomplish if weights/frequencies weren't in play. I could do:
distances = defaultdict(int)
for new_point in set_of_points:
for already_selected_point in selected_points:
distances[new_point] += retrieve_dist(new_point, already_selected_point)
Then I'd sort distances.items() by the second entry in each tuple, and would get the desired item to select.
However, when frequencies of already-selected points come into play, I just can't seem to wrap my head around this problem.
Can an expert help out? Thanks in advance.
A solution to your problem would be to make selected_points a list rather than a set. In this case, each new point is compared to a and b (and all other points) as many times as they have already been found.
If each point is typically found many times, it might be possible to improve perfomance using a dict instead, with the key being the points, and the value being the number of times each point is selected. In that case I think your algorithm would be
distances = defaultdict(int)
for new_point in set_of_points:
for already_selected_point, occurances in selected_points.items():
distances[new_point] += occurances * retrieve_dist(new_point, already_selected_point)
Objective: Given a coordinate X, find "n" nearest line-polygon for coordinate X, not just "n" nearest points. Example: https://i.imgur.com/qyxV2MF.png
I have a group of spatial line-polygons which can have more than 2 coordinates. Their coordinates are stored in a (scipy)KDtree to enable NN search.
First, I will query "i" number of nearest coordinates then look up the corresponding line-polygons-> "i" coordinates may not necessarily produce "i" lines.
In order to achieve "n" nearest lines, I will need to increase "i". My problem is that "i" can be unpredictable because the number of coords varies between every line-polygon. Example, a line-polygon can be represented by 2 coordinates, but another can be represented using 10 coordinates. Most of the time, I only need 2 nearest neighboring line-polygons from point X.
In the example image, I need line A and B as my result. Even with "i" = 3, only line A will be found because A1, A2, A3 are the nearest neighbors to X.
Question: Is there a way to group coordinates of a shape together and then carry out NN search to get "n" unique shapes? (besides brute forcing "i" to ensure "n" unique shapes)
Current workaround pseudocode:
found = []
while True:
if first_loop:
result = look up N nearest coords
else:
result = look up Nth nearest coord
look up shapes using result and append to found
perform de-duplication of found
if len(found) >= required:
return found
else:
N = N+1 # to check the Nth neighbor next iteration
If i understood your question correctly, it's a problem of having the right data-structures.
Let's have the following data-structures,
1. A dictionary from the line-polygons to points
2. Another dictionary from points to line-polygons (or equivalently a single bidirectional map from bidict instead of couple of dictionaries)
3. a boolean array visited with size equal to the number of points
Now the following algorithm should solve your problem (can be implemented efficiently with the above data structures):
For all the points initialize the visited array to be False.
Find the nearest point to the query point first from the kd-tree, mark the matched point and all the points from the the corresponding polygon the matched point belongs to as visited and return that particular polygon (id) as the nearest polygon (if there are multiple such polygons, return all of them).
Repeat step 2 until n such (distinct) polygons are returned. Consider a new point returned from the kd-tree as matched with the query point iff it's not yet visited (if a matched point returned by kd-tree is already visited, discard it and query the next nearest point matched). Once a point is visited, mark the point and all the points from the corresponding polygon(s) as visited and return the polygon(s).
I see two ways of doing this efficiently:
Index the complete "line-polygons": For this you could bound each line-polygon by a minimum bounding rectangle. Then index all the rectangles with an appropriate index structure like an R-Tree. Instead of points, you will have line-polygons on the lowest level, so you will have to adapt the distance for this case.
Use Distance Browsing: The idea here is to attach to each point the id of its line-polygon and index the points in an index structure (e.g., KD-Tree). Then you successively retrieve the next nearest point to your query using distance browsing. You proceed this until you have found points of n different line-polygons.
I am trying to find all the nearest neighbors which are within 1 KM radius. Here is my script to construct tree and search the nearest points,
from pysal.cg.kdtree import KDTree
def construct_tree(s):
data_geopoints = [tuple(x) for x in s[['longitude','latitude']].to_records(index=False)]
tree = KDTree(data_geopoints, distance_metric='Arc', radius=pysal.cg.RADIUS_EARTH_KM)
return tree
def get_neighbors(s,tree):
indices = tree.query_ball_point(s, 1)
return indices
#Constructing the tree for search
tree = construct_tree(data)
#Finding the nearest neighbours within 1KM
data['neighborhood'] = data['lat_long'].apply(lambda row: get_neighbors(row,tree))
From what I read in pysal page, it says -
kd-tree built on top of kd-tree functionality in scipy. If using scipy
0.12 or greater uses the scipy.spatial.cKDTree, otherwise uses scipy.spatial.KDTree.
In my case it should be using cKDTree. This is working fine for a sample dataset, but since the tree.query_ball_point returns the list of indices as a result. Each list will have 100s of elements. For my data points (2 Million records), this is growing bigger and bigger and stops due to memory issue after certain point. Any idea on how to solve this?
Just in case if anyone looking for an answer for this, I have solved it by finding the nearest neighbours for a group (tree.query_ball_point can handle batches) and write in to database and then process next group, rather than keeping all in memory. Thanks.
I'm a trying to calculate a kind of fuzzy Jaccard index between two sets with the following rationale: as the Jaccard index, I want to calculate the ratio between the number of items that are common to both sets and the total number of different items in both sets. The problem is that I want to use a similarity function with a threshold to determine what what counts as the "same" item being in both sets, so that items that are similar:
Aren't counted twice in the union
Are counted in the intersection.
I have a working implementation here (in python):
def fuzzy_jaccard(set1, set2, similarity, threshold):
intersection_size = union_size = len(set1 & set2)
shorter_difference, longer_difference = sorted([set2 - set1, set1 - set2], key=len)
while len(shorter_difference) > 0:
item1, item2 = max(
itertools.product(longer_difference, shorter_difference),
key=lambda (a, b): similarity(a, b)
)
longer_difference.remove(item1)
shorter_difference.remove(item2)
if similarity(item1, item2) > threshold:
union_size += 1
intersection_size += 1
else:
union_size += 2
union_size = union_size + len(longer_difference)
return intersection_size / union_size
The problem here is the this is quadratic in the size of the sets, because in itertools.product I iterate in all possible pairs of items taken one from each set(*). Now, I think I must do this because I want to match each item a from set1 with the best possible candidate b from set2 that isn't more similar to another item a' from set1.
I have a feeling that there should be a O(n) way of doing that I'm not grasping. Do you have any suggestions?
There are other issues two, like recalculating the similarity for each pair once I get the best match, but I don't care to much about them.
I doubt there's any way that would be O(n) in the general case, but you can probably do a lot better than O(n^2) at least for most cases.
Is similarity transitive? By this I mean: can you assume that distance(a, c) <= distance(a, b) + distance(b, c)? If not, this answer probably won't help. I'm treating similarities like distances.
Try clumping the data:
Pick a radius r. Based on intuition, I suggest setting r to one-third of the average of the first 5 similarities you calculate, or something.
The first point you pick in set1 becomes the centre of your first clump. Classify the points in set2 as being in the clump (similarity to the centre point <= r) or outside the clump. Also keep track of points that are within 2r of the clump centre.
You can require that clump centre points be at least a distance of 2r from each other; in that case some points may not be in any clump. I suggest making them at least r from each other. (Maybe less if you're dealing with a large number of dimensions.) You could treat every point as a clump centre but then you wouldn't save any processing time.
When you pick a new point, first compare it with the clump centre points (even though they're in the same set). Either it's in an already existing clump, or it becomes a new clump centre, (or perhaps neither if it's between r and 2r of a clump centre). If it's within r of a clump centre, then compare it with all points in the other set that are within 2r of that clump centre. You may be able to ignore points further than 2r from the clump centre. If you don't find a similar point within the clump (perhaps because the clump has no points left), then you may have to scan all the rest of the points for that case. Hopefully this would mostly happen only when there aren't many points left in the set. If this works well, then in most cases you'd find the most similar point within the clump and would know that it's the most similar point.
This idea may require some tweaking.
If there are a large number of dimenstions involved, then you might find that for a given radius r, frustratingly many points are within 2r of each other while few are within r of each other.
Here's another algorithm. The more time-consuming it is to calculate your similarity function (as compared to the time it takes to maintain sorted lists of points) the more index points you might want to have. If you know the number of dimensions, it might make sense to use that number of index points. You might reject a point as a candidate index point if it's too similar to another index point.
For each of the first point you use and any others you decide to use as index points, generate a list of all the remaining points in the other set, sorted in order of distance from the index point,
When you're comparing a point P1 to points in the other set, I think you can skip over sets for two possible reasons. Consider the most similar point P2 you've found to P1. If P2 is similar to an index point then you can skip all points which are sufficiently dissimilar from that index point. If P2 is dissimilar to an index point then you can skip over all points which are sufficiently similar to that index point. I think in some cases you can skip over some of both types of point for the same index point.