I have a list of Shapely polygons and a point like so:
from shapely.geometry import Point, Polygon
polygons = [Polygon(...), Polygon(...), ...]
point = Point(2.5, 5.7)
and I want to find the closest polygon in the list to that point. I'm already aware of the object.distance(other) function which returns the minimum distance between two geometric shapes, and I thought about computing all the distances in a loop to find the closest polygon:
polygons = [Polygon(...), Polygon(...), ...]
point = Point(2.5, 5.7)
min_dist = 10000
closest_polygon = None
for polygon in polygons:
dist = polygon.distance(point)
if dist < min_dist:
min_dist = dist
closest_polygon = polygon
My question is: Is there a more efficient way to do it?
There is a shorter way, e.g.
from shapely.geometry import Point, Polygon
import random
from operator import itemgetter
def random_coords(n):
return [(random.randint(0, 100), random.randint(0, 100)) for _ in range(n)]
polys = [Polygon(random_coords(3)) for _ in range(4)]
point = Point(random_coords(1))
min_distance, min_poly = min(((poly.distance(point), poly) for poly in polys), key=itemgetter(0))
as Georgy mentioned (++awesome!) even more concise:
min_poly = min(polys, key=point.distance)
but distance computation is in general computationally intensive
I have a solution that works if you have at least 2 polygons with a distance different from 0. Let's call these 2 polygons "basePolygon0" and "basePolygon1". The idea is to build a KD tree with the distance of each polygon to each of the "basis" polygons.
Once the KD tree has been built, we can query it by computing the distance to each of the basis polygons.
Here's a working example:
from shapely.geometry import Point, Polygon
import numpy as np
from scipy.spatial import KDTree
# prepare a test with triangles
poly0 = Polygon([(3,-1),(5,-1),(4,2)])
poly1 = Polygon([(-2,1),(-4,2),(-3,4)])
poly2 = Polygon([(-3,-3),(-4,-6),(-2,-6)])
poly3 = Polygon([(-1,-4),(1,-4),(0,-1)])
polys = [poly0,poly1,poly2,poly3]
p0 = Point(4,-3)
p1 = Point(-4,1)
p2 = Point(-4,-2)
p3 = Point(0,-2.5)
testPoints = [p0,p1,p2,p3]
# select basis polygons
# it works with any pair of polygons that have non zero distance
basePolygon0 = polys[0]
basePolygon1 = polys[1]
# compute tree query
def buildQuery(point):
distToBasePolygon0 = basePolygon0.distance(point)
distToBasePolygon1 = basePolygon1.distance(point)
return np.array([distToBasePolygon0,distToBasePolygon1])
distances = np.array([buildQuery(poly) for poly in polys])
# build the KD tree
tree = KDTree(distances)
# test it
for p in testPoints:
q = buildQuery(p)
output = tree.query(q)
print(output)
This yields as expected:
# (distance, polygon_index_in_KD_tree)
(2.0248456731316584, 0)
(1.904237866994273, 1)
(1.5991500555008626, 2)
(1.5109986459170694, 3)
There is one way that might be faster, but without doing any actual tests, it's hard for me to say for sure.
This might not work for your situation, but the basic idea is each time a Shapely object is added to the array, you adjust the position of different array elements so that it is always "sorted" in this manner. In Python, this can be done with the heapq module. The only issue with that module is that it's hard to choose a function to compare to different objects, so you would have to do something like this answer, where you make a custom Class for objects to put in the heapq that is a tuple.
import heapq
class MyHeap(object):
def __init__(self, initial=None, key=lambda x:x):
self.key = key
if initial:
self._data = [(key(item), item) for item in initial]
heapq.heapify(self._data)
else:
self._data = []
def push(self, item):
heapq.heappush(self._data, (self.key(item), item))
def pop(self):
return heapq.heappop(self._data)[1]
The first element in the tuple is a "key", which in this case would be the distance to the point, and then the second element would be the actual Shapely object, and you could use it like so:
point = Point(2.5, 5.7)
heap = MyHeap(initial=None, key=lambda x:x.distance(point))
heap.push(Polygon(...))
heap.push(Polygon(...))
# etc...
And at the end, the object you're looking for will be at heap.pop().
Ultimately, though, both algorithms seem to be (roughly) O(n), so any speed up would not be a significant one.
Related
I have a list of unorered points (2D) and I want to calculate the sum of distances between them.
As my background is a c++ dev I would do it like this:
import math
class Point:
def __init__(self, x,y):
self.x = x
self.y = y
def distance(P1, P2):
return math.sqrt((P2.x-P1.x)**2 + (P2.y-P1.y)**2)
points = [Point(rand(1), rand(1)) for i in range(10)]
#this part should be in a nicer way
pathLen = 0
for i in range(1,10):
pathLen += distance(points[i-1], points[i])
Is there a more pythonic way to replace the for loop? e.g with reduce or something like that?
best regards!
You can use a generator expresion with sum, zip and itertools islice to avoid duplicating data:
from itertools import islice
paathLen = sum(distance(x, y) for x, y in zip(points, islice(points, 1, None)))
Here you have the live example
A few fixes, as a C++ approach is probably not the best here:
import math
# you need this import here, python has no rand in the main namespace
from random import random
class Point:
def __init__(self, x,y):
self.x = x
self.y = y
# there's usually no need to encapsulate variables in Python
def distance(P1, P2):
# your distance formula was wrong
# you were adding positions on each axis instead of subtracting them
return math.sqrt((P1.x-P2.x)**2 + (P1.y-P2.y)**2)
points = [Point(random(), random()) for i in range(10)]
# use a sum over a list comprehension:
pathLen = sum([distance(points[i-1], points[i]) for i in range(10)])
#Robin Zigmond's zip approach is also a neat way to achieve it, though it wasn't immediately obvious to me that it could be used here.
I ran into a similar problem and pieced together a numpy solution which I think works nicely.
Namely, if you cast your list of points as a numpy array you can then do the following:
pts = np.asarray(points)
dist = np.sqrt(np.sum((pts[np.newaxis, :, :]-pts[:, np.newaxis, :])**2, axis=2))
dist results in a nxn numpy symmetric array where the distance between each point to every other point is given above or below the diagonal. The diagonal is each point's distance to itself so just 0s.
You can then use:
path_leng = np.sum(dist[np.triu_indices(pts.shape[0], 1)].tolist())
to collect the top half of the numpy array and sum them to get the pathlength.
I've got a dataset of refineries in Texas (GeoJSON here - https://pastebin.com/R0D9fif9 ):
Name,Latitude,Longitude
Marathon Petroleum,29.374722,-94.933611
Marathon Petroleum,29.368733,-94.903253
Valero,29.367617,-94.909515
LyondellBasell,29.71584,-95.234814
Valero,29.722213,-95.255198
Exxon,29.743865,-95.009208
Shell,29.720425,-95.12495
Petrobras,29.722466,-95.208807
I would like to create a printed map out of these points. But they lie too closely together at a given resolution.
Since every refinery should get mentioned in the legend, I can't cluster. So I would like to
Get the centroid - that was easy
import json
import csv
from shapely.geometry import shape, Point, MultiPoint
with open('refineries.csv', 'rU') as infile:
reader = csv.DictReader(infile)
data = {}
for row in reader:
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
listo = list(zip(data['Longitude'], data['Latitude']))
points1 = MultiPoint(points=listo)
points = MultiPoint([(-94.933611, 29.374722), (-94.903253, 29.368733), (-94.909515, 29.367617), (-95.234814, 29.71584), (-95.255198, 29.722213), (-95.009208, 29.743865), (-95.12495, 29.720425), (-95.208807, 29.722466)])
print(points.centroid)
Shift all points away from the centroid until a minimum distance between all is reached
May you please help me here? Thanks in advance!
It depends how exactly do you want to shift the points away from the centroid. One way would be to calculate for each point its great-circle distance and azimuth with respect to the centroid and rescale all distances in order to ensure that the distance between the two closest points is larger than a specified threshold. In the example below, pyproj is used for the calculation of the azimuths and distances.
import json
import csv
import sys
from shapely.geometry import shape, Point, MultiPoint
from pyproj import Geod
with open('refineries.csv', 'rU') as infile:
reader = csv.DictReader(infile)
data = {}
for row in reader:
for header, value in row.items():
if not header in data:
data[header] = []
data[header].append(value)
listo = list(zip(map(float, data['Longitude']), map(float, data['Latitude'])))
def scale_coords(coords, required_dist = 1000.):
g = Geod(ellps = 'WGS84')
num_of_points = len(coords)
#calculate centroid
C = MultiPoint(coords).centroid
#determine the minimum distance among points
dist_min, dist_max = float('inf'), float('-inf')
for i in range(num_of_points):
lon_i, lat_i = coords[i]
for j in range(i+1, num_of_points):
lon_j, lat_j = coords[j]
_,_,dist = g.inv(lon_i, lat_i, lon_j, lat_j)
dist_min = min(dist_min, dist)
dist_max = max(dist_max, dist)
if dist_min > required_dist:
return coords
coords_scaled = [None]*num_of_points
scaling = required_dist / dist_min
for i, (lon_i, lat_i) in enumerate(coords):
az,_,dist = g.inv(C.x, C.y, lon_i, lat_i)
lon_f,lat_f,_ = g.fwd(C.x, C.y, az, dist*scaling)
coords_scaled[i] = (lon_f, lat_f)
return coords_scaled
Alternatively, this might be combined with the approach within which you also relax the azimuths. This would in principle result in smaller scaling factor for the "radial" distances. However, it would also slightly distort the "visual distribution" of the points. Also, the method presented above might be "improved" by ignoring any outlier points in the rescaling, i.e., points which are already sufficiently far from the centroid and which have no nearby neighbors.
My goal is to find the nearest x,y point co-ordinate for every pixel. Based on that i have to colour the pixel points.
Here is what i have tried,
The below code will draw the points.
import numpy as np
import matplotlib.pyplot as plt
points = np.array([[0,40],[0,0],[5,30],[4,10],[10,25],[20,5],[30,35],[35,3],[50,0],[45,15],[40,22],[50,40]])
print (points)
x1, y1 = zip(*points)
plt.plot(x1,y1,'.')
plt.show()
Now to find the nearest point for each pixel.
I am found something like this where i have to give manually each pixel co-ordinates, to get the nearest point.
from scipy import spatial
import numpy as np
A = np.random.random((10,2))*100
print (A)
pt = np.array([[6, 30],[9,80]])
print (pt)
for each in pt:
A[spatial.KDTree(A).query(each)[1]] # <-- the nearest point
distance,index = spatial.KDTree(A).query(each)
print (distance) # <-- The distances to the nearest neighbors
print (index) # <-- The locations of the neighbors
print (A[index])
The output will be like this,
[[1.76886192e+01 1.75054781e+01]
[4.17533199e+01 9.94619127e+01]
[5.30943347e+01 9.73358766e+01]
[3.05607891e+00 8.14782701e+01]
[5.88049334e+01 3.46475520e+01]
[9.86076676e+01 8.98375851e+01]
[9.54423012e+01 8.97209269e+01]
[2.62715747e+01 3.81651805e-02]
[6.59340306e+00 4.44893348e+01]
[6.66997434e+01 3.62820929e+01]]
[[ 6 30]
[ 9 80]]
14.50148095039858
8
[ 6.59340306 44.48933479]
6.124988197559344
3
[ 3.05607891 81.4782701 ]
Instead of giving each point manually i want to take each pixel from the image and i wanted to find the nearest blue point. This is my first question.
After that i want to classify those points into two categories,
Based on pixel and point i want to colour it, basically i want to do a cluster on it.
This is not in proper form. But at the end i want it like this.
Thanks in advance guys.
Use cKDTree instead of KDTree, which is faster (see this answer).
You can give the kdtree an array of points to query instead of looping over all of them.
Constructing a kdtree is a costly operation compared to querying it, so construct it once and query many times.
Compare the following two code snippets, on my tests the second one run x800 times faster.
from timeit import default_timer as timer
np.random.seed(0)
A = np.random.random((1000,2))*100
pt = np.random.randint(0,100,(100,2))
start1 = timer()
for each in pt:
A[spatial.KDTree(A).query(each)[1]]
distance,index = spatial.KDTree(A).query(each)
end1 = timer()
print end1-start1
start2 = timer()
kdt = spatial.cKDTree(A) # cKDTree + outside construction
distance,index = kdt.query(pt)
A[index]
end2 = timer()
print end2-start2
you can use scikit-learn for this:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=1)
labels = list(range(len(points)))
neigh.fit(points, labels)
pred = neigh.predict(np.random.random((10,2))*50)
if you want the points itself and not their class labels you can do
points[pred]
I'm trying to create regions of polygons on the condition that they touch. In my example I have an example dataset with 382 polygons that need to be grouped together (but the full dataset contains 6355 polygons). (I would show a picture, but I don't have enough reputation to do that..)
I though of doing this brute force, but of course that takes very long and is not very optimal.
def groupBuildings(blds):
# blds is a list with shapely polygons
groups = []
for bld in blds:
group = []
group.append(bld)
for other in blds:
for any in group:
if any != other and any.intersects(other):
group.append(other)
groups.append(group)
return groups
I learned about region growing and thought that that would be a possible solution, but still the performance is terrible. I've implemented this in the following way:
def groupBuildings(blds):
# blds is a list with shapely polygons
others = blds
groups = []
while blds != []:
done = []
group = []
first = blds.pop(0)
done.append(first)
group.append(first)
for other in others:
if (other in blds) and first.touches(other):
group.append(other)
blds.remove(other)
return groups
But I think the problem here is that I don't have any nearest neighbors, so I still have to iterate over every building twice.
So my question is: are nearest neighbors essential for region growing? Or is there another way of doing this efficiently?
You will be best served using shapely.ops.cascaded_union() (docs here).
from shapely.geometry import Point, Polygon, MultiPolygon
from shapely.ops import cascaded_union
import numpy as np
polygons = [Point(200*x,200*y).buffer(b) for x,y,b in np.random.random((6000,3))]
multi = MultiPolygon(polygons)
unioned = cascaded_union(multi)
%%timeit
unioned = cascaded_union(multi)
# 2.8 seconds for me
My data object is an instance of:
class data_instance:
def __init__(self, data, tlabel):
self.data = data # 1xd numpy array
self.true_label = tlabel # integer {1,-1}
So far in code, I have a list called data_history full with data_istance and a set of centers (numpy array with shape (k,d)).
For a given data_instance new_data, I want:
1/ Get the nearest center to new_data from centers (by euclidean distance) let it be called Nearest_center.
2/ Iterate trough data_history and:
2.1/ select elements where the nearest center is Nearest_center (result of 1/) into list called neighbors.
2.2/ Get labels of object in neighbors.
Bellow is my code which work but it steel slow and I am looking for something more efficient.
My Code
For 1/
def getNearestCenter(data,centers):
if centers.shape != (1,2):
dist_ = np.sqrt(np.sum(np.power(data-centers,2),axis=1)) # This compute distance between data and all centers
center = centers[np.argmin(dist_)] # this return center which have the minimum distance from data
else:
center=centers[0]
return center
For 2/ (To optimize)
def getLabel(dataPoint, C, history):
labels = []
cluster = getNearestCenter(dataPoint.data,C)
for x in history:
if np.all(getNearestCenter(x.data,C) == cluster):
labels.append(x.true_label)
return labels
You should rather use the optimized cdist from scipy.spatial which is more efficient than calculating it with numpy,
from scipy.spatial.distance import cdist
dist = cdist(data, C, metric='euclidean')
dist_idx = np.argmin(dist, axis=1)
An even more elegant solution is to use scipy.spatial.cKDTree (as pointed out by #Saullo Castro in comments), which could be faster for a large dataset,
from scipy.spatial import cKDTree
tr = cKDTree(C)
dist, dist_idx = tr.query(data, k=1)
Found it:
dist_ = np.argmin(np.sqrt(np.sum(np.power(data[:, None]-C,2),axis=2)),axis=1)
This should return the index of the nearest center in centers from each data point of data.