I'm trying to create regions of polygons on the condition that they touch. In my example I have an example dataset with 382 polygons that need to be grouped together (but the full dataset contains 6355 polygons). (I would show a picture, but I don't have enough reputation to do that..)
I though of doing this brute force, but of course that takes very long and is not very optimal.
def groupBuildings(blds):
# blds is a list with shapely polygons
groups = []
for bld in blds:
group = []
group.append(bld)
for other in blds:
for any in group:
if any != other and any.intersects(other):
group.append(other)
groups.append(group)
return groups
I learned about region growing and thought that that would be a possible solution, but still the performance is terrible. I've implemented this in the following way:
def groupBuildings(blds):
# blds is a list with shapely polygons
others = blds
groups = []
while blds != []:
done = []
group = []
first = blds.pop(0)
done.append(first)
group.append(first)
for other in others:
if (other in blds) and first.touches(other):
group.append(other)
blds.remove(other)
return groups
But I think the problem here is that I don't have any nearest neighbors, so I still have to iterate over every building twice.
So my question is: are nearest neighbors essential for region growing? Or is there another way of doing this efficiently?
You will be best served using shapely.ops.cascaded_union() (docs here).
from shapely.geometry import Point, Polygon, MultiPolygon
from shapely.ops import cascaded_union
import numpy as np
polygons = [Point(200*x,200*y).buffer(b) for x,y,b in np.random.random((6000,3))]
multi = MultiPolygon(polygons)
unioned = cascaded_union(multi)
%%timeit
unioned = cascaded_union(multi)
# 2.8 seconds for me
Related
I have a polydata structure and its extracted edges but computed with extract_feature_edges function as unconnected cells (separated lines).
Is it possible to connect those cells (lines) from their common points and then get the different features (lands, islands such as what you can see in the image - Antartica, Australia, ... - BTW they are paleo continents)?
In resume, I would like to extract from my grid and its edges the different land parts as separate polydata. I have tried with the python module shapely and the polygonize function, it works but not with 3D coordinates (https://shapely.readthedocs.io/en/latest/reference/shapely.polygonize.html).
import pyvista as pv
! wget -q -nc https://thredds-su.ipsl.fr/thredds/fileServer/ipsl_thredds/brocksce/pyvista/mesh.vtk
mesh = pv.PolyData('mesh.vtk')
edges = mesh.extract_feature_edges(boundary_edges=True)
pl = pv.Plotter()
pl.add_mesh(pv.Sphere(radius=0.999, theta_resolution=360, phi_resolution=180))
pl.add_mesh(mesh, show_edges=True, edge_color="gray")
pl.add_mesh(edges, color="red", line_width=2)
viewer = pl.show(jupyter_backend='pythreejs', return_viewer=True)
display(viewer)
Any idea?
Here is a solution using vtk.vtkStripper() to join contiguous segments into polylines.
See thread from https://discourse.vtk.org/t/get-a-continuous-line-from-a-polydata-structure/9864
import pyvista as pv
import vtk
import random
! wget -q -nc https://thredds-su.ipsl.fr/thredds/fileServer/ipsl_thredds/brocksce/pyvista/mesh.vtk
mesh = pv.PolyData('mesh.vtk')
edges = mesh.extract_feature_edges(boundary_edges=True)
pl = pv.Plotter()
pl.add_mesh(pv.Sphere(radius=0.999, theta_resolution=360, phi_resolution=180))
pl.add_mesh(mesh, show_edges=True, edge_color="gray")
regions = edges.connectivity()
regCount = len(set(pv.get_array(regions, name="RegionId")))
connectivityFilter = vtk.vtkPolyDataConnectivityFilter()
stripper = vtk.vtkStripper()
for r in range(regCount):
connectivityFilter.SetInputData(edges)
connectivityFilter.SetExtractionModeToSpecifiedRegions()
connectivityFilter.InitializeSpecifiedRegionList()
connectivityFilter.AddSpecifiedRegion(r)
connectivityFilter.Update()
stripper.SetInputData(connectivityFilter.GetOutput())
stripper.SetJoinContiguousSegments(True)
stripper.Update()
reg = stripper.GetOutput()
random_color = "#"+''.join([random.choice('0123456789ABCDEF') for i in range(6)])
pl.add_mesh(reg, color=random_color, line_width=4)
viewer = pl.show(jupyter_backend='pythreejs', return_viewer=True)
display(viewer)
This has come up before in github discussions. The conclusion was that PyVista doesn't have anything built-in to reorder edges, but there might be third-party libraries that can do this (this answer mentioned libigl, but I have no experience with that).
I have some ideas on how to tackle this, but there are concerns about the applicability of such a helper in the generic case. In your specific case, however, we know that every edge is a closed loop, and that there aren't very many of them, so we don't have to worry about performance (and especially memory footprint) that much.
Here's a manual approach to reordering the edges by building an adjacency graph and walking until we end up where we started on each loop:
from collections import defaultdict
import pyvista as pv
# load example mesh
mesh = pv.read('mesh.vtk')
# get edges
edges = mesh.extract_feature_edges(boundary_edges=True)
# build undirected adjacency graph from edges (2-length lines)
# (potential performance improvement: use connectivity to only do this for each closed loop)
# (potentially via calling edges.split_bodies())
lines = edges.lines.reshape(-1, 3)[:, 1:]
adjacency = defaultdict(set) # {2: {1, 3}, ...} if there are lines from point 2 to point 1 and 3
for first, second in lines:
adjacency[first].add(second)
adjacency[second].add(first)
# start looping from whichever point, keep going until we run out of adjacent points
points_left = set(range(edges.n_points))
loops = []
while points_left:
point = points_left.pop() # starting point for next loop
loop = [point]
loops.append(loop)
while True:
# keep walking the loop
neighb = adjacency[point].pop()
loop.append(neighb)
if neighb == loop[0]:
# this loop is done
break
# make sure we never backtrack
adjacency[neighb].remove(point)
# bookkeeping
points_left.discard(neighb)
point = neighb
# assemble new lines based on the existing ones, flatten
lines = sum(([len(loop)] + loop for loop in loops), [])
# overwrite the lines in the original edges; optionally we could create a copy here
edges.lines = lines
# edges are long, closed loops by construction, so it's probably correct
# plot each curve with an individual colour just to be safe
plotter = pv.Plotter()
plotter.add_mesh(pv.Sphere(radius=0.999))
plotter.add_mesh(edges, scalars=range(edges.n_cells), line_width=3, show_scalar_bar=False)
plotter.enable_anti_aliasing('msaa')
plotter.show()
This code replaces your original 1760 2-length lines with 14 larger lines defining each loop. You have to be a bit careful, though: north of Australia you have a loop that self-intersects:
The intersection point appears 4 times instead of 2. This means that my brute-force solver doesn't give a well-defined result: it will choose at the intersection randomly, and if by (bad) luck we start the loop from the intersection point the algorithm will probably fail. Making it more robust is left as an exercise to the reader (my comment about splitting the edges into individual ones could help with this issue).
I have two different GeoDataFrames: One of which contain polygon squares in a large grid. The other contains larger, and fewer, polygons.
I wish to calculate the area of overlap within each of the grid squares with the other, larger squares.
To do so, I made a simple loop method
for _, patch in tqdm(layer.iterrows(), total=layer.shape[0], desc=name):
# Index of intersecting squares
idx = joined.intersects(patch.geometry)
intersection_polygon = joined[idx].intersection(patch.geometry)
area_of_intersection = intersection_polygon.area
joined.loc[idx, "value"] += area_of_intersection
In an attempt to speed up this method, I converted the layer DataFrame, which contains the larger patches to a Dask-DataFrame.
I implemented it the following way:
def multi_area(patch, joined=None):
# Index of intersecting squares
idx = joined.intersects(patch.geometry)
intersection_polygon = joined[idx].intersection(patch.geometry)
area_of_intersection = intersection_polygon.area
joined.loc[idx, "value"] += area_of_intersection
return joined["value"]
layer_dask = dask_geopandas.from_geopandas(layer, npartitions=8)
with ProgressBar():
joined["value"] = layer_dask.apply(multi_area, meta=joined, joined=joined, axis=1).compute(scheduler='multiprocessing')
This, however, returns the error AttributeError: 'GeoDataFrame' object has no attribute 'name', and at this point I am unsure if this is the optimal way of doing it, and what I am doing wrong.
The job I will be doing will have 400 million grid squares, so I am planning on batching this calculation out on smaller areas later, as I can't come up with a smarter way of doing it...
I managed to speed up the process quite a bit using spatial joins and overlay as suggested by Michael in the comments.
In addition I implemented Dask Dataframes so the final code becomes:
import dask_geopandas as dg
import geopandas as gpd
def dissolve_shuffle(ddf, by=None, **kwargs):
"""Shuffle and map partition"""
meta = ddf._meta.dissolve(by=by, as_index=False, **kwargs)
shuffled = ddf.shuffle(
by, npartitions=ddf.npartitions, shuffle="tasks", ignore_index=True
)
return shuffled.map_partitions(
gpd.GeoDataFrame.dissolve, by=by, as_index=False, meta=meta, **kwargs
)
def calculate_area_overlap_dask(
df_grid,
layer,
nthreads=8,
) -> gpd.GeoDataFrame:
"""This function calculates the area of overlap in each grid cell for a given map-layer
"""
layer = layer[["geometry"]]
df_grid = df_grid[["geometry"]]
# Split up the layer using the grid
_overlay = gpd.overlay(layer, df_grid, how="intersection")
# Convert the overlay to a dask geopandas dataframe and calculate the area of each new polygon
_overlay = dg.from_geopandas(_overlay, npartitions=nthreads)
_overlay["area"] = _overlay.area
_overlay = _overlay.compute()
# Convert the grid to a dask geopandas dataframe and spatial join all split layer polygons to corresponding grid cells
df_grid = dg.from_geopandas(df_grid, npartitions=nthreads)
joined = dg.sjoin(df_grid, _overlay, how="inner").reset_index()
# Faster dissolve of area within each grid cell
scored_grid = dissolve_shuffle(
joined,
"index",
)
scored_grid = scored_grid.compute()
return scored_grid
def polygon_to_grid(name: str, gdf) -> gpd.GeoDataFrame:
"""This function converts a geodataframe to a grid of polygons
"""
gdf["value"] = range(len(gdf.index))
# Rasteriser polygonet
out_grid: xr.Dataset = make_geocube(
vector_data=gdf,
measurements=["value"],
resolution=(-100, 100),
fill=np.nan,
)
vals: xr.DataArray = out_grid.value.values
vals[~np.isnan(vals)] = np.arange(len(vals[~np.isnan(vals)]), dtype=np.int32)
vals[np.isnan(vals)] = -9999
out_grid.value.values = vals
out_grid.rio.to_raster( f"{name}_raster.tif")
# Read saved raster
src: xr.Dataset = rasterio.open(f"{name}_raster.tif")
r = src.read(1).astype(np.int32)
# Convert polygons
shapes = features.shapes(r, mask=r != -9999, transform= src.transform)
polygons: list[Polygon] = list(shapes)
geom: list[Polygon] = [shapely.geometry.shape(i[0]) for i in polygons]
# Convert to geodataframe
grid = gpd.GeoDataFrame(
geometry=gpd.GeoSeries(
geom,
),
)
return grid
if __name__=="__main__":
area = gpd.read_file("some_area.shp")
layer = gpd.read_file("some_map_layer.shp")
area_grid = polygon_to_grid("area", area)
grid_evaluated = calculate_area_overlap_dask(area_grid, layer)
This mess ended up working, but it was very prone to memory-issues with large datasets. So I opted for a solution that was less precise, but much faster.
So when one exports r.out.vtk from Grass GIS we get a bad surface with -99999 points instead of nulls:
I want to remove them, yet a simple clip is not enough:
pd = pv.read('./pid1.vtk')
pd = pd.clip((0,1,1), invert=False).extract_surface()
p.add_mesh(pd ) #add atoms to scene
p.show()
resulting in:
So I wonder how to keep from it only top (> -999) points and connected vertices - in order to get only the top plane (it is curved\not flat actually) using pyvista?
link to example .vtk
There is an easy way to do this and there isn't...
You could use pyvista's threshold filter with all_scalars=True as long as you have only one set of scalars:
import pyvista as pv
pd = pv.read('./pid1.vtk')
pd = pd.threshold(-999, all_scalars=True)
plotter = pv.Plotter()
plotter.add_mesh(pd) #add atoms to scene
plotter.show()
Since all_scalars starts filtering based on every scalar array, this will only do what you'd expect if there are no other scalars. Furthermore, unfortunately there seems to be a bug in pyvista (expected to be fixed in version 0.32.0) which makes the use of this keyword impossible.
What you can do in the meantime (if you don't want to use pyvista's main branch before the fix is released) is to threshold the data yourself using numpy:
import pyvista as pv
pd = pv.read('./pid1.vtk')
scalars = pd.active_scalars
keep_inds = (scalars > -999).nonzero()[0]
pd = pd.extract_points(keep_inds, adjacent_cells=False)
plotter = pv.Plotter()
plotter.add_mesh(pd) #add atoms to scene
plotter.show()
The main point of both all_scalars (in threshold) and adjacent_cells (in extract_points) is to only keep cells where every point satisfies the condition.
With both of the above I get the following figure using your data:
I have a list of Shapely polygons and a point like so:
from shapely.geometry import Point, Polygon
polygons = [Polygon(...), Polygon(...), ...]
point = Point(2.5, 5.7)
and I want to find the closest polygon in the list to that point. I'm already aware of the object.distance(other) function which returns the minimum distance between two geometric shapes, and I thought about computing all the distances in a loop to find the closest polygon:
polygons = [Polygon(...), Polygon(...), ...]
point = Point(2.5, 5.7)
min_dist = 10000
closest_polygon = None
for polygon in polygons:
dist = polygon.distance(point)
if dist < min_dist:
min_dist = dist
closest_polygon = polygon
My question is: Is there a more efficient way to do it?
There is a shorter way, e.g.
from shapely.geometry import Point, Polygon
import random
from operator import itemgetter
def random_coords(n):
return [(random.randint(0, 100), random.randint(0, 100)) for _ in range(n)]
polys = [Polygon(random_coords(3)) for _ in range(4)]
point = Point(random_coords(1))
min_distance, min_poly = min(((poly.distance(point), poly) for poly in polys), key=itemgetter(0))
as Georgy mentioned (++awesome!) even more concise:
min_poly = min(polys, key=point.distance)
but distance computation is in general computationally intensive
I have a solution that works if you have at least 2 polygons with a distance different from 0. Let's call these 2 polygons "basePolygon0" and "basePolygon1". The idea is to build a KD tree with the distance of each polygon to each of the "basis" polygons.
Once the KD tree has been built, we can query it by computing the distance to each of the basis polygons.
Here's a working example:
from shapely.geometry import Point, Polygon
import numpy as np
from scipy.spatial import KDTree
# prepare a test with triangles
poly0 = Polygon([(3,-1),(5,-1),(4,2)])
poly1 = Polygon([(-2,1),(-4,2),(-3,4)])
poly2 = Polygon([(-3,-3),(-4,-6),(-2,-6)])
poly3 = Polygon([(-1,-4),(1,-4),(0,-1)])
polys = [poly0,poly1,poly2,poly3]
p0 = Point(4,-3)
p1 = Point(-4,1)
p2 = Point(-4,-2)
p3 = Point(0,-2.5)
testPoints = [p0,p1,p2,p3]
# select basis polygons
# it works with any pair of polygons that have non zero distance
basePolygon0 = polys[0]
basePolygon1 = polys[1]
# compute tree query
def buildQuery(point):
distToBasePolygon0 = basePolygon0.distance(point)
distToBasePolygon1 = basePolygon1.distance(point)
return np.array([distToBasePolygon0,distToBasePolygon1])
distances = np.array([buildQuery(poly) for poly in polys])
# build the KD tree
tree = KDTree(distances)
# test it
for p in testPoints:
q = buildQuery(p)
output = tree.query(q)
print(output)
This yields as expected:
# (distance, polygon_index_in_KD_tree)
(2.0248456731316584, 0)
(1.904237866994273, 1)
(1.5991500555008626, 2)
(1.5109986459170694, 3)
There is one way that might be faster, but without doing any actual tests, it's hard for me to say for sure.
This might not work for your situation, but the basic idea is each time a Shapely object is added to the array, you adjust the position of different array elements so that it is always "sorted" in this manner. In Python, this can be done with the heapq module. The only issue with that module is that it's hard to choose a function to compare to different objects, so you would have to do something like this answer, where you make a custom Class for objects to put in the heapq that is a tuple.
import heapq
class MyHeap(object):
def __init__(self, initial=None, key=lambda x:x):
self.key = key
if initial:
self._data = [(key(item), item) for item in initial]
heapq.heapify(self._data)
else:
self._data = []
def push(self, item):
heapq.heappush(self._data, (self.key(item), item))
def pop(self):
return heapq.heappop(self._data)[1]
The first element in the tuple is a "key", which in this case would be the distance to the point, and then the second element would be the actual Shapely object, and you could use it like so:
point = Point(2.5, 5.7)
heap = MyHeap(initial=None, key=lambda x:x.distance(point))
heap.push(Polygon(...))
heap.push(Polygon(...))
# etc...
And at the end, the object you're looking for will be at heap.pop().
Ultimately, though, both algorithms seem to be (roughly) O(n), so any speed up would not be a significant one.
I have two sets of points, one is a map consisting of x,y coordinates, and the second is a path of x,y coordinates. I'm trying to find the closest map points to my path points, pretty simple. Except my map is 380000 points and my paths (of which I have several) each consist of ~ 350000 points themselves.
Other than sampling my data to get smaller datasets, I'm trying to find a faster way to accomplish this task.
base algorithm:
import pandas as pd
from scipy.spatial.distance import cdist
...
def closeset_point(point, points):
return points[cdist([point], points).argmin()]
# log['point'].shape; 333000
# map_data['point'].shape; 380000
closest = [closest_point(log_p, list(map_data['point'])) for log_p in log['point']]
as per this example: Find closest point in Pandas DataFrames
After converting this to a tqdm progress bar to see how long it would take (as it was taking a while, obviously), I noticed it would take about 10hrs to complete.
tqdm loop:
for i in trange(len(log), desc='finding closest points'):
closest.append(closest_point(log['point'].loc[i], list(map_data['point'])))
>> finding closest points: 5%| | 16432/333456 [32:11<10:13:52], 8.60it/s
While 10 hours is not impossible, I wonder if there is a way to speed this up? I have a solid gpu/cpu/ram at my disposal so I feel this should be doable. I'm also learning tensorflow (but honestly my math is atrocious so I'm very in the dark with it)
Any ideas on how to speed this up with either multi-threading, gpu computation, tensorflow or some other sort of wizardry?
inb4 python is slow ;)
*edit: image shows what i'm trying to do. green is path, blue is map, orange is what I'm trying to find.
The following is a mini example of what you're trying to do. Considers the variable coords1 as your variable log['point'] and coords2 as your variable log['point']. The end result is the index of the coord2 closest to coord1.
from scipy.spatial import distance
import numpy as np
coords1 = [(35.0456, -85.2672),
(35.1174, -89.9711),
(35.9728, -83.9422),
(36.1667, -86.7833)]
coords2 = [(35.0456, -85.2672),
(35.1174, -89.9711),
(35.9728, -83.9422),
(34.9728, -83.9422),
(36.1667, -86.7833)]
tmp = distance.cdist(coords1, coords2, "sqeuclidean") # sqeuclidean based on Mark Setchell comment to improve speed further
result = np.argmin(tmp,1)
# result: array([0, 1, 2, 4])
This should be way faster, because it's done everything in one iteration.
After 3 years, but if anyone is looking at this issue... You may want to try Numba I get almost a 9x speed reduction from scipy distance.cdist on a 1.5 Million set of points to a 1.5 K set of path points. Also, as #
Mark Setchell said if you want to remove the np.sqrt in a big enough set of points could be considerable saved time.
Results
size: (1459383, 2)
numba: 0.06402060508728027
cdist: 0.5371212959289551
Code
# EUCLEDIAN DISTANCE
#numba.njit('(float64[:,::1], float64[::1], float64[::1])', parallel=True, fastmath=True)
def pz_dist(p_array, x_flat, y_flat):
m = p_array.shape[0]
n = x_flat.shape[0]
d = np.empty(shape=(m, n), dtype=np.float64)
for i in numba.prange(m):
p1 = p_array[i, :]
for j in range(n):
_x = x_flat[j] - p1[0]
_y = y_flat[j] - p1[1]
_d = np.sqrt(_x**2 + _y**2)
d[i, j] = _d
return d