K-nearest points from two dataframes with GeoPandas - python

GeoPandas uses shapely under the hood. To get the nearest neighbor I saw the use of nearest_points from shapely. However, this approach does not include k-nearest points.
I needed to compute distances to nearest points from to GeoDataFrames and insert the distance into the GeoDataFrame containing the "from this point" data.
This is my approach using GeoSeries.distance() without using another package or library. Note that when k == 1 the returned value essentially shows the distance to the nearest point. There is also a GeoPandas-only solution for nearest point by #cd98 which inspired my approach.
This works well for my data, but I wonder if there is a better or faster approach or another benefit to use shapely or sklearn.neighbors?
import pandas as pd
import geopandas as gp
gdf1 > GeoDataFrame with point type geometry column - distance from this point
gdf2 > GeoDataFrame with point type geometry column - distance to this point
def knearest(from_points, to_points, k):
distlist = to_points.distance(from_points)
distlist.sort_values(ascending=True, inplace=True) # To have the closest ones first
return distlist[:k].mean()
# looping through a list of nearest points
for Ks in [1, 2, 3, 4, 5, 10]:
name = 'dist_to_closest_' + str(Ks) # to set column name
gdf1[name] = gdf1.geometry.apply(knearest, args=(gdf2, closest_x))

yes there is, but first, I must credit the University of Helsinki from automating GIS process, here's the source code. Here's how
first, read the data, for example, finding nearest bus stops for each building.
# Filepaths
stops = gpd.read_file('data/pt_stops_helsinki.gpkg')
buildings = read_gdf_from_zip('data/building_points_helsinki.zip')
define the function, here, you can adjust the k_neighbors
from sklearn.neighbors import BallTree
import numpy as np
def get_nearest(src_points, candidates, k_neighbors=1):
"""Find nearest neighbors for all source points from a set of candidate points"""
# Create tree from the candidate points
tree = BallTree(candidates, leaf_size=15, metric='haversine')
# Find closest points and distances
distances, indices = tree.query(src_points, k=k_neighbors)
# Transpose to get distances and indices into arrays
distances = distances.transpose()
indices = indices.transpose()
# Get closest indices and distances (i.e. array at index 0)
# note: for the second closest points, you would take index 1, etc.
closest = indices[0]
closest_dist = distances[0]
# Return indices and distances
return (closest, closest_dist)
def nearest_neighbor(left_gdf, right_gdf, return_dist=False):
"""
For each point in left_gdf, find closest point in right GeoDataFrame and return them.
NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
"""
left_geom_col = left_gdf.geometry.name
right_geom_col = right_gdf.geometry.name
# Ensure that index in right gdf is formed of sequential numbers
right = right_gdf.copy().reset_index(drop=True)
# Parse coordinates from points and insert them into a numpy array as RADIANS
left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
# Find the nearest points
# -----------------------
# closest ==> index in right_gdf that corresponds to the closest point
# dist ==> distance between the nearest neighbors (in meters)
closest, dist = get_nearest(src_points=left_radians, candidates=right_radians)
# Return points from right GeoDataFrame that are closest to points in left GeoDataFrame
closest_points = right.loc[closest]
# Ensure that the index corresponds the one in left_gdf
closest_points = closest_points.reset_index(drop=True)
# Add distance if requested
if return_dist:
# Convert to meters from radians
earth_radius = 6371000 # meters
closest_points['distance'] = dist * earth_radius
return closest_points
Do the nearest neighbours analysis
# Find closest public transport stop for each building and get also the distance based on haversine distance
# Note: haversine distance which is implemented here is a bit slower than using e.g. 'euclidean' metric
# but useful as we get the distance between points in meters
closest_stops = nearest_neighbor(buildings, stops, return_dist=True)
now join the from and to data frame
# Rename the geometry of closest stops gdf so that we can easily identify it
closest_stops = closest_stops.rename(columns={'geometry': 'closest_stop_geom'})
# Merge the datasets by index (for this, it is good to use '.join()' -function)
buildings = buildings.join(closest_stops)

The answer above using Automating GIS-processes is really nice but there is an error when converting points as a numpy array as RADIANS. The latitude and longitude are reversed.
left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.y * np.pi / 180, geom.x * np.pi / 180)).to_list())
Indeed Points are given with (lat, lon) but the longitude correspond the x-axis of a plan or a sphere and the latitude to the y-axis.

If your data are in grid coordinates, then the approach is a bit leaner, but with one key gotcha.
Building on sutan's answer and streamlining the block from the Uni Helsinki...
To get multiple neighbors, you edit the k_neighbors argument....and must ALSO hard code vars within the body of the function (see my additions below 'closest' and 'closest_dist') AND add them to the return statement.
Thus, if you want the 2 closest points, it looks like:
from sklearn.neighbors import BallTree
import numpy as np
def get_nearest(src_points, candidates, k_neighbors=2):
"""
Find nearest neighbors for all source points from a set of candidate points
modified from: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html
"""
# Create tree from the candidate points
tree = BallTree(candidates, leaf_size=15, metric='euclidean')
# Find closest points and distances
distances, indices = tree.query(src_points, k=k_neighbors)
# Transpose to get distances and indices into arrays
distances = distances.transpose()
indices = indices.transpose()
# Get closest indices and distances (i.e. array at index 0)
# note: for the second closest points, you would take index 1, etc.
closest = indices[0]
closest_dist = distances[0]
closest_second = indices[1] # *manually add per comment above*
closest_second_dist = distances[1] # *manually add per comment above*
# Return indices and distances
return (closest, closest_dist, closest_sec, closest_sec_dist)
The inputs are lists of (x,y) tuples. Thus, since (by question title) your data is in a GeoDataframe:
# easier to read
in_pts = [(row.geometry.x, row.geometry.y) for idx, row in gdf1.iterrows()]
qry_pts = [(row.geometry.x, row.geometry.y) for idx, row in gdf2.iterrows()]
# faster (by about 7X)
in_pts = [(x,y) for x,y in zip(gdf1.geometry.x , gdf1.geometry.y)]
qry_pts = [(x,y) for x,y in zip(gdf2.geometry.x , gdf2.geometry.y)]
I'm not interested in distances, so instead of commenting out of the function, I run:
idx_nearest, _, idx_2ndnearest, _ = get_nearest(in_pts, qry_pts)
and get two arrays of the same length of in_pts that, respectively, contain index values of the closest and second closest points from the original geodataframe for qry_pts.

Great solution! If you are using Automating GIS-processes solution, make sure to reset the index of buildings geoDataFrame before join (only if you are using a subset of left_gdf).
buildings.insert(0, 'Number', range(0,len(buildings)))
buildings.set_index('Number' , inplace = True)

Based on the answers before I have a all-in-one solution for you which takes two geopandas.DataFrames as input and searches for the nearest k-neighbors.
def get_nearest_neighbors(gdf1, gdf2, k_neighbors=2):
'''
Find k nearest neighbors for all source points from a set of candidate points
modified from: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html
Parameters
----------
gdf1 : geopandas.DataFrame
Geometries to search from.
gdf2 : geopandas.DataFrame
Geoemtries to be searched.
k_neighbors : int, optional
Number of nearest neighbors. The default is 2.
Returns
-------
gdf_final : geopandas.DataFrame
gdf1 with distance, index and all other columns from gdf2.
'''
src_points = [(x,y) for x,y in zip(gdf1.geometry.x , gdf1.geometry.y)]
candidates = [(x,y) for x,y in zip(gdf2.geometry.x , gdf2.geometry.y)]
# Create tree from the candidate points
tree = BallTree(candidates, leaf_size=15, metric='euclidean')
# Find closest points and distances
distances, indices = tree.query(src_points, k=k_neighbors)
# Transpose to get distances and indices into arrays
distances = distances.transpose()
indices = indices.transpose()
closest_gdfs = []
for k in np.arange(k_neighbors):
gdf_new = gdf2.iloc[indices[k]].reset_index()
gdf_new['distance'] = distances[k]
gdf_new = gdf_new.add_suffix(f'_{k+1}')
closest_gdfs.append(gdf_new)
closest_gdfs.insert(0,gdf1)
gdf_final = pd.concat(closest_gdfs,axis=1)
return gdf_final

Related

How to implement in Python a function to compute the Euclidean distance between two arbitrary points on a torus

Given a 10x10 grid (2d-array) filled randomly with numbers, either 0, 1 or 2. How can I find the Euclidean distance (the l2-norm of the distance vector) between two given points considering periodic boundaries?
Let us consider an arbitrary grid point called centre. Now, I want to find the nearest grid point containing the same value as centre. I need to take periodic boundaries into account, such that the matrix/grid can be seen rather as a torus instead of a flat plane. In that case, say the centre = matrix[0,2], and we find that there is the same number in matrix[9,2], which would be at the southern boundary of the matrix. The Euclidean distance computed with my code would be for this example np.sqrt(0**2 + 9**2) = 9.0. However, because of periodic boundaries, the distance should actually be 1, because matrix[9,2] is the northern neighbour of matrix[0,2]. Hence, if periodic boundary values are implemented correctly, distances of magnitude above 8 should not exist.
So, I would be interested on how to implement in Python a function to compute the Euclidean distance between two arbitrary points on a torus by applying a wrap-around for the boundaries.
import numpy as np
matrix = np.random.randint(0,3,(10,10))
centre = matrix[0,2]
#rewrite the centre to be the number 5 (to exclude itself as shortest distance)
matrix[0,2] = 5
#find the points where entries are same as centre
same = np.where((matrix == centre) == True)
idx_row, idx_col = same
#find distances from centre to all values which are of same value
dist = np.zeros(len(same[0]))
for i in range(0,len(same[0])):
delta_row = same[0][i] - 0 #row coord of centre
delta_col = same[1][i] - 2 #col coord of centre
dist[i] = np.sqrt(delta_row**2 + delta_col**2)
#retrieve the index of the smallest distance
idx = dist.argmin()
print('Centre value: %i. The nearest cell with same value is at (%i,%i)'
% (centre, same[0][idx],same[1][idx]))
For each axis, you can check whether the distance is shorter when you wrap around or when you don't. Consider the row axis, with rows i and j.
When not wrapping around, the difference is abs(i - j).
When wrapping around, the difference is "flipped", as in 10 - abs(i - j). In your example with i == 0 and j == 9 you can check that this correctly produces a distance of 1.
Then simply take whichever is smaller:
delta_row = same[0][i] - 0 #row coord of centre
delta_row = min(delta_row, 10 - delta_row)
And similarly for delta_column.
The final dist[i] calculation needs no changes.
I have a working 'sketch' of how this could work. In short, I calculate the distance 9 times, 1 for the normal distance, and 8 shifts to possibly correct for a closer 'torus' distance.
As n is getting larger, the calculation costs can go sky high as the numbers go up. But, the torus effect, is probably not needed as there is always a point nearby without 'wrap around'.
You can easily test this, because for a grid of size 1, if a point is found of distance 1/2 or closer, you know there is not a closer torus point (right?)
import numpy as np
n=10000
np.random.seed(1)
A = np.random.randint(low=0, high=10, size=(n,n))
I create 10000x10000 points, and store the location of the 1's in ONES.
ONES = np.argwhere(A == 0)
Now I define my torus distance, which is trying which of the 9 mirrors is the closest.
def distance_on_torus( point=[500,500] ):
index_diff = [[1],[1],[0],[0],[0,1],[0,1],[0,1],[0,1]]
coord_diff = [[-1],[1],[-1],[1],[-1,-1],[-1,1],[1,-1],[1,1]]
tree = BallTree( ONES, leaf_size=5*n, metric='euclidean')
dist, indi = tree.query([point],k=1, return_distance=True )
distances = [dist[0]]
for indici_to_shift, coord_direction in zip(index_diff, coord_diff):
MIRROR = ONES.copy()
for i,shift in zip(indici_to_shift,coord_direction):
MIRROR[:,i] = MIRROR[:,i] + (shift * n)
tree = BallTree( MIRROR, leaf_size=5*n, metric='euclidean')
dist, indi = tree.query([point],k=1, return_distance=True )
distances.append(dist[0])
return np.min(distances)
%%time
distance_on_torus([2,3])
It is slow, the above takes 15 minutes.... For n = 1000 less than a second.
A optimisation would be to first consider the none-torus distance, and if the minimum distance is possibly not the smallest, calculate with only the minimum set of extra 'blocks' around. This will greatly increase speed.

Calculate minimum distance between a point and a polygon in geopandas

I am trying to compute the minimum distance between a set of points and a set of polygons.
My code looks like this:
polys = sf.geometry.tolist()
cities = sf.CITY_LABEL.tolist()
min_dist = np.empty(len(points))
min_city = ['NA'] * len(points)
min_coord = ['NA'] * len(points)
inside = ['NA'] * len(points)
for i, point in enumerate(points):
aux = sf.boundary.distance(point).tolist()
idx = aux.index(min(aux))
min_dist[i] = aux[idx]
min_city[i] = cities[idx]
min_coord[i] = polys[idx].boundary.interpolate(polys[idx].boundary.project(point)).wkt
inside[i] = polys[idx].contains(point)
where the variable points contain the points and the variable sf is my shapefile with polygons.
I then save to a file the min distance in degrees, km (deg*111), closest polygon name (a city), the closest polygon point, and whether the point is inside the polygon. So, I have something like this:
But the minimum distance I computed (column D) is larger than the distance between the point (column A) and the polygon closest point (column F).
Any idea what am I doing wrong? Why is the distance function returning a different distance than the distance between column A and F?
(I also replicated the code in R, and indeed the minimum distance I get is the one computed between column A and F of the file above, not the one in column D)

How can I determine which curve is closest to a given set of points?

I have several dataframes which each contain two columns of x and y values, so each row represents a point on a curve. The different dataframes then represent contours on a map. I have another series of data points (fewer in number), and I'd like to see which contour they are closest to on average.
I would like to establish the distance from each datapoint to each point on the curve, with sqrt(x^2+y^2) - sqrt(x_1^2 + y_1^2), add them up for each point on the curve. The trouble is that there are several thousand points on the curve, and there are only a few dozen datapoints to assess, so I can't simply put these in columns next to each other.
I think I need to cycle through the datapoints, checking the sqdistance between them and each point in the curve.
I don't know whether there is an easy function or module that can do this.
Thanks in advance!
Edit: Thanks for the comments. #Alexander: I've tried the vectorize function, as follows, with a sample dataset. I'm actually using contours which comprise several thousand datapoints, and the dataset to compare against are 100+, so I'd like to be able to automate as much as possible. I'm currently able to create a distance measurement from the first datapoint against my contour, but I would ideally like to cycle through j as well. When I try it, it comes up with an error:
import numpy as np
from numpy import vectorize
import pandas as pd
from pandas import DataFrame
df1 = {'X1':['1', '2', '2', '3'], 'Y1':['2', '5', '7', '9']}
df1 = DataFrame(df1, columns=['X1', 'Y1'])
df2 = {'X2':['3', '5', '6'], 'Y2':['10', '15', '16']}
df2 = DataFrame(df2, columns=['X2', 'Y2'])
df1=df1.astype(float)
df2=df2.astype(float)
Distance=pd.DataFrame()
i = range(0, len(df1))
j = range(0, len(df2))
def myfunc(x1, y1, x2, y2):
return np.sqrt((x2-x1)**2+np.sqrt(y2-y1)**2)
vfunc=np.vectorize(myfunc)
Distance['Distance of Datapoint j to Contour']=vfunc(df1.iloc[i] ['X1'], df1.iloc[i]['Y1'], df2.iloc[0]['X2'], df2.iloc[0]['Y2'])
Distance['Distance of Datapoint j to Contour']=vfunc(df1.iloc[i] ['X1'], df1.iloc[i]['Y1'], df2.iloc[1]['X2'], df2.iloc[1]['Y2'])
Distance
General idea
The "curve" is actually a polygon with a lot's of points. There definetly some libraries to calculate the distance between the polygon and the point. But generally it will be something like:
Calculate "approximate distance" to whole polygon, e.g. to the bounding box of a polygon (from point to 4 line segments), or to the center of bounding box
calculate distances to the lines of a polygon. If you have too many points then as an extra step "resolution" of a polygon might be reduced.
Smallest found distance is the distance from point to the polygon.
repeat for each point and each polygon
Existing solutions
Some libraries already can do that:
shapely question, shapely Geo-Python docs
Using shapely in geopandas to calculate distance
scipy.spatial.distance: scipy can be used to calculate distance between arbitrary number of points
numpy.linalg.norm(point1-point2): some answers propose different ways to calculate distance using numpy. Some even show performance benchmarks
sklearn.neighbors: not really about curves and distances to them, but can be used if you want to check "to which area point is most likely related"
And you can always calculate distances yourself using D(x1, y1, x2, y2) = sqrt((x₂-x₁)² + (y₂-y₁)²) and search for best combination of points that gives minimal distance
Example:
# get distance from points of 1 dataset to all the points of another dataset
from scipy.spatial import distance
d = distance.cdist(df1.to_numpy(), df2.to_numpy(), 'euclidean')
print(d)
# Results will be a matrix of all possible distances:
# [[ D(Point_df1_0, Point_df2_0), D(Point_df1_0, Point_df2_1), D(Point_df1_0, Point_df2_2)]
# [ D(Point_df1_1, Point_df2_0), D(Point_df1_1, Point_df2_1), D(Point_df1_1, Point_df2_2)]
# [ D(Point_df1_3, Point_df2_0), D(Point_df1_2, Point_df2_1), D(Point_df1_2, Point_df2_2)]
# [ D(Point_df1_3, Point_df2_0), D(Point_df1_3, Point_df2_1), D(Point_df1_3, Point_df2_2)]]
[[ 8.24621125 13.60147051 14.86606875]
[ 5.09901951 10.44030651 11.70469991]
[ 3.16227766 8.54400375 9.8488578 ]
[ 1. 6.32455532 7.61577311]]
What to do next is up to you. For example as a metric of "general distance between curves" you can:
Pick smallest values in each row and each column (if you skip some columns/rows, then you might end up with candidate that "matches only a part of contour), and calculate their median: np.median(np.hstack([np.amin(d, axis) for axis in range(len(d.shape))])).
Or you can calculate mean value of:
all the distances: np.median(d)
of "smallest 2/3 of distances": np.median(d[d<np.percentile(d, 66, interpolation='higher')])
of "smallest distances that cover at least each rows and each columns":
for min_value in np.sort(d, None):
chosen_indices = d<=min_value
if np.all(np.hstack([np.amax(chosen_indices, axis) for axis in range(len(chosen_indices.shape))])):
break
similarity = np.median(d[chosen_indices])
Or maybe you can use different type of distance from the begining (e.g. "correlation distance" looks promising to your task)
Maybe use "Procrustes analysis, a similarity test for two data sets" together with distances.
Maybe you can use minkowski distance as a similarity metric.
Alternative approach
Alternative approach would be to use some "geometry" library to compare areas of concave hulls:
Build concave hulls for contours and for "candidate datapoints" (not easy, but possible: using shapely , using concaveman). But if you are sure that your contours are already ordered and without overlapping segments, then you can directly build polygons from those points without need for concave hull.
Use "intersection area" minus "non-common area" as a metric of similarity (shapely can be used for that):
Non-common area is: union - intersection or simply "symmetric difference"
Final metric: intersection.area - symmetric_difference.area (intersection, area)
This approach might be better than processing distances in some situations, for example:
You want to prefer "fewer points covering whole area" over "huge amount of very close points that cover only half of the area"
It's more obvious way to compare candidates with different number of points
But it has it's disadvantages too (just draw some examples on paper and experiment to find them)
Other ideas:
instead of using polygons or concave hull you can:
build a linear ring from your points and then use contour.buffer(some_distance). This way you ignore "internal area" of the contour and only compare contour itself (with tolerance of some_distance). Distance between centroids (or double of that) may be used as value for some_distance
You can build polygons/lines from segments using ops.polygonize
instead of using intersection.area - symmetric_difference.area you can:
Snap one object to another, and then compare snapped object to original
Before comparing real objects you can compare "simpler" versions of the objects to filter out obvious mismatches:
For example you can check if boundaries of objects intersect
Or you can simplify geometries before comparing them
For the distance, you need to change your formula to
def getDistance(x, y, x_i, y_i):
return sqrt((x_i -x)^2 + (y_i - y)^2)
with (x,y) being your datapoint and (x_i, y_i) being a point from the curve.
Consider using NumPy for vectorization. Explicitly looping through your data points will most likely be less efficient, depending on your use case, it might however be quick enough. (If you need to run it on a regular basis, I think vectorization will easily outspeed the explicit way) This could look something like this:
import numpy as np # Universal abbreviation for the module
datapoints = np.random.rand(3,2) # Returns a vector with randomized entries of size 3x2 (Imagine it as 3 sets of x- and y-values
contour1 = np.random.rand(1000, 2) # Other than the size (which is 1000x2) no different than datapoints
contour2 = np.random.rand(1000, 2)
contour3 = np.random.rand(1000, 2)
def squareDistanceUnvectorized(datapoint, contour):
retVal = 0.
print("Using datapoint with values x:{}, y:{}".format(datapoint[0], datapoint[1]))
lengthOfContour = np.size(contour, 0) # This gets you the number of lines in the vector
for pointID in range(lengthOfContour):
squaredXDiff = np.square(contour[pointID,0] - datapoint[0])
squaredYDiff = np.square(contour[pointID,1] - datapoint[1])
retVal += np.sqrt(squaredXDiff + squaredYDiff)
retVal = retVal / lengthOfContour # As we want the average, we are dividing the sum by the element count
return retVal
if __name__ == "__main__":
noOfDatapoints = np.size(datapoints,0)
contID = 0
for currentDPID in range(noOfDatapoints):
dist1 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour1)
dist2 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour2)
dist3 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour3)
if dist1 > dist2 and dist1 > dist3:
contID = 1
elif dist2 > dist1 and dist2 > dist3:
contID = 2
elif dist3 > dist1 and dist3 > dist2:
contID = 3
else:
contID = 0
if contID == 0:
print("Datapoint {} is inbetween two contours".format(currentDPID))
else:
print("Datapoint {} is closest to contour {}".format(currentDPID, contID))
Okay, now moving on to vector-land.
I have taken the liberty to adjust this part to what I think is your dataset. Try it and let me know if it works.
import numpy as np
import pandas as pd
# Generate 1000 points (2-dim Vector) with random values between 0 and 1. Make them strings afterwards.
# This is the first contour
random2Ddata1 = np.random.rand(1000,2)
listOfX1 = [str(x) for x in random2Ddata1[:,0]]
listOfY1 = [str(y) for y in random2Ddata1[:,1]]
# Do the same for a second contour, except that we de-center this 255 units into the first dimension
random2Ddata2 = np.random.rand(1000,2)+[255,0]
listOfX2 = [str(x) for x in random2Ddata2[:,0]]
listOfY2 = [str(y) for y in random2Ddata2[:,1]]
# After this step, our 'contours' are basically two blobs of datapoints whose centers are approx. 255 units apart.
# Generate a set of 4 datapoints and make them a Pandas-DataFrame
datapoints = {'X': ['0.5', '0', '255.5', '0'], 'Y': ['0.5', '0', '0.5', '-254.5']}
datapoints = pd.DataFrame(datapoints, columns=['X', 'Y'])
# Do the same for the two contours
contour1 = {'Xf': listOfX1, 'Yf': listOfY1}
contour1 = pd.DataFrame(contour1, columns=['Xf', 'Yf'])
contour2 = {'Xf': listOfX2, 'Yf': listOfY2}
contour2 = pd.DataFrame(contour2, columns=['Xf', 'Yf'])
# We do now have 4 datapoints.
# - The first datapoint is basically where we expect the mean of the first contour to be.
# Contour 1 consists of 1000 points with x, y- values between 0 and 1
# - The second datapoint is at the origin. Its distances should be similar to the once of the first datapoint
# - The third datapoint would be the result of shifting the first datapoint 255 units into the positive first dimension
# - The fourth datapoint would be the result of shifting the first datapoint 255 units into the negative second dimension
# Transformation into numpy array
# First the x and y values of the data points
dpArray = ((datapoints.values).T).astype(np.float)
c1Array = ((contour1.values).T).astype(np.float)
c2Array = ((contour2.values).T).astype(np.float)
# This did the following:
# - Transform the datapoints and contours into numpy arrays
# - Transpose them afterwards so that if we want all x values, we can write var[0,:] instead of var[:,0].
# A personal preference, maybe
# - Convert all the values into floats.
# Now, we iterate through the contours. If you have a lot of them, putting them into a list beforehand would do the job
for contourid, contour in enumerate([c1Array, c2Array]):
# Now for the datapoints
for _index, _value in enumerate(dpArray[0,:]):
# The next two lines do vectorization magic.
# First, we square the difference between one dpArray entry and the contour x values.
# You might notice that contour[0,:] returns an 1x1000 vector while dpArray[0,_index] is an 1x1 float value.
# This works because dpArray[0,_index] is broadcasted to fit the size of contour[0,:].
dx = np.square(dpArray[0,_index] - contour[0,:])
# The same happens for dpArray[1,_index] and contour[1,:]
dy = np.square(dpArray[1,_index] - contour[1,:])
# Now, we take (for one datapoint and one contour) the mean value and print it.
# You could write it into an array or do basically anything with it that you can imagine
distance = np.mean(np.sqrt(dx+dy))
print("Mean distance between contour {} and datapoint {}: {}".format(contourid+1, _index+1, distance))
# But you want to be able to call this... so here we go, generating a function out of it!
def getDistanceFromDatapointsToListOfContoursFindBetterName(datapoints, listOfContourDataFrames):
""" Takes a DataFrame with points and a list of different contours to return the average distance for each combination"""
dpArray = ((datapoints.values).T).astype(np.float)
listOfContours = []
for item in listOfContourDataFrames:
listOfContours.append(((item.values).T).astype(np.float))
retVal = np.zeros((np.size(dpArray,1), len(listOfContours)))
for contourid, contour in enumerate(listOfContours):
for _index, _value in enumerate(dpArray[0,:]):
dx = np.square(dpArray[0,_index] - contour[0,:])
dy = np.square(dpArray[1,_index] - contour[1,:])
distance = np.mean(np.sqrt(dx+dy))
print("Mean distance between contour {} and datapoint {}: {}".format(contourid+1, _index+1, distance))
retVal[_index, contourid] = distance
return retVal
# And just to see that it is, indeed, returning the same results, run it once
getDistanceFromDatapointsToListOfContoursFindBetterName(datapoints, [contour1, contour2])

Get the distance of each point with every other, and find where the curve approach itself

I am programming a random generated spline curve by first generating control points and then interpolate with spicy.splev.Here is an example.
Points are given like this:
np.array =[[ 1.00000000e+01 -4.65000000e+02]
[ 1.78319153e+01 -4.60252579e+02]
...]
I now want to get the distance of every point with every other of the spline to see if at one point the spline comes too close to itself which includes self-collision.
Before and after every point there should be an interval where points are ignored as these are always the closest points to each point:
def collision(splinePoints, interval):
length = len(splinePoints)
mylist = []
i = -1
for item in splinePoints:
i += 1
first = item
lowerLimit = i - interval
uperLimit = i + interval
if lowerLimit >= 0:
for item in splinePoints[:lowerLimit]:
mylist.append(first)
mylist.append(item)
if uperLimit <= length:
for item in splinePoints[uperLimit:]:
mylist.append(first)
mylist.append(item)
return np.amin(lengthOfLines(np.array(mylist)))
Lengths of lines is checked with this:
def lengthOfLines(points):
return np.sqrt(np.sum(np.diff(points.T)**2, axis=0))
It somehow works, but not always. I am also struggling with debugging as the generated data is big and hard to read check or compare. Any idea how to do it better?
All pairwise distances are obtained with pdist method of scipy.spatial package. It returns a flat array of distances, with redundancies eliminated. The utility function squareform unpacks them to a symmetric square matrix, which is often more convenient.
You also want to find the nearest point that is not directly before or after the given point on the curve. In the example below, I penalize the distances between neighbors (within 20 index values) by setting those distances to infinity. Then argmin find the nearest point for everyone, and I visualize it by drawing a red line to that nearest point.
import numpy as np
from scipy.spatial.distance import pdist, squareform
t = np.linspace(0, 10, 50)
points = np.stack(((t+5)*np.cos(t), (t+5)*np.sin(t)), axis=-1) # for example
distances = squareform(pdist(points)) # distance matrix
i, j = np.meshgrid(np.arange(t.size), np.arange(t.size))
distances[np.abs(i-j) <= 20] = np.inf # don't count neighbors
nearest = np.argmin(distances, axis=0) # nearest to each
plt.plot(points[:, 0], points[:, 1])
for k in range(len(t)):
npoint = points[nearest[k]]
plt.plot([points[k, 0], npoint[0]], [points[k, 1], npoint[1]], 'r')
plt.axes().set_aspect('equal', 'datalim')
plt.show()

find indices of lat lon point on a grid using python

I am new to python, and I can't figure out how to find the minimum distance from a given lat/lon point (which is not given from the grid, but selected by me) to a find the closest indices of a lat/lon point on a grid.
Basically , I am reading in an ncfile that contains 2D coordinates:
coords = 'coords.nc'
fh = Dataset(coords,mode='r')
lons = fh.variables['latitudes'][:,:]
lats = fh.variables['longitudes'][:,:]
fh.close()
>>> lons.shape
(94, 83)
>>> lats.shape
(94, 83)
I want to find the indices in the above grid for the nearest lat lon to the below values:
sel_lat=71.60556
sel_lon=-161.458611
I tried to make lat/lon pairs in order to use the scipy.spatial.distance function, but I still am having problems because I did not set up the input arrays to the format it wants, but I don't understand how to do that:
latLon_pairsGrid = np.vstack(([lats.T],[lons.T])).T
>>> latLon_pairsGrid.shape
(94, 83, 2)
distance.cdist([sel_lat,sel_lon],latLon_pairsGrid,'euclidean')
Any help or hints would be appreciated
Checkout the pyresample package. It provides spatial nearest neighbour search using a fast kdtree approach:
import pyresample
import numpy as np
# Define lat-lon grid
lon = np.linspace(30, 40, 100)
lat = np.linspace(10, 20, 100)
lon_grid, lat_grid = np.meshgrid(lon, lat)
grid = pyresample.geometry.GridDefinition(lats=lat_grid, lons=lon_grid)
# Generate some random data on the grid
data_grid = np.random.rand(lon_grid.shape[0], lon_grid.shape[1])
# Define some sample points
my_lons = np.array([34.5, 36.5, 38.5])
my_lats = np.array([12.0, 14.0, 16.0])
swath = pyresample.geometry.SwathDefinition(lons=my_lons, lats=my_lats)
# Determine nearest (w.r.t. great circle distance) neighbour in the grid.
_, _, index_array, distance_array = pyresample.kd_tree.get_neighbour_info(
source_geo_def=grid, target_geo_def=swath, radius_of_influence=50000,
neighbours=1)
# get_neighbour_info() returns indices in the flattened lat/lon grid. Compute
# the 2D grid indices:
index_array_2d = np.unravel_index(index_array, grid.shape)
print "Indices of nearest neighbours:", index_array_2d
print "Longitude of nearest neighbours:", lon_grid[index_array_2d]
print "Latitude of nearest neighbours:", lat_grid[index_array_2d]
print "Great Circle Distance:", distance_array
There is also a shorthand method for directly obtaining the data values at the nearest grid points:
data_swath = pyresample.kd_tree.resample_nearest(
source_geo_def=grid, target_geo_def=swath, data=data_grid,
radius_of_influence=50000)
print "Data at nearest grid points:", data_swath
I think I found an answer, but it is a workaround that avoids calculating distance between the chosen lat/lon and the lat/lons on the grid. This doesn't seem completely accurate because I am never calculating distances, just the closest difference between lat/lon values themselves.
I used the answer to the question find (i,j) location of closest (long,lat) values in a 2D array
a = abs(lats-sel_lat)+abs(lons-sel_lon)
i,j = np.unravel_index(a.argmin(),a.shape)
Using those returned indices i,j, I can then find on the grid the coordinates that correspond most closely to my selected lat, lon value:
>>> lats[i,j]
71.490295
>>> lons[i,j]
-161.65045

Categories

Resources