I have a list of users' latitude and longitude.
The input will be the user's lat/lon and a range. ex. 500 metersĀ
I want to find out which users are in the range of 500 meters from that list.
using geopy.distance I can find the distance between two points..
newport_ri = (41.49008, -71.312796, 100)
cleveland_oh = (41.499498, -81.695391, 100)
print(distance.distance(newport_ri, cleveland_oh).km)
What I want is to find the points giving distance.
Something like this-
coor = [(35.441339, -88.092403)
,(35.453793, -88.061769),
(35.559426, -88.014642),
(35.654535, -88.060918),
(35.812953, -88.120935)]
def findClosest(coor,userCoor,ranges):
pass
userCoor = [35.829042, -88.039396]
ranges = 500 #meter or km
findClosest(coor,userCoor,ranges)
## Output:The coordinates of the user are within 500 meters
For example, if the number of users is not very large, you can sort users by distances and return the closest users according to ranges. The function would begin like this:
def findClosest(coor, userCoor, ranges):
dist = []
for i, u in enumerate(userCoor):
dist.append((distance.distance(coor, u).km, i))
dist = sorted(dist)
...
Otherwise, if a faster solution is needed preprocessing users might be necessary, for example, computing a voronoi diagram of their locations or something similar.
Related
GeoPandas uses shapely under the hood. To get the nearest neighbor I saw the use of nearest_points from shapely. However, this approach does not include k-nearest points.
I needed to compute distances to nearest points from to GeoDataFrames and insert the distance into the GeoDataFrame containing the "from this point" data.
This is my approach using GeoSeries.distance() without using another package or library. Note that when k == 1 the returned value essentially shows the distance to the nearest point. There is also a GeoPandas-only solution for nearest point by #cd98 which inspired my approach.
This works well for my data, but I wonder if there is a better or faster approach or another benefit to use shapely or sklearn.neighbors?
import pandas as pd
import geopandas as gp
gdf1 > GeoDataFrame with point type geometry column - distance from this point
gdf2 > GeoDataFrame with point type geometry column - distance to this point
def knearest(from_points, to_points, k):
distlist = to_points.distance(from_points)
distlist.sort_values(ascending=True, inplace=True) # To have the closest ones first
return distlist[:k].mean()
# looping through a list of nearest points
for Ks in [1, 2, 3, 4, 5, 10]:
name = 'dist_to_closest_' + str(Ks) # to set column name
gdf1[name] = gdf1.geometry.apply(knearest, args=(gdf2, closest_x))
yes there is, but first, I must credit the University of Helsinki from automating GIS process, here's the source code. Here's how
first, read the data, for example, finding nearest bus stops for each building.
# Filepaths
stops = gpd.read_file('data/pt_stops_helsinki.gpkg')
buildings = read_gdf_from_zip('data/building_points_helsinki.zip')
define the function, here, you can adjust the k_neighbors
from sklearn.neighbors import BallTree
import numpy as np
def get_nearest(src_points, candidates, k_neighbors=1):
"""Find nearest neighbors for all source points from a set of candidate points"""
# Create tree from the candidate points
tree = BallTree(candidates, leaf_size=15, metric='haversine')
# Find closest points and distances
distances, indices = tree.query(src_points, k=k_neighbors)
# Transpose to get distances and indices into arrays
distances = distances.transpose()
indices = indices.transpose()
# Get closest indices and distances (i.e. array at index 0)
# note: for the second closest points, you would take index 1, etc.
closest = indices[0]
closest_dist = distances[0]
# Return indices and distances
return (closest, closest_dist)
def nearest_neighbor(left_gdf, right_gdf, return_dist=False):
"""
For each point in left_gdf, find closest point in right GeoDataFrame and return them.
NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
"""
left_geom_col = left_gdf.geometry.name
right_geom_col = right_gdf.geometry.name
# Ensure that index in right gdf is formed of sequential numbers
right = right_gdf.copy().reset_index(drop=True)
# Parse coordinates from points and insert them into a numpy array as RADIANS
left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
# Find the nearest points
# -----------------------
# closest ==> index in right_gdf that corresponds to the closest point
# dist ==> distance between the nearest neighbors (in meters)
closest, dist = get_nearest(src_points=left_radians, candidates=right_radians)
# Return points from right GeoDataFrame that are closest to points in left GeoDataFrame
closest_points = right.loc[closest]
# Ensure that the index corresponds the one in left_gdf
closest_points = closest_points.reset_index(drop=True)
# Add distance if requested
if return_dist:
# Convert to meters from radians
earth_radius = 6371000 # meters
closest_points['distance'] = dist * earth_radius
return closest_points
Do the nearest neighbours analysis
# Find closest public transport stop for each building and get also the distance based on haversine distance
# Note: haversine distance which is implemented here is a bit slower than using e.g. 'euclidean' metric
# but useful as we get the distance between points in meters
closest_stops = nearest_neighbor(buildings, stops, return_dist=True)
now join the from and to data frame
# Rename the geometry of closest stops gdf so that we can easily identify it
closest_stops = closest_stops.rename(columns={'geometry': 'closest_stop_geom'})
# Merge the datasets by index (for this, it is good to use '.join()' -function)
buildings = buildings.join(closest_stops)
The answer above using Automating GIS-processes is really nice but there is an error when converting points as a numpy array as RADIANS. The latitude and longitude are reversed.
left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.y * np.pi / 180, geom.x * np.pi / 180)).to_list())
Indeed Points are given with (lat, lon) but the longitude correspond the x-axis of a plan or a sphere and the latitude to the y-axis.
If your data are in grid coordinates, then the approach is a bit leaner, but with one key gotcha.
Building on sutan's answer and streamlining the block from the Uni Helsinki...
To get multiple neighbors, you edit the k_neighbors argument....and must ALSO hard code vars within the body of the function (see my additions below 'closest' and 'closest_dist') AND add them to the return statement.
Thus, if you want the 2 closest points, it looks like:
from sklearn.neighbors import BallTree
import numpy as np
def get_nearest(src_points, candidates, k_neighbors=2):
"""
Find nearest neighbors for all source points from a set of candidate points
modified from: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html
"""
# Create tree from the candidate points
tree = BallTree(candidates, leaf_size=15, metric='euclidean')
# Find closest points and distances
distances, indices = tree.query(src_points, k=k_neighbors)
# Transpose to get distances and indices into arrays
distances = distances.transpose()
indices = indices.transpose()
# Get closest indices and distances (i.e. array at index 0)
# note: for the second closest points, you would take index 1, etc.
closest = indices[0]
closest_dist = distances[0]
closest_second = indices[1] # *manually add per comment above*
closest_second_dist = distances[1] # *manually add per comment above*
# Return indices and distances
return (closest, closest_dist, closest_sec, closest_sec_dist)
The inputs are lists of (x,y) tuples. Thus, since (by question title) your data is in a GeoDataframe:
# easier to read
in_pts = [(row.geometry.x, row.geometry.y) for idx, row in gdf1.iterrows()]
qry_pts = [(row.geometry.x, row.geometry.y) for idx, row in gdf2.iterrows()]
# faster (by about 7X)
in_pts = [(x,y) for x,y in zip(gdf1.geometry.x , gdf1.geometry.y)]
qry_pts = [(x,y) for x,y in zip(gdf2.geometry.x , gdf2.geometry.y)]
I'm not interested in distances, so instead of commenting out of the function, I run:
idx_nearest, _, idx_2ndnearest, _ = get_nearest(in_pts, qry_pts)
and get two arrays of the same length of in_pts that, respectively, contain index values of the closest and second closest points from the original geodataframe for qry_pts.
Great solution! If you are using Automating GIS-processes solution, make sure to reset the index of buildings geoDataFrame before join (only if you are using a subset of left_gdf).
buildings.insert(0, 'Number', range(0,len(buildings)))
buildings.set_index('Number' , inplace = True)
Based on the answers before I have a all-in-one solution for you which takes two geopandas.DataFrames as input and searches for the nearest k-neighbors.
def get_nearest_neighbors(gdf1, gdf2, k_neighbors=2):
'''
Find k nearest neighbors for all source points from a set of candidate points
modified from: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html
Parameters
----------
gdf1 : geopandas.DataFrame
Geometries to search from.
gdf2 : geopandas.DataFrame
Geoemtries to be searched.
k_neighbors : int, optional
Number of nearest neighbors. The default is 2.
Returns
-------
gdf_final : geopandas.DataFrame
gdf1 with distance, index and all other columns from gdf2.
'''
src_points = [(x,y) for x,y in zip(gdf1.geometry.x , gdf1.geometry.y)]
candidates = [(x,y) for x,y in zip(gdf2.geometry.x , gdf2.geometry.y)]
# Create tree from the candidate points
tree = BallTree(candidates, leaf_size=15, metric='euclidean')
# Find closest points and distances
distances, indices = tree.query(src_points, k=k_neighbors)
# Transpose to get distances and indices into arrays
distances = distances.transpose()
indices = indices.transpose()
closest_gdfs = []
for k in np.arange(k_neighbors):
gdf_new = gdf2.iloc[indices[k]].reset_index()
gdf_new['distance'] = distances[k]
gdf_new = gdf_new.add_suffix(f'_{k+1}')
closest_gdfs.append(gdf_new)
closest_gdfs.insert(0,gdf1)
gdf_final = pd.concat(closest_gdfs,axis=1)
return gdf_final
I have a csv of names, transaction amount and an exact longitude and latitude of the location where the transaction was performed.
I want the final document to be anonymized - for that I need to change it into a CSV where the names are hashed (that should be easy enough), and the longitude and latitude are obscured within a radius of 2km.
I.e, changing the coordinates so they are within no more than 2 km from the original location, but in a randomized way, so that it is not revertible by a formula.
Does anyone know how to work with coordinates that way?
You could use locality sensitive hashing (LSH) to map similar co-ordinates (i.e. within a 2 KM radius), to the same value with a high probability. Hence, co-ordinates that map to the same bucket would be located closer together in Euclidean space.
Else, another technique would be to use any standard hash function y = H(x), and compute y modulo N, where N is the range of co-ordinates. Assume, your co-ordinates are P = (500,700), and you would like to return a randomized value in a range of [-x,x] KM from P.
P = (500,700)
Range = 1000 #1000 meters for example
#Anonymize co-ordinates to within specified range
ANON_X = hash(P[0]) % Range
ANON_Y = hash(P[1]) % Range
#Randomly add/subtract range
P = (P + ANON_X*random.choice([-1,1]), P+ANON_Y*random.choice([-1,1]))
I have this distribution of points (allPoints, which is a list of lists: [[x1,y1][x2,y2][x3,y3][x4,y4]...[xn,yn]]):
From which I'd like to select points, randomly.
in Python I would do something like:
from random import *
point = choice(allPoints)
Except, I need the random pick to not be biased by the existing density. For instance, here, "choice" would tend to pick a point in the upmost-leftmost part of the plot.
How can I, in Python, get rid of this bias?
I've tried to divide the space in portions of size "div", and then, sample within this portion, but in many cases, no points exist at all and the while loop doesn't find any solution:
def column(matrix, i):
return [row[i] for row in matrix]
div = 10
min_x,max_x = min(column(allPoints,0)),max(column(allPoints,0))
min_y, max_y = min(column(allPoints,1)),max(column(allPoints,1))
zone_x_min = randint(1,div-1) * (max_x - min_x) / div + min_x
zone_x_max = zone_x_min + (max_x - min_x) / div
zone_y_min = randint(1,div-1) * (max_y - min_y) / div + min_y
zone_y_max = zone_yl_min + (max_y - min_y) / div
p = choice(allPoints)
cont = True
while cont == True:
if (p[0] > zone_x_min and p[0] < zone_x_max) and (e[1] > zone_y_min and e[1] < zone_y_max):
cont = False
else:
p = choice(allPoints)
what would be a correct, inexpensive (if possible) solution to this problem?
If it wasn't ridiculous, I think something like would work for me, in theory:
p = [uniform(min_x,max_x),uniform(min_y,max_y)]
while p not in allPoints:
p = [uniform(min_x,max_x),uniform(min_y,max_y)]
The question is a little ill-formed, but here's a stab.
The idea is to use a gaussian kernel density estimate, then sample from your data with weights equal to the inverse of the pdf at each point.
This is not statistically justifiable in any real sense.
import numpy as np
from scipy import stats
#random data
x = np.random.normal(size = 200)
y = np.random.normal(size = 200)
#estimate the density
kernel = stats.gaussian_kde(np.vstack([x,y]))
#calculate the inverse of pdf for each point, and normalise to sum to 1
pvector = 1/kernel.pdf(np.vstack([x,y]))/sum(1/kernel.pdf(np.vstack([x,y])))
#get a vector of indices based on your weights
np.random.choice(range(len(x)), size = 10, replace = True, p = pvector)
I believe you want to randomly select a datum point from your graph.That is, one of the little black dots.
Compute a centroid, or pick a point like (1.0, 70).
Compute the distance from each point to the centroid and let that be the probability of your choice of that point.
That is if distance(P,C) is 100 and distance(Q,C) is 1 then let P be 100x more likely to be chosen. All points are eligible to win, but the crowded ones are individually less likely (but make it up with.volume).
If I understand your initial attempt correctly, I believe there is a simple adjustment you can make to make this work.
Randomly generate an x value (0,4.5), and a y value (0,70).
Then loop through allPoints to find the closest dot.
This has the downside of large empty areas all converging to a single point. A way to help (not remove) this problem would be to make your random point have a range. If no dot exists in that range, randomly generate a new dot.
Assuming you want your selected points to be visually spread I can think of at least one "efficient/easy" method.
Choose a random point (with random.choice for example) ;
remove from your initial set any point that is "close"*;
repeat until there is no point left in your set.
*This requires that you know from the beginning how dense you want your sample to be.
I have a 3D-LiDAR pointcoud repesenting a tree loaded into python with the laspy package. It is now stored as a numpy array. My purpose is to calculate the height of the tree by finding the point with the highest z-value and calculate the distance to the lowest z-value beneath it.
So I imported the data via:
inFile = laspy.file.File("~/DATA/tree.las", mode='r')
point_records = inFile.points
At the moment, i calculated the height by:
min = inFile.header.min
max = inFile.header.max
zdist = max[2] -min[2]
The problem is that this way, i do not take slope in the terrain into account. How can i index the point that is exactly below the highest one?
This is just a blind guess, because for a good answer, there is a lot of information missing.
Suppose we have an array of 3 points with (x,y,z)
A = [1,2,3]
B = [1,2,4]
C = [0,1,2].
We have identified point A as being the maximum in z and have its lat and long with
lat = 1
long = 2
Basically, you go through the list of point and filter out all the points, you want to look at, and take the minimal point. Below is a straightforward way to do that, using a for-loop. This is not ideal for speed. np.where() and fancy indexing can be used, to do that easier and faster, but this is more readable and adjustable:
import numpy as np
# This is some test data, with three data points
a = np.array([[1,2,3],[1,2,4],[0,1,2]])
# Now we define the lat and long we want to get
filter_x = 1
filter_y = 2
filtered_points = []
for i in range(a.shape[0]): # iterating through all points
if a[i][0] == filter_x and a[i][1] == filter_y:
filtered_points.append(a[i][2]) # Append z of point to list
print min(filtered_points) # print minimum
Does anyone know a good method to calculate the empirical/sample covariogram, if possible in Python?
This is a screenshot of a book which contains a good definition of covariagram:
If I understood it correctly, for a given lag/width h, I'm supposed to get all the pair of points that are separated by h (or less than h), multiply its values and for each of these points, calculate its mean, which in this case, are defined as m(x_i). However, according to the definition of m(x_{i}), if I want to compute m(x1), I need to obtain the average of the values located within distance h from x1. This looks like a very intensive computation.
First of all, am I understanding this correctly? If so, what is a good way to compute this assuming a two dimensional space? I tried to code this in Python (using numpy and pandas), but it takes a couple of seconds and I'm not even sure it is correct, that is why I will refrain from posting the code here. Here is another attempt of a very naive implementation:
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(np.array(coordinates))) # coordinates is a nx2 array
z = np.array(z) # z are the values
cutoff = np.max(distances)/3.0 # somewhat arbitrary cutoff
width = cutoff/15.0
widths = np.arange(0, cutoff + width, width)
Z = []
Cov = []
for w in np.arange(len(widths)-1): # for each width
# for each pairwise distance
for i in np.arange(distances.shape[0]):
for j in np.arange(distances.shape[1]):
if distances[i, j] <= widths[w+1] and distances[i, j] > widths[w]:
m1 = []
m2 = []
# when a distance is within a given width, calculate the means of
# the points involved
for x in np.arange(distances.shape[1]):
if distances[i,x] <= widths[w+1] and distances[i, x] > widths[w]:
m1.append(z[x])
for y in np.arange(distances.shape[1]):
if distances[j,y] <= widths[w+1] and distances[j, y] > widths[w]:
m2.append(z[y])
mean_m1 = np.array(m1).mean()
mean_m2 = np.array(m2).mean()
Z.append(z[i]*z[j] - mean_m1*mean_m2)
Z_mean = np.array(Z).mean() # calculate covariogram for width w
Cov.append(Z_mean) # collect covariances for all widths
However, now I have confirmed that there is an error in my code. I know that because I used the variogram to calculate the covariogram (covariogram(h) = covariogram(0) - variogram(h)) and I get a different plot:
And it is supposed to look like this:
Finally, if you know a Python/R/MATLAB library to calculate empirical covariograms, let me know. At least, that way I can verify what I did.
One could use scipy.cov, but if one does the calculation directly (which is very easy), there are more ways to speed this up.
First, make some fake data that has some spacial correlations. I'll do this by first making the spatial correlations, and then using random data points that are generated using this, where the data is positioned according to the underlying map, and also takes on the values of the underlying map.
Edit 1:
I changed the data point generator so positions are purely random, but z-values are proportional to the spatial map. And, I changed the map so that left and right side were shifted relative to eachother to create negative correlation at large h.
from numpy import *
import random
import matplotlib.pyplot as plt
S = 1000
N = 900
# first, make some fake data, with correlations on two spatial scales
# density map
x = linspace(0, 2*pi, S)
sx = sin(3*x)*sin(10*x)
density = .8* abs(outer(sx, sx))
density[:,:S//2] += .2
# make a point cloud motivated by this density
random.seed(10) # so this can be repeated
points = []
while len(points)<N:
v, ix, iy = random.random(), random.randint(0,S-1), random.randint(0,S-1)
if True: #v<density[ix,iy]:
points.append([ix, iy, density[ix,iy]])
locations = array(points).transpose()
print locations.shape
plt.imshow(density, alpha=.3, origin='lower')
plt.plot(locations[1,:], locations[0,:], '.k')
plt.xlim((0,S))
plt.ylim((0,S))
plt.show()
# build these into the main data: all pairs into distances and z0 z1 values
L = locations
m = array([[math.sqrt((L[0,i]-L[0,j])**2+(L[1,i]-L[1,j])**2), L[2,i], L[2,j]]
for i in range(N) for j in range(N) if i>j])
Which gives:
The above is just the simulated data, and I made no attempt to optimize it's production, etc. I assume this is where the OP starts, with the task below, since the data already exists in a real situation.
Now calculate the "covariogram" (which is much easier than generating the fake data, btw). The idea here is to sort all the pairs and associated values by h, and then index into these using ihvals. That is, summing up to index ihval is the sum over N(h) in the equation, since this includes all pairs with hs below the desired values.
Edit 2:
As suggested in the comments below, N(h) is now only the pairs that are between h-dh and h, rather than all pairs between 0 and h (where dh is the spacing of h-values in ihvals -- ie, S/1000 was used below).
# now do the real calculations for the covariogram
# sort by h and give clear names
i = argsort(m[:,0]) # h sorting
h = m[i,0]
zh = m[i,1]
zsh = m[i,2]
zz = zh*zsh
hvals = linspace(0,S,1000) # the values of h to use (S should be in the units of distance, here I just used ints)
ihvals = searchsorted(h, hvals)
result = []
for i, ihval in enumerate(ihvals[1:]):
start, stop = ihvals[i-1], ihval
N = stop-start
if N>0:
mnh = sum(zh[start:stop])/N
mph = sum(zsh[start:stop])/N
szz = sum(zz[start:stop])/N
C = szz-mnh*mph
result.append([h[ihval], C])
result = array(result)
plt.plot(result[:,0], result[:,1])
plt.grid()
plt.show()
which looks reasonable to me as one can see bumps or troughs at the expected for the h values, but I haven't done a careful check.
The main speedup here over scipy.cov, is that one can precalculate all of the products, zz. Otherwise, one would feed zh and zsh into cov for every new h, and all the products would be recalculated. This calculate could be sped up even more by doing partial sums, ie, from ihvals[n-1] to ihvals[n] at each timestep n, but I doubt that will be necessary.