I've got a dataset of refineries in Texas (GeoJSON here - https://pastebin.com/R0D9fif9 ):
Name,Latitude,Longitude
Marathon Petroleum,29.374722,-94.933611
Marathon Petroleum,29.368733,-94.903253
Valero,29.367617,-94.909515
LyondellBasell,29.71584,-95.234814
Valero,29.722213,-95.255198
Exxon,29.743865,-95.009208
Shell,29.720425,-95.12495
Petrobras,29.722466,-95.208807
I would like to create a printed map out of these points. But they lie too closely together at a given resolution.
Since every refinery should get mentioned in the legend, I can't cluster. So I would like to
Get the centroid - that was easy
import json
import csv
from shapely.geometry import shape, Point, MultiPoint
with open('refineries.csv', 'rU') as infile:
reader = csv.DictReader(infile)
data = {}
for row in reader:
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
listo = list(zip(data['Longitude'], data['Latitude']))
points1 = MultiPoint(points=listo)
points = MultiPoint([(-94.933611, 29.374722), (-94.903253, 29.368733), (-94.909515, 29.367617), (-95.234814, 29.71584), (-95.255198, 29.722213), (-95.009208, 29.743865), (-95.12495, 29.720425), (-95.208807, 29.722466)])
print(points.centroid)
Shift all points away from the centroid until a minimum distance between all is reached
May you please help me here? Thanks in advance!
It depends how exactly do you want to shift the points away from the centroid. One way would be to calculate for each point its great-circle distance and azimuth with respect to the centroid and rescale all distances in order to ensure that the distance between the two closest points is larger than a specified threshold. In the example below, pyproj is used for the calculation of the azimuths and distances.
import json
import csv
import sys
from shapely.geometry import shape, Point, MultiPoint
from pyproj import Geod
with open('refineries.csv', 'rU') as infile:
reader = csv.DictReader(infile)
data = {}
for row in reader:
for header, value in row.items():
if not header in data:
data[header] = []
data[header].append(value)
listo = list(zip(map(float, data['Longitude']), map(float, data['Latitude'])))
def scale_coords(coords, required_dist = 1000.):
g = Geod(ellps = 'WGS84')
num_of_points = len(coords)
#calculate centroid
C = MultiPoint(coords).centroid
#determine the minimum distance among points
dist_min, dist_max = float('inf'), float('-inf')
for i in range(num_of_points):
lon_i, lat_i = coords[i]
for j in range(i+1, num_of_points):
lon_j, lat_j = coords[j]
_,_,dist = g.inv(lon_i, lat_i, lon_j, lat_j)
dist_min = min(dist_min, dist)
dist_max = max(dist_max, dist)
if dist_min > required_dist:
return coords
coords_scaled = [None]*num_of_points
scaling = required_dist / dist_min
for i, (lon_i, lat_i) in enumerate(coords):
az,_,dist = g.inv(C.x, C.y, lon_i, lat_i)
lon_f,lat_f,_ = g.fwd(C.x, C.y, az, dist*scaling)
coords_scaled[i] = (lon_f, lat_f)
return coords_scaled
Alternatively, this might be combined with the approach within which you also relax the azimuths. This would in principle result in smaller scaling factor for the "radial" distances. However, it would also slightly distort the "visual distribution" of the points. Also, the method presented above might be "improved" by ignoring any outlier points in the rescaling, i.e., points which are already sufficiently far from the centroid and which have no nearby neighbors.
Related
As seen in the picture I have an outlier and I would like to remove it(not the red one but the one above it in green, which is not aligned with other points) and hence I am trying to find the min distance and then try to eliminate it. But given the huge dataset it takes an eternity to execute. This is my code below. Appreciate any solution that helps, thanks! enter image description here
import math
#list of 11600 points
dataset = [[2478, 3534], [4217, 953],......,11600 points]
copy_dataset = dataset
Indices =[]
Min_Dists =[]
Distance = []
Copy_Dist=[]
for p1 in range(len(dataset)):
p1_x= dataset[p1][0]
p1_y= dataset[p1][1]
for p2 in range(len(copy_dataset)):
p2_x= copy_dataset[p2][0]
p2_y= copy_dataset[p2][1]
dist = math.sqrt((p1_x - p2_x) ** 2 + (p1_y - p2_y) ** 2)
Distance.append(dist)
Copy_Dist.append(dist)
min_dist_1= min(Distance)
Distance.remove(min_dist_1)
if(min_dist_1 !=0):
Min_Dists.append(min_dist_1)
ind_1 = Copy_Dist.index(min_dist_1)
Indices.append(ind_1)
min_dist_2=min(Distance)
Distance.remove(min_dist_2)
if(min_dist_2 !=0):
Min_Dists.append(min_dist_2)
ind_2 = Copy_Dist.index(min_dist_2)
Indices.append(ind_2)
To_Remove = copy_dataset.index([p1_x, p1_y])
copy_dataset.remove(copy_dataset[To_Remove])
Not sure how to solve this problem in general, but it's probably a lot faster to compute the distances in a vectorized fashion.
dataset_copy = dataset.copy()
dataset_copy = dataset_copy[:, np.newaxis]
distance = np.sqrt(np.sum(np.square(dataset - dataset_copy), axis=~0))
Thank you for the answers mates! I tried the below way to solve the issue it worked pretty quick.
from statistics import mean
from scipy.spatial import distance
D = distance.squareform(distance.pdist(dataset))
closest = np.argsort(D, axis=1)
d1 =[]
for i in range(len(dataset)):
d1.append(D[i][closest[i][1]])
avg_dist = int(mean(d1))
for i in range(len(dataset)):
d1= D[i][closest[i][1]]
d2= D[i][closest[i][2]]
if(abs(avg_dist-d1)>2):
if(abs(avg_dist-d2)>2):
print(dataset[i])
dataset.remove(dataset[i])
If you need all distances at once:
distances = scipy.spatial.distance_matrix(dataset, dataset)
If you need distances of one point to all others:
for pt in dataset:
distances = scipy.spatial.distance_matrix([pt], dataset)[0]
# distances.min() will be 0 because the point has 0 distance to itself
# the nearest neighbor will be the second element in sorted order
indices = np.argpartition(distances, 1) # or use argsort for a complete sort
nearest_neighbor = indices[1]
Documentation: distance_matrix, argpartition
I have a list of Shapely polygons and a point like so:
from shapely.geometry import Point, Polygon
polygons = [Polygon(...), Polygon(...), ...]
point = Point(2.5, 5.7)
and I want to find the closest polygon in the list to that point. I'm already aware of the object.distance(other) function which returns the minimum distance between two geometric shapes, and I thought about computing all the distances in a loop to find the closest polygon:
polygons = [Polygon(...), Polygon(...), ...]
point = Point(2.5, 5.7)
min_dist = 10000
closest_polygon = None
for polygon in polygons:
dist = polygon.distance(point)
if dist < min_dist:
min_dist = dist
closest_polygon = polygon
My question is: Is there a more efficient way to do it?
There is a shorter way, e.g.
from shapely.geometry import Point, Polygon
import random
from operator import itemgetter
def random_coords(n):
return [(random.randint(0, 100), random.randint(0, 100)) for _ in range(n)]
polys = [Polygon(random_coords(3)) for _ in range(4)]
point = Point(random_coords(1))
min_distance, min_poly = min(((poly.distance(point), poly) for poly in polys), key=itemgetter(0))
as Georgy mentioned (++awesome!) even more concise:
min_poly = min(polys, key=point.distance)
but distance computation is in general computationally intensive
I have a solution that works if you have at least 2 polygons with a distance different from 0. Let's call these 2 polygons "basePolygon0" and "basePolygon1". The idea is to build a KD tree with the distance of each polygon to each of the "basis" polygons.
Once the KD tree has been built, we can query it by computing the distance to each of the basis polygons.
Here's a working example:
from shapely.geometry import Point, Polygon
import numpy as np
from scipy.spatial import KDTree
# prepare a test with triangles
poly0 = Polygon([(3,-1),(5,-1),(4,2)])
poly1 = Polygon([(-2,1),(-4,2),(-3,4)])
poly2 = Polygon([(-3,-3),(-4,-6),(-2,-6)])
poly3 = Polygon([(-1,-4),(1,-4),(0,-1)])
polys = [poly0,poly1,poly2,poly3]
p0 = Point(4,-3)
p1 = Point(-4,1)
p2 = Point(-4,-2)
p3 = Point(0,-2.5)
testPoints = [p0,p1,p2,p3]
# select basis polygons
# it works with any pair of polygons that have non zero distance
basePolygon0 = polys[0]
basePolygon1 = polys[1]
# compute tree query
def buildQuery(point):
distToBasePolygon0 = basePolygon0.distance(point)
distToBasePolygon1 = basePolygon1.distance(point)
return np.array([distToBasePolygon0,distToBasePolygon1])
distances = np.array([buildQuery(poly) for poly in polys])
# build the KD tree
tree = KDTree(distances)
# test it
for p in testPoints:
q = buildQuery(p)
output = tree.query(q)
print(output)
This yields as expected:
# (distance, polygon_index_in_KD_tree)
(2.0248456731316584, 0)
(1.904237866994273, 1)
(1.5991500555008626, 2)
(1.5109986459170694, 3)
There is one way that might be faster, but without doing any actual tests, it's hard for me to say for sure.
This might not work for your situation, but the basic idea is each time a Shapely object is added to the array, you adjust the position of different array elements so that it is always "sorted" in this manner. In Python, this can be done with the heapq module. The only issue with that module is that it's hard to choose a function to compare to different objects, so you would have to do something like this answer, where you make a custom Class for objects to put in the heapq that is a tuple.
import heapq
class MyHeap(object):
def __init__(self, initial=None, key=lambda x:x):
self.key = key
if initial:
self._data = [(key(item), item) for item in initial]
heapq.heapify(self._data)
else:
self._data = []
def push(self, item):
heapq.heappush(self._data, (self.key(item), item))
def pop(self):
return heapq.heappop(self._data)[1]
The first element in the tuple is a "key", which in this case would be the distance to the point, and then the second element would be the actual Shapely object, and you could use it like so:
point = Point(2.5, 5.7)
heap = MyHeap(initial=None, key=lambda x:x.distance(point))
heap.push(Polygon(...))
heap.push(Polygon(...))
# etc...
And at the end, the object you're looking for will be at heap.pop().
Ultimately, though, both algorithms seem to be (roughly) O(n), so any speed up would not be a significant one.
I want to simulate a movement on a real world map (spherical) and represent the actual position on (google|openStreet)maps.
I have an initial lat/long pair e.g. (51.506314, -0.088455) and want to move to e.g. (51.509359, -0.087221) on a certain speed by getting interpolated coordinates periodically.
Pseudocode for clarification:
loc_init = (51.509359, -0.087221)
loc_target = (51.509359, -0.087221)
move_path = Something.path(loc_init, loc_target, speed=50)
for loc in move_path.get_current_loc():
map.move_to(loc)
device.notify_new_loc(loc)
...
time.sleep(1)
Retrieving the current interpolated position can happen in different ways e.g. calculating with a fixed refresh time (1 sec) or maybe running in a thread holding and calculating continuously new positions.
Unfortunately I never worked with geo data before and can't find something useful on the internet. Maybe there is already a module or an implementation doing that?
Solved my problem:
Found a C++ library geographiclib which was ported to Python doing exactly what I was looking for.
Example code to calculate a inverse geodesic line and get positions for a specific distance:
from geographiclib.geodesic import Geodesic
import math
# define the WGS84 ellipsoid
geod = Geodesic.WGS84
loc_init = (51.501218, -0.093773)
loc_target = (51.511020, -0.086563)
g = geod.Inverse(loc_init[0], loc_init[1], loc_target[0], loc_target[1])
l = geod.InverseLine(loc_init[0], loc_init[1], loc_target[0], loc_target[1])
print ("The distance is {:.3f} m.".format(g['s12']))
# interval in m for interpolated line between locations
interval = 500
step = int(math.ceil(l.s13 / interval))
for i in range(step + 1):
if i == 0:
print ("distance latitude longitude azimuth")
s = min(interval * i, l.s13)
loc = l.Position(s, Geodesic.STANDARD | Geodesic.LONG_UNROLL)
print ("{:.0f} {:.5f} {:.5f} {:.5f}".format(
loc['s12'], loc['lat2'], loc['lon2'], loc['azi2']))
Gives:
The distance is 1199.958 m.
distance latitude longitude azimuth
0 51.50122 -0.09377 24.65388
500 51.50530 -0.09077 24.65623
1000 51.50939 -0.08776 24.65858
1200 51.51102 -0.08656 24.65953
My data object is an instance of:
class data_instance:
def __init__(self, data, tlabel):
self.data = data # 1xd numpy array
self.true_label = tlabel # integer {1,-1}
So far in code, I have a list called data_history full with data_istance and a set of centers (numpy array with shape (k,d)).
For a given data_instance new_data, I want:
1/ Get the nearest center to new_data from centers (by euclidean distance) let it be called Nearest_center.
2/ Iterate trough data_history and:
2.1/ select elements where the nearest center is Nearest_center (result of 1/) into list called neighbors.
2.2/ Get labels of object in neighbors.
Bellow is my code which work but it steel slow and I am looking for something more efficient.
My Code
For 1/
def getNearestCenter(data,centers):
if centers.shape != (1,2):
dist_ = np.sqrt(np.sum(np.power(data-centers,2),axis=1)) # This compute distance between data and all centers
center = centers[np.argmin(dist_)] # this return center which have the minimum distance from data
else:
center=centers[0]
return center
For 2/ (To optimize)
def getLabel(dataPoint, C, history):
labels = []
cluster = getNearestCenter(dataPoint.data,C)
for x in history:
if np.all(getNearestCenter(x.data,C) == cluster):
labels.append(x.true_label)
return labels
You should rather use the optimized cdist from scipy.spatial which is more efficient than calculating it with numpy,
from scipy.spatial.distance import cdist
dist = cdist(data, C, metric='euclidean')
dist_idx = np.argmin(dist, axis=1)
An even more elegant solution is to use scipy.spatial.cKDTree (as pointed out by #Saullo Castro in comments), which could be faster for a large dataset,
from scipy.spatial import cKDTree
tr = cKDTree(C)
dist, dist_idx = tr.query(data, k=1)
Found it:
dist_ = np.argmin(np.sqrt(np.sum(np.power(data[:, None]-C,2),axis=2)),axis=1)
This should return the index of the nearest center in centers from each data point of data.
I need to compare some theoretical data with real data in python.
The theoretical data comes from resolving an equation.
To improve the comparative I would like to remove data points that fall far from the theoretical curve. I mean, I want to remove the points below and above red dashed lines in the figure (made with matplotlib).
Both the theoretical curves and the data points are arrays of different length.
I can try to remove the points in a roughly-eye way, for example: the first upper point can be detected using:
data2[(data2.redshift<0.4)&data2.dmodulus>1]
rec.array([('1997o', 0.374, 1.0203223485103787, 0.44354759972859786)], dtype=[('SN_name', '|S10'), ('redshift', '<f8'), ('dmodulus', '<f8'), ('dmodulus_error', '<f8')])
But I would like to use a less roughly-eye way.
So, can anyone help me finding an easy way of removing the problematic points?
Thank you!
This might be overkill and is based on your comment
Both the theoretical curves and the data points are arrays of
different length.
I would do the following:
Truncate the data set so that its x values lie within the max and min values of the theoretical set.
Interpolate the theoretical curve using scipy.interpolate.interp1d and the above truncated data x values. The reason for step (1) is to satisfy the constraints of interp1d.
Use numpy.where to find data y values that are out side the range of acceptable theory values.
DONT discard these values, as was suggested in comments and other answers. If you want for clarity, point them out by plotting the 'inliners' one color and the 'outliers' an other color.
Here's a script that is close to what you are looking for, I think. It hopefully will help you accomplish what you want:
import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
# make up data
def makeUpData():
'''Make many more data points (x,y,yerr) than theory (x,y),
with theory yerr corresponding to a constant "sigma" in y,
about x,y value'''
NX= 150
dataX = (np.random.rand(NX)*1.1)**2
dataY = (1.5*dataX+np.random.rand(NX)**2)*dataX
dataErr = np.random.rand(NX)*dataX*1.3
theoryX = np.arange(0,1,0.1)
theoryY = theoryX*theoryX*1.5
theoryErr = 0.5
return dataX,dataY,dataErr,theoryX,theoryY,theoryErr
def makeSameXrange(theoryX,dataX,dataY):
'''
Truncate the dataX and dataY ranges so that dataX min and max are with in
the max and min of theoryX.
'''
minT,maxT = theoryX.min(),theoryX.max()
goodIdxMax = np.where(dataX<maxT)
goodIdxMin = np.where(dataX[goodIdxMax]>minT)
return (dataX[goodIdxMax])[goodIdxMin],(dataY[goodIdxMax])[goodIdxMin]
# take 'theory' and get values at every 'data' x point
def theoryYatDataX(theoryX,theoryY,dataX):
'''For every dataX point, find interpolated thoeryY value. theoryx needed
for interpolation.'''
f = interpolate.interp1d(theoryX,theoryY)
return f(dataX[np.where(dataX<np.max(theoryX))])
# collect valid points
def findInlierSet(dataX,dataY,interpTheoryY,thoeryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
withinUpper = np.where(dataY<(interpTheoryY+theoryErr))
withinLower = np.where(dataY[withinUpper]
>(interpTheoryY[withinUpper]-theoryErr))
return (dataX[withinUpper])[withinLower],(dataY[withinUpper])[withinLower]
def findOutlierSet(dataX,dataY,interpTheoryY,thoeryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
withinUpper = np.where(dataY>(interpTheoryY+theoryErr))
withinLower = np.where(dataY<(interpTheoryY-theoryErr))
return (dataX[withinUpper],dataY[withinUpper],
dataX[withinLower],dataY[withinLower])
if __name__ == "__main__":
dataX,dataY,dataErr,theoryX,theoryY,theoryErr = makeUpData()
TruncDataX,TruncDataY = makeSameXrange(theoryX,dataX,dataY)
interpTheoryY = theoryYatDataX(theoryX,theoryY,TruncDataX)
inDataX,inDataY = findInlierSet(TruncDataX,TruncDataY,interpTheoryY,
theoryErr)
outUpX,outUpY,outDownX,outDownY = findOutlierSet(TruncDataX,
TruncDataY,
interpTheoryY,
theoryErr)
#print inlierIndex
fig = plt.figure()
ax = fig.add_subplot(211)
ax.errorbar(dataX,dataY,dataErr,fmt='.',color='k')
ax.plot(theoryX,theoryY,'r-')
ax.plot(theoryX,theoryY+theoryErr,'r--')
ax.plot(theoryX,theoryY-theoryErr,'r--')
ax.set_xlim(0,1.4)
ax.set_ylim(-.5,3)
ax = fig.add_subplot(212)
ax.plot(inDataX,inDataY,'ko')
ax.plot(outUpX,outUpY,'bo')
ax.plot(outDownX,outDownY,'ro')
ax.plot(theoryX,theoryY,'r-')
ax.plot(theoryX,theoryY+theoryErr,'r--')
ax.plot(theoryX,theoryY-theoryErr,'r--')
ax.set_xlim(0,1.4)
ax.set_ylim(-.5,3)
fig.savefig('findInliers.png')
This figure is the result:
At the end I use some of the Yann code:
def theoryYatDataX(theoryX,theoryY,dataX):
'''For every dataX point, find interpolated theoryY value. theoryx needed
for interpolation.'''
f = interpolate.interp1d(theoryX,theoryY)
return f(dataX[np.where(dataX<np.max(theoryX))])
def findOutlierSet(data,interpTheoryY,theoryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
up = np.where(data.dmodulus > (interpTheoryY+theoryErr))
low = np.where(data.dmodulus < (interpTheoryY-theoryErr))
# join all the index together in a flat array
out = np.hstack([up,low]).ravel()
index = np.array(np.ones(len(data),dtype=bool))
index[out]=False
datain = data[index]
dataout = data[out]
return datain, dataout
def selectdata(data,theoryX,theoryY):
"""
Data selection: z<1 and +-0.5 LFLRW separation
"""
# Select data with redshift z<1
data1 = data[data.redshift < 1]
# From modulus to light distance:
data1.dmodulus, data1.dmodulus_error = modulus2distance(data1.dmodulus,data1.dmodulus_error)
# redshift data order
data1.sort(order='redshift')
# Outliers: distance to LFLRW curve bigger than +-0.5
theoryErr = 0.5
# Theory curve Interpolation to get the same points as data
interpy = theoryYatDataX(theoryX,theoryY,data1.redshift)
datain, dataout = findOutlierSet(data1,interpy,theoryErr)
return datain, dataout
Using those functions I can finally obtain:
Thank you all for your help.
Just look at the difference between the red curve and the points, if it is bigger than the difference between the red curve and the dashed red curve remove it.
diff=np.abs(points-red_curve)
index= (diff>(dashed_curve-redcurve))
filtered=points[index]
But please take the comment from NickLH serious. Your Data looks pretty good without any filtering, your "outlieres" all have a very big error and won't affect the fit much.
Either you could use the numpy.where() to identify which xy pairs meet your plotting criteria, or perhaps enumerate to do pretty much the same thing. Example:
x_list = [ 1, 2, 3, 4, 5, 6 ]
y_list = ['f','o','o','b','a','r']
result = [y_list[i] for i, x in enumerate(x_list) if 2 <= x < 5]
print result
I'm sure you could change the conditions so that '2' and '5' in the above example are the functions of your curves