My data object is an instance of:
class data_instance:
def __init__(self, data, tlabel):
self.data = data # 1xd numpy array
self.true_label = tlabel # integer {1,-1}
So far in code, I have a list called data_history full with data_istance and a set of centers (numpy array with shape (k,d)).
For a given data_instance new_data, I want:
1/ Get the nearest center to new_data from centers (by euclidean distance) let it be called Nearest_center.
2/ Iterate trough data_history and:
2.1/ select elements where the nearest center is Nearest_center (result of 1/) into list called neighbors.
2.2/ Get labels of object in neighbors.
Bellow is my code which work but it steel slow and I am looking for something more efficient.
My Code
For 1/
def getNearestCenter(data,centers):
if centers.shape != (1,2):
dist_ = np.sqrt(np.sum(np.power(data-centers,2),axis=1)) # This compute distance between data and all centers
center = centers[np.argmin(dist_)] # this return center which have the minimum distance from data
else:
center=centers[0]
return center
For 2/ (To optimize)
def getLabel(dataPoint, C, history):
labels = []
cluster = getNearestCenter(dataPoint.data,C)
for x in history:
if np.all(getNearestCenter(x.data,C) == cluster):
labels.append(x.true_label)
return labels
You should rather use the optimized cdist from scipy.spatial which is more efficient than calculating it with numpy,
from scipy.spatial.distance import cdist
dist = cdist(data, C, metric='euclidean')
dist_idx = np.argmin(dist, axis=1)
An even more elegant solution is to use scipy.spatial.cKDTree (as pointed out by #Saullo Castro in comments), which could be faster for a large dataset,
from scipy.spatial import cKDTree
tr = cKDTree(C)
dist, dist_idx = tr.query(data, k=1)
Found it:
dist_ = np.argmin(np.sqrt(np.sum(np.power(data[:, None]-C,2),axis=2)),axis=1)
This should return the index of the nearest center in centers from each data point of data.
Related
I have a dataframe df with four columns=['ID','Lat','Lon','Elevation'] and n rows. Each row represent one point. I want to create four new columns=['Aggregated_ID','Lat_mean','Lon_mean','Ele_mean'] and add them to df so that points closer than a certain z-value have the same 'Aggregated_ID' and an average of Lat, Lon and Elevation on the other columns. I also have the distance matrix n x n between points. I tried this:
from scipy.cluster.hierarchy import fclusterdata
def create_ID_column(df, distances, z, start_id=0):
clusters = fclusterdata(distances, z, criterion='distance')
df['AggregatedID'] = np.char.add("Aggregate_", (clusters + start_id).astype(str))
return df, max(clusters) + start_id
def create_mean_coordinate_columns(df):
mean_coordinates = df.groupby('AggregatedID').mean().reset_index()
mean_coordinates = mean_coordinates[['AggregatedID', 'Longitude', 'Latitude','Elevation']]
mean_coordinates = mean_coordinates.rename(columns={'Longitude': 'Lon_mean', 'Latitude': 'Lat_mean','Elevation':'Ele_mean'})
df = df.merge(mean_coordinates, on='AggregatedID')
return df
z=1000
dist=n x n matrix # matrix of distances
df, start_id =create_ID_column(df, dist, z)
df= create_mean_coordinate_columns(df)
It works quite fast if I have few points, but now I need to do this operation on 60'000 points. Although I have 32 GB of RAM this code is using all of it and still running from yesterday. Is there a way to make it faster? I wrote both steps but the problem is only the apply of create_ID_column function which do the clustering. Thanks!
EDIT: I think the only option can be to use a different algorithm instead of fclusterdata but I don't know which one. Now I'm trying this:
from sklearn.cluster import AgglomerativeClustering
def create_ID_column(df, distances, z, start_id=0):
clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=z, linkage='complete')
clustering.fit(distances)
clusters = clustering.labels_
df['AggregatedID'] = np.char.add("Aggregate_", (clusters + start_id).astype(str))
return df, max(clusters) + start_id
df, start_id = create_ID_column(df, dist, z)
I ran it, I hope it's faster but I'm not optimistic because it seems to have the same behavior. I don't understand why it's so difficult to obtain the result.. 60'000 points are a lot, but not so much.
As seen in the picture I have an outlier and I would like to remove it(not the red one but the one above it in green, which is not aligned with other points) and hence I am trying to find the min distance and then try to eliminate it. But given the huge dataset it takes an eternity to execute. This is my code below. Appreciate any solution that helps, thanks! enter image description here
import math
#list of 11600 points
dataset = [[2478, 3534], [4217, 953],......,11600 points]
copy_dataset = dataset
Indices =[]
Min_Dists =[]
Distance = []
Copy_Dist=[]
for p1 in range(len(dataset)):
p1_x= dataset[p1][0]
p1_y= dataset[p1][1]
for p2 in range(len(copy_dataset)):
p2_x= copy_dataset[p2][0]
p2_y= copy_dataset[p2][1]
dist = math.sqrt((p1_x - p2_x) ** 2 + (p1_y - p2_y) ** 2)
Distance.append(dist)
Copy_Dist.append(dist)
min_dist_1= min(Distance)
Distance.remove(min_dist_1)
if(min_dist_1 !=0):
Min_Dists.append(min_dist_1)
ind_1 = Copy_Dist.index(min_dist_1)
Indices.append(ind_1)
min_dist_2=min(Distance)
Distance.remove(min_dist_2)
if(min_dist_2 !=0):
Min_Dists.append(min_dist_2)
ind_2 = Copy_Dist.index(min_dist_2)
Indices.append(ind_2)
To_Remove = copy_dataset.index([p1_x, p1_y])
copy_dataset.remove(copy_dataset[To_Remove])
Not sure how to solve this problem in general, but it's probably a lot faster to compute the distances in a vectorized fashion.
dataset_copy = dataset.copy()
dataset_copy = dataset_copy[:, np.newaxis]
distance = np.sqrt(np.sum(np.square(dataset - dataset_copy), axis=~0))
Thank you for the answers mates! I tried the below way to solve the issue it worked pretty quick.
from statistics import mean
from scipy.spatial import distance
D = distance.squareform(distance.pdist(dataset))
closest = np.argsort(D, axis=1)
d1 =[]
for i in range(len(dataset)):
d1.append(D[i][closest[i][1]])
avg_dist = int(mean(d1))
for i in range(len(dataset)):
d1= D[i][closest[i][1]]
d2= D[i][closest[i][2]]
if(abs(avg_dist-d1)>2):
if(abs(avg_dist-d2)>2):
print(dataset[i])
dataset.remove(dataset[i])
If you need all distances at once:
distances = scipy.spatial.distance_matrix(dataset, dataset)
If you need distances of one point to all others:
for pt in dataset:
distances = scipy.spatial.distance_matrix([pt], dataset)[0]
# distances.min() will be 0 because the point has 0 distance to itself
# the nearest neighbor will be the second element in sorted order
indices = np.argpartition(distances, 1) # or use argsort for a complete sort
nearest_neighbor = indices[1]
Documentation: distance_matrix, argpartition
I have a list of Shapely polygons and a point like so:
from shapely.geometry import Point, Polygon
polygons = [Polygon(...), Polygon(...), ...]
point = Point(2.5, 5.7)
and I want to find the closest polygon in the list to that point. I'm already aware of the object.distance(other) function which returns the minimum distance between two geometric shapes, and I thought about computing all the distances in a loop to find the closest polygon:
polygons = [Polygon(...), Polygon(...), ...]
point = Point(2.5, 5.7)
min_dist = 10000
closest_polygon = None
for polygon in polygons:
dist = polygon.distance(point)
if dist < min_dist:
min_dist = dist
closest_polygon = polygon
My question is: Is there a more efficient way to do it?
There is a shorter way, e.g.
from shapely.geometry import Point, Polygon
import random
from operator import itemgetter
def random_coords(n):
return [(random.randint(0, 100), random.randint(0, 100)) for _ in range(n)]
polys = [Polygon(random_coords(3)) for _ in range(4)]
point = Point(random_coords(1))
min_distance, min_poly = min(((poly.distance(point), poly) for poly in polys), key=itemgetter(0))
as Georgy mentioned (++awesome!) even more concise:
min_poly = min(polys, key=point.distance)
but distance computation is in general computationally intensive
I have a solution that works if you have at least 2 polygons with a distance different from 0. Let's call these 2 polygons "basePolygon0" and "basePolygon1". The idea is to build a KD tree with the distance of each polygon to each of the "basis" polygons.
Once the KD tree has been built, we can query it by computing the distance to each of the basis polygons.
Here's a working example:
from shapely.geometry import Point, Polygon
import numpy as np
from scipy.spatial import KDTree
# prepare a test with triangles
poly0 = Polygon([(3,-1),(5,-1),(4,2)])
poly1 = Polygon([(-2,1),(-4,2),(-3,4)])
poly2 = Polygon([(-3,-3),(-4,-6),(-2,-6)])
poly3 = Polygon([(-1,-4),(1,-4),(0,-1)])
polys = [poly0,poly1,poly2,poly3]
p0 = Point(4,-3)
p1 = Point(-4,1)
p2 = Point(-4,-2)
p3 = Point(0,-2.5)
testPoints = [p0,p1,p2,p3]
# select basis polygons
# it works with any pair of polygons that have non zero distance
basePolygon0 = polys[0]
basePolygon1 = polys[1]
# compute tree query
def buildQuery(point):
distToBasePolygon0 = basePolygon0.distance(point)
distToBasePolygon1 = basePolygon1.distance(point)
return np.array([distToBasePolygon0,distToBasePolygon1])
distances = np.array([buildQuery(poly) for poly in polys])
# build the KD tree
tree = KDTree(distances)
# test it
for p in testPoints:
q = buildQuery(p)
output = tree.query(q)
print(output)
This yields as expected:
# (distance, polygon_index_in_KD_tree)
(2.0248456731316584, 0)
(1.904237866994273, 1)
(1.5991500555008626, 2)
(1.5109986459170694, 3)
There is one way that might be faster, but without doing any actual tests, it's hard for me to say for sure.
This might not work for your situation, but the basic idea is each time a Shapely object is added to the array, you adjust the position of different array elements so that it is always "sorted" in this manner. In Python, this can be done with the heapq module. The only issue with that module is that it's hard to choose a function to compare to different objects, so you would have to do something like this answer, where you make a custom Class for objects to put in the heapq that is a tuple.
import heapq
class MyHeap(object):
def __init__(self, initial=None, key=lambda x:x):
self.key = key
if initial:
self._data = [(key(item), item) for item in initial]
heapq.heapify(self._data)
else:
self._data = []
def push(self, item):
heapq.heappush(self._data, (self.key(item), item))
def pop(self):
return heapq.heappop(self._data)[1]
The first element in the tuple is a "key", which in this case would be the distance to the point, and then the second element would be the actual Shapely object, and you could use it like so:
point = Point(2.5, 5.7)
heap = MyHeap(initial=None, key=lambda x:x.distance(point))
heap.push(Polygon(...))
heap.push(Polygon(...))
# etc...
And at the end, the object you're looking for will be at heap.pop().
Ultimately, though, both algorithms seem to be (roughly) O(n), so any speed up would not be a significant one.
My goal is to find the nearest x,y point co-ordinate for every pixel. Based on that i have to colour the pixel points.
Here is what i have tried,
The below code will draw the points.
import numpy as np
import matplotlib.pyplot as plt
points = np.array([[0,40],[0,0],[5,30],[4,10],[10,25],[20,5],[30,35],[35,3],[50,0],[45,15],[40,22],[50,40]])
print (points)
x1, y1 = zip(*points)
plt.plot(x1,y1,'.')
plt.show()
Now to find the nearest point for each pixel.
I am found something like this where i have to give manually each pixel co-ordinates, to get the nearest point.
from scipy import spatial
import numpy as np
A = np.random.random((10,2))*100
print (A)
pt = np.array([[6, 30],[9,80]])
print (pt)
for each in pt:
A[spatial.KDTree(A).query(each)[1]] # <-- the nearest point
distance,index = spatial.KDTree(A).query(each)
print (distance) # <-- The distances to the nearest neighbors
print (index) # <-- The locations of the neighbors
print (A[index])
The output will be like this,
[[1.76886192e+01 1.75054781e+01]
[4.17533199e+01 9.94619127e+01]
[5.30943347e+01 9.73358766e+01]
[3.05607891e+00 8.14782701e+01]
[5.88049334e+01 3.46475520e+01]
[9.86076676e+01 8.98375851e+01]
[9.54423012e+01 8.97209269e+01]
[2.62715747e+01 3.81651805e-02]
[6.59340306e+00 4.44893348e+01]
[6.66997434e+01 3.62820929e+01]]
[[ 6 30]
[ 9 80]]
14.50148095039858
8
[ 6.59340306 44.48933479]
6.124988197559344
3
[ 3.05607891 81.4782701 ]
Instead of giving each point manually i want to take each pixel from the image and i wanted to find the nearest blue point. This is my first question.
After that i want to classify those points into two categories,
Based on pixel and point i want to colour it, basically i want to do a cluster on it.
This is not in proper form. But at the end i want it like this.
Thanks in advance guys.
Use cKDTree instead of KDTree, which is faster (see this answer).
You can give the kdtree an array of points to query instead of looping over all of them.
Constructing a kdtree is a costly operation compared to querying it, so construct it once and query many times.
Compare the following two code snippets, on my tests the second one run x800 times faster.
from timeit import default_timer as timer
np.random.seed(0)
A = np.random.random((1000,2))*100
pt = np.random.randint(0,100,(100,2))
start1 = timer()
for each in pt:
A[spatial.KDTree(A).query(each)[1]]
distance,index = spatial.KDTree(A).query(each)
end1 = timer()
print end1-start1
start2 = timer()
kdt = spatial.cKDTree(A) # cKDTree + outside construction
distance,index = kdt.query(pt)
A[index]
end2 = timer()
print end2-start2
you can use scikit-learn for this:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=1)
labels = list(range(len(points)))
neigh.fit(points, labels)
pred = neigh.predict(np.random.random((10,2))*50)
if you want the points itself and not their class labels you can do
points[pred]
I have two sets of points, one is a map consisting of x,y coordinates, and the second is a path of x,y coordinates. I'm trying to find the closest map points to my path points, pretty simple. Except my map is 380000 points and my paths (of which I have several) each consist of ~ 350000 points themselves.
Other than sampling my data to get smaller datasets, I'm trying to find a faster way to accomplish this task.
base algorithm:
import pandas as pd
from scipy.spatial.distance import cdist
...
def closeset_point(point, points):
return points[cdist([point], points).argmin()]
# log['point'].shape; 333000
# map_data['point'].shape; 380000
closest = [closest_point(log_p, list(map_data['point'])) for log_p in log['point']]
as per this example: Find closest point in Pandas DataFrames
After converting this to a tqdm progress bar to see how long it would take (as it was taking a while, obviously), I noticed it would take about 10hrs to complete.
tqdm loop:
for i in trange(len(log), desc='finding closest points'):
closest.append(closest_point(log['point'].loc[i], list(map_data['point'])))
>> finding closest points: 5%| | 16432/333456 [32:11<10:13:52], 8.60it/s
While 10 hours is not impossible, I wonder if there is a way to speed this up? I have a solid gpu/cpu/ram at my disposal so I feel this should be doable. I'm also learning tensorflow (but honestly my math is atrocious so I'm very in the dark with it)
Any ideas on how to speed this up with either multi-threading, gpu computation, tensorflow or some other sort of wizardry?
inb4 python is slow ;)
*edit: image shows what i'm trying to do. green is path, blue is map, orange is what I'm trying to find.
The following is a mini example of what you're trying to do. Considers the variable coords1 as your variable log['point'] and coords2 as your variable log['point']. The end result is the index of the coord2 closest to coord1.
from scipy.spatial import distance
import numpy as np
coords1 = [(35.0456, -85.2672),
(35.1174, -89.9711),
(35.9728, -83.9422),
(36.1667, -86.7833)]
coords2 = [(35.0456, -85.2672),
(35.1174, -89.9711),
(35.9728, -83.9422),
(34.9728, -83.9422),
(36.1667, -86.7833)]
tmp = distance.cdist(coords1, coords2, "sqeuclidean") # sqeuclidean based on Mark Setchell comment to improve speed further
result = np.argmin(tmp,1)
# result: array([0, 1, 2, 4])
This should be way faster, because it's done everything in one iteration.
After 3 years, but if anyone is looking at this issue... You may want to try Numba I get almost a 9x speed reduction from scipy distance.cdist on a 1.5 Million set of points to a 1.5 K set of path points. Also, as #
Mark Setchell said if you want to remove the np.sqrt in a big enough set of points could be considerable saved time.
Results
size: (1459383, 2)
numba: 0.06402060508728027
cdist: 0.5371212959289551
Code
# EUCLEDIAN DISTANCE
#numba.njit('(float64[:,::1], float64[::1], float64[::1])', parallel=True, fastmath=True)
def pz_dist(p_array, x_flat, y_flat):
m = p_array.shape[0]
n = x_flat.shape[0]
d = np.empty(shape=(m, n), dtype=np.float64)
for i in numba.prange(m):
p1 = p_array[i, :]
for j in range(n):
_x = x_flat[j] - p1[0]
_y = y_flat[j] - p1[1]
_d = np.sqrt(_x**2 + _y**2)
d[i, j] = _d
return d