I am brute force calculating the shortest distance from one point to many others on a 2D plane with data coming from pandas dataframes using df['column'].to_numpy().
Currently, I am doing this using nested for loops on numpy arrays to fill up a list, taking the minimum value of that list, and storing that value in another list.
Checking 1000 points (from df_point) against 25,000 (from df_compare) takes about one minute, as this is understandably an inefficient process. My code is below.
point_x = df_point['x'].to_numpy()
compare_x = df_compare['x'].to_numpy()
point_y = df_point['y'].to_numpy()
compare_y = df_compare['y'].to_numpy()
dumarr = []
minvals = []
# Brute force caclulate the closet point by using the Pythagorean theorem comparing each
# point to every other point
for k in range(len(point_x)):
for i,j in np.nditer([compare_x,compare_y]):
dumarr.append(((point_x[k] - i)**2 + (point_y[k] - j)**2))
minval.append(df_compare['point_name'][dumarr.index(min(dumarr))])
# Clear dummy array (otherwise it will continuously append to)
dumarr = []
This isn't a particularly pythonic. Is there a way to do this with vectorization or at least without using nested for loops?
The approach is to create a 1000 x 25000 matrix, and then find the indices of the row minimums.
# distances for all combinations (1000x25000 matrix)
dum_arr = (point_x[:, None] - compare_x)**2 + (point_y[:, None] - compare_y)**2
# indices of minimums along rows
idx = np.argmin(dum_arr, axis=1)
# Not sure what is needed from the indices, this get the values
# from `point_name` dataframe using found indices
min_vals = df_compare['point_name'].iloc[idx]
I'm gonna give you the approach :
Create DataFrame with columns being ->pointID,CoordX,CoordY
Create a secondary DataFrame with an offset value of 1 (oldDF.iloc[pointIDx] = newDF.iloc[pointIDx]-1)
This offset value needs to be looped from 1 till the number of coordinates-1
tempDF["Euclid Dist"] = sqrt(square(oldDf["CoordX"]-newDF["CoordX"])+square(oldDf["CoordY"]-newDF["CoordY"]))
Append this tempDF to a list
Reasons why this will be faster:
Only one loop to iterate offset from 1 till number of coordinates-1
Vectorization has been taken care off by step 4
Utilize numpy squareroot and square functions to ensure best results
Instead of to find the closest point, you could try finding the closest in the x and y direction separately, and then compare those two to find which is closer by using the built-in min function like the top answer from this question:
min(myList, key=lambda x:abs(x-myNumber))
from list of integers, get number closest to a given value
EDIT:
Your loop would end up something like this if you do it all in one function call. Also, I'm not sure if the min function will end up looping through the compare arrays in a way that would take the same amount of time as your current code:
for k,m in np.nditer([point_x, point_y]):
min = min(compare_x, compare_y, key=lambda x,y: (x-k)**2 + (y-m)**2 )
Another alternative could be to pre-compute the distance from (0,0) or another point like (-1000,1000) for all the points in the compare array, sort the compare array based on that, then only check points with a similar distance from the reference.
Here’s an example using scipy cdist, which is ideal for this type of problem:
import numpy as np
from scipy.spatial.distance import cdist
point = np.array([[1, 2], [3, 5], [4, 7]])
compare = np.array([[3, 2], [8, 5], [4, 1], [2, 2], [8, 9]])
# create 3x5 distance matrix
dm = cdist(point, compare)
# get row-wise mins
mins = dm.min(axis=1)
Related
Have 2 sets of data, 1 which contains coordinates of fixed location called locations
And a secondary table of vehicle movements called movements
What would be the fastest way to iterate through both tables to find if any of the movements are within a certain distance of a location, e.g. the Euclidean distance between a point on the movements and a point on any of the locations?
Currently am using a nested loop which is incredibly slow. Both pandas df have been converted using
locations_dict=locations.to_dict('records')
movements_dict=movements.to_dict('records')
then iterated via:
for movement in movements_dict:
visit='no visit'
for location in locations_dict:
distance = np.sqrt((location['Latitude']-movement['Lat'])**2+(location['Longitude']-movement['Lng'])**2)
if distance < 0.05:
visit=location['Location']
break
else:
continue
movement['distance']=distance
movement['visit']=visit
Any way to make this faster? The main issue is this operation is a cartesian product, and any inserts will increase the complexity of the operation significantly.
You can export the pandas data directly to numpy for example like this:
loc_lat=locations['Latitude' ].to_numpy()
loc_lon=locations['Longitude'].to_numpy()
mov_lat=movements['Lat' ].to_numpy()
mov_lon=movements['Lon' ].to_numpy()
From now on there is no need to use loops to obtain results as you can rely on numpy working an entire arrays at once. This should give a great speedup over the approach using Python looping over dictionary values.
Check out following code example showing how to get an array with all pairs from two arrays:
import numpy as np
a = np.array([1,2,3])
b = np.array([4,5])
print( np.transpose([np.tile(a, len(b)), np.repeat(b,len(a))]) )
gives_as_print = """
[[1 4]
[2 4]
[3 4]
[1 5]
[2 5]
[3 5]]"""
say i have the following matrix:
A = [[7,5,1,2]
[10,1,3,8]
[2,2,2,3]]
I need to extract the row with elements closest to 0 compared to all other rows, aka the row with the minimal elements. So i need [2,2,2,3]
I have tried a number of things, np.min, np.amin, np.argmin
But all are giving me the minimum values of each row for example:
[2,1,1,2]
This is not what im looking for.
If someone knows the right function could you point me to the documentation of the function?
Thank you.
It depends on how you define distance when you say closest. I'm guessing you are looking for the Euclidean distance, i.e. L2 norm here. In which case, you can just find the minimum sum of square for all rows:
A[(A ** 2).sum(1).argmin()]
# array([2, 2, 2, 3])
You can also find the closest by L1 norm or the sum of absolute difference against 0s:
A[np.abs(A).sum(1).argmin()]
# array([2, 2, 2, 3])
In this dummy example, the two methods give the same result, but they could be different depending on the actual data.
import numpy as np
A = np.array([[7,5,1,2],
[10,1,3,8],
[2,2,2,3]])
print(print(A[np.argmin(A.sum(axis=1))]))
# [2 2 2 3]
Sum the rows, then find the row index of the minimum value, and finally find the row.
The first way that comes to mind is to find the minimum of each row. Then find the argmin of that array.
row_mins = A.min(axis=0)
row_with_minimum = row_mins.argmin()
Then to get the row with the minimum element, do
A[row_with_minimum, :]
I currently have an edge array of dimension (n_edges, 2) containing node pairs described as [NodeID1, NodeID2], which are both integers. I need to efficiently enumerate these NodeIDs so that I can represent them as indices in an adjacency matrix. My current approach is to extract the unique set of sorted NodeIDs, map them to 0 ranging through the number of distinct nodes, and then replacing the entries using pandas.DataFrame.replace(mapping). Here is an example of what I am doing:
import numpy as np
import pandas as pd
a = np.random.randint(0, 100000000, (40000000, 2))
df = pd.DataFrame(a)
unique_values = np.unique(a)
mapping = dict(zip(unique_values, np.arange(len(unique_values))))
df.replace(mapping)
I have also tried defining a function which applies this map and vectorizing it with NumPy, but it is still quite slow. Any ideas as to how I can implement this more efficiently?
Turns out np.unique has an option to return the indices of the original numbers in the unique array, you just need to reshape it.
u, indices = np.unique(a, return_inverse=True)
b = indices.reshape(a.shape)
This runs in about 20 seconds on your example.
I have a matrix to store k minimum distances for N elements. Whenever a new element arrives I want to compute the distances to all N elements and if any distance is lower to the maximum distance stored I want to update that value and store the new distance. Initially the distances are set to np.inf.
elems = np.array([[5, 5],[4, 4],[8, 8]])
k=2
center_mindists = np.full((len(elems),k), np.inf)
So when a new element arrives, let's say x=np.array([1,1]) I have to compute the distance to all elements and store it if it is lesser than the maximum distance stored at the time
distances = np.sum(np.abs(elems - x)) #[8 6 14]
To do so, I find the indices where there is the maximum distance in each row and then select the max stored distances that are higher to the recently computed distance
max_min_idx = np.argmax(center_mindists, axis=1) #[0 0 0]
id0 = np.indices(max_min_idx.shape)
lower_id = distances < centers_mindists[id0, max_min_idx]
Finally I have to update those values with the new ones:
center_mindists[id0, max_min_idx][lower_idx] = distances[lower_idx[0]]
The thing is that the assignation does not change the values on the center_min_dists matrix and I couldn't find a solution for this.
Thanks a lot!!
You can perform the assignment in two steps, since you have a double index, the first part of which makes a copy. Instead of
center_mindists[id0, max_min_idx][lower_idx] = distances[lower_idx[0]]
explicitly update the copy, and assign it back:
temp = center_mindists[id0, max_min_idx]
temp[lower_idx] = distances[lower_idx[0]]
center_mindists[id0, max_min_idx] = temp
This is actually pretty convenient because you really use temp to compute the lower_idx mask in the first place.
center_mindists[id0, max_min_idx] is a copy, because the indices are not slices (basic indexing).
center_mindists[id0, max_min_idx][lower_idx] = ...
modifies that copy, not the original, so nothing ends up happening.
You have to somehow combine indices so that you have only one set of advanced indexing
center_mindists[idx0, idx1] = ....
I am trying to do the following on Numpy without using a loop :
I have a matrix X of dimensions N*d and a vector y of dimension N.
y contains integers ranging from 1 to K.
I am trying to get a matrix M of size K*d, where M[i,:]=np.mean(X[y==i,:],0)
Can I achieve this without using a loop?
With a loop, it would go something like this.
import numpy as np
N=3
d=3
K=2
X=np.eye(N)
y=np.random.randint(1,K+1,N)
M=np.zeros((K,d))
for i in np.arange(0,K):
line=X[y==i+1,:]
if line.size==0:
M[i,:]=np.zeros(d)
else:
M[i,:]=mp.mean(line,0)
Thank you in advance.
The code's basically collecting specific rows off X and adding them for which we have a NumPy builtin in np.add.reduceat. So, with that in focus, the steps to solve it in a vectorized way could be as listed next -
# Get sort indices of y
sidx = y.argsort()
# Collect rows off X based on their IDs so that they come in consecutive order
Xr = X[np.arange(N)[sidx]]
# Get unique row IDs, start positions of each unique ID
# and their counts to be used for average calculations
unq,startidx,counts = np.unique((y-1)[sidx],return_index=True,return_counts=True)
# Add rows off Xr based on the slices signified by the start positions
vals = np.true_divide(np.add.reduceat(Xr,startidx,axis=0),counts[:,None])
# Setup output array and set row summed values into it at unique IDs row positions
out = np.zeros((K,d))
out[unq] = vals
This solves the question, but creates an intermediate K×N boolean matrix, and doesn't use the built-in mean function. This may lead to worse performance or worse numerical stability in some cases. I'm letting the class labels range from 0 to K-1 rather than 1 to K.
# Define constants
K,N,d = 10,1000,3
# Sample data
Y = randint(0,K-1,N) #K-1 to omit one class to test no-examples case
X = randn(N,d)
# Calculate means for each class, vectorized
# Map samples to labels by taking a logical "outer product"
mark = Y[None,:]==arange(0,K)[:,None]
# Count number of examples in each class
count = sum(mark,1)
# Avoid divide by zero if no examples
count += count==0
# Sum within each class and normalize
M = (dot(mark,X).T/count).T
print(M, shape(M), shape(mark))