I have a matrix to store k minimum distances for N elements. Whenever a new element arrives I want to compute the distances to all N elements and if any distance is lower to the maximum distance stored I want to update that value and store the new distance. Initially the distances are set to np.inf.
elems = np.array([[5, 5],[4, 4],[8, 8]])
k=2
center_mindists = np.full((len(elems),k), np.inf)
So when a new element arrives, let's say x=np.array([1,1]) I have to compute the distance to all elements and store it if it is lesser than the maximum distance stored at the time
distances = np.sum(np.abs(elems - x)) #[8 6 14]
To do so, I find the indices where there is the maximum distance in each row and then select the max stored distances that are higher to the recently computed distance
max_min_idx = np.argmax(center_mindists, axis=1) #[0 0 0]
id0 = np.indices(max_min_idx.shape)
lower_id = distances < centers_mindists[id0, max_min_idx]
Finally I have to update those values with the new ones:
center_mindists[id0, max_min_idx][lower_idx] = distances[lower_idx[0]]
The thing is that the assignation does not change the values on the center_min_dists matrix and I couldn't find a solution for this.
Thanks a lot!!
You can perform the assignment in two steps, since you have a double index, the first part of which makes a copy. Instead of
center_mindists[id0, max_min_idx][lower_idx] = distances[lower_idx[0]]
explicitly update the copy, and assign it back:
temp = center_mindists[id0, max_min_idx]
temp[lower_idx] = distances[lower_idx[0]]
center_mindists[id0, max_min_idx] = temp
This is actually pretty convenient because you really use temp to compute the lower_idx mask in the first place.
center_mindists[id0, max_min_idx] is a copy, because the indices are not slices (basic indexing).
center_mindists[id0, max_min_idx][lower_idx] = ...
modifies that copy, not the original, so nothing ends up happening.
You have to somehow combine indices so that you have only one set of advanced indexing
center_mindists[idx0, idx1] = ....
Related
Pretty new to python so any advice is always welcome.
I am trying to map data from multiple sets of coordinates to one set and am trying to use Bilinear interpolation to do it.
I have a set of DataFrames I iterate over and am trying to find the nearest neighbors for my interpolation.
Since my grids may not be uniform in spacing I am sorting by Y position first:
for i in range(0, len(df_x['X'])):
x_pos = df_x._get_value(i, 'X')#pull x coord y coord
y_pos = df_y._get_value(i, 'Y')
for n in data_list:
df = data_list[n] #
d_y = abs(df['Y'] - y_pos) #array of distance from Y pos
d_y.drop_duplicates() # remove duplicates
nn_y1 = d_y.nsmallest(1) # finds closest row
nn_y2 = d_y.nsmallest(2).iloc[-1] # finds next closest row
print(type(nn_y1))
d_x_y1 = df[df['DesignY'] == nn_y1] # creates list of X at closest row
I think this should provide me with my upper and lower bounds nearest my points.
however when then sorting for X position I get an error
ValueError: Can only compare identically-labeled Series objects
I think this is due to the fact that the type for nn_y1 kicks out <class 'pandas.core.series.Series'>
any advice for how to get the value instead of the series? I could create a dataframe with one element but that seems hacky? I tried some combinations of _get_value() but to no avail.
nsmallest returns:
"The n smallest values in the Series, sorted in increasing order." (Type Series)
In this case the simple way is to unpack from nsmallest(2) since both values are needed:
nn_y1, nn_y2 = d_y.nsmallest(2)
To modify the code directly iloc is needed to get the first value from the Series:
nn_y1 = d_y.nsmallest(1).iloc[0]
Alternatively d_y.nsmallest(2) could've been used twice with iloc to get both values:
smallest = d_y.nsmallest(2)
nn_y1 = smallest.iloc[0]
nn_y2 = smallest.iloc[1]
I have a space of zeros with a variable dimension and an array of ones with a variable dimension, for instance:
import numpy
space = numpy.zeros((1000,5))
a = numpy.ones((150))
I would like to insert the ones of the array inside the matrix in order that those ones will be homogeneously distributed inside the matrix.
You can use numpy.linspace to obtain the indices.
It's not obvious if you'd like to assign a slice of five ones to every index location or just assign to the first index of the slice. This is how both of these would work:
space = numpy.zeros((1000,5))
a = numpy.ones((150, 5))
b = numpy.ones((150,))
index = numpy.rint(numpy.linspace(start=0, stop=999, num=150)).astype(np.int)
# This would assign five ones to every location
space[index] = a
# This would assign a one to the first element at every location
space[index, 0] = b
I am brute force calculating the shortest distance from one point to many others on a 2D plane with data coming from pandas dataframes using df['column'].to_numpy().
Currently, I am doing this using nested for loops on numpy arrays to fill up a list, taking the minimum value of that list, and storing that value in another list.
Checking 1000 points (from df_point) against 25,000 (from df_compare) takes about one minute, as this is understandably an inefficient process. My code is below.
point_x = df_point['x'].to_numpy()
compare_x = df_compare['x'].to_numpy()
point_y = df_point['y'].to_numpy()
compare_y = df_compare['y'].to_numpy()
dumarr = []
minvals = []
# Brute force caclulate the closet point by using the Pythagorean theorem comparing each
# point to every other point
for k in range(len(point_x)):
for i,j in np.nditer([compare_x,compare_y]):
dumarr.append(((point_x[k] - i)**2 + (point_y[k] - j)**2))
minval.append(df_compare['point_name'][dumarr.index(min(dumarr))])
# Clear dummy array (otherwise it will continuously append to)
dumarr = []
This isn't a particularly pythonic. Is there a way to do this with vectorization or at least without using nested for loops?
The approach is to create a 1000 x 25000 matrix, and then find the indices of the row minimums.
# distances for all combinations (1000x25000 matrix)
dum_arr = (point_x[:, None] - compare_x)**2 + (point_y[:, None] - compare_y)**2
# indices of minimums along rows
idx = np.argmin(dum_arr, axis=1)
# Not sure what is needed from the indices, this get the values
# from `point_name` dataframe using found indices
min_vals = df_compare['point_name'].iloc[idx]
I'm gonna give you the approach :
Create DataFrame with columns being ->pointID,CoordX,CoordY
Create a secondary DataFrame with an offset value of 1 (oldDF.iloc[pointIDx] = newDF.iloc[pointIDx]-1)
This offset value needs to be looped from 1 till the number of coordinates-1
tempDF["Euclid Dist"] = sqrt(square(oldDf["CoordX"]-newDF["CoordX"])+square(oldDf["CoordY"]-newDF["CoordY"]))
Append this tempDF to a list
Reasons why this will be faster:
Only one loop to iterate offset from 1 till number of coordinates-1
Vectorization has been taken care off by step 4
Utilize numpy squareroot and square functions to ensure best results
Instead of to find the closest point, you could try finding the closest in the x and y direction separately, and then compare those two to find which is closer by using the built-in min function like the top answer from this question:
min(myList, key=lambda x:abs(x-myNumber))
from list of integers, get number closest to a given value
EDIT:
Your loop would end up something like this if you do it all in one function call. Also, I'm not sure if the min function will end up looping through the compare arrays in a way that would take the same amount of time as your current code:
for k,m in np.nditer([point_x, point_y]):
min = min(compare_x, compare_y, key=lambda x,y: (x-k)**2 + (y-m)**2 )
Another alternative could be to pre-compute the distance from (0,0) or another point like (-1000,1000) for all the points in the compare array, sort the compare array based on that, then only check points with a similar distance from the reference.
Here’s an example using scipy cdist, which is ideal for this type of problem:
import numpy as np
from scipy.spatial.distance import cdist
point = np.array([[1, 2], [3, 5], [4, 7]])
compare = np.array([[3, 2], [8, 5], [4, 1], [2, 2], [8, 9]])
# create 3x5 distance matrix
dm = cdist(point, compare)
# get row-wise mins
mins = dm.min(axis=1)
I am trying to do the following on Numpy without using a loop :
I have a matrix X of dimensions N*d and a vector y of dimension N.
y contains integers ranging from 1 to K.
I am trying to get a matrix M of size K*d, where M[i,:]=np.mean(X[y==i,:],0)
Can I achieve this without using a loop?
With a loop, it would go something like this.
import numpy as np
N=3
d=3
K=2
X=np.eye(N)
y=np.random.randint(1,K+1,N)
M=np.zeros((K,d))
for i in np.arange(0,K):
line=X[y==i+1,:]
if line.size==0:
M[i,:]=np.zeros(d)
else:
M[i,:]=mp.mean(line,0)
Thank you in advance.
The code's basically collecting specific rows off X and adding them for which we have a NumPy builtin in np.add.reduceat. So, with that in focus, the steps to solve it in a vectorized way could be as listed next -
# Get sort indices of y
sidx = y.argsort()
# Collect rows off X based on their IDs so that they come in consecutive order
Xr = X[np.arange(N)[sidx]]
# Get unique row IDs, start positions of each unique ID
# and their counts to be used for average calculations
unq,startidx,counts = np.unique((y-1)[sidx],return_index=True,return_counts=True)
# Add rows off Xr based on the slices signified by the start positions
vals = np.true_divide(np.add.reduceat(Xr,startidx,axis=0),counts[:,None])
# Setup output array and set row summed values into it at unique IDs row positions
out = np.zeros((K,d))
out[unq] = vals
This solves the question, but creates an intermediate K×N boolean matrix, and doesn't use the built-in mean function. This may lead to worse performance or worse numerical stability in some cases. I'm letting the class labels range from 0 to K-1 rather than 1 to K.
# Define constants
K,N,d = 10,1000,3
# Sample data
Y = randint(0,K-1,N) #K-1 to omit one class to test no-examples case
X = randn(N,d)
# Calculate means for each class, vectorized
# Map samples to labels by taking a logical "outer product"
mark = Y[None,:]==arange(0,K)[:,None]
# Count number of examples in each class
count = sum(mark,1)
# Avoid divide by zero if no examples
count += count==0
# Sum within each class and normalize
M = (dot(mark,X).T/count).T
print(M, shape(M), shape(mark))
Suppose I have a two dimensional numpy array with a given shape and I would like to get a view of the values that satisfy a predicate based on the value's position. That is, if x and y are the column and row index accordingly and a predicate x>y the function should return only the array's values for which the column index is greater than the row index.
The easy way to do is a double loop but I would like a possibly faster (vectorized maybe?) approach?
Is there a better way?
In general, you could do this by constructing an open mesh grid corresponding to the row/column indices, apply your predicate to get a boolean mask, then index into your array using this mask:
A = np.zeros((10,20))
y, x = np.ogrid[:A.shape[0], :A.shape[1]]
mask = x > y
A[mask] = 1
Your specific example happens to be the upper triangle - you can get a copy of it using np.triu, or you can get the corresponding row/column indices using np.triu_indices.