How to find the nearest neighbour index from one series to another - python

I have a target array A, which represents isobaric pressure levels in NCEP reanalysis data.
I also have the pressure at which a cloud is observed as a long time series, B.
What I am looking for is a k-nearest neighbour lookup that returns the indices of those nearest neighbours, something like knnsearch in Matlab that could be represented the same in python such as: indices, distance = knnsearch(A, B, n)
where indices is the nearest n indices in A for every value in B, and distance is how far removed the value in B is from the nearest value in A, and A and B can be of different lengths (this is the bottleneck that I have found with most solutions so far, whereby I would have to loop each value in B to return my indices and distance)
import numpy as np
A = np.array([1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 70, 50, 30, 20, 10]) # this is a fixed 17-by-1 array
B = np.array([923, 584.2, 605.3, 153.2]) # this can be any n-by-1 array
n = 2
What I would like returned from indices, distance = knnsearch(A, B, n) is this:
indices = [[1, 2],[4, 5] etc...]
where 923 in A is matched to first A[1]=925 and then A[2]=850
and 584.2 in A is matched to first A[4]=600 and then A[5]=500
distance = [[72, 77],[15.8, 84.2] etc...]
where 72 represents the distance between queried value in B to the nearest value in A e.g. distance[0, 0] == np.abs(B[0] - A[1])
The only solution I have been able to come up with is:
import numpy as np
def knnsearch(A, B, n):
indices = np.zeros((len(B), n))
distances = np.zeros((len(B), n))
for i in range(len(B)):
a = A
for N in range(n):
dif = np.abs(a - B[i])
ind = np.argmin(dif)
indices[i, N] = ind + N
distances[i, N] = dif[ind + N]
# remove this neighbour from from future consideration
np.delete(a, ind)
return indices, distances
array_A = np.array([1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 70, 50, 30, 20, 10])
array_B = np.array([923, 584.2, 605.3, 153.2])
neighbours = 2
indices, distances = knnsearch(array_A, array_B, neighbours)
print(indices)
print(distances)
returns:
[[ 1. 2.]
[ 4. 5.]
[ 4. 3.]
[10. 11.]]
[[ 2. 73. ]
[ 15.8 84.2]
[ 5.3 94.7]
[ 3.2 53.2]]
There must be a way to remove the for loops, as I need the performance should my A and B arrays contain many thousands of elements with many nearest neighbours...
Please help! Thanks :)

The second loop can easily be vectorized. The most straightforward way to do it is to use np.argsort and select the indices corresponding to the n smallest dif values. However, for large arrays, as only n values should be sorted, it is better to use np.argpartition.
Therefore, the code would look like something like that:
def vector_knnsearch(A, B, n):
indices = np.empty((len(B), n))
distances = np.empty((len(B), n))
for i,b in enumerate(B):
dif = np.abs(A - b)
min_ind = np.argpartition(dif,n)[:n] # Returns the indexes of the 3 smallest
# numbers but not necessarily sorted
ind = min_ind[np.argsort(dif[min_ind])] # sort output of argpartition just in case
indices[i, :] = ind
distances[i, :] = dif[ind]
return indices, distances
As said in the comments, the first loop can also be removed using a meshgrid, however, the extra use of memory and computation time to construct the meshgrid makes this approach slower for the dimensions I tried (and this will probably get worse for large arrays and end up in Memory Error). In addition, the readability of the code decreases. Overall, this would probably do this approach less pythonic.
def mesh_knnsearch(A, B, n):
m = len(B)
rng = np.arange(m).reshape((m,1))
Amesh, Bmesh = np.meshgrid(A,B)
dif = np.abs(Amesh-Bmesh)
min_ind = np.argpartition(dif,n,axis=1)[:,:n]
ind = min_ind[rng,np.argsort(dif[rng,min_ind],axis=1)]
return ind, dif[rng,ind]
Not that it is important to define this rng as a 2d array in order to retrieve a[rng[0],ind[0]], a[rng[1],ind[1]], etc and maintain the dimensions of the array, as opposed to a[:,ind] which retrieves a[:,ind[0]], a[:,ind[1]], etc.

Related

Apply custom sum over numpy ndarray

I would like to do this particular computation: with a square ndarray A in 4 dimension of size (N, )*4, I would like to compute the 2 dimension array B such that
for n in range(N):
for m in range(N):
B[n, m] = sum(A[i, j, n-i, m-j] for i in range(n) for j in range(m))
Is it possible to vectorize this computation with numpy?
It somehow looks like a kind of convolution, but on one array only
It's hard to visualize the whole array action
N=4; A = np.arange(N**4).reshape(N,N,N,N)
this tests the same:
for n in range(N):
for m in range(N):
I = np.arange(n)[:,None]; J = np.arange(m)
B[n,m] = A[I,J,n-I,m-J].sum()
It's harder to "vectorize" the n,m since the indexed portion of A changes with n and m. At the last iteration:
In [248]: n,m
Out[248]: (3, 3)
In [249]: A[I,J,n-I,m-J]
Out[249]:
array([[ 15, 30, 45],
[ 75, 90, 105],
[135, 150, 165]])
while if either n or m is 0, it's an "empty" array with sum of 0.

Iterating within a numpy array

I have 3 2x2 Matrices, P1, P2, and P3, which are populated with randomly generated integers. I want to make sure that these matrices are positive definite (i.e. All eigenvalues are all greater than 0). My code is below.
P1 = np.random.randint(10, size=(m,n))
P2 = np.random.randint(10, size=(m,n))
P3 = np.random.randint(10, size=(m,n))
lambda1 = np.linalg.eigvals(P1)
lambda2 = np.linalg.eigvals(P2)
lambda3 = np.linalg.eigvals(P3)
for i in lambda1:
if (i <= 0): P1 = np.random.randint(10, size=(m,n))
for i in lambda2:
if (i <= 0): P2 = np.random.randint(10, size=(m,n))
for i in lambda3:
if (i <= 0): P3 = np.random.randint(10, size=(m,n))
print('Eigenvalue output to to verify that matrices are positive definite:\n')
print(u'\u03BB(P\u2081) = ' + str(np.linalg.eigvals(P1)))
print(u'\u03BB(P\u2082) = ' + str(np.linalg.eigvals(P2)))
print(u'\u03BB(P\u2083) = ' + str(np.linalg.eigvals(P3)))
Right now, the if statement will pretty much re-generate the matrix once or twice if the eigenvalues are not positive, but it will not verify that the eigenvalues are always positive. My first guess was to nest a while loop within the for loop, but I could not figure out a way to get that to work, and I'm unsure if that is the most efficient way.
This function creates an array with positive eigenvalues:
def create_arr_with_pos_ev(m,n):
ev = np.array([-1,-1])
while not all(ev>0):
arr = np.random.randint(10, size=(m,n))
ev = np.linalg.eigvals(arr)
return arr, ev
First I define dummy eigenvalues, that are lower than 0. Then I create a new array and calculate its eigenvalues. If there is a negative eigenvalue (while not all(ev>0)), create a new one.
As a supplement to the answer above, this can also be simplified a little further by taking out any input arguments to the function, and just defining the original matrix within the function:
def create_arr_with_pos_ev():
arr = np.random.randint(10, size=(m,n))
ev = np.linalg.eigvals(arr)
while not all (ev >0):
arr = np.random.randint(10, size=(m,n))
ev = np.linalg.eigvals(arr)
print('\nMatrix: \n' + str(arr) + '\nEigenvalues: \n',ev)
return arr, ev
Print:
P1,eig1=create_arr_with_pos_ev()
P2,eig2=create_arr_with_pos_ev()
P3,eig3=create_arr_with_pos_ev()
Output:
Matrix:
[[6 0]
[3 7]]
Eigenvalues:
[7. 6.]
Matrix:
[[9 3]
[4 2]]
Eigenvalues:
[10.4244289 0.5755711]
Matrix:
[[5 6]
[3 8]]
Eigenvalues:
[ 2. 11.]

numpy irreversibly change items in a list at random

I have a list
a = np.ones(100)
I want to turn 50 random items in that list to 0. Once those elements are 0, I want to turn 25 random elements of the remaining items in the list to 0. Of the remaining 25 ones, I want to turn 13 of the remaining elements to zero at random etc.
I will then run this through a simple loop e.g. (pseudocode) "if item == 1, print red particle, else print blue particle" etc.
This is basically to simulate exponential decay, but I'm struggling to think of an algorithm to do this.
This is not a duplicate of "Numpy: Replace random elements in an array" because once the elements have changed I do not want them to be considered for change again.
You are basically needing diminishing ones in the array and it seems with the additional constraint that the ones being replaced by zeros are only from the existing set of ones. To solve it, here's one way with np.random.choice with its optional arg replace set as False to get unique indices per iteration to assign zeros and using np.flatnonzero to get the leftover indices of ones per iteration.
Hence, the implementation would look something like this -
# Counts of ones to be set as zeros per iteration
counts = np.array([50,25,13,6,3,2,1])
a = np.ones(100,dtype=int)
for c in counts:
a[np.random.choice(np.flatnonzero(a), c, replace=False)] = 0
Sample run -
In [49]: counts = np.array([50,25,13,6,3,2,1])
...: a = np.ones(100,dtype=int)
...: for c in counts:
...: a[np.random.choice(np.flatnonzero(a), c, replace=False)] = 0
...: print a.sum() # verify with summation of ones print at each iteration
50
25
12
6
3
1
0
To setup the counts array for a generic length input array a, we can do something like this -
N = int(np.log(len(a))/np.log(2)) # number of iterations
counts = (len(a)*((0.5)**(np.arange(N)+1))).astype(int)
Here's a simple version. It's an alternative to the other answers and diverges from instructions by randomly setting elements to zero as is suggested for exponential decay rather than removing a deterministic 1/2 at each round.
This prints the entire array each round but alternatives could include the first element, print(a[0]), or the number remaining, print(a.sum()).
a = np.ones(100)
while any(a):
print(a)
for i, _ in enumerate(a):
if random.random() > 0.5:
a[i] = 0
The way you want to implement it is to always set exactly 50% of the remaining elements to zero. This means in each epoch you will always decay exactly 50, 25, 12.5, ... elements.
However I don't believe that's how exponential decay processes work.
To my understanding you should, with a probability p, set each individual element to zero. This means for a probability p = 0.5 in each epoch you will on average decay 50, 25, 12.5, ... elements.
This method also simplifies the solution a lot:
import numpy as np
x = np.ones(100)
p = .5 # probability of decay
for i in range(20):
mask = np.random.choice([True, False], size=len(x), p=[p, 1-p])
x[mask] = 0
If speed is a factor (but memory consumption isn't), you can also vectorize this operation:
x = np.random.choice([-1, 0], size=(100, 20), p=[p, 1-p])
x = np.cumsum(x, axis=1)
x = x == 0
after which
x[:, n]
is your population after n + 1 epochs and
np.sum(x, axis=0)
is the population size over time.
Here's one way you could do it. The essential idea is to create an array of indices [0, 1, 2, ..., 99], shuffle that array, and then uses slices of decreasing size from that array as the indices of a to be zeroed.
In [75]: a = np.ones(100)
In [76]: sizes = (len(a)*np.power(2.0, [-1, -2, -3, -4, -5, -6, -7]) + 0.5).astype(int)
In [77]: sizes
Out[77]: array([50, 25, 13, 6, 3, 2, 1])
In [78]: indices = np.arange(len(a))
In [79]: np.random.shuffle(indices)
In [80]: start = 0
In [81]: for k in range(len(sizes)):
...: end = start + sizes[k]
...: a[indices[start:end]] = 0
...: print(np.count_nonzero(a))
...: start = end
...:
50
25
12
6
3
1
0

Numpy: get the lowest N elements of an array X, considering only elements whose index is not an element in another array Y

To get the lowest 10 values of an array X I do something like:
lowest10 = np.argsort(X)[:10]
what is the most efficient way, avoiding loops, to filter the results so that I get the lowest 10 values whose index is not an element of another array Y?
So for example if the array Y is:
[2,20,51]
X[2], X[20] and X[51] shouldn't be taken into consideration to compute the lowest 10.
After some benchmarking here is my humble recommendation:
Swapping out appears to be more or less always faster than masking (even if 99% of X are forbidden.) So use something along the lines of
swap = X[Y]
X[Y] = np.inf
Sorting is expensive, therefore use argpartition and only sort what's necessary. Like
lowest10 = np.argpartition(Xfiltered, 10)[:10]
lowest10 = lowest10[np.argsort(Xfiltered[lowest10])]
Here are some benchmarks:
import numpy as np
from timeit import timeit
def swap_out():
global sol
swap = X[Y]
X[Y] = np.inf
sol = np.argpartition(X, K)[:K]
sol = sol[np.argsort(X[sol])]
X[Y] = swap
def app1():
sidx = X.argsort()
return sidx[~np.in1d(sidx, Y)][:K]
def app2():
sidx = np.argpartition(X,range(K+Y.size))
return sidx[~np.in1d(sidx, Y)][:K]
def app3():
sidx = np.argpartition(X,K+Y.size)
return sidx[~np.in1d(sidx, Y)][:K]
K = 10 # number of small elements wanted
N = 10000 # size of X
M = 10 # size of Y
S = 10 # number of repeats in benchmark
X = np.random.random((N,))
Y = np.random.choice(N, (M,))
so = timeit(swap_out, number=S)
print(sol)
print(X[sol])
d1 = timeit(app1, number=S)
print(sol)
print(X[sol])
d2 = timeit(app2, number=S)
print(sol)
print(X[sol])
d3 = timeit(app3, number=S)
print(sol)
print(X[sol])
print('pp', f'{so:8.5f}', ' d1(um)', f'{d1:8.5f}', ' d2', f'{d2:8.5f}', ' d3', f'{d3:8.5f}')
# pp 0.00053 d1(um) 0.00731 d2 0.00313 d3 0.00149
Here's one approach -
sidx = X.argsort()
idx_out = sidx[~np.in1d(sidx, Y)][:10]
Sample run -
# Setup inputs
In [141]: X = np.random.choice(range(60), 60)
In [142]: Y = np.array([2,20,51])
# For testing, let's set the Y positions as 0s and
# we want to see them skipped in o/p
In [143]: X[Y] = 0
# Use proposed approach
In [144]: sidx = X.argsort()
In [145]: X[sidx[~np.in1d(sidx, Y)][:10]]
Out[145]: array([ 0, 2, 4, 5, 5, 9, 9, 10, 12, 14])
# Print the first 13 numbers and skip three 0s and
# that should match up with the output from proposed approach
In [146]: np.sort(X)[:13]
Out[146]: array([ 0, 0, 0, 0, 2, 4, 5, 5, 9, 9, 10, 12, 14])
Alternatively, for performance, we might want to use np.argpartition, like so -
sidx = np.argpartition(X,range(10+Y.size))
idx_out = X[sidx[~np.in1d(sidx, Y)][:10]]
This would be beneficial if the length of X is a much larger number than 10.
If you don't care about the order of elements in that list of 10 indices, for further boost, we can simply pass on the scalar length instead of range array to np.argpartition : np.argpartition(X,10+Y.size).
We can optimize np.in1d with searchsorted to have one more approach (listing next).
Listing below all the discussed approaches in this post -
def app1(X, Y, n=10):
sidx = X.argsort()
return sidx[~np.in1d(sidx, Y)][:n]
def app2(X, Y, n=10):
sidx = np.argpartition(X,range(n+Y.size))
return sidx[~np.in1d(sidx, Y)][:n]
def app3(X, Y, n=10):
sidx = np.argpartition(X,n+Y.size)
return sidx[~np.in1d(sidx, Y)][:n]
def app4(X, Y, n=10):
n_ext = n+Y.size
sidx = np.argpartition(X,np.arange(n_ext))[:n_ext]
ssidx = sidx.argsort()
mask = np.ones(ssidx.size,dtype=bool)
search_idx = np.searchsorted(sidx, Y, sorter=ssidx)
search_idx[search_idx==sidx.size] = 0
idx = ssidx[search_idx]
mask[idx[sidx[idx] == Y]] = 0
return sidx[mask][:n]
You can work on a subset of original array using numpy.delete();
lowest10 = np.argsort(np.delete(X, Y))[:10]
Since delete works by slicing the original array with indexes to keep, complexity should be constant.
Warning: This solution uses a subset of original X array (X without the elements indexed in Y), thus the end result will be the lowest 10 of that subset.

Decrease array size by averaging adjacent values with numpy

I have a large array of thousands of vals in numpy. I want to decrease its size by averaging adjacent values.
For example:
a = [2,3,4,8,9,10]
#average down to 2 values here
a = [3,9]
#it averaged 2,3,4 and 8,9,10 together
So, basically, I have n number of elements in array, and I want to tell it to average down to X number of values, and it averages like above.
Is there some way to do that with numpy (already using it for other things, so I'd like to stick with it).
Using reshape and mean, you can average every m adjacent values of an 1D-array of size N*m, with N being any positive integer number. For example:
import numpy as np
m = 3
a = np.array([2, 3, 4, 8, 9, 10])
b = a.reshape(-1, m).mean(axis=1)
#array([3., 9.])
1)a.reshape(-1, m) will create a 2D image of the array without copying data:
array([[ 2, 3, 4],
[ 8, 9, 10]])
2)taking the mean in the second axis (axis=1) will then calculate the mean value of each row, resulting in:
array([3., 9.])
Try this:
n_averaged_elements = 3
averaged_array = []
a = np.array([ 2, 3, 4, 8, 9, 10])
for i in range(0, len(a), n_averaged_elements):
slice_from_index = i
slice_to_index = slice_from_index + n_averaged_elements
averaged_array.append(np.mean(a[slice_from_index:slice_to_index]))
>>>> averaged_array
>>>> [3.0, 9.0]
Looks like a simple non-overlapping moving window average to me, how about:
In [3]:
import numpy as np
a = np.array([2,3,4,8,9,10])
window_sz = 3
a[:len(a)/window_sz*window_sz].reshape(-1,window_sz).mean(1)
#you want to be sure your array can be reshaped properly, so the [:len(a)/window_sz*window_sz] part
Out[3]:
array([ 3., 9.])
In this example, I presume that a is the 1D numpy array that needs to be averaged. In the method that I give below, we first find the factors of the length of this array a. And, then we choose the an appropriate factor as the step size to average the array with.
Here is the code.
import numpy as np
from functools import reduce
''' Function to find factors of a given number 'n' '''
def factors(n):
return list(set(reduce(list.__add__,
([i, n//i] for i in range(1, int(n**0.5) + 1) if n % i == 0))))
a = [2,3,4,8,9,10] #Given array.
'''fac: list of factors of length of a.
In this example, len(a) = 6. So, fac = [1, 2, 3, 6] '''
fac = factors(len(a))
'''step: choose an appropriate step size from the list 'fac'.
In this example, we choose one of the middle numbers in fac
(3). '''
step = fac[int( len(fac)/3 )+1]
'''avg: initialize an empty array. '''
avg = np.array([])
for i in range(0, len(a), step):
avg = np.append( avg, np.mean(a[i:i+step]) ) #append averaged values to `avg`
print avg #Prints the final result
[3.0, 9.0]

Categories

Resources