I have two 1D numpy arrays A and B of size (n, ) and (m, ) respectively which correspond to the x positions of points on a line. I want to calculate the distance between every point in A to every point in B. I then need to use these distances at a set y distance, d, to work out the potential at each point in A.
I'm currently using the following:
V = numpy.zeros(n)
for i in range(n):
xdist = A[i] - B
r = numpy.sqrt(xdist**2 + d**2)
dV = 1/r
V[i] = numpy.sum(dV)
This works but for large data sets it can take a while so I would like to use a function similar to scipy.spatial.distance.cdist which doesn't work for 1D arrays and I don't want to add another dimension to the arrays as they become too large.
Vectorized approach
One vectorized approach after extending A to 2D with the introduction of a new axis using np.newaxis/None and thus making use of broadcasting would be -
(1/(np.sqrt((A[:,None] - B)**2 + d**2))).sum(1)
Hybrid approach for large arrays
Now, for large arrays, we might have to divide the data into chunks.
Thus, with BSZ as the block size, we would have a hybrid approach, like so -
dsq = d**2
V = np.zeros((n//BSZ,BSZ))
for i in range(n//BSZ):
V[i] = (1/(np.sqrt((A[i*BSZ:(i+1)*BSZ,None] - B)**2 + dsq))).sum(1)
Runtime test
Approaches -
def original_app(A,B,d):
V = np.zeros(n)
for i in range(n):
xdist = A[i] - B
r = np.sqrt(xdist**2 + d**2)
dV = 1/r
V[i] = np.sum(dV)
return V
def vectorized_app1(A,B,d):
return (1/(np.sqrt((A[:,None] - B)**2 + d**2))).sum(1)
def vectorized_app2(A,B,d, BSZ = 100):
dsq = d**2
V = np.zeros((n//BSZ,BSZ))
for i in range(n//BSZ):
V[i] = (1/(np.sqrt((A[i*BSZ:(i+1)*BSZ,None] - B)**2 + dsq))).sum(1)
return V.ravel()
Timings and verification -
In [203]: # Setup inputs
...: n,m = 10000,2000
...: A = np.random.rand(n)
...: B = np.random.rand(m)
...: d = 10
In [204]: out1 = original_app(A,B,d)
...: out2 = vectorized_app1(A,B,d)
...: out3 = vectorized_app2(A,B,d, BSZ = 100)
...: print np.allclose(out1, out2)
...: print np.allclose(out1, out3)
In [205]: %timeit original_app(A,B,d)
10 loops, best of 3: 133 ms per loop
In [206]: %timeit vectorized_app1(A,B,d)
10 loops, best of 3: 138 ms per loop
In [207]: %timeit vectorized_app2(A,B,d, BSZ = 100)
10 loops, best of 3: 65.2 ms per loop
We can play around with the parameter block size BSZ -
In [208]: %timeit vectorized_app2(A,B,d, BSZ = 200)
10 loops, best of 3: 74.5 ms per loop
In [209]: %timeit vectorized_app2(A,B,d, BSZ = 50)
10 loops, best of 3: 67.4 ms per loop
Thus, the best one seems to be giving a 2x speedup with a block size of 100 at my end.
EDIT: My answer turned out to be nearly identical to Divakar's after a closer look. However, you can save some memory by doing the operations in-place. Taking the sum along the second axis is more efficient than long the first.
import numpy
a = numpy.random.randint(0,10,10) * 1.
b = numpy.random.randint(0,10,10) * 1.
xdist = a[:,None] - b
xdist **= 2
xdist += d**2
xdist **= -1
V = numpy.sum(xdist, axis=1)
which gives the same solution as your code.
I would like to use a function similar to scipy.spatial.distance.cdist which doesn't work for 1D arrays and I don't want to add another dimension to the arrays as they become too large.
cdist works fine, you just have to reshape the arrays to have shape (n, 1) instead of (n,). You can add another dimension to a one-dimensional array A without copying the underlying data by using A[:, None] or A.reshape(-1, 1).
For example,
In [56]: from scipy.spatial.distance import cdist
In [57]: A
Out[57]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [58]: B
Out[58]: array([0, 2, 4, 6, 8])
In [59]: A[:, None]
In [60]: cdist(A[:, None], B[:, None])
array([[ 0., 2., 4., 6., 8.],
[ 1., 1., 3., 5., 7.],
[ 2., 0., 2., 4., 6.],
[ 3., 1., 1., 3., 5.],
[ 4., 2., 0., 2., 4.],
[ 5., 3., 1., 1., 3.],
[ 6., 4., 2., 0., 2.],
[ 7., 5., 3., 1., 1.],
[ 8., 6., 4., 2., 0.],
[ 9., 7., 5., 3., 1.]])
To compute V as shown in your code, you can use cdist with metric='sqeuclidean', as follows:
In [72]: d = 3.
In [73]: r = np.sqrt(cdist(A[:,None], B[:,None], metric='sqeuclidean') + d**2)
In [74]: V = (1/r).sum(axis=1)
Say I have a 2D pytorch tensor and a 2D numpy boolean as follows,
a = torch.tensor([[ 0., 1., 2.],
[ 3., 4., 5.],
[ 6., 7., 8.],
[ 9., 10., 11.],
[12., 13., 14.]])
m = numpy.array([[ False, True, False],
[ True, False, True],
[ False, True, True],
[ False, False, False],
[ True, False, False]])
They have the same dimension and the number of True's in each column of m is the same.
I need to get the 2x3 tensor that is
which is
tensor([[ 3., 1., 5.],
[12., 7., 8.]])
The actual tensor is very large, and the operation needs to be performed many times. So I want to ask what is an efficient way of doing this (or the most efficient way).
In my benchmarks a jitted numba solution is the fastest, I could find
My benchmarks for a, m with shape (10000,200)(equal result tensors)
13.2 ms (3.46x)
list comprehension
31.3 ms (1.46x)
45.7 ms (1.00x)
Generation of sufficiently large sample data for benchmarking
import torch
import numpy as np
def generate_data(rows=500, columns=100):
a = torch.from_numpy(np.random.uniform(1,10, (rows,columns)).astype(np.float32))
# argsort trick by #divakar https://stackoverflow.com/a/55317373/14277722
def shuffle_along_axis(a, axis):
idx = np.random.rand(*a.shape).argsort(axis=axis)
return np.take_along_axis(a,idx,axis=axis)
m = shuffle_along_axis(np.full((columns,rows), np.random.randint(2, size=rows)), 1).astype('bool').T
return a, np.ascontiguousarray(m)
a, m = generate_data(10000,200)
A jitted numba implementation
import numba as nb
def gather2d(arr1, arr2):
res = np.zeros((np.count_nonzero(arr2[:,0]), arr1.shape[1]), np.float32)
counter = np.zeros(arr1.shape[1], dtype=np.intp)
for i in range(arr1.shape[0]):
for j in range(arr1.shape[1]):
if arr2[i,j]:
res[counter[j], j] = arr1[i,j]
counter[j] += 1
return res
# %timeit 10 loops, best of 5: 13.2 ms per loop
tensor([[2.1846, 7.8890, 8.8218, ..., 4.8309, 9.2853, 6.4404],
[5.8842, 3.7332, 6.7436, ..., 1.2914, 3.2983, 3.5627],
[9.5128, 2.4283, 2.2152, ..., 4.9512, 9.7335, 9.6252],
[7.3193, 7.8524, 9.6654, ..., 3.3665, 8.8926, 4.7660],
[1.3829, 1.3347, 6.6436, ..., 7.1956, 4.0446, 6.4633],
[6.4264, 3.6283, 3.6385, ..., 8.4152, 5.8498, 5.0281]])
Against a vectorized baseline solution
# %timeit 10 loops, best of 5: 45.7 ms per loop
a.gather(0, torch.from_numpy(np.nonzero(m.T)[1].reshape(-1, m.shape[1], order='F')))
A python list comprehension turns out to be surprisingly fast
def g(arr1,arr2):
return np.array([i[j] for i,j in zip(arr1.T,arr2.T)]).T
# %timeit 10 loops, best of 5: 31.3 ms per loop
torch.from_numpy(g(a.numpy(), m))
You can try this way by using only NumPy and PyTorch:
b,c = m.nonzero()
b = torch.tensor(b)
c = torch.tensor(c)
tensor([[ 1., 3., 5.],
[ 7., 8., 12.]]) # True values are taken on axis=1
I used the same example provided by #Michael Szczesny to measure the time:
import numpy as np
import timeit
import torch
rows, columns = (10000,200)
a = torch.from_numpy(np.random.uniform(1,10, (rows,columns)).astype(np.float32))
m = np.random.choice([False, True], size=(rows, columns))
starttime = timeit.default_timer()
b,c = m.nonzero()
b = torch.tensor(b)
c = torch.tensor(c)
print(f"The time difference is :{(timeit.default_timer() - starttime)*1000} ms")
The time difference is : 26.4 ms
It is better than the second and third approaches of #Michael Szczesny.
I tried searching for an answer, but couldn't find what I needed. Apologies if this is a duplicate question.
Suppose I have a 2d-array with shape (n, n*m). What I want to do is an outer sum of this array to its transpose that results in an array with shape (n*m, n*m). For example, suppose i have
A = array([[1., 1., 2., 2.],
[1., 1., 2., 2.]])
I want to do an outer sum of A and A.T such that the output is:
>>> array([[2., 2., 3., 3.],
[2., 2., 3., 3.],
[3., 3., 4., 4.],
[3., 3., 4., 4.]])
Note that np.add.outer does not work because it ravels in the inputs into vectors. I can achieve something similar by doing
np.tile(A, (2, 1)) + np.tile(A.T, (1, 2))
but this does not seem reasonable when n and m are reasonably large (n > 100 and m > 1000). Is it possible to write this sum using einsum? I just can't seem to figure out einsum.
To leverage broadcasting, we need to break it down to 3D and then permute axes and add -
n = A.shape[0]
m = A.shape[1]//n
a = A.reshape(n,m,n) # reshape to 3D
out = (a[None,:,:,:] + a.transpose(1,2,0)[:,:,None,:]).reshape(n*m,-1)
Sample run for verification -
In [359]: # Setup input array
...: np.random.seed(0)
...: n,m = 3,4
...: A = np.random.randint(1,10,(n,n*m))
In [360]: # Original soln
...: out0 = np.tile(A, (m, 1)) + np.tile(A.T, (1, m))
In [361]: # Posted soln
...: n = A.shape[0]
...: m = A.shape[1]//n
...: a = A.reshape(n,m,n)
...: out = (a[None,:,:,:] + a.transpose(1,2,0)[:,:,None,:]).reshape(n*m,-1)
In [362]: np.allclose(out0, out)
Out[362]: True
Timings with large n,m -
In [363]: # Setup input array
...: np.random.seed(0)
...: n,m = 100,100
...: A = np.random.randint(1,10,(n,n*m))
In [364]: %timeit np.tile(A, (m, 1)) + np.tile(A.T, (1, m))
1 loop, best of 3: 407 ms per loop
In [365]: %%timeit
...: # Posted soln
...: n = A.shape[0]
...: m = A.shape[1]//n
...: a = A.reshape(n,m,n)
...: out = (a[None,:,:,:] + a.transpose(1,2,0)[:,:,None,:]).reshape(n*m,-1)
1 loop, best of 3: 219 ms per loop
Further performance boost with numexpr
We can leverage multi-core with numexpr module for large data and to gain memory efficiency and hence performance -
import numexpr as ne
n = A.shape[0]
m = A.shape[1]//n
a = A.reshape(n,m,n)
p1 = a[None,:,:,:]
p2 = a.transpose(1,2,0)[:,:,None,:]
out = ne.evaluate('p1+p2').reshape(n*m,-1)
Timings with same large n, m setup -
In [367]: %%timeit
...: # Posted soln
...: n = A.shape[0]
...: m = A.shape[1]//n
...: a = A.reshape(n,m,n)
...: p1 = a[None,:,:,:]
...: p2 = a.transpose(1,2,0)[:,:,None,:]
...: out = ne.evaluate('p1+p2').reshape(n*m,-1)
10 loops, best of 3: 152 ms per loop
one way is
I think this will take a lot of memory with n>100 and m>1000.
but isn't this the same as
Given two numpy arrays of nx3 and mx3, what is an efficient way to determine the row indices (counter) wherein the rows are common in the two arrays. For instance I have the following solution, which is significantly slow for not even much larger arrays
def arrangment(arr1,arr2):
hits = []
for i in range(arr2.shape[0]):
current_row = np.repeat(arr2[i,:][None,:],arr1.shape[0],axis=0)
x = current_row - arr1
for j in range(arr1.shape[0]):
if np.isclose(x[j,0],0.0) and np.isclose(x[j,1],0.0) and np.isclose(x[j,2],0.0):
return hits
It checks if rows of arr2 exist in arr1 and returns the row indices of arr1 where the rows match. I need this arrangement to be always sequentially ascending in terms of rows of arr2. For instance given
arr1 = np.array([[-1., -1., -1.],
[ 1., -1., -1.],
[ 1., 1., -1.],
[-1., 1., -1.],
[-1., -1., 1.],
[ 1., -1., 1.],
[ 1., 1., 1.],
[-1., 1., 1.]])
arr2 = np.array([[-1., 1., -1.],
[ 1., 1., -1.],
[ 1., 1., 1.],
[-1., 1., 1.]])
The function should return:
[3, 2, 6, 7]
quick and dirty answer
(arr1[:, None] == arr2).all(-1).argmax(0)
array([3, 2, 6, 7])
Better answer
Takes care of chance a row in arr2 doesn't match anything in arr1
t = (arr1[:, None] == arr2).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)
array([ 3., 2., 6., 7.])
As pointed out by #Divakar np.isclose accounts for rounding error in comparing floats
t = np.isclose(arr1[:, None], arr2).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)
I had a similar problem in the past and I came up with a fairly optimised solution for it.
First you need a generalisation of numpy.unique for multidimensional arrays, which for the sake of completeness I would copy it here
def unique2d(arr,consider_sort=False,return_index=False,return_inverse=False):
"""Get unique values along an axis for 2D arrays.
2D array
Does permutation of the values within the axis matter?
Two rows can contain the same values but with
different arrangements. If consider_sort
is True then those rows would be considered equal
Similar to numpy unique
Similar to numpy unique
2D array of unique rows
If return_index is True also returns indices
If return_inverse is True also returns the inverse array
if consider_sort is True:
a = np.sort(arr,axis=1)
a = arr
b = np.ascontiguousarray(a).view(np.dtype((np.void,
a.dtype.itemsize * a.shape[1])))
if return_inverse is False:
_, idx = np.unique(b, return_index=True)
_, idx, inv = np.unique(b, return_index=True, return_inverse=True)
if return_index == False and return_inverse == False:
return arr[idx]
elif return_index == True and return_inverse == False:
return arr[idx], idx
elif return_index == False and return_inverse == True:
return arr[idx], inv
return arr[idx], idx, inv
Now all you need is to concatenate (np.vstack) your arrays and find the unique rows. The reverse mapping together with np.searchsorted will give you the indices you need. So lets write another function similar to numpy.in2d but for multidimensional (2D) arrays
def in2d_unsorted(arr1, arr2, axis=1, consider_sort=False):
"""Find the elements in arr1 which are also in
arr2 and sort them as the appear in arr2"""
assert arr1.dtype == arr2.dtype
if axis == 0:
arr1 = np.copy(arr1.T,order='C')
arr2 = np.copy(arr2.T,order='C')
if consider_sort is True:
sorter_arr1 = np.argsort(arr1)
arr1 = arr1[np.arange(arr1.shape[0])[:,None],sorter_arr1]
sorter_arr2 = np.argsort(arr2)
arr2 = arr2[np.arange(arr2.shape[0])[:,None],sorter_arr2]
arr = np.vstack((arr1,arr2))
_, inv = unique2d(arr, return_inverse=True)
size1 = arr1.shape[0]
size2 = arr2.shape[0]
arr3 = inv[:size1]
arr4 = inv[-size2:]
# Sort the indices as they appear in arr2
sorter = np.argsort(arr3)
idx = sorter[arr3.searchsorted(arr4, sorter=sorter)]
return idx
Now all you need to do is call in2d_unsorted with your input parameters
>>> in2d_unsorted(arr1,arr2)
array([ 3, 2, 6, 7])
While may not be fully optimised this approach is much faster. Let's benchmark it against #piRSquareds solutions
def indices_piR(arr1,arr2):
t = np.isclose(arr1[:, None], arr2).all(-1)
return np.where(t.any(0), t.argmax(0), np.nan)
with the following arrays
arr1 = np.random.permutation(n).reshape(n//3, 3)
idx = np.random.permutation(n//3)
arr2 = arr1[idx]
In [13]: np.allclose(in2d_unsorted(arr1,arr2),indices_piR(arr1,arr2))
In [14]: %timeit indices_piR(arr1,arr2)
10000 loops, best of 3: 181 µs per loop
In [15]: %timeit in2d_unsorted(arr1,arr2)
10000 loops, best of 3: 85.7 µs per loop
Now, for n=1500
In [24]: %timeit indices_piR(arr1,arr2)
100 loops, best of 3: 10.3 ms per loop
In [25]: %timeit in2d_unsorted(arr1,arr2)
1000 loops, best of 3: 403 µs per loop
and for n=15000
In [28]: %timeit indices_piR(A,B)
1 loop, best of 3: 1.02 s per loop
In [29]: %timeit in2d_unsorted(arr1,arr2)
100 loops, best of 3: 4.65 ms per loop
So for largeish arrays this is over 200X faster compared to #piRSquared's vectorised solution.
My goal is to assign the values of an existing 2D array, or create a new array, using two 2D arrays of the same shape, one with values and one with indices to assign the corresponding value to.
X = np.array([range(5),range(5)])
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
Y= np.array([range(5), [2,3,4,1,0]])
array([[0, 1, 2, 3, 4],
[2, 3, 4, 1, 0]])
My desired output is an array of the same shape as X and Y, with the values of X given in the index from the corresponding row in Y. This result can be achieved by looping through each row in the following way:
output = np.zeros(X.shape)
for i in range(X.shape[0]):
output[i][Y[i]] = X[i]
array([[ 0., 1., 2., 3., 4.],
[ 4., 3., 0., 1., 2.]])
Is there a more efficient way to apply this sort of assignment?
np.take(output, Y)
Will return the items in the output array I would like to assign to the values of X to, but I believe np.take does not produce a reference to the original array, and instead a new array.
for i in range(X.shape[0]):
output[i][Y[i]] = X[i]
is equivalent to
I = np.arange(X.shape[0])[:, np.newaxis]
output[I, Y] = X
For example,
X = np.array([range(5),range(5)])
Y = np.array([range(5), [2,3,4,1,0]])
output = np.zeros(X.shape)
I = np.arange(X.shape[0])[:, np.newaxis]
output[I, Y] = X
>>> output
array([[ 0., 1., 2., 3., 4.],
[ 4., 3., 0., 1., 2.]])
There is not much difference in performance when the loop has few iterations.
But if X.shape[0] is large, then using indexing is much faster:
def using_loop(X, Y):
output = np.zeros(X.shape)
for i in range(X.shape[0]):
output[i][Y[i]] = X[i]
return output
def using_indexing(X, Y):
output = np.zeros(X.shape)
I = np.arange(X.shape[0])[:, np.newaxis]
output[I, Y] = X
return output
X2 = np.tile(X, (100,1))
Y2 = np.tile(Y, (100,1))
In [77]: %timeit using_loop(X2, Y2)
1000 loops, best of 3: 376 µs per loop
In [78]: %timeit using_indexing(X2, Y2)
100000 loops, best of 3: 15.2 µs per loop
I have in my code the following expression:
a = (b / x[:, np.newaxis]).sum(axis=1)
where b is an ndarray of shape (M, N), and x is an ndarray of shape (M,). Now, b is actually sparse, so for memory efficiency I would like to substitute in a scipy.sparse.csc_matrix or csr_matrix. However, broadcasting in this way is not implemented (even though division or multiplication is guaranteed to maintain sparsity) (the entries of x are non-zero), and raises a NotImplementedError. Is there a sparse function I'm not aware of that would do what I want? (dot() would sum along the wrong axis.)
If b is in CSC format, then b.data has the non-zero entries of b, and b.indices has the row index of each of the non-zero entries, so you can do your division as:
b.data /= np.take(x, b.indices)
It's hackier than Warren's elegant solution, but it will probably also be faster in most settings:
b = sps.rand(1000, 1000, density=0.01, format='csc')
x = np.random.rand(1000)
def row_divide_col_reduce(b, x):
data = b.data.copy() / np.take(x, b.indices)
ret = sps.csc_matrix((data, b.indices.copy(), b.indptr.copy()),
return ret.sum(axis=1)
def row_divide_col_reduce_bis(b, x):
d = sps.spdiags(1.0/x, 0, len(x), len(x))
return (d * b).sum(axis=1)
In [2]: %timeit row_divide_col_reduce(b, x)
1000 loops, best of 3: 210 us per loop
In [3]: %timeit row_divide_col_reduce_bis(b, x)
1000 loops, best of 3: 697 us per loop
In [4]: np.allclose(row_divide_col_reduce(b, x),
...: row_divide_col_reduce_bis(b, x))
Out[4]: True
You can cut the time almost in half in the above example if you do the division in-place, i.e.:
def row_divide_col_reduce(b, x):
b.data /= np.take(x, b.indices)
return b.sum(axis=1)
In [2]: %timeit row_divide_col_reduce(b, x)
10000 loops, best of 3: 131 us per loop
To implement a = (b / x[:, np.newaxis]).sum(axis=1), you can use a = b.sum(axis=1).A1 / x. The A1 attribute returns the 1D ndarray, so the result is a 1D ndarray, not a matrix. This concise expression works because you are both scaling by x and summing along axis 1. For example:
In [190]: b
<3x3 sparse matrix of type '<type 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [191]: b.A
array([[ 1., 0., 2.],
[ 0., 3., 0.],
[ 4., 0., 5.]])
In [192]: x
Out[192]: array([ 2., 3., 4.])
In [193]: b.sum(axis=1).A1 / x
Out[193]: array([ 1.5 , 1. , 2.25])
More generally, if you want to scale the rows of a sparse matrix with a vector x, you could multiply b on the left with a sparse matrix containing 1.0/x on the diagonal. The function scipy.sparse.spdiags can be used to create such a matrix. For example:
In [71]: from scipy.sparse import csc_matrix, spdiags
In [72]: b = csc_matrix([[1,0,2],[0,3,0],[4,0,5]], dtype=np.float64)
In [73]: b.A
array([[ 1., 0., 2.],
[ 0., 3., 0.],
[ 4., 0., 5.]])
In [74]: x = array([2., 3., 4.])
In [75]: d = spdiags(1.0/x, 0, len(x), len(x))
In [76]: d.A
array([[ 0.5 , 0. , 0. ],
[ 0. , 0.33333333, 0. ],
[ 0. , 0. , 0.25 ]])
In [77]: p = d * b
In [78]: p.A
array([[ 0.5 , 0. , 1. ],
[ 0. , 1. , 0. ],
[ 1. , 0. , 1.25]])
In [79]: a = p.sum(axis=1)
In [80]: a
matrix([[ 1.5 ],
[ 1. ],
[ 2.25]])