Fastest way to delete/extract a submatrix from a numpy matrix - python

I have a square matrix that is NxN (N is usually >500). It is constructed using a numpy array.
I need to extract a new matrix that has the i-th column and row removed from this matrix. The new matrix is (N-1)x(N-1).
I am currently using the following code to extract this matrix:
new_mat = np.delete(old_mat,idx_2_remove,0)
new_mat = np.delete(old_mat,idx_2_remove,1)
I have also tried to use:
row_indices = [i for i in range(0,idx_2_remove)]
row_indices += [i for i in range(idx_2_remove+1,N)]
col_indices = row_indices
rows = [i for i in row_indices for j in col_indices]
cols = [j for i in row_indices for j in col_indices]
old_mat[(rows, cols)].reshape(len(row_indices), len(col_indices))
But I found this is slower than using np.delete() in the former. The former is still quite slow for my application.
Is there a faster way to accomplish what I want?
Edit 1:
It seems the following is even faster than the above two, but not by much:
new_mat = old_mat[row_indices,:][:,col_indices]

Here are 3 alternatives I quickly wrote:
Repeated delete:
def foo1(arr, i):
return np.delete(np.delete(arr, i, axis=0), i, axis=1)
Maximal use of slicing (may need some edge checks):
def foo2(arr,i):
N = arr.shape[0]
res = np.empty((N-1,N-1), arr.dtype)
res[:i, :i] = arr[:i, :i]
res[:i, i:] = arr[:i, i+1:]
res[i:, :i] = arr[i+1:, :i]
res[i:, i:] = arr[i+1:, i+1:]
return res
Advanced indexing:
def foo3(arr,i):
N = arr.shape[0]
idx = np.r_[:i,i+1:N]
return arr[np.ix_(idx, idx)]
Test that they work:
In [874]: x = np.arange(100).reshape(10,10)
In [875]: np.allclose(foo1(x,5),foo2(x,5))
Out[875]: True
In [876]: np.allclose(foo1(x,5),foo3(x,5))
Out[876]: True
Compare timings:
In [881]: timeit foo1(arr,100).shape
4.98 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [882]: timeit foo2(arr,100).shape
526 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [883]: timeit foo3(arr,100).shape
2.21 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So the slicing is fastest, even if the code is longer. It looks like np.delete works like foo3, but one dimension at a time.

Related

Numpy irregular samples (with replacement) of nonzero indices in each row

I would like to generate bootstrap samples of each row's nonzero indices. E.g. for this array:
m = np.array([[1,1,0,0], [1,1,0,1]])
I want to select two indices from the first row, and three from the second, with replacement. The non-vectorized solution is a for loop over the rows:
for row in m:
idx = np.nonzero(row)[0]
boot_idx = np.random.choice(idx, len(idx), replace=True)
print(boot_idx)
To clarify the need, the array m is actually a mask of a 3D tensor, and I want to take bootstrap averages of that tensor based on the indices selected here.
If speed is the concern, you could use numba:
import numpy as np
import numba as nb
#nb.njit
def func(m):
for row in m:
idx = np.nonzero(row)[0]
boot_idx = np.random.choice(idx, len(idx), replace=True)
return #return what you want
This results in significant speed increases:
def func_op(m):
for row in m:
idx = np.nonzero(row)[0]
boot_idx = np.random.choice(idx, len(idx), replace=True)
return
func(m) #Run once to JIT
%timeit func(m)
%timeit func_op(m)
Output:
706 ns ± 2.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
30.2 µs ± 859 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

How to create a mask in numpy conditioned on index?

N = 5
mask = np.zeros((N, N, N))
for i in range(N):
for j in range(N):
for k in range(N):
if j==k and i!=j:
mask[i,j,k] = 1
Currently I am doing it as the code shown above, i feel there must be a more efficient and pythonic way to achieve this goal
You can do:
import numpy as np
N = 5
i, j, k = np.ogrid[:N, :N, :N]
mask = (j == k) & (i != j)
I suggest following way:
tiled = np.tile(np.identity(N),(N,1))
for i in range(N):
tiled[i*N+i,i] = 0
mask = np.reshape(tiled,(N,N,N))
Here, you first create 2d array, insert zeros vertically, and then reshape it into 3d array. Code executes faster than original.
Original: 13.5 µs ± 305 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
New: 10.2 µs ± 396 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

numpy - einsum vs naive implementation runtime performaned

I have a two dimensional array Y of size (N,M), say for instance:
N, M = 200, 100
Y = np.random.normal(0,1,(N,M))
For each N, I want to compute the dot product of the vector (M,1) with its transpose, which returns a (M,M) matrix. One way to do it inefficiently is:
Y = Y[:,:,np.newaxis]
[Y[i,:,:] # Y[i,:,:].T for i in range(N)]
which is quite slow: timeit on the second line returns
11.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I thought a much better way to do it is the use the einsum numpy function (https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html):
np.einsum('ijk,imk->ijm', Y, Y, optimize=True)
(which means: for each row i, create a (j,k) matrix where its elements results from the dot product on the last dimension m)
The two methods does returns the exact same result, but the runtime of this new version is disappointing (only a bit more than twice the speed)
3.82 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One would expect much more improvement by using the vectorized einsum function since the first method is very inefficient... Do you have an explanation for this ? Does there exists a better way to do this calculation ?
In [60]: N, M = 200, 100
...: Y = np.random.normal(0,1,(N,M))
In [61]: Y1 = Y[:,:,None]
Your iteration, 200 steps to produce (100,100) arrays:
In [62]: timeit [Y1[i,:,:]#Y1[i,:,:].T for i in range(N)]
18.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
einsum only modestly faster:
In [64]: timeit np.einsum('ijk,imk->ijm', Y1,Y1)
14.5 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but you could apply the # in full 'batch' mode with:
In [65]: timeit Y[:,:,None]#Y[:,None,:]
7.63 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But as Divakar notes, the sum axis is size 1, so you could use plain broadcasted multiply. This is an outer product, not a matrix one.
In [66]: timeit Y[:,:,None]*Y[:,None,:]
8.2 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
'vectorizing' gives big gains when doing many iterations on a simple operation. For fewer operations on a more complex operation, the gain isn't as great.
This is an old post, yet covers the subject in many details: efficient outer product.
In particular if you are interested in adding numba dependency, that may be your fastest option.
Updating part of numba code from the original post and adding the multi outer product:
import numpy as np
from numba import jit
from numba.typed import List
#jit(nopython=True)
def outer_numba(a, b):
m = a.shape[0]
n = b.shape[0]
result = np.empty((m, n))
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result
#jit(nopython=True)
def multi_outer_numba(Y):
all_result = List()
for k in range(Y.shape[0]):
y = Y[k]
n = y.shape[0]
tmp_res = np.empty((n, n))
for i in range(n):
for j in range(n):
tmp_res[i, j] = y[i]*y[j]
all_result.append(tmp_res)
return all_result
r = [outer_numba(Y[i],Y[i]) for i in range(N)]
r = multi_outer_numba(Y)

How to Vectorize This Matrix Operation?

(I asked a similar question before but this is a different operation.)
I have 2 arrays of boolean masks and I am looking to calculate an operation on every combination of two masks.
The slow version
N = 10000
M = 580
masksA = np.array(np.random.randint(0,2, size=(N,M)), dtype=np.bool)
masksB = np.array(np.random.randint(0,2, size=(N,M)), dtype=np.bool)
result = np.zeros(shape=(N,N), dtype=np.float)
for i in range(N):
for j in range(N):
result[i,j] = np.float64(np.count_nonzero(np.logical_and(masksA[i,:],masksB[j,:]))) / M
It seems the first input would be masksA as the question text reads - "operation on every combination of two masks".
We can use matrix-multiplication to solve it, like so -
result = masksA.astype(np.float).dot(masksB.T)/M
Alternatively, use lower precision np.float32 for dtype conversion for faster computations. Since, we are counting, it should be fine with lower precision.
Timings -
In [5]: N = 10000
...: M = 580
...:
...: np.random.seed(0)
...: masksA = np.array(np.random.randint(0,2, size=(N,M)), dtype=np.bool)
...: masksB = np.array(np.random.randint(0,2, size=(N,M)), dtype=np.bool)
In [6]: %timeit masksA.astype(np.float).dot(masksB.T)
1.87 s ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: %timeit masksA.astype(np.float32).dot(masksB.T)
1 s ± 7.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python: Taking a the outer product of each row of matrix by itself, taking the sum then returning a vector of sums

Say I have a matrix A of dimension N by M.
I wish to return an N dimensional vector V where the nth element is the double sum of all pairwise product of the entries in the nth row of A.
In loops, I guess I could do:
V = np.zeros(A.shape[0])
for n in range(A.shape[0]):
for i in range(A.shape[1]):
for j in range(A.shape[1]):
V[n] += A[n,i] * A[n,j]
I want to vectorise this and I guess I could do:
V_temp = np.einsum('ij,ik->ijk', A, A)
V = np.einsum('ijk->i', A)
But I don't think this is very memory efficient way as the intermediate step V_temp is unnecessarily storing the whole outer products when all I need are sums. Is there a better way to do this?
Thanks
You can use
V=np.einsum("ni,nj->n",A,A)
You are actually calculating
A.sum(-1)**2
In other words, the sum over an outer product is just the product of the sums of the factors.
Demo:
A = np.random.random((1000,1000))
np.allclose(np.einsum('ij,ik->i', A, A), A.sum(-1)**2)
# True
t = timeit.timeit('np.einsum("ij,ik->i",A,A)', globals=dict(A=A,np=np), number=10)*100; f"{t:8.4f} ms"
# '948.4210 ms'
t = timeit.timeit('A.sum(-1)**2', globals=dict(A=A,np=np), number=10)*100; f"{t:8.4f} ms"
# ' 0.7396 ms'
Perhaps you can use
np.einsum('ij,ik->i', A, A)
or the equivalent
np.einsum(A, [0,1], A, [0,2], [0])
On a 2015 Macbook, I get
In [35]: A = np.random.rand(100,100)
In [37]: %timeit for_loops(A)
640 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [38]: %timeit np.einsum('ij,ik->i', A, A)
658 µs ± 7.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit np.einsum(A, [0,1], A, [0,2], [0])
672 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Categories

Resources