How to Vectorize This Matrix Operation?

How to Vectorize This Matrix Operation? - python

(I asked a similar question before but this is a different operation.)
I have 2 arrays of boolean masks and I am looking to calculate an operation on every combination of two masks.
The slow version
N = 10000
M = 580
masksA = np.array(np.random.randint(0,2, size=(N,M)), dtype=np.bool)
masksB = np.array(np.random.randint(0,2, size=(N,M)), dtype=np.bool)
result = np.zeros(shape=(N,N), dtype=np.float)
for i in range(N):
for j in range(N):
result[i,j] = np.float64(np.count_nonzero(np.logical_and(masksA[i,:],masksB[j,:]))) / M

It seems the first input would be masksA as the question text reads - "operation on every combination of two masks".
We can use matrix-multiplication to solve it, like so -
result = masksA.astype(np.float).dot(masksB.T)/M
Alternatively, use lower precision np.float32 for dtype conversion for faster computations. Since, we are counting, it should be fine with lower precision.
Timings -
In [5]: N = 10000
...: M = 580
...:
...: np.random.seed(0)
...: masksA = np.array(np.random.randint(0,2, size=(N,M)), dtype=np.bool)
...: masksB = np.array(np.random.randint(0,2, size=(N,M)), dtype=np.bool)
In [6]: %timeit masksA.astype(np.float).dot(masksB.T)
1.87 s ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: %timeit masksA.astype(np.float32).dot(masksB.T)
1 s ± 7.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

Python: get maximum occurrence in array

I implemented codes to try to get maximum occurrence in numpy array. I was satisfactory using numba, but got limitations. I wonder whether it can be improved to a general case.
numba implementation
import numba as nb
import numpy as np
import collections
#nb.njit("int64(int64[:])")
def max_count_unique_num(x):
"""
Counts maximum number of unique integer in x.
Args:
x (numpy array): Integer array.
Returns:
Int
"""
# get maximum value
m = x[0]
for v in x:
if v > m:
m = v
if m == 0:
return x.size
# count each unique value
num = np.zeros(m + 1, dtype=x.dtype)
for k in x:
num[k] += 1
# maximum count
m = 0
for k in num:
if k > m:
m = k
return m
For comparisons, I also implemented numpy's unique and collections.Counter
def np_unique(x):
""" Counts maximum occurrence using numpy's unique. """
ux, uc = np.unique(x, return_counts=True)
return uc.max()
def counter(x):
""" Counts maximum occurrence using collections.Counter. """
counts = collections.Counter(x)
return max(counts.values())
timeit
Edit: Add np.bincount for additional comparison, as suggested by #MechanicPig.
In [1]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [2]: %timeit max_count_unique_num(x)
30 µs ± 387 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: %timeit np_unique(x)
1.14 ms ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [4]: %timeit counter(x)
2.68 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [6]: %timeit counter(x)
3.07 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit np_unique(x)
1.3 ms ± 7.35 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit max_count_unique_num(x)
490 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [10]: %timeit np.bincount(x).max()
32.3 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [11]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [12]: %timeit np.bincount(x).max()
830 µs ± 6.09 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
The limitations of numba implementation are quite obvious: efficiency only when all values in x are small positive int and will be significantly reduced for very large int; not applicable to float and negative values.
Any way I can generalize the implementation and keep the speed?
Update
After checking the source code of np.unique, an implementation for general cases can be:
#nb.njit(["int64(int64[:])", "int64(float64[:])"])
def max_count_unique_num_2(x):
x.sort()
n = 0
k = 0
x0 = x[0]
for v in x:
if x0 == v:
k += 1
else:
if k > n:
n = k
k = 1
x0 = v
# for last item in x if it equals to previous one
if k > n:
n = k
return n
timeit
In [154]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [155]: %timeit max_count_unique_num(x)
519 µs ± 5.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [156]: %timeit np_unique(x)
1.3 ms ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [157]: %timeit max_count_unique_num_2(x)
240 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [158]: x = np.random.randint(0, 200000, size=300000).astype(np.int64)
In [159]: %timeit max_count_unique_num(x)
1.01 ms ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [160]: %timeit np_unique(x)
18.1 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [161]: %timeit max_count_unique_num_2(x)
3.58 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So:
If large integer in x and the size is not large, max_count_unique_num_2 beats max_count_unique_num.
Both max_count_unique_num and max_count_unique_num_2 are significantly faster than np.unique.
Small modification on max_count_unique_num_2 can return the item that has maximum occurrence, even all items having same maximum occurrence.
max_count_unique_num_2 can even be accelerated if x is itself sorted by removing x.sort().

What if shortening your code:
#nb.njit("int64(int64[:])", fastmath=True)
def shortened(x):
num = np.zeros(x.max() + 1, dtype=x.dtype)
for k in x:
num[k] += 1
return num.max()
or paralleled:
#nb.njit("int64(int64[:])", parallel=True, fastmath=True)
def shortened_paralleled(x):
num = np.zeros(x.max() + 1, dtype=x.dtype)
for k in nb.prange(x.size):
num[x[k]] += 1
return num.max()
Parallelizing will beat for larger data sizes. Note that parallel will get different result in some runs and need to be cured if be possible.
For handling the floats (or negative values) using Numba:
#nb.njit("int8(float64[:])", fastmath=True)
def shortened_float(x):
num = np.zeros(x.size, dtype=np.int8)
for k in x:
for j in range(x.shape[0]):
if k == x[j]:
num[j] += 1
return num.max()
IMO, np.unique(x, return_counts=True)[1].max() is the best choice which handle both integers and floats in a very fast implementation. Numba can be faster for integers (it depends on the data sizes as larger data sizes weaker performance; AIK, it is due to looping instinct than arrays), but for floats the code must be optimized in terms of performance if it could; But I don't think that Numba can beat NumPy unique, particularly when we faced to large data.
Notes: np.bincount can handle just integers.

You can do that without using numpy too.
arr = [1,1,2,2,3,3,4,5,6,1,3,5,7,1]
counts = list(map(list(arr).count, set(arr)))
list(set(arr))[counts.index(max(counts))]
If you want to use numpy then try this,
arr = np.array([1,1,2,2,3,3,4,5,6,1,3,5,7,1])
uniques, counts = np.unique(arr, return_counts = True)
uniques[np.where(counts == counts.max())]
Both do the exact same job. To check which method is more efficient just do this,
time_i = time.time()
<arr declaration> # Creating a new array each iteration can cause the total time to increase which would be biased against the numpy method.
for i in range(10**5):
<method you want>
time_f = time.time()
When I ran this I got 0.39 seconds for the first method and 2.69 for the second one. So it's pretty safe to say that the first method is more efficient.

What I want to say is that your implementation is almost the same as numpy.bincount. If you want to make it universal, you can consider encoding the original data:
def encode(ar):
# Equivalent to numpy.unique(ar, return_inverse=True)[1] when ar.ndim == 1
flatten = ar.ravel()
perm = flatten.argsort()
sort = flatten[perm]
mask = np.concatenate(([False], sort[1:] != sort[:-1]))
encoded = np.empty(sort.shape, np.int64)
encoded[perm] = mask.cumsum()
encoded.shape = ar.shape
return encoded
def count_max(ar):
return max_count_unique_num(encode(ar))

Multiplying integers by booleans and understanding numpy array comparison

I'm writing a program in which there is a numpy array a whose elements can take three possible values: -1, 0 or 1. I am trying to multiply some of its elements by a number c. The idea is to obtain this behaviour:
for i,el in enumerate(a):
if el == b:
a[i] *= c
I came up with a solution that does not require any loops and works a couple of orders of magnitude faster than the previous one, this is the code I used to test them:
# Long array with random integers between -1 and 1
a = np.random.choice(3,1000000) - 1
a1 = a.copy()
a2 = a.copy()
# Reference values for b and c
b = 1
c = 10
# Solution with loop
t0 = time.time()
for i,el in enumerate(a1):
if el == b:
a1[i] *= c
t1 = time.time()
# Solution without loop
a2 = a2*((a2 == b)*c + (a2 != b))
t2 = time.time()
print("No loop: %f s"%(t1 - t0))
print("Loop: %f s"%(t2 - t1))
Although it seems to be working fine I'm not really happy with multiplying integers by booleans, but I don't know if I should, so I would appreciate if anyone could tell me a bit more about what is Numpy doing and/or if there is a better way to do this that I am not considering.
Thanks in advance!

NumPy will cast the bool type to the integer type, with False and True converted to 0 and 1 respectively. This casting is safe, so don't worry, be happy.
In [8]: np.can_cast(np.bool8, np.intc)
Out[8]: True
If you prefer to be explicit, you could do that casting yourself by replacing (a2 == b) with (a2 == b).astype(int), but that is not necessary.

Some comparative timings:
In [66]: %%timeit a2=a.copy()
...: a2*((a2==b)*10 + (a2!=b))
14.4 ms ± 36.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [67]: %%timeit a2=a.copy()
...: a2[a2==b] *= 10
1.96 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [68]: %%timeit a2=a.copy()
...: a2[a2==b] = a2[a2==b]*10
3.28 ms ± 5.63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [69]: %%timeit a2=a.copy()
...: np.multiply(a2, 10, where=a2==b, out=a2)
1.63 ms ± 3.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The fastest ones only do one a2==b test. The multiply with where parameter is fastest, but also bit harder to understand.
And to verify that the fastest produces the same thing:
In [73]: a2=a.copy();a2=a2*((a2==b)*10 + (a2!=b))
In [74]: a3=a.copy();np.multiply(a3, 10, where=a3==b, out=a3);
In [75]: np.allclose(a2,a3)
Out[75]: True

numpy - einsum vs naive implementation runtime performaned

I have a two dimensional array Y of size (N,M), say for instance:
N, M = 200, 100
Y = np.random.normal(0,1,(N,M))
For each N, I want to compute the dot product of the vector (M,1) with its transpose, which returns a (M,M) matrix. One way to do it inefficiently is:
Y = Y[:,:,np.newaxis]
[Y[i,:,:] # Y[i,:,:].T for i in range(N)]
which is quite slow: timeit on the second line returns
11.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I thought a much better way to do it is the use the einsum numpy function (https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html):
np.einsum('ijk,imk->ijm', Y, Y, optimize=True)
(which means: for each row i, create a (j,k) matrix where its elements results from the dot product on the last dimension m)
The two methods does returns the exact same result, but the runtime of this new version is disappointing (only a bit more than twice the speed)
3.82 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One would expect much more improvement by using the vectorized einsum function since the first method is very inefficient... Do you have an explanation for this ? Does there exists a better way to do this calculation ?

In [60]: N, M = 200, 100
...: Y = np.random.normal(0,1,(N,M))
In [61]: Y1 = Y[:,:,None]
Your iteration, 200 steps to produce (100,100) arrays:
In [62]: timeit [Y1[i,:,:]#Y1[i,:,:].T for i in range(N)]
18.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
einsum only modestly faster:
In [64]: timeit np.einsum('ijk,imk->ijm', Y1,Y1)
14.5 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but you could apply the # in full 'batch' mode with:
In [65]: timeit Y[:,:,None]#Y[:,None,:]
7.63 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But as Divakar notes, the sum axis is size 1, so you could use plain broadcasted multiply. This is an outer product, not a matrix one.
In [66]: timeit Y[:,:,None]*Y[:,None,:]
8.2 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
'vectorizing' gives big gains when doing many iterations on a simple operation. For fewer operations on a more complex operation, the gain isn't as great.

This is an old post, yet covers the subject in many details: efficient outer product.
In particular if you are interested in adding numba dependency, that may be your fastest option.
Updating part of numba code from the original post and adding the multi outer product:
import numpy as np
from numba import jit
from numba.typed import List
#jit(nopython=True)
def outer_numba(a, b):
m = a.shape[0]
n = b.shape[0]
result = np.empty((m, n))
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result
#jit(nopython=True)
def multi_outer_numba(Y):
all_result = List()
for k in range(Y.shape[0]):
y = Y[k]
n = y.shape[0]
tmp_res = np.empty((n, n))
for i in range(n):
for j in range(n):
tmp_res[i, j] = y[i]*y[j]
all_result.append(tmp_res)
return all_result
r = [outer_numba(Y[i],Y[i]) for i in range(N)]
r = multi_outer_numba(Y)

Creating an array of numbers that add up to 1 with given length

I'm trying to use different weights for my model and I need those weights add up to 1 like this;
def func(length):
return ['a list of numbers add up to 1 with given length']
func(4) returns [0.1, 0.2, 0.3, 0.4]
The numbers should be linearly spaced and they should not start from 0. Is there any way to achieve this with numpy or scipy?

This can be done quite simply using numpy arrays:
def func(length):
linArr = np.arange(1, length+1)
return linArr/sum(x)
First we create an array of length length ranging from 1 to length. Then we normalize the sum.
Thanks to Paul Panzer for pointing out that the efficiency of this function can be improved by using Gauss's formula for the sum of the first n integers:
def func(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum

For large inputs, you might find that using np.linspace is faster than the accepted answer
def f1(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum
def f2(l):
delta = 2/(l*(l+1))
return np.linspace(delta, l*delta, l)
Ensure that the two things produce the same result:
In [39]: np.allclose(f1(1000000), f2(1000000))
Out[39]: True
Check timing of both:
In [68]: %timeit f1(10000000)
515 ms ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [69]: %timeit f2(10000000)
247 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's tempting to just use np.arange(delta, l*delta, delta) which should be even faster, but this does present the risk of rounding errors causing the array to have lengths different from l (as will happen e.g. for l = 10000000).
If speed is more important than code style, it might also possible to squeeze out a bit more by using Numba:
from numba import jit
#jit
def f3(l):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for n in range(l):
a[n] = (n+1)*delta
return a
In [96]: %timeit f3(10000000)
216 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
While we're at it, let's note that it's possible to parallelize this loop. Doing so naively with Numba doesn't appear to give much, but helping it out a bit and pre-splitting the array into num_parallel parts does give further improvement on a quad core system:
from numba import njit, prange
#njit(parallel=True)
def f4(l, num_parallel=4):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for j in prange(num_parallel):
# The last iteration gets whatever's left from rounding
offset = 0 if j != num_parallel - 1 else l % num_parallel
for n in range(l//num_parallel + offset):
i = j*(l//num_parallel) + n
a[i] = (i+1)*delta
return a
In [171]: %timeit f4(10000000, 4)
163 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [172]: %timeit f4(10000000, 8)
158 ms ± 5.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [173]: %timeit f4(10000000, 12)
157 ms ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fastest way to delete/extract a submatrix from a numpy matrix

I have a square matrix that is NxN (N is usually >500). It is constructed using a numpy array.
I need to extract a new matrix that has the i-th column and row removed from this matrix. The new matrix is (N-1)x(N-1).
I am currently using the following code to extract this matrix:
new_mat = np.delete(old_mat,idx_2_remove,0)
new_mat = np.delete(old_mat,idx_2_remove,1)
I have also tried to use:
row_indices = [i for i in range(0,idx_2_remove)]
row_indices += [i for i in range(idx_2_remove+1,N)]
col_indices = row_indices
rows = [i for i in row_indices for j in col_indices]
cols = [j for i in row_indices for j in col_indices]
old_mat[(rows, cols)].reshape(len(row_indices), len(col_indices))
But I found this is slower than using np.delete() in the former. The former is still quite slow for my application.
Is there a faster way to accomplish what I want?
Edit 1:
It seems the following is even faster than the above two, but not by much:
new_mat = old_mat[row_indices,:][:,col_indices]

Here are 3 alternatives I quickly wrote:
Repeated delete:
def foo1(arr, i):
return np.delete(np.delete(arr, i, axis=0), i, axis=1)
Maximal use of slicing (may need some edge checks):
def foo2(arr,i):
N = arr.shape[0]
res = np.empty((N-1,N-1), arr.dtype)
res[:i, :i] = arr[:i, :i]
res[:i, i:] = arr[:i, i+1:]
res[i:, :i] = arr[i+1:, :i]
res[i:, i:] = arr[i+1:, i+1:]
return res
Advanced indexing:
def foo3(arr,i):
N = arr.shape[0]
idx = np.r_[:i,i+1:N]
return arr[np.ix_(idx, idx)]
Test that they work:
In [874]: x = np.arange(100).reshape(10,10)
In [875]: np.allclose(foo1(x,5),foo2(x,5))
Out[875]: True
In [876]: np.allclose(foo1(x,5),foo3(x,5))
Out[876]: True
Compare timings:
In [881]: timeit foo1(arr,100).shape
4.98 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [882]: timeit foo2(arr,100).shape
526 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [883]: timeit foo3(arr,100).shape
2.21 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So the slicing is fastest, even if the code is longer. It looks like np.delete works like foo3, but one dimension at a time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Vectorize This Matrix Operation? - python

Related

Python: get maximum occurrence in array

Multiplying integers by booleans and understanding numpy array comparison

numpy - einsum vs naive implementation runtime performaned

Creating an array of numbers that add up to 1 with given length

Fastest way to delete/extract a submatrix from a numpy matrix

Categories

Resources