Avoid Race Condition in Numba

Avoid Race Condition in Numba - python

Here is a toy njit function that takes in a distance matrix, loops through each row of the matrix, and records the minimum value in each column and also which row that minimum value came from. However, IIUC, with the use of prange, this could possibly cause a race condition (especially for larger input arrays):
from numba import njit, prange
import numpy as np
#njit
def some_transformation_func(D, row_i):
"""
This function applies some transformation to the ith row (`row_i`) in the `D` matrix in place.
However, the transformation time is random (but all less than a second), which means
that the rows can take
"""
# Apply some inplace transformation on the ith row of D
#njit(parallel=True)
def some_func(D):
P = np.empty((D.shape[1]))
I = np.empty((D.shape[1]), np.int64)
P[:] = np.inf
I[:] = -1
for row in prange(D.shape[0]):
some_transformation_func(D, row)
for col in range(D.shape[1]):
if P[col] > D[row, col]:
P[col] = D[row, col]
I[col] = row
return P, I
if __name__ == "__main__":
D = np.array([[4,1,6,9,9],
[1,3,8,2,7],
[2,8,0,0,1],
[3,7,4,6,5]
])
P, I = some_func(D)
print(P)
print(I)
# [1. 1. 0. 0. 1.]
# [1 0 2 2 2]
How would I confirm whether or not there is a race condition (especially if D is very large with many more rows and columns)? And, more importantly, if there is a race condition, how could I avoid it?

In these situations, instead of setting prange to the size of the array, the best thing to do is to manually chunk up the data into n_threads number of chunks, then to distribute the processing accordingly, and finally perform a reduction. So, something like this:
from numba import njit, prange, config
import numpy as np
#njit
def wrapper_func(thread_idx, start_indices, stop_indices, D, P, I):
for row in range(start_indices[thread_idx], stop_indices[thread_idx]):
some_transformation_func(D, row)
for col in range(D.shape[1]):
if P[thread_idx, col] > D[row, col]:
P[thread_idx, col] = D[row, col]
I[thread_idx, col] = row
#njit
def some_transformation_func(D, row_i):
"""
This function applies some transformation to the ith row (`row_i`) in the `D` matrix in place.
However, the transformation time is random (but all less than a second), which means
that the rows can take
"""
# Apply some inplace transformation on the ith row of D
#njit(parallel=True)
def some_func(D):
n_threads = config.NUMBA_NUM_THREADS # Let's assume that there are 2 threads
P = np.empty((n_threads, D.shape[1]))
I = np.empty((n_threads, D.shape[1]), np.int64)
P[:, :] = np.inf
I[:, :] = -1
start_indices = np.array([0, 2], np.int64)
stop_indices = np.array([2, 4], np.int64) # Note that these are exclusive
for thread_idx in prange(n_threads):
wrapper_func(thread_idx, start_indices, stop_indices, D, P, I)
# Perform reduction from all threads and store results in P[0]
for thread_idx in range(1, n_threads):
for i in prange(l):
if P[0, i] > P[thread_idx, i]:
P[0, i] = P[thread_idx, i]
I[0, i] = I[thread_idx, i]
return P[0], I[0]
if __name__ == "__main__":
D = np.array([[4,1,6,9,9],
[1,3,8,2,7],
[2,8,0,0,1],
[3,7,4,6,5]
])
P, I = some_func(D)
print(P)
print(I)
# [1. 1. 0. 0. 1.]
# [1 0 2 2 2]
Note that this will cost you more memory (exactly n_threads more memory) but you benefit from the parallelization. Additionally, the code becomes cleaner and much easier to maintain. What one needs to do is figure the best way to chunk up the data and determine the start_row and stop_row (exclusive) indices.

Related

slicing for each column of 2D numpy array

In the code, I have a 2D array(D) and for each column, I want to extract some "k" no of neighboring cols(left and right) and sum them up. A naive approach would be to use a for loop, but to speed up this I am trying to slice the 2D matrix for each column to get a submatrix and sum it column-wise. Surprisingly, the naive approach is faster than using the slicing option for k > 6. Any suggestion on how I can make the implementation efficient?
Naive implementation:
`
k = 64
index = np.arange(D.shape[1])
index_kp = index + k
index_kn = index - k
# neighbors can be less than k if sufficient neighbors not available; for ex. near beginning and the end of an array
index_kn[np.where(index_kn <0)] = np.where(index_kn <0)
index_kp[np.where(index_kp > (len(index)-1))] = np.where(index_kp > (len(index)-1))
Dsmear = np.empty_like(D) #stores the summation of neighboring k columns for each col
for i in range(len(index_kp)):
Dsmear[:,i] = np.sum(D[:, index_kn[i]:index_kp[i]], axis=1)
`
Slicing implementation:
D1 = np.concatenate((np.repeat(D[:,0].reshape(-1,1),k,axis=1), D, np.repeat(D[:,-1].reshape(-1,1),k,axis=1)),axis=1) #padding the edges with k columns
idx = np.asarray([np.arange(i-k,i+k+1) for i in range(k, D.shape[1]+k)], dtype=np.int32)
D_broadcast = D1[:, idx] # 3D array; is a bottleneck
Dsmear = np.sum(D_broadcast, axis=2)

numba cuda on matrix decomposition code in python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I'm trying to implement this code to use my gpu
import numpy as np
import math
import numba
from numba import vectorize , cuda
#cuda.jit(['float32(float 32)'], device=True)
def CholeskyInc(A):
n,m = A.shape
if n!=m:
print('La matriz no es cuadrada')
else:
L = np.zeros((n,n))
for k in range(n):
L[k,k] = math.sqrt(A[k,k])
for i in range(k+1,n):
if A[i,k] != 0:
L[i,k]=A[i,k]/A[k,k]
for j in range(k+1,n):
for i in range(j,n):
if A[i,j] != 0:
L[i,j] = A[i,j]-A[i,k]*A[j,k]
return L
Actually, I don't have any experience using numba and I prove some options but any of them work yet. Does anyone could explainme how I can execute this code in the GPU rather than the CPU using numba and cuda.

I don't have the needed rep here to just comment, so here's a couple suggestions to start:
The decorator usage:
Usin #cuda.jit as a decorator lets you skip defining the arg types and return type of your function.
You only need to call #cuda.jit(device=True) if you plan on calling this function from another cuda kernel function. Where a cuda kernel function is any function where you haven't set device=True. See the code sample at the end of this response.
Functions compiled to cuda kernels can't have an explicit return, nor can they print to console.
In place of returning L, create L in the host code (the code you are running on cpu) and pass that L as a second parameter to your function. After you run the function the results will be stored in L.
In place of the print statement you'll have to do something like assign an error value to L and check for that error value before you later use L
import numpy as np
from numba import cuda
from math import sqrt
def main():
""" Creates 2 matrices with randomized values which are then
passed to a cuda kernel where the incomplete Cholesky factorization
is done.
these first several lines set up the input arrays (your A and L)
as well as the thread and block shape that numba will need when
launching your kernel onto the GPU.
"""
np.set_printoptions(linewidth=240)
rand_n = np.random.randint(8 ,11)
A = np.random.random((rand_n ,rand_n)).astype(np.float32)
L = np.zeros((A.shape[0] ,A.shape[0]) ,dtype=A.dtype)
# tpb meaning "threads per block"
tpb_y = 16
tpb_x = 16
tpb = tpb_x ,tpb_y
# bpg meaning "blocks per grid"
bpg_y = (A.shape[0] + tpb_y -1 )//tpb_y # ensures enough blocks to cover whole matrix height
bpg_x = (A.shape[1] + tpb_x -1 )//tpb_x # ensures enough blocks to cover whole matrix width
# Alternatively, you could use the following commented lines
# which will allocate fewer thread-blocks (less overhead) but requires that
# we implement striding loops in our kernel/device functions to
# cover the rest of the matrix data.
# bpg_y = (A.shape[0]//2 + tpb_y-1)//tpb_y # cover half of matrix height, and stride the other half
# bpg_x = (A.shape[1]//2 + tpb_x-1)//tpb_x # cover half of matrix width, and stride the other half
bpg = bpg_x ,bpg_y
# now we launch the kernel
my_cuda_kernel[bpg ,tpb](A ,L)
if np.isnan(L[0 ,0]):
print('La matriz no es cuadrada')
print(f"A:\n{A}")
print(f"L:\n{L}")
#cuda.jit
def my_cuda_kernel(arr_in :np.ndarray, arr_out :np.ndarray):
# in your kernel, you can set up shared memory
# or implement looping structures to stride through global device memory.
n ,m = arr_in.shape
if n!= m:
# set the first element on the diagonal to a NaN value,
# as a way to signal that no operations were done on L.
arr_out[0, 0] = np.nan # check by calling np.isnan(L[0,0])
return
# get the current thread's location within the grid
x, y = cuda.grid(2) # 2 because we launched the kernel specifying only x, and y dimensions
# these x,y coordinates don't explicitly map to coordinates in the array objects
# but we can simplify our indexing by treating them like they do.
x_stride, y_stride = cuda.gridsize(2)
# this loop provides 2 purposes:
# 1. It acts as boundary check to make sure we don't attempt to read
# memory locations outside of the input/output arrays.
# 2. It also allows our threads to "stride" through the data, and do more
# work to justify the overhead cost setting this thread-block up.
# * Note that threads will only be able to "stride"
# if we don't launch the kernel with full grid coverage.
for _k in range(0, n, x_stride * y_stride):
k = _k + y
# we are telling each thread to stride the
# diagonal and compute the cholesky for arr[:,k] rows and arr[k,:] cols.
CholeskyInc(arr_in, arr_out, n, k)
#cuda.jit(device=True)
def CholeskyInc(A, L, n, k):
L[k, k] = sqrt(A[k, k])
for i in range(k + 1, n):
if A[i, k] != 0:
L[i, k] = A[i, k] / L[k, k]
for j in range(k + 1, n):
for i in range(j, n):
if A[i, j] != 0:
L[i, j] = A[i, j] - A[i, k] * A[j, k]
if __name__ == '__main__':
main()
which produces the following output:
A:
[[0.29083535 0.80408204 0.63088804 0.90458757 0.86371994 0.7966909 0.5818828 0.8885034 ]
[0.24579939 0.8107 0.9785071 0.40308124 0.96477604 0.39282414 0.18642609 0.3129212 ]
[0.18401423 0.11662608 0.3512116 0.97926706 0.4021766 0.23531164 0.81310475 0.93359345]
[0.5243785 0.0469533 0.49699584 0.507422 0.24569689 0.4899143 0.61420184 0.9332651 ]
[0.1070556 0.5214806 0.24065676 0.8860097 0.5074029 0.43745205 0.09919663 0.9222924 ]
[0.17103161 0.25640044 0.94678307 0.26446953 0.9416109 0.8391528 0.69582105 0.5433431 ]
[0.5520146 0.10083573 0.7929039 0.44067022 0.6251738 0.6831893 0.23636419 0.97260725]
[0.47044474 0.13215688 0.5002679 0.72581047 0.8298903 0.55161124 0.6673608 0.5644971 ]]
L:
[[ 0.5392915 0. 0. 0. 0. 0. 0. 0. ]
[ 0.45578206 0.75028265 0. 0. 0. 0. 0. 0. ]
[ 0.34121478 0.07139549 0.31735036 0. 0. 0. 0. 0. ]
[ 0.972347 -0.08193862 0.40050274 0.23244919 0. 0. 0. 0. ]
[ 0.19851157 0.49516642 0.22095701 0.829872 0.495942 0. 0. 0. ]
[ 0.3171413 0.21436097 0.9153108 0.17478424 0.923301 0.809901 0. 0. ]
[ 1.0235922 -0.03484913 0.69132537 0.15120566 0.56607753 0.5887773 -0.06835592 0. ]
[ 0.8723385 0.01652185 0.4136994 0.47911936 0.7795266 0.47115034 0.4076684 0.34317887]]

Most efficient way to reduce-sum a numpy array (with autograd)

I have two arrays:
index = [2,1,0,0,1,1,1,2]
values = [1,2,3,4,5,4,3,2]
I would like to produce:
[sum(v for i,v in zip(index, values) if i == ui) for i in sorted(set(index))]
in the most efficient way possible.
my values are computed via autograd
doing a groupby in pandas is really not efficient because of the point above
I have to do it hundreds of times on the same index but with different values
len(values) ~ 10**7
len(set(index)) ~ 10**6
Counter(index).most_common(1)[0][1] ~ 1000
I think a pure numpy solution would be the best.
I tried to precompute the reduced version of index, and then do:
[values[l].sum() for l in reduced_index]
but it is not efficient enough.
Here is a minimal code sample:
import numpy as np
import autograd.numpy as anp
from autograd import grad
import pandas as pd
EASY = True
if EASY:
index = np.random.randint(10, size=10**3)
values = anp.random.rand(10**3) * 2 - 1
else:
index = np.random.randint(1000, size=10**7)
values = anp.random.rand(10**7) * 2 - 1
# doesn't work
def f1(values):
return anp.exp(anp.bincount(index, weights=values)).sum()
index_unique = sorted(set(index))
index_map = {j: i for i, j in enumerate(index_unique)}
index_mapped = [index_map[i] for i in index]
index_lists = [[] for _ in range(len(index_unique))]
for i, j in enumerate(index_mapped):
index_lists[j].append(i)
def f2(values):
s = anp.array([values[l].sum() for l in index_lists])
return anp.exp(s).sum()
ans = grad(f2)(values)

If your index are non negative integers, you can use np.bincount with values as weights:
np.bincount(index, weights=values)
# array([ 7., 14., 3.])
This gives the sum at each position from 0 to max(index).

Element-wise maximum of two sparse matrices

Is there an easy/build-in way to get the element-wise maximum of two (or ideally more) sparse matrices? I.e. a sparse equivalent of np.maximum.

This did the trick:
def maximum (A, B):
BisBigger = A-B
BisBigger.data = np.where(BisBigger.data < 0, 1, 0)
return A - A.multiply(BisBigger) + B.multiply(BisBigger)

No, there's no built-in way to do this in scipy.sparse. The easy solution is
np.maximum(X.A, Y.A)
but this is obviously going to be very memory-intensive when the matrices have large dimensions and it might crash your machine. A memory-efficient (but by no means fast) solution is
# convert to COO, if necessary
X = X.tocoo()
Y = Y.tocoo()
Xdict = dict(((i, j), v) for i, j, v in zip(X.row, X.col, X.data))
Ydict = dict(((i, j), v) for i, j, v in zip(Y.row, Y.col, Y.data))
keys = list(set(Xdict.iterkeys()).union(Ydict.iterkeys()))
XmaxY = [max(Xdict.get((i, j), 0), Ydict.get((i, j), 0)) for i, j in keys]
XmaxY = coo_matrix((XmaxY, zip(*keys)))
Note that this uses pure Python instead of vectorized idioms. You can try shaving some of the running time off by vectorizing parts of it.

Here's another memory-efficient solution that should be a bit quicker than larsmans'. It's based on finding the set of unique indices for the nonzero elements in the two arrays using code from Jaime's excellent answer here.
import numpy as np
from scipy import sparse
def sparsemax(X, Y):
# the indices of all non-zero elements in both arrays
idx = np.hstack((X.nonzero(), Y.nonzero()))
# find the set of unique non-zero indices
idx = tuple(unique_rows(idx.T).T)
# take the element-wise max over only these indices
X[idx] = np.maximum(X[idx].A, Y[idx].A)
return X
def unique_rows(a):
void_type = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
b = np.ascontiguousarray(a).view(void_type)
idx = np.unique(b, return_index=True)[1]
return a[idx]
Testing:
def setup(n=1000, fmt='csr'):
return sparse.rand(n, n, format=fmt), sparse.rand(n, n, format=fmt)
X, Y = setup()
Z = sparsemax(X, Y)
print np.all(Z.A == np.maximum(X.A, Y.A))
# True
%%timeit X, Y = setup()
sparsemax(X, Y)
# 100 loops, best of 3: 4.92 ms per loop

The latest scipy (13.0) defines element-wise booleans for sparse matricies. So:
BisBigger = B>A
A - A.multiply(BisBigger) + B.multiply(BisBigger)
np.maximum does not (yet) work because it uses np.where, which is still trying to get the truth value of an array.
Curiously B>A returns a boolean dtype, while B>=A is float64.

Here is a function that returns a sparse matrix that is element-wise maximum of two sparse matrices. It implements the answer by hpaulj:
def sparse_max(A, B):
"""
Return the element-wise maximum of sparse matrices `A` and `B`.
"""
AgtB = (A > B).astype(int)
M = AgtB.multiply(A - B) + B
return M
Testing:
A = sparse.csr_matrix(np.random.randint(-9,10, 25).reshape((5,5)))
B = sparse.csr_matrix(np.random.randint(-9,10, 25).reshape((5,5)))
M = sparse_max(A, B)
M2 = sparse_max(B, A)
# Test symmetry:
print((M.A == M2.A).all())
# Test that M is larger or equal to A and B, element-wise:
print((M.A >= A.A).all())
print((M.A >= B.A).all())

from scipy import sparse
from numpy import array
I = array([0,3,1,0])
J = array([0,3,1,2])
V = array([4,5,7,9])
A = sparse.coo_matrix((V,(I,J)),shape=(4,4))
A.data.max()
9
If you haven't already, you should try out ipython, you could have saved your self time my making your spare matrix A then simply typing A. then tab, this will print a list of methods that you can call on A. From this you would see A.data gives you the non-zero entries as an array and hence you just want the maximum of this.

Cumulative summation of a numpy array by index

Assume you have an array of values that will need to be summed together
d = [1,1,1,1,1]
and a second array specifying which elements need to be summed together
i = [0,0,1,2,2]
The result will be stored in a new array of size max(i)+1. So for example i=[0,0,0,0,0] would be equivalent to summing all the elements of d and storing the result at position 0 of a new array of size 1.
I tried to implement this using
c = zeros(max(i)+1)
c[i] += d
However, the += operation adds each element only once, thus giving the unexpected result of
[1,1,1]
instead of
[2,1,2]
How would one correctly implement this kind of summation?

If I understand the question correctly, there is a fast function for this (as long as the data array is 1d)
>>> i = np.array([0,0,1,2,2])
>>> d = np.array([0,1,2,3,4])
>>> np.bincount(i, weights=d)
array([ 1., 2., 7.])
np.bincount returns an array for all integers range(max(i)), even if some counts are zero

Juh_'s comment is the most efficient solution. Here's working code:
import numpy as np
import scipy.ndimage as ni
i = np.array([0,0,1,2,2])
d = np.array([0,1,2,3,4])
n_indices = i.max() + 1
print ni.sum(d, i, np.arange(n_indices))

This solution should be more efficient for large arrays (it iterates over the possible index values instead of the individual entries of i):
import numpy as np
i = np.array([0,0,1,2,2])
d = np.array([0,1,2,3,4])
i_max = i.max()
c = np.empty(i_max+1)
for j in range(i_max+1):
c[j] = d[i==j].sum()
print c
[1. 2. 7.]

def zeros(ilen):
r = []
for i in range(0,ilen):
r.append(0)
i_list = [0,0,1,2,2]
d = [1,1,1,1,1]
result = zeros(max(i_list)+1)
for index in i_list:
result[index]+=d[index]
print result

In the general case when you want to sum submatrices by labels you can use the following code
import numpy as np
from scipy.sparse import coo_matrix
def labeled_sum1(x, labels):
P = coo_matrix((np.ones(x.shape[0]), (labels, np.arange(len(labels)))))
res = P.dot(x.reshape((x.shape[0], np.prod(x.shape[1:]))))
return res.reshape((res.shape[0],) + x.shape[1:])
def labeled_sum2(x, labels):
res = np.empty((np.max(labels) + 1,) + x.shape[1:], x.dtype)
for i in np.ndindex(x.shape[1:]):
res[(...,)+i] = np.bincount(labels, x[(...,)+i])
return res
The first method use the sparse matrix multiplication. The second one is the generalization of user333700's answer. Both methods have comparable speed:
x = np.random.randn(100000, 10, 10)
labels = np.random.randint(0, 1000, 100000)
%time res1 = labeled_sum1(x, labels)
%time res2 = labeled_sum2(x, labels)
np.all(res1 == res2)
Output:
Wall time: 73.2 ms
Wall time: 68.9 ms
True

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Avoid Race Condition in Numba - python

Related

slicing for each column of 2D numpy array

numba cuda on matrix decomposition code in python [closed]

Most efficient way to reduce-sum a numpy array (with autograd)

Element-wise maximum of two sparse matrices

Cumulative summation of a numpy array by index

Categories

Resources