Accelerate float calculations in nested loops (with CUDA if possible)

Accelerate float calculations in nested loops (with CUDA if possible) - python

Is it possible to accelerate the following nested loop in Python (possibly with CUDA or parallel process)? The order of the elements that are appended to outputList[i] doesn't matter.
Currently, the code takes a long time to complete. Which part is slowing down the code? Is it the append() or the calculation f = c*sin(i*pi/180) + (1/c)*cos(i*pi/180)?
N = 640*480
outputList = [[] for i in range(N)]
def foo(a,b): # a, b are always integers
c = sqrt(a**2 + b**2)
for deg in range(360):
f = c*sin(deg*pi/180) + (1/c)*cos(deg*pi/180)
if (f<1):
outputList[a].append(f)
if __name__ == '__main__':
for x in range(N):
for y in range(60):
foo(x,y)

As talonmies mentioned, your first step should be using numpy.
import math
import numpy as np
def foo(a, b):
c = math.hypot(a, b)
degs = np.arange(360, dtype='f8')
rads = np.deg2rad(degs)
fs = c * np.sin(rads) + (1/c)*np.cos(rads)
below1 = fs[fs < 1.]
outputList[a].extend(below1)
There are fancier methods to do the sin + cos computation but I've kept it simple for now. Also, depending on your use case, change outputList to a 2D numpy array or a list of 1D numpy arrays

Related

Can any for loop be vectorized in python(numpy)?

I have been recently getting into python optimization by learning about broadcasting and vectorization. I would say I have got the basics down, but there are just some for loops I am uncapable of vectorizing. My question is: is it possible to convert any python for loop into a C++ for loop using numpy?
As an example:
import numpy as np
N = 10
A = np.eye(4) * 0.5
r = np.eye(4)
W = np.random.normal(0, np.sqrt(1), N)
r_t = r[None]
for i in range(N):
z = np.trace(A # r)
temp = r * W[i] * z
r_t = np.concatenate((r_t, temp[None]))
r = temp
The code is an example, and it is not supposed to do anything in particular (aside from returning r_t as a (N, 4,4) array).
W is an array of N values randomly picked from a normal distribution; A is identity matrix scaled by 0.5, and r is initially the identity matrix.
The problem I am finding is that I want "r" to be updated at the end of every loop, so that the value of "z" is different at every loop as well. Is there any way one could vectorize a loop of this sort?

Python code to calculate base Lagrange Polynomial is taking too long for large numbers of data

So, I need help minimizing the time it takes to run the code with large numbers of data only by using NumPy. I think the for loops made my code inefficient.. But I do not know how to make the for loop into a list comprehension, which might help it run faster..
def lagrange(p,node,n,x):
m=[]
#base lagrange polynomial
for i in range(n):
for j in range(p+1):
L=1
for k in range(p+1):
if k!=j:
L= L*(x[i] - node[k])/(node[j] - node[k])
m.append(L)
lagrange= np.array(m).reshape(n,p+1)
return lagrange
def interpolant(a,b,p,n,x,f):
m=[]
node=np.linspace(a,b,p+1)
for j in range(n):
polynomial=0
for i in range(p+1):
polynomial += f(node[i]) * lagrange(p,node,n,x)
m.append(polynomial)
interpolant = np.array(inter)
return interpolant

It appears the value of lagrange_poly(...) is recomputed n*(p+1) times for no reason which is very very expensive! You can compute it once before the loop, store it in a variable and reuse the variable later.
Here is the fixed code:
def uniform_poly_interpolation(a,b,p,n,x,f,produce_fig):
inter=[]
xhat=np.linspace(a,b,p+1)
#use for loop to iterate interpolant.
mat = lagrange_poly(p,xhat,n,x,1e-10)[0]
for j in range(n):
po=0
for i in range(p+1):
po += f(xhat[i]) * mat[i,j]
inter.append(po)
interpolant = np.array(inter)
return interpolant
This should be much much faster.
Moreover, the execution is slow because accessing scalar values of Numpy arrays from CPython is very slow. Numpy is designed to work with array and not to extract scalar values in loops. Additionally, the loop CPython interpreter are relatively slow. You can solve this problem efficiently with Numba that compile your code to a very fast native code using a JIT-compiler.
Here is the Numba code:
import numba as nb
#nb.njit
def lagrange_poly(p, xhat, n, x, tol):
error_flag = 0
er = 1
lagrange_matrix = np.empty((n, p+1), dtype=np.float64)
for l in range(p):
if abs(xhat[l] - xhat[l+1]) < tol:
error_flag = er
# Base lagrange polynomial
for i in range(n):
for j in range(p+1):
L = 1.0
for k in range(p+1):
if k!=j:
L = L * (x[i] - xhat[k]) / (xhat[j] - xhat[k])
lagrange_matrix[i, j] = L
return lagrange_matrix, error_flag
Overall, this should be several order of magnitude faster.

How can I improve performance in my forward substitution method for lower triangle matrices?

I tried implementing the forward substitution method, a solving process to solve the problem Lx = b with L being a lower triangle matrix and x,b as vectors.
This was an easy task:
def tri_solve(L,b):
n = len(b)
x = np.zeros(n)
x[0] = b[0]/L[0,0];
for i in range(1,n):
comp = 0;
for k in range(0,i):
index = L[i,k]
preSolution = x[k]
comp = comp + index * preSolution
x[i] = 1/L[i,i] * (b[i] - comp)
return x;
Now I compared my calculation times for different sized matrices several times with linalg.solve from the scipy module and it turns out that it is much faster. This makes sense in some points, since SciPy is written in C and C++, but I still expected similar or better calculation times for matrices up to 10x10 dimension. Beginning with 6x6 matrices, linalg.solves becomes slightly faster on average.
Is there a way to improve my rather simple solution?

You could try solve_triangular
If you want to accelerate your code, what you could do is to vectorize the inner loop.
def tri_solve(L,b):
n = len(b)
x = np.zeros(n)
x[0] = b[0]/L[0,0];
for i in range(1,n):
comp = np.sum(L[i,:i] * x[:i])
x[i] = 1/L[i,i] * (b[i] - comp)
return x;
Edit: How to use it
You have to pass as first argument a square lower triangular matrix and as second argument you can pass a 1D array
N = 20
A = np.tril(np.random.randn(N, N))
b = np.random.randn(N)
assert np.allclose(np.linalg.solve(A, b), tri_solve(A, b))
Of course this is a naive implementation and is not stable, you can't use it to solve very large or ill conditioned systems.

Vectorizing three nested loops with Numpy

I have a complex matrix C with dimensions (r, r) as well as a complex vector of size r. I need to compute a new matrix from C and v following this equation:
where K is also a square matrix of dimensions (r, r). Here is the code to compute K with three loops:
import numpy as np
import matplotlib.pyplot as plt
r = 9
# Create random matrix
C = np.random.rand(r,r) + np.random.rand(r,r) * 1j
v = np.random.rand(r) + np.random.rand(r) * 1j
# Original loops
K = np.zeros((r, r))
for m in range(r):
for n in range(r):
for i in range(r):
K[m,n] += np.imag( C[i,m] * np.conj(C[i,n]) * np.sign(np.imag(v[i])) )
plt.figure()
plt.imshow(K)
plt.show()
Removing the loop with i is relatively easy:
# First optimization
K = np.zeros((r, r))
for m in range(r):
for n in range(r):
K[m,n] = np.imag(np.sum(C[:,m] * np.conj(C[:,n]) * np.sign(np.imag(v)) ))
but I am not sure how to proceed to vectorize the two remaining loops. Is it actually possible in this case?

I had a lot of these of problems and here is how I usually proceeded to find solutions to writing out vectorized code.
Here is what I have noticed about your summation. Cool conclusion is that you probably do not need vectorization at all, as you can express your whole calculation as a single product of 2D matrics. Here comes...
Lets first define following matrix (sorry for lack of Latex notation, Stackoverflow does not support Mathjax) :
A_{i,j} = c_{i,j}.
B_{i,j} = c_{i,j} * sgn(Im(v_i))
Then you can write your summation as:
k_{m,n} = Im( \sum_{i=1}^{r} c_{i,m} * sgn(Im(v_i)) * c_{i,n}^* ) = Im ( \sum_{i=1}^{r} B_{i,m} * A_{i,n}^* ) = Im( \sum_{i=1}^{r} B_{m,i}^T * A_{i,n}^* )
The expression above inside of Im(.) is the by definition of matrix multiplication equivalent to following :
k_{m,n} = Im( (B^T * A^*)_{m,n} )
Which means that your matrix k can be expressed as product of transpose of matrix B and product of matrix A. In your code the matrix matrix A is assigned already to variable C. So the vectorization could be done as follows:
C = np.random.rand(r,r) + np.random.rand(r,r) * 1j
v = np.random.rand(r) + np.random.rand(r) * 1j
k = np.imag( (C * np.sign(np.imag(v)).T # np.conj(C) )
And you have avoided both nasty loops and convoluted expressions

This looks like matrix multiplication:
out = np.imag((C*np.sign(np.imag(v))[:,None]).T # np.conj(C))
Or you can use np.einsum:
out = np.imag(np.einsum('im,in,i', C, np.conj(C), np.sign(np.imag(v))))
Verification with your approach:
np.all(np.abs(out-K) < 1e-6)
# True

I found something that can work for now. However, one loop remains and since the resulting matrix is symetric, there is still some optimization to be made.
Instead of removing the i loop, I removed the two other ones:
K = np.zeros((r, r), dtype=np.complex128)
for i in range(r):
K += adjointMatrix(C) # (np.sign(np.imag(v)) * C)
K = np.imag(K)
with:
def adjointMatrix(X):
return np.conjugate( np.transpose(X) )

Numba Matrix Vector multiplication

I'm trying to use numbapro to write a simple matrix vector multiplication below:
from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time
n = 100
#cuda.jit('void(float32[:,:], float32[:], float32[:])')
def cu_matrix_vector(A, b, c):
y, x = cuda.grid(2)
if y < n:
c[y] = 0.0
if x < n and y < n:
for i in range(n):
c[y] += A[y, i] * b[i]
A = np.array(np.random.random((n, n)), dtype=np.float32)
B = np.array(np.random.random((n, 1)), dtype=np.float32)
C = np.empty_like(B)
s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)
cu_matrix_vector(dA, dB, dC)
dC.to_host()
e = time()
tcuda = e - s
but I'm getting following error:
numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED Failed to copy memory D->H
I don't understand why the device to host copy is failing. Please help

Your code has multiple problems.
The B and C vectors are Nx1 2D matrices, not 1D vectors, but the type signature of your kernel lists them as "float32[:]" -- 1D vectors. It also indexes them with a single index, which results in runtime errors on the GPU due to misaligned access (cuda-memcheck is your friend here!)
Your kernel assumes a 2D grid, but only uses 1 column of it -- meaning many threads doing the same computation and overwriting each other.
There is no execution configuration given, so NumbaPro is launching a kernel with 1 block of 1 thread. (nvprof is your friend here!)
Here is a code that works. Note that this uses a 1D grid of 1D blocks, and loops over the columns of the matrix. Therefore it is optimized for the case where the number of rows in the vector/matrix is large. A kernel that is optimized for a short and wide matrix would need to use another approach (parallel reductions). But I would use CUBLAS sgemv (which is exposed in NumbaPro also) instead.
from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time
m = 100000
n = 100
#cuda.jit('void(f4[:,:], f4[:], f4[:])')
def cu_matrix_vector(A, b, c):
row = cuda.grid(1)
if (row < m):
sum = 0
for i in range(n):
sum += A[row, i] * b[i]
c[row] = sum
A = np.array(np.random.random((m, n)), dtype=np.float32)
B = np.array(np.random.random(m), dtype=np.float32)
C = np.empty_like(B)
s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)
cu_matrix_vector[(m+511)/512, 512](dA, dB, dC)
dC.to_host()
print C
e = time()
tcuda = e - s

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Accelerate float calculations in nested loops (with CUDA if possible) - python

Related

Can any for loop be vectorized in python(numpy)?

Python code to calculate base Lagrange Polynomial is taking too long for large numbers of data

How can I improve performance in my forward substitution method for lower triangle matrices?

Vectorizing three nested loops with Numpy

Numba Matrix Vector multiplication

Categories

Resources