numba cuda on matrix decomposition code in python [closed]

numba cuda on matrix decomposition code in python [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I'm trying to implement this code to use my gpu
import numpy as np
import math
import numba
from numba import vectorize , cuda
#cuda.jit(['float32(float 32)'], device=True)
def CholeskyInc(A):
n,m = A.shape
if n!=m:
print('La matriz no es cuadrada')
else:
L = np.zeros((n,n))
for k in range(n):
L[k,k] = math.sqrt(A[k,k])
for i in range(k+1,n):
if A[i,k] != 0:
L[i,k]=A[i,k]/A[k,k]
for j in range(k+1,n):
for i in range(j,n):
if A[i,j] != 0:
L[i,j] = A[i,j]-A[i,k]*A[j,k]
return L
Actually, I don't have any experience using numba and I prove some options but any of them work yet. Does anyone could explainme how I can execute this code in the GPU rather than the CPU using numba and cuda.

I don't have the needed rep here to just comment, so here's a couple suggestions to start:
The decorator usage:
Usin #cuda.jit as a decorator lets you skip defining the arg types and return type of your function.
You only need to call #cuda.jit(device=True) if you plan on calling this function from another cuda kernel function. Where a cuda kernel function is any function where you haven't set device=True. See the code sample at the end of this response.
Functions compiled to cuda kernels can't have an explicit return, nor can they print to console.
In place of returning L, create L in the host code (the code you are running on cpu) and pass that L as a second parameter to your function. After you run the function the results will be stored in L.
In place of the print statement you'll have to do something like assign an error value to L and check for that error value before you later use L
import numpy as np
from numba import cuda
from math import sqrt
def main():
""" Creates 2 matrices with randomized values which are then
passed to a cuda kernel where the incomplete Cholesky factorization
is done.
these first several lines set up the input arrays (your A and L)
as well as the thread and block shape that numba will need when
launching your kernel onto the GPU.
"""
np.set_printoptions(linewidth=240)
rand_n = np.random.randint(8 ,11)
A = np.random.random((rand_n ,rand_n)).astype(np.float32)
L = np.zeros((A.shape[0] ,A.shape[0]) ,dtype=A.dtype)
# tpb meaning "threads per block"
tpb_y = 16
tpb_x = 16
tpb = tpb_x ,tpb_y
# bpg meaning "blocks per grid"
bpg_y = (A.shape[0] + tpb_y -1 )//tpb_y # ensures enough blocks to cover whole matrix height
bpg_x = (A.shape[1] + tpb_x -1 )//tpb_x # ensures enough blocks to cover whole matrix width
# Alternatively, you could use the following commented lines
# which will allocate fewer thread-blocks (less overhead) but requires that
# we implement striding loops in our kernel/device functions to
# cover the rest of the matrix data.
# bpg_y = (A.shape[0]//2 + tpb_y-1)//tpb_y # cover half of matrix height, and stride the other half
# bpg_x = (A.shape[1]//2 + tpb_x-1)//tpb_x # cover half of matrix width, and stride the other half
bpg = bpg_x ,bpg_y
# now we launch the kernel
my_cuda_kernel[bpg ,tpb](A ,L)
if np.isnan(L[0 ,0]):
print('La matriz no es cuadrada')
print(f"A:\n{A}")
print(f"L:\n{L}")
#cuda.jit
def my_cuda_kernel(arr_in :np.ndarray, arr_out :np.ndarray):
# in your kernel, you can set up shared memory
# or implement looping structures to stride through global device memory.
n ,m = arr_in.shape
if n!= m:
# set the first element on the diagonal to a NaN value,
# as a way to signal that no operations were done on L.
arr_out[0, 0] = np.nan # check by calling np.isnan(L[0,0])
return
# get the current thread's location within the grid
x, y = cuda.grid(2) # 2 because we launched the kernel specifying only x, and y dimensions
# these x,y coordinates don't explicitly map to coordinates in the array objects
# but we can simplify our indexing by treating them like they do.
x_stride, y_stride = cuda.gridsize(2)
# this loop provides 2 purposes:
# 1. It acts as boundary check to make sure we don't attempt to read
# memory locations outside of the input/output arrays.
# 2. It also allows our threads to "stride" through the data, and do more
# work to justify the overhead cost setting this thread-block up.
# * Note that threads will only be able to "stride"
# if we don't launch the kernel with full grid coverage.
for _k in range(0, n, x_stride * y_stride):
k = _k + y
# we are telling each thread to stride the
# diagonal and compute the cholesky for arr[:,k] rows and arr[k,:] cols.
CholeskyInc(arr_in, arr_out, n, k)
#cuda.jit(device=True)
def CholeskyInc(A, L, n, k):
L[k, k] = sqrt(A[k, k])
for i in range(k + 1, n):
if A[i, k] != 0:
L[i, k] = A[i, k] / L[k, k]
for j in range(k + 1, n):
for i in range(j, n):
if A[i, j] != 0:
L[i, j] = A[i, j] - A[i, k] * A[j, k]
if __name__ == '__main__':
main()
which produces the following output:
A:
[[0.29083535 0.80408204 0.63088804 0.90458757 0.86371994 0.7966909 0.5818828 0.8885034 ]
[0.24579939 0.8107 0.9785071 0.40308124 0.96477604 0.39282414 0.18642609 0.3129212 ]
[0.18401423 0.11662608 0.3512116 0.97926706 0.4021766 0.23531164 0.81310475 0.93359345]
[0.5243785 0.0469533 0.49699584 0.507422 0.24569689 0.4899143 0.61420184 0.9332651 ]
[0.1070556 0.5214806 0.24065676 0.8860097 0.5074029 0.43745205 0.09919663 0.9222924 ]
[0.17103161 0.25640044 0.94678307 0.26446953 0.9416109 0.8391528 0.69582105 0.5433431 ]
[0.5520146 0.10083573 0.7929039 0.44067022 0.6251738 0.6831893 0.23636419 0.97260725]
[0.47044474 0.13215688 0.5002679 0.72581047 0.8298903 0.55161124 0.6673608 0.5644971 ]]
L:
[[ 0.5392915 0. 0. 0. 0. 0. 0. 0. ]
[ 0.45578206 0.75028265 0. 0. 0. 0. 0. 0. ]
[ 0.34121478 0.07139549 0.31735036 0. 0. 0. 0. 0. ]
[ 0.972347 -0.08193862 0.40050274 0.23244919 0. 0. 0. 0. ]
[ 0.19851157 0.49516642 0.22095701 0.829872 0.495942 0. 0. 0. ]
[ 0.3171413 0.21436097 0.9153108 0.17478424 0.923301 0.809901 0. 0. ]
[ 1.0235922 -0.03484913 0.69132537 0.15120566 0.56607753 0.5887773 -0.06835592 0. ]
[ 0.8723385 0.01652185 0.4136994 0.47911936 0.7795266 0.47115034 0.4076684 0.34317887]]

Related

How to use use cuda to optimize my matrix multiplication code [duplicate]

Lately I've been trying to get into programming for GPUs in Python using the Numba library. I have been reading up on it on their website using the tutorial there and currently I'm stuck on their example, which can be found here: https://numba.pydata.org/numba-doc/latest/cuda/examples.html. I'm attempting to generalize the example for the fast matrix multiplication a bit (which is of the form A*B=C). When testing I noticed that matrices with dimensions that are not perfectly divisible by the number of threads per block (TPB) do not yield a correct answer.
I copied the code below from the example at https://numba.pydata.org/numba-doc/latest/cuda/examples.html and created a very small test case with 4 by 4 matrices. If I choose TPB=2 everything is fine, but when I set TPB=3 it goes wrong. I understand that the code goes out of the bounds of the matrices, but I am unable to prevent this from happening (I tried some if statements on ty + i * TPB and tx + i * TPB, but these did not work.
from numba import cuda, float32
import numpy as np
import math
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
if x >= C.shape[0] and y >= C.shape[1]:
# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
C[x, y] = tmp
#%%
x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
I would like to write some code that is not dependent on the matrices A, B, and C having dimensions that are perfectly divisible by the TPB, as these are sometimes out of my control. I understand that GPUs are only faster with matrix multiplication for very large matrices, but I wanted to use small examples to be able to check whether the answer is correct before applying it to actual data.

There are arguably at least two errors in that posted code:
This can't possibly be a correct range check:
if x >= C.shape[0] and y >= C.shape[1]:
In order for us to decide that a particular thread in the grid not do any loading activity, we require either that x is out of range or that y is out of range. The and should have been an or.
It is illegal to use cuda.syncthreads() in conditional code, if all the threads in the block cannot participate in that statement. The previous return statement in item 1 above (even if corrected from and to or) pretty much guarantees this illegal behavior for problem sizes not whole-number-divisible by the threadblock size.
Therefore, to fix these issues, we cannot use just a simple return statement for an out-of-bounds thread. Instead, at the point of load, we must only allow threads to load from global to shared memory, if the computed global load indices (for A or B) are in-bounds (the shared indices are in-bounds, by definition). Furthermore, when writing a result, we must only write computed results that are in-bounds for C.
The following code has those items fixed. It seems to work correctly for your given test case:
$ cat t49.py
from numba import cuda, float32
import numpy as np
import math
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = float32(0.)
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = 0
sB[tx, ty] = 0
if x < A.shape[0] and (ty+i*TPB) < A.shape[1]:
sA[tx, ty] = A[x, ty + i * TPB]
if y < B.shape[1] and (tx+i*TPB) < B.shape[0]:
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
if x < C.shape[0] and y < C.shape[1]:
C[x, y] = tmp
#%%
x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h#y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 6. 6. 6. 6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
[[ 6. 6. 6. 6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
========= ERROR SUMMARY: 0 errors
$
(Note the use of and here in the bounds tests is correct. Testing whether a set of indices are in-bound is different in a boolean sense compared to testing whether a set of indices is out-of-bounds. In the in-bounds test, we require both to be in-bounds. In the out-of-bounds test, either index out-of-bounds is disqualifying).
I'm not suggesting the above code is defect-free or suitable for any particular purpose. It is offered to demonstrate possible fixes for the issues I identified. Getting a shared-memory tiled matrix multiply to work in every imaginable configuration is non-trivial, as you have discovered, and I've not tested it beyond what is shown here. (For example, if you decided to make TPB larger than 32, you would run into other problems. Also, the original posted code is advertised only for square matrix multiplication, and this will not work in the general non-square case.)
As noted above, the posted code and the above code with "fixes" will not correctly handle the general non-square case. I believe some straightforward modifications will allow us to handle the non-square case. In a nutshell, we must size the grid large enough to handle the dimensions of both input matrices, while still only writing results for the in-bounds values of the output matrix. Here is a lightly tested example:
$ cat t49.py
from numba import cuda, float32
import numpy as np
import math
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = float32(0.)
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
#%%
x_h = np.arange(115).reshape([5,23])
y_h = np.ones([23,7])
z_h = np.zeros([5,7])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
#TPB must be an integer between 1 and 32
TPB = 32
threadsperblock = (TPB, TPB)
grid_y_max = max(x_h.shape[0],y_h.shape[0])
grid_x_max = max(x_h.shape[1],y_h.shape[1])
blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])
blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h#y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 253. 253. 253. 253. 253. 253. 253.]
[ 782. 782. 782. 782. 782. 782. 782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253. 253. 253. 253. 253. 253. 253.]
[ 782. 782. 782. 782. 782. 782. 782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
========= ERROR SUMMARY: 0 errors
$
I've also reordered the sense of x and y (and usage of tx and ty) to fix a performance issue in the above code. The same performance issue was present in the original posted doc code as well.
Again, no claims of defect free. Furthermore I'm sure "more optimal" code could be arrived at. However optimizing matrix multiplication is an exercise that should fairly quickly lead to using a library implementation. Using cupy here for the GPU approach should be a fairly straightforward way to tap into a high-quality matrix multiply routine on the GPU.
EDIT: As discussed here OP's code (and, it seems, the doc example) also had a performance issue around the setup of the tmp variable. Changing that to a proper 32-bit float variable makes an important performance difference.

Avoid Race Condition in Numba

Here is a toy njit function that takes in a distance matrix, loops through each row of the matrix, and records the minimum value in each column and also which row that minimum value came from. However, IIUC, with the use of prange, this could possibly cause a race condition (especially for larger input arrays):
from numba import njit, prange
import numpy as np
#njit
def some_transformation_func(D, row_i):
"""
This function applies some transformation to the ith row (`row_i`) in the `D` matrix in place.
However, the transformation time is random (but all less than a second), which means
that the rows can take
"""
# Apply some inplace transformation on the ith row of D
#njit(parallel=True)
def some_func(D):
P = np.empty((D.shape[1]))
I = np.empty((D.shape[1]), np.int64)
P[:] = np.inf
I[:] = -1
for row in prange(D.shape[0]):
some_transformation_func(D, row)
for col in range(D.shape[1]):
if P[col] > D[row, col]:
P[col] = D[row, col]
I[col] = row
return P, I
if __name__ == "__main__":
D = np.array([[4,1,6,9,9],
[1,3,8,2,7],
[2,8,0,0,1],
[3,7,4,6,5]
])
P, I = some_func(D)
print(P)
print(I)
# [1. 1. 0. 0. 1.]
# [1 0 2 2 2]
How would I confirm whether or not there is a race condition (especially if D is very large with many more rows and columns)? And, more importantly, if there is a race condition, how could I avoid it?

In these situations, instead of setting prange to the size of the array, the best thing to do is to manually chunk up the data into n_threads number of chunks, then to distribute the processing accordingly, and finally perform a reduction. So, something like this:
from numba import njit, prange, config
import numpy as np
#njit
def wrapper_func(thread_idx, start_indices, stop_indices, D, P, I):
for row in range(start_indices[thread_idx], stop_indices[thread_idx]):
some_transformation_func(D, row)
for col in range(D.shape[1]):
if P[thread_idx, col] > D[row, col]:
P[thread_idx, col] = D[row, col]
I[thread_idx, col] = row
#njit
def some_transformation_func(D, row_i):
"""
This function applies some transformation to the ith row (`row_i`) in the `D` matrix in place.
However, the transformation time is random (but all less than a second), which means
that the rows can take
"""
# Apply some inplace transformation on the ith row of D
#njit(parallel=True)
def some_func(D):
n_threads = config.NUMBA_NUM_THREADS # Let's assume that there are 2 threads
P = np.empty((n_threads, D.shape[1]))
I = np.empty((n_threads, D.shape[1]), np.int64)
P[:, :] = np.inf
I[:, :] = -1
start_indices = np.array([0, 2], np.int64)
stop_indices = np.array([2, 4], np.int64) # Note that these are exclusive
for thread_idx in prange(n_threads):
wrapper_func(thread_idx, start_indices, stop_indices, D, P, I)
# Perform reduction from all threads and store results in P[0]
for thread_idx in range(1, n_threads):
for i in prange(l):
if P[0, i] > P[thread_idx, i]:
P[0, i] = P[thread_idx, i]
I[0, i] = I[thread_idx, i]
return P[0], I[0]
if __name__ == "__main__":
D = np.array([[4,1,6,9,9],
[1,3,8,2,7],
[2,8,0,0,1],
[3,7,4,6,5]
])
P, I = some_func(D)
print(P)
print(I)
# [1. 1. 0. 0. 1.]
# [1 0 2 2 2]
Note that this will cost you more memory (exactly n_threads more memory) but you benefit from the parallelization. Additionally, the code becomes cleaner and much easier to maintain. What one needs to do is figure the best way to chunk up the data and determine the start_row and stop_row (exclusive) indices.

How to real-time filter with scipy and lfilter?

Disclaimer: I am probably not as good at DSP as I should be and therefore have more issues than I should have getting this code to work.
I need to filter incoming signals as they happen. I tried to make this code to work, but I have not been able to so far.
Referencing scipy.signal.lfilter doc
import numpy as np
import scipy.signal
import matplotlib.pyplot as plt
from lib import fnlib
samples = 100
x = np.linspace(0, 7, samples)
y = [] # Unfiltered output
y_filt1 = [] # Real-time filtered
nyq = 0.5 * samples
f1_norm = 0.1 / nyq
f2_norm = 2 / nyq
b, a = scipy.signal.butter(2, [f1_norm, f2_norm], 'band', analog=False)
zi = scipy.signal.lfilter_zi(b,a)
zi = zi*(np.sin(0) + 0.1*np.sin(15*0))
This sets zi as zi*y[0 ] initially, which in this case is 0. I have got it from the example code in the lfilter documentation, but I am not sure if this is correct at all.
Then it comes to the point where I am not sure what to do with the few initial samples.
The coefficients a and b are len(a) = 5 here.
As lfilter takes input values from now to n-4, do I pad it with zeroes, or do I need to wait until 5 samples have gone by and take them as a single bloc, then continuously sample each next step in the same way?
for i in range(0, len(a)-1): # Append 0 as initial values, wrong?
y.append(0)
step = 0
for i in xrange(0, samples): #x:
tmp = np.sin(x[i]) + 0.1*np.sin(15*x[i])
y.append(tmp)
# What to do with the inital filterings until len(y) == len(a) ?
if (step> len(a)):
y_filt, zi = scipy.signal.lfilter(b, a, y[-len(a):], axis=-1, zi=zi)
y_filt1.append(y_filt[4])
print(len(y))
y = y[4:]
print(len(y))
y_filt2 = scipy.signal.lfilter(b, a, y) # Offline filtered
plt.plot(x, y, x, y_filt1, x, y_filt2)
plt.show()

I think I had the same problem, and found a solution on https://github.com/scipy/scipy/issues/5116:
from scipy import zeros, signal, random
def filter_sbs():
data = random.random(2000)
b = signal.firwin(150, 0.004)
z = signal.lfilter_zi(b, 1) * data[0]
result = zeros(data.size)
for i, x in enumerate(data):
result[i], z = signal.lfilter(b, 1, [x], zi=z)
return result
if __name__ == '__main__':
result = filter_sbs()
The idea is to pass the filter state z in each subsequent call to lfilter. For the first few samples the filter may give strange results, but later (depending on the filter length) it starts to behave correctly.

The problem is not how you are buffering the input. The problem is that in the 'offline' version, the state of the filter is initialized using lfilter_zi which computes the internal state of an LTI so that the output will already be in steady-state when new samples arrive at the input. In the 'real-time' version, you skip this so that the filter's initial state is 0. You can either initialize both versions to using lfilter_zi or else initialize both to 0. Then, it doesn't matter how many samples you filter at a time.
Note, if you initialize to 0, the filter will 'ring' for a certain amount of time before reaching a steady state. In the case of FIR filters, there is an analytic solution for determining this time. For many IIR filters, there is not.
This following is correct. For simplicity's sake I initialize to 0 and feed the input on sample at a time. However, any non-zero block size will produce equivalent output.
from scipy import signal, random
from numpy import zeros
def filter_sbs(data, b):
z = zeros(b.size-1)
result = zeros(data.size)
for i, x in enumerate(data):
result[i], z = signal.lfilter(b, 1, [x], zi=z)
return result
def filter(data, b):
result = signal.lfilter(b,1,data)
return result
if __name__ == '__main__':
data = random.random(20000)
b = signal.firwin(150, 0.004)
result1 = filter_sbs(data, b)
result2 = filter(data, b)
print(result1 - result2)
Output:
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -5.55111512e-17
0.00000000e+00 1.66533454e-16]

Improving a numpy implementation of a simple spring network

I wanted a very simple spring system written in numpy. The system would be defined as a simple network of knots, linked by links. I'm not interested in evaluating the system over time, but instead I want to go from an initial state, change a variable (usually move a knot to a new position) and solve the system until it reaches a stable state (last applied force is below a given threshold). The knots have no mass, there's no gravity, the forces are all derived from each link's current lengths/init lengths. And the only "special" variable is that each knot can bet set as "anchored" (doesn't move).
So I wrote this simple solver below, and included a simple example. Jump to the very end for my question.
import numpy as np
from numpy.core.umath_tests import inner1d
np.set_printoptions(precision=4)
np.set_printoptions(suppress=True)
np.set_printoptions(linewidth =150)
np.set_printoptions(threshold=10)
def solver(kPos, kAnchor, link0, link1, w0, cycles=1000, precision=0.001, dampening=0.1, debug=False):
"""
kPos : vector array - knot position
kAnchor : float array - knot's anchor state, 0 = moves freely, 1 = anchored (not moving)
link0 : int array - array of links connecting each knot. each index corresponds to a knot
link1 : int array - array of links connecting each knot. each index corresponds to a knot
w0 : float array - initial link length
cycles : int - eval stops when n cycles reached
precision : float - eval stops when highest applied force is below this value
dampening : float - keeps system stable during each iteration
"""
kPos = np.asarray(kPos)
pos = np.array(kPos) # copy of kPos
kAnchor = 1-np.clip(np.asarray(kAnchor).astype(float),0,1)[:,None]
link0 = np.asarray(link0).astype(int)
link1 = np.asarray(link1).astype(int)
w0 = np.asarray(w0).astype(float)
F = np.zeros(pos.shape)
i = 0
for i in xrange(cycles):
# Init force applied per knot
F = np.zeros(pos.shape)
# Calculate forces
AB = pos[link1] - pos[link0] # get link vectors between knots
w1 = np.sqrt(inner1d(AB,AB)) # get link lengths
AB/=w1[:,None] # normalize link vectors
f = (w1 - w0) # calculate force vectors
f = f[:,None] * AB
# Apply force vectors on each knot
np.add.at(F, link0, f)
np.subtract.at(F, link1, f)
# Update point positions
pos += F * dampening * kAnchor
# If the maximum force applied is below our precision criteria, exit
if np.amax(F) < precision:
break
# Debug info
if debug:
print 'Iterations: %s'%i
print 'Max Force: %s'%np.amax(F)
return pos
Here's some test data to show how it works. In this case i'm using a grid, but in reality this can be any type of network, like a string with many knots, or a mess of polygons...:
import cProfile
# Create a 5x5 3D knot grid
z = np.linspace(-0.5, 0.5, 5)
x = np.linspace(-0.5, 0.5, 5)[::-1]
x,z = np.meshgrid(x,z)
kPos = np.array([np.array(thing) for thing in zip(x.flatten(), z.flatten())])
kPos = np.insert(kPos, 1, 0, axis=1)
'''
array([[-0.5 , 0. , 0.5 ],
[-0.25, 0. , 0.5 ],
[ 0. , 0. , 0.5 ],
...,
[ 0. , 0. , -0.5 ],
[ 0.25, 0. , -0.5 ],
[ 0.5 , 0. , -0.5 ]])
'''
# Define the links connecting each knots
link0 = [0,1,2,3,5,6,7,8,10,11,12,13,15,16,17,18,20,21,22,23,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
link1 = [1,2,3,4,6,7,8,9,11,12,13,14,16,17,18,19,21,22,23,24,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]
AB = kPos[link0]-kPos[link1]
w0 = np.sqrt(inner1d(AB,AB)) # this is a square grid, each link's initial length will be 0.25
# Set the anchor states
kAnchor = np.zeros(len(kPos)) # All knots will be free floating
kAnchor[12] = 1 # Middle knot will be anchored
This is what the grid looks like:
If we run my code using this data, nothing will happen since the links aren't pushing or stretching:
print np.allclose(kPos,solver(kPos, kAnchor, link0, link1, w0, debug=True))
# Returns True
# Iterations: 0
# Max Force: 0.0
Now lets move that middle anchored knot up a bit and solve the system:
# Move the center knot up a little
kPos[12] = np.array([0,0.3,0])
# eval the system
new = solver(kPos, kAnchor, link0, link1, w0, debug=True) # positions will have moved
#Iterations: 102
#Max Force: 0.000976603249133
# Rerun with cProfile to see how fast it runs
cProfile.run('solver(kPos, kAnchor, link0, link1, w0)')
# 520 function calls in 0.008 seconds
And here's what the grid looks like after being pulled by that single anchored knot:
Question:
My actual use cases are a little more complex than this example and solve a little too slow for my taste: (100-200 knots with a network anywhere between 200-300 links, solves in a few seconds).
How can i make my solver function run faster? I'd consider Cython but i have zero experience with C. Any help would be greatly appreciated.

Your method, at a cursory glance, appears to be an explicit under-relaxation type of method. Calculate the residual force at each knot, apply a factor of that force as a displacement, repeat until convergence. It's the repeating until convergence that takes the time. The more points you have, the longer each iteration takes, but you also need more iterations for the constraints at one end of the mesh to propagate to the other.
Have you considered an implicit method? Write the equation for the residual force at each non-constrained node, assemble them into a large matrix, and solve in one step. Information now propagates across the entire problem in a single step. As an additional benefit, the matrix you construct should be sparse, which scipy has a module for.
Wikipedia: explicit and implicit methods
EDIT Example of an implicit method matching (roughly) your problem. This solution is linear, so it doesn't take into account the effect of the calculated displacement on the force. You would need to iterate (or use non-linear techniques) to calculate this. Hope it helps.
#!/usr/bin/python3
import matplotlib.pyplot as pp
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import scipy as sp
import scipy.sparse
import scipy.sparse.linalg
#------------------------------------------------------------------------------#
# Generate a grid of knots
nX = 10
nY = 10
x = np.linspace(-0.5, 0.5, nX)
y = np.linspace(-0.5, 0.5, nY)
x, y = np.meshgrid(x, y)
knots = list(zip(x.flatten(), y.flatten()))
# Create links between the knots
links = []
# Horizontal links
for i in range(0, nY):
for j in range(0, nX - 1):
links.append((i*nX + j, i*nX + j + 1))
# Vertical links
for i in range(0, nY - 1):
for j in range(0, nX):
links.append((i*nX + j, (i + 1)*nX + j))
# Create constraints. This dict takes a knot index as a key and returns the
# fixed z-displacement associated with that knot.
constraints = {
0 : 0.0,
nX - 1 : 0.0,
nX*(nY - 1): 0.0,
nX*nY - 1 : 1.0,
2*nX + 4 : 1.0,
}
#------------------------------------------------------------------------------#
# Matrix i-coordinate, j-coordinate and value
Ai = []
Aj = []
Ax = []
# Right hand side array
B = np.zeros(len(knots))
# Loop over the links
for link in links:
# Link geometry
displacement = np.array([ knots[1][i] - knots[0][i] for i in range(2) ])
distance = np.sqrt(displacement.dot(displacement))
# For each node
for i in range(2):
# If it is not a constraint, add the force associated with the link to
# the equation of the knot
if link[i] not in constraints:
Ai.append(link[i])
Aj.append(link[i])
Ax.append(-1/distance)
Ai.append(link[i])
Aj.append(link[not i])
Ax.append(+1/distance)
# If it is a constraint add a diagonal and a value
else:
Ai.append(link[i])
Aj.append(link[i])
Ax.append(+1.0)
B[link[i]] += constraints[link[i]]
# Create the matrix and solve
A = sp.sparse.coo_matrix((Ax, (Ai, Aj))).tocsr()
X = sp.sparse.linalg.lsqr(A, B)[0]
#------------------------------------------------------------------------------#
# Plot the links
fg = pp.figure()
ax = fg.add_subplot(111, projection='3d')
for link in links:
x = [ knots[i][0] for i in link ]
y = [ knots[i][1] for i in link ]
z = [ X[i] for i in link ]
ax.plot(x, y, z)
pp.show()

Create a loop that runs function (with parameters which are indices of data set) on all items of the data set?

so I've got a function:
def connection(n,m,r):
is_connected = ((x[n]-x[m])**2 + (y[n]-y[m])**2)**0.5
if is_connected < 2*r:
return n + " " + "connects with" + " " + m
else:
return "no connection"
This basically sees whether two circles (with coordinates that correspond to the indices n and m) connect. The n and m parameters refer to the indices in the data sets x and y, which come from a numpy.random array:
array([[ 0.31730234, 0.73662906],
[ 0.54488759, 0.09462212],
[ 0.07500703, 0.36148366],
[ 0.33200281, 0.04550565],
[ 0.3420866 , 0.9425797 ],
[ 0.36115391, 0.16670599],
[ 0.95586938, 0.52599398],
[ 0.13707665, 0.6574444 ],
[ 0.77766138, 0.56875582],
[ 0.79618595, 0.7139309 ]])
Since the array is basically 10 sets of coordinates, I have produced two lists out of them, x and y (x is the first column of the array, y is the second). m and n are indices in these lists. Therefore, n and m correspond to indices in the array, but I'm not sure how?
What I've been doing now is manually inputting the indices to see whether any two circles in this array connect - is there a -for loop that can do this in a more efficient way?

You should be doing things differently anyway. Unfortunatly the cKDTree which is much faster does not have the necessary features, but even the other KDTree should give you a vast speed increase (and solve it much more elegantly)
from scipy.spatial import KDTree
from itertools import chain
tree = KDTree(circles)
# unfortunatly only a list of lists, because there may be a different amount
# also the point itself is included every time.
connections = tree.query_ball_tree(tree, 2*r)
# if all you want is a list of lists of what connects with what
# connections is already what you need. The rest creates a connectivity matrix:
repeats = [len(l) for l in connections]
x_point = np.arange(len(circles)).repeat(repeats)
y_point = np.fromiter(chain(*connections), dtype=np.intp)
# or construct a sparse matrix here instead, scipy.sparse has some graph tools
# maybe it even has a better thing to do this.
connected = np.zeros((len(circles),) * 2, dtype=bool)
connected[x_point, y_point] = True
While it doesn't use cKDTree unfortunatly, this still saves you the O(N^2) complexity... Of course if len(circles) is small, that does not matter, but then you can just use broadcasting, (or also distance_matrix from scipy.spatial):
distances = np.sqrt(((circles[:,None,:] - circles)**2).sum(-1))
connected = distances < (2 * r)
# if you need the list of lists/arrays here you can do:
connections = [np.flatnonzero(c) for c in connected]
But note that the second method is a memory hungry monster and only any good if circles is small.

EDIT: Just realized what follows is just an expanded version of seberg's last method...
If your data sets are small, as in (very) few thousands of elements, you can brute force things with numpy:
import numpy as np
n = 10 # the number of circles
circles = np.random.rand(n, 2) # the array of centers
distances = circles.reshape(n, 1, 2) - circles.reshape(1, n, 2)
# distances now has shape (n, n, 2)
distances = np.sqrt(np.sum(distances**2, axis=2))
# distances now has shape (n, n)
# distances[i, j] holds the distance between the i-th and j-th circle centers
When you want to check which circles of radius r overlap, you can do something like this:
r = 0.1
overlap = distances < 2 * r
# overlap[i, j] is True if the i-th and j-th circle overlap, False if not
These last 2 lines you can reuse for any values of r you want, without having to do the more calculation intensive previous step.
It uses a lot of unnecessary memory, so it will break down for (moderately) large data sets, but since all loops are being run under hood by numpy, it should be fast.

you can use map
to do so, simply change connection to accept a circle as its parameter, and have the r as part of the circle
def connection(circle):
n, m, r = circle
is_connected = ((x[n]-x[m])**2 + (y[n]-y[m])**2)**0.5
if is_connected < 2*r:
return n + " " + "connects with" + " " + m
else:
return "no connection"
and have your list be a list of circles AND their radius.
circles = array([
[ 0.31730234, 0.73662906, r],
[ 0.54488759, 0.09462212, r],
[ 0.07500703, 0.36148366, r],
[ 0.33200281, 0.04550565, r],
[ 0.3420866 , 0.9425797 , r],
[ 0.36115391, 0.16670599, r],
[ 0.95586938, 0.52599398, r],
[ 0.13707665, 0.6574444 , r],
[ 0.77766138, 0.56875582, r],
[ 0.79618595, 0.7139309 , r]])
then simply do this:
map(connection, circles)
if the r is external, or a member then you want to use this:
def connection(coord):
n, m = coord
is_connected = ((x[n]-x[m])**2 + (y[n]-y[m])**2)**0.5
if is_connected < 2*r:
return n + " " + "connects with" + " " + m
else:
return "no connection"
coords = array([
[ 0.31730234, 0.73662906],
[ 0.54488759, 0.09462212],
[ 0.07500703, 0.36148366],
[ 0.33200281, 0.04550565],
[ 0.3420866 , 0.9425797 ],
[ 0.36115391, 0.16670599],
[ 0.95586938, 0.52599398],
[ 0.13707665, 0.6574444 ],
[ 0.77766138, 0.56875582],
[ 0.79618595, 0.7139309 ]])
map(connection, coords)
or you can keep your current format, and do a slightly uglier implementation. and i still dont know where you get the r from.
for item in circles:
connection(item[0], item[1], r)

A straightforward example would be:
# m is your array
rows = m.shape[0]
for x in range(rows):
for y in range(rows)[x+1:]:
# x, y correspond to indices of two circles
conn = connection(x, y, r)

After reviewing some of your responses, I have written you a new answer which changes your code entirely.
def check_overlap(circleA, circleB, r):
# your code.
# in each circle you have [x-coord, y-coord]
circles = [
[0.34, 0.74], [0.27, 0.19],
[0.24. 0.94], [0.64, 1.42]]
for a, b in ((a,b) for a in circles for b in circles):
if a != b:
check_overlap(a, b, r)
this makes it clear what you are doing in your code, python is all about readability.
note: the last function is the same as using itertools.product but without importing it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

numba cuda on matrix decomposition code in python [closed] - python

Related

How to use use cuda to optimize my matrix multiplication code [duplicate]

Avoid Race Condition in Numba

How to real-time filter with scipy and lfilter?

Improving a numpy implementation of a simple spring network

Create a loop that runs function (with parameters which are indices of data set) on all items of the data set?

Categories

Resources