How to use use cuda to optimize my matrix multiplication code [duplicate]

How to use use cuda to optimize my matrix multiplication code [duplicate] - python

Lately I've been trying to get into programming for GPUs in Python using the Numba library. I have been reading up on it on their website using the tutorial there and currently I'm stuck on their example, which can be found here: https://numba.pydata.org/numba-doc/latest/cuda/examples.html. I'm attempting to generalize the example for the fast matrix multiplication a bit (which is of the form A*B=C). When testing I noticed that matrices with dimensions that are not perfectly divisible by the number of threads per block (TPB) do not yield a correct answer.
I copied the code below from the example at https://numba.pydata.org/numba-doc/latest/cuda/examples.html and created a very small test case with 4 by 4 matrices. If I choose TPB=2 everything is fine, but when I set TPB=3 it goes wrong. I understand that the code goes out of the bounds of the matrices, but I am unable to prevent this from happening (I tried some if statements on ty + i * TPB and tx + i * TPB, but these did not work.
from numba import cuda, float32
import numpy as np
import math
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
if x >= C.shape[0] and y >= C.shape[1]:
# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
C[x, y] = tmp
#%%
x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
I would like to write some code that is not dependent on the matrices A, B, and C having dimensions that are perfectly divisible by the TPB, as these are sometimes out of my control. I understand that GPUs are only faster with matrix multiplication for very large matrices, but I wanted to use small examples to be able to check whether the answer is correct before applying it to actual data.

There are arguably at least two errors in that posted code:
This can't possibly be a correct range check:
if x >= C.shape[0] and y >= C.shape[1]:
In order for us to decide that a particular thread in the grid not do any loading activity, we require either that x is out of range or that y is out of range. The and should have been an or.
It is illegal to use cuda.syncthreads() in conditional code, if all the threads in the block cannot participate in that statement. The previous return statement in item 1 above (even if corrected from and to or) pretty much guarantees this illegal behavior for problem sizes not whole-number-divisible by the threadblock size.
Therefore, to fix these issues, we cannot use just a simple return statement for an out-of-bounds thread. Instead, at the point of load, we must only allow threads to load from global to shared memory, if the computed global load indices (for A or B) are in-bounds (the shared indices are in-bounds, by definition). Furthermore, when writing a result, we must only write computed results that are in-bounds for C.
The following code has those items fixed. It seems to work correctly for your given test case:
$ cat t49.py
from numba import cuda, float32
import numpy as np
import math
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = float32(0.)
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = 0
sB[tx, ty] = 0
if x < A.shape[0] and (ty+i*TPB) < A.shape[1]:
sA[tx, ty] = A[x, ty + i * TPB]
if y < B.shape[1] and (tx+i*TPB) < B.shape[0]:
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
if x < C.shape[0] and y < C.shape[1]:
C[x, y] = tmp
#%%
x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h#y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 6. 6. 6. 6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
[[ 6. 6. 6. 6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
========= ERROR SUMMARY: 0 errors
$
(Note the use of and here in the bounds tests is correct. Testing whether a set of indices are in-bound is different in a boolean sense compared to testing whether a set of indices is out-of-bounds. In the in-bounds test, we require both to be in-bounds. In the out-of-bounds test, either index out-of-bounds is disqualifying).
I'm not suggesting the above code is defect-free or suitable for any particular purpose. It is offered to demonstrate possible fixes for the issues I identified. Getting a shared-memory tiled matrix multiply to work in every imaginable configuration is non-trivial, as you have discovered, and I've not tested it beyond what is shown here. (For example, if you decided to make TPB larger than 32, you would run into other problems. Also, the original posted code is advertised only for square matrix multiplication, and this will not work in the general non-square case.)
As noted above, the posted code and the above code with "fixes" will not correctly handle the general non-square case. I believe some straightforward modifications will allow us to handle the non-square case. In a nutshell, we must size the grid large enough to handle the dimensions of both input matrices, while still only writing results for the in-bounds values of the output matrix. Here is a lightly tested example:
$ cat t49.py
from numba import cuda, float32
import numpy as np
import math
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = float32(0.)
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
#%%
x_h = np.arange(115).reshape([5,23])
y_h = np.ones([23,7])
z_h = np.zeros([5,7])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
#TPB must be an integer between 1 and 32
TPB = 32
threadsperblock = (TPB, TPB)
grid_y_max = max(x_h.shape[0],y_h.shape[0])
grid_x_max = max(x_h.shape[1],y_h.shape[1])
blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])
blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h#y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 253. 253. 253. 253. 253. 253. 253.]
[ 782. 782. 782. 782. 782. 782. 782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253. 253. 253. 253. 253. 253. 253.]
[ 782. 782. 782. 782. 782. 782. 782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
========= ERROR SUMMARY: 0 errors
$
I've also reordered the sense of x and y (and usage of tx and ty) to fix a performance issue in the above code. The same performance issue was present in the original posted doc code as well.
Again, no claims of defect free. Furthermore I'm sure "more optimal" code could be arrived at. However optimizing matrix multiplication is an exercise that should fairly quickly lead to using a library implementation. Using cupy here for the GPU approach should be a fairly straightforward way to tap into a high-quality matrix multiply routine on the GPU.
EDIT: As discussed here OP's code (and, it seems, the doc example) also had a performance issue around the setup of the tmp variable. Changing that to a proper 32-bit float variable makes an important performance difference.

Related

optimize this numpy operation

I have inherited some code and there is one particular operation that takes an inordinate amount of time.
The operation is defined as:
cutoff = 0.2
# X has shape (76187, 247, 20)
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
weightfun = lambda x: 1.0 / np.sum(np.dot(X_flat, x) / np.dot(x, x) > 1 - cutoff)
# This is expensive...
N_list = np.array(list(map(weightfun, X_flat)))
This takes hours to compute on my machine. I am wondering if there is a way to optimize this. The code is computing normalized hamming distances between vector sequences.

weightfun performs two dot product operations for every row of X_flat. The worst one is np.dot(X_flat, x), where the dot product is performed against the whole X_flat matrix. But there's a trick to speed things up. The iterative part in the first dot product can be computed only once with:
X_matmut = X_flat # X_flat.T
Also, I noticed that the second dot product is nothing more than the diagonal of the result of the first one.
The rewritten code looks like this:
cutoff = 0.2
# X has shape (76187, 247, 20)
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
X1 = X_flat # X_flat.T
X2 = X1.diagonal()
N_list = 1.0 / (X1/X2 > 1 - cutoff).sum(axis=0)
Edit
For such a large input, when performing the operation above the memory becomes the new bottleneck as the new matrix won't fit into RAM. So there's also the option of breaking the computation into chunks, as the code below shows.
The code gets a little messy, but at least it didn't try to destroy my PC :-P
import numpy as np
import time
# Sample data
X = np.random.random([76187, 247, 20])
start = time.time()
cutoff = 0.2
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
# Divide data into 20 chuncks
X_parts = np.array_split(X_flat, 20)
# Diagonal will be saved incrementally
diagonal = []
for i in range(len(X_parts)):
part = X_parts[i]
X_parts[i] = part # X_flat.T
diagonal.extend(X_parts[i][range(len(X_parts[i])), range(len(diagonal), len(diagonal)+len(X_parts[i]))])
# Performs the second part of the calculation
diagonal = np.array(diagonal)
X_list = np.zeros(len(diagonal))
for x in X_parts:
X_list += (x/diagonal > 1 - cutoff).sum(axis=0)
X_list = 1.0 / X_list
print('Time to solve: %.2f secs' % (time.time() - start))
I would love to be able to perform all the computation on a single loop and discard the used chunks, but it is obligatory to run over the whole matrix once to retrieve the diagonal. Don't believe it's worth to compute everything twice to save memory.
While I use a decent setup (16 GB of RAM in a i7 intel and SSD drive for storage), the whole processing took me around 15 minutes.

Understanding shared memory use for improvement in Numba

I'm trying to learn more about the use of shared memory to improve performance in some cuda kernels in Numba, for this I was looking at the Matrix multiplication Example in the Numba documentation and tried to implement to see the gain.
This is my test implementation, I'm aware that the example in the documentation has some issues that I followed from Here, so I copied the fixed example code.
from timeit import default_timer as timer
import numba
from numba import cuda, jit, int32, int64, float64, float32
import numpy as np
from numpy import *
#cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
size = 1024*4
tpbx,tpby = 16, 16
tpb = (tpbx,tpby)
bpgx, bpgy = int(size/tpbx), int(size/tpby)
bpg = (bpgx, bpgy)
a_in = cuda.to_device(np.arange(size*size, dtype=np.float32).reshape((size, size)))
b_in = cuda.to_device(np.ones(size*size, dtype=np.float32).reshape((size, size)))
c_out1 = cuda.device_array_like(a_in)
c_out2 = cuda.device_array_like(a_in)
s = timer()
cuda.synchronize()
matmul[bpg,tpb](a_in, b_in, c_out1);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host1 = c_out1.copy_to_host()
print(c_host1)
s = timer()
cuda.synchronize()
fast_matmul[bpg,tpb](a_in, b_in, c_out2);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host2 = c_out2.copy_to_host()
print(c_host2)
The time of execution of the above kernels are essentially the same, actually the matmul was making faster for some larger input matrices. I would like to know what I'm missing in order to see the gain as the documentation suggests.
Thanks,
Bruno.

I made a performance mistake in the code I put in that other answer. I've now fixed it. In a nutshell this line:
tmp = 0.
caused numba to create a 64-bit floating point variable tmp. That triggered other arithmetic in the kernel to be promoted from 32-bit floating point to 64-bit floating point. That is inconsistent with the rest of the arithmetic and also inconsistent with the intent of the demonstration in the other answer. This error affects both kernels.
When I change it in both kernels to
tmp = float32(0.)
both kernels get noticeably faster, and on my GTX960 GPU, your test case shows that the shared code runs about 2x faster than the non-shared code (but see below).
The non-shared kernel also has a performance issue related to memory access patterns. Similar to the indices swap in that other answer, for this particular scenario only, we can rectify this problem simply by reversing the assigned indices:
j, i = cuda.grid(2)
in the non-shared kernel. This allows that kernel to perform approximately as well as it can, and with that change the shared kernel runs about 2x faster than the non-shared kernel. Without that additional change to the non-shared kernel, the performance of the non-shared kernel is much worse.

numba cuda on matrix decomposition code in python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I'm trying to implement this code to use my gpu
import numpy as np
import math
import numba
from numba import vectorize , cuda
#cuda.jit(['float32(float 32)'], device=True)
def CholeskyInc(A):
n,m = A.shape
if n!=m:
print('La matriz no es cuadrada')
else:
L = np.zeros((n,n))
for k in range(n):
L[k,k] = math.sqrt(A[k,k])
for i in range(k+1,n):
if A[i,k] != 0:
L[i,k]=A[i,k]/A[k,k]
for j in range(k+1,n):
for i in range(j,n):
if A[i,j] != 0:
L[i,j] = A[i,j]-A[i,k]*A[j,k]
return L
Actually, I don't have any experience using numba and I prove some options but any of them work yet. Does anyone could explainme how I can execute this code in the GPU rather than the CPU using numba and cuda.

I don't have the needed rep here to just comment, so here's a couple suggestions to start:
The decorator usage:
Usin #cuda.jit as a decorator lets you skip defining the arg types and return type of your function.
You only need to call #cuda.jit(device=True) if you plan on calling this function from another cuda kernel function. Where a cuda kernel function is any function where you haven't set device=True. See the code sample at the end of this response.
Functions compiled to cuda kernels can't have an explicit return, nor can they print to console.
In place of returning L, create L in the host code (the code you are running on cpu) and pass that L as a second parameter to your function. After you run the function the results will be stored in L.
In place of the print statement you'll have to do something like assign an error value to L and check for that error value before you later use L
import numpy as np
from numba import cuda
from math import sqrt
def main():
""" Creates 2 matrices with randomized values which are then
passed to a cuda kernel where the incomplete Cholesky factorization
is done.
these first several lines set up the input arrays (your A and L)
as well as the thread and block shape that numba will need when
launching your kernel onto the GPU.
"""
np.set_printoptions(linewidth=240)
rand_n = np.random.randint(8 ,11)
A = np.random.random((rand_n ,rand_n)).astype(np.float32)
L = np.zeros((A.shape[0] ,A.shape[0]) ,dtype=A.dtype)
# tpb meaning "threads per block"
tpb_y = 16
tpb_x = 16
tpb = tpb_x ,tpb_y
# bpg meaning "blocks per grid"
bpg_y = (A.shape[0] + tpb_y -1 )//tpb_y # ensures enough blocks to cover whole matrix height
bpg_x = (A.shape[1] + tpb_x -1 )//tpb_x # ensures enough blocks to cover whole matrix width
# Alternatively, you could use the following commented lines
# which will allocate fewer thread-blocks (less overhead) but requires that
# we implement striding loops in our kernel/device functions to
# cover the rest of the matrix data.
# bpg_y = (A.shape[0]//2 + tpb_y-1)//tpb_y # cover half of matrix height, and stride the other half
# bpg_x = (A.shape[1]//2 + tpb_x-1)//tpb_x # cover half of matrix width, and stride the other half
bpg = bpg_x ,bpg_y
# now we launch the kernel
my_cuda_kernel[bpg ,tpb](A ,L)
if np.isnan(L[0 ,0]):
print('La matriz no es cuadrada')
print(f"A:\n{A}")
print(f"L:\n{L}")
#cuda.jit
def my_cuda_kernel(arr_in :np.ndarray, arr_out :np.ndarray):
# in your kernel, you can set up shared memory
# or implement looping structures to stride through global device memory.
n ,m = arr_in.shape
if n!= m:
# set the first element on the diagonal to a NaN value,
# as a way to signal that no operations were done on L.
arr_out[0, 0] = np.nan # check by calling np.isnan(L[0,0])
return
# get the current thread's location within the grid
x, y = cuda.grid(2) # 2 because we launched the kernel specifying only x, and y dimensions
# these x,y coordinates don't explicitly map to coordinates in the array objects
# but we can simplify our indexing by treating them like they do.
x_stride, y_stride = cuda.gridsize(2)
# this loop provides 2 purposes:
# 1. It acts as boundary check to make sure we don't attempt to read
# memory locations outside of the input/output arrays.
# 2. It also allows our threads to "stride" through the data, and do more
# work to justify the overhead cost setting this thread-block up.
# * Note that threads will only be able to "stride"
# if we don't launch the kernel with full grid coverage.
for _k in range(0, n, x_stride * y_stride):
k = _k + y
# we are telling each thread to stride the
# diagonal and compute the cholesky for arr[:,k] rows and arr[k,:] cols.
CholeskyInc(arr_in, arr_out, n, k)
#cuda.jit(device=True)
def CholeskyInc(A, L, n, k):
L[k, k] = sqrt(A[k, k])
for i in range(k + 1, n):
if A[i, k] != 0:
L[i, k] = A[i, k] / L[k, k]
for j in range(k + 1, n):
for i in range(j, n):
if A[i, j] != 0:
L[i, j] = A[i, j] - A[i, k] * A[j, k]
if __name__ == '__main__':
main()
which produces the following output:
A:
[[0.29083535 0.80408204 0.63088804 0.90458757 0.86371994 0.7966909 0.5818828 0.8885034 ]
[0.24579939 0.8107 0.9785071 0.40308124 0.96477604 0.39282414 0.18642609 0.3129212 ]
[0.18401423 0.11662608 0.3512116 0.97926706 0.4021766 0.23531164 0.81310475 0.93359345]
[0.5243785 0.0469533 0.49699584 0.507422 0.24569689 0.4899143 0.61420184 0.9332651 ]
[0.1070556 0.5214806 0.24065676 0.8860097 0.5074029 0.43745205 0.09919663 0.9222924 ]
[0.17103161 0.25640044 0.94678307 0.26446953 0.9416109 0.8391528 0.69582105 0.5433431 ]
[0.5520146 0.10083573 0.7929039 0.44067022 0.6251738 0.6831893 0.23636419 0.97260725]
[0.47044474 0.13215688 0.5002679 0.72581047 0.8298903 0.55161124 0.6673608 0.5644971 ]]
L:
[[ 0.5392915 0. 0. 0. 0. 0. 0. 0. ]
[ 0.45578206 0.75028265 0. 0. 0. 0. 0. 0. ]
[ 0.34121478 0.07139549 0.31735036 0. 0. 0. 0. 0. ]
[ 0.972347 -0.08193862 0.40050274 0.23244919 0. 0. 0. 0. ]
[ 0.19851157 0.49516642 0.22095701 0.829872 0.495942 0. 0. 0. ]
[ 0.3171413 0.21436097 0.9153108 0.17478424 0.923301 0.809901 0. 0. ]
[ 1.0235922 -0.03484913 0.69132537 0.15120566 0.56607753 0.5887773 -0.06835592 0. ]
[ 0.8723385 0.01652185 0.4136994 0.47911936 0.7795266 0.47115034 0.4076684 0.34317887]]

How to real-time filter with scipy and lfilter?

Disclaimer: I am probably not as good at DSP as I should be and therefore have more issues than I should have getting this code to work.
I need to filter incoming signals as they happen. I tried to make this code to work, but I have not been able to so far.
Referencing scipy.signal.lfilter doc
import numpy as np
import scipy.signal
import matplotlib.pyplot as plt
from lib import fnlib
samples = 100
x = np.linspace(0, 7, samples)
y = [] # Unfiltered output
y_filt1 = [] # Real-time filtered
nyq = 0.5 * samples
f1_norm = 0.1 / nyq
f2_norm = 2 / nyq
b, a = scipy.signal.butter(2, [f1_norm, f2_norm], 'band', analog=False)
zi = scipy.signal.lfilter_zi(b,a)
zi = zi*(np.sin(0) + 0.1*np.sin(15*0))
This sets zi as zi*y[0 ] initially, which in this case is 0. I have got it from the example code in the lfilter documentation, but I am not sure if this is correct at all.
Then it comes to the point where I am not sure what to do with the few initial samples.
The coefficients a and b are len(a) = 5 here.
As lfilter takes input values from now to n-4, do I pad it with zeroes, or do I need to wait until 5 samples have gone by and take them as a single bloc, then continuously sample each next step in the same way?
for i in range(0, len(a)-1): # Append 0 as initial values, wrong?
y.append(0)
step = 0
for i in xrange(0, samples): #x:
tmp = np.sin(x[i]) + 0.1*np.sin(15*x[i])
y.append(tmp)
# What to do with the inital filterings until len(y) == len(a) ?
if (step> len(a)):
y_filt, zi = scipy.signal.lfilter(b, a, y[-len(a):], axis=-1, zi=zi)
y_filt1.append(y_filt[4])
print(len(y))
y = y[4:]
print(len(y))
y_filt2 = scipy.signal.lfilter(b, a, y) # Offline filtered
plt.plot(x, y, x, y_filt1, x, y_filt2)
plt.show()

I think I had the same problem, and found a solution on https://github.com/scipy/scipy/issues/5116:
from scipy import zeros, signal, random
def filter_sbs():
data = random.random(2000)
b = signal.firwin(150, 0.004)
z = signal.lfilter_zi(b, 1) * data[0]
result = zeros(data.size)
for i, x in enumerate(data):
result[i], z = signal.lfilter(b, 1, [x], zi=z)
return result
if __name__ == '__main__':
result = filter_sbs()
The idea is to pass the filter state z in each subsequent call to lfilter. For the first few samples the filter may give strange results, but later (depending on the filter length) it starts to behave correctly.

The problem is not how you are buffering the input. The problem is that in the 'offline' version, the state of the filter is initialized using lfilter_zi which computes the internal state of an LTI so that the output will already be in steady-state when new samples arrive at the input. In the 'real-time' version, you skip this so that the filter's initial state is 0. You can either initialize both versions to using lfilter_zi or else initialize both to 0. Then, it doesn't matter how many samples you filter at a time.
Note, if you initialize to 0, the filter will 'ring' for a certain amount of time before reaching a steady state. In the case of FIR filters, there is an analytic solution for determining this time. For many IIR filters, there is not.
This following is correct. For simplicity's sake I initialize to 0 and feed the input on sample at a time. However, any non-zero block size will produce equivalent output.
from scipy import signal, random
from numpy import zeros
def filter_sbs(data, b):
z = zeros(b.size-1)
result = zeros(data.size)
for i, x in enumerate(data):
result[i], z = signal.lfilter(b, 1, [x], zi=z)
return result
def filter(data, b):
result = signal.lfilter(b,1,data)
return result
if __name__ == '__main__':
data = random.random(20000)
b = signal.firwin(150, 0.004)
result1 = filter_sbs(data, b)
result2 = filter(data, b)
print(result1 - result2)
Output:
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -5.55111512e-17
0.00000000e+00 1.66533454e-16]

Improving a numpy implementation of a simple spring network

I wanted a very simple spring system written in numpy. The system would be defined as a simple network of knots, linked by links. I'm not interested in evaluating the system over time, but instead I want to go from an initial state, change a variable (usually move a knot to a new position) and solve the system until it reaches a stable state (last applied force is below a given threshold). The knots have no mass, there's no gravity, the forces are all derived from each link's current lengths/init lengths. And the only "special" variable is that each knot can bet set as "anchored" (doesn't move).
So I wrote this simple solver below, and included a simple example. Jump to the very end for my question.
import numpy as np
from numpy.core.umath_tests import inner1d
np.set_printoptions(precision=4)
np.set_printoptions(suppress=True)
np.set_printoptions(linewidth =150)
np.set_printoptions(threshold=10)
def solver(kPos, kAnchor, link0, link1, w0, cycles=1000, precision=0.001, dampening=0.1, debug=False):
"""
kPos : vector array - knot position
kAnchor : float array - knot's anchor state, 0 = moves freely, 1 = anchored (not moving)
link0 : int array - array of links connecting each knot. each index corresponds to a knot
link1 : int array - array of links connecting each knot. each index corresponds to a knot
w0 : float array - initial link length
cycles : int - eval stops when n cycles reached
precision : float - eval stops when highest applied force is below this value
dampening : float - keeps system stable during each iteration
"""
kPos = np.asarray(kPos)
pos = np.array(kPos) # copy of kPos
kAnchor = 1-np.clip(np.asarray(kAnchor).astype(float),0,1)[:,None]
link0 = np.asarray(link0).astype(int)
link1 = np.asarray(link1).astype(int)
w0 = np.asarray(w0).astype(float)
F = np.zeros(pos.shape)
i = 0
for i in xrange(cycles):
# Init force applied per knot
F = np.zeros(pos.shape)
# Calculate forces
AB = pos[link1] - pos[link0] # get link vectors between knots
w1 = np.sqrt(inner1d(AB,AB)) # get link lengths
AB/=w1[:,None] # normalize link vectors
f = (w1 - w0) # calculate force vectors
f = f[:,None] * AB
# Apply force vectors on each knot
np.add.at(F, link0, f)
np.subtract.at(F, link1, f)
# Update point positions
pos += F * dampening * kAnchor
# If the maximum force applied is below our precision criteria, exit
if np.amax(F) < precision:
break
# Debug info
if debug:
print 'Iterations: %s'%i
print 'Max Force: %s'%np.amax(F)
return pos
Here's some test data to show how it works. In this case i'm using a grid, but in reality this can be any type of network, like a string with many knots, or a mess of polygons...:
import cProfile
# Create a 5x5 3D knot grid
z = np.linspace(-0.5, 0.5, 5)
x = np.linspace(-0.5, 0.5, 5)[::-1]
x,z = np.meshgrid(x,z)
kPos = np.array([np.array(thing) for thing in zip(x.flatten(), z.flatten())])
kPos = np.insert(kPos, 1, 0, axis=1)
'''
array([[-0.5 , 0. , 0.5 ],
[-0.25, 0. , 0.5 ],
[ 0. , 0. , 0.5 ],
...,
[ 0. , 0. , -0.5 ],
[ 0.25, 0. , -0.5 ],
[ 0.5 , 0. , -0.5 ]])
'''
# Define the links connecting each knots
link0 = [0,1,2,3,5,6,7,8,10,11,12,13,15,16,17,18,20,21,22,23,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
link1 = [1,2,3,4,6,7,8,9,11,12,13,14,16,17,18,19,21,22,23,24,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]
AB = kPos[link0]-kPos[link1]
w0 = np.sqrt(inner1d(AB,AB)) # this is a square grid, each link's initial length will be 0.25
# Set the anchor states
kAnchor = np.zeros(len(kPos)) # All knots will be free floating
kAnchor[12] = 1 # Middle knot will be anchored
This is what the grid looks like:
If we run my code using this data, nothing will happen since the links aren't pushing or stretching:
print np.allclose(kPos,solver(kPos, kAnchor, link0, link1, w0, debug=True))
# Returns True
# Iterations: 0
# Max Force: 0.0
Now lets move that middle anchored knot up a bit and solve the system:
# Move the center knot up a little
kPos[12] = np.array([0,0.3,0])
# eval the system
new = solver(kPos, kAnchor, link0, link1, w0, debug=True) # positions will have moved
#Iterations: 102
#Max Force: 0.000976603249133
# Rerun with cProfile to see how fast it runs
cProfile.run('solver(kPos, kAnchor, link0, link1, w0)')
# 520 function calls in 0.008 seconds
And here's what the grid looks like after being pulled by that single anchored knot:
Question:
My actual use cases are a little more complex than this example and solve a little too slow for my taste: (100-200 knots with a network anywhere between 200-300 links, solves in a few seconds).
How can i make my solver function run faster? I'd consider Cython but i have zero experience with C. Any help would be greatly appreciated.

Your method, at a cursory glance, appears to be an explicit under-relaxation type of method. Calculate the residual force at each knot, apply a factor of that force as a displacement, repeat until convergence. It's the repeating until convergence that takes the time. The more points you have, the longer each iteration takes, but you also need more iterations for the constraints at one end of the mesh to propagate to the other.
Have you considered an implicit method? Write the equation for the residual force at each non-constrained node, assemble them into a large matrix, and solve in one step. Information now propagates across the entire problem in a single step. As an additional benefit, the matrix you construct should be sparse, which scipy has a module for.
Wikipedia: explicit and implicit methods
EDIT Example of an implicit method matching (roughly) your problem. This solution is linear, so it doesn't take into account the effect of the calculated displacement on the force. You would need to iterate (or use non-linear techniques) to calculate this. Hope it helps.
#!/usr/bin/python3
import matplotlib.pyplot as pp
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import scipy as sp
import scipy.sparse
import scipy.sparse.linalg
#------------------------------------------------------------------------------#
# Generate a grid of knots
nX = 10
nY = 10
x = np.linspace(-0.5, 0.5, nX)
y = np.linspace(-0.5, 0.5, nY)
x, y = np.meshgrid(x, y)
knots = list(zip(x.flatten(), y.flatten()))
# Create links between the knots
links = []
# Horizontal links
for i in range(0, nY):
for j in range(0, nX - 1):
links.append((i*nX + j, i*nX + j + 1))
# Vertical links
for i in range(0, nY - 1):
for j in range(0, nX):
links.append((i*nX + j, (i + 1)*nX + j))
# Create constraints. This dict takes a knot index as a key and returns the
# fixed z-displacement associated with that knot.
constraints = {
0 : 0.0,
nX - 1 : 0.0,
nX*(nY - 1): 0.0,
nX*nY - 1 : 1.0,
2*nX + 4 : 1.0,
}
#------------------------------------------------------------------------------#
# Matrix i-coordinate, j-coordinate and value
Ai = []
Aj = []
Ax = []
# Right hand side array
B = np.zeros(len(knots))
# Loop over the links
for link in links:
# Link geometry
displacement = np.array([ knots[1][i] - knots[0][i] for i in range(2) ])
distance = np.sqrt(displacement.dot(displacement))
# For each node
for i in range(2):
# If it is not a constraint, add the force associated with the link to
# the equation of the knot
if link[i] not in constraints:
Ai.append(link[i])
Aj.append(link[i])
Ax.append(-1/distance)
Ai.append(link[i])
Aj.append(link[not i])
Ax.append(+1/distance)
# If it is a constraint add a diagonal and a value
else:
Ai.append(link[i])
Aj.append(link[i])
Ax.append(+1.0)
B[link[i]] += constraints[link[i]]
# Create the matrix and solve
A = sp.sparse.coo_matrix((Ax, (Ai, Aj))).tocsr()
X = sp.sparse.linalg.lsqr(A, B)[0]
#------------------------------------------------------------------------------#
# Plot the links
fg = pp.figure()
ax = fg.add_subplot(111, projection='3d')
for link in links:
x = [ knots[i][0] for i in link ]
y = [ knots[i][1] for i in link ]
z = [ X[i] for i in link ]
ax.plot(x, y, z)
pp.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use use cuda to optimize my matrix multiplication code [duplicate] - python

Related

optimize this numpy operation

Understanding shared memory use for improvement in Numba

numba cuda on matrix decomposition code in python [closed]

How to real-time filter with scipy and lfilter?

Improving a numpy implementation of a simple spring network

Categories

Resources