I have written some code in Python and wanted to improve it using Numba's function decorators. Using a just-in-time compiler works fine (`#jit`). However, when I tried to parralize my code, speeding it up even more, the programm strangely runs slower than the non-parralized version.
Here is my code:
#numba.njit(parallel=True)
def get_output(input, depth, input_size, kernel_size, input_depth, kernels, bias):
output = np.zeros((depth, input_size[0] - kernel_size + 1, input_size[0] - kernel_size + 1))
for k in numba.prange(depth):
for i in numba.prange(input_depth):
output[k] += valid_correlate(input[i], kernels[k][i])
output[k] += bias[k]
return output
#numba.njit(fastmath=True)
def apply_filter(mat, filter, point):
result = 0
end_point = (min(mat.shape[0], point[0] + filter.shape[0]),
min(mat.shape[1], point[1] + filter.shape[1]))
point = (max(0, point[0]), max(0, point[1]))
area = mat[point[0]:end_point[0], point[1]:end_point[1]]
if filter.shape != area.shape:
s_filter = filter[0:area.shape[0], 0:area.shape[1]]
else:
s_filter = filter
i_result = np.multiply(area, s_filter)
result = np.sum(i_result)
return result
#numba.njit(nogil=True)
def valid_correlate(mat, filter):
f_mat = np.zeros((mat.shape[0] - filter.shape[0] + 1, mat.shape[1] - filter.shape[0] + 1))
for x in range(f_mat.shape[0]):
for y in range(f_mat.shape[1]):
f_mat[x, y] = apply_filter(mat, filter, (x, y))
return f_mat
With parralel=True it takes about 0.07 seconds, whilst its about 0.056 seconds without it.
I can't seem to figure out the problem here and would be glad for any help!
Regards
-Eirik
As pointed out in the comments, creating temporary Numpy array is expensive because allocations do not scale in parallel but also because memory-bound core also do not scale. Here is a modified (untested) code:
#numba.njit(fastmath=True)
def apply_filter(mat, filter, point):
result = 0
end_point = (min(mat.shape[0], point[0] + filter.shape[0]),
min(mat.shape[1], point[1] + filter.shape[1]))
point = (max(0, point[0]), max(0, point[1]))
area = mat[point[0]:end_point[0], point[1]:end_point[1]]
if filter.shape != area.shape:
s_filter = filter[0:area.shape[0], 0:area.shape[1]]
else:
s_filter = filter
res = 0.0
# Replace the slow `np.multiply` call creating a temporary array
for i in range(area.shape[0]):
for j in range(area.shape[1]):
res += area[i, j] + s_filter[i, j]
return res
I assumed the shape of area and s_filter are the same. If this is not the case, then please provide a complete reproducible example.
Also note that the first call is slow because of compilation time. One ways to fix that is to provide the function signature. See the Numba documentation for more information about this.
By the way, the operation looks like a custom convolution. If so, then note that FFTs are known to speed this up a lot if the filter is relatively big. For small filter the code can be faster if the compiler know the size of the array at compile time. If the filter is relatively big and separable, then the whole computation can be split in much faster steps.
Related
I'm trying to learn more about the use of shared memory to improve performance in some cuda kernels in Numba, for this I was looking at the Matrix multiplication Example in the Numba documentation and tried to implement to see the gain.
This is my test implementation, I'm aware that the example in the documentation has some issues that I followed from Here, so I copied the fixed example code.
from timeit import default_timer as timer
import numba
from numba import cuda, jit, int32, int64, float64, float32
import numpy as np
from numpy import *
#cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
size = 1024*4
tpbx,tpby = 16, 16
tpb = (tpbx,tpby)
bpgx, bpgy = int(size/tpbx), int(size/tpby)
bpg = (bpgx, bpgy)
a_in = cuda.to_device(np.arange(size*size, dtype=np.float32).reshape((size, size)))
b_in = cuda.to_device(np.ones(size*size, dtype=np.float32).reshape((size, size)))
c_out1 = cuda.device_array_like(a_in)
c_out2 = cuda.device_array_like(a_in)
s = timer()
cuda.synchronize()
matmul[bpg,tpb](a_in, b_in, c_out1);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host1 = c_out1.copy_to_host()
print(c_host1)
s = timer()
cuda.synchronize()
fast_matmul[bpg,tpb](a_in, b_in, c_out2);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host2 = c_out2.copy_to_host()
print(c_host2)
The time of execution of the above kernels are essentially the same, actually the matmul was making faster for some larger input matrices. I would like to know what I'm missing in order to see the gain as the documentation suggests.
Thanks,
Bruno.
I made a performance mistake in the code I put in that other answer. I've now fixed it. In a nutshell this line:
tmp = 0.
caused numba to create a 64-bit floating point variable tmp. That triggered other arithmetic in the kernel to be promoted from 32-bit floating point to 64-bit floating point. That is inconsistent with the rest of the arithmetic and also inconsistent with the intent of the demonstration in the other answer. This error affects both kernels.
When I change it in both kernels to
tmp = float32(0.)
both kernels get noticeably faster, and on my GTX960 GPU, your test case shows that the shared code runs about 2x faster than the non-shared code (but see below).
The non-shared kernel also has a performance issue related to memory access patterns. Similar to the indices swap in that other answer, for this particular scenario only, we can rectify this problem simply by reversing the assigned indices:
j, i = cuda.grid(2)
in the non-shared kernel. This allows that kernel to perform approximately as well as it can, and with that change the shared kernel runs about 2x faster than the non-shared kernel. Without that additional change to the non-shared kernel, the performance of the non-shared kernel is much worse.
I'm currently working on an project aim at finding blur region by using walsh hadamard transform. The basic idea is pixel-wise extract local patch and apply walsh hadamard transform to this local patch. In order to do Walsh hadamard transform, I prior generate the hadamard matrix H and do H×T(local_patch)×H_transpose computation. This operation cost 5ms per pixel which is time consuming. I'm wondering is there have some technique to speed up the matrix multiplication process in numpy python or using some other fast walsh hadamard trainsform technique to replace the H×T×H'. Any help would be appreciated.
for i in range(h):
for j in range(w):
local_patch_gray = gray_pad[i:i+patch_size, j:j+patch_size]
local_patch_gray = local_patch_gray[1:, 1:] # extract 2^n×2^n part
local_patch_blur = blur_pad[i:i + patch_size, j:j + patch_size]
local_patch_blur = local_patch_blur[1:, 1:]
patch_WHT = np.dot(np.dot(H, local_patch_gray), H)
blur_WHT = np.dot(np.dot(H, local_patch_blur), H)
num = np.power(np.sum(np.power(np.abs(blur_WHT), p)), 1/p)
denomi = np.power(np.sum(np.power(np.abs(patch_WHT), p)), 1/p)
if denomi == 0:
blur_map[i, j] = 0
continue
blur_map[i, j] = num / denomi
It sounds like this is a job for Numba, check out their 5-minute starting guide.
In short, Numba compiles the first call of a function into a fast-callable format, so that every subsequent call of the same function is at light speed. Numba also has options which can make function calls at ludicrous speed. The options that will pertain to your example are likely fastmath and parallel.
As a starting point, here's what your new numba function might look like:
#njit(fastmath=True, parallel=True)
def lightning_fast_numba_function:
local_patch_gray = gray_pad[i:i+patch_size, j:j+patch_size]
local_patch_gray = local_patch_gray[1:, 1:] # extract 2^n×2^n part
local_patch_blur = blur_pad[i:i + patch_size, j:j + patch_size]
local_patch_blur = local_patch_blur[1:, 1:]
patch_WHT = np.dot(np.dot(H, local_patch_gray), H)
blur_WHT = np.dot(np.dot(H, local_patch_blur), H)
num = np.power(np.sum(np.power(np.abs(blur_WHT), p)), 1/p)
denomi = np.power(np.sum(np.power(np.abs(patch_WHT), p)), 1/p)
if denomi == 0:
blur_map[i, j] = 0
continue
blur_map[i, j] = num / denomi
for i in range(h):
for j in range(w):
lighting_fast_numba_function()
Other options you may consider are using np.nditer instead of range. But, dont hesitate to cross-check options using Numpy's iteration docs.
Lastly, I noticed a Wikipedia article for your alg has a fast section, with Python code. Might find it useful.
I have a function that takes a [32, 32, 3] tensor, and outputs a [256,256,3] tensor.
Specifically, the function interprets the smaller array as if it was a .svg file, and 'renders' it to a 256x256 array as a canvas using this algorithm
For an explanation of WHY I would want to do this, see This question
The function behaves exactly as intended, until I try to include it in the training loop of a GAN. The current error I'm seeing is:
NotImplementedError: Cannot convert a symbolic Tensor (mul:0) to a numpy array.
A lot of other answers to similar errors seem to boil down to "You need to re-write the function using tensorflow, not numpy"
Here's the working code using numpy - is it possible to re-write it to exclusively use tensorflow functions?
def convert_to_bitmap(input_tensor, target, j):
#implied conversion to nparray - the tensorflow docs seem to indicate this is okay, but the error is thrown here when training
array = input_tensor
outputArray = target
output = target
for i in range(32):
col = float(array[i,0,j])
if ((float(array[i,0,0]))+(float(array[i,0,1]))+(float(array[i,0,2]))/3)< 0:
continue
#slice only the red channel from the i line, multiply by 255
red_array = array[i,:,0]*255
#slice only the green channel, multiply by 255
green_array = array[i,:,1]*255
#combine and flatten them
combined_array = np.dstack((red_array, green_array)).flatten()
#remove the first two and last two indices of the combined array
index = [0,1,62,63]
clipped_array = np.delete(combined_array,index)
#filter array to remove values less than 0
filtered = clipped_array > 0
filtered_array = clipped_array[filtered]
#check array has an even number of values, delete the last index if it doesn't
if len(filtered_array) % 2 == 0:
pass
else:
filtered_array = np.delete(filtered_array,-1)
#convert into a set of tuples
l = filtered_array.tolist()
t = list(zip(l, l[1:] + l[:1]))
if not t:
continue
output = fill_polygon(t, outputArray, col)
return(output)
The 'fill polygon' function is copied from the 'mahotas' library:
def fill_polygon(polygon, canvas, color):
if not len(polygon):
return
min_y = min(int(y) for y,x in polygon)
max_y = max(int(y) for y,x in polygon)
polygon = [(float(y),float(x)) for y,x in polygon]
if max_y < canvas.shape[0]:
max_y += 1
for y in range(min_y, max_y):
nodes = []
j = -1
for i,p in enumerate(polygon):
pj = polygon[j]
if p[0] < y and pj[0] >= y or pj[0] < y and p[0] >= y:
dy = pj[0] - p[0]
if dy:
nodes.append( (p[1] + (y-p[0])/(pj[0]-p[0])*(pj[1]-p[1])) )
elif p[0] == y:
nodes.append(p[1])
j = i
nodes.sort()
for n,nn in zip(nodes[::2],nodes[1::2]):
nn += 1
canvas[y, int(n):int(nn)] = color
return(canvas)
NOTE: I'm not trying to get someone to convert the whole thing for me! There are some functions that are pretty obvious (tf.stack instead of np.dstack), but others that I don't even know how to start, like the last few lines of the fill_polygon function above.
Yes you can actually do this, you can use a python function in sth called tf.pyfunc. Its a python wrapper but its extremely slow in comparison to plain tensorflow. However, tensorflow and Cuda for example are so damn fast because they use stuff like vectorization, meaning you can rewrite a lot , really many of the loops in terms of mathematical tensor operations which are very fast.
In general:
If you want to use custom code as a custom layer, i would recommend you to rethink the algebra behind those loops and try to express them somehow different. If its just preprocessing before the training is going to start, you can use tensorflow but doing the same with numpy and other libraries is easier.
To your main question: Yes its possible, but better dont use loops. Tensorflow has a build-in loop optimizer but then you have to use tf.while() and thats anyoing (maybe just for me). I just blinked over your code, but it looks like you should be able to vectorize it quite good using the standard tensorflow vocabulary. If you want it fast, i mean really fast with GPU support write all in tensorflow, but nothing like 50/50 with tf.convert_to_tensor(), because than its going to be slow again. because than you switch between GPU and CPU and plain Python interpreter and the tensorflow low level API. Hope i could help you at least a bit
This code 'works', in that it only uses tensorflow functions, and does allow the model to train when used in a training loop:
def convert_image (x):
#split off the first column of the generator output, and store it for later (remove the 'colours' column)
colours_column = tf.slice(img_to_convert, tf.constant([0,0,0], dtype=tf.int32), tf.constant([32,1,3], dtype=tf.int32))
#split off the rest of the data, only keeping R + G, and discarding B
image_data_red = tf.slice(img_to_convert, tf.constant([0,1,0], dtype=tf.int32), tf.constant([32,31,1], dtype=tf.int32))
image_data_green = tf.slice(img_to_convert, tf.constant([0,1,1], dtype=tf.int32), tf.constant([32, 31,1], dtype=tf.int32))
#roll each row by 1 position, and make two more 2D tensors
rolled_red = tf.roll(image_data_red, shift=-1, axis=0)
rolled_green = tf.roll(image_data_green, shift=-1, axis=0)
#remove all values where either the red OR green channels are 0
zeroes = tf.constant(0, dtype=tf.float32)
#this is for the 'count_nonzero' command
boolean_red_data = tf.not_equal(image_data_red, zeroes)
boolean_green_data = tf.not_equal(image_data_green, zeroes)
initial_data_mask = tf.logical_and(boolean_red_data, boolean_green_data)
#count non-zero values per row and flatten it
count = tf.math.count_nonzero(initial_data_mask, 1)
count_flat = tf.reshape(count, [-1])
flat_red = tf.reshape(image_data_red, [-1])
flat_green = tf.reshape(image_data_green, [-1])
boolean_red = tf.math.logical_not(tf.equal(flat_red, tf.zeros_like(flat_red)))
boolean_green = tf.math.logical_not(tf.equal(flat_green, tf.zeros_like(flat_red)))
mask = tf.logical_and(boolean_red, boolean_green)
flat_red_without_zero = tf.boolean_mask(flat_red, mask)
flat_green_without_zero = tf.boolean_mask(flat_green, mask)
# create a ragged tensor
X0_ragged = tf.RaggedTensor.from_row_lengths(values=flat_red_without_zero, row_lengths=count_flat)
Y0_ragged = tf.RaggedTensor.from_row_lengths(values=flat_green_without_zero, row_lengths=count_flat)
#do the same for the rolled version
rolled_data_mask = tf.roll(initial_data_mask, shift=-1, axis=1)
flat_rolled_red = tf.reshape(rolled_red, [-1])
flat_rolled_green = tf.reshape(rolled_green, [-1])
#from SO "shift zeros to the end"
boolean_rolled_red = tf.math.logical_not(tf.equal(flat_rolled_red, tf.zeros_like(flat_rolled_red)))
boolean_rolled_green = tf.math.logical_not(tf.equal(flat_rolled_green, tf.zeros_like(flat_rolled_red)))
rolled_mask = tf.logical_and(boolean_rolled_red, boolean_rolled_green)
flat_rolled_red_without_zero = tf.boolean_mask(flat_rolled_red, rolled_mask)
flat_rolled_green_without_zero = tf.boolean_mask(flat_rolled_green, rolled_mask)
# create a ragged tensor
X1_ragged = tf.RaggedTensor.from_row_lengths(values=flat_rolled_red_without_zero, row_lengths=count_flat)
Y1_ragged = tf.RaggedTensor.from_row_lengths(values=flat_rolled_green_without_zero, row_lengths=count_flat)
#available outputs for future use are:
X0 = X0_ragged.to_tensor(default_value=0.)
Y0 = Y0_ragged.to_tensor(default_value=0.)
X1 = X1_ragged.to_tensor(default_value=0.)
Y1 = Y1_ragged.to_tensor(default_value=0.)
#Example tensor cel (replace with (x))
P = tf.cast(x, dtype=tf.float32)
#split out P.x and P.y, and fill a ragged tensor to the same shape as Rx
Px_value = tf.cast(x, dtype=tf.float32) - tf.cast((tf.math.floor(x/255)*255), dtype=tf.float32)
Py_value = tf.cast(tf.math.floor(x/255), dtype=tf.float32)
Px = tf.squeeze(tf.ones_like(X0)*Px_value)
Py = tf.squeeze(tf.ones_like(Y0)*Py_value)
#for each pair of values (Y0, Y1, make a vector, and check to see if it crosses the y-value (Py) either up or down
a = tf.math.less(Y0, Py)
b = tf.math.greater_equal(Y1, Py)
c = tf.logical_and(a, b)
d = tf.math.greater_equal(Y0, Py)
e = tf.math.less(Y1, Py)
f = tf.logical_and(d, e)
g = tf.logical_or(c, f)
#Makes boolean bitwise mask
#calculate the intersection of the line with the y-value, assuming it intersects
#P.x <= (G.x - R.x) * (P.y - R.y) / (G.y - R.y + R.x) - use tf.divide_no_nan for safe divide
h = tf.math.less(Px,(tf.math.divide_no_nan(((X1-X0)*(Py-Y0)),(Y1-Y0+X0))))
#combine using AND with the mask above
i = tf.logical_and(g,h)
#tf.count_nonzero
#reshape to make a column tensor with the same dimensions as the colours
#divide by 2 using tf.floor_mod (returns remainder of division - any remainder means the value is odd, and hence the point is IN the polygon)
final_count = tf.cast((tf.math.count_nonzero(i, 1)), dtype=tf.int32)
twos = tf.ones_like(final_count, dtype=tf.int32)*tf.constant([2], dtype=tf.int32)
divide = tf.cast(tf.math.floormod(final_count, twos), dtype=tf.int32)
index = tf.cast(tf.range(0,32, delta=1), dtype=tf.int32)
clipped_index = divide*index
sort = tf.sort(clipped_index)
reverse = tf.reverse(sort, [-1])
value = tf.slice(reverse, [0], [1])
pair = tf.constant([0], dtype=tf.int32)
slice_tensor = tf.reshape(tf.stack([value, pair, pair], axis=0),[-1])
output_colour = tf.slice(colours_column, slice_tensor, [1,1,3])
return output_colour
This is where the 'convert image' function is applied using tf.vectorize_map:
def convert_images(image_to_convert):
global img_to_convert
img_to_convert = image_to_convert
process_list = tf.reshape((tf.range(0,65536, delta=1, dtype=tf.int32)), [65536, 1])
output_line = tf.vectorized_map(convert_image, process_list)
output_line_squeezed = tf.squeeze(output_line)
output_reshape = (tf.reshape(output_line_squeezed, [256,256,3])/127.5)-1
output = tf.expand_dims(output_reshape, axis=0)
return output
It is PAINFULLY slow, though - It does not appear to be using the GPU, and looks to be single threaded as well.
I'm adding it as an answer to my own question because is clearly IS possible to do this numpy function entirely in tensorflow - it just probably shouldn't be done like this.
I am looking for an efficient way to implement a simple filter with one coefficient that is time-varying and specified by a vector with the same length as the input signal.
The following is a simple implementation of the desired behavior:
def myfilter(signal, weights):
output = np.empty_like(weights)
val = signal[0]
for i in range(len(signal)):
val += weights[i]*(signal[i] - val)
output[i] = val
return output
weights = np.random.uniform(0, 0.1, (100,))
signal = np.linspace(1, 3, 100)
output = myfilter(signal, weights)
Is there a way to do this more efficiently with numpy or scipy?
You can trade in the overhead of the loop for a couple of additional ops:
import numpy as np
def myfilter(signal, weights):
output = np.empty_like(weights)
val = signal[0]
for i in range(len(signal)):
val += weights[i]*(signal[i] - val)
output[i] = val
return output
def vectorised(signal, weights):
wp = np.r_[1, np.multiply.accumulate(1 - weights[1:])]
sw = weights * signal
sw[0] = signal[0]
sws = np.add.accumulate(sw / wp)
return wp * sws
weights = np.random.uniform(0, 0.1, (100,))
signal = np.linspace(1, 3, 100)
print(np.allclose(myfilter(signal, weights), vectorised(signal, weights)))
On my machine the vectorised version is several times faster. It uses a "closed form" solution of your recurrence equation.
Edit: For very long signal / weight (100,000 samples, say) this method doesn't work because of overflow. In that regime you can still save a bit (more than 50% on my machine) using the following trick, which has the added bonus that you needn't solve the recurrence formula, only invert it.
from scipy import linalg
def solver(signal, weights):
rw = 1 / weights[1:]
v = np.r_[1, rw, 1-rw, 0]
v.shape = 2, -1
return linalg.solve_banded((1, 0), v, signal)
This trick uses the fact that your recurrence is formally similar to a Gauss elimination on a matrix with only one nonvanishing subdiagonal. It piggybacks on a library function that specialises in doing precisely that.
Actually, quite proud of this one.
I'm trying to use fancy indexing instead of looping to speed up a function in Numpy. To the best of my knowledge, I've implemented the fancy indexing version correctly. The problem is that the two functions (loop and fancy-indexed) do not return the same result. I'm not sure why. It's worth pointing out that the functions do return the same result if a smaller array is used (e.g., 20 x 20 x 20).
Below I've included everything necessary to reproduce the error. If the functions do return the same result, then the line find_maxdiff(data) - find_maxdiff_fancy(data) should return an array full of zeroes.
from numpy import *
def rms(data, axis=0):
return sqrt(mean(data ** 2, axis))
def find_maxdiff(data):
samples, channels, epochs = shape(data)
window_size = 50
maxdiff = zeros(epochs)
for epoch in xrange(epochs):
signal = rms(data[:, :, epoch], axis=1)
for t in xrange(window_size, alen(signal) - window_size):
amp_a = mean(signal[t-window_size:t], axis=0)
amp_b = mean(signal[t:t+window_size], axis=0)
the_diff = abs(amp_b - amp_a)
if the_diff > maxdiff[epoch]:
maxdiff[epoch] = the_diff
return maxdiff
def find_maxdiff_fancy(data):
samples, channels, epochs = shape(data)
window_size = 50
maxdiff = zeros(epochs)
signal = rms(data, axis=1)
for t in xrange(window_size, alen(signal) - window_size):
amp_a = mean(signal[t-window_size:t], axis=0)
amp_b = mean(signal[t:t+window_size], axis=0)
the_diff = abs(amp_b - amp_a)
maxdiff[the_diff > maxdiff] = the_diff
return maxdiff
data = random.random((600, 20, 100))
find_maxdiff(data) - find_maxdiff_fancy(data)
data = random.random((20, 20, 20))
find_maxdiff(data) - find_maxdiff_fancy(data)
The problem is this line:
maxdiff[the_diff > maxdiff] = the_diff
The left side selects only some elements of maxdiff, but the right side contains all elements of the_diff. This should work instead:
replaceElements = the_diff > maxdiff
maxdiff[replaceElements] = the_diff[replaceElements]
or simply:
maxdiff = maximum(maxdiff, the_diff)
As for why 20x20x20 size seems to work: This is because your window size is too large, so nothing gets executed.
First, in fancy your signal is now 2D if I understand correctly - so I think it would be clearer to index it explicitly (eg amp_a = mean(signal[t-window_size:t,:], axis=0). Similarly with alen(signal) - this should just be samples in both cases so I think it would be clearer to use that.
It is wrong whenever you are actually doing something in the t loop - when samples < window_lenght as in the 20x20x20 example, that loop never gets executed. As soon as that loop is executed more than once (ie samples > 2 *window_length+1) then the errors come. Not sure why though - they do look equivalent to me.