Rolling window PCA in python

Rolling window PCA in python - python

I'm wondering if anyone knows how to implement a rolling/moving window PCA that reuses the calculated PCA when adding and removing measurements.
The idea is that I have a large set of data (measurement) over a very long time, and I would like to have a moving window (say, 200 days) starting at the beginning of my dataset and each step, I include the next day's measurement and throw out the last measurement, so my window is always 200 days long. However, I would not like to simply recalculate the PCA each time.
Is it possible to make an algorithm that is more efficient than simply calculating the PCA for each window independently? Thanks in advance!

A complete answer depends on a lot of factors. I'll cover what I think are the most important such factors, and hopefully that'll be enough information to point you in the right direction.
First, directly answering your question, yes it is possible to make an algorithm that is more efficient than simply calculating the PCA for each window independently.
Improving the Naive PCA Algorithm (low-dimensional inputs)
As a first pass at the problem, let's assume that you're doing a naive PCA calculation with no normalization (i.e., you're leaving the data lone, computing the covariance matrix, and finding that matrix's eigenvalues/eigenvectors).
When faced with an input matrix X whose PCA we want to compute, the naive algorithm first computes the covariance matrix W = X.T # X. Once we've computed that for some window of 200 elements, we can cheaply add or remove elements from consideration from the original data set by removing their contribution to the covariance.
"""
W: shape (p, p)
row: shape (1, p)
"""
def add_row(W, row):
return W + (row.T # row)
def remove_row(W, row):
return W - (row.T # row)
Your description of a sliding window is equivalent to removing a row and adding a new one, so we can quickly compute a new covariance matrix using O(p^2) computations rather than the O(n p^2) a typical matrix multiply would take (with n==200 for this problem).
The covariance matrix isn't the final answer though, and we still need to find the principal components. If you aren't hand-rolling the eigensolver yourself there isn't a lot to be done -- you'll pay the cost for new eigenvalues and eigenvectors every time.
However, if you are writing your own eigensolver, most such methods accept a starting input and iterate till some termination condition (usually a max number of iterations or if the error becomes low enough, whichever you hit first). Swapping out a single data point isn't likely to drastically alter the principal components, so for typical data one might expect that re-using the existing eigenvalues/eigenvectors as inputs into the eigensolver would allow you to terminate in far fewer iterations than when starting from randomized inputs, affording an additional speedup.
Improving Covariance-Free Algorithms (high-dimensional inputs)
Usually (maybe always?), covariance-free PCA algorithms have some kind of iterated solver (much like an eigensolver), but they have computational shortcuts that allow finding eigenvalues/eigenvectors without explicitly materializing the covariance matrix.
Any individual such method might have additional tricks that allow you to save some information from one window to the next, but in general one would expect that you can reduce the total number of iterations simply by re-using the existing principal components instead of using random inputs to start the solver (much like in the eigensolver case above).
Window Normalization w/ Naive Algorithm
Supposing you're normalizing each window to have a mean of 0 in each column (common in PCA), you'll have some additional work when modifying the covariance matrix.
First I'll assume you already have a rolling mechanism for keeping track of any differences that need to be applied from one window to the next. If not, consider something like the following:
"""
We're lazy and don't want to handle a change in sample
size, so only work with row swaps -- good enough for
a sliding window.
old_row: shape (1, p)
new_row: shape (1, p)
"""
def replaced_row_mean_adjustment(old_row, new_row):
return (new_row - old_row)/200. # whatever your window size is
The effect on the covariance matrix isn't too bad to compute, but I'll put some code here anyway.
"""
W: shape (p, p)
center: shape (1, p)
exactly equal to the mean diff vector we referenced above
X: shape (200, p)
exactly equal to the window you're examining after any
previous mean centering has been applied, but before the
current centering has happened. Note that we only use
its row and column sums, so you could get away with
a rolling computation for those instead, but that's
a little more code, and I want to leave at least some
of the problem for you to work on
"""
def update_covariance(W, center, X):
result = W
result -= center.T # np.sum(X, axis=0).reshape(1, -1)
result -= np.sum(X, axis=1).reshape(-1, 1) # center
result += 200 * center.T # center # or whatever your window size is
return result
Rescaling to have a standard deviation of 1 is also common in PCA. That's pretty easy to accomodate as well.
"""
Updates the covariance matrix assuming you're modifing a window
of data X with shape (200, p) by multiplying each column by
its corresponding element in v. A rolling algorithm to compute
v isn't covered here, but it shouldn't be hard to figure out.
W: shape (p, p)
v: shape (1, p)
"""
def update_covariance(W, v):
return W * (v.T # v) # Note that this is element-wise multiplication of W
Window Normalization w/ Covariance-free Algorithm
The tricks that you have available here will vary quite a bit depending on the algorithm that you're using, but the general strategy I'd try first is to use a rolling algorithm to keep track of the mean and standard deviation for each column for the current window and to modify the iterative solver to take that into account (i.e., given a window X you want to iterate on the rescaled window Y -- substitute Y=a*X+b into the iterative algorithm of your choice and simplify symbolically to hopefully yield a version with a small additional constant cost).
As before you'll want to re-use any principal components you find instead of using a random initialization vector for each window.

Related

theano gradient with respect to matrix row

As the question suggests, I would like to compute the gradient with respect to a matrix row. In code:
import numpy.random as rng
import theano.tensor as T
from theano import function
t_x = T.matrix('X')
t_w = T.matrix('W')
t_y = T.dot(t_x, t_w.T)
t_g = T.grad(t_y[0,0], t_x[0]) # my wish, but DisconnectedInputError
t_g = T.grad(t_y[0,0], t_x) # no problems, but a lot of unnecessary zeros
f = function([t_x, t_w], [t_y, t_g])
y,g = f(rng.randn(2,5), rng.randn(7,5))
As the comments indicate, the code works without any problems when I compute the gradient with respect to the entire matrix. In this case the gradient is correctly computed, but the problem is that the result has only non-zero entries in row 0 (because other rows of x obviously do not appear in the equations for the first row of y).
I have found this question, suggesting to store all rows of the matrix in separate variables and build graphs from these variables. In my setting though, I have no idea how much rows might be in X.
Would anybody have an idea how to get the gradient with respect to a single row of a matrix or how I could omit the extra zeros in the output? If anybody would have suggestions how an arbitrary amount of vectors can be stacked, that should work as well, I guess.

I realised that it is possible to get rid of the zeros when computing derivatives with respect to the entries in row i:
t_g = T.grad(t_y[i,0], t_x)[i]
and for computing the Jacobian, I found out that
t_g = T.jacobian(t_y[i], t_x)[:,i]
does the trick. However it seems to have a rather heavy impact on computation speed.
It would also be possible to approach this problem mathematically. The Jacobian of the matrix multiplication t_y w.r.t. t_x is simply the transpose of t_w.T, which is t_w in this case (the transpose of the transpose is the original matrix). Thus, the computation would be as simple as
t_g = t_w

scipy.linalg.sparse.eigsh does not work for generalised eigenvalues

I'm working on a machine learning project which involves doing a Principal Component Analysis on some labeled data and using those labels to extract more valuable information from the data.
To do that, I'm calculating a scatter matrix for each class, and for each pair of classes I need to solve a generalised eigenvalue problem for their scatter matrices, as follows:
S_i * v = w * (S_j + b.I) * v
where b is a multiplier and I is the identity matrix. Now, this is the code in python:
jeigenvalues = eigsh(scatter_j, k=10, return_eigenvectors=False, maxiter=100)
print('eigenvalues made')
beta = betaMult*mean(jeigenvalues)
print(beta)
print(scatter_j+beta*eye(shape(x_data)[1]))
w, v = eigsh(scatter_i,M=scatter_j+beta*eye(shape(x_data)[1]),k=int(numberOfEVs/45), maxiter=100)
print(i,j,'done')
numberOfEVs is 90 in my current code (so that it's divisible by 45).
But the problem is, at the line where I use the eigsh for the aforementioned formula, it never gives me an answer. It keeps eating more and more memory without even completing a single iteration (I set its maxiter input to 1, and it still didn't give an answer). When I don't give the eigsh function the M argument (which is the matrix on the right side of the generalised EV problem and it is assumed to be "I" when not specified), it works correctly. But when M is provided, it becomes unresponsive.
Any ideas?
EDIT: The scatter matrices have rather small entries, mostly around 10^-5. I've also tried multiplying the left hand side by the inverse of the RHS matrix, and again it's having the same issue (goes on for a long time without an answer). Is the smallness of these entries the issue? How can I solve it, then?

Best way to calculate the fundamental matrix of an absorbing Markov Chain?

I have a very large absorbing Markov chain (scales to problem size -- from 10 states to millions) that is very sparse (most states can react to only 4 or 5 other states).
I need to calculate one row of the fundamental matrix of this chain (the average frequency of each state given one starting state).
Normally, I'd do this by calculating (I - Q)^(-1), but I haven't been able to find a good library that implements a sparse matrix inverse algorithm! I've seen a few papers on it, most of them P.h.D. level work.
Most of my Google results point me to posts talking about how one shouldn't use a matrix inverse when solving linear (or non-linear) systems of equations... I don't find that particularly helpful. Is the calculation of the fundamental matrix similar to solving a system of equations, and I simply don't know how to express one in the form of the other?
So, I pose two specific questions:
What's the best way to calculate a row (or all the rows) of the inverse of a sparse matrix?
OR
What's the best way to calculate a row of the fundamental matrix of a large absorbing Markov chain?
A Python solution would be wonderful (as my project is still currently a proof-of-concept), but if I have to get my hands dirty with some good ol' Fortran or C, that's not a problem.
Edit: I just realized that the inverse B of matrix A can be defined as AB=I, where I is the identity matrix. That may allow me to use some standard sparse matrix solvers to calculate the inverse... I've got to run off, so feel free to complete my train of thought, which I'm starting to think might only require a really elementary matrix property...

Assuming that what you're trying to do is work out is the expected number of steps before absorbtion, the equation from "Finite Markov Chains" (Kemeny and Snell), which is reproduced on Wikipedia is:
Or expanding the fundamental matrix
Rearranging:
Which is in the standard format for using functions for solving systems of linear equations
Putting this into practice to demonstrate the difference in performance (even for much smaller systems than those you're describing).
import networkx as nx
import numpy
def example(n):
"""Generate a very simple transition matrix from a directed graph
"""
g = nx.DiGraph()
for i in xrange(n-1):
g.add_edge(i+1, i)
g.add_edge(i, i+1)
g.add_edge(n-1, n)
g.add_edge(n, n)
m = nx.to_numpy_matrix(g)
# normalize rows to ensure m is a valid right stochastic matrix
m = m / numpy.sum(m, axis=1)
return m
Presenting the two alternative approaches for calculating the number of expected steps.
def expected_steps_fundamental(Q):
I = numpy.identity(Q.shape[0])
N = numpy.linalg.inv(I - Q)
o = numpy.ones(Q.shape[0])
numpy.dot(N,o)
def expected_steps_fast(Q):
I = numpy.identity(Q.shape[0])
o = numpy.ones(Q.shape[0])
numpy.linalg.solve(I-Q, o)
Picking an example that's big enough to demonstrate the types of problems that occur when calculating the fundamental matrix:
P = example(2000)
# drop the absorbing state
Q = P[:-1,:-1]
Produces the following timings:
%timeit expected_steps_fundamental(Q)
1 loops, best of 3: 7.27 s per loop
And:
%timeit expected_steps_fast(Q)
10 loops, best of 3: 83.6 ms per loop
Further experimentation is required to test the performance implications for sparse matrices, but it's clear that calculating the inverse is much much slower than what you might expect.
A similar approach to the one presented here can also be used for the variance of the number of steps

The reason you're getting the advice not to use matrix inverses for solving equations is because of numerical stability. When you're matrix has eigenvalues that are zero or near zero, you have problems either from lack of an inverse (if zero) or numerical stability (if near zero). The way to approach the problem, then, is to use an algorithm that doesn't require that an inverse exist. The solution is to use Gaussian elimination. This doesn't provide a full inverse, but rather gets you to row-echelon form, a generalization of upper-triangular form. If the matrix is invertible, then the last row of the result matrix contains a row of the inverse. So just arrange that the last row you eliminate on is the row you want.
I'll leave it to you to understand why I-Q is always invertible.

pseudo inverse of sparse matrix in python

I am working with data from neuroimaging and because of the large amount of data, I would like to use sparse matrices for my code (scipy.sparse.lil_matrix or csr_matrix).
In particular, I will need to compute the pseudo-inverse of my matrix to solve a least-square problem.
I have found the method sparse.lsqr, but it is not very efficient. Is there a method to compute the pseudo-inverse of Moore-Penrose (correspondent to pinv for normal matrices).
The size of my matrix A is about 600'000x2000 and in every row of the matrix I'll have from 0 up to 4 non zero values. The matrix A size is given by voxel x fiber bundle (white matter fiber tracts) and we are expecting maximum 4 tracts to cross in a voxel. In most of the white matter voxels we expect to have at least 1 tract, but I will say that around 20% of the lines could be zeros.
The vector b should not be sparse, actually b contains the measure for each voxel, which is in general not zero.
I would need to minimize the error, but there are also some conditions on the vector x. As I tried the model on smaller matrices, I never needed to constrain the system in order to satisfy these conditions (in general 0
Is that of any help? Is there a way to avoid taking the pseudo-inverse of A?
Thanks
Update 1st June:
thanks again for the help.
I can't really show you anything about my data, because the code in python give me some problems. However, in order to understand how I could choose a good k I've tried to create a testing function in Matlab.
The code is as follow:
F=zeros(100000,1000);
for k=1:150000
p=rand(1);
a=0;
b=0;
while a<=0 || b<=0
a=random('Binomial',100000,p);
b=random('Binomial',1000,p);
end
F(a,b)=rand(1);
end
solution=repmat([0.5,0.5,0.8,0.7,0.9,0.4,0.7,0.7,0.9,0.6],1,100);
size(solution)
solution=solution';
measure=F*solution;
%check=pinvF*measure;
k=250;
F=sparse(F);
[U,S,V]=svds(F,k);
s=svds(F,k);
plot(s)
max(max(U*S*V'-F))
for s=1:k
if S(s,s)~=0
S(s,s)=1/S(s,s);
end
end
inv=V*S'*U';
inv*measure
max(inv*measure-solution)
Do you have any idea of what should be k compare to the size of F? I've taken 250 (over 1000) and the results are not satisfactory (the waiting time is acceptable, but not short).
Also now I can compare the results with the known solution, but how could one choose k in general?
I also attached the plot of the 250 single values that I get and their squares normalized. I don't know exactly how to better do a screeplot in matlab. I'm now proceeding with bigger k to see if suddently the value will be much smaller.
Thanks again,
Jennifer

You could study more on the alternatives offered in scipy.sparse.linalg.
Anyway, please note that a pseudo-inverse of a sparse matrix is most likely to be a (very) dense one, so it's not really a fruitful avenue (in general) to follow, when solving sparse linear systems.
You may like to describe a slight more detailed manner your particular problem (dot(A, x)= b+ e). At least specify:
'typical' size of A
'typical' percentage of nonzero entries in A
least-squares implies that norm(e) is minimized, but please indicate whether your main interest is on x_hat or on b_hat, where e= b- b_hat and b_hat= dot(A, x_hat)
Update: If you have some idea of the rank of A (and its much smaller than number of columns), you could try total least squares method. Here is a simple implementation, where k is the number of first singular values and vectors to use (i.e. 'effective' rank).
from scipy.sparse import hstack
from scipy.sparse.linalg import svds
def tls(A, b, k= 6):
"""A tls solution of Ax= b, for sparse A."""
u, s, v= svds(hstack([A, b]), k)
return v[-1, :-1]/ -v[-1, -1]

Regardless of the answer to my comment, I would think you could accomplish this fairly easily using the Moore-Penrose SVD representation. Find the SVD with scipy.sparse.linalg.svds, replace Sigma by its pseudoinverse, and then multiply V*Sigma_pi*U' to find the pseudoinverse of your original matrix.

Using strides for an efficient moving average filter

I recently learned about strides in the answer to this post, and was wondering how I could use them to compute a moving average filter more efficiently than what I proposed in this post (using convolution filters).
This is what I have so far. It takes a view of the original array then rolls it by the necessary amount and sums the kernel values to compute the average. I am aware that the edges are not handled correctly, but I can take care of that afterward... Is there a better and faster way? The objective is to filter large floating point arrays up to 5000x5000 x 16 layers in size, a task that scipy.ndimage.filters.convolve is fairly slow at.
Note that I am looking for 8-neighbour connectivity, that is a 3x3 filter takes the average of 9 pixels (8 around the focal pixel) and assigns that value to the pixel in the new image.
import numpy, scipy
filtsize = 3
a = numpy.arange(100).reshape((10,10))
b = numpy.lib.stride_tricks.as_strided(a, shape=(a.size,filtsize), strides=(a.itemsize, a.itemsize))
for i in range(0, filtsize-1):
if i > 0:
b += numpy.roll(b, -(pow(filtsize,2)+1)*i, 0)
filtered = (numpy.sum(b, 1) / pow(filtsize,2)).reshape((a.shape[0],a.shape[1]))
scipy.misc.imsave("average.jpg", filtered)
EDIT Clarification on how I see this working:
Current code:
use stride_tricks to generate an array like [[0,1,2],[1,2,3],[2,3,4]...] which corresponds to the top row of the filter kernel.
Roll along the vertical axis to get the middle row of the kernel [[10,11,12],[11,12,13],[13,14,15]...] and add it to the array I got in 1)
Repeat to get the bottom row of the kernel [[20,21,22],[21,22,23],[22,23,24]...]. At this point, I take the sum of each row and divide it by the number of elements in the filter, giving me the average for each pixel, (shifted by 1 row and 1 col, and with some oddities around edges, but I can take care of that later).
What I was hoping for is a better use of stride_tricks to get the 9 values or the sum of the kernel elements directly, for the entire array, or that someone can convince me of another more efficient method...

For what it's worth, here's how you'd do it using "fancy" striding tricks. I was going to post this yesterday, but got distracted by actual work! :)
#Paul & #eat both have nice implementations using various other ways of doing this. Just to continue things from the earlier question, I figured I'd post the N-dimensional equivalent.
You're not going to be able to significantly beat scipy.ndimage functions for >1D arrays, however. (scipy.ndimage.uniform_filter should beat scipy.ndimage.convolve, though)
Moreover, if you're trying to get a multidimensional moving window, you risk having memory usage blow up whenever you inadvertently make a copy of your array. While the initial "rolling" array is just a view into the memory of your original array, any intermediate steps that copy the array will make a copy that is orders of magnitude larger than your original array (i.e. Let's say that you're working with a 100x100 original array... The view into it (for a filter size of (3,3)) will be 98x98x3x3 but use the same memory as the original. However, any copies will use the amount of memory that a full 98x98x3x3 array would!!)
Basically, using crazy striding tricks is great for when you want to vectorize moving window operations on a single axis of an ndarray. It makes it really easy to calculate things like a moving standard deviation, etc with very little overhead. When you want to start doing this along multiple axes, it's possible, but you're usually better off with more specialized functions. (Such as scipy.ndimage, etc)
At any rate, here's how you do it:
import numpy as np
def rolling_window_lastaxis(a, window):
"""Directly taken from Erik Rigtorp's post to numpy-discussion.
<http://www.mail-archive.com/numpy-discussion#scipy.org/msg29450.html>"""
if window < 1:
raise ValueError, "`window` must be at least 1."
if window > a.shape[-1]:
raise ValueError, "`window` is too long."
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def rolling_window(a, window):
if not hasattr(window, '__iter__'):
return rolling_window_lastaxis(a, window)
for i, win in enumerate(window):
if win > 1:
a = a.swapaxes(i, -1)
a = rolling_window_lastaxis(a, win)
a = a.swapaxes(-2, i)
return a
filtsize = (3, 3)
a = np.zeros((10,10), dtype=np.float)
a[5:7,5] = 1
b = rolling_window(a, filtsize)
blurred = b.mean(axis=-1).mean(axis=-1)
So what we get when we do b = rolling_window(a, filtsize) is an 8x8x3x3 array, that's actually a view into the same memory as the original 10x10 array. We could have just as easily used different filter size along different axes or operated only along selected axes of an N-dimensional array (i.e. filtsize = (0,3,0,3) on a 4-dimensional array would give us a 6 dimensional view).
We can then apply an arbitrary function to the last axis repeatedly to effectively calculate things in a moving window.
However, because we're storing temporary arrays that are much bigger than our original array on each step of mean (or std or whatever), this is not at all memory efficient! It's also not going to be terribly fast, either.
The equivalent for ndimage is just:
blurred = scipy.ndimage.uniform_filter(a, filtsize, output=a)
This will handle a variety of boundary conditions, do the "blurring" in-place without requiring a temporary copy of the array, and be very fast. Striding tricks are a good way to apply a function to a moving window along one axis, but they're not a good way to do it along multiple axes, usually....
Just my $0.02, at any rate...

I'm not familiar enough with Python to write out code for that, but the two best ways to speed up convolutions is to either separate the filter or to use the Fourier transform.
Separated filter : Convolution is O(M*N), where M and N are number of pixels in the image and the filter, respectively. Since average filtering with a 3-by-3 kernel is equivalent to filtering first with a 3-by-1 kernel and then a 1-by-3 kernel, you can get (3+3)/(3*3) = ~30% speed improvement by consecutive convolution with two 1-d kernels (this obviously gets better as the kernel gets larger). You may still be able to use stride tricks here, of course.
Fourier Transform : conv(A,B) is equivalent to ifft(fft(A)*fft(B)), i.e. a convolution in direct space becomes a multiplication in Fourier space, where A is your image and B is your filter. Since the (element-wise) multiplication of the Fourier transforms requires that A and B are the same size, B is an array of size(A) with your kernel at the very center of the image and zeros everywhere else. To place a 3-by-3 kernel at the center of an array, you may have to pad A to odd size. Depending on your implementation of the Fourier transform, this can be a lot faster than the convolution (and if you apply the same filter multiple times, you can pre-compute fft(B), saving another 30% of computation time).

Lets see:
It's not so clear form your question, but I'm assuming now that you'll like to improve significantly this kind of averaging.
import numpy as np
from numpy.lib import stride_tricks as st
def mf(A, k_shape= (3, 3)):
m= A.shape[0]- 2
n= A.shape[1]- 2
strides= A.strides+ A.strides
new_shape= (m, n, k_shape[0], k_shape[1])
A= st.as_strided(A, shape= new_shape, strides= strides)
return np.sum(np.sum(A, -1), -1)/ np.prod(k_shape)
if __name__ == '__main__':
A= np.arange(100).reshape((10, 10))
print mf(A)
Now, what kind of performance improvements you would actually expect?
Update:
First of all, a warning: the code in it's current state does not adapt properly to the 'kernel' shape. However that's not my primary concern right now (anyway the idea is there allready how to adapt properly).
I have just chosen the new shape of a 4D A intuitively, for me it really make sense to think about a 2D 'kernel' center to be centered to each grid position of original 2D A.
But that 4D shaping may not actually be the 'best' one. I think the real problem here is the performance of summing. One should to be able to find 'best order' (of the 4D A) inorder to fully utilize your machines cache architecture. However that order may not be the same for 'small' arrays which kind of 'co-operates' with your machines cache and those larger ones, which don't (at least not so straightforward manner).
Update 2:
Here is a slightly modified version of mf. Clearly it's better to reshape to a 3D array first and then instead of summing just do dot product (this has the advantage all so, that kernel can be arbitrary). However it's still some 3x slower (on my machine) than Pauls updated function.
def mf(A):
k_shape= (3, 3)
k= np.prod(k_shape)
m= A.shape[0]- 2
n= A.shape[1]- 2
strides= A.strides* 2
new_shape= (m, n)+ k_shape
A= st.as_strided(A, shape= new_shape, strides= strides)
w= np.ones(k)/ k
return np.dot(A.reshape((m, n, -1)), w)

One thing I am confident needs to be fixed is your view array b.
It has a few items from unallocated memory, so you'll get crashes.
Given your new description of your algorithm, the first thing that needs fixing is the fact that you are striding outside the allocation of a:
bshape = (a.size-filtsize+1, filtsize)
bstrides = (a.itemsize, a.itemsize)
b = numpy.lib.stride_tricks.as_strided(a, shape=bshape, strides=bstrides)
Update
Because I'm still not quite grasping the method and there seems to be simpler ways to solve the problem, I'm just going to put this here:
A = numpy.arange(100).reshape((10,10))
shifts = [(-1,-1),(-1,0),(-1,1),(0,-1),(0,1),(1,-1),(1,0),(1,1)]
B = A[1:-1, 1:-1].copy()
for dx,dy in shifts:
xstop = -1+dx or None
ystop = -1+dy or None
B += A[1+dx:xstop, 1+dy:ystop]
B /= 9
...which just seems like the straightforward approach. The only extraneous operation is that it has allocate and populate B only once. All the addition, division and indexing has to be done regardless. If you are doing 16 bands, you still only need to allocate B once if your intent is to save an image. Even if this is no help, it might clarify why I don't understand the problem, or at least serve as a benchmark to time the speedups of other methods. This runs in 2.6 sec on my laptop on a 5k x 5k array of float64's, 0.5 of which is the creation of B

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.