Python, fast computation of rolling percentile

Python, fast computation of rolling percentile - python

Given a multidimensional array, I want to compute a rolling percentile over one of its axes, with the rolling windows truncated near the boundaries of the array. Below is a minimal example implementation using only numpy via np.nanpercentile() applied to stacked, rolled (through np.roll()) arrays. However, the input array may be very large (~ 1 GB or more), so two issues arise:
For the current implementation, the stacked, rolled array may
not fit into RAM memory. Avoidable with for-loops over all axes
unaffected by the rolling, but may be slow.
Even fully vectorized (as below), the computation time is quite long,
understandably due to the sheer amount of computations performed.
Questions: Is there a more efficient python implementation of a rolling percentile (with axis/axes argument or the like and with truncated windows near the boundaries)?** If not, how could the computation be sped up (and, if possible, without exceeding the RAM)? C-code called from Python? Computation of percentiles at fewer "central" points, and approximation in between via (e.g. linear) interpolation? Other ideas?
Related post (implementing rolling percentiles): How to compute moving (or rolling, if you will) percentile/quantile for a 1d array in numpy? Issues are:
pandas implementation via pd.Series().rolling().quantile() works only for pd.Series or pd.DataFrame objects, not multidimensional (4D or arbitrary D) arrays;
implementation via np.lib.stride_tricks.as_strided() with np.nanpercentile() is similar to the one below and should not be much faster given that np.nanpercentile() is the speed bottleneck, see below
Minimal example implementation:
import numpy as np
np.random.seed(100)
# random array of numbers
a = np.random.rand(10000,1,70,70)
# size of rolling window
n_window = 150
# percentile to compute
p = 0.7
# NaN values to prepend/append to array before rolling
nan_temp = np.full(tuple([n_window] + list(np.array(a.shape)[1:])), fill_value=np.nan)
# prepend and append NaN values to array
a_temp = np.concatenate((nan_temp, a, nan_temp), axis=0)
# roll array, stack rolled arrays along new dimension, compute percentile (ignoring NaNs) using np.nanpercentile()
res = np.nanpercentile(np.concatenate([np.roll(a_temp, shift=i, axis=0)[...,None] for i in range(-n_window, n_window+1)],axis=-1),p*100,axis=-1)
# cut away the prepended/appended NaN values
res = res[n_window:-n_window]
Computation times (in seconds), example (for the case of a having a shape of (1000,1,70,70) instead of (10000,1,70,70)):
create random array: 0.0688176155090332
prepend/append NaN values: 0.03478217124938965
stack rolled arrays: 38.17830514907837
compute nanpercentile: 1145.1418626308441
cut out result: 0.0004646778106689453

Related

Pad trues in a binary matrix (more efficiently than with a binary dilation)

Problem statement:
I have a very large binary matrix, lets say with dimensions (1000000,500), for which I want to spread the existing trues along its columns.
More specifically, for each True in the matrix, I'd like to pad $N$ Trues along the column right after it.
One approach to solving this is with a binary_dilation from skimage or scipy.
For example, the following would pad two Trues after each True as desired.
import time
import numpy as np
from skimage import morphology
np.random.seed(0)
a = np.random.rand(1000000,500)
t1=time.time()
#in this case N=2
s = np.array([[0],
[0],
[1],
[1],
[1]])
out = morphology.binary_dilation(a > 0.95,s)
t2=time.time()
print(t2 - t1) #9.127300500869751
However, I suspect there's some redundancy in the operation performed by the dilation, since it's designed to solve a far more general problem. Moreover, the dilation's time scales linearly when increasing the number of columns, so I think there's got to be a faster way.
Any ideas for a more efficient approach?
Alternatively, a similar problem which I'm also interested in is the following:
Given a binary matrix, fill Trues between two Trues a distance of at most $N$ rows from each other and at most a single column from each other (if they are in neighboring columns, fill the column of the first true: the one with a smaller row number). I have used skimage.morphology.remove_small_holes iteratively over the columns, to partly solve this, but it doesn't handle neighboring columns, and again I think there's a more efficient approach.
Important context:
My goal here is to help connect nearby regions before labeling them with scipy.ndiamge.label or skimage.measure.label. Therefore, both the problems above are equivalent, since after labeling I can conjunct with the original binary matrix (the one before the padding) to remove all the trues that were not originally there.

Scipy spline interpolation: Determine array length of vector of knots / B-spline coefficients in tck before actual computation

Is it somehow possible to determine the array length of the arrays in the tck tuple returned by scipy.interpolate.splprep before computing the values?
I have to fit a spline interpolation to noisy data with 5 million data points (or less, can be varying).
My observation is that the interpolation at an array length of ~ 90 is pretty good, while it takes a long time to compute the interpolation for higher array lengths (it sometimes also directly jumps from ~ 90 to ~ 1000 while making s one step smaller and the interpolation also becomes noisy) and it is not appropriate enough, if the array length is far less (<50)...
Actually, this array length depends on the smoothing factor s provided to the splprep function, but for different measurement data, s varies a lot to get a consistent array length of around 90. E.g. for data1 s has a value of around 1000 to get len(cfk[0]) equals to 90, for data2 s has a value of around 100 to get len(cfk[0]) equals to 90 at same lengths of data1 and data2. It might be dependent on the noise of the data...
I have thought about a loop where s starts at some point and decreases through the loop while len(cfk[0]) is constantly being checked - but this takes ages, especially if len(cfk[0]) gets closer to 90.
Therefore, it would be useful to somehow know the smoothing factor to get the desired array length before computing the cfk tuple.

Short answer: no, not easily. Dierckx Fortran library, which splrep wraps, uses some fairly non-trivial logic for determining the knot vector, and it's all baked into the Fortran code. So, the only way is to carefully trace the latter. It's available from netlib, also scipy/interpolate/fitpack

Data comparison using numpy

I am trying to make an algorithm using just numpy (i saw others using PIL, but it has some drawbacks) that can compare and plot the difference between two maps that show ice levels from different years. I load the images and set NaNs to zero, as I have some.
data = np.load(filename)
data[np.isnan(data)]=0
The data arrays contain values between 0 and 100 and represent concentration levels (100 is the deep blue).
The data looks like this:
I am trying to compute the difference so that a loss in ice over time will correspond to a negative value, and a gain in ice will correspond to a positive value. The ice is denoted by the blue color in the plots above.
Any hints? Comparing element by element seems to be not the best idea...

To get the difference between 2 same sized numpy arrays of data, just take one from the other:
diff = img1 - img2
Numpy is basically a Python wrapper for an underlying C code base, designed for these sorts of operations. Although underneath it is comparing element to element (as you say above); it is significantly faster at these sorts of operations.

Find the most frequent element in a masked array

I need to find the most frequent element in a numpy array "label", only if those elements lie inside the mask array. Here is the brute force approach:
def getlabel(mask, label):
# get majority label
assert label.shape == mask.shape
tmp = []
for i in range(mask.shape[0]):
for j in range(mask.shape[1]):
if mask[i][j] == True:
tmp.append(label[i][j])
return Counter(tmp).most_common(1)[0][0]
However I don't think this is the most elegant and fastest approach yet. Which other data structures should I use? (hasing, dictionary, etc... )?

Assuming your mask is a boolean array:
import numpy as np
cnt = np.bincount(label[mask].flat)
This gives you a vector of number of occurrences of values 0, 1, 2, ... max(label)
You can find the most frequent then by
most_frequent = np.argmax(cnt)
And naturally, the number of these elements in your input data is
cnt[most_frequent]
Usually, np.bincount is fast. Let us try with labels with maximum number of 999 (i.e. 1000 bins) and a 10 000 000 element array masked by 8 000 000 values:
data = np.random.randint(0, 1000, (1000, 10000))
mask = np.random.random((1000, 10000)) < 0.8
# time this section
cnt = np.bincount(data[mask].flat)
With my machine this takes 80 ms. The argmax takes maybe 2 ns/bin, so even if your label integers are a bit scattered, it does not really matter.
This approach is probably the fastest approach if the following conditions hold:
the labels are integers within range 0..N, where N is not much more than the size of the input array
the input data is in a NumPy array
This solution may be applied to some other cases, but then it is more a question of how and whether there are better solutions available. (See metaperture's answer.) For example, a simple conversion of a Python list into ndarray is rather costly, and the speed benefit gained by bincount will be lost if the input is a Python list, and the amount of data is not large.
The sparsity of labels in the integer space is not a problem per se. Creating and zeroing the output vector is relatively fast, and it is easy and fast to compress back with np.nonzero. However, if the maximum label value is large compared to the size of the input array, then the speed benefit may be lost.

np.bincount is not a general approach.np.bincount will be faster for bounded, low entropy, discrete distributions. However, it will fail:
if the distribution is unbounded, the memory used is unbounded (can be arbitrarily large for an arbitrarily small input array)
if the distribution is continuous, the argmax of bincount is not the mode (technically it's the MAP of a KDE, where the KDE is generated using histogram-like methods)
if the distribution has high entropy/dispersal, then the bin-based representation of np.bincount doesn't make sense (won't fail but will just be worse)
For a general solution, you should do one of:
cnt = Counter((l for m, l in zip(mask.flat, label.flat) if m)) # or...
cnt = Counter(label[mask].flat)
Or:
scipy.stats.mode(label[mask].flat)
In my testing the former is ~20x faster. If you know the distribution is discrete with a relatively low bound and entropy then bincount will be faster.
If the above is not fast enough, a better general approach than bincount is to sample your data
collections.Counter(np.random.choice(data[mask], 1000)).most_common(1)
scipy.stats.mode(np.random.choice(data[mask], 1000))
Both of the above are an order of magnitude faster than the unsampled versions and converge to the mode quickly for even the most pathological distributions.

Using strides for an efficient moving average filter

I recently learned about strides in the answer to this post, and was wondering how I could use them to compute a moving average filter more efficiently than what I proposed in this post (using convolution filters).
This is what I have so far. It takes a view of the original array then rolls it by the necessary amount and sums the kernel values to compute the average. I am aware that the edges are not handled correctly, but I can take care of that afterward... Is there a better and faster way? The objective is to filter large floating point arrays up to 5000x5000 x 16 layers in size, a task that scipy.ndimage.filters.convolve is fairly slow at.
Note that I am looking for 8-neighbour connectivity, that is a 3x3 filter takes the average of 9 pixels (8 around the focal pixel) and assigns that value to the pixel in the new image.
import numpy, scipy
filtsize = 3
a = numpy.arange(100).reshape((10,10))
b = numpy.lib.stride_tricks.as_strided(a, shape=(a.size,filtsize), strides=(a.itemsize, a.itemsize))
for i in range(0, filtsize-1):
if i > 0:
b += numpy.roll(b, -(pow(filtsize,2)+1)*i, 0)
filtered = (numpy.sum(b, 1) / pow(filtsize,2)).reshape((a.shape[0],a.shape[1]))
scipy.misc.imsave("average.jpg", filtered)
EDIT Clarification on how I see this working:
Current code:
use stride_tricks to generate an array like [[0,1,2],[1,2,3],[2,3,4]...] which corresponds to the top row of the filter kernel.
Roll along the vertical axis to get the middle row of the kernel [[10,11,12],[11,12,13],[13,14,15]...] and add it to the array I got in 1)
Repeat to get the bottom row of the kernel [[20,21,22],[21,22,23],[22,23,24]...]. At this point, I take the sum of each row and divide it by the number of elements in the filter, giving me the average for each pixel, (shifted by 1 row and 1 col, and with some oddities around edges, but I can take care of that later).
What I was hoping for is a better use of stride_tricks to get the 9 values or the sum of the kernel elements directly, for the entire array, or that someone can convince me of another more efficient method...

For what it's worth, here's how you'd do it using "fancy" striding tricks. I was going to post this yesterday, but got distracted by actual work! :)
#Paul & #eat both have nice implementations using various other ways of doing this. Just to continue things from the earlier question, I figured I'd post the N-dimensional equivalent.
You're not going to be able to significantly beat scipy.ndimage functions for >1D arrays, however. (scipy.ndimage.uniform_filter should beat scipy.ndimage.convolve, though)
Moreover, if you're trying to get a multidimensional moving window, you risk having memory usage blow up whenever you inadvertently make a copy of your array. While the initial "rolling" array is just a view into the memory of your original array, any intermediate steps that copy the array will make a copy that is orders of magnitude larger than your original array (i.e. Let's say that you're working with a 100x100 original array... The view into it (for a filter size of (3,3)) will be 98x98x3x3 but use the same memory as the original. However, any copies will use the amount of memory that a full 98x98x3x3 array would!!)
Basically, using crazy striding tricks is great for when you want to vectorize moving window operations on a single axis of an ndarray. It makes it really easy to calculate things like a moving standard deviation, etc with very little overhead. When you want to start doing this along multiple axes, it's possible, but you're usually better off with more specialized functions. (Such as scipy.ndimage, etc)
At any rate, here's how you do it:
import numpy as np
def rolling_window_lastaxis(a, window):
"""Directly taken from Erik Rigtorp's post to numpy-discussion.
<http://www.mail-archive.com/numpy-discussion#scipy.org/msg29450.html>"""
if window < 1:
raise ValueError, "`window` must be at least 1."
if window > a.shape[-1]:
raise ValueError, "`window` is too long."
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def rolling_window(a, window):
if not hasattr(window, '__iter__'):
return rolling_window_lastaxis(a, window)
for i, win in enumerate(window):
if win > 1:
a = a.swapaxes(i, -1)
a = rolling_window_lastaxis(a, win)
a = a.swapaxes(-2, i)
return a
filtsize = (3, 3)
a = np.zeros((10,10), dtype=np.float)
a[5:7,5] = 1
b = rolling_window(a, filtsize)
blurred = b.mean(axis=-1).mean(axis=-1)
So what we get when we do b = rolling_window(a, filtsize) is an 8x8x3x3 array, that's actually a view into the same memory as the original 10x10 array. We could have just as easily used different filter size along different axes or operated only along selected axes of an N-dimensional array (i.e. filtsize = (0,3,0,3) on a 4-dimensional array would give us a 6 dimensional view).
We can then apply an arbitrary function to the last axis repeatedly to effectively calculate things in a moving window.
However, because we're storing temporary arrays that are much bigger than our original array on each step of mean (or std or whatever), this is not at all memory efficient! It's also not going to be terribly fast, either.
The equivalent for ndimage is just:
blurred = scipy.ndimage.uniform_filter(a, filtsize, output=a)
This will handle a variety of boundary conditions, do the "blurring" in-place without requiring a temporary copy of the array, and be very fast. Striding tricks are a good way to apply a function to a moving window along one axis, but they're not a good way to do it along multiple axes, usually....
Just my $0.02, at any rate...

I'm not familiar enough with Python to write out code for that, but the two best ways to speed up convolutions is to either separate the filter or to use the Fourier transform.
Separated filter : Convolution is O(M*N), where M and N are number of pixels in the image and the filter, respectively. Since average filtering with a 3-by-3 kernel is equivalent to filtering first with a 3-by-1 kernel and then a 1-by-3 kernel, you can get (3+3)/(3*3) = ~30% speed improvement by consecutive convolution with two 1-d kernels (this obviously gets better as the kernel gets larger). You may still be able to use stride tricks here, of course.
Fourier Transform : conv(A,B) is equivalent to ifft(fft(A)*fft(B)), i.e. a convolution in direct space becomes a multiplication in Fourier space, where A is your image and B is your filter. Since the (element-wise) multiplication of the Fourier transforms requires that A and B are the same size, B is an array of size(A) with your kernel at the very center of the image and zeros everywhere else. To place a 3-by-3 kernel at the center of an array, you may have to pad A to odd size. Depending on your implementation of the Fourier transform, this can be a lot faster than the convolution (and if you apply the same filter multiple times, you can pre-compute fft(B), saving another 30% of computation time).

Lets see:
It's not so clear form your question, but I'm assuming now that you'll like to improve significantly this kind of averaging.
import numpy as np
from numpy.lib import stride_tricks as st
def mf(A, k_shape= (3, 3)):
m= A.shape[0]- 2
n= A.shape[1]- 2
strides= A.strides+ A.strides
new_shape= (m, n, k_shape[0], k_shape[1])
A= st.as_strided(A, shape= new_shape, strides= strides)
return np.sum(np.sum(A, -1), -1)/ np.prod(k_shape)
if __name__ == '__main__':
A= np.arange(100).reshape((10, 10))
print mf(A)
Now, what kind of performance improvements you would actually expect?
Update:
First of all, a warning: the code in it's current state does not adapt properly to the 'kernel' shape. However that's not my primary concern right now (anyway the idea is there allready how to adapt properly).
I have just chosen the new shape of a 4D A intuitively, for me it really make sense to think about a 2D 'kernel' center to be centered to each grid position of original 2D A.
But that 4D shaping may not actually be the 'best' one. I think the real problem here is the performance of summing. One should to be able to find 'best order' (of the 4D A) inorder to fully utilize your machines cache architecture. However that order may not be the same for 'small' arrays which kind of 'co-operates' with your machines cache and those larger ones, which don't (at least not so straightforward manner).
Update 2:
Here is a slightly modified version of mf. Clearly it's better to reshape to a 3D array first and then instead of summing just do dot product (this has the advantage all so, that kernel can be arbitrary). However it's still some 3x slower (on my machine) than Pauls updated function.
def mf(A):
k_shape= (3, 3)
k= np.prod(k_shape)
m= A.shape[0]- 2
n= A.shape[1]- 2
strides= A.strides* 2
new_shape= (m, n)+ k_shape
A= st.as_strided(A, shape= new_shape, strides= strides)
w= np.ones(k)/ k
return np.dot(A.reshape((m, n, -1)), w)

One thing I am confident needs to be fixed is your view array b.
It has a few items from unallocated memory, so you'll get crashes.
Given your new description of your algorithm, the first thing that needs fixing is the fact that you are striding outside the allocation of a:
bshape = (a.size-filtsize+1, filtsize)
bstrides = (a.itemsize, a.itemsize)
b = numpy.lib.stride_tricks.as_strided(a, shape=bshape, strides=bstrides)
Update
Because I'm still not quite grasping the method and there seems to be simpler ways to solve the problem, I'm just going to put this here:
A = numpy.arange(100).reshape((10,10))
shifts = [(-1,-1),(-1,0),(-1,1),(0,-1),(0,1),(1,-1),(1,0),(1,1)]
B = A[1:-1, 1:-1].copy()
for dx,dy in shifts:
xstop = -1+dx or None
ystop = -1+dy or None
B += A[1+dx:xstop, 1+dy:ystop]
B /= 9
...which just seems like the straightforward approach. The only extraneous operation is that it has allocate and populate B only once. All the addition, division and indexing has to be done regardless. If you are doing 16 bands, you still only need to allocate B once if your intent is to save an image. Even if this is no help, it might clarify why I don't understand the problem, or at least serve as a benchmark to time the speedups of other methods. This runs in 2.6 sec on my laptop on a 5k x 5k array of float64's, 0.5 of which is the creation of B

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, fast computation of rolling percentile - python

Related

Pad trues in a binary matrix (more efficiently than with a binary dilation)

Scipy spline interpolation: Determine array length of vector of knots / B-spline coefficients in tck before actual computation

Data comparison using numpy

Find the most frequent element in a masked array

Using strides for an efficient moving average filter

Categories

Resources