Numpy max slow when applied to list of arrays

Numpy max slow when applied to list of arrays - python

I carry out some computations to obtain a list of numpy arrays. Subsequently, I would like to find the largest values along the first axis. My current implementation (see below) is very slow and I would like to find alternatives.
Original
pending = [<list of items>]
matrix = [compute(item) for item in pending if <some condition on item>]
dominant = np.max(matrix, axis = 0)
Revision 1: This implementation is faster (~10x; presumably because numpy does not need to figure out the shape of the array)
pending = [<list of items>]
matrix = [compute(item) for item in pending if <some condition on item>]
matrix = np.vstack(matrix)
dominant = np.max(matrix, axis = 0)
I ran a couple of tests and the slowdown seems to be due to an internal conversion of the list of arrays to a numpy array
Timer unit: 1e-06 s
Total time: 1.21389 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 def direct_max(list_of_arrays):
5 1000 1213886 1213.9 100.0 np.max(list_of_arrays, axis = 0)
Total time: 1.20766 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 def numpy_max(list_of_arrays):
9 1000 1151281 1151.3 95.3 list_of_arrays = np.array(list_of_arrays)
10 1000 56384 56.4 4.7 np.max(list_of_arrays, axis = 0)
Total time: 0.15437 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
12 #profile
13 def stack_max(list_of_arrays):
14 1000 102205 102.2 66.2 list_of_arrays = np.vstack(list_of_arrays)
15 1000 52165 52.2 33.8 np.max(list_of_arrays, axis = 0)
Is there any way to speed up the max function or is it possible to populate a numpy array efficiently with the results of my calculation such that max is fast?

You can use reduce(np.maximum, matrix), here is a test:
import numpy as np
np.random.seed(0)
N, M = 1000, 1000
matrix = [np.random.rand(N) for _ in xrange(M)]
%timeit np.max(matrix, axis = 0)
%timeit np.max(np.vstack(matrix), axis = 0)
%timeit reduce(np.maximum, matrix)
The result is:
10 loops, best of 3: 116 ms per loop
10 loops, best of 3: 10.6 ms per loop
100 loops, best of 3: 3.66 ms per loop
Edit
`argmax()' is more difficult, but you can use a for loop:
def argmax_list(matrix):
m = matrix[0].copy()
idx = np.zeros(len(m), dtype=np.int)
for i, a in enumerate(matrix[1:], 1):
mask = m < a
m[mask] = a[mask]
idx[mask] = i
return idx
It's still faster than argmax():
%timeit np.argmax(matrix, axis=0)
%timeit np.argmax(np.vstack(matrix), axis=0)
%timeit argmax_list(matrix)
result:
10 loops, best of 3: 131 ms per loop
10 loops, best of 3: 21 ms per loop
100 loops, best of 3: 13.1 ms per loop

Related

How to vectorise triple for looped cumulative sum

I want to vectorise the triple sum
\sum_{i=1}^I\sum_{j=1}^J\sum_{m=1}^J a_{ijm}
such that I end up with a matrix
A \in \mathbb{R}^{I \times J}
where A_{kl} = \sum_{i=1}^k\sum_{j=1}^l\sum_{m=1}^l a_{ijm} for k = 1,...,I and l = 1, ...,J
carrying forward the sums to avoid pointless recomputation.
I currently use this code:
np.cumsum(np.cumsum(np.cumsum(a, axis = 0), axis = 1), axis = 2).diagonal(axis1 = 1, axis2 = 2)
but it is inefficient as it does lots of extra work and extracts the correct matrix at the end with the diagonal method. I can't think of how to make this faster.

The main challenge here is to compute the inner two sums, i.e. the sum of the square slices of a matrix originating from the top left. The final sum is just a cumsum on top of that along the 0th axis.
Setup:
import numpy as np
I, J = 100, 100
arr = np.random.rand(I, J, J)
Your implementation:
%%timeit
out = np.cumsum(np.cumsum(np.cumsum(arr, axis = 0), axis = 1), axis = 2).diagonal(axis1 = 1, axis2 = 2)
# 10.9 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Your implementation improved by taking the diagonal before cumsumming over the 0th axis:
%%timeit
out = arr.cumsum(axis=1).cumsum(axis=2).diagonal(axis1=1, axis2=2).cumsum(axis=0)
# 6.25 ms ± 34.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Finally, some tril/triu trickery:
%%timeit
out = np.cumsum(np.cumsum(np.tril(arr, k=-1).sum(axis=2) + np.triu(arr).sum(axis=1), axis=1), axis=0)
# 3.15 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
which is already better, but admittedly still not ideal. I don't see a better way to compute the inner two sums noted above with pure numpy.

You can use Numba so to produce a very fast implementation. Here is the code:
import numba as nb
import numpy as np
#nb.njit('(float64[:,:,::1],)', parallel=True)
def compute(arr):
ni, nj, nk = arr.shape
assert nj == nk
result = np.empty((ni, nj))
# Parallel cumsum along the axis 1 and 2 + extraction of the diagonal
for i in nb.prange(ni):
tmp = np.zeros(nk)
for j in range(nj):
for k in range(nk):
tmp[k] += arr[i, j, k]
result[i, j] = np.sum(tmp[:j+1])
# Cumsum along the axis 0
for i in range(1, ni):
for k in range(nk):
result[i, k] += result[i-1, k]
return result
result = compute(a)
Here are performance results on my 6-core i5-9600KF with a 100x100x100 float64 input array:
Initial code: 12.7 ms
Chryophylaxs v1: 7.1 ms
Chryophylaxs v2: 5.5 ms
Numba: 0.2 ms
This implementation is significantly faster than all others. It is about 64 times faster than the initial implementation. It is also actually optimal on my machine since it completely saturate the bandwidth of my RAM only for reading the input array (which is mandatory). Note that it is better not to use multiple threads for very small arrays.
Note that this code also use far less memory as it only need 8 * nk * num_threads bytes of temporary storage as opposed to 16 * ni * nj * nk bytes for the initial solution.

Cumsum with restarts [duplicate]

This question already has answers here:
Restart cumsum and get index if cumsum more than value
(3 answers)
Closed 2 years ago.
I want to bin the data every time the threshold 10000 is exceeded.
I have tried this with no luck:
# data which is an array of floats
diff = np.diff(np.cumsum(data)//10000, prepend=0)
indices = (np.argwhere(diff > 0)).flatten()
The problem is that all the bins does not contain 10000, which was my goal.
Expected output
input_data = [4000, 5000, 6000, 2000, 8000, 3000]
# (4000+5000+6000 >= 10000. Index 2)
# (2000+8000 >= 10000. Index 4)
Output: [2, 4]
I wonder if there is any alternative to a for loop?

Not sure how this could be vectorized, if it even can be, since by taking the cumulative sum you'll be propagating the remainders each time the threshold is surpassed. So probably this is a good case for numba, which will compile the code down to C level, allowing for a loopy but performant approach:
from numba import njit, int32
#njit('int32[:](int32[:], uintc)')
def windowed_cumsum(a, thr):
indices = np.zeros(len(a), int32)
window = 0
ix = 0
for i in range(len(a)):
window += a[i]
if window >= thr:
indices[ix] = i
ix += 1
window = 0
return indices[:ix]
The explicit signature implies ahead of time compilation, though this enforces specific dtypes on the input array. The inferred dtype for the example array is of int32, though if this might not always be the case or for a more flexible solution you can always ignore the dtypes in the signature, which will only imply that the function will be compiled on the first execution.
input_data = np.array([4000, 5000, 6000, 2000, 8000, 3000])
windowed_cumsum(input_data, 10000)
# array([2, 4])
Also #jdehesa raises an interesting point, which is that for very long arrays compared to the number of bins, a better option might be to just append the indices to a list. So here is an alternative approach using lists (also in no-python mode), along with timings under different scenarios:
from numba import njit, int32
#njit
def windowed_cumsum_list(a, thr):
indices = []
window = 0
for i in range(len(a)):
window += a[i]
if window >= thr:
indices.append(i)
window = 0
return indices
a = np.random.randint(0,10,10_000)
%timeit windowed_cumsum(a, 20)
# 16.1 µs ± 232 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit windowed_cumsum_list(a, 20)
# 65.5 µs ± 623 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit windowed_cumsum(a, 2000)
# 7.38 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit windowed_cumsum_list(a, 2000)
# 7.1 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So it seems that under most scenarios using numpy will be a faster option, since even in the second case, with a length 10000 array and a resulting array of 20 indices of bins, both perform similarly, though for memory efficiency reasons the latter might be more convenient in some cases.

Here is how you can do it fairly efficiently with a loop, using np.searchsorted to find bin boundaries fast:
import numpy as np
np.random.seed(0)
bin_size = 10_000
data = np.random.randint(100, size=20_000)
# Naive solution (incorrect, for comparison)
data_f = np.floor(np.cumsum(data) / bin_size).astype(int)
bin_starts = np.r_[0, np.where(np.diff(data_f) > 0)[0] + 1]
# Check bin sizes
bin_sums = np.add.reduceat(data, bin_starts)
# We go over the limit!
print(bin_sums.max())
# 10080
# Better solution with loop
data_c = np.cumsum(data)
ref_val = 0
bin_starts = [0]
while True:
# Search next split point
ref_idx = bin_starts[-1]
# Binary search through remaining cumsum
next_idx = np.searchsorted(data_c[ref_idx:], ref_val + bin_size, side='right')
next_idx += ref_idx
# If we finished the array stop
if next_idx >= len(data_c):
break
# Add new bin boundary
bin_starts.append(next_idx)
ref_val = data_c[next_idx - 1]
# Convert bin boundaries to array
bin_starts = np.array(bin_starts)
# Check bin sizes
bin_sums = np.add.reduceat(data, bin_starts)
# Does not go over limit
print(bin_sums.max())
# 10000

A tedious loop looking for improvements

in my code I need to calculate the values of a vector many times which are the mean values from different patches of another array.
Here is an example of my code showing how I do it but I found that it is too less-efficient in running...
import numpy as np
vector_a = np.zeros(10)
array_a = np.random.random((100,100))
for i in range(len(vector_a)):
vector_a[i] = np.mean(array_a[:,i+20:i+40]
Is there any way to make it more efficient? Any comments or suggestions are very welcome! Many thanks!
-yes, the 20 and 40 are fixed.

EDIT:
Actually you can do this much faster. The previous function can be improved by operating on summed columns like this:
def rolling_means_faster1(array_a, n, first, size):
# Sum each relevant columns
sum_a = np.sum(array_a[:, first:(first + size + n - 1)], axis=0)
# Reshape as before
strides_b = (sum_a.strides[0], sum_a.strides[0])
array_b = np.lib.stride_tricks.as_strided(sum_a, (n, size), (strides_b))
# Average
v = np.sum(array_b, axis=1)
v /= (len(array_a) * size)
return v
Another way is to work with accumulated sums, adding and removing as necessary for each output element.
def rolling_means_faster2(array_a, n, first, size):
# Sum each relevant columns
sum_a = np.sum(array_a[:, first:(first + size + n - 1)], axis=0)
# Add a zero a the beginning so the next operation works fine
sum_a = np.insert(sum_a, 0, 0)
# Sum the initial `size` elements and add and remove partial sums as necessary
v = np.sum(sum_a[:size]) - np.cumsum(sum_a[:n]) + np.cumsum(sum_a[-n:])
# Average
v /= (size * len(array_a))
return v
Benchmarking with the previous solution from before:
import numpy as np
np.random.seed(100)
array_a = np.random.random((1000, 1000))
n = 100
first = 100
size = 200
%timeit rolling_means_orig(array_a, n, first, size)
# 12.7 ms ± 55.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rolling_means(array_a, n, first, size)
# 5.49 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rolling_means_faster1(array_a, n, first, size)
# 166 µs ± 874 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit rolling_means_faster2(array_a, n, first, size)
# 182 µs ± 2.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So these last two seem to be very close in performance. It may depend on the relative sizes of the inputs.
This is a possible vectorized solution:
import numpy as np
# Data
np.random.seed(100)
array_a = np.random.random((100, 100))
# Take all the relevant columns
slice_a = array_a[:, 20:40 + 10]
# Make a "rolling window" with stride tricks
strides_b = (slice_a.strides[1], slice_a.strides[0], slice_a.strides[1])
array_b = np.lib.stride_tricks.as_strided(slice_a, (10, 100, 20), (strides_b))
# Take mean
result = np.mean(array_b, axis=(1, 2))
# Original method for testing correctness
vector_a = np.zeros(10)
idv1 = np.arange(10) + 20
idv2 = np.arange(10) + 40
for i in range(len(vector_a)):
vector_a[i] = np.mean(array_a[:,idv1[i]:idv2[i]])
print(np.allclose(vector_a, result))
# True
Here is a quick benchmark in IPython (sizes increased for appreciation):
import numpy as np
def rolling_means(array_a, n, first, size):
slice_a = array_a[:, first:(first + size + n)]
strides_b = (slice_a.strides[1], slice_a.strides[0], slice_a.strides[1])
array_b = np.lib.stride_tricks.as_strided(slice_a, (n, len(array_a), size), (strides_b))
return np.mean(array_b, axis=(1, 2))
def rolling_means_orig(array_a, n, first, size):
vector_a = np.zeros(n)
idv1 = np.arange(n) + first
idv2 = np.arange(n) + (first + size)
for i in range(len(vector_a)):
vector_a[i] = np.mean(array_a[:,idv1[i]:idv2[i]])
return vector_a
np.random.seed(100)
array_a = np.random.random((1000, 1000))
n = 100
first = 100
size = 200
%timeit rolling_means(array_a, n, first, size)
# 5.48 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rolling_means_orig(array_a, n, first, size)
# 32.8 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This solution works on the assumption that you are trying to compute rolling average of a subset of window of columns.
As an example and ignoring rows, given [0, 1, 2, 3, 4] and a window of 2 the averages are [0.5, 1.5, 2.5, 3.5], and that you might only want the second and third averages.
Your current solution is inefficient as it is recomputes the mean for a column for each output in vector_a. Given that (a / n) + (b / n) == (a + b) / n we can get away with computing the mean of each column only once, and then combine the column means as needed to produce the final output.
window_first_start = idv1.min() # or idv1[0]
window_last_end = idv2.max() # or idv2[-1]
window_size = idv2[0] - idv1[0]
assert ((idv2 - idv1) == window_size).all(), "sanity check, not needed if assumption holds true"
# a view of the columns we are interested in, no copying is done here
view = array_a[:,window_first_start:window_last_end]
# calculate the means for each column
col_means = view.mean(axis=0)
# cumsum is used to find the rolling sum of means and so the rolling average
# We use an out variable to make sure we have a 0 in the first element of cum_sum.
# This makes like a little easier in the next step.
cum_sum = np.empty(len(col_means) + 1, dtype=col_means.dtype)
cum_sum[0] = 0
np.cumsum(col_means, out=cum_sum[1:])
result = (cum_sum[window_size:] - cum_sum[:-window_size]) / window_size
Having tested this against your own code, the above is significantly faster (increasing with the size of the input array), and slightly faster than the solution provided by jdehesa. With an input array of 1000x1000, it is two orders of magnitude faster than your solution and one order of magnitude faster than jdehesa's.

Try this:
import numpy as np
array_a = np.random.random((100,100))
vector_a = [np.mean(array_a[:,i+20:i+40]) for i in range(10)]

Python: Find largest array index along a specific dimension which is greater than a threshold

Lets say I have a 4-D numpy array (ex: np.rand((x,y,z,t))) of data with dimensions corresponding to X,Y,Z, and time.
For each X and Y point, and at each time step, I want to find the largest index in Z for which the data is larger than some threshold n.
So my end result should be an X-by-Y-by-t array. Instances where there are no values in the Z-column greater than the threshold should be represented by a 0.
I can loop through element-by-element and construct a new array as I go, however I am operating on a very large array and it takes too long.

Unfortunately, following the example of Python builtins, numpy doesn't make it easy to get the last index, although the first is trivial. Still, something like
def slow(arr, axis, threshold):
return (arr > threshold).cumsum(axis=axis).argmax(axis=axis)
def fast(arr, axis, threshold):
compare = (arr > threshold)
reordered = compare.swapaxes(axis, -1)
flipped = reordered[..., ::-1]
first_above = flipped.argmax(axis=-1)
last_above = flipped.shape[-1] - first_above - 1
are_any_above = compare.any(axis=axis)
# patch the no-matching-element found values
patched = np.where(are_any_above, last_above, 0)
return patched
gives me
In [14]: arr = np.random.random((100,100,30,50))
In [15]: %timeit a = slow(arr, axis=2, threshold=0.75)
1 loop, best of 3: 248 ms per loop
In [16]: %timeit b = fast(arr, axis=2, threshold=0.75)
10 loops, best of 3: 50.9 ms per loop
In [17]: (slow(arr, axis=2, threshold=0.75) == fast(arr, axis=2, threshold=0.75)).all()
Out[17]: True
(There's probably a slicker way to do the flipping but it's the end of day here and my brain is shutting down. :-)

Here's a faster approach -
def faster(a,n,invalid_specifier):
mask = a>n
idx = a.shape[2] - (mask[:,:,::-1]).argmax(2) - 1
idx[~mask[:,:,-1] & (idx == a.shape[2]-1)] = invalid_specifier
return idx
Runtime test -
# Using #DSM's benchmarking setup
In [553]: a = np.random.random((100,100,30,50))
...: n = 0.75
...:
In [554]: out1 = faster(a,n,invalid_specifier=0)
...: out2 = fast(a, axis=2, threshold=n) # #DSM's soln
...:
In [555]: np.allclose(out1,out2)
Out[555]: True
In [556]: %timeit fast(a, axis=2, threshold=n) # #DSM's soln
10 loops, best of 3: 64.6 ms per loop
In [557]: %timeit faster(a,n,invalid_specifier=0)
10 loops, best of 3: 43.7 ms per loop

Fast array manipulation based on element inclusion in binary matrix

For a large set of randomly distributed points in a 2D lattice, I want to efficiently extract a subarray, which contains only the elements that, approximated as indices, are assigned to non-zero values in a separate 2D binary matrix. Currently, my script is the following:
lat_len = 100 # lattice length
input = np.random.random(size=(1000,2)) * lat_len
binary_matrix = np.random.choice(2, lat_len * lat_len).reshape(lat_len, -1)
def landed(input):
output = []
input_as_indices = np.floor(input)
for i in range(len(input)):
if binary_matrix[input_as_indices[i,0], input_as_indices[i,1]] == 1:
output.append(input[i])
output = np.asarray(output)
return output
However, I suspect there must be a better way of doing this. The above script can take quite long to run for 10000 iterations.

You are correct. The calculation above, can be be done more efficiently without a for loop in python using advanced numpy indexing,
def landed2(input):
idx = np.floor(input).astype(np.int)
mask = binary_matrix[idx[:,0], idx[:,1]] == 1
return input[mask]
res1 = landed(input)
res2 = landed2(input)
np.testing.assert_allclose(res1, res2)
this results in a ~150x speed-up.

It seems you can squeeze in a noticeable performance boost if you work with linearly indexed arrays. Here's a vectorized implementation to solve our case, similar to #rth's answer, but using linear indexing -
# Get floor-ed indices
idx = np.floor(input).astype(np.int)
# Calculate linear indices
lin_idx = idx[:,0]*lat_len + idx[:,1]
# Index raveled/flattened version of binary_matrix with lin_idx
# to extract and form the desired output
out = input[binary_matrix.ravel()[lin_idx] ==1]
Thus, in short we have:
out = input[binary_matrix.ravel()[idx[:,0]*lat_len + idx[:,1]] ==1]
Runtime tests -
This section compares the proposed approach in this solution against the other solution that uses row-column indexing.
Case #1(Original datasizes):
In [62]: lat_len = 100 # lattice length
...: input = np.random.random(size=(1000,2)) * lat_len
...: binary_matrix = np.random.choice(2, lat_len * lat_len).
reshape(lat_len, -1)
...:
In [63]: idx = np.floor(input).astype(np.int)
In [64]: %timeit input[binary_matrix[idx[:,0], idx[:,1]] == 1]
10000 loops, best of 3: 121 µs per loop
In [65]: %timeit input[binary_matrix.ravel()[idx[:,0]*lat_len + idx[:,1]] ==1]
10000 loops, best of 3: 103 µs per loop
Case #2(Larger datasizes):
In [75]: lat_len = 1000 # lattice length
...: input = np.random.random(size=(100000,2)) * lat_len
...: binary_matrix = np.random.choice(2, lat_len * lat_len).
reshape(lat_len, -1)
...:
In [76]: idx = np.floor(input).astype(np.int)
In [77]: %timeit input[binary_matrix[idx[:,0], idx[:,1]] == 1]
100 loops, best of 3: 18.5 ms per loop
In [78]: %timeit input[binary_matrix.ravel()[idx[:,0]*lat_len + idx[:,1]] ==1]
100 loops, best of 3: 13.1 ms per loop
Thus, the performance boost with this linear indexing seems to be about 20% - 30%.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy max slow when applied to list of arrays - python

Related

How to vectorise triple for looped cumulative sum

Cumsum with restarts [duplicate]

A tedious loop looking for improvements

Python: Find largest array index along a specific dimension which is greater than a threshold

Fast array manipulation based on element inclusion in binary matrix

Categories

Resources