When broadcasting is a bad idea ? (numpy) [closed]

When broadcasting is a bad idea ? (numpy) [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations.
Example 1:
from numpy import array
a = array([1.0,2.0,3.0])
b = array([2.0,2.0,2.0]) # multiply element-by-element ()
a * b
>> array([ 2., 4., 6.])
Example 2 :
from numpy import array
a = array([1.0,2.0,3.0])
b = 2.0 # broadcast b to all a
a * b
>>array([ 2., 4., 6.])
We can think of the scalar b being stretched during the arithmetic operation into an array with the same shape as a. Numpy is smart enough to use the original scalar value without actually making copies so that broadcasting operations are as memory and computationally efficient as possible (b is a scalar, not an array)
A small benchmarking made by #Eric Duminil in another memory performance question, shows that broadcasting makes difference in term of speed and memory
However, I am quoting from the same article linked above:
There are, cases where broadcasting is a bad idea because it
leads to inefficient use of memory that slows computation
The question is: When broadcasting uses unnecessarily large amounts of memory and result sluggish performance ?
In other terms when we should use hybrid broadcasting/python looping algorithm over the pure broadcasting approch?

There isn't a real clear case when broadcasting is bad. Often broadcasting is the simplest most readable solution which is probably what you want. If that is too slow after benchmarking I would only then consider optimising the broadcasts or sequential operations away.
As many of the existing comments say, there is often a tradeoff between memory and compute with regards to broadcasting. However if your algorithms are designed incorrectly you can hurt both aspects.
The biggest problem I find is that while numpy may try to optimise different steps such that it uses views of an array, often it won't be able to do these optimisations for broadcasts or sequential operations. Clever use of numpy may still not be able to get around this problem and so it might be worthwhile considering rewriting your program using loops so that you can merge your operations together manually. This can minimise memory usage and maximising performance. Doing this in plain python however would be extremely slow, but fortunately we have things like numba which can JIT (just in time) compile an annotated function down into efficient machine code.
Another problem is very large arrays, broadcasting can rapidly increase memory usage. Often moving from O(n) even O(n^2) or even worse if arrays are different shapes (I don't mean broadcasting with a scalar). While this may be fine for small arrays, it will quickly become an issue as the size of the arrays increase. Multiplying by a scaler may just double memory usage which isn't nearly as bad.
a = np.arange(128, dtype='float64')
# temp. array memory usage -- A: ~1KB, B: ~1MB, C: ~1GB
A: float = a.sum()
B: float = (a[:, None] + a[None, :]).sum()
C: float = (a[:, None, None] + a[None, :, None] + a[None, None, :]).sum()
# If the size of a was small at, then we are OK
# eg. a.shape == (16,) gives -- a: ~128B, b: ~2KB, c: ~32KB
The example above while quite readable is not efficient as the size of the arrays increase due to the temporary arrays used in reduction operation, this could be converted to a loop based format which would only use O(n). If the output you wanted was the broadcast arrays themselves not the reduction, then broadcasting would be very near optimal.
example: I recently run into this problem myself. I had a 1D binary mask that I needed to broadcast with itself into a 2D matrix so that I could then use it to extract elements from a large pre-computed distance matrix, the extra condition was that I had to exclude the diagonal too (I also did not have access to the original 1d positions).
Naturally this would look as follows:
import numpy as np
def broadcast_masked_tril_total(dists2d, mask):
# broadcast 1d mask into 2d array
# - this can be very slow, moving from O(N) to O(N^2) memory
mask2d = mask[None, :] & mask[:, None]
# ignore diagonal
# - the 2D array needs to exist in memory to make these edits, a view cannot work.
np.fill_diagonal(mask2d, False)
# index array with elements
# - 2d mask means O(N^2) memory is read, instead of O(N)
total = dists2d[mask2d].sum()
# elems are repeated so divide by 2
return total / 2
The problem was of course the memory usage and the intermediate storage of values.
There may be a clever numpy fix, but the obvious solution is just to convert it to loops, as you say, where you don't need to make use of the broadcasting. Advantageously you can try identifying which operations can be merged together instead of chaining them like in numpy.
Usually a general rule of thumb is that the less memory accesses and intermediate storage locations the faster it will run.
import numba
#numba.njit()
def efficient_masked_tril_total(dists2d, mask):
total = 0.
for y, y_val in enumerate(mask):
# we only ever need to read from the same O(N) mask memory
# we can immediately skip invalid rows
if not y_val:
continue
# skip the diagonal and one triangle of the distance matrix
# - can't do this efficiently with numpy broadcasting and
# mask, intermediate storage of 2d mask was required
for x in range(y+1, len(mask)):
# again accessing the same O(n) mask item without broadcasting
if not mask[x]:
continue
total += dists2d[y, x]
return total
for example using this:
N = int(np.sqrt((10*1024**2)/(64/8))) # enough elems for 10.0 MB
# make distance matrices
mask = np.random.random(N) > 0.5
positions = np.random.random(N)
# again we broadcast, note that we could further optimise
# our efficient approach by just passing in the positions directly.
dists2d = np.abs(positions[:, None] - positions[None, :])
# warmup
print(broadcast_masked_tril_total(dists2d, mask))
print(efficient_masked_tril_total(dists2d, mask))
# timeit
import timeit
print(timeit.timeit(lambda: broadcast_masked_tril_total(dists2d, mask), number=1000))
print(timeit.timeit(lambda: efficient_masked_tril_total(dists2d, mask), number=1000))
tl;dr: In short, I would suggest that you always use the simplest most readable solution. Only then if it becomes a performance problem should you spend time benchmarking and optimising your approach. Just remember that "premature optimisation is the root of all evil."
So there isn't really a specific case where broadcasting is a bad idea. Often it is the simplest solution and that is a good thing. Sometimes broadcasting will be the optimal solution if you need to return the the actual broadcast array. If you are using the broadcast with some sort of secondary operation such as a reduction, you can probably optimise the combined operation by converting it to loops. Just remember that it isn't necessarily bad, its only a problem if performance becomes an issue. Smaller arrays are usually not a problem, but if you are working with much larger ones broadcasting can easily cause memory issues too.

Related

best way to store numbers in a multidimensional (sparse) array in python

What is the best container object for a calculation in N dimensions, when the problem is symmetric so that only some numbers need to be calculated?
Concretely, for N=4 I have:
M=50
results = np.zeros((M,M,M,M))
for ii in range(M):
for jj in range(ii,M):
for kk in range(jj,M):
for ll in range(kk, M):
res=1 #really some calculation
results[ii,jj,kk,ll] = res
Many elements in this array are completely redundant and aren't even accessed. This is even more true for higher N (I'd like to go up to N=10 or ideally N=15).
Is it better to use lists and append in each step for such a problem, or a dictionary, or sparse matrices? I tried a sparse matrix, but it keeps warning me that I shouldn't frequently change elements in a sparse matrix, so presumably this is not a good idea.
The only functionality that I'd need to retain is finding maxima (ideally along each dimension).
Any insights would be appreciated!

The "density" of the matrix will by 1 / D**2, where D is the number of dimensions - so you can see that the payoff in space is exponential, while the performance penalty comparing to lists or dense matrices is constant.
So, when the number of dimensions is high, sparse matrices will provide HUGE advantage in space used, and they're still faster than just lists. If the number of dimensions is small, dense matrices will be slightly bigger but also only slightly faster (slightly here: few times faster, but since the total execution time is small, the absolute difference is still small).
Overall, unless the number of dimensions is fixed, it makes more sense to stick with sparse matrices. However, if D is fixed, it's better to just benchmark for this specific case.

Central difference with Convolution

So basically I am trying to do finite differencing on a 2d array without doing too many for loops. I would like to have the Hessian matrix of the array, and the gradient. So I need both the first order and second order derivative of the array.
This can be achieved by evaluating the following equation on on the array.
To deal with boundaries we only compute it for the interior points, so code for this derivate might look something like the following
arr = np.random.rand(16).reshape(4,4)
result = np.zeros_like(arr)
w, h = arr.shape
for i in range(1, w-1):
for j in range(1, h-1):
result[i,j] = (arr[i+1, j] - arr[i-1, j]) / (2*dx)
This gives the correct answer but can be very slow compared nu numpy operations, so I thought to myself. This is basically just a convolution with a kernel that looks like this
kernel = [1, 0 , -1]
So we execute the following code
from scipy.sigmal import convolve
result = np.pad((convolve(arr,kernel,mode='same',
method = 'direct')/(2*dx))[1:-1, 1:-1], 1).T
Since we are only dealing with the interior points, we cut them of and pad with zeros afterwards, to mimick what would happened in the previous naive case.
This works! But with some arrays, the mean squared error between the naive case and the convolution case sky rockets. So it seems that the numerical error increases very much for some cases.
I would like the speed gained by convolution with the stability of the naive case. Any help?

We can simply slice and operate. Hence, after output initialization, do -
result[1:-1,1:-1] = (arr[2:,1:-1] - arr[:-2,1:-1])/(2*dx)
Convolution IMHO would be an overkill when working with NumPy arrays, as slicing arrays are virtually free on memory and performance. Being compute heavy, one can look into numexpr though to leverage multi-cores.

Why is NumPy subtraction slower on one large matrix $M$ than when dividing $M$ into smaller matrices and then subtracting?

I'm working on some code where I have several matrices and want to subtract a vector $v$ from each row of each matrix (and then do some other stuff with the result). As I'm using NumPy and want to 'vectorise' as much as possible, I thought I'd speed up my running time by storing all the matrices as one large ('concatenated') matrix and subtracting $v$ from that. The issue is that my code runs slower after this supposed optimisation. In fact, in some scenarios breaking up the matrices and subtracting separately is significantly faster (see code example below).
Can you tell me what is causing this? Naively, I would assume that both approaches require the same number of elementary subtraction operations and the large matrix approach is faster as we avoid looping through all matrices separately with a pure Python loop.
Initially, I thought the slow-down may be due to initialising a larger matrix to store the result of subtracting. To test this, I initialised a large matrix outside my test function and passed it to the np.subtract command. Then I thought that broadcasting may be causing the slow performance, so I manually broadcast the vector into the same shape as large matrix and then subtracted the resulting broadcasted matrix. Both attempts have failed to make the large matrix approach competitive.
I've made the following MWE to showcase the issue.
Import NumPy and a timer:
import numpy as np
from timeit import default_timer as timer
Then I have some parameters that control the size and number of matrices.
n = 100 # width of matrix
m = 500 # height of matrix
k = 100 # number of matrices
M = 100 # upper bound on entries
reps = 100 # repetitions for timings
We can generate a list of test matrices as follows. The large matrix is just the concatenation of all matrices in the list. The vector we subtract from the matrices is randomly generated.
list_of_matrices = [np.random.randint(0, M+1, size=(m,n)) for _ in range(k)]
large_matrix = np.row_stack(list_of_matrices)
vector = np.random.randint(0, M+1, size=n)
Here are the three functions I use to evaluate the speed of subtraction. The first subtracts the vector from each matrix in the list, the second subtracts the vector from the (concatenated) large matrix and the last function is an attempt to speed up the latter approach by pre_initialising an output matrix and broadcasting the vector.
def list_compute(list_of_matrices, vector):
for j in range(k):
np.subtract(list_of_matrices[j], vector)
def array_compute(bidlists, vector):
np.subtract(large_matrix, vector_matrix, out=pre_allocated)
pre_allocated = np.empty(shape=large_matrix.shape)
vector_matrix = np.broadcast_to(vector, shape=large_matrix.shape)
def faster_array_compute(large_matrix, vector_matrix, out_matrix):
np.subtract(large_matrix, vector_matrix, out=out_matrix)
I benchmark the three functions by running
start = timer()
for _ in range(reps):
list_compute(list_of_matrices, vector)
print timer() - start
start = timer()
for _ in range(reps):
array_compute(large_matrix, vector)
print timer() - start
start = timer()
for _ in range(reps):
faster_array_compute(large_matrix, vector_matrix, pre_allocated)
print timer() - start
For the above parameters, I get timings of
0.539432048798
1.12959504128
1.10976290703
Naively, I would expect the large matrix approach to be faster or at least competitive compared to the several matrices approach. I hope someone can give me some insights into why this is not the case and how I can speed up my code!

The type of the variable pre_allocated is float8. The input matrices are int. You have an implicit conversion. Try to modify the pre-allocation to:
pre_allocated = np.empty_like(large_matrix)
Before the change, the execution times on my machine were:
0.6756095182868318
1.2262537249271794
1.250292605883855
After the change:
0.6776479894965846
0.6468182835551346
0.6538956945388001
The performance is similar in all cases. There is a large variance in those measurements. One may even observe that the first one is the fastest.
It seams that there is no gain due to pre-allocation.
Note that the allocation is very fast because it reserves only address space. The RAM is consumed only on access event actually. The buffer is 20MiB thus it is larger that L3 caches on the CPU. The execution time will be dominated by page faults and refilling of the caches. Moreover, for the first case the memory is re-allocated just after being freed. The resource is likely to be "hot" for the memory allocator. Therefore you cannot directly compare solution A with others.
Modify the "action" line in the first case to keep the actual result:
np.subtract(list_of_matrices[j], vector, out=pre_allocated[m*j:m*(j+1)])
Then the gain from vectorized operations becomes more observable:
0.8738251849091547
0.678185239557866
0.6830777283598941

numpy.sum transition to kahan but with masked arrays for increased precision

I have a multi-array stack of data that is masked to exclude 'bad' or problematic values- this is in the 3rd dimension. Current code utilizes np.sum, but the level of precision (both large and small numbers) has negatively impacted results. I've attempted to implement the kahan_sum referenced here but forgotten about the masked arrays, and the results are not similar (due to masking). It is my hope that the added precision retention by utilizing a kahan summation and accumulator will permit downstream operations to maintain less error.
Source/research:
https://github.com/numpy/numpy/issues/8786
Kahan summation
Python floating point precision sum (I've jacked up the precision as far as possible but it doesn't help)
import numpy as np
import numpy.ma as ma
def kahan_sum(a, axis=None):
s = numpy.zeros(a.shape[:axis] + a.shape[axis+1:])
c = numpy.zeros(s.shape)
for i in range(a.shape[axis]):
# http://stackoverflow.com/a/42817610/353337
y = a[(slice(None),) * axis + (i,)] - c
t = s + y
c = (t - s) - y
s = t.copy()
return s
data=np.random.rand(5,5,5)
dd=np.ma.masked_array(data=d, mask=np.random.rand(5,5,5)<0.2)
I want to sum along the 3rd (axis=2) as that's essentially my 'stack' of photos.
The masks are not coming out as I expected. It's possible I'm just overtired...
np.sum(dd, axis=2)
kahan_sum(dd, axis=2)
np.sum provides a fully populated array of data and excluded the 'masked' values.
kahan_sum essentially or'd all of the masked values, and I've been unable to come up with a pattern for it.
Printing the mask is pretty evident that thats where the problem is; I'm just not figuring out how to fix it or why it's operating the way it is.
Thank you.

If you really need more precision, consider using math.fsum which is accurate to fp resolution. If A is your 3D masked array, something like:
i,j,k = A.shape
np.frompyfunc(lambda i,j:math.fsum(A[i,j].compressed().tolist()),2,1)(*np.ogrid[:i,:j])
But before that I'd triple check thatnp.sum really isn't good enough. As far as I know it uses pairwise summation along contiguous axes which in practice tends to be pretty good.

Using strides for an efficient moving average filter

I recently learned about strides in the answer to this post, and was wondering how I could use them to compute a moving average filter more efficiently than what I proposed in this post (using convolution filters).
This is what I have so far. It takes a view of the original array then rolls it by the necessary amount and sums the kernel values to compute the average. I am aware that the edges are not handled correctly, but I can take care of that afterward... Is there a better and faster way? The objective is to filter large floating point arrays up to 5000x5000 x 16 layers in size, a task that scipy.ndimage.filters.convolve is fairly slow at.
Note that I am looking for 8-neighbour connectivity, that is a 3x3 filter takes the average of 9 pixels (8 around the focal pixel) and assigns that value to the pixel in the new image.
import numpy, scipy
filtsize = 3
a = numpy.arange(100).reshape((10,10))
b = numpy.lib.stride_tricks.as_strided(a, shape=(a.size,filtsize), strides=(a.itemsize, a.itemsize))
for i in range(0, filtsize-1):
if i > 0:
b += numpy.roll(b, -(pow(filtsize,2)+1)*i, 0)
filtered = (numpy.sum(b, 1) / pow(filtsize,2)).reshape((a.shape[0],a.shape[1]))
scipy.misc.imsave("average.jpg", filtered)
EDIT Clarification on how I see this working:
Current code:
use stride_tricks to generate an array like [[0,1,2],[1,2,3],[2,3,4]...] which corresponds to the top row of the filter kernel.
Roll along the vertical axis to get the middle row of the kernel [[10,11,12],[11,12,13],[13,14,15]...] and add it to the array I got in 1)
Repeat to get the bottom row of the kernel [[20,21,22],[21,22,23],[22,23,24]...]. At this point, I take the sum of each row and divide it by the number of elements in the filter, giving me the average for each pixel, (shifted by 1 row and 1 col, and with some oddities around edges, but I can take care of that later).
What I was hoping for is a better use of stride_tricks to get the 9 values or the sum of the kernel elements directly, for the entire array, or that someone can convince me of another more efficient method...

For what it's worth, here's how you'd do it using "fancy" striding tricks. I was going to post this yesterday, but got distracted by actual work! :)
#Paul & #eat both have nice implementations using various other ways of doing this. Just to continue things from the earlier question, I figured I'd post the N-dimensional equivalent.
You're not going to be able to significantly beat scipy.ndimage functions for >1D arrays, however. (scipy.ndimage.uniform_filter should beat scipy.ndimage.convolve, though)
Moreover, if you're trying to get a multidimensional moving window, you risk having memory usage blow up whenever you inadvertently make a copy of your array. While the initial "rolling" array is just a view into the memory of your original array, any intermediate steps that copy the array will make a copy that is orders of magnitude larger than your original array (i.e. Let's say that you're working with a 100x100 original array... The view into it (for a filter size of (3,3)) will be 98x98x3x3 but use the same memory as the original. However, any copies will use the amount of memory that a full 98x98x3x3 array would!!)
Basically, using crazy striding tricks is great for when you want to vectorize moving window operations on a single axis of an ndarray. It makes it really easy to calculate things like a moving standard deviation, etc with very little overhead. When you want to start doing this along multiple axes, it's possible, but you're usually better off with more specialized functions. (Such as scipy.ndimage, etc)
At any rate, here's how you do it:
import numpy as np
def rolling_window_lastaxis(a, window):
"""Directly taken from Erik Rigtorp's post to numpy-discussion.
<http://www.mail-archive.com/numpy-discussion#scipy.org/msg29450.html>"""
if window < 1:
raise ValueError, "`window` must be at least 1."
if window > a.shape[-1]:
raise ValueError, "`window` is too long."
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def rolling_window(a, window):
if not hasattr(window, '__iter__'):
return rolling_window_lastaxis(a, window)
for i, win in enumerate(window):
if win > 1:
a = a.swapaxes(i, -1)
a = rolling_window_lastaxis(a, win)
a = a.swapaxes(-2, i)
return a
filtsize = (3, 3)
a = np.zeros((10,10), dtype=np.float)
a[5:7,5] = 1
b = rolling_window(a, filtsize)
blurred = b.mean(axis=-1).mean(axis=-1)
So what we get when we do b = rolling_window(a, filtsize) is an 8x8x3x3 array, that's actually a view into the same memory as the original 10x10 array. We could have just as easily used different filter size along different axes or operated only along selected axes of an N-dimensional array (i.e. filtsize = (0,3,0,3) on a 4-dimensional array would give us a 6 dimensional view).
We can then apply an arbitrary function to the last axis repeatedly to effectively calculate things in a moving window.
However, because we're storing temporary arrays that are much bigger than our original array on each step of mean (or std or whatever), this is not at all memory efficient! It's also not going to be terribly fast, either.
The equivalent for ndimage is just:
blurred = scipy.ndimage.uniform_filter(a, filtsize, output=a)
This will handle a variety of boundary conditions, do the "blurring" in-place without requiring a temporary copy of the array, and be very fast. Striding tricks are a good way to apply a function to a moving window along one axis, but they're not a good way to do it along multiple axes, usually....
Just my $0.02, at any rate...

I'm not familiar enough with Python to write out code for that, but the two best ways to speed up convolutions is to either separate the filter or to use the Fourier transform.
Separated filter : Convolution is O(M*N), where M and N are number of pixels in the image and the filter, respectively. Since average filtering with a 3-by-3 kernel is equivalent to filtering first with a 3-by-1 kernel and then a 1-by-3 kernel, you can get (3+3)/(3*3) = ~30% speed improvement by consecutive convolution with two 1-d kernels (this obviously gets better as the kernel gets larger). You may still be able to use stride tricks here, of course.
Fourier Transform : conv(A,B) is equivalent to ifft(fft(A)*fft(B)), i.e. a convolution in direct space becomes a multiplication in Fourier space, where A is your image and B is your filter. Since the (element-wise) multiplication of the Fourier transforms requires that A and B are the same size, B is an array of size(A) with your kernel at the very center of the image and zeros everywhere else. To place a 3-by-3 kernel at the center of an array, you may have to pad A to odd size. Depending on your implementation of the Fourier transform, this can be a lot faster than the convolution (and if you apply the same filter multiple times, you can pre-compute fft(B), saving another 30% of computation time).

Lets see:
It's not so clear form your question, but I'm assuming now that you'll like to improve significantly this kind of averaging.
import numpy as np
from numpy.lib import stride_tricks as st
def mf(A, k_shape= (3, 3)):
m= A.shape[0]- 2
n= A.shape[1]- 2
strides= A.strides+ A.strides
new_shape= (m, n, k_shape[0], k_shape[1])
A= st.as_strided(A, shape= new_shape, strides= strides)
return np.sum(np.sum(A, -1), -1)/ np.prod(k_shape)
if __name__ == '__main__':
A= np.arange(100).reshape((10, 10))
print mf(A)
Now, what kind of performance improvements you would actually expect?
Update:
First of all, a warning: the code in it's current state does not adapt properly to the 'kernel' shape. However that's not my primary concern right now (anyway the idea is there allready how to adapt properly).
I have just chosen the new shape of a 4D A intuitively, for me it really make sense to think about a 2D 'kernel' center to be centered to each grid position of original 2D A.
But that 4D shaping may not actually be the 'best' one. I think the real problem here is the performance of summing. One should to be able to find 'best order' (of the 4D A) inorder to fully utilize your machines cache architecture. However that order may not be the same for 'small' arrays which kind of 'co-operates' with your machines cache and those larger ones, which don't (at least not so straightforward manner).
Update 2:
Here is a slightly modified version of mf. Clearly it's better to reshape to a 3D array first and then instead of summing just do dot product (this has the advantage all so, that kernel can be arbitrary). However it's still some 3x slower (on my machine) than Pauls updated function.
def mf(A):
k_shape= (3, 3)
k= np.prod(k_shape)
m= A.shape[0]- 2
n= A.shape[1]- 2
strides= A.strides* 2
new_shape= (m, n)+ k_shape
A= st.as_strided(A, shape= new_shape, strides= strides)
w= np.ones(k)/ k
return np.dot(A.reshape((m, n, -1)), w)

One thing I am confident needs to be fixed is your view array b.
It has a few items from unallocated memory, so you'll get crashes.
Given your new description of your algorithm, the first thing that needs fixing is the fact that you are striding outside the allocation of a:
bshape = (a.size-filtsize+1, filtsize)
bstrides = (a.itemsize, a.itemsize)
b = numpy.lib.stride_tricks.as_strided(a, shape=bshape, strides=bstrides)
Update
Because I'm still not quite grasping the method and there seems to be simpler ways to solve the problem, I'm just going to put this here:
A = numpy.arange(100).reshape((10,10))
shifts = [(-1,-1),(-1,0),(-1,1),(0,-1),(0,1),(1,-1),(1,0),(1,1)]
B = A[1:-1, 1:-1].copy()
for dx,dy in shifts:
xstop = -1+dx or None
ystop = -1+dy or None
B += A[1+dx:xstop, 1+dy:ystop]
B /= 9
...which just seems like the straightforward approach. The only extraneous operation is that it has allocate and populate B only once. All the addition, division and indexing has to be done regardless. If you are doing 16 bands, you still only need to allocate B once if your intent is to save an image. Even if this is no help, it might clarify why I don't understand the problem, or at least serve as a benchmark to time the speedups of other methods. This runs in 2.6 sec on my laptop on a 5k x 5k array of float64's, 0.5 of which is the creation of B

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

When broadcasting is a bad idea ? (numpy) [closed] - python

Related

best way to store numbers in a multidimensional (sparse) array in python

Central difference with Convolution

Why is NumPy subtraction slower on one large matrix $M$ than when dividing $M$ into smaller matrices and then subtracting?

numpy.sum transition to kahan but with masked arrays for increased precision

Using strides for an efficient moving average filter

Categories

Resources