Why does padding an FFT in NumPy make it run much slower?

Why does padding an FFT in NumPy make it run much slower? - python

I had writted a script using NumPy's fft function, where I was padding my input array to the nearest power of 2 to get a faster FFT.
After profiling the code, I found that the FFT call was taking the longest time, so I fiddled around with the parameters and found that if I didn't pad the input array, the FFT ran several times faster.
Here's a minimal example to illustrate what I'm talking about (I ran this in IPython and used the %timeit magic to time the execution).
x = np.arange(-4.*np.pi, 4.*np.pi, 1000)
dat1 = np.sin(x)
The timing results:
%timeit np.fft.fft(dat1)
100000 loops, best of 3: 12.3 µs per loop
%timeit np.fft.fft(dat1, n=1024)
10000 loops, best of 3: 61.5 µs per loop
Padding the array to a power of 2 leads to a very drastic slowdown.
Even if I create an array with a prime number of elements (hence the theoretically slowest FFT)
x2 = np.arange(-4.*np.pi, 4.*np.pi, 1009)
dat2 = np.sin(x2)
The time it takes to run still doesn't change so drastically!
%timeit np.fft.fft(dat2)
100000 loops, best of 3: 12.2 µs per loop
I would have thought that padding the array will be a one time operation, and then calculating the FFT should be quicker.
Am I missing anything?
EDIT: I was supposed to use np.linspace rather than np.arange. Below are the timing results using linspace
In [2]: import numpy as np
In [3]: x = np.linspace(-4*np.pi, 4*np.pi, 1000)
In [4]: x2 = np.linspace(-4*np.pi, 4*np.pi, 1024)
In [5]: dat1 = np.sin(x)
In [6]: dat2 = np.sin(x2)
In [7]: %timeit np.fft.fft(dat1)
10000 loops, best of 3: 55.1 µs per loop
In [8]: %timeit np.fft.fft(dat2)
10000 loops, best of 3: 49.4 µs per loop
In [9]: %timeit np.fft.fft(dat1, n=1024)
10000 loops, best of 3: 64.9 µs per loop
Padding still causes a slowdown. Could this be a local issue? i.e., due to some quirk in my NumPy setup it's acting this way?

FFT algorithms like NumPy's are fast for array sizes that factorize into a product of small primes, not just powers of two. If you increase the array size by padding the computational work increases. The speed of FFT algorithms is also critically dependent on the cache use. If you pad to an array size that creates less efficient cache use the efficiency slows down. The really fast FFT algorithms, like FFTW and Intel MKL, will actually generate plans for the array size factorization to get the most efficient computation. This includes both heuristics and actual measurements. So no, padding to the nearest power of two is only beneficial in introductory textbooks and not neccesarily in practice. As a rule of thumb you usually benefit from padding if the array size factorizes to one or more very large prime.

You're using np.arange when you want to be using np.linspace
In [2]: x = np.arange(-4.*np.pi, 4.*np.pi, 1000)
In [3]: x
Out[3]: array([-12.56637061])
np.arange takes arguments (start, stop, step), whereas np.linspace is (start, stop, number_of_pts). When you calculate with the data I suspect you think you're using, you get the expected behavior:
In [4]: x = np.linspace(-4.*np.pi, 4.*np.pi, 1000)
In [5]: dat1 = np.sin(x)
In [6]: %timeit np.fft.fft(dat1)
1 loops, best of 3: 28.1 µs per loop
In [7]: %timeit np.fft.fft(dat1, n=1024)
10000 loops, best of 3: 26.7 µs per loop
In [8]: x = np.linspace(-4.*np.pi, 4.*np.pi, 1009)
In [9]: dat2 = np.sin(x)
In [10]: %timeit np.fft.fft(dat2)
10000 loops, best of 3: 53 µs per loop
In [11]: %timeit np.fft.fft(dat2, n=1024)
10000 loops, best of 3: 26.8 µs per loop

Related

Efficient way to sample a large array many times with NumPy?

If you don't care about the details of what I'm trying to implement, just skip past the lower horizontal line
I am trying to do a bootstrap error estimation on some statistic with NumPy. I have an array x, and wish to compute the error on the statistic f(x) for which usual gaussian assumptions in error analysis do not hold. x is very large.
To do this, I resample x using numpy.random.choice(), where the size of my resample is the size of the original array, with replacement:
resample = np.random.choice(x, size=len(x), replace=True)
This gives me a new realization of x. This operation must now be repeated ~1,000 times to give an accurate error estimate. If I generate 1,000 resamples of this nature;
resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
and then compute the statistic f(x) on each realization;
results = [f(arr) for arr in resamples]
then I have inferred the error of f(x) to be something like
np.std(results)
the idea being that even though f(x) itself cannot be described using gaussian error analysis, a distribution of f(x) measures subject to random error can be.
Okay, so that's a bootstrap. Now, my problem is that the line
resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
is very slow for large arrays. Is there a smarter way to do this without a list comprehension? The second list comprehension
results = [f(arr) for arr in resamples]
can be pretty slow too, depending on the details of the function f(x).

Since we are allowing repetitions, we could generate all the indices in one go with np.random.randint and then simply index to get resamples equivalent, like so -
num_samples = 1000
idx = np.random.randint(0,len(x),size=(num_samples,len(x)))
resamples_arr = x[idx]
One more approach would be to generate random number from uniform distribution with numpy.random.rand and scale to length of array, like so -
resamples_arr = x[(np.random.rand(num_samples,len(x))*len(x)).astype(int)]
Runtime test with x of 5000 elems -
In [221]: x = np.random.randint(0,10000,(5000))
# Original soln
In [222]: %timeit [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
10 loops, best of 3: 84 ms per loop
# Proposed soln-1
In [223]: %timeit x[np.random.randint(0,len(x),size=(1000,len(x)))]
10 loops, best of 3: 76.2 ms per loop
# Proposed soln-2
In [224]: %timeit x[(np.random.rand(1000,len(x))*len(x)).astype(int)]
10 loops, best of 3: 59.7 ms per loop
For very large x
With a very large array x of 600,000 elements, you might not want to create all those indices for 1000 samples. In that case, per sample solution would have their timings something like this -
In [234]: x = np.random.randint(0,10000,(600000))
# Original soln
In [235]: %timeit np.random.choice(x, size=len(x), replace=True)
100 loops, best of 3: 13 ms per loop
# Proposed soln-1
In [238]: %timeit x[np.random.randint(0,len(x),len(x))]
100 loops, best of 3: 12.5 ms per loop
# Proposed soln-2
In [239]: %timeit x[(np.random.rand(len(x))*len(x)).astype(int)]
100 loops, best of 3: 9.81 ms per loop

As alluded to by #Divakar you can pass a tuple to size to get a 2d array of resamples rather than using list comprehension.
Here assume for a second that f is just sum rather than some other function. Then:
x = np.random.randn(100000)
resamples = np.random.choice(x, size=(1000, x.shape[0]), replace=True)
# resamples.shape = (1000, 1000000)
results = np.apply_along_axis(f, axis=1, arr=resamples)
print(results.shape)
# (1000,)
Here np.apply_along_axis is admittedly just a glorified for-loop equivalent to [f(arr) for arr in resamples]. But I am not exactly sure if you need to index x here based on your question.

Python3 - Computationally efficient correlation between matrix and array

I'd like to correlate the columns of an mxn matrix with a 1xm array. This should give me an 1xn array back. At the moment I am doing this a bit clumsy with:
c = np.corrcoef(X, y)[:-1,-1]
I find the correlations I want here in the last column and with the last row/column corresponding to the correlation the array have with it self (so r = 1.0).
This is fine, but however, I need to do this on quite big matrices and that is basically when it becomes too computationally heavy and my computer gives up.
For example the largest matrix I am doing this for has the size:
48x290400 (= X) and 48x1 (=y), where I want to end up with 290400 r-values
This works fine in Matlab, but not in python using np.corrcoef. Anyone got a good solution for this?
Cheers
Daniel

We could use corr2_coeff from this post after transposing the input arrays -
corr2_coeff(a.T,b.T).ravel()
Sample run -
In [160]: a = np.random.rand(3, 5)
In [161]: b = np.random.rand(3, 1)
# Proposed in the question
In [162]: np.corrcoef(a.T, b.T)[:-1,-1]
Out[162]: array([-0.0716, 0.1905, 0.9699, 0.7482, -0.1511])
# Proposed in this post
In [163]: corr2_coeff(a.T,b.T).ravel()
Out[163]: array([-0.0716, 0.1905, 0.9699, 0.7482, -0.1511])
Runtime test -
In [171]: a = np.random.rand(48, 10000)
In [172]: b = np.random.rand(48, 1)
In [173]: %timeit np.corrcoef(a.T, b.T)[:-1,-1]
1 loops, best of 3: 619 ms per loop
In [174]: %timeit corr2_coeff(a.T,b.T).ravel()
1000 loops, best of 3: 1.72 ms per loop
In [176]: 619.0/1.72
Out[176]: 359.8837209302326
Massive 360x speedup there!
Scaling it further -
In [239]: a = np.random.rand(48, 29040)
In [240]: b = np.random.rand(48, 1)
In [241]: %timeit np.corrcoef(a.T, b.T)[:-1,-1]
1 loops, best of 3: 5.19 s per loop
In [242]: %timeit corr2_coeff(a.T,b.T).ravel()
100 loops, best of 3: 8.09 ms per loop
In [244]: 5190.0/8.09
Out[244]: 641.5327564894932
640x+ speedup on this bigger dataset and should scale better as we go towards actual dataset sizes!

Efficiently compute columnwise sum of sparse array where every non-zero element is 1

I have a bunch of data in SciPy compressed sparse row (CSR) format. Of course the majority of elements is zero, and I further know that all non-zero elements have a value of 1. I want to compute sums over different subsets of rows of my matrix. At the moment I am doing the following:
import numpy as np
import scipy as sp
import scipy.sparse
# create some data with sparsely distributed ones
data = np.random.choice((0, 1), size=(1000, 2000), p=(0.95, 0.05))
data = sp.sparse.csr_matrix(data, dtype='int8')
# generate column-wise sums over random subsets of rows
nrand = 1000
for k in range(nrand):
inds = np.random.choice(data.shape[0], size=100, replace=False)
# 60% of time is spent here
extracted_rows = data[inds]
# 20% of time is spent here
row_sum = extracted_rows.sum(axis=0)
The last few lines there are the bottleneck in a larger computational pipeline. As I annotated in the code, 60% of time is spent slicing the data from the random indices, and 20% is spent computing the actual sum.
It seems to me I should be able to use my knowledge about the data in the array (i.e., any non-zero value in the sparse matrix will be 1; no other values present) to compute these sums more efficiently. Unfortunately, I cannot figure out how. Dealing with just data.indices perhaps? I have tried other sparsity structures (e.g. CSC matrix), as well as converting to dense array first, but these approaches were all slower than this CSR matrix approach.

It is well known that indexing of sparse matrices is relatively slow. And there have SO questions about getting around that by accessing the data attributes directly.
But first some timings. Using data and ind as you show I get
In [23]: datad=data.A # times at 3.76 ms per loop
In [24]: timeit row_sumd=datad[inds].sum(axis=0)
1000 loops, best of 3: 529 µs per loop
In [25]: timeit row_sum=data[inds].sum(axis=0)
1000 loops, best of 3: 890 µs per loop
In [26]: timeit d=datad[inds]
10000 loops, best of 3: 55.9 µs per loop
In [27]: timeit d=data[inds]
1000 loops, best of 3: 617 µs per loop
The sparse version is slower than the dense one, but not by a lot. The sparse indexing is much slower, but its sum is somewhat faster.
The sparse sum is done with a matrix product
def sparse.spmatrix.sum
....
return np.asmatrix(np.ones((1, m), dtype=res_dtype)) * self
That suggests that faster way - turn inds into an appropriate array of 1s and multiply.
In [49]: %%timeit
....: b=np.zeros((1,data.shape[0]),'int8')
....: b[:,inds]=1
....: rowmul=b*data
....:
1000 loops, best of 3: 587 µs per loop
That makes the sparse operation about as fast as the equivalent dense one. (but converting to dense is much slower)
==================
The last time test is missing the np.asmatrix that is present in the sparse sum. But times are similar, and the results are the same
In [232]: timeit b=np.zeros((1,data.shape[0]),'int8'); b[:,inds]=1; x1=np.asmatrix(b)*data
1000 loops, best of 3: 661 µs per loop
In [233]: timeit b=np.zeros((1,data.shape[0]),'int8'); b[:,inds]=1; x2=b*data
1000 loops, best of 3: 605 µs per loop
One produces a matrix, the other an array. But both are doing a matrix product, 2nd dim of B against 1st of data. Even though b is an array, the task is actually delegated to data and its matrix product - in a not so transparent a way.
In [234]: x1
Out[234]: matrix([[9, 9, 5, ..., 9, 5, 3]], dtype=int8)
In [235]: x2
Out[235]: array([[9, 9, 5, ..., 9, 5, 3]], dtype=int8)
b*data.A is element multiplication and raises an error; np.dot(b,data.A) works but is slower.
Newer numpy/python has a matmul operator. I see the same time pattern:
In [280]: timeit b#dataA # dense product
100 loops, best of 3: 2.64 ms per loop
In [281]: timeit b#data.A # slower due to `.A` conversion
100 loops, best of 3: 6.44 ms per loop
In [282]: timeit b#data # sparse product
1000 loops, best of 3: 571 µs per loop
np.dot may also delegate action to sparse, though you have to be careful. I just hung my machine with np.dot(csr_matrix(b),data.A).

Here's a vectorized approach after converting data to a dense array and also getting all those inds in a vectorized manner using argpartition-based method -
# Number of selections as a parameter
n = 100
# Get inds across all iterations in a vectorized manner as a 2D array.
inds2D = np.random.rand(nrand,data.shape[0]).argpartition(n)[:,:n]
# Index into data with those 2D array indices. Then, convert to dense NumPy array,
# reshape and sum reduce to get the final output
out = np.array(data.todense())[inds2D.ravel()].reshape(nrand,n,-1).sum(1)
Runtime test -
1) Function definitions :
def org_app(nrand,n):
out = np.zeros((nrand,data.shape[1]),dtype=int)
for k in range(nrand):
inds = np.random.choice(data.shape[0], size=n, replace=False)
extracted_rows = data[inds]
out[k] = extracted_rows.sum(axis=0)
return out
def vectorized_app(nrand,n):
inds2D = np.random.rand(nrand,data.shape[0]).argpartition(n)[:,:n]
return np.array(data.todense())[inds2D.ravel()].reshape(nrand,n,-1).sum(1)
Timings :
In [205]: # create some data with sparsely distributed ones
...: data = np.random.choice((0, 1), size=(1000, 2000), p=(0.95, 0.05))
...: data = sp.sparse.csr_matrix(data, dtype='int8')
...:
...: # generate column-wise sums over random subsets of rows
...: nrand = 1000
...: n = 100
...:
In [206]: %timeit org_app(nrand,n)
1 loops, best of 3: 1.38 s per loop
In [207]: %timeit vectorized_app(nrand,n)
1 loops, best of 3: 826 ms per loop

Numpy Pure Functions for performance, caching

I'm writing some moderately performance critical code in numpy.
This code will be in the inner most loop, of a computation that's run time is measured in hours.
A quick calculation suggest that this code will be executed up something like 10^12 times, in some variations of the calculation.
So the function is to calculate sigmoid(X) and another to calculate its derivative (gradient).
Sigmoid has the property that for y=sigmoid(x), dy/dx= y(1-y)
In python for numpy this looks like:
sigmoid = vectorize(lambda(x): 1.0/(1.0+exp(-x)))
grad_sigmoid = vectorize(lambda (x): sigmoid(x)*(1-sigmoid(x)))
As can be seen, both functions are pure (without side effects),
so they are ideal candidates for memoization,
at least for the short term, I have some worries about caching every single call to sigmoid ever made: Storing 10^12 floats which would take several terabytes of RAM.
Is there a good way to optimise this?
Will python pick up that these are pure functions and cache them for me, as appropriate?
Am I worrying over nothing?

These functions already exist in scipy. The sigmoid function is available as scipy.special.expit.
In [36]: from scipy.special import expit
Compare expit to the vectorized sigmoid function:
In [38]: x = np.linspace(-6, 6, 1001)
In [39]: %timeit y = sigmoid(x)
100 loops, best of 3: 2.4 ms per loop
In [40]: %timeit y = expit(x)
10000 loops, best of 3: 20.6 µs per loop
expit is also faster than implementing the formula yourself:
In [41]: %timeit y = 1.0 / (1.0 + np.exp(-x))
10000 loops, best of 3: 27 µs per loop
The CDF of the logistic distribution is the sigmoid function. It is available as the cdf method of scipy.stats.logistic, but cdf eventually calls expit, so there is no point in using that method. You can use the pdf method to compute the derivative of the sigmoid function, or the _pdf method which has less overhead, but "rolling your own" is faster:
In [44]: def sigmoid_grad(x):
....: ex = np.exp(-x)
....: y = ex / (1 + ex)**2
....: return y
Timing (x has length 1001):
In [45]: from scipy.stats import logistic
In [46]: %timeit y = logistic._pdf(x)
10000 loops, best of 3: 73.8 µs per loop
In [47]: %timeit y = sigmoid_grad(x)
10000 loops, best of 3: 29.7 µs per loop
Be careful with your implementation if you are going to use values that are far into the tails. The exponential function can overflow pretty easily. logistic._cdf is a bit more robust than my quick implementation of sigmoid_grad:
In [60]: sigmoid_grad(-500)
/home/warren/anaconda/bin/ipython:3: RuntimeWarning: overflow encountered in double_scalars
import sys
Out[60]: 0.0
In [61]: logistic._pdf(-500)
Out[61]: 7.1245764067412855e-218
An implementation using sech**2 (1/cosh**2) is a bit slower than the above sigmoid_grad:
In [101]: def sigmoid_grad_sech2(x):
.....: y = (0.5 / np.cosh(0.5*x))**2
.....: return y
.....:
In [102]: %timeit y = sigmoid_grad_sech2(x)
10000 loops, best of 3: 34 µs per loop
But it handles the tails better:
In [103]: sigmoid_grad_sech2(-500)
Out[103]: 7.1245764067412855e-218
In [104]: sigmoid_grad_sech2(500)
Out[104]: 7.1245764067412855e-218

Just expanding on my comment, here is a comparison between your sigmoid through vectorize and using numpy directly:
In [1]: x = np.random.normal(size=10000)
In [2]: sigmoid = np.vectorize(lambda x: 1.0 / (1.0 + np.exp(-x)))
In [3]: %timeit sigmoid(x)
10 loops, best of 3: 63.3 ms per loop
In [4]: %timeit 1.0 / (1.0 + np.exp(-x))
1000 loops, best of 3: 250 us per loop
As you can see, not only does vectorize make it much slower, the fact is that you can calculate 10000 sigmoids in 250 microseconds (that is, 25 nanoseconds for each). A single dictionary look-up in Python is slower than that, let alone all the other code to get the memoization in place.
The only way to optimize this that I can think of is writing a sigmoid ufunc for numpy, which basically will implement the operation in C. That way, you won't have to do each operation in the sigmoid to the entire array, even though numpy does this really fast.

If you are looking to memoize this process, I'd wrap that code in a function and decorate with functools.lru_cache(maxsize=n). Experiment with the maxsize value to find the appropriate size for your application. For best results, use a maxsize argument that is a power of two.
from functools import lru_cache
lru_cache(maxsize=8096)
def sigmoids(x):
sigmoid = vectorize(lambda(x): 1.0/(1.0+exp(-x)))
grad_sigmoid = vectorize(lambda (x): sigmoid(x)*(1-sigmoid(x)))
return sigmoid, grad_sigmoid
If you're on 2.7 (which I expect you are since you're using numpy), you can take a look at https://pypi.python.org/pypi/repoze.lru/ for a memoization library with identical syntax.
You can install it via pip: pip install repoze.lru
from repoze.lru import lru_cache
lru_cache(maxsize=8096)
def sigmoids(x):
sigmoid = vectorize(lambda(x): 1.0/(1.0+exp(-x)))
grad_sigmoid = vectorize(lambda (x): sigmoid(x)*(1-sigmoid(x)))
return sigmoid, grad_sigmoid

Mostly I agree with Warren Weckesser and his answer above.
But for derivative of sigmoid the following can be used:
In [002]: def sg(x):
...: s = scipy.special.expit(x)
...: return s * (1.0 - s)
Timings:
In [003]: %timeit y = logistic._pdf(x)
10000 loops, best of 3: 45 µs per loop
In [004]: %timeit y = sg(x)
10000 loops, best of 3: 20.4 µs per loop
The only problem is accuracy:
In [005]: sg(37)
Out[005]: 0.0
In [006]: logistic._pdf(37)
Out[006]: 8.5330476257440658e-17

Improve performance of array handling

I have a large code which takes a bit of time to run. I've tracked down the two lines that take up most of the time and I'd like to know if there's a way to speed them up. Here's a MWE:
import numpy as np
def setup(k=2, m=100, n=300):
return np.random.randn(k,m), np.random.randn(k,n),np.random.randn(k,m)
# make some random points and weights
a, b, w = setup()
# Weighted euclidean distance between arrays a and b.
wdiff = (a[np.newaxis,...] - b[np.newaxis,...].T) / w[np.newaxis,...]
# This is the set of operations that need a performance boost:
dist_1 = np.exp(-0.5*(wdiff*wdiff)) / w
dist_2 = np.array([i[0]*i[1] for i in dist_1])
I'm coming from this question BTW Fast weighted euclidean distance between points in arrays where ali_m suggested his amazing answer that saved me a lot of time by applying broadcasting (of which I know absolutely nothing, yet at least) Could something like that be applied with these lines?

Your dist_2 calculation can be sped up by a factor of 10 or so:
>>> dist_1.shape
(300, 2, 100)
>>> %timeit dist_2 = np.array([i[0]*i[1] for i in dist_1])
1000 loops, best of 3: 1.35 ms per loop
>>> %timeit dist_2 = dist_1.prod(axis=1)
10000 loops, best of 3: 116 µs per loop
>>> np.allclose(np.array([i[0]*i[1] for i in dist_1]), dist_1.prod(axis=1))
True
I couldn't manage to do much with your dist_1 as the majority of time is spent in the exponentiation:
>>> %timeit (-0.5*(wdiff*wdiff)) / w
1000 loops, best of 3: 467 µs per loop
>>> %timeit np.exp((-0.5*(wdiff*wdiff)))/w
100 loops, best of 3: 3.3 ms per loop

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does padding an FFT in NumPy make it run much slower? - python

Related

Efficient way to sample a large array many times with NumPy?

Python3 - Computationally efficient correlation between matrix and array

Efficiently compute columnwise sum of sparse array where every non-zero element is 1

Numpy Pure Functions for performance, caching

Improve performance of array handling

Categories

Resources