FFT convolution not being faster than the cannonical convolution computation

FFT convolution not being faster than the cannonical convolution computation - python

A couple of months ago I found out that convolutions are computed in the fastest possible way using the FFT algorithm (even more with the FFTW library)
Using the following code I have controversial results.
Imports
from scipy import fftpack
from numba import jit
Convolution with FFT:
def conv_fft(X, R):
n = len(X)
a = fftpack.fft(X)
b = fftpack.fft(R)
c = a * b
e = fftpack.ifft(c)
result = e[n]
return result
Convolution using the formula:
#jit(cache=True)
def conv(X, R):
n = len(X)
result = complex_type(0)
for i in range(n+1):
result += X[n-i] * R[i]
return result
This are critical functions in a much complex process, the difference arises only by using one version or the other.
no FFT with FFT increment
Test1 0.028761 0.034139 0.0053780
Test2 0.098565 0.103180 0.0046150
** the test2 computes more convolutions per test.*
The test show that the code with FFT is slower and I cannot see why since the fftpack apparently call the FFTW library which is "the fastest in the west"...
Any guidance is appreciated.
A conclusion for my is that the numba JIT compilation is unbelievably fast.

You're only returning a single value (the n:th one) of the convolution, not the full array. With FFT you always calculate all values, whereas in your conv function you only calculate the one you're after. Complexity-wise, the FFT is O(N*log(N)), and your implementation of conv is O(N). If you would implement a naive conv function that would return the full convolution, it would be O(N^2).
So, if you want the full convoluted array your best bet is the FFT way of doing it. If you only want the n:th value, your method is complexity wise the best.

You should be able to get away with creating fewer temporary arrays, using this type of syntax, which should make it faster.
def conv_fft(X, R):
fftpack.fft(X, overwrite_x=True)
b = fftpack.fft(R)
X *= b
fftpack.ifft(X, overwrite_x=True)
return X

Related

Double antiderivative computation in python

I have the following problem. I have a function f defined in python using numpy functions. The function is smooth and integrable on positive reals. I want to construct the double antiderivative of the function (assuming that both the value and the slope of the antiderivative at 0 are 0) so that I can evaluate it on any positive real smaller than 100.
Definition of antiderivative of f at x:
integrate f(s) with s from 0 to x
Definition of double antiderivative of f at x:
integrate (integrate f(t) with t from 0 to s) with s from 0 to x
The actual form of f is not important, so I will use a simple one for convenience. But please note that even though my example has a known closed form, my actual function does not.
import numpy as np
f = lambda x: np.exp(-x)*x
My solution is to construct the antiderivative as an array using naive numerical integration:
N = 10000
delta = 100/N
xs = np.linspace(0,100,N+1)
vs = f(xs)
avs = np.cumsum(vs)*delta
aavs = np.cumsum(avs)*delta
This of course works but it gives me arrays instead of functions. But this is not a big problem as I can interpolate aavs using a spline to get a function and get rid of the arrays.
from scipy.interpolate import UnivariateSpline
aaf = UnivariateSpline(xs, aavs)
The function aaf is approximately the double antiderivative of f.
The problem is that even though it works, there is quite a bit of overhead before I can get my function and precision is expensive.
My other idea was to interpolate f by a spline and take the antiderivative of that, however this introduces numerical errors that are too big for what I want to use the function.
Is there any better way to do that? By better I mean faster without sacrificing accuracy.
Edit: What I hope is possible is to use some kind of Fourier transform to avoid integrating twice. I hope that there is some convenient transform of vs that allows to multiply the values component-wise with xs and transform back to get the double antiderivative. I played with this a bit, but I got lost.
Edit: I figured out that by using the trapezoidal rule instead of a naive sum, increases the accuracy quite a bit. Using Simpson's rule should increase the accuracy further, but it's somewhat fiddly to do with numpy arrays.
Edit: As #user202729 rightfully complains, this seems off. The reason it seems off is because I have skipped some details. I explain here why what I say makes sense, but it does not affect my question.
My actual goal is not to find the double antiderivative of f, but to find a transformation of this. I have skipped that because I think it only confuses the matter.
The function f decays exponentially as x approaches 0 or infinity. I am minimizing the numerical error in the integration by starting the sum from 0 and going up to approximately the peak of f. This ensure that the relative error is approximately constant. Then I start from the opposite direction from some very big x and go back to the peak. Then I do the same for the antiderivative values.
Then I transform the aavs by another function which is sensitive to numerical errors. Then I find the region where the errors are big (the values oscillate violently) and drop these values. Finally I approximate what I believe are good values by a spline.
Now if I use spline to approximate f, it introduces an absolute error which is the dominant term in a rather large interval. This gets "integrated" twice and it ends up being a rather large relative error in aavs. Then once I transform aavs, I find that the 'good region' has shrunk considerably.
EDIT: The actual form of f is something I'm still looking into. However, it is going to be a generalisation of the lognormal distribution. Right now I am playing with the following family.
I start by defining a generalization of the normal distribution:
def pdf_n(params, center=0.0, slope=8):
scale, min, diff = params
if diff > 0:
r = min
l = min + diff
else:
r = min - diff
l = min
def retfun(m):
x = (m - center)/scale
E = special.expit(slope*x)*(r - l) + l
return np.exp( -np.power(1 + x*x, E)/2 )
return np.vectorize(retfun)
It may not be obvious what is happening here, but the result is quite simple. The function decays as exp(-x^(2l)) on the left and as exp(-x^(2r)) on the right. For min=1 and diff=0, this is the normal distribution. Note that this is not normalized. Then I define
g = pdf(params)
f = np.vectorize(lambda x:g(np.log(x))/x/area)
where area is the normalization constant.
Note that this is not the actual code I use. I stripped it down to the bare minimum.

You can compute the two np.cumsum (and the divisions) at once more efficiently using Numba. This is significantly faster since there is no need for several temporary arrays to be allocated, filled, read again and freed. Here is a naive implementation:
import numba as nb
#nb.njit('float64[::1](float64[::1], float64)') # Assume vs is contiguous
def doubleAntiderivative_naive(vs, delta):
res = np.empty(vs.size, dtype=np.float64)
sum1, sum2 = 0.0, 0.0
for i in range(vs.size):
sum1 += vs[i] * delta
sum2 += sum1 * delta
res[i] = sum2
return res
However, the sum is not very good in term of numerical stability. A Kahan summation is needed to improve the accuracy (or possibly the alternative Kahan–Babuška-Klein algorithm if you are paranoid about the accuracy and performance do not matter so much). Note that Numpy use a pair-wise algorithm which is quite good but far from being prefect in term of accuracy (this is a good compromise for both performance and accuracy).
Moreover, delta can be factorized during in the summation (ie. the result just need to be premultiplied by delta**2).
Here is an implementation using the more accurate Kahan summation:
#nb.njit('float64[::1](float64[::1], float64)')
def doubleAntiderivative_accurate(vs, delta):
res = np.empty(vs.size, dtype=np.float64)
delta2 = delta * delta
sum1, sum2 = 0.0, 0.0
c1, c2 = 0.0, 0.0
for i in range(vs.size):
# Kahan summation of the antiderivative of vs
y1 = vs[i] - c1
t1 = sum1 + y1
c1 = (t1 - sum1) - y1
sum1 = t1
# Kahan summation of the double antiderivative of vs
y2 = sum1 - c2
t2 = sum2 + y2
c2 = (t2 - sum2) - y2
sum2 = t2
res[i] = sum2 * delta2
return res
Here is the performance of the approaches on my machine (with an i5-9600KF processor):
Numpy cumsum: 51.3 us
Naive Numba: 11.6 us
Accutate Numba: 37.2 us
Here is the relative error of the approaches (based on the provided input function):
Numpy cumsum: 1e-13
Naive Numba: 5e-14
Accutate Numba: 2e-16
Perfect precision: 1e-16 (assuming 64-bit numbers are used)
If f can be easily computed using Numba (this is the case here), then vs[i] can be replaced by calls to f (inlined by Numba). This helps to reduce the memory consumption of the computation (N can be huge without saturating your RAM).
As for the interpolation, the splines often gives good numerical result but they are quite expensive to compute and AFAIK they require the whole array to be computed (each item of the array impact all the spline although some items may have a negligible impact alone). Regarding your needs, you could consider using Lagrange polynomials. You should be careful when using Lagrange polynomials on the edges. In your case, you can easily solve the numerical divergence issue on the edges by extending the array size with the border values (since you know the derivative on each edges of vs is 0). You can apply the interpolation on the fly with this method which can be good for both performance (typically if the computation is parallelized) and memory usage.

First, I created a version of the code I found more intuitive. Here I multiply cumulative sum values by bin widths. I believe there is a small error in the original version of the code related to the bin width issue.
import numpy as np
f = lambda x: np.exp(-x)*x
N = 1000
xs = np.linspace(0,100,N+1)
domainwidth = ( np.max(xs) - np.min(xs) )
binwidth = domainwidth / N
vs = f(xs)
avs = np.cumsum(vs)*binwidth
aavs = np.cumsum(avs)*binwidth
Next, for visualization here is some very simple plotting code:
import matplotlib
import matplotlib.pyplot as plt
plt.figure()
plt.scatter( xs, vs )
plt.figure()
plt.scatter( xs, avs )
plt.figure()
plt.scatter( xs, aavs )
plt.show()
The first integral matches the known result of the example expression and can be seen on wolfram
Below is a simple function that extracts an element from the second derivative. Note that int is a bad rounding function. I assume this is what you have implemented already.
def extract_double_antideriv_value(x):
return aavs[int(x/binwidth)]
singleresult = extract_double_antideriv_value(50.24)
print('singleresult', singleresult)
Whatever full computation steps are required, we need to know them before we can start optimizing. Do you have a million different functions to integrate? If you only need to query a single double anti-derivative many times, your original solution should be fairly ideal.
Symbolic Approximation:
Have you considered approximations to the original function f, which can have closed form integration solutions? You have a limited domain on which the function lives. Perhaps approximate f with a Taylor series (which can be constructed with known maximum error) then integrate exactly? (consider Pade, Taylor, Fourier, Cheby, Lagrange(as suggested by another answer), etc...)
Log Tricks:
Another alternative to dealing with spiky errors, would be to take the log of your original function. Is f always positive? Is the integration error caused because the neighborhood around the max is very small? If so, you can study ln(f) or even ln(ln(f)) instead. It would really help to understand what f looks like more.
Approximation Integration Tricks
There exist countless integration tricks in general, which can make approximate closed form solutions to undo-able integrals. A very common one when exponetnial functions are involved (I think yours is expoential?) is to use Laplace's Method. But which trick to pull out of the bag is highly dependent upon the conditions which f satisfies.

How to speed up an operation between two arrays of different sizes?

I have two arrays which are lists of points in 3D space. These arrays have different lengths.
np.shape(arr1) == (34709, 3)
np.shape(arr2) == (4835053, 3)
I have a function which can compute the Pythagorean distance between a single point in one array and all points in another, given periodic boundary conditions:
def pythag_periodic(array, point, dimensions):
delta = np.abs(array - point)
delta = np.where(delta > 0.5 * dimensions, delta - dimensions, delta)
return np.sqrt((delta ** 2).sum(axis=-1))
I am trying to apply this operation to all points in both arrays. I have a loop which calls this function recursively, but it is agonisingly slow.
for i in arr1:
pp.append(pythag_periodic(arr2, i, dimensions))
Any suggestions as to how I might speed this up would be much appreciated.

You should use numba : https://numba.pydata.org/ (disclosed: I am not the author). It is a library that translates Python functions to optimized machine code at runtime. Thus, Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.
To apply to your code is really simple. In a nutshell, import the library and then use the decorator. Besides, you have more options that can be relevant for you like Parallelize Your Algorithms (have a look to their website).
For instance:
from numba import jit
#jit(nopython=True)
def pythag_periodic(array, point, dimensions):
delta = np.abs(array - point)
delta = np.where(delta > 0.5 * dimensions, delta - dimensions, delta)
return np.sqrt((delta ** 2).sum(axis=-1))

Another cool option would be to exploit Numpy's broadcasting (via the None keyword when indexing the arrays) and the super neat einsum function to avoid the loop and perform the sum and square operations simultaneously, respectively.
Note however that this approach is slighty slower for small matrices, but once you get to sizes greater than 4000 elements or so it is much faster. Also, beware of running out of RAM as vectorization has this downside (although you are already storing the NxM array in your code anyways).
import numpy as np
def pythag_periodic_vectorized(a1, a2):
delta = np.abs(a1[:,None,:] - a2[None,...])
delta = np.where(delta > 0.5 * a1.shape[1], delta - a1.shape[1], delta)
return np.sqrt(np.einsum("ijk,ijk->ij", delta, delta))

FFT polynomial multiplication in Python using inbuilt Numpy.fft

I want to multiply two polynomials fast in python. As my polynomials are rather large (> 100000) elements and I have to multiply lots of them. Below, you will find my approach,
from numpy.random import seed, randint
from numpy import polymul, pad
from numpy.fft import fft, ifft
from timeit import default_timer as timer
length=100
def test_mul(arr_a,arr_b): #inbuilt python multiplication
c=polymul(arr_a,arr_b)
return c
def sb_mul(arr_a,arr_b): #my schoolbook multiplication
c=[0]*(len(arr_a) + len(arr_b) - 1 )
for i in range( len(arr_a) ):
for j in range( len(arr_b) ):
k=i+j
c[k]=c[k]+arr_a[i]*arr_b[j]
return c
def fft_test(arr_a,arr_b): #fft based polynomial multuplication
arr_a1=pad(arr_a,(0,length),'constant')
arr_b1=pad(arr_b,(0,length),'constant')
a_f=fft(arr_a1)
b_f=fft(arr_b1)
c_f=[0]*(2*length)
for i in range( len(a_f) ):
c_f[i]=a_f[i]*b_f[i]
return c_f
if __name__ == '__main__':
seed(int(timer()))
random=1
if(random==1):
x=randint(1,1000,length)
y=randint(1,1000,length)
else:
x=[1]*length
y=[1]*length
start=timer()
res=test_mul(x,y)
end=timer()
print("time for built in pol_mul", end-start)
start=timer()
res1=sb_mul(x,y)
end=timer()
print("time for schoolbook mult", end-start)
res2=fft_test(x,y)
print(res2)
#########check############
if( len(res)!=len(res1) ):
print("ERROR");
for i in range( len(res) ):
if( res[i]!=res1[i] ):
print("ERROR at pos ",i,"res[i]:",res[i],"res1[i]:",res1[i])
Now, here are my approach in detail,
1. First, I tried myself with a naive implementation of Schoolbook with complexity O(n^2). But as you may expect it turned out to be very slow.
Second, I came to know polymul in the Numpy library. This function is a lot faster than the previous one. But I realized this is also a O(n^2) complexity. You can see, if you increase the length k the time increases by k^2 times.
My third approach is to try a FFT based multiplication using the inbuilt FFT functions. I followed the the well known approach also described here but Iam not able to get it work.
Now my questions are,
Where am I going wrong in my FFT based approach? Can you please tell me how can I fix it?
Is my observation that polymul function has O(n^2) complexity correct?
Please, let me know if you have any question.
Thanks in advance.

Where am I going wrong in my FFT based approach? Can you please tell me how can I fix it?
The main problem is that in the FFT based approach, you should be taking the inverse transform after the multiplication, but that step is missing from your code. With this missing step your code should look like the following:
def fft_test(arr_a,arr_b): #fft based polynomial multiplication
arr_a1=pad(arr_a,(0,length),'constant')
arr_b1=pad(arr_b,(0,length),'constant')
a_f=fft(arr_a1)
b_f=fft(arr_b1)
c_f=[0]*(2*length)
for i in range( len(a_f) ):
c_f[i]=a_f[i]*b_f[i]
return ifft(c_f)
Note that there may also a few opportunities for improvements:
The zero padding can be handled directly by passing the required FFT length as the second argument (e.g. a_f = fft(arr_a, length))
The coefficient multiplication in your for loop may be directly handled by numpy.multiply.
If the polynomial coefficients are real-valued, then you can use numpy.fft.rfft and numpy.fft.irfft (instead of numpy.fft.fft and numpy.fft.ifft) for some extra performance boost.
So an implementation for real-valued inputs may look like:
from numpy.fft import rfft, irfft
def fftrealpolymul(arr_a, arr_b): #fft based real-valued polynomial multiplication
L = len(arr_a) + len(arr_b)
a_f = rfft(arr_a, L)
b_f = rfft(arr_b, L)
return irfft(a_f * b_f)
Is my observation that polymul function has O(n2) complexity correct?
That also seem to be the performance I am observing, and matches the available code in my numpy installation (version 1.15.4, and there doesn't seem any change in that part in the more recent 1.16.1 version).

Is there an efficient function to calculate a product?

I'm looking for a numpy function (or a function from any other package) that would efficiently evaluate
with f being a vector-valued function of a vector-valued input x. The product is taken to be a simple component-wise multiplication.
The issue here is that both the length of each x vector and the total number of result vectors (f of x) to be multiplied (N) is very large, in the order of millions. Therefore, it is impossible to generate all the results at once (it wouldn't fit in memory) and then multiply them afterwards using np.multiply.reduce or the like .
A toy example of the type of code I would like to replace is:
import numpy as np
x = np.ones(1000000)
prod = f(x)
for i in range(2, 1000000):
prod *= f(i * np.ones(1000000))
with f a vector-valued function with the dimension of its output equal to the dimension of its input.
To be sure: I'm not looking for equivalent code, but for a single, highly optimized function. Is there such a thing?
For those familiar with Wolfram Mathematica: It would be the equivalent to Product. In Mathematica, I would be able to simply write Product[f[i ConstantArray[1,1000000]],{i,1000000}].

Numpy ufuncs all have a reduce method. np.multiply is a ufunc. So it's a one-liner:
np.multiply.reduce(v)
Where v is the vector of values you compute in what is hopefully an equally efficient manner.
To compute the vector, just apply your function to the input:
v = f(x)
So with your example:
np.multiply.reduce(np.sin(x))
Alternative
A simpler way to phrase the same thing is np.prod:
np.prod(v)
You can also use the prod method directly on your vector:
v.prod()

python numpy optimization n-dimensional projection

I am relatively new to python and am interested in any ideas to optimize and speed up this function. I have to call it tens~hundreds of thousands of times for a numerical computation I am doing and it takes a major fraction of the code's overall computational time.
I have written this in c, but I am interested to see any tricks to make it run faster in python specifically.
This code calculates a stereographic projection of a bigD-length vector to a littleD-length vector, per http://en.wikipedia.org/wiki/Stereographic_projection. The variable a is a numpy array of length ~ 96.
import numpy as np
def nsphere(a):
bigD = len(a)
littleD = 3
temp = a
# normalize before calculating projection
temp = temp/np.sqrt(np.dot(temp,temp))
# calculate projection
for i in xrange(bigD-littleD + 2,2,-1 ):
temp = temp[0:-1]/(1.0 - temp[-1])
return temp
#USAGE:
q = np.random.rand(96)
b = nsphere(q)
print b

This should be faster:
def nsphere(a, littleD=3):
a = a / np.sqrt(np.dot(a, a))
z = a[littleD:].sum()
return a[:littleD] / (1. - z)
Please do the math to double check that this is in fact the same as your iterative algorithm.
Obviously the main speedup here is going to come from the fact that this is a O(n) algorithm that replaces your O(n**2) algorithm for computing the projection. But specifically to speeding things up in python, you want to "vectorize your inner loop". Meaning try and avoid loops and anything else that is going to have high python overhead in the most performance critical parts of your code and instead try and use python and numpy builtins which are highly optimized. Hope that helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

FFT convolution not being faster than the cannonical convolution computation - python

You should be able to get away with creating fewer temporary arrays, using this type of syntax, which should make it faster. def conv_fft(X, R): fftpack.fft(X, overwrite_x=True) b = fftpack.fft(R) X *= b fftpack.ifft(X, overwrite_x=True) return X

Related

Double antiderivative computation in python

How to speed up an operation between two arrays of different sizes?

FFT polynomial multiplication in Python using inbuilt Numpy.fft

Is there an efficient function to calculate a product?

python numpy optimization n-dimensional projection

Categories

Resources