A long term puzzle, how to optimize multi-level loops in python? - python

I have written a function in python to calculate Delta function in Gauss broadening, which involves 4-level loops. However, the efficiency is very low, about 10 times slower than using Fortran in a similar way.
def Delta_Gaussf(Nw, N_bd, N_kp, hw, eigv):
Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float)
for w1 in range(Nw):
for k1 in range(N_kp):
for i1 in range(N_bd):
for j1 in range(N_bd):
if ( j1 >= i1 ):
Delta_Gauss[w1][k1][i1][j1] = np.exp(pow((eigv[k1][j1]-eigv[k1][i1]-hw[w1])/width,2))
return Delta_Gauss
I have removed some constants to make it looks simpler.
Could any one help me to optimize this script to increase efficiency?

Simply compile it
To get the best performance I recommend Numba (easy usage, good performance). Alternatively Cython may be a good idea, but with a bit more changes to your code.
You actually got everything right and implemented a easy to understand (for a human and most important for a compiler) solution.
There are basically two ways to gain performance
Vectorize the code as #scnerd showed. This is usually a bit slower and more complex than simply compile a quite simple code, that only uses some for loops. Don't vectorize your code and than use a compiler. From a simple looping aproach this is usually some work to do and leads to a slower and more complex result. The advantage of this process is that you only need numpy, which is a standard dependency in nearly every Python project that deals with some numerical calculations.
Compile the code. If you have already a solution with a few loops and no other, or only a few non numpy functions involved this is often the simplest and fastest solution.
A solution using Numba
You do not have to change much, I changed the pow function to np.power and some slight changes to the way arrays accessed in numpy (this isn't really necessary).
import numba as nb
import numpy as np
#performance-debug info
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')
def Delta_Gaussf_nb(Nw, N_bd, N_kp, hw, width,eigv):
Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float)
for w1 in range(Nw):
for k1 in range(N_kp):
for i1 in range(N_bd):
for j1 in range(N_bd):
if ( j1 >= i1 ):
Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
return Delta_Gauss
Due to the 'if' the SIMD-vectorization fails. In the next step we can remove it (maybe a call outside the njited function to np.triu(Delta_Gauss) will be necessary). I also parallelized the function.
def Delta_Gaussf_1(Nw, N_bd, N_kp, hw, width,eigv):
Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=np.float64)
for w1 in nb.prange(Nw):
for k1 in range(N_kp):
for i1 in range(N_bd):
for j1 in range(N_bd):
Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
return Delta_Gauss
Nw = 20
N_bd = 20
N_kp = 20
hw = np.linspace(0., 1.0, Nw)
eigv = np.zeros((N_kp, N_bd),dtype=np.float)
Your version: 0.5s
first_compiled version: 1.37ms
parallel version: 0.55ms
These easy optimizations lead to about 1000x speedup.

BLUF: Using Numpy's full functionality, plus another neat module, you can get the Python code down over 100x faster than this raw for-loop code. Using #max9111's answer, however, you can get even faster with much cleaner code and less work.
The resulting code looks nothing like the original, so I'll do the optimization one step at a time so that the process and final code make sense. Essentially, we're going to use a lot of broadcasting in order to get Numpy to perform the looping under the hood (which is always faster than looping in Python). The result computes the full square of results, which means we're necessarily duplicating some work since the result is symmetrical, but it's easier, and honestly probably faster, to do this work in high-performance ways than to have an if at the deepest level of looping in order to avoid the computation. This might be avoidable in Fortran, but probably not in Python. If you want the result to be identical to your provided source, we'll need to take the upper triangle of the result of my code below (which I do in the sample code below... feel free to remove the triu call in actual production, it's not necessary).
First, we'll notice a few things. The main equation has a denominator that performs np.sqrt, but the content of that computation doesn't change at any iteration of the loop, so we'll compute it once and re-use the result. This turns out to be minor, but we'll do it anyway. Next, the main function of the inner two loops is to perform eigv[k1][j1] - eigv[k1][i1], which is quite easy to vectorize. If eigv is a matrix, then eigv[k1] - eigv[k1].T produces a matrix where result[i1, j1] = eigv[k1][j1] - eigv[k1][i1]. This allows us to entirely remove the innermost two loops:
def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
denom = np.sqrt(2.0 * np.pi) * width
eigv = np.matrix(eigv)
for w1 in range(Nw):
for k1 in range(N_kp):
this_eigv = (eigv[k1] - eigv[k1].T - hw[w1])
v = np.power(this_eigv / width, 2)
Delta_Gauss[w1, k1, :, :] = np.exp(-0.5 * v) / denom
# Take the upper triangle to make the result exactly equal to the original code
return np.triu(Delta_Gauss)
Well, now that we're on the broadcasting bandwagon, it really seems like the remaining two loops should be possible to remove in the same way. As it happens, it is easy! The only thing we need k1 for is to get the row out of eigv that we're trying to pairwise-subtract... so why not do this to all rows at the same time? We're currently basically subtracting matrices of shapes (1, B) - (B, 1) for each of N rows in eigv (where B is is N_bd). We can abuse broadcasting to do this for all rows of eigv simultaneously by subtracting matrices of shapes (N, 1, B) - (N, B, 1) (where N is N_kp):
def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
denom = np.sqrt(2.0 * np.pi) * width
for w1 in range(Nw):
this_eigv = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2) - hw[w1]
v = np.power(this_eigv / width, 2)
Delta_Gauss[w1, :, :, :] = np.exp(-0.5 * v) / denom
return np.triu(Delta_Gauss)
The next step should be clear now. We're only using w1 to index hw, so let's do some more broadcasting to make numpy do the looping instead. We're currently subtracting a scalar value from a matrix of shape (N, B, B), so to get the resulting matrix for each of the W values in hw, we need to perform subtraction on matrices of the shapes (1, N, B, B) - (W, 1, 1, 1) and numpy will broadcast everything to produce a matrix of the shape (W, N, B, B):
def Delta_Gaussf(hw, width, eigv):
eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
v = np.power(w_sub / width, 2)
denom = np.sqrt(2.0 * np.pi) * width
Delta_Gauss = np.exp(-0.5 * v) / denom
return np.triu(Delta_Gauss)
On my example data, this code is ~100x faster (~900ms to ~10ms). Your mileage might vary.
But wait! There's more! Since our code is all numeric/numpy/python, we can use another handy module called numba to compile this function into an equivalent one with higher performance. Under the hood, it's basically reading what functions we're calling and converting the function into C-types and C-calls to remove the Python function call overhead. It's doing more than that, but that gives the jist of where we're going to gain benefit. To gain this benefit is trivial in this case:
import numba
def Delta_Gaussf(hw, width, eigv):
eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
v = np.power(w_sub / width, 2)
denom = np.sqrt(2.0 * np.pi) * width
Delta_Gauss = np.exp(-0.5 * v) / denom
return np.triu(Delta_Gauss)
The resulting function is down to about ~7ms on my sample data, down from ~10ms, just by adding that decorator. Pretty nice for no effort.
EDIT: #max9111 gave a better answer that points out that numba works much better with the loop syntax than with numpy broadcasting code. With almost no work besides the removal of the inner if statement, he shows that numba.jit can be made to get the almost original code even faster. The result is much cleaner, in that you still have just the single innermost equation that shows what each value is, and you don't have to follow the magical broadcasting used above. I highly recommend using his answer.
For my given sample data (Nw = 20, N_bd = 20, N_kp = 20), my final runtimes are the following (I've included timings on the same computer for #max9111's solution, first without using parallel execution and then with it on my 2-core VM):
Original code: ~900 ms
Fortran estimate: ~90 ms (based on OP saying it was ~10x faster)
Final numpy code: ~10 ms
Final code with numba.jit: ~7 ms
max9111's solution (serial): ~4ms
max9111 (parallel 2-core): ~3ms
Overall vectorized speedup: ~130x
max9111's numba speedup: ~300x (potentially more with more cores)
I don't know how fast exactly your Fortran code is, but it looks like proper usage of numpy allows you to easily beat it by an order of magnitude, and #max9111's numba solution gives you potentially another order of magnitude.


Efficient computation of a loop of integrals in Python

I was wondering how to speed up the following code in where I compute a probability function which involves numerical integrals and then I compute some confidence margins.
Some possibilities that I have thought about are Numba or vectorization of the code
I have made minor modifications because there was a mistake. I am looking for some modifications that provide major time improvements (I know that there are some minor changes that would provide some minor time improvements, such as repeated functions, but I am not concerned about them)
The code is:
# -*- coding: utf-8 -*-
Created on Tue Jan 26 17:05:46 2021
#author: Ignacio
import numpy as np
from scipy.integrate import simps
def pdf(V,alfa_points):
return simps(1/np.sqrt(2*np.pi)/np.sqrt(sigma_R2)*np.exp(-(V*np.cos(alfa)-eR)**2/2/sigma_R2)*1/np.sqrt(2*np.pi)/np.sqrt(sigma_I2)*np.exp(-(V*np.sin(alfa)-eI)**2/2/sigma_I2),alfa)
def find_nearest(array,value):
idx = (np.abs(array-value)).argmin()
return array[idx]
N = 20
for tt in range(len(th)):
for vv in range(len(Vs)):
This version's speedup: 31x
A simple profiling (%%prun) reveals that most of the time is spent in simps.
You are in control of the integration done in pdf(): for example, you can use the trapeze method instead of Simpson with negligible numerical difference if you increase a bit the resolution of alpha. In fact, the higher resolution obtained by a higher sampling of alpha more than makes up for the difference between simps and trapeze (see picture at the bottom as for why). This is by far the highest speedup. We go one bit further by implementing the trapeze method ourselves instead of using scipy, since it is so simple. This alone yields marginal gain, but opens the door for a more drastic optimization (below, about pdf2D.
Also, the remaining simps(PDF, ...) goes faster when it knows that the dx step is constant, so we can just say so instead of passing the whole alpha array.
You can avoid doing the loop to compute PDF and use np.vectorize(pdf) directly on Vs, or better (as in the code below), do a 2-D version of that calculation.
There are some other minor things (such as using an index directly fmin[tt] = Vs[closest(values, 0.05)] instead of finding the index, returning the value, and then using a boolean mask for where values == xval_05), or taking all the constants (including alpha) outside functions and avoid recalculating every time.
This above gives us a 5.2x improvement. There is a number of things I don't understand in your code, e.g. why having An (ones) and Pn (zeros)?
But, importantly, another ~6x speedup comes from the observation that, since we are implementing our own trapeze method by using numpy primitives, we can actually do it in 2D in one go for the whole PDF.
The final speed up of the code below is 31x. I believe that a better understanding of "the big picture" of what you want to do would yield additional, perhaps substantial, speed gains.
Modified code:
import numpy as np
from scipy.integrate import simps
alpha_points = 200 # more points as we'll use trapeze not simps
alpha = np.linspace(0, 2*np.pi, alpha_points)
cosalpha = np.cos(alpha)
sinalpha = np.sin(alpha)
d_alpha = np.mean(np.diff(alpha)) # constant dx
coeff = 1 / np.sqrt(2*np.pi)
d_Vs = np.mean(np.diff(Vs)) # constant dx
def f2D(Vs, eR, sigma_R2, eI, sigma_I2):
a = coeff / np.sqrt(sigma_R2)
b = coeff / np.sqrt(sigma_I2)
y = a * np.exp(-(np.outer(cosalpha, Vs) - eR)**2 / 2 / sigma_R2) * b * np.exp(-(np.outer(sinalpha, Vs) - eI)**2 / 2 / sigma_I2)
return y
def pdf2D(Vs, eR, sigma_R2, eI, sigma_I2):
y = f2D(Vs, eR, sigma_R2, eI, sigma_I2)
s = y.sum(axis=0) - (y[0] + y[-1]) / 2 # our own impl of trapeze, on 2D y
return s * d_alpha
def closest(a, val):
return np.abs(a - val).argmin()
N = 20
n = np.linspace(0,N-1,N)
d = 1
sigma_An = 0.1
sigma_Pn = 0.2
th = np.linspace(0,np.pi/2,250)
R = np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
I = np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
for tt in range(len(th)):
PDF=pdf2D(Vs, eR, sigma_R2, eI, sigma_I2)
total = simps(PDF, dx=d_Vs)
values = np.cumsum(PDF) * inc / total
fmin[tt] = Vs[closest(values, 0.05)]
fmax[tt] = Vs[closest(values, 0.95)]
Note: most of the fmin and fmax are np.allclose() compared with the original function, but some of them have a small error: after some digging, it turns out that the implementation here is more precise as that function f() can be pretty abrupt, and more alpha points actually help (and more than compensate the minuscule lack of precision due to using trapeze instead of Simpson).
For example, at index tt=244, vv=400:
Considering several methods, the one that provides the largest time improvement is the Numba method. The method proposed by Pierre is very interesting and it does not require to install other packages, which is an asset.
However, in the examples that I have computed, the time improvement is not as large as with the numba example, specially when the points in th grows to a few tenths of thousands (which is my actual case). I post here the Numba code just in case someone is interested:
import numpy as np
from numba import njit
def margins(val_min,val_max):
for tt in range(len(th)):
for vv in range(len(Vs)):
idx = (np.abs(values-val_min)).argmin()
idx = (np.abs(values-val_max)).argmin()
return fmin,fmax
N = 20
fmin, fmax = margins(0.05,0.95)

Suggestions on how to speed up this python function?

Any suggestions on how to speed up this function?
def smooth_surface(z,c):
hph_arr_list = []
for x in xrange(c,len(z)-(c+1)):
new_arr = np.hstack(z[x-c:x+c])
hph_arr_list.append(np.percentile(new_arr[((new_arr >= np.percentile(new_arr,15)) & (new_arr <= np.percentile(new_arr,85)))],99))
return np.array(map(float,hph_arr_list))
The variable z has a length of ~15 million and c is value of window size + and -. The function is basically a sliding window that calculates a percentile value per iteration. Any help would be appreciated! z is an array of arrays (hence the np.hstack). Maybe any idea if numba would help with this. If so, how to implement?
The slow part of the computation appear to be the line np.percentile(new_arr[((new_arr >= np.percentile(new_arr,15)) & (new_arr <= np.percentile(new_arr,85)))],99). This is due to the unexpectedly slow np.percentile on small arrays as well as the creation of several intermediate arrays.
Since new_arr is actually quite small, it is much faster to just sort it and make the interpolation yourself. Moreover, numba can also help to speed the computation up.
#njit #Use #njit instead of #jit to increase speed
def filter(arr):
arr = arr.copy() # This line can be removed to modify arr in-place
lo = int(math.ceil(len(arr)*0.15))
hi = int(len(arr)*0.85)
interp = 0.99 * (hi - 1 - lo)
interp = interp - int(interp)
assert lo <= hi-2
return arr[hi-2]* (1.0 - interp) + arr[hi-1] * interp
This code is 160 times faster with arrays of size 20 on my machine and should produce the same result.
Finally, you can speed up smooth_surface too by using automatic parallelization in numba (see here for more information). Here is an untested prototype:
def smooth_surface(z,c):
hph_arr = np.zeros(len(z)-(c+1)-c)
for x in prange(c,len(z)-(c+1)):
hph_arr[x-c] = filter(np.hstack(z[x-c:x+c]))
return hph_arr

Vectorize a Newton method in Python/Numpy

I am trying to figure out if Python/Numpy is a viable alternative to develop my numerical software which is already available in C++. In order to get performance in Python/Numpy, one need to "vectorize" the code. But it turns out that as soon as I move away from very simple examples, I struggle to vectorize the code (I am not talking about SIMD instructions but "efficient Numpy code" without loops). Here is an algorithm that I want to get efficiently in Python/Numpy.
Create an numpy array containing: 1.0, 1.0 + 1/n, 1.0 + 2/n, ..., 2.0
For every u in the array, compute the root of x^2 - u, using a Newton method, stopping when |dx| <= 1.0e-7. Store the result in an array result.
Sum all the elements of the result array
Here is the algorithm in Python I want to speed up
import numpy as np
n = 1000000
data = np.arange(1.0, 2.0, 1.0 / n)
def newton(u):
x = 2.0
while True:
f = x**2 - u
df_dx = 2 * x
dx = f / df_dx
if (abs(dx) <= 1.0e-7):
x -= dx
return x
result = map(newton, data)
print result[n - 1]
Here is a version of the algorithm in C++11
#include <iostream>
#include <vector>
#include <cmath>
int main (int argc, char const *argv[]) {
auto n = std::size_t{100000000};
auto v = std::vector<double>(n + 1);
for(size_t k = 0; k < v.size(); ++k) {
v[k] = 1.0 + static_cast<double>(k) / n;
auto result = std::vector<double>(n + 1);
for(size_t k = 0; k < v.size(); ++k) {
auto x = double{2.0};
while(true) {
auto f = double{x * x - v[k]};
auto df_dx = double{2 * x};
auto dx = double{f / df_dx};
if (std::abs(dx) <= 1.0e-7) {
x -= dx;
result[k] = x;
auto somme = double{0.0};
for(size_t k = 0; k < result.size(); ++k) {
somme += result[k];
std::cout << somme << std::endl;
return 0;
It takes 2.9 seconds to run on my machine. Is there a way to make a fast Python/Numpy algorithm that does the same thing (I am willing to get something that is less than 5 times slower).
You can do step 1. with numpy efficiently:
1.0 + np.arange(n + 1) / n
however I think you would need the np.vectorize() method to feed back x into your calculated values and it's not an efficient function (basically a wrapper for a python loop). If you can use scipy then there are built in methods that might do what you want http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.newton.html
EDIT: Having thought a bit more about this I followed up on #ev-br's point and tried some alternatives. The masking uses too much processing but the abs().max() is pretty fast so a compromise might be to "divide the problem into blocks" both in the 1st dimension of the array and in iteration direction. The following doesn't do too badly (< 20s) on my pretty low power laptop - certainly much faster than np.vectorize() or any of the scipy solving systems I could find. (If I set m too big it runs out of something (memory?) and grinds to a complete halt!)
n = 100000000
m = 5000000
block = 3
u = 1.0 + np.arange(n + 1) / n
x = np.full(u.shape, 2.0)
dx = np.ones(u.shape)
for i in range(0, n, m):
while np.abs(dx[i:i+m]).max() > 1.0e-7:
for j in range(block):
dx[i:i+m] = (x[i:i+m] ** 2 - u[i:i+m]) / (2 * x[i:i+m])
x[i:i+m] -= dx[i:i+m]
Here's a toy example. Notice that often vectorization means writing your code as if you're manipulating numbers, and letting numpy do its magic:
>>> import numpy as np
>>> a = np.array([1., 2., 3.])
>>> def f(x):
... return x**2 - a, 2.*x # function and derivative
>>> def newt(f, x0):
... x = np.asarray(x0)
... for _ in range(5): # hardcode the number of iterations (I know)
... v, dv = f(x)
... x -= v / dv
... return x
>>> newt(f, [1., 1., 1.])
array([ 1. , 1.41421356, 1.73205081])
If this is a performance bottleneck, this is unlikely to be competetive with hand-written C++ code: First of all, you're manipulating python objects with all the overhead; then numpy is likely doing a bunch of array allocations under the hood.
An often viable strategy is to start by writing things in python/numpy, and then move bottlenecks into a compiled code --- eg Cython or C++ wrapped by Cython. In this particular case since you already have the C++ code, just wrapping it with Cython is likely easiest but YMMV.
I'm not looking to wave small snippets of code as a solution, but here's something to get you started. I have a strong suspicion that you're having troubles just declaring such an array in python without spending too much time on it, so I'll mostly help you out there.
As far as the square roots come in, please add your example python code and I'll see what I can help optimize from that point on. In my example roots and sums are found with the default numpy functions/methods.
def summing():
n = 1000000
ar = np.arange(0, n)
ar = ar/float(n)
ar = ar + np.ones(n)
sqrt = np.sqrt(ar)
return np.sum(ar)
In short, to get the starting array it's best to use a "workaround".
initialize an array ar with values `[1,2,3,....n]
divide ar with n. This gets us the 1/n, 2/n ... members
add to that an array of same dimensions that contain just the number 1.0
This gets us the full array [ 1., 1.000001, 1.000002, ..., 1.999998, 1.999999]) we're after. If I understood you right.
find square roots, sum it
Average of 10 sequential execution times is 0.018786 seconds.
Obviously I'm 6 years late to this party, but this question is a common stumbling block for people in effectively using numpy for real scientific work. The basic idea is covered in #ev-br's answer. The OP points out that the solution offered there (even modified to stop iterating when a convergence criterion is met rather than after a fixed number of iterations) takes the same number of passes for each element of u. I want to show how you can avoid that objection using pure numpy code, making explicit the mask suggestion in #ev-br's comment.
However, I also want to point out that in many real world situations, the number of passes for Newton-like iteration to converge varies so little that this general technique I illustrate here will actually slow numpy code down significantly. If the average number of iterations will be within a factor of two or three of the maximum number of iterations, you should stick with something closer to #ev-br's answer (including his first comment).
The numpy performance numbers you need to understand are these: Loops over array indices will run 200 to 500 times slower in pure numpy code than in compiled code. On the other hand, if you manage to use numpy's array syntax to avoid all index loops, you can get within about a factor of 5 of compiled speed. (The factor of 5 is partly because of memory management as #ev-br mentions, but also because optimized compiled code overlaps many different arithmetical operations inside each index loop, while numpy just performs a single arithmetic operation, storing everything back to memory after each operation.) The point is that factor of 100 difference means that it often pays to do substantial amounts of "extra" work in numpy code: Even if you do 10 times the number of floating point operations in vectorized numpy code, it will still run 10 times faster than the index-loop code that avoids the "extra" work. (Incidentally, the python map function is implemented as an interpreted index loop - it has nothing to do with numpy array operations.)
from numpy import asfarray, broadcast_arrays, arange
# Begin by defining the function to be inverted by Newton's method.
def f_dfdx(x):
x = asfarray(x) # always avoid repeated type conversions
return x**2, 2.*x
# First, the simplest algorithm to find x such that f(x)=y.
# We must supply a starting guess x0 for x.
def f_inverse0(f_dfdx, y, x0, tol=1.e-7):
y, x = broadcast_arrays(asfarray(y), asfarray(x0))
x = x.copy() # without this may clobber input x0
for npass in range(20):
f, dfdx = f_dfdx(x)
dx = (f - y) / dfdx
if (abs(dx) <= tol).all():
break # iterate all x until all have converged
x -= dx
raise RuntimeError("failed to converge")
return x
# A frequently slower algorithm that avoids extra iterations.
def f_inverse1(f_dfdx, y, x0, tol=1.e-7):
y, x = broadcast_arrays(asfarray(y), asfarray(x0))
shape = x.shape
y, x = y.ravel(), x.flatten() # avoid clobbering x0
unconverged = arange(y.size)
for npass in range(20):
f, dfdx = f_dfdx(x[unconverged])
dx = (f - y[unconverged]) / dfdx
unc = abs(dx) > tol
unconverged = unconverged[unc]
if not unconverged.size:
break # iterate all x until all have converged
x[unconverged] -= dx[unc]
raise RuntimeError("failed to converge")
return x.reshape(shape)
On my machine, the OP's C++ program runs in 2.03 s (1.64+0.38 user+sys). For n=100 million as for the C++ program, f_inverse0 runs in 20.4 s (4.7+15.6 user+sys). As expected, f_inverse1 is slower, 51.3 s (11.5+39.8 user+sys). Again, don't automatically try to minimize total operation count when you are writing numpy code. The high system overhead is probably due to heavy memory management - every vector temporary is 0.8 GB and the memory manager is struggling.
Cutting the array size to n = 1 million elements (8 MB), then multiplying the runtime by 100 brings the system time down by a large factor, f_inverse0 now takes 16.1 s (12.5+3.6), while f_inverse1 takes 22.3 s (16.2+5.1). This factor of 8 to 10 slower than compiled code is not unreasonable to expect for numpy performance.

Convolution computations in Numpy/Scipy

Profiling some computational work I'm doing showed me that one bottleneck in my program was a function that basically did this (np is numpy, sp is scipy):
def mix1(signal1, signal2):
spec1 = np.fft.fft(signal1, axis=1)
spec2 = np.fft.fft(signal2, axis=1)
return np.fft.ifft(spec1*spec2, axis=1)
Both signals have shape (C, N) where C is the number of sets of data (usually less than 20) and N is the number of samples in each set (around 5000). The computation for each set (row) is completely independent of any other set.
I figured that this was just a simple convolution, so I tried to replace it with:
def mix2(signal1, signal2):
outputs = np.empty_like(signal1)
for idx, row in enumerate(outputs):
outputs[idx] = sp.signal.convolve(signal1[idx], signal2[idx], mode='same')
return outputs
...just to see if I got the same results. But I didn't, and my questions are:
Why not?
Is there a better way to compute the equivalent of mix1()?
(I realise that mix2 probably wouldn't have been faster as-is, but it might have been a good starting point for parallelisation.)
Here's the full script I used to quickly check this:
import numpy as np
import scipy as sp
import scipy.signal
N = 4680
C = 6
def mix1(signal1, signal2):
spec1 = np.fft.fft(signal1, axis=1)
spec2 = np.fft.fft(signal2, axis=1)
return np.fft.ifft(spec1*spec2, axis=1)
def mix2(signal1, signal2):
outputs = np.empty_like(signal1)
for idx, row in enumerate(outputs):
outputs[idx] = sp.signal.convolve(signal1[idx], signal2[idx], mode='same')
return outputs
def test(num, chans):
sig1 = np.random.randn(chans, num)
sig2 = np.random.randn(chans, num)
res1 = mix1(sig1, sig2)
res2 = mix2(sig1, sig2)
np.testing.assert_almost_equal(res1, res2)
if __name__ == "__main__":
test(N, C)
So I tested this out and can now confirm a few things:
1) numpy.convolve is not circular, which is what the fft code is giving you:
2) FFT does not internally pad to a power of 2. Compare the vastly different speeds of the following operations:
x1 = np.random.uniform(size=2**17-1)
x2 = np.random.uniform(size=2**17)
3) Normalization is not a difference -- if you do a naive circular convolution by adding up a(k)*b(i-k), you will get the result of the FFT code.
The thing is padding to a power of 2 is going to change the answer. I've heard tales that there are ways to deal with this by cleverly using prime factors of the length (mentioned but not coded in Numerical Recipes) but I've never seen people actually do that.
scipy.signal.fftconvolve does convolve by FFT, it's python code. You can study the source code, and correct you mix1 function.
As mentioned before, the scipy.signal.convolve function does not perform a circular convolution. If you want a circular convolution performed in realspace (in contrast to using fft's) I suggest using the scipy.ndimage.convolve function. It has a mode parameter which can be set to 'wrap' making it a circular convolution.
for idx, row in enumerate(outputs):
outputs[idx] = sp.ndimage.convolve(signal1[idx], signal2[idx], mode='wrap')

Rewriting a for loop in pure NumPy to decrease execution time

I recently asked about trying to optimise a Python loop for a scientific application, and received an excellent, smart way of recoding it within NumPy which reduced execution time by a factor of around 100 for me!
However, calculation of the B value is actually nested within a few other loops, because it is evaluated at a regular grid of positions. Is there a similarly smart NumPy rewrite to shave time off this procedure?
I suspect the performance gain for this part would be less marked, and the disadvantages would presumably be that it would not be possible to report back to the user on the progress of the calculation, that the results could not be written to the output file until the end of the calculation, and possibly that doing this in one enormous step would have memory implications? Is it possible to circumvent any of these?
import numpy as np
import time
def reshape_vector(v):
b = np.empty((3,1))
for i in range(3):
b[i][0] = v[i]
return b
def unit_vectors(r):
return r / np.sqrt((r*r).sum(0))
def calculate_dipole(mu, r_i, mom_i):
relative = mu - r_i
r_unit = unit_vectors(relative)
A = 1e-7
num = A*(3*np.sum(mom_i*r_unit, 0)*r_unit - mom_i)
den = np.sqrt(np.sum(relative*relative, 0))**3
B = np.sum(num/den, 1)
return B
N = 20000 # number of dipoles
r_i = np.random.random((3,N)) # positions of dipoles
mom_i = np.random.random((3,N)) # moments of dipoles
a = np.random.random((3,3)) # three basis vectors for this crystal
n = [10,10,10] # points at which to evaluate sum
gamma_mu = 135.5 # a constant
t_start = time.clock()
for i in range(n[0]):
r_frac_x = np.float(i)/np.float(n[0])
r_test_x = r_frac_x * a[0]
for j in range(n[1]):
r_frac_y = np.float(j)/np.float(n[1])
r_test_y = r_frac_y * a[1]
for k in range(n[2]):
r_frac_z = np.float(k)/np.float(n[2])
r_test = r_test_x +r_test_y + r_frac_z * a[2]
r_test_fast = reshape_vector(r_test)
B = calculate_dipole(r_test_fast, r_i, mom_i)
omega = gamma_mu*np.sqrt(np.dot(B,B))
# write r_test, B and omega to a file
frac_done = np.float(i+1)/(n[0]+1)
t_elapsed = (time.clock()-t_start)
t_remain = (1-frac_done)*t_elapsed/frac_done
print frac_done*100,'% done in',t_elapsed/60.,'minutes...approximately',t_remain/60.,'minutes remaining'
One obvious thing you can do is replace the line
r_test_fast = reshape_vector(r_test)
r_test_fast = r_test.reshape((3,1))
Probably won't make any big difference in performance, but in any case it makes sense to use the numpy builtins instead of reinventing the wheel.
Generally speaking, as you probably have noticed by now, the trick with optimizing numpy is to express the algorithm with the help of numpy whole-array operations or at least with slices instead of iterating over each element in python code. What tends to prevent this kind of "vectorization" is so-called loop-carried dependencies, i.e. loops where each iteration is dependent on the result of a previous iteration. Looking briefly at your code, you have no such thing, and it should be possible to vectorize your code just fine.
EDIT: One solution
I haven't verified this is correct, but should give you an idea of how to approach it.
First, take the cartesian() function, which we'll use. Then
def calculate_dipole_vect(mus, r_i, mom_i):
# Treat each mu sequentially
Bs = []
omega = []
for mu in mus:
rel = mu - r_i
r_norm = np.sqrt((rel * rel).sum(1))
r_unit = rel / r_norm[:, np.newaxis]
A = 1e-7
num = A*(3*np.sum(mom_i * r_unit, 0)*r_unit - mom_i)
den = r_norm ** 3
B = np.sum(num / den[:, np.newaxis], 0)
omega.append(gamma_mu * np.sqrt(np.dot(B, B)))
return Bs, omega
# Transpose to get more "natural" ordering with row-major numpy
r_i = r_i.T
mom_i = mom_i.T
t_start = time.clock()
r_frac = cartesian((np.arange(n[0]) / float(n[0]),
np.arange(n[1]) / float(n[1]),
np.arange(n[2]) / float(n[2])))
r_test = np.dot(r_frac, a)
B, omega = calculate_dipole_vect(r_test, r_i, mom_i)
print 'Total time for vectorized: %f s' % (time.clock() - t_start)
Well, in my testing, this is in fact slightly slower than the loop-based approach I started from. The thing is, in the original version in the question, it was already vectorized with whole-array operations over arrays of shape (20000, 3), so any further vectorization doesn't really bring much further benefit. In fact, it may worsen the performance, as above, maybe due to big temporary arrays.
If you profile your code, you'll see that 99% of the running time is in calculate_dipole so reducing the time for this looping really won't give a noticeable reduction in execution time. You still need to focus on calculate_dipole if you want to make this faster. I tried my Cython code for calculate_dipole on this and got a reduction by about a factor of 2 in the overall time. There might be other ways to improve the Cython code too.

