Related
In a python program, the following function is called about 20,000 times from another function that is called about 1000 times from yet another function that executes 30 times. Thus the total number of times this particular function is called is about 600,000,000. In python it takes more than two hours (perhaps much longer; I aborted the program without waiting for it to finish), while essentially the same task coded in Java takes less than 5 minutes. If I change the 20,000 above to 400 (keeping everything else in the rest of the program untouched), the total time drops to about 4 minutes (this means this particular function is the culprit). What can I do to speed up the Python version, or is it just not possible? No lists are manipulated inside this function (there are lists elsewhere in the whole program, but in those places I tried to use numpy arrays as far as possible). I understand that replacing python lists with numpy arrays speeds things up, but there are cases in my program (not in this particular function) where I must build a list iteratively, using append; and those must-have lists are lists of objects (not floats or ints), so numpy would be of little help even if I converted those lists of objects to numpy arrays.
def compute_something(arr):
'''
arr is received as a numpy array of ints and floats (I think python upcasts them to all floats,
doesn’t it?).
Inside this function, elements of arr are accessed using indexing (arr[0], arr[1], etc.), because
each element of the array has its own unique use. It’s not that I need the array as a whole (as in
arr**2 or sum(arr)).
The arr elements are used in several simple arithmetic operations involving nothing costlier than
+, -, *, /, and numpy.log(). There is no other loop inside this function; there are a few if’s though.
Inside this function, use is made of constants imported from other modules (I doubt the
importing, as in AnotherModule.x is expensive).
'''
for x in numpy.arange(float1, float2, float3):
do stuff
return a, b, c # Return a tuple of three floats
Edit:
Thanks for all the comments. Here’s the inside of the function (I made the variable names short for convenience). The ndarray array arr has only 3 elements in it. Can you please suggest any improvement?
def compute_something(arr):
a = Mod.b * arr[1] * arr[2] + Mod.c
max = 0.0
for c in np.arange(a, arr[1] * arr[2] * (Mod.d – Mod.e), Mod.f):
i = c / arr[2]
m1 = Mod.A * np.log( (i / (arr[1] *Mod.d)) + (Mod.d/Mod.e))
m2 = -Mod.B * np.log(1.0 - (i/ (arr[1] *Mod.d)) - (Mod.d /
Mod.e))
V = arr[0] * (Mod.E - Mod.r * i / arr[1] - Mod.r * Mod.d -
m1 – m2)
p = c * V /1000.0
if p > max:
max = p
vmp = V
pen = Mod.COEFF1 * (Mod.COEFF2 - max) if max < Mod.CONST else 0.0
wo = Mod.COEFF3 * arr[1] * arr[0] + Mod.COEFF4 * abs(Mod.R5 - vmp) +
Mod.COEFF6 * arr[2]
w = wo + pen
return vmp, max, w
Python supports profiling of code. (module cProfile). Also there is option to use line_profiler to find most expensive part of code tool here.
So you do not need to guessing which part of code is most expensive.
In this code which you presten the problem is in usage for loop which generates many conversion between types of objects. If you use numpy you can vectorize your calculation.
I try to rewrite your code to vectorize your operation. You do not provide information what is Mod object, but I have hope it will work.
def compute_something(arr):
a = Mod.b * arr[1] * arr[2] + Mod.c
# start calculation on vectors instead of for lop
c_arr = np.arange(a, arr[1] * arr[2] * (Mod.d – Mod.e), Mod.f)
i_arr = c_arr/arr[2]
m1_arr = Mod.A * np.log( (i_arr / (arr[1] *Mod.d)) + (Mod.d/Mod.e))
m2_arr = -Mod.B * np.log(1.0 - (i_arr/ (arr[1] *Mod.d)) - (Mod.d /
Mod.e))
V_arr = arr[0] * (Mod.E - Mod.r * i_arr / arr[1] - Mod.r * Mod.d -
m1_arr – m2_arr)
p = c_arr * V_arr / 1000.0
max_val = p.max() # change name to avoid conflict with builtin function
max_ind = np.nonzero(p == max_val)[0][0]
vmp = V_arr[max_ind]
pen = Mod.COEFF1 * (Mod.COEFF2 - max_val) if max_val < Mod.CONST else 0.0
wo = Mod.COEFF3 * arr[1] * arr[0] + Mod.COEFF4 * abs(Mod.R5 - vmp) +
Mod.COEFF6 * arr[2]
w = wo + pen
return vmp, max_val, w
I would suggest to use range as it is approximately 2 times faster:
def python():
for i in range(100000):
pass
def numpy():
for i in np.arange(100000):
pass
from timeit import timeit
print(timeit(python, number=1000))
print(timeit(numpy, number=1000))
Output:
5.59282787179696
10.027646953771665
I'm baffled. I just ported my code from Java to Python. Goods news is the Python alternative for the lib I'm using is much quicker. Bad part is that my custom processing code is much slower with the Python alternative I wrote :( I even removed some parts I deemed unnecessary, still much slower. The Java version took about half a second, Python takes 5-6.
rimg1 = imageio.imread('test1.png').astype(np.uint8)
rimg2 = imageio.imread('test2.png').astype(np.uint8)
sum_time = 0
for offset in range(-left, right):
rdest = np.zeros((h, w, 3)).astype(np.uint8)
if offset == 0:
continue
mult = np.uint8(1.0 / (offset * multiplier / frames))
for y in range(h):
for x in range(0, w - backup, 1):
slice_time = time.time()
src = rimg2[y,x] // mult + 1
sum_time += time.time() - slice_time
pix = rimg1[y,x + backup]
w ~= 384 and h ~= 384
src ranges from 0 - 30 usually.
left to right is -5 to 5
How come sum_time takes about a third of my total time?
Edit
With the help of josephjscheidt I made some changes.
mult = np.uint8(1.0 / (offset * multiplier / frames))
multArray = np.floor_divide(rimg2, mult) + 1
for y in range(h):
pixy = rimg1[y]
multy = multArray[y]
for x in range(0, w - backup, 1):
src = multy[y]
slice_time = time.time()
pix = pixy[x + backup]
sum_time += time.time() - slice_time
ox = x
for o in range(src):
if ox < 0:
break
rdest[y,ox] = pix
ox-=1
Using the numpy iterator for the srcArray cuts total time almost in half! The numpy operation itself seems to take negligible time.
Now most of the time taken is in rimg1 lookup
pix = rimg1[x + backup]
and the inner for loop (both taking 50% of time). Is it possible to handle this with numpy operations as well?
Edit
I would figure rewriting it could be of benefit, but somehow the following actually takes a little bit longer:
for x in range(0, w - backup, 1):
slice_time = time.time()
lastox = max(x-multy[y], 0)
rdest[y,lastox:x] = pixy[x + backup]
sum_time += time.time() - slice_time
Edit
slice_time = time.time()
depth = multy[y]
pix = pixy[x + backup]
ox = x
#for o in range(depth):
# if ox < 0:
# break;
#
# rdesty[ox] = pix
# ox-=1
# if I uncomment the above lines, and comment out the following two
# it takes twice as long!
lastox = max(x-multy[y], 0)
rdesty[lastox:x] = pixy[x + backup]
sum_time += time.time() - slice_time
The python interpreter is strange..
Time taken is now 2.5 seconds for sum_time. In comparison, Java does it in 60ms
For loops are notoriously slow with numpy arrays, and you have a three-layer for loop here. The underlying concept with numpy arrays is to perform operations on the entire array at once, rather than trying to iterate over them.
Although I can't entirely interpret your code, because most of the variables are undefined in the code chunk you provided, I'm fairly confident you can refactor here and vectorize your commands to remove the loops. For instance, if you redefine offset as a one-dimensional array, then you can calculate all values of mult at once without having to invoke a for loop: mult will become a one-dimensional array holding the correct values. We can avoid dividing by zero using the out argument (setting the default output to the offset array) and where argument (performing the calculation only where offset doesn't equal zero):
mult = np.uint8(np.divide(1.0, (offset * multiplier / frames),
out = offset, where = (offset != 0))
Then, to use the mult array on the rimg2 row by row, you can use a broadcasting trick (here, I'm assuming you want to add one to each element in rimg2):
src = np.floor_divide(rimg2, mult[:,None], out = rimg2, where = (mult != 0)) + 1
I found this article extremely helpful when learning how to effectively work with numpy arrays:
https://realpython.com/numpy-array-programming/
Since you are working with images, you may want to especially pay attention to the section on image feature extraction and stride_tricks. Anyway, I hope this helps you get started.
The problem I have is array too big in Matlab. The array data comes from audio file. I want to get the impulse response.
I first FFT the original and recorded audio. Then the division of recorded by original. Lastly inverse FFT to get the impulse response. That was what I planned to do but I got stuck at the division part.
Stuck using Matlab, I found a python code that can do it just fine. I rewrite the code into Matlab and the problem is back again. The code is incomplete but it is enough to show the problem.
Hope to get many advice and criticism. Thanks
Planned to do but failed so moved on to the next code
[y_sweep,Fs] = audioread('sweep.wav');
[y_rec,Fs] = audioread('edit_rec_sweep_laptop_1.2.wav');
fft_y1 = abs(fft(y_rec(:,1)));
fft_y2 = abs(fft(y_rec(:,2)));
fft_x = abs(fft(y_sweep));
fft_h1 = fft_y1/fft_x;
% fft_h2 = fft_y2/fft_x;
% fft_h = [fft_h1,fft_h2];
% h1 = ifft(fft1_h);
'Translated' code from python but still failed thus came here
[a,fs] = audioread('sweep.wav'); % sweep
[b,fs] = audioread('rec.wav'); % rec
a = pad(a,fs*50,fs*10);
b = pad(b,fs*50,fs*10);
[m,n] = size(b);
h = zeros(m,n);
for chan = 1:2
b1 = b(:,1);
ffta = abs(fft(a));
fftb = abs(fft(b1));
ffth = fftb / ffta;
end
pad.m function (translated from python but should be correct)
function y = pad(data, t_full, t_pre)
[row_dim,col_dim] = size(data);
t_post = t_full - row_dim - t_pre;
if t_post > 0
if col_dim == 1
y = [zeros(t_pre,1);data;zeros(t_post,1)];
% width = [t_pre,t_post];
else
y1 = [zeros(t_pre,1);data(:,1);zeros(t_post,1)];
y2 = [zeros(t_pre,1);data(:,2);zeros(t_post,1)];
y = [y1,y2];
% width = [[t_pre,t_post],[0,0]];
end
else
if col_dim == 1
y = [zeros(t_pre,1);data(t_full - t_pre:end,1)];
% width = [t_pre,0];
else
y = [zeros(t_pre,1);data(t_full - t_pre:end,1)];
% width = [[t_pre,0],[0,0]];
end
end
end
Error
Error using \
Requested 4800000x4800000 (171661.4GB) array exceeds
maximum array size preference. Creation of arrays
greater than this limit may take a long time and
cause MATLAB to become unresponsive. See array size
limit or preference panel for more information.
Error in impulseresponse (line 13)
ffth = fftb / ffta;
The forward slash is shorthand in MATLAB for mrdivide(). This is for solving systems of linear matrix equations. What I think you want is rdivide which is denoted by ./.
c = a/b is only equivalent to standard division if b is scalar.
c = a./b is element-wise division, where every element of a is divided by the corresponding element of b.
[1 2 3] ./ [2 4 9]
>> ans = [0.5, 0.5, 0.3333]
So the last active line of your "planned to do" code becomes
fft_h1 = fft_y1 ./ fft_x;
I try to pair two objects (one data set contains about 0.5 million elements, another one contains about 2 million elements) which meet certain conditions, then save information of the two objects to a file. Many variables are not involved in the pairing calculation, but they are important for my following analysis, therefore I need to keep track of those variables and save them. If there is way to vectorize the whole analysis, it will be much faster. In the following I take the random number as an example:
import numpy as np
from astropy import units as u
from astropy.coordinates import SkyCoord
from PyAstronomy import pyasl
RA1 = np.random.uniform(0,360,500000)
DEC1 = np.random.uniform(-90,90,500000)
d = np.random.uniform(55,2000,500000)
z = np.random.uniform(0.05,0.2,500000)
e = np.random.uniform(0.05,1.0,500000)
s = np.random.uniform(0.05,5.0,500000)
RA2 = np.random.uniform(0,360,2000000)
DEC2 = np.random.uniform(-90,90,2000000)
n = np.random.randint(10,10000,2000000)
m = np.random.randint(10,10000,2000000)
f = open('results.txt','a')
for i in range(len(RA1)):
if i % 50000 == 0:
print i
ra1 = RA1[i]
dec1 = DEC1[i]
c1 = SkyCoord(ra=ra1*u.degree, dec=dec1*u.degree)
for j in range(len(RA2)):
ra2 = RA2[j]
dec2 = DEC2[j]
c2 = SkyCoord(ra=ra2*u.degree, dec=dec2*u.degree)
ang = c1.separation(c2)
sep = d[i] * ang.radian
pa = pyasl.positionAngle(ra1, dec1, ra2, dec2)
if sep < 1.5:
np.savetxt(f,np.c_[ra1,dec1,sep,z[i],e[i],s[i],n[j],m[j]], fmt = '%1.4f %1.4f %1.4f %1.4f %1.4f %1.4f %i %i')
The basic question you need to ask yourself is: Can you reduce the dataset?
If not I have some bad news: 500000 * 2000000 is 1e12. That means you're trying to do one trillion operations.
The angular seperation involves some trigonometric functions (I think cos, sin and sqrt are involved here) so it will be roughly in the order of hundreds of nanoseconds up to microseconds per operation. Assuming each operation takes 1us you'll still need 12 days to complete this. And this assumes you don't have any Python loop or IO overhead and I think 1us is reasonable for these kind of operations.
But there are certainly ways to optimize it: SkyCoord allows to vectorize but only 1D:
# Create the SkyCoord for the longer array once
c2 = SkyCoord(ra=RA2*u.degree, dec=DEC2*u.degree)
# and calculate the seperation from each coordinate of the shorter list
for idx, (ra, dec) in enumerate(zip(RA1, DEC1)):
c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
# x will be the angular seperation with a length of your RA2 and DEC2 arrays
x = c1.separation(c2)
This will already yield a speedup of several orders of magnitude:
# note that I made these MUCH shorter
RA1 = np.random.uniform(0,360,5)
DEC1 = np.random.uniform(-90,90,5)
RA2 = np.random.uniform(0,360,10)
DEC2 = np.random.uniform(-90,90,10)
def test(RA1, DEC1, RA2, DEC2):
"""Version with vectorized inner loop."""
c2 = SkyCoord(ra=RA2*u.degree, dec=DEC2*u.degree)
for idx, (ra, dec) in enumerate(zip(RA1, DEC1)):
c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
x = c1.separation(c2)
def test2(RA1, DEC1, RA2, DEC2):
"""Double loop."""
for ra, dec in zip(RA1, DEC1):
c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
for ra, dec in zip(RA2, DEC2):
c2 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
x = c1.separation(c2)
%timeit test(RA1, DEC1, RA2, DEC2) # 1 loop, best of 3: 225 ms per loop
%timeit test2(RA1, DEC1, RA2, DEC2) # 1 loop, best of 3: 2.71 s per loop
This is already 10 times as fast and it scales MUCH better:
RA1 = np.random.uniform(0,360,5)
DEC1 = np.random.uniform(-90,90,5)
RA2 = np.random.uniform(0,360,2000000)
DEC2 = np.random.uniform(-90,90,2000000)
%timeit test(RA1, DEC1, RA2, DEC2) # 1 loop, best of 3: 2.8 s per loop
# test2 scales so bad I only use 50 elements here
RA2 = np.random.uniform(0,360,50)
DEC2 = np.random.uniform(-90,90,50)
%timeit test2(RA1, DEC1, RA2, DEC2) # 1 loop, best of 3: 11.4 s per loop
Note that by vectorizing the inner loop I was able to calculate 40000 times more elements in 1/4 of the time. So by vectorizing the inner loop you should be roughly 200k times faster.
Here we calculated 5 times 2 million seperations in 3 seconds, so it will be roughly 300 ns per operation. At this speed you'd need 3 days to complete this task.
Even if you could also vectorize the remaining loop away I don't think that would yield any great speedups because at that level the loop overhead is much less than the computation time in each loop. Using line-profiler supports this statement:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
11 def test(RA1, DEC1, RA2, DEC2):
12 1 216723 216723.0 2.6 c2 = SkyCoord(ra=RA2*u.degree, dec=DEC2*u.degree)
13 6 222 37.0 0.0 for idx, (ra, dec) in enumerate(zip(RA1, DEC1)):
14 5 206796 41359.2 2.5 c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
15 5 7847321 1569464.2 94.9 x = c1.separation(c2)
If it's not obvious from the Hits that's from the 5 x 2,000,000 run and for comparison here is the one from a 5x20 run on test2:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
17 def test2(RA1, DEC1, RA2, DEC2):
18 6 80 13.3 0.0 for ra, dec in zip(RA1, DEC1):
19 5 195030 39006.0 0.6 c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
20 105 1737 16.5 0.0 for ra, dec in zip(RA2, DEC2):
21 100 3871427 38714.3 11.8 c2 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
22 100 28870724 288707.2 87.6 x = c1.separation(c2)
The reason why test2 scales worse is that the c2 = SkyCoord part takes 12% of the total time instead of just 2.5% and that each single call to seperation has some significant overhead. So it's not really the Python loop overhead that makes it slow but the SkyCoord constructor and the static parts of seperation.
You obviously need to vectorize the pa calculation and the saving to file (I haven't worked with PyAstronomy and numpy.savetext so I can't advise there).
But there is still the problem that it's simply not feasible to do one trillion trigonometric operations on a normal computer.
Some additional ideas how to reduce the time:
Use multiprocessing so each core of your computer works in parallel, in theory this could speed this up by the amount of your cores. In practice this won't be be reachable and I would recommend doing this only if you have more than >= 8 cores or a cluster avaiable. Otherwise the time spent on getting multiprocessing to work correctly might exceed the single-core running time. Especially because multiprocessing might not work correctly and then you have to rerun the calculation.
Preprocess the elements: Remove items where the RA or DEC difference alone makes it impossible to find matches. However if this cannot remove a significant fraction of the elements the additional subtractions and comparisons might actually slow this down.
Here is an implementation using a buffer in memory to reduce I/O. Note: I prefer using io module for file input/output in order to be more compatible with Python 3. I think it is a best practice. You won’t have lower performance with it.
import io
with io.open('results.txt', 'a') as f:
buf = io.BytesIO()
for i in xrange(len(RA1)):
if i % 50000 == 0:
print(i)
f.write(buf.getvalue())
buf.truncate(0)
ra1 = RA1[i]
dec1 = DEC1[i]
c1 = SkyCoord(ra=ra1 * u.degree, dec=dec1 * u.degree)
for j in xrange(len(RA2)):
ra2 = RA2[j]
dec2 = DEC2[j]
c2 = SkyCoord(ra=ra2 * u.degree, dec=dec2 * u.degree)
ang = c1.separation(c2)
sep = d[i] * ang.radian
pa = pyasl.positionAngle(ra1, dec1, ra2, dec2)
if sep < 1.5:
np.savetxt(buf, np.c_[ra1, dec1, sep, z[i], e[i], s[i], n[j], m[j]],
fmt='%1.4f %1.4f %1.4f %1.4f %1.4f %1.4f %i %i')
f.write(buf.getvalue())
Note: In Python 2, I use xrange instead of range to reduce memory usage.
The buf.truncate(0) could be replaced by a new instance like this: buf = io.BytesIO(). It could be more efficient…
First way to speedup: c2 = SkyCoord calculated for every pair in ra2, dec2 len(RA1) times. You can speedup by making a buffer array of SkyCoord:
f = open('results.txt','a')
C1 = [SkyCoord(ra=ra1*u.degree, dec=DEC1[i]*u.degree)
for i, ra1 in enumerate(RA1)] )
C2 = [SkyCoord(ra=ra2*u.degree, dec=DEC2[i]*u.degree)
for i, ra2 in enumerate(RA2)] ) # buffer coords
for i, c1 in enumerate(C1): # we only need enumerate() to get i
for j, c2 in enumerate(C2):
ang = c1.separation(c2) # note we don't have to calculate c2
if d[i] < 1.5 / ang.radian:
# now we don't have to multiply every iteration.
# The right part is a constant
# the next line is only executed if objects are close enough
pa = pyasl.positionAngle(RA1[i], DEC1[i], RA2[j], DEC2[j])
np.savetxt('...whatever')
You can speedup even more by reading SkyCoord.separation code and vectorizing it to replace SkyCoord, but I'm too lazy to do it myself. I assume if we had two 2xN coord matrices x1, x2 it will look similar to (Matlab/Octave):
cos = pdist2(x1, x2) / (sqrt(dot(x1, x1)) * sqrt(dot(x2, x2)))
Assuming you want to reduce your dataset to <2 degree differences (as per your comment), you can make a mask by broadcasting (may need to do in chunks, but method is same)
aMask=(abs(RA1[:,None]-RA2[None,:])<2)&(abs(DEC1[:,None]-DEC2[None,:])<2)
In some smaller scale testing, this reduces the dataset by about 1/5000. Then make a location array of the mask.
locs=np.where(aMask)
(array([ 0, 2, 4, ..., 4998, 4999, 4999], dtype=int32),
array([3575, 1523, 1698, ..., 4869, 1801, 2792], dtype=int32))
(from my 5k x 5k test). Dump all your other variables through, for example, d[locs[0]] to create 1d arrays that you can push through SkyCoord as per #MSeifert's answer.
When you get your outputs and compare to 1.5, you'll get a boolean bmask that you can then outlocs=locs[0][bmask] and output RA1[outlocs] etc.
I've done similar things trying to do spatial derivatives on shells for FEM analysis, where taking the full rank of comparison between all the datapoints is similarly inefficient.
savetxt used this way is essentially
astr = fmt % (ra1,dec1,sep,z[i],e[i],s[i],n[j],m[j])
astr += '\n' # or include in fmt
f.write(astr)
that is, just writing a formatted line to the file
The following code is written in python and it works, i.e. returns the expected result. However, it is very slow and I believe that can be optimized.
G_tensor = numpy.matlib.identity(N_particles*3,dtype=complex)
for i in range(N_particles):
for j in range(i, N_particles):
if i != j:
#Do lots of things, here is shown an example.
# However you should not be scared because
#it only fills the G_tensor
R = numpy.linalg.norm(numpy.array(positions[i])-numpy.array(positions[j]))
rx = numpy.array(positions[i][0])-numpy.array(positions[j][0])
ry = numpy.array(positions[i][1])-numpy.array(positions[j][1])
rz = numpy.array(positions[i][2])-numpy.array(positions[j][2])
krq = (k*R)**2
pf = -k**2*alpha*numpy.exp(1j*k*R)/(4*math.pi*R)
a = 1.+(1j*k*R-1.)/(krq)
b = (3.-3.*1j*k*R-krq)/(krq)
G_tensor[3*i+0,3*j+0] = pf*(a + b * (rx*rx)/(R**2)) #Gxx
G_tensor[3*i+1,3*j+1] = pf*(a + b * (ry*ry)/(R**2)) #Gyy
G_tensor[3*i+2,3*j+2] = pf*(a + b * (rz*rz)/(R**2)) #Gzz
G_tensor[3*i+0,3*j+1] = pf*(b * (rx*ry)/(R**2)) #Gxy
G_tensor[3*i+0,3*j+2] = pf*(b * (rx*rz)/(R**2)) #Gxz
G_tensor[3*i+1,3*j+0] = pf*(b * (ry*rx)/(R**2)) #Gyx
G_tensor[3*i+1,3*j+2] = pf*(b * (ry*rz)/(R**2)) #Gyz
G_tensor[3*i+2,3*j+0] = pf*(b * (rz*rx)/(R**2)) #Gzx
G_tensor[3*i+2,3*j+1] = pf*(b * (rz*ry)/(R**2)) #Gzy
G_tensor[3*j+0,3*i+0] = pf*(a + b * (rx*rx)/(R**2)) #Gxx
G_tensor[3*j+1,3*i+1] = pf*(a + b * (ry*ry)/(R**2)) #Gyy
G_tensor[3*j+2,3*i+2] = pf*(a + b * (rz*rz)/(R**2)) #Gzz
G_tensor[3*j+0,3*i+1] = pf*(b * (rx*ry)/(R**2)) #Gxy
G_tensor[3*j+0,3*i+2] = pf*(b * (rx*rz)/(R**2)) #Gxz
G_tensor[3*j+1,3*i+0] = pf*(b * (ry*rx)/(R**2)) #Gyx
G_tensor[3*j+1,3*i+2] = pf*(b * (ry*rz)/(R**2)) #Gyz
G_tensor[3*j+2,3*i+0] = pf*(b * (rz*rx)/(R**2)) #Gzx
G_tensor[3*j+2,3*i+1] = pf*(b * (rz*ry)/(R**2)) #Gzy
Do you know how can I parallelize it? You should note that the two loops are not symmetric.
Edit one: A numpythonic solution was presented above and I made a comparison between the c++ implementation, my loop version in python and thr numpythonic. Results are the following:
- c++ = 0.14seg
- numpythonic version = 1.39seg
- python loop version = 46.56seg
Probably results can get better if we use the intel version of numpy.
Here is a proposition that should now work (I corrected a few mistakes) but that nonetheless sould give you the general idea of how verctorization can be applied to your code in order to make efficient use of numpy arrays. Everything is build in "one-pass" (ie without any for-loops) which is the "numpythonic" way:
import numpy as np
import math
N=2
k,alpha=1,1
G = np.zeros((N,3,N,3),dtype=complex)
# np.mgrid gives convenient arrays of indices that
# can be used to write readable code
i,x_i,j,x_j = np.ogrid[0:N,0:3,0:N,0:3]
# A quick demo on how we can make the identity tensor with it
G[np.where((i == j) & (x_i == x_j))] = 1
#print(G.reshape(N*3,N*3))
positions=np.random.rand(N,3)
# Here I assumed position has shape [N,3]
# I build arr[i,j]=position[i] - position[j] using broadcasting
# by turning position into a column and a row
R = np.linalg.norm(positions[None,:,:]-positions[:,None,:],axis=-1)
# R is now a N,N matrix of all the distances
#we reshape R to N,1,N,1 so that it can be broadcated to N,3,N,3
R=R.reshape(N,1,N,1)
r=positions[None,:,:]-positions[:,None,:]
krq = (k*R)**2
pf = -k**2*alpha*np.exp(1j*k*R)/(4*math.pi*R)
a = 1.+(1j*k*R-1.)/(krq)
b = (3.-3.*1j*k*R-krq)/(krq)
#print(np.isnan(pf[:,0,:,0]))
# here we build all the combination rx*rx rx*ry etc...
comb_r=(r[:,:,:,None]*r[:,:,None,:]).transpose([0,2,1,3])
#we compute G without the pf*A term
G = pf*(b * comb_r/(R**2))
#we add pf*a term where it is due
G[np.where(x_i == x_j)] = (G + pf*a)[np.where(x_i == x_j)]
# we didn't bother with the identity or condition i!=j so we enforce it here
G[np.where(i == j)] = 0
G[np.where((i == j) & (x_i == x_j))] = 1
print(G.reshape(N*3,N*3))
Python is not a fast language. Number crunching with python should always use for time critical parts code written in a compiled language. With compilation down to the CPU level you can speed up the code by a factor up to 100 and then still go for parallelization. So I would not look down to using more cores doing inefficient stuff, but to work more efficient. I see the following ways to speed up the code:
1) Better use of numpy: Can you do your calculations instead on scalar level directly on vector/matrix level? eg. rx = positions[:,0]-positions[0,:] (not checked if that is correct) but something along those lines.
If that is not possible with your kind of calculations, than you can go for option 2 or 3
2) Use cython. Cython compiles Python code to C, which is then compiled to your CPU. By using static typing at the right places you can make your code much faster, see cython tutorials eg.: http://cython.readthedocs.io/en/latest/src/quickstart/cythonize.html
3) If you are familiar with FORTRAN, it might be a good idea to write just this part in FORTRAN and then call it from Python using f2py. In fact, your code looks a lot like FORTRAN anyway. For C and C++ SWIG is one great tool to make compiled code available in Python, but there are plenty of other techniques (cython, Boost::Python, ctypes, numba etc.)
When you have done this, and it is still to slow, using GPU power with pyCUDA or parallelization with mpi4py or multiprocessing might be an option.