parallelize (not symmetric) loops in python

parallelize (not symmetric) loops in python - python

The following code is written in python and it works, i.e. returns the expected result. However, it is very slow and I believe that can be optimized.
G_tensor = numpy.matlib.identity(N_particles*3,dtype=complex)
for i in range(N_particles):
for j in range(i, N_particles):
if i != j:
#Do lots of things, here is shown an example.
# However you should not be scared because
#it only fills the G_tensor
R = numpy.linalg.norm(numpy.array(positions[i])-numpy.array(positions[j]))
rx = numpy.array(positions[i][0])-numpy.array(positions[j][0])
ry = numpy.array(positions[i][1])-numpy.array(positions[j][1])
rz = numpy.array(positions[i][2])-numpy.array(positions[j][2])
krq = (k*R)**2
pf = -k**2*alpha*numpy.exp(1j*k*R)/(4*math.pi*R)
a = 1.+(1j*k*R-1.)/(krq)
b = (3.-3.*1j*k*R-krq)/(krq)
G_tensor[3*i+0,3*j+0] = pf*(a + b * (rx*rx)/(R**2)) #Gxx
G_tensor[3*i+1,3*j+1] = pf*(a + b * (ry*ry)/(R**2)) #Gyy
G_tensor[3*i+2,3*j+2] = pf*(a + b * (rz*rz)/(R**2)) #Gzz
G_tensor[3*i+0,3*j+1] = pf*(b * (rx*ry)/(R**2)) #Gxy
G_tensor[3*i+0,3*j+2] = pf*(b * (rx*rz)/(R**2)) #Gxz
G_tensor[3*i+1,3*j+0] = pf*(b * (ry*rx)/(R**2)) #Gyx
G_tensor[3*i+1,3*j+2] = pf*(b * (ry*rz)/(R**2)) #Gyz
G_tensor[3*i+2,3*j+0] = pf*(b * (rz*rx)/(R**2)) #Gzx
G_tensor[3*i+2,3*j+1] = pf*(b * (rz*ry)/(R**2)) #Gzy
G_tensor[3*j+0,3*i+0] = pf*(a + b * (rx*rx)/(R**2)) #Gxx
G_tensor[3*j+1,3*i+1] = pf*(a + b * (ry*ry)/(R**2)) #Gyy
G_tensor[3*j+2,3*i+2] = pf*(a + b * (rz*rz)/(R**2)) #Gzz
G_tensor[3*j+0,3*i+1] = pf*(b * (rx*ry)/(R**2)) #Gxy
G_tensor[3*j+0,3*i+2] = pf*(b * (rx*rz)/(R**2)) #Gxz
G_tensor[3*j+1,3*i+0] = pf*(b * (ry*rx)/(R**2)) #Gyx
G_tensor[3*j+1,3*i+2] = pf*(b * (ry*rz)/(R**2)) #Gyz
G_tensor[3*j+2,3*i+0] = pf*(b * (rz*rx)/(R**2)) #Gzx
G_tensor[3*j+2,3*i+1] = pf*(b * (rz*ry)/(R**2)) #Gzy
Do you know how can I parallelize it? You should note that the two loops are not symmetric.
Edit one: A numpythonic solution was presented above and I made a comparison between the c++ implementation, my loop version in python and thr numpythonic. Results are the following:
- c++ = 0.14seg
- numpythonic version = 1.39seg
- python loop version = 46.56seg
Probably results can get better if we use the intel version of numpy.

Here is a proposition that should now work (I corrected a few mistakes) but that nonetheless sould give you the general idea of how verctorization can be applied to your code in order to make efficient use of numpy arrays. Everything is build in "one-pass" (ie without any for-loops) which is the "numpythonic" way:
import numpy as np
import math
N=2
k,alpha=1,1
G = np.zeros((N,3,N,3),dtype=complex)
# np.mgrid gives convenient arrays of indices that
# can be used to write readable code
i,x_i,j,x_j = np.ogrid[0:N,0:3,0:N,0:3]
# A quick demo on how we can make the identity tensor with it
G[np.where((i == j) & (x_i == x_j))] = 1
#print(G.reshape(N*3,N*3))
positions=np.random.rand(N,3)
# Here I assumed position has shape [N,3]
# I build arr[i,j]=position[i] - position[j] using broadcasting
# by turning position into a column and a row
R = np.linalg.norm(positions[None,:,:]-positions[:,None,:],axis=-1)
# R is now a N,N matrix of all the distances
#we reshape R to N,1,N,1 so that it can be broadcated to N,3,N,3
R=R.reshape(N,1,N,1)
r=positions[None,:,:]-positions[:,None,:]
krq = (k*R)**2
pf = -k**2*alpha*np.exp(1j*k*R)/(4*math.pi*R)
a = 1.+(1j*k*R-1.)/(krq)
b = (3.-3.*1j*k*R-krq)/(krq)
#print(np.isnan(pf[:,0,:,0]))
# here we build all the combination rx*rx rx*ry etc...
comb_r=(r[:,:,:,None]*r[:,:,None,:]).transpose([0,2,1,3])
#we compute G without the pf*A term
G = pf*(b * comb_r/(R**2))
#we add pf*a term where it is due
G[np.where(x_i == x_j)] = (G + pf*a)[np.where(x_i == x_j)]
# we didn't bother with the identity or condition i!=j so we enforce it here
G[np.where(i == j)] = 0
G[np.where((i == j) & (x_i == x_j))] = 1
print(G.reshape(N*3,N*3))

Python is not a fast language. Number crunching with python should always use for time critical parts code written in a compiled language. With compilation down to the CPU level you can speed up the code by a factor up to 100 and then still go for parallelization. So I would not look down to using more cores doing inefficient stuff, but to work more efficient. I see the following ways to speed up the code:
1) Better use of numpy: Can you do your calculations instead on scalar level directly on vector/matrix level? eg. rx = positions[:,0]-positions[0,:] (not checked if that is correct) but something along those lines.
If that is not possible with your kind of calculations, than you can go for option 2 or 3
2) Use cython. Cython compiles Python code to C, which is then compiled to your CPU. By using static typing at the right places you can make your code much faster, see cython tutorials eg.: http://cython.readthedocs.io/en/latest/src/quickstart/cythonize.html
3) If you are familiar with FORTRAN, it might be a good idea to write just this part in FORTRAN and then call it from Python using f2py. In fact, your code looks a lot like FORTRAN anyway. For C and C++ SWIG is one great tool to make compiled code available in Python, but there are plenty of other techniques (cython, Boost::Python, ctypes, numba etc.)
When you have done this, and it is still to slow, using GPU power with pyCUDA or parallelization with mpi4py or multiprocessing might be an option.

Related

How to speed up this numpy.arange loop?

In a python program, the following function is called about 20,000 times from another function that is called about 1000 times from yet another function that executes 30 times. Thus the total number of times this particular function is called is about 600,000,000. In python it takes more than two hours (perhaps much longer; I aborted the program without waiting for it to finish), while essentially the same task coded in Java takes less than 5 minutes. If I change the 20,000 above to 400 (keeping everything else in the rest of the program untouched), the total time drops to about 4 minutes (this means this particular function is the culprit). What can I do to speed up the Python version, or is it just not possible? No lists are manipulated inside this function (there are lists elsewhere in the whole program, but in those places I tried to use numpy arrays as far as possible). I understand that replacing python lists with numpy arrays speeds things up, but there are cases in my program (not in this particular function) where I must build a list iteratively, using append; and those must-have lists are lists of objects (not floats or ints), so numpy would be of little help even if I converted those lists of objects to numpy arrays.
def compute_something(arr):
'''
arr is received as a numpy array of ints and floats (I think python upcasts them to all floats,
doesn’t it?).
Inside this function, elements of arr are accessed using indexing (arr[0], arr[1], etc.), because
each element of the array has its own unique use. It’s not that I need the array as a whole (as in
arr**2 or sum(arr)).
The arr elements are used in several simple arithmetic operations involving nothing costlier than
+, -, *, /, and numpy.log(). There is no other loop inside this function; there are a few if’s though.
Inside this function, use is made of constants imported from other modules (I doubt the
importing, as in AnotherModule.x is expensive).
'''
for x in numpy.arange(float1, float2, float3):
do stuff
return a, b, c # Return a tuple of three floats
Edit:
Thanks for all the comments. Here’s the inside of the function (I made the variable names short for convenience). The ndarray array arr has only 3 elements in it. Can you please suggest any improvement?
def compute_something(arr):
a = Mod.b * arr[1] * arr[2] + Mod.c
max = 0.0
for c in np.arange(a, arr[1] * arr[2] * (Mod.d – Mod.e), Mod.f):
i = c / arr[2]
m1 = Mod.A * np.log( (i / (arr[1] *Mod.d)) + (Mod.d/Mod.e))
m2 = -Mod.B * np.log(1.0 - (i/ (arr[1] *Mod.d)) - (Mod.d /
Mod.e))
V = arr[0] * (Mod.E - Mod.r * i / arr[1] - Mod.r * Mod.d -
m1 – m2)
p = c * V /1000.0
if p > max:
max = p
vmp = V
pen = Mod.COEFF1 * (Mod.COEFF2 - max) if max < Mod.CONST else 0.0
wo = Mod.COEFF3 * arr[1] * arr[0] + Mod.COEFF4 * abs(Mod.R5 - vmp) +
Mod.COEFF6 * arr[2]
w = wo + pen
return vmp, max, w

Python supports profiling of code. (module cProfile). Also there is option to use line_profiler to find most expensive part of code tool here.
So you do not need to guessing which part of code is most expensive.
In this code which you presten the problem is in usage for loop which generates many conversion between types of objects. If you use numpy you can vectorize your calculation.
I try to rewrite your code to vectorize your operation. You do not provide information what is Mod object, but I have hope it will work.
def compute_something(arr):
a = Mod.b * arr[1] * arr[2] + Mod.c
# start calculation on vectors instead of for lop
c_arr = np.arange(a, arr[1] * arr[2] * (Mod.d – Mod.e), Mod.f)
i_arr = c_arr/arr[2]
m1_arr = Mod.A * np.log( (i_arr / (arr[1] *Mod.d)) + (Mod.d/Mod.e))
m2_arr = -Mod.B * np.log(1.0 - (i_arr/ (arr[1] *Mod.d)) - (Mod.d /
Mod.e))
V_arr = arr[0] * (Mod.E - Mod.r * i_arr / arr[1] - Mod.r * Mod.d -
m1_arr – m2_arr)
p = c_arr * V_arr / 1000.0
max_val = p.max() # change name to avoid conflict with builtin function
max_ind = np.nonzero(p == max_val)[0][0]
vmp = V_arr[max_ind]
pen = Mod.COEFF1 * (Mod.COEFF2 - max_val) if max_val < Mod.CONST else 0.0
wo = Mod.COEFF3 * arr[1] * arr[0] + Mod.COEFF4 * abs(Mod.R5 - vmp) +
Mod.COEFF6 * arr[2]
w = wo + pen
return vmp, max_val, w

I would suggest to use range as it is approximately 2 times faster:
def python():
for i in range(100000):
pass
def numpy():
for i in np.arange(100000):
pass
from timeit import timeit
print(timeit(python, number=1000))
print(timeit(numpy, number=1000))
Output:
5.59282787179696
10.027646953771665

accelerated FFT to be invoked from Python Numba CUDA kernel

I need to calculate the Fourier transform of a 256 element float64 signal. The requirement is as such that I need to invoke these FFTs from inside a cuda.jitted section and it must be completed within 25usec. Alas cuda.jit-compiled functions do not allow to invoke external libraries => I wrote my own. Alas my single-core code is still way too slow (~250usec on a Quadro P4000). Is there a better way?
I created a single core FFT-function that gives correct results, but is alas 10x too slow. I don't understand how to make good use of multiple cores.
---fft.py
from numba import cuda, boolean, void, int32, float32, float64, complex128
import math, sys, cmath
def _transform_radix2(vector, inverse, out):
n = len(vector)
levels = int32(math.log(float32(n))/math.log(float32(2)))
assert 2**levels==n # error: Length is not a power of 2
#uncomment either Numba.Cuda or Numpy memory allocation, (intelligent conditional compileation??)
exptable = cuda.local.array(1024, dtype=complex128)
#exptable = np.zeros(1024, np.complex128)
assert (n // 2) <= len(exptable) # error: FFT length > MAXFFTSIZE
coef = complex128((2j if inverse else -2j) * math.pi / n)
for i in range(n // 2):
exptable[i] = cmath.exp(i * coef)
for i in range(n):
x = i
y = 0
for j in range(levels):
y = (y << 1) | (x & 1)
x >>= 1
out[i] = vector[y]
size = 2
while size <= n:
halfsize = size // 2
tablestep = n // size
for i in range(0, n, size):
k = 0
for j in range(i, i + halfsize):
temp = out[j + halfsize] * exptable[k]
out[j + halfsize] = out[j] - temp
out[j] += temp
k += tablestep
size *= 2
scale=float64(n if inverse else 1)
for i in range(n):
out[i]=out[i]/scale # the inverse requires a scaling
# now create the Numba.cuda version to be called by a GPU
gtransform_radix2 = cuda.jit(device=True)(_transform_radix2)
---test.py
from numba import cuda, void, float64, complex128, boolean
import cupy as cp
import numpy as np
import timeit
import fft
#cuda.jit(void(float64[:],boolean, complex128[:]))
def fftbench(y, inverse, FT):
Y = cuda.local.array(256, dtype=complex128)
for i in range(len(y)):
Y[i]=complex128(y[i])
fft.gtransform_radix2(Y, False, FT)
str='\nbest [%2d/%2d] iterations, min:[%9.3f], max:[%9.3f], mean:[%9.3f], std:[%9.3f] usec'
a=[127.734375 ,130.87890625 ,132.1953125 ,129.62109375 ,118.6015625
,110.2890625 ,106.55078125 ,104.8203125 ,106.1875 ,109.328125
,113.5 ,118.6640625 ,125.71875 ,127.625 ,120.890625
,114.04296875 ,112.0078125 ,112.71484375 ,110.18359375 ,104.8828125
,104.47265625 ,106.65625 ,109.53515625 ,110.73828125 ,111.2421875
,112.28125 ,112.38671875 ,112.7734375 ,112.7421875 ,113.1328125
,113.24609375 ,113.15625 ,113.66015625 ,114.19921875 ,114.5
,114.5546875 ,115.09765625 ,115.2890625 ,115.7265625 ,115.41796875
,115.73828125 ,116. ,116.55078125 ,116.5625 ,116.33984375
,116.63671875 ,117.015625 ,117.25 ,117.41015625 ,117.6640625
,117.859375 ,117.91015625 ,118.38671875 ,118.51171875 ,118.69921875
,118.80859375 ,118.67578125 ,118.78125 ,118.49609375 ,119.0078125
,119.09375 ,119.15234375 ,119.33984375 ,119.31640625 ,119.6640625
,119.890625 ,119.80078125 ,119.69140625 ,119.65625 ,119.83984375
,119.9609375 ,120.15625 ,120.2734375 ,120.47265625 ,120.671875
,120.796875 ,120.4609375 ,121.1171875 ,121.35546875 ,120.94921875
,120.984375 ,121.35546875 ,120.87109375 ,120.8359375 ,121.2265625
,121.2109375 ,120.859375 ,121.17578125 ,121.60546875 ,121.84375
,121.5859375 ,121.6796875 ,121.671875 ,121.78125 ,121.796875
,121.8828125 ,121.9921875 ,121.8984375 ,122.1640625 ,121.9375
,122. ,122.3515625 ,122.359375 ,122.1875 ,122.01171875
,121.91015625 ,122.11328125 ,122.1171875 ,122.6484375 ,122.81640625
,122.33984375 ,122.265625 ,122.78125 ,122.44921875 ,122.34765625
,122.59765625 ,122.63671875 ,122.6796875 ,122.6171875 ,122.34375
,122.359375 ,122.7109375 ,122.83984375 ,122.546875 ,122.25390625
,122.06640625 ,122.578125 ,122.7109375 ,122.83203125 ,122.5390625
,122.2421875 ,122.06640625 ,122.265625 ,122.13671875 ,121.8046875
,121.87890625 ,121.88671875 ,122.2265625 ,121.63671875 ,121.14453125
,120.84375 ,120.390625 ,119.875 ,119.34765625 ,119.0390625
,118.4609375 ,117.828125 ,117.1953125 ,116.9921875 ,116.046875
,115.16015625 ,114.359375 ,113.1875 ,110.390625 ,108.41796875
,111.90234375 ,117.296875 ,127.0234375 ,147.58984375 ,158.625
,129.8515625 ,120.96484375 ,124.90234375 ,130.17578125 ,136.47265625
,143.9296875 ,150.24609375 ,141. ,117.71484375 ,109.80859375
,115.24609375 ,118.44140625 ,120.640625 ,120.9921875 ,111.828125
,101.6953125 ,111.21484375 ,114.91015625 ,115.2265625 ,118.21875
,125.3359375 ,139.44140625 ,139.76953125 ,135.84765625 ,137.3671875
,141.67578125 ,139.53125 ,136.44921875 ,135.08203125 ,135.7890625
,137.58203125 ,138.7265625 ,154.33203125 ,172.01171875 ,152.24609375
,129.8046875 ,125.59375 ,125.234375 ,127.32421875 ,132.8984375
,147.98828125 ,152.328125 ,153.7734375 ,155.09765625 ,156.66796875
,159.0546875 ,151.83203125 ,138.91796875 ,138.0546875 ,140.671875
,143.48046875 ,143.99609375 ,146.875 ,146.7578125 ,141.15234375
,141.5 ,140.76953125 ,140.8828125 ,145.5625 ,150.78125
,148.89453125 ,150.02734375 ,150.70703125 ,152.24609375 ,148.47265625
,131.95703125 ,125.40625 ,123.265625 ,123.57421875 ,129.859375
,135.6484375 ,144.51171875 ,155.05078125 ,158.4453125 ,140.8125
,100.08984375 ,104.29296875 ,128.55078125 ,139.9921875 ,143.38671875
,143.69921875 ,137.734375 ,124.48046875 ,116.73828125 ,114.84765625
,113.85546875 ,117.45703125 ,122.859375 ,125.8515625 ,133.22265625
,139.484375 ,135.75 ,122.69921875 ,115.7734375 ,116.9375
,127.57421875]
y1 =cp.zeros(len(a), cp.complex128)
FT1=cp.zeros(len(a), cp.complex128)
for i in range(len(a)):
y1[i]=a[i] #convert to complex to feed the FFT
r=1000
series=sorted(timeit.repeat("fftbench(y1, False, FT1)", number=1, repeat=r, globals=globals()))
series=series[0:r-5]
print(str % (len(series), r, 1e6*np.min(series), 1e6*np.max(series), 1e6*np.mean(series), 1e6*np.std(series)));
a faster implementation t<<25usec

The drawback of your algorithm is that even on GPU it runs on a single-core.
In order to understand how to design algorithms on Nvidia GPGPU I recommend to look at :
the CUDA C Programming guide and to the numba documentation to apply the code in python.
Moreover to understand what's wrong with your code, I recommend to use Nvidia profiler.
The following parts of the answer will explained how to apply the basics on your example.
Run multiples threads
To improve performances, you will first need to launch multiples threads that can run in parallel, CUDA handle threads as follow:
Threads are grouped into blocs of n threads (n < 1024)
Each thread withing the same bloc can be synchronized and have access to a (fast) common memory space called "shared memory".
You can run multiples blocs in parallel in a "grid" but you will lose the synchronization mechanism.
The syntax to run multiples threads is the following:
fftbench[griddim, blockdim](y1, False, FT1)
to simplify, I will use only one bloc of size 256:
fftbench[1, 256](y1, False, FT1)
Memory
To improve GPU performances it's important to look where the data will be stored, their is three main spaces:
global memory: it's the "RAM" of your GPU, it's slow and have a high latency, this is where all your array are placed when you send them to the GPU.
shared memory: it's a little fast access memory, all the thread of a bloc have access to the same shared memory.
local memory: physically it's the same that global memory, but each thread access its own local memory.
Typically, if you use multiples times the sames data, you should try store them in shared memory to prevent latency from the global memory.
In your code, you can store exptable in shared memory:
exptable = cuda.shared.array(1024, dtype=complex128)
and if n is not too big, you may want to use a working instead of using out:
working = cuda.shared.array(256, dtype=complex128)
Assign tasks to each thread
Of course if you don't change your function, all thread will do the same job and it will just slow down your program.
In this example we will assign each thread to one cell of the array. To do so, we have to get the unique id of thread withing a bloc:
idx = cuda.threadIdx.x
Now we will be able to speed up the for loops, lets handle them one by one:
exptable = cuda.shared.array(1024, dtype=complex128)
...
for i in range(n // 2):
exptable[i] = cmath.exp(i * coef)
Here is the goal: we will want the n/2 first threads to fill this array, then all the thread will be able to use it.
So in this case just replace the for loop by a condition on the thread idx's:
if idx < n // 2:
exptable[idx] = cmath.exp(idx * coef)
For the two last loops it's easier, each thread will deal with one cell of the array:
for i in range(n):
x = i
y = 0
for j in range(levels):
y = (y << 1) | (x & 1)
x >>= 1
out[i] = vector[y]
become
x = idx
y = 0
for j in range(levels):
y = (y << 1) | (x & 1)
x >>= 1
working[idx] = vector[y]
and
for i in range(n):
out[i]=out[i]/scale # the inverse requires a scaling
become
out[idx]=working[idx]/scale # the inverse requires a scaling
I use the shared array working but you can replace it by out if you want to use global memory.
Now, lets look at the while loop, we said that we want each thread to only deal with one cell of the array. So we can try to parallelize the two for loops inside.
...
for i in range(0, n, size):
k = 0
for j in range(i, i + halfsize):
temp = out[j + halfsize] * exptable[k]
out[j + halfsize] = out[j] - temp
out[j] += temp
k += tablestep
...
To simplify I will only use half of the threads, we will take the 128 first threads and determine j as follow:
...
if idx < 128:
j = (idx%halfsize) + size*(idx//halfsize)
...
k is:
k = tablestep*(idx%halfsize)
so we got the loop:
size = 2
while size <= n:
halfsize = size // 2
tablestep = n // size
if idx < 128:
j = (idx%halfsize) + size*(idx//halfsize)
k = tablestep*(idx%halfsize)
temp = working[j + halfsize] * exptable[k]
working[j + halfsize] = working[j] - temp
working[j] += temp
size *= 2
Synchronization
Last but not least, we need to synchronize all theses threads. In fact the program will not work if we do not synch. On the GPU thread may not run at the same time so you can get issues when data are produced by one thread and used by another one, for example:
exptable[0] is used by thread_2 before thread_0 fill store its value
working[j + halfsize] is moddified by another thread before you store it in temp
to prevent this we can use the function:
cuda.syncthreads()
All the threads in the same bloc will finish this line before execution the rest of the code.
In this example, you need to synchronize at two point, after the working initialization and after each iteration of the while loop.
then your code look like:
def _transform_radix2(vector, inverse, out):
n = len(vector)
levels = int32(math.log(float32(n))/math.log(float32(2)))
assert 2**levels==n # error: Length is not a power of 2
exptable = cuda.shared.array(1024, dtype=complex128)
working = cuda.shared.array(256, dtype=complex128)
assert (n // 2) <= len(exptable) # error: FFT length > MAXFFTSIZE
coef = complex128((2j if inverse else -2j) * math.pi / n)
if idx < n // 2:
exptable[idx] = cmath.exp(idx * coef)
x = idx
y = 0
for j in range(levels):
y = (y << 1) | (x & 1)
x >>= 1
working[idx] = vector[y]
cuda.syncthreads()
size = 2
while size <= n:
halfsize = size // 2
tablestep = n // size
if idx < 128:
j = (idx%halfsize) + size*(idx//halfsize)
k = tablestep*(idx%halfsize)
temp = working[j + halfsize] * exptable[k]
working[j + halfsize] = working[j] - temp
working[j] += temp
size *= 2
cuda.syncthreads()
scale=float64(n if inverse else 1)
out[idx]=working[idx]/scale # the inverse requires a scaling
I feel like your question is a good way to introduce some basics about GPGPU computing and I try to answer it in a didactic way. The final code is far from perfect and can be optimized a lot, I highly recommend you to read this Programming guide if you want to learn more about GPU optimizations.

Fastest way to add/multiply two floating point scalar numbers in python

I'm using python and apparently the slowest part of my program is doing simple additions on float variables.
It takes about 35seconds to do around 400,000,000 additions/multiplications.
I'm trying to figure out what is the fastest way I can do this math.
This is how the structure of my code looks like.
Example (dummy) code:
def func(x, y, z):
loop_count = 30
a = [0,1,2,3,4,5,6,7,8,9,10,11,12,...35 elements]
b = [0,11,22,33,44,55,66,77,88,99,1010,1111,1212,...35 elements]
p = [0,0,0,0,0,0,0,0,0,0,0,0,0,...35 elements]
for i in range(loop_count - 1):
c = p[i-1]
d = a[i] + c * a[i+1]
e = min(2, a[i]) + c * b[i]
f = e * x
g = y + d * c
.... and so on
p[i] = d + e + f + s + g5 + f4 + h7 * t5 + y8
return sum(p)
func() is called about 200k times. The loop_count is about 30. And I have ~20 multiplications and ~45 additions and ~10 uses of min/max
I was wondering if there is a method for me to declare all these as ctypes.c_float and do addition in C using stdlib or something similar ?
Note that the p[i] calculated at the end of the loop is used as c in the next loop iteration. For iteration 0, it just uses p[-1] which is 0 in this case.
My constraints:
I need to use python. While I understand plain math would be faster in C/Java/etc. I cannot use it due to a bunch of other things I do in python which cannot be done in C in this same program.
I tried writing this with cython, but it caused a bunch of issues with the environment I need to run this in. So, again - not an option.

I think you should consider using numpy. You did not mention any constraint.
Example case of a simple dot operation (x.y)
import datetime
import numpy as np
x = range(0,10000000,1)
y = range(0,20000000,2)
for i in range(0, len(x)):
x[i] = x[i] * 0.00001
y[i] = y[i] * 0.00001
now = datetime.datetime.now()
z = 0
for i in range(0, len(x)):
z = z+x[i]*y[i]
print "handmade dot=", datetime.datetime.now()-now
print z
x = np.arange(0.0, 10000000.0*0.00001, 0.00001)
y = np.arange(0.0, 10000000.0*0.00002, 0.00002)
now = datetime.datetime.now()
z = np.dot(x,y)
print 'numpy dot =',datetime.datetime.now()-now
print z
outputs
handmade dot= 0:00:02.559000
66666656666.7
numpy dot = 0:00:00.019000
66666656666.7
numpy is more than 100x times faster.
The reason is that numpy encapsulates a C library that does the dot operation with compiled code. In the full python you have a list of potentially generic objects, casting, ...

Can a Cython/Numba compiled function improve on numpy.max(numpy.abs(a-b))?

I am optimizing a bottleneck section of my code--iterating on a function a' = f(a), where a and a' are N by 1 vectors, until max(abs(a' - a)) is sufficiently small.
I have put a Numba wrapper on f(a), and got a nice speedup over the most optimized pure NumPy version I was able to produe (cut runtime by about 50%).
I tried writing a C-compatible version of numpy.max(numpy.abs(aprime - a)), but it turns out this is slower! I actually lose back ALL of the gains I got from Numba-fying the first portion of the iteration!
Is there likely to be a way for Numba or Cython to improve upon numpy.max(numpy.abs(aprime - a))? I reproduce my code below for reference, where a is P0 and a' is Pprime:
EDIT: For me, it seems that it is important to "flatten()" the inputs to "maxabs()". When I do this, the performance is no worse than NumPy. Then, when I do a "dry run" of the function outside the timing brackets as JoshAdel suggested, the loop with "maxabs" does slightly better than the loop with numpy.max(numpy.abs()).
from numba import jit
import numpy as np
### Preliminaries, to make the working example fully functional
n = 1200
Gammer = np.exp(-np.random.rand(n,n))
alpher = np.ones((n,1))
xxer = 10000*np.random.rand(n,1)
chii = 6.5
varkappa = 6.5
phi3 = 1.5
A = .5
sig = .2
mmer = np.dot(Gammer,xxer**phi3)
totalprod = A*alpher + (1-A)*mmer
Gammerchii = Gammer**chii
Gammerrats = Gammerchii[:,0].flatten()/Gammerchii[0,:].flatten()
Gammerrats[(Gammerchii[0,:].flatten() == 0) | (Gammerchii[:,0].flatten() == 0)] = 1.
P0 = (Gammerrats*(xxer[0]/totalprod[0])*(totalprod/xxer).flatten())**(1/(1+2*chii))
P0 *= n/np.sum(P0)
### End of preliminaries
### This is the function to produce a' = f(a)
#jit
def Piteration(P0, chii, sig, n, xxer, totalprod, Gammerrats, Gammerchii):
Mac = np.zeros((n,))
Pprime = np.zeros((n,))
themacpow = 1-(1/chii)*(sig/(1-sig))
specialchiipow = 1/(1+2*chii)
Psum = 0.
for i in range(n):
for j in range(n):
Mac[j] += ((P0[i]/P0[j])**chii)*Gammerchii[i,j]*totalprod[j]
for i in range(n):
Pprime[i] = (Gammerrats[i]*(xxer[0]/totalprod[0])*(totalprod[i]/xxer[i])*((Mac[i]/Mac[0])**themacpow))**specialchiipow
Psum += Pprime[i]
Psum = n/Psum
for i in range(n):
Pprime[i] *= Psum
return Pprime
### This is the function to find max(abs(aprime - a))
#jit
def maxabs(vec1,vec2,n):
themax = 0.
curdiff = 0.
for i in range(n):
curdiff = vec1[i] - vec2[i]
if curdiff < 0:
curdiff *= -1
if curdiff > themax:
themax = curdiff
return themax
### This is the main loop
diff = 1000.
while diff > 1e-2:
Pprime = Piteration(P0.flatten(), chii, sig, n, xxer.flatten(), totalprod.flatten(), Gammerrats.flatten(), Gammerchii)
diff = maxabs(P0.flatten(),Pprime.flatten(),n)
P0 = 1.*Pprime

When I time your maxabs function vs np.max(np.abs(vec1 - vec2)) for an array of shape (1200,), the numba version is ~2.6x faster using numba 0.32.0.
When you time the code, make sure you run your function once before you time it so that you don't include the time it takes to jit the code, which you only pay the first time. In general using timeit and running multiple times takes care of this. I'm not sure how you did the timing though since I see almost no difference in using maxabs vs the numpy call, most of the runtime seems to be in the call to Piteration.

Python: optimising loops

I wish to optimise some python code consisting of two nested loops. I am not so familar with numpy, but I understand it should enable me to improve the efficiency for such a task. Below is a test code I wrote that reflects what happens in the actual code. Currently using the numpy range and iterator is slower than the usual python one. What am I doing wrong? What is the best solution to this problem?
Thanks for your help!
import numpy
import time
# setup a problem analagous to that in the real code
npoints_per_plane = 1000
nplanes = 64
naxis = 1000
npoints3d = naxis + npoints_per_plane * nplanes
npoints = naxis + npoints_per_plane
specres = 1000
# this is where the data is being mapped to
sol = dict()
sol["ems"] = numpy.zeros(npoints3d)
sol["abs"] = numpy.zeros(npoints3d)
# this would normally be non-random input data
data = dict()
data["ems"] = numpy.zeros((npoints,specres))
data["abs"] = numpy.zeros((npoints,specres))
for ip in range(npoints):
data["ems"][ip,:] = numpy.random.random(specres)[:]
data["abs"][ip,:] = numpy.random.random(specres)[:]
ems_mod = numpy.random.random(1)[0]
abs_mod = numpy.random.random(1)[0]
ispec = numpy.random.randint(specres)
# this the code I want to optimize
t0 = time.time()
# usual python range and iterator
for ip in range(npoints_per_plane):
jp = naxis + ip
for ipl in range(nplanes):
ip3d = jp + npoints_per_plane * ipl
sol["ems"][ip3d] = data["ems"][jp,ispec] * ems_mod
sol["abs"][ip3d] = data["abs"][jp,ispec] * abs_mod
t1 = time.time()
# numpy ranges and iterator
ip_vals = numpy.arange(npoints_per_plane)
ipl_vals = numpy.arange(nplanes)
for ip in numpy.nditer(ip_vals):
jp = naxis + ip
for ipl in numpy.nditer(ipl_vals):
ip3d = jp + npoints_per_plane * ipl
sol["ems"][ip3d] = data["ems"][jp,ispec] * ems_mod
sol["abs"][ip3d] = data["abs"][jp,ispec] * abs_mod
t2 = time.time()
print "plain python: %0.3f seconds" % ( t1 - t0 )
print "numpy: %0.3f seconds" % ( t2 - t1 )
edit: put "jp = naxis + ip" in the first for loop only
additional note:
I worked out how to get numpy to quickly do the inner loop, but not the outer loop:
# numpy vectorization
for ip in xrange(npoints_per_plane):
jp = naxis + ip
sol["ems"][jp:jp+npoints_per_plane*nplanes:npoints_per_plane] = data["ems"][jp,ispec] * ems_mod
sol["abs"][jp:jp+npoints_per_plane*nplanes:npoints_per_plane] = data["abs"][jp,ispec] * abs_mod
Joe's solution below shows how to do both together, thanks!

The best way of writing loops in numpy is not writing loops and instead using vectorized operations. For example:
c = 0
for i in range(len(a)):
c += a[i] + b[i]
becomes
c = np.sum(a + b, axis=0)
For a and b with a shape of (100000, 100) this takes 0.344 seconds in the first variant, and 0.062 seconds in the second.
In the case presented in your question the following does what you want:
sol['ems'][naxis:] = numpy.ravel(
numpy.repeat(
data['ems'][naxis:,ispec,numpy.newaxis] * ems_mod,
nplanes,
axis=1
),
order='F'
)
This could be further optimized with some tricks, but that would reduce clarity and is probably premature optimization because:
plain python: 0.064 seconds
numpy: 0.002 seconds
The solution works as follows:
Your original version contains jp = naxis + ip which merely skips the first naxis elements [naxis:] selects all but the first naxis elements. Your inner loop repeats the value of data[jp,ispec] for nplanes times and writes it to multiple locations ip3d = jp + npoints_per_plane * ipl which is equivalent to a flattened 2D array offset by naxis. Therefore a second dimension is added via numpy.newaxis to the (previously 1D) data['ems'][naxis:, ispec], the values are repeated nplanes times along this new dimension via numpy.repeat. The resulting 2D array is then flattened again via numpy.ravel (in Fortran order, i.e., with the lowest axis having the smallest stride) and written to the appropriate subarray of sol['ems']. If the target array was actually 2D, the repeat could be skipped by using automatic array broadcasting.
If you run into a situation where you cannot avoid using loops, you could use Cython (which supports efficient buffer views on numpy arrays).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parallelize (not symmetric) loops in python - python

Related

How to speed up this numpy.arange loop?

accelerated FFT to be invoked from Python Numba CUDA kernel

Fastest way to add/multiply two floating point scalar numbers in python

Can a Cython/Numba compiled function improve on numpy.max(numpy.abs(a-b))?

Python: optimising loops

Categories

Resources