How to speed up this numpy.arange loop?

How to speed up this numpy.arange loop? - python

In a python program, the following function is called about 20,000 times from another function that is called about 1000 times from yet another function that executes 30 times. Thus the total number of times this particular function is called is about 600,000,000. In python it takes more than two hours (perhaps much longer; I aborted the program without waiting for it to finish), while essentially the same task coded in Java takes less than 5 minutes. If I change the 20,000 above to 400 (keeping everything else in the rest of the program untouched), the total time drops to about 4 minutes (this means this particular function is the culprit). What can I do to speed up the Python version, or is it just not possible? No lists are manipulated inside this function (there are lists elsewhere in the whole program, but in those places I tried to use numpy arrays as far as possible). I understand that replacing python lists with numpy arrays speeds things up, but there are cases in my program (not in this particular function) where I must build a list iteratively, using append; and those must-have lists are lists of objects (not floats or ints), so numpy would be of little help even if I converted those lists of objects to numpy arrays.
def compute_something(arr):
'''
arr is received as a numpy array of ints and floats (I think python upcasts them to all floats,
doesn’t it?).
Inside this function, elements of arr are accessed using indexing (arr[0], arr[1], etc.), because
each element of the array has its own unique use. It’s not that I need the array as a whole (as in
arr**2 or sum(arr)).
The arr elements are used in several simple arithmetic operations involving nothing costlier than
+, -, *, /, and numpy.log(). There is no other loop inside this function; there are a few if’s though.
Inside this function, use is made of constants imported from other modules (I doubt the
importing, as in AnotherModule.x is expensive).
'''
for x in numpy.arange(float1, float2, float3):
do stuff
return a, b, c # Return a tuple of three floats
Edit:
Thanks for all the comments. Here’s the inside of the function (I made the variable names short for convenience). The ndarray array arr has only 3 elements in it. Can you please suggest any improvement?
def compute_something(arr):
a = Mod.b * arr[1] * arr[2] + Mod.c
max = 0.0
for c in np.arange(a, arr[1] * arr[2] * (Mod.d – Mod.e), Mod.f):
i = c / arr[2]
m1 = Mod.A * np.log( (i / (arr[1] *Mod.d)) + (Mod.d/Mod.e))
m2 = -Mod.B * np.log(1.0 - (i/ (arr[1] *Mod.d)) - (Mod.d /
Mod.e))
V = arr[0] * (Mod.E - Mod.r * i / arr[1] - Mod.r * Mod.d -
m1 – m2)
p = c * V /1000.0
if p > max:
max = p
vmp = V
pen = Mod.COEFF1 * (Mod.COEFF2 - max) if max < Mod.CONST else 0.0
wo = Mod.COEFF3 * arr[1] * arr[0] + Mod.COEFF4 * abs(Mod.R5 - vmp) +
Mod.COEFF6 * arr[2]
w = wo + pen
return vmp, max, w

Python supports profiling of code. (module cProfile). Also there is option to use line_profiler to find most expensive part of code tool here.
So you do not need to guessing which part of code is most expensive.
In this code which you presten the problem is in usage for loop which generates many conversion between types of objects. If you use numpy you can vectorize your calculation.
I try to rewrite your code to vectorize your operation. You do not provide information what is Mod object, but I have hope it will work.
def compute_something(arr):
a = Mod.b * arr[1] * arr[2] + Mod.c
# start calculation on vectors instead of for lop
c_arr = np.arange(a, arr[1] * arr[2] * (Mod.d – Mod.e), Mod.f)
i_arr = c_arr/arr[2]
m1_arr = Mod.A * np.log( (i_arr / (arr[1] *Mod.d)) + (Mod.d/Mod.e))
m2_arr = -Mod.B * np.log(1.0 - (i_arr/ (arr[1] *Mod.d)) - (Mod.d /
Mod.e))
V_arr = arr[0] * (Mod.E - Mod.r * i_arr / arr[1] - Mod.r * Mod.d -
m1_arr – m2_arr)
p = c_arr * V_arr / 1000.0
max_val = p.max() # change name to avoid conflict with builtin function
max_ind = np.nonzero(p == max_val)[0][0]
vmp = V_arr[max_ind]
pen = Mod.COEFF1 * (Mod.COEFF2 - max_val) if max_val < Mod.CONST else 0.0
wo = Mod.COEFF3 * arr[1] * arr[0] + Mod.COEFF4 * abs(Mod.R5 - vmp) +
Mod.COEFF6 * arr[2]
w = wo + pen
return vmp, max_val, w

I would suggest to use range as it is approximately 2 times faster:
def python():
for i in range(100000):
pass
def numpy():
for i in np.arange(100000):
pass
from timeit import timeit
print(timeit(python, number=1000))
print(timeit(numpy, number=1000))
Output:
5.59282787179696
10.027646953771665

Related

How to use sympy solve in for loop

I'm using sympy to solve an equation in a for loop in which at each interaction a variable (kp) multiples the function. But in each interaction the length of the output increases. I have an array k and kp is selected from k
k = [2,4,5,7,9]
for kp in k:
didt = beta * kp * teta
dteta = integrate(1/((kp-1) * pk * didt / avgk),teta)
dt = integrate(1,(t,0 ,1))
teta2 = solve(dteta - dt ,teta)
#print(solve(dteta - dt ,teta))
didt2 = beta * solve(dteta - dt ,teta) *kp
print(didt2)
Also, the output for didt2 for 1st iteration is
[1.49182469764127, 1.49182469764127]
for the second one is
[11.0231763806416, 11.0231763806416, 11.0231763806416, 11.0231763806416]
for the 3rd one is [54.5981500331442, 54.5981500331442, 54.5981500331442, 54.5981500331442, 54.5981500331442]
I'm just wondering, why the length of didt2 increases at each interaction?

It looks like solve() returns a list. This means that beta * solve(dteta - dt ,teta) *kp doesn't do what you think. Rather than multiplying the result, you are duplicating the elements of the returned list. For a simple example, try to see what the output is:
[0] * 10
In your case, kp takes on the values of 2, 4, and 5 on each iteration of the list, so the output you see is the result of doing
[1.49182469764127] * 2
[11.0231763806416] * 4
[54.5981500331442] * 5
These all result in lists with the exact length of the value of kp. It does not do numeric multiplication.

Fastest way to add/multiply two floating point scalar numbers in python

I'm using python and apparently the slowest part of my program is doing simple additions on float variables.
It takes about 35seconds to do around 400,000,000 additions/multiplications.
I'm trying to figure out what is the fastest way I can do this math.
This is how the structure of my code looks like.
Example (dummy) code:
def func(x, y, z):
loop_count = 30
a = [0,1,2,3,4,5,6,7,8,9,10,11,12,...35 elements]
b = [0,11,22,33,44,55,66,77,88,99,1010,1111,1212,...35 elements]
p = [0,0,0,0,0,0,0,0,0,0,0,0,0,...35 elements]
for i in range(loop_count - 1):
c = p[i-1]
d = a[i] + c * a[i+1]
e = min(2, a[i]) + c * b[i]
f = e * x
g = y + d * c
.... and so on
p[i] = d + e + f + s + g5 + f4 + h7 * t5 + y8
return sum(p)
func() is called about 200k times. The loop_count is about 30. And I have ~20 multiplications and ~45 additions and ~10 uses of min/max
I was wondering if there is a method for me to declare all these as ctypes.c_float and do addition in C using stdlib or something similar ?
Note that the p[i] calculated at the end of the loop is used as c in the next loop iteration. For iteration 0, it just uses p[-1] which is 0 in this case.
My constraints:
I need to use python. While I understand plain math would be faster in C/Java/etc. I cannot use it due to a bunch of other things I do in python which cannot be done in C in this same program.
I tried writing this with cython, but it caused a bunch of issues with the environment I need to run this in. So, again - not an option.

I think you should consider using numpy. You did not mention any constraint.
Example case of a simple dot operation (x.y)
import datetime
import numpy as np
x = range(0,10000000,1)
y = range(0,20000000,2)
for i in range(0, len(x)):
x[i] = x[i] * 0.00001
y[i] = y[i] * 0.00001
now = datetime.datetime.now()
z = 0
for i in range(0, len(x)):
z = z+x[i]*y[i]
print "handmade dot=", datetime.datetime.now()-now
print z
x = np.arange(0.0, 10000000.0*0.00001, 0.00001)
y = np.arange(0.0, 10000000.0*0.00002, 0.00002)
now = datetime.datetime.now()
z = np.dot(x,y)
print 'numpy dot =',datetime.datetime.now()-now
print z
outputs
handmade dot= 0:00:02.559000
66666656666.7
numpy dot = 0:00:00.019000
66666656666.7
numpy is more than 100x times faster.
The reason is that numpy encapsulates a C library that does the dot operation with compiled code. In the full python you have a list of potentially generic objects, casting, ...

Series Expansion of cos with Python

So, I'm trying to find the value of cos(x), where x=1.2. I feel the script I have written should be fine, however, the value I get out isn't correct. That is; cos(1.2)=0.6988057880877979, for 25 terms, when I should get out: cos(1.2)=0.36235775.
I have created a similar program for calculating sin(1.2) which works fine.
Calculating sin(1.2):
import math as m
x=1.2
k=1
N=25
s=x
sign=1.0
while k<N:
sign=-sign
k=k+2
term=sign*x**k/m.factorial(k)
s=s+term
print('sin(%g) = %g (approximation with %d terms)' % (x,s,N))
Now trying to calculate cos(1.2):
import math as m
x=1.2
k=1
N=25
s=x
sign=1.0
while k<N:
sign=-sign
k=k+1
term=sign*x**k/m.factorial(k)
s=s+term
print(s)

You shouldn't be setting your initial sum to 1.2, and your representation of the expansion
is a bit off - we need to account for the even-ness of the function, so increment k by 2. Also, without modifying your program structure, you'd have to set the initial variables so they are correctly put to their starting values at the beginning of the first loop. Re-ordering your loop control flow a bit, we have
import math as m
x=1.2
k=0
N=25
s=0
sign=1.0
while k<N:
term=sign*x**(k)/m.factorial(k)
s=s+term
k += 2
sign = -sign
print(s)
Gives
0.3623577544766735

I think you're using the wrong series for the cosine, the correct formula would be (I highlighted the important differences with ^):
sum_over_n [(-1)**n * x ** (2 * n) / (math.factorial(2 * n))]
# ^^^^ ^^^^
that means to add n-terms you have something like:
def cosine_by_series(x, terms):
cos = 0
for n in range(terms):
cos += ((-1)**n) * (x ** (2*n)) / (math.factorial(2 * n))
return cos
# or simply:
# return sum(((-1)**n) * (x ** (2*n)) / (math.factorial(2 * n)) for n in range(terms)
which gives:
>>> cosine_by_series(1.2, 30)
0.3623577544766735

parallelize (not symmetric) loops in python

The following code is written in python and it works, i.e. returns the expected result. However, it is very slow and I believe that can be optimized.
G_tensor = numpy.matlib.identity(N_particles*3,dtype=complex)
for i in range(N_particles):
for j in range(i, N_particles):
if i != j:
#Do lots of things, here is shown an example.
# However you should not be scared because
#it only fills the G_tensor
R = numpy.linalg.norm(numpy.array(positions[i])-numpy.array(positions[j]))
rx = numpy.array(positions[i][0])-numpy.array(positions[j][0])
ry = numpy.array(positions[i][1])-numpy.array(positions[j][1])
rz = numpy.array(positions[i][2])-numpy.array(positions[j][2])
krq = (k*R)**2
pf = -k**2*alpha*numpy.exp(1j*k*R)/(4*math.pi*R)
a = 1.+(1j*k*R-1.)/(krq)
b = (3.-3.*1j*k*R-krq)/(krq)
G_tensor[3*i+0,3*j+0] = pf*(a + b * (rx*rx)/(R**2)) #Gxx
G_tensor[3*i+1,3*j+1] = pf*(a + b * (ry*ry)/(R**2)) #Gyy
G_tensor[3*i+2,3*j+2] = pf*(a + b * (rz*rz)/(R**2)) #Gzz
G_tensor[3*i+0,3*j+1] = pf*(b * (rx*ry)/(R**2)) #Gxy
G_tensor[3*i+0,3*j+2] = pf*(b * (rx*rz)/(R**2)) #Gxz
G_tensor[3*i+1,3*j+0] = pf*(b * (ry*rx)/(R**2)) #Gyx
G_tensor[3*i+1,3*j+2] = pf*(b * (ry*rz)/(R**2)) #Gyz
G_tensor[3*i+2,3*j+0] = pf*(b * (rz*rx)/(R**2)) #Gzx
G_tensor[3*i+2,3*j+1] = pf*(b * (rz*ry)/(R**2)) #Gzy
G_tensor[3*j+0,3*i+0] = pf*(a + b * (rx*rx)/(R**2)) #Gxx
G_tensor[3*j+1,3*i+1] = pf*(a + b * (ry*ry)/(R**2)) #Gyy
G_tensor[3*j+2,3*i+2] = pf*(a + b * (rz*rz)/(R**2)) #Gzz
G_tensor[3*j+0,3*i+1] = pf*(b * (rx*ry)/(R**2)) #Gxy
G_tensor[3*j+0,3*i+2] = pf*(b * (rx*rz)/(R**2)) #Gxz
G_tensor[3*j+1,3*i+0] = pf*(b * (ry*rx)/(R**2)) #Gyx
G_tensor[3*j+1,3*i+2] = pf*(b * (ry*rz)/(R**2)) #Gyz
G_tensor[3*j+2,3*i+0] = pf*(b * (rz*rx)/(R**2)) #Gzx
G_tensor[3*j+2,3*i+1] = pf*(b * (rz*ry)/(R**2)) #Gzy
Do you know how can I parallelize it? You should note that the two loops are not symmetric.
Edit one: A numpythonic solution was presented above and I made a comparison between the c++ implementation, my loop version in python and thr numpythonic. Results are the following:
- c++ = 0.14seg
- numpythonic version = 1.39seg
- python loop version = 46.56seg
Probably results can get better if we use the intel version of numpy.

Here is a proposition that should now work (I corrected a few mistakes) but that nonetheless sould give you the general idea of how verctorization can be applied to your code in order to make efficient use of numpy arrays. Everything is build in "one-pass" (ie without any for-loops) which is the "numpythonic" way:
import numpy as np
import math
N=2
k,alpha=1,1
G = np.zeros((N,3,N,3),dtype=complex)
# np.mgrid gives convenient arrays of indices that
# can be used to write readable code
i,x_i,j,x_j = np.ogrid[0:N,0:3,0:N,0:3]
# A quick demo on how we can make the identity tensor with it
G[np.where((i == j) & (x_i == x_j))] = 1
#print(G.reshape(N*3,N*3))
positions=np.random.rand(N,3)
# Here I assumed position has shape [N,3]
# I build arr[i,j]=position[i] - position[j] using broadcasting
# by turning position into a column and a row
R = np.linalg.norm(positions[None,:,:]-positions[:,None,:],axis=-1)
# R is now a N,N matrix of all the distances
#we reshape R to N,1,N,1 so that it can be broadcated to N,3,N,3
R=R.reshape(N,1,N,1)
r=positions[None,:,:]-positions[:,None,:]
krq = (k*R)**2
pf = -k**2*alpha*np.exp(1j*k*R)/(4*math.pi*R)
a = 1.+(1j*k*R-1.)/(krq)
b = (3.-3.*1j*k*R-krq)/(krq)
#print(np.isnan(pf[:,0,:,0]))
# here we build all the combination rx*rx rx*ry etc...
comb_r=(r[:,:,:,None]*r[:,:,None,:]).transpose([0,2,1,3])
#we compute G without the pf*A term
G = pf*(b * comb_r/(R**2))
#we add pf*a term where it is due
G[np.where(x_i == x_j)] = (G + pf*a)[np.where(x_i == x_j)]
# we didn't bother with the identity or condition i!=j so we enforce it here
G[np.where(i == j)] = 0
G[np.where((i == j) & (x_i == x_j))] = 1
print(G.reshape(N*3,N*3))

Python is not a fast language. Number crunching with python should always use for time critical parts code written in a compiled language. With compilation down to the CPU level you can speed up the code by a factor up to 100 and then still go for parallelization. So I would not look down to using more cores doing inefficient stuff, but to work more efficient. I see the following ways to speed up the code:
1) Better use of numpy: Can you do your calculations instead on scalar level directly on vector/matrix level? eg. rx = positions[:,0]-positions[0,:] (not checked if that is correct) but something along those lines.
If that is not possible with your kind of calculations, than you can go for option 2 or 3
2) Use cython. Cython compiles Python code to C, which is then compiled to your CPU. By using static typing at the right places you can make your code much faster, see cython tutorials eg.: http://cython.readthedocs.io/en/latest/src/quickstart/cythonize.html
3) If you are familiar with FORTRAN, it might be a good idea to write just this part in FORTRAN and then call it from Python using f2py. In fact, your code looks a lot like FORTRAN anyway. For C and C++ SWIG is one great tool to make compiled code available in Python, but there are plenty of other techniques (cython, Boost::Python, ctypes, numba etc.)
When you have done this, and it is still to slow, using GPU power with pyCUDA or parallelization with mpi4py or multiprocessing might be an option.

Writing a while loop for error function erf(x)?

I understand there is a erf (Wikipedia) function in python. But in this assignment we are specifically asked to write the error function as if it was not already implemented in python while using a while loop.
erf (x) is simplified already as : (2/ (sqrt(pi)) (x - x^3/3 + x^5/10 - x^7/42...)
Terms in the series must be added until the absolute total is less than 10^-20.

First of all - SO is not the place when people code for you, here people help you to solve particular problem not the whole task
Any way:
It's not too hard to implement wikipedia algorithm:
import math
def erf(x):
n = 1
res = 0
res1 = (2 / math.sqrt(math.pi)) * x
diff = 1
s = x
while diff > math.pow(10, -20):
dividend = math.pow((-1), n) * math.pow(x, 2 * n + 1)
divider = math.factorial(n) * (2 * n + 1)
s += dividend / divider
res = ((2 / math.sqrt(math.pi)) * s)
diff = abs(res1 - res)
res1 = res
n += 1
return res
print(erf(1))
Please read the source code carefully and post all questions that you don't understand.
Also you may check python sources and see how erf is implemented

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to speed up this numpy.arange loop? - python

I would suggest to use range as it is approximately 2 times faster: def python(): for i in range(100000): pass def numpy(): for i in np.arange(100000): pass from timeit import timeit print(timeit(python, number=1000)) print(timeit(numpy, number=1000)) Output: 5.59282787179696 10.027646953771665

Related

How to use sympy solve in for loop

Fastest way to add/multiply two floating point scalar numbers in python

Series Expansion of cos with Python

parallelize (not symmetric) loops in python

Writing a while loop for error function erf(x)?

Categories

Resources