In a python program, the following function is called about 20,000 times from another function that is called about 1000 times from yet another function that executes 30 times. Thus the total number of times this particular function is called is about 600,000,000. In python it takes more than two hours (perhaps much longer; I aborted the program without waiting for it to finish), while essentially the same task coded in Java takes less than 5 minutes. If I change the 20,000 above to 400 (keeping everything else in the rest of the program untouched), the total time drops to about 4 minutes (this means this particular function is the culprit). What can I do to speed up the Python version, or is it just not possible? No lists are manipulated inside this function (there are lists elsewhere in the whole program, but in those places I tried to use numpy arrays as far as possible). I understand that replacing python lists with numpy arrays speeds things up, but there are cases in my program (not in this particular function) where I must build a list iteratively, using append; and those must-have lists are lists of objects (not floats or ints), so numpy would be of little help even if I converted those lists of objects to numpy arrays.
def compute_something(arr):
'''
arr is received as a numpy array of ints and floats (I think python upcasts them to all floats,
doesn’t it?).
Inside this function, elements of arr are accessed using indexing (arr[0], arr[1], etc.), because
each element of the array has its own unique use. It’s not that I need the array as a whole (as in
arr**2 or sum(arr)).
The arr elements are used in several simple arithmetic operations involving nothing costlier than
+, -, *, /, and numpy.log(). There is no other loop inside this function; there are a few if’s though.
Inside this function, use is made of constants imported from other modules (I doubt the
importing, as in AnotherModule.x is expensive).
'''
for x in numpy.arange(float1, float2, float3):
do stuff
return a, b, c # Return a tuple of three floats
Edit:
Thanks for all the comments. Here’s the inside of the function (I made the variable names short for convenience). The ndarray array arr has only 3 elements in it. Can you please suggest any improvement?
def compute_something(arr):
a = Mod.b * arr[1] * arr[2] + Mod.c
max = 0.0
for c in np.arange(a, arr[1] * arr[2] * (Mod.d – Mod.e), Mod.f):
i = c / arr[2]
m1 = Mod.A * np.log( (i / (arr[1] *Mod.d)) + (Mod.d/Mod.e))
m2 = -Mod.B * np.log(1.0 - (i/ (arr[1] *Mod.d)) - (Mod.d /
Mod.e))
V = arr[0] * (Mod.E - Mod.r * i / arr[1] - Mod.r * Mod.d -
m1 – m2)
p = c * V /1000.0
if p > max:
max = p
vmp = V
pen = Mod.COEFF1 * (Mod.COEFF2 - max) if max < Mod.CONST else 0.0
wo = Mod.COEFF3 * arr[1] * arr[0] + Mod.COEFF4 * abs(Mod.R5 - vmp) +
Mod.COEFF6 * arr[2]
w = wo + pen
return vmp, max, w
Python supports profiling of code. (module cProfile). Also there is option to use line_profiler to find most expensive part of code tool here.
So you do not need to guessing which part of code is most expensive.
In this code which you presten the problem is in usage for loop which generates many conversion between types of objects. If you use numpy you can vectorize your calculation.
I try to rewrite your code to vectorize your operation. You do not provide information what is Mod object, but I have hope it will work.
def compute_something(arr):
a = Mod.b * arr[1] * arr[2] + Mod.c
# start calculation on vectors instead of for lop
c_arr = np.arange(a, arr[1] * arr[2] * (Mod.d – Mod.e), Mod.f)
i_arr = c_arr/arr[2]
m1_arr = Mod.A * np.log( (i_arr / (arr[1] *Mod.d)) + (Mod.d/Mod.e))
m2_arr = -Mod.B * np.log(1.0 - (i_arr/ (arr[1] *Mod.d)) - (Mod.d /
Mod.e))
V_arr = arr[0] * (Mod.E - Mod.r * i_arr / arr[1] - Mod.r * Mod.d -
m1_arr – m2_arr)
p = c_arr * V_arr / 1000.0
max_val = p.max() # change name to avoid conflict with builtin function
max_ind = np.nonzero(p == max_val)[0][0]
vmp = V_arr[max_ind]
pen = Mod.COEFF1 * (Mod.COEFF2 - max_val) if max_val < Mod.CONST else 0.0
wo = Mod.COEFF3 * arr[1] * arr[0] + Mod.COEFF4 * abs(Mod.R5 - vmp) +
Mod.COEFF6 * arr[2]
w = wo + pen
return vmp, max_val, w
I would suggest to use range as it is approximately 2 times faster:
def python():
for i in range(100000):
pass
def numpy():
for i in np.arange(100000):
pass
from timeit import timeit
print(timeit(python, number=1000))
print(timeit(numpy, number=1000))
Output:
5.59282787179696
10.027646953771665
I'm using python and apparently the slowest part of my program is doing simple additions on float variables.
It takes about 35seconds to do around 400,000,000 additions/multiplications.
I'm trying to figure out what is the fastest way I can do this math.
This is how the structure of my code looks like.
Example (dummy) code:
def func(x, y, z):
loop_count = 30
a = [0,1,2,3,4,5,6,7,8,9,10,11,12,...35 elements]
b = [0,11,22,33,44,55,66,77,88,99,1010,1111,1212,...35 elements]
p = [0,0,0,0,0,0,0,0,0,0,0,0,0,...35 elements]
for i in range(loop_count - 1):
c = p[i-1]
d = a[i] + c * a[i+1]
e = min(2, a[i]) + c * b[i]
f = e * x
g = y + d * c
.... and so on
p[i] = d + e + f + s + g5 + f4 + h7 * t5 + y8
return sum(p)
func() is called about 200k times. The loop_count is about 30. And I have ~20 multiplications and ~45 additions and ~10 uses of min/max
I was wondering if there is a method for me to declare all these as ctypes.c_float and do addition in C using stdlib or something similar ?
Note that the p[i] calculated at the end of the loop is used as c in the next loop iteration. For iteration 0, it just uses p[-1] which is 0 in this case.
My constraints:
I need to use python. While I understand plain math would be faster in C/Java/etc. I cannot use it due to a bunch of other things I do in python which cannot be done in C in this same program.
I tried writing this with cython, but it caused a bunch of issues with the environment I need to run this in. So, again - not an option.
I think you should consider using numpy. You did not mention any constraint.
Example case of a simple dot operation (x.y)
import datetime
import numpy as np
x = range(0,10000000,1)
y = range(0,20000000,2)
for i in range(0, len(x)):
x[i] = x[i] * 0.00001
y[i] = y[i] * 0.00001
now = datetime.datetime.now()
z = 0
for i in range(0, len(x)):
z = z+x[i]*y[i]
print "handmade dot=", datetime.datetime.now()-now
print z
x = np.arange(0.0, 10000000.0*0.00001, 0.00001)
y = np.arange(0.0, 10000000.0*0.00002, 0.00002)
now = datetime.datetime.now()
z = np.dot(x,y)
print 'numpy dot =',datetime.datetime.now()-now
print z
outputs
handmade dot= 0:00:02.559000
66666656666.7
numpy dot = 0:00:00.019000
66666656666.7
numpy is more than 100x times faster.
The reason is that numpy encapsulates a C library that does the dot operation with compiled code. In the full python you have a list of potentially generic objects, casting, ...
I am working on solving and analyzing a system of differential equations in Python. First did I solve it with help of scipy.integrate dopri5 and scopes Odeint. Which worked out fine. Then I tried to solve the equations with use of the Euler's method. The equations and code is as followed,
dj = -mu*(J**3 - (C - C0)*J - F)
dc = J + C*F + a*J**2
df = J*F - C
T = 100
dt = 0.001
t = np.linspace(0, T, int(T/dt)+1)
j = np.zeros(len(t))
c = np.zeros(len(t))
f = np.zeros(len(t))
# Initial condition
j[0] = 0.1
c[0] = -0.5
f[0] = 0.1
a = 0.3025
C0 = 0.5
mu = 50
for i in range(len(t)):
j[i+1] = j[i] + (-mu * (j[i]**3 - (c[i] - C0)*j[i] - f[i]))*dt
c[i+1] = c[i] + (j[i] + c[i] * f[i] + (a * j[i])**2)*dt
f[i+1] = f[i] + (j[i] * f[i] - c[i])*dt
Is there any reason why the Euler's method should not work when both the other two are?
In the first iteration, i is 0, and your first line of the loop essentially is:
j[0] = j[-1] + (-mu * (j[-1]**3 - (c[-1] - C0)*j[-1] - f[-1]))*dt
j[-1] is the last element of j, just like c[-1] is the last element of c, etc. Initially they are all zeros, so j[0] becomes a 0, too, which overwrites the initial conditions. To fix this problem, change range(len(t)) to range(1,len(t)). (The model diverges after the first 9200 steps, anyway.)
As DYZ says, your calculation is incorrect on the first loop iteration because j[-1] is the last element of j, which you've initialised to zero.
However, your code wastes a lot of RAM. I assume you just want arrays containing T results, plus the initial values, rather than the results calculated on every step. The code below achieves that via a double for loop. We aren't really getting any benefit from Numpy in this code, so I don't bother importing it.
Note that Euler integration is not very accurate, and you generally need to use a much smaller step size than what's required by more sophisticated integration algorithms. As DYZ mentions, with your current step size the calculation diverges before the loop finishes.
Here's a modified version of your code using a smaller step size.
T = 100
dt = 0.00001
steps = int(T / dt)
substeps = int(steps / T)
# Recalculate `dt` to compensate for possible truncation
# in the `steps` and `substeps` calculations
dt = 1.0 / substeps
print('steps, substeps, dt:', steps, substeps, dt)
a = 0.3025
C0 = 0.5
mu = 50
#dj = -mu*(J**3 - (C - C0)*J - F)
#dc = J + C*F + a*J**2
#df = J*F - C
# Initial condition
j = 0.1
c = -0.5
f = 0.1
jlst, clst, flst = [j], [c], [f]
for i in range(T):
for _ in range(substeps):
j1 = j + (-mu * (j**3 - (c - C0)*j - f))*dt
c1 = c + (j + c * f + (a * j)**2)*dt
f1 = f + (j * f - c)*dt
j, c, f = j1, c1, f1
jlst.append(j)
clst.append(c)
flst.append(f)
def round_seq(seq, places=6):
return [round(u, places) for u in seq]
print('j:', round_seq(jlst), end='\n\n')
print('c:', round_seq(clst), end='\n\n')
print('f:', round_seq(flst), end='\n\n')
output
steps, substeps, dt: 10000000 100000 1e-05
j: [0.1, 0.585459, 1.26718, 3.557956, -1.311867, -0.647698, -0.133683, 0.395812, 0.964856, 3.009683, -2.025674, -1.047722, -0.48872, 0.044296, 0.581284, 1.245423, 14.725407, -1.715456, -0.907364, -0.372118, 0.167733, 0.705257, 1.511711, -3.588555, -1.476817, -0.778593, -0.253874, 0.289294, 0.837128, 1.985792, -2.652462, -1.28088, -0.657113, -0.132971, 0.409071, 0.983504, 3.229393, -2.1809, -1.113977, -0.539586, -0.009829, 0.528546, 1.156086, 8.23469, -1.838582, -0.967078, -0.423261, 0.113883, 0.650319, 1.381138, 12.045565, -1.575015, -0.833861, -0.305952, 0.23632, 0.778052, 1.734888, -2.925769, -1.362437, -0.709641, -0.186249, 0.356775, 0.917051, 2.507782, -2.367126, -1.184147, -0.590753, -0.063942, 0.476121, 1.07614, 5.085211, -1.976542, -1.029395, -0.474206, 0.059772, 0.596505, 1.273214, 17.083466, -1.682855, -0.890842, -0.357555, 0.182944, 0.721096, 1.554496, -3.331861, -1.450497, -0.763182, -0.239007, 0.30425, 0.85435, 2.076595, -2.584081, -1.258788, -0.642362, -0.117774, 0.423883, 1.003181, 3.521072, -2.132709, -1.094792, -0.525123]
c: [-0.5, -0.302644, 0.847742, 12.886781, 0.177404, -0.423405, -0.569541, -0.521669, -0.130084, 7.97828, -0.109606, -0.363033, -0.538874, -0.61005, -0.506872, 0.05076, 216.678959, -0.198445, -0.408569, -0.566869, -0.603713, -0.451729, 0.58959, 2.252504, -0.246645, -0.451, -0.588697, -0.587898, -0.375758, 2.152898, -0.087229, -0.295185, -0.49006, -0.603411, -0.562389, -0.263696, 8.901196, -0.132332, -0.342969, -0.525087, -0.609991, -0.526417, -0.077251, 67.082608, -0.177771, -0.389092, -0.555341, -0.607658, -0.47794, 0.293664, 147.817033, -0.225425, -0.432796, -0.579951, -0.595996, -0.412269, 1.235928, -0.037058, -0.273963, -0.473412, -0.597912, -0.574782, -0.318837, 4.581828, -0.113301, -0.3222, -0.51029, -0.608168, -0.543547, -0.172371, 24.718184, -0.157526, -0.369151, -0.542732, -0.609811, -0.500922, 0.09504, 291.915024, -0.204371, -0.414, -0.56993, -0.602265, -0.443622, 0.700005, 0.740665, -0.25268, -0.456048, -0.590933, -0.585265, -0.36427, 2.528225, -0.093699, -0.301181, -0.494644, -0.60469, -0.558516, -0.245806, 10.941068, -0.137816, -0.348805, -0.52912]
f: [0.1, 0.68085, 1.615135, 1.01107, -2.660947, -0.859348, -0.134789, 0.476782, 1.520241, 4.892319, -9.514924, -2.041217, -0.61413, 0.060247, 0.792463, 2.510586, 11.393914, -6.222736, -1.559576, -0.438133, 0.200729, 1.033274, 3.348756, -39.664752, -4.304545, -1.201378, -0.282146, 0.349631, 1.331995, 4.609547, -20.169056, -3.104072, -0.923759, -0.138225, 0.513633, 1.716341, 6.739864, -11.717002, -2.307614, -0.699883, 7.4e-05, 0.700823, 2.22957, 11.017447, -7.434886, -1.751919, -0.512171, 0.138566, 0.922012, 2.9434, -30.549886, -5.028825, -1.346261, -0.348547, 0.282981, 1.19254, 3.987366, -26.554232, -3.566328, -1.0374, -0.200198, 0.439487, 1.535198, 5.645421, -14.674838, -2.619369, -0.792589, -0.060175, 0.615387, 1.985246, 8.779969, -8.991742, -1.972575, -0.590788, 0.077534, 0.820118, 2.599728, 8.879606, -5.928246, -1.509453, -0.417854, 0.218635, 1.066761, 3.477148, -36.053938, -4.124934, -1.163178, -0.263755, 0.369033, 1.37438, 4.811848, -18.741635, -2.987496, -0.893457, -0.120864, 0.535433, 1.771958, 7.117055, -11.027021, -2.227847, -0.674889]
That takes about 75 seconds on my old 2GHz machine.
Using dt = 0.000005 (which takes almost 2 minutes on this machine) the final values of j, c, and f are -0.524774, -0.529217, -0.674293, respectively, so it looks like we're beginning to get convergence.
Thanks to LutzL for pointing out that dt may need adjusting because of the rounding in the steps and substeps calculations.
I need to calculate the arcsine function of small values that are under the form of mpmath's "mpf" floating-point bignums.
What I call a "small" value is for example e/4/(10**7) = 0.000000067957045711476130884...
Here is a result of a test on my machine with mpmath's built-in asin function:
import gmpy2
from mpmath import *
from time import time
mp.dps = 10**6
val=e/4/(10**7)
print "ready"
start=time()
temp=asin(val)
print "mpmath asin: "+str(time()-start)+" seconds"
>>> 155.108999968 seconds
This is a particular case: I work with somewhat small numbers, so I'm asking myself if there is a way to calculate it in python that actually beats mpmath for this particular case (= for small values).
Taylor series are actually a good choice here because they converge very fast for small arguments. But I still need to accelerate the calculations further somehow.
Actually there are some problems:
1) Binary splitting is ineffective here because it shines only when you can write the argument as a small fraction. A full-precision float is given here.
2) arcsin is a non-alternating series, thus Van Wijngaarden or sumalt transformations are ineffective too (unless there is a way I'm not aware of to generalize them to non-alternating series).
https://en.wikipedia.org/wiki/Van_Wijngaarden_transformation
The only acceleration left I can think of is Chebyshev polynomials. Can Chebyshev polynomials be applied on the arcsin function? How to?
Can you use the mpfr type that is included in gmpy2?
>>> import gmpy2
>>> gmpy2.get_context().precision = 3100000
>>> val = gmpy2.exp(1)/4/10**7
>>> from time import time
>>> start=time();r=gmpy2.asin(val);print time()-start
3.36188197136
In addition to supporting the GMP library, gmpy2 also supports the MPFR and MPC multiple-precision libraries.
Disclaimer: I maintain gmpy2.
Actually binary splitting does work very well, if combined with iterated argument reduction to balance the number of terms against the size of the numerators and denominators (this is known as the bit-burst algorithm).
Here is a binary splitting implementation for mpmath based on repeated application of the formula atan(t) = atan(p/2^q) + atan((t*2^q-p) / (2^q+p*t)). This formula was suggested recently by Richard Brent (in fact mpmath's atan already uses a single invocation of this formula at low precision, in order to look up atan(p/2^q) from a cache). If I remember correctly, MPFR also uses the bit-burst algorithm to evaluate atan, but it uses a slightly different formula, which possibly is more efficient (instead of evaluating several different arctangent values, it does analytic continuation using the arctangent differential equation).
from mpmath.libmp import MPZ, bitcount
from mpmath import mp
def bsplit(p, q, a, b):
if b - a == 1:
if a == 0:
P = p
Q = q
else:
P = p * p
Q = q * 2
B = MPZ(1 + 2 * a)
if a % 2 == 1:
B = -B
T = P
return P, Q, B, T
else:
m = a + (b - a) // 2
P1, Q1, B1, T1 = bsplit(p, q, a, m)
P2, Q2, B2, T2 = bsplit(p, q, m, b)
T = ((T1 * B2) << Q2) + T2 * B1 * P1
P = P1 * P2
B = B1 * B2
Q = Q1 + Q2
return P, Q, B, T
def atan_bsplit(p, q, prec):
"""computes atan(p/2^q) as a fixed-point number"""
if p == 0:
return MPZ(0)
# FIXME
nterms = (-prec / (bitcount(p) - q) - 1) * 0.5
nterms = int(nterms) + 1
if nterms < 1:
return MPZ(0)
P, Q, B, T = bsplit(p, q, 0, nterms)
if prec >= Q:
return (T << (prec - Q)) // B
else:
return T // (B << (Q - prec))
def atan_fixed(x, prec):
t = MPZ(x)
s = MPZ(0)
q = 1
while t:
q = min(q, prec)
p = t >> (prec - q)
if p:
s += atan_bsplit(p, q, prec)
u = (t << q) - (p << prec)
v = (MPZ(1) << (q + prec)) + p * t
t = (u << prec) // v
q *= 2
return s
def atan1(x):
prec = mp.prec
man = x.to_fixed(prec)
return mp.mpf((atan_fixed(man, prec), -prec))
def asin1(x):
x = mpf(x)
return atan1(x/sqrt(1-x**2))
With this code, I get:
>>> from mpmath import *
>>> mp.dps = 1000000
>>> val=e/4/(10**7)
>>> from time import time
>>> start = time(); y1 = asin(x); print time() - start
58.8485069275
>>> start = time(); y2 = asin1(x); print time() - start
8.26498985291
>>> nprint(y2 - y1)
-2.31674e-1000000
Warning: atan1 assumes 0 <= x < 1/2, and the determination of the number of terms might not be optimal or correct (fixing these issues is left as an exercise to the reader).
A fast way is to use a pre-calculated look-up table.
But if you look at e.g. a Taylor series for asin;
def asin(x):
rv = (x + 1/3.0*x**3 + 7/30.0*x**5 + 64/315.0*x**7 + 4477/22680.0*x**9 +
28447/138600.0*x**11 + 23029/102960.0*x**13 +
17905882/70945875.0*x**15 + 1158176431/3958416000.0*x**17 +
9149187845813/26398676304000.0*x**19)
return rv
You'll see that for small values of x, asin(x) ≈ x.
In [19]: asin(1e-7)
Out[19]: 1.0000000000000033e-07
In [20]: asin(1e-9)
Out[20]: 1e-09
In [21]: asin(1e-11)
Out[21]: 1e-11
In [22]: asin(1e-12)
Out[22]: 1e-12
E.g. for the value us used:
In [23]: asin(0.000000067957045711476130884)
Out[23]: 6.795704571147624e-08
In [24]: asin(0.000000067957045711476130884)/0.000000067957045711476130884
Out[24]: 1.0000000000000016
Of course it depends on whether this difference is relevant to you.