Fastest way to add/multiply two floating point scalar numbers in python

Fastest way to add/multiply two floating point scalar numbers in python - python

I'm using python and apparently the slowest part of my program is doing simple additions on float variables.
It takes about 35seconds to do around 400,000,000 additions/multiplications.
I'm trying to figure out what is the fastest way I can do this math.
This is how the structure of my code looks like.
Example (dummy) code:
def func(x, y, z):
loop_count = 30
a = [0,1,2,3,4,5,6,7,8,9,10,11,12,...35 elements]
b = [0,11,22,33,44,55,66,77,88,99,1010,1111,1212,...35 elements]
p = [0,0,0,0,0,0,0,0,0,0,0,0,0,...35 elements]
for i in range(loop_count - 1):
c = p[i-1]
d = a[i] + c * a[i+1]
e = min(2, a[i]) + c * b[i]
f = e * x
g = y + d * c
.... and so on
p[i] = d + e + f + s + g5 + f4 + h7 * t5 + y8
return sum(p)
func() is called about 200k times. The loop_count is about 30. And I have ~20 multiplications and ~45 additions and ~10 uses of min/max
I was wondering if there is a method for me to declare all these as ctypes.c_float and do addition in C using stdlib or something similar ?
Note that the p[i] calculated at the end of the loop is used as c in the next loop iteration. For iteration 0, it just uses p[-1] which is 0 in this case.
My constraints:
I need to use python. While I understand plain math would be faster in C/Java/etc. I cannot use it due to a bunch of other things I do in python which cannot be done in C in this same program.
I tried writing this with cython, but it caused a bunch of issues with the environment I need to run this in. So, again - not an option.

I think you should consider using numpy. You did not mention any constraint.
Example case of a simple dot operation (x.y)
import datetime
import numpy as np
x = range(0,10000000,1)
y = range(0,20000000,2)
for i in range(0, len(x)):
x[i] = x[i] * 0.00001
y[i] = y[i] * 0.00001
now = datetime.datetime.now()
z = 0
for i in range(0, len(x)):
z = z+x[i]*y[i]
print "handmade dot=", datetime.datetime.now()-now
print z
x = np.arange(0.0, 10000000.0*0.00001, 0.00001)
y = np.arange(0.0, 10000000.0*0.00002, 0.00002)
now = datetime.datetime.now()
z = np.dot(x,y)
print 'numpy dot =',datetime.datetime.now()-now
print z
outputs
handmade dot= 0:00:02.559000
66666656666.7
numpy dot = 0:00:00.019000
66666656666.7
numpy is more than 100x times faster.
The reason is that numpy encapsulates a C library that does the dot operation with compiled code. In the full python you have a list of potentially generic objects, casting, ...

Related

Vectorizing a for loop in Python that includes a cumsum() and iterative slicing

I have the following for loop that operates over three numpy arrays of the same length:
n = 100
a = np.random.random(n)
b = np.random.random(n)
c = np.random.random(n)
valid = np.empty(n)
for i in range(n):
valid[i] = np.any(a[i] > b[i:] + c[i:].cumsum())
Is there a way to replace this for loop with some vectorized numpy operations?
For example, because I only care if a[i] is larger than any value in b[i:], I can do np.minimum.accumulate(b[::-1])[::-1] which gets the smallest value of b at every index and onwards, and then compare it to a like this:
a > np.minimum.accumulate(b[::-1])[::-1]
but I still would need a way to vectorize the c[i:].cumsum() into a single array calculation.

Your goal is to find the minimum of b[i:] + c[i:].cumsum() for each i. Clearly you can compare that to a directly.
You can write the elements of c[i:].cumsum() as the upper triangle of a matrix. Let's look at a toy case with n = 3:
c = [c1, c2, c3]
s1 = c.cumsum()
s0 = np.r_[0, s1[:-1]]
You can write the elements of the cumulative sum as
c1, c1 + c2, c1 + c2 + c3 s1[0:] s1[0:] - s0[0]
c2, c2 + c3 = s1[1:] - c1 = s1[1:] - s0[1]
c3 s1[2:] - (c1 + c2) s1[2:] - s0[2]
You can use np.triu_indices to construct these sums as a raveled array:
r, c = np.triu_indices(n)
diff = s1[c] - s0[r] + b[c]
Since np.minimum is a ufunc, you can accumulate diff for each run defined by r using minimum.reduceat. The locations are given roughly by np.flatnonzero(np.diff(r)) + 1, but you can generate them faster with np.arange:
m = np.minimum.reduceat(diff, np.r_[0, np.arange(n, 1, -1).cumsum()])
So finally, you have:
valid = a > m
TL;DR
s1 = c.cumsum()
s0 = np.r_[0, s1[:-1]]
r, c = np.triu_indices(n)
valid = a > np.minimum.reduceat(s1[c] - s0[r] + b[c], np.r_[0, np.arange(n, 1, -1).cumsum()])

I assume you want to vectorize it to decrease the running time. Since you are only using pure NumPy operations, you can use numba: see 5 Minutes Guide to Numba
it will look something like this:
import numba
#numba.njit()
def valid_for_single_idx(idx, a, b, c):
return np.any(a[idx] > b[idx:] + c[idx:].cumsum())
valid = np.empty(n)
for i in range(n):
valid[i] = valid_for_single_idx(i, a, b, c)
So far it isn't really vectorization (as the loop still happens), but it translate the the numpy line into llvm, so it happens as fast as probably possible.
Although it's not increasing the speed, but looks a bit nicer, you can use .map:
import numba
from functools import partial
#numba.njit()
def valid_for_single_idx(idx, a, b, c):
return np.any(a[idx] > b[idx:] + c[idx:].cumsum())
valid = map(partial(valid_for_single_idx, a=a, b=b, c=c), range(n))

How to speed up this numpy.arange loop?

In a python program, the following function is called about 20,000 times from another function that is called about 1000 times from yet another function that executes 30 times. Thus the total number of times this particular function is called is about 600,000,000. In python it takes more than two hours (perhaps much longer; I aborted the program without waiting for it to finish), while essentially the same task coded in Java takes less than 5 minutes. If I change the 20,000 above to 400 (keeping everything else in the rest of the program untouched), the total time drops to about 4 minutes (this means this particular function is the culprit). What can I do to speed up the Python version, or is it just not possible? No lists are manipulated inside this function (there are lists elsewhere in the whole program, but in those places I tried to use numpy arrays as far as possible). I understand that replacing python lists with numpy arrays speeds things up, but there are cases in my program (not in this particular function) where I must build a list iteratively, using append; and those must-have lists are lists of objects (not floats or ints), so numpy would be of little help even if I converted those lists of objects to numpy arrays.
def compute_something(arr):
'''
arr is received as a numpy array of ints and floats (I think python upcasts them to all floats,
doesn’t it?).
Inside this function, elements of arr are accessed using indexing (arr[0], arr[1], etc.), because
each element of the array has its own unique use. It’s not that I need the array as a whole (as in
arr**2 or sum(arr)).
The arr elements are used in several simple arithmetic operations involving nothing costlier than
+, -, *, /, and numpy.log(). There is no other loop inside this function; there are a few if’s though.
Inside this function, use is made of constants imported from other modules (I doubt the
importing, as in AnotherModule.x is expensive).
'''
for x in numpy.arange(float1, float2, float3):
do stuff
return a, b, c # Return a tuple of three floats
Edit:
Thanks for all the comments. Here’s the inside of the function (I made the variable names short for convenience). The ndarray array arr has only 3 elements in it. Can you please suggest any improvement?
def compute_something(arr):
a = Mod.b * arr[1] * arr[2] + Mod.c
max = 0.0
for c in np.arange(a, arr[1] * arr[2] * (Mod.d – Mod.e), Mod.f):
i = c / arr[2]
m1 = Mod.A * np.log( (i / (arr[1] *Mod.d)) + (Mod.d/Mod.e))
m2 = -Mod.B * np.log(1.0 - (i/ (arr[1] *Mod.d)) - (Mod.d /
Mod.e))
V = arr[0] * (Mod.E - Mod.r * i / arr[1] - Mod.r * Mod.d -
m1 – m2)
p = c * V /1000.0
if p > max:
max = p
vmp = V
pen = Mod.COEFF1 * (Mod.COEFF2 - max) if max < Mod.CONST else 0.0
wo = Mod.COEFF3 * arr[1] * arr[0] + Mod.COEFF4 * abs(Mod.R5 - vmp) +
Mod.COEFF6 * arr[2]
w = wo + pen
return vmp, max, w

Python supports profiling of code. (module cProfile). Also there is option to use line_profiler to find most expensive part of code tool here.
So you do not need to guessing which part of code is most expensive.
In this code which you presten the problem is in usage for loop which generates many conversion between types of objects. If you use numpy you can vectorize your calculation.
I try to rewrite your code to vectorize your operation. You do not provide information what is Mod object, but I have hope it will work.
def compute_something(arr):
a = Mod.b * arr[1] * arr[2] + Mod.c
# start calculation on vectors instead of for lop
c_arr = np.arange(a, arr[1] * arr[2] * (Mod.d – Mod.e), Mod.f)
i_arr = c_arr/arr[2]
m1_arr = Mod.A * np.log( (i_arr / (arr[1] *Mod.d)) + (Mod.d/Mod.e))
m2_arr = -Mod.B * np.log(1.0 - (i_arr/ (arr[1] *Mod.d)) - (Mod.d /
Mod.e))
V_arr = arr[0] * (Mod.E - Mod.r * i_arr / arr[1] - Mod.r * Mod.d -
m1_arr – m2_arr)
p = c_arr * V_arr / 1000.0
max_val = p.max() # change name to avoid conflict with builtin function
max_ind = np.nonzero(p == max_val)[0][0]
vmp = V_arr[max_ind]
pen = Mod.COEFF1 * (Mod.COEFF2 - max_val) if max_val < Mod.CONST else 0.0
wo = Mod.COEFF3 * arr[1] * arr[0] + Mod.COEFF4 * abs(Mod.R5 - vmp) +
Mod.COEFF6 * arr[2]
w = wo + pen
return vmp, max_val, w

I would suggest to use range as it is approximately 2 times faster:
def python():
for i in range(100000):
pass
def numpy():
for i in np.arange(100000):
pass
from timeit import timeit
print(timeit(python, number=1000))
print(timeit(numpy, number=1000))
Output:
5.59282787179696
10.027646953771665

Solving system of nonlinear equations with Python, Euler's formula

I am working on solving and analyzing a system of differential equations in Python. First did I solve it with help of scipy.integrate dopri5 and scopes Odeint. Which worked out fine. Then I tried to solve the equations with use of the Euler's method. The equations and code is as followed,
dj = -mu*(J**3 - (C - C0)*J - F)
dc = J + C*F + a*J**2
df = J*F - C
T = 100
dt = 0.001
t = np.linspace(0, T, int(T/dt)+1)
j = np.zeros(len(t))
c = np.zeros(len(t))
f = np.zeros(len(t))
# Initial condition
j[0] = 0.1
c[0] = -0.5
f[0] = 0.1
a = 0.3025
C0 = 0.5
mu = 50
for i in range(len(t)):
j[i+1] = j[i] + (-mu * (j[i]**3 - (c[i] - C0)*j[i] - f[i]))*dt
c[i+1] = c[i] + (j[i] + c[i] * f[i] + (a * j[i])**2)*dt
f[i+1] = f[i] + (j[i] * f[i] - c[i])*dt
Is there any reason why the Euler's method should not work when both the other two are?

In the first iteration, i is 0, and your first line of the loop essentially is:
j[0] = j[-1] + (-mu * (j[-1]**3 - (c[-1] - C0)*j[-1] - f[-1]))*dt
j[-1] is the last element of j, just like c[-1] is the last element of c, etc. Initially they are all zeros, so j[0] becomes a 0, too, which overwrites the initial conditions. To fix this problem, change range(len(t)) to range(1,len(t)). (The model diverges after the first 9200 steps, anyway.)

As DYZ says, your calculation is incorrect on the first loop iteration because j[-1] is the last element of j, which you've initialised to zero.
However, your code wastes a lot of RAM. I assume you just want arrays containing T results, plus the initial values, rather than the results calculated on every step. The code below achieves that via a double for loop. We aren't really getting any benefit from Numpy in this code, so I don't bother importing it.
Note that Euler integration is not very accurate, and you generally need to use a much smaller step size than what's required by more sophisticated integration algorithms. As DYZ mentions, with your current step size the calculation diverges before the loop finishes.
Here's a modified version of your code using a smaller step size.
T = 100
dt = 0.00001
steps = int(T / dt)
substeps = int(steps / T)
# Recalculate `dt` to compensate for possible truncation
# in the `steps` and `substeps` calculations
dt = 1.0 / substeps
print('steps, substeps, dt:', steps, substeps, dt)
a = 0.3025
C0 = 0.5
mu = 50
#dj = -mu*(J**3 - (C - C0)*J - F)
#dc = J + C*F + a*J**2
#df = J*F - C
# Initial condition
j = 0.1
c = -0.5
f = 0.1
jlst, clst, flst = [j], [c], [f]
for i in range(T):
for _ in range(substeps):
j1 = j + (-mu * (j**3 - (c - C0)*j - f))*dt
c1 = c + (j + c * f + (a * j)**2)*dt
f1 = f + (j * f - c)*dt
j, c, f = j1, c1, f1
jlst.append(j)
clst.append(c)
flst.append(f)
def round_seq(seq, places=6):
return [round(u, places) for u in seq]
print('j:', round_seq(jlst), end='\n\n')
print('c:', round_seq(clst), end='\n\n')
print('f:', round_seq(flst), end='\n\n')
output
steps, substeps, dt: 10000000 100000 1e-05
j: [0.1, 0.585459, 1.26718, 3.557956, -1.311867, -0.647698, -0.133683, 0.395812, 0.964856, 3.009683, -2.025674, -1.047722, -0.48872, 0.044296, 0.581284, 1.245423, 14.725407, -1.715456, -0.907364, -0.372118, 0.167733, 0.705257, 1.511711, -3.588555, -1.476817, -0.778593, -0.253874, 0.289294, 0.837128, 1.985792, -2.652462, -1.28088, -0.657113, -0.132971, 0.409071, 0.983504, 3.229393, -2.1809, -1.113977, -0.539586, -0.009829, 0.528546, 1.156086, 8.23469, -1.838582, -0.967078, -0.423261, 0.113883, 0.650319, 1.381138, 12.045565, -1.575015, -0.833861, -0.305952, 0.23632, 0.778052, 1.734888, -2.925769, -1.362437, -0.709641, -0.186249, 0.356775, 0.917051, 2.507782, -2.367126, -1.184147, -0.590753, -0.063942, 0.476121, 1.07614, 5.085211, -1.976542, -1.029395, -0.474206, 0.059772, 0.596505, 1.273214, 17.083466, -1.682855, -0.890842, -0.357555, 0.182944, 0.721096, 1.554496, -3.331861, -1.450497, -0.763182, -0.239007, 0.30425, 0.85435, 2.076595, -2.584081, -1.258788, -0.642362, -0.117774, 0.423883, 1.003181, 3.521072, -2.132709, -1.094792, -0.525123]
c: [-0.5, -0.302644, 0.847742, 12.886781, 0.177404, -0.423405, -0.569541, -0.521669, -0.130084, 7.97828, -0.109606, -0.363033, -0.538874, -0.61005, -0.506872, 0.05076, 216.678959, -0.198445, -0.408569, -0.566869, -0.603713, -0.451729, 0.58959, 2.252504, -0.246645, -0.451, -0.588697, -0.587898, -0.375758, 2.152898, -0.087229, -0.295185, -0.49006, -0.603411, -0.562389, -0.263696, 8.901196, -0.132332, -0.342969, -0.525087, -0.609991, -0.526417, -0.077251, 67.082608, -0.177771, -0.389092, -0.555341, -0.607658, -0.47794, 0.293664, 147.817033, -0.225425, -0.432796, -0.579951, -0.595996, -0.412269, 1.235928, -0.037058, -0.273963, -0.473412, -0.597912, -0.574782, -0.318837, 4.581828, -0.113301, -0.3222, -0.51029, -0.608168, -0.543547, -0.172371, 24.718184, -0.157526, -0.369151, -0.542732, -0.609811, -0.500922, 0.09504, 291.915024, -0.204371, -0.414, -0.56993, -0.602265, -0.443622, 0.700005, 0.740665, -0.25268, -0.456048, -0.590933, -0.585265, -0.36427, 2.528225, -0.093699, -0.301181, -0.494644, -0.60469, -0.558516, -0.245806, 10.941068, -0.137816, -0.348805, -0.52912]
f: [0.1, 0.68085, 1.615135, 1.01107, -2.660947, -0.859348, -0.134789, 0.476782, 1.520241, 4.892319, -9.514924, -2.041217, -0.61413, 0.060247, 0.792463, 2.510586, 11.393914, -6.222736, -1.559576, -0.438133, 0.200729, 1.033274, 3.348756, -39.664752, -4.304545, -1.201378, -0.282146, 0.349631, 1.331995, 4.609547, -20.169056, -3.104072, -0.923759, -0.138225, 0.513633, 1.716341, 6.739864, -11.717002, -2.307614, -0.699883, 7.4e-05, 0.700823, 2.22957, 11.017447, -7.434886, -1.751919, -0.512171, 0.138566, 0.922012, 2.9434, -30.549886, -5.028825, -1.346261, -0.348547, 0.282981, 1.19254, 3.987366, -26.554232, -3.566328, -1.0374, -0.200198, 0.439487, 1.535198, 5.645421, -14.674838, -2.619369, -0.792589, -0.060175, 0.615387, 1.985246, 8.779969, -8.991742, -1.972575, -0.590788, 0.077534, 0.820118, 2.599728, 8.879606, -5.928246, -1.509453, -0.417854, 0.218635, 1.066761, 3.477148, -36.053938, -4.124934, -1.163178, -0.263755, 0.369033, 1.37438, 4.811848, -18.741635, -2.987496, -0.893457, -0.120864, 0.535433, 1.771958, 7.117055, -11.027021, -2.227847, -0.674889]
That takes about 75 seconds on my old 2GHz machine.
Using dt = 0.000005 (which takes almost 2 minutes on this machine) the final values of j, c, and f are -0.524774, -0.529217, -0.674293, respectively, so it looks like we're beginning to get convergence.
Thanks to LutzL for pointing out that dt may need adjusting because of the rounding in the steps and substeps calculations.

Writing a function for x * sin(3/x) in python

I have to write a function, s(x) = x * sin(3/x) in python that is capable of taking single values or vectors/arrays, but I'm having a little trouble handling the cases when x is zero (or has an element that's zero). This is what I have so far:
def s(x):
result = zeros(size(x))
for a in range(0,size(x)):
if (x[a] == 0):
result[a] = 0
else:
result[a] = float(x[a] * sin(3.0/x[a]))
return result
Which...doesn't work for x = 0. And it's kinda messy. Even worse, I'm unable to use sympy's integrate function on it, or use it in my own simpson/trapezoidal rule code. Any ideas?
When I use integrate() on this function, I get the following error message: "Symbol" object does not support indexing.

This takes about 30 seconds per integrate call:
import sympy as sp
x = sp.Symbol('x')
int2 = sp.integrate(x*sp.sin(3./x),(x,0.000001,2)).evalf(8)
print int2
int1 = sp.integrate(x*sp.sin(3./x),(x,0,2)).evalf(8)
print int1
The results are:
1.0996940
-4.5*Si(zoo) + 8.1682775
Clearly you want to start the integration from a small positive number to avoid the problem at x = 0.
You can also assign x*sin(3./x) to a variable, e.g.:
s = x*sin(3./x)
int1 = sp.integrate(s, (x, 0.00001, 2))
My original answer using scipy to compute the integral:
import scipy.integrate
import math
def s(x):
if abs(x) < 0.00001:
return 0
else:
return x*math.sin(3.0/x)
s_exact = scipy.integrate.quad(s, 0, 2)
print s_exact
See the scipy docs for more integration options.

If you want to use SymPy's integrate, you need a symbolic function. A wrong value at a point doesn't really matter for integration (at least mathematically), so you shouldn't worry about it.
It seems there is a bug in SymPy that gives an answer in terms of zoo at 0, because it isn't using limit correctly. You'll need to compute the limits manually. For example, the integral from 0 to 1:
In [14]: res = integrate(x*sin(3/x), x)
In [15]: ans = limit(res, x, 1) - limit(res, x, 0)
In [16]: ans
Out[16]:
9⋅π 3⋅cos(3) sin(3) 9⋅Si(3)
- ─── + ──────── + ────── + ───────
4 2 2 2
In [17]: ans.evalf()
Out[17]: -0.164075835450162

How do you make this code more pythonic?

Could you guys please tell me how I can make the following code more pythonic?
The code is correct. Full disclosure - it's problem 1b in Handout #4 of this machine learning course. I'm supposed to use newton's algorithm on the two data sets for fitting a logistic hypothesis. But they use matlab & I'm using scipy
Eg one question i have is the matrixes kept rounding to integers until I initialized one value to 0.0. Is there a better way?
Thanks
import os.path
import math
from numpy import matrix
from scipy.linalg import inv #, det, eig
x = matrix( '0.0;0;1' )
y = 11
grad = matrix( '0.0;0;0' )
hess = matrix('0.0,0,0;0,0,0;0,0,0')
theta = matrix( '0.0;0;0' )
# run until convergence=6or7
for i in range(1, 6):
#reset
grad = matrix( '0.0;0;0' )
hess = matrix('0.0,0,0;0,0,0;0,0,0')
xfile = open("q1x.dat", "r")
yfile = open("q1y.dat", "r")
#over whole set=99 items
for i in range(1, 100):
xline = xfile.readline()
s= xline.split(" ")
x[0] = float(s[1])
x[1] = float(s[2])
y = float(yfile.readline())
hypoth = 1/ (1+ math.exp(-(theta.transpose() * x)))
for j in range(0,3):
grad[j] = grad[j] + (y-hypoth)* x[j]
for k in range(0,3):
hess[j,k] = hess[j,k] - (hypoth *(1-hypoth)*x[j]*x[k])
theta = theta - inv(hess)*grad #update theta after construction
xfile.close()
yfile.close()
print "done"
print theta

One obvious change is to get rid of the "for i in range(1, 100):" and just iterate over the file lines. To iterate over both files (xfile and yfile), zip them. ie replace that block with something like:
import itertools
for xline, yline in itertools.izip(xfile, yfile):
s= xline.split(" ")
x[0] = float(s[1])
x[1] = float(s[2])
y = float(yline)
...
(This is assuming the file is 100 lines, (ie. you want the whole file). If you're deliberately restricting to the first 100 lines, you could use something like:
for i, xline, yline in itertools.izip(range(100), xfile, yfile):
However, its also inefficient to iterate over the same file 6 times - better to load it into memory in advance, and loop over it there, ie. outside your loop, have:
xfile = open("q1x.dat", "r")
yfile = open("q1y.dat", "r")
data = zip([line.split(" ")[1:3] for line in xfile], map(float, yfile))
And inside just:
for (x1,x2), y in data:
x[0] = x1
x[1] = x2
...

x = matrix([[0.],[0],[1]])
theta = matrix(zeros([3,1]))
for i in range(5):
grad = matrix(zeros([3,1]))
hess = matrix(zeros([3,3]))
[xfile, yfile] = [open('q1'+a+'.dat', 'r') for a in 'xy']
for xline, yline in zip(xfile, yfile):
x.transpose()[0,:2] = [map(float, xline.split(" ")[1:3])]
y = float(yline)
hypoth = 1 / (1 + math.exp(theta.transpose() * x))
grad += (y - hypoth) * x
hess -= hypoth * (1 - hypoth) * x * x.transpose()
theta += inv(hess) * grad
print "done"
print theta

the matrixes kept rounding to integers until I initialized one value
to 0.0. Is there a better way?
At the top of your code:
from __future__ import division
In Python 2.6 and earlier, integer division always returns an integer unless there is at least one floating point number within. In Python 3.0 (and in future division in 2.6), division works more how we humans might expect it to.
If you want integer division to return an integer, and you've imported from future, use a double //. That is
from __future__ import division
print 1//2 # prints 0
print 5//2 # prints 2
print 1/2 # prints 0.5
print 5/2 # prints 2.5

You could make use of the with statement.

the code that reads the files into lists could be drastically simpler
for line in open("q1x.dat", "r"):
x = map(float,line.split(" ")[1:])
y = map(float, open("q1y.dat", "r").readlines())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest way to add/multiply two floating point scalar numbers in python - python

Related

Vectorizing a for loop in Python that includes a cumsum() and iterative slicing

How to speed up this numpy.arange loop?

Solving system of nonlinear equations with Python, Euler's formula

Writing a function for x * sin(3/x) in python

How do you make this code more pythonic?

Categories

Resources