I have following python code:
H1 = [[0.04,0.03,0.01,0.002],[0.02,0.04,0.001,0.5]]
H2 = [[0.06,0.02,0.02,0.004],[0.8,0.09,0.6,0.1]]
D1 = [0.01,0.02,0.1,0.01]
D2 = [0.1,0.3,0.01,0.4]
Tp = np.sum(D1)
Tn = np.sum(D2)
T = []
append2 = T.append
E = []
append3 = E.append
for h1,h2 in itertools.izip(H1,H2)
Err = []
append1 = Err.append
for v in h1:
L1 = [1 if i>=v else 0 for i in h1]
L2 = [1 if i>=v else 0 for i in h2]
Sp = np.dot(D1,L1)
Sn = np.dot(D2,L2)
err = min(Sp+Tn-Sn, Sn+Tp-Sp)
append1(err)
b = np.argmin(Err)
append2(h1[b])
append3(Err[b])
This is just an example code. I need to run the inner for loop near about 20,000 times (here it runs just twice). But the inner for loop takes much time making it inpractical to use.
In line profiler, it shows that line Sp = np.dot(D1,L1) , Sn = np.dot(D2,L2) and b = np.argmin(Err) are the most time consuming.
How can I reduce the time taken by above code.
Any help will be much appreciated.
Thanks!
You can get a pretty big speed boost if you use numpy functions with numpy arrays instead of lists. Most numpy functions will convert lists to arrays internally and that adds a lot of overhead to the run time. Here is a simple example:
In [16]: a = range(10)
In [17]: b = range(10)
In [18]: aa = np.array(a)
In [19]: bb = np.array(b)
In [20]: %timeit np.dot(a, b)
10000 loops, best of 3: 54 us per loop
In [21]: %timeit np.dot(aa, bb)
100000 loops, best of 3: 3.4 us per loop
numpy.dot run 16x faster when called with arrays in this case. Also when you use numpy arrays you'll be able to simplify some of your code which should also help it run faster. For example if h1 is an array, L1 = [1 if i>=v else 0 for i in h1] can be written as h1 > v which returns an array and should also run faster. Bellow I've gone ahead and replaced your lists with arrays so you can see what it would look like.
import numpy as np
H1 = np.array([[0.04,0.03,0.01,0.002],[0.02,0.04,0.001,0.5]])
H2 = np.array([[0.06,0.02,0.02,0.004],[0.8,0.09,0.6,0.1]])
D1 = np.array([0.01,0.02,0.1,0.01])
D2 = np.array([0.1,0.3,0.01,0.4])
Tp = np.sum(D1)
Tn = np.sum(D2)
T = np.zeros(H1.shape[0])
E = np.zeros(H1.shape[0])
for i in range(len(H1)):
h1 = H1[i]
h2 = H2[i]
Err = np.zeros(len(h1))
for j in range(len(h1)):
v = h1[j]
L1 = h1 > v
L2 = h2 > v
Sp = np.dot(D1, L1)
Sn = np.dot(D2, L2)
err = min(Sp+Tn-Sn, Sn+Tp-Sp)
Err[j] = err
b = np.argmin(Err)
T[i] = h1[b]
E[i] = Err[b]
Once you're more comfortable with numpy arrays you might want to look into expressing at least your inner loop using broadcasting. For some applications, using broadcasting can be much more efficient than python loops. Good luck, hope that helps.
You need to keep the data in ndarray types. When you do a numpy operation on a list, it has to construct a new array each time. I modified your code to run a variable number of times and found it too ~1s for 10000 iterations. Changing the datatypes to ndarrays reduced that by about a factor of two, and I think there is still some improvement to make (the first version of this had a bug that made it execute too fast)
import itertools
import numpy as np
N = 10000
H1 = [np.array([0.04,0.03,0.01,0.002])] * N
H2 = [np.array([0.06,0.02,0.02,0.004])] * N
D1 = np.array([0.01,0.02,0.1,0.01] )
D2 = np.array([0.1,0.3,0.01,0.4] )
Tp = np.sum(D1)
Tn = np.sum(D2)
T = []
append2 = T.append
E = []
append3 = E.append
for h1,h2 in itertools.izip(H1,H2):
Err = []
append1 = Err.append
for v in h1:
#L1 = [1 if i>=v else 0 for i in h1]
#L2 = [1 if i>=v else 0 for i in h2]
L1 = h1 > v
L2 = h2 > v
Sp = np.dot(D1,L1)
Sn = np.dot(D2,L2)
err = min(Sp+Tn-Sn, Sn+Tp-Sp)
append1(err)
b = np.argmin(Err)
append2(h1[b])
append3(Err[b])
There's some low-hanging fruit in your list comprehensions:
L1 = [1 if i>=v else 0 for i in h1]
L2 = [1 if i>=v else 0 for i in h2]
The above could be written as:
L1 = [i>=v for i in h1]
L2 = [i>=v for i in h2]
Because Booleans are a subclass of integers, True and False are already 1 and 0, just wearing fancy clothes.
err = min(Sp+Tn-Sn, Sn+Tp-Sp)
append1(err)
You could combine the above two lines to avoid the variable assignment and access.
If you put the code in a function, all local variable usage will be slightly faster. Also, any global functions or methods you use (e.g. min, np.dot) can be converted to locals in the function signature using default arguments. np.dot is an especially slow call to make (outside of how long the operation itself takes) because it involves an attribute lookup. This would be similar to the optimization you already make with the list append methods.
Now I imagine none of this will really affect performance much, since your question really seems to be "how can I make NumPy faster?" (which others are on top of for you) but they might have some impact and be worth doing.
If I have correctly understood what does the instruction np.dot() on two lists of dimensions 1, it seems to me that the following code should do the same as yours.
Could you test its speed , please ?
Its principle is to play on indices instead of the elements of lists, and to use the peculiarity of a list defined as a default value of a function
H1 = [[0.04,0.03,0.01,0.002],[0.02,0.04,0.001,0.5]]
H2 = [[0.06,0.02,0.02,0.004],[0.8,0.09,0.6,0.1]]
D1 = [0.01,0.02,0.1,0.01]
D2 = [0.1,0.3,0.01,0.4]
Tp = np.sum(D1)
Tn = np.sum(D2)
T,E = [],[]
append2 = T.append
append3 = E.append
ONE,TWO = [],[]
def zoui(v, ONE=ONE,TWO=TWO,
D1=D1,D2=D2,Tp=Tp,Tn=Tn,tu0123 = (0,1,2,3)):
diff = sum(D1[i] if ONE[i]>=v else 0 for i in tu0123)\
-sum(D2[i] if TWO[i]>=v else 0 for i in tu0123)
#or maybe
#diff = sum(D1[i] * ONE[i]>=v for i in tu0123)\
# -sum(D2[i] * TWO[i]>=v for i in tu0123)
return min(Tn+diff,Tp-diff)
for n in xrange(len(H1)):
ONE[:] = H1[n]
TWO[:] = H2[n]
Err = map(zoui,ONE)
b = np.argmin(Err)
append2(ONE[b])
append3(Err[b])
Related
I want to achieve this function in Python like Matlab
in matlab, the code is
A = [];
for ii = 0:9
B = [ii, ii+1, ii**2];
C = [ii+ii**2, ii-5];
A = [A, B, C];
end
but in Python, use np.hstack or np.concatenate, the ndarray must have same number of dimensions
if the A in first loop is empty, the code will make mistake as following:
for ii in range(10):
B = np.array([ii, ii+1, ii**2])
C = np.array([ii+ii**2, ii-5])
if ii == 0:
A = np.hstack([B, C])
else:
A = np.hstack([A, B, C])
and, that is my Python code, B and C are variable, not repeat the ndarray, plz don't close my question!
for ii in range(10):
B = np.array([ii, ii+1, ii**2])
C = np.array([ii+ii**2, ii-5])
if ii == 0:
A = np.hstack([B, C])
else:
A = np.hstack([A, B, C])
but, i think it a little troublesome and unreadable.
how can i rewrite it?(It's better to use only one line of code)
Without knowing what the result Should be - I think this is close
import numpy as np
q = np.arange(10)
bs = np.vstack((q,q+1,q**2)).T
cs = np.vstack((q,q**2,q-5)).T
a = np.hstack((bs,cs))
Or maybe:
a = np.hstack((bs,cs)).ravel()
I have done the following code:
test.py
nc = 1; nb = 20; ni = 6; nc = 2; ia = 20; ib = 20; ic = 0
U1 = numpy.array((1,2,0,0,0,3))
U2 = numpy.array((2,2,1,0,0,1))
U3 = numpy.array((2,1,1,0,0,2))
U4 = numpy.array((2,1,0,1,0,2))
U5 = numpy.array((2,1,1,1,0,3))
for n in range(ni):
a = nc*(nb*nc*ia+nc*ib+ic)+U1[n]
a2 = ia + U1[n]
b2 = ib + U3[n]
c2 = ic + U4[n]
b = nc*(nb*nc*a2+nc*b2+c2)+U2[n]
A = str(numpy.array((a,b,U5[n])))
print(A)
with open("test.txt", 'w') as out:
for o in A:
out.write(o)
test.txt gives me the following:
[1683 1933 3]
But if I print test.py by using print(A), I get this:
[1681 1774 2]
[1682 1848 1]
[1680 1685 1]
[1680 1682 1]
[1680 1680 0]
[1683 1933 3]
How can I write the the whole print in test.txt? I assume to do something like this:
ol = []
ol.append(o))
The basic problem is you are opening the same file again and again and overwriting this file for every iteration in the for loop.
Use:
with open("test.txt", 'w') as out:
for n in range(ni):
a = nc*(nb*nc*ia+nc*ib+ic)+U1[n]
a2 = ia + U1[n]
b2 = ib + U3[n]
c2 = ic + U4[n]
b = nc*(nb*nc*a2+nc*b2+c2)+U2[n]
A = str(numpy.array((a,b,U5[n])))
out.write(f"{A}\n")
Now the text.txt file will contain:
[1681 1774 2]
[1682 1848 1]
[1680 1685 1]
[1680 1682 1]
[1680 1680 0]
[1683 1933 3]
The best solution - Fast and efficient
import numpy
nc = 1; nb = 20; ni = 6; nc = 2; ia = 20; ib = 20; ic = 0
U1 = numpy.array((1,2,0,0,0,3))
U2 = numpy.array((2,2,1,0,0,1))
U3 = numpy.array((2,1,1,0,0,2))
U4 = numpy.array((2,1,0,1,0,2))
U5 = numpy.array((2,1,1,1,0,3))
# No for loop here, we are using NumPy broadcasting features
a = nc*(nb*nc*ia+nc*ib+ic)+U1
a2 = ia + U1
b2 = ib + U3
c2 = ic + U4
b = nc * (nb * nc * a2 + nc * b2 + c2) + U2
# Transpose the matrix to get the result wanted in your case
A = numpy.array((a, b, U5)).T
with open(file="res.txt", mode="w") as b:
b.write(numpy.array2string(A))
Remarks about your code written in the question
In most cases, using NumPy broadcasting (removing loop over arrays) makes the code faster. It can also be easier to read.
Writing into a file while inside a loop is a bad practice. Even if you keep your file open with the use of the context manager with open, performances are poor.
Better build your array, convert it to string.
Then write the whole thing into a file.
Other solution using numpy builtin functions
Disclaimer: To use only if the row number of your array is small (<500)
Dumping a numpy array into a text file is a builtin function in NumPy.
Look at this: API doc | numpy.savetxt
However, if you look at the source code of this function, you will see that you iterate over the array's rows which impacts performances a lot when dimensions number increases (thanks to #hpaulj for the remark).
In your case, you could replace the two last lines of the snippet above with:
numpy.savetxt("a.txt", A) # just see the doc to add some formatting options
In each iteration of the outer loop, you are asking the file system for a fresh, empty copy of "test.txt". So of course the final version only contains the content of the last loop.
Open with the attributes "a" for "(write-and-)append" or as in the other answer, and more efficiently, open once outside of the loop.
I'm using python and apparently the slowest part of my program is doing simple additions on float variables.
It takes about 35seconds to do around 400,000,000 additions/multiplications.
I'm trying to figure out what is the fastest way I can do this math.
This is how the structure of my code looks like.
Example (dummy) code:
def func(x, y, z):
loop_count = 30
a = [0,1,2,3,4,5,6,7,8,9,10,11,12,...35 elements]
b = [0,11,22,33,44,55,66,77,88,99,1010,1111,1212,...35 elements]
p = [0,0,0,0,0,0,0,0,0,0,0,0,0,...35 elements]
for i in range(loop_count - 1):
c = p[i-1]
d = a[i] + c * a[i+1]
e = min(2, a[i]) + c * b[i]
f = e * x
g = y + d * c
.... and so on
p[i] = d + e + f + s + g5 + f4 + h7 * t5 + y8
return sum(p)
func() is called about 200k times. The loop_count is about 30. And I have ~20 multiplications and ~45 additions and ~10 uses of min/max
I was wondering if there is a method for me to declare all these as ctypes.c_float and do addition in C using stdlib or something similar ?
Note that the p[i] calculated at the end of the loop is used as c in the next loop iteration. For iteration 0, it just uses p[-1] which is 0 in this case.
My constraints:
I need to use python. While I understand plain math would be faster in C/Java/etc. I cannot use it due to a bunch of other things I do in python which cannot be done in C in this same program.
I tried writing this with cython, but it caused a bunch of issues with the environment I need to run this in. So, again - not an option.
I think you should consider using numpy. You did not mention any constraint.
Example case of a simple dot operation (x.y)
import datetime
import numpy as np
x = range(0,10000000,1)
y = range(0,20000000,2)
for i in range(0, len(x)):
x[i] = x[i] * 0.00001
y[i] = y[i] * 0.00001
now = datetime.datetime.now()
z = 0
for i in range(0, len(x)):
z = z+x[i]*y[i]
print "handmade dot=", datetime.datetime.now()-now
print z
x = np.arange(0.0, 10000000.0*0.00001, 0.00001)
y = np.arange(0.0, 10000000.0*0.00002, 0.00002)
now = datetime.datetime.now()
z = np.dot(x,y)
print 'numpy dot =',datetime.datetime.now()-now
print z
outputs
handmade dot= 0:00:02.559000
66666656666.7
numpy dot = 0:00:00.019000
66666656666.7
numpy is more than 100x times faster.
The reason is that numpy encapsulates a C library that does the dot operation with compiled code. In the full python you have a list of potentially generic objects, casting, ...
I've been checking out how to vectorize an outer and inner for loop. These have some calculations and also a delete inside them - that seems to make it much less straight forward.
How would this be vectorized best?
import numpy as np
flattenedArray = np.ndarray.tolist(someNumpyArray)
#flattenedArray is a python list of lists.
c = flattenedArray[:]
for a in range (len(flattenedArray)):
for b in range(a+1, len(flattenedArray)):
if a == b:
continue
i0 = flattenedArray[a][0]
j0 = flattenedArray[a][1]
z0 = flattenedArray[a][2]
i1 = flattenedArray[b][0]
i2 = flattenedArray[b][1]
z1 = flattenedArray[b][2]
if ((np.square(z0-z1)) <= (np.square(i0-i1) + (np.square(j0-j2)))):
if (np.square(i0-i1) + (np.square(j0-j1))) <= (np.square(z0+z1)):
c.remove(flattenedArray[b])
#MSeifert is, of course, as so often right. So the following full vectorisation is only to show "how it's done"
import numpy as np
N = 4
data = np.random.random((N, 3))
# vectorised code
j, i = np.tril_indices(N, -1) # chose tril over triu to have contiguous columns
# useful later
sqsum = np.square(data[i,0]-data[j,0]) + np.square(data[i,1]-data[j,1])
cond = np.square(data[i, 2] + data[j, 2]) >= sqsum
cond &= np.square(data[i, 2] - data[j, 2]) <= sqsum
# because equal 'b's are grouped together we can use reduceat:
cond = np.r_[False, np.logical_or.reduceat(
cond, np.add.accumulate(np.arange(N-1)))]
left = data[~cond, :]
# original code (modified to make it run)
flattenedArray = np.ndarray.tolist(data)
#flattenedArray is a python list of lists.
c = flattenedArray[:]
for a in range (len(flattenedArray)):
for b in range(a+1, len(flattenedArray)):
if a == b:
continue
i0 = flattenedArray[a][0]
j0 = flattenedArray[a][1]
z0 = flattenedArray[a][2]
i1 = flattenedArray[b][0]
j1 = flattenedArray[b][1]
z1 = flattenedArray[b][2]
if ((np.square(z0-z1)) <= (np.square(i0-i1) + (np.square(j0-j1)))):
if (np.square(i0-i1) + (np.square(j0-j1))) <= (np.square(z0+z1)):
try:
c.remove(flattenedArray[b])
except:
pass
# check they are the same
print(np.alltrue(c == left))
Vectorizing the inner loop isn't much of a problem if you work with a mask:
import numpy as np
# I'm using a random array
flattenedArray = np.random.randint(0, 100, (10, 3))
mask = np.zeros(flattenedArray.shape[0], bool)
for idx, row in enumerate(flattenedArray):
# Calculate the broadcasted elementwise addition/subtraction of this row
# with all following
added_squared = np.square(row[None, :] + flattenedArray[idx+1:])
subtracted_squared = np.square(row[None, :] - flattenedArray[idx+1:])
# Check the conditions
col1_col2_added = subtracted_squared[:, 0] + subtracted_squared[:, 1]
cond1 = subtracted_squared[:, 2] <= col1_col2_added
cond2 = col1_col2_added <= added_squared[:, 2]
# Update the mask
mask[idx+1:] |= cond1 & cond2
# Apply the mask
flattenedArray[mask]
If you also want to vectorize the outer loop one has to do it by broadcasting, that however will use a lot of memory O(n**2) instead of O(n). Given that the critical inner loop is already vectorized there won't be a lot of speedup by vectorizing the outer loop.
I'm gradually moving from Matlab to Python and would like to get some advice on optimising an iterative loop.
This is how I am currently running the loop, and for info I've included the code that defines the variables.
nh = 2000
h = np.array(range(nh))
nt = 10000
wmin = 1
wmax = 10
hw = np.array(wmin + (wmax-wmin)*invlogit(randn(1,nh)));
sl = np.array(zeros((nh,1))+radians(40))
fa = np.array(zeros((nh,1))+radians(35))
c = np.array(zeros((nh,1))+4.4)
y = np.array(zeros((nh,1))+17.6)
yw = np.array(zeros((nh,1))+9.81)
ir = 0.028
m = np.array(zeros((nh,nt)));
m[:,49] = 0.1
z = np.array(zeros((nh,nt)))
z[:,0] = 0+(3.0773-0)*rand(nh,1).T
reset = np.array(zeros((nh,nt)))
fs = np.array(zeros((nh,nt)))
for t in xrange(0, nt-1):
fs[:,t] = (c.T+(y.T-m[:,t]*yw.T)*z[:,t]*(np.cos(sl.T)**2)*np.tan(fa.T))/(y.T*z[:,t]*np.sin(sl.T)*np.cos(sl.T))
reset[fs[:,t]<=1,t+1] = 1;
z[fs[:,t]<=1,t+1] = 0;
z[fs[:,t]>1,t+1] = z[fs[:,t]>1,t]+(ir/hw[0,fs[:,t]>1]).T
This is how I would optimise the code in Matlab, however it runs fairly slowly in python. I suspect there is a more pythonic way of doing this and would really appreciate a nudge in the right direction.
Many thanks!
Not specifically about the loop, you're doing a ton of extra work in calls that look like:
np.array(zeros((nh,nt)))
Just use:
np.zeros((nh,nt))
in its place. Additionally, you could replace:
h = np.array(range(nh))
with:
h = np.arange(nh)
Other comments:
You're calling np.sin(sl.T)*np.cos(sl.T) in every loop although, sl does not appear to be changing at all. Just calculate it once and assign it to a variable that you use in the loop. You do this in a bunch of your trig calls.
The expression
(c.T+(y.T-m[:,t]*yw.T)*z[:,t]*(np.cos(sl.T)**2)*np.tan(fa.T))/(y.T*z[:,t]*np.sin(sl.T)*np.cos(sl.T))
uses c, y, m, yw, sl, fa that do not change inside the loop. You could compute several subexpressions before the loop.
Also, most of those arrays contain one repeated value. You could compute with scalars instead:
sl = radians(40)
fa = radians(35)
c = 4.4
y = 17.6
yw = 9.81
Then, with precomputed subexpressions:
A = cos(sl)**2 * tan(fa) * (y - m*yw)
B = y*sin(sl)*cos(sl)
for t in xrange(0, nt-1):
fs[:,t] = (c + A[:,t]*z[:,t]) / (B*z[:,t])
less = fs[:,t]<=1
more = np.logical_not(less)
reset[less,t+1] = 1
z[less,t+1] = 0
z[more,t+1] = z[more,t]+(ir/hw[0,more]).T