I need help speeding up this loop and I am not sure how to go about it
import numpy as np
import pandas as pd
import timeit
n = 1000
df = pd.DataFrame({0:np.random.rand(n),1:np.random.rand(n)})
def loop():
result = pd.DataFrame(index=df.index,columns=['result'])
for i in df.index:
last_index_to_consider = df.index.values[::-1][i]
tdf = df.loc[:last_index_to_consider] - df.shift(-i).loc[:last_index_to_consider]
tdf = tdf.apply(lambda x: x**2)
tsumdf = tdf.sum(axis=1)
result.loc[i,'result'] = tsumdf.mean()
return result
print(timeit.timeit(loop, number=10))
Is it possible to tweak the for loop to make it faster or are there options using numba or can I go ahead and use multiple threads to speed this loop up?
What would be the most sensible way to get more performance than just simply evaluating this code straight away?
There's a lot of compute happening per iteration. Keeping it that way, we could leverage underlying array data alongwith np.einsum for the squared-sum-reductions could bring about speedups. Here's an implementation that goes along those lines -
def array_einsum_loop(df):
a = df.values
l = len(a)
out = np.empty(l)
for i in range(l):
d = a[:l-i] - a[i:]
out[i] = np.einsum('ij,ij->',d,d)
df_out = pd.DataFrame({'result':out/np.arange(l,0,-1)})
return df_out
Runtime test -
In [153]: n = 1000
...: df = pd.DataFrame({0:np.random.rand(n),1:np.random.rand(n)})
In [154]: %timeit loop(df)
1 loop, best of 3: 1.43 s per loop
In [155]: %timeit array_einsum_loop(df)
100 loops, best of 3: 5.61 ms per loop
In [156]: 1430/5.61
Out[156]: 254.9019607843137
Not bad for a 250x+ speedup without breaking any loop or bank!
Just for fun, the ultimate speedup with numba :
import numba
#numba.njit
def numba(d0,d1):
n=len(d0)
result=np.empty(n,np.float64)
for i in range(n):
s=0
k=i
for j in range(n-i):
u = d0[j]-d0[k]
v = d1[j]-d1[k]
k+=1
s += u*u + v*v
result[i] = s/(j+1)
return result
def loop2(df):
return pd.DataFrame({'result':numba(*df.values.T)})
For a 2500x+ factor.
In [519]: %timeit loop2(df)
583 µs ± 5.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Related
for my class I need to write more optimized math function using NumPy. Problem is, when using NumPy my solutions are slower when native Python.
function which cubes all the elements of an array and sum them
Python:
def cube(x):
result = 0
for i in range(len(x)):
result += x[i] ** 3
return result
My, using NumPy (15-30% slower):
def cube(x):
it = numpy.nditer([x, None])
for a, b in it:
b[...] = a*a*a
return numpy.sum(it.operands[1])
Some random calculation function
Python:
def calc(x):
m = sum(x) / len(x)
result = 0
for i in range(len(x)):
result += (x[i] - m)**4
return result / len(x)
NumPy (>10x slower):
def calc(x):
m = numpy.mean(x)
result = 0
for i in range(len(x)):
result += numpy.power((x[i] - m), 4)
return result / len(x)
I don't know how to approatch this, so far I have tried random functions from NumPy
To elaborate on what has been said in comments:
Numpy's power comes from being able to do all the looping in fast c/fortran rather than slow Python looping. For example, if you have an array x and you want to calculate the square of every value in that array, you could do
y = []
for value in x:
y.append(value**2)
or even (with a list comprehension)
y = [value**2 for value in x]
but it will be much faster if you can do all the looping inside numpy with
y = x**2
(assuming x is already a numpy array).
So for your examples, the proper way to do it in numpy would be
1.
def sum_of_cubes(x):
result = 0
for i in range(len(x)):
result += x[i] ** 3
return result
def sum_of_cubes_numpy(x):
return (x**3).sum()
def calc(x):
m = sum(x) / len(x)
result = 0
for i in range(len(x)):
result += (x[i] - m)**4
return result / len(x)
def calc_numpy(x):
m = numpy.mean(x) # or just x.mean()
return numpy.sum((x - m)**4) / len(x)
Note that I've assumed that the input x is already a numpy array, not a regular Python list: if you have a list lst, you can create an array from it with arr = numpy.array(lst).
In [337]: def cube(x):
...: result = 0
...: for i in range(len(x)):
...: result += x[i] ** 3
...: return result
...:
nditer is not a good numpy iterator, at least not when used in Python level code. It's really just a stepping stone toward writing compiled code. It's docs need a better disclaimer.
In [338]: def cube1(x):
...: it = numpy.nditer([x, None])
...: for a, b in it:
...: b[...] = a*a*a
...: return numpy.sum(it.operands[1])
...:
In [339]: cube(list(range(10)))
Out[339]: 2025
In [340]: cube1(list(range(10)))
Out[340]: 2025
In [341]: cube1(np.arange(10))
Out[341]: 2025
A more direct numpy iteration:
In [342]: def cube2(x):
...: it = [a*a*a for a in x]
...: return numpy.sum(it)
...:
The better whole-array code. Just as sum can work with the whole array, the power also applies the whole.
In [343]: def cube3(x):
...: return numpy.sum(x**3)
...:
In [344]: cube2(np.arange(10))
Out[344]: 2025
In [345]: cube3(np.arange(10))
Out[345]: 2025
Doing some timings:
The list reference:
In [346]: timeit cube(list(range(1000)))
438 µs ± 9.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The slow nditer:
In [348]: timeit cube1(np.arange(1000))
2.8 ms ± 5.65 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The partial numpy:
In [349]: timeit cube2(np.arange(1000))
520 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I can improve its time by passing a list instead of an array. Iteration on lists is faster.
In [352]: timeit cube2(list(range(1000)))
229 µs ± 9.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But the time for a 'pure' numpy version blows all of those out of the water:
In [350]: timeit cube3(np.arange(1000))
23.6 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The general rule is that numpy methods applied to a numpy array are fastest. But if you must loop, it's usually better to use lists.
Sometimes the pure numpy approach creates very large temporary array. Then memory management complexities can reduce performance. In such cases a modest of number of iterations on a complex task may be best.
I want to apply outer addition of multiple vectors/matrices. Let's say four times:
import numpy as np
x = np.arange(100)
B = np.add.outer(x,x)
B = np.add.outer(B,x)
B = np.add.outer(B,x)
I would like best if the number of additions could be a variable, like a=4 --> 4 times the addition. Is this possible?
Approach #1
Here's one with array-initialization -
n = 4 # number of iterations to add outer versions
l = len(x)
out = np.zeros([l]*n,dtype=x.dtype)
for i in range(n):
out += x.reshape(np.insert([1]*(n-1),i,l))
Why this approach and not iterative addition to create new arrays at each iteration?
Iteratively creating new arrays at each iteration would require more memory and hence memory-overhead there. With array-initialization, we are adding element off x into an already initialized array. Hence, it tries to be memory-efficient with it.
Alternative #1
We can remove one iteration with initializing with x. Hence, the changes would be -
out = np.broadcast_to(x,[l]*n).copy()
for i in range(n-1):
Approach # 2: With np.add.reduce -
Another way would be with np.add.reduce, which again doesn't create any intermediate arrays, but being a reduction method might be better here as that's what it's implemented for -
l = len(x); n = 4
np.add.reduce([x.reshape(np.insert([1]*(n-1),i,l)) for i in range(n)])
Timings -
In [17]: x = np.arange(100)
In [18]: %%timeit
...: n = 4 # number of iterations to add outer versions
...: l = len(x)
...: out = np.zeros([l]*n,dtype=x.dtype)
...: for i in range(n):
...: out += x.reshape(np.insert([1]*(n-1),i,l))
829 ms ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: l = len(x); n = 4
In [20]: %timeit np.add.reduce([x.reshape(np.insert([1]*(n-1),i,l)) for i in range(n)])
183 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I don't think there's a builtin argument to repeat this procedure several times, but you can define a custom function for it fairly easily
def recursive_outer_add(arr, num):
if num == 1:
return arr
x = np.add.outer(arr, arr)
for i in range(num - 1):
x = np.add.outer(x, arr)
return x
Just as a warning: the array gets really big really fast
Short and reasonably fast:
n = 4
l = 10
x = np.arange(l)
sum(np.ix_(*n*(x,)))
timeit(lambda:sum(np.ix_(*n*(x,))),number=1000)
# 0.049082988989539444
We can speed this up a little by going back to front:
timeit(lambda:sum(reversed(np.ix_(*n*(x,)))),number=1000)
# 0.03847671199764591
We can also build our own reversed np.ix_:
from operator import getitem
from itertools import accumulate,chain,repeat
sum(accumulate(chain((x,),repeat((slice(None),None),n-1)),getitem))
timeit(lambda:sum(accumulate(chain((x,),repeat((slice(None),None),n-1)),getitem)),number=1000)
# 0.02427654700295534
I have the following data for a Python program.
import numpy as np
np.random.seed(28)
n = 100000
d = 60
S = np.random.rand(n)
O = np.random.rand(n, d, d)
p = np.random.rand()
mask = np.where(S < 0.5)
And I want to run the following algorithm:
def method1():
sum_p = np.zeros([d, d])
sum_m = np.zeros([d, d])
for k in range(n):
s = S[k] * O[k]
sum_p += s
if(S[k] < 0.5):
sum_m -= s
return p * sum_p + sum_m
This is a minimal example, but the code in method1() is supposed to be run many times in my project, so I would like to rewrite it in a more pythonic way, to make it as efficient as possible. I have tried with the following method:
def method2():
sall = S[:, None, None] * O
return p * sall.sum(axis=0) - sall[mask].sum(axis=0)
But, although this method performs better with low values of d, when d=60 it does not provide good times:
# To check that both methods provide the same result.
In [1]: np.sum(method1() == method2()) == d*d
Out[1]: True
In [2]: %timeit method1()
Out[2]: 801 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit method2()
Out[3]: 1.91 s ± 6.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Do you have any other ideas to optimize this method?
(As additional information, the variable mask is supposed to be used in other parts of my final code, so I don't need to consider it inside the code of method2 for the time computation.)
Given S being an n x m matrix, as a numpy array, I want to call function f on pairs of (S[i], S[j]) to calculate a particular value of interest, stored in a matrix with dimensions n x n. In my particular case the function f is commutative so f(x,y) = f(y,x).
With all this in mind I am wondering if I can do any tricks to speed this up as much as I can, n can be fairly large.
When I time the function f, it's around a couple of microseconds, which is as expected. It's a pretty straightforward calculation. Below I show you the timings I got, compared with max() and sum() for reference.
In [19]: %timeit sum(s[11])
4.68 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [20]: %timeit max(s[11])
3.61 µs ± 64.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [21]: %timeit f(s[11], s[21], 50, 10, 1e-5)
1.23 µs ± 7.25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [22]: %timeit f(s[121], s[321], 50, 10, 1e-5)
1.26 µs ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
However when I time the overall processing time for a 500x50 sample data (resulting in 500 x 500 /2 = 125K comparisons), the overall time blows up significantly (into minutes). I would have expected something like 0.2-0.3 seconds (1.25E5 * 2E-6 sec/calc).
In [12]: #jit
...: def testf(s, n, m, p):
...: tol = 1e-5
...: sim = np.zeros((n,n))
...: for i in range(n):
...: for j in range(n):
...: if i > j:
...: delta = p[i] - p[j]
...: if delta < 0:
...: res = f(s[i], s[j], m, abs(delta), tol) # <-- max(s[i])
...: else:
...: res = f(s[j], s[i], m, delta, tol) # <-- sum(s[j])
...: sim[i][j] = res
...: sim[j][i] = res
...: return sim
In code above I have changed the lines where res is assigned to max() and sum() (commented out parts) for testing and the code executes approx 100 times faster, even though the functions themselves are slower compared to my function f()
Which brings me to my questions:
Can I avoid the double loop to speed this up? Ideally I want to be able to run this for matrices where n = 1E5 size. (Comment: since the max and sum, functions work considerably faster, my guess is that the for loops isn't the bottleneck here, but still good to know if there is a better way)
What may cause the severe slowdown with my function, if it's not the double for loop?
EDIT
The specifics of the function f was asked, by some comments. It's iterating over two arrays and checks the number of values in the two arrays that are "close enough". I removed the comments and changes some variable names but the logic is as shown below. It was interesting to note that math.isclose(x,y,rel_tol) which is equivalent to the if-statements i have below, makes the code significantly slower, probably due to library call?
from numba import jit
#jit
def f(arr1, arr2, n, d, rel_tol):
counter = 0
i,j,k = 0,0,0
while (i < n and j < n and k < n):
val = arr1[j] + d
if abs(arr1[i] - arr2[k]) < rel_tol * max(arr1[i], arr2[k]):
counter += 1
i += 1
k += 1
elif abs(val - arr2[k]) < rel_tol * max(val, arr2[k]):
counter += 1
j += 1
k += 1
else:
# incremenet the index corresponding to the lightest
if arr1[i] <= arr2[k] and arr1[i] <= val:
if i < n:
i += 1
elif val <= arr1[i] and val <= arr2[k]:
if j < n:
j += 1
else:
k += 1
return counter
def nonzero(a):
row,colum = a.shape
nonzero_row = np.array([],dtype=int)
nonzero_col = np.array([],dtype=int)
for i in range(0,row):
for j in range(0,colum):
if a[i,j] != 0:
nonzero_row = np.append(nonzero_row,i)
nonzero_col = np.append(nonzero_col,j)
return (nonzero_row,nonzero_col)
The above code is much slower compared to
(row,col) = np.nonzero(edges_canny)
It would be great if I can get any direction how to increase the speed and why numpy functions are much faster?
There are 2 reasons why NumPy functions can outperform Pythons types:
The values inside the array are native types, not Python types. This means NumPy doesn't need to go through the abstraction layer that Python has.
NumPy functions are (mostly) written in C. That actually only matters in some cases because a lot of Python functions are also written in C, for example sum.
In your case you also do something really inefficient: You append to an array. That's one really expensive operation in the middle of a double loop. That's an obvious (and unnecessary) bottleneck right there. You would get amazing speedups just by using lists as nonzero_row and nonzero_col and only convert them to array just before you return:
def nonzero_list_based(a):
row,colum = a.shape
a = a.tolist()
nonzero_row = []
nonzero_col = []
for i in range(0,row):
for j in range(0,colum):
if a[i][j] != 0:
nonzero_row.append(i)
nonzero_col.append(j)
return (np.array(nonzero_row), np.array(nonzero_col))
The timings:
import numpy as np
def nonzero_original(a):
row,colum = a.shape
nonzero_row = np.array([],dtype=int)
nonzero_col = np.array([],dtype=int)
for i in range(0,row):
for j in range(0,colum):
if a[i,j] != 0:
nonzero_row = np.append(nonzero_row,i)
nonzero_col = np.append(nonzero_col,j)
return (nonzero_row,nonzero_col)
arr = np.random.randint(0, 10, (100, 100))
%timeit np.nonzero(arr)
# 315 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit nonzero_original(arr)
# 759 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nonzero_list_based(arr)
# 13.1 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Even though it's 40 times slower than the NumPy operation it's still more than 60 times faster than your approach. There's an important lesson here: Avoid np.append whenever possible!
One additional point why NumPy outperforms alternative approaches is because they (mostly) use state-of-the art approaches (or they "import" them, i.e. BLAS/LAPACK/ATLAS/MKL) to solve the problems. These algorithms have been optimized for correctness and speed over years (if not decades). You shouldn't expect to find a faster or even comparable solution.