Multiple arithmetic operations efficiently on NumPy

Multiple arithmetic operations efficiently on NumPy - python

I have the following data for a Python program.
import numpy as np
np.random.seed(28)
n = 100000
d = 60
S = np.random.rand(n)
O = np.random.rand(n, d, d)
p = np.random.rand()
mask = np.where(S < 0.5)
And I want to run the following algorithm:
def method1():
sum_p = np.zeros([d, d])
sum_m = np.zeros([d, d])
for k in range(n):
s = S[k] * O[k]
sum_p += s
if(S[k] < 0.5):
sum_m -= s
return p * sum_p + sum_m
This is a minimal example, but the code in method1() is supposed to be run many times in my project, so I would like to rewrite it in a more pythonic way, to make it as efficient as possible. I have tried with the following method:
def method2():
sall = S[:, None, None] * O
return p * sall.sum(axis=0) - sall[mask].sum(axis=0)
But, although this method performs better with low values of d, when d=60 it does not provide good times:
# To check that both methods provide the same result.
In [1]: np.sum(method1() == method2()) == d*d
Out[1]: True
In [2]: %timeit method1()
Out[2]: 801 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit method2()
Out[3]: 1.91 s ± 6.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Do you have any other ideas to optimize this method?
(As additional information, the variable mask is supposed to be used in other parts of my final code, so I don't need to consider it inside the code of method2 for the time computation.)

Related

My Python code TimeLimitExceeded. How do I optimize and reduce the number of operations in my code?

Here is my code:
def seperateInputs(inp):
temp = inp.split()
n = int(temp[0])
wires = []
temp[1] = temp[1].replace('),(', ') (')
storeys = temp[1][1:len(temp[1])-1].split()
for each in storeys:
each = each[1:len(each)-1]
t = each.split(',')
wires.append((int(t[0]), int(t[1])))
return n, wires
def findCrosses(n, wires):
cross = 0
for i in range(len(wires)-1):
for j in range(i+1, len(wires)):
if (wires[i][0] < wires[j][0] and wires[i][1] > wires[j][1]) or (wires[i][0] > wires[j][0] and wires[i][1] < wires[j][1]):
cross += 1
return cross
def main():
m = int(input())
for i in range(m):
inp = input()
n, wires = seperateInputs(inp)
print(findCrosses(n, wires))
main()
The question asks:
I also tested my own sample input which got me the output that is correct:
Sample input:
3
20 [(1,8),(10,18),(17,19),(13,16),(4,1),(8,17),(2,10),(11,0),(3,2),(12,3),(18,14),(7,7),(19,5),(0,6)]
20 [(3,4),(10,7),(6,11),(7,17),(13,9),(15,19),(19,12),(16,14),(12,8),(0,3),(8,15),(4,18),(18,6),(5,5),(9,13),(17,1),(1,0)]
20 [(15,8),(0,14),(1,4),(6,5),(3,0),(13,15),(7,10),(5,9),(19,7),(17,13),(10,3),(16,16),(14,2),(11,11),(8,18),(9,12),(4,1)]
Sample output:
38
57
54
However although small input worked but medium to large input gives me TimeLimitExceeded error:
How do I optimize this? Is there a way to have much less operations than what I already have? TIA.

There are a handful of things you can do.
First, things are easier to compute if you sort the list first by the left building. This costs a little up-front, but makes things easier and fast as you process because you only need to compare how many lower second elements you've seen so far. The code is nice and simple for this:
l = [(3,4),(10,7),(6,11),(7,17),(13,9),(15,19),(19,12),(16,14),(12,8),(0,3),(8,15),(4,18),(18,6),(5,5),(9,13),(17,1),(1,0)]
def count_crossings(l):
s = sorted(l, key=lambda p: p[0])
endpoints = []
count = 0
for i, j in s:
count += sum(e > j for e in endpoints)
endpoints.append(j)
return count
count_crossings(l)
# 57
This is a little inefficient because you are looping through endpoints for every point. If you could also keep endpoints sorted, you would only need to count the number less than the given right hand boiling floor. Anytime you thing of keeping a list sorted, you should consider the amazing built-in library bisect. This will make things an order of magnitude faster:
import bisect
def count_crossings_b(l):
s = sorted(l, key=lambda p: p[0])
endpoints = []
count = 0
for i, j in s:
bisect.insort_left(endpoints, j)
count += len(endpoints) - bisect.bisect_right(endpoints, j)
return count
count_crossings_b(l)
# 57
The various timings on my laptop look like:
l = [(random.randint(1, 200), random.randint(1, 200)) for _ in range(1000)]
%timeit findCrosses(l) # original
# 179 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit count_crossings(l)
# 38.1 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit count_crossings_b(l)
# 1.08 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Python performace on class instance vs local (numpy) variables

I've read other posts on how python speed/performance should be relatively unaffected by whether code being run is just in main, in a function or defined as a class attribute, but these do not explain the very large differences in performance that I see when using class vs local variables, especially when using the numpy library. To be more clear, I made an script example below.
import numpy as np
import copy
class Test:
def __init__(self, n, m):
self.X = np.random.rand(n,n,m)
self.Y = np.random.rand(n,n,m)
self.Z = np.random.rand(n,n,m)
def matmul1(self):
self.A = np.zeros(self.X.shape)
for i in range(self.X.shape[2]):
self.A[:,:,i] = self.X[:,:,i] # self.Y[:,:,i] # self.Z[:,:,i]
return
def matmul2(self):
self.A = np.zeros(self.X.shape)
for i in range(self.X.shape[2]):
x = copy.deepcopy(self.X[:,:,i])
y = copy.deepcopy(self.Y[:,:,i])
z = copy.deepcopy(self.Z[:,:,i])
self.A[:,:,i] = x # y # z
return
t1 = Test(300,100)
%%timeit
t1.matmul1()
#OUTPUT: 20.9 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
t1.matmul2()
#OUTPUT: 516 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In this script I define a class with attributes X, Y and Z as 3-way arrays. I also have two function attributes (matmul1 and matmul2) which loop through the 3rd index of the arrays and matrix multiply each of the 3 slices to populate an array, A. matmul1 just loops through class variables and matrix multiplies, whereas matmul2 creates local copies for each matrix multiplication within the loop. Matmul1 is ~40X slower than matmul2. Can someone explain why this is happening? Maybe I am thinking about how to use classes incorrectly, but I also wouldn't assume that variables should be deep copied all the time. Basically, what is it about deep copying that affects my performance so significantly, and is this unavoidable when using class attributes/variables? It seems like its more than just the overhead of calling class attributes as discussed here. Any input is appreciated, thanks!
Edit: My real question is why do copies of, instead of views of subarrays of class instance variables, result in much better performance for these types of methods.

If you put the m dimension first, you could do this product without iteration:
In [146]: X1,Y1,Z1 = X.transpose(2,0,1), Y.transpose(2,0,1), Z.transpose(2,0,1)
In [147]: A1 = X1#Y1#Z1
In [148]: np.allclose(A, A1.transpose(1,2,0))
Out[148]: True
However sometimes, working with very large arrays is slower, due to memory management complexities.
It might worth testing
A1[i] = X1[i] # Y1[i] # Z1[i]
where the iteration is on the outermost dimension.
My computer is too small to do good timings on these array sizes.
edit
I added these alternatives to your class, and tested with a smaller case:
In [67]: class Test:
...: def __init__(self, n, m):
...: self.X = np.random.rand(n,n,m)
...: self.Y = np.random.rand(n,n,m)
...: self.Z = np.random.rand(n,n,m)
...: def matmul1(self):
...: A = np.zeros(self.X.shape)
...: for i in range(self.X.shape[2]):
...: A[:,:,i] = self.X[:,:,i] # self.Y[:,:,i] # self.Z[:,:,i]
...: return A
...: def matmul2(self):
...: A = np.zeros(self.X.shape)
...: for i in range(self.X.shape[2]):
...: x = self.X[:,:,i].copy()
...: y = self.Y[:,:,i].copy()
...: z = self.Z[:,:,i].copy()
...: A[:,:,i] = x # y # z
...: return A
...: def matmul3(self):
...: x = self.X.transpose(2,0,1).copy()
...: y = self.Y.transpose(2,0,1).copy()
...: z = self.Z.transpose(2,0,1).copy()
...: return (x#y#z).transpose(1,2,0)
...: def matmul4(self):
...: x = self.X.transpose(2,0,1).copy()
...: y = self.Y.transpose(2,0,1).copy()
...: z = self.Z.transpose(2,0,1).copy()
...: A = np.zeros(x.shape)
...: for i in range(x.shape[0]):
...: A[i] = x[i]#y[i]#z[i]
...: return A.transpose(1,2,0)
In [68]: t1=Test(100,50)
In [69]: np.max(np.abs(t1.matmul2()-t1.matmul4()))
Out[69]: 0.0
In [70]: np.allclose(t1.matmul3(),t1.matmul2())
Out[70]: True
The view iteration is 10x slower:
In [71]: timeit t1.matmul1()
252 ms ± 424 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [72]: timeit t1.matmul2()
26 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The additions are about the same:
In [73]: timeit t1.matmul3()
30.8 ms ± 4.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [74]: timeit t1.matmul4()
27.3 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Without the copy(), the transpose produces a view, and times are similar to matmul1 (250ms).
My guess is that with "fresh" copies, matmul is able to pass them to the best BLAS function by reference. With views, as in matmul1, it has to take some sort of slower route.
But if I use dot instead of matmul, I get the faster time, even with the matmul1 iteation.
In [77]: %%timeit
...: A = np.zeros(X.shape)
...: for i in range(X.shape[2]):
...: A[:,:,i] = X[:,:,i].dot(Y[:,:,i]).dot(Z[:,:,i])
25.2 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It sure looks like matmul with views is taking some suboptimal calculation choice.

More optimized math function using NumPy

for my class I need to write more optimized math function using NumPy. Problem is, when using NumPy my solutions are slower when native Python.
function which cubes all the elements of an array and sum them
Python:
def cube(x):
result = 0
for i in range(len(x)):
result += x[i] ** 3
return result
My, using NumPy (15-30% slower):
def cube(x):
it = numpy.nditer([x, None])
for a, b in it:
b[...] = a*a*a
return numpy.sum(it.operands[1])
Some random calculation function
Python:
def calc(x):
m = sum(x) / len(x)
result = 0
for i in range(len(x)):
result += (x[i] - m)**4
return result / len(x)
NumPy (>10x slower):
def calc(x):
m = numpy.mean(x)
result = 0
for i in range(len(x)):
result += numpy.power((x[i] - m), 4)
return result / len(x)
I don't know how to approatch this, so far I have tried random functions from NumPy

To elaborate on what has been said in comments:
Numpy's power comes from being able to do all the looping in fast c/fortran rather than slow Python looping. For example, if you have an array x and you want to calculate the square of every value in that array, you could do
y = []
for value in x:
y.append(value**2)
or even (with a list comprehension)
y = [value**2 for value in x]
but it will be much faster if you can do all the looping inside numpy with
y = x**2
(assuming x is already a numpy array).
So for your examples, the proper way to do it in numpy would be
1.
def sum_of_cubes(x):
result = 0
for i in range(len(x)):
result += x[i] ** 3
return result
def sum_of_cubes_numpy(x):
return (x**3).sum()
def calc(x):
m = sum(x) / len(x)
result = 0
for i in range(len(x)):
result += (x[i] - m)**4
return result / len(x)
def calc_numpy(x):
m = numpy.mean(x) # or just x.mean()
return numpy.sum((x - m)**4) / len(x)
Note that I've assumed that the input x is already a numpy array, not a regular Python list: if you have a list lst, you can create an array from it with arr = numpy.array(lst).

In [337]: def cube(x):
...: result = 0
...: for i in range(len(x)):
...: result += x[i] ** 3
...: return result
...:
nditer is not a good numpy iterator, at least not when used in Python level code. It's really just a stepping stone toward writing compiled code. It's docs need a better disclaimer.
In [338]: def cube1(x):
...: it = numpy.nditer([x, None])
...: for a, b in it:
...: b[...] = a*a*a
...: return numpy.sum(it.operands[1])
...:
In [339]: cube(list(range(10)))
Out[339]: 2025
In [340]: cube1(list(range(10)))
Out[340]: 2025
In [341]: cube1(np.arange(10))
Out[341]: 2025
A more direct numpy iteration:
In [342]: def cube2(x):
...: it = [a*a*a for a in x]
...: return numpy.sum(it)
...:
The better whole-array code. Just as sum can work with the whole array, the power also applies the whole.
In [343]: def cube3(x):
...: return numpy.sum(x**3)
...:
In [344]: cube2(np.arange(10))
Out[344]: 2025
In [345]: cube3(np.arange(10))
Out[345]: 2025
Doing some timings:
The list reference:
In [346]: timeit cube(list(range(1000)))
438 µs ± 9.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The slow nditer:
In [348]: timeit cube1(np.arange(1000))
2.8 ms ± 5.65 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The partial numpy:
In [349]: timeit cube2(np.arange(1000))
520 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I can improve its time by passing a list instead of an array. Iteration on lists is faster.
In [352]: timeit cube2(list(range(1000)))
229 µs ± 9.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But the time for a 'pure' numpy version blows all of those out of the water:
In [350]: timeit cube3(np.arange(1000))
23.6 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The general rule is that numpy methods applied to a numpy array are fastest. But if you must loop, it's usually better to use lists.
Sometimes the pure numpy approach creates very large temporary array. Then memory management complexities can reduce performance. In such cases a modest of number of iterations on a complex task may be best.

How to effectively work on combinations along one dimension in np array

Given S being an n x m matrix, as a numpy array, I want to call function f on pairs of (S[i], S[j]) to calculate a particular value of interest, stored in a matrix with dimensions n x n. In my particular case the function f is commutative so f(x,y) = f(y,x).
With all this in mind I am wondering if I can do any tricks to speed this up as much as I can, n can be fairly large.
When I time the function f, it's around a couple of microseconds, which is as expected. It's a pretty straightforward calculation. Below I show you the timings I got, compared with max() and sum() for reference.
In [19]: %timeit sum(s[11])
4.68 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [20]: %timeit max(s[11])
3.61 µs ± 64.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [21]: %timeit f(s[11], s[21], 50, 10, 1e-5)
1.23 µs ± 7.25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [22]: %timeit f(s[121], s[321], 50, 10, 1e-5)
1.26 µs ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
However when I time the overall processing time for a 500x50 sample data (resulting in 500 x 500 /2 = 125K comparisons), the overall time blows up significantly (into minutes). I would have expected something like 0.2-0.3 seconds (1.25E5 * 2E-6 sec/calc).
In [12]: #jit
...: def testf(s, n, m, p):
...: tol = 1e-5
...: sim = np.zeros((n,n))
...: for i in range(n):
...: for j in range(n):
...: if i > j:
...: delta = p[i] - p[j]
...: if delta < 0:
...: res = f(s[i], s[j], m, abs(delta), tol) # <-- max(s[i])
...: else:
...: res = f(s[j], s[i], m, delta, tol) # <-- sum(s[j])
...: sim[i][j] = res
...: sim[j][i] = res
...: return sim
In code above I have changed the lines where res is assigned to max() and sum() (commented out parts) for testing and the code executes approx 100 times faster, even though the functions themselves are slower compared to my function f()
Which brings me to my questions:
Can I avoid the double loop to speed this up? Ideally I want to be able to run this for matrices where n = 1E5 size. (Comment: since the max and sum, functions work considerably faster, my guess is that the for loops isn't the bottleneck here, but still good to know if there is a better way)
What may cause the severe slowdown with my function, if it's not the double for loop?
EDIT
The specifics of the function f was asked, by some comments. It's iterating over two arrays and checks the number of values in the two arrays that are "close enough". I removed the comments and changes some variable names but the logic is as shown below. It was interesting to note that math.isclose(x,y,rel_tol) which is equivalent to the if-statements i have below, makes the code significantly slower, probably due to library call?
from numba import jit
#jit
def f(arr1, arr2, n, d, rel_tol):
counter = 0
i,j,k = 0,0,0
while (i < n and j < n and k < n):
val = arr1[j] + d
if abs(arr1[i] - arr2[k]) < rel_tol * max(arr1[i], arr2[k]):
counter += 1
i += 1
k += 1
elif abs(val - arr2[k]) < rel_tol * max(val, arr2[k]):
counter += 1
j += 1
k += 1
else:
# incremenet the index corresponding to the lightest
if arr1[i] <= arr2[k] and arr1[i] <= val:
if i < n:
i += 1
elif val <= arr1[i] and val <= arr2[k]:
if j < n:
j += 1
else:
k += 1
return counter

Looking to speed up loop including pandas operations

I need help speeding up this loop and I am not sure how to go about it
import numpy as np
import pandas as pd
import timeit
n = 1000
df = pd.DataFrame({0:np.random.rand(n),1:np.random.rand(n)})
def loop():
result = pd.DataFrame(index=df.index,columns=['result'])
for i in df.index:
last_index_to_consider = df.index.values[::-1][i]
tdf = df.loc[:last_index_to_consider] - df.shift(-i).loc[:last_index_to_consider]
tdf = tdf.apply(lambda x: x**2)
tsumdf = tdf.sum(axis=1)
result.loc[i,'result'] = tsumdf.mean()
return result
print(timeit.timeit(loop, number=10))
Is it possible to tweak the for loop to make it faster or are there options using numba or can I go ahead and use multiple threads to speed this loop up?
What would be the most sensible way to get more performance than just simply evaluating this code straight away?

There's a lot of compute happening per iteration. Keeping it that way, we could leverage underlying array data alongwith np.einsum for the squared-sum-reductions could bring about speedups. Here's an implementation that goes along those lines -
def array_einsum_loop(df):
a = df.values
l = len(a)
out = np.empty(l)
for i in range(l):
d = a[:l-i] - a[i:]
out[i] = np.einsum('ij,ij->',d,d)
df_out = pd.DataFrame({'result':out/np.arange(l,0,-1)})
return df_out
Runtime test -
In [153]: n = 1000
...: df = pd.DataFrame({0:np.random.rand(n),1:np.random.rand(n)})
In [154]: %timeit loop(df)
1 loop, best of 3: 1.43 s per loop
In [155]: %timeit array_einsum_loop(df)
100 loops, best of 3: 5.61 ms per loop
In [156]: 1430/5.61
Out[156]: 254.9019607843137
Not bad for a 250x+ speedup without breaking any loop or bank!

Just for fun, the ultimate speedup with numba :
import numba
#numba.njit
def numba(d0,d1):
n=len(d0)
result=np.empty(n,np.float64)
for i in range(n):
s=0
k=i
for j in range(n-i):
u = d0[j]-d0[k]
v = d1[j]-d1[k]
k+=1
s += u*u + v*v
result[i] = s/(j+1)
return result
def loop2(df):
return pd.DataFrame({'result':numba(*df.values.T)})
For a 2500x+ factor.
In [519]: %timeit loop2(df)
583 µs ± 5.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiple arithmetic operations efficiently on NumPy - python

Related

My Python code TimeLimitExceeded. How do I optimize and reduce the number of operations in my code?

Python performace on class instance vs local (numpy) variables

More optimized math function using NumPy

How to effectively work on combinations along one dimension in np array

Looking to speed up loop including pandas operations

Categories

Resources