More optimized math function using NumPy - python

for my class I need to write more optimized math function using NumPy. Problem is, when using NumPy my solutions are slower when native Python.
function which cubes all the elements of an array and sum them
Python:
def cube(x):
result = 0
for i in range(len(x)):
result += x[i] ** 3
return result
My, using NumPy (15-30% slower):
def cube(x):
it = numpy.nditer([x, None])
for a, b in it:
b[...] = a*a*a
return numpy.sum(it.operands[1])
Some random calculation function
Python:
def calc(x):
m = sum(x) / len(x)
result = 0
for i in range(len(x)):
result += (x[i] - m)**4
return result / len(x)
NumPy (>10x slower):
def calc(x):
m = numpy.mean(x)
result = 0
for i in range(len(x)):
result += numpy.power((x[i] - m), 4)
return result / len(x)
I don't know how to approatch this, so far I have tried random functions from NumPy

To elaborate on what has been said in comments:
Numpy's power comes from being able to do all the looping in fast c/fortran rather than slow Python looping. For example, if you have an array x and you want to calculate the square of every value in that array, you could do
y = []
for value in x:
y.append(value**2)
or even (with a list comprehension)
y = [value**2 for value in x]
but it will be much faster if you can do all the looping inside numpy with
y = x**2
(assuming x is already a numpy array).
So for your examples, the proper way to do it in numpy would be
1.
def sum_of_cubes(x):
result = 0
for i in range(len(x)):
result += x[i] ** 3
return result
def sum_of_cubes_numpy(x):
return (x**3).sum()
def calc(x):
m = sum(x) / len(x)
result = 0
for i in range(len(x)):
result += (x[i] - m)**4
return result / len(x)
def calc_numpy(x):
m = numpy.mean(x) # or just x.mean()
return numpy.sum((x - m)**4) / len(x)
Note that I've assumed that the input x is already a numpy array, not a regular Python list: if you have a list lst, you can create an array from it with arr = numpy.array(lst).

In [337]: def cube(x):
...: result = 0
...: for i in range(len(x)):
...: result += x[i] ** 3
...: return result
...:
nditer is not a good numpy iterator, at least not when used in Python level code. It's really just a stepping stone toward writing compiled code. It's docs need a better disclaimer.
In [338]: def cube1(x):
...: it = numpy.nditer([x, None])
...: for a, b in it:
...: b[...] = a*a*a
...: return numpy.sum(it.operands[1])
...:
In [339]: cube(list(range(10)))
Out[339]: 2025
In [340]: cube1(list(range(10)))
Out[340]: 2025
In [341]: cube1(np.arange(10))
Out[341]: 2025
A more direct numpy iteration:
In [342]: def cube2(x):
...: it = [a*a*a for a in x]
...: return numpy.sum(it)
...:
The better whole-array code. Just as sum can work with the whole array, the power also applies the whole.
In [343]: def cube3(x):
...: return numpy.sum(x**3)
...:
In [344]: cube2(np.arange(10))
Out[344]: 2025
In [345]: cube3(np.arange(10))
Out[345]: 2025
Doing some timings:
The list reference:
In [346]: timeit cube(list(range(1000)))
438 µs ± 9.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The slow nditer:
In [348]: timeit cube1(np.arange(1000))
2.8 ms ± 5.65 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The partial numpy:
In [349]: timeit cube2(np.arange(1000))
520 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I can improve its time by passing a list instead of an array. Iteration on lists is faster.
In [352]: timeit cube2(list(range(1000)))
229 µs ± 9.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But the time for a 'pure' numpy version blows all of those out of the water:
In [350]: timeit cube3(np.arange(1000))
23.6 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The general rule is that numpy methods applied to a numpy array are fastest. But if you must loop, it's usually better to use lists.
Sometimes the pure numpy approach creates very large temporary array. Then memory management complexities can reduce performance. In such cases a modest of number of iterations on a complex task may be best.

Related

Repeating indexes a given number of times

Basically, I have :
An array giving indexes "I", e.g. (1, 2),
And a list of the same length giving the corresponding number of repetitions "N", e.g. [1, 3]
And I want to create an array containing the indexes I repeated N times, i.e. (1, 2, 2, 2) here, where 1 is repeated one time and 2 is repeated 3 times.
The best solution I've come up with uses the np.repeat and np.concatenate functions :
import numpy as np
list_index = np.arange(2)
list_no_repetition = [1, 3]
result = np.concatenate([np.repeat(index, no_repetition)
for index, no_repetition in zip(list_index, list_no_repetition)])
print(result)
I wonder if there is a "prettier"/"more efficient solution".
Thank you for your help.
Not sure about prettier, but you could solve it completely with list comprehension:
[x for i,l in zip(list_index, list_no_repetition) for x in [i]*l]
Hello this is the alternative that I propose:
import numpy as np
list_index = np.arange(2)
list_no_repetition = [1, 3]
result = np.array([])
for i in range(len(list_index)):
tempA=np.empty(list_no_repetition[i])
tempA.fill(list_index[i])
result = np.concatenate([result, tempA])
result
You could also use a dictionary with key as the index and the value as the amount of times repeated. I think that Andreas had it right with the list comprehension.
import numpy as np
repeatdict = {
1:1,
2:3,
3:6
}
result = [x for key, value in repeatdict.items() for x in [key]*value]
print(result)
If by "efficiency" you mean speed, you can use timeit. Here are some results for some arbitrary, larger data.
First, define the functions and data:
# generate some data (list values/indices and number of reps)
N = 1000
li_2 = np.arange(N)
lnr_2 = np.random.randint(low=0, high=10, size=N)
# three functions produce the same result
def by_range(items, rep_cts):
x = np.full(sum(rep_cts), np.nan)
i = 0
for val, reps in zip(items, rep_cts):
x[i:i + reps] = val
i = i + reps
return x
def by_comp(items, reps):
return np.array([val for val, rep in zip(items, reps) for i in range(rep)])
def by_cat(list_index, list_no_repetition):
return np.concatenate([np.repeat(index, no_repetition)
for index, no_repetition in zip(list_index, list_no_repetition)])
About the same speed: first allocating an array and then filling it in, vs. doing a one-line double-for comprehension.
# 820 µs ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit by_range(li_2, lnr_2)
# 829 µs ± 4.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit by_comp(li_2, lnr_2)
Original method of concatenation is slightly slower:
# 2.19 ms ± 98.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit by_cat(li_2, lnr_2)
Note that the results will differ depending on where/how you run this, and the specific data you're dealing with.

Python performace on class instance vs local (numpy) variables

I've read other posts on how python speed/performance should be relatively unaffected by whether code being run is just in main, in a function or defined as a class attribute, but these do not explain the very large differences in performance that I see when using class vs local variables, especially when using the numpy library. To be more clear, I made an script example below.
import numpy as np
import copy
class Test:
def __init__(self, n, m):
self.X = np.random.rand(n,n,m)
self.Y = np.random.rand(n,n,m)
self.Z = np.random.rand(n,n,m)
def matmul1(self):
self.A = np.zeros(self.X.shape)
for i in range(self.X.shape[2]):
self.A[:,:,i] = self.X[:,:,i] # self.Y[:,:,i] # self.Z[:,:,i]
return
def matmul2(self):
self.A = np.zeros(self.X.shape)
for i in range(self.X.shape[2]):
x = copy.deepcopy(self.X[:,:,i])
y = copy.deepcopy(self.Y[:,:,i])
z = copy.deepcopy(self.Z[:,:,i])
self.A[:,:,i] = x # y # z
return
t1 = Test(300,100)
%%timeit
t1.matmul1()
#OUTPUT: 20.9 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
t1.matmul2()
#OUTPUT: 516 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In this script I define a class with attributes X, Y and Z as 3-way arrays. I also have two function attributes (matmul1 and matmul2) which loop through the 3rd index of the arrays and matrix multiply each of the 3 slices to populate an array, A. matmul1 just loops through class variables and matrix multiplies, whereas matmul2 creates local copies for each matrix multiplication within the loop. Matmul1 is ~40X slower than matmul2. Can someone explain why this is happening? Maybe I am thinking about how to use classes incorrectly, but I also wouldn't assume that variables should be deep copied all the time. Basically, what is it about deep copying that affects my performance so significantly, and is this unavoidable when using class attributes/variables? It seems like its more than just the overhead of calling class attributes as discussed here. Any input is appreciated, thanks!
Edit: My real question is why do copies of, instead of views of subarrays of class instance variables, result in much better performance for these types of methods.
If you put the m dimension first, you could do this product without iteration:
In [146]: X1,Y1,Z1 = X.transpose(2,0,1), Y.transpose(2,0,1), Z.transpose(2,0,1)
In [147]: A1 = X1#Y1#Z1
In [148]: np.allclose(A, A1.transpose(1,2,0))
Out[148]: True
However sometimes, working with very large arrays is slower, due to memory management complexities.
It might worth testing
A1[i] = X1[i] # Y1[i] # Z1[i]
where the iteration is on the outermost dimension.
My computer is too small to do good timings on these array sizes.
edit
I added these alternatives to your class, and tested with a smaller case:
In [67]: class Test:
...: def __init__(self, n, m):
...: self.X = np.random.rand(n,n,m)
...: self.Y = np.random.rand(n,n,m)
...: self.Z = np.random.rand(n,n,m)
...: def matmul1(self):
...: A = np.zeros(self.X.shape)
...: for i in range(self.X.shape[2]):
...: A[:,:,i] = self.X[:,:,i] # self.Y[:,:,i] # self.Z[:,:,i]
...: return A
...: def matmul2(self):
...: A = np.zeros(self.X.shape)
...: for i in range(self.X.shape[2]):
...: x = self.X[:,:,i].copy()
...: y = self.Y[:,:,i].copy()
...: z = self.Z[:,:,i].copy()
...: A[:,:,i] = x # y # z
...: return A
...: def matmul3(self):
...: x = self.X.transpose(2,0,1).copy()
...: y = self.Y.transpose(2,0,1).copy()
...: z = self.Z.transpose(2,0,1).copy()
...: return (x#y#z).transpose(1,2,0)
...: def matmul4(self):
...: x = self.X.transpose(2,0,1).copy()
...: y = self.Y.transpose(2,0,1).copy()
...: z = self.Z.transpose(2,0,1).copy()
...: A = np.zeros(x.shape)
...: for i in range(x.shape[0]):
...: A[i] = x[i]#y[i]#z[i]
...: return A.transpose(1,2,0)
In [68]: t1=Test(100,50)
In [69]: np.max(np.abs(t1.matmul2()-t1.matmul4()))
Out[69]: 0.0
In [70]: np.allclose(t1.matmul3(),t1.matmul2())
Out[70]: True
The view iteration is 10x slower:
In [71]: timeit t1.matmul1()
252 ms ± 424 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [72]: timeit t1.matmul2()
26 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The additions are about the same:
In [73]: timeit t1.matmul3()
30.8 ms ± 4.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [74]: timeit t1.matmul4()
27.3 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Without the copy(), the transpose produces a view, and times are similar to matmul1 (250ms).
My guess is that with "fresh" copies, matmul is able to pass them to the best BLAS function by reference. With views, as in matmul1, it has to take some sort of slower route.
But if I use dot instead of matmul, I get the faster time, even with the matmul1 iteation.
In [77]: %%timeit
...: A = np.zeros(X.shape)
...: for i in range(X.shape[2]):
...: A[:,:,i] = X[:,:,i].dot(Y[:,:,i]).dot(Z[:,:,i])
25.2 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It sure looks like matmul with views is taking some suboptimal calculation choice.

Building a numpy array as a function of previous element

I would like to create a numpy array where the first element is a defined constant, and every next element is defined as the function of the previous element in the following way:
import numpy as np
def build_array_recursively(length, V_0, function):
returnList = np.empty(length)
returnList[0] = V_0
for i in range(1,length):
returnList[i] = function(returnList[i-1])
return returnList
d_t = 0.05
print(build_array_recursively(20, 0.3, lambda x: x-x*d_t+x*x/2*d_t*d_t-x*x*x/6*d_t*d_t*d_t))
The print method above outputs
[0.3 0.28511194 0.27095747 0.25750095 0.24470843 0.23254756 0.22098752
0.20999896 0.19955394 0.18962586 0.18018937 0.17122037 0.16269589
0.15459409 0.14689418 0.13957638 0.13262186 0.1260127 0.11973187 0.11376316]
Is there a fast way of doing this in numpy without a for loop?
If so is there a way to handle two elements before the current one, e.g. can a Fibonacci array be constructed similarly?
I found a similar question here
Is it possible to vectorize recursive calculation of a NumPy array where each element depends on the previous one?
but was not answered in general. In my example, the difference equation is difficult to solve manually.
This is faster for what you want to do. You don't have to use recursion for the function.
Calculate the element based on previous element. Append calculated element to a list, and then change the list to numpy.
def method2(length, V_0, d_t):
k = [V_0]
x = V_0
for i in range(1, length):
x = x - x * d_t + x * x / 2 * d_t * d_t - x * x * x / 6 * d_t * d_t * d_t
k.append(x)
return np.asarray(k)
print(method2(20,0.3, 0.05))
Running you existing method 10000 times takes 0.438 seconds, while method2 takes 0.097 seconds.
Using a function to make the code clearer (instead of the inline lambda):
def fn(x):
return x-x*d_t+x*x/2*d_t*d_t-x*x*x/6*d_t*d_t*d_t
And a function that combines elements of build_array_recursively and method2:
def foo1(length, V_0, function):
returnList = np.empty(length)
returnList[0] = x = V_0
for i in range(1,length):
returnList[i] = x = function(x)
return returnList
In [887]: timeit build_array_recursively(20,0.3, fn);
61.4 µs ± 63 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [888]: timeit method2(20,0.3, fn);
16.9 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [889]: timeit foo1(20,0.3, fn);
13 µs ± 29.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The main time saver in method2 and foo2 is carrying over x, the last value, from one iteration to the next, rather than indexing with returnList[i-1].
The accumulation method, assigning to a preallocated array, or list append, is less important. Performance is usually similar.
Here the calculation is simple enough that details of what you do in the loop makes a big difference in the overall time.
All of these are loops. Some ufunc have a reduce (and accumulate) method, that can apply the function repeatedly to a elements of the input array. np.sum, np.cumsum, etc make use of this. But you can't do that with a general Python function.
You have to use some sort of compilation tool like numba to perform this sort of loop much faster.

How to effectively work on combinations along one dimension in np array

Given S being an n x m matrix, as a numpy array, I want to call function f on pairs of (S[i], S[j]) to calculate a particular value of interest, stored in a matrix with dimensions n x n. In my particular case the function f is commutative so f(x,y) = f(y,x).
With all this in mind I am wondering if I can do any tricks to speed this up as much as I can, n can be fairly large.
When I time the function f, it's around a couple of microseconds, which is as expected. It's a pretty straightforward calculation. Below I show you the timings I got, compared with max() and sum() for reference.
In [19]: %timeit sum(s[11])
4.68 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [20]: %timeit max(s[11])
3.61 µs ± 64.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [21]: %timeit f(s[11], s[21], 50, 10, 1e-5)
1.23 µs ± 7.25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [22]: %timeit f(s[121], s[321], 50, 10, 1e-5)
1.26 µs ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
However when I time the overall processing time for a 500x50 sample data (resulting in 500 x 500 /2 = 125K comparisons), the overall time blows up significantly (into minutes). I would have expected something like 0.2-0.3 seconds (1.25E5 * 2E-6 sec/calc).
In [12]: #jit
...: def testf(s, n, m, p):
...: tol = 1e-5
...: sim = np.zeros((n,n))
...: for i in range(n):
...: for j in range(n):
...: if i > j:
...: delta = p[i] - p[j]
...: if delta < 0:
...: res = f(s[i], s[j], m, abs(delta), tol) # <-- max(s[i])
...: else:
...: res = f(s[j], s[i], m, delta, tol) # <-- sum(s[j])
...: sim[i][j] = res
...: sim[j][i] = res
...: return sim
In code above I have changed the lines where res is assigned to max() and sum() (commented out parts) for testing and the code executes approx 100 times faster, even though the functions themselves are slower compared to my function f()
Which brings me to my questions:
Can I avoid the double loop to speed this up? Ideally I want to be able to run this for matrices where n = 1E5 size. (Comment: since the max and sum, functions work considerably faster, my guess is that the for loops isn't the bottleneck here, but still good to know if there is a better way)
What may cause the severe slowdown with my function, if it's not the double for loop?
EDIT
The specifics of the function f was asked, by some comments. It's iterating over two arrays and checks the number of values in the two arrays that are "close enough". I removed the comments and changes some variable names but the logic is as shown below. It was interesting to note that math.isclose(x,y,rel_tol) which is equivalent to the if-statements i have below, makes the code significantly slower, probably due to library call?
from numba import jit
#jit
def f(arr1, arr2, n, d, rel_tol):
counter = 0
i,j,k = 0,0,0
while (i < n and j < n and k < n):
val = arr1[j] + d
if abs(arr1[i] - arr2[k]) < rel_tol * max(arr1[i], arr2[k]):
counter += 1
i += 1
k += 1
elif abs(val - arr2[k]) < rel_tol * max(val, arr2[k]):
counter += 1
j += 1
k += 1
else:
# incremenet the index corresponding to the lightest
if arr1[i] <= arr2[k] and arr1[i] <= val:
if i < n:
i += 1
elif val <= arr1[i] and val <= arr2[k]:
if j < n:
j += 1
else:
k += 1
return counter

Why is numpy faster at finding non-zero elements in a matrix?

def nonzero(a):
row,colum = a.shape
nonzero_row = np.array([],dtype=int)
nonzero_col = np.array([],dtype=int)
for i in range(0,row):
for j in range(0,colum):
if a[i,j] != 0:
nonzero_row = np.append(nonzero_row,i)
nonzero_col = np.append(nonzero_col,j)
return (nonzero_row,nonzero_col)
The above code is much slower compared to
(row,col) = np.nonzero(edges_canny)
It would be great if I can get any direction how to increase the speed and why numpy functions are much faster?
There are 2 reasons why NumPy functions can outperform Pythons types:
The values inside the array are native types, not Python types. This means NumPy doesn't need to go through the abstraction layer that Python has.
NumPy functions are (mostly) written in C. That actually only matters in some cases because a lot of Python functions are also written in C, for example sum.
In your case you also do something really inefficient: You append to an array. That's one really expensive operation in the middle of a double loop. That's an obvious (and unnecessary) bottleneck right there. You would get amazing speedups just by using lists as nonzero_row and nonzero_col and only convert them to array just before you return:
def nonzero_list_based(a):
row,colum = a.shape
a = a.tolist()
nonzero_row = []
nonzero_col = []
for i in range(0,row):
for j in range(0,colum):
if a[i][j] != 0:
nonzero_row.append(i)
nonzero_col.append(j)
return (np.array(nonzero_row), np.array(nonzero_col))
The timings:
import numpy as np
def nonzero_original(a):
row,colum = a.shape
nonzero_row = np.array([],dtype=int)
nonzero_col = np.array([],dtype=int)
for i in range(0,row):
for j in range(0,colum):
if a[i,j] != 0:
nonzero_row = np.append(nonzero_row,i)
nonzero_col = np.append(nonzero_col,j)
return (nonzero_row,nonzero_col)
arr = np.random.randint(0, 10, (100, 100))
%timeit np.nonzero(arr)
# 315 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit nonzero_original(arr)
# 759 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nonzero_list_based(arr)
# 13.1 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Even though it's 40 times slower than the NumPy operation it's still more than 60 times faster than your approach. There's an important lesson here: Avoid np.append whenever possible!
One additional point why NumPy outperforms alternative approaches is because they (mostly) use state-of-the art approaches (or they "import" them, i.e. BLAS/LAPACK/ATLAS/MKL) to solve the problems. These algorithms have been optimized for correctness and speed over years (if not decades). You shouldn't expect to find a faster or even comparable solution.

Categories

Resources