GEMM using Numpy einsum

GEMM using Numpy einsum - python

Can a single numpy einsum statement replicate gemm functionality? Scalar and matrix multiplication seem straightforward, but I haven't found how to get the "+" working. In case its simpler, D = alpha * A * B + beta * C would be acceptable (preferable actually)
alpha = 2
beta = 3
A = np.arange(9).reshape(3, 3)
B = A + 1
C = B + 1
left_part = alpha*np.dot(A, B)
print(left_part)
left_part = np.einsum(',ij,jk->ik', alpha, A, B)
print(left_part)

There seems to be some confusion here: np.einsum handles operations that can be cast in the following form: broadcast–multiply–reduce. Element-wise summation is not part of its scope.
The reason why you need this sort of thing for the multiplication is that writing these operations out "naively" may exceed memory or computing resources quickly. Consider, for example, matrix multiplication:
import numpy as np
x, y = np.ones((2, 2000, 2000))
# explicit loop - ridiculously slow
a = sum(x[:,j,np.newaxis] * y[j,:] for j in range(2000))
# explicit broadcast-multiply-reduce: throws MemoryError
a = (x[:,:,np.newaxis] * y[:,np.newaxis,:]).sum(1)
# einsum or dot: fast and memory-saving
a = np.einsum('ij,jk->ik', x, y)
The Einstein convention however factorizes for addition, so you
can write your BLAS-like problem simply as:
d = np.einsum(',ij,jk->ik', alpha, a, b) + np.einsum(',ik', beta, c)
with minimal memory overhead (you can rewrite most of it as in-place operations if you are really concerned about memory) and constant runtime overhead (the cost of two python-to-C calls).
So regarding performance, this seems, respectfully, like a case of premature optimization to me: have you actually verified that the split of GEMM-like operations into two separate numpy calls is a bottleneck in your code? If it indeed is, then I suggest the following (in order of increasing involvedness):
Try, carefully!, scipy.linalg.blas.dgemm. I would be surprised if you get
significantly better performance, since dgemms are usually only
building block themselves.
Try an expression compiler (essentially you are proposing
such a thing) like Theano.
Write your own generalised ufunc using Cython or C.

Related

Invert particular bits in a byte array [duplicate]

What is the most efficient way to map a function over a numpy array? I am currently doing:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
# Obtain array of square of each element in x
squarer = lambda t: t ** 2
squares = np.array([squarer(xi) for xi in x])
However, this is probably very inefficient, since I am using a list comprehension to construct the new array as a Python list before converting it back to a numpy array. Can we do better?

I've tested all suggested methods plus np.array(list(map(f, x))) with perfplot (a small project of mine).
Message #1: If you can use numpy's native functions, do that.
If the function you're trying to vectorize already is vectorized (like the x**2 example in the original post), using that is much faster than anything else (note the log scale):
If you actually need vectorization, it doesn't really matter much which variant you use.
Code to reproduce the plots:
import numpy as np
import perfplot
import math
def f(x):
# return math.sqrt(x)
return np.sqrt(x)
vf = np.vectorize(f)
def array_for(x):
return np.array([f(xi) for xi in x])
def array_map(x):
return np.array(list(map(f, x)))
def fromiter(x):
return np.fromiter((f(xi) for xi in x), x.dtype)
def vectorize(x):
return np.vectorize(f)(x)
def vectorize_without_init(x):
return vf(x)
b = perfplot.bench(
setup=np.random.rand,
n_range=[2 ** k for k in range(20)],
kernels=[
f,
array_for,
array_map,
fromiter,
vectorize,
vectorize_without_init,
],
xlabel="len(x)",
)
b.save("out1.svg")
b.show()

How about using numpy.vectorize.
import numpy as np
x = np.array([1, 2, 3, 4, 5])
squarer = lambda t: t ** 2
vfunc = np.vectorize(squarer)
vfunc(x)
# Output : array([ 1, 4, 9, 16, 25])

TL;DR
As noted by #user2357112, a "direct" method of applying the function is always the fastest and simplest way to map a function over Numpy arrays:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
f = lambda x: x ** 2
squares = f(x)
Generally avoid np.vectorize, as it does not perform well, and has (or had) a number of issues. If you are handling other data types, you may want to investigate the other methods shown below.
Comparison of methods
Here are some simple tests to compare three methods to map a function, this example using with Python 3.6 and NumPy 1.15.4. First, the set-up functions for testing:
import timeit
import numpy as np
f = lambda x: x ** 2
vf = np.vectorize(f)
def test_array(x, n):
t = timeit.timeit(
'np.array([f(xi) for xi in x])',
'from __main__ import np, x, f', number=n)
print('array: {0:.3f}'.format(t))
def test_fromiter(x, n):
t = timeit.timeit(
'np.fromiter((f(xi) for xi in x), x.dtype, count=len(x))',
'from __main__ import np, x, f', number=n)
print('fromiter: {0:.3f}'.format(t))
def test_direct(x, n):
t = timeit.timeit(
'f(x)',
'from __main__ import x, f', number=n)
print('direct: {0:.3f}'.format(t))
def test_vectorized(x, n):
t = timeit.timeit(
'vf(x)',
'from __main__ import x, vf', number=n)
print('vectorized: {0:.3f}'.format(t))
Testing with five elements (sorted from fastest to slowest):
x = np.array([1, 2, 3, 4, 5])
n = 100000
test_direct(x, n) # 0.265
test_fromiter(x, n) # 0.479
test_array(x, n) # 0.865
test_vectorized(x, n) # 2.906
With 100s of elements:
x = np.arange(100)
n = 10000
test_direct(x, n) # 0.030
test_array(x, n) # 0.501
test_vectorized(x, n) # 0.670
test_fromiter(x, n) # 0.883
And with 1000s of array elements or more:
x = np.arange(1000)
n = 1000
test_direct(x, n) # 0.007
test_fromiter(x, n) # 0.479
test_array(x, n) # 0.516
test_vectorized(x, n) # 0.945
Different versions of Python/NumPy and compiler optimization will have different results, so do a similar test for your environment.

There are numexpr, numba and cython around, the goal of this answer is to take these possibilities into consideration.
But first let's state the obvious: no matter how you map a Python-function onto a numpy-array, it stays a Python function, that means for every evaluation:
numpy-array element must be converted to a Python-object (e.g. a Float).
all calculations are done with Python-objects, which means to have the overhead of interpreter, dynamic dispatch and immutable objects.
So which machinery is used to actually loop through the array doesn't play a big role because of the overhead mentioned above - it stays much slower than using numpy's built-in functionality.
Let's take a look at the following example:
# numpy-functionality
def f(x):
return x+2*x*x+4*x*x*x
# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"
np.vectorize is picked as a representative of the pure-python function class of approaches. Using perfplot (see code in the appendix of this answer) we get the following running times:
We can see, that the numpy-approach is 10x-100x faster than the pure python version. The decrease of performance for bigger array-sizes is probably because data no longer fits the cache.
It is worth also mentioning, that vectorize also uses a lot of memory, so often memory-usage is the bottle-neck (see related SO-question). Also note, that numpy's documentation on np.vectorize states that it is "provided primarily for convenience, not for performance".
Other tools should be used, when performance is desired, beside writing a C-extension from the scratch, there are following possibilities:
One often hears, that the numpy-performance is as good as it gets, because it is pure C under the hood. Yet there is a lot room for improvement!
The vectorized numpy-version uses a lot of additional memory and memory-accesses. Numexp-library tries to tile the numpy-arrays and thus get a better cache utilization:
# less cache misses than numpy-functionality
import numexpr as ne
def ne_f(x):
return ne.evaluate("x+2*x*x+4*x*x*x")
Leads to the following comparison:
I cannot explain everything in the plot above: we can see bigger overhead for numexpr-library at the beginning, but because it utilize the cache better it is about 10 time faster for bigger arrays!
Another approach is to jit-compile the function and thus getting a real pure-C UFunc. This is numba's approach:
# runtime generated C-function as ufunc
import numba as nb
#nb.vectorize(target="cpu")
def nb_vf(x):
return x+2*x*x+4*x*x*x
It is 10 times faster than the original numpy-approach:
However, the task is embarrassingly parallelizable, thus we also could use prange in order to calculate the loop in parallel:
#nb.njit(parallel=True)
def nb_par_jitf(x):
y=np.empty(x.shape)
for i in nb.prange(len(x)):
y[i]=x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
return y
As expected, the parallel function is slower for smaller inputs, but faster (almost factor 2) for larger sizes:
While numba specializes on optimizing operations with numpy-arrays, Cython is a more general tool. It is more complicated to extract the same performance as with numba - often it is down to llvm (numba) vs local compiler (gcc/MSVC):
%%cython -c=/openmp -a
import numpy as np
import cython
#single core:
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_f(double[::1] x):
y_out=np.empty(len(x))
cdef Py_ssize_t i
cdef double[::1] y=y_out
for i in range(len(x)):
y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
return y_out
#parallel:
from cython.parallel import prange
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_par_f(double[::1] x):
y_out=np.empty(len(x))
cdef double[::1] y=y_out
cdef Py_ssize_t i
cdef Py_ssize_t n = len(x)
for i in prange(n, nogil=True):
y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
return y_out
Cython results in somewhat slower functions:
Conclusion
Obviously, testing only for one function doesn't prove anything. Also one should keep in mind, that for the choosen function-example, the bandwidth of the memory was the bottle neck for sizes larger than 10^5 elements - thus we had the same performance for numba, numexpr and cython in this region.
In the end, the ultimative answer depends on the type of function, hardware, Python-distribution and other factors. For example Anaconda-distribution uses Intel's VML for numpy's functions and thus outperforms numba (unless it uses SVML, see this SO-post) easily for transcendental functions like exp, sin, cos and similar - see e.g. the following SO-post.
Yet from this investigation and from my experience so far, I would state, that numba seems to be the easiest tool with best performance as long as no transcendental functions are involved.
Plotting running times with perfplot-package:
import perfplot
perfplot.show(
setup=lambda n: np.random.rand(n),
n_range=[2**k for k in range(0,24)],
kernels=[
f,
vf,
ne_f,
nb_vf, nb_par_jitf,
cy_f, cy_par_f,
],
logx=True,
logy=True,
xlabel='len(x)'
)

squares = squarer(x)
Arithmetic operations on arrays are automatically applied elementwise, with efficient C-level loops that avoid all the interpreter overhead that would apply to a Python-level loop or comprehension.
Most of the functions you'd want to apply to a NumPy array elementwise will just work, though some may need changes. For example, if doesn't work elementwise. You'd want to convert those to use constructs like numpy.where:
def using_if(x):
if x < 5:
return x
else:
return x**2
becomes
def using_where(x):
return numpy.where(x < 5, x, x**2)

It seems that no one has mentioned a built-in factory method of producing ufunc in numpy package: np.frompyfunc, which I have tested against np.vectorize, and have outperformed it by about 20~30%. Of course it will not perform as well prescribed C code or even numba(which I have not tested), but it can a better alternative than np.vectorize
f = lambda x, y: x * y
f_arr = np.frompyfunc(f, 2, 1)
vf = np.vectorize(f)
arr = np.linspace(0, 1, 10000)
%timeit f_arr(arr, arr) # 307ms
%timeit vf(arr, arr) # 450ms
I have also tested larger samples, and the improvement is proportional. See the documentation also here

Edit: the original answer was misleading, np.sqrt was applied directly to the array, just with a small overhead.
In multidimensional cases where you want to apply a builtin function that operates on a 1d array, numpy.apply_along_axis is a good choice, also for more complex function compositions from numpy and scipy.
Previous misleading statement:
Adding the method:
def along_axis(x):
return np.apply_along_axis(f, 0, x)
to the perfplot code gives performance results close to np.sqrt.

I believe in newer version( I use 1.13) of numpy you can simply call the function by passing the numpy array to the fuction that you wrote for scalar type, it will automatically apply the function call to each element over the numpy array and return you another numpy array
>>> import numpy as np
>>> squarer = lambda t: t ** 2
>>> x = np.array([1, 2, 3, 4, 5])
>>> squarer(x)
array([ 1, 4, 9, 16, 25])

As mentioned in this post, just use generator expressions like so:
numpy.fromiter((<some_func>(x) for x in <something>),<dtype>,<size of something>)

All above answers compares well, but if you need to use custom function for mapping, and you have numpy.ndarray, and you need to retain the shape of array.
I have compare just two, but it will retain the shape of ndarray. I have used the array with 1 million entries for comparison. Here I use square function, which is also inbuilt in numpy and has great performance boost, since there as was need of something, you can use function of your choice.
import numpy, time
def timeit():
y = numpy.arange(1000000)
now = time.time()
numpy.array([x * x for x in y.reshape(-1)]).reshape(y.shape)
print(time.time() - now)
now = time.time()
numpy.fromiter((x * x for x in y.reshape(-1)), y.dtype).reshape(y.shape)
print(time.time() - now)
now = time.time()
numpy.square(y)
print(time.time() - now)
Output
>>> timeit()
1.162431240081787 # list comprehension and then building numpy array
1.0775556564331055 # from numpy.fromiter
0.002948284149169922 # using inbuilt function
here you can clearly see numpy.fromiter works great considering to simple approach, and if inbuilt function is available please use that.

Use numpy.fromfunction(function, shape, **kwargs)
See "https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfunction.html"

Making a matrix symmetric, in-place vs. out-of-place

I have a matrix which is supposed to be symmetric (it's the inverse of symmetric), but it is not exactly due to numerical errors in the inversion etc.
So, I add a step of making the matrix symmetric (by a = .5(a+a'), and I see a numerical disaster if I do it in-place (out-of-place is ok). Code:
import numpy as np
def check_sym(x):
print("||a-a'||^2 = %e" % np.sum((x - x.T)**2))
# make a symmetric matrix
dim = 100
a = np.random.randn(dim,dim)
a = np.matmul(a, a.T)
b = a.copy()
check_sym(a)
print("symmetrizing in-place")
a += a.T
a *= .5
check_sym(a)
print("symmetrizing out-of-place")
b = .5 * (b + b.T)
check_sym(b)
And the output is:
||a-a'||^2 = 1.184044e-26
symmetrizing in-place
||a-a'||^2 = 7.313593e+04
symmetrizing out-of-place
||a-a'||^2 = 0.000000e+00
Note that for lower dimension (e.g. dim=10) the problem does not appear.
EDIT some more info is given by looking at a-a' after the in-place version:

The error comes from the line a += a.T. It is a known problem of the in-place operations (I cannot find right now the proper piece of documentation that states so) but quoted from scipy lecture notes:
The transposition is a view.
As a results, the following code is wrong and will not make a matrix symmetric:
a += a.T
It will work for small arrays (because of buffering) but fail for large one, in unpredictable ways.
The reason is that at the same time a is being updated with a.T, a.T is actually changing (since it is a memoryview of a), and thus updating some coordinates of a incorrectly.
If you want to symmetrize a matrix in-place, you could do the following:
a = np.random.rand(4,4)
a[np.tril_indices_from(a)] = a.T[np.tril_indices_from(a)]
Or, if you want to stick to your notation:
a += a.T.copy()
since copy will create a temporary copy of a.T which is not going to be updated.

Addition speed of numpy arrays with different contiguous-type

Numpy arrays are stored with different contiguous types (C- and F-). When using numpy.swapaxes(), the contiguous type gets changed. I need to add two multidimensional arrays (3d to be more specific), one of which comes from another array with swapped axes. What I've noticed is that when the first axis gets swapped with the last axis, in the case of a 3d array, the contiguous type changes from C- to F-. And adding two arrays with different contiguous type is extremely slow (~6 times slower than adding two C-contiguous arrays). However, if other axes are swapped (0-1 or 1-2), the resulting array would have false flags for both C- and F- contiguous (non-contiguous). The weird thing to me is that adding one array of C-configuous and one array neither C- nor F- contiguous, is in fact only slightly slower than adding two arrays of same type. Here are my two questions:
Why does it seem to be different for C-&F-contiguous arrray addition and C-&non-contiguous array addition? Is is caused by different rearranging mechanism or simply because the rearranging distance between C- and F- contiguous is longest for all possible axes orders?
If I have to add a C-contiguous array and a F-contiguous/non-contiguous array, what is the best way to accelerate the speed?
Below is a minimum example of what I encountered. The three printed durations on my computer are 2.0s (C-contiguous + C-contiguous), 12.4s (C-contiguous + F-contiguous), 3.4s (C-contiguous + non-contiguous) and 3.3s (C-contiguous + non-contiguous).
import numpy as np
import time
np.random.seed(1234)
a = np.random.random((300, 400, 500)) # C-contiguous
b = np.swapaxes(np.random.random((500, 400, 300)), 0, 2) # F-contiguous
c = np.swapaxes(np.random.random((300, 500, 400)), 1, 2) # Non-contiguous
d = np.swapaxes(np.random.random((400, 300, 500)), 0, 1) # Non-contiguous
t = time.time()
for n in range(10):
result = a + a
print(time.time() - t)
t = time.time()
for n in range(10):
result = a + b
print(time.time() - t)
t = time.time()
for n in range(10):
result = a + c
print(time.time() - t)
t = time.time()
for n in range(10):
result = a + d
print(time.time() - t)

These types (F and C) denote whether a matrix (or multi-dimensional array) is stored in column-major (C as in C language which uses column-major storage) or row-major (F as in Fortran language which uses row-major storage).
Both do not really vary in speed. It is just a abstraction layer. No matter which one you use, it brings performance wise the same.
However, what makes an enormous difference is whether arrays are contiguous or not. If they are contiguous you will have good timings cause of caching effects, vectorization and other optimization games that the compiler might apply.

Iterating with numpy with different indexes

Say I have a for loop using range as shown below. Is there a good way to eliminate the for loop and use numpy arrays only?
y =[146, 96, 59, 133, 192, 127, 79, 186, 272, 155, 98, 219]
At=3
Bt=2
Aindex=[]
Bindex=[]
for i in range(len(y)-1):
A =At
B =Bt
At =y[i] / y[i] + 5 * (A + B)
Aindex.append(At)
Bt =(At - A) + y[i+1] * B
Bindex.append(Bt)
I would use something like
c=len(y)-1
Aindex=y[:c]/y[:c]+5* (A + B)
But A and B updates in the loop. I also do not know how to vectorize y[i+1] in the Bt equation

You asked something similar in Iterating over a numpy array with enumerate like function, except there A and B did not change.
Strictly speaking you can't vectorize this case, because of that change in Bt. This an iterative problem, where the i+1 term depends on the i term. Most of the numpy vector operations operate (effectively) on all terms at once.
Could you rework the problem so it makes use of cumsum and/or cumprod? Those are builtin methods that step through a vector (or axis of an array), calculating a cumulative sum or product. numpy's generalization of this is ufunc.accumulate.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.accumulate.html
In the meantime, I'd suggest making more use of arrays
y = np.array(y)
At = np.zeros(y.shape)
Bt = np.zeros(y.shape)
At[0] = 3
Bt[0] = 2
for i in range(len(y)-1):
A, B = At[i],Bt[i]
At[i+1] =y[i] / y[i] + 5 * (A + B)
Bt[i+1] =(At[i+1] - A) + y[i+1] * B
numpy uses an nditer to step through several array (including an output one) together. http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html Though I suspect it is more useful when working on multidimensional arrays. For your 1d arrays it is probably overkill. Still if speed becomes essential, you could work through this documentation, and implement the problem in cython.

Speed up python code for computing matrix cofactors

As part of a complex task, I need to compute matrix cofactors. I did this in a straightforward way using this nice code for computing matrix minors. Here is my code:
def matrix_cofactor(matrix):
C = np.zeros(matrix.shape)
nrows, ncols = C.shape
for row in xrange(nrows):
for col in xrange(ncols):
minor = matrix[np.array(range(row)+range(row+1,nrows))[:,np.newaxis],
np.array(range(col)+range(col+1,ncols))]
C[row, col] = (-1)**(row+col) * np.linalg.det(minor)
return C
It turns out that this matrix cofactor code is the bottleneck, and I would like to optimize the code snippet above. Any ideas as to how to do this?

If your matrix is invertible, the cofactor is related to the inverse:
def matrix_cofactor(matrix):
return np.linalg.inv(matrix).T * np.linalg.det(matrix)
This gives large speedups (~ 1000x for 50x50 matrices). The main reason is fundamental: this is an O(n^3) algorithm, whereas the minor-det-based one is O(n^5).
This probably means that also for non-invertible matrixes, there is some clever way to calculate the cofactor (i.e., not use the mathematical formula that you use above, but some other equivalent definition).
If you stick with the det-based approach, what you can do is the following:
The majority of the time seems to be spent inside det. (Check out line_profiler to find this out yourself.) You can try to speed that part up by linking Numpy with the Intel MKL, but other than that, there is not much that can be done.
You can speed up the other part of the code like this:
minor = np.zeros([nrows-1, ncols-1])
for row in xrange(nrows):
for col in xrange(ncols):
minor[:row,:col] = matrix[:row,:col]
minor[row:,:col] = matrix[row+1:,:col]
minor[:row,col:] = matrix[:row,col+1:]
minor[row:,col:] = matrix[row+1:,col+1:]
...
This gains some 10-50% total runtime depending on the size of your matrices. The original code has Python range and list manipulations, which are slower than direct slice indexing. You could try also to be more clever and copy only parts of the minor that actually change --- however, already after the above change, close to 100% of the time is spent inside numpy.linalg.det so that furher optimization of the othe parts does not make so much sense.

The calculation of np.array(range(row)+range(row+1,nrows))[:,np.newaxis] does not depended on col so you could could move that outside the inner loop and cache the value. Depending on the number of columns you have this might give a small optimization.

Instead of using the inverse and determinant, I'd suggest using the SVD
def cofactors(A):
U,sigma,Vt = np.linalg.svd(A)
N = len(sigma)
g = np.tile(sigma,N)
g[::(N+1)] = 1
G = np.diag(-(-1)**N*np.product(np.reshape(g,(N,N)),1))
return U # G # Vt

from sympy import *
A = Matrix([[1,2,0],[0,3,0],[0,7,1]])
A.adjugate().T
And the output (which is cofactor matrix) is:
Matrix([
[ 3, 0, 0],
[-2, 1, -7],
[ 0, 0, 3]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

GEMM using Numpy einsum - python

Related

Invert particular bits in a byte array [duplicate]

Making a matrix symmetric, in-place vs. out-of-place

Addition speed of numpy arrays with different contiguous-type

Iterating with numpy with different indexes

Speed up python code for computing matrix cofactors

Categories

Resources