Speeding up a large number of small matrix multiplications

Speeding up a large number of small matrix multiplications - python

I have a large number of 2x2 matrices that I need multiplied together.
I can initiate the general shape of the problem as
import numpy as np
import time
A_dim = 6*6
B_dim = 2**8
C_dim = B_dim
A = np.random.rand(A_dim,A_dim,2,2)
B = np.random.rand(B_dim,2,2)
C = np.random.rand(C_dim,2,2)
tic = time.perf_counter()
X = A[None,None,:,:,:,:] # B[:,None,None,None,:,:] # A[None,None,:,:,:,:] # C[None,:,None,None,:,:]
toc = time.perf_counter()
print(f"matrix multiplication took {toc - tic:0.4f} seconds")
where I get
matrix multiplication took 14.4403 seconds
This seems to be a vectorized implementation, but is there anything I can do to speed this up? My native Numpy library runs this on only one core, so possibly if I figured out how to get OpenBLAS to work, this would be faster? The problem is that each matrix operation takes a very small amount of time to complete. Is there a better way to construct such a multidimensional array?

Related

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.
For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.
I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.
import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray
# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
data_t = data[0,:,0]
fftAxis = 0
BATCH = np.int32(1)
else:
data_t = data
fftAxis = 1
BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)
data_o_gpu = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)
dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))
The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.

Post my attempts, I would want to conclude that:
Using cufft library from skcuda is a bit tricky and to get to the correct FFT output might take a long time, in development. I also noticed that there wasn't an order of magnitude difference in execution time between NumPy's FFT and cufft's FFT (from skcuda)
Using CuPy and arranging your data in a format so that the FFT dimension is laid out in contiguous memory gives an order of magnitude improvement in the FFT compute time. For my case, the order was a little better than 10!
Using CuPy for FFTs is a great option if one wants to stick to Py-based development only. Also the to and fro from C to Python when writing C GPU kernels is an added overhead which is very conveniently resolved with CuPy. Though CuPy itself calls laying out the plan and calling the FFT exec engine internally.

Optimize large numpy array multiplication with sparse matrices

I am doing a statistical calculation of the following form:
where
d (data) is a matrix of size [i=30, 100x100]
m (model) is a matrix of size [i=30, 100x100]
C-1 (covariance) is a 1002 by 1002 symmetric matrix
d and C-1 are constants and do not have any structure apart from C-1 being symmetric. m changes, but is always sparse. The output of the calculation is just a floating-point number.
I need to perform this calculation many times in a Monte Carlo simulation, so speed is of the essence. Using sparse array multiplication techniques with m speeds things up considerably over a naive matrix dot product. However, each iteration of the slow function below still takes about 0.1 seconds to run. The vast majority of time (>98%) is spent in the matrix multiplications, and not the model generation function (generate_model). I would like to speed this up by an order of magnitude if possible.
The code and output are pasted below.
Things that do not work include:
Upgrading to the Intel MKL linear algebra routines (few percent speedup, surprisingly small)
Using numpy.linalg.multi_dot
Taking advantage of the fact that C-1 is symmetric (this does not work even in principle, see this mathoverflow question)
Things that sort of work include:
Precalculating C-1d, gives a ~40% speedup
How can I speed up this code? Solutions relying on packages like cython, numba, etc are most welcome as well as "standard" scipy/numpy solutions. Thanks in advance!
from __future__ import division
import numpy as np
import scipy.sparse
import sys
import timeit
def generate_model(n, size, hw = 8):
#model for the data--squares at random locations
output = np.zeros((n, size, size))
for i in range(n):
randx = np.random.randint(hw, size-hw)
randy = np.random.randint(hw, size-hw)
output[i,(randx-hw):(randx+hw), (randy-hw):(randy+hw)]=np.random.random((hw*2, hw*2))
return output
def slow_function(datacube, invcovmatrix, size):
model = generate_model(30, size)
output = 0
for i in range(model.shape[0]):
data = datacube[i,:,:].flatten()
mu = model[i,:,:].flatten()
sparsemu = scipy.sparse.csr_matrix(mu)
output += -0.5* (
np.float(-2.0*sparsemu.dot(invcovmatrix).dot(data)) +
np.float(sparsemu.dot(sparsemu.dot(invcovmatrix).T))
)
return output
def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs)
return wrapped
if __name__ == "__main__":
size = 100
invcovmat = np.random.random((size**2, size**2))
#make symmetric for consistency
invcovmat = (invcovmat+invcovmat.T)/2
datacube = np.random.random((30, size, size))
#TIMING
wrapped = wrapper(slow_function, datacube, invcovmat, size)
times = []
for i in range(20):
print i
times.append(timeit.timeit(wrapped, number = 1))
times.sort()
print '\n', np.mean(times[0:3]), ' s/iteration; best of 3'
Output:
0.10408163070678711 s/iteration; best of 3

Speeding up Evaluation of Sympy Symbolic Expressions

A Python program I am currently working on (Gaussian process classification) is bottlenecking on evaluation of Sympy symbolic matrices, and I can't figure out what I can, if anything, do to speed it up. Other parts of the program I've already ensured are typed properly (in terms of numpy arrays) so calculations between them are properly vectorised, etc.
I looked into Sympy's codegen functions a bit (autowrap, binary_function) in particular, but because my within my ImmutableMatrix object itself are partial derivatives over elements of a symbolic matrix, there is a long list of 'unhashable' things which prevent me from using the codegen functionality.
Another possibility I looked into was using Theano - but after some initial benchmarks, I found that while it build the initial partial derivative symbolic matrices much quicker, it seemed to be a few orders of magnitude slower at evaluation, the opposite of what I was seeking.
Below is a working, extracted snippet of the code I am currently working on.
import theano
import sympy
from sympy.utilities.autowrap import autowrap
from sympy.utilities.autowrap import binary_function
import numpy as np
import math
from datetime import datetime
# 'Vectorized' cdist that can handle symbols/arbitrary types - preliminary benchmarking put it at ~15 times faster than python list comprehension, but still notably slower (forgot at the moment) than cdist, of course
def sqeucl_dist(x, xs):
m = np.sum(np.power(
np.repeat(x[:,None,:], len(xs), axis=1) -
np.resize(xs, (len(x), xs.shape[0], xs.shape[1])),
2), axis=2)
return m
def build_symbolic_derivatives(X):
# Pre-calculate derivatives of inverted matrix to substitute values in the Squared Exponential NLL gradient
f_err_sym, n_err_sym = sympy.symbols("f_err, n_err")
# (1,n) shape 'matrix' (vector) of length scales for each dimension
l_scale_sym = sympy.MatrixSymbol('l', 1, X.shape[1])
# K matrix
print("Building sympy matrix...")
eucl_dist_m = sqeucl_dist(X/l_scale_sym, X/l_scale_sym)
m = sympy.Matrix(f_err_sym**2 * math.e**(-0.5 * eucl_dist_m)
+ n_err_sym**2 * np.identity(len(X)))
# Element-wise derivative of K matrix over each of the hyperparameters
print("Getting partial derivatives over all hyperparameters...")
pd_t1 = datetime.now()
dK_df = m.diff(f_err_sym)
dK_dls = [m.diff(l_scale_sym) for l_scale_sym in l_scale_sym]
dK_dn = m.diff(n_err_sym)
print("Took: {}".format(datetime.now() - pd_t1))
# Lambdify each of the dK/dts to speed up substitutions per optimization iteration
print("Lambdifying ")
l_t1 = datetime.now()
dK_dthetas = [dK_df] + dK_dls + [dK_dn]
dK_dthetas = sympy.lambdify((f_err_sym, l_scale_sym, n_err_sym), dK_dthetas, 'numpy')
print("Took: {}".format(datetime.now() - l_t1))
return dK_dthetas
# Evaluates each dK_dtheta pre-calculated symbolic lambda with current iteration's hyperparameters
def eval_dK_dthetas(dK_dthetas_raw, f_err, l_scales, n_err):
l_scales = sympy.Matrix(l_scales.reshape(1, len(l_scales)))
return np.array(dK_dthetas_raw(f_err, l_scales, n_err), dtype=np.float64)
dimensions = 3
X = np.random.rand(50, dimensions)
dK_dthetas_raw = build_symbolic_derivatives(X)
f_err = np.random.rand()
l_scales = np.random.rand(3)
n_err = np.random.rand()
t1 = datetime.now()
dK_dthetas = eval_dK_dthetas(dK_dthetas_raw, f_err, l_scales, n_err) # ~99.7%
print(datetime.now() - t1)
In this example, 5 50x50 symbolic matrices are evaluated, i.e. only 12,500 elements, taking 7 seconds. I've spent quite some time looking for resources on speeding operations like this up, and trying to translate it into Theano (at least until I found its evaluation slower in my case) and having no luck there either.
Any help greatly appreciated!

Numba slower than pure Python in frequency counting

Given a data matrix with discrete entries represented as a 2D numpy array, I'm trying to compute the observed frequencies of some features (the columns) only looking at some instances (the rows of the matrix).
I can do that quite easily with numpy using bincount applied to each slice after having done some fancy slicing. Doing that in pure Python, using an external data structure as a count accumulator, is a double loop in C-style.
import numpy
import numba
try:
from time import perf_counter
except:
from time import time
perf_counter = time
def estimate_counts_numpy(data,
instance_ids,
feature_ids):
"""
WRITEME
"""
#
# slicing the data array (probably memory consuming)
curr_data_slice = data[instance_ids, :][:, feature_ids]
estimated_counts = []
for feature_slice in curr_data_slice.T:
counts = numpy.bincount(feature_slice)
#
# checking just for the all 0 case:
# this is not stable for not binary datasets TODO: fix it
if counts.shape[0] < 2:
counts = numpy.append(counts, [0], 0)
estimated_counts.append(counts)
return estimated_counts
#numba.jit(numba.types.int32[:, :](numba.types.int8[:, :],
numba.types.int32[:],
numba.types.int32[:],
numba.types.int32[:],
numba.types.int32[:, :]))
def estimate_counts_numba(data,
instance_ids,
feature_ids,
feature_vals,
estimated_counts):
"""
WRITEME
"""
#
# actual counting
for i, feature_id in enumerate(feature_ids):
for instance_id in instance_ids:
estimated_counts[i][data[instance_id, feature_id]] += 1
return estimated_counts
if __name__ == '__main__':
#
# creating a large synthetic matrix, testing for performance
rand_gen = numpy.random.RandomState(1337)
n_instances = 2000
n_features = 2000
large_matrix = rand_gen.binomial(1, 0.5, (n_instances, n_features))
#
# random indexes too
n_sample = 1000
rand_instance_ids = rand_gen.choice(n_instances, n_sample, replace=False)
rand_feature_ids = rand_gen.choice(n_features, n_sample, replace=False)
binary_feature_vals = [2 for i in range(n_features)]
#
# testing
numpy_start_t = perf_counter()
e_counts_numpy = estimate_counts_numpy(large_matrix,
rand_instance_ids,
rand_feature_ids)
numpy_end_t = perf_counter()
print('numpy done in {0} secs'.format(numpy_end_t - numpy_start_t))
binary_feature_vals = numpy.array(binary_feature_vals)
#
#
curr_feature_vals = binary_feature_vals[rand_feature_ids]
#
# creating a data structure to hold the slices
# (with numba I cannot use list comprehension?)
# e_counts_numba = [[0 for val in range(feature_val)]
# for feature_val in
# curr_feature_vals]
e_counts_numba = numpy.zeros((n_sample, 2), dtype='int32')
numba_start_t = perf_counter()
estimate_counts_numba(large_matrix,
rand_instance_ids,
rand_feature_ids,
binary_feature_vals,
e_counts_numba)
numba_end_t = perf_counter()
print('numba done in {0} secs'.format(numba_end_t - numba_start_t))
These are the times I get while running the above code:
numpy done in 0.2863295429997379 secs
numba done in 11.55551904299864 secs
My point here is that my implementation is even slower when I try to apply a jit with numba, so I highly suspect I am messing things up.

The reason your function is slow is because Numba has fallen back to object mode to compile the loop.
There are two problems:
Numba doesn't yet support chained indexing of multidimensional arrays, so you need to rewrite this:
estimated_counts[i][data[instance_id, feature_id]]
into this:
estimated_counts[i, data[instance_id, feature_id]]
Your explicit type signature is incorrect. All of your input arrays are actually int64, rather than int8/int32. Rather than fix your signature, you can rely on Numba's automatic JIT to detect the argument types and compile the right version. All you have to do is change the decorator to just #numba.jit. Just make sure you call the function once before you benchmark if you don't want to include compilation time.
With these changes, I benchmark Numba to be about 15% faster than NumPy for this function.

Rewriting a for loop in pure NumPy to decrease execution time

I recently asked about trying to optimise a Python loop for a scientific application, and received an excellent, smart way of recoding it within NumPy which reduced execution time by a factor of around 100 for me!
However, calculation of the B value is actually nested within a few other loops, because it is evaluated at a regular grid of positions. Is there a similarly smart NumPy rewrite to shave time off this procedure?
I suspect the performance gain for this part would be less marked, and the disadvantages would presumably be that it would not be possible to report back to the user on the progress of the calculation, that the results could not be written to the output file until the end of the calculation, and possibly that doing this in one enormous step would have memory implications? Is it possible to circumvent any of these?
import numpy as np
import time
def reshape_vector(v):
b = np.empty((3,1))
for i in range(3):
b[i][0] = v[i]
return b
def unit_vectors(r):
return r / np.sqrt((r*r).sum(0))
def calculate_dipole(mu, r_i, mom_i):
relative = mu - r_i
r_unit = unit_vectors(relative)
A = 1e-7
num = A*(3*np.sum(mom_i*r_unit, 0)*r_unit - mom_i)
den = np.sqrt(np.sum(relative*relative, 0))**3
B = np.sum(num/den, 1)
return B
N = 20000 # number of dipoles
r_i = np.random.random((3,N)) # positions of dipoles
mom_i = np.random.random((3,N)) # moments of dipoles
a = np.random.random((3,3)) # three basis vectors for this crystal
n = [10,10,10] # points at which to evaluate sum
gamma_mu = 135.5 # a constant
t_start = time.clock()
for i in range(n[0]):
r_frac_x = np.float(i)/np.float(n[0])
r_test_x = r_frac_x * a[0]
for j in range(n[1]):
r_frac_y = np.float(j)/np.float(n[1])
r_test_y = r_frac_y * a[1]
for k in range(n[2]):
r_frac_z = np.float(k)/np.float(n[2])
r_test = r_test_x +r_test_y + r_frac_z * a[2]
r_test_fast = reshape_vector(r_test)
B = calculate_dipole(r_test_fast, r_i, mom_i)
omega = gamma_mu*np.sqrt(np.dot(B,B))
# write r_test, B and omega to a file
frac_done = np.float(i+1)/(n[0]+1)
t_elapsed = (time.clock()-t_start)
t_remain = (1-frac_done)*t_elapsed/frac_done
print frac_done*100,'% done in',t_elapsed/60.,'minutes...approximately',t_remain/60.,'minutes remaining'

One obvious thing you can do is replace the line
r_test_fast = reshape_vector(r_test)
with
r_test_fast = r_test.reshape((3,1))
Probably won't make any big difference in performance, but in any case it makes sense to use the numpy builtins instead of reinventing the wheel.
Generally speaking, as you probably have noticed by now, the trick with optimizing numpy is to express the algorithm with the help of numpy whole-array operations or at least with slices instead of iterating over each element in python code. What tends to prevent this kind of "vectorization" is so-called loop-carried dependencies, i.e. loops where each iteration is dependent on the result of a previous iteration. Looking briefly at your code, you have no such thing, and it should be possible to vectorize your code just fine.
EDIT: One solution
I haven't verified this is correct, but should give you an idea of how to approach it.
First, take the cartesian() function, which we'll use. Then
def calculate_dipole_vect(mus, r_i, mom_i):
# Treat each mu sequentially
Bs = []
omega = []
for mu in mus:
rel = mu - r_i
r_norm = np.sqrt((rel * rel).sum(1))
r_unit = rel / r_norm[:, np.newaxis]
A = 1e-7
num = A*(3*np.sum(mom_i * r_unit, 0)*r_unit - mom_i)
den = r_norm ** 3
B = np.sum(num / den[:, np.newaxis], 0)
Bs.append(B)
omega.append(gamma_mu * np.sqrt(np.dot(B, B)))
return Bs, omega
# Transpose to get more "natural" ordering with row-major numpy
r_i = r_i.T
mom_i = mom_i.T
t_start = time.clock()
r_frac = cartesian((np.arange(n[0]) / float(n[0]),
np.arange(n[1]) / float(n[1]),
np.arange(n[2]) / float(n[2])))
r_test = np.dot(r_frac, a)
B, omega = calculate_dipole_vect(r_test, r_i, mom_i)
print 'Total time for vectorized: %f s' % (time.clock() - t_start)
Well, in my testing, this is in fact slightly slower than the loop-based approach I started from. The thing is, in the original version in the question, it was already vectorized with whole-array operations over arrays of shape (20000, 3), so any further vectorization doesn't really bring much further benefit. In fact, it may worsen the performance, as above, maybe due to big temporary arrays.

If you profile your code, you'll see that 99% of the running time is in calculate_dipole so reducing the time for this looping really won't give a noticeable reduction in execution time. You still need to focus on calculate_dipole if you want to make this faster. I tried my Cython code for calculate_dipole on this and got a reduction by about a factor of 2 in the overall time. There might be other ways to improve the Cython code too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding up a large number of small matrix multiplications - python

Related

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

Optimize large numpy array multiplication with sparse matrices

Speeding up Evaluation of Sympy Symbolic Expressions

Numba slower than pure Python in frequency counting

Rewriting a for loop in pure NumPy to decrease execution time

Categories

Resources