Diagonal of a numpy matrix without compute the entire matrix - python

I have a simple algebric problem and I would like to solve it with numpy (of course that I could solve it easily with numba, but that is not the point).
Let us consider a first random matrix A with size (m x n), with n a big value, and a second random matrix B with size (n x n).
A = np.random.random((1E6, 1E2))
B = np.random.random((1E2, 1E2))
We want to compute the following expression:
np.diag(np.dot(np.dot(A,B),B.T))
The problem is that the entire matrix is loaded to the memory and only then is extracted the diagonal. Is it possible to do this operation in a more efficient way?

This is how I would approach it from your starting expression
np.diag(np.dot(np.dot(A,B),B.T))
You can start by grouping terms:
np.diag(np.dot(A, np.dot(B,B.T)))
then only use the first relevant (square) part of A:
np.diag(np.dot(A[:B.shape[0], :], np.dot(B,B.T)))
and then avoid the extra multiplications (that will fall out of the diagonal), by doing the element-wise multiplications yourself:
np.sum( np.multiply(A[:B.shape[0], :].T, np.dot(B,B.T)), 0)

Changed (A*B)*B.T to A*(B*B.T)
Multiplied only this part of A (A[:B.shape[0]]) that would result in the diagonal part of the matrix
import numpy as np
import time
A = np.random.random((1000_000, 100))
B = np.random.random((100, 100))
start_time = time.time()
result = np.diag(np.dot(np.dot(A, B), B.T))
print('Baseline: ', time.time() - start_time)
start_time = time.time()
for i in range(100):
result2 = np.diag(np.dot(A[:B.shape[0]], np.dot(B, B.T)))
print('Optimized: ', (time.time() - start_time) / 100)
stop = 1
assert np.allclose(result, result2)
Baseline: 1.7957241535186768
Optimized: 0.00016015291213989258

Yes.
N = 1E6
A = np.random.random((N, 1E2))
B = np.random.random((1E2, 1E2))
result = 0;
for i in range(N):
result += np.dot(np.dot(A[i,:], B[i,:])[i, :], B.T[i, :])
# Replacing B.T[i, :] with B[:, i].T might be a little more efficient
Explanation:
Say we have: K = np.dot(np.dot(A,B),B.T).
Then, K[0,0] = (A[0, :] * B[:,0])[0, :] * B.T[:])
Let X = (A[0, :] * B[:,0]), which is the [0, 0] element of np.dot(A,B)
Then X[0, :] * B.T[:, 0] is the [0, 0] element of np.dot(np.dot(A,B),B.T)
Then X[0, :] * B.T[:, 0] = (A[0, :] * B[:,0])[0, :] * B.T[:])
We can also generalize this result to: K[i,i] = (A[i, :] * B[:,i])[i, :] * B.T[:, i])

Related

Is there any way to optimize a triple loop in Python by using numpy or other ressources?

I'm having trouble finding out a way to optimize a triple loop in Python. I will directly give the code for a better and simpler representation of what I have to compute :
Given two 2-D arrays named samples (M x N) and D(N x N) along with the output results (NxN):
for sigma in range(M):
for i in range(N):
for j in range(N):
results[i, j] += (1/N) * (samples[sigma, i]*samples[sigma, j]
- samples[sigma, i]*D[j, i]
- samples[sigma, j]*D[i, j])
return results
It does the job but is not effective at all in python. I tried to unloop the for i.. for j.. loop but I cannot compute it correctly with the sigma in the way.
Does someone have an idea on how to optimize those few lines ? Any suggestions are welcomed such as numpy, numexpr, etc...
One way I found to improve your code (i.e reduce the number of loops) is by using np.meshgrid.
Here is the impovement I found. It took some fiddling but it gives the same output as your triple loop code. I kept the same code structure so you can see what parts correspond to what part. I hope this is of use to you!
for sigma in range(M):
xx, yy = np.meshgrid(samples[sigma], samples[sigma])
results += (1/N) * (xx * yy
- yy * D.T
- xx * D)
print(results) # or return results
.
Edit: Here's a small script to verify that the results are as expected:
import numpy as np
M, N = 3, 4
rng = np.random.default_rng(seed=42)
samples = rng.random((M, N))
D = rng.random((N, N))
results = rng.random((N, N))
results_old = results.copy()
results_new = results.copy()
for sigma in range(M):
for i in range(N):
for j in range(N):
results_old[i, j] += (1/N) * (samples[sigma, i]*samples[sigma, j]
- samples[sigma, i]*D[j, i]
- samples[sigma, j]*D[i, j])
print('\n\nresults_old', results_old, sep='\n')
for sigma in range(M):
xx, yy = np.meshgrid(samples[sigma], samples[sigma])
results_new += (1/N) * (xx * yy
- yy * D.T
- xx * D)
print('\n\nresults_new', results_new, sep='\n')
Edit 2: Entirely getting rid of loops: it is a bit convoluted but it essentially does the same thing.
M, N = samples.shape
xxx, yyy = np.meshgrid(samples, samples)
split_x = np.array(np.hsplit(np.vsplit(xxx, M)[0], M))
split_y = np.array(np.vsplit(np.hsplit(yyy, M)[0], M))
results += np.sum(
(1/N) * (split_x*split_y
- split_y*D.T
- split_x*D), axis=0)
print(results) # or return results
In order to vectorize for loops, we can make use of broadcasting and then reducing along any axes that are not reflected by the output array. To do so, we can "assign" one axis to each of the for loop indices (as a convention). For your example this means that all input arrays can be reshaped to have dimension 3 (i.e. len(a.shape) == 3); the axes correspond then to sigma, i, j respectively. Then we can perform all operations with the broadcasted arrays and finally reduce (sum) the result along the sigma axis (since only i, j are reflected in the result):
# Ordering of axes: (sigma, i, j)
samples_i = samples[:, :, np.newaxis]
samples_j = samples[:, np.newaxis, :]
D_ij = D[np.newaxis, :, :]
D_ji = D.T[np.newaxis, :, :]
return (samples_i*samples_j - samples_i*D_ji - samples_j*D_ij).sum(axis=0) / N
The following is a complete example that compares the reference code (using for loops) with the above version; note that I've removed the 1/N part in order to keep computations in the domain of integers and thus make the array equality test exact.
import time
import numpy as np
def timeit(func):
def wrapper(*args):
t_start = time.process_time()
res = func(*args)
t_total = time.process_time() - t_start
print(f'{func.__name__}: {t_total:.3f} seconds')
return res
return wrapper
rng = np.random.default_rng()
M, N = 100, 200
samples = rng.integers(0, 100, size=(M, N))
D = rng.integers(0, 100, size=(N, N))
#timeit
def reference(samples, D):
results = np.zeros(shape=(N, N))
for sigma in range(M):
for i in range(N):
for j in range(N):
results[i, j] += (samples[sigma, i]*samples[sigma, j]
- samples[sigma, i]*D[j, i]
- samples[sigma, j]*D[i, j])
return results
#timeit
def new(samples, D):
# Ordering of axes: (sigma, i, j)
samples_i = samples[:, :, np.newaxis]
samples_j = samples[:, np.newaxis, :]
D_ij = D[np.newaxis, :, :]
D_ji = D.T[np.newaxis, :, :]
return (samples_i*samples_j - samples_i*D_ji - samples_j*D_ij).sum(axis=0)
assert np.array_equal(reference(samples, D), new(samples, D))
This gives me the following benchmark results:
reference: 6.465 seconds
new: 0.133 seconds
I found easier to break the problem into smaller steps and work on it, until we have a single equation.
Going from your original formulation:
for sigma in range(M):
for i in range(N):
for j in range(N):
results[i, j] += (1/N) * (samples[sigma, i]*samples[sigma, j]
- samples[sigma, i]*D[j, i]
- samples[sigma, j]*D[i, j])
The first thing is to eliminate the j index in the inner most loop. For this we start working with vectors instead of single elements:
for sigma in range(M):
for i in range(N):
results[i, :] += (1/N) * (samples[sigma, i]*samples[sigma, :] - samples[sigma, i]*D[:, i] - samples[sigma, :]*D[i, :])
Then, we eliminate the second loop, the one with i index. In this step we start to think in matrices. Therefore, each loop is the direct summation of "sigma matrices".
for sigma in range(M):
results += (1/N) * (samples[sigma, :, np.newaxis] * samples[sigma] - samples[sigma, :, np.newaxis] * D.T - samples[sigma, :] * D)
I strongly recommend to use this step as the solution since vectorizing even more would require too much memory for a big value of M. But, just for knowlegde...
think of the matrices as 3-dimensional objects. We do the calculations and sum at the end in index zero as:
results = (1/N) * (samples[:, :, np.newaxis] * samples[:,np.newaxis] - samples[:, :, np.newaxis] * D.T - samples[:, np.newaxis, :] * D).sum(axis=0)

multiplying a vector and a matrix using for nested for loops

I want to write a function matvec_row_variant_scalar(A,x) that implements the scalar-wise, row-variant of the matrix-vector multiplication, where A is a 2D array, and x is a 1D array. It MUST use two nested loops and scalar-wise access to the entries of 𝐴 and 𝑥 .
this is what i have tried.
def matvec_row_variant_scalar(A,x):
y = np.zeros(x.shape)
for i in range(A.shape[0]):
for j in range(A.shape[0]):
A[i,j] =int(x[j])*A[i,j]
y[j] = A[i,:].sum()
return y
A= np.array([[1,0,0],[0,,0],[0,0,1]])
x= np.array([[1], [2], [3]])
print(matvec_row_variant_scalar(A,x))
I understand you want a column-vector b = A * x where b[i] = (A[i, :] * x[i]).sum(), where x is also a column-vector. You don't really need a loop for this, because numpy can do this for you without loops much faster.
A = np.array([[1,0,0],[0,1,0],[0,0,1]])
x = np.array([[1], [2], [3]])
b = (A * x).sum(axis=1, keepdims=True)
# b = array([[1],
# [2],
# [3]])
Alternatively, you can simplify your problem before calculating it:
(A[i, :] * x[i]).sum() is the same as A[i, :].sum() * x[i], so you can sum across the rows of A first and then multiply by x to get the same result.
b = A.sum(axis=1, keepdims=True) * x
However, since you must have two loops, (Note that this will be much slower especially for larger arrays)
y = np.zeros(x.shape)
for i in range(A.shape[0]): # iterate over rows
for j in range(A.shape[1]): # iterate over columns
y[i, 0] = y[i, 0] + x[i, 0] * A[i,j] # Keep adding x[i] * A[i, j] to y[i]
# y = array([[1.],
# [2.],
# [3.]])
Speed comparison:
import timeit
from matplotlib import pyplot as plt
sizes = [1, 5, 10, 50, 100, 500, 1000, 5000, 10000]
def mult1(A, x):
return (A * x).sum(axis=1, keepdims=True)
def mult2(A, x):
return A.sum(axis=1, keepdims=True) * x
def mult3(A, x):
y = np.zeros(x.shape)
for i in range(A.shape[0]): # iterate over rows
for j in range(A.shape[1]): # iterate over columns
y[i, 0] = y[i, 0] + x[i, 0] * A[i,j] # Keep adding x[i] * A[i, j] to y[i]
return y
time_vals = np.zeros((len(size),3))
for i, size in enumerate(sizes):
reps = 10 if size < 1000 else 1
print(sizes, reps)
A = np.random.random((size, size))
x = np.random.random((size, 1))
time_vals[i, 0] = timeit.timeit("mult1(A, x)", setup="from __main__ import A, x, mult1", number=reps) / reps
time_vals[i, 1] = timeit.timeit("mult2(A, x)", setup="from __main__ import A, x, mult2", number=reps) / reps
time_vals[i, 2] = timeit.timeit("mult3(A, x)", setup="from __main__ import A, x, mult3", number=reps) / reps
plt.plot(sizes, time_vals[:, 0], label="mult1")
plt.plot(sizes, time_vals[:, 1], label="mult2")
plt.plot(sizes, time_vals[:, 2], label="mult3")
Gives this plot. The iterative approach is consistently 1+ order of magnitude slower than the vectorized approach for arrays of significant size.

Summation within iteration over two variables with matrix operations

I have the following matrices: Q, P, q and y with shapes (100,100), (100,100), (100,100) and (100,2) respectively.
For every i, I want to compute the following:
This is what I've tried so far, it appears to work but I know this is bad practice
and painfully slow.
grad = np.zeros(100, 2)
for i in range(100):
tmp = 0
for j in range(100):
tmp += ((P[i, j] - Q[i, j]) * q[i, j] * (y[i, :] - y[j, :]))
grad[i, :] = tmp * 4
My question is how can I compute this using matrix operations instead of nested loops?
From your notation, try broadcasting:
grad = 4 * (((P-Q)*q)[...,None]*(y[:,None,:]-y[None])).sum(axis=1)

How can these 2 loops be vectorized in Python?

I'm retrieving close to 400k values in values, which is pretty slow by itself (that code is not being shown), and then I try to do a prediction of those values through a Kalmann filter, the first loop is taking a little over a minute to run, and the second aroun 2 and half minutes, I think the first can be vectorized, but I'm not sure how, specially the window_sma. The second loop I'm not sure how I could deal with the i increasing the x array (x = np.append(x, new_x_col, axis=1)).
This is the first one, which tries to do a prediction based on the values from SMA, using polyfit and polyval:
window_sma = 200
sma_index = 500
offset = 50
SMA = talib.SMA(values, timeperiod = window_sma)
vector_X = [1, 2, 3, 15]
sma_predicted = []
start_time = time.time()
for i in range (sma_index, len(SMA)):
j = int(i - offset)
k = int(i - offset / 2)
window_sma = [SMA[j], SMA[k], SMA[i]]
polyfit = np.polyfit([1, 2, 3], window_sma, 2)
y_hat = np.polyval(polyfit, vector_X)
sma_predicted.append(y_hat[-1])
And the second one, which attemps to filter the output of the first for loop to have a better prediction of the values I got from SMA:
# Kalman Filter
km = KalmanFilter(dim_x = 2, dim_z = 1)
# state transition matrix
km.F = np.array([[1.,1.],
[0.,1.]])
# Measurement function
km.H = np.array([[1.,0.]])
# Change in time
dt = 0.0001
a = 1.5
# Covariance Matrix
km.Q = np.power(a, 2) * \
np.array([[np.power(dt,4)/4, np.power(dt,3)/2],
[np.power(dt,3)/2, np.power(dt,2)]])
# Variance
km.R = 1000
# Identity Matrix
I = np.array([[1, 0], [0, 1]])
# Measurement Matrix
km.Z = np.array(sma_predicted)
# Initial state
x = np.zeros((2,1))
x = np.array([[sma_predicted[0]], [0]])
# Initial distribution state's covariance matrix
km.P = np.array([[1000, 0], [0, 1000]])
for i in range (0, len(sma_predicted) - 1):
# Prediction
new_x_col = np.dot(km.F, x[:, i]).reshape(2, 1)
x = np.append(x, new_x_col, axis=1)
km.P = km.F * km.P * km.F.T + km.Q
# Correction
K = np.dot(km.P, km.H.T) / (np.dot(np.dot(km.H, km.P), km.H.T) + km.R)
x[:, -1] = x[:, -1] + np.dot(K, (km.Z[i + 1] - np.dot(km.H, x[:, -1])))
#x[:, -1] = (x[:, -1] + K * (km.Z[i + 1] - km.H * x[:, -1])).reshape(2, i + 2)
km.P = (I - K * km.H) * km.P
Thanks!
The second one is worth attacking first, so I'll just do that.
You have this:
x = np.array([[sma_predicted[0]], [0]])
for i in range (0, len(sma_predicted) - 1):
new_x_col = np.dot(km.F, x[:, i]).reshape(2, 1)
x = np.append(x, new_x_col, axis=1)
# ...
Repeatedly appending to the same array is always bad practice in NumPy, so start with something like this:
x = np.zeros((2, len(sma_predicted)))
x[0, 0] = sma_predicted[0]
for i in range(len(sma_predicted) - 1):
x[:, i+1] = np.dot(km.F, x[:, i])
# ...
Note the reshape(2, 1) is not needed, thanks to NumPy broadcasting.
I realize this does not answer all of your implicit questions, but perhaps it gets you started.
It would be nice if dot were a ufunc so we could do something like np.dot.outer(km.F, x.T), but it isn't (see this from 2009), so we can't. You could implement more speedups using Numba (with the append() removed as I showed, your code is a good candidate for Numba).

Improve performance of function without parallelization

Some weeks ago I posted a question (Speed up nested for loop with elements exponentiation) which got a very good answer by abarnert. This question is related to that one since it makes use of the performance improvements suggested by said user.
I need to improve the performance of a function that involves calculating three factors and then applying an exponential on them.
Here's a MWE of my code:
import numpy as np
import timeit
def random_data(N):
# Generate some random data.
return np.random.uniform(0., 10., N)
# Data lists.
array1 = np.array([random_data(4) for _ in range(1000)])
array2 = np.array([random_data(3) for _ in range(2000)])
# Function.
def func():
# Empty list that holds all values obtained in for loop.
lst = []
for elem in array1:
# Avoid numeric errors if one of these values is 0.
e_1, e_2 = max(elem[0], 1e-10), max(elem[1], 1e-10)
# Obtain three parameters.
A = 1./(e_1*e_2)
B = -0.5*((elem[2]-array2[:,0])/e_1)**2
C = -0.5*((elem[3]-array2[:,1])/e_2)**2
# Apply exponential.
value = A*np.exp(B+C)
# Store value in list.
lst.append(value)
return lst
# time function.
func_time = timeit.timeit(func, number=100)
print func_time
Is it possible to speed up func without having to recurr to parallelization?
Here's what I have so far. My approach is to do as much of the math as possible across numpy arrays.
Optimizations:
Calculate As within numpy
Re-factor calculation of B and C by splitting them into factors, some of which can be computed within numpy
Code:
def optfunc():
e0 = array1[:, 0]
e1 = array1[:, 1]
e2 = array1[:, 2]
e3 = array1[:, 3]
ar0 = array2[:, 0]
ar1 = array2[:, 1]
As = 1./(e0 * e1)
Bfactors = -0.5 * (1 / e0**2)
Cfactors = -0.5 * (1 / e1**2)
lst = []
for i, elem in enumerate(array1):
B = ((elem[2] - ar0) ** 2) * Bfactors[i]
C = ((elem[3] - ar1) ** 2) * Cfactors[i]
value = As[i]*np.exp(B+C)
lst.append(value)
return lst
print np.allclose(optfunc(), func())
# time function.
func_time = timeit.timeit(func, number=10)
opt_func_time = timeit.timeit(optfunc, number=10)
print "%.3fs --> %.3fs" % (func_time, opt_func_time)
Result:
True
0.759s --> 0.485s
At this point I'm stuck. I managed to do it entirely without python for loops, but it is slower than the above version for a reason I do not yet understand:
def optfunc():
x = array1
y = array2
x0 = x[:, 0]
x1 = x[:, 1]
x2 = x[:, 2]
x3 = x[:, 3]
y0 = y[:, 0]
y1 = y[:, 1]
A = 1./(x0 * x1)
Bfactors = -0.5 * (1 / x0**2)
Cfactors = -0.5 * (1 / x1**2)
B = (np.transpose([x2]) - y0)**2 * np.transpose([Bfactors])
C = (np.transpose([x3]) - y1)**2 * np.transpose([Cfactors])
return np.transpose([A]) * np.exp(B + C)
Result:
True
0.780s --> 0.558s
However note that the latter gets you an np.array whereas the former only gets you a Python list... this might account for the difference but I'm not sure.

Categories

Resources