Multiplication of two huge dense matrices Hadamard-multiplied by a sparse matrix - python

I have two dense matrices A and B, and each of them has a size fo 3e5x100. Another sparse binary matrix, C, with size 3e5x3e5. I want to find the following quantity: C ∘ (AB'), where ∘ is Hadamard product (i.e., element wise) and B' is the transpose of B. Explicitly calculating AB' will ask for crazy amount of memory (~500GB). Since the end result won't need the whole AB', it is sufficient to only calculate the multiplication A_iB_j' where C_ij != 0, where A_i is the column i of matrix A and C_ij is the element at location (i,j) of the matrix C. A suggested approach would be like the algorithm below:
result = numpy.initalize_sparse_matrix(shape = C.shape)
while True:
(i,j) = C_ij.pop_nonzero_index() #prototype function returns the nonzero index and then points to the next nonzero index
if (i,j) is empty:
break
result(i,j) = A_iB_j'
This algorithm however takes too much time. Is there anyway to improve it using LAPACK/BLAS algorithms? I am coding in Python so I think numpy can be more human friendly wrapper for LAPACK/BLAS.

You can do this computation using the following, assuming C is stored as a scipy.sparse matrix:
C = C.tocoo()
result_data = C.data * (A[C.row] * B[C.col]).sum(1)
result = sparse.coo_matrix((result_data, (row, col)), shape=C.shape)
Here we show that the result matches the naive algorithm for some smaller inputs:
import numpy as np
from scipy import sparse
N = 300
M = 10
def make_C(N, nnz=1000):
data = np.random.rand(nnz)
row = np.random.randint(0, N, nnz)
col = np.random.randint(0, N, nnz)
return sparse.coo_matrix((data, (row, col)), shape=(N, N))
A = np.random.rand(N, M)
B = np.random.rand(N, M)
C = make_C(N)
def f_naive(C, A, B):
return C.multiply(np.dot(A, B.T))
def f_efficient(C, A, B):
C = C.tocoo()
result_data = C.data * (A[C.row] * B[C.col]).sum(1)
return sparse.coo_matrix((result_data, (C.row, C.col)), shape=C.shape)
np.allclose(
f_naive(C, A, B).toarray(),
f_efficient(C, A, B).toarray()
)
# True
And here we see that it works for the full input size:
N = 300000
M = 100
A = np.random.rand(N, M)
B = np.random.rand(N, M)
C = make_C(N)
out = f_efficient(C, A, B)
print(out.shape)
# (300000, 300000)
print(out.nnz)
# 1000

Related

Computing derivatives using numpy

I'm trying to implement a differential in python via numpy that can accept a scalar, a vector, or a matrix.
import numpy as np
def foo_scalar(x):
f = x * x
df = 2 * x
return f, df
def foo_vector(x):
f = x * x
n = x.size
df = np.zeros((n, n))
for mu in range(n):
for i in range(n):
if mu == i:
df[mu, i] = 2 * x[i]
return f, df
def foo_matrix(x):
f = x * x
m, n = x.shape
df = np.zeros((m, n, m, n))
for mu in range(m):
for nu in range(n):
for i in range(m):
for j in range(n):
if (mu == i) and (nu == j):
df[mu, nu, i, j] = 2 * x[i, j]
return f, df
This works fine, but it seems like there should be a way to do this in a single function, and let numpy "figure out" the correct dimensions. I could force everything into a 2-D array form with something like
x = np.array(x)
if len(x.shape) == 0:
x = x.reshape(1, 1)
elif len(x.shape) == 1:
x = x.reshape(-1, 1)
if len(f.shape) == 0:
f = f.reshape(1, 1)
elif len(f.shape) == 1:
f = f.reshape(-1, 1)
and always have 4 nested for loops, but this doesn't scale if I need to generalize to higher-order tensors.
Is what I'm trying to do possible, and if so, how?
I highly doubt there is a function to generate the second parameter returned by the function in Numpy. That being said you can play with the feature of Numpy and Python so to vectorize this and make the function faster. You first need to generate the indices and, then generate the target matrix and set it. Note that operating with N-dimensional generic arrays tends to be slow and tricky in non-trivial cases. The magic * unrolling operator is used to generate N parameters.
def foo_generic(x):
f = x ** 2
idx = np.stack(np.meshgrid(*[np.arange(e) for e in x.shape], indexing='ij'))
idx = tuple(np.concatenate((idx, idx)).reshape(2*x.ndim, -1))
df = np.zeros([*x.shape, *x.shape])
df[idx] = 2 * x.ravel()
return f, df
Note that foo_generic does not support scalar and it would be very inefficient to use it for that anyway, but you can add a condition in it to support this special case apart.
The df matrix will very quickly be huge for higher order so I strongly advise you not to use dense matrices for that since the number of zeros is huge compared to the number of values in the matrix case already. Sparse matrices fix this. In fact, for a 5x5 matrix, there are >95% of zeros. Not to mention the matrix becomes quickly huge and willing a huge matrix full of zeros is not efficient.

In PyTorch calc Euclidean distance instead of matrix multiplication

Let say we have 2 matrices:
mat = torch.randn([20, 7]) * 100
mat2 = torch.randn([7, 20]) * 100
n, m = mat.shape
The simplest usual matrix multiplication looks like this:
def mat_vec_dot_product(mat, vect):
n, m = mat.shape
res = torch.zeros([n])
for i in range(n):
for j in range(m):
res[i] += mat[i][j] * vect[j]
return res
res = torch.zeros([n, n])
for k in range(n):
res[:, k] = mat_vec_dot_product(mat, mat2[:, k])
But what if I need to apply L2 norm instead of dot product? The code is next:
def mat_vec_l2_mult(mat, vect):
n, m = mat.shape
res = torch.zeros([n])
for i in range(n):
for j in range(m):
res[i] += (mat[i][j] - vect[j]) ** 2
res = res.sqrt()
return res
for k in range(n):
res[:, k] = mat_vec_l2_mult(mat, mat2[:, k])
Can we do this somehow in an optimal way using Torch or any other libraries? Cause naive O(n^3) Python code works really slow.
Use torch.cdist for L2 norm - euclidean distance
res = torch.cdist(mat, mat2.permute(1,0), p=2)
Here, I have used permute to swap dim of mat2 from 7,20 to 20,7
First of all, matrix multiplication in PyTorch has a built-in operator: #.
So, to multiply mat and mat2 you simply do:
mat # mat2
(should work, assuming dimensions agree).
Now, to compute the Sum of Squared Differences(SSD, or L2-norm of differences) which you seem to compute in your second block, you can do a simple trick.
Since the squared L2-norm ||m_i - v||^2 (where m_i is the i'th row of matrix M and v is the vector) is equal to the dot product <m_i - v, m_i-v> - from linearity of the dot product you obtain: <m_i,m_i> - 2<m_i,v> + <v,v> so you can compute the SSD of each row in M from vector v by computing once the squared L2-norm of each row, once the dot product between each row and the vector and once the L2-norm of the vector. This can be done in O(n^2).
However, for the SSD between 2 matrices you will still get O(n^3). Improvements can be made though by vectorizing the operations instead of using loops.
Here is a simple implementation for 2 matrices:
def mat_mat_l2_mult(mat,mat2):
rows_norm = (torch.norm(mat, dim=1, p=2, keepdim=True)**2).repeat(1,mat2.shape[1])
cols_norm = (torch.norm(mat2, dim=0, p=2, keepdim=True)**2).repeat(mat.shape[0], 1)
rows_cols_dot_product = mat # mat2
ssd = rows_norm -2*rows_cols_dot_product + cols_norm
return ssd.sqrt()
mat = torch.randn([20, 7])
mat2 = torch.randn([7,20])
print(mat_mat_l2_mult(mat, mat2))
The resulting matrix will have at each cell i,j the L2-norm of the difference between each row i in mat and each column j in mat2.

Python: Speeding up large double sum with elements precalculated

I need to calculate a double sum of the form:
wignersum{ell} = sum_{ell1} sum_{ell2} (2*ell1+1)(2*ell2+1) * W{ell,ell1,ell2}^2 * C1(ell1) * C2(ell2)
where wignersum is an array indexed by ell, and ell, ell1, and ell2 all run from 0 to ellmax. The W{ell,ell1,ell2}^2 are a set of known coefficients that I've already calculated (called w3j), stored in an array of shape (ellmax, ellmax, ellmax) as a global variable to be called by this function. (These coefficients are time intensive to calculate and I've found it faster to load them from a numpy file). The C1 and C2 are arrays of coefficients of shape (ellmax).
I have successfully calculated this sum by making use of a double for loop and grabbing the appropriate elements from each prexisting array and updating the wignersum array in each iteration. I assume there is a better way to vectorize this problem to speed up the calculation. I thought about making the C1 and C2 arrays into arrays of the same shape as the w3j array, then multiplying these arrays elementwise before using np.sum on the ell1 and ell2 axes. I'm unsure whether this is in fact a good method of vecotrizing, and if it is, how to actually do this.
The code as it stands is something like
import numpy as np
ell_max = 400
w3j = np.ones((ell_max, ell_max, ell_max))
C1 = np.arange(ell_max)
C2 = np.arange(ell_max)
def function(ell_max)
ells = np.arange(ell_max)
wignersum = np.zeros(ell_max)
factor = np.array([2*i+1 for i in range(384)])
for ell1 in ells:
A = factor[ell1]
B = C1[ell1]
for ell2 in ells:
D = factor[ell2] * C2[ell2] * w3j[:,ell1,ell2]
wignersum += A * B * D
return wignersum
(note the in actuality C1 and C2 are not global variables but are local variables that must be calculated from a set of parameters fed to function. This is not the limiting factor in the code speed however)
With the double for loop this takes ~1.5 seconds to run for ell_max~400 which is too long for the purposes I'm using it for. I'd like to vectorize this as much as possible to improve speed.
You can use either einsum or matrix multiplication for a ~20x speedup:
import numpy as np
ell_max = 400
w3j = np.random.randint(1,10,(ell_max, ell_max, ell_max))
C1 = np.random.randint(1,10,ell_max)
C2 = np.random.randint(1,10,ell_max)
def function(ell_max):
ells = np.arange(ell_max)
wignersum = np.zeros(ell_max)
factor = np.array([2*i+1 for i in range(ell_max)])
for ell1 in ells:
A = factor[ell1]
B = C1[ell1]
for ell2 in ells:
D = factor[ell2] * C2[ell2] * w3j[:,ell1,ell2]
wignersum += A * B * D
return wignersum
def pp_es(l_mx):
l = np.arange(l_mx)
f = 2*l+1
return np.einsum("i,i,j,j,kij",f,C1,f,C2,w3j,optimize=True)
def pp_mm(l_mx):
l = np.arange(l_mx)
f = 2*l+1
return w3j.reshape(l_mx,-1)#np.outer(f*C1,f*C2).ravel()
from timeit import timeit
print(timeit(lambda:pp_es(400),number=10))
print(timeit(lambda:pp_mm(400),number=10))
print(timeit(lambda:function(400),number=10))
print((pp_mm(400)==pp_es(400)).all())
print((function(400)==pp_mm(400)).all())
Sample run:
0.6061844169162214 # einsum
0.6111843499820679 # matrix x vector
12.233918005018495 # OP
True # einsum == matrix x vector
True # OP == matrix x vector

Fast way for matrix multiplication in Python

Does anybody know a fast way to compute matrices such as:
Z{i,j} = \sum_{p,k,l,q} \frac{A_{ip} B_{pk} C_{kl} D_{lq} E_{qj} }{a_p - b_q - c}
For normal matrix multiplication I would use numpy.dot(a,b), but now I got to divide the elements by $a_p$ and $b_q$.
Any suggestions?
Any suggestions on how to compute
$$ C_{i,j} = \sum _p = \frac{E_{i,p} B_{p,j}}{m_p} $$
will be of great help as well.
Note that (E[i, p] * B[p, j]) / m[p] is equal to E[i, p] * (B[p, j] / m[p]), so you can simply divide m into B before calling np.dot.
def f(E, B, m):
B = np.asarray(B) # matrix
m = np.asarray(m).reshape((B.shape[0], 1)) # row vector
return np.dot(E, B / m) # m is broadcasted to match B

Inverse of numpy.dot

I can easily calculate something like:
R = numpy.column_stack([A,np.ones(len(A))])
M = numpy.dot(R,[k,m0])
where A is a simple array and k,m0 are known values.
I want something different. Having fixed R, M and k, I need to obtain m0.
Is there a way to calculate this by an inverse of the function numpy.dot()?
Or it is only possible by rearranging the matrices?
M = numpy.dot(R,[k,m0])
is performing matrix multiplication. M = R * x.
So to compute the inverse, you could use np.linalg.lstsq(R, M):
import numpy as np
A = np.random.random(5)
R = np.column_stack([A,np.ones(len(A))])
k = np.random.random()
m0 = np.random.random()
M = R.dot([k,m0])
(k_inferred, m0_inferred), residuals, rank, s = np.linalg.lstsq(R, M)
assert np.allclose(m0, m0_inferred)
assert np.allclose(k, k_inferred)
Note that both k and m0 are determined, given M and R (assuming len(M) >= 2).

Categories

Resources