I have to create N large matrices, of size M x M, with M = 100'000, on a cluster. I can create them one by one. Usually I would first define a tensor
mat_all = torch.zeros((N,M,M))
And then I would fill mat_all as follows:
for i in range(N):
tmp = create_matrix(M,M)
mat_all[i,:,:] = tmp
where the function create_matrix creates a square matrix of size M.
My problem is: if I do that, I have memory issue in creating the big tensor mat_all with torch.ones . I do not have these issues when I create the matrices one by one with create_matrix.
I was wondering if there is a way to have a tensor as mat_all which deals with N matrices MxM but in such a way that I do not have memory issues.
Related
I'm trying to perform outer concatenation in Tensorflow by combining two 2D tensors into a third, so that two m by n tensors combine into an m by m by n^2 tensor. In the past when I've made a new tensor, I've used the entire data set and preallocated the space, so that here I would have a tensor of the total number of samples, S, times the other dimensions (S by m by m by n^2). This takes up too much memory. What are my options for processing an outer concatenation without overloading my RAM? Should I be making this an individual layer, for instance, and if so, how? Thank you for any advice provided.
I'm trying to make a matrix/vector multiplication, but my matrix is stored in a way # operator cannot be used.
My matrix Z is actually a list on size N containing the columns of the matrix which are all PETSc4py.Vec of size NN, where NN≫N (eg. NN=10000 and N=10). As N is small, I can make a for loop on it, so for instance, if I want to compute r =Z.T # u with u a vector of size NN, I do
r = np.zeros(N)
for i,z in enumerate(Z):
r[i] = u * z # scalar product
Now I have a vector u of size N and I want to make the multiplication w = Z # u, I can't apply the same method because it would involve a loop of size NN which I'm trying to avoid.
I could convert my "matrix" Z to a NumPy matrix, but I'm also trying to avoid it...
I represented on the figure 1 the way the matrix is stored. A red line represents a vector that should be read for a the matrix-vector multiplication.
Is there a mathematical way (or a magic trick !) to compute this operation without making the big loop ?
Thanks
My goal here is to build the sparse CSR matrix very fast. It is currently the main bottleneck in my process, and I've already optimized it by constructing the coo matrix relatively fast, and then using tocsr().
However, I would imagine that constructing the csr matrix directly must be faster?
I have a very specific format of a sparse matrix that is also large (i.e. on orders of 100000x50000). I've looked online at these other answers, but most are not addressing the question I have.
Efficiently construct FEM/FVM matrix
Looks at constructing a very specific formatted sparse matrix vs using coo, which led to a scipy merge improvement on the speed of tocsr().
Sparse Matrix Structure:
The sparse matrix, H, is comprised of W lists of size N, or built from an initial array of size NxW, lets call it A. Along the diagonal are repeating lists of size N for N times. So for the first N rows of H, is A[:,0] repeated, but sliding along N steps for each row.
Comparison to COO.tocsr()
When I scale up N and W, and build up the COO matrix first, then running tocsr(), it is actually faster then just building the CSR matrix directly. I'm not sure why this would be the case? I am wondering if perhaps I can take advantage of the structure of my sparse matrix H in some way? Since there are many repeating elements in there.
Code Sample
Here is a code sample to visualize what is going on for a small sample size:
from scipy.sparse import linalg, dok_matrix, coo_matrix, csr_matrix
import numpy as np
import matplotlib.pyplot as plt
def test_csr(testdata):
indices = [x for _ in range(W-1) for x in range(N**2)]
ptrs = [N*(i) for i in range(N*(W-1))]
ptrs.append(len(indices))
data = []
# loop along the first axis
for i in range(W-1):
vec = testdata[:,i].squeeze()
# repeat vector N times
for i in range(N):
data.extend(vec)
Hshape = ((N*(W-1), N**2))
H = csr_matrix((data, indices, ptrs), shape=Hshape)
return H
N = 4
W = 8
testdata = np.random.randn(N,W)
print(testdata.shape)
H = test_csr(testdata)
plt.imshow(H.toarray(), cmap='jet')
plt.show()
It looks like your output has only the first W-1 rows of test data. I'm not sure if this is intentional or not. My solutions assumes you want to use all of testdata.
When you construct the COO matrix are you also constructing the indices and data in a similar way?
One thing which might speed up constructing the csr_matrix is to use built in numpy functions to generate the data for the csr_matrix rather than python loops and arrays. I would expect this to improve the speed of generating the indices significantly. You can adjust the dtype to different type of int depending on the size of your matrix.
N = 4
W = 8
testdata = np.random.randn(N,W)
ptrs = N*np.arange(W*N+1,dtype='int')
indices = np.tile(np.arange(N*N,dtype='int'),W)
data = np.tile(testdata,N).flatten()
Hshape = ((N*W, N**2))
H = csr_matrix((data, indices, ptrs), shape=Hshape)
Another possibility is to first construct the large array, and then define each of the N vertical column blocks at once. This means that you don't need to make a ton of copies of the original data before you put it into the sparse matrix. However, it may be slow to convert the matrix type.
N = 4
W = 8
testdata = np.random.randn(N,W)
Hshape = ((N*W, N**2))
H = sp.sparse.lol_matrix(Hshape)
for j in range(N):
H[N*np.arange(W)+j,N*j:N*(j+1)] = testdata.T
H = H.tocsc()
In python I have 2 three dimensional arrays:
T with size (n,n,n)
U with size (k,n,n)
T and U can be seen as many 2-D arrays one next to the other. I need to multiply all those matrices, ie I have to perform the following operation:
for i in range(n):
H[:,:,i] = U[:,:,i].dot(T[:,:,i]).dot(U[:,:,i].T)
As n might be very big I am wondering if this operation could be in some way speed up with numpy.
Carefully looking into the iterators and how they are involved in those dot product reductions, we could translate all of those into one np.einsum implementation like so -
H = np.einsum('ijk,jlk,mlk->imk',U,T,U)
I am facing a memory and speed problem using NumPy but my issue is quite simple.
A is a large NumPy array of H * W integers.
V is a list containing N views of the large array A, each view as the same (Hv, Wv) shape.
K is another list containing N float weights corresponding to the views.
Hv are Wv are almost equal to H and W but smaller. As NumPy views are not copies, this is nice for memory management, even if N is big.
Now, I want to compute a new array using broadcasting for speed: B = V1*K1 + ... + VN*KN
This will result in a new Hv * Wv weighted array.
The issue is that I do not know how to perform such operation without creating intermediate arrays in memory (which is what happens when a view is multiplied with the corresponding weight) and while benefiting from broadcast operations.
import numpy as np
H = W = 1000
Hv = Wv = 900
N = 100
A = np.arange(H * W).reshape(H, W)
V = [A[i:Hv + i, i:Wv + i] for i in range(N)]
K = np.random.rand(N)
# It neither uses speed broadcast nor low memory!
B = sum(v*k for v, k in zip(V, K))
Could someone help me to make a smart use of NumPy, please?
I am assuming V is given as a list and we have don't have access to optimize creating it or just don't need to. So, A is out of the equation and we are left with V and K to get to final output B and thus, left with optimizing the last step.
To solve it, we can just use np.tensordot to replace the last step of sum-reduction as that's basically sum-reduction of a matrix-multiplication. In our case, we are reducing the first axis from K and along the length of input list V. Internally, NumPy would convert the list to a NumPy tensor array and that length would become the first axis of its array version. Thus, we would be reducing the first axis from both these inputs and therefore the implementation would be -
B = np.tensordot(K,V,axes=[0,0]) # `axes` indicates the axes to be sum-reduced
Please note that the internal conversion of list to NumPy array might not be inexpensive and as such it would make more sense to create V using initialization as a NumPy array, rather than in a loop comprehension that would result in a list.