I am trying to implement spectral clustering algorithm for a community detection in graph problem.
I have very huge matrix to calculates its Eigenvectors, matrix of > 1Mx1M.
Numpy and Scipy needs the matrix to be on memory to calculate it, which is impossible in my case.
Is there any other lib or package that calculates Eigenvectors and values on disk instead of memory (just like HDF5 allows us to store and manipulate data on disk)?
Or is there any solution you can suggest?
Increase the size of your swap file.
See:
What is virtual memory?
Creating a swap space
Using a swap space
Also systems typically report on in real time in the resource monitor.
For Ubuntu
[]
Related
I'm solving a problem that involves a sparse matrix. It has the three main diagonals, and a bunch of other subdiagonals. The full size of the matrix is (2048000, 2048000), but as it is quite sparse, it has only 525312000 stored elements, corresponding to about 4 GB of memory for double precision. When I create this matrix, Activity Monitor and top on my Mac both report a memory use of about 4 GB, as expected.
Next, I create an incomplete LU factorisation, to use as a preconditioner when solving the matrix system with BiCGStab. I use the following code:
from scipy.sparse.linalg import spilu
ILU = spilu(csc_matrix(L+Lr))
here, Lr is the matrix I mentioned above, L is another sparse diagonal matrix that is purely tridiagonal, and thus much smaller.
The variable ILU is of type SuperLU, and according to ILU.nnz it contains only 20384063 stored elements, which means it should take about 150 MB of memory, yet Activity Monitor and top both claim that I am now using about 8 GB of memory, where previously I was using about 4 GB. So what happened to all of that memory?
Not a particularly satisfying answer, but I posted an issue at the SciPy repo, and from the discussion it seems this is an issue on Mac only, and that there isn't much to be done about it.
https://github.com/scipy/scipy/issues/13827
I am trying to learn ML using Kaggle datasets. In one of the problems (using Logistic regression) inputs and parameters matrices are of size (1110001, 8) & (2122640, 8) respectively.
I am getting memory error while doing it in python. This would be same for any language I guess since it's too big. My question is how do they multiply matrices in real life ML implementations (since it would usually be this big)?
Things bugging me :
Some ppl in SO have suggested to calculate dot product in parts and then combine. But even then matrix would be still too big for RAM (9.42TB? in this case)
And If I write it to a file wouldn't it be too slow for optimization algorithms to read from file and minimize function?
Even if I do write it to file how would fmin_bfgs(or any opt. function) read from file?
Also Kaggle notebook shows only 1GB of storage available. I don't think anyone would allow TBs of storage space.
In my input matrix many rows have similar values for some columns. Can I use it my advantage to save space? (like sparse matrix for zeros in matrix)
Can anyone point me to any real life sample implementation of such cases. Thanks!
I have tried many things. I will be mentioning these here, if anyone needs them in future:
I had already cleaned up data like removing duplicates and
irrelevant records depending on given problem etc.
I have stored large matrices which hold mostly 0s as sparse matrix.
I implemented the gradient descent using mini-batch method instead of plain old Batch method (theta.T dot X).
Now everything is working fine.
Currently I'm implementing this paper for my undergraduate theses with python, but I only use the mahalanobis metric learning (in case you're curious).
In a shortcut, I face a problem when I need to learn a matrix with the size of 67K*67K consisting of integer, by simply numpy.dot(A.T,A) where A is a random vector sized (1,67K). When I do that it's simply throw MemoryError since my PC only have 8gb ram, and the raw calculation of the memory needed is 16gb to init. Than I search for alternative and found dask.
so i moved on to dask with this dask.array.dot(A.T,A) and it's done. But than I need to do whitening transformation to that matrix, and in dask I can achieve it by get the SVD. But everytime I do that SVD, the ipython kernel dies (I assume it due to lack of memory).
this is what I do so far from init, until the kernel dies:
fv_length=512*2*66
W = da.random.randint(10,20,(fv_length),(1000,1000))
W = da.reshape(W,(1,fv_length))
W_T = W.T
Wt = da.dot(W_T,W); del W,W_T
Wt = da.reshape(Wt,(fv_length*fv_length/2,2))
U,S,Vt = da.linalg.svd(Wt); del Wt
I didn't get the U,S,and Vt yet.
Is my memory simply not enough to do these sort of things, even when I'm using dask?
or actually this is not a spec problem, but my bad memory management?
or something else?
At this point I'm desperately trying in other bigger spec PC, so I am planning to rent a bare metal server with a 32gb ram. Even if I do so, is it enough?
Generally speaking dask.array does not guarantee out-of-core operation on all computations. A square matrix-matrix multiply (or any L3 BLAS operation) is more-or-less impossible to do efficiently in small memory.
You can ask Dask to use an on-disk cache for intermediate values. See the FAQ under the question My computation fills memory, how do I spill to disk?. However this will be limited by disk-writing speeds, which are generally fairly slow.
A large memory machine and NumPy is probably the simplest way to resolve this problem. Alternatively you could try to find a different formulation of your problem.
I have a large (1000x1000x5000) 3D numpy array upon which I need to perform many 3D rotations and then compute an asymmetric distance transform. The distance transform is trivially parallelizable, but I need a way to also perform the rotation itself using a computing cluster (which doesn't have so much [e.g. 2GB] memory/core). What's a good strategy to efficiently exploit the computing cluster? (it does not have any GPUs or other specialized hardware for that matter).
And yes, I need the rotated volume - meaning that I cannot just simply relabel the coordinates as the asymmetric distance transform will overwrite the dataset several times.
The software I'm using on the cluster: python3.4.2 with scipy, numpy and mpi4py.
Thanks!
If you want to do matrix operations (e.g. a rotation that you could express as a matrix multiplication) in parallel on a cluster, what I would do is.
Compile numpy with a multi-threaded BLAS (e.g. OpenBLAS) so the matrix multiplication is multi-threaded on a node. The advantage is that you know this has been extensively tested and optimized, and you don't need to worry about parallel scaling.
Assuming that the machine has, say 32 cores per node (i.e. 2*32=64 GB of RAM in total). I would run ~4 MPI task per node with 8 threads / MPI task (so the available RAM / task is 16 GB, thus removing the low RAM constraints).
Do a domain decomposition of your array among MPI tasks. For instance, this code (see _mprotate function) does a computation of rotations with scipy.ndimage using multiprocessing, you could do something similar but with mpi4py.
Although the problem is that, unless I'm mistaken, scipy.ndimage.interpolation.rotate does not use matrix operations with BLAS, and is a pure C implementation that in the end calls the NI_GeometricTransform function. So, unless you use a different algorithm, the above approach won't work. You will then have to run as many MPI tasks as you have cores, and do the domain decomposition among them (see mpi4py tutorials).
This does not fully answer your question but hope it helps.
I am using scipy.sparse.linalg.eigsh to solve the generalized eigen value problem for a very sparse matrix and running into memory problems. The matrix is a square matrix with 1 million rows/columns, but each row has only about 25 non-zero entries. Is there a way to solve the problem without reading the entire matrix into memory, i.e. working with only blocks of the matrix in memory at a time?
It's ok if the solution involves using a different library in python or in java.
For ARPACK, you only need to code up a routine that computes certain matrix-vector products. This can be implemented in any way you like, for instance reading the matrix from the disk.
from scipy.sparse.linalg import LinearOperator
def my_matvec(x):
y = compute matrix-vector product A x
return y
A = LinearOperator(matvec=my_matvec, shape=(1000000, 1000000))
scipy.sparse.linalg.eigsh(A)
Check the scipy.sparse.linalg.eigsh documentation for what is needed in the generalized eigenvalue problem case.
The Scipy ARPACK interface exposes more or less the complete ARPACK interface, so I doubt you will gain much by switching to FORTRAN or some other way to access Arpack.