I would like to reproduce the SVD method mentioned in a standford lecture on my own dataset. The slide of the lecture is as following
My dataset is of the same type, which is a word co-occurrence matrix M with a size of
<13840x13840 sparse matrix of type '<type 'numpy.int64'>'
with 597828 stored elements in Compressed Sparse Column format>
generated and processed from CountVectorizer(), note that this is a symmetric matrix.
However, when I tried to extract features from SVD, however, none of the following code works,
1st try:
scipy.linalg.svd(M)
I have tried the matrix from sparse csr todense() and toarray(), my computer taken quite a few minutes, and it displays kernel stops. I also played around with different parameter settings
2nd try:
scipy.sparse.linalg.svds(M)
I have also tried to change the matrix type from int64 to float64, however, the kernel dead after 30 seconds or so.
Anyone could suggest me a way to conduct SVD on this matrix in any way?
Thank you so much
Seems that the matrix is to stressful for the memory. You have several options:
Perform an adaptive SVD,
Use modred,
Use the SVD from dask.
The latter two should work out of the box.
All these options will load only what the memory can.
Related
I'm comparing in python the reading time of a row of a matrix, taken first in dense and then in sparse format.
The "extraction" of a row from a dense matrix costs around 3.6e-05 seconds
For the sparse format I tried both csr_mtrix and lil_matrix, but both for the row-reading spent around 1-e04 seconds
I would expect the sparse format to give the best performance, can anyone help me understand this ?
arr[i,:] for a dense array produces a view, so its execution time is independent of arr.shape. If you don't understand the distinction between view and copy, you need to do more reading about numpy basics.
csr and lil formats allow indexing that looks a lot like ndarray's, but there are key differences. For the most part the concept of a view does not apply. There is one exception. M.getrowview(i) takes advantage of the unique data structure of a lil to produce a view. (Read its doc and code)
Some indexing of a csr format actually uses matrix multiplication, using a specially constructed 'extractor' matrix.
In all cases where sparse indexing produces sparse matrix, actually constructing the new matrix from the data takes time. Sparse does not use compiled code nearly as much as numpy. It's strong point, relative to numpy is matrix multiplication of matrices that are 10% sparse (or smaller).
In the simplest format (to understand), coo, each nonzero element is represented by 3 values - data, row, col. Those are stored in 3 1d arrays. So it has to have a sparsity of less than 30% to even break even with respect to memory use. coo does not implement indexing.
I use a Scipy CSR representation of a 800,000x350,000 Matrix, let's say its M. I want to calculate the dot product M * M[0:x].T. Now depending on the value of x the memory consumption grows. x=1 is not noticeable but if x=2000 the multiplication process takes around 8 gigabyte of RAM.
I wonder what happens when i calculate this product and why it takes so much memory in comparison to storing the sparse Matrix (around 30Mb). Is the matrix expanded for the multiplication?
By investigating the results and memory consumption over time and after each operation I found out that the reason is the result of the sparse matrix multiplication. Indeed there are many zero-values in M. But the result of M*M.T is a matrix containing only 50% zeros. Thus the result consumes a lot memory.
Example: Let's assume each row vector of M has a none-zero field at the same index but other than that is sparse. Then the result of M*M.T would not be sparse at all (no zero-values).
Nonetheless, thanks for helping.
The compile core of csr matrix multiplication is found at
https://github.com/scipy/scipy/blob/0cff7a5fe6226668729fc2551105692ce114c2b3/scipy/sparse/sparsetools/csr.h
starting around line 500, the csr_matmat... function. It includes a reference to a math paper that it is based on.
The Python code called with A*B is __mul__. Look at the version for your csr matrix to make sure that it is calling self._mul_sparse_matrix, which you will see ends up calling self.format + '_matmat_pass1' (and pass2).
Assuming it isn't resorting to the dense versions, you'll have to study the underlying algorithm to understand whether this memory use is realistic or not.
I need to form a 2D matrix with total size 2,886 X 2,003,817. I try to use numpy.zeros to make a 2D zero element matrix and then calculate and assign each element of Matrix (most of them are zero son I need to replace few of them).
but when I try numpy.zero to initialize my matrix I get the following memory error:
C=numpy.zeros((2886,2003817)) "MemoryError"
I also try to form the matrix without initialization. Basically I calculate the element of each row in each iteration of my algorithm and then
C=numpy.concatenate((C,[A]),axis=0)
in which C is my final matrix and A is the calculated row at the current iteration. But I find out this method takes a lots of time, I am guessing it is because of using numpy.concatenate(?)
could you please let me know if there is a way to avoid memory error in initializing my matrix or is there any better method or suggestion to form the matrix in this size?
Thanks,
Amir
If your data has a lot of zeros in it, you should use scipy.sparse matrix.
It is a special data structure designed to save memory for matrices that have a lot of zeros. However, if your matrix is not that sparse, sparse matrices start to take up more memory. There are many kinds of sparse matrices, and each of them is efficient at one thing, while inefficient at another thing, so be careful with what you choose.
I'm currently trying to classify text. My dataset is too big and as suggested here, I need to use a sparse matrix. My question is now, what is the right way to add an element to a sparse matrix? Let's say for example I have a matrix X which is my input .
X = np.random.randint(2, size=(6, 100))
Now this matrix X looks like an ndarray of an ndarray (or something like that).
If I do
X2 = csr_matrix(X)
I have the sparse matrix, but how can I add another element to the sparce matrix ?
for example this dense element: [1,0,0,0,1,1,1,0,...,0,1,0] to a sparse vector, how do I add it to the sparse input matrix ?
(btw, I'm very new at python, scipy,numpy,scikit ... everything)
Scikit-learn has a great documentation, with great tutorials that you really should read before trying to invent it yourself. This one is the first one to read it explains how to classify text, step-by-step, and this one is a detailed example on text classification using sparse representation.
Pay extra attention to the parts where they talk about sparse representations, in this section. In general, if you want to use svm with linear kernel and you large amount of data, LinearSVC (which is based on Liblinear) is better.
Regarding your question - I'm sure there are many ways to concatenate two sparse matrices (btw this is what you should look for in google for other ways of doing it), here is one, but you'll have to convert from csr_matrix to coo_matrix which is anther type of sparse matrix: Is there an efficient way of concatenating scipy.sparse matrices?.
EDIT: When concatenating two matrices (or a matrix and an array which is a 1 dimenesional matrix) the general idea is to concatenate X1.data and X2.data and manipulate their indices and indptrs (or row and col in case of coo_matrix) to point to the correct places. Some sparse representations are better for specific operations and more complex for other operations, you should read about csr_matrix and see if this is the best representation. But I really urge you to start from those tutorials I posted above.
I am using scipy.sparse.linalg.eigsh to solve the generalized eigen value problem for a very sparse matrix and running into memory problems. The matrix is a square matrix with 1 million rows/columns, but each row has only about 25 non-zero entries. Is there a way to solve the problem without reading the entire matrix into memory, i.e. working with only blocks of the matrix in memory at a time?
It's ok if the solution involves using a different library in python or in java.
For ARPACK, you only need to code up a routine that computes certain matrix-vector products. This can be implemented in any way you like, for instance reading the matrix from the disk.
from scipy.sparse.linalg import LinearOperator
def my_matvec(x):
y = compute matrix-vector product A x
return y
A = LinearOperator(matvec=my_matvec, shape=(1000000, 1000000))
scipy.sparse.linalg.eigsh(A)
Check the scipy.sparse.linalg.eigsh documentation for what is needed in the generalized eigenvalue problem case.
The Scipy ARPACK interface exposes more or less the complete ARPACK interface, so I doubt you will gain much by switching to FORTRAN or some other way to access Arpack.