SciPy NumPy and SciKit-learn , create a sparse matrix - python

I'm currently trying to classify text. My dataset is too big and as suggested here, I need to use a sparse matrix. My question is now, what is the right way to add an element to a sparse matrix? Let's say for example I have a matrix X which is my input .
X = np.random.randint(2, size=(6, 100))
Now this matrix X looks like an ndarray of an ndarray (or something like that).
If I do
X2 = csr_matrix(X)
I have the sparse matrix, but how can I add another element to the sparce matrix ?
for example this dense element: [1,0,0,0,1,1,1,0,...,0,1,0] to a sparse vector, how do I add it to the sparse input matrix ?
(btw, I'm very new at python, scipy,numpy,scikit ... everything)

Scikit-learn has a great documentation, with great tutorials that you really should read before trying to invent it yourself. This one is the first one to read it explains how to classify text, step-by-step, and this one is a detailed example on text classification using sparse representation.
Pay extra attention to the parts where they talk about sparse representations, in this section. In general, if you want to use svm with linear kernel and you large amount of data, LinearSVC (which is based on Liblinear) is better.
Regarding your question - I'm sure there are many ways to concatenate two sparse matrices (btw this is what you should look for in google for other ways of doing it), here is one, but you'll have to convert from csr_matrix to coo_matrix which is anther type of sparse matrix: Is there an efficient way of concatenating scipy.sparse matrices?.
EDIT: When concatenating two matrices (or a matrix and an array which is a 1 dimenesional matrix) the general idea is to concatenate X1.data and X2.data and manipulate their indices and indptrs (or row and col in case of coo_matrix) to point to the correct places. Some sparse representations are better for specific operations and more complex for other operations, you should read about csr_matrix and see if this is the best representation. But I really urge you to start from those tutorials I posted above.

Related

dense matrix vs sparse matrix in python

I'm comparing in python the reading time of a row of a matrix, taken first in dense and then in sparse format.
The "extraction" of a row from a dense matrix costs around 3.6e-05 seconds
For the sparse format I tried both csr_mtrix and lil_matrix, but both for the row-reading spent around 1-e04 seconds
I would expect the sparse format to give the best performance, can anyone help me understand this ?
arr[i,:] for a dense array produces a view, so its execution time is independent of arr.shape. If you don't understand the distinction between view and copy, you need to do more reading about numpy basics.
csr and lil formats allow indexing that looks a lot like ndarray's, but there are key differences. For the most part the concept of a view does not apply. There is one exception. M.getrowview(i) takes advantage of the unique data structure of a lil to produce a view. (Read its doc and code)
Some indexing of a csr format actually uses matrix multiplication, using a specially constructed 'extractor' matrix.
In all cases where sparse indexing produces sparse matrix, actually constructing the new matrix from the data takes time. Sparse does not use compiled code nearly as much as numpy. It's strong point, relative to numpy is matrix multiplication of matrices that are 10% sparse (or smaller).
In the simplest format (to understand), coo, each nonzero element is represented by 3 values - data, row, col. Those are stored in 3 1d arrays. So it has to have a sparsity of less than 30% to even break even with respect to memory use. coo does not implement indexing.

python kernel dead when performing SVD on a sparse symmetrical matrix

I would like to reproduce the SVD method mentioned in a standford lecture on my own dataset. The slide of the lecture is as following
My dataset is of the same type, which is a word co-occurrence matrix M with a size of
<13840x13840 sparse matrix of type '<type 'numpy.int64'>'
with 597828 stored elements in Compressed Sparse Column format>
generated and processed from CountVectorizer(), note that this is a symmetric matrix.
However, when I tried to extract features from SVD, however, none of the following code works,
1st try:
scipy.linalg.svd(M)
I have tried the matrix from sparse csr todense() and toarray(), my computer taken quite a few minutes, and it displays kernel stops. I also played around with different parameter settings
2nd try:
scipy.sparse.linalg.svds(M)
I have also tried to change the matrix type from int64 to float64, however, the kernel dead after 30 seconds or so.
Anyone could suggest me a way to conduct SVD on this matrix in any way?
Thank you so much
Seems that the matrix is to stressful for the memory. You have several options:
Perform an adaptive SVD,
Use modred,
Use the SVD from dask.
The latter two should work out of the box.
All these options will load only what the memory can.

Scipy sparse matrices - purpose and usage of different implementations

Scipy has many different types of sparse matrices available. What are the most important differences between these types, and what is the difference in their intended usage?
I'm developing a code in python based on a sample code1 in Matlab. One section of the code utilizes sparse matrices - which seem to have a single (annoying) type in Matlab, and I'm trying to figure out which type I should use2 in python.
1: This is for a class. Most people are doing the project in Matlab, but I like to create unnecessary work and confusion --- apparently.
2: This is an academic question: I have the code working properly with the 'CSR' format, but I'm interesting in knowing what the optimal usages are.
Sorry if I'm not answering this completely enough, but hopefully I can provide some insight.
CSC (Compressed Sparse Column) and CSR (Compressed Sparse Row) are more compact and efficient, but difficult to construct "from scratch". Coo (Coordinate) and DOK (Dictionary of Keys) are easier to construct, and can then be converted to CSC or CSR via matrix.tocsc() or matrix.tocsr().
CSC is more efficient at accessing column-vectors or column operations, generally, as it is stored as arrays of columns and their value at each row.
CSR matrices are the opposite; stored as arrays of rows and their values at each column, and are more efficient at accessing row-vectors or row operations.

How to assemble large sparse matrices effectively in python/scipy

I am working on an FEM project using Scipy. Now my problem is, that
the assembly of the sparse matrices is too slow. I compute the
contribution of every element in dense small matrices (one for each
element). For the assembly of the global matrices I loop over all
small dense matrices and set the matrice entries the following way:
[i,j] = someList[k][l]
Mglobal[i,j] = Mglobal[i,j] + Mlocal[k,l]
Mglobal is a lil_matrice of appropriate size, someList maps the
indexing variables.
Of course this is rather slow and consumes most of the matrice
assembly time. Is there a better way to assemble a large sparse matrix
from many small dense matrices? I tried scipy.weave but it doesn't
seem to work with sparse matrices
I posted my response to the scipy mailing list; stack overflow is a bit easier
to access so I will post it here as well, albeit a slightly improved version.
The trick is to use the IJV storage format. This is a trio of three arrays
where the first one contains row indicies, the second has column indicies, and
the third has the values of the matrix at that location. This is the best way
to build finite element matricies (or any sparse matrix in my opinion) as access
to this format is really fast (just filling an an array).
In scipy this is called coo_matrix; the class takes the three arrays as an
argument. It is really only useful for converting to another format (CSR os
CSC) for fast linear algebra.
For finite elements, you can estimate the size of the three arrays by something
like
size = number_of_elements * number_of_basis_functions**2
so if you have 2D quadratics you would do number_of_elements * 36, for example.
This approach is convenient because if you have local matricies you definitely
have the global numbers and entry values: exactly what you need for building
the three IJV arrays. Scipy is smart enough to throw out zero entries, so
overestimating is fine.

How to efficiently add sparse matrices in Python

I want to know how to efficiently add sparse matrices in Python.
I have a program that breaks a big task into subtasks and distributes them across several CPUs. Each subtask yields a result (a scipy sparse matrix formatted as: lil_matrix).
The sparse matrix dimensions are: 100000x500000 , which is quite huge, so I really need the most efficient way to sum all the resulting sparse matrices into a single sparse matrix, using some C-compiled method or something.
Have you tried timing the simplest method?
matrix_result = matrix_a + matrix_b
The documentation warns this may be slow for LIL matrices, suggesting the following may be faster:
matrix_result = (matrix_a.tocsr() + matrix_b.tocsr()).tolil()

Categories

Resources