Scipy has many different types of sparse matrices available. What are the most important differences between these types, and what is the difference in their intended usage?
I'm developing a code in python based on a sample code1 in Matlab. One section of the code utilizes sparse matrices - which seem to have a single (annoying) type in Matlab, and I'm trying to figure out which type I should use2 in python.
1: This is for a class. Most people are doing the project in Matlab, but I like to create unnecessary work and confusion --- apparently.
2: This is an academic question: I have the code working properly with the 'CSR' format, but I'm interesting in knowing what the optimal usages are.
Sorry if I'm not answering this completely enough, but hopefully I can provide some insight.
CSC (Compressed Sparse Column) and CSR (Compressed Sparse Row) are more compact and efficient, but difficult to construct "from scratch". Coo (Coordinate) and DOK (Dictionary of Keys) are easier to construct, and can then be converted to CSC or CSR via matrix.tocsc() or matrix.tocsr().
CSC is more efficient at accessing column-vectors or column operations, generally, as it is stored as arrays of columns and their value at each row.
CSR matrices are the opposite; stored as arrays of rows and their values at each column, and are more efficient at accessing row-vectors or row operations.
Related
I'm comparing in python the reading time of a row of a matrix, taken first in dense and then in sparse format.
The "extraction" of a row from a dense matrix costs around 3.6e-05 seconds
For the sparse format I tried both csr_mtrix and lil_matrix, but both for the row-reading spent around 1-e04 seconds
I would expect the sparse format to give the best performance, can anyone help me understand this ?
arr[i,:] for a dense array produces a view, so its execution time is independent of arr.shape. If you don't understand the distinction between view and copy, you need to do more reading about numpy basics.
csr and lil formats allow indexing that looks a lot like ndarray's, but there are key differences. For the most part the concept of a view does not apply. There is one exception. M.getrowview(i) takes advantage of the unique data structure of a lil to produce a view. (Read its doc and code)
Some indexing of a csr format actually uses matrix multiplication, using a specially constructed 'extractor' matrix.
In all cases where sparse indexing produces sparse matrix, actually constructing the new matrix from the data takes time. Sparse does not use compiled code nearly as much as numpy. It's strong point, relative to numpy is matrix multiplication of matrices that are 10% sparse (or smaller).
In the simplest format (to understand), coo, each nonzero element is represented by 3 values - data, row, col. Those are stored in 3 1d arrays. So it has to have a sparsity of less than 30% to even break even with respect to memory use. coo does not implement indexing.
I am working with a sparse matrix of about 260k rows, 3M columns and 26M non-zero values (stored in Matrix Market format). I also have JSON files that describe the metadata for each row and column. I need to perform matrix operations over this matrix, namely matrix products, although other operations will be required most surely. I have been working with my matrix and with my dictionaries (coming from the JSONs), each dictionary links the row/col index to its metadata value. It is not ideal, although it works.
I do wonder, is there a better option out there? I am aware of Pandas/Dato dataframes but it seems to me that the matrix part (and its operations) is relegated somehow. I have been following a little bit the blaze project (Dask, xray, mainly these out-of-core technologies). I want to know what is the standard way (or the most adequate way) to deal with this scenario.
Any insight is greatly appreciated. Thanks.
The latest version of pandas has "sparse" datastructures, including DataFrame, Series, and Panel, which can be compressed on any common value, including NaN, not just 0. Pandas is supported behind the scenes by numpy and optionally by scipy, which has the scipy.sparse module for directly working with mathematically sparse (mainly 0-filled) matrices. "Sparse" Pandas objects have an experimental API to convert to scipy.sparse objects, as well.
I'm currently trying to classify text. My dataset is too big and as suggested here, I need to use a sparse matrix. My question is now, what is the right way to add an element to a sparse matrix? Let's say for example I have a matrix X which is my input .
X = np.random.randint(2, size=(6, 100))
Now this matrix X looks like an ndarray of an ndarray (or something like that).
If I do
X2 = csr_matrix(X)
I have the sparse matrix, but how can I add another element to the sparce matrix ?
for example this dense element: [1,0,0,0,1,1,1,0,...,0,1,0] to a sparse vector, how do I add it to the sparse input matrix ?
(btw, I'm very new at python, scipy,numpy,scikit ... everything)
Scikit-learn has a great documentation, with great tutorials that you really should read before trying to invent it yourself. This one is the first one to read it explains how to classify text, step-by-step, and this one is a detailed example on text classification using sparse representation.
Pay extra attention to the parts where they talk about sparse representations, in this section. In general, if you want to use svm with linear kernel and you large amount of data, LinearSVC (which is based on Liblinear) is better.
Regarding your question - I'm sure there are many ways to concatenate two sparse matrices (btw this is what you should look for in google for other ways of doing it), here is one, but you'll have to convert from csr_matrix to coo_matrix which is anther type of sparse matrix: Is there an efficient way of concatenating scipy.sparse matrices?.
EDIT: When concatenating two matrices (or a matrix and an array which is a 1 dimenesional matrix) the general idea is to concatenate X1.data and X2.data and manipulate their indices and indptrs (or row and col in case of coo_matrix) to point to the correct places. Some sparse representations are better for specific operations and more complex for other operations, you should read about csr_matrix and see if this is the best representation. But I really urge you to start from those tutorials I posted above.
I am working on an FEM project using Scipy. Now my problem is, that
the assembly of the sparse matrices is too slow. I compute the
contribution of every element in dense small matrices (one for each
element). For the assembly of the global matrices I loop over all
small dense matrices and set the matrice entries the following way:
[i,j] = someList[k][l]
Mglobal[i,j] = Mglobal[i,j] + Mlocal[k,l]
Mglobal is a lil_matrice of appropriate size, someList maps the
indexing variables.
Of course this is rather slow and consumes most of the matrice
assembly time. Is there a better way to assemble a large sparse matrix
from many small dense matrices? I tried scipy.weave but it doesn't
seem to work with sparse matrices
I posted my response to the scipy mailing list; stack overflow is a bit easier
to access so I will post it here as well, albeit a slightly improved version.
The trick is to use the IJV storage format. This is a trio of three arrays
where the first one contains row indicies, the second has column indicies, and
the third has the values of the matrix at that location. This is the best way
to build finite element matricies (or any sparse matrix in my opinion) as access
to this format is really fast (just filling an an array).
In scipy this is called coo_matrix; the class takes the three arrays as an
argument. It is really only useful for converting to another format (CSR os
CSC) for fast linear algebra.
For finite elements, you can estimate the size of the three arrays by something
like
size = number_of_elements * number_of_basis_functions**2
so if you have 2D quadratics you would do number_of_elements * 36, for example.
This approach is convenient because if you have local matricies you definitely
have the global numbers and entry values: exactly what you need for building
the three IJV arrays. Scipy is smart enough to throw out zero entries, so
overestimating is fine.
I'm working with large, sparse matrices (document-feature matrices generated from text) in python. It's taking quite a bit of processing time and memory to chew through these, and I imagine that sparse matrices could offer some improvements. But I'm worried that using a sparse matrix library is going to make it harder to plug into other python (and R, through rpy2) modules.
Can people who've crossed this bridge already offer some advice? What are the pros and cons of using sparse matrices in python/R, in terms of performance, scalability, and compatibility?
Using sparse matrices in Python might not be a great idea in itself.
Have you checked out sparse matrices in numpy / scipy?
Numpy brings the immense benefit of using mainly C code to provide performance gains in Python.
From my limited experience of doing text processing in R, the performance makes it pretty much unusable for anything beyond exploratory data analysis.
Regardless, you shouldn't be using vanilla lists for sparse matrices, it will (understandably) take a while to chew through them.
There are several ways to represent sparse matrices (documentation for the R SparseM package reports 20 different ways to store sparse matrix data), so complete compatibility with all solutions is probably out of question. The number options also suggests that there is no best-in-all-situations solution.
Pick either the numpy sparse matrices or R's SparseM (through rpy2) according to where your heavy number crunching routines on those matrices are found (numpy or R).