How to create a huge sparse matrix in scipy - python

I am trying to create a very huge sparse matrix which has a shape (447957347, 5027974).
And, it contains 3,289,288,566 elements.
But, when i create a csr_matrix using scipy.sparse, it return something like this:
<447957346x5027974 sparse matrix of type '<type 'numpy.uint32'>'
with -1005678730 stored elements in Compressed Sparse Row format>
The source code for creating matrix is:
indptr = np.array(a, dtype=np.uint32) # a is a python array('L') contain row index information
indices = np.array(b, dtype=np.uint32) # b is a python array('L') contain column index information
data = np.ones((len(indices),), dtype=np.uint32)
test = csr_matrix((data,indices,indptr), shape=(len(indptr)-1, 5027974), dtype=np.uint32)
And, I also found when I convert an 3 billion length python array to numpy array, it will raise an error:
ValueError:setting an array element with a sequence
But, when I create three 1 billion length python arrays, and convert them to numpy array, then append them. It works fine.
I'm confused.

You are using an older version of SciPy. In the original implementation of sparse matrices, indices where stored in an int32 variable, even on 64 bit systems. Even if you define them to be uint32, as you did, they get casted. So whenever your matrix has more than 2^31 - 1 nonzero entries, as is your case, the indexing overflows and lots of bad things happen. Note that in your case the weird negative number of elements is explained by:
>>> np.int32(np.int64(3289288566))
-1005678730
The good news is that this has already been figured out. I think this is the relevant PR, although there were some more fixes after that one. In any case, if you use the latest release candidate for SciPy 0.14, your problem should be gone.

Related

Why is each element in a sparse csc matrix 8 bytes?

For example, if I initially have a dense matrix:
A = numpy.array([[0, 0],[0, 1]])
and then convert it to a csc sparse matrix using csc_matrix(A). The matrix is then stored as:
(1, 1) 1
#(row, column) val
which comprises of three values. Why is the size of the sparse matrix only 8 bytes, even though the computer is essentially storing 3 values? Surely the size of the matrix would be a least 12 bytes, since an integer usually holds 4 bytes.
I don't agree that the size of the sparse matrix is 8 bytes. I may be missing something, but if I do this, I get a very different answer:
>>> import sys
>>> import numpy
>>> from scipy import sparse
>>> A = numpy.array([[0, 0],[0, 1]])
>>> s = sparse.csc_matrix(A)
>>> s
<2x2 sparse matrix of type '<class 'numpy.int32'>'
with 1 stored elements in Compressed Sparse Column format>
>>> sys.getsizeof(s)
56
This is the size of the data structure in memory and I assure you that it is accurate. Python must know how big it is, because it does the memory allocation.
If, on the other hand, you use s.data.nbytes:
>>> s.data.nbytes
4
This gives the expected answer of 4. It is expected because s reports itself as having one stored element of type int32. The value returned, according to the docs,
does not include memory consumed by non-element attributes of the array object.
This is not a more accurate result, just an answer to a different question, as 35421869 makes clear.
I can't explain why you report a value of 8 bytes when the result 4 is clearly correct. One possibility is that numpy.array([[0, 0],[0, 1]]) is not in fact what was actually converted to the sparse array. Where did the value 5 come from? The value of 8 is consistent with a beginning value of numpy.array([[0, 0],[0, 5.0]]).
Your figure of 12 bytes is based on two unmet expectations.
It is possible to represent a sparse matrix as a list of triples (row, column, value). And that is in fact how a COO-matrix is stored, at least in principle. But CSC stands for compressed sparse column and so there are fewer explicit column indexes than in a COO-matrix. This Wikipedia article provides a lucid explanation of how the storage actually works.
nbytes does not report the total memory cost of storing the elements of the matrix. It reports a numpy invariant (over many different kinds of matrix) x.nbytes == np.prod(x.shape) * x.itemsize. This is an important quantity because the explicitly stored elements of the matrix form its biggest subsidiary data structure and must be allocated in contiguous memory.

Update numpy array with sparse indices and values

I have 1-dimensional numpy array and want to store sparse updates of it.
Say I have array of length 500000 and want to do 100 updates of 100 elements. Updates are either adds or just changing the values (I do not think it matters).
What is the best way to do it using numpy?
I wanted to just store two arrays: indices, values_to_add and therefore have two objects: one stores dense matrix and other just keeps indices and values to add, and I can just do something like this with the dense matrix:
dense_matrix[indices] += values_to_add
And if I have multiple updates, I just concat them.
But this numpy syntax doesn't work fine with repeated elements: they are just ignored.
Updating pair when we have an update that repeats index is O(n). I thought of using dict instead of array to store updates, which looks fine from the point of view of complexity, but it doesn't look good numpy style.
What is the most expressive way to achieve this? I know about scipy sparse objects, but (1) I want pure numpy because (2) I want to understand the most efficient way to implement it.
If you have repeated indices you could use at, from the documentation:
Performs unbuffered in place operation on operand ‘a’ for elements
specified by ‘indices’. For addition ufunc, this method is equivalent
to a[indices] += b, except that results are accumulated for elements
that are indexed more than once.
Code
a = np.arange(10)
indices = [0, 2, 2]
np.add.at(a, indices, [-44, -55, -55])
print(a)
Output
[ -44 1 -108 3 4 5 6 7 8 9]

How this mixed scipy.sparse / numpy program should be handled

I am currently trying to use numpy as well a scipy in order to handle sparse matrices, but, in the process of evaluating sparsity of a matrix, I had trouble, and I don't know how the following behaviour should be understood:
import numpy as np
import scipy.sparse as sp
a=sp.csc.csc_matrix(np.ones((3,3)))
a
np.count_nonzero(a)
When evaluating a, and non zero count, using the above code, I saw this output in ipython:
Out[9]: <3x3 sparse matrix of type '' with 9
stored elements in Compressed Sparse Column format>
Out[10]: 1
I think there is something I don't understand here.
A 3*3 matrix full of 1, should have 9 non-zero term, and this is the answer I get if I use the toarray method from scipy.
I may be using numpy and scipy the wrong way ?
The nonzero count is available as an attribute:
In [295]: a=sparse.csr_matrix(np.arange(9).reshape(3,3))
In [296]: a
Out[296]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in Compressed Sparse Row format>
In [297]: a.nnz
Out[297]: 8
As Warren commented, you can't count on numpy functions working on sparse. Use sparse functions and methods. Sometimes numpy functions are written in a way that invokes the arrays own method, in which the function call might work. But that is true only on a case by case basis.
In Ipython I make heavy use of the a.<tab> to get a list of completions (attributes and methods). I also use the function?? to look at the code.
In the case of np.count_nonzero I see no code - it is compiled, and only works on np.ndarray objects.
np.nonzero(a) works. Look at its code, and see that it looks for the array's method: nonzero = a.nonzero
The sparse nonzero method code is:
def nonzero(self):
...
# convert to COOrdinate format
A = self.tocoo()
nz_mask = A.data != 0
return (A.row[nz_mask],A.col[nz_mask])
The A.data !=0 line is there because it is possible to construct a matrix with 0 data elements, particularly if you use the coo (data,(i,j)) format. So apart from that caution, the nnz attribute gives a reliable count.
Doing a.<tab> I also see a.getnnz and a.eleminate_zeros methods, which may be helpful if you are worried about sneaky zeros.
Sometimes it is useful to work directly with the data attributes of a sparse matrix. It's safer to access them than to modify them. But each sparse format has different attributes. In the csr case you can do:
In [306]: np.count_nonzero(a.data)
Out[306]: 8

Indices of scipy sparse csr_matrix

I have two scipy sparse csr matrices with the exact same shape but potentially different data values and nnz value. I now want to get the top 10 elements of one matrix and increase the value on the same indices on the other matrix. My current approach is as follows:
idx = a.data.argpartition(-10)[-10:]
i, j = matrix.nonzero()
i_idx = i[idx]
j_idx = j[idx]
b[i_idx, j_idx] += 1
The reason I have to go this way is that a.data and b.data do not necessarily have the same number of elements and hence the indices would differ.
My question now is whether I can improve this in some way. As far as I know the nonzero procedure is not elegant as I have to allocate two new arrays and I am very tough on memory already. I can get the j_indices via csr_matrix.indices but what about the i_indices? Can I use the indptr in a nice way for that?
Happy for any hints.
I'm not sure what the "top 10 elements" means. I assume that if you have matrices A and B you want to set B[i, j] += 1 if A[i, j] is within the first 10 nonzero entries of A (in CSR format).
I also assume that B[i, j] could be zero, which is the worst case performance wise, since you need to modify the sparsity structure of your matrix.
CSR is not an ideal format to use for changing sparsity structure. This is because every new insertion/deletion is O(nnzs) complexity (assuming the CSR storage is backed by an array - and it usually is).
You could use the DOK format for your second matrix (the one you are modifying), which provides O(1) access to elements.
I haven't benchmarked this but I suppose your option is 10 * O(nnzs) at worst, when you are adding 10 new nonzero values, whereas the DOK version should need O(nnzs) to build the matrix, then O(1) for each insertion and, finally, O(nnzs) to convert it back to CSR (assuming this is needed).

how to take a matrix in python?

i want to create a matrix of size 1234*5678 with it being filled with 1 to 5678 in row major order?>..!!
I think you will need to use numpy to hold such a big matrix efficiently , not just computation. You have ~5e6 items of 4/8 bytes means 20/40 Mb in pure C already, several times of that in python without an efficient data structure (a list of rows, each row a list).
Now, concerning your question:
import numpy as np
a = np.empty((1234, 5678), dtype=np.int)
a[:] = np.linspace(1, 5678, 5678)
You first create an array of the requested size, with type int (I assume you know you want 4 bytes integer, which is what np.int will give you on most platforms). The 3rd line uses broadcasting so that each row (a[0], a[1], ... a[1233]) is assigned the values of the np.linspace line (which gives you an array of [1, ....., 5678]). If you want F storage, that is column major:
a = np.empty((1234, 4567), dtype=np.int, order='F')
...
The matrix a will takes only a tiny amount of memory more than an array in C, and for computation at least, the indexing capabilities of arrays are much better than python lists.
A nitpick: numeric is the name of the old numerical package for python - the recommended name is numpy.
Or just use Numerical Python if you want to do some mathematical stuff on matrix too (like multiplication, ...). If they use row major order for the matrix layout in memory I can't tell you but it gets coverd in their documentation
Here's a forum post that has some code examples of what you are trying to achieve.

Categories

Resources