Why is each element in a sparse csc matrix 8 bytes? - python

For example, if I initially have a dense matrix:
A = numpy.array([[0, 0],[0, 1]])
and then convert it to a csc sparse matrix using csc_matrix(A). The matrix is then stored as:
(1, 1) 1
#(row, column) val
which comprises of three values. Why is the size of the sparse matrix only 8 bytes, even though the computer is essentially storing 3 values? Surely the size of the matrix would be a least 12 bytes, since an integer usually holds 4 bytes.

I don't agree that the size of the sparse matrix is 8 bytes. I may be missing something, but if I do this, I get a very different answer:
>>> import sys
>>> import numpy
>>> from scipy import sparse
>>> A = numpy.array([[0, 0],[0, 1]])
>>> s = sparse.csc_matrix(A)
>>> s
<2x2 sparse matrix of type '<class 'numpy.int32'>'
with 1 stored elements in Compressed Sparse Column format>
>>> sys.getsizeof(s)
56
This is the size of the data structure in memory and I assure you that it is accurate. Python must know how big it is, because it does the memory allocation.
If, on the other hand, you use s.data.nbytes:
>>> s.data.nbytes
4
This gives the expected answer of 4. It is expected because s reports itself as having one stored element of type int32. The value returned, according to the docs,
does not include memory consumed by non-element attributes of the array object.
This is not a more accurate result, just an answer to a different question, as 35421869 makes clear.
I can't explain why you report a value of 8 bytes when the result 4 is clearly correct. One possibility is that numpy.array([[0, 0],[0, 1]]) is not in fact what was actually converted to the sparse array. Where did the value 5 come from? The value of 8 is consistent with a beginning value of numpy.array([[0, 0],[0, 5.0]]).
Your figure of 12 bytes is based on two unmet expectations.
It is possible to represent a sparse matrix as a list of triples (row, column, value). And that is in fact how a COO-matrix is stored, at least in principle. But CSC stands for compressed sparse column and so there are fewer explicit column indexes than in a COO-matrix. This Wikipedia article provides a lucid explanation of how the storage actually works.
nbytes does not report the total memory cost of storing the elements of the matrix. It reports a numpy invariant (over many different kinds of matrix) x.nbytes == np.prod(x.shape) * x.itemsize. This is an important quantity because the explicitly stored elements of the matrix form its biggest subsidiary data structure and must be allocated in contiguous memory.

Related

How to create huge sparse matrix with dtype=float16?

I've tried all of these and had either memory error or some kind of other error.
Matrix1 = csc_matrix((130000,130000)).todense()
Matrix1 = csc_matrix((130000,130000), dtype=float_).todense()
Matrix1 = csc_matrix((130000,130000), dtype=float16).todense()
How can I create a huge sparse matrix with float type of data?
To create a huge sparse matrix, just do exactly what you're doing:
Matrix1 = csc_matrix((130000,130000), dtype=float16)
… without calling todense() at the end. This succeeds, and takes a tiny amount of memory.1
When you add todense(), that successfully creates a huge sparse array that takes a tiny amount of memory, and then tries to convert that to a dense array that takes a huge amount of memory, which fails with a MemoryError. But the solution to that is just… don't do that.
And likewise, if you use dtype=float_ instead of dtype=float16, you get float64 values (which aren't what you want, and take 4x the memory), but again, the solution is just… don't do that.
1. sys.getsizeof(m) gives 56 bytes for the sparse array handle, sys.getsizeof(m.data) gives 96 bytes for the internal storage handle, and m.data.nbytes gives 0 bytes for the actual storage, for a grand total of 152 bytes. Which is unlikely to raise a MemoryError.

Is there any performance reason to use ndim 1 or 2 vectors in numpy?

This seems like a pretty basic question, but I didn't find anything related to it on stack. Apologies if I missed an existing question.
I've seen some mathematical/linear algebraic reasons why one might want to use numpy vectors "proper" (i.e. ndim 1), as opposed to row/column vectors (i.e. ndim 2).
But now I'm wondering: are there any (significant) efficiency reasons why one might pick one over the other? Or is the choice pretty much arbitrary in that respect?
(edit) To clarify: By "ndim 1 vs ndim 2 vectors" I mean representing a vector that contains, say, numbers 3 and 4 as either:
np.array([3, 4]) # ndim 1
np.array([[3, 4]]) # ndim 2
The numpy documentation seems to lean towards the first case as the default, but like I said, I'm wondering if there's any performance difference.
If you use numpy properly, then no - it is not a consideration.
If you look at the numpy internals documentation, you can see that
Numpy arrays consist of two major components, the raw array data (from now on, referred to as the data buffer), and the information about the raw array data. The data buffer is typically what people think of as arrays in C or Fortran, a contiguous (and fixed) block of memory containing fixed sized data items. Numpy also contains a significant set of data that describes how to interpret the data in the data buffer.
So, irrespective of the dimensions of the array, all data is stored in a continuous buffer. Now consider
a = np.array([1, 2, 3, 4])
and
b = np.array([[1, 2], [3, 4]])
It is true that accessing a[1] requires (slightly) less operations than b[1, 1] (as the translation of 1, 1 to the flat index requires some calculations), but, for high performance, vectorized operations are required anyway.
If you want to sum all elements in the arrays, then, in both case you would use the same thing: a.sum(), and b.sum(), and the sum would be over elements in contiguous memory anyway. Conversely, if the data is inherently 2d, then you could do things like b.sum(axis=1) to sum over rows. Doing this yourself in a 1d array would be error prone, and not more efficient.
So, basically a 2d array, if it is natural for the problem just gives greater functionality, with zero or negligible overhead.

How this mixed scipy.sparse / numpy program should be handled

I am currently trying to use numpy as well a scipy in order to handle sparse matrices, but, in the process of evaluating sparsity of a matrix, I had trouble, and I don't know how the following behaviour should be understood:
import numpy as np
import scipy.sparse as sp
a=sp.csc.csc_matrix(np.ones((3,3)))
a
np.count_nonzero(a)
When evaluating a, and non zero count, using the above code, I saw this output in ipython:
Out[9]: <3x3 sparse matrix of type '' with 9
stored elements in Compressed Sparse Column format>
Out[10]: 1
I think there is something I don't understand here.
A 3*3 matrix full of 1, should have 9 non-zero term, and this is the answer I get if I use the toarray method from scipy.
I may be using numpy and scipy the wrong way ?
The nonzero count is available as an attribute:
In [295]: a=sparse.csr_matrix(np.arange(9).reshape(3,3))
In [296]: a
Out[296]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in Compressed Sparse Row format>
In [297]: a.nnz
Out[297]: 8
As Warren commented, you can't count on numpy functions working on sparse. Use sparse functions and methods. Sometimes numpy functions are written in a way that invokes the arrays own method, in which the function call might work. But that is true only on a case by case basis.
In Ipython I make heavy use of the a.<tab> to get a list of completions (attributes and methods). I also use the function?? to look at the code.
In the case of np.count_nonzero I see no code - it is compiled, and only works on np.ndarray objects.
np.nonzero(a) works. Look at its code, and see that it looks for the array's method: nonzero = a.nonzero
The sparse nonzero method code is:
def nonzero(self):
...
# convert to COOrdinate format
A = self.tocoo()
nz_mask = A.data != 0
return (A.row[nz_mask],A.col[nz_mask])
The A.data !=0 line is there because it is possible to construct a matrix with 0 data elements, particularly if you use the coo (data,(i,j)) format. So apart from that caution, the nnz attribute gives a reliable count.
Doing a.<tab> I also see a.getnnz and a.eleminate_zeros methods, which may be helpful if you are worried about sneaky zeros.
Sometimes it is useful to work directly with the data attributes of a sparse matrix. It's safer to access them than to modify them. But each sparse format has different attributes. In the csr case you can do:
In [306]: np.count_nonzero(a.data)
Out[306]: 8

How to create a huge sparse matrix in scipy

I am trying to create a very huge sparse matrix which has a shape (447957347, 5027974).
And, it contains 3,289,288,566 elements.
But, when i create a csr_matrix using scipy.sparse, it return something like this:
<447957346x5027974 sparse matrix of type '<type 'numpy.uint32'>'
with -1005678730 stored elements in Compressed Sparse Row format>
The source code for creating matrix is:
indptr = np.array(a, dtype=np.uint32) # a is a python array('L') contain row index information
indices = np.array(b, dtype=np.uint32) # b is a python array('L') contain column index information
data = np.ones((len(indices),), dtype=np.uint32)
test = csr_matrix((data,indices,indptr), shape=(len(indptr)-1, 5027974), dtype=np.uint32)
And, I also found when I convert an 3 billion length python array to numpy array, it will raise an error:
ValueError:setting an array element with a sequence
But, when I create three 1 billion length python arrays, and convert them to numpy array, then append them. It works fine.
I'm confused.
You are using an older version of SciPy. In the original implementation of sparse matrices, indices where stored in an int32 variable, even on 64 bit systems. Even if you define them to be uint32, as you did, they get casted. So whenever your matrix has more than 2^31 - 1 nonzero entries, as is your case, the indexing overflows and lots of bad things happen. Note that in your case the weird negative number of elements is explained by:
>>> np.int32(np.int64(3289288566))
-1005678730
The good news is that this has already been figured out. I think this is the relevant PR, although there were some more fixes after that one. In any case, if you use the latest release candidate for SciPy 0.14, your problem should be gone.

how to take a matrix in python?

i want to create a matrix of size 1234*5678 with it being filled with 1 to 5678 in row major order?>..!!
I think you will need to use numpy to hold such a big matrix efficiently , not just computation. You have ~5e6 items of 4/8 bytes means 20/40 Mb in pure C already, several times of that in python without an efficient data structure (a list of rows, each row a list).
Now, concerning your question:
import numpy as np
a = np.empty((1234, 5678), dtype=np.int)
a[:] = np.linspace(1, 5678, 5678)
You first create an array of the requested size, with type int (I assume you know you want 4 bytes integer, which is what np.int will give you on most platforms). The 3rd line uses broadcasting so that each row (a[0], a[1], ... a[1233]) is assigned the values of the np.linspace line (which gives you an array of [1, ....., 5678]). If you want F storage, that is column major:
a = np.empty((1234, 4567), dtype=np.int, order='F')
...
The matrix a will takes only a tiny amount of memory more than an array in C, and for computation at least, the indexing capabilities of arrays are much better than python lists.
A nitpick: numeric is the name of the old numerical package for python - the recommended name is numpy.
Or just use Numerical Python if you want to do some mathematical stuff on matrix too (like multiplication, ...). If they use row major order for the matrix layout in memory I can't tell you but it gets coverd in their documentation
Here's a forum post that has some code examples of what you are trying to achieve.

Categories

Resources