How this mixed scipy.sparse / numpy program should be handled - python

I am currently trying to use numpy as well a scipy in order to handle sparse matrices, but, in the process of evaluating sparsity of a matrix, I had trouble, and I don't know how the following behaviour should be understood:
import numpy as np
import scipy.sparse as sp
a=sp.csc.csc_matrix(np.ones((3,3)))
a
np.count_nonzero(a)
When evaluating a, and non zero count, using the above code, I saw this output in ipython:
Out[9]: <3x3 sparse matrix of type '' with 9
stored elements in Compressed Sparse Column format>
Out[10]: 1
I think there is something I don't understand here.
A 3*3 matrix full of 1, should have 9 non-zero term, and this is the answer I get if I use the toarray method from scipy.
I may be using numpy and scipy the wrong way ?

The nonzero count is available as an attribute:
In [295]: a=sparse.csr_matrix(np.arange(9).reshape(3,3))
In [296]: a
Out[296]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in Compressed Sparse Row format>
In [297]: a.nnz
Out[297]: 8
As Warren commented, you can't count on numpy functions working on sparse. Use sparse functions and methods. Sometimes numpy functions are written in a way that invokes the arrays own method, in which the function call might work. But that is true only on a case by case basis.
In Ipython I make heavy use of the a.<tab> to get a list of completions (attributes and methods). I also use the function?? to look at the code.
In the case of np.count_nonzero I see no code - it is compiled, and only works on np.ndarray objects.
np.nonzero(a) works. Look at its code, and see that it looks for the array's method: nonzero = a.nonzero
The sparse nonzero method code is:
def nonzero(self):
...
# convert to COOrdinate format
A = self.tocoo()
nz_mask = A.data != 0
return (A.row[nz_mask],A.col[nz_mask])
The A.data !=0 line is there because it is possible to construct a matrix with 0 data elements, particularly if you use the coo (data,(i,j)) format. So apart from that caution, the nnz attribute gives a reliable count.
Doing a.<tab> I also see a.getnnz and a.eleminate_zeros methods, which may be helpful if you are worried about sneaky zeros.
Sometimes it is useful to work directly with the data attributes of a sparse matrix. It's safer to access them than to modify them. But each sparse format has different attributes. In the csr case you can do:
In [306]: np.count_nonzero(a.data)
Out[306]: 8

Related

SciPy sparse matrix ".multiply" not returning expected results?

So I have a COO matrix "coo_mat" created using scipy.sparse, with the first three non-zero elements being:
coo_mat.data[:5]
>>> array([0.61992174, 1.30911574, 1.48995508])
I wish to multiply the matrix by 2, and I understand that I can simply do:
(coo_mat*2).data[:5]
>>> array([1.23984347, 2.61823147, 2.97991015])
However, I don't understand why the results are not consistent when I try:
coo_mat.multiply(2).data[:5]
>>> array([2.04156392, 1.54042948, 2.3306947 ])
I've used the element-wise multiplication method in other analyses and it was working as I expected. Is there something I missing when using sparse.coo_matrix.multiply().
SciPy doesn't promise anything about the output format of most sparse matrix operations. It can reorder the elements of a COO matrix, or even switch formats to CSR or CSC or something. Here, coo_mat.multiply(2) is returning a CSR matrix with a completely different element representation and element layout:
In [11]: x = scipy.sparse.coo_matrix([[1]])
In [12]: type(x.multiply(2))
Out[12]: scipy.sparse.csr.csr_matrix
scipy.sparse.coo_matrix inherits its multiply method from the scipy.sparse.spmatrix base class, which implements multiply as
def multiply(self, other):
"""Point-wise multiplication by another matrix
"""
return self.tocsr().multiply(other)
There's no optimization for COO in that method.

Set directly data members in scipy sparse matrix

I'm building a large CSR sparse matrix which uses quite some memory even in sparse format so I want to avoid a copy when I create the matrix. The most efficient way I found is building directly the compressed sparse row representation. However, the class initializer copies the arrays I pass to it, so I have set directly the data members. Example:
from scipy import sparse
m = sparse.csr_matrix((5,5))
m.data = np.arange(5)
m.indices = np.arange(5)
m.indptr = np.arange(6)
This appears to work but I didn't find it in the documentation, I'd like to know if it is supported, if it breaks something I have not tried.
Also, it would be useful to know if I can use memmapped arrays without quirks, or use different integer datatypes for the indices.
Edit:
The accepted answer shows that no copy happens provided the indices types are correct. I have checked the __init__ and, even when it doesn't copy indices and indptr, it does scan two times both of them to find the minimum and maximum values, and it effectively does nothing more than setting the data, indices and indptr members if the inputs are well-formed, so for performance what I'm doing now is:
# [...] get shape and data from somewhere
m = sparse.csr_matrix(shape, dtype=data.dtype)
indices = np.empty(..., dtype=m.indices.dtype)
indptr = np.empty(..., dtype=m.indptr.dtype)
# [...] fill indices and indptr
m.data = data
m.indices = indices
m.indptr = indptr
# Possibly also do one or both of the following:
m.has_sorted_indices = True
m.has_canonical_format = True
Here's an example of making a sparse matrix without copying the definition arrays:
In [191]: data=np.arange(5)
...: indices=np.arange(5).astype('int32')
...: indptr=np.arange(6).astype('int32')
In [192]: M = sparse.csr_matrix((data,indices,indptr))
In [193]: data.__array_interface__['data'], M.data.__array_interface__['data']
Out[193]: ((55897168, False), (55897168, False))
In [194]: indices.__array_interface__['data'], M.indices.__array_interface__['data']
Out[194]: ((70189040, False), (70189040, False))
In [195]: indptr.__array_interface__['data'], M.indptr.__array_interface__['data']
Out[195]: ((56184432, False), (56184432, False))
https://github.com/scipy/scipy/blob/v1.4.1/scipy/sparse/compressed.py
I wrote that with the __init__ in mind. Look also at the check_format method to see what it checks for consistency.

Why is each element in a sparse csc matrix 8 bytes?

For example, if I initially have a dense matrix:
A = numpy.array([[0, 0],[0, 1]])
and then convert it to a csc sparse matrix using csc_matrix(A). The matrix is then stored as:
(1, 1) 1
#(row, column) val
which comprises of three values. Why is the size of the sparse matrix only 8 bytes, even though the computer is essentially storing 3 values? Surely the size of the matrix would be a least 12 bytes, since an integer usually holds 4 bytes.
I don't agree that the size of the sparse matrix is 8 bytes. I may be missing something, but if I do this, I get a very different answer:
>>> import sys
>>> import numpy
>>> from scipy import sparse
>>> A = numpy.array([[0, 0],[0, 1]])
>>> s = sparse.csc_matrix(A)
>>> s
<2x2 sparse matrix of type '<class 'numpy.int32'>'
with 1 stored elements in Compressed Sparse Column format>
>>> sys.getsizeof(s)
56
This is the size of the data structure in memory and I assure you that it is accurate. Python must know how big it is, because it does the memory allocation.
If, on the other hand, you use s.data.nbytes:
>>> s.data.nbytes
4
This gives the expected answer of 4. It is expected because s reports itself as having one stored element of type int32. The value returned, according to the docs,
does not include memory consumed by non-element attributes of the array object.
This is not a more accurate result, just an answer to a different question, as 35421869 makes clear.
I can't explain why you report a value of 8 bytes when the result 4 is clearly correct. One possibility is that numpy.array([[0, 0],[0, 1]]) is not in fact what was actually converted to the sparse array. Where did the value 5 come from? The value of 8 is consistent with a beginning value of numpy.array([[0, 0],[0, 5.0]]).
Your figure of 12 bytes is based on two unmet expectations.
It is possible to represent a sparse matrix as a list of triples (row, column, value). And that is in fact how a COO-matrix is stored, at least in principle. But CSC stands for compressed sparse column and so there are fewer explicit column indexes than in a COO-matrix. This Wikipedia article provides a lucid explanation of how the storage actually works.
nbytes does not report the total memory cost of storing the elements of the matrix. It reports a numpy invariant (over many different kinds of matrix) x.nbytes == np.prod(x.shape) * x.itemsize. This is an important quantity because the explicitly stored elements of the matrix form its biggest subsidiary data structure and must be allocated in contiguous memory.

Is it possible to translate this Python code to Cython?

I'm actually looking to speed up #2 of this code by as much as possible, so I thought that it might be useful to try Cython. However, I'm not sure how to implement sparse matrix in Cython. Can somebody show how to / if it's possible to wrap it in Cython or perhaps Julia to make it faster?
#1) This part computes u_dict dictionary filled with unique strings and then enumerates them.
import scipy.sparse as sp
import numpy as np
from scipy.sparse import csr_matrix
full_dict = set(train1.values.ravel().tolist() + test1.values.ravel().tolist() + train2.values.ravel().tolist() + test2.values.ravel().tolist())
print len(full_dict)
u_dict= dict()
for i, q in enumerate(full_dict):
u_dict[q] = i
shape = (len(full_dict), len(full_dict))
H = sp.lil_matrix(shape, dtype=np.int8)
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
#2) I need to speed up this part
# train_full is pandas dataframe with two columns w1 and w2 filled with strings
H = load_sparse_csr('matrix.npz')
correlation_train = []
for idx, row in train_full.iterrows():
if idx%1000 == 0: print idx
id_1 = u_dict[row['w1']]
id_2 = u_dict[row['w2']]
a_vec = H[id_1].toarray() # these vectors are of length of < 3 mil.
b_vec = H[id_2].toarray()
correlation_train.append(np.corrcoef(a_vec, b_vec)[0][1])
While I contributed to How to properly pass a scipy.sparse CSR matrix to a cython function? quite some time ago, I doubt if cython is the way to go. Especially if you don't already have experience with numpy and cython. cython gives the biggest speedup when you replace iterative calculations with code that it can translate to C without calling numpy or other python code. Throw pandas into the mix and you have an even bigger learning curve.
And important parts of sparse code are already written with cython.
Without touching the cython issue I see a couple of problems.
H is defined twice:
H = sp.lil_matrix(shape, dtype=np.int8)
H = load_sparse_csr('matrix.npz')
That's either an oversight, or a failure to understand how Python variables are created and assigned. The 2nd assignment replaces the first; thus the first does nothing. In addition the first just makes an empty lil matrix. Such a matrix could be filled iteratively; while not fast it is the intended use of the lil format.
The 2nd expression creates a new matrix from data saved in an npz file. That involves the numpy npz file loaded as well as the basic csr matrix creation code. And since the attributes are already in csr format, there's nothing for cython touch.
You do have an iteration here - but over a Pandas dataframe:
for idx, row in train_full.iterrows():
id_1 = u_dict[row['w1']]
a_vec = H[id_1].toarray()
Looks like you are picking a particular row of H based on a dictionary/array look up. Sparse matrix indexing is slow compared to dense matrix indexing. That is, if Ha = H.toarray() fits your memory then,
a_vec = Ha[id_1,:]
will be a lot faster.
Faster selection of rows (or columns) from a sparse matrix has been asked before. If you could work directly with the sparse data of a row I could recommend something more direct. But you want a dense array that you can pass to np.corrcoef, so we'd have to implement the toarray step as well.
How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

How to create a huge sparse matrix in scipy

I am trying to create a very huge sparse matrix which has a shape (447957347, 5027974).
And, it contains 3,289,288,566 elements.
But, when i create a csr_matrix using scipy.sparse, it return something like this:
<447957346x5027974 sparse matrix of type '<type 'numpy.uint32'>'
with -1005678730 stored elements in Compressed Sparse Row format>
The source code for creating matrix is:
indptr = np.array(a, dtype=np.uint32) # a is a python array('L') contain row index information
indices = np.array(b, dtype=np.uint32) # b is a python array('L') contain column index information
data = np.ones((len(indices),), dtype=np.uint32)
test = csr_matrix((data,indices,indptr), shape=(len(indptr)-1, 5027974), dtype=np.uint32)
And, I also found when I convert an 3 billion length python array to numpy array, it will raise an error:
ValueError:setting an array element with a sequence
But, when I create three 1 billion length python arrays, and convert them to numpy array, then append them. It works fine.
I'm confused.
You are using an older version of SciPy. In the original implementation of sparse matrices, indices where stored in an int32 variable, even on 64 bit systems. Even if you define them to be uint32, as you did, they get casted. So whenever your matrix has more than 2^31 - 1 nonzero entries, as is your case, the indexing overflows and lots of bad things happen. Note that in your case the weird negative number of elements is explained by:
>>> np.int32(np.int64(3289288566))
-1005678730
The good news is that this has already been figured out. I think this is the relevant PR, although there were some more fixes after that one. In any case, if you use the latest release candidate for SciPy 0.14, your problem should be gone.

Categories

Resources