Say I have a huge numpy matrix A taking up tens of gigabytes. It takes a non-negligible amount of time to allocate this memory.
Let's say I also have a collection of scipy sparse matrices with the same dimensions as the numpy matrix. Sometimes I want to convert one of these sparse matrices into a dense matrix to perform some vectorized operations.
Can I load one of these sparse matrices into A rather than re-allocate space each time I want to convert a sparse matrix into a dense matrix? The .toarray() method which is available on scipy sparse matrices does not seem to take an optional dense array argument, but maybe there is some other way to do this.
If the sparse matrix is in the COO format:
def assign_coo_to_dense(sparse, dense):
dense[sparse.row, sparse.col] = sparse.data
If it is in the CSR format:
def assign_csr_to_dense(sparse, dense):
rows = sum((m * [k] for k, m in enumerate(np.diff(sparse.indptr))), [])
dense[rows, sparse.indices] = sparse.data
To be safe, you might want to add the following lines to the beginning of each of the functions above:
assert sparse.shape == dense.shape
dense[:] = 0
It does seem like there should be a better way to do this (and I haven't scoured the documentation), but you could always loop over the elements of the sparse array and assign to the dense array (probably zeroing out the dense array first). If this ends up too slow, that seems like an easy C extension to write....
Related
I am trying to find the eigenvalues of many small matrices, while not trying to use a loop, with the intent to use CuPy later on.
Thus, I have tried to set up a large matrix that takes the matrices that I want to solve as blocks on its diagonal. This matrix contains a lot of unnecessary zeros, thus I use Scipy.Sparse.
All works well, until I want to find the eigenvalues, where the spsolve() function calculates the full eigenvectors to the problem, when most of the entries should also be zero.
import numpy as np
from scipy import sparse as sp
from scipy.sparse.linalg import spsolve, eigs
sigx=np.array([[0, 1],[1, 0]], dtype=np.complex128) # a 2x2 Pauli matrix
karray=np.arange(-np.pi, np.pi, np.pi/100) #200 elements
H_sci=sp.kron(sp.diags(karray), sigx) #The sparse matrix I want to find the eigenvalues to
H_reg=H_sci.toarray() #Converted into a regular numpy array to see the memory difference
print(H_sci.data.nbytes) #12800 = 2*2*200*16, reminder that 16 bytes = 128 bits --> saves 4 arrays of length 200
print(H_reg.nbytes) #2560000 = 2*2*200*200*16 --> saves the entire matrix
E_sci=eigs(H_sci, k=398) #throws an error for k=400 and 399, even though I should have 400 eigenvalues?
print(E_sci[1].data.nbytes) #2547200 --> as much as H_reg
Do I do something wrong? Is there an alternative approach to solving many matrices (here 2x2 for example) in parallel? I have used Numba for looping over the matrices before, but I would like to try to use my GPU to see whether I can speed this problem up, because I do not see why I should solve these matrices one after another.
I have a csr_matrix, let's say I called:
import scipy.sparse as ss
mat = ss.csr.csr_matrix((50, 100))
Now I want to modify some of the values on this matrix. I call:
mat[0,1]+=1
And I get:
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
I need only to set a few values (at the scale of the matrix at last) just after the creation of the matrix. Later on I will only read the columns or do element-wise operations on the whole matrix (like .log1p())
What would be the correct way to do that ? Currently I can just ignore the warning, but there may be a better way, that don't yield a warning.
You can control the appearance of warnings. The default is to show them once during a run, and then be silent. You can change that to raise an error, be completely silent, or issue the warning every time.
A common way of creating a sparse matrix is to create the 3 coo style arrays, with all nonzero values. Then make a coo matrix, or csr directly (it takes the same style of input).
coo format doesn't have indexing, so you can't do M[i,j]=1 anyways. But csr does implement it. I think the warning is there to discourage multiple changes (in a loop) not one or two.
Changing the sparsity of a csr matrix requires recalculating the whole set of attributes (data and index pointers). That's why its expensive. I haven't done timings but it may be almost as expensive as making the array fresh.
lil is supposed to be better for incremental assignment. It keeps its data in lists of lists, and inserting values into lists is fast. But converting csr to lil and back takes time, so I wouldn't do it for just a few additions.
Instead of:
from scipy.sparse import csr_matrix
# Create sparse matrix.
graph = csr_matrix((10, 10))
# Change sparse matrix.
graph[(1, 1)] = 0 # --- SLOW --- ^1
# Do some calculations.
graph += graph
Or:
from scipy.sparse import lil_matrix
# Create sparse matrix.
graph = lil_matrix((10, 10))
# Change sparse matrix.
graph[(1, 1)] = 0
# Do some calculations.
graph += graph # --- SLOW --- ^2
Combine the strengths of both:
from scipy.sparse import csr_matrix, lil_matrix
# Create sparse matrix.
graph = lil_matrix((10, 10))
# Change sparse matrix.
graph[(1, 1)] = 0
# Done with changes to graph. Convert to csr.
graph = csr_matrix(graph)
# Do some calculations.
graph += graph
Don't take "--- SLOW ---" as a one-size fits all commandment! It's just a warning that in with some data sets you should be aware that there may be faster, more efficient ways, of doing things. For other data sets this would only make your code harder to read and maintain without any performance benefit.
1: "SLOW" as per warning:
/venv/lib/python3.8/site-packages/scipy/sparse/_index.py:82: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
2: "SLOW" as per warning in docs:
Disadvantages of the LIL format:
arithmetic operations LIL + LIL are slow (consider CSR or CSC)
I have a large sparse matrix in the form of a scipy coo_matrix (size of 5GB). I have to make use of the non-zero entries of the matrix and do some further processing.
What would be the best way to access the elements of the matrix? Should I convert the matrix to other formats or use it as it is? Also, could you please tell me the exact syntax for accessing an element of a coo_matrix? I got a bit confused since it doesn't allow slicing.
First let's build a random COO matrix:
import numpy as np
from scipy import sparse
x = sparse.rand(10000, 10000, format='coo')
The non-zero values are found in the .data attribute of the matrix, and you can get their corresponding row/column indices using x.nonzero():
v = x.data
r, c = x.nonzero()
print np.all(x.todense()[r, c] == v)
# True
With a COO matrix it's possible to index a single row or column (as a sparse vector) using the getrow()/getcol() methods. If you want to do slicing or fancy indexing of particular elements then you need to convert it to another format such as lil_matrix, for example using the .tolil() method.
You should really read the scipy.sparse docs for more information about the features of the different sparse array formats - the appropriate choice of format really depends on what you plan on doing with your array.
I researched a lot on this but couldn't find a practical solution to this problem. I am using scipy to create csr sparse matrix and want to substract this matrix from an equivalent matrix of all ones. In scipy and numpy notations, if matrix is not sparse, we can do so by simply writing 1 - MatrixVariable. However, this operation is not implemented if Matrix is sparse. I could just think of the following obvious solution:
Iterate through the entire sparse matrix, set all zero elements to 1 and all non-zero elements to 0.
But this would create a matrix where most elements are 1 and only a few are 0, which is no longer sparse and due its huge size could not be converted to dense.
What could be an alternative and effective way of doing this?
Thanks.
Your new matrix will not be sparse, because it will have 1s everywhere, so you will need a dense array to hold it:
new_mat = np.ones(sps_mat.shape, sps_mat.dtype) - sps_mat.todense()
This requires that your matrix fits in memory. It actually requires that it fits in memory 3 times. If that is an issue, you can get it to be more efficient doing something like:
new_mat = sps_mat.todense()
new_mat *= -1
new_mat += 1
You can access the data from your sparse matrix as a 1D array so that:
ss.data *= -1
ss.data += 1
will work like 1 - ss, for all non-zero elements in your sparse matrix.
I have a loop that in each iteration gives me a column c of a sparse matrix N.
To assemble/grow/accumulate N column by column I thought of using
N = scipy.sparse.hstack([N, c])
To do this it would be nice to initialize the matrix with with rows of length 0. However,
N = scipy.sparse.csc_matrix((4,0))
raises a ValueError: invalid shape.
Any suggestions, how to do this right?
You can't. Sparse matrices are restricted compared to NumPy arrays and in particular don't allow 0 for any axis. All sparse matrix constructors check for this, so if and when you do manage to build such a matrix, you're exploiting a SciPy bug and your script is likely to break when you upgrade SciPy.
That being said, I don't see why you'd need an n × 0 sparse matrix since an n × 0 NumPy array is allowed and takes practically no storage space.
Turns out sparse.hstack cannot handle a NumPy array with a zero axis, so disregard my previous comment. However, what I think you should do is collect all the columns in a list, then hstack them in one call. That's better than your loop since append'ing to a list takes amortized constant time, while hstack takes linear time. So your proposed algorithm takes quadratic time while it could be linear.
You must use at least 1 in your shape.
N = scipy.sparse.csc_matrix((4,1))
Which you can stack:
print scipy.sparse.hstack( (N,N) )
#<4x2 sparse matrix of type '<type 'numpy.float64'>'
# with 0 stored elements in COOrdinate format>