Efficient incremental sparse matrix in python / scipy / numpy

Efficient incremental sparse matrix in python / scipy / numpy - python

Is there a way in Python to have an efficient incremental update of sparse matrix?
H = lil_matrix((n,m))
for (i,j) in zip(A,B):
h(i,j) += compute_something
It seems that such a way to build a sparse matrix is quite slow (lil_matrix is the fastest sparse matrix type for that).
Is there a way (like using dict of dict or other kind of approaches) to efficiently build the sparse matrix H?

In https://stackoverflow.com/a/27771335/901925 I explore incremental matrix assignment.
lol and dok are the recommended formats if you want to change values. csr will give you an efficiency warning, and coo does not allow indexing.
But I also found that dok indexing is slow compared to regular dictionary indexing. So for many changes it is better to build a plain dictionary (with the same tuple indexing), and build the dok matrix from that.
But if you can calculate the H data values with a fast numpy vector operation, as opposed to iteration, it is best to do so, and construct the sparse matrix from that (e.g. coo format). In fact even with iteration this would be faster:
h = np.zeros(A.shape)
for k, (i,j) in enumerate(zip(A,B)):
h[k] = compute_something
H = sparse.coo_matrix((h, (A, B)), shape=(n,m))
e.g.
In [780]: A=np.array([0,1,1,2]); B=np.array([0,2,2,1])
In [781]: h=np.zeros(A.shape)
In [782]: for k, (i,j) in enumerate(zip(A,B)):
h[k] = i+j+k
.....:
In [783]: h
Out[783]: array([ 0., 4., 5., 6.])
In [784]: M=sparse.coo_matrix((h,(A,B)),shape=(4,4))
In [785]: M
Out[785]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in COOrdinate format>
In [786]: M.A
Out[786]:
array([[ 0., 0., 0., 0.],
[ 0., 0., 9., 0.],
[ 0., 6., 0., 0.],
[ 0., 0., 0., 0.]])
Note that the (1,2) value is the sum 4+5. That's part of the coo to csr conversion.
In this case I could have calculated h with:
In [791]: A+B+np.arange(A.shape[0])
Out[791]: array([0, 4, 5, 6])
so there's no need for iteration.

Nope, do not use csr_matrix or csc_matrix, as they are going to be even more slower than lil_matrix, if you construct them incrementally. The Dictionary of Key based sparse matrix is exactly what you are looking for
from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
for j in range(5):
S[i,j] = i+j # Update elements

A faster way would be:
H_ij = compute_something_vectorized()
H = coo_matrix((H_ij, (A, B))).tocsr()
The data for duplicate coordinates are then summed, see the docs for coo_matrix.

Related

Using np.diag() to construct a 3D array [duplicate]

This question already has answers here:
What's the best way to create a "3D identity matrix" in Numpy?
(3 answers)
Closed 2 years ago.
The standard usage of the np.diag(a) function when given a 1D array a is to create a 2D array with the diagonal entries being the elements of a. In my case, a is a 2D array with size n x m. My goal is to generate an n x n x m array in a manner similar to the np.diag() function, where each n x n slice is a matrix of zeros with the m'th row of a in the diagonal. What is the best way of doing this? Clearly it can be done with the np.diag() function and a for loop, but I am wondering whether a vectorized version of this exists with numpy.

One way to accomplish this is to use the function np.broadcast_to, which broadcasts a given array to a new shape. I had trouble broadcasting the m dimension to the end of the array, but broadcasting it as the first dimension and then transposing along the first and last dimensions also seemed to work just fine.
Please see the code snippet below:
# Specify dimensions
n = 4
m = 3
# Create diagonal matrix
D = np.eye(n)
# Broadcast diagonal and transpose
B = np.transpose(np.broadcast_to(D, (m,) + D.shape), (2, 1, 0))
# Verify shape
print(B.shape)
--> (4, 4, 3)
# Verify correct slice
print(B[:, :, 0])
--> array([[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.]])
Hope this helps!

Theano: Operate on nonzero elements of sparse matrix

I'm trying to take the exp of nonzero elements in a sparse theano variable. I have the current code:
A = T.matrix("Some matrix with many zeros")
A_sparse = theano.sparse.csc_from_dense(A)
I'm trying to do something that's equivalent to the following numpy syntax:
mask = (A_sparse != 0)
A_sparse[mask] = np.exp(A_sparse[mask])
but Theano doesn't support != masks yet. (And (A_sparse > 0) | (A_sparse < 0) doesn't seem to work either.)
How can I achieve this?

The support for sparse matrices in Theano is incomplete, so some things are tricky to achieve. You can use theano.sparse.structured_exp(A_sparse) in that particular case, but I try to answer your question more generally below.
Comparison
In Theano one would normally use the comparison operators described here: http://deeplearning.net/software/theano/library/tensor/basic.html
For example, instead of A != 0, one would write T.neq(A, 0). With sparse matrices one has to use the comparison operators in theano.sparse. Both operators have to be sparse matrices, and the result is also a sparse matrix:
mask = theano.sparse.neq(A_sparse, theano.sparse.sp_zeros_like(A_sparse))
Modifying a Subtensor
In order to modify part of a matrix, one can use theano.tensor.set_subtensor. With dense matrices this would work:
indices = mask.nonzero()
A = T.set_subtensor(A[indices], T.exp(A[indices]))
Notice that Theano doesn't have a separated boolean type—the mask is zeros and ones—so nonzero() has to be called first to take the indices of the nonzero elements. Furthermore, this is not implemented for sparse matrices.
Operating on Nonzero Sparse Elements
Theano provides sparse operations that are said to be structured and operate only on the nonzero elements. See:
http://deeplearning.net/software/theano/tutorial/sparse.html#structured-operation
More precisely, they operate on the data attribute of a sparse matrix, independent of the indices of the elements. Such operations are straightforward to implement. Note that the structured operations will operate on all the values in the data array, also those that are explicitly set to zero.

Here's a way of doing this with the scipy.sparse module. I don't know how theano implements its sparse. It's likely to be based on similar ideas (since it uses name like csc)
In [224]: A=sparse.csc_matrix([[1.,0,0,2,0],[0,0,3,0,0],[0,1,1,2,0]])
In [225]: A.A
Out[225]:
array([[ 1., 0., 0., 2., 0.],
[ 0., 0., 3., 0., 0.],
[ 0., 1., 1., 2., 0.]])
In [226]: A.data
Out[226]: array([ 1., 1., 3., 1., 2., 2.])
In [227]: A.data[:]=np.exp(A.data)
In [228]: A.A
Out[228]:
array([[ 2.71828183, 0. , 0. , 7.3890561 , 0. ],
[ 0. , 0. , 20.08553692, 0. , 0. ],
[ 0. , 2.71828183, 2.71828183, 7.3890561 , 0. ]])
The main attributes of the csc format at data, indices, indptr. It's possible that data has some 0 values if you fiddle with them after creation, but a freshly created matrix shouldn't.
The matrix also has a nonzero method modeled on the numpy one. In practice it converts the matrix to coo format, filters out any zero values, and returns the row and col attributes:
In [229]: A.nonzero()
Out[229]: (array([0, 0, 1, 2, 2, 2]), array([0, 3, 2, 1, 2, 3]))
And the csc format allows indexing just as a dense numpy array:
In [230]: A[A.nonzero()]
Out[230]:
matrix([[ 2.71828183, 7.3890561 , 20.08553692, 2.71828183,
2.71828183, 7.3890561 ]])

T.where works.
A_sparse = T.where(A_sparse == 0, 0, T.exp(A_sparse))
#Seppo Envari's answer seems faster though. So I'll accept his answer.

Create a numpy array according to another array along with indices array

I have a numpy array(eg., a = np.array([ 8., 2.])), and another array which stores the indices I would like to get from the former array. (eg., b = np.array([ 0., 1., 1., 0., 0.]).
What I would like to do is to create another array from these 2 arrays, in this case, it should be: array([ 8., 2., 2., 8., 8.])
of course, I can always use a for loop to achieve this goal:
for i in range(5):
c[i] = a[b[i]]
I wonder if there is a more elegant method to create this array. Something like c = a[b[0:5]] (well, this apparently doesn't work)

Only integer arrays can be used for indexing, and you've created b as a float64 array. You can get what you're looking for if you explicitly convert to integer:
bi = np.array(b, dtype=int)
c = a[bi[0:5]]

What are the advantages of using numpy.identity over numpy.eye?

Having looked over the man pages for numpy's eye and identity, I'd assumed that identity was a special case of eye, since it has fewer options (e.g. eye can fill shifted diagonals, identity cannot), but could plausibly run more quickly. However, this isn't the case on either small or large arrays:
>>> np.identity(3)
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
>>> np.eye(3)
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
>>> timeit.timeit("import numpy; numpy.identity(3)", number = 10000)
0.05699801445007324
>>> timeit.timeit("import numpy; numpy.eye(3)", number = 10000)
0.03787708282470703
>>> timeit.timeit("import numpy", number = 10000)
0.00960087776184082
>>> timeit.timeit("import numpy; numpy.identity(1000)", number = 10000)
11.379066944122314
>>> timeit.timeit("import numpy; numpy.eye(1000)", number = 10000)
11.247124910354614
What, then, is the advantage of using identity over eye?

identity just calls eye so there is no difference in how the arrays are constructed. Here's the code for identity:
def identity(n, dtype=None):
from numpy import eye
return eye(n, dtype=dtype)
As you say, the main difference is that with eye the diagonal can may be offset, whereas identity only fills the main diagonal.
Since the identity matrix is such a common construct in mathematics, it seems the main advantage of using identity is for its name alone.

To see the difference in an example, run the below codes:
import numpy as np
#Creates an array of 4 x 4 with the main diagonal of 1
arr1 = np.eye(4)
print(arr1)
print("\n")
#or you can change the diagonal position
arr2 = np.eye(4, k=1) # or try with another number like k= -2
print(arr2)
print("\n")
#but you can't change the diagonal in identity
arr3 = np.identity(4)
print(arr3)

np.identity returns a square matrix (special case of a 2D-array) which is an identity matrix with the main diagonal (i.e. 'k=0') as 1's and the other values as 0's. you can't change the diagonal k here.
np.eye returns a 2D-array, which fills the diagonal, i.e. 'k' which can be set, with 1's and rest with 0's.
So, the main advantage depends on the requirement. If you want an identity matrix, you can go for identity right away, or can call the np.eye leaving the rest to defaults.
But, if you need a 1's and 0's matrix of a particular shape/size or have a control over the diagonal you can go for eye method.
Just like how a matrix is a special case of an array, np.identity is a special case of np.eye.
Additional references:
Eye and Identity - HackerRank

Symmetric matrices in numpy?

I wish to initiate a symmetric matrix in python and populate it with zeros.
At the moment, I have initiated an array of known dimensions but this is unsuitable for subsequent input into R as a distance matrix.
Are there any 'simple' methods in numpy to create a symmetric matrix?
Edit
I should clarify - creating the 'symmetric' matrix is fine. However I am interested in only generating the lower triangular form, ie.,
ar = numpy.zeros((3, 3))
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
I want:
array([[ 0],
[ 0, 0 ],
[ 0., 0., 0.]])
Is this possible?

I don't think it's feasible to try work with that kind of triangular arrays.
So here is for example a straightforward implementation of (squared) pairwise Euclidean distances:
def pdista(X):
"""Squared pairwise distances between all columns of X."""
B= np.dot(X.T, X)
q= np.diag(B)[:, None]
return q+ q.T- 2* B
For performance wise it's hard to beat it (in Python level). What would be the main advantage of not using this approach?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.