Numpy (sparse) repeated index increment - python

a, b are 1D numpy ndarray of same size with integer data type.
C is a 2D scipy.sparse.lil_matrix.
If the indexing [a, b] contains repeated index, does C[a, b] += np.array([1]) always increment C exactly once for each unique indexing of C by [a, b]?
Did the documentation mention this?
Example:
import scipy.sparse as ss
import numpy as np
C = ss.lil_matrix((3,2), dtype=int)
a = np.array([0, 1, 2] * 4)
b = np.array([0, 1] * 6)
C[a, b] += np.array([1])
print(C.todense(), '\n')
C[a, b] += np.array([1])
print(C.todense())
Result:
[[1 1]
[1 1]
[1 1]]
[[2 2]
[2 2]
[2 2]]

I don't know that it's documented
It's well known that dense arrays are set just once per unique index due to buffering. We have to use add.at to get unbuffered addition.
In [966]: C=sparse.lil_matrix((3,2),dtype=int)
In [967]: Ca=C.A
In [968]: Ca += 1
In [969]: Ca
Out[969]:
array([[1, 1],
[1, 1],
[1, 1]])
In [970]: Ca=C.A
In [973]: np.add.at(Ca,(a,b),1)
In [974]: Ca
Out[974]:
array([[2, 2],
[2, 2],
[2, 2]])
Your example shows that the lil indexed setting behaves in the buffered sense as well. But I'd have to look at the code to see exactly why.
It is documented that coo style inputs are summed across duplicates.
In [975]: M=sparse.coo_matrix((np.ones_like(a),(a,b)), shape=(3,2))
In [976]: print(M)
(0, 0) 1
(1, 1) 1
(2, 0) 1
(0, 1) 1
(1, 0) 1
(2, 1) 1
(0, 0) 1
(1, 1) 1
(2, 0) 1
(0, 1) 1
(1, 0) 1
(2, 1) 1
In [977]: M.A
Out[977]:
array([[2, 2],
[2, 2],
[2, 2]])
In [978]: M
Out[978]:
<3x2 sparse matrix of type '<class 'numpy.int32'>'
with 12 stored elements in COOrdinate format>
In [979]: M.tocsr()
Out[979]:
<3x2 sparse matrix of type '<class 'numpy.int32'>'
with 6 stored elements in Compressed Sparse Row format>
In [980]: M.sum_duplicates()
In [981]: M
Out[981]:
<3x2 sparse matrix of type '<class 'numpy.int32'>'
with 6 stored elements in COOrdinate format>
points are stored in coo format as entered, but when used for display or calculation (csr format) duplicates are summed.

Related

Create an sparse matrix from a list of tuples having the indexes of the column where is a 1

Problem:
I have a list of tuples, which each tuple represents a column of a 2D-array and each element of the tuple represents the index of that column of the array that is a 1; the other entries that aren't in that tuple, are 0.
I want to create an sparse matrix with this list of tuples in an efficient way (trying to not use for loops).
Example:
# init values
list_tuples = [
(0, 2, 4),
(0, 2, 3),
(1, 3, 4)
]
n = length(list_tuples) + 1
m = 5 # arbritrary, however n >= max([ei for ei in list_tuples]) + 1
# what I need is a function which accepts this tuples and give the shape of the array
# (at least the row size, because the column size can be infered from the list of tuples)
A = some_function(list_tuples, array_shape = (m, n))
Then what I expect to have is an array of the form:
[
[1, 1, 0]
[0, 0, 1]
[1, 1, 0]
[0, 1, 1]
[1, 0, 1]
]
Your values are the indices that are required for the compressed sparse column format. You'll also need the indptr array, which for your data is the cumulative sum of the lengths of the tuples (prepended with 0). The data array would be an array of ones with the same length as the sum of the lengths of the tuples, which you can get from the last element of the cumulative sum. Here's how that looks with your example:
In [45]: from scipy.sparse import csc_matrix
In [46]: list_tuples = [
...: (0, 2, 4),
...: (0, 2, 3),
...: (1, 3, 4)
...: ]
In [47]: indices = sum(list_tuples, ()) # Flatten the tuples into one sequence.
In [48]: indptr = np.cumsum([0] + [len(t) for t in list_tuples])
In [49]: a = csc_matrix((np.ones(indptr[-1], dtype=int), indices, indptr))
In [50]: a
Out[50]:
<5x3 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Column format>
In [51]: a.A
Out[51]:
array([[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
[0, 1, 1],
[1, 0, 1]])
Note that csc_matrix inferred the number of rows from the maximum that it found in the indices. You can use the shape parameter to override that, e.g.
In [52]: b = csc_matrix((np.ones(indptr[-1], dtype=int), indices, indptr), shape=(7, len(list_tuples)))
In [53]: b
Out[53]:
<7x3 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Column format>
In [54]: b.A
Out[54]:
array([[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
[0, 1, 1],
[1, 0, 1],
[0, 0, 0],
[0, 0, 0]])
You can also generate a coo_matrix pretty easily. The flattened list_tuples gives the row indices, and np.repeat can be used to create the column indices:
In [63]: from scipy.sparse import coo_matrix
In [64]: i = sum(list_tuples, ()) # row indices
In [65]: j = np.repeat(range(len(list_tuples)), [len(t) for t in list_tuples])
In [66]: c = coo_matrix((np.ones(len(i), dtype=int), (i, j)))
In [67]: c
Out[67]:
<5x3 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in COOrdinate format>
In [68]: c.A
Out[68]:
array([[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
[0, 1, 1],
[1, 0, 1]])

Understand the csr format

I am trying to undersand how scipy CSR works.
https://docs.scipy.org/doc/scipy/reference/sparse.html
For example, of the following matrix on https://en.wikipedia.org/wiki/Sparse_matrix
( 0 0 0 0 )
( 5 8 0 0 )
( 0 0 3 0 )
( 0 6 0 0 )
it says the CSR representation is the following.
Must V list one row after another with non-zero elements in a row list from left to right?
I can understand COL_INDEX is the column index (column 1 is indexed as 0) corresponding to elements in V.
I don't understand ROW_INDEX. Could anybody show me how the ROW_INDEX was created from the original matrix? Thanks.
V = [ 5 8 3 6 ]
COL_INDEX = [ 0 1 2 1 ]
ROW_INDEX = [ 0 0 2 3 4 ]
From the scipy manual:
csr_matrix((data, indices, indptr), [shape=(M, N)]) is the standard
CSR representation where the column indices for row i are stored in
indices[indptr[i]:indptr[i+1]] and their corresponding values are
stored in data[indptr[i]:indptr[i+1]]. If the shape parameter is not
supplied, the matrix dimensions are inferred from the index arrays.
indptr is the same as ROW_INDEX and indicies is the same as COL_INDEX.
Here is an example of a naive way to create the indices and value array. Essentially ROW_INDICES[i + 1] is the total number of non-zero entires from row 0 to i inclusive with the last entry being the total number of non-zero entries.
ROW_INDICES = [0]
COL_INDICES = []
VALS = []
for i in range(num_rows):
ROW_INDICES.append(ROW_INDICES[i])
for j in range(num_cols):
if m[i, j] > 0:
ROW_INDICES[i + 1] += 1
COL_INDICES.append(j)
VALS.append(m[i, j])
coo format
I think it's best to start with the coo definition. It's easier to understand, and widely used:
In [90]: A = np.array([[0,0,0,0],[5,8,0,0],[0,0,3,0],[0,6,0,0]])
In [91]: M = sparse.coo_matrix(A)
The values are stored in 3 attributes:
In [92]: M.row
Out[92]: array([1, 1, 2, 3], dtype=int32)
In [93]: M.col
Out[93]: array([0, 1, 2, 1], dtype=int32)
In [94]: M.data
Out[94]: array([5, 8, 3, 6])
We can make a new matrix from those 3 arrays:
In [95]: sparse.coo_matrix((_94, (_92, _93))).A
Out[95]:
array([[0, 0, 0],
[5, 8, 0],
[0, 0, 3],
[0, 6, 0]])
oops, I need to add a shape, since one column is all 0s:
In [96]: sparse.coo_matrix((_94, (_92, _93)), shape=(4,4)).A
Out[96]:
array([[0, 0, 0, 0],
[5, 8, 0, 0],
[0, 0, 3, 0],
[0, 6, 0, 0]])
Another way to display this matrix:
In [97]: print(M)
(1, 0) 5
(1, 1) 8
(2, 2) 3
(3, 1) 6
np.where(A) gives the same non-zero coordinates.
In [108]: np.where(A)
Out[108]: (array([1, 1, 2, 3]), array([0, 1, 2, 1]))
conversion to csr
Once we have coo, we can easily convert it to csr. In fact sparse often does that for us:
In [98]: Mr = M.tocsr()
In [99]: Mr.data
Out[99]: array([5, 8, 3, 6], dtype=int64)
In [100]: Mr.indices
Out[100]: array([0, 1, 2, 1], dtype=int32)
In [101]: Mr.indptr
Out[101]: array([0, 0, 2, 3, 4], dtype=int32)
Sparse does several things - it sorts the indices, sums duplicates, and replaces the row with a indptr array. Here it is actually longer than the original, but in general it will be shorter, since it has just one value per row (plus 1). But perhaps more important, most of the fast calculation routines, especially matrix multiplication, have been written using the csr format.
I've used this package a lot. MATLAB as well, where the default definition is in the coo style, but the internal storage is csc (but not as exposed to users as in scipy). But I've never tried to derive indptr from scratch. I could, but I don't need to.
csr_matrix accepts inputs in the coo format, but also in the indptr etc format. I wouldn't recommend it, unless you already have those inputs calculated (say from another matrix). It's more error prone, and probably not much faster.
Iteration with indptr
However sometimes it is useful to iterate on intptr, and perform calculations directly on the data. Often this is faster than working with the provided methods.
For example we can list the nonzero values by row:
In [104]: for i in range(Mr.shape[0]):
...: pt = slice(Mr.indptr[i], Mr.indptr[i+1])
...: print(i, Mr.indices[pt], Mr.data[pt])
...:
0 [] []
1 [0 1] [5 8]
2 [2] [3]
3 [1] [6]
Keeping the initial 0 makes this iteration easier. When the matrix is (10000,90000) there's not much incentive to reduces the size of indptr by 1.
lil format
The lil format stores the matrix in a similar manner:
In [105]: Ml = M.tolil()
In [106]: Ml.data
Out[106]: array([list([]), list([5, 8]), list([3]), list([6])], dtype=object)
In [107]: Ml.rows
Out[107]: array([list([]), list([0, 1]), list([2]), list([1])], dtype=object)
In [110]: for i,(r,d) in enumerate(zip(Ml.rows, Ml.data)):
...: print(i, r, d)
...:
0 [] []
1 [0, 1] [5, 8]
2 [2] [3]
3 [1] [6]
Because of how rows are stored, lil actually allows us to fetch a view:
In [167]: Ml.getrowview(2)
Out[167]:
<1x4 sparse matrix of type '<class 'numpy.longlong'>'
with 1 stored elements in List of Lists format>
In [168]: for i in range(Ml.shape[0]):
...: print(Ml.getrowview(i))
...:
(0, 0) 5
(0, 1) 8
(0, 2) 3
(0, 1) 6

Adding values to non zero elements in a Sparse Matrix

I have a sparse matrix in which I want to increment all the values of non-zero elements by one. However, I cannot figure it out. Is there a way to do it using standard packages in python? Any help will be appreciated.
I cannot comment on it's performance but you can do (Scipy 1.1.0);
>>> from scipy.sparse import csr_matrix
>>> a = csr_matrix([[0, 2, 0], [1, 0, 0]])
>>> print(a)
(0, 1) 2
(1, 0) 1
>>> a[a.nonzero()] = a[a.nonzero()] + 1
>>> print(a)
(0, 1) 3
(1, 0) 2
If your matrix have 2 dimensions, you can do the following:
sparse_matrix = [[element if element==0 else element+1 for element in row ]for row in sparse_matrix]
It will iterate over every element of your matrix and return the element without any change if it is equals to zero, else it add 1 to the element and return it.
More about conditionals in list comprehension in the answer for this question.
You can use the package numpy which has efficient functions for dealing with n-dimensional arrays. What you need is:
array[array>0] += 1
where array is the numpy array of your matrix. Example here:
`
import numpy as np
my_matrix = [[2,0,0,0,7],[0,0,0,4,0]]
array = np.array(my_matrix);
print("Matrix before incrementing values: \n", array)
array[array>0] += 1
print("Matrix after incrementing values: \n", array)`
Outputs:
Matrix before incrementing values:
[[2 0 0 0 7]
[0 0 0 4 0]]
Matrix after incrementing values:
[[3 0 0 0 8]
[0 0 0 5 0]]
Hope this helps!
When you have a scipy sparse matrix (scipy.sparse) is:
import scipy.sparse as sp
my_matrix = [[2,0,0,0,7],[0,0,0,4,0]]
my_matrix = sp.csc_matrix(my_matrix)
my_matrix.data += 1
my_matrix.todense()
Returns:
[[3, 0, 0, 0, 8], [0, 0, 0, 5, 0]]

CSR scipy matrix does not update after updating its values

I have the following code in python:
import numpy as np
from scipy.sparse import csr_matrix
M = csr_matrix(np.ones([2, 2],dtype=np.int32))
print(M)
print(M.data.shape)
for i in range(np.shape(mat)[0]):
for j in range(np.shape(mat)[1]):
if i==j:
M[i,j] = 0
print(M)
print(M.data.shape)
The output of the first 2 prints is:
(0, 0) 1
(0, 1) 1
(1, 0) 1
(1, 1) 1
(4,)
The code is changing the value of the same index (i==j) and setting the value to zero.
After executing the loops then the output of the last 2 prints is:
(0, 0) 0
(0, 1) 1
(1, 0) 1
(1, 1) 0
(4,)
If I understand the concept of sparse matrices correctly, it should not be the case. It should not show me the zero values and the output of last 2 prints should be like this:
(0, 1) 1
(1, 0) 1
(2,)
Does anyone have explanation for this? Am I doing something wrong?
Yes, you are trying to change elements of the matrix one by one. :)
Ok, it does work that way, though if you changed things the other way (setting a 0 to nonzero) you will get an Efficiency warning.
To keep your kind of change fast, it only changes the value in the M.data array, and does not recalculate the indices. You have to invoke a separate csr_matrix.eliminate_zeros method the clean up the matrix. To get best speed call this once at the end of the loop.
There is a csr_matrix.setdiag method that lets you set the whole diagonal with one call. It still needs the cleanup.
In [1633]: M=sparse.csr_matrix(np.arange(9).reshape(3,3))
In [1634]: M
Out[1634]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in Compressed Sparse Row format>
In [1635]: M.A
Out[1635]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]], dtype=int32)
In [1636]: M.setdiag(0)
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [1637]: M
Out[1637]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 9 stored elements in Compressed Sparse Row format>
In [1638]: M.A
Out[1638]:
array([[0, 1, 2],
[3, 0, 5],
[6, 7, 0]])
In [1639]: M.data
Out[1639]: array([0, 1, 2, 3, 0, 5, 6, 7, 0])
In [1640]: M.eliminate_zeros()
In [1641]: M
Out[1641]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 6 stored elements in Compressed Sparse Row format>
In [1642]: M.data
Out[1642]: array([1, 2, 3, 5, 6, 7])

numpy/scipy build adjacency matrix from weighted edgelist

I'm reading a weighted egdelist / numpy array like:
0 1 1
0 2 1
1 2 1
1 0 1
2 1 4
where the columns are 'User1','User2','Weight'. I'd like to perform a DFS algorithm with scipy.sparse.csgraph.depth_first_tree, which requires a N x N matrix as input. How can I convert the previous list into a square matrix as:
0 1 1
1 0 1
0 4 0
within numpy or scipy?
Thanks for your help.
EDIT:
I've been working with a huge (150 million nodes) network, so I'm looking for a memory efficient way to do that.
You could use a memory-efficient scipy.sparse matrix:
import numpy as np
import scipy.sparse as sparse
arr = np.array([[0, 1, 1],
[0, 2, 1],
[1, 2, 1],
[1, 0, 1],
[2, 1, 4]])
shape = tuple(arr.max(axis=0)[:2]+1)
coo = sparse.coo_matrix((arr[:, 2], (arr[:, 0], arr[:, 1])), shape=shape,
dtype=arr.dtype)
print(repr(coo))
# <3x3 sparse matrix of type '<type 'numpy.int64'>'
# with 5 stored elements in COOrdinate format>
To convert the sparse matrix to a dense numpy array, you could use todense:
print(coo.todense())
# [[0 1 1]
# [1 0 1]
# [0 4 0]]
Try something like the following:
import numpy as np
import scipy.sparse as sps
A = np.array([[0, 1, 1],[0, 2, 1],[1, 2, 1],[1, 0, 1],[2, 1, 4]])
i, j, weight = A[:,0], A[:,1], A[:,2]
# find the dimension of the square matrix
dim = max(len(set(i)), len(set(j)))
B = sps.lil_matrix((dim, dim))
for i,j,w in zip(i,j,weight):
B[i,j] = w
print B.todense()
>>>
[[ 0. 1. 1.]
[ 1. 0. 1.]
[ 0. 4. 0.]]

Categories

Resources