Efficiently inversing a `csr_matrix` in `scipy` element-wise - python

Let $A$ be a csr_matrix representing the connectivity matrix for a graph where $A_{ij}$ is the weight of an edge. Now, I need to inverse each non-zero element of the matrix in an efficient way. The way I'm doing this right now is
B = 1.0 / A.toarray()
B[B == np.inf] = 0
This has two down-sides:
memory usage increases by converting a csr_matrix to an array.
a division by zero happens
Are there any suggestions to do this more efficient?

One way you could do this is to create a new matrix from the data, indices and indptr of A: B = csr_matrix((1/A.data, A.indices, A.indptr)).
(This assumes that there are no explicitly stored zeros in A, so 1/A.data doesn't result in some values being inf.)
For example,
In [108]: A
Out[108]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [109]: A.A
Out[109]:
array([[0. , 1. , 2.5, 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 4. ],
[2. , 0. , 0. , 0. ]])
In [110]: B = csr_matrix((1/A.data, A.indices, A.indptr))
In [111]: B
Out[111]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [112]: B.A
Out[112]:
array([[0. , 1. , 0.4 , 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.25],
[0.5 , 0. , 0. , 0. ]])

csr has a power method:
In [598]: M = sparse.csr_matrix([[0,3,2],[.5,0,10]])
In [599]: M
Out[599]:
<2x3 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [600]: M.A
Out[600]:
array([[ 0. , 3. , 2. ],
[ 0.5, 0. , 10. ]])
In [601]: x = M.power(-1)
In [602]: x
Out[602]:
<2x3 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [603]: x.A
Out[603]:
array([[0. , 0.33333333, 0.5 ],
[2. , 0. , 0.1 ]])

Related

Numpy covariance command returning matrix with more dimensions than input

I have an arbitrary row vector "u" and an arbitrary matrix "e" as follows:
u = np.resize(np.array([8,3]),[1,2])
e = np.resize(np.array([[2,2,5,5],[1, 6, 7, 4]]),[4,2])
np.cov(u,e)
array([[ 12.5, 0. , 0. , -12.5, 7.5],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[-12.5, 0. , 0. , 12.5, -7.5],
[ 7.5, 0. , 0. , -7.5, 4.5]])
The matrix that this returns is 5x5. This is confusing to me because the largest dimension of the inputs is only 4.
Thus, this may be less of a numpy question and more of a math question...not sure...
Please refer to the official numpy documentation (https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.cov.html) and check whether you usage of the numpy.cov function is consistent with what you are trying to achieve and you understand what you are trying to do.
When looking at the signature
numpy.cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None)
m : array_like
A 1-D or 2-D array containing multiple variables and observations.
Each row of m represents a variable, and each column a single observation > > of all those variables. Also see rowvar below.
y : array_like, optional
An additional set of variables and observations. y has the same form as that of m.
Note how m and y are combined as shown in the last example on the page
>>> x = [-2.1, -1, 4.3]
>>> y = [3, 1.1, 0.12]
>>> X = np.stack((x, y), axis=0)
>>> print(np.cov(X))
[[ 11.71 -4.286 ]
[ -4.286 2.14413333]]
>>> print(np.cov(x, y))
[[ 11.71 -4.286 ]
[ -4.286 2.14413333]]
>>> print(np.cov(x))
11.71

Broadcast through numpy array with list of arrays

Given an adjacency list:
adj_list = [array([0,1]),array([0,1,2]),array([0,2])]
And an array of indices,
ind_arr = array([0,1,2])
Goal:
A = np.zeros((3,3))
for i in ind_arr:
A[i,list(adj_list[x])] = 1.0/float(adj_list[x].shape[0])
Currently, I have written:
A[ind_list[:],adj_list[:]] = 1. / len(adj_list[:])
And tried various configurations of indexing within this scaffold.
Here's one approach -
lens = np.array([len(i) for i in adj_list])
col_idx = np.concatenate(adj_list)
out = np.zeros((len(lens), col_idx.max()+1))
row_idx = np.repeat(np.arange(len(lens)), lens)
vals = np.repeat(1.0/lens, lens)
out[row_idx, col_idx] = vals
Sample input, output -
In [494]: adj_list = [np.array([0,2]),np.array([0,1,4])]
In [496]: out
Out[496]:
array([[ 0.5 , 0. , 0.5 , 0. , 0. ],
[ 0.33333333, 0.33333333, 0. , 0. , 0.33333333]])
Sparse matrix as output
Additionally, if you want to save memory and create a sparse matrix instead, that's an easy extension -
In [506]: from scipy.sparse import csr_matrix
In [507]: csr_matrix((vals, (row_idx, col_idx)), shape=(len(lens), col_idx.max()+1))
Out[507]:
<2x5 sparse matrix of type '<type 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [508]: _.toarray()
Out[508]:
array([[ 0.5 , 0. , 0.5 , 0. , 0. ],
[ 0.33333333, 0.33333333, 0. , 0. , 0.33333333]])
I don't think you can completely eliminate loops due to the mixed data types, but you can reduce the nested double for loops to a single one:
A = np.zeros((2, 3))
for i, arr in enumerate(adj_list):
arr_size = len(arr)
A[i, :arr_size] = 1./arr_size
A
# array([[ 0.5 , 0.5 , 0. ],
# [ 0.33333333, 0.33333333, 0.33333333]])
Or if the numbers in the arrays are actually columns positions:
A = np.zeros((2, 3))
for i, arr in enumerate(adj_list):
A[i, arr] = 1./len(arr)
A
# array([[ 0.5 , 0.5 , 0. ],
# [ 0.33333333, 0.33333333, 0.33333333]])
Another option using MultiLabelBinarizer from sklearn(but may not be as efficient):
from sklearn.preprocessing import MultiLabelBinarizer
​
mlb = MultiLabelBinarizer()
adj_list = [np.array([0,1]),np.array([0,1,2])]
​
sizes = np.fromiter(map(len, adj_list), dtype=int)
mlb.fit_transform(adj_list)/sizes[:,None]
# array([[ 0.5 , 0.5 , 0. ],
# [ 0.33333333, 0.33333333, 0.33333333]])

Python sparse matrix remove duplicate indices except one?

I am computing the cosine similarity between matrix of vectors, and I get the result in a sparse matrix like this:
(0, 26) 0.359171459261
(0, 25) 0.121145761751
(0, 24) 0.316922015914
(0, 23) 0.157622038039
(0, 22) 0.636466644041
(0, 21) 0.136216495731
(0, 20) 0.243164535496
(0, 19) 0.348272617805
(0, 18) 0.636466644041
(0, 17) 1.0
But there are duplicates for example:
(0, 24) 0.316922015914 and (24, 0) 0.316922015914
What I want to do is to remove them by indice and be (if I have (0,24) then I don't need (24, 0) because it is the same) left with only one of this and remove the second, for all vectors in the matrix.
Currently I have the following code to create the matrix:
vectorized_words = sparse.csr_matrix(vectorize_words(nostopwords,glove_dict))
cos_similiarity = cosine_similarity(vectorized_words,dense_output=False)
So to summarize I don't want to remove all duplicates, I want to be left with only one of them using the pythonic way.
Thank you in advance !
I think it is easiest to get the upper-triangle of a coo format matrix:
First make a small symmetric matrix:
In [876]: A = sparse.random(5,5,.3,'csr')
In [877]: A = A+A.T
In [878]: A
Out[878]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 11 stored elements in Compressed Sparse Row format>
In [879]: A.A
Out[879]:
array([[ 0. , 0. , 0.81388978, 0. , 0. ],
[ 0. , 0. , 0.73944395, 0.20736975, 0.98968617],
[ 0.81388978, 0.73944395, 0. , 0. , 0. ],
[ 0. , 0.20736975, 0. , 0.05581152, 0.04448881],
[ 0. , 0.98968617, 0. , 0.04448881, 0. ]])
Convert to coo, and set the lower-triangle data values to 0
In [880]: Ao = A.tocoo()
In [881]: mask = (Ao.row>Ao.col)
In [882]: mask
Out[882]:
array([False, False, False, False, True, True, True, False, False,
True, True], dtype=bool)
In [883]: Ao.data[mask]=0
Convert back to 0, and use eliminate_zeros to prune the matrix.
In [890]: A1 = Ao.tocsr()
In [891]: A1
Out[891]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 11 stored elements in Compressed Sparse Row format>
In [892]: A1.eliminate_zeros()
In [893]: A1
Out[893]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [894]: A1.A
Out[894]:
array([[ 0. , 0. , 0.81388978, 0. , 0. ],
[ 0. , 0. , 0.73944395, 0.20736975, 0.98968617],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0.05581152, 0.04448881],
[ 0. , 0. , 0. , 0. , 0. ]])
Both the coo and csr formats have a in-place eliminate_zeros method.
def eliminate_zeros(self):
"""Remove zero entries from the matrix
This is an *in place* operation
"""
mask = self.data != 0
self.data = self.data[mask]
self.row = self.row[mask]
self.col = self.col[mask]
Instead of using Ao.data[mask]=0 you could this code as a model for eliminating just the lower_triangle values.

Numpy - Modal matrix and diagonal Eigenvalues

I wrote a simple Linear Algebra code in Python Numpy to calculate the Diagonal of EigenValues by calculating $M^{-1}.A.M$ (M is the Modal Matrix) and it's working strange.
Here's the Code :
import numpy as np
array = np.arange(16)
array = array.reshape(4, -1)
print(array)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
eigenvalues, eigenvectors = np.linalg.eig(array)
print eigenvalues
[ 3.24642492e+01 -2.46424920e+00 1.92979794e-15 -4.09576009e-16]
print eigenvectors
[[-0.11417645 -0.7327781 0.54500164 0.00135151]
[-0.3300046 -0.28974835 -0.68602671 0.40644504]
[-0.54583275 0.15328139 -0.2629515 -0.8169446 ]
[-0.76166089 0.59631113 0.40397657 0.40914805]]
inverseEigenVectors = np.linalg.inv(eigenvectors) #M^(-1)
diagonal= inverseEigenVectors.dot(array).dot(eigenvectors) #M^(-1).A.M
print(diagonal)
[[ 3.24642492e+01 -1.06581410e-14 5.32907052e-15 0.00000000e+00]
[ 7.54951657e-15 -2.46424920e+00 -1.72084569e-15 -2.22044605e-16]
[ -2.80737213e-15 1.46768503e-15 2.33547852e-16 7.25592561e-16]
[ -6.22319863e-15 -9.69656080e-16 -1.38050658e-30 1.97215226e-31]]
the final 'diagonal' matrix should be a diagonal matrix with EigenValues on the main diagonal and zeros elsewhere. but it's not... the two first main diagonal values ARE eigenvalues but the two second aren't (although just like the two second eigenvalues, they are nearly zero).
and by the way a number like $-1.06581410e-14$ is literally zero so how can I make numpy show them as zero?
What am I doing wrong?
Thanks...
Just round the final result to the desired digits :
print(diagonal.round(5))
array([[ 32.46425, 0. , 0. , 0. ],
[ 0. , -2.46425, 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
Don't confuse precision of computation and printing policies.
>>> diagonal[np.abs(diagonal)<0.0000000001]=0
>>> print diagonal
[[ 32.4642492 0. 0. 0. ]
[ 0. -2.4642492 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]
>>>

Build diagonal matrix without using for loop

I am trying to build the following matrix in Python without using a for loop:
A
[[ 0.1 0.2 0. 0. 0. ]
[ 1. 2. 3. 0. 0. ]
[ 0. 1. 2. 3. 0. ]
[ 0. 0. 1. 2. 3. ]
[ 0. 0. 0. 4. 5. ]]
I tried the fill_diagonal method in NumPy (see matrix B below) but it does not give me the same matrix as shown in matrix A:
B
[[ 1. 0.2 0. 0. 0. ]
[ 0. 2. 0. 0. 0. ]
[ 0. 0. 3. 0. 0. ]
[ 0. 0. 0. 1. 0. ]
[ 0. 0. 0. 4. 5. ]]
Here is the Python code that I used to construct the matrices:
import numpy as np
import scipy.linalg as sp # maybe use scipy to build diagonal matrix?
#---- build diagonal square array using "for" loop
m = 5
A = np.zeros((m, m))
A[0, 0] = 0.1
A[0, 1] = 0.2
for i in range(1, m-1):
A[i, i-1] = 1 # m-1
A[i, i] = 2 # m
A[i, i+1] = 3 # m+1
A[m-1, m-2] = 4
A[m-1, m-1] = 5
print('A \n', A)
#---- build diagonal square array without loop
B = np.zeros((m, m))
B[0, 0] = 0.1
B[0, 1] = 0.2
np.fill_diagonal(B, [1, 2, 3])
B[m-1, m-2] = 4
B[m-1, m-1] = 5
print('B \n', B)
So is there a way to construct a diagonal matrix like the one shown by matrix A without using a for loop?
There are functions for this in scipy.sparse, e.g.:
from scipy.sparse import diags
C = diags([1,2,3], [-1,0,1], shape=(5,5), dtype=float)
C = C.toarray()
C[0, 0] = 0.1
C[0, 1] = 0.2
C[-1, -2] = 4
C[-1, -1] = 5
Diagonal matrices are generally very sparse, so you could also keep it as a sparse matrix. This could even have large efficiency benefits, depending on the application.
The efficiency gains sparse matrices could give you depend very much on matrix size. For a 5x5 array you can't really be bothered I guess. But for larger matrices creating the array could be a lot faster with sparse matrices, illustrated by the following example with an identity matrix:
%timeit np.eye(3000)
# 100 loops, best of 3: 3.12 ms per loop
%timeit sparse.eye(3000)
# 10000 loops, best of 3: 79.5 µs per loop
But the real strength of the sparse matrix data type is shown when you need to do mathematical operations on arrays that are sparse:
%timeit np.eye(3000).dot(np.eye(3000))
# 1 loops, best of 3: 2.8 s per loop
%timeit sparse.eye(3000).dot(sparse.eye(3000))
# 1000 loops, best of 3: 1.11 ms per loop
Or when you need to work with some very large but sparse array:
np.eye(1E6)
# ValueError: array is too big.
sparse.eye(1E6)
# <1000000x1000000 sparse matrix of type '<type 'numpy.float64'>'
# with 1000000 stored elements (1 diagonals) in DIAgonal format>
Notice that the number of 0 is always 3 (or a constant whenever you want to have a diagonal matrix like this):
In [10]:
import numpy as np
A1=[0.1, 0.2]
A2=[1,2,3]
A3=[4,5]
SPC=[0,0,0] #=or use np.zeros #spacing zeros
np.hstack((A1,SPC,A2,SPC,A2,SPC,A2,SPC,A3)).reshape(5,5)
Out[10]:
array([[ 0.1, 0.2, 0. , 0. , 0. ],
[ 1. , 2. , 3. , 0. , 0. ],
[ 0. , 1. , 2. , 3. , 0. ],
[ 0. , 0. , 1. , 2. , 3. ],
[ 0. , 0. , 0. , 4. , 5. ]])
In [11]:
import itertools #A more general way of doing it
np.hstack(list(itertools.chain(*[(item, SPC) for item in [A1, A2, A2, A2, A3]]))[:-1]).reshape(5,5)
Out[11]:
array([[ 0.1, 0.2, 0. , 0. , 0. ],
[ 1. , 2. , 3. , 0. , 0. ],
[ 0. , 1. , 2. , 3. , 0. ],
[ 0. , 0. , 1. , 2. , 3. ],
[ 0. , 0. , 0. , 4. , 5. ]])

Categories

Resources