Calculate Similarity of Sparse Matrix - python

I am using Python with numpy, scipy and scikit-learn module.
I'd like to classify the arrays in very big sparse matrix. (100,000 * 100,000)
The values in the matrix are equal to 0 or 1. The only thing I have is the index of value = 1.
a = [1,3,5,7,9]
b = [2,4,6,8,10]
which means
a = [0,1,0,1,0,1,0,1,0,1,0]
b = [0,0,1,0,1,0,1,0,1,0,1]
How can I change the index array to the sparse array in scipy ?
How can I classify those array quickly ?
Thank you very much.

If you choose the sparse coo_matrix you can create it passing the indices like:
from scipy.sparse import coo_matrix
import scipy
nrows = 100000
ncols = 100000
row = scipy.array([1,3,5,7,9])
col = scipy.array([2,4,6,8,10])
values = scipy.ones(col.size)
m = coo_matrix((values, (row,col)), shape=(nrows, ncols), dtype=float)

Related

Avoid stacking of sparse matrices for elementwise multiplication

I want to perform a elementwise-multiplication of two (scipy) sparse matrices: A.shape = B.shape = (m,n). However, matrix B consists of a smaller matrix B_base which is stacked horizontally. Obviously, this is is not memory-efficient. Thus, the question: How can I efficiently multiply A and B_base elementwise without stacking?
Below find a MWE using sparse.hstack:
from scipy import sparse
A = sparse.random(m=1000, n=10000, density=0.1, format="csc")
B = sparse.random(m=1000, n=1000, density=0.1, format="csc")
factor_matrix = sparse.hstack([B for i in range(10)], format="csc")
result = A.multiply(factor_matrix)

How to concatenate more sparse matrices into one in python

I have a problem in python where i would like to merge some sparse matrices into one. The sparse matrices are of csr_matrix type and have same amount of rows. When I use hstack to stack them together I obtain an array of matrices, but I would like to obtain a single matrix with the number of rows (which is the same for every matrix) and as the number of columns the sum of the columns number of every matrix.
Thanks for support.
You can do this using scipy.sparse.hstack. For example:
import numpy as np
from scipy import sparse
x = sparse.csr_matrix(np.random.randint(0, 2, size=(10, 10)))
y = sparse.csr_matrix(np.random.randint(0, 2, size=(10, 10)))
xy = sparse.hstack([x, y])
print(xy.shape)
# (10, 20)
print(type(xy))
# <class 'scipy.sparse.coo.coo_matrix'>

Python scipy.sparse: how to efficiently set a set of entries to 0?

Let a be a big scipy.sparse matrix and IJ={(i0,j0),(i1,j1),...} a set of positions. How can I efficiently set all the entries in a in positions IJ to 0? Something like a[IJ]=0.
In Mathematica, I would create a new sparse matrix b with background value 1 (instead of 0) and all entries in IJ. Then, I would use a=a*b (entry-wise multiplication). That does not seem to be an option here.
A toy example:
import scipy.sparse as sp
import numpy as np
np.set_printoptions(linewidth=200,edgeitems=5,precision=4)
m=n=10**1;
a=sp.random(m,n,4/m,format='csr'); print(a.toarray())
IJ=np.array([range(0,n,2),range(0,n,2)]); print(IJ) #every second diagonal
You are almost there. To go by your definitions, all you'd need to do is:
a[IJ[0],IJ[1]] = 0
Note that scipy will warn you:
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
You can read more about that here.
The scipy sparse matrices can't have a non-zero background value. While it it possible to make a "sparse" matrix with lots of non-zero value, the performance (speed & memory) would be far worse than dense matrix multiplication.
A possible work-around is to rewrite every sparse matrix to have a default value of zero. For example, if matrix Y' contains mostly 1, I can replace Y' by I - Y where Y = I - Y' and I is the identity matrix.
import scipy.sparse as sp
import numpy as np
size = (100, 100)
x = np.random.uniform(-1, 1, size=size)
y = sp.random(*size, 0.001, format='csr')
# Z = (I - Y)X = X - YX
z = x - y.multiply(x)
# A = X(I - Y) = X - XY = X - transpose(YX)
a = x - y.multiply(x).T

How to generate a Random Sparse Hermitian Matrix in python?

I would like to generate a Random Sparse Hermitian Matrix of a given shape in python. How can I do it efficiently? Is there any built-in python function for this task?
I have found a solution for the random sparse matrix, but I want the matrix to be Hermitian too. Here is the solution for the random sparse matrix that I found
import numpy as np
import scipy.stats as stats
import scipy.sparse as sparse
import matplotlib.pyplot as plt
np.random.seed((3,14159))
def sprandsym(n, density):
rvs = stats.norm().rvs
X = sparse.random(n, n, density=density, data_rvs=rvs)
upper_X = sparse.triu(X)
result = upper_X + upper_X.T - sparse.diags(X.diagonal())
return result
M = sprandsym(5000, 0.01)
print(repr(M))
# <5000x5000 sparse matrix of type '<class 'numpy.float64'>'
# with 249909 stored elements in Compressed Sparse Row format>
# check that the matrix is symmetric. The difference should have no non-zero elements
assert (M - M.T).nnz == 0
statistic, pval = stats.kstest(M.data, 'norm')
# The null hypothesis is that M.data was drawn from a normal distribution.
# A small p-value (say, below 0.05) would indicate reason to reject the null hypothesis.
# Since `pval` below is > 0.05, kstest gives no reason to reject the hypothesis
# that M.data is normally distributed.
print(statistic, pval)
# 0.0015998040114 0.544538788914
fig, ax = plt.subplots(nrows=2)
ax[0].hist(M.data, normed=True, bins=50)
stats.probplot(M.data, dist='norm', plot=ax[1])
plt.show()
We know that a matrix plus it's hermitian is a hermitian matrix. So to ensure your final matrix B is hermitian, just do
B = A + A.conj().T

SciPy/NumPy: Normalize a csr_matrix

I'm trying to normalize a csr_matrix:
<5400x6845 sparse matrix of type '<type 'numpy.float64'> with 91833 stored elements in Compressed Sparse Row format>
What I tried was this:
import numpy as np
from scipy import sparse
# ve is my csr_matrix
ve_sum = ve.sum(axis=1)
ve_sums = sparse.csr_matrix(np.tile(ve_sum, (1, ve.shape[1]))) # <-- here I get MemoryError
n_ve = ve/ve_sums
This is obviously not the correct way of doing this kind of easy normalization.
What is the correct way?
# Normalize the rows of ve.
row_sums = np.array(ve.sum(axis=1))[:,0]
row_indices, col_indices = ve.nonzero()
ve.data /= row_sums[row_indices]
A quick google search reveals this also.

Categories

Resources