Python: how to use Python to generate a random sparse symmetric matrix?

Python: how to use Python to generate a random sparse symmetric matrix? - python

How to use python to generate a random sparse symmetric matrix ?
In MATLAB, we have a function "sprandsym (size, density)"
But how to do that in Python?

If you have scipy, you could use sparse.random. The sprandsym function below generates a sparse random matrix X, takes its upper triangular half, and adds its transpose to itself to form a symmetric matrix. Since this doubles the diagonal values, the diagonals are subtracted once.
The non-zero values are normally distributed with mean 0 and standard deviation
of 1. The Kolomogorov-Smirnov test is used to check that the non-zero values is
consistent with a drawing from a normal distribution, and a histogram and
QQ-plot is generated too to visualize the distribution.
import numpy as np
import scipy.stats as stats
import scipy.sparse as sparse
import matplotlib.pyplot as plt
np.random.seed((3,14159))
def sprandsym(n, density):
rvs = stats.norm().rvs
X = sparse.random(n, n, density=density, data_rvs=rvs)
upper_X = sparse.triu(X)
result = upper_X + upper_X.T - sparse.diags(X.diagonal())
return result
M = sprandsym(5000, 0.01)
print(repr(M))
# <5000x5000 sparse matrix of type '<class 'numpy.float64'>'
# with 249909 stored elements in Compressed Sparse Row format>
# check that the matrix is symmetric. The difference should have no non-zero elements
assert (M - M.T).nnz == 0
statistic, pval = stats.kstest(M.data, 'norm')
# The null hypothesis is that M.data was drawn from a normal distribution.
# A small p-value (say, below 0.05) would indicate reason to reject the null hypothesis.
# Since `pval` below is > 0.05, kstest gives no reason to reject the hypothesis
# that M.data is normally distributed.
print(statistic, pval)
# 0.0015998040114 0.544538788914
fig, ax = plt.subplots(nrows=2)
ax[0].hist(M.data, normed=True, bins=50)
stats.probplot(M.data, dist='norm', plot=ax[1])
plt.show()
PS. I used
upper_X = sparse.triu(X)
result = upper_X + upper_X.T - sparse.diags(X.diagonal())
instead of
result = (X + X.T)/2.0
because I could not convince myself that the non-zero elements in (X + X.T)/2.0 have the right distribution. First, if X were dense and normally distributed with mean 0 and variance 1, i.e. N(0, 1), then (X + X.T)/2.0 would be N(0, 1/2). Certainly we could fix this by using
result = (X + X.T)/sqrt(2.0)
instead. Then result would be N(0, 1). But there is yet another problem: If X is sparse, then at nonzero locations, X + X.T would often be a normally distributed random variable plus zero. Dividing by sqrt(2.0) will squash the normal distribution closer to 0 giving you a more tightly spiked distribution. As X becomes sparser, this may be less and less like a normal distribution.
Since I didn't know what distribution (X + X.T)/sqrt(2.0) generates, I opted for copying the upper triangular half of X (thus repeating what I know to be normally distributed non-zero values).

The matrix needs to be symmetric too, which seems to be glossed over by the two answers here;
def sparseSym(rank, density=0.01, format='coo', dtype=None, random_state=None):
density = density / (2.0 - 1.0/rank)
A = scipy.sparse.rand(rank, rank, density=density, format=format, dtype=dtype, random_state=random_state)
return (A + A.transpose())/2
This will create a sparse matrix, and then adds it's transpose to itself to make it symmetric.
It takes into account the fact that the density will increase as you add the two together, and the fact that there is no additional increase in density from the diagonal terms.

unutbu's answer is the best one for performance and extensibility - numpy and scipy, together, have a lot of the functionality from matlab.
If you can't use them for whatever reason, or you're looking for a pure python solution, you could try
from random import randgauss, randint
sparse = [ [0 for i in range(N)] for j in range(N)]
# alternatively, if you have numpy but not scipy:
# sparse = numpy.zeros(N,N)
for _ in range(num_terms):
(i,j) = (randint(0,n),randint(0,n))
x = randgauss(0,1)
sparse[i][j] = x
sparse[j][i] = x
Although it might give you a little more control than unutbu's solution, you should expect it to be significantly slower; scipy is a dependency you probably don't want to avoid

Related

Pytorch: Efficiently compute unbiased estimator of mean to the power of four

Let w, x, y, z be torch tensors of shape (m, n) and we wish to compute the following unbiased estimator row-wise efficiently (without for loops), where I want to compute for every row 1, ..., m:
In case of only the unbiased estimator of the square of means, i.e., for :
this is possible, e.g., using torch.einsum:
batch_outer = torch.einsum('bi, bj -> bij', x, y)
zero_diag = 1-torch.eye(batch_outer.shape[1])
return (batch_outer * zero_diag).sum(dim=2).sum(dim=1) / (n * (n-1))
However, for the case to the power of four this is not so easy doable, mostly because these are not squared tensors and in particular, because the zeroing out of the diagonals becomes very tedious.
My questions:
1.) How can this be implemented efficiently ommitting any for loops?
2.) Which time and memory complexity would that solution have in big O notation?
3.) Can this solution also be used to do it with four 3D tensors of shape (m, k, n), where again we only want to do the computations along the axes of length n (dim=2)?
4.) If I want to do it in log-space for numerical stability, i.e., to use logsumexp for summations and sums for multiplications (because log(xy)= log(x)+log(y)), any solution with einsum wouldnt work anymore. How could that computation then be done in log space?

1 This implementation seems to work if I didn't make mess with the diagonal dimensions.
import numpy as np
import torch as th
x = np.array([1,4,5,3])
y = np.array([5,2,4,5])[np.newaxis]
z = np.array([5,7,4,5])[np.newaxis][np.newaxis]
w = np.array([3,9,5,1])[np.newaxis][np.newaxis][np.newaxis]
xth = th.Tensor(x)
yth = th.Tensor(y)
zth = th.Tensor(z)
wth = th.Tensor(w)
tensor = xth*th.transpose(yth, 0, 1)*th.transpose(zth,0,2)*th.transpose(wth,0,3)
diag = th.diagonal(tensor, dim1 = -2, dim2 = -1)
result = th.sum(tensor) - th.sum(diag)
result /= np.math.factorial(len(x))
print(result)
The order is between O(n^2.37..) - O(n^3), depending on the pytorch implementation of the matrix multiplication.
I don't see why not, just choose properly the dimensions to transpose and take the diagonal.
I don't see why would this solution won't work in a log-space.
pd: my knowledge in pytorch is quite limited, but I'm sure you can define x,y,z,w in a more elegant way.

How to remove discontinuities from complex angle of NumPy eigenvector components?

I am using NumPy's linalg.eig on square matrices. My square matrices are a function of a 2D domain, and I am looking at its eigenvectors' complex angles along a parameterized circle on this domain. As long as the path I am considering is smooth, I expect the complex angles of each eigenvector's components to be smooth. However, for some cases, this is not the case with Python (although it is with other programming languages). For the parameter M=0 (some argument in my matrix that appears on its diagonal), I have components that look like:
when they should ideally look like (M=0.1):
What I have tried:
I verified that the matrices are Hermitian in both cases.
When I use linalg.eigh, M=0.1 becomes discontinuous while M=0 sometimes becomes continuous.
Using np.unwrap did nothing.
The difference between component phases (i.e. np.angle(v1-v2) for eigenvector v=[[v1],[v2]]) is smooth/continuous, but this is not what I want.
Fixing the NumPy seed before solving did nothing for different values of the seed. For example: np.random.seed(1).
What else can I do? I am trying to use Sympy's eigenvects just because I am running out of options, and I asked another question asking about another potential approach here: How do I force first component of NumPy eigenvectors to be real? . But, I do not know what else I can try.
Here is a minimal working example that works nicely in a Jupyter notebook:
import numpy as np
from numpy import linalg as LA
import matplotlib.pyplot as plt
M = 0.01; # nonzero M is okay
M = 0.0; # M=0 causes problems
def matrix_generator(kx,ky,M):
a = 2.46; t = 1; k = np.array((kx,ky));
d1 = (a/2)*np.array((1,np.sqrt(3)));d2 = (a/2)*np.array((1,-np.sqrt(3)));d3 = -a*np.array((1,0));
sx = np.matrix([[0,1],[1,0]]);sy = np.matrix([[0,-1j],[1j,0]]);sz = np.matrix([[1,0],[0,-1]]);
hx = np.cos(k#d1)+np.cos(k#d2)+np.cos(k#d3);hy = np.sin(k#d1)+np.sin(k#d2)+np.sin(k#d3);
return -t*(hx*sx - hy*sy + M*sz)
n_segs = 200; #number of segments in (kx,ky) loop
evecs_along_loop = np.zeros((n_segs,2,2),dtype=float)
# parameterize circular loop
kx0 = 0.5; ky0 = 1; r1=0.2; r2=0.2;
a = np.linspace(0.0, 2*np.pi, num=n_segs+2)
kloop=np.zeros((n_segs+2,2))
for i in range(n_segs+2):
kloop[i,:]=np.array([kx0 + r1*np.cos(a[i]), ky0 + r2*np.sin(a[i])])
# assign eigenvector complex angles
for j in np.arange(n_segs):
np.random.seed(2)
H = matrix_generator(kloop[j][0],kloop[j][1],M)
eval0, psi0 = LA.eig(H)
evecs_along_loop[j,:,:] = np.angle(psi0)
# plot eigenvector complex angles
for p in np.arange(2):
for q in np.arange(2):
print(f"Phase for eigenvector element {p},{q}:")
fig = plt.figure()
ax = plt.axes()
ax.plot((evecs_along_loop[:,p,q]))
plt.show()
Clarification for anon01's comment:
For M=0, a sample matrix at some value of (kx,ky) would look like:
a = np.matrix([[0.+0.j, 0.99286437+1.03026667j],
[0.99286437-1.03026667j, 0.+0.j]])
For M =/= 0, the diagonal will be non-zero (but real).

I think that in general this is a tough problem. The fundamental issue is that eigenvectors (unlike eigenvalues) are not unambiguously defined. An eigenvector v of M with eigenvalue c is any non-zero vector for which
M*v = c*v
In particular for any non zero scalar s, multiplying an eigenvector by s yields an eigenvector, and even if you demand (as usual) that eigenvectors have length 1, we are still free to multiply by any scalar of absolute value 1. Even worse, if v1,..vd are orthogonal eigenvectors for c, then any non-zero linear combination of the v's is also an eigenvector for c.
Different eigendecomposition routines might well, therefore, come up with very different eigenvectors and still be doing their job. Moreover some routines might produce eigenvectors that are far apart for matrices that are close together.
A simple tractable case is where you know that all your eigenvalues are non-degenerate (i.e. each eigenspace is of dimension 1) and you happen to know that for a particular i, the i'th component of each eigenvector will be non zero. Then you could multiply the eigenvector v by a scalar, of absolute value 1, chosen so that after the multiplication v[i] is a positive real number. In C
s = conj(v[i])/cabs(v[i])
where
conj(z) is the complex conjugate of the complex number z,
and cabs(z) is the absolute value of the complex number z
Note that the above supposes that we are using the same index for every eigenvector, though the factor s varies from eigenvector to eigenvector.
This would impose a uniqueness on the eigenvectors, and, one would hope, mean that they varied continuously with the parameters of your matrix.

Matrix inversion for matrix with large values in python

I'm doing matrix inversion in python, and I found it very weird that the result differs by the data scale.
In the code below, it is expected that A_inv/B_inv = B/A. However, it shows that the difference between A_inv/B_inv and B/A becomes larger and larger depend on the data scale... Is this because Python cannot compute matrix inverse precisely for matrix with large values?
Also, I checked the condition number for B, which is a constant ~3.016 no matter the scale is.
Thanks!!!
import numpy as np
from matplotlib import pyplot as plt
D = 30
N = 300
np.random.seed(10)
original_data = np.random.sample([D, N])
A = np.cov(original_data)
A_inv = np.linalg.inv(A)
B_cond = []
diff = []
for k in xrange(1,10):
B = A * np.power(10,k)
B_cond.append(np.linalg.cond(B))
B_inv = np.linalg.inv(B)
### Two measurements of difference are used
diff.append(np.log(np.linalg.norm(A_inv/B_inv - B/A)))
#diff.append(np.max(np.abs(A_inv/B_inv - B/A)))
# print B_cond
plt.figure()
plt.plot(xrange(1,10), diff)
plt.xlabel('data(B) / data(A)')
plt.ylabel('log(||A_inv/B_inv - B/A||)')
plt.savefig('Inversion for large matrix')

I may be wrong, but I think it comes from number representation in machine.
When you are dealing with great numbers, your inverse matrix is going to have very little number in magnitude (close to zero). And clsoe to zero, the representation of the floating number is not precise enough, I guess...
https://en.wikipedia.org/wiki/Floating-point_arithmetic

There is no reason that you should expect np.linalg.norm(A_inv/B_inv - B/A) to be equal to anything special. Instead, you can check the quality of the inverse calculation by multiplying the original matrix by its inverse and checking the determinant, np.linalg.det(A.dot(A_inv)), which should be equal to 1.

Pearson's correlation coefficient between all pairs of rows from two 2D arrays using scipy.stats.pearsonr vs. numpy.corrcoeff in python 3.5

I tried to calculate the Pearson's correlation coefficients between every pairs of rows from two 2D arrays. Then, sort the rows/columns of the correlation matrix based on its diagonal elements. First, the correlation coefficient matrix (i.e., 'ccmtx') was calculated from one random matrix (i.e., 'randmtx') in the following code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
def correlation_map(x, y):
n_row_x = x.shape[0]
n_row_y = x.shape[0]
ccmtx_xy = np.empty((n_row_x, n_row_y))
for n in range(n_row_x):
for m in range(n_row_y):
ccmtx_xy[n, m] = pearsonr(x[n, :], y[m, :])[0]
return ccmtx_xy
randmtx = np.random.randn(100, 1000) # generating random matrix
#ccmtx = np.corrcoef(randmtx, randmtx) # cc matrix based on numpy.corrcoef
ccmtx = correlation_map(randmtx, randmtx) # cc matrix based on scipy pearsonr
#
ccmtx_diag = np.diagonal(ccmtx)
#
ids, vals = np.argsort(ccmtx_diag, kind = 'mergesort'), np.sort(ccmtx_diag, kind = 'mergesort')
#ids, vals = np.argsort(ccmtx_diag, kind = 'quicksort'), np.sort(ccmtx_diag, kind = 'quicksort')
plt.plot(ids)
plt.show()
plt.plot(ccmtx_diag[ids])
plt.show()
vals[0]
The issue here is when the 'pearsonr' was used, the diagonal elements of 'ccmtx' are exactly 1.0 which makes sense. However, the 'corrcoef' was used, the diagonal elements of 'ccmtrix' are not exactly one (and slightly less than 1 for some diagonals) seemingly due to a precision error of floating point numbers.
I found to be annoying that the auto-correlation matrix of a single matrix have diagnoal elements not being 1.0 since this resulted in the shuffling of rows/columes of the correlation matrix when the matrix is sorted based on the diagonal elements.
My questions are:
[1] is there any good way to accelerate the computation time when I stick to use the 'pearsonr' function? (e.g., vectorized pearsonr?)
[2] Is there any good way/practice to prevent this precision error when using the 'corrcoef' in numpy? (e.g. 'decimals' option in np.around?)
I have searched the correlation coefficient calculations between all pairs of rows or columns from two matrices. However, as the algorithms containe some sort of "cov / variance" operation, this kind of precision issue seems always existing.
Minor point: the 'mergesort' option seems to provide reliable results than the 'quicksort' as the quicksort shuffled 1d array with exactly 1 to random order.
Any thoughts/comments would be greatly appreciated!

For question 1 vectorized pearsonr see the comments to the question.
I will answer only question 2: how to improve the precision of np.corrcoef.
The correlation matrix R is computed from the covariance matrix C according to
.
The implementation is optimized for performance and memory usage. It computes the covariance matrix, and then performs two divisions by sqrt(C_ii) and by sqrt(Cjj). This separate square-rooting is where the imprecision comes from. For example:
np.sqrt(3 * 3) - 3 == 0.0
np.sqrt(3) * np.sqrt(3) - 3 == -4.4408920985006262e-16
We can fix this by implementing our own simple corrcoef routine:
def corrcoef(a, b):
c = np.cov(a, b)
d = np.diag(c)
return c / np.sqrt(d[:, None] * d[None, :])
Note that this implementation requires more memory than the numpy implementation because it needs to store a temporary matrix with size n * n and it is slightly slower because it needs to do n^2 square roots instead of only 2 n.

Correlation coefficients for sparse matrix in python?

Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like numpy.corrcoef that will work on a scipy sparse matrix.

You can compute the correlation coefficients fairly straightforwardly from the covariance matrix like this:
import numpy as np
from scipy import sparse
def sparse_corrcoef(A, B=None):
if B is not None:
A = sparse.vstack((A, B), format='csr')
A = A.astype(np.float64)
n = A.shape[1]
# Compute the covariance matrix
rowsum = A.sum(1)
centering = rowsum.dot(rowsum.T.conjugate()) / n
C = (A.dot(A.T.conjugate()) - centering) / (n - 1)
# The correlation coefficients are given by
# C_{i,j} / sqrt(C_{i} * C_{j})
d = np.diag(C)
coeffs = C / np.sqrt(np.outer(d, d))
return coeffs
Check that it works OK:
# some smallish sparse random matrices
a = sparse.rand(100, 100000, density=0.1, format='csr')
b = sparse.rand(100, 100000, density=0.1, format='csr')
coeffs1 = sparse_corrcoef(a, b)
coeffs2 = np.corrcoef(a.todense(), b.todense())
print(np.allclose(coeffs1, coeffs2))
# True
Be warned:
The amount of memory required for computing the covariance matrix C will be heavily dependent on the sparsity structure of A (and B, if given). For example, if A is an (m, n) matrix containing just a single column of non-zero values then C will be an (n, n) matrix containing all non-zero values. If n is large then this could be very bad news in terms of memory consumption.

You do not need to introduce a large dense matrix. Just keep it sparse using Numpy:
import numpy as np
def sparse_corr(A):
N = A.shape[0]
C=((A.T*A -(sum(A).T*sum(A)/N))/(N-1)).todense()
V=np.sqrt(np.mat(np.diag(C)).T*np.mat(np.diag(C)))
COR = np.divide(C,V+1e-119)
return COR
Testing the performance:
A = sparse.rand(1000000, 100, density=0.1, format='csr')
sparse_corr(A)

I present an answer for a scipy sparse matrix which runs in parallel. Rather than returning a giant correlation matrix, this returns a feature mask of fields to keep after checking all fields for both positive and negative Pearson correlations.
I also try to minimize calculations using the following strategy:
Process each column
Start at the current column + 1 and calculate correlations moving to the right.
For any abs(correlation) >= threshold, mark the current column for removal and calculate no further correlations.
Perform these steps for each column in the dataset except the last.
This might be sped up further by keeping a global list of columns marked for removal and skipping further correlation calculations for such columns, since columns will execute out of order. However, I do not know enough about race conditions in python to implement this tonight.
Returning a column mask will obviously allow the code to handle much larger datasets than returning the entire correlation matrix.
Check each column using this function:
def get_corr_row(idx_num, sp_mat, thresh):
# slice the column at idx_num
cols = sp_mat.shape[1]
x = sp_mat[:,idx_num].toarray().ravel()
start = idx_num + 1
# Now slice each column to the right of idx_num
for i in range(start, cols):
y = sp_mat[:,i].toarray().ravel()
# Check the pearson correlation
corr, pVal = pearsonr(x,y)
# Pearson ranges from -1 to 1.
# We check both positive and negative correlations >= thresh using abs(corr)
if abs(corr) >= thresh:
# stop checking after finding the 1st correlation > thresh
return False
# Mark column at idx_num for removal in the mask
return True
Run the column level correlation checks in parallel:
from joblib import Parallel, delayed
import multiprocessing
def Get_Corr_Mask(sp_mat, thresh, n_jobs=-1):
# we must make sure the matrix is in csc format
# before we start doing all these column slices!
sp_mat = sp_mat.tocsc()
cols = sp_mat.shape[1]
if n_jobs == -1:
# Process the work on all available CPU cores
num_cores = multiprocessing.cpu_count()
else:
# Process the work on the specified number of CPU cores
num_cores = n_jobs
# Return a mask of all columns to keep by calling get_corr_row()
# once for each column in the matrix
return Parallel(n_jobs=num_cores, verbose=5)(delayed(get_corr_row)(i, sp_mat, thresh)for i in range(cols))
General Usage:
#Get the mask using your sparse matrix and threshold.
corr_mask = Get_Corr_Mask(X_t_fpr, 0.95)
# Remove features that are >= 95% correlated
X_t_fpr_corr = X_t_fpr[:,corr_mask]

Unfortunately, Alt's answer didn't work out for me. The values given to the np.sqrt function where mostly negative, so the resulting covariance values were nan.
I wasn't able to use ali_m's answer as well, because my matrix was too large that I couldn't fit the centering = rowsum.dot(rowsum.T.conjugate()) / n matrix in my memory (My matrix's dimensions are: 3.5*10^6 x 33)
Instead, I used scikit-learn's StandardScaler to compute the standard sparse matrix and then used a multiplication to obtain the correlation matrix.
from sklearn.preprocessing import StandardScaler
def compute_sparse_correlation_matrix(A):
scaler = StandardScaler(with_mean=False)
scaled_A = scaler.fit_transform(A) # Assuming A is a CSR or CSC matrix
corr_matrix = (1/scaled_A.shape[0]) * (scaled_A.T # scaled_A)
return corr_matrix
I believe that this approach is faster and more robust than the other mentioned approaches. Moreover, it also preserves the sparsity pattern of the input matrix.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.