Correlation coefficients for sparse matrix in python? - python

Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like numpy.corrcoef that will work on a scipy sparse matrix.

You can compute the correlation coefficients fairly straightforwardly from the covariance matrix like this:
import numpy as np
from scipy import sparse
def sparse_corrcoef(A, B=None):
if B is not None:
A = sparse.vstack((A, B), format='csr')
A = A.astype(np.float64)
n = A.shape[1]
# Compute the covariance matrix
rowsum = A.sum(1)
centering = rowsum.dot(rowsum.T.conjugate()) / n
C = (A.dot(A.T.conjugate()) - centering) / (n - 1)
# The correlation coefficients are given by
# C_{i,j} / sqrt(C_{i} * C_{j})
d = np.diag(C)
coeffs = C / np.sqrt(np.outer(d, d))
return coeffs
Check that it works OK:
# some smallish sparse random matrices
a = sparse.rand(100, 100000, density=0.1, format='csr')
b = sparse.rand(100, 100000, density=0.1, format='csr')
coeffs1 = sparse_corrcoef(a, b)
coeffs2 = np.corrcoef(a.todense(), b.todense())
print(np.allclose(coeffs1, coeffs2))
# True
Be warned:
The amount of memory required for computing the covariance matrix C will be heavily dependent on the sparsity structure of A (and B, if given). For example, if A is an (m, n) matrix containing just a single column of non-zero values then C will be an (n, n) matrix containing all non-zero values. If n is large then this could be very bad news in terms of memory consumption.

You do not need to introduce a large dense matrix. Just keep it sparse using Numpy:
import numpy as np
def sparse_corr(A):
N = A.shape[0]
C=((A.T*A -(sum(A).T*sum(A)/N))/(N-1)).todense()
V=np.sqrt(np.mat(np.diag(C)).T*np.mat(np.diag(C)))
COR = np.divide(C,V+1e-119)
return COR
Testing the performance:
A = sparse.rand(1000000, 100, density=0.1, format='csr')
sparse_corr(A)

I present an answer for a scipy sparse matrix which runs in parallel. Rather than returning a giant correlation matrix, this returns a feature mask of fields to keep after checking all fields for both positive and negative Pearson correlations.
I also try to minimize calculations using the following strategy:
Process each column
Start at the current column + 1 and calculate correlations moving to the right.
For any abs(correlation) >= threshold, mark the current column for removal and calculate no further correlations.
Perform these steps for each column in the dataset except the last.
This might be sped up further by keeping a global list of columns marked for removal and skipping further correlation calculations for such columns, since columns will execute out of order. However, I do not know enough about race conditions in python to implement this tonight.
Returning a column mask will obviously allow the code to handle much larger datasets than returning the entire correlation matrix.
Check each column using this function:
def get_corr_row(idx_num, sp_mat, thresh):
# slice the column at idx_num
cols = sp_mat.shape[1]
x = sp_mat[:,idx_num].toarray().ravel()
start = idx_num + 1
# Now slice each column to the right of idx_num
for i in range(start, cols):
y = sp_mat[:,i].toarray().ravel()
# Check the pearson correlation
corr, pVal = pearsonr(x,y)
# Pearson ranges from -1 to 1.
# We check both positive and negative correlations >= thresh using abs(corr)
if abs(corr) >= thresh:
# stop checking after finding the 1st correlation > thresh
return False
# Mark column at idx_num for removal in the mask
return True
Run the column level correlation checks in parallel:
from joblib import Parallel, delayed
import multiprocessing
def Get_Corr_Mask(sp_mat, thresh, n_jobs=-1):
# we must make sure the matrix is in csc format
# before we start doing all these column slices!
sp_mat = sp_mat.tocsc()
cols = sp_mat.shape[1]
if n_jobs == -1:
# Process the work on all available CPU cores
num_cores = multiprocessing.cpu_count()
else:
# Process the work on the specified number of CPU cores
num_cores = n_jobs
# Return a mask of all columns to keep by calling get_corr_row()
# once for each column in the matrix
return Parallel(n_jobs=num_cores, verbose=5)(delayed(get_corr_row)(i, sp_mat, thresh)for i in range(cols))
General Usage:
#Get the mask using your sparse matrix and threshold.
corr_mask = Get_Corr_Mask(X_t_fpr, 0.95)
# Remove features that are >= 95% correlated
X_t_fpr_corr = X_t_fpr[:,corr_mask]

Unfortunately, Alt's answer didn't work out for me. The values given to the np.sqrt function where mostly negative, so the resulting covariance values were nan.
I wasn't able to use ali_m's answer as well, because my matrix was too large that I couldn't fit the centering = rowsum.dot(rowsum.T.conjugate()) / n matrix in my memory (My matrix's dimensions are: 3.5*10^6 x 33)
Instead, I used scikit-learn's StandardScaler to compute the standard sparse matrix and then used a multiplication to obtain the correlation matrix.
from sklearn.preprocessing import StandardScaler
def compute_sparse_correlation_matrix(A):
scaler = StandardScaler(with_mean=False)
scaled_A = scaler.fit_transform(A) # Assuming A is a CSR or CSC matrix
corr_matrix = (1/scaled_A.shape[0]) * (scaled_A.T # scaled_A)
return corr_matrix
I believe that this approach is faster and more robust than the other mentioned approaches. Moreover, it also preserves the sparsity pattern of the input matrix.

Related

Pytorch: Efficiently compute unbiased estimator of mean to the power of four

Let w, x, y, z be torch tensors of shape (m, n) and we wish to compute the following unbiased estimator row-wise efficiently (without for loops), where I want to compute for every row 1, ..., m:
In case of only the unbiased estimator of the square of means, i.e., for :
this is possible, e.g., using torch.einsum:
batch_outer = torch.einsum('bi, bj -> bij', x, y)
zero_diag = 1-torch.eye(batch_outer.shape[1])
return (batch_outer * zero_diag).sum(dim=2).sum(dim=1) / (n * (n-1))
However, for the case to the power of four this is not so easy doable, mostly because these are not squared tensors and in particular, because the zeroing out of the diagonals becomes very tedious.
My questions:
1.) How can this be implemented efficiently ommitting any for loops?
2.) Which time and memory complexity would that solution have in big O notation?
3.) Can this solution also be used to do it with four 3D tensors of shape (m, k, n), where again we only want to do the computations along the axes of length n (dim=2)?
4.) If I want to do it in log-space for numerical stability, i.e., to use logsumexp for summations and sums for multiplications (because log(xy)= log(x)+log(y)), any solution with einsum wouldnt work anymore. How could that computation then be done in log space?
1 This implementation seems to work if I didn't make mess with the diagonal dimensions.
import numpy as np
import torch as th
x = np.array([1,4,5,3])
y = np.array([5,2,4,5])[np.newaxis]
z = np.array([5,7,4,5])[np.newaxis][np.newaxis]
w = np.array([3,9,5,1])[np.newaxis][np.newaxis][np.newaxis]
xth = th.Tensor(x)
yth = th.Tensor(y)
zth = th.Tensor(z)
wth = th.Tensor(w)
tensor = xth*th.transpose(yth, 0, 1)*th.transpose(zth,0,2)*th.transpose(wth,0,3)
diag = th.diagonal(tensor, dim1 = -2, dim2 = -1)
result = th.sum(tensor) - th.sum(diag)
result /= np.math.factorial(len(x))
print(result)
The order is between O(n^2.37..) - O(n^3), depending on the pytorch implementation of the matrix multiplication.
I don't see why not, just choose properly the dimensions to transpose and take the diagonal.
I don't see why would this solution won't work in a log-space.
pd: my knowledge in pytorch is quite limited, but I'm sure you can define x,y,z,w in a more elegant way.

numpy covariance between each column of a matrix and a vector

Based on this post, I can get covariance between two vectors using np.cov((x,y), rowvar=0). I have a matrix MxN and a vector Mx1. I want to find the covariance between each column of the matrix and the given vector. I know that I can use for loop to write. I was wondering if I can somehow use np.cov() to get the result directly.
As Warren Weckesser said, the numpy.cov(X, Y) is a poor fit for the job because it will simply join the arrays in one M by (N+1) array and find the huge (N+1) by (N+1) covariance matrix. But we'll always have the definition of covariance and it's easy to use:
A = np.sqrt(np.arange(12).reshape(3, 4)) # some 3 by 4 array
b = np.array([[2], [4], [5]]) # some 3 by 1 vector
cov = np.dot(b.T - b.mean(), A - A.mean(axis=0)) / (b.shape[0]-1)
This returns the covariances of each column of A with b.
array([[ 2.21895142, 1.53934466, 1.3379221 , 1.20866607]])
The formula I used is for sample covariance (which is what numpy.cov computes, too), hence the division by (b.shape[0] - 1). If you divide by b.shape[0] you get the unadjusted population covariance.
For comparison, the same computation using np.cov:
import numpy as np
A = np.sqrt(np.arange(12).reshape(3, 4))
b = np.array([[2], [4], [5]])
np.cov(A, b, rowvar=False)[-1, :-1]
Same output, but it takes about twice this long (and for large matrices, the difference will be much larger). The slicing at the end is because np.cov computes a 5 by 5 matrix, in which only the first 4 entries of the last row are what you wanted. The rest is covariance of A with itself, or of b with itself.
Correlation coefficient
The correlation coefficientis obtained by dividing by square roots of variances. Watch out for that -1 adjustment mentioned earlier: numpy.var does not make it by default, to make it happen you need ddof=1 parameter.
corr = cov / np.sqrt(np.var(b, ddof=1) * np.var(A, axis=0, ddof=1))
Check that the output is the same as the less efficient version
np.corrcoef(A, b, rowvar=False)[-1, :-1]

Pearson's correlation coefficient between all pairs of rows from two 2D arrays using scipy.stats.pearsonr vs. numpy.corrcoeff in python 3.5

I tried to calculate the Pearson's correlation coefficients between every pairs of rows from two 2D arrays. Then, sort the rows/columns of the correlation matrix based on its diagonal elements. First, the correlation coefficient matrix (i.e., 'ccmtx') was calculated from one random matrix (i.e., 'randmtx') in the following code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
def correlation_map(x, y):
n_row_x = x.shape[0]
n_row_y = x.shape[0]
ccmtx_xy = np.empty((n_row_x, n_row_y))
for n in range(n_row_x):
for m in range(n_row_y):
ccmtx_xy[n, m] = pearsonr(x[n, :], y[m, :])[0]
return ccmtx_xy
randmtx = np.random.randn(100, 1000) # generating random matrix
#ccmtx = np.corrcoef(randmtx, randmtx) # cc matrix based on numpy.corrcoef
ccmtx = correlation_map(randmtx, randmtx) # cc matrix based on scipy pearsonr
#
ccmtx_diag = np.diagonal(ccmtx)
#
ids, vals = np.argsort(ccmtx_diag, kind = 'mergesort'), np.sort(ccmtx_diag, kind = 'mergesort')
#ids, vals = np.argsort(ccmtx_diag, kind = 'quicksort'), np.sort(ccmtx_diag, kind = 'quicksort')
plt.plot(ids)
plt.show()
plt.plot(ccmtx_diag[ids])
plt.show()
vals[0]
The issue here is when the 'pearsonr' was used, the diagonal elements of 'ccmtx' are exactly 1.0 which makes sense. However, the 'corrcoef' was used, the diagonal elements of 'ccmtrix' are not exactly one (and slightly less than 1 for some diagonals) seemingly due to a precision error of floating point numbers.
I found to be annoying that the auto-correlation matrix of a single matrix have diagnoal elements not being 1.0 since this resulted in the shuffling of rows/columes of the correlation matrix when the matrix is sorted based on the diagonal elements.
My questions are:
[1] is there any good way to accelerate the computation time when I stick to use the 'pearsonr' function? (e.g., vectorized pearsonr?)
[2] Is there any good way/practice to prevent this precision error when using the 'corrcoef' in numpy? (e.g. 'decimals' option in np.around?)
I have searched the correlation coefficient calculations between all pairs of rows or columns from two matrices. However, as the algorithms containe some sort of "cov / variance" operation, this kind of precision issue seems always existing.
Minor point: the 'mergesort' option seems to provide reliable results than the 'quicksort' as the quicksort shuffled 1d array with exactly 1 to random order.
Any thoughts/comments would be greatly appreciated!
For question 1 vectorized pearsonr see the comments to the question.
I will answer only question 2: how to improve the precision of np.corrcoef.
The correlation matrix R is computed from the covariance matrix C according to
.
The implementation is optimized for performance and memory usage. It computes the covariance matrix, and then performs two divisions by sqrt(C_ii) and by sqrt(Cjj). This separate square-rooting is where the imprecision comes from. For example:
np.sqrt(3 * 3) - 3 == 0.0
np.sqrt(3) * np.sqrt(3) - 3 == -4.4408920985006262e-16
We can fix this by implementing our own simple corrcoef routine:
def corrcoef(a, b):
c = np.cov(a, b)
d = np.diag(c)
return c / np.sqrt(d[:, None] * d[None, :])
Note that this implementation requires more memory than the numpy implementation because it needs to store a temporary matrix with size n * n and it is slightly slower because it needs to do n^2 square roots instead of only 2 n.

Optimize Scipy Sparse Matrix

I have a sparse matrix where I'm currently enumerating over each row and performing some calculations based on the information from each row. Each row is completely independent of the others. However, for large matrices, this code is extremely slow (takes about 2 hours) and I can't convert the matrix to a dense one either (limited to 8GB RAM).
import scipy.sparse
import numpy as np
def process_row(a, b):
"""
a - contains the row indices for a sparse matrix
b - contains the column indices for a sparse matrix
Returns a new vector of length(a)
"""
return
def assess(mat):
"""
"""
mat_csr = mat.tocsr()
nrows, ncols = mat_csr.shape
a = np.arange(ncols, dtype=np.int32)
b = np.empty(ncols, dtype=np.int32)
result = []
for i, row in enumerate(mat_csr):
# Process one row at a time
b.fill(i)
result.append(process_row(b, a))
return result
if __name__ == '__main__':
row = np.array([8,2,7,4])
col = np.array([1,3,2,1])
data = np.array([1,1,1,1])
mat = scipy.sparse.coo_matrix((data, (row, col)))
print assess(mat)
I am looking to see if there's any way to design this better so that it performs much faster. Essentially, the process_row function takes (row, col) index pairs (from a, b) and does some math using another sparse matrix and returns a result. I don't have the option to change this function but it can actually process different row/col pairs and is not restricted to processing everything from the same row.
Your problem looks similar to this other recent SO question:
Calculate the euclidean distance in scipy csr matrix
In my answer I sketched a way of iterating over the rows of a sparse matrix. I think it is faster to convert the array to lil, and construct the dense rows directly from its sublists. This avoids the overhead of creating a new sparse matrix for each row. But I haven't done time tests.
https://stackoverflow.com/a/36559702/901925
Maybe this applies to your case.

Python: how to use Python to generate a random sparse symmetric matrix?

How to use python to generate a random sparse symmetric matrix ?
In MATLAB, we have a function "sprandsym (size, density)"
But how to do that in Python?
If you have scipy, you could use sparse.random. The sprandsym function below generates a sparse random matrix X, takes its upper triangular half, and adds its transpose to itself to form a symmetric matrix. Since this doubles the diagonal values, the diagonals are subtracted once.
The non-zero values are normally distributed with mean 0 and standard deviation
of 1. The Kolomogorov-Smirnov test is used to check that the non-zero values is
consistent with a drawing from a normal distribution, and a histogram and
QQ-plot is generated too to visualize the distribution.
import numpy as np
import scipy.stats as stats
import scipy.sparse as sparse
import matplotlib.pyplot as plt
np.random.seed((3,14159))
def sprandsym(n, density):
rvs = stats.norm().rvs
X = sparse.random(n, n, density=density, data_rvs=rvs)
upper_X = sparse.triu(X)
result = upper_X + upper_X.T - sparse.diags(X.diagonal())
return result
M = sprandsym(5000, 0.01)
print(repr(M))
# <5000x5000 sparse matrix of type '<class 'numpy.float64'>'
# with 249909 stored elements in Compressed Sparse Row format>
# check that the matrix is symmetric. The difference should have no non-zero elements
assert (M - M.T).nnz == 0
statistic, pval = stats.kstest(M.data, 'norm')
# The null hypothesis is that M.data was drawn from a normal distribution.
# A small p-value (say, below 0.05) would indicate reason to reject the null hypothesis.
# Since `pval` below is > 0.05, kstest gives no reason to reject the hypothesis
# that M.data is normally distributed.
print(statistic, pval)
# 0.0015998040114 0.544538788914
fig, ax = plt.subplots(nrows=2)
ax[0].hist(M.data, normed=True, bins=50)
stats.probplot(M.data, dist='norm', plot=ax[1])
plt.show()
PS. I used
upper_X = sparse.triu(X)
result = upper_X + upper_X.T - sparse.diags(X.diagonal())
instead of
result = (X + X.T)/2.0
because I could not convince myself that the non-zero elements in (X + X.T)/2.0 have the right distribution. First, if X were dense and normally distributed with mean 0 and variance 1, i.e. N(0, 1), then (X + X.T)/2.0 would be N(0, 1/2). Certainly we could fix this by using
result = (X + X.T)/sqrt(2.0)
instead. Then result would be N(0, 1). But there is yet another problem: If X is sparse, then at nonzero locations, X + X.T would often be a normally distributed random variable plus zero. Dividing by sqrt(2.0) will squash the normal distribution closer to 0 giving you a more tightly spiked distribution. As X becomes sparser, this may be less and less like a normal distribution.
Since I didn't know what distribution (X + X.T)/sqrt(2.0) generates, I opted for copying the upper triangular half of X (thus repeating what I know to be normally distributed non-zero values).
The matrix needs to be symmetric too, which seems to be glossed over by the two answers here;
def sparseSym(rank, density=0.01, format='coo', dtype=None, random_state=None):
density = density / (2.0 - 1.0/rank)
A = scipy.sparse.rand(rank, rank, density=density, format=format, dtype=dtype, random_state=random_state)
return (A + A.transpose())/2
This will create a sparse matrix, and then adds it's transpose to itself to make it symmetric.
It takes into account the fact that the density will increase as you add the two together, and the fact that there is no additional increase in density from the diagonal terms.
unutbu's answer is the best one for performance and extensibility - numpy and scipy, together, have a lot of the functionality from matlab.
If you can't use them for whatever reason, or you're looking for a pure python solution, you could try
from random import randgauss, randint
sparse = [ [0 for i in range(N)] for j in range(N)]
# alternatively, if you have numpy but not scipy:
# sparse = numpy.zeros(N,N)
for _ in range(num_terms):
(i,j) = (randint(0,n),randint(0,n))
x = randgauss(0,1)
sparse[i][j] = x
sparse[j][i] = x
Although it might give you a little more control than unutbu's solution, you should expect it to be significantly slower; scipy is a dependency you probably don't want to avoid

Categories

Resources