Correlation matrix in NumPy with NaN's

Correlation matrix in NumPy with NaN's - python

A have a n x m matrix in which row i represents the timeseries of the variable V_i. I would like to compute the n x n correlation matrix M, where M_{i,j} contains the correlation coefficient (Pearson's r) between V_i and V_j.
However, when I try the following in numpy:
numpy.corrcoef(numpy.matrix('5 6 7; 1 1 1'))
I get the following output:
array([[ 1., nan],
[ nan, nan]])
It seems that numpy.corrcoef doesn't like unit vectors, because if I change the second row to 7 6 5, I get the expected result:
array([[ 1., -1.],
[ -1., 1.]])
What is the reason for this kind of behavior of numpy.corrcoef?

leewangzhong (in the comment) is correct, Pearson's r is not defined for constant timeseries, as their standard deviation is zero. Thanks!

Related

Outer minimum vectorization in numpy follow up

This is a follow-up to my previous question.
Given an NxM matrix A, I want to efficiently obtain the NxN matrix whose ith row is the sum along the 2nd axis of the result of applying np.minimum between A and the ith row of A.
Using a for loop,
> A = np.array([[1, 2], [3, 4], [5,6]])
> output = np.zeros(shape=(A.shape[0], A.shape[0]))
> for i in range(A.shape[0]):
output[i] = np.sum(np.minimum(A, A[i]), axis=1)
> output
array([[ 3., 3., 3.],
[ 3., 7., 7.],
[ 3., 7., 11.]])
Is is possible to optimize this further without the for loop?
Edit: I would also like to do it without allocating an MxMxN tensor because of memory constraints.

instead of a for loop. Using the NumPy minimum and sum functions, you can compute the desired matrix output as follows:
output = np.sum(np.minimum(A[:, None], A), axis=2)

scipy `SparseEfficiencyWarning` when division on rows of csr_matrix

Suppose I already had a csr_matrix:
import numpy as np
from scipy.sparse import csr_matrix
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1., 2., 3., 4., 5., 6.])
mat = csr_matrix((data, indices, indptr), shape=(3, 3))
print(mat.A)
[[1. 0. 2.]
[0. 0. 3.]
[4. 5. 6.]]
it's simple if I want to divide a single row of this csr_matrix:
mat[0] /= 2
print(mat.A)
[[0.5 0. 1. ]
[0. 0. 3. ]
[4. 5. 6. ]]
However, if I want to change multiple rows, it throws an warning:
mat[np.array([0,1])]/=np.array([[1],[2]])
print(mat.A)
[[1. 0. 2. ]
[0. 0. 1.5]
[4. 5. 6. ]]
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_arrayXarray(i, j, x)
How come division on multiple rows change its sparsity? It suggests me to change to lil_matrix, but when I checked the code of def tolil():
def tolil(self, copy=False):
lil = self._lil_container(self.shape, dtype=self.dtype)
self.sum_duplicates()
ptr,ind,dat = self.indptr,self.indices,self.data
rows, data = lil.rows, lil.data
for n in range(self.shape[0]):
start = ptr[n]
end = ptr[n+1]
rows[n] = ind[start:end].tolist()
data[n] = dat[start:end].tolist()
return lil
which basically loops all rows, I don't think it's necessary in my case. What may be the correct way if I simply want to divide a few rows of a csr_matrix? Thanks!

Your matrix:
In [208]: indptr = np.array([0, 2, 3, 6])
...: indices = np.array([0, 2, 2, 0, 1, 2])
...: data = np.array([1., 2., 3., 4., 5., 6.])
...: mat = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [209]: mat
Out[209]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [210]: mat.A
Out[210]:
array([[1., 0., 2.],
[0., 0., 3.],
[4., 5., 6.]])
Simple division just changes the mat.data values, in-place:
In [211]: mat/= 3
In [212]: mat
Out[212]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [213]: mat *= 3
In [214]: mat.A
Out[214]:
array([[1., 0., 2.],
[0., 0., 3.],
[4., 5., 6.]])
The RHS of your case produces a np.matrix object:
In [215]: mat[np.array([0,1])]/np.array([[1],[2]])
Out[215]:
matrix([[1. , 0. , 2. ],
[0. , 0. , 1.5]])
Assigning that to the mat subset produces the warning:
In [216]: mat[np.array([0,1])] = _
/usr/local/lib/python3.8/dist-packages/scipy/sparse/_index.py:146: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_arrayXarray(i, j, x)
Your warning and mine occurs in the set step:
self._set_arrayXarray(i, j, x)
If I divide again I don't get the warning:
In [217]: mat[np.array([0,1])]/=np.array([[1],[2]])
In [218]: mat
Out[218]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 9 stored elements in Compressed Sparse Row format>
Why? Because after the first assignment, mat has 9 non-zero terms, not the original six. So [217] doesn't change sparsity.
Convert mat back to 6 zeros, and we get the warning again:
In [219]: mat.eliminate_zeros()
In [220]: mat
Out[220]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [221]: mat[np.array([0,1])]/=np.array([[1],[2]])
/usr/local/lib/python3.8/dist-packages/scipy/sparse/_index.py:146: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_arrayXarray(i, j, x)
and a change in sparsity:
In [222]: mat
Out[222]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 9 stored elements in Compressed Sparse Row format>
Assigning a sparse matrix of [215] doesn't trigger the warning:
In [223]: mat.eliminate_zeros()
In [224]: m1=sparse.csr_matrix(mat[np.array([0,1])]/np.array([[1],[2]]))
In [225]: m1
Out[225]:
<2x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
In [226]: mat[np.array([0,1])]=m1
In [227]: mat
Out[227]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
===
The [215] division is best seen as a ndarray action, not a sparse one:
In [232]: mat[np.array([0,1])]/np.array([[1],[2]])
Out[232]:
matrix([[1. , 0. , 2. ],
[0. , 0. , 0.09375]])
In [233]: mat[np.array([0,1])].todense()/np.array([[1],[2]])
Out[233]:
matrix([[1. , 0. , 2. ],
[0. , 0. , 0.09375]])
The details of this division are found in sparse._base.py, mat._divide, with different actions depending whether the other is scalar, dense array, or sparse matrix. Sparse matrix division does not implement broadcasting.
As a general rule, matrix multiplication is the most efficient sparse calculation. In fact actions like row or column sum are implemented with it. And so are some forms of indexing. Element-wise calculations are ok if they can be applied to the M.data array without regard to row or column indices (e.g. square, power, scalar multiplication). M.multiply is element-wise, but without the full broadcasting power of dense arrays. Sparse division is even more limited.
edit
sklearn has some utilities to perform certain kinds of sparse actions that it needs, like scaling and normalizing.
In [274]: from sklearn.utils import sparsefuncs
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.utils
https://scikit-learn.org/stable/modules/generated/sklearn.utils.sparsefuncs.inplace_row_scale.html#sklearn.utils.sparsefuncs.inplace_row_scale
With the sample mat:
In [275]: mat
Out[275]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [276]: mat.A
Out[276]:
array([[1. , 0. , 2. ],
[0. , 0. , 0.1875],
[4. , 5. , 6. ]])
Applying the row scaling:
In [277]: sparsefuncs.inplace_row_scale(mat,np.array([10,20,1]))
In [278]: mat
Out[278]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [279]: mat.A
Out[279]:
array([[10. , 0. , 20. ],
[ 0. , 0. , 3.75],
[ 4. , 5. , 6. ]])
The scaling array has to match in length. In your case you'd need to take the inverse of your [[1],[2]], and pad it with 1 to act on all rows.
Looking at the source I see it uses sparsefuncs.inplace_csr_row_scale. That in turn does:
X.data *= np.repeat(scale, np.diff(X.indptr))
The details of this action are:
In [283]: mat.indptr
Out[283]: array([0, 2, 3, 6], dtype=int32)
In [284]: np.diff(mat.indptr)
Out[284]: array([2, 1, 3], dtype=int32)
In [285]: np.repeat(np.array([10,20,1]), _)
Out[285]: array([10, 10, 20, 1, 1, 1])
In [286]: mat.data
Out[286]: array([100., 200., 75., 4., 5., 6.])
So it converts the scale array into an array that matches the data in shape. Then the inplace *= array multiplication is easy.

It seems that the problem lies within the way of accessing the matrix data
mat[np.array([0,1])]/=np.array([[1],[2]])
The direct access to the rows seems to be the problem. I did not find any reliable data as to why this is the case, but I think that this forces the construction of the rows with all 0 values that the CSR format would normally skip.
A way to mitigate the problem would be to first write the data np.array([[1],[2]]) as 1/x to get [[1],[0.5]] and then create a diagonal csr_matrix from it so that we have
mat_x = csr_matrix(([1,0.5,1], [0,1,2], [0,1,2,3]), shape=(3, 3))
If we then multiply this matrix from the left to affect rows, we get the desired output without any warnings
mat = mat_x * mat

numpy slices using column dependent end index from an integer array

If I have an array and I apply summation
arr = np.array([[1.,1.,2.],[2.,3.,4.],[4.,5.,6]])
np.sum(arr,axis=1)
I get the total along the three rows ([4.,9.,15.])
My complication is that arr contains data that may be bad after a certain column index. I have an integer array that tells me how many "good" values I have in each row and I want to sum/average over the good values. Say:
ngoodcols=np.array([0,1,2])
np.sum(arr[:,0:ngoodcols],axis=1) # not legit but this is the idea
It is clear how to do this in a loop, but is there a way to sum only that many, producing [0.,2.,9.] without resorting to looping? Equivalently, I could use nansum if I knew how to set the elements in column indexes higher than b equal to np.nan, but this is a nearly equivalent problem as far as slicing is concerned.

One possibility is to use masked arrays:
import numpy as np
arr = np.array([[1., 1., 2.], [2., 3., 4.], [4., 5., 6]])
ngoodcols = np.array([0, 1, 2])
mask = ngoodcols[:, np.newaxis] <= np.arange(arr.shape[1])
arr_masked = np.ma.masked_array(arr, mask)
print(arr_masked)
# [[-- -- --]
# [2.0 -- --]
# [4.0 5.0 --]]
print(arr_masked.sum(1))
# [-- 2.0 9.0]
Note that here when there are not good values you get a "missing" value as a result, which may or may not be useful for you. Also, a masked array also allows you to easily do other operations that only apply for valid values (mean, etc.).
Another simple option is to just multiply by the mask:
import numpy as np
arr = np.array([[1., 1., 2.], [2., 3., 4.], [4., 5., 6]])
ngoodcols = np.array([0, 1, 2])
mask = ngoodcols[:, np.newaxis] <= np.arange(arr.shape[1])
print((arr * ~mask).sum(1))
# [0. 2. 9.]
Here when there are no good values you just get zero.

Here is one way using Boolean indexing. This sets elements in column indexes higher than ones in ngoodcols equal to np.nan and use np.nansum:
import numpy as np
arr = np.array([[1.,1.,2.],[2.,3.,4.],[4.,5.,6]])
ngoodcols = np.array([0,1,2])
arr[np.asarray(ngoodcols)[:,None] <= np.arange(arr.shape[1])] = np.nan
print(np.nansum(arr, axis=1))
# [ 0. 2. 9.]

Remove nan rows in a scipy sparse matrix

I am given a (normalized) sparse adjacency matrix and a list of labels for the respective matrix rows. Because some nodes have been removed by another sanitization function, there are some rows containing NaNs in the matrix. I want to find these rows and remove them as well as their respective labels. Here is the function I wrote:
def sanitize_nan_rows(adj, labels):
# convert to numpy array and keep dimension
adj = np.array(adj, ndmin=2)
for i, row in enumerate(adj):
# check if row all nans
if np.all(np.isnan(row)):
# print("Removing nan row label in %s" % i)
# remove row index from labels
del labels[i]
# remove all nan rows
adj = adj[~np.all(np.isnan(adj), axis=1)]
# return sanitized adj and labels_clean
return adj, labels
labels is a simple Python list and adj has the type <class 'scipy.sparse.lil.lil_matrix'> (containing elements of type <class 'numpy.float64'>), which are both the result of
adj, labels = nx.attr_sparse_matrix(infected, normalized=True)
On execution I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-503-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)
<ipython-input-502-ead99efec677> in sanitize_nans(adj, labels)
6 for i, row in enumerate(adj):
7 # check if row all nans
----> 8 if np.all(np.isnan(row)):
9 print("Removing nan row label in %s" % i)
10 # remove row index from labels
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
So I thought that SciPy NaNs were different from numpy NaNs. After that I tried to convert the sparse matrix into a numpy array (taking the risk of flooding my RAM, because the matrix has about 40k rows and columns). When running it, the error stays the same however. It seems that the np.array() call just wrapped the sparse matrix and didn't convert it, as type(row) inside the for loop still outputs <class 'scipy.sparse.lil.lil_matrix'>
So my question is how to resolve this issue and whether there is a better approach that gets the job done. I am fairly new to numpy and scipy (as used in networkx), so I'd appreciate an explanation. Thank you!
EDIT: After changing the conversion to what hpaulj proposed, I'm getting a MemoryError:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-519-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)
<ipython-input-518-44201f4ff35c> in sanitize_nans(adj, labels)
1 def sanitize_nans(adj, labels):
----> 2 adj = adj.toarray()
3
4 for i, row in enumerate(adj):
5 # check if row all nans
/usr/lib/python3/dist-packages/scipy/sparse/lil.py in toarray(self, order, out)
348 def toarray(self, order=None, out=None):
349 """See the docstring for `spmatrix.toarray`."""
--> 350 d = self._process_toarray_args(order, out)
351 for i, row in enumerate(self.rows):
352 for pos, j in enumerate(row):
/usr/lib/python3/dist-packages/scipy/sparse/base.py in_process_toarray_args(self, order, out)
697 return out
698 else:
--> 699 return np.zeros(self.shape, dtype=self.dtype, order=order)
700
701
MemoryError:
So apparently I'll have to stick with the sparse matrix to save RAM.

If I make a sample array:
In [328]: A=np.array([[1,0,0,np.nan],[0,np.nan,np.nan,0],[1,0,1,0]])
In [329]: A
Out[329]:
array([[ 1., 0., 0., nan],
[ 0., nan, nan, 0.],
[ 1., 0., 1., 0.]])
In [331]: M=sparse.lil_matrix(A)
This lil sparse matrix is stored in 2 arrays:
In [332]: M.data
Out[332]: array([[1.0, nan], [nan, nan], [1.0, 1.0]], dtype=object)
In [333]: M.rows
Out[333]: array([[0, 3], [1, 2], [0, 2]], dtype=object)
With your function, no rows will be removed, even though the middle row of the sparse matrix only contains nan.
In [334]: A[~np.all(np.isnan(A), axis=1)]
Out[334]:
array([[ 1., 0., 0., nan],
[ 0., nan, nan, 0.],
[ 1., 0., 1., 0.]])
I could test the rows of M for nan, and identify the ones that only contain nan (besides 0s). But it's probably easier to collect the ones that we want to keep.
In [346]: ll = [i for i,row in enumerate(M.data) if not np.all(np.isnan(row))]
In [347]: ll
Out[347]: [0, 2]
In [348]: M[ll,:]
Out[348]:
<2x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in LInked List format>
In [349]: _.A
Out[349]:
array([[ 1., 0., 0., nan],
[ 1., 0., 1., 0.]])
A row of M is a list, but np.isnan(row) will convert it to an array and do it's array test.

numpy classification comparison with 3d array

I'm trying to do some basic classification of numpy arrays...
I want to compare a 2d array against a 3d array, along the 3rd dimension, and make a classification based on the corresponding z-axis values.
so given 3 arrays that are stacked into a 3d array:
import numpy as np
a1 = np.array([[1,1,1],[1,1,1],[1,1,1]])
a2 = np.array([[3,3,3],[3,3,3],[3,3,3]])
a3 = np.array([[5,5,5],[5,5,5],[5,5,5]])
a3d = dstack((a1,a2,a3))
and another 2d array
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
I want to be able to compare a2d against a3d, and return a 2d array of which level of a3d is closest. (or I suppose any custom function that can compare each value along the z-axis, and return a value base on that comparison.)
EDIT
I modified my arrays to more closely match my data. a1 would be the minimum values, a2 the average values, and a3 the maximum values. So I want to output if each a2d value is closer to a1 (classed "1") a2 (classed "2") or a3 (classed "3"). I'm doing as a 3d array because in the real data, it won't be a simple 3-array choice, but for SO purposes, it helps to keep it simple. We can assume that in the case of a tie, we'll take the lower, so 2 would be classed as level "1", 4 as level "2".

You can use the following list comprehension :
>>> [sum(sum(abs(i-j)) for i,j in z) for z in [zip(i,a2d) for i in a3d]]
[30.0, 22.5, 30.0]
In preceding code i create the following list with zip,that is the zip of each sub array of your 3d list then all you need is calculate the sum of the elemets of subtract of those pairs then sum of them again :
>>> [zip(i,a2d) for i in a3d]
[[(array([ 1., 3., 1.]), array([1, 2, 1])), (array([ 2., 2., 1.]), array([5, 5, 4])), (array([ 3., 1., 1.]), array([9, 8, 8]))], [(array([ 4., 6., 4.]), array([1, 2, 1])), (array([ 5. , 6.5, 4. ]), array([5, 5, 4])), (array([ 6., 4., 4.]), array([9, 8, 8]))], [(array([ 7., 9., 7.]), array([1, 2, 1])), (array([ 8., 8., 7.]), array([5, 5, 4])), (array([ 9., 7., 7.]), array([9, 8, 8]))]]
then for all of your sub arrays you'll have the following list:
[30.0, 22.5, 30.0]
that for each sub-list show a the level of difference with 2d array!and then you can get the relative sub-array from a3d like following :
>>> a3d[l.index(min(l))]
array([[ 4. , 6. , 4. ],
[ 5. , 6.5, 4. ],
[ 6. , 4. , 4. ]])
Also you can put it in a function:
>>> def find_nearest(sub,main):
... l=[sum(sum(abs(i-j)) for i,j in z) for z in [zip(i,sub) for i in main]]
... return main[l.index(min(l))]
...
>>> find_nearest(a2d,a3d)
array([[ 4. , 6. , 4. ],
[ 5. , 6.5, 4. ],
[ 6. , 4. , 4. ]])

You might consider a different approach using numpy.vectorize which lets you efficiently apply a python function to each element of your array.
In this case, your python function could just classify each pixel with whatever breaks you define:
import numpy as np
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
def classify(x):
if x >= 4:
return 3
elif x >= 2:
return 2
elif x > 0:
return 1
else:
return 0
vclassify = np.vectorize(classify)
result = vclassify(a2d)

Thanks to #perrygeo and #Kasra - they got me thinking in a good direction.
Since I want a classification of the closest 3d array's z value, I couldn't do simple math - I needed the (z)index of the closest value.
I did it by enumerating both axes of the 2d array, and doing a proximity compare against the corresponding (z)index of the 3d array.
There might be a way to do this without iterating the 2d array, but at least I'm avoiding iterating the 3d.
import numpy as np
a1 = np.array([[1,1,1],[1,1,1],[1,1,1]])
a2 = np.array([[3,3,3],[3,3,3],[3,3,3]])
a3 = np.array([[5,5,5],[5,5,5],[5,5,5]])
a3d = np.dstack((a1,a2,a3))
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
classOut = np.empty_like(a2d)
def find_nearest_idx(array,value):
idx = (np.abs(array-value)).argmin()
return idx
# enumerate to get indices
for i,a in enumerate(a2d):
for ii,v in enumerate(a):
valStack = a3d[i,ii]
nearest = find_nearest_idx(valStack,v)
classOut[i,ii] = nearest
print classOut
which gets me
[[0 0 1]
[2 2 0]
[0 1 1]]
This tells me that (for example) a2d[0,0] is closest to the 0-index of a3d[0,0], which in my case means it is closest to the min value for that 2d position. a2d[1,1] is closest to the 2-index, which in my case means closer to the max value for that 2d position.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Correlation matrix in NumPy with NaN's - python

leewangzhong (in the comment) is correct, Pearson's r is not defined for constant timeseries, as their standard deviation is zero. Thanks!

Related

Outer minimum vectorization in numpy follow up

scipy `SparseEfficiencyWarning` when division on rows of csr_matrix

numpy slices using column dependent end index from an integer array

Remove nan rows in a scipy sparse matrix

numpy classification comparison with 3d array

Categories

Resources