Remove nan rows in a scipy sparse matrix

Remove nan rows in a scipy sparse matrix - python

I am given a (normalized) sparse adjacency matrix and a list of labels for the respective matrix rows. Because some nodes have been removed by another sanitization function, there are some rows containing NaNs in the matrix. I want to find these rows and remove them as well as their respective labels. Here is the function I wrote:
def sanitize_nan_rows(adj, labels):
# convert to numpy array and keep dimension
adj = np.array(adj, ndmin=2)
for i, row in enumerate(adj):
# check if row all nans
if np.all(np.isnan(row)):
# print("Removing nan row label in %s" % i)
# remove row index from labels
del labels[i]
# remove all nan rows
adj = adj[~np.all(np.isnan(adj), axis=1)]
# return sanitized adj and labels_clean
return adj, labels
labels is a simple Python list and adj has the type <class 'scipy.sparse.lil.lil_matrix'> (containing elements of type <class 'numpy.float64'>), which are both the result of
adj, labels = nx.attr_sparse_matrix(infected, normalized=True)
On execution I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-503-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)
<ipython-input-502-ead99efec677> in sanitize_nans(adj, labels)
6 for i, row in enumerate(adj):
7 # check if row all nans
----> 8 if np.all(np.isnan(row)):
9 print("Removing nan row label in %s" % i)
10 # remove row index from labels
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
So I thought that SciPy NaNs were different from numpy NaNs. After that I tried to convert the sparse matrix into a numpy array (taking the risk of flooding my RAM, because the matrix has about 40k rows and columns). When running it, the error stays the same however. It seems that the np.array() call just wrapped the sparse matrix and didn't convert it, as type(row) inside the for loop still outputs <class 'scipy.sparse.lil.lil_matrix'>
So my question is how to resolve this issue and whether there is a better approach that gets the job done. I am fairly new to numpy and scipy (as used in networkx), so I'd appreciate an explanation. Thank you!
EDIT: After changing the conversion to what hpaulj proposed, I'm getting a MemoryError:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-519-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)
<ipython-input-518-44201f4ff35c> in sanitize_nans(adj, labels)
1 def sanitize_nans(adj, labels):
----> 2 adj = adj.toarray()
3
4 for i, row in enumerate(adj):
5 # check if row all nans
/usr/lib/python3/dist-packages/scipy/sparse/lil.py in toarray(self, order, out)
348 def toarray(self, order=None, out=None):
349 """See the docstring for `spmatrix.toarray`."""
--> 350 d = self._process_toarray_args(order, out)
351 for i, row in enumerate(self.rows):
352 for pos, j in enumerate(row):
/usr/lib/python3/dist-packages/scipy/sparse/base.py in_process_toarray_args(self, order, out)
697 return out
698 else:
--> 699 return np.zeros(self.shape, dtype=self.dtype, order=order)
700
701
MemoryError:
So apparently I'll have to stick with the sparse matrix to save RAM.

If I make a sample array:
In [328]: A=np.array([[1,0,0,np.nan],[0,np.nan,np.nan,0],[1,0,1,0]])
In [329]: A
Out[329]:
array([[ 1., 0., 0., nan],
[ 0., nan, nan, 0.],
[ 1., 0., 1., 0.]])
In [331]: M=sparse.lil_matrix(A)
This lil sparse matrix is stored in 2 arrays:
In [332]: M.data
Out[332]: array([[1.0, nan], [nan, nan], [1.0, 1.0]], dtype=object)
In [333]: M.rows
Out[333]: array([[0, 3], [1, 2], [0, 2]], dtype=object)
With your function, no rows will be removed, even though the middle row of the sparse matrix only contains nan.
In [334]: A[~np.all(np.isnan(A), axis=1)]
Out[334]:
array([[ 1., 0., 0., nan],
[ 0., nan, nan, 0.],
[ 1., 0., 1., 0.]])
I could test the rows of M for nan, and identify the ones that only contain nan (besides 0s). But it's probably easier to collect the ones that we want to keep.
In [346]: ll = [i for i,row in enumerate(M.data) if not np.all(np.isnan(row))]
In [347]: ll
Out[347]: [0, 2]
In [348]: M[ll,:]
Out[348]:
<2x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in LInked List format>
In [349]: _.A
Out[349]:
array([[ 1., 0., 0., nan],
[ 1., 0., 1., 0.]])
A row of M is a list, but np.isnan(row) will convert it to an array and do it's array test.

Related

Outer minimum vectorization in numpy follow up

This is a follow-up to my previous question.
Given an NxM matrix A, I want to efficiently obtain the NxN matrix whose ith row is the sum along the 2nd axis of the result of applying np.minimum between A and the ith row of A.
Using a for loop,
> A = np.array([[1, 2], [3, 4], [5,6]])
> output = np.zeros(shape=(A.shape[0], A.shape[0]))
> for i in range(A.shape[0]):
output[i] = np.sum(np.minimum(A, A[i]), axis=1)
> output
array([[ 3., 3., 3.],
[ 3., 7., 7.],
[ 3., 7., 11.]])
Is is possible to optimize this further without the for loop?
Edit: I would also like to do it without allocating an MxMxN tensor because of memory constraints.

instead of a for loop. Using the NumPy minimum and sum functions, you can compute the desired matrix output as follows:
output = np.sum(np.minimum(A[:, None], A), axis=2)

numpy slices using column dependent end index from an integer array

If I have an array and I apply summation
arr = np.array([[1.,1.,2.],[2.,3.,4.],[4.,5.,6]])
np.sum(arr,axis=1)
I get the total along the three rows ([4.,9.,15.])
My complication is that arr contains data that may be bad after a certain column index. I have an integer array that tells me how many "good" values I have in each row and I want to sum/average over the good values. Say:
ngoodcols=np.array([0,1,2])
np.sum(arr[:,0:ngoodcols],axis=1) # not legit but this is the idea
It is clear how to do this in a loop, but is there a way to sum only that many, producing [0.,2.,9.] without resorting to looping? Equivalently, I could use nansum if I knew how to set the elements in column indexes higher than b equal to np.nan, but this is a nearly equivalent problem as far as slicing is concerned.

One possibility is to use masked arrays:
import numpy as np
arr = np.array([[1., 1., 2.], [2., 3., 4.], [4., 5., 6]])
ngoodcols = np.array([0, 1, 2])
mask = ngoodcols[:, np.newaxis] <= np.arange(arr.shape[1])
arr_masked = np.ma.masked_array(arr, mask)
print(arr_masked)
# [[-- -- --]
# [2.0 -- --]
# [4.0 5.0 --]]
print(arr_masked.sum(1))
# [-- 2.0 9.0]
Note that here when there are not good values you get a "missing" value as a result, which may or may not be useful for you. Also, a masked array also allows you to easily do other operations that only apply for valid values (mean, etc.).
Another simple option is to just multiply by the mask:
import numpy as np
arr = np.array([[1., 1., 2.], [2., 3., 4.], [4., 5., 6]])
ngoodcols = np.array([0, 1, 2])
mask = ngoodcols[:, np.newaxis] <= np.arange(arr.shape[1])
print((arr * ~mask).sum(1))
# [0. 2. 9.]
Here when there are no good values you just get zero.

Here is one way using Boolean indexing. This sets elements in column indexes higher than ones in ngoodcols equal to np.nan and use np.nansum:
import numpy as np
arr = np.array([[1.,1.,2.],[2.,3.,4.],[4.,5.,6]])
ngoodcols = np.array([0,1,2])
arr[np.asarray(ngoodcols)[:,None] <= np.arange(arr.shape[1])] = np.nan
print(np.nansum(arr, axis=1))
# [ 0. 2. 9.]

Combine numpy arrays to form a matrix

This seems like it should be straightforward, but I can't figure it out.
Data source is a two column, comma delimited input file with these contents:
6,10
5,9
8,13
...
And my code is:
import numpy as np
data = np.loadtxt("data.txt", delimiter=",")
m = len(data)
x = np.reshape(data[:,0], (m,1))
y = np.ones((m,1))
z = np.matrix([x,y])
Which gives me this error:
Users/acpigeon/.virtualenvs/ipynb/lib/python2.7/site-packages/numpy-1.9.0.dev_297f54b-py2.7-macosx-10.9-intel.egg/numpy/matrixlib/defmatrix.pyc in __new__(subtype, data, dtype, copy)
270 shape = arr.shape
271 if (ndim > 2):
--> 272 raise ValueError("matrix must be 2-dimensional")
273 elif ndim == 0:
274 shape = (1, 1)
ValueError: matrix must be 2-dimensional
No amount of reshaping seems to get this to work, so I'm either missing something really simple or there's a better way to do this.
EDIT:
Would have been helpful to specify the output I am looking for. Here is a line of code that generates the desired result:
In [1]: np.matrix([[5,1],[6,1],[8,1]])
Out[1]:
matrix([[5, 1],
[6, 1],
[8, 1]])

The desired output can be generated this way:
In [12]: np.array((data[:, 0], np.ones(m))).transpose()
Out[12]:
array([[ 6., 1.],
[ 5., 1.],
[ 8., 1.]])
The above is copied from ipython and so has ipython style prompts.
Answer to previous version
To eliminate the error, replace:
x = np.reshape(data[:, 0], (m, 1))
with:
x = data[:, 0]
The former line produces a 2-dimensional matrix and that is what causes the error message. The latter produces a 1-D array with the same data.

Or how about first turning the array into a matrix, and then change the last column to 1?
In [2]: data=np.loadtxt('stack23859379.txt',delimiter=',')
In [3]: np.matrix(data)
Out[3]:
matrix([[ 6., 10.],
[ 5., 9.],
[ 8., 13.]])
In [4]: z = np.matrix(data)
In [5]: z[:,1]=1
In [6]: z
Out[6]:
matrix([[ 6., 1.],
[ 5., 1.],
[ 8., 1.]])

Correlation matrix in NumPy with NaN's

A have a n x m matrix in which row i represents the timeseries of the variable V_i. I would like to compute the n x n correlation matrix M, where M_{i,j} contains the correlation coefficient (Pearson's r) between V_i and V_j.
However, when I try the following in numpy:
numpy.corrcoef(numpy.matrix('5 6 7; 1 1 1'))
I get the following output:
array([[ 1., nan],
[ nan, nan]])
It seems that numpy.corrcoef doesn't like unit vectors, because if I change the second row to 7 6 5, I get the expected result:
array([[ 1., -1.],
[ -1., 1.]])
What is the reason for this kind of behavior of numpy.corrcoef?

leewangzhong (in the comment) is correct, Pearson's r is not defined for constant timeseries, as their standard deviation is zero. Thanks!

Unsuccessful append to an empty NumPy array

I am trying to fill an empty(not np.empty!) array with values using append but I am gettin error:
My code is as follows:
import numpy as np
result=np.asarray([np.asarray([]),np.asarray([])])
result[0]=np.append([result[0]],[1,2])
And I am getting:
ValueError: could not broadcast input array from shape (2) into shape (0)

I might understand the question incorrectly, but if you want to declare an array of a certain shape but with nothing inside, the following might be helpful:
Initialise empty array:
>>> a = np.zeros((0,3)) #or np.empty((0,3)) or np.array([]).reshape(0,3)
>>> a
array([], shape=(0, 3), dtype=float64)
Now you can use this array to append rows of similar shape to it. Remember that a numpy array is immutable, so a new array is created for each iteration:
>>> for i in range(3):
... a = np.vstack([a, [i,i,i]])
...
>>> a
array([[ 0., 0., 0.],
[ 1., 1., 1.],
[ 2., 2., 2.]])
np.vstack and np.hstack is the most common method for combining numpy arrays, but coming from Matlab I prefer np.r_ and np.c_:
Concatenate 1d:
>>> a = np.zeros(0)
>>> for i in range(3):
... a = np.r_[a, [i, i, i]]
...
>>> a
array([ 0., 0., 0., 1., 1., 1., 2., 2., 2.])
Concatenate rows:
>>> a = np.zeros((0,3))
>>> for i in range(3):
... a = np.r_[a, [[i,i,i]]]
...
>>> a
array([[ 0., 0., 0.],
[ 1., 1., 1.],
[ 2., 2., 2.]])
Concatenate columns:
>>> a = np.zeros((3,0))
>>> for i in range(3):
... a = np.c_[a, [[i],[i],[i]]]
...
>>> a
array([[ 0., 1., 2.],
[ 0., 1., 2.],
[ 0., 1., 2.]])

numpy.append is pretty different from list.append in python. I know that's thrown off a few programers new to numpy. numpy.append is more like concatenate, it makes a new array and fills it with the values from the old array and the new value(s) to be appended. For example:
import numpy
old = numpy.array([1, 2, 3, 4])
new = numpy.append(old, 5)
print old
# [1, 2, 3, 4]
print new
# [1, 2, 3, 4, 5]
new = numpy.append(new, [6, 7])
print new
# [1, 2, 3, 4, 5, 6, 7]
I think you might be able to achieve your goal by doing something like:
result = numpy.zeros((10,))
result[0:2] = [1, 2]
# Or
result = numpy.zeros((10, 2))
result[0, :] = [1, 2]
Update:
If you need to create a numpy array using loop, and you don't know ahead of time what the final size of the array will be, you can do something like:
import numpy as np
a = np.array([0., 1.])
b = np.array([2., 3.])
temp = []
while True:
rnd = random.randint(0, 100)
if rnd > 50:
temp.append(a)
else:
temp.append(b)
if rnd == 0:
break
result = np.array(temp)
In my example result will be an (N, 2) array, where N is the number of times the loop ran, but obviously you can adjust it to your needs.
new update
The error you're seeing has nothing to do with types, it has to do with the shape of the numpy arrays you're trying to concatenate. If you do np.append(a, b) the shapes of a and b need to match. If you append an (2, n) and (n,) you'll get a (3, n) array. Your code is trying to append a (1, 0) to a (2,). Those shapes don't match so you get an error.

This error arise from the fact that you are trying to define an object of shape (0,) as an object of shape (2,). If you append what you want without forcing it to be equal to result[0] there is no any issue:
b = np.append([result[0]], [1,2])
But when you define result[0] = b you are equating objects of different shapes, and you can not do this. What are you trying to do?

Here's the result of running your code in Ipython. Note that result is a (2,0) array, 2 rows, 0 columns, 0 elements. The append produces a (2,) array. result[0] is (0,) array. Your error message has to do with trying to assign that 2 item array into a size 0 slot. Since result is dtype=float64, only scalars can be assigned to its elements.
In [65]: result=np.asarray([np.asarray([]),np.asarray([])])
In [66]: result
Out[66]: array([], shape=(2, 0), dtype=float64)
In [67]: result[0]
Out[67]: array([], dtype=float64)
In [68]: np.append(result[0],[1,2])
Out[68]: array([ 1., 2.])
np.array is not a Python list. All elements of an array are the same type (as specified by the dtype). Notice also that result is not an array of arrays.
Result could also have been built as
ll = [[],[]]
result = np.array(ll)
while
ll[0] = [1,2]
# ll = [[1,2],[]]
the same is not true for result.
np.zeros((2,0)) also produces your result.
Actually there's another quirk to result.
result[0] = 1
does not change the values of result. It accepts the assignment, but since it has 0 columns, there is no place to put the 1. This assignment would work in result was created as np.zeros((2,1)). But that still can't accept a list.
But if result has 2 columns, then you can assign a 2 element list to one of its rows.
result = np.zeros((2,2))
result[0] # == [0,0]
result[0] = [1,2]
What exactly do you want result to look like after the append operation?

numpy.append always copies the array before appending the new values. Your code is equivalent to the following:
import numpy as np
result = np.zeros((2,0))
new_result = np.append([result[0]],[1,2])
result[0] = new_result # ERROR: has shape (2,0), new_result has shape (2,)
Perhaps you mean to do this?
import numpy as np
result = np.zeros((2,0))
result = np.append([result[0]],[1,2])

SO thread 'Multiply two arrays element wise, where one of the arrays has arrays as elements' has an example of constructing an array from arrays. If the subarrays are the same size, numpy makes a 2d array. But if they differ in length, it makes an array with dtype=object, and the subarrays retain their identity.
Following that, you could do something like this:
In [5]: result=np.array([np.zeros((1)),np.zeros((2))])
In [6]: result
Out[6]: array([array([ 0.]), array([ 0., 0.])], dtype=object)
In [7]: np.append([result[0]],[1,2])
Out[7]: array([ 0., 1., 2.])
In [8]: result[0]
Out[8]: array([ 0.])
In [9]: result[0]=np.append([result[0]],[1,2])
In [10]: result
Out[10]: array([array([ 0., 1., 2.]), array([ 0., 0.])], dtype=object)
However, I don't offhand see what advantages this has over a pure Python list or lists. It does not work like a 2d array. For example I have to use result[0][1], not result[0,1]. If the subarrays are all the same length, I have to use np.array(result.tolist()) to produce a 2d array.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove nan rows in a scipy sparse matrix - python

Related

Outer minimum vectorization in numpy follow up

numpy slices using column dependent end index from an integer array

Combine numpy arrays to form a matrix

Correlation matrix in NumPy with NaN's

Unsuccessful append to an empty NumPy array

Categories

Resources