I have a sparse matrix which is transformed from sklearn tfidfVectorier. I believe that some rows are all-zero rows. I want to remove them. However, as far as I know, the existing built-in functions, e.g. nonzero() and eliminate_zero(), focus on zero entries, rather than rows.
Is there any easy way to remove all-zero rows from a sparse matrix?
Example:
What I have now (actually in sparse format):
[ [0, 0, 0]
[1, 0, 2]
[0, 0, 1] ]
What I want to get:
[ [1, 0, 2]
[0, 0, 1] ]
Slicing + getnnz() does the trick:
M = M[M.getnnz(1)>0]
Works directly on csr_array.
You can also remove all 0 columns without changing formats:
M = M[:,M.getnnz(0)>0]
However if you want to remove both you need
M = M[M.getnnz(1)>0][:,M.getnnz(0)>0] #GOOD
I am not sure why but
M = M[M.getnnz(1)>0, M.getnnz(0)>0] #BAD
does not work.
There aren't existing functions for this, but it's not too bad to write your own:
def remove_zero_rows(M):
M = scipy.sparse.csr_matrix(M)
First, convert the matrix to CSR (compressed sparse row) format. This is important because CSR matrices store their data as a triple of (data, indices, indptr), where data holds the nonzero values, indices stores column indices, and indptr holds row index information. The docs explain better:
the column indices for row i are stored in
indices[indptr[i]:indptr[i+1]] and their corresponding values are
stored in data[indptr[i]:indptr[i+1]].
So, to find rows without any nonzero values, we can just look at successive values of M.indptr. Continuing our function from above:
num_nonzeros = np.diff(M.indptr)
return M[num_nonzeros != 0]
The second benefit of CSR format here is that it's relatively cheap to slice rows, which simplifies the creation of the resulting matrix.
Thanks for your reply, #perimosocordiae
I just find another solution by myself. I am posting here in case someone may need it in the future.
def remove_zero_rows(X)
# X is a scipy sparse matrix. We want to remove all zero rows from it
nonzero_row_indice, _ = X.nonzero()
unique_nonzero_indice = numpy.unique(nonzero_row_indice)
return X[unique_nonzero_indice]
Related
In scipy, to create a sparse matrix from triple format data (row, col and data arrays), the default behavior is to sum the data values for all duplicates. Can I change this behavior to overwrite (or do nothing) instead?
For example:
import scipy.sparse as sparse
rows = [0, 0]
cols = [0, 0]
data = [1, 1]
S = sparse.coo_matrix((data, (rows, cols)))
Here, S.todense() is equal to matrix([[2]]) but I would wish it to be matrix([[1]]).
In the documentation of sparse.coo_matrix, it reads
By default when converting to CSR or CSC format, duplicate (i,j)
entries will be summed together. This facilitates efficient
construction of finite element matrices and the like.
It appears from that formulation that there might be other options than the default.
I've seen discussion on the scipy github about giving more control over this summing, but I don't know of any production changes. As the docs indicate, there's a long standing tradition over summing the duplicates.
As created, the coo matrix does not sum; it just assigns your parameters to its attributes:
In [697]: S = sparse.coo_matrix((data, (rows, cols)))
In [698]: S.data
Out[698]: array([1, 1])
In [699]: S.row
Out[699]: array([0, 0], dtype=int32)
In [700]: S.col
Out[700]: array([0, 0], dtype=int32)
Converting to dense (or to csr/csc) does sum - but doesn't change S itself:
In [701]: S.A
Out[701]: array([[2]])
In [702]: S.data
Out[702]: array([1, 1])
You can performing the summing inplace with:
In [703]: S.sum_duplicates()
In [704]: S.data
Out[704]: array([2], dtype=int32)
I don't know of a way of either removing the duplicates or bypassing that action. I may look up the relevant issue.
=================
S.todok() does an inplace sum (that is, changes S). Looking at that code I see that it calls self.sum_duplicates. The following replicates that without the sum:
In [727]: dok=sparse.dok_matrix((S.shape),dtype=S.dtype)
In [728]: dok.update(zip(zip(S.row,S.col),S.data))
In [729]: dok
Out[729]:
<1x1 sparse matrix of type '<class 'numpy.int32'>'
with 1 stored elements in Dictionary Of Keys format>
In [730]: print(dok)
(0, 0) 1
In [731]: S
Out[731]:
<1x1 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in COOrdinate format>
In [732]: dok.A
Out[732]: array([[1]])
It's a dictionary update, so the final value is the last of the duplicates. I found elsewhere that dok.update is a pretty fast way of adding values to a sparse matrix.
tocsr inherently does the sum; tolil uses tocsr; so this todok approach may be simplest.
If you want only values of 1:
S.sum_duplicates()
S.data[:]=1
I'm trying to figure the best way to turn my data into a numpy/scipy sparse matrix. I don't need to do any heavy computation in this format. I just need to be able to convert data from a dense, too-large-for-memory csv to something I can pass it into an sklearn estimator. My theory is that the sparse-ified data should fit in memory.
Because all of the features are categorical, I'm using a generator to iterate over the file and the hashing trick to one hot encode everything:
def get_data(train=True):
if traindata:
path = '../originalData/train_rev1_short_short.csv'
else:
path = '../originalData/test_rev1_short.csv'
it = enumerate(open(path))
it.next() # burn the header row
x = [0] * 27 # initialize row container
for ix, line in it:
for ixx, f in enumerate(line.strip().split(',')):
# Record sample id
if ixx == 0:
sample_id = f
# If this is the training data, record output class
elif ixx == 1 and train:
c = f
# Use the hashing trick to one hot encode categorical features
else:
x[ixx] = abs(hash(str(ixx) + '_' + f)) % (2 ** 20)
yield (sample_id, x, c) if train else (sample_id, x)
The result are rows like this:
10000222510487979663 [1, 3, 66642, 433470, 960966, ..., 802612, 319257, 80942]
10000335031004381249 [1, 2, 87543, 394759, 183945, ..., 773845, 219833, 64573]
Where the first value is the sample ID and the list is the index values of the columns that have a '1' value.
What it is the most efficient way to turn this into a numpy/scipy sparse matrix? My only requirements are fast row-wise write/read and sklearn compatibility. Based on the scipy documentation, it seems like the CSR matrix is what I need, but I'm having some trouble figuring out to convert the data I have while using the generator construct.
Any advice? Open also to alternate approaches, I'm relatively new to problems like this.
Your data format is almost the internal structure of a scipy.sparse.lil_matrix (list of lists). You should first generate one of those, and then call .tocsr() on it to obtain the desired csr matrix.
A small example on how to populate these:
from scipy.sparse import lil_matrix
positions = [[1, 2, 10], [], [5, 6, 2]]
data = [[1, 1, 1], [], [1, 1, 1]]
l = lil_matrix((3, 11))
l.rows = positions
l.data = data
c = l.tocsr()
where data is just a list of lists of ones mirroring the structure of positions and positions would correspond to your feature indices. As you can see, the attributes l.rows and l.data are real lists here, so you can append data as it comes. In that case you need to be careful with the shape, though. When scipy generates these lil_matrix from other data, then it will put arrays of dtype object, but those are almost lists, too.
I am having a issues figuring out to do this operation
So I have and the variable index 1xM sparse binary array and I have a 2-d array (NxM) samples. I want to use index to select specific rows of samples adnd get a 2-d array.
I have tried stuff like:
idx = index.todense() == 1
samples[idx.T,:]
but nothing.
So far I have made it work doing this:
idx = test_x.todense() == 1
selected_samples = samples[np.array(idx.flat)]
But there should be a cleaner way.
To give an idea using a fraction of the data:
print(idx.shape) # (1, 22360)
print(samples.shape) (22360, 200)
The short answer:
selected_samples = samples[index.nonzero()[1]]
The long answer:
The first problem is that your index matrix is 1xN while your sample ndarray is NxM. (See the mismatch?) This is why you needed to call .flat.
That's not a big deal, though, because we just need the nonzero entries in the sparse vector. Get those with index.nonzero(), which returns a tuple of (row indices, column indices). We only care about the column indices, so we use index.nonzero()[1] to get those by themselves.
Then, simply index with the array of nonzero column indices and you're done.
By binary matrix, I mean every element in the matrix is either 0 or 1, and I use the Matrix class in numpy for this.
First of all, is there a specific type of matrix in numpy for it, or do we simply use a matrix that is populated with 0s and 1s?
Second, what is the quickest way for creating a square matrix full of 0s given its dimension with the Matrix class? Note: numpy.zeros((dim, dim)) is not what I want, as it creates a 2-D array with float 0.
Third, I want to get and set any given row of the matrix frequently. For get, I can think of using row = my_matrix.A[row_index].tolist(), which will return a list representation of the given row. For set, it seems that I can just do my_matrix[row_index] = row_list, with row_list being a list of the same length as the given row. Again, I wonder whether they are the most efficient methods for doing the jobs.
To make a numpy array whose elements can be either 0 or 1, use the dtype = 'bool' parameter:
arr = np.zeros((dim,dim), dtype = 'bool')
Or, to convert arr to a numpy matrix:
arr = np.matrix(arr)
To access a row:
arr[row_num]
and to set a row:
arr[row_num] = new_row
is the quickest way.
I have a NumPy matrix that contains mostly non-zero values, but occasionally will contain a zero value. I need to be able to:
Count the non-zero values in each row and put that count into a variable that I can use in subsequent operations, perhaps by iterating through row indices and performing the calculations during the iterative process.
Count the non-zero values in each column and put that count into a variable that I can use in subsequent operations, perhaps by iterating through column indices and performing the calculations during the iterative process.
For example, one thing I need to do is to sum each row and then divide each row sum by the number of non-zero values in each row, reporting a separate result for each row index. And then I need to sum each column and then divide the column sum by the number of non-zero values in the column, also reporting a separate result for each column index. I need to do other things as well, but they should be easy after I figure out how to do the things that I am listing here.
The code I am working with is below. You can see that I am creating an array of zeros and then populating it from a csv file. Some of the rows will contain values for all the columns, but other rows will still have some zeros remaining in some of the last columns, thus creating the problem described above.
The last five lines of the code below are from another posting on this forum. These last five lines of code return a printed list of row/column indices for the zeros. However, I do not know how to use that resulting information to create the non-zero row counts and non-zero column counts described above.
ANOVAInputMatrixValuesArray=zeros([len(TestIDs),9],float)
j=0
for j in range(0,len(TestIDs)):
TestID=str(TestIDs[j])
ReadOrWrite='Read'
fileName=inputFileName
directory=GetCurrentDirectory(arguments that return correct directory)
inputfile=open(directory,'r')
reader=csv.reader(inputfile)
m=0
for row in reader:
if m<9:
if row[0]!='TestID':
ANOVAInputMatrixValuesArray[(j-1),m]=row[2]
m+=1
inputfile.close()
IndicesOfZeros = indices(ANOVAInputMatrixValuesArray.shape)
locs = IndicesOfZeros[:,ANOVAInputMatrixValuesArray == 0]
pts = hsplit(locs, len(locs[0]))
for pt in pts:
print(', '.join(str(p[0]) for p in pt))
Can anyone help me with this?
import numpy as np
a = np.array([[1, 0, 1],
[2, 3, 4],
[0, 0, 7]])
columns = (a != 0).sum(0)
rows = (a != 0).sum(1)
The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements.
The .sum(x) function sums the elements over the axis x. Sum of True/False elements is the number of True elements.
The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array:
columns = np.array([2, 1, 3])
rows = np.array([2, 3, 1])
EDIT: The whole code could look like this (with a few simplifications in your original code):
ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
ReadOrWrite = 'Read'
fileName = inputFileName
directory = GetCurrentDirectory(arguments that return correct directory)
# use directory or filename to get the CSV file?
with open(directory, 'r') as csvfile:
ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]
nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)
EDIT 2:
To get the mean value of all columns/rows, use the following:
colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)
What do you want to do if there are no non-zero elements in a column/row? Then we can adapt the code to solve such a problem.
A fast way to count nonzero elements per row in a scipy sparse matrix m is:
np.diff(m.tocsr().indptr)
The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. So calculating the difference between each entry will provide the number of non-zero elements in each row.
Similarly, for the number of nonzero elements in each column, use:
np.diff(m.tocsc().indptr)
If the data is already in the appropriate form, these will run in O(m.shape[0]) and O(m.shape[1]) respectively, rather than O(m.getnnz()) in Marat and Finn's solutions.
If you need both row and column nozero counts, and, say, m is already a CSR, you might use:
row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)
which is not asymptotically faster than first converting to CSC (which is O(m.getnnz())) to get col_nonzeros, but is faster because of implementation details.
The faster way is to clone your matrix with ones instead of real values. Then just sum up by rows or columns:
X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)
That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53)
edit:
Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by
np.array(NumNonZeroElementsByColumn)[0]
For sparse matrices, use the getnnz() function supported by CSR/CSC matrix.
E.g.
a = scipy.sparse.csr_matrix([[0, 1, 1], [0, 1, 0]])
a.getnnz(axis=0)
array([0, 2, 1])
(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy.
For sparse matrices I did:
(i,j) = X.nonzero()
column_sums = np.zeros(X.shape[1])
for n in np.asarray(j).ravel():
column_sums[n] += 1.
I wonder if there is a more elegant way.