Sum over rows in scipy.sparse.csr_matrix - python

I have a big csr_matrix and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. (Context: The matrix is a document-term matrix obtained from sklearn CountVectorizer and I want to be able to quickly combine documents according to codes associated with these documents)
For a minimal example, this is my matrix:
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack
row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()
[[1 0 0 0 0]
[0 0 3 0 0]
[0 5 0 0 0]
[4 0 0 0 0]
[0 0 2 0 0]]
No let's say I want a new matrix B in which rows (1, 4) and (2, 3, 5) are combined by summing them, which would look something like this:
[[5 0 0 0 0]
[0 5 5 0 0]]
And should be again in sparse format (because the real data I'm working with is large). I tried to sum over slices of the matrix and then stack it:
idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))
But this gives me the summed up values just for the non-zero columns in the slice, so I can't combine it with the other slices because the number of columns in the summed slices are different.
I feel like there must be an easy way to do this. But I couldn't find any discussion of this online or in the documentation. What am I missing?
Thank you for your help

Note that you can do this by carefully constructing another matrix. Here's how it would work for a dense matrix:
>>> S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])
>>> np.dot(S, A.toarray())
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
>>>
The sparse version is only a little more complicated. The information about which rows should be summed together is encoded in row:
col = range(5)
row = [0, 1, 1, 0, 1]
dat = [1, 1, 1, 1, 1]
S = csr_matrix((dat, (row, col)), shape=(2, 5))
result = S * A
# check that the result is another sparse matrix
print type(result)
# check that the values are the ones we want
print result.toarray()
Output:
<class 'scipy.sparse.csr.csr_matrix'>
[[5 0 0 0 0]
[0 5 5 0 0]]
You can handle more rows in your output by including higher values in row and extending the shape of S accordingly.

The indexing should be:
idx1 = [0, 3] # rows 1 and 4
idx2 = [1, 2, 4] # rows 2,3 and 5
Then you need to keep A_sub1 and A_sub2 in sparse format and use axis=0:
A_sub1 = csr_matrix(A[idx1, :].sum(axis=0))
A_sub2 = csr_matrix(A[idx2, :].sum(axis=0))
B = vstack((A_sub1, A_sub2))
B.toarray()
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
Note, I think the A[idx, :].sum(axis=0) operations involve conversion from sparse matrices - so #Mr_E's answer is probably better.
Alternatively, it works when you use axis=0 and np.vstack (as opposed to scipy.sparse.vstack):
A_sub1 = A[idx1, :].sum(axis=0)
A_sub2 = A[idx2, :].sum(axis=0)
np.vstack((A_sub1, A_sub2))
Giving:
matrix([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])

Related

Efficiently convert Numpy 2D array of counts to zero-padded 2D array of indices?

I have a numpy 2D array of n rows (observations) X m columns (features), where each element is the count of times that feature was observed. I need to convert it to a zero-padded 2D array of feature_indices, where each feature_index is repeated a number of times corresponding to the 'count' in the original 2D array.
This seems like it should be a simple combo of np.where with np.repeat or just expansion using indexing, but I'm not seeing it. Here's a very slow, loopy solution (way too slow to use in practice):
# Loopy solution (way too slow!)
def convert_2Dcountsarray_to_zeropaddedindices(countsarray2D):
rowsums = np.sum(countsarray2D,1)
max_rowsum = np.max(rowsums)
out = []
for row_idx, row in enumerate(countsarray2D):
out_row = [0]*int(max_rowsum - rowsums[row_idx]) #Padding zeros so all out_rows same length
for ele_idx in range(len(row)):
[out_row.append(x) for x in np.repeat(ele_idx, row[ele_idx]) ]
out.append(out_row)
return np.array(out)
# Working example
countsarray2D = np.array( [[1,2,0,1,3],
[0,0,0,0,3],
[0,1,1,0,0]] )
# Shift all features up by 1 (i.e. add a dummy feature 0 we will use for padding)
countsarray2D = np.hstack( (np.zeros((len(countsarray2D),1)), countsarray2D) )
print(convert_2Dcountsarray_to_zeropaddedindices(countsarray2D))
# Desired result:
array([[1 2 2 4 5 5 5]
[0 0 0 0 5 5 5]
[0 0 0 0 0 2 3]])
One solution would be to flatten the array and use np.repeat.
This solution requires first adding the number of zeros to use as padding for each row to countsarray2D. This can be done as follows:
counts = countsarray2D.sum(axis=1)
max_count = max(counts)
zeros_to_add = max_count - counts
countsarray2D = np.c_[zeros_to_add, countsarray2D]
The new countsarray2D is then:
array([[0, 1, 2, 0, 1, 3],
[4, 0, 0, 0, 0, 3],
[5, 0, 1, 1, 0, 0]])
Now, we can flatten the array and use np.repeat. An index array A is used as the input array while countsarray2D determines the number of times each index value should be repeated.
n_rows, n_cols = countsarray2D.shape
A = np.tile(np.arange(n_cols), (n_rows, 1))
np.repeat(A, countsarray2D.flatten()).reshape(n_rows, -1)
Final result:
array([[1, 2, 2, 4, 5, 5, 5],
[0, 0, 0, 0, 5, 5, 5],
[0, 0, 0, 0, 0, 2, 3]])

Reorder a square array using a sorted 1D array

Let's say I have a symmetric n-by-n array A and a 1D array x of length n, where the rows/columns of A correspond to the entries of x, and x is ordered. Now assume both A and x are randomly rearranged, so that the rows/columns still correspond but they're no longer in order. How can I manipulate A to recover the correct order?
As an example: x = array([1, 3, 2, 0]) and
A = array([[1, 3, 2, 0],
[3, 9, 6, 0],
[2, 6, 4, 0],
[0, 0, 0, 0]])
so the mapping from x to A in this example is A[i][j] = x[i]*x[j]. x should be sorted like array([0, 1, 2, 3]) and I want to arrive at
A = array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])
I guess that OP is looking for a flexible way to use indices that sorts both rows and columns of his mapping at once. What is more, OP might be interested in doing it in reverse, i.e. find and initial view of mapping if it's lost.
def mapping(x, my_map, return_index=True, return_inverse=True):
idx = np.argsort(x)
out = my_map(x[idx], x[idx])
inv = np.empty_like(idx)
inv[idx] = np.arange(len(idx))
return out, idx, inv
x = np.array([1, 3, 2, 0])
a, idx, inv = mapping(x, np.multiply.outer) #sorted mapping
b = np.multiply.outer(x, x) #straight mapping
print(b)
>>> [[1 3 2 0]
[3 9 6 0]
[2 6 4 0]
[0 0 0 0]]
print(a)
>>> [[0 0 0 0]
[0 1 2 3]
[0 2 4 6]
[0 3 6 9]]
np.array_equal(b, a[np.ix_(inv, inv)]) #sorted to straight
>>> True
np.array_equal(a, b[np.ix_(idx, idx)]) #straight to sorted
>>> True
A simple implementation would be
idx = np.argsort(x)
A = A[idx, :]
A = A[:, idx]
Another possibility would be (all credit to #mathfux):
A[np.ix_(idx, idx)]
You can use argsort and fancy indexing:
idx = np.argsort(x)
A2 = A[idx[None], idx[:,None]]
output:
array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])

Transform an array to 1s and 0s using another array

I have two arrays
arr1 = np.array([[4, 1, 3, 2, 5], [5, 2, 4, 1, 3]])
arr2 = np.array([[2], [1]])
I want to transform array 1 to a binary array using the elements of the array 2 in the following way
For row 1 of array 1, I want to use the row 1 of array 2 i.e. 2 - to make the top 2 values of array 1 as 1s and the rest as 0s
Similarly for row 2 of array 1, I want to use the row 2 of array 2 i.e. 1 - to make the top 1 value of array 1 as 1s and the rest as 0s
So arr1 would get transformed as follows
arr1_transformed = np.array([[1, 0, 0, 0, 1], [1, 0, 0, 0, 0]])
Here is what I tried.
arr1_sorted_indices = np.argosrt(-arr1)
This gave me the indices of the sorted array
array([[1, 3, 2, 0, 4],
[3, 1, 4, 2, 0]])
Now I think I need to mask this array with the help of arr2 to get the desired output and I'm not sure how to do it.
this should do the job in the mentioned case:
def trasform_arr(arr1,arr2):
for i in range(0,len(arr1)):
if i >= len(arr2):
arr1[i] = [0 for x in arr1[i]]
else:
sorted_arr = sorted(arr1[i])[-arr2[i][0]:]
arr1[i] = [1 if x in sorted_arr else 0 for x in arr1[i]]
arr1 = [[4, 1, 3, 2, 5], [5, 2, 4, 1, 3]]
arr2 = [[2], [1]]
trasform_arr(arr1,arr2)
print(arr1)
You can try the following:
import numpy as np
arr1 = np.array([[4, 1, 3, 2, 5], [5, 2, 4, 1, 3]])
arr2 = np.array([[2],[1]])
r, c = arr1.shape
s = np.argsort(np.argsort(-arr1))
out = (np.arange(c) < arr2)[np.c_[0:r], s] * 1
print(out)
It gives:
[[1 0 0 0 1]
[1 0 0 0 0]]

Python/NumPy: find the first index of zero, then replace all elements with zero after that for each row

I have an numpy array like this:
a = np.array([[1, 0, 1, 1, 1],
[1, 1, 1, 1, 0],
[1, 0, 0, 1, 1],
[1, 0, 1, 0, 1]])
Question 1:
As shown in the title, I want to replace all elements with zero after the first zero appeared. The result should be like this :
a = np.array([[1, 0, 0, 0, 0],
[1, 1, 1, 1, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]])
Question 2: how to slice different columns for each row like this example?
As I am dealing with an array with large size. If any one could find an efficient way to solve this please. Thank you very much.
One way to accomplish question 1 is to use numpy.cumprod
>>> np.cumprod(a, axis=1)
array([[1, 0, 0, 0, 0],
[1, 1, 1, 1, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]])
Question 1:
You could iterate over the array like so:
for i in range(a.shape[0]):
j = 0
row = a[i]
while row[j]>0:
j += 1
row[j+1:] = 0
This will change the array in-place. If you are interested in very high performance, the answers to this question could be of use to find the first zero faster. np.where scans the entire array for this and therefore is not optimal for the task.
Actually, the fastest solution will depend a bit on the distribution of your array entries: If there are many floats in there and rarely is there ever a zero, the while loops in the code above will interrupt late on average, requiring to write only "a few" zeros. If however there are only two possible entries like in your sample array and these occur with a similar probability (i.e. ~50%), there would be a lot of zeros to be written to a, and the following will be faster:
b = np.zeros(a.shape)
for i in range(a.shape[0]):
j = 0
a_row = a[i]
b_row = b[i]
while a_row[j]>0:
b_row[j] = a_row[j]
j += 1
Question 2:
If you mean to slice each row individually on a similar criterion dealing with a first occurence of some kind, you could simply adapt this iteration pattern. If the criterion is more global (like finding the maximum of the row, for example) built-in methods like np.where exist that will be more efficient, but it probably would depend a bit on the criterion itself which choice is best.
Question 1: An efficient way to do this would be the following.
import numpy as np
a = np.array([[1, 0, 1, 1, 1],
[1, 1, 1, 1, 0],
[1, 0, 0, 1, 1],
[1, 0, 1, 0, 1]])
for row in a:
zeros = np.where(row == 0)[0]
if (len(zeros)):# Check if zero exists
row[zeros[0]:] = 0
print(a)
Output:
[[1 0 0 0 0]
[1 1 1 1 0]
[1 0 0 0 0]
[1 0 0 0 0]]
Question 2: Using the same array, for each row rowIdx, you can have a array of columns colIdxs that you want to extract from.
rowIdx = 2
colIdxs = [1, 3, 4]
print(a[rowIdx, colIdxs])
Output:
[0 1 1]
I prefer Ayrat's creative answer for the first question, but if you need to slice different columns for different rows in large size, this could help you:
indexer = tuple(np.s_[i:a.shape[1]] for i in (a==0).argmax(axis=1))
for i,j in enumerate(indexer):
a[i,j]=0
indexer:
(slice(1, 5, None), slice(4, 5, None), slice(1, 5, None), slice(1, 5, None))
or:
indexer = (a==0).argmax(axis=1)
for i in range(a.shape[0]):
a[i,indexer[i]:]=0
indexer:
[1 4 1 1]
output:
[[1 0 0 0 0]
[1 1 1 1 0]
[1 0 0 0 0]
[1 0 0 0 0]]

Numpy assign an array value based on the values of another array with column selected based on a vector

I have a 2 dimensional array
X
array([[2, 3, 3, 3],
[3, 2, 1, 3],
[2, 3, 1, 2],
[2, 2, 3, 1]])
and a 1 dimensional array
y
array([1, 0, 0, 1])
For each row of X, i want to find the column index where X has the lowest value and y has a value of 1, and set the corresponding row column pair in a third matrix to 1
For example, in case of the first row of X, the column index corresponding to the minimum X value (for the first row only) and y = 1 is 0, then I want Z[0,0] = 1 and all other Z[0,i] = 0.
Similarly, for the second row, column index 0 or 3 gives the lowest X value with y = 1. Then i want either Z[1,0] or Z[1,3] = 1 (preferably Z[1,0] = 1 and all other Z[1,i] = 0, since 0 column is the first occurance)
My final Z array will look like
Z
array([[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1]])
One way to do this is using masked arrays.
import numpy as np
X = np.array([[2, 3, 3, 3],
[3, 2, 1, 3],
[2, 3, 1, 2],
[2, 2, 3, 1]])
y = np.array([1, 0, 0, 1])
#get a mask in the shape of X. (True for places to ignore.)
y_mask = np.vstack([y == 0] * len(X))
X_masked = np.ma.masked_array(X, y_mask)
out = np.zeros_like(X)
mins = np.argmin(X_masked, axis=0)
#Output: array([0, 0, 0, 3], dtype=int64)
#Now just set the indexes to 1 on the minimum for each axis.
out[np.arange(len(out)), mins] = 1
print(out)
[[1 0 0 0]
[1 0 0 0]
[1 0 0 0]
[0 0 0 1]]
you can use numpy.argmin(), to get the indexes of the min value at each row of X. For example:
import numpy as np
a = np.arange(6).reshape(2,3) + 10
ids = np.argmin(a, axis=1)
Similarly, you can the indexes where y is 1 by either numpy.nonzero or numpy.where.
Once you have the two index arrays setting the values in third array should be quite easy.

Categories

Resources