Consecutive elements in a Sparse matrix row - python

I am working on a sparse matrix stored in COO format. What would be the fastest way to get the number of consecutive elements per each row.
For example consider the following matrix:
a = [[0,1,2,0],[1,0,0,2],[0,0,0,0],[1,0,1,0]]
Its COO representation would be
(0, 1) 1
(0, 2) 2
(1, 0) 1
(1, 3) 2
(3, 0) 1
(3, 2) 1
I need the result to be [1,2,0,2]. The first row contains two Non-zero elements that lies nearby. Hence its a group or set. In the second row we have two non-zero elements,but they dont lie nearby, and hence we can say that it forms two groups. The third row there are no non-zeroes and hence no groups. The fourth row has again two non-zeroes but separated by zeroes nad hence we consider as two groups. It would be like the number of clusters per row. Iterating through the rows are an option but only if there is no faster solution. Any help in this regard is appreciated.
Another simple example: consider the following row:
[1,2,3,0,0,0,2,0,0,8,7,6,0,0]
The above row should return [3] sine there are three groups of non-zeroes getting separated by zeroes.

Convert it to a dense array, and apply your logic row by row.
you want the number of groups per row
zeros count when defining groups
row iteration is faster with arrays
In coo format your matrix looks like:
In [623]: M=sparse.coo_matrix(a)
In [624]: M.data
Out[624]: array([1, 2, 1, 2, 1, 1])
In [625]: M.row
Out[625]: array([0, 0, 1, 1, 3, 3], dtype=int32)
In [626]: M.col
Out[626]: array([1, 2, 0, 3, 0, 2], dtype=int32)
This format does not implement row indexing; csr and lil do
In [627]: M.tolil().data
Out[627]: array([[1, 2], [1, 2], [], [1, 1]], dtype=object)
In [628]: M.tolil().rows
Out[628]: array([[1, 2], [0, 3], [], [0, 2]], dtype=object)
So the sparse information for the 1st row is a list of nonzero data values, [1,2], and list of their column numbers, [1,2]. Compare that with the row of the dense array, [0, 1, 2, 0]. Which is easier to analyze?
Your first task is to write a function that analyzes one row. I haven't studied your logic enough to say whether the dense form is better than the sparse one or not. It is easy to get the column information from the dense form with M.A[0,:].nonzero().
In your last example, I can get the nonzero indices:
In [631]: np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])
Out[631]: (array([ 0, 1, 2, 6, 9, 10, 11], dtype=int32),)
In [632]: idx=np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])[0]
In [633]: idx
Out[633]: array([ 0, 1, 2, 6, 9, 10, 11], dtype=int32)
In [634]: np.diff(idx)
Out[634]: array([1, 1, 4, 3, 1, 1], dtype=int32)
We may be able to get the desired count from the number of diff values >1, though I'd have to look at more examples to define the details.
Extension of the analysis to multiple rows depends on first thoroughly understanding the single row case.

With the help of #hpauljs comment I came up with following snippet to do this:
M = m.tolil()
r = []
for i in range(M.shape[0]):
sumx=0
idx= M.rows[i]
if (len(idx) > 2):
tempidx = np.diff(idx)
if (1 in tempidx):
temp = filter(lambda a: a != 1, tempidx)
sumx=1
counts = len(temp)
r.append(counts+sumx)
elif (len(idx) == 2):
tempidx = np.diff(idx)
if(tempidx[0]==1):
counts = 1
r.append(counts)
else:
counts = 2
r.append(counts)
elif (len(idx) == 1):
counts = 1
r.append(counts)
else:
counts = 0
r.append(counts)
tempcluster = np.sum(r)/float(M.shape[0])
cluster.append(tempcluster)

Related

Sort matrix columns based on the values in the first row

Currently trying to do some beginner matrix handling exercises, but are unsure on how to sort a nxn matrix's column by the columns first index. etc.
It should be a method that could work on any size matrix, as it will not be the same size matrix every time.
Anyone who has any good suggestions?
The implementation here can be very simple depending on how the data, ie. the matrix, is represented. If it is given as a list of column-lists, it just needs a sort. For the given example:
>>> m = [[2, 3, 7], [-1, -2, 5.2], [0, 1, 4], [2, 4, 5]]
>>> y = sorted(m, key=lambda x: x[0])
>>> y
[[-1, -2, 5.2], [0, 1, 4], [2, 3, 7], [2, 4, 5]]
Other representations might need a more complex approach. For example, if the matrix is given as a list of rows:
>>> m = [[2, -1, 0, 2], [3, -2, 1, 4], [7, 5.2, 4, 5]]
>>> order = sorted(range(len(m[0])), key=lambda x: m[0][x])
>>> order
[1, 2, 0, 3]
>>> y = [[row[x] for x in order] for row in m]
>>> y
[[-1, 0, 2, 2], [-2, 1, 3, 4], [5.2, 4, 7, 5]]
The idea here is that first, we will get the order the elements are going to be in based on the first row. We can do that by sorting range(4), so [0, 1, 2, 3] with the sorting key (the value used for sorting) being the i-th value of the first row.
The result is that we get [1, 2, 0, 3] which says: Column index 1 is first, then index 2, 0 and finally 3.
Now we want to create a new matrix where every row follows that order which we can do with a list comprehension over the original matrix, where for each row, we create a new list that has the elements of the row according to the order we determined before.
Note that this approach creates new lists for the whole matrix - if you're dealing with large matrices, you probably want to use the appropriate primitives from numpy and swap the elements around in place.
If matrix is your input, you can do:
result = list(zip(*sorted(zip(*matrix))))
So working from inside out, this expression does:
zip: to iterate the transposed of the matrix (rows become columns and vice versa)
sorted: sorts the transposed matrix. No need to provide a custom key, the sorting will be by the first element (row, which is a column in the original matrix). If there is a tie, by second element (row), ...etc.
zip: to iterate the transposed of the transposed matrix, i.e. transposing it back to its original shape
list to turn the iterable to a list (a matrix)

How to get indexes of all maximum values in python multidimensional np array

I would like to get all indexes of maximum values for each row from ndarray.
For example, i have
arr = np.array([[1, 3, 3], [1, 5, 4]])
And i would like to get indexes of all 3's from first row, and all 5's from second row.
np.where(((arr == arr[0].max()) | (arr == arr[1].max())))
And it returns
(array([0, 0, 1], dtype=int64), array([1, 2, 1], dtype=int64))
I want something like that but more universal for any amount of rows.
Because np.where(arr == arr.argmax()) doesn't work like i want it to work. It only returns indexes of first maxium that it met in each row.
#Paul's answer in comments might be the best you can find. Writing it for readers. arr.max(1) find max in each row and arr==arr.max(1,keepdims=True) finds all elements in each row that are equal to corresponding max in that row. Finally nonzero returns indices of those elements:
(arr==arr.max(axis=1,keepdims=True)).nonzero()
output for OP's example:
(array([0, 0, 1]), array([1, 2, 1]))

Indexing and replacing values in sparse CSC matrix (Python)

I have a sparse CSC matrix, "A", in which I want to replace the first row with a vector that is all zeros, except for the first entry which is 1.
So far I am doing the inefficient version, e.g.:
import numpy as np
from scipy.sparse import csc_matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
A = csc_matrix((data, (row, col)), shape=(3, 3))
replace = np.zeros(3)
replace[0] = 1
A[0,:] = replace
A.eliminate_zeros()
But I'd like to do it with .indptr, .data, etc. As it is a CSC, I am guessing that this might be inefficient as well? In my exact problem, the matrix is 66000 X 66000.
For a CSR sparse matrix I've seen it done as
A.data[1:A.indptr[1]] = 0
A.data[0] = 1.0
A.indices[0] = 0
A.eliminate_zeros()
So, basically I'd like to do the same for a CSC sparse matrix.
Expected result: To do exactly the same as above, just more efficiently (applicable to very large sparse matrices).
That is, start with:
[1, 0, 4],
[0, 0, 5],
[2, 3, 6]
and replace the upper row with a vector that is as long as the matrix, is all zeros except for 1 at the beginning. As such, one should end with
[1, 0, 0],
[0, 0, 5],
[2, 3, 6]
And be able to do it for large sparse CSC matrices efficiently.
Thanks in advance :-)
You can do it by indptr and indices. If you want to construct your matrix with indptr and indices parameters by:
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
A = csc_matrix((data, indices, indptr), shape=(3,3))
But if you want to set all elements in the first row except the first element in row 0, you need to set data values to zero for those that indices is zero. In other words:
data[indices == 0] = 0
The above line set all the elements of the first row to 0. To avoid setting the first element to zero we can do the following:
indices_tmp = indices == 0
indices_tmp[0] = False # to avoid removing the first element in row 0.
data[indices_tmp == True] = 0
A = csc_matrix((data, indices, indptr), shape=(3,3))
Hope it helps.

numpy argmax when values are equal

I got a numpy matrix and I want to get the index of the maximum value in each row. E.g.
[[1,2,3],[1,3,2],[3,2,1]]
will return
[0,1,2]
However, when there're more than 1 maximum value in each row, numpy.argmax will only return the smallest index. E.g.
[[0,0,0],[0,0,0],[0,0,0]]
will return
[0,0,0]
Can I change the default (smallest index) to some other values? E.g. when there're equal maximum values, return 1 or None, so that the above result will be
[1,1,1]
or
[None, None, None]
If I can do this in TensorFlow that'll be better.
Thanks!
You can use np.partition two find the two largest values and check if they are equal, and then use that as a mask in np.where to set the default value:
In [228]: a = np.array([[1, 2, 3, 2], [3, 1, 3, 2], [3, 5, 2, 1]])
In [229]: twomax = np.partition(a, -2)[:, -2:].T
In [230]: default = -1
In [231]: argmax = np.where(twomax[0] != twomax[1], np.argmax(a, -1), default)
In [232]: argmax
Out[232]: array([ 2, -1, 1])
A convenient value of "default" is -1, as argmax will not return that on its own. None does not fit in an integer array. A masked array is also an option, but I didn't go that far. Here is a NumPy implementation
def my_argmax(a):
rows = np.where(a == a.max(axis=1)[:, None])[0]
rows_multiple_max = rows[:-1][rows[:-1] == rows[1:]]
my_argmax = a.argmax(axis=1)
my_argmax[rows_multiple_max] = -1
return my_argmax
Example of use:
import numpy as np
a = np.array([[0, 0, 0], [4, 5, 3], [3, 4, 4], [6, 2, 1]])
my_argmax(a) # array([-1, 1, -1, 0])
Explanation: where selects the indexes of all maximal elements in each row. If a row has multiple maxima, the row number will appear more than once in rows array. Since this array is already sorted, such repetition is detected by comparing consecutive elements. This identifies the rows with multiple maxima, after which they are masked in the output of NumPy's argmax method.

Adding sparse matrices with explicit column and row order and different shapes

Let's say I have two sparse matrices, scipy.sparse.csr_matrix to be precise, that I would like to add element-wise, with the added problem that they have an ID to each row and column, corresponding to a word.
For instance, one matrix might have columns and rows that correspond to ['cat', 'hat'], in that particular order. Another matrix could then have columns and rows that correspond to ['cat', 'mat', 'hat']. This means that when adding these matrices, I need to take into account the following things:
The matrices might have corresponding columns and rows in different orders.
The matrices might not be of the same shape.
Some columns and rows in one matrix might be not be present in the other.
I have trouble coming up with a solution to this merging problem, and would hope that you could help me come up with an answer.
For added clarity, here's an example:
import scipy.sparse as sp
mat1_id2column = ['cat', 'hat']
mat1_id2row = ['cat', 'hat']
mat2_id2column = ['cat', 'mat', 'hat']
mat2_id2row = ['cat', 'mat', 'hat']
mat1 = sp.csr_matrix([[1, 0], [0, 1]])
mat2 = sp.csr_matrix([[1, 0, 1], [1, 0, 0], [0, 0, 1]])
merge(mat1, mat2)
#Expected output:
id2column = ['cat', 'hat', 'mat']
id2row = ['cat', 'hat', 'mat']
merged = sp.csr_matrix([[2, 1, 0], [0, 1, 0], [1, 0, 0]])
Any help is much appreciated!
Start by building a row id index that will be the union of the row ids of the 2 matrices. Then do the same for columns. Using this, you can now translate from coordinates in the old matrices to coordinates in the new result matrix.
Do you see how to finish it from there or should I be more explicit?
In one way of other you have to work out a unique mapping from your strings and row/column indexes.
A start using dictionaries is:
from collections import defaultdict
def foo(dd,mat):
for ij,v in mat[0].todok().iteritems():
dd[(mat[1][ij[0]],mat[2][ij[1]])] += v
dd=defaultdict(int)
foo(dd,(mat1,mat1_id2row,mat1_id2column))
foo(dd,(mat2,mat2_id2row,mat2_id2column))
print dd
produces
defaultdict(<type 'int'>, {('cat', 'hat'): 1,
('hat', 'hat'): 2,
('mat', 'cat'): 1,
('cat', 'cat'): 2})
dd could then be turned back into a dok
A different approach would take advantage of the way coo_matrix handles duplicates - they are added together when it is converted to a csr.
In this example take ['cat', 'mat', 'hat'] as the master index.
The 2 defining arrays for mat2 are then
data: array([1, 1, 1, 1])
row : array([0, 0, 1, 2])
col : array([0, 2, 0, 2])
for mat1 they would be (I haven't worked out the code to do this yet)
data: array([1, 1])
row : array([0, 2])
col : array([0, 2])
concatenate the respective arrays, and create a new coo matrix merged
data: array([1, 1, 1, 1, 1, 1])
row : array([0, 0, 1, 2, 0, 2])
col : array([0, 2, 0, 2, 0, 2])
merged.A would be
array([[2, 0, 1],
[1, 0, 0],
[0, 0, 2]])
another option is to use matrix multiplication to map the arrays on to larger ones that can be added. Again I'm leaving the details of how to generate the mapping unspecified. You'd have to generate a separate T for each different sequence of words. That may require the same amount of iterative work as the other approaches.
T1 = sp.csr_matrix(np.array([[1,0],[0,0],[0,1]]))
T2 = T1.T # same mapping for row and cols of mat1
T1.dot(mat1).dot(T2) + mat2

Categories

Resources