Question
I have a CSR matrix, and I want to be able to retrieve the column indices and the values stored.
Data
For different reasons I'm not allowed to share my data, but here's a look (the numpy library is imported as np):
print(type(data) == type(ind) == list) # data and ind are lists
# OUT: True
print(len(data) == len(ind) == 134464) # data and ind have a size of 134,464
# OUT: True
print(np.alltrue([type(subarray) == np.ndarray for subarray in data])) # data (and ind) contains ndarray
# OUT: True
print(np.alltrue([len(data[i]) == len(ind[i]) for i in range(len(data))])) # each ndarray of data have the same length than the corresponding ndarray of ind
# OUT: True
print(min([len(data[i]) for i in range(len(data))]) >= 1) # each subarray of data (and of ind) has at least a length of 1
# OUT: True
print(np.alltrue([subarray.dtype == np.float64 for subarray in data])) # each subarray of data (and of ind) contains floats
# OUT: True
Code
Here is how I create the matrix (using csr_matrix from scipy.sparse):
indptr = np.empty(nbr_of_rows + 1) # nbr_of_rows = 134,464 = len(data)
indptr[0] = 0
for i in range(1, len(indptr)):
indptr[i] = indptr[i-1] + len(data[i-1])
data = np.concatenate(data) # now I have type(data) = np.darray, data.dtype = np.float64 and len(data) = 2,821,574
ind = np.concatenante(ind) # same than above
X = csr_matrix((data, ind, indptr), shape=(nbr_of_rows, nbr_of_columns)) # nbr_of_columns = 3,991 = max(ind) + 1 (since min(ind) = 0)
print(f"The matrix has a shape of {X.shape} and a sparsity of {(1 - (X.nnz / (X.shape[0] * X.shape[1]))): .2%}.")
# OUT: The matrix has a shape of (134464, 3991) and a sparsity of 99.47%.
So far so good (at least I think so). But now, even though I manage to retrieve the column indices, I can’t successfully retrieve the values:
print(np.alltrue(ind == X.nonzero()[1])) # Retrieving the columns indices
# OUT: True
print(np.alltrue(data == X[X.nonzero()])) # Trying to retrieve the values
# OUT: False
print(np.alltrue(np.sort(data) == np.sort(X[X.nonzero()]))) # Seeing if the values are at least the same
# OUT: False
print(np.sum(data) == np.sum(X[X.nonzero()])) # Seeing if the values add up to the same total
# OUT: False
When I look deeper, I find that I get almost all the values (only a small amount of mistakes):
print(len(data) == len(X[X.nonzero()].tolist()[0]))
# OUT: True
print(len(np.argwhere((data != X[X.nonzero()]))))
# OUT: 2184
So I get "only" 2,184 wrong values out of 2,821,574 total values.
Can someone please help me in getting all the correct values from my CSR matrix?
EDIT
I know now thanks to #hpaulj that I can use the class attributes X.indices and X.data to retrieve the CSR format index array and the CSR format data array of the matrix. However, I still would like to know why, in my case, I don't have np.altrue(X[X.nonzero()] == X.data).
Without your data I can't replicate your problem, and probably wouldn't want to do so even with such a large array.
But I'll try to illustrate what I expect to happen when constructing a matrix this way. From another question I have a small matrix in a Ipython session:
In [60]: Mx
Out[60]:
<1x3 sparse matrix of type '<class 'numpy.intc'>'
with 2 stored elements in Compressed Sparse Row format>
In [61]: Mx.A
Out[61]: array([[0, 1, 2]], dtype=int32)
nonzero returns the coo format indices, row, col
In [62]: Mx.nonzero()
Out[62]: (array([0, 0], dtype=int32), array([1, 2], dtype=int32))
The csr attributes are:
In [63]: Mx.data,Mx.indices,Mx.indptr
Out[63]:
(array([1, 2], dtype=int32),
array([1, 2], dtype=int32),
array([0, 2], dtype=int32))
Now lets make a new matrix, using the attributes of Mx. Assuming you constructed your indptr, indices, and data correctly this should imitate what you've done:
In [64]: newM = sparse.csr_matrix((Mx.data, Mx.indices, Mx.indptr))
In [65]: newM.A
Out[65]: array([[0, 1, 2]], dtype=int32)
data matches between the two matrices:
In [68]: Mx.data==newM.data
Out[68]: array([ True, True])
id of the data don't match, but their bases do. See my recent answer to see why this is relevant
https://stackoverflow.com/a/74543855/901925
In [75]: id(Mx.data.base), id(newM.data.base)
Out[75]: (2255407394864, 2255407394864)
That means changes to newA will appear in Mx:
In [77]: newM[0,1] = 100
In [78]: newM.A
Out[78]: array([[ 0, 100, 2]], dtype=int32)
In [79]: Mx.A
Out[79]: array([[ 0, 100, 2]], dtype=int32)
fuller test
Let's try a small scale test of your code:
In [92]: data = np.array([[1.23,2],[3],[]],object); ind = np.array([[1,2],[3],[]],object)
...: indptr = np.empty(4)
...: indptr[0] = 0
...: for i in range(1, 4):
...: indptr[i] = indptr[i-1] + len(data[i-1])
...: data = np.concatenate(data).ravel()
...: ind = np.concatenate(ind).ravel() # same than above
In [93]: data,ind,indptr
Out[93]: (array([1.23, 2. , 3. ]), array([1., 2., 3.]), array([0., 2., 3., 3.]))
And the sparse matrix:
In [94]: X = sparse.csr_matrix((data, ind, indptr), shape=(3,3))
In [95]: X
Out[95]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
data matches:
In [96]: X.data
Out[96]: array([1.23, 2. , 3. ])
In [97]: data == X.data
Out[97]: array([ True, True, True])
and is infact a view:
In [98]: data[1]+=.23; data
Out[98]: array([1.23, 2.23, 3. ])
In [99]: X.A
Out[99]:
array([[0. , 1.23, 2.23],
[0. , 0. , 0. ],
[3. , 0. , 0. ]])
oops
I made an error in specifying the X shape:
In [110]: X = sparse.csr_matrix((data, ind, indptr), shape=(3,4))
In [111]: X.A
Out[111]:
array([[0. , 1.23, 2.23, 0. ],
[0. , 0. , 0. , 3. ],
[0. , 0. , 0. , 0. ]])
In [112]: X.data
Out[112]: array([1.23, 2.23, 3. ])
In [113]: X.nonzero()
Out[113]: (array([0, 0, 1], dtype=int32), array([1, 2, 3], dtype=int32))
In [114]: X[X.nonzero()]
Out[114]: matrix([[1.23, 2.23, 3. ]])
In [115]: data
Out[115]: array([1.23, 2.23, 3. ])
In [116]: data == X[X.nonzero()]
Out[116]: matrix([[ True, True, True]])
Depending on the type of the values you store in the matrix, numpy.float64 or numpy.int64, perhaps, the following post might answer your question: https://github.com/scipy/scipy/issues/13329#issuecomment-753541268
In particular, the comment "Apparently I don't get an error when data is a numpy array rather than a list." suggests that having data as numpy.array rather than a list could solve your problem.
Hopefully, this at least sets you on the right track.
Related
I am trying to find the both location and the value of the minimum element of a sparse matrix for each row. A toy example for the question is given below:
Here, we have a 3x6 sparse matrix "M".
H = np.array([[1, 2, 3, 0, 4, 0 ,0],
[0, 5, 0, 6, 0, 0 ,0],
[0, 0, 0, 7, 0, 0 ,8], dtype = np.float32)
M = scipy.sparse.csr_matrix(H)
Then, what I would like to obtain is the nonzero minimum elements of each row.
For the example above:
min_elements = some_function(M,axis = 0)
and receiving the return as min_elements = [1,5,7]. The method M.min(axis=0) does not work for my case since the minimum element of each row is zero, therefore, returning an all-zeros array.
Thus, is there any efficient way of implementing such a function in an computationally efficient way using sparse matrix. In my general case, the sparse matrices will be quite huge and requires lots of additional computation. For this reason, the performance/speed is the main benchmark for me.
Thank you!
In [333]: from scipy import sparse
In [334]: M = sparse.csr_matrix(H)
In [335]: M
Out[335]:
<3x7 sparse matrix of type '<class 'numpy.float32'>'
with 8 stored elements in Compressed Sparse Row format>
M is stored as:
In [336]: M.indptr
Out[336]: array([0, 4, 6, 8], dtype=int32)
In [337]: M.data
Out[337]: array([1., 2., 3., 4., 5., 6., 7., 8.], dtype=float32)
In [338]: M.indices
Out[338]: array([0, 1, 2, 4, 1, 3, 3, 6], dtype=int32)
We can iterate on the slices defined by indptr, and take the min:
In [340]: for i in range(M.shape[0]):
...: sl = slice(M.indptr[i],M.indptr[i+1])
...: x, y = M.data[sl], M.indices[sl]
...: m = np.argmin(x)
...: print(y[m], x[m])
...:
0 1.0
1 5.0
3 7.0
This can be streamlined a bit, but it gives the basic idea.
It may be easier to picture what's going on in the lil format:
In [341]: Ml = M.tolil()
In [342]: Ml.data
Out[342]:
array([list([1.0, 2.0, 3.0, 4.0]), list([5.0, 6.0]), list([7.0, 8.0])],
dtype=object)
In [343]: Ml.rows
Out[343]: array([list([0, 1, 2, 4]), list([1, 3]), list([3, 6])], dtype=object)
In [344]: for d,r in zip(Ml.data, Ml.rows):
...: m = np.argmin(d)
...: print(r[m], d[m])
...:
0 1.0
1 5.0
3 7.0
Previous SO have asked for things like the smallest (or largest) N values by row.
Sparse is best for things that can be expressed as some sort of matrix multiplication. That includes row (or column) sums. Even csr indexing is done with matrix multiplication. Other row-by-row operations aren't as easy.
You could flip all your data and find the maximum. This is assuming all your data is positive, as in the example.
M_inv = M.copy()
M_inv.data = 1/M.data
one_over_min_M = M_inv.max(axis=1)
min_M = 1/one_over_min_M.to_array()
On your example I get the output
[[1. ]
[5. ]
[6.9999995]]
There is some horrible numerical error there, but if you're happy to round your answer...
Edit: This approach might be redeemed if you're after the indices and want to do M_inv.argmax(axis=1), otherwise it's probably not the best.
I have a program that created a numpy array and the array is
array([[0.0543275 , 0.51249827, 0.43317423],
[0.07144389, 0.51152126, 0.41703486],
[0.0776112 , 0.48593384, 0.43645496]])
I used the following code for finding the maximum in a row but it is not working for float values
for row in a:
maxi = np.argmax(np.max(row, axis=0))
float(maxi)
print(maxi)
I want something like this
array([[0 , 1 , 0],
[0 , 1 , 0],
[0 , 1 , 0]])
Upd: it was originally wrong, now this is just the essence of the the previous correct answer:
a = np.array([[0.0543275 , 0.51249827, 0.43317423],
[0.07144389, 0.51152126, 0.41703486],
[0.0776112 , 0.48593384, 0.43645496]])
b = np.zeros_like(a)
b[np.arange(a.shape[0]), np.argmax(a, axis=1)] = 1
Since np.argmax() gives us indices of the max elements, we just use them for indexing directly. Now b contains desired output:
array([[0., 1., 0.],
[0., 1., 0.],
[0., 1., 0.]])
you can also do: b.astype(int) to turn to integers.
Here is an option that works
for e, i in enumerate(a):
for f, j in enumerate(i):
if j == max(i):
a[e][f] = 1
else:
a[e][f] = 0
This will convert the array that you use to the desired form:
<class 'numpy.ndarray'>
[[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]]
In [41]: arr = np.array([[0.0543275 , 0.51249827, 0.43317423], [0.07144389, 0.51
...: 152126, 0.41703486], [0.0776112 , 0.48593384, 0.43645496]])
In [42]: arr
Out[42]:
array([[0.0543275 , 0.51249827, 0.43317423],
[0.07144389, 0.51152126, 0.41703486],
[0.0776112 , 0.48593384, 0.43645496]])
The maximum in each row is:
In [47]: np.max(arr, axis=1)
Out[47]: array([0.51249827, 0.51152126, 0.48593384])
Its row index is:
In [48]: np.argmax(arr, axis=1)
Out[48]: array([1, 1, 1])
We can map that argmax array onto a array with the same shape with:
In [52]: x = np.zeros(arr.shape, int)
In [53]: x[np.arange(3),_48] = 1
In [54]: x
Out[54]:
array([[0, 1, 0],
[0, 1, 0],
[0, 1, 0]])
I want to make a function that when fed an array, it returns an array of the same shape but with all zeros expect for 1 value that is the max one. eg. with an array like this:
my_array = np.arange(9).reshape((3,3))
[[ 0. 1. 2.]
[ 3. 4. 5.]
[ 6. 7. 8.]]
when passed in the function I want it out like this:
[[ 0. 0. 0.]
[ 0. 0. 0.]
[ 0. 0. 8.]]
exeption:
When there is many max value that are equal, I only want one out of them and the rest gets zero'ed out (the order doesn't matter).
I am honeslty clueless as to how to make this in an elegant way that is furthermore efficient, how would you do it?
For efficiency, use array-initialization and argmax to get the max index (first one linearly indexed if more than one) -
def app_flat(my_array):
out = np.zeros_like(my_array)
idx = my_array.argmax()
out.flat[idx] = my_array.flat[idx]
return out
We can also use ndarray.ravel() in place of ndaarray.flat and I would think that the performance numbers would be comparable.
For this sparsey output, to gain memory efficiency and hence performance, you might want to use sparse matrices, especially for large arrays. Thus, for sparse matrix output, we would have an alternative one, like so -
from scipy.sparse import coo_matrix
def app_sparse(my_array):
idx = my_array.argmax()
r,c = np.unravel_index(idx, my_array.shape)
return coo_matrix(([my_array[r,c]],([r],[c])),shape=my_array.shape)
Sample run -
In [336]: my_array
Out[336]:
array([[0, 1, 2],
[3, 4, 5],
[8, 7, 8]])
In [337]: app_flat(my_array)
Out[337]:
array([[0, 0, 0],
[0, 0, 0],
[8, 0, 0]])
In [338]: app_sparse(my_array)
Out[338]:
<3x3 sparse matrix of type '<type 'numpy.int64'>'
with 1 stored elements in COOrdinate format>
In [339]: app_sparse(my_array).toarray() # just to confirm values
Out[339]:
array([[0, 0, 0],
[0, 0, 0],
[8, 0, 0]])
Runtime test on bigger array -
In [340]: my_array = np.random.randint(0,1000,(5000,5000))
In [341]: %timeit app_flat(my_array)
10 loops, best of 3: 34.9 ms per loop
In [342]: %timeit app_sparse(my_array) # sparse matrix output
100 loops, best of 3: 17.2 ms per loop
with few lines:
my_array = np.arange(9).reshape((3,3))
my_array2 = np.zeros(len(my_array.ravel()))
my_array2[np.argmax(my_array)] = np.max(my_array)
my_array2 = my_array2.reshape(my_array.shape)
I have a large 2d numpy array and two 1d arrays that represent x/y indexes within the 2d array. I want to use these 1d arrays to perform an operation on the 2d array.
I can do this with a for loop, but it's very slow when working on a large array. Is there a faster way? I tried using the 1d arrays simply as indexes but that didn't work. See this example:
import numpy as np
# Two example 2d arrays
cnt_a = np.zeros((4,4))
cnt_b = np.zeros((4,4))
# 1d arrays holding x and y indices
xpos = [0,0,1,2,1,2,1,0,0,0,0,1,1,1,2,2,3]
ypos = [3,2,1,1,3,0,1,0,0,1,2,1,2,3,3,2,0]
# This method works, but is very slow for a large array
for i in range(0,len(xpos)):
cnt_a[xpos[i],ypos[i]] = cnt_a[xpos[i],ypos[i]] + 1
# This method is fast, but gives incorrect answer
cnt_b[xpos,ypos] = cnt_b[xpos,ypos]+1
# Print the results
print 'Good:'
print cnt_a
print ''
print 'Bad:'
print cnt_b
The output from this is:
Good:
[[ 2. 1. 2. 1.]
[ 0. 3. 1. 2.]
[ 1. 1. 1. 1.]
[ 1. 0. 0. 0.]]
Bad:
[[ 1. 1. 1. 1.]
[ 0. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 0. 0. 0.]]
For the cnt_b array numpy is obviously not summing correctly, but I'm unsure how to fix this without resorting to the (v. inefficient) for loop used to calculate cnt_a.
Another approach by using 1D indexing (suggested by #Shai) extended to answer the actual question:
>>> out = np.zeros((4, 4))
>>> idx = np.ravel_multi_index((xpos, ypos), out.shape) # extract 1D indexes
>>> x = np.bincount(idx, minlength=out.size)
>>> out.flat += x
np.bincount calculates how many times each of the index is present in the xpos, ypos and stores them in x.
Or, as suggested by #Divakar:
>>> out.flat += np.bincount(idx, minlength=out.size)
We could compute the linear indices, then accumulate into zeros-initialized output array with np.add.at. Thus, with xpos and ypos as arrays, here's one implementation -
m,n = xpos.max()+1, ypos.max()+1
out = np.zeros((m,n),dtype=int)
np.add.at(out.ravel(), xpos*n+ypos, 1)
Sample run -
In [95]: # 1d arrays holding x and y indices
...: xpos = np.array([0,0,1,2,1,2,1,0,0,0,0,1,1,1,2,2,3])
...: ypos = np.array([3,2,1,1,3,0,1,0,0,1,2,1,2,3,3,2,0])
...:
In [96]: cnt_a = np.zeros((4,4))
In [97]: # This method works, but is very slow for a large array
...: for i in range(0,len(xpos)):
...: cnt_a[xpos[i],ypos[i]] = cnt_a[xpos[i],ypos[i]] + 1
...:
In [98]: m,n = xpos.max()+1, ypos.max()+1
...: out = np.zeros((m,n),dtype=int)
...: np.add.at(out.ravel(), xpos*n+ypos, 1)
...:
In [99]: cnt_a
Out[99]:
array([[ 2., 1., 2., 1.],
[ 0., 3., 1., 2.],
[ 1., 1., 1., 1.],
[ 1., 0., 0., 0.]])
In [100]: out
Out[100]:
array([[2, 1, 2, 1],
[0, 3, 1, 2],
[1, 1, 1, 1],
[1, 0, 0, 0]])
you can iterate on both lists at once, and increment for each couple (if you are not used to it, zip can combine lists)
for x, y in zip(xpos, ypos):
cnt_b[x][y] += 1
But this will be about the same speed as your solution A.
If your lists xpos/ypos are of length n, I don't see how you can update your matrix in less than o(n) since you'll have to check each pair one way or an other.
Other solution: you could count (with collections.Counter possibly) the similar index pairs (ex: (0, 3) etc...) and update the matrix with the count value. But I doubt it would be much faster, since you the time gained on updating the matrix would be lost on counting multiple occurrences.
Maybe I am totally wrong tho, in which case I'd be curious too to see a not o(n) answer
I think you are looking for ravel_multi_index funciton
lidx = np.ravel_multi_index((xpos, ypos), cnt_a.shape)
converts to "flatten" 1D indices into cnt_a and cnt_b:
np.add.at( cnt_b, lidx, 1 )
I'm doing a project and I'm doing a lot of matrix computation in it.
I'm looking for a smart way to speed up my code. In my project, I'm dealing with a sparse matrix of size 100Mx1M with around 10M non-zeros values. The example below is just to see my point.
Let's say I have:
A vector v of size (2)
A vector c of size (3)
A sparse matrix X of size (2,3)
v = np.asarray([10, 20])
c = np.asarray([ 2, 3, 4])
data = np.array([1, 1, 1, 1])
row = np.array([0, 0, 1, 1])
col = np.array([1, 2, 0, 2])
X = coo_matrix((data,(row,col)), shape=(2,3))
X.todense()
# matrix([[0, 1, 1],
# [1, 0, 1]])
Currently I'm doing:
result = np.zeros_like(v)
d = scipy.sparse.lil_matrix((v.shape[0], v.shape[0]))
d.setdiag(v)
tmp = d * X
print tmp.todense()
#matrix([[ 0., 10., 10.],
# [ 20., 0., 20.]])
# At this point tmp is csr sparse matrix
for i in range(tmp.shape[0]):
x_i = tmp.getrow(i)
result += x_i.data * ( c[x_i.indices] - x_i.data)
# I only want to do the subtraction on non-zero elements
print result
# array([-430, -380])
And my problem is the for loop and especially the subtraction.
I would like to find a way to vectorize this operation by subtracting only on the non-zero elements.
Something to get directly the sparse matrix on the subtraction:
matrix([[ 0., -7., -6.],
[ -18., 0., -16.]])
Is there a way to do this smartly ?
You don't need to loop over the rows to do what you are already doing. And you can use a similar trick to perform the multiplication of the rows by the first vector:
import scipy.sparse as sps
# number of nonzero entries per row of X
nnz_per_row = np.diff(X.indptr)
# multiply every row by the corresponding entry of v
# You could do this in-place as:
# X.data *= np.repeat(v, nnz_per_row)
Y = sps.csr_matrix((X.data * np.repeat(v, nnz_per_row), X.indices, X.indptr),
shape=X.shape)
# subtract from the non-zero entries the corresponding column value in c...
Y.data -= np.take(c, Y.indices)
# ...and multiply by -1 to get the value you are after
Y.data *= -1
To see that it works, set up some dummy data
rows, cols = 3, 5
v = np.random.rand(rows)
c = np.random.rand(cols)
X = sps.rand(rows, cols, density=0.5, format='csr')
and after run the code above:
>>> x = X.toarray()
>>> mask = x == 0
>>> x *= v[:, np.newaxis]
>>> x = c - x
>>> x[mask] = 0
>>> x
array([[ 0.79935123, 0. , 0. , -0.0097763 , 0.59901243],
[ 0.7522559 , 0. , 0.67510109, 0. , 0.36240006],
[ 0. , 0. , 0.72370725, 0. , 0. ]])
>>> Y.toarray()
array([[ 0.79935123, 0. , 0. , -0.0097763 , 0.59901243],
[ 0.7522559 , 0. , 0.67510109, 0. , 0.36240006],
[ 0. , 0. , 0.72370725, 0. , 0. ]])
The way you are accumulating your result requires that there are the same number of non-zero entries in every row, which seems a pretty weird thing to do. Are you sure that is what you are after? If that's really what you want you could get that value with something like:
result = np.sum(Y.data.reshape(Y.shape[0], -1), axis=0)
but I have trouble believing that is really what you are after...