scipy `SparseEfficiencyWarning` when division on rows of csr_matrix

scipy `SparseEfficiencyWarning` when division on rows of csr_matrix - python

Suppose I already had a csr_matrix:
import numpy as np
from scipy.sparse import csr_matrix
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1., 2., 3., 4., 5., 6.])
mat = csr_matrix((data, indices, indptr), shape=(3, 3))
print(mat.A)
[[1. 0. 2.]
[0. 0. 3.]
[4. 5. 6.]]
it's simple if I want to divide a single row of this csr_matrix:
mat[0] /= 2
print(mat.A)
[[0.5 0. 1. ]
[0. 0. 3. ]
[4. 5. 6. ]]
However, if I want to change multiple rows, it throws an warning:
mat[np.array([0,1])]/=np.array([[1],[2]])
print(mat.A)
[[1. 0. 2. ]
[0. 0. 1.5]
[4. 5. 6. ]]
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_arrayXarray(i, j, x)
How come division on multiple rows change its sparsity? It suggests me to change to lil_matrix, but when I checked the code of def tolil():
def tolil(self, copy=False):
lil = self._lil_container(self.shape, dtype=self.dtype)
self.sum_duplicates()
ptr,ind,dat = self.indptr,self.indices,self.data
rows, data = lil.rows, lil.data
for n in range(self.shape[0]):
start = ptr[n]
end = ptr[n+1]
rows[n] = ind[start:end].tolist()
data[n] = dat[start:end].tolist()
return lil
which basically loops all rows, I don't think it's necessary in my case. What may be the correct way if I simply want to divide a few rows of a csr_matrix? Thanks!

Your matrix:
In [208]: indptr = np.array([0, 2, 3, 6])
...: indices = np.array([0, 2, 2, 0, 1, 2])
...: data = np.array([1., 2., 3., 4., 5., 6.])
...: mat = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [209]: mat
Out[209]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [210]: mat.A
Out[210]:
array([[1., 0., 2.],
[0., 0., 3.],
[4., 5., 6.]])
Simple division just changes the mat.data values, in-place:
In [211]: mat/= 3
In [212]: mat
Out[212]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [213]: mat *= 3
In [214]: mat.A
Out[214]:
array([[1., 0., 2.],
[0., 0., 3.],
[4., 5., 6.]])
The RHS of your case produces a np.matrix object:
In [215]: mat[np.array([0,1])]/np.array([[1],[2]])
Out[215]:
matrix([[1. , 0. , 2. ],
[0. , 0. , 1.5]])
Assigning that to the mat subset produces the warning:
In [216]: mat[np.array([0,1])] = _
/usr/local/lib/python3.8/dist-packages/scipy/sparse/_index.py:146: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_arrayXarray(i, j, x)
Your warning and mine occurs in the set step:
self._set_arrayXarray(i, j, x)
If I divide again I don't get the warning:
In [217]: mat[np.array([0,1])]/=np.array([[1],[2]])
In [218]: mat
Out[218]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 9 stored elements in Compressed Sparse Row format>
Why? Because after the first assignment, mat has 9 non-zero terms, not the original six. So [217] doesn't change sparsity.
Convert mat back to 6 zeros, and we get the warning again:
In [219]: mat.eliminate_zeros()
In [220]: mat
Out[220]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [221]: mat[np.array([0,1])]/=np.array([[1],[2]])
/usr/local/lib/python3.8/dist-packages/scipy/sparse/_index.py:146: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_arrayXarray(i, j, x)
and a change in sparsity:
In [222]: mat
Out[222]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 9 stored elements in Compressed Sparse Row format>
Assigning a sparse matrix of [215] doesn't trigger the warning:
In [223]: mat.eliminate_zeros()
In [224]: m1=sparse.csr_matrix(mat[np.array([0,1])]/np.array([[1],[2]]))
In [225]: m1
Out[225]:
<2x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
In [226]: mat[np.array([0,1])]=m1
In [227]: mat
Out[227]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
===
The [215] division is best seen as a ndarray action, not a sparse one:
In [232]: mat[np.array([0,1])]/np.array([[1],[2]])
Out[232]:
matrix([[1. , 0. , 2. ],
[0. , 0. , 0.09375]])
In [233]: mat[np.array([0,1])].todense()/np.array([[1],[2]])
Out[233]:
matrix([[1. , 0. , 2. ],
[0. , 0. , 0.09375]])
The details of this division are found in sparse._base.py, mat._divide, with different actions depending whether the other is scalar, dense array, or sparse matrix. Sparse matrix division does not implement broadcasting.
As a general rule, matrix multiplication is the most efficient sparse calculation. In fact actions like row or column sum are implemented with it. And so are some forms of indexing. Element-wise calculations are ok if they can be applied to the M.data array without regard to row or column indices (e.g. square, power, scalar multiplication). M.multiply is element-wise, but without the full broadcasting power of dense arrays. Sparse division is even more limited.
edit
sklearn has some utilities to perform certain kinds of sparse actions that it needs, like scaling and normalizing.
In [274]: from sklearn.utils import sparsefuncs
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.utils
https://scikit-learn.org/stable/modules/generated/sklearn.utils.sparsefuncs.inplace_row_scale.html#sklearn.utils.sparsefuncs.inplace_row_scale
With the sample mat:
In [275]: mat
Out[275]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [276]: mat.A
Out[276]:
array([[1. , 0. , 2. ],
[0. , 0. , 0.1875],
[4. , 5. , 6. ]])
Applying the row scaling:
In [277]: sparsefuncs.inplace_row_scale(mat,np.array([10,20,1]))
In [278]: mat
Out[278]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [279]: mat.A
Out[279]:
array([[10. , 0. , 20. ],
[ 0. , 0. , 3.75],
[ 4. , 5. , 6. ]])
The scaling array has to match in length. In your case you'd need to take the inverse of your [[1],[2]], and pad it with 1 to act on all rows.
Looking at the source I see it uses sparsefuncs.inplace_csr_row_scale. That in turn does:
X.data *= np.repeat(scale, np.diff(X.indptr))
The details of this action are:
In [283]: mat.indptr
Out[283]: array([0, 2, 3, 6], dtype=int32)
In [284]: np.diff(mat.indptr)
Out[284]: array([2, 1, 3], dtype=int32)
In [285]: np.repeat(np.array([10,20,1]), _)
Out[285]: array([10, 10, 20, 1, 1, 1])
In [286]: mat.data
Out[286]: array([100., 200., 75., 4., 5., 6.])
So it converts the scale array into an array that matches the data in shape. Then the inplace *= array multiplication is easy.

It seems that the problem lies within the way of accessing the matrix data
mat[np.array([0,1])]/=np.array([[1],[2]])
The direct access to the rows seems to be the problem. I did not find any reliable data as to why this is the case, but I think that this forces the construction of the rows with all 0 values that the CSR format would normally skip.
A way to mitigate the problem would be to first write the data np.array([[1],[2]]) as 1/x to get [[1],[0.5]] and then create a diagonal csr_matrix from it so that we have
mat_x = csr_matrix(([1,0.5,1], [0,1,2], [0,1,2,3]), shape=(3, 3))
If we then multiply this matrix from the left to affect rows, we get the desired output without any warnings
mat = mat_x * mat

Related

Retrieving values of a CSR matrix

Question
I have a CSR matrix, and I want to be able to retrieve the column indices and the values stored.
Data
For different reasons I'm not allowed to share my data, but here's a look (the numpy library is imported as np):
print(type(data) == type(ind) == list) # data and ind are lists
# OUT: True
print(len(data) == len(ind) == 134464) # data and ind have a size of 134,464
# OUT: True
print(np.alltrue([type(subarray) == np.ndarray for subarray in data])) # data (and ind) contains ndarray
# OUT: True
print(np.alltrue([len(data[i]) == len(ind[i]) for i in range(len(data))])) # each ndarray of data have the same length than the corresponding ndarray of ind
# OUT: True
print(min([len(data[i]) for i in range(len(data))]) >= 1) # each subarray of data (and of ind) has at least a length of 1
# OUT: True
print(np.alltrue([subarray.dtype == np.float64 for subarray in data])) # each subarray of data (and of ind) contains floats
# OUT: True
Code
Here is how I create the matrix (using csr_matrix from scipy.sparse):
indptr = np.empty(nbr_of_rows + 1) # nbr_of_rows = 134,464 = len(data)
indptr[0] = 0
for i in range(1, len(indptr)):
indptr[i] = indptr[i-1] + len(data[i-1])
data = np.concatenate(data) # now I have type(data) = np.darray, data.dtype = np.float64 and len(data) = 2,821,574
ind = np.concatenante(ind) # same than above
X = csr_matrix((data, ind, indptr), shape=(nbr_of_rows, nbr_of_columns)) # nbr_of_columns = 3,991 = max(ind) + 1 (since min(ind) = 0)
print(f"The matrix has a shape of {X.shape} and a sparsity of {(1 - (X.nnz / (X.shape[0] * X.shape[1]))): .2%}.")
# OUT: The matrix has a shape of (134464, 3991) and a sparsity of 99.47%.
So far so good (at least I think so). But now, even though I manage to retrieve the column indices, I can’t successfully retrieve the values:
print(np.alltrue(ind == X.nonzero()[1])) # Retrieving the columns indices
# OUT: True
print(np.alltrue(data == X[X.nonzero()])) # Trying to retrieve the values
# OUT: False
print(np.alltrue(np.sort(data) == np.sort(X[X.nonzero()]))) # Seeing if the values are at least the same
# OUT: False
print(np.sum(data) == np.sum(X[X.nonzero()])) # Seeing if the values add up to the same total
# OUT: False
When I look deeper, I find that I get almost all the values (only a small amount of mistakes):
print(len(data) == len(X[X.nonzero()].tolist()[0]))
# OUT: True
print(len(np.argwhere((data != X[X.nonzero()]))))
# OUT: 2184
So I get "only" 2,184 wrong values out of 2,821,574 total values.
Can someone please help me in getting all the correct values from my CSR matrix?
EDIT
I know now thanks to #hpaulj that I can use the class attributes X.indices and X.data to retrieve the CSR format index array and the CSR format data array of the matrix. However, I still would like to know why, in my case, I don't have np.altrue(X[X.nonzero()] == X.data).

Without your data I can't replicate your problem, and probably wouldn't want to do so even with such a large array.
But I'll try to illustrate what I expect to happen when constructing a matrix this way. From another question I have a small matrix in a Ipython session:
In [60]: Mx
Out[60]:
<1x3 sparse matrix of type '<class 'numpy.intc'>'
with 2 stored elements in Compressed Sparse Row format>
In [61]: Mx.A
Out[61]: array([[0, 1, 2]], dtype=int32)
nonzero returns the coo format indices, row, col
In [62]: Mx.nonzero()
Out[62]: (array([0, 0], dtype=int32), array([1, 2], dtype=int32))
The csr attributes are:
In [63]: Mx.data,Mx.indices,Mx.indptr
Out[63]:
(array([1, 2], dtype=int32),
array([1, 2], dtype=int32),
array([0, 2], dtype=int32))
Now lets make a new matrix, using the attributes of Mx. Assuming you constructed your indptr, indices, and data correctly this should imitate what you've done:
In [64]: newM = sparse.csr_matrix((Mx.data, Mx.indices, Mx.indptr))
In [65]: newM.A
Out[65]: array([[0, 1, 2]], dtype=int32)
data matches between the two matrices:
In [68]: Mx.data==newM.data
Out[68]: array([ True, True])
id of the data don't match, but their bases do. See my recent answer to see why this is relevant
https://stackoverflow.com/a/74543855/901925
In [75]: id(Mx.data.base), id(newM.data.base)
Out[75]: (2255407394864, 2255407394864)
That means changes to newA will appear in Mx:
In [77]: newM[0,1] = 100
In [78]: newM.A
Out[78]: array([[ 0, 100, 2]], dtype=int32)
In [79]: Mx.A
Out[79]: array([[ 0, 100, 2]], dtype=int32)
fuller test
Let's try a small scale test of your code:
In [92]: data = np.array([[1.23,2],[3],[]],object); ind = np.array([[1,2],[3],[]],object)
...: indptr = np.empty(4)
...: indptr[0] = 0
...: for i in range(1, 4):
...: indptr[i] = indptr[i-1] + len(data[i-1])
...: data = np.concatenate(data).ravel()
...: ind = np.concatenate(ind).ravel() # same than above
In [93]: data,ind,indptr
Out[93]: (array([1.23, 2. , 3. ]), array([1., 2., 3.]), array([0., 2., 3., 3.]))
And the sparse matrix:
In [94]: X = sparse.csr_matrix((data, ind, indptr), shape=(3,3))
In [95]: X
Out[95]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
data matches:
In [96]: X.data
Out[96]: array([1.23, 2. , 3. ])
In [97]: data == X.data
Out[97]: array([ True, True, True])
and is infact a view:
In [98]: data[1]+=.23; data
Out[98]: array([1.23, 2.23, 3. ])
In [99]: X.A
Out[99]:
array([[0. , 1.23, 2.23],
[0. , 0. , 0. ],
[3. , 0. , 0. ]])
oops
I made an error in specifying the X shape:
In [110]: X = sparse.csr_matrix((data, ind, indptr), shape=(3,4))
In [111]: X.A
Out[111]:
array([[0. , 1.23, 2.23, 0. ],
[0. , 0. , 0. , 3. ],
[0. , 0. , 0. , 0. ]])
In [112]: X.data
Out[112]: array([1.23, 2.23, 3. ])
In [113]: X.nonzero()
Out[113]: (array([0, 0, 1], dtype=int32), array([1, 2, 3], dtype=int32))
In [114]: X[X.nonzero()]
Out[114]: matrix([[1.23, 2.23, 3. ]])
In [115]: data
Out[115]: array([1.23, 2.23, 3. ])
In [116]: data == X[X.nonzero()]
Out[116]: matrix([[ True, True, True]])

Depending on the type of the values you store in the matrix, numpy.float64 or numpy.int64, perhaps, the following post might answer your question: https://github.com/scipy/scipy/issues/13329#issuecomment-753541268
In particular, the comment "Apparently I don't get an error when data is a numpy array rather than a list." suggests that having data as numpy.array rather than a list could solve your problem.
Hopefully, this at least sets you on the right track.

Outer minimum vectorization in numpy follow up

This is a follow-up to my previous question.
Given an NxM matrix A, I want to efficiently obtain the NxN matrix whose ith row is the sum along the 2nd axis of the result of applying np.minimum between A and the ith row of A.
Using a for loop,
> A = np.array([[1, 2], [3, 4], [5,6]])
> output = np.zeros(shape=(A.shape[0], A.shape[0]))
> for i in range(A.shape[0]):
output[i] = np.sum(np.minimum(A, A[i]), axis=1)
> output
array([[ 3., 3., 3.],
[ 3., 7., 7.],
[ 3., 7., 11.]])
Is is possible to optimize this further without the for loop?
Edit: I would also like to do it without allocating an MxMxN tensor because of memory constraints.

instead of a for loop. Using the NumPy minimum and sum functions, you can compute the desired matrix output as follows:
output = np.sum(np.minimum(A[:, None], A), axis=2)

Efficiently index 2d numpy array using two 1d arrays

I have a large 2d numpy array and two 1d arrays that represent x/y indexes within the 2d array. I want to use these 1d arrays to perform an operation on the 2d array.
I can do this with a for loop, but it's very slow when working on a large array. Is there a faster way? I tried using the 1d arrays simply as indexes but that didn't work. See this example:
import numpy as np
# Two example 2d arrays
cnt_a = np.zeros((4,4))
cnt_b = np.zeros((4,4))
# 1d arrays holding x and y indices
xpos = [0,0,1,2,1,2,1,0,0,0,0,1,1,1,2,2,3]
ypos = [3,2,1,1,3,0,1,0,0,1,2,1,2,3,3,2,0]
# This method works, but is very slow for a large array
for i in range(0,len(xpos)):
cnt_a[xpos[i],ypos[i]] = cnt_a[xpos[i],ypos[i]] + 1
# This method is fast, but gives incorrect answer
cnt_b[xpos,ypos] = cnt_b[xpos,ypos]+1
# Print the results
print 'Good:'
print cnt_a
print ''
print 'Bad:'
print cnt_b
The output from this is:
Good:
[[ 2. 1. 2. 1.]
[ 0. 3. 1. 2.]
[ 1. 1. 1. 1.]
[ 1. 0. 0. 0.]]
Bad:
[[ 1. 1. 1. 1.]
[ 0. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 0. 0. 0.]]
For the cnt_b array numpy is obviously not summing correctly, but I'm unsure how to fix this without resorting to the (v. inefficient) for loop used to calculate cnt_a.

Another approach by using 1D indexing (suggested by #Shai) extended to answer the actual question:
>>> out = np.zeros((4, 4))
>>> idx = np.ravel_multi_index((xpos, ypos), out.shape) # extract 1D indexes
>>> x = np.bincount(idx, minlength=out.size)
>>> out.flat += x
np.bincount calculates how many times each of the index is present in the xpos, ypos and stores them in x.
Or, as suggested by #Divakar:
>>> out.flat += np.bincount(idx, minlength=out.size)

We could compute the linear indices, then accumulate into zeros-initialized output array with np.add.at. Thus, with xpos and ypos as arrays, here's one implementation -
m,n = xpos.max()+1, ypos.max()+1
out = np.zeros((m,n),dtype=int)
np.add.at(out.ravel(), xpos*n+ypos, 1)
Sample run -
In [95]: # 1d arrays holding x and y indices
...: xpos = np.array([0,0,1,2,1,2,1,0,0,0,0,1,1,1,2,2,3])
...: ypos = np.array([3,2,1,1,3,0,1,0,0,1,2,1,2,3,3,2,0])
...:
In [96]: cnt_a = np.zeros((4,4))
In [97]: # This method works, but is very slow for a large array
...: for i in range(0,len(xpos)):
...: cnt_a[xpos[i],ypos[i]] = cnt_a[xpos[i],ypos[i]] + 1
...:
In [98]: m,n = xpos.max()+1, ypos.max()+1
...: out = np.zeros((m,n),dtype=int)
...: np.add.at(out.ravel(), xpos*n+ypos, 1)
...:
In [99]: cnt_a
Out[99]:
array([[ 2., 1., 2., 1.],
[ 0., 3., 1., 2.],
[ 1., 1., 1., 1.],
[ 1., 0., 0., 0.]])
In [100]: out
Out[100]:
array([[2, 1, 2, 1],
[0, 3, 1, 2],
[1, 1, 1, 1],
[1, 0, 0, 0]])

you can iterate on both lists at once, and increment for each couple (if you are not used to it, zip can combine lists)
for x, y in zip(xpos, ypos):
cnt_b[x][y] += 1
But this will be about the same speed as your solution A.
If your lists xpos/ypos are of length n, I don't see how you can update your matrix in less than o(n) since you'll have to check each pair one way or an other.
Other solution: you could count (with collections.Counter possibly) the similar index pairs (ex: (0, 3) etc...) and update the matrix with the count value. But I doubt it would be much faster, since you the time gained on updating the matrix would be lost on counting multiple occurrences.
Maybe I am totally wrong tho, in which case I'd be curious too to see a not o(n) answer

I think you are looking for ravel_multi_index funciton
lidx = np.ravel_multi_index((xpos, ypos), cnt_a.shape)
converts to "flatten" 1D indices into cnt_a and cnt_b:
np.add.at( cnt_b, lidx, 1 )

numpy classification comparison with 3d array

I'm trying to do some basic classification of numpy arrays...
I want to compare a 2d array against a 3d array, along the 3rd dimension, and make a classification based on the corresponding z-axis values.
so given 3 arrays that are stacked into a 3d array:
import numpy as np
a1 = np.array([[1,1,1],[1,1,1],[1,1,1]])
a2 = np.array([[3,3,3],[3,3,3],[3,3,3]])
a3 = np.array([[5,5,5],[5,5,5],[5,5,5]])
a3d = dstack((a1,a2,a3))
and another 2d array
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
I want to be able to compare a2d against a3d, and return a 2d array of which level of a3d is closest. (or I suppose any custom function that can compare each value along the z-axis, and return a value base on that comparison.)
EDIT
I modified my arrays to more closely match my data. a1 would be the minimum values, a2 the average values, and a3 the maximum values. So I want to output if each a2d value is closer to a1 (classed "1") a2 (classed "2") or a3 (classed "3"). I'm doing as a 3d array because in the real data, it won't be a simple 3-array choice, but for SO purposes, it helps to keep it simple. We can assume that in the case of a tie, we'll take the lower, so 2 would be classed as level "1", 4 as level "2".

You can use the following list comprehension :
>>> [sum(sum(abs(i-j)) for i,j in z) for z in [zip(i,a2d) for i in a3d]]
[30.0, 22.5, 30.0]
In preceding code i create the following list with zip,that is the zip of each sub array of your 3d list then all you need is calculate the sum of the elemets of subtract of those pairs then sum of them again :
>>> [zip(i,a2d) for i in a3d]
[[(array([ 1., 3., 1.]), array([1, 2, 1])), (array([ 2., 2., 1.]), array([5, 5, 4])), (array([ 3., 1., 1.]), array([9, 8, 8]))], [(array([ 4., 6., 4.]), array([1, 2, 1])), (array([ 5. , 6.5, 4. ]), array([5, 5, 4])), (array([ 6., 4., 4.]), array([9, 8, 8]))], [(array([ 7., 9., 7.]), array([1, 2, 1])), (array([ 8., 8., 7.]), array([5, 5, 4])), (array([ 9., 7., 7.]), array([9, 8, 8]))]]
then for all of your sub arrays you'll have the following list:
[30.0, 22.5, 30.0]
that for each sub-list show a the level of difference with 2d array!and then you can get the relative sub-array from a3d like following :
>>> a3d[l.index(min(l))]
array([[ 4. , 6. , 4. ],
[ 5. , 6.5, 4. ],
[ 6. , 4. , 4. ]])
Also you can put it in a function:
>>> def find_nearest(sub,main):
... l=[sum(sum(abs(i-j)) for i,j in z) for z in [zip(i,sub) for i in main]]
... return main[l.index(min(l))]
...
>>> find_nearest(a2d,a3d)
array([[ 4. , 6. , 4. ],
[ 5. , 6.5, 4. ],
[ 6. , 4. , 4. ]])

You might consider a different approach using numpy.vectorize which lets you efficiently apply a python function to each element of your array.
In this case, your python function could just classify each pixel with whatever breaks you define:
import numpy as np
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
def classify(x):
if x >= 4:
return 3
elif x >= 2:
return 2
elif x > 0:
return 1
else:
return 0
vclassify = np.vectorize(classify)
result = vclassify(a2d)

Thanks to #perrygeo and #Kasra - they got me thinking in a good direction.
Since I want a classification of the closest 3d array's z value, I couldn't do simple math - I needed the (z)index of the closest value.
I did it by enumerating both axes of the 2d array, and doing a proximity compare against the corresponding (z)index of the 3d array.
There might be a way to do this without iterating the 2d array, but at least I'm avoiding iterating the 3d.
import numpy as np
a1 = np.array([[1,1,1],[1,1,1],[1,1,1]])
a2 = np.array([[3,3,3],[3,3,3],[3,3,3]])
a3 = np.array([[5,5,5],[5,5,5],[5,5,5]])
a3d = np.dstack((a1,a2,a3))
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
classOut = np.empty_like(a2d)
def find_nearest_idx(array,value):
idx = (np.abs(array-value)).argmin()
return idx
# enumerate to get indices
for i,a in enumerate(a2d):
for ii,v in enumerate(a):
valStack = a3d[i,ii]
nearest = find_nearest_idx(valStack,v)
classOut[i,ii] = nearest
print classOut
which gets me
[[0 0 1]
[2 2 0]
[0 1 1]]
This tells me that (for example) a2d[0,0] is closest to the 0-index of a3d[0,0], which in my case means it is closest to the min value for that 2d position. a2d[1,1] is closest to the 2-index, which in my case means closer to the max value for that 2d position.

Substitute for numpy broadcasting using scipy.sparse.csc_matrix

I have in my code the following expression:
a = (b / x[:, np.newaxis]).sum(axis=1)
where b is an ndarray of shape (M, N), and x is an ndarray of shape (M,). Now, b is actually sparse, so for memory efficiency I would like to substitute in a scipy.sparse.csc_matrix or csr_matrix. However, broadcasting in this way is not implemented (even though division or multiplication is guaranteed to maintain sparsity) (the entries of x are non-zero), and raises a NotImplementedError. Is there a sparse function I'm not aware of that would do what I want? (dot() would sum along the wrong axis.)

If b is in CSC format, then b.data has the non-zero entries of b, and b.indices has the row index of each of the non-zero entries, so you can do your division as:
b.data /= np.take(x, b.indices)
It's hackier than Warren's elegant solution, but it will probably also be faster in most settings:
b = sps.rand(1000, 1000, density=0.01, format='csc')
x = np.random.rand(1000)
def row_divide_col_reduce(b, x):
data = b.data.copy() / np.take(x, b.indices)
ret = sps.csc_matrix((data, b.indices.copy(), b.indptr.copy()),
shape=b.shape)
return ret.sum(axis=1)
def row_divide_col_reduce_bis(b, x):
d = sps.spdiags(1.0/x, 0, len(x), len(x))
return (d * b).sum(axis=1)
In [2]: %timeit row_divide_col_reduce(b, x)
1000 loops, best of 3: 210 us per loop
In [3]: %timeit row_divide_col_reduce_bis(b, x)
1000 loops, best of 3: 697 us per loop
In [4]: np.allclose(row_divide_col_reduce(b, x),
...: row_divide_col_reduce_bis(b, x))
Out[4]: True
You can cut the time almost in half in the above example if you do the division in-place, i.e.:
def row_divide_col_reduce(b, x):
b.data /= np.take(x, b.indices)
return b.sum(axis=1)
In [2]: %timeit row_divide_col_reduce(b, x)
10000 loops, best of 3: 131 us per loop

To implement a = (b / x[:, np.newaxis]).sum(axis=1), you can use a = b.sum(axis=1).A1 / x. The A1 attribute returns the 1D ndarray, so the result is a 1D ndarray, not a matrix. This concise expression works because you are both scaling by x and summing along axis 1. For example:
In [190]: b
Out[190]:
<3x3 sparse matrix of type '<type 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [191]: b.A
Out[191]:
array([[ 1., 0., 2.],
[ 0., 3., 0.],
[ 4., 0., 5.]])
In [192]: x
Out[192]: array([ 2., 3., 4.])
In [193]: b.sum(axis=1).A1 / x
Out[193]: array([ 1.5 , 1. , 2.25])
More generally, if you want to scale the rows of a sparse matrix with a vector x, you could multiply b on the left with a sparse matrix containing 1.0/x on the diagonal. The function scipy.sparse.spdiags can be used to create such a matrix. For example:
In [71]: from scipy.sparse import csc_matrix, spdiags
In [72]: b = csc_matrix([[1,0,2],[0,3,0],[4,0,5]], dtype=np.float64)
In [73]: b.A
Out[73]:
array([[ 1., 0., 2.],
[ 0., 3., 0.],
[ 4., 0., 5.]])
In [74]: x = array([2., 3., 4.])
In [75]: d = spdiags(1.0/x, 0, len(x), len(x))
In [76]: d.A
Out[76]:
array([[ 0.5 , 0. , 0. ],
[ 0. , 0.33333333, 0. ],
[ 0. , 0. , 0.25 ]])
In [77]: p = d * b
In [78]: p.A
Out[78]:
array([[ 0.5 , 0. , 1. ],
[ 0. , 1. , 0. ],
[ 1. , 0. , 1.25]])
In [79]: a = p.sum(axis=1)
In [80]: a
Out[80]:
matrix([[ 1.5 ],
[ 1. ],
[ 2.25]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scipy `SparseEfficiencyWarning` when division on rows of csr_matrix - python

Related

Retrieving values of a CSR matrix

Outer minimum vectorization in numpy follow up

Efficiently index 2d numpy array using two 1d arrays

numpy classification comparison with 3d array

Substitute for numpy broadcasting using scipy.sparse.csc_matrix

Categories

Resources