For creating a scipy sparse matrix, I have an array or row and column indices I and J along with a data array V. I use those to construct a matrix in COO format and then convert it to CSR,
matrix = sparse.coo_matrix((V, (I, J)), shape=(n, n))
matrix = matrix.tocsr()
I have a set of row indices for which the only entry should be a 1.0 on the diagonal. So far, I go through I, find all indices that need wiping, and do just that:
def find(lst, a):
# From <http://stackoverflow.com/a/16685428/353337>
return [i for i, x in enumerate(lst) if x in a]
# wipe_rows = [1, 55, 32, ...] # something something
indices = find(I, wipe_rows) # takes too long
I = numpy.delete(I, indices).tolist()
J = numpy.delete(J, indices).tolist()
V = numpy.delete(V, indices).tolist()
# Add entry 1.0 to the diagonal for each wipe row
I.extend(wipe_rows)
J.extend(wipe_rows)
V.extend(numpy.ones(len(wipe_rows)))
# construct matrix via coo
That works alright, but find tends to take a while.
Any hints on how to speed this up? (Perhaps wiping the rows in COO or CSR format is a better idea.)
If you intend to clear multiple rows at once, this
def _wipe_rows_csr(matrix, rows):
assert isinstance(matrix, sparse.csr_matrix)
# delete rows
for i in rows:
matrix.data[matrix.indptr[i]:matrix.indptr[i+1]] = 0.0
# Set the diagonal
d = matrix.diagonal()
d[rows] = 1.0
matrix.setdiag(d)
return
is by far the fastest method. It doesn't really remove the lines, but sets all entries to zeros, then fiddles with the diagonal.
If the entries are actually to be removed, one has to do some array manipulation. This can be quite costly, but if speed is no issue: This
def _wipe_row_csr(A, i):
'''Wipes a row of a matrix in CSR format and puts 1.0 on the diagonal.
'''
assert isinstance(A, sparse.csr_matrix)
n = A.indptr[i+1] - A.indptr[i]
assert n > 0
A.data[A.indptr[i]+1:-n+1] = A.data[A.indptr[i+1]:]
A.data[A.indptr[i]] = 1.0
A.data = A.data[:-n+1]
A.indices[A.indptr[i]+1:-n+1] = A.indices[A.indptr[i+1]:]
A.indices[A.indptr[i]] = i
A.indices = A.indices[:-n+1]
A.indptr[i+1:] -= n-1
return
replaces a given row i of the matrix matrix by the entry 1.0 on the diagonal.
np.in1d should be a faster way of finding the indices:
In [322]: I # from a np.arange(12).reshape(4,3) matrix
Out[322]: array([0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int32)
In [323]: indices=[i for i, x in enumerate(I) if x in [1,2]]
In [324]: indices
Out[324]: [2, 3, 4, 5, 6, 7]
In [325]: ind1=np.in1d(I,[1,2])
In [326]: ind1
Out[326]:
array([False, False, True, True, True, True, True, True, False,
False, False], dtype=bool)
In [327]: np.where(ind1) # same as indices
Out[327]: (array([2, 3, 4, 5, 6, 7], dtype=int32),)
In [328]: I[~ind1] # same as the delete
Out[328]: array([0, 0, 3, 3, 3], dtype=int32)
Direct manipulation of the coo inputs like this often a good way. But another is to take advantage of the csr math abilities. You should be able to construct a diagonal matrix that zeros out the correct rows, and then adds the ones back in.
Here's what I have in mind:
In [357]: A=np.arange(16).reshape(4,4)
In [358]: M=sparse.coo_matrix(A)
In [359]: M.A
Out[359]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In [360]: d1=sparse.diags([(1,0,0,1)],[0],(4,4))
In [361]: d2=sparse.diags([(0,1,1,0)],[0],(4,4))
In [362]: (d1*M+d2).A
Out[362]:
array([[ 0., 1., 2., 3.],
[ 0., 1., 0., 0.],
[ 0., 0., 1., 0.],
[ 12., 13., 14., 15.]])
In [376]: x=np.ones((4,),bool);x[[1,2]]=False
In [378]: d1=sparse.diags([x],[0],(4,4),dtype=int)
In [379]: d2=sparse.diags([~x],[0],(4,4),dtype=int)
Doing this with lil format looks easy:
In [593]: Ml=M.tolil()
In [594]: Ml.data[wipe]=[[1]]*len(wipe)
In [595]: Ml.rows[wipe]=[[i] for i in wipe]
In [596]: Ml.A
Out[596]:
array([[ 0, 1, 2, 3],
[ 0, 1, 0, 0],
[ 0, 0, 1, 0],
[12, 13, 14, 15]], dtype=int32)
It's sort of what you are doing with csr format, but it's easy to replace each row list with the appropriate [1] and [i] list. But conversion times (tolil etc) can hurt run times.
Related
I am trying to find the both location and the value of the minimum element of a sparse matrix for each row. A toy example for the question is given below:
Here, we have a 3x6 sparse matrix "M".
H = np.array([[1, 2, 3, 0, 4, 0 ,0],
[0, 5, 0, 6, 0, 0 ,0],
[0, 0, 0, 7, 0, 0 ,8], dtype = np.float32)
M = scipy.sparse.csr_matrix(H)
Then, what I would like to obtain is the nonzero minimum elements of each row.
For the example above:
min_elements = some_function(M,axis = 0)
and receiving the return as min_elements = [1,5,7]. The method M.min(axis=0) does not work for my case since the minimum element of each row is zero, therefore, returning an all-zeros array.
Thus, is there any efficient way of implementing such a function in an computationally efficient way using sparse matrix. In my general case, the sparse matrices will be quite huge and requires lots of additional computation. For this reason, the performance/speed is the main benchmark for me.
Thank you!
In [333]: from scipy import sparse
In [334]: M = sparse.csr_matrix(H)
In [335]: M
Out[335]:
<3x7 sparse matrix of type '<class 'numpy.float32'>'
with 8 stored elements in Compressed Sparse Row format>
M is stored as:
In [336]: M.indptr
Out[336]: array([0, 4, 6, 8], dtype=int32)
In [337]: M.data
Out[337]: array([1., 2., 3., 4., 5., 6., 7., 8.], dtype=float32)
In [338]: M.indices
Out[338]: array([0, 1, 2, 4, 1, 3, 3, 6], dtype=int32)
We can iterate on the slices defined by indptr, and take the min:
In [340]: for i in range(M.shape[0]):
...: sl = slice(M.indptr[i],M.indptr[i+1])
...: x, y = M.data[sl], M.indices[sl]
...: m = np.argmin(x)
...: print(y[m], x[m])
...:
0 1.0
1 5.0
3 7.0
This can be streamlined a bit, but it gives the basic idea.
It may be easier to picture what's going on in the lil format:
In [341]: Ml = M.tolil()
In [342]: Ml.data
Out[342]:
array([list([1.0, 2.0, 3.0, 4.0]), list([5.0, 6.0]), list([7.0, 8.0])],
dtype=object)
In [343]: Ml.rows
Out[343]: array([list([0, 1, 2, 4]), list([1, 3]), list([3, 6])], dtype=object)
In [344]: for d,r in zip(Ml.data, Ml.rows):
...: m = np.argmin(d)
...: print(r[m], d[m])
...:
0 1.0
1 5.0
3 7.0
Previous SO have asked for things like the smallest (or largest) N values by row.
Sparse is best for things that can be expressed as some sort of matrix multiplication. That includes row (or column) sums. Even csr indexing is done with matrix multiplication. Other row-by-row operations aren't as easy.
You could flip all your data and find the maximum. This is assuming all your data is positive, as in the example.
M_inv = M.copy()
M_inv.data = 1/M.data
one_over_min_M = M_inv.max(axis=1)
min_M = 1/one_over_min_M.to_array()
On your example I get the output
[[1. ]
[5. ]
[6.9999995]]
There is some horrible numerical error there, but if you're happy to round your answer...
Edit: This approach might be redeemed if you're after the indices and want to do M_inv.argmax(axis=1), otherwise it's probably not the best.
I'm somewhat new to numpy so this might be a dumb question, but here goes:
Let's say I have a tensor of any shape and size, say (100,5,5) or (3,3,10,15,4). I have a randomly generated list of indices for points I want to replace with np.nan. For a (3,3,3) test case, it would be as follows:
>> data = np.random.randn(3,3,3)
>> data
array([[[ 0.21368315, -1.42814113, 1.23021783],
[ 0.25835315, 0.44775156, -1.20489094],
[ 0.25928972, 0.39486046, -1.79189447]],
[[ 2.24080908, -0.89617961, -0.29550817],
[ 0.21756087, 1.33996913, -1.24418745],
[-0.63617598, 0.56848439, 0.8175564 ]],
[[ 0.61367002, -1.16104071, -0.53488283],
[ 1.0363354 , -0.76888041, 1.24524786],
[-0.84329375, -0.61744489, 1.50502058]]])
>> idxs = np.argwhere(np.isfinite(data))
>> dropidxs = idxs[np.random.choice(idxs.shape[0], 3, replace=False)]
>> dropidxs
array([[1, 1, 1],
[2, 0, 2],
[2, 1, 0]])
How do I replace the corresponding values? Previously, when I was only dealing with the 3D case, I did it using the following.
for idx in dropidxs:
i,j,k = dropidxs[idx]
missingCube[i,j,k] = np.nan
But now, I want the function to be able to handle tensors of any size.
I've tried
for idx in dropidxs:
missingCube[idx] = np.nan
and
missingCube[dropidxs] = np.nan
But both (unsurprisingly) end up removing a corresponding slice along axis=0. How should I approach this? Is there an easier way to achieve what I'm trying to do?
In [486]: data = np.random.randn(3,3,3)
With this creation all terms are finite, so nonzero returns a tuple of (27,) arrays:
In [487]: idx = np.nonzero(np.isfinite(data))
In [488]: len(idx)
Out[488]: 3
In [489]: idx[0].shape
Out[489]: (27,)
argwhere produces the same numbers, but in a 2d array:
In [490]: idxs = np.argwhere(np.isfinite(data))
In [491]: idxs.shape
Out[491]: (27, 3)
So you select a subset.
In [492]: dropidxs = idxs[np.random.choice(idxs.shape[0], 3, replace=False)]
In [493]: dropidxs.shape
Out[493]: (3, 3)
In [494]: dropidxs
Out[494]:
array([[1, 1, 0],
[2, 1, 2],
[2, 1, 1]])
We could have generated the same subset by x = np.random.choice(...), and applying that x to the arrays in idxs. But in this case, the argwhere array is easier to work with.
But to apply that array to indexing we still need a tuple of arrays:
In [495]: tup = tuple([dropidxs[:,i] for i in range(3)])
In [496]: tup
Out[496]: (array([1, 2, 2]), array([1, 1, 1]), array([0, 2, 1]))
In [497]: data[tup]
Out[497]: array([-0.27965058, 1.2981397 , 0.4501406 ])
In [498]: data[tup]=np.nan
In [499]: data
Out[499]:
array([[[-0.4899279 , 0.83352547, -1.03798762],
[-0.91445783, 0.05777183, 0.19494065],
[ 0.6835925 , -0.47846423, 0.13513958]],
[[-0.08790631, 0.30224828, -0.39864576],
[ nan, -0.77424244, 1.4788093 ],
[ 0.41915952, -0.09335664, -0.47359613]],
[[-0.40281937, 1.64866377, -0.40354504],
[ 0.74884493, nan, nan],
[ 0.13097487, -1.63995208, -0.98857852]]])
Or we could index with:
In [500]: data[dropidxs[:,0],dropidxs[:,1],dropidxs[:,2]]
Out[500]: array([nan, nan, nan])
Actually, a transpose of dropidxs might be be more convenient:
In [501]: tdrop = dropidxs.T
In [502]: tuple(tdrop)
Out[502]: (array([1, 2, 2]), array([1, 1, 1]), array([0, 2, 1]))
In [503]: data[tuple(tdrop)]
Out[503]: array([nan, nan, nan])
Sometimes we can use * to expand a list/array into a tuple, but not when indexing:
In [504]: data[*tdrop]
File "<ipython-input-504-cb619d907adb>", line 1
data[*tdrop]
^
SyntaxError: invalid syntax
but we can create the tuple with:
In [506]: data[(*tdrop,)]
Out[506]: array([nan, nan, nan])
Is it what you're searching for:
import numpy as np
x = np.random.randn(10, 3, 3, 3)
new_value = 0
x[x < 0] = new_value
or
x[x == -inf] = 0
You can choose from flattened indices and convert back to data indices to set elements to np.nan. Here with a seed(41) to make results reproducible, choosing 3 elements.
import numpy as np
data = np.random.randn(3,3,3)
rng = np.random.default_rng(41)
idx = rng.choice(np.arange(data.size), 3, replace=False)
data[np.unravel_index(idx, data.shape)] = np.nan
data
Output
array([[[ 0.13180452, -0.81228319, -0.04456739],
[ 0.53060077, -0.2246579 , 1.83926463],
[-0.38670047, -0.53703577, 0.49275628]],
[[ 0.36671354, 1.44012848, -0.57209412],
[ 0.53960111, -1.06578638, 1.10669842],
[ 1.1772824 , nan, -0.82792041]],
[[-0.03352594, 0.29351109, 0.57021538],
[-0.33291872, nan, 0.04675677],
[ nan, 2.59450517, -1.9579655 ]]])
I'm looking for a solution to sum per column in a 2D array ("a" in the example below) and starting from a cell position as defined in a different 1D array ("ref" in the example below).
I have tried the following:
import numpy as np
a = np.arange(20).reshape(5, 4)
print(a) # representing an original large 2D array
ref = np.array([0, 2, 4, 1]) # reference array for defining start of sum
s = a.sum(axis=0)
print(s) # Works: sums all elements per column
s = a[2:].sum(axis=0)
print(s) # Works as well: sum from the third element till end per column
# This is what I look for: sum per column starting at element defined by ref[]
s = np.zeros(4).astype(int) # makes an empty 1D array
for i in np.arange(4): # for each column
for j in np.arange(ref[i], 5):
s[i] += a[j, i] # sums all elements from ref till end (i.e. 5)
print(s) # This is the desired outcome
for i in np.arange(4):
s = a[ref[i]:].sum(axis=0)
print(s) # No good; same as a[ref[4]:].sum(axis=0) and here ref[4] = 1
s = np.zeros(4).astype(int) # makes an empty 1D array
for i in np.arange(4):
s[i] = np.sum(a[ref[i]:, i])
print(s) # Yes; this is also the desired outcome
Is it possible to realize this without using a for loop?
Does numpy have functions for doing this in a single step?
s = a[ref:].sum(axis=0)
This would be nice, but is not working.
Thank you for your time!
A basic solution based on np.cumsum:
In [1]: a = np.arange(15).reshape(5, 3)
In [2]: res = np.array([0, 2, 3])
In [3]: b = np.cumsum(a, axis=0)
In [4]: b
Out[4]:
array([[ 0, 1, 2],
[ 3, 5, 7],
[ 9, 12, 15],
[18, 22, 26],
[30, 35, 40]])
In [5]: a
Out[5]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
In [6]: b[res, np.arange(a.shape[1])]
Out[6]: array([ 0, 12, 26])
In [7]: b[-1, :] - b[res, np.arange(a.shape[1])]
Out[7]: array([30, 23, 14])
so it does not give us the result we want: we need to add a first line of zeros to b:
In [13]: b = np.vstack([np.zeros((1, a.shape[1])), b])
In [14]: b
Out[14]:
array([[ 0., 0., 0.],
[ 0., 1., 2.],
[ 3., 5., 7.],
[ 9., 12., 15.],
[ 18., 22., 26.],
[ 30., 35., 40.]])
In [17]: b[-1, :] - b[res, np.arange(a.shape[1])]
Out[17]: array([ 30., 30., 25.])
which is, I believe, the desired output.
For example, given:
import numpy as np
data = np.array(
[[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 0, 1],
[0, 1, 1],
[0, 0, 0]])
I want to get a 3-dimensional array, looking like:
result = array([[[ 2., 0.],
[ 0., 2.]],
[[ 0., 2.],
[ 0., 0.]]])
One way is:
for row in data
newArray[ row[0] ][ row[1] ][ row[2] ] += 1
What I'm trying to do is the following:
for i in dimension1
for j in dimension2
for k in dimension3
result[i,j,k] = (data[data[data[:,0]==i, 1]==j, 2]==k).sum()
This doesn't seem to work and I would like to achieve the desired result by sticking to my implementation rather than the one mentioned in the beginning (or using any extra imports, eg counter).
Thanks.
You can also use numpy.histogramdd for this:
>>> np.histogramdd(data, bins=(2, 2, 2))[0]
array([[[ 2., 0.],
[ 0., 2.]],
[[ 0., 2.],
[ 0., 0.]]])
The problem is that data[data[data[:,0]==i, 1]==j, 2]==k is not what you expect it to be.
Let's take this apart for the case (i,j,k) == (0,0,0)
data[:,0]==0 is [True, True, False, False, True, True], and data[data[:,0]==0] correctly gives us the lines where the first number is 0.
Now from those lines we get the lines where the second number is 0: data[data[:,0]==0, 1]==0, which gives us [True, False, False, True]. And this is the problem. Because if we take those indices from data, i.e., data[data[data[:,0]==0, 1]==0] we do not get the rows where the first and second number are 0, but the 0th and 3rd row instead:
In [51]: data[data[data[:,0]==0, 1]==0]
Out[51]: array([[0, 0, 0],
[1, 0, 1]])
And if we now filter for the rows where the third number is 0, we get the wrong result w.r.t. the orignal data.
And that's why your approach does not work. For better methods, see the other answers.
You can do something like the following
#Get output dimension and construct output array.
>>> dshape = tuple(data.max(axis=0)+1)
>>> dshape
(2, 2, 2)
>>> out = np.zeros(shape)
If you have numpy 1.8+:
out.flat[np.ravel_multi_index(data.T, dshape)]+=1
Else:
#Get indices and unique the resulting array
>>> inds = np.ravel_multi_index(data.T, dshape)
>>> inds, inverse = np.unique(inds, return_inverse=True)
>>> values = np.bincount(inverse)
>>> values
array([2, 2, 2])
>>> out.flat[inds] = values
>>> out
array([[[ 2., 0.],
[ 0., 2.]],
[[ 0., 2.],
[ 0., 0.]]])
Numpy versions before numpy 1.7 do not have a add.at attribute and the top code will not work without it. As ravel_multi_index may not be the fastest algorithm ever you can look into taking the unique rows of a numpy array. In effect these two operations should be equivalent.
Don't fear the imports. They're what make Python awesome.
If question assumes that you already have the result matrix.
import numpy as np
data = np.array(
[[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 0, 1],
[0, 1, 1],
[0, 0, 0]]
)
result = np.zeros((2,2,2))
# range of each dim, aka allowable values for each dim
dim_ranges = zip(np.zeros(result.ndim), np.array(result.shape)-1)
dim_ranges
# Out[]:
# [(0.0, 2), (0.0, 2), (0.0, 2)]
# Multidimentional histogram will effectively "count" along each dim
sums,_ = np.histogramdd(data,bins=result.shape,range=dim_ranges)
result += sums
result
# Out[]:
# array([[[ 2., 0.],
# [ 0., 2.]],
#
# [[ 0., 2.],
# [ 0., 0.]]])
This solution solves for any "result" ndarray, no matter what the shape. Additionally, it works fine even if your "data" ndarray has indices which are out-of-bounds for your result matrix.
How can I create a numpy matrix with its elements being a function of its indices?
For example, a multiplication table: a[i,j] = i*j
An Un-numpy and un-pythonic would be to create an array of zeros and then loop through.
There is no doubt that there is a better way to do this, without a loop.
However, even better would be to create the matrix straight-away.
A generic solution would be to use np.fromfunction()
From the doc:
numpy.fromfunction(function, shape, **kwargs)
Construct an array by executing a function over each coordinate. The
resulting array therefore has a value fn(x, y, z) at coordinate (x, y,
z).
The below snippet should provide the required matrix.
import numpy as np
np.fromfunction(lambda i, j: i*j, (5,5))
Output:
array([[ 0., 0., 0., 0., 0.],
[ 0., 1., 2., 3., 4.],
[ 0., 2., 4., 6., 8.],
[ 0., 3., 6., 9., 12.],
[ 0., 4., 8., 12., 16.]])
The first parameter to the function is a callable which is executed for each of the coordinates. If foo is a function that you pass as the first argument, foo(i,j) will be the value at (i,j). This holds for higher dimensions too. The shape of the coordinate array can be modified using the shape parameter.
Edit:
Based on the comment on using custom functions like lambda x,y: 2*x if x > y else y/2, the following code works:
import numpy as np
def generic_f(shape, elementwise_f):
fv = np.vectorize(elementwise_f)
return np.fromfunction(fv, shape)
def elementwise_f(x , y):
return 2*x if x > y else y/2
print(generic_f( (5,5), elementwise_f))
Output:
[[0. 0.5 1. 1.5 2. ]
[2. 0.5 1. 1.5 2. ]
[4. 4. 1. 1.5 2. ]
[6. 6. 6. 1.5 2. ]
[8. 8. 8. 8. 2. ]]
The user is expected to pass a scalar function that defines the elementwise operation. np.vectorize is used to vectorize the user-defined scalar function and is passed to np.fromfunction().
Here's one way to do that:
>>> indices = numpy.indices((5, 5))
>>> a = indices[0] * indices[1]
>>> a
array([[ 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4],
[ 0, 2, 4, 6, 8],
[ 0, 3, 6, 9, 12],
[ 0, 4, 8, 12, 16]])
To further explain, numpy.indices((5, 5)) generates two arrays containing the x and y indices of a 5x5 array like so:
>>> numpy.indices((5, 5))
array([[[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2],
[3, 3, 3, 3, 3],
[4, 4, 4, 4, 4]],
[[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]]])
When you multiply these two arrays, numpy multiplies the value of the two arrays at each position and returns the result.
For the multiplication
np.multiply.outer(np.arange(5), np.arange(5)) # a_ij = i * j
and in general
np.frompyfunc(
lambda i, j: f(i, j), 2, 1
).outer(
np.arange(5),
np.arange(5),
).astype(np.float64) # a_ij = f(i, j)
basically you create an np.ufunc via np.frompyfunc and then outer it with the indices.
Edit
Speed comparision between the different solutions.
Small matrices:
Eyy![1]: %timeit np.multiply.outer(np.arange(5), np.arange(5))
100000 loops, best of 3: 4.97 µs per loop
Eyy![2]: %timeit np.array( [ [ i*j for j in xrange(5)] for i in xrange(5)] )
100000 loops, best of 3: 5.51 µs per loop
Eyy![3]: %timeit indices = np.indices((5, 5)); indices[0] * indices[1]
100000 loops, best of 3: 16.1 µs per loop
Bigger matrices:
Eyy![4]: %timeit np.multiply.outer(np.arange(4096), np.arange(4096))
10 loops, best of 3: 62.4 ms per loop
Eyy![5]: %timeit indices = np.indices((4096, 4096)); indices[0] * indices[1]
10 loops, best of 3: 165 ms per loop
Eyy![6]: %timeit np.array( [ [ i*j for j in xrange(4096)] for i in xrange(4096)] )
1 loops, best of 3: 1.39 s per loop
I'm away from my python at the moment, but does this one work?
array( [ [ i*j for j in xrange(5)] for i in xrange(5)] )
Just wanted to add that #Senderle's response can be generalized for any function and dimension:
dims = (3,3,3) #i,j,k
ii = np.indices(dims)
You could then calculate a[i,j,k] = i*j*k as
a = np.prod(ii,axis=0)
or a[i,j,k] = (i-1)*j*k:
a = (ii[0,...]-1)*ii[1,...]*ii[2,...]
etc