I'm somewhat new to numpy so this might be a dumb question, but here goes:
Let's say I have a tensor of any shape and size, say (100,5,5) or (3,3,10,15,4). I have a randomly generated list of indices for points I want to replace with np.nan. For a (3,3,3) test case, it would be as follows:
>> data = np.random.randn(3,3,3)
>> data
array([[[ 0.21368315, -1.42814113, 1.23021783],
[ 0.25835315, 0.44775156, -1.20489094],
[ 0.25928972, 0.39486046, -1.79189447]],
[[ 2.24080908, -0.89617961, -0.29550817],
[ 0.21756087, 1.33996913, -1.24418745],
[-0.63617598, 0.56848439, 0.8175564 ]],
[[ 0.61367002, -1.16104071, -0.53488283],
[ 1.0363354 , -0.76888041, 1.24524786],
[-0.84329375, -0.61744489, 1.50502058]]])
>> idxs = np.argwhere(np.isfinite(data))
>> dropidxs = idxs[np.random.choice(idxs.shape[0], 3, replace=False)]
>> dropidxs
array([[1, 1, 1],
[2, 0, 2],
[2, 1, 0]])
How do I replace the corresponding values? Previously, when I was only dealing with the 3D case, I did it using the following.
for idx in dropidxs:
i,j,k = dropidxs[idx]
missingCube[i,j,k] = np.nan
But now, I want the function to be able to handle tensors of any size.
I've tried
for idx in dropidxs:
missingCube[idx] = np.nan
and
missingCube[dropidxs] = np.nan
But both (unsurprisingly) end up removing a corresponding slice along axis=0. How should I approach this? Is there an easier way to achieve what I'm trying to do?
In [486]: data = np.random.randn(3,3,3)
With this creation all terms are finite, so nonzero returns a tuple of (27,) arrays:
In [487]: idx = np.nonzero(np.isfinite(data))
In [488]: len(idx)
Out[488]: 3
In [489]: idx[0].shape
Out[489]: (27,)
argwhere produces the same numbers, but in a 2d array:
In [490]: idxs = np.argwhere(np.isfinite(data))
In [491]: idxs.shape
Out[491]: (27, 3)
So you select a subset.
In [492]: dropidxs = idxs[np.random.choice(idxs.shape[0], 3, replace=False)]
In [493]: dropidxs.shape
Out[493]: (3, 3)
In [494]: dropidxs
Out[494]:
array([[1, 1, 0],
[2, 1, 2],
[2, 1, 1]])
We could have generated the same subset by x = np.random.choice(...), and applying that x to the arrays in idxs. But in this case, the argwhere array is easier to work with.
But to apply that array to indexing we still need a tuple of arrays:
In [495]: tup = tuple([dropidxs[:,i] for i in range(3)])
In [496]: tup
Out[496]: (array([1, 2, 2]), array([1, 1, 1]), array([0, 2, 1]))
In [497]: data[tup]
Out[497]: array([-0.27965058, 1.2981397 , 0.4501406 ])
In [498]: data[tup]=np.nan
In [499]: data
Out[499]:
array([[[-0.4899279 , 0.83352547, -1.03798762],
[-0.91445783, 0.05777183, 0.19494065],
[ 0.6835925 , -0.47846423, 0.13513958]],
[[-0.08790631, 0.30224828, -0.39864576],
[ nan, -0.77424244, 1.4788093 ],
[ 0.41915952, -0.09335664, -0.47359613]],
[[-0.40281937, 1.64866377, -0.40354504],
[ 0.74884493, nan, nan],
[ 0.13097487, -1.63995208, -0.98857852]]])
Or we could index with:
In [500]: data[dropidxs[:,0],dropidxs[:,1],dropidxs[:,2]]
Out[500]: array([nan, nan, nan])
Actually, a transpose of dropidxs might be be more convenient:
In [501]: tdrop = dropidxs.T
In [502]: tuple(tdrop)
Out[502]: (array([1, 2, 2]), array([1, 1, 1]), array([0, 2, 1]))
In [503]: data[tuple(tdrop)]
Out[503]: array([nan, nan, nan])
Sometimes we can use * to expand a list/array into a tuple, but not when indexing:
In [504]: data[*tdrop]
File "<ipython-input-504-cb619d907adb>", line 1
data[*tdrop]
^
SyntaxError: invalid syntax
but we can create the tuple with:
In [506]: data[(*tdrop,)]
Out[506]: array([nan, nan, nan])
Is it what you're searching for:
import numpy as np
x = np.random.randn(10, 3, 3, 3)
new_value = 0
x[x < 0] = new_value
or
x[x == -inf] = 0
You can choose from flattened indices and convert back to data indices to set elements to np.nan. Here with a seed(41) to make results reproducible, choosing 3 elements.
import numpy as np
data = np.random.randn(3,3,3)
rng = np.random.default_rng(41)
idx = rng.choice(np.arange(data.size), 3, replace=False)
data[np.unravel_index(idx, data.shape)] = np.nan
data
Output
array([[[ 0.13180452, -0.81228319, -0.04456739],
[ 0.53060077, -0.2246579 , 1.83926463],
[-0.38670047, -0.53703577, 0.49275628]],
[[ 0.36671354, 1.44012848, -0.57209412],
[ 0.53960111, -1.06578638, 1.10669842],
[ 1.1772824 , nan, -0.82792041]],
[[-0.03352594, 0.29351109, 0.57021538],
[-0.33291872, nan, 0.04675677],
[ nan, 2.59450517, -1.9579655 ]]])
Related
I know something similar to this question has been asked many times over already, but all answers given to similar questions only seem to work for arrays with 2 dimensions.
My understanding of np.argsort() is that np.sort(array) == array[np.argsort(array)] should be True.
I have found out that this is indeed correct if np.ndim(array) == 2, but it gives different results if np.ndim(array) > 2.
Example:
>>> array = np.array([[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]],
[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]]])
>>> np.sort(array)
array([[[ 0.62078744, 0.81774634],
[ 0.29718462, 0.43912609]],
[[ 0.1266578 , 0.82282054],
[ 0.79134389, 0.98180375]]])
>>> array.argsort()
array([[[1, 0],
[1, 0]],
[[0, 1],
[1, 0]]])
>>> array[array.argsort()]
array([[[[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]],
[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]]],
[[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]],
[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]]]],
[[[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]],
[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]]],
[[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]],
[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]]]]])
So, can anybody explain to me how exactly np.argsort() can be used as the indices to obtain the sorted array?
The only way I can come up with is:
args = np.argsort(array)
array_sort = np.zeros_like(array)
for i in range(array.shape[0]):
for j in range(array.shape[1]):
array_sort[i, j] = array[i, j, args[i, j]]
which is extremely tedious and cannot be generalized for any given number of dimensions.
Here is a general method:
import numpy as np
array = np.array([[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]],
[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]]])
a = 1 # or 0 or 2
order = array.argsort(axis=a)
idx = np.ogrid[tuple(map(slice, array.shape))]
# if you don't need full ND generality: in 3D this can be written
# much more readable as
# m, n, k = array.shape
# idx = np.ogrid[:m, :n, :k]
idx[a] = order
print(np.all(array[idx] == np.sort(array, axis=a)))
Output:
True
Explanation: We must specify for each element of the output array the complete index of the corresponding element of the input array. Thus each index into the input array has the same shape as the output array or must be broadcastable to that shape.
The indices for the axes along which we do not sort/argsort stay in place. We therefore need to pass a broadcastable range(array.shape[i]) for each of those. The easiest way is to use ogrid to create such a range for all dimensions (If we used this directly, the array would come back unchanged.) and then replace the index correspondingg to the sort axis with the output of argsort.
UPDATE March 2019:
Numpy is becoming more strict in enforcing multi-axis indices being passed as tuples. Currently, array[idx] will trigger a deprecation warning. To be future proof use array[tuple(idx)] instead. (Thanks #Nathan)
Or use numpy's new (version 1.15.0) convenience function take_along_axis:
np.take_along_axis(array, order, a)
#Hameer's answer works, though it might use some simplification and explanation.
sort and argsort are working on the last axis. argsort returns a 3d array, same shape as the original. The values are the indices on that last axis.
In [17]: np.argsort(arr, axis=2)
Out[17]:
array([[[1, 0],
[1, 0]],
[[0, 1],
[1, 0]]], dtype=int32)
In [18]: _.shape
Out[18]: (2, 2, 2)
In [19]: idx=np.argsort(arr, axis=2)
To use this we need to construct indices for the other dimensions that broadcast to the same (2,2,2) shape. ix_ is a handy tool for this.
Just using idx as one of the ix_ inputs doesn't work:
In [20]: np.ix_(range(2),range(2),idx)
....
ValueError: Cross index must be 1 dimensional
Instead I use the last range, and then ignore it. #Hameer instead constructs the 2d ix_, and then expands them.
In [21]: I,J,K=np.ix_(range(2),range(2),range(2))
In [22]: arr[I,J,idx]
Out[22]:
array([[[ 0.62078744, 0.81774634],
[ 0.29718462, 0.43912609]],
[[ 0.1266578 , 0.82282054],
[ 0.79134389, 0.98180375]]])
So the indices for the other dimensions work with the (2,2,2) idx array:
In [24]: I.shape
Out[24]: (2, 1, 1)
In [25]: J.shape
Out[25]: (1, 2, 1)
That's the basics for constructing the other indices when you are given multidimensional index for one dimension.
#Paul constructs the same indices with ogrid:
In [26]: np.ogrid[slice(2),slice(2),slice(2)] # np.ogrid[:2,:2,:2]
Out[26]:
[array([[[0]],
[[1]]]), array([[[0],
[1]]]), array([[[0, 1]]])]
In [27]: _[0].shape
Out[27]: (2, 1, 1)
ogrid as a class works with slices, while ix_ requires a list/array/range.
argsort for a multidimensional ndarray (from 2015) works with a 2d array, but the same logic applies (find a range index(s) that broadcasts with the argsort).
Here's a vectorized implementation. It should be N-dimensional and quite a bit faster than what you're doing.
import numpy as np
def sort1(array, args):
array_sort = np.zeros_like(array)
for i in range(array.shape[0]):
for j in range(array.shape[1]):
array_sort[i, j] = array[i, j, args[i, j]]
return array_sort
def sort2(array, args):
shape = array.shape
idx = np.ix_(*tuple(np.arange(l) for l in shape[:-1]))
idx = tuple(ar[..., None] for ar in idx)
array_sorted = array[idx + (args,)]
return array_sorted
if __name__ == '__main__':
array = np.random.rand(5, 6, 7)
idx = np.argsort(array)
result1 = sort1(array, idx)
result2 = sort2(array, idx)
print(np.array_equal(result1, result2))
I have a n x m matrix X and a n x p matrix Y where Y is binary data. In the end I want a p x n matrix Z where the columns of Z are a function of the columns of X subsetting to the column entries corresponding to 1's in Y.
For example
>>> X
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> Y
array([[1, 0],
[1, 0],
[0, 1]])
n_x,m = X.shape
n_y,p = Y.shape
Z = np.zeros([p, n_x])
for i in range(n_x):
col = X[:,[i]]
for j in range(p):
#this is where I subset col with Y[:,[j]]
Z[j][i] = my_func(subsetted_column)
The iterations would produce
i=0, j=0: subsetted_column = [[1],[4]]
i=0, j=1: subsetted_column = [[7]]
i=1, j=0: subsetted_column = [[2],[5]]
i=1, j=1: subsetted_column = [[8]]
i=2, j=0: subsetted_column = [[3],[6]]
i=2, j=1: subsetted_column = [[9]]
I assume there is some way to do that nested loop in a single list comprehension. The function my_func also takes a long time so would be nice to parallelize that somehow.
Edit: I could do something like
for i in range(n_x):
for j in range(p):
subsetted_column = np.trim_zeros(np.multiply(X[:,i], Y[:,j]))
Z[j][i] = my_func(subsetted_column)
But I still believe there is an easier solution
Does this what you want?
import numpy as np
N, M, P = 4, 3, 2
a = np.random.random((N, M))
b = np.random.randint(2, size=(N, P)).astype(bool)
your_func = lambda x: x # insert proper function here
flat = [your_func(ai[bj]) for bj in b.T for ai in a.T]
out = np.empty((P, M), dtype=object)
out.ravel()[:] = flat
print(a)
print(b)
print(out)
Remarks:
It is easiest to convert your masking array to dtype bool because this allows you to use logical indexing.
If your_func returns just a number it's better not to use dtype=object for out.
If you want to parallelise, a list comprehension is perhaps not the best thing to do, but I'm no expert on that. It's just that the loop looks like an obvious parallelisation target, since the order of iterations is irrelevant.
Sample output:
[[ 0.62739382 0.85774837 0.81958524]
[ 0.99690996 0.71202879 0.97636715]
[ 0.89235107 0.91739852 0.39537849]
[ 0.0413107 0.11662271 0.72419308]]
[[False True]
[ True True]
[False False]
[ True True]]
[[array([ 0.99690996, 0.0413107 ]) array([ 0.71202879, 0.11662271])
array([ 0.97636715, 0.72419308])]
[array([ 0.62739382, 0.99690996, 0.0413107 ])
array([ 0.85774837, 0.71202879, 0.11662271])
array([ 0.81958524, 0.97636715, 0.72419308])]]
It may help to perform the subsetting in a pre-processing loop
In [112]: xs = [X[y,:] for y in Y.astype(bool).T]
In [113]: xs
Out[113]:
[array([[1, 2, 3],
[4, 5, 6]]),
array([[7, 8, 9]])]
(.T is used to iterate on columns in the list comprehension; bool allows 'masked' selection)
Let's say, for example that my_func takes the mean on axis=0 for the subsets
In [116]: [np.mean(s, axis=0) for s in xs]
Out[116]: [array([ 2.5, 3.5, 4.5]), array([ 7., 8., 9.])]
In [117]: np.array(_)
Out[117]:
array([[ 2.5, 3.5, 4.5],
[ 7. , 8. , 9. ]])
I could combine it into one loop, but it's harder to think about:
np.array([np.mean(X[y,:],axis=0) for y in Y.astype(bool).T])
With this xs list, you can focus your efforts on applying my_func efficiently to all the columns of xs[i] as np.mean(xs[i], axis=0) does.
The double loop version of this mean
In [121]: p=np.zeros((2,3))
In [122]: for i in range(2):
...: for j in range(3):
...: p[i,j] = np.mean(xs[i][:,j])
...:
In [123]: p
Out[123]:
array([[ 2.5, 3.5, 4.5],
[ 7. , 8. , 9. ]])
Equivalent double list comprehension
In [125]: [[np.mean(i) for i in j.T] for j in xs]
Out[125]: [[2.5, 3.5, 4.5], [7.0, 8.0, 9.0]]
For creating a scipy sparse matrix, I have an array or row and column indices I and J along with a data array V. I use those to construct a matrix in COO format and then convert it to CSR,
matrix = sparse.coo_matrix((V, (I, J)), shape=(n, n))
matrix = matrix.tocsr()
I have a set of row indices for which the only entry should be a 1.0 on the diagonal. So far, I go through I, find all indices that need wiping, and do just that:
def find(lst, a):
# From <http://stackoverflow.com/a/16685428/353337>
return [i for i, x in enumerate(lst) if x in a]
# wipe_rows = [1, 55, 32, ...] # something something
indices = find(I, wipe_rows) # takes too long
I = numpy.delete(I, indices).tolist()
J = numpy.delete(J, indices).tolist()
V = numpy.delete(V, indices).tolist()
# Add entry 1.0 to the diagonal for each wipe row
I.extend(wipe_rows)
J.extend(wipe_rows)
V.extend(numpy.ones(len(wipe_rows)))
# construct matrix via coo
That works alright, but find tends to take a while.
Any hints on how to speed this up? (Perhaps wiping the rows in COO or CSR format is a better idea.)
If you intend to clear multiple rows at once, this
def _wipe_rows_csr(matrix, rows):
assert isinstance(matrix, sparse.csr_matrix)
# delete rows
for i in rows:
matrix.data[matrix.indptr[i]:matrix.indptr[i+1]] = 0.0
# Set the diagonal
d = matrix.diagonal()
d[rows] = 1.0
matrix.setdiag(d)
return
is by far the fastest method. It doesn't really remove the lines, but sets all entries to zeros, then fiddles with the diagonal.
If the entries are actually to be removed, one has to do some array manipulation. This can be quite costly, but if speed is no issue: This
def _wipe_row_csr(A, i):
'''Wipes a row of a matrix in CSR format and puts 1.0 on the diagonal.
'''
assert isinstance(A, sparse.csr_matrix)
n = A.indptr[i+1] - A.indptr[i]
assert n > 0
A.data[A.indptr[i]+1:-n+1] = A.data[A.indptr[i+1]:]
A.data[A.indptr[i]] = 1.0
A.data = A.data[:-n+1]
A.indices[A.indptr[i]+1:-n+1] = A.indices[A.indptr[i+1]:]
A.indices[A.indptr[i]] = i
A.indices = A.indices[:-n+1]
A.indptr[i+1:] -= n-1
return
replaces a given row i of the matrix matrix by the entry 1.0 on the diagonal.
np.in1d should be a faster way of finding the indices:
In [322]: I # from a np.arange(12).reshape(4,3) matrix
Out[322]: array([0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int32)
In [323]: indices=[i for i, x in enumerate(I) if x in [1,2]]
In [324]: indices
Out[324]: [2, 3, 4, 5, 6, 7]
In [325]: ind1=np.in1d(I,[1,2])
In [326]: ind1
Out[326]:
array([False, False, True, True, True, True, True, True, False,
False, False], dtype=bool)
In [327]: np.where(ind1) # same as indices
Out[327]: (array([2, 3, 4, 5, 6, 7], dtype=int32),)
In [328]: I[~ind1] # same as the delete
Out[328]: array([0, 0, 3, 3, 3], dtype=int32)
Direct manipulation of the coo inputs like this often a good way. But another is to take advantage of the csr math abilities. You should be able to construct a diagonal matrix that zeros out the correct rows, and then adds the ones back in.
Here's what I have in mind:
In [357]: A=np.arange(16).reshape(4,4)
In [358]: M=sparse.coo_matrix(A)
In [359]: M.A
Out[359]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In [360]: d1=sparse.diags([(1,0,0,1)],[0],(4,4))
In [361]: d2=sparse.diags([(0,1,1,0)],[0],(4,4))
In [362]: (d1*M+d2).A
Out[362]:
array([[ 0., 1., 2., 3.],
[ 0., 1., 0., 0.],
[ 0., 0., 1., 0.],
[ 12., 13., 14., 15.]])
In [376]: x=np.ones((4,),bool);x[[1,2]]=False
In [378]: d1=sparse.diags([x],[0],(4,4),dtype=int)
In [379]: d2=sparse.diags([~x],[0],(4,4),dtype=int)
Doing this with lil format looks easy:
In [593]: Ml=M.tolil()
In [594]: Ml.data[wipe]=[[1]]*len(wipe)
In [595]: Ml.rows[wipe]=[[i] for i in wipe]
In [596]: Ml.A
Out[596]:
array([[ 0, 1, 2, 3],
[ 0, 1, 0, 0],
[ 0, 0, 1, 0],
[12, 13, 14, 15]], dtype=int32)
It's sort of what you are doing with csr format, but it's easy to replace each row list with the appropriate [1] and [i] list. But conversion times (tolil etc) can hurt run times.
The title looks complicated, but the problem is not that hard. I have 2 matrices: data_X and data_Y. I have to construct a new matrix based on data_X, which will consists of all the rows of data_X, where the corresponding value in the column column in data_Y is not equal to someNumber. The same for data_Y. For example here is 5 by 2 data_X matrix and 5 by 1 data_Y matrix, column is 0 and someNumber = -1.
[[ 0.09580361 0.11221975]
[ 0.71409124 0.24583188]
[ 0.67346718 0.72550385]
[ 0.40641294 0.01172211]
[ 0.89974846 0.70378831]] # data_X
and data_Y = np.array([[5], [-1], [4], [2], [-1]]).
The result would be:
[[ 0.09580361 0.11221975]
[ 0.67346718 0.72550385]
[ 0.40641294 0.01172211]]
[5 4 2]
It is not hard to see that this can be achieved by the following:
data_x, data_y = [], []
for i in xrange(len(data_Y)):
if data_Y[i][column] != someNumber:
data_y.append(data_Y[i][column])
data_x.append(data_X[i])
But I believe there is way easier way (like 2 or 3 numpy operations) to get the results I need.
Use boolean indexing -
In [228]: X
Out[228]:
array([[ 0.09580361, 0.11221975],
[ 0.71409124, 0.24583188],
[ 0.67346718, 0.72550385],
[ 0.40641294, 0.01172211],
[ 0.89974846, 0.70378831]])
In [229]: Y
Out[229]:
array([[ 5],
[-1],
[ 4],
[ 2],
[-1]])
In [230]: mask = Y!=-1 # Create mask for boolean indexing
In [231]: X[mask.ravel()]
Out[231]:
array([[ 0.09580361, 0.11221975],
[ 0.67346718, 0.72550385],
[ 0.40641294, 0.01172211]])
In [232]: Y[mask]
Out[232]: array([5, 4, 2])
I have two numpy arrays that contains NaNs:
A = np.array([np.nan, 2, np.nan, 3, 4])
B = np.array([ 1 , 2, 3 , 4, np.nan])
are there any smart way using numpy to remove the NaNs in both arrays, and also remove whats on the corresponding index in the other list?
Making it look like this:
A = array([ 2, 3, ])
B = array([ 2, 4, ])
What you could do is add the 2 arrays together this will overwrite with NaN values where they are none, then use this to generate a boolean mask index and then use the index to index into your original numpy arrays:
In [193]:
A = np.array([np.nan, 2, np.nan, 3, 4])
B = np.array([ 1 , 2, 3 , 4, np.nan])
idx = np.where(~np.isnan(A+B))
idx
print(A[idx])
print(B[idx])
[ 2. 3.]
[ 2. 4.]
output from A+B:
In [194]:
A+B
Out[194]:
array([ nan, 4., nan, 7., nan])
EDIT
As #Oliver W. has correctly pointed out, the np.where is unnecessary as np.isnan will produce a boolean index that you can use to index into the arrays:
In [199]:
A = np.array([np.nan, 2, np.nan, 3, 4])
B = np.array([ 1 , 2, 3 , 4, np.nan])
idx = (~np.isnan(A+B))
print(A[idx])
print(B[idx])
[ 2. 3.]
[ 2. 4.]
A[~(np.isnan(A) | np.isnan(B))]
B[~(np.isnan(A) | np.isnan(B))]