I need to remove values from a np axis based on a condition.
For example, I would want to remove [:,2] (the second values on axis 1) if the first value == 0, else I would want to remove [:,3].
Input:
[[0,1,2,3],[0,2,3,4],[1,3,4,5]]
Output:
[[0,1,3],[0,2,4],[1,3,4]]
So now my output has one less value on the 1st axis, depending on if it met the condition or not.
I know I can isolate and manipulate this based on
array[np.where(array[:,0] == 0)] but then I would have to deal with each condition separately, and it's very important for me to preserve the order of this array.
I am dealing with 3D arrays & am hoping to be able to calculate all this simultaneously while preserving the order.
Any help is much appreciated!
A possible solution:
a = np.array([[0,1,2,3],[0,2,3,4],[1,3,4,5]])
b = np.arange(a.shape[1])
np.apply_along_axis(
lambda x: x[np.where(x[0] == 0, np.delete(b,2), np.delete(b,3))], 1, a)
Output:
array([[0, 1, 3],
[0, 2, 4],
[1, 3, 4]])
Since you are starting and ending with a list, a straight forward iteration is a good solution:
In [261]: alist =[[0,1,2,3],[0,2,3,4],[1,3,4,5]]
In [262]: for row in alist:
...: if row[0]==0: row.pop(2)
...: else: row.pop(3)
...:
In [263]: alist
Out[263]: [[0, 1, 3], [0, 2, 4], [1, 3, 4]]
A possible array approach:
In [273]: arr = np.array([[0,1,2,3],[0,2,3,4],[1,3,4,5]])
In [274]: mask = np.ones(arr.shape, bool)
In [275]: mask[np.arange(3),np.where(arr[:,0]==0,2,3)]=False
In [276]: mask
Out[276]:
array([[ True, True, False, True],
[ True, True, False, True],
[ True, True, True, False]])
arr[mask] will be 1d, but since we are deleting the same number of elements each row, we can reshape it:
In [277]: arr[mask].reshape(arr.shape[0],-1)
Out[277]:
array([[0, 1, 3],
[0, 2, 4],
[1, 3, 4]])
I expect the list approach will be faster for small cases, but the array should scale better. I don't know where the trade off is.
Related
Say I have a numpy array as follows:
arr = np.array([[[1, 7], [5, 1]], [[5, 7], [6, 7]]])
where each of the innermost sub-arrays is an element. So for example; [1, 7] and [5, 1] are both considered elements.
... and I would like to find all the elements which satisfy: [<=5, >=7]. So, a truthy result array for the above example would look as follows:
arr_truthy = [[True, False], [True, False]]
... as for one of the bottom elements in arr the first value is <=5 and the second is >=7.
I can solve this easily by iterating over each of the axes in arr:
for x in range(arr.shape[0]):
for y in range(arr.shape[1]):
# test values, report if true.
.. but this method is slow and I'm hoping there's a more numpy way to do it. I've tried np.where but I can't work out how to do the multi sub-element conditional.
I'm effectively trying to test an independent conditional on each number in the elements.
Can anyone point me in the right direction?
I would do it like this. Initialize output to the same shape. Then just do the comparison on those elements.
out = np.full(arr.shape,False)
out[:,:,0] = arr[:,:,0] >= 5
out[:,:,1] = arr[:,:,1] >= 8
Output:
array([[[False, False],
[ True, False]],
[[ True, True],
[ True, False]]])
EDIT: After our edit I think you just need a final np.all along the last axis:
np.all(out, axis=-1)
Returns:
array([[False, False],
[ True, False]])
Are you looking for
(arr[:,:,0] <= 5) & (arr[:,:,1] >= 7)
? You can perform broadcasted comparison.
Output:
array([[True, False],
[True, False]])
In your example the second pair ([5, 1]) matches your rule (the first value is >=5 and the second is <=7), but in the result(arr_truthy) its value is False. My code works if this was a mistake, otherwise please clarify the condition.
arr = np.array([[[1, 7], [5, 1]], [[5, 6], [6, 7]], [[1, 9], [9, 1]]])
# Create True/False arrays for the first/second element of the input
first = np.zeros_like(arr, dtype=bool)
first = first.flatten()
first[::2] = 1
first = first.reshape(arr.shape)
second = np.invert(first)
# The actual logic:
out = np.logical_or(np.where(arr >= 5, first, False), np.where(arr <= 7, second, False))
# Both the condition for the first and for the second element of each par has to be meet
out = out.all(axis=-1)
I am trying to find out the index of the minimum value in each row and I am using below code.
#code
import numpy as np
C = np.array([[1,2,4],[2,2,5],[4,3,3]])
ind = np.where(C == C.min(axis=1).reshape(len(C),1))
ind
#output
(array([0, 1, 1, 2, 2], dtype=int64), array([0, 0, 1, 1, 2], dtype=int64))
but the problem it is returning all indices of minimum values in each row. but I want only the first occurrence of minimum values. like
(array([0, 1, 2], dtype=int64), array([0, 0, 1], dtype=int64))
If you want to use comparison against the minimum value, we need to use np.min and keep the dimensions with keepdims set as True to give us a boolean array/mask. To select the first occurance, we can use argmax along each row of the mask and thus have our desired output.
Thus, the implementation to get the corresponding column indices would be -
(C==C.min(1, keepdims=True)).argmax(1)
Sample step-by-step run -
In [114]: C # Input array
Out[114]:
array([[1, 2, 4],
[2, 2, 5],
[4, 3, 3]])
In [115]: C==C.min(1, keepdims=1) # boolean array of min values
Out[115]:
array([[ True, False, False],
[ True, True, False],
[False, True, True]], dtype=bool)
In [116]: (C==C.min(1, keepdims=True)).argmax(1) # argmax to get first occurances
Out[116]: array([0, 0, 1])
The first output of row indices would simply be a range array -
np.arange(C.shape[0])
To achieve the same column indices of first occurance of minimum values, a direct way would be to use np.argmin -
C.argmin(axis=1)
I want to remove features with low variance in my array of data. By using scikit-learn, the code will look like below.
>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
>>> selector = VarianceThreshold()
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])
My question is how to catch the column indexes that have been deleted? Let say I want to use them to delete another array in the same column (0th and 3th column in the above example).
Any idea?
selector.get_support() will return an array which shows which columns are kept and which are removed. In above case:
selector.get_support()
will return
array([False, True, True, False], dtype=bool)
which means first and last indices of the original input (X) are removed.
I have a matrix which represents a distances to the k-nearest neighbour of a set of points,
and there is a matrix of class labels of the nearest neighbours. (both N-by-k matrix)
What is the best way WITHOUT explicit python loop (actually, I want to implement this in theano where those loops are not going to work) to build a (N-by-#classes) matrix whose (i,j) element will be the sum of distances from i-th point to its k-NN points with the class label 'j'?
Example:
# N = 2
# k = 5
# number of classes = 3
K_val = np.array([[1,2,3,4,6],
[2,4,5,5,7]])
l_val = np.array([[0,1,2,0,1],
[2,0,1,2,0]])
"""
result -> [[5,8,3],
[11,5,7]]
"""
You can compute this with
numpy.bincount. It
has a weights parameter which allows you to count the items in l_val
but weight the items according to K_val.
The only little snag is that each row of K_val and l_val seems to be treated independently. So add a shift to l_val so each row has values which are distinct from every other row.
import numpy as np
num_classes = 3
K_val = np.array([[1,2,3,4,6],
[2,4,5,5,7]])
l_val = np.array([[0,1,2,0,1],
[2,0,1,2,0]])
def label_distance(l_val, K_val):
nrows, ncols = l_val.shape
shift = (np.arange(nrows)*num_classes)[:, np.newaxis]
result = (np.bincount((l_val+shift).ravel(), weights=K_val.ravel(),
minlength=num_classes*nrows)
.reshape(nrows, num_classes))
return result
print(label_distance(l_val, K_val))
yields
[[ 5. 8. 3.]
[ 11. 5. 7.]]
Although senderle's method is really elegant, using bincount is faster:
def using_extradim(l_val, K_val):
return (K_val[:,:,None] * (l_val[:,:,None] == numpy.arange(3)[None,None,:])).sum(axis=1)
In [34]: K2 = np.tile(K_val, (1000,1))
In [35]: L2 = np.tile(l_val, (1000,1))
In [36]: %timeit using_extradim(L2, K2)
1000 loops, best of 3: 584 µs per loop
In [40]: %timeit label_distance(L2, K2)
10000 loops, best of 3: 67.7 µs per loop
Here's a way to calculate the values directly. As unutbu's tests show, using bincount is much faster for large datasets, but I think it's worth knowing how to do this using vanilla broadcasting as well:
>>> (K_val[:,:,None] * (l_val[:,:,None] == numpy.arange(3)[None,None,:])).sum(axis=1)
array([[ 5, 8, 3],
[11, 5, 7]])
That's a bit hairy, so I'll step through it slowly. It's probably best to do it this way in code you want to be able to read later! There are four steps:
labels = numpy.arange(3)
l_selector = l_val[:,:,None] == labels[None,None,:]
distances = (K_val[:,:,None] * l_selector)
result = distances.sum(axis=1)
First we create a list of labels (labels above). Then we create a boolean index array:
>>> l_selector = l_val[:,:,None] == labels[None,None,:]
This expands l_val and labels into arrays that can be broadcast together. The None values (equivalent to np.newaxis) add new empty dimensions:
>>> l_val[:,:,None].shape
(2, 5, 1)
>>> labels[None,None,:].shape
(1, 1, 3)
The dimensions are aligned, so both arrays can be expanded (by repeating the values) along their empty dimensions:
>>> l_selector.shape
(2, 5, 3)
Now we have a (n_points, n_neighbors, n_labels) array, where each column corresponds to a label. (See how each row has only one Truevalue?)
>>> l_selector
array([[[ True, False, False],
[False, True, False],
[False, False, True],
[ True, False, False],
[False, True, False]],
[[False, False, True],
[ True, False, False],
[False, True, False],
[False, False, True],
[ True, False, False]]], dtype=bool)
So now we can use this to separate out the distances for each of the three labels. But again, we have to make sure that our arrays are broadcastable, hence the K_val[:,:,None] here:
>>> distances = (K_val[:,:,None] * l_selector)
>>> distances
array([[[1, 0, 0],
[0, 2, 0],
[0, 0, 3],
[4, 0, 0],
[0, 6, 0]],
[[0, 0, 2],
[4, 0, 0],
[0, 5, 0],
[0, 0, 5],
[7, 0, 0]]])
Now all we have to do is sum over the columns.
>>> result = distances.sum(axis=1)
>>> result
array([[ 5, 8, 3],
[11, 5, 7]])
You might also consider the transposed approach, which requires a little bit less reshaping:
>>> labels = numpy.arange(3)
>>> l_selector = l_val[None,:,:] == labels[:,None,None]
>>> distances = K_val * l_selector
>>> distances.sum(axis=-1)
array([[ 5, 11],
[ 8, 5],
[ 3, 7]])
>>> distances.sum(axis=-1).T
array([[ 5, 8, 3],
[11, 5, 7]])
I want to check that a sequence of N numpy vectors of integers is lexicographically ordered. All the vectors in the sequence have shape 1 × 2. (The value of N is big, so I want to avoid sorting this sequence if it is already sorted.)
Does Python, or numpy, already offer a predicate to perform such a test?
(It would not be hard to roll my own, but I prefer to use built-in tools if they exist.)
You can use np.diff and np.any:
A = np.array([[1,2,3], [2,3,1], [3, 4, 5]])
diff = np.diff(A, axis=0)
print np.all(diff>=0, axis=0)
To have an issorted predicate you need a well defined sort, or at least a clear method of comparing items.
To follow on my question about the nature of your data. It sounds as though you have something like this:
In [130]: x=[[1,3],[3,4],[1,2],[3,1],[0,2],[6,5]]
In [131]: x1=[np.array(i).reshape(1,2) for i in x]
In [132]: x1
Out[132]:
[array([[1, 3]]),
array([[3, 4]]),
array([[1, 2]]),
array([[3, 1]]),
array([[0, 2]]),
array([[6, 5]])]
The Python sort is lexographic - that is, sort of the 1st element on the sublists, and then on the 2nd element.
In [137]: sorted(x)
Out[137]: [[0, 2], [1, 2], [1, 3], [3, 1], [3, 4], [6, 5]]
numpy sorts don't preserve the pairs - depending on the axis specification it sorts by column, or row (or flat). But the np.sort doc does say that complex numbers are sorted lexographically:
In [157]: xj = np.dot(x,[1,1j])
In [158]: xj
Out[158]: array([ 1.+3.j, 3.+4.j, 1.+2.j, 3.+1.j, 0.+2.j, 6.+5.j])
In [159]: np.sort(xj)
Out[159]: array([ 0.+2.j, 1.+2.j, 1.+3.j, 3.+1.j, 3.+4.j, 6.+5.j])
This matches the Python list sort.
If my guess as to your data type is correct, a comparison based test would use something like:
In [167]: [i.__lt__(j) for i,j in zip(x[:-1],x[1:])]
Out[167]: [True, False, True, False, True]
In [168]: xs=sorted(x)
In [169]: [i.__lt__(j) for i,j in zip(xs[:-1],xs[1:])]
Out[169]: [True, True, True, True, True]
That also works for the complex array:
In [173]: xjs=np.sort(xj)
In [174]: [i.__lt__(j) for i,j in zip(xjs[:-1],xjs[1:])]
Out[174]: [True, True, True, True, True]
For large lists I'd try one of the itertools for short circuiting iteration.
But when applied to the (plain) array, it is clear that the question of whether it is sorted or not needs further specification.
In [172]: [i.__lt__(j) for i,j in zip(x1[:-1],x1[1:])]
Out[172]:
[array([[ True, True]], dtype=bool),
array([[False, False]], dtype=bool),
array([[ True, False]], dtype=bool),
array([[False, True]], dtype=bool),
array([[ True, True]], dtype=bool)]
By the way, a list of (2,1) arrays would look something like this:
[np.array(i).reshape(1,2) for i in x]
[array([[1, 3]]),
array([[3, 4]]),
array([[1, 2]]),
array([[3, 1]]),
array([[0, 2]]),
array([[6, 5]])]
which if turned into an array would have a (6,1,2) shape. Or did you want a (6,2) array?
In [179]: np.array(x)
Out[179]:
array([[1, 3],
[3, 4],
[1, 2],
[3, 1],
[0, 2],
[6, 5]])
numpy has lexsort, but this does a sort, not a test of whether the data is sorted. None-the-less, running it on sorted data is about twice as fast as unsorted.
import numpy as np
import timeit
def data(N):
return np.random.randint(0,10,(N,2))
def get_sorted(x):
return x[np.lexsort(x.T)]
x = data(5)
y = get_sorted(x)
print x # to verify lex sorting
print
print y
print
x = data(1000)
y = get_sorted(x)
# to test the time for sorted vs unsorted data
print timeit.timeit("np.lexsort(x.T)", "from __main__ import np, x", number=1000)
print timeit.timeit("np.lexsort(y.T)", "from __main__ import np, y", number=1000)
And here are the results:
[[6 7] # unsorted
[4 3]
[6 7]
[9 2]
[7 3]]
[[9 2] # sorted by the second column first
[4 3]
[7 3]
[6 7]
[6 7]]
0.0788 # time to lex sort 1000x2 unsorted data values
0.0381 # time to lex sort 1000x2 pre-sorted data values
Note also that the speed of python vs numpy will depend on the list, because python can sometime short-circuit its tests. So if you think that your list will generally be unsorted, a pure python solution could figure this out in the first few values, which could be much faster; whereas numpy solutions will generally work the the entire array.