I have boolean array of shape (n_samples, n_items) which represents a set: my_set[i, j] tells if sample i contains item j.
To populate it, the array is initialized as zeros, and receive another array of integers, with shape (n_samples, 3), telling for each example, three elements that belongs to it, for instance:
my_set = np.zeros((2, 5), dtype=bool)
init_values = np.array([[1,3,4], [0,1,2]], dtype=np.int64)
So, I need to fill my_set in row 0 and columns 1, 3, 4 and in row 1, columns 0, 1, 2, with with ones.
my_set contain valid values in appropriated range (that is, in [0, n_items)), and each column doesn't contain duplicated items.
Some failed approaches:
I know that a list of integers (or array) can be used as index, so I tried to use init_values as index straightforward, but it failed:
my_set[init_values] = 1
File "<ipython-input-9-9b2c4d19f4f6>", line 1, in <cell line: 1>
my_set[init_values] = 1
IndexError: index 3 is out of bounds for axis 0 with size 2
I don't know why the 3 is indexing over the first axis, so I tried a second approach: "pick up all rows and index only desired columns", using a mix of slicing and integer index. And it didn't throw error, but didn't worked as expected: checkout the shape, I expect it to be (2, 3), however...
my_set[:, init_values].shape
Out[11]: (2, 2, 3)
Not sure why it didn't work, but at least the first axis looks correct, so I tried to pick up only the first column, which is a list of integers, and therefore it is "more natural"... once again, it didn't worked:
my_set[:, init_values[:,0]].shape
Out[12]: (2, 2)
I expected this shape to be (2, 1) since I wanted all rows with a single column on each, corresponding to the indexes given in init_values.
I decided to go back to integer index approach for the first axis.... and it worked:
my_set[np.arange(len(my_set)), init_values[:,0]].shape
Out[13]: (2,)
However, it only works wor one column, so I need to iterate over columns to make it really work, but it looks like a good-initial workaround.
Current solution
So, to solve my original problem, I wrote this:
for c in range(init_values.shape[1])
my_set[np.arange(len(my_set)), init_values[:,c]] = 1
# now lets check my_set is properly filled
print(my_set)
Out[14]: [[False True False True True]
[ True True True False False]]
which is exactly what I need.
Question(s):
That said, here goes my main question:
Is there a more efficient way to do this? I see it quite inefficient as the number of elements grows (for this example I used 3 but I actually need larger values).
In addition to this I'd like to understand why using np.arange on the first index behaves different from slicing it as :: I didn't expect this behavior.
Any other comment to understand why previous approaches failed, are also welcome.
You only have column indices, so you also need to create their corresponding row indices:
>>> my_set[np.arange(len(my_set))[:, None], init_values] = 1
>>> my_set
array([[False, True, False, True, True],
[ True, True, True, False, False]])
[:, None] is used to convert the row indices row vector to the column vector, so that row and column indices have compatible shapes for broadcasting:
>>> np.arange(len(my_set))[:, None]
array([[0],
[1]])
>>> np.broadcast_arrays(np.arange(len(my_set))[:, None], init_values)
[array([[0, 0, 0],
[1, 1, 1]]),
array([[1, 3, 4],
[0, 1, 2]], dtype=int64)]
The essence of slicing is to apply the index of other dimensions to each index in the slicing range of this dimension. Here is a simple test. The matrix to be indexed is as follows:
>>> ar = np.arange(4).reshape(2, 2)
>>> ar
array([[0, 1],
[2, 3]])
If you want to get elements whit indices 0 and 1 in row 0, and elements with indices 1 and 0 in row 1, but you use the combination of column indices [[0, 1], [1, 0]] and slice, you will get:
>>> ar[:, [[0, 1], [1, 0]]]
array([[[0, 1],
[1, 0]],
[[2, 3],
[3, 2]]])
This is equivalent to combining the row index from 0 to 1 with the column indices respectively:
>>> ar[0, [[0, 1], [1, 0]]]
array([[0, 1],
[1, 0]])
>>> ar[1, [[0, 1], [1, 0]]]
array([[2, 3],
[3, 2]])
In fact, broadcasting is used secretly here. The actual indices are:
>>> np.broadcast_arrays(0, [[0, 1], [1, 0]])
[array([[0, 0],
[0, 0]]),
array([[0, 1],
[1, 0]])]
>>> np.broadcast_arrays(1, [[0, 1], [1, 0]])
[array([[1, 1],
[1, 1]]),
array([[0, 1],
[1, 0]])]
This is not the same as the indices you actually need. Therefore, you need to manually generate the correct row indices for broadcasting:
>>> ar[[[0], [1]], [[0, 1], [1, 0]]]
array([[0, 1],
[3, 2]])
>>> np.broadcast_arrays([[0], [1]], [[0, 1], [1, 0]])
[array([[0, 0],
[1, 1]]),
array([[0, 1],
[1, 0]])]
Related
I have a 2D array:
[[0,0], [0,1], [1,0], [1,1]]
I want to delete the [0,1] element without knowing its position within the array (as the elements may be shuffled).
Result should be:
[[0,0], [1,0], [1,1]]
I've tried using numpy.delete but keep getting back a flattened array:
>>> arr = np.array([[0,0], [0,1], [1,0], [1,1]])
>>> arr
array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
>>> np.delete(arr, [0,1])
array([0, 1, 1, 0, 1, 1])
Specifying the axis removes the 0, 1 elements rather than searching for the element (which makes sense):
>>> np.delete(arr, [0,1], axis=0)
array([[1, 0],
[1, 1]])
And trying to find the location (as has been suggested) seems equally problematic:
>>> np.where(arr==[0,1])
(array([0, 1, 1, 3]), array([0, 0, 1, 1]))
(Where did that 3 come from?!?)
Here we find all of the rows that match the candidate [0, 1]
>>> (arr == [0, 1]).all(axis=1)
array([False, True, False, False])
Or alternatively, the rows that do not match the candidate
>>> ~(arr == [0, 1]).all(axis=1)
array([ True, False, True, True])
So, to select all those rows that do not match [0, 1]
>>> arr[~(arr == [0, 1]).all(axis=1)]
array([[0, 0],
[1, 0],
[1, 1]])
Note that this will create a new array.
mask = (arr==np.array([0,1])).all(axis=1)
arr1 = arr[~mask,:]
Look at mask.. It should be [False, True,...].
From the documentation:
numpy.delete(arr, obj, axis=None)
axis : int, optional
The axis along which to delete the subarray defined by obj. If axis
is None, obj is applied to the flattened array
If you don't specify the axis(i.e. None), it will automatically flatten your array; you just need to specify the axis parameter, in your case np.delete(arr, [0,1],axis=0)
However, just like in the example above, [0,1] is a list of indices; you must provide the indices/location(you can do that with np.where(condition,array) for example)
Here you have a working example:
my_array = np.array([[0, 1],
[1, 0],
[1, 1],
[0, 0]])
row_index, = np.where(np.all(my_array == [0, 1], axis=1))
my_array = np.delete(my_array, row_index,axis=0)
print(my_array)
#Output is below
[[1 0]
[1 1]
[0 0]]
I was wondering if there is a way to do the following in TensorFlow, using gather_nd or something similar.
I have two tensors:
values with shape [128, 100],
indices with shape [128, 3],
where each row of indices contains indices along the second dimension of values (for that same row). I want to index values using indices. For example, I want something that does this (using loose notation to represent tensors):
values = [[0, 0, 0, 1, 1, 0, 1],
[1, 1, 0, 0, 1, 0, 0]]
indices = [[2, 3, 6],
[0, 2, 3]]
batched_gather(values, indices) = [[0, 1, 1], [1, 0, 0]]
This op will go through each row of values and indices and perform a gather on the values row using the indices row.
Is there a simple way to do this in TensorFlow?
Thanks!
Not sure if this qualifies as "simple", but you can use gather_nd for this:
def batched_gather(values, indices):
row_indices = tf.range(0, tf.shape(values)[0])[:, tf.newaxis]
row_indices = tf.tile(row_indices, [1, tf.shape(indices)[-1]])
indices = tf.stack([row_indices, indices], axis=-1)
return tf.gather_nd(values, indices)
Explanation: The idea is to construct index vectors such as [0, 1] meaning "the value in the 0th row and 1st column".
The column indices are already given in the indices argument to the function.
The row indices are a simple progression from 0 to e.g. 128 (in your example), but are repeated (tiled) in accordance with the number of column indices for each row (3 in your example; could hardcode this instead of using tf.shape if this number is fixed).
The row and column indices are then stacked to produce the index vectors. In your example, the resulting indices would be
array([[[0, 2],
[0, 3],
[0, 6]],
[[1, 0],
[1, 2],
[1, 3]]])
and gather_nd produces the desired result.
I have the following numpy array matrix ,
matrix = np.zeros((3,5), dtype = int)
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
Suppose I have this numpy array indices as well
indices = np.array([[1,3], [2,4], [0,4]])
array([[1, 3],
[2, 4],
[0, 4]])
Question: How can I assign 1s to the elements in the matrix where their indices are specified by the indices array. A vectorized implementation is expected.
For more clarity, the output should look like:
array([[0, 1, 0, 1, 0], #[1,3] elements are changed
[0, 0, 1, 0, 1], #[2,4] elements are changed
[1, 0, 0, 0, 1]]) #[0,4] elements are changed
Here's one approach using NumPy's fancy-indexing -
matrix[np.arange(matrix.shape[0])[:,None],indices] = 1
Explanation
We create the row indices with np.arange(matrix.shape[0]) -
In [16]: idx = np.arange(matrix.shape[0])
In [17]: idx
Out[17]: array([0, 1, 2])
In [18]: idx.shape
Out[18]: (3,)
The column indices are already given as indices -
In [19]: indices
Out[19]:
array([[1, 3],
[2, 4],
[0, 4]])
In [20]: indices.shape
Out[20]: (3, 2)
Let's make a schematic diagram of the shapes of row and column indices, idx and indices -
idx (row) : 3
indices (col) : 3 x 2
For using the row and column indices for indexing into input array matrix, we need to make them broadcastable against each other. One way would be to introduce a new axis into idx, making it 2D by pushing the elements into the first axis and allowing a singleton dim as the last axis with idx[:,None], as shown below -
idx (row) : 3 x 1
indices (col) : 3 x 2
Internally, idx would be broadcasted, like so -
In [22]: idx[:,None]
Out[22]:
array([[0],
[1],
[2]])
In [23]: indices
Out[23]:
array([[1, 3],
[2, 4],
[0, 4]])
In [24]: np.repeat(idx[:,None],2,axis=1) # indices has length of 2 along cols
Out[24]:
array([[0, 0], # Internally broadcasting would be like this
[1, 1],
[2, 2]])
Thus, the broadcasted elements from idx would be used as row indices and column indices from indices for indexing into matrix for setting elements in it. Since, we had -
idx = np.arange(matrix.shape[0]),
Thus, we would end up with -
matrix[np.arange(matrix.shape[0])[:,None],indices] for setting elements.
this involves loop and hence may not be very efficient for large arrays
for i in range(len(indices)):
matrix[i,indices[i]] = 1
> matrix
Out[73]:
array([[0, 1, 0, 1, 0],
[0, 0, 1, 0, 1],
[1, 0, 0, 0, 1]])
I have a list of numpy arrays. How can I can remove duplicate arrays from the list?
I tried set(arrays) but got the error "TypeError: unhashable type: 'numpy.ndarray"
Example with 2d arrays (mine are actually 3d). Here the starting list is length 10. The output list of distinct arrays should be length 8, because the elements at indexes 0, 5, 9 are all equal.
>>> import numpy
>>> numpy.random.seed(0)
>>> arrays = [numpy.random.randint(2, size=(2,2)) for i in range(10)]
>>> numpy.array_equal(arrays[0], arrays[5])
True
>>> numpy.array_equal(arrays[5], arrays[9])
True
You can start off by collecting all those arrays from the input list into a NumPy array. Then, lex-sort it, which would bring all the duplicate rows in consecutive order. Then, do differentiation along the rows, giving us all zeros for duplicate rows, which could be extracted using (sorted_array==0).all(1). This would give you a mask of starting positions of duplicates, which could be used to select elements from the concatenated array. Finally, the selected elements are reshaped and sent back to a list of arrays format by splitting along the first axis. Thus, you would have a vectorized implementation, like so -
A = numpy.concatenate((arrays)).reshape(-1,arrays[0].size)
sortedA = A[numpy.lexsort(A.T)]
idx = numpy.append(True,~(numpy.diff(sortedA,axis=0)==0).all(1))
out = numpy.vsplit((A.reshape((len(arrays),) + arrays[0].shape))[idx],idx.sum())
Sample input, output -
In [238]: arrays
Out[238]:
[array([[0, 1],
[1, 0]]), array([[1, 1],
[1, 1]]), array([[1, 1],
[1, 0]]), array([[0, 1],
[0, 0]]), array([[0, 0],
[0, 1]]), array([[0, 1],
[1, 0]]), array([[0, 1],
[1, 1]]), array([[1, 0],
[1, 0]]), array([[1, 0],
[1, 1]]), array([[0, 1],
[1, 0]])]
In [239]: out
Out[239]:
[array([[[0, 1],
[1, 0]]]), array([[[1, 1],
[1, 1]]]), array([[[1, 1],
[1, 0]]]), array([[[0, 1],
[1, 0]]]), array([[[0, 1],
[1, 1]]]), array([[[1, 0],
[1, 0]]]), array([[[1, 0],
[1, 1]]]), array([[[0, 1],
[1, 0]]])]
In the end, looped over the list comparing with numpy.array_equal
distinct = list()
for M in arrays:
if any(numpy.array_equal(M, N) for N in distinct):
continue
distinct.append(M)
It's O(n**2) but what the hey.
You can use tostring and fromstring to convert to and from hashable items (byte strings). You can put them in a set:
>>> arrs = [np.random.random(10) for _ in range(10)]
>>> arrs += arrs # create duplicate items
>>>
>>> darrs = set((arr.tostring(), arr.dtype) for arr in arrs)
>>> uniq_arrs = [np.fromstring(arr, dtype=dtype) for arr, dtype in darrs]
I am trying to extract the full set of indices into an N-dimensional cube, and it seems like np.mgrid is just what I need for that. For example, np.mgrid[0:4,0:4] produces a 4 by 4 matrix containing all the indices into an array of the same shape.
The problem is that I want to do this in an arbitrary number of dimensions, based on the shape of another array. I.e. if I have an array a of arbitrary dimension, I want to do something like idx = np.mgrid[0:a.shape], but that syntax is not allowed.
Is it possible to construct the slice I need for np.mgrid to work? Or is there perhaps some other, elegant way of doing this? The following expression does what I need, but it is rather complicated and probably not very efficient:
np.reshape(np.array(list(np.ndindex(a.shape))),list(a.shape)+[len(a.shape)])
I usually use np.indices:
>>> a = np.arange(2*3).reshape(2,3)
>>> np.mgrid[:2, :3]
array([[[0, 0, 0],
[1, 1, 1]],
[[0, 1, 2],
[0, 1, 2]]])
>>> np.indices(a.shape)
array([[[0, 0, 0],
[1, 1, 1]],
[[0, 1, 2],
[0, 1, 2]]])
>>> a = np.arange(2*3*5).reshape(2,3,5)
>>> (np.mgrid[:2, :3, :5] == np.indices(a.shape)).all()
True
I believe the following does what you're asking:
>>> a = np.random.random((1, 2, 3))
>>> np.mgrid[map(slice, a.shape)]
array([[[[0, 0, 0],
[0, 0, 0]]],
[[[0, 0, 0],
[1, 1, 1]]],
[[[0, 1, 2],
[0, 1, 2]]]])
It produces exactly the same result as np.mgrid[0:1,0:2,0:3]except that it uses a's shape instead of hard-coded dimensions.