How to remove duplicate elements from list of numpy arrays? - python

I have a list of numpy arrays. How can I can remove duplicate arrays from the list?
I tried set(arrays) but got the error "TypeError: unhashable type: 'numpy.ndarray"
Example with 2d arrays (mine are actually 3d). Here the starting list is length 10. The output list of distinct arrays should be length 8, because the elements at indexes 0, 5, 9 are all equal.
>>> import numpy
>>> numpy.random.seed(0)
>>> arrays = [numpy.random.randint(2, size=(2,2)) for i in range(10)]
>>> numpy.array_equal(arrays[0], arrays[5])
True
>>> numpy.array_equal(arrays[5], arrays[9])
True

You can start off by collecting all those arrays from the input list into a NumPy array. Then, lex-sort it, which would bring all the duplicate rows in consecutive order. Then, do differentiation along the rows, giving us all zeros for duplicate rows, which could be extracted using (sorted_array==0).all(1). This would give you a mask of starting positions of duplicates, which could be used to select elements from the concatenated array. Finally, the selected elements are reshaped and sent back to a list of arrays format by splitting along the first axis. Thus, you would have a vectorized implementation, like so -
A = numpy.concatenate((arrays)).reshape(-1,arrays[0].size)
sortedA = A[numpy.lexsort(A.T)]
idx = numpy.append(True,~(numpy.diff(sortedA,axis=0)==0).all(1))
out = numpy.vsplit((A.reshape((len(arrays),) + arrays[0].shape))[idx],idx.sum())
Sample input, output -
In [238]: arrays
Out[238]:
[array([[0, 1],
[1, 0]]), array([[1, 1],
[1, 1]]), array([[1, 1],
[1, 0]]), array([[0, 1],
[0, 0]]), array([[0, 0],
[0, 1]]), array([[0, 1],
[1, 0]]), array([[0, 1],
[1, 1]]), array([[1, 0],
[1, 0]]), array([[1, 0],
[1, 1]]), array([[0, 1],
[1, 0]])]
In [239]: out
Out[239]:
[array([[[0, 1],
[1, 0]]]), array([[[1, 1],
[1, 1]]]), array([[[1, 1],
[1, 0]]]), array([[[0, 1],
[1, 0]]]), array([[[0, 1],
[1, 1]]]), array([[[1, 0],
[1, 0]]]), array([[[1, 0],
[1, 1]]]), array([[[0, 1],
[1, 0]]])]

In the end, looped over the list comparing with numpy.array_equal
distinct = list()
for M in arrays:
if any(numpy.array_equal(M, N) for N in distinct):
continue
distinct.append(M)
It's O(n**2) but what the hey.

You can use tostring and fromstring to convert to and from hashable items (byte strings). You can put them in a set:
>>> arrs = [np.random.random(10) for _ in range(10)]
>>> arrs += arrs # create duplicate items
>>>
>>> darrs = set((arr.tostring(), arr.dtype) for arr in arrs)
>>> uniq_arrs = [np.fromstring(arr, dtype=dtype) for arr, dtype in darrs]

Related

using integer as index for multidimensional numpy array

I have boolean array of shape (n_samples, n_items) which represents a set: my_set[i, j] tells if sample i contains item j.
To populate it, the array is initialized as zeros, and receive another array of integers, with shape (n_samples, 3), telling for each example, three elements that belongs to it, for instance:
my_set = np.zeros((2, 5), dtype=bool)
init_values = np.array([[1,3,4], [0,1,2]], dtype=np.int64)
So, I need to fill my_set in row 0 and columns 1, 3, 4 and in row 1, columns 0, 1, 2, with with ones.
my_set contain valid values in appropriated range (that is, in [0, n_items)), and each column doesn't contain duplicated items.
Some failed approaches:
I know that a list of integers (or array) can be used as index, so I tried to use init_values as index straightforward, but it failed:
my_set[init_values] = 1
File "<ipython-input-9-9b2c4d19f4f6>", line 1, in <cell line: 1>
my_set[init_values] = 1
IndexError: index 3 is out of bounds for axis 0 with size 2
I don't know why the 3 is indexing over the first axis, so I tried a second approach: "pick up all rows and index only desired columns", using a mix of slicing and integer index. And it didn't throw error, but didn't worked as expected: checkout the shape, I expect it to be (2, 3), however...
my_set[:, init_values].shape
Out[11]: (2, 2, 3)
Not sure why it didn't work, but at least the first axis looks correct, so I tried to pick up only the first column, which is a list of integers, and therefore it is "more natural"... once again, it didn't worked:
my_set[:, init_values[:,0]].shape
Out[12]: (2, 2)
I expected this shape to be (2, 1) since I wanted all rows with a single column on each, corresponding to the indexes given in init_values.
I decided to go back to integer index approach for the first axis.... and it worked:
my_set[np.arange(len(my_set)), init_values[:,0]].shape
Out[13]: (2,)
However, it only works wor one column, so I need to iterate over columns to make it really work, but it looks like a good-initial workaround.
Current solution
So, to solve my original problem, I wrote this:
for c in range(init_values.shape[1])
my_set[np.arange(len(my_set)), init_values[:,c]] = 1
# now lets check my_set is properly filled
print(my_set)
Out[14]: [[False True False True True]
[ True True True False False]]
which is exactly what I need.
Question(s):
That said, here goes my main question:
Is there a more efficient way to do this? I see it quite inefficient as the number of elements grows (for this example I used 3 but I actually need larger values).
In addition to this I'd like to understand why using np.arange on the first index behaves different from slicing it as :: I didn't expect this behavior.
Any other comment to understand why previous approaches failed, are also welcome.
You only have column indices, so you also need to create their corresponding row indices:
>>> my_set[np.arange(len(my_set))[:, None], init_values] = 1
>>> my_set
array([[False, True, False, True, True],
[ True, True, True, False, False]])
[:, None] is used to convert the row indices row vector to the column vector, so that row and column indices have compatible shapes for broadcasting:
>>> np.arange(len(my_set))[:, None]
array([[0],
[1]])
>>> np.broadcast_arrays(np.arange(len(my_set))[:, None], init_values)
[array([[0, 0, 0],
[1, 1, 1]]),
array([[1, 3, 4],
[0, 1, 2]], dtype=int64)]
The essence of slicing is to apply the index of other dimensions to each index in the slicing range of this dimension. Here is a simple test. The matrix to be indexed is as follows:
>>> ar = np.arange(4).reshape(2, 2)
>>> ar
array([[0, 1],
[2, 3]])
If you want to get elements whit indices 0 and 1 in row 0, and elements with indices 1 and 0 in row 1, but you use the combination of column indices [[0, 1], [1, 0]] and slice, you will get:
>>> ar[:, [[0, 1], [1, 0]]]
array([[[0, 1],
[1, 0]],
[[2, 3],
[3, 2]]])
This is equivalent to combining the row index from 0 to 1 with the column indices respectively:
>>> ar[0, [[0, 1], [1, 0]]]
array([[0, 1],
[1, 0]])
>>> ar[1, [[0, 1], [1, 0]]]
array([[2, 3],
[3, 2]])
In fact, broadcasting is used secretly here. The actual indices are:
>>> np.broadcast_arrays(0, [[0, 1], [1, 0]])
[array([[0, 0],
[0, 0]]),
array([[0, 1],
[1, 0]])]
>>> np.broadcast_arrays(1, [[0, 1], [1, 0]])
[array([[1, 1],
[1, 1]]),
array([[0, 1],
[1, 0]])]
This is not the same as the indices you actually need. Therefore, you need to manually generate the correct row indices for broadcasting:
>>> ar[[[0], [1]], [[0, 1], [1, 0]]]
array([[0, 1],
[3, 2]])
>>> np.broadcast_arrays([[0], [1]], [[0, 1], [1, 0]])
[array([[0, 0],
[1, 1]]),
array([[0, 1],
[1, 0]])]

How to set individual indices in Numpy arrays

I am trying to use arrays to set values in other arrays. Unfortunately instead of setting a value it is somehow overwriting a bunch of values. What is going on, and how can I achieve what I want?
>>> target = np.array( [ [0,1],[1,2],[2,3] ])
>>> target
array([[0, 1],
[1, 2],
[2, 3]])
>>> actions = np.array([0,0,0])
>>> target[actions] #The first row, 3 times
array([[0, 1],
[0, 1],
[0, 1]])
>>> target[:,actions] #The first column, 3 times
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
>>> values = np.array([7,8,9])
>>> target[:,actions] = values #why isnt this working?
>>> target
array([[9, 1],
[9, 2],
[9, 3]])
#Actually want
#array([[7, 1],
# [8, 2],
# [9, 3]])
>>> target = np.array( [ [0,1],[1,2],[2,3] ]) #reset to original value
>>> actions = np.array([0,1,0])
>>> target[:,actions] = values.reshape(3, 1)
array([[7, 7],
[8, 8],
[9, 9]])
#Actually want
#array([[7, 1],
# [1, 8],
# [9, 3]])
target[:,actions] selects the same column of target thrice.
When you say target[:,actions] = values, what you are doing is:
Assign 7 to all the values in the column, three times.
Assign 8 to all the values in the column, three times.
Assign 9 to all the values in the column, three times.
So you end up with 9 in all the values in the column.
If you insist on this awkward triple-writing of data, you can fix it by transposing the write:
target[:,actions] = values.reshape(3, 1)
This will write [7,8,9] to the column, three times. Obviously that's wasteful, and you could do this instead:
target[:,actions[-1]] = values
The effect should be the same, and it saves computation.
2 ways to write [7,8,9] to the first column:
basic indexing (with slice):
In [396]: target[:,0] = [7,8,9] # all rows, 1st column
In [397]: target
Out[397]:
array([[7, 1],
[8, 2],
[9, 3]])
Advanced indexing (with 2 lists)
In [398]: target[[0,1,2],[0,0,0]] = [7,8,9] # pair [0,0],[1,0],[2,0]
In [399]: target
Out[399]:
array([[7, 1],
[8, 2],
[9, 3]])
The 2nd method also works for a mix of columns:
In [400]: target = np.array( [ [0,1],[1,2],[2,3] ])
In [401]: target[[0,1,2],[0,1,0]] = [7,8,9]
In [402]: target
Out[402]:
array([[7, 1],
[1, 8],
[9, 3]])
Broadcasting comes into play. In a case like this the are 3 potential arrays to broadcast - the 2 dimensions and the source array.
Advanced indexing like this produces a 1d array. So the source array has to match:
In [403]: target[[0,1,2],[0,1,0]]
Out[403]: array([7, 8, 9])
A (1,3) can broadcast to (3,), but a (3,1) can't:
In [404]: target[[0,1,2],[0,1,0]] = np.array([[7,8,9]])
In [405]: target[[0,1,2],[0,1,0]] = np.array([[7,8,9]]).T
...
ValueError: shape mismatch: value array of shape (3,1) could not be broadcast to indexing result of shape (3,)
This sort of indexing is unusual. Note that the result is (3,3).
In [412]: target[:,[0,0,0]]
Out[412]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
A (3,1) source:
In [413]: np.array([[7,8,9]]).T
Out[413]:
array([[7],
[8],
[9]])
In [414]: target[:,[0,0,0]] = _
In [415]: target
Out[415]:
array([[7, 1],
[8, 2],
[9, 3]])
The (3,1) can broadcast to (3,3). It works, but ends up assigning [7,8,9] 3 times, all to the same 0 column.
Another way of assigning the 1st column:
In [423]: target[np.ix_([0,1,2],[0,0,0])]
Out[423]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
Again a (3,3), with accepts a (3,1):
In [424]: target[np.ix_([0,1,2],[0,0,0])] = np.array([[7,8,9]]).T
In [425]: target
Out[425]:
array([[7, 1],
[8, 2],
[9, 3]])
ix_ makes 2 arrays that can broadcast against each other, in this case a column vector and a row one:
In [426]: np.ix_([0,1,2],[0,0,0])
Out[426]:
(array([[0],
[1],
[2]]), array([[0, 0, 0]]))
I can select all elements of target with:
In [430]: target[np.ix_([0,1,2],[0,1])]
Out[430]:
array([[0, 1],
[1, 2],
[2, 3]])
and in a jumbled order:
In [431]: target[np.ix_([2,0,1],[1,0])]
Out[431]:
array([[3, 2],
[1, 0],
[2, 1]])
I couldn't get it to work using : indexing, however the following is functional by using an array of indices. Not sure why the : method is not working, if someone can come up with a way to fix that I will accept it instead.
>>> target = np.array( [ [0,1],[1,2],[2,3] ])
>>> rows = np.arange(target.shape[0])
>>> actions = np.array([0,1,0])
>>> values = np.array([7,8,9])
>>> target[rows,actions] = values
>>> target
array([[7, 1],
[1, 8],
[9, 3]])

Insert a list into numpy-based matrix

I want to insert a list into numpy-based matrix in a specific index. For instance, the following code (python 2.7) is supposed to insert the list [5,6,7] into M in the second place:
M = [[0, 0], [0, 1], [1, 0], [1, 1]]
M = np.asarray(M)
X = np.insert(M, 1, [5,6,7])
print(X)
This, however, does not output what I would like. It causes to mess up the matrix M by merging all lists into one single list. How can I achieve adding any list in any place of numpy-based matrix?
Thank you
In [80]: M = [[0, 0], [0, 1], [1, 0], [1, 1]]
...: M1 = np.asarray(M)
...:
List insert:
In [81]: M[1:2] = [[5,6,7]]
In [82]: M
Out[82]: [[0, 0], [5, 6, 7], [1, 0], [1, 1]]
Contrast the array made from the original M and the modified one:
In [83]: M1
Out[83]:
array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
In [84]: np.array(M)
Out[84]:
array([list([0, 0]), list([5, 6, 7]), list([1, 0]), list([1, 1])],
dtype=object)
The second one is not a 2d array.
np.insert without an axis ravels things (check the docs)
In [85]: np.insert(M1,1,[5,6,7])
Out[85]: array([0, 5, 6, 7, 0, 0, 1, 1, 0, 1, 1])
If I specify an axis it complains about a mismatch in shapes:
In [86]: np.insert(M1,1,[5,6,7],axis=0)
...
5071 new[slobj] = arr[slobj]
5072 slobj[axis] = slice(index, index+numnew)
-> 5073 new[slobj] = values
5074 slobj[axis] = slice(index+numnew, None)
5075 slobj2 = [slice(None)] * ndim
ValueError: could not broadcast input array from shape (1,3) into shape (1,2)
It creates a (1,2) shape slot to receive the new value, but [5,6,7] won't fit.
In [87]: np.insert(M1,1,[5,6],axis=0)
Out[87]:
array([[0, 0],
[5, 6],
[0, 1],
[1, 0],
[1, 1]])
arr = numpy.array([input().split() for i in range(int(input().split()[0]))])
print(arr)
INPUT:
2 1 2 3 4 5 6 7 8
OUTPUT:
[['1' '2' '3' '4']
['5' '6' '7' '8']]

How to convert a 'for' loop into a matricial expression for a list of lists using python3?

I need to convert a for loop into an expression using matrix form. I have a list of lists, a list of indices, and a matrix of shape (4,2) named 'toSave':
import numpy as np
M = [list() for i in range(3)]
indices= [1,1,0,1]
toSave = np.array([[0, 0],
[0, 1],
[0, 2],
[0, 3]])
for each index i in indices i want to save the row corresponding to the position of index i in indices:
for n, i in enumerate(indices):
M[i].append(toSave[n])
the result is:
M=[[[0, 2]], [[0, 0], [0, 1], [0, 3]], []]
Is possible to use a matricial expression instead to use a for loop, something as M[indices].append(toSave[range(4)]) ?
Here's one approach -
sidx = np.argsort(indices)
s_indx = np.take(indices, sidx)
split_idx = np.flatnonzero(s_indx[1:] != s_indx[:-1])+1
out = np.split(toSave[sidx], split_idx, axis=0)
Sample run -
# Given inputs
In [67]: M=[[] for i in range(3)]
...: indices= [1,1,0,1]
...: toSave=np.array([[0, 0],
...: [0, 1],
...: [0, 2],
...: [0, 3]])
...:
# Using loopy solution
In [68]: for n, i in enumerate(indices):
...: M[i].append(toSave[n])
...:
In [69]: M
Out[69]: [[array([0, 2])], [array([0, 0]), array([0, 1]), array([0, 3])], []]
# Using proposed solution
In [70]: out
Out[70]:
[array([[0, 2]]), array([[0, 0],
[0, 1],
[0, 3]])]
Performance boost
A faster way would be to avoid np.split and do the splitting with slicing, like so -
sorted_toSave = toSave[sidx]
idx = np.concatenate(( [0], split_idx, [toSave.shape[0]] ))
out = [sorted_toSave[i:j] for i,j in zip(idx[:-1],idx[1:])]

Numpy slice with array as index

I am trying to extract the full set of indices into an N-dimensional cube, and it seems like np.mgrid is just what I need for that. For example, np.mgrid[0:4,0:4] produces a 4 by 4 matrix containing all the indices into an array of the same shape.
The problem is that I want to do this in an arbitrary number of dimensions, based on the shape of another array. I.e. if I have an array a of arbitrary dimension, I want to do something like idx = np.mgrid[0:a.shape], but that syntax is not allowed.
Is it possible to construct the slice I need for np.mgrid to work? Or is there perhaps some other, elegant way of doing this? The following expression does what I need, but it is rather complicated and probably not very efficient:
np.reshape(np.array(list(np.ndindex(a.shape))),list(a.shape)+[len(a.shape)])
I usually use np.indices:
>>> a = np.arange(2*3).reshape(2,3)
>>> np.mgrid[:2, :3]
array([[[0, 0, 0],
[1, 1, 1]],
[[0, 1, 2],
[0, 1, 2]]])
>>> np.indices(a.shape)
array([[[0, 0, 0],
[1, 1, 1]],
[[0, 1, 2],
[0, 1, 2]]])
>>> a = np.arange(2*3*5).reshape(2,3,5)
>>> (np.mgrid[:2, :3, :5] == np.indices(a.shape)).all()
True
I believe the following does what you're asking:
>>> a = np.random.random((1, 2, 3))
>>> np.mgrid[map(slice, a.shape)]
array([[[[0, 0, 0],
[0, 0, 0]]],
[[[0, 0, 0],
[1, 1, 1]]],
[[[0, 1, 2],
[0, 1, 2]]]])
It produces exactly the same result as np.mgrid[0:1,0:2,0:3]except that it uses a's shape instead of hard-coded dimensions.

Categories

Resources