Removing arrays which contain duplicate elements in Numpy [duplicate]

Removing arrays which contain duplicate elements in Numpy [duplicate] - python

I have a (N,3) array of numpy values:
>>> vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
>>> vals
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 7],
[0, 4, 5],
[2, 2, 1],
[0, 0, 0],
[5, 4, 3]])
I'd like to remove rows from the array that have a duplicate value. For example, the result for the above array should be:
>>> duplicates_removed
array([[1, 2, 3],
[4, 5, 6],
[0, 4, 5],
[5, 4, 3]])
I'm not sure how to do this efficiently with numpy without looping (the array could be quite large). Anyone know how I could do this?

This is an option:
import numpy
vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
a = (vals[:,0] == vals[:,1]) | (vals[:,1] == vals[:,2]) | (vals[:,0] == vals[:,2])
vals = numpy.delete(vals, numpy.where(a), axis=0)

Here's an approach to handle generic number of columns and still be a vectorized method -
def rows_uniq_elems(a):
a_sorted = np.sort(a,axis=-1)
return a[(a_sorted[...,1:] != a_sorted[...,:-1]).all(-1)]
Steps :
Sort along each row.
Look for differences between consecutive elements in each row. Thus, any row with at least one zero differentiation indicates a duplicate element. We will use this to get a mask of valid rows. So, the final step is to simply select valid rows off input array, using the mask.
Sample run -
In [49]: a
Out[49]:
array([[1, 2, 3, 7],
[4, 5, 6, 7],
[7, 8, 7, 8],
[0, 4, 5, 6],
[2, 2, 1, 1],
[0, 0, 0, 3],
[5, 4, 3, 2]])
In [50]: rows_uniq_elems(a)
Out[50]:
array([[1, 2, 3, 7],
[4, 5, 6, 7],
[0, 4, 5, 6],
[5, 4, 3, 2]])

numpy.array([v for v in vals if len(set(v)) == len(v)])
Mind you, this still loops behind the scenes. You can't avoid that. But it should work fine even for millions of rows.

Its six years on, but this question helped me, so I ran a comparison for speed for the answers given by Divakar, Benjamin, Marcelo Cantos and Curtis Patrick.
import numpy as np
vals = np.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
def rows_uniq_elems1(a):
idx = a.argsort(1)
a_sorted = a[np.arange(idx.shape[0])[:,None], idx]
return a[(a_sorted[:,1:] != a_sorted[:,:-1]).all(-1)]
def rows_uniq_elems2(a):
a = (a[:,0] == a[:,1]) | (a[:,1] == a[:,2]) | (a[:,0] == a[:,2])
return np.delete(a, np.where(a), axis=0)
def rows_uniq_elems3(a):
return np.array([v for v in a if len(set(v)) == len(v)])
def rows_uniq_elems4(a):
return np.array([v for v in a if len(np.unique(v)) == len(v)])
Results:
%timeit rows_uniq_elems1(vals)
10000 loops, best of 3: 67.9 µs per loop
%timeit rows_uniq_elems2(vals)
10000 loops, best of 3: 156 µs per loop
%timeit rows_uniq_elems3(vals)
1000 loops, best of 3: 59.5 µs per loop
%timeit rows_uniq_elems(vals)
10000 loops, best of 3: 268 µs per loop
It seems that using set beats numpy.unique. In my case I needed to do this over a much larger array:
bigvals = np.random.randint(0,10,3000).reshape([3,1000])
%timeit rows_uniq_elems1(bigvals)
10000 loops, best of 3: 276 µs per loop
%timeit rows_uniq_elems2(bigvals)
10000 loops, best of 3: 192 µs per loop
%timeit rows_uniq_elems3(bigvals)
10000 loops, best of 3: 6.5 ms per loop
%timeit rows_uniq_elems4(bigvals)
10000 loops, best of 3: 35.7 ms per loop
The methods without list comprehensions are much faster. However, the number of rows are hard coded, and are difficult to extend to more than three columns, so in my case at least the list comprehension with the set is the best answer.
EDITED because I confused rows and columns in bigvals

Identical to Marcelo, but I think using numpy.unique() instead of set() may get across exactly what you are shooting for.
numpy.array([v for v in vals if len(numpy.unique(v)) == len(v)])

Related

Can NumPy take care that an array is (nonstrictly) increasing along one axis?

Is there a function in numpy to guarantee or rather fix an array such that it is (nonstrictly) increasing along one particular axis?
For example, I have the following 2D array:
X = array([[1, 2, 1, 4, 5],
[0, 3, 1, 5, 4]])
the output of np.foobar(X) should return
array([[1, 2, 2, 4, 5],
[0, 3, 3, 5, 5]])
Does foobar exist or do I need to do that manually by using something like np.diff and some smart indexing?

Use np.maximum.accumulate for a running (accumulated) max value along that axis to ensure the strictly increasing criteria -
np.maximum.accumulate(X,axis=1)
Sample run -
In [233]: X
Out[233]:
array([[1, 2, 1, 4, 5],
[0, 3, 1, 5, 4]])
In [234]: np.maximum.accumulate(X,axis=1)
Out[234]:
array([[1, 2, 2, 4, 5],
[0, 3, 3, 5, 5]])
For memory efficiency, we can assign it back to the input for in-situ changes with its out argument.
Runtime tests
Case #1 : Array as input
In [254]: X = np.random.rand(1000,1000)
In [255]: %timeit np.maximum.accumulate(X,axis=1)
1000 loops, best of 3: 1.69 ms per loop
# #cᴏʟᴅsᴘᴇᴇᴅ's pandas soln using df.cummax
In [256]: %timeit pd.DataFrame(X).cummax(axis=1).values
100 loops, best of 3: 4.81 ms per loop
Case #2 : Dataframe as input
In [257]: df = pd.DataFrame(np.random.rand(1000,1000))
In [258]: %timeit np.maximum.accumulate(df.values,axis=1)
1000 loops, best of 3: 1.68 ms per loop
# #cᴏʟᴅsᴘᴇᴇᴅ's pandas soln using df.cummax
In [259]: %timeit df.cummax(axis=1)
100 loops, best of 3: 4.68 ms per loop

pandas offers you the df.cummax function:
import pandas as pd
pd.DataFrame(X).cummax(axis=1).values
array([[1, 2, 2, 4, 5],
[0, 3, 3, 5, 5]])
It's useful to know that there's a first class function on hand in case your data is already loaded into a dataframe.

Padding a 2D numpy with varying rows into a same size [duplicate]

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.
v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)
Trying to force another type will cause an exception:
np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.
What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?
From my sample sequence v, I would like to get something like this, if 0 is the placeholder
array([[1, 0], [1, 2]], dtype=int32)

You can use itertools.zip_longest:
import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out:
array([[1, 0],
[1, 2]])
Note: For Python 2, it is itertools.izip_longest.

Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -
def boolean_indexing(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.zeros(mask.shape,dtype=int)
out[mask] = np.concatenate(v)
return out
Sample run
In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]
In [28]: out
Out[28]:
array([[1, 0, 0, 0, 0],
[1, 2, 0, 0, 0],
[3, 6, 7, 8, 9],
[4, 0, 0, 0, 0]])
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.
Runtime test
In this section I am timing DataFrame-based solution by #Alberto Garcia-Raboso, itertools-based solution by #ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.
Case #1 : Larger size variation
In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]
In [45]: v = v*1000
In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop
In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop
In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop
Case #2 : Lesser size variation
In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]
In [50]: v = v*1000
In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop
In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop
In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop
Case #3 : Larger number of elements (100 max) per list element
In [139]: # Setup inputs
...: N = 10000 # Number of elems in list
...: maxn = 100 # Max. size of a list element
...: lens = np.random.randint(0,maxn,(N))
...: v = [list(np.random.randint(0,9,(L))) for L in lens]
...:
In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop
In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop
In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop
To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!

Pandas and its DataFrame-s deal beautifully with missing data.
import numpy as np
import pandas as pd
v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))
# array([[1, 0],
# [1, 2]], dtype=int32)

max_len = max(len(sub_list) for sub_list in v)
result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])
>>> result
array([[1, 0],
[1, 2]])
>>> type(result)
numpy.ndarray

Here is a general way:
>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1, 0, 0, 0],
[ 2, 3, 4, 0],
[ 5, 6, 0, 0],
[ 7, 8, 9, 10],
[11, 12, 0, 0]], dtype=int32)

you can try to convert pandas dataframe first, after that convert it to numpy array
ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
df = pd.DataFrame(ll)
print(df)
# 0 1 2 3
# 0 1 2 3.0 NaN
# 1 4 5 NaN NaN
# 2 6 7 8.0 9.0
npl = df.to_numpy()
print(npl)
# [[ 1. 2. 3. nan]
# [ 4. 5. nan nan]
# [ 6. 7. 8. 9.]]

I was having a numpy broadcast error with Alexander's answer so I added a small variation with numpy.pad:
pad = len(max(X, key=len))
result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])

If you want to extend the same logic to deeper levels (list of lists of lists,..) you can use tensorflow ragged tensors and convert to tensors/arrays. For example:
import tensorflow as tf
v = [[1], [1, 2]]
padded_v = tf.ragged.constant(v).to_tensor(0)
This creates an array padded with 0.
or a deeper example:
w = [[[1]], [[2],[1, 2]]]
padded_w = tf.ragged.constant(w).to_tensor(0)

Convert Python sequence to NumPy array, filling missing values

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.
v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)
Trying to force another type will cause an exception:
np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.
What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?
From my sample sequence v, I would like to get something like this, if 0 is the placeholder
array([[1, 0], [1, 2]], dtype=int32)

You can use itertools.zip_longest:
import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out:
array([[1, 0],
[1, 2]])
Note: For Python 2, it is itertools.izip_longest.

Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -
def boolean_indexing(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.zeros(mask.shape,dtype=int)
out[mask] = np.concatenate(v)
return out
Sample run
In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]
In [28]: out
Out[28]:
array([[1, 0, 0, 0, 0],
[1, 2, 0, 0, 0],
[3, 6, 7, 8, 9],
[4, 0, 0, 0, 0]])
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.
Runtime test
In this section I am timing DataFrame-based solution by #Alberto Garcia-Raboso, itertools-based solution by #ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.
Case #1 : Larger size variation
In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]
In [45]: v = v*1000
In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop
In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop
In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop
Case #2 : Lesser size variation
In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]
In [50]: v = v*1000
In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop
In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop
In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop
Case #3 : Larger number of elements (100 max) per list element
In [139]: # Setup inputs
...: N = 10000 # Number of elems in list
...: maxn = 100 # Max. size of a list element
...: lens = np.random.randint(0,maxn,(N))
...: v = [list(np.random.randint(0,9,(L))) for L in lens]
...:
In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop
In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop
In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop
To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!

Pandas and its DataFrame-s deal beautifully with missing data.
import numpy as np
import pandas as pd
v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))
# array([[1, 0],
# [1, 2]], dtype=int32)

max_len = max(len(sub_list) for sub_list in v)
result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])
>>> result
array([[1, 0],
[1, 2]])
>>> type(result)
numpy.ndarray

Here is a general way:
>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1, 0, 0, 0],
[ 2, 3, 4, 0],
[ 5, 6, 0, 0],
[ 7, 8, 9, 10],
[11, 12, 0, 0]], dtype=int32)

you can try to convert pandas dataframe first, after that convert it to numpy array
ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
df = pd.DataFrame(ll)
print(df)
# 0 1 2 3
# 0 1 2 3.0 NaN
# 1 4 5 NaN NaN
# 2 6 7 8.0 9.0
npl = df.to_numpy()
print(npl)
# [[ 1. 2. 3. nan]
# [ 4. 5. nan nan]
# [ 6. 7. 8. 9.]]

I was having a numpy broadcast error with Alexander's answer so I added a small variation with numpy.pad:
pad = len(max(X, key=len))
result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])

If you want to extend the same logic to deeper levels (list of lists of lists,..) you can use tensorflow ragged tensors and convert to tensors/arrays. For example:
import tensorflow as tf
v = [[1], [1, 2]]
padded_v = tf.ragged.constant(v).to_tensor(0)
This creates an array padded with 0.
or a deeper example:
w = [[[1]], [[2],[1, 2]]]
padded_w = tf.ragged.constant(w).to_tensor(0)

dot product of two 1D vectors in numpy

I'm working with numpy in python to calculate a vector multiplication.
I have a vector x of dimensions n x 1 and I want to calculate x*x_transpose.
This gives me problems because x.T or x.transpose() doesn't affect a 1 dimensional vector (numpy represents vertical and horizontal vectors the same way).
But how do I calculate a (n x 1) x (1 x n) vector multiplication in numpy?
numpy.dot(x,x.T) gives a scalar, not a 2D matrix as I want.

You are essentially computing an Outer Product.
You can use np.outer.
In [15]: a=[1,2,3]
In [16]: np.outer(a,a)
Out[16]:
array([[1, 2, 3],
[2, 4, 6],
[3, 6, 9]])

While np.outer is the simplest way to do this, I'd thought I'd just mention how you might manipulate the (N,) shaped array to do this:
In [17]: a = np.arange(4)
In [18]: np.dot(a[:,None], a[None,:])
Out[18]:
array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])
In [19]: np.outer(a,a)
Out[19]:
array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])
Where you could alternatively replace None with np.newaxis.
Another more exotic way to do this is with np.einsum:
In [20]: np.einsum('i,j', a, a)
Out[20]:
array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])
And just for fun, some timings, which are likely going to vary based on hardware and numpy version/compilation:
Small-ish vector
In [36]: a = np.arange(5, dtype=np.float64)
In [37]: %timeit np.outer(a,a)
100000 loops, best of 3: 17.7 µs per loop
In [38]: %timeit np.dot(a[:,None],a[None,:])
100000 loops, best of 3: 11 µs per loop
In [39]: %timeit np.einsum('i,j', a, a)
1 loops, best of 3: 11.9 µs per loop
In [40]: %timeit a[:, None] * a
100000 loops, best of 3: 9.68 µs per loop
And something a little larger
In [42]: a = np.arange(500, dtype=np.float64)
In [43]: %timeit np.outer(a,a)
1000 loops, best of 3: 605 µs per loop
In [44]: %timeit np.dot(a[:,None],a[None,:])
1000 loops, best of 3: 1.29 ms per loop
In [45]: %timeit np.einsum('i,j', a, a)
1000 loops, best of 3: 359 µs per loop
In [46]: %timeit a[:, None] * a
1000 loops, best of 3: 597 µs per loop

If you want an inner product then use numpy.dot(x,x) for outer product use numpy.outer(x,x)

Another alternative is to define the row / column vector with 2-dimensions, e.g.
a = np.array([1, 2, 3], ndmin=2)
np.dot(a.T, a)
array([[1, 2, 3],
[2, 4, 6],
[3, 6, 9]])

Another alternative is to user numpy.matrix
>>> a = np.matrix([1,2,3])
>>> a
matrix([[1, 2, 3]])
>>> a.T * a
matrix([[1, 2, 3],
[2, 4, 6],
[3, 6, 9]])
Generally use of numpy.arrays is preferred. However, using numpy.matrices can be more readable for long expressions.

Mask numpy array based on index

How do I mask an array based on the actual index values?
That is, if I have a 10 x 10 x 30 matrix and I want to mask the array when the first and second index equal each other.
For example, [1, 1 , :] should be masked because 1 and 1 equal each other but [1, 2, :] should not because they do not.
I'm only asking this with the third dimension because it resembles my current problem and may complicate things. But my main question is, how to mask arrays based on the value of the indices?

In general, to access the value of the indices, you can use np.meshgrid:
i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij')
m.mask = (i == j)
The advantage of this method is that it works for arbitrary boolean functions on i, j, and k. It is a bit slower than the use of the identity special case.
In [56]: %%timeit
....: i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij')
....: i == j
10000 loops, best of 3: 96.8 µs per loop
As #Jaime points out, meshgrid supports a sparse option, which doesn't do so much duplication, but requires a bit more care in some cases because they don't broadcast. It will save memory and speed things up a little. For example,
In [77]: x = np.arange(5)
In [78]: np.meshgrid(x, x)
Out[78]:
[array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]]),
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2],
[3, 3, 3, 3, 3],
[4, 4, 4, 4, 4]])]
In [79]: np.meshgrid(x, x, sparse=True)
Out[79]:
[array([[0, 1, 2, 3, 4]]),
array([[0],
[1],
[2],
[3],
[4]])]
So, you can use the sparse version as he says, but you must force the broadcasting as such:
i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij', sparse=True)
m.mask = np.repeat(i==j, k.size, axis=2)
And the speedup:
In [84]: %%timeit
....: i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij', sparse=True)
....: np.repeat(i==j, k.size, axis=2)
10000 loops, best of 3: 73.9 µs per loop

In your special case of wanting to mask the diagonals, you can use the np.identity() function which returns ones along the diagonal. Since you have the third dimension, we have to add that third dimension to the the identity matrix:
m.mask = np.identity(10)[...,None]*np.ones((1,1,30))
There might be a better way of constructing that array, but it is basically stacking 30 of the np.identity(10) array. For example, this is equivalent:
np.dstack((np.identity(10),)*30)
but slower:
In [30]: timeit np.identity(10)[...,None]*np.ones((1,1,30))
10000 loops, best of 3: 40.7 µs per loop
In [31]: timeit np.dstack((np.identity(10),)*30)
1000 loops, best of 3: 219 µs per loop
And #Ophion's suggestions
In [33]: timeit np.tile(np.identity(10)[...,None], 30)
10000 loops, best of 3: 63.2 µs per loop
In [71]: timeit np.repeat(np.identity(10)[...,None], 30)
10000 loops, best of 3: 45.3 µs per loop

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing arrays which contain duplicate elements in Numpy [duplicate] - python

This is an option: import numpy vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]]) a = (vals[:,0] == vals[:,1]) | (vals[:,1] == vals[:,2]) | (vals[:,0] == vals[:,2]) vals = numpy.delete(vals, numpy.where(a), axis=0)

numpy.array([v for v in vals if len(set(v)) == len(v)]) Mind you, this still loops behind the scenes. You can't avoid that. But it should work fine even for millions of rows.

Identical to Marcelo, but I think using numpy.unique() instead of set() may get across exactly what you are shooting for. numpy.array([v for v in vals if len(numpy.unique(v)) == len(v)])

Related

Can NumPy take care that an array is (nonstrictly) increasing along one axis?

Padding a 2D numpy with varying rows into a same size [duplicate]

Convert Python sequence to NumPy array, filling missing values

dot product of two 1D vectors in numpy

Mask numpy array based on index

Categories

Resources