Identify if list has consecutive elements that are equal - python

I'm trying to identify if a large list has consecutive elements that are the same.
So let's say:
lst = [1, 2, 3, 4, 5, 5, 6]
And in this case, I would return true, since there are two consecutive elements lst[4] and lst[5], are the same value.
I know this could probably be done with some sort of combination of loops, but I was wondering if there were a more efficient way to do this?

You can use itertools.groupby() and a generator expression within any()
*:
>>> from itertools import groupby
>>> any(sum(1 for _ in g) > 1 for _, g in groupby(lst))
True
Or as a more Pythonic way you can use zip(), in order to check if at least there are two equal consecutive items in your list:
>>> any(i==j for i,j in zip(lst, lst[1:])) # In python-2.x,in order to avoid creating a 'list' of all pairs instead of an iterator use itertools.izip()
True
Note: The first approach is good when you want to check if there are more than 2 consecutive equal items, otherwise, in this case the second one takes the cake!
* Using sum(1 for _ in g) instead of len(list(g)) is very optimized in terms of memory use (not reading the whole list in memory at once) but the latter is slightly faster.

You can use a simple any condition:
lst = [1, 2, 3, 4, 5, 5, 6]
any(lst[i]==lst[i+1] for i in range(len(lst)-1))
#outputs:
True
any return True if any of the iterable elements are True

If you're looking for an efficient way of doing this and the lists are numerical, you would probably want to use numpy and apply the diff (difference) function:
>>> numpy.diff([1,2,3,4,5,5,6])
array([1, 1, 1, 1, 0, 1])
Then to get a single result regarding whether there are any consecutive elements:
>>> numpy.any(~numpy.diff([1,2,3,4,5,5,6]).astype(bool))
This first performs the diff, inverts the answer, and then checks if any of the resulting elements are non-zero.
Similarly,
>>> 0 in numpy.diff([1, 2, 3, 4, 5, 5, 6])
also works well and is similar in speed to the np.any approach (credit for this last version to heracho).

Here a more general numpy one-liner:
number = 7
n_consecutive = 3
arr = np.array([3, 3, 6, 5, 8, 7, 7, 7, 4, 5])
# ^ ^ ^
np.any(np.convolve(arr == number, v=np.ones(n_consecutive), mode='valid')
== n_consecutive)[0]
This method always searches the whole array, while the approach from #Kasramvd ends when the condition is first met. So which method is faster dependents on how sparse those cases of consecutive numbers are.
If you are interested in the positions of the consecutive numbers, and have to look at all elements of the array this approach should be faster (for larger arrays (or/and longer sequences)).
idx = np.nonzero(np.convolve(arr==number, v=np.ones(n_consecutive), mode='valid')
== n_consecutive)
# idx = i: all(arr[i:i+n_consecutive] == number)
If you are not interested in a specific value but at all consecutive numbers in general a slight variation of #jmetz's answer:
np.any(np.convolve(np.abs(np.diff(arr)), v=np.ones(n_consecutive-1), mode='valid') == 0)
# ^^^^^^
# EDIT see djvg's comment

Starting in Python 3.10, the new pairwise function provides a way to slide through pairs of consecutive elements, so that we can test the quality between consecutive elements:
from itertools import pairwise
any(x == y for (x, y) in pairwise([1, 2, 3, 4, 5, 5, 6]))
# True
The intermediate result of pairwise:
pairwise([1, 2, 3, 4, 5, 5, 6])
# [(1, 2), (2, 3), (3, 4), (4, 5), (5, 5), (5, 6)]

A simple for loop should do it:
def check(lst):
last = lst[0]
for num in lst[1:]:
if num == last:
return True
last = num
return False
lst = [1, 2, 3, 4, 5, 5, 6]
print (check(lst)) #Prints True
Here, in each loop, I check if the current element is equal to the previous element.

The convolution approach suggested in scleronomic's answer is very promising, especially if you're looking for more than two consecutive elements.
However, the implementation presented in that answer might not be the most efficient, because it consists of two steps: diff() followed by convolve().
Alternative implementation
If we consider that the diff() can also be calculated using convolution, we can combine the two steps into a single convolution.
The following alternative implementation only requires a single convolution of the full signal, which is advantageous if the signal has many elements.
Note that we cannot take the absolute values of the diff (to prevent false positives, as mentioned in this comment), so we add some random noise to the unit kernel instead.
# example signal
signal = numpy.array([1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0])
# minimum number of consecutive elements
n_consecutive = 3
# convolution kernel for weighted moving sum (with small random component)
rng = numpy.random.default_rng()
random_kernel = 1 + 0.01 * rng.random(n_consecutive - 1)
# convolution kernel for first-order difference (similar to numpy.diff)
diff_kernel = [1, -1]
# combine the kernels so we only need to do one convolution with the signal
combined_kernel = numpy.convolve(diff_kernel, random_kernel, mode='full')
# convolve the signal to get the moving weighted sum of differences
moving_sum_of_diffs = numpy.convolve(signal, combined_kernel, mode='valid')
# check if moving sum is zero anywhere
result = numpy.any(moving_sum_of_diffs == 0)
See the DSP guide for a detailed discussion of convolution.
Timing
The difference between the two implementations boils down to this:
def original(signal, unit_kernel):
return numpy.convolve(numpy.abs(numpy.diff(signal)), unit_kernel, mode='valid')
def alternative(signal, combined_kernel):
return numpy.convolve(signal, combined_kernel, mode='valid')
where unit_kernel = numpy.ones(n_consecutive - 1) and combined_kernel is defined above.
Comparison of these two functions, using timeit, shows that alternative() can be several times faster, for small kernel sizes (i.e. small value of n_consecutive). However, for large kernel sizes the advantage becomes negligible, because the convolution becomes dominant (compared to the diff).
Notes:
For large kernel sizes I would prefer the original two-step approach, as I think it is easier to understand.
Due to numerical issues it may be necessary to replace numpy.any(moving_sum_of_diffs == 0) by numpy.any(numpy.abs(moving_sum_of_diffs) < very_small_number), see e.g. here.

My solution for this if you want to find out whether 3 consecutive values are equal to 7. For example, a tuple of intList = (7, 7, 7, 8, 9, 1):
for i in range(len(intList) - 1):
if intList[i] == 7 and intList[i + 2] == 7 and intList[i + 1] == 7:
return True
return False

Related

Python/Numpy fast way to selecting every nth chunk in list

Edited for the confusion in the problem, thanks for the answers!
My original problem was that I have a list [1,2,3,4,5,6,7,8], and I want to select every chunk of size x with gap of one. So if I want to select select every other chunk of size 2, the outcome would be [1,2,4,5,7,8]. A chunk size of three would give me [1,2,3,5,6,7].
I've searched a lot on slicing and I couldn't find a way to select chunks instead of element. Make multiple slice operations then join and sort seems a little too expensive. The input can either be a python list or numpy ndarray. Thanks in advance.
To me it seems, you want to skip one element between chunks until the end of the input list or array.
Here's one approach based on np.delete that deletes that single elements squeezed between chunks -
out = np.delete(A,np.arange(len(A)/(x+1))*(x+1)+x)
Here's another approach based on boolean-indexing -
L = len(A)
avoid_idx = np.arange(L/(x+1))*(x+1)+x
out = np.array(A)[~np.in1d(np.arange(L),avoid_idx)]
Sample run -
In [98]: A = [51,42,13,34,25,68,667,18,55,32] # Input list
In [99]: x = 2
# Thus, [51,42,13,34,25,68,667,18,55,32]
^ ^ ^ # Skip these
In [100]: np.delete(A,np.arange(len(A)/(x+1))*(x+1)+x)
Out[100]: array([ 51, 42, 34, 25, 667, 18, 32])
In [101]: L = len(A)
...: avoid_idx = np.arange(L/(x+1))*(x+1)+x
...: out = np.array(A)[~np.in1d(np.arange(L),avoid_idx)]
...:
In [102]: out
Out[102]: array([ 51, 42, 34, 25, 667, 18, 32])
First off, you can create an array of indices then use np.in1d() function in order to extract the indices that should be omit then with a simple not operator get the indices that must be preserve. And at last pick up them using a simple boolean indexing:
>>> a = np.array([1,2,3,4,5,6,7,8])
>>> range_arr = np.arange(a.size)
>>>
>>> a[~np.in1d(range_arr,range_arr[2::3])]
array([1, 2, 4, 6, 8])
General approach:
>>> range_arr = np.arange(np_array.size)
>>> np_array[~np.in1d(range_arr,range_arr[chunk::chunk+1])]
Using a pure python solution:
This assumes the desired items are: [yes, yes, no, yes, yes, no, ...]
Quicker to code, slower to run:
data = [1, 2, 3, 4, 5, 6, 7, 8]
filtered = [item for i, item in enumerate(data) if i % 3 != 2]
assert filtered == [1, 2, 4, 5, 7, 8]
Slightly slower to write, but faster to run:
from itertools import cycle, compress
data = [1, 2, 3, 4, 5, 6, 7, 8]
selection_criteria = [True, True, False]
filtered = list(compress(data, cycle(selection_criteria)))
assert filtered == [1, 2, 4, 5, 7, 8]
The second example runs in 66% of the time the first example does, and is also clearer and easier to change the selection criteria
A simple list solution
>> ll = [1,2,3,4,5,6,7,8]
>> list(itertools.chain(*zip(ll[::3],ll[1::3])))
[1, 2, 4, 5, 7, 8]
At least for this case of chunks of size 2, skipping one value between chunks. The number ll[] slicings determine the chunk size, and the slicing step determines the chunk spacing.
As I commented there is some ambiguity in the problem description, so I hesitate to generalize this solution more until that is cleared up.
It may be easier to generalize the numpy solutions, but they aren't necessarily faster. Conversion to arrays has a time overhead.
list(itertools.chain(*zip(*[ll[i::6] for i in range(3)])))
produces chunks of length 3, skipping 3 elements.
zip(*) is an idiomatic way of 'transposing' a list of lists
itertools.chain(*...) is an idiomatic way of a flattening a list of lists.
Another option is a list comprehension with a condition based on item count
[v for i,v in enumerate(ll) if i%3]
handily skips every 3rd item, same as your example. (0<(i%6)<4) keeps 3, skips 3.
This should do the trick:
step = 3
size = 2
chunks = len(input) // step
input = np.asarray(input)
result = input[:chunks*step].reshape(chunks, step)[:, :size]
A simple list comprehension can do the job:
[ L[i] for i in range(len(L)) if i%3 != 2 ]
For chunks of size n
[ L[i] for i in range(len(L)) if i%(n+1) != n ]

Most efficient way to implement numpy.in1d for muliple arrays

What is the best way to implement a function which takes an arbitrary number of 1d arrays and returns a tuple containing the indices of the matching values (if any).
Here is some pseudo-code of what I want to do:
a = np.array([1, 0, 4, 3, 2])
b = np.array([1, 2, 3, 4, 5])
c = np.array([4, 2])
(ind_a, ind_b, ind_c) = return_equals(a, b, c)
# ind_a = [2, 4]
# ind_b = [1, 3]
# ind_c = [0, 1]
(ind_a, ind_b, ind_c) = return_equals(a, b, c, sorted_by=a)
# ind_a = [2, 4]
# ind_b = [3, 1]
# ind_c = [0, 1]
def return_equals(*args, sorted_by=None):
...
You can use numpy.intersect1d with reduce for this:
def return_equals(*arrays):
matched = reduce(np.intersect1d, arrays)
return np.array([np.where(np.in1d(array, matched))[0] for array in arrays])
reduce may be little slow here because we are creating intermediate NumPy arrays here(for large number of input it may be very slow), we can prevent this if we use Python's set and its .intersection() method:
matched = np.array(list(set(arrays[0]).intersection(*arrays[1:])))
Related GitHub ticket: n-array versions of set operations, especially intersect1d
This solution basically concatenates all input 1D arrays into one big 1D array with the intention of performing the required operations in a vectorized manner. The only place where it uses loop is at the start where it gets the lengths of the input arrays, which must be minimal on runtime costs.
Here's the function implementation -
import numpy as np
def return_equals(*argv):
# Concatenate input arrays into one big array for vectorized processing
A = np.concatenate((argv[:]))
# lengths of input arrays
narr = len(argv)
lens = np.zeros((1,narr),int).ravel()
for i in range(narr):
lens[i] = len(argv[i])
N = A.size
# Start indices of each group of identical elements from different input arrays
# in a sorted version of the huge concatenated input array
start_idx = np.where(np.append([True],np.diff(np.sort(A))!=0))[0]
# Runlengths of islands of identical elements
runlens = np.diff(np.append(start_idx,N))
# Starting and all indices of the positions in concatenate array that has
# islands of identical elements which are present across all input arrays
good_start_idx = start_idx[runlens==narr]
good_all_idx = good_start_idx[:,None] + np.arange(narr)
# Get offsetted indices and sort them to get the desired output
idx = np.argsort(A)[good_all_idx] - np.append([0],lens[:-1].cumsum())
return np.sort(idx.T,1)
In Python:
def return_equal(*args):
rtr=[]
for i, arr in enumerate(args):
rtr.append([j for j, e in enumerate(arr) if
all(e in a for a in args[0:i]) and
all(e in a for a in args[i+1:])])
return rtr
>>> return_equal(a,b,c)
[[2, 4], [1, 3], [0, 1]]
For start, I'd try:
def return_equals(*args):
x=[]
c=args[-1]
for a in args:
x.append(np.nonzero(np.in1d(a,c))[0])
return x
If I add a d=np.array([1,0,4,3,0]) (it has only 1 match; what if there are no matches?)
then
return_equals(a,b,d,c)
produces:
[array([2, 4], dtype=int32),
array([1, 3], dtype=int32),
array([2], dtype=int32),
array([0, 1], dtype=int32)]
Since the length of both input and returned arrays can differ, you really can't vectorize the problem. That is, it takes some special gymnastics to perform the operation across all inputs at once. And if the number of arrays is small compared to their typical length, I wouldn't worry about speed. Iterating a few times is not expensive. It's iterating over a 100 values that's expensive.
You could, of course, pass the keyword arguments on to in1d.
It's not clear what you are trying to do with the sorted_by parameter. Is that something that you could just as easily apply to the arrays before you pass them to this function?
List comprehension version of this iteration:
[np.nonzero(np.in1d(x,c))[0] for x in [a,b,d,c]]
I can imagine concatenating the arrays into one longer one, applying in1d, and then splitting it up into subarrays. There is a np.split, but it requires that you tell it how many elements to put in each sublist. That means, somehow, determining how many matches there are for each argument. Doing that without looping could be tricky.
The pieces for this (that still need to be packed as function) are:
args=[a,b,d,c]
lens=[len(x) for x in args]
abc=np.concatenate(args)
C=np.cumsum(lens)
I=np.nonzero(np.in1d(abc,c))[0]
S=np.split(I,(2,4,5))
[S[0],S[1]-C[0],S[2]-C[1],S[3]-C[2]]
I
# array([ 2, 4, 6, 8, 12, 15, 16], dtype=int32)
C
# array([ 5, 10, 15, 17], dtype=int32)
The (2,4,5) are the number of elements of I between successive values of C, i.e. the number of elements that match for each of a,b,...

numpy.subtract but only until difference reaches threshold - replace numbers smaller than that with threshold

I want to subtract a given value from each element in my numpy array.
For example, if I have a numpy array called a_q, and variable called subtract_me, then I can simply do this:
result = np.subtract(a_q,subtract_me)
That's fine. But I don't want it to simply subtract blindly from every element. If the difference is lower than a threshold, then I don't want the subtraction to happen. Instead, I want that element of the array to be replaced by that threshold.
What's the most efficient way to do this? I could simply iterate through the array and subtract from each element and put a check condition on whether the threshold has been reached or not, and build a new array out of the results (as below) - but is there a better or more efficient way to do it?
threshold = 3 # in my real program, the threshold is the
# lowest non-infinity number that python can handle
subtract_me = 6
a_q = []
for i in range(10):
val = i - subtract_me
if val < threshold:
val = threshold
a_q.append(val)
myarr = np.array(a_q)
print myarr
Vectorised methods are typically most efficient with NumPy arrays so here's one way which is likely to be more efficient than iterating over an array one element at a time:
>>> threshold = 3
>>> subtract_me = 6
>>> a_q = np.arange(10)
>>> arr = a_q - subtract_me # takeaway the subtract_me value
array([-6, -5, -4, -3, -2, -1, 0, 1, 2, 3])
>>> arr[arr - subtract_me < threshold] = threshold # replace any value less than threshold
array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
EDIT: since np.clip was mentioned in the comments below the question, I may as well absorb it into my answer for completeness ;-)
Here's one way you could use it to get the desired result:
>>> np.clip((a_q - subtract_me), threshold, np.max(a_q))
array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])

Return tuple of any two items in list, which if summed equal given int

For example, assume a given list of ints:
int_list = list(range(-10,10))
[-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
What is the most efficient way to find if any given two values in int_list sum to equal a given int, say 2?
I was asked this in a technical phone interview this morning on how to efficiently handle this scenario with an int_list of say, 100 million items (I rambled and had no good answer :/).
My first idea was:
from itertools import combinations
int_list = list(range(-10,10))
combo_list = list(combinations(int_list, 2))
desired_int = 4
filtered_tuples = list(filter(lambda x: sum(x) == desired_int, combo_list))
filtered_tuples
[(-5, 9), (-4, 8), (-3, 7), (-2, 6), (-1, 5), (0, 4), (1, 3)]
Which doesn't even work with a range of only range(-10000, 10000)
Also, does anyone know of a good online Python performance testing tool?
For any integer A there is at most one integer B that will sum together to equal integer N. It seems easier to go through the list, do the arithmetic, and do a membership test to see if B is in the set.
int_list = set(range(-500000, 500000))
TARGET_NUM = 2
def filter_tuples(int_list, target):
for int_ in int_list:
other_num = target - int_
if other_num in int_list:
yield (int_, other_num)
filtered_tuples = filter_tuples(int_list, TARGET_NUM)
Note that this will duplicate results. E.g. (-2, 4) is a separate response from (4, -2). You can fix this by changing your function:
def filter_tuples(int_list, target):
for int_ in int_list:
other_num = target - int_
if other_num in int_list:
set.remove(int_)
set.remove(other_num)
yield (int_, other_num)
EDIT: See my other answer for an even better approach (with caveats).
What is the most efficient way to find if any given two values in int_list sum to equal a given int, say 2?
My first inclination was to do it with the itertools module's combinations and the short-cutting power of any, but it could be quite slower than Adam's approach:
>>> import itertools
>>> int_list = list(range(-10,10))
>>> any(i + j == 2 for i, j in itertools.combinations(int_list, 2))
True
Seems to be fairly responsive for larger ranges:
>>> any(i + j == 2 for i, j in itertools.combinations(xrange(-10000,10000), 2))
True
>>> any(i + j == 2 for i, j in itertools.combinations(xrange(-1000000,1000000), 2))
True
Takes about 10 seconds on my machine:
>>> any(i + j == 2 for i, j in itertools.combinations(xrange(-10000000,10000000), 2))
True
A more literal approach using math:
Assume a given list of ints:
int_list = list(range(-10,10)) ... [-10, -9, -8, -7, -6, -5, -4, -3, -2,
-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
What is the most efficient way to find if any given two values in
int_list sum to equal a given int, say 2? ... how to efficiently
handle this scenario with an int_list of say, 100 million items.
It's clear that we can deduce the requirements that we can apply a single parameter, n, for the range of integers, of the form range(-n, n), which means every integer from negative n up to but not including positive n. From there the requirements are clearly to whether some number, x, is a sum of any two integers in that range.
Any such range can be trivially shown to contain a pair that sum to any number in that range and n-1 beyond it, so it's a waste of computing power to search for it.
def x_is_sum_of_2_diff_numbers_in_range(x, n):
if isinstance(x, int) and isinstance(n, int):
return -(n*2) < x < (n - 1)*2
else:
raise ValueError('args x and n must be ints')
Computes nearly instantly:
>>> x_is_sum_of_2_diff_numbers_in_range(2, 1000000000000000000000000000)
True
Testing the edge-cases:
def main():
print x_is_sum_of_2_diff_numbers_in_range(x=5, n=4) # True
print x_is_sum_of_2_diff_numbers_in_range(x=6, n=4) # False
print x_is_sum_of_2_diff_numbers_in_range(x=-7, n=4) # True
print x_is_sum_of_2_diff_numbers_in_range(x=-8, n=4) # False
EDIT:
Since I can see that a more generalized version of this problem (where the list could contain any given numbers) is a common one, I can see why some people have a preconceived approach to this, but I still stand by my interpretation of this question's requirements, and I consider this answer the best approach for this more specific case.
I would have thought that any solution that depends on a doubly nested iteration over the list (albeit having the inner loop concealed by a nifty Python function) is O(n^2).
It is worth considering sorting the input. For any reasonable comparison-based sort, this will be O(n.lg(n)), which is already better than O(n^2). You might do better with a radix sort or pre-sort (making something like a bucket sort) depending on the range of the input list.
Having sorted the input, it is an O(n) operation to find a pair of numbers that sum to any given number, so your overall complexity is O(n.lg(n)).
In practice, it's an open question whether, for the stipulated “large number” of elements, a brute-force O(n^2) algorithm with nice cache behavior (zipping through arrays in order) would outperform the asymptotically better algorithm that moves a lot of data around, but eventually the one with the lower asymptotic complexity will win.

Selecting unique random values from the third column of a an array in python

I have a 41000x3 numpy array that I call "sortedlist" in the function below. The third column has a bunch of values, some of which are duplicates, others which are not. I'd like to take a sample of unique values (no duplicates) from the third column, which is sortedlist[:,2]. I think I can do this easily with numpy.random.sample(sortedlist[:,2], sample_size). The problem is I'd like to return, not only those values, but all three columns where, in the last column, there are the randomly chosen values that I get from numpy.random.sample.
EDIT: By unique values I mean I want to choose random values which appear only once. So If I had an array:
array = [[0, 6, 2]
[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[3, 1, 1]
[5, 2, 8]]
And I wanted to choose 4 values of the third column, I want to get something like new_array_1 out:
new_array_1 = [[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[5, 2, 8]]
But I don't want something like new_array_2, where two values in the 3rd column are the same:
new_array_2 = [[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[3, 1, 1]]
I have the code to choose random values but without the criterion that they shouldn't be duplicates in the third column.
samplesize = 100
rand_sortedlist = sortedlist[np.random.randint(len(sortedlist), size = sample_size),:]]
I'm trying to enforce this criterion by doing something like this
array_index = where( array[:,2] == sample(SelectionWeight, sample_size) )
But I'm not sure if I'm on the right track. Any help would be greatly appreciated!
I can't think of a clever numpythonic way to do this that doesn't involve multiple passes over the data. (Sometimes numpy is so much faster than pure Python that's still the fastest way to go, but it never feels right.)
In pure Python, I'd do something like
def draw_unique(vec, n):
# group indices by value
d = {}
for i, x in enumerate(vec):
d.setdefault(x, []).append(i)
drawn = [random.choice(d[k]) for k in random.sample(d, n)]
return drawn
which would give
>>> a = np.random.randint(0, 10, (41000, 3))
>>> drawn = draw_unique(a[:,2], 3)
>>> drawn
[4219, 6745, 25670]
>>> a[drawn]
array([[5, 6, 0],
[8, 8, 1],
[5, 8, 3]])
I can think of some tricks with np.bincount and scipy.stats.rankdata but they hurt my head, and there always winds up being one step at the end I can't see how to vectorize.. and if I'm not vectorizing the whole thing I might as well use the above which at least is simple.
I believe this will do what you want. Note that the running time will almost certainly be dominated by whatever method you use to generate your random numbers. (An exception is if the dataset is gigantic but you only need a small number of rows, in which case very few random numbers need to be drawn.) So I'm not sure this will run much faster than a pure python method would.
# arrayify your list of lists
# please don't use `array` as a variable name!
a = np.asarray(arry)
# sort the list ... always the first step for efficiency
a2 = a[np.argsort(a[:, 2])]
# identify rows that are duplicates (3rd column is non-increasing)
# Note this has length one less than a2
duplicate_rows = np.diff(a2[:, 2]) == 0)
# if duplicate_rows[N], then we want to remove row N and N+1
keep_mask = np.ones(length(a2), dtype=np.bool) # all True
keep_mask[duplicate_rows] = 0 # remove row N
keep_mask[1:][duplicate_rows] = 0 # remove row N + 1
# now actually slice the array
a3 = a2[keep_mask]
# select rows from a3 using your preferred random number generator
# I actually prefer `random` over numpy.random for sampling w/o replacement
import random
result = a3[random.sample(xrange(len(a3)), DESIRED_NUMBER_OF_ROWS)]

Categories

Resources