Removing duplicate sets of items in a sequence - python

I have a list, for example data = [0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6], and I need to remove sets of items from it (max. lenght of set k = 3), but only when the sets follow each other. data includes three such cases: [4, 4], [5, 8, 5, 8], and [1, 5, 6, 1, 5, 6], so the cleaned up list should look like [0, 4, 2, 5, 8, 7, 1, 5, 6].
I tried this code and it works:
data = np.array([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6])
for k in range(1, 3):
kth_difference = data[k:] - data[:-k]
ids = np.where(kth_difference)
data = data[ids]
But if I change input list to something like data = [0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5] (broke the last set), the new output list is [0, 4, 2, 5, 8, 7, 1, 5], which lost 6 at the end.
What a solution for this task? How to make this solution workable for any k?

You added a numpy tag, so let's use that to our advantage. Start with an array:
data = np.array([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6])
It's easy to make a mask of elements up to length n that follow each other:
mask_1 = data[1:] == data[:-1]
mask_2 = data[2:] == data[:-2]
mask_3 = data[3:] == data[:-3]
The first mask has ones at each location where the next element is the same. The second mask will have a one wherever an element is the same as something two elements ahead, so you need to find runs of 2 elements at a time. The same applies to the third mask. Filtering of the mask needs to take into account that you want to include the possibility of partial matches at the end. You can effectively extend the mask with k-1 elements to accomplish this:
delta = np.diff(np.r_[False, mask_3, np.ones(2, dtype=bool), False].view(np.int8))
edges = np.flatnonzero(delta).reshape(-1, 2)
lengths = edges[:, 1] - edges[:, 0]
delta[edges[lengths < 3, :]] = 0
mask = delta[-k:].cumsum(dtype=np.int8).view(bool)
In this arrangement, mask masks the duplicated three elements that constitute a duplicated group. It may contain fewer than three elements if the replicated portion is truncated. That ensures that you get to keep all the elements of partial duplicates.
For this exercise, I will assume that you don't have strange overlaps between different levels. I.e., each part of the array that belongs to a repeated segment belongs to exactly one possible repeated segment. Otherwise, the mask processing becomes much more complex.
Here is a function to wrap all this together:
def clean_mask(mask, k):
delta = np.diff(np.r_[False, mask, np.ones(k - 1, bool), False].view(np.int8))
edges = np.flatnonzero(delta).reshape(-1, 2)
lengths = edges[:, 1] - edges[:, 0]
delta[edges[lengths < k, :]] = 0
return delta[:-k].cumsum(dtype=np.int8).view(bool)
def dedup(data, kmax):
data = np.asarray(data)
kmax = min(kmax, data.size // 2)
remove = np.zeros(data.shape, dtype=np.bool)
for k in range(kmax, 0, -1):
remove[k:] |= clean_mask(data[k:] == data[:-k], k)
return data[~remove]
Outputs for the two test cases you show in the question:
>>> dedup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6], 3)
array([0, 4, 2, 5, 8, 7, 1, 5, 6])
>>> dedup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5], 3)
array([0, 4, 2, 5, 8, 7, 1, 5, 6])
Timing
A quick benchmark shows that the numpy solution is also much faster than pure python:
for n in range(2, 7):
x = np.random.randint(0, 10, 10**n)
y = list(x)
%timeit dedup(x, 3)
%timeit remdup(y)
Results:
# 100 elements
dedup: 332 µs ± 5.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
remdup: 36.9 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 1000 elements
dedup: 412 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
remdup: > 1 minute
Caveats
This solution makes no attempt to cover corner cases. For example: data = [2, 2, 2, 2, 2, 2, 2] or similar, where multiple levels of k can overlap.

Here is an attempt, which is also my very first time using break and for/else:
def remdup(l):
while True:
for (i,j) in ((i,j) for i in range(0,len(l)) for j in range(i+1, len(l)+1)):
if l[i:j] == l[j:j+(j-i)]:
l = l[:j] + l[j+j-i:]
break # restart
else: # if no duplicate was found
break # halt
return l
print(remdup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6]))
# [0, 4, 2, 5, 8, 7, 1, 5, 6]
How this works:
iterate on all substrings l[i:j]:
if l[i:j] is duplicate with the next substring of same length, l[j,j+j-i]:
remove l[j,j+j-i]
break the iteration and restart it because the list has changed
if no duplicate was found, return the list
I recommend avoiding using break and for/else. They're ugly and make deceptive code.

Related

Efficient vectorized version of this numpy for loop

Short intro
I have two paired lists of 2D numpy arrays (see below) - paired in the sense that index 0 in array1 corresponds to index 0 in array2. For each of the pairs I want to get all the combinations of all rows in the 2D numpy arrays, like answered by Divakar here.
Array example
arr1 = [
np.vstack([[1,6,3,9], [8,5,6,7]]),
np.vstack([[1,6,3,9]]),
np.vstack([[1,6,3,9], [8,5,6,7],[8,5,6,7]])
]
arr2 = [
np.vstack([[8,8,8,8]]),
np.vstack([[8,8,8,8]]),
np.vstack([[1,6,3,9], [8,5,6,7],[8,5,6,7]])
]
Working code
Note, unlike the linked answer my columns are fixed (always 4) hence I replaced using shape by the hardcode value 4 (or 8 in np.zeros).
def merge(a1, a2):
# From: https://stackoverflow.com/questions/47143712/combination-of-all-rows-in-two-numpy-arrays
m1 = a1.shape[0]
m2 = a2.shape[0]
out = np.zeros((m1, m2, 8), dtype=int)
out[:, :, :4] = a1[:, None, :]
out[:, :, 4:] = a2
out.shape = (m1 * m2, -1)
return out
total = np.concatenate([merge(arr1[i], arr2[i]) for i in range(len(arr1))])
print(total)
Question
While the above works fine, it looks inefficient to me as it:
involves looping through the arrays
"appends" (in list list comprehsion) to the total array, requiring it to allocate memory each time
creates multiple zero arrays (in the merge function), whereas I could create an empty one at the start? related to the point above
I perform this operation thousands of times on arrays with millions of elements, so any suggestions on how to transform this code into something more efficient?
To be honest, this seems pretty hard to optimize. Each step in the loop has a different size, so likely there isn't any purely vectorized way of doing these things. You can try pre-allocating the memory and writing in place, rather than allocating many pieces and finally concatenating the results, but I'd bet that doesn't help you much (unless you are under such restrained conditions that you don't have enough RAM to store everything twice, of course).
Feel free to try the following approach on your larger data, but I'd be surprised if you get any significant speedup (or even that you don't get slower results!).
# Use scalar product to get the final size
result = np.zeros((np.dot([len(x) for x in arr1], [len(x) for x in arr2]), 8), dtype=int)
start = 0
for a1, a2 in zip(arr1, arr2):
end = start + len(a1) * len(a2)
result[start:end, :4] = np.repeat(a1, len(a2), axis=0)
result[start:end, 4:] = np.tile(a2, (len(a1), 1))
start = end
This is what I wanted to see - the list and the merge results:
In [60]: arr1
Out[60]:
[array([[1, 6, 3, 9],
[8, 5, 6, 7]]),
array([[1, 6, 3, 9]]),
array([[1, 6, 3, 9],
[8, 5, 6, 7],
[8, 5, 6, 7]])]
In [61]: arr2
Out[61]:
[array([[8, 8, 8, 8]]),
array([[8, 8, 8, 8]]),
array([[1, 6, 3, 9],
[8, 5, 6, 7],
[8, 5, 6, 7]])]
In [63]: merge(arr1[0],arr2[0]) # a (2,4) with (1,4) => (2,8)
Out[63]:
array([[1, 6, 3, 9, 8, 8, 8, 8],
[8, 5, 6, 7, 8, 8, 8, 8]])
In [64]: merge(arr1[1],arr2[1]) # a (1,4) with (1,4) => (1,8)
Out[64]: array([[1, 6, 3, 9, 8, 8, 8, 8]])
In [65]: merge(arr1[2],arr2[2]) # a (3,4) with (3,4) => (9,8)
Out[65]:
array([[1, 6, 3, 9, 1, 6, 3, 9],
[1, 6, 3, 9, 8, 5, 6, 7],
[1, 6, 3, 9, 8, 5, 6, 7],
[8, 5, 6, 7, 1, 6, 3, 9],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 1, 6, 3, 9],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 8, 5, 6, 7]])
And total is (12,8), combing all "rows".
The list comprehension is, more cleanly stated:
[merge(a,b) for a,b in zip(arr1,arr2)]
The lists, while the same length, have arrays with different numbers of rows, and the merge is also different.
People often ask about making an array iteratively, and we consistently say, collect the results in a list, and do one concatenate (like) construction at the end. The equivalent loop is:
In [70]: alist = []
...: for a,b in zip(arr1,arr2):
...: alist.append(merge(a,b))
This is usually competitive with predefining the total array, and assigning rows. And in your case to get the final shape of total you'd have to iterate through the lists and record the number of rows, etc.
Unless the computation is trivial, the iteration mechanism is a minor part of the total time. I'm pretty sure that here, it's calling merge 3 times that's taking most of the time. For a task like this I wouldn't worry too much about memory use, including the creation of the zeros. You have to, in one way or other use memory for a (12,8) final result. Building that from a (2,8),(1,8), and (9,8) isn't a big issue.
The list comprehension with concatenate and without:
In [72]: timeit total = np.concatenate([merge(a,b) for a,b in zip(arr1,arr2)])
22.4 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [73]: timeit [merge(a,b) for a,b in zip(arr1,arr2)]
16.3 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Calling merge 3 times with any of the pairs takes about the same time.
Oh, another thing, don't try to 'reuse' the out array across merge calls. When accumulating results like this in a list, reuse of the arrays is dangerous. Each merge call must return its own array, not a "recycled" one.

Flip every 2nd pair in a list

What would be the fastest way to break a list of random numbers into sets of two, alternately flipping every pair? For example:
pleatedTuple=(0, 1, 3, 2, 4, 5, 7, 6, 8, 9)
What I want in one operation:
flatPairs=[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
Items will be random single digits, I only made them sequential for readability. I need to do thousands of these in a run so speed is priority. Python 3.6.4.
Thank you for any ideas, I’m stumped by this one.
Option 1
As long as this is pairs we're talking about, let's try a list comprehension:
flatPairs = [
[x, y] if i % 2 == 0 else [y, x] for i, (x, y) in enumerate(
zip(pleatedTuple[::2], pleatedTuple[1::2])
)
]
You can also build this from scratch using a loop:
flatPairs = []
for i, (x, y) in enumerate(zip(pleatedTuple[::2], pleatedTuple[1::2])):
if i % 2 == 0:
flatPairs.append([x, y])
else:
flatPairs.append([y, x])
print(flatPairs)
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
Option 2
Use Ned Batchelder's chunking subroutine chunks and flip every alternate sublist:
# https://stackoverflow.com/a/312464/4909087
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
Call chunks and exhaust the returned generator to get a list of pairs:
flatPairs = list(chunks(pleatedTuple, n=2))
Now, reverse every other pair with a loop.
for i in range(1, len(flatPairs), 2):
flatPairs[i] = flatPairs[i][::-1]
print(flatPairs)
[(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)]
Note that in this case, the result is a list of tuples.
Performance
(of my answers only)
I'm interested in performance, so I've decided to time my answers:
# Setup
pleatedTuple = tuple(range(100000))
# List comp
21.1 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Loop
20.8 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# chunks
26 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For more performance, you may replace the chunks generator with a more performant alternative:
flatPairs = list(zip(pleatedTuple[::2], pleatedTuple[1::2]))
And then reverse with a loop as required. This brings the time down considerably:
13.1 ms ± 994 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A 2x speedup, phew! Beware though, this isn't nearly as memory efficient as the generator would be...
You can use the standard grouping idiom and zip it with the length:
>>> by_pairs_index = zip(range(len(pleatedTuple)), *[iter(pleatedTuple)]*2)
>>> [[b, a] if i%2 else [a,b] for i,a,b in by_pairs_index]
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
If performance is critical, you may consider other approaches.
You can use iter with list slicing:
pleatedTuple=(0, 1, 3, 2, 4, 5, 7, 6, 8, 9)
new_data = [list(pleatedTuple[i:i+2][::-1]) if c%2 != 0 else list(pleatedTuple[i:i+2]) for i, c in zip(range(0, len(pleatedTuple), 2), range(len(pleatedTuple)))]
Output:
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
Option 1
Using map and reversed and slice assignment.
p = list(map(list, zip(pleatedTuple[::2], pleatedTuple[1::2])))
p[1::2] = map(list, map(reversed, p[1::2]))
p
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
Slight variation
p = list(map(list, zip(pleatedTuple[::2], pleatedTuple[1::2])))
p[1::2] = (x[::-1] for x in p[1::2])
p
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
Option 2
def weird(p):
return [[p[2 * i + i % 2], p[2 * i + (i + 1) % 2]] for i in range(len(p) // 2)]
weird(pleatedTuple)
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
More generic
def weird(p, k):
return [list(p[i*k:(i+1)*k][::(i-1)%2*2-1]) for i in range(len(p) // k)]
weird(pleatedTuple, 2)
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
weird(pleatedTuple * 3, 3)
[[0, 1, 3],
[5, 4, 2],
[7, 6, 8],
[1, 0, 9],
[3, 2, 4],
[6, 7, 5],
[8, 9, 0],
[2, 3, 1],
[4, 5, 7],
[9, 8, 6]]
You can do this in numpy:
>>> pleatedTuple=(0, 1, 3, 2, 4, 5, 7, 6, 8, 9)
>>> pleatedArray = np.array(pleatedTuple)
>>> flat2D = pleatedArray.reshape(5,2)
>>> flat2D[1::2] = np.flip(pleated2D[1::2], axis=1)
Of course this is probably going to waste as much time converting between tuples and arrays as it saves doing a tiny loop in numpy instead of Python. (From a quick test, it takes about twice as long as Coldspeed's Option 2 at the example size, and doesn't catch up until you get to much, much longer tuples, and you have a bunch of little tuples, not a few giant ones.)
But if you're concerned with speed, the obvious thing to do is put all of these thousands of pleated tuples into one giant numpy array and do them all at once, and then it will probably be a lot faster. (Still, we're probably talking about saving milliseconds for thousands of these.)
What i use is :
pleatedTuple=(0, 1, 3, 2, 4, 5, 7, 6, 8, 9)
for i in range(0,len(pleatedTuple),2):
print(pleatedTuple[i:i+2])
Is this what you are looking for ?
(0, 1)
(3, 2)
(4, 5)
(7, 6)
(8, 9)

In place insertion into list (or array)

I'm running a script in Python, where I need to insert new numbers into an array (or list) at certain index locations. The problem is that obviously as I insert new numbers, the index locations are invalidated. Is there a clever way to insert the new values at the index locations all at once? Or is the only solution to increment the index number (first value of the pair) as I add?
Sample test code snippets:
original_list = [0, 1, 2, 3, 4, 5, 6, 7]
insertion_indices = [1, 4, 5]
new_numbers = [8, 9, 10]
pairs = [(insertion_indices[i], new_numbers[i]) for i in range(len(insertion_indices))]
for pair in pairs:
original_list.insert(pair[0], pair[1])
Results in:
[0, 8, 1, 2, 9, 10, 3, 4, 5, 6, 7]
whereas I want:
[0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7]
Insert those values in backwards order. Like so:
original_list = [0, 1, 2, 3, 4, 5, 6, 7]
insertion_indices = [1, 4, 5]
new_numbers = [8, 9, 10]
new = zip(insertion_indices, new_numbers)
new.sort(reverse=True)
for i, x in new:
original_list.insert(i, x)
The reason this works is based on the following observation:
Inserting a value at the beginning of the list offsets the indexes of all other values by 1. Inserting a value at the end though, and the indexes remain unchanged. As a consequence, if you start by inserting the value with the largest index (10) and continue "backwards" you would not have to update any indexes.
Being NumPy tagged and since input is mentioned as list/array, you can simply use builtin numpy.insert -
np.insert(original_list, insertion_indices, new_numbers)
To roll out the theory as a custom made one (mostly for performance), we could use mask, like so -
def insert_numbers(original_list,insertion_indices, new_numbers):
# Length of output array
n = len(original_list)+len(insertion_indices)
# Setup mask array to selecrt between new and old numbers
mask = np.ones(n,dtype=bool)
mask[insertion_indices+np.arange(len(insertion_indices))] = 0
# Setup output array for assigning values from old and new lists/arrays
# by using mask and inverted mask version
out = np.empty(n,dtype=int)
out[mask] = original_list
out[~mask] = new_numbers
return out
For list output, append .tolist().
Sample run -
In [83]: original_list = [0, 1, 2, 3, 4, 5, 6, 7]
...: insertion_indices = [1, 4, 5]
...: new_numbers = [8, 9, 10]
...:
In [85]: np.insert(original_list, insertion_indices, new_numbers)
Out[85]: array([ 0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7])
In [86]: np.insert(original_list, insertion_indices, new_numbers).tolist()
Out[86]: [0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7]
Runtime test on a 10000x scaled dataset -
In [184]: original_list = range(70000)
...: insertion_indices = np.sort(np.random.choice(len(original_list), 30000, replace=0)).tolist()
...: new_numbers = np.random.randint(0,10, len(insertion_indices)).tolist()
...: out1 = np.insert(original_list, insertion_indices, new_numbers)
...: out2 = insert_numbers(original_list, insertion_indices, new_numbers)
...: print np.allclose(out1, out2)
True
In [185]: %timeit np.insert(original_list, insertion_indices, new_numbers)
100 loops, best of 3: 5.37 ms per loop
In [186]: %timeit insert_numbers(original_list, insertion_indices, new_numbers)
100 loops, best of 3: 4.8 ms per loop
Let's test out with arrays as inputs -
In [190]: original_list = np.arange(70000)
...: insertion_indices = np.sort(np.random.choice(len(original_list), 30000, replace=0))
...: new_numbers = np.random.randint(0,10, len(insertion_indices))
...: out1 = np.insert(original_list, insertion_indices, new_numbers)
...: out2 = insert_numbers(original_list, insertion_indices, new_numbers)
...: print np.allclose(out1, out2)
True
In [191]: %timeit np.insert(original_list, insertion_indices, new_numbers)
1000 loops, best of 3: 1.48 ms per loop
In [192]: %timeit insert_numbers(original_list, insertion_indices, new_numbers)
1000 loops, best of 3: 1.07 ms per loop
The performance just shoots up, because there's no runtime overhead on conversion to list.
Add this before your for loop:
for i in range(len(insertion_indices)):
insertion_indices[i]+=i
Increment the required index by 1 after every insert
original_list = [0, 1, 2, 3, 4, 5, 6, 7]
insertion_indices = [1, 4, 5]
new_numbers = [8, 9, 10]
for i in range(len(insertion_indices)):
original_list.insert(insertion_indices[i]+i,new_numbers[i])
print(original_list)
Output
[0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7]
#Required list
[0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7]
less elegant, but working too: use numpy ndarray, to increment indice each time:
import numpy as np
original_list = [0, 1, 2, 3, 4, 5, 6, 7]
insertion_indices = [1, 4, 5]
new_numbers = [8, 9, 10]
pairs = np.array([[insertion_indices[i], new_numbers[i]] for i in range(len(insertion_indices))])
for pair in pairs:
original_list.insert(pair[0], pair[1])
pairs[:, 0] += 1

Generate 1D NumPy array of concatenated ranges

I want to generate a following array a:
nv = np.random.randint(3, 10+1, size=(1000000,))
a = np.concatenate([np.arange(1,i+1) for i in nv])
Thus, the output would be something like -
[0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2, 3, 4, 5, 0, ...]
Does there exist any better way to do it?
Here's a vectorized approach using cumulative summation -
def ranges(nv, start = 1):
shifts = nv.cumsum()
id_arr = np.ones(shifts[-1], dtype=int)
id_arr[shifts[:-1]] = -nv[:-1]+1
id_arr[0] = start # Skip if we know the start of ranges is 1 already
return id_arr.cumsum()
Sample runs -
In [23]: nv
Out[23]: array([3, 2, 5, 7])
In [24]: ranges(nv, start=0)
Out[24]: array([0, 1, 2, 0, 1, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5, 6])
In [25]: ranges(nv, start=1)
Out[25]: array([1, 2, 3, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7])
Runtime test -
In [62]: nv = np.random.randint(3, 10+1, size=(100000,))
In [63]: %timeit your_func(nv) # #MSeifert's solution
10 loops, best of 3: 129 ms per loop
In [64]: %timeit ranges(nv)
100 loops, best of 3: 5.54 ms per loop
Instead of doing this with numpy methods you could use normal python ranges and just convert the result to an array:
from itertools import chain
import numpy as np
def your_func(nv):
ranges = (range(1, i+1) for i in nv)
flattened = list(chain.from_iterable(ranges))
return np.array(flattened)
This doesn't need to utilize hard to understand numpy slicings and constructs. To show a sample case:
import random
>>> nv = [random.randint(1, 10) for _ in range(5)]
>>> print(nv)
[4, 2, 10, 5, 3]
>>> print(your_func(nv))
[ 1 2 3 4 1 2 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 1 2 3]
Why two steps?
a = np.concatenate([np.arange(0,np.random.randint(3,11)) for i in range(1000000)])

vectorize numpy unique for subarrays

I have a numpy array data of shape (N, 20, 20) with N being some very large number.
I want to get the number of unique values in each of the 20x20 sub-arrays.
with a loop that would be:
values = []
for i in data:
values.append(len(np.unique(i)))
How could I vectorize this loop? speed is a concern.
If I try np.unique(data) I get the unique values for the whole data array not the individual 20x20 blocks, so that's not what I need.
First, you can work with data.reshape(N,-1), since you are interested in sorting the last 2 dimensions.
An easy way to get the number of unique values for each row is to dump each row into a set and let it do the sorting:
[len(set(i)) for i in data.reshape(data.shape[0],-1)]
But this is an iteration, through probably a fast one.
A problem with 'vectorizing' is that the set or list of unique values in each row will differ in length. 'rows with differing length' is a red flag when it comes to 'vectorizing'. You no longer have the 'rectangular' data layout that makes most vectorizing possible.
You could sort each row:
np.sort(data.reshape(N,-1))
array([[1, 2, 2, 3, 3, 5, 5, 5, 6, 6],
[1, 1, 1, 2, 2, 2, 3, 3, 5, 7],
[0, 0, 2, 3, 4, 4, 4, 5, 5, 9],
[2, 2, 3, 3, 4, 4, 5, 7, 8, 9],
[0, 2, 2, 2, 2, 5, 5, 5, 7, 9]])
But how do you identify the unique values in each row without iterating? Counting the number of nonzero differences might just do the trick:
In [530]: data=np.random.randint(10,size=(5,10))
In [531]: [len(set(i)) for i in data.reshape(data.shape[0],-1)]
Out[531]: [7, 6, 6, 8, 6]
In [532]: sdata=np.sort(data,axis=1)
In [533]: (np.diff(sdata)>0).sum(axis=1)+1
Out[533]: array([7, 6, 6, 8, 6])
I was going to add a warning about floats, but if np.unique is working for your data, my approach should work just as well.
[(np.bincount(i)>0).sum() for i in data]
This is an iterative solution that is clearly faster than my len(set(i)) version, and is competitive with the diff...sort.
In [585]: data.shape
Out[585]: (10000, 400)
In [586]: timeit [(np.bincount(i)>0).sum() for i in data]
1 loops, best of 3: 248 ms per loop
In [587]: %%timeit
sdata=np.sort(data,axis=1)
(np.diff(sdata)>0).sum(axis=1)+1
.....:
1 loops, best of 3: 280 ms per loop
I just found a faster way to use bincount, np.count_nonzero
In [715]: timeit np.array([np.count_nonzero(np.bincount(i)) for i in data])
10 loops, best of 3: 59.6 ms per loop
I was surprised at the speed improvement. But then I recalled that count_nonzero is used in other functions (e.g. np.nonzero) to allocate space for their return results. So it makes sense that this function would be coded for maximum speed. (It doesn't help in the diff...sort case because it does not take an axis parameter).

Categories

Resources