Efficient vectorized version of this numpy for loop - python

Short intro
I have two paired lists of 2D numpy arrays (see below) - paired in the sense that index 0 in array1 corresponds to index 0 in array2. For each of the pairs I want to get all the combinations of all rows in the 2D numpy arrays, like answered by Divakar here.
Array example
arr1 = [
np.vstack([[1,6,3,9], [8,5,6,7]]),
np.vstack([[1,6,3,9]]),
np.vstack([[1,6,3,9], [8,5,6,7],[8,5,6,7]])
]
arr2 = [
np.vstack([[8,8,8,8]]),
np.vstack([[8,8,8,8]]),
np.vstack([[1,6,3,9], [8,5,6,7],[8,5,6,7]])
]
Working code
Note, unlike the linked answer my columns are fixed (always 4) hence I replaced using shape by the hardcode value 4 (or 8 in np.zeros).
def merge(a1, a2):
# From: https://stackoverflow.com/questions/47143712/combination-of-all-rows-in-two-numpy-arrays
m1 = a1.shape[0]
m2 = a2.shape[0]
out = np.zeros((m1, m2, 8), dtype=int)
out[:, :, :4] = a1[:, None, :]
out[:, :, 4:] = a2
out.shape = (m1 * m2, -1)
return out
total = np.concatenate([merge(arr1[i], arr2[i]) for i in range(len(arr1))])
print(total)
Question
While the above works fine, it looks inefficient to me as it:
involves looping through the arrays
"appends" (in list list comprehsion) to the total array, requiring it to allocate memory each time
creates multiple zero arrays (in the merge function), whereas I could create an empty one at the start? related to the point above
I perform this operation thousands of times on arrays with millions of elements, so any suggestions on how to transform this code into something more efficient?

To be honest, this seems pretty hard to optimize. Each step in the loop has a different size, so likely there isn't any purely vectorized way of doing these things. You can try pre-allocating the memory and writing in place, rather than allocating many pieces and finally concatenating the results, but I'd bet that doesn't help you much (unless you are under such restrained conditions that you don't have enough RAM to store everything twice, of course).
Feel free to try the following approach on your larger data, but I'd be surprised if you get any significant speedup (or even that you don't get slower results!).
# Use scalar product to get the final size
result = np.zeros((np.dot([len(x) for x in arr1], [len(x) for x in arr2]), 8), dtype=int)
start = 0
for a1, a2 in zip(arr1, arr2):
end = start + len(a1) * len(a2)
result[start:end, :4] = np.repeat(a1, len(a2), axis=0)
result[start:end, 4:] = np.tile(a2, (len(a1), 1))
start = end

This is what I wanted to see - the list and the merge results:
In [60]: arr1
Out[60]:
[array([[1, 6, 3, 9],
[8, 5, 6, 7]]),
array([[1, 6, 3, 9]]),
array([[1, 6, 3, 9],
[8, 5, 6, 7],
[8, 5, 6, 7]])]
In [61]: arr2
Out[61]:
[array([[8, 8, 8, 8]]),
array([[8, 8, 8, 8]]),
array([[1, 6, 3, 9],
[8, 5, 6, 7],
[8, 5, 6, 7]])]
In [63]: merge(arr1[0],arr2[0]) # a (2,4) with (1,4) => (2,8)
Out[63]:
array([[1, 6, 3, 9, 8, 8, 8, 8],
[8, 5, 6, 7, 8, 8, 8, 8]])
In [64]: merge(arr1[1],arr2[1]) # a (1,4) with (1,4) => (1,8)
Out[64]: array([[1, 6, 3, 9, 8, 8, 8, 8]])
In [65]: merge(arr1[2],arr2[2]) # a (3,4) with (3,4) => (9,8)
Out[65]:
array([[1, 6, 3, 9, 1, 6, 3, 9],
[1, 6, 3, 9, 8, 5, 6, 7],
[1, 6, 3, 9, 8, 5, 6, 7],
[8, 5, 6, 7, 1, 6, 3, 9],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 1, 6, 3, 9],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 8, 5, 6, 7]])
And total is (12,8), combing all "rows".
The list comprehension is, more cleanly stated:
[merge(a,b) for a,b in zip(arr1,arr2)]
The lists, while the same length, have arrays with different numbers of rows, and the merge is also different.
People often ask about making an array iteratively, and we consistently say, collect the results in a list, and do one concatenate (like) construction at the end. The equivalent loop is:
In [70]: alist = []
...: for a,b in zip(arr1,arr2):
...: alist.append(merge(a,b))
This is usually competitive with predefining the total array, and assigning rows. And in your case to get the final shape of total you'd have to iterate through the lists and record the number of rows, etc.
Unless the computation is trivial, the iteration mechanism is a minor part of the total time. I'm pretty sure that here, it's calling merge 3 times that's taking most of the time. For a task like this I wouldn't worry too much about memory use, including the creation of the zeros. You have to, in one way or other use memory for a (12,8) final result. Building that from a (2,8),(1,8), and (9,8) isn't a big issue.
The list comprehension with concatenate and without:
In [72]: timeit total = np.concatenate([merge(a,b) for a,b in zip(arr1,arr2)])
22.4 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [73]: timeit [merge(a,b) for a,b in zip(arr1,arr2)]
16.3 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Calling merge 3 times with any of the pairs takes about the same time.
Oh, another thing, don't try to 'reuse' the out array across merge calls. When accumulating results like this in a list, reuse of the arrays is dangerous. Each merge call must return its own array, not a "recycled" one.

Related

Removing duplicate sets of items in a sequence

I have a list, for example data = [0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6], and I need to remove sets of items from it (max. lenght of set k = 3), but only when the sets follow each other. data includes three such cases: [4, 4], [5, 8, 5, 8], and [1, 5, 6, 1, 5, 6], so the cleaned up list should look like [0, 4, 2, 5, 8, 7, 1, 5, 6].
I tried this code and it works:
data = np.array([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6])
for k in range(1, 3):
kth_difference = data[k:] - data[:-k]
ids = np.where(kth_difference)
data = data[ids]
But if I change input list to something like data = [0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5] (broke the last set), the new output list is [0, 4, 2, 5, 8, 7, 1, 5], which lost 6 at the end.
What a solution for this task? How to make this solution workable for any k?
You added a numpy tag, so let's use that to our advantage. Start with an array:
data = np.array([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6])
It's easy to make a mask of elements up to length n that follow each other:
mask_1 = data[1:] == data[:-1]
mask_2 = data[2:] == data[:-2]
mask_3 = data[3:] == data[:-3]
The first mask has ones at each location where the next element is the same. The second mask will have a one wherever an element is the same as something two elements ahead, so you need to find runs of 2 elements at a time. The same applies to the third mask. Filtering of the mask needs to take into account that you want to include the possibility of partial matches at the end. You can effectively extend the mask with k-1 elements to accomplish this:
delta = np.diff(np.r_[False, mask_3, np.ones(2, dtype=bool), False].view(np.int8))
edges = np.flatnonzero(delta).reshape(-1, 2)
lengths = edges[:, 1] - edges[:, 0]
delta[edges[lengths < 3, :]] = 0
mask = delta[-k:].cumsum(dtype=np.int8).view(bool)
In this arrangement, mask masks the duplicated three elements that constitute a duplicated group. It may contain fewer than three elements if the replicated portion is truncated. That ensures that you get to keep all the elements of partial duplicates.
For this exercise, I will assume that you don't have strange overlaps between different levels. I.e., each part of the array that belongs to a repeated segment belongs to exactly one possible repeated segment. Otherwise, the mask processing becomes much more complex.
Here is a function to wrap all this together:
def clean_mask(mask, k):
delta = np.diff(np.r_[False, mask, np.ones(k - 1, bool), False].view(np.int8))
edges = np.flatnonzero(delta).reshape(-1, 2)
lengths = edges[:, 1] - edges[:, 0]
delta[edges[lengths < k, :]] = 0
return delta[:-k].cumsum(dtype=np.int8).view(bool)
def dedup(data, kmax):
data = np.asarray(data)
kmax = min(kmax, data.size // 2)
remove = np.zeros(data.shape, dtype=np.bool)
for k in range(kmax, 0, -1):
remove[k:] |= clean_mask(data[k:] == data[:-k], k)
return data[~remove]
Outputs for the two test cases you show in the question:
>>> dedup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6], 3)
array([0, 4, 2, 5, 8, 7, 1, 5, 6])
>>> dedup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5], 3)
array([0, 4, 2, 5, 8, 7, 1, 5, 6])
Timing
A quick benchmark shows that the numpy solution is also much faster than pure python:
for n in range(2, 7):
x = np.random.randint(0, 10, 10**n)
y = list(x)
%timeit dedup(x, 3)
%timeit remdup(y)
Results:
# 100 elements
dedup: 332 µs ± 5.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
remdup: 36.9 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 1000 elements
dedup: 412 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
remdup: > 1 minute
Caveats
This solution makes no attempt to cover corner cases. For example: data = [2, 2, 2, 2, 2, 2, 2] or similar, where multiple levels of k can overlap.
Here is an attempt, which is also my very first time using break and for/else:
def remdup(l):
while True:
for (i,j) in ((i,j) for i in range(0,len(l)) for j in range(i+1, len(l)+1)):
if l[i:j] == l[j:j+(j-i)]:
l = l[:j] + l[j+j-i:]
break # restart
else: # if no duplicate was found
break # halt
return l
print(remdup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6]))
# [0, 4, 2, 5, 8, 7, 1, 5, 6]
How this works:
iterate on all substrings l[i:j]:
if l[i:j] is duplicate with the next substring of same length, l[j,j+j-i]:
remove l[j,j+j-i]
break the iteration and restart it because the list has changed
if no duplicate was found, return the list
I recommend avoiding using break and for/else. They're ugly and make deceptive code.

Accessing chunks at once in a numpy array

Provided a numpy array:
arr = np.array([0,1,2,3,4,5,6,7,8,9,10,11,12])
I wonder how access chosen size chunks with chosen separation, both concatenated and in slices:
E.g.: obtain chunks of size 3 separated by two values:
arr_chunk_3_sep_2 = np.array([0,1,2,5,6,7,10,11,12])
arr_chunk_3_sep_2_in_slices = np.array([[0,1,2],[5,6,7],[10,11,12])
Wha is the most efficient way to do it? If possible, I would like to avoid copying or creating new objects as much as possible. Maybe Memoryviews could be of help here?
Approach #1
Here's one with masking -
def slice_grps(a, chunk, sep):
N = chunk + sep
return a[np.arange(len(a))%N < chunk]
Sample run -
In [223]: arr
Out[223]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
In [224]: slice_grps(arr, chunk=3, sep=2)
Out[224]: array([ 0, 1, 2, 5, 6, 7, 10, 11, 12])
Approach #2
If the input array is such that the last chunk would have enough runway, we could , we could leverage np.lib.stride_tricks.as_strided, inspired by this post to select m elements off each block of n elements -
# https://stackoverflow.com/a/51640641/ #Divakar
def skipped_view(a, m, n):
s = a.strides[0]
strided = np.lib.stride_tricks.as_strided
shp = ((a.size+n-1)//n,n)
return strided(a,shape=shp,strides=(n*s,s), writeable=False)[:,:m]
out = skipped_view(arr,chunk,chunk+sep)
Note that the output would be a view into the input array and as such no extra memory overhead and virtually free!
Sample run to make things clear -
In [255]: arr
Out[255]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
In [256]: chunk = 3
In [257]: sep = 2
In [258]: skipped_view(arr,chunk,chunk+sep)
Out[258]:
array([[ 0, 1, 2],
[ 5, 6, 7],
[10, 11, 12]])
# Let's prove that the output is a view indeed
In [259]: np.shares_memory(arr, skipped_view(arr,chunk,chunk+sep))
Out[259]: True
How about a reshape and slice?
In [444]: arr = np.array([0,1,2,3,4,5,6,7,8,9,10,11,12])
In [445]: arr.reshape(-1,5)
...
ValueError: cannot reshape array of size 13 into shape (5)
Ah a problem - your array isn't big enough for this reshape - so we have to pad it:
In [446]: np.concatenate((arr,np.zeros(2,int))).reshape(-1,5)
Out[446]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 0, 0]])
In [447]: np.concatenate((arr,np.zeros(2,int))).reshape(-1,5)[:,:-2]
Out[447]:
array([[ 0, 1, 2],
[ 5, 6, 7],
[10, 11, 12]])
as_strided can get a way with this by including bytes outside the databuffer. Usually that's seen as a bug, though here it can be an asset - provided you really do throw that garbage away.
Or throwing away the last incomplete line:
In [452]: arr[:-3].reshape(-1,5)[:,:3]
Out[452]:
array([[0, 1, 2],
[5, 6, 7]])

Sort matrix based on its diagonal entries

First of all I would like to point out that my question is different than this one: Sort a numpy matrix based on its diagonal
The question is as follow:
Suppose I have a numpy matrix
A=
5 7 8
7 2 9
8 9 3
I would like to sort the matrix based on its diagonal and then re-arrange the matrix element based on it. Such that now
sorted_A:
2 9 7
9 3 8
7 8 5
Note that:
(1). The diagonal is sorted
(2). The other elements (non-diagonal) re-adjusted by it. How?
because diag(A)= [5,2,3] & diag(sorted_A)=[2,3,5]
so row/column indices A=[0,1,2] become [1,2,0] in sorted_A.
So far I use brute force where I extract the diagonal elements, get the indices O(N²) and then re-arrange the matrix (another O(N²)). I wonder if there is any efficient/elegant way to do this. I appreciate all the help I can get.
Sorting the rows based on the diagonal values is easy:
In [192]: A=np.array([[5,7,8],[7,2,9],[8,9,3]])
In [193]: A
Out[193]:
array([[5, 7, 8],
[7, 2, 9],
[8, 9, 3]])
In [194]: np.diag(A)
Out[194]: array([5, 2, 3])
In [195]: idx=np.argsort(np.diag(A))
In [196]: idx
Out[196]: array([1, 2, 0], dtype=int32)
In [197]: A[idx,:]
Out[197]:
array([[7, 2, 9],
[8, 9, 3],
[5, 7, 8]])
Rearranging the elements in each row to the original diagonals are back on the diagonal will take some experimenting - trial and error. We probably have to 'roll' each row based on some value related to the sorting idx. I don't recall if there is a function to roll each row separately or if we have to iterate over the rows to do that.
In [218]: A1=A[idx,:]
In [219]: [np.roll(a,-i) for a,i in zip(A1,[1,1,1])]
Out[219]: [array([2, 9, 7]), array([9, 3, 8]), array([7, 8, 5])]
In [220]: np.array([np.roll(a,-i) for a,i in zip(A1,[1,1,1])])
Out[220]:
array([[2, 9, 7],
[9, 3, 8],
[7, 8, 5]])
So roll with [1,1,1] does the job. But off hand I don't see how that can be derived. I suspect we need to generate several more test cases, possibly larger ones, and look for a pattern.
That roll probably has something to do with how much the row has moved, the difference between the original position and the new one. Let's try:
np.arange(3)-idx
In [222]: np.array([np.roll(a,i) for a,i in zip(A1,np.arange(3)-idx)])
Out[222]:
array([[2, 9, 7],
[9, 3, 8],
[7, 8, 5]])
Applying the sorting idx to both rows and columns seems to do the trick as well:
In [227]: A[idx,:][:,idx]
Out[227]:
array([[2, 9, 7],
[9, 3, 8],
[7, 8, 5]])
In [229]: A[idx[:,None],idx]
Out[229]:
array([[2, 9, 7],
[9, 3, 8],
[7, 8, 5]])
Here I simplify a straightforward solution that has been stated before but is hard to get your heads around.
This is useful if you want to sort a table (e.g. confusion matrix by its diagonal magnitude and arrange rows and columns accordingly.
>>> A=np.array([[5,1,4],[7,2,9],[8,0,3]])
>>> A
array([[5, 1, 4],
[7, 2, 9],
[8, 0, 3]])
>>> diag = np.diag(A)
>>> diag
array([5, 2, 3])
>>> idx=np.argsort(diag) # get the order of items that are in diagon
>>> A[idx,:][:,idx] # reorder rows and arrows based on the order of items on diagon
array([[2, 9, 7],
[0, 3, 8],
[1, 4, 5]])
if you want to sort in descending order just add idx = idx[::-1] # reverse order

vectorize numpy unique for subarrays

I have a numpy array data of shape (N, 20, 20) with N being some very large number.
I want to get the number of unique values in each of the 20x20 sub-arrays.
with a loop that would be:
values = []
for i in data:
values.append(len(np.unique(i)))
How could I vectorize this loop? speed is a concern.
If I try np.unique(data) I get the unique values for the whole data array not the individual 20x20 blocks, so that's not what I need.
First, you can work with data.reshape(N,-1), since you are interested in sorting the last 2 dimensions.
An easy way to get the number of unique values for each row is to dump each row into a set and let it do the sorting:
[len(set(i)) for i in data.reshape(data.shape[0],-1)]
But this is an iteration, through probably a fast one.
A problem with 'vectorizing' is that the set or list of unique values in each row will differ in length. 'rows with differing length' is a red flag when it comes to 'vectorizing'. You no longer have the 'rectangular' data layout that makes most vectorizing possible.
You could sort each row:
np.sort(data.reshape(N,-1))
array([[1, 2, 2, 3, 3, 5, 5, 5, 6, 6],
[1, 1, 1, 2, 2, 2, 3, 3, 5, 7],
[0, 0, 2, 3, 4, 4, 4, 5, 5, 9],
[2, 2, 3, 3, 4, 4, 5, 7, 8, 9],
[0, 2, 2, 2, 2, 5, 5, 5, 7, 9]])
But how do you identify the unique values in each row without iterating? Counting the number of nonzero differences might just do the trick:
In [530]: data=np.random.randint(10,size=(5,10))
In [531]: [len(set(i)) for i in data.reshape(data.shape[0],-1)]
Out[531]: [7, 6, 6, 8, 6]
In [532]: sdata=np.sort(data,axis=1)
In [533]: (np.diff(sdata)>0).sum(axis=1)+1
Out[533]: array([7, 6, 6, 8, 6])
I was going to add a warning about floats, but if np.unique is working for your data, my approach should work just as well.
[(np.bincount(i)>0).sum() for i in data]
This is an iterative solution that is clearly faster than my len(set(i)) version, and is competitive with the diff...sort.
In [585]: data.shape
Out[585]: (10000, 400)
In [586]: timeit [(np.bincount(i)>0).sum() for i in data]
1 loops, best of 3: 248 ms per loop
In [587]: %%timeit
sdata=np.sort(data,axis=1)
(np.diff(sdata)>0).sum(axis=1)+1
.....:
1 loops, best of 3: 280 ms per loop
I just found a faster way to use bincount, np.count_nonzero
In [715]: timeit np.array([np.count_nonzero(np.bincount(i)) for i in data])
10 loops, best of 3: 59.6 ms per loop
I was surprised at the speed improvement. But then I recalled that count_nonzero is used in other functions (e.g. np.nonzero) to allocate space for their return results. So it makes sense that this function would be coded for maximum speed. (It doesn't help in the diff...sort case because it does not take an axis parameter).

Conditional index in 2d array in python

I have a 2D array, g, like so:
np.array([
[1 2 3 4],
[5 6 7 8],
[9 10 11 12]
])
So g[0] returns the first row, in other words when I give an index of 0, I get the first row. When I use an index of 1, I get the second row:
g[1] = [5 6 7 8]
and so on.
But I want to return all rows where the index of g is NOT a certain value.
Eg. I want to return g[x] for all x where x != 1.
I know how to use conditional indexing with 1D arrays, but what about 2D arrays? I'm confused here because I'm not putting conditions on what indices to retrieve according to the values, but I need a condition dependent on the indices themselves.
You could use np.arange(len(g)) != 1 to create a boolean index:
In [137]: g
Out[137]:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
In [138]: g[np.arange(len(g)) != 1]
Out[138]:
array([[ 1, 2, 3, 4],
[ 9, 10, 11, 12]])
If you really want to eliminate just one row, you could, alternatively, use np.concatenate to join two basic slices:
In [143]: np.concatenate([g[:1], g[2:]])
Out[143]:
array([[ 1, 2, 3, 4],
[ 9, 10, 11, 12]])
For large arrays, the first method appears to be faster, however:
In [150]: g2 = np.tile(g, (10000,1))
In [153]: %timeit g2[np.arange(len(g)) != 1]
100000 loops, best of 3: 6.9 µs per loop
In [152]: %timeit np.concatenate([g2[:1], g2[2:]])
10000 loops, best of 3: 51.8 µs per loop
unutbu's answer works, but I find placing the computation in the indices... icky. :/
I would do something like this:
rowsidontwant = [1, 3]
listofrows = [ g[i] for i in filter(lambda x: not in rowsidontwant, xrange(len(g))) ]
It's a a little more... general. The list of rows may not be what you want, but you can put the data in whatever form you like after that.

Categories

Resources