vectorize numpy unique for subarrays - python

I have a numpy array data of shape (N, 20, 20) with N being some very large number.
I want to get the number of unique values in each of the 20x20 sub-arrays.
with a loop that would be:
values = []
for i in data:
values.append(len(np.unique(i)))
How could I vectorize this loop? speed is a concern.
If I try np.unique(data) I get the unique values for the whole data array not the individual 20x20 blocks, so that's not what I need.

First, you can work with data.reshape(N,-1), since you are interested in sorting the last 2 dimensions.
An easy way to get the number of unique values for each row is to dump each row into a set and let it do the sorting:
[len(set(i)) for i in data.reshape(data.shape[0],-1)]
But this is an iteration, through probably a fast one.
A problem with 'vectorizing' is that the set or list of unique values in each row will differ in length. 'rows with differing length' is a red flag when it comes to 'vectorizing'. You no longer have the 'rectangular' data layout that makes most vectorizing possible.
You could sort each row:
np.sort(data.reshape(N,-1))
array([[1, 2, 2, 3, 3, 5, 5, 5, 6, 6],
[1, 1, 1, 2, 2, 2, 3, 3, 5, 7],
[0, 0, 2, 3, 4, 4, 4, 5, 5, 9],
[2, 2, 3, 3, 4, 4, 5, 7, 8, 9],
[0, 2, 2, 2, 2, 5, 5, 5, 7, 9]])
But how do you identify the unique values in each row without iterating? Counting the number of nonzero differences might just do the trick:
In [530]: data=np.random.randint(10,size=(5,10))
In [531]: [len(set(i)) for i in data.reshape(data.shape[0],-1)]
Out[531]: [7, 6, 6, 8, 6]
In [532]: sdata=np.sort(data,axis=1)
In [533]: (np.diff(sdata)>0).sum(axis=1)+1
Out[533]: array([7, 6, 6, 8, 6])
I was going to add a warning about floats, but if np.unique is working for your data, my approach should work just as well.
[(np.bincount(i)>0).sum() for i in data]
This is an iterative solution that is clearly faster than my len(set(i)) version, and is competitive with the diff...sort.
In [585]: data.shape
Out[585]: (10000, 400)
In [586]: timeit [(np.bincount(i)>0).sum() for i in data]
1 loops, best of 3: 248 ms per loop
In [587]: %%timeit
sdata=np.sort(data,axis=1)
(np.diff(sdata)>0).sum(axis=1)+1
.....:
1 loops, best of 3: 280 ms per loop
I just found a faster way to use bincount, np.count_nonzero
In [715]: timeit np.array([np.count_nonzero(np.bincount(i)) for i in data])
10 loops, best of 3: 59.6 ms per loop
I was surprised at the speed improvement. But then I recalled that count_nonzero is used in other functions (e.g. np.nonzero) to allocate space for their return results. So it makes sense that this function would be coded for maximum speed. (It doesn't help in the diff...sort case because it does not take an axis parameter).

Related

Efficient vectorized version of this numpy for loop

Short intro
I have two paired lists of 2D numpy arrays (see below) - paired in the sense that index 0 in array1 corresponds to index 0 in array2. For each of the pairs I want to get all the combinations of all rows in the 2D numpy arrays, like answered by Divakar here.
Array example
arr1 = [
np.vstack([[1,6,3,9], [8,5,6,7]]),
np.vstack([[1,6,3,9]]),
np.vstack([[1,6,3,9], [8,5,6,7],[8,5,6,7]])
]
arr2 = [
np.vstack([[8,8,8,8]]),
np.vstack([[8,8,8,8]]),
np.vstack([[1,6,3,9], [8,5,6,7],[8,5,6,7]])
]
Working code
Note, unlike the linked answer my columns are fixed (always 4) hence I replaced using shape by the hardcode value 4 (or 8 in np.zeros).
def merge(a1, a2):
# From: https://stackoverflow.com/questions/47143712/combination-of-all-rows-in-two-numpy-arrays
m1 = a1.shape[0]
m2 = a2.shape[0]
out = np.zeros((m1, m2, 8), dtype=int)
out[:, :, :4] = a1[:, None, :]
out[:, :, 4:] = a2
out.shape = (m1 * m2, -1)
return out
total = np.concatenate([merge(arr1[i], arr2[i]) for i in range(len(arr1))])
print(total)
Question
While the above works fine, it looks inefficient to me as it:
involves looping through the arrays
"appends" (in list list comprehsion) to the total array, requiring it to allocate memory each time
creates multiple zero arrays (in the merge function), whereas I could create an empty one at the start? related to the point above
I perform this operation thousands of times on arrays with millions of elements, so any suggestions on how to transform this code into something more efficient?
To be honest, this seems pretty hard to optimize. Each step in the loop has a different size, so likely there isn't any purely vectorized way of doing these things. You can try pre-allocating the memory and writing in place, rather than allocating many pieces and finally concatenating the results, but I'd bet that doesn't help you much (unless you are under such restrained conditions that you don't have enough RAM to store everything twice, of course).
Feel free to try the following approach on your larger data, but I'd be surprised if you get any significant speedup (or even that you don't get slower results!).
# Use scalar product to get the final size
result = np.zeros((np.dot([len(x) for x in arr1], [len(x) for x in arr2]), 8), dtype=int)
start = 0
for a1, a2 in zip(arr1, arr2):
end = start + len(a1) * len(a2)
result[start:end, :4] = np.repeat(a1, len(a2), axis=0)
result[start:end, 4:] = np.tile(a2, (len(a1), 1))
start = end
This is what I wanted to see - the list and the merge results:
In [60]: arr1
Out[60]:
[array([[1, 6, 3, 9],
[8, 5, 6, 7]]),
array([[1, 6, 3, 9]]),
array([[1, 6, 3, 9],
[8, 5, 6, 7],
[8, 5, 6, 7]])]
In [61]: arr2
Out[61]:
[array([[8, 8, 8, 8]]),
array([[8, 8, 8, 8]]),
array([[1, 6, 3, 9],
[8, 5, 6, 7],
[8, 5, 6, 7]])]
In [63]: merge(arr1[0],arr2[0]) # a (2,4) with (1,4) => (2,8)
Out[63]:
array([[1, 6, 3, 9, 8, 8, 8, 8],
[8, 5, 6, 7, 8, 8, 8, 8]])
In [64]: merge(arr1[1],arr2[1]) # a (1,4) with (1,4) => (1,8)
Out[64]: array([[1, 6, 3, 9, 8, 8, 8, 8]])
In [65]: merge(arr1[2],arr2[2]) # a (3,4) with (3,4) => (9,8)
Out[65]:
array([[1, 6, 3, 9, 1, 6, 3, 9],
[1, 6, 3, 9, 8, 5, 6, 7],
[1, 6, 3, 9, 8, 5, 6, 7],
[8, 5, 6, 7, 1, 6, 3, 9],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 1, 6, 3, 9],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 8, 5, 6, 7]])
And total is (12,8), combing all "rows".
The list comprehension is, more cleanly stated:
[merge(a,b) for a,b in zip(arr1,arr2)]
The lists, while the same length, have arrays with different numbers of rows, and the merge is also different.
People often ask about making an array iteratively, and we consistently say, collect the results in a list, and do one concatenate (like) construction at the end. The equivalent loop is:
In [70]: alist = []
...: for a,b in zip(arr1,arr2):
...: alist.append(merge(a,b))
This is usually competitive with predefining the total array, and assigning rows. And in your case to get the final shape of total you'd have to iterate through the lists and record the number of rows, etc.
Unless the computation is trivial, the iteration mechanism is a minor part of the total time. I'm pretty sure that here, it's calling merge 3 times that's taking most of the time. For a task like this I wouldn't worry too much about memory use, including the creation of the zeros. You have to, in one way or other use memory for a (12,8) final result. Building that from a (2,8),(1,8), and (9,8) isn't a big issue.
The list comprehension with concatenate and without:
In [72]: timeit total = np.concatenate([merge(a,b) for a,b in zip(arr1,arr2)])
22.4 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [73]: timeit [merge(a,b) for a,b in zip(arr1,arr2)]
16.3 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Calling merge 3 times with any of the pairs takes about the same time.
Oh, another thing, don't try to 'reuse' the out array across merge calls. When accumulating results like this in a list, reuse of the arrays is dangerous. Each merge call must return its own array, not a "recycled" one.

Can NumPy take care that an array is (nonstrictly) increasing along one axis?

Is there a function in numpy to guarantee or rather fix an array such that it is (nonstrictly) increasing along one particular axis?
For example, I have the following 2D array:
X = array([[1, 2, 1, 4, 5],
[0, 3, 1, 5, 4]])
the output of np.foobar(X) should return
array([[1, 2, 2, 4, 5],
[0, 3, 3, 5, 5]])
Does foobar exist or do I need to do that manually by using something like np.diff and some smart indexing?
Use np.maximum.accumulate for a running (accumulated) max value along that axis to ensure the strictly increasing criteria -
np.maximum.accumulate(X,axis=1)
Sample run -
In [233]: X
Out[233]:
array([[1, 2, 1, 4, 5],
[0, 3, 1, 5, 4]])
In [234]: np.maximum.accumulate(X,axis=1)
Out[234]:
array([[1, 2, 2, 4, 5],
[0, 3, 3, 5, 5]])
For memory efficiency, we can assign it back to the input for in-situ changes with its out argument.
Runtime tests
Case #1 : Array as input
In [254]: X = np.random.rand(1000,1000)
In [255]: %timeit np.maximum.accumulate(X,axis=1)
1000 loops, best of 3: 1.69 ms per loop
# #cᴏʟᴅsᴘᴇᴇᴅ's pandas soln using df.cummax
In [256]: %timeit pd.DataFrame(X).cummax(axis=1).values
100 loops, best of 3: 4.81 ms per loop
Case #2 : Dataframe as input
In [257]: df = pd.DataFrame(np.random.rand(1000,1000))
In [258]: %timeit np.maximum.accumulate(df.values,axis=1)
1000 loops, best of 3: 1.68 ms per loop
# #cᴏʟᴅsᴘᴇᴇᴅ's pandas soln using df.cummax
In [259]: %timeit df.cummax(axis=1)
100 loops, best of 3: 4.68 ms per loop
pandas offers you the df.cummax function:
import pandas as pd
pd.DataFrame(X).cummax(axis=1).values
array([[1, 2, 2, 4, 5],
[0, 3, 3, 5, 5]])
It's useful to know that there's a first class function on hand in case your data is already loaded into a dataframe.

Create a list with items from another list at indices specified in a third list

Consider two lists:
a = [2, 4, 5]
b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
I want a resulting list c where
c = [0, 0, 2, 0, 4, 5, 0 ,0 ,0 ,0]
is a list of length len(b) with values taken from b defined by indices specified in a and zeros elsewhere.
What is the most elegant way of doing this?
Use a list comprehension with the conditional expression and enumerate.
This LC will iterate over the index and the value of the list b and if the index i is found within a then it will set the element to v, otherwise it'll set it to 0.
a = [2, 4, 5]
b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
c = [v if i in a else 0 for i, v in enumerate(b)]
print(c)
# [0, 0, 2, 0, 4, 5, 0, 0, 0, 0]
Note: If a is large then you may be best converting to a set first, before using in. The time complexity for using in with a list is O(n) whilst for a set it is O(1) (in the average case for both).
The list comprehension is roughly equivalent to the following code (for explanation):
c = []
for i, v in enumerate(b):
if i in a:
c.append(v)
else:
c.append(0)
As you have the option of using numpy I've included a simple method below which uses initialises an array filled with zeros and then uses list indexing to replace the elements.
import numpy as np
a2 = np.array(a)
b2 = np.array(b)
c = np.zeros(len(b2))
c[a2] = b[a2]
When timing the three methods (my list comp, my numpy, and Jon's method) the following results are given for N = 1000, a = list(range(0, N, 10)), and b = list(range(N)).
In [170]: %timeit lc_func(a,b)
100 loops, best of 3: 3.56 ms per loop
In [171]: %timeit numpy_func(a2,b2)
100000 loops, best of 3: 14.8 µs per loop
In [172]: %timeit jon_func(a,b)
10000 loops, best of 3: 22.8 µs per loop
This is to be expected. The numpy function is fastest, but both Jon's function and the numpy are much faster than a list comprehension. If I increased the number of elements to 100,000 then the gap between numpy and Jon's method gets even larger.
Interestingly enough though, for small N Jon's function is the best! I suspect this is to do with the overhead of creating numpy arrays being trumped by the overhead of lists.
Moral of the story: large N? Go with numpy. Small N? Go with Jon.
The other option is to pre-initialise the target list with 0s - a fast operation, then over-write the value to the suitable index, eg:
a = [2, 4, 5]
b = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
c = [0] * len(b)
for el in a:
c[el] = b[el]
# [0, 0, 2, 0, 4, 5, 0, 0, 0, 0]

Conditional index in 2d array in python

I have a 2D array, g, like so:
np.array([
[1 2 3 4],
[5 6 7 8],
[9 10 11 12]
])
So g[0] returns the first row, in other words when I give an index of 0, I get the first row. When I use an index of 1, I get the second row:
g[1] = [5 6 7 8]
and so on.
But I want to return all rows where the index of g is NOT a certain value.
Eg. I want to return g[x] for all x where x != 1.
I know how to use conditional indexing with 1D arrays, but what about 2D arrays? I'm confused here because I'm not putting conditions on what indices to retrieve according to the values, but I need a condition dependent on the indices themselves.
You could use np.arange(len(g)) != 1 to create a boolean index:
In [137]: g
Out[137]:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
In [138]: g[np.arange(len(g)) != 1]
Out[138]:
array([[ 1, 2, 3, 4],
[ 9, 10, 11, 12]])
If you really want to eliminate just one row, you could, alternatively, use np.concatenate to join two basic slices:
In [143]: np.concatenate([g[:1], g[2:]])
Out[143]:
array([[ 1, 2, 3, 4],
[ 9, 10, 11, 12]])
For large arrays, the first method appears to be faster, however:
In [150]: g2 = np.tile(g, (10000,1))
In [153]: %timeit g2[np.arange(len(g)) != 1]
100000 loops, best of 3: 6.9 µs per loop
In [152]: %timeit np.concatenate([g2[:1], g2[2:]])
10000 loops, best of 3: 51.8 µs per loop
unutbu's answer works, but I find placing the computation in the indices... icky. :/
I would do something like this:
rowsidontwant = [1, 3]
listofrows = [ g[i] for i in filter(lambda x: not in rowsidontwant, xrange(len(g))) ]
It's a a little more... general. The list of rows may not be what you want, but you can put the data in whatever form you like after that.

How to split an array according to a condition in numpy?

For example, I have a ndarray that is:
a = np.array([1, 3, 5, 7, 2, 4, 6, 8])
Now I want to split a into two parts, one is all numbers <5 and the other is all >=5:
[array([1,3,2,4]), array([5,7,6,8])]
Certainly I can traverse a and create two new array. But I want to know does numpy provide some better ways?
Similarly, for multidimensional array, e.g.
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[2, 4, 7]])
I want to split it according to the first column <3 and >=3, which result is:
[array([[1, 2, 3],
[2, 4, 7]]),
array([[4, 5, 6],
[7, 8, 9]])]
Are there any better ways instead of traverse it? Thanks.
import numpy as np
def split(arr, cond):
return [arr[cond], arr[~cond]]
a = np.array([1,3,5,7,2,4,6,8])
print split(a, a<5)
a = np.array([[1,2,3],[4,5,6],[7,8,9],[2,4,7]])
print split(a, a[:,0]<3)
This produces the following output:
[array([1, 3, 2, 4]), array([5, 7, 6, 8])]
[array([[1, 2, 3],
[2, 4, 7]]), array([[4, 5, 6],
[7, 8, 9]])]
It might be a quick solution
a = np.array([1,3,5,7])
b = a >= 3 # variable with condition
a[b] # to slice the array
len(a[b]) # count the elements in sliced array
1d array
a = numpy.array([2,3,4,...])
a_new = a[(a < 4)] # to get elements less than 5
2d array based on column(consider value of column i should be less than 5,
a = numpy.array([[1,2],[5,6],...]
a = a[(a[:,i] < 5)]
if your condition is multicolumn based, then you can make a new array applying the conditions on the columns. Then you can just compare the new array with value 5(according to my assumption) to get indexes and follow above codes.
Note that, whatever i have written in (), returns the index array.

Categories

Resources