Numpy group by multiple vectors, get group indices - python

I have several numpy arrays; I want to build a groupby method that would have group ids for these arrays. It will then allow me to index these arrays on the group id to perform operations on the groups.
For an example:
import numpy as np
import pandas as pd
a = np.array([1,1,1,2,2,3])
b = np.array([1,2,2,2,3,3])
def group_np(groupcols):
groupby = np.array([''.join([str(b) for b in bs]) for bs in zip(*[c for c in groupcols])])
_, groupby = np.unique(groupby, return_invesrse=True)
return groupby
def group_pd(groupcols):
df = pd.DataFrame(groupcols[0])
for i in range(1, len(groupcols)):
df[i] = groupcols[i]
for i in range(len(groupcols)):
df[i] = df[i].fillna(-1)
return df.groupby(list(range(len(groupcols)))).grouper.group_info[0]
Outputs:
group_np([a,b]) -> [0, 1, 1, 2, 3, 4]
group_pd([a,b]) -> [0, 1, 1, 2, 3, 4]
Is there a more efficient way of implementing it, ideally in pure numpy? The bottleneck currently seems to be building a vector that would have unique values for each group - at the moment I am doing that by concatenating the values for each vector as strings.
I want this to work for any number of input vectors, which can have millions of elements.
Edit: here is another testcase:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
Here, group elements 2,3,4,7 should all be the same.
Edit2: adding some benchmarks.
a = np.random.randint(1, 1000, 30000000)
b = np.random.randint(1, 1000, 30000000)
c = np.random.randint(1, 1000, 30000000)
def group_np2(groupcols):
_, groupby = np.unique(np.stack(groupcols), return_inverse=True, axis=1)
return groupby
%timeit group_np2([a,b,c])
# 25.1 s +/- 1.06 s per loop (mean +/- std. dev. of 7 runs, 1 loop each)
%timeit group_pd([a,b,c])
# 21.7 s +/- 646 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

After using np.stack on the arrays a and b, if you set the parameter return_inverse to True in np.unique then it is the output you are looking for:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
_, inv = np.unique(np.stack([a,b]), axis=1, return_inverse=True)
print (inv)
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
and you can replace [a,b] in np.stack by a list of all the vectors.
Edit: a faster solution is use np.unique on the sum of the arrays multiply by the cumulative product (np.cumprod) of the max plus 1 of all previous arrays in groupcols. such as:
def group_np_sum(groupcols):
groupcols_max = np.cumprod([ar.max()+1 for ar in groupcols[:-1]])
return np.unique( sum([groupcols[0]] +
[ ar*m for ar, m in zip(groupcols[1:],groupcols_max)]),
return_inverse=True)[1]
To check:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print (group_np_sum([a,b]))
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
Note: the number associated to each group may not be the same (here I changed the first element of a by 3)
a = np.array([3,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print(group_np2([a,b]))
print (group_np_sum([a,b]))
array([3, 1, 0, 0, 0, 2, 4, 0], dtype=int64)
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
but groups themselves are the same.
Now to check for timing:
a = np.random.randint(1, 100, 30000)
b = np.random.randint(1, 100, 30000)
c = np.random.randint(1, 100, 30000)
groupcols = [a,b,c]
%timeit group_pd(groupcols)
#13.7 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit group_np2(groupcols)
#34.2 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit group_np_sum(groupcols)
#3.63 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The numpy_indexed package (dsiclaimer: I am its authos) covers these type of use cases:
import numpy_indexed as npi
npi.group_by((a, b))
Passing a tuple of index-arrays like this avoids creating a copy; but if you dont mind making the copy you can use stacking as well:
npi.group_by(np.stack(a, b))

Related

Boolean indexing array through array of boolean indexes without loop

I want to index an array with a boolean mask through multiple boolean arrays without a loop.
This is what I want to achieve but without a loop and only with numpy.
import numpy as np
a = np.array([[0, 1],[2, 3]])
b = np.array([[[1, 0], [1, 0]], [[0, 0], [1, 1]]], dtype=bool)
r = []
for x in b:
print(a[x])
r.extend(a[x])
# => array([0, 2])
# => array([2, 3])
print(r)
# => [0, 2, 2, 3]
# what I would like to do is something like this
r = some_fancy_indexing_magic_with_b_and_a
print(r)
# => [0, 2, 2, 3]
Approach #1
Simply broadcast a to b's shape with np.broadcast_to and then mask it with b -
In [15]: np.broadcast_to(a,b.shape)[b]
Out[15]: array([0, 2, 2, 3])
Approach #2
Another would be getting all the indices and mod those by the size of a, which would also be the size of each 2D block in b and then indexing into flattened a -
a.ravel()[np.flatnonzero(b)%a.size]
Approach #3
On the same lines as App#2, but keeping the 2D format and using non-zero indices along the last two axes of b -
_,r,c = np.nonzero(b)
out = a[r,c]
Timings on large arrays (given sample shapes scaled up by 100x) -
In [50]: np.random.seed(0)
...: a = np.random.rand(200,200)
...: b = np.random.rand(200,200,200)>0.5
In [51]: %timeit np.broadcast_to(a,b.shape)[b]
45.5 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [52]: %timeit a.ravel()[np.flatnonzero(b)%a.size]
94.6 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [53]: %%timeit
...: _,r,c = np.nonzero(b)
...: out = a[r,c]
128 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How do I sort a 2D numpy array in this specific way

I realize there are quite a number of 'how to sort numpy array'-questions on here already. But I could not find how to do it in this specific way.
I have an array similar to this:
array([[1,0,1,],
[0,0,1],
[1,1,1],
[1,1,0]])
I want to sort the rows, keeping the order within the rows the same. So I expect the following output:
array([[0,0,1,],
[1,0,1],
[1,1,0],
[1,1,1]])
You can use dot and argsort:
a[a.dot(2**np.arange(a.shape[1])[::-1]).argsort()]
# array([[0, 0, 1],
# [1, 0, 1],
# [1, 1, 0],
# [1, 1, 1]])
The idea is to convert the rows into integers.
a.dot(2**np.arange(a.shape[1])[::-1])
# array([5, 1, 7, 6])
Then, find the sorted indices and use that to reorder a:
a.dot(2**np.arange(a.shape[1])[::-1]).argsort()
# array([1, 0, 3, 2])
My tests show this is slightly faster than lexsort.
a = a.repeat(1000, axis=0)
%timeit a[np.lexsort(a.T[::-1])]
%timeit a[a.dot(2**np.arange(a.shape[1])[::-1]).argsort()]
230 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
192 µs ± 4.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Verify correctness:
np.array_equal(a[a.dot(2**np.arange(a.shape[1])[::-1]).argsort()],
a[np.lexsort(a.T[::-1])])
# True

Numpy: Can you use broadcasting to replace values by row?

I have a M x N matrix X and a 1 x N matrix Y. What I would like to do is replace any 0-entry in X with the appropriate value from Y based on its column.
So if
X = np.array([[0, 1, 2], [3, 0, 5]])
and
Y = np.array([10, 20, 30])
The desired end result would be [[10, 1, 2], [3, 20, 5]].
This can be done straightforwardly by generating a M x N matrix where every row is Y and then using filter arrays:
Y = np.ones((X.shape[0], 1)) * Y.reshape(1, -1)
X[X==0] = Y[X==0]
But could this be done using numpy's broadcasting functionality?
Sure. Instead of physically repeating Y, create a broadcasted view of Y with the shape of X, using numpy.broadcast_to:
expanded = numpy.broadcast_to(Y, X.shape)
mask = X==0
x[mask] = expanded[mask]
Expand X to make it a bit more general:
In [306]: X = np.array([[0, 1, 2], [3, 0, 5],[0,1,0]])
where identifies the 0s; the 2nd array identifies the columns
In [307]: idx = np.where(X==0)
In [308]: idx
Out[308]: (array([0, 1, 2, 2]), array([0, 1, 0, 2]))
In [309]: Z = X.copy()
In [310]: Z[idx]
Out[310]: array([0, 0, 0, 0]) # flat list of where to put the values
In [311]: Y[idx[1]]
Out[311]: array([10, 20, 10, 30]) # matching list of values by column
In [312]: Z[idx] = Y[idx[1]]
In [313]: Z
Out[313]:
array([[10, 1, 2],
[ 3, 20, 5],
[10, 1, 30]])
Not doing broadcasting, but reasonably clean numpy.
Times compared to broadcast_to approach
In [314]: %%timeit
...: idx = np.where(X==0)
...: Z[idx] = Y[idx[1]]
...:
9.28 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [315]: %%timeit
...: exp = np.broadcast_to(Y,X.shape)
...: mask=X==0
...: Z[mask] = exp[mask]
...:
19.5 µs ± 513 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Faster, though the sample size is small.
Another way to make the expanded Y, is with repeat:
In [319]: %%timeit
...: exp = np.repeat(Y[None,:],3,0)
...: mask=X==0
...: Z[mask] = exp[mask]
...:
10.8 µs ± 55.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Whose time is close to my where. It turns out that broadcast_to is relatively slow:
In [321]: %%timeit
...: exp = np.broadcast_to(Y,X.shape)
...:
10.5 µs ± 52.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [322]: %%timeit
...: exp = np.repeat(Y[None,:],3,0)
...:
3.76 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
We'd have to do more tests to see whether that is just due to a setup cost, or if the relative times still apply with much larger arrays.

What is a vectorized way to create multiple powers of a NumPy array?

I have a NumPy array:
arr = [[1, 2],
[3, 4]]
I want to create a new array that contains powers of arr up to a power order:
# arr_new = [arr^0, arr^1, arr^2, arr^3,...arr^order]
arr_new = [[1, 1, 1, 2, 1, 4, 1, 8],
[1, 1, 3, 4, 9, 16, 27, 64]]
My current approach uses for loops:
# Pre-allocate an array for powers
arr = np.array([[1, 2],[3,4]])
order = 3
rows, cols = arr.shape
arr_new = np.zeros((rows, (order+1) * cols))
# Iterate over each exponent
for i in range(order + 1):
arr_new[:, (i * cols) : (i + 1) * cols] = arr**i
print(arr_new)
Is there a faster (i.e. vectorized) approach to creating powers of an array?
Benchmarking
Thanks to #hpaulj and #Divakar and #Paul Panzer for the answers. I benchmarked the loop-based and broadcasting-based operations on the following test arrays.
arr = np.array([[1, 2],
[3,4]])
order = 3
arrLarge = np.random.randint(0, 10, (100, 100)) # 100 x 100 array
orderLarge = 10
The loop_based function is:
def loop_based(arr, order):
# pre-allocate an array for powers
rows, cols = arr.shape
arr_new = np.zeros((rows, (order+1) * cols))
# iterate over each exponent
for i in range(order + 1):
arr_new[:, (i * cols) : (i + 1) * cols] = arr**i
return arr_new
The broadcast_based function using hstack is:
def broadcast_based_hstack(arr, order):
# Create a 3D exponent array for a 2D input array to force broadcasting
powers = np.arange(order + 1)[:, None, None]
# Generate values (third axis contains array at various powers)
exponentiated = arr ** powers
# Reshape and return array
return np.hstack(exponentiated) # <== using hstack function
The broadcast_based function using reshape is:
def broadcast_based_reshape(arr, order):
# Create a 3D exponent array for a 2D input array to force broadcasting
powers = np.arange(order + 1)[:, None]
# Generate values (3-rd axis contains array at various powers)
exponentiated = arr[:, None] ** powers
# reshape and return array
return exponentiated.reshape(arr.shape[0], -1) # <== using reshape function
The broadcast_based function using cumulative product cumprod and reshape:
def broadcast_cumprod_reshape(arr, order):
rows, cols = arr.shape
# Create 3D empty array where the middle dimension is
# the array at powers 0 through order
out = np.empty((rows, order + 1, cols), dtype=arr.dtype)
out[:, 0, :] = 1 # 0th power is always 1
a = np.broadcast_to(arr[:, None], (rows, order, cols))
# Cumulatively multiply arrays so each multiplication produces the next order
np.cumprod(a, axis=1, out=out[:,1:,:])
return out.reshape(rows, -1)
On Jupyter notebook, I used the timeit command and got these results:
Small arrays (2x2):
%timeit -n 100000 loop_based(arr, order)
7.41 µs ± 174 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit -n 100000 broadcast_based_hstack(arr, order)
10.1 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit -n 100000 broadcast_based_reshape(arr, order)
3.31 µs ± 61.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit -n 100000 broadcast_cumprod_reshape(arr, order)
11 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Large arrays (100x100):
%timeit -n 1000 loop_based(arrLarge, orderLarge)
261 µs ± 5.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 broadcast_based_hstack(arrLarge, orderLarge)
225 µs ± 4.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 broadcast_based_reshape(arrLarge, orderLarge)
223 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 broadcast_cumprod_reshape(arrLarge, orderLarge)
157 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Conclusions:
It seems that the broadcast based approach using reshape is faster for smaller arrays. However, for large arrays, the cumprod approach scales better and is faster.
Extend arrays to higher dims and let broadcasting do its magic with some help from reshaping -
In [16]: arr = np.array([[1, 2],[3,4]])
In [17]: order = 3
In [18]: (arr[:,None]**np.arange(order+1)[:,None]).reshape(arr.shape[0],-1)
Out[18]:
array([[ 1, 1, 1, 2, 1, 4, 1, 8],
[ 1, 1, 3, 4, 9, 16, 27, 64]])
Note that arr[:,None] is essentially arr[:,None,:], but we can skip the trailing ellipsis for brevity.
Timings on a bigger dataset -
In [40]: np.random.seed(0)
...: arr = np.random.randint(0,9,(100,100))
...: order = 10
# #hpaulj's soln with broadcasting and stacking
In [41]: %timeit np.hstack(arr **np.arange(order+1)[:,None,None])
1000 loops, best of 3: 734 µs per loop
In [42]: %timeit (arr[:,None]**np.arange(order+1)[:,None]).reshape(arr.shape[0],-1)
1000 loops, best of 3: 401 µs per loop
That reshaping part is practically free and that's where we gain performance here alongwith the broadcasting part of course, as seen in the breakdown below -
In [52]: %timeit (arr[:,None]**np.arange(order+1)[:,None])
1000 loops, best of 3: 390 µs per loop
In [53]: %timeit (arr[:,None]**np.arange(order+1)[:,None]).reshape(arr.shape[0],-1)
1000 loops, best of 3: 401 µs per loop
Use broadcasting to generate the values, and reshape or rearrange the values as desired:
In [34]: arr **np.arange(4)[:,None,None]
Out[34]:
array([[[ 1, 1],
[ 1, 1]],
[[ 1, 2],
[ 3, 4]],
[[ 1, 4],
[ 9, 16]],
[[ 1, 8],
[27, 64]]])
In [35]: np.hstack(_)
Out[35]:
array([[ 1, 1, 1, 2, 1, 4, 1, 8],
[ 1, 1, 3, 4, 9, 16, 27, 64]])
Here is a solution using cumulative multiplication which scales better than power based approaches, especially if the input array is of float dtype:
import numpy as np
def f_mult(a, k):
m, n = a.shape
out = np.empty((m, k, n), dtype=a.dtype)
out[:, 0, :] = 1
a = np.broadcast_to(a[:, None], (m, k-1, n))
a.cumprod(axis=1, out=out[:, 1:])
return out.reshape(m, -1)
Timings:
int up to power 9
divakar: 0.4342731796205044 ms
hpaulj: 0.794165057130158 ms
pp: 0.20520629966631532 ms
float up to power 39
divakar: 29.056487752124667 ms
hpaulj: 31.773792404681444 ms
pp: 1.0329263447783887 ms
Code for timings, thks #Divakar:
def f_divakar(a, k):
return (a[:,None]**np.arange(k)[:,None]).reshape(a.shape[0],-1)
def f_hpaulj(a, k):
return np.hstack(a**np.arange(k)[:,None,None])
from timeit import timeit
np.random.seed(0)
a = np.random.randint(0,9,(100,100))
k = 10
print('int up to power 9')
print('divakar:', timeit(lambda: f_divakar(a, k), number=1000), 'ms')
print('hpaulj: ', timeit(lambda: f_hpaulj(a, k), number=1000), 'ms')
print('pp: ', timeit(lambda: f_mult(a, k), number=1000), 'ms')
a = np.random.uniform(0.5,2.0,(100,100))
k = 40
print('float up to power 39')
print('divakar:', timeit(lambda: f_divakar(a, k), number=1000), 'ms')
print('hpaulj: ', timeit(lambda: f_hpaulj(a, k), number=1000), 'ms')
print('pp: ', timeit(lambda: f_mult(a, k), number=1000), 'ms')
You are creating a Vandermonde matrix with a reshape, so it is probably best to use numpy.vander to make it, and let someone else take care of the best algorithm.
This way your code is just:
np.vander(arr.ravel(), order).reshape((arr.shape[0], -1))
That said, it seems like they use something like Paul Panzer's cumprod method under the hood so it should scale well.

Numpy: Get random set of rows from 2D array

I have a very large 2D array which looks something like this:
a=
[[a1, b1, c1],
[a2, b2, c2],
...,
[an, bn, cn]]
Using numpy, is there an easy way to get a new 2D array with, e.g., 2 random rows from the initial array a (without replacement)?
e.g.
b=
[[a4, b4, c4],
[a99, b99, c99]]
>>> A = np.random.randint(5, size=(10,3))
>>> A
array([[1, 3, 0],
[3, 2, 0],
[0, 2, 1],
[1, 1, 4],
[3, 2, 2],
[0, 1, 0],
[1, 3, 1],
[0, 4, 1],
[2, 4, 2],
[3, 3, 1]])
>>> idx = np.random.randint(10, size=2)
>>> idx
array([7, 6])
>>> A[idx,:]
array([[0, 4, 1],
[1, 3, 1]])
Putting it together for a general case:
A[np.random.randint(A.shape[0], size=2), :]
For non replacement (numpy 1.7.0+):
A[np.random.choice(A.shape[0], 2, replace=False), :]
I do not believe there is a good way to generate random list without replacement before 1.7. Perhaps you can setup a small definition that ensures the two values are not the same.
This is an old post, but this is what works best for me:
A[np.random.choice(A.shape[0], num_rows_2_sample, replace=False)]
change the replace=False to True to get the same thing, but with replacement.
Another option is to create a random mask if you just want to down-sample your data by a certain factor. Say I want to down-sample to 25% of my original data set, which is currently held in the array data_arr:
# generate random boolean mask the length of data
# use p 0.75 for False and 0.25 for True
mask = numpy.random.choice([False, True], len(data_arr), p=[0.75, 0.25])
Now you can call data_arr[mask] and return ~25% of the rows, randomly sampled.
This is a similar answer to the one Hezi Rasheff provided, but simplified so newer python users understand what's going on (I noticed many new datascience students fetch random samples in the weirdest ways because they don't know what they are doing in python).
You can get a number of random indices from your array by using:
indices = np.random.choice(A.shape[0], number_of_samples, replace=False)
You can then use fancy indexing with your numpy array to get the samples at those indices:
A[indices]
This will get you the specified number of random samples from your data.
I see permutation has been suggested. In fact it can be made into one line:
>>> A = np.random.randint(5, size=(10,3))
>>> np.random.permutation(A)[:2]
array([[0, 3, 0],
[3, 1, 2]])
If you want to generate multiple random subsets of rows, for example if your doing RANSAC.
num_pop = 10
num_samples = 2
pop_in_sample = 3
rows_to_sample = np.random.random([num_pop, 5])
random_numbers = np.random.random([num_samples, num_pop])
samples = np.argsort(random_numbers, axis=1)[:, :pop_in_sample]
# will be shape [num_samples, pop_in_sample, 5]
row_subsets = rows_to_sample[samples, :]
An alternative way of doing it is by using the choice method of the Generator class, https://github.com/numpy/numpy/issues/10835
import numpy as np
# generate the random array
A = np.random.randint(5, size=(10,3))
# use the choice method of the Generator class
rng = np.random.default_rng()
A_sampled = rng.choice(A, 2)
leading to a sampled data,
array([[1, 3, 2],
[1, 2, 1]])
The running time is also profiled compared as follows,
%timeit rng.choice(A, 2)
15.1 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.random.permutation(A)[:2]
4.22 µs ± 83.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit A[np.random.randint(A.shape[0], size=2), :]
10.6 µs ± 418 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But when the array goes big, A = np.random.randint(10, size=(1000,300)). working on the index is the best way.
%timeit A[np.random.randint(A.shape[0], size=50), :]
17.6 µs ± 657 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit rng.choice(A, 50)
22.3 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.random.permutation(A)[:50]
143 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So the permutation method seems to be the most efficient one when your array is small while working on the index is the optimal solution when your array goes big.
If you need the same rows but just a random sample then,
import random
new_array = random.sample(old_array,x)
Here x, has to be an 'int' defining the number of rows you want to randomly pick.
One can generates a random sample from a given array with a random number generator:
rng = np.random.default_rng()
b = rng.choice(a, 2, replace=False)
b
>>> [[a4, b4, c4],
[a99, b99, c99]]
I am quite surprised that this much easier to read solution has not been proposed after more than 10 years :
import random
b = np.array(
random.choices(a, k=2)
)
Edit : Ah, maybe because it was only introduced in Python 3.6, but still…

Categories

Resources