I have 3 NumPy arrays, and I want to create tuples of the i-th element of each list. These tuples represent keys for a dictionary I had previously defined.
Ex:
List 1: [1, 2, 3, 4, 5]
List 2: [6, 7, 8, 9, 10]
List 3: [11, 12, 13, 14, 15]
Desired output: [mydict[(1,6,11)],mydict[(2,7,12)],mydict[(3,8,13)],mydict[(4,9,14)],mydict[(5,10,15)]]
These tuples represent keys of a dictionary I have previously defined (essentially, as input variables to a previously calculated function). I had read that this is the best way to store function values for lookup.
My current method of doing this is as follows:
[dict[x] for x in zip(l1, l2, l3)]
This works, but is obviously slow. Is there a way to vectorize this operation, or make it faster in any way? I'm open to changing the way I've stored the function values as well, if that is necessary.
EDIT: My apologies for the question being unclear. I do in fact, have NumPy arrays. My mistake for referring to them as lists and displaying them as such. They are of the same length.
Your question is a bit confusing, since you're calling these NumPy arrays, and asking for a way to vectorize things, but then showing lists, and labeling them as lists in your example, and using list in the title. I'm going to assume you do have arrays.
>>> l1 = np.array([1, 2, 3, 4, 5])
>>> l2 = np.array([6, 7, 8, 9, 10])
>>> l3 = np.array([11, 12, 13, 14, 15])
If so, you can stack these up in a 2D array:
>>> ll = np.stack((l1, l2, l3))
And then you can just transpose that:
>>> lt = ll.T
This is better than vectorized; it's constant-time. NumPy is just creating another view of the same data, with different striding so it reads in column order instead of row order.
>>> lt
array([[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14],
[ 5, 10, 15]])
As miradulo points out, you can do both of these in one step with column_stack:
>>> lt = np.column_stack((l1, l2, l3))
But I suspect you're actually going to want ll as a value in its own right. (Although I admit I'm just guessing here at what you're trying to do…)
And of course if you want to loop over these rows as 1D arrays instead of doing further vectorized work, you can:
>>> for row in lt:
...: print(row)
[ 1 6 11]
[ 2 7 12]
[ 3 8 13]
[ 4 9 14]
[ 5 10 15]
Of course, you can convert them from 1D arrays to tuples just by calling tuple on each row. Or… whatever that mydict is supposed to be (it doesn't look like a dictionary—there's no key-value pairs, just values), you can do that.
>>> mydict = collections.namedtuple('mydict', list('abc'))
>>> tups = [mydict(*row) for row in lt]
>>> tups
[mydict(a=1, b=6, c=11),
mydict(a=2, b=7, c=12),
mydict(a=3, b=8, c=13),
mydict(a=4, b=9, c=14),
mydict(a=5, b=10, c=15)]
If you're worried about the time to look up a tuple of keys in a dict, itemgetter in the operator module has a C-accelerated version. If keys is a np.array, or a tuple, or whatever, you can do this:
for row in lt:
myvals = operator.itemgetter(*row)(mydict)
# do stuff with myvals
Meanwhile, I decided to slap together a C extension that should be as fast as possible (with no error handling, because I'm lazy it should be a tiny bit faster that way—this code will probably segfault if you give it anything but a dict and a tuple or list):
static PyObject *
itemget_itemget(PyObject *self, PyObject *args) {
PyObject *d;
PyObject *keys;
PyArg_ParseTuple(args, "OO", &d, &keys);
PyObject *seq = PySequence_Fast(keys, "keys must be an iterable");
PyObject **arr = PySequence_Fast_ITEMS(seq);
int seqlen = PySequence_Fast_GET_SIZE(seq);
PyObject *result = PyTuple_New(seqlen);
PyObject **resarr = PySequence_Fast_ITEMS(result);
for (int i=0; i!=seqlen; ++i) {
resarr[i] = PyDict_GetItem(d, arr[i]);
Py_INCREF(resarr[i]);
}
return result;
}
Times for looking up 100 random keys out of a 10000-key dictionary on my laptop with python.org CPython 3.7 on macOS:
itemget.itemget: 1.6µs
operator.itemgetter: 1.8µs
comprehension: 3.4µs
pure-Python operator.itemgetter: 6.7µs
So, I'm pretty sure anything you do is going to be fast enough—that's only 34ns/key that we're trying to optimize. But if that really is too slow, operator.itemgetter does a good enough job moving the loop to C and cuts it roughly in half, which is pretty close to the best possibly result you could expect. (It's hard to imagine looping up a bunch of boxed-value keys in a hash table in much less than 16ns/key, after all.)
Define your 3 lists. You mention 3 arrays, but show lists (and call them that as well):
In [112]: list1,list2,list3 = list(range(1,6)),list(range(6,11)),list(range(11,16))
Now create a dictionary with tuple keys:
In [114]: dd = {x:i for i,x in enumerate(zip(list1,list2,list3))}
In [115]: dd
Out[115]: {(1, 6, 11): 0, (2, 7, 12): 1, (3, 8, 13): 2, (4, 9, 14): 3, (5, 10, 15): 4}
Accessing elements from that dictionary with your code:
In [116]: [dd[x] for x in zip(list1,list2,list3)]
Out[116]: [0, 1, 2, 3, 4]
In [117]: timeit [dd[x] for x in zip(list1,list2,list3)]
1.62 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Now for an array equivalent - turn the lists into a 2d array:
In [118]: arr = np.array((list1,list2,list3))
In [119]: arr
Out[119]:
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15]])
Access the same dictionary elements. If I used column_stack I could have omitted the .T, but that's slower. (array transpose is fast)
In [120]: [dd[tuple(x)] for x in arr.T]
Out[120]: [0, 1, 2, 3, 4]
In [121]: timeit [dd[tuple(x)] for x in arr.T]
15.7 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Notice that this is substantially slower. Iteration over an array is slower than iteration over a list. You can't access elements of a dictionary in any sort of numpy 'vectorized' fashion - you have to use a Python iteration.
I can improve on the array iteration by first turning it into a list:
In [124]: arr.T.tolist()
Out[124]: [[1, 6, 11], [2, 7, 12], [3, 8, 13], [4, 9, 14], [5, 10, 15]]
In [125]: timeit [dd[tuple(x)] for x in arr.T.tolist()]
3.21 µs ± 9.67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Array construction times:
In [122]: timeit arr = np.array((list1,list2,list3))
3.54 µs ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [123]: timeit arr = np.column_stack((list1,list2,list3))
18.5 µs ± 11.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With the pure Python itemgetter (from v3.6.3) there's no savings:
In [149]: timeit operator.itemgetter(*[tuple(x) for x in arr.T.tolist()])(dd)
3.51 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and if I move the getter definition out of the time loop:
In [151]: %%timeit idx = operator.itemgetter(*[tuple(x) for x in arr.T.tolist()]
...: )
...: idx(dd)
...:
482 ns ± 1.85 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Related
Just wondering if there is a way to construct a tumbling window in python. So for example if I have list/ndarray , listA = [3,2,5,9,4,6,3,8,7,9]. Then how could I find the maximum of the first 3 items (3,2,5) -> 5, and then the next 3 items (9,4,6) -> 9 and so on... Sort of like breaking it up to sections and finding the max. So the final result would be list [5,9,8,9]
Approach #1: One-liner for windowed-max using np.maximum.reduceat -
In [118]: np.maximum.reduceat(listA,np.arange(0,len(listA),3))
Out[118]: array([5, 9, 8, 9])
Becomes more compact with np.r_ -
np.maximum.reduceat(listA,np.r_[:len(listA):3])
Approach #2: Generic ufunc way
Here's a function for generic ufuncs and that window length as a parameter -
def windowed_ufunc(a, ufunc, W):
a = np.asarray(a)
n = len(a)
L = W*(n//W)
out = ufunc(a[:L].reshape(-1,W),axis=1)
if n>L:
out = np.hstack((out, ufunc(a[L:])))
return out
Sample run -
In [81]: a = [3,2,5,9,4,6,3,8,7,9]
In [82]: windowed_ufunc(a, ufunc=np.max, W=3)
Out[82]: array([5, 9, 8, 9])
On other ufuncs -
In [83]: windowed_ufunc(a, ufunc=np.min, W=3)
Out[83]: array([2, 4, 3, 9])
In [84]: windowed_ufunc(a, ufunc=np.sum, W=3)
Out[84]: array([10, 19, 18, 9])
In [85]: windowed_ufunc(a, ufunc=np.mean, W=3)
Out[85]: array([3.33333333, 6.33333333, 6. , 9. ])
Benchmarking
Timings on NumPy solutions on array data with sample data scaled up by 10000x -
In [159]: a = [3,2,5,9,4,6,3,8,7,9]
In [160]: a = np.tile(a, 10000)
# #yatu's soln
In [162]: %timeit moving_maxima(a, w=3)
435 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# From this post - app#1
In [167]: %timeit np.maximum.reduceat(a,np.arange(0,len(a),3))
353 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# From this post - app#2
In [165]: %timeit windowed_ufunc(a, ufunc=np.max, W=3)
379 µs ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you want a one-liner, you can use list comprehension:
listA = [3,2,5,9,4,6,3,8,7,9]
listB=[max(listA[i:i+3]) for i in range(0,len(listA),3)]
print (listB)
it returns:
[5, 9, 8, 9]
Of course the codes can be written more dynamically: if you want a different window size, just change 3 to any integer.
Using numpy, you can extend the list with zeroes so its length is divisible by the window size, and reshape and compute the maxalong the second axis:
def moving_maxima(a, w):
mod = len(a)%w
d = w if mod else mod
x = np.r_[a, [0]*(d-mod)]
return x.reshape(-1,w).max(1)
Some examples:
moving_maxima(listA,2)
# array([3., 9., 6., 8., 9.])
moving_maxima(listA,3)
#array([5, 9, 8, 9])
moving_maxima(listA,4)
#array([9, 8, 9])
So say i have an array:
arr = np.arange(12)
And at the end I want this array:
arr2 = [0,1,2,6,7,8]
So I want a jumping mulitple slice, something like:
arr2 = arr[(0:2):-1:6]
where the second array is a slice of three that jumps 6 everytime.
Is this possible in numpy?
My actual example is a more complex example where part of the math is applied for the slice (0:2) that jumps 6 and the other math is applied to the slice (3:5) with a goal to write in one line i.e. without a for-loop.
Sorry if this question has been asked before. I'm having trouble finding documentation on this and I think I might just be googling the wrong thing. Thanks!
You can't do this with slice notation, at least not directly.
But with some reshaping:
In [74]: arr = np.arange(12)
In [75]: arr.reshape(-1,3)
Out[75]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
In [76]: arr.reshape(-1,3)[::2,:]
Out[76]:
array([[0, 1, 2],
[6, 7, 8]])
In [77]: _.reshape(-1)
Out[77]: array([0, 1, 2, 6, 7, 8])
Individually slicing and reshaping make views, but at some point in this transition, it has to make a copy. So the timing relative to the advanced indexing that Divakar suggests is, at best, modest:
In [86]: timeit arr.reshape(-1,3)[::2,:].reshape(-1)
3.99 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [87]: timeit arr[(np.arange(len(arr))%6)<3]
8.91 µs ± 89.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I am calling an object several times that is returning a numpy list:
for x in range(0,100):
d = simulation3()
d = [0, 1, 2, 3]
d = [4, 5, 6, 7]
..and many more
I want to take each list and append it to a 2D array.
final_array = [[0, 1, 2, 3],[4, 5, 6, 7]...and so forth]
I tried creating an empty array (final_array = np.zeros(4,4)) and appending it but the values are appending after the 4X4 matrix is created.
Can anyone help me with this? thank you!
You can use np.fromiter to create an array from an iterable. Since, by default, this function only works with scalars, you can use itertools.chain to help:
np.random.seed(0)
from itertools import chain
def simulation3():
return np.random.randint(0, 10, 4)
n = 5
d = np.fromiter(chain.from_iterable(simulation3() for _ in range(5)), dtype='i')
d.shape = 5, 4
print(d)
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6],
[8, 8, 1, 6],
[7, 7, 8, 1]], dtype=int32)
But this is relatively inefficient. NumPy performs best with fixed size arrays. If you know the size of your array in advance, you can define an empty array and update rows sequentially. See the alternatives described by #norok2.
there are multiple way to do it in numpy , the easiest way is to use vstack like this :
for Ex :
#you have these array you want to concat
d1 = [0, 1, 2, 3]
d2 = [4, 5, 6, 7]
d3 = [4, 5, 6, 7]
#initialize your variable with zero raw
X = np.zeros((0,4))
#then each time you call your function use np.vstack like this :
X = np.vstack((np.array(d1),X))
X = np.vstack((np.array(d2),X))
X = np.vstack((np.array(d2),X))
# and finally you have your array like below
#array([[4., 5., 6., 7.],
# [4., 5., 6., 7.],
# [0., 1., 2., 3.]])
The optimal solution depends on the numbers / sizes you are dealing with.
My favorite solution (which only works if you already know the size of the final result) is to initialize the array which will contain your results and then fill each you could initialize your result and then fill it using views.
This the most memory efficient solution.
If you do not know the size of the final result, then you are better off by generating a list of lists, which can be converted (or stacked) as a NumPy array at the end of the process.
Here are some examples, where gen_1d_list() is used to generate some random numbers to mimic the result of simulate3() (meaning that in the following code, you should replace gen_1d_list(n, dtype) with simulate3()):
stacking1() implements the filling using views
stacking2() implements the list generation and converting to NumPy array
stacking3() implements the list generation and stacking to NumPy array
stacking4() implements the dynamic modification of a NumPy array using vstack() as proposed earlier.
import numpy as np
def gen_1d_list(n, dtype=int):
return list(np.random.randint(1, 100, n, dtype))
def stacking1(n, m, dtype=int):
arr = np.empty((n, m), dtype=dtype)
for i in range(n):
arr[i] = gen_1d_list(m, dtype)
return arr
def stacking2(n, m, dtype=int):
items = [gen_1d_list(m, dtype) for i in range(n)]
arr = np.array(items)
return arr
def stacking3(n, m, dtype=int):
items = [gen_1d_list(m, dtype) for i in range(n)]
arr = np.stack(items, dtype)
return arr
def stacking4(n, m, dtype=int):
arr = np.zeros((0, m), dtype=dtype)
for i in range(n):
arr = np.vstack((gen_1d_list(m, dtype), arr))
return arr
Time-wise, stacking1() and stacking2() are more or less equally fast, while stacking3() and stacking4() are slower (and, in proportion, much slower for small size inputs).
Some numbers, for small size inputs:
n, m = 4, 10
%timeit stacking1(n, m)
# 15.7 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit stacking2(n, m)
# 14.2 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit stacking3(n, m)
# 22.7 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit stacking4(n, m)
# 31.8 µs ± 270 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and for larger size inputs:
n, m = 4, 1000000
%timeit stacking1(n, m)
# 344 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit stacking2(n, m)
# 350 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit stacking3(n, m)
# 370 ms ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit stacking4(n, m)
# 369 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Given input
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Need output :
array([[2, 3],
[4, 6],
[7, 8]])
It is easy to use iteration or loop to do this, but there should be a neat way to do this without using loops. Thanks
Approach #1
One approach with masking -
A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1)
Sample run -
In [395]: A
Out[395]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [396]: A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1)
Out[396]:
array([[2, 3],
[4, 6],
[7, 8]])
Approach #2
Using the regular pattern of non-diagonal elements that could be traced with broadcasted additions with range arrays -
m = A.shape[0]
idx = (np.arange(1,m+1) + (m+1)*np.arange(m-1)[:,None]).reshape(m,-1)
out = A.ravel()[idx]
Approach #3 (Strides Strikes!)
Abusing the regular pattern of non-diagonal elements from previous approach, we can introduce np.lib.stride_tricks.as_strided and some slicing help, like so -
m = A.shape[0]
strided = np.lib.stride_tricks.as_strided
s0,s1 = A.strides
out = strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1)).reshape(m,-1)
Runtime test
Approaches as funcs :
def skip_diag_masking(A):
return A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1)
def skip_diag_broadcasting(A):
m = A.shape[0]
idx = (np.arange(1,m+1) + (m+1)*np.arange(m-1)[:,None]).reshape(m,-1)
return A.ravel()[idx]
def skip_diag_strided(A):
m = A.shape[0]
strided = np.lib.stride_tricks.as_strided
s0,s1 = A.strides
return strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1)).reshape(m,-1)
Timings -
In [528]: A = np.random.randint(11,99,(5000,5000))
In [529]: %timeit skip_diag_masking(A)
...: %timeit skip_diag_broadcasting(A)
...: %timeit skip_diag_strided(A)
...:
10 loops, best of 3: 56.1 ms per loop
10 loops, best of 3: 82.1 ms per loop
10 loops, best of 3: 32.6 ms per loop
I know I'm late to this party, but I have what I believe is a simper solution. So you want to remove the diagonal? Okay cool:
replace it with NaN
filter all but NaN (this converts to one dimensional as it can't assume the result will be square)
reset the dimensionality
`
arr = np.array([[1,2,3],[4,5,6],[7,8,9]]).astype(np.float)
np.fill_diagonal(arr, np.nan)
arr[~np.isnan(arr)].reshape(arr.shape[0], arr.shape[1] - 1)
Solution steps:
Flatten your array
Delete the location of the diagonal elements which is at the location range(0, len(x_no_diag), len(x) + 1)
Reshape your array to (num_rows, num_columns - 1)
The function:
import numpy as np
def remove_diag(x):
x_no_diag = np.ndarray.flatten(x)
x_no_diag = np.delete(x_no_diag, range(0, len(x_no_diag), len(x) + 1), 0)
x_no_diag = x_no_diag.reshape(len(x), len(x) - 1)
return x_no_diag
Example:
>>> x = np.random.randint(5, size=(3,3))
array([[0, 2, 3],
[3, 4, 1],
[2, 4, 0]])
>>> remove_diag(x)
array([[2, 3],
[3, 1],
[2, 4]])
Just with numpy, assuming a square matrix:
new_A = numpy.delete(A,range(0,A.shape[0]**2,(A.shape[0]+1))).reshape(A.shape[0],(A.shape[1]-1))
If you do not mind creating a new array, then you can use list comprehension.
A = np.array([A[i][A[i] != A[i][i]] for i in range(len(A))])
Rerunning the same methods as #Divakar,
A = np.random.randint(11,99,(5000,5000))
skip_diag_masking
85.7 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
skip_diag_broadcasting
163 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
skip_diag_strided
52.5 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
skip_diag_list_comp
101 ms ± 347 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Perhaps the cleanest way, based on Divakar's first solution but using len(array) instead of array.shape[0], is:
array_without_diagonal = array[~np.eye(len(array), dtype=bool)].reshape(len(array), -1)
I love all the answers here, but would like to add one in case your numpy object has more than 2-dimensions. In that case, you can use the following adjustment of Divakar's approach #1:
def remove_diag(A):
removed = A[~np.eye(A.shape[0], dtype=bool)].reshape(A.shape[0], int(A.shape[0])-1, -1)
return np.squeeze(removed)
The other approach is to use numpy.delete(). assuming square matrix, you can use:
numpy.delete(A,range(0,A.shape[0]**2,A.shape[0])).reshape(A.shape[0],A.shape[1]-1)
I have a very large 2D array which looks something like this:
a=
[[a1, b1, c1],
[a2, b2, c2],
...,
[an, bn, cn]]
Using numpy, is there an easy way to get a new 2D array with, e.g., 2 random rows from the initial array a (without replacement)?
e.g.
b=
[[a4, b4, c4],
[a99, b99, c99]]
>>> A = np.random.randint(5, size=(10,3))
>>> A
array([[1, 3, 0],
[3, 2, 0],
[0, 2, 1],
[1, 1, 4],
[3, 2, 2],
[0, 1, 0],
[1, 3, 1],
[0, 4, 1],
[2, 4, 2],
[3, 3, 1]])
>>> idx = np.random.randint(10, size=2)
>>> idx
array([7, 6])
>>> A[idx,:]
array([[0, 4, 1],
[1, 3, 1]])
Putting it together for a general case:
A[np.random.randint(A.shape[0], size=2), :]
For non replacement (numpy 1.7.0+):
A[np.random.choice(A.shape[0], 2, replace=False), :]
I do not believe there is a good way to generate random list without replacement before 1.7. Perhaps you can setup a small definition that ensures the two values are not the same.
This is an old post, but this is what works best for me:
A[np.random.choice(A.shape[0], num_rows_2_sample, replace=False)]
change the replace=False to True to get the same thing, but with replacement.
Another option is to create a random mask if you just want to down-sample your data by a certain factor. Say I want to down-sample to 25% of my original data set, which is currently held in the array data_arr:
# generate random boolean mask the length of data
# use p 0.75 for False and 0.25 for True
mask = numpy.random.choice([False, True], len(data_arr), p=[0.75, 0.25])
Now you can call data_arr[mask] and return ~25% of the rows, randomly sampled.
This is a similar answer to the one Hezi Rasheff provided, but simplified so newer python users understand what's going on (I noticed many new datascience students fetch random samples in the weirdest ways because they don't know what they are doing in python).
You can get a number of random indices from your array by using:
indices = np.random.choice(A.shape[0], number_of_samples, replace=False)
You can then use fancy indexing with your numpy array to get the samples at those indices:
A[indices]
This will get you the specified number of random samples from your data.
I see permutation has been suggested. In fact it can be made into one line:
>>> A = np.random.randint(5, size=(10,3))
>>> np.random.permutation(A)[:2]
array([[0, 3, 0],
[3, 1, 2]])
If you want to generate multiple random subsets of rows, for example if your doing RANSAC.
num_pop = 10
num_samples = 2
pop_in_sample = 3
rows_to_sample = np.random.random([num_pop, 5])
random_numbers = np.random.random([num_samples, num_pop])
samples = np.argsort(random_numbers, axis=1)[:, :pop_in_sample]
# will be shape [num_samples, pop_in_sample, 5]
row_subsets = rows_to_sample[samples, :]
An alternative way of doing it is by using the choice method of the Generator class, https://github.com/numpy/numpy/issues/10835
import numpy as np
# generate the random array
A = np.random.randint(5, size=(10,3))
# use the choice method of the Generator class
rng = np.random.default_rng()
A_sampled = rng.choice(A, 2)
leading to a sampled data,
array([[1, 3, 2],
[1, 2, 1]])
The running time is also profiled compared as follows,
%timeit rng.choice(A, 2)
15.1 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.random.permutation(A)[:2]
4.22 µs ± 83.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit A[np.random.randint(A.shape[0], size=2), :]
10.6 µs ± 418 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But when the array goes big, A = np.random.randint(10, size=(1000,300)). working on the index is the best way.
%timeit A[np.random.randint(A.shape[0], size=50), :]
17.6 µs ± 657 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit rng.choice(A, 50)
22.3 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.random.permutation(A)[:50]
143 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So the permutation method seems to be the most efficient one when your array is small while working on the index is the optimal solution when your array goes big.
If you need the same rows but just a random sample then,
import random
new_array = random.sample(old_array,x)
Here x, has to be an 'int' defining the number of rows you want to randomly pick.
One can generates a random sample from a given array with a random number generator:
rng = np.random.default_rng()
b = rng.choice(a, 2, replace=False)
b
>>> [[a4, b4, c4],
[a99, b99, c99]]
I am quite surprised that this much easier to read solution has not been proposed after more than 10 years :
import random
b = np.array(
random.choices(a, k=2)
)
Edit : Ah, maybe because it was only introduced in Python 3.6, but still…