python: mean of variable length 2 matrix - python

Consider the following variable length 2D array
[
[1, 2, 3],
[4, 5],
[6, 7, 8, 9]
]
How can i find the mean of the variables along the column?
I want something like [(1+4+6)/3,(2+5+7)/3, (3+8)/2, 9/1]
So the end result would be [3.667, 4.667, 5.5, 9]
Is this possible using numpy?
I tried np.mean(x, axis=0), but numpy expects the arrays of same dimension.
Right now, I am popping the elements of each column and finding the mean. Is there a better way to achieve the result?

You could use pandas:
import pandas as pd
a = [[1, 2, 3],
[4, 5],
[6, 7, 8, 9]]
df = pd.DataFrame(a)
# 0 1 2 3
# 0 1 2 3 NaN
# 1 4 5 NaN NaN
# 2 6 7 8 9
df.mean()
# 0 3.666667
# 1 4.666667
# 2 5.500000
# 3 9.000000
# dtype: float64
Here is another solution that only uses numpy:
import numpy as np
nrows = len(a)
ncols = max(len(row) for row in a)
arr = np.zeros((nrows, ncols))
arr.fill(np.nan)
for jrow, row in enumerate(a):
for jcol, col in enumerate(row):
arr[jrow, jcol] = col
print np.nanmean(arr, axis=0)
# array([ 3.66666667, 4.66666667, 5.5 , 9. ])

Very simple alternative approach using itertools.izip_longest() as:
>>> mean_list = []
>>> for sub_list in izip_longest(*my_list):
... filtered_list = filter(None, sub_list)
... mean_list.append(sum(filtered_list)/(len(filtered_list)*1.0))
...
>>> mean_list
[3.6666666666666665, 4.666666666666667, 5.5, 9.0]
where my_list equals to:
[
[1, 2, 3],
[4, 5],
[6, 7, 8, 9]
]

Listed in this post is an almost vectorized approach using NumPy. We would try to assign each element in list element an ID based on their positions. These IDs could then be fed to np.bincount as it would perform ID based summations. Finally, we would divide the summations respectively by the lengths of each ID to get the final average values.
Thus, we would have an implementation like so -
def variable_mean(a):
vals = np.concatenate(a)
lens = np.array(map(len,a))
id_arr = np.ones(vals.size,dtype=int)
id_arr[0] = 0
id_arr[lens.cumsum()[:-1]] = -lens[:-1] + 1
IDs = id_arr.cumsum()
return np.bincount(IDs,vals)/np.bincount(IDs)
Runtime test -
In [298]: # Setup input
...: N = 1000 # number of elems in input list
...: minL = 3 # min len of an element (list) in input list
...: maxL = 10 # max len of an element (list) in input list
...: a = [list(np.random.randint(0,9,(i))) \
...: for i in np.random.randint(minL,maxL,(N))]
...:
In [299]: %timeit pd.DataFrame(a).mean() ##Julien Spronck's pandas soln
100 loops, best of 3: 3.33 ms per loop
In [300]: %timeit variable_mean(a)
100 loops, best of 3: 2.36 ms per loop
In [301]: # Setup input
...: N = 1000 # number of elems in input list
...: minL = 3 # min len of an element (list) in input list
...: maxL = 100 # max len of an element (list) in input list
...: a = [list(np.random.randint(0,9,(i))) \
...: for i in np.random.randint(minL,maxL,(N))]
...:
In [302]: %timeit pd.DataFrame(a).mean() ##Julien Spronck's pandas soln
10 loops, best of 3: 27.1 ms per loop
In [303]: %timeit variable_mean(a)
100 loops, best of 3: 9.58 ms per loop

If you want to do it manually, what I would do:
max_length = 0
Figure out the max array length:
for array in arrays:
if len(array) > max:
max = len(array)
Pad all arrays to that length with 'None'
for array in arrays:
while len(array) < max:
array.append(None)
Zip will group the columns
columns = zip(*arrays)
columns == [(1, 4, 6), (2, 5, 7), (3, 'None', 8), ('None', 'None', 9)]
Calculate the average as you would for any list:
for col in columns:
count = 0
sum = 0.0
for num in col:
if num is not None:
count += 1
sum += float(num)
print "%s: Avg %s" % (col, sum/count)
Or as a list comprehension after padding the arrays:
[sum(filter(None, col))/float(len(filter(None, col))) for col in zip(*arrays)]
Output:
(1, 4, 6): Avg 3.66666666667
(2, 5, 7): Avg 4.66666666667
(3, 'None', 8): Avg 5.5
('None', 'None', 9): Avg 9.0

In Py3, zip_longest takes a fillvalue parameter:
In [1208]: ll=[
...: [1, 2, 3],
...: [4, 5],
...: [6, 7, 8, 9]
...: ]
In [1209]: list(itertools.zip_longest(*ll, fillvalue=np.nan))
Out[1209]: [(1, 4, 6), (2, 5, 7), (3, nan, 8), (nan, nan, 9)]
By filling with nan, I can use np.nanmean to take the mean ignoring the nan. nanmean turns its input (here _ from the previous line) into an array:
In [1210]: np.nanmean(_, axis=1)
Out[1210]: array([ 3.66666667, 4.66666667, 5.5 , 9. ])

Related

Index array based on value limits of another

Let's say I have an array (or even a list) that looks like:
tmp_data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
And then I have another ray that are distance values:
dist_data = [ 15.625 46.875 78.125 109.375 140.625 171.875 203.125 234.375 265.625 296.875]
Now, say I want to create a threshold of distance that I would like to perform an operation on from tmp_data. For this example, let's just take the max value. And let's set the threshold distance to 100. What I would like to do is take the n number of elements every 100 distance units and replace all elements in that with the maximum value in that small array. For example: I would want the final output to be
max_tmp_data_100 = [2,2,2,5,5,5,8,8,8,9]
This is because the first 3 elements in dist_data are below 100, so we take the first three elements of tmp_data (0,1,2), and get the maximum of this and replace all elements in there with that value, 2
Then, the next set of data that would be below the next 100 value would be
tmp_dist_array_100 = [109.375 140.625 171.875]
tmp_data_100 = [3,4,5]
max_tmp_data_100 = [5,5,5]
(append to [2,2,2])
I have come up with the following:
# Initialize
final_array = []
d_array = []
idx = 1
for i in range(0,10):
if dist_data[i] < idx * final_res:
d_array.append(tmp_data[i])
elif dist_data[i] > idx * final_res:
# Now get the values
max_val = np.amax(d_array)
new_array = np.ones(len(d_array)) * max_val
final_array.extend(new_array)
idx = idx + 1
But the outcome is
[2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0]
When it should be [2,2,2,5,5,5,8,8,8,9]
With numpy:
import numpy as np
cdist_data = [15.625, 46.875, 78.125, 109.375, 140.625, 171.875, 203.125, 234.375,265.625, 296.875]
cut = 100
a = np.array(dist_data)
vals = np.searchsorted(a, np.r_[cut:a.max() + cut:cut]) - 1
print(vals[(a/cut).astype(int)])
It gives:
[2 2 2 5 5 5 9 9 9 9]
You can do with groupby
from itertools import groupby
dist_data = [ 15.625, 46.875 ,78.125 ,109.375 ,140.625 ,171.875 ,203.125 ,234.375, 265.625 ,296.875]
tmp_data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
result = []
index_list = [[dist_data.index(i) for i in l]
for k, l in groupby(dist_data, key=lambda x:x//100)]
for i in tmp_data:
for lst in index_list:
if i in lst:
result.append(max(lst))
print(result)
# [2, 2, 2, 5, 5, 5, 9, 9, 9, 9]
A per your requirements last 4 elements will comes under next threshold value, the max of last 4 element is 9.

Find consecutive sequences based on Boolean array

I'm trying to extract sequences from an array b for which a boolean array a is used as index (len(a) >= len(b), but (a==True).sum() == len(b), i.e. there are only as many values true in a than there are elements in b). The sequences should be represented in the result as start and end index of a where a[i] is true and for which there are consecutive values.
For instance, for the following arrays of a and b
a = np.asarray([True, True, False, False, False, True, True, True, False])
b = [1, 2, 3, 4, 5]
the result should be [((0, 1), [1, 2]), ((5, 7), [3, 4, 5])], so as many elements in the array as there are true-sequences. Each true sequence should contain the start and end index from a and the values these relate to from b).
So for the above:
[
((0, 1), [1, 2]), # first true sequence: starting at index=0 (in a), ending at index=1, mapping to the values [1, 2] in b
((5, 7), [3, 4, 5]) # second true sequence: starting at index=5, ending at index=7, with values in b=[3, 4, 5]
]
How can this be done efficiently in numpy?
Here's one NumPy based one inspired by this post -
def func1(a,b):
# "Enclose" mask with sentients to catch shifts later on
mask = np.r_[False,a,False]
# Get the shifting indices
idx = np.flatnonzero(mask[1:] != mask[:-1])
s0,s1 = idx[::2], idx[1::2]
idx_b = np.r_[0,(s1-s0).cumsum()]
out = []
for (i,j,k,l) in zip(s0,s1-1,idx_b[:-1],idx_b[1:]):
out.append(((i, j), b[k:l]))
return out
Sample run -
In [104]: a
Out[104]: array([ True, True, False, False, False, True, True, True, False])
In [105]: b
Out[105]: [1, 2, 3, 4, 5]
In [106]: func1(a,b)
Out[106]: [((0, 1), [1, 2]), ((5, 7), [3, 4, 5])]
Timings -
In [156]: # Using given sample data and tiling it 1000x
...: a = np.asarray([True, True, False, False, False, True, True, True, False])
...: b = [1, 2, 3, 4, 5]
...: a = np.tile(a,1000)
...: b = np.tile(b,1000)
# #Chris's soln
In [157]: %%timeit
...: res = []
...: gen = (i for i in b)
...: for k, g in itertools.groupby(enumerate(a), lambda x:x[1]):
...: if k:
...: ind, bools = list(zip(*g))
...: res.append((ind[0::len(ind)-1], list(itertools.islice(gen, len(bools)))))
100 loops, best of 3: 13.8 ms per loop
In [158]: %timeit func1(a,b)
1000 loops, best of 3: 1.29 ms per loop
Using itertools.groupby and itertools.islice:
import itertools
res = []
gen = (i for i in b)
for k, g in itertools.groupby(enumerate(a), lambda x:x[1]):
if k:
ind, bools = list(zip(*g))
res.append((ind[0::len(ind)-1], list(itertools.islice(gen, len(bools)))))
Output
[((0, 1), [1, 2]), ((5, 7), [3, 4, 5])]
Insights:
itertools.groupby returns grouped object of Trues and Falses.
list[0::len(list)-1] returns first and last element of list.
Since b always have a same number of Trues, make b a generator and grab as many elements as there are Trues.
Time taken:
def itertool_version():
res = []
gen = (i for i in b)
for k, g in itertools.groupby(enumerate(a), lambda x:x[1]):
if k:
ind, bools = list(zip(*g))
res.append((ind[0::len(ind)-1], list(itertools.islice(gen, len(bools)))))
return res
%timeit itertool()
7.11 µs ± 313 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I don't know about a solution using numpy, but maybe the following for-loop solution will help you (or others) finding a different, more efficient solution:
import numpy as np
a = np.asarray([True, True, False, False, False, True, True, True, False])
b = []
temp_list = []
count = 0
for val in a:
if (val):
count += 1
temp_list.append(count) if len(temp_list) == 0 else None # Only add the first 'True' value in a sequence
# Code only reached if val is not true > append b if temp_list has more than 1 entry
elif (len(temp_list) > 0):
temp_list.append(count) # Add the last true value in a sequence
b.append(temp_list)
temp_list = []
print(b)
>>> [[1, 2], [3, 5]]
Here is my two cents. Hope it helps. [EDITED]
# Get Data
a = np.asarray([True, True, False, False, False, True, True, True, False])
b = [1, 2, 3, 4, 5]
# Assign Index names
ac = ac.astype(float)
ac[ac==1] = b
# Select edges
ac[(np.roll(ac, 1) != 0) & (np.roll(ac, -1) != 0)] = 0 # Clear out intermediates
indices = ac[ac != 0] # Select only edges
indices.reshape(2, int(indices.shape[0]/2)) # group in pairs
Output
>> [[1, 2], [3, 5]]
Solution uses a method where() from numpy:
result = []
f = np.where(a)[0]
m = 1
for j in list(create(f)):
lo = j[1]-j[0]+1
result.append((j, [*range(m, m + lo)]))
m += lo
print(result)
#OUTPUT: [((0, 1), [1, 2]), ((5, 7), [3, 4, 5])]
And there is a method to split an array [0 1 5 6 7] -- > [(0, 1), (5, 7)]:
def create(k):
le = len(k)
i = 0
while i < le:
left = k[i]
while i < le - 1 and k[i] + 1 == k[i + 1]:
i += 1
right = k[i]
if right - left >= 1:
yield (left, right)
elif right - left == 1:
yield (left, )
yield (right, )
else:
yield (left, )
i += 1

Stepping with multiple values while slicing an array in Python

I am trying to get m values while stepping through every n elements of an array. For example, for m = 2 and n = 5, and given
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
I want to retrieve
b = [1, 2, 6, 7]
Is there a way to do this using slicing? I can do this using a nested list comprehension, but I was wondering if there was a way to do this using the indices only. For reference, the list comprehension way is:
b = [k for j in [a[i:i+2] for i in range(0,len(a),5)] for k in j]
I agree with wim that you can't do it with just slicing. But you can do it with just one list comprehension:
>>> [x for i,x in enumerate(a) if i%n < m]
[1, 2, 6, 7]
No, that is not possible with slicing. Slicing only supports start, stop, and step - there is no way to represent stepping with "groups" of size larger than 1.
In short, no, you cannot. But you can use itertools to remove the need for intermediary lists:
from itertools import chain, islice
res = list(chain.from_iterable(islice(a, i, i+2) for i in range(0, len(a), 5)))
print(res)
[1, 2, 6, 7]
Borrowing #Kevin's logic, if you want a vectorised solution to avoid a for loop, you can use 3rd party library numpy:
import numpy as np
m, n = 2, 5
a = np.array(a) # convert to numpy array
res = a[np.where(np.arange(a.shape[0]) % n < m)]
There are other ways to do it, which all have advantages for some cases, but none are "just slicing".
The most general solution is probably to group your input, slice the groups, then flatten the slices back out. One advantage of this solution is that you can do it lazily, without building big intermediate lists, and you can do it to any iterable, including a lazy iterator, not just a list.
# from itertools recipes in the docs
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
groups = grouper(a, 5)
truncated = (group[:2] for group in groups)
b = [elem for group in truncated for elem in group]
And you can convert that into a pretty simple one-liner, although you still need the grouper function:
b = [elem for group in grouper(a, 5) for elem in group[:2]]
Another option is to build a list of indices, and use itemgetter to grab all the values. This might be more readable for a more complicated function than just "the first 2 of every 5", but it's probably less readable for something as simple as your use:
indices = [i for i in range(len(a)) if i%5 < 2]
b = operator.itemgetter(*indices)(a)
… which can be turned into a one-liner:
b = operator.itemgetter(*[i for i in range(len(a)) if i%5 < 2])(a)
And you can combine the advantages of the two approaches by writing your own version of itemgetter that takes a lazy index iterator—which I won't show, because you can go even better by writing one that takes an index filter function instead:
def indexfilter(pred, a):
return [elem for i, elem in enumerate(a) if pred(i)]
b = indexfilter((lambda i: i%5<2), a)
(To make indexfilter lazy, just replace the brackets with parens.)
… or, as a one-liner:
b = [elem for i, elem in enumerate(a) if i%5<2]
I think this last one might be the most readable. And it works with any iterable rather than just lists, and it can be made lazy (again, just replace the brackets with parens). But I still don't think it's simpler than your original comprehension, and it's not just slicing.
The question states array, and by that if we are talking about NumPy arrays, we can surely use few obvious NumPy tricks and few not-so obvious ones. We can surely use slicing to get a 2D view into the input under certain conditions.
Now, based on the array length, let's call it l and m, we would have three scenarios :
Scenario #1 :l is divisible by n
We can use slicing and reshaping to get a view into the input array and hence get constant runtime.
Verify the view concept :
In [108]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
In [109]: m = 2; n = 5
In [110]: a.reshape(-1,n)[:,:m]
Out[110]:
array([[1, 2],
[6, 7]])
In [111]: np.shares_memory(a, a.reshape(-1,n)[:,:m])
Out[111]: True
Check timings on a very large array and hence constant runtime claim :
In [118]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
In [119]: %timeit a.reshape(-1,n)[:,:m]
1000000 loops, best of 3: 563 ns per loop
In [120]: a = np.arange(10000000)
In [121]: %timeit a.reshape(-1,n)[:,:m]
1000000 loops, best of 3: 564 ns per loop
To get flattened version :
If we have to get a flattened array as output, we just need to use a flattening operation with .ravel(), like so -
In [127]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
In [128]: m = 2; n = 5
In [129]: a.reshape(-1,n)[:,:m].ravel()
Out[129]: array([1, 2, 6, 7])
Timings show that it's not too bad when compared with the other looping and vectorized numpy.where versions from other posts -
In [143]: a = np.arange(10000000)
# #Kevin's soln
In [145]: %timeit [x for i,x in enumerate(a) if i%n < m]
1 loop, best of 3: 1.23 s per loop
# #jpp's soln
In [147]: %timeit a[np.where(np.arange(a.shape[0]) % n < m)]
10 loops, best of 3: 145 ms per loop
In [144]: %timeit a.reshape(-1,n)[:,:m].ravel()
100 loops, best of 3: 16.4 ms per loop
Scenario #2 :l is not divisible by n, but the groups end with a complete one at the end
We go to the non-obvious NumPy methods with np.lib.stride_tricks.as_strided that allows to go beyoond the memory block bounds (hence we need to be careful here to not write into those) to facilitate a solution using slicing. The implementation would look something like this -
def select_groups(a, m, n):
a = np.asarray(a)
strided = np.lib.stride_tricks.as_strided
# Get params defining the lengths for slicing and output array shape
nrows = len(a)//n
add0 = len(a)%n
s = a.strides[0]
out_shape = nrows+int(add0!=0),m
# Finally stride, flatten with reshape and slice
return strided(a, shape=out_shape, strides=(s*n,s))
A sample run to verify that the output is a view -
In [151]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13])
In [152]: m = 2; n = 5
In [153]: select_groups(a, m, n)
Out[153]:
array([[ 1, 2],
[ 6, 7],
[11, 12]])
In [154]: np.shares_memory(a, select_groups(a, m, n))
Out[154]: True
To get flattened version, append with .ravel().
Let's get some timings comparison -
In [158]: a = np.arange(10000003)
In [159]: m = 2; n = 5
# #Kevin's soln
In [161]: %timeit [x for i,x in enumerate(a) if i%n < m]
1 loop, best of 3: 1.24 s per loop
# #jpp's soln
In [162]: %timeit a[np.where(np.arange(a.shape[0]) % n < m)]
10 loops, best of 3: 148 ms per loop
In [160]: %timeit select_groups(a, m=m, n=n)
100000 loops, best of 3: 5.8 µs per loop
If we need a flattened version, it's still not too bad -
In [163]: %timeit select_groups(a, m=m, n=n).ravel()
100 loops, best of 3: 16.5 ms per loop
Scenario #3 :l is not divisible by n,and the groups end with a incomplete one at the end
For this case, we would need an extra slicing at the end on top of what we had in the previous method, like so -
def select_groups_generic(a, m, n):
a = np.asarray(a)
strided = np.lib.stride_tricks.as_strided
# Get params defining the lengths for slicing and output array shape
nrows = len(a)//n
add0 = len(a)%n
lim = m*(nrows) + add0
s = a.strides[0]
out_shape = nrows+int(add0!=0),m
# Finally stride, flatten with reshape and slice
return strided(a, shape=out_shape, strides=(s*n,s)).reshape(-1)[:lim]
Sample run -
In [166]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [167]: m = 2; n = 5
In [168]: select_groups_generic(a, m, n)
Out[168]: array([ 1, 2, 6, 7, 11])
Timings -
In [170]: a = np.arange(10000001)
In [171]: m = 2; n = 5
# #Kevin's soln
In [172]: %timeit [x for i,x in enumerate(a) if i%n < m]
1 loop, best of 3: 1.23 s per loop
# #jpp's soln
In [173]: %timeit a[np.where(np.arange(a.shape[0]) % n < m)]
10 loops, best of 3: 145 ms per loop
In [174]: %timeit select_groups_generic(a, m, n)
100 loops, best of 3: 12.2 ms per loop
I realize that recursion isn't popular, but would something like this work? Also, uncertain if adding recursion to the mix counts as just using slices.
def get_elements(A, m, n):
if(len(A) < m):
return A
else:
return A[:m] + get_elements(A[n:], m, n)
A is the array, m and n are defined as in the question. The first if covers the base case, where you have an array with length less than the number of elements you're trying to retrieve, and the second if is the recursive case. I'm somewhat new to python, please forgive my poor understanding of the language if this doesn't work properly, though I tested it and it seems to work fine.
With itertools you could get an iterator with:
from itertools import compress, cycle
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
n = 5
m = 2
it = compress(a, cycle([1, 1, 0, 0, 0]))
res = list(it)

Pandas - assign histogram bucket to each row

Here is my dataframe:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 6, 4, 3, 2, 7]})
buckets = [(0,3),(3,5),(5,9)]
I also have histogram buckets stated above. Now I would like to assign each row of dataframe to buckets index. So I would like to get new column with the following info:
df['buckets_index'] = [0,0,0,1,2,1,0,0,2]
Of course, I can do it with loops, but I have fairly big dataframe (2.5 mil rows), so I need to get it done quickly.
Any thoughts?
You can use pd.cut, with labels=False if you only want the index:
buckets = [0,3,5,9]
df['bucket'] = pd.cut(df['A'], bins=buckets)
df['bucket_idx'] = pd.cut(df['A'], bins=buckets, labels=False)
The resulting output:
A bucket bucket_idx
0 1 (0, 3] 0
1 2 (0, 3] 0
2 3 (0, 3] 0
3 4 (3, 5] 1
4 6 (5, 9] 2
5 4 (3, 5] 1
6 3 (0, 3] 0
7 2 (0, 3] 0
8 7 (5, 9] 2
You could use np.searchsorted -
df['buckets_index'] = np.asarray(buckets)[:,1].searchsorted(df.A.values)
Runtime test -
In [522]: df = pd.DataFrame({'A': np.random.randint(1,8,(10000))})
In [523]: buckets = [0,3,5,9]
In [524]: %timeit pd.cut(df['A'], bins=buckets, labels=False)
1000 loops, best of 3: 460 µs per loop # #root's soln
In [525]: buckets = [(0,3),(3,5),(5,9)]
In [526]: %timeit np.asarray(buckets)[:,1].searchsorted(df.A.values)
10000 loops, best of 3: 166 µs per loop
Outside limits cases : For such cases, we need to use clipping, like so -
np.asarray(buckets)[:,1].searchsorted(df.A.values).clip(max=len(buckets)-1)

Randomly selecting rows from numpy array

I want to randomly select rows from a numpy array. Say I have this array-
A = [[1, 3, 0],
[3, 2, 0],
[0, 2, 1],
[1, 1, 4],
[3, 2, 2],
[0, 1, 0],
[1, 3, 1],
[0, 4, 1],
[2, 4, 2],
[3, 3, 1]]
To randomly select say 6 rows, I am doing this:
B = A[np.random.choice(A.shape[0], size=6, replace=False), :]
I want another array C which has the rows which were not selected in B.
Is there some in-built method to do this or do I need to do a brute-force, checking rows of B with rows of A?
You can make any number of row-wise random partitions of A by slicing a shuffled sequence of row indices:
ind = numpy.arange( A.shape[ 0 ] )
numpy.random.shuffle( ind )
B = A[ ind[ :6 ], : ]
C = A[ ind[ 6: ], : ]
If you don't want to change the order of the rows in each subset, you can sort each slice of the indices:
B = A[ sorted( ind[ :6 ] ), : ]
C = A[ sorted( ind[ 6: ] ), : ]
(Note that the solution provided by #MaxNoe also preserves row order.)
Solution
This gives you the indices for the selection:
sel = np.random.choice(A.shape[0], size=6, replace=False)
and this B:
B = A[sel]
Get all not selected indices:
unsel = list(set(range(A.shape[0])) - set(sel))
and use them for C:
C = A[unsel]
Variation with NumPy functions
Instead of using set and list, you can use this:
unsel2 = np.setdiff1d(np.arange(A.shape[0]), sel)
For the example array the pure Python version:
%%timeit
unsel1 = list(set(range(A.shape[0])) - set(sel))
100000 loops, best of 3: 8.42 µs per loop
is faster than the NumPy version:
%%timeit
unsel2 = np.setdiff1d(np.arange(A.shape[0]), sel)
10000 loops, best of 3: 77.5 µs per loop
For larger A the NumPy version is faster:
A = np.random.random((int(1e4), 3))
sel = np.random.choice(A.shape[0], size=6, replace=False)
%%timeit
unsel1 = list(set(range(A.shape[0])) - set(sel))
1000 loops, best of 3: 1.4 ms per loop
%%timeit
unsel2 = np.setdiff1d(np.arange(A.shape[0]), sel)
1000 loops, best of 3: 315 µs per loop
You can use boolean masks and draw random indices from an integer array which is as long as yours. The ~ is an elementwise not:
idx = np.arange(A.shape[0])
mask = np.zeros_like(idx, dtype=bool)
selected = np.random.choice(idx, 6, replace=False)
mask[selected] = True
B = A[mask]
C = A[~mask]

Categories

Resources