Generate random subsample of fixed size of numpy array

Generate random subsample of fixed size of numpy array - python

I have searched for this and couldn't find it. Imagine I have a numpy array of size N. Now I want to generate it's subsample of size M. Basically I want M randomly chosen elements from this array. N >= M. How can I do it ?

np.random.choice():
>>> N = 100; M = 10
>>> a = np.arange(0, N)
>>> np.random.choice(a, M, replace=False)
array([22, 81, 63, 7, 10, 52, 30, 33, 18, 41])
With replace=False you get no repetitions, and in that case M must be <= N.
Edit: 2d case:
>>> a = np.arange(0,120).reshape(10,12)
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[ 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[ 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],
[ 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47],
[ 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
[ 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71],
[ 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83],
[ 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95],
[ 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107],
[108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119]])
>>> idx = np.arange(0, 10)
>>> rand_idx = np.random.choice(idx, 5, replace=False)
>>> a[rand_idx]
array([[24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],
[36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47],
[84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95],
[12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]])

Related

Adding each value in an RDD to its partition number

Quite new to PySpark so this might be simple. I have an RDD that ranges from 1 to 100 and has 4 partitions.
A = sc.parallelize(range(100), 4)
And I have to find a way to return another RDD where each value in the RDD is added to its partition number. The ideal example would be:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
99, 100, 101, 102]
Would like to know how I could amend the following code to get the desired results.
A = sc.parallelize(range(100), 4)
B =
print(B.collect())

Keep remaining numbers in range 100 except numbers in the array

Array
a = (0, 3, 5, 8, 11, 12, 14, 15, 18, 20, 21, 22, 26, 26, 28, 33, 38, 41, 42, 42, 51, 52, 61, 62, 64, 65, 67, 69, 73, 76, 79, 82, 83, 84, 85, 86, 93, 94, 96, 97)
How to print the remaining numbers in the range 0-100, except those numbers in a?

You can use sets and subtract a from the range of numbers 0 - 100:
a = (0, 3, 5, 8, 11, 12, 14, 15, 18, 20, 21, 22, 26, 26, 28, 33, 38, 41, 42, 42, 51, 52, 61, 62, 64, 65, 67, 69, 73, 76, 79, 82, 83, 84, 85, 86, 93, 94, 96, 97)
print(set(range(101)) - set(a))
Prints:
{1, 2, 4, 6, 7, 9, 10, 13, 16, 17, 19, 23, 24, 25, 27, 29, 30, 31, 32, 34, 35, 36, 37, 39, 40, 43, 44, 45, 46, 47, 48, 49, 50, 53, 54, 55, 56, 57, 58, 59, 60, 63, 66, 68, 70, 71, 72, 74, 75, 77, 78, 80, 81, 87, 88, 89, 90, 91, 92, 95, 98, 99, 100}
If order is crucial, you can filter the range by removing items in a -- still using set(a) to make it efficient.
a = (0, 3, 5, 8, 11, 12, 14, 15, 18, 20, 21, 22, 26, 26, 28, 33, 38, 41, 42, 42, 51, 52, 61, 62, 64, 65, 67, 69, 73, 76, 79, 82, 83, 84, 85, 86, 93, 94, 96, 97)
s_a = set(a)
filtered = [n for n in range(101) if n not in s_a]

Split integer into equal chunks

What is the most efficient and reliable way in Python to split sectors up like this:
number: 101 (may vary of course)
chunk1: 1 to 30
chunk2: 31 to 61
chunk3: 62 to 92
chunk4: 93 to 101
Flow:
copy sectors 1 to 30
skip sectors in chunk 1 and copy 30 sectors starting from sector 31.
and so on...
I have this solved in a "manual" way using modules and basic math but there's got to be a function for this?
Thank you.

I assume that you will have number in a list format. So, in this case if you want very specific format of cluster of number sequence and you know where it should separate then using indexing is the best way as it will have less time complexity. So,you can always create a small code and make it a function to use repeatedly. Something like below:
def sectors(num_seq,chunk_size=30):
...: import numpy as np
...: sectors = int(np.ceil(len(num_seq)/float(chunk_size))) #create number of sectors
...: for i in range(sectors):
...: if i < (sectors - 1):
...: print num_seq[(chunk_size*i):(chunk_size*(i+1))] #All will chunk equal size except the last one.
...: else:
...: print num_seq[(chunk_size*i):] #Takes rest at the end.
Now, every time you want similar thing you can reuse it and it is efficient as you are defining list index value instead of searching through it.
Here is the output:
x = range(1,101)
print sectors(x)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
[31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]
[61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]
[91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
Please let me know if this meets your requirement.

Easy and fast(single iteration):
>>> input = range(1, 102)
>>> n = 30
>>> output = [input[i:i+n] for i in range(0, len(input), n)]
>>> output
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60], [61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90], [91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101]]
Another very simple and comprehensive way:
>>> f = lambda x,y: [ x[i:i+y] for i in range(0,len(x),y)]
>>> f(range(1, 102), 30)
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60], [61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90], [91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101]]

You can try using numpy.histogram if you're looking to spit a number into equal sized bins (sectors).
This will create an array of numbers, demarcating each bin boundary:
import numpy as np
number = 101
values = np.arange(number, dtype=int)
bins = np.histogram(values, bins='auto')
print(bins)

formatting dictionary printing output

I have dictionary called d which has several lists stored into it. If I print the dictionary I get this difficult to read output :
{'Patch(0,8)': [28, 56, 75], 'Patch(0,6)': [1, 11, 17, 19, 20, 23, 28, 30, 44, 45, 49, 56, 60, 63, 75, 81, 91, 99],
'Patch(4,0)': [2, 5, 6, 8, 19, 22, 23, 27, 31, 34, 35, 36, 41, 45, 51, 52, 53, 55, 56, 59, 60, 61, 62, 64, 66, 67, 68, 70, 73, 75, 76, 77, 79, 85, 87, 91, 94, 96],
'Patch(4,6)': [19, 23, 45, 56, 60, 75, 91], 'Patch(0,0)': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99], 'Patch(8,0)': [2, 22, 23, 27, 34, 52
, 55, 60, 85], 'Patch(0,2)': [0, 1, 2, 3, 4, 6, 7, 10, 11, 13, 15, 16, 17, 18, 19, 20, 22, 23, 25, 26, 28, 29, 30, 32, 34, 36, 37, 38, 40, 43, 44, 45, 46, 47,
49, 50, 51, 52, 53, 54, 56, 58, 59, 60, 61, 62, 63, 64, 66, 70, 71, 74, 75, 76, 77, 78, 80, 81, 83, 85, 90, 91, 92, 93, 94, 96, 98, 99], 'Patch(2,8)': [28, 56, 75], 'Patch(4,8)': [56, 75]}
I just want to print each Patch and corresponding data in a new line :
{'Patch(0,8)': [28, 56, 75],
'Patch(0,6)': [1, 11, 17, 19, 20, 23, 28, 30, 44, 45, 49, 56, 60, 63, 75, 81, 91, 99],
I tried pprint after seeing the suggestions in this answer :
pprint.pprint(d, width=1)
I get this :
{'Patch(0,8)': [28,
56,
75], and so on
What am I missing here ?

Just pass in width that is big enough to hold every value in the dict:
>>> pprint.pprint(d, width=1000)
{'Patch(0,0)': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99],
'Patch(0,2)': [0, 1, 2, 3, 4, 6, 7, 10, 11, 13, 15, 16, 17, 18, 19, 20, 22, 23, 25, 26, 28, 29, 30, 32, 34, 36, 37, 38, 40, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 54, 56, 58, 59, 60, 61, 62, 63, 64, 66, 70, 71, 74, 75, 76, 77, 78, 80, 81, 83, 85, 90, 91, 92, 93, 94, 96, 98, 99],
'Patch(0,6)': [1, 11, 17, 19, 20, 23, 28, 30, 44, 45, 49, 56, 60, 63, 75, 81, 91, 99],
'Patch(0,8)': [28, 56, 75],
'Patch(2,8)': [28, 56, 75],
'Patch(4,0)': [2, 5, 6, 8, 19, 22, 23, 27, 31, 34, 35, 36, 41, 45, 51, 52, 53, 55, 56, 59, 60, 61, 62, 64, 66, 67, 68, 70, 73, 75, 76, 77, 79, 85, 87, 91, 94, 96],
'Patch(4,6)': [19, 23, 45, 56, 60, 75, 91],
'Patch(4,8)': [56, 75],
'Patch(8,0)': [2, 22, 23, 27, 34, 52, 55, 60, 85]}

I usually print dicts as JSON to give it structure and formatting I can easily read.
import json
json.dumps( dict( a=1, b=2), indent=2)

You can make this into a simple loop to print it. have a look at dict.iteritems for the official docs.
for key, value in d.iteritems():
print key + " - " + str(value)

How to generate groups of 10 consecutive numbers in a list?

I am trying to generate a list of consecutive numbers in groups of ten. For example, let's start with a list of 109 numbers:
mylist = range(1,110,1)
I know that I can generate a list of intervals of 10 by using range(1,110,10), which yields:
[1, 11, 21, 31, 41, 51, 61, 71, 81, 91, 101]
How can I generate a list of consecutive numbers in groups of ten like the following?
[[1,2,3,4,5,6,7,8,9,10],[11,12,13,14,15,16,17,18,19,20], ...]

You can use a list comprehension:
[range(i, i + 10) for i in range(1, 102, 10)]
Demo:
>>> from pprint import pprint
>>> [range(i, i + 10) for i in range(1, 102, 10)]
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [11, 12, 13, 14, 15, 16, 17, 18, 19, 20], [21, 22, 23, 24, 25, 26, 27, 28, 29, 30], [31, 32, 33, 34, 35, 36, 37, 38, 39, 40], [41, 42, 43, 44, 45, 46, 47, 48, 49, 50], [51, 52, 53, 54, 55, 56, 57, 58, 59, 60], [61, 62, 63, 64, 65, 66, 67, 68, 69, 70], [71, 72, 73, 74, 75, 76, 77, 78, 79, 80], [81, 82, 83, 84, 85, 86, 87, 88, 89, 90], [91, 92, 93, 94, 95, 96, 97, 98, 99, 100], [101, 102, 103, 104, 105, 106, 107, 108, 109, 110]]
>>> pprint(_)
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
[21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
[31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
[41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
[51, 52, 53, 54, 55, 56, 57, 58, 59, 60],
[61, 62, 63, 64, 65, 66, 67, 68, 69, 70],
[71, 72, 73, 74, 75, 76, 77, 78, 79, 80],
[81, 82, 83, 84, 85, 86, 87, 88, 89, 90],
[91, 92, 93, 94, 95, 96, 97, 98, 99, 100],
[101, 102, 103, 104, 105, 106, 107, 108, 109, 110]]

You can use nested list comprehensions to generate lists like this.
[[10*i + j for j in range(1,11)] for i in range(10)]
Output
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
[21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
[31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
[41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
[51, 52, 53, 54, 55, 56, 57, 58, 59, 60],
[61, 62, 63, 64, 65, 66, 67, 68, 69, 70],
[71, 72, 73, 74, 75, 76, 77, 78, 79, 80],
[81, 82, 83, 84, 85, 86, 87, 88, 89, 90],
[91, 92, 93, 94, 95, 96, 97, 98, 99, 100]]

Alternatively, you can group them together.
def grouper(iterable, n):
# from itertools recipes
return zip(*[iter(iterable)] * n)
full_range = range(1, 101)
grouped_list = list(grouper(full_range,10))
Which results in:
[(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
(11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
(21, 22, 23, 24, 25, 26, 27, 28, 29, 30),
(31, 32, 33, 34, 35, 36, 37, 38, 39, 40),
(41, 42, 43, 44, 45, 46, 47, 48, 49, 50),
(51, 52, 53, 54, 55, 56, 57, 58, 59, 60),
(61, 62, 63, 64, 65, 66, 67, 68, 69, 70),
(71, 72, 73, 74, 75, 76, 77, 78, 79, 80),
(81, 82, 83, 84, 85, 86, 87, 88, 89, 90),
(91, 92, 93, 94, 95, 96, 97, 98, 99, 100)]
# a list of tuples, if you need it to be a list of lists:
# [list(group) for group in grouper(full_range, 10)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generate random subsample of fixed size of numpy array - python

I have searched for this and couldn't find it. Imagine I have a numpy array of size N. Now I want to generate it's subsample of size M. Basically I want M randomly chosen elements from this array. N >= M. How can I do it ?

Related

Adding each value in an RDD to its partition number

Keep remaining numbers in range 100 except numbers in the array

Split integer into equal chunks

formatting dictionary printing output

How to generate groups of 10 consecutive numbers in a list?

Categories

Resources