Selecting a random sample from a very large generator

Selecting a random sample from a very large generator - python

I am trying to test some strategies for a game, which can be defined by 10 non-negative integers that add up to 100. There are 109 choose 9, or roughly 10^12 of these, so comparing them all is not practical. I would like to take a random sample of about 1,000,000 of these.
I have tried the methods from the answers to this question, and this one, but all still seem far too slow to work. The quickest method seems like it will take about 180 hours on my machine.
This is how I've tried to make the generator (adapted from a previous SE answer). For some reason, changing prob does not seem to impact the run time of turning it into a list.
def tuples_sum_sample(nbval,total, prob, order=True) :
"""
Generate all the tuples L of nbval positive or nul integer
such that sum(L)=total.
The tuples may be ordered (decreasing order) or not
"""
if nbval == 0 and total == 0 : yield tuple() ; raise StopIteration
if nbval == 1 : yield (total,) ; raise StopIteration
if total==0 : yield (0,)*nbval ; raise StopIteration
for start in range(total,0,-1) :
for qu in tuples_sum(nbval-1,total-start) :
if qu[0]<=start :
sol=(start,)+qu
if order :
if random.random() <prob:
yield sol
else :
l=set()
for p in permutations(sol,len(sol)) :
if p not in l :
l.add(p)
if random.random()<prob:
yield p
Rejection sampling seems like it would take about 3 million years, so this is out as well.
randsample = []
while len(randsample)<1000000:
x = (random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100))
if sum(x) == 100:
randsample.append(x)
randsample
Can anyone think of another way to do this?
Thanks

A couple of frame-challenging questions:
Is there any reason you must generate the entire population, then sample that population?
Why do you need to check if your numbers sum to 100?
You can generate a set of numbers that sum to a value. Check out the first answer here:
Random numbers that add to 100: Matlab
Then generate the number of such sets you desire (1,000,000 in this case).
import numpy as np
def set_sum(number=10, total=100):
initial = np.random.random(number-1) * total
sort_list = np.append(initial, [0, total]).astype(int)
sort_list.sort()
set_ = np.diff(sort_list)
return set_
if __name__ == '__main__':
import timeit
a = set_sum()
n = 1000000
sample = [set_sum() for i in range(n)]

Numpy to the rescue!
Specifically, you need a multinomial distribution:
import numpy as np
desired_sum = 100
n = 10
np.random.multinomial(desired_sum, np.ones(n)/n, size=1000000)
It outputs a matrix with a million rows of 10 random integers in a few seconds. Each row sums up to 100.
Here's a smaller example:
np.random.multinomial(desired_sum, np.ones(n)/n, size=10)
which outputs:
array([[ 8, 7, 12, 11, 11, 9, 9, 10, 11, 12],
[ 7, 11, 8, 9, 9, 10, 11, 14, 11, 10],
[ 6, 10, 11, 13, 8, 10, 14, 12, 9, 7],
[ 6, 11, 6, 7, 8, 10, 8, 18, 13, 13],
[ 7, 7, 13, 11, 9, 12, 13, 8, 8, 12],
[10, 11, 13, 9, 6, 11, 7, 5, 14, 14],
[12, 5, 9, 9, 10, 8, 8, 16, 9, 14],
[14, 8, 14, 9, 11, 6, 10, 9, 11, 8],
[12, 10, 12, 9, 12, 10, 7, 10, 8, 10],
[10, 7, 10, 19, 8, 5, 11, 8, 8, 14]])
The sums appear to be correct:
sum(np.random.multinomial(desired_sum, np.ones(n)/n, size=10).T)
# array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100])
Python only
You could also start with a list on 10 zeroes, iterate 100 times and increment a random cell each time :
import random
desired_sum = 100
n = 10
row = [0] * n
for _ in range(desired_sum):
row[random.randrange(n)] += 1
row
# [16, 7, 9, 7, 10, 11, 4, 19, 4, 13]
sum(row)
# 100

Related

Add new element in the next sublist depending in if it has been added or not (involves also a dictionary problem) python

Community of Stackoverflow:
I'm trying to create a list of sublists with a loop based on a random sampling of values of another list; and each sublist has the restriction of not having a duplicate or a value that has already been added to a prior sublist.
Let's say (example) I have a main list:
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
#I get:
[[1,13],[4,1],[8,13]]
#I WANT:
[[1,13],[4,9],[8,14]] #(no duplicates when checking previous sublists)
The real code that I thought it would work is the following (as a draft):
matrixvals=list(matrix.index.values) #list where values are obtained
lists=[[]for e in range(0,3)] #list of sublists that I want to feed
vls=[] #stores the values that have been added to prevent adding them again
for e in lists: #initiate main loop
for i in range(0,5): #each sublist will contain 5 different random samples
x=random.sample(matrixvals,1) #it doesn't matter if the samples are 1 or 2
if any(x) not in vls: #if the sample isn't in the evaluation list
vls.extend(x)
e.append(x)
else: #if it IS, then do a sample but without those already added values (line below)
x=random.sample([matrixvals[:].remove(x) for x in vls],1)
vls.extend(x)
e.append(x)
print(lists)
print(vls)
It didn't work as I get the following:
[[[25], [16], [15], [31], [17]], [[4], [2], [13], [42], [13]], [[11], [7], [13], [17], [25]]]
[25, 16, 15, 31, 17, 4, 2, 13, 42, 13, 11, 7, 13, 17, 25]
As you can see, number 13 is repeated 3 times, and I don't understand why
I would want:
[[[25], [16], [15], [31], [17]], [[4], [2], [13], [42], [70]], [[11], [7], [100], [18], [27]]]
[25, 16, 15, 31, 17, 4, 2, 13, 42, 70, 11, 7, 100, 18, 27] #no dups
In addition, is there a way to convert the sample.random results as values instead of lists? (to obtain):
[[25,16,15,31,17]], [4, 2, 13, 42,70], [11, 7, 100, 18, 27]]
Also, the final result in reality isn't a list of sublists, actually is a dictionary (the code above is a draft attempt to solve the dict problem), is there a way to obtain that previous method in a dict? With my present code I got the next results:
{'1stkey': {'1stsubkey': {'list1': [41,
40,
22,
28,
26,
14,
41,
15,
40,
33],
'list2': [41, 40, 22, 28, 26, 14, 41, 15, 40, 33],
'list3': [41, 40, 22, 28, 26, 14, 41, 15, 40, 33]},
'2ndsubkey': {'list1': [21,
7,
31,
12,
8,
22,
27,...}
Instead of that result, I would want the following:
{'1stkey': {'1stsubkey': {'list1': [41,40,22],
'list2': [28, 26, 14],
'list3': [41, 15, 40, 33]},
'2ndsubkey': {'list1': [21,7,31],
'list2':[12,8,22],
'list3':[27...,...}#and so on
Is there a way to solve both list and dict problem? Any help will be very appreciated; I can made some progress even only with the list problem
Thanks to all

I realize you may be more interested in finding out why your particular approach isn't working. However, if I've understood your desired behavior, I may be able to offer an alternative solution. After posting my answer, I will take a look at your attempt.
random.sample lets you sample k number of items from a population (collection, list, whatever.) If there are no repeated elements in the collection, then you're guaranteed to have no repeats in your random sample:
from random import sample
pool = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
num_samples = 4
print(sample(pool, k=num_samples))
Possible output:
[9, 11, 8, 7]
>>>
It doesn't matter how many times you run this snippet, you will never have repeated elements in your random sample. This is because random.sample doesn't generate random objects, it just randomly picks items which already exist in a collection. This is the same approach you would take when drawing random cards from a deck of cards, or drawing lottery numbers, for example.
In your case, pool is the pool of possible unique numbers to choose your sample from. Your desired output seems to be a list of three lists, where each sublist has two samples in it. Rather than calling random.sample three times, once for each sublist, we should call it once with k=num_sublists * num_samples_per_sublist:
from random import sample
pool = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
num_sublists = 3
samples_per_sublist = 2
num_samples = num_sublists * samples_per_sublist
assert num_samples <= len(pool)
print(sample(pool, k=num_samples))
Possible output:
[14, 10, 1, 8, 6, 3]
>>>
OK, so we have six samples rather than four. No sublists yet. Now you can simply chop this list of six samples up into three sublists of two samples each:
from random import sample
pool = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
num_sublists = 3
samples_per_sublist = 2
num_samples = num_sublists * samples_per_sublist
assert num_samples <= len(pool)
def pairwise(iterable):
yield from zip(*[iter(iterable)]*samples_per_sublist)
print(list(pairwise(sample(pool, num_samples))))
Possible output:
[(4, 11), (12, 13), (8, 15)]
>>>
Or if you really want sublists, rather than tuples:
def pairwise(iterable):
yield from map(list, zip(*[iter(iterable)]*samples_per_sublist))
EDIT - just realized that you don't actually want a list of lists, but a dictionary. Something more like this? Sorry I'm obsessed with generators, and this isn't really easy to read:
keys = ["1stkey"]
subkeys = ["1stsubkey", "2ndsubkey"]
num_lists_per_subkey = 3
num_samples_per_list = 5
num_samples = num_lists_per_subkey * num_samples_per_list
min_sample = 1
max_sample = 50
pool = list(range(min_sample, max_sample + 1))
def generate_items():
def generate_sub_items():
from random import sample
samples = sample(pool, k=num_samples)
def generate_sub_sub_items():
def chunkwise(iterable, n=num_samples_per_list):
yield from map(list, zip(*[iter(iterable)]*n))
for list_num, chunk in enumerate(chunkwise(samples), start=1):
key = f"list{list_num}"
yield key, chunk
for subkey in subkeys:
yield subkey, dict(generate_sub_sub_items())
for key in keys:
yield key, dict(generate_sub_items())
print(dict(generate_items()))
Possible output:
{'1stkey': {'1stsubkey': {'list1': [43, 20, 4, 27, 2], 'list2': [49, 44, 18, 8, 37], 'list3': [19, 40, 9, 17, 6]}, '2ndsubkey': {'list1': [43, 20, 4, 27, 2], 'list2': [49, 44, 18, 8, 37], 'list3': [19, 40, 9, 17, 6]}}}
>>>

Sample irregular list of numbers with a set delta

Is there a simpler way, using e.g. numpy, to get samples for a given X and delta than the below code?
>>> X = [1, 4, 5, 6, 11, 13, 15, 20, 21, 22, 25, 30]
>>> delta = 5
>>> samples = [X[0]]
>>> for x in X:
... if x - samples[-1] >= delta:
... samples.append(x)
>>> samples
[1, 6, 11, 20, 25, 30]

If you are aiming to "vectorize" the process for performance reasons (e.g. using numpy), you could compute the number of elements that are less than each element plus the delta. This will give you indices for the items to select with the items that need to be skipped getting the same index as the preceding ones to be kept.
import numpy as np
X = np.array([1, 4, 5, 6, 11, 13, 15, 20, 21, 22, 25, 30])
delta = 5
i = np.sum(X<X[:,None]+delta,axis=1) # index of first to keep
i = np.insert(i[:-1],0,0) # always want the first, never the last
Y = X[np.unique(i)] # extract values as unique indexes
print(Y)
[ 1 6 11 20 25 30]
This assumes that the numbers are in ascending order
[EDIT]
As indicated in my comment, the above solution is flawed and will only work some of the time. Although vectorizing a python function does not fully leverage the parallelism (and is slower than the python loop), it is possible to implement the filter like this
X = np.array([1, 4, 5, 6, 10,11,12, 13, 15, 20, 21, 22, 25, 30])
delta = 5
fdelta = np.frompyfunc(lambda a,b:a if a+delta>b else b,2,1)
Y = X[X==fdelta.accumulate(X,dtype=np.object)]
print(Y)
[ 1 6 11 20 25 30]

A variation of the knapsack problem where some items must be included but don't count towards the objective

I am trying to solve a variation of the 0/1 knapsack problem where n items must be picked, but only k < n items' values are counted towards the objective function.
My idea was to set up two vectors of binary variables, x and y - x denoting which n items are picked, and y denoting which k items are counted towards the objective - however my problem is ensuring that y is a subset of x.
I am using the python mip library, and here is the code I have so far (a slightly modified version of the knapsack example in the mip documentation):
from mip import Model, xsum, maximize, BINARY
values = [10, 13, 18, 31, 7, 15, 8, 11, 3, 9, 13, 12, 11, 6, 18, 11, 18, 13, 12, 11]
weights = [11, 15, 20, 24, 9, 16, 12, 3, 6, 9, 17, 13, 20, 9, 32, 14, 19, 20, 12, 13]
max_weight = 200
I = range(len(weights))
m = Model("knapsack")
x = [m.add_var(var_type=BINARY) for i in I]
y = [m.add_var(var_type=BINARY) for i in I]
m += xsum(x[i] for i in I) == 15 # n
m += xsum(y[i] for i in I) == 11 # k
m += xsum(weights[i] * x[i] for i in I) <= max_weight
m.objective = maximize(xsum(values[i] * y[i] for i in I))
# `m += xsum(x[i] * y[i] for i in I) == 11` doesn't work
m.optimize()
selected_x = [i for i in I if x[i].x >= 0.99]
selected_y = [i for i in I if y[i].x >= 0.99]
print("selected items: {}".format(selected_x))
print("selected items: {}".format(selected_y))
#Output:
# selected items: [0, 1, 2, 4, 6, 7, 8, 11, 12, 13, 15, 16, 17, 18, 19]
# selected items: [1, 2, 3, 5, 10, 11, 14, 16, 17, 18, 19]
Any help would be great, thank you.
edit: for anyone finding this in the future, simply adding
for i in I:
m += x[i] >= y[i]
works perfectly.

Find the longest arithmetic progression inside a sequence

Suppose I have a sequence of increasing numbers, and I want to find the length of longest arithmetic progression within the sequence. Longest arithmetic progression means an increasing sequence with common difference, such as [2, 4, 6, 8] or [3, 6, 9, 12].
For example,
for [5, 10, 14, 15, 17], [5, 10, 15] is the longest arithmetic progression, with length 3;
for [10, 12, 13, 20, 22, 23, 30], [10, 20, 30] is the longest arithmetic progression with length 3;
for [7, 10, 12, 13, 15, 20, 21], [10, 15, 20] or [7, 10, 13] are the longest arithmetic progressions with length 3.
This site
https://prismoskills.appspot.com/lessons/Dynamic_Programming/Chapter_22_-_Longest_arithmetic_progression.jsp
offers some insight into the problem, i.e. by looping around j and consider
every 3 elements. I intend to use this algorithm in Python, and my code is as follows:
def length_of_AP(L):
n = len(L)
Table = [[0 for _ in range(n)] for _ in range(n)]
length_of_AP = 2
# initialise the last column of the table as all i and (n-1) pairs have lenth 2
for i in range(n):
Table[i][n-1] =2
# loop around the list and i, k such that L[i] + L[k] = 2 * L[j]
for j in range(n - 2, 0, -1):
i = j - 1
k = j + 1
while i >= 0 and k < n:
difference = (L[i] + L[k]) - 2 * L[j]
if difference < 0:
k = k + 1
else:
if difference > 0:
i = i - 1
else:
Table[i][j] = Table[j][k] + 1
length_of_AP = max(length_of_AP, Table[i][j])
k = k + 1
i = i - 1
return length_of_AP
This function works fine with [1, 3, 4, 5, 7, 8, 9], but it doesn't work for [5, 10, 14, 15, 20, 25, 26, 27, 28, 30, 31], where I am supposed to get 6 but I got 4. I can see the reason being that 25, 26, 27, 28 inside the list may be a distracting factor for my function. How do I change my function so that it gives me the result desired.
Any help may be appreciated.

Following your link and running second sample, it looks like the code actually find proper LAP
5, 10, 15, 20, 25, 30,
but fails to find proper length. I didn't spend too much time analyzing the code but the piece
// Any 2-letter series is an AP
// Here we initialize only for the last column of lookup because
// all i and (n-1) pairs form an AP of size 2
for (int i=0; i<n; i++)
lookup[i][n-1] = 2;
looks suspicious to me. It seems that you need to initialize whole lookup table with 2 instead of just last column and if I do so, it starts to get correct length on your sample as well.
So get rid of the "initialise" loop and change your 3rd line to following code:
# initialise whole table with 2 as all (i, j) pairs have length 2
Table = [[2 for _ in range(n)] for _ in range(n)]
Moreover their
Sample Execution:
Max AP length = 6
3, 5, 7, 9, 11, 13, 15, 17,
Contains this bug as well and actually prints correct sequence only because of sheer luck. If I modify the sortedArr to
int sortedArr[] = new int[] {3, 4, 5, 7, 8, 9, 11, 13, 14, 15, 16, 17, 18, 112, 113, 114, 115, 116, 117, 118};
I get following output
Max AP length = 7
112, 113, 114, 115, 116, 117, 118,
which is obviously wrong as original 8-items long sequence 3, 5, 7, 9, 11, 13, 15, 17, is still there.

Did you try it?
Here's a quick brute force implementation, for small datasets it should run fast enough:
def gen(seq):
diff = ((b-a, a) for a, b in it.combinations(sorted(seq), 2))
for d, n in diff:
k = []
while n in seq:
k.append(n)
n += d
yield (d, k)
def arith(seq):
return max(gen(seq), key=lambda x: len(x[1]))
In [1]: arith([7, 10, 12, 13, 15, 20, 21])
Out[1]: (3, [7, 10, 13])
In [2]: %timeit arith([7, 10, 12, 13, 15, 20, 21])
10000 loops, best of 3: 23.6 µs per loop
In [3]: seq = {random.randrange(1000) for _ in range(100)}
In [4]: arith(seq)
Out[4]: (171, [229, 400, 571, 742, 913])
In [5]: %timeit arith(seq)
100 loops, best of 3: 3.79 ms per loop
In [6]: seq = {random.randrange(1000000) for _ in range(1000)}
In [7]: arith(seq)
Out[7]: (81261, [821349, 902610, 983871])
In [8]: %timeit arith(seq)
1 loop, best of 3: 434 ms per loop

python numpy polyfit function

I'm new to python and haven't found an answer on this site so far.
I'm using numpy.polyfit in a loop and getting an error as below and don't understand as when I run the code in debug everything works fine and the len of arrays going into the function are the same:
Error Runtime exception: TypeError: expected x and y to have same length
My code is below:
import numpy as np
from collections import defaultdict
bb = [ 10, 11, 12, 22, 10, 11, 12, 11, 10, 11, 12, 22, 10, 11, 12, 11, 10, 11, 12, 22, 10, 11, 12, 11, 10, 11, 12, 22, 10, 11, 12, 11, 10 ]
i = 0
b = -3
bb_gradient = defaultdict(dict)
while ( b <= 0 ):
print i
print len(range(3))
print len(bb[b-3:b])
bb_gradient[i][0], _ = np.polyfit( range(3), weekly_bb_lower[b-3:b], 1 )
i += 1
b += 1
What am I doing wrong?
Thanks in anticipation.

I am assuming bb is weekly_bb_lower. Change while ( b <= 0 ) to while ( b < 0 ). because when b becomes 0, weekly_bb_lower[-3:0] will return an empty list. a list[-n:0] is supposed to be empty.

You can avoid referencing an empty list by moving the last three elements to the start of your list:
import numpy as np
from collections import defaultdict
bb = [ 10, 11, 12, 22, 10, 11, 12, 11, 10, 11, 12, 22, 10, 11, 12, 11, 10, 11, 12, 22, 10, 11, 12, 11, 10, 11, 12, 22, 10, 11, 12, 11, 10 ]
bb = bb[-3:] + bb[:-3] # moves the last three elements of the list to the start prior to looping
bb_gradient = defaultdict(dict)
for i in range(3):
bb_gradient[i][0], _ = np.polyfit( range(3) , bb[i:i+3], 1 )
Prashanth's explanation is correct.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting a random sample from a very large generator - python

Related

Add new element in the next sublist depending in if it has been added or not (involves also a dictionary problem) python

Sample irregular list of numbers with a set delta

A variation of the knapsack problem where some items must be included but don't count towards the objective

Find the longest arithmetic progression inside a sequence

python numpy polyfit function

Categories

Resources