Related
I have a list of tuples formed by 1000 object ids and their scores, i.e.:
scored_items = [('14',534.9),('4',86.0),('78',543.21),....].
Let T be the aggregated score of the top 20 highest scoring items.
That's easy. Using python:
top_20 = sorted(score_items, key=lambda k: k[1],reverse = True)[:20]
T = sum(n for _, n in top_20)
Next, let t equal a quarter of T. I.e. in python: t = math.ceil(T/4)
My question is: what's the most efficient way to randomly select 20 items (without replacement) from scored_items such that their aggregated score is equal to or greater than (but never lower than) t? They may or may not include items from top_20.
Would prefer an answer in Python, and would prefer to not rely on external libraries much
Background: This is an item-ranking algorithm that is strategy proof according to an esoteric - but useful - Game Theory theorem. Source: section 2.5 in this paper, or just read footnote 18 on page 11 of this same link. Btw strategy proof essentially means it's tough to game it.
I'm a neophyte python programmer and have been mulling how to solve this problem for a while now, but just can't seem to wrap my head around it. Would be great to know how the experts would approach and solve this.
I suppose the most simplistic (and least performant perhaps) way is to keep randomly generating sets of 20 items till their scores' sum exceeds or equals t.
But there has to be a better way to do this right?
Here is an implementation of what I mentioned in the comments.
Since we want items such that the sum of the scores is large, we can weight the choice so that we are more likely to pick samples with large scores.
import numpy as np
import math
def normalize(p):
return p/sum(p)
def get_sample(scored_items, N=20, max_iter = 1000):
topN = sorted(scored_items, key=lambda k: k[1],reverse = True)[:N]
T = sum(n for _, n in topN)
t = math.ceil(T/4)
i = 0
scores = np.array([x[1] for x in scored_items])
p=normalize(scores)
while i < max_iter:
sample_indexes = np.random.choice(a=range(len(ids)), size=N, replace=False, p=p)
sample = [scored_items[x] for x in sample_indexes]
if sum(n for _, n in sample) >= t:
print("Found a solution at iteration %d"%i)
return sample
i+=1
print("Could not find a solution after %d iterations"%max_iter)
return None
An example of how to use it:
np.random.seed(0)
ids = range(1000)
scores = 10000*np.random.random_sample(size=len(ids))
scored_items = list(zip(map(str, ids), scores))
sample = get_sample(scored_items, 20)
#Found a solution at iteration 0
print(sum(n for _, n in sample))
#139727.1229832652
Though this is not guaranteed to get a solution, I ran this in a loop 100 times and each time a distinct solution was found on the first iteration.
Though I do not know of a efficient way for huge lists something like this works even for 1000 or so items. You can do a bit better if you don't need True randomness
import random
testList = [x for x in range(1,1000)]
T = sum(range(975, 1000))/4
while True:
rs = random.sample(testList, 15)
if sum(rs) >= t: break
print rs
I 'm trying to solve Project Euler problem 240:
In how many ways can twenty 12-sided dice (sides numbered 1 to 12) be rolled so that the top ten sum to 70?
I've come up with code to solve this. But it really takes a lot of time to compute. I know this approach is pretty bad. Can someone suggest me how I can fix this code to perform better?
import itertools
def check(a,b): # check all the elements in a list a, are lesser than or equal to value b
chk=0
for x in a:
if x<=b:
chk=1
return chk
lst=[]
count=0
for x in itertools.product(range(1,13),repeat=20):
a=sorted([x[y] for y in range(20)])
if sum(a[-10:])==70 and check(a[:10],min(a[-10:])):
count+=1
Below code is for the problem defined in the description of the problem. It works perfectly and gives the exact solution....
import itertools
def check(a,b):
chk=1
for x in a:
if x>b:
chk=0
break
return chk
count=0
for x in itertools.product(range(1,7),repeat=5):
a=sorted([x[y] for y in range(5)])
if sum(a[-3:])==15 and check(a[:2],min(a[-3:])):
count+=1
It's no good iterating over all possibilities, because there are 1220 = 3833759992447475122176 ways to roll 20 twelve-sided dice, and at, say, a million rolls per second, that would take millions of years to complete.
The way to solve this kind of problem is to use dynamic programming. Find some way to split up your problem into the sum of several smaller problems, and build up a table of the answers to these sub-problems until you can compute the result you need.
For example, let T(n, d, k, t) be the number of ways to roll n d-sided dice so that the top k of them sum to t. How can we split this up into sub-problems? Well, we could consider the number of dice, i, that roll d exactly. There are nCi ways to choose these i dice, and T(n − i, d − 1, ...) ways to choose the n − i remaining dice which must roll at most d − 1. (For some suitable choice of parameters for k and t which I've elided.)
Take the product of these, and sum it up for all suitable values of i and you're done. (Well, not quite done: you have to specify the base cases, but that should be easy.)
The number of sub-problems that you need to compute will be at most (n + 1)(d + 1)(k + 1)(t + 1), which in the Project Euler case (n = 20, d = 12, k = 10, t = 70) is at most 213213. (In practice, it's much less than this, because many branches of the tree reach base cases quickly: in my implementation it turns out that the answers to just 791 sub-problems are sufficient to compute the answer.)
To write a dynamic program, it's usually easiest to express it recursively and use memoization to avoid re-computing the answer to sub-problems. In Python you could use the #functools.lru_cache decorator.
So the skeleton of your program could look like this. I've replaced the crucial details by ??? so as not to deprive you of the pleasure of working it out for yourself. Work with small examples (e.g. "two 6-sided dice, the top 1 of which sums to 6") to check that your logic is correct, before trying bigger cases.
def combinations(n, k):
"""Return C(n, k), the number of combinations of k out of n."""
c = 1
k = min(k, n - k)
for i in range(1, k + 1):
c *= (n - k + i)
c //= i
return c
#lru_cache(maxsize=None)
def T(n, d, k, t):
"""Return the number of ways n distinguishable d-sided dice can be
rolled so that the top k dice sum to t.
"""
# Base cases
if ???: return 1
if ???: return 0
# Divide and conquer. Let N be the maximum number of dice that
# can roll exactly d.
N = ???
return sum(combinations(n, i)
* T(n - i, d - 1, ???)
for i in range(N + 1))
With appropriate choices for all the ???, this answers the Project Euler problem in a few milliseconds:
>>> from timeit import timeit
>>> timeit(lambda:T(20, 12, 10, 70), number=1)
0.008017531014047563
>>> T.cache_info()
CacheInfo(hits=1844, misses=791, maxsize=None, currsize=791)
this solution should work - not sure how long it will take on your system.
from itertools import product
lg = (p for p in product(xrange(1,13,1),repeat=10) if sum(p) == 70)
results = {}
for l in lg:
results[l] = [p for p in product(xrange(1,min(l),1),repeat=10)]
what it does is create the "top ten" first. then adds to each "top ten" a list of the possible "next ten" items where the max value is capped at the minimum item in the "top ten"
results is a dict where the key is the "top ten" and the value is a list of the possible "next ten"
the solution (amount of combinations that fit the requirements) would be to count the number of lists in all the result dict like this:
count = 0
for k, v in results.items():
count += len(v)
and then count will be the result.
update
okay, i have thought of a slightly better way of doing this.
from itertools import product
import math
def calc_ways(dice, sides, top, total):
top_dice = (p for p in product(xrange(1,sides+1,1),repeat=top) if sum(p) == total)
n_count = dict((n, math.pow(n, dice-top)) for n in xrange(1,sides+1,1))
count = 0
for l in top_dice:
count += n_count[min(l)]
return count
since im only counting the length of the "next ten" i figured i would just pre-calculate the amount of options for each 'lowest' number in "top ten" so i created a dictionary which does that. the above code will run much smoother, as it is comprised only of a small dictionary, a counter, and a generator. as you can imagine, it will probably still take much time.... but i ran it for the first 1 million results in under 1 minute. so i'm sure its within the feasible range.
good luck :)
update 2
after another comment by you, i understood what i was doing wrong and tried to correct it.
from itertools import product, combinations_with_replacement, permutations
import math
def calc_ways(dice, sides, top, total):
top_dice = (p for p in product(xrange(1,sides+1,1),repeat=top) if sum(p) == total)
n_dice = dice-top
n_sets = len(set([p for p in permutations(range(n_dice)+['x']*top)]))
n_count = dict((n, n_sets*len([p for p in combinations_with_replacement(range(1,n+1,1),n_dice)])) for n in xrange(1,sides+1,1))
count = 0
for l in top_dice:
count += n_count[min(l)]
return count
as you can imagine it is quite a disaster, and does not even give the right answer. i think i am going to leave this one for the mathematicians. since my way of solving this would simply be:
def calc_ways1(dice, sides, top, total):
return len([p for p in product(xrange(1,sides+1,1),repeat=dice) if sum(sorted(p)[-top:]) == total])
which is an elegant 1 line solution, and provides the right answer for calc_ways1(5,6,3,15) but takes forever for the calc_ways1(20,12,10,70) problem.
anyway, math sure seems like the way to go on this, not my silly ideas.
I've been using the random_element() function provided by SAGE to generate random integer partitions for a given integer (N) that are a particular length (S). I'm trying to generate unbiased random samples from the set of all partitions for given values of N and S. SAGE's function quickly returns random partitions for N (i.e. Partitions(N).random_element()).
However, it slows immensely when adding S (i.e. Partitions(N,length=S).random_element()). Likewise, filtering out random partitions of N that are of length S is incredibly slow.
However, and I hope this helps someone, I've found that in the case when the function returns a partition of N not matching the length S, that the conjugate partition is often of length S. That is:
S = 10
N = 100
part = list(Partitions(N).random_element())
if len(part) != S:
SAD = list(Partition(part).conjugate())
if len(SAD) != S:
continue
This increases the rate at which partitions of length S are found and appears to produce unbiased samples (I've examined the results against entire sets of partitions for various values of N and S).
However, I'm using values of N (e.g. 10,000) and S (e.g. 300) that make even this approach impractically slow. The comment associated with SAGE's random_element() function admits there is plenty of room for optimization. So, is there a way to more quickly generate unbiased (i.e. random uniform) samples of integer partitions matching given values of N and S, perhaps, by not generating partitions that do not match S? Additionally, using conjugate partitions works well in many cases to produce unbiased samples, but I can't say that I precisely understand why.
Finally, I have a definitively unbiased method that has a zero rejection rate. Of course, I've tested it to make sure the results are representative samples of entire feasible sets. It's very fast and totally unbiased. Enjoy.
from sage.all import *
import random
First, a function to find the smallest maximum addend for a partition of n with s parts
def min_max(n,s):
_min = int(floor(float(n)/float(s)))
if int(n%s) > 0:
_min +=1
return _min
Next, A function that uses a cache and memoiziation to find the number of partitions
of n with s parts having x as the largest part. This is fast, but I think there's
a more elegant solution to be had. e.g., Often: P(N,S,max=K) = P(N-K,S-1)
Thanks to ante (https://stackoverflow.com/users/494076/ante) for helping me with this:
Finding the number of integer partitions given a total, a number of parts, and a maximum summand
D = {}
def P(n,s,x):
if n > s*x or x <= 0: return 0
if n == s*x: return 1
if (n,s,x) not in D:
D[(n,s,x)] = sum(P(n-i*x, s-i, x-1) for i in xrange(s))
return D[(n,s,x)]
Finally, a function to find uniform random partitions of n with s parts, with no rejection rate! Each randomly chosen number codes for a specific partition of n having s parts.
def random_partition(n,s):
S = s
partition = []
_min = min_max(n,S)
_max = n-S+1
total = number_of_partitions(n,S)
which = random.randrange(1,total+1) # random number
while n:
for k in range(_min,_max+1):
count = P(n,S,k)
if count >= which:
count = P(n,S,k-1)
break
partition.append(k)
n -= k
if n == 0: break
S -= 1
which -= count
_min = min_max(n,S)
_max = k
return partition
I ran into a similar problem when I was trying to calculate the probability of the strong birthday problem.
First off, the partition function explodes when given only modest amount of numbers. You'll be returning a LOT of information. No matter which method you're using N = 10000 and S = 300 will generate ridiculous amounts of data. It will be slow. Chances are any pure python implementation you use will be equally slow or slower. Look to making a CModule.
If you want to try python the approach I took as a combination of itertools and generators to keep memory usage down. I don't seem to have my code handy anymore, but here's a good impementation:
http://wordaligned.org/articles/partitioning-with-python
EDIT:
Found my code:
def partition(a, b=-1, limit=365):
if (b == -1):
b = a
if (a == 2 or a == 3):
if (b >= a and limit):
yield [a]
else:
return
elif (a > 3):
if (a <= b):
yield [a]
c = 0
if b > a-2:
c = a-2
else:
c = b
for i in xrange(c, 1, -1):
if (limit):
for j in partition(a-i, i, limit-1):
yield [i] + j
Simple approach: randomly assign the integers:
def random_partition(n, s):
partition = [0] * s
for x in range(n):
partition[random.randrange(s)] += 1
return partition
I am trying to write an algorithm that would pick N distinct items from an sequence at random, without knowing the size of the sequence in advance, and where it is expensive to iterate over the sequence more than once. For example, the elements of the sequence might be the lines of a huge file.
I have found a solution when N=1 (that is, "pick exactly one element at random from a huge sequence"):
import random
items = range(1, 10) # Imagine this is a huge sequence of unknown length
count = 1
selected = None
for item in items:
if random.random() * count < 1:
selected = item
count += 1
But how can I achieve the same thing for other values of N (say, N=3)?
If your sequence is short enough that reading it into memory and randomly sorting it is acceptable, then a straightforward approach would be to just use random.shuffle:
import random
arr=[1,2,3,4]
# In-place shuffle
random.shuffle(arr)
# Take the first 2 elements of the now randomized array
print arr[0:2]
[1, 3]
Depending upon the type of your sequence, you may need to convert it to a list by calling list(your_sequence) on it, but this will work regardless of the types of the objects in your sequence.
Naturally, if you can't fit your sequence into memory or the memory or CPU requirements of this approach are too high for you, you will need to use a different solution.
Use reservoir sampling. It's a very simple algorithm that works for any N.
Here is one Python implementation, and here is another.
Simplest I've found is this answer in SO, improved a bit below:
import random
my_list = [1, 2, 3, 4, 5]
how_big = 2
new_list = random.sample(my_list, how_big)
# To preserve the order of the list, you could do:
randIndex = random.sample(range(len(my_list)), how_big)
randIndex.sort()
new_list = [my_list[i] for i in randIndex]
If you have python version of 3.6+ you can use choices
from random import choices
items = range(1, 10)
new_items = choices(items, k = 3)
print(new_items)
[6, 3, 1]
#NPE is correct, but the implementations that are being linked to are sub-optimal and not very "pythonic". Here's a better implementation:
def sample(iterator, k):
"""
Samples k elements from an iterable object.
:param iterator: an object that is iterable
:param k: the number of items to sample
"""
# fill the reservoir to start
result = [next(iterator) for _ in range(k)]
n = k - 1
for item in iterator:
n += 1
s = random.randint(0, n)
if s < k:
result[s] = item
return result
Edit As #panda-34 pointed out the original version was flawed, but not because I was using randint vs randrange. The issue is that my initial value for n didn't account for the fact that randint is inclusive on both ends of the range. Taking this into account fixes the issue. (Note: you could also use randrange since it's inclusive on the minimum value and exclusive on the maximum value.)
Following will give you N random items from an array X
import random
list(map(lambda _: random.choice(X), range(N)))
It should be enough to accept or reject each new item just once, and, if you accept it, throw out a randomly chosen old item.
Suppose you have selected N items of K at random and you see a (K+1)th item. Accept it with probability N/(K+1) and its probabilities are OK. The current items got in with probability N/K, and get thrown out with probability (N/(K+1))(1/N) = 1/(K+1) so survive through with probability (N/K)(K/(K+1)) = N/(K+1) so their probabilities are OK too.
And yes I see somebody has pointed you to reservoir sampling - this is one explanation of how that works.
As aix mentioned reservoir sampling works. Another option is generate a random number for every number you see and select the top k numbers.
To do it iteratively, maintain a heap of k (random number, number) pairs and whenever you see a new number insert to the heap if it is greater than smallest value in the heap.
This was my answer to a duplicate question (closed before I could post) that was somewhat related ("generating random numbers without any duplicates"). Since, it is a different approach than the other answers, I'll leave it here in case it provides additional insight.
from random import randint
random_nums = []
N = # whatever number of random numbers you want
r = # lower bound of number range
R = # upper bound of number range
x = 0
while x < N:
random_num = randint(r, R) # inclusive range
if random_num in random_nums:
continue
else:
random_nums.append(random_num)
x += 1
The reason for the while loop over the for loop is that it allows for easier implementation of non-skipping in random generation (i.e. if you get 3 duplicates, you won't get N-3 numbers).
There's one implementation from the numpy library.
Assuming that N is smaller than the length of the array, you'd have to do the following:
# my_array is the array to be sampled from
assert N <= len(my_array)
indices = np.random.permutation(N) # Generates shuffled indices from 0 to N-1
sampled_array = my_array[indices]
If you need to sample the whole array and not just the first N positions, then you can use:
import random
sampled_array = my_array[random.sample(len(my_array), N)]
The range for x and y is from 0 to 99.
I am currently doing it like this:
excludeFromTrainingSet = []
while len(excludeFromTrainingSet) < 4000:
tempX = random.randint(0, 99)
tempY = random.randint(0, 99)
if [tempX, tempY] not in excludeFromTrainingSet:
excludeFromTrainingSet.append([tempX, tempY])
But it takes ages and I really need to speed this up.
Any ideas?
Vincent Savard has an answer that's almost twice as fast as the first solution offered here.
Here's my take on it. It requires tuples instead of lists for hashability:
def method2(size):
ret = set()
while len(ret) < size:
ret.add((random.randint(0, 99), random.randint(0, 99)))
return ret
Just make sure that the limit is sane as other answerers have pointed out. For sane input, this is better algorithmically O(n) as opposed to O(n^2) because of the set instead of list. Also, python is much more efficient about loading locals than globals so always put this stuff in a function.
EDIT: Actually, I'm not sure that they're O(n) and O(n^2) respectively because of the probabilistic component but the estimations are correct if n is taken as the number of unique elements that they see. They'll both be slower as they approach the total number of available spaces. If you want an amount of points which approaches the total number available, then you might be better off using:
import random
import itertools
def method2(size, min_, max_):
range_ = range(min_, max_)
points = itertools.product(range_, range_)
return random.sample(list(points), size)
This will be a memory hog but is sure to be faster as the density of points increases because it avoids looking at the same point more than once. Another option worth profiling (probably better than last one) would be
def method3(size, min_, max_):
range_ = range(min_, max_)
points = list(itertools.product(range_, range_))
N = (max_ - min_)**2
L = N - size
i = 1
while i <= L:
del points[random.randint(0, N - i)]
i += 1
return points
My suggestion :
def method2(size):
randints = range(0, 100)
excludeFromTrainingSet = set()
while len(excludeFromTrainingSet) < size:
excludeFromTrainingSet.add((random.choice(randints), random.choice(randints)))
return excludeFromTrainingSet
Instead of generation 2 random numbers every time, you first generate the list of numbers from 0 to 99, then you choose 2 and appends to the list. As others pointed out, there are only 10 000 possibilities so you can't loop until you get 40 000, but you get the point.
I'm sure someone is going to come in here with a usage of numpy, but how about using a set and tuple?
E.g.:
excludeFromTrainingSet = set()
while len(excludeFromTrainingSet) < 40000:
temp = (random.randint(0, 99), random.randint(0, 99))
if temp not in excludeFromTrainingSet:
excludeFromTrainingSet.add(temp)
EDIT: Isn't this an infinite loop since there are only 100^2 = 10000 POSSIBLE results, and you're waiting until you get 40000?
Make a list of all possible (x,y) values:
allpairs = list((x,y) for x in xrange(99) for y in xrange(99))
# or with Py2.6 or later:
from itertools import product
allpairs = list(product(xrange(99),xrange(99)))
# or even taking DRY to the extreme
allpairs = list(product(*[xrange(99)]*2))
Shuffle the list:
from random import shuffle
shuffle(allpairs)
Read off the first 'n' values:
n = 4000
trainingset = allpairs[:n]
This runs pretty snappily on my laptop.
You could make a lookup table of random values... make a random index into that lookup table, and then step through it with a static increment counter...
Generating 40 thousand numbers inevitably will take a while. But you are performing an O(n) linear search on the excludeFromTrainingSet, which takes quite a while especially later in the process. Use a set instead. You could also consider generating a number of coordinate sets e.g. over night and pickle them, so you don't have to generate new data for each test run (dunno what you're doing, so this might or might not help). Using tuples, as someone noted, is not only the semantically correct choice, it might also help with performance (tuple creation is faster than list creation). Edit: Silly me, using tuples is required when using sets, since set members must be hashable and lists are unhashable.
But in your case, your loop isn't terminating because 0..99 is 100 numbers and two-tuples of them have only 100^2 = 10000 unique combinations. Fix that, then apply the above.
Taking Vince Savard's code:
>>> from random import choice
>>> def method2(size):
... randints = range(0, 100)
... excludeFromTrainingSet = set()
... while True:
... x = size - len(excludeFromTrainingSet)
... if not x:
... break
... else:
... excludeFromTrainingSet.add((choice(randints), choice(randints)) for _ in range(x))
... return excludeFromTrainingSet
...
>>> s = method2(4000)
>>> len(s)
4000
This is not a great algorithm because it has to deal with collisions, but the tuple-generation makes it tolerable. This runs in about a second on my laptop.
## for py 3.0+
## generate 4000 points in 2D
##
import random
maxn = 10000
goodguys = 0
excluded = [0 for excl in range(0, maxn)]
for ntimes in range(0, maxn):
alea = random.randint(0, maxn - 1)
excluded[alea] += 1
if(excluded[alea] > 1): continue
goodguys += 1
if goodguys > 4000: break
two_num = divmod(alea, 100) ## Unfold the 2 numbers
print(two_num)