Random sample without repetition but probability - python

I am somehow missing a function in python which is a combination of two I know.
I have a list of numbers and probabilities for those and want to chose n of them, without repetition.
random.sample can chose from a list without repetition, but does not allow probabilities:
l = [5,124,6,2,7,1]
sample(l,k=5)
On the other hand, choices allows me to use weights, but uses repetition:
choices(l,k=2,weights=[0.5,0.25,0.25,0.125,0,0.125])
Is there any chance how do to that in combination?
Until now I run a while-loop doing choices so often until the number of uniquely chosen elements becomes k. But this is quite inefficient, in particular of one element has big probability.

numpy.random.choice works. Use:
import numpy as np
l = [5,124,6,2,7,1]
weights=[0.5,0.25,0.25,0.125,0,0.125]
weights = [w/sum(weights) for w in weights]
np.random.choice(l, size=5, replace=False, p=weights)
Edited to make probabilities sum to 1

Related

how to choose the independently weighted numbers in Python?

import random
number=list(range(1,10))
weighted=[1]*2+[2]*2+[3]*2+[4]*2+[5]*2
number_weighted=random.choice(number,weighted,k=1) **#if k=4 then the same number is chosen sometimes**
I want to use loop 3 times to choose the each one number.
I want to choose the number that independent(not same), weighted.
In python, if you know this problem, I would appreciate it if you taught me
For example,
number=[1,2,3,4,5]
weighted=[0.1,0.1,0.4,0.3,0.1]
then choose two number
i want 3, 4 Probability)
but random.choice function is sometimes 1,1 selected.
so, i think
i take one number (suppose number 3) then
number=[1,2,4,5]
weighted=[0.1,0.1,0.3,0.1]
and i take one number (suppose number 4). use loop function
Your question isn't quite clear, so comment if it doesn't solve your problem.
Define functionn which returns random from the list and weight. another function to make sure you have n randoms from different weights.
And your weight and array was of different length I hope that was an error.
import random
def get_rand(num_list,weight_list,weight):
selection_from= [i for i,v in enumerate(weight_list) if v==weight]
print(selection_from)
rand_index =random.choice(selection_from)
return num_list[rand_index]
def get_n_rand(num_list,weight_list,n):
weights= list(set(weight_list))
random.shuffle(weights)
final_list=[]
# if you don't want numbers from same weight
for weight in weights[:n]:
final_list.append(get_rand(num_list,weight_list,weight))
#if same weight is also fine use this:
#for i in range(n):
# weight = random.choice(weights)
# final_list.append(get_rand(num_list,weight_list,weight))
return final_list
number=list(range(1,10))
weighted=[1]*2+[2]*2+[3]*2+[4]*2+[5]*1
assert(len(number)==len(weighted))
rand=get_n_rand(number,weighted,3)
print("selected numbers:",rand)
print("their weights:",[weighted[number.index(i)] for i in rand])
Since you had hard time understanding,
selection_from= [i for i,v in enumerate(weight_list) if v==weight]
is equivalent to:
selection_from= []
for i in range(len(weight_list)):
v= weight_list[i]
if v==weight:
selection_from.append(i)

Mimicking random.sample() for non-uniform distributions

I want to emulate the functionality of random.sample() in python, but with a non-uniform (triangular, in this case) distribution of choices. Important for this is that a single item is not chosen twice (as described in the random.sample docs). Here's what I have:
...
def tri_sample(population, k, mode=0):
"""
Mimics the functionality of random.sample() but with a triangular
distribution over the length of the sequence.
Mode defaults to 0, which favors lower indices.
"""
psize = len(population)
if k > psize:
raise ValueError("k must be less than the number of items in population.")
if mode > psize:
raise ValueError("mode must be less than the number of items in population.")
indices_chosen = []
sample = []
for i in range(k):
# This ensures unique selections
while True:
choice = math.floor(random.triangular(0, psize, mode))
if choice not in indices_chosen:
break
indices_chosen.append(choice)
sample.append(population[choice])
return sample
...
My suspicion is that this is not an ideal way of preventing duplicate items being pulled. My first thought when designing this was to make a duplicate of population and .pop() the items as they're sampled to prevent choosing the same item twice, but I saw two problems with that:
If population is a list of objects, there could be some difficulty with duplicating the list while still ensuring that the items in sample point to the same objects in population.
Using .pop() on the population would change the size of the population, altering the distribution each time. Ideally, the distribution (not sure if I'm using the term correctly--the probability of each item being called) would be the same no matter what order the items are chosen in.
Is there a more efficient way of taking a non-uniform random sample from a population?
You can achieve what you want by using numpy.random.choice
The input to this function is as follows:
numpy.random.choice(a, size=None, replace=True, p=None)
so you could specify the weight vector p to be your desired probability distribution, and also choose replace=False, so that samples would not be repeated.
Alternatively, you could sample directly from the triangular distribution using numpy.random.triangular. You can do that in a loop, and add the new result to the list only if it did not appear there before.

Fast way to obtain a random index from an array of weights in python

I regularly find myself in the position of needing a random index to an array or a list, where the probabilities of indices are not uniformly distributed, but according to certain positive weights. What's a fast way to obtain them? I know I can pass weights to numpy.random.choice as optional argument p, but the function seems quite slow, and building an arange to pass it is not ideal either. The sum of weights can be an arbitrary positive number and is not guaranteed to be 1, which makes the approach to generate a random number in (0,1] and then substracting weight entries until the result is 0 or less impossible.
While there are answers on how to implement similar things (mostly not about obtaining the array index, but the corresponding element) in a simple manner, such as Weighted choice short and simple, I'm looking for a fast solution, because the appropriate function is executed very often. My weights change frequently, so the overhead of building something like an alias mask (a detailed introduction can be found on http://www.keithschwarz.com/darts-dice-coins/) should be considered part of the calculation time.
Cumulative summing and bisect
In any generic case, it seems advisable to calculate the cumulative sum of weights, and use bisect from the bisect module to find a random point in the resulting sorted array
def weighted_choice(weights):
cs = numpy.cumsum(weights)
return bisect.bisect(cs, numpy.random.random() * cs[-1])
if speed is a concern. A more detailed analysis is given below.
Note: If the array is not flat, numpy.unravel_index can be used to transform a flat index into a shaped index, as seen in https://stackoverflow.com/a/19760118/1274613
Experimental Analysis
There are four more or less obvious solutions using numpy builtin functions. Comparing all of them using timeit gives the following result:
import timeit
weighted_choice_functions = [
"""import numpy
wc = lambda weights: numpy.random.choice(
range(len(weights)),
p=weights/weights.sum())
""",
"""import numpy
# Adapted from https://stackoverflow.com/a/19760118/1274613
def wc(weights):
cs = numpy.cumsum(weights)
return cs.searchsorted(numpy.random.random() * cs[-1], 'right')
""",
"""import numpy, bisect
# Using bisect mentioned in https://stackoverflow.com/a/13052108/1274613
def wc(weights):
cs = numpy.cumsum(weights)
return bisect.bisect(cs, numpy.random.random() * cs[-1])
""",
"""import numpy
wc = lambda weights: numpy.random.multinomial(
1,
weights/weights.sum()).argmax()
"""]
for setup in weighted_choice_functions:
for ps in ["numpy.ones(40)",
"numpy.arange(10)",
"numpy.arange(200)",
"numpy.arange(199,-1,-1)",
"numpy.arange(4000)"]:
timeit.timeit("wc(%s)"%ps, setup=setup)
print()
The resulting output is
178.45797914802097
161.72161589498864
223.53492237901082
224.80936180002755
1901.6298267539823
15.197789980040397
19.985687876993325
20.795070077001583
20.919113760988694
41.6509403079981
14.240949985047337
17.335801470966544
19.433710905024782
19.52205040602712
35.60536142199999
26.6195822560112
20.501282756973524
31.271995796996634
27.20013752405066
243.09768892999273
This means that numpy.random.choice is surprisingly very slow, and even the dedicated numpy searchsorted method is slower than the type-naive bisect variant. (These results were obtained using Python 3.3.5 with numpy 1.8.1, so things may be different for other versions.) The function based on numpy.random.multinomial is less efficient for large weights than the methods based on cumulative summing. Presumably the fact that argmax has to iterate over the whole array and run comparisons each step plays a significant role, as can be seen as well from the four second difference between an increasing and a decreasing weight list.

Should this be solved using a subset sum algorithm

The problem is asking me to find all possible subsets of a list that added together (in pairs, alone, or multiple of them) will equal a given number. I have been reading a lot on subset sum problems and not sure if this applies to this problem.
To explain the problem more, I have a max weight of candy that I am allowed to purchase.
I know the weight of ten pieces of different candy that I have stored in a list
candy = [ [snickers, 150.5], [mars, 130.3], ......]
I can purchase at most max_weight = 740.5 grams EXACTLY.
Thus I have to find all possible combinations of candy that will equal exactly the max_weight. I will be programming in python. Don't need the exact code but just whether or not it is a subset sum problem and possible suggestions on how to proceed.
Ok here's a brute force approach exploiting numpy's index magic:
from itertools import combinations
import numpy as np
candy = [ ["snickers", 150.5], ["mars", 130.3], ["choc", 10.0]]
n = len(candy)
ww = np.array([c[1] for c in candy]) # extract weights of candys
idx = np.arange(n) # list of indexes
iidx,sums = [],[]
# generate all possible sums with index list
for k in range(n):
for ii in combinations(idx, k+1):
ii = list(ii) # convert tupel to list, so it can be used as a list of indeces
sums.append(np.sum(ww[ii]))
iidx.append(ii)
sums = np.asarray(sums)
ll = np.where(np.abs(sums-160.5)<1e-9) # filter out values which match 160.5
# print results
for le in ll:
print([candy[e] for e in iidx[le]])
This is exactly the subset sum problem. You could use a dynamic programming approach to solve it

Python list reordering, remember original order?

I'm working on a Bayesian probability project, in which I need to adjust probabilities based on new information. I have yet to find an efficient way to do this. What I'm trying to do is start with an equal probability list for distinct scenarios. Ex.
There are 6 people: E, T, M, Q, L, and Z, and their initial respective probabilities of being chosen are represented in
myList=[.1667, .1667, .1667, .1667, .1667, .1667]
New information surfaces that people in the first third alphabetically have a collective 70% chance of being chosen. A new list is made, sorted alphabetically by name (E, L, M, Q, T, Z), that just includes the new information. (.7/.333=2.33, .3/.667=.45)
newList=[2.33, 2.33, .45, .45, .45, .45)
I need a way to order the newList the same as myList so I can multiply the right values in list comprehension, and reach the adjust probabilities. Having a single consistent order is important because the process will be repeated several times, each with different criteria (vowels, closest to P, etc), and in a list with about 1000 items.
Each newList could instead be a newDictionary, and then once the adjustment criteria are created they could be ordered into a list, but transforming multiple dictionaries seems inefficient. Is it? Is there a simple way to do this I'm entirely missing?
Thanks!
For what it's worth, the best thing you can do for the speed of your methods in Python is to use numpy instead of the standard types (you'll thus be using pre-compiled C code to perform arithmetic operations). This will lead to a dramatic speed increase. Numpy arrays have fixed orderings anyway, and syntax is more directly applicable to mathematical operations. You just need to consider how to express the operations as matrix operations. E.g. your example:
myList = np.ones(6) / 6.
newInfo = np.array( [.7/2, .7/2, .3/4, .3/4, .3/4, .3/4] )
result = myList * newInfo
Since both vectors have unit sum there's no need to normalise (I'm not sure what you were doing in your example, I confess, so if there's a subtlety I've missed let me know), but if you do need to it's trivial:
result /= np.sum(result)
Try storing your info as a list of tuples:
bayesList = [('E', 0.1667), ('M', 0.1667), ...]
your list comprehension can be along the lines of
newBayes = [(person, prob * normalizeFactor) for person, prob in bayesList]
where you've normalizeFactor was calculated before setting up your list comprehension

Categories

Resources