Mimicking random.sample() for non-uniform distributions

Mimicking random.sample() for non-uniform distributions - python

I want to emulate the functionality of random.sample() in python, but with a non-uniform (triangular, in this case) distribution of choices. Important for this is that a single item is not chosen twice (as described in the random.sample docs). Here's what I have:
...
def tri_sample(population, k, mode=0):
"""
Mimics the functionality of random.sample() but with a triangular
distribution over the length of the sequence.
Mode defaults to 0, which favors lower indices.
"""
psize = len(population)
if k > psize:
raise ValueError("k must be less than the number of items in population.")
if mode > psize:
raise ValueError("mode must be less than the number of items in population.")
indices_chosen = []
sample = []
for i in range(k):
# This ensures unique selections
while True:
choice = math.floor(random.triangular(0, psize, mode))
if choice not in indices_chosen:
break
indices_chosen.append(choice)
sample.append(population[choice])
return sample
...
My suspicion is that this is not an ideal way of preventing duplicate items being pulled. My first thought when designing this was to make a duplicate of population and .pop() the items as they're sampled to prevent choosing the same item twice, but I saw two problems with that:
If population is a list of objects, there could be some difficulty with duplicating the list while still ensuring that the items in sample point to the same objects in population.
Using .pop() on the population would change the size of the population, altering the distribution each time. Ideally, the distribution (not sure if I'm using the term correctly--the probability of each item being called) would be the same no matter what order the items are chosen in.
Is there a more efficient way of taking a non-uniform random sample from a population?

You can achieve what you want by using numpy.random.choice
The input to this function is as follows:
numpy.random.choice(a, size=None, replace=True, p=None)
so you could specify the weight vector p to be your desired probability distribution, and also choose replace=False, so that samples would not be repeated.
Alternatively, you could sample directly from the triangular distribution using numpy.random.triangular. You can do that in a loop, and add the new result to the list only if it did not appear there before.

Related

How to generate unique(!) arrays/lists/sequences of uniformly distributed random

Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.

This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?

Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.

Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())

The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.

Selecting random elements in a list conditional on attribute

class Agent:
def __init__(self, state):
self.state = state
#initialize values
state_0_agents = 10
state_1_agents = 10
numberofselections = 2 #number of agents who can choose to transition to the higher plane
#list of agents
agents = [Agent(0) for i in range(state_0_agents)]
agents.extend(Agent(1) for i in range(state_1_agents))
random.choice(agents)
I want to randomly select a couple of agents from this Agents list whose state I will end up changing to 1. Unfortunately the random.choice function selects among all the elements. However I want to randomly select only among those whose state is 0.
I would prefer if this could occur without creating a new list.

I see 3 options here:
Create a list anyway, you can do so with a list comprehension:
random.choice([a for a in agents if a.state == 0])
Put the random.choice() call in a loop, keep trying until you get one that matches the criteria:
while True:
agent = random.choice(agents)
if agent.state == 0:
break
Index your agents list, then pick from that index; these are really just lists still:
agent_states_index = {}
for index, agent in enumerate(agents):
agent_states_index.setdefault(agent.state, []).append(index)
agent_index = random.choice(agent_states_index[0])
agent = agents[agent_index]

There are four algorithms I know of for this.
The first is detailed in this answer. Iterate through the array, then if you come across an element that satisfies a condition, check to see if a random integer is less than (1/(however many elements you've passed that satisfy the condition)).
The second is to iterate through your array, adding to a new array elements that fulfill the condition, then randomly pick one out of that list.
Both of these algorithms run in O(n) time, where n is the size of the array. They are guaranteed to find an element if it is there and satisfies the condition.
There are another two algorithms that are much faster. They both run in O(1) time but have some major weaknesses.
The first is to keep picking indexes randomly until you hit on one that satisfies the condition. This has a potentially infinite time complexity but is O(1) in practice. (If there are very few elements that satisfy the condition and the array is very large, something like 1 in 10000 elements, this becomes slower.) It is also not guaranteed to find an element if it is not there; if there is no element that satisfies the condition, you either have an infinite loop or have to write the algorithm to make a finite number of guesses and you might miss an element even if it is there.
The second is to pick a random index, then keep incrementing it until you find an index that satisfies the condition. It is guaranteed to either find an acceptable index or look through all of the indexes without entering into an infinite loop. It has the downside of not being completely random. Obviously, if you increment the index by 1 every time, it will be really, really nonrandom (if there are clumps of acceptable indexes in the array). However, if you choose the increment randomly from one of a handful of numbers that are coprime to the number of elements of the array, then it's still not fair and random, but is fairly fair and random, and guaranteed to succeed.
Again, these last 2 algorithms are very fast but are either not guaranteed to work or not guaranteed to be completely random. I don't know of an algorithm that is both fast, guaranteed to work, and completely fair and random.

Use numpy.where:
import numpy as np
class Agent:
def __init__(self, state):
self.state = state
#initialize values
state_0_agents = 10
state_1_agents = 10
#list of agents
agents = [0]*state_0_agents
agents += [1]*state_1_agents
selected_agent_idx = random.choice(np.where(np.array(agents) == 0))

You can also use the nonzero function in numpy as it returns a list of indices where an iterable is not zero. Then you can combine it with the choice function to change the value of a random in a element index of that list:
import numpy as np
index_agent0 = np.nonzero(agents==0)[0]
agents[np.random.choice(index_agent0)] = 1

Sorting points on multiple lines

Given that we have two lines on a graph (I just noticed that I inverted the numbers on the Y axis, this was a mistake, it should go from 11-1)
And we only care about whole number X axis intersections
We need to order these points from highest Y value to lowest Y value regardless of their position on the X axis (Note I did these pictures by hand so they may not line up perfectly).
I have a couple of questions:
1) I have to assume this is a known problem, but does it have a particular name?
2) Is there a known optimal solution when dealing with tens of billions (or hundreds of millions) of lines? Our current process of manually calculating each point and then comparing it to a giant list requires hours of processing. Even though we may have a hundred million lines we typically only want the top 100 or 50,000 results some of them are so far "below" other lines that calculating their points is unnecessary.

Your data structure is a set of tuples
lines = {(y0, Δy0), (y1, Δy1), ...}
You need only the ntop points, hence build a set containing only
the top ntop yi values, with a single pass over the data
top_points = choose(lines, ntop)
EDIT --- to choose the ntop we had to keep track of the smallest
one, and this is interesting info, so let's return also this value
from choose, also we need to initialize decremented
top_points, smallest = choose(lines, ntop)
decremented = top_points
and start a loop...
while True:
Generate a set of decremented values
decremented = {(y-Δy, Δy) for y, Δy in top_points}
decremented = {(y-Δy, Δy) for y, Δy in decremented if y>smallest}
if decremented == {}: break
Generate a set of candidates
candidates = top_lines.union(decremented)
generate a new set of top points
new_top_points, smallest = choose(candidates, ntop)
The following is no more necessary
check if new_top_points == top_points
if new_top_points == top_points: break
top_points = new_top_points</strike>
of course we are in a loop...
The difficult part is the choose function, but I think that this
answer to the question
How can I sort 1 million numbers, and only print the top 10 in Python?
could help you.

It's not a really complicated thing, just a "normal" sorting problem.
Usually sorting requires a large amount of computing time. But your case is one where you don't need to use complex sorting techniques.
You on both graphs are growing or falling constantly, there are no "jumps". You can use this to your advantage. The basic algorithm:
identify if a graph is growing or falling.
write a generator, that generates the values; from left to right if raising, form right to left if falling.
get the first value from both graphs
insert the lower on into the result list
get a new value from the graph that had the lower value
repeat the last two steps until one generator is "empty"
append the leftover items from the other generator.

random.sample and "selection order"

help(random.sample)
says "The resulting list is in selection order so that all sub-slices will also be valid random samples"
What does selection order mean? If there were no requirement for selection order, how would resulting list look like? How could sub-slice not be a valid random sample?
Upd As far as I understood, it means that results will not be sorted in any way probably.

random.sample(population, k)
Given a population sequence it returns a list of length k with elements chosen (or selected) from population. Selection Order refers to order in which each of the elements are selected (random). The list is thus not sorted by indexes in population but by how the selection was made. Thus any-subslice of returned list is also a random sample for the population.
Example -
>>> import random
>>> population=[1,2,3,4,5,6,7,8,9,10,11,12,]
>>> ls=random.sample(population,5)
>>> ls
[1, 11, 7, 12, 6]
The returned list has elements in the order they were selected. So you can use sub-slicing on ls and not lose randomness
>>> ls[:3]
[1, 11, 7]
If selection ordering was not enforced, you could have ls look like
[1,6,7,11,12]
The sub-slice would then not be completely random but constrained by the length of slice. E.g. The greatest value cannot occur in a sub-slice of length 3 (In this case that would be [1, 6, 7])

The full help string is:
sample(self, population, k) method of random.Random instance
Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
To choose a sample in a range of integers, use xrange as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(xrange(10000000), 60)
So taking the example of a raffle; all the tickets rolling around inside the drum are the population, and k is the number of tickets drawn. The set of all the tickets drawn is the result of the random sample.
The sample is not sorted, nor altered in any way, it is in the order it is drawn. If you imagine that you went to a raffle, and they drew 100 tickets first, and discarded them, and then started drawing the actual tickets, the set of winning tickets would still be a random sample of the population. This is equivalent to taking slices of the first larger sample.
What it's saying, is that any sub slice of any sample, is still a valid random sample.
To answer your questions;
selection order is just the order in which the values are drawn to make up the sample.
without ensuring selection order the sample may be sorted somehow.
The following code you can imagine creating a random sample ensuring selection order:
def sample(population, k):
sample = []
popsize = len(population)-1
while len(sample) <= k:
r = population[random.randint(0, popsize]
if r not in sample:
sample.append(r)
return sample

Retrieve List Position Of Highest Value?

There is a list with float values, which can differ or not. How can I find the randomly chosen list-index of one of the highest values in this list?
If the context is interesting to you:
I try to write a solver for the pen&paper game Battleship. I attempt to calculate probabilities for a hit on each of the fields and then want the the solver to shoot at one of the most likely spots, which means retrieving the index of the highest likelyhood in my likelyhood-list and then tell the game engine this index as my choice. Already the first move shows that it can happen, that there are a lot of fields with the same likelyhood. In this case it makes sense to choose one of them at random (and not just take always the first or anything like that).

Find the maximum using How to find all positions of the maximum value in a list?. Then pick a random from the list using random.choice.
>>> m = max(a)
>>> max_pos = [i for i, j in enumerate(a) if j == m]
>>> random.choice(max_pos)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.