Random.choices cum_weights - python

please I need more clarity on this, I really do not understand it well. Using this as an example
import random
my_list = [9999, 45, 63, 19, 89, 5, 72]
cum_w = [1, 9, 10, 9, 2, 12, 7]
d_rand = random.choices(my_list, cum_weights=cum_w, k=7)
sum = 0
for idx, i in enumerate(cum_w):
if idx == 0:
for i in cum_w: sum += i
print(f"cum_weight for {my_list[idx]}\t= {i/sum}\tRandom={random.choices(my_list, cum_weights=cum_w, k=7)}")
Below is the output
cum_weight for 9999 = 0.14 Random=[45, 45, 9999, 45, 45, 9999, 45]
cum_weight for 45 = 0.18 Random=[45, 45, 45, 45, 9999, 45, 45]
cum_weight for 63 = 0.2 Random=[45, 45, 45, 9999, 9999, 9999, 45]
cum_weight for 19 = 0.18 Random=[45, 45, 45, 45, 45, 45, 9999]
cum_weight for 89 = 0.04 Random=[9999, 45, 45, 45, 45, 9999, 45]
cum_weight for 5 = 0.24 Random=[45, 45, 45, 45, 45, 45, 45]
cum_weight for 72 = 0.14 Random=[45, 45, 9999, 45, 45, 45, 45]
The probability of 9(cum_w[1] and cum_w[3]) are 0.18.
Why does 45(9) occur so often?
I've read random.choices documentation and does not really get to me.
How does the cum_weights works?
Please, I kindly need depth knowledge on this.

You asked "Why does 45(9) occur so often?" and "How do the cum_weights work?" Addressing the second question will explain the first. Note that what follows is an implementation of one approach used for this kind of problem. I'm not claiming that this is python's implementation, it is intended to illustrate the concepts involved.
Let's start by looking at how values can be generated if you use cumulative weights, i.e., a list where at each index the entry is the sum of all weights up to and including the current index.
import random
# Given cumulative weights, convert them to proportions, then generate U ~ Uniform(0,1)
# random values to use in a linear search to generate values in the correct proportions.
# This is based on the well-known probability result that P{a<=U<=b} = (b - a) for
# 0 <= a < b <= 1.
def gen_cumulative_weighted(values, c_weights): # values and c_weights must be lists of the same length
# Convert cumulative weights to probabilities/proportions by dividing by the last value.
# This yields a list of non-decreasing values between 0 and 1. Note that the last entry
# is always 1, so a Uniform(0, 1) random number will *always* be less than or equal to
# some entry in the list.
p = [c_weights[i] / c_weights[-1] for i in range(len(c_weights))]
while True:
index = 0 # starting from the beginning of the list
# The following three lines find the first index having the property u <= p[index].
u = random.random()
while u > p[index]:
index += 1
yield(values[index]) # yield the corresponding value.
As the comments point out, the weights are scaled by the last (and largest) value to scale them to a set of values in the range (0,1). These can be thought of as the right-most endpoints of non-overlapping subranges, each of which has a length equal to the corresponding scaled weight. (Sketch it out on paper if this is unclear, you should see it pretty quickly.) A generated Uniform(0,1) value will fall in one of those subranges, and the probability it does so is equal to the length of the subrange according to a well-known result from probability.
If we have the raw weights rather than the cumulative weights, all we have to do is convert them to cumulative and then pass the work off to the cumulative weighted version of the generator:
def gen_weighted(values, weights): # values and weights must be lists of the same length
cumulative_w = [sum(weights[:i+1]) for i in range(len(weights))]
return gen_cumulative_weighted(values, cumulative_w)
Now we're ready to use the generators:
my_values = [9999, 45, 63, 19, 89, 5, 72]
my_weights = [1, 9, 10, 9, 2, 12, 7]
good_gen = gen_weighted(my_values, my_weights)
print('Passing raw weights to the weighted implementation:')
print([next(good_gen) for _ in range(20)])
which will produce results such as:
Passing raw weights to the weighted implementation:
[63, 5, 63, 63, 72, 19, 63, 5, 45, 63, 72, 19, 5, 89, 72, 63, 63, 19, 89, 45]
Okay, so what happens if we pass raw weights to the cumulative weighted version of the algorithm? Your raw weights of [1, 9, 10, 9, 2, 12, 7] get scaled by dividing by the last value, and become [1/7, 9/7, 10/7, 9/7, 2/7, 12/7, 1]. When we generate u ~ Uniform(0, 1) and use it to search linearly through the scaled weights, it will yield index zero => 9999 with probability 1/7, and index one => 45 with probability 6/7! This happens because u is always ≤ 1, and therefore always less than 9/7. As a result, the linear search will never get past any scaled weight ≥ 1, which for your inputs means it can only generate the first two values and does so with the wrong weighting.
print('Passing raw weights to the cumulative weighted implementation:')
bad_gen = gen_cumulative_weighted(my_values, my_weights)
print([next(bad_gen) for _ in range(20)])
produces results such as:
Passing raw weights to the cumulative weighted implementation:
[45, 45, 45, 45, 45, 45, 45, 9999, 45, 9999, 45, 45, 45, 45, 45, 9999, 45, 9999, 45, 45]

Related

Generating discrete random numbers under constraints

I have a problem that I'm not sure how to solve properly.
Suppose we have to generate 1 <= n <= 40 numbers: X[1], X[2], ..., X[n].
For each number, we have some discrete space we can draw a number from. This space is not always a range and can be quite large (thousands/millions of numbers).
Another constraint is that the resulting array of numbers should be sorted in ascending order: X[1] <= X[2] <= ... <= X[n].
As an example for three numbers:
X[1] in {8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
X[2] in {10, 20, 30, 50}
X[3] in {1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003}
Examples of valid outputs for this test: [9, 20, 2001], [18, 30, 1995]
Example of invalid outputs for this test: [25, 10, 1998] (not increasing order)
I already tried different methods but what I'm not satisfied with is that they all yield not uniformly distributed results, i.e. there is a strong bias in all my solutions and some samples are underrepresented.
One of the methods is to try to randomly generate numbers one by one and at each iteration reduce the space for the upcoming numbers to satisfy the increasing order condition. This is bad because this solution always biases the last numbers towards the higher end of their possible range.
I already gave up on looking for an exact solution that could yield samples uniformly. I would really appreciate any reasonable solution (preferably, on Python, but anything will do, really).
I won't code it for you but here's the logic to do the non brute force approach:
Let's define N(i,x) the number of possible samples of X[1],...,X[i] where X[i]=x. And S(i) the possible values for X[i]. You have the recursion formula N(i,x) = Sum over y in S(i-1) with y<=x of N(i-1,y). This allows you to very quickly compute all N(i,x). It is then easy to build up your sample from the end:
Knowing all N(n,x), you can draw X[n] from S(n) with probability N(n,X[n]) / (Sum over x in S(N) of N(n,x))
And then you keep building down: given you have already drawn X[n],X[n-1],...,X[i+1] you draw X[i] from S(i) with X[i]<=X[i+1] with probability N(i,X[i]) / (Sum over x in S(i) with x<=X[i+1] of N(i,x))
Here is an implementation of the hueristic I suggested in the comments:
import random
def rand_increasing(sets):
#assume: sets is list of sets
sets = [s.copy() for s in sets]
n = len(sets)
indices = list(range(n))
random.shuffle(indices)
chosen = [0]*n
for i,k in enumerate(indices):
chosen[k] = random.choice(list(sets[k]))
for j in indices[(i+1):]:
if j > k:
sets[j] = {x for x in sets[j] if x > chosen[k]}
else:
sets[j] = {x for x in sets[j] if x < chosen[k]}
return chosen
#test:
sets = [{8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31},
{10, 20, 30, 50},
{1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003}]
for _ in range(10):
print(rand_increasing(sets))
Typical output:
[24, 50, 1996]
[26, 30, 2001]
[17, 30, 1995]
[11, 20, 2000]
[12, 20, 1996]
[11, 50, 2003]
[14, 20, 2002]
[9, 10, 2001]
[8, 30, 1999]
[8, 10, 1998]
Of course, if you can get uniform sampling with Julien's approach, that is preferable. (This heuristic might give uniform -- but that would require proof). Also note that poor choices in the earlier stages might drive some of the later sets in the permutation to being empty, raising an error. The function could be called in a loop with proper error trapping, yielding a hit-or-miss approach.

Finding max and min indices in lists in Python

I have a list that looks like:
trial_lst = [0.5, 3, 6, 40, 90, 130.8, 129, 111, 8, 9, 0.01, 9, 40, 90, 130.1, 112, 108, 90, 77, 68, 0.9, 8, 40, 90, 92, 130.4]
The list represents a series of experiments, each with a minimum and a maximum index. For example, in the list above, the minimum and maximum would be as follows:
Experiment 1:
Min: 0.5
Max: 130.8
Experiment 2:
Min: 0.01
Max: 130.1
Experiment 3:
Min: 0.9
Max: 103.4
I obtained the values for each experiment above because I know that each
experiment starts at around zero (such as 0.4, 0.001, 0.009, etc.) and ends at around 130 (130, 131.2, 130.009, etc.). You can imagine a nozzle turning on and off. When it turns on, the pressure rises and as it's turned off, the pressure dips. I am trying to calculate the minimum and maximum values for each experiment.
What I've tried so far is iterating through the list to first mark each index as max, but I can't seem to get that right.
Here is my code. Any suggestions on how I can change it?
for idx, item in enumerate(trial_lst):
if idx > 0:
prev = trial_lst[idx-1]
curr = item
if prev > curr:
result.append((curr, "max"))
else:
result.append((curr, ""))
I am looking for a manual way to do this, no libraries.
Use the easiest way ( sort your list or array first ):
trial_lst = [0.5, 3, 6, 40, 90, 130.8, 129, 111, 8, 9, 0.01, 9, 40, 90, 130.1, 112, 108, 90, 77, 68, 0.9, 8, 40, 90, 92, 130.4]
trial_lst.sort(key=float)
for count, items in enumerate(trial_lst):
counter = count + 1
last_object = (counter, trial_lst[count], trial_lst[(len(trial_lst)-1) - count])
print( last_object )
You can easily get the index of the minimum value using the following:
my_list.index(min(my_list))
Here is an interactive demonstration which may help:
>>> trial_lst = [0.5, 3, 6, 40, 90, 130.8, 129, 111, 8, 9, 0.01, 9, 40, 90, 130.1, 112, 108, 90, 77, 68, 0.9, 8, 40, 90, 92, 130.4]
Use values below 1 to identify where one experiment ends and another begins
>>> indices = [x[0] for x in enumerate(map(lambda x:x<1, trial_lst)) if x[1]]
Break list into sublists at those values
>>> sublists = [trial_lst[i:j] for i,j in zip([0]+indices, indices+[None])[1:]]
Compute max/min for each sublist
>>> for i,l in enumerate(sublists):
... print "Experiment", i+1
... print "Min", min(l)
... print "Max", max(l)
... print
...
Experiment 1
Min 0.5
Max 130.8
Experiment 2
Min 0.01
Max 130.1
Experiment 3
Min 0.9
Max 130.4

EDIT: Python how to create bins with equal amount of data and plot them?

I got 6056 volume & price data, and I want to create 20 bins(volume) with equal amount of data in each bin, find the average volume and price within each bin and plot a graph of volume(x-axis) against price(y-axis)
I am trying to modify my code to change the interval so that it can contain the same number of data points
the intervals don't need to be equally spaced. I want to have the same number of data in each interval and determine the range of each interval, then find the average value of the data within each interval and plot it
my current code is:
dat = df['Volume_norm']
def discretize(data, bins):
split = np.array_split(np.sort(data), bins)
cutoffs = [x[-1] for x in split]
cutoffs = cutoffs[:-1]
discrete = np.digitize(data, cutoffs, right=True)
return discrete, cutoffs
discrete_dat, cutoff = discretize(dat, 50)
df = pd.DataFrame({'X' : TradeN['Volume_norm'], 'Y' : TradeN['dMidP']}) #we build a dataframe from the data
data_cut = pd.cut(dat,Cutoff) #we cut the data following the bins #we cut the data following the bins
grp = df.groupby(by = data_cut) #we group the data by the cut
ret = grp.aggregate(np.mean)
however, when I count the data, this returns me:
Counter({Interval(0.376, 0.46400000000000002, closed='right'): 2065,
Interval(0.83899999999999997, 0.92800000000000005, closed='right'): 563,
Interval(0.046399999999999997, 0.0557, closed='right'): 63,
Interval(0.56100000000000005, 0.67200000000000004, closed='right'): 121,
Interval(0.46400000000000002, 0.51000000000000001, closed='right'): 145,
Interval(0.11600000000000001, 0.14399999999999999, closed='right'): 105,
Interval(0.013899999999999999, 0.023199999999999998, closed='right'): 144,
Interval(0.14399999999999999, 0.186, closed='right'): 119,
Interval(0.186, 0.23200000000000001, closed='right'): 134,
which means the number of data in each range is still different
emphasized textYou will need to partition the data in collections of equal cardinality:
data = [collection of data points]
bins = []
num_bins = 12
data_points_per_bin = len(data) // 12
bins = [data[_ * data_points_per_bin: (_+1)*data_points_per_bin] for _ in range(num_bins)]
This last line is a list comprehension that creates a list of lists (the bins) containing the data points. It iterates over all the data, slices it in groups of equal sizes, and stores it.
You will probably need to choose a num_bins that is a divisor of the number of data points, and is the closest to the appropriate number, or decide what to do with the data not allocated in a full bin.
for instance:
data = list(range(48))
num_bins = 12
data_points_per_bin = len(data) // 12
bins = [data[_ * data_points_per_bin: (_+1)*data_points_per_bin] for _ in range(num_bins)]
the output is:
[[0, 1, 2, 3],
[4, 5, 6, 7],
[8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31],
[32, 33, 34, 35],
[36, 37, 38, 39],
[40, 41, 42, 43],
[44, 45, 46, 47]]
Once the data is allocated to each bin, you can plot it.

one liner to generate a random number between 0 and 100 but last digit ends in 0, 1, 2, 3, 4

Without concat strings or if conditionals and such, use arithmetics one-liner. Preferably written in Python please.
Background of the story: a legacy system requires a listening port last digit ends in [0-4] so [5-9] is reserved for a mirror TCP.
I wrote a deploy script which generates idempotent random number but having trouble guarantee the last digit in [0-4]. Since Ansible & Jinja2 template language is limited so I don't want the code relies too much on hairy string operations and if conditions.
Once a idempotent random number is generated, I need a arithmetic function to project the number to another integer which the last digit is between 0 and 4 and still guarantee the idempotence
After some discussion in the comments, if you've randomly selected a number x in range(0,50), you can map it to {0,1,2,3,4,10,11,...} like this:
y = 10*(x//5) + x % 5
For example:
In [8]: out = [10 * (x//5) + x % 5 for x in range(50)]
In [9]: out[:10]
Out[9]: [0, 1, 2, 3, 4, 10, 11, 12, 13, 14]
In [10]: out[-10:]
Out[10]: [80, 81, 82, 83, 84, 90, 91, 92, 93, 94]
This will turn any number you've generated in the [0,50) range into one satisfying your < 5 mod 10 criterion.
Here's one way to do it with random.sample and mod:
>>> import random
>>> [i for i in random.sample(range(100), 100) if i%10<5]
[24, 23, 62, 90, 80, 12, 4, 30, 43, 92, 21, 33, 41, 63, 52, 44, 81, 61, 31, 70, 73, 20, 0, 74, 2, 84, 11, 53, 13, 42, 50, 64, 60, 32, 71, 34, 72, 51, 1, 22, 91, 94, 40, 14, 82, 93, 3, 83, 54, 10]
You can use next and a gen. exp to generate a single number:
>>> gen = (i for i in random.sample(range(101), 101) if i%10<5) # include 100 in sample
>>> next(gen)
72
>>> next(gen)
84
>>> next(gen)
51
>>> next(gen)
71
>>> next(gen)
34
>>> next(gen)
40
If you don't mind disallowing 100, you can simply generate two random numbers, one between 0 and 9, the other between 0 and 4, and combine them arithmetically.
port = 10 * random.randint(0,9) + random.randint(0,4)
This chooses any of the 50 such ports uniformly.
The simplest way to add 100 to the mix is to simply generate a random number between 0 and 50 (inclusive), and treat one of them as a proxy for 100, with the rest indicating that another random number be generated using the scheme above.
port = 100 if random.randint(0,50) == 50 else 10 * random.randint(0,9) + random.randint(0,4)

Python, neighbors on a regular grid

Let's suppose I have a set of 2D coordinates that represent the centers of cells of a 2D regular mesh. I would like to find, for each cell in the grid, the two closest neighbors in each direction.
The problem is quite straightforward if one assigns to each cell and index defined as follows:
idx_cell = idx+N*idy
where N is the total number of cells in the grid, idx=x/dx and idy=y/dx, with x and y being the x-coordinate and the y-coordinate of a cell and dx its size.
For example, the neighboring cells for a cell with idx_cell=5 are the cells with idx_cell equal to 4,6 (for the x-axis) and 5+N,5-N (for the y-axis).
The problem that I have is that my implementation of the algorithm is quite slow for large (N>1e6) data sets.
For instance, to get the neighbors of the x-axis I do
[x[(idx_cell==idx_cell[i]-1)|(idx_cell==idx_cell[i]+1)] for i in cells]
Do you think there's a fastest way to implement this algorithm?
You are basically reinventing the indexing scheme of a multidimensional array. It is relatively easy to code, but you can use the two functions unravel_index and ravel_multi_index to your advantage here.
If your grid is of M rows and N columns, to get the idx and idy of a single item you could do:
>>> M, N = 12, 10
>>> np.unravel_index(4, dims=(M, N))
(0, 4)
This also works if, instead of a single index, you provide an array of indices:
>>> np.unravel_index([15, 28, 32, 97], dims=(M, N))
(array([1, 2, 3, 9], dtype=int64), array([5, 8, 2, 7], dtype=int64))
So if cells has the indices of several cells you want to find neighbors to:
>>> cells = np.array([15, 28, 32, 44, 87])
You can get their neighbors as:
>>> idy, idx = np.unravel_index(cells, dims=(M, N))
>>> neigh_idx = np.vstack((idx-1, idx+1, idx, idx))
>>> neigh_idy = np.vstack((idy, idy, idy-1, idy+1))
>>> np.ravel_multi_index((neigh_idy, neigh_idx), dims=(M,N))
array([[14, 27, 31, 43, 86],
[16, 29, 33, 45, 88],
[ 5, 18, 22, 34, 77],
[25, 38, 42, 54, 97]], dtype=int64)
Or, if you prefer it like that:
>>> np.ravel_multi_index((neigh_idy, neigh_idx), dims=(M,N)).T
array([[14, 16, 5, 25],
[27, 29, 18, 38],
[31, 33, 22, 42],
[43, 45, 34, 54],
[86, 88, 77, 97]], dtype=int64)
The nicest thing about going this way is that ravel_multi_index has a mode keyword argument you can use to handle items on the edges of your lattice, see the docs.

Categories

Resources