python simulation of actual number of occurrence given theoretical probabilities - python

The goal is to simulate the actual number of occurrence given theoretical probabilities.
For example, a 6-faces biased dice with probability of landing (1,2,3,4,5,6) being (0.1,0.2,0.15,0.25,0.1,0.2).
Roll the dice for 1000 times and output simulated number of getting each face.
I know numpy.random.choices offer the function to generate each rolling, but I need kind of summary of number of landing of each face.
What is the optimal scripts with Python for above?

Numpy can be used to do that easily and very efficently:
faces = np.arange(0, 6)
faceProbs = [0.1, 0.2, 0.15, 0.25, 0.1, 0.2] # Define the face probabilities
v = np.random.choice(faces, p=faceProbs, size=1000) # Roll the dice for 1000 times
counts = np.bincount(v, minlength=6) # Count the existing occurrences
prob = counts / len(v) # Compute the probability

Can be done without Numpy too.
import random
random.choices([1,2,3,4,5,6], weights=[0.1,0.2,0.15,0.25,0.1,0.2], k=1000)

Related

Does np.random.poisson work with very small numbers?

I'm trying to simulate a system of reactions over time. In order to do this I have to multiply the value of probability of a reaction occurring with a pre-calculated time step that it can occur in, save this result in new variable and use the new variable to sample from the poisson distribution.
This is a snippet of my code:
lam = (evaluate_propensity*delta_t)
rxn_vector = np.random.poisson(lam) # probability of a reaction firing in the given time period
I've written a function to calculate the value of delta_t based on system specific parameters, the value calculated is very small 0.00014970194372884217 and I think this is having an impact on the np.random.poisson function.
The evaluate_propensity variable is an array that details the probability of a reaction occurring based on the number of molecules in the system and the ratios between molecules in a reaction. This is calculated dynamically and changes after each iteration as the molecule numbers change, but the values for the first iteration are:
evaluate_propensity = np.array([1.0, 0.002, 0.0, 0.0])
The documentation states that lam must be >= 0 and mine is (just) but rxn_vector just always returns an array of zeros.
rxn_vector = [0 0 0 0]
I know that the last two elements of the array will evaluate to zero. But didn't think that the first two would as well. Is there a way to make it more sensitive or amplify my results somehow or am I doing something wrong?
Cheers
The probability to draw a non-zero number for lambda = 1.5e-4 is tiny, it is P(k>0) = 1 - P(k=0) = 1.5e-4. On average you'd have to draw a lot more than four samples to get a non-zero value, 1 / 1.5e-4 = 6667 samples for propensity = 1. For smaller values the number of necessary samples is obviously even larger.
You can confirm this with scipy.stats
from scipy.stats import poisson
pdist = poisson(1.5e-4)
prob = 1 - pdist.pmf(0)
print(prob) # 0.00014998875056249084

Choosing random number where probability is random in Python

While I can find decent information on how to generate numbers based on probabilities for picking each number with numpy.random.choice e.g.:
np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
which picks 0 with probability p =.1, 1 with p = 0, 2 with p = .3, 3 with p = .6 and 4 with p = 0.
What I would like to know is, what function is there that will vary the probabilities? So for example, one time I might have the probability distribution above and the next maybe p=[0.25, .1, 0.18, 0.2, .27]). So I would like to generate probability distributions on the fly. Is there a python python library that does this?
What I am wanting to do is to generate arrays, each of length n with numbers from some probability distribution, such as above.
One good option is the Dirichlet distribution: samples from this distribution lie in a K-dimensional simplex aka a multinomial distribution.
Naturally there's a convenient numpy function for generating as many such random distributions as you'd like:
# 10 length-4 probability distributions:
np.random.dirichlet((1,1,1,3),size = 10)
And these would get fed to the p= argument in your np.random.choice call.
You can consult Wikipedia for more info about how the tuple parameter affects the sampled multinomial distributions.
AFAIK there's no inbuilt way to do this. You can do roulette wheel selection which should accomplish what you want.
The basic idea is simple:
def roulette(weights):
total = sum(weights)
mark = random.random() * total
runner = 0
for index, val in enumerate(weights):
runner += val
if runner >= mark:
return index
You can read more at https://en.wikipedia.org/wiki/Fitness_proportionate_selection

numpy - Given a number, find numbers that sum to it, with fuzzy weights

Suppose you have a number that you want to represent a total -- let's say it's 123,456,789.
Now, suppose you want to generate some numbers that add up to that number, but with fuzzy weights.
For instance, suppose I want to generate three numbers. The first should be around 60% of the total, but with some small level of variance. The second should be 30% of the total, again with some variance. And the third would end up being about 10%, depending on the other two.
I tried doing it this way:
percentages = [0.6, 0.3]
total = 123456789
still_need = total
values = []
for i in range(2):
x = int(total * (percentages[i] + np.random.normal(scale=0.05)))
values.append(x)
still_need = still_need - x
values.append(still_need)
But that doesn't seem very elegant.
Is there a better way?
A clean way to do it would be to draw from a multinomial distribution
total = 123456789
percentages = [0.6, 0.3, 0.1]
values = np.random.multinomial(total, percentages)
In this case, the multinomial distribution models rolling a 3-sided die 123456789 times, where the probability of each face turning up is [0.6, 0.3, 0.1]. Calling multinomial() is like running a single trial of this experiment. It returns 3 random integers that sum to 123456789. They represent the number of times that each face of the die turned up. If you want multiple draws, you can use the size parameter`.

Create random numbers with left skewed probability distribution

I would like to pick a number randomly between 1-100 such that the probability of getting numbers 60-100 is higher than 1-59.
I would like to have the probability to be a left-skewed distribution for numbers 1-100. That is to say, it has a long tail and a peak.
Something along the lines:
pers = np.arange(1,101,1)
prob = <left-skewed distribution>
number = np.random.choice(pers, 1, p=prob)
I do not know how to generate a left-skewed discrete probability function. Any ideas? Thanks!
This is the answer you are looking for using the SciPy function 'skewnorm'. It can make any positive set of integers either left or rightward skewed.
from scipy.stats import skewnorm
import matplotlib.pyplot as plt
numValues = 10000
maxValue = 100
skewness = -5 #Negative values are left skewed, positive values are right skewed.
random = skewnorm.rvs(a = skewness,loc=maxValue, size=numValues) #Skewnorm function
random = random - min(random) #Shift the set so the minimum value is equal to zero.
random = random / max(random) #Standadize all the vlues between 0 and 1.
random = random * maxValue #Multiply the standardized values by the maximum value.
#Plot histogram to check skewness
plt.hist(random,30,density=True, color = 'red', alpha=0.1)
plt.show()
Please reference the documentation here:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skewnorm.html
Histogram of left-skewed distribution
The code generates the following plot.
Like you described, just make sure your skewed-distribution adds up to 1.0:
pers = np.arange(1,101,1)
# Make each of the last 41 elements 5x more likely
prob = [1.0]*(len(pers)-41) + [5.0]*41
# Normalising to 1.0
prob /= np.sum(prob)
number = np.random.choice(pers, 1, p=prob)
The p argument of np.random.choice is the probability associated with each element in the array in the first argument. So something like:
np.random.choice(pers, 1, p=[0.01, 0.01, 0.01, 0.01, ..... , 0.02, 0.02])
Where 0.01 is the lower probability for 1-59 and 0.02 is the higher probability for 60-100.
The SciPy documentation has some useful examples.
http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.random.choice.html
EDIT:
You might also try this link and look for a distribution (about half way down the page) that fits the model you are looking for.
http://docs.scipy.org/doc/scipy/reference/stats.html

Plotting confidence intervals for Maximum Likelihood Estimate

I am trying to write code to produce confidence intervals for the number of different books in a library (as well as produce an informative plot).
My cousin is at elementary school and every week is given a book by his teacher. He then reads it and returns it in time to get another one the next week. After a while we started noticing that he was getting books he had read before and this became gradually more common over time.
Say the true number of books in the library is N and the teacher picks one uniformly at random (with replacement) to give to you each week. If at week t the number of occasions on which you have received a book you have read is x, then I can produce a maximum likelihood estimate for the number of books in the library following https://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library .
Example: Consider a library with five books A, B, C, D, and E. If you receive books [A, B, A, C, B, B, D] in seven successive weeks, then the value for x (the number of duplicates) will be [0, 0, 1, 1, 2, 3, 3] after each of those weeks, meaning after seven weeks, you have received a book you have already read on three occasions.
To visualise the likelihood function (assuming I have understood what one is correctly) I have written the following code which I believe plots the likelihood function. The maximum is around 135 which is indeed the maximum likelihood estimate according to the MSE link above.
from __future__ import division
import random
import matplotlib.pyplot as plt
import numpy as np
#N is the true number of books. t is the number of weeks.unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t):
return t - len(set([random.randint(0,N) for i in xrange(t)]))
iters = 1000
ydata = []
for N in xrange(10,500):
sampledunk = [numberrepeats(N,t) for i in xrange(iters)].count(unk)
ydata.append(sampledunk/iters)
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
The output looks like
My questions are these:
Is there an easy way to get a 95% confidence interval and plot it on the diagram?
How can you superimpose a smoothed curve over the plot?
Is there a better way my code should have been written? It isn't very elegant and is also quite slow.
Finding the 95% confidence interval means finding the range of the x axis so that 95% of the time the empirical maximum likelihood estimate we get by sampling (which should theoretically be 135 in this example) will fall within it. The answer #mbatchkarov has given does not currently do this correctly.
There is now a mathematical answer at https://math.stackexchange.com/questions/656101/how-to-find-a-confidence-interval-for-a-maximum-likelihood-estimate .
Looks like you're ok on the first part, so I'll tackle your second and third points.
There are plenty of ways to fit smooth curves, with scipy.interpolate and splines, or with scipy.optimize.curve_fit. Personally, I prefer curve_fit, because you can supply your own function and let it fit the parameters for you.
Alternatively, if you don't want to learn a parametric function, you could do simple rolling-window smoothing with numpy.convolve.
As for code quality: you're not taking advantage of numpy's speed, because you're doing things in pure python. I would write your (existing) code like this:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
# N is the true number of books.
# t is the number of weeks.
# unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t, iters):
rand = np.random.randint(0, N, size=(t, iters))
return t - np.array([len(set(r)) for r in rand])
iters = 1000
ydata = np.empty(500-10)
for N in xrange(10,500):
sampledunk = np.count_nonzero(numberrepeats(N,t,iters) == unk)
ydata[N-10] = sampledunk/iters
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
It's probably possible to optimize this even more, but this change brings your code's runtime from ~30 seconds to ~2 seconds on my machine.
The a simple (numerical) way to get a confidence interval is simply to run your script many times, and see how much your estimate varies. You can use that standard deviation to calculate the confidence interval.
In the interest of time, another option is to run a bunch of trials at each value of N (I used 2000), and then use random subsampling of those trials to get an estimate of the estimator standard deviation. Basically, this involves selecting a subset of the trials, generating your likelihood curve using that subset, then finding the maximum of that curve to get your estimator. You do this over many subsets and this gives you a bunch of estimators, which you can use to find a confidence interval on your estimator. My full script is as follows:
import numpy as np
t = 30
k = 3
def trial(N):
return t - len(np.unique(np.random.randint(0, N, size=t)))
def trials(N, n_trials):
return np.asarray([trial(N) for i in xrange(n_trials)])
n_trials = 2000
Ns = np.arange(1, 501)
results = np.asarray([trials(N, n_trials=n_trials) for N in Ns])
def likelihood(results):
L = (results == 3).mean(-1)
# boxcar filtering
n = 10
L = np.convolve(L, np.ones(n) / float(n), mode='same')
return L
def max_likelihood_estimate(Ns, results):
i = np.argmax(likelihood(results))
return Ns[i]
def max_likelihood(Ns, results):
# calculate mean from all trials
mean = max_likelihood_estimate(Ns, results)
# randomly subsample results to estimate std
n_samples = 100
sample_frac = 0.25
estimates = np.zeros(n_samples)
for i in xrange(n_samples):
mask = np.random.uniform(size=results.shape[1]) < sample_frac
estimates[i] = max_likelihood_estimate(Ns, results[:,mask])
std = estimates.std()
sterr = std * np.sqrt(sample_frac) # is this mathematically sound?
ci = (mean - 1.96*sterr, mean + 1.96*sterr)
return mean, std, sterr, ci
mean, std, sterr, ci = max_likelihood(Ns, results)
print "Max likelihood estimate: ", mean
print "Max likelihood 95% ci: ", ci
There are two drawbacks to this method. One is that, since you're taking many subsamples from the same set of trials, your estimates are not independent. To limit the effect of this, I only used 25% of the results for each subset. Another drawback is that each subsample is only a fraction of your data, so estimates derived from these subsets will have more variance than estimates derived from running the full script many times. To account for this, I computed the standard error as the standard deviation divided by the square root of 4, since I had four times as much data in my full data set than in one of the subsamples. However, I'm not familiar enough with Monte Carlo theory to know if this is mathematically sound. Running my script a number of times did seem to indicate that my results were reasonable.
Lastly, I did use a boxcar filter on the likelihood curves to smooth them out a bit. Ideally, this should improve results, but even with the filtering there was still a considerable amount of variability in the results. When calculating the value for the overall estimator, I wasn't sure if it would be better compute one likelihood curve from all the results and use the max of that (this is what I ended up doing), or to use the mean of all the subset estimators. Using the mean of the subset estimators might be able to help cancel out some of the roughness in the curves that remains after filtering, but I'm not sure on this.
Here is an answer to your first question and a pointer to a solution for the second:
plot(xdata,ydata)
# calculate the cumulative distribution function
cdf = np.cumsum(ydata)/sum(ydata)
# get the left and right boundary of the interval that contains 95% of the probability mass
right=argmax(cdf>0.975)
left=argmax(cdf>0.025)
# indicate confidence interval with vertical lines
vlines(xdata[left], 0, ydata[left])
vlines(xdata[right], 0, ydata[right])
# hatch confidence interval
fill_between(xdata[left:right], ydata[left:right], facecolor='blue', alpha=0.5)
This produces the following figure:
I'll try to answer question 3 when I have more time :)

Categories

Resources