Numpy Trapezoidal Distribution for Age Distribution - python

I'm trying to create a rough model of US population distribution to generate random ages for a sample population, with the following image as a source, of sorts.
I feel that this could be most simply modeled by a trapezoidal distribution that remains uniform until dropping off at around the age of 50. However it seems that numpy does not offer the ability to utilize this distribution function. Because of this, I was wondering if it is possible to "combine" two distribution functions (in this case, a uniform distribution function with a maximum value of 50, and a triangular distribution function with a minimum of 51 and a maximum of 100). Is this possible, and is there a way to directly express a trapezoidal distribution function in python?

Yes, you can combine the samples arbitrarily. Just use np.concatenate
import numpy as np
import matplotlib.pyplot as p
%matplotlib inline
def agedistro(turn,end,size):
pass
totarea = turn + (end-turn)/2 # e.g. 50 + (90-50)/2
areauptoturn = turn # say 50
areasloped = (end-turn)/2 # (90-50)/2
size1= int(size*areauptoturn/totarea)
size2= size- size1
s1 = np.random.uniform(low=0,high=turn,size= size1) # (low=0.0, high=1.0, size=None)
s2 = np.random.triangular(left=turn,mode=turn,right=end,size=size2) #(left, mode, right, size=None)
# mode : scalar- the value where the peak of the distribution occurs.
#The value should fulfill the condition left <= mode <= right.
s3= np.concatenate((s1,s2)) # don't use add , it will add the numbers piecewise
return s3
s3=agedistro(turn=50,end=90,size=1000000)
p.hist(s3,bins=50)
p.show()

Related

Random numbers with user-defined continuous probability distribution

I would like to simulate something on the subject of photon-photon-interaction. In particular, there is Halpern scattering. Here is the German Wikipedia entry on it Halpern-Streuung. And there the differential cross section has an angular dependence of (3+(cos(theta))^2)^2.
I would like to have a generator of random numbers between 0 and 2*Pi, which corresponds to the density function ((3+(cos(theta))^2)^2)*(1/(99*Pi/4)). So the values around 0, Pi and 2*Pi should occur a little more often than the values around Pi/2 and 3.
I have already found that there is a function on how to randomly output discrete values with user-defined probability values numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2]). I could work with that in an emergency, should there be nothing else. But actually I already want a continuous probability distribution here.
I know that even if there is such a Python command where you can enter a mathematical distribution function, it basically only produces discrete distributions of values, since no irrational numbers with 1s and 0s can be represented. But still, such a command would be more elegant with a continuous function.
Assuming the density function you have is proportional to a probability density function (PDF) you can use the rejection sampling method: Draw a number in a box until the box falls within the density function. It works for any bounded density function with a closed and bounded domain, as long as you know what the domain and bound are (the bound is the maximum value of f in the domain). In this case, the bound is 64/(99*math.pi) and the algorithm works as follows:
import math
import random
def sample():
mn=0 # Lowest value of domain
mx=2*math.pi # Highest value of domain
bound=64/(99*math.pi) # Upper bound of PDF value
while True: # Do the following until a value is returned
# Choose an X inside the desired sampling domain.
x=random.uniform(mn,mx)
# Choose a Y between 0 and the maximum PDF value.
y=random.uniform(0,bound)
# Calculate PDF
pdf=(((3+(math.cos(x))**2)**2)*(1/(99*math.pi/4)))
# Does (x,y) fall in the PDF?
if y<pdf:
# Yes, so return x
return x
# No, so loop
See also the section "Sampling from an Arbitrary Distribution" in my article on randomization.
The following shows the method's correctness by showing the probability that the returned sample is less than π/8. For correctness, the probability should be close to 0.0788:
print(sum(1 if sample()<math.pi/8 else 0 for _ in range(1000000))/1000000)
I had two suggestions in mind. The inverse transform sampling method and the "Deletion metode" (I'll just call it that). The inverse transform sampling method: There is an inverse function to my distribution. But I get problems in several places with the math. functions because of the domain. E.g. math.sqrt(-1). You would still have to trick around with if-queries here.That's why I decided to use Peter's suggestion.
And if you collect values in a loop and plot them in a histogram, it also looks quite good. Here with 40000 values and 100 bins
Here is the whole code for someone who is interested
import numpy as np
import math
import random
import matplotlib.pyplot as plt
N=40000
bins=100
def Deletion_method():
x=None
while x==None:
mn=0 # Lowest value of domain
mx=2*math.pi # Highest value of domain
bound=64/(99*math.pi) # Upper bound of PDF value
# Choose an X inside the desired sampling domain.
xrad=random.uniform(mn,mx)
# Choose a Y between 0 and the maximum PDF value.
y=random.uniform(0,bound)
# Calculate PDF
P=((3+(math.cos(xrad))**2)**2)*(1/(99*math.pi/4))
# Does (x,y) fall in the PDF?
if y<P:
x=xrad
return(x)
Values=[]
for k in range(0, N):
Values=np.append(Values, [Deletion_method()])
plt.hist(Values, bins)
plt.show()

How to calculate one-sided tolerance interval with scipy

I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?
I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)
Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph

How to sample from a custom distribution when parameters are known?

The target is to get samples from a distribution whose parameters is known.
For example, the self-defined distribution is p(X|theta), where theta the parameter vector of K dimensions and X is the random vector of N dimensions.
Now we know (1) the theta is known; (2) p(X|theta) is NOT known, but I know p(X|theta) ∝ f(X,theta), and f is a known function.
Can pymc3 do such sampling from p(X|theta), and how?
The purpose is not sampling from posterior distribution of parameters, but want to samples from a self-defined distribution.
Starting from a simple example of sampling from a Bernoulli distribution. I did the following:
import pymc3 as pm
import numpy as np
import scipy.stats as stats
import pandas as pd
import theano.tensor as tt
with pm.Model() as model1:
p=0.3
density = pm.DensityDist('density',
lambda x1: tt.switch( x1, tt.log(p), tt.log(1 - p) ),
) #tt.switch( x1, tt.log(p), tt.log(1 - p) ) is the log likelihood from pymc3 source code
with model1:
step = pm.Metropolis()
samples = pm.sample(1000, step=step)
I expect the result is 1000 binary digits, with the proportion of 1 is about 0.3. However, I got strange results where very large numbers occur in the output.
I know something is wrong. Please help on how to correctly write pymc3 codes for such non-posterior MCMC sampling questions.
Prior predictive sampling (for which you should be using pm.sample_prior_predictive()) involves only using the RNGs provided by the RandomVariable objects in your compute graph. By default, DensityDist does not implement a RNG, but does provide the random parameter for this purpose, so you'll need to use that. The log-likelihood is only evaluated with respect to observables, so it plays no role here.
A simple way to generate a valid RNG for an arbitrary distribution is to use inverse transform sampling. In this case, one samples a uniform distribution on the unit interval and then transforms it through the inverse CDF of the desired function. For the Bernoulli case, the inverse CDF partitions the unit line based on the probability of success, assigning 0 to one part and 1 to the other.
Here is a factory-like implementation that creates a Bernoulli RNG compatible with pm.DensityDist's random parameter (i.e., accepts point and size kwargs).
def get_bernoulli_rng(p=0.5):
def _rng(point=None, size=1):
# Bernoulli inverse CDF, given p (prob of success)
_icdf = lambda q: np.uint8(q < p)
return _icdf(pm.Uniform.dist().random(point=point, size=size))
return _rng
So, to fill out the example, it would go something like
with pm.Model() as m:
p = 0.3
y = pm.DensityDist('y', lambda x: tt.switch(x, tt.log(p), tt.log(1-p)),
random=get_bernoulli_rng(p))
prior = pm.sample_prior_predictive(random_seed=2019)
prior['y'].mean() # 0.306
Obviously, this could equally be done with random=pm.Bernoulli.dist(p).random, but the above illustrates generically how one could do this with arbitrary distributions, given their inverse CDF, i.e., you only need to modify _icdf and the parameters.

Randomly generate integers with a distribution that prefers low ones

I have an list ordered by some quality function from which I'd like to take elements, preferring the good elements at the beginning of the list.
Currently, my function to generate the random indices looks essentially as follows:
def pick():
p = 0.2
for i in itertools.count():
if random.random() < p:
break
return i
It does a good job, but I wonder:
What's the name of the generated random distribution?
Is there a built-in function in Python for that distribution?
What you are describing sounds a lot like the exponential distribution. It already exists in the random module.
Here is some code that takes just the integer part of sampling from an exponential distribution with a rate parameter of 100.
import random
import matplotlib.pyplot as plt
d = [int(random.expovariate(1/100)) for i in range(10000)]
h,b = np.histogram(d, bins=np.arange(0,max(d)))
plt.bar(left=b[:-1], height=h, ec='none', width=1))
plt.show()
You could simulate it via exponential, but this is like making square peg fit round hole. As Mark said, it is geometric distribution - discrete, shifted by 1. And it is right here in the numpy:
import numpy as np
import random
import itertools
import matplotlib.pyplot as plt
p = 0.2
def pick():
for i in itertools.count():
if random.random() < p:
break
return i
q = np.random.geometric(p, size = 100000) - 1
z = [pick() for i in range(100000)]
bins = np.linspace(-0.5, 30.5, 32)
plt.hist(q, bins, alpha=0.2, label='geom')
plt.hist(z, bins, alpha=0.2, label='pick')
plt.legend(loc='upper right')
plt.show()
Output:
random.random() defaults to a uniform distribution, but there are other methods within random that would also work. For your given use case, I would suggest random.expovariate(2) (Documentation, Wikipedia). This is an exponential distribution that will heavily prefer lower values. If you google some of the other methods listed in the documentation, you can find some other built-in distributions.
Edit: Be sure to play around with the argument value for expovariate. Also note that it doesn't guarantee a value less than 1, so you might need to ensure that you only use values less than 1.

Generating random numbers with a given probability density function

I want to specify the probability density function of a distribution and then pick up N random numbers from that distribution in Python. How do I go about doing that?
In general, you want to have the inverse cumulative probability density function. Once you have that, then generating the random numbers along the distribution is simple:
import random
def sample(n):
return [ icdf(random.random()) for _ in range(n) ]
Or, if you use NumPy:
import numpy as np
def sample(n):
return icdf(np.random.random(n))
In both cases icdf is the inverse cumulative distribution function which accepts a value between 0 and 1 and outputs the corresponding value from the distribution.
To illustrate the nature of icdf, we'll take a simple uniform distribution between values 10 and 12 as an example:
probability distribution function is 0.5 between 10 and 12, zero elsewhere
cumulative distribution function is 0 below 10 (no samples below 10), 1 above 12 (no samples above 12) and increases linearly between the values (integral of the PDF)
inverse cumulative distribution function is only defined between 0 and 1. At 0 it is 10, at 12 it is 1, and changes linearly between the values
Of course, the difficult part is obtaining the inverse cumulative density function. It really depends on your distribution, sometimes you may have an analytical function, sometimes you may want to resort to interpolation. Numerical methods may be useful, as numerical integration can be used to create the CDF and interpolation can be used to invert it.
This is my function to retrieve a single random number distributed according to the given probability density function. I used a Monte-Carlo like approach. Of course n random numbers can be generated by calling this function n times.
"""
Draws a random number from given probability density function.
Parameters
----------
pdf -- the function pointer to a probability density function of form P = pdf(x)
interval -- the resulting random number is restricted to this interval
pdfmax -- the maximum of the probability density function
integers -- boolean, indicating if the result is desired as integer
max_iterations -- maximum number of 'tries' to find a combination of random numbers (rand_x, rand_y) located below the function value calc_y = pdf(rand_x).
returns a single random number according the pdf distribution.
"""
def draw_random_number_from_pdf(pdf, interval, pdfmax = 1, integers = False, max_iterations = 10000):
for i in range(max_iterations):
if integers == True:
rand_x = np.random.randint(interval[0], interval[1])
else:
rand_x = (interval[1] - interval[0]) * np.random.random(1) + interval[0] #(b - a) * random_sample() + a
rand_y = pdfmax * np.random.random(1)
calc_y = pdf(rand_x)
if(rand_y <= calc_y ):
return rand_x
raise Exception("Could not find a matching random number within pdf in " + max_iterations + " iterations.")
In my opinion this solution is performing better than other solutions if you do not have to retrieve a very large number of random variables. Another benefit is that you only need the PDF and avoid calculating the CDF, inverse CDF or weights.

Categories

Resources