I'd like to generate random numbers that follow a dropping linear frequency distribution, take n=1-x for an example.
The numpy library however seems to offer only more complex distributions.
So, it turns out you can totally use random.triangular(0,1,0) for this. See documentation here: https://docs.python.org/2/library/random.html
random.triangular(low, high, mode)
Return a random floating point number N such that low <= N <= high and with the specified mode between those bounds.
Histogram made with matplotlib:
import matplotlib.pyplot as plt
import random
bins = [0.1 * i for i in range(12)]
plt.hist([random.triangular(0,1,0) for i in range(2500)], bins)
For denormalized PDF with density
1-x, in the range [0...1)
normalization constant is 1/2
CDF is equal to 2x-x^2
Thus, sampling is quite obvious
r = 1.0 - math.sqrt(random.random())
Sample program produced pretty much the same plot
import math
import random
import matplotlib.pyplot as plt
bins = [0.1 * i for i in range(12)]
plt.hist([(1.0 - math.sqrt(random.random())) for k in range(10000)], bins)
plt.show()
UPDATE
let's denote S to be an integral, and S_a^b is definite integral from a to b.
So
Denormalized PDF(x) = 1-x
Normalization:
N = S_0^1 (1-x) dx = 1/2
Thus, normalized PDF
PDF(x) = 2*(1-x)
Let's compute CDF
CDF(x) = S_0^x PDF(x) dx = 2x - x*x
Checking: CDF(0) = 0, CDF(1) = 1
Sampling is via inverse CDF method, by solving for x
CDF(x) = U(0,1)
where U(0,1) is uniform random in [0,1)
This is simple quadratic equation with solution
x = 1 - sqrt(1 - U(0,1)) = 1 - sqrt(U(0,1))
which translated directly into Python code
Related
Currently I want to generate some samples to get expectation & variance of it.
Given the probability density function: f(x) = {2x, 0 <= x <= 1; 0 otherwise}
I already found that E(X) = 2/3, Var(X) = 1/18, my detail solution is from here https://math.stackexchange.com/questions/4430163/simulating-expectation-of-continuous-random-variable
But here is what I have when simulating using python:
import numpy as np
N = 100_000
X = np.random.uniform(size=N, low=0, high=1)
Y = [2*x for x in X]
np.mean(Y) # 1.00221 <- not equal to 2/3
np.var(Y) # 0.3323 <- not equal to 1/18
What am I doing wrong here? Thank you in advanced.
You are generating the mean and variance of Y = 2X, when you want the mean and variance of the X's themselves. You know the density, but the CDF is more useful for random variate generation than the PDF. For your problem, the density is:
so the CDF is:
Given that the CDF is an easily invertible function for the range [0,1], you can use inverse transform sampling to generate X values by setting F(X) = U, where U is a Uniform(0,1) random variable, and inverting the relationship to solve for X. For your problem, this yields X = U1/2.
In other words, you can generate X values with
import numpy as np
N = 100_000
X = np.sqrt(np.random.uniform(size = N))
and then do anything you want with the data, such as calculate mean and variance, plot histograms, use in simulation models, or whatever.
A histogram will confirm that the generated data have the desired density:
import matplotlib.pyplot as plt
plt.hist(X, bins = 100, density = True)
plt.show()
produces
The mean and variance estimates can then be calculated directly from the data:
print(np.mean(X), np.var(X)) # => 0.6661509538922444 0.05556962913014367
But wait! There’s more...
Margin of error
Simulation generates random data, so estimates of mean and variance will be variable across repeated runs. Statisticians use confidence intervals to quantify the magnitude of the uncertainty in statistical estimates. When the sample size is sufficiently large to invoke the central limit theorem, an interval estimate of the mean is calculated as (x-bar ± half-width), where x-bar is the estimate of the mean. For a so-called 95% confidence interval, the half-width is 1.96 * s / sqrt(n) where:
s is the estimated standard deviation;
n is the number of samples used in the estimates of mean and standard deviation; and
1.96 is a scaling constant derived from the normal distribution and the desired level of confidence.
The half-width is a quantitative measure of the margin of error, a.k.a. precision, of the estimate. Note that as n gets larger, the estimate has a smaller margin of error and becomes more precise, but there are diminishing returns to increasing the sample size due to the square root. Increasing the precision by a factor of 2 would require 4 times the sample size if independent sampling is used.
In Python:
var = np.var(X)
print(np.mean(X), var, 1.96 * np.sqrt(var / N))
produces results such as
0.6666763186360812 0.05511848269208021 0.0014551397290634852
where the third column is the confidence interval half-width.
Improving precision
Inverse transform sampling can yield greater precision for a given sample size if we use a clever trick based on fundamental properties of expectation and variance. In intro prob/stats courses you probably were told that Var(X + Y) = Var(X) + Var(Y). The true relationship is actually Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y), where Cov(X,Y) is the covariance between X and Y. If they are independent, the covariance is 0 and the general relationship becomes the one we learn/teach in intro courses, but if they are not independent the more general equation must be used. Variance is always a positive quantity, but covariance can be either positive or negative. Consequently, it’s easy to see that if X and Y have negative covariance the variance of their sum will be less than when they are independent. Negative covariance means that when X is above its mean Y tends to be below its mean, and vice-versa.
So how does that help? It helps because we can use the inverse transform, along with a technique known as antithetic variates, to create pairs of random variables which are identically distributed but have negative covariance. If U is a random variable with a Uniform(0,1) distribution, U’ = 1 - U also has a Uniform(0,1) distribution. (In fact, flipping any symmetric distribution will produce the same distribution.) As a result, X = F-1(U) and X’ = F-1(U’) are identically distributed since they’re defined by the same CDF, but will have negative covariance because they fall on opposite sides of their shared median and thus strongly tend to fall on opposite sides of their mean. If we average each pair to get A = (F-1(ui) + F-1(1-ui)) / 2) the expected value E[A] = E[(X + X’)/2] = 2E[X]/2 = E[X] while the variance Var(A) = [(Var(X) + Var(X’) + 2Cov(X,X’)]/4 = 2[Var(X) + Cov(X,X’)]/4 = [Var(X) + Cov(X,X’)]/2. In other words, we get a random variable A whose average is an unbiased estimate of the mean of X but which has less variance.
To fairly compare antithetic results head-to-head with independent sampling, we take the original sample size and allocate it with half the data being generated by the inverse transform of the U’s, and the other half generated by antithetic pairing using 1-U’s. We then average the paired values and generate statistics as before. In Python:
U = np.random.uniform(size = N // 2)
antithetic_avg = (np.sqrt(U) + np.sqrt(1.0 - U)) / 2
anti_var = np.var(antithetic_avg)
print(np.mean(antithetic_avg), anti_var, 1.96*np.sqrt(anti_var / (N / 2)))
which produces results such as
0.6667222935263972 0.0018911848781598295 0.0003811869837216061
Note that the half-width produced with independent sampling is nearly 4 times as large as the half-width produced using antithetic variates. To put it another way, we would need more than an order of magnitude more data for independent sampling to achieve the same precision.
To approximate the integral of some function of x, say, g(x), over S = [0, 1], using Monte Carlo simulation, you
generate N random numbers in [0, 1] (i.e. draw from the uniform distribution U[0, 1])
calculate the arithmetic mean of g(x_i) over i = 1 to i = N where x_i is the ith random number: i.e. (1 / N) times the sum from i = 1 to i = N of g(x_i).
The result of step 2 is the approximation of the integral.
The expected value of continuous random variable X with pdf f(x) and set of possible values S is the integral of x * f(x) over S. The variance of X is the expected value of X-squared minus the square of the expected value of X.
Expected value: to approximate the integral of x * f(x) over S = [0, 1] (i.e. the expected value of X), set g(x) = x * f(x) and apply the method outlined above.
Variance: to approximate the integral of (x * x) * f(x) over S = [0, 1] (i.e. the expected value of X-squared), set g(x) = (x * x) * f(x) and apply the method outlined above. Subtract the result of this by the square of the estimate of the expected value of X to obtain an estimate of the variance of X.
Adapting your method:
import numpy as np
N = 100_000
X = np.random.uniform(size = N, low = 0, high = 1)
Y = [x * (2 * x) for x in X]
E = [(x * x) * (2 * x) for x in X]
# mean
print((a := np.mean(Y)))
# variance
print(np.mean(E) - a * a)
Output
0.6662016482614397
0.05554821798023696
Instead of making Y and E lists, a much better approach is
Y = X * (2 * X)
E = (X * X) * (2 * X)
Y, E in this case are numpy arrays. This approach is much more efficient. Try making N = 100_000_000 and compare the execution times of both methods. The second should be much faster.
I'm building a simulation which requires random draws from the tail of a lognormal distribution. A threshold τ (tau) is chosen, and a resulting conditional distribution is given by:
I need to randomly sample from that conditional distribution, where F(x) is lognormal with a chosen µ (mu) and σ (sigma), and τ (tau) is set by the user.
My inelegant solution right now is simply to sample from the lognormal, tossing out any values under τ (tau), until I have the sample size I need. But I'm sure this can be improved.
Thanks for the help!
The easiest way is probably to leverage the truncated normal distribution as provided by Scipy.
This gives the following code, with ν (nu) as the variable of the standard Gaussian distribution, and τ (tau) mapping to ν0 on that distribution. This function returns a Numpy array containing ranCount lognormal variates:
import numpy as np
from scipy.stats import truncnorm
def getMySamplesScipy(ranCount, mu, sigma, tau):
nu0 = (math.log(tau) - mu) / sigma # position of tau on unit Gaussian
xs = truncnorm.rvs(nu0, np.inf, size=ranCount) # truncated unit normal samples
ys = np.exp(mu + sigma * xs) # go back to x space
return ys
If for some reason this is not suitable, well some of the tricks commonly used for Gaussian variates, such as Box-Muller do not work for a truncated distribution, but we can resort always to a general principle: the Inverse Transform Sampling theorem.
So we generate cumulative probabilities for our variates, by transforming uniform variates. And we trust Scipy, using its inverse of the erf error function to go back from our probabilities to the x space values.
This gives something like the following Python code (without any attempt at optimization):
import math
import random
import numpy as np
import numpy.random as nprd
import scipy.special as spfn
# using the "Inverse Method":
def getMySamples(ranCount, mu, sigma, tau):
nu0 = (math.log(tau) - mu) / sigma # position of tau in standard Gaussian curve
headCP = (1/2) * (1 + spfn.erf(nu0/math.sqrt(2)))
tailCP = 1.0 - headCP # probability of being in the "tail"
uvs = np.random.uniform(0.0, 1.0, ranCount) # uniform variates
cps = (headCP + uvs * tailCP) # Cumulative ProbabilitieS
nus = (math.sqrt(2)) * spfn.erfinv(2*cps-1) # positions in standard Gaussian
xs = np.exp(mu + sigma * nus) # go back to x space
return xs
Alternatives:
We can leverage the significant amount of material related to the Truncated Gaussian distribution.
There is a relatively recent (2016) review paper on the subject by Zdravko Botev and Pierre L'Ecuyer. This paper provides a pointer to publicly available R source code. Some material is seriously old, for example the 1986 book by Luc Devroye: Non-Uniform Random Variate Generation.
For example, a possible rejection-based method: if τ (tau) maps to ν0 on the standard Gaussian curve, the unit Gaussian distribution is like exp(-ν2/2). If we write ν = ν0 + δ, this is proportional to: exp(-δ2/2) * exp(-ν0*δ).
The idea is to approximate the exact distribution beyond ν0 by an exponential one, of parameter ν0. Note that the exact distribution is constantly below the approximate one. Then we can randomly accept the relatively cheap exponential variates with a probability of exp(-δ2/2).
We can just pick an equivalent algorithm in the literature. In the Devroye book, chapter IX page 382, there is some pseudo-code:
REPEAT
generate independent exponential random variates X and Y
UNTIL X2 <= 2*ν02*Y
RETURN R <-- ν0 + X/ν0
for which a Numpy rendition could be written like this:
def getMySamplesXpRj(rawRanCount, mu, sigma, tau):
nu0 = (math.log(tau) - mu) / sigma # position of tau in standard Gaussian
if (nu0 <= 0):
print("Error: τ (tau) too small in getMySamplesXpRj")
rnu0 = 1.0 / nu0
xs = nprd.exponential(1.0, rawRanCount) # exponential "raw" variates
ys = nprd.exponential(1.0, rawRanCount)
allSamples = nu0 + (rnu0 * xs)
boolArray = (xs*xs - 2*nu0*nu0*ys) <= 0.0
samples = allSamples[boolArray]
ys = np.exp(mu + sigma * samples) # go back to x space
return ys
According to Table 3 in the Botev-L'Ecuyer paper, the rejection rate of this algorithm is nicely low.
Besides, if you are willing to allow for some sophistication, there is also some literature about the Ziggurat algorithm as used for truncated Gaussian distributions, for example the 2012 arXiv 1201.6140 paper by Nicolas Chopin at ENSAE-CREST.
Side note: with recent versions of Python, it seems that you can use Greek letters for your variable names directly, σ instead of sigma, τ instead of tau, just as in the statistics books:
$ python3
Python 3.9.6 (default, Jun 29 2021, 00:00:00)
>>>
>>> σ = 2
>>> τ = 7
>>>
>>> στ = σ * τ
>>>
>>> στ + 1
15
>>>
A clean way is to define a subclass of rv_continuous with an implementation of _cdf. To draw variates you may want to also define _ppf or _rvs methods.
In passing someone had suggested to me that I could use half normal distribution in python to set min and max points using 0 to infinity:
halfnorm.rvs()
The 0 seems to cut off the min, however I have no idea what to do with the infinity.
I would like to do a number generator from 0 - 15 within a normal distribution, but having a hard time finding a function that doesn't go over the max or below the min due to the nature of distribution limits.
I would try to use the beta-distribution: https://en.wikipedia.org/wiki/Beta_distribution. It's quite simple (e.g. to integrate) and capable of fitting typical reaction time distributions.
Now the question is how to sample this efficiently for fixed α and β parameters ... scipy has done it for us: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html
Edit: Motivated by the comment and curiosity, here an example, plotting the histogram of 10 samples á 1000 values:
from scipy.stats import beta
from numpy import histogram
import pylab
max_time = 3
min_time = 0.5
a, b = 2, 7
dist = beta(a, b)
for _ in range(10):
sample = min_time + dist.rvs(size=1000) * (max_time - min_time)
his, bins = histogram(sample, bins=20, density=True)
pylab.plot(bins[:-1], his, ".")
pylab.xlabel("Reaction time [s]")
pylab.ylabel("Probability density [1/s]")
pylab.grid()
pylab.show()
I had just answered similar question here. I'll copy answer here as I think this question title is much more informative:
You can use uniform distribution with boundaries "translated" from normal to uniform space (using error function) and convert it to normal distribution using inverse error function.
import matplotlib.pyplot as plt
import numpy as np
from scipy import special
mean = 0
std = 7
min_value = 0
max_value = 15
min_in_standard_domain = (min_value - mean) / std
max_in_standard_domain = (max_value - mean) / std
min_in_erf_domain = special.erf(min_in_standard_domain)
max_in_erf_domain = special.erf(max_in_standard_domain)
random_uniform_data = np.random.uniform(min_in_erf_domain, max_in_erf_domain, 10000)
random_gaussianized_data = (special.erfinv(random_uniform_data) * std) + mean
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].hist(random_uniform_data, 30)
axes[1].hist(random_gaussianized_data, 30)
axes[0].set_title('uniform distribution samples')
axes[1].set_title('erfinv(uniform distribution samples)')
plt.show()
I recently ran in to a similar issue.
To get around this and keep my min/max in reasonable bounds I just created some if statements to catch any numbers that went above the real min and max.
if value <0:
value = abs(value)
elif value >15:
value - 15 = diff
value = 15-diff
This was close enough for me.
Is there any library/function in Python which allows us to generate discrete data that matches given target moments (mean, standard deviation, skewness, kurtosis)? I do not wish to necessarily enforce any specific underlying continuous distribution.
That is, I want to generate, say, 10000 numbers, such that when we calculate their first four moments using standard formulae we get something close to the target moments given as input.
Any known library in Python that implements such method? Her is an example of a paper in which this specific problem is solved (as part of a larger problem):
https://link.springer.com/article/10.1023/A:1021853807313
Thanks!
Yes, although not with 100% accuracy, this is possible.
import statsmodels.sandbox.distributions.extras as extras
import scipy.interpolate as interpolate
import scipy.stats as ss
import matplotlib.pyplot as plt
import numpy as np
def generate_normal_four_moments(mu, sigma, skew, kurt, size=10000, sd_wide=10):
f = extras.pdf_mvsk([mu, sigma, skew, kurt])
x = np.linspace(mu - sd_wide * sigma, mu + sd_wide * sigma, num=500)
y = [f(i) for i in x]
yy = np.cumsum(y) / np.sum(y)
inv_cdf = interpolate.interp1d(yy, x, fill_value="extrapolate")
rr = np.random.rand(size)
return inv_cdf(rr)
Next, we generate the data by using
data = generate_normal_four_moments(mu=0, sigma=1, skew=-1, kurt=3)
Let's check the moments:
np.mean(data)
np.var(data)
ss.skew(data)
ss.kurtosis(data)
gives
-0.039986656405454374
1.051375501684874
-1.071149838792561
2.9813805363255472
Is there a way to calculate the square of a number (closest approximation), say 4, using Gaussian distribution where mu is the number and sigma is 0.16. and for 1000 random points?
I searched the internet a lot, but couldn't find a solution to this. Any piece of code would be very much helpful as i am new to python.
Assuming that you have your data generated you could find an approximation of your mu (which is the square of your number) by taking the mean of your data. By the law of the large numbers you can be sure that as the size of your data grow the approximation become more accurate. Example:
import random
def generate_data(size):
mu, sigma = 4 ** 2, 0.16
return [random.gauss(mu, sigma) for _ in range(size)]
def mean(ls):
return sum(ls) / len(ls)
print(mean(generate_data(10))) #15.976644889526114
print(mean(generate_data(100))) #16.004123848232233
print(mean(generate_data(1000))) #16.00164187802018
print(mean(generate_data(10000))) #16.001000022147206
You can use numpy.random.randn to generate a standard Gaussian distribution, which can then be scaled as needed, from the docs,
For random samples from N(\mu, \sigma^2), use:
sigma * np.random.randn(...) + mu
which for your example,
import numpy as np
import matplotlib.pyplot as plt
N = 4.
mu = N**2
sigma = 1/N**2
dist = np.sqrt(sigma) * np.random.randn(1000) + mu
plt.hist(dist,30)
plt.show()
If you don't want to use numpy, you could also use random module,
import random
dist = [random.normalvariate(mu, sigma) for i in range(1000)]