Generating new exponential distribution from different exponential distribution - python

I have a numpy array in range [0,1000] with exponential distribution with lambda_x, I want to transform this numpy array to an array with different exponential distribution lambda_y. How can I find a function that does this mapping?
tried to use the inverse function but it didnt work.
def inverse(X, lambd):
"""Inverse of exponential distribution """
return -np.log(1-X)/lambd

It should be as simple as taking your original exponentials X and scaling them by multiplying by λx/λy to produce Y's.
A well known mechanism is to generate exponentials via inverse transform sampling. The second example on that page shows that if you generate U's which are uniformly distributed between 0 and 1 (where 100*U corresponds to the percentiles of the distribution) and transform them using the formula -ln(1 - U) / λ, you will get exponentials with rate λ. If λ is λx it yields your X distribution, and if λ is λy it yields the Y distribution. Hence rescaling by the ratio of the lambdas will convert from one to the other for a given percentile.

Related

Getting variance values for random samples generated from a standard normal distribution using numpy

I have a function that gives me probability distributions for each class, in terms of a matrix corresponding to mean values and another matrix corresponding to variance values. For example, if I had four classes then I would have the following outputs:
y_means = [1,2,3,4]
y_variance = [0.01,0.02,0.03,0.04]
I need to do the following calculation to the mean values to continue with the rest of my program:
y_means = np.array(y_means)
y_means = np.reshape(y_means,(y_means.size,1))
A = np.random.randn(10,y_means.size)
y_means = np.matmul(A,y_means)
Here, I have used the numpy.random.randn function to generate random samples from a standard normal distribution, and then multiply this with the matrix with the mean value to obtain a new output matrix. The dimension of the output matrix would then be of the size (10 x 1).
I need to do a similar calculation such that my output_variances will also be a (10 x 1) matrix. But it is not meaningful to multiply the variances in the same way with random samples from a standard normal distribution, because this would result in negative values as well. This is undesirable because my ultimate aim would be to create a normal distribution with these mean values and their corresponding variance values using:
torch.distributions.normal.Normal(loc=y_means, scale=y_variance)
So my question is if there is any method by which I get a variance value for each random sample generated by numpy.random.randn? Because then the multplication of such a matrix would make more sense with output_variance.
Or if there is any other strategy for this that I might be unaware of, please let me know.
The problem mentioned in the question required another matrix of the same dimension as A that corresponded to a variance measure for the random samples present in A.
Taking a row-wise or column-wise variance of the matrix denoted by A using numpy.var() didn't give a similar 10 x 4 matrix to multiply with y_variance.
I had solved the above problem by using the following approach:
First create a matrix with the same dimensions as A with zero entries, using the following line of code:
A_var = np.zeros_like(A)
then, using torch.distributions, create normal distributions with the values in A as the mean and zeroes as variance:
dist_A = torch.distributions.normal.Normal(loc=torch.Tensor(A), scale=torch.Tensor(A_var))
https://pytorch.org/docs/stable/distributions.html lists all the operations possible on Normal distributions in PyTorch. The sample() method can generate samples from a given distribution for any size. This property was exploited to first generate a sample matrix of size 10 X 10 x 4 and then calculating the variance along axis 0.
np.var(np.array(dist2.sample((10,))),axis=0)
This would result in a variance matrix of size 10 x 4, which can be used for calculations with y_variance.

How to calculate one-sided tolerance interval with scipy

I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?
I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)
Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph

Calculating discrete PDF from discrete CDF in python

If we have discrete cdf for quantiles like
quantiles = array([1.000e-04, 1.000e-03, 1.000e-02, 2.000e-02, 3.000e-02, 4.000e-02,
5.000e-02, 6.000e-02, 7.000e-02, 8.000e-02, 9.000e-02, 1.000e-01,
2.000e-01, 3.000e-01, 4.000e-01, 5.000e-01, 6.000e-01, 7.000e-01,
8.000e-01, 9.000e-01, 9.100e-01, 9.200e-01, 9.300e-01, 9.400e-01,
9.500e-01, 9.600e-01, 9.700e-01, 9.800e-01, 9.900e-01, 9.990e-01,
9.999e-01])
Is it valid to create a reverse mapping linear interpolation ? That is from the cdf quantiles, we estimate the value of the random variable satifying cdf condition p(x < a) = p_a. Then we get uniformly distributed values from 0 to 1 and generate random variable in question (think of mapping from y to x axis on a cdf plot). Would the PDF from this be a good approximation ?
f = interp1d(quantiles, matching_discrete_cdf, kind='linear')
uni_rv = stats.uniform.rvs(loc=percentiles.min(),
scale=percentiles.max() - percentiles.min(), size=nof_items)
pdf = f(uni_rv)
I assume that when you write "pdf" you mean "sample" and not an actual probability density function; and when you write "matching_discrete_cdf", you mean the "percent point function" (PPF) which is the inverse of CDF. The terminological confusion aside, the idea is sound: generating a sample for a custom distribution by transforming a uniform sample by the PPF is a standard approach.
The interpolation will distort the distribution a little, and so will the fact that the quantiles 1.000e-04 and 9.999e-01 of the original distribution will become the min and max of the generated numbers (the original distribution had some small chance of being outside these limits). But this should be acceptable, and is inevitable given the data you have. Maybe use the cubic interpolation instead of linear one?
On the chance that you actually want the PDF and not a sample -- the PDF is the derivative of the CDF. I would use cubic spline interpolation on the CDF values (InterpolatedUnivariateSpline), and then take its derivative.

Sampling random rates from poisson distribution given observed counts with Python

I want to generate rates which are consistent with observed counts according to a Poisson distribution.
It's easy to do the reverse with scipy. I can draw counts given a fixed rate
counts = scipy.stats.poisson.rvs(mu)
but I can't find an easy way with counts as the argument returning random rates.
Drawing counts from scipy.stats.poisson.rvs(mu) is sampling from a Poisson distribution. If you had a set of samples (counts) from a single Poisson distribution and you want a rate, you're trying to estimate the Poisson distrbution. To estimate the Poisson distribution compute the average counts: λ. Then the distribution is:
P(k) = λk e-λ / k!
The distribution can then be used to compute the probability of observing some count (k) in an interval.
If instead each of the counts is assumed to be from separate Poisson distribution - then you only have one sample from each and the best estimate of the distribution comes from taking the sample as the mean λ.
See https://en.wikipedia.org/wiki/Poisson_distribution
Turns out what I was actually looking for is the Gamma distribution, which has the same functional form, but is continuous. To accomplish what I was trying to do with scipy:
mu = scipy.stats.gamma.rvs(counts+1)
The counts+1 is just because of how the power is defined in the distribution
Scipy Docs

How can I maximize the Poissonian likelihood of a histogram given a fit curve with scipy/numpy?

I have data in a python/numpy/scipy environment that needs to be fit to a probability density function. A way to do this is to create a histogram of the data and then fit a curve to this histogram. The method scipy.optimize.leastsq does this by minimizing the sum of (y - f(x))**2, where (x,y) would in this case be the histogram's bin centers and bin contents.
In statistical terms, this least-square maximizes the likelihood of obtaining that histogram by sampling each bin count from a gaussian centered around the fit function at that bin's position. You can easily see this: each term (y-f(x))**2 is -log(gauss(y|mean=f(x))), and the sum is the logarithm of the multiplying the gaussian likelihood for all the bins together.
That's however not always accurate: for the type of statistical data I'm looking at, each bin count would be the result of a Poissonian process, so I want to minimize (the logarithm of the product over all the bins (x,y) of) poisson(y|mean=f(x)). The Poissonian comes very close to the Gaussian distribution for large values of f(x), but if my histogram doesn't have as good statistics, the difference would be relevant and influencing the fit.
If I understood correctly, you have data and want to see whether or not some probability distribution fits your data.
Well, if that's the case - you need QQ-Plot. If that's the case, then take a look at this StackOverflow question-answer. However, that is about normal distribution function, and you need a code for Poisson distribution function. All you need to do is create some random data according to Poisson random function and test your samples against it. Here you can find an example of QQ-plot for Poisson distribution function. Here's the code from this web-site:
#! /usr/bin/env python
from pylab import *
p = poisson(lam=10, size=4000)
m = mean(p)
s = std(p)
n = normal(loc=m, scale=s, size=p.shape)
a = m-4*s
b = m+4*s
figure()
plot(sort(n), sort(p), 'o', color='0.85')
plot([a,b], [a,b], 'k-')
xlim(a,b)
ylim(a,b)
xlabel('Normal Distribution')
ylabel('Poisson Distribution with $\lambda=10$')
grid(True)
savefig('qq.pdf')
show()

Categories

Resources