I want to generate rates which are consistent with observed counts according to a Poisson distribution.
It's easy to do the reverse with scipy. I can draw counts given a fixed rate
counts = scipy.stats.poisson.rvs(mu)
but I can't find an easy way with counts as the argument returning random rates.
Drawing counts from scipy.stats.poisson.rvs(mu) is sampling from a Poisson distribution. If you had a set of samples (counts) from a single Poisson distribution and you want a rate, you're trying to estimate the Poisson distrbution. To estimate the Poisson distribution compute the average counts: λ. Then the distribution is:
P(k) = λk e-λ / k!
The distribution can then be used to compute the probability of observing some count (k) in an interval.
If instead each of the counts is assumed to be from separate Poisson distribution - then you only have one sample from each and the best estimate of the distribution comes from taking the sample as the mean λ.
See https://en.wikipedia.org/wiki/Poisson_distribution
Turns out what I was actually looking for is the Gamma distribution, which has the same functional form, but is continuous. To accomplish what I was trying to do with scipy:
mu = scipy.stats.gamma.rvs(counts+1)
The counts+1 is just because of how the power is defined in the distribution
Scipy Docs
Related
I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?
I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)
Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph
I have a numpy array in range [0,1000] with exponential distribution with lambda_x, I want to transform this numpy array to an array with different exponential distribution lambda_y. How can I find a function that does this mapping?
tried to use the inverse function but it didnt work.
def inverse(X, lambd):
"""Inverse of exponential distribution """
return -np.log(1-X)/lambd
It should be as simple as taking your original exponentials X and scaling them by multiplying by λx/λy to produce Y's.
A well known mechanism is to generate exponentials via inverse transform sampling. The second example on that page shows that if you generate U's which are uniformly distributed between 0 and 1 (where 100*U corresponds to the percentiles of the distribution) and transform them using the formula -ln(1 - U) / λ, you will get exponentials with rate λ. If λ is λx it yields your X distribution, and if λ is λy it yields the Y distribution. Hence rescaling by the ratio of the lambdas will convert from one to the other for a given percentile.
I have this task:
Choose your favorite continuous distribution (the less it looks normal, the more interesting; try to choose one of the distributions we have not discussed in the course).
Generate a sample of 1000 from it, build a histogram of the sample,
and draw a theoretical density of distribution of your random value
on top of it (so that the values are on the same scale, don't forget
to set the histogram to normed=True). Your task is to estimate the
distribution of your random sample average for different sample
sizes.
3
To do this, generate 1000 samples of n volume and build
histograms of their sample averages for three or more n values (for
example, 5, 10, 50).
Using the information on the mean and
variance of the original distribution (easily found on wikipedia),
calculate the values of the normal distribution parameters, which,
according to the central limit theorem, approximate the distribution
of the sample averages. Note: to calculate the values of these
parameters, it is the theoretical mean and the variance of your
random value that should be used, and not their sample estimates.
5.
On top of each histogram, draw the density of the corresponding
normal distribution (be careful with the parameters of the function,
it takes the input not the dispersion, but the standard deviation).
Describe the difference between the distributions obtained at different n values. How
does the accuracy of the approximation of
the distribution of sample averages change with the growth of n?
So if I want to pick exponential distribution in Python do I need to go like that?
from scipy.stats import expon
import matplotlib.pyplot as plt
exdist=sc.expon(loc=2,scale=3) # loc and scale - shift and scale parameters, default values 0 and 1.
mean, var, skew, kurt = exdist.stats(moments='mvsk') # Let's see the moments of our distribution.
x = np.linspace((0,2,100))
ax.plot(x, exdist.pdf(x)) # Let's draw it
arr=exdist.rvc(size=1000) # generation of thousand of random numbers. (Is it for task 3?)
And I constantly getting this error: here is a screenshot from Jupyter:
https://i.stack.imgur.com/zwUtu.png
Could you please explain to me how to write the right code? I can't figure out where to start or where to make a mistake. Do I have to use arr.mean() to search for a sample mean and plt.hist(arr,bins=) to build a histogram? I would be very grateful for the explanation.
The following model (taken from the Bayesian Methods for Hackers) works with the Poisson.
count_data = np.loadtxt("data/txtdata.csv")
n_count_data = len(count_data)
with pm.Model() as model:
alpha = 1.0/count_data.mean() # Recall count_data is the
# variable that holds our txt counts
lambda_1 = pm.Exponential("lambda_1", alpha)
lambda_2 = pm.Exponential("lambda_2", alpha)
tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1)
idx = np.arange(n_count_data) # Index
lambda_ = pm.math.switch(tau >= idx, lambda_1,lambda_2)
observation = pm.Poisson("obs", lambda_, observed=count_data)
step = [pm.Metropolis(), pm.NUTS()]
trace = pm.sample(10000, tune=5000,step=step)
pm.traceplot(trace, ['lambda_1', 'lambda_2', 'tau'])
plt.show()
With Poisson Distribution :
However, when using an Exponential random variable in this model:
observation = pm.Exponential("obs", lambda_, observed=count_data)
I get:
With Exponential Distribution :
The reason I wish to use Exponential distribution is using non integers.
I am not sure if the problem is with the lambda_ definition or with something else (the sampler needed for this).
The Poisson distribution models counts.
It can also used in things like queue networks to model the inter-arrival times of single customers. Note, the expected inter-arrival times of customers would be the inverse of the rate parameter (usually lambda).
One can feed data to the Poisson process as counts per fixed sampled interval of time e.g. how many customers did you get per day.
An exponential distribution is used to model some transition time. It is not a count process. It is a continuous process with its discrete analog being the geometric distribution.
A Poisson distribution is used to model discrete data and discrete counts that have exponential distribution of time between successive counts.
They are very similar looking but are different in nature.
It is likely that the count data that was fed into the exponential distribution confused the whole process.
In other words, the exponential distribution is not the appropriate model as it can not understand the given data.
I have data in a python/numpy/scipy environment that needs to be fit to a probability density function. A way to do this is to create a histogram of the data and then fit a curve to this histogram. The method scipy.optimize.leastsq does this by minimizing the sum of (y - f(x))**2, where (x,y) would in this case be the histogram's bin centers and bin contents.
In statistical terms, this least-square maximizes the likelihood of obtaining that histogram by sampling each bin count from a gaussian centered around the fit function at that bin's position. You can easily see this: each term (y-f(x))**2 is -log(gauss(y|mean=f(x))), and the sum is the logarithm of the multiplying the gaussian likelihood for all the bins together.
That's however not always accurate: for the type of statistical data I'm looking at, each bin count would be the result of a Poissonian process, so I want to minimize (the logarithm of the product over all the bins (x,y) of) poisson(y|mean=f(x)). The Poissonian comes very close to the Gaussian distribution for large values of f(x), but if my histogram doesn't have as good statistics, the difference would be relevant and influencing the fit.
If I understood correctly, you have data and want to see whether or not some probability distribution fits your data.
Well, if that's the case - you need QQ-Plot. If that's the case, then take a look at this StackOverflow question-answer. However, that is about normal distribution function, and you need a code for Poisson distribution function. All you need to do is create some random data according to Poisson random function and test your samples against it. Here you can find an example of QQ-plot for Poisson distribution function. Here's the code from this web-site:
#! /usr/bin/env python
from pylab import *
p = poisson(lam=10, size=4000)
m = mean(p)
s = std(p)
n = normal(loc=m, scale=s, size=p.shape)
a = m-4*s
b = m+4*s
figure()
plot(sort(n), sort(p), 'o', color='0.85')
plot([a,b], [a,b], 'k-')
xlim(a,b)
ylim(a,b)
xlabel('Normal Distribution')
ylabel('Poisson Distribution with $\lambda=10$')
grid(True)
savefig('qq.pdf')
show()