I'm having trouble finding quantile functions for well-known probability distributions in Python, do they exist? In particular, is there an inverse normal distribution function? I couldn't find anything in either Numpy or Scipy.
Check the .ppf() method of any distribution class in scipy.stats.
This is the equivalent of a quantile function (otherwise named as percent point function or inverse CDF)
An example with the exponential distribution from scipy.stats:
# analysis libs
import scipy
import numpy as np
# plotting libs
import matplotlib as mpl
import matplotlib.pyplot as plt
# Example with the exponential distribution
c = 0
lamb = 2
# Create a frozen exponential distribution instance with specified parameters
exp_obj = scipy.stats.expon(c,1/float(lamb))
x_in = np.linspace(0,1,200) # 200 numbers in [0,1], input for ppf()
y_out = exp_obj.ppf(x_in)
plt.plot(x_in,y_out) # graphically check the results of the inverse CDF
It seems new but I've found this about numpy and quantile. Maybe you can have a look (not tested)
Related
Suppose I have a grid given by
import numpy as np
grid = np.linspace(0,20,1000)
I want to get a 1000-by-1 vector p so that if one were to plot points
(grid[i], p[i]) the graph would look like the density of a lognormal distribution.
Use scipy's stats for obtaining pdf's of probability-distributions!
Numpy, in most (all?) cases only support sampling-methods, not pdf-calculations. What's needed surely depends on the use-case.
Often the pdf plays no role in practical sampling-only implementations, like in this case, where sampling is reduced to normal-distribution sampling (often reduced to uniform-sampling combined with other functions) followed by the exponential-function (code):
double rk_lognormal(rk_state *state, double mean, double sigma)
{
return exp(rk_normal(state, mean, sigma));
}
Make sure to read above docs to learn how to use these!
Example code:
import numpy as np
import scipy.stats as spt
import matplotlib.pyplot as plt
rv = spt.lognorm(0.954) # "frozen" RV (shape-param fixed)
x_points = np.linspace(1,20,1000, dtype=int) # 0 excluded
plt.scatter(x_points, rv.pdf(x_points))
plt.show()
Output:
Given the following algorithm:
import numpy as np
from matplotlib import pyplot as plt
exponent = 2
sample = np.random.random_sample(1000000)
distribution = sample ** exponent
plt.hist(distribution)
plt.show()
How can I find / interpolate the exponential curve that describes the given distribution from the graphic the best for any exponent where the sample is always between 0 and 1 - and how can I find the area under that to-be-found-curve? I mean the curve for a sample size that equals infinity.
Define type of distribution for samples of a random distribution is not easy task. You have to check some validation tests to define distribution with desirable statistical confidence.
Scipy has test for normal distribution: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mstats.normaltest.html
However, if you know the function (or just set of x,y points) you can calculate area under it using numerical methods.
For example, trapezoidal method: http://docs.scipy.org/doc/numpy/reference/generated/numpy.trapz.html or Simpson's rule: http://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html#scipy.integrate.simps
from scipy.integrate import simps
from numpy import trapz
print trapz([1,2,3], x=[4,6,8])
print simps([1,2,3], x=[4,6,8])
Output:
8.0
8.0
I'm new to Bayesian stats and I'm trying to estimate the posterior of a poisson (likelihood) and gamma distribution (prior) in Python. The parameter I'm trying to estimate is the lambda variable in the poisson distribution. I think the posterior will take the form of a gamma distribution (conjugate prior?) but I don't want to leverage that. The only thing I'm given is the data (named "my_data"). Here's my code:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats
x=np.linspace(1,len(my_data),len(my_data))
lambda_estimate=np.mean(my_data)
prior= scipy.stats.gamma.pdf(x,alpha,beta) #the parameters dont matter for now
likelihood_temp = lambda yi, a: scipy.stats.poisson.pmf(yi, a)
likelihood = lambda y, a: np.log(np.prod([likelihood_temp(data, a) for data in my_data]))
posterior=likelihood(my_data,lambda_estimate) * prior
When I try to plot the posterior I get an empty plot. I plotted the prior and it looks fine, so I think the issue is the likelihood. I took the log because the data is fairly large and I didn't want things to get unstable. Can anyone point out the issues in my code? Any help would be appreciated.
In Bayesian statistics, one goal is to calculate the posterior distribution of the parameter (lambda) given the data and the prior over a range of possible values for lambda. In your code, you calculating the prior over the array x, but you are taking a single value for lambda to calculate the likelihood. The posterior and likelihood should be over x as well, something like:
posterior = [likelihood(my_data, lambda_i) for lambda_i in x] * prior
(assuming you are not taking the logs of the prior and likelihood)
You might want to take a look at the PyMC3 library.
I would recommend you to have a look at the conjugate_prior module.
You could just type:
from conjugate_prior import GammaPoisson
model = GammaPoisson(prior_a, prior_b)
model = model.update(...)
credible_interval = model.posterior(lower_bound, upper_bound)
I've looked online and have yet to find an answer or way to figure the following
I'm translating some MATLAB code to Python where in MATLAB im looking to find the kernel density estimation with the function:
[p,x] = ksdensity(data)
where p is the probability at point x in the distribution.
Scipy has a function but only returns p.
Is there a way to find the probability at values of x?
Thanks!
That form of the ksdensity call automatically generates an arbitrary x. scipy.stats.gaussian_kde() returns a callable function that can be evaluated with any x of your choosing. The equivalent x would be np.linspace(data.min(), data.max(), 100).
import numpy as np
from scipy import stats
data = ...
kde = stats.gaussian_kde(data)
x = np.linspace(data.min(), data.max(), 100)
p = kde(x)
Another option is the kernel density estimator in the Scikit-Learn Python package, sklearn.neighbors.KernelDensity
Here is a little example similar to the Matlab documentation for ksdensity for a Gaussian distribution:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
np.random.seed(12345)
# similar to MATLAB ksdensity example x = [randn(30,1); 5+randn(30,1)];
Vecvalues=np.concatenate((np.random.normal(0,1,30), np.random.normal(5,1,30)))[:,None]
Vecpoints=np.linspace(-8,12,100)[:,None]
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(Vecvalues)
logkde = kde.score_samples(Vecpoints)
plt.plot(Vecpoints,np.exp(logkde))
plt.show()
The plot this produces looks like:
Matlab is orders of magnitude faster than KernelDensity when it comes to finding the optimal bandwidth. Any idea of how to make the KernelDenisty faster? – Yuca Jul 16 '18 at 20:58
Hi, Yuca. The matlab use Scott rule to estimate the bandwidth, which is fast but requires the data from the normal distribution. For more information, please see this Post.
I am using scipy.stats to fit my data.
scipy.stats.invgauss.fit(my_data_array)
scipy.stats.wald.fit(my_data_array)
From wiki http://en.wikipedia.org/wiki/Inverse_Gaussian_distribution it says that Wald distribution is just another name for inverse gaussian, but using two functions above gives me different fitting parameters. scipy.stats.invgauss.fit gives me three parameters and scipy.stats.wald.fit gives two.
What is the difference between these two distributons in scipy.stats?
I was trying to find the answer here, http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wald.html and http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.invgauss.html, but really no clue.
The link to the scipy.stats wald distribution has the answer to your question.
wald is a special case of invgauss with mu == 1.
So the following should produce the same answer:
import numpy as np
import scipy.stats as st
my_data = np.random.randn(1000)
wald_params = st.wald.fit(my_data)
invgauss_params = st.invgauss.fit(my_data, f0=1)
wald_params and invgauss_params are the same except invgauss has a 1 in front of the other two parameters which is the parameter that they said would be fixed as one in the wald distribution (I fixed it with the arg f0=1).
Hope that helps.