I am using SciPy's norm object here and I have a normal distribution here with a mean value of 100. and a standard deviation of 20.:
from scipy.stats import norm
dist = norm(loc=100., scale=20.)
I want to get the probability of a new instance being in locations... let's say... 70, or 120, how can I retrieve this probability using the norm object?
The norm object has a few methods such as norm.pdf, norm.cdf, norm.ppf, etc.. I am not sure which one I can use for this task.
Thank you
First of all you are talking of normal distribution which is a continuous distribution so you cannot get the probability that a new instance is at an exact location (that would be 0 by definition).
In your example you can get the probability that the observation is for example > 70 or < 70 (the strict inequality makes no difference for continuous distributions hence >= or > are same).
You need to use dist.cdf(70) for this to get P(X<=70) and 1 - dist.cdf(70) to get P(X>70)
Related
So let's imagine I have an array of sample data which is normally distributed. What I want, is to compute the probability of another sample being less than -3 and provide a bootstrapped confidence interval for that probability. After doing some research, I found the bootstrapped python library which I want to use to find the CI.
So I have:
import numpy as np
import bootstrapped.bootstrap as bs
import bootstrapped.stats_functions as bs_stats
mu, sigma = 2.5, 4 # mean and standard deviation
samples = np.random.normal(mu, sigma, 1000)
bs.bootstrap(samples, stat_func= ???)
What should I write for stat_func ? I tried writing a lambda function to compute the probability of -3, but it did not work. I know how to compute the probability of a sample being less than -3, it's simply the CI which I am having a hard time dealing with.
I followed the example of stat_functions.mean from the bootstrapped package. Below it is wrapped in a 'factory' so that you can specify the level at which you want to calculate the frequency (sadly you cannot pass it as an optional argument to functions that bootstrap() is expecting). Basically prob_less_func_factory(level) returns a function that calculates the proportion of your sample that is less than that level. It can be used for matrices just like the example I followed.
def prob_less_func_factory(level = -3.0):
def prob_less_func(values, axis=1):
'''Returns the proportion of samples that are less than the 'level' of each row of a matrix'''
return np.mean(np.asmatrix(values)<level, axis=axis).A1
return prob_less_func
Now you pass it in like so
level = -3
bs_res = bs.bootstrap(samples, stat_func = prob_less_func_factory(level=level))
and the result I get (yours will be slightly different because samples is random) is
0.088 (0.06999999999999999, 0.105)
so the boostrap function estimated (well, calculated) the proportion of values in samples that are less than -3 to be 0.088 and the confidence interval around it is (0.06999999999999999, 0.105)
For checking we can calculate the theoretical value of one sample from your distribution being less than -3:
from scipy.stats import norm
print(f'Theoretical Prob(N(mean={mu},std={sigma})<{level}): {norm.cdf(level, loc=mu,scale =sigma)}')
and we get
Theoretical Prob(N(mean=2.5,std=4)<-3): 0.08456572235133569
so it all seems consistent consistent.
I would like to calculate a one sided tolerance bound based on the normal distribution given a data set with known N (sample size), standard deviation, and mean.
If the interval were two sided I would do the following:
conf_int = stats.norm.interval(alpha, loc=mean, scale=sigma)
In my situation, I am bootstrapping samples, but if I weren't I would refer to this post on stackoverflow: Correct way to obtain confidence interval with scipy and use the following: conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
How would you do the same thing, but to calculate this as a one sided bound (95% of values are above or below x<--bound)?
I assume that you are interested in computing one-side tolerance bound using the normal distribution (based on the fact you mention the scipy.stats.norm.interval function as the two-sided equivalent of your need).
Then the good news is that, based on the tolerance interval Wikipedia page:
One-sided normal tolerance intervals have an exact solution in terms of the sample mean and sample variance based on the noncentral t-distribution.
(FYI: Unfortunately, this is not the case for the two-sided setting)
This assertion is based on this paper. Besides paragraph 4.8 (page 23) provides the formulas.
The bad news is that I do not think there is a ready-to-use scipy function that you can safely tweak and use for your purpose.
But you can easily calculate it yourself. You can find on Github repositories that contain such a calculator from which you can find inspiration, for example that one from which I built the following illustrative example:
import numpy as np
from scipy.stats import norm, nct
# sample size
n=1000
# Percentile for the TI to estimate
p=0.9
# confidence level
g = 0.95
# a demo sample
x = np.array([np.random.normal(100) for k in range(n)])
# mean estimate based on the sample
mu_est = x.mean()
# standard deviation estimated based on the sample
sigma_est = x.std(ddof=1)
# (100*p)th percentile of the standard normal distribution
zp = norm.ppf(p)
# gth quantile of a non-central t distribution
# with n-1 degrees of freedom and non-centrality parameter np.sqrt(n)*zp
t = nct.ppf(g, df=n-1., nc=np.sqrt(n)*zp)
# k factor from Young et al paper
k = t / np.sqrt(n)
# One-sided tolerance upper bound
conf_upper_bound = mu_est + (k*sigma_est)
Here is a one-line solution with the openturns library, assuming your data is a numpy array named sample.
import openturns as ot
ot.NormalFactory().build(sample.reshape(-1, 1)).computeQuantile(0.95)
Let us unpack this. NormalFactory is a class designed to fit the parameters of a Normal distribution (mu and sigma) on a given sample: NormalFactory() creates an instance of this class.
The method build does the actual fitting and returns an object of the class Normal which represents the normal distribution with parameters mu and sigma estimated from the sample.
The sample reshape is there to make sure that OpenTURNS understands that the input sample is a collection of one-dimension points, not a single multi-dimensional point.
The class Normal then provides the method computeQuantile to compute any quantile of the distribution (the 95-th percentile in this example).
This solution does not compute the exact tolerance bound because it uses a quantile from a Normal distribution instead of a Student t-distribution. Effectively, that means that it ignores the estimation error on mu and sigma. In practice, this is only an issue for really small sample sizes.
To illustrate this, here is a comparison between the PDF of the standard normal N(0,1) distribution and the PDF of the Student t-distribution with 19 degrees of freedom (this means a sample size of 20). They can barely be distinguished.
deg_freedom = 19
graph = ot.Normal().drawPDF()
student = ot.Student(deg_freedom).drawPDF().getDrawable(0)
student.setColor('blue')
graph.add(student)
graph.setLegends(['Normal(0,1)', 't-dist k={}'.format(deg_freedom)])
graph
I'm using scikit-optimize to do a BayesSearchCV within my RandomForestClassifier hyperparameter space. One hyperparameter is supposed to also be 0 (zero) while having a log-uniform distribution:
ccp_alpha = Real(min(ccp_alpha), max(ccp_alpha), prior='log-uniform')
Since log(0) is impossible to calculate, it is apparently impossible to have the parameter take the value 0 at some point.
Consequently, the following error is thrown:
ValueError: Not all points are within the bounds of the space.
Is there any way to work around this?
Note that getting a 0 from a log-uniform distribution is not well defined. How would you normalize this distribution, or in other words what would the odds be of drawing a 0?
The simplest approach would be to yield a list of values to try with the specified distribution. Since the values in this list will be sampled uniformly, you can use any distribution you like. For example with the list
reals = [0,0,0,0,x1,x2,x3,x4]
wehere x1 to x4 are log-uniformly distributed will give you odds 4 / 8 of drawing a 0, and odds 4 / 8 of drawing a log-uniformly distributed value.
If you really wanted to, you could also implement a class called MyReal (probably subclassed from Real) that implements a rvs method that yields the distribution you want.
I would like to define a bounded version of a continuous random variable distribution (say, an exponential, but I might want to use others as well). The bounds are 0 and 1. I would like to
draw random variates (as done by scipy.stats.rv_continuous.rvs),
use the ppf (percentage point function) (as done by scipy.stats.rv_continuous.ppf), and possibly
use the cdf (cumulative density function) (as done by scipy.stats.rv_continuous.cdf)
Possible approaches I can think of:
Getting random variates in an ad hoc way is not difficult
import scipy.stats
d = scipy.stats.expon(0, 3/10.) # an exponential distribution as an example
rv = d.rvs(size=target_number_of_rv)
rv = rv[0=<rv]
rv = rv[rv<=1]
while len(rv) < target_number_of_rv:
rv += d.rvs(1)
rv = rv[0=<rv]
rv = rv[rv<=1]
but 1) this is non-generic and potentially error-prone and 2) it does not help with the ppf or cdf.
Subclassing scipy.stats.rv_continuous, as is done here and here. Thereby, the ppf of scipy.stats.rv_continuous can be used. The drawback is that it requires the pdf (not just a pre-defined rv_continuous object or the pdf of the unbounded distribution and the bounds), and if this is wrong, cdf and ppf and everything else will be wrong as well.
Designing a class that cares for applying the bounds to the rv generation and for correcting the value of the ppf obtained from the unbounded object in scipy.stats. A drawback is that this is non-generic and error-prone as well and that it may be difficult to correct the ppf. My feeling is that the value of the cdf of the unbounded distribution could be scaled by what share of probability mass is out of the bounds (in total, lower and upper), but I may be wrong. That would be for lower and upper bounds l and u and any valid quantile x (with l<=x<=u): (cdf(x)-cdf(l))/(cdf(u)-cdf(l)). Obtaining the ppf would, however, require to invert the resulting function.
My feeling is that there might be a better and more generic way to do this. Is there? Maybe with sympy? Maybe by somehow obtaining the function object of the unbounded cdf and modifying it directly?
Python is version: 3.6.2, scipy is version 0.19.1.
If the distribution is one of those that is available in scipy.stats then you can evaluate its integral between the two bounds using the cdf for that distribution. Otherwise, you can define the pdf for rv_continuous and then use its cdf to get this integral.
Now, you have, in effect, the pdf for the bounded version of the pdf you want because you have calculated the normalising constant for it, in that integral. You can proceed to use rv_continuous with the form that you have for the pdf plus the normalising constant and with the bounds.
Here's what your code might be like. The variable scale is set according to the scipy documents. norm is the integral of the exponential pdf over [0,1]. Only about .49 of the probability mass is accounted for. Therefore, to make the exponential, when truncated to the [0,1] interval give a mass of one we must divide its pdf by this factor.
Truncated_expon is defined as a subclass of rv_continuous as in the documentation. By supplying its pdf we make it possible (at least for such a simple integral!) for scipy to calculate this distribution's cdf and thereby to calculate random samples.
I have calculated the cdf at one as a check.
>>> from scipy import stats
>>> lamda = 2/3
>>> scale = 1/lamda
>>> norm = stats.expon.cdf(1, scale=scale)
>>> norm
0.48658288096740798
>>> from math import exp
>>> class Truncated_expon(stats.rv_continuous):
... def _pdf(self, x, lamda):
... return lamda*exp(-lamda*x)/0.48658288096740798
...
>>> e = Truncated_expon(a=0, b=1, shapes='lamda')
>>> e.cdf(1, lamda=lamda)
1.0
>>> e.rvs(size=20, lamda=lamda)
array([ 0.20064067, 0.67646465, 0.89118679, 0.86093035, 0.14334989,
0.10505598, 0.53488779, 0.11606106, 0.41296616, 0.33650899,
0.95126415, 0.57481087, 0.04495104, 0.00308469, 0.23585195,
0.00653972, 0.59400395, 0.34919065, 0.91762547, 0.40098409])
I have 4 different distributions which I've fitted to a sample of observations. Now I want to compare my results and find the best solution. I know there are a lot of different methods to do that, but I'd like to use a quantile-quantile (q-q) plot.
The formulas for my 4 distributions are:
where K0 is the modified Bessel function of the second kind and zeroth order, and Γ is the gamma function.
My sample style looks roughly like this: (0.2, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.6, 0.7 ...), so I have multiple identical values and also gaps in between them.
I've read the instructions on this site and tried to implement them in python. So, like in the link:
1) I sorted my data from the smallest to the largest value.
2) I computed "n" evenly spaced points on the interval (0,1), where "n" is my sample size.
3) And this is the point I can't manage.
As far as I understand, I should now use the values I calculated beforehand (those evenly spaced values), put them in the inverse functions of my above distributions and thus compute the theoretical quantiles of my distributions.
For reference, here are the inverse functions (partly calculated with wolframalpha, and as far it was possible):
where W is the Lambert W-function and everything in brackets afterwards is the argument.
The problem is, apparently there doesn't exist an inverse function for the first distribution. The next one would probably produce complex values (negative under the root, because b = 0.55 according to the fit) and the last two of them have a Lambert W-Function (where I'm unsecure how to implement them in python).
So my question is, is there a way to calculate the q-q plots without the analytical expressions of the inverse distribution functions?
I'd appreciate any help you could give me very much!
A simpler and more conventional way to go about this is to compute the log likelihood for each model and choose that one that has the greatest log likelihood. You don't need the cdf or quantile function for that, only the density function, which you have already.
The log likelihood is just the sum of log p(x|model) where p(x|model) is the probability density of datum x under a given model. Here "model" = model with parameters selected by maximizing the log likelihood over the possible values of the parameters.
You can be more careful about this by integrating the log likelihood over the parameter space, taking into account also any prior probability assigned to each model; that would be a Bayesian approach.
It sounds like you are essentially looking to choose a model by minimizing the Kolmogorov-Smirnov (KS) statistic, which despite it's heavy name, is pretty simple -- it is the difference between the would-be quantile function and the empirical quantile. That's defensible, but I think comparing log likelihoods is more conventional, and also simpler since you need only the pdf.
It happens that there is an easier way. It's taken me a day or two to dig around until I was pointed toward the right method in scipy.stats. I was looking for the wrong sort of name!
First, build a subclass of rv_continuous to represent one of your distributions. We know the pdf for your distributions, so that's what we define. In this case there's just one parameter. If more are needed just add them to the def statement and use them in the return statement as required.
>>> from scipy import stats
>>> param = 3/2
>>> from math import exp
>>> class NoName(stats.rv_continuous):
... def _pdf(self, x, param):
... return param*exp(-param*x)
...
Now create an instance of this object, declare the lower end of its support (ie, the lowest value that the r.v. can assume), and what the parameters are called.
>>> noname = NoName(a=0, shapes='param')
I don't have an actual sample of values to play with. I'll create a pseudo-random sample.
>>> sample = noname.rvs(size=100, param=param)
Sort it to make it into the so-called 'empirical cdf'.
>>> empirical_cdf = sorted(sample)
The sample has 100 elements, therefore generate 100 points at which to sample the inverse cdf, or quantile function, as discussed in the paper your referenced.
>>> theoretical_points = [(_-0.5)/len(sample) for _ in range(1, 1+len(sample))]
Get the quantile function values at these points.
>>> theoretical_cdf = [noname.ppf(_, param=param) for _ in theoretical_points]
Plot it all.
>>> from matplotlib import pyplot as plt
>>> plt.plot([0,3.5], [0, 3.5], 'b-')
[<matplotlib.lines.Line2D object at 0x000000000921B400>]
>>> plt.scatter(empirical_cdf, theoretical_cdf)
<matplotlib.collections.PathCollection object at 0x000000000921BD30>
>>> plt.show()
Here's the Q-Q plot that results.
Darn it ... Sorry, I was fixated on a slick solution to somehow bypass the missing inverse CDF and calculate the quantiles directly (and avoid any numerically approaches). But it can also be done by simple brute force.
At first you have to define the quantiles for your distributions yourself (for instance ten times more accurate than the original/empirical quantiles). Then you need to calculate the corresponding CDF values. Then you have to compare these values one by one with the ones which were calculated in step 2 in the question. The according quantiles of the CDF values with the smallest deviations are the ones you were looking for.
The precision of this solution is limited by the resolution of the quantiles you defined yourself.
But maybe I'm wrong and there is a more elegant way to solve this problem, then I would be happy to hear it!